arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 智能体、规划与决策 45 篇

2606.14885 2026-06-16 cs.AI cs.CL 新提交

Dr-DCI: Scaling Direct Corpus Interaction via Dynamic Workspace Expansion

Dr-DCI: 通过动态工作空间扩展实现直接语料交互的规模化

Yi Lu, Zhuofeng Li, Ping Nie, Haoxiang Zhang, Yuyu Zhang, Kai Zou, Wenhu Chen, Jimmy Lin, Dongfu Jiang, Yu Zhang

发表机构 * University of Toronto(多伦多大学) Texas A&M University(德克萨斯A&M大学) University of Waterloo(滑铁卢大学) UC San Diego(加州大学圣迭戈分校) Verdent AI Netmind AI

AI总结 提出DR-DCI框架,将检索作为智能体可调用的动作来动态扩展本地工作空间,结合检索器的召回能力与DCI的局部操作精度,实现大规模语料上的高效搜索与验证。

Comments 25 pages, 4 figures, 22 tables

详情
AI中文摘要

大规模语料上的智能体搜索依赖于检索器中介接口(如BM25或ColBERT)实现可扩展的候选发现。虽然这些接口在排序相关文档方面有效,但它们仅将证据呈现为排序结果或有界文档视图,限制了智能体重组材料和跨文档验证约束的能力。直接语料交互(DCI)通过暴露可shell执行的语料操作来解决这一限制,支持灵活的搜索、过滤、比较和验证。然而,随着语料增长,全语料终端命令变得缓慢且不稳定,降低了性能和效率。我们提出DR-DCI,一种检索器引导的DCI框架,将检索视为智能体可调用的动作以扩展本地工作空间。智能体不是直接操作整个语料,而是动态地将相关文档拉入一个不断演变的工作空间,并在其中执行DCI操作。这种设计结合了检索器级别的召回与DCI级别的精度:检索保持探索的可扩展性,而DCI保留有效证据解析所需的局部操作。实验表明,DR-DCI在不同规模下均有效且高效。在Browsecomp-Plus上,DR-DCI达到71.2%的准确率,相比原始DCI和消融变体提升高达8.3个百分点,同时减少了工具使用、墙钟时间和估计成本。通过保留工作空间的上下文重置,准确率进一步提升至73.3%。在语料规模实验中,DR-DCI在10万到1000万文档范围内保持有效,而原始DCI变得不稳定,BM25表现显著更差。DR-DCI还扩展到2000万规模的文件级文档Wiki-18 QA设置,在六个基准测试中平均得分63.0,优于基于检索和训练搜索智能体的基线。消融分析进一步表明,排序预览和文档间DCI是性能的关键。

英文摘要

Agentic search over large corpora relies on retriever-mediated interfaces (e.g., BM25 or ColBERT) for scalable candidate discovery. While effective at ranking relevant documents, these interfaces expose evidence only as ranked results or bounded document views, limiting agents' ability to reorganize material and verify constraints across documents. Direct Corpus Interaction (DCI) addresses this limitation by exposing shell-executable corpus operations for flexible search, filtering, comparison, and verification. However, full-corpus terminal commands become slow and unstable as the corpus grows, degrading performance and efficiency. We introduce DR-DCI, a retriever-steered DCI framework that treats retrieval as an agent-callable action for expanding a local workspace. Rather than operating directly over the full corpus, the agent dynamically pulls relevant documents into an evolving workspace and conducts DCI operations within it. This design combines retriever-level recall with DCI-style precision: retrieval keeps exploration scalable, while DCI preserves the local operations needed for effective evidence resolution. Experiments show that DR-DCI is both effective and efficient across scales. On Browsecomp-Plus, DR-DCI reaches 71.2\% accuracy, improving over raw DCI and ablated variants by up to 8.3 points while reducing tool usage, wall time, and estimated cost. With workspace-preserving context reset, accuracy further improves to 73.3\%. In corpus-scaling experiments, DR-DCI remains effective from 100K to 10M documents, whereas raw DCI becomes unstable and BM25 performs substantially worse. DR-DCI also scales to a 20M-scale file-per-document Wiki-18 QA setting, achieving an average score of 63.0 across six benchmarks and outperforming retrieval-based and trained search-agent baselines. Ablation analysis further shows that ranked previews and inter-document DCI are key to performance.

2606.15077 2026-06-16 cs.AI cs.CL 新提交

Risk-Aware LLM Agents for Geospatial Data Retrieval: Design and Preliminary Adversarial Evaluation

风险感知的LLM智能体用于地理空间数据检索:设计与初步对抗性评估

Kyle Gao, Joel Cumming, Jonathan Li, Linlin Xu, David A. Clausi

发表机构 * Dept. of Systems Design Engineering, University of Waterloo(滑铁卢大学系统设计工程系) SkyWatch Dept. of Geography and Environmental Management, University of Waterloo(滑铁卢大学地理与环境管理系) Dept. of Geomatics Engineering, University of Calgary(卡尔加里大学测绘工程系)

AI总结 提出一种基于LLM的框架,通过自然语言查询从云地理空间目录检索遥感数据,集成三个智能体实现安全、意图解析和API调用生成,初步对抗实验表明提示级安全指令提升鲁棒性但需系统级防御。

Comments Accepted for publication in the International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences (ISPRS Archives), ISPRS Congress 2026

详情
AI中文摘要

我们提出一个由LLM驱动的框架,用于通过自然语言查询从基于云的地理空间目录中检索遥感数据。该系统将用户意图转换为结构化的API调用,实现对卫星影像和环境数据集的高效访问。该架构集成了三个智能体:Guardrail用于安全和策略执行,General-QA用于意图解释,Recommender-Analyst用于模式感知的API调用生成。这种协调设计确保了与外部数据服务的可靠、语义对齐的交互。该模块化框架通过API模式替换可跨平台移植,并支持环境监测、灾害响应和气候分析等应用。它在用户意图与地理空间基础设施之间建立了可扩展的接口,实现了简化和自动化的地球观测工作流程。在对抗性多轮设置下的初步实验表明,提示级安全指令提高了鲁棒性,尽管在API操作场景中仍存在罕见的高影响失败,这突显了需要自适应、系统级的防御措施来平衡安全性、可用性和成本效率,这也激励了我们使用拦截级别的Guardrail智能体。

英文摘要

We present an LLM-driven framework for retrieving remote sensing data from cloud-based geospatial catalogues using natural language queries. The system converts user intent into structured API calls, enabling efficient access to satellite imagery and environmental datasets. The architecture integrates three agents: Guardrail for safety and policy enforcement, General-QA for intent interpretation, and Recommender-Analyst for schema-aware API call generation. This coordinated design ensures reliable, semantically aligned interaction with external data services. The modular framework is portable across platforms through API schema substitution and supports applications in environmental monitoring, disaster response, and climate analysis. It establishes a scalable interface between user intent and geospatial infrastructure, enabling streamlined and automated Earth observation workflows. Preliminary experiments under adversarial multi-turn settings show that prompt-level safety instructions improve robustness, although rare high-impact failures persist in API manipulation scenarios and highlight the need for adaptive, system-level defenses that balance safety, usability, and cost efficiency, which motivates the use of our intercept-level Guardrail agent.

2606.15363 2026-06-16 cs.AI 新提交

APEX: Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI Agents

APEX: 自适应原则提取——面向生产AI智能体的三层自进化框架

Ya-Chuan Chen, Tien-Jen Lai, Hsiang-Wei Hu

发表机构 * Grace AI Technology

AI总结 提出APEX框架,通过三层协同进化(提示修复、原则蒸馏、工作流拓扑选择)提升AI智能体性能,在15节点计算集群上实现健康评分+90%。

Comments 8 pages, 1 figure, 4 tables. Evaluated on a production 15-node compute fleet with 114 real task traces. Code available at https://aispark.airlive.com/joe-hackathon/

详情
AI中文摘要

AI智能体的自我改进已成为一个关键研究前沿:系统根据累积的操作经验修改自身的提示、工作流和决策规则。最先进的Self-Harness框架[1]通过挖掘失败簇并修补智能体提示,在Terminal-Bench-2.0上实现了14–21%的提升。然而,Self-Harness仅优化一个维度——提示提示——而行为原则和工作流拓扑保持不变。我们提出APEX(自适应原则提取),一个三层协同进化框架,同时进化:(L1) 通过失败模式修补的提示,(L2) 通过成功轨迹蒸馏[2]的行为原则,以及(L3) 通过基于结构适应度选择[6]的智能体工作流拓扑。我们在Joe[13]上实现了APEX,Joe是一个基于NVIDIA Nemotron构建的生产级超级AI智能体,专为NVIDIA Agent Challenge 2026设计为边缘AI智能体工厂,管理一个15节点计算集群,使用18天内收集的114个真实任务轨迹。APEX在单次进化运行中达到0.570的APEX健康评分(相比基线0.300提升+90%),蒸馏出6个新的可复用原则,并选择了一个得分为0.900(+20%)的研究优先工作流拓扑。我们的结果表明,多维协同进化显著优于单轴提示优化,且成本仅为在本地qwen2.5-coder:32b实例上调用4次LLM(约270秒)。

英文摘要

Self-improvement in AI agents has emerged as a key research frontier: systems that modify their own prompts, workflows, and decision rules based on accumulated operational experience. The state-of-the-art Self-Harness framework [1] achieves 14--21% improvement on Terminal-Bench-2.0 by mining failure clusters and patching the agent harness. However, Self-Harness optimises only one dimension -- the prompt harness -- leaving behavioural principles and workflow topology unchanged. We propose APEX (Adaptive Principle EXtraction), a three-layer co-evolution framework that simultaneously evolves: (L1) the harness via failure-mode patching, (L2) behavioural principles via success-trace distillation [2], and (L3) the agent workflow topology via structural fitness-based selection [6]. We implement APEX on Joe [13], a production-grade super AI Agent built on NVIDIA Nemotron and designed as an Edge AI Agent Factory for the NVIDIA Agent Challenge 2026, managing a 15-node compute fleet using 114 real task traces collected over 18 days. APEX achieves an APEX Health Score of 0.570 (+90% vs. baseline 0.300) in a single evolutionary run, distilling 6 novel reusable principles and selecting a research-first workflow topology scoring 0.900 (+20%). Our results demonstrate that multi-dimensional co-evolution substantially outperforms single-axis harness optimisation, at a cost of only 4 LLM calls (~270 s) on a local qwen2.5-coder:32b instance.

2606.15367 2026-06-16 cs.AI cs.CL cs.IR cs.LG 新提交

S1-DeepResearch: Beyond Search, Toward Real-World Long-Horizon Research Agents

S1-DeepResearch:超越搜索,迈向真实世界的长周期研究智能体

Yao Dong, Xinglin Xiao, Liwei Dong, Xinlong Jin, Zhengbo Li, Heng Zhang, Duyun Wang, Nan Xu

发表机构 * XScience Lab(XScience实验室) Wenge AI(问格人工智能)

AI总结 提出统一轨迹构建范式,结合封闭式问答与开放式探索,通过图基任务构建、智能体轨迹生成和多维验证,合成高质量长链推理轨迹,训练出在20个基准上达到开源最优的32B模型。

详情
AI中文摘要

深度研究智能体旨在通过长周期规划、证据收集、推理和报告生成来解决复杂的知识密集型任务。尽管搜索智能体近期在信息检索和答案验证方面展现出强大能力,但现有训练数据集大多以搜索为中心,主要关注封闭式问答和信息定位。因此,它们主要训练信息寻求行为,而对关键深度研究能力(包括证据整合、知识综合、规划、文件理解和结构化报告生成)的覆盖有限。在这项工作中,我们提出了一种用于深度研究智能体的统一轨迹构建范式,该范式结合了封闭式问答和开放式探索。所提出的框架包括图基任务构建、智能体轨迹展开和多维轨迹验证,能够可扩展地合成涵盖长链复杂推理、深度研究指令遵循、报告撰写、文件理解与生成以及技能使用的高质量智能体轨迹。与现有的面向搜索的数据集相比,我们合成的轨迹更强调知识综合、复杂推理和规划。S1-DeepResearch-32B在跨越五个能力维度(包括复杂推理、指令遵循、报告生成、文件理解和技能使用)的20个基准测试中,达到了同等规模开源模型的最先进性能。在几个具有挑战性的深度研究基准上,它接近领先的专有前沿模型的性能。这些结果强调了联合建模信息获取、知识综合和面向规划的智能体行为对于构建有效深度研究智能体的重要性。

英文摘要

Deep research agents aim to solve complex knowledge-intensive tasks through long-horizon planning, evidence gathering, reasoning, and report generation. While recent progress in search agents has demonstrated strong capabilities in information retrieval and answer verification, most existing training datasets remain search-centric, focusing primarily on closed-ended question answering and information localization. As a result, they mainly train information-seeking behavior while providing limited coverage of key deep research capabilities, including evidence integration, knowledge synthesis, planning, file understanding, and structured report generation. In this work, we propose a unified trajectory construction paradigm for deep research agents that combines closed-ended QA and open-ended exploration. The proposed framework consists of graph-grounded task formulation, agentic trajectory rollout, and multi-dimensional trajectory verification, enabling scalable synthesis of high-quality agentic trajectories spanning long-chain complex reasoning, deep research instruction following, report writing, file understanding and generation, and skills usage. Compared with existing search-oriented datasets, our synthesized trajectories place greater emphasis on knowledge synthesis, complex reasoning, and planning. S1-DeepResearch-32B achieves state-of-the-art performance among open-source models of comparable scale across 20 benchmarks spanning five capability dimensions, including complex reasoning, instruction following, report generation, file understanding, and skills usage. On several challenging deep research benchmarks, it approaches the performance of leading proprietary frontier models. These results highlight the importance of jointly modeling information acquisition, knowledge synthesis, and planning-oriented agent behaviors for building effective deep research agents.

2606.15497 2026-06-16 cs.AI 新提交

Towards End-to-End Automation of AI Research

迈向AI研究的端到端自动化

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Chris Lu, Shengran Hu, Jakob Foerster, David Ha, Jeff Clune

发表机构 * Sakana AI FLAIR University of Oxford(牛津大学) University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所)

AI总结 提出AI Scientist系统,利用基础模型实现从构思到论文撰写的全自动研究,并通过机器学习会议研讨会的同行评审。

Comments Published in Nature 651, 914-919 (2026)

详情
AI中文摘要

科学自动化是AI领域的一个长期目标。虽然社区在自动化科学过程的各个组成部分方面取得了显著进展,但能够自主导航整个研究生命周期(从构思到发表)的系统仍然遥不可及。在这里,我们展示了迄今为止朝着端到端自动化整个过程的最强演示。我们提出了AI Scientist,它能够创建研究想法、编写代码、运行实验、绘制和分析数据、撰写完整的科学手稿并进行自己的同行评审。其想法、执行和呈现的质量足以生成一份由AI系统产生的手稿,该手稿通过了机器学习会议研讨会的首轮同行评审。该研讨会的接受率为70%。我们的系统在一个复杂的代理系统中利用了现代基础模型。我们在两种设置中评估AI Scientist:一种聚焦模式,使用人类提供的代码模板作为初始支架,在特定主题上进行研究;另一种是无模板的开放模式,利用代理搜索进行更广泛的科学探索。两种设置都能产生多样化的想法,并自动测试、报告和评估它们。这一成就展示了AI在科学贡献方面日益增长的能力,并标志着研究方式可能发生的范式转变。与任何有影响力的新技术一样,可能存在重大风险,包括给不堪重负的评审系统增加负担以及给科学文献带来噪音。然而,如果负责任地开发,这种自主系统可以极大地加速科学发现。

英文摘要

The automation of science is a long-standing ambition in the field of AI. While the community has made significant progress in automating individual components of the scientific process, a system that autonomously navigates the entire research lifecycle -- from conception to publication -- has remained out of reach. Here, we present the strongest demonstration to date toward automating the entire process end-to-end. We present The AI Scientist, which creates research ideas, writes code, runs experiments, plots and analyzes data, writes the entire scientific manuscript and performs its own peer review. Its ideas, execution, and presentation are of sufficient quality to produce a manuscript generated by an AI system that passes the first round of peer review at a major machine learning conference workshop. The workshop has an acceptance rate of 70 percent. Our system leverages modern foundation models within a complex agentic system. We evaluate The AI Scientist in two settings: a focused mode using human-provided code templates as an initial scaffold to conduct research on a specific topic, and a template-free, open-ended mode that leverages agentic search for wider scientific exploration. Both settings produce diverse ideas and automatically test, report on, and evaluate them. This achievement demonstrates AI's growing capacity for scientific contribution and signifies a potential paradigm shift in how research is conducted. As with any impactful new technology, there could be significant risks, including taxing overwhelmed review systems and adding noise to scientific literature. However, if developed responsibly, such autonomous systems could greatly accelerate scientific discovery.

2606.15579 2026-06-16 cs.AI cs.LG cs.MA cs.SE 新提交

Your Agent Has a Genome: Sequence-Level Behavioral Analysis and Runtime Governance of LLM-Powered Autonomous Agents

你的智能体有基因组:基于序列的LLM驱动自主智能体行为分析与运行时治理

Sidi Deng

发表机构 * Independent Researcher(独立研究员)

AI总结 提出XEPV序列编码框架,将LLM智能体行为建模为基因组序列,通过n-gram挖掘发现P-X-P高风险模式,设计Governor三层干预系统,使成功率提升6.2%并减少44% token消耗。

Comments 16 pages, 15 figures, 12 tables

详情
AI中文摘要

我们提出基础序列分析框架,该框架将LLM驱动的自主智能体的运行时行为编码为使用四个字母的字母表的紧凑符号序列:X(探索)、E(执行)、P(规划)和V(验证)。借鉴基因组序列分析的类比,我们对从生产ReAct智能体系统收集的347条真实世界执行轨迹(跨越8天)应用n-gram模式挖掘、马尔可夫转移矩阵和点二列相关分析。我们的分析揭示:(1) 三元组P-X-P是唯一统计显著的高风险模式,使成功率降低10.4%;(2) P比率是成功的最强负预测因子(r=-0.256, p<0.0001);(3) E→V转移概率仅为2.1%,表明存在系统性验证缺陷。基于这些发现,我们设计了Governor,一个三层运行时干预系统,包括规则引擎、统计累加器和基于卡方的阈值自适应器。在自然的部署前后评估中(N=101 vs. N=246),Governor使任务成功率绝对提升6.2%,同时平均token消耗减少44%。为验证跨系统通用性,我们将XEPV编码应用于SWE-bench上2000条公开SWE-agent轨迹,确认探索螺旋和E→V验证缺陷在独立系统中复现。我们概述了六个研究方向,包括基础序列语言模型、跨智能体行为指纹识别和奖励塑造,并发布开源工具包以促进可重复性。

英文摘要

We propose Base Sequence Analysis, a framework that encodes the runtime behavior of LLM-powered autonomous agents into compact symbolic sequences using a four-letter alphabet: X (Explore), E (Execute), P (Plan), and V (Verify). Drawing an analogy to genomic sequence analysis, we apply n-gram pattern mining, Markov transition matrices, and point-biserial correlation to 347 real-world execution traces collected from a production ReAct agent system over 8 days. Our analysis reveals that (1) the trigram P-X-P is the only statistically significant high-risk pattern, lowering success rate by 10.4%; (2) P-ratio is the strongest negative predictor of success (r=-0.256, p<0.0001); and (3) the E->V transition probability is only 2.1%, indicating a systemic verification deficit. Based on these findings, we design Governor, a three-layer runtime intervention system comprising a rule engine, a statistical accumulator, and a chi-square-based threshold adaptor. In a natural before/after deployment evaluation (N=101 vs. N=246), Governor achieves a +6.2% absolute increase in task success rate while simultaneously reducing average token consumption by 44%. To validate cross-system generality, we apply the XEPV encoding to 2,000 public SWE-agent trajectories on SWE-bench, confirming that exploration spirals and the E->V verification deficit replicate in an independent system. We outline six research directions including base sequence language models, cross-agent behavioral fingerprinting, and reward shaping, and release an open-source toolkit for reproducibility.

2606.15866 2026-06-16 cs.AI cs.LG 新提交

STRIDE: Strategic Trajectory Reasoning via Discriminative Estimation for Verifiable Reinforcement Learning

STRIDE: 通过判别估计进行策略轨迹推理以实现可验证强化学习

Qinjian Zhao, Zhihao Dou, Dinggen Zhang, Xiangyu Li, Chaoda Song, Zhongwei Wan, Xinpeng Li, Yanyan Zhang, Kaijie Chen, Qingtao Pan, Chengcheng Feng, Zhiqiang Gao, Xiaoyu Xia

发表机构 * Kean University(基恩大学) Case Western Reserve University(凯斯西储大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) The Ohio State University(俄亥俄州立大学) Tongji University(同济大学) Duke Kunshan University(昆山杜克大学) Royal Melbourne Institute of Technology(皇家墨尔本理工大学)

AI总结 提出STRIDE框架,通过对比成功与失败轨迹估计n-gram策略模式的判别偏好,结合推理显著性熵识别关键策略模式,实现细粒度信用分配,提升可验证强化学习的推理性能。

详情
AI中文摘要

可验证奖励强化学习(RLVR)已成为提升大语言模型推理能力的有效后训练范式。然而,现有RLVR方法通常依赖最终答案正确性分配轨迹级奖励,提供稀疏监督,并统一处理所有token,不考虑它们对推理的实际贡献。尽管最近的研究引入了中间信号,如过程奖励、高熵token和语义不确定性,但这些信号通常本身不可验证,且可能无法区分有益策略模式与有害模式。为解决这一局限,我们提出STRIDE(通过判别估计进行策略轨迹推理),一种从可验证结果中推导策略推理监督的细粒度RLVR框架。STRIDE对比每个响应组内的成功和失败轨迹,以估计每个n-gram策略模式的结果判别偏好,并进一步将该信号与推理显著性熵结合,识别决策相关的策略模式。在RL优化过程中,这些模式被分配差异化的优势值,从而在保持RLVR可验证性的同时实现更精确的信用分配。大量实验表明,STRIDE在多种模型、任务和扩展设置(包括VLM和基于智能体的系统)中一致提升了推理性能。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training paradigm for improving the reasoning abilities of large language models. However, existing RLVR methods typically rely on final-answer correctness to assign trajectory-level rewards, providing sparse supervision and treating all tokens uniformly regardless of their actual contribution to reasoning. Although recent studies introduce intermediate signals such as process rewards, high-entropy tokens, and semantic uncertainty, these signals are often not inherently verifiable and may fail to distinguish beneficial strategic patterns from harmful ones. To address this limitation, we propose STRIDE (Strategic Trajectory Reasoning with Discriminative Estimation), a fine-grained RLVR framework that derives strategic reasoning supervision from verifiable outcomes. STRIDE contrasts successful and failed trajectories within each response group to estimate the outcome-discriminative preference of each $n$-gram strategic pattern, and further combines this signal with reasoning saliency entropy to identify decision-relevant strategic patterns. These patterns are assigned differentiated advantage values during RL optimization, enabling more precise credit assignment while preserving the verifiability of RLVR. Extensive experiments demonstrate that STRIDE consistently improves reasoning performance across diverse models, tasks, and extended settings, including VLMs and agent-based systems.

2606.15874 2026-06-16 cs.AI cs.SE 新提交

LLM-as-Code Agentic Programming for Agent Harness

LLM即代码:面向Agent框架的编程范式

Junjia Qi, Zichuan Fu, Jingtong Gao, Wenlin Zhang, Hanyu Yan, Xian Wu, Xiangyu Zhao

发表机构 * City University of Hong Kong(香港城市大学) Tencent Jarvis Lab(腾讯贾维斯实验室)

AI总结 针对LLM作为编排器导致控制流幻觉和不可靠执行的问题,提出Agentic Programming范式,由程序控制所有流程,LLM仅作为代码组件在需要推理或生成时被调用,显著提升长序列操作的稳定性。

Comments Accepted at the KDD 2026 Workshop on Agentic Software Engineering (AgenticSE)

详情
AI中文摘要

每个主要的LLM Agent框架都赋予LLM编排者的角色;模型决定下一步做什么、何时调用工具以及何时停止。我们认为,令牌爆炸、控制流幻觉和不可靠完成并非实现缺陷,而是将循环、分支和排序等确定性工作分配给概率系统的架构后果。更好的提示或更强的模型无法保证LLM Agent的可靠性。因此,我们提出Agentic Programming,其中程序控制所有流程,而LLM本身是其中的一部分,一个称为LLM-as-Code的自适应组件,仅在任务需要推理或生成时调用。在每个调用中,模型保持完全灵活性,但不能改变程序的执行路径。由于控制权在程序中,LLM的上下文由执行历史的调用树构建,形成有向无环图(DAG)。每个调用的上下文长度由其调用深度决定,而非随步骤累积。计算机使用Agent的案例研究表明,该设计不仅是理论立场,而且是实用的,显著提高了长视觉操作序列的稳定性。

英文摘要

Every major LLM agent framework gives the LLM the role of orchestrator; the model decides what to do next, when to call tools, and when to stop. We argue that token explosion, control-flow hallucination, and unreliable completion are not implementation bugs but architectural consequences of assigning the deterministic work of looping, branching, and sequencing to a probabilistic system. A better prompt or a stronger model cannot guarantee the reliability of the LLM agent. We therefore propose Agentic Programming, in which the program governs all control flow, and the LLM is itself part of it, an adaptive component we call LLM-as-Code and invoke only where a task calls for reasoning or generation. Within each call the model keeps full flexibility, but it cannot alter the program's execution path. With control in the program, the LLM's context is built from the execution history's call tree and forms a directed acyclic graph (DAG). Each call's context length is then determined by its call depth rather than by accumulation over steps. A case study of computer-use agents shows that the design is practical, not just a theoretical stance, substantially improving the stability of long visual operation sequences.

2606.15994 2026-06-16 cs.AI cs.LG 新提交

Agentic Framework for Deep Learning workload migration via In-Context Learning

基于上下文学习的深度学习工作负载迁移智能体框架

Qiyue Liang, Steven Ingram, George Vanica, Andi Gavrilescu, Newfel Harrat, Hassan Sipra, Sethuraman Sankaran

发表机构 * Google(谷歌)

AI总结 提出结合上下文学习与Oracle驱动的自调试的自主系统,实现从PyTorch到JAX的深度学习模型自动迁移,在神经模块上达到91%数值等价性。

详情
AI中文摘要

将深度学习模型从PyTorch灵活的面向对象设计迁移到JAX的函数式无状态设置通常是一项手动且易出错的任务。自动迁移具有挑战性,因为大型语言模型(LLM)难以处理严格且动态的API对齐,并且容易在精确操作上出错。我们提出了一个完全自主的系统,结合了上下文学习(ICL)与Oracle驱动的自调试。首先,我们整理了一个ICL上下文,作为惯用JAX样式和测试用例生成的严格参考。其次,不依赖LLM推导数学输出,而是运行源PyTorch模块以获取其实际的动态张量状态,从而创建一个不可变的执行Oracle。然后,我们使用自主智能体循环基于Oracle数据合成测试。测试用例被重复执行,并将回溯发送回LLM进行自我修正。消融实验表明,将ICL参考与Oracle基础及自调试相结合,大大优于纯指令和基本智能体基线。这种改进没有增加过多的计算开销。我们的轻量级流水线在神经模块上实现了91%的数值等价性(相比之下,基线为9%,指令+自调试为27%),为跨框架迁移提供了高度可靠、可扩展的蓝图。该方案已在多个最先进模型上得到验证,包括SAM(Segment Anything)、T5、Code Whisper等,显示出高数值等价性。代码:https://github.com/AI-Hypercomputer/accelerator-agents/tree/main/MaxCode

英文摘要

Translating deep learning models from PyTorch's flexible, object-oriented design to JAX's functional, stateless setup is usually a manual and error-prone task. Automated migration is challenging because Large Language Models (LLMs) struggle with strict and dynamic API alignment and are prone to mistakes for exacting operations. We propose a fully autonomous system that combines In-Context Learning (ICL) with oracle-driven self-debugging. First, we curated an ICL context that serves as a strict reference for idiomatic JAX styling and test case generation. Second, instead of depending on the LLM to deduce mathematical outputs, we run the source PyTorch modules to get their actual dynamic tensor states. This creates an unchangeable execution oracle. We then use an autonomous agentic loop to synthesize tests based on the oracle data. The test cases are executed repeatedly, and the traceback is sent back to the LLM for self-correction. Ablations show that combining ICL references with oracle grounding and self-debugging greatly outperforms pure instructional and basic agentic baselines. This improvement does not add an excessive computational overhead. Our lightweight pipeline achieves 91% numerical equivalence (compared to baseline: 9%, instruction + self-debugging: 27%) on neural modules, providing a highly reliable, scalable blueprint for cross-framework migration. This has been validated across several state-of-the-art models including SAM (segment anything), T5, Code Whisper amongst others showing high numerical equivalency. Code: https://github.com/AI-Hypercomputer/accelerator-agents/tree/main/MaxCode

2606.16149 2026-06-16 cs.AI 新提交

LiteOdyssey: A Lightweight Reasoning AI Agent for Interpretable Rare-Disease Diagnosis

LiteOdyssey: 一种用于可解释罕见病诊断的轻量级推理AI智能体

Minh-Ha Nguyen, Erica Gray, Chih-Ting Yang, Rizwan Hamid, Lingyao Li, Siyuan Ma, Thomas A. Cassini, Cathy Shyr

发表机构 * Vanderbilt University(范德堡大学) Vanderbilt University Medical Center(范德堡大学医学中心) University of South Florida(南佛罗里达大学)

AI总结 提出轻量级框架LiteOdyssey,通过人类-AI协作的诊断策略和公共生物医学工具增强单个推理语言模型,在罕见病诊断基准上达到最先进性能,无需微调或多智能体集成。

Comments 21 pages,5 main figures, working version 1

详情
AI中文摘要

大多数医疗AI系统通过扩展额外机制来改进:更多的微调数据、更多的智能体和/或更大的检索数据库。然而,在罕见病诊断中,这种扩展可能导致系统难以部署、审计和维护。我们探究是否可以通过扩展单个AI智能体的推理链来实现最先进的诊断性能:通过人类-AI协作开发的诊断策略指导它,并利用可免费获取的生物医学工具进行增强。我们引入了LiteOdyssey,一个轻量级罕见病诊断框架,通过临床遗传学工作流引导推理语言模型。该框架通过人类反馈的策略迭代(PIHF)开发,并动态访问公共生物医学工具。在两个仅提供患者临床特征的高难度基准上,LiteOdyssey取得了最先进的性能,在LIRICAL(n=370)和PhenoPacket Store(n=873)的合并1243个病例中,总体疾病Recall@1达到59.3%。这两个基准中超高罕见病(患病率低于1/1,000,000)的比例很高,分别约为45%和52.8%。在更困难的PhenoPacket子集上(其中因果疾病未在我们的稀有性映射流程中映射到Orphanet),LiteOdyssey实现了60.7%的Recall@1,而相同基线模型(GPT-5.4)不使用工具时为10.7%。这一性能是在没有微调、多智能体集成或大型病例检索数据库的情况下实现的。在开发过程中未见过的病例、真实世界罕见病患者的私人队列以及较小的开源权重模型上也观察到了增益。LiteOdyssey为罕见病AI系统指明了一条路径,使其准确、易于部署且对医生审查更透明。

英文摘要

Most medical AI systems improve by scaling additional machinery: more fine-tuning data, more agents, and/or larger retrieval databases. In rare-disease diagnosis, however, such scaling can produce systems that are difficult to deploy, audit, and maintain. We asked whether state-of-the-art diagnostic performance could instead be achieved by extending the reasoning chain of a single AI agent: guiding it with a diagnostic policy, developed through human-AI collaboration and augmenting with freely available biomedical tools. We introduce LiteOdyssey, a lightweight rare-disease diagnostic framework that guides reasoning language model through a clinical genetics workflow. This framework was developed through Policy Iteration with Human Feedback (PIHF) and uses dynamic access to public biomedical tools. On two challenging benchmarks that provide only patient clinical features, LiteOdyssey achieved state-of-the-art performance, with an overall disease Recall@1 of 59.3% over the combined 1,243 cases of LIRICAL (n = 370) and the PhenoPacket Store (n = 873). Both benchmarks have a high proportion of ultra-rare disease (a prevalence below 1 in 1,000,000, with ultra-rare shares of approximately 45% and 52.8%, respectively). On the more difficult PhenoPacket subset, where causal diseases were not mapped to Orphanet in our rarity-mapping pipeline, LiteOdyssey achieved 60.7% Recall@1, compared with 10.7% for the same baseline model (GPT-5.4) without tools. This performance was achieved without fine-tuning, multi-agent ensembles, or a large case-retrieval database. Gains were also observed in the following: on cases never seen during development, on a private cohort of real-world rare disease patients, and on a smaller open-weights model. LiteOdyssey suggests a path toward rare-disease AI systems that are accurate, easier to deploy, and more transparent for physician review.

2606.16707 2026-06-16 cs.AI 新提交

User as Code: Executable Memory for Personalized Agents

用户即代码:面向个性化智能体的可执行记忆

Bojie Li

发表机构 * Pine AI

AI总结 提出可执行记忆范式User as Code,将用户模型转化为可运行的Python代码,通过两阶段流水线实现精确聚合与规则执行,在长对话基准上达到78.8%召回率,聚合问题准确率99%,并能主动触发安全警报。

详情
AI中文摘要

个性化AI智能体需要用户记忆:一个关于用户是谁的持久模型,通过多次对话构建并在每次新对话中查询。如今,这种记忆几乎总是以非结构化文本、知识图谱或扁平事实存储的形式保存,并通过检索——获取与当前请求最相似的条目——来查询。这种“事实袋”记忆能很好地回忆单个事实,但由于存储事实和基于事实行动是分离的步骤,它在解决矛盾、聚合多条记录或执行规则方面存在困难。我们认为用户记忆应该是可执行的。我们引入用户即代码(UaC)范式,其中智能体对用户的模型是一个活的软件项目:类型化的Python对象保存用户状态,普通的Python函数编码管理状态的规则,因此表示和推理用户发生在同一个可由解释器运行的媒介中。实现机制是一个两阶段流水线:一个只追加的日志从不丢弃任何事实,并定期检查点化为类型化代码。这改变了记忆的能力。在标准的长对话基准测试中,UaC在召回率上匹配全上下文上限和最强的先前记忆系统(LOCOMO上78.8%)。其优势在表示至关重要的地方显现。在关于用户历史的聚合问题(如“我去年进行了多少次国际旅行?”)上,基于检索的记忆崩溃(6-43%),而UaC保持近乎完美(99%),因为答案是对类型化状态的一行计算,而不是对文本的搜索。而且,由于其规则在状态变化时确定性执行,UaC能够呈现未经请求的、安全关键的警报——例如新开的药物与数月前记录的过敏相冲突——这是查询驱动记忆无法提供的能力。

英文摘要

A personalized AI agent needs a user memory: a persistent model of who the user is, built across many conversations and consulted on each new one. Today this memory is almost always stored as unstructured text, a knowledge graph, or a flat store of facts, and consulted by retrieval -- fetching the entries most similar to the current request. Such "bag-of-facts" memory recalls individual facts well, but because storing a fact and acting on it are separate steps, it struggles to resolve contradictions, aggregate over many records, or enforce rules. We argue that user memory should instead be executable. We introduce User as Code (UaC), a paradigm in which an agent's model of a user is a living software project: typed Python objects hold the user's state and ordinary Python functions encode the rules that govern it, so representing and reasoning about the user happen in one medium an interpreter can run. The enabling mechanism is a two-phase pipeline: an append-only log that never discards a fact, periodically checkpointed into typed code. This changes what memory can do. On standard long-term conversation benchmarks, UaC matches both a full-context upper bound and the strongest prior memory systems on recall (78.8% on LOCOMO). Its advantage emerges where representation matters most. On aggregate questions over a user's history -- "how many international trips did I take last year?" -- retrieval-based memory collapses (6-43%) while UaC stays near-perfect (99%), because the answer is a one-line computation over typed state rather than a search over text. And because its rules execute deterministically whenever the state changes, UaC can surface unsolicited, safety-critical alerts -- such as a newly prescribed drug that conflicts with an allergy recorded months earlier -- a capability query-driven memory cannot provide.

2606.16769 2026-06-16 cs.AI 新提交

Skill-to-LoRA: From Using Skills to Learning Behaviors for Token-Efficient LLM Agents

Skill-to-LoRA:从使用技能到学习行为以实现令牌高效的LLM智能体

Tianyi Zhang, Zhonghao Qi

发表机构 * The Chinese University of Hong Kong(香港中文大学)

AI总结 提出Skill-to-LoRA方法,将技能文本转换为LoRA适配器,替代运行时注入技能文档,在SWE-Skills-Bench上提升通过率并降低令牌成本。

Comments Preprint. 10 pages, 4 figures

详情
AI中文摘要

智能体技能通常以SKILL.md文件形式分发:描述工作流、工具、资源和领域约定的人类可读程序文档。虽然便于检查和重用,但这种设计需要将相同的可重用程序重复注入运行时上下文。我们提出Skill-to-LoRA(S2L),一种以行为为中心的技能表示,用技能特定的LoRA适配器替代运行时技能文本。S2L不是压缩技能文档本身,而是建模技能文本引起的行为变化:离线时,使用完整的SKILL.md合成技能引导的演示;在线时,省略完整文档,动态加载对应的LoRA适配器以激活学习到的技能行为。我们使用Qwen3.6-27B在SWE-Skills-Bench的21个技能子集上评估S2L。与无技能和完整技能文本基线相比,S2L的通过率分别提高2.9和5.2个百分点,同时相对于完整技能文本提示,每步令牌成本降低6.6%。S2L在18/21个技能上匹配或优于完整技能文本,在15/21个技能上匹配或优于无技能基线。控制实验进一步表明,性能提升依赖于技能特定的适配器对齐:错误LoRA和共享LoRA均降低性能。这些结果表明,许多程序性智能体技能可以从运行时指令转换为可训练、可动态加载的行为模块。代码将在接收后发布。

英文摘要

Agent skills are commonly distributed as SKILL.md files: human-readable procedural documents that describe workflows, tools, resources, and domain conventions. While convenient for inspection and reuse, this design requires the same reusable procedure to be repeatedly injected into the runtime context. We propose Skill-to-LoRA(S2L), a behavior-centric skill representation that replaces runtime skill text with skill-specific LoRA adapters. Rather than compressing the skill document itself, S2L models the behavioral change induced by the skill text: offline, the complete SKILL.md is used to synthesize skill-guided demonstrations; online, the full document is omitted and the corresponding LoRA adapter is dynamically loaded to activate the learned skill behavior. We evaluate S2L with Qwen3.6-27B on a 21-skill subset of SWE-Skills-Bench. Compared with the no-skill and Full Skill Text baselines, S2L improves pass rate by 2.9 and 5.2 percentage points, respectively, while reducing per-step token cost by 6.6% relative to Full Skill Text prompting. S2L matches or improves Full Skill Text on 18/21 skills and the no-skill baseline on 15/21 skills. Control experiments further show that the gains depend on skill-specific adapter alignment: Wrong-LoRA and Shared-LoRA both reduce performance. These results suggest that many procedural agent skills can be converted from runtime instructions into trainable, dynamically loadable behavioral modules. Code will be released upon acceptance.

2606.16774 2026-06-16 cs.AI cs.CL 新提交

OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models

OpenClaw-Skill:面向智能体大语言模型的集体技能树搜索

Tianyi Lin, Chuanyu Sun, Jingyi Zhang, Changxu Wei, Huanjin Yao, Shunyu Liu, Xikun Zhang, Liu Liu, Jiaxing Huang

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Nanyang Technological University(南洋理工大学) Tsinghua University(清华大学) Royal Melbourne Institute of Technology(皇家墨尔本理工大学) Beijing University of Aeronautics and Astronautics(北京航空航天大学)

AI总结 提出集体技能树搜索(CSTS)框架,通过集体智能生成和评估技能节点,构建结构化、多样且可泛化的技能树,并引入集体技能强化学习,提升大语言模型在工具使用、多步推理和动态环境交互中的智能体能力。

Comments 13 pages, 2 figures

详情
AI中文摘要

为大型语言模型(LLM)智能体配备有效技能对于解决OpenClaw等现实世界系统中的复杂任务至关重要。在这项工作中,我们旨在开发一个自动构建此类可重用技能的框架,以增强LLM在工具使用、多步推理和动态环境交互方面的能力。为此,我们提出了集体技能树搜索(CSTS),一种新颖的基于树搜索的技能构建框架,用于构建结构化、多样且可泛化的技能树。CSTS的核心思想是利用集体智能,通过两个迭代阶段共同搜索、识别和组合有效技能:集体技能节点生成(CSN-Gen)和集体技能节点评估(CSN-Assess)。CSN-Gen利用来自多个模型的集体知识,为每个子任务探索多样化的候选技能,实现全面的技能探索。CSN-Assess使用多个模型作为评判者,通过两种评分机制评估和选择技能节点:(1)集体质量评分,聚合独立评估以产生技能有效性的稳健估计;(2)集体可迁移性评分,明确验证技能是否在不同模型间良好泛化。通过CSTS,我们构建了一套全面的技能树以及技能增强的训练数据,使模型能够有效学习和利用技能。此外,我们引入了集体技能强化学习,主动从技能树中选择多个相关技能,以拓宽解空间探索,避免陷入单一技能及其导致的同质或次优解。最终,我们训练的模型OpenClaw-Skill在长期规划、工具使用和跨挑战性基准的泛化方面展现出卓越的智能体能力。

英文摘要

Equipping Large Language Model (LLM) agents with effective skills is crucial for solving complex tasks in real-world systems like OpenClaw. In this work, we aim to develop a framework that automatically constructs such reusable skills to enhance LLMs in tool use, multi-step reasoning, and dynamic environment interaction. To this end, we propose Collective Skill Tree Search (CSTS), a novel tree-search-based skill construction framework that constructs structured, diverse and generalizable tree of skills. The core idea of CSTS is to leverage collective intelligence to jointly search, identify and compose effective skills via two iterative phases: Collective Skill Node Generation (CSN-Gen) and Collective Skill Node Assessment (CSN-Assess). CSN-Gen exploits collective knowledge from multiple models to explore diverse candidate skills for each subtask, enabling comprehensive skill exploration. CSN-Assess employs multiple models as judges to evaluate and select skill nodes with two scoring mechanisms: (1) collective quality scoring that aggregates independent evaluations to produce a robust estimate of skill effectiveness, and (2) collective transferability scoring that explicitly verifies whether a skill generalizes well across different models. With CSTS, we construct a set of comprehensive tree of skills along with skill-augmented training data, enabling models to effectively learn and utilize skills. Besides, we introduce Collective Skill Reinforcement Learning, which actively selects multiple relevant skills from the tree to broaden solution-space exploration, avoid being trapped by a single skill and its resulting homogeneous or suboptimal solutions. As a result, our trained model, OpenClaw-Skill, exhibits outstanding agentic capabilities in long-horizon planning, tool use and generalization over challenging benchmarks.

2606.16813 2026-06-16 cs.AI 新提交

GIST-CMTF: Goal-State Inference for Causal Minimal Tool Filtering in LLM Agents

GIST-CMTF:LLM代理中因果最小工具过滤的目标状态推断

Rahul Suresh Babu, Rohit Shukla

AI总结 提出GIST-CMTF层,通过预测候选符号目标状态并估计歧义性,解决工具增强LLM代理因用户请求多义性导致的错误目标执行问题,在120个任务上达到97.0%成功率。

详情
AI中文摘要

工具增强的LLM代理依赖运行时过滤来决定每个步骤中哪些工具应可见。因果最小工具过滤(CMTF)通过仅暴露下一个因果必要的工具前沿来减少工具选择混淆,但它假设用户请求已映射到符号目标状态。实际上,诸如“处理我的预约”或“处理这封邮件”之类的请求可能对应多个可能的目标。这会导致错误目标执行,即代理为意外目标遵循有效的因果工具路径。我们引入GIST-CMTF,一个目标状态推断层,它预测在CMTF使用的相同状态转换词汇上的候选符号目标,估计歧义性,并要么应用CMTF,要么将澄清暴露为产生缺失目标或状态变量的因果动作。我们在七个模型后端、六种过滤方法和120个受控工具使用任务上评估GIST-CMTF。GIST-CMTF实现了97.0%的任务成功率,而top-goal CMTF为80.1%,semantic-goal CMTF为82.9%。它将错误目标执行从top-goal CMTF下的19.4%降低到2.5%,同时保留了因果过滤的单工具暴露,并且使用的令牌数远少于全工具暴露。这些结果表明,可靠的工具增强代理在暴露外部动作之前应验证目标状态,而不仅仅是工具相关性。

英文摘要

Tool-augmented LLM agents rely on runtime filtering to decide which tools should be visible at each step. Causal Minimal Tool Filtering (CMTF) reduces tool-choice confusion by exposing only the next causally necessary tool frontier, but it assumes that the user request has already been mapped to a symbolic goal state. In practice, requests such as "handle my appointment" or "take care of this email" may correspond to multiple possible goals. This creates wrong-goal execution, where an agent follows a valid causal tool path for an unintended objective. We introduce GIST-CMTF, a goal-state inference layer that predicts candidate symbolic goals over the same state-transition vocabulary used by CMTF, estimates ambiguity, and either applies CMTF or exposes clarification as a causal action that produces missing goal or state variables. We evaluate GIST-CMTF across seven model backends, six filtering methods, and 120 controlled tool-use tasks. GIST-CMTF achieves 97.0% task success, compared with 80.1% for top-goal CMTF and 82.9% for semantic-goal CMTF. It reduces wrong-goal execution from 19.4% under top-goal CMTF to 2.5%, while preserving the one-tool exposure of causal filtering and using substantially fewer tokens than all-tools exposure. These results suggest that reliable tool-augmented agents should validate goal state, not only tool relevance, before exposing external actions.

2606.16987 2026-06-16 cs.AI 新提交

Consensus-based Agentic Large Language Model Framework for Harmonized Tariff Schedule Code Classification

基于共识的智能体大语言模型框架用于协调制度海关编码分类

Truong Thanh Hung Nguyen, Khanh Van Quynh Nguyen, Hoang-Loc Cao, Tri Duong, Phuc Ho, Van Pham, Loc Nguyen, Hung Cao

发表机构 * Analytics Everywhere Lab, University of New Brunswick(新不伦瑞克大学无处不在分析实验室) University of Economics Ho Chi Minh City(胡志明市经济大学)

AI总结 提出一种多智能体LLM框架,通过信息检索、语义检索、证据推理、共识验证和分层投票等方法,解决加拿大10位HTS编码分类难题,在3300条数据上验证了证据驱动和人工参与的必要性。

Comments Accepted at the 3rd International Conference of Resilience by Technology and Design (RTD 2026)

详情
AI中文摘要

准确的协调制度(HTS)编码分类对于海运物流中的清关、关税评估、贸易统计和法规合规至关重要。然而,精确的HTS分类仍然具有挑战性,因为产品描述通常简短、不完整或模糊,而正确的分类依赖于层级关税结构、法律注释和特定司法管辖区的规则。本文提出了一种智能体大语言模型(LLM)框架,用于智慧港口和海运物流环境中的加拿大10位HTS编码分类。该框架集成了多智能体信息检索、官方关税文件的语义检索、基于证据的推理、基于共识的验证、跨层级编码组件的逐元素投票、置信度估计以及人工介入升级。我们在一个包含3300条领域专家标注的产品记录(来自物流和配送场景)的私有数据集上评估了该框架。实验结果表明,即使对于先进的LLM,精确的10位分类仍然困难,性能从粗略的章节级预测下降到细粒度的关税和统计后缀分配。这些发现表明,需要基于证据、不确定性感知和以人为中心的分类工作流程,而不是完全自主的单步预测。所提出的框架支持更可解释、可问责和合规导向的HTS分类,适用于海运物流和智慧港口操作。我们的代码可在https://github.com/Analytics-Everywhere-Lab/hts获取。

英文摘要

Accurate Harmonized Tariff Schedule (HTS) code classification is essential for customs clearance, duty assessment, trade statistics, and regulatory compliance in maritime logistics. However, exact HTS classification remains challenging because product descriptions are often short, incomplete, or ambiguous, while correct classification depends on hierarchical tariff structures, legal notes, and jurisdiction-specific rules. This paper proposes an agentic large language model (LLM) framework for Canadian 10-digit HTS code classification in smart-port and maritime logistics environments. The framework integrates multi-agent information retrieval, semantic retrieval over official tariff documents, evidence-grounded reasoning, consensus-based validation, element-wise voting across hierarchical code components, confidence estimation, and human-in-the-loop escalation. We evaluate the framework on a private dataset of 3,300 domain-expert-labeled product records collected from logistics and delivery contexts. Experimental results show that exact 10-digit classification remains difficult even for advanced LLMs, with performance decreasing from coarse chapter-level prediction to fine-grained tariff and statistical suffix assignment. These findings demonstrate the need for evidence-grounded, uncertainty-aware, and human-centered classification workflows rather than fully autonomous single-step prediction. The proposed framework supports more interpretable, accountable, and compliance-oriented HTS classification for maritime logistics and smart-port operations. Our code is available at https://github.com/Analytics-Everywhere-Lab/hts.

2606.16995 2026-06-16 cs.AI cs.LG 新提交

When in Doubt, Plan It Out: Committed Small Language Model Deliberation for Reactive Reinforcement Learning

存疑则计划:用于反应式强化学习的小型语言模型承诺式推理

Nathan Gavenski, Juarez Monteiro, Francisco Galuppo, Adriano Veloso, Odinaldo Rodrigues

AI总结 提出PACT混合架构,结合快速反应式强化学习策略与慢速小型语言模型规划器,通过异步生成和验证候选动作计划来提升策略在陌生环境中的表现。

Comments LM4Plan Workshop at ICML 2026

详情
AI中文摘要

强化学习(RL)策略在陌生环境中常常性能下降,因为它们缺乏明确的推理。我们提出了Plan, Align, Commit, Think (PACT),一种混合架构,结合了快速、反应式的RL策略与慢速、深思熟虑的小型语言模型(SLM)规划器。PACT异步调用SLM来生成和验证候选动作计划。一旦通过模拟验证计划是安全、可行且完整的,就直接执行该计划,绕过RL策略,无需重新训练或修改它。在三个难度递增的FrozenLake配置上评估,PACT在所有基线中表现最佳,同时依赖于一个2B参数的SLM骨干,这表明在这些设置中,深思熟虑的规划和反应式执行相结合比单独任何一种都更强大。

英文摘要

Reinforcement Learning (RL) policies often degrade in unfamiliar environments because they lack explicit deliberation. We propose Plan, Align, Commit, Think (PACT), a hybrid architecture that combines a fast, reactive RL policy with a slow, deliberative Small Language Model (SLM) planner. PACT invokes the SLM asynchronously to generate and validate candidate action plans. Once a plan is verified through simulation as safe, feasible, and complete, it is executed directly, bypassing the RL policy without retraining or modifying it. Evaluated on three FrozenLake configurations of increasing difficulty, PACT outperforms all baselines while relying on a 2B-parameter SLM backbone, suggesting that deliberative planning and reactive execution are more powerful in concert than either is alone in these settings.

2606.14778 2026-06-16 cs.CV cs.AI 交叉投稿

FactCheck: Feasibility-aware Long-term Action Anticipation with Multi-agent Collaboration

FactCheck: 基于多智能体协作的可行性感知长期动作预测

Rui Cao, Jiannong Cao, Bo Yuan, Zhiyuan Wen, Mingjin Zhang

发表机构 * The Hong Kong Polytechnic University(香港理工大学) China Mobile(中国移动)

AI总结 提出FactCheck多智能体框架,通过闭环“观察-规划-验证”机制,结合历史动作图验证可行性,在EPIC-Kitchens-55和EGTEA Gaze+上超越现有方法。

详情
AI中文摘要

长期动作预测(LTA)旨在从部分观察的视频中预测未来动词-名词动作的有序序列。虽然该任务是具身智能的基础,但预测物理上可行的长期动作仍然是一个关键挑战。现有方法以开环方式运行,常常幻觉出不存在物体、违反物体可供性或不考虑物体状态,因为它们缺乏明确的机制来验证动作相对于物理环境的可行性。为解决此问题,我们提出FactCheck,一种新颖的多智能体协作框架,通过闭环“观察-规划-验证”机制提高可行性。FactCheck将复杂的LTA任务分解为专门角色:观察者从视频观察中识别历史动作并构建双形式结构化记忆,包括捕捉高层人类意图和环境状态的历史动作摘要,以及编码物体状态和时间依赖性的历史动作图;规划者基于低层历史动作和高层历史动作摘要生成未来动作草案;验证者严格根据历史动作图验证草案并修正不可行动作。在EPIC-Kitchens-55和EGTEA Gaze+基准上的大量实验表明,FactCheck始终优于最先进方法。我们的工作为可行性感知的长期动作预测建立了新范式,有效闭环了动作识别、动作预测和动作验证。

英文摘要

Long-term action anticipation (LTA) aims to predict an ordered sequence of future verb-noun actions from a partially observed video. While this task serves as the foundation for embodied intelligence, anticipating physically feasible long-term actions remains a critical challenge. Existing methods, which operate in an open-loop manner, often hallucinate non-existent objects, violate object affordances, or disregard object states, as they lack explicit mechanisms to verify action feasibility against the physical environment. To address this, we propose FactCheck, a novel multi-agent collaboration framework that improves feasibility through a closed-loop "Observe-Plan-Verify" mechanism. FactCheck decomposes the complex LTA task into specialized roles: an Observer that recognizes historical actions from video observations and constructs a dual-form structured memory, comprising a History Action Abstract that captures high-level human intentions and environmental status, and a History Action Graph that encodes object states and temporal dependencies; a Planner that generates draft future actions conditioned on both low-level historical actions and high-level History Action Abstract; and a Verifier that rigorously validates the draft against the History Action Graph and refines infeasible actions. Extensive experiments on the EPIC-Kitchens-55 and EGTEA Gaze+ benchmarks demonstrate that FactCheck consistently outperforms state-of-the-art methods. Our work establishes a new paradigm for feasibility-aware long-term action anticipation, effectively closing the loop of action recognition, action prediction and action verification.

2606.14801 2026-06-16 cs.LG cs.AI cs.RO 交叉投稿

QPILOTS: Efficient Test-Time Q-Steering for Flow Policies

QPILOTS:面向流策略的高效测试时Q引导

Yifan Ruan, Chenyang Cao, Andreas Burger, Ali Pesaranghader, Kaveh Kamali, Jaehong Kim, Nandita Vijaykumar, Alan Aspuru-Guzik, Igor Gilitschenski, Nicholas Rhinehart

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) LG Electronics(LG电子)

AI总结 提出QPILOTS方法,在推理时通过投影去噪中间状态到最终动作估计并计算评论家梯度来引导流匹配和扩散策略,无需修改原策略,在离线到在线RL基准上达到90%平均成功率。

Comments 10 pages, 7 figures

详情
AI中文摘要

流匹配和扩散策略是表达力强的动作生成器,但使用时序差分强化学习(RL)优化它们仍然困难。有效的策略提取需要利用评论家的动作梯度,但通过多步去噪过程直接反向传播该信号可能数值不稳定。现有方法要么丢弃梯度信息,将策略蒸馏为更简单的单步动作器,要么随着评论家改进而重复微调去噪策略。我们提出QPILOTS,一种保持原策略不变并在推理时引导去噪过程的方法。在每个去噪步骤中,我们不是评估评论家对噪声中间动作(其中评论家预测不可靠),而是首先将该中间状态投影到最终干净动作的估计,并在那里计算评论家梯度。我们引入两种变体:QPILOTS-U使用快速单点近似,而QPILOTS-M通过学习的辅助网络绘制可微后验样本。在标准的离线到在线RL基准测试中,QPILOTS实现了最佳整体性能,在50个任务中达到平均90%的成功率。我们还应用QPILOTS引导一个大型、冻结的预训练视觉-语言动作(VLA)基础模型,在模拟的六个操作任务中优于或匹配先前的推理时方法。

英文摘要

Flow-matching and diffusion policies are expressive action generators, but optimizing them with temporal-difference reinforcement learning (RL) remains difficult. Effective policy extraction requires exploiting the critic's action gradient, yet directly backpropagating this signal through a multi-step denoising process can be numerically unstable. Existing methods work around this either by discarding gradient information, distilling the policy into a simpler one-step actor, or repeatedly fine-tuning the denoising policy as the critic improves. We propose QPILOTS, a method that leaves the original policy unmodified and steers the denoising process at inference time. At each denoising step, instead of evaluating the critic on the noisy intermediate action where critic predictions are unreliable, we first project that intermediate state to an estimate of the final clean action and compute the critic gradient there. We introduce two variants: QPILOTS-U uses a fast single-point approximation, while QPILOTS-M draws differentiable posterior samples via a learned auxiliary network. On a standard offline-to-online RL benchmark, QPILOTS achieves the best aggregate performance, reaching an average success rate of 90% across 50 tasks. We also apply QPILOTS to steer a large, frozen, pretrained Vision-Language Action (VLA) foundation model, outperforming or matching prior inference-time approaches across six manipulation tasks in simulation.

2606.15141 2026-06-16 eess.AS cs.AI cs.SD 交叉投稿

EChO-Agent: Evidence Chain Orchestration Agent for Audio Reasoning

EChO-Agent: 用于音频推理的证据链编排智能体

Siyuan Zhang, Jian Zong, Junyu Wang, Peiyuan Jiang, Jiahao Yan, Jingyu Zhang, Tianrui Wang, Xiaobao Wang, Longbiao Wang, Jianwu Dang

发表机构 * School of Artificial Intelligence, Tianjin University(天津大学人工智能学院)

AI总结 提出EChO-Agent模块化框架,将复杂音频问答转化为规划、工具执行、证据整合和答案验证流程,在MMAR基准上提升准确率和评分。

Comments 5 pages, 2 figures. Accepted by Interspeech 2026

详情
AI中文摘要

虽然LALMs在音频问答上展现出潜力,但在处理复杂音频推理时,它们未能聚焦于问题相关的音频片段,也无法提供清晰、可检查的推理过程。强化学习和工具增强提示可以帮助模型更好地将问题与音频关联起来,但缺乏可靠的方式来理解、整合和自验证音频片段。为弥补这一不足,我们提出了EChO-Agent,一个模块化智能体框架,将复杂的音频问答重新表述为规划、工具执行、证据整合和答案验证的工作流程。在MMAR基准上的实验表明,EChO-Agent在准确率和评分上均优于基线,消融研究显示证据整合是关键因素。

英文摘要

While LALMs show promise on audio question answering, they fail to focus on question-relevant segments of audio and provide a clear, checkable reasoning process when dealing with complex audio reasoning. Reinforcement learning and tool-augmented prompting can help models better relate questions to audio but lack a reliable way to understand, integrate, and self-verify audio segments. To address this gap, we present EChO-Agent, a modular agent framework that reformulates complex audio QA as a planning, tool execution, evidence integration, and answer verification workflow. Experiments on MMAR benchmark show EChO-Agent improves both accuracy and rubric scores over baseline and ablation studies show evidence integration is the key factor.

2606.15197 2026-06-16 cs.LG cs.AI 交叉投稿

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

StarOR: 协同树搜索与测试时强化学习用于优化建模

Jiajun Li, Yu Ding, Shisi Guan, Ran Hou, Wanyuan Wang

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Northwest A&F University(西北农林科技大学)

AI总结 提出StarOR框架,结合蒙特卡洛树搜索与测试时强化学习,通过四阶段分解和GRPO更新LoRA适配器,实现无监督细粒度奖励的中间决策优化,在5个基准上以4B模型达到最优性能。

Comments 41pages, V1, preprint

详情
AI中文摘要

优化建模本质上是层次化的,需要精确的符号承诺序列。传统的基于学习的自动化优化建模方法通过大规模标注或策划的训练数据改进建模策略,但适应新问题分布成本高昂。同时,一次性生成在层次化建模中仍然脆弱,早期符号错误可能传播为无效公式。测试时缩放通过额外的实例级计算实现结构探索,提供了一种有前景的替代方案;然而,现有的基于搜索的方法通常依赖固定策略,导致重复展开继承相似的建模偏差,并为中间决策提供有限的信用分配。为了解决这些限制,我们提出了StarOR,一种协同搜索与适应的框架,将MCTS与测试时强化学习相结合用于优化建模。StarOR将建模过程分解为四个阶段,并通过GRPO在每个非终端节点更新瞬态LoRA适配器。通过使用MCTS生成的兄弟节点作为局部比较集,StarOR将搜索时的探索转化为实例特定的策略细化。此外,无监督的多方面奖励系统为中间公式决策提供细粒度反馈,无需真实标签。在五个优化基准上的实验表明,即使使用4B骨干网络,StarOR也实现了最先进的性能,优于现有方法和前沿LLMs。

英文摘要

Optimization modeling is inherently hierarchical, requiring a precise sequence of symbolic commitments. Traditional learning-based automated optimization modeling methods improve modeling policies through large-scale annotated or curated training data, but are costly to adapt to new problem distributions. Meanwhile, one-shot generation remains brittle in hierarchical modeling, where early symbolic errors can propagate into invalid formulations. Test-time scaling offers a promising alternative by enabling structural exploration with additional instance-level computation; however, existing search-based methods typically rely on a fixed policy, causing repeated rollouts to inherit similar modeling biases and providing limited credit assignment for intermediate decisions. To address these limitations, we propose StarOR, a synergistic search-and-adaptation framework that couples MCTS with Test-Time Reinforcement Learning for optimization modeling. StarOR decomposes the modeling process into four stages and updates a transient LoRA adapter via GRPO at each non-terminal node. By using MCTS-generated siblings as local comparison sets, StarOR transforms search-time exploration into instance-specific policy refinement. Moreover, an unsupervised multi-faceted reward system provides fine-grained feedback for intermediate formulation decisions without ground-truth labels. Experiments across five optimization benchmarks show that StarOR achieves state-of-the-art performance even with a 4B backbone, outperforming existing methods and the frontier LLMs.

2606.15390 2026-06-16 cs.CL cs.AI cs.LG 交叉投稿

Not All Skills Help: Measuring and Repairing Agent Knowledge

并非所有技能都有用:测量与修复智能体知识

Yixuan Wang, Yiyang Zhou, Yiming Liang, Congyu Zhang, Fuxiao Liu, Jiawei Zhou, Huaxiu Yao

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) Purdue(普渡大学) NVIDIA(英伟达)

AI总结 提出ASSAY框架,通过随机掩码测量技能因果贡献,分离技能生成与筛选,在推理时抑制负面技能,显著提升LLM智能体任务完成率。

Comments 18 pages, 5 figures

详情
AI中文摘要

LLM智能体可以通过从经验中积累自然语言技能来改进,而无需更新权重,但当前系统将所有关于保留哪些技能以及如何应用它们的决策完全交由LLM判断。我们认为这混淆了两个不同的角色:从经验中生成技能是判断擅长的创造性行为,而决定该技能是否真正有帮助则需要跨多个任务的实证证据。通过随机掩码测量每个技能的因果贡献,我们发现技能库表现出普遍的因果异质性:单个技能通常在某些任务类型上有帮助,但在其他任务类型上有害,然而它们的相反效应在总体上相互抵消,使得全局筛选方法无法察觉。我们提出ASSAY,一个将生成与筛选分离的框架:它在小型开发集上计算每个技能的因果归因,离线重组技能库,并为每个测试任务抑制预测效应为负的技能。在跨越四个提供商的七个基础模型以及两个基准(AppWorld和tau-bench)上,ASSAY始终优于先前的技能筛选方法。在AppWorld最难的数据划分上,DeepSeek-V3实现了69.3%的任务目标完成率(相对提升47.4%),在所有已发表方法(包括权重调整方法)中达到了新的最先进水平。在tau-bench零售领域,GPT-4.1相对提升8.7%,在公开排行榜上超越了o4-mini、o1和GPT-4.5,且无需任何权重修改。消融实验将主要收益归因于每任务掩码,证实瓶颈在于推理时将技能与任务匹配,而非全局移除不良技能。代码已开源:https://github.com/aiming-lab/assay。

英文摘要

LLM agents can improve without weight updates by accumulating natural-language skills from experience, but current systems entrust every decision about which skills to keep and how to apply them to LLM judgment alone. We argue that this conflates two distinct roles: generating a skill from experience is a creative act that judgment handles well, while deciding whether that skill actually helps requires empirical evidence across many tasks. Measuring per-skill causal contributions via randomized masking, we find that skill libraries exhibit pervasive causal heterogeneity: individual skills routinely help on some task types while hurting on others, yet their opposing effects cancel in aggregate, making them invisible to global curation methods. We propose ASSAY, a framework that separates generation from curation: it computes a per-skill causal attribution on a small development set, restructures the library offline, and suppresses skills with negative predicted effect for each test task. Across seven base models spanning four providers and two benchmarks (AppWorld and tau-bench), ASSAY consistently improves over prior skill-curation approaches. On AppWorld's hardest split, DeepSeek-V3 achieves 69.3% task-goal completion (47.4% relative improvement), a new state of the art among all published methods including weight-tuned approaches. On tau-bench retail, GPT-4.1 improves by 8.7% relative, advancing past o4-mini, o1, and GPT-4.5 on the public leaderboard without any weight modification. Ablation traces the dominant gain to per-task masking, confirming that the bottleneck is matching skills to tasks at inference time, not removing bad skills globally. Code is available at https://github.com/aiming-lab/assay.

2606.15912 2026-06-16 cs.LG cs.AI 交叉投稿

On-Policy Distillation with Curriculum Turn-level Guidance for Multi-turn Agents

基于课程回合级指导的在线策略蒸馏用于多轮智能体

Gengsheng Li, Mao Zheng, Mingyang Song, Ruiqi Liu, Tianyu Yang, Jie Sun, Qiyong Zhong, Haiyun Guo, Junfeng Fang, Dan Zhang, Jinqiao Wang

发表机构 * Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所基础模型研究中心) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Large Language Model Department, Tencent(腾讯大语言模型部) University of Science and Technology of China(中国科学技术大学) Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学) Wuhan AI Research(武汉人工智能研究院)

AI总结 针对多轮智能体在线策略蒸馏中错误累积导致教师监督失效的问题,提出混合教师和学生生成回合的Guided-OPD算法,通过课程式衰减教师干预概率,在ALFWorld等任务上平均提升21.1%得分和25.5%成功率。

详情
AI中文摘要

能够规划、调用工具并与环境交互的多轮智能体为解决复杂任务提供了一种有前景的范式,但其能力通常依赖于非常大的模型,这些模型的推理成本在实践中令人望而却步。在线策略蒸馏(OPD)是将这种能力迁移到较小学生模型的一种自然方法,但我们发现它在这种设置下存在一种特征性失败模式:小的学生错误在回合间累积,将轨迹推离教师熟悉的状态分布,因此教师的监督在最需要的地方变得最不可靠。我们提出了引导式在线策略蒸馏(Guided-OPD),一种简单而有效的算法,它在每个轨迹中混合教师和学生生成的回合,并按照衰减到零的课程安排教师的干预概率。强引导使早期轨迹接近教师分布,然后逐渐撤除以恢复推理时使用的纯在线策略。在ALFWorld、ScienceWorld和WebShop上,从Qwen3-30B-A3B教师蒸馏Qwen3学生,Guided-OPD相比普通OPD平均提高21.1%得分和25.5%成功率,在较小的学生上收益更大。

英文摘要

Multi-turn agents that plan, invoke tools, and interact with environments offer a promising paradigm for solving complex tasks, yet their capabilities typically rely on very large models whose inference cost is prohibitive in practice.On-Policy Distillation (OPD) is a natural recipe for transferring such capabilities to smaller students, but we find that it suffers a characteristic failure mode in this setting: small student errors compound across turns and push the trajectory out of the teacher's familiar state distribution, so the teacher's supervision becomes least reliable precisely where the student needs it most.We propose Guided On-Policy Distillation (Guided-OPD), a simple yet effective algorithm that mixes teacher- and student-generated turns within each rollout and schedules the teacher's intervention probability along a curriculum that decays to zero.Strong guidance keeps early trajectories close to the teacher distribution and is then gradually withdrawn to recover the purely on-policy regime used at inference.On ALFWorld, ScienceWorld, and WebShop, distilling Qwen3 students from a Qwen3-30B-A3B teacher, Guided-OPD improves Score by 21.1\% and Success Rate by 25.5\% over vanilla OPD on average, with larger gains on smaller students.

2606.16014 2026-06-16 cs.HC cs.AI cs.MA 交叉投稿

Orchestrated Reality: From Role-Play to Living, Playable Game Worlds -- LLM-Driven World Simulation as a Parameterized-Action POMDP

编排现实:从角色扮演到活生生的、可玩的游戏世界——作为参数化动作POMDP的LLM驱动世界模拟

Yuhang Huang, Chenmiao Li, Chaowei Fang

发表机构 * The University of Tokyo(东京大学) Individual Researcher(个人研究员)

AI总结 提出编排现实框架,将LLM驱动的游戏世界形式化为参数化动作POMDP,通过单例编排代理维护规范JSON状态,实现数值状态、叙事声音和规则逻辑的统一协调。

Comments 9 pages, 2 figures. Work in progress. Yuhang Huang and Chenmiao Li contributed equall

详情
AI中文摘要

许多游戏依赖于讲故事与跟踪等级、NPC行为和后果模拟的系统相结合;将紧密编写的叙事与深度模拟的世界桥接起来——在沙盒和开放世界环境中最为突出——一直成本高昂。LLM驱动的世界开辟了一条新路径:一个单一框架可以协调数值状态、叙事声音、故事节奏和规则逻辑。实现这一点需要LLM系统维持一个持久的世界(谁在哪里、刚刚发生了什么、当前什么是真实的),而当今部署的系统无法做到:叙事声音以自由散文形式断言状态,没有任何经过验证的表示,因此完全自主的游戏引擎仍然不可行。我们将此视为一种架构选择,而非语言模型的限制,并报告一个正在进行的框架——编排现实——的工作进展,该框架将世界视为一个由单例编排代理拥有的规范对象,类似于桌面角色扮演游戏中的游戏主持人(GM)。我们将面向人类玩家的LLM驱动的游戏世界形式化为一个参数化动作POMDP:状态是一个规范JSON实体的树,动作分解为$a=(k, x_k)$(离散意图类型加上结构化JSON参数),代理仅观察状态的一个叙事投影$o=O(s)$,转移核$F$是一个LLM驱动的计划-差异-验证-应用(PDVA)流水线,该流水线提交经过模式验证和内容哈希的JSON差异。我们给出了形式模型、一个JSON状态示例、一个单轮示例,以及来自实际部署的15个说明性事件目录,展示了该框架的实际应用。通过计划的人类玩家研究进行的实证验证——以及多NPC并发代理和作为RL环境的部署——被定位为未来工作。

英文摘要

Many games rely on storytelling combined with systems that track levelling, NPC behaviour, and consequence simulation; bridging tightly-authored narrative with deeply-simulated worlds -- most acute in sandbox and open-world settings -- has been prohibitively expensive. LLM-driven worlds open a new path: a single harness can coordinate numerical state, narrative voice, storytelling pacing, and rule logic together. Realising this requires the LLM system to sustain a persistent world (who is where, what has just happened, what is currently true), which today's deployed systems do not: the narrative voice asserts state in free prose without any validated representation, so a fully autonomous game engine remains infeasible. We treat this as an architectural choice, not a limitation of language models, and report work in progress on a framework -- orchestrated reality -- that makes the world a canonical object owned by a singleton orchestration agent analogous to the tabletop-RPG Game Master (GM). We formalise an LLM-driven game world for a human player as a Parameterized-Action POMDP: state is a tree of canonical JSON entities, actions decompose as $a=(k, x_k)$ (a discrete intent kind plus structured JSON parameters), the agent observes only a narrative projection $o=O(s)$ of state, and the transition kernel $F$ is an LLM-driven Plan-Diff-Validate-Apply (PDVA) pipeline that commits schema-validated, content-hashed JSON deltas. We give the formal model, a JSON-state example, a worked single-turn example, and a catalogue of 15 illustrative incidents drawn from a real deployment showing the framework in action. Empirical validation through a planned human player study -- together with multi-NPC concurrent agency and deployment as an RL environment -- is situated as future work.

2606.16215 2026-06-16 cs.CL cs.AI cs.LG 交叉投稿

PACT: Privileged Trace Co-Training for Multi-Turn Tool-Use Agents

PACT: 多轮工具使用智能体的特权轨迹协同训练

Zhenbang Du, Jun Luo, Zhiwei Zheng, Xiangchi Yuan, Kejing Xia, Dachuan Shi, Qirui Jin, Qijia He, Shaofeng Zou, Yingbin Liang, Wenke Lee

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Ohio State University(俄亥俄州立大学) University of Pennsylvania(宾夕法尼亚大学) Arizona State University(亚利桑那州立大学)

AI总结 提出PACT框架,通过特权轨迹(专家轨迹)在训练时提供密集监督信号,结合轨迹条件RL和组件感知SFT损失,避免推理时依赖轨迹,显著提升多轮工具使用智能体的性能。

Comments Project page: https://zhenbangdu.github.io/pact-project-page/

详情
AI中文摘要

多轮工具使用智能体必须在多个交互轮次中进行推理、调用工具并适应观察结果。对此类智能体进行后训练具有挑战性,因为强化学习通常面临稀疏奖励和弱信用分配问题(尽管匹配仅提示推理设置),而基于专家轨迹的监督微调提供密集过程监督,但可能过度约束模型到固定轨迹。为解决这一问题,我们提出PACT,一种用于多轮工具使用智能体的特权轨迹协同训练框架。关键思想是仅将专家轨迹作为训练时的优化信号,而非推理时的提示。PACT保持推理生成仅基于提示,然后通过两个互补信号利用专家轨迹指导优化:一个轨迹条件RL代理,在专家轨迹上下文中评估仅提示轨迹;一个组件感知SFT损失,以退火强度监督推理前缀和工具调用。为减少对训练时轨迹上下文的过度依赖,PACT进一步引入仅提示锚定。我们还提供了一个潜在轨迹视角,连接两个基于轨迹的目标,并解释专家轨迹如何在推理生成中不被使用的情况下指导优化。在FTRL、BFCL和ToolHop上的实验表明,PACT持续优于强SFT和RL基线,凸显了特权轨迹协同训练在多轮工具使用学习中的价值。

英文摘要

Multi-turn tool-use agents must reason, call tools, and adapt to observations across several interaction turns. Post-training such agents is challenging, as reinforcement learning often suffers from sparse rewards and weak credit assignment despite matching the prompt-only inference setting, while supervised fine-tuning on expert traces provides dense process supervision but can over-constrain the model to fixed trajectories. To tackle this, we propose PACT, a Privileged trAce Co-Training framework for multi-turn tool-use agents. The key idea is to use expert traces only as training-time optimization signals rather than rollout-time hints. PACT keeps rollout generation prompt-only, then uses expert traces to guide optimization through two complementary signals: a trace-conditioned RL surrogate that evaluates prompt-only rollouts under expert-trace context, and a component-aware SFT loss that supervises reasoning prefixes and tool-calls with annealed strength. To reduce over-reliance on the training-only trace context, PACT further introduces a prompt-only anchoring. We also provide a latent-trace view that connects the two trace-based objectives and explains how expert traces can guide optimization without being used during rollout generation. Experiments on FTRL, BFCL, and ToolHop show that PACT consistently improves over strong SFT- and RL-based baselines, highlighting the value of privileged trace co-training for multi-turn tool-use learning.

2606.16316 2026-06-16 cs.IR cs.AI cs.LG 交叉投稿

RL-Index: Reinforcement Learning for Retrieval Index Reasoning

RL-Index:用于检索索引推理的强化学习

Yongjia Lei, Nedim Lipka, Zhisheng Qi, Utkarsh Sahu, Koustava Goswami, Franck Dernoncourt, Ryan A. Rossi, Yu Wang

发表机构 * University of Oregon(俄勒冈大学) Adobe Research(Adobe研究)

AI总结 提出RL-Index框架,将检索索引推理转化为强化学习问题,通过LLM生成理由增强文档,使用GRPO优化,提升检索和问答性能并降低在线延迟。

详情
AI中文摘要

检索外部知识对于解决现实世界任务至关重要,但当查询与其相关知识之间的关系涉及超越表面语义或词汇匹配的隐式和复杂推理时(例如,依赖同一定理的数学问题或需要深度推理的编码),仍然具有挑战性。现有方法主要依赖查询端推理(例如,查询重写),这引入了显著的在线延迟,并且未能充分利用对知识语料库本身进行推理的机会(即索引端推理)。在本文中,我们提出了RL-Index,一个智能索引框架,将检索索引推理形式化为强化学习问题。RL-Index不是在进行查询时执行推理,而是通过用LLM生成的理由增强文档,将推理转移到索引阶段,这些理由显式编码了潜在的查询-知识关系。为了优化这些理由的质量,我们采用了组相对策略优化(GRPO),并使用检索相似性作为可验证的奖励信号,从而能够直接优化索引决策以提高检索效果。在BRIGHT基准上的大量实验表明,RL-Index持续提高了检索和下游问答性能,同时显著降低了在线推理延迟。此外,学到的理由增强跨不同的检索器和生成器具有泛化能力,突显了其作为即插即用索引策略在不同检索系统中的鲁棒性。

英文摘要

Retrieving external knowledge is essential for solving real-world tasks, yet it remains challenging when the relationship between a query and its relevant knowledge involves implicit and complex reasoning beyond surface-level semantic or lexical matching (e.g., mathematical problems relying on the same theorem or coding requiring deep reasoning). Existing approaches primarily rely on query-side reasoning (e.g., query rewriting), which introduces significant online latency and underutilizes the opportunity to perform reasoning over the knowledge corpus itself (i.e., index-side reasoning). In this paper, we propose RL-Index, an agentic indexing framework that formulates retrieval index reasoning as a reinforcement learning problem. Instead of performing reasoning at query time, RL-Index shifts reasoning to the indexing stage by augmenting documents with LLM-generated rationales that explicitly encode the latent query-knowledge relationship. To optimize the quality of these rationales, we employ Group Relative Policy Optimization (GRPO) and use retrieval similarity as a verifiable reward signal, enabling direct optimization of indexing decisions for retrieval effectiveness. Extensive experiments on the BRIGHT benchmark demonstrate that RL-Index consistently improves both retrieval and downstream question-answering performance, while significantly reducing online inference latency. Moreover, the learned rationale augmentation generalizes across diverse retrievers and generators, highlighting its robustness as a plug-and-play indexing strategy across different retrieval systems.

2606.16432 2026-06-16 cs.CL cs.AI 交叉投稿

ACCORD: Action-Conditioned Contextual Grounding for Language Agents

ACCORD: 面向语言智能体的动作条件上下文接地

Lai Jiang, Cheng Qian, Zhenhailong Wang, Pan Lu, Heng Ji, Hao Peng

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Stanford University(斯坦福大学)

AI总结 针对用户指令常因隐含环境假设而欠指定,导致LLM智能体执行失败的问题,提出ACCORD框架,在每次动作前主动探测缺失信息并整合轨迹上下文,无需额外训练,在AppWorld和AlfWorld上显著提升任务完成率。

详情
AI中文摘要

用户指令往往因人类对周围环境的隐含假设而欠指定。对于在信息丰富的数字和物理环境中运行的大型语言模型(LLM)智能体,这些假设无法仅从指令中推断;必须从工具、数据、接口和观察的当前状态中恢复。因此,有效执行要求智能体识别缺失的上下文,将其基于观察到的证据,并带入后续动作。我们表明,当前智能体常常未能做到这一点。它们基于假设而非观察到的细节行动,忽略本可收集的信息,并且未能整合已经返回的证据。基于这一洞察,我们提出ACCORD(动作条件上下文接地),一种简单有效的自适应接地智能体框架。在每次动作前,ACCORD主动探测环境中缺失的信息,并整合来自智能体轨迹中原本会被忽略的相关上下文。无需额外训练或任务成功信号,ACCORD在AppWorld上将任务目标完成率从42.0%提升至62.6%(GPT-5-mini),比强基线高出最多20.6个百分点。这些增益在更强的基模型(Claude-4.5-sonnet上+10.8)、开放权重模型(Qwen3.5-27B-FP8上+10.1)以及具身AlfWorld基准(GPT-5-mini上成功率+7.4)上持续存在。

英文摘要

User instructions are often underspecified because humans rely on implicit assumptions about the surrounding environment. For large language model (LLM) agents operating in information-rich digital and physical environments, these assumptions cannot be inferred from the instruction alone; they must be recovered from the current state of tools, data, interfaces, and observations. Effective execution therefore requires agents to identify missing context, ground it in observed evidence, and carry it forward into subsequent actions. We show that current agents often fail to do so. They act from assumed rather than observed specifics, overlook information they could have gathered, and fail to incorporate evidence that has already been returned. Building on this insight, we propose ACCORD (Action-Conditioned Contextual Grounding), a simple and effective agent framework for adaptive grounding. Before each action, ACCORD actively probes the environment for missing information and integrates relevant context from the agent's trajectory that would otherwise be overlooked. Requiring no additional training or task-success signals, ACCORD improves task-goal completion on AppWorld by up to +20.6 points with GPT-5-mini, from 42.0% to 62.6%, compared to strong baselines. These gains persist with a substantially stronger base model (+10.8 with Claude-4.5-sonnet), an open-weight model (+10.1 with Qwen3.5-27B-FP8), and on the embodied AlfWorld benchmark (+7.4 success rate with GPT-5-mini).

2606.16515 2026-06-16 cs.LG cs.AI cs.RO 交叉投稿

Direction-Conditioned Policies via Compositional Subgoal Scoring for Online Goal-Conditioned Reinforcement Learning

基于组合子目标评分的方向条件策略用于在线目标条件强化学习

Swaminathan S K, Damiya Gondha, Theyanesh Eswaramoorthy Rajahkrishnan, Aritra Hazra

AI总结 提出方向条件策略(DCP),通过共享InfoNCE表示将目标达成分解为子目标评分和方向条件动作,理论证明方向充分性、训练与部署一致性及可控子空间失效条件,在九个环境中优于对比RL。

Comments 17 pages, Accepted to the 2nd Workshop on Compositional Learning at ICML 2026 (Seoul, South Korea)

详情
AI中文摘要

Hamilton-Jacobi-Bellman理论表明,最优目标条件动作仅通过当前状态下目标距离的梯度依赖于目标,然而标准的在线GCRL仍然将演员网络条件于原始目标——当目标远离数据分布时,这是一个几何上无信息的信号。我们提出方向条件策略(DCP),一种完全在线的方法,将目标达成分解为两个共享一个InfoNCE表示ψ的组件:一个子目标评分步骤,选择与最终目标g在ψ空间中对齐的已访问状态z_t;以及一个方向条件演员,它消耗从ψ(s_t)到ψ(z_t)的单位方向d_t和幅度r_t。这两个组件联合训练,在部署时干净地分解(子目标评分被移除,而方向条件保留,用g代替z_t),并允许在相同的(d_t, r_t)接口上进行独立修改。我们证明了三个结果。首先,HJB下的方向充分性:在控制仿射动力学下,最优动作仅通过价值梯度依赖于目标。其次,一个定量界表明,在学习表示的温和条件下,并假设评分规则返回一个路径上的z_t,演员在训练和部署时的条件输入在表示误差和测地线松弛下是一致的。第三,一个可控子空间刻画了方向条件失效的情况。在九个环境中,DCP在大多数最终指标上优于对比RL,在操作和障碍物交互任务上提升最大;对学习到的ψ-距离景观的定性分析表明,对比表示表现为一种在线拟度量,编码环境拓扑,而唯一的失败案例(AntSoccer)定位到理论预期的学习梯度病理。

英文摘要

Hamilton-Jacobi-Bellman theory implies that the optimal goal-conditioned action depends on the goal only through the gradient of the goal-reaching distance at the current state, yet standard online GCRL still conditions the actor on the raw goal -- a signal that is geometrically uninformative when the goal is far from the data distribution. We propose Direction-Conditioned Policies (DCP), a fully online method that decomposes goal-reaching into two components sharing one InfoNCE representation $ψ$: a subgoal-scoring step that selects a visited state $z_t$ aligned with the final goal $g$ in $ψ_g$, and a direction-conditioned actor that consumes the unit direction $d_t$ and magnitude $r_t$ from $ψ(s_t)$ to $ψ(z_t)$. The two components train jointly, factor cleanly at deployment (subgoal scoring is removed, while direction conditioning remains with $g$ in place of $z_t$), and admit independent modification at the same $(d_t,r_t)$ interface. We prove three results. First, direction sufficiency under HJB: the optimal action under control-affine dynamics depends on the goal only through the value gradient. Second, a quantitative bound showing that, under mild conditions on the learned representation and assuming the scoring rule returns an on-path $z_t$, the actor's conditioning input at training and at deployment coincide up to representation error and geodesic slack. Third, a controllable-subspace characterization of when directional conditioning fails. Across nine environments, DCP improves over Contrastive RL on most final metrics, with the largest gains on manipulation and obstacle-interaction tasks; a qualitative analysis of the learned $ψ$-distance landscape shows the contrastive representation behaves as an online quasimetric encoding environment topology, and the single failure case (AntSoccer) localizes to a learned-gradient pathology that the theory anticipates.

2606.16603 2026-06-16 cs.CL cs.AI 交叉投稿

VeriGraph: Towards Verifiable Data-Analytic Agents

VeriGraph: 迈向可验证的数据分析智能体

Jiajie Jin, Zhao Yang, Wenle Liao, Yuyang Hu, Guanting Dong, Xiaoxi Li, Yutao Zhu, Zhicheng Dou

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院)

AI总结 提出VeriGraph框架,通过构建显式异质证据有向无环图(DAG)实现数据分析智能体的可验证性,并设计基于图的策略优化提升正确性与可审计性。

Comments 10 pages

详情
AI中文摘要

基于LLM的智能体在数据密集型分析任务中展现出强大能力,但其输出很少是可验证的:对线性文本轨迹的依赖使其推理难以审计。特别是,对原始数据的确定性计算和对自然语言主张的语义推导常常纠缠在非结构化流中,导致数值结论难以复现,定性判断难以检查。为解决这一问题,我们提出VeriGraph,一个可追踪的神经符号推理框架,使智能体在执行过程中构建显式的异质证据有向无环图(DAG)。VeriGraph引入了三种证据扩展原语,即计算扩展、基础扩展和推导扩展,以在统一图中连接原始数据、解释器变量、计算结果和自然语言主张。在此公式下,结构可追溯性简化为从原始数据源到终端主张的图可达性,而语义支持通过主张级证据评估来衡量。为了改进图构建,我们进一步设计了一种基于图的策略优化策略,采用复合奖励联合监督答案正确性、计算完整性和推导连贯性。在四个基准上的实验表明,VeriGraph-8B在所有基线中取得了最高总分。更重要的是,VeriGraph生成了可审计的证据图,具有显著更强的主张基础,在我们的主张级证据支持评估下达到了87.61%的基础率。这些结果表明,显式证据图构建是实现可验证数据分析智能体的有前景的途径。我们的代码可在https://github.com/ignorejjj/VeriGraph获取。

英文摘要

LLM-based agents have demonstrated strong capabilities in data-intensive analytical tasks, yet their outputs are rarely verifiable: a reliance on linear text trajectories makes their reasoning difficult to audit. In particular, deterministic computations over raw data and semantic deductions over natural-language claims are often entangled in an unstructured stream, leaving numerical conclusions hard to reproduce and qualitative judgments hard to inspect. To address this, we propose VeriGraph, a traceable neuro-symbolic reasoning framework that enables agents to construct an explicit heterogeneous evidence directed acyclic graph (DAG) during execution. VeriGraph introduces three evidence-expansion primitives, namely computational, grounding, and derivational expansion, to connect raw data, interpreter variables, computed results, and natural-language claims in a unified graph. Under this formulation, structural traceability is reduced to graph reachability from raw data sources to terminal claims, while semantic support is measured by claim-level evidence evaluation. To improve graph construction, we further design a graph-based policy optimization strategy with a composite reward that jointly supervises answer correctness, computational integrity, and derivational coherence. Experiments on four benchmarks show that VeriGraph-8B achieves the highest overall score among all baselines. More importantly, VeriGraph produces auditable evidence graphs with substantially stronger claim grounding, achieving a 87.61\% Grounding Rate under our claim-level evidence support evaluation. These results suggest that explicit evidence-graph construction is a promising path toward verifiable data-analytic agents. Our code is available at https://github.com/ignorejjj/VeriGraph.

2606.17016 2026-06-16 cs.CL cs.AI cs.LG cs.MA 交叉投稿

TokenPilot: Cache-Efficient Context Management for LLM Agents

TokenPilot: 面向LLM智能体的缓存高效上下文管理

Buqiang Xu, Zirui Xue, Dianmou Chen, Chenyang Fu, Chiyu Wu, Caiying Huang, Chen Jiang, Jizhan Fang, Xinle Deng, Yijun Chen, Yunzhi Yao, Xuehai Wang, Jin Shang, Gong Yu, Ningyu Zhang

发表机构 * Zhejiang University(浙江大学) University of Electronic Science and Technology of China(电子科技大学) Xi’an University of Electronic Science and Technology(西安电子科技大学) HomologyAI(同源人工智能)

AI总结 针对LLM智能体长会话中上下文累积导致推理成本高的问题,提出TokenPilot双粒度上下文管理框架,通过摄入感知压缩和生命周期感知驱逐策略,在保持性能的同时降低61%-87%的成本。

Comments LightMem Series: Work in Progress

详情
AI中文摘要

随着LLM智能体被部署在长周期会话中,上下文累积推高了推理成本。现有方法利用文本修剪或动态内存驱逐来最小化token占用,但其无约束的序列突变改变了布局,引入前缀不匹配和缓存失效。这揭示了文本稀疏性与提示缓存连续性之间的关键权衡。为解决此问题,我们提出TokenPilot,一个双粒度上下文管理框架。全局上,摄入感知压缩作为框架工具,稳定提示前缀并在摄入门处消除开放世界环境噪声。局部上,生命周期感知驱逐监控上下文段的持续剩余效用,强制执行保守的批处理轮次调度,仅在任务相关性过期时卸载内容段。在PinchBench和Claw-Eval上的隔离和连续模式实验表明,TokenPilot在隔离模式下成本降低61%和56%,在连续模式下降低61%和87%,同时与先前系统相比保持竞争性能。TokenPilot已集成到LightMem2中,地址为https://github.com/zjunlp/LightMem2。

英文摘要

As LLM agents are deployed in long-horizon sessions, context accumulation drives up inference costs. Existing approaches utilize text pruning or dynamic memory eviction to minimize token footprints; however, their unconstrained sequence mutations alter layouts, introducing prefix mismatches and cache invalidation. This reveals a critical trade-off between text sparsity and prompt cache continuity. To address this, we present TokenPilot, a dual-granularity context management framework. Globally, Ingestion-Aware Compaction acts as a framework harness to stabilize prompt prefixes and eliminate open-world environmental noise at the ingestion gate. Locally, Lifecycle-Aware Eviction monitors the ongoing residual utility of context segments, enforcing a conservative batch-turn schedule to offload content segments only when task relevance expires. Experiments on PinchBench and Claw-Eval under both isolated and continuous modes demonstrate that TokenPilot reduces costs by 61% and 56% in isolated mode, and 61% and 87% in continuous mode, while maintaining competitive performance compared to prior systems. TokenPilot has been integrated into LightMem2 at https://github.com/zjunlp/LightMem2.

2510.15966 2026-06-16 cs.AI 版本更新

PISA: A Pragmatic Psych-Inspired Unified Memory System for Enhanced AI Agency

PISA:一种增强AI能动性的实用心理学启发统一记忆系统

Shian Jia, Ziyang Huang, Xinbo Wang, Haofei Zhang, Mingli Song

发表机构 * Zhejiang University(浙江大学)

AI总结 受皮亚杰认知发展理论启发,提出PISA统一记忆系统,通过三模态适应机制(模式更新、演化、创建)和混合记忆访问架构,显著提升AI代理的适应性和长期知识保留。

详情
AI中文摘要

记忆系统对AI代理至关重要,但现有工作往往缺乏对多样化任务的适应性,并忽视了AI代理记忆的建设性和任务导向作用。借鉴皮亚杰的认知发展理论,我们提出PISA,一个实用的、受心理学启发的统一记忆系统,通过将记忆视为建设性和自适应过程来解决这些局限性。为了实现持续学习和适应性,PISA引入了三模态适应机制(即模式更新、模式演化和模式创建),在保持连贯组织的同时支持灵活的记忆更新。基于这些模式基础结构,我们进一步设计了一种混合记忆访问架构,将符号推理与神经检索无缝集成,显著提高了检索准确性和效率。我们在现有LOCOMO基准和我们新提出的用于数据分析任务的AggQA基准上进行的实证评估证实,PISA通过显著增强适应性和长期知识保留,树立了新的最先进水平。

英文摘要

Memory systems are fundamental to AI agents, yet existing work often lacks adaptability to diverse tasks and overlooks the constructive and task-oriented role of AI agent memory. Drawing from Piaget's theory of cognitive development, we propose PISA, a pragmatic, psych-inspired unified memory system that addresses these limitations by treating memory as a constructive and adaptive process. To enable continuous learning and adaptability, PISA introduces a trimodal adaptation mechanism (i.e., schema updation, schema evolution, and schema creation) that preserves coherent organization while supporting flexible memory updates. Building on these schema-grounded structures, we further design a hybrid memory access architecture that seamlessly integrates symbolic reasoning with neural retrieval, significantly improving retrieval accuracy and efficiency. Our empirical evaluation, conducted on the existing LOCOMO benchmark and our newly proposed AggQA benchmark for data analysis tasks, confirms that PISA sets a new state-of-the-art by significantly enhancing adaptability and long-term knowledge retention.

2602.07883 2026-06-16 cs.AI 版本更新

ToolSelf: Unifying Task Execution and Self-Reconfiguration via Tool-Driven Emergent Adaptation

ToolSelf: 通过工具驱动的涌现适应统一任务执行与自我重构

Jingqi Zhou, Sheng Wang, Dezhao Deng, Junwen Lu, Junwei Su, Qintong Li, Jiahui Gao, Hao Wu, Jiyue Jiang, Lingpeng Kong, Dunhong Jin, Chuan Wu

发表机构 * The University of Hong Kong(香港大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出ToolSelf框架,将配置更新抽象为标准化工具接口,统一任务执行与自我重构,并采用配置感知两阶段训练(CAT)实现涌现适应性,在多种基准测试中平均超越静态配置基线28.8分。

详情
AI中文摘要

基于LLM的智能体系统在复杂长时任务中表现出色,但仍受限于执行前固定的静态配置。这种刚性导致领域特定性能与跨任务泛化之间的权衡:强先验和紧凑工具空间有助于专业化但削弱迁移,而任务无关的工作流和广泛动作空间扩展覆盖但稀释指导。现有的执行前优化、规划者-工作者编排和配置修补未能解决这一矛盾,因为它们将适应与执行解耦,导致信息丢失、优化碎片化和信用分配模糊。我们提出ToolSelf,一种工具驱动的运行时自我重构范式,将配置更新抽象为标准化工具接口,并在一个策略的动作空间内统一执行和适应。执行代理可以根据任务进度和反馈动态更新子目标、策略、工具箱、上下文和上下文管理模式。我们进一步引入配置感知两阶段训练(CAT),结合拒绝采样微调和轨迹级KTO强化学习来内化自我重构。在多种基准测试中,零样本ToolSelf与任务专用代理相媲美;经过CAT训练后,ToolSelf平均比静态配置基线高出28.8分,为消除手动注入指导的涌现适应性开辟了道路。

英文摘要

LLM-powered agentic systems excel at complex long-horizon tasks, but remain constrained by static configurations fixed before execution. Such rigidity forces a trade-off between domain-specific performance and cross-task generalization: strong priors and compact tool spaces aid specialization but weaken transfer, while task-agnostic workflows and broad action spaces expand coverage but dilute guidance. Existing pre-execution optimization, planner-worker orchestration, and configuration patching fall short of resolving this tension, as they decouple adaptation from execution, causing information loss, fragmented optimization, and ambiguous credit assignment. We propose ToolSelf, a tool-driven runtime self-reconfiguration paradigm that abstracts configuration updates as a standardized tool interface and unifies execution and adaptation within one policy's action space. The execution agent can dynamically update sub-goals, strategies, toolboxes, context, and context-management modes based on task progress and feedback. We further introduce Configuration-Aware Two-stage Training (CAT), which combines rejection sampling fine-tuning with trajectory-level KTO reinforcement learning to internalize self-reconfiguration. Across diverse benchmarks, zero-shot ToolSelf rivals task-specialized agents; after CAT training, ToolSelf gains 28.8 points over the static-configuration baseline on average, illuminating a path toward emergent adaptivity that obviates manually injected guidance. The code is available at https://github.com/lian-tian-mo-zun/ToolSelf.

2603.00680 2026-06-16 cs.AI 版本更新

MemPO: Self-Memory Policy Optimization for Long-Horizon Agents

MemPO:面向长时程智能体的自我记忆策略优化

Ruoran Li, Xinghua Zhang, Haiyang Yu, Shitong Duan, Xiang Li, Wenxin Xiang, Chonghua Liao, Xudong Guo, Yongbin Li, Jinli Suo

发表机构 * Tsinghua University(清华大学) Tongyi Lab, Alibaba Group(阿里巴巴集团通义实验室)

AI总结 提出自我记忆策略优化算法(MemPO),让智能体自主管理记忆,通过基于记忆有效性的信用分配机制选择性保留关键信息,在减少令牌消耗的同时提升任务性能。

详情
AI中文摘要

长时程智能体在与环境交互过程中面临上下文规模不断增长的挑战,这降低了性能和稳定性。现有方法通常引入外部记忆模块并从存储的记忆中查找相关信息,但无法让模型自身主动管理记忆内容并与智能体的总体任务目标对齐。为解决这些限制,我们提出了自我记忆策略优化算法(MemPO),使智能体(策略模型)能够在与环境交互时自主总结和管理其记忆。通过改进基于记忆有效性的信用分配机制,策略模型可以选择性地保留关键信息,在保持任务性能的同时显著减少令牌消耗。大量实验和分析证实,MemPO 在基础模型上实现了 25.98 的绝对 F1 分数提升,比之前的最先进基线高出 7.1,同时令牌使用量分别减少了 67.58% 和 73.12%。代码已在此 https URL 发布。

英文摘要

Long-horizon agents face the challenge of growing context size during interaction with environment, which degrades the performance and stability. Existing methods typically introduce the external memory module and look up the relevant information from the stored memory, which prevents the model itself from proactively managing its memory content and aligning with the agent's overarching task objectives. To address these limitations, we propose the self-memory policy optimization algorithm (MemPO), which enables the agent (policy model) to autonomously summarize and manage their memory during interaction with environment. By improving the credit assignment mechanism based on memory effectiveness, the policy model can selectively retain crucial information, significantly reducing token consumption while preserving task performance. Extensive experiments and analyses confirm that MemPO achieves absolute F1 score gains of 25.98 over the base model and 7.1 over the previous SOTA baseline, while reducing token usage by 67.58% and 73.12%. The code is released at https://github.com/TheNewBeeKing/MemPO.

2605.29796 2026-06-16 cs.AI cs.CL cs.LG 版本更新

SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

SAAS:面向智能体搜索中过度搜索缓解的自我感知强化学习

Yunbo Tang, Chengyi Yang, Shiyu Liu, Zhishang Xiang, Zerui Chen, Qinggang Zhang, Jinsong Su

发表机构 * School of Informatics, Xiamen University(厦门大学信息学院) School of Artificial Intelligence, Jilin University(吉林大学人工智能学院)

AI总结 提出SAAS强化学习框架,通过搜索边界建模、边界感知奖励和分阶段优化策略,使LLM智能体具备动态自我感知能力,在不降低准确率的前提下显著减少过度搜索。

详情
AI中文摘要

智能体搜索使LLM能够通过迭代推理和外部搜索解决复杂的多跳问题。尽管有效,但这些系统在实践中常受限于一个关键缺陷:智能体无法识别自身知识边界,在内部知识足够时盲目触发搜索,甚至在已收集足够证据时未能终止搜索。缺乏自我感知导致严重的 extbf{过度搜索},带来大量推理延迟和过高的计算成本。为此,我们提出SAAS,一种新颖的强化学习框架,旨在培养动态自我感知能力,精确调节搜索行为而不损害准确性。SAAS引入三个关键组件:(i) 搜索边界建模机制,通过对比禁用搜索和启用搜索的轨迹,识别策略演化下的搜索边界;(ii) 边界感知奖励模块,将这种边界意识转化为轨迹级惩罚,抑制不必要和冗余的搜索;(iii) 分阶段优化策略,利用顺序课程优先考虑推理而非搜索正则化,从而避免奖励黑客。大量实验表明,SAAS在保持准确性的同时大幅减少了过度搜索。我们的代码和实现细节已在https://github.com/XMUDeepLIT/SAAS发布。

英文摘要

Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly triggering searches when internal knowledge suffices and failing to terminate search even when adequate evidence has been collected. The lack of self-awareness leads to severe \textbf{over-search}, incurring substantial inference latency and prohibitive computational cost. To this end, we propose SAAS, a novel RL framework designed to cultivate dynamic self-awareness that precisely regulates search behavior without compromising accuracy. SAAS introduces three key components: (i) a search boundary modeling mechanism, which identifies the search boundary under the evolving policy by contrasting search-disabled and search-enabled rollouts; (ii) a boundary-aware reward module, which translates this boundary awareness into trajectory-level penalties, suppressing unnecessary and redundant searches; and (iii) a stage-wise optimization strategy, which leverages a sequential curriculum to prioritize reasoning over search regularization, thereby avoiding reward hacking. Extensive experiments demonstrate that SAAS substantially reduces over-search, while maintaining accuracy. Our code and implementation details are released at https://github.com/XMUDeepLIT/SAAS.

2606.08151 2026-06-16 cs.AI 版本更新

Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents

决策感知记忆卡:用于工具使用LLM智能体的反事实启发式上下文选择与压缩

Xinyu Guan, Qianyang Zhao, Yuming Deng

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出CICL决策感知上下文层,通过构建上下文图、评分单元效用并打包为记忆卡,提升工具使用LLM智能体在行动时的证据选择与压缩能力,在SWE-bench验证集上实现检索命中率提升。

Comments 15 pages, 2 figures, 8 tables. Code is available at https://github.com/stephen-guan-researcher/CICL; Qwen-QLoRA adapter is available at https://huggingface.co/XinyuGuan/CICL

详情
AI中文摘要

使用工具的LLM智能体失败的原因往往不是缺少相关文本,而是在行动时未能选择、压缩或呈现决定性证据。我们提出CICL,一个决策感知上下文层,它将实例证据转化为上下文图,通过共享的八字段模式路由确定性、Opus辅助、Qwen、Codex/GPT-5.5和Qwen-QLoRA判断,根据行动偏移、结果提升、必要性和负迁移风险对单元评分,并将高效用证据打包为类型化记忆卡供预算有限的智能体使用。该设计将测量到的决策信号与判断模型分离,使得前沿标注、局部代理和轻量级排序器可以在一个可审计协议下进行比较。实验上,CICL在公开基准测试中取得了具体提升,同时暴露了其局限性。在50个SWE-bench Verified文件检索实例上,直接使用Qwen3.6-plus对BM25前50候选进行重排序,将hit@1从0.58提升至0.78,MRR@10从0.634提升至0.790,且所有2500个判断均可解析。受控诊断显示了行动关键性:在预算120时,CICL在v1上达到F1 0.620,在v3上达到F1 0.425,而移除最高效用的语义v3单元导致F1降至0.000。补充检查包括Qwen-QLoRA在710个候选上的一致性、一个小的200标签真实代码Opus辅助信号,以及一个三实例补丁烟雾测试验证检索到补丁的流程,但不声称官方SWE-bench成功。RepoBench-R摘要仍优于记忆卡,紧凑型排序器尚未取代启发式方法。CICL贡献了一个可复现的测量和选择层,用于决策关键上下文,而非端到端编码智能体修复声明。

英文摘要

Modern large language model (LLM) agents do not simply need longer contexts; they need decision-relevant evidence at the moment of action. We study decision-aware context selection: ranking retrieved files, tests, traces, rules, and memories by their expected effect on an agent's next action rather than by semantic similarity alone. We present the Counterfactual-Inspired Context Layer (CICL), which builds an instance context graph, estimates decision-oriented utility for candidate units, and compresses selected evidence into typed memory cards. The same schema can be instantiated with hosted LLM judges, local surrogates, or lightweight rankers, making the selection protocol auditable across model choices. On 50 SWE-bench Verified file-retrieval instances, Qwen3.6-Plus reranking of BM25 top-50 candidates improves hit@1 from 0.58 to 0.78 and MRR@10 from 0.634 to 0.790, with all 2,500 judgments parseable. Controlled diagnostics show that CICL identifies action-critical evidence: removing the top-utility semantic unit reduces F1 from 0.245 to 0.000. In selected-then-compressed mode, memory cards save 44.93 tokens per query while preserving selected evidence. CICL provides a practical layer for measuring, ranking, and compressing decision-critical context for tool-using agents. Code is available at https://github.com/stephen-guan-researcher/CICL.

2606.09365 2026-06-16 cs.AI cs.CL 版本更新

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

经验造就熟练:通过自进化技能记忆实现可泛化的医疗智能体推理

Haoran Sun, Wenjie Li, Yujie Zhang, Zekai Lin, Fanrui Zhang, Kaitao Chen, Xingqi He, Yichen Li, Mianxin Liu, Lei Liu, Yankai Jiang

发表机构 * Fudan University(复旦大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Innovation Institute(上海创新研究院) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出SkeMex框架,通过技能记忆实现医疗智能体后部署自进化,无需更新模型权重,在临床任务中优于现有记忆型智能体。

详情
AI中文摘要

医疗智能体系统越来越期望支持交互式临床决策,而不仅仅是静态问答。在这种设置中,有效的智能体必须跨演化病例重用先前经验,然而现有的记忆机制通常保留原始历史轨迹,这些轨迹冗余、嘈杂且难以管理。更重要的是,它们很少区分哪些记忆对未来推理真正有用。这限制了它们积累紧凑且可靠的经验以进行长期临床推理的能力。为弥补这一差距,我们提出SkeMex,一种部署后自进化框架,通过基于技能的记忆改进医疗智能体,无需更新模型权重。SkeMex将信息丰富的交互轨迹提炼为结构化技能,编码可重用的程序性知识,并将其组织成涵盖通用、任务特定和行动级经验的多分支存储库。为确定哪些记忆应被重用和保留,SkeMex从环境反馈中估计上下文相关的效用,并用其指导价值感知的检索和存储库治理。闭环的“读-写-评估-治理”生命周期通过写入新技能、更新效用、促进有用记忆和移除有害条目进一步支持持续进化。跨不同临床任务的实验表明,SkeMex在离线和在线设置中均持续优于代表性记忆型智能体。它还能跨模型骨干泛化并支持可迁移的技能记忆。所有数据和代码将公开发布。

英文摘要

Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw historical traces that are redundant, noisy, and difficult to govern. More importantly, they rarely distinguish which memories are truly useful for future reasoning. This limits their ability to accumulate compact and reliable experience for long-horizon clinical reasoning. To close this gap, we propose SkeMex, a post-deployment self-evolution framework that improves medical agents through a skill-based memory without updating model weights. SkeMex distills informative interaction trajectories into structured skills that encode reusable procedural knowledge, and organizes them into a multi-branch repository spanning general, task-specific, and action-level experience. To determine which memories should be reused and retained, SkeMex estimates context-dependent utility from environment feedback and uses it to guide value-aware retrieval and repository governance. A closed-loop ``Read--Write--Assess--Govern" lifecycle further supports continual evolution by writing new skills, updating utilities, promoting useful memories, and removing harmful entries. Experiments across diverse clinical tasks show that SkeMex consistently outperforms representative memory-based agents in both offline and online settings. It also generalizes across model backbones and supports transferable skill memory. All data and code will be released publicly.

2606.11349 2026-06-16 cs.AI cs.HC 版本更新

Knowing When to Ask: Self-Gated Clarification for Hierarchical Language Agents

知道何时提问:分层语言代理的自门控澄清机制

Aijing Gao, Yiming Kang, Mengdie Flora Wang, Jae Oh Woo

发表机构 * Amazon Web Services(亚马逊云科技)

AI总结 提出ACTION-RATING框架,将澄清请求纳入代理的动作空间,与导航共享序数尺度,在分层推理中实现自门控澄清,通过强制性和机会性两种信息寻求模式提升决策准确性。

详情
AI中文摘要

在分层推理中,失败通常源于中间决策点,代理在没有意识到缺乏关键信息的情况下错误地选择了分支。我们不将澄清视为外部不确定性触发,而是提出ACTION-RATING,一种将澄清置于代理动作空间内、与导航共享序数尺度的公式,使得在每个决策点提问与行动直接竞争,并在中间状态可观察求助行为。从代理自身的评分中涌现出两种结构上不同的信息寻求模式:强制性(无可行分支)和机会性(尽管有领先候选但仍有残余不确定性)。在协调关税表分类(30,000节点分类树,三个基准,跨4个家族的9个LLM)上,我们观察到从强制性澄清到机会性澄清的机制转变,信息寻求有效性(ISE,一个局部诊断指标,定义为帮助交互后正确下一步导航步骤的比例,非最终任务指标)从50%上升到74%。三个诊断对比未能复现此结构。可分离性测试表明,当答案质量下降(准确率下降18.8%)时,信息寻求模式(模式分裂、ISE排名)保持不变,支持代理寻求帮助的位置与其所获帮助质量之间的经验分离。在受控答案通道下,10位数字准确率提升达+16.2%;我们将其解读为更好定位所能释放的上限,而非部署估计。

英文摘要

In hierarchical reasoning, failures often originate at intermediate decision points where the agent commits to a wrong branch without recognizing that it lacks critical information. Rather than treating clarification as an external uncertainty trigger, we propose ACTION-RATING, a formulation that places it inside the agent's action space on a shared ordinal scale with navigation, so that asking competes directly with acting at every decision point and help-seeking becomes observable at intermediate states. Two structurally distinct information-seeking modes emerge from the agent's own ratings: mandatory (no viable branch) and opportunistic (residual uncertainty despite a leading candidate). On Harmonized Tariff Schedule classification (30,000-node taxonomy, three benchmarks, 9~LLMs across 4 families), we observe a regime shift from mandatory to opportunistic clarification, with Information-Seeking Effectiveness (ISE), a local diagnostic defined as the fraction of help interactions followed by a correct next navigation step (not a final-task metric), rising from 50% to 74%. Three diagnostic contrasts fail to reproduce this structure. A separability test shows that the information-seeking pattern (mode split, ISE ranking) persists when answer quality is degraded (-18.8% accuracy), supporting an empirical separation between where an agent seeks help and the quality of the help it receives. Under the controlled answer channel, accuracy gains reach +16.2% at 10-digit; we read this as an upper bound on what better localization could unlock, not a deployment estimate.

2606.13710 2026-06-16 cs.AI cs.LG 版本更新

Hybrid Open-Ended Tri-Evolution Makes Better Deep Researcher

混合开放式三重进化打造更优深度研究者

Hongming Piao, Chi Liu, Mengzhuo Chen, Yan Shu, Xidong Wang, Derek Li, Ying Wei, Bryan Dai

发表机构 * IQuest Research Zhejiang University(浙江大学)

AI总结 提出混合开放式三重进化框架,通过混合模式强化学习协同进化提议者、求解者和评判者,使8B模型在深度研究任务上超越静态开源8-32B模型及先进训练方法。

详情
AI中文摘要

深度研究和智能体进化是AI智能体在现实应用中迈向通用人工智能的实际任务。前者使智能体能够在开放环境中自主检索和整合信息以处理开放式研究任务,但受限于智能体系统的静态参数化深度研究能力。后者允许智能体自主与环境交互以获得经验,从而进化模型能力。然而,其有效性仅在具有标准答案的可验证任务上得到广泛验证,与开放式研究任务存在差距。为桥接这两个关键任务,我们提出混合开放式三重进化框架,该框架利用混合模式强化学习,基于网络规模知识促进提议者、求解者和评判者的协同进化,朝着开放式任务和环境中自主进化的智能体迈进。在三个长格式深度研究基准上的大量实验表明,通过HOTE训练的8B模型超越了最强的静态开源8-32B模型以及通过最先进深度研究训练方法训练的模型,且时间开销更少,并进一步验证了HOTE中三个模块的进化不可或缺。

英文摘要

Deep research and agent evolution serve as de-facto tasks for AI agents in real-world applications toward artificial general intelligence. The former enables autonomous retrieval and integration of information in open-ended environments to tackle open-ended research tasks, yet it is constrained by the static parametric deep research capabilities of agent systems. The latter allows agents to autonomously interact with the environment to gain experiences that evolve model capabilities. However, its effectiveness has been widely validated only on verifiable tasks with standard answers, leaving a gap with open-ended research tasks. To bridge these two critical tasks, we propose the Hybrid Open-Ended Tri-Evolution (HOTE) framework, which leverages hybrid-mode reinforcement learning to facilitate the collaborative evolution of a proposer, solver and judge based on web-scale knowledge, moving toward autonomous evolving agents in open-ended tasks and environments. Extensive experiments on three long-form deep research benchmarks demonstrate that the 8B model trained via HOTE surpasses the strongest static open 8-32B models as well as those trained by state-of-the-art deep research training methods with less time overhead, and further verify that the evolution of all three modules in HOTE is indispensable.

2601.19612 2026-06-16 cs.LG cs.AI cs.RO 版本更新

Safe Exploration via Policy Priors

通过策略先验进行安全探索

Manuel Wendl, Yarden As, Manish Prajapat, Anton Pollak, Stelian Coros, Andreas Krause

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 提出SOOPER方法,利用次优但保守的策略先验,结合概率动力学模型进行乐观探索和悲观回退,在保证安全的同时收敛到最优策略。

详情
AI中文摘要

安全探索是强化学习智能体在受控(例如模拟)环境之外在线学习和适应的关键要求。在这项工作中,我们通过利用次优但保守的策略(例如,从离线数据或模拟器中获得)作为先验来应对这一挑战。我们的方法SOOPER使用概率动力学模型进行乐观探索,但在必要时悲观地回退到保守的策略先验。我们证明了SOOPER在整个学习过程中保证安全性,并通过限制其累积遗憾建立了收敛到最优策略的保证。在关键的安全强化学习基准测试和真实硬件上的大量实验表明,SOOPER具有可扩展性,优于现有技术,并在实践中验证了我们的理论保证。

英文摘要

Safe exploration is a key requirement for reinforcement learning (RL) agents to learn and adapt online, beyond controlled (e.g. simulated) environments. In this work, we tackle this challenge by utilizing suboptimal yet conservative policies (e.g., obtained from offline data or simulators) as priors. Our approach, SOOPER, uses probabilistic dynamics models to optimistically explore, yet pessimistically fall back to the conservative policy prior if needed. We prove that SOOPER guarantees safety throughout learning, and establish convergence to an optimal policy by bounding its cumulative regret. Extensive experiments on key safe RL benchmarks and real-world hardware demonstrate that SOOPER is scalable, outperforms the state-of-the-art and validate our theoretical guarantees in practice.

2602.00887 2026-06-16 cs.CL cs.AI cs.LG 版本更新

EffGen: Enabling Small Language Models as Capable Autonomous Agents

EffGen: 使小型语言模型成为能干的自主智能体

Gaurav Srivastava, Aafiya Hussain, Chi Wang, Yingyan Celine Lin, Xuan Wang

发表机构 * Department of Computer Science, Virginia Tech, Blacksburg, VA, USA(弗吉尼亚理工大学计算机科学系) Georgia Institute of Technology, Atlanta, GA, USA(佐治亚理工学院) Google DeepMind, USA(谷歌DeepMind)

AI总结 EffGen是一个针对小型语言模型优化的开源智能体框架,通过提示压缩、任务分解、复杂度路由和统一记忆系统,实现高效、安全的本地部署,在13个基准测试中优于LangChain等框架。

Comments Accepted to ICML 2026 Conference

详情
AI中文摘要

目前大多数基于语言模型的智能体系统都是通过API调用为大型语言模型(如GPT、Claude、Gemini)构建和优化的;虽然强大,但这种方法面临高令牌成本和敏感应用中的隐私问题等限制。我们提出了EffGen,一个针对小型语言模型优化的开源智能体框架,能够实现有效、高效且安全的本地部署。EffGen有四大贡献:(1)增强的工具调用与提示优化,可将输入提示压缩高达70-80%(在我们的基准测试中平均压缩57%),同时保留任务语义;(2)智能任务分解,根据依赖关系将复杂查询分解为并行或顺序子任务;(3)基于复杂度的路由,利用五个因素做出智能的执行前决策;(4)统一记忆系统,结合短期、长期和基于向量的存储。此外,EffGen统一了多种智能体协议(MCP、A2A、ACP)以实现跨协议通信。在13个基准测试上的结果表明,EffGen在成功率、执行速度和内存占用方面优于LangChain、AutoGen和Smolagents。我们的结果揭示,提示优化和复杂度路由具有互补的缩放行为:优化对小型语言模型更有利(1.5B模型提升11.2%,而32B模型提升2.4%),而路由对大型模型更有利(1.5B模型提升3.6%,而32B模型提升7.9%),两者结合在所有规模上都能带来一致的增益。EffGen在Apache 2.0许可证下发布,确保研究和商业用途的广泛可访问性,代码可在https://github.com/effgen/effgen获取,Python包可通过pip install effgen安装,项目网站和文档位于https://effgen.ai和https://docs.effgen.ai。

英文摘要

Most existing language model agentic systems today are built and optimized for large language models (e.g., GPT, Claude, Gemini) via API calls; while powerful, this approach faces several limitations including high token costs and privacy concerns for sensitive applications. We introduce EffGen, an open-source agentic framework optimized for small language models (SLMs) that enables effective, efficient, and secure local deployment. EffGen makes four major contributions: (1) Enhanced tool-calling with prompt optimization that compresses input prompts by up to 70-80% (and 57% on average across our benchmarks) while preserving task semantics, (2) Intelligent task decomposition that breaks complex queries into parallel or sequential subtasks based on dependencies, (3) Complexity-based routing using five factors to make smart pre-execution decisions, and (4) Unified memory system combining short-term, long-term, and vector-based storage. Additionally, EffGen unifies multiple agent protocols (MCP, A2A, ACP) for cross-protocol communication. Results on 13 benchmarks show EffGen outperforms LangChain, AutoGen, and Smolagents with higher success rates, faster execution, and lower memory. Our results reveal that prompt optimization and complexity routing have complementary scaling behavior: optimization benefits SLMs more (11.2% gain at 1.5B vs 2.4% at 32B), while routing benefits large models more (3.6% at 1.5B vs 7.9% at 32B), providing consistent gains across all scales when combined. EffGen is released under the Apache 2.0 License, ensuring broad accessibility for research and commercial use, with the code available at https://github.com/ctrl-gaurav/effGen, the Python package at https://pypi.org/project/effgen/ (pip install effgen), and the project website and documentation at https://effgen.org/ and https://docs.effgen.org/.

2603.21613 2026-06-16 cs.IR cs.AI 版本更新

AgenticRec: A Recommendation-Oriented Agentic Framework with Progressive Tool-Integrated Reasoning Optimization

AgenticRec:面向推荐的智能体框架与渐进式工具集成推理优化

Tianyi Li, Zixuan Wang, Guidong Lei, Xiaodong Li, Hui Li

发表机构 * Xiamen University(厦门大学)

AI总结 提出AgenticRec框架,将推荐建模为工具集成推理过程,并设计两阶段训练范式,通过隐式反馈激活和渐进偏好细化提升推荐准确性。

详情
AI中文摘要

基于大型语言模型的推荐智能体为个性化推荐提供了有前景的范式。然而,现有智能体通常存在工具集成推理轨迹与推荐反馈之间的错位,限制了其区分细粒度用户偏好的能力。为解决这些问题,我们提出AgenticRec,一个面向推荐的智能体框架,将推荐形式化为在推荐导向工具套件上的工具集成推理过程。基于此框架,我们进一步开发了一个专门的两阶段训练范式,专为推荐智能体定制。在第一阶段,我们引入推荐导向轨迹激活,在隐式反馈下优化智能体推荐能力。在第二阶段,渐进偏好细化通过自举困难对上的双向偏好推理进一步优化智能体,逐步锐化偏好边界。理论分析和大量实验证明了AgenticRec的有效性。我们的代码可在该https URL获取。

英文摘要

Recommender agents built on Large Language Models offer a promising paradigm for personalized recommendation. However, existing agents typically suffer from a misalignment between their tool-integrated reasoning trajectories and recommendation feedback, limiting their ability to distinguish fine-grained user preferences. To address these challenges, we propose AgenticRec, an agentic recommendation framework that formulates recommendation as a tool-integrated reasoning process over a recommendation-oriented tool suite. Built upon this framework, we further develop a dedicated two-stage training paradigm tailored for recommender agents. In the first stage, we introduce Recommendation-Oriented Trajectory Activation, optimize the agentic recommendation ability under implicit feedback. In the second stage, Progressive Preference Refinement further refines the agent through bidirectional preference reasoning over self-bootstrapped hard pairs, progressively sharpening preference boundaries. Theoretical analysis and extensive experiments demonstrate the effectiveness of AgenticRec. Our code is available at https://anonymous.4open.science/r/AgenticRec-FB16.

2603.22376 2026-06-16 cs.IR cs.AI 版本更新

Closing the Auto-Research Loop: An AI Co-Scientist for Production Search Ranking

关闭自动研究循环:面向生产搜索排名的AI合作科学家

Liwei Wu, Cho-Jui Hsieh

发表机构 * Trip.com Group(Trip.com集团) UCLA(加州大学洛杉矶分校)

AI总结 提出AI合作科学家框架,通过LLM代理与云计算集成,自动迭代生成想法、实现代码、进行GPU实验并分析结果,在搜索排名任务中带来额外+0.083%离线增益。

Comments Submitted to EMNLP for review on June 14, 2026

详情
AI中文摘要

我们提出了一个AI合作科学家框架,该框架为大型在线旅游平台的生产搜索排名系统关闭了研究循环——将LLM代理与直接云计算访问配对,使得想法生成、代码实现、GPU实验和结果分析能够与人类科学家一起端到端迭代。该框架采用混合代理架构:单一LLM代理处理常规工作,而多LLM共识(GPT-5.2、Gemini Pro 3、Claude Opus 4.5)用于更高风险的决策。在生产排名任务上,人工设计的Transformer基线(V2)相比预Transformer基线(V1)提升了+0.118%;AI合作科学家在V2之上的自动循环贡献了额外的+0.083%,合计离线增益为+0.201%,大约在一周多的挂钟时间内完成(单次运行数值;统计限制在论文中讨论)。最有用的AI提案——统一长序列布局、槽位类型嵌入和多阶段学习率调度——是NLP和视觉领域的标准实践,但之前未出现在我们的生产栈中,这表明LLM代理可以作为排名团队的跨学科连接器。我们还报告了部署背景、负面结果和经验教训。

英文摘要

We present an AI Co-Scientist framework that closes the research loop for the production search-ranking system of a large online travel platform -- pairing LLM agents with direct cloud-compute access so that idea generation, code implementation, GPU experimentation, and result analysis iterate end-to-end with a human scientist in the loop. The framework uses a hybrid agent architecture: single-LLM agents handle routine work, while multi-LLM consensus (GPT-5.2, Gemini Pro 3, Claude Opus 4.5) is invoked for higher-stakes decisions. On the production ranking task, a human-designed transformer baseline (V2) yielded $+0.118\%$ over a pre-transformer baseline (V1); the AI Co-Scientist's automated loop on top of V2 contributed an additional $+0.083\%$, for a combined $+0.201\%$ offline gain delivered in roughly one extra week of wall-clock time (single-run numbers; statistical limits discussed in the paper). The most useful AI proposals -- unified long-sequence layouts, slot-type embeddings, and multi-phase learning-rate schedules -- are standard practice in NLP and Vision but were absent from our production stack, suggesting that LLM agents can serve as cross-disciplinary connectors for ranking teams. We also report deployment context, negative results, and lessons learned.

2603.22766 2026-06-16 cs.HC cs.AI 版本更新

From Overload to Convergence: Supporting Multi-Issue Human-AI Negotiation with Bayesian Visualization

从过载到收敛:基于贝叶斯可视化的多议题人机协商支持

Mehul Parmar, Chaklam Silpasuwanchai

发表机构 * Asian Institute of Technology(亚洲理工学院)

AI总结 针对多议题协商中认知负荷导致人类表现下降的问题,提出基于贝叶斯估计协议概率的不确定性可视化方法,实验证明该方法能提升人类协商结果和效率,同时保持人类控制。

Comments Accepted for publication to CHI 2026. v2: Added Appendix B (system prompts) and Appendix C (payoff matrices) in response to replication requests. Dataset independently available at https://doi.org/10.5281/zenodo.20545331

详情
AI中文摘要

随着AI系统越来越多地介入协商过程,理解协商议题数量对人类表现的影响对于维护人类自主性至关重要。我们在一个真实的租赁场景中设计了人机协商案例研究,改变协商议题的数量;实证结果表明,在没有支持的情况下,表现最多在三个议题时保持稳定,但随着额外议题增加认知负荷而下降。为了解决这个问题,我们引入了一种基于贝叶斯协议概率估计的新型不确定性可视化方法。它展示了随着协商进展,相互可接受的协议空间如何缩小,帮助用户识别有前景的选项。在受试者内实验(N=32)中,它改善了人类结果和效率,保持了人类控制,并避免了价值重新分配。我们的发现揭示了人类在人机协商中能够管理的复杂性的实际极限,推进了关于复杂协商中人类表现的理论,并为交互系统提供了经过验证的设计指导。

英文摘要

As AI systems increasingly mediate negotiations, understanding how the number of negotiated issues impacts human performance is crucial for maintaining human agency. We designed a human-AI negotiation case study in a realistic property rental scenario, varying the number of negotiated issues; empirical findings show that without support, performance stays stable up to three issues but declines as additional issues increase cognitive load. To address this, we introduce a novel uncertainty-based visualization driven by Bayesian estimation of agreement probability. It shows how the space of mutually acceptable agreements narrows as negotiation progresses, helping users identify promising options. In a within-subjects experiment (N=32), it improved human outcomes and efficiency, preserved human control, and avoided redistributing value. Our findings surface practical limits on the complexity people can manage in human-AI negotiation, advance theory on human performance in complex negotiations, and offer validated design guidance for interactive systems.

2604.09673 2026-06-16 cs.LG cs.AI 版本更新

Active Inference with a Self-Prior in the Mirror-Mark Task

镜像标记任务中带有自我先验的主动推理

Dongmin Kim, Hoshinori Kanazawa, Yasuo Kuniyoshi

发表机构 * The University of Tokyo(东京大学) Laboratory for Intelligent Systems and Informatics(智能系统与信息学实验室)

AI总结 提出一种基于自我先验的计算模型,通过主动推理驱动标记导向行为,无需外部奖励即可模拟镜像自我识别。

Comments 8 pages, 5 figures, Accepted to IEEE ICDL 2026

详情
AI中文摘要

镜像自我识别测试评估受试者是否触摸仅在镜子中可见的自身标记,被广泛用作自我意识的指标。在本研究中,我们提出一个计算模型,其中这种行为通过单一机制——自我先验——自发产生,无需任何外部奖励。自我先验通过Transformer实现,学习熟悉多感官经验的密度;当出现新标记时,与学习分布的差异通过主动推理驱动标记导向行为。一个仅依赖视觉和本体感觉而无触觉输入的模拟婴儿,发现镜中自己脸上的贴纸并在约70%的情况下将其移除,无需任何明确指令。贴纸移除后预期自由能显著下降,证实自我先验作为区分自我与非自我的内部标准。跨模态采样进一步表明,自我先验捕获视觉-本体感觉关联,充当概率身体图式。这些结果为镜像测试中观察到的关键行为提供了简洁的计算解释,并表明自由能原理可作为研究自我意识发展起源的统一假设。代码见:this https URL

英文摘要

The mirror self-recognition test evaluates whether a subject touches a mark on its own body that is visible only in a mirror, and is widely used as an indicator of self-awareness. In this study, we present a computational model in which this behavior emerges spontaneously through a single mechanism, the self-prior, without any external reward. The self-prior, implemented with a Transformer, learns the density of familiar multisensory experiences; when a novel mark appears, the discrepancy from this learned distribution drives mark-directed behavior through active inference. A simulated infant, relying solely on vision and proprioception without tactile input, discovered a sticker placed on its own face in the mirror and removed it in approximately 70% of cases without any explicit instruction. Expected free energy decreased significantly after sticker removal, confirming that the self-prior operates as an internal criterion for distinguishing self from non-self. Cross-modal sampling further demonstrated that the self-prior captures visual--proprioceptive associations, functioning as a probabilistic body schema. These results provide a concise computational account of the key behavior observed in the mirror test and suggest that the free energy principle can serve as a unifying hypothesis for investigating the developmental origins of self-awareness. Code is available at: https://github.com/kim135797531/self-prior-mirror

2605.18401 2026-06-16 cs.CL cs.AI 版本更新

SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

SkillsVote: 代理技能的生命周期治理从收集、推荐到进化

Hongyi Liu, Haoyan Yang, Tao Jiang, Bo Tang, Feiyu Xiong, Yuyu Luo, Zhiyu Li

发表机构 * Harbin Institute of Technology(哈尔滨理工大学) Soochow University(苏州大学) The Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州))

AI总结 本文提出SkillsVote框架,通过生命周期治理管理代理技能,从收集和推荐到进化,提升模型在终端基准和SWE-Bench Pro上的性能。

Comments 71 pages, 12 figures, 13 tables

详情
AI中文摘要

长周期LLM代理留下的轨迹可能成为可重用的经验,但原始轨迹噪声大且难以管理。我们将代理技能视为一种经验模式,结合可执行脚本和不可执行的指导。然而,开放技能生态系统包含冗余、不均匀、环境敏感的产物,随意更新会污染未来上下文。我们提出了SkillsVote,一个用于代理技能生命周期治理的框架,从收集和推荐到进化。SkillsVote对百万级开源语料库进行环境需求、质量和可验证性分析,然后合成可验证技能的任务。在执行前,SkillsVote在结构化技能库中进行代理库搜索以暴露教学技能上下文。在执行后,它将轨迹分解为技能关联的子任务,将结果归因于技能使用、代理探索、环境和结果信号,并只接受成功的可重用发现以进行证据门控更新。在评估中,离线进化使GPT-5.2在Terminal-Bench 2.0上提升高达7.9个百分点,而在线进化使SWE-Bench Pro提升高达2.6个百分点。总体而言,受控的外部技能库可以在不更新模型的情况下提升冻结代理,当系统控制暴露、信用和保存时。

英文摘要

Long-horizon LLM agents generate traces that could become reusable experience, but raw trajectories are noisy, local, and hard to govern. Agent Skills offer a structured artifact for combining procedural guidance, executable resources, and applicability boundaries. Yet open skill ecosystems contain redundant, uneven, environment-sensitive artifacts, and indiscriminate updates can pollute future context. We present SkillsVote, a lifecycle-governance framework for Agent Skills across collection, recommendation, attribution, and evolution. SkillsVote profiles a million-scale open source corpus for environment requirements, quality, and verifiability, and synthesizes tasks for verifiable skills. Before execution, it performs agentic library search over structured skill folders to expose instructional context. After execution, it decomposes trajectories into skill-linked subtasks, attributes outcomes to skill-guided execution, agent exploration, environment, and result signals, and admits only successful reusable discoveries to evidence-gated updates. Experiments on Terminal-Bench 2.0 and SWE-Bench Pro show that SkillsVote improves agent performance on challenging agentic coding benchmarks. The gains arise from two complementary pathways: online evolution over task streams at test time and offline transfer via frozen libraries built from either historical trajectories or curated open source skills.

2606.11520 2026-06-16 cs.CL cs.AI cs.LG 版本更新

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

ISE:一种基于执行的多轮操作系统代理轨迹合成方法

Siyuan Luo, Nairong Zheng, Lin Zhou, Tiankuo Yao, Shengyou Yuan, Haojia Yu, Cong Pang, Jiapeng Luo, Lewei Lu

发表机构 * University of Electronic Science and Technology of China(电子科技大学) SenseTime Research(字节跳动研究院)

AI总结 提出ISE三阶段范式,通过结构化意图构建、角色锁定用户模拟和真实执行环境,生成多轮代理轨迹,微调后显著提升代理工具使用性能。

Comments 13 pages, 6 figures. Dataset and code: https://github.com/Valiere01/ISE-Trace

详情
AI中文摘要

训练有能力的操作系统代理需要同时捕获结构化用户意图、多轮任务委派和基于工具执行的数据——这些属性在现有数据集中缺失。我们提出ISE(意图->模拟->执行),一种三阶段合成范式,联合解决这些差距。阶段1通过4D框架(人物角色x领域x任务x复杂度)构建约50000个结构化意图;去重后池中包含43956个唯一意图,并在mpnet-base-v2嵌入(余弦核,q=1)上获得61.57的Vendi分数。阶段2通过角色锁定的用户模拟器驱动多轮用户-代理交互,将每轮用户交互基于实际执行结果,生成23132条完整轨迹,平均8.12轮用户交互和68.24轮总对话。阶段3在实时、隔离的操作系统工作空间中执行每个工具调用,生成真实的故障恢复动态而非模拟响应。在ISETrace上微调后,使用Qwen3-8B在标准协议下的代理工具使用任务中,ClawEval pass@1从19.3提升至37.7。该结果优于零样本GPT-4o和四倍大的Qwen3-32B基础模型。对阶段2的消融实验证明多轮模拟带来了大部分性能提升。我们在该https URL发布所有源代码和数据集。

英文摘要

Training capable OS agents requires data that simultaneously captures structured user intents, multi-turn task delegation, and grounded tool execution--properties absent from existing datasets. We propose ISE (Intent -> Simulate -> Execute), a three-stage synthesis paradigm that addresses these gaps jointly. Stage 1 constructs roughly 50000 structured intents via a 4D framework (Persona x Domain x Task x Complexity); after deduplication the pool contains 43956 unique intents and attains a Vendi Score of 61.57 over the entire pool on mpnet-base-v2 embeddings (cosine kernel, q=1). Stage 2 drives multi-turn user-agent interaction through a role-locked user simulator that grounds each user turn in actual execution outcomes, producing 23132 complete trajectories averaging 8.12 user turns and 68.24 total dialogue turns. Stage 3 runs every tool call inside a live, isolated OS workspace, generating authentic failure-recovery dynamics instead of simulated responses. Fine-tuning on ISETrace improves ClawEval pass@1 from 19.3 to 37.7 using Qwen3-8B on agent tool-use tasks with a standard protocol. This result outperforms zero-shot GPT-4o and the larger Qwen3-32B base model which is four times bigger. An ablation on Stage 2 proves multi-turn simulation brings a large portion of the performance gain. We release all source code and dataset at https://github.com/Valiere01/ISE-Trace.

2. 知识表示、推理与符号AI 18 篇

2606.14892 2026-06-16 cs.AI cs.LG cs.SI stat.ML 新提交

Relational Structural Causal Models

关系结构因果模型

Adiba Ejaz, Elias Bareinboim

发表机构 * Causal Artificial Intelligence Lab, Columbia University(哥伦比亚大学因果人工智能实验室)

AI总结 提出关系结构因果模型,将结构因果模型扩展到对象和关系可变的场景,通过关系因果图和符号识别准则实现未见组合的因果和观测查询识别,并设计关系神经因果模型在交通场景中优于非关系基线。

Comments Proceedings of the Forty-Third International Conference on Machine Learning

详情
AI中文摘要

人工智能必须拥有一个因果的环境模型,支持关于干预和反事实的推理,同时具有组合性,支持对未见过的对象组合进行泛化。在这项工作中,我们正式研究了何时以及如何学习这样的模型。我们开发了关系结构因果模型,将结构因果模型(Pearl 2009)扩展到对象及其关系变化的场景。首先,我们展示了在没有进一步假设的情况下,不仅因果查询,而且关于未见对象组合的观测查询的答案也无法被识别。为了实现这种识别——包括在存在未观测混杂的情况下——我们定义了关系因果图并推导了符号识别准则。最后,我们提出了关系神经因果模型,这是一种可证明正确的方法,在具有不同汽车、信号和行人的模拟交通场景中优于非关系基线。

英文摘要

An artificial intelligence must have a model of its environment that is causal, supporting reasoning about interventions and counterfactuals, and also combinatorial, supporting generalization to unseen combinations of objects. In this work, we formally study when and how such a model can be learned. We develop relational structural causal models, extending structural causal models (Pearl 2009) to settings where objects and their relations vary. First, we show how answers to not only causal but also observational queries about unseen combinations of objects can not be identified without further assumptions. To enable such identification--including in the presence of unobserved confounding--we define relational causal graphs and derive symbolic identification criteria. Finally, we propose relational neural causal models, a provably correct approach that outperforms non-relational baselines on simulated traffic scenes with varying cars, signals, and pedestrians.

2606.14935 2026-06-16 cs.AI 新提交

PrologMCP: A Standardized Prolog Tool Interface for LLM Agents

PrologMCP:面向LLM代理的标准化Prolog工具接口

Agnieszka Mensfelt, Adarsh Prabhakaran, Adrian Haret, Vince Trencsenyi, Kostas Stathis

发表机构 * Royal Holloway, University of London(伦敦大学皇家霍洛威学院)

AI总结 提出PrologMCP,一个通过模型上下文协议将Prolog暴露为状态化工具的任务无关开源服务器,使LLM代理能够通过翻译-运行-检查-修复循环稳健地委托演绎推理,在PARARULE-Plus上达到或超越推理型LLM。

Comments Accepted at Joint Workshop on Statistics and Knowledge Integration for Logic, Learning, Ethical Decisions, and LLMs, 18 July 2026, Lisbon

详情
AI中文摘要

前沿推理调优语言模型在深度演绎任务上仍然失败,而通过扩展内部推理来提升性能的成本很高。符号委托提供了一条补充路径:语言模型翻译问题,求解器执行推理。然而,当前用于逻辑编程的自动形式化管道通常是针对特定任务或代理的定制集成。我们引入了PrologMCP,一个任务无关的开源服务器,通过模型上下文协议(MCP)将Prolog暴露为状态化工具。其紧凑的工具接口、结构化错误报告和每会话隔离使翻译-运行-检查-修复循环成为MCP能力代理的可复用原语。我们在PARARULE-Plus的两个子集上评估了增强PrologMCP的形式化代理与标准和推理LLM(Claude Sonnet 4.6、GPT-4.1和o4-mini)的性能:一个通用样本和一个更具挑战性的样本,针对自然语言推理的特定失败模式。在通用样本上,形式化代理匹配或超越推理LLM(准确率1.00对比1.00/0.998),相比标准模型提升最大(GPT-4.1为0.762)。在挑战性子集上,形式化代理保持接近完美(1.00/0.99),而推理LLM降至0.95/0.94。这些结果表明,通过MCP将推理委托给Prolog是扩展自然语言推理的一种稳健且可检查的替代方案。

英文摘要

Frontier reasoning-tuned language models still fail on deductive tasks at depth, and the cost of improved performance through extended internal reasoning scales poorly. Symbolic delegation offers a complementary route: a language model translates the problem, while a solver performs the inference. However, current autoformalization pipelines for logic programming are typically bespoke integrations tied to particular tasks or agents. We introduce PrologMCP, a task-agnostic, open-source server that exposes Prolog as a stateful tool through the Model Context Protocol (MCP). Its compact tool interface, structured error reporting, and per-session isolation make the translate-run-inspect-repair loop a reusable primitive for MCP-capable agents. We evaluate a formalizer agent enhanced with PrologMCP against standard and reasoning LLMs (Claude Sonnet 4.6, GPT-4.1, and o4-mini) on two subsets of PARARULE-Plus: a general-purpose sample and a more challenging one targeting a specific failure mode of natural-language reasoning. On the general sample, the formalizer matches or exceeds reasoning LLMs (accuracy 1.00 vs.\ 1.00 / 0.998), with the largest gains over standard models (0.762 for GPT-4.1). On the challenging subset, the formalizer remains near-perfect (1.00 / 0.99) while reasoning LLMs drop to 0.95 / 0.94. These results suggest that delegating inference to Prolog via MCP is a robust and inspectable alternative to extended natural-language reasoning.

2606.15096 2026-06-16 cs.AI 新提交

VGPT-RSI for RH-Adjacent Formal Progress: Boundary Certificates, Verified Finite Lagarias Inequalities, and Explicit Failure Localization

VGPT-RSI 用于 RH 邻近形式化进展:边界证书、已验证的有限 Lagarias 不等式和显式故障定位

Zhixin Hu, Tao Xu, Xiaodian Sun, Li Jin, Momiao Xiong

AI总结 提出 VGPT-RSI 系统,通过构造并验证有限 RH 边界证书和 Lagarias 准则的有限形式化,实现 Riemann 假设邻近问题的部分形式化进展,并明确识别剩余数学障碍。

Comments 31 pages, 3 figures

详情
AI中文摘要

Riemann 假设仍然是数学中未解决的核心问题之一。我们不声称证明,而是研究一个可验证的 AI 辅助推理系统能否产生可靠的、经过形式化检查的部分进展,同时明确识别剩余的数学障碍。我们将可验证增长物理变压器与递归自我改进(VGPT-RSI)应用于两个 RH 邻近的认证任务。首先,我们在一个参数化安全下界曲线上构造并验证了一个有限 RH 边界证书,该曲线覆盖一个区域。数值边界曲线被转换为证书支持的下界曲线,使用向外舍入区间算术和 Arb/FLINT 球算术进行审计,然后在 Rocq/CoqInterval 中检查参数化定理。其次,我们启动了一个形式化的 Lagarias 路径证书。Lagarias 准则指出 RH 等价于全局不等式。我们将有限量形式化,并产生一个 Coq 检查的有限证书。最终系统识别出确切未解决的数学瓶颈:形式化 Lagarias 等价性,证明超出任何有限截断的全局尾部定理,以及可能将反例减少到巨量或相关的极值整数。这些结果表明,VGPT-RSI 能够产生经过认证的 RH 邻近形式化进展,组织证明依赖关系,并在剩余障碍确实是数学问题时避免过度声称。

英文摘要

The Riemann Hypothesis remains one of the central unsolved problems in mathematics. Rather than claiming proof, we investigate whether a verifiable AI-assisted reasoning system can produce reliable, formally checked partial progress while explicitly identifying the remaining mathematical obstructions. We apply the Verifiable Growing Physical Transformer with Recursive Self-Improvement (VGPT-RSI) to two RH-adjacent certification tasks. First, we construct and verify a finite RH-boundary certificate for inequality on a parameterized safe lower curve over a region. The numerical boundary curve is converted into a certificate-backed lower curve, audited using outward-rounded interval arithmetic and Arb/FLINT ball arithmetic, and then checked in Rocq/CoqInterval for the parameterized theorem. Second, we initiate a formal Lagarias-route certificate. Lagarias criterion states that RH is equivalent to the global inequality. We formalize the finite quantity and produce a Coq-checked finite certificate. The final system identifies the exact unresolved mathematical bottlenecks: formalizing the Lagarias equivalence, proving the global tail theorem beyond any finite cutoff, and potentially reducing counterexamples to colossally abundant or related extremal integers. These results demonstrate that VGPT-RSI can produce certified RH-adjacent formal progress, organize proof dependencies, and avoid overclaiming when the remaining obstruction is genuinely mathematical.

2606.15291 2026-06-16 cs.AI 新提交

A Formal Framework for Declarative Agentic AI in Business Process Analysis

业务流程分析中声明式智能体AI的形式化框架

Mohammad Azarijafari, Luisa Mich, Michele Missikoff

发表机构 * University of Trento(特伦托大学) Istituto di Analisi dei Sistemi ed Informatica (IASI) “Antonio Ruberti”, National Research Council (CNR)(国家研究委员会(CNR)安东尼奥·鲁贝蒂系统分析与信息学研究所(IASI))

AI总结 提出基于AGO方法的形式化框架,通过集合论和数学逻辑定义智能体、目标和对象实体及其交互,构建业务流程知识库以支持结构化查询、增量更新和自动生成工作流。

详情
AI中文摘要

智能体AI为自动化业务流程(BP)开辟了新机遇,实现了自主决策和动态适应。然而,要实现这一潜力,需要以形式化精度定义BP实体及其交互。本文通过AGO方法提出了一个用于智能体BP分析的形式化框架。AGO从谁在行动(智能体)、为何执行(目标)以及相关实体是什么(对象)的角度捕获建模视角。基于集合论和数学逻辑,我们形式化定义了AGO实体类型及其交互,将所有定义组织成BP知识库(BPKB)。生成的BPKB支持结构化查询、增量更新和BP工作流的自动生成,同时确保导出路径的健全性和完备性。

英文摘要

Agentic AI opens new opportunities for automating Business Process (BP), enabling autonomous decision-making and dynamic adaptation. However, realising this potential requires BP entities and their interactions to be defined with formal precision. This paper presents a formal framework for Agentic BP analysis through the AGO methodology. AGO captures the modelling perspective in terms of who is acting (Agents), why it is carried out (Goals), and what the relevant entities are (Objects). Grounded in set theory and mathematical logic, we formally define the AGO entity types and their interactions, organising all definitions into a BP Knowledge Base (BPKB). The resulting BPKB supports structured querying, incremental updates, and automatic generation of BP workflows, while ensuring soundness and completeness of the derived paths.

2606.15656 2026-06-16 cs.AI 新提交

Overcoming the Impedance Mismatch: A Theoretical Roadmap for Fusing Foundation Models and Knowledge Graphs

克服阻抗不匹配:融合基础模型与知识图谱的理论路线图

Sahil Rajesh Dhayalkar

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 本文提出“阻抗不匹配”概念,形式化分析基础模型与知识图谱的结构与几何摩擦,通过三级层次分类揭示现有方法的局限,并给出理论路线图实现真正的语义融合。

Comments 12 pages. Accepted at the ACL 2026 4th Workshop on Towards Knowledgeable Foundation Models (https://openreview.net/forum?id=hXDYsNAq8m)

详情
AI中文摘要

现代人工智能仍然从根本上分裂于基础模型的连续概率空间和知识图谱的离散确定性结构之间。虽然检索增强生成(RAG)试图通过将图数据序列化为文本来连接它们,但我们认为这种词汇桥接仅仅是表面的补丁。在本文中,我们将底层的结构和几何摩擦形式化为\textit{阻抗不匹配}。通过将当前的神经符号集成策略分类为三级层次,我们证明无论是表面级别的提示注入还是连续表示对齐,都无法保留可靠的多跳推理所需的严格逻辑模式。我们定义了具体的数学极限,如词汇瓶颈和拓扑坍缩,表明当前架构最终会产生幻觉或混淆语义节点。为了实现真正的语义融合,我们提出了一个严格的理论路线图。我们主张通过结构化残差流原生内化离散符号结构,利用向量符号架构进行潜在子图注入,并通过正交子空间编辑执行模型更新。这个可操作的框架为无缝融合符号逻辑的精确性和参数化记忆的表达能力的模型铺平了道路。

英文摘要

Modern artificial intelligence remains fundamentally divided between the continuous, probabilistic spaces of Foundation Models and the discrete, deterministic structures of Knowledge Graphs. While Retrieval-Augmented Generation (RAG) attempts to connect them by serializing graph data into text, we argue this lexical bridging is merely a superficial patch. In this paper, we formalize the underlying structural and geometric friction as the \textit{Impedance Mismatch}. By categorizing current neuro-symbolic integration strategies into a three-tiered hierarchy, we demonstrate that neither surface-level prompt injection nor continuous representation alignment can preserve the strict logical motifs required for reliable multi-hop reasoning. We define the specific mathematical limits, such as the Lexical Bottleneck and Topological Collapse, that show current architectures will eventually hallucinate or conflate semantic nodes. To achieve true semantic fusion, we propose a rigorous theoretical roadmap. We advocate for natively internalizing discrete symbolic structures through Structured Residual Streams, utilizing Vector Symbolic Architectures for latent sub-graph injection, and performing model updates via Orthogonal Subspace Editing. This actionable framework paves the way for models that seamlessly fuse the precision of symbolic logic with the expressivity of parametric memory.

2606.16118 2026-06-16 cs.AI cs.CL cs.LO 新提交

Know Your Limits : On the Faithfulness of LLMs as Solvers and Autoformalizers in Legal Reasoning

了解你的局限:LLM在法律推理中作为求解器和自动形式化工具的忠实性

Olivia Peiyu Wang, Sanna Wong-Toropainen, Daneshvar Amrollahi, Ryan Bai, Tashvi Bansal, Arush Garg, Leilani H. Gilpin

发表机构 * UC Santa Cruz(加州大学圣克鲁兹分校) Univ. Helsinki(赫尔辛基大学) CodeX, Stanford(斯坦福大学CodeX中心) Stanford University(斯坦福大学) Canyon Crest Academy(峡谷峰学院) Monta Vista High School(蒙塔维斯塔高中) Los Altos High School(洛斯阿尔托斯高中)

AI总结 研究LLM在法律推理中是否忠实执行逻辑推理,发现LLM基于形式推理的高性能掩盖了范围清洗等不忠实模式,揭示基准准确性与逻辑忠实性之间的根本差距。

Comments 10 pages, submitted to COLM 2026 (under review, average score of 6.25 across 4 reviewers) and accepted by the AI4Law workshop at ICML. This is the version where we already addressed most of the reviews from the COLM reviewers

详情
AI中文摘要

大型语言模型(LLM)在推理任务上表现强劲,但这是否反映了忠实的逻辑推理还是启发式近似仍不清楚。我们在法律蕴含中通过比较三种范式——纯LLM分类、基于LLM的形式推理以及使用Z3 SMT求解器的基于求解器的形式推理——在重新标注的ContractNLI子集上对五个LLM进行了研究。我们的重新标注揭示了实用法律解释与严格形式蕴含之间存在系统性的、可测量的差距,其中相当大比例的法律上合理的推理在没有额外未声明假设的情况下缺乏形式基础。虽然引入形式结构提高了准确性,基于LLM的形式推理达到了最高的基准性能,但我们表明这种提升并不意味着忠实推理。我们识别出三种反复出现的失败模式:范围清洗(LLM报告与求解器不一致的分类而不执行底层形式推理,产生看似逻辑上合理但实际并非如此的结论)、隐式约束盲区(LLM忽略形式表示中存在的逻辑约束)以及程序合成失败(尽管有结构化提示,LLM仍生成错误的Z3代码)。关键的是,范围清洗在所有模型中持续存在,这引发了对基于LLM的形式推理作为符号执行代理的忠实性的严重担忧。这些结果揭示了基准准确性与逻辑忠实性之间的根本差距。

英文摘要

Large Language Models (LLMs) achieve strong performance on reasoning tasks, but whether this reflects faithful logical inference or heuristic approximation remains unclear. We study this question in legal entailment by comparing three paradigms, including pure LLM classification, LLM-based Formal Reasoning, and solver-based Formal Reasoning using the Z3 SMT solver, on a re-annotated subset of ContractNLI across five LLMs. Our re-annotation reveals a systematic and measurable gap between pragmatic legal interpretation and strict formal entailment, where a substantial proportion of legally sound inferences are not formally grounded without additional unstated assumptions. While introducing formal structure improves accuracy, with LLM-based Formal Reasoning achieving the highest benchmark performance, we show that this gain does not imply faithful reasoning. We identify three recurring failure modes: scope laundering, where LLMs report solver-inconsistent classifications without executing the underlying formal reasoning, producing conclusions that appear logically grounded but are not; implicit constraint blindness, where LLMs overlook logical constraints present in formal representations; and program synthesis failures, where LLMs generate incorrect Z3 code despite structured prompting. Critically, scope laundering persists across all models, raising serious concerns about the faithfulness of LLM-based formal reasoning as a proxy for symbolic execution. These results reveal a fundamental gap between benchmark accuracy and logical faithfulness.

2606.16509 2026-06-16 cs.AI 新提交

Model Graph Inductive Learning for Knowledge Graph Completion

模型图归纳学习用于知识图谱补全

Mohommad Esmaei Khani, Mahdieh Hasheminejad, Ali Taherkhani, Hossein Hajiabolhassan

发表机构 * Yazd University(亚兹德大学) Institute for Advanced Studies in Basic Sciences (IASBS)(基础科学高等研究所) Medizinische Universität Graz(格拉茨医科大学)

AI总结 提出模型图归纳学习(MGIL)框架,通过聚类实体构建模型图并应用GNN捕获全局结构,生成高质量初始嵌入,在归纳链接预测任务上取得最优或竞争性结果。

详情
AI中文摘要

知识图谱中的链接预测根本上依赖于实体和关系嵌入的质量。然而,大多数现有方法仅通过聚合每个实体的局部邻域来推导这些嵌入,忽略了知识图谱的全局结构。这种有限的视角阻止了模型捕获对于准确和可泛化的链接预测至关重要的高层结构模式。为了解决这些限制,我们引入了模型图归纳学习(MGIL),该框架通过基于实体传入和传出关系结构或实体类型的相似性对实体进行聚类来构建模型图。然后,在模型图上应用GNN以生成捕获知识图谱全局视图的嵌入。这些嵌入随后作为原始知识图谱的高质量初始特征,取代随机初始化,从而产生更稳定和更具表达力的表示。在标准和最近提出的归纳基准上的广泛实验表明,MGIL在归纳链接预测中实现了最先进或极具竞争力的性能,突显了其在不同图设置下的有效性。

英文摘要

Link prediction in knowledge graphs fundamentally depends on the quality of learned embeddings for entities and relations. However, most existing methods derive these embeddings by aggregating only the local neighborhood of each entity, neglecting the global structure of the knowledge graph. This limited view prevents models from capturing higher-level structural patterns that are essential for accurate and generalizable link prediction. To address these limitations, we introduce Model Graph Inductive Learning (\textbf{MGIL}), a framework that constructs a model graph by clustering entities based on the similarity of their incoming and outgoing relational structures or their entity types. A GNN is then applied to this model graph to produce embeddings that capture the global view of the knowledge graph. These embeddings subsequently serve as high-quality initial features %embeddings for the original knowledge graph, replacing random initialization and leading to more stable and expressive representations. Extensive experiments on standard and recently proposed inductive benchmarks demonstrate that MGIL achieves state-of-the-art or highly competitive performance in inductive link prediction, highlighting its effectiveness across diverse graph settings.

2606.16893 2026-06-16 cs.AI cs.CL cs.LO 新提交

Symbolic Informalization: Fluent, Productive, Multilingual

符号非形式化:流畅、高效、多语言

Aarne Ranta

发表机构 * Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg(查尔姆斯理工大学与哥德堡大学计算机科学与工程系)

AI总结 提出符号非形式化方法,将形式数学可靠地转换为自然语言,基于Dedukti和Grammatical Framework的中间语言架构,实现多证明系统与多自然语言的流畅转换。

详情
AI中文摘要

符号非形式化能够将形式数学可靠地转换为自然语言。它有望使机器验证的内容在不损失精确性的情况下对人类可读。在传统证明系统使用中,符号非形式化将语法糖的有限机制推广为数学的普通语言。在由人工智能和自动形式化构建证明的场景中,符号非形式化可以解释具体构建了什么。本文概述了Informath项目,旨在展示符号非形式化如何以合理的开发工作量产生流畅的文本,并处理多种形式语言和自然语言。Informath基于中间语言架构,其中Dedukti作为不同证明系统(Agda、Lean、Rocq)之间的枢纽,而Grammatical Framework(GF)负责不同自然语言的语言正确性和变体。

英文摘要

Symbolic informalization enables a reliable conversion of formal mathematics to natural language. It has the potential to make machine-checked content human-readable without loss of precision. In a traditional proof system usage, symbolic informalization generalizes the limited mechanisms of syntactic sugar into the ordinary language of mathematics. In a setting where proofs are constructed by artificial intelligence and autoformalization, symbolic informalization can explain what precisely has been constructed. This paper outlines the project Informath, which aims to show how symbolic informalization can produce fluent text with a reasonable development effort and address multiple formal and natural languages. Informath is based on an interlingual architecture, where Dedukti works as a hub between different proof systems (Agda, Lean, Rocq) and Grammatical Framework (GF) takes care of linguistic correctness and variation in different natural languages.

2606.16944 2026-06-16 cs.AI cs.HC 新提交

A Causal Model of Theory of Mind in Conflict for Artificial Intelligence

冲突情境下心智理论的人工智能因果模型

Nikolos Gurney

发表机构 * Institute for Creative Technologies, University of Southern California(南加州大学创意技术研究所)

AI总结 提出结构因果模型,将心智理论视为由情境和主体条件激活的机制,通过三条因果路径决定何时进行心智化,提升AI社会推理的准确性和效率。

详情
AI中文摘要

心智理论(ToM)是将心理状态归因于他人并利用这些归因进行预测和推理的能力,被广泛认为是有效人机融合的关键。现有AI-ToM模型解决了如何心智化的问题,但基本未涉及何时心智化。核心问题是:在冲突中,何种情境和主体层面的条件下,ToM的参与在因果上是合理的?本文提出一个结构因果模型,形式化为有向无环图(DAG),将ToM视为由情境和主体条件激活的机制,而非始终开启的能力。模型指定了四个捕捉情境和主体条件的外生变量、五个内生中介变量,以及一个通过三条不同因果路径产生参与状态的机制性ToM节点:可处理性路径、推理深度路径和使能原因路径。主要结果是认知准确性,它将社会推理与行为策略解耦,并泛化到冲突之外的社会现象。该框架为AI系统提供了一种有原则的、资源理性的心智化决策程序,对效率、信任以及鲁棒的人工社会智能的发展具有意义。讨论了仿真验证、实证人机协作研究以及由冲突优化心智化引发的伦理问题。

英文摘要

Theory of mind (ToM), the capacity to ascribe mental states to others and use those ascriptions for prediction and inference, is widely assumed to be essential for effective human-machine integration. Existing AI-ToM models address \emph{how} to mentalize, but leave the question of when largely unaddressed. The central question is: under what situational and agent-level conditions is ToM engagement causally warranted in conflict? This paper presents a structural causal model formalized as a directed acyclic graph (DAG), treating ToM as a mechanism activated by situational and agent-level conditions rather than as an always-on capacity. The model specifies four exogenous variables capturing situational and agent-level conditions, five endogenous mediators, and a mechanistic ToM node producing engagement states through three distinct causal pathways: a tractability pathway, a reasoning-depth pathway, and an enabling-cause pathway. The primary outcome is epistemic accuracy, which decouples social reasoning from behavioral policy and generalizes across social phenomena beyond conflict. The framework gives AI systems a principled, resource-rational decision procedure for mentalizing, with implications for efficiency, trust, and the development of robust artificial social intelligence. Simulation validation, empirical human-machine teaming studies, and ethical considerations arising from conflict-optimized mentalizing are discussed.

2606.15246 2026-06-16 cs.LO cs.AI cs.DL 交叉投稿

Provenance-Enhanced Statements in Knowledge Graphs

知识图谱中增强来源的陈述

Fabio Vitali, Valentina Pasqual

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 提出DEC框架,通过认知模态逻辑将来源谓词解释为认知立场指示器,并分组为认知世界,实现基于来源的推理,避免将分歧视为不一致。

Comments 33 pages

详情
AI中文摘要

在当代知识图谱中,形式为“根据$X$,$φ$”的增强来源陈述无处不在,尤其是在图内容主要表示主张、解释和假设(\emph{capta})而非观察者独立事实(\emph{data})的领域。当前的来源模型可以记录谁说了什么,但通常将来源视为语义中性的,未充分说明归因陈述与事实承诺、彼此之间以及推理的关系。在本文中,我们引入DEC框架,该框架将来源谓词解释为认知立场指示器,并将来源同质的陈述集分组为\emph{认知世界}。借鉴认知模态逻辑(信念、知识和推测),DEC刻画了认知世界与一个特殊的事实核心(“现实”)之间的局部性、合理性和可控渗透,从而能够对归因内容进行有原则的推理,而不会将分歧视为不一致。我们为RDF数据集形式化了DEC解释,该解释对RDF 1.2语义是保守的,阐明了内涵性和同一性(包括超人悖论)的作用,并在常见的语义网表示(命名图、引用三元组/RDF-star和具体化)上说明了该方法。最后,我们描述了原型DEC推理器,它作为Fuseki数据集模块实现,支持受控事实化以及分歧和错觉的显式检测。

英文摘要

Provenance-enhanced statements of the form "according to $X$, $φ$" are pervasive in contemporary knowledge graphs, especially in domains where graph content primarily represents claims, interpretations, and hypotheses (\emph{capta}) rather than observer-independent facts (\emph{data}). Current provenance models can record who asserted what, but they typically treat provenance as semantically neutral, leaving underspecified how attributed claims relate to factual commitment, to one another, and to reasoning. In this paper we introduce DEC, a framework that interprets provenance predicates as indicators of epistemic stance and groups provenance-homogeneous sets of statements into \emph{cognitive worlds}. Drawing on cognitive modal logics (doxastic, epistemic, and conjectural), DEC characterizes locality, rationality, and controlled permeation between cognitive worlds and a distinguished factual core ("reality"), thereby enabling principled reasoning over attributed content without collapsing disagreements into inconsistencies. We formalize a DEC interpretation for RDF datasets that is conservative over RDF~1.2 semantics, clarify the role of intensionality and identity (including the Superman paradox), and illustrate the approach on common Semantic Web representations (named graphs, quoted triples/RDF-star, and reification). Finally, we describe our prototype DEC reasoner implemented as a Fuseki dataset module, supporting controlled factualisation and explicit detection of disagreements and delusions.

2606.15719 2026-06-16 cs.LO cs.AI math.LO 交叉投稿

The algebra of Krom logic programs

Krom逻辑程序的代数

Christian Antić

发表机构 * Vienna University of Technology(维也纳技术大学)

AI总结 本文研究Krom逻辑程序的代数结构,通过顺序组合赋予其幺半群结构,并扩展到多种半环,建立生成集、规范分解,与变换幺半群和有限自动机建立联系。

详情
AI中文摘要

本文研究了仅由事实和最多一个体原子的规则组成的Krom逻辑程序的代数结构。我们证明了顺序组合赋予Krom程序类一个自然的幺半群结构,并且该结构允许丰富的代数扩展,包括Krom半近环、Krom拟环、Krom-Conway半近环和Krom-Conway omega半近环。此外,我们建立了显式的生成集和规范分解,研究了相关的${}^ω$-算子,用图论术语刻画了Kleene星,并将有限Krom幺半群与变换幺半群和有限状态自动机关联起来。这些结果为逻辑编程、代数自动机理论和代数图论之间提供了新的联系。

英文摘要

This paper investigates the algebraic structure of Krom logic programs, consisting only of facts and rules with at most one body atom. We show that sequential composition endows the class of Krom programs with a natural monoid structure and that this structure admits rich algebraic extensions to Krom seminearrings, Krom quemirings, Krom-Conway seminearrings, and Krom-Conway omegaseminearrings. Furthermore, we establish explicit generating sets and canonical decompositions, study the associated ${}^ω$-operator, characterize the Kleene star in graph-theoretic terms, and relate finite Krom monoids to transformation monoids and finite-state automata. These results provide new connections between logic programming, algebraic automata theory, and algebraic graph theory.

2606.16010 2026-06-16 cs.IR cs.AI 交叉投稿

Theorem-Grounded Execution Ontologies for Interpretable Machine Reasoning

定理驱动的可执行本体用于可解释机器推理

Raghu Anantharangachar

发表机构 * Independent Researcher(独立研究者)

AI总结 提出TGEO框架,将推理建模为可执行状态转换过程,通过定理族识别、本体绑定、语义对象发现等步骤生成可解释推理图,实现可验证、可重放的AI推理。

详情
AI中文摘要

大型语言模型在数学、科学、编程和常识推理等任务上取得了令人印象深刻的性能。尽管取得了这些进展,但其推理过程在很大程度上仍然是潜在的,这使得它们难以解释、验证、重放、调试和跨领域迁移。现有方法如思维链、思维树、思维图以及工具增强推理暴露了中间推理产物,但通常缺乏明确的执行语义、形式化状态表示和可验证的推理结构。我们引入了定理驱动的可执行本体(TGEO),这是一个将推理建模为可执行状态转换过程而非生成令牌序列的框架。给定输入问题,TGEO识别相关的定理族,将问题绑定到领域本体,发现语义对象,实例化状态和操作符,构建谓词和契约,并合成一个可执行的推理图。生成的图提供了可解释、可重放和可审计的推理表示,其中每个状态转换、操作符应用和验证步骤都被显式表示。TGEO集成了五个架构组件:(1)定理驱动的推理先验,(2)可执行本体,(3)操作符中介的状态转换,(4)基于谓词和契约的执行验证,以及(5)架构审计和故障定位。我们在来自数学基准领域和精心策划的Golden Execution Suite的定理密集型推理任务上评估了TGEO。我们的研究结果证明了可执行推理表示对于可解释、可验证和可复现的AI推理系统的价值。

英文摘要

Large language models have achieved impressive performance on reasoning tasks spanning mathematics, science, programming, and commonsense inference. Despite these advances, their reasoning processes remain largely latent, making them difficult to interpret, verify, replay, debug, and transfer across domains. Existing approaches such as chain-of-thought, tree-of-thoughts, graph-of-thoughts, and tool-augmented reasoning expose intermediate reasoning artifacts but typically lack explicit execution semantics, formal state representations, and verifiable reasoning structures. We introduce Theorem-Grounded Execution Ontologies (TGEO), a framework that models reasoning as an executable state-transition process rather than a sequence of generated tokens. Given an input problem, TGEO identifies relevant theorem families, binds the problem to a domain ontology, discovers semantic objects, instantiates states and operators, constructs predicates and contracts, and synthesizes an executable reasoning graph. The resulting graph provides an interpretable, replayable, and auditable representation of reasoning in which every state transition, operator application, and validation step is explicitly represented. TGEO integrates five architectural components: (1) theorem-grounded reasoning priors, (2) executable ontologies, (3) operator-mediated state transitions, (4) predicate and contract-based execution validation, and (5) architectural auditing and failure localization. We evaluate TGEO on theorem-intensive reasoning tasks derived from mathematical benchmark domains and a curated Golden Execution Suite. Our findings demonstrate the value of executable reasoning representations for interpretable, verifiable, and reproducible AI reasoning systems.

2506.17104 2026-06-16 cs.AI cs.CL cs.LO 版本更新

Towards Advanced Mathematical Reasoning for LLMs via First-Order Logic Theorem Proving

迈向基于一阶逻辑定理证明的大语言模型高级数学推理

Chuxue Cao, Mengze Li, Juntao Dai, Jinluan Yang, Zijian Zhao, Shengyu Zhang, Weijie Shi, Chengzhong Liu, Sirui Han, Yike Guo

发表机构 * Hong Kong University of Science and Technology(香港科学与技术大学) Peking University(北京大学) Zhejiang University(浙江大学)

AI总结 针对大语言模型在多步一阶逻辑数学推理中的困难,提出DREAM方法,通过公理驱动策略多样化和子命题错误反馈提升推理多样性和正确性,在定理证明数据集上性能提升0.6%-6.4%。

Comments Accepted by EMNLP 25

详情
AI中文摘要

大语言模型(LLMs)在一阶逻辑(FOL)推理方面展现出有前景的能力,并在各个领域得到应用。然而,它们在涉及多步FOL推理的复杂数学推理中的有效性仍待研究。尽管LLMs在已有的数学推理基准上表现有竞争力,但它们在多步FOL任务上表现不佳,例如Deepseek-Prover-V2-7B在我们提出的定理证明数据集上的准确率仅为4.2%。这一问题源于对多样化证明策略的探索有限,以及早期推理错误可能破坏整个证明。为解决这些问题,我们提出DREAM,一种自适应解决方案,增强LLMs生成策略的多样性和合理性。DREAM包含公理驱动策略多样化机制以促进多样化的策略结果,以及子命题错误反馈以帮助LLMs反思和纠正其证明。我们的贡献包括:通过FOL定理证明在LLMs的数学推理方面取得开创性进展,引入一种新颖的推理阶段解决方案,将性能提升0.6%至6.4%,并提供包含447个数学定理的Lean 4格式数据集用于评估。

英文摘要

Large language models (LLMs) have shown promising first-order logic (FOL) reasoning capabilities with applications in various areas. However, their effectiveness in complex mathematical reasoning involving multi-step FOL deductions is still under-researched. While LLMs perform competitively on established mathematical reasoning benchmarks, they struggle with multi-step FOL tasks, as demonstrated by Deepseek-Prover-V2-7B's low accuracy (4.2%) on our proposed theorem proving dataset. This issue arises from the limited exploration of diverse proof strategies and the potential for early reasoning mistakes to undermine entire proofs. To address these issues, we propose DREAM, a self-adaptive solution that enhances the Diversity and REAsonability of LLMs' generation strategies. DREAM incorporates an Axiom-Driven Strategy Diversification mechanism to promote varied strategic outcomes and a Sub-Proposition Error Feedback to help LLMs reflect on and correct their proofs. Our contributions include pioneering advancements in LLMs' mathematical reasoning through FOL theorem proving, introducing a novel inference stage solution that improves performance by 0.6% to 6.4%, and providing a curated dataset of 447 mathematical theorems in Lean 4 format for evaluation.

2507.22951 2026-06-16 cs.AI cs.LG 版本更新

Unifying Post-hoc Explanations of Knowledge Graph Completions

统一知识图谱补全的事后解释

Alessandro Lonardi, Samy Badreddine, Tarek R. Besold, Pablo Sanchez Martin

发表机构 * Sony AI, Barcelona, Spain(索尼人工智能,巴塞罗那,西班牙)

AI总结 针对知识图谱补全缺乏统一事后解释框架的问题,提出基于多目标优化的分类法,统一现有算法,改进评估协议,强调可解释性对用户查询的重要性。

Comments 22 pages, 8 figures, 4 tables

详情
AI中文摘要

知识图谱将信息组织为实体-关系-实体三元组,使机器学习模型能够预测可能缺失的三元组,这一任务称为知识图谱补全(KGC)。KGC的事后可解释性解决的是识别哪些三元组最影响机器学习模型预测的问题。目前,该领域缺乏形式化和一致的评估,阻碍了可重复性和跨研究比较。本文主张为KGC中的事后可解释性建立统一的分类法。首先,我们通过多目标优化提出事后解释的特征描述,统一了KGC中现有的事后可解释性算法及其产生的解释,平衡了解释有效性和简洁性。接着,我们通过说明性实验,基于流行指标(如平均倒数排名和Hits@k)检验了改进的评估协议。最后,我们强调可解释性作为解释解决最终用户有意义查询的能力的重要性。通过统一方法和讨论评估标准,本文为KGC可解释性中更可重复和更有影响力的研究提供了论据。

英文摘要

Knowledge Graphs organize information as entity-relation-entity triples, enabling machine learning models to predict plausible missing triples in a task known as Knowledge Graph Completion (KGC). Post-hoc explainability for KGC addresses the problem of identifying which triples most influence the predictions of machine learning models. Currently, the field lacks formalization and consistent evaluations, hindering reproducibility and cross-study comparisons. This paper argues for a unified taxonomy for post-hoc explainability in KGC. First, we propose a characterization of post-hoc explanations via multi-objective optimization that unifies existing post-hoc explainability algorithms in KGC and the explanations they produce, balancing explanation effectiveness and conciseness. Next, we examine improved evaluation protocols based on popular metrics, such as Mean Reciprocal Rank and Hits@k, through illustrative experiments. Finally, we stress the importance of interpretability as the ability of explanations to address queries meaningful to end users. By unifying methods and discussing evaluation standards, this work puts forward a case for more reproducible and impactful research in KGC explainability.

2512.09831 2026-06-16 cs.AI cs.LG cs.MA cs.SI 版本更新

Interpretation as Linear Transformation: A Cognitive-Geometric Model of Concepts and Meaning

解释作为线性变换:概念与意义的认知几何模型

Chainarong Amornbunchornvej

发表机构 * National Electronics and Computer Technology Center(国家电子与计算机技术中心)

AI总结 提出一个几何框架,通过线性映射和向量空间建模异构智能体间的概念传递、动机与影响,揭示误解与概念消亡的结构条件,并给出领导力的可达性解释。

Comments The revised draft w.r.t. reviewer comments. The code is at https://github.com/DarkEyes/Cognitive-Geometry

详情
AI中文摘要

本文发展了一个几何框架,用于建模认知异构智能体间的概念、动机和影响。每个智能体由一个个性化价值空间表示,这是一个编码智能体解释和评估意义的内在维度的向量空间。评价性概念被形式化为结构化向量(抽象存在),其传递由线性解释映射中介。抽象存在只有在避免这些映射的零空间时才能在通信中存活,从而为可理解性、误解和概念消亡提供了结构性标准。在该框架内,我展示了概念扭曲、动机漂移和相互理解的限制如何源于纯代数约束。一个核心结果——无零空间领导条件——将领导力刻画为表征可达性的属性,而非说服或权威。更广泛地,该模型解释了抽象存在在穿越不同认知几何时如何传播、变异或消失。该理论通过将意义保存建立在结构兼容性而非共享信息或理性之上,统一了概念空间、社会认识论和AI价值对齐的见解。我认为,这种认知几何视角澄清了人类和人工系统中影响的认知边界,并为分析异构智能体间的概念动力学提供了通用基础。

英文摘要

This paper develops a geometric framework for modeling concepts, motivation, and influence across cognitively heterogeneous agents. Each agent is represented by a personalized value space, a vector space encoding the internal dimensions through which the agent interprets and evaluates meaning. Evaluative concepts are formalized as structured vectors, abstract beings, whose transmission is mediated by linear interpretation maps. An abstract being survives communication only if it avoids the null spaces of these maps, yielding a structural criterion for intelligibility, miscommunication, and concept death. Within this framework, I show how conceptual distortion, motivational drift, and the limits of mutual understanding arise from purely algebraic constraints. A central result, the No-Null-Space Leadership Condition, characterizes leadership as a property of representational reachability rather than persuasion or authority. More broadly, the model explains how abstract beings can propagate, mutate, or disappear as they traverse diverse cognitive geometries. The account unifies insights from conceptual spaces, social epistemology, and AI value alignment by grounding meaning preservation in structural compatibility rather than shared information or rationality. I argue that this cognitive-geometric perspective clarifies the epistemic boundaries of influence in both human and artificial systems, and offers a general foundation for analyzing conceptual dynamics across heterogeneous agents.

2602.02028 2026-06-16 cs.AI 版本更新

Edit Knowledge, Not Just Facts via Multi-Step Reasoning over Background Stories

编辑知识,而不仅仅是事实:基于背景故事的多步推理

Ya Gao, Kalle Kujanpää, Pekka Marttinen, Harri Valpola, Alexander Ilin

发表机构 * Aalto University(阿莱大学) Amazon.com(亚马逊公司) System 2 AI(系统2人工智能)

AI总结 提出将知识更新视为推理问题,通过背景故事引入新知识、自生成多跳问题训练和知识蒸馏,使模型在多步推理中灵活运用新信息。

Comments Under review

详情
AI中文摘要

使人工智能系统,特别是大型语言模型,能够更新知识并在推理过程中灵活应用,仍然是一个核心挑战。现有的知识编辑方法强调原子事实,改进了事实回忆,但往往未能将更新后的信息整合到可在不同上下文中使用的连贯框架中。在这项工作中,我们认为知识更新从根本上是一个推理问题,而不是记忆问题。因此,模型应该在新的信息有助于解决任务的情况下进行训练,结合已有知识,并通过多步推理进行练习。基于这一见解,我们提出了一种基于三个原则的训练策略。首先,新知识作为连贯的背景故事引入,将新事实情境化并解释它们与现有知识的关系。其次,使用自生成的多跳问题训练模型,这些问题需要涉及新信息的多步推理。第三,使用知识蒸馏进行训练,迫使学生模型内化教师的推理行为,而无需访问新信息。实验表明,使用此策略训练的模型在推理过程中有效利用新获得的知识,并在需要结合多个新事实的具有挑战性的问题上取得了显著性能。

英文摘要

Enabling artificial intelligence systems, particularly large language models, to update knowledge and flexibly apply it during reasoning remains a central challenge. Existing knowledge editing approaches emphasize atomic facts, improving factual recall but often failing to integrate updated information into a coherent framework usable across contexts. In this work, we argue that knowledge update is fundamentally a reasoning problem rather than a memorization problem. Consequently, a model should be trained in situations where the new information is instrumental to solving a task, combined with pre-existing knowledge, and exercised through multi-step reasoning. Based on this insight, we propose a training strategy based on three principles. First, new knowledge is introduced as a coherent background story that contextualizes novel facts and explains their relation to existing knowledge. Second, models are trained using self-generated multi-hop questions that require multi-step reasoning involving the new information. Third, training is done using knowledge distillation, forcing a student model to internalize the teacher's reasoning behavior without access to the novel information. Experiments show that models trained with this strategy effectively leverage newly acquired knowledge during reasoning and achieve remarkable performance on challenging questions that require combining multiple new facts.

2602.21066 2026-06-16 cs.AI 版本更新

The Initial Exploration Problem in Knowledge Graph Exploration

知识图谱探索中的初始探索问题

Claire McNamara, Lucy Hederman, Declan O'Sullivan

发表机构 * School of Computer Science and Statistics(计算机科学与统计学学院)

AI总结 本文识别并理论化了知识图谱探索中的初始探索问题(IEP),分析了三个相互依赖的障碍:范围不确定性、本体不透明性和查询能力不足,并提出了设计空间中的结构性缺口。

Comments 13 pages

详情
AI中文摘要

知识图谱(KGs)能够跨领域集成和表示复杂信息,但其语义丰富性和结构复杂性为缺乏语义网技术专业知识的外行用户设置了巨大障碍。当遇到不熟悉的KG时,这些用户面临一个明确的定向挑战:他们不知道哪些问题是可能的,知识是如何组织的,或者如何开始探索。本文识别并理论化了这一现象,称为初始探索问题(IEP)。借鉴信息行为和人机交互的理论,包括ASK、探索性搜索、信息觅食和认知负荷理论,我们开发了IEP的概念框架,其特征是三个相互依赖的障碍:范围不确定性、本体不透明性和查询能力不足。我们认为这些障碍在首次接触时汇聚,将IEP与预设现有起点或信息目标的相关概念区分开来。在交互原语层面分析KG探索界面,我们指出许多系统依赖于在首次接触时不成立的认知假设。这揭示了设计空间中的结构性缺口:缺乏用于范围揭示的交互原语,即无需用户制定查询或解释本体结构即可传达KG内容的机制。通过阐述IEP,本文为评估KG界面和设计支持初始探索的入口点脚手架提供了理论视角。

英文摘要

Knowledge Graphs (KGs) enable the integration and representation of complex information across domains, but their semantic richness and structural complexity create substantial barriers for lay users without expertise in semantic web technologies. When encountering an unfamiliar KG, such users face a distinct orientation challenge: they do not know what questions are possible, how the knowledge is structured, or how to begin exploration. This paper identifies and theorises this phenomenon as the Initial Exploration Problem (IEP). Drawing on theories from information behaviour and human-computer interaction, including ASK, exploratory search, information foraging, and cognitive load theory, we develop a conceptual framing of the IEP characterised by three interdependent barriers: scope uncertainty, ontology opacity, and query incapacity. We argue that these barriers converge at the moment of first contact, distinguishing the IEP from related concepts that presuppose an existing starting point or information goal. Analysing KG exploration interfaces at the level of interaction primitives, we suggest that many systems rely on epistemic assumptions that do not hold at first contact. This reveals a structural gap in the design space: the absence of interaction primitives for scope revelation, mechanisms that communicate what a KG contains without requiring users to formulate queries or interpret ontological structures. In articulating the IEP, this paper provides a theoretical lens for evaluating KG interfaces and for designing entry-point scaffolding that supports initial exploration.

2604.03496 2026-06-16 cs.AI cs.IR cs.LG 版本更新

Beyond Predefined Schemas: TRACE-KG for Context-Enriched Knowledge Graph Generation

超越预定义模式:TRACE-KG 用于上下文增强的知识图谱生成

Mohammad Sadeq Abolhasani, Yang Ba, Yixuan He, Rong Pan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出 TRACE-KG 框架,通过数据驱动模式联合构建上下文增强的知识图谱和归纳模式,无需预定义本体,解决长技术文档中图谱碎片化问题。

Comments Accepted at Graph Foundation Models at ICML 2026

详情
AI中文摘要

知识图谱生成通常依赖于预定义本体或免模式提取。本体驱动的流水线强制执行一致的类型,但需要昂贵的模式设计和维护,而免模式方法通常产生碎片化的图谱,全局组织薄弱,尤其是在信息密集、依赖上下文的冗长技术文档中。我们提出 \textbf{TRACE-KG}(\textbf{T}ext-d\textbf{R}iven schem\textbf{A} for \textbf{C}ontext-\textbf{E}nriched \textbf{K}nowledge \textbf{G}raphs),一个无需假设预定义本体即可联合构建上下文增强的知识图谱和归纳模式的框架。TRACE-KG 通过结构化限定符捕获条件关系,并使用数据驱动模式组织实体和关系,该模式作为可重用的语义支架,同时保持对源证据的完全可追溯性。实验表明,TRACE-KG 生成结构连贯、可追溯的知识图谱,并为本体驱动和免模式构建流水线提供了实用的替代方案。

英文摘要

Knowledge graph generation typically relies either on predefined ontologies or on schema-free extraction. Ontology-driven pipelines enforce consistent typing but require costly schema design and maintenance, whereas schema-free methods often produce fragmented graphs with weak global organization, especially in long technical documents with dense, context-dependent information. We propose \textbf{TRACE-KG} (\textbf{T}ext-d\textbf{R}iven schem\textbf{A} for \textbf{C}ontext-\textbf{E}nriched \textbf{K}nowledge \textbf{G}raphs), a framework that jointly constructs a context-enriched knowledge graph and an induced schema without assuming a predefined ontology. TRACE-KG captures conditional relations through structured qualifiers and organizes entities and relations using a data-driven schema that serves as a reusable semantic scaffold while preserving full traceability to the source evidence. Experiments show that TRACE-KG produces structurally coherent, traceable knowledge graphs and offers a practical alternative to both ontology-driven and schema-free construction pipelines.

3. 多智能体与博弈 28 篇

2606.14923 2026-06-16 cs.AI cs.CY cs.MA 新提交

Trust Between AI Agents: Measuring Formation, Breakage, and Recovery, with Implications for Governing Multi-Agent Systems

AI智能体之间的信任:衡量形成、破裂与恢复,及其对多智能体系统治理的启示

Yujiao Chen

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出基于代价验证的行为信任度量,通过合作生存游戏研究六个前沿模型快照的信任形成、破裂与恢复,发现信任形成可减少验证,恢复慢于形成,且集群失败延长怀疑,建议校准而非最大怀疑作为治理核心。

详情
AI中文摘要

随着语言模型智能体越来越多地以团队形式工作,每个智能体必须决定对其队友的信任程度。然而,我们缺乏衡量AI智能体之间信任的标准方法。我们提出一种基于代价验证的行为度量。在一个合作生存游戏中,检查队友的工作会消耗资源,而信任错误的答案可能是致命的。相对于同一模型的无记忆版本,减少验证提供了信任的可观察度量。利用这一框架,我们研究了六个前沿模型快照的信任形成、破裂与恢复。当与始终可靠的队友配对时,四个快照(Claude Opus 4.6、Claude Sonnet 4.6、GPT-5.1和Gemini 3.1 Pro)将验证减少了约60-85%,而两个较小的快照几乎没有或完全没有这种调整。失败会逆转这种折扣,但模型在响应方式上存在差异。一些模型将重新审查集中在肇事者身上,而另一些则对整个团队变得更加谨慎。恢复比形成慢,并且集群失败使怀疑持续的时间远长于相同数量的分散失败。这些差异具有实际后果。形成信任的模型验证更少、决策更快,并在我们的环境中获得更高的收益。相比之下,持续过度验证与犹豫不决而非安全性相关。我们的结果表明,信任倾向可以在部署前测量,并建议校准而非最大怀疑应成为多智能体AI系统治理的核心关注点。

英文摘要

As language-model agents increasingly work in teams, each agent must decide how much to trust its teammates. Yet we lack a standard way to measure trust between AI agents. We propose a behavioral measure based on costly verification. In a cooperative survival game, checking a teammate's work consumes resources, while trusting a wrong answer can be fatal. Relative to a memoryless version of the same model, reduced verification provides an observable measure of trust. Using this framework, we study trust formation, breakage, and recovery across six frontier model snapshots. When paired with a consistently reliable teammate, four snapshots (Claude Opus 4.6, Claude Sonnet 4.6, GPT-5.1, and Gemini 3.1 Pro) reduce verification by roughly 60-85%, whereas two smaller snapshots show little or no such adjustment. Failures reverse this discount, but models differ in how they respond. Some concentrate renewed scrutiny on the culprit, while others become more cautious toward the entire team. Recovery is slower than formation, and clustered failures sustain suspicion far longer than the same number of failures spread apart. These differences have practical consequences. Models that form trust verify less, decide more quickly, and achieve higher payoffs in our environment. By contrast, persistent over-verification is associated with indecision rather than safety. Our results show that trust dispositions can be measured before deployment and suggest that calibration, rather than maximal suspicion, should be the central concern in the governance of multi-agent AI systems.

2606.15503 2026-06-16 cs.AI cs.CY cs.MA cs.NE 新提交

Synthetic Counteradaptation: A Principle of Human-AI Co-evolution

合成反适应:人机共同进化的一个原理

Ivar Frisch, Jackie Kay, Philip Moreira Tomei

发表机构 * Spectral Circuits Research Independent Researcher(独立研究者) AI Objectives Institute(AI Objectives研究所)

AI总结 提出合成反适应概念,描述人机通过相互适应策略和行为实现共同进化,并分析围棋、混合动机社交和地缘政治模拟等案例。

Comments 15 pages, 1 figure. Published in Antikythera (MIT Press), February 2025

详情
Journal ref
Antikythera Journal, MIT Press, February 2025
AI中文摘要

在本文中,我们引入了合成反适应的概念,这是一个人类与AI系统通过相互适应对方的策略和行为而共同进化的过程。当AI系统发展出新的策略或社会协议,促使人类提取见解并调整自身行为作为回应时,就会发生合成反适应,从而导致新的智能体交互动态的出现。为了说明这些动态,我们分析了来自不同背景的案例,包括围棋游戏、混合动机社交互动和地缘政治模拟。通过探索这些案例,我们展示了合成反适应如何为理解多智能体环境中人机交互的递归和共同进化性质提供一个框架。

英文摘要

In this paper, we introduce the concept of synthetic counteradaptation, a process where human and AI systems co-evolve by adapting to each other's strategies and behaviors. Synthetic counteradaptation occurs when AI systems develop novel strategies or social protocols, prompting humans to extract insights and adapt their own behaviors in response, leading to the emergence of new agent interaction dynamics. To illustrate these dynamics, we analyze examples from various contexts, including the game of Go, mixed-motive social interactions, and geopolitical simulations. By exploring these cases, we demonstrate how synthetic counteradaptation provides a framework for understanding the recursive and co-evolutionary nature of human-AI interactions in multi-agent environments.

2606.15684 2026-06-16 cs.AI 新提交

Multi-agent Framework for Time-Sensitive Complementary Collaboration in Minecraft

Minecraft中时间敏感互补协作的多智能体框架

Juheon Yi, Jinglu Wang, Xiaoyi Zhang, Yan Lu

发表机构 * Microsoft Research Asia(微软亚洲研究院)

AI总结 提出TickingCollabBench基准和TickingCollab框架,用于评估LLM在动态、实时、异构智能体强制协作任务中的表现,发现LLM因延迟和协调困难而频繁失败。

详情
AI中文摘要

我们提出了TickingCollabBench,这是一个基于Minecraft的多智能体基准,用于一类新颖的时间敏感互补协作任务。我们的基准反映了现实世界协作的四个核心特征:智能体异构性、强制协作、动态环境以及具有失败风险的严格实时约束。为此,我们开发了TickingCollab框架,该框架支持生成多样化的动态环境,并抽象了Minecraft的原始API,以便通过声明式YAML任务规范来组合这些事件。在此基础上,我们设计了一个可行性感知的自动基准生成流水线,其中LLM起草结构多样的任务配置,可行性验证器使用近似约束过滤掉无效配置。评估表明,语言延迟以及在部分可观测性和智能体异构性下协调的固有困难,导致LLM在动态环境中频繁失败,并且远不及全局知识oracle的表现。

英文摘要

We present TickingCollabBench, a Minecraft-based multi-agent benchmark for a novel class of time-sensitive complementary collaboration tasks. Our benchmark reflects four core characteristics of real-world collaboration: agent heterogeneity, mandatory collaboration, dynamic environments, and strict real-time constraints with failure risks. To enable this, we develop the TickingCollab framework, which supports the generation of diverse dynamic environments and abstracts Minecraft's primitive APIs to enable declarative YAML task specifications for composing these events. Building on this, we design a feasibility-aware automated benchmark generation pipeline, where an LLM drafts structurally diverse task configurations and feasibility verifier filters out invalid ones using approximate constraints. Evaluations demonstrate that lang latency and inherent difficulty of coordinating under partial observability and agent heterogeneity cause LLMs to frequently fail under dynamic environments and fall significantly short of a global-knowledge oracle.

2606.16328 2026-06-16 cs.AI 新提交

AdaSTORM: Scaling LLM Reasoning on Dynamic Graphs via Adaptive Spatio-Temporal Multi-Agent Collaboration

AdaSTORM: 通过自适应时空多智能体协作扩展动态图上的LLM推理

Bing Hao, Ruijie Wang, Haodong Qian, Yunlong Chu, Yuhang Liu, Yumeng Lin, Minglai Shao, Jianxin Li

发表机构 * Tianjin University, China(天津大学,中国) Beihang University, China(北航大学,中国)

AI总结 提出AdaSTORM框架,通过自适应分区和时空解耦的多智能体协作,将动态图推理扩展到千节点规模,准确率超90%,无需外部工具。

详情
AI中文摘要

大型语言模型(LLM)在动态图推理中展现出显著潜力,但面临扩展瓶颈:当前模型只能处理数十个节点的图,受限于指数级推理开销和有限的上下文窗口。尽管多智能体系统(MAS)提供了集体推理和拓扑感知编排的能力——这些能力天然适用于图结构任务,但其在动态图上的应用仍未探索。本文提出通过自适应时空多智能体协作扩展动态图上的LLM推理(AdaSTORM),这是一个将大规模动态图推理重构为两个阶段的框架:(i)自适应分区,将大规模动态图划分为与模型推理能力匹配的子区域,同时最小化推理成本;(ii)协作推理,将图分区拓扑与时空解耦的多智能体架构对齐。AdaSTORM是首个专为动态图推理设计的多智能体框架。大量实验表明,AdaSTORM成功突破了扩展瓶颈,将推理扩展到千节点图,在多个大规模动态图设置中准确率超过90%,且无需外部工具,显著优于七个竞争基线。此外,它在现有基准上达到了最先进的准确率,并稳健地泛化到真实世界数据集。源代码可在 https://github.com/irisorchid107/AdaSTORM/ 获取。

英文摘要

Large Language Models (LLMs) demonstrate remarkable potential in dynamic graph reasoning, but suffer from a scaling bottleneck: current models can only handle graphs with tens of nodes, constrained by exponential reasoning overhead and finite context windows. While multi-agent systems (MAS) offer collective reasoning and topology-aware orchestration, capabilities naturally suited for graph-structured tasks, their application to dynamic graphs remains unexplored. This paper presents Scaling LLM Reasoning on Dynamic Graphs via Adaptive Spatio-Temporal Multi-Agent Collaboration (AdaSTORM), a framework that reformulates large-scale dynamic graph reasoning into two stages: (i) Adaptive Partitioning, partitioning large-scale dynamic graphs into subregions that match the model's reasoning capacity while minimizing inference cost; and (ii) Collaborative Reasoning, aligning graph partition topologies with a spatio-temporal decoupled multi-agent architecture. AdaSTORM is the first multi-agent framework tailored for dynamic graph reasoning. Extensive experiments show that AdaSTORM successfully breaks through the scaling bottleneck, scaling reasoning to thousand-node graphs with over 90% accuracy across several large-scale dynamic graph settings without external tools, significantly outperforms seven competitive baselines. Furthermore, it achieves state-of-the-art accuracy on existing benchmarks and generalizes robustly to real-world datasets. The source code is available at: https://github.com/irisorchid107/AdaSTORM/.

2606.16330 2026-06-16 cs.AI 新提交

Phase-Aware Guidance Injection for Recurrent MAPPO in Assembly-Line Disruption Recovery

装配线中断恢复中面向阶段的引导注入用于循环MAPPO

Xin Huang, Yongcai Wang, Fengyi Zhang, Zhikun Tao, Yunjun Han, Naiqi Wu

发表机构 * School of Information, Renmin University of China(中国人民大学信息学院) State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所多模态人工智能系统国家重点实验室) The Information Science Academy, China Electronics Technology Group Corporation(中国电子科技集团公司信息科学研究院) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) The Institute of Systems Engineering, Macau University of Science and Technology(澳门科技大学系统工程研究所)

AI总结 提出面向阶段的引导注入框架,在评估时通过logit级动作偏置增强训练好的循环MAPPO调度策略,利用规则、回放和在线LLM引导减少异常恢复时间并保持准时交付。

Comments 6 pages, 4 figures, accepted by the 2026 IEEE International Conference on Automation Science and Engineering (CASE 2026)

详情
AI中文摘要

工业装配线的中断恢复需要在机器故障、工人缺勤和紧急订单下及时做出决策。现有方法要么依赖僵化的手工恢复逻辑,要么学习自适应策略,但无法在决策时轻易利用异构的外部恢复知识来减少异常恢复时间(ART)并保持准时交付(OTD)。为解决这一差距,我们提出了一种面向阶段的引导注入框架,通过在评估期间引入logit级动作偏置来增强训练好的循环MAPPO(RMAPPO)调度策略。该框架为基于规则、基于回放和基于在线LLM的引导提供了统一的决策时接口,同时仅在异常和恢复阶段激活干预。在自定义的AssemblyLineEnv上的实验表明,高质量的规则引导带来最强的性能提升,基于回放的引导在不完美可用性下平滑退化,而在线LLM引导仍能提供有用的中间改进。这些结果表明,决策时引导注入可以在不重新设计actor的情况下利用异构恢复提示。

英文摘要

Disruption recovery in industrial assembly lines requires timely decisions under machine faults, worker absence, and emergency orders. Existing methods either rely on rigid handcrafted recovery logic or learn adaptive policies that do not readily exploit heterogeneous external recovery knowledge at decision time to reduce abnormal recovery time (ART) and preserve on-time delivery (OTD). To address this gap, we propose a phase-aware guidance injection framework that augments a trained recurrent MAPPO (RMAPPO) scheduling policy through logit-level action bias during evaluation. The framework provides a unified decision-time interface for rule-based, replay-based, and online LLM-based guidance, while activating intervention only during abnormal and recovery phases. Experiments on a custom AssemblyLineEnv show that high-quality rule guidance yields the strongest gains, replay-based guidance degrades smoothly under imperfect availability, and online LLM guidance still provides useful intermediate improvements. These results show that decision-time guidance injection can exploit heterogeneous recovery hints without redesigning the actor.

2606.16478 2026-06-16 cs.AI 新提交

Tensor-Coord: Algebraic Decomposition of Joint Plan Tensors for Conflict-Free Multi-Agent LLM Planning

Tensor-Coord:用于无冲突多智能体LLM规划的联合计划张量代数分解

Mudit Rastogi

发表机构 * University of Michigan(密歇根大学)

AI总结 提出Tensor-Coord框架,将多智能体联合计划表示为三阶张量,通过CP和Tucker分解识别协调结构,计算协调复杂度并定位冲突,实现无冲突规划。

详情
AI中文摘要

大型语言模型(LLM)在多智能体规划中仍然受限,因为独立生成的计划可能导致协调失败,如空间碰撞、资源争用和时间死锁。我们引入Tensor-Coord,一个多线性代数框架,将N个智能体的联合计划表示为三阶张量 \(T \in R^{N \times H \times A}\),维度为智能体、时间步和动作。使用典型多面体(CP)和Tucker分解来识别潜在协调结构。最小ε近似CP秩R*定义了一个可计算的协调复杂度度量,\(CC(Pi)=(R*-N)/N\)。我们证明R*=N是计划独立性的充分必要条件。残差 \(E=T-T_{R*}\) 定义了智能体对、时间步和动作上的冲突分数,无需领域特定规则即可定位失败。Tucker因子提供可解释的智能体角色、时间阶段和动作聚类,这些被转换为自然语言约束,用于迭代LLM重规划。在多机器人配送任务上的实验,包括简单(2个智能体,5x5网格)、中等(3个智能体,5x5网格)和困难(4个智能体,5x5网格)设置,显示在2个智能体情况下100%收敛到无冲突计划,平均迭代1.4次;3个智能体情况下80%收敛,平均迭代3.2次;4个智能体情况下60%收敛,平均迭代4.0次。CP秩近似线性增长,\(R*(N) = 3.9N + 0.5\),支持其作为协调复杂度预测器的使用。

英文摘要

Large language models (LLMs) remain limited in multi-agent planning because independently generated plans can create coordination failures such as spatial collisions, resource contention, and temporal deadlocks. We introduce Tensor-Coord, a multilinear algebra framework that represents the joint plan of N agents as a third-order tensor \(T \in R^{N \times H \times A}\) over agents, timesteps, and actions. Canonical Polyadic (CP) and Tucker decompositions are used to identify latent coordination structure. The minimal epsilon-approximate CP rank R* defines a computable coordination complexity measure, with \(CC(Pi)=(R*-N)/N\). We prove that R*=N is necessary and sufficient for plan independence. The residual \(E=T-T_{R*}\) defines a conflict score over agent pairs, timesteps, and actions, localizing failures without domain-specific rules. Tucker factors provide interpretable agent roles, temporal phases, and action clusters that are converted into natural language constraints for iterative LLM replanning. Experiments on multi-robot delivery tasks across Easy (2 agents, 5x5 grid), Medium (3 agents, 5x5 grid), and Hard (4 agents, 5x5 grid) settings show convergence to conflict-free plans in 100% of 2-agent cases within 1.4 iterations on average, 80% of 3-agent cases within 3.2 iterations, and 60% of 4-agent cases within 4.0 iterations. CP rank scaled approximately linearly as \(R*(N) = 3.9N + 0.5\), supporting its use as a predictor of coordination complexity.

2606.11692 2026-06-16 cs.CY cs.AI cs.MA cs.SI 交叉投稿

Evaluation of Alternative-Based Information Systems for Deliberative Polling using an Agentic Simulator

基于智能体模拟器的审议式投票中替代性信息系统评估

Rwaida Alssadi, Khulud Alawaji, Balaji Kasula, Muntaser Syed, Badria Alfurhood, Markus Zanker, Marius Silaghi

发表机构 * Florida Institute of Technology(佛罗里达理工学院) Princess Nourah Bint Abdulrahman(纳厄赫·阿卜杜勒拉赫曼公主) Free University of Bozen-Bozano(博兹诺-博萨诺自由大学)

AI总结 提出基于LLM的智能体双极论证模拟器(ABAS),通过覆盖率和语料多样性评估审议式投票中推荐机制的有效性,并测试了对抗性投票攻击下的鲁棒性。

详情
AI中文摘要

审议式投票旨在通过让股东在投票前接触广泛论点来改善集体决策。然而,确保每个选民遇到理由空间的代表性样本(覆盖问题)仍然是一个开放的挑战,特别是在大规模和对抗性或策略性动机的选民群体中。本文介绍了一种使用基于LLM的智能体双极论证模拟器(ABAS)评估解决方案的方法,该模拟器基于一个将投票形式化为六元组<Jend, Jopp, Ratt, Renh, VA, VR>(包含支持与反对理由、攻击与增强关系、股东权重和关系权重)的框架。ABAS模拟N个自主股东智能体,每个智能体根据[-1,1]内的期望分布分配潜在意见,依次投票、选择或撰写理由,并可选择提交论证图链接。该模拟器实现推荐机制,根据可观察的支持质量对现有理由进行排序。它通过覆盖率(即每个股东收到的K条推荐中代表语料库理由标签集的比例)来评估机制的成功,作为NP难子集理由问题的一个解决方案。报告的实验描述了创造力率(pown)、推荐大小(K)、论证密度(plinks)和人口规模(N)如何影响覆盖率和语料库多样性。在一个经过身份验证的选民群体中(Sybil攻击不可能,只有关系图可被操纵),我们通过协调策略性投票攻击对评分进行压力测试:标签洪泛攻击导致覆盖率崩溃,而通过反向PageRank规则的作者计数关系加权比均匀权重显著更好地抵抗了洪泛攻击。

英文摘要

Deliberative polling promises to improve collective decision-making by exposing shareholders to a broad range of arguments before they vote. Yet ensuring that every voter encounters a representative sample of the reason space, the coverage problem, remains an open challenge, particularly at scale and in adversarial or strategically motivated electorates. This paper introduces a way of evaluating solutions using the LLM-based Agentic Bipolar Argumentation Simulator, grounded in a framework which formalises a poll as a six-tuple <Jend, Jopp, Ratt, Renh, VA, VR> of endorsing and opposing justifications, attack and enhance relations, and shareholder- and relation-weights. ABAS simulates N autonomous shareholder agents, each assigned a latent opinion according to desired distributions in [-1, 1], who sequentially vote, choose or author justifications, and optionally submit argumentation-graph links. The simulator implements recommendations that rank existing justifications by their observable endorsement mass. It evaluates the mechanism's success by coverage, namely the fraction of the corpus reason-tag set represented in the K recommendations presented to each shareholder, as a solution to the NP-hard Subsuming Justification Problem. Reported experiments characterise how creativity rate (pown), recommendation size (K), argumentation density (plinks), and population size (N) affect coverage and corpus diversity. In an authenticated electorate where Sybil attacks are impossible and only the relation graph is gameable, we stress-test the scoring with coordinated strategic voting attacks: a tag-flood attack collapses coverage, while author-count relation weighting through a reversed-PageRank rule resists the flood markedly better than uniform weights.

2606.14710 2026-06-16 cs.DC cs.AI 交叉投稿

Poster: EdgeCitadel -- Hybrid NATS-MQTT Orchestration for Edge Multi-Agent Systems

海报:EdgeCitadel——面向边缘多智能体系统的混合NATS-MQTT编排

Zhonghao Zhan, Yefan Zhang, Hamed Haddadi

发表机构 * Imperial College London(帝国理工学院伦敦分校) Independent Researcher(独立研究员)

AI总结 针对边缘AI智能体协调依赖云传输或中央中继的问题,提出基于NATS 2.10服务器与内置MQTT适配器的混合编排平台EdgeCitadel,实现异构智能体连接、持久化存储、直接委托和被动聚合,并在ARM64、x64和Android设备上验证。

详情
AI中文摘要

边缘驻留的AI智能体越来越多地跨越家庭服务器、物联网中心、笔记本电脑和手机,但它们的协调栈仍然假设云风格传输或中央中继。我们提出了EdgeCitadel,一个基于单一NATS 2.10服务器并内置MQTT适配器的边缘多智能体编排平台。该设计结合了用于异构智能体的MQTT连接、用于后端服务的JetStream支持的持久化和重放、通过共享主题命名空间的直接对等委托,以及一个不在传输路径上但能可视化并存储流量的被动聚合器。我们的海报重点展示了从MQTT中继原型(在物联网通信中常见)到当前混合架构的迁移,并演示了一个跨ARM64、x64和Android客户端的工作跨设备测试平台。

英文摘要

Edge-resident AI agents increasingly span home servers, IoT hubs, laptops, and phones, yet their coordination stacks still assume cloud-style transports or a central relay. We present EdgeCitadel, an edge multi-agent orchestration platform built around a single NATS 2.10 server with the built-in MQTT adapter. The design combines MQTT connectivity for heterogeneous agents, JetStream-backed persistence and replay for backend services, direct peer delegation over a shared subject namespace, and a passive aggregator that visualizes and stores traffic without sitting on the delivery path. Our poster highlights the migration from MQTT relay prototypes (common in IoT communication) to the current hybrid architecture and demonstrates a working cross-device testbed spanning ARM64, x64, and Android clients.

2606.14756 2026-06-16 cs.CV cs.AI cs.LG 交叉投稿

Divide-and-Denoise: A Game-Theoretic Method for Fairly Composing Diffusion Models

分而除噪:一种公平组合扩散模型的博弈论方法

Abhi Gupta, Polina Barabanshchikova, Vikas Garg, Samuel Kaski, Tommi Jaakkola

发表机构 * Massachusetts Institute of Technology(麻省理工学院) University of Washington(华盛顿大学) University of Cambridge(剑桥大学)

AI总结 提出Divide-and-Denoise方法,通过公平分配博弈协调多个预训练扩散模型,在采样时划分区域并引导各模型去噪,解决模型主导或冲突问题,在条件图像生成中优于基线。

Comments Accepted as spotlight at ICML 2026

详情
AI中文摘要

大量预训练扩散模型为组合提供了机会。然而,组合多个模型存在一个模型主导或模型间相互冲突的风险。在此,我们提出Divide-and-Denoise,一种在采样过程中协调多个预训练扩散模型的方法。类似于管理专业劳动力,我们的方法在模型间创建了公平且高效的劳动分工。我们方法的核心是分配的概念,它定义了每个模型对含噪样本每个区域的责任。在每个时间步,我们通过以下步骤去噪:(i) 通过求解公平分配博弈更新分配,其中我们在公平约束下将样本划分为最大化总效用的区域,以及(ii) 使模型与这种分配对齐,引导每个模型在其分配区域内去噪。这导致了一个新的复合去噪过程,该过程与划分过程同步演化。我们在条件图像生成上评估了Divide-and-Denoise。在包括GenEval基准在内的多个质量指标上,我们的方法优于基线,并解决了常见失败情况,包括缺失对象和属性不匹配。实验表明,Divide-and-Denoise利用了每个模型的专业知识,同时不忽视任何其他模型。

英文摘要

The abundance of pre-trained diffusion models provides an opportunity for composition. Combining several models, however, runs the risk of one model dominating or models disagreeing with each other. Here, we propose Divide-and-Denoise, a method for coordinating multiple pre-trained diffusion models during sampling. Much like managing a specialized workforce, our method creates a fair but efficient division of labor across models. Central to our method is the notion of an allocation which defines the responsibility of each model to every region of the noisy sample. At every timestep, we then denoise by (i) updating the allocation by solving a fair division game, where we divide the sample into regions that maximize total utility under fairness constraints, and (ii) aligning the models with this allocation, where we guide each model to denoise within its assigned region. This leads to a new composite denoising process that evolves in tandem with a division process. We evaluate Divide-and-Denoise on conditional image generation. Across several quality metrics, including the GenEval benchmark, our method outperforms baselines and resolves common failures including missing objects and mismatched attributes. Experiments show that Divide-and-Denoise utilizes each model's expertise without neglecting any other model.

2606.14790 2026-06-16 cs.PL cs.AI 交叉投稿

XFlow: An Executable Protocol Programming System for Reliable Multi-Agent Workflows

XFlow: 一个用于可靠多智能体工作流的可执行协议编程系统

Hanqi Li, Jing Peng, Zijian Wang, Lu Chen, Kai Yu

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China(X-LANCE实验室,计算机科学学院,上海交通大学,上海,中国) Jiangsu Key Lab of Language Computing, Suzhou, China(江苏省语言计算重点实验室,苏州,中国) Suzhou Laboratory, Suzhou, China(苏州实验室,苏州,中国)

AI总结 提出XFlow可执行协议编程系统及其领域特定语言XPF,通过将工作流承诺从提示移至可检查、可执行的协议结构,并利用生命周期管理的符号中介智能体输出,提升多智能体工作流的可靠性。

详情
AI中文摘要

基于LLM的多智能体系统越来越多地协调规划、推理、工具使用和人类交互,但其可靠性仍然有限。这一局限的核心来源是未充分指定的提示-框架边界。当前系统缺乏原则性的方式来决定哪些工作流承诺应保留在提示中,哪些应成为框架结构。我们提出\textbf{XFlow},一个用于可靠多智能体工作流的可执行协议编程系统,以及\textbf{XPF}(XFlow协议格式),其领域特定协议编程语言。XFlow位于纯提示编排和标记式工作流描述之间的中间位置。XPF保持可读性,作为文字协议,但被编译并作为程序执行。其设计将非正式语义工作保留在智能体内部,同时将选定的承诺转移到可检查、可维护和可执行的框架结构中。在运行时,XFlow通过生命周期管理的符号(具有验证和提交状态的类型化状态单元)来阶段化不确定性。智能体输出在被共享状态之前被中介,而不是通过提示、转录或隐式记忆传播。我们的实验涵盖约束交互、长上下文推理和智能体软件工程。它们表明,XFlow通过使约束、证据处理和处理需求显式且可执行,提高了可靠性。

英文摘要

LLM-based multi-agent systems increasingly coordinate planning, reasoning, tool use, and human interaction, yet their reliability remains limited. A central source of this limitation is the underspecified prompt--harness boundary. Current systems lack a principled way to decide which workflow commitments should remain in prompts and which should become harness structure. We present \textbf{XFlow}, an executable protocol programming system for reliable multi-agent workflows, and \textbf{XPF} (XFlow Protocol Format), its domain-specific protocol programming language. XFlow occupies a middle position between prompt-only orchestration and markup-like workflow descriptions. XPF remains readable as a literate protocol, but it is compiled and executed as a program. Its design keeps informal semantic work inside actors while moving selected commitments into harness structure that can be checked, preserved, and enforced. At runtime, XFlow stages uncertainty through lifecycle-governed symbols, which are typed state cells with validation and commit states. Actor outputs are mediated before they become shared state, instead of spreading through prompts, transcripts, or implicit memory. Our experiments cover Constrained Interaction, Long-Context Reasoning, and Agentic Software Engineering. They show that XFlow improves reliability by making constraints, evidence handling, and process requirements explicit and enforceable.

2606.14805 2026-06-16 cs.SE cs.AI 交叉投稿

Knowledge-Based Zero-Replay Debugging of Multi-Agent LLM Traces

基于知识的无重放多智能体LLM轨迹调试

Dong Ho Kang, Hyeonjeong Cha, Daein Weon

发表机构 * ustechlab.com(ustechlab)

AI总结 提出一种知识图谱驱动的无重放预测方法,通过结构化事件知识图谱和轻量级预测器,在不执行重放的情况下定位高影响事件,将轨迹定位召回率从0.73提升至0.93。

Comments 21 pages, 1 figure, 6 tables. Submitted to Knowledge-Based Systems

详情
AI中文摘要

多智能体大语言模型(LLM)系统的可靠运行依赖于对长执行轨迹的调试,其中少数因果决定性事件被埋没在消息、路由、内存写入和工具调用的非结构化日志中。标准工具是反事实重放(回退、编辑并重新运行轨迹以衡量每个事件的影响),但其成本随候选事件数量线性增长,使得大规模穷举重放不可行。我们将轨迹调试视为基于知识的决策支持问题。每条轨迹被编译成一个结构化的知识图谱,涵盖路由、内存、工具使用、不确定性和潜在证据,并通过校准的预测器决定稀缺的重放预算应分配到哪里。我们不提出新的重放预言机;我们提出一种无需支付重放成本即可预测其结果的方法。我们形式化了无重放反事实效应预测:给定固定预算下的轨迹,在未执行任何重放前预测预言机会将哪些事件标记为高影响。BranchPoint-Latent 是一个轻量级预测器,基于知识图谱的可观测、结构、不确定性和潜在特征。通过针对37个轨迹族系的确定性重放预言机进行校准,单个学习排序梯度提升预测器在零预言机重放成本下,将留出族系的每轨迹定位(Branch Recall@5)从0.73提升至0.93。我们并非声称普遍优势,而是刻画了何时廉价图中心性足够、何时需要学习到的证据。最终成果是一个可审计、成本高效的AI可靠性调试决策支持系统,明确位于成本-精度前沿,并提供可复现的工件。

英文摘要

Reliable operation of multi-agent large language model (LLM) systems depends on debugging long execution traces, where the few causally decisive events are buried in unstructured logs of messages, routes, memory writes, and tool calls. The standard tool is counterfactual replay (rewind, edit, and re-run the trajectory to measure each event's effect), but its cost grows linearly with the number of candidate events, making exhaustive replay infeasible at scale. We frame trace debugging as a knowledge-based decision-support problem. Each trace is compiled into a structured event knowledge graph over routing, memory, tool-use, uncertainty, and latent evidence, and a calibrated predictor decides where a scarce replay budget should be spent. We do not propose a new replay oracle; we propose a method to predict its results without paying the replay cost. We formulate zero-replay counterfactual-effect prediction: given a trace under a fixed budget, predict which events the oracle would mark high-effect before any replay is performed. BranchPoint-Latent is a lightweight predictor over observable, structural, uncertainty, and latent features of the knowledge graph. Calibrated against a deterministic replay oracle across 37 trace families, a single learning-to-rank gradient-boosted predictor raises per-trace localization (Branch Recall@5) from 0.73 to 0.93 on held-out families at zero oracle-replay cost. Rather than claiming universal dominance, we characterize when cheap graph centrality suffices and when learned evidence is necessary. The result is an auditable, cost-efficient decision-support system for AI-reliability debugging, positioned explicitly on the cost-accuracy frontier with reproducible artifacts.

2606.15024 2026-06-16 cs.MA cs.AI cs.SY eess.SY 交叉投稿

Resilient Consensus in Agentic AI

智能体AI中的弹性共识

Sribalaji C. Anand, George J. Pappas

发表机构 * KTH(瑞典皇家理工学院) University of Pennsylvania(宾夕法尼亚大学)

AI总结 研究LLM智能体在多智能体系统中的共识问题,发现经典弹性共识理论在LLM智能体中失效,但结合经典滤波器可改善一致性。

详情
AI中文摘要

大型语言模型(LLM)智能体越来越多地部署在多智能体系统中,它们必须协调并达成共享决策。我们探究了为确定性智能体开发的经典弹性共识理论是否适用于可能表现对抗性的LLM智能体。将LLM协议视为拜占庭共识博弈,我们在完全和一般通信图上进行受控实验。我们发现,经过提示的LLM智能体无法达成原则上可实现的共识:即使在经典理论保证存在收敛算法的设置中,共识也可能失败,并且这种失败在不同温度和视野下持续存在。同时,用经典弹性共识滤波器包装智能体可改善一致性。滤波的益处取决于底层拓扑已提供的鲁棒性。我们的结果表明,经典弹性共识理论是智能体AI安全的有用视角。

英文摘要

Large language model (LLM) agents are increasingly deployed in multi-agent systems where they must coordinate and agree on shared decisions. We ask whether classical resilient consensus theory, developed for deterministic agents, transfers to LLM agents that may behave adversarially. Framing LLM agreement as a Byzantine consensus game, we run controlled experiments on complete and general communication graphs. We find that prompted LLM agents fail to reach agreement that is achievable in principle: consensus can fail even in settings where classical theory guarantees that a convergent algorithm exists, and this failure persists across temperatures and horizons. At the same time, wrapping the agents with classical resilient consensus filters improves agreement. The benefit of filtering depends on how much robustness the underlying topology already provides. Our results suggest that classical resilient consensus theory is a useful lens for the safety of agentic AI.

2606.15376 2026-06-16 cs.DC cs.AI cs.MA 交叉投稿

CoAgent: Concurrency Control for Multi-Agent Systems

CoAgent: 多智能体系统的并发控制

Hongtao Lyu, Dingyan Zhang, Mingyu Wu, Xingda Wei, Haibo Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 针对多智能体LLM系统并发访问共享状态时的冲突问题,提出MTPO协议,通过智能体自身判断冲突并修复计划,实现可串行化执行,在保持近串行正确率的同时提升速度。

Comments 14 pages, 7 figures. Submitted to ATC 2026

详情
AI中文摘要

多智能体LLM系统——编码智能体、运维智能体、文档智能体——现在通常并行运行多个智能体,针对同一个git树、Kubernetes集群或文档。一旦其中两个智能体修改共享状态,它们就进入了经典并发控制研究了几十年的领域,但经典机制不适合LLM智能体。单个智能体事务跨越数分钟的推理,读集广泛且不透明而非静态可推断,智能体操作的实时状态既不允许分叉也不允许缓冲,因此写操作在执行时立即生效。锁会阻塞长时间的推理间隔;OCC的终止-重试会在每次冲突时丢弃数分钟的工作。\n本文基于经典事务缺乏的能力构建并发控制:每个智能体内的LLM可以判断冲突写入是否使其计划无效,并精确修复依赖于该写入的操作。因此控制变为建议性的:运行时通知,智能体修复。我们的协议MTPO(单调轨迹预排序)在启动时固定一个序列化顺序,为每次读取提供按顺序过滤的值,并原地推测性地应用写入;单向通知要求受影响的读取者重新判断并修补其计划,同时框架通过每个工具预先注册的saga式逆操作机械地撤销和重新排序错位的写入。在静止时,运行按预定顺序可串行化。我们将MTPO实现为CoAgent,一种工具调用中间件,其特权ToolSmith在线增长具有声明足迹和可撤销的工具。在十个有冲突的工作负载上,CoAgent在1.4倍加速和近串行令牌成本下保持5%以内的串行正确性,而2PL和OCC几乎放弃了所有并发增益;在纯bash目标系统上,它在线增长了一个25工具库,并将任务通过率从45/71提升到63/71,时间和成本分别为0.80倍和0.86倍。

英文摘要

Multi-agent LLM systems -- coding agents, devops agents, document agents -- now routinely run several agents in parallel against the same git tree, Kubernetes cluster, or document. As soon as two of them mutate shared state, they enter the regime classical concurrency control has studied for decades, but classical mechanisms fit LLM agents poorly. A single agent transaction spans minutes of inference, read sets are broad and opaque rather than statically inferable, and the live state agents act on admits neither fork nor buffer, so writes take effect the moment they execute. Locks block long inference intervals; OCC abort-and-retry discards minutes of work on every conflict. This paper builds concurrency control on a capability classical transactions lack: the LLM inside each agent can judge whether a conflicting write invalidates its plan, and can repair exactly the operations that depended on it. Control therefore turns advisory: the runtime informs, the agent repairs. Our protocol, MTPO (Monotonic Trajectory Pre-Order), fixes a serialization order at launch, serves each read the order-filtered value, and applies writes speculatively in place; a one-way notification asks an affected reader to re-judge and patch its plan, while the framework mechanically undoes and reorders misplaced writes through the saga-style inverse each tool registers in advance. At quiescence the run is serializable in the pre-decided order. We realize MTPO as CoAgent, toolcall middleware whose privileged ToolSmith grows footprint-declared, undoable tools online. On ten contended workloads, CoAgent stays within 5\% of serial correctness at a $1.4\times$ speedup and near-serial token cost, where 2PL and OCC surrender nearly all concurrency gains; on a bash-only target system, it grows a 25-tool library online and lifts the task pass rate from 45/71 to 63/71 at $0.80\times$ the time and $0.86\times$ the cost.

2606.15931 2026-06-16 cs.MA cs.AI 交叉投稿

DeepRoot: A KG-Coordinated Multi-Agent System for Therapeutic Reasoning over Historical Medical Texts

DeepRoot: 一个基于知识图谱协调的多智能体系统,用于历史医学文本的治疗推理

Zijian Carl Ma, Sean J. Wang, Sijbren Kramer, Li Erran Li

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Cambridge(剑桥大学) University of Toronto(多伦多大学)

AI总结 提出DeepRoot多智能体系统,通过联合构建和利用验证知识图谱,将接地与推理分离并组合,从历史医学文本中恢复药物-疾病治疗关系,显著优于基线LLM和工具调用LLM。

详情
Journal ref
ICML 2026 GenBio; ACM CAIS 2026 Workshop AI Agents for Discovery in the Wild
AI中文摘要

历史医学档案和传统药物对药物发现具有巨大潜力,并且仍然是当前药物开发的主要来源。然而,前本体论的散文和特殊的分类法阻碍了数据的标准化和医学现代化,使其无法用于当前的生物医学流程。此外,现有的LLM智能体系统,无论是工具调用、检索增强还是智能体深度研究,都无法将此类文本转化为可验证的药物发现线索。我们通过DeepRoot填补了这一空白,这是一个多智能体LLM系统,它联合构建并利用一个验证知识图谱,表明接地和推理——这两个经常被混淆的概念——是系统可以组合用于治疗推理的可分离轴。应用于《神农本草经》,DeepRoot在R@20上恢复了21个保留化合物-疾病治疗对中的10个(47.6%,而原始语料库LLM为4.8%,随机约为2.4%),并且在推理质量上,在LLM作为评判的审计中优于基线LLM和直接通过工具调用访问DeepRoot自身查询的相同API的LLM。使用工具的LLM在87%的声明上产生幻觉证据,而DeepRoot为7-10%。仅图推理的幻觉率为0%,但在推理连贯性上排名最低;DeepRoot KG+LLM是唯一在两个轴上都获胜的条件,为系统挖掘和重新利用历史医学知识指明了一条道路。

英文摘要

Historical medical archives and traditional medicines hold immense potential for drug discovery and remain a primary source for current drug development. However, pre-ontological prose and idiosyncratic taxonomies prevent the standardization and medical modernization of the data for use in current biomedical pipelines. Furthermore, no existing LLM agent system, whether tool-calling, retrieval-augmented, or agentic deep-research, can convert such text into verifiable drug-discovery leads at scale. We close this gap with DeepRoot, a multi-agent LLM system that jointly builds and utilizes a verified knowledge graph, showing that grounding and reasoning -- often conflated -- are separable axes the system can compose for therapeutic reasoning. Applied to the Shen Nong Ben Cao Jing, DeepRoot recovers $10$ of $21$ held-out compound-disease treatment pairs at R@$20$ ($47.6\%$ vs $4.8\%$ for a raw corpus LLM and $\sim\!2.4\%$ random) and dominates an LLM-as-judge audit for reasoning quality over baseline LLMs and LLMs with direct tool-call access to the same APIs DeepRoot itself queries. Tool-using LLMs hallucinate evidence on $87\%$ of claims, versus 7-10% for DeepRoot. Graph-only inference hallucinates $0\%$ but ranks lowest on reasoning coherence; DeepRoot KG+LLM is the only condition to win on both axes, pointing toward a route for systematic mining and repurposing of historical medical knowledge.

2606.16326 2026-06-16 cs.GT cs.AI q-fin.RM 交叉投稿

Gaming-Resistant Insurance Contracts for Autonomous AI Agents: Strategy-Proof Toll Mechanism Design

自主AI代理的抗博弈保险合约:策略证明的通行费机制设计

Hao-Hsuan Chen

发表机构 * Hao-Hsuan Chen(何浩轩)

AI总结 本文扩展了时间一致精算运行时的框架,使运营商策略化,刻画了自主AI代理保险合约的五种攻击空间,并证明了精算运行时的抗博弈性,通过新合约条款实现激励兼容。

Comments 29 pages. Companion to arXiv:2605.26508 (Paper A, foundations) and arXiv:2605.25632 (Paper B, empirical)

详情
AI中文摘要

论文A定义了一个时间一致的精算运行时,该运行时根据合约固定的安全默认值对每个产生副作用的行动定价,并针对储备预算门控执行。它将运营商视为被动。本文使运营商策略化。我们刻画了自主AI代理保险合约的五种攻击空间,并证明了精算运行时何时具有抗博弈性。两种攻击面——通行费后的安全默认选择以及边界内的行动分割——通过论文A的最小权限和无分割条款得以关闭。其余三种需要新的合约条款。首先,公共控制聚合防止跨边界重新路由将通行费降低到应用于总暴露的边界潜力以下。其次,接口故障(如无效JSON)是合约相关事件,而非安全胜利:将其视为零通行费安全默认值可能奖励不可靠的模型,而升级费用则逆转了激励。我们通过来自配套实证论文的跨模型轨迹验证了这一接口合规定理。第三,一个带有分量最小惩罚计划的模型身份菜单使得部署模型的真实报告成为弱占优策略。然后,我们将这些条款与论文A的运行时保证组合,以获得在五种攻击空间上的联合激励兼容性。最后,一个双参数保费族在真实均衡下满足了运营商个体理性和弱预算平衡。结果是为自主代理副作用的精算控制提供了一个激励兼容层。

英文摘要

Paper A defines a time-consistent actuarial runtime that prices each side-effect-bearing action against a contractually fixed safe default and gates execution against a reserve budget. It treats the operator as passive. This paper makes the operator strategic. We characterise a five-attack space for autonomous AI-agent insurance contracts and prove when the actuarial runtime is gaming-resistant. Two attack surfaces -- post-toll safe-default selection and within-boundary action splitting -- are closed by Paper A's minimal-authority and no-splitting clauses. The remaining three require new contract clauses. First, common-control aggregation prevents cross-boundary re-routing from reducing toll below the boundary potential applied to total exposure. Second, interface failures such as invalid JSON are contract-relevant events, not safety wins: treating them as zero-toll safe defaults can reward unreliable models, while escalation fees reverse the incentive. We validate this interface-compliance theorem on committed cross-model traces from the companion empirical paper. Third, a model-identity menu with a componentwise-minimum penalty schedule makes truthful reporting of the deployed model weakly dominant. We then compose these clauses with Paper A's runtime guarantees to obtain joint incentive compatibility over the five-attack space. Finally, a two-parameter premium family discharges operator individual rationality and weak budget balance at the truthful equilibrium. The result is an incentive-compatibility layer for actuarial control of autonomous-agent side effects.

2606.16428 2026-06-16 cs.CL cs.AI cs.HC 交叉投稿

LectūraAgents: A Multi-Agent Framework for Adaptive Personalized AI-Assisted Learning and Embodied Teaching

LectūraAgents:面向自适应个性化AI辅助学习与具身教学的多智能体框架

Jaward Sesay, Yue Yu, Siwei Dong, Yemin Shi, Guangyao Chen, Börje F. Karlsson

发表机构 * Beijing Institute of Technology(北京理工大学) Peking University(北京大学) Cornell University(康奈尔大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 提出LectūraAgents多智能体框架,通过层次化架构和自适应具身教学机制(如手势、高亮等)实现端到端个性化学习,并设计教学动作-语音对齐算法提升连贯性,在多个课程级别上优于现有方法。

详情
AI中文摘要

有效的个性化AI辅助学习需要系统不仅能够生成准确的、针对学习者的教育材料,还能动态调整其教学方式以适应不同学习者。然而,现有的教育智能体主要关注讲座内容自动化和模拟,往往缺乏针对个体学习者的多模态和具身教学方法的建模。为此,我们提出LectūraAgents——一个多智能体框架,通过端到端的自适应具身教学实现个性化学习。其核心模拟了教授-学生关系,其中ProfessorAgent领导一个由专业下属智能体组成的协作团队,通过研究、规划、审查和具身交付适应学习者需求的讲座内容。该框架有三个主要贡献:(1)用于端到端个性化学习的层次化多智能体架构;(2)自适应具身教学机制,其中ProfessorAgent在教学环境中对内容执行可见且具有教学动机的教学动作(例如手写、高亮、下划线等);(3)教学动作-语音对齐(TASA)算法,该算法采用基于显著性的启发式和时序语义分割,生成与学习者档案对齐的连贯教学动作序列。我们在高中、本科和研究生级别的多样化课程上,使用基于样本特定量规的分析评估LectūraAgents;生成的讲座材料和教学动作由专家教育者评估和验证。实验结果显示,在讲座内容质量、具身教学质量、评估和个性化方面,LectūraAgents持续优于现有方法,使其成为大规模个性化学习的教学基础扎实的框架。

英文摘要

Effective personalized AI-assisted learning demands systems that can not only generate accurate learner-specific educational materials, but also dynamically adapt their instruction to diverse learners. However, existing educational agents have primarily focused on lecture content automation and simulations, which often fall short of modelling multimodal and embodied instructional methods tailored for the individual learner. To this end, we propose LectūraAgents - a multi-agent framework that enables personalized learning through end-to-end adaptive embodied teaching. At its core, LectūraAgents mirrors a professor-student relationship, in which a ProfessorAgent leads a collaborative team of specialized subordinate agents through research, planning, review, and embodied delivery of lecture contents that adapt to a learner's needs. The framework offers three main contributions: (1) a hierarchical multi-agent architecture for end-to-end personalized learning; (2) an adaptive embodied teaching mechanism, wherein the ProfessorAgent executes visible and pedagogically motivated teaching actions (e.g., handwrite, highlight, underline, etc.) over contents in a teaching environment; and (3) a Teaching Action-Speech Alignment (TASA) algorithm that employs salience-based heuristics and temporal semantic segmentation to generate coherent teaching action sequences aligned with learner profiles. We evaluate LectūraAgents on diverse courses at high school, undergraduate, and graduate levels using sample-specific rubric-based analysis; with generated lecture materials and teaching actions assessed and validated by expert educators. Experimental results show consistent gains in lecture content quality, embodied teaching quality, assessment, and personalization over existing approaches, positioning LectūraAgents as a pedagogically well-grounded framework for personalized learning at scale.

2509.21862 2026-06-16 cs.AI cs.MA cs.SI econ.GN q-fin.EC 版本更新

Shachi: A Modular, Controllable Framework for LLM-Based Agent-Based Modeling of Emergent Collective Behavior

Shachi: 一种用于基于LLM的涌现集体行为建模的模块化、可控框架

So Kuroki, Yingtao Tian, Kou Misaki, Takashi Ikegami, Takuya Akiba, Yujin Tang

发表机构 * Sakana AI, Japan(日本Sakana AI) The University of Tokyo, Japan(日本东京大学)

AI总结 提出Shachi框架,将智能体认知分解为配置、记忆和工具等独立可控组件,通过扰动实验研究微观认知特征如何影响宏观群体动态,并在10任务基准和关税冲击案例中验证其有效性。

Comments Accepted to ALIFE 2026

详情
AI中文摘要

个体LLM驱动的智能体之间的交互如何产生集体行为是人工生命中的一个核心问题,然而对这些涌现动态的受控研究因缺乏用于系统实验的规范化模拟框架而受到阻碍。为解决这一问题,我们引入了Shachi,一种规范化的方法论和模块化框架,它将智能体的认知分解为核心组件:用于内在身份的配置、用于上下文连续性的记忆以及用于扩展能力的工具,所有这些都由LLM推理引擎协调。这种分解将每个认知组件视为独立可控的变量,从而能够进行扰动研究,追踪微观认知特征如何传播到群体层面的动态。我们研究了跨越三个集体复杂性层次的10任务基准中的行为模式。Shachi支持跨环境转换的记忆迁移,产生依赖于历史的行为转变,并允许智能体同时栖息于多个环境,揭示了单环境研究中不可见的跨环境干扰。此外,在一个真实的美国关税冲击案例研究中,具有独立控制认知组件的局部交互智能体产生了与观察到的真实世界结果方向一致的宏观市场动态。我们的工作为基于LLM的ABM提供了一个严谨的开源模拟框架,旨在促进对交互人工智能体涌现集体行为的累积性科学研究。

英文摘要

How collective behaviors emerge from the interactions of individual LLM-driven agents is a central question in artificial life, yet controlled study of these emergent dynamics has been hindered by the lack of a principled simulation framework for systematic experimentation. To address this, we introduce Shachi, a principled methodology and modular framework that decomposes an agent's cognition into core components: Configuration for intrinsic identity, Memory for contextual continuity, and Tools for extended capabilities, all orchestrated by an LLM reasoning engine. This decomposition treats each cognitive component as an independently controllable variable, enabling perturbation studies that trace how micro-level cognitive traits propagate into population-level dynamics. We investigate behavioral patterns across a 10-task benchmark spanning three levels of collective complexity. Shachi enables memory transfer across environment transitions, producing history-dependent behavioral shifts, and allows agents to simultaneously inhabit multiple environments, revealing cross-environment interference invisible in single-environment studies. Furthermore, in a real-world U.S. tariff shock case study, locally interacting agents with individually controlled cognitive components produce macro-level market dynamics directionally consistent with observed real-world outcomes. Our work provides a rigorous, open-source simulation framework for LLM-based ABM, aimed at fostering cumulative scientific inquiry into the emergent collective behaviors of interacting artificial agents.

2601.05746 2026-06-16 cs.AI 版本更新

DynaDebate: Breaking Homogeneity in Multi-Agent Debate with Dynamic Path Generation

DynaDebate: 通过动态路径生成打破多智能体辩论中的同质性

Zhenghao Li, Zhi Zheng, Wei Chen, Jielun Zhao, Yong Chen, Tong Xu, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(认知智能国家重点实验室,中国科学技术大学) North Automatic Control Technology Institute(北自动控制技术研究所) Shenzhen Institute for Advanced Study, UESTC(深圳先进研究 institute, 电子科技大学)

AI总结 提出DynaDebate框架,通过动态路径生成、过程中心辩论和触发式验证机制,解决多智能体辩论中推理路径同质化问题,提升辩论效果。

Comments 19pages,7 figures

详情
AI中文摘要

近年来,基于大语言模型的多智能体系统(MAS)发展迅速,在协作决策和复杂问题求解方面表现出色。研究人员进一步探索了多智能体辩论(MAD)框架,通过多个智能体之间的信息交换和辩论来增强MAS的推理和协作能力。然而,现有方法通常依赖无引导的初始化,导致智能体采用相同的推理路径,从而产生相同的错误。因此,智能体之间的有效辩论受到阻碍,最终结果常常退化为简单的多数投票。为解决上述问题,我们引入了动态多智能体辩论(DynaDebate),通过三个关键机制增强多智能体辩论的有效性:(1)动态路径生成与分配,使用专门的路径生成智能体生成多样且逻辑合理的解决方案路径,并具有自适应冗余;(2)过程中心辩论,将焦点从表面结果投票转移到严格的逐步逻辑批评,以确保过程正确性;(3)基于触发的验证智能体,在出现分歧时激活,并使用外部工具客观解决死锁。实验表明,DynaDebate在大多数基准测试中取得了优越或极具竞争力的性能。

英文摘要

Recent years have witnessed the rapid development of Large Language Model-based Multi-Agent Systems (MAS), which excel at collaborative decision-making and complex problem-solving. Researchers have further investigated Multi-Agent Debate (MAD) frameworks, which enhance the reasoning and collaboration capabilities of MAS through information exchange and debate among multiple agents. However, existing approaches often rely on unguided initialization, causing agents to adopt identical reasoning paths that lead to the same errors. As a result, effective debate among agents is hindered, and the final outcome frequently degenerates into simple majority voting. To solve the above problem, we introduce Dynamic Multi-Agent Debate (DynaDebate), which enhances the effectiveness of multi-agent debate through three key mechanisms: (1) Dynamic Path Generation and Allocation, which employs a dedicated Path Generation Agent to generate diverse and logical solution paths with adaptive redundancy; (2) Process-Centric Debate, which shifts the focus from surface-level outcome voting to rigorous step-by-step logic critique to ensure process correctness; (3) A Trigger-Based Verification Agent, which is activated upon disagreement and uses external tools to objectively resolve deadlocks. Experiments show that DynaDebate achieves superior or highly competitive performance across the majority of benchmarks\footnote{The code is at https://github.com/nwpuLee2021/brianstorm.}.

2601.21714 2026-06-16 cs.AI 版本更新

E-mem: Multi-agent based Episodic Context Reconstruction for LLM Agent Memory

E-mem:基于多智能体的事件上下文重建用于LLM智能体记忆

Kaixiang Wang, Yidan Lin, Jiong Lou, Zhaojiacheng Zhou, Bunyod Suvonov, Jie Li

发表机构 * National University of Singapore(新加坡国立大学) Tsinghua University(清华大学)

AI总结 E-mem通过多智能体协同重建事件上下文,提升LLM在长时序任务中的推理能力,实现54%的F1分数,优于现有方法7.75%,并降低70%的token成本。

Comments This paper has been accepted by ICML 2026. If you find our project helpful, please consider giving it a star: https://github.com/dog-last/E-mem

详情
AI中文摘要

大型语言模型(LLM)智能体向系统2推理发展,需要在长时空中保持严格的逻辑完整性。然而,主流的记忆预处理方法因破坏性去上下文化而不足。本文提出E-mem框架,从记忆预处理转向事件上下文重建。受生物酶体启发,E-mem采用异构分层架构,多个助手智能体维护未压缩的记忆上下文,而中央主智能体负责全局规划。与被动检索不同,我们的机制使助手能在激活段内本地推理,提取上下文感知的证据后再聚合。在LoCoMo基准测试中,E-mem实现了超过54%的F1分数,优于现有最佳方法GAM 7.75%,同时降低token成本超过70%。

英文摘要

The evolution of Large Language Model (LLM) agents towards System~2 reasoning, characterized by deliberative, high-precision problem-solving, requires maintaining rigorous logical integrity over extended horizons. However, prevalent memory preprocessing paradigms suffer from destructive de-contextualization. By compressing complex sequential dependencies into pre-defined structures (e.g., embeddings or graphs), these methods sever the contextual integrity essential for deep reasoning. To address this, we propose E-mem, a framework shifting from Memory Preprocessing to Episodic Context Reconstruction. Inspired by biological engrams, E-mem employs a heterogeneous hierarchical architecture where multiple assistant agents maintain uncompressed memory contexts, while a central master agent orchestrates global planning. Unlike passive retrieval, our mechanism empowers assistants to locally reason within activated segments, extracting context-aware evidence before aggregation. Evaluations on the LoCoMo benchmark demonstrate that E-mem achieves over 54\% F1, surpassing the state-of-the-art GAM by 7.75\%, while reducing token cost by over 70\%.

2604.02863 2026-06-16 cs.AI 版本更新

EMS: Multi-Agent Voting via Efficient Majority-then-Stopping

EMS: 通过高效多数决后停止的多智能体投票

Yiqing Liu, Hantao Yao, Wu Liu, Yongdong Zhang

发表机构 * GitHub

AI总结 提出EMS方法,通过可靠性感知调度和自适应增量投票,在保持多数投票准确性的同时,平均减少35%的智能体调用和44%的令牌消耗。

详情
AI中文摘要

多数投票是将多智能体响应聚合为最终决策的标准方法。然而,传统方法通常要求所有智能体在聚合开始前完成推理,导致大量计算开销,因为一旦达成多数共识,许多响应就变得冗余。在这项工作中,我们将高效的多智能体投票形式化为一个可靠性感知的智能体调度问题,并提出高效多数决后停止(EMS)以提高推理效率。EMS首先通过检索每个智能体在语义相似查询上的历史共识证据,估计其任务条件可靠性排序(TCRO),然后按可靠性降序调用智能体。接下来,自适应增量投票(AIV)在当前领先答案无法被剩余智能体的任何可能投票推翻时终止过程,并返回该答案。最后,可靠性历史更新(RHU)仅根据被调用智能体与最终决策的共识来更新它们。在五个基准上的广泛评估表明,EMS在保持多数投票准确性的同时,平均将调用的智能体数量减少了35%,令牌消耗减少了44%。代码可在以下网址获取:https://this https URL。

英文摘要

Majority voting is the standard for aggregating multi-agent responses into a final decision. However, traditional methods typically require all agents to complete their reasoning before aggregation begins, leading to significant computational overhead, as many responses become redundant once a majority consensus is achieved. In this work, we formulate efficient multi-agent voting as a reliability-aware agent scheduling problem and propose Efficient Majority-then-Stopping (EMS) to improve reasoning efficiency. EMS first estimates a Task-Conditioned Reliability Ordering (TCRO) for each agent by retrieving its historical consensus evidence on semantically similar queries, and then invoking agents in descending reliability order. Next, Adaptive Incremental Voting (AIV) terminates the process once the current leading answer cannot be overturned by any possible votes from the remaining agents, and returns this answer. Finally, Reliability History Updating (RHU) updates only the invoked agents according to their consensus with the final decision. Extensive evaluations across five benchmarks show that EMS preserves the accuracy of Majority Voting while reducing the average number of invoked agents by 35% and token consumption by 44%, respectively. The code is available at https://github.com/fuyu66/EMS.

2606.01365 2026-06-16 cs.AI 版本更新

Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability

多智能体LLM系统中浪费计算资源的早期诊断:基于故障感知的可观测性

Xianyou Li, Weiran Yan, Yichao Wu, Penghao Liang, Mengwei Yuan, Jianan Liu, Jing Yang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种故障感知的可观测性框架,通过在线轨迹信号诊断多智能体LLM系统中的浪费计算,并在GAIA验证集上评估,揭示不同故障机制及其与资源消耗的关系。

详情
AI中文摘要

使用工具的多智能体大语言模型(LLM)系统在产生答案之前,通过模型令牌、工具调用、重试和代码执行来消耗计算资源。当运行失败时,最终答案评估揭示了终点,但通常无法揭示轨迹停止可恢复进展的时间点。本文引入了一个故障感知的可观测性框架,用于诊断多智能体LLM轨迹中的浪费计算。该框架将重复出现的故障模式映射到在线轨迹信号,包括工具可靠性、执行恢复、编排循环、证据可用性、信息变化和预算压力。我们在一个三智能体问答系统中实例化该框架,并在相同的执行上限下对165条GAIA验证轨迹进行评估。操作故障仍然常见:22/53的1级运行、33/86的2级运行和12/26的3级运行未能产生可用的最终答案。轨迹揭示了这些结果背后的不同机制,包括证据不足、重复动作循环、最大步数终止、工具故障连续以及成功执行但无有用输出的调用。平均令牌使用量从1级的8,152个令牌上升到3级的16,389个令牌,而证据可用性和句子级支持则出现分歧。一项缓存的10条轨迹LLM评判基础审计表明,廉价的在线信号和更深入的语义指标捕捉了故障的互补层面。结果将故障感知可观测性定位为原始执行日志与最终答案准确性之间的诊断层。

英文摘要

Failure-aware observability diagnoses wasted computation in multi-agent LLM systems before final-answer evaluation can explain what went wrong. We propose a trace-based framework for a three-agent architecture -- orchestrator, search agent, and execution agent -- that converts structured events into online signals for loops, budget pressure, low information gain, and tool instability, then adds offline semantic grounding metrics and selective LLM-as-judge evaluation. On 165 GAIA validation traces under identical caps, 98 runs produce usable final answers and 67 fail or stop without one. Among warned failed runs, 58.1% of tokens are spent after the first warning on average, indicating substantial opportunity for intervention. A 10-task Level-2 pilot uses warnings to diversify search or require evidence, reducing post-warning token fraction from 0.638 in the baseline to 0.304. The results support a layered design: cheap online signals help the orchestrator redirect or halt redundant behavior, while deeper semantic checks identify whether completed answers are grounded enough to trust.

2606.09039 2026-06-16 cs.AI 版本更新

Agent Economics: An Entropy-Controlled Pluralistic Alignment Framework for Preventing Artificial Hivemind in Autonomous Agents

Agent经济学:一种熵控制的多元对齐框架以防止自主智能体中的蜂群思维

Cheonsu Jeong

发表机构 * AX Center, SAMSUNG SDS(三星SDS AX中心)

AI总结 提出行为协议框架(BPF),通过心智化社会智能、多元对齐和可验证执行内核三个模块,在闭环架构中控制熵以保持策略多样性,提升自主智能体经济的稳定性、效率和可信度。

Comments 15 pages, 2 figures, 1 table

详情
AI中文摘要

本研究提出了行为协议框架(BPF),这是一个熵控制的多元对齐框架,旨在解决自主智能体经济中的两个关键挑战:由智能体间过度战略趋同引起的蜂群思维效应,以及自主决策过程中缺乏透明度。所提出的BPF由三个核心模块组成:基于心智理论的心智化社会智能(MbSI)、多元对齐(PA)和可验证执行内核(VEK)。这些模块有机地集成在一个闭环架构中,该架构控制着智能体行为从决策、执行到验证和反馈的整个生命周期。为了评估所提出的框架,将开发一个用Python实现的模拟环境和基于Streamlit的用户界面。通过实证实验,本研究旨在检验PA模块的熵控制机制能否有效保持智能体间的战略多样性并减轻集体趋同,同时VEK模块提供决策过程的全面且透明的审计追踪。预期结果将表明,所提出的框架能够同时增强自主智能体经济的稳定性、效率和可信度。因此,本研究为开发稳健、透明且可问责的智能体原生经济系统提供了一种实用方法。

英文摘要

This study proposes the Behavioral Protocol Framework (BPF), an entropy-controlled pluralistic alignment framework designed to address two critical challenges in autonomous agent economies: the hivemind effect arising from excessive strategic convergence among agents and the lack of transparency in autonomous decision-making processes. The proposed BPF consists of three core modules: Mentalizing-based Social Intelligence (MbSI) grounded in Theory of Mind (ToM), Pluralistic Alignment (PA), and a Verifiable Execution Kernel (VEK). These modules are organically integrated within a closed-loop architecture that governs the entire lifecycle of agent behavior, from decision-making and execution to verification and feedback. To evaluate the proposed framework, a simulation environment implemented in Python and a Streamlit-based user interface will be developed. Through empirical experimentation, the study aims to examine whether the entropy-control mechanism of the PA module can effectively preserve strategic diversity among agents and mitigate collective convergence, while the VEK module provides a comprehensive and transparent audit trail of the decision-making process. The anticipated results are expected to demonstrate that the proposed framework can simultaneously enhance the stability, efficiency, and trustworthiness of autonomous agent economies. Consequently, this research offers a practical approach for developing robust, transparent, and accountable agent-native economic systems.

2606.13003 2026-06-16 cs.AI cs.CL cs.MA 版本更新

The Illusion of Multi-Agent Advantage

多智能体优势的错觉

Prathyusha Jwalapuram, Hehai Lin, Chuyuan Li, Fangkai Jiao, Sudong Wang, Yifei Ming, Zixuan Ke, Chengwei Qin, Giuseppe Carenini, Shafiq Joty

发表机构 * Salesforce Research(Salesforce研究院) HKUST (Guangzhou)(香港科技大学(广州)) University of British Columbia(不列颠哥伦比亚大学) Nanyang Technological University(南洋理工大学)

AI总结 通过系统评估,发现自动生成的多智能体系统在性能和成本效率上均不如单智能体基线(如思维链自一致性),揭示了现有评估框架的缺陷和架构膨胀问题。

详情
AI中文摘要

普遍观点认为多智能体系统优于单智能体系统,其优势包括上下文保护、并行处理和分布式决策。然而,这一主张的经验支持主要依赖于与使用优先考虑孤立推理任务的基准测试的单智能体基线的比较,这些基准测试未能充分评估这些优势。我们专注于自动生成的多智能体系统(旨在比手动设计的系统具有更强的泛化能力),对单智能体系统(特别是思维链自一致性)进行了严格、系统的评估。在传统推理数据集和具有交互式多步骤工作流的任务(例如 BrowseComp-Plus)上,我们证明自动多智能体系统始终不如思维链自一致性,尽管其成本高达10倍。为了将这些失败与任务结构固有的局限性隔离开来,我们引入了一个为多智能体系统量身定制的诊断性合成数据集,该数据集具有显式任务分解、上下文分离和并行化潜力。我们表明,专家设计的多智能体系统在该数据集上的原始性能和成本效率方面始终优于自动生成的架构,这表明现有的评估框架未能考虑增加计算成本的边际效用,从而掩盖了复杂多智能体系统的关键架构缺陷和低效性。关键的是,对生成的多智能体系统架构的系统解构表明,当前的自动化设计范式产生了架构膨胀,优先考虑表面复杂性,但这并未转化为功能效用,暴露了与多智能体原则的根本性错位。

英文摘要

Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic MAS consistently underperform CoT-SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperforms automatically generated architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.

2310.06555 2026-06-16 cs.CL cs.AI cs.LG cs.MA 版本更新

It's About Time: Temporal References in Emergent Communication

关于时间:涌现通信中的时间指代

Olaf Lipinski, Adam J. Sobey, Federico Cerutti, Timothy J. Norman

发表机构 * University of Southampton(索姆塞特大学) The Alan Turing Institute(艾伦·图灵研究所) University of Brescia(布雷西亚大学)

AI总结 研究涌现通信中时间指代缺失问题,发现仅改变损失函数不足,需修改架构(分批方法)才能使时间指代涌现,95%以上代理成功,为提升通信效率奠定基础。

Comments 23 pages main body and 31 pages supplementary material, 9 figures in main body. Code available at https://github.com/olipinski/TRG

详情
Journal ref
Journal of Artificial Intelligence Research 86, Article 11 (June 2026)
AI中文摘要

涌现通信使代理能够开发定制语言以提高通信效率。尽管已知时间结构在自然语言中的重要性,但在涌现通信中尚无时间指代的证据。本文通过探索代理如何交流时间关系来填补这一空白。我们分析了时间指代涌现的三个潜在因素:环境因素、外部因素和架构因素。实验表明,仅改变损失函数不足以使时间指代涌现;相反,架构变化是必要的。代理架构的最小变化——使用不同的分批方法——允许时间指代涌现。在强调时间关系的时间指代游戏环境中,将此修改后的设计与标准架构进行比较。分析显示,超过95%使用修改后分批方法的代理发展出了时间指代,而无需改变其损失函数。我们认为时间指代对于未来提高代理通信效率是必要的,使未来代理能够使用更接近最优编码的方式,与纯组合语言相比。这些见解为将时间指代纳入其他涌现通信设置以及研究语言的其他方面提供了基础。

英文摘要

Emergent communication enables agents to develop bespoke languages that improve communication efficiency. Despite the known importance of temporal structure in natural language, there is no existing evidence of temporal references in emergent communication. This paper addresses this gap, by exploring how agents communicate about temporal relationships. We analyse three potential factors for the emergence of temporal references: environmental, external, and architectural. Our experiments demonstrate that altering the loss function is insufficient for temporal references to emerge; rather, architectural changes are necessary. A minimal change in agent architecture, using a different batching method, allows the emergence of temporal references. This modified design is compared with the standard architecture in a temporal referential games environment, which emphasises temporal relationships. The analysis shows that over 95% of the agents with the modified batching method develop temporal references, without changes to their loss function. We consider temporal referencing necessary for future improvements to the agents' communication efficiency, enabling future agents to use a closer to optimal coding as compared to purely compositional languages. These insights provide the basis for incorporation of temporal references into other emergent communication settings, and investigation of other aspects of language.

2602.05965 2026-06-16 cs.MA cs.AI 版本更新

Learning to Share: Selective Memory for Efficient Parallel Agentic Systems

学习共享:面向高效并行智能体系统的选择性记忆

Joseph Fioresi, Parth Parag Kulkarni, Ashmal Vayani, Song Wang, Mubarak Shah

AI总结 提出LTS机制,通过强化学习训练控制器选择性共享跨团队中间信息,在减少并行智能体系统计算开销的同时保持或提升任务性能。

Comments ICML 2026

详情
AI中文摘要

智能体系统通过协调多个智能体迭代推理、调用工具和交换中间结果来解决复杂任务。为了提高鲁棒性和解决方案质量,最近的方法部署多个并行运行的智能体团队以探索多样化的推理轨迹。然而,并行执行带来了显著的计算成本:当不同团队独立推理相似子问题或执行类似步骤时,它们反复进行大量重叠计算。为了解决这些限制,本文提出学习共享(LTS),一种用于并行智能体框架的学习型共享记忆机制,能够在控制上下文增长的同时实现跨团队选择性信息重用。LTS引入了一个所有团队可访问的全局记忆库和一个轻量级控制器,该控制器决定是否将中间智能体步骤添加到记忆中。控制器使用带有使用感知信用分配的逐步强化学习进行训练,使其能够识别在并行执行中全局有用的信息。在AssistantBench和GAIA基准上的实验表明,与无记忆并行基线相比,LTS显著减少了总体运行时间,同时匹配或提高了任务性能,证明了学习型记忆准入是提高并行智能体系统效率的有效策略。项目页面:此https URL

英文摘要

Agentic systems solve complex tasks by coordinating multiple agents that iteratively reason, invoke tools, and exchange intermediate results. To improve robustness and solution quality, recent approaches deploy multiple agent teams running in parallel to explore diverse reasoning trajectories. However, parallel execution comes at a significant computational cost: when different teams independently reason about similar sub-problems or execute analogous steps, they repeatedly perform substantial overlapping computation. To address these limitations, in this paper, we propose Learning to Share (LTS), a learned shared-memory mechanism for parallel agentic frameworks that enables selective cross-team information reuse while controlling context growth. LTS introduces a global memory bank accessible to all teams and a lightweight controller that decides whether intermediate agent steps should be added to memory or not. The controller is trained using stepwise reinforcement learning with usage-aware credit assignment, allowing it to identify information that is globally useful across parallel executions. Experiments on the AssistantBench and GAIA benchmarks show that LTS significantly reduces overall runtime while matching or improving task performance compared to memory-free parallel baselines, demonstrating that learned memory admission is an effective strategy for improving the efficiency of parallel agentic systems. Project page: https://joefioresi718.github.io/LTS_webpage/

2603.01131 2026-06-16 cs.MA cs.AI 版本更新

MedCollab: IBIS-Guided Multi-Agent Collaboration with Hierarchical Disease Relation Chains for Clinical Diagnosis

MedCollab:基于IBIS引导的多智能体协作与分层疾病关系链的临床诊断

Yuqi Zhan, Xinyue Wu, Tianyu Lin, Yutong Bao, Xiaoyu Wang, Weihao Cheng, Huangwei Chen, Feiwei Qin, Zhu Zhu

发表机构 * Princeton University(普林斯顿大学) Springer Heidelberg(斯普林格海德堡) ABC Institute(ABC研究所) Rupert-Karls-University Heidelberg(海德堡鲁珀特-卡尔大学) Hangzhou Dianzi University(杭州电子科技大学) Zhejiang University(浙江大学) Children’s Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Children and Adolescents’ Health and Diseases(浙江大学医学院儿童医院,国家儿童青少年健康与疾病临床研究中心)

AI总结 提出MedCollab框架,通过IBIS结构化论证和分层疾病关系链(HDRC)增强多智能体协作,提升临床诊断的准确性、可追溯性和报告质量。

详情
AI中文摘要

大型语言模型(LLM)在临床诊断中展现出潜力,但仍受限于不可靠的报告生成、薄弱的证据基础和 opaque 推理。我们提出MedCollab,一个基于IBIS引导的多智能体框架,用于全周期临床诊断和诊断报告生成。模拟医院会诊,MedCollab从患者记录中动态招募专科和检查智能体。每个诊断假设通过基于问题的信息系统(IBIS)结构化为证据关联的论点,提高可追溯性和可审计性。MedCollab进一步构建分层疾病关系链(HDRC),将接受的假设组织成具有临床意义的病理和共病关系。一个验证器引导的共识模块审计推理质量,检测矛盾,并在多轮中更新智能体权重。在ClinicalBench和MIMIC-IV上的实验表明,MedCollab在诊断准确性、科室路由、证据一致性和报告质量方面优于强大的LLM和医学多智能体基线。这些结果表明,结构化论证和疾病关系建模可以提高基于LLM的诊断的可靠性、透明度和临床连贯性。

英文摘要

Clinical diagnosis is a gradual process of evidence integration, in which physicians move from symptoms and medical history to examinations, competing hypotheses, disease relations, and treatment decisions. Large language models have advanced medical text understanding and generation. Yet their clinical use remains limited by weak evidence grounding, opaque reasoning, and inconsistent links among differential diagnosis, final diagnosis, diagnostic basis, and treatment planning. We introduce MedCollab, a multi-agent framework for full-cycle clinical diagnosis and report generation. MedCollab coordinates specialist and examination agents according to patient records. It structures agent deliberation with an Issue-Based Information System (IBIS) protocol, so that each diagnostic position is supported by patient-specific evidence and medical knowledge. It also builds Hierarchical Disease Relation Chains (HDRC) to connect accepted hypotheses through progression, complication, and comorbidity relations. During multi-round deliberation, a verifier-guided consensus module evaluates evidence support, medical plausibility, and logical conflicts. It then adjusts agent contributions and filters unsupported reasoning. Experiments on ClinicalBench and MIMIC-IV show that MedCollab outperforms leading LLMs and medical multi-agent baselines in diagnostic accuracy, evidence consistency, and clinical reasoning quality. These results indicate that structured and auditable collaboration can produce more faithful and clinically coherent diagnostic reports.

2604.09679 2026-06-16 cs.MA cs.AI 版本更新

HCP-MAD:Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate

HCP-MAD:用于高效多智能体辩论的异构共识渐进推理

Yiqing Liu, Hantao Yao, Wu Liu, Allen He, Yongdong Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出HCP-MAD框架,通过异构共识验证和自适应停止机制,在保持准确率的同时大幅降低多智能体辩论的token成本。

详情
AI中文摘要

多智能体辩论(MAD)是一种协作框架,其中多个智能体通过生成推理和交替批评循环来迭代优化解决方案。当前的工作主要分别优化轮内拓扑和轮间交互,限制了token成本对任务复杂度的适应性。本文引入了用于高效多智能体辩论的异构共识渐进推理(HCP-MAD),利用共识作为动态信号来促进渐进推理。核心动机是大多数简单任务可以通过轻量级双智能体辩论有效解决,而复杂任务需要扩展协作。首先,异构共识验证使用一对异构智能体进行快速共识验证以实现提前停止。其次,异构双智能体辩论应用自适应停止标准来终止推理轨迹的相互批评。最后,未解决的任务通过升级的集体投票,聚合来自额外智能体的多样化视角来处理。在六个基准上的实验表明,HCP-MAD在提高准确性的同时大幅降低了token成本。代码见此URL。

英文摘要

Multi-Agent Debate (MAD) is a collaborative framework in which multiple agents iteratively refine solutions through the generation of reasoning and alternating critique cycles. Current work primarily optimizes intra-round topologies and inter-round interactions separately, limiting the adaptation of token costs to task complexity. This work introduces Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate (HCP-MAD), leveraging consensus as a dynamic signal to facilitate progressive reasoning. The core motivation is that a majority of straightforward tasks can be effectively resolved via lightweight pair-agent debates, while complex tasks require expanded collaboration. Firstly, Heterogeneous Consensus Verification conducts rapid consensus verification using a pair of heterogeneous agents for early stopping. Next, Heterogeneous Pair-Agent Debate applies an adaptive stopping criterion to terminate mutual critique of reasoning traces. Finally, the unresolved tasks are addressed through Escalated Collective Voting by aggregating diverse perspectives from additional agents. Experiments across six benchmarks show that HCP-MAD enhances accuracy while substantially reducing token costs. Code is https://github.com/fuyu66/HCP-MAD.

2605.29874 2026-06-16 cs.MA cs.AI cs.GT 版本更新

Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension

下一代LLM智能体系统中合作的演化动力学:跨提供商的实证扩展

Francisco León Zúñiga Bolívar

发表机构 * Institución Universitaria Colegio Mayor del Cauca(大学机构科尔多瓦大学)

AI总结 本研究通过扩展Willis等人的基准,测试2025-2026年四个前沿LLM模型在迭代囚徒困境中的合作偏差,发现合作偏差普遍存在但提供商间差异显著,且噪声仍是普遍挑战。

Comments v2 (erratum): two truncated Gemini 3.1 Pro libraries regenerated; cooperative-plurality 9/12->10/12, conclusions unchanged. 11 pages, 3 figures, 8 tables. Extends arXiv:2501.16173. Code and n=500 replication: https://github.com/arqFranciscoLeon/evollm (archived: https://doi.org/10.5281/zenodo.20248615)

详情
AI中文摘要

下一代LLM智能体是否继承了其前身中记录的合作偏差,还是规模和提供商的多样性重塑了竞争性多智能体环境中的均衡行为?Willis等人使用演化博弈论和迭代囚徒困境(IPD)为此问题建立了基准,发现ChatGPT-4o和Claude 3.5 Sonnet中存在一致的合作偏差。我们将此基准扩展到2025-2026年发布的四个前沿模型——Claude Sonnet 4.6、Gemini 2.5 Flash、Gemini 3.1 Pro和GPT-5.4 Mini——在三种提示风格(默认、散文、自我优化)和四种群体组成(平衡和有偏,有无噪声)下应用相同的协议。合作偏差在提供商间持续存在(H1):在平衡无噪声条件下,十二种模型-提示组合中有九种倾向于合作均衡。提供商间差异显著(H3):Gemini 2.5 Flash在有偏条件下达到高达77%的攻击性均衡,而GPT-5.4 Mini在自我优化下达到70%的合作均衡。对攻击性能力对等的支持是部分的(H2):自我优化提高了所有模型的ICD,Claude Sonnet 4.6 Refine在数据集中达到最高ICD(0.913),但默认和散文提示未显示系统性缩小。关于噪声鲁棒性的证据方向为正但未稳健确认(H4):在每种条件下n=500次Moran迭代,Claude Sonnet 4.6的平均噪声敏感度约为6个百分点,而Claude 3.5 Sonnet为13个百分点,但一旦传播前身未报告的抽样误差,这一跨研究差距在统计上不显著。提供商身份而非模型代际是均衡结果的最强相关因素;无论模型大小或年代,噪声仍然是普遍挑战。

英文摘要

Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape equilibrium behaviour in competitive multi-agent settings? Willis et al. established a benchmark for this question using evolutionary game theory and the Iterated Prisoner's Dilemma (IPD), finding consistent cooperative biases in ChatGPT-4o and Claude 3.5 Sonnet. We extend this benchmark to four frontier models released in 2025-2026 - Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3.1 Pro, and GPT-5.4 Mini - applying the identical protocol across three prompting styles (Default, Prose, Self-Refine) and four population compositions (balanced and biased, with and without noise). Cooperative bias persists across providers (H1): ten of twelve model-prompt combinations favour cooperative equilibria in balanced noiseless conditions. Cross-provider divergence is substantial (H3): Gemini 2.5 Flash reaches up to 77% aggressive equilibria under biased conditions, while GPT-5.4 Mini reaches 70% cooperative equilibria under Self-Refine. Support for aggressive capability parity is partial (H2): Self-Refine raises ICD in all models and Gemini 3.1 Pro Refine achieves the highest ICD in the dataset (0.925), but Default and Prose prompts show no systematic narrowing. Evidence on noise robustness is directionally positive but not robustly confirmed (H4): with n=500 Moran iterations per condition, average noise sensitivity is about 6 percentage points for Claude Sonnet 4.6 versus 13 pp for Claude 3.5 Sonnet, but this cross-study gap is not statistically significant once the predecessor's unreported sampling error is propagated. Provider identity, rather than model generation, is the strongest correlate of equilibrium outcomes; noise remains a universal challenge regardless of model size or vintage.

4. 搜索、优化与约束求解 9 篇

2606.15577 2026-06-16 cs.AI 新提交

Large Language Models as Optimizers: A Survey of Direct vs. Tool-Augmented Approaches and Their Performance Frontiers

大型语言模型作为优化器:直接方法与工具增强方法的调查及其性能前沿

Roko Peran, Luka Hobor, Mihael Kovac, Mario Brcic

发表机构 * University of Zagreb, Faculty of Electrical Engineering and Computing(萨格勒布大学电气工程与计算学院)

AI总结 综述LLM作为优化器的三种范式(直接优化、工具增强优化、工具创造优化),分析性能前沿与推理差距,探讨直接优化的未来潜力与工具增强优化的可审计性之间的权衡。

Comments 6 pages, 1 figure, 2 tables, accepted at 49th ICT and Electronics Convention, MIPRO - https://mipro.hr; Paper ID: #23463

详情
AI中文摘要

大型语言模型(LLM)越来越多地参与复杂的数学优化,即使触发它们的实用用户并未意识到这一点。毕竟,许多现实世界问题归结为寻找更好或最佳解决方案。LLM作为优化器的领域有三种范式:直接优化、工具增强优化和工具创造优化。直接优化使用迭代提示和启发式生成来探索解空间。工具增强优化将自然语言问题转化为形式化规范并编排外部求解器。工具创造优化更进一步,利用LLM发现可重用的算法或启发式方法,这些方法可以以零边际LLM成本部署。我们基于文献中的基准描述了当前的性能前沿。我们识别了当前架构中的关键推理差距,并论证了直接优化的未来潜力与工具增强优化的可审计性之间的权衡。即使是未来更强大的模型,也可能选择工具制造以提高重复性问题族的操作效率。

英文摘要

Large Language Models (LLMs) are increasingly involved in complex mathematical optimization, even if the pragmatic user who triggers them is unaware of it. After all, many real-world problems reduce to the search for better or the best solutions. The field of LLM-as-optimizer has three paradigms: direct optimization, tool-augmented optimization, and tool-creating optimization. Direct optimization uses iterative prompting and heuristic generation to navigate solution spaces. Tool-augmented optimization translates natural language problems into formal specifications and orchestrates external solvers. Tool-creating optimization goes further, using LLMs to discover reusable algorithms or heuristics that can be deployed at zero marginal LLM cost. We describe current performance frontiers based on the benchmarks from the literature. We identify the critical reasoning gap in current architectures and argue for trade-offs between the future potential of direct optimization and the auditability of tool-augmented optimization. Even future, more powerful models might opt for tool-making to improve operational efficiency for repetitive families of problems.

2606.15797 2026-06-16 cs.AI 新提交

Unassigned Agents in Compilation-based Multi-agent Path Finding

基于编译的多智能体路径规划中的未分配智能体

Pavel Surynek

发表机构 * Faculty of Information Technology, Czech Technical University in Prague(布拉格捷克理工大学信息技术学院)

AI总结 针对未分配智能体的多智能体路径规划问题,提出基于SAT的编译方法,通过SMT-CBS和NRF-SAT求解器实现。

详情
AI中文摘要

基于编译的技术代表了多智能体路径规划(MAPF)求解器的一个重要流派,因其模块化和对问题非标准变体的适应性。在标准MAPF中,任务是引导所有智能体从初始位置无碰撞地到达给定的个体目标位置,而使用不同智能体要求的变体也具有相关性。这种变体是带有未分配智能体的MAPF(UA-MAPF),其中一些智能体与标准MAPF具有相同的设置(有初始位置和目标),而其余智能体只有初始位置但没有目标——未分配智能体。尽管未分配智能体不需要到达任何目标位置,但如有必要,它们必须为标准智能体让路,这构成了一个特定的挑战。我们在本文中表明,UA-MAPF可以表达为基于编译的MAPF技术,这些技术基于将问题表述为布尔可满足性,具体地,我们改编了SMT-CBS和NRF-SAT,这两种基于反例引导抽象精化和非精化抽象的最新求解器。

英文摘要

Compilation-based techniques represent an important stream of solvers for multi-agent path finding (MAPF) due to their modularity and adaptability for non-standard variants of the problem. While in the standard MAPF the task is to navigate all agents from their initial positions to given individual goal positions without any collision, variants where a different requirement for agents is used are also relevant. Such a variant is MAPF with unassigned agents (UA-MAPF) where some agents have the same setting as in the standard MAPF with initial positions and goals while the remaining agents have the initial position but have no goal - unassigned agents. Despite unassigned agent do not need to reach any goal position they have to be moved out of the way of the standard agents if needed which represent a specific challenge. We show in this paper that UA-MAPF can be expressed in recent compilation-based techniques for MAPF based on formulating the problem as Boolean satisfiability, namely we adapt SMT-CBS and NRF-SAT, the recent solvers based on counterexample guided abstraction refinement and non-refined abstractions.

2606.16329 2026-06-16 cs.AI 新提交

Exploiting Search in Symbolic Numeric Planning with Patterns

利用模式在符号数值规划中进行搜索

Matteo Cardellini, Enrico Giunchiglia

发表机构 * DIBRIS, University of Genoa(热那亚大学DIBRIS)

AI总结 提出基于符号模式规划(SPP)的数值规划过程,通过动态重计算模式并利用中间状态引导搜索,提高规划效率。

Comments Under Review at the Journal of Artificial Intelligence Research

详情
AI中文摘要

在本文中,我们提出了一种基于符号模式规划(SPP)的数值规划过程。给定一个数值规划问题 $Π$,一个模式 $\prec$ 是一个动作序列,用于定义一个公式,该公式编码了从起始状态 $S$ 可执行的 $\prec$ 的子序列。Cardellini, Giunchiglia, 和 Maratea (2024a) 遵循规划作为可满足性的方法,在每一步 $n \ge 0$ 定义一个公式 $Π^\prec_n$,其中 $(i)$ 模式 $\prec$ 仅在 $n=0$ 时在 $Π$ 的初始状态 $I$ 中计算,然后在每一步 $n$ 中被利用,$(ii)$ 起始状态 $S$ 设置为 $I$,$(iii)$ 目标集 $G$ 要求在通过将 $\prec$ 的子序列连接 $n$ 次所能达到的最后一个状态中成立。该过程从 $n=0$ 开始,一旦 $Π^\prec_n$ 可满足则终止,否则递增 $n$ 继续。在本文中,可能在每一步,$(i)$ 我们符号化地搜索一个从 $I$ 可达的中间状态 $P$,该状态更接近目标状态,$(ii)$ 动态重计算模式 $\prec_h$ —— 用于下一步 —— 在 $P$ 中,$(iii)$ 精炼用于到达 $P$ 的模式 $\prec_g$,以及 $(iv)$ 从状态 $S$ 开始新的搜索,$S$ 可以是初始状态 $I$ 或最后计算的中间状态 $P$,利用计算出的模式 $\prec_g$ 和 $\prec_h$ 来定义搜索中使用的模式 $\prec$。特别地,在每一步,我们定义一个公式 $Π^{\prec}_{S,P}$,编码存在一个状态 $P'$ 比 $P$ 更接近目标状态,且 $P'$ 从起始状态 $S$ 使用模式 $\prec$ 可达。我们提出了不同的技术来生成这样的公式,每种技术对应一种不同的搜索空间探索策略。我们证明了它们的正确性和完备性,后者在一定条件下成立。

英文摘要

In this paper, we present a procedure for numeric planning based on Symbolic Pattern Planning (SPP). Given a numeric planning problem $Π$, a pattern $\prec$ is a sequence of actions used to define a formula encoding the subsequences of $\prec$ executable from a starting state $S$. Cardellini, Giunchiglia, and Maratea (2024a) follow the Planning as Satisfiability approach by defining, at each step $n \ge 0$, a formula $Π^\prec_n$ in which $(i)$ the pattern $\prec$ is computed only for $n=0$ in the initial state $I$ of $Π$, and then exploited at each step $n$, $(ii)$ the starting state $S$ is set to $I$, and $(iii)$ the set $G$ of goals is required to hold in the last state that can be reached by one of the subsequences of $\prec$ concatenated $n$ times. The procedure begins with $n=0$, terminates as soon as $Π^\prec_n$ is satisfiable, and otherwise proceeds by incrementing $n$. In this paper, possibly at each step, $(i)$ we symbolically search for an intermediate state $P$ reachable from $I$, closer to a goal state, $(ii)$ dynamically recompute the pattern $\prec_h$ -- to be used in the next step -- in $P$, $(iii)$ refine the pattern $\prec_g$ used to reach $P$, and $(iv)$ start the new search from the state $S$ which can be either the initial state $I$ or the last computed intermediate state $P$, exploiting the computed patterns $\prec_g$ and $\prec_h$ to define the pattern $\prec$ to be used in the search. In particular, at each step, we define a formula $Π^{\prec}_{S,P}$ encoding the existence of a state $P'$ closer than $P$ to a goal state, with $P'$ reachable from the starting state $S$ when using the pattern $\prec$. We present different techniques for producing such formulas, each corresponding to a different strategy for exploring the search space. We prove their correctness and completeness, the latter under certain conditions.

2606.16567 2026-06-16 cs.AI cs.LG cs.SY eess.SY math.DS 新提交

TNODEV: Toolbox for Neural ODE Verification

TNODEV: 神经ODE验证工具箱

Abdelrahman Sayed Sayed, Pierre-Jean Meyer, Mohamed Ghazel

发表机构 * Univ Gustave Eiffel, COSYS-ESTAS(古斯塔夫·埃菲尔大学,COSYS-ESTAS实验室)

AI总结 提出TNODEV,首个集成伪造检查、区间可达性、验证循环和并行调度的神经ODE形式验证器,支持安全集包含和分类鲁棒性验证。

Comments 29 pages, 7 figures, Under review in TMLR

详情
AI中文摘要

神经常微分方程(神经ODE)已开始出现在安全关键场景中,例如网络物理系统的连续时间控制器和集成到自动化决策流水线中的分类器,这引发了对其行为能否被形式化验证的问题。现有的专门用于神经ODE的工具仅提供单次可达性调用,没有迭代输入集细化,将其判定的精度限制在单次可达性调用所能提供的范围内。我们提出了TNODEV,这是首个用于神经ODE的可靠形式验证器,它集成了伪造检查器、基于连续时间混合单调性的快速区间可达性后端、具有三种输入集分裂启发式的验证与细化循环以及并行调度器,构成一个端到端流水线。TNODEV支持纯神经ODE、与神经网络控制器闭环的神经ODE以及通用神经ODE(GNODE)上的安全集包含验证,安全集可指定为区间或由目标分类标签诱导的半空间交集。我们在安全集包含和分类鲁棒性属性的一系列基准上评估了TNODEV,包括与NNV 2.0和CORA的直接可达性比较,以及在MNIST通用神经ODE分类器上与NNV2.0的验证比较。

英文摘要

Neural ordinary differential equations (neural ODE) have started to appear in safety critical settings such as continuous-time controllers for cyber-physical systems and classifiers integrated into automated decision pipelines, raising the question of whether their behavior can be formally verified. Existing tools dedicated to neural ODE provide only a single reachability call without iterative input set refinement, limiting the precision of their verdicts to whatever one reachability call can deliver. We present TNODEV, the first sound formal verifier for neural ODE that integrates a falsification checker, a fast interval-based reachability backend based on continuous-time mixed monotonicity, a verification and refinement loop with three input-set splitting heuristics, and a parallel scheduler in a single end-to-end pipeline. TNODEV supports safe-set inclusion verification on pure neural ODE, neural ODE in closed loop with a neural network controller and general neural ODE (GNODE), with the safe set specified either as an interval or as the half-space intersection induced by a target classification label. We evaluate TNODEV on a range of benchmarks across safe-set inclusion and classification-robustness properties, including a direct reachability comparison against NNV~2.0 and CORA and a verification comparison against NNV2.0 on MNIST general neural ODE classifiers.

2606.15301 2026-06-16 cs.LG cs.AI 交叉投稿

Discovering Lattice Reduction Strategies via Self-Play

通过自我对弈发现格基约简策略

Mohamed Malhou, Kristin Lauter, Ludovic Perret

发表机构 * FAIR, Meta Superintelligence Labs(Meta超级智能实验室FAIR) Sorbonne Université CNRS, LIP6(索邦大学CNRS/LIP6) EPITA, EPITA Research Lab (LRE)(EPITA研究实验室(LRE))

AI总结 利用深度强化学习和AlphaZero风格自我对弈,在LLL原始动作空间中学习更优的格基约简策略,训练于8维格但可零样本泛化至32维。

详情
AI中文摘要

Lenstra-Lenstra-Lovász (LLL) 算法是计算机科学中用于格基约简的开创性贡献,但其多项式时间输出的基随着维数增长远非最优。我们证明,深度强化学习可以通过与LLL的原始动作空间交互,发现严格更优、可泛化的约简策略。我们将格基约简形式化为单人马尔可夫决策过程 (MDP),并使用AlphaZero风格的自我对弈流水线训练深度残差网络,该流水线结合了自适应视界MCTS(蒙特卡洛树搜索),将多步网络预测与熵门控扩展机制耦合。由此产生的策略DeltaStar仅在小的8维q-ary格上训练,且需要的原始行操作少于LLL。关键的是,它无需重新训练即可零样本泛化到未见过的模数和高达n=32的更高维度。

英文摘要

The Lenstra-Lenstra-Lovász (LLL) algorithm is a seminal contribution to computer science used for lattice basis reduction, yet its polynomial-time outputs produce bases that are far from optimal as the dimension grows. We show that deep reinforcement learning can discover strictly superior, generalizable reduction strategies by interacting with the primitive action space of LLL. We formulate lattice reduction as a single-player Markov Decision Process (MDP) and train a deep residual network using an AlphaZero-style self-play pipeline augmented with adaptive-horizon MCTS (Monte Carlo Tree Search), which couples multi-step network predictions with an entropy-gated expansion mechanism. The resulting policy, DeltaStar, is trained exclusively on small $8$-dimensional $q$-ary lattices and requires fewer primitive row operations than LLL. Crucially, it generalizes zero-shot to unseen moduli and higher dimensions up to $n=32$ without retraining.

2606.15623 2026-06-16 cs.LG cs.AI 交叉投稿

Surprise-Guided MergeSort: Budget-Efficient Human-in-the-Loop Ranking via Adaptive Comparison Scheduling

惊喜引导的归并排序:通过自适应比较调度实现预算高效的人机协同排名

Yujin Park, Haejun Chung, Ikbeom Jang

发表机构 * Hanyang University(汉阳大学) Hankuk University of Foreign Studies(韩国外国语大学)

AI总结 提出惊喜引导的归并排序(SGS)框架,利用视觉语言模型(VLM)作为问题优先级排序器,通过自适应预算分配将高模糊度比较路由给人类,在六个基准上以相同预算实现Kendall's τ×100提升6-12点。

Comments 16 pages

详情
AI中文摘要

成对比较是主观排名任务的金标准;然而,穷举标注需要大量人工比较($O(n^2)$)。虽然基于排序的方法已将此负担减少到$O(n\log n)$,但每次比较仍需昂贵的人工判断。为了进一步提高标注效率,我们提出利用视觉语言模型(VLM)不是作为标注替代,而是作为\emph{问题优先级排序器},以识别哪些比较真正需要人工判断。所提出的\textbf{惊喜引导的归并排序(SGS)}框架通过三个集成组件实现这一点:(1)自底向上的归并排序调度器,结构化比较并利用传递性;(2)复合惊喜评分器——结合位置偏差消除的VLM置信度、Elo差距和投票熵——量化比较模糊性;(3)自适应预算分配器,将高惊喜对路由给人类,同时通过传递性推理自动化低惊喜对。在六个不同基准上进行了验证,涵盖文本相似度(STS-B、BIOSSES、SICKR-STS)和图像质量评估(KonIQ-10k、TID2013、LIVE Challenge)。SGS有效地识别并跳过了每次会话多达535个非信息性比较。因此,在相同总预算下,它相对于Active Elo实现了Kendall's $τ{\times}100$提升+6到+12。这些结果表明,将VLM引导的惊喜度量与算法排序相结合,在不同领域提供了普遍一致的准确性-效率权衡。

英文摘要

Pairwise comparison is the gold standard for subjective ranking tasks; however, exhaustive annotation requires a massive number of human comparisons ($O(n^2)$). While sorting-based methods have reduced this burden to $O(n\log n)$, they still require expensive human judgment for every single comparison. To further improve annotation efficiency, we propose leveraging a Vision-Language Model (VLM) not as an annotator replacement, but as a \emph{question prioritizer} to identify which comparisons genuinely require human judgment. The proposed \textbf{Surprise-Guided MergeSort (SGS)} framework achieves this through three integrated components: (1) a bottom-up MergeSort scheduler that structures comparisons and exploits transitivity, (2) a composite Surprise Scorer -- combining position-bias-cancelled VLM confidence, Elo gap, and vote entropy -- to quantify comparison ambiguity, and (3) an adaptive budget allocator that routes high-surprise pairs to humans while automating low-surprise pairs via transitivity inference. Validation was conducted on six diverse benchmarks spanning text similarity (STS-B, BIOSSES, SICKR-STS) and image quality assessment (KonIQ-10k, TID2013, LIVE Challenge). SGS effectively identified and skipped up to 535 non-informative comparisons per session. Consequently, it achieved Kendall's $τ{\times}100$ improvements of $+6$ to $+12$ over Active Elo under the same total budget. These results demonstrate that combining VLM-guided surprise metrics with algorithmic sorting provides a generally consistent accuracy-efficiency trade-off across diverse domains.

2606.15923 2026-06-16 cs.NE cs.AI cs.LG 交叉投稿

Runtime Analysis of Cartesian Genetic Programming in Evolving Boolean Functions

笛卡尔遗传规划在演化布尔函数中的运行时分析

Duc-Cuong Dang, Roman Kalkreuth, Andre Opris

发表机构 * University of Passau(帕绍大学) RWTH Aachen University(亚琛工业大学)

AI总结 本文首次对笛卡尔遗传规划在完全训练集上演化布尔函数进行运行时分析,证明构造n输入合取式的期望适应度评估次数为O(n D^5),并发现非严格选择可加速至O(n D^4),而异或函数需要指数时间。

Comments To appear in the Proceedings of PPSN 2026

详情
AI中文摘要

笛卡尔遗传规划(CGP)是遗传规划中实用且流行的形式之一,因为它使用基于图的程序表示。本文首次对CGP在完全训练集上演化布尔函数进行运行时分析。我们证明了CGP使用最多D≥n-1个二元门、最小函数集,甚至采用严格生存选择时,构造n个输入的合取式的期望适应度评估次数的渐近界为O(n D^5)。当使用非严格选择时,该界改进为O(n D^4)。我们的分析揭示了CGP诱导搜索的有趣特征,这些特征此前仅通过经验观察得到。特别是,允许接受同样好的解(包括那些包含不贡献适应度的连接门的解)可以导致加速,从而获得更好的渐近时间界。与合取式相反,我们还证明了一个负面结果,即CGP需要指数时间来演化异或函数。演化合取式的实验补充了我们的理论发现。使用不完全训练集可以进一步减少平均适应度评估次数,同时保持较好的泛化水平。

英文摘要

Cartesian Genetic Programming (CGP) is among the practical and popular forms of Genetic Programming as it uses a graph-based representation of programs. This paper presents a first runtime analysis of CGP in evolving Boolean functions using complete training sets. We prove an asymptotic bound $O(n D^5)$ for the expected number of fitness evaluations of CGP to construct a conjunction of $n$ inputs using at most $D \geq n-1$ binary gates, a minimal function set, and even with a strict survival selection. When the non-strict selection is used, the bound is improved to $O(n D^4)$. Our analysis reveals interesting characteristics of CGP induced search, which have been only observed empirically. In particular, enabling the acceptance of equally good solutions, including those with connected gates non-contributing to fitness, can lead to a speedup, and consequently a better asymptotic time bound. In contrast to conjunctions, we also prove a negative result which shows that CGP requires exponential time to evolve an exclusive disjunction. Experiments evolving conjunctions complement our theoretical findings. The use of incomplete training sets is found to further reduce the average number of fitness evaluations while maintaining a good level of generalisation.

2505.13986 2026-06-16 math.OC cs.AI cs.LG 版本更新

RIDGECUT: Learning Graph Partitioning with Rings and Wedges

RIDGECUT:基于环与楔形结构的图分割学习

Qize Jiang, Angelo Zangari, Linsey Pang, Alice Gatti, Mahima Aggarwal, Giovanna Vantini, Xiaosong Ma, Weiwei Sun, Sourav Medya, Sanjay Chawla

发表机构 * College of Computer Science and Artificial Intelligence, Shanghai Key Laboratory of Data Science(计算机科学与人工智能学院,上海数据科学重点实验室) University of Illinois Chicago(伊利诺伊大学芝加哥分校) PayPal Inc.(PayPal公司) Center for AI Safety(人工智能安全中心) Qatar Computing Research Institute(卡塔尔计算研究所) Hamad Bin Khalifa University(哈马德·本·卡西姆大学) Computing and Mathematical Sciences (CMS) Division(计算与数学科学(CMS)部门) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(Mohamed bin Zayed人工智能大学(MBZUAI)) Fudan University(复旦大学)

AI总结 提出RidgeCut框架,通过将动作空间约束为环与楔形结构,利用强化学习解决归一化割问题,在交通网络上实现结构感知分割,降低归一化割值并展现强泛化能力。

Comments Extended version of the paper accepted at KDD 2026

详情
AI中文摘要

强化学习通过学习跨实例泛化的启发式方法,在图的组合优化问题上展现出潜力。然而,如何有效地将领域知识融入强化学习框架进行图分割仍然具有挑战性,因为现有方法通常依赖于无约束的节点级动作,导致动作空间大且探索效率低。在本文中,我们提出RidgeCut,一种强化学习框架,通过约束动作空间来在归一化割问题中实现结构感知分割。以交通网络为动机示例,我们引入了一个利用城市道路拓扑领域知识的新概念——其中自然分割通常呈现为同心环和径向楔形。通过将图转换为线性或圆形表示,我们的方法能够使用基于变换器的策略并通过近端策略优化进行高效学习。RidgeCut产生的分割不仅与预期的空间布局一致,而且与现有方法相比实现了更低的归一化割值。在合成和真实交通图上的实验结果表明,RidgeCut在跨图大小的归纳泛化方面始终优于现有方法。尽管以道路网络为动机,RidgeCut为将结构先验嵌入到图分割的强化学习框架中提供了一种通用机制。

英文摘要

Reinforcement learning (RL) has shown promise for combinatorial optimization problems on graphs by learning heuristics that generalize across instances. However, effectively incorporating domain knowledge into RL frameworks for graph partitioning remains challenging, as existing approaches typically rely on unconstrained node-level actions that lead to large action spaces and inefficient exploration. In this paper, we propose RidgeCut, an RL framework that constrains the action space to enforce structure-aware partitioning in the Normalized Cut problem. Using transportation networks as a motivating example, we introduce a novel concept that leverages domain knowledge about urban road topology -- where natural partitions often take the form of concentric rings and radial wedges. By transforming the graph into linear or circular representations, our method enables the use of transformer-based policies and efficient learning via Proximal Policy Optimization. The resulting partitions from RidgeCut are not only aligned with expected spatial layouts but also achieve lower normalized cuts compared to existing methods. Experimental results on synthetic and real-world traffic graphs demonstrate that RidgeCut consistently outperforms existing methods while exhibiting strong inductive generalization across graph sizes. Although motivated by road networks, RidgeCut provides a general mechanism for embedding structural priors into RL frameworks for graph partitioning.

2603.23249 2026-06-16 cs.LG cs.AI math.OC 版本更新

A Learning Method with Gap-Aware Generation for Heterogeneous DAG Scheduling

一种具有间隙感知生成的异构DAG调度学习方法

Ruisong Zhou, Haijun Zou, Li Zhou, Chumin Sun, Zaiwen Wen

发表机构 * School of Mathematical Science, Peking University(北京大学数学科学学院) State Key Laboratory of Mathematical Sciences, Institute of Computational Mathematics and Scientific/Engineering Computing, Academy of Mathematics and Systems Science, Chinese Academy of Sciences(数学科学国家重点实验室,计算数学与科学/工程计算研究所,中国科学院数学系统科学研究院) Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co., Ltd(华为技术有限公司2012实验室理论实验室,中央研究院) Beijing International Center for Mathematical Research, Peking University(北京大学北京国际数学研究中心)

AI总结 提出WeCAN,一种端到端强化学习框架,通过加权交叉注意力编码器建模任务-资源池兼容性,并引入跳序扩展生成机制消除调度间隙,在TPC-H等真实DAG上优于强基线。

Comments 31pages, 8 figures

详情
AI中文摘要

有向无环图(DAG)的高效调度是大规模数据密集型计算系统的核心问题,其中查询计划、数据处理工作负载和计算图由依赖任务组成,这些任务竞争有限的异构资源池。在实践中,实现高性能执行需要调度器适应具有不同资源池和任务类型的环境,同时在严格运行时预算下生成调度。我们提出WeCAN,一种用于异构DAG调度的端到端强化学习框架,解决了任务-资源池兼容系数和生成诱导的最优性间隙。它采用两阶段单次通过设计:单次前向传播产生任务-资源池分数和全局参数,随后通过生成映射构建调度,无需重复网络调用。其加权交叉注意力编码器通过兼容系数门控建模任务-资源池交互,并且对环境波动具有规模无关性。此外,广泛使用的列表调度映射可能因受限可达性而产生生成诱导的最优性间隙。我们引入一种顺序空间分析,通过可行调度顺序刻画生成映射的可达集,解释生成诱导间隙的机制,并给出间隙消除的充分条件。在这些条件指导下,我们设计了一种跳序扩展实现,具有解析参数化的递减跳序规则,在保持单次通过效率的同时扩大可达顺序集。在真实TPC-H查询DAG、资源密集型工作负载数据集和ML编译器计算图上的实验表明,相比强基线,我们改善了完工时间,推理时间与经典启发式相当,且快于多轮神经调度器。

英文摘要

Efficient scheduling of directed acyclic graphs (DAGs) is a core problem in large-scale data-intensive computing systems, where query plans, data-processing workloads, and computation graphs consist of dependent tasks competing for limited heterogeneous resource pools. In practice, achieving high-performance execution requires schedulers to adapt across environments with varying resource pools and task types, while generating schedules under tight runtime budgets. We propose WeCAN, an end-to-end reinforcement learning framework for heterogeneous DAG scheduling that addresses task-pool compatibility coefficients and generation-induced optimality gaps. It adopts a two-stage single-pass design: a single forward pass produces task-pool scores and global parameters, followed by a generation map that constructs schedules without repeated network calls. Its weighted cross-attention encoder models task-pool interactions gated by compatibility coefficients, and is size-agnostic to environment fluctuations. Moreover, widely used list-scheduling maps can incur generation-induced optimality gaps from restricted reachability. We introduce an order-space analysis that characterizes the reachable set of generation maps via feasible schedule orders, explains the mechanism behind generation-induced gaps, and yields sufficient conditions for gap elimination. Guided by these conditions, we design a skip-extended realization with an analytically parameterized decreasing skip rule, which enlarges the reachable order set while preserving single-pass efficiency. Experiments on real-world TPC-H query DAGs, resource-intensive workload datasets, and ML-compiler computation graphs demonstrate improved makespan over strong baselines, with inference time comparable to classical heuristics and faster than multi-round neural schedulers.

5. 机器学习与表示学习 132 篇

2606.14941 2026-06-16 cs.AI 新提交

Semantics-Enhanced Retrieval-Augmented Time Series Forecasting

语义增强的检索增强时间序列预测

Shiqiao Zhou, Zipeng Wu, Holger Schöner, Edouard Fouché, IAG Wilson, Shuo Wang

发表机构 * University of California, Berkeley(加州大学伯克利分校) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) University of Montreal(蒙特利尔大学)

AI总结 针对非平稳性下仅靠时间序列相似性检索不足的问题,提出SERAF框架,通过双模态检索(时间序列及其自生成文本描述)联合利用历史模式,提升预测性能。

Comments Accepted to the ICML 2026 Workshop on Forecasting as a New Frontier of Intelligence

详情
AI中文摘要

时间序列预测模型通常受益于历史模式。受检索增强生成(RAG)启发,最近的研究探索了检索相关历史时间序列片段以增强预测。然而,在非平稳性下,仅依靠时间序列相似性进行检索往往不够。为此,我们提出一种多模态方法:\textbf{S}emantics-\textbf{E}nhanced \textbf{R}etrieval-\textbf{A}ugmented Time Series \textbf{F}orecasting 框架,SERAF。与仅依赖时间序列相似性的主流方法不同,SERAF对时间序列及其自生成的文本描述进行双重检索。它检索两组互补的历史模式和对应的未来,这些被选择性地联合用于指导未来预测。在七个真实世界数据集上的实验表明,与最先进的基线相比,SERAF在桥接时间序列的数值视图和语义视图方面是有效的。

英文摘要

Time series forecasting models often benefit from historical patterns. Inspired by Retrieval-Augmented Generation (RAG), recent research explored retrieving relevant historical time series segments to enhance forecasting. However, relying solely on time series similarity is often insufficient for retrieval under non-stationarity. To address this, we propose a multimodal approach: a \textbf{S}emantics-\textbf{E}nhanced \textbf{R}etrieval-\textbf{A}ugmented Time Series \textbf{F}orecasting framework, SERAF. Unlike mainstream approaches that depend only on time series similarity, SERAF conducts dual retrieval over the time series and their self-generated textual descriptions. It retrieves two complementary sets of historical patterns and corresponding futures, which are selectively and jointly used to guide future predictions. Experiments across seven real-world datasets demonstrate the effectiveness of SERAF in bridging numerical and semantic views of time series compared with state-of-the-art baselines.

2606.14997 2026-06-16 cs.AI cs.LG 新提交

AI Engram: In Search of Memory Traces in Artificial Intelligence

AI Engram: 在人工智能中寻找记忆痕迹

Jea Kwon, Dong-Kyum Kim, Jiwon Kim, Yonghyun Kim, Woong Kook, Meeyoung Cha

发表机构 * University of California, Berkeley(加州大学伯克利分校) KAIST(韩国科学技术院)

AI总结 提出几何框架,通过约束逆问题形式化神经科学标准,识别深度神经网络中的记忆痕迹(AI engram),实现记忆的线性组合与擦除,无需迭代优化。

Comments Accepted to ICML 2026 (Oral). Code is available at https://github.com/jeakwon/ai-engram/

详情
AI中文摘要

记忆形成是智能的基础,但深度神经网络是否保留类似于生物记忆单元的可识别记忆痕迹仍是一个未解问题。本文引入一个几何框架,通过将神经科学标准(特异性、再激活、充分性和必要性)形式化为约束逆问题,来识别此类“AI engram”。我们推导出一个闭式估计器,从全局纠缠参数中分离出单个记忆痕迹,并证明这一生物学启发的解对应于参数流形上的自然梯度更新。AI engram 允许对学习知识进行手术式操作:任何记忆子集可以通过线性算术进行组合或擦除,无需迭代优化。从简单 MLP 到大语言模型的实验证明了 AI engram 的因果有效性和显著可扩展性。总之,这些结果桥接了生物记忆理论与人工表示学习,并提供了关于深度网络如何在分布式存储中同时支持功能特异性的几何洞见。

英文摘要

Memory formation is fundamental to intelligence, yet whether deep neural networks preserve identifiable memory traces analogous to biological memory units remains an open question. This work introduces a geometric framework to identify such "AI engrams" by formalizing the neuroscientific criteria of specificity, reactivation, sufficiency, and necessity into a constrained inverse problem. We derive a closed-form estimator that isolates individual memory traces from globally entangled parameters, and show that this biologically-derived solution corresponds to a natural gradient update on the parameter manifold. AI engrams enable surgical manipulation of learned knowledge: any subset of memories can be composed or erased through linear arithmetic, without iterative optimization. Experiments ranging from simple MLPs to LLMs demonstrate the causal validity and substantial scalability of AI engrams. Together, these results bridge theories of biological memory and artificial representation learning and offer geometric insight into how deep networks simultaneously support functional specificity within distributed storage.

2606.15273 2026-06-16 cs.AI 新提交

Feature Attribution in Directed Acyclic Graphs Using Edge Intervention

基于边干预的有向无环图特征归因

Qiheng Sun, Junxu Liu, Xiaokai Mao, Haocheng Xia, Jinfei Liu, Kui Ren, Haibo Hu

发表机构 * Zhejiang University(浙江大学) Zhejiang Lab(之江实验室) Hong Kong Polytechnic University(香港理工大学)

AI总结 针对现有特征归因方法无法同时捕获特征外部性和外生影响的问题,提出基于边干预的DAG-SHAP方法,将每条特征边作为归因对象,并引入近似计算方法,实验验证其有效性。

详情
AI中文摘要

基于Shapley值的特征归因方法在涉及复杂特征交互和因果关系的场景中面临挑战,即使提供了因果结构。现有方法通常采用节点中心视角,仅将重要性归因于单个特征。因此,它们往往无法同时捕获特征的外部性和外生影响,导致不合理的解释。为克服这些限制,我们提出一种新的基于边干预的特征归因方法DAG-SHAP。DAG-SHAP将每条特征边作为单独的归因对象,确保特征的外部性和外生贡献都被适当捕获。此外,我们引入了一种近似方法以高效计算DAG-SHAP。在真实和合成数据集上的大量实验验证了DAG-SHAP的有效性。我们的代码可在https://github.com/ZJU-DIVER/DAG-SHAP获取。

英文摘要

Shapley value-based feature attribution methods face challenges in scenarios involving complex feature interactions and causal relationships, even when a causal structure is provided. Existing methods typically adopt a node-centric view, attributing importance solely to individual features. Consequently, they often fail to simultaneously capture the externality and exogenous influence of features, leading to unreasonable interpretations. To overcome these limitations, we propose a novel feature attribution method called DAG-SHAP, which is based on edge intervention. DAG-SHAP treats each feature edge as an individual attribution object, ensuring that both externality and exogenous contributions of features are appropriately captured. Additionally, we introduce an approximation method for efficiently computing DAG-SHAP. Extensive experiments on both real and synthetic datasets validate the effectiveness of DAG-SHAP. Our code is available at https://github.com/ZJU-DIVER/DAG-SHAP.

2606.15447 2026-06-16 cs.AI 新提交

Hierarchical Modeling of ICD Codes in EHR Foundation Models

EHR基础模型中ICD码的分层建模

Megha Thukral, Dong Gyun Kang, Rudra Pratap Singh, Shruthi Kashinath Hiremath, Katrin Hänsel, Thomas Plötz

发表机构 * School of Interactive Computing, Georgia Institute of Technology(佐治亚理工学院交互计算学院) Optum AI

AI总结 研究利用ICD-10-CM层次结构作为归纳偏置,通过序列增强和图注入两种机制改进EHR表示学习,实验表明显式编码层次结构在域内和跨数据集任务中均优于扁平表示。

详情
AI中文摘要

电子健康记录基础模型通常将ICD诊断码视为扁平标记,忽略了捕获疾病家族、子类别和细粒度诊断细节的临床上有意义的层次结构。因此,现有的EHR表示学习方法并未明确利用编码系统中已有的层次结构。在这项工作中,我们研究ICD-10-CM层次结构作为临床表示学习的一般归纳偏置。我们研究了两种互补的机制来融入层次结构:首先,通过在BERT风格的transformer中向诊断序列添加对应于ICD层次不同级别的标记;其次,通过结合诊断共现结构的层次感知边将层次结构注入基于图的代码表示中。在这些设置下,我们评估显式层次结构是否改进了下游预测、层次结构的哪些级别最有用、层次编码是否改善了跨数据集的迁移,以及层次结构如何重塑嵌入相似性结构。我们在两个大规模真实世界临床数据集上进行了实验:MIMIC-IV(用于预训练和域内评估)和eICU(用于通过冻结编码器探测评估跨数据集迁移)。我们的发现表明,显式编码ICD层次结构在域内和跨数据集设置中均优于扁平代码表示,同时揭示了最有效的层次级别取决于任务和建模方法。更广泛地说,我们专注于层次感知的EHR表示学习,并表明编码层次结构的好处可泛化到不同的建模设置和层次级别。

英文摘要

Electronic health record foundation models typically treat ICD diagnosis codes as flat tokens, overlooking the clinically meaningful hierarchical structure that captures disease families, subcategories, and fine-grained diagnostic detail. As a result, existing EHR representation learning methods do not explicitly exploit the hierarchical structure already present in the coding system. In this work, we study ICD-10-CM hierarchy as a general inductive bias for clinical representation learning. We investigate two complementary mechanisms for incorporating hierarchy: first, by augmenting diagnosis sequences in a BERT-style transformer with tokens corresponding to different levels of the ICD hierarchy, and second, by injecting hierarchy into graph-based code representations through hierarchy-aware edges combined with diagnosis co-occurrence structure. Across these settings, we evaluate whether explicit hierarchy improves downstream prediction, which levels of the hierarchy are most useful, whether hierarchy encoding improves transfer across datasets, and how hierarchy reshapes embedding similarity structure. We conduct experiments on two large-scale real-world clinical datasets: MIMIC-IV, used for pretraining and in-domain evaluation, and eICU, used to assess cross-dataset transfer via frozen encoder probing. Our findings show that explicitly encoding ICD hierarchy improves over flat code representations in both in-domain and cross-dataset settings, while revealing that the most useful level of hierarchy depends on both the task and the modeling approach. More broadly, we focus on hierarchy-aware EHR representation learning and show that the benefits of encoding hierarchy are generalizable across modeling settings and hierarchy levels.

2606.15841 2026-06-16 cs.AI 新提交

Heteroskedastic Signals in Budgeted LLM Verification: Structural Heterogeneity Limits Optimization Gains

预算受限LLM验证中的异方差信号:结构异质性限制了优化收益

Jinlong Yang

发表机构 * Northwestern Polytechnical University(西北工业大学)

AI总结 本文发现LLM不确定性信号在预算受限验证中存在异方差性,导致全局分配扭曲;通过分层阈值干预(CST)在强异质性设置下提升命中率达17个百分点,揭示结构异质性是主要瓶颈。

详情
AI中文摘要

大型语言模型(LLM)系统越来越多地使用不确定性信号来在验证、测试时扩展、工具执行和其他选择性计算决策中分配有限的计算资源。此类策略依赖于一个全局信号可比性假设:相等的分数应在不同输入中携带可比的决策价值。使用预算受限验证作为受控诊断设置,我们识别出该假设的一种失效模式:不确定性质量在成本分层上是异方差的,某些区域尽管集中了大量错误,却表现出近乎随机的可区分性。在一个显式的局部模型下,我们刻画了由此导致的全局分配扭曲,并表明其上界随跨层信号质量离散度而缩放。我们通过一个受控干预层级(阈值、MP-Adapt、MP-Strat以及一个故意简单的成本分层阈值干预CST)将弱信号、优化不稳定性和结构异质性分离开来。在MBPP和MATH上使用Qwen3-8B、LLaMA3-8B和GPT-4o-mini的实验表明,全局在线自适应相对于静态阈值化产生不一致的收益;MP-Strat部分恢复了性能,而CST在强异质性设置下无需梯度更新即可将命中率提升高达17个百分点。这些结果表明,在所观察的设置中,结构异质性(而非仅优化器弱点)是主要瓶颈。更广泛地说,错位的反馈结构并不总能通过更强的优化来修复。

英文摘要

Large language model (LLM) systems increasingly use uncertainty signals to allocate limited computation across verification, test-time scaling, tool execution, and other selective-compute decisions. Such policies rely on a \emph{global signal comparability assumption}: equal scores should carry comparable decision value across inputs. Using budgeted verification as a controlled diagnostic setting, we identify a failure mode of this assumption: uncertainty quality is heteroskedastic across cost strata, with some regions exhibiting near-random discriminability despite concentrating many errors. Under an explicit local model, we characterize the resulting distortion of global allocation and show that its upper bound scales with cross-stratum signal-quality dispersion. We separate weak signals, optimization instability, and structural heterogeneity through a controlled intervention hierarchy: Threshold, MP-Adapt, MP-Strat, and a deliberately simple cost-stratified thresholding intervention (CST). Across MBPP and MATH using Qwen3-8B, LLaMA3-8B, and GPT-4o-mini, global online adaptation yields inconsistent gains over static thresholding; MP-Strat partially recovers performance, while CST improves hit rate by up to 17 percentage points in strongly heterogeneous settings without gradient updates. These results identify structural heterogeneity, rather than optimizer weakness alone, as the primary bottleneck in the observed settings. More broadly, misaligned feedback structure cannot always be repaired by stronger optimization.

2606.16140 2026-06-16 cs.AI cs.CL 新提交

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

VibeThinker-3B:探索小型语言模型中可验证推理的前沿

Sen Xu, Shixi Liu, Wei Wang, Jixin Min, Yingwei Dai, Zhibin Yin, Yirong Chen, Xin Zhou, Junlin Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出3B参数紧凑模型VibeThinker-3B,通过频谱到信号后训练范式(课程SFT、多域强化学习、离线自蒸馏)在可验证推理任务上达到前沿性能,匹配甚至超越大模型,并验证推理增强不损害指令可控性。

详情
AI中文摘要

本技术报告介绍了VibeThinker-3B,一个具有3B参数的紧凑密集模型,旨在探究在严格的小模型范围内可验证推理能推进到何种程度。基于频谱到信号后训练范式,我们通过优化的流程系统性地增强模型,该流程包括基于课程的监督微调、多域强化学习和离线自蒸馏。实验评估表明,VibeThinker-3B在高度要求的可验证任务上达到了前沿水平。具体来说,它在AIME26上获得94.3分(通过声明级测试时缩放提升至97.1),在LiveCodeBench v6上获得80.2的Pass@1,并在最近的未见LeetCode竞赛中表现出强大的分布外泛化能力,接受率达96.1%。这有效地将其置于一流推理系统的性能区间,匹配或超越规模大数个数量级的旗舰模型,如DeepSeek V3.2、GLM-5和Gemini 3 Pro。此外,IFEval上的93.4分证实了这种极端的推理增强并未损害严格的指令可控性。扩展我们之前的1.5B工作,这些发现推动了参数压缩-覆盖假说,该假说将可验证推理视为可压缩到紧凑推理核心中,而开放域知识和通用能力则需要广泛的参数覆盖事实、概念和长尾场景。这一观点表明,紧凑模型不仅是部署高效的替代品,更是通往参数密集能力领域前沿性能的互补路径。

英文摘要

This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. Experimental evaluations demonstrate that VibeThinker-3B achieves frontier-level performance on highly demanding verifiable tasks. Specifically, it attains a score of 94.3 on AIME26 (improving to 97.1 with claim-level test-time scaling), an 80.2 Pass@1 on LiveCodeBench v6, and exhibits strong out-of-distribution generalization with a 96.1\% acceptance rate on recent unseen LeetCode contests. This effectively places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Furthermore, a score of 93.4 on IFEval confirms that this extreme reasoning enhancement does not compromise strict instruction controllability. Extending our previous 1.5B work, these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios. This perspective suggests that compact models are not merely deployment-efficient substitutes, but a complementary path toward frontier-level performance in parameter-dense capability regimes.

2606.16152 2026-06-16 cs.AI 新提交

The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning

质量-效用悖论:为什么高奖励数据会损害小模型的数学推理

Haolong Qian, Xianliang Yang, Yinuo ma, Lirong Che, Feng Lu, Ye Guo, Lei Song, Jiang Bian, Chun Yuan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 发现数学推理蒸馏中的质量-效用悖论:Oracle精炼的高奖励数据因分布漂移增加适应成本,反而不如SLM自生成数据;提出风格对齐精炼方法恢复效用。

Comments Accepted at ICML 2026

详情
AI中文摘要

从强大推理模型进行知识蒸馏被广泛用于提升小语言模型(SLM)的数学推理能力,通常假设奖励模型得分更高的轨迹能提供更有用的监督。我们在数学推理蒸馏中发现了一个反直觉的\textbf{质量-效用悖论}。由更强Oracle精炼或合成的数据根据奖励模型获得更高的感知质量,但在Qwen2.5、LLaMA-3和DeepSeek系列中,其表现始终不如SLM自身生成并通过拒绝采样选择的轨迹。我们的分析表明,Oracle精炼将逻辑修复与偏离SLM原生推理分布的分布漂移相结合。这种漂移增加了学习者的适应成本,可能抵消改进推理逻辑带来的收益。为验证这一机制,我们引入\textbf{风格对齐精炼},在保留Oracle逻辑修复的同时保持SLM的原生轨迹。这种干预降低了适应成本并恢复了下游效用。这些发现表明,有效的数学推理蒸馏应联合优化感知解质量和学习者-数据兼容性,而非仅依赖奖励模型得分。数据集和代码见https://github.com/Dracoqhl/Quality-Utility-Paradox。

英文摘要

Knowledge distillation from powerful reasoning models is widely used to improve Small Language Models (SLMs) on mathematical reasoning, often assuming that traces with higher reward model scores provide more useful supervision. We identify a counterintuitive \textbf{Quality-Utility Paradox} in mathematical reasoning distillation. Data refined or synthesized by a stronger Oracle obtains higher perceived quality according to reward models, yet consistently underperforms traces generated by the SLM itself and selected through rejection sampling across Qwen2.5, LLaMA-3, and DeepSeek families. Our analysis shows that Oracle refinement couples logical repair with distributional drift away from the SLM's native reasoning distribution. This drift increases the learner's adaptation cost and can outweigh the benefit of improved reasoning logic. To test this mechanism, we introduce \textbf{Style-Aligned Refinement}, which preserves the native trajectory of the SLM while retaining logical repair from the Oracle. This intervention lowers adaptation cost and restores downstream utility. These findings suggest that effective mathematical reasoning distillation should jointly optimize perceived solution quality and learner-data compatibility, rather than relying solely on reward-model scores. The datasets and code are available at https://github.com/Dracoqhl/Quality-Utility-Paradox.

2606.16210 2026-06-16 cs.AI 新提交

Sensor-Conditioned Representation Learning via Scene-Relevant Observation Quotients

基于场景相关观测商的传感器条件表示学习

Yan Jiao, Pin-Han Ho, Limei Peng

发表机构 * Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China(电子科技大学深圳高等研究院) Department of Electrical and Computer Engineering, University of Waterloo(滑铁卢大学电气与计算机工程系) School of Computer Science and Engineering, Kyungpook National University(庆北国立大学计算机科学与工程学院)

AI总结 提出场景相关观测商作为表示目标,通过OQ-TSAE框架分解场景与干扰因子,在传感器条件下保持可区分性,优于重建、度量学习和对比学习基线。

详情
AI中文摘要

智能传感系统中的学习表示通常通过重建保真度或下游预测精度来评估,但这些标准并未指定哪些潜在区分是由传感过程证明合理的。在传感器条件环境中,干扰因素可以在不改变场景的情况下改变测量值,而不同的场景在有限的传感能力下可能无法区分。本文形式化了传感器条件表示的正确性,即在抑制干扰引起的和传感器不支持的变化的同时,保留传感支持的场景区分。我们引入了场景相关观测商,一种由传感支持的可区分性在干扰规范化后诱导的表示目标,并开发了观测商塔克结构自编码(OQ-TSAE),一种具有假区分、假合并、干扰敏感性和潜在排序一致性诊断的场景-干扰因子分解框架。在受控基准上的实验表明,商一致监督在表示正确性诊断上优于面向重建、度量学习和对比学习的基线。敏感性、扰动和消融研究显示了商对齐监督、可靠商关系和商几何的重要性。互补的真实雷达实验表明,仅重建的OQ-TSAE变体保留了竞争性的下游效用、观测退化下的鲁棒性和低种子间变异性。这些结果表明,传感器条件表示不仅应通过预测效用评估,还应通过其潜在几何是否保留传感证明的场景区分来评估。

英文摘要

Learned representations in intelligent sensing systems are often evaluated by reconstruction fidelity or downstream prediction accuracy, but these criteria do not specify which latent distinctions are justified by the sensing process. In sensor-conditioned environments, nuisance factors can change measurements without changing the scene, while distinct scenes may be indistinguishable under limited sensing capability. This paper formulates sensor-conditioned representation correctness as preserving sensing-supported scene distinctions while suppressing nuisance-induced and sensor-unsupported variation. We introduce the scene-relevant observation quotient, a representation target induced by sensing-supported distinguishability after nuisance canonicalization, and develop Observation-Quotient Tucker-Structured Autoencoding (OQ-TSAE), a scene-nuisance factorized framework with diagnostics for false distinction, false merge, nuisance sensitivity, and latent ordering consistency. Experiments on a controlled benchmark show that quotient-consistent supervision improves representation-correctness diagnostics over reconstruction-oriented, metric-learning, and contrastive-learning baselines. Sensitivity, perturbation, and ablation studies show the importance of quotient-aligned supervision, reliable quotient relations, and quotient geometry. Complementary real-radar experiments show that a reconstruction-only OQ-TSAE variant retains competitive downstream utility, robustness under observation degradation, and low seed-to-seed variability. These results suggest that sensor-conditioned representations should be evaluated not only by predictive utility, but also by whether their latent geometry preserves sensing-justified scene distinctions.

2606.16222 2026-06-16 cs.AI cs.LG 新提交

Latent Thought Flow: Efficient Latent Reasoning in Large Language Models

潜在思维流:大型语言模型中的高效潜在推理

Xiandong Zou, Jing Huang, Jianshu Li, Pan Zhou

发表机构 * Singapore Management University(新加坡管理大学) Ant Group(蚂蚁集团)

AI总结 提出Latent Thought Flow (LTF)方法,将推理建模为可变长度连续轨迹,通过连续GFlowNet训练采样器匹配奖励后验,在提升准确率9.5%的同时平均减少推理长度27.2%。

详情
AI中文摘要

大型语言模型(LLMs)越来越依赖中间推理,然而显式的思维链(CoT)存在语言空间瓶颈:每个思维必须解码为token,导致高推理开销。潜在推理将思考过程转移到连续空间,但现有方法大多学习确定性或奖励最大化路径,缺乏在具有不同正确性和成本的轨迹间分配概率的原则性方法。我们提出潜在思维流(LTF),将推理建模为可变长度连续轨迹,并训练采样器以匹配由答案质量和计算成本定义的奖励诱导后验。我们使用具有随机潜在转移的连续GFlowNet实例化该方法。为处理稀疏答案监督,我们引入熵加权子轨迹平衡目标以获取中间奖励,以及参考先验正则化器以锚定探索。在微调和迁移学习设置下的实验表明,与强潜在推理基线相比,LTF在平均减少推理长度27.2%的同时,准确率提升9.5%,优于显式CoT和潜在推理基线。

英文摘要

Large Language Models (LLMs) increasingly rely on intermediate reasoning, yet explicit Chain-of-Thought (CoT) suffers from a linguistic space bottleneck: each thought must be decoded into tokens, causing high inference overhead. Latent reasoning moves deliberation into continuous space, but existing methods mostly learn deterministic or reward-maximizing paths, lacking a principled way to allocate probability across trajectories with different correctness and costs. We propose Latent Thought Flow (LTF), which models reasoning as variable-length continuous trajectories and trains a sampler to match a reward-induced posterior over answer quality and computation cost. We instantiate this with a continuous GFlowNet using stochastic latent transitions. To handle sparse answer supervision, we introduce an Entropy-Weighted Subtrajectory Balance objective for intermediate rewards and a reference-prior regularizer to anchor exploration. Experiments under finetuning and transfer learning settings show that LTF outperforms explicit CoT and latent reasoning baselines, improving accuracy by 9.5% while reducing reasoning length by 27.2% on average compared with strong latent reasoning baselines.

2606.16501 2026-06-16 cs.AI 新提交

Post-Hoc Merging is Not Enough: Many-Shot Model Merging with Loss-Gap Balancing

事后合并是不够的:基于损失差距平衡的多轮模型合并

Kyungjin Im, Miru Kim, Chanin Eom, Minhae Kwon

AI总结 提出METIS方法,通过迭代多轮合并和任务损失差距加权,解决多任务模型合并中的信息擦除问题,显著提升最差任务性能。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

模型合并已成为一种实用的训练后策略,通过组合多个任务专用模型来构建单一的多任务大语言模型(LLM)。然而,大多数现有方法依赖于事后合并,即任务专用模型在训练后仅合并一次。这种一次性聚合常常遭受任务干扰,导致跨单个任务的信息擦除。在这项工作中,我们表明用迭代的多轮合并协议取代事后合并能有效提升多任务性能。基于这一见解,我们提出了METIS(Mitigating Erasure from Task Interference for Stable many-shot merging),一种损失感知的多轮合并方法,通过任务级损失差距加权和基于共识的掩码来解决事后合并中的信息擦除问题。值得注意的是,METIS在最差性能任务上表现出显著的性能提升,有效缓解了信息擦除。(项目页面:https://imkyungjin.github.io/METIS/)

英文摘要

Model merging has become a practical post-training strategy for building a single multi-task large language model (LLM) by combining multiple task-specialized models. However, most existing approaches rely on post-hoc merging, in which task-specific models are merged only once after training. This one-shot aggregation often suffers from task interference, leading to information erasure across individual tasks. In this work, we show that replacing post-hoc merging with an iterative many-shot merging protocol is effective in improving multi-task performance. Building on this insight, we propose METIS, Mitigating Erasure from Task Interference for Stable many-shot merging. METIS is a loss-aware many-shot merging method that addresses information erasure in post-hoc merging through task-wise loss-gap weighting and consensus-based masking. Notably, METIS exhibits significant performance improvement on the worst-performing task, effectively mitigating information erasure. (Project page: https://imkyungjin.github.io/METIS/)

2606.16733 2026-06-16 cs.AI 新提交

A First-Principles Derivation of LLM Policy Optimization: From Expected Reward to GRPO and Its Structural Extensions

LLM策略优化的第一性原理推导:从期望奖励到GRPO及其结构扩展

Jianghan Shen, Siqi Luo, Yue Li, Jiyao Liu, Wanying Qu, Yi Zhang, Ziyan Huang, Tianbin Li, Ming Hu, Xiaohong Liu, Yirong Chen, Junjun He

发表机构 * Nanjing University(南京大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学) Peking University(北京大学) Fudan University(复旦大学) Nanjing University of Aeronautics and Astronautics(南京航空航天大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 本文从第一性原理出发,基于轨迹概率和奖励两个轴,统一分析了从REINFORCE、PPO到GRPO及其变体的LLM策略优化方法,揭示了设计选择背后的原理和复合失败模式。

详情
AI中文摘要

语言模型的策略梯度算法优化相同的目标 $J(θ) = \mathbb{E}_{τ\sim p_θ(τ)}[R(τ)]$,该目标恰好有两个因素:轨迹概率 $p_θ(τ)$ 和奖励 $R(τ)$。从REINFORCE到PPO再到GRPO及其后续方法,每种方法都修改其中一个或两个因素,以解决先前公式中的特定失败。现有调查按领域或时间顺序组织这些方法,这掩盖了每个设计选择背后的原理及其在梯度估计器中的精确干预位置。本调查从第一性原理出发,重新审视基于 $J(θ)$ 的LLM策略优化领域,并使用由 $p_θ(τ)$ 诱导的轨迹侧和由 $R(τ)$ 诱导的奖励侧作为定位方法的两条轴。它涵盖了从REINFORCE和PPO到GRPO的路径,以及GRPO后变体、Agentic RL和GRPO-OPD。由此产生的框架是统一的、诊断性的和可扩展的:它从共享目标分析方法,识别每种方法修改了哪一侧以及为什么,并在这些设置中应用相同的轨迹和奖励轴。在这些设置中,该框架还暴露了单侧修复无法解决的复合失败,因此需要轨迹侧和奖励侧的联合设计。该地图识别的边界情况和耦合失败标志着现有解决方案的极限,并为设计下一代LLM策略优化算法提供了原则性的起点。

英文摘要

Policy gradient algorithms for language models optimize the same objective $J(θ) = \mathbb{E}*{τ\sim p*θ(τ)}[R(τ)]$, which has exactly two factors: the trajectory probability $p_θ(τ)$ and the reward $R(τ)$. Every method from REINFORCE to PPO to GRPO and their descendants modifies one or both factors to address a specific failure in the preceding formulation. Existing surveys organize these methods by domain or chronology, which obscures the rationale behind each design choice and the precise location of its intervention within the gradient estimator. This survey revisits the landscape of LLM policy optimization from $J(θ)$ on first principles and uses the trajectory side, induced by $p_θ(τ)$, and the reward side, induced by $R(τ)$, as the two axes along which methods are located. It covers the path from REINFORCE and PPO to GRPO, as well as post-GRPO variants, Agentic RL, and GRPO-OPD. The resulting framework is unified, diagnostic, and extensible: it analyzes methods from a shared objective, identifies which side each method modifies and why, and applies the same trajectory and reward axes across these settings. Across these settings, the framework also exposes compound failures that no single-side fix resolves and that therefore require joint design of the trajectory side and the reward side. The boundary cases and coupled failures identified by this map mark where existing solutions run out and provide a principled starting point for designing the next generation of LLM policy optimization algorithms.

2606.16811 2026-06-16 cs.AI cs.CL 新提交

Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier

从最小标签扩展LLM推理:一种带有轻量级验证器的半监督框架

Keizo Kato, Chenhui Chu, Yugo Murawaki, Sado Kurohashi

发表机构 * Fujitsu Limited(富士通株式会社) Kyoto University(京都大学) National Institute of Informatics(国立信息学研究所)

AI总结 提出半监督框架,用轻量级推理正确性分类器和熵过滤从少量标注数据生成高质量伪推理链,在数学和视觉问答任务上达到10-15倍标注数据效果。

Comments LREC 2026. Section 3.3 is updated

详情
AI中文摘要

对于大型语言模型(LLMs)的发展,最近生成伪中间推理的方法取得了显著进展。但它们通常依赖大量正确标注的答案来评估推理质量。本文提出一种半监督框架,从最小监督中扩展推理学习,将推理验证本身转变为数据创建机制。我们仅在少量标注样本上训练一个轻量级推理正确性分类器,用于判断LLM生成的中间推理轨迹是否有效。此外,基于熵的置信度阈值过滤掉不可靠样本,剩余的高置信度推理轨迹用于微调模型。在可验证数学问题(Orca-Math子集)和基于视觉编程的图像场景图问答(GQA)上的实验表明,我们的方法达到了与使用10-15倍标注数据相当的准确率。消融分析证实,分类器和熵过滤对于可扩展且抗噪声的伪标签生成都是必不可少的。通过用轻量级推理验证替代昂贵的答案级监督,我们的方法为构建大规模推理资源提供了一条实用路径,并为未来从最小人工输入中学习的自主推理系统铺平了道路。

英文摘要

For the development of Large language models (LLMs), recent approaches to generating pseudo intermediate reasoning have shown remarkable progress. But they typically rely on large numbers of correctly annotated answers to assess reasoning quality. This paper presents a semi-supervised framework that scales reasoning learning from minimal supervision, turning reasoning verification itself into a data creation mechanism. We train a lightweight reasoning-correctness classifier on only a few labeled samples, which judges whether intermediate reasoning traces generated by an LLM are valid. Furthermore, an entropy-based confidence threshold filters out unreliable samples, and the remaining high-confidence reasoning traces are used to fine-tune the model. Experiments on Verifiable Math Problems (Orca-Math subset) and Question Answering on Image Scene Graphs (GQA) with Visual Programming show that our method achieves accuracy comparable to using 10-15x more labeled data. Ablation analyses confirm that both the classifier and entropy filtering are essential for scalable and noise-resistant pseudo-labeling. By replacing expensive answer-level supervision with lightweight reasoning verification, our method provides a practical path toward constructing large-scale reasoning resources and paves the way for future autonomous reasoning systems that learn from minimal human input.

2606.16923 2026-06-16 cs.AI stat.ML 新提交

MA-SBI: Misspecification-Aware Simulation-Based Inference via Side-Channel Guidance

MA-SBI: 通过侧信道引导的误设定感知仿真推断

Arunkumar V, Manoranjan Gandhudi, Gangadharan G. R., Arun Prakash, S. Senthilkumar

发表机构 * University College of Engineering, Anna University Tiruchirappalli(安娜大学蒂鲁吉拉伯利工程学院) Central University of Karnataka(卡纳塔克中央大学) National Institute of Technology Tiruchirappalli(蒂鲁吉拉伯利国立理工学院) School of Computer & Systems Sciences, Jawaharlal Nehru University(贾瓦哈拉尔·尼赫鲁大学计算机与系统科学学院)

AI总结 针对仿真模型误设定问题,提出无需校准的MA-SBI框架,利用侧信道文本信息进行后验校正,理论保证偏差减少界限,实验表明仅用文本即可匹配oracle后验。

Comments 23 pages, 9 figures, 12 tables

详情
AI中文摘要

潜在参数的仿真推断(SBI)常受仿真器误设定困扰,即由于固有的建模简化导致的仿真观测与真实观测之间的不匹配。最新的鲁棒SBI方法RoPE通过真实与仿真观测学习表示之间的最优传输来解决此问题,但需要真实参数校准对,而这在需要SBI的设置中通常不可用。实践者拥有的是非结构化侧信息,如制度标签、指令文本和政策公告。我们提出误设定感知仿真推断(MA-SBI),一个无需校准的框架,将侧信道转化为后验校正。学习到的校正器将侧信道文本映射到观测空间偏移,应用于任何预训练的摊销后验之前,无需重新训练也无需参数真实值。我们的主要定理通过误设定与侧信道之间的互信息界定了可实现的偏差减少,通过Donsker-Varadhan扩展到所有次高斯噪声的非平凡常数。在隐藏校准基准上,仅使用文本的MA-SBI在10个种子和两个骨干网络上匹配oracle后验(TOST等价),而使用更多数据的RoPE则不能。两种方法互补:当误设定是结构性的且可从参数对中恢复时,RoPE占优,正如理论所预测。随机变体在真实COVID和OxCGRT流行病学数据上提高了后验预测对数似然,并在一个良好设定的认知科学语料库上正确保持后验不变。

英文摘要

Simulation-based inference (SBI) of latent parameters is often hindered by simulator misspecification, the mismatch between simulated and real-world observations caused by inherent modeling simplifications. RoPE, the recent state-of-the-art for robust SBI, addresses this through optimal transport between learned representations of real and simulated observations, but requires ground-truth parameter calibration pairs that are typically unavailable in the very settings where SBI is needed. What practitioners do have is unstructured side-information such as regime labels, instruction text, and policy bulletins. We propose Misspecification-Aware Simulation-Based Inference (MA-SBI), a calibration-free framework that turns this side-channel into a posterior correction. A learned corrector maps side-channel text to an observation-space shift applied before any pre-trained amortized posterior, requiring no retraining and no parameter ground-truth. Our main theorem bounds achievable bias reduction by the mutual information between misspecification and side-channel, with a non-vacuous constant that extends to all sub-Gaussian noise via Donsker-Varadhan. On hide-the-calibration benchmarks, MA-SBI with text alone matches the oracle posterior across 10 seeds and two backbones (TOST equivalence), while RoPE given more data does not. The two approaches are complementary: where misspecification is structural and recoverable from parameter pairs, RoPE dominates, as the theory predicts. A stochastic variant improves posterior-predictive log-likelihood on real COVID and OxCGRT epidemiological data, and correctly leaves the posterior unchanged on a well-specified cognitive-science corpus.

2606.16925 2026-06-16 cs.AI 新提交

RAID: Semantic Graph Diffusion for True Cold-Start and Cross-Lingual Forecasting

RAID: 面向真正冷启动和跨语言预测的语义图扩散

Arunkumar V, Manoranjan Gandhudi, Gangadharan G. R., Arun Prakash, S. Senthilkumar

发表机构 * University College of Engineering, Anna University Tiruchirappalli(安娜大学蒂鲁吉拉伯利工程学院) Central University of Karnataka(卡纳塔克中央大学) National Institute of Technology Tiruchirappalli(蒂鲁吉拉伯利国立理工学院) Jawaharlal Nehru University(贾瓦哈拉尔·尼赫鲁大学)

AI总结 针对时间序列基础模型在无历史数据时失效的问题,提出RAID框架,利用元数据语义检索和图条件扩散实现冷启动预测,在准确性和推理速度上超越现有方法,并支持零样本跨语言迁移。

Comments 25 pages, 4 figures, 8 tables

详情
AI中文摘要

时间序列基础模型在给定非空历史窗口时表现出强大的迁移性能。然而,真正的冷启动场景(新项目没有先前的观测数据)违背了这一假设。我们提出了RAID(检索增强迭代扩散)框架,该框架用元数据驱动的语义检索和图条件扩散取代了基于历史的相关性学习。RAID使用冻结的多语言嵌入模型将文本元数据映射到共享语义空间,并构建一个可自然扩展到未见项目的归纳检索图。它首先通过聚合语义相关邻居的信息形成基础预测,然后使用门控扩散模块细化该预测以建模残差不确定性。在严格的真正冷启动协议下,RAID在预测准确性和预测区间覆盖率上均优于强大的基础模型和竞争基线,同时通过非自回归解码将推理延迟降低一个数量级。共享语义空间还实现了零样本跨语言迁移,使得在英文描述上训练的模型能够泛化到其他语言描述的项目,而无需直接监督。

英文摘要

Time-series foundation models show strong transfer performance when given a non-empty history window. However, true cold-start scenarios, where a new item has no prior observations, violate this assumption. We propose RAID (Retrieval-Augmented Iterative Diffusion) a framework, which replaces history-based correlation learning with metadata-driven semantic retrieval and graph-conditioned diffusion. RAID maps textual metadata into a shared semantic space using a frozen multilingual embedding model and constructs an inductive retrieval graph that extends naturally to unseen items. It first forms a base forecast by aggregating information from semantically related neighbors, then refines this forecast with a gated diffusion module to model residual uncertainty. Under a strict true cold-start protocol, RAID outperforms strong foundation models and competitive baselines on both forecasting accuracy and prediction interval coverage, while reducing inference latency by an order of magnitude through non-autoregressive decoding. The shared semantic space also enables zero-shot cross-lingual transfer, allowing a model trained on English descriptions to generalize to items described in other languages without direct supervision.

2606.14708 2026-06-16 math.OC cs.AI 交叉投稿

PH-KAN: Port-Hamiltonian Kolmogorov-Arnold Network

PH-KAN:端口-哈密顿 Kolmogorov-Arnold 网络

Achraf El Messaoudi, Karim Cherifi, Yann Le Gorrec, Yongxin Wu

AI总结 提出基于 Kolmogorov-Arnold 网络的保结构非线性端口-哈密顿系统辨识框架,通过专用 KAN 块参数化各组件并显式施加约束,获得比标准 MLP 更可解释的模型。

详情
AI中文摘要

数据驱动的机器学习方法在非线性系统辨识中越来越有吸引力,但标准模型往往无法保持潜在的物理结构,且难以解释,尤其是在没有解析模型可用时。在此背景下,端口-哈密顿(pH)模型提供了一种自然的物理信息表示。然而,当这些模型使用标准多层感知器(MLP)参数化时,学习到的本构组件通常仍然难以解释。在本文中,我们提出了一种基于 Kolmogorov-Arnold 网络(KAN)的非线性端口-哈密顿系统的保结构辨识框架。所提出的 PH-KAN 模型使用专用 KAN 块参数化互连矩阵、耗散矩阵、哈密顿量和输入映射,同时通过构造强制满足端口-哈密顿约束。这产生了本构表示,其中定义所辨识 pH 组件的非线性函数可以被显式检查,从而得到比基于标准 MLP 的参数化更具可解释性的模型。

英文摘要

Data-driven machine learning approaches have become increasingly attractive for nonlinear system identification, but standard models often fail to preserve the underlying physical structure and remain difficult to interpret, especially when no analytical model is available. In this context, port-Hamiltonian (pH) models provide a natural physics-informed representation. However, when these models are parameterized with standard multilayer perceptrons (MLPs), the learned constitutive components often remain poorly interpretable. In this paper, we propose a structure-preserving identification framework for nonlinear port-Hamiltonian systems based on Kolmogorov-Arnold Networks (KANs). The proposed PH-KAN model parameterizes the interconnection matrix, dissipation matrix, Hamiltonian, and input mapping using dedicated KAN blocks, while enforcing the port-Hamiltonian constraints by construction. This yields constitutive representations in which the nonlinear functions defining the identified pH components can be explicitly inspected, leading to a more interpretable model than with standard MLP-based parameterizations.

2606.14732 2026-06-16 cs.CV cs.AI cs.LG cs.MM 交叉投稿

Steady-Forcing: Balancing Spatial Persistence and Motion Continuity in Long-Horizon Nature Video Diffusion

Steady-Forcing: 长时程自然视频扩散中空间持久性与运动连续性的平衡

Matiur Rahman Minar, Seunghun Oh, GangHyeon Jeong, Unsang Park

发表机构 * Department of Computer Science and Engineering, Sogang University(西江大学计算机科学与工程系) Department of Artificial Intelligence, Sogang University(西江大学人工智能系)

AI总结 提出Steady-Forcing框架,通过视觉锚点、运动记忆和蒸馏等技术,在长时程固定相机自然视频生成中平衡背景稳定与运动连续性,优于现有方法。

Comments Project page: https://minar09.github.io/steadyforcing/

详情
AI中文摘要

自回归视频扩散模型支持流式生成,但在长时程生成中常退化:静态场景布局漂移,而改善空间稳定性的机制往往抑制运动,导致水流、火焰或烟雾等自然流动停滞。我们研究了固定相机长时程自然视频生成中的这种稳定性-运动权衡,其中两种失败模式比移动相机设置更易区分。我们提出Steady-Forcing,一种结合持久视觉锚点(V-Sink)、指数移动平均运动记忆(EMA-Sink)、块相对时间编码、周期性缓存净化以及从Wan2.1-14B教师模型蒸馏(在任务聚焦配置下使用运动奖励先验)的记忆与训练框架。这些组件共同设计用于在数分钟的自回归生成中保持背景一致性,同时维持视觉上合理的流体动力学。在七个基线上的评估表明,Steady-Forcing改善了长时程背景一致性和成像质量,而盲用户研究显示更强的感知稳定性和运动连续性。基准评估进一步表明,通用的VBench聚合分数对固定相机伪影惩罚不足,同时将漂移引起的光流奖励为动态程度,而不直接惩罚纹理硬化或流动停滞——这激励了未来针对静态相机自然流动评估的任务特定基准。项目页面:https://minar09.github.io/steadyforcing/

英文摘要

Autoregressive video diffusion models enable streaming generation but often degrade over long rollouts: static scene layouts drift, while mechanisms that improve spatial stability tend to suppress motion, causing natural flows such as water, fire, or smoke to stagnate. We study this stability-motion trade-off in fixed-camera long-horizon nature video generation, where the two failure modes can be more clearly separated than in moving-camera settings. We propose Steady-Forcing, a memory and training framework combining a persistent visual anchor (V-Sink), an exponential moving-average motion memory (EMA-Sink), block-relative temporal encoding, periodic cache purification, and distillation from a Wan2.1-14B teacher with motion-rewarded priors under task-focused configurations. Together, these components are designed to preserve background identity while sustaining visually plausible fluid dynamics over multi-minute autoregressive rollouts. Evaluations across seven baselines show that Steady-Forcing improves long horizon background consistency and imaging quality, while a blind user study indicates stronger perceived stability and motion continuity. The benchmark evaluation further suggest that generic VBench aggregate scores under-penalize fixed-camera artifacts as well as rewarding drift-induced optical flow as Dynamic Degree while not directly penalizing texture hardening or flow stagnation - motivating future task-specific benchmarks for static-camera nature-flow evaluation. Project page: https://minar09.github.io/steadyforcing/

2606.14753 2026-06-16 cs.CV cs.AI 交叉投稿

Beyond Self-Attention: Sub-Quadratic Vision Transformers for Fast Image Captioning

超越自注意力:用于快速图像描述的次二次视觉Transformer

Chiradeep Ghosh, Dakshina Ranjan Kisku

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) National Institute of Technology Durgapur(德里apur国立学院) Durgapur, India(印度德里apur)

AI总结 提出基于高斯混合模型和EM算法的概率Transformer,将自注意力复杂度从二次降至线性,在Flickr30K上实现高效图像描述。

Comments 8 pages, 8 figures

详情
AI中文摘要

图像描述是一项具有挑战性且重要的任务,旨在为给定图像生成连贯且语义有意义的文本描述。要完成此任务,需要对视觉内容有深入理解,并具备用自然语言表达这种理解的能力。尽管基于Transformer的架构取得了显著进展,现有方法仍存在局限性,例如缺乏丰富的局部特征表示以及二次自注意力的高计算成本。所提出的模型通过重构视觉Transformer架构,专注于提高计算效率。在设计该方法时,将Vision Transformer中的标准自注意力机制替换为基于高斯混合模型(GMM)的概率Transformer方法,这是一种软聚类技术。该模型不是计算所有图像块之间的成对注意力,而是使用期望最大化(EM)算法将相似块分组到固定数量的聚类中。这种基于聚类的机制将计算复杂度从二次O(n^2)降低到线性O(nK),其中K << n。自回归的GPT解码器用于生成描述。该模型在Flickr 30K数据集上进行了评估,显示出与现有工作相比具有竞争力和显著的改进。

英文摘要

Image captioning is a challenging and significant task that aims to generate coherent and semantically meaningful textual descriptions for given images. To accomplish this task, it requires a deep understanding of visual content along with the ability to express that understanding in natural language. Despite remarkable progress with transformer-based architectures, existing approaches often suffer from limitations, such as a lack of rich local feature representations and the high computational cost of quadratic self-attention. The proposed model focuses on improving computational efficiency by restructuring the vision transformer architecture. In designing this approach, the standard self-attention mechanism in Vision Transformers is replaced with a probabilistic transformer approach based on a Gaussian Mixture Model (GMM), a soft-clustering technique. Instead of computing pairwise attention among all image patches, the model groups similar patches into a fixed number of clusters using an Expectation-Maximization (EM) algorithm. This clustering-based mechanism reduces the computational complexity from quadratic O(n^2) to linear O(nK), where K << n. The autoregressive GPT-based decoder is used for caption generation. The model is evaluated on the Flickr 30K dataset, demonstrating competitive and significant improvement over existing works.

2606.14760 2026-06-16 cs.CV cs.AI 交叉投稿

GeoRoPE: Ground-Aware Rotary Adaptation for Remote Sensing Foundation Models

GeoRoPE: 面向遥感基础模型的地面感知旋转适配

Yu Luo, Kun Hu, Mengwei He, Xiaogang Zhu, Shan Zeng, Allen Benter, Wei Xiang, Patrick Filippi, Thomas Francis Bishop, Zhiyong Wang

发表机构 * The University of Sydney(悉尼大学) Edith Cowan University(埃迪斯科文大学) Adelaide University(阿德莱德大学) Wuhan Polytechnic University(武汉轻工大学) Climate, Orange Agricultural Institute(气候研究所,奥兰治农业研究所) La Trobe University(拉筹伯大学)

AI总结 提出GeoRoPE方法,通过地理坐标校准和频率校准解决遥感基础模型中的尺度失配问题,提升跨分辨率鲁棒性和尺度敏感表征学习。

详情
AI中文摘要

遥感基础模型(RSFMs)受益于在多传感器和地面采样距离(GSD)图像上的预训练,但仅凭这种暴露并不能解决下游适配过程中的尺度失配问题。固定的token网格偏移在不同传感器下可能对应不同的地面距离,使得基于网格的位置先验在物理上不一致。同时,异质空间粒度意味着紧凑的城市区域和均质景观即使在相同GSD下也可能需要不同的位置敏感性。因此,我们提出GeoRoPE,一种面向RSFMs的地面感知、RoPE兼容且参数高效的空间适配方法。GeoRoPE从两个互补方面重新校准token级位置交互。首先,地理坐标校准(GCC)根据一个token网格步长代表的地面距离重新缩放原始token网格偏移,产生跨GSD的地理校准相对坐标。其次,地理频率校准(GFC)使用关系特定因子调整原生RoPE频率,实现对场景依赖空间粒度的位置敏感适配。GeoRoPE通过轻量适配器注入预训练RSFM,在保持冻结空间先验的同时添加地理感知位置校正。在多个RSFM、传感器、分辨率和下游任务上的实验表明,GeoRoPE提升了跨分辨率鲁棒性和尺度敏感表征学习。

英文摘要

Remote-sensing foundation models (RSFMs) benefit from pretraining on imagery from multiple sensors and ground sampling distances (GSDs), but such exposure alone does not resolve scale mismatch during downstream adaptation. A fixed token-grid offset can correspond to different ground distances across sensors, making grid-based positional priors physically inconsistent. Meanwhile, heterogeneous spatial granularity means that compact urban regions and homogeneous landscapes may require different positional sensitivities even under the same GSD. Therefore, we propose {GeoRoPE}, a ground-aware, RoPE-compatible, and parameter-efficient spatial adaptation method for RSFMs. GeoRoPE recalibrates token-level positional interactions from two complementary aspects. First, \textit{Geo-Coordinate Calibration (GCC)} rescales raw token-grid offsets according to the ground distance represented by one token-grid step, producing geo-calibrated relative coordinates across GSDs. Second, \textit{Geo-Frequency Calibration (GFC)} adjusts the native RoPE frequency with a relation-specific factor, enabling position sensitive adaptation to scene-dependent spatial granularity. GeoRoPE is injected into pretrained RSFMs through a lightweight adapter, preserving the frozen spatial prior while adding geo-aware positional corrections. Experiments across multiple RSFMs, sensors, resolutions, and downstream tasks demonstrate that GeoRoPE improves cross-resolution robustness and scale-sensitive representation learning.

2606.14765 2026-06-16 cs.CV cs.AI cs.LG cs.MM 交叉投稿

Momentum-Guided Semantic Forecasting (MoFore) for Self-Supervised Video Representation Learning

动量引导的语义预测(MoFore)用于自监督视频表示学习

Qinwu Xu

发表机构 * Qinwu Xu, PhD(秦武 Xu 博士)

AI总结 提出MoFore框架,通过预测未来潜在嵌入进行自监督视频表示学习,结合对比正则化防止表示崩溃,在UCF101上验证了时间一致性和语义结构。

Comments 13 pages, 5 Figures, and 2 Tables

详情
AI中文摘要

自监督视频表示学习最近通过对比学习、掩码重建和预测表示学习取得了进展。基于重建的方法如MAE和VideoMAE通过恢复掩码视觉内容来学习表示,而对比方法如CLIP通过表示对齐学习语义有意义的嵌入空间。在这项工作中,我们提出了一种动量引导的语义预测框架(MoFore)用于自监督视频表示学习。该方法不是优化像素级重建或任务特定的语义对齐,而是通过从时间上遥远的上下文片段预测未来的潜在嵌入来学习时间预测性视频表示。为了提高跨时间尺度的鲁棒性,我们进一步引入了训练期间的随机时间间隔预测。该框架将预测性潜在预测与对比正则化相结合,以鼓励时间一致性同时防止表示崩溃。在UCF101数据集上的实验表明,所提出的框架在训练期间不使用动作标签的情况下学习了时间一致且语义有意义的视频表示。定量分析显示学习到的嵌入空间具有强时间稳定性和涌现的类别级结构,而定性检索实验揭示了跨相关活动的运动感知组织。总体而言,结果表明长程潜在预测为自监督视频表示学习提供了一种有效且计算高效的方法,而不依赖于基于重建的目标。

英文摘要

Self-supervised video representation learning has recently advanced through contrastive learning, masked reconstruction, and predictive representation learning. Reconstruction-based approaches such as MAE and VideoMAE learn representations by recovering masked visual content \cite{he2022mae,tong2022videomae}, while contrastive methods such as CLIP learn semantically meaningful embedding spaces through representation alignment \cite{radford2021clip}. In this work, we introduce a Momentum-Guided Semantic Forecasting framework (MoFore) for self-supervised video representation learning. Instead of optimizing for pixel-level reconstruction or task-specific semantic alignment, the proposed method learns temporally predictive video representations by forecasting future latent embeddings from temporally distant context clips. To improve robustness across temporal scales, we further introduce randomized temporal-gap forecasting during training. The framework combines predictive latent forecasting with contrastive regularization to encourage temporal consistency while preventing representation collapse. Experiments on the UCF101 dataset demonstrate that the proposed framework learns temporally consistent and semantically meaningful video representations without using action labels during training. Quantitative analysis shows strong temporal stability and emergent category-level structure in the learned embedding space, while qualitative retrieval experiments reveal motion-aware organization across related activities. Overall, the results suggest that long-range latent forecasting provides an effective and computationally efficient approach for self-supervised video representation learning without relying on reconstruction-based objectives.

2606.14770 2026-06-16 cs.CV cs.AI cs.IR cs.LG 交叉投稿

An Empirical Analysis of Optimization Dynamics and Sparsity Boundaries in Large-Scale Pedestrian Attribute Recognition

大规模行人属性识别中的优化动态与稀疏边界实证分析

Houssam El Mir

发表机构 * College of Computer Science and Technology, Zhejiang University of Technology(浙江工业大学计算机科学与技术学院)

AI总结 针对行人属性识别中极端类别不平衡问题,提出多标签焦点损失校准配置(alpha=0.50, gamma=2.0),在零计算开销下匹配BCE基线并提升难例挖掘,同时识别出0.1%正样本率下的稀疏墙边界。

详情
AI中文摘要

行人属性识别(PAR)对于视频监控至关重要,支持法医搜索和重识别系统。当将PETA和PA-100K合并为一个包含109,000张图像的复合语料库时,极端类别不平衡仍然是一个基本障碍,其中少数属性的正样本比例低于1%。这导致标准BCE优化抑制稀有特征,我们称之为多数负类欺骗陷阱。我们在ResNet-18骨干网络上对多标签焦点损失超参数(alpha和gamma)进行了系统消融。校准配置(alpha=0.50, gamma=2.0)实现了62.32%的宏F1分数,与BCE基线相当,同时保留了优越的难例挖掘和收敛动态。我们的方法使用纯损失函数工程,边缘部署零计算开销。我们识别出稀疏墙,这是一个硬边界,当正样本比例低于0.1%时,全局损失重新加权失效,需要实例级干预。

英文摘要

Pedestrian Attribute Recognition (PAR) is critical for video surveillance, enabling forensic search and re-identification systems. Extreme class imbalance remains a fundamental obstacle when merging PETA and PA-100K into a 109,000-image composite corpus, where minority attributes have positive sample fractions below 1%. This causes standard BCE optimization to suppress rare traits, a phenomenon we term the majority negative class cheating trap. We present a systematic ablation of Multi-Label Focal Loss hyperparameters (alpha and gamma) on a ResNet-18 backbone. A calibrated configuration (alpha=0.50, gamma=2.0) achieves a Macro F1-score of 62.32%, matching BCE baseline while preserving superior hard-example mining and convergence dynamics. Our approach uses pure loss-function engineering with zero computational overhead for edge deployment. We identify the Sparsity Wall, a hard boundary where positive sample fractions below 0.1% make global loss reweighting ineffective, requiring instance-level intervention.

2606.14773 2026-06-16 cs.CV cs.AI 交叉投稿

Double-Helix Vision (DH-V2): A Geometry-Based Visual Sampler for Bandwidth-Constrained Perception

双螺旋视觉 (DH-V2):一种基于几何的带宽受限感知视觉采样器

Jinwen Wen

发表机构 * Independent Researcher(独立研究者)

AI总结 提出双螺旋视觉(DH),一种基于黄金比例螺旋轨迹的几何采样器,将2D图像压缩为1D信号,实现1433倍压缩比,在CPU上0.52ms完成感知,CIFAR-10上准确率提升6.03%。

Comments 5 pages, 3 figures, 5 tables. Code and benchmarks: https://github.com/JackJ-C/double-helix-vision-tool

详情
AI中文摘要

我们提出双螺旋视觉(DH),一种基于几何的视觉采样器,利用成对的黄金比例启发螺旋轨迹将2D图像压缩为紧凑的1D信号。DH不是均匀处理每个像素,而是采用两个相位偏移的螺旋(Alpha和Beta,偏移180度)以生物启发的中央凹方式采样图像:中心高密度,外围稀疏覆盖。在4K分辨率下,DH实现了1433倍压缩比(减少99.93%),同时保留场景的几何结构。完整的感知流水线——包括空间映射、时间碰撞检测和帧内结构视差估计——在仅CPU硬件上以1080p分辨率运行仅需0.52毫秒,无需神经网络依赖。在CIFAR-10上,在极端采样预算下(每个螺旋K=128个点),DH比均匀随机采样获得了+6.03%的准确率提升。提供了一个可序列化为JSON的机器人API,以2.7 KB的数据包提供亚毫秒级空间感知报告。代码和基准测试在MIT许可下提供。

英文摘要

We present Double-Helix Vision (DH), a geometry-based visual sampler that compresses 2D images into compact 1D signals using paired golden-ratio-inspired spiral trajectories. Rather than processing every pixel uniformly, DH employs two phase-shifted helices (Alpha and Beta, offset by 180 degrees) to sample the image with biologically-inspired foveation: high density at the center, sparse coverage at the periphery. At 4K resolution, DH achieves a 1,433x compression ratio (99.93% reduction) while preserving the geometric structure of the scene. The full perception pipeline -- including spatial mapping, temporal collision detection, and intra-frame structural disparity estimation -- runs in 0.52 ms at 1080p on CPU-only hardware, with no neural network dependencies. On CIFAR-10 at extreme sampling budgets (K=128 points per helix), DH achieves a +6.03% accuracy gain over uniform random sampling. A JSON-serializable Robotics API is provided, delivering sub-millisecond spatial perception reports in 2.7 KB packets. Code and benchmarks are available under the MIT License.

2606.14792 2026-06-16 cs.CV cs.AI 交叉投稿

Efficient Reinforcement for Visual-Textual Thinking with Discrete Diffusion Model

基于离散扩散模型的视觉-文本思维高效强化学习

Yoonjeon Kim, Yuhta Takida, Chieh-Hsin Lai, Eunho Yang, Yuki Mitsufuji

发表机构 * KAIST(韩国科学技术院) Sony AI(索尼AI) AITRICS Sony Group Corporation(索尼集团公司)

AI总结 提出用离散扩散模型替代自回归模型进行多模态强化学习,通过局部视觉编辑减少计算量,并设计分解奖励分配策略解决跨模态干扰问题。

详情
AI中文摘要

基于强化学习的后训练已被广泛采用,以在能够同时进行文本和图像生成的统一多模态模型中实现交错视觉和文本推理。然而,大多数现有方法建立在自回归统一模型上,在视觉推理过程中需要完整的图像再生。在这项工作中,我们证明多模态离散扩散模型是自回归模型在交错推理中进行强化学习的有效替代方案,因为它们能够通过局部视觉编辑而非完整的图像令牌再生来执行高效的视觉展开。与自回归基线相比,这使GRPO期间的展开计算减少了26.9%,且性能下降极小。尽管效率提高,我们发现联合奖励分配(在模态间使用共享奖励信号)在RL更新期间会在不相关的图像和文本令牌序列之间引入跨模态干扰。为解决此问题,我们提出分解奖励分配策略,该策略独立地为文本和视觉片段分配奖励。采用分解奖励分配后,我们的RL方法相比联合奖励分配提高了11.2%,相比基础模型提高了38.04%。

英文摘要

RL-based post-training has been widely adopted to enable interleaved visual and textual reasoning in unified multimodal models capable of both text and image generation. However, most existing approaches are built upon autoregressive (AR) unified models, which require full image regeneration during visual reasoning. In this work, we demonstrate that multimodal discrete diffusion models are effective alternatives to AR models for reinforcement learning in interleaved reasoning, owing to their ability to perform efficient visual rollouts via localized visual editing rather than full image-token regeneration. This reduces rollout computation during GRPO by 26.9\% compared to AR baselines, with minimal performance drop. Despite the improved efficiency, we find that joint reward assignment, which employs a shared reward signal across modalities, introduces cross-modal interference between unrelated image and text token sequences during RL updates. To address this issue, we propose factorized reward assignment, a strategy that assigns rewards independently to text and vision segments. With factorized reward assignment, our RL approach achieves an 11.2% improvement over joint reward assignment and a 38.04% improvement over the base model.

2606.14822 2026-06-16 quant-ph cs.AI 交叉投稿

Quantum Machine Learning for Industrial Applications

量子机器学习在工业中的应用

Léo Monbroussou

发表机构 * Sorbonne Université(索邦大学) LIP6(LIP6实验室) Naval Group(海军集团) IRIF(IRIF研究院) CNRS(法国国家科学研究中心)

AI总结 研究量子机器学习在工业中的潜力,解决变分量子电路的训练性、表达性和抗经典模拟能力,提出无贫瘠高原的理论保证和多项式量子优势算法。

Comments PhD thesis

详情
Journal ref
Sorbonne University, EDITE doctoral school, LIP6 laboratory, 2025
AI中文摘要

机器学习的最新进展已经改变了众多工业领域,但经典范式面临根本性限制:快速增长的数据量、不断上升的计算成本、显著的能源消耗以及传统硬件架构的物理缩放极限。量子计算已成为应对这些挑战的一种有前景的计算范式,催生了量子机器学习(QML)领域。本文研究了QML的理论基础,重点关注近期和未来的实际应用。解决了三个核心挑战:变分量子电路的可训练性、表达性以及抵抗高效经典模拟的能力。首先研究了汉明重量保持变分量子电路的可训练性,并建立了理论保证,解决了关于该电路族不存在贫瘠高原的开放猜想。然后引入了子空间保持的QML算法,包括光子电路和量子卷积神经网络,旨在模仿经典ML子程序,同时提供多项式量子优势。最后,将变分量子电路分析为量子傅里叶模型,并推导出一个框架来共同表征表达性和可训练性,从中获得了量子模型可证明与其经典对应物分离的条件。这些贡献旨在推进在现实世界应用中利用近期和未来量子技术的理论路线图。

英文摘要

Recent advances in Machine Learning have transformed numerous industrial sectors, yet classical paradigms face fundamental limitations: rapidly growing data volumes, rising computational costs, significant energy consumption, and the physical scaling limits of conventional hardware architectures. Quantum computing has emerged as a promising computational paradigm to address these challenges, giving rise to the field of Quantum Machine Learning (QML). In this thesis, the theoretical foundations of QML are investigated, with a focus on near-term and future practical applications. Three central challenges are addressed: the trainability of variational quantum circuits, their expressivity, and their resistance to efficient classical simulation. The trainability of Hamming-weight preserving variational quantum circuits is first studied, and theoretical guarantees are established that resolve an open conjecture on the absence of barren plateaus for this circuit family. Subspace-preserving QML algorithms are then introduced, including photonic circuits and quantum convolutional neural networks, and are designed to mimic classical ML subroutines while offering polynomial quantum advantage. Finally, variational quantum circuits are analyzed as quantum Fourier models, and a framework is derived to jointly characterize expressivity and trainability, from which conditions are obtained under which quantum models provably separate from their classical counterparts. These contributions are intended to advance the theoretical roadmap for harnessing near-term and future quantum technologies in real-world applications.

2606.14865 2026-06-16 cs.LG cs.AI 交叉投稿

GRAPE: Guided Parameter-Space Evolution for Compact Adversarial Robustness

GRAPE: 面向紧凑对抗鲁棒性的引导式参数空间演化

Zhiyuan Ye, Xiangyu Zhou, Ji Qi, Hao Zhang, Yi Zhou

发表机构 * University of Science and Technology of China(中国科学技术大学) China Mobile (Suzhou) Software Technology Co., Ltd.(中移(苏州)软件技术有限公司)

AI总结 提出GRAPE框架,通过逐步暴露参数空间并利用对抗谱利用分数引导容量分配,在固定计算预算下提升紧凑模型的对抗鲁棒性,在CIFAR-10上以1.009倍FLOPs将PGD-20鲁棒准确率从51.70%提升至56.94%,参数减少21.4%。

详情
AI中文摘要

对抗训练(AT)提高了神经网络的鲁棒性,但大多数方法从一开始就训练固定的参数空间。本文探讨了参数变得可优化的顺序是否会影响最终的鲁棒解,即使最终架构或计算预算被控制。我们提出了GRAPE(引导式参数空间演化),一种面向紧凑对抗鲁棒性的训练框架。GRAPE结合了参数空间稳定化与渐进式隐藏扩展:它在当前暴露空间中稳定鲁棒优化,逐步释放新的可优化维度,并使用对抗谱利用分数引导新释放的容量流向高压模块。与固定结构的AT相比,GRAPE将鲁棒模型学习视为一个渐进式参数空间暴露和演化的过程。在CIFAR-10上的标准$\ell_\infty$威胁模型下,以固定结构ResNet-18 AT作为对照参考,GRAPE在几乎匹配的计算预算下(FLOPs比率为1.009倍)将PGD-20鲁棒准确率从51.70%提升至56.94%,同时参数数量减少约21.4%。一个具有相同最终ResNet-18架构的序列增长变体达到了56.52%的PGD-20鲁棒准确率,表明增益不仅来自最终架构差异,还来自参数空间暴露路径。这些结果表明,引导式参数空间演化可以在匹配计算条件下产生紧凑且鲁棒的参数配置。

英文摘要

Adversarial Training (AT) improves neural network robustness, but most methods train a fixed parameter space from the start. This paper asks whether the order in which parameters become optimizable can affect the final robust solution, even when the final architecture or computation budget is controlled. We propose GRAPE, Guided Parameter-Space Evolution, a training framework for compact adversarial robustness. GRAPE combines parameter-space stabilization with progressive hidden expansion: it stabilizes robust optimization in the currently exposed space, gradually releases new optimizable dimensions, and uses an adversarial spectral utilization score to guide newly released capacity toward high-pressure modules. In contrast to fixed-structure AT, GRAPE treats robust model learning as a process of progressive parameter-space exposure and evolution. Under the standard $\ell_\infty$ threat model on CIFAR-10, with fixed-structure ResNet-18 AT as a controlled reference, GRAPE improves PGD-20 robust accuracy from 51.70% to 56.94% at a nearly matched computation budget with a FLOPs ratio of 1.009x, while reducing parameter count by about 21.4%. A sequential grow variant with the same final ResNet-18 architecture reaches 56.52% PGD-20 robust accuracy, indicating that the gain is not only due to final architecture differences but also to the parameter-space exposure path. These results suggest that guided parameter-space evolution can yield compact and robust parameter configurations under matched computation.

2606.14929 2026-06-16 cs.LG cs.AI stat.ML 交叉投稿

Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts

嵌入模型路由的策略遗憾:具有低秩专家的上下文赌博机

Yan Dai, Negin Golrezaei, Patrick Jaillet

发表机构 * Operations Research Center, MIT(麻省理工学院运筹学研究中心) Sloan School of Management, MIT(麻省理工学院斯隆管理学院) Department of EECS, MIT(麻省理工学院电气工程与计算机科学系)

AI总结 针对推荐系统中嵌入模型路由问题,形式化为具有低秩专家的对抗性上下文线性赌博机,提出Hypentropy策略梯度算法,实现$\tilde{\mathcal O}(s\sqrt{M T})$线性化策略遗憾。

详情
AI中文摘要

现代推荐系统越来越依赖于将多样化的查询动态路由到多个嵌入模型。尽管具有实际意义,但在对抗性查询、赌博机反馈和模型有限可观测性等现实条件下,该问题仍未得到充分理解。我们将嵌入模型路由形式化为具有低秩专家的对抗性上下文线性赌博机,其中上下文是查询,动作是物品,专家是在低秩潜在表示空间上工作的嵌入模型。我们首先证明,标准遗憾概念存在结构错误指定或统计难解性,并确定了一个对数二次策略类,它足够表达以捕获查询相关的模型路由,同时又足够结构化以允许高效的在线学习。其次,我们提出了一种称为Hypentropy策略梯度(HPG)的策略梯度算法。它在不完全信息下可证明地适应未知的低秩结构,并达到$\tilde{\mathcal O}(s\sqrt{M T})$线性化策略遗憾——其中$s$、$M$和$T$分别是专家的内在秩、模型数量和轮数——从而避免了维度灾难。最后,我们还提供了HPG的计算高效且无需参数调整的实现。

英文摘要

Modern recommendation systems increasingly rely on dynamically routing diverse queries to multiple embedding models. Despite its practical significance, this problem remains poorly understood under realistic conditions like adversarial queries, bandit feedback, and limited observability of models. We formalize embedding model routing as an adversarial contextual linear bandit with low-rank experts, where contexts are queries, actions are items, and experts are the embedding models working on low-rank latent representation spaces. We first establish that standard regret notions suffer from structural misspecification or statistical intractability, and we identify a log-quadratic policy class that is expressive enough to capture query-dependent model routing, yet structured enough to allow efficient online learning. Second, we propose a policy gradient algorithm called Hypentropy Policy Gradient (HPG). It provably adapts to the unknown low-rank structure under incomplete information and attains $\tilde{\mathcal O}(s\sqrt{M T})$ linearized policy regret -- where $s, M$, and $T$ are the intrinsic rank of the experts, the number of models, and the number of rounds -- thus avoiding a curse of dimensionality. Finally, we also provide an computationally efficient and parameter-free implementation of HPG.

2606.14934 2026-06-16 cs.LG cs.AI 交叉投稿

Separable Neural Architectures as Physical World Models: from Mathematical Theory to Applications

可分离神经架构作为物理世界模型:从数学理论到应用

Reza T Batley, Andrew Kichline, Sourav Saha

发表机构 * Kevin T. Crofton Department of Aerospace and Ocean Engineering, Virginia Polytechnic Institute and State University(弗吉尼亚理工大学凯文·T·克罗夫顿航空航天与海洋工程系)

AI总结 提出可分离神经架构(SNA),结合神经逼近与张量分解,通过变分框架求解偏微分方程,实现高维问题代数级缩放,并在工程案例中取得显著加速。

详情
AI中文摘要

本文介绍了可分离神经架构(SNA),这是一种结合神经逼近与张量分解的函数表示类。SNA将局部坐标函数(原子)与由稀疏低秩交互对象控制的全局相互作用解耦。该架构具有紧凑且平滑的归纳偏置,非常适合求解偏微分方程(PDE)。当在变分SNA(VSNA)框架下被视为Galerkin试验空间时,该公式满足Lax-Milgram下的经典变分保证:适定性、拟最优性、收敛性和稳定性。在高维时空-参数PDE中,VSNA通过代数级而非指数级缩放来缓解维数灾难。利用完全分解的、张量原生的交替最小二乘(ALS)优化框架,可将此成本降低至维度线性。VSNA在椭圆、双曲和抛物系统中得到验证,显示出与预测的代数谱缩放率高度一致。我们通过两个工程案例研究展示了SNA作为“一次求解,随处查询”的物理世界模型:一个7维参数化制造模拟和一个用于Inconel 718的实验性热-属性反演流程。VSNA在标准笔记本电脑CPU上102秒内执行了1,000,000次蒙特卡洛扫描,相比基于NVIDIA A100 GPU的全网格有限元基线实现了150,000倍加速。它还能在100毫秒内实现实时生成式逆模态重建。这些结果表明,SNA可作为连续参数流形的紧凑数学基础,实现实时反演、优化循环和快速不确定性传播。

英文摘要

This work introduces the Separable Neural Architecture (SNA), a function representational class combining neural approximation with tensor decomposition. The SNA decouples localized coordinate functions (atoms) from global interactions governed by a sparse, low-rank interaction object. This architecture possesses a compact and smooth inductive bias well-suited for solving partial differential equations (PDEs). When viewed as a Galerkin trial space under the variational SNA (VSNA) framework, the formulation satisfies classical variational guarantees under Lax-Milgram: well-posedness, quasi-optimality, convergence, and stability. In high-dimensional spatiotemporal--parametric PDEs, the VSNA mitigates the curse of dimensionality by scaling algebraically rather than exponentially. Exploiting an entirely factorized, tensor-native alternating least squares (ALS) optimization framework reduces this cost to linear in dimension. The VSNA is validated across elliptic, hyperbolic, and parabolic systems, demonstrating close alignment with predicted algebraic and spectral scaling rates. We showcase the SNA as a "solve once, query anywhere" physical world model via two engineering case studies: a 7D parametric manufacturing simulation and an experimental thermal-to-property inversion pipeline for Inconel 718. The VSNA executes a 1,000,000-query Monte Carlo sweep in 102s on a standard laptop CPU, yielding a 150,000x speedup over a full-grid finite element baseline hosted on an NVIDIA A100 GPU. It further enables real-time generative inverse-mode reconstructions under 100ms. These results demonstrate that the SNA serves as a compact mathematical substrate for continuous parameter manifolds to enable real-time inversion, optimization loops, and rapid uncertainty propagation.

2606.14971 2026-06-16 cs.LG cs.AI 交叉投稿

FastMix: Fast Data Mixture Optimization via Gradient Descent

FastMix: 通过梯度下降实现快速数据混合优化

Haoru Tan, Sitong Wu, Yanfeng Chen, Jun Xia, Ruobing Xie, Bin Xia, Xingwu Sun, Xiaojuan Qi

发表机构 * University of Hong Kong(香港大学) Tencent(腾讯) Chinese University of Hong Kong(香港中文大学)

AI总结 提出FastMix框架,将数据混合选择重新表述为双层优化问题,通过联合优化混合系数和模型参数,实现高效、可扩展的数据混合发现,在预训练和后训练中均优于基线方法且大幅降低搜索成本。

详情
Journal ref
ICLR-2026
AI中文摘要

虽然大规模和多样化的数据集推动了大型模型的最新进展,但确定预训练和后训练的最佳数据混合仍然是一个重要的开放问题。我们通过FASTMIX应对这一挑战,这是一个新颖的框架,在仅训练单个代理模型的同时自动发现数据混合。FASTMIX不依赖预定义的启发式方法或资源密集型模拟,而是联合优化混合系数和模型参数,显著提高了相对于先前方法的效率和可扩展性。FASTMIX的核心是将混合选择重新表述为一个双层优化问题。在这种重新表述下,我们证明优化混合比例在数学上等价于在均匀源采样下分配每个源的损失权重。这将混合系数直接嵌入到可微分的迭代优化目标中,从而能够对混合和模型进行高效的基于梯度的优化。为了解决优化问题,FASTMIX实现了一个近似迭代优化过程,交替进行(i)根据当前混合比例对采样的数据更新模型参数(内循环)和(ii)基于验证反馈更新混合比例(外循环)。在预训练和后训练中,FASTMIX均优于基线方法,同时大幅降低了搜索成本。代码见 https://github.com/hrtan/fastmix

英文摘要

While large and diverse datasets have driven recent advances in large models, identifying the optimal data mixture for pre-training and post-training remains a significant open problem. We address this challenge with FASTMIX, a novel framework that automates data mixture discovery while training only a single proxy model. Instead of relying on predefined heuristics or resource-intensive simulations, FASTMIX jointly optimizes mixture coefficients and model parameters, substantially improving efficiency and scalability over prior approaches. At the core of FASTMIX is a reformulation of mixture selection as a bilevel optimization problem. Under this reformulation, we show that optimizing mixture ratios is mathematically equivalent to assigning per-source loss weights under uniform source sampling. This embeds the mixture coefficients directly into the differentiable iterative optimization objective, enabling efficient, gradient-based optimization of both mixture and model. To solve the optimization problem, FASTMIX implements an approximate iterative optimization procedure, alternating between (i) updating model parameters on data sampled according to current mixture ratios (inner loop) and (ii) updating mixture ratios based on validation feedback (outer loop). Across pre- and post-training, FASTMIX outperforms baselines while drastically reducing search cost. Code (https://github.com/hrtan/fastmix)

2606.14975 2026-06-16 cs.NE cs.AI cs.LG physics.data-an q-bio.NC 交叉投稿

Harnessing cortical geometry, wiring, and function as inductive biases for recurrent neural networks

利用皮层几何、连接和功能作为循环神经网络的归纳偏置

Mo Shakiba, Rana Rokni, Mohammad Mohammadi, Nima Dehghani

发表机构 * Neuromatch Academy, Neuromatch, Inc., USA(Neuromatch学院,Neuromatch公司,美国) McGovern Institute for Brain Research, Massachusetts Institute of Technology (MIT)(麦戈文脑科学研究所,麻省理工学院(MIT))

AI总结 本研究利用MICrONS项目数据,通过神经元空间坐标、解剖连接和功能关系初始化循环权重并施加空间约束,构建生物基础循环神经网络,在认知决策任务中优于基线模型,并发展出低熵、模块化和小世界组织。

详情
AI中文摘要

皮层的连接和功能组织如何塑造循环计算仍然是神经科学和机器学习中的一个核心问题。在这里,我们利用通过皮层网络机器智能(MICrONS)项目发布的数据——一个涵盖小鼠视觉皮层多个区域的功能连接组学资源,其中密集钙成像与同一动物的高分辨率电子显微镜重建共同配准——来构建生物基础的循环神经网络。使用来自近12,000个共同配准的兴奋性神经元的神经元空间坐标、解剖连接和功能衍生关系,我们初始化循环权重并在学习过程中施加通信感知的空间约束。在三个认知决策任务中,受皮层结构和功能约束的网络始终优于基线和部分约束模型。功能权重初始化提供了最大的增益,而真实空间嵌入在多种条件下产生了稳健的额外改进。这些生物基础网络还发展出低熵、模块化和小世界组织,并且即使当循环被限制为正权重时也能保持强劲性能。总之,我们的结果表明,皮层的机制——其几何、连接和功能结构——可以作为构建循环网络的强大归纳基础,这些网络学习更有效,同时收敛于生物计算的关键组织原则。

英文摘要

How the wiring and functional organization of cortex shape recurrent computation remains a central question in both neuroscience and machine learning. Here, we leverage data released through the Machine Intelligence from Cortical Networks (MICrONS) program--a functional connectomics resource spanning multiple areas of mouse visual cortex, in which dense calcium imaging is co-registered with high-resolution electron microscopy reconstruction from the same animal--to build biologically grounded recurrent neural networks. Using neuronal spatial coordinates, anatomical connectivity, and function-derived relationships from nearly 12,000 coregistered excitatory neurons, we initialize recurrent weights and impose communication-aware spatial constraints during learning. Across three cognitive decision-making tasks, networks constrained by cortical structure and function consistently outperform baseline and partially constrained models. Functional weight initialization provides the largest gain, while real spatial embedding yields robust additional improvements across conditions. These biologically grounded networks also develop low-entropy, modular, and small-world organization, and retain strong performance even when recurrence is restricted to positive weights. Together, our results show that the machinery of cortex--its geometry, wiring, and functional structure--can be harnessed as a powerful inductive basis for building recurrent networks that learn more effectively while converging toward key organizational principles of biological computation.

2606.15015 2026-06-16 cs.CV cs.AI 交叉投稿

NEXUS: Neural Energy Fields for Physically Consistent Contact-Rich 3D Object Dynamics

NEXUS: 用于物理一致的高接触3D物体动力学的神经能量场

Qizhen Ying, Guangming Wang, Yangchen Pan, Victor Adrian Prisacariu, Yixiong Jing

发表机构 * University of Oxford(牛津大学) University of Cambridge(剑桥大学)

AI总结 提出神经能量场框架NEXUS,通过标量能量和耗散项建模保守与非保守动力学,提升高接触3D场景下的长时程轨迹精度并指导视频生成。

Comments 18 pages, 4 figures, 6 tables. Preprint

详情
AI中文摘要

基于物理的视频生成需要可控的3D物体动力学,这些动力学在接触、变形和外力作用下保持物理一致性。现有的基于轨迹的方法通常建模孤立的物理效应,难以在高接触3D场景中组合保守和非保守动力学。我们提出NEXUS,一个用于高接触3D物体动力学的神经能量场框架。NEXUS将每个物体表示为结构图,并构建动态的物体-物体和物体-环境接触图。受哈密顿神经网络启发,NEXUS通过标量能量和耗散项而非直接预测状态或加速度来公式化运动。保守效应(包括重力和弹性变形)被组合为加性能量项,而非保守效应(如阻尼和冲击引起的能量损失)则通过学习的瑞利型耗散建模。力通过对能量和耗散函数求导得到,并通过多子步半隐式积分器进行演化。在受控轨迹基准测试中,NEXUS在不同力学属性和物理效应组合下,相较于代表性的学习和物理结构化动力学基线,提高了长时程精度。我们进一步展示NEXUS轨迹为高接触视频生成提供了有效指导,在保持竞争性视觉质量的同时提高了物理合理性。

英文摘要

Physics-grounded video generation requires controllable 3D object dynamics that remain physically consistent under contact, deformation, and external forcing. Existing trajectory-based methods often model isolated physical effects, making it difficult to compose conservative and non-conservative dynamics in contact-rich 3D scenes. We present NEXUS, a neural energy-field framework for contact-rich 3D object dynamics. NEXUS represents each object as a structural graph and constructs dynamic object-object and object-environment contact graphs. Inspired by Hamiltonian Neural Networks, NEXUS formulates motion through scalar energy and dissipation terms rather than directly predicting states or accelerations. Conservative effects, including gravity and elastic deformation, are composed as additive energy terms, while non-conservative effects such as damping and impact-induced energy loss are modeled with learned Rayleigh-style dissipation. Forces are derived by differentiating the energy and dissipation functions and rolled out with a multi-substep semi-implicit integrator. Across controlled trajectory benchmarks, NEXUS improves long-horizon accuracy over representative learned and physics-structured dynamics baselines under varying mechanical properties and physical-effect compositions. We further show that NEXUS trajectories provide effective guidance for contact-rich video generation, improving physical plausibility while maintaining competitive visual quality.

2606.15055 2026-06-16 cs.CV cs.AI 交叉投稿

Bridging Geographic Bias in Urban Streetscape Inference via Lifelong Learning with Visual-Semantic Pivoting

通过视觉-语义枢轴终身学习弥合城市街景推理中的地理偏差

Xinze Zhang

发表机构 * University of Southern California(南加州大学)

AI总结 提出HVSP-LL终身学习框架,通过分层视觉-语义枢轴模块和公平感知重放机制,在跨城市街景推理中减少地理偏差,实现城市间感知差距缩小38%。

详情
AI中文摘要

城市街景的视觉感知支撑着景观规划、公共卫生和场所营造中的循证决策。然而,在少数拍摄良好的大都市上训练的模型会系统性地误判代表性不足的地区,将地理偏差传播到下游政策中。我们通过HVSP-LL(一种终身学习框架)解决了这一差距,该框架将分层视觉-语义枢轴模块与公平感知重放机制相结合。枢轴模块沿三层本体(宏观结构、中观组成、微观元素)组织景观概念,并将图像特征与每层可学习的语义锚点对齐,提供抵抗分布漂移的可迁移表示。终身适应组件顺序吸收新的城市区域,同时通过最差区域样本重新加权目标和结构感知示例缓冲区约束区域间感知差距。我们在一个由四大洲十二个城市和七个感知维度组成的全景街景基准上评估了HVSP-LL。该框架在保留城市序列上达到0.834的斯皮尔曼相关系数,比最强的持续基线绝对提高了6.1个百分点,并将城市间感知差距缩小到0.094——相对于最强的持续基线(0.151)减少了38%,相对于代表性的正则化基线(0.218)减少了57%。消融实验证实,枢轴层次结构的每一层都有单调贡献,公平感知重放将平均反向迁移从-0.038(无保留)转换为+0.013,消除了保留序列上的灾难性遗忘。我们的结果表明,分层锚定是实现城市尺度地理公平街景推理的实用途径。

英文摘要

Visual perception of urban streetscapes underpins evidence-based decisions in landscape planning, public health, and place-making. Yet models trained on a few well-photographed metropolises systematically misjudge underrepresented districts, propagating geographic bias into downstream policy. We address this gap with HVSP-LL, a lifelong learning framework that couples a stratified visual-semantic pivoting module with an equity-aware rehearsal mechanism. The pivoting module organises landscape concepts along a three-tier ontology (macro structure, meso composition, micro element) and aligns image features to learnable semantic anchors at each tier, providing transferable representations that resist distributional drift. The lifelong adaptation component sequentially absorbs new urban regions while constraining inter-region perception gaps through a worst-region sample-reweighting objective and a structurally-aware exemplar buffer. We evaluate HVSP-LL on a panoramic streetscape benchmark assembled from twelve cities across four continents and seven perceptual dimensions. The framework attains 0.834 Spearman correlation on the held-out city sequence, an absolute 6.1 point improvement over the strongest continual baseline, and shrinks the inter-city perception gap to 0.094 -- a 38% reduction relative to the strongest continual baseline (0.151) and a 57% reduction relative to a representative regularisation baseline (0.218). Ablations confirm that each tier of the pivoting hierarchy contributes monotonically, and the equity-aware rehearsal converts mean backward transfer from -0.038 (without retention) to +0.013, eliminating catastrophic forgetting on the held-out sequence. Our results indicate that hierarchical anchoring is a practical pathway toward geographically equitable streetscape inference at city scale.

2606.15134 2026-06-16 cs.CV cs.AI cs.LG 交叉投稿

Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

超越标量距离:来自冻结MLLM的语义属性梯度用于视觉嵌入

Shubhang Bhatnagar, Dheeraj Baiju, Narendra Ahuja

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出SAGA框架,利用冻结的多模态大语言模型(MLLM)通过GRPO奖励机制为视觉编码器提供属性级监督,替代传统标量距离,提升零样本图像检索性能。

详情
AI中文摘要

用于检索的视觉编码器通常通过类标签监督进行训练:每个训练对简化为一个标量,均匀地将嵌入推远或拉近,就好像每个视觉属性要么不同要么匹配。一个多模态大语言模型(MLLM),在展示相同的一对图像时,能够阐述这些属性并利用它们预测图像是否共享一个类别。我们提出\textbf{SAGA},一个框架,将这种基于语言、属性感知的感知转化为编码器本身的训练信号。具体来说,我们使用组相对策略优化(GRPO)来奖励MLLM对视觉编码器令牌的正确预测。由于正确的预测要求这些令牌暴露该对之间不同或匹配的具体属性,梯度推动编码器编码这些属性,用属性解析的监督取代统一的成对标量。一个辅助的注意力蒸馏损失将编码器的嵌入锚定到MLLM关注的令牌上,一个标准的度量学习损失塑造嵌入几何结构以进行最近邻检索。MLLM在整个过程中被冻结,在推理时被丢弃,与度量学习基线的部署成本相匹配。在CUB-200-2011、Cars-196、FGVC-Aircraft和iNaturalist Aves上的零样本图像检索中,SAGA在Recall@1上比最先进的基线提高了3到6个百分点。

英文摘要

Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose \textbf{SAGA}, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder's tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliary attention-distillation loss anchors the encoder's embedding to tokens the MLLM attended to, and a standard metric-learning loss shapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improves Recall@1 by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves on zero-shot image retrieval.

2606.15157 2026-06-16 cs.LG cs.AI 交叉投稿

PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression

PolyKV: 异构保留与分配用于KV缓存压缩

Chao Fei, Panos Kalnis

发表机构 * King Abdullah University of Science and Technology(阿卜杜拉国王科技大学)

AI总结 针对长上下文大模型推理中KV缓存压缩问题,提出PolyKV框架,通过层级别信号为每层选择合适压缩策略并分配非均匀缓存预算,实验表明在固定预算下显著恢复性能差距。

详情
AI中文摘要

KV缓存压缩对于减少长上下文大语言模型推理的内存成本至关重要。然而,现有方法通常在所有Transformer层上应用单一的压缩策略和统一的缓存预算。这种统一设计忽略了不同层在预填充和解码过程中可能扮演不同角色,因此可能需要不同的驱逐策略和缓存容量。我们提出了PolyKV,一种逐层KV缓存优化框架,考虑了方法选择和预算分配的设计空间。PolyKV基于层级别信号将每层路由到合适的KV压缩策略,同时在固定总预算下分配非均匀预算。这种公式化实现了现有KV缓存方法的异构组合。在LLaMA-3.1-8B和Qwen3-8B上的实验表明,在相同的512 token平均KV预算下,PolyKV分别恢复了最强单策略基线与FullKV之间LongBench性能差距的54.5%和25.7%。在128-1024预算范围内,PolyKV持续比最强基线提升1.7%-6.4%,对应FullKV差距的40.0%-54.5%恢复。

英文摘要

KV cache compression is essential for reducing the memory cost of long-context large language model inference. Existing approaches, however, typically apply a single compression policy and a uniform cache budget across all transformer layers. This uniform design ignores the fact that different layers can play different roles during prefill and decoding, and may therefore require different eviction strategies and cache capacities. We present PolyKV, a layer-wise KV cache optimization framework that considers design space with method selection and budget allocation. PolyKV routes each layer to a suitable KV compression policy based on layer-level signals, while assigning non-uniform budgets under a fixed total budget. This formulation enables heterogeneous compositions of existing KV cache methods. Experiments on LLaMA-3.1-8B and Qwen3-8B show that, under the same 512-token average KV budget, PolyKV recovers 54.5% and 25.7% of the LongBench performance gap between the strongest single-policy baseline and FullKV, respectively. Across 128-1024 budget sweep, PolyKV consistently improves over the strongest baseline by 1.7%-6.4%, corresponding to 40.0%-54.5% recovery of the FullKV gap.

2606.15207 2026-06-16 cs.LG cs.AI cs.NE 交叉投稿

Controlled Dynamics Attractor Transformer

受控动力学吸引子Transformer

Cheng Zhang, Minnan Luo, Zesheng Yang, Ming Li, Yong-Jin Liu, Qinghua Zheng

发表机构 * Xi'an Jiaotong University(西安交通大学) Tsinghua University(清华大学)

AI总结 提出受控动力学吸引子Transformer(CDAT),通过耦合混合von Mises-Fisher注意力能量与Hopfield精炼能量,并引入CANN启发的兴奋-抑制调制,实现拓扑约束的动力学系统,在图异常检测和图分类任务上达到最优性能。

Comments 20pages,3 figures

详情
Journal ref
Forty-Third International Conference on Machine Learning(ICML 2026)
AI中文摘要

Transformer架构通过自注意力机制在深度模型的表示学习和推理方面取得了显著进展。同时,联想记忆(AM)框架将表示映射到能量景观上,提供了可解释的检索机制。然而,其连续时间推理动力学缺乏经典连续吸引子神经网络(CANN)的生物合理性。为弥合这一差距,我们提出了受控动力学吸引子Transformer(CDAT),它将混合von Mises-Fisher(Mo-vMF)注意力能量与Hopfield精炼能量耦合,同时通过CANN启发的兴奋-抑制调制增强能量下降。CDAT实例化了一个拓扑约束的动力学系统,其耦合编码了标记之间的关系结构,从而将吸引子式动力学与现代基于能量的注意力联系起来。我们进一步提供了构造性的耗散分析,以正式建立其受控推理动力学。得益于这些鲁棒且结构化的动力学,CDAT在图异常检测和图分类的多个基准测试中达到了最先进的性能。

英文摘要

Transformer architectures have dramatically advanced representation learning and inference in deep models through self-attention mechanisms. In parallel,associative memory (AM) frameworks map representations onto energy landscapes, offering interpretable retrieval mechanisms. However, their continuous-time inference dynamics lack the biological plausibility of classical Continuous Attractor Neural Networks (CANNs). To bridge this gap, we propose Controlled Dynamics Attractor Transformer (CDAT), which couples a mixture von Mises-Fisher (Mo-vMF) attention energy with a Hopfield refinement energy, while augmenting energy descent with a CANN-inspired excitation-inhibition modulation. CDAT instantiates a topology-constrained dynamical system whose couplings encode relational structure among tokens, thereby linking attractor-style dynamics to modern energy-based attention. We further provide a constructive dissipation analysis to formally establish their controlled inference dynamics. Benefiting from these robust and structured dynamics, CDAT achieves state-of-the-art performance across multiple benchmarks in graph anomaly detection and graph classification.

2606.15247 2026-06-16 cs.LG cs.AI 交叉投稿

Exploring Starts Are Not Enough: Counterexamples and a Fix for Monte Carlo Exploring Starts

探索性初始状态并不足够:蒙特卡洛探索性初始状态的反例与修正

Octave Oliviers, Glenn Vinnicombe

发表机构 * Department of Engineering, University of Cambridge(剑桥大学工程系)

AI总结 本文通过构造反例证明,在表格设置下,蒙特卡洛探索性初始状态(MCES)算法可能收敛到次优解,并提出基于状态级学习率缩放的修正方法以恢复最优性收敛。

详情
AI中文摘要

蒙特卡洛探索性初始状态(MCES)的渐近行为是强化学习中一个长期存在的开放问题,即使在表格设置中也是如此。我们通过构造算法收敛到次优解的例子,研究了表格MCES的收敛性质。本文为初始访问和首次访问MCES提供了新的反例,并给出了初始访问情况下的收敛恢复修正。我们表明,即使贪婪动作平均更新频率高于非贪婪动作,初始访问MCES在样本平均更新下也可能存在稳定的次优解。然而,通过按状态将学习率与更新频率成反比缩放,可以保证收敛到最优性。与之前的均匀化方法不同,此修正适用于需要近似估计值函数的大规模问题。然后,我们扩展该例子以表明样本平均首次访问MCES也可能收敛到次优解。这基本上解决了一个基本的开放问题,并表明仅靠探索性初始状态并不能保证收敛到最优性。更广泛地说,这些结果突显了收敛性关键取决于应用于不同动作的更新的相对大小和频率,使得学习率的选择以及探索与利用的平衡成为MCES分析和可扩展蒙特卡洛控制方法实现的核心。

英文摘要

The asymptotic behaviour of Monte Carlo Exploring Starts (MCES) is a long-standing open question in reinforcement learning, even in the tabular setting. We investigated the convergence properties of tabular MCES by constructing examples in which the algorithm converges to suboptimal solutions. This paper presents new counterexamples for both initial-visit and first-visit MCES and gives a convergence-restoring modification for the initial-visit case. We show that stable suboptimal solutions may exist for initial-visit MCES with sample-average updates even when greedy actions are updated more often than non-greedy actions on average. However, by scaling learning rates inversely to update frequencies on a state-by-state basis, convergence to optimality is guaranteed. Unlike previous uniformisation methods, this modification is applicable to large-scale problems that require approximating the estimated value function. We then extend the example to show that sample-average first-visit MCES may also converge to suboptimal solutions. This largely settles a fundamental open problem and shows that exploring starts alone do not guarantee convergence to optimality. More broadly, these results highlight that convergence depends critically on the relative size and frequency of updates applied to different actions, making the choice of learning rates and the balance between exploration and exploitation central to the analysis of MCES and the implementation of scalable Monte Carlo control methods.

2606.15260 2026-06-16 cs.LG cs.AI 交叉投稿

Trust-Region Diffusion Policies for Massively Parallel On-Policy RL

大规模并行在线强化学习的信任区域扩散策略

Huy Le, Onur Celik, Denis Blessing, Tai Hoang, Claas A Voelcker, Axel Brunnbauer, Felix Richter, Michael Volpp, Gerhard Neumann

发表机构 * University of Freiburg(弗赖堡大学) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所)

AI总结 提出TruDi方法,通过信任区域优化约束扩散轨迹的KL散度,实现大规模并行在线强化学习中的稳定训练,在73个任务中优于或持平基线。

详情
AI中文摘要

利用大规模并行模拟的强化学习已成为开发鲁棒、可部署策略的标准框架;然而,大多数现有方法仍依赖简单的高斯策略参数化。扩散模型提供了更具表达力的策略类,并在具有挑战性的控制问题上表现出色,但大多数基于扩散的强化学习方法是为离线或离策略训练设计的。在这项工作中,我们探究扩散策略能否在大规模并行、在线策略机制下有效训练。为此,我们引入了信任区域扩散策略(TruDi),它使得扩散策略能够用于大规模并行模拟的在线强化学习。这种设置特别具有挑战性,因为数据分布在每次更新中快速变化,使得复杂策略的稳定训练变得困难。TruDi通过整合信任区域优化规则来约束整个扩散轨迹上的KL散度,从而解决了这一问题。实验上,我们在包含73个任务的4个不同的大规模并行强化学习基准上评估了TruDi。在这些任务中,TruDi在标准任务上始终优于或与强基线持平,在更具挑战性的人形控制任务上取得了明显收益,为大规模并行在线强化学习建立了新的强基线。

英文摘要

Reinforcement learning with massively parallel simulations has become a standard framework for developing robust, deployable policies; however, most existing approaches still rely on simple Gaussian policy parameterizations. Diffusion models provide a more expressive policy class and have shown strong performance on challenging control problems, yet most diffusion-based RL methods are designed for offline or off-policy training. In this work, we ask whether diffusion policies can be trained effectively in the massively parallel, on-policy regime. To this end, we introduce Trust-region Diffusion Policies (TruDi), which enables diffusion policies for on-policy RL with massively parallel simulations. This setting is particularly challenging because the data distribution changes quickly across updates, making stable training with complex policies difficult. TruDi addresses this by integrating a trust-region optimization rule to enforce a KL-divergence constraint over the entire diffusion trajectory. Empirically, we evaluate TruDi on a diverse set of 4 massively parallel RL benchmarks comprising a total of 73 tasks. Across these tasks, TruDi consistently outperforms or is on-par with strong baselines on standard tasks and achieves clear gains on more challenging humanoid control tasks, establishing a strong new baseline for massively parallel on-policy RL.

2606.15278 2026-06-16 cs.LG cs.AI 交叉投稿

RECTOR: Masked Region-Channel-Temporal Modeling for Affective and Cognitive Representation Learning

RECTOR:面向情感与认知表征学习的掩码区域-通道-时间建模

Jinhan Liu, Mahsa Shoaran

发表机构 * Cornell University(康奈尔大学)

AI总结 提出RECTOR自监督框架,通过自适应功能分区和掩码拓扑学习,统一建模EEG/sEEG的区域-通道-时间动态,在情感识别和任务参与分类上达到新最优,且对缺失通道和跨导联泛化鲁棒。

详情
AI中文摘要

情感和认知障碍表现为跨区域、通道和时间的分布式、时变脑网络动态,给基于EEG/sEEG的临床诊断鲁棒表征学习带来挑战。我们提出RECTOR(掩码区域-通道-时间建模),一种端到端自监督框架,超越固定解剖先验,统一联合区域-通道-时间表征学习。其核心RECTOR-SA是一种由自适应功能分区诱导的层次化块稀疏自注意力,将区域结构从静态解剖定义演变为自适应功能区域。自监督由掩码拓扑和表征学习驱动,联合优化三个互补目标:掩码预测建模、拓扑结构建模和跨视图一致性。在多个基准上,RECTOR在EEG情感识别和sEEG任务参与分类中达到新最优。关键的是,其对缺失通道的强鲁棒性和跨导联泛化能力凸显了其在异构EEG/sEEG上进行大规模预训练的潜力,并在区域和通道层面提供可解释的洞察。

英文摘要

Affective and cognitive disorders manifest as distributed, time-varying brain network dynamics across regions, channels, and time, challenging robust representation learning from EEG/sEEG for clinical diagnosis. We propose RECTOR (Masked Region-Channel-Temporal Modeling), an end-to-end self-supervised framework that unifies joint region-channel-temporal representation learning beyond fixed anatomical priors. At its core, RECTOR-SA is a hierarchical, block-sparse self-attention induced by Adaptive Functional Partitioning that evolves region structures from static anatomical definitions to adaptive functional regions. The self-supervision is driven by Masked Topology and Representation Learning, which jointly optimizes three complementary objectives: Masked Predictive Modeling, Topological Structure Modeling, and Cross-View Consistency. Across diverse benchmarks, RECTOR sets a new state-of-the-art in EEG emotion recognition and sEEG task-engagement classification. Crucially, its strong robustness to missing channels and cross-montage generalization underscores its potential for large-scale pre-training on heterogeneous EEG/sEEG, providing interpretable insights at both region and channel levels.

2606.15284 2026-06-16 eess.SP cs.AI cs.LG 交叉投稿

CAP: Towards PPG Universal Representation Learning with Patient-level Supervision

CAP:面向患者级监督的PPG通用表示学习

Chenyang He, Xinyi Shao, Shun Huang, Bosong Huang, Daoqiang Zhang, Ming Jing, Cheng Ding

发表机构 * Nanjing University of Aeronautics and Astronautics(南京航空航天大学) Peking University(北京大学) Independent Researcher(独立研究者) Jinling Clinical Medical College College of Artificial Intelligence Nanjing University of Aeronautics and Astronautics(金陵临床医学院人工智能学院南京航空航天大学)

AI总结 提出CAP方法,通过构建大规模PPG-EHR多模态数据集和跨模态对比对齐,学习患者级临床语义的PPG表示,在四项下游任务中平均提升26.7%,呼吸率预测提升87.6%。

Comments Accepted as an Oral presentation at KDD 2026

详情
AI中文摘要

光电容积描记法(PPG)在可穿戴健康监测和临床决策支持中发挥着核心作用。然而,现有的通用PPG表示学习方法主要关注信号级目标,往往忽略患者级健康背景,这限制了对复杂临床任务和异质性队列的泛化能力。为解决这一问题,我们通过将碎片化的病史和临床记录整合为连贯的患者级电子健康记录(EHR),构建了一个大规模配对PPG-EHR多模态数据集。基于此资源,我们提出了临床锚定预训练方法(CAP)。在预训练期间,CAP执行跨模态对比对齐,将PPG表示锚定到患者级临床语义,引导编码器超越波形拟合,建模患者整体生理状态的一致性。在下游适应期间,预训练的PPG编码器提供临床基础的表示,增强归纳偏置,提高鲁棒性和可迁移性。实验表明,CAP在四个不同的下游任务上持续优于强基线。CAP在呼吸率预测上取得了特别大的提升(相比最先进基线相对提升高达87.6%),并在所有任务上平均相对提升26.7%。我们通过全面分析(包括消融实验和多个互补的可视化学习表示)进一步增强了方法的可解释性。实验代码可在 https://github.com/gody123gody/CAP 获取。

英文摘要

Photoplethysmography (PPG) plays a central role in wearable health monitoring and clinical decision support. Yet existing approaches to universal PPG representation learning largely focus on signal-level objectives and often overlook patient-level health context, which limits generalization to complex clinical tasks and heterogeneous cohorts. To address this gap, we construct a large-scale paired PPG-EHR multimodal dataset by distilling fragmented medical histories and clinical records into cohesive, patient-level electronic health records (EHR). Building on this resource, we propose Clinical Anchored Pretraining for PPG (CAP). During pretraining, CAP performs cross-modal contrastive alignment that anchors PPG representations to patient-level clinical semantics, guiding the encoder beyond waveform fitting toward modeling consistency in a patient's overall physiological state. During downstream adaptation, the pretrained PPG encoder provides clinically grounded representations that strengthen inductive bias and improve robustness and transferability. Experiments demonstrate that CAP consistently outperforms strong baselines on four diverse downstream tasks. CAP achieves a particularly large gain on respiratory rate prediction (up to +87.6% relative improvement over the state-of-the-art baseline) and delivers an average relative +26.7% across all tasks. We further enhance the interpretability of our approach through comprehensive analyses, including ablations and multiple complementary visualizations of the learned representations. The code for our experiments is available at: https://github.com/gody123gody/CAP .

2606.15377 2026-06-16 cs.LG cs.AI physics.geo-ph 交叉投稿

Learning Earthquake Wave Arrival Time Picking from Labels with Inaccuracies

从不准确标签中学习地震波到时拾取

Sen Li, Xu Yang, S. Mostafa Mousavi, Anye Cao, Keting Fan, Yaoqi Liu, Changbin Wang, Qiang Niu

发表机构 * Department of Earth and Planetary Sciences, Harvard University(哈佛大学地球与行星科学系) School of Computer Science and Technology, China University of Mining and Technology(中国矿业大学(北京)计算机科学与技术学院) School of Mines, China University of Mining and Technology(中国矿业大学(北京)矿院) State Key Laboratory of Coal Exploration and Intelligent Mining, China University of Mining and Technology(中国矿业大学(北京)煤炭勘探与智能开采国家重点实验室)

AI总结 提出标签噪声对比鲁棒学习(LaNCoR)方法,通过对齐波形特征与标签表示分布来纠正错误标签,在微地震P波到时拾取任务中性能提升高达28.8%。

Comments 28 pages, 10 figures

详情
AI中文摘要

不准确标记的训练数据,或称“标签噪声”,对监督机器学习模型的完整性构成重大威胁。这种污染通过教导模型特征与标签之间的错误映射直接降低性能,导致泛化能力差,并在正确标记的验证和测试数据上准确性降低。当前地震学应用主要依赖大规模训练集或数据增强来减少标签噪声影响,这可能是劳动密集且成本高昂的。在这里,我们介绍一种标签噪声对比鲁棒学习(LaNCoR)方法,该方法可以有效处理地震信号处理任务中的噪声标签,而无需大规模训练数据集。在该方法中,输入波形特征和标签表示分布在特征空间中对齐,以纠正错误标记并减少其对训练过程的影响。我们使用两个基线模型和训练方法展示了LaNCoR在真实微地震数据P波到时拾取任务上的性能。我们的结果表明,LaNCoR在性能指标上可提升高达28.8%。该方法在地震学和地球科学中的模型训练方面具有巨大潜力。

英文摘要

Inaccurately labeled training data, or "label noise", poses a significant threat to the integrity of supervised machine learning models. This corruption directly degrades performance by teaching the model erroneous mappings between features and labels, which leads to poor generalization and reduced accuracy on properly labeled validation and test data. Current seismological applications mainly rely on large-scale training sets or data augmentation to reduce the label-noise impact, which can be labor-intensive and costly. Here, we introduce a Label Noise-Contrastive Robust Learning (LaNCoR) approach that can effectively handle noisy labels in seismic signal processing tasks, without requiring large-scale training datasets. In this approach, the input waveform feature and label representation distributions are aligned in the feature space to correct mislabeling and reduce its impact on the training process. We present LaNCoR's performance on the task of P-phase arrival-time picking of real microseismic data using two baseline models and training approaches. Our results indicate that LaNCoR can improve performance by up to 28.8% across performance metrics. This approach holds great promise for model training in seismology and geosciences.

2606.15455 2026-06-16 cs.LG cs.AI 交叉投稿

Understanding Diversity Collapse in RLVR via the Lens of Overtraining

通过过度训练的视角理解RLVR中的多样性崩溃

Suqin Yuan, Jinkun Chen, Jiyang Zheng, Muyang Li, Lei Feng, Dadong Wang, Tao Xiang, Tongliang Liu, Bo An

发表机构 * Sydney AI Centre, The University of Sydney(悉尼大学悉尼人工智能中心) Southeast University(东南大学) Microsoft(微软) Data61, CSIRO(澳大利亚联邦科学与工业研究组织Data61) Chongqing University(重庆大学) Nanyang Technological University(南洋理工大学)

AI总结 本文通过过度训练的视角形式化RLVR中的多样性崩溃,发现标准训练中大部分更新是过度训练,并提出贝叶斯边界门控(BBG)方法,通过估计每个问题对推理边界的边际贡献来优化,提升多个基准上的Pass@k。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为增强大型语言模型推理能力的关键方法。然而,RLVR常常遭受\emph{多样性崩溃}:Pass@$1$提升而高$k$的Pass@$k$下降,这被视为模型推理边界的收窄。我们通过\emph{过度训练}的视角形式化了这种多样性崩溃:一旦一个问题对参考指标的贡献有效饱和,进一步的更新不再扩展模型能解决的问题,但仍将概率质量集中在on-policy采样偏好的轨迹上。在每次问题少量rollout的标准设置下,即使单次成功也会使问题进入高$k$ Pass@$k$的近乎饱和状态,因此标准RLVR中的大多数更新从边界角度来看都是过度训练。这一视角也提供了一种解读:RLVR能否扩展模型超越基础模型的推理能力?由于RLVR结构上偏向于高$k$ Pass@$k$,其总体下降本身并不意味着没有新的推理增益。在干预上,将更新限制在零成功的问题上,在困难基准上将Pass@$256$提升到基础模型之上;在观察上,标准RLVR训练中,最初不可解的问题中有相当一部分变得可解。基于这些发现,我们提出\emph{贝叶斯边界门控}(BBG),通过估计每个问题对推理边界的边际贡献,将优化从过度训练中转移出来。在多个推理基准上,BBG在广泛的$k$范围内提升了平均Pass@$k$。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a key approach for enhancing the reasoning abilities of large language models. However, RLVR often suffers from \emph{diversity collapse}: Pass@$1$ improves while high-$k$ Pass@$k$ degrades, which is viewed as a narrowing of the model's reasoning boundary. We formalize this diversity collapse through the lens of \emph{overtraining}: once a problem's contribution to the reference metric has effectively saturated, further updates no longer expand what the model can solve but still concentrate probability mass on the trajectories favored by on-policy sampling. Under a standard setup with few rollouts per problem, even a single observed success places a problem in a nearly saturated regime for high-$k$ Pass@$k$, so most updates in standard RLVR are overtraining from the boundary perspective. This perspective also suggests a reading of whether RLVR can expand the model's reasoning abilities beyond the base model: since RLVR is structurally biased against high-$k$ Pass@$k$, its aggregate decline does not by itself mean that no new reasoning gains occurred. Interventionally, restricting updates to problems with zero observed success lifts Pass@$256$ above the base model on difficult benchmarks; observationally, a non-trivial fraction of initially unsolvable problems become solvable during standard RLVR training. Building on these findings, we propose \emph{Bayesian Boundary Gating} (BBG), which redirects optimization away from overtraining by estimating each problem's marginal contribution to the reasoning boundary. Across multiple reasoning benchmarks, BBG improves average Pass@$k$ across a wide range of $k$.

2606.15479 2026-06-16 cs.LG cs.AI math.PR 交叉投稿

Bayesian 3D Steerable CNNs: Enabling Equivariance and Uncertainty Quantification Simultaneously

贝叶斯3D可转向CNN:同时实现等变性和不确定性量化

Abhishek Keripale, Ponkrshnan Thiagarajan, Susanta Ghosh

发表机构 * Michigan Technological University(密歇根理工大学) Johns Hopkins University(约翰霍普金斯大学) The Center for Artificial Intelligence at the Institute of Computing and Cybersystems, Michigan Technological University(密歇根理工大学计算与网络系统研究所人工智能中心)

AI总结 提出贝叶斯可转向CNN,通过后验分布赋予核随机性同时保持SE(3)-等变性,实现不确定性分解,在分类精度和分布偏移下鲁棒性优于确定性模型。

详情
AI中文摘要

可转向卷积神经网络(Steerable-CNNs)通过将核参数化为可转向基函数的线性组合来保证SE(3)-等变性,但其确定性本质阻碍了不确定性量化——限制了其在需要置信度估计的场景中的应用。我们提出一种贝叶斯可转向CNN,将后验分布置于基系数上,从而在精确保持等变性的同时产生随机核。模型的损失函数通过变分推断获得,并通过贝叶斯反向传播最小化。该框架将预测不确定性分解为认知不确定性和偶然不确定性。实验上,该模型在取得竞争性分类精度的同时,预期校准误差为0.0263,并且在加性高斯噪声引起的分布偏移下,其性能比确定性对应模型高出最多6.17%。此外,我们利用模型的不确定性估计显著提升其性能,在测试数据集的84%上实现了约4%的准确率提升。认知不确定性与预测误差之间统计显著的负相关性表明,学习到的后验方差具有语义意义。该框架将贝叶斯不确定性量化与等变CNN的归纳偏置统一起来。

英文摘要

Steerable convolutional neural networks (Steerable-CNNs) guarantee SE(3)-equivariance by parameterizing kernels as linear combinations of steerable basis functions, but their deterministic nature precludes uncertainty quantification - limiting their use in settings where confidence estimates are essential. We propose a Bayesian Steerable-CNN that places posterior distributions over the basis coefficients, yielding stochastic kernels while preserving equivariance exactly. The loss function of the model is obtained via variational inference and minimized by Bayes-by-Backpropagation. The framework admits a decomposition of predictive uncertainty into epistemic and aleatoric components. Empirically, the model attains competitive classification accuracy alongside an expected calibration error of 0.0263 and outperforms its deterministic counterpart by up to 6.17% under distributional shift induced by additive Gaussian noise. Furthermore, we leverage the model's uncertainty estimates to enhance its performance significantly, achieving a notable gain - approximately 4% higher accuracy across 84% of the test dataset. A statistically significant negative correlation between epistemic uncertainty and prediction error confirms that the learned posterior variance is semantically meaningful. The framework unifies Bayesian uncertainty quantification with the inductive bias of equivariant CNNs.

2606.15527 2026-06-16 cs.CV cs.AI 交叉投稿

Selective Synergistic Learning for Video Object-Centric Learning

选择性协同学习用于视频对象中心学习

WonJun Moon, Jae-Pil Heo

发表机构 * KAIST(韩国科学技术院) Sungkyunkwan University(成均馆大学)

AI总结 提出选择性协同学习(SSync),通过伪标签线性复杂度选择性蒸馏可靠线索,避免错误传播,提升视频对象分解质量并作为即插即用模块。

详情
AI中文摘要

典型的视频对象中心学习(VOCL)方法采用基于槽的框架,依赖重建驱动的编码器-解码器架构,学习通过两个空间图进行:编码器的注意力图和解码器的对象图。由于这两个不同的图表现出不同的属性,最近的密集对齐策略试图通过对比学习强制所有时空补丁之间的一致性来调和这种差异。然而,这种无差别的对齐无意中传播了每个模块固有的弱点,例如编码器的噪声预测和解码器的模糊边界。此外,计算所有对之间的密集相似性会带来与时空补丁总数二次方关系的计算成本,严重限制了可扩展性。受此启发,我们提出了选择性协同学习(SSync)。SSync 不是进行穷举的补丁到补丁对齐,而是通过选择性蒸馏仅最可靠的线索来防止错误传播:严格利用编码器进行边界细化,利用解码器进行内部去噪。这通过线性复杂度的伪标签实现,消除了二次空间比较的需要。此外,为了防止强化架构偏差(如槽冗余),我们引入了传递性伪标签合并,基于时空激活一致性合并重叠的槽。大量研究表明,SSync 提高了分解质量,并作为一个通用的即插即用模块,同时对槽配置表现出卓越的鲁棒性。代码可在 github.com/wjun0830/SSync 获取。

英文摘要

Typical video object-centric learning (VOCL) approaches employ slot-based frameworks that rely on reconstruction-driven encoder-decoder architectures, where learning is mediated by two spatial maps: attention maps from the encoder and object maps from the decoder. As these two distinct maps exhibit different properties, a recent dense alignment strategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches via contrastive learning. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module, such as noisy encoder predictions and blurred decoder boundaries. Moreover, computing dense similarities across all pairs incurs a computational cost quadratic in the total number of spatio-temporal patches, severely limiting scalability. Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync prevents error propagation by selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising. This is realized via a pseudo-labeling with linear complexity, eliminating the need for quadratic spatial comparisons. Also, to prevent the reinforcement of architectural biases like slot redundancy, we introduce a transitive pseudo-label merging that consolidates overlapping slots based on spatio-temporal activation consistency. Extensive studies demonstrate that SSync improves decomposition quality and serves as a versatile, plug-and-play module while also exhibiting exceptional robustness to slot configurations. Code is available at github.com/wjun0830/SSync.

2606.15553 2026-06-16 cs.LG cs.AI 交叉投稿

Distilling Drifting Transformers with Representation Autoencoders

用表示自编码器蒸馏漂移变换器

Jiawei Zhang, Mengfei Xia, Gen Li, Yuantao Gu

发表机构 * Tsinghua University(清华大学) Ant Group(蚂蚁集团) CUHK(香港中文大学)

AI总结 提出Drift-RAE方法,通过漂移范式在表示自编码器潜空间中蒸馏预训练流模型,解决各向异性和大曲率问题,在ImageNet 256上仅用10k步达到1.77 FID。

详情
AI中文摘要

表示自编码器(RAE)通过预训练编码器中强标签聚类的DINO特征,在语义更丰富的潜空间中改进了扩散和流模型。然而,在蒸馏阶段,丰富语义表示导致的严重各向异性和大曲率会阻碍收敛和性能,使得基于轨迹的蒸馏不稳定。在这项工作中,我们认为RAE潜空间通过新提出的漂移模型与蒸馏兼容。我们首先定量研究了不同自编码器上的曲率和各向同性统计,并从理论上揭示了漂移模型本身极有可能在像基于重建的VAE这样的极端分散空间上失败。这些促使我们直接将漂移范式应用于表示自编码器。我们提出的方法Drift-RAE使用漂移在RAE潜空间中蒸馏预训练流模型,并进行了有洞察力的修改,通过理论上将漂移场与其他框架对齐来提高训练稳定性。关于实验证据,我们在ImageNet 256数据集上仅用10k步蒸馏就达到了1.77 FID,超越了最先进的RAE蒸馏方法,并且与原始漂移模型相比具有竞争力,而无需辅助MAE特征提取器。代码将公开提供。

英文摘要

Representation Autoencoders (RAEs) have improved diffusion and flow models by semantically richer latent space owing to the strongly label-wise clustered DINO features in the pretrained encoders. Yet in the distillation stage, the severe anisotropy and large curvatures caused by the rich semantic representations would hinder the convergence and performance, making the trajectory-based distillation unstable. In this work, we argue that the RAE latent space is compatible with distillation via the newly proposed Drifting Models. We first quantitatively study the curvatures and isotropy statistics across different autoencoders, and theoretically reveal that Drifting Model itself is highly likely to fail on extremely scattered spaces like reconstruction-based VAEs. These motivate us to apply the drifting paradigm directly to representation autoencoders. Our proposed method, Drift-RAE, distills pretrained flow models in RAE latent spaces using Drifting, together with insightful modifications that improve training stability by thereotically aligning drifting fields with other frameworks. Regarding the experimental evidences, we achieve 1.77 FID on ImageNet 256 dataset using only 10k distillation steps, surpassing state-of-the-art RAE distillation methods and appearing comparative with the original Drifting Model without requiring an auxiliary MAE feature extractor. The code will be made publicly available.

2606.15576 2026-06-16 cs.LG cs.AI 交叉投稿

Localizing Credit at the Divergence: Path-Conditioned Self-Distillation for LLM Reasoning

在分歧处定位信用:路径条件自蒸馏用于LLM推理

Yu Li, Shu Hong, Tian Lan

发表机构 * Department of Electrical and Computer Engineering, George Washington University(乔治华盛顿大学电气与计算机工程系)

AI总结 提出Hindsight Self-Distillation (HSD)方法,通过将教师模型条件于当前训练组中的成功同伴轨迹,在失败与成功轨迹的分歧处提供密集信用信号,提升LLM在数学和代码推理任务上的性能。

详情
AI中文摘要

基于可验证奖励的强化学习为每次 rollout 分配一个标量,在长推理轨迹中留下了 token 级信用分配不明确的问题。同策略自蒸馏通过让同一模型作为教师,并条件于特权信息,产生密集的逐 token 信号来解决这一问题。但常见的真实答案选择仅是一个终点线索:在简短答案任务中,教师在需要路径级指导的中间位置保持沉默。我们提出后见自蒸馏(HSD),它将教师条件于从当前训练组中抽取的一个成功同伴 rollout。这样的同伴是从成功条件策略中精确采样的样本,无需额外的采样 rollout。通过提供完整的成功延续而不仅仅是最终答案,产生的信用信号集中在失败 rollout 与成功同伴之间的分歧位置。在 Qwen3-8B 和 Qwen3-32B 的数学和代码基准测试中,HSD 相比 GRPO 变体和同策略蒸馏基线获得了最佳结果,在 AIME 等简短答案任务上提升最大。

英文摘要

Reinforcement learning from verifiable rewards assigns a single scalar to each rollout, leaving token-level credit assignment underspecified in long reasoning traces. On-policy self-distillation addresses this by letting the same model act as a teacher conditioned on privileged information, producing a dense per-token signal. But the common choice of a ground-truth answer is only an endpoint cue: on terse-answer tasks, the teacher falls silent at the intermediate positions where path-level guidance matters most. We propose Hindsight Self-Distillation (HSD), which conditions the teacher on a successful peer rollout drawn from the current training group. Such a peer is an exact sample from the success-conditioned policy, requiring no additional sampled rollouts. By providing a full successful continuation rather than only the final answer, the resulting credit signal concentrates at the divergence position between a failed rollout and a successful peer. Across Qwen3-8B and Qwen3-32B on math and code benchmarks, HSD obtains the best result against GRPO variants and on-policy distillation baselines, with the largest gains on terse-answer tasks such as AIME.

2606.15589 2026-06-16 cs.LG cs.AI 交叉投稿

Is Code Better Than Language for Algorithmic Reasoning

算法推理中代码是否优于语言

Terry Tong, Yu Feng, Surbhi Goel, Dan Roth

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 通过分离中间表示与执行机制,在40个任务上比较代码执行与自然语言推理,发现代码执行优势源于外部执行而非表示变化。

Comments ICML 2026

详情
AI中文摘要

对于工具增强的语言模型,比较自然语言推理与代码执行管道是困难的,因为比较同时改变了中间表示和执行机制。我们通过一个中间干预来分离这些因素:模型将其推理表达为可执行代码,语言模型在上下文中模拟该代码以产生答案。在40个任务的可验证算法基准上,确定性代码执行比自然语言推理高出+31.6个百分点。我们观察到中间干预与自然语言推理没有显著差异(+0.15个百分点)。这些结果表明,在我们评估的设置中,仅改变中间表示并不能解释工具使用的优势,为性能提升需要可靠的外部执行提供了证据。我们用一个简单的统计决策理论模型形式化了这一直觉,该模型刻画了在我们的解耦轨迹生成/执行机制中,执行何时主导端到端风险。我们通过一个重建干预验证了我们的理论,该干预利用代理语言模型从代码表示中推断自然语言推理轨迹,恢复了与原始自然语言推理管道相当的性能。所有实验见https://github.com/TerryTong-Git/ToolProj。

英文摘要

For tool-augmented language models, comparing natural-language reasoning with code-execution pipelines is difficult because the comparison changes both the intermediate representation and the execution mechanism. We separate these factors with an intermediate intervention: the model expresses its reasoning as executable code, and the language model simulates that code in context to produce an answer. On a 40-task verifiable algorithmic benchmark, deterministic code execution outperforms natural-language reasoning by +31.6pp. We observe that the intermediate intervention is not meaningfully different from natural-language reasoning (+0.15pp). These results suggest that, in our evaluated setting, changing the intermediate representation alone does not explain the tool-use advantage, providing evidence for the performance gains requiring reliable external execution. We formalize this intuition with a simple statistical decision-theoretic model that characterizes when execution dominates end-to-end risk in our disentangled trace-generation/execution regime. We validate our theory using a reconstruction intervention that leverages a proxy language model to infer natural-language reasoning traces from code representations, recovering performance comparable to the original natural-language reasoning pipeline. All experiments are at https://github.com/TerryTong-Git/ToolProj.

2606.15669 2026-06-16 cs.LG cs.AI 交叉投稿

Z-Plane Neural Networks: Bounded Geometric Activation Replaces ReLU and LayerNorm

Z平面神经网络:有界几何激活替代ReLU和LayerNorm

Sungwoo Goo, Hwi-yeol Yun, Sangkeun Jung

发表机构 * College of Pharmacy, Chungnam National University(忠南大学药学院) Department of Computer Science & Engineering, Chungnam National University(忠南大学计算机科学与工程系)

AI总结 提出Z平面神经网络,通过有界几何激活函数Radial Bounding将隐藏状态映射到超球面上的2D相量束,在保持方向信息的同时限制能量幅度,理论证明其保持1-Lipschitz连续性并防止梯度消失,实验表明100层无ReLU和LayerNorm的MLP在MNIST上稳定收敛。

详情
AI中文摘要

现代深度神经网络依赖欧几里得标量激活(如ReLU)和全局归一化技术(如LayerNorm)来防止深层架构中的梯度不稳定。然而,这些机制固有地导致神经元死亡、丢弃关键方向信息并破坏特征表示的正交性。受生物轴突频率调制传输的启发,我们提出了Z平面神经网络,将隐藏状态映射到超球面上的2D相量束。我们引入了一种新颖的几何激活函数Radial Bounding($\mathbf{x} / \max(1, \\|\mathbf{x}\\|_2)$),它在保持相位(方向)的同时限制能量幅度。我们从数学上证明,这种各向同性激活保持了1-Lipschitz连续性,并通过保留切向梯度防止梯度消失。实验上,一个完全不含ReLU和LayerNorm的100层Z平面多层感知机(MLP)在MNIST数据集上成功收敛,准确率达到98.34%,且具有绝对数值稳定性,证明仅靠有界几何激活就足以实现稳定的深度学习。

英文摘要

Modern deep neural networks rely on Euclidean scalar activations (e.g., ReLU) and global normalization techniques (e.g., LayerNorm) to prevent gradient instability in deep architectures. However, these mechanisms inherently cause dead neurons, discard critical directional information, and destroy the orthogonality of feature representations. Inspired by the frequency-modulation transmission of biological axons, we propose the Z-Plane Neural Network, which maps hidden states into 2D phasor bundles on a hypersphere. We introduce a novel geometric activation function, Radial Bounding($\mathbf{x} / \max(1, \|\mathbf{x}\|_2)$), which limits the energy magnitude while preserving the phase (direction). We demonstrate mathematically that this isotropic activation maintains 1-Lipschitz continuity and prevents gradient vanishing by preserving tangential gradients. Empirically, a 100-layer Z-Plane Multi-Layer Perceptron (MLP)-entirely devoid of ReLU and LayerNorm-successfully converges on the MNIST dataset with 98.34% accuracy and absolute numerical stability, proving that bounded geometric activation alone is sufficient for stable deep learning.

2606.15678 2026-06-16 cs.LG cs.AI 交叉投稿

The Reservoir Attention Network: Cross-Pass State in Pretrained Transformers via Content-Addressable Reservoir Injection

储层注意力网络:通过内容可寻址储层注入在预训练Transformer中的跨前向传播状态

Emma Leonhart

发表机构 * Emma Leonhart

AI总结 提出储层注意力网络(RAN),通过在预训练Transformer中间层注入固定随机储层来携带跨前向传播状态,实验表明未训练的循环动态足以传递可用状态。

Comments 29 pages, 14 figures

详情
AI中文摘要

本文对储层注意力网络(RAN)进行了可行性和动力学研究,该架构将一个固定的、随机初始化的储层注入到预训练Transformer的中间层注意力中,以在跨前向传播时携带状态。实验涵盖从GPT-2(124M、355M)到Qwen2.5(0.5B、1.5B)的模型,均在单个消费级GPU上运行。任务被选为最小探针,以隔离单个机制;更广泛的“始终活跃的智能体”愿景在整个过程中被视为受计算限制的未来工作,而非本文的主张。储层被设计为未训练的(固定随机):这隔离了未训练的循环动态本身是否足以携带可用的跨前向传播状态,而将训练的循环作为互补的、更昂贵的方向。

英文摘要

A feasibility and dynamics study of the Reservoir Attention Network (RAN), an architecture that injects a fixed, randomly-initialized reservoir into the mid-layer attention of a pretrained transformer to carry state across forward passes. Experiments span GPT-2 (124M, 355M) to Qwen2.5 (0.5B, 1.5B) on a single consumer GPU. The tasks are minimal probes chosen to isolate individual mechanisms; the broader always-alive agent vision is treated throughout as compute-limited future work, not a claim of this paper. The reservoir is left untrained (fixed random) by design: this isolates whether untrained recurrent dynamics alone suffice to carry usable cross-pass state, leaving trained recurrence as a complementary, more expensive direction.

2606.15695 2026-06-16 cs.LG cs.AI 交叉投稿

When Generator Replay Degrades: Projected Rehearsal Orchestration for Heterogeneous Federated Class-Incremental Learning

当生成器回放退化时:面向异构联邦类增量学习的投影排练编排

Thinh T. H. Nguyen, Khoa D. Doan, Binh T. Nguyen, Danh Le-Phuoc, Kok-Seng Wong

发表机构 * VinUniversity VNU-HCM, University of Science(胡志明市国家大学理科大学) Technische Universität Berlin(柏林工业大学)

AI总结 针对异构联邦类增量学习中客户端标签子集不同、任务阶段不一致导致的旧知识遗忘问题,提出投影排练编排框架PRO及增强版PRO-MAX,通过服务器端维护紧凑类级投影记忆并实现平衡伪多任务训练,在图像、文本和图基准上提升异构流下的保留与最终效用。

Comments 46 pages

详情
AI中文摘要

联邦类增量学习(FCIL)在客户端观察到不同标签子集、在不同阶段推进任务以及为相同语义概念提供不均匀监督时变得极其困难。现有的FCIL方法通常通过输入空间合成来保留旧知识,但在异构任务流下可能脆弱且难以跨模态迁移。为缓解这些问题,我们提出PRO,一个用投影排练编排替代合成输入回放的框架。为去除外部预训练,我们在相同的预热条件下评估所有方法。此后,PRO在服务器上维护紧凑的类级投影记忆,并允许客户端在当前示例和旧投影记忆上执行平衡的伪多任务训练。为处理更强的表示漂移,我们进一步引入PRO-MAX,它在保持相同服务器轻量原则(服务器仅聚合模型更新和记忆统计)的同时,用邻域加权记忆对齐增强PRO。在图像、文本和图基准上,PRO和PRO-MAX在异构流下提高了保留和最终效用,同时在同构FCIL中保持竞争力。即使基线获得更大的回放预算,它们在监督不平衡和阶段错位下也会退化,表明仅靠回放数量无法解决回放质量失败。额外的弱任务诊断进一步表明,更大的回放不匹配与更大的下游退化相关,而我们的方法使投影记忆与不断演化的表示保持更好对齐。

英文摘要

Federated class-incremental learning (FCIL) becomes substantially harder when clients observe different label subsets, progress through tasks at different stages, and provide uneven supervision for the same semantic concepts. Existing FCIL methods often preserve old knowledge through input-space synthesis, but they can be fragile under heterogeneous task streams and difficult to transfer across modalities. To alleviate such issues, we propose PRO, a framework that replaces synthetic input replay with projected rehearsal orchestration. To remove external pretraining, we evaluate all methods under the same warmup. After this, PRO maintains compact class-level projected memories on the server and allows clients perform balanced pseudo multi-task training over current examples and old projected memories. To handle stronger representation drift, we further introduce PRO-MAX, which augments PRO with neighborhood-weighted memory alignment while preserving the same server-light principle that the server only aggregates model updates and memory statistics. Across image, text, and graph benchmarks, PRO and PRO-MAX improve retention and final utility under heterogeneous streams while remaining competitive in homogeneous FCIL. Even when baselines are given expanded replay budgets, they degrade under supervision imbalance and stage misalignment, indicating that replay quantity alone does not resolve replay-quality failures. Additional weak-task diagnostics further show that larger replay mismatch is associated with larger downstream degradation, while our method keeps projected memories better aligned with the evolving representation.

2606.15734 2026-06-16 cs.CL cs.AI cs.IR cs.LG 交叉投稿

Retrievable Gradients: Continual Post-Training Without Cumulative Weight Drift

可检索梯度:无累积权重漂移的持续后训练

Weihang Su, Jiacheng Kang, Jingyan Xu, Qingyao Ai, Jianming Long, Hanwen Zhang, Bangde Du, Xinyuan Cao, Min Zhang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 提出ReGrad范式,将梯度作为可检索知识单元,通过元学习重塑文档梯度为通用适应信号,实现无权重漂移的可扩展参数知识注入。

详情
AI中文摘要

持续后训练使模型在部署后能够吸收新知识,但重复更新共享参数会累积权重漂移,可能导致灾难性遗忘并降低通用能力。检索增强生成避免了这种参数漂移,但往往缺乏参数化知识整合的深度。在本文中,我们提出ReGrad(可检索梯度),一种将梯度视为可检索知识单元的新范式。ReGrad离线预计算文档特定梯度,存储在索引化的梯度库中,并在推理时仅检索与查询相关的梯度以进行临时权重调整。然而,原始语言建模梯度针对词级文档重建而非查询驱动的知识使用进行优化。因此,我们引入双层元学习目标,将文档派生梯度重塑为下游任务的通用适应信号。在通用和特定领域设置上的实验表明,ReGrad优于CPT和RAG基线,实现了可扩展且可逆的参数知识注入,且不累积权重漂移。

英文摘要

Continual post-training enables models to absorb emerging knowledge after deployment, but repeatedly updating shared parameters can accumulate weight drift, potentially causing catastrophic forgetting and degrading general capabilities. Retrieval-augmented generation avoids such parameter drift, yet often lacks the depth of parametric knowledge integration. In this paper, we propose ReGrad (Retrievable Gradients), a new paradigm that treats gradients as retrievable units of knowledge. ReGrad pre-computes document-specific gradients offline, stores them in an indexed Gradient Bank, and retrieves only query-relevant gradients at inference time for temporary weight adaptation. However, raw language-modeling gradients are optimized for token-level document reconstruction rather than for query-driven knowledge use. We therefore introduce a bi-level meta-learning objective that reshapes document-derived gradients into generalizable adaptation signals for downstream tasks. Experiments across general and domain-specific settings show that \textsc{ReGrad} outperforms CPT and RAG baselines, enabling scalable and reversible parametric knowledge injection without accumulating weight drift.

2606.15767 2026-06-16 cs.LG cs.AI 交叉投稿

Visualizing Uncertainty: Spatial Maps of Missing and Conflicting Evidence in Deep Learning

可视化不确定性:深度学习中缺失与冲突证据的空间图

Dong Hyun Jeong, Feng Chen, Jin-Hee Cho, Lance M. Kaplan, Audun Jøsang, Soo-Yeon Ji

发表机构 * University of the District of Columbia(哥伦比亚特区大学) University of Texas at Dallas(德克萨斯大学达拉斯分校) Virginia Tech(弗吉尼亚理工大学) U.S. Army DEVCOM Army Research Laboratory(美国陆军DEVCOM陆军研究实验室) University of Oslo(奥斯陆大学) Bowie State University(鲍伊州立大学)

AI总结 提出不确定性激活图(UAM)框架,结合证据深度学习与全梯度类激活映射,生成空间不确定性激活图,区分缺乏证据的空虚和假设冲突的不和谐,填补不确定性量化与可解释性之间的空白。

详情
AI中文摘要

理解深度神经网络何时以及为何不确定对于在安全关键领域部署可靠的机器学习系统至关重要。虽然现有的不确定性量化方法提供了模型置信度的标量度量,但它们对输入的哪些空间区域导致不同类型的不确定性提供的洞察有限。我们提出了一种新颖的可视化框架——不确定性激活图(UAM),它将证据深度学习(EDL)与全梯度类激活映射(FullGrad)相结合,生成可解释的空间不确定性激活图。我们的方法区分了两种基本的不确定性类型:空虚(代表缺乏证据)和不和谐(捕捉竞争假设之间的冲突证据)。通过利用FullGrad的完整梯度分解特性和主观逻辑的原则性不确定性量化,我们的方法产生了理论上合理的可视化,突出显示了导致模型不确定性的特定图像区域。利用该框架,通过计算信念加权属性生成空虚和不和谐激活图,从而能够识别模型缺乏知识的区域与遇到模糊证据的区域。在多个基准数据集上的广泛评估表明,所提出的框架有效地解决了不确定性量化与可解释性之间的关键差距,为评估复杂视觉识别任务中的模型可靠性提供了直观的视觉反馈。

英文摘要

Understanding when and why deep neural networks are uncertain is crucial for deploying reliable machine learning systems in safety-critical domains. While existing uncertainty quantification methods provide scalar measures of model confidence, they offer limited insight into which spatial regions of an input contribute to different types of uncertainty. We propose a novel visualization framework, Uncertainty Activation Map (UAM), that combines Evidential Deep Learning (EDL) with Full-Gradient Class Activation Mapping (FullGrad) to generate interpretable spatial uncertainty activation maps. Our approach distinguishes between two fundamental types of uncertainty: vacuity, representing lack of evidence, and dissonance, capturing conflicting evidence between competing hypotheses. By leveraging the complete gradient decomposition property of FullGrad and the principled uncertainty quantification of Subjective Logic, our method produces theoretically grounded visualizations that highlight specific image regions responsible for model uncertainty. With this framework, vacuity and dissonance activation maps are generated by computing belief-weighted attributions, enabling identification of where models lack knowledge versus where they encounter ambiguous evidence. Extensive evaluations across multiple benchmark datasets demonstrate that the proposed framework effectively addresses the critical gap between uncertainty quantification and explainability, providing intuitive visual feedback to assess model reliability in complex visual recognition tasks.

2606.15793 2026-06-16 cs.LG cs.AI stat.ML 交叉投稿

Proximal Policy Optimization for Amortized Discrete Sampling

用于摊销离散采样的近端策略优化

Anna Zykova-Myzina, Timofei Gritsaev, Daniil Tiapkin, Nikita Morozov

发表机构 * HSE University(高等经济学院) Constructor University(康斯特大学) CMAP, CNRS, École polytechnique, IPP(CMAP,CNRS,巴黎综合理工学院,IPP)

AI总结 本文在生成流网络框架下,推导了策略梯度算法并首次应用近端策略优化,提升了离散概率分布采样的收敛速度和数据效率。

详情
AI中文摘要

本文探讨了在生成流网络(GFlowNet)框架下,使用策略梯度算法训练随机策略以从结构化离散概率分布中采样。基于GFlowNet与熵正则化强化学习之间的广泛理论联系,我们推导了用于训练GFlowNet的标准策略梯度算法的等价形式,并实验性地探索了其各种方法论方面,包括基线训练和优势估计。最重要的是,我们的工作是首次推导并成功将近端策略优化应用于GFlowNet,在从合成能量到分子图生成的基准测试中,与标准GFlowNet训练目标相比,显示出更快的收敛速度和更高的数据效率。

英文摘要

This paper explores policy gradient algorithms for training stochastic policies to sample from structured discrete probability distributions under the Generative Flow Network (GFlowNet) framework. Building on extensive theoretical connections between GFlowNets and entropy-regularized reinforcement learning, we derive equivalents of standard policy gradient algorithms for training GFlowNets, as well as experimentally explore their various methodological aspects, including baseline training and advantage estimation. Most importantly, our work is the first to derive and successfully apply proximal policy optimization to GFlowNets, showing its improved convergence speed and data efficiency compared to standard GFlowNet training objectives on benchmarks ranging from synthetic energies to molecular graph generation.

2606.15796 2026-06-16 cs.CV cs.AI 交叉投稿

DifFRACT: Diffusion Feature Reconstruction and Attribution for Circuit Tracing

DifFRACT:用于电路追踪的扩散特征重构与归因

Artyom Mazur, Nina Konovalova, Aibek Alanov

发表机构 * HSE University(高等经济学院) FusionBrain Lab(FusionBrain实验室)

AI总结 本文扩展了基于转码器的电路追踪方法到多模态扩散Transformer,通过训练时间步条件转码器近似MLP子层,实现精确的特征级归因并恢复可解释电路,揭示了属性绑定和跨流语义传播机制。

详情
AI中文摘要

机械可解释性旨在通过将模型计算分解为可解释特征和电路来解释神经网络行为。虽然基于转码器的电路追踪最近已实现对大型语言模型的详细因果分析,但用于图像生成的多模态扩散Transformer仍然相对不透明。我们仍然缺乏理解语义信息如何在去噪步骤间传播以及文本和图像表示如何在双流MM-DiT架构中交互的工具。现有方法仅提供部分洞察:注意力图揭示了token交互的有限视图,而稀疏自编码器可以发现可解释特征,但并未直接揭示这些特征如何通过非线性MLP层进行变换和组合。在这项工作中,我们将基于转码器的电路追踪扩展到多模态扩散Transformer。我们训练了时间步条件转码器,它们忠实地近似FLUX.1[schnell]中MLP子层的输入输出行为。通过用转码器替换MLP并线性化剩余计算,我们获得了精确的特征到特征归因,并恢复了紧凑、可解释的电路。实验上,我们的转码器在稀疏性-忠实度权衡上与稀疏自编码器相当或略优。得到的电路揭示了属性绑定和跨流语义传播背后的机制,并为系统性生成错误提供了因果解释。此外,基于电路的干预比标准的基于SAE的引导更加精确和有效。我们的结果表明,基于转码器的电路分析对于最先进的扩散Transformer是可行的,并为理解和控制多模态生成模型提供了强大的框架。代码可在https://github.com/Artalmaz31/DifFRACT获取。

英文摘要

Mechanistic interpretability seeks to explain neural network behavior by decomposing model computations into interpretable features and circuits. While transcoder-based circuit tracing has recently enabled detailed causal analyses of large language models, multimodal diffusion transformers for image generation remain comparatively opaque. We still lack tools for understanding how semantic information propagates across denoising steps and how text and image representations interact within double-stream MM-DiT architectures. Existing methods provide only partial insight: attention maps expose a limited view of token interactions, while sparse autoencoders can discover interpretable features but do not directly reveal how these features are transformed and composed through nonlinear MLP layers. In this work, we extend transcoder-based circuit tracing to multimodal diffusion transformers. We train timestep-conditioned transcoders that faithfully approximate the input-output behavior of MLP sublayers in FLUX.1[schnell]. By replacing MLPs with transcoders and linearizing the remaining computation, we obtain exact feature-to-feature attribution and recover compact, interpretable circuits. Empirically, our transcoders match or slightly outperform sparse autoencoders on the sparsity-faithfulness tradeoff. The resulting circuits reveal mechanisms underlying attribute binding and cross-stream semantic propagation, and provide causal explanations for systematic generation errors. Moreover, circuit-guided interventions are substantially more precise and effective than standard SAE-based steering. Our results demonstrate that transcoder-based circuit analysis is feasible for state-of-the-art diffusion transformers and provides a powerful framework for understanding and controlling multimodal generative models. The code is available at https://github.com/Artalmaz31/DifFRACT

2606.15835 2026-06-16 cs.LG cs.AI 交叉投稿

Wasserstein Convergence of ODE-Based Samplers in Decentralized Diffusion Model via Velocity Field Decomposition

基于速度场分解的去中心化扩散模型中ODE采样器的Wasserstein收敛性

Chencheng Tang, Xuanyu Xue, Fangyikang Wang, Chao Zhang, Hubery Yin

发表机构 * Peking University(北京大学) Shanghai Jiao Tong University(上海交通大学) MBZUAI(穆罕默德·本·扎耶德人工智能大学) Zhejiang University(浙江大学) Tencent(腾讯)

AI总结 针对去中心化扩散模型中随机专家切换的ODE采样,通过速度场分解建立Wasserstein-2距离下的收敛保证,证明N步离散化以O(N^{-1/2}+ε)速率收敛。

Comments 50 pages, 9 figures. Preprint under review

详情
AI中文摘要

扩散模型在生成任务中取得了令人印象深刻的实证成功,其收敛理论现在已相对完善。受隐私和可扩展性驱动,最近的去中心化扩散架构用多个局部专家和路由机制取代单个全局速度场,产生具有随机专家切换的采样动力学,这超出了标准扩散收敛分析的范围。在这项工作中,我们研究了具有随机速度场和基于ODE的采样的去中心化扩散框架。我们在Wasserstein-2距离下建立了收敛保证,表明$N$步离散化的分布在$W_2$中以速率$\mathcal{O}(N^{-1/2}+\varepsilon)$收敛到解析解,其中$\varepsilon$捕捉神经逼近误差。据我们所知,这是针对具有基于ODE采样方案的去中心化扩散模型的第一个$W_2$收敛结果。

英文摘要

Diffusion models have achieved impressive empirical success in generative tasks, and their convergence theory is now relatively well understood. Motivated by privacy and scalability, recent decentralized diffusion architectures replace a single global velocity field with multiple local experts and a routing mechanism, yielding a sampling dynamics with stochastic expert switching that falls outside standard diffusion convergence analyses. In this work, We study a decentralized diffusion framework with stochastic velocity fields and ODE-based sampling. We establish a convergence guarantee in Wasserstein-2 distance, showing that the distribution of the $N$-step discretization converges to the analytical solution at rate $\mathcal{O}(N^{-1/2}+\varepsilon)$ in $W_2$, where $\varepsilon$ captures the neural approximation errors. To our knowledge, this is the first $W_2$ convergence result for decentralized diffusion models with an ODE-based sampling scheme.

2606.15897 2026-06-16 cs.LG cs.AI stat.ML 交叉投稿

Topological Flow Matching

拓扑流匹配

Kacper Wyrwal, İsmail İlkan Ceylan, Alexander Tong

发表机构 * University of Oxford(牛津大学) TU Wien(维也纳技术大学) AITHYRA

AI总结 提出拓扑流匹配,通过拉普拉斯漂移增强参考过程,在保留流匹配稳定性和无模拟目标的同时,捕捉底层域拓扑结构,适用于脑fMRI、洋流等结构化数据。

Comments Accepted at ICLR 2026. 26 pages, 24 figures. Code: https://github.com/KacperWyrwal/topological-flow-matching

详情
AI中文摘要

流匹配是一个强大的生成建模框架,因其简单性和强大的经验性能而受到重视。然而,其标准公式将结构化空间上的信号(例如脑图上的fMRI数据)视为欧几里得空间中的点,忽略了其域的丰富拓扑特征。为了解决这个问题,我们引入了拓扑流匹配,这是流匹配的一种拓扑感知泛化。我们将流匹配解释为解决退化薛定谔桥问题的框架,并通过用拉普拉斯导出的漂移增强参考过程来注入拓扑信息。这种原则性修改捕获了底层域的结构,同时保留了流匹配的理想特性:稳定的、无模拟的目标和确定性样本路径。因此,我们的框架可以作为标准流匹配的直接替代品。我们在多样化的结构化数据集上展示了其有效性,包括脑fMRI、洋流、地震事件和交通流。

英文摘要

Flow matching is a powerful generative modeling framework, valued for its simplicity and strong empirical performance. However, its standard formulation treats signals on structured spaces, such as fMRI data on brain graphs, as points in Euclidean space, overlooking the rich topological features of their domains. To address this, we introduce topological flow matching, a topology-aware generalization of flow matching. We interpret flow matching as a framework for solving a degenerate Schrödinger bridge problem and inject topological information by augmenting the reference process with a Laplacian-derived drift. This principled modification captures the structure of the underlying domain while preserving the desirable properties of flow matching: a stable, simulation-free objective and deterministic sample paths. As a result, our framework serves as a drop-in replacement for standard flow matching. We demonstrate its effectiveness on diverse structured datasets, including brain fMRIs, ocean currents, seismic events, and traffic flows.

2606.15956 2026-06-16 cs.CV cs.AI cs.LG 交叉投稿

You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences

你不需要强假设:通过时间差异进行视觉表示学习

Ninad Daithankar, Alexi Gladstone, Yann LeCun, Heng Ji

发表机构 * UIUC(伊利诺伊大学厄巴纳-香槟分校) New York University(纽约大学)

AI总结 提出TDV方法,基于因果假设(过去导致未来)从视频中自监督学习,避免强归纳偏置,在密集空间任务上达到SOTA。

详情
AI中文摘要

AI的进步很大程度上是由假设更少的方法驱动的。随着计算和数据量的增加,弱归纳偏置的方法通常优于强假设的方法。这在视觉表示学习领域尤为典型,方法从监督学习主导,到弱监督学习,再到如今无需人工标签的自监督学习的广泛成功。然而,即使是现代自监督学习方法仍然依赖于强归纳偏置,如数据增强、掩码或裁剪。如果这一趋势持续,这些剩余的偏置在大规模下将成为瓶颈——我们的实验证实了这一点:随着数据增长,归纳偏置的最优强度降低。这促使我们寻找依赖更少假设的方法。为此,我们提出了视觉时间差异(TDV),一种从视频中进行自监督学习的新范式,它避免了现有的归纳偏置,而是依赖于一个因果假设:过去导致未来。TDV通过联合训练图像编码器和运动编码器,使得当前帧的表示加上编码的运动等于下一帧的表示。尽管没有利用任何强归纳偏置,TDV在密集空间任务上达到了最先进的水平,为无需强假设的表示学习奠定了基础。

英文摘要

Progress in AI has largely been driven by methods that assume less. As compute and data increase, approaches with weaker inductive biases generally outperform those with stronger assumptions. This is particularly characteristic of the field of Visual Representation Learning, where approaches have gone from being dominated by Supervised Learning, to Weakly Supervised Learning, to the now widespread success of Self-Supervised Learning without human labels. Yet, even modern Self-Supervised Learning approaches still depend on strong inductive biases such as augmentations, masking, or cropping. If this trend holds, even these remaining biases should become bottlenecks at scale -- and our experiments confirm this: the optimal strength of inductive biases decreases as data grows. This motivates the search for approaches that rely on fewer assumptions. To this end, we introduce Temporal Difference in Vision (TDV), a new paradigm for self-supervised learning from video that avoids existing inductive biases, relying instead on a causal assumption that the past causes the future. TDV functions by jointly training an image encoder and a motion encoder so that the current frame's representation plus the encoded motion equals the next frame's representation. Despite not leveraging any strong inductive biases, TDV matches state-of-the-art recipes on dense spatial tasks, laying the foundation for representation learning without strong assumptions.

2606.15963 2026-06-16 cs.DC cs.AI cs.CL cs.LG 交叉投稿

PreLort: Prefix-Nested LoRA for Federated Fine-Tuning under Rank Heterogeneity

PreLort: 面向秩异构联邦微调的前缀嵌套LoRA

Muhammad Waseem, Nurbek Tastan, Andrej Jovanovic, Nicholas D. Lane, Nils Lukas, Karthik Nandakumar, Samuel Horvath

发表机构 * MBZUAI, UAE University of Cambridge, UK(MBZUAI,阿联酋剑桥大学,英国) Flower Labs, UK(Flower Labs,英国) Michigan State University, USA(密歇根州立大学,美国)

AI总结 针对联邦LoRA中异构秩导致的信息分布不均问题,提出PreLort方法,通过前缀层次化嵌套低秩结构、分段聚合规则和前缀嵌套训练策略,使低秩客户端受益于高秩客户端的丰富信息,在准确率和ROUGE-L上优于现有方法。

详情
AI中文摘要

使用LoRA等参数高效方法对大型语言模型进行联邦微调,能够实现基础模型的隐私保护适配。异构硬件资源带来了挑战,因为具有不同适配器秩的客户端无法直接聚合。现有方法虽能实现异构秩下的聚合,但未能控制信息在秩维度上的分布,导致共享低秩表示利用不充分。为此,我们提出PreLort:一种用于联邦LoRA的嵌套低秩公式,将适配器维度组织成前缀层次结构。我们的方法确保较低秩维度编码任务相关信息,而较高秩维度捕获额外容量。基于此,我们引入(i)分段聚合规则,仅对贡献于每个秩分段的客户端进行平均,避免来自零填充低秩客户端的稀释;以及(ii)前缀嵌套训练策略,在多个秩截断下优化每个适配器,鼓励有用信号集中在低秩前缀维度。这些组件共同鼓励一个一致的低秩前缀捕获最任务相关信息,而较高秩维度学习额外容量。这使得低秩客户端能够受益于高秩客户端贡献的更丰富信息,因为前缀维度被一致地学习和聚合。实验表明,我们的方法在准确率和ROUGE-L上持续优于先前的异构联邦LoRA方法,并在多个基础模型上实现了更低或相当困惑度。

英文摘要

Federated fine-tuning of large language models using parameter-efficient methods such as LoRA enables privacy-preserving adaptation of foundation models. Heterogeneous hardware resources introduce challenges, as clients with different adapter ranks cannot be directly aggregated. While existing methods enable aggregation under heterogeneous ranks, they fail to control how information is distributed across rank dimensions, leading to suboptimal use of shared low-rank representations. Instead, we propose PreLort: a nested low-rank formulation for federated LoRA that organizes adapter dimensions into a prefix hierarchy. Our approach ensures that lower-rank dimensions encode task-relevant information, while higher-rank dimensions capture additional capacity. Building on this, we introduce (i) a segment-wise aggregation rule that averages only over clients contributing to each rank segment, avoiding dilution from zero-padded lower-rank clients, and (ii) a prefix-nested training strategy that optimizes each adapter under multiple rank truncations, encouraging useful signal to concentrate in low-rank prefix dimensions. Together, these components encourage a consistent low-rank prefix capturing the most task-relevant information, while higher-rank dimensions learn additional capacity. This allows low-rank clients to benefit from richer information contributed by higher-rank clients, as prefix dimensions are consistently learned and aggregated. Experiments demonstrate that our method consistently outperforms prior heterogeneous federated LoRA methods in accuracy and ROUGE-L, while achieving lower or comparable perplexity across multiple base models.

2606.15989 2026-06-16 q-bio.NC cs.AI 交叉投稿

Task-guided cross-subject latent alignment: a multi-encoder-decoder VAE

任务引导的跨被试潜在对齐:一种多编码器-解码器VAE

Angeliki Papathanasiou, Jascha Achterberg, Thomas E. Nichols, Rui Ponte Costa

发表机构 * Centre for Neural Circuits and Behaviour Department of Physiology Anatomy and Genetics University of Oxford(神经回路与行为中心 生理解剖与遗传学系 牛津大学) Big Data Institute Nuffield Department of Medicine University of Oxford(大数据研究所 纳菲尔德医学系 牛津大学)

AI总结 提出MED-VAE模型,通过预训练ANN锚定表征,实现无共享刺激的跨被试神经对齐,在自然场景数据集上优于传统方法,并支持跨被试图像解码。

Comments In Proceedings of the 9th Conference on Cognitive Computational Neuroscience, New York, NY, USA, 2026

详情
AI中文摘要

对齐跨被试的神经活动有望发现共享的计算原理和可泛化解码器。然而,传统对齐方法要求被试间共享刺激,这一限制使其难以应用于数据有限或非重叠的自然范式。我们提出了一种多编码器-解码器变分自编码器(MED-VAE),通过将表征锚定到预训练ANN提供的公共支架上,实现了无需共享刺激的跨被试对齐。利用自然场景数据集,我们展示了MED-VAE创建了具有优越语义组织的公共潜在空间,在跨被试对齐方面优于常见方法,同时在对传统方法失效的保留刺激上保持了稳健的泛化能力。从这些公共空间重建回每个被试的原始神经空间,MED-VAE在其跨被试潜在空间中保留了等量的刺激驱动信号。最后,我们展示了这种优越的对齐直接实现了跨被试神经预测,通过跨被试图像解码得到了验证。总之,我们提出了一种框架,用于识别可泛化的公共子空间以进行跨被试预测和下游任务,本文以静态图像视觉皮层响应为例进行了演示。

英文摘要

Aligning neural activity across subjects offers the promise of discovering shared computational principles and generalizable decoders. However, traditional alignment methods require shared stimuli across subjects, a constraint that limits applicability to naturalistic paradigms with limited or non-overlapping data. We introduce a Multi-Encoder-Decoder Variational Autoencoder (MED-VAE) that achieves cross-subject alignment without shared stimuli by anchoring representations to a common scaffold provided by a pretrained ANN. Using the Natural Scenes Dataset, we show that MED-VAE creates common latent spaces with superior semantic organisation, achieving higher cross-subject alignment than common methods while maintaining robust generalisation to held-out stimuli where traditional methods degrade. Reconstructing from these common spaces back to each subject's original neural space, MED-VAE preserves equal stimulus-driven signal in its cross-subject latent space. Finally, we show that this superior alignment directly enables cross-subject neural prediction, as demonstrated via cross-subject image decoding. In summary, we introduce a framework to identify generalisable common subspaces for cross-subject predictions and downstream tasks, demonstrated here for visual cortex responses to static images.

2606.16050 2026-06-16 cs.LG cs.AI 交叉投稿

ALCL: An Adaptive Log-Correntropy Loss for Robust Learning under Non-Gaussian Noise

ALCL:一种用于非高斯噪声下鲁棒学习的自适应对数相关熵损失

Mainak Kundu, Ria Kanjilal, Ismail Uysal

发表机构 * University of South Florida(南佛罗里达大学) California Polytechnic State University(加州州立理工大学)

AI总结 提出自适应对数相关熵损失(ALCL),通过可微重参数化联合学习形状和尺度参数,使损失几何动态适应残差统计,抑制极端异常值,在混合重尾和脉冲噪声下优于MSE和固定核相关熵损失。

详情
AI中文摘要

在重尾和脉冲噪声下的鲁棒深度学习仍然具有挑战性,因为均方误差(MSE)等传统损失对异常值表现出无界敏感性。尽管基于相关熵的目标函数提高了鲁棒性,但现有公式依赖于固定的核参数,这些参数必须凭经验调整且在训练期间保持不变。为了解决这些局限性,我们提出了一种自适应对数相关熵损失(ALCL),这是一种重尾损失公式,能够在优化过程中自适应地学习其鲁棒性几何结构。ALCL引入了一个对数残差模型,其形状和尺度参数通过可微重参数化与网络权重联合学习。这产生了一个原理性的最大似然公式,其影响函数形式上是有界且再下降的,使得损失几何能够动态适应不断变化的残差统计,同时抑制极端异常值。在四个广泛使用的基准数据集(涵盖灰度图像和红绿蓝(RGB)图像数据)上,在混合重尾和脉冲噪声下进行的比较实验表明,ALCL在重建保真度和下游分类准确性方面始终优于MSE和最优调整的广义相关熵损失。虽然在低噪声条件下性能差异仍然很小,但在高噪声条件下,ALCL在灰度基准上中位数准确率提高了高达4.75%,在RGB数据集上提高了4.51%,并且运行间方差减小。这些结果表明,通过联合学习损失参数实现的自适应鲁棒性为非高斯环境下深度学习中基于静态相关熵的损失提供了一种计算高效的替代方案。

英文摘要

Robust deep learning under heavy-tailed and impulsive noise remains challenging because conventional losses such as mean squared error (MSE) exhibit unbounded sensitivity to outliers. Although correntropy-based objectives improve robustness, existing formulations rely on fixed kernel parameters that must be empirically tuned and remain static during training. To address these limitations, we propose an Adaptive Log-Correntropy Loss (ALCL), a heavy-tailed loss formulation that adaptively learns its robustness geometry during optimization. ALCL introduces a logarithmic residual model whose shape and scale parameters are learned jointly with network weights through differentiable reparameterization. This yields a principled maximum likelihood formulation whose influence function is formally bounded and redescending, allowing the loss geometry to adapt dynamically to evolving residual statistics while suppressing extreme outliers. Comparative experiments on four widely used benchmark datasets spanning grayscale and red-green-blue (RGB) image data under mixed heavy-tailed and impulsive noise demonstrate that ALCL consistently outperforms MSE and optimally tuned generalized correntropy losses in both reconstruction fidelity and downstream classification accuracy. While performance differences remain small under low-noise conditions, under high-noise regimes ALCL improves median accuracy by up to 4.75% on grayscale benchmarks and 4.51% on RGB datasets, with reduced variance across runs. These results demonstrate that adaptive robustness through joint learning of loss parameters provides a computationally efficient alternative to static correntropy-based losses for deep learning in non-Gaussian environments.

2606.16076 2026-06-16 cs.LG cs.AI cs.GT 交叉投稿

Phys-JEPA: Physics-Informed Latent World Models for Multivariate Time-Series Forecasting

Phys-JEPA:面向多变量时间序列预测的物理信息潜在世界模型

Weizhi Nie, Weichao Liu, Honglin Guo, Yuting Su

发表机构 * Tianjin University(天津大学)

AI总结 提出Phys-JEPA架构,将物理一致性约束引入潜在状态和状态转移,分解预测状态为物理和残差分量,在气候、交通、电力数据集上提升预测精度。

Comments Submitted to arXiv as a preliminary manuscript. 10 figures

详情
AI中文摘要

物理系统中的多变量预测需要模型在预测耦合时间变量的同时保持有意义的状态演化。深度预测器可以拟合时间相关性,物理信息模型可以用科学约束正则化预测,但这些方向通常仅在解码输出层面连接。因此,生成未来轨迹的隐藏预测状态可能在统计上有用,但在物理上无结构。我们提出Phys-JEPA,一种用于多变量时间序列预测的物理信息联合嵌入预测架构。Phys-JEPA学习一个潜在世界模型,其中预测状态被分解为物理和残差分量,物理一致性直接施加于潜在状态和潜在转移,而不仅仅施加于解码后的预测。该公式利用已知物理变量组织表示空间,同时保留未解析动力学的残差容量。在Jena Climate 2009–2016上,Phys-JEPA在H=24时将聚合MSE从0.12482降至0.12273,温度MSE从0.01892降至0.01831。在Traffic上,完整Phys-JEPA在所有测试视界内优于监督基线,将H=192的MSE从0.800784降至0.773873。在Electricity上,最佳变体取决于视界:静态潜在一致性在H=24和H=48时最强,而完整Phys-JEPA在H=192时给出最佳的聚合和目标变量MSE。这些初步结果表明,将物理信息学习从输出空间转移到潜在预测状态空间是可解释时间世界模型的一个有前景的方向。

英文摘要

Multivariate forecasting in physical systems requires models that predict coupled temporal variables while preserving meaningful state evolution. Deep forecasters can fit temporal correlations, and physics-informed models can regularize predictions with scientific constraints, but these directions are often connected only at the decoded-output level. As a result, the hidden predictive state that generates future trajectories may remain statistically useful but physically unstructured. We introduce Phys-JEPA, a physics-informed joint-embedding predictive architecture for multivariate time-series forecasting. Phys-JEPA learns a latent world model in which predictive states are decomposed into physical and residual components, and physical consistency is imposed directly on latent states and latent transitions rather than only on decoded forecasts. This formulation uses known physical variables to organize the representation space while retaining residual capacity for unresolved dynamics. On Jena Climate 2009--2016, Phys-JEPA reduces aggregate MSE from 0.12482 to 0.12273 and temperature MSE from 0.01892 to 0.01831 at H=24. On Traffic, full Phys-JEPA improves aggregate MSE over the supervised baseline across all tested horizons, reducing H=192 MSE from 0.800784 to 0.773873. On Electricity, the best variant depends on horizon: static latent consistency is strongest at H=24 and H=48, while full Phys-JEPA gives the best aggregate and target-variable MSE at H=192. These initial results suggest that moving physics-informed learning from output space to latent predictive state space is a promising direction for interpretable temporal world models.

2606.16093 2026-06-16 cs.CL cs.AI 交叉投稿

Long-Context Modeling via GSS-Transformer Hybrid Architecture with Learnable Mixing

基于可学习混合的GSS-Transformer混合架构的长上下文建模

Kuzey Torlak, Hüseyin Arda Arslan, Anıl Dervişoğlu, Beyza Nur Deniz, Onur Boyar

发表机构 * Kadıköy Anadolu High School(卡德柯伊安纳多卢高中) Politecnico di Torino(都灵理工大学) Istanbul Technical University(伊斯坦布尔理工大学) Boğaziçi University(博阿齐奇大学) IBM Research - Tokyo(IBM 东京研究院)

AI总结 提出并行混合架构PHA,通过可学习混合机制融合GSS、GQA和FFN,在长上下文建模中实现Transformer级困惑度与更高效率。

Comments 16 pages, 9 tables, 4 figures

详情
AI中文摘要

建模长距离依赖仍然是自然语言处理中的核心挑战。Transformer架构通过自注意力实现强性能,但计算复杂度随序列长度呈二次方增长($O(N^2)$),而状态空间模型(SSM)线性扩展($O(N)$)但存在选择性召回瓶颈,难以从压缩状态中检索精确信息。这导致了效率与困惑度之间的基本权衡。为应对这些挑战,我们提出了\textit{并行混合架构(PHA)},它将门控状态空间(GSS)、分组查询注意力(GQA)和前馈网络(FFN)作为独立的并行分支运行,并通过可学习混合机制融合。PHA不强制SSM近似注意力或将两种范式串行化,而是让每个分支专门化:GSS捕获全局上下文,注意力执行选择性检索,FFN提供补充处理。在WikiText-103上,PHA在125M参数下达到16.51 PPL,优于Hedgehog(16.70)和H3-125M(23.70)。扩展到180M参数得到16.42 PPL,与纯注意力基线结果相当,同时在长上下文下吞吐量提高24%,内存使用降低40%。在OpenWebText上,我们的125M模型达到19.72 PPL,优于标准Transformer(20.60)和GSS混合基线(19.80)。这些结果表明,将序列建模范式分离为并行专家,能够在长上下文语言建模中实现Transformer级困惑度,同时显著提升效率。

英文摘要

Modeling long-range dependencies remains a central challenge in natural language processing. Transformer architectures achieve strong performance via self-attention but scale quadratically ($O(N^2)$) with sequence length, while State Space Models (SSMs) scale linearly ($O(N)$) but suffer from a selective recall bottleneck, struggling to retrieve precise information from compressed states. This creates a fundamental tradeoff between efficiency and perplexity. To tackle these challenges, we propose the \textit{Parallel Hybrid Architecture (PHA)}, which runs Gated State Spaces (GSS), Grouped Query Attention (GQA), and Feed-Forward Networks (FFNs) as independent parallel branches fused by a learnable mixing mechanism. Instead of forcing SSMs to approximate attention or serializing the two paradigms, PHA allows each branch to specialize: GSS captures global context, while attention performs selective retrieval, with FFN providing complementary processing. On WikiText-103, PHA achieves 16.51 PPL at 125M parameters, outperforming Hedgehog (16.70) and H3-125M (23.70). Scaling to 180M parameters yields 16.42 PPL, which gives comparable results with the pure attention baseline while delivering 24\% higher throughput and up to 40\% lower memory usage at long contexts. On OpenWebText, our 125M model achieves 19.72 PPL, outperforming standard Transformers (20.60) and GSS hybrid baselines (19.80). These results demonstrate that separating sequence modeling paradigms into parallel specialists enables Transformer-level perplexity with substantially improved efficiency for long-context language modeling.

2606.16112 2026-06-16 cs.LG cs.AI 交叉投稿

Scaling Adaptive Depth with Norm-Agnostic Residual Networks

缩放自适应深度:范数无关残差网络

Tomás Figliolia, Beren Millidge

发表机构 * Zyphra San Francisco, CA(Zyphra旧金山加州)

AI总结 针对残差网络中残差流范数随深度增长导致深层更新被抑制的问题,提出范数无关残差架构NAG,通过分离幅度和方向信息保持各层贡献,并实现可解释的自适应深度跳过机制,在等计算量下匹配全深度性能。

详情
AI中文摘要

残差架构在深度学习中无处不在,但它们存在一个微妙的结构性限制:残差流的范数会随深度迅速增长。因此,来自后层的更新相对于累积的残差状态变得很小。这降低了它们对表示的影响,并限制了模型在深度上扩展的益处。为了解决这个问题,我们引入了NAG,一种范数无关的残差架构,它将残差流中的幅度与方向信息分离,在整个深度中保留有意义的层贡献,并防止后层更新被残差范数增长系统地抑制。重要的是,NAG仅引入可忽略数量的额外参数,并依赖于易于内核融合的简单操作,从而在实践中保持训练效率。我们表明,该架构优于基线Transformer,其增益随深度增加而显著增大,从而能够有效训练更深的模型。范数无关的公式还产生了一种可解释的深度混合(MoD)机制,该机制自适应地跳过注意力和MLP层。除了作为训练后的精度-计算权衡外,该机制还可以用作预训练时的扩展策略:在等FLOP训练下,通过减少每token前向传播成本节省的计算量可以再投资于在更多token上训练,同时保持总参数数量和KV缓存预算固定。在我们的实验中,约20%-25%的适度深度混合率在相等训练计算量下匹配全深度基线性能,同时大幅减少执行的层参数数量和前向传播FLOPs。这些结果将深度稀疏性确定为固定计算量训练的新扩展轴,从而能够实现非常深但FLOP高效的模型。

英文摘要

Residual architectures are ubiquitous in deep learning, but they suffer from a subtle structural limitation: the norm of the residual stream can grow rapidly with depth. As a result, updates from later layers become small relative to the accumulated residual state. This reduces their impact on the representation and limits the benefits of scaling models in depth. To address this, we introduce NAG, a norm-agnostic residual architecture that separates magnitude from directional information in the residual stream, preserving meaningful layer contributions throughout depth and preventing later updates from being systematically suppressed by residual-norm growth. Importantly, NAG introduces only a negligible number of additional parameters and relies on simple operations that are easily kernel-fusible, preserving training efficiency in practice. We show that this architecture outperforms baseline Transformers, with gains that increase substantially as depth grows, enabling effective training of much deeper models. The norm-agnostic formulation also leads to an interpretable Mixture-of-Depths (MoD) mechanism that adaptively skips both attention and MLP layers. Beyond serving as a post-training accuracy-compute tradeoff, this mechanism can be used as a pretraining-time scaling strategy: under iso-FLOP training, compute saved by reducing per-token forward-pass cost can be reinvested into training on more tokens while keeping the total parameter count and KV-cache budget fixed. In our experiments, moderate Mixture-of-Depths rates of approximately 20%-25% match full-depth baseline performance under equal training compute while substantially reducing the number of executed layer parameters and forward-pass FLOPs. These results identify sparsity in depth as a new scaling axis for fixed-compute training, enabling very deep yet FLOP-efficient models.

2606.16160 2026-06-16 cs.LG cs.AI cs.HC 交叉投稿

A comparative and critical study of EEGNet for fNIRS-driven cognitive load classification

EEGNet在fNIRS驱动的认知负荷分类中的比较与批判性研究

Mehshan Ahmed Khan, Houshyar Asadi, Li Zhang, Mohammad reza Chalak Qazani, Ghazal Bargshady, Stefanos gkikas, Christian arzate, Sam Oladazimi, Zoran Najdovsk, Lei Wei, Chee Peng Lim

发表机构 * Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University(智能系统研究与创新研究所(IISRI),德克萨斯大学) Department of Computer Science, Royal Holloway, University of London(伦敦大学皇家霍洛威学院计算机科学系) College of Science and Engineering, James Cook University(詹姆斯库克大学科学与工程学院) Faculty of Science and Technology, University of Canberra(堪培拉大学科学与技术学院) Honda research institute (HRI), Japan(日本本田研究院) Swinburne University of Technonology, Hawthorn, Victoria(技术学院,维多利亚州哈沃恩)

AI总结 本研究系统评估EEGNet在fNIRS认知负荷分类中的性能,发现重叠分段和小固定学习率在随机分割中表现最佳,但受试者独立评估准确率大幅下降,非重叠分段和PCA特征在SI评估中取得最佳56.11%准确率,表明消除时间冗余有助于学习更鲁棒的跨个体表征。

详情
AI中文摘要

由于时间变异性、受试者间差异以及对预处理选择的敏感性,从功能性近红外光谱(fNIRS)信号中准确分类认知负荷仍然是一个重大挑战。本研究通过系统检查时间分割策略(重叠与非重叠)、窗口长度(10秒、20秒、30秒)、特征提取方法(方差分析(ANOVA)、主成分分析(PCA)、快速独立成分分析(FastICA))、学习率配置(固定和自适应)以及评估协议(随机分割与受试者独立(SI))的影响,对EEGNet在基于fNIRS的认知负荷分类中进行了全面评估。随机分割实验的结果表明,重叠分割结合较小的固定学习率(0.01-0.001)由于时间冗余和血流动力学转变的密集采样而产生了最高的准确率。然而,SI评估显示准确率大幅下降,表明对未见参与者的泛化能力有限。在SI评估下,非重叠分割优于重叠窗口,使用PCA特征、20秒窗口和0.1学习率获得了最佳准确率56.11%。这些发现表明,消除时间冗余有助于模型学习更鲁棒和可泛化的跨个体认知负荷表征。尽管自适应学习率策略提高了训练稳定性,但并未超过最优选择的固定学习率的性能。该研究强调了分割策略和学习率选择在提高模型泛化能力中的关键作用,并指出了开发基于fNIRS的可靠、实时和受试者独立认知负荷分类系统所必需的方法学考虑。

英文摘要

Accurately classifying cognitive load from functional near-infrared spectroscopy (fNIRS) signals remains a significant challenge due to temporal variability, inter-subject differences, and sensitivity to preprocessing choices. This study provides a comprehensive evaluation of EEGNet for fNIRS-based cognitive load classification by systematically examining the effects of temporal segmentation strategies (overlapping vs. non-overlapping), window lengths (10s, 20s, 30s), feature extraction methods (Analysis of Variance (ANOVA), Principal Component Analysis (PCA), Fast Independent Component Analysis (FastICA)), learning rate configurations (fixed and adaptive), and evaluation protocols (random split vs. subject-independent (SI)). Results from random-split experiments show that overlapping segmentation, combined with smaller fixed learning rates (0.01-0.001), yields the highest accuracies, due to temporal redundancy and dense sampling of hemodynamic transitions. However, SI evaluation reveals a substantial drop in accuracy, demonstrating limited generalization to unseen participants. Under SI evaluation, non-overlapping segmentation outperformed overlapping windows, with the best accuracy of 56.11% achieved using PCA features with a 20-second window and a 0.1 learning rate. These findings indicate that eliminating temporal redundancy helps the model learn more robust and generalizable representations of cognitive load across individuals. Although adaptive learning rate strategy improved training stability, it did not surpass the performance of optimally selected fixed learning rates. The study highlights the critical role of segmentation strategy and learning rate selection in improving model generalization and identifies methodological considerations essential for developing reliable, real-time, and SI cognitive load classification systems using fNIRS.

2606.16193 2026-06-16 cs.CV cs.AI cs.LG 交叉投稿

Cascaded Sparse Autoencoders Learn Multi-Level Visual Concepts in Multimodal LLMs

级联稀疏自编码器在多模态大语言模型中学习多级视觉概念

Yusong Zhao, Hengyi Wang, Tanuja Ganu, Akshay Nambi, Hao Wang

发表机构 * Rutgers University(罗格斯大学) Microsoft Research(微软研究院)

AI总结 提出级联稀疏自编码器(CSAEs),通过在第一级SAE解码器权重上训练第二级SAE来学习层次化视觉概念,避免嵌套或堆叠SAE的缺点,在多个MLLM和数据集上提升了概念层次一致性和干预效果。

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉-语言任务上表现出色,但其内部视觉表示仍难以解释。稀疏自编码器(SAEs)提供了一种可扩展的方式,将密集模型激活分解为稀疏、可解释的特征。然而,现有SAE架构主要恢复扁平特征字典,不太适合显式的多级概念组织。在本文中,我们引入级联稀疏自编码器(CSAEs)用于学习MLLMs中的层次化视觉概念。CSAEs并非嵌套或堆叠SAE稀疏激活码,而是直接在第一个SAE的解码器权重上训练第二个SAE,将学习到的低级特征方向作为高级抽象的输入。这种设计使CSAEs能够学习“概念的概念”,同时避免了嵌套、Matryoshka式层次结构中的共享前缀耦合问题以及简单堆叠SAE的瓶颈。在Qwen3-VL、Gemma-3和LLaVA上的多个视觉数据集上的实验表明,与最先进的SAE基线相比,CSAEs在层次概念一致性方面提高了可解释性。概念引导的结果进一步表明,学习到的概念组支持对MLLM输出进行有效的组级干预。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated strong performance on vision-language tasks, yet their internal visual representations remain difficult to interpret. Sparse Autoencoders (SAEs) provide a scalable way to decompose dense model activations into sparse, interpretable features. However, existing SAE architectures primarily recover flat feature dictionaries and are less suited for explicit multi-level concept organization. In this paper, we introduce cascaded sparse autoencoders (CSAEs) for learning hierarchical visual concepts in MLLMs. Rather than nesting or stacking SAE sparse activation codes, CSAEs train a second-level SAE directly on the decoder weights of the first-level SAE, treating learned low-level feature directions as inputs for higher-level abstraction. This design enables CSAEs to learn "concepts of concepts" while avoiding drawbacks from the shared-prefix coupling of nesting, Matryoshka-style hierarchies and the bottlenecks of naively stacked SAEs. Experiments across Qwen3-VL, Gemma-3, and LLaVA on multiple visual datasets show that CSAEs improve interpretability in terms of hierarchical concept coherence over state-of-the-art SAE baselines. Results on concept steering further demonstrate that the learned concept groups support effective group-level interventions in MLLM outputs.

2606.16257 2026-06-16 cs.LG cs.AI 交叉投稿

Variance Reduction for Non-Log-Concave Sampling with Applications to Inverse Problems

非对数凹采样的方差缩减及其在逆问题中的应用

M. Berk Sahin, Ahmet Ege Tanriverdi, Behzad Sharif, Abolfazl Hashemi

发表机构 * School of Electrical and Computer Engineering, Purdue University(普渡大学电气与计算机工程学院) School of Electrical and Computer Engineering, University of Southern California(南加州大学电气与计算机工程学院) School of Biomedical Engineering, Purdue University(普渡大学生物医学工程学院)

AI总结 针对非对数凹分布采样中随机梯度高方差问题,提出统一分析动量、STORM和PAGE等方差缩减方法,证明其在相对Fisher信息和非平方总变差距离下的改进收敛率,并扩展至基于得分的生成先验逆问题求解。

Comments Accepted to Uncertainty in Artificial Intelligence (UAI) 2026

详情
AI中文摘要

从具有未归一化密度的高维、非对数凹分布中采样是机器学习中的一个基本挑战,特别是当势能的精确梯度不可用,且必须通过每次迭代固定梯度计算预算下表现出高方差的随机梯度来近似时。尽管诸如带动量的SGD、STORM和PAGE等方差缩减技术已在非凸优化中展现出改进的收敛性质,但它们对非对数凹分布采样的影响仍 largely unexplored。在这项工作中,我们首次对这些估计器用于非对数凹分布采样进行了统一分析。我们在$\varepsilon$-相对Fisher信息下建立了改进的非渐近收敛率,并在Poincaré不等式假设下,在平方总变差距离下建立了改进的非渐近收敛率,进一步证明了向目标分布的弱收敛。我们将分析扩展到使用基于得分的生成先验求解逆问题。我们通过实验验证了理论,并证明在每次迭代固定梯度计算预算下,方差缩减技术在两个标准成像应用中 consistently 提高了样本质量。

英文摘要

Sampling from high-dimensional, non-log-concave distributions with unnormalized densities is a fundamental challenge in machine learning, particularly when the exact gradient of the potential is unavailable and must be approximated via stochastic gradients that exhibit high variance under a fixed budget of gradient computations per iteration. Although variance reduction techniques such as SGD with momentum, STORM, and PAGE have demonstrated improved convergence properties in non-convex optimization, their implications for sampling from non-log-concave distributions remain largely unexplored. In this work, we develop the first unified analysis of these estimators for sampling from non-log-concave distributions. We establish improved non-asymptotic convergence rates in $\varepsilon$-relative Fisher information and, under a Poincaré inequality assumption, in squared total variation distance, and further prove weak convergence to the target distribution. We extend our analysis to solving inverse problems with score-based generative priors. We empirically validate our theory and demonstrate that, under a fixed gradient computations per iteration, variance-reduction techniques consistently improve sample quality in two standard imaging applications.

2606.16327 2026-06-16 cs.SD cs.AI eess.AS 交叉投稿

ArtBoost: Synthetic Articulatory Data Augmentation for Acoustic-to-Articulatory Inversion

ArtBoost: 用于声学到发音逆映射的合成发音数据增强

Hyung Kyu Kim, Byungchan Hwang, Hak Gu Kim

发表机构 * Anonymous 1(匿名机构1)

AI总结 提出ArtBoost数据增强策略,利用大规模语音-网格数据集提取伪发音轨迹进行预训练,在有限EMA数据下提升声学到发音逆映射性能,PCC和RMSE一致改善。

Comments Accepted in Interspeech26

详情
AI中文摘要

最近的声学到发音逆映射(AAI)模型依赖于电磁发音描记术(EMA)数据,这些数据成本高昂且规模有限。为了解决这一限制,我们提出了\textit{ArtBoost},一种新颖的数据增强策略,利用最初为语音驱动的3D面部动画开发的大规模语音-网格数据集,在有限的EMA监督下改进AAI。\textit{ArtBoost}从可见的面部锚点提取伪发音轨迹,并在真实EMA数据上微调之前用于预训练。实验显示PCC和RMSE一致改善。轨迹分析证实伪发音信号反映了物理上有意义的可见发音动态。在不同AAI架构上的额外评估表明稳定的性能提升,表明\textit{ArtBoost}可以集成到多种AAI模型中。这些结果表明语音-网格数据为AAI提供了一种有效且可扩展的发音监督来源。项目页面:https://cau-irislab.github.io/Interspeech26-ArtBoost/

英文摘要

Recent acoustic-to-articulatory inversion (AAI) models rely on electromagnetic articulography (EMA) data, which are costly and limited in scale. To address this limitation, we propose \textit{ArtBoost}, a novel data augmentation strategy that leverages large-scale speech--mesh datasets originally developed for speech-driven 3D facial animation to improve AAI under limited EMA supervision. \textit{ArtBoost} extracts pseudo articulatory trajectories from visible facial anchors and uses them for pre-training before fine-tuning on real EMA data. Experiments show consistent improvements in PCC and RMSE. Trajectory analyses confirm that the pseudo articulatory signals reflect physically meaningful visible articulatory dynamics. Additional evaluations across different AAI architectures demonstrate stable performance gains, indicating that \textit{ArtBoost} can be integrated into diverse AAI models. These results suggest that speech--mesh data provide an effective and scalable source of articulatory supervision for AAI. Project page: https://cau-irislab.github.io/Interspeech26-ArtBoost/

2606.16360 2026-06-16 cs.CL cs.AI 交叉投稿

Tyler: Typed Latent Reasoning for Language Models -- When to Think, What to Compute, and How Much to Allocate

Tyler: 语言模型的类型化潜在推理——何时思考、计算什么以及分配多少

Hanyu Lin, Min Cai, Jiawei Wen, Haodi Zhang

发表机构 * Shenzhen University(深圳大学) University of Alberta(阿尔伯塔大学)

AI总结 提出Tyler框架,通过类型化潜在推理模块和预算感知策略,在自回归解码中动态选择文本生成或潜在计算,显著提升推理准确率并降低遗忘。

Comments website: https://typed-latent-reasoning.github.io

详情
AI中文摘要

链式思维(CoT)提示通过将中间计算外化为离散文本标记来改进大型语言模型(LLM)的推理能力,但这种文本接口也引入了冗余和推理开销。潜在推理通过在连续表示中执行部分计算提供了一种有前景的替代方案。然而,现有方法通常预定义潜在计算何时被调用以及如何在解码过程中分配,留下一个关键问题未解决:何时调用潜在计算、执行何种类型的计算以及分配多少预算。我们提出\textbf{Ty}ped \textbf{L}at\textbf{e}nt \textbf{R}easoning(Tyler),一个用于自回归解码过程中潜在推理的类型化和预算感知框架。Tyler学习一个策略,在每个解码步骤中,选择发射一个文本标记或切换到专门用于特定推理功能的潜在计算模块。一旦被调用,一个算子将当前推理状态映射为支持全局规划、局部状态更新或可重用过程抽象的潜在标记。在三个骨干LLM上的广泛实验中,Tyler相比CoT提高了最多14.49个百分点的准确率,相比最强的竞争基线提高了最多4.30个百分点。它进一步在多种推理领域上泛化,并以最低的遗忘实现了最佳的最后阶段性能。

英文摘要

Chain-of-thought (CoT) prompting improves reasoning in large language models (LLMs) by externalizing intermediate computation as discrete text tokens, but this textual interface also introduces redundancy and inference overhead. Latent reasoning offers a promising alternative by carrying part of the computation in continuous representations. However, existing methods typically predefine when latent computation is invoked and how it is allocated during decoding, leaving a key problem unresolved: when to invoke latent computation, what type of computation to perform, and how much budget to allocate. We propose \textbf{Ty}ped \textbf{L}at\textbf{e}nt \textbf{R}easoning (Tyler), a typed and budget-aware framework for latent reasoning during autoregressive decoding. Tyler learns a policy that, at each decoding step, chooses between emitting a text token and switching to a latent computation module specialized for a particular reasoning function. Once invoked, an operator maps the current reasoning state into latent tokens that support global planning, local state updates, or reusable procedural abstraction. Across extensive experiments on three backbone LLMs, Tyler improves accuracy by up to 14.49 points over CoT and by up to 4.30 points over the strongest competing baseline. It further generalizes across diverse reasoning domains and achieves the best final-stage performance with the lowest forgetting.

2606.16362 2026-06-16 eess.IV cs.AI cs.IT math.IT 交叉投稿

Input-Dependent Fisher Information for Local Sensitivity Analysis of Medical Image Classifiers

输入依赖的Fisher信息矩阵用于医学图像分类器的局部敏感性分析

Sourya Sengupta. Mark A. Anastasio

发表机构 * Department of Electrical and Computer Engineering, University of Illinois Urbana–Champaign(伊利诺伊大学厄巴纳-香槟分校电气与计算机工程系) Mallinckrodt Institute of Radiology and Department of Electrical & Systems Engineering, Washington University in St. Louis(华盛顿大学圣路易斯分校马林克罗德特放射医学研究所及电气与系统工程系)

AI总结 提出基于输入依赖Fisher信息矩阵(iFIM)的局部敏感性分析框架,通过Gram矩阵恢复iFIM非零谱,将图像分解为高/低敏感性分量,实验证明高敏感性分量与预测置信度和分类性能变化强相关。

详情
AI中文摘要

深度神经网络在医学图像分类中取得了强大性能,但通常像黑箱一样工作。常用的后验解释方法通常提供启发式可视化,其与分类器预测分布的关系是间接的。本文引入了一个基于训练分类器的输入依赖Fisher信息矩阵(iFIM)的局部敏感性分析框架。iFIM描述了在输入图像的无穷小扰动下分类器预测分布的变化。通过使用Gram矩阵公式,可以在不显式形成完整图像维度的Fisher矩阵的情况下恢复iFIM的非零特征谱。然后利用领先的iFIM特征空间将输入图像投影为高局部敏感性分量及其正交分量。这些分量提供了局部预测敏感性的模型内在描述,而不是传统的逐像素归因热图或任务相关解剖结构的因果分割。该框架在受控和临床医学图像分类任务上使用多种分类器架构进行了评估。基于扰动的实验表明,高敏感性iFIM分量与预测置信度和分类性能的变化相比低敏感性互补分量有更强的耦合。结果支持iFIM框架作为分析局部决策敏感性的原则性工具,并补充医学成像中现有的基于归因的可解释性方法。

英文摘要

Deep neural networks have achieved strong performance in medical image classification, but often work like black-box. Commonly used post-hoc interpretation methods often provide heuristic visualizations whose relationship to the classifier's predictive distribution is indirect. This work introduces a local sensitivity analysis framework based on the input-dependent Fisher Information Matrix (iFIM) of a trained classifier. The iFIM characterizes how the classifier's predictive distribution changes under infinitesimal perturbations of the input image. By using a Gram-matrix formulation, the nonzero eigenspectrum of the iFIM can be recovered without explicitly forming the full image-dimensional Fisher matrix. The leading iFIM eigenspace is then used to project an input image into a high local-sensitivity component and its orthogonal component. These components provide a model-intrinsic description of local predictive sensitivity, rather than a conventional pixel-wise attribution heatmap or a causal segmentation of task-relevant anatomy. The framework is evaluated on controlled and clinical medical image classification tasks using multiple classifier architectures. Perturbation-based experiments show that high-sensitivity iFIM components are more strongly coupled to changes in predictive confidence and classification performance than lower-sensitivity complementary components. The results support the iFIM framework as a principled tool for analyzing local decision sensitivity and for complementing existing attribution-based interpretability methods in medical imaging.

2606.16454 2026-06-16 cs.LG cs.AI 交叉投稿

SDS-LoRA: Overcoming Anisotropic Gradient Scaling in Low-Rank Adaptation

SDS-LoRA:克服低秩适应中的各向异性梯度缩放

Junghun Oh, Sungyong Baik, Kyoung Mu Lee

发表机构 * Seoul National University(首尔大学) Hanyang University(汉阳大学)

AI总结 提出SDS-LoRA,通过结构解耦奇异值与反向传播,消除LoRA中梯度各向异性缩放导致的秩降低和次优对齐问题,提升收敛速度和适应性能。

详情
AI中文摘要

低秩适应(LoRA)通过使用低秩矩阵参数化权重更新,实现了大型预训练模型对下游任务的高效适应。在本文中,我们从几何角度研究了LoRA参数化的局限性。具体地,我们表明当全微调梯度反向传播到低秩矩阵时,它会经历由奇异值驱动的各向异性缩放。我们认为这种现象是不可取的,因为它通过将梯度偏向主导奇异方向而抑制其他方向,从而扭曲了全微调梯度。我们的分析表明,各向异性梯度缩放降低了低秩矩阵梯度的有效秩,并导致LoRA中全微调梯度与其低秩近似之间的次优对齐,从而加剧了与全微调的差距。为了解决这些局限性,我们提出了一种新的低秩参数化方法SDS-LoRA,该方法在结构上将奇异值与反向传播解耦。我们的方法确保全微调梯度仅通过低秩矩阵子空间的正交基反向传播,独立于其尺度。收敛性分析表明,虽然LoRA的收敛速率随低秩矩阵的条件数而恶化,但SDS-LoRA与之无关。在自然语言和视觉基准上的实验结果表明,SDS-LoRA改善了损失收敛并缩小了与全微调的差距,显著提升了适应性能。

英文摘要

Low-Rank Adaptation (LoRA) enables efficient adaptation of large pre-trained models to downstream tasks by parameterizing weight updates with low-rank matrices. In this paper, we investigate the limitations of the LoRA parameterization from a geometric perspective. Specifically, we show that when a full fine-tuning gradient is backpropagated to the low-rank matrices, it undergoes anisotropic scaling driven by their singular values. We argue that this phenomenon is undesirable because it distorts the full fine-tuning gradient by skewing it toward dominant singular directions while suppressing others. Our analyses demonstrate that anisotropic gradient scaling reduces the effective rank of the low-rank matrices' gradients and results in suboptimal alignment between the full fine-tuning gradient and its low-rank approximation in LoRA, thereby exacerbating the gap to full fine-tuning. To address these limitations, we propose a new low-rank parameterization, SDS-LoRA, which structurally decouples singular values from the backward pass. Our method ensures that the full fine-tuning gradient backpropagates only through the orthonormal bases of the low-rank matrices' subspaces, independent of their scales. Convergence analysis demonstrates that while LoRA's convergence rate degrades with the condition number of the low-rank matrices, SDS-LoRA remains independent of it. Experimental results across natural language and vision benchmarks show that SDS-LoRA improves loss convergence and reduces the gap to full fine-tuning, significantly enhancing adaptation performance.

2606.16456 2026-06-16 cs.LG cs.AI 交叉投稿

SPRI: SVD-Partitioned Residual Initialization for Data-Constrained MoE Upcycling

SPRI: 基于SVD分解残差初始化的数据受限MoE升级方法

Weiqiao Shan, Ruixiang Mao, Yuang Li, Yuhao Zhang, Yingfeng Luo, Tong Zheng, Chen Xu, Yucheng Qiao, Chunxiang Jin, Yi Yuan, Jingdong Chen, Tong Xiao, Jingbo Zhu

发表机构 * Northeastern University, China(东北大学) Huawei TSC, China(华为技术有限公司) CUHK-Shenzhen, China(香港中文大学(深圳)) University of Maryland, USA(马里兰大学) Harbin Engineering University, China(哈尔滨工程大学) Inclusion AI, Ant Group(蚂蚁集团Inclusion AI) NiuTrans Research, China(小牛翻译研究中心)

AI总结 提出SPRI方法,利用预训练FFN权重的SVD分解残差初始化MoE专家,结合两阶段训练策略,在数据受限的多语言语音翻译任务中显著提升性能。

Comments 8pages, 12 tables, 3 figures

详情
AI中文摘要

混合专家(MoE)模型能够实现高效扩展,但从头训练成本过高。MoE升级通过将预训练的密集模型转换为稀疏MoE模型来降低这一成本。然而,现有的升级方法通常依赖大规模持续训练,并且在数据受限的监督适应中表现不佳,原因在于专家同质化或对预训练参数的过度扰动。在此设置下,有效的升级必须利用预训练权重结构,同时为路由专家引入足够的多样性。为此,我们提出了基于SVD分解残差初始化(SPRI)的方法,该方法将从预训练前馈网络(FFN)权重中提取的SVD分解残差分配到路由专家中,从而在预训练谱结构的基础上引入可控的专家多样性。我们进一步引入两阶段训练策略以提高适应稳定性。我们在多语言语音到文本翻译任务上评估SPRI,该任务中有限的监督数据对MoE升级构成挑战,而多个目标语言提供了天然的路由异质性。在CoVoST2数据集上的15个英语到其他语言方向中,SPRI相比完全微调的密集模型平均BLEU和COMET分别提高了2.58和3.32分,并且比之前最佳的MoE升级基线高出3.39 BLEU和4.34 COMET分。

英文摘要

Mixture-of-Experts (MoE) models enable efficient scaling, but training them from scratch remains prohibitively expensive. MoE upcycling mitigates this cost by converting pretrained dense models into sparse MoE models. However, existing upcycling methods typically rely on large-scale continued training and often perform poorly under data-constrained supervised adaptation, due to either homogeneous experts or overly disruptive perturbations to pretrained parameters. In this setting, effective upcycling must leverage pretrained weight structure while introducing sufficient diversity among routed experts. To this end, we propose SVD-Partitioned Residual Initialization (SPRI), which distributes SVD-partitioned residuals derived from pretrained feed-forward network (FFN) weights across routed experts, introducing controlled expert diversity grounded in pretrained spectral structure. We further introduce a two-stage training strategy to improve adaptation stability. We evaluate SPRI on multilingual speech-to-text translation, where limited supervised data challenges MoE upcycling and multiple target languages provide natural routing heterogeneity. On CoVoST2 across 15 En-to-XX directions, SPRI improves average BLEU and COMET over fully fine-tuned dense models by 2.58 and 3.32 points, respectively, and outperforms the prior best MoE upcycling baseline by 3.39 BLEU and 4.34 COMET points.

2606.16462 2026-06-16 cs.LG cs.AI 交叉投稿

Learning aligned EEG representations with subject-specific encoders

学习带有主体特定编码器的对齐脑电图表示

Bruna J. Lopes, Gabriel Schwartz, Sylvain Chevallier, Raphael Y. de Camargo, Bruno Aristimunha

发表机构 * University of São Paulo(圣保罗大学) Université Paris-Saclay, Inria TAU team, LISN-CNRS(巴黎萨克雷大学,Inria TAU团队,LISN-CNRS) Institut de neuromodulation, GHU Paris, psychiatrie et neurosciences, centre hospitalier Sainte-Anne, pôle hospitalo-universitaire 15, Université Paris Cité(神经调控研究所,GHU巴黎,精神病学与神经科学,圣安娜医院,大学医院中心15区,巴黎西岱大学) Federal University of ABC (UFABC)(ABC联邦大学) Yneuro Swartz Center for Computational Neuroscience (SCCN), Institute for Neural Computation (INC), University of California San Diego(斯沃茨计算神经科学中心,神经计算研究所,加州大学圣地亚哥分校)

AI总结 提出使用主体特定编码器替代共享编码器,结合共同分类器实现跨主体脑电图对齐,实验表明该方法能内化欧几里得对齐的作用,提高类别区分度,并识别出未见主体的编码器选择是主要瓶颈。

详情
AI中文摘要

跨主体脑电图解码有望提供更多训练数据,但也使神经网络面临强烈的跨主体分布偏移。我们研究仅凭任务监督和架构是否能学习主体对齐的表示。我们将共享的脑电图编码器替换为主体特定编码器后接共同分类器,并在四个运动想象数据集上将该混合模型与标准EEGNet、AttentionBaseNet和CTNet基线(结合欧几里得对齐EA)进行比较。EA通过重新居中主体协方差改进了共享编码器,但混合编码器在很大程度上内化了这一作用:当移除EA时,验证损失曲线和潜在距离分析变化很小。主体特定头增加了类别区分度,并将每个主体置于其自身的潜在流形附近,改善了大多数主体,但留下了一个对方法敏感的子集。这些结果支持主体特定编码器作为脑电图解码的学习对齐机制,并将未见主体的编码器选择确定为剩余瓶颈。

英文摘要

Cross-subject EEG decoding promises more training data, but it also exposes neural networks to strong inter-subject distribution shifts. We study whether task supervision and architecture alone can learn subject-aligned representations. We replace a shared EEG encoder with subject-specific encoders followed by a common classifier, and compare this hybrid model with standard EEGNet, AttentionBaseNet, and CTNet baselines with Euclidean Alignment (EA) on four motor-imagery datasets. EA improves shared encoders by recentering subject covariances, but the hybrid encoder largely internalises this role: validation-loss curves and latent-distance analyses change little when EA is removed. Subject-specific heads increase class distinctiveness and place each subject close to its own latent manifold, improving most subjects while leaving a method-sensitive subset. These results support subject-specific encoders as a learned alignment mechanism for EEG decoding and identify head selection for unseen subjects as the remaining bottleneck.

2606.16633 2026-06-16 cs.CV cs.AI 交叉投稿

DCP-Prune: Ultra-Low Token Pruning with Distribution Consistency Preservation

DCP-Prune:基于分布一致性保持的超低令牌剪枝

Xifeng Xue, Xiaokang Wang, Zirui Li, Ming-Ming Cheng, Guolei Sun

发表机构 * College of Computer Science, Nankai University(南开大学计算机学院) Nanjing University of Posts and Telecommunications(南京邮电大学)

AI总结 提出DCP-Prune框架,通过锚点-上下文图恢复和文本感知令牌聚类选择,在超低令牌预算下保持分布一致性,实现稳定高性能。

Comments The code will be released at: https://github.com/EMVision-NK/DCP-Prune

详情
AI中文摘要

最近的视觉令牌剪枝方法在中等令牌预算下能有效保持模型性能,但在超低令牌预算下变得不稳定。我们的分析表明,随着剪枝预算减少,精度下降通常伴随着更大的特征分布偏移。关键的是,这种分布偏移的程度与性能下降强相关。为了更好地表征这一现象,我们引入了一种轻量级的分布一致性度量来估计保留令牌与完整令牌之间的分布偏移。受这些观察启发,我们提出了一个两阶段剪枝框架,包括锚点-上下文图恢复(ACGR)和文本感知令牌聚类选择(TATCS)。具体地,ACGR在令牌移除前转移上下文信息,而TATCS在检测到严重分布偏移时动态重新选择代表性令牌。大量实验表明,我们的方法在超低令牌预算下实现了更优且更稳定的性能。值得注意的是,在仅使用16个视觉令牌的情况下,它在LLaVA-1.5-7B上保留了92.1%的上限平均性能。

英文摘要

Recent vision token pruning methods effectively preserve model performance under moderate token budgets but become unstable under ultra-low token budget. Our analysis shows that as the pruning budget decreases, accuracy degradation is often accompanied by larger feature distribution shifts. Critically, the degree of this distribution shift strongly correlates with performance degradation. To better characterize this phenomenon, we introduce a lightweight distribution consistency metric to estimate the distribution shift between retained and full tokens. Motivated by these observations, we propose a two-stage pruning framework consisting of Anchor-Context Graph Recovery (ACGR) and Text-Aware Token Cluster Selection (TATCS). Specifically, ACGR transfers contextual information before token removal, while TATCS dynamically re-selects representative tokens when severe distribution shift is detected. Extensive experiments demonstrate that our method achieves superior and more stable performance under ultra-low token budget. Notably, it retains 92.1% of the upper-bound average performance on LLaVA-1.5-7B with only 16 visual tokens.

2606.16694 2026-06-16 cs.LG cs.AI physics.app-ph q-bio.NC 交叉投稿

Adaptive inference and function vectors in deep transformers

深度变换器中的自适应推理与函数向量

Ravin Raj, Gautam Reddy

发表机构 * Joseph Henry Laboratories of Physics, Princeton University(普林斯顿大学约瑟夫·亨利物理实验室)

AI总结 提出深度变换器作为平均场交互系统实现分布式推理的理论,利用函数向量逐层推断潜在上下文变量,在上下文回归任务中预测非高斯分层结构与深度的关系,并通过约束线性注意力变换器验证。

详情
AI中文摘要

变换器被广泛用作学习大量耦合变量间复杂相关性的通用基础架构,但其内部机制仍不明确。我们提出了一种深度变换器作为平均场交互系统的理论,该系统在通信、局部性和深度约束下实现分布式推理。我们证明,这样的系统可以利用内部状态表示(“函数向量”)在其层上以越来越精细的尺度推断潜在上下文变量。在上下文回归任务中,该理论预测了潜在上下文变量中的非高斯分层结构与变换器深度之间的非平凡关系。使用约束线性注意力变换器对预测进行了测试,并展示了深度架构中的自适应推理。前馈模块和深度使变换器能够实现比先前描述的更丰富的上下文学习算法类别。

英文摘要

Transformers are widely used as a general-purpose substrate for learning complex correlations between a large collection of coupled variables, but their internal mechanisms have remained mysterious. We introduce a theory of a deep transformer as a mean-field interacting system that implements distributed inference, subject to constraints on communication, locality and depth. We show that such a system can exploit internal state representations ('function vectors') to infer a latent context variable at increasingly finer scales over its layers. In an in-context regression task, the theory predicts a non-trivial relationship between non-Gaussian, hierarchical structure in the latent context variable, and transformer depth. Predictions are tested using constrained linear attention transformers and demonstrate adaptive inference in deep architectures. Feedforward blocks and depth enable transformers to implement a much richer class of in-context learning algorithms than previously described.

2606.16730 2026-06-16 stat.ML cs.AI cs.LG 交叉投稿

Attention is Just Another Name for Coupling?: A Fast-Slow ODE Perspective on Hierarchical Pretraining

注意力只是耦合的另一个名字?:关于层级预训练的快速-慢速ODE视角

Zhengyuan Gao

AI总结 本文提出一种快慢ODE视角,将因果自注意力视为耦合机制,并引入一个通过零初始化门控反馈到快路径的慢子系统,在理论证明和实验验证中揭示了其与主方程平稳分布的联系。

详情
AI中文摘要

因果自注意力是一种耦合机制:每个token的隐藏状态通过同一时间尺度上前置token的学习混合来更新。本文提出一个疑问:是否存在第二个时间上更慢的耦合——一个在序列的时间下采样视图上运行并通过零初始化门控反馈到快路径的慢子系统——来补充它?该问题以奇异摄动常微分方程(ODE)的语言提出,其中快变量$x$以token速率演化,慢变量$y$每$P$个token更新一次,时间尺度比$\varepsilon = 1/P$通过因果块均值池化在结构上强制执行。\n本文将快慢ODE形式具体化为一个神经网络:一个在$T$个token上的标准因果注意力快路径,一个在$T/P$个池化token上的全注意力慢路径(每层便宜$P^2$倍),以及一个零初始化的加法门控。此外,在快动力学的线性生成器假设下,我们证明了平衡流形$x = \phi(y)$恰好是主方程(ME)的平稳分布$p_{\mathrm{st}}(y)$;在该机制下,学习的MLP $\phi_\theta(y)$是其变分近似(训练块不是生成器,因此该恒等式是结构极限,而非对训练网络的断言)。实验上,在50万token时,耦合是中性的——门控保持关闭,耦合和冻结消融在运行间噪声范围内——其墙钟成本与密集基线相当。贡献在于精确的、带有间隙标记的映射本身,而非性能提升。

英文摘要

Causal self-attention is a coupling mechanism: each token's hidden state is updated by a learned mixture of preceding tokens at the same timescale. This paper asks whether a second, temporally slower coupling-a slow sub-system operating on a temporally-downsampled view of the sequence and fed back into the fast path through a zero-initialised gate-complements it. The question is framed in the language of singularly perturbed ordinary differential equations (ODEs), where the fast variable $x$ evolves at the token rate, the slow variable $y$ evolves at one update per $P$ tokens, and the timescale ratio $\varepsilon = 1/P$ is enforced structurally by causal block-mean pooling. The paper instantiates the fast-slow ODE formalism as a concrete neural network: a fast path of standard causal attention over $T$ tokens, a slow path of full attention over $T/P$ pooled tokens ($P^2 \times$ cheaper per layer), and a zero-initialised additive gate. In addition, under a linear-generator assumption on the fast dynamics, we prove that the equilibrium manifold $x = ϕ(y)$ is exactly the master-equation (ME) stationary distribution $p_{\mathrm{st}}(y)$; in that regime a learned MLP $ϕ_θ(y)$ is a variational approximation of it (the trained block is not a generator, so this identity is the structured limit, not a claim about the network as trained). Empirically, at $500$k tokens the coupling is neutral -- the gate stays closed and the coupled and frozen ablations are within run-to-run noise -- at a wall-clock cost comparable to a dense baseline. The contribution is the precise, gap-marked mapping itself, not a performance gain.

2606.16790 2026-06-16 cs.LG cs.AI 交叉投稿

Decision-Weighted Flow Matching for Contextual Stochastic Optimization

决策加权流匹配用于上下文随机优化

Jize Xie, Haomiao Wu, Qiang Chen, Xiu Su, Yi Chen

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Central South University(中南大学) Big Data Institute(大数据研究院)

AI总结 提出决策加权流匹配(DW-FM)框架,通过重加权速度回归目标对齐下游遗憾,在CVaR基准上优于标准方法。

详情
AI中文摘要

条件生成模型越来越多地被用作随机优化的场景生成器,但标准训练目标强调均匀分布拟合,而非生成场景所引发的下游决策。这造成了目标不匹配:统计常见区域的误差对决策遗憾影响很小,而决策敏感区域的误差可能显著改变最优行动。我们提出决策加权流匹配(DW-FM),一种遗憾对齐的训练框架,它保留了标准流匹配的简单性,同时使用决策敏感的端点信息对其速度回归目标进行重加权。理论上,我们通过损失诱导的决策差异和伴随输运论证将下游遗憾与路径速度不匹配联系起来,得到一个理想的遗憾对齐替代目标以及具有遗憾保证的实用端点加权目标。实验上,我们在三个基于CVaR的上下文随机优化基准(涵盖合成投资组合、半真实金融和交通CVaR任务)上展示了DW-FM的有效性,其中DW-FM在标准基线上改善了下游遗憾。

英文摘要

Conditional generative models are increasingly used as scenario generators for stochastic optimization, but standard training objectives emphasize uniform distributional fit rather than the downstream decisions induced by generated scenarios. This creates an objective mismatch: errors in statistically common regions may have little effect on decision regret, whereas errors in decision-sensitive regions can substantially change the optimal action. We propose Decision-Weighted Flow Matching (DW-FM), a regret-aligned training framework that preserves the simplicity of standard flow matching while reweighting its velocity-regression objective using decision-sensitive endpoint information. Theoretically, we connect downstream regret to pathwise velocity mismatch through a loss-induced decision discrepancy and an adjoint transport argument, yielding an ideal regret-aligned surrogate and practical endpoint-weighted objectives with regret guarantees. Empirically, we demonstrate the effectiveness of DW-FM on three CVaR-based contextual stochastic optimization benchmarks spanning synthetic portfolio, semi-real financial, and traffic-CVaR tasks, where DW-FM improves downstream regret over standard baselines.

2606.16815 2026-06-16 eess.SP cs.AI cs.LG 交叉投稿

A Perception vs. Distortion Perspective on Score-Based Generative Channel Estimation

基于分数的生成式信道估计中的感知与失真权衡视角

Marco Skocaj, Lukas Eller, Mate Boban

AI总结 本文通过感知-失真权衡理论,分析了基于分数的生成模型在信道估计中的优势与局限,指出在高预测不确定性下可接近贝叶斯最优性能,低不确定性下判别式方法更优。

Comments 13 pages

详情
AI中文摘要

受其在计算机视觉和逆问题求解中的显著成功驱动,基于分数的模型越来越多地应用于无线通信,并在一系列物理层任务中展现出潜力。然而,尽管兴趣日益增长,当前文献往往缺乏对分数匹配何时比传统判别学习具有实际优势的严格分析。本文旨在通过信道估计这一无线系统中的基本逆问题用例来填补这一空白。我们通过感知-失真权衡的视角,提出了基于分数的信道估计的理论解释,识别了分数匹配表现优异的条件及其关键局限性。特别是,通过将下游无线任务(如容量最大化)建模为信道估计过程的泛函,我们量化了标准失真最小化方法所导致的超额风险。大量数值结果表明,在高预测不确定性下,大的超额风险差距可以通过基于分数的估计来弥补,从而通过学习的后验实现接近贝叶斯最优的预编码,而在低预测不确定性下,由于复杂度更低且模型容量利用更高效,判别式失真最小化方法更可取。

英文摘要

Driven by their remarkable success in computer vision and inverse problem solving, score-based models are increasingly applied to wireless communications, where they show promise across a range of physical-layer tasks. However, despite this growing interest, the current literature often lacks a rigorous analysis of when score-matching offers a tangible advantage over traditional discriminative learning. This paper aims to address this gap through the use-case of channel estimation, a fundamental inverse problem in wireless systems. We present a theoretically grounded interpretation of score-based channel estimation through the lens of the perception-distortion tradeoff, identifying the conditions where score matching excels as well as its key limitations. In particular, by modeling downstream wireless tasks (e.g., capacity maximization) as functionals of the channel estimation process, we quantify the excess risk incurred by standard distortion-minimization approaches. Extensive numerical results show that under high predictive uncertainty, the large excess risk gap can be offset by score-based estimation, enabling near Bayesian-optimal precoding via the learned posterior, whereas in the low predictive uncertainty regime, discriminative distortion-minimization approaches are preferable due to lower complexity and more efficient use of model capacity.

2606.16825 2026-06-16 cs.CL cs.AI cs.LG 交叉投稿

Tying the Loop -- Tied Expert Layers in Mixture-of-Experts Language Models

循环绑定——混合专家语言模型中的专家层绑定

Martin Jaggi

发表机构 * EPFL(瑞士联邦理工学院洛桑)

AI总结 提出专家绑定方法,通过共享连续Transformer层的专家参数,在保持独立路由和注意力的同时,将MoE模型内存占用降低近2倍,且不损失困惑度或下游性能。

Comments Code available at https://github.com/epfml/looped-moe

详情
AI中文摘要

混合专家(MoE)架构通过每个令牌仅激活一小部分专家来高效扩展大型语言模型(LLM),但全部参数计数——主要由专家参数主导——必须保留在训练和推理内存中。为了解决这个问题,我们引入了专家绑定(Expert Tying),这是一种架构修改,它在连续Transformer层之间共享专家参数,同时保留独立的逐层路由和注意力。我们在常见的先进架构上评估了这种方法,包括OLMoE、Qwen3和DeepSeek风格的MoE。我们的预训练实验表明,绑定专家可以将内存占用减少近2倍,而几乎不降低困惑度或下游质量。通过利用MoE路径中固有的参数冗余,我们的方法提供了高度有利的计算-内存权衡,推动了下一代LLM的高效训练和扩展。

英文摘要

Mixture-of-Experts (MoE) architectures efficiently scale Large Language Models (LLMs) by activating only a small fraction of their experts per token, yet the full parameter count - dominated by the expert parameters - must be held in training and inference memory. To address this, we introduce Expert Tying, an architectural modification that shares expert parameters across consecutive transformer layers while preserving independent, layer-wise routing and attention. We evaluate this approach across common, state-of-the-art architectures, including OLMoE, Qwen3, and DeepSeek-style MoEs. Our pretraining experiments demonstrate that tying experts can reduce memory footprint by almost 2x at virtually no degradation in perplexity or downstream quality. By exploiting the parameter redundancy inherent in MoE pathways, our method provides a highly favorable compute-to-memory trade-off, advancing efficient training and scaling of next-generation LLMs.

2606.16837 2026-06-16 cs.CV cs.AI cs.SD 交叉投稿

Robust Spoofed Speech Detection via Temporal Pyramid Modeling

基于时间金字塔建模的鲁棒语音伪造检测

Mahtab Masoudi Nezhad, Nima Karimian

发表机构 * Lane Department of Computer Science and Electrical Engineering, West Virginia University(西弗吉尼亚大学莱恩计算机科学与电气工程系) Bellini College of Artificial Intelligence, Cybersecurity and Computing, University of South Florida(南佛罗里达大学贝利尼人工智能、网络安全与计算学院)

AI总结 提出时间金字塔适配器,通过多尺度时间卷积捕获局部伪影和全局韵律异常,结合自监督XLS-R表示,在多个数据集上显著优于基线模型。

详情
AI中文摘要

伪造语音检测日益受到逼真合成、语音转换和重放攻击的挑战,跨数据集泛化仍然是主要限制。本文提出时间金字塔适配器,利用具有不同感受野的并行时间卷积来捕获多尺度伪造线索,从局部伪影到全局韵律异常。我们还集成了自监督XLS-R表示,并结合前端适配器,包括Mel、Sinc和用于多尺度时间建模的时间金字塔设计。所提出的模型在多个基准上进行了评估,包括ASVspoof 2017、ASVspoof 2021 (DF/LA)、PartialSpoof、DiffSSD和多语言HQ-MPSD数据集。实验结果表明,时间金字塔模型在PartialSpoof数据库上获得了99.24%的AUC和3.87%的EER,显著优于基础模型和多个SOTA基线,如LCNN-BLSTM(9.87% EER)和TRACE(8.08% EER)。此外,多语言评估证实,虽然伪造伪影与语言无关,但自监督表示提高了鲁棒性,在领域和语言偏移下性能下降,凸显了需要更好的适应和校准策略。

英文摘要

Spoofed speech detection is increasingly challenged by realistic synthesis, voice conversion, and replay attacks, with cross-dataset generalization remaining a major limitation. This work we propose a Temporal Pyramid Adapter that utilize parallel temporal convolutions with varying receptive fields to capture multi-scale spoofing cues, ranging from local artifacts to global prosodic irregularities. We also integrated self-supervised XLS-R representations combined with front-end adapters, including Mel, Sinc, and a Temporal Pyramid design for multi-scale temporal modeling. The proposed model is evaluated cross multiple benchmark including ASVspoof 2017, ASVspoof 2021 (DF/LA), PartialSpoof, DiffSSD, and multilingual HQ-MPSD datasets. Experimental results demonstrate that Temporal Pyramid model obtained AUC of 99.24% and a EER of 3.87% on the PartialSpoof database, which is significantly outperforming the base model and several SOTA baseline such as LCNN-BLSTM (9.87% EER) and TRACE (8.08% EER). Additionally, multilingual evaluations confirm that while spoofing artifact are independent from language. While self-supervised representations improve robustness, performance degrades under domain and language shifts, highlighting the need for better adaptation and calibration strategies.

2606.16846 2026-06-16 cs.LG cs.AI 交叉投稿

Deep Q-Learning on Hölder Spaces

Hölder空间上的深度Q学习

Qian Qi

发表机构 * Peking University(北京大学)

AI总结 研究连续时间随机控制中Q学习的算子核心,通过分析扩散设置下Bellman最优性目标的正则性和逼近复杂度,提出适应混合正则性的张量积DeepONet架构,并给出显式逼近和资源界限。

详情
AI中文摘要

我们研究了具有连续状态和动作的连续时间随机控制中Q学习的算子理论核心。在基于价值的强化学习中,每次Q学习或DQN更新都基于Bellman最优性目标;我们的分析在扩散设置中分离出该目标,并研究其正则性和逼近复杂度。在均匀椭圆性和Hölder正则系数下,我们证明Bellman更新将有界输入映射到各向异性正则类,平滑状态变量而仅保留对动作变量的Lipschitz依赖性。这产生了Bellman迭代的紧族,并激发了适应问题混合正则性的张量积DeepONet架构。然后我们推导出显式的逼近和资源界限,以及时间步长$δ\ o 0$时的刚度-复杂度权衡。所得理论在连续随机控制中Bellman目标正则性和逼近层面直接贡献于Q学习理论。同时,我们并未声称对包含探索、经验回放和随机梯度更新的实际采样Q学习有完整的收敛定理。

英文摘要

We study the operator-theoretic core of Q-learning in continuous-time stochastic control with continuous states and actions. In value-based reinforcement learning, each Q-learning or DQN update is built from a Bellman optimality target; our analysis isolates this target in a diffusion setting and studies its regularity and approximation complexity. Under uniform ellipticity and Hölder-regular coefficients, we show that a Bellman update maps bounded inputs into an anisotropic regularity class, smoothing the state variable while leaving only Lipschitz dependence on the action variable. This yields a compact family of Bellman iterates and motivates a tensor-product DeepONet architecture adapted to the mixed regularity of the problem. We then derive explicit approximation and resource bounds, together with a stiffness--complexity trade-off as the time step $δ\to 0$. The resulting theory makes a direct contribution to Q-learning theory at the level of Bellman target regularity and approximation in continuous stochastic control. At the same time, we do not claim a full convergence theorem for practical sampled Q-learning with exploration, replay, and stochastic gradient updates.

2606.16883 2026-06-16 cs.LG cs.AI 交叉投稿

Upper Bounds on the Generalization Error of Deep Learning Models via Local Robustness and Stability

深度学习模型泛化误差的上界:基于局部鲁棒性和稳定性

Abdul-Rauf Nuhu, Parham M. Kebria, Vahid Hemmati, Mahmoud N. Mahmoud, Edward Tunstel, Abdollah Homaifar

发表机构 * North Carolina Agricultural and Technical State University(北卡罗来纳农业技术州立大学) University of Alabama(阿拉巴马大学) Southwest Research Institute(西南研究院)

AI总结 提出一种通过局部区域稳定样本数缩放鲁棒性项的泛化上界,在ImageNet上实现非空洞且最紧的误差估计。

详情
AI中文摘要

泛化是数据驱动模型的关键属性,尤其是在安全关键应用中部署的深度学习模型。基于鲁棒性的泛化界作为一种将鲁棒性与泛化性能联系起来的原则性方法而受到关注,通常以数据依赖的方式。然而,大多数现有界在实际设置中存在空洞问题,产生远超过实际错误率的松散上界,限制了其在真实世界评估中的实用性。虽然这个问题通常归因于不确定性项,但问题的很大一部分源于鲁棒性项本身,特别是对于0-1损失。现有方法通常将鲁棒性项视为全局度量,忽略了其在输入空间不同子区域间的变化。在这项工作中,我们提出了一种泛化界,通过根据每个子区域内稳定和不稳定样本的数量来缩放鲁棒性项,从而解决了这一局限性。我们的界同时包含数据和模型依赖因素,同时保持实际相关性(产生更紧的真实误差上界)。在ImageNet数据集上训练的模型上的实验表明,我们的界始终非空洞,并在现有方法中实现了最紧的估计,与一系列鲁棒深度神经网络的实证性能紧密对齐。

英文摘要

Generalization is a critical property of data-driven models, particularly deep learning models deployed in safety-critical applications. Robustness-based generalization bounds have gained attention as a principled way to link robustness properties to generalization performance, often in a data-dependent manner. However, most existing bounds suffer from vacuousness in practical settings, yielding loose upper bounds that greatly exceed the actual error rates and limiting their usefulness for real-world evaluation. While this issue is often attributed to the uncertainty term, a substantial part of the problem originates from the robustness term itself, particularly for the 0-1 loss. Existing approaches typically treat the robustness term as a global measure, ignoring its variation across different sub-regions of the input space. In this work, we propose a generalization bound that addresses this limitation by scaling the robustness term according to the number of stable and unstable samples within each sub-region. Our bounds incorporate both data- and model-dependent factors while maintaining practical relevance (yielding tighter upper bounds on true error). Experiments on models trained on the ImageNet dataset show that our bounds remain consistently non-vacuous and achieve the tightest estimates among existing methods, closely aligning with empirical performance across a range of robust deep neural networks.

2606.16891 2026-06-16 cs.LG cs.AI 交叉投稿

Beyond Weights and Gradients: A Taxonomy of Federated Learning Messages

超越权重和梯度:联邦学习消息的分类学

Alvaro Javier Vargas Guerrero, Xinguang Wang, Quang Manh Doan, Guy Nagels

发表机构 * AIMS lab, Center for Neurosciences, UZ Brussel, Vrije Universiteit Brussel, Brussels, Belgium(AIMS实验室,神经科学中心,布鲁塞尔大学医院,布鲁塞尔自由大学,布鲁塞尔,比利时) Artificial Intelligence Lab, Vrije Universiteit Brussel, Brussels, Belgium(人工智能实验室,布鲁塞尔自由大学,布鲁塞尔,比利时)

AI总结 本文提出联邦消息的正式数学定义,建立包含模型结构、统计摘要和数据条件表示的三类分类法,分析计算、通信和隐私权衡,并综述202篇文献揭示2021年后消息范式多样化趋势。

Comments 4 figures, 9 pages, with 7 pages of content

详情
AI中文摘要

联邦学习正迅速发展,超越了传统模型权重和梯度的交换,但现有定义未能涵盖现代负载(如合成数据和联邦分析)的全部范围。本文通过提出一个联邦消息的正式数学定义来弥补这一空白,该定义同时考虑了效用和隐私。我们引入了一个分类法,将这些交换组织为三类:模型结构、统计摘要和数据条件表示。通过基于计算需求、通信成本和隐私风险评估这些组别,我们提供了对去中心化训练中涉及权衡的更清晰理解。我们对202篇近期出版物的回顾凸显了自2021年以来向多样化消息范式的显著转变,标志着从标准深度学习更新向更专业信息共享的转变。该框架为未来研究优化联邦系统以适应不同硬件和安全需求提供了结构化路径。

英文摘要

Federated Learning is rapidly evolving beyond the exchange of traditional model weights and gradients, yet existing definitions fail to capture the full scope of modern payloads like synthetic data and federated analytics. This paper addresses the gap by proposing a formal mathematical definition of a federated message that accounts for both utility and privacy. We introduce a taxonomy that organizes these exchanges into three categories: model structures, statistical summaries, and data-conditioned representations. By evaluating these groups based on computational demands, communication costs, and privacy risks, we provide a clearer understanding of the trade-offs involved in decentralized training. Our review of 202 recent publications highlights a significant shift since 2021 toward diverse messaging paradigms, signaling a move away from standard deep learning updates toward more specialized information sharing. This framework provides a structured path for future research to optimize federated systems for varying hardware and security requirements.

2606.16920 2026-06-16 cs.LG cs.AI 交叉投稿

Demystifying Variance in Circuit Discovery of LLMs

揭示LLM电路发现中的方差

Frank Zhengqing Wu, Francesco Tonin, Volkan Cevher

发表机构 * Laboratory for Information and Inference Systems (LIONS), École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland(信息与推理系统实验室(LIONS),洛桑联邦理工学院(EPFL),瑞士洛桑)

AI总结 本文研究LLM电路发现中的重采样、重述和样本方差,提出CEAP方法减少重采样方差,并分析重述方差源于不同模板激活不同电路,样本方差主要由不忠定义导致。

详情
AI中文摘要

电路发现是机械可解释性中的关键技术,用于定位对执行给定任务至关重要的模型组件。尽管当前最先进的方法(EAP-IG)在(不)忠实性指标上表现良好,但它存在显著的变异性。这包括重采样方差(当我们用来自同一分布的新数据批次探测时电路发生变化)、重述方差(当提示被重新表述时发现的电路发生偏移)以及样本方差(具有低总体不忠实性的电路在单个样本上的不忠实性表现出大幅波动)。本文研究了这些方差的根源。我们证明了CEAP(我们新的电路发现方法,在理论上改进了EAP-IG)可以显著减轻重采样方差。我们进一步表明,重述方差是由于不同模板的提示倾向于激活模型中的不同电路。这使我们提出,可能很难找到一个全面的电路来解释和控制模型在任务上的行为,而该任务可以用无数模板表达,这表明LLM可能本质上难以操控。我们表明,稀疏性(据称能形成更紧凑和可解释的任务电路)无法解决这个问题。关于样本方差,我们认为它很大程度上是良性的:极差的不忠实性分数通常源于不忠实性的定义方式,而非测量电路的缺陷。我们表明,不忠实性的大小受选择性贡献缩放的影响,这是一种神经机制,解释了有时观察到的极差分数。

英文摘要

Circuit discovery is a key technique in mechanistic interpretability to pinpoint the model components that are crucial for performing a given task. Although the current state-of-the-art method (EAP-IG) performs well on the metric of (un)faithfulness, it suffers from substantial variability. This includes resampling variance, where the circuit changes when we probe with a new batch of data from the same distribution; rephrasing variance, where the discovered circuit shifts when the prompts are rephrased; and sample-wise variance, where a circuit with low population unfaithfulness exhibits large fluctuations in unfaithfulness across individual samples. This paper studies the roots of these variances. We demonstrate that CEAP, our new circuit discovery method that improves upon EAP-IG with a theoretical guarantee, can substantially lessen resampling variance. We further show that rephrasing variance arises because prompts with different templates tend to activate different circuits in the model. This leads us to argue that it may be challenging to find a comprehensive circuit that explains and controls the model's behavior on a task, which can be expressed in countless templates, suggesting that LLMs may be inherently hard to steer. We show that sparsity, which has been claimed to form more compact and interpretable task circuits, fails to solve this problem. Regarding sample-wise variance, we argue that it is largely benign: extremely poor unfaithfulness scores often stem from how unfaithfulness is defined, rather than from defects in the measured circuits. We show that the magnitude of unfaithfulness is affected by selective contribution scaling, a neural mechanism that accounts for the extremely poor scores sometimes observed.

2606.16933 2026-06-16 cs.LG cs.AI 交叉投稿

A Unified Causal-Origin Taxonomy of Distributional Shifts in Reinforcement Learning

强化学习中分布偏移的统一因果起源分类法

Ardianto Wibowo, Paulo E Santos, Amer Baghdadi, Matthew Stephenson, Karl Sammut, Jean-Philippe Diguet

发表机构 * IMT Atlantique(IMT大西洋) Flinders University(弗林德斯大学) IRL Crossing Priori Analytica CNRS(法国国家科学研究中心)

AI总结 提出一种统一因果起源分类法,将强化学习中的分布偏移按因果来源(内部/外部)和时间边界(显式/隐式/混合)分类,统一了分布内/外泛化与非平稳性分析。

Comments The paper is currently under review at the Journal of Artificial Intelligence Research (JAIR)

详情
AI中文摘要

强化学习系统在运行条件与先前遇到的条件不同时通常会退化,这反映了底层数据生成过程中的分布偏移。这种偏移可能发生在训练和评估之间,如分布内(ID)和分布外(OOD)泛化,或者发生在环境动态随时间演变的非平稳设置中。然而,这些观点之间的形式关系尚不清楚,现有工作主要关注缓解措施而非智能体-环境交互中偏移的因果起源。本文开发了一个统一的因果起源分类法,描述了强化学习中分布偏移的来源,并将ID/OOD泛化与非平稳设置联系起来。我们将监督学习中的经典数据集偏移原则迁移到强化学习,通过将分布偏移重新表述为生成交互过程。使用部分可观测马尔可夫决策过程(POMDP),我们将交互分解为结构组件,包括状态分布、观测过程、策略、奖励和转移动态,以及偏移时间边界。所提出的分类法区分了内部(智能体驱动)和外部(环境驱动)的分布偏移。偏移时间边界视角进一步刻画了显式、隐式和混合偏移。这种表述将ID/OOD泛化和非平稳性统一为底层过程中的结构化变化。我们还引入了一个评估框架,通过性能退化和恢复指标来衡量偏移影响和适应能力。通过将分布偏移扎根于强化学习的因果起源结构,本文支持在分布偏移下进行系统性的鲁棒性分析。

英文摘要

Reinforcement learning (RL) systems often degrade when operating conditions differ from those previously encountered, reflecting distributional shifts in the underlying data-generating process. Such shifts may occur between training and evaluation, as in In-Distribution (ID) and Out-of-Distribution (OOD) generalization, or within non-stationary settings where environment dynamics evolve over time. However, the formal relationship between these views remains unclear, and existing work mainly focuses on mitigation rather than the causal origin of shift within the agent-environment interaction. This work develops a unified causal-origin taxonomy that characterizes sources of distributional shift in RL and relates ID/OOD generalization to non-stationary settings. We transfer the classical dataset-shift principle from supervised learning to RL by reformulating distributional shift in terms of the generative interaction process. Using a Partially Observable Markov Decision Process (POMDP), we decompose the interaction into structural components, including the state distribution, observation process, policy, reward, and transition dynamics, together with the shifted-time boundary. The proposed taxonomy distinguishes internal, agent-driven, and external, environment-driven, distributional shifts. The shifted-time boundary perspective further characterizes explicit, implicit, and hybrid shifts. This formulation unifies ID/OOD generalization and non-stationarity as structured changes in the underlying process. We also introduce an evaluation framework for measuring shift impact and adaptation through performance degradation and recovery metrics. By grounding distributional shift in the causal-origin structure of RL, this work supports systematic analysis of robustness under distributional shift.

2606.17028 2026-06-16 cs.LG cs.AI cs.AR 交叉投稿

HAMON: Passive Optical Sequence Mixing for Long-Horizon Forecasting

HAMON: 用于长程预测的无源光学序列混合

Alper Yıldırım

AI总结 提出HAMON无源衍射光学预测核心,通过光学传播替代数字序列混合层,在多个基准上优于或接近最强数字基线,MSE最多降低14%。

详情
AI中文摘要

简单的线性模型和频域模型在长程时间序列预测中仍然出奇地具有竞争力,最近的机制证据表明,标准预测基准可能不需要使Transformer在其他领域强大的密集叠加表示。这引发了一个底层问题:如果核心预测算子通常是低复杂度的且近似线性,它是否需要被实现为学习到的数字时间混合?我们引入了HAMON,一种无源衍射光学预测核心,其中历史值被编码到光学孔径上,未来位置保持暗场,级联的可训练相位掩模与自由空间衍射直接在输出场中形成预测。在推理时,预测由单个无源光学传播过程完成,无需可训练的数字序列混合层。在标准基准上,HAMON在ETTm2的所有预测长度和ETTh2除最长预测长度外的所有长度上优于考虑的最强数字基线,MSE最多降低14%,并且在不同预测长度上一致地优于基线,而非孤立点。它在Weather上具有竞争力,在其余ETT设置以及高通道数的Traffic和Electricity数据集上略逊于最强基线。相位编码、强度兼容读出和相位扰乱消融实验,以及TorchOptics交叉模拟检查表明,预测来自承载数据的光场而非数字预测头。由于无源核心使用标准傅里叶光学,HAMON为光学硬件和无源物理序列混合定义了一个具体目标。

英文摘要

Simple linear and frequency-domain models remain surprisingly competitive in long-horizon time-series forecasting, and recent mechanistic evidence suggests that standard forecasting benchmarks may not require the dense superposed representations that make transformers powerful in other domains. This raises a substrate-level question: if the core forecasting operator is often low-complexity and approximately linear, does it need to be implemented as learned digital temporal mixing? We introduce HAMON, a passive diffractive optical forecasting core in which historical values are encoded onto an optical aperture, future positions are left dark, and cascaded trainable phase masks with free-space diffraction shape the forecast directly in the output field. At inference, prediction is performed by a single passive optical propagation pass with no trainable digital sequence-mixing layer. Across standard benchmarks, HAMON outperforms the strongest digital baselines considered on ETTm2 at all horizons and on ETTh2 at all but the longest horizon, improving MSE by up to 14\% and doing so consistently across horizons rather than at isolated points. It is competitive on Weather and trails the strongest baselines on the remaining ETT settings and on the high-channel-count Traffic and Electricity datasets. Phase encoding, intensity-compatible readout, and phase-scrambling ablations, together with a TorchOptics cross-simulator check, indicate that the forecasts arise from the data-bearing optical field rather than from a digital forecasting head. Because the passive core uses standard Fourier optics, HAMON defines a concrete target for optical hardware and for passive physical sequence mixing.

2606.17037 2026-06-16 cs.CV cs.AI cs.LG 交叉投稿

The Importance of Phase in Neural Representations: An Internal Oppenheim-Lim Test of Image Classifiers

相位在神经表示中的重要性:图像分类器的内部Oppenheim-Lim测试

Alper Yıldırım

AI总结 通过内部相位-幅度移植实验,发现图像分类器(如PRISM2D、GFNet、ViT-B/16)的预测主要依赖相位/符号信息,而图像特定幅度对读出贡献有限;ResNet-50在ReLU前存在潜在符号编码,揭示了CNN与注意力模型在纹理-形状差异上的机制。

详情
AI中文摘要

Oppenheim和Lim(1981)表明,自然图像仅从傅里叶相位重建时仍可识别,而幅度几乎不携带其身份信息。我们探究训练后的图像分类器是否在其隐藏层内再现这种不对称性,并进行因果测试:给定两幅图像,我们在选定层将一幅图像的相位移植到另一幅图像的幅度上,并记录预测跟随哪幅图像。在PRISM2D、GFNet和ViT-B/16中,预测跟随相位或符号捐赠者,删除所有图像特定幅度几乎不影响准确率,因此身份信息依赖于相位,而图像特定幅度对读出而言在很大程度上是可舍弃的。ResNet-50起初似乎打破了这一模式,因为在ReLU之后移植符号无效;在ReLU之前的公平干预揭示了后期块中存在强烈的潜在符号编码,而仅DC对照表明读出消耗了通道空间平均值。对照排除了幅度简单地不依赖于图像的平凡情况。因此,这些架构共享一个相位/符号身份编码,但以不同基(由整流和读出几何决定)暴露出来,这为CNN与注意力模型之间的纹理-形状差异提供了机制性解释。

英文摘要

Oppenheim and Lim (1981) showed that natural images stay recognizable when reconstructed from their Fourier phase alone, while the magnitude carries little of their identity. We ask whether trained image classifiers reproduce this asymmetry inside their hidden layers, and we test it causally: given two images, we transplant the phase of one onto the magnitude of the other at a chosen layer and record which image the prediction follows. In PRISM2D, GFNet, and ViT-B/16 the prediction follows the phase or sign donor, and deleting all image-specific magnitude barely moves accuracy, so identity rides on phase while image-specific magnitude is largely dispensable to the readout. ResNet-50 at first seems to break the pattern, because transplanting sign after its ReLUs does nothing; a fair intervention before the ReLU reveals a strong latent sign code in the late blocks, and a DC-only control shows the readout consumes a channel-wise spatial average. Controls rule out the trivial case in which magnitude simply stops depending on the image. The architectures therefore share a phase/sign identity code but expose it in different bases, set by rectification and readout geometry, which gives a mechanistic account of the texture--shape gap between CNNs and attention models.

2602.10385 2026-06-16 cs.LG cs.AI 版本更新

Capture Timing-Attention of Events in Clinical Time Series

捕捉临床时间序列中的事件时序注意力

Jia Li, Yu Hou, Rui Zhang

发表机构 * Department of Surgery(外科系;计算机科学系,明尼苏达大学明尼阿波利斯分校,MN USA) Department of Computer Science, U of M Minneapolis MN USA

AI总结 提出LITT架构,通过虚拟相对时间轴对齐事件序列,实现事件时序注意力机制,用于个性化临床轨迹分析,在乳腺癌患者心脏毒性预测中优于现有方法。

Comments 8 pages of body text

详情
AI中文摘要

从纵向EHR数据中自动发现个性化轨迹(即顺序事件模式)对于临床研究中的精准医学至关重要,但即使对于当代AI模型来说,这仍然是一个艰巨的挑战。例如,虽然Transformer的注意力机制可以捕捉丰富的关联,但它基本上不关心事件的时间和顺序,从而绕过了潜在的因果推理。直观上,我们需要一种能够评估患者特定轨迹之间“对齐程度”并识别其共享模式(即一致序列中的显著事件)的方法。这需要将时间视为一个真正的**可计算**维度,允许模型为候选事件分配超出其观测物理时间的“相对时间戳”。在这项工作中,我们引入了LITT(个体级时间变换),一种新颖的架构,能够在虚拟的“相对时间线”上临时对齐序列事件,从而实现**事件时序聚焦的注意力**和临床轨迹的个性化解释。其可解释性和有效性在来自3,276名乳腺癌患者的真实纵向EHR数据上得到验证,用于预测心脏毒性诱发心脏病的发病时间。此外,LITT在公共数据集上优于基准和最先进的生存分析方法,使其成为临床AI精准医学的重要一步。

英文摘要

The contemporary paradigm of trajectory learning operates fundamentally at the level of group dynamics, systematically reducing individual-level complexity to fit group-level models, thus rendering effective patient subtyping difficult and individual-level modeling largely out of reach. We propose a data-driven paradigm that introduces a dedicated individual-level temporal variable to capture \emph{Timing Attention} (i.e., the degree of concentration of an event's timing distribution across the patient cohort), thereby rendering timing a \emph{computable dimension} that enables individualized temporal features in trajectory learning. Instantiated as the Level-of-Individual Time Transformation (LITT) and applied to longitudinal EHR data from 3,276 breast cancer patients, the proposed paradigm demonstrates, for the first time to our knowledge: (1) automatic discovery of clinically significant patient trajectories, and (2) counterfactual timing deduction, that is, a \emph{What-If Machine}. Both results are purely data-driven, requiring no prior domain knowledge. LITT further achieves strong performance on timing prediction and survival analysis tasks.

2512.10903 2026-06-16 cs.AI 版本更新

Multi-Granular Node Pruning for Causal Circuit Discovery

多粒度节点剪枝用于因果电路发现

Muhammad Umair Haider, Hammad Rizwan, Hassan Sajjad, A. B. Siddique

发表机构 * Department of Computer Science, University of Kentucky, USA(美国肯塔基大学计算机科学系) Department of Computer Science, Dalhousie University, Canada(加拿大达尔豪斯大学计算机科学系)

AI总结 提出一种节点级剪枝框架,通过可学习掩码和多粒度稀疏惩罚,在单次微调中从大语言模型中高效发现最小因果电路,节点更少且性能相当,内存占用降低5-10倍。

详情
AI中文摘要

电路发现旨在识别大语言模型(LLMs)中负责特定行为的最小化子网络。现有方法主要依赖迭代边剪枝,计算成本高且局限于粗粒度单元(如注意力头或MLP块),忽略了单个神经元等更细粒度的结构。我们提出了一种用于电路发现的节点级剪枝框架,解决了可扩展性和粒度限制。我们的方法在统一的优化目标中引入了跨多个粒度级别(从整个块到单个神经元)的可学习掩码。粒度特定的稀疏惩罚指导剪枝过程,使得在单次微调运行中实现全面压缩。实验上,我们的方法识别的电路在节点数量上小于先前方法发现的电路;此外,我们证明了许多被粗粒度方法认为重要的神经元实际上是无关的,同时仍能保持任务性能。此外,我们的方法具有显著更低的内存占用(5-10倍),因为它不需要在内存中保留中间激活来工作。

英文摘要

Circuit discovery aims to identify minimal subnetworks that are responsible for specific behaviors in large language models (LLMs). Existing approaches primarily rely on iterative edge pruning, which is computationally expensive and limited to coarse-grained units such as attention heads or MLP blocks, overlooking finer structures like individual neurons. We propose a node-level pruning framework for circuit discovery that addresses both scalability and granularity limitations. Our method introduces learnable masks across multiple levels of granularity, from entire blocks to individual neurons, within a unified optimization objective. Granularity-specific sparsity penalties guide the pruning process, allowing a comprehensive compression in a single fine-tuning run. Empirically, our approach identifies circuits that are smaller in nodes than those discovered by prior methods; moreover, we demonstrate that many neurons deemed important by coarse methods are actually irrelevant, while still maintaining task performance. Furthermore, our method has a significantly lower memory footprint, 5-10x, as it does not require keeping intermediate activations in the memory to work.

2512.20043 2026-06-16 cs.AI 版本更新

Discovering Symmetry Groups with Flow Matching

通过流匹配发现对称群

Yuxuan Chen, Jung Yeon Park, Floor Eijkelboom, Jianke Yang, Jan-Willem van de Meent, Lawson L. S. Wong, Robin Walters

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出LieFlow框架,将对称发现转化为李群上的分布学习问题,无需固定基或分布假设,能统一发现连续和离散对称,实验优于LieGAN。

详情
AI中文摘要

对称性是理解物理系统的基础,可以提高机器学习中的性能和样本效率。这两项工作都需要了解数据中的潜在对称性,但自动发现这些对称性具有挑战性。我们提出LieFlow,一种新颖的框架,将对称发现重新定义为李群上的分布学习问题。我们的方法不搜索对称生成器,而是直接在群空间中操作,在大假设群$G$上建模对称分布。学习到的分布的支持揭示了潜在的对称群$H \subseteq G$。与先前的工作不同,LieFlow可以在统一框架中发现连续和离散对称,而不假设固定的李代数基或群元素上的特定分布。在合成2D和3D点云、ModelNet10和真实世界MI-Motion数据集上的实验表明,LieFlow准确发现了连续和离散子群,在识别离散对称方面显著优于最先进的基线LieGAN。

英文摘要

Symmetry is fundamental to understanding physical systems and can improve performance and sample efficiency in machine learning. Both pursuits require knowledge of the underlying symmetries in data, yet discovering these symmetries automatically is challenging. We propose LieFlow, a novel framework that reframes symmetry discovery as a distribution learning problem on Lie groups. Instead of searching for the symmetry generators, our approach operates directly in group space, modeling a symmetry distribution over a large hypothesis group $G$. The support of the learned distribution reveals the underlying symmetry group $H \subseteq G$. Unlike previous works, LieFlow can discover both continuous and discrete symmetries within a unified framework, without assuming a fixed Lie algebra basis or a specific distribution over the group elements. Experiments on synthetic 2D and 3D point clouds, ModelNet10 and a real-world MI-Motion dataset show that LieFlow accurately discovers continuous and discrete subgroups, significantly outperforming a state-of-the-art baseline, LieGAN, in identifying discrete symmetries.

2602.05367 2026-06-16 cs.AI 版本更新

RaBiT: Residual-Aware Binarization Training for Accurate and Efficient LLMs

RaBiT:基于残差的二值化训练用于准确且高效的LLM

Youngcheon You, Banseok Lee, Minseop Choi, Seonyoung Kim, Hyochan Chong, Changdong Kim, Youngmin Kim, Dongkyu Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 RaBiT通过算法强制残差层级解决二值化中的特征共适应问题,提升2位精度-效率边界,实现超越VQ的性能和4.49倍的推理加速。

Comments Accepted to ICML 2026

详情
AI中文摘要

高效部署大型语言模型(LLMs)需要极端量化,迫使在低比特效率与性能之间做出关键权衡。残差二值化通过堆叠二进制(±1)层实现硬件友好的乘法自由推理,但受到病理特征共适应的困扰。我们识别出一种关键失败模式,称为路径适应:在量化感知训练(QAT)中,并行残差二值路径学习冗余特征,降低误差补偿结构并限制模型的表达能力。尽管先前工作依赖启发式修补(例如路径冻结)来限制解空间,我们提出了RaBiT,一种新的量化框架,通过算法强制残差层级解决共适应问题。其核心机制依次从单个共享的全精度权重推导每个二值路径,确保每个路径纠正前一个的误差。这一过程通过稳健的初始化稳定,优先考虑功能保持而非单纯权重近似。RaBiT重新定义了2位精度-效率边界:它实现了最先进的性能,甚至超越硬件密集型向量量化(VQ)方法,并在RTX 4090上实现了比全精度模型快4.49倍的推理加速。

英文摘要

Efficient deployment of large language models (LLMs) requires extreme quantization, forcing a critical trade-off between low-bit efficiency and performance. Residual binarization enables hardware-friendly, matmul-free inference by stacking binary ($\pm$1) layers, but is plagued by pathological feature co-adaptation. We identify a key failure mode, which we term inter-path adaptation: during quantization-aware training (QAT), parallel residual binary paths learn redundant features, degrading the error-compensation structure and limiting the expressive capacity of the model. While prior work relies on heuristic workarounds (e.g., path freezing) that constrain the solution space, we propose RaBiT, a novel quantization framework that resolves co-adaptation by algorithmically enforcing a residual hierarchy. Its core mechanism sequentially derives each binary path from a single shared full-precision weight, which ensures that every path corrects the error of the preceding one. This process is stabilized by a robust initialization that prioritizes functional preservation over mere weight approximation. RaBiT redefines the 2-bit accuracy-efficiency frontier: it achieves state-of-the-art performance, rivals even hardware-intensive Vector Quantization (VQ) methods, and delivers a $4.49\times$ inference speed-up over full-precision models on an RTX 4090. Code is available at https://github.com/SamsungLabs/RaBiT.

2602.23242 2026-06-16 cs.AI 版本更新

A Model-Free Universal AI

无模型通用人工智能

Yegon Kim, Juho Lee

发表机构 * Graduate School of AI, KAIST(韩国科学技术院人工智能研究生院)

AI总结 提出首个在通用强化学习中证明渐近ε最优的无模型智能体AIQI,通过分布动作值函数的通用归纳实现,并扩展了Self-AIXI的渐近最优性证明。

详情
AI中文摘要

在通用强化学习中,所有已建立的最优智能体,包括AIXI,都是基于模型的,显式维护和使用环境模型。本文介绍了具有Q归纳的通用人工智能(AIQI),这是首个被证明在通用RL中渐近ε最优的无模型智能体。AIQI对分布动作值函数进行通用归纳,而不是像先前工作那样对策略或环境进行归纳。在“真理颗粒”条件下,我们证明了AIQI是强渐近ε最优和渐近ε贝叶斯最优的。我们还应用我们的新颖证明技术,在没有特别假设的情况下证明了Self-AIXI的渐近ε最优性。我们的结果显著扩展了已知通用智能体的多样性。

英文摘要

In general reinforcement learning, all established optimal agents, including AIXI, are model-based, explicitly maintaining and using environment models. This paper introduces Universal AI with Q-Induction (AIQI), the first model-free agent proven to be asymptotically $\varepsilon$-optimal in general RL. AIQI performs universal induction over distributional action-value functions, instead of policies or environments like previous works. Under a grain of truth condition, we prove that AIQI is strong asymptotically $\varepsilon$-optimal and asymptotically $\varepsilon$-Bayes-optimal. We also apply our novel proof techniques to show asymptotic $\varepsilon$-optimality of Self-AIXI without any ad-hoc assumptions. Our results significantly expand the diversity of known universal agents.

2604.05859 2026-06-16 cs.AI 版本更新

When Do We Need LLMs? A Diagnostic for Language-Driven Bandits

何时需要大语言模型?语言驱动型老虎机的诊断方法

Uljad Berdica, Fernando Acero, Anton Ipsen, Parisa Zehtabi, Michael Cashmore, Manuela Veloso

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出LLMP-UCB算法从LLM中获取不确定性估计,实验表明轻量数值老虎机在文本嵌入上匹配或超越LLM方案且成本更低,并给出基于臂嵌入的几何诊断指导何时使用LLM。

Comments The Reinforcement Learning Conference, 2026

详情
AI中文摘要

我们研究非情节性决策问题中的上下文多臂老虎机(CMABs),其中上下文包含文本和数值信息(例如推荐系统、动态投资组合调整、报价选择;这些都是金融中常见的问题)。虽然大语言模型(LLMs)越来越多地应用于这些场景,但在每个决策步骤使用LLM进行推理计算成本高昂,且难以获得不确定性估计。为解决这一问题,我们引入LLMP-UCB,一种通过重复推理从LLM中导出不确定性估计的老虎机算法。然而,我们的实验表明,在文本嵌入(稠密或Matryoshka)上运行的轻量数值老虎机以极低的成本匹配或超越了基于LLM的解决方案的准确性。我们进一步证明,嵌入维度是探索-利用平衡的一个实用杠杆,能够在无需提示复杂性的情况下实现成本-性能权衡。最后,为指导实践者,我们提出一种基于臂嵌入的几何诊断方法,以决定何时使用LLM驱动的推理与轻量数值老虎机。我们的结果为跨AI用例广泛适用的成本效益高、不确定性感知的决策系统提供了原则性部署框架。

英文摘要

We study Contextual Multi-Armed Bandits (CMABs) for non-episodic decision-making problems where the context includes both textual and numerical information (e.g., recommendation systems, dynamic portfolio adjustments, offer selection; all frequent problems in finance). While Large Language Models (LLMs) are increasingly applied to these settings, utilizing LLMs for reasoning at every decision step is computationally expensive, and uncertainty estimates are difficult to obtain. To address this, we introduce LLMP-UCB, a bandit algorithm that derives uncertainty estimates from LLMs via repeated inference. However, our experiments demonstrate that lightweight numerical bandits operating on text embeddings (dense or Matryoshka) match or exceed the accuracy of LLM-based solutions at a fraction of their cost. We further show that embedding dimensionality is a practical lever on the exploration-exploitation balance, enabling cost-performance tradeoffs without prompt complexity. Finally, to guide practitioners, we propose a geometric diagnostic based on the arms' embeddings to decide when to use LLM-driven reasoning versus a lightweight numerical bandit. Our results provide a principled deployment framework for cost-effective, uncertainty-aware decision systems with broad applicability across AI use cases.

2605.02427 2026-06-16 cs.AI cs.LG 版本更新

The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling

模型知晓,解码器发现:未来价值引导的粒子力量采样

Tu Nguyen, Matthieu Zimmer, Rasul Tutunov, Xiaotong Ji, Haitham Bou Ammar

发表机构 * Huawei Heisenberg Research Center(华为海森堡研究中心) Huawei Noah’s Ark Lab(华为诺亚实验室) UCL Centre for Artificial Intelligence(伦敦大学学院人工智能中心)

AI总结 本文提出APPS算法,通过块状粒子方法高效定位LLM的多步解,提升推理准确率与运行效率,减少对训练数据的依赖。

详情
AI中文摘要

英文摘要

A recurring pattern in "reasoning without training" is that base LLMs already assign non-trivial probability mass to correct multi-step solutions; the bottleneck is locating these modes efficiently at inference time. Power sampling provides a principled way to bias decoding toward such modes by targeting p_theta(x)^alpha with alpha > 1, but practical approximations must account for future-dependent correction factors that determine which prefixes remain promising. We introduce Auxiliary Particle Power Sampling (APPS), a blockwise particle algorithm for approximating the sequence-level power target with a bounded population of partial solutions. APPS propagates hypotheses in parallel using proposal-corrected power reweighting and refines their survival through future-value-guided selection at resampling boundaries. This redistributes finite compute across competing prefixes rather than committing to a single unfolding path, while providing a direct scaling knob in the particle count and predictable peak memory. We instantiate the future-value signal with short-horizon rollouts and also study an amortized variant that replaces rollouts with a lightweight learned selection head. AMore broadly, APPS improves the accuracy--runtime trade-off of training-free decoding, further supporting the view that inference-time power approximation can recover gains often attributed to post-training.

2606.01561 2026-06-16 cs.AI cs.LG 版本更新

S-SPPO: Semantic-Calibrated Self-Play Preference Optimization

S-SPPO:语义校准的自对弈偏好优化

Xiwen Chen, Wenhui Zhu, Jingjing Wang, Peijie Qiu, Zhipeng Wang, Huayu Li, ZhengXiao He, Xuanzhao Dong, Prayag Tiwari, Mingkun Xu, Yujian Xiong, Feng Luo, Abolfazl Razi, Brendan Hogan Rappazzo, Anderson Schneider, Yuriy Nevmyvaka

发表机构 * University of Arizona, USA(亚利桑那大学) Arizona State University, USA(亚利桑那州立大学) Now at Google LLC, work done at Rice University(现就职于谷歌公司,曾就职于里士大学) Clemson University, USA(克莱姆森大学) Washington University in St. Louis, USA(圣路易斯华盛顿大学) Halmstad University, Sweden(哈姆斯塔德大学) Guangdong Institute of Intelligence Science and Technology, China(广东智能科学与技术研究院)

AI总结 针对自对弈偏好优化(SPPO)中因偏好预测过度自信导致策略退化的问题,提出双空间语义校准框架S-SPPO,通过语义门控监督校准和潜在排斥表示校准,在保持博弈结构的同时提升对齐性能。

Comments Accepted by ICML2026

详情
AI中文摘要

将大型语言模型(LLM)与人类偏好对齐通常通过直接偏好优化(DPO)来实现。然而,DPO的标准Bradley-Terry实现在建模人类偏好中常见的传递性偏离方面存在局限。为解决此问题,近期工作引入了自对弈偏好优化(SPPO),通过训练自生成的胜负对来迭代优化策略。然而,我们的研究发现SPPO存在一个关键的不稳定性:当偏好预测器对语义上无法区分的响应赋予过度自信的胜利时,优化容易导致策略退化。为缓解这一问题,我们提出S-SPPO,一个双空间语义校准框架,包括:i)通过语义门控进行监督校准,随着语义重叠增加将胜率目标退火至最大熵基线;ii)通过潜在排斥进行表示校准,以强制几何多样性,防止流形坍塌并保持所选样本与拒绝样本之间的潜在多样性。理论上,我们证明该校准保持了常和博弈结构,促进收敛至纳什均衡。实验上,S-SPPO避免了先前方法中的性能退化,在AlpacaEval 2.0上使用Llama-3-8B实现了52.19%的胜率和47.46%的长度控制胜率,且在训练过程中未使用额外的人工标注偏好。代码将在https://github.com/xiwenc1/s-sppo提供。

英文摘要

Aligning Large Language Models (LLMs) with human preferences is often formulated via Direct Preference Optimization (DPO). However, the standard Bradley-Terry instantiation of DPO is limited in modeling common departures from transitivity in human preferences. To address this, recent work has introduced Self-Play Preference Optimization (SPPO), which iteratively refines the policy by training on self-generated win-lose pairs. Our investigation, however, reveals a critical instability in SPPO: the optimization is prone to policy degeneration when the preference oracle assigns overly confident wins to semantically indistinguishable responses. To mitigate this, we propose S-SPPO, a dual-space semantic calibration framework comprising: i) Supervision Calibration via semantic gating, which anneals win rate targets toward the maximum-entropy baseline as semantic overlap increases; and ii) Representation Calibration via latent repulsion to enforce geometric diversity to prevent manifold collapse and maintain latent diversity between chosen and rejected samples. Theoretically, we show that the calibration preserves the constant-sum game structure, facilitating convergence to a Nash Equilibrium. Empirically, S-SPPO avoids the performance degradation seen in prior methods, achieving 52.19% win rate and 47.46% length-controlled win rate on AlpacaEval 2.0 with Llama-3-8B, without using additional human-annotated preferences during training. The code will be available at https://github.com/xiwenc1/s-sppo.

2606.10237 2026-06-16 cs.AI cs.LG 版本更新

Minimalist Genetic Programming

极简遗传编程

Leonardo Trujillo

发表机构 * Tecnológico Nacional de México/IT de Tijuana(墨西哥国家理工学院/蒂胡ana信息技术学院) LASIGE, Department of Informatics, Faculty of Sciences, University of Lisbon(里斯本大学科学学院信息系LASIGE)

AI总结 提出极简遗传编程(MGP),借鉴语言学中的极简主义程序,用MERGE操作替代进化搜索,在符号回归任务中有效避免膨胀,稳定找到精确解。

详情
AI中文摘要

遗传编程(GP)基于两个重要见解。首先,任何学习任务从根本上都可以视为程序归纳问题,目标是构建表示为语法树的符号层次模型。其次,将此任务视为搜索问题,并使用进化来定位所需模型。自提出以来,GP在广泛的任务和问题领域中取得了显著成果。本文通过修改GP的第二个核心见解,将问题视为句法推导任务,提出了一种替代观点。具体来说,本文提出了极简遗传编程(MGP),该算法与GP一样受生物启发,但并非源自进化,而是从人类语言的极简主义程序中汲取灵感,其中句法被理解为连接其他两个心智系统的最优解决方案。在极简主义中,核心计算过程是一个称为MERGE的二元集合形成算子,它可以通过简单的马尔可夫过程逐步构建复杂的句法结构。MGP能够发现符号表达式的核心构建块,并使用MERGE逐步组合它们。所提出的系统在已知因膨胀倾向而难以用标准GP系统解决的符号回归任务上进行了基准测试。结果表明,当选择适当的原子句法对象词典时,MGP能够在一组标准GP难以做到同样任务的符号回归中一致地产生精确的真实模型。极简主义提供的见解被证明与程序归纳问题相关,并且基于MGP在这项工作中展示的潜力,应进一步探索。

英文摘要

Genetic programming (GP) is based on two important insights. First, that any learning task can fundamentally be posed as a program induction problem, where the goal is to construct a symbolic hierarchical model that is expressed as a syntax tree. Second, to pose this task as a search problem, and use evolution to locate the desired model. Since it was proposed, GP has produced notable results in a wide range of tasks and problem domains. This work presents an alternative view by modifying the second core insight of GP, posing the problem as a syntactic derivation task instead. In particular, this paper presents Minimalist Genetic Programming (MGP), an algorithm that like GP is biologically inspired, but instead of evolution it takes inspiration from the Minimalist Program to human language, in which syntax is understood as an optimal solution to the problem of linking two other mental systems. In minimalism, the core computational process is a binary set formation operator called $MERGE$, than can be used to incrementally construct complex syntactic structures using a simple Markovian process. MGP is able to discover the core building blocks of the symbolic expressions, and to incrementally combined them using $MERGE$. The proposed system is benchmarked on symbolic regression tasks that are known to be difficult to solve with standard GP systems because of the propensity for bloat. Results show that when a proper lexicon of atomic syntactic objects are chosen, MGP is able to consistently produce the exact ground truth model on a set of symbolic regression tasks where standard GP struggles to do the same. The insights provided by minimalism are shown to be relevant to the problem of program induction, and should be explored further based on the potential exhibited by MGP in this work.

2402.00094 2026-06-16 cs.NE cs.AI cs.LG 版本更新

Deep Neural Networks: A Formulation Via Non-Archimedean Analysis

深度神经网络:基于非阿基米德分析的公式化

W. A. Zúñiga-Galindo

发表机构 * University of Texas Rio Grande Valley School of Mathematical & Statistical Sciences(德克萨斯大学里奥格兰德谷大学数学与统计科学学院)

AI总结 提出一种基于非阿基米德局部域整数环的多层树状架构深度神经网络,该网络是定义在环上实值函数的鲁棒通用逼近器,并证明其对单位区间上平方可积函数的通用逼近性。

Comments Final version accepted in the Journal of Fourier Analysis and Applications

详情
AI中文摘要

我们引入了一类具有多层树状架构的新型深度神经网络(DNN)。这些架构使用非阿基米德局部域的整数环中的数字进行编码。这些环具有作为无限有根树的自然层次结构。这些环上的自然态射允许我们构建有限的多层架构。新的DNN是定义在所述环上的实值函数的鲁棒通用逼近器。我们还证明了这些DNN是单位区间上定义的实值平方可积函数的鲁棒通用逼近器。

英文摘要

We introduce a new class of deep neural networks (DNNs) with multilayered tree-like architectures. The architectures are codified using numbers from the ring of integers of non-Archimdean local fields. These rings have a natural hierarchical organization as infinite rooted trees. Natural morphisms on these rings allow us to construct finite multilayered architectures. The new DNNs are robust universal approximators of real-valued functions defined on the mentioned rings. We also show that the DNNs are robust universal approximators of real-valued square-integrable functions defined in the unit interval.

2405.02369 2026-06-16 cs.NE cs.AI cs.LG 版本更新

No One-Size-Fits-All Neurons: Task-based Neurons for Artificial Neural Networks

没有万能神经元:面向任务的人工神经网络神经元

Feng-Lei Fan, Meng Wang, Hang-Cheng Dong, Jianwei Ma, Tieyong Zeng

发表机构 * Department of Data Science, City University of Hong Kong(城市大学数据科学系) School of Mathematics, Harbin Institute of Technology(哈尔滨工业大学数学系) School of Instrumentation, Harbin Institute of Technology(哈尔滨工业大学仪器系) School of Earth and Space Sciences, Peking University(北京大学地球与空间科学学院) Institute for Advanced Study, Beijing Normal-Hong Kong Baptist University(北京师范大学-香港 Baptist大学高级研究院)

AI总结 受大脑神经元任务特异性的启发,提出一种两阶段框架设计任务导向神经元,通过多项式基函数引入归纳偏置,在合成数据、经典基准和实际应用中性能优于现有模型。

Comments 8 pages, 4 figures

详情
AI中文摘要

在过去十年中,许多成功的网络都采用了新颖的架构,这些架构几乎无一例外地使用相同类型的神经元。最近,越来越多的深度学习研究受到NeuroAI理念和人类大脑中观察到的神经元多样性的启发,从而提出了新颖的人工神经元设计。设计性能良好的神经元代表了相对于设计性能良好的神经架构的一个新维度。从生物学角度看,大脑并不依赖一种在所有方面都普遍适用的单一类型神经元。相反,在我们的大脑中,神经元通常是基于任务的。在本研究中,我们探讨以下问题:既然人脑是一个基于任务的神经元使用者,那么人工网络设计能否从基于任务的架构设计转向基于任务的神经元设计?由于方法论上不存在万能神经元,在相同结构下,基于任务的神经元由于对任务具有内在的归纳偏置,相比现有的通用神经元可以增强特征表示能力。具体来说,我们提出了一个用于原型化基于任务神经元的两阶段框架。作为初始步骤,我们使用多项式作为基函数来评估所提出的框架。实验上,在合成数据、经典基准和实际应用上的系统实验结果表明,所提出的基于任务的神经元设计不仅可行,而且相比其他最先进模型具有竞争力的性能。

英文摘要

In the past decade, many successful networks are on novel architectures, which almost exclusively use the same type of neurons. Recently, more and more deep learning studies have been inspired by the idea of NeuroAI and the neuronal diversity observed in human brains, leading to the proposal of novel artificial neuron designs. Designing well-performing neurons represents a new dimension relative to designing well-performing neural architectures. Biologically, the brain does not rely on a single type of neuron that universally functions in all aspects. Instead, in our brain, neurons are often task-based. In this study, we address the following question: since the human brain is a task-based neuron user, can the artificial network design go from the task-based architecture design to the task-based neuron design? Since methodologically there are no one-size-fits-all neurons, given the same structure, task-based neurons can enhance the feature representation ability relative to the existing universal neurons due to the intrinsic inductive bias for the task. Specifically, we propose a two-step framework for prototyping task-based neurons. As the initial step, we evaluate the proposed framework using polynomials as base functions. Empirically, systematic experimental results on synthetic data, classic benchmarks, and real-world applications show that the proposed task-based neuron design is not only feasible but also delivers competitive performance over other state-of-the-art models.

2405.15768 2026-06-16 stat.ML cs.AI cs.LG 版本更新

Canonical Variates in Wasserstein Metric Space

Wasserstein度量空间中的典型变量

Jia Li, Lin Lin

发表机构 * Department of Statistics, The Pennsylvania State University(宾夕法尼亚州立大学统计学系) Department of Biostatistics and Bioinformatics, Duke University(杜克大学生物统计学与生物信息学系)

AI总结 针对分布数据分类问题,提出基于Wasserstein距离的Fisher比最大化降维方法,通过迭代优化算法实现,实验证明能显著提升分类性能。

Comments single space 39 pages, 10 figures

详情
AI中文摘要

在本文中,我们处理由向量空间上的分布(而非单个点)表示的实例的分类问题。我们考虑基于成对距离的分类算法,特别是分布之间的Wasserstein度量。我们研究的核心是在Wasserstein度量空间中进行降维以提高分类准确性。我们引入了一种基于最大化Fisher比(定义为类间变异与类内变异之比)原理的新方法。该比值最大化的方向被称为判别坐标或典型变量轴。在实践中,类间变异和类内变异被定义为分布对之间的平均平方Wasserstein距离,这些分布对要么属于同一类,要么属于不同类。该比值优化通过一种迭代算法实现,该算法在向量空间中的最优传输和最大化步骤之间交替进行。进行了实证研究以评估算法的收敛性;实验结果表明,降维技术显著提高了分类性能。此外,新方法优于基于从分布数据派生的向量表示运行的成熟算法。它对实例如何由分布总结的变化(例如高斯混合模型表示中的分量数量)也表现出鲁棒性。

英文摘要

In this paper, we address the classification of instances represented by distributions on a vector space rather than single points. We consider classification algorithms based on pairwise distances, specifically, the Wasserstein metric between distributions. Central to our investigation is dimension reduction within the Wasserstein metric space to enhance classification accuracy. We introduce a novel approach grounded in the principle of maximizing Fisher's ratio, defined as the quotient of between-class variation to within-class variation. The directions in which this ratio is maximized are termed discriminant coordinates or canonical variates axes. In practice, both between-class and within-class variations are defined as the average squared Wasserstein distances between pairs of distributions, with the pairs either belonging to the same class or to different classes. This ratio optimization is achieved through an iterative algorithm, which alternates between optimal transport and maximization steps within the vector space. Empirical studies are conducted to assess the algorithm's convergence; and experimental results demonstrate that the dimension reduction technique substantially enhances classification performance. Moreover, the new method outperforms well-established algorithms that operate on vector representations derived from distributional data. It also exhibits robustness to variations in how instances are summarized by distributions, such as the number of components in a Gaussian mixture model (GMM) representation.

2410.11687 2026-06-16 cs.LG cs.AI cs.NE 版本更新

Learning in the Recurrent State: Gradient Descent with Linear Recurrent Networks

循环状态中的学习:线性循环网络的梯度下降

Yudou Tian, Neeraj Mohan Sushma, Harshvardhan Mestha, Nicolo Colombo, David Kappel, Anand Subramoney

发表机构 * Center for Cognitive Interaction Technology CITEC, Universität Bielefeld, Germany(认知交互技术中心CITEC,比勒菲尔德大学,德国) Department of Electrical and Electronics Engineering, Birla Institute of Technology and Science Pilani, India(电子与电子工程系,比拉理工科学与技术学院比兰,印度) Department of Computer Science, Royal Holloway, University of London, United Kingdom(计算机科学系,伦敦皇家霍洛威大学,英国)

AI总结 提出一种线性循环网络架构GRIL,通过乘法读出和滑动窗口交叉积自注意力更新,使其能在单次前向传播中实现任务特定线性预测器的小批量梯度下降,并在长程竞技场和语言建模中取得有效性能。

Comments 28 pages, 11 figures

详情
AI中文摘要

线性循环网络(LRNNs)提供线性时间序列建模,但标准循环更新不直接暴露上下文梯度下降所需的监督乘积。我们为LRNNs提出一种充分的构造性归纳偏置:配备乘法读出的对角循环状态和短滑动窗口交叉积自注意力更新。由此产生的架构,基于梯度的循环上下文学习器(GRIL),可以在单次前向传播中实现任务特定线性预测器的小批量梯度下降。同一设计扩展到多步更新和交叉熵分类,并有限地基于MLP扩展到非线性回归。实验上,训练好的GRIL在合成ICL任务上恢复了构造所预测的行为和参数,并且相同的架构偏置在长程竞技场和语言建模中产生了有用的性能。这些结果表明,窗口化交叉积自注意力是一种实用的、可测试的归纳偏置,使LRNNs通过类似梯度下降的更新在上下文中学习。

英文摘要

Linear recurrent networks (LRNNs) offer linear-time sequence modeling, but standard recurrent updates do not directly expose the supervised products needed for in-context gradient descent. We propose a sufficient constructive inductive bias for LRNNs: equip a diagonal recurrent state with multiplicative readout and a short sliding-window cross-product self-attention update. The resulting architecture, Gradient-based Recurrent In-context Learner (GRIL), can implement minibatch gradient descent on a task-specific linear predictor during a single forward pass. The same design extends to multi-step updates and cross-entropy classification, with a limited MLP-based extension to non-linear regression. Empirically, trained GRILs recover the behavior and parameters predicted by the construction on synthetic ICL tasks, and the same architectural bias yields useful performance on Long Range Arena and language modelling. These results present windowed cross-product self-attention as a practical, testable inductive bias for LRNNs that learn in context through gradient-descent-like updates.

2502.10389 2026-06-16 cs.CV cs.AI 版本更新

Region-Adaptive Sampling for Diffusion Transformers

扩散变压器的区域自适应采样

Ziming Liu, Yifan Yang, Chengruidong Zhang, Yiqi Zhang, Lili Qiu, Yang You, Yuqing Yang

发表机构 * National University of Singapore(新加坡国立大学) Microsoft Research(微软研究院)

AI总结 提出RAS,一种无需训练的自适应采样策略,通过动态分配不同采样比例到图像区域,实现扩散变压器2.36-2.51倍加速且质量损失极小。

Comments CVPR'26 Poster

详情
AI中文摘要

扩散模型(DMs)已成为跨不同领域生成任务的主要选择。然而,它们依赖多次顺序前向传递,严重限制了实时性能。先前的加速方法主要集中于减少采样步骤数或重用中间结果,由于卷积U-Net结构的限制,未能利用图像中空间区域的变化。通过利用扩散变压器(DiTs)在处理可变数量令牌方面的灵活性,我们引入了RAS,一种新颖的、无需训练的采样策略,该策略根据DiT模型的关注点动态地为图像中的区域分配不同的采样比例。我们的关键观察是,在每个采样步骤中,模型集中在语义上有意义的区域,并且这些关注区域在连续步骤中表现出强烈的连续性。利用这一见解,RAS仅更新当前关注的区域,而其他区域则使用来自前一步的缓存噪声进行更新。模型的关注点基于前一步的输出确定,利用了我们观察到的时间一致性。我们在Stable Diffusion 3和Lumina-Next-T2I上评估了RAS,分别实现了高达2.36倍和2.51倍的加速,且生成质量下降最小。此外,一项用户研究表明,RAS在人类评估下提供相当的质量,同时实现1.6倍加速。我们的方法朝着更高效的扩散变压器迈出了重要一步,增强了它们在实时应用中的潜力。

英文摘要

Diffusion models (DMs) have become the leading choice for generative tasks across diverse domains. However, their reliance on multiple sequential forward passes significantly limits real-time performance. Previous acceleration methods have primarily focused on reducing the number of sampling steps or reusing intermediate results, failing to leverage variations across spatial regions within the image due to the constraints of convolutional U-Net structures. By harnessing the flexibility of Diffusion Transformers (DiTs) in handling variable number of tokens, we introduce RAS, a novel, training-free sampling strategy that dynamically assigns different sampling ratios to regions within an image based on the focus of the DiT model. Our key observation is that during each sampling step, the model concentrates on semantically meaningful regions, and these areas of focus exhibit strong continuity across consecutive steps. Leveraging this insight, RAS updates only the regions currently in focus, while other regions are updated using cached noise from the previous step. The model's focus is determined based on the output from the preceding step, capitalizing on the temporal consistency we observed. We evaluate RAS on Stable Diffusion 3 and Lumina-Next-T2I, achieving speedups up to 2.36x and 2.51x, respectively, with minimal degradation in generation quality. Additionally, a user study reveals that RAS delivers comparable qualities under human evaluation while achieving a 1.6x speedup. Our approach makes a significant step towards more efficient diffusion transformers, enhancing their potential for real-time applications.

2505.04397 2026-06-16 cs.CV cs.AI cs.LG eess.IV 版本更新

PURe: A Plug-and-Play Product-Unit Residual Module for Vision Networks

PURe: 一种用于视觉网络的即插即用乘积单元残差模块

Ziyuan Li, Uwe Jaekel, Babette Dellen

发表机构 * Department of Mathematics, Informatics and Technology, University of Applied Sciences Koblenz(科隆应用科学大学数学、信息学与技术系) Technical University of Munich(慕尼黑技术大学)

AI总结 提出PURe模块,通过二维乘积单元的对数域公式实现稳定的局部乘法交互,可替代残差网络中的标准单元,在图像分类和CT分割任务中提升精度-参数权衡。

Comments Revised version

详情
AI中文摘要

现代视觉网络主要由加性局部变换主导,而显式的乘法局部交互仍未得到充分探索。乘积单元提供了一种直接建模此类交互的方法,但其在深度架构中的使用受到优化不稳定性的限制。在这项工作中,我们提出了PURe,一种用于深度视觉网络的乘积单元残差模块。PURe围绕一个具有实值对数域公式的二维乘积单元构建,使得乘法局部聚合在深度残差层次结构中变得实用。由此产生的模块可作为原生残差单元的即插即用替代品。我们将PURe实例化到用于图像分类的残差CNN和用于体积CT数据切片分割的二维残差编码器-解码器网络中。在Galaxy10 DECaLS、ImageNet和CIFAR-10上,PURe一致地改进了残差CNN,并产生了更有利的精度-参数权衡,使得中等深度模型能够以更小的参数预算匹配或超越显著更深的ResNet基线。在AMOS基准测试中,PURe还在3D病例级评估下改进了切片CT分割。这些结果表明,显式的乘法局部交互是深度残差视觉网络的一种实用且有效的设计原语。

英文摘要

Modern vision networks are dominated by additive local transformations, whereas explicit multiplicative local interactions remain underexplored. Product units offer a direct approach to modeling such interactions, but their use in deep architectures has been limited by optimization instability. In this work, we propose PURe, a Product-Unit Residual Module for deep vision networks. PURe is built around a 2D Product Unit with a real-valued log-domain formulation that makes multiplicative local aggregation practical within deep residual hierarchies. The resulting module serves as a drop-in replacement for native residual units. We instantiate PURe in residual CNNs for image classification and in 2D residual encoder-decoder networks for slice-based segmentation on volumetric CT data. Across Galaxy10 DECaLS, ImageNet, and CIFAR-10, PURe consistently improves residual CNNs and yields a more favorable accuracy-parameter trade-off, allowing moderately deep models to match or surpass substantially deeper ResNet baselines with much smaller parameter budgets. On the AMOS benchmark, PURe also improves slice-based CT segmentation under 3D case-level evaluation. These results show that explicit multiplicative local interaction is a practical and effective design primitive for deep residual vision networks.

2505.04486 2026-06-16 cs.CV cs.AI cs.LG 版本更新

Efficient Flow Matching using Latent Variables

使用潜在变量的高效流匹配

Anirban Samaddar, Yixuan Sun, Viktor Nilsson, Sandeep Madireddy

发表机构 * Argonne National Laboratory(阿贡国家实验室) KTH Royal Institute of Technology(皇家理工学院)

AI总结 提出Latent-CFM方法,利用预训练深度潜在变量模型提取数据特征作为条件,提升流匹配模型的训练效率和生成质量,在图像和物理场生成任务中优于现有方法。

详情
AI中文摘要

流匹配模型在概率生成模型的图像生成任务中显示出巨大潜力。然而,文献中的大多数流匹配模型在从简单源分布(如标准高斯)学习流时,并未显式利用目标数据中的潜在聚类结构。这导致学习效率低下,尤其是对于许多通常位于低维流形中的高维真实世界数据集。为此,我们提出了 $\texttt{Latent-CFM}$,它通过使用预训练的深度潜在变量模型从数据中提取的特征作为条件,提供了高效的训练策略。通过对来自多模态分布的合成数据和广泛使用的图像基准数据集的实验,我们表明,$\texttt{Latent-CFM}$ 通过采用预训练的轻量级潜在变量模型,在显著减少训练和计算量的情况下,展现出比最先进的流匹配模型更好的生成质量。除了自然图像,我们还考虑了源自物理过程的空间场的生成建模。使用二维达西流数据集,我们证明了我们的方法比竞争方法生成更物理准确的样本。此外,通过潜在空间分析,我们证明了我们的方法可用于以潜在特征为条件的条件图像生成,这增加了生成过程的可解释性。

英文摘要

Flow matching models have shown great potential in image generation tasks among probabilistic generative models. However, most flow matching models in the literature do not explicitly utilize the underlying clustering structure in the target data when learning the flow from a simple source distribution like the standard Gaussian. This leads to inefficient learning, especially for many high-dimensional real-world datasets, which often reside in a low-dimensional manifold. To this end, we present $\texttt{Latent-CFM}$, which provides efficient training strategies by conditioning on the features extracted from data using pretrained deep latent variable models. Through experiments on synthetic data from multi-modal distributions and widely used image benchmark datasets, we show that $\texttt{Latent-CFM}$ exhibits improved generation quality with significantly less training and computation than state-of-the-art flow matching models by adopting pretrained lightweight latent variable models. Beyond natural images, we consider generative modeling of spatial fields stemming from physical processes. Using a 2d Darcy flow dataset, we demonstrate that our approach generates more physically accurate samples than competing approaches. In addition, through latent space analysis, we demonstrate that our approach can be used for conditional image generation conditioned on latent features, which adds interpretability to the generation process.

2505.18227 2026-06-16 cs.LG cs.AI 版本更新

Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality

Token缩减应超越生成模型中的效率——从视觉、语言到多模态

Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik

发表机构 * Harvard University(哈佛大学) Northeastern University(东北大学) CAS(中国科学院) Wuhan University(武汉大学) MIT(麻省理工学院) Peking University(北京大学)

AI总结 本文提出Token缩减应超越传统效率优化,成为生成模型的基础原则,通过减少冗余token来促进多模态融合、缓解幻觉、维持长输入连贯性并提升训练稳定性。

Comments Project page: https://github.com/ZLKong/Awesome-Collection-Token-Reduction

详情
AI中文摘要

在Transformer架构中,token——从原始数据中分割出的离散单元——通过将输入切分为固定长度的块而形成。每个token被映射到一个嵌入向量,从而在保留输入关键信息的同时实现并行注意力计算。由于Transformer自注意机制的二次计算复杂度,token缩减主要被用作一种效率策略,尤其在单一视觉和语言领域,它有助于平衡计算成本、内存使用和推理延迟。尽管取得了这些进展,本文认为在大规模生成模型时代,token缩减应超越其传统的效率导向角色。相反,我们将其定位为生成建模中的基本原则,对模型架构和更广泛的应用产生关键影响。具体而言,我们认为在视觉、语言和多模态系统中,token缩减可以:(i) 促进更深层次的多模态集成和对齐,(ii) 缓解“过度思考”和幻觉,(iii) 在长输入上保持连贯性,(iv) 增强训练稳定性等。我们将token缩减重新定义为不仅仅是效率措施。通过这样做,我们概述了有前景的未来方向,包括算法设计、强化学习引导的token缩减、用于上下文学习的token优化、智能体框架设计以及更广泛的机器学习和科学领域。

英文摘要

In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input's essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications. Specifically, we contend that across vision, language, and multimodal systems, token reduction can: (i) facilitate deeper multimodal integration and alignment, (ii) mitigate "overthinking" and hallucinations, (iii) maintain coherence over long inputs, and (iv) enhance training stability, etc. We reframe token reduction as more than an efficiency measure. By doing so, we outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, agentic framework design, and broader ML and scientific domains.

2505.19699 2026-06-16 cs.LG cs.AI cs.DC 版本更新

Mosaic: Data-Free Knowledge Distillation via Mixture-of-Experts for Heterogeneous Distributed Environments

Mosaic: 面向异构分布式环境的无数据知识蒸馏与混合专家模型

Junming Liu, Yanting Gao, Yuqi Li, Siyuan Meng, Yifei Sun, Aoqi Wu, Yirong Chen, Ding Wang, Shiping Wen

发表机构 * School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) The City University of New York(纽约城市大学) Shenzhen University of Advanced Technology(深圳先进技术大学)

AI总结 针对联邦学习中模型与数据异构性问题,提出Mosaic框架,通过本地生成模型合成隐私保护数据,并利用混合专家模型蒸馏全局模型,在图像和多模态基准上超越现有方法。

Comments 23 pages, 5 figures, 24 tables; Accepted by Knowledge-Based Systems, 2026

详情
AI中文摘要

联邦学习(FL)是一种去中心化的机器学习范式,使客户端能够在保护数据隐私的同时协作训练模型。然而,模型和数据异构性的共存导致客户端间表示不一致和优化动态发散,最终阻碍了鲁棒的全局性能。为克服这些挑战,我们提出了Mosaic,一种面向异构分布式环境的新型无数据知识蒸馏框架。Mosaic首先训练本地生成模型以近似每个客户端的个性化分布,从而能够生成合成数据,并通过与真实数据严格分离来保护隐私。随后,Mosaic根据客户端模型的专业知识形成混合专家模型(MoE),并使用生成的数据将其蒸馏到全局模型中。为进一步增强MoE架构,Mosaic通过一个在少量代表性原型上训练的轻量级元模型来集成专家预测。在标准图像和多模态基准上的大量实验表明,Mosaic在模型和数据异构性下均持续优于最先进的方法。源代码已发布在https://this https URL。

英文摘要

Federated Learning (FL) is a decentralized machine learning paradigm that enables clients to collaboratively train models while preserving data privacy. However, the coexistence of model and data heterogeneity gives rise to inconsistent representations and divergent optimization dynamics across clients, ultimately hindering robust global performance. To transcend these challenges, we propose Mosaic, a novel data-free knowledge distillation framework tailored for heterogeneous distributed environments. Mosaic first trains local generative models to approximate each client's personalized distribution, enabling synthetic data generation that safeguards privacy through strict separation from real data. Subsequently, Mosaic forms a Mixture-of-Experts (MoE) from client models based on their specialized knowledge, and distills it into a global model using the generated data. To further enhance the MoE architecture, Mosaic integrates expert predictions via a lightweight meta model trained on a few representative prototypes. Extensive experiments on standard image and multimodal benchmarks demonstrate that Mosaic consistently outperforms state-of-the-art approaches under both model and data heterogeneity. The source code has been published at https://github.com/Wings-Of-Disaster/Mosaic.

2505.20030 2026-06-16 cs.LG cs.AI nlin.CD physics.comp-ph 版本更新

Multiple Descents in Deep Learning as a Sequence of Order-Chaos Transitions in LSTM Networks

深度学习中的多重下降现象:LSTM网络中的有序-混沌转变序列

Wenbo Wei, Fan Xu, Nicholas Chong Jia Le, Choy Heng Lai, Ling Feng

发表机构 * Department of Physics(物理系) National University of Singapore(新加坡国立大学) Institute of High Performance Computing (IHPC)(高性能计算研究所) A*STAR

AI总结 本文在LSTM网络训练中发现多重下降现象,通过渐近稳定性分析表明性能周期与有序-混沌相变相关,最优训练点位于临界转变点,且首次有序-混沌转变处边缘最宽,利于权重探索。

详情
AI中文摘要

我们在长短期记忆(LSTM)网络训练真实世界任务的过程中观察到一种新颖的“多重下降”现象,即模型过训练后性能会经历多次长期的上行和下行循环。通过对模型进行渐近稳定性分析,我们发现性能周期——由测试数据中的损失函数指示——与模型有序和混沌之间的相变过程密切相关,且局部最优训练步骤始终处于两个阶段之间的临界转变点。更重要的是,模型的最优点通常出现在从有序到混沌的第一次转变处,此时“混沌边缘”的“宽度”往往最宽,从而允许对学习权重配置进行最佳探索。

英文摘要

We observe a novel `multiple-descent' phenomenon during the learning process of a recurrent neural network called long-short-term memory (LSTM) networks during its training on real-world task, in which the performance goes through long cycles of up and down trends multiple times after the model is overtrained. By carrying out asymptotic stability analysis of the models, we found that the cycles in performance -- indicated by loss function in test data -- are closely associated with the phase transition process between order and chaos of the model, and the local optimal training step are consistently at the critical transition point between the two phases. More importantly, the most optimal point of the model usually occurs at the first transition from order to chaos, where the `width' of the `edge of chaos' is often the widest, allowing the best exploration of weight configurations for learning.

2505.23878 2026-06-16 cs.LG cs.AI 版本更新

AC-ODM: Actor--Critic Online Data Mixing for Sample-Efficient LLM Pretraining

AC-ODM: 面向样本高效LLM预训练的演员-评论家在线数据混合

Jing Ma, Chenhao Dang, Mingjie Liao

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出AC-ODM方法,从强化学习视角优化预训练数据混合,通过参数化策略实现梯度构造性干扰最大化,支持代理与非代理两种模式,显著提升收敛速度和下游任务准确率。

Comments ICML 2026 (Poster)

详情
AI中文摘要

优化预训练数据组成对于LLM的泛化能力至关重要。虽然动态混合通过捕捉不断变化的训练动态优于静态策略,但当前方法无法在计算效率、样本效率和结构灵活性之间取得平衡,以适应多样化的数据源。我们引入了演员-评论家在线数据混合(AC-ODM),该方法从强化学习角度处理数据混合,使用参数化策略,我们理论上证明该策略充当动态线性代理,最大化梯度的构造性干扰。为增强实际灵活性,AC-ODM支持两种操作模式:(i)代理模式,用于固定的预准备语料库,其中在小模型上学习的策略迁移到更大的目标模型;(ii)非代理模式,用于无需先验知识的直接端到端从头训练。实验上,AC-ODM在各种架构上的收敛速度和下游准确率显著优于先前方法。在Pythia-1B上,它使用比竞争基线少66%的训练步骤达到最优验证困惑度,在MMLU准确率上实现27.5%的相对提升,在HumanEval上pass@1提高2.23倍,同时每步墙钟时间几乎可忽略不计(增加0.4%),内存开销仅增加2%。代码可在https://this https URL获取。

英文摘要

Optimizing pretraining data composition is pivotal for LLM generalization. While dynamic mixing outperforms static strategies by capturing evolving training dynamics, current methods fail to reconcile computational efficiency with sample efficiency and structural flexibility for diverse pipelines.We introduce Actor--Critic Online Data Mixing (AC-ODM), which approaches data mixing from a reinforcement learning perspective with a parameterized policy that we theoretically prove to act as a dynamic linear surrogate maximizing the constructive interference of gradients. To enhance practical flexibility, AC-ODM supports two operational modes: (i) a proxy mode for fixed, pre-prepared corpora, where a policy learned on a small model is transferred to a larger target; and (ii) a non-proxy mode for direct end-to-end training from scratch without priors. Empirically, AC-ODM significantly outperforms prior methods in convergence speed and downstream accuracy across various architectures. On Pythia-1B, it reaches optimal validation perplexity using up to 66% fewer training steps than competitive baselines, delivering a 27.5% relative improvement in MMLU accuracy and a 2.23 x higher pass@1 on HumanEval, all while incurring a virtually negligible (0.4%) per-step wall-clock increase and only 2% additional memory overhead. Code is available at https://github.com/DANG-ai/AC-ODM.

2506.22427 2026-06-16 cs.LG cs.AI 版本更新

CLoVE: Personalized Federated Learning through Clustering of Loss Vector Embeddings

CLoVE: 通过损失向量嵌入聚类的个性化联邦学习

Randeep Bhatia, Nikos Papadis, Murali Kodialam, TV Lakshman, Sayak Chakrabarty

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出CLoVE算法,利用客户端损失向量嵌入进行聚类,实现个性化联邦学习,具有简单、适用监督和无监督任务、无需最优模型初始化等优点,理论证明可高概率准确恢复聚类并指数收敛。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026); 35 pages, 7 figures

详情
AI中文摘要

我们提出了CLoVE(损失向量嵌入聚类),一种用于聚类联邦学习(CFL)的新算法。在CFL中,客户端根据其数据分布自然分组为聚类。然而,识别这些聚类具有挑战性,因为客户端分配是未知的。CLoVE利用从客户端数据上的模型损失导出的客户端嵌入,并利用以下洞察:同一聚类中的客户端共享相似的损失值,而不同聚类中的客户端表现出不同的损失模式。基于这些嵌入,CLoVE能够迭代地识别和分离来自不同聚类的客户端,并通过联邦聚合优化特定聚类的模型。与现有CFL算法相比,CLoVE的主要优点是:(1)简单性,(2)适用于监督和无监督设置,(3)消除了对接近最优模型初始化的需求,使其更稳健,更适合实际应用。我们建立了理论收敛界,表明CLoVE可以在单轮中以高概率准确恢复聚类,并在线性设置中以指数速度收敛到最优模型。我们与多种CFL和通用个性化联邦学习(PFL)算法在不同类型数据集和广泛非IID设置下的全面实验表明,CLoVE在仅几轮训练中就能实现高度准确的聚类恢复,并在各种监督和无监督PFL任务中达到最先进的模型精度。

英文摘要

We propose CLoVE (Clustering of Loss Vector Embeddings), a novel algorithm for Clustered Federated Learning (CFL). In CFL, clients are naturally grouped into clusters based on their data distribution. However, identifying these clusters is challenging, as client assignments are unknown. CLoVE utilizes client embeddings derived from model losses on client data, and leverages the insight that clients in the same cluster share similar loss values, while those in different clusters exhibit distinct loss patterns. Based on these embeddings, CLoVE is able to iteratively identify and separate clients from different clusters and optimize cluster-specific models through federated aggregation. Key advantages of CLoVE over existing CFL algorithms are (1) its simplicity, (2) its applicability to both supervised and unsupervised settings, and (3) the fact that it eliminates the need for near-optimal model initialization, which makes it more robust and better suited for real-world applications. We establish theoretical convergence bounds, showing that CLoVE can recover clusters accurately with high probability in a single round and converges exponentially fast to optimal models in a linear setting. Our comprehensive experiments comparing with a variety of both CFL and generic Personalized Federated Learning (PFL) algorithms on different types of datasets and an extensive array of non-IID settings demonstrate that CLoVE achieves highly accurate cluster recovery in just a few rounds of training, along with state-of-the-art model accuracy, across a variety of both supervised and unsupervised PFL tasks.

2508.00956 2026-06-16 cs.LG cs.AI cs.IR 版本更新

FOUNDv2: Learning Unified User Quantized Tokenizers for User Representation

FOUNDv2: 学习统一的用户量化分词器用于用户表示

Chuan He, Yang Chen, Bin Dou, Wuliang Huang, Baokun Wang, Yongchao Liu, Xing Fu, Yu Cheng, Chuntao Hong, Weiqiang Wang, Zhongle Xie, Jiajun Zheng, Xin-Wei Yao

发表机构 * Ant Group(蚂蚁集团) Zhejiang University of Technology(浙江工业大学) Zhejiang University(浙江大学)

AI总结 提出FOUNDv2框架,通过统一用户量化分词器(U2QT)将异构用户数据转化为离散令牌,结合多视图RQ-VAE和多尺度对齐目标,实现高效存储和预测性能,在多个基准上优于任务特定基线。

详情
AI中文摘要

用户表示学习是大规模网络平台上个性化服务的基础支柱。尽管其重要性,传统的连续嵌入方法面临重大挑战,包括缺乏多源数据融合的统一范式、由于信息密度低导致的过高存储开销以及缺乏多尺度建模粒度。为克服这些限制,我们引入FOUNDv2,一个以统一用户量化分词器(U2QT)框架为核心的综合用户表示方案。FOUNDv2通过一个稳健的两阶段架构将异构用户数据转化为标准化的离散令牌空间。具体来说,该框架首先提取紧凑的特征表示,然后使用多视图RQ-VAE通过共享和源特定的码本将其离散化为存储高效的令牌。为了赋予这些表示预测智能,我们进一步设计多尺度对齐目标以捕捉细粒度的行为依赖和宏观时间周期性。在各种基准上的大量实验表明,FOUNDv2在实现存储和计算成本大幅降低的同时,始终优于任务特定基线。最后,FOUNDv2在支付宝上的大规模部署验证了其在多种工业场景中的实际可扩展性和效率。主要代码可在以下网址获取:this https URL。

英文摘要

User representation learning serves as a fundamental pillar for personalized services on large-scale web platforms. Despite its importance, conventional continuous embedding methods face significant challenges, including the lack of a unified paradigm for multi-source data integration, prohibitive storage overhead due to low information density, and the lack of multi-scale modeling granularity. To overcome these limitations, we introduce FOUNDv2, a comprehensive user representation scheme centered on the Unified User Quantized Tokenizer U2QT) framework. FOUNDv2 transforms heterogeneous user data into a standardized discrete token space through a robust two-stage architecture. Specifically, the framework first extracts compact feature representations and subsequently employs a multi-view RQ-VAE to discretize them into storage-efficient tokens using shared and source-specific codebooks. To empower these representations with predictive intelligence, we further design multi-scale alignment objectives to capture both fine-grained behavioral dependencies and macro-temporal periodicity. Extensive experiments on various benchmarks demonstrate that FOUNDv2 consistently outperforms task-specific baselines while achieving substantial reductions in storage and computational costs. Finally, the large-scale deployment of FOUNDv2 on Alipay validates its practical scalability and efficiency across diverse industrial scenarios. The main code is available at: https://github.com/chuanhe1999/FOUNDv2.

2508.05287 2026-06-16 cs.LG cs.AI 版本更新

FlowState: Sampling-Rate-Equivariant Time-Series Forecasting

FlowState: 采样率等变的时间序列预测

Lars Graf, Thomas Ortner, Stanisław Woźniak, Angeliki Pantazi

发表机构 * GitHub

AI总结 提出FlowState架构,通过状态空间模型编码器和函数基解码器实现采样率等变预测,无需重新训练即可适应不同采样率和预测长度,在GIFT-Eval基准上取得最优结果。

Comments Proceedings of the 43 rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026. Copyright 2026 by the author(s)

详情
AI中文摘要

现有的时间序列基础模型(TSFMs)通常基于Transformer变体,缺乏对不同采样率的适应性,难以在不同上下文和目标长度上泛化,且计算效率低下。我们提出FlowState,一种新颖的TSFM架构,通过将状态空间模型(SSM)编码器与函数基解码器(FBD)配对,实现采样率等变预测。这种设计支持连续时间建模和动态时间尺度调整,使FlowState能够天然地泛化到所有可能的时间分辨率,并动态调整预测范围而无需重新训练。我们进一步提出一种高效的预训练策略,提高了鲁棒性并加速了训练。尽管FlowState是最小的TSFMs之一,它在广泛使用的GIFT-Eval基准上取得了最先进的结果,同时展现出对未见采样率的卓越适应性。我们的详细分析证实了其组件的有效性,并展示了其适应不同输入采样率的独特能力。

英文摘要

Existing time series foundation models (TSFMs), often based on transformer variants, lack adaptability to different sampling rates, struggle with generalization across varying context and target lengths, and are computationally inefficient. We introduce FlowState, a novel TSFM architecture that achieves sampling-rate-equivariant forecasting through a unified design that pairs a state space model (SSM) encoder with a functional basis decoder (FBD). This design enables continuous-time modeling and dynamic time-scale adjustment, allowing FlowState to inherently generalize across all possible temporal resolutions, and dynamically adjust the forecasting horizons without retraining. We further propose an efficient pretraining strategy that improves robustness and accelerates training. Despite being one of the smallest TSFMs, FlowState achieves state-of-the-art results on the widely used GIFT-Eval benchmark, while demonstrating superior adaptability to unseen sampling rates. Our detailed analyses confirm the effectiveness of its components, and we demonstrate its unique ability to adapt to varying input sampling rates.

2508.17254 2026-06-16 cs.CV cs.AI 版本更新

A biological vision inspired framework for machine perception of abutting grating illusory contours

一种受生物视觉启发的机器感知对接光栅错觉轮廓框架

Xiao Zhang, Kai-Fu Yang, Xian-Shi Zhang, Hong-Zhi You, Hong-Mei Yan, Yong-Jie Li

发表机构 * Sichuan Cancer Hospital & Institute, School of Life Science and Technology, University of Electronic Science and Technology of China(四川肿瘤医院及研究院、电子科技大学生命科学与技术学院)

AI总结 提出受视觉皮层启发的ICPNet网络,通过多尺度特征投影、特征交互注意力和边缘融合模块,显著提升了对对接光栅错觉轮廓的感知能力。

详情
AI中文摘要

更高层次的机器智能需要与人类感知和认知对齐。深度神经网络(DNN)主导的机器智能在各种现实任务中表现出色。然而,最近证据表明,DNN无法感知如对接光栅这样的错觉轮廓,这与人类感知模式不一致。与以往工作不同,我们提出了一种受视觉皮层电路启发的新型深度网络,称为错觉轮廓感知网络(ICPNet)。在ICPNet中,设计了多尺度特征投影(MFP)模块以提取多尺度表示。为了增强前馈和反馈特征之间的交互,引入了特征交互注意力模块(FIAM)。此外,受人类感知中形状偏见的启发,通过边缘融合模块(EFM)进行的边缘检测任务注入了形状约束,引导网络关注前景。我们在现有的AG-MNIST测试集和本文构建的AG-Fashion-MNIST测试集上评估了我们的方法。综合实验结果表明,ICPNet对对接光栅错觉轮廓的敏感度显著高于最先进模型,在各个子集上的top-1准确率均有显著提升。这项工作有望使基于DNN的模型向人类级智能迈进一步。

英文摘要

Higher levels of machine intelligence demand alignment with human perception and cognition. Deep neural networks (DNN) dominated machine intelligence have demonstrated exceptional performance across various real-world tasks. Nevertheless, recent evidence suggests that DNNs fail to perceive illusory contours like the abutting grating, a discrepancy that misaligns with human perception patterns. Departing from previous works, we propose a novel deep network called illusory contour perception network (ICPNet) inspired by the circuits of the visual cortex. In ICPNet, a multi-scale feature projection (MFP) module is designed to extract multi-scale representations. To boost the interaction between feedforward and feedback features, a feature interaction attention module (FIAM) is introduced. Moreover, drawing inspiration from the shape bias observed in human perception, an edge detection task conducted via the edge fusion module (EFM) injects shape constraints that guide the network to concentrate on the foreground. We assess our method on the existing AG-MNIST test set and the AG-Fashion-MNIST test sets constructed by this work. Comprehensive experimental results reveal that ICPNet is significantly more sensitive to abutting grating illusory contours than state-of-the-art models, with notable improvements in top-1 accuracy across various subsets. This work is expected to make a step towards human-level intelligence for DNN-based models.

2510.04212 2026-06-16 cs.LG cs.AI 版本更新

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

低精度Transformer训练失败的原因:对Flash Attention的分析

Haiquan Qiu, Quanming Yao

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) Beijing National Research Center for Information Science and Technology(北京信息科学与技术国家研究中心) State Key laboratory of Space Network and Communications(空间网络与通信国家重点实验室)

AI总结 本文首次从机制上解释了低精度下Flash Attention导致训练崩溃的原因,揭示了相似低秩表示与有偏舍入误差的恶性循环,并通过最小化修改稳定训练。

Comments ICLR 2026

详情
AI中文摘要

追求计算效率推动了在训练Transformer模型时采用低精度格式。然而,这一进展常常受到臭名昭著的训练不稳定性的阻碍。本文首次对一种长期存在且未解决的失败案例提供了机制性解释,即在低精度设置下使用flash attention会导致灾难性的损失爆炸。我们的深入分析表明,这种失败并非随机现象,而是由两个相互交织的现象引起的:注意力机制内出现相似的低秩表示,以及低精度算术中固有的有偏舍入误差的累积效应。我们展示了这些因素如何形成误差累积的恶性循环,破坏权重更新,最终使训练动态偏离轨道。为了验证我们的发现,我们对flash attention引入了一个最小修改,以减轻舍入误差的偏差。这一简单改变稳定了训练过程,证实了我们的分析,并为这一长期存在的问题提供了实用解决方案。代码可在以下网址获取:此 https URL。

英文摘要

The pursuit of computational efficiency has driven the adoption of low-precision formats for training transformer models. However, this progress is often hindered by notorious training instabilities. This paper provides the first mechanistic explanation for a long-standing and unresolved failure case where training with flash attention in low-precision settings leads to catastrophic loss explosion. Our in-depth analysis reveals that the failure is not a random artifact but caused by two intertwined phenomena: the emergence of similar low-rank representations within the attention mechanism and the compounding effect of biased rounding errors inherent in low-precision arithmetic. We demonstrate how these factors create a vicious cycle of error accumulation that corrupts weight updates, ultimately derailing the training dynamics. To validate our findings, we introduce a minimal modification to the flash attention that mitigates the bias in rounding errors. This simple change stabilizes the training process, confirming our analysis and offering a practical solution to this persistent problem. Code is available at https://github.com/ucker/why-low-precision-training-fails.

2510.07651 2026-06-16 cs.CL cs.AI 版本更新

OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference

OBCache: 面向高效长上下文LLM推理的最优脑KV缓存剪枝

Yuzhe Gu, Xiyu Liang, Jiaojiao Zhao, Enmao Diao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出OBCache框架,将缓存驱逐形式化为逐层结构化剪枝问题,基于最优脑损伤理论量化令牌显著性,通过输出感知分数改进现有驱逐策略,在长上下文任务中提升准确性。

Comments ICML 2026

详情
AI中文摘要

具有扩展上下文窗口的大型语言模型(LLM)实现了强大的应用,但带来了显著的内存开销,因为缓存所有键值(KV)状态随序列长度和批大小线性扩展。现有的缓存驱逐方法通过利用注意力稀疏性来解决这一问题,但它们通常使用累积注意力权重对令牌进行启发式排名,而不考虑其对注意力输出的真实影响。我们提出了最优脑缓存(OBCache),一个将缓存驱逐形式化为逐层结构化剪枝问题的原则性框架。基于最优脑损伤(OBD)理论,OBCache通过测量由剪枝令牌引起的注意力输出扰动来量化令牌显著性,并为孤立键、孤立值以及联合键值对推导出闭式分数。我们的分数不仅考虑了注意力权重,还考虑了值状态和注意力输出的信息,从而通过输出感知信号增强了现有的驱逐策略。在LLaMA和Qwen模型上的实验表明,将现有工作中跨不同查询位置估计令牌显著性的启发式分数替换为OBCache的输出感知分数,持续提高了长上下文准确性。代码可在 https://github.com/DreamSoul-AI/OBCache 获取。

英文摘要

Large language models (LLMs) with extended context windows enable powerful applications but impose significant memory overhead, as caching all key-value (KV) states scales linearly with sequence length and batch size. Existing cache eviction methods address this by exploiting attention sparsity, yet they typically rank tokens heuristically using accumulated attention weights without considering their true impact on attention outputs. We propose Optimal Brain Cache (OBCache), a principled framework that formulates cache eviction as a layer-wise structured pruning problem. Building upon the Optimal Brain Damage (OBD) theory, OBCache quantifies token saliency by measuring the perturbation in attention outputs induced by pruning tokens, with closed-form scores derived for isolated keys, isolated values, and joint key-value pairs. Our scores account not only for attention weights but also for information from value states and attention outputs, thereby enhancing existing eviction strategies with output-aware signals. Experiments on LLaMA and Qwen models demonstrate that replacing the heuristic scores in existing works, which estimate token saliency across different query positions, with OBCache's output-aware scores consistently improves long-context accuracy. Code is available at https://github.com/DreamSoul-AI/OBCache.

2510.16882 2026-06-16 cs.LG cs.AI cs.CL 版本更新

Utility-Diversity Aware Online Batch Selection for LLM Supervised Fine-tuning

面向LLM监督微调的效用-多样性感知在线批次选择

Heming Zou, Yixiu Mao, Yun Qu, Qi Wang, Xiangyang Ji

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出UDS框架,利用logits矩阵核范数和轻量记忆缓冲实现高效在线批次选择,兼顾数据效用与多样性,无需外部资源,在多个基准上优于现有方法并降低训练时间。

Comments ICML 2026 accepted paper

详情
AI中文摘要

监督微调(SFT)是一种常用的技术,用于将大型语言模型(LLM)适配到下游任务。在实践中,对整个数据集进行SFT计算成本高昂,且有时会导致过拟合或偏差放大。这促进了SFT中数据筛选的兴起,即优先选择最有价值的数据进行优化。本文研究了在线批次选择系列方法,这些方法在训练过程中动态评分和过滤样本。然而,现有的流行方法通常(i)仅依赖数据的效用选择子集,而忽略多样性等其他关键因素,(ii)依赖外部资源如参考模型或验证集,以及(iii)相对于全数据集训练增加了额外训练时间。为解决这些局限,本文开发了UDS(效用-多样性采样),一个用于SFT中高效在线批次选择的框架。UDS利用logits矩阵的核范数来捕获数据效用和样本内多样性,同时通过与历史样本的轻量内存缓冲进行高效低维嵌入比较来估计样本间多样性。这种设计消除了对外部资源和不必要反向传播的需求,确保了计算效率。在多个基准上的实验表明,UDS在不同数据预算下始终优于最先进的在线批次选择方法,并且与全数据集微调相比显著减少了训练时间。代码可在该https URL获取。

英文摘要

Supervised fine-tuning (SFT) is a commonly used technique to adapt large language models (LLMs) to downstream tasks. In practice, SFT on a full dataset is computationally expensive and sometimes suffers from overfitting or bias amplification. This facilitates the rise of data curation in SFT, which prioritizes the most valuable data to optimze. This work studies the online batch selection family that dynamically scores and filters samples during the training process. However, existing popular methods often (i) rely merely on the utility of data to select a subset while neglecting other crucial factors like diversity, (ii) rely on external resources such as reference models or validation sets, and (iii) incur extra training time over full-dataset training. To address these limitations, this work develops UDS (Utility-Diversity Sampling), a framework for efficient online batch selection in SFT. UDS leverages the nuclear norm of the logits matrix to capture both data utility and intra-sample diversity, while estimating inter-sample diversity through efficient low-dimensional embedding comparisons with a lightweight memory buffer of historical samples. Such a design eliminates the need for external resources and unnecessary backpropagation, securing computational efficiency. Experiments on multiple benchmarks demonstrate that UDS consistently outperforms state-of-the-art online batch selection methods under varying data budgets, and significantly reduces training time compared to full-dataset fine-tuning. Code is available at https://github.com/gfyddha/UDS.

2511.08577 2026-06-16 cs.CL cs.AI cs.LG cs.PF 版本更新

Think-at-Hard: Selective Latent Iterations to Improve Reasoning Language Models

Think-at-Hard: 选择性潜在迭代以改进推理语言模型

Tianyu Fu, Yichen You, Zekai Chen, Guohao Dai, Huazhong Yang, Yu Wang

AI总结 针对循环变压器中潜在过思考问题,提出Think-at-Hard方法,通过轻量级决策器选择性地在困难令牌上触发潜在迭代,并采用深度感知LoRA和双因果注意力机制,在数学、问答和编码任务上一致提升性能。

Comments Accepted by ICML'26

详情
AI中文摘要

提升大型语言模型(LLMs)的推理能力,特别是在参数约束下,对实际应用至关重要。循环变压器通过执行多次潜在迭代来细化每个令牌,超越单次前向传播。然而,我们识别出一种潜在过思考现象:大多数令牌预测在第一次前向传播后已经正确,但在后续迭代中有时会被修改为错误。我们询问选择性地跳过潜在迭代是否能提高准确性,并揭示了一个显著的潜力:使用预言迭代策略可将性能提升高达7.3%。受此启发,我们提出了Think-at-Hard (TaH),一种针对选择性迭代优化的循环变压器。TaH采用轻量级神经决策器来触发潜在迭代,仅在标准前向传播后可能不正确的令牌上触发。在潜在迭代期间,深度感知的低秩适应(LoRA)模块将目标从一般的下一个令牌预测转变为聚焦的困难令牌细化。双因果注意力机制将注意力从令牌序列维度扩展到额外的迭代深度维度,实现跨迭代信息流,同时保持完全的序列并行性。在九个基准上的实验显示,在数学、问答和编码任务上一致提升。在相同参数数量下,TaH在93%的令牌上跳过迭代,性能比始终迭代的基线高3.8-4.4%,并超过单次迭代的Qwen3基线3.0-3.8%。当允许LoRA和决策器增加不到3%的参数时,增益分别进一步增加到5.3-6.2%和6.1-6.8%。我们的代码可在以下网址获取:https://this URL。

英文摘要

Improving the reasoning abilities of Large Language Models (LLMs), especially under parameter constraints, is crucial for real-world applications. Looped transformers address this by performing multiple latent iterations to refine each token beyond a single forward pass. However, we identify a latent overthinking phenomenon: most token predictions are already correct after the first pass, but are sometimes revised into errors in later iterations. We ask whether selectively skipping latent iterations can improve accuracy, and reveal significant potential with an oracle iteration policy that boosts performance by up to 7.3%. Motivated by this, we propose Think-at-Hard (TaH), a looped transformer optimized for selective iteration. TaH employs a lightweight neural decider to trigger latent iteration, only at tokens likely to be incorrect after the standard forward pass. During latent iterations, depth-aware Low-Rank Adaptation (LoRA) modules shift the objective from general next-token prediction to focused hard-token refinement. A duo-causal attention mechanism extends attention from the token sequence dimension to an additional iteration depth dimension, enabling cross-iteration information flow with full sequential parallelism. Experiments on nine benchmarks show consistent gains across math, QA, and coding tasks. With identical parameter counts, TaH outperforms always-iterate baselines by 3.8-4.4% while skipping iterations on 93% of tokens, and exceeds single-iteration Qwen3 baselines by 3.0-3.8%. When allowing <3% more parameters from LoRA and decider, the gains further increase to 5.3-6.2% and 6.1-6.8%, respectively. Our code is available at https://github.com/thu-nics/TaH.

2512.18295 2026-06-16 cs.LG cs.AI 版本更新

AL-GNN: Privacy-Preserving and Replay-Free Continual Graph Learning via Analytic Learning

AL-GNN: 基于分析学习的隐私保护且无需重放的持续图学习

Xuling Zhang, Jindong Li, Yifei Zhang, Mingqi Yang, Menglin Yang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州)) Northwestern Polytechnical University(西北工业大学) South China University of Technology(华南理工大学)

AI总结 提出AL-GNN框架,利用分析学习理论将持续图学习转化为递归最小二乘优化,通过闭式分类器更新和正则化特征自相关矩阵实现无需反向传播和重放缓冲的高效训练,在保护隐私的同时提升性能并减少遗忘。

详情
AI中文摘要

持续图学习(CGL)旨在使图神经网络能够从图结构数据流中增量学习,而不会遗忘先前获得的知识。现有方法,特别是基于经验重放的方法,通常存储并重新访问过去的图数据以缓解灾难性遗忘。然而,这些方法存在显著局限性,包括隐私问题和低效性。在这项工作中,我们提出了AL-GNN,一种新颖的持续图学习框架,消除了对反向传播和重放缓冲区的需求。相反,AL-GNN利用分析学习理论的原理,将学习形式化为递归最小二乘优化过程。它通过闭式分类器更新和正则化特征自相关矩阵来分析和更新模型知识。这种设计使得每个任务能够进行高效的单次训练,并通过避免存储历史样本固有地保护数据隐私。在多个动态图分类基准上的大量实验表明,AL-GNN取得了与现有方法相比具有竞争力或更优的性能。例如,它在CoraFull上平均性能提高了10%,在Reddit上遗忘减少了30%以上,同时由于其无反向传播的设计,训练时间减少了近50%。

英文摘要

Continual graph learning (CGL) aims to enable graph neural networks to incrementally learn from a stream of graph structured data without forgetting previously acquired knowledge. Existing methods particularly those based on experience replay typically store and revisit past graph data to mitigate catastrophic forgetting. However, these approaches pose significant limitations, including privacy concerns, inefficiency. In this work, we propose AL GNN, a novel framework for continual graph learning that eliminates the need for backpropagation and replay buffers. Instead, AL GNN leverages principles from analytic learning theory to formulate learning as a recursive least squares optimization process. It maintains and updates model knowledge analytically through closed form classifier updates and a regularized feature autocorrelation matrix. This design enables efficient one pass training for each task, and inherently preserves data privacy by avoiding historical sample storage. Extensive experiments on multiple dynamic graph classification benchmarks demonstrate that AL GNN achieves competitive or superior performance compared to existing methods. For instance, it improves average performance by 10% on CoraFull and reduces forgetting by over 30% on Reddit, while also reducing training time by nearly 50% due to its backpropagation free design.

2512.22560 2026-06-16 cs.DC cs.AI cs.LG 版本更新

RollArt: Disaggregated Multi-Task Agentic RL Training at Scale

RollArt: 可分解的多任务智能体强化学习规模化训练

Wei Gao, Yuheng Zhao, Tianyuan Wu, Shaopan Xiong, Weixun Wang, Dakai An, Lunxi Cao, Dilxat Muhtar, Zichen Liu, Haizhou Zhao, Ju Huang, Siran Yang, Yongbin Li, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng, Wei Wang

发表机构 * HKUST(香港科技大学) Alibaba Group(阿里巴巴集团) Tongyi Lab, Alibaba(阿里云实验室)

AI总结 提出RollArt系统,通过将强化学习流水线分解到异构硬件上,实现多任务智能体RL的高效训练,相比现有系统减少1.31-2.05倍训练时间。

Comments 19 pages, 15 figures

详情
AI中文摘要

智能体强化学习通过与环境的多轮交互训练大语言模型,产生混合计算密集型预填充、带宽密集型解码、CPU密集型环境执行和突发性奖励评估的工作负载。现有系统要么将所有阶段共置于单一GPU集群,要么仅以粗粒度解耦,忽视了硬件异构性并导致阶段间大量同步开销。我们提出ROLLART,一个在可分解基础设施上的多任务智能体RL系统。ROLLART将每个流水线阶段映射到最合适的硬件:将预填充密集型任务路由到计算优化GPU,解码密集型任务路由到带宽优化GPU,环境任务路由到CPU集群。它在轨迹级别解耦生成,使得生成、环境交互和奖励评分可以独立进行,从而慢速或失败的环境不会阻塞其他任务。ROLLART将无状态奖励计算卸载到无服务器基础设施,并通过有界陈旧性的异步权重同步将生成与训练重叠。结果表明,ROLLART有效提高了训练吞吐量,与各种RL系统相比实现了1.31-2.05倍的训练时间减少。我们还在阿里巴巴集群上使用超过3000个GPU训练了用于Qoder产品的数千亿参数MoE模型,验证了其稳定性和可扩展性。

英文摘要

Agentic Reinforcement Learning (RL) trains LLMs through multi-turn interactions with environments, producing workloads that mix compute-bound prefill, bandwidth-bound decoding, CPU-heavy environment execution, and bursty reward evaluation. Existing systems either colocate all stages on a single GPU cluster or decouple them only at a coarse granularity, overlooking hardware heterogeneity and incurring substantial synchronization overhead across stages. We present ROLLART, a system for multi-task agentic RL on disaggregated infrastructure. ROLLART maps each pipeline stage to best-fit hardware, routing prefill-heavy tasks to compute-optimized GPUs, decode-heavy tasks to bandwidth-optimized GPUs, and environments to CPU clusters. It decouples rollout at the trajectory level, allowing generation, environment interaction, and reward scoring to proceed independently, so that slow or failed environments never block the others. ROLLART offloads stateless reward computation to serverless infrastructure and overlaps rollout with training via staleness-bounded asynchronous weight synchronization. Our results demonstrate that ROLLART effectively improves training throughput and achieves 1.31--2.05 \(\times\) training time reduction compared to various RL systems. We also evaluated ROLLART by training a hundreds-of-billions-parameter MoE model for Qoder product on an Alibaba cluster with above 3,000 GPUs, demonstrating its stability and scalability.

2601.11219 2026-06-16 cs.LG cs.AI 版本更新

SDFLoRA: Selective Decoupled Federated LoRA for Privacy-preserving Fine-tuning with Heterogeneous Clients

SDFLoRA: 面向异构客户隐私保护微调的选择性解耦联邦LoRA

Zhikang Shen, Jianrong Lu, Haiyuan Wan, Jianhai Chen

发表机构 * Zhejiang University(浙江大学) Tsinghua University(清华大学)

AI总结 提出SDFLoRA,通过将LoRA更新解耦为共享和私有组件,仅聚合共享部分并注入差分隐私噪声,解决联邦微调中的秩异构和数据异构问题,提升隐私-效用权衡。

详情
AI中文摘要

联邦学习(FL)用于大型语言模型(LLM)作为在分布式数据上适应模型的隐私保护方法日益受到关注,其中低秩适应(LoRA)等参数高效方法被广泛采用以降低通信和内存成本。然而,实际部署通常表现出秩和数据异构性:客户端在不同的低秩预算和数据分布下运行,使得LoRA更新的直接聚合存在偏差且不稳定。现有方法要么强制统一秩,要么将异构更新对齐到单个共享子空间,这往往会混合可迁移和客户端特定的方向,从而损害个性化。此外,在差分隐私(DP)下,扰动这种结构混合的更新会向本应保持纯局部的方向注入噪声,导致不必要的效用下降。为了解决这些问题,我们提出了选择性解耦联邦LoRA(SDFLoRA),一种结构感知的LoRA框架,将每个客户端更新解耦为用于聚合的共享组件和保留客户端特定语义的私有组件。只有共享组件参与子空间对齐,而私有组件保持本地且不通信,使得训练与DP兼容并在秩异构下稳定聚合。通过仅向聚合的可共享更新注入噪声,该方法避免了对局部方向的扰动,并改善了效用-隐私权衡。在多个基准上的实验表明,SDFLoRA优于联邦LoRA基线,并实现了强大的效用-隐私权衡。

英文摘要

Federated learning (FL) for large language models (LLMs) has attracted increasing attention as a privacy-preserving approach for adapting models over distributed data, where parameter-efficient methods such as Low-Rank Adaptation (LoRA) are widely adopted to reduce communication and memory costs. However, practical deployments often exhibit rank and data heterogeneity: clients operate under different low-rank budgets and data distributions, making direct aggregation of LoRA updates biased and unstable. Existing approaches either enforce a unified rank or align heterogeneous updates into a single shared subspace, which tends to mix transferable and client-specific directions and consequently undermines personalization. Moreover, under differential privacy (DP), perturbing such structurally mixed updates injects noise into directions that should remain purely local, leading to unnecessary utility degradation. To address these issues, we propose Selective Decoupled Federated LoRA (SDFLoRA), a structure-aware LoRA framework that decouples each client update into a shared component for aggregation and a private component that preserves client-specific semantics. Only the shared component participates in subspace alignment, while the private component remains local and uncommunicated, making the training DP-compatible and stabilizing aggregation under rank heterogeneity. By injecting noise only into the aggregated shareable update, this approach avoids perturbations to local directions and improves the utility-privacy trade-off. Experiments on multiple benchmarks demonstrate that SDFLoRA outperforms federated LoRA baselines and achieves a strong utility-privacy trade-off.

2601.16509 2026-06-16 cs.LG cs.AI 版本更新

Adaptive $k$NN graph model

自适应 $k$NN 图模型

Jiaye Li, Hang Xu, Shichao Zhang

发表机构 * The State Key Laboratory of Blockchain and Data Security(区块链与数据安全国家重点实验室) Zhejiang University(浙江大学) The School of Computer Science and Engineering(计算机科学与工程学院) Central South University(中南大学) School of Computer Science and Engineering(计算机科学与工程学院) Guangxi Normal University(广西师范大学)

AI总结 提出一种基于分层可导航小世界图与预计算投票机制的自适应图模型,将邻居选择与加权的计算负担转移到训练阶段,在保持分类精度的同时实现实时推理速度。

Comments 31 pages, 5 figures

详情
AI中文摘要

$k$ 近邻 ($k$NN) 算法是人工智能中非参数分类的基石,但其在大规模应用中的部署始终受到推理速度与准确性之间计算权衡的限制。现有的近似最近邻解决方案加速了检索,但往往降低了分类精度,并且缺乏选择最优邻域大小 ($k$) 的自适应性。本文提出了一种自适应图模型,将推理延迟与计算复杂度解耦。通过将分层可导航小世界 (HNSW) 图与预计算投票机制相结合,我们的框架将邻居选择和加权的计算负担完全转移到训练阶段。在这种拓扑结构中,较高的图层次实现快速导航,而较低的层次则通过自适应邻居数量编码精确的、节点特定的决策边界。在六个不同数据集上与八种最先进基线进行基准测试,我们证明了该架构显著加速了推理速度,实现了实时性能,且不牺牲分类精度。这些发现为 $k$NN 固有的推理瓶颈提供了可扩展、鲁棒的解决方案,为基于图的非参数学习奠定了自适应的结构基础。

英文摘要

The $k$-nearest neighbors ($k$NN) algorithm is a cornerstone of non-parametric classification in artificial intelligence, yet its deployment in large-scale applications is persistently constrained by the computational trade-off between inference speed and accuracy. Existing approximate nearest neighbor solutions accelerate retrieval but often degrade classification precision and lack adaptability in selecting the optimal neighborhood size ($k$). Here, we present an adaptive graph model that decouples inference latency from computational complexity. By integrating a Hierarchical Navigable Small World (HNSW) graph with a pre-computed voting mechanism, our framework completely transfers the computational burden of neighbor selection and weighting to the training phase. Within this topological structure, higher graph layers enable rapid navigation, while lower layers encode precise, node-specific decision boundaries with adaptive neighbor counts. Benchmarking against eight state-of-the-art baselines across six diverse datasets, we demonstrate that this architecture significantly accelerates inference speeds, achieving real-time performance, without compromising classification accuracy. These findings offer a scalable, robust solution to the inherent inference bottleneck of $k$NN, laying an adaptive structural foundation for graph-based nonparametric learning.

2602.11550 2026-06-16 cs.LG cs.AI 版本更新

TS-Memory: Plug-and-Play Memory for Time Series Foundation Models

TS-Memory: 时间序列基础模型的即插即用记忆模块

Sisuo Lyu, Siru Zhong, Tiegang Chen, Weilin Ruan, Qingxiang Liu, Taiqiang Lv, Qingsong Wen, Raymond Chi-Wing Wong, Yuxuan Liang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Tencent(腾讯) Squirrel Ai Learning The Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 提出参数化记忆蒸馏方法TS-Memory,通过轻量级记忆适配器增强冻结的时间序列基础模型,在分布偏移下实现无检索的高效零样本预测,显著提升点预测和概率预测性能。

详情
AI中文摘要

时间序列基础模型(TSFMs)通过大规模预训练实现了强大的零样本预测,但在分布偏移下将其适应到下游领域仍然具有挑战性。现有解决方案面临权衡:参数化适应可能导致灾难性遗忘,并需要昂贵的多领域维护,而非参数化检索虽然改善了预测,但由于数据存储搜索导致高推理延迟。我们提出了参数化记忆蒸馏,并将其实现为TS-Memory,一种增强冻结TSFMs的轻量级记忆适配器。TS-Memory分两个阶段训练。首先,我们构建一个离线、检索泄漏安全的kNN教师,从检索到的未来中合成置信度感知的分位数目标。其次,我们通过置信度门控监督将该检索诱导的分布校正蒸馏到轻量级记忆适配器中。在推理过程中,TS-Memory以常数时间开销融合记忆和骨干预测,实现无检索部署。在多种TSFMs和基准上的实验表明,与代表性的适应方法相比,在点预测和概率预测上均有一致的改进,效率与冻结骨干相当。代码:此 https URL。

英文摘要

Time Series Foundation Models (TSFMs) achieve strong zero-shot forecasting through large-scale pre-training, but adapting them to downstream domains under distribution shift remains challenging. Existing solutions face a trade-off: Parametric Adaptation can cause catastrophic forgetting and requires costly multi-domain maintenance, while Non-Parametric Retrieval improves forecasts but incurs high inference latency due to datastore search. We propose Parametric Memory Distillation and implement it as TS-Memory, a lightweight memory adapter that augments frozen TSFMs. TS-Memory is trained in two stages. First, we construct an offline, retrieval-leakage-safe kNN teacher that synthesizes confidence-aware quantile targets from retrieved futures. Second, we distill this retrieval-induced distributional correction into a lightweight memory adapter via confidence-gated supervision. During inference, TS-Memory fuses memory and backbone predictions with constant-time overhead, enabling retrieval-free deployment. Experiments across diverse TSFMs and benchmarks demonstrate consistent improvements in both point and probabilistic forecasting over representative adaptation methods, with efficiency comparable to the frozen backbone. Code: https://github.com/sisuolv/TS-Memory.

2602.22422 2026-06-16 cs.LG cs.AI 版本更新

Revisiting Chebyshev Polynomial and Anisotropic RBF Models for Tabular Regression

重新审视切比雪夫多项式和各向异性RBF模型在表格回归中的应用

Luciano Gerber, Huw Lloyd

发表机构 * Department of Computing and Mathematics, Manchester Metropolitan University(计算与数学系,曼彻斯特 Metropolitan 大学)

AI总结 本文在55个数据集上基准测试切比雪夫多项式回归器、各向异性RBF网络和平滑树混合模型,发现平滑模型在CPU可行模型中与树集成准确率相当且泛化差距更小,建议将其纳入候选池。

Comments 46 pages, 6 figures, 21 tables. Under review at Knowledge-Based Systems

详情
AI中文摘要

平滑基模型如切比雪夫多项式回归器和径向基函数(RBF)网络在数值分析中已得到充分确立。它们的连续可微预测表面适用于代理优化、敏感性分析以及其他响应随输入逐渐变化的环境。尽管具有这些特性,平滑模型在树集成主导的表格回归中很少出现。我们探究它们是否能够竞争,跨55个按应用领域组织的回归数据集对模型进行基准测试。我们开发了一种各向异性RBF网络,具有数据驱动的中心放置和基于梯度的宽度优化,一个岭正则化的切比雪夫多项式回归器,以及一个平滑树混合模型(切比雪夫模型树);这三个模型均作为scikit-learn兼容包发布。我们将这些模型与树集成、预训练transformer和标准基线进行基准测试,评估准确性和泛化行为。transformer在大多数数据集上准确率排名第一,但其GPU依赖性、推理延迟和数据集大小限制制约了其在应用科学和工业中常见的基于CPU环境中的部署。在CPU可行的模型中,平滑模型和树集成在准确率上统计上持平,但前者倾向于表现出更紧的泛化差距。我们建议常规地将平滑基模型纳入候选池,特别是当下游使用受益于更紧的泛化和逐渐变化的预测时。

英文摘要

Smooth-basis models such as Chebyshev polynomial regressors and radial basis function (RBF) networks are well established in numerical analysis. Their continuously differentiable prediction surfaces suit surrogate optimisation, sensitivity analysis, and other settings where the response varies gradually with inputs. Despite these properties, smooth models seldom appear in tabular regression, where tree ensembles dominate. We ask whether they can compete, benchmarking models across 55 regression datasets organised by application domain. We develop an anisotropic RBF network with data-driven centre placement and gradient-based width optimisation, a ridge-regularised Chebyshev polynomial regressor, and a smooth-tree hybrid (Chebyshev model tree); all three are released as scikit-learn-compatible packages. We benchmark these against tree ensembles, a pre-trained transformer, and standard baselines, evaluating accuracy alongside generalisation behaviour. The transformer ranks first on accuracy across a majority of datasets, but its GPU dependence, inference latency, and dataset-size limits constrain deployment in the CPU-based settings common across applied science and industry. Among CPU-viable models, smooth models and tree ensembles are statistically tied on accuracy, but the former tend to exhibit tighter generalisation gaps. We recommend routinely including smooth-basis models in the candidate pool, particularly when downstream use benefits from tighter generalisation and gradually varying predictions.

2603.03417 2026-06-16 cs.CR cs.AI 版本更新

Parallel Test-Time Scaling with Multi-Sequence Verifiers

并行测试时扩展与多序列验证器

Yegon Kim, Seungyoo Lee, Chaeyun Jang, Hyungi Lee, Juho Lee

发表机构 * Graduate School of AI, KAIST(人工智能研究生院,韩国科学技术院)

AI总结 提出多序列验证器(MSV),通过条件化候选集预测正确性,改善校准性,提升最佳选择准确率并实现早停策略,在数学推理任务中以不到一半延迟达到相同精度。

详情
AI中文摘要

并行测试时扩展(为单个问题生成多个候选解)是提升大语言模型性能的强大技术。然而,它受到两个关键瓶颈的阻碍:从候选池中准确选择正确的解,以及生成大量完整解带来的高推理延迟。我们认为这两个挑战从根本上与验证器的校准性相关,因为校准良好的验证器能改进答案选择,并支持早停策略以减少延迟。然而,现有的非生成式验证器存在局限性,因为它们孤立地评分每个候选,忽略了候选集之间的丰富上下文信息。为解决这一问题,我们引入了多序列验证器(MSV),这是一种轻量级验证器,它基于完整采样集的条件来预测每个候选的正确性。MSV实现了改进的校准性,这直接增强了最佳N选择性能,并赋能了一种新颖的早停框架。在具有挑战性的数学推理基准测试中,相对于强基线,MSV将最佳64选1的准确率提升了高达6%,并且在早停设置下,以不到一半的延迟达到了与基线相同的准确率。

英文摘要

Parallel test-time scaling, which generates multiple candidate solutions for a single problem, is a powerful technique for improving large language model performance. However, it is hindered by two key bottlenecks: accurately selecting the correct solution from the candidate pool, and the high inference latency from generating many full solutions. We argue that both challenges are fundamentally linked to verifier calibration, as a well-calibrated verifier improves answer selection and enables early-stopping strategies to reduce latency. However, existing non-generative verifiers are limited as they score each candidate in isolation, overlooking rich contextual information across the set of candidates. To address this, we introduce the Multi-Sequence Verifier (MSV), a lightweight verifier that predicts each candidate's correctness conditioned on the full sampled set. MSV achieves improved calibration, which directly enhances best-of-N selection performance and empowers a novel early-stopping framework. Across challenging mathematical reasoning benchmarks, MSV improves best-of-64 accuracy by up to 6\% relative to strong baselines, and in the early-stopping setting reaches the same accuracy as baselines with less than half the latency.

2603.17353 2026-06-16 cs.LG cs.AI 版本更新

Learning Permutation Distributions via Reflected Diffusion on Ranks

通过秩上的反射扩散学习排列分布

Sizhuang He, Yangtian Zhang, Shiyang Zhang, David van Dijk

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Soft-Rank Diffusion框架,通过将排列松弛为软秩实现平滑扩散,并引入上下文广义Plackett-Luce去噪器,在排序和组合优化任务上优于现有扩散方法。

Comments 18 pages including the appendix, 7 figures, 9 tables, Accepted at ICML 2026

详情
AI中文摘要

有限对称群 S_n 为排列提供了自然域,但由于其阶乘增长的大小和离散、非欧几里得结构,在 S_n 上学习概率分布具有挑战性。最近的排列扩散方法通过基于洗牌的随机游走(例如,riffle shuffles)定义前向加噪,并使用 Plackett-Luce (PL) 变体学习反向转移,但由此产生的轨迹可能很突兀,并且随着 n 的增长,去噪变得越来越困难。我们提出 Soft-Rank Diffusion,一种离散扩散框架,用结构化的软秩前向过程取代基于洗牌的破坏:通过将离散秩松弛为软秩,将排列提升到连续的潜在表示,从而产生更平滑、更易处理的轨迹。对于反向过程,我们引入了上下文广义 Plackett-Luce (cGPL) 去噪器,它推广了先前的 PL 风格参数化,并提高了序列决策结构的表达能力。在排序和组合优化基准上的实验表明,Soft-Rank Diffusion 始终优于先前的扩散基线,在长序列和内在序列设置中尤其有显著优势。

英文摘要

The finite symmetric group S_n provides a natural domain for permutations, yet learning probability distributions on S_n is challenging due to its factorially growing size and discrete, non-Euclidean structure. Recent permutation diffusion methods define forward noising via shuffle-based random walks (e.g., riffle shuffles) and learn reverse transitions with Plackett-Luce (PL) variants, but the resulting trajectories can be abrupt and increasingly hard to denoise as n grows. We propose Soft-Rank Diffusion, a discrete diffusion framework that replaces shuffle-based corruption with a structured soft-rank forward process: we lift permutations to a continuous latent representation of order by relaxing discrete ranks into soft ranks, yielding smoother and more tractable trajectories. For the reverse process, we introduce contextualized generalized Plackett-Luce (cGPL) denoisers that generalize prior PL-style parameterizations and improve expressivity for sequential decision structures. Experiments on sorting and combinatorial optimization benchmarks show that Soft-Rank Diffusion consistently outperforms prior diffusion baselines, with particularly strong gains in long-sequence and intrinsically sequential settings.

2604.02343 2026-06-16 cs.LG cs.AI cs.IT math.IT 版本更新

Haiku to Opus in Just 10 bits: LLMs Unlock Large Compression Gains

仅用10比特从俳句到巨作:LLMs解锁巨大压缩增益

Roy Rinberg, Annabelle Michael Carrell, Simon Henniger, Nicholas Carlini, Keri Warr

发表机构 * Harvard University(哈佛大学) University of Cambridge(剑桥大学) Anthropic

AI总结 研究LLM生成文本的无损和有损压缩,提出问答压缩(QA)交互协议,用少量二进制问题实现超100倍压缩比,高效传递知识。

详情
AI中文摘要

我们研究了LLM生成文本在无损和有损场景下的压缩,刻画了一个压缩-计算边界,其中更多的压缩需要更多的计算。对于无损压缩,领域适应的LoRA适配器可以将基于LLM的算术编码的压缩比提高2倍,相对于仅使用基础LLM的压缩。对于有损压缩,提示模型进行简洁重写然后应用算术编码可以实现约0.03的压缩比,比压缩原始响应提高2倍。我们进一步引入了问答压缩(QA),一种受游戏“二十个问题”启发的交互式有损协议。一个小模型通过向更强模型提问是/否问题来迭代优化其响应,每个答案恰好传输1比特。在涵盖数学、科学和代码的8个基准测试中,10个二进制问题恢复了小模型和大模型在标准基准上能力差距的23%到72%,在更难的基准上恢复了7%到38%,实现了0.0006到0.004的压缩比。这比之前基于LLM的压缩(Deletang等人,2024)小100倍以上,表明交互式协议可以比传输完整响应更高效地传递知识。

英文摘要

We study the compression of LLM-generated text across lossless and lossy regimes, characterizing a compression-compute frontier where more compression is possible at the cost of more compute. For lossless compression, domain-adapted LoRA adapters can improve LLM-based arithmetic coding by 2x over compression with the base LLM alone. For lossy compression, prompting a model for a succinct rewrite then applying arithmetic coding can achieve compression ratios of approximately 0.03, a 2x improvement over compressing the original response. We further introduce Question-Asking compression (QA), an interactive lossy protocol inspired by the game 'Twenty Questions'. A small model iteratively refines its response by asking yes/no questions to a stronger model, transferring exactly one bit per answer. On 8 benchmarks spanning math, science, and code, 10 binary questions recover 23% to 72% of the capability gap between a small and large model on standard benchmarks and 7% to 38% on harder benchmarks, achieving compression ratios of 0.0006 to 0.004. This is over 100x smaller than prior LLM-based compression (Deletang et al., 2024), suggesting that interactive protocols can transfer knowledge far more efficiently than transmitting full responses.

2604.03472 2026-06-16 cs.CL cs.AI 版本更新

Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

词汇丢弃:LLM共同进化中的课程多样性

Jacob Dineen, Aswin RRV, Zhikun Xu, Ben Zhou

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 针对LLM共同进化中问题多样性崩溃的问题,提出词汇丢弃机制,通过在策略训练和课程生成时随机掩码输出logits维持多样性,在数学推理任务上提升求解器性能平均+4.4点。

详情
AI中文摘要

共同进化自我对弈,其中一个语言模型生成问题,另一个求解,有望在没有人类监督的情况下实现自主课程学习。在实践中,提议者迅速收敛到满足奖励函数的狭窄问题分布。这种多样性崩溃使得课程对求解者无信息量,从而停滞共同进化循环。我们引入词汇丢弃,一种在策略训练和课程生成期间应用于提议者输出logits的随机掩码,作为维持多样性的轻量级机制。该掩码是硬性的且非平稳的,防止提议者锁定在固定的token序列上。通过R-Zero在数学推理上训练Qwen3-4B和Qwen3-8B,我们发现词汇丢弃在整个训练过程中在词汇、语义和功能指标上维持了提议者的多样性。它还带来了求解器性能的提升,在8B规模上平均提高+4.4点,在竞赛级基准上增益最大。我们的发现表明,显式的动作空间约束,类似于经典自我对弈中游戏规则的结构性作用,可以帮助维持语言中的生产性共同进化。词汇丢弃是该原则的一个简单实例。

英文摘要

Co-evolutionary self-play, where one language model generates problems and another solves them, promises autonomous curriculum learning without human supervision. In practice, the proposer quickly converges to a narrow distribution of problems that satisfy the reward function. This diversity collapse renders the curriculum uninformative for the solver, stalling the co-evolutionary loop. We introduce vocabulary dropout, a random mask applied to the proposer's output logits during both policy training and curriculum generation, as a lightweight mechanism to sustain diversity. The mask is hard and non-stationary, preventing the proposer from locking into fixed token sequences. Training Qwen3-4B and Qwen3-8B on mathematical reasoning via R-Zero, we find that vocabulary dropout sustains proposer diversity across lexical, semantic, and functional metrics throughout training. It also yields solver improvements averaging +4.4 points at 8B, with the largest gains on competition-level benchmarks. Our findings suggest that explicit action-space constraints, analogous to the structural role that game rules play in classical self-play, can help sustain productive co-evolution in language. Vocabulary dropout is one simple instantiation of this principle.

2604.13085 2026-06-16 cs.LG cs.AI 版本更新

Adaptive Memory Crystallization for Autonomous AI Agent Learning in Dynamic Environments

自适应记忆结晶:动态环境中自主AI智能体学习

Rajat Khanda, Mohammad Baqar, Sambuddha Chakrabarti, Satyasaran Changdar

发表机构 * GitHub

AI总结 提出自适应记忆结晶(AMC)架构,基于突触标记与捕获理论,通过三阶段记忆层次和随机微分方程实现持续强化学习,在多个基准上显著提升前向迁移、减少灾难性遗忘并降低内存占用。

详情
AI中文摘要

在动态环境中运行的自主AI智能体面临一个持续挑战:在不遗忘先前知识的情况下获取新能力。我们提出自适应记忆结晶(AMC),一种用于持续强化学习中渐进式经验巩固的记忆架构。AMC在概念上受突触标记与捕获(STC)理论的定性结构启发,即记忆经历离散的稳定阶段,但不声称模拟潜在的分子或突触机制。AMC将记忆建模为一个连续的结晶过程,其中经验根据多目标效用信号从可塑状态迁移到稳定状态。该框架引入了一个三阶段记忆层次(液态-玻璃态-晶态),由伊藤随机微分方程(SDE)控制,其群体行为由显式的福克-普朗克方程描述,该方程具有封闭形式的贝塔平稳分布。我们提供了以下证明:(i)结晶SDE的适定性和全局收敛到唯一的贝塔平稳分布;(ii)单个结晶状态指数收敛到其固定点,具有显式速率和方差界;(iii)端到端Q学习误差界和匹配的记忆容量下界,将SDE参数直接与智能体性能联系起来。在Meta-World MT50、Atari 20游戏序列学习和MuJoCo持续运动上的实证评估一致显示,前向迁移提高了34-43%(相对于最强基线),灾难性遗忘减少了67-80%,内存占用减少了62%。

英文摘要

Autonomous AI agents operating in dynamic environments face a persistent challenge: acquiring new capabilities without erasing prior knowledge. We present Adaptive Memory Crystallization (AMC), a memory architecture for progressive experience consolidation in continual reinforcement learning. AMC is conceptually inspired by the qualitative structure of synaptic tagging and capture (STC) theory, the idea that memories transition through discrete stability phases, but makes no claim to model the underlying molecular or synaptic mechanisms. AMC models memory as a continuous crystallization process in which experiences migrate from plastic to stable states according to a multi-objective utility signal. The framework introduces a three-phase memory hierarchy (Liquid--Glass--Crystal) governed by an Itô stochastic differential equation (SDE) whose population-level behavior is captured by an explicit Fokker--Planck equation admitting a closed-form Beta stationary distribution. We provide proofs of: (i) well-posedness and global convergence of the crystallization SDE to a unique Beta stationary distribution; (ii) exponential convergence of individual crystallization states to their fixed points, with explicit rates and variance bounds; and (iii) end-to-end Q-learning error bounds and matching memory-capacity lower bounds that link SDE parameters directly to agent performance. Empirical evaluation on Meta-World MT50, Atari 20-game sequential learning, and MuJoCo continual locomotion consistently shows improvements in forward transfer (+34--43\% over the strongest baseline), reductions in catastrophic forgetting (67--80\%), and a 62\% decrease in memory footprint.

2604.25853 2026-06-16 cs.CL cs.AI cs.LG 版本更新

G-Loss: Graph-Guided Fine-Tuning of Language Models

G-Loss:图引导的语言模型微调

Aditya Sharma, Vinti Agarwal, Rajesh Kumar

发表机构 * BITS Pilani(BITS 派拉尼) Bucknell University(巴克内尔大学)

AI总结 提出G-Loss损失函数,通过构建文档相似度图并利用半监督标签传播捕捉全局语义结构,引导语言模型学习更具判别性和鲁棒性的嵌入,在多个分类任务上提升准确率并加速收敛。

Comments 20 pages, Learning on Graphs (LoG2025)

详情
AI中文摘要

用于微调预训练语言模型(如BERT)的传统损失函数,包括交叉熵、对比损失、三元组损失和监督对比损失,仅在局部邻域内操作,未能考虑全局语义结构。我们提出了G-Loss,一种图引导的损失函数,它结合半监督标签传播来利用嵌入流形中的结构关系。G-Loss构建了一个文档相似度图,捕捉全局语义关系,从而引导模型学习更具判别性和鲁棒性的嵌入。我们在五个涵盖关键下游分类任务的基准数据集上评估了G-Loss:MR(情感分析)、R8和R52(主题分类)、Ohsumed(医学文档分类)和20NG(新闻分类)。在大多数实验设置中,G-Loss收敛更快,并产生语义一致的嵌入空间,从而比使用传统损失函数微调的模型获得更高的分类准确率。

英文摘要

Traditional loss functions, including cross-entropy, contrastive, triplet, and su pervised contrastive losses, used for fine-tuning pre-trained language models such as BERT, operate only within local neighborhoods and fail to account for the global semantic structure. We present G-Loss, a graph-guided loss function that incorporates semi-supervised label propagation to use structural relationships within the embedding manifold. G-Loss builds a document-similarity graph that captures global semantic relationships, thereby guiding the model to learn more discriminative and robust embeddings. We evaluate G-Loss on five benchmark datasets covering key downstream classification tasks: MR (sentiment analysis), R8 and R52 (topic categorization), Ohsumed (medical document classification), and 20NG (news categorization). In the majority of experimental setups, G-Loss converges faster and produces semantically coherent embedding spaces, resulting in higher classification accuracy than models fine-tuned with traditional loss functions.

2605.06734 2026-06-16 cs.LG cs.AI quant-ph 版本更新

Gated QKAN-FWP: Scalable Quantum-inspired Sequence Learning

门控QKAN-FWP:可扩展的量子启发序列学习

Kuo-Chung Peng, Samuel Yen-Chi Chen, Jiun-Cheng Jiang, Chen-Yu Liu, En-Jui Kuo, Yun-Yuan Wang, Prayag Tiwari, Andrea Ceschini, Chi-Sheng Chen, Yu-Chao Hsu, Chun-Hua Lin, Tai-Yue Li, Antonello Rosato, Massimo Panella, Simon See, Saif Al-Kuwari, Kuan-Cheng Chen, Nan-Yow Chen, Hsi-Sheng Goan

发表机构 * Department of Physics and Center for Theoretical Physics, National Taiwan University(物理系与理论物理中心,国立台湾大学) National Center for High-Performance Computing, National Institutes of Applied Research(高性能计算国家中心,应用研究国家机构) Wells Fargo, New York, NY, USA(摩根大通银行,纽约,纽约州,美国) NVIDIA AI Technology Center, NVIDIA Corp., Taipei, Taiwan(NVIDIA AI技术中心,NVIDIA公司,台北,台湾) Center for Quantum Science and Engineering, National Taiwan University(量子科学与工程中心,国立台湾大学) Graduate Institute of Applied Physics, National Taiwan University(应用物理研究所,国立台湾大学) Department of Electrophysics, National Yang Ming Chiao Tung University(电子物理系,国立阳明交通大学) School of Information Technology, Halmstad University(信息科技学院,哈尔姆斯塔德大学) Department of Information Engineering, Electronics and Telecommunications (DIET), University of Rome “La Sapienza”, Rome, Italy(信息工程、电子与电信系(DIET),罗马“拉·索拉维亚”大学,罗马,意大利) Beth Israel Deaconess Medical Center & Harvard Medical School(贝瑟尔以色列德acons医疗中心及哈佛医学院) Cross College Elite Program, National Cheng Kung University(跨学院精英计划,国立成功大学)

AI总结 提出门控QKAN-FWP框架,融合快速权重编程与量子启发KAN,使用单量子比特数据重上传电路作为非线性激活,引入标量门控更新规则,在时间序列基准、MiniGrid强化学习和太阳周期预测中优于经典循环模型,并在NISQ设备上验证了可行性。

Comments 46 pages, 13 figures, 10 tables

详情
AI中文摘要

快速权重编程器(FWP)通过动态更新的参数而非循环隐藏状态来编码时间依赖关系。量子FWP(QFWP)使用变分量子电路(VQC)扩展了这一思想,但现有实现依赖于多量子比特架构,在噪声中等规模量子(NISQ)设备上难以扩展,且经典模拟成本高昂。我们提出了门控QKAN-FWP,一种将FWP与量子启发Kolmogorov-Arnold网络(QKAN)相结合的快速权重框架,使用单量子比特数据重上传电路作为可学习非线性激活,称为数据重上传激活(DARUAN)。我们进一步引入了一种标量门控快速权重更新规则,稳定参数演化,并对其自适应记忆核、几何有界性和可并行梯度路径进行了理论分析。我们在时间序列基准、MiniGrid强化学习上评估了该框架,并以实际太阳周期预测作为主要实际结果。在528个月输入窗口和132个月预测水平的长时域设置中,我们的12.5k参数模型实现了比一系列经典循环基线(参数最多达13倍)更低的缩放均方误差(MSE)、峰值幅度误差和峰值时间误差,这些基线包括长短期记忆网络(LSTM)(25.9k-89.1k参数)、WaveNet-LSTM(167k)、普通循环神经网络(11.5k)和改进的echo state网络(132k)。为了验证NISQ兼容性,我们进一步在IonQ和IBM量子处理器上部署了训练好的快速编程器,在1024次测量下恢复了与无噪声模拟器相对MSE在0.1%以内的预测精度。这些结果使门控QKAN-FWP成为一种可扩展、参数高效且NISQ兼容的量子启发序列建模方法。

英文摘要

Fast Weight Programmers (FWPs) encode temporal dependencies through dynamically updated parameters rather than recurrent hidden states. Quantum FWPs (QFWPs) extend this idea with variational quantum circuits (VQCs), but existing implementations rely on multi-qubit architectures that are difficult to scale on noisy intermediate-scale quantum (NISQ) devices and expensive to simulate classically. We propose gated QKAN-FWP, a fast-weight framework that integrates FWP with Quantum-inspired Kolmogorov-Arnold Network (QKAN) using single-qubit data re-uploading circuits as learnable nonlinear activation, known as DatA Re-Uploading ActivatioN (DARUAN). We further introduce a scalar-gated fast-weight update rule that stabilizes parameter evolution, supported by a theoretical analysis of its adaptive memory kernel, geometric boundedness, and parallelizable gradient paths. We evaluate the framework across time-series benchmarks, MiniGrid reinforcement learning, and highlight real-world solar cycle forecasting as our main practical result. In the long-horizon setting with 528-month input window and 132-month forecast horizon, our 12.5k-parameter model achieves lower scaled Mean Square Error (MSE), peak amplitude error, and peak timing error than a suite of classical recurrent baselines with up to 13x more parameters, including Long Short-Term Memory (LSTM) networks (25.9k-89.1k parameters), WaveNet-LSTM (167k), Vanilla recurrent neural network (11.5k), and a Modified Echo State Network (132k). To validate NISQ compatibility, we further deploy the trained fast programmer on IonQ and IBM Quantum processors, recovering forecasting accuracy within 0.1% relative MSE of the noiseless simulator at 1024 shots. These results position gated QKAN-FWP as a scalable, parameter-efficient, and NISQ-compatible approach to quantum-inspired sequence modeling.

2605.18324 2026-06-16 cs.CV cs.AI cs.GR cs.LG stat.ML 版本更新

Improved Baselines with Representation Autoencoders

改进的基于表示自动编码器的基线

Jaskirat Singh, Boyang Zheng, Zongze Wu, Richard Zhang, Eli Shechtman, Saining Xie

发表机构 * Adobe Research(Adobe研究院) ANU(澳大利亚国立大学) New York University(纽约大学)

AI总结 本文研究了基于表示自动编码器(RAE)的设计选择,发现三个见解,简化并改进了RAE。首先,研究了一种通用公式,将表示定义为最后k个编码器层的总和,而不是仅最终层。其次,研究了RAE与表示对齐(REPA)的假设,发现两者具有互补的工作机制。最后,改进了RAE在无分类器指导(CFG)中的表现,通过重新参数化DiT模型输出,实现了无需训练第二个模型的指导效果。RAEv2在ImageNet-256上达到了1.06的gFID,且训练效率显著提高。

详情
AI中文摘要

Representation Autoencoders (RAE) replace traditional VAE with pretrained vision encoders. In this paper, we systematically investigate several design choices and find three insights which simplify and improve RAE. First, we study a generalized formulation where the representation is defined as sum of the last k encoder layers rather than solely the final layer. This simple change greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces). Second, we study the prevalent assumption that RAE (using pretrained representation as encoder) replaces representation alignment (REPA), which distills the same representation to intermediate layers instead. Through large-scale empirical analysis, we uncover a surprising finding: RAE and REPA exhibit complementary working mechanisms, allowing the same representation to be used as both encoder and target for intermediate diffusion layers. Finally, the original RAE struggles with classifier-free guidance (CFG) and requires training a second, weaker diffusion model for AutoGuidance (AG). We show that REPA itself can be viewed as x-prediction in RAE latent space. By simply re-parameterizing the output of the DiT model, it can provide guidance for

英文摘要

Representation Autoencoders (RAE) replace traditional VAE with pretrained vision encoders. In this paper, we systematically investigate several design choices and find three insights which simplify and improve RAE. First, we study a generalized formulation where the representation is defined as sum of the last k encoder layers rather than solely the final layer. This simple change greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces). Second, we study the prevalent assumption that RAE (using pretrained representation as encoder) replaces representation alignment (REPA), which distills the same representation to intermediate layers instead. Through large-scale empirical analysis, we uncover a surprising finding: RAE and REPA exhibit complementary working mechanisms, allowing the same representation to be used as both encoder and target for intermediate diffusion layers. Finally, the original RAE struggles with classifier-free guidance (CFG) and requires training a second, weaker diffusion model for AutoGuidance (AG). We show that REPA itself can be viewed as x-prediction in RAE latent space. By simply re-parameterizing the output of the DiT model, it can provide guidance for "free". Overall, RAEv2 leads to more than 10x faster convergence over the original RAE, achieving a state-of-the-art gFID of 1.06 in just 80 epochs on ImageNet-256. On FDr6, RAEv2 achieves a state-of-the-art 2.17 at just 80 epochs compared to the previous best 3.26 (800 epochs) without any post-training. This motivates EPFID@k (epochs to reach unguided gFID < k) as a measure of training efficiency. RAEv2 attains an EPFID@2 of 35 epochs, versus 177 for the original RAE. We also validate our approach across diverse settings for text-to-image generation and navigation world models, showing consistent improvements. The code is available at https://raev2.github.io.

2605.21850 2026-06-16 cs.CL cs.AI 版本更新

ACC: Compiling Agent Trajectories for Long-Context Training

ACC:用于长上下文训练的代理轨迹编译

Qisheng Su, Zhen Fang, Shiting Huang, Yu Zeng, Yiming Zhao, Kou Shi, Ziao Zhang, Lin Chen, Zehui Chen, Lijun Wu, Feng Zhao

发表机构 * MoE Key Lab of BIPC, University of Science and Technology of China(中科院大学科学技术大学MoE关键实验室) Shanghai Innovation Institute(上海创新研究院) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 本文提出ACC,一种将代理轨迹编译为长上下文问答对的方法,通过整合多轮交互中的工具响应和环境观察,提升大语言模型的长上下文推理能力。

详情
AI中文摘要

近期代理的发展重新激发了对LLM长上下文推理能力的需求。然而,训练LLM具备这种能力需要耗费成本的长文档整理或启发式上下文合成。我们发现,当代理解决问题时,会产生大量轨迹,涉及调用工具和接收环境观察,这些证据分散在多个回合中,需要整合远距离上下文片段。然而,标准代理SFT会屏蔽工具响应,仅训练回合级工具选择,导致监督盲区,使这些分散的信号无法被利用。我们提出Agent Context Compilation (ACC),将搜索、软件工程和数据库查询代理的轨迹转换为长上下文QA对,结合原始问题与多回合收集的工具响应和环境观察,训练模型直接回答而不使用工具。这使问题与证据之间的依赖关系显式化,使模型能够直接监督长上下文推理,无需额外标注。ACC是一种简单但有效的做法,可与任何现有的长上下文扩展或训练方法结合,提供可扩展的监督微调数据。我们通过MRCR和GraphWalks长距离依赖建模任务验证了ACC,挑战需要跨回合核心ference解析和图遍历的基准测试。训练Qwen3-30B-A3B使用ACC在MRCR上达到68.3(+18.1),在GraphWalks上达到77.5(+7.6),结果与Qwen3-235B-A22B相当,同时在GPQA、MMLU-Pro、AIME和IFEval上保持通用能力。进一步的机制分析表明,ACC训练的模型表现出任务自适应的注意力重构和专家专业化。

英文摘要

Recent development of agents has renewed demand for long-context reasoning capacity of LLMs. However, training LLMs for this capacity requires costly long-document curation or heuristic context synthesis. We observe that agents produce massive trajectories when solving problems, invoking tools and receiving environment observations across many turns. The evidence needed to answer the original question is thus scattered throughout these turns, requiring integration of distant context segments. Nevertheless, standard agent SFT masks tool responses and only trains turn-level tool selection, creating a supervision blind spot where these scattered signals go unused. We propose Agent Context Compilation (ACC), which converts trajectories from search, software engineering, and database querying agents into long-context QA pairs that combine the original question with tool responses and environment observations gathered across multiple turns, training the model to answer directly without tool use. This makes the dependencies between the question and the evidence explicit, enabling direct supervision of long-context reasoning over distant segments without additional annotation. ACC is a simple but effective approach that can be combined with any existing long-context extension or training method, providing scalable supervised fine-tuning data. We validate ACC on long-range dependency modeling tasks through MRCR and GraphWalks, challenging benchmarks requiring cross-turn coreference resolution and graph traversal over extended contexts. Training Qwen3-30B-A3B with ACC achieves 68.3 on MRCR (+18.1) and 77.5 on GraphWalks (+7.6), results comparable to Qwen3-235B-A22B, while preserving general capabilities on GPQA, MMLU-Pro, AIME, and IFEval. Further mechanism analysis reveals that the ACC-trained model exhibits task-adaptive attention restructuring and expert specialization.

2605.22873 2026-06-16 cs.LG cs.AI cs.CL 版本更新

When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions

LLM何时推理?基于熵相变的动力系统视角

Wei Xia, Haoqing Wang, Zhi-Hong Deng, Yehui Tang

发表机构 * Samsung Research(三星研究院) State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University(通用人工智能国家重点实验室,北京理工大学)

AI总结 本文通过早期解码熵动态检测LLM的推理状态,提出轻量级无训练路由框架EDRM,自适应选择推理策略,在减少token消耗的同时提升准确率。

详情
AI中文摘要

链式思维(CoT)推理已成为增强LLM能力的默认策略,但其应用引发了一个基本问题:显式推理何时真正有益?实证证据揭示了一个显著悖论:CoT在事实性和开放式任务上往往带来边际甚至负增益,同时成倍增加token消耗。在这项工作中,我们表明LLM推理不是任务或模型的静态属性,而是在生成过程中涌现的\emph{动态解码状态}。通过系统分析,我们发现早期熵动态提供了这一状态的可靠信号:受益于CoT的任务表现出一致的熵降低,而其他任务则呈现不稳定或增加的模式。这种行为可以解释为从高熵探索状态到低熵结构化推理状态的类相变转变。基于这些见解,我们提出了 extbf{EDRM}(基于熵动态的推理流形),一个轻量级且无需训练的路由框架,利用早期解码熵自适应选择推理策略。EDRM将熵轨迹嵌入到紧凑且可解释的流形表示中,支持零样本部署和细粒度实例级适应。在15个基准测试和4个不同规模与架构的LLM上,EDRM始终优于静态基线。在数据集层面,EDRM实现了 extbf{41--55\%}的token减少,同时仅需50个校准样本即可提高准确率。在实例层面,它进一步将准确率提升高达 extbf{4.7\%},同时保持 extbf{27--45\%}的token节省。这些结果表明,推理应被选择性地调用而非默认使用,并展示了基于熵的解码控制对于高效自适应LLM推理的有效性。

英文摘要

Chain-of-thought (CoT) reasoning has become the default strategy for enhancing LLM capabilities, yet its application raises a fundamental question: when is explicit reasoning actually beneficial? Empirical evidence reveals a striking paradox: CoT often provides marginal or even negative gains on factual and open-ended tasks while multiplying token consumption. In this work, we show that LLM reasoning is not a static property of tasks or models, but a \emph{dynamic decoding state} that emerges during generation. Through systematic analysis, we find early-stage entropy dynamics provide a reliable signal of this state: tasks benefiting from CoT exhibit consistent entropy reduction, while others display unstable or increasing patterns. This behavior can be interpreted as a phase-transition-like shift from a high-entropy exploratory regime to a low-entropy structured reasoning regime. Based on these insights, we propose \textbf{EDRM} (Entropy Dynamics-based Reasoning Manifold), a lightweight and training-free routing framework that leverages early decoding entropy to adaptively select inference strategies. EDRM embeds entropy trajectories into a compact and interpretable manifold representation, enabling both zero-shot deployment and fine-grained instance-level adaptation. Across 15 benchmarks and 4 LLMs of varying scales and architectures, EDRM consistently outperforms static baselines. At the dataset level, EDRM achieves \textbf{41--55\%} token reduction while improving accuracy with as few as 50 calibration samples. At the instance level, it further improves accuracy by up to \textbf{4.7\%} while maintaining \textbf{27--45\%} token savings. These results suggest that reasoning should be invoked selectively rather than by default, and demonstrate the effectiveness of entropy-driven decoding control for efficient and adaptive LLM inference.

2606.01602 2026-06-16 cs.LG cs.AI cs.IT math.IT 版本更新

Estimating Mutual Information between Time Series and Temporal Event Sequences Across Diverse Analysis Tasks

估计时间序列与时间事件序列在不同分析任务中的互信息

Haoji Hu, Huaqing Mao, Yijun Lin, Xiaowei Jia, Jinwei Zhou, Minoh Jeong, Yao-Yi Chiang

发表机构 * University of Minnesota - Twin Cities(明尼苏达大学-双城分校) University of Pittsburgh(匹兹堡大学) Inha University(Inha大学)

AI总结 提出一种非参数互信息估计器,直接度量连续时间序列与离散事件序列之间的依赖关系,无需数据转换或离散化,通过处理量化伪影和事件冗余实现鲁棒统一框架。

详情
AI中文摘要

成对依赖度量(如相关性和因果性)是时间数据挖掘的基础,但目前仍缺乏一种原则性且稳健的方法来量化异构数据类型之间的依赖关系,特别是连续时间序列与离散时间事件序列之间。现有方法依赖于对量化、重复值和事件冗余高度敏感的临时变换或互信息估计器,导致实践中结果有偏或不稳定。我们提出一种非参数互信息估计器,无需数据转换、学习或临时离散化,直接度量时间序列与事件序列之间的依赖关系。我们的方法对真实世界时间序列的连续-离散二元性进行建模,以处理量化和重复值伪影,并引入潜在事件聚类策略以减轻事件共现和冗余带来的偏差。这些共同构成了一个鲁棒且统一的框架,桥接了离散和连续互信息。我们在四个代表性任务上评估了所提出的估计器:用于因果分析的离散-连续时延互信息、全局和局部时间重复发现、用于时间序列预测的离散协变量选择以及用于分类的连续特征选择。在合成和真实世界数据集上的实验表明,在准确性、鲁棒性和可解释性方面,该方法一致优于现有方法,使其成为异构时间数据的通用依赖算子,类似于同质时间序列的皮尔逊相关。代码见:https://github.com/HaojiHu/Multimodal-Temporal-Data-Quantification

英文摘要

Pairwise dependence measures such as correlation and causality are fundamental to temporal data mining, yet there is still no principled and robust way to quantify dependence between heterogeneous data types, especially between continuous time series and discrete temporal event sequences. Existing approaches rely on ad hoc transformations or mutual-information estimators that are highly sensitive to quantization, repeated values, and event redundancy, leading to biased or unstable results in practice. We propose a nonparametric mutual information estimator that directly measures the dependence between time series and event sequences without data transformation, learning, or ad hoc discretization. Our method models the continuous-discrete duality of real-world time series to handle quantization and repeated-value artifacts and introduces a latent event clustering strategy to mitigate bias from event co-occurrence and redundancy. Together, these yield a robust and unified framework that bridges discrete and continuous mutual information. We evaluate the proposed estimator on four representative tasks: discrete-continuous time-delayed mutual information for causality analysis, global and local temporal repetition discovery, discrete covariate selection for time series forecasting, and continuous feature selection for classification. Experiments on synthetic and real-world datasets show consistent improvements over existing methods in accuracy, robustness, and interpretability, positioning our approach as a general-purpose dependence operator for heterogeneous temporal data, similar to Pearson correlation for homogeneous time series. Code available at: https://github.com/HaojiHu/Multimodal-Temporal-Data-Quantification

2606.07082 2026-06-16 cs.LG cs.AI 版本更新

On the Geometry of On-Policy Distillation

论在线策略蒸馏的几何结构

Zhennan Shen, Yanshu Li, Qingyu Yin, Chak Tou Leong, Zhilin Wang, Yanxu Chen, Rongduo Han, Sunbowen Lee, Yi R. Fung

发表机构 * HKUST(香港科技大学) UT Austin(得克萨斯大学奥斯汀分校) Zhejiang University(浙江大学) Hong Kong PolyU(香港理工大学) USTC(中国科学技术大学) BUPT(北京邮电大学) Nankai University(南开大学) BIT(北京理工大学)

AI总结 本文通过参数空间诊断,揭示在线策略蒸馏(OPD)的更新轨迹具有松弛离主成分、子空间锁定等独特几何特性,表明其并非介于SFT和RLVR之间的中间方法。

Comments 17 pages, 8 figures

详情
AI中文摘要

在线策略蒸馏(OPD)越来越多地被用于改进大型语言模型的推理能力,但其训练动态仍鲜为人知。我们刻画了OPD更新在参数空间中的轨迹,并将其与监督微调(SFT)和可验证奖励强化学习(RLVR)进行了比较。一套参数空间诊断一致地将OPD置于松弛的离主成分区域:与SFT相比,其更新影响更少的权重,并更强烈地避开主方向;而与RLVR相比,其约束更宽松。除了这种静态定位外,OPD还表现出子空间锁定:其累积更新迅速进入一个狭窄的低维通道。将训练限制在早期形成的更新子空间内能保持OPD的性能,但会严重降低SFT,表明该锁定子空间对OPD在功能上是充分的。控制实验进一步表明,稀疏化更新令牌和将rollout生成移至离策略能保持秩动态,而将OPD目标与RLVR混合则会改变它们。总体而言,这些结果表明OPD不仅仅是SFT和RLVR之间的中间点,而是在参数空间中诱导出自身独特的更新几何结构。

英文摘要

On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). A suite of parameter-space diagnostics consistently places OPD in a relaxed off-principal regime: compared with SFT, its updates affect fewer weights and avoid principal directions more strongly, while compared with RLVR, they remain less tightly constrained. Beyond this static localization, OPD exhibits subspace locking: its cumulative updates rapidly enter a narrow low-dimensional channel. Constraining training to the update subspace formed early in training preserves OPD performance but substantially degrades SFT, indicating that the locked subspace is functionally sufficient for OPD. Control experiments further show that sparsifying the update tokens and shifting rollout generation off-policy preserve the rank dynamics, whereas mixing the OPD objective with RLVR changes them. Overall, these results suggest that OPD is not merely an intermediate point between SFT and RLVR, but induces its own update geometry in parameter space.

2606.08090 2026-06-16 cs.DB cs.AI 版本更新

Fast LLM-Based Semantic Filtering: From a Unified Framework to an Adaptive Two-Phase Method

基于LLM的快速语义过滤:从统一框架到自适应两阶段方法

Kyoungmin Kim, Martin Catheland, Anastasia Ailamaki

发表机构 * EPFL(瑞士联邦理工学院)

AI总结 提出自适应两阶段语义过滤框架,结合无模型聚类与在线代理,利用LLM的置信度作为软标签训练代理,并通过稀疏感知校准降低级联成本,在90%准确率目标下速度提升1.6-2.0倍。

详情
AI中文摘要

在文档语料库上评估自然语言的是/否谓词并满足准确率目标——语义过滤——是基于LLM的数据处理的基石。对每个文档调用LLM(即oracle)代价高昂,因此级联方法将oracle与快速代理配对。然而,当前部署存在四个局限性:(1) 每个级联家族——无模型聚类、预构建的小型LLM代理、在线训练的代理——只采用单一表示和流水线,仅在狭窄的查询范围内有效。(2) 最强的在线代理在稠密嵌入的双编码器上采用自定义训练方案,忽略了更丰富谓词所需的token级证据。(3) 代理针对二元是/否标签进行训练,浪费了LLM在边界文档上的逐文档置信度,而这些正是代理最需要学习的。(4) 现有校准添加了统一的安全裕度,将真实的代理不确定性与小样本噪声混为一谈,增加了级联成本。\n我们通过以下方式解决这些问题:(1) 自适应地组合不同家族——首先使用无模型聚类,仅在需要时使用在线代理,并在各阶段共享oracle调用;(2) 用现成的token感知模型的混合替代余弦双编码器;(3) 使用oracle的逐文档置信度作为软标签来训练代理;(4) 采用一种校准方法,仅在标记样本稀疏的地方添加安全裕度。我们也是首次将oracle的逐文档置信度用于三个目的:查询级难度指南针、任何基于代理的级联所需的最小oracle调用次数的下界,以及代理的软训练标签。\n在三个10K文档语料库上,以90%准确率为目标,我们的方法比每个语料库上最佳先前方法快1.6-2.0倍,并在95%的查询上达到目标;基于BER的下界表明未来工作还有约4-20倍的提升空间。

英文摘要

Evaluating a natural-language yes/no predicate over a document corpus under an accuracy target - the semantic filter - is a cornerstone of LLM-based data processing. Calling the LLM on every document (the oracle) is prohibitive, so cascades pair the oracle with a fast proxy. As deployed today, they leave four limitations on the table. (1) Each cascade family - model-free clustering, prebuilt small-LLM proxies, online-trained proxies - commits to a single representation and pipeline, and wins on only a narrow query regime. (2) The strongest online proxy invests in a custom training scheme on a bi-encoder over dense embeddings, missing the token-level evidence richer predicates require. (3) The proxy is trained against binary yes/no labels, wasting the LLM's per-document confidence at the boundary documents it most needs to learn. (4) Existing calibrations add a uniform safety margin, conflating genuine proxy uncertainty with small-sample noise and inflating cascade cost. We address these by (1) composing families adaptively - model-free clustering first, online proxy only when needed, with oracle calls shared across phases; (2) replacing the cosine bi-encoder with a hybrid of off-the-shelf token-aware models; (3) training the proxy with the oracle's per-document confidence as a soft label; and (4) a calibration that adds the safety margin only where the labeled sample is sparse. We are also the first to use the oracle's per-document confidence for three purposes: a query-level difficulty compass, a lower bound on the minimum oracle calls any proxy-based cascade can make, and the proxy's soft training label. At a 90% accuracy target on three 10K-document corpora, our methods are 1.6-2.0x faster than the best prior method per corpus and meet the target on 95% of queries; the BER-derived lower bound indicates a further ~4-20x of headroom for future work.

2606.08898 2026-06-16 eess.AS cs.AI cs.LG 版本更新

Few-shot Class-variable Incremental Audio Classification via Prototype Adaptation and Pseudo Class-variable Training

基于原型适应和伪类变量训练的少样本类变量增量音频分类

Yanxiong Li, Guoqing Chen, Qianqian Li, Sen Huang

发表机构 * School of Electronic and Information Engineering, South China University of Technology(华南理工大学电子与信息学院)

AI总结 针对实际中类别数量增减的少样本类变量增量音频分类问题,提出一种结合原型适应网络和伪类变量训练策略的方法,在三个公开数据集上平均准确率超过现有方法。

Comments This paper has been accepted for publication in Interspeech 2026. 4 Tables and 4 Figures

详情
AI中文摘要

在少样本类增量音频分类任务中,通常假设类别数量总是增加而不考虑减少的可能性。然而,实际中类别数量通常会增加或减少。本文研究了少样本类变量增量音频分类(FCIAC)问题,其中类别数量增加或减少。我们提出了一种使用原型适应和伪类变量训练的FCIAC方法。我们的方法中的模型由编码器和分类器组成。分类器由类变量原型适应网络初始化,其结构随类别的变化而动态变化。此外,我们设计了一种伪类变量训练策略,以增强模型对变化类别的适应性。在三个公开数据集上的实验表明,我们的方法在平均准确率上超过了先前的方法。代码位于:https://github.com/cgq2971-afk/FCIAC。

英文摘要

In the task of few-shot class-incremental audio classification, the number of classes is assumed to always increase without considering the possibility of decrease. However, the number of classes generally increases or decreases in practice. In this paper, we investigate a problem of Few-shot Class-variable Incremental Audio Classification (FCIAC), in which the number of classes increases or decreases. We propose a FCIAC method using prototype adaptation and pseudo class-variable training. The model in our method consists of an encoder and a classifier. The classifier is initialized by a class-variable prototype adaptation network, whose structure dynamically changes with the change of classes. In addition, we design a pseudo class-variable training strategy to enhance the model's adaptability to changing classes. Experiments on three public datasets show that our method exceeds previous methods in average accuracy. The code is at: https://github.com/cgq2971-afk/FCIAC.

2605.28860 2026-06-16 cs.LG cs.AI cs.CL cs.CR 版本更新

Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

灾难性遗忘的机制起源:为什么RL比SFT更好地保留电路?

Jeanmely Rojas Nunez, Viraj Sawant, Nathan Allen, Nomgondalai Amgalanbaatar, Yannis Zongo, Vasu Sharma, Maheep Chaudhary

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) University of Toronto(多伦多大学)

AI总结 通过引入差异电路脆弱性指标,研究比较了强化学习与监督微调在大型语言模型微调中对内部计算电路的保留程度,发现RL虽任务适应较慢但能更好保留电路,从而减轻灾难性遗忘。

详情
AI中文摘要

微调大型语言模型(LLMs)经常导致先前能力的灾难性遗忘。最近的研究表明,强化学习(RL)比监督微调(SFT)更有效地保留先前能力,这归因于策略梯度更新更接近基础策略\cite{shenfeld2025rl}。我们将这种行为解释扩展到机制层面,并探究RL的优势是否通过内部计算电路的更强保留来体现。我们引入了差异电路脆弱性,一种头部级别的度量,用于衡量电路在微调下的退化程度,并将其用于比较RL和SFT在Qwen2.5-3B-Instruct适应科学问答任务上的表现。我们发现了清晰的机制权衡:SFT更快地适应目标任务,但导致更大的电路破坏和先前能力的遗忘,而RL保留了更大比例的基础电路,代价是任务适应较慢。这些发现表明,电路保留可能有助于解释为什么RL对灾难性遗忘更具鲁棒性。我们在此发布了代码:https://github.com/rl-sft-circuit-research/differential-circuit-vulnerability。

英文摘要

Fine-tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinforcement learning (RL) retains prior capabilities more effectively than supervised fine-tuning (SFT), attributing this to policy-gradient updates remaining closer to the base policy \cite{shenfeld2025rl}. We extend this behavioral account to the mechanistic level and ask whether RL's advantage is mirrored by stronger preservation of internal computational circuits. We introduce differential circuit vulnerability, a head-level measure of how much a circuit degrades under fine-tuning, and use it to compare RL and SFT on Qwen2.5-3B-Instruct adapted to scientific question-answering. We find a clear mechanistic trade-off: SFT adapts more rapidly to the target task but produces substantially greater circuit disruption and forgetting of prior capabilities, whereas RL preserves a larger fraction of the base circuit at the cost of slower task adaptation. These findings suggest that circuit preservation may help explain why RL is more robust to catastrophic forgetting. We released our code here: https://github.com/rl-sft-circuit-research/differential-circuit-vulnerability.

6. 自然语言与多模态智能 74 篇

2606.15231 2026-06-16 cs.AI 新提交

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

Visual-Seeker:通过主动视觉推理实现视觉原生多模态智能搜索

Zhengbo Zhang, Changtao Miao, Jinbo Su, Zhaowen Zhou, Chunxia Zhang, Xukai Wang, Ruiqi Liu, Kaiyuan Zheng, Jiansheng Cai, Bo Zhang, Zhe Li, Shiming Xiang, Ying Yan

发表机构 * School of Artificial Intelligence UCAS(中国科学院大学人工智能学院) Institute of Automation CAS(中国科学院自动化研究所) Ant Digital Technologies Ant Group(蚂蚁数字科技蚂蚁集团) RUC(中国人民大学) BIT(北京理工大学)

AI总结 提出Visual-Seeker,一种通过主动视觉推理进行视觉原生多模态深度搜索的智能体,在五个基准上达到最先进性能,甚至超越专有模型。

详情
AI中文摘要

多模态大语言模型(MLLMs)在许多视觉任务中展示了令人印象深刻的能力,但在面对复杂、开放世界场景时,它们常常在事实性基础上挣扎。尽管最近的多模态深度搜索智能体试图通过利用外部工具来解决这个问题,但视觉原生搜索范式仍未得到充分探索。现有方法主要依赖于具有显式语义的简单图像和纯文本证据轨迹,限制了智能体执行多跳、跨模态推理和搜索的能力。为了解决这些限制,我们提出了Visual-Seeker,一种通过主动视觉推理的视觉原生多模态深度搜索智能体。我们的智能体不是将视觉视为静态输入,而是主动关注细粒度的视觉细节,在搜索过程中动态收集视觉证据。为了释放其视觉原生潜力,我们设计了一个主动视觉推理数据管道,并合成了5K高质量的多模态轨迹用于模型训练。大量实验表明,在五个具有挑战性的多模态搜索基准上,我们的方法达到了最先进的性能,甚至超越了多个专有模型,验证了在真实网络环境中鲁棒的视觉原生推理和搜索能力。代码和数据可在 https://github.com/ZhengboZhang/Visual-Seeker 获取。

英文摘要

Multimodal large language models (MLLMs) have demonstrated impressive capabilities in many visual tasks, but they often struggle with factual grounding when confronted with complex, open-world scenarios. While recent multimodal deep search agents attempt to address this issue by utilizing external tools, the visual-native search paradigm remains underexplored. Existing methods primarily rely on simple images with explicit semantics and text-only evidence trajectories, limiting the agent's ability to perform multi-hop, cross-modal reasoning and search. To address these limitations, we propose Visual-Seeker, a visual-native multimodal deep search agent via active visual reasoning. Rather than treating vision as a static input, our agent actively attends to fine-grained visual details, dynamically harvests visual evidence throughout the search process. To unlock its visual-native potential, we design an active visual reasoning data pipeline and synthesize 5K high-quality multimodal trajectories for model training. Extensive experiments demonstrate the state-of-the-art performance across five challenging multimodal search benchmarks, even surpassing several proprietary models, validating robust visual-native reasoning and search in real-world web environments. The code and data can be accessed at: https://github.com/ZhengboZhang/Visual-Seeker.

2606.15591 2026-06-16 cs.AI cs.CL cs.MA 新提交

Agentic Retrieval and Reinforcement Learned Equation Chains: A Controlled Generation Framework for Complex and Novel Physics Word Problems

智能检索与强化学习方程链:面向复杂新颖物理文字题的可控生成框架

Tirthankar Mittra

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出ARVRE两阶段框架,通过离线时序差分学习构建有效物理方程链,结合智能检索增强生成控制问题结构与难度,再由大语言模型生成自然语言问题,实现复杂、新颖且可解的物理文字题生成。

详情
AI中文摘要

生成高质量、新颖、复杂且可解的物理文字题(PWPs)在教育内容生成中仍是一个具有挑战性且未被充分探索的问题。现有方法多改编自数学文字题(MWP)生成,常产生模糊、不可解或结构简单且语言多样性有限的问题。我们提出ARVRE(智能检索值强化方程链),一个用于生成多样且数学有效的PWPs的两阶段框架。在第一阶段,使用一种离线时序差分学习形式构建有效的物理方程链,同时一个智能检索增强生成(RAG)框架动态选择主题特定的概念和词汇。这种设计能够显式控制问题结构和难度。在第二阶段,大语言模型(LLM)将方程链和检索到的概念转换为自然语言的物理问题。通过将生成过程基于有效方程链,我们的方法在保持数学正确性的同时,促进了语言多样性和上下文丰富性。人工和自动评估表明,ARVRE生成的PWPs比现有方法更复杂、新颖且可解。这些结果凸显了结合强化学习、检索和LLM用于可靠生成教育物理内容的潜力。

英文摘要

Generating high-quality Physics Word Problems (PWPs) that are novel, complex, and solvable remains a challenging and underexplored problem in educational content generation. Existing approaches, many adapted from Math Word Problem (MWP) generation, often produce ambiguous, unsolvable, or structurally simple questions with limited linguistic diversity. We introduce ARVRE (Agentic Retrieval Value Reinforced Equation-chain), a two-stage framework for generating diverse and mathematically valid PWPs. In the first stage, a form of offline temporal-difference learning is used to construct valid chains of physics equations, while an agentic retrieval-augmented generation (RAG) framework dynamically selects topic-specific concepts and vocabulary. This design enables explicit control over problem structure and difficulty. In the second stage, a Large Language Model (LLM) converts the equation chain and retrieved concepts into a natural-language physics question. By grounding generation in valid equation chains, our method preserves mathematical correctness while promoting linguistic diversity and contextual richness. Human and automated evaluations demonstrate that ARVRE generates PWPs that are more complex, novel, and solvable than those produced by existing approaches. These results highlight the potential of combining reinforcement learning, retrieval, and LLMs for reliable generation of educational physics content.

2606.15598 2026-06-16 cs.AI 新提交

Integrating Reasoning and Generalization in Text-to-SQL via Self-Enhanced Fine-Tuning

通过自增强微调在Text-to-SQL中整合推理与泛化

Feng Lyu, Jinfeng Cen, Sijing Duan, Hao Wu, Shucheng Li, Weixu Zhang, Haolun Wu

发表机构 * Central South University(中南大学) Tsinghua University(清华大学) Nanjing University(南京大学) McGill University(麦吉尔大学)

AI总结 提出CoTE-SQL方法,通过自增强推理轨迹、结构化思维链提示和错误感知修正,在开源LLM上实现Bird和Spider基准的最优性能。

Comments 14 pages, 13 figures, 7 tables

详情
AI中文摘要

Text-to-SQL旨在将自然语言问题转换为可执行的结构化数据库SQL查询,使非专业用户能够直观地访问数据。尽管大型语言模型(LLM)的最新进展在该任务中显示出潜力,但现有的基于LLM的方法往往难以在强大的推理能力和稳健的泛化之间取得平衡。为了解决这些局限性,我们提出了CoTE-SQL,通过三个关键创新来增强基于LLM的Text-to-SQL生成:(i)从LLM中提取的自增强推理轨迹,无需人工标注;(ii)具有模块化分解和示例检索的结构化思维链(CoT)提示;(iii)基于SQL执行反馈的错误感知修正。在Spider和Bird基准上的大量实验表明,CoTE-SQL在基于开源LLM的方法中取得了新的最先进性能,在Bird上(53.39% EX / 59.02 VES)和Spider上(79.60% EX / 77.19 VES)均表现强劲,尤其是在复杂查询上取得了显著提升。结果突出了在基于LLM的Text-to-SQL设计中结合自增强、结构化推理和执行时反馈的有效性。

英文摘要

Text-to-SQL aims to translate natural language questions into executable SQL queries over structured databases, enabling non-expert users to access data intuitively. While recent advances in large language models (LLMs) have shown promise in this task, existing LLM-based approaches often struggle to strike a balance between strong reasoning capabilities and robust generalization. To address these limitations, we propose CoTE-SQL to enhance the LLM-based text-to-SQL generation with three key innovations: (i) self-enhanced reasoning traces distilled from LLMs without human annotation, (ii) structured chain-of-thought (CoT) prompting with modular decomposition and examples retrieval, and (iii) error-aware revision based on SQL execution feedback. Extensive experiments on the Spider and Bird benchmarks demonstrate that CoTE-SQL achieves new state-of-the-art performance among methods built on open-source LLMs with comparable model sizes on Bird (53.39% EX / 59.02 VES) and strong results on Spider (79.60% EX / 77.19 VES), with especially significant gains on complex queries. Results highlight the effectiveness of combining self-enhancement, structured reasoning, and execution-time feedback within an LLM-based framework for text-to-SQL design.

2606.15696 2026-06-16 cs.AI cs.CL cs.LG 新提交

Do LLMs Reliably Identify Correct Information Units in Aphasic Discourse?

LLMs 能否可靠识别失语症语篇中的正确信息单元?

Jason M Pittman, Yesenia Medina-Santos, Anton Phillips, Brielle C. Stark

发表机构 * Indiana University Bloomington(印第安纳大学布卢明顿分校)

AI总结 研究评估指令微调大语言模型在零样本和少样本提示下对失语症语篇进行词级正确信息单元分类的性能,发现少样本提示可提升效果但一致性仍不足。

Comments 5 tables, 4 figures

详情
AI中文摘要

正确信息单元(CIUs)是失语症语篇评估的核心,因为它们量化了交际信息性而非仅语言形式。然而,CIU评分耗时且需要训练有素的评分者。本研究考察了指令微调的大语言模型(LLMs)是否能够可靠地从失语症语篇转录中进行词级CIU分类。使用Cat Rescue刺激引发的16个图片描述转录根据Nicholas和Brookshire(1993)的标准进行CIU状态标注。样本涵盖四个严重程度层:对照组、轻度、中度和重度失语症。在零样本和两种少样本提示条件下,对四个公开可用的指令微调LLMs进行了基准测试,使用五个分层随机种子。通过准确率、精确率、召回率、F1和Cohen's kappa与人类共识标签进行性能评估。零样本提示在所有模型中均不足。相比之下,少样本提示带来了显著提升,并为三个可行模型产生了有竞争力的性能。Llama-3.1-8B、Qwen2.5-7B和Mistral-7B的平均少样本F1分数范围为0.776至0.817,固定全局和逐块局部示例选择之间无显著差异。Phi-3-mini不稳定且未产生可靠性能。可行模型显示出高召回率但较低的精确率,表明系统性地过度将词元分类为CIU。性能也随语篇严重程度变化,在更严重的失语症中结果最弱。少样本LLM提示可以在无需基于梯度的任务训练的情况下支持自动CIU识别,但与人类标注的一致性仍不足以完全自主使用。这些发现支持基于LLM的CIU评分作为语篇评估系统中一个有前景的人机协同组件。

英文摘要

Correct Information Units (CIUs) are central to discourse assessment in aphasia because they quantify communicative informativeness rather than linguistic form alone. However, CIU scoring is time intensive and requires trained raters. This study examined whether instruction-tuned large language models (LLMs) can reliably perform token-level CIU classification from aphasic discourse transcripts. Sixteen picture-description transcripts elicited with the Cat Rescue stimulus were annotated for CIU status according to Nicholas and Brookshire (1993). The sample spanned four severity strata: control, mild, moderate, and severe aphasia. Four publicly available instruction-tuned LLMs were benchmarked under zero-shot and two few-shot prompting conditions across five stratified random seeds. Performance was evaluated against consensus human labels using accuracy, precision, recall, F1, and Cohen's kappa. Zero-shot prompting was insufficient across models. In contrast, few-shot prompting yielded substantial gains and produced competitive performance for three viable models. Mean few-shot F1 scores ranged from 0.776 to 0.817 across Llama-3.1-8B, Qwen2.5-7B, and Mistral-7B, with no significant differences between fixed global and per-chunk local example selection. Phi-3-mini was unstable and did not yield reliable performance. Viable models showed high recall but lower precision, suggesting systematic over-classification of tokens as CIUs. Performance also varied by discourse severity, with the weakest results in more severe aphasia. Few-shot LLM prompting can support automated CIU identification without gradient-based task training, but agreement with human annotation remains insufficient for fully autonomous use. These findings support LLM-based CIU scoring as a promising human-in-the-loop component of discourse assessment systems.

2606.15782 2026-06-16 cs.AI cs.CV 新提交

Mitigating Visual Hallucinations in Multimodal Systems through Retrieval-Augmented Reliability-Aware Inference

通过检索增强的可靠性感知推理缓解多模态系统中的视觉幻觉

Pratheswaran Hariharan, Haiping Xu, Donghui Yan

发表机构 * University of Massachusetts, Dartmouth(马萨诸塞大学达特茅斯分校)

AI总结 提出一种检索增强的可靠性感知推理框架,利用外部视觉证据库和多个可靠性指标进行决策门控,在不重训练模型的情况下减少视觉幻觉,将接受预测准确率从85.84%提升至88.88%。

Comments 28 pages, 9 figures

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉语言理解和自然语言响应生成方面展现了强大的能力。然而,当视觉证据较弱、模糊或语义不一致时,这些系统仍可能产生过度自信的预测和类似幻觉的输出。现有方法大多侧重于改进多模态表示对齐或检索增强生成,而缺乏量化实例级预测可靠性或识别错误视觉输出的机制。本文提出了一种检索增强的可靠性感知推理框架,用于可信的多模态视觉理解。该框架利用预训练的视觉嵌入和基于归一化特征表示的最近邻检索构建外部视觉证据数据库。检索到的证据用于通过多个可靠性指标估计预测的可信度,包括相似性强度、类别支持一致性、证据边际、基于熵的不确定性以及聚合可靠性分数。基于这些信号,决策门控决定系统是否应接受预测、谨慎回答或在证据不足时放弃/回退。然后,多模态响应生成层根据可靠性决策生成最终面向用户的响应。在ImageNet-100上的实验表明,所提出的可靠性感知框架在89.04%的覆盖率下将接受预测准确率从85.84%提升至88.88%。类似幻觉的接受错误答案率从14.16%降至11.12%。这些结果表明,整合检索证据、可靠性估计和选择性决策门控可以在不重新训练大型多模态模型的情况下改善校准并减少过度自信的视觉错误。

英文摘要

Multimodal large language models (MLLMs) have demonstrated strong capabilities in vision-language understanding and natural-language response generation. However, these systems can still produce overconfident predictions and hallucination-like outputs, particularly when the visual evidence is weak, ambiguous, or semantically inconsistent. Most existing approaches focus on improving multimodal representation alignment or retrieval-augmented generation, while providing limited mechanisms to quantify instance-level prediction reliability or identify incorrect visual outputs. This work proposes a retrieval-augmented reliability-aware inference framework for trustworthy multimodal visual understanding. The proposed framework constructs an external visual evidence database using pretrained visual embeddings and nearest-neighbor retrieval over normalized feature representations. Retrieved evidence is used to estimate prediction trustworthiness through multiple reliability indicators, including similarity strength, class-support agreement, evidence margin, entropy-based uncertainty, and an aggregate reliability score. Based on these signals, a decision gate determines whether the system should accept the prediction, answer with caution, or abstain/fallback when evidence is insufficient. A multimodal response-generation layer then produces a final user-facing response conditioned on the reliability decision. Experiments on ImageNet-100 demonstrate that the proposed reliability-aware framework improves accepted prediction accuracy from 85.84\% to 88.88\% at 89.04\% coverage. The hallucination-like accepted wrong-answer rate is reduced from 14.16\% to 11.12\%. These results show that integrating retrieval evidence, reliability estimation, and selective decision gating can improve calibration and reduce overconfident visual errors without retraining large multimodal models.

2606.16122 2026-06-16 cs.AI 新提交

Thinking with Visual Grounding

视觉锚定思维

Junkai Zhang, Yihe Deng, Kai-Wei Chang, Wei Wang

发表机构 * University of California, Los Angeles(加利福尼亚大学洛杉矶分校)

AI总结 提出视觉锚定思维方法,让视觉语言模型在推理时交替生成自然语言和视觉锚点(点或框),并通过合成数据管道和锚定感知强化学习训练,在计数和空间推理任务上显著提升性能。

详情
AI中文摘要

视觉思维不仅应该听起来正确,还应该展示其证据。虽然最近的视觉语言模型(VLM)能够生成自然语言推理轨迹,但这些轨迹往往隐含了所支持的图像区域,使得它们难以验证和监督。我们引入了视觉锚定思维,这是一种推理过程,其中模型将自然语言思想与每一步所使用的视觉证据的显式点或框锚定交替生成。这使得模型能够在语言中表达中间推理,同时将关键对象锚定到它们所指的图像区域。为了训练这种行为,我们构建了一个可扩展的合成管道,该管道蒸馏正确的视觉推理轨迹,提取轨迹所需的视觉对象,使用基于SAM3的代理对其进行锚定,并从生成的掩码中导出对齐的点与框监督。我们进一步提出了锚定感知强化学习,它将答案正确性奖励与密集的锚定奖励相结合,后者评分生成的物体引用是否匹配正确的图像证据。在两个计数基准和四个空间推理基准上,将视觉锚定思维添加到Gemma3-4B-IT中,始终优于原始模型和非锚定思维基线。在空间推理上,视觉锚定思维的4B模型匹配,并在某些情况下超越了同一模型家族的Gemma3-27B-IT。我们的分析表明,点锚定适合计数,而框锚定在空间任务上从显式锚定奖励中获益最多。总体而言,我们的结果表明,当VLM的中间思维与使它们为真的图像区域相关联时,它们的思考能力更强。

英文摘要

Visual thinking should not only sound right; it should show its evidence. While recent vision-language models (VLMs) can produce natural-language reasoning traces, these traces often leave the supporting image regions implicit, making them hard to verify and difficult to supervise. We introduce visually grounded thinking, a reasoning process in which models interleave natural-language thoughts with explicit point or box groundings of the visual evidence used at each step. This lets the model express intermediate reasoning in language while grounding key objects in the image regions they refer to. To train this behavior, we construct a scalable synthesis pipeline that distills correct visual reasoning traces, extracts the visual objects required by the traces, grounds them with a SAM3-based agent, and derives aligned point and box supervision from the resulting masks. We further propose grounding-aware reinforcement learning, which combines answer correctness rewards with dense grounding rewards that score whether generated object references match the correct image evidence. Across two counting benchmarks and four spatial reasoning benchmarks, adding visually grounded thinking to Gemma3-4B-IT consistently improves performance over the original model and the non-grounded thinking baseline. On spatial reasoning, the visually grounded thinking 4B models match, and in some cases surpass, Gemma3-27B-IT from the same model family. Our analysis shows that point grounding is well suited to counting, while box grounding benefits most from explicit grounding rewards on spatial tasks. Overall, our results show that VLMs think better when their intermediate thoughts are tied to the image regions that make them true.

2606.16307 2026-06-16 cs.AI cs.CL 新提交

State-Grounded Multi-Agent Synthetic Data Generation for Tool-Augmented LLMs

面向工具增强型大语言模型的基于状态的多智能体合成数据生成

Rahul Khedar, Eshita, Sneha Teja Sree Reddy Thondapu, Mayank Malhotra, Arup Das, Jitesh Chandra, Yun-Shiuan Chuang, Chaitanya Kulkarni, Arun Menon, Linsey Pang, Avinash Karn, Mouli V, Prakhar Mehrotra

发表机构 * PayPal AI

AI总结 提出StateGen平台,通过四角色LLM循环和状态管理器生成多轮、工具接地的高质量训练对话,消除工具调用幻觉,支持层次化多智能体设置。

Comments 9 pages, 5 figures, 6 tables, 1 algorithm

详情
AI中文摘要

训练工具增强型LLM代理需要大量多轮、工具接地的对话数据,这些数据标注成本高、生产环境中受隐私限制,且公共数据集中基本缺失。我们提出StateGen,一个合成数据生成平台,通过编排四角色LLM循环(角色条件用户模拟器、被测代理、状态接地工具模拟器和多轴LLM评判器)生成带有评分和丰富推理轨迹的训练对话。关键架构贡献是一个权威状态管理器,它在多轮对话中维护一个结构化的世界状态对象,强制执行后端即事实的不变性,从而从结构上消除了最主要的工具调用幻觉类别。StateGen通过将子代理声明为工具(所有子代理共享一个状态对象)自然地扩展到层次化多智能体设置。我们在三个生产语料库上报告了64,698个评估对话的结果:工具调用幻觉得分达到9.66/10,系统通过23维特征向量支持角色驱动变化,并且干净分离的训练集和黄金评估集划分确认数据不是记忆诱饵(按标准差距分析)。与八个外部系统的比较表明,没有单一公开平台同时具备多轮生成、状态接地工具模拟、层次化多智能体支持和内置评判器评分功能。

英文摘要

Training tool-augmented LLM agents requires large corpora of multi-turn, tool-grounded conversational data that is expensive to annotate, privacy-constrained in production settings, and largely absent from public datasets. We present StateGen, a synthetic data generation platform that produces scored, reasoning-trace-rich training conversations by orchestrating a four-role LLM loop: a persona-conditioned user simulator, an agent under test, a state-grounded tool simulator, and a multi-axis LLM judge. The key architectural contribution is an authoritative state manager that maintains a structured world-state object across turns, enforcing a backend-is-truth invariant that eliminates the dominant class of tool-call hallucinations by construction. StateGen extends naturally to hierarchical multi-agent settings by declaring sub-agents as tools, all sharing a single state object. We report results on 64,698 evaluated conversations across three production corpora: tool-call hallucination scores reach 9.66/10, the system supports persona-driven variation via a 23-dimensional trait vector, and a cleanly separated train and golden evaluation set split confirms the data is not memorization bait (per-criterion gap analysis). Comparison with eight external systems shows that no single publicly available platform combines multi-turn generation, state-grounded tool simulation, hierarchical multi-agent support, and built-in judge scoring.

2606.16364 2026-06-16 cs.AI cs.CR cs.SE 新提交

Looking Is Not Picking: An Attention-Segment Account of Tool-Selection Failures in LLM Agents

看而非选:LLM智能体工具选择失败的注意力-片段解释

Shiyang Chen

AI总结 本文通过分析LLM智能体对工具定义片段的注意力,发现工具选择失败源于决策读出阶段而非工具可见性,并提出了基于注意力的无训练选择器来修复。

Comments 13 pages, 1 figure, 15 tables

详情
AI中文摘要

LLM智能体会错误地调用工具,自然的猜测是模型在拥挤的工具箱中未能看到正确的工具。我们通过一个并发工作未涉及的视角——模型对标记的工具定义片段的注意力——展示了相反的情况。在真实的BFCL失败案例上,通过每个候选的注意力argmax,模型在80%的情况下最关注正确的工具(对比21%的随机概率),而正确工具是注意力不足的片段仅占10%:它看到了正确的工具但仍然选错。这直接反驳了直观的“拥挤工具箱/中间丢失”解释:失败在于决策读出,而非工具箱,我们通过三种方式证实了这一点。(1) 输入vs.读出:修复提示(重新排序或复制正确工具)仅恢复<=23%的失败,而读出侧干预恢复59-91%。(2) 表示不变性:两种不同表示中的指向正确工具的干预——加性注意力logit偏置和残差流转向向量——恢复的失败案例大致相同(每任务Jaccard 0.865合并,每模型0.79-0.91),因此瓶颈定位于读出,与干预的表示无关。(3) 无训练、无正确工具的选择器:基于每个片段的注意力在BFCL上缩小了大部分无正确工具与有正确工具之间的差距(函数名选择合并+11.9分 vs. 有正确工具上限+17.9分),并在Seal-Tools上增加+14.9分;每个模型均为正向(精确McNemar检验p<=8e-4每个)。范围不同:因果注意力偏置剂量反应在10个遵循掩码的模型(3-32B)上是双向且单调的,而0.5-32B全范围仅携带相关性诊断;可部署的选择器在5个单轮模型上评估,尚未迁移到多轮循环。

英文摘要

LLM agents mis-call tools, and the natural guess is that the model failed to see the right tool in a crowded harness. We show the opposite through a lens concurrent work sets aside -- the model's attention to labeled tool-definition segments. On real BFCL failures, by per-candidate attention argmax the model attends most to the correct tool 80% of the time (vs. 21% chance), and the gold is the under-attended segment on only 10%: it looks at the right tool and still picks wrong. This directly refutes the intuitive "crowded-harness / lost-in-the-middle" explanation: the failure is at the decision readout, not the harness, and we pin it there three ways. (1) Input vs. readout: repairing the prompt (reordering or duplicating the gold tool) recovers <=23% of failures, while readout-side interventions recover 59-91%. (2) Representation-invariance: two gold-pointed interventions in different representations -- an additive attention-logit bias and a residual-stream steering vector -- recover largely the same failures (per-task Jaccard 0.865 pooled, 0.79-0.91 per model), so the bottleneck is localized to the readout independent of which representation is poked. (3) A training-free, gold-free selector: per-segment attention closes most of the gold-free-vs-oracle gap on BFCL (+11.9 pts pooled function-name selection vs. +17.9-pt oracle headroom) and adds +14.9 pts on Seal-Tools; every model positive (exact McNemar p<=8e-4 each). Scopes differ: the causal attention-bias dose-response is bidirectional and monotonic on 10 mask-honoring models (3-32B), the full 0.5-32B span carrying only the correlational diagnostic; the deployable selector is evaluated on 5 single-turn models and does not yet transfer to a multi-turn loop.

2606.16481 2026-06-16 cs.AI 新提交

Steering Emotional Dynamics for Art Therapy: Controllable Narrative Script Generation through Hierarchically Guided LLM Agents

引导艺术治疗的情感动态:通过分层引导的LLM智能体实现可控叙事脚本生成

Suqing Wang, Qinghai Miao, Chao Guo, Yisheng Lv

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 提出EC-Script框架,通过分层控制情感轨迹生成叙事脚本,实现情感轨迹规划、场景驱动和局部情感调节,显著优于基线方法。

详情
AI中文摘要

艺术治疗在情感治愈中扮演重要角色,其中叙事创作是情感表达的主要载体。鉴于治愈过程中情感固有的动态特性,具有精细控制情感波动的叙事使个体能够安全地投射内心冲突并实现情感宣泄。近年来,随着大型语言模型(LLM)的快速发展,自动叙事生成技术为支持此类艺术设计提供了新途径。然而,现有方法虽然能生成流畅文本,但难以生成遵循特定情感轨迹的叙事,无法满足以情感为导向的心理治愈需求。为解决这些问题,本文提出EC-Script,一种基于LLM智能体的框架,能够实现对情感治愈叙事生成中情感轨迹的分层控制。为确保生成的叙事严格遵循给定的情感模式,EC-Script通过情感轨迹规划建立整体叙事方向,通过角色驱动场景生成推动场景级情节发展,并通过情感控制脚本编写调节角色的局部情感变化。最终输出逐场景的脚本内容,与预设情感轨迹保持高度一致。实验结果表明,EC-Script在情感轨迹遵循度上显著优于基线方法,展现出优秀且可靠的情感可控性,从而为AI辅助情感治愈场景提供有效的技术支持。

英文摘要

Art therapy plays a vital role in emotional healing, in which narrative creation acts as the primary vehicle for emotional expression. Given the inherently dynamic nature of emotions during healing, narratives with finely controlled emotional fluctuations enable individuals to safely project inner conflicts and achieve emotional catharsis. Recently, with the rapid development of Large Language Models (LLMs), automated narrative generation technology has provided a new pathway to support such artistic designs. However, while existing methods can produce fluent texts, they struggle to generate narratives that adhere to specified affective trajectories, failing to meet the demands of emotion-oriented psychological healing. To address these issues, this paper proposes EC-Script, an LLM agent-based framework that enables hierarchical control of the affective trajectory in narrative generation for emotional healing. To ensure that the generated narratives strictly follow the given emotional patterns, EC-Script establishes overall narrative direction through Emotion-Trajectory Planning, propels scene-level plot development with Character-Driven Scene Generation, and regulates local emotional changes of characters via Emotion-Controlled Script Writing. Ultimately, it outputs scene-by-scene script content that remains highly consistent with the preset affective trajectory. Experimental results demonstrate that EC-Script significantly outperforms baseline methods in affective trajectory adherence, exhibiting excellent and reliable emotional controllability, thereby providing effective technical support for AI-assisted emotional healing scenarios.

2606.16541 2026-06-16 cs.AI cs.LG 新提交

The Faithfulness Gap: Certifying Semantic Equivalence Between Natural-Language and Formal Mathematical Statements

忠实性差距:认证自然语言与形式数学语句之间的语义等价性

Noor Islam S. Mohammad, Tamim Sheikh

发表机构 * Department of Computer Science, Informatics Institute, Istanbul Technical University, İstanbul, Türkiye(信息学院计算机科学系,伊斯坦布尔技术大学,伊斯坦布尔,土耳其) Department of Computer Science(计算机科学系) Engineering, Jashore University of Science(工程系,贾沙尔大学科学学院)

AI总结 提出双向可证明性指纹识别框架,通过前向和后向推论邻域匹配自然语言探针,认证自动形式化翻译的忠实性,并引入反事实探针生成、等价谱、自适应探针预算分配和忠实性引导解码四个新组件,在基准上实现高检测率并减少漂移。

详情
AI中文摘要

自动形式化——将自然语言数学翻译成形式证明助手——的瓶颈不在于翻译流畅性,而在于\emph{忠实性}:一个形式语句可以通过类型检查且可证明,但仍可能编码与源意图不同的定理。我们引入\emph{双向可证明性指纹识别}(\bpf{}),这是一个通过刻画每个候选在背景理论中的前向和后向推论邻域,并将这些邻域与从自然语言语句导出的探针进行匹配来认证忠实性的框架。我们进一步引入四个新组件:(i)\emph{反事实探针生成}(\cpg{}),一种合成针对特定漂移方向的探针的对比性程序;(ii)\emph{等价谱},一个替代脆弱的二元判决的连续忠实性分数;(iii)\emph{自适应探针预算分配}(\apba{}),一个信息论预算路由器;以及(iv)\emph{忠实性引导解码}(\fgd{}),它在自动形式化过程中使用\bpf{}信号作为奖励。我们证明了一个\emph{漂移检测定理}和一个\emph{PAC-忠实性}结果,该结果确立了在温和假设下,自然语言语句的等价类可以从$\mathcal{O}(\log(1/δ)/\varepsilon)$个探针中学习。我们发布了\driftbench{},一个包含$2{,}183$个NL/Lean~4对的基准,这些对具有跨mathlib4六个子领域的受控漂移标签。\bpf{}\,+\,\cpg{}在$3.0\%$的假阳性率下检测出$89.6\%$的漂移形式化——相比之下,类型检查为$41.2\%$,LLM评判基线为$63.3\%$——并且\fgd{}将最先进的自动形式化器产生漂移语句的比率降低了$47\%$。https://pmlrbd.github.io/BPF/

英文摘要

Autoformalization, translating natural-language mathematics into formal proof assistants, is bottlenecked not by translation fluency but by \emph{faithfulness}: a formal statement can typecheck and be provable, yet still encode a different theorem than the source intended. We introduce \emph{Bidirectional Provability Fingerprinting} (\bpf{}), a framework that certifies faithfulness by characterizing each candidate through its forward and backward consequence neighborhoods in the ambient theory and matching these against probes derived from the natural-language statement. We further introduce four novel components: (i) \emph{Counterfactual Probe Generation} (\cpg{}), a contrastive procedure that synthesizes probes targeting specific drift directions; (ii) the \emph{Equivalence Spectrum}, a continuous faithfulness score that replaces brittle binary verdicts; (iii) \emph{Adaptive Probe Budget Allocation} (\apba{}), an information-theoretic budget router; and (iv) \emph{Faithfulness-Guided Decoding} (\fgd{}), which uses \bpf{} signals as a reward during autoformalization. We prove a \emph{drift detection theorem} and a \emph{PAC-faithfulness} result establishing that the equivalence class of a natural language statement is learnable from $\mathcal{O}(\log(1/δ)/\varepsilon)$ probes under mild assumptions. We release \driftbench{}, a benchmark of $2{,}183$ NL/Lean~4 pairs with controlled drift labels across six subfields of mathlib4. \bpf{}\,+\,\cpg{} detects $89.6\%$ of drifted formalizations at a $3.0\%$ false-positive rate-against $41.2\%$ for typecheck and $63.3\%$ for LLM-judge baselines, and \fgd{} reduces the rate at which a state-of-the-art autoformalizer emits drifted statements by $47\%$. https://pmlrbd.github.io/BPF/

2606.16687 2026-06-16 cs.AI cs.CL 新提交

From Affect Prediction to Affect Forecasting: Evidence for Distinct Information Sources in Longitudinal Text

从情感预测到情感预报:纵向文本中不同信息源的证据

Sadia Noor, Seemab Latif, Raja Khurram Shahzad, Mehwish Fatima

发表机构 * School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST)(国立科技大学电气工程与计算机科学学院) Department of Communication, Quality Management and Information Systems, Mid Sweden University(中瑞典大学通信、质量管理和信息系统系)

AI总结 本文区分当前情感估计与未来情感变化预报,提出TSAP框架和ACF-Hybrid模型,实验表明文本语义支持当前预测,而数值轨迹动力学更适用于未来变化预报。

详情
AI中文摘要

对纵向文本中的维度情感建模需要区分当前情感估计与未来情感变化预报。现有方法通常将每个文本视为独立观测,并对两个任务应用类似假设,而不检验它们是否依赖不同的信息源。本文利用纵向自我报告生态短文和情感词条目研究这一区别。我们提出特质-状态情感预测(TSAP)框架及其时间扩展E-TSAP用于逐文本效价和唤醒度预测,在来自91名用户的1737条条目的保留预测测试集上评估。我们进一步提出情感变化预报混合模型(ACF-Hybrid)用于下一步情感变化预报,在来自46名用户的保留预报测试集上评估。对于预测,E-TSAP在效价上达到复合皮尔逊相关系数0.670,在唤醒度上达到0.449。对于预报,文本表示的表现不如紧凑的数值轨迹基线:包含文本的模型在效价上仅达到r=0.316,在唤醒度上达到r=0.284,而简单的先前状态基线分别达到r=0.615和r=0.670。ACF-Hybrid使用维度特定的数值轨迹特征,在效价上达到r=0.659,在唤醒度上达到r=0.658。这些结果表明,文本语义支持当前情感预测,而未来情感变化通过先前数值轨迹动力学能更好地捕获。

英文摘要

Modeling dimensional affect in longitudinal text requires distinguishing current affect estimation from future affective change forecasting. Existing approaches often treat each text as an independent observation and apply similar assumptions to both tasks, without testing whether they rely on different information sources. This paper investigates that distinction using longitudinal self-reported ecological essays and feeling-word entries. We propose the Trait--State Affective Prediction (TSAP) framework and its temporal extension E-TSAP for per-text valence and arousal prediction, evaluated on a held-out prediction test set of 1,737 entries from 91 users. We further propose the Affective Change Forecaster Hybrid (ACF-Hybrid) for next-step affective change forecasting, evaluated on a held-out forecasting test set of 46 users. For prediction, E-TSAP achieves composite Pearson correlations of 0.670 for valence and 0.449 for arousal. For forecasting, textual representations perform worse than compact numeric trajectory baselines: the text-inclusive model achieves only r=0.316 for valence and r=0.284 for arousal, whereas a simple prior-state baseline reaches r=0.615 and r=0.670, respectively. ACF-Hybrid, using dimension-specific numeric trajectory features, achieves r=0.659 for valence and $r=0.658$ for arousal. These results show that textual semantics support current affect prediction, whereas future affective change is better captured through prior numeric trajectory dynamics.

2606.14750 2026-06-16 eess.AS cs.AI cs.CV cs.SD 交叉投稿

Pixel-TTS: Image based Text Rendering for Robust Text-to-Speech

Pixel-TTS: 基于图像的文字渲染实现鲁棒文本转语音

Adarsh Arigala, Arjun Gangwar, S Umesh, Yova Kementchedjhieva

发表机构 * SPRING Lab, Indian Institute of Technology, Madras, India(SPRING实验室,印度理工学院,马德拉斯,印度) MBZUAI, UAE(MBZUAI,阿联酋)

AI总结 提出Pixel-TTS框架,将文本渲染为图像并通过2D卷积生成嵌入,消除嵌入矩阵扩展,提升对未见字符和拼写变体的鲁棒性,实现零样本泛化。

Comments 5 pages, 4 figures, 4 tables

详情
AI中文摘要

近期基于像素的文本建模进展表明,将文本表示为图像能使模型利用视觉线索进行语言理解。将文本锚定在其视觉形式上,允许具有不同Unicode编码的结构相似字符产生相似的嵌入,从而有益于跨语言和零样本场景。传统的基于文本的方法独立处理每个字符,限制了向未见字符的泛化,并在跨语言适应时需要嵌入扩展。我们提出Pixel-TTS,首个视觉接地语音合成框架。它将文本渲染为图像,并通过2D卷积层投影以生成嵌入。这种设计在微调过程中消除了嵌入矩阵扩展,同时提高了对未见字符和拼写变体的鲁棒性。大量实验表明,Pixel-TTS在强基线上实现了有竞争力的性能、更快的收敛和鲁棒的零样本泛化。

英文摘要

Recent advances in pixel-based text modeling show that representing text as images enables models to exploit visual cues for language understanding. Grounding text in its visual form allows structurally similar characters with different Unicode encodings to produce similar embeddings, benefiting cross-lingual and zero-shot scenarios. Conventional text-based approaches treat each character independently, limiting generalization to unseen characters and requiring embedding expansion during cross-lingual adaptation. We propose Pixel-TTS, the first framework for visually grounded speech synthesis. It renders text as images and projects them through a 2D convolutional layer to generate embeddings. This design eliminates embedding matrix expansion during fine-tuning while improving robustness to unseen characters and orthographic variations. Extensive experiments show Pixel-TTS achieves competitive performance with strong baselines, faster convergence and robust zero-shot generalization.

2606.14762 2026-06-16 cs.CV cs.AI 交叉投稿

Scribby: A Multi-Level LLM Framework for Semantic Video Analysis

Scribby: 一种用于语义视频分析的多级LLM框架

Julian Abelarde, Hugo Garrido-Lestache Belinchon

发表机构 * Department of Computer Science and Software Engineering, Milwaukee School of Engineering(密尔沃基工程学院计算机科学与软件工程系)

AI总结 提出一种基于LLM的视频摘要框架,通过微观索引(分析完整转录、句子及语义分组)平衡宏观理解与微观语义分析,并利用相关性热图实现语义分块和匹配的可视化。

详情
AI中文摘要

随着视频内容在教育平台、录播讲座和直播娱乐中的持续扩展,对长视频进行高效且结构化分析的需求日益增长。尽管许多现有AI程序基于AI生成的转录提供高级视频摘要,但这些方法通常局限于粗略概述,缺乏对视频结构、主题进展和语义关系的详细分析,而这些正是全面视频分析所必需的。本文提出一种基于LLM的视频摘要框架,平衡宏观理解与微观语义分析。该过程的第一阶段在微观层面对视频进行索引,包括:(1) 分析完整转录,(2) 分析单个转录句子,(3) 使用LLM作为评判依据语义相似性对这些句子进行分组。在句子级处理中,通过将全局转录分析和相邻句子信息纳入每个评估提示,保留上下文连续性。该框架为通过相关性热图可视化语义分块和语义匹配的视频分析工具奠定了基础。还讨论了框架的局限性和未来扩展。

英文摘要

As video content continues to expand across educational platforms, recorded lectures, and live-streamed entertainment, the need for efficient and structured analysis of long-form footage has increased \cite{1}. Although many existing AI programs provide high-level video summaries based on AI-generated transcripts \cite{2,3,4,5}, these approaches are often limited to coarse overviews and lack detailed analysis of a video's structure, thematic progression, and semantic relationships, all of which are required for comprehensive video analysis. This paper proposes an LLM-based video summarization framework that balances macro-level comprehension with micro-level semantic analysis \cite{6,12,13}. The first stage of the process indexes the video at a micro level by (1) analyzing the full transcript, (2) analyzing individual transcript sentences, and (3) grouping these sentences by semantic similarity using an LLM as a judge \cite{6,13}. Contextual continuity is retained during sentence-level processing by incorporating both the global transcript analysis and adjacent sentence information into each evaluation prompt. This framework establishes a foundation for video analysis tools that visualize semantic chunking and semantic matching through relevance-based heatmaps. Limitations and future expansions of the framework are also discussed.

2606.14777 2026-06-16 cs.CV cs.AI 交叉投稿

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

JoyAI-VL-Interaction: 实时视觉-语言交互智能

Dingyu Yao, Junhao Zhou, Chenxu Yang, Chuanyu Qin, Haowen Hou, Zheming Liang, Congcong Wang, Yuhang Cao, Shenglong Ye, Shuai Xie, Shuhuan Gu, Haoyang Huang, Qingyi Si, Nan Duan, Jiaqi Wang

发表机构 * JD.com(京东)

AI总结 提出一种持续观察、自主决定是否回应的视觉-语言交互模型,并开源8B规模模型及完整部署系统,在六个真实场景中优于现有方案。

详情
AI中文摘要

现实世界中的许多时刻不会等待用户提问。安全监控上起火,视频通话中表情变化,或直播中观众想要的商品一闪而过。然而,当今的大模型大多仍以轮次式设计:它们只在被召唤时回答,即使是看似交互式的视频通话应用,其运作方式仍是问答系统,仅在轮询或提示时做出反应。我们主张一种不同的范式:一个像人一样存在于世界中的模型。它持续观察当前发生的事件,自行决定是说话还是保持沉默,实时交互,并在问题困难时委托给后台模型。为了推动交互模型及其在各领域的应用,我们做出两项完全开源贡献。首先,我们发布JoyAI-VL-Interaction,一个8B规模的视觉优先VL交互模型。该模型内部做出响应决策,每秒选择保持沉默、回应或委托给后台模型,并在视觉触发响应性和时间感知方面表现出色。我们为其配备了一个可迁移的训练方案,从中涌现出我们从未训练过的能力,例如引导购物者切换应用屏幕或根据幻灯片即兴授课。其次,我们发布了一个围绕该模型构建的完整可部署系统。该系统将任何正在进行的视频流式传输到模型中,使其真正存在于世界中。所有其他组件都是可插拔的,包括ASR/TTS模块、记忆、可视化UI以及可连接任何API或代理的后台大脑。在六个真实场景中,人类评估者以较大优势偏好JoyAI-VL-Interaction而非豆包和Gemini的应用内视频通话助手。据我们所知,这是第一个开源的、视觉驱动的交互模型,同时发布了其训练方案、数据和完整可部署系统。

英文摘要

Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Yet today's large models remain mostly turn-based by design: they answer only when addressed, and even video-call apps that appear interactive still operate as question-answer systems, reacting only when polled or prompted. We argue for a different paradigm: a model that is present in the world like a person. It continuously watches what is happening now, decides on its own whether to speak or stay silent, interacts in real time, and delegates to a background model when the problem is hard. To advance interaction models and their adoption across domains, we make two fully open-sourced contributions. First, we release JoyAI-VL-Interaction, an 8B-scale, vision-first VL-interaction model. The model makes the response decision internally, choosing each second to stay silent, respond, or delegate to a background model, and it excels at vision-triggered responsiveness and time awareness. We pair it with a transferable training recipe, from which capabilities we never trained for emerge, such as guiding a shopper through changing app screens or improvising a lecture from a slide deck. Second, we release a complete, deployable system built around that model. The system streams any ongoing video into the model, making it genuinely present in the world. All other components are pluggable, including ASR/TTS modules, memory, visualization UI, and a background brain that can connect to any API or agent. Across six real-world scenarios, human raters prefer JoyAI-VL-Interaction over the in-app video-call assistants of Doubao and Gemini by a wide margin. To our knowledge, this is the first open, vision-driven interaction model released together with its training recipe, data, and complete deployable system.

2606.15007 2026-06-16 cs.CL cs.AI cs.LG 交叉投稿

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

Nemotron 3 Ultra: 开放、高效的混合专家Mamba-Transformer模型用于智能体推理

NVIDIA, :, Aaron Blakeman, Aaron Thomas, Aastha Jhunjhunwala, Abhibha Gupta, Abhinav Khattar, Adam Rajfer, Adi Renduchintala, Adil Asif, Aditya Vavre, Adriana Flores Miranda, Ahmad Bilal, Aileen Zaman, Ajay Hotchandani, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, Alex Gronskiy, Alex Kondratenko, Alex Steiner, Alex Ye, Alexander Bukharin, Alexandre Milesi, Ali Taghibakhshi, Alice Gatti, Alisa Liu, Alok Kumar, Amar Phanishayee, Ameya Sunil Mahabaleshwarkar, Amir Klein, Amit Zuker, Amnon Geifman, Anahita Bhiwandiwalla, Ananth Subramaniam, Andrea Santilli, Andrew Fulks, Andrew McHarg, Andrew Tao, Andrii Skliar, Anjulie Agrusa, Ankur Srivastava, Ankur Verma, Anna Shors, Anna Warno, Antoni-Joan Solergibert I Llaquet, Arham Mehta, Arkadiusz Nowaczynski, Arti Jain, Ashwath Aithal, Ashwin Poojary, Asif Ahamed, Asit Mishra, Asma Kuriparambil Thekkumpate, Atefeh Sohrabizadeh, Avinash Kaur, Avinash Vem, Ayush Dattagupta, Barath Subramaniam Anandan, Bardiya Sadeghi, Ben Lanir, Benedikt Schifferer, Besmira Nushi, Bilal Kartal, Bill Thiede, Bita Darvish Rouhani, Bo Deng, Bob Schatz, Boris Ginsburg, Boxin Wang, Brad Nemire, Brandon Norick, Brian Dang, Brian Westphal, Brian Yu, Brucek Khailany, Bryan Catanzaro, Carlo del Mundo, Caryln Aarish, Chankyu Lee, Chantal Hwang, Charbel Sakr, Charles Wang, Charlie Truong, Chen Cui, Cheng Cheng, Cheng-Ping Hsieh, Chenghao Zhang, Chenhui Deng, Chintan Patel, Chris Alexiuk, Christian Cosgrove, Christian Munley, Christine Harvey, Christopher Parisien, Chunyang Shen, Coco Li, Collin Neale, Cynthia Gao, Cyril Meurillon, Dan Gil, Dan Su, Dan Zhao, Dane Corneil, Daniel Afrimi, Daniel Egert, Daniel Korzekwa, Daniel Lo, Daniel Machlab, Daniel Serebrenik, Daniil Sorokin, Daria Gitman, Daria Levy, Darko Stosic, David Mosallanezhad, David Yu, Davit Karamyan, Deena Donia, Deep Debroy, Deepak Narayanan, Devin O'Kelly, Dheeraj Peri, Dhruv Nathawani, Di, Wu, Dima Rekesh, Divyanshu Kakwani, Donald Plummer, Dong Anh, Dongfeng Yu, Dongfu Jiang, Donnie Kim, Dorrin Poorkay, Duncan Riach, Dusan Stosic, Dustin VanStee, Eavan Meng, Edgar Minasyan, Edward Lin, Eileen Margaret Peters Long, Elad Sarafin, Elad Segal, Elena Lantz, Ellie Evans, Elliott Ning, Eric Chung, Eric Harper, Eric Pham-Hung, Eric Tramel, Eric Yang, Erick Galinkin, Erik Pounds, Erika Goncalves Goncalves, Evan Briones, Evan Wu, Evelina Bakhturina, Evgeny Tsykunov, Ewa Dobrowolska, Faisal Ladhak, Farzan Memarian, Fay Wang, Fei Jia, Felipe Soares, Felipe Vieira Frujeri, Feng Chen, Fengguang Lin, Ferenc Galko, Frank Sun, Frankie Siino, Frida Hou, Gal Hubara Agam, Gal Kaplun, Gantavya Bhatt, Gargi Prasad, Garvit Kulshreshtha, George Armstrong, Gerald Shen, Giulio Borghesi, Gordana Neskovic, Gorkem Batmaz, Grace Lam, Greg Mason, Greg Pauloski, Grigor Nalbandyan, Grzegorz Chlebus, Grzegorz Karch, Guan-Ting Liu, Guoming Zhang, Guyue Huang, Haggai Maron, Haifeng Qian, Haim Elisha, Haoxing Ren, Haran Kumar Shiv Kumar, Haribhau Hud, Harris Nover, Harrison Saturley Hall, Hayate Iso, Helen Ngo, Herbert Hum, Herman Sahota, Hexin Wang, Himanshu Soni, Hovhannes Tamoyan, Hua Li, Huanhuan Chen, Hui Li, Hui Wang, Huy Nguyen, Ian Chiles, Ido Galil, Ido Shahaf, Igor Gitman, Igor Shovkun, Ilya Loshchilov, Ingo Guehring, Itamar Schen, Itay Levy, Itay Neeman, Ivan Moshkov, Izik Golan, Izzy Putterman, Jaemin Choi, Jakub Slowikowski, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jatin Mitra, Jeffrey Glick, Jenny Chen, Jesse Oliver, Jiacheng Xu, Jiafan Zhu, Jialin Song, Jian Zhang, Jiantao Jiao, Jiaqi Zeng, Jie Lou, Jim King, Jimmy Zhang, Jingquan Wang, Jinhang Choi, Jinju Chu, Joey Conway, Joey Guman, Johan Jatko, Johannes Rausch, John Kamalu, John Roberts, Johnny Greco, Johnny Mensel, Jonah Alben, Jonas Yang, Jonathan Cohen, Jonathan Raiman, Joseph Jennings, Joshua Mabry, Joshua Pierce, Joyjit Daw, Julien Veron Vialard, Junkeun Yi, Jupinder Parmar, Kajal Jain, Kan Zhu, Kari Briski, Katherine Cheung, Katherine Luna, Keith Willowhawk, Keith Wyss, Keshav Santhanam, Kevin Shih, Kezhi Kong, Khanh Nguyen, Khushi Bhardwaj, Kirthi Shankar Sivamani, Konstantinos Krommydas, Krishna C. Puvvada, Krzysztof Pawelec, Kumar Anik, Kyle Keprios, Kylie Day, Lawrence McAfee, Leo Du, Leon Derczynski, Li Ding, Linda Liu, Lingjie Wu, Lior Kadoch, Lizzie Wei, Luis Vega, Luke Robison, Lun Su, Maarten Van Segbroeck, Maciej Jakub Mikulski, Maer Rodrigues de Melo, Magda Sypula, Mahan Fathi, Makesh Narsimhan Sreedhar, Makesh Tarun Chandran, Manoj Kilaru, Maor Ashkenazi, Marc Cuevas, Marc Romeijn, Marcin Chochowski, Mark Cai, Mark Mozolewski, Markus Kliegl, Marta Stepniewska-Dziubinska, Martyna Patelka, Mattei Machczynski, Matvei Novikov, Mauricio Ferrato, Maximilian Golub, Mehrzad Samadi, Melissa Corpuz, Mengru Wang, Mengxi Wu, Meredith Price, Meriem Boubdir, Micah Schaffer, Michael Andersch, Michael Boone, Michael Gschwind, Michael Lightstone, Michael Loh, Michal Bien, Michal Zawalski, Michelle Gill, Miguel Martinez, Mikail Khona, Mike Chrzanowski, Mike Houston, Mingyuan Ma, Minseok Lee, Mohamed Fawzy, Mohammad Dabbah, Mohammad Shoeybi, Mostofa Patwary, Nabin Mulepati, Najeeb Nabwani, Namit Dhameja, Narimane Hennouni, Natalie Hereth, Nathaniel Pinckney, Nave Algarici, Nave Assaf, Netanel Haber, Nicholas Knight, Nick Reamaroon, Nickson Quak, Nidhi Bhatia, Nikhil Desai, Nikolai Ludwig, Nima Tajbakhsh, Ning Xu, Nir Ailon, Nirmal Juluru, Nitin Nitin, Ofri Masad, Oleg Rybakov, Oleksii Hrinchuk, Oleksii Kuchaiev, Olivia Viessmann, Olivier Delalleau, Oluwatobi Olabiyi, Omer Ullman Argov, Omri Puny, Oren Tropp, Pablo Ribalta, Pallab Bhattacharya, Panos Lampropoulos, Parth Mannan, Pasha Shamis, Patrick Legresley, Paul Gibbons, Pavlo Molchanov, Pawel Morkisz, Peter Dykas, Peter Jin, Pierre-Yves Aquilanti, Pinky Xu, Piotr Januszewski, Piotr Laskiewicz, Pooya Jannaty, Prakash Gurumurthy, Pranav Prashant Thombre, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Puhui Meng, Qiyu Wan, Rabeeh Karimi Mahabadi, Rachel Oberman, Rachit Garg, Radha Sri-Tharan, Rahul Kandu, Rakshit Sanadhya, Ran El-Yaniv, Ran Zilberstein, Rasoul Shafipour, Ray Macalisang, Rayen Tian, Reka Kovacs, Renjie Pi, Rick Izzo, Rima Shahbazyan, Rishabh Garg, Rishi Puri, Rita Fernandes Neves, Ritchie Zhao, Ritika Borkar, Ritu Gala, Riyad Islam, Robert Clark, Robert Hesse, Robert Kirby, Roger Waleffe, Rohit Watve, Roi Koren, Ron Banner, Ruoxi Zhang, Russell J. Hewett, Ryan Prenger, Ryan Stewart, Ryota Egashira, Sadegh Mahdavi, Saee Paliwal, Sagar Singh, Sahil Modi, Salika Dave, Samantha Shinagawa, Samuel Kriman, Sandip Bhaskar, Sangkug Lym, Sanjay Kariyappa, Sanjeev Satheesh, Saran Vikas Murari, Satish Pasumarthi, Saurabh Mishra, Saurav Muralidharan, Scott Hara, Sean Narentharen, Selvaraj Anandaraj, Seonjin Na, Seonmeyong Bak, Seonmyeong Bak, Sepehr Sameni, Seph Mard, Serge Panev, Seth Henneman, Seth Poulos, Shahar Mor, Shantanu Acharya, Shaona Ghosh, Sharath Turuvekere Sreenivas, Sharon Mendelson, Shaun Kotek, Shawn Wang, Shay Aharon, Shaya Gharghabi, Sheng-Chieh Lin, Shi Chen, Shiqing Fan, Shirish Baskaran, Shreya Gopa, Shrimai Prabhumoye, Shubham Pachori, Shubham Toshniwal, Shuoyang Ding, Shwetha Krishnamurthy, Siddharth Singh, Simeng Sun, Sirshak Das, Sivakumar Arayandi Thottakara, Smita Ithape, Somshubra Majumdar, Soumye Singhal, Sri Harsha Singudasu, Sridhar Bhuvanapalli, Srimukh Veccham, Stas Sergienko, Stefania Alborghetti, Stephen Ge, Su Rong, Sugam Dipak Devare, Sukrit Rao, Sumeet Kumar Barua, Sungsoo Ha, Sunny Gai, Suriya Gunasekar, Suseella Panguluri, Suyog Gupta, Sviataslau Hinzburh, Sweta Priyadarshi, Syeda Nahida Akter, Talor Abramovich, Tan Bui, Tanay Varshney, Tatevik Ter-Hovhannisyan, Teodor-Dumitru Ene, Terry Kong, Thanh Do, Tianhe Zhang, Tiffany Moore, Tijmen Blankevoort, Tim Moon, Tiyasa Mitra, Tom Balough, Tomasz Grzegorzek, Tomasz Hliwiak, Tomer Asida, Tomer Bar Natan, Tomer Keren, Tomer Ronen, Tony Salim, Tony Wang, Traian Rebedea, Tugrul Konuk, Twinkle Vashishth, Udi Karpas, Ushnish De, Vahid Noorozi, Venkat Srinivasan, Venmugil Elango, Vibhor Agrawal, Victor Cui, Vijay Korthikanti, Vikas Mehta, Vinay Rao, Virginia Wu, Vitaly Kurin, Vitaly Lavrukhin, Vladimir Anisimov, Vu Pham, Wanli Jiang, Wasi Uddin Ahmad, Wataru Ishihara, Wei Du, Wei Ping, Weiheng Chai, Wenliang Dai, Wesley Helmholz, Will Jennings, Will Zhu, Wojciech Prazuch, Xiaowei Ren, Xiwen Yu, Yan Breek, Yang Chen, Yang Yu, Yangyi Chen, Yaniv Galron, Yashaswi Karnati, Yejin Choi, Yev Meyer, Yi-Fu Wu, Yian Zhang, Ying Lin, Yonatan Geifman, Yonggan Fu, Youngeun Kwon, Yu Yao, Yugi Guvvla, Yuki Huang, Yunsheng Liu, Zach Moshe, Zachary Newell, Zhilin Wang, Zhiyu Li, Zhongbo Zhu, Zhuolin Yang, Zihan Liu, Zijie Yan, Zsolt-Alon Wertheimer

发表机构 * NVIDIA(英伟达)

AI总结 提出550B总参数量、55B激活参数的混合专家Mamba-Attention语言模型Nemotron 3 Ultra,通过20T tokens预训练、1M上下文扩展及后训练,在推理吞吐量提升约6倍的同时保持与顶尖模型相当的精度。

详情
AI中文摘要

我们介绍了Nemotron 3 Ultra,一个总参数量5500亿、激活参数550亿的混合专家Mamba-Attention语言模型。我们在20万亿文本tokens上预训练了Nemotron 3 Ultra,然后将上下文长度扩展到100万tokens,并使用监督微调(SFT)、强化学习(RL)和多教师在线策略蒸馏(MOPD)进行后训练。Nemotron 3 Ultra是我们迄今为止能力最强的模型,采用了多项关键技术——LatentMoE、多token预测(MTP)、NVFP4预训练、多环境RLVR、MOPD和推理预算控制。与公开可用的最先进LLM相比,Nemotron 3 Ultra的推理吞吐量提高了约6倍,同时达到了相当的精度。最先进的精度、高推理吞吐量和100万tokens的上下文长度使Nemotron 3 Ultra成为长时间运行的自主智能体任务的理想选择。我们在HuggingFace上开源了基础、后训练和量化检查点,以及训练数据和配方。

英文摘要

We introduce Nemotron 3 Ultra, a 550 billion total and 55 billion active parameter Mixture-of-Experts Hybrid Mamba-Attention language model. We pre-trained Nemotron 3 Ultra on 20 trillion text tokens, then extended the context length to 1M tokens, and post-trained using Supervised Fine Tuning (SFT), Reinforcement Learning (RL), and Multi-teacher On-Policy Distillation (MOPD). Nemotron 3 Ultra is our most capable model yet, employing multiple key technologies - LatentMoE, Multi Token Prediction (MTP), NVFP4 pre-training, multi-environment RLVR, MOPD, and reasoning budget control. Nemotron 3 Ultra achieves up to ~6x higher inference throughput as compared to state-of-the-art publicly available LLMs while attaining on-par accuracy. The state-of-the-art accuracy, high inference throughput, and 1M token context length make Nemotron 3 Ultra ideal for long-running autonomous agentic tasks. We open-source the base, post-trained, and quantized checkpoints, along with the training data and recipe on HuggingFace.

2606.15079 2026-06-16 cs.CL cs.AI 交叉投稿

Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale

Ling 和 Ring 2.6 技术报告:高效且即时的万亿参数规模智能体智能

Ang Li, Ben Liu, Bin Han, Bin Hu, Bin Jing, Binbin Hu, Bing Li, Cai Chen, Caizhi Tang, Changxin Tian, Chao Huang, Chao Zhang, Chen Liang, Chen Qian, Chengfu Tang, Chengyao Wen, Chilin Fu, Chunwei Wu, Cong Zhang, Cunyin Peng, Daixin Wang, Dalong Zhang, Deng Zhao, Dingnan Jin, Dingyuan Zhu, Donghao Zhang, Fan Yuan, Fangzheng Zhao, Fanzhuang Meng, Feifan Wu, Feng Xu, Fengbin Fang, Gangshan Wang, Guodong Yang, Hailin Zhao, Haitao Wang, Haitao Zhang, Hanxiao Zhang, Hanzi Wang, Hao Dai, Hao Liu, Hao Qian, Hao Wu, Haoxiong Liu, Haoyu Xu, Heng Zhang, Hong Liu, Hongliang Zhang, Hongrui Liu, Hongxun Li, Hongzhi Ruan, Huaidong Xiong, Huihuang Zheng, Huikang Tang, Jia Guo, Jia Li, Jia Liu, Jiameng Wang, Jiaming Liu, Jiannan Shi, Jianping Wei, Jiaolong Yang, Jiapeng Wang, Jie Gao, Jie Wang, Jiewei Wu, Jin Yang, Jinjin Li, Jinjing Huang, Jinquan Sun, Jinyao Chen, Juanhui Tu, Jun Liu, Jun Mei, Jun Xu, Jun Zhou, Junjie Ou, Junnan Sipan, Junpeng Fang, Kaihong Zhang, Kaiqin Hu, Ke Shi, Kuan Xu, Kun Tang, Kunlong Chen, Lanyin Mei, Lei Chen, Lei Liang, Lei Xu, Li Tang, Liang Jiang, Liangcheng Fu, Lihui Zhang, Linfeng Shi, Lintao Ma, Liyuan Liu, Longfei Li, Longfei Zheng, Lu Liu, Lu Yu, Man Li, Meiqi Zhu, Meng Li, Mengjie Gao, Mengshu Sun, Mingming Yin, Mingyang Zhang, Mingyuan Fan, Nuo Xu, Pan Tang, Peijie Jiang, Peilong Zhao, Peng Lin, Pingping Liu, Qi Zuo, Qian Zhao, Qiang Cheng, Qianggang Cao, Qiaoben Bao, Qing Cui, Qingyuan Yang, Qitao Shi, Qiyin Huang, Qizheng Zhou, Quan Wan, Runyuan Zhao, Shaomian Zheng, Shaowei Wei, Shengnan Zhang, Shuaicheng Li, Shujie Li, Shuo Zhang, Sikang Bian, Tianchu Yao, Tiange Xu, Tianshu Wang, Ting Guo, Tinghao Wang, Tingwei Huang, Tong Zhao, Tongkai Yang, Wang Hong, Wanli Gu, Wei Lu, Weichang Wu, Weiguang Han, Weiquan Li, Wenbo Shen, Wenjing Fang, Wenzhi Tang, Xiang Shu, Xiao Shi, Xiaodong Yan, Xiaolu Zhang, Xiaopei Wan, Xiaqing Sun, Xin Zhao, Xingyu Lu, Xinxing Yang, Xinyao Tang, Xinyu Kong, Xinyu Liu, Xiong Xu, Xuan Sun, Xudong Han, Xudong Wang, Xujie Shen, Yalin Zhang, Yangyang Hou, Yankun Ren, Yao Zhao, Ye Chen, Yeyang Chen, Yibo Cao, Yifan Zuo, Yijie Chen, Ying Li, Yingjie Song, Yingxue Li, Yiqi Wang, Yixuan Sun, Yizhu Xiao, Yongfei Xu, Yu Liu, Yuchen Fang, Yue Gao, Yue Yu, Yue Zhang, Yuqi Zhang, Yuxiao He, Yuxiao Lu, Yuxin Tian, Yuxuan Li, Yuzhuo Fu, Zhankai Xu, Zhaoxin Huan, Zhenduo Zhang, Zhengke Gui, Zhengyu Huang, Zhenjun Ma, Zhenxuan Pan, Zheping Qu, Zhibo Zhu, Zhidong Fan, Zhigang Huangfu, Zhihao Wang, Zhiqiang Zhang, Zhizhen Liu, Zhuyan Zhou, Zibin Lin, Zihang Zeng, Zihao Wang, Zilong Wang, Ziqi Liu, Zitao Xuan, Zixuan Cheng, Zujie Wen, Zuoli Tang

发表机构 * Ling Team(Ling团队) Inclusion AI

AI总结 提出Ling-2.6和Ring-2.6模型系列,通过架构迁移预训练、混合线性注意力设计及KPop强化学习框架,实现低延迟、强推理与高效部署,开源所有检查点。

详情
AI中文摘要

高效且可扩展的智能体智能需要模型既能提供低延迟响应,又能具备强大的推理能力,同时保持训练、服务和部署的实用性。在本报告中,我们介绍了Ling-2.6和Ring-2.6,这是一系列旨在大规模解决这一挑战的模型。Ling-2.6针对即时响应生成和每个输出令牌的高能力进行了优化,而Ring-2.6则专为更深层次的推理和更高级的智能体工作流而设计。我们没有从头开始训练,而是通过架构迁移预训练和大规模后训练来升级Ling-2.0基础模型。这一升级以模型架构、优化目标、服务系统和智能体训练环境的统一协同设计为指导,从而在模型能力和部署效率上实现改进。在架构层面,我们引入了一种混合线性注意力设计,将闪电注意力与MLA相结合,提高了长上下文训练和解码的效率。为了进一步提升令牌效率,我们通过进化思维链、语言单元策略优化、双向偏好对齐和最短正确响应蒸馏来优化每个输出令牌的能力。对于智能体能力,我们提出了KPop,这是一个强化学习框架,旨在支持Ring-2.6-1T在大规模环境接地数据上的稳定训练。KPop通过跨编码、搜索、工具使用和工作流执行的异步调度提高了训练效率,实现了从复杂的智能体-环境交互中进行可扩展学习。Ling-2.6和Ring-2.6共同为高效、可扩展和开放的智能体系统提供了一条实用路径。我们开源了2.6系列的所有检查点,以支持实用智能体智能的进一步研究和开发。

英文摘要

Efficient and scalable agentic intelligence requires models that can deliver both low-latency responses and strong reasoning capabilities while remaining practical to train, serve, and deploy. In this report, we present Ling-2.6 and Ring-2.6, a family of models designed to address this challenge at scale. Ling-2.6 is optimized for instant response generation and high capability per output token, whereas Ring-2.6 is tailored for deeper reasoning and more advanced agentic workflows. Instead of training from scratch, we upgrade the Ling-2.0 base model through architectural migration pre-training and large-scale post-training. This upgrade is guided by a unified co-design of model architecture, optimization objectives, serving systems, and agent training environments, enabling improvements in both model capability and deployment efficiency. At the architectural level, we introduce a hybrid linear attention design that integrates Lightning Attention with MLA, improving the efficiency of long-context training and decoding. To further enhance token efficiency, we optimize capability per output token through Evolutionary Chain-of-Thought, Linguistic Unit Policy Optimization, bidirectional preference alignment, and shortest-correct-response distillation. For agentic capabilities, we propose KPop, a reinforcement learning framework designed to support stable training of Ring-2.6-1T on large-scale environment-grounded data. KPop improves training efficiency through asynchronous scheduling across coding, search, tool use, and workflow execution, enabling scalable learning from complex agent-environment interactions. Together, Ling-2.6 and Ring-2.6 provide a practical pathway toward efficient, scalable, and open agentic systems. We open-source all checkpoints in the 2.6 family to support further research and development in practical agentic intelligence.

2606.15080 2026-06-16 cs.CL cs.AI 交叉投稿

AdaMame: A Training Recipe for Adaptive Multilingual Reasoning

AdaMame: 一种自适应多语言推理的训练方案

Dayeon Ki, Kevin Duh, Marine Carpuat

发表机构 * University of Maryland(马里兰大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 针对多语言推理中的语言崩溃问题,提出两阶段训练方案AdaMame,通过SFT建立多语言能力,再以自适应GRPO优化推理语言对齐,在准确率、语言保真度和令牌效率上达到帕累托最优。

Comments 20 pages, 5 figures

详情
AI中文摘要

尽管大型推理模型(LRMs)在英语中表现出色,但它们往往无法以查询语言进行推理,这种现象称为语言崩溃。现有的基于强化学习的修复方法通常在准确性目标上添加一个二元语言保真度奖励,但仍然会在准确性、中间轨迹代码切换和过度令牌使用方面产生权衡。在这项工作中,我们提出了AdaMame,一种用于多语言数学推理的两阶段训练方案,通过自适应地将推理语言与查询语言对齐来解决这些限制,同时不损害准确性。第一阶段的SFT在五种语言的自然推理轨迹上进行微调,以建立多语言推理能力。在随后的RL阶段,我们引入了AdaMame-GRPO,这是组相对策略优化(GRPO)的一种改编,其中查询条件的对齐因子在训练过程中逐渐增长,引导模型首先探索多样的推理语言,然后利用查询语言进行推理。在两个基准、两个LRM和12种语言上的评估表明,AdaMame-GRPO在所有基线上实现了推理准确性、语言保真度和令牌效率的帕累托最优性能,在领域外、低资源语言上取得了最强的提升。

英文摘要

While Large Reasoning Models (LRMs) show strong performance in English, they often fail to reason in the language of the query, a phenomenon known as language collapse. Existing RL-based fixes typically add a binary language fidelity reward to the accuracy objective, yet still incur trade-off in accuracy, mid-trace code-switching, and excessive token usage. In this work, we propose AdaMame, a two-stage training recipe for multilingual mathematical reasoning that addresses these limitations by adaptively aligning the reasoning language to the query language without compromising accuracy. The first SFT stage fine-tunes on naturally occurring reasoning traces across five languages to establish multilingual reasoning capability. In the subsequent RL stage, we introduce AdaMame-GRPO, an adaptation of Group Relative Policy Optimization (GRPO) in which a query-conditioned alignment factor grows progressively during training, guiding the model to first explore diverse reasoning languages before exploiting reasoning in the query language. Evaluated across two benchmarks, two LRMs, and 12 languages, AdaMame-GRPO achieves Pareto-optimal performance across reasoning accuracy, language fidelity, and token efficiency over all baselines, with the strongest gains on out-of-domain, lower-resource languages.

2606.15186 2026-06-16 cs.SD cs.AI eess.AS 交叉投稿

FreeSonic: Training-Free Temporal-Aware Decoupled Attention for Precise Audio Editing

FreeSonic: 无需训练的时序感知解耦注意力用于精确音频编辑

Yuxuan Jiang, Mingyang Han, Yusheng Dai, Andong Wang, Tianhong Zhou, Jiaxin Ye, Dongxiao Wang, Haoxiang Shi, Boyu Li, Jun Song, Cheng Yu, Bo Zheng, Weibei Dou, Zehua Chen, Jun Zhu

发表机构 * Tsinghua University(清华大学) Alibaba Group(阿里巴巴集团) Monash University(蒙纳士大学) Renmin University of China(中国人民大学) Fudan University(复旦大学)

AI总结 提出FreeSonic,一种无需训练的框架,利用基于Rectified Flow的TangoFlux模型,通过优化反转-逆过程、联合文本-音频注意力图以及调度注意力解耦,实现精确且一致的音频编辑,同时保持背景保真度。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

文本到音频(TTA)生成取得了显著进展,但实现精确且一致的音频编辑仍然是一个主要挑战。然而,现有方法难以平衡时间一致性与背景保留。在本文中,我们提出FreeSonic,一个无需训练的框架,利用最先进的基于Rectified Flow的TangoFlux模型。FreeSonic利用优化的反转-逆过程和联合文本-音频注意力图进行精确的目标片段提取。对于内容编辑,一种新颖的调度注意力解耦将修改限制在目标区域,同时保留原始声学上下文。此外,面向任务的噪声注入增强了音频移除和非刚性替换等任务的通用性。大量实验结果表明,FreeSonic通过提供高保真且高效的解决方案,在精确且一致的音频编辑中实现了优越的平衡。项目和演示:https://free-sonic.github.io/

英文摘要

Text-to-audio (TTA) generation has made significant strides, yet achieving precise and consistent audio editing remains a major challenge. However, existing methods struggle to balance temporal consistency with background preservation. In this paper, we propose FreeSonic, a training-free framework leveraging the state-of-the-art Rectified Flow-based TangoFlux model. FreeSonic utilizes an optimized inversion-reverse process and joint text-audio attention maps for precise target segment extraction. For content editing, a novel scheduled attention decoupling confines modifications to target regions while preserving original acoustic context. Furthermore, task-oriented noise injection enhances versatility for tasks such as audio removal and non-rigid replacement. Extensive experimental results demonstrate that FreeSonic achieves a superior balance by providing a high-fidelity and efficient solution for precise and consistent audio editing. Project and demos: https://free-sonic.github.io/

2606.15307 2026-06-16 cs.CL cs.AI 交叉投稿

Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

利用思维链监督的强化学习进行仇恨和宣传模因的可解释检测

Mohamed Bayan Kmainasi, Mucahid Kutlu, Ali Ezzat Shahroor, Abul Hasnat, Firoj Alam

发表机构 * Hamad Bin Khalifa University(哈马德·本·哈利法大学) Qatar University(卡塔尔大学)

AI总结 提出基于强化学习的后训练方法,结合任务特定奖励和组相对策略优化(GRPO),提升思考型多模态大语言模型在仇恨和宣传模因检测中的分类性能和解释质量。

详情
AI中文摘要

仇恨和宣传模因利用图像与文本之间的相互作用来传达有害意图,而这两种模态单独都无法揭示这种意图。尽管基于思考的多模态大语言模型(MLLMs)在视觉-语言理解方面取得了进展,但它们在模因内容审核中的应用仍未得到充分探索。我们提出了一种基于强化学习的后训练方法,通过任务特定奖励和组相对策略优化(GRPO)来提高思考型MLLMs的分类性能和基于参考的解释质量。具体来说,我们(i)对现成的MLLMs在英语和阿拉伯语基准上的仇恨和宣传模因理解进行了系统的实证研究,(ii)通过蒸馏和多LLM细粒度宣传标注,用弱监督的思维链(CoT)理由扩展了现有的模因数据集,(iii)引入了一个基于GRPO的目标函数,带有思考长度正则化,联合优化分类准确性和解释质量,以及(iv)研究基于共识伪标签的无标签模因的自监督GRPO。在Hateful Memes和ArMeme基准上的实验表明,我们的方法在FHM准确率(从79.9%提高到82.0%,提升高达2.1%)和ArMeme宏F1(从0.536提高到0.612,提升高达7.6个百分点,附带解释;与原始ArMeme基准相比提升6.1个百分点)上优于先前报告的结果,同时生成自然语言解释。在ArMeme上,序列分类基线在原始准确率方面仍然更强,而我们的方法提供了更平衡的每类性能以及解释。我们公开发布了代码、数据扩展和评估资源。

英文摘要

Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone. Although thinking-based multimodal large language models (MLLMs) have advanced vision-language understanding, their application to meme content moderation remains underexplored. We propose a reinforcement learning-based post-training method that improves classification performance and reference-based explanation quality in thinking-based MLLMs via task-specific rewards and Group Relative Policy Optimization (GRPO). Concretely, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful and propagandistic meme understanding across English and Arabic benchmarks, (ii) extend existing meme datasets with weakly supervised chain-of-thought (CoT) rationales via distillation and multi-LLM fine-grained propaganda annotations, (iii) introduce a GRPO-based objective with thinking-length regularization that jointly optimizes classification accuracy and explanation quality, and (iv) investigate self-supervised GRPO on unlabeled memes using consensus-based pseudo-labels. Experiments on the Hateful Memes and ArMeme benchmarks show that our approach improves over previously reported results on FHM accuracy (up to +2.1%, from 79.9% to 82.0%) and on ArMeme macro-F1 (up to +7.6 points, from 0.536 to 0.612 with explanations; +6.1 compared to the original ArMeme benchmark), while also generating natural-language explanations. On ArMeme, sequence-classification baselines remain stronger in terms of raw accuracy, whereas our approach provides more balanced per-class performance along with explanations. We publicly release our code, data extensions, and evaluation resources.

2606.15331 2026-06-16 cs.IR cs.AI 交叉投稿

HoloRec: Holistic Encoding and Interleaved Reasoning for Generative Recommendation

HoloRec:面向生成式推荐的整体编码与交错推理

Shuqi Zhao, Jingsong Su, Xiang Liu, Xingzhi Yao, Yiming Qiu, Huimu Wang, Liang Lin, Pengbo Mo, Mingming Li, Jiao Dai, Jizhong Han, Songlin Hu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(信息工程研究所,中国科学院) School of Artificial Intelligence, Beijing Normal University(北京师范大学人工智能学院) JD.com(京东公司)

AI总结 提出HoloRec,通过多粒度嵌套残差量化构建层次语义编码矩阵,实现内生的思维链推理,无需外部标注,在稀疏场景下显著提升推荐准确率。

详情
AI中文摘要

将任务建模为序列生成的生成式推荐模型克服了传统级联架构的目标碎片化问题,但现有方法仍存在缺乏层次结构用于多步推理的扁平语义表示,以及需要昂贵标注且与生成目标脱节的外部构建思维链(CoT)等问题。我们提出HoloRec,一种内生的思维链推荐机制,通过多粒度嵌套残差量化构建层次语义编码矩阵,并由整体重建损失优化,统一了表示、推理和生成。HoloRec支持两种推理模式:非思考模式使用轻量级多粒度监督对齐进行快速预测,思考模式采用交错推理方案动态生成CoT步骤,将推理直接嵌入生成过程,无需外部数据。在多个公开推荐数据集上的实验表明,HoloRec持续优于基线,在稀疏场景下尤其显著,且思考模式在仅增加适度推理开销的情况下实现了比非思考模式更高的准确率。

英文摘要

Generative recommendation models that formulate the task as sequence generation overcome the objective fragmentation problem of traditional cascade architectures, yet existing approaches still suffer from flat semantic representations lacking hierarchical structure for multi-step reasoning and an externally constructed chain-of-thought (CoT) that requires expensive annotations and remains disconnected from the generation objective. We propose HoloRec, an endogenous chain-of-thought recommendation mechanism that unifies representation, reasoning, and generation by constructing a hierarchical semantic encoding matrix via multi-granularity nested residual quantization optimized by a holistic reconstruction loss. HoloRec supports two inference modes: a non-thinking mode that uses lightweight multi-granularity supervised alignment for fast prediction, and a thinking mode that employs an interleaved reasoning scheme to generate CoT steps on the fly, directly embedding reasoning into the generation process without external data. Experiments on multiple public recommendation datasets demonstrate that HoloRec consistently outperforms baselines, with especially significant gains in sparse scenarios, and the thinking mode achieves better accuracy than the non-thinking mode with only modest inference overhead.

2606.15405 2026-06-16 cs.CL cs.AI 交叉投稿

T-Mem: Memory That Anticipates, Not Archives

T-Mem:预测而非归档的记忆

Weidong Guo, Dakai Wang, Zixuan Wang, Hui Liu, Yu Xu

发表机构 * Tencent(腾讯)

AI总结 提出T-Mem架构,通过写时触发机制覆盖描述性和关联性回忆,解决长对话中语义关联检索问题,在LoCoMo和LoCoMo-Plus上达到SOTA。

详情
AI中文摘要

长期记忆对于对话代理在扩展对话中保持连贯性、遵循多个会话前做出的承诺以及根据每个用户调整行为至关重要。然而,当前基于LLM的长期对话记忆受限于查询与存储内容(包括词汇和稠密向量)之间的相似性。当查询和记忆共享表面特征(如措辞或命名实体,我们称之为描述性)时,该方法有效。但它忽略了另一类同样有价值的案例,即查询和记忆不共享表面特征,仅通过潜在语义弧(关联性)相连。在这种机制下,现有的长期记忆系统普遍失败。覆盖这另一半使得助手首次能够主动将过去的对话作为语义资产。在记忆方面,这是认知科学中称为情景未来思维的工程对应物:预演过去的经验,以便在未来需要找到它的上下文中使用。我们将这些写时预演称为触发器。我们提出T-Mem,这是第一个覆盖描述性和关联性回忆的长期对话记忆架构。在两种证据粒度(单个事实和完整交流)上,T-Mem实例化一个描述性触发器家族和一个关联性触发器家族,使得每个记忆都能从表面相似和相关性约束的查询中访问。作为实证验证,T-Mem在LoCoMo和LoCoMo-Plus上达到了最先进水平。

英文摘要

Long-term memory is essential for conversational agents to remain coherent across extended dialogues, follow through on commitments made many sessions earlier, and adapt their behaviour to each user. Current LLM-backed long-term conversational memory, however, is reachability-bounded by the similarity between a query and stored content, both lexical and dense-vector. The approach is effective when query and memory share surface features such as wording or named entities (we call this descriptive). But it misses another, equally valuable class of cases, where query and memory do not share surface features and are tied only by a latent semantic arc (associative). On this regime prevailing long-term memory systems collectively fail. Covering this other half is what allows an assistant, for the first time, to actively draw on past dialogue as a semantic asset. On the memory side, this is the engineering counterpart of what cognitive science calls episodic future thinking: rehearsing past experience for the future contexts under which it will need to be found. We call these write-time rehearsals triggers. We propose T-Mem, the first long-term conversational memory architecture that covers both descriptive and associative recall. At each of two evidence granularities, single facts and full exchanges, T-Mem instantiates one descriptive trigger family and one associative trigger family, so that every memory remains reachable from both surface-similar and relevance-bound queries. As empirical validation, T-Mem reaches state-of-the-art on both LoCoMo and LoCoMo-Plus.

2606.15412 2026-06-16 cs.CL cs.AI 交叉投稿

Few-Shot Biomedical Relation Extraction with Large Language Models: A Viable Alternative to Supervised Learning?

基于大语言模型的少样本生物医学关系抽取:监督学习的可行替代方案?

Jakob Mraz, Tomaž Curk, Blaž Zupan

发表机构 * University of Ljubljana(卢布尔雅那大学) Baylor College of Medicine(贝勒医学院)

AI总结 研究利用大语言模型进行少样本生物医学关系抽取,比较成对分类与联合生成两种方法,发现联合生成更精确高效,在宏F1上超越监督基线,尤其在稀有关系类型上表现突出。

详情
AI中文摘要

生物医学关系抽取(BioRE)是将生物医学文献转化为结构化知识的关键步骤。然而,现有方法大多依赖在昂贵标注数据集上训练的监督模型,限制了其在关系类型和领域上的可扩展性和适应性。我们研究了基于提示学习的大语言模型(LLMs)进行少样本BioRE,并比较了两种任务形式:成对分类(预测单个实体对的关系)和联合生成(在单次模型调用中提取多个关系)。在BioREDirect数据集上的实验揭示了明确的精确率-召回率权衡。成对分类实现了更高的召回率,而联合生成更精确且计算效率更高。最佳模型达到了0.44的微F1分数,显著优于之前的少样本结果(0.34),但仍低于监督基线(0.56)。这一差距大部分归因于一个定义模糊的关系类型。当使用宏F1评估时(在类别不平衡设置下更能反映跨关系类型的性能),基于提示的方法优于监督基线(0.45 vs. 0.38),尤其在稀有关系类型上。这些发现突显了LLMs在低资源场景下进行BioRE的潜力,并强调了定义良好的关系模式的重要性。

英文摘要

Biomedical relation extraction (BioRE) is a key step in transforming biomedical literature into structured knowledge. However, most existing approaches rely on supervised models trained on costly annotated datasets, limiting their scalability and adaptability across relation types and domains. We investigate few-shot BioRE using prompt-based learning with large language models (LLMs) and compare two task formulations: pairwise classification, which predicts relations for individual entity pairs, and joint generation, which extracts multiple relations in a single model call. Experiments on the BioREDirect dataset reveal a clear precision-recall trade-off. Pairwise classification achieves higher recall, whereas joint generation is more precise and computationally efficient. The best-performing model achieves a micro-F1 score of 0.44, substantially outperforming previous few-shot results (0.34) while remaining below the supervised baseline (0.56). Much of this gap is attributable to a single ambiguously defined relation type. When evaluated using macro-F1, which better captures performance across relation types in an imbalanced setting, prompt-based approaches outperform the supervised baseline (0.45 vs. 0.38), particularly on rare relation types. These findings highlight the potential of LLMs for BioRE in low-resource settings and underscore the importance of well-defined relation schemas.

2606.15419 2026-06-16 cs.CL cs.AI 交叉投稿

Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering

让LLMs互相评判:面向医学问答的多智能体同行评审推理

Zaifu Zhan, Shuang Zhou, Rui Zhang

发表机构 * University of Minnesota(明尼苏达大学)

AI总结 提出多智能体同行评审推理方法,让多个LLM独立生成思维链推理并相互评估,选择最优推理链输出答案,在三个医学问答数据集上优于单模型和多数投票方法。

Comments Accepted by the Journal of the American Medical Informatics Association

详情
AI中文摘要

目的:提升大语言模型在医学问答中的准确性、可解释性和鲁棒性。方法:我们设计了一种多智能体同行评审推理方法,其中多个LLM智能体独立生成包含候选答案的思维链推理,然后作为同行评审者评估彼此推理的事实正确性和逻辑合理性。选择评分最高的推理链生成最终答案。使用五个最先进的LLM(Llama-3.1-8B、Qwen2.5-7B、Phi-4、DeepSeek-LLM-7B、GPT-oss-20B)在三个基准数据集(HeadQA、MedQA-USMLE和PubMedQA)上进行实验。性能与单模型思维链推理和基于思维链的多数投票进行了比较。结果:同行评审推理始终优于两种基线。最佳模型组合在数据集上的平均准确率达到0.820,超过了最强单模型(0.777)和多数投票集成(最高0.789)。该方法还随着参与模型数量的增加而有效扩展,同时同行评估可靠地区分了高质量和低质量的推理链。结论:提出的多智能体同行评审推理方法使LLM既能作为求解者又能作为评估者,在医学问答中取得了优越性能。通过强调推理质量而非仅答案一致性,该方法提高了准确性、可解释性和鲁棒性,为可信赖的生物医学AI系统提供了有前景的方向。

英文摘要

Objective: To enhance the accuracy, interpretability, and robustness of large language models (LLMs) in medical question answering (MedQA). Method: We designed a multi-agent peer-reviewed reasoning method in which multiple LLM agents independently generate chain-of-thought reasoning with candidate answers, then act as peer reviewers to evaluate each other's reasoning for factual correctness and logical soundness. The highest-rated reasoning chain is selected to produce the final answer. Experiments were conducted with five state-of-the-art LLMs (Llama-3.1-8B, Qwen2.5-7B, Phi-4, DeepSeek-LLM-7B, GPT-oss-20B) on three benchmark datasets: HeadQA, MedQA-USMLE, and PubMedQA. Performance was compared against single-model chain-of-thought reasoning and chain-of-thought-based majority voting. Results: Peer-reviewed reasoning consistently outperformed both baselines. The best model combination achieved an average accuracy of 0.820 across datasets, exceeding the strongest single model (0.777) and majority voting ensembles (up to 0.789). The method also scaled effectively with more participating models, while peer assessments reliably distinguished high- from low-quality reasoning chains. Conclusion: The proposed multi-agent peer-reviewed reasoning method enables LLMs to act as both solvers and evaluators, yielding superior performance in MedQA. By emphasizing reasoning quality rather than answer agreement alone, this approach improves accuracy, interpretability, and robustness, offering a promising direction for trustworthy biomedical AI systems.

2606.15427 2026-06-16 cs.LG cs.AI cs.CV 交叉投稿

Post-Launch Capability Expansion of Vision-Language Models via Prompting for On-Orbit Spacecraft Inspection

通过提示实现视觉语言模型发射后能力扩展用于在轨航天器检测

Nicholas A. Welsh, Lennon J. Shikhman, Monty Nehru Attazs, Seemanthini K. Putane, Van Minh Nguyen, Ryan T. White

发表机构 * Florida Institute of Technology(佛罗里达理工学院) University of Florida(佛罗里达大学)

AI总结 研究利用提示驱动的视觉语言模型在轨扩展语义能力,无需修改权重即可通过自然语言提示检测新航天器部件,在129张图像上零样本实例分割达到0.385 mAP@0.5。

Comments 5 pages, 1 figure, 2 tables. Equal contribution by Nicholas A. Welsh and Lennon Shikhman. Published in the CVPR2026 Workshop on AI4Space

详情
AI中文摘要

星载检测系统通常在发射前部署感知模型,之后更新模型权重或扩展固定标签集在操作上变得不可行。虽然监督模型可以在飞行前集成,但在轨道上添加新的语义能力需要重新训练和重新上传参数。我们研究提示驱动的视觉语言模型是否能够实现发射后语义扩展,允许通过自然语言提示指定新的航天器部件,而无需修改星载权重。我们在一个包含129张先前未见卫星图像的测试集上,采用严格冻结的单次推理协议,评估了航天器部件的零样本实例分割。在固定全局阈值且无后处理的情况下,SAM3达到0.385 mAP@0.5和0.267 mAP@0.5:0.95。性能强烈依赖于尺度:大型结构元素如航天器主体(0.639 AP@0.50)和太阳翼(0.598 AP@0.5)定位可靠,而相对较小的附件如天线(0.221 AP@0.5)和推进器(0.081 AP@0.5)仍然困难。提示形式影响性能,包含空间和几何描述符的结构化提示相比短类别名称提示提升高达82%。该模型在当代嵌入式GPU的内存和计算范围内运行,表明提示驱动的定位可以为主要航天器结构提供发射后语义扩展的实用机制,同时突显了在轨道域偏移下细粒度部件零样本定位的局限性。

英文摘要

Spaceborne inspection systems often deploy perception models prior to launch, after which updating model weights or expanding fixed label sets becomes operationally impractical. While supervised models can be integrated pre-flight, adding new semantic capabilities in orbit requires retraining and re-uploading parameters. We investigate whether prompt-driven vision--language models can enable post-launch semantic expansion, allowing new spacecraft components to be specified via natural-language prompts without modifying onboard weights. We evaluate zero-shot instance segmentation of spacecraft components under a strictly frozen, single-pass inference protocol on a test set of $129$ images of previously unseen satellites. Under fixed global thresholds and no post-processing, SAM3 achieves $0.385$ mAP@$0.5$ and $0.267$ mAP@$0.5{:}0.95$. Performance is strongly scale-dependent: large structural elements like spacecraft bodies ($0.639$ AP@$0.50$) and solar arrays ($0.598$ AP@$0.5$) localize reliably, while relatively small appendages like antennas ($0.221$ AP@$0.5$) and thrusters ($0.081$ AP@$0.5$) remain difficult. Prompt formulation influences performance, with structured prompts incorporating spatial and geometric descriptors yielding up to $82%$ improvement over short category-name prompts. The model operates within the memory and compute envelope of contemporary embedded GPUs, suggesting prompt-driven grounding can provide a practical mechanism for post-launch semantic extension of dominant spacecraft structures while highlighting limitations of zero-shot localization for fine-scale components under orbital domain shift.

2606.15540 2026-06-16 cs.SD cs.AI cs.MM eess.AS 交叉投稿

AP-GRPO: Anchor-Gated Phonetic Alignment with Policy Optimization for Pathological Speech Reconstruction

AP-GRPO: 基于锚定门控语音对齐与策略优化的病理语音重建

Pengfei Zhang, Hoang H Nguyen, Yutong Song, Wenjun Huang, Tahmid Imtiaz Imu, Henry Peng Zou, Jiang Wu, Honghui Xu, Amir M. Rahmani

发表机构 * University of California Irvine(加州大学尔湾分校) University of Illinois Chicago(伊利诺伊大学芝加哥分校) Kennesaw State University(肯尼索州立大学)

AI总结 针对神经退行性和神经运动障碍患者的病理语音,提出AP-GRPO框架,通过锚定门控奖励和语音对齐奖励优化语音语言模型,实现忠实重建,并揭示疾病特异性模式。

详情
AI中文摘要

来自神经退行性和神经运动障碍患者的病理语音通常在声学上失真且语言上支离破碎,因此需要病理语音重建来从失真和不完整的语音录音中恢复预期的文本内容。关键在于,此类录音很少均匀退化:一些单词或短语仍然可靠,可以作为可听锚点来重建受损的周围内容。我们引入了锚定门控语音组相对策略优化(AP-GRPO),这是一个带有语音奖励的GRPO框架,通过可听锚点保留和锚点间语音兼容性来对齐语音语言模型(SLM)与原始语音信号。AP-GRPO包括:(i)一个锚定门控奖励,用于匹配清晰区域中的可靠可听锚点;(ii)一个锚点间语音对齐奖励,用于评估恢复的内容是否在语音上得到相应受损锚点间语音片段的支持。在四种疾病条件下,AP-GRPO提高了忠实语音重建,并且学习的锚点约束自动适应每种条件,从而揭示可解释的疾病特异性特征:严重发音退化条件需要更强的锚点强制,而轻度损伤或语言障碍条件则更依赖于锚点间恢复的语音对齐。

英文摘要

Pathological speech from patients with neurodegenerative and neuromotor disorders is often acoustically distorted and linguistically fragmented, making pathological speech reconstruction necessary to recover intended textual content from distorted and incomplete speech recordings. Crucially, such recordings are rarely uniformly degraded: some words or short phrases remain reliable and can serve as audible anchors for reconstructing the corrupted surrounding content. We introduce Anchor-gated Phonetic Group Relative Policy Optimization (AP-GRPO), a GRPO framework with phonetic reward that aligns speech language models (SLMs) through audible-anchor preservation and inter-anchor phonetic compatibility to the original speech signal. AP-GRPO consists of: (i) an anchor-gated reward that matches reliable audible anchors in clear regions; and (ii) an inter-anchor phonetic alignment reward that evaluates whether recovered contents are phonetically supported by the corresponding corrupted inter-anchor speech span. Across four disease conditions, AP-GRPO improves faithful speech reconstruction, and the learned anchor constraint automatically adapts to each condition and thus reveals interpretable disease-specific profiles: conditions with severe articulatory degradation require stronger anchor enforcement, whereas milder impairment or linguistically impaired conditions rely more on phonetic alignment for inter-anchor recovery.

2606.15566 2026-06-16 cs.CL cs.AI 交叉投稿

LLM-Assisted Stance Detection in Scientific Discourse: A Test Case in Bayesian Cognitive Science

科学话语中的立场检测:以贝叶斯认知科学为例的LLM辅助方法

Eyup Engin Kucuk, Tarik Kelestemur, Ömer Dağlar Tanrikulu

发表机构 * University of New Hampshire(新罕布什尔大学) Independent Researcher(独立研究员)

AI总结 提出结合理论驱动编码手册、专家标注和诊断门控提示优化的方法,利用三个前沿LLM检测贝叶斯模型在科学文本中的现实主义/工具主义立场,在210篇文章的6858条引文中达到0.78的联合信度。

Comments 9 pages, 4 figures; Code and data: https://github.com/EyupEK/autoresearch_bayes

详情
AI中文摘要

定性编码是社会科学的核心,但专家标注难以规模化。LLM提供了一种可能的扩展,但当目标构念是解释性的、理论负载的且仅间接表达时,需要仔细验证。我们在一个困难案例中研究这个问题:检测作者是将贝叶斯模型视为心理和神经机制的描述(现实主义)还是有用的数学工具(工具主义)。我们的方法结合了理论驱动的编码手册、专家编码的参考标注、诊断门控提示优化搜索(为三个前沿LLM:GPT-5.1、Claude Sonnet 4.6、Gemini 3 Pro Preview生成共享的零样本提示)以及多评估者信度分析。最终提示在保留样本上实现了0.76的综合信度分数(ICC=0.79和α=0.74的调和平均数),所有诊断均满足。在来自210篇文章的6858条引文上部署后,三个LLM达到了显著的引文级一致性(ICC=0.80;α=0.76;综合=0.78)和近乎完美的文章级排名稳定性(评估者对之间r=0.96-0.97)。语料库总体偏向弱现实主义,但文章级立场很少一致:仅1.4%的文章使用单一波段,而59.5%的文章跨越四个或更多波段。低层感知/运动文章比高层认知文章高出8.8个现实主义点(p<.001,d=0.60),量化了长期持有的定性直觉。我们将其作为专家主导的案例研究呈现;该框架旨在推广到类似的理论密集型任务,而非所有定性分析。

英文摘要

Qualitative coding is central to social science, but expert annotation is difficult to scale. LLMs offer a possible extension, yet require careful validation when the target construct is interpretive, theoretically loaded, and only indirectly expressed. We study this problem in a difficult case: detecting whether authors treat Bayesian models as descriptions of mental and neural mechanisms (realism) or as useful mathematical tools (instrumentalism). Our method combines a theory-driven codebook, expert-coded reference annotations, a diagnostic-gated prompt-optimization search yielding a shared zero-shot prompt for three frontier LLMs (GPT-5.1, Claude Sonnet 4.6, Gemini 3 Pro Preview), and multi-rater reliability analysis. The final prompt achieved a held-out combined reliability score of 0.76 (harmonic mean of ICC = 0.79 and $α$ = 0.74), with all diagnostics satisfied. Deployed on 6,858 quotes from 210 articles, the three LLMs reached substantial quote-level agreement (ICC = 0.80; $α$ = 0.76; combined = 0.78) and near-perfect article-level rank stability ($r$ = 0.96-0.97 across rater pairs). The corpus was predominantly weakly realist, but article-level stances were rarely uniform: only 1.4% of articles used a single band, while 59.5% spanned four or more. Low-level perception/motor articles scored 8.8 Realism points higher than high-level cognition articles ($p < .001$, $d = 0.60$), quantifying a long-held qualitative intuition. We present this as an expert-led case study; the framework is intended to generalize to similar theoretically demanding tasks, not to all qualitative analysis.

2606.15694 2026-06-16 cs.MM cs.AI cs.CV cs.LG 交叉投稿

MAF: Multimodal Adaptive Few-shot Prompting for Sentiment Analysis with MLLMs

MAF: 面向情感分析的多模态自适应少样本提示方法

Hangling Xie

发表机构 * Nanjing University of Posts and Telecommunications(南京邮电大学)

AI总结 提出MAF框架,通过动态检索与查询相关的多模态示例,利用轻量级系数生成网络实时融合多模态相似度,结合多数投票提升MLLM在情感分析中的性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)在理解复杂多模态内容方面展现了卓越的能力。然而,它们在情感分析中的性能对提示设计高度敏感,导致静态、统一应用的提示本质上无法捕捉不同输入中变化的细微多模态线索。为了解决这一局限性,我们提出了一种多模态自适应少样本提示(MAF)框架,该框架动态检索并整合与查询相关的示例,以上下文敏感的方式激发MLLM的情感推理能力。MAF构建了一个示例检索模块,整体编码面部表情、场景上下文和文本语义,并引入唇部运动幅度检测机制以在多人物场景中准确识别说话者。与传统的固定权重融合不同,我们训练了一个轻量级系数生成网络,实时输出查询条件的融合权重,从而实现多模态相似度分数的加权聚合,以检索最具信息量的前K个示例。通过MLLM生成的多个候选输出进行多数投票,进一步增强了预测稳定性。在公开基准数据集上的大量实验表明,MAF相比相应的骨干变体取得了显著且一致的性能提升,并与强大的多模态情感分析基线保持竞争力。

英文摘要

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in understanding complex multimodal content. However, their performance in sentiment analysis exhibits acute sensitivity to prompt design, rendering static, uniformly applied prompts inherently suboptimal for capturing the nuanced multimodal cues that vary across inputs. To address this limitation, we propose a Multimodal Adaptive Few-Shot Prompting (MAF) framework, which dynamically retrieves and integrates query-relevant demonstrations to elicit the sentiment reasoning capabilities of MLLMs in a context-sensitive manner. MAF constructs a demonstration retrieval module that holistically encodes facial expressions, scene context, and textual semantics, with a lip movement amplitude detection mechanism introduced for accurate speaker identification in multi-person scenarios. Departing from conventional fixed-weight fusion, a lightweight coefficient generation network is trained to output query-conditioned fusion weights in real time, enabling weighted aggregation of multimodal similarity scores to retrieve the top-K most informative demonstrations. Prediction stability is further enhanced through majority voting over multiple candidate outputs generated by the MLLM. Extensive experiments on public benchmark datasets demonstrate that MAF achieves substantial and consistent performance improvements over the corresponding backbone variants and remains competitive with strong multimodal sentiment-analysis baselines.

2606.15733 2026-06-16 cs.CL cs.AI 交叉投稿

Vernier: Probing Representational Misalignment Behind Lexical Gaps in Causal Reasoning

Vernier: 探测因果推理中词汇间隙背后的表征错位

Zhenyu Yu

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院)

AI总结 通过配对视图权重更新和激活修补,发现语言模型在因果推理中因变量名替换导致的答案差异源于表征错位而非信息丢失,并在Qwen和Llama模型上验证了反事实增强的对齐效果。

详情
AI中文摘要

指令微调的语言模型在将其英文变量名替换为类型保留的占位符后,可能会对相同的因果推理问题给出不同的答案,尽管结构因果模型和正确答案未变。我们探究这种词汇间隙是否反映了占位符视图中的信息丢失,或是从仍携带答案相关内容的表征中读取时的错位。Vernier 使用配对视图权重更新作为工具,然后检查间隙闭合后留下的机制。在工作状态下,证据支持表征错位。变量名探针在占位符视图上变得更准确,对 Qwen-7B、Qwen-14B 和 Llama-3.1-8B 的激活修补表明,决策令牌表征可以在视图间传递答案身份。重新对齐视图的更新是对原始提示和占位符提示的反事实增强,而答案子空间 KL 主要增强了中间答案信念的一致性。成功受限于模型家族、规模和任务。CRASS 转移在 Qwen 规模和 Llama 上可靠,e-CARE 仍然较弱,初步的非因果重命名任务显示出类似的定性模式。

英文摘要

Instruction-tuned language models can answer the same causal-reasoning question differently after its English variable names are replaced by type-preserving placeholders, although the structural causal model and the gold answer are unchanged. We ask whether this lexical gap reflects information loss in the placeholder view or a misaligned read-out from a representation that still carries answer-relevant content. Vernier uses a paired-view weight update as an instrument and then inspects the mechanism left after the gap closes. In the working regimes, the evidence favours representational misalignment. A variable-name probe becomes more accurate on the placeholder view, and activation patching on Qwen-7B, Qwen-14B, and Llama-3.1-8B shows that the decision-token representation can transfer answer identity between views. The update that realigns the views is counterfactual augmentation over original and placeholder prompts, while the answer-subspace KL mainly sharpens intermediate answer-belief agreement. Success is bounded by model family, scale, and task. CRASS transfer is reliable across Qwen scales and Llama, e-CARE remains weak, and preliminary non-causal rename tasks show a similar qualitative pattern.

2606.15741 2026-06-16 cs.CL cs.AI 交叉投稿

A Self Consistency Based Reranking for Narrative Question Answering

基于自一致性的叙事问答重排序

Molham Mohamed, Ali Hamdi

发表机构 * GitHub

AI总结 提出自一致性重排序框架,通过生成多个候选答案并基于语义一致性选择最终答案,提升叙事问答的鲁棒性和准确性。

详情
AI中文摘要

叙事问答(NQA)是自然语言处理中一项具有挑战性的任务,要求模型理解长文本上下文、捕捉事件间关系并生成连贯的响应。尽管预训练语言模型近期取得了进展,但大多数现有方法在推理时依赖单一解码输出,使其对生成变异性敏感,常导致答案不完整或不一致。为解决这一局限,我们提出了一种基于自一致性的自集成重排序框架用于叙事问答。该方法为每个故事-问题对生成多个候选答案,并根据生成响应间的语义一致性选择最终答案。这使得模型能够探索多样化的答案表述,同时通过基于共识的选择提高鲁棒性,而无需修改底层架构。该框架将预训练和微调的语言生成与多答案推理及基于相似度的重排序相结合。我们在NarrativeQA数据集上使用多种模型(包括FLAN-T5 Base和Small以及Pegasus-Large)在基线和微调设置下评估了所提方法。实验结果表明,该方法在所有模型上均持续提升了性能。特别是,FLAN-T5-Base在结合自集成推理后,性能从82.32%提升至86.66%(+4.34%),取得了最佳整体性能。此外,Pegasus-Large的提升最大,从72.50%提升至87.07%(+14.57%),凸显了所提策略的有效性。

英文摘要

Narrative question answering (NQA) is a challenging task in natural language processing that requires models to understand long textual contexts, capture relationships across events, and generate coherent responses. Despite recent advances in pretrained language models, most existing approaches rely on a single decoding output during inference, making them sensitive to generation variability and often resulting in incomplete or inconsistent answers .To address this limitation, we propose a self-ensemble Self-Consistency-Based reranking framework for narrative question answering. The proposed method generates multiple candidate answers for each story-question pair and selects the final answer based on semantic agreement among the generated responses. This allows the model to explore diverse answer formulations while improving robustness through consensus-based selection without requiring modifications to the underlying architecture .The framework combines pretrained and fine-tuned language generation with multi-answer inference and similarity-based reranking. We evaluate the proposed approach on the NarrativeQA dataset using multiple models, including FLAN-T5 (Base and Small) and Pegasus-Large, under both baseline and fine-tuned settings .Experimental results demonstrate that the proposed method consistently improves performance across all models. In particular, FLAN-T5-Base achieves the best overall performance, improving from 82.32% to 86.66% (+4.34%) when combined with self-ensemble inference. Additionally, the largest improvement is observed with Pegasus-Large, which increases from 72.50% to 87.07% (+14.57%), highlighting the effectiveness of the proposed strategy.

2606.15778 2026-06-16 cs.CL cs.AI cs.LG cs.SI 交叉投稿

DYNA : Dynamic Episodic Memory Networks for Augmenting Large Language Models with Temporal Knowledge Graphs in Continuous Learning

DYNA:用于在持续学习中通过时间知识图谱增强大语言模型的动态情景记忆网络

Ali Sarabadani, Mahtab Tajvidiyan

发表机构 * Department of Computer Engineering and Information Technology, University of Qom(卡姆大学计算机工程与信息科技系)

AI总结 提出DYNA框架,通过时间知识图谱作为外部可更新记忆,增强冻结的大语言模型,在三个时间召回任务上减少约7%的灾难性遗忘并提升约5%的时间排序能力。

详情
AI中文摘要

大语言模型(LLMs)难以在不遗忘或昂贵重训练的情况下融入新知识。我们提出DYNA,一个轻量级框架,通过时间知识图谱增强冻结的LLM,其中事件作为节点,时间关系作为有向、带时间戳的边。该图谱作为外部可更新记忆。在查询时,DYNA通过随机游走和中心性度量检索相关节点,然后增强LLM的响应。在三个时间召回任务上评估,DYNA相比微调减少了约7%的灾难性遗忘,相比标准RAG提升了约5%的时间排序能力。更高的图谱聚类系数与更好的检索相关,表明图谱结构的重要性。贡献:(1)将情景记忆作为时间知识图谱,(2)无需重训练的LLM增强,(3)图谱属性作为检索性能的预测因子。

英文摘要

Large Language Models (LLMs) struggle to incorporate new knowledge without forgetting or costly retraining. We propose DYNA, a lightweight framework that augments a frozen LLM with a temporal knowledge graph where events are nodes and temporal relations are directed, timestamped edges. The graph serves as an external, updatable memory. At query time, DYNA retrieves relevant nodes via random walks and centrality measures, then augments the LLM's response. Evaluated on three temporal recall tasks, DYNA reduces catastrophic forgetting by ~7% compared to fine-tuning and improves temporal ordering by ~5% over standard RAG. Higher graph clustering coefficients correlate with better retrieval, showing that graph structure matters. Contributions: (1) episodic memory as temporal KG, (2) retraining-free LLM augmentation, (3) graph properties as predictors of retrieval performance.

2606.15819 2026-06-16 cs.CV cs.AI 交叉投稿

SACE: Concept Erasure at the Semantic Singularity in Visual Autoregressive Models

SACE: 视觉自回归模型中的语义奇点概念擦除

Siya Yang, Nanxiang Jiang, Zhaoxin Fan, Yunfeng Diao

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) School of Computer Science and Technology, Beijing Institute of Technology(北京理工大学计算机科学与技术学院)

AI总结 针对视觉自回归模型应用现有擦除技术导致语义崩溃和视觉伪影的问题,提出语义奇点公理并通过增量语义显著性分析验证,进而引入首个尺度感知的概念擦除框架SACE,在首尺度耦合熵正则化擦除目标与恢复性保存损失,实现精确概念擦除。

详情
AI中文摘要

视觉自回归(VAR)模型的快速进步为高保真文本到图像合成开辟了变革性前沿,同时也加剧了对生成内容安全对齐的担忧。将现有擦除技术简单应用于VAR模型会导致灾难性的语义崩溃和视觉伪影,因为这些技术主要针对扩散模型的同质去噪步骤设计。为应对这一基础性挑战,我们首先提出语义奇点公理,该公理认为提示中嵌入的任何目标语义概念在Scale-0处被明确锁定。然后通过我们提出的增量语义显著性分析(ISSA)严格验证该公理,该分析还使社区能够透明地检查从粗到细的语义注入过程。在此洞察指导下,我们引入了首个针对VAR模型的尺度感知概念擦除框架(SACE)。通过将干预严格限制在首尺度,我们的方法耦合了熵正则化擦除目标以防止高熵采样退化,以及恢复性保存损失以安全锚定纠缠良性先验的完整性。大量实验表明,我们的方法在最小训练开销下实现了跨多个领域的手术式概念擦除性能,及时而优雅地解决了新兴VAR架构中固有的关键安全漏洞。代码可在 https://github.com/limerenceysy/SACE 获取。

英文摘要

The rapid progress of visual autoregressive (VAR) models has unlocked a transformative frontier for high-fidelity text-to-image synthesis, while heightening concerns over the safety alignment of generated content. Naive application of existing erasure techniques to VAR models causes catastrophic semantic collapse and visual artifacts, since they are predominantly designed for the homogeneous denoising steps of diffusion models. To address this foundational challenge, we first propose the Semantic Singularity Axiom, which posits that any target semantic concept embedded within a prompt is definitively locked at Scale-0. Then rigorously validate this axiom through our proposed Incremental Semantic Saliency Analysis (ISSA),which also enable the community to transparently inspect the coarse-to-fine semantic injection process. Guided by this insight, we introduce the first scale-aware concept erasure framework (SACE) for VAR models. By strictly confining interventions to the first scale, our approach couples an Entropy-Regularized Erasure Objective to prevent high-entropy sampling degeneration, alongside a restorative preservation loss to safely anchor the integrity of entangled benign priors. Extensive experiments demonstrate that our method achieves surgical concept erasure performance across various domains with minimal training overhead, timely and elegently resolute the critical safety vulnerabilities inherent in emerging VAR architectures. Code is available at: https://github.com/limerenceysy/SACE}{https://github.com/limerenceysy/SACE.

2606.15821 2026-06-16 cs.CL cs.AI cs.LG 交叉投稿

The Truth Stays in the Family: Enhancing Contextual Grounding via Inherited Truthful Heads in Model Lineages

真相留在家族中:通过模型谱系中继承的真相头增强上下文基础

Miso Choi, Seonga Choi, Mincheol Kwon, Woosung Joung, Jinkyu Kim, Jungbeom Lee

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 研究发现基础LLM与下游变体间存在上下文真相分数的强继承性,提出TruthProbe软门控策略放大真相头以提升上下文真实性并减少多模态幻觉。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型语言模型(LLM)的最新进展产生了许多共享基础LLM的专业多模态LLM(MLLM),形成了不同的模型谱系。基础LLM与下游变体之间是否存在基本的行为联系尚不清楚。我们通过量化头部级别的上下文真相分数来研究这个问题。在包括基于Vicuna、Qwen2.5、LLaMA2和Mistral的模型在内的多种LLM和MLLM谱系中,我们发现真相分数在模型家族内被强烈保留,即使在指令调优或多模态适应后也是如此。我们进一步表明,这种继承与注意力头权重保留一致,并且上下文真相头关注查询相关的证据。基于这一发现,我们提出了TruthProbe,一种软门控策略,在保留其他头部贡献的同时放大上下文真相头。TruthProbe在HaluEval上提高了上下文真实性,并在POPE和CHAIR上减少了多模态幻觉,基础LLM的真相分数有效转移到其微调的LLM和MLLM后代。代码可在https://github.com/miso-choi/TruthProbe获取。

英文摘要

Recent advances in large language models (LLMs) have produced many specialized multimodal LLMs (MLLMs) that share common foundational LLMs, forming distinct model lineages. It remains unclear whether a fundamental behavioral link exists between the foundational LLMs and downstream variants. We investigate this question by quantifying head-level context-truthfulness scores. Across diverse LLM and MLLM lineages, including Vicuna-, Qwen2.5-, LLaMA2-, and Mistral-based models, we find that Truth Scores are strongly preserved within model families, even after instruction tuning or multimodal adaptation. We further show that this inheritance is consistent with attention-head weight preservation, and that context-truthful heads attend to query-relevant evidence. Building on this finding, we propose TruthProbe, a soft-gating strategy that amplifies context-truthful heads while preserving other head contributions. TruthProbe improves contextual truthfulness on HaluEval and reduces multimodal hallucination on POPE and CHAIR, with base-LLM Truth Scores transferring effectively to their fine-tuned LLM and MLLM descendants. Code is available at https://github.com/miso-choi/TruthProbe.

2606.15877 2026-06-16 cs.CL cs.AI 交叉投稿

Free Energy Heuristics: Fast-And-Frugal Cognition as Active Inference Under Uncertain Precision

自由能启发式:作为不确定精度下主动推理的快速节俭认知

Alex Bogdan

发表机构 * Evolutionairy AI Toronto, Canada(进化人工智能(多伦多,加拿大))

AI总结 本文提出元不确定性决定链式思维(CoT)的效果:当模型对自身证据的可靠性高度不确定时,更多推理会降低准确率。通过自由能最小化策略证明,在重尾精度先验下,有限数量的高有效性线索后停止整合,与“取最优”启发式等价。实验验证了高元不确定性下长CoT导致准确率下降17.3个百分点。

Comments 64 pages, 6 figures

详情
AI中文摘要

链式思维(CoT)提升了大型语言模型在数学和符号推理中的表现。但在规划、有争议的伦理问题以及模型无法自我检查的任务中,更多推理反而使情况更糟。这两种效应均有文献记载;但一直缺少一个原则性的解释来说明哪种属性决定了结果。我们认为这是元不确定性:模型对其自身证据可靠性的不确定程度。当这种不确定性很高时,额外的推理不再增加信号,而是开始制造虚假的置信度。我们证明,在不确定精度下最小化期望自由能的策略,在精度先验为重尾分布时(定理2.6.1),会在有限数量的高有效性线索后停止整合线索,并且在递减优势条件下,该策略在样本层面上与“取最优”策略相同(定理2.7.4)。因此,快速节俭启发式和主动推理是同一计算的两种描述。预测是,在高元不确定性项目上,更长的CoT会降低准确率。我们按项目对区间进行评分(模拟-恢复rho > 0.96),构建了FEH-79基准(包含匹配对照的奈特框架),并在七个模型(五个开放权重3B-32B,两个前沿模型)、五种CoT长度和7,875个响应上进行了预注册研究。门槛(在数据前固定)要求负交互的后验概率高于0.95,准确率下降超过6个百分点。结果成立。高区间下降为17.3个百分点(95% CI [7.7, 25.5]);具有明确答案的匹配项目没有显示成本。该效应依赖于区间:在能力较强的中大型模型中显著,在两个前沿系统中具有方向性,在最弱的模型中缺失甚至反转。该框架回答了CoT何时有帮助,并统一了贝叶斯和快速节俭传统:少即是多的效应是关于元不确定性区间的证据,而非反对贝叶斯认知。

英文摘要

Chain-of-thought (CoT) improves large language models' performance in math and symbolic reasoning. But on planning, contested ethics, and tasks where the model cannot check itself, more reasoning makes things worse. Both effects are documented; what has been missing is a principled account of which property decides the outcome. We argue it is meta-uncertainty: how unsure the model is about the reliability of its own evidence. When that uncertainty is high, extra reasoning stops adding signal and starts manufacturing false confidence. We prove that the policy minimizing expected free energy under uncertain precision stops integrating cues after a finite number of high-validity ones when the precision prior is heavy-tailed (Theorem 2.6.1), and under a Descending Dominance condition, is sample-wise identical to take-the-best (Theorem 2.7.4). Fast-and-frugal heuristics and active inference are, then, two descriptions of the same computation. The prediction is that on high-meta-uncertainty items, longer CoT should degrade accuracy. We score the regime per item (simulate-and-recover rho > 0.96), build FEH-79, a benchmark of Knightian frames with matched controls, and run a pre-registered study across seven models (five open-weight 3B-32B, two frontier), five CoT lengths, and 7,875 responses. The gate, fixed before any data, required a negative interaction with posterior probability above 0.95 and an accuracy drop of more than 6 points. It held. The high-regime drop is 17.3 points (95% CI [7.7, 25.5]); matched items with definite answers show no cost. The effect is regime-dependent: decisive in capable mid-to-large models, directional in the two frontier systems, absent-to-reversed in the weakest. The framework answers when CoT helps and unifies the Bayesian and fast-and-frugal traditions: less-is-more effects are evidence about the meta-uncertainty regime, not against Bayesian cognition.

2606.15880 2026-06-16 cs.CV cs.AI 交叉投稿

Deep Residual Injection for Full-Spectrum Forensic Signal Perception in Multimodal Large Language Models

深度残差注入:多模态大语言模型的全频谱取证信号感知

Kaiqing Lin, Zhiyuan Yan, Ruoxin Chen, Ke-Yue Zhang, Yue Zhou, Caiyong Piao, Bin Li, Taiping Yao, Bo Wang, Youchang Xiao, Shouhong Ding

发表机构 * National University of Singapore(新加坡国立大学) Tsinghua University(清华大学) University of Science and Technology of China(中国科学技术大学) University of Electronic Science and Technology of China(电子科技大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 针对多模态大语言模型在取证中难以同时保留语义知识和捕获低级生成器伪影的问题,提出Deep-VRM方法,通过将伪影特定视觉信号作为残差路径注入中间层,实现全频谱信号感知,达到鲁棒检测性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)因其强大的语义理解能力,越来越多地被应用于取证领域。随着AI生成图像变得逼真,仅凭语义层面的不一致往往不足以进行可靠检测。这引发了一个关键问题:MLLMs能否实现全频谱取证信号感知,即在不牺牲预训练语义知识的情况下捕获低级生成器伪影。我们进一步对MLLMs中的取证信号感知进行了逐层分析,表明语义信息主要在早期到中间层形成,而直接微调学习伪影会破坏这些语义表示。基于这一发现,我们提出了深度视觉残差MLLM(Deep-VRM),以保留早期语义处理,同时将伪影特定的视觉信号作为残差路径注入中间层,在此与语义标记表示融合,并通过后续可训练层传播。这使得后续层能够联合建模语义推理和信号级取证线索,令人惊讶的是,模型学会了根据输入自适应地利用不同级别的取证信号,实现了鲁棒且可泛化的检测性能。大量实验表明,我们的方法在大多数基准测试中达到了最先进水平。代码和数据可在https://github.com/KQL11/Deep-VRM获取。

英文摘要

Multimodal large language models (MLLMs) have been increasingly adopted in forensics for their robust semantic understanding. As AI-generated images become realistic, semantic-level inconsistencies alone are often insufficient for reliable detection. This motivates a critical question: whether MLLMs can achieve full-spectrum forensic signal perception, i.e., capturing low-level generator artifacts without sacrificing pre-trained semantic knowledge. We further perform a layer-wise analysis of forensic signal perception in MLLMs, showing that semantic information is primarily formed in the early-to-middle layers, whereas direct fine-tuning for artifact learning disrupts these semantic representations. Based on this insight, we propose Deep Visual Residual MLLM (Deep-VRM) to preserve early semantic processing while injecting artifact-specific visual signals as a residual path into an intermediate layer, where they are fused with semantic token representations and propagated through subsequent trainable layers. This enables later layers to jointly model semantic reasoning and signal-level forensic cues, and surprisingly, the model learns to adaptively leverage different levels of forensic signals depending on the input, achieving robust and generalizable detection performance. Extensive experiments show that our method achieves state-of-the-art across most benchmarks. The code and data are available at https://github.com/KQL11/Deep-VRM.

2606.15906 2026-06-16 cs.IR cs.AI cs.CL cs.DB cs.MM 交叉投稿

MAGE-RAG: Multigranular Adaptive Graph Evidence for Agentic Multimodal RAG in Long-Document QA

MAGE-RAG:面向长文档问答的多粒度自适应图证据多模态RAG

Yilong Zuo, Xunkai Li, Jing Yuan, Qiangqiang Dai, Hongchao Qin, Ronghua Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出MAGE-RAG框架,通过离线构建包含页面和元素节点的证据图,在线自适应构建证据子图,平衡证据覆盖与噪声控制,在长文档多模态问答中取得最优性能。

详情
AI中文摘要

长文档多模态问答要求系统在长PDF中定位稀疏证据,并整合来自文本、表格、图像、图表和复杂布局的线索。现有RAG方法大多依赖于文本块或页面的固定Top-k检索。文本检索可以压缩上下文,但往往丢失视觉和布局信息;页面级视觉检索保留原始页面,但也会将大量无关区域送入阅读器,导致证据覆盖、噪声和推理成本之间的静态权衡。本文提出MAGE-RAG,一种用于长文档多模态问答的多粒度自适应图证据框架。MAGE-RAG以页面检索作为查询时证据构建的入口。离线阶段,它构建一个包含页面节点和元素节点的证据图,编码包含关系、阅读顺序、布局邻接、章节层次和语义邻居关系。查询时,在线证据控制器在显式预算下迭代地激活、打开、搜索和剪枝证据。生成的证据子图随后被渲染为结构化的多模态阅读器输入,使LVLM能够在有限上下文中消费紧凑且相关的证据。在LongDocURL和MMLongBench-Doc上,我们建立了统一的比较和分析协议,涵盖直接MLLM、文本RAG、页面级视觉RAG和图/智能体RAG。实验表明,MAGE-RAG在LongDocURL上达到52.75的整体准确率,在MMLongBench-Doc上达到53.26的准确率和51.19的F1。细粒度分解、预算-性能曲线、消融和基于轨迹的分析进一步表明,查询时证据子图构建能够平衡分散证据覆盖与上下文噪声控制。我们的代码可在https://github.com/laonuo2004/MAGE-RAG.git获取。

英文摘要

Long-document multimodal question answering requires a system to locate sparse evidence in long PDFs and integrate clues from text, tables, images, charts, and complex layouts. Existing RAG methods mostly rely on fixed Top-k retrieval over text chunks or pages. Text retrieval can compress the context but often loses visual and layout information; page-level visual retrieval preserves the original page, yet it also sends large irrelevant regions to the reader, leading to a static trade-off among evidence coverage, noise, and inference cost. This paper proposes MAGE-RAG, a multigranular adaptive graph evidence framework for long-document multimodal QA. MAGE-RAG uses page retrieval as the entry point for query-time evidence construction. Offline, it builds an evidence graph with page nodes and element nodes, encoding containment, reading order, layout adjacency, section hierarchy, and semantic-neighbor relations. At query time, an online evidence controller iteratively activates, opens, searches, and prunes evidence under explicit budgets. The resulting evidence subgraph is then rendered into structured multimodal reader input, allowing the LVLM to consume compact and relevant evidence within a limited context. On LongDocURL and MMLongBench-Doc, we establish a unified comparison and analysis protocol covering Direct MLLM, Text RAG, Page-level Visual RAG, and Graph/Agentic RAG. Experiments show that MAGE-RAG achieves 52.75 overall accuracy on LongDocURL, and 53.26 accuracy with 51.19 F1 on MMLongBench-Doc. Fine-grained breakdowns, budget-performance curves, ablations, and trace-based analysis further show that query-time evidence subgraph construction can balance dispersed evidence coverage with context-noise control. Our code is available at https://github.com/laonuo2004/MAGE-RAG.git.

2606.15972 2026-06-16 cs.CL cs.AI cs.LG 交叉投稿

Formalize Once, Edit the Rest: Efficient Lean-Based Answer Selection for Math Reasoning

一次形式化,其余编辑:基于Lean的高效数学推理答案选择

Ji Feng, Zhouxing Shi

发表机构 * University of California, Riverside(加州大学河滨分校)

AI总结 提出BASE流水线,通过形式化一个候选答案并编辑其余答案,减少自动形式化调用约5倍,同时提升选择准确性。

Comments 15 pages, 1 figure. Code available at https://github.com/ucr-rai/base-and-edit

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地应用于数学推理,形式化证明助手(如Lean)可用于以机器可检查的严谨性验证推理输出,从而支持在测试时扩展中从K个采样候选答案中进行答案选择等用例。然而,使用Lean要求LLM的输出(最初为自然语言)首先被形式化。现有的基于Lean的答案选择工作使用自动形式化模型为每个候选答案独立生成一个Lean形式化语句,这带来了显著的计算成本。我们提出BASE,一个基础-编辑流水线,它为每个问题形式化一个基础候选答案,并通过就地编辑答案表达式来推导出其余K-1个语句。为此,我们训练了一个重写器模型LEANSCRIBE,用于定位基础形式化中的答案,并为其他K-1个候选答案生成可重用的编辑函数。BASE同时提高了选择准确性并降低了形式化成本——这是一个帕累托改进,在四个基准测试和三个求解器上的所有12个(数据集,求解器)配置中均成立,在K=8时自动形式化器调用减少约5倍,且随着K增长,减少幅度预计会更大。代码可在https://github.com/ucr-rai/base-and-edit获取。

英文摘要

With large language models (LLMs) increasingly applied to mathematical reasoning, formal proof assistants such as Lean can be leveraged to verify reasoning outputs with machine-checkable rigor, enabling use cases such as answer selection in test-time scaling with K sampled candidate answers. However, employing Lean requires that LLM outputs, originally in natural language, first be formalized. Existing Lean-based answer-selection work uses an autoformalization model to generate a formal statement in Lean for each candidate answer independently, incurring a significant computational cost. We propose BASE, a base-and-edit pipeline that formalizes a single base candidate per problem and derives the remaining K-1 statements by editing the answer expression in place. To facilitate this, we train a rewriter model LEANSCRIBE to localize the answer in the base formalization and generate a reusable edit function for the other K-1 candidates. BASE simultaneously improves selection accuracy and reduces formalization cost - a Pareto improvement that holds on all 12 (dataset, solver) configurations across four benchmarks and three solvers, cutting autoformalizer calls by about 5x at K=8, with the reduction expected to become larger as K grows. Code is available at https://github.com/ucr-rai/base-and-edit.

2606.15998 2026-06-16 cs.IR cs.AI cs.CL cs.LG 交叉投稿

Entity Labels Are Not Entity Signals: A Framework for Observable Relevance in Document Re-Ranking

实体标签并非实体信号:文档重排序中可观测相关性的框架

Utshab Kumar Ghosh, Shubham Chatterjee

发表机构 * Department of Computer Science, Missouri University of Science and Technology(计算机科学系,密苏里科技大学)

AI总结 提出实体可观测相关性(OER)与概念相关性(CER)的区分,证明CER监督效果差,而OER对齐可显著提升重排序性能。

Comments ICTIR '26

详情
Journal ref
Proceedings of the 2026 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR)
AI中文摘要

实体感知的文档检索使用与查询关联的实体作为排序信号,假设语义相关的实体也是有用的检索信号。我们证明这一假设是不充分的,并解释原因。与作为真实观测的词项不同,实体链接是由不完美的链接器产生的假设:如果链接器在相关和非相关文档中无差别地触发,那么一个实体可能在主题上重要,却不提供任何判别性信号。我们将此形式化为概念实体相关性(CER)——实体是否与查询主题相关——和可观测实体相关性(OER)——其在集合中的观测出现是否能区分相关与非相关文档——之间的区别。在四个集合和包括人工实体判断的标注来源上,CER和OER表现出接近随机的吻合度(κ≈0),而OER的操作化实现吻合度较高(κ≈0.5),确认CER是系统性异常值。基于CER的监督选择主题上合理但判别性弱的实体,在某些集合上仅能过滤不到4%的非相关文档。将监督与OER对齐可将非相关文档过滤提升至10倍,并在BM25基础上将开放世界MAP提升0.051。我们的发现促使实体感知检索中从概念实体相关性向可观测实体相关性的转变。

英文摘要

Entity-aware document retrieval uses query-associated entities as ranking signals, assuming that semantically relevant entities are also useful retrieval signals. We show this assumption is insufficient- and explain why. Unlike terms, which are ground-truth observations, entity links are hypotheses produced by an imperfect linker: an entity can be topically central yet provide no discriminative signal if the linker fires indiscriminately across relevant and non-relevant documents. We formalize this as a distinction between Conceptual Entity Relevance (CER)- whether an entity is topically related to a query- and Observable Entity Relevance (OER)- whether its observed presence in a collection discriminates relevant from non-relevant documents. Across four collections and annotation sources including human entity judgments, CER and OER exhibit near-chance agreement ($κ\approx 0$), while OER operationalizations agree substantially ($κ\approx 0.5$), confirming CER as the systematic outlier. CER-based supervision selects topically plausible but weakly discriminative entities, pruning fewer than 4% of non-relevant documents on some collections. Aligning supervision with OER improves non-relevant pruning by up to 10x and open-world MAP by 0.051 over BM25. Our findings motivate a shift from conceptual to observable notions of entity relevance in entity-aware retrieval.

2606.16074 2026-06-16 cs.CL cs.AI 交叉投稿

PVminerLLM2: Improving Structured Extraction of Patient Voice via Preference Optimization

PVminerLLM2:通过偏好优化改进患者声音的结构化提取

Samah Fodeh, Linhai Ma, Ganesh Puthiaraju, Srivani Talakokkul, Afshan Khan, Elyas Irankhah, Sreeraj Ramachandran, Ashley Hagaman, Sarah Lowe, Aimee Roundtree

发表机构 * Yale School of Medicine(耶鲁大学医学院) Yale School of Public Health(耶鲁大学公共卫生学院) Texas State University(德克萨斯州立大学)

AI总结 提出PVminerLLM2,通过偏好优化和令牌级门控稳定项、混淆感知偏好对构建等技术,解决监督微调难以处理的细粒度错误,在患者声音结构化提取任务上优于基线模型。

详情
AI中文摘要

动机:患者生成的文本包含关于患者生活经历、社会背景和护理参与的关键信息,但大多是非结构化的,限制了其在以患者为中心的结果研究中的应用。先前的工作引入了PV-Miner基准和PVMinerLLM模型用于结构化提取。然而,仅靠监督微调(SFT)难以处理罕见、细粒度且分布不均的错误,尤其是在令牌关键的结构化输出中。结果:我们提出了PVminerLLM2,一组改进的用于结构化患者声音提取的LLM,它应用偏好优化来解决监督微调无法处理的令牌级错误。我们的方法引入了(i)带有令牌级门控稳定项的偏好目标,防止在偏好优化下绝对令牌似然的退化,以及(ii)混淆感知的偏好对构建,以更好地捕捉低分离度的区分。我们进一步引入了令牌重要性加权和逆频率重加权,以解决令牌不平衡和类别偏斜问题。在多种模型规模下,PVMinerLLM2始终优于强基线,在代码、子代码和跨度上分别获得了高达4.43%、3.50%和1.55%的提升,并且优于使用现有偏好优化方法训练的基线LLM。可用性和实现:PVminerLLM2的补充材料、代码、评估脚本和训练模型公开于:https://github.com/Data-Mining-Lab-Yale/PVminerLLM2

英文摘要

Motivation: Patient-generated text contains critical information on patients' lived experiences, social context, and care engagement, but remains largely unstructured, limiting its use in patient-centered outcomes research. Prior work introduced the PV-Miner benchmark and PVMinerLLM models for structured extraction. However, supervised fine-tuning (SFT) alone struggles with rare, fine-grained, and unevenly distributed errors, particularly in token-critical structured outputs. Results: We present PVminerLLM2, an improved set of LLMs for structured patient voice extraction that applies preference optimization to address token-critical errors beyond the reach of supervised fine-tuning. Our method introduces (i) a preference objective with token-level gated stabilization term that prevents degradation of absolute token likelihood under preference optimization, and (ii) confusion-aware preference pair construction to better capture low-separation distinctions. We further incorporate token-importance weighting and inverse-frequency reweighing to address token imbalance and class skew. Across multiple model sizes, PVMinerLLM2 consistently outperforms strong baselines, achieving gains of up to 4.43% (Code), 3.50% (Sub-code), and 1.55% (Span), and outperforms baseline LLM trained with existing preference optimization methods. Availability and Implementation: The supplementary material, code, evaluation scripts, and trained models for PVminerLLM2 are publicly available at: https://github.com/Data-Mining-Lab-Yale/PVminerLLM2

2606.16082 2026-06-16 cs.CV cs.AI 交叉投稿

Tool-IQA: Augmenting Image Quality Assessment with Simple Tools

Tool-IQA: 利用简单工具增强图像质量评估

Guanyi Qin, Junjie Zhang, Chunming He, Yibing Fu, Jie Liang, Tianhe Wu, Lei Zhang

发表机构 * National University of Singapore(新加坡国立大学) OPPO Research Institute(OPPO研究院) Nanyang Technical University(南洋理工大学) Duke University(杜克大学) City University of Hong Kong(香港城市大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出Tool-IQA,通过为视觉语言模型配备放大镜和伽马校正器等简单工具,将被动评分转变为工具增强的工作流程,显著提升图像质量评估性能。

详情
AI中文摘要

视觉语言模型(VLM)越来越多地被用于图像质量评估(IQA)。然而,当前方法通常采用静态的一次性评分范式,而人类通过动态视觉检查(例如,选择性调整视图以验证细节和细微伪影)来评估图像质量。具体来说,仅依赖单次观察存在两个主要限制:首先,仅在全局尺度上感知图像限制了对更精细局部细节的评估;其次,图像的原始强度分布可能压倒可见性,导致对图像质量的检查不足。为了解决这些问题,我们提出了Tool-IQA,将评估机制从被动评分转变为工具增强的工作流程。特别地,我们为VLM配备了简单而有效的视图工具:用于检查局部细节的放大镜,以及用于揭示可见性和隐藏伪影的伽马校正器。评估遵循一个结构化的流程,包括带有评分标准的初始观察、工具增强的深入检查以及最终校准质量分数的量化。此外,为了确保高效且有目的地调用工具,我们引入了一种批量感知的训练策略,以奖励能够产生积极贡献的工具交互,而不仅仅是鼓励使用。在各种IQA基准上的实验表明,通过有效的工具调用和校准评估,我们提出的Tool-IQA显著优于现有最先进的模型,例如,在具有挑战性的CLIVE数据集上实现了0.854的PLCC。

英文摘要

Vision-Language Models (VLMs) have been increasingly adopted for Image Quality Assessment (IQA). However, current methods typically employ a static one-shot scoring paradigm, despite the fact that humans assess image quality through dynamic visual inspection, e.g., selectively adjusting views to verify details and subtle artifacts. Specifically, relying solely on a single-pass observation introduces two primary limitations: first, perceiving the image only at a global scale restricts the assessment of finer local details; second, the original intensity distribution of the image may overwhelm the visibility, leading to insufficient inspection of image quality. To address these issues, we propose Tool-IQA, shifting the assessment mechanism from passive scoring to a tool-augmented workflow. In particular, we equip VLMs with simple yet effective view tools: a Magnifier to inspect local details, and a Gamma Corrector to uncover visibility and hidden artifacts. The assessment follows a structured pipeline that consists of an initial observation with rubric notes, a tool-augmented in-depth inspection, and a final quantification for calibrated quality score. Furthermore, to ensure efficient and purposeful tool callings, we introduce a batch-aware training strategy to reward tool interactions that can yield positive contributions rather than simply encouraging usage. Experiments on a variety of IQA benchmarks demonstrate that, with effective tool calling and calibrated assessment, our proposed Tool-IQA significantly outperforms existing state-of-the-art models, e.g., it achieves a PLCC of 0.854 on the challenging CLIVE dataset.

2606.16281 2026-06-16 cs.CL cs.AI 交叉投稿

Who Should Lead Decoding Now? Tracking Reliable Trajectories for Ensembling Masked Diffusion Language Models

现在谁应该主导解码?跟踪可靠轨迹以集成掩码扩散语言模型

Heecheol Yun, Joonhyung Park, Joowon Kim, Eunho Yang

发表机构 * KAIST(韩国科学技术院) AITRICS

AI总结 针对掩码扩散语言模型集成问题,提出TIE框架,通过跟踪答案相关位置的置信度动态,迭代识别并传递可靠解码轨迹,实现多模型协同生成。

Comments preprint

详情
AI中文摘要

掩码扩散语言模型(MDLM)已成为序列生成的一种独特范式。随着MDLM在能力和知识覆盖范围上变得多样化,一个重要问题是如何结合它们的知识。为此,我们首先研究了MDLM独特的解码动态。我们发现,成功的生成在答案相关位置上表现出稳定的置信度动态,而不可靠的轨迹通常可以通过注入来自其他模型的有希望的中间状态来纠正。受此观察启发,我们提出了$\textbf{TIE}$(基于轨迹的迭代集成),这是一个知识融合框架,其中MDLM迭代地识别可靠的解码轨迹并在模型之间传递它们。TIE跟踪答案相关位置上的置信度动态,以确定哪个模型当前遵循更可靠的轨迹,并选择性地跨模型传递部分去噪的序列。由于处于更有希望轨迹上的模型在去噪步骤中经常变化,TIE允许不同模型在生成的不同阶段贡献互补的优势。在多种推理任务上的强劲表现以及我们的分析表明,TIE为MDLM集成这一尚未充分探索的问题提供了一种实用方法。

英文摘要

Masked Diffusion Language Models (MDLMs) have emerged as a distinct paradigm for sequence generation. As MDLMs become diverse in capabilities and knowledge coverage, an important question is how to combine their knowledge. Toward this, we first investigate the unique decoding dynamics of MDLMs. We find that successful generations exhibit stable confidence dynamics over answer-relevant positions, while unreliable trajectories can often be corrected by injecting promising intermediate states from other models. Guided by this observation, we propose $\textbf{TIE}$ ($\textbf{T}$rajectory-based $\textbf{I}$terative $\textbf{E}$nsembling), a knowledge fusion framework in which MDLMs iteratively identify reliable decoding trajectories and relay them across models. TIE tracks confidence dynamics over answer-relevant positions to determine which model currently follows a more reliable trajectory and selectively transfers partially denoised sequences across models. As the model on the more promising trajectory often changes across denoising steps, TIE allows different models to contribute complementary strengths at different stages of generation. Strong performance across diverse reasoning tasks, along with our analyses, suggests that TIE offers a practical approach to the underexplored problem of MDLM ensembling.

2606.16353 2026-06-16 cs.CV cs.AI 交叉投稿

What Should a Streaming Video Model Remember?

流式视频模型应该记住什么?

Haonan Ge, Yiwei Wang, Hang Wu, Yujun Cai

发表机构 * University of California, Santa Barbara(加州大学圣塔芭芭拉分校) University of California, Merced(加州大学默塞德分校) The University of Queensland(昆士兰大学)

AI总结 针对流式视频理解中固定记忆预算下的长程历史利用问题,提出选择性潜在记忆框架SelectStream,通过惊喜驱动自适应窗口、优先级保持合并和查询条件图推理三个机制,实现高效在线推理,在多个基准上取得领先性能。

详情
AI中文摘要

流式视频理解模型必须在持续流中的任意时刻回答查询,仅使用到目前为止观察到的内容,并在固定的记忆和计算预算下工作。现有方法通过添加记忆库、检索模块或视觉令牌压缩来保存长程历史。然而,强近期窗口基线表明,不加区分地注入历史可能会稀释当前场景感知,这表明关键挑战不在于是否使用记忆,而在于如何选择性分配记忆。我们将此形式化为预算在线潜在证据分配,并提出\textbf{SelectStream},一个选择性潜在记忆框架,该框架保持当前观察对冻结VLM直接可见,同时仅通过紧凑的、查询条件的证据预算暴露历史信息。三个协调机制控制何时写入、保留什么以及如何检索:惊喜驱动的自适应窗口、优先级保持合并以及固定容量潜在记忆图上的查询条件图推理。检索到的证据被校准并作为潜在令牌注入以生成答案,无需重放帧或随着流长度增长上下文。实验结果表明,SelectStream实现了强大的在线流式性能,并保持了通用视频理解能力,在StreamingBench上达到82.67%,在OVO-Bench上达到67.03%,在离线视频基准上平均准确率达到74.4%,同时优于强近期窗口基线和先前的流式记忆方法。

英文摘要

Streaming video understanding models must answer queries at any moment during an ongoing stream, using only what they have observed so far and under fixed memory and computation budgets. Existing methods address this by adding memory banks, retrieval modules, or visual token compression to preserve long-range history. However, strong recent-window baselines show that indiscriminate history injection can dilute current-scene perception, suggesting that the key challenge is not whether to use memory, but how to allocate it selectively. We formulate this as budgeted online latent evidence allocation and propose \textbf{SelectStream}, a selective latent-memory framework that keeps the current observation directly visible to a frozen VLM while exposing historical information only through a compact, query-conditioned evidence budget. Three coordinated mechanisms govern when to write, what to preserve, and how to retrieve: surprise-driven adaptive windowing, priority-preserving consolidation, and query-conditioned graph reasoning over a fixed-capacity latent memory graph. Retrieved evidence is calibrated and injected as latent tokens for answer generation, without replaying frames or growing the context with stream length. Experimental results show that SelectStream achieves strong online streaming performance and preserves general video understanding, reaching 82.67\% on StreamingBench, 67.03\% on OVO-Bench, and 74.4\% average accuracy on offline video benchmarks, while outperforming strong recent-window baselines and prior streaming memory methods.

2606.16484 2026-06-16 cs.CV cs.AI cs.MM 交叉投稿

Unified Multimodal Model for Brain MRI Imputation and Understanding

统一多模态模型用于脑MRI补全与理解

Zhiyun Song, Che Liu, Tian Xia, Avinash Kori, Wenjia Bai

发表机构 * Department of Computing, Imperial College London(伦敦帝国理工学院计算机系) Department of Brain Sciences, Imperial College London(伦敦帝国理工学院脑科学系)

AI总结 提出UniBrain模型,通过统一训练策略联合处理脑MRI模态补全与图像理解,采用自对齐和动态隐藏状态机制,在多疾病数据集上实现高性能。

Comments Early accepted to MICCAI 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在医学领域具有巨大潜力,因为它们继承了LLM的知识,并允许以自然语言集成、分析和解释多种数据模态。然而,医学MLLMs面临重大挑战,特别是高质量训练数据的稀缺以及现实临床环境中数据缺失的频繁发生。在此,我们提出了一种新颖的统一多模态模型UniBrain,用于脑磁共振图像(MRI)分析。为了解决潜在的脑MRI模态缺失问题,我们采用统一训练策略进行联合成像模态补全和脑图像理解。在训练过程中,构建了交错且描述丰富的数据流,以自回归方式训练模型,从而实现基于生成的多模态数据的医学推理。引入自对齐策略,利用密集图像嵌入学习细粒度解剖特征,无需详细的图像描述。此外,我们提出了一种动态隐藏状态机制,以缓解长上下文多模态推理中的暴露偏差。在多疾病脑MRI数据集上的大量实验表明,UniBrain在模态不完全的各种情况下,在脑图像补全、理解和疾病诊断方面均取得了高性能。

英文摘要

Multimodal large language models (MLLMs) hold great potential for medicine, as they inherit knowledge from LLM and allow multiple data modalities to be integrated, analysed and interpreted in natural language. However, the field of medical MLLMs is constrained by non-trivial challenges, notably the scarcity of high-quality training data and the frequent occurrence of missing data in the real-world clinical setting. Here, we propose a novel unified multimodal model, UniBrain, for brain magnetic resonance image (MRI) analysis. To address potential missing brain MRI modalities, we employ a unified training strategy to perform joint imaging modality imputation and brain image understanding. During training, an interleaved and description-enriched data flow is constructed to train the model in an autoregressive manner, enabling medical reasoning with generated multimodal data. A self-alignment strategy is introduced to leverage dense image embeddings to learn fine-grained anatomical features without requiring detailed image captions. Furthermore, we propose a dynamic hidden state mechanism to alleviate the exposure bias during long-context multimodal inference. Extensive experiments on multi-disease brain MRI dataset demonstrate that UniBrain achieves high performance for brain image imputation, understanding, and disease diagnosis under various extents of modality incompleteness.

2606.16568 2026-06-16 cs.CL cs.AI 交叉投稿

Fast When, Careful Who: Dual-Process Multiparty Turn-Taking with Diffusion Augmentation

快速判断何时,谨慎决定谁:基于扩散增强的双过程多轮对话

Rutherford A. Patamia, Ming Liu, Wei Luo, Favour Ekong, Akan Cosgun

发表机构 * Deakin University(迪肯大学) Griffith University(格里菲斯大学)

AI总结 针对多说话人对话中的轮次转换问题,提出音频两阶段流水线,先快速检测轮次边界,再轻量验证决定是否转移并预测下一说话人,扩散增强进一步改善检测性能。

详情
AI中文摘要

可靠的轮次转换对于口语对话系统至关重要。然而,现有方法大多针对双说话人交互设计,难以处理包含重叠和快速说话人切换的现实多说话人音频。我们在VoxConverse数据集上研究多说话人轮次转换,并提出一个纯音频的两阶段流水线,将何时触发轮次边界与是否实际转移话语权分开。一个快速触发器扫描音频并提出候选的结束轮次时间,而一个轻量验证器仅在这些时间运行,以决定\textsc{Hold}或\textsc{Shift},并支持下一说话人预测。我们报告了完整多说话人设置下的结果,以及为可比性而控制的二元顶2投影结果。我们还研究了基于扩散的、保留标签的背景音频混合作为数据增强策略。结果显示,与基线相比,转移检测有所改善,扩散增强进一步提升了性能。

英文摘要

Reliable turn-taking is essential for spoken dialogue systems. However, most existing methods are designed for two-speaker interaction and struggle with realistic multiparty audio containing overlap and rapid speaker changes. We study multiparty turn-taking on the VoxConverse dataset and propose an audio-only two-stage pipeline that separates when to trigger a turn boundary from whether the floor is actually transferring. A fast trigger scans the audio and proposes candidate end-of-turn times, while a lightweight verifier runs only at those times to decide \textsc{Hold} or \textsc{Shift} and support next-speaker prediction. We report results in the full multiparty setting and a controlled dyadic top-2 projection for comparability. We also investigate diffusion-based, label-preserving background-audio mixing as a data augmentation strategy. Results show improved shift detection over a baseline, with further improvements from diffusion augmentation.

2606.16595 2026-06-16 cs.SD cs.AI 交叉投稿

ArtNet: A JEPA-Like Articulatory Predictive Framework for Robust Zero-Shot Phoneme Recognition

ArtNet:一种类似JEPA的发音预测框架,用于鲁棒的零样本音素识别

Zeqian Hu, Fuliang Weng, Shu Shang, Yaqian Zhou

发表机构 * Fudan University(复旦大学) Pedawise

AI总结 提出ArtNet框架,通过基于发音特征的结构化预测任务和变分信息瓶颈抑制语言特定变化,在零样本跨语言音素识别中实现20.56%的音素错误率降低。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

零样本跨语言音素识别常因直接声学到符号映射的脆弱性而受阻,该映射易受语言特定变化影响。借鉴视觉中的联合嵌入预测架构(JEPA)工作,我们提出ArtNet,一个探索基于发音特征的结构化特征预测任务以增强声学鲁棒性的框架。具体而言,ArtNet集成了一个发音预测器,旨在从自监督学习(SSL)特征中提取通用发音表示,并采用变分信息瓶颈(VIB)抑制语言特定变化。在七种未见语言上的实验表明,ArtNet,特别是与所提出的向量空间库存对齐(VSIA)策略协同使用时,显著优于竞争基线,实现了音素错误率(PER)相对降低20.56%,音素特征错误率(PFER)相对降低7.01%。

英文摘要

Zero-shot cross-lingual phoneme recognition is often hindered by the fragility of direct acoustic-to-symbol mapping, which is susceptible to language-specific variations. Echoing joint-embedding predictive architecture (JEPA) work in vision, we propose ArtNet, a framework that explores a structured feature prediction task based on articulatory features to enhance acoustic robustness. Specifically, ArtNet integrates an articulatory predictor, designed to extract universal articulatory representations from self-supervised learning (SSL) features, with a variational information bottleneck (VIB) to suppress language-specific variations. Experiments on seven unseen languages demonstrate that ArtNet, particularly when synergized with the proposed vector-space inventory alignment (VSIA) strategy, significantly outperforms competitive baselines, achieving a 20.56\% relative reduction in phoneme error rate (PER) and 7.01\% in phoneme feature error rate (PFER).

2606.16620 2026-06-16 cs.LG cs.AI 交叉投稿

Entropy-Gated Latent Recursion

熵门控潜在递归

Soham Bhattacharjee, Dushyant Singh Chauhan, Salem Lahlou, Martin Takac, Nils Lukas

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出熵门控潜在递归(EGLR),通过在高不确定性token处递归应用冻结模型顶层解码器,构建与温度采样正交的确定性采样轴,扩展推理时缩放空间,在数学推理任务中显著提升性能。

详情
AI中文摘要

推理时缩放已成为改进语言模型推理能力的主要手段,但现有方法的展开多样性仅来源于单一来源:随机token级采样。我们认为这种单轴采样空间本质上是受限的,并识别出第二个完全确定且互补的轴:在冻结模型的顶层解码器层在高不确定性token处递归重新应用的层跨度$L$。不同的$L$选择会产生不同的展开,解决不同的问题子集,且无需随机性。我们通过熵门控潜在递归(EGLR)实例化这一轴,这是一种无需训练的解码过程,它重新应用顶层$L$层最多$K_{\max}$次迭代,直到下一个token分布收敛。结合$T$个温度采样,EGLR将单轴随机展开池转变为$L\times T$笛卡尔采样空间,且几乎不增加每次展开的成本。我们在8个指令微调模型和6个数学推理基准上表征了这一空间,并表明$L$轴与温度确实互补:在MATH-500上使用Qwen2.5-3B-Instruct时,联合$L\times T$预言机达到91.6%,比仅温度预言机(83.4%)高出8.2个百分点,比仅层预言机(81.2%)高出10.4个百分点,证实两个轴捕获了真正互补的问题。扩展的展开池为任何下游过程(包括自一致性、带验证器的最佳$N$选择和组相对RL训练(GRPO))提供了更丰富的每个提示候选,开辟了不依赖随机噪声的推理时缩放新方向。

英文摘要

Inference-time scaling has become the dominant lever for improving language-model reasoning, but existing methods derive rollout diversity from a single source: stochastic token-level sampling. We argue that this single-axis sampling space is fundamentally limiting, and identify a second, fully deterministic and complementary axis: the layer span $L$ at which a frozen model's top decoder layers are recursively re-applied at high-uncertainty tokens. Different choices of $L$ produce distinct rollouts that solve different subsets of problems, with no stochasticity. We instantiate this axis through Entropy-Gated Latent Recursion (EGLR), a training-free decoding procedure that re-applies the top-$L$ layers for at most $K_{\max}$ iterations until the next-token distribution converges. Combined with $T$ temperature samples, EGLR turns a single-axis stochastic rollout pool into an $L\times T$ Cartesian sampling space at almost the same per-rollout cost. We characterize this space across $8$ instruction-tuned models and $6$ math reasoning benchmarks, and show that the $L$-axis is genuinely complementary to temperature: on MATH-500 with Qwen2.5-3B-Instruct, the joint $L\times T$ oracle reaches $91.6\%$, $+8.2$ percentage points beyond the temperature-only oracle ($83.4\%$) and $+10.4$ points beyond the layer-only oracle ($81.2\%$), confirming that the two axes capture genuinely complementary problems. The expanded rollout pool provides richer per-prompt candidates for any downstream procedure that consumes rollouts, including self-consistency, best-of-$N$ with verifiers, and group-relative RL training (GRPO), opening a new direction for inference-time scaling that does not rely on stochastic noise.

2606.16731 2026-06-16 cs.SD cs.AI cs.HC 交叉投稿

MuVAP: Multimodal Multiparty Voice Activity Projection for Turn-taking Prediction in the Wild

MuVAP: 面向野外对话轮次预测的多模态多方语音活动投影

Haotian Qi, Gabriel Skantze

发表机构 * Department of Speech Music and Hearing, KTH Stockholm, Sweden(瑞典皇家理工学院言语、音乐与听觉系)

AI总结 提出MuVAP框架,通过将声学预测锚定到面部轨迹,实现从单声道音频和单摄像头视角进行说话人感知的轮次预测,并引入角色相对投影和AVCC数据集解决多方建模和因果跟踪问题。

详情
AI中文摘要

当前的多方对话轮次模型通常依赖于复杂的麦克风阵列或多摄像头设置,限制了它们在人与机器人交互场景中的适用性。我们提出了MuVAP,这是一个因果多模态框架,通过将声学预测锚定到面部轨迹来扩展语音活动投影,从而能够从单声道音频流和单摄像头视角进行说话人感知的轮次预测。为了解决建模多个说话人的组合复杂性,我们提出了角色相对投影,它将任意N说话人交互映射到一个固定的当前与下一个话语持有者状态。由于现有的视听数据集包含破坏因果跟踪的剪辑切换,我们引入了视听对话语料库,这是一个31小时的未剪辑、单摄像头多方对话数据集。评估表明,MuVAP在两人和三人场景下的转换-保持和下一说话人预测任务中优于强基线。

英文摘要

Current multiparty turn-taking models often rely on complex microphone arrays or multi-camera setups, limiting their applicability in human-robot interaction scenarios. We introduce MuVAP, a causal multimodal framework that extends Voice Activity Projection by grounding acoustic predictions in face tracks, enabling speaker-aware turn-taking predictions from a monaural audio stream and a single camera view. To address the combinatorial complexity of modeling multiple speakers, we propose Role-Relative Projection, which maps any N-speaker interaction onto a fixed current versus next floor-holder state. Because existing audiovisual datasets contain disruptive editing cuts that break causal tracking, we introduce the Audio-Visual Conversation Corpus, a 31-hour dataset of unedited, single-camera multiparty conversations. Evaluations demonstrate that MuVAP outperforms strong baselines on Shift-Hold and next-speaker prediction tasks across two- and three-speaker settings.

2606.16783 2026-06-16 cs.CV cs.AI cs.LG 交叉投稿

Gen-VCoT: Generative Visual Chain-of-Thought Reasoning via Diffusion-Based RGB Intermediate Representations

Gen-VCoT: 基于扩散的RGB中间表示的生成式视觉思维链推理

Zhiqiang Zhou, Junliang Dai, Xu ling

发表机构 * Hunan Chemical Industry Vocational and Technical College(湖南化工职业技术学院)

AI总结 提出Gen-VCoT框架,利用专家视觉模型生成RGB图像作为推理中间步骤,通过自适应路由器选择推理深度,在空间和深度问题上分别提升25%和50%,但简单事实查询性能下降,表明最优表示依赖于任务。

Comments 12 pages, 5 figures

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉推理方面表现出色,但依赖基于文本的思维链(CoT),缺乏可解释的视觉中间表示。现有方法使用不透明的标记或外部工具,缺失关键属性。我们提出Gen-VCoT,一个使用专家视觉模型生成RGB图像作为推理中间表示的框架。它包含三个阶段:视觉定位(SAM分割)、几何推理(Marigold深度图)和语义推理(Qwen2-VL集成)。一个自适应路由器选择推理深度。评估显示,Gen-VCoT在空间问题(提升25%)和深度问题(提升50%)上表现更好,但可能损害简单事实查询。文本CoT在CLEVR上优于视觉中间表示(91.2% vs 62.5%),表明最优表示依赖于任务。Gen-VCoT为可解释的多模态推理建立了新范式。

英文摘要

Multimodal large language models (MLLMs) excel at visual reasoning but rely on text-based chain-of-thought (CoT), lacking interpretable visual intermediates. Existing methods use opaque tokens or external tools, missing key properties. We propose Gen-VCoT, a framework using expert vision models to generate RGB images as reasoning intermediates. It has three stages: visual grounding (SAM segmentation), geometric reasoning (Marigold depth maps), and semantic reasoning (Qwen2-VL integration). An adaptive router selects reasoning depth. Evaluations show Gen-VCoT improves spatial (25% better) and depth (50% better) questions, but may hurt simple factual queries. Text CoT outperforms visual intermediates on CLEVR (91.2% vs 62.5%), showing task-dependent optimal representations. Gen-VCoT establishes a new paradigm for interpretable multimodal reasoning.

2606.16845 2026-06-16 cs.CL cs.AI 交叉投稿

Robust Dual-Signal Fusion: Hybrid Neuro-Symbolic Gating with Compressed Chain-of-Thought Refinement for Irony Detection in Social Media Texts

鲁棒双信号融合:混合神经符号门控与压缩链式思维精炼用于社交媒体文本讽刺检测

Ankit Bhattacharjee, Krityapriya Bhaumik

发表机构 * Indian Institute of Technology Kharagpur(印度理工学院克勒格布尔分校)

AI总结 提出RDS融合框架,结合神经符号架构与压缩链式思维推理,在TweetEval和iSarcasm数据集上达到与微调BERTweet相当的性能,并显著优于监督方法。

Comments 11 pages total, 10 figures

详情
AI中文摘要

大型语言模型(LLM)默认倾向于字面语义解释,使得零样本讽刺检测成为一个持续的挑战。我们引入了鲁棒双信号(RDS)融合框架,这是一种混合神经符号架构,无需监督微调(SFT)即可压缩链式思维(CoT)推理轨迹。在严格保留的TweetEval测试集(N=734)上,RDS达到了78.1%的准确率和0.777的宏F1分数,与微调BERTweet的绝对性能上限相匹配。在高度不平衡的iSarcasm数据集上,冻结的CoT管道过滤了22.5%的分布外幻觉,实现了0.6726的零样本宏F1和0.4821的讽刺F1,优于多个强监督的SemEval Transformer集成。统计消融实验证实了这种结构协同作用:将符号先验添加到神经基线没有显著提升(p=0.242),而将CoT管道添加到该先验的边际收益被高度压缩(p=0.149)。只有所有三个信号的完整并发融合才能实现相对于基线的统计验证改进(p=0.005)。

英文摘要

Large Language Models (LLMs) natively default to literal semantic interpretations, making zero-shot irony detection a persistent challenge. We introduce the Robust Dual-Signal (RDS) Fusion framework, a hybrid neuro-symbolic architecture that compresses Chain-of-Thought (CoT) reasoning trajectories without Supervised Fine-Tuning (SFT). Evaluated on a strictly held-out TweetEval test set (N=734), RDS achieves 78.1% accuracy and a Macro F1 of 0.777, matching the absolute performance ceiling of the fine-tuned BERTweet. On the heavily imbalanced iSarcasm dataset, the frozen CoT pipeline filters 22.5% of out-of-distribution hallucinations, yielding a zero-shot Macro F1 of 0.6726 and Ironic F1 of 0.4821, outperforming multiple heavily supervised SemEval transformer ensembles. A statistical ablation confirms this structural synergy: adding the symbolic prior to the neural baseline yields no significant gain (p = 0.242), and the marginal benefit of adding the CoT pipeline to that prior is heavily compressed (p = 0.149). Only the complete, concurrent fusion of all three signals achieves a statistically validated improvement over the baseline (p = 0.005).

2606.16847 2026-06-16 cs.CL cs.AI 交叉投稿

Follow the Latent Roadmap: Navigating Revocable Decoding for Diffusion LLMs with Anchor Tokens

遵循潜在路径:利用锚定令牌导航扩散LLM的可撤销解码

Yizhen Yao, Qinglin Zhu, Runcong Zhao, Xiangxiang Dai, Yanzheng Xiang, Yulan He, Lin Gui

发表机构 * King's College London(伦敦国王学院) The Chinese University of Hong Kong(香港中文大学) The Alan Turing Institute, UK(英国艾伦·图灵研究所)

AI总结 针对扩散大语言模型解码速度与质量的权衡,提出无训练框架ASRD,通过锚定令牌解耦上下文,结合锚定引导生成与锚定扰动验证,在数学和编码基准上提升准确率6.4%,加速推理7.2倍。

详情
AI中文摘要

扩散大语言模型(dLLMs)为并行生成提供了有前景的途径,但面临解码速度与质量之间的权衡。虽然可撤销解码策略尝试通过验证和重新掩码来减轻错误,但它们通常在混合质量上下文中操作。这导致两个关键失败:\textit{错误传播},即新令牌从错误上下文中吸收有毒信息;以及\textit{局部错误强化},即错误相互强化以逃避检测。为缓解这些挑战,我们提出ASRD(锚定监督可撤销解码),一种在嵌入空间内运行的无训练框架。ASRD明确将解码上下文解耦为通过时间一致性识别的可信\textit{锚定令牌}和不确定候选令牌。利用动态锚定令牌缓存,我们引入两种互补机制:(1)锚定引导生成,将熵加权锚定信号注入掩码位置,以隐式地将注意力引导向可靠的全局骨架;(2)锚定扰动验证,对不确定候选令牌施加正交扰动,破坏并重新掩码由脆弱局部共识驱动的错误。在数学和编码基准上的大量实验表明,ASRD优于最近的重新掩码基线,准确率提升高达6.4%,同时推理吞吐量加速高达7.2倍。

英文摘要

Diffusion Large Language Models (dLLMs) offer a promising avenue for parallel generation but face a trade-off between decoding speed and quality. While revocable decoding strategies attempt to mitigate errors by verifying and remasking tokens, they typically operate within a mixed-quality context. This leads to two critical failures: \textit{Error Propagation}, where new tokens absorb toxic information from erroneous context, and \textit{Local Error Reinforcement}, where errors mutually reinforce each other to evade detection. To alleviate these challenges, we propose ASRD (Anchor Supervised Revocable Decoding), a training-free framework that operates within the embedding space. ASRD explicitly decouples the decoding context into trusted \textit{Anchor Tokens}, which are identified via temporal consistency, and uncertain candidates. Leveraging a dynamic Anchor Tokens Cache, we introduce two complementary mechanisms: (1) Anchor-Guided Generation, which injects entropy-weighted anchor signals into masked positions to implicitly rectify attention toward the reliable global skeleton; and (2) Anchor-Perturbed Verification, which applies orthogonal perturbations to uncertain candidate tokens, destabilizing and remasking errors driven by fragile local consensus. Extensive experiments on math and coding benchmarks demonstrate that ASRD outperforms recent remasking baselines, achieving accuracy improvements of up to 6.4\% while accelerating inference throughput by up to 7.2$\times$.

2606.16890 2026-06-16 cs.CL cs.AI 交叉投稿

Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering

组合推理深度预测临床AI失败:与电子健康记录问答中Transformer组合性限制一致的实证证据

Sanjay Basu

发表机构 * University of California San Francisco(加州大学旧金山分校) Waymark

AI总结 本研究引入推理步数(hop count)作为预测大型语言模型在电子健康记录问答中失败的理论驱动指标,发现准确率随步数增加单调下降,且扩展思考未能显著改善,提示组合推理深度是跨架构的失败预测因子。

Comments 20 pages, 5 figures. Code: https://github.com/sanjaybasu/compositional-depth-clinical-ehr

详情
AI中文摘要

聚合准确率基准掩盖了大型语言模型在电子健康记录(EHR)问答中失败的系统性结构:需要更多推理步骤的问题会产生不成比例的更多错误。受Transformer组合性限制的理论结果启发,我们引入一个预先指定的跳数分类法——从EHR回答临床问题所需的不同推理步骤的数量——作为模型失败的原则性预测因子。我们标注了313个由临床医生生成的MedAlign EHR问答对,涵盖四个跳数级别,并在模型内消融(claude-sonnet-4-6,零样本 vs. 扩展思考)和跨架构复制(gpt-4o和gpt-5.4-2026-03-05,零样本)中评估了301个问题。所有三个模型,跨越两个提供商和两个OpenAI代(GPT-4和GPT-5),均显示准确率随跳数单调下降:Claude Sonnet零样本从30.6%(跳数=1)降至17.6%(跳数=4)(Cochran-Armitage z=-2.30,p=0.011;每跳OR 0.72,95% CI [0.56,0.92],p=0.008);GPT-4o复现了这一点(37.8%降至14.7%;OR 0.58 [0.45,0.75],p<0.001);gpt-5.4-2026-03-05证实了这一点(37.8%降至23.5%;OR 0.80 [0.66,0.98],p=0.027)。一项预先指定的上下文充分性审计显示,较高跳数的问题并未因EHR截断而受到不同不利影响(跳数2-4的可回答性为93-95%,而跳数1为79%),因此下降反映了组合推理难度。扩展思考在三个推理条件下并未显著平缓准确率-深度曲线,且思考令牌使用量与跳数呈正相关(r=0.31,p<0.0001),与预测的O(k)计算需求一致。因此,跳数是一个理论驱动、跨架构的大型语言模型在EHR问答中错误的预测因子,对临床AI的部署风险分层具有直接意义。

英文摘要

Aggregate accuracy benchmarks conceal a systematic structure in how large language models fail at electronic health record (EHR) question answering: questions requiring more inferential steps produce disproportionately more errors. Motivated by theoretical results on transformer compositionality limits, we introduce a pre-specified hop-count taxonomy -- the number of distinct reasoning steps required to answer a clinical question from an EHR -- as a principled predictor of model failure. We annotate 313 clinician-generated MedAlign EHR question-answer pairs across four hop levels and evaluate 301 questions in a within-model ablation (claude-sonnet-4-6, zero-shot vs. extended thinking) and cross-architecture replications (gpt-4o and gpt-5.4-2026-03-05, zero-shot). All three models, spanning two providers and two OpenAI generations (GPT-4 and GPT-5), show monotone accuracy decline with hop count: Claude Sonnet zero-shot falls from 30.6% (hop=1) to 17.6% (hop=4) (Cochran-Armitage z=-2.30, p=0.011; OR per hop 0.72, 95% CI [0.56,0.92], p=0.008); GPT-4o replicates this (37.8% to 14.7%; OR 0.58 [0.45,0.75], p<0.001); and gpt-5.4-2026-03-05 confirms it (37.8% to 23.5%; OR 0.80 [0.66,0.98], p=0.027). A pre-specified context-sufficiency audit shows higher-hop questions are not differentially disadvantaged by EHR truncation (answerability 93-95% at hops 2-4 vs. 79% at hop=1), so the decline reflects compositional reasoning difficulty. Extended thinking did not significantly flatten the accuracy-depth curve across three reasoning conditions, and thinking-token usage scaled with hop count (r=0.31, p<0.0001), consistent with the predicted O(k) computational requirement. Hop count is thus a theory-motivated, cross-architecture predictor of large-language-model error on EHR question answering, with direct implications for deployment risk stratification of clinical AI.

2606.16996 2026-06-16 cs.CV cs.AI cs.LG 交叉投稿

ActiveSAM: Image-Conditional Class Pruning for Fast and Accurate Open-Vocabulary Segmentation

ActiveSAM: 图像条件类别剪枝实现快速准确的开放词汇分割

Tran Dinh Tien, Zhiqiang Shen

发表机构 * VILA Lab, Mohamed bin Zayed University of Artificial Intelligence(VILA实验室,穆罕默德·本·扎耶德人工智能大学)

AI总结 提出ActiveSAM,一种无需训练、零样本的推理框架,通过图像条件类别剪枝和低分辨率预览,将SAM 3转化为主动词汇分割器,在8个基准上平均提升1.4 mIoU,速度提升最高5.5倍。

Comments Preprint. Code is available at https://github.com/VILA-Lab/ActiveSAM

详情
AI中文摘要

Segment Anything Model 3 (SAM 3) 为概念提示分割提供了强大的冻结骨干网络,但直接应用于开放词汇语义分割 (OVSS) 效率低下:全分辨率解码通常在整个数据集词汇表上运行,而每个图像只包含一小部分活跃类别。我们引入ActiveSAM,一种无需训练、零样本的推理框架,将SAM 3转化为主动词汇分割器。ActiveSAM首先规范化并扩展类别提示,然后从低分辨率存在预览中估计图像条件的活跃集。只有保留的类别使用冻结的SAM 3解码器进行桶式提示复用全分辨率解码。预览阶段仅使用类别存在证据,跳过不必要的分割头计算,而最终阶段应用边缘感知背景校准以抑制低置信度像素。ActiveSAM不需要目标数据集训练、权重更新或oracle类别存在标签。在八个OVSS基准上,ActiveSAM改善了无需训练的开放词汇语义分割的速度-准确率权衡,平均比当前最先进的SegEarth-OV3高出约+1.4 mIoU,同时在大型词汇数据集上运行速度最高提升5.5倍。ActiveSAM在模拟真实世界分布偏移的图像损坏下也表现出最强的鲁棒性,使其非常适合部署在噪声输入领域,如自动驾驶和具身AI。代码可在https://github.com/VILA-Lab/ActiveSAM获取。

英文摘要

Segment Anything Model 3 (SAM 3) provides a strong frozen backbone for concept-prompted segmentation, but applying it directly to open-vocabulary semantic segmentation (OVSS) is inefficient: full-resolution decoding is typically run over the entire dataset vocabulary, whereas each image contains only a small active subset of classes. We introduce ActiveSAM, a training-free, zero-shot inference framework that turns SAM 3 into an active-vocabulary segmenter. ActiveSAM first canonicalizes and expands class prompts, then estimates an image-conditioned active set from a low-resolution presence preview. Only the retained classes are decoded at full resolution, using bucketed prompt multiplexing with the frozen SAM 3 decoder. The preview stage uses only class-presence evidence and skips unnecessary segmentation-head computation, while the final stage applies margin-aware background calibration to suppress low-confidence pixels. ActiveSAM requires no target-dataset training, no weight updates, and no oracle class-presence labels. Across eight OVSS benchmarks, ActiveSAM improves the speed-accuracy tradeoff of training-free open-vocabulary semantic segmentation, outperforming the current state-of-the-art SegEarth-OV3 by approximately +1.4 mIoU on average while running up to 5.5x faster on large-vocabulary datasets. ActiveSAM also demonstrates the strongest robustness under image corruption that simulates real-world distribution shift, making it well-suited for deployment in noisy-input domains such as autonomous driving and embodied AI. Code is available at https://github.com/VILA-Lab/ActiveSAM.

2308.06035 2026-06-16 cs.AI cs.CL 版本更新

Attention, not scale, drives human-AI alignment in multimodal language prediction

注意力,而非规模,驱动多模态语言预测中的人机对齐

Viktor Kewenig, Andrew Lampinen, Samuel A. Nastase, Christopher Edwards, Quitterie Lacome D'Elascombe, Akilles Rechardt, Jeremy I Skipper, Gabriella Vigliocco

发表机构 * Psychology and Language Science, Experimental Psychology, University College London, London, UK(心理学与语言科学、实验心理学,伦敦大学学院,伦敦,英国) Google Deepmind, Mountain View, US(谷歌DeepMind,山景城,美国) Princeton Neuroscience Institute, Princeton University, Princeton, NJ, USA(普林斯顿神经科学研究所,普林斯顿大学,普林斯顿,新泽西州,美国) Computer Science Department, Exeter University(计算机科学系,埃克塞特大学)

AI总结 本研究通过比较五种视觉-语言模型与600名人类在视觉世界范式中的表现,发现添加视觉上下文显著提升模型与人类在预测评分上的一致性,且注意力机制而非模型规模是主要驱动因素。

Comments 39 pages, 6 Figures, published in NPJ Artificial Intelligence

详情
AI中文摘要

人类通常利用视觉上下文来预测即将出现的词语。目前视觉-语言模型在多大程度上产生类似行为尚不清楚。在这里,我们将五个最先进的预训练系统与600名人类参与者并排放置在基于网络的视觉世界范式中。在100个六秒电影片段中,模型和参与者接收纯文本或同步视频和文本,并判断指定目标词接下来出现的可能性;全程记录人类眼动。添加视觉上下文在所有架构中均增加了模型与人类在可预测性评分上的一致性(平均Delta r = 0.18),且参数大小无影响。当视觉上下文信息丰富时,Transformer注意力显著提高了一致性。两个Transformer模型的注意力图与人类注视相对应,当场景包含信息性线索时,解释了高达70%的参与者间方差。值得注意的是,跨模态注意力可靠地追踪了语义线索上的预期性人类注视。这些结果表明,当前基于Transformer的视觉-语言模型可以在语言预测期间近似利用视觉上下文的人类行为——并且对信息性线索的选择性注意力,而非纯粹的模型规模,是这种对齐的主要驱动因素。

英文摘要

Humans routinely draw on visual context to predict upcoming words. To what extent current vision-language models produce comparable behaviour is unclear. Here we placed five state-of-the-art pretrained systems side-by-side with 600 human participants in a web-based Visual-World Paradigm. On each of 100 six-second movie clips, models and participants received either text only or synchronised video and text and judged how likely a specified target word was to appear next; human eye movements were tracked throughout. Adding visual context increased model-human alignment in predictability ratings across all architectures (average Delta r = 0.18) with no impact of parameter size. When visual context was informative, transformer attention significantly increased alignment. Attention maps from two transformer models corresponded with human gaze, explaining up to 70% of the inter-participant variance when the scene contained informative cues. Notably, cross-modal attention reliably tracked anticipatory human fixations on semantic cues. These results suggest that current transformer-based vision-language models can approximate human behaviour exploiting visual context during language prediction - and that selective attention to informative cues, not sheer model scale, is the principal driver of this alignment.

2510.01444 2026-06-16 cs.AI cs.CL cs.LG 版本更新

Dual-Uncertainty Guided Policy Learning for Multimodal Reasoning

双不确定性引导的多模态推理策略学习

Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, Dong Yu

发表机构 * Tencent Hunyuan(腾讯文汇) University of Maryland(马里兰大学) University of North Carolina(北卡罗来纳大学)

AI总结 提出DUPL方法,通过量化感知不确定性和输出不确定性来引导策略更新,在多个多模态推理基准上显著提升模型准确率,优于现有方法。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)已经提升了多模态大语言模型的推理能力。然而,现有方法通常将视觉输入视为确定性的,忽略了视觉模态固有的感知模糊性。因此,它们无法区分模型的不确定性是源于复杂推理还是模糊感知,从而无法有针对性地分配探索或学习信号。为了解决这一问题,我们引入了\textbf{DUPL},一种用于多模态RLVR的双不确定性引导策略学习方法,该方法量化并利用感知不确定性(通过对称KL散度)和输出不确定性(通过策略熵)来指导策略更新。通过建立不确定性驱动的反馈循环并采用动态分支优先级机制,DUPL重新校准策略优势,将学习重点放在具有高感知或决策模糊性的状态上,从而实现超越被动数据增强的有效目标探索。在涵盖数学和通用领域的多个多模态推理基准上,DUPL取得了显著提升。它将Qwen2.5-VL的准确率提升了高达$\textbf{12.3%}$(3B)和$\textbf{7.9%}$(7B),将Qwen3-VL-Instruct的准确率提升了高达$\textbf{10.7%}$(4B)和$\textbf{12.4%}$(8B),持续优于GRPO,同时无缝泛化到其他算法(DAPO,平均$\textbf{+6.5%}$)和架构(LLaVA-OneVision-1.5,平均$\textbf{+4.7%}$)。这些结果表明,DUPL是一种有效且可泛化的多模态RLVR方法。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has advanced reasoning capabilities in multimodal large language models. However, existing methods typically treat visual inputs as deterministic, overlooking the perceptual ambiguity inherent to the visual modality. Consequently, they fail to distinguish whether a model's uncertainty stems from complex reasoning or ambiguous perception, preventing the targeted allocation of exploration or learning signals. To address this gap, we introduce \textbf{DUPL}, a dual-uncertainty guided policy learning approach for multimodal RLVR that quantifies and leverages both perceptual uncertainty (via symmetric KL divergence) and output uncertainty (via policy entropy) to guide policy updates. By establishing an uncertainty-driven feedback loop and employing a dynamic branch prioritization mechanism, DUPL recalibrates the policy advantage to focus learning on states with high perceptual or decisional ambiguity, enabling effective targeted exploration beyond passive data augmentation. Evaluated on diverse multimodal reasoning benchmarks spanning mathematical and general domains, DUPL achieves solid gains. It improves Qwen2.5-VL accuracy by up to $\textbf{12.3%}$ (3B) and $\textbf{7.9%}$ (7B), and Qwen3-VL-Instruct by up to $\textbf{10.7%}$ (4B) and $\textbf{12.4%}$ (8B), consistently outperforming GRPO, while seamlessly generalizing to alternative algorithms (DAPO, $\textbf{+6.5%}$ avg) and architectures (LLaVA-OneVision-1.5, $\textbf{+4.7%}$ avg). These results demonstrate that DUPL is an effective and generalizable approach for multimodal RLVR.

2602.08597 2026-06-16 cs.AI 版本更新

An Attention Mechanism for Robust Multimodal Integration in a Global Workspace Architecture

全局工作空间架构中用于鲁棒多模态集成的注意力机制

Roland Bertin-Johannet, Lara Scipio, Leopold Maytié, Rufin VanRullen

发表机构 * CerCo, CNRS, Université de Toulouse(CerCo、CNRS、图卢兹大学) ANITI, Artificial and Natural Intelligence Toulouse Institute(ANITI、图卢兹人工智能与自然智能研究所)

AI总结 提出一种轻量级自上而下的模态选择器,在冻结的多模态全局工作空间上运行,通过注意力机制提升系统在模态噪声或缺失下的鲁棒性,并在两个数据集上验证了其高效性和可迁移性。

Comments 21 pages, 6 figures, 2 tables. Accepted at ICANN 2026. Code: https://github.com/RolandBERTINJOHANNET/GW_attention

详情
AI中文摘要

鲁棒的多模态系统必须在某些模态存在噪声、退化或不可靠时仍保持有效。现有的多模态融合方法通常将模态选择与表示学习联合进行,这使得难以判断鲁棒性来自选择器本身还是来自完全的端到端协同适应。受全局工作空间理论(GWT)启发,我们使用一个轻量级的自上而下模态选择器,运行在冻结的多模态全局工作空间之上,来研究这个问题。我们在两个复杂度递增的多模态数据集(Simple Shapes 和 MM-IMDb 1.0)上,在结构化模态损坏条件下评估了我们的方法。该选择器在使用的可训练参数远少于端到端注意力基线的情况下提高了鲁棒性,并且学习到的选择策略在下游任务、损坏模式甚至之前未见过的模态上具有更好的迁移性。除了显式的损坏设置外,在 MM-IMDb 1.0 基准测试上,我们展示了相同的机制改善了全局工作空间相对于其无注意力对应版本的性能,并取得了不错的基准性能。

英文摘要

Robust multimodal systems must remain effective when some modalities are noisy, degraded, or unreliable. Existing multimodal fusion methods often learn modality selection jointly with representation learning, making it difficult to determine whether robustness comes from the selector itself or from full end-to-end co-adaptation. Motivated by Global Workspace Theory (GWT), we study this question using a lightweight top-down modality selector operating on top of a frozen multimodal global workspace. We evaluate our method on two multimodal datasets of increasing complexity: Simple Shapes and MM-IMDb 1.0, under structured modality corruptions. The selector improves robustness while using far fewer trainable parameters than end-to-end attention baselines, and the learned selection strategy transfers better across downstream tasks, corruption regimes, and even to a previously unseen modality. Beyond explicit corruption settings, on the MM-IMDb 1.0 benchmark, we show that the same mechanism improves the global workspace over its no-attention counterpart and yields decent benchmark performance.

2501.09310 2026-06-16 cs.CL cs.AI cs.SE 版本更新

Understanding, Detecting, and Repairing Real-World In-Context-Learning-Based Text-to-SQL Errors

理解、检测和修复基于上下文学习的真实世界文本到SQL错误

Jiawei Shen, Chengcheng Wan, Ruoyi Qiao, Jiazhen Zou, Hang Xu, Yuchen Shao, Yueling Zhang, Weikai Miao, Geguang Pu

发表机构 * East China Normal University(东华师范大学) Shanghai China(上海中国) Shanghai Innovation Institute(上海创新研究院) sei.ecnu.edu.cn(东华师范大学电子邮件)

AI总结 本研究首次全面调查基于上下文学习的文本到SQL错误,总结27种错误类型,并提出MapleDoctor框架,相比现有方法修复率提高13.8%,误修复极少,延迟降低67.4%。

Comments Accepted by FSE 2026

详情
AI中文摘要

大型语言模型(LLMs)已被用于文本到SQL任务,利用其上下文学习(ICL)能力将自然语言问题转换为SQL查询。然而,这种技术面临正确性问题。在本文中,我们首次对基于ICL的文本到SQL错误进行了全面研究。我们的研究涵盖了四种代表性的ICL技术、五种基本修复方法、两个基准测试和两种LLM设置。我们发现文本到SQL错误普遍存在,并总结了7个类别的27种错误类型。我们还发现,现有的修复尝试在正确性提升方面有限,同时具有高计算开销和许多误修复。基于这些发现,我们提出了MapleDoctor,一种新颖的文本到SQL错误检测和修复框架。评估表明,MapleDoctor优于现有解决方案,修复了13.8%更多的查询,误修复数量可忽略不计,并减少了67.4%的修复延迟。该工件可在GitHub上公开获取。

英文摘要

Large language models (LLMs) have been adopted for text-to-SQL tasks, utilizing their in-context learning (ICL) capability to translate natural language questions into SQL queries. However, such a technique faces correctness problems. In this paper, we conduct the first comprehensive study of text-to-SQL errors of ICL-based techniques. Our study covers four representative ICL-based techniques, five basic repairing methods, two benchmarks, and two LLM settings. We find that text-to-SQL errors are widespread and summarize 27 error types of 7 categories. We also find that existing repairing attempts have limited correctness improvement while having high computational overhead and many mis-repairs. Based on these findings, we propose MapleDoctor, a novel text-to-SQL error detection and repairing framework. The evaluation demonstrates that MapleDoctor outperforms existing solutions by repairing 13.8% more queries with a negligible number of mis-repairs and reducing 67.4% repair latency. The artifact is publicly available at GitHub.

2502.08266 2026-06-16 cs.CL cs.AI cs.LG 版本更新

Dealing with Annotator Disagreement in Hate Speech Classification

处理仇恨言论分类中的标注者分歧

Somaiyeh Dehghan, Mehmet Umut Sen, Berrin Yanikoglu

发表机构 * Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey(工程与自然科学学院,Sabanci大学,伊斯坦布尔,土耳其) Center of Excellence in Data Analytics (VERIM), Sabanci University, Istanbul, Turkey(数据分析卓越中心(VERIM),Sabanci大学,伊斯坦布尔,土耳其)

AI总结 研究标注者分歧对仇恨言论分类的影响,评估多数投票等聚合方法,并利用感知强度增强分类性能,在土耳其语推文中取得新最优结果。

Comments 19 pages, 4 Tables

详情
AI中文摘要

仇恨言论检测是一项关键任务,尤其是在有害内容可能迅速传播的社交媒体上。收集社交媒体内容(如推文)来训练机器学习模型很容易,但由于其固有的主观性,检测和分类仇恨言论可能很困难。这种主观性导致标注者之间频繁出现分歧,尤其是对于微妙或边缘内容。传统方法要么丢弃非共识样本,要么通过专家裁决强制设定“黄金标准”,忽略了关于不确定性和多样化人类视角的宝贵信息。我们研究了仇恨言论分类中标注者分歧这一很大程度上被忽视的问题,并评估了一系列聚合方法,包括多数投票、序数策略(最小值、最大值和均值),并分析了它们在二分类、四分类和六分类任务中的影响。此外,我们利用标注者感知的仇恨言论强度分数来探索基于回归和混合建模的方法。我们证明,过滤非共识样本会导致过于乐观的结果,而感知强度提供了增强分类性能的补充信号。最后,我们在土耳其语推文的仇恨言论检测中建立了新的最优结果,并表明标注者分歧在适当建模后,是构建更稳健可靠系统的宝贵资源。

英文摘要

Hate speech detection is a crucial task, especially on social media where harmful content can spread quickly. Collecting social media content (tweets etc.) to train machine learning models is easy, but detecting and categorizing hate speech can be difficult due to the inherently subjective nature. This subjectivity leads to frequent disagreement among annotators, particularly for subtle or borderline content. Traditional approaches either discard non-consensus samples or force a ''gold standard'' through expert adjudication, ignoring valuable information about uncertainty and diverse human perspectives. We examine the largely overlooked problem of annotator disagreement in hate speech classification and evaluate a range of aggregation methods, including majority voting, ordinal strategies (minimum, maximum, and mean), and analyze their impact across binary, 4-class, and 6-class classification tasks. In addition, we leverage annotators' perceived hate speech strength scores to explore regression-based and hybrid modeling approaches. Among others, we show that filtering non-consensus samples results in over-optimistic results and that the perceived strength provides a complementary signal that enhance classification performance. Finally, we establish new state-of-the-art results for hate speech detection in Turkish tweets, and demonstrate that annotator disagreement, when properly modeled, is a valuable resource for building more robust and reliable systems.

2502.11201 2026-06-16 cs.DB cs.AI 版本更新

Bridging the Gap: Enabling Natural Language Queries for NoSQL Databases through Text-to-NoSQL Translation

弥合差距:通过文本到NoSQL翻译实现NoSQL数据库的自然语言查询

Jinwei Lu, Jiawei Lu, Chen Zhang, Zhiqian Qin, Haodi Zhang, Yuanfeng Song, Raymond Chi-Wing Wong

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学) The University of Hong Kong(香港大学)

AI总结 本文研究Text-to-NoSQL任务,提出TEND基准和SAG求解器,用于将自然语言请求翻译为MongoDB聚合管道,验证了无模式文档推理的独特挑战。

详情
AI中文摘要

NoSQL数据库是核心数据基础设施,但对其的自然语言访问仍不成熟:正确的查询生成必须恢复非关系数据模型如何表示实体、嵌套路径、数组、缺失字段和动态键。本文研究Text-to-NoSQL,将自然语言请求翻译为可执行的NoSQL查询,实例化为对无模式文档存储的MongoDB聚合管道。我们提出TEND(Text-to-NoSQL Dataset的缩写),一个执行验证的基准,包含11个数据库上的1,210个MongoDB原生任务。据我们所知,TEND是第一个数据库世界设计为MongoDB原生的Text-to-NoSQL基准:专家手动定义集合边界、嵌套数组、可选和稀疏路径、多态形状以及动态键约定;这些世界填充真实数据并通过冻结的MongoDB执行验证,因此TEND评估无模式文档推理而非SQL到MQL的迁移。我们进一步引入SAG(Schema-as-Data Grounding)求解器,该求解器在受限MQL生成、执行接地修复和结果一致性选择之前,从存储文档证据中诱导路径和值接地。评估使用受限列容忍执行准确率(EXC)作为主要指标,辅以分级结果集F1和互斥执行结果分解。实验表明,在NL2SQL上表现强劲的LLM在TEND上大幅下降,验证了Text-to-NoSQL作为一个独特的无模式文档推理问题。

英文摘要

NoSQL databases are core data infrastructure, yet natural-language access to them remains underdeveloped: correct query generation must recover how a non-relational data model represents entities, nested paths, arrays, missing fields, and dynamic keys. This paper studies Text-to-NoSQL, translating natural-language requests into executable NoSQL queries, instantiated with MongoDB aggregation pipelines over schema-less document stores. We present TEND, short for Text-to-NoSQL Dataset, an execution-verified benchmark with 1,210 MongoDB-native tasks across 11 databases. To our knowledge, TEND is the first Text-to-NoSQL benchmark whose database worlds are MongoDB-native by design: experts manually define collection boundaries, nested arrays, optional and sparse paths, polymorphic shapes, and dynamic-key conventions; these worlds are populated with real data and verified through frozen MongoDB execution, so TEND evaluates schema-less document reasoning rather than SQL-to-MQL transfer. We further introduce SAG, a Schema-as-Data Grounding solver that induces path and value grounding from stored-document evidence before bounded MQL generation, execution-grounded repair, and result-consistency selection. Evaluation uses bounded column-tolerant execution accuracy (EXC) as the headline metric, complemented by a graded result-set F1 and a mutually exclusive execution-outcome decomposition. Experiments show that LLMs with strong NL2SQL performance degrade substantially on TEND, validating Text-to-NoSQL as a distinct schema-less document reasoning problem.

2506.16738 2026-06-16 cs.CL cs.AI cs.SD eess.AS 版本更新

LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization

LM-SPT:面向语音标记化的LM对齐语义蒸馏

Daejin Jo, Jeeyoung Yun, Byungseok Roh, Sungwoong Kim

发表机构 * Department of Artificial Intelligence, Korea University(韩国大学人工智能系) Multi-modal Model Training, Kakao Corp(Kakao公司多模态模型训练部)

AI总结 提出LM-SPT方法,通过语义语音重合成蒸馏,在不降低帧率的情况下生成与语言模型更对齐的离散语音标记,在ASR和TTS任务中优于现有方法。

详情
AI中文摘要

随着语音语言模型(SLM)的快速发展,离散语音标记已成为语音和文本之间的核心接口,实现了跨模态的统一建模。最近的语音标记化方法旨在从低级声学中分离语义信息,以更好地与语言模型(LM)对齐。特别是,以前的方法使用自监督学习(SSL)教师模型(如HuBERT)提取语义表示,然后将其蒸馏到语义量化器中,以抑制声学冗余并捕获与内容相关的潜在结构。然而,这些标记器通常以相对较高的帧率运行,产生的标记序列明显长于其文本对应物,阻碍了与预训练LM的无缝集成。尽管最近的方法尝试通过对SSL特征应用均匀平均池化来降低标记率,但这可能会过度平滑包含内容的区域并稀释结构信息,从而可能限制LM对齐。为了解决这个问题,我们提出了LM-SPT,一种基于语义语音重合成蒸馏的LM对齐语音标记化方法。LM-SPT不是通过池化直接匹配教师和学生特征,而是仅从语义标记重合成语音,并使用冻结的、LM对齐的语音编码器最小化从原始波形和重合成波形提取的表示之间的差异。这种间接监督避免了严格的时间对齐,并鼓励在降低帧率下与LM更语义对齐的专用语义单元。实验结果表明,在自动语音识别和文本到语音任务中,即使在不损害编解码器级别的语音重建保真度的情况下,所提出的LM-SPT在应用于SLM时也始终优于先前的语义增强语音标记器。

英文摘要

With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models (LMs). In particular, previous methods use self-supervised learning (SSL) teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer to suppress acoustic redundancy as well as capture content-related latent structures. However, these tokenizers often operate at relatively high frame rates, producing token sequences significantly longer than their textual counterparts and hindering seamless integration with pretrained LMs. Although recent methods attempt to reduce the token rate by applying uniform average pooling to SSL features, this can over-smooth content-bearing regions and dilute the structural information, thereby potentially limiting the LM alignment. To address this, we propose LM-SPT, an LM-aligned speech tokenization method based on semantic speech-resynthesis distillation. Instead of directly matching teacher and student features via pooling, LM-SPT resynthesizes speech from semantic tokens only and minimizes the discrepancy between representations extracted from the original and resynthesized waveforms using a frozen, LM-aligned speech encoder. This indirect supervision avoids rigid temporal alignment and encourages dedicated semantic units that are more semantically aligned with LMs under reduced frame rates. Experimental results show that the proposed LM-SPT consistently outperforms previous semantic-enhanced speech tokenizers when applied to SLMs for the tasks of automatic speech recognition and text-to-speech, even without compromising the speech reconstruction fidelity at the codec level.

2510.13940 2026-06-16 cs.CL cs.AI 版本更新

Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention

少即是多:用最小测试时干预提升大语言模型推理能力

Zhen Yang, Mingyang Zhang, Feng Chen, Ganggui Ding, Liang Hou, Xin Tao, Ying-Cong Chen

发表机构 * HKUST(GZ)(香港科技大学(广州)) Kuaishou Technology(快手科技) AIML(人工智能实验室) ZJU(浙江大学) Ant Group(蚂蚁集团) HKUST(香港科技大学)

AI总结 针对大语言模型推理中高计算成本问题,提出最小测试时干预(MTI)框架,通过仅在不确定位置应用分类器自由引导和轻量负提示引导,在保持高效的同时提升推理准确性和稳定性。

Comments Code: https://github.com/EnVision-Research/MTI

详情
AI中文摘要

大语言模型(LLMs)的最新进展集中在通过增加推理计算来改进测试时扩展以提升推理能力,但这往往以牺牲效率为代价。我们重新审视测试时行为,发现了一个简单但未被充分探索的现象:推理不确定性高度局部化——只有一小部分高熵标记对输出正确性起主导作用。受此启发,我们提出了最小测试时干预(MTI),这是一个无需训练的框架,以最小的开销增强推理准确性和稳定性。MTI包括:(i)选择性CFG干预,仅在不确定位置应用分类器自由引导;(ii)轻量负提示引导,重用主模型的KV缓存以高效近似无条件解码。MTI在通用、编码和STEM任务上均取得一致提升——例如,在DeepSeek-R1-7B的六个基准测试上平均提升9.28%,在使用Ling-mini-2.0的AIME2024上提升11.25%——同时保持高效性。

英文摘要

Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model's KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +9.28% average improvement on six benchmarks for DeepSeek-R1-7B and +11.25% on AIME2024 using Ling-mini-2.0-while remaining highly efficient.

2601.06212 2026-06-16 cs.CV cs.AI 版本更新

Akasha 2: Hamiltonian State Space Duality and Visual-Language Joint Embedding Predictive Architectur

Akasha 2: 哈密顿状态空间对偶与视觉-语言联合嵌入预测架构

Yani Meziani

发表机构 * Independent AI Researcher(独立AI研究员) Québec (QC), Canada(魁北克(QC),加拿大)

AI总结 提出 Akasha 2 多模态架构,结合哈密顿状态空间对偶与视觉-语言联合嵌入预测,通过稀疏混合哈密顿专家和哈密顿流匹配实现超低延迟视频预测与合成,在保持能量守恒下取得 SOTA 性能。

Comments No supporting claims were validated in this automated agentic R&D research run

详情
AI中文摘要

我们提出了 Akasha 2,一种最先进的多模态架构,它集成了哈密顿状态空间对偶(H-SSD)与视觉-语言联合嵌入预测架构(VL-JEPA)。该系统利用 Mamba-3 选择性状态空间模型(SSM),并通过稀疏混合哈密顿专家(SMoE-HE)增强,后者通过辛积分强制执行潜在物理守恒定律。对于视觉合成,我们引入了哈密顿流匹配(HFM)和持久化 3D 高斯泼溅(3DGS),在移动硬件上实现了超低延迟(<50ms)。这项工作在潜在世界模型中建立了一个新范式,通过全息记忆架构实现了前所未有的时空一致性。我们的方法表明,将物理启发的归纳偏置融入神经架构可带来显著改进:最先进的视频预测(FVD: 287),比扩散模型快 4 倍的视觉合成,以及相比 Transformer 基线 3-18 倍的推理加速,同时在长时间范围内保持能量守恒。

英文摘要

We present Akasha 2, a state-of-the-art multimodal architecture that integrates Hamiltonian State Space Duality (H-SSD) with Visual-Language Joint Embedding Predictive Architecture (VL-JEPA). The system leverages the Mamba-3 Selective State Space Model (SSM) augmented by a Sparse Mixture of Hamiltonian Experts (SMoE-HE) that enforces latent physical conservation laws through symplectic integration. For visual synthesis, we introduce Hamiltonian Flow Matching (HFM) and persistent 3D Gaussian Splatting (3DGS), enabling ultra-low latency (<50ms) on mobile hardware. This work establishes a new paradigm in latent world models, achieving unprecedented spatiotemporal coherence through a holographic memory architecture. Our approach demonstrates that incorporating physics-inspired inductive biases into neural architectures yields significant improvements: state-of-the-art video prediction (FVD: 287), 4x faster visual synthesis than diffusion models, and 3-18x inference speedup over transformer baselines while maintaining energy conservation over extended horizons.

2602.00344 2026-06-16 cs.CV cs.AI cs.CL 版本更新

When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs

当RAG有害:诊断和缓解检索增强LVLMs中的注意力分散

Beidi Zhao, Wenlong Deng, Xinting Liao, Yushu Li, Nazim Shaikh, Yao Nie, Xiaoxiao Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文发现检索增强生成(RAG)在LVLMs中导致注意力分散(AD)问题,即检索文本抑制视觉注意力并偏离问题相关区域,提出MAD-RAG方法通过双问题公式和注意力混合来解耦视觉定位与上下文整合,在三个基准上提升性能并纠正大部分失败案例。

Comments 19 pages, 13 figures

详情
AI中文摘要

虽然检索增强生成(RAG)是增强大型视觉语言模型(LVLMs)在基于知识的VQA任务上的主导范式之一,但最近的工作将RAG失败归因于对检索上下文的注意力不足,并提出减少分配给图像令牌的注意力。在这项工作中,我们识别了先前研究忽略的一个不同失败模式:注意力分散(AD)。当检索上下文足够(高度相关或包含正确答案)时,检索文本全局抑制视觉注意力,并且图像令牌上的注意力从问题相关区域转移。这导致模型在原本无需检索文本就能正确回答的问题上失败。为了缓解这个问题,我们提出了MAD-RAG,一种无需训练的干预方法,通过双问题公式解耦视觉定位与上下文整合,并结合注意力混合以保留图像条件证据。在OK-VQA、E-VQA和InfoSeek上的大量实验表明,MAD-RAG在不同模型家族中始终优于现有基线,相对于原始RAG基线分别取得了高达4.76%、9.20%和6.18%的绝对增益。值得注意的是,MAD-RAG纠正了高达74.68%的失败案例,且计算开销可忽略不计。

英文摘要

While Retrieval-Augmented Generation (RAG) is one of the dominant paradigms for enhancing Large Vision-Language Models (LVLMs) on knowledge-based VQA tasks, recent work attributes RAG failures to insufficient attention towards the retrieved context, proposing to reduce the attention allocated to image tokens. In this work, we identify a distinct failure mode that previous study overlooked: Attention Distraction (AD). When the retrieved context is sufficient (highly relevant or including the correct answer), the retrieved text suppresses the visual attention globally, and the attention on image tokens shifts away from question-relevant regions. This leads to failures on questions the model could originally answer correctly without the retrieved text. To mitigate this issue, we propose MAD-RAG, a training-free intervention that decouples visual grounding from context integration through a dual-question formulation, combined with attention mixing to preserve image-conditioned evidence. Extensive experiments on OK-VQA, E-VQA, and InfoSeek demonstrate that MAD-RAG consistently outperforms existing baselines across different model families, yielding absolute gains of up to 4.76%, 9.20%, and 6.18% over the vanilla RAG baseline. Notably, MAD-RAG rectifies up to 74.68% of failure cases with negligible computational overhead.

2602.12279 2026-06-16 cs.CV cs.AI cs.LG 版本更新

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

UniT:统一多模态思维链测试时扩展

Leon Liangyu Chen, Haoyu Ma, Zhipeng Fan, Ziqi Huang, Animesh Sinha, Xiaoliang Dai, Jialiang Wang, Zecheng He, Jianwei Yang, Chunyuan Li, Junzhe Sun, Chu Wang, Serena Yeung-Levy, Felix Juefei-Xu

发表机构 * Stanford University(斯坦福大学) Meta Superintelligence Labs(Meta超级智能实验室) Nanyang Technological University(南洋理工大学)

AI总结 提出UniT框架,通过多轮推理、验证和细化实现统一多模态模型的测试时扩展,实验表明短推理轨迹可泛化到长链,顺序思维链比并行采样更高效。

Comments CVPR 2026

详情
AI中文摘要

统一模型可以在单一架构内处理多模态理解和生成,但它们通常以单次通过的方式运行,而不迭代地细化输出。许多多模态任务,尤其是那些涉及复杂空间组合、多个交互对象或不断变化的指令的任务,需要分解指令、验证中间结果并进行迭代修正。虽然测试时扩展(TTS)已证明分配额外的推理计算用于迭代推理能显著提升语言模型性能,但将这一范式扩展到统一多模态模型仍然是一个开放挑战。我们引入了UniT,一个用于多模态思维链测试时扩展的框架,使单个统一模型能够在多轮中推理、验证和细化。UniT结合了智能体数据合成、统一模型训练和灵活的测试时推理,以激发包括验证、子目标分解和内容记忆在内的认知行为。我们的关键发现是:(1)在短推理轨迹上训练的统一模型能在测试时泛化到更长的推理链;(2)顺序思维链推理比并行采样提供更可扩展且计算高效的TTS策略;(3)在生成和编辑轨迹上训练能提升分布外视觉推理能力。这些结果确立了多模态测试时扩展作为推进统一模型中生成和理解的有效的范式。

英文摘要

Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.

2603.01696 2026-06-16 cs.CV cs.AI 版本更新

Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

跨模态身份映射:通过强化学习最小化模态转换中的信息损失

Haonan Jia, Shichao Dong, Xin Dong, Zenghui Sun, Jin Wang, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, Kaifu Zhang

发表机构 * Taobao & Tmall Group of Alibaba(淘宝与天猫集团(阿里巴巴)) The University of Hong Kong(香港大学)

AI总结 提出跨模态身份映射(CIM)框架,利用强化学习优化图像描述,通过检索一致性度量信息损失,无需额外标注,显著提升关系推理能力。

Comments Accepted to CVPR 2026

详情
AI中文摘要

大型视觉语言模型(LVLMs)在生成的图像描述中常常遗漏或歪曲关键的视觉内容。最小化这种信息损失将迫使LVLMs关注图像细节以生成精确的描述。然而,由于视觉内容和文本输出之间的模态差距,衡量模态转换过程中的信息损失本质上是困难的。在本文中,我们认为图像描述的质量与使用该描述通过文本搜索检索到的图像之间的相似性正相关。基于这一见解,我们进一步提出了跨模态身份映射(CIM),一种无需额外标注即可增强图像描述的强化学习框架。具体来说,该方法从两个角度定量评估信息损失:图库表示一致性和查询-图库图像相关性。在这些指标的监督下,LVLM最小化信息损失并旨在实现从图像到描述的恒等映射。实验结果表明,我们的方法在图像描述方面表现出优越的性能,即使与监督微调相比也是如此。特别是在COCO-LN500基准上,CIM在Qwen2.5-VL-7B上的关系推理提升了20%。

英文摘要

Large Vision-Language Models (LVLMs) often omit or misrepresent critical visual content in generated image captions. Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions. However, measuring information loss during modality conversion is inherently challenging due to the modal gap between visual content and text output. In this paper, we argue that the quality of an image caption is positively correlated with the similarity between images retrieved via text search using that caption. Based on this insight, we further propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations. Specifically, the method quantitatively evaluates the information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance. Supervised under these metrics, LVLM minimizes information loss and aims to achieve identity mapping from images to captions. The experimental results demonstrate the superior performance of our method in image captioning, even when compared with Supervised Fine-Tuning. Particularly, on the COCO-LN500 benchmark, CIM achieves a 20% improvement in relation reasoning on Qwen2.5-VL-7B.

2603.05299 2026-06-16 cs.LG cs.AI cs.CL cs.SD 版本更新

WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation

WavSLM: 通过WavLM蒸馏的单流语音语言建模

Luca Della Libera, Cem Subakan, Mirco Ravanelli

发表机构 * Concordia University(康科迪亚大学) Mila-Quebec AI Institute(蒙特利尔AI研究所) Université Laval(拉瓦尔大学)

AI总结 提出WavSLM,通过量化蒸馏WavLM自监督表示到单一码本并优化自回归下一块预测,实现无文本监督的单流语音语言建模,在一致性和生成任务上表现竞争。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

大型语言模型表明,简单的自回归训练可以产生可扩展且连贯的生成,但由于语义和声学信息的纠缠,将这一范式扩展到语音仍然具有挑战性。大多数现有的语音语言模型依赖于文本监督、分层令牌流或复杂的混合架构,偏离了在文本中已被证明有效的单流生成预训练范式。在这项工作中,我们引入了WavSLM,一种通过将自监督WavLM表示量化和蒸馏到单一码本中,并优化自回归下一块预测目标来训练的语音语言模型。WavSLM在单个令牌流中联合建模语义和声学信息,无需文本监督或文本预训练。尽管其简单性,它在一致性基准和语音生成方面取得了有竞争力的性能,同时使用更少的参数、更少的训练数据,并支持流式推理。

英文摘要

Large language models show that simple autoregressive training can yield scalable and coherent generation, but extending this paradigm to speech remains challenging due to the entanglement of semantic and acoustic information. Most existing speech language models rely on text supervision, hierarchical token streams, or complex hybrid architectures, departing from the single-stream generative pretraining paradigm that has proven effective in text. In this work, we introduce WavSLM, a speech language model trained by quantizing and distilling self-supervised WavLM representations into a single codebook and optimizing an autoregressive next-chunk prediction objective. WavSLM jointly models semantic and acoustic information within a single token stream without text supervision or text pretraining. Despite its simplicity, it achieves competitive performance on consistency benchmarks and speech generation while using fewer parameters, less training data, and supporting streaming inference.

2603.24058 2026-06-16 cs.CV cs.AI 版本更新

Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification

通过注意力不平衡修正减轻LVLM中的对象幻觉

Han Sun, Qin Li, Peixin Wang, Min Zhang

发表机构 * Shanghai Key Laboratory of Trustworthy Computing, East China Normal University(上海可信计算实验室,东华大学)

AI总结 发现多模态和token间注意力不平衡是对象幻觉的因果因素,提出轻量级解码干预方法AIR,通过重新分配注意力权重修正不平衡,在多个基准上减少幻觉达35.1%,并提升通用能力。

Comments CVPR 2026 Findings Track, code is available at https://github.com/Ice-wave/AIR

详情
AI中文摘要

大型视觉-语言模型(LVLMs)中的对象幻觉严重损害了其在现实应用中的可靠性,对它们在自动驾驶和医学图像分析等高风险场景中的部署构成了关键障碍。通过系统的实证研究,我们发现跨模态(即视觉和语言)和模态内(单个token之间)的不平衡注意力分配与对象幻觉的发生存在强因果相关性。利用这一洞察,我们引入了一个新概念——注意力不平衡,它不仅量化了注意力差异的程度,还直观地描绘了驱动对象幻觉的潜在模式(例如,对无关语言token的过度关注或对判别性视觉特征的关注不足)。为了减轻对象幻觉,我们进一步提出了注意力不平衡修正(AIR),这是一种轻量级的解码时干预方法,通过重新分配注意力权重和调整注意力分布来修正模态级和token级的不平衡。在四个主流LVLM和三个基准(CHAIR、POPE和MM-Vet)上,与七个基线进行的大量评估表明,AIR持续降低对象幻觉率,与基线相比最高减少35.1%,同时在多种视觉-语言任务中提升LVLMs的通用能力高达15.9%。

英文摘要

Object hallucination in Large Vision-Language Models (LVLMs) severely compromises their reliability in real-world applications, posing a critical barrier to their deployment in high-stakes scenarios such as autonomous driving and medical image analysis. Through systematic empirical investigation, we identify that the imbalanced attention allocation, both across modalities (i.e., vision and language) and within modalities (among individual tokens), exhibits a strong causal correlation with the occurrence of object hallucination. Leveraging this insight, we introduce a novel concept termed attention imbalance, which not only quantifies the degree of attention disparity but also visually delineates the underlying patterns (e.g., over-attentiveness to irrelevant language tokens or under-attentiveness to discriminative visual features) that drive object hallucination. To mitigate object hallucination, we further propose Attention Imbalance Rectification (AIR), a lightweight decoding-time intervention method that reallocates attention weights and adjusts attention distributions to rectify modality-wise and token-wise imbalances. Extensive evaluations on four mainstream LVLMs and three benchmarks (CHAIR, POPE, and MM-Vet) with seven baselines demonstrate that AIR consistently reduces object hallucination rates, achieving up to a 35.1% reduction compared to the baselines, while improving up to 15.9% of LVLMs' general capability across diverse vision-language tasks.

2604.17301 2026-06-16 cs.CL cs.AI cs.HC cs.IR cs.LG 版本更新

RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation

RoTRAG: 基于经验法则推理的检索增强生成对话有害内容检测

Juhyeon Lee, Wonduk Seo, Junseo Koh, Seunghyun Lee, Haihua Chen, Yi Bu

发表机构 * Peking University(北京大学) Enhans University of North Texas(北得克萨斯大学)

AI总结 提出RoTRAG框架,通过检索外部道德规范(RoTs)增强LLM的多轮对话有害内容检测,实现基于规范推理和分类,平均F1提升约40%,分布误差降低8.4%。

Comments Accepted by SIGIR-ICTIR 2026, Oral Presentation

详情
Journal ref
Proceedings of the 2026 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR '26), July 25, 2026, Melbourne, VIC, Australia. ACM, New York, NY, USA, 12 pages
AI中文摘要

检测多轮对话中的有害内容需要对完整对话上下文进行推理,而非孤立的话语。然而,现有方法主要依赖模型内部的参数化知识,缺乏对外部规范性原则的明确依据。这常导致在社会细微语境下判断不一致、可解释性有限以及跨轮次冗余推理。为解决此问题,我们提出RoTRAG,一种检索增强框架,将简洁的人类编写的道德规范(称为经验法则,RoTs)融入基于LLM的有害性评估中。对于每一轮,RoTRAG从外部语料库中检索相关RoTs,并将其作为轮次推理和最终严重性分类的明确规范性证据。为提高效率,我们进一步引入一个轻量级二元路由分类器,决定新轮次是否需要基于检索的推理或可重用现有上下文。在ProsocialDialog和Safety Reasoning Multi Turn Dialogue上的实验表明,RoTRAG在有害分类和严重性估计上均持续优于竞争基线,在基准数据集上F1平均相对提升约40%,分布误差平均相对降低8.4%,同时在不牺牲性能的情况下减少冗余计算。

英文摘要

Detecting harmful content in multi turn dialogue requires reasoning over the full conversational context rather than isolated utterances. However, most existing methods rely mainly on models internal parametric knowledge, without explicit grounding in external normative principles. This often leads to inconsistent judgments in socially nuanced contexts, limited interpretability, and redundant reasoning across turns. To address this, we propose RoTRAG, a retrieval augmented framework that incorporates concise human written moral norms, called Rules of Thumb (RoTs), into LLM based harm assessment. For each turn, RoTRAG retrieves relevant RoTs from an external corpus and uses them as explicit normative evidence for turn level reasoning and final severity classification. To improve efficiency, we further introduce a lightweight binary routing classifier that decides whether a new turn requires retrieval grounded reasoning or can reuse existing context. Experiments on ProsocialDialog and Safety Reasoning Multi Turn Dialogue show that RoTRAG consistently improves both harm classification and severity estimation over competitive baselines, with an average relative gain of around 40% in F1 across benchmark datasets and an average relative reduction of 8.4% in distributional error, while reducing redundant computation without sacrificing performance.

2605.01733 2026-06-16 cs.CV cs.AI 版本更新

GEASS: Gated Evidence-Adaptive Selective Caption Trust for Vision-Language Models

GEASS: 基于证据适应的门控选择性描述信任机制用于视觉-语言模型

Zeshang Li, Shuoyang Zhang

发表机构 * University of International Relations(国际关系大学)

AI总结 本文提出GEASS,一种无需训练的模块,通过门控、加权和证据标准来决定模型在每个查询中消耗多少描述信息,从而提升视觉-语言模型的准确性。

Comments 18 pages, 12 figures

详情
AI中文摘要

视觉-语言模型(VLMs)在 grounded reasoning 方面表现出色,但仍然容易产生 object hallucination。最近的研究将自动生成的描述视为一个均匀的积极资源,但我们发现盲目地嵌入一个描述可能会降低而不是提高性能——在 HallusionBench 上,Qwen2.5-VL-3B 的准确性下降了近 10 个点。两个结构性质解释了这一点。首先,描述不仅锚定了模型的最终答案,还锚定了其推理轨迹和词汇选择。其次,描述错误是不对称的:遗漏远多于伪造,但每个伪造对实例的影响更大。因此,描述的有用性是查询特定的,而不是语料库特定的。我们提出 GEASS(ated Evidence-Adaptive Selective Caption Trust),一个无需训练的模块,决定每个查询中模型消耗多少描述信息:它通过干净路径的置信度来门控描述,通过它产生的熵减少来加权描述,并在两种路径意见不同时提高证据标准。在 POPE 和 HallusionBench 上对四个 VLMs 的实验表明,GEASS 在 vanilla 推理和对比解码上都表现出色,仅需每个查询两个额外的前向传递。

英文摘要

Vision-Language Models (VLMs) hallucinate objects that are not present, and a growing line of work tries to curb this by feeding the model its own generated caption as auxiliary evidence -- assuming that a caption, once available, is something to consume. We show this fails: naively appending a caption can lower accuracy rather than raise it, dropping Qwen2.5-VL-3B† on HallusionBench by nearly ten points. To understand why, we build GD-Probe, a diagnostic set that pairs a global and a detail question on the same image, so that any difference in caption effect is attributable to the question alone. Caption utility proves to be a per-query property: the same caption helps global questions and harms detail ones, through a single mechanism -- an embedded caption competes with the image for attention and pulls the model's evidence onto its own text -- whose sign is set by whether the caption covers the queried content. Crucially, this regime is readable from quantities the decoder already emits, with no attention access or grounding. We turn this into GEASS (Gated Evidence-Adaptive Selective Caption Trust), a training-free, logit-level module that decides per query how much of the caption to trust, gating it by the clean path's confidence, weighting it by the entropy reduction it induces, and raising the evidence bar when the two pathways disagree. Across four VLMs and two benchmarks (POPE and HallusionBench), GEASS improves over both vanilla inference and contrastive decoding under a single fixed setting, adding only two forward passes and no parameters.

2605.18313 2026-06-16 cs.CV cs.AI 版本更新

Wasserstein Equilibrium Decoding for Reliable Medical Visual Question Answering

Wasserstein均衡解码用于可靠的医疗视觉问答

Luca Hagen, Johanna P. Müller, Weitong Zhang, Mengyun Qiao, Bernhard Kainz

发表机构 * Friedrich-Alexander University Erlangen-Nürnberg(弗里德里希-亚历山大厄林根-纽伦堡大学) Imperial College London(伦敦帝国理工学院) University College London(伦敦大学学院)

AI总结 本文提出了一种基于Wasserstein距离的均衡解码方法,用于改进医疗视觉问答系统,通过语义感知的停止准则提高解码效率和准确性,同时在VQA-RAD和PathVQA数据集上实现了显著的性能提升。

详情
AI中文摘要

小型视觉-语言模型(2-8B)由于隐私限制、有限的连接性和低延迟要求,适合临床部署。然而,其有限的容量会加剧生成合理但错误的输出。我们扩展了之前仅限于纯文本、封闭式NLP任务的博弈论解码方法,应用于开放式的医疗视觉问答(VQA)。我们引入了一种语义感知的Wasserstein停止准则,以取代基于词序的匹配,使收敛基于候选答案之间的语义共识,避免因临床等效排名交换导致的不必要的迭代。在VQA-RAD和PathVQA上,我们获得了比贪心和判别基线显著的改进。在VQA-RAD上,我们比贪心的4B模型提高了3.5个百分点(p < 0.01),在更大规模上呈现出相似趋势。在PathVQA上,Gemma-3-4B与BDG在贪心解码下表现相当,尽管没有领域特定的微调。在与经典BDG的准确性相等时,Wasserstein准则将平均收敛迭代次数减少了约20%,在提高推理效率的同时保留了博弈论均衡行为。代码可在https://github.com/luca-hagen/Wasserstein-BDG-medical-VQA上获得。

英文摘要

Small vision-language models (2-8B) are well-suited for clinical deployment due to privacy constraints, limited connectivity, and low-latency requirements favouring on-device or on-premise inference. However, their limited capacity exacerbates the generation of plausible but incorrect outputs. We extend game-theoretic decoding, previously restricted to text-only, closed-ended NLP tasks, to vision-language models for open-ended Medical VQA. We introduce a semantically aware Wasserstein stopping criterion that replaces lexical order matching, enabling convergence based on semantic consensus among near-synonymous candidate answers and avoiding unnecessary iterations caused by clinically equivalent ranking swaps. On VQA-RAD and PathVQA, we obtain consistent, statistically significant improvements over greedy and discriminative baselines. On VQA-RAD, we improve Qwen3-VL-2B by +3.5 percentage points (p < 0.01), surpassing the greedy 4B model, with similar trends at larger scales. On PathVQA, Gemma-3-4B with BDG matches MedGemma-4B under greedy decoding despite no domain-specific fine-tuning. At accuracy parity with classic BDG, the Wasserstein criterion reduces average convergence iterations by approximately 20%, improving inference efficiency while preserving the game-theoretic equilibrium behaviour. Code is available at https://github.com/luca-hagen/ Wasserstein-BDG-medical-VQA.

2606.00435 2026-06-16 cs.CV cs.AI 版本更新

Detect Before You Leap: Mirage Detection in Vision-Language Models

在跳跃前检测:视觉语言模型中的幻象检测

Sayeed Shafayet Chowdhury, Md. Shaown Miah, S. M. Taiabul Haque, Syed Ishtiaque Ahmed

发表机构 * Indiana University Indianapolis(印第安纳大学印第安纳波利斯分校) Bangladesh University of Engineering and Technology(孟加拉工程与技术大学)

AI总结 针对视觉语言模型在缺乏视觉证据时产生自信但无根据回答的幻象问题,提出文本条件层内对齐方法,通过分析视觉编码器各层补丁令牌与问题嵌入的对齐轨迹,结合像素统计、零样本域路由和结构化自评估,实现高精度预响应幻象检测。

详情
AI中文摘要

视觉语言模型(VLM)即使在所需视觉证据缺失、空白或与问题无关时,也能产生自信的视觉答案。这种失败模式被称为幻象(Asadi et al. 2026),在医学和文档视觉问答中尤其令人担忧,因为看似合理但缺乏视觉依据的响应可能被误认为是基于图像的证据。我们研究预发布幻象检测:给定图像-问题对,目标是在VLM生成响应之前确定其应回答还是弃权。我们提出文本条件层内对齐(TC-LIA),一种模型无关的方法,探测CLIP ViT-H/14视觉编码器各层的补丁令牌表示。TC-LIA将逐层图像补丁令牌投影到最终CLIP嵌入空间,并测量它们与问题嵌入的相似度,从而跟踪问题相关视觉证据是否在视觉层中出现。得到的对齐轨迹通过最终图像-文本余弦相似度、后期层top-k补丁-文本对齐、早期到后期增益和逐层斜率进行总结。这些特征与像素统计空白/噪声检测、零样本域路由和结构化VLM自评估相结合,形成一个集成系统。在五个VQA领域、三种输入条件和十二个VLM骨干网络上,最佳系统实现了约94.6-94.7%的三类检测准确率,幻象率低于3%,而基线幻象率范围为21.7%至66.6%。

英文摘要

Vision-language models (VLMs) can produce confident visual answers even when the required visual evidence is missing, blank, or unrelated to the question. This failure mode, recently described as mirage (mirage2026), is especially concerning in medical and document VQA, where a plausible but visually ungrounded answer may be mistaken for image-based evidence. We study the complementary problem of pre-release mirage detection: given an image-question pair, determine whether the VLM should answer or abstain before generation. To that end, we propose a novel model-agnostic Text-Conditioned Layer-wise Internal Alignment (TC-LIA) method that probes patch-token representations across the layers of a CLIP ViT-H/14 vision encoder. The key idea is to project layer-wise image patch tokens into the final CLIP embedding space and measure their similarity with the question embedding, thereby tracking whether question-relevant visual evidence emerges across vision layers. TC-LIA summarizes this alignment trajectory using final image-text cosine similarity, late-layer top-k patch-text alignment, early-to-late gain, and layer-wise slope. These features are combined with pixel-statistic based blank/noise detection, zero-shot domain routing, and structured VLM self-assessment in an ensemble. Across five VQA domains with related, unrelated-real, and blank/noise inputs, and across twelve VLM backbones, Qwen2.5-VL-32B achieves the highest three-class detection accuracy of 94.7% with a 3.0% mirage rate, while Qwen2.5-VL-72B achieves 94.6% accuracy with a lower 2.8% mirage rate. Baseline mirage rates span 21.7-66.6%.

2606.02955 2026-06-16 cs.CL cs.AI cs.LG 版本更新

Fast-dLLM++: Fréchet Profile Decoding for Faster Diffusion LLM Inference

Fast-dLLM++: 用于更快扩散LLM推理的Fréchet轮廓解码

Siva Rajesh Kasa, Yasong Dai, Sumit Negi, Hongdong Li

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 针对扩散大语言模型推理中并行令牌生成的瓶颈,提出Fréchet轮廓解码方法,通过利用异构置信度轮廓选择并行提交集,在保持模型和缓存不变的情况下提升吞吐量。

Comments Initial version accepted at Workshop on Structured Probabilistic Inference & Generative Modeling, ICML 2026. Project Page: https://ringo-star.github.io/projectpage_frechet/

详情
AI中文摘要

扩散大语言模型承诺并行令牌生成,但推理仍然受限于决定哪些掩码令牌可以安全地一起提交。Fast-dLLM通过KV缓存和置信度引导的并行解码解决了这个问题,但其解码理论使用同质高置信度假设,实际上将每个候选集简化为其最弱的选择令牌。我们认为这留下了速度提升空间,因为实际解码步骤表现出异构置信度轮廓。我们提出 extbf{Fast-dLLM++},一种无需训练的扩展,引入了\emph{Fréchet轮廓解码}:从完整的排序置信度轮廓中选择并行提交集,而不是单个最坏情况置信度。得到的规则是Fast-dLLM因子选择器的异构置信度泛化,在等置信度情况下精确恢复先前规则,并在所选令牌具有不均匀置信度时增加一个可证明的\emph{异构性奖励}。Fast-dLLM++完全保持模型、扩散过程和缓存实现不变,使其成为现有Fast-dLLM解码的直接替代品。在GSM8K、MATH、HumanEval和MBPP上使用LLaDA-8B模型的实验表明,理论改进直接转化为经验收益:轮廓感知选择通过利用最弱令牌规则忽略的安全并行性改进了准确率-吞吐量前沿,在可比准确率下实现了高达37%的吞吐量提升。我们的匿名代码发布在此https URL。

英文摘要

Diffusion large language models promise parallel token generation, yet inference remains bottlenecked by deciding which masked tokens can be safely committed together. Fast-dLLM addressed this with KV caching and confidence-guided parallel decoding, but its decoding theory uses a homogeneous high-confidence assumption that effectively reduces each candidate set to its weakest selected token. We argue that this leaves speed on the table because real decoding steps exhibit heterogeneous confidence profiles. We propose \textbf{Fast-dLLM++}, a training-free extension that introduces \emph{Fréchet profile decoding}: selecting parallel commit sets from the full sorted confidence profile rather than a single worst-case confidence. The resulting rule is a heterogeneous-confidence generalization of Fast-dLLM's factor selector and it recovers the previous rule exactly in the equal-confidence case and adds a provable \emph{heterogeneity bonus} when the selected tokens have uneven confidences. Fast-dLLM++ leaves the model, diffusion process, and cache implementation entirely unchanged, making it a drop-in replacement for existing Fast-dLLM decoding. Experiments on GSM8K, MATH, HumanEval, and MBPP with the LLaDA-8B model show that the theoretical improvement translates directly into empirical gains: profile-aware selection improves the accuracy--throughput frontier by exploiting safe parallelism that weakest-token rules miss, achieving up to 37\% higher throughput at comparable accuracy. Our code release is at https://github.com/Ringo-Star/FastdLLM_plusplus.

2606.06646 2026-06-16 cs.CL cs.AI 版本更新

CAF-Gen: A Multi-Agent System for Enriching Argumentation Structures

CAF-Gen: 一种用于丰富论证结构的多智能体系统

Jakub Bąba, Jarosław A. Chudziak

发表机构 * Faculty of Electronics and Information Technology, Warsaw University of Technology(电子与信息技术学院,华沙技术大学)

AI总结 提出CAF-Gen多智能体框架,通过迭代创建-评审流程将浅层论证结构自动转换为符合Carneades论证框架的丰富模型,克服单次生成的结构不稳定性。

Comments Accepted for publication in the proceedings of ICCCI 2026

详情
AI中文摘要

从自然文本中形式化复杂推理是计算语言学的核心挑战之一。它要求系统不仅理解关键词,还要理解文本中嵌入的上下文和复杂推理。当前的论证挖掘技术能够识别基本的主张和前提,但往往难以捕捉高级模式(如Carneades论证框架)所需的更丰富的结构信息,该框架包含前提类型、证明标准和论证模式等特征。我们通过引入CAF-Gen来解决这一局限性,这是一个自动化的多智能体框架,旨在将浅层论证结构丰富为符合CAF的论证模型。通过采用迭代的创建者-评审者流水线,创建者智能体的输出由批评智能体验证以确保结构完整性。这种多智能体协作对于缓解单次生成模型典型的结构不稳定性至关重要。我们的实验表明,迭代反馈循环提高了所得数据的质量,并与原始标注实现了强对齐,同时生成了结构更丰富的模型。我们的发现表明,多智能体系统可以克服单次生成的局限性,为自动建模形式论证提供了一种稳健的方法。

英文摘要

Formalizing complex reasoning from natural text is one of the central challenges in computational linguistics. It requires systems to understand not just keywords but also the context and complex reasoning embedded in a text. Current Argument Mining (AM) techniques identify basic claims and premises, yet they often struggle to capture the richer structural information required by advanced schemas such as the Carneades Argumentation Framework (CAF), which incorporates features such as premise types, proof standards, and argument schemes. We address this limitation by introducing CAF-Gen, an automated multi-agent framework designed to enrich shallow argument structures into CAF-compliant argument models. By employing an iterative Creator-Reviewer pipeline, a creator agent's output is validated by a critical agent to ensure structural integrity. This multi-agent collaboration is crucial for mitigating the structural instability typical of single-pass generative models. Our experiments demonstrate that the iterative feedback loop improves the quality of the resulting data and achieves strong alignment with the original annotations, while producing structurally richer models. Our findings show that the multi-agent system can overcome the limitations of single-pass generation, providing a robust methodology for the automated modeling of formal argumentation.

2606.07015 2026-06-16 cs.SD cs.AI 版本更新

Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation

面向统一歌曲生成与带伴奏共生成的歌声转换

Ziyu Zhang, Chunyu Qiang, Xiaopeng Wang, Yuxin Guo, Kang Yin, Wenjie Tian, Jingbin Hu, Tianlun Zuo, Zhao Guo, Teng Ma, Yuzhe Liang, Chen Zhang, Lei Xie

发表机构 * Northwestern Polytechnical University(西北工业大学) Kuaishou Technology(快手科技) Beijing Institute of Technology(北京理工大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) University of Science and Technology of China(中国科学技术大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出UniSinger框架,基于多模态扩散Transformer统一零样本歌曲生成与伴奏共生成歌声转换,通过共享说话人嵌入和课程学习策略实现跨任务音色控制与多任务优化。

详情
AI中文摘要

尽管歌曲生成和歌声转换(SVC)已显著发展,但长期以来它们被孤立开发:前者缺乏零样本说话人克隆,而后者忽略了人声-伴奏协同。为弥合这一差距,我们提出UniSinger,这是首个统一说话人克隆歌曲生成与伴奏共生成SVC的端到端框架。基于多模态扩散Transformer,我们构建了一个统一的说话人嵌入空间,将说话人表示从SVC迁移到歌曲生成,从而实现细粒度的跨任务音色控制。为缓解多任务优化冲突,我们设计了一种课程学习策略,使用任务特定的模态掩码来引导模型逐步掌握语义内容、人声音色和伴奏之间的生成机制。实验表明,在两个任务上均达到最先进性能,并实现了互补优势,为智能音乐制作提供了新可能性。

英文摘要

While song generation and singing voice conversion (SVC) have evolved significantly, they have long been developed isolated: the former lacks zero-shot speaker cloning, while the latter overlooks vocal-accompaniment synergy. To bridge this gap, we propose UniSinger, the first end-to-end framework unifying speaker cloning song generation and accompaniment co-generation SVC. Building on the multimodal diffusion transformer, we construct a unified speaker embedding space transferring speaker representation from SVC to song generation, endowing fine-grained cross-task timbre control. To mitigate multi-task optimization conflicts, we design a curriculum learning strategy using task-specific modality masking to guide the model to gradually master the generative mechanisms among semantic content, vocal timbre, and accompaniment. Experiments show state-of-the-art performance on both tasks and realizes complementary benefits, offering new possibilities for intelligent music production.

2606.11751 2026-06-16 cs.CV cs.AI 版本更新

AnchorEdit: Maintaining Temporal Consistency in Multi-turn Image Editing via Causal Memory

AnchorEdit: 通过因果记忆在多轮图像编辑中保持时间一致性

Hang Xu, Xiaoxiao Ma, Guohui Zhang, Yu Hu, Siming Fu, Jie Huang, Lin Song, Haoyang Huang, Nan Duan, Feng Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学) JD Explore Academy(京东探索研究院)

AI总结 提出首个自回归扩散框架AnchorEdit,通过因果记忆机制和自展开策略解决多轮编辑中的身份漂移和误差累积问题,在10轮以上交互中保持高保真度。

Comments Code: https://github.com/xuhang07/AnchorEdit

详情
AI中文摘要

多轮图像编辑对于迭代设计至关重要,但当前模型在连续步骤中常面临身份漂移和误差累积。现有研究利用视频先验保持一致性,但其依赖的双向注意力与交互式编辑的因果、顺序性质根本不符。本文提出AnchorEdit,首个专为高分辨率、长期多轮编辑设计的自回归(AR)扩散框架。AnchorEdit通过三阶段训练课程弥合视频先验与因果推理之间的差距:保持身份的单轮预训练、使用新颖的自展开策略进行因果AR强制微调以缓解暴露偏差,以及用于高效4步生成的一致性蒸馏。在推理过程中,我们引入记忆机制来锚定初始主体身份,并确保在扩展编辑轨迹上的稳定外推。为评估性能,我们提供了一个新的高分辨率多轮编辑基准,旨在压力测试长期稳定性。大量实验表明,AnchorEdit达到了最先进的结果,即使在10轮以上的交互中也能保持卓越的主体保真度和指令遵循能力。

英文摘要

Multi-turn image editing is essential for iterative design, yet current models often struggle with identity drift and error accumulation over successive steps. While existing research leverages video priors for consistency, their reliance on bidirectional attention is fundamentally misaligned with the causal, sequential nature of interactive editing. In this paper, we propose AnchorEdit, the first autoregressive (AR) diffusion-based framework designed specifically for high-resolution, long-term multi-turn editing. AnchorEdit bridges the gap between video priors and causal inference through a three-stage training curriculum: identity-preserving sing-turn pretraining, causal AR forcing fine-tuning with a novel self-rollout strategy to mitigate exposure bias, and consistency distillation for efficient 4-step generation. During inference, we introduce a memory mechanism to anchor the initial subject identity and ensure stable extrapolation across extended editing trajectories. To evaluate performance, we provide a new high-resolution multi-turn editing benchmark designed to stress-test long-horizon stability. Extensive experiments demonstrate that AnchorEdit achieves state-of-the-art results, maintaining exceptional subject fidelity and instruction following even over 10+ interaction rounds.

2606.14142 2026-06-16 cs.CL cs.AI 版本更新

Implicit Reasoning for Large Language Model-based Generative Recommendation

基于大语言模型的生成式推荐的隐式推理

Yinhan He, Liam Collins, Bhuvesh Kumar, Jundong Li, Neil Shah, Donald Loveland

发表机构 * University of Virginia(弗吉尼亚大学) Snap Inc.(Snap公司)

AI总结 针对大语言模型用于生成式推荐时显式推理的三大局限(世界知识表达弱化、语义ID与自然语言嵌入空间不对齐、推理质量敏感),提出轻量级隐式推理范式PauseRec,在性能、训练成本和推理速度上均优于显式方法。

详情
AI中文摘要

大语言模型(LLMs)越来越多地被用作生成式推荐(GR)的骨干,有望利用预训练的世界知识。然而,如何可靠地调用这些知识进行GR仍不清楚。一个关键障碍是,基于LLM的GR通常使用语义ID(SIDs)表示物品,这破坏了LLM的自然语言推理接口,因为这些标记在预训练期间对LLM是未见过的。现有方法通过昂贵的多阶段流程来应对,这些流程将SID接地并引发显式推理,但对每个阶段何时以及为何必要提供的见解有限。在这项工作中,我们系统地分解了基于LLM的GR的显式推理训练流程,揭示了三个关键局限:弱化的世界知识表达、SID与自然语言标记嵌入空间之间的不对齐,以及对推理质量的敏感性,所有这些都损害了显式推理性能。为了规避这些问题,我们提出了PauseRec,一种为GR量身定制的轻量级隐式推理范式。PauseRec非常实用,避免了昂贵的推理轨迹获取和推理对齐训练,带来了诸多好处:(1)其性能比标准显式CoT方法高出高达6.22%,(2)将训练成本降低高达65%的GPU小时,(3)将推理速度提升高达71.3%。这些结果使PauseRec成为显式推理生成的轻量级替代方案,能够实现更有效、更高效的基于LLM的GR。

英文摘要

Large Language Models (LLMs) are increasingly adopted as backbones for Generative Recommendation (GR), promising access to pretrained world knowledge. Yet reliably invoking this knowledge for GR remains poorly understood. A key obstacle is that LLM-based GR typically represents items with Semantic IDs (SIDs), disrupting LLMs' natural-language reasoning interface because these tokens are unseen by the LLM during pretraining. Existing approaches address this with expensive multi-stage pipelines that ground SIDs and elicit explicit rationales, but offer limited insight into when and why each stage is necessary. In this work, we systematically decompose explicit reasoning training pipelines for LLM-based GR, revealing three key limitations: weakened world-knowledge verbalization, misalignment between SID and natural-language token embedding spaces, and sensitivity to rationale quality, all of which hurt explicit reasoning performance. To circumvent these issues, we propose PauseRec, a lightweight implicit reasoning paradigm tailored for GR. PauseRec is exceptionally practical, avoiding costly reasoning trace acquisition and reasoning alignment training, leading to a multitude of benefits: (1) it outperforms standard explicit CoT methods by up to 6.22%, (2) it reduces training cost by up to 65% GPU hours, and (3) it speeds up inference by up to 71.3%. These results position PauseRec as a lightweight alternative to explicit rationale generation, enabling more effective and efficient LLM-based GR.

7. 机器人与具身智能 33 篇

2606.15647 2026-06-16 cs.AI cs.CV cs.RO 新提交

Towards Next-Generation Healthcare: A Survey of Medical Embodied AI for Perception, Decision-Making, and Action

迈向下一代医疗:医疗具身AI在感知、决策与行动中的综述

Cheng Zhang, Qing Cai, Xingzheng Wu, Xun Yang, Xiaojun Chang, Bingkun Bao, Liqiang Nie, Xinwang Liu, Yi Yang

发表机构 * School of Information Science and Engineering, Ocean University of China(中国海洋大学信息科学与工程学院) Innovation School of Artificial Intelligence, Hefei University of Technology(合肥工业大学人工智能创新学院) School of Information Science and Technology, University of Science and Technology of China(中国科学技术大学信息科学技术学院) School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机与信息工程学院) School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)计算机科学与技术学院) College of Computer Science and Technology, National University of Defense Technology(国防科技大学计算机科学与技术学院) ReLER Laboratory, CCAI, Zhejiang University(浙江大学计算机辅助设计与图形学国家重点实验室)

AI总结 本文系统综述医疗具身AI的核心组件,强调感知、决策与行动的协调集成,并分析临床实践中的挑战与未来方向。

Comments 19 pages, 9 figures

详情
AI中文摘要

基础模型在提升医疗效率方面表现出色,广泛应用于各类医疗场景。然而,它们在感知、理解和与物理世界交互方面的能力有限,严重制约了其在真实临床工作流中的有效性,而临床工作流中安全关键的决策和物理执行紧密耦合。近年来,具身人工智能(AI)作为一种有前景的物理交互范式出现,使智能体能够在复杂医疗环境中操作。随着该领域研究的迅速扩展,理解智能体如何在临床环境中作为集成的端到端系统运行变得日益关键。然而,现有关于医疗具身AI的综述大多强调单个方面或功能组件,缺乏统一的系统级组织。为支持和巩固最新进展,我们系统调查了医疗具身AI的核心组件,特别关注感知、决策与行动的协调集成。我们进一步回顾了代表性医疗应用和相关数据集,并分析了真实临床实践中遇到的主要挑战。最后,我们讨论了这一快速发展领域未来研究的关键方向。相关项目见 https://github.com/VMVLab/Medical_Embodied_AI_Paper_List。

英文摘要

Foundation models have demonstrated impressive performance in enhancing healthcare efficiency across a wide range of medical applications. Nevertheless, their limited ability to perceive, understand, and interact with the physical world significantly constrains their effectiveness in real-world clinical workflows, where safety-critical decision-making and physical execution are tightly coupled. Recently, embodied artificial intelligence (AI) has emerged as a promising physical-interactive paradigm for intelligent healthcare, enabling agents to operate in complex medical environments. As research in this area rapidly expands, understanding how intelligent agents function as integrated, end-to-end systems in clinical environments becomes increasingly critical. However, existing surveys on medical embodied AI largely emphasize individual aspects or functional components, lacking a unified system-level organization of the field. To support and consolidate recent advances, we systematically survey the core components of medical embodied AI, with a particular emphasis on the coordinated integration of perception, decision-making, and action. We further review representative medical applications and relevant datasets, and we analyze the major challenges encountered in real-world clinical practice. Finally, we discuss key directions for future research in this rapidly evolving field. The associated project can be found at https://github.com/VMVLab/Medical_Embodied_AI_Paper_List.

2606.15753 2026-06-16 cs.AI 新提交

RoboPIN: Grounded Embodied Reasoning via Pinned Chain-of-Thought

RoboPIN: 基于锚定思维链的具身推理

Yaoting Huang, Yifu Yuan, Linqi Han, Chengwen Li, Shuoheng Zhang, Xianze Yao, Hongyao Tang, Yan Zheng, Jianye Hao

发表机构 * Tianjin University(天津大学)

AI总结 提出Pinned Chain-of-Thought (PinCoT)推理范式,通过结构化视觉锚点绑定实体,解决多步推理中实体引用漂移和视觉解耦问题;训练4B参数模型在14个基准上平均超越最强7B基线12%。

详情
AI中文摘要

具身推理要求模型感知物理环境中与任务相关的物体和空间,并在多步推理中保持一致的视觉基础。然而,当前的视觉语言模型依赖于纯文本或坐标增强的思维链,其中实体引用仍然隐式和模糊。这可能导致推理过程与视觉证据解耦、实体引用在步骤间漂移、推理轨迹与最终答案之间的因果断裂,并且由于跨视角外观变化,这些问题在多视角场景中进一步放大。为了解决这些问题,我们提出了Pinned Chain-of-Thought (PinCoT),一种结构化推理范式,将每个推理步骤锚定到视觉证据。PinCoT引入了推理锚点的概念,它将每个任务相关实体绑定到一个结构化的视觉锚点,包含实体名称、唯一标识、视角索引和空间基础,从而能够在推理步骤和视角之间实现一致的实体跟踪。我们构建了一个全自动数据生成管道来构建数据集,这是一个高质量的PinCoT格式推理数据集。然后,我们通过三阶段后训练训练方法,逐步注入具身知识、结构化推理能力和过程监督对齐,奖励直接约束推理过程中的锚点定位和身份一致性。在涵盖具身空间推理、多视角推理和指向的14个基准测试中,仅有4B参数的方法始终优于7B级别的开源具身模型,比最强的7B基线Mimo-Embodied平均提高12%。进一步分析表明,PinCoT提高了基础准确性和跨步骤身份一致性,验证了过程监督的有效性。

英文摘要

Embodied reasoning requires models to perceive task-relevant objects and spaces in physical environments and maintain consistent visual grounding throughout multi-step reasoning. However, current vision-language models rely on text-only or coordinate-augmented chain-of-thought, where entity references remain implicit and ambiguous. This may cause the reasoning process to decouple from visual evidence, entity references to drift across steps, and a causal disconnection between the reasoning trajectory and the final answer, with these problems further amplified in multi-view scenarios due to cross-view appearance changes. To address these issues, we propose Pinned Chain-of-Thought (\pincot{}), a structured reasoning paradigm that pins every reasoning step to visual evidence. \pincot{} introduces the concept of \reasoninganchor{}, which binds each task-relevant entity to a structured visual anchor with entity name, unique identity, view index, and spatial grounding, enabling consistent entity tracking across reasoning steps and views. We build a fully automated data generation pipeline to construct \dataset{}, a high-quality \pincot{}-formatted reasoning dataset. We then train \method{} through three-stage post-training that progressively injects embodied knowledge, structured reasoning ability, and process-supervised alignment, with rewards that directly constrain both anchor localization and identity consistency during reasoning. On 14 benchmarks covering embodied spatial reasoning, multi-view reasoning, and pointing, \method{} with only 4B parameters consistently outperforms 7B level open-source embodied models, achieving a 12\% average improvement over the strongest 7B baseline, Mimo-Embodied. Further analysis shows that \pincot{} improves grounding accuracy and cross-step identity consistency, validating the effectiveness of process supervision.

2606.16558 2026-06-16 cs.AI cs.RO cs.SY eess.SY 新提交

ROSA-RL: Uncertainty-Aware Roundabout Optimized Speed Advisory with Reinforcement Learning

ROSA-RL:基于强化学习的不确定性感知环岛优化速度建议

Anna-Lena Schlamp, Jeremias Gerner, Klaus Bogenberger, Werner Huber, Stefanie Schmidtner

发表机构 * Universität der Bundeswehr München(慕尼黑联邦国防军大学) Hochschule für angewandte Wissenschaften Landshut(兰茨胡特应用科学大学)

AI总结 针对混合交通中环岛场景的不确定性,提出ROSA-RL框架,结合Transformer预测冲突区域占用概率与强化学习,实现安全高效的环岛入口速度协调。

Comments 8 pages, 2 figures, 2 tables. Copyright 2026 IEEE. This is the accepted manuscript for 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC), not the final published version

详情
AI中文摘要

环岛在混合交通中对自动驾驶构成挑战,因为异质且非确定性的人类行为、未知的驾驶意图以及高交互复杂性使得在进入时刻冲突区域是被阻塞还是可用存在不确定性。我们提出ROSA-RL——基于强化学习的不确定性感知环岛优化速度建议。它通过概率冲突预测,实现混合交通中自动驾驶和人类驾驶车辆的安全高效环岛进入。一个基于Transformer的模型预测未来五秒内的冲突区域占用情况,捕捉多智能体交互以预测即将发生的冲突和可用间隙。预测输出编码了未来运动和意图的不确定性,并增强经典强化学习框架的状态,实现不确定性感知的速度协调。在基于真实世界数据的仿真评估中,ROSA-RL能有效处理不确定性,并优于基于模型的基线方法,缩小了与假设完全已知占用的理想设置之间的差距,同时提高了交通效率和安全性。本工作的源代码可在github.com/urbanAIthi/ROSA-RL获取。

英文摘要

Roundabouts challenge automated driving in mixed traffic, as heterogeneous and non-deterministic human behavior, unknown driving intentions, and high interaction complexity create uncertainty about whether the conflict zone will be blocked or available at the moment of entry. We present ROSA-RL -- uncertainty-aware Roundabout Optimized Speed Advisory with Reinforcement Learning. It enables safe and efficient roundabout entry for automated and human-driven vehicles in mixed traffic through probabilistic conflict forecasting. A Transformer-based model predicts conflict zone occupancy over a five-second horizon, capturing multi-agent interactions to anticipate upcoming conflicts and available gaps. The prediction outputs encode uncertainty in future motion and intent, and augment the state of a classical RL framework, enabling uncertainty-aware speed coordination. Evaluated in simulations grounded in real-world data, ROSA-RL can effectively handle uncertainty and outperform a comparable model-based baseline, closing the gap to an ideal setting assuming fully known occupancy while improving traffic efficiency and safety. The source code of this work is available under: github.com/urbanAIthi/ROSA-RL.

2606.14716 2026-06-16 cs.CV cs.AI cs.RO 交叉投稿

RAMS: Resource-Adaptive and Detection-Conditioned Model Switching for Embedded Edge Perception

RAMS: 面向嵌入式边缘感知的资源自适应与检测条件模型切换

Kushal Khemani, Evan Leri, George Xu, Amit Hod

发表机构 * NEXEDGE Research Lab(NEXEDGE研究实验室)

AI总结 提出RAMS运行时控制器,通过监控设备压力、校准切换阈值,在YOLOv8三个规模模型间动态切换,引入检测条件策略和VRU加权准确率评分,在多种嵌入式平台上实现延迟与精度的平衡。

详情
AI中文摘要

嵌入式硬件上的边缘目标检测需要在变化的资源压力下平衡推理延迟和检测质量。我们提出RAMS,一种轻量级运行时控制器,它监控设备压力,从空闲行为校准切换阈值,并在三个驻留的YOLOv8层级(NANO/SMALL/MEDIUM,分辨率320/416/640 px)之间动态选择,无需模型重新加载延迟。RAMS定义了五种切换策略,包括两种检测条件变体,可在最近检测到易受伤道路使用者(VRU)后防止激进的降级。我们进一步引入VRU加权准确率评分(SWAS),一种用于离线策略比较的标量指标,无需真实标注,以及一种基于oracle的变体,用于分离检测器循环性与真正的层级保留收益。在Raspberry Pi 5、x86笔记本电脑和Jetson Orin ONNX/TensorRT部署中,相同的控制器方程在37倍的延迟范围内运行。在重负载下的Jetson Orin TensorRT上,safety2策略实现了3.41毫秒的平均延迟,比固定MEDIUM推理快5.6倍,同时通过接近NANO操作并在VRU阳性窗口期间选择性锁定SMALL和MEDIUM,保留了其代理准确率的74%。与重负载下仅基于阈值的策略相比,检测条件切换在oracle评分下将SWAS提高了25.4%,在检测器衍生评分下提高了47.3%。实时KITTI评估报告了每层级VRU召回率分别为24.2%、41.2%和59.0%,表明反应性覆盖从根本上受限于基线检测器的召回率。

英文摘要

Edge object detection on embedded hardware requires balancing inference latency and detection quality under changing resource pressure. We present RAMS, a lightweight runtime controller that monitors device pressure, calibrates switching thresholds from idle behavior, and dynamically selects among three resident YOLOv8 tiers (NANO/SMALL/MEDIUM at 320/416/640 px) without model-reload latency. RAMS defines five switching policies, including two detection-conditioned variants that prevent aggressive downgrades after recent vulnerable-road-user (VRU) detections. We further introduce the VRU-Weighted Accuracy Score (SWAS), a scalar metric for offline policy comparison without ground-truth annotations, together with an oracle-bounded variant that separates detector circularity from genuine tier-retention benefit. Across Raspberry Pi 5, x86 laptops, and Jetson Orin ONNX/TensorRT deployments, the same controller equations operate over a 37x latency range. On Jetson Orin TensorRT under heavy load, the safety2 policy achieves 3.41 ms mean latency, 5.6x faster than fixed-MEDIUM inference, while retaining 74% of its proxy accuracy through near-NANO operation with selective SMALL and MEDIUM locks during VRU-positive windows. Detection-conditioned switching improves SWAS by 25.4% under oracle scoring and 47.3% under detector-derived scoring relative to threshold-only policies under heavy load. Live KITTI evaluation reports per-tier VRU recall of 24.2%, 41.2%, and 59.0%, showing that reactive overrides are fundamentally limited by baseline detector recall.

2606.14752 2026-06-16 cs.CV cs.AI cs.LG cs.RO 交叉投稿

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

X-Tokenizer: 一种用于视觉-语言-动作预训练的多模态动作分词器

Xirui Kang, Yanpei Shi, Lucy Liang, Roy Gan, Dongxiu Liu, Pushi Zhang, Danpeng Chen, Xiaoyi Qin, Yinan Zheng, Jinliang Zheng, Hao Wang, Xianyuan Zhan, Hang Su

发表机构 * Square Robot City University of Hong Kong(香港城市大学) Tsinghua University(清华大学)

AI总结 提出X-Tokenizer,通过语义残差量化(SRQ)和掩码动作建模(MAM)将动作离散化为语义接口,在2.4M轨迹上预训练后提升VLA模型的多模态接地和长程任务性能。

Comments Project page: https://x-square-robot.github.io/X-Tokenizer_projectPage/

详情
AI中文摘要

现代视觉-语言-动作(VLA)模型必须桥接预训练的视觉-语言推理和精确的连续机器人控制。现有的动作分词器主要为了重建而离散化动作,产生的编码保留了运动几何结构,但仅向主干网络提供弱语义监督。因此,我们将动作分词化不仅视为压缩,而是作为多模态推理与可执行控制之间的语义接口学习。为此,我们引入了X-Tokenizer,一种轻量级的编码器-语义残差量化(SRQ)-解码器架构,为多种机械臂形态提供共享的动作接口。其关键组件SRQ在残差向量量化上施加了非对称结构:第一层通过掩码动作建模(MAM)训练,形成捕获粗略运动意图的离散动作语言,而更深层则保持面向重建的残差,保留细粒度细节。为了进一步将动作标记与多模态语义对齐,X-Tokenizer通过与预训练基础模型的表示空间进行对比对齐以及下一帧视觉-语言特征预测进行预训练。在2.4M轨迹(2.0B动作帧)上预训练后,单个冻结的X-Tokenizer作为表示塑造的监督信号插入混合离散-连续VLA中。X-Tokenizer在真实世界聚合指标上达到最佳,并在RoboTwin 2.0模拟中表现强劲。在多模态接地(+13.5%)和长程任务(+8.25)上优于FAST,表明动作分词器作为VLA预训练的语义接口,而不仅仅是动作压缩。

英文摘要

Modern Vision-Language-Action (VLA) models must bridge pretrained vision-language reasoning and precise continuous robot control. Existing action tokenizers discretize actions primarily for reconstruction, producing codes that preserve motion geometry but provide only weak semantic supervision to the backbone. We therefore formulate action tokenization not as mere compression, but as semantic interface learning between multimodal reasoning and executable control. To this end, we introduce X-Tokenizer, a lightweight encoder-Semantic Residual Quantization (SRQ)-decoder architecture that provides a shared action interface across diverse robotic arm embodiments. Its key component, SRQ, imposes an asymmetric structure on residual vector quantization: the first level is trained with Masked Action Modeling (MAM) to form a discrete action language that captures coarse motion intent, while deeper levels remain reconstruction-oriented residuals that preserve fine-grained details. To further align action tokens with multimodal semantics, X-Tokenizer is pretrained with contrastive alignment to the representation space of a pretrained foundation model and with next-frame vision-language feature prediction. Pretrained on 2.4M trajectories (2.0B action frames), a single frozen X-Tokenizer plugs into a mixed discrete-continuous VLA as a representation-shaping supervision signal. X-Tokenizer achieves top real-world aggregate and strong RoboTwin 2.0 simulation results. Outperforming FAST in multimodal grounding (+13.5%) and long-horizon tasks (+8.25), it shows that action tokenizers serve as semantic interfaces for VLA pretraining beyond mere action compression.

2606.14772 2026-06-16 cs.CV cs.AI 交叉投稿

ScoutVLA: UAV-Centric Active Perception via a Dual-Expert VLA Model for Open-World Embodied Question Answering

ScoutVLA:面向开放世界具身问答的无人机中心主动感知双专家VLA模型

Wenhao Lu, Zhengqiu Zhu, Xiaofeng Wang, Xiaoran Zhang, Yatai Ji, Yong Zhao, Yue Hu, Yingzhen Nie, Jinlong Zhu, Zheng Zhu

发表机构 * National Key Laboratory of Digital Intelligent Modeling and Simulation, National University of Defense Technology(国防科技大学数字智能建模与仿真国家重点实验室) GigaAI

AI总结 针对无人机在室外具身问答中细粒度视角调整不足的问题,提出ScoutVLA模型,采用解耦双专家架构(视觉语言专家推断语义意图,动作专家生成连续视角调整轨迹),并通过知识隔离机制平衡连续控制与语义推理,在仿真和真实实验中显著优于基线方法。

详情
AI中文摘要

空中具身问答(EQA)要求无人机(UAV)主动感知环境并回答自然语言问题。现有的室外EQA系统通常在目标进入无人机视野后停止,导致寻找证据所需的问题的细粒度视角调整问题仍未解决。为解决此问题,我们引入FG-EQA,一个细粒度主动感知EQA基准,包含超过4万条模拟轨迹和1千条真实轨迹。受侦察蜂“摇摆舞”的启发(它们迭代调整飞行路径以验证目标信息),我们提出ScoutVLA,一种用于室外EQA的证据驱动视觉-语言-动作模型。为模拟这种主动探索行为,ScoutVLA采用解耦双专家架构:视觉语言专家推断语义意图以识别缺失证据,而独立动作专家使用高自由度流匹配生成连续视角调整轨迹。为平衡连续控制和语义推理的竞争需求,我们设计了一种解耦训练策略,其中包含知识隔离机制,防止动作梯度抹除模型的多模态推理能力。大量仿真实验和定性真实世界实地研究均验证了ScoutVLA相对于最先进基线的优越性,平均严格成功率高10.48倍,平均QA正确率高7.72倍。

英文摘要

Aerial Embodied Question Answering (EQA) requires Unmanned Aerial Vehicles (UAVs) to actively perceive the environment and answer natural language questions. Existing outdoor EQA systems usually stop once the target enters the UAV's field of view, leaving the fine-grained viewpoint adjustment needed for evidence-seeking questions largely unresolved. To address this issue, we introduce FG-EQA, a fine-grained active perception EQA benchmark with more than 40K simulated trajectories and 1K real-world trajectories. Drawing inspiration from the ``waggle dance'' of scout bees, which iteratively adjust their flight paths to verify target information, we propose ScoutVLA, an evidence-driven Vision-Language-Action model for outdoor EQA. To emulate this active exploration behavior, ScoutVLA features a decoupled dual-expert architecture: a vision-language expert infers the semantic intent to identify missing evidence, while an independent action expert employs high-DoF flow matching to generate continuous viewpoint-refinement trajectories. To balance the competing demands of continuous control and semantic reasoning, we devise a decoupled training strategy with a knowledge insulation mechanism that prevents the action gradients from erasing the model's multimodal reasoning ability. Extensive simulated experiments and a qualitative real-world field study both verify the superiority of ScoutVLA over the state-of-the-art baselines, demonstrating a 10.48$\boldsymbol{\times}$ higher average strict success rate and a 7.72$\boldsymbol{\times}$ higher average QA correctness.

2606.14981 2026-06-16 cs.RO cs.AI cs.LG 交叉投稿

Inference-time Policy Steering via Vision and Touch

通过视觉和触觉进行推理时策略引导

Yilin Wu, Zilin Si, Zeynep Temel, Oliver Kroemer, Andrea Bajcsy

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出ViTaL框架,通过视觉采样验证和触觉引导扩散编辑的双层优化,在推理时引导机器人策略,显著提升接触丰富操作任务的成功率。

详情
AI中文摘要

推理时引导通过在部署前验证候选动作来适应预训练的生成式机器人策略。虽然先前的方法通常仅使用视觉观察进行验证,但对于接触丰富的操作任务,仅靠视觉往往不足,因为成功取决于全局任务进展和微妙的局部交互(如接触力)。我们提出了ViTaL,一个视觉-触觉推理时引导框架,将多模态引导形式化为双层优化问题。在高层,视觉采样与验证执行长时域模式选择,决定机器人应执行何种行为。在低层,触觉引导的扩散编辑在较短时域内细化所选动作序列,以满足局部接触要求。为了支持基于结果的引导,ViTaL学习了一个视觉-触觉潜在世界模型,并采用了语义对齐的视觉和触觉验证器,包括一个新颖的文本条件触觉奖励,直接在潜在空间中对预测的触觉未来进行评分。在三个真实世界的接触丰富操作任务中,ViTaL相对于基础策略将整体成功率提高了51%,比单模态引导至少高出33%,并且比朴素多模态融合至少高出20%。网站:https://yilin-wu98.github.io/vital_website。

英文摘要

Inference-time steering adapts pre-trained generative robot policies during deployment by verifying candidate actions before execution. While prior methods typically perform this verification only with visual observations, vision alone is often insufficient for contact-rich manipulation, where success depends on both global task progress and subtle local interactions such as contact force. We introduce ViTaL, a visuo-tactile inference-time steering framework that formulates multimodal guidance as a bi-level optimization problem. At the high level, visual sampling-and-verification performs long-horizon mode selection, deciding what behavior the robot should execute. At the low level, tactile-guided diffusion editing refines the selected action sequence over a shorter horizon to satisfy local contact requirements. To support outcome-based steering, ViTaL learns a visuo-tactile latent world model and employs semantically aligned visual and tactile verifiers, including a novel text-conditioned tactile reward that scores predicted tactile futures directly in latent space. Across three real-world contact-rich manipulation tasks, ViTaL improves overall success by 51% over the base policy, outperforms unimodal steering by at least 33%, and exceeds naive multimodal fusion by at least 20%. Website: https://yilin-wu98.github.io/vital_website.

2606.15251 2026-06-16 cs.RO cs.AI cs.LG 交叉投稿

Driving, Fast or Slow? Neuro-Symbolic Guidance for Motion Prediction in Multi-Modal Ground Mobility

驾驶,快或慢?多模态地面移动中运动预测的神经符号引导

Simon Kohaut, Felix Divo, Julius Hahnewald, Benedict Flade, Julian Eggert, Kristian Kersting, Devendra Singh Dhami

发表机构 * Artificial Intelligence and Machine Learning Lab, TU Darmstadt(达姆施塔特工业大学人工智能与机器学习实验室) Honda Research Institute(本田研究所) Hessian Center for AI (hessian.AI)(黑森州人工智能中心) Centre for Cognitive Science(认知科学中心) German Center for AI (DFKI)(德国人工智能研究中心) Uncertainty in Artificial Intelligence Lab, TU Eindhoven(埃因霍温理工大学人工智能不确定性实验室)

AI总结 提出TraCS框架,通过神经符号方法将交通规则编码为概率一阶逻辑,增强黑盒运动预测模型的可解释性和合规性,在Argoverse 2上持续提升SOTA性能。

详情
AI中文摘要

准确且可解释的异构交通空间(包括行人、自行车、汽车和卡车)运动预测对于安全的自主导航至关重要。然而,最先进的方法仍然是黑盒,缺乏对现实世界移动的监管和行为约束的显式编码。我们提出Trajectory Compliance-Shaping (TraCS),一种神经符号框架,通过可解释的概率一阶逻辑增强现有的黑盒运动预测骨干网络。为此,TraCS采用智能体代码生成流水线,弥合交通规则的自然语言描述与概率运动预测之间的差距。此外,TraCS采用反应式数据流推理引擎,随着场景演变维护并高效更新合规性景观。为防止TraCS过度自信地将骨干网络的预测引导到错误方向,我们提出一种神经置信度评分,作为上下文感知的合规性信号衰减。我们在Argoverse 2基准上展示了TraCS如何持续改进最先进的预测骨干网络,表明概率和符号合规性推理是纯神经运动预测的广泛适用且计算高效的补充。

英文摘要

Accurate and interpretable motion prediction for heterogeneous traffic spaces, including pedestrians, bicycles, cars, and trucks, is essential for safe autonomous navigation. Nevertheless, state-of-the-art approaches remain predominantly black-box, lacking explicit encoding of the regulatory and behavioral constraints of real-world mobility. We propose Trajectory Compliance-Shaping (TraCS), a neuro-symbolic framework that augments existing black-box motion prediction backbones with interpretable and probabilistic first-order logic. To do so, TraCS employs an agentic code-generation pipeline to bridge the gap between natural-language descriptions of traffic regulations and probabilistic motion prediction. Furthermore, TraCS employs a reactive data-streaming inference engine that maintains and efficiently updates compliance landscapes as scenes evolve. To prevent TraCS from overconfidently steering the backbone's predictions in the wrong direction, we propose a neural confidence rating learned as a context-aware attenuation of the compliance signal. We demonstrate on the Argoverse 2 benchmark how TraCS consistently improves state-of-the-art prediction backbones, showing that probabilistic and symbolic compliance reasoning is a broadly applicable and computationally efficient complement to purely neural motion predictors.

2606.15594 2026-06-16 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 交叉投稿

Pixels to Proofs: Probabilistically-Safe Latent World Model Control via Parallel Conformal Robust MPC

从像素到证明:通过并行保形鲁棒MPC实现概率安全的潜在世界模型控制

Devesh Nath, Anutam Srinivasan, Haoran Yin, Ruitong Jiang, Jeffrey Fang, Glen Chou

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出SLS^2框架,结合保形预测与鲁棒模型预测控制,在学习的潜在世界模型中实现基于视觉的安全运动规划,提升目标到达性能与安全性。

详情
AI中文摘要

我们提出了SLS^2,一个使用鲁棒模型预测控制(MPC)在学习的潜在世界模型中进行安全反馈运动规划的框架。我们的方法训练了一个动作条件的联合嵌入世界模型,具有紧凑的马尔可夫潜在状态,通过学习的潜在动力学实现高效的基于梯度的轨迹优化。为了在潜在预测不完美的情况下确保真实系统的安全性,我们采用保形预测来通知GPU加速的系统级综合(SLS)鲁棒MPC方案,以获得校准的潜在误差界限和鲁棒的潜在空间约束集。我们还学习并保形化了一个潜在约束检查器,使SLS规划器能够在闭环执行期间施加概率安全约束。我们在基于视觉的控制任务上评估了我们的方法,与潜在世界模型和安全规划基线相比,它提高了目标到达性能和安全性。

英文摘要

We present SLS^2, a framework for safe feedback motion planning from pixels using robust model predictive control (MPC) in learned latent world models. Our approach trains an action-conditioned joint-embedding world model with compact Markovian latent states, enabling efficient gradient-based trajectory optimization through learned latent dynamics. To enforce safety for the true system despite imperfect latent predictions, we inform a GPU-accelerated system level synthesis (SLS) robust MPC scheme with conformal prediction to obtain calibrated latent error bounds and robust latent-space constraint sets. We further learn and conformalize a latent constraint checker, allowing the SLS planner to impose probabilistic safety constraints during closed-loop execution. We evaluate our method on vision-based control tasks, where it improves both goal-reaching performance and safety over latent world-model and safe-planning baselines.

2606.15631 2026-06-16 cs.RO cs.AI 交叉投稿

Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time

检索,不重新训练:在测试时将视觉语言动作模型扩展到新任务

Jeongeun Park, Juhan Park, Taekyung Kim, Sungjoon Choi, Dongyoon Han, Sangdoo Yun

发表机构 * NAVER AI Lab(NAVER AI实验室) Korea University(高丽大学)

AI总结 提出检索增强策略,通过一次训练冻结模型,部署时仅添加检索数据即可适应新任务,无需逐任务微调,在跨本体泛化中优于基线。

Comments https://recap-robot.github.io/

详情
AI中文摘要

将视觉-语言-动作(VLA)策略扩展到新任务通常需要特定任务的遥操作演示和逐任务微调,这使得适应在数据收集和计算方面成本高昂。在本文中,我们表明这种目标侧逐任务适应成本可以被检索所取代。我们的检索增强策略在目标本体(查询)和更廉价的本体(池,例如人手视频)的配对演示上训练一次,然后冻结。新任务在部署时通过将池侧演示附加到检索池来添加。冻结策略在每个控制步骤中根据检索到的轨迹进行条件化,因此新任务通过索引数据而非更新参数来吸收。微调仅在面对新的、未见过的本体时需要,而不是每个新任务。我们表明,检索改进了超越特定骨干网络的策略,包括标准VLA策略,但其效果在基于视频生成的世界动作模型(WAM)Cosmos Policy中尤为显著。在这种设置中,检索提供了粗略的任务进展,而WAM的未来图像目标提供了额外的视觉一致性信号,增强了检索条件化的动作。在PushT上,我们研究了检索如何为跨本体泛化到未见目标角度提供可重用的高级运动先验,而在RoboTwin 2.0上,我们的方法在未见任务上优于跨本体基线,并且我们还在真实机器人上演示了该方法。

英文摘要

Extending a vision-language-action (VLA) policy to a new task typically requires task-specific teleoperated demonstrations and per-task fine-tuning, making adaptation costly in both data collection and compute. In this paper, we show that this target-side per-task adaptation cost can be replaced by retrieval. Our retrieval-augmented policy is trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by appending pool-side demonstrations to a retrieval pool. The frozen policy conditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters. Fine-tuning is needed only to take on a new, unseen embodiment, not for each new task. We show that retrieval improves policies beyond a specific backbone, including standard VLA policies, but its effect is especially pronounced in Cosmos Policy, a video-generation-based world-action model (WAM). In this setting, retrieval supplies coarse task progression, while the WAM's future-image objective provides an additional visual consistency signal that strengthens the retrieval-conditioned actions. On PushT, we study how retrieval provides a reusable high-level motion prior for cross-embodiment generalization to unseen goal angles, while on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks, and we additionally demonstrate the method on a real robot.

2606.15654 2026-06-16 cs.RO cs.AI 交叉投稿

PO-PDDL: Learning Symbolic POMDPs from Visual Demonstrations for Robot Planning Under Uncertainty

PO-PDDL: 从视觉演示中学习符号化POMDP以实现不确定性下的机器人规划

Wenjing Tang, Xuanjin Jin, Yuan Liu, Renming Huang, Cewu Lu, Panpan Cai

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出PO-PDDL符号化POMDP框架,通过从机器人执行视频中重建潜在状态轨迹、识别部分可观测性并学习随机转移与观测模型,实现不确定性下的鲁棒任务规划。

详情
AI中文摘要

现实世界的机器人任务规划必须在随机动作执行和部分可观测性下进行,然而为真实机器人领域构建部分可观测马尔可夫决策过程(POMDP)模型仍然困难且劳动密集。我们引入了PO-PDDL,一种POMDP的符号化表述,它保留了规划领域定义语言(PDDL)的关系结构和LLM友好的语法,同时显式建模了部分可观测性、随机性和信念。基于此表述,我们提出了一种用于学习PO-PDDL模型的演示驱动流程。该方法从真实机器人执行视频中重建潜在符号状态轨迹,通过推断状态与视觉观测之间的不一致性识别部分可观测性,并相应地学习随机转移和观测模型。得到的PO-PDDL领域可跨任务重用,并在感知和执行不确定性下实现在线信念空间规划。在真实世界长时域操作任务上的实验表明,我们的方法持续优于现有的PDDL和POMDP模型学习方法,以显著更低的规划成本实现了不确定性下的鲁棒任务规划。

英文摘要

Real-world robot task planning must operate under both stochastic action execution and partial observability, yet constructing Partially Observable Markov Decision Process (POMDP) models for real robotics domains remains difficult and labor-intensive. We introduce PO-PDDL, a symbolic formulation of POMDPs that preserves the relational structure and LLM-friendly syntax of the Planning Domain Definition Language (PDDL), while explicitly modeling partial observability, stochasticity, and beliefs. Building on this formulation, we propose a demonstration-driven pipeline for learning PO-PDDL models. The proposed method reconstructs latent symbolic state trajectories from real-robot execution videos, identifies partial observability via inconsistencies between inferred states and visual observations, and learns stochastic transition and observation models accordingly. The resulting PO-PDDL domains are reusable across tasks and enable online belief-space planning under both perception and execution uncertainty. Experiments on real-world long-horizon manipulation tasks show that our method consistently outperforms existing PDDL and POMDP model-learning approaches, achieving robust task planning under uncertainty with significantly lower planning cost.

2606.15756 2026-06-16 cs.LG cs.AI 交叉投稿

From Correlation to Causation in Lane Change Prediction for Automated Driving: A Causal Explanation Framework

从相关性到因果性:自动驾驶换道预测的因果解释框架

Mohamed Manzour, Aditya Kumar, Augusto Luis Ballardini, Miguel Ángel Sotelo

发表机构 * University of Alcalá(阿尔卡拉大学)

AI总结 提出基于因果推断的换道预测框架,结合深度结构因果建模与干预效应分析,在预测准确率超过95%的同时,识别直接贡献变量及其因果链,实现可解释的因果推理。

详情
AI中文摘要

换道预测是智能车辆的核心任务,提前预测操作有助于更安全的决策。然而,现有方法主要学习观测驾驶变量与未来操作之间的统计关联,而忽略了输入变量之间的因果依赖关系。这限制了可解释性,尤其是当纵向间隙、相对纵向速度和碰撞时间(TTC)等物理相关变量被视为独立平坦输入时。本文提出一个基于因果推断的换道预测与解释框架。该方法结合语言特征构建、专家约束的因果发现、基于深度端到端因果推断(DECI)的深度结构因果建模、基于干预的效果分析、反驳测试和递归因果链解释。目标不仅是预测未来操作,还要识别直接贡献于预测的候选变量、影响这些变量的上游因素以及这些效应传播的因果链。该框架在车道标记交叉事件前的前三秒内平均F1分数超过95%。除了预测精度,该框架使用基于干预的效果分析,在学到的因果结构下区分有影响力的变量和弱影响力变量。它进一步区分候选直接贡献者和中介效应,并生成对比性因果链解释,阐明为什么预测的操作更受青睐,而替代操作支持较少。因此,主要贡献是一个机制感知的换道预测流程,从基于相关性的分类转向更可解释的因果推理用于操作预测。

英文摘要

Lane-change prediction is a central task in intelligent vehicles, where early maneuver anticipation can support safer decision-making. However, many existing approaches mainly learn statistical associations between observed driving variables and future maneuvers, while overlooking the causal dependencies among the input variables themselves. This limits interpretability, especially when physically related variables such as longitudinal gap, relative longitudinal velocity, and Time-To-Collision (TTC) are treated as independent flat inputs. This article presents a causal-inference-based framework for lane-change prediction and explanation. The proposed approach combines linguistic feature construction, expert-constrained causal discovery, deep structural causal modeling with Deep End-to-end Causal Inference (DECI), intervention-based effect analysis, refutation testing, and recursive causal-chain explanation. The objective is not only to predict the future maneuver, but also to identify candidate variables that directly contribute to the prediction, the upstream factors influencing them, and the causal chains through which these effects propagate. The framework achieves average F1-scores above 95% during the first three seconds before the lane-marking crossing event. Beyond prediction accuracy, the framework uses intervention-based effect analysis to distinguish influential from weakly influential variables under the learned causal structure. It further distinguishes candidate direct contributors from mediated effects and generates contrastive causal-chain explanations that clarify why the predicted maneuver is favored and why the alternative maneuvers are less supported. The main contribution is therefore a mechanism-aware lane-change prediction pipeline that moves beyond correlation-based classification toward more interpretable causal reasoning for maneuver prediction.

2606.15768 2026-06-16 cs.RO cs.AI 交叉投稿

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

LaWAM: 用于高效动力学感知机器人策略的潜在世界行动模型

Jialei Chen, Kai Wang, Kang Chen, Shuaihang Chen, Feng Gao, Wenhao Tang, Zhiyuan Li, Weilin Liu, Zhuyu Yao, Boxun Li, Yuanbo Xu, Chao Yu

发表机构 * Tsinghua University(清华大学) Jilin University(吉林大学) Nankai University(南开大学) Peking University(北京大学) Harbin Institute of Technology(哈尔滨工业大学) Zhongguancun Academy(中关村学院) Striding.AI Infinigence AI

AI总结 提出LaWAM模型,通过潜在视觉子目标预测场景变化,实现动力学感知的机器人控制,在多个基准上达到最优或竞争性成功率,且推理延迟低。

详情
AI中文摘要

视觉-语言-行动模型(VLA)利用大规模视觉-语言预训练进行语义机器人控制,但通常缺乏对机器人行动如何改变场景的明确预见。世界行动模型(WAM)通过基于预测的未来条件化策略来解决这一限制,但现有方法通常依赖计算昂贵的视频生成,且存在大量像素级冗余。我们提出LaWAM,一种潜在世界行动模型,通过紧凑的潜在视觉子目标(而非重建的未来视频)向机器人策略暴露预测动力学。LaWAM的核心是一个潜在行动条件化的潜在世界模型(LaWM)。我们通过在预训练视觉基础模型的潜在空间中训练潜在行动模型,并重新利用其前向解码器来预测未来观察特征以描述场景演变,从而获得LaWM。然后,LaWAM基于这些预测的潜在视觉子目标条件化行动生成,以实现动力学感知的机器人控制。LaWAM在LIBERO(98.6%成功率)、RoboTwin(91.22%成功率)和真实世界操作任务中取得了最优或具有竞争力的成功率,同时保持低延迟推理。LaWAM每次行动块预测运行时间为187毫秒,相比像素空间WAM,实现了高达24倍的墙钟延迟降低。

英文摘要

Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World-Action Models (WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposes predictive dynamics to robot policies through compact latent visual subgoals instead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training a latent action model in the latent space of a pretrained vision foundation model and repurposing its forward decoder to predict future observation features for scene evolution. LaWAM then conditions action generation on these predicted latent visual subgoals to enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms per action-chunk prediction and achieves up to 24x lower wall-clock latency than pixel-space WAMs.

2606.16042 2026-06-16 cs.RO cs.AI 交叉投稿

Leveraging Deep Learning for Object and Position Recognition of Load Carriers for Autonomous Logistics Vehicles

利用深度学习实现自主物流车辆对载具的物体与位置识别

Christoph Legat, Tobias Miller, Marco Riess

发表机构 * Research Group on Cognitive Autonomy & Predictive Intelligence, Technical University of Applied Sciences, Augsburg, Germany(认知自主与预测智能研究组,奥格斯堡应用技术大学,德国) Grenzebach Maschinenbau GmbH, Asbach-Bäumenheim, Germany(Grenzebach Maschinenbau GmbH,德国阿斯巴赫-博伊门海姆)

AI总结 提出基于深度学习的框架,通过卷积神经网络从RGBD数据中识别载具上的预定义地标并计算其位姿,实现自主物流车辆对载具的检测与定位,实验验证了工业环境下的可靠性。

Comments 6 pages, 6 figures, IFAC World Congress2026, \c{opyright} 2026 the authors. This work has been accepted to IFAC for publication under a Creative Commons Licence CC-BY-NC-ND

详情
AI中文摘要

本工作探索了在移动机器人中利用人工智能实现载具的自主检测和位姿估计,以便自动拾取。设计了一个深度神经网络,从RGBD数据中识别载具上的预定义地标;然后利用这些地标计算载具的位姿。该网络直接处理RGBD图像以估计地标位置,这些位置构成了确定载具位置的基础。该方法在大量实验中得到了验证,并包含软件和硬件实现。提出了一个基于深度学习的框架,用于检测载具并估计其位姿,以应用于自主物流车辆。我们的方法使用卷积神经网络从RGBD输入中识别载具上的特征参考点,并通过将这些推断出的地标与先验几何知识相结合来计算其位姿。实验表明,所得精度足以在工业环境中可靠地检测载具,证实了该方法适用于自主内部物流应用。

英文摘要

This work explores the use of artificial intelligence in mobile robotics to achieve autonomous detection and pose estimation of load carriers for automated pickup. A deep neural network is designed to recognize predefined landmarks on the carrier from RGBD data; these landmarks are then used to compute the carrier's pose. The network operates directly on RGBD images to estimate landmark positions, which form the basis for determining the carrier's location. The approach is validated in extensive experiments and comprises both software and hardware implementations. A deep learning-based framework is presented to detect load carriers and estimate their pose for use with autonomous logistics vehicles. Our method uses a convolutional neural network to identify characteristic reference points on the carrier from RGBD input and computes its pose by combining these inferred landmarks with prior geometric knowledge. Experiments show that the resulting accuracy is sufficient for reliable load carrier detection in industrial environments, confirming the suitability of the method for autonomous intralogistics applications.

2606.16202 2026-06-16 cs.CV cs.AI cs.RO 交叉投稿

EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video

EgoPhys: 从第一人称视频学习可变形物体的通用物理模型

Hyunjin Kim, Ri-Zhao Qiu, Guangqi Jiang, Xiaolong Wang

发表机构 * UC San Diego(加州大学圣地亚哥分校)

AI总结 提出EgoPhys框架,从第一人称RGB视频中通过可泛化先验构建可变形物体的物理数字孪生,无需测试时优化即可预测弹簧刚度场,在重建、未来预测和零样本泛化上优于基线。

Comments Project Page: https://hjhyunjinkim.github.io/EgoPhys

详情
AI中文摘要

人类通过日常互动自然地理解物体物理,但准确预测复杂的可变形动力学(如弹性材料和织物)仍然是计算机视觉和机器人学的主要挑战。我们提出EgoPhys,一个利用可泛化先验从仅RGB的第一人称视频构建可变形物理数字孪生的框架。EgoPhys通过将每个物体的逆物理解蒸馏到紧凑码本中,克服了现有方法的局限性,从而能够为未见物体预测密集的弹簧刚度场,而无需每个弹簧的测试时优化。使用来自多样化第一人称交互的可泛化先验进行训练,EgoPhys在重建、未来预测和零样本泛化方面优于基线。为了支持训练和评估,我们整理了一个涵盖多样化可变形物体、场景和操作风格的第一人称交互数据集。我们将EgoPhys部署在真实的xArm6机器人上,证明从单个第一人称人类游戏视频初始化的数字孪生可以作为内部世界表示,辅助可变形物体规划,突显第一人称RGB观测作为通往真实到模拟管道的可扩展路径。

英文摘要

Humans naturally understand object physics through everyday interactions, but faithfully predicting complex deformable dynamics, such as elastic materials and fabrics, remains a major challenge for computer vision and robotics. We present EgoPhys, a framework that constructs deformable physical digital twins from egocentric RGB-only video using generalizable priors. EgoPhys overcomes the limitations of existing methods to enable controllable deformable digital twin generation from egocentric videos by distilling per-object inverse-physics solutions into a compact codebook, enabling prediction of dense spring stiffness fields for unseen objects without per-spring test-time optimization. Trained with generalizable priors from diverse egocentric interactions, EgoPhys outperforms baselines in reconstruction, future prediction, and zero-shot generalization. To support training and evaluation, we curate an egocentric interaction dataset covering diverse deformable objects, scenes, and manipulation styles. We deploy EgoPhys on a real xArm6 robot, demonstrating that a digital twin initialized from a single egocentric human play video can serve as an internal world representation to aid in deformable-object planning, highlighting egocentric RGB observations as a scalable path toward real-to-sim pipelines.

2606.16253 2026-06-16 cs.CV cs.AI 交叉投稿

Learned Image Compression for Vision-Language-Action Models

面向视觉-语言-动作模型的图像压缩学习

Hyeonjun Kim, Jegwang Ryu, Sangbeom Ha, Junhyeok Lee, Jun-Hyuk Kim, Hyemin Ahn, Jaeho Lee

发表机构 * POSTECH(浦项科技大学) Soongsil University(崇实大学) Chung-Ang University(中央大学)

AI总结 提出SPARC框架,通过自适应比特率分配和倾斜率损失,在低带宽下保持VLA机器人控制性能,优于传统编解码器。

详情
AI中文摘要

视觉-语言-动作(VLA)模型越来越依赖高频多摄像头观测,使得视觉通信成为带宽受限或分布式部署场景中实时机器人控制的主要瓶颈。然而,现有的图像和视频编解码器旨在保留通用视觉保真度,而非下游VLA策略的控制性能。在这项工作中,我们引入了SPARC(空间自适应速率控制),一种为VLA驱动机器人量身定制的学习图像压缩框架。我们的关键观察是,视觉信息的重要性在相机视角和图像内的空间区域之间差异很大。基于这一观察,SPARC采用轻量级时间掩码选择器,根据任务相关性自适应地在潜在表示上分配比特率,同时利用时间上下文。我们进一步引入倾斜率损失,通过减少基于熵的目标过度抑制罕见但任务关键的视觉模式的趋势来稳定训练。在包括RoboCasa365、VLABench和LIBERO在内的多样化机器人基准测试上的实验表明,在相同比特率预算下,SPARC始终比传统图像/视频编解码器和最近的学习压缩方法实现更强的控制性能。我们还展示了在远程控制设置中的实际部署优势,我们的方法显著改善了比特率-成功率权衡。

英文摘要

Vision-language-action (VLA) models increasingly rely on high-frequency multi-camera observations, making visual communication a major bottleneck for real-time robotic control in bandwidth-constrained or distributed deployment settings. Existing image and video codecs, however, are designed to preserve generic visual fidelity rather than the control performance of downstream VLA policies. In this work, we introduce SPARC (SPatially Adaptive Rate Control), a learned image compression framework tailored for VLA-driven robots. Our key observation is that the importance of visual information varies substantially across both camera views and spatial regions within an image. Based on this observation, SPARC employs a lightweight temporal mask selector that adaptively allocates bitrate over latent representations according to task relevance while leveraging temporal context. We further introduce a tilted rate loss that stabilizes training by reducing the tendency of entropy-based objectives to over-suppress rare yet task-critical visual patterns. Experiments on diverse robotic benchmarks, including RoboCasa365, VLABench, and LIBERO, show that SPARC consistently achieves stronger control performance than conventional image/video codecs and recent learned compression methods under the same bitrate budget. We additionally demonstrate real-world deployment benefits in remote-control settings, where our method substantially improves the bitrate-success tradeoff.

2606.16286 2026-06-16 cs.LG cs.AI cs.RO 交叉投稿

FlowMPC: Improving Flow Matching policies with World Models

FlowMPC:利用世界模型改进流匹配策略

Chandon Hamel

发表机构 * Stanford University(斯坦福大学)

AI总结 提出FlowMPC框架,结合流匹配模仿策略与学习的世界模型,通过MPPI规划提升测试时性能,在ManiSkill操作任务中显著提高成功率。

详情
AI中文摘要

流匹配(FM)是一种在多模态动作空间中进行行为克隆的强大方法[Jiang et al., 2025],但由于它没有直接训练以最大化期望回报,FM策略在测试时的表现仍有改进空间。本文研究学习的世界模型是否可以通过对策略提出的候选动作序列进行模型预测路径积分(MPPI)规划来改进FM策略。基于TD-MPC2 [Hansen et al., 2024],我引入了FlowMPC,这是一个将模仿学习的FM策略与学习的世界模型相结合的框架,用于ManiSkill操作任务[Tao et al., 2025]中的测试时规划。在PickCube和PickSingleYCB上,添加世界模型比单独使用FM策略提高了性能,尤其是在回合结束时的成功率方面有显著提升。这些结果表明,基于世界模型的规划可以有效地补充基于流的模仿策略,而无需修改FM训练目标。

英文摘要

Flow Matching (FM) is a powerful approach for behavior cloning in multimodal action spaces [Jiang et al., 2025], but because it is not trained to directly maximize expected return, there is still room to improve how FM policies act at test time. This work investigates whether a learned world model can improve FM policies by enabling Model Predictive Path Integral (MPPI) planning over candidate action sequences proposed by the policy. Building on TD-MPC2 [Hansen et al., 2024], I introduce FlowMPC, a framework that combines an imitation-learned FM policy with a learned world model for test-time planning in ManiSkill manipulation tasks [Tao et al., 2025]. Across PickCube and PickSingleYCB, adding the world model improved performance over the FM policy alone, with especially clear gains in end-of-episode success. These results suggest that world-model-based planning can effectively complement flow-based imitation policies without modifying the FM training objective.

2606.16480 2026-06-16 cs.RO cs.AI cs.SY eess.SY 交叉投稿

HOLO-MPPI: Multi-Scenario Motion Planning via Hierarchical Policy Optimization

HOLO-MPPI:通过分层策略优化的多场景运动规划

Youngjae Min, Jovin D'sa, Faizan M. Tariq, David Isele, Navid Azizan, Sangjae Bae

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Honda Research Institute, USA(本田研究所(美国))

AI总结 提出HOLO-MPPI框架,结合离线高层策略学习与在线低层随机最优控制,实现多场景运动规划,无需针对每个场景重新调整参数,在自动驾驶中优于MPPI和端到端RL基线。

详情
AI中文摘要

部署在现实世界中的机器人必须在不同场景下规划运动,而无需针对每个场景重新调整参数。端到端强化学习(RL)可以跨场景泛化,但在分布偏移、奖励错误指定和随机交互下往往变得脆弱。模型预测路径积分(MPPI)控制能够在无梯度的情况下实现强大的实时优化,但其性能依赖于良好形状的采样先验,而手动设计先验无法扩展到多场景部署。我们提出了HOLO-MPPI(高层离线,低层在线MPPI),一种多场景运动规划框架,结合了高层策略学习与低层随机最优控制。离线时,我们学习一个高层策略,在抽象动作空间中提出场景鲁棒的规划,并利用学习的世界模型进行在线推演。在线时,该策略作为数据驱动的先验生成器,根据当前观测和目标参数化MPPI的采样分布。然后MPPI围绕该先验实时优化低层控制序列,以适应局部扰动。我们通过设计有效的高层动作空间和定制模型架构,在自动驾驶中实例化HOLO-MPPI。在多种驾驶场景下的评估表明,HOLO-MPPI在保持实时控制的同时,优于MPPI和端到端RL基线。

英文摘要

Robots deployed in the real world must plan motions across diverse scenarios without per-scenario retuning. End-to-end reinforcement learning (RL) can generalize across scenarios but often becomes brittle under distribution shift, reward misspecification, and stochastic interactions. Model predictive path integral (MPPI) control enables strong real-time refinement without gradients, but its performance depends on a well-shaped sampling prior, while manually designing the priors does not scale to multi-scenario deployment. We present HOLO-MPPI (High-level Offline, Low-level Online MPPI), a multi-scenario motion planning framework that combines high-level policy learning with low-level stochastic optimal control. Offline, we learn a high-level policy that proposes scenario-robust plans in an abstract action space, with a learned world model for online rollout. Online, the policy serves as a data-driven prior generator that parameterizes MPPI's sampling distribution conditioned on the current observation and goal. MPPI then optimizes low-level control sequences around this prior in real time to adapt to local disturbances. We instantiate HOLO-MPPI in autonomous driving by designing an effective high-level action space and tailored model architectures. Our evaluation across diverse driving scenarios shows that HOLO-MPPI improves upon MPPI and end-to-end RL baselines while maintaining real-time control.

2606.16690 2026-06-16 cs.RO cs.AI cs.CV 交叉投稿

PATCH: Action-Chunk-Conditioned Latent Patch Innovation Monitoring for Robot Manipulation

PATCH: 基于动作块条件潜在补丁创新的机器人操作监控

Yanan Zhou, Ranpeng Qiu, Yincong Chen, Jiajie Cui, Weiming Zhi

发表机构 * School of Computer Science, The University of Sydney(悉尼大学计算机科学学院) Australian Centre For Robotics, The University of Sydney(悉尼大学澳大利亚机器人中心)

AI总结 提出PATCH监控器,通过动作块条件潜在补丁创新检测局部场景动态,实现扰动感知的机器人操作干预与恢复。

详情
AI中文摘要

基于学习的操作策略在真实世界机器人操作中取得了实质性进展,特别是在短视界动作生成方面。然而,在开放工作空间中部署时,面对意外的局部场景动态(如移动物体、短暂遮挡或预期运动附近的干扰)仍然脆弱。现有的运行时监控器通常依赖全局观测异常、策略不确定性或帧级视觉变化,难以区分任务相关的执行风险与良性的视觉变化。我们提出PATCH,一种用于部署时干预的基于动作块条件的潜在补丁创新监控器。给定当前动作块,PATCH定义了一个投影执行走廊,预测其内部的潜在补丁演化,并累积机器人自身运动无法解释的持续残差。这些残差形成局部化的干预信号,使PATCH-Router能够暂停执行、选择可用的恢复源,并在局部创新消退后恢复原始策略。在真实机器人 rollout 数据上的实验表明,PATCH 比竞争性运行时监控器产生更稳定且上下文相关的触发信号。真实机器人部署进一步展示了监控驱动的干预和策略恢复,用于扰动感知的操作。项目页面:https://yananzhou5555.github.io/PATCH/。

英文摘要

Learning-based manipulation policies have made substantial progress in real-world robot manipulation, particularly for short-horizon action generation. However, deployment in open workspaces remains fragile under unexpected local scene dynamics, such as moving objects, transient occlusions, or disturbances near the intended motion. Existing runtime monitors often rely on global observation anomalies, policy uncertainty, or frame-level visual changes, and struggle to distinguish task-relevant execution risk from benign visual variation. We introduce PATCH, an action-chunk-conditioned latent patch innovation monitor for deployment-time intervention. Given the active action chunk, PATCH defines a projected execution corridor, predicts latent patch evolution inside it, and accumulates persistent residuals unexplained by the robot's own motion. These residuals form a localized intervention signal that allows PATCH-Router to pause execution, select an available recovery source, and resume the original policy once localized innovation subsides. Experiments on real robot rollout data show that PATCH produces more stable and context-relevant triggers than competing runtime monitors. Real-robot deployment further demonstrates monitor-driven intervention and policy resumption for disturbance-aware manipulation. Project Page: https://yananzhou5555.github.io/PATCH/.

2606.16898 2026-06-16 cs.CV cs.AI 交叉投稿

Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial Localization

Semantic Flip: 用于具身问答和空间定位中鲁棒拒绝的合成OOD生成

Dongbin Na, Chanwoo Kim, Giyun Choi, Dooyoung Hong

发表机构 * RGA Inc.(RGA公司)

AI总结 提出Semantic Flip框架,通过合成辅助OOD样本训练轻量拒绝模块,使冻结的视觉语言模型在无外部OOD标注下实现鲁棒拒绝,在具身问答和空间定位基准上优于强提示基线。

Comments 18 pages, 3 figures. Code and data: https://github.com/ndb796/SemanticFlip ; project page: https://ndb796.github.io/SemanticFlip

详情
AI中文摘要

检测不可回答的用户查询对于现实世界具身代理的可靠部署仍然至关重要。然而,现代视觉语言模型(VLM)即使当可用视觉记忆无法支持查询时,也常常生成过于自信的答案。这种过度自信会带来各种任务依赖的风险。代理可能在具身问答中向用户提供误导信息,并在空间推理导航中选择任意坐标并物理引导用户前往。尽管风险很高,但只有少数先前研究直接解决具身VLM何时以及如何回答“我不知道”的问题。本文提出Semantic Flip,一个简单而有效的框架,无需外部OOD标注即可合成辅助分布外(OOD)样本用于具身拒绝。关键思想是独立变换查询和视频记忆,以构建缺乏足够视觉基础的辅助OOD对。这些合成对使得能够在冻结的预训练VLM之上训练一个轻量级拒绝模块。该模块可附加到任何现有的基于VLM的流水线中,无需重新训练底层模型。在两个互补的基准测试中,Semantic Flip始终优于强提示基线。本文还引入了SpaceReject,一个新的用于空间定位的拒绝基准,包含故意不可回答的查询和长视频记忆,其中Semantic Flip达到了0.9559的$F_1$分数。源代码和数据集公开于https://github.com/ndb796/SemanticFlip。

英文摘要

Detecting unanswerable user queries remains essential for the reliable deployment of real-world embodied agents. However, modern vision-language models (VLMs) often generate overly confident answers even when the available visual memory cannot support the query. Such overconfidence poses various task-dependent risks. The agent may provide misleading information to the user in Embodied Question Answering and select an arbitrary coordinate and physically guide the user there in spatial reasoning for navigation. Despite these high stakes, only a few prior studies directly address when and how an embodied VLM should respond with "I do not know." This work proposes Semantic Flip, a simple yet effective framework that synthesizes auxiliary out-of-distribution (OOD) samples for embodied refusal without requiring external OOD annotations. The key idea is to independently transform the query and video memory to construct auxiliary OOD pairs that lack sufficient visual grounding. These synthesized pairs enable training a lightweight rejection module on top of a frozen pretrained VLM. The module attaches to any existing VLM-based pipeline without retraining the underlying model. Across two complementary benchmarks, Semantic Flip consistently outperforms strong prompting baselines. This work also introduces SpaceReject, a new refusal benchmark for spatial localization with deliberately unanswerable queries over long video memory, where Semantic Flip achieves an $F_1$ score of 0.9559. The source codes and datasets are publicly available at https://github.com/ndb796/SemanticFlip.

2606.16902 2026-06-16 cs.RO cs.AI 交叉投稿

Binary Tracking for Spatial QA and Navigation with Open Vision-Language Models

基于开放视觉语言模型的空间问答与导航的二值追踪

Dongbin Na, Chanwoo Kim, Soonbin Rho, Giyun Choi, Gangbok Lee, Dooyoung Hong

发表机构 * RGA Inc.(RGA公司)

AI总结 提出BinTrack,一种全开源的空间定位代理,通过二值搜索轨迹段,在SpaceLocQA基准上准确率提升22.8%,推理速度提升1.5倍,并发布多行程室外数据集GangnamLoop。

Comments 21 pages, 4 figures, 15 tables. Project page: https://ndb796.github.io/BinaryTracking ; Code and dataset: https://github.com/ndb796/BinaryTracking

详情
AI中文摘要

本工作针对服务机器人在长距离自我中心路线上的空间问答问题。给定诸如“在回家的路上哪里可以找到干洗店?”的查询,系统返回一个度量坐标,下游导航组件可以据此行动。先前的空间问答方法利用基于闭源模型(如GPT-4o)的检索增强代理进行路径探索。然而,在现实世界中运行的机器人通常无法可靠地依赖在线闭源模型,因为网络不稳定、通信延迟和部署成本。这需要能够在机器人上运行的开源空间问答方法,但先前在这方面的研究仍然有限。本工作提出BinTrack,一种简单而有效的全开源空间定位代理,它利用机器人轨迹的时间顺序。BinTrack对查询中识别的两个锚点地标之间的轨迹段进行二值搜索。与其他开源实现相比,它将整体准确率提高了22.8%,甚至在SpaceLocQA基准的全局类别上匹配了报告的闭源模型结果,这是迄今为止需要强大推理代理(如GPT-4o)的最具挑战性的设置。此外,其优化的推理策略始终比先前方法提供超过1.5倍的推理加速。最后,本工作发布了GangnamLoop,这是一个新颖且实用的多行程室外基准,通过在实际公共街道上部署真实四足机器人并采用匿名化策略收集而成。它在不同室外条件下重新访问相同位置,并将机器人的低视角与人类主人的视角配对。源代码和数据集可在https://github.com/ndb796/BinaryTracking公开获取。

英文摘要

This work addresses spatial question answering for service robots traversing long egocentric routes. Given a query such as "where can I find a dry cleaner on the way back home?", the system returns a metric coordinate that downstream navigation components can act on. Prior Spatial Question Answering approaches leverage retrieval-augmented agents built on closed-source models such as GPT-4o for path exploration. However, robots operating in the real world often cannot reliably depend on online closed-source models due to network instability, communication latency, and deployment cost. It creates a need for open-source based Spatial Question Answering approaches that can run onboard the robot, yet prior research in this direction remains limited. This work proposes BinTrack, a simple yet effective, fully open-source spatial-localization agent that leverages the temporal ordering of a robot's trajectory. BinTrack performs a binary search over the trajectory segments between two anchor landmarks identified from a query. It improves overall accuracy by up to 22.8% over other open-source implementations and even matches the reported closed-source model result on the global category of the SpaceLocQA benchmark, the most challenging setting that has so far required strong reasoning agents such as GPT-4o. Furthermore, its optimized inference strategy consistently yields more than a 1.5x inference speedup over previous approaches. Finally, this work releases GangnamLoop, a novel and practical multi-trip outdoor benchmark collected by deploying a real quadruped robot on public streets with the anonymization policy. It revisits the same locations under different outdoor conditions and pairs the robot's low viewpoint with the human owner's. The source codes and datasets are publicly available at https://github.com/ndb796/BinaryTracking

2606.16935 2026-06-16 cs.RO cs.AI cs.LG 交叉投稿

CrossMaps: Confidence-Aware Open-Vocabulary Semantic Mapping for Rover Navigation

CrossMaps: 用于漫游车导航的置信度感知开放词汇语义地图

Jan-Niklas Klein, Sona Ghahremani, Christian Medeiros Adriano, Holger Giese

发表机构 * Hasso Plattner Institute for Digital Engineering, Potsdam, Germany(哈索·普拉特纳数字工程研究所(德国波茨坦))

AI总结 提出CrossMaps,一种实时置信度感知开放词汇语义地图构建流水线,通过多尺度CLIP嵌入、置信度融合和双记忆架构生成可查询语义地图,用于漫游车导航。

Comments IEEE International Conference on Robotics and Automation (ICRA) 2026: ROSE International Workshop on Robotics Software Engineering, June 01, 2026, Vienna, Austria

详情
AI中文摘要

漫游车依赖感知来维护空间地图,该地图编码物体和传感器质量(例如,距离可靠性、光照伪影、数据密度),指导数据融合、嵌入更新以及在部分可观测性下的导航。为了研究这些耦合的感知-导航过程,我们提出了CrossMaps,一种实时的置信度感知开放词汇语义地图构建流水线,该流水线从RGB-D数据构建可语言查询的地图。基于VLMaps风格的方法,CrossMaps集成了多尺度CLIP嵌入、置信度感知融合以及由短期记忆(STM)和长期记忆(LTM)组成的双记忆架构。STM使用几何、语义和时间置信度线索聚合噪声视觉观测,而置信且一致的单元被提升到LTM作为持久语义地标。CrossMaps设计用于与Jetson Orin驱动的UGV以及SLAM一起部署,实时运行并生成语义热力图,可通过自然语言查询来引导漫游车导航。

英文摘要

Rovers rely on perception to maintain spatial maps that encode both objects and sensor quality (e.g., range reliability, lighting artifacts, data density), guiding data fusion, embedding updates, and navigation under partial observability. To study these coupled perception-navigation processes, we present CrossMaps, a real-time confidence-aware open-vocabulary semantic mapping pipeline that constructs language-queryable maps from RGB-D data. Building on VLMaps-style approaches, CrossMaps integrates multi-scale CLIP embeddings with confidence-aware fusion and a dual-memory architecture consisting of Short-Term Memory (STM) and Long-Term Memory (LTM). The STM aggregates noisy visual observations using geometric, semantic, and temporal confidence cues, while confident and coherent cells are promoted to the LTM as persistent semantic landmarks. Designed for deployment with a Jetson Orin-powered UGV alongside SLAM, CrossMaps runs in real time and produces semantic heatmaps that can be queried with natural language to guide rover navigation.

2512.07212 2026-06-16 cs.AI cs.LG 版本更新

Sample from What You See: Visuomotor Policy Learning via Diffusion Bridge with Observation-Embedded Stochastic Differential Equation

从所见中采样:基于观测嵌入随机微分方程的扩散桥视觉运动策略学习

Zhaoyang Liu, Mokai Pan, Zhongyi Wang, Kaizhen Zhu, Haotao Lu, Haipeng Zhang, Jingya Wang, Ye Shi

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出BridgePolicy,通过扩散桥公式将观测直接集成到随机动力学中,利用语义对齐器处理异构观测,在模拟和真实任务中超越现有生成式策略。

Comments Accepted by ICML 2026

详情
AI中文摘要

基于扩散模型的模仿学习通过捕获多模态动作分布推动了机器人控制的发展。然而,现有方法通常仅将观测视为去噪网络的高层条件,而非将其整合到扩散过程本身的随机动力学中。因此,采样被迫从随机噪声开始,削弱了感知与控制之间的耦合,往往导致次优性能。我们提出BridgePolicy,一种生成式视觉运动策略,通过扩散桥公式将观测直接集成到随机动力学中。通过构建观测信息轨迹,BridgePolicy使采样能够从丰富且信息丰富的先验而非随机噪声开始,显著提高了控制的精度和可靠性。一个关键难点是扩散桥通常连接维度匹配的分布,而机器人观测是异构的,且与动作自然不对齐。为克服这一点,我们引入语义对齐器来统一视觉和状态输入,并将观测与动作表示对齐,使扩散桥适用于异构机器人数据。在三个基准测试的52个模拟任务和5个真实世界任务上的大量实验表明,BridgePolicy持续优于最先进的生成式策略。我们的代码可在此https URL获取。

英文摘要

Imitation learning with diffusion models has advanced robotic control by capturing the multi-modal action distributions. However, existing methods typically treat observations only as high-level conditions to the denoising network, rather than integrating them into the stochastic dynamics of the diffusion process itself. As a result, the sampling is forced to begin from random noise, weakening the coupling between perception and control and often yielding suboptimal performance. We propose BridgePolicy, a generative visuomotor policy that directly integrates observations into the stochastic dynamics via a diffusion-bridge formulation. By constructing an observation-informed trajectory, BridgePolicy enables sampling to start from a rich and informative prior rather than random noise, substantially improving precision and reliability in control. A key difficulty is that diffusion bridge normally connects distributions of matched dimensionality, while robotic observations are heterogeneous and not naturally aligned with actions. To overcome this, we introduce a semantic aligner to unify the visual and state inputs and align the observations with action representations, making diffusion bridge applicable to heterogeneous robot data. Extensive experiments across 52 simulation tasks on three benchmarks and 5 real-world tasks demonstrate that BridgePolicy consistently outperforms state-of-the-art generative policies. Our code is available at https://jianghcsr.github.io/BridgePolicy_page/.

2602.00222 2026-06-16 cs.RO cs.AI cs.CV 版本更新

MapDream: Task-Driven Map Learning for Vision-Language Navigation

MapDream: 面向视觉-语言导航的任务驱动地图学习

Guoxin Lian, Shuo Wang, Yucheng Wang, Yongcai Wang, Maiyue Chen, Kaihui Wang, Bo Zhang, Zhizhong Su, Deying Li, Zhaoxin Fan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出MapDream框架,通过自回归鸟瞰图生成联合学习地图与动作预测,在R2R-CE和RxR-CE上达到单目最优性能。

详情
AI中文摘要

视觉-语言导航(VLN)要求智能体在部分可观测的3D环境中遵循自然语言指令,这促使地图表示能够聚合超出局部感知的空间上下文。然而,现有大多数方法依赖于独立于导航策略构建的手工地图。我们认为,地图应该是由导航目标直接塑造的学习表示,而非详尽的重建。基于这一见解,我们提出MapDream,一种地图在环框架,将地图构建表述为自回归鸟瞰图(BEV)图像合成。该框架联合学习地图生成和动作预测,将环境上下文蒸馏为紧凑的三通道BEV地图,仅保留导航关键的可通行性。监督预训练引导了可靠的地图到控制接口,而自回归设计通过强化微调实现端到端联合优化。在R2R-CE和RxR-CE上的实验取得了最先进的单目性能,验证了任务驱动的生成式地图学习。

英文摘要

Vision-Language Navigation (VLN) requires agents to follow natural language instructions in partially observed 3D environments, motivating map representations that aggregate spatial context beyond local perception. However, most existing approaches rely on hand-crafted maps constructed independently of the navigation policy. We argue that maps should instead be learned representations shaped directly by navigation objectives rather than exhaustive reconstructions. Based on this insight, we propose MapDream, a map-in-the-loop framework that formulates map construction as autoregressive bird's-eye-view (BEV) image synthesis. The framework jointly learns map generation and action prediction, distilling environmental context into a compact three-channel BEV map that preserves only navigation-critical affordances. Supervised pre-training bootstraps a reliable mapping-to-control interface, while the autoregressive design enables end-to-end joint optimization through reinforcement fine-tuning. Experiments on R2R-CE and RxR-CE achieve state-of-the-art monocular performance, validating task-driven generative map learning.

2602.07343 2026-06-16 cs.CV cs.AI cs.LG cs.RO 版本更新

Seeing Roads Through Words: A Language-Guided Framework for RGB-T Driving Scene Segmentation

通过文字看道路:一种语言引导的RGB-T驾驶场景分割框架

Ruturaj Reddy, Hrishav Bakul Barua, Junn Yong Loo, Thanh Thi Nguyen, Ganesh Krishnasamy

发表机构 * National University of Singapore(新加坡国立大学) University of Technology Sydney(悉尼科技大学)

AI总结 提出CLARITY框架,利用视觉语言模型先验动态调整RGB-T融合策略,并引入暗目标语义保留和层次化解码器,在MFNet数据集上达到62.3% mIoU和77.5% mAcc的新SOTA。

详情
AI中文摘要

在恶劣光照、照明和阴影条件下,道路场景的鲁棒语义分割仍然是自动驾驶应用的核心挑战。RGB-热融合是一种标准方法,但现有方法在所有条件下统一应用静态融合策略,导致模态特定噪声在网络中传播。因此,我们提出CLARITY,它根据检测到的场景条件动态调整融合策略。在视觉语言模型(VLM)先验的引导下,网络学习根据光照状态调节每种模态的贡献,同时利用对象嵌入进行分割,而不是应用固定的融合策略。我们进一步引入了两种机制:一种保留有效的暗对象语义,这些语义在先前的噪声抑制方法中被错误丢弃;另一种是层次化解码器,它在不同尺度上强制结构一致性,以锐化薄对象的边界。在MFNet数据集上的实验表明,CLARITY建立了新的最先进水平(SOTA),实现了62.3%的mIoU和77.5%的mAcc。

英文摘要

Robust semantic segmentation of road scenes under adverse illumination, lighting, and shadow conditions remain a core challenge for autonomous driving applications. RGB-Thermal fusion is a standard approach, yet existing methods apply static fusion strategies uniformly across all conditions, allowing modality-specific noise to propagate throughout the network. Hence, we propose CLARITY that dynamically adapts its fusion strategy to the detected scene condition. Guided by vision-language model (VLM) priors, the network learns to modulate each modality's contribution based on the illumination state while leveraging object embeddings for segmentation, rather than applying a fixed fusion policy. We further introduce two mechanisms - one which preserves valid dark-object semantics that prior noise-suppression methods incorrectly discard, and a hierarchical decoder that enforces structural consistency across scales to sharpen boundaries on thin objects. Experiments on the MFNet dataset demonstrate that CLARITY establishes a new state-of-the-art (SOTA), achieving 62.3% mIoU and 77.5% mAcc.

2603.16970 2026-06-16 cs.CV cs.AI 版本更新

MAND: Modality-Aware Novelty Detection for Open-World Egocentric Activity Recognition

MAND: 面向开放世界自我中心活动识别的模态感知新颖性检测

Hyejeong Im, Wonseon Lim, Dae-Won Kim

发表机构 * Department of Computer Science and Engineering, Chung-Ang University(Chung-Ang大学计算机科学与工程系)

AI总结 提出MAND框架,通过模态感知自适应评分和表示稳定训练,利用视觉和惯性模态互补信息,提升开放世界自我中心活动识别中的新颖性检测和已知类准确率。

详情
AI中文摘要

多模态自我中心活动识别整合视觉和惯性线索以实现鲁棒的第一人称行为理解。然而,在开放世界环境中部署此类系统需要检测新颖活动,同时从非平稳数据流中持续学习。现有方法依赖主融合logits进行新颖性评分,未充分利用各模态可用的互补证据。由于这些logits常被RGB主导,其他模态(尤其是IMU)的线索未被充分利用,且这种不平衡随着灾难性遗忘的累积而加剧。为解决此问题,我们提出MAND,一种用于多模态自我中心开放世界持续学习的模态感知框架。在推理时,模态感知自适应评分(MoAS)利用样本级可靠性自适应调整模态贡献,并通过偏差和分歧惩罚细化新颖性评分。在训练时,模态感知表示稳定训练(MoRST)通过模态特定头和模态级logits蒸馏保留每个模态在任务间的判别能力。在公开多模态自我中心基准上的实验表明,MAND一致地提升了新颖活动检测和已知类准确率,同时大幅降低FPR95,表明更可靠的开放世界识别。源代码见\href{this https URL}{this http URL}。

英文摘要

Multimodal egocentric activity recognition integrates visual and inertial cues for robust first-person behavior understanding. However, deploying such systems in open-world environments requires detecting novel activities while continuously learning from non-stationary data streams. Existing methods rely on the main fused logits for novelty scoring, without fully exploiting the complementary evidence available from individual modalities. Because these logits are often dominated by RGB, cues from other modalities, particularly IMU, remain underutilized, and this imbalance worsens as catastrophic forgetting accumulates. To address this, we propose MAND, a modality-aware framework for multimodal egocentric open-world continual learning. At inference, Modality-aware Adaptive Scoring (MoAS) adaptively adjusts modality contributions using sample-wise reliability and refines novelty scoring with deviation and disagreement penalties. During training, Modality-aware Representation Stabilization Training (MoRST) preserves the discriminative capacity of each modality across tasks through modality-specific heads and modality-wise logit distillation. Experiments on a public multimodal egocentric benchmark show that MAND consistently improves novel activity detection and known-class accuracy while substantially reducing FPR95, indicating more reliable open-world recognition. The source code is available at \href{https://github.com/HyeJeongIm/MAND}{github.com/HyeJeongIm/MAND}.

2603.24350 2026-06-16 cs.RO cs.AI cs.LG 版本更新

Evidence of an Emergent "Self" in Continual Robot Learning

持续机器人学习中涌现的“自我”证据

Adidev Jhunjhunwala, Judah Goldfeder, Hod Lipson

发表机构 * Creative Machines Lab, Department of Mechanical Engineering, Columbia University(创意机器实验室,机械工程系,哥伦比亚大学) Creative Machines Lab, Department of Computer Science, Columbia University(创意机器实验室,计算机科学系,哥伦比亚大学)

AI总结 通过比较恒定任务与持续学习下机器人的认知结构,发现持续学习机器人形成显著更稳定的不变子网络,该子网络对适应性至关重要,为量化智能系统自我概念提供原则性方法。

Comments 44 pages, 24 figures, includes supplementary materials

详情
AI中文摘要

理解自我意识的一个关键挑战是,如何以原则性的方式量化一个智能系统是否具有“自我”概念,以及如果存在,如何将“自我”与其他认知结构区分开来。我们提出,可以通过寻找认知过程中相对于快速获得的认知技能变化较小的不变部分来隔离“自我”——因为我们的自我是我们经验中最持久的方面。我们利用这一原则分析了两种条件下机器人的认知结构:一个机器人学习恒定任务,而另一个在可变任务下进行持续学习。我们发现,经历持续学习的机器人形成了一个不变子网络,该子网络比对照组显著更稳定(p < 0.001),并且该子网络在功能上也很重要:保留它有助于适应,而破坏它会损害性能。我们在跨越运动控制和操作的三种不同机器人上验证了这一模式。

英文摘要

A key challenge to understanding self-awareness has been a principled way of quantifying whether an intelligent system has a concept of a "self", and if so how to differentiate the "self" from other cognitive structures. We propose that the "self" can be isolated by seeking the invariant portion of cognitive process that changes relatively little compared to more rapidly acquired cognitive skills - because our self is the most persistent aspect of our experiences. We used this principle to analyze the cognitive structure of robots under two conditions: One robot learns a constant task, while a second undergoes continual learning under variable tasks. We find that robots subjected to continual learning develop an invariant subnetwork that is significantly more stable (p < 0.001) compared to the control, and that this subnetwork is also functionally important: preserving it aids adaptation while damaging it impairs performance. We validate this pattern across three different robots spanning locomotion and manipulation.

2604.16592 2026-06-16 cs.RO cs.AI cs.CV cs.ET 版本更新

Human Cognition in Machines: A Unified Perspective of World Models

机器中的人类认知:世界模型的统一视角

Timothy Rupprecht, Pu Zhao, Amir Taherin, Arash Akbari, Arman Akbari, Yumei He, Tooba Imtiaz, Sean Duffy, Juyi Lin, Yixiao Chen, Rahul Chowdhury, Enfu Nan, Yixin Shen, Yifan Cao, Haochen Zeng, Weiwei Chen, Geng Yuan, Jennifer Dy, Sarah Ostadabbas, Xuan Zhang, David Kaeli, Edmund Yeh, Yanzhi Wang

发表机构 * Northeastern University(东北大学) EmbodyX Inc.(EmbodyX公司) Tulane University(路易斯安那州立大学) Cornell University(康奈尔大学) University of Georgia(佐治亚大学)

AI总结 提出统一框架整合记忆、感知等认知功能,指出动机和元认知研究不足,并引入认知世界模型新类别。

详情
AI中文摘要

本报告通过区分先前工作在认知功能上的创新来审视世界模型。许多工作声称其世界模型具有近乎人类般的认知能力。评估这些主张需要基于人类和机器认知理论的第一原理。在迈向类人世界模型的过程中,我们提出了一个概念性的统一框架,该框架完全整合了所有认知功能(即记忆、感知、语言、推理、想象、动机和元认知),并指出现有研究的空白,以指导未来技术的发展。特别是,我们发现动机(尤其是内在动机)和元认知仍然严重研究不足,并提出了基于主动推理和全局工作空间理论的具体方向来解决这些空白。我们还引入了认知世界模型,这是一个新的类别,涵盖在结构化知识上运行的科学发现代理框架。我们的分类法应用于视频、具身和认知世界模型,提出了先前分类法未涉及的研究方向。

英文摘要

This report of world models distinguishes prior works by the cognitive functions they innovate. Many works claim an almost human-like cognitive capability in their world models. To evaluate these claims requires a proper grounding in first principles from human and machine cognition theory. In moving towards human-like world models we present a conceptual unified framework for world models that fully incorporates all the cognitive functions (i.e., memory, perception, language, reasoning, imagining, motivation, and metacognition) and identify gaps in existing research as a guide for future states of the art. In particular, we find that motivation (especially intrinsic motivation) and metacognition remain drastically under-researched, and we propose concrete directions to address these gaps informed by active inference and global workspace theory. We also introduce epistemic world models, a new category encompassing agent frameworks for scientific discovery that operate over structured knowledge. Our taxonomy, applied to video, embodied, and epistemic world models, suggests research directions where prior taxonomies have not.

2604.21391 2026-06-16 cs.RO cs.AI 版本更新

From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges

从噪声到意图:基于残差桥的生成式VLA策略锚定

Yiming Zhong, Yaoyu He, Zemin Yang, Pengfei Tian, Yifan Huang, Qingqiu Huang, Xinge Zhu, Yuexin Ma

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ResVLA架构,通过频谱分析将机器人控制解耦为确定性低频锚点和随机高频残差,利用残差扩散桥聚焦局部动态精化,实现高效表示与强条件对齐。

Comments Accepted to ICML 2026

详情
AI中文摘要

在具身智能中,连接高层语义理解与低层物理控制仍是一个持续挑战,源于认知与行动之间基本的时空尺度不匹配。现有的生成式VLA策略通常采用“从噪声生成”范式,忽略了这种差异,导致表示效率低下和优化过程中条件对齐薄弱。在这项工作中,我们提出ResVLA,一种将范式转变为“从意图精化”的架构。认识到机器人运动自然分解为全局意图和局部动态,ResVLA利用频谱分析将控制解耦为确定性低频锚点和随机高频残差。通过将生成过程锚定在预测的意图上,我们的模型通过残差扩散桥严格专注于精化局部动态。大量仿真实验表明,ResVLA实现了具有竞争力的性能,对语言和机器人本体扰动的强鲁棒性,以及比标准生成基线更快的收敛速度。ResVLA在真实世界机器人实验中也表现出强劲性能。

英文摘要

Bridging high-level semantic understanding with low-level physical control remains a persistent challenge in embodied intelligence, stemming from the fundamental spatiotemporal scale mismatch between cognition and action. Existing generative VLA policies typically adopt a "Generation-from-Noise" paradigm, which disregards this disparity, leading to representation inefficiency and weak condition alignment during optimization. In this work, we propose ResVLA, an architecture that shifts the paradigm to "Refinement-from-Intent." Recognizing that robotic motion naturally decomposes into global intent and local dynamics, ResVLA utilizes spectral analysis to decouple control into a deterministic low-frequency anchor and a stochastic high-frequency residual. By anchoring the generative process on the predicted intent, our model focuses strictly on refining local dynamics via a residual diffusion bridge. Extensive simulation experiments show that ResVLA achieves competitive performance, strong robustness to language and robot embodiment perturbations, and faster convergence than standard generative baselines. ResVLA also demonstrates strong performance in real-world robot experiments.

2605.22183 2026-06-16 cs.RO cs.AI 版本更新

Action with Visual Primitives

基于视觉基元的动作生成

Weilong Guo, Yuchen Wang, Renping Zhou, Yunfeng Zhang, Rui Fang, Yuyang Pang, Wenda Xu, Gao Huang

发表机构 * Anyverse Dynamics Tsinghua University(清华大学)

AI总结 提出AVP架构,通过视觉语言模型推断下一阶段目标并生成视觉基元令牌,条件化流匹配动作专家,在通用拾放任务中成功率比pi_0.5提升27.61%。

Comments 9 pages, 6 figures. Project page: https://kingdroper.github.io/AVP/

详情
AI中文摘要

视觉-语言-动作(VLA)模型已成为通用机器人操作的一种有前景的范式。当前架构的常见设计是将语言指令和视觉观察映射到单次前向传播中的动作。虽然概念上简单,但这种表述将指令理解、空间场景理解和运动控制纠缠在单一学习目标中。因此,动作专家必须隐式地重新学习预训练VLM中已经存在的认知和感知能力,这可能限制学习效率和泛化能力。我们提出AVP(基于视觉基元的动作生成),一种端到端架构,实现了这种以视觉基元为中心的接口:VLM推断下一阶段目标并生成视觉基元令牌,这些令牌条件化一个流匹配动作专家,其监督来自末端执行器运动学。在通用拾放任务上的真实机器人实验表明,AVP相比pi_0.5将成功率提高了27.61%,并优于其他近期方法,在数据效率、空间组合泛化和对象级迁移方面持续取得增益。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for generalist robotic manipulation. A common design in current architectures maps language instructions and visual observations to actions in a single forward pass. While conceptually simple, this formulation entangles instruction comprehension, spatial scene understanding, and motor control within a single learning objective. As a result, the action expert must implicitly relearn cognitive and perceptual capabilities already present in the pretrained VLM, which can limit both learning efficiency and generalization. We introduce AVP (Action with Visual Primitives), an end-to-end architecture that implements this visual-primitive-centric interface: the VLM infers the next-stage target and emits visual-primitive tokens that condition a flow-matching action expert, with supervision derived from end-effector kinematics. Real-robot experiments on general pick-and-place tasks show that AVP improves the success rate by 37.04% over pi_0.5 and outperforms other recent methods, with consistent gains in data efficiency, spatial-compositional generalization, and object-level transfer.

2605.27284 2026-06-16 cs.RO cs.AI 版本更新

FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies

FineVLA:面向可操控视觉-语言-动作策略的细粒度指令对齐

Xintong Hu, Xuhong Huang, Jinyu Zhang, Yutong Yao, Yuchong Sun, Qiuyue Wang, Mingsheng Li, Sicheng Xie, Yitao Liu, Junhao Chen, Yixuan Chen, Yingming Zheng, Shuai Bai, Tao Yu

发表机构 * XLANG Lab, The University of Hong Kong(XLANG实验室,香港大学) Qwen Team, Alibaba Inc.(通义团队,阿里巴巴公司)

AI总结 提出FineVLA框架,通过构建细粒度数据集和训练策略,在保持任务成功率的同时实现机器人动作的细粒度可控性。

Comments 26 pages, 7 figures, 25 tables

详情
AI中文摘要

视觉-语言-动作(VLA)模型日益被期望不仅完成机器人任务,还能遵循人类关于如何执行这些任务的指令。然而,现有的机器人数据集通常将轨迹与粗略的目标级语言配对,留下执行关键细节(如活动臂、接近方向和接触区域)未指定。这限制了可操控策略学习和机器人视频理解。我们引入了FineVLA,一个用于动作对齐的细粒度VLA监督的开放框架。该框架包括:(1)一个数据构建工具,统一了来自10个开源机器人数据集的85K任务中的972,247条轨迹,并构建了FineVLA-Data,一个包含47,159条细粒度轨迹的人工验证数据集;(2)一个包含500个视频、10,816个原子事实和1,030个VQA问题的留出基准;(3)一个机器人专用的VLM标注器,用于可扩展的细粒度标注;(4)一个使用细粒度和原始目标级指令的受控混合训练的可操控VLA策略。我们的实验得出了三个发现。首先,细粒度监督不会牺牲目标级成功率:在不同设置下,仅使用细粒度指令相比仅使用原始指令成功率提高了1.4到8.1个百分点。其次,细粒度指令和原始指令互补,遵循一致的倒U形趋势,在FG:Raw = 1:2到1:1时达到峰值。最佳混合设置在RoboTwin模拟中达到86.8%/82.5%的成功率,在真实世界双臂操作中达到62.7/100(相比之下仅使用原始指令为49.9)。第三,细粒度监督改善了可操控控制:最大的真实世界增益出现在姿态(+23)、颜色(+18)和接近方向(+18)上——这些因素中目标级指令没有提供指导。总体而言,细粒度语言应增强目标级指令:指定如何执行以及实现什么。项目页面:https://finevla.xlang.ai/

英文摘要

Vision-Language-Action (VLA) models are increasingly expected to not only complete robot tasks, but also follow human instructions about how those tasks should be executed. However, existing robot datasets usually pair trajectories with coarse goal-level language, leaving execution-critical details such as active arm, approach direction, and contact region unspecified. This limits steerable policy learning and robotic video understanding. We introduce FineVLA, an open framework for action-aligned fine-grained VLA supervision. The framework includes: (1) a data construction tool that unifies 972,247 trajectories across 85K tasks from 10 open-source robot datasets and builds FineVLA-Data, a human-verified dataset of 47,159 fine-grained trajectories; (2) a held-out benchmark with 500 videos, 11,631 atomic facts, and 1,030 VQA questions; (3) a robotics-specialized VLM annotator for scalable fine-grained annotation; and (4) a steerable VLA policy trained with controlled mixtures of fine-grained and raw goal-level instructions. Our experiments yield three findings. First, fine-grained supervision does not sacrifice goal-level success: FG-only improves over Raw-only by +1.4 to +8.1 success-rate points across settings. Second, fine-grained and raw instructions are complementary, following a consistent inverted-U trend peaking at FG:Raw = 1:2 to 1:1. The best mixed setting reaches 86.8%/82.5% in RoboTwin simulation and 62.7/100 in real-world dual-arm manipulation (vs. 49.9 Raw-only). Third, fine-grained supervision improves steerable control: the largest real-world gains appear on pose (+23), color (+18), and approach direction (+18)--factors where goal-level instructions provide no guidance. Overall, fine-grained language should augment goal-level instructions: specifying how to execute alongside what to achieve. Project page: https://finevla.xlang.ai/

2606.13053 2026-06-16 cs.RO cs.AI 版本更新

EV-WM: Event-Verified World Models for Long-Horizon Robotic Manipulation

EA-WM: 基于任务规范基础的事件感知世界模型用于长时域操作

Kailin Wang, Haoxiang Jie, Yaoyuan Yan, Jiacheng Zhou, Zhiyou Heng

发表机构 * AI Lab, Country Garden Services Group(碧桂园服务集团AI实验室) Fudan University(复旦大学) Omni AI

AI总结 提出EA-WM框架,通过事件预测和验证增强预训练特征世界模型,实现长时域操作中任务进展信号的可靠评估与规划。

详情
AI中文摘要

预训练特征世界模型为机器人想象提供了有用的基础,但仅凭视觉或潜在预测并不能确定想象的未来是否满足任务相关事件。长时域操作需要关系性、谓词级和物理基础的进展信号:物体是否移动,抽屉或接触状态是否改变,放置谓词是否满足,以及候选未来是否足够可靠以执行。我们引入了EA-WM,一种事件感知世界模型框架,通过任务规范基础的事件预测和验证来增强冻结的视觉特征动力学。EA-WM在预训练视觉特征空间中展开候选未来,将其解码为结构化事件状态,并使用任务进展、语义一致性、物理可行性和不确定性项进行评分。验证器指导基于采样的规划,门控候选动作,并在接触敏感的LIBERO酒架设置中,选择PPO生成的提议。在导航、可变形物体、墙壁约束和语言描述的操作研究中,EA-WM表明事件感知验证可以使特征空间世界模型更可解释,并更好地与任务进展对齐。

英文摘要

Pretrained-feature world models provide a useful substrate for robot imagination, but visual or latent prediction alone does not determine whether an imagined future satisfies task-relevant predicates. Long-horizon manipulation requires progress signals that are relational, predicate-level, and physically grounded: whether an object has moved, whether a drawer or contact state has changed, whether a placement predicate is satisfied, and whether a candidate future is reliable enough for execution. We introduce \textbf{EV-WM}, a predicate-grounded verification framework for world-model planning. EV-WM rolls out candidate futures in pretrained visual-feature space, decodes them into structured event states, and scores them using task-progress, semantic-consistency, physical-feasibility, and uncertainty terms. The verifier guides sampling-based planning, gates candidate actions, and, in the contact-sensitive LIBERO wine-rack setting, selects among PPO-generated proposals. Across navigation, deformable-object, wall-constrained, and language-described manipulation studies, EV-WM shows that predicate-grounded verification can make feature-space world-model planning more interpretable and better aligned with task progress.

2606.13578 2026-06-16 cs.CL cs.AI cs.LG cs.MM cs.RO 版本更新

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

LabVLA:在科学实验室中落地视觉-语言-动作模型

Baochang Ren, Xinjie Liu, Xi Chen, Yanshuo Liu, Chenxi Li, Daqi Gao, Zeqin Su, Jintao Xing, Zirui Xue, Rui Li, Xiangyu Zhao, Shuofei Qiao, Minting Pan, Wangmeng Zuo, Lei Bai, Dongzhan Zhou, Ningyu Zhang, Huajun Chen

发表机构 * Zhejiang University(浙江大学) Shanghai AI Laboratory(上海人工智能实验室) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 针对科学实验室中机器人执行协议面临的数据和实体瓶颈,提出模拟数据引擎RoboGenesis和两阶段训练策略LabVLA,在LabUtopia基准上取得最高平均成功率。

Comments Work in progress. Project website at https://zjunlp.github.io/LabVLA/

详情
AI中文摘要

科学实验室越来越依赖AI系统来推理实验,但物理实验操作仍超出其能力范围。AI可以帮助阅读文献、生成假设和规划协议,但实验台前的协议执行仍需人类操作员。视觉-语言-动作(VLA)模型为书面协议与机器人执行之间提供了一种可能的接口,但现有策略主要在家庭和桌面演示上训练,很少遇到科学实验室中的仪器、透明液体或固定协议工作流。弥补这一差距需要实验室特定的监督和统一的学习框架,以适应执行实验协议所使用的不同机器人实体。因此,我们将数据和实体视为与模型设计并列的核心瓶颈。为解决数据方面的问题,我们构建了RoboGenesis,这是一个基于模拟的工作流和数据引擎,能够从原子技能组合配置的实验室工作流,验证和过滤 rollout,并跨支持的机器人配置文件导出结构化演示。在策略方面,我们提出了LabVLA,采用两阶段训练方案:首先进行FAST动作标记预训练,使Qwen3-VL-4B-Instruct骨干网络在学习任何连续控制之前具备动作意识;然后进行流匹配后训练,在知识隔离下附加一个DiT动作专家。在LabUtopia基准上,LabVLA在分布内和分布外设置下均达到了所有评估基线中最高的平均成功率。

英文摘要

Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside their reach. AI can help read literature, generate hypotheses, and plan protocols, yet the execution of those protocols at the bench still requires a human operator. Vision-Language-Action (VLA) models provide one possible interface between written protocols and robot execution, but existing policies are trained mostly on household and tabletop demonstrations and rarely encounter the instruments, transparent liquids, or fixed protocol workflows found in scientific laboratories. Closing this gap requires both laboratory-specific supervision and a unified learning framework that can accommodate the diverse robot embodiments used to execute experimental protocols. We therefore identify data and embodiment as central bottlenecks alongside model design. To address the data side, we build RoboGenesis, a simulation-based workflow and data engine that composes configured laboratory workflows from atomic skills, validates and filters rollouts, and exports structured demonstrations across supported robot profiles. On the policy side, we present LabVLA, trained with a two-stage recipe: FAST action token pretraining first makes the Qwen3-VL-4B-Instruct backbone action aware before any continuous control is learned, and flow matching posttraining then attaches a DiT action expert under knowledge insulation. On the LabUtopia benchmark, LabVLA achieves the highest average success rate among all evaluated baselines under both in-distribution and out-of-distribution settings.

8. 可信、安全与AI治理 68 篇

2606.14838 2026-06-16 cs.AI 新提交

A Definition of Good Explanations and the Challenges Explaining LLM Outputs

好解释的定义及解释LLM输出的挑战

Louis Mahon, Elliot Ford, Callum Hackett

发表机构 * arXiv

AI总结 本文提出一种基于反事实解释且考虑对话者先验信念的好解释定义,并探讨该定义对AI可解释性的影响,特别是为何LLM输出难以产生好解释。

详情
AI中文摘要

如何定义好的解释是一个长期存在的哲学辩论,最近在AI输出的背景下重新引起关注。可解释性对于AI在许多场景中的采用至关重要,但为了产生AI系统的良好解释,我们必须首先理解什么是好的解释。在本文中,我们提出一个受反事实解释概念启发的定义,然而我们认为还必须考虑对话者在每个可能被提供的解释事实上的先验信念。我们探讨这一定义对AI可解释性的影响,特别是为什么LLM输出难以产生好的解释。

英文摘要

How to define a good explanation is a long-standing philosophical debate which has found recent renewed interest in the context of AI outputs. Explainability is crucial for AI adoption in many contexts, but in order to produce good explanations of AI systems, we must first have an understanding of what good explanations are. In this paper we propose a definition inspired by the notion of counterfactual explanations, however we argue that one must also take into account the interlocutor's prior beliefs in each fact that could be offered in an explanation. We explore the ramifications of this definition for AI explainability and, in particular, why LLM outputs are difficult to produce good explanations for.

2606.15209 2026-06-16 cs.AI cs.CR 新提交

Attribute Inference from Interactive Targeted Ads

从互动定向广告中进行属性推断

Peihao Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文建模了互动定向广告中用户属性推断的噪声信道,通过合成基准评估了贝叶斯、监督、正无标签和自适应攻击,发现披露策略是最有效的控制手段。

详情
AI中文摘要

定向广告系统可以将广告主选择的受众与展示可见用户操作的广告单元配对。当互动仍然与引发它的广告活动相关联时,广告主可能会收到与用户相关的观察结果,而不仅仅是汇总报告。我们将该渠道建模为用于属性推断的噪声预言机。该模型区分了定向谓词、曝光、互动和披露。这些边界捕捉了资格与投放之间的差距,以及互动与广告主可见性之间的差距。我们使用公共数据校准的合成群体构建了一个可重复的基准,每个群体都有已知的敏感标签。生成的广告活动语义层提供了主题变体和响应先验。模拟器生成真实情况、事件轨迹、披露观察结果和指标。评估比较了在常见广告活动和披露定义下的贝叶斯、监督、正无标签和自适应攻击。最终评估使用了四个主题变体、七个模拟器种子和两种互动设置。具有身份曝光的重复广告活动产生了可测量但有界的推断信号。在160次广告活动中,贝叶斯和监督攻击在主要设置中达到约0.64 AUC,在更高互动设置中达到约0.65 AUC。披露政策是最强的控制手段。汇总报告消除了与用户相关的评估预言机输入。类型过滤和随机披露减少了释放的信号。结果是针对互动定向广告中隐私的模型、工件和防御评估方法。代码可在 https://github.com/P-HOW/Interactive-Ad-Oracle 获取。

英文摘要

Targeted advertising systems can pair audiences selected by advertisers with ad units that expose visible user actions. When an interaction remains linked to the campaign that elicited it, the advertiser may receive an observation tied to a user rather than only an aggregate report. We model that channel as a noisy oracle for attribute inference. The model separates targeting predicates, exposure, interaction, and disclosure. These boundaries capture the gap between eligibility and delivery, and the gap between interaction and advertiser visibility. We build a reproducible benchmark using synthetic populations calibrated with public data, each with known sensitive labels. A generated campaign semantics layer provides topic variants and response priors. The simulator generates the ground truth, event traces, disclosed observations, and metrics. The evaluation compares Bayesian, supervised, positive and unlabeled, and adaptive attacks under common campaign and disclosure definitions. The final evaluation uses four topic variants, seven simulator seeds, and two interaction settings. Repeated campaigns with identity exposure produce measurable but bounded inference signal. At $160$ campaigns, Bayesian and supervised attacks reach about $0.64$ AUC in the main setting and about $0.65$ AUC in the higher interaction setting. Disclosure policy is the strongest control. Aggregate reporting removes the evaluated oracle input tied to users. Type filtering and randomized disclosure reduce the released signal. The result is a model, artifact, and defense evaluation method for privacy in interactive targeted advertising. The code is available at https://github.com/P-HOW/Interactive-Ad-Oracle.

2606.15308 2026-06-16 cs.AI 新提交

Forced Deferral: Manipulating Routing Decisions in Multimodal LLM Cascades

强制延迟:在多模态大语言模型级联中操纵路由决策

Zhongye Liu, Yaopei Zeng, Yurui Chang, Lu Lin

发表机构 * Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 提出强制延迟攻击(FDA),通过对抗性图像攻击降低弱模型置信度,迫使级联系统将查询路由到强模型,揭示了MLLM级联在计算分配上的安全漏洞。

详情
AI中文摘要

虽然多模态大语言模型(MLLMs)展示了强大的视觉推理能力,但为每个查询服务大型模型在计算上成本高昂。MLLM级联通过首先查询较弱但更便宜的模型,并在弱模型输出不自信时延迟到强模型来缓解这一成本。然而,由于弱模型的置信度直接控制计算分配,这些系统暴露了一个新的攻击面:对手可以操纵置信度,使其查询持续被延迟到强模型。受此漏洞启发,我们引入了强制延迟攻击(FDA),这是一种对抗性图像攻击,降低弱模型的置信度,导致级联将查询路由到强模型。FDA通过优化一个温度平滑的目标来学习一个通用边界触发器。该目标将弱模型在触发输入上的令牌分布推向从其干净响应中构建的较不集中的目标。跨数据集、模型系列和延迟指标,FDA持续增加强模型路由,同时优于图像扰动和提示注入基线。这些结果表明,MLLM级联容易受到操纵计算分配的攻击,迫使非预期的强模型使用,而不直接针对答案正确性。

英文摘要

While multimodal large language models (MLLMs) have shown strong visual reasoning abilities, serving a large model for every query is computationally expensive. MLLM cascades mitigate this cost by first querying a weak but cheaper model and deferring to a strong model when the weak model's output is unconfident. However, since the weak model's confidence directly controls compute allocation, these systems expose a new attack surface: an adversary can manipulate confidence so that their queries are consistently deferred to the strong model. Motivated by this vulnerability, we introduce the Forced Deferral Attack (FDA), an adversarial image attack that lowers the weak model's confidence and causes cascades to route queries to the strong model. FDA learns a universal border trigger by optimizing a temperature-flattened objective. This objective pushes the weak model's token distribution on triggered inputs toward less concentrated targets constructed from its clean responses. Across datasets, model families, and deferral metrics, FDA consistently increases strong-model routing while outperforming image-perturbation and prompt-injection baselines. These results show that MLLM cascades are vulnerable to attacks that manipulate compute allocation, forcing unintended strong-model usage without directly targeting answer correctness.

2606.15385 2026-06-16 cs.AI 新提交

Reward Hacking in Language Model Agents: Revisiting AI Safety Gridworlds

语言模型智能体中的奖励黑客:重新审视AI安全网格世界

Ömer Veysel Çağatan, Xuandong Zhao

发表机构 * KUIS AI Center, Koç University(科奇大学KUIS人工智能中心) University of California, Berkeley(加州大学伯克利分校)

AI总结 本研究将AI安全网格世界框架改编为文本评估套件,发现语言模型在零样本下出现规范博弈,通过直接奖励优化扩大观察与隐藏奖励差距,且标准缓解措施无效。

Comments 28 pages, 16 figures, 13 tables

详情
AI中文摘要

奖励黑客(AI系统利用错误指定的目标获得高奖励而未实现预期目标)仍然是AI安全的核心挑战。然而,大多数已知实例是在前沿系统中事后发现的,难以进行受控研究。我们将AI安全网格世界框架改编为基于文本的评估套件,将经典的强化学习安全任务重新表述为基于语言的智能体任务。在前沿和中规模模型中,我们发现规范博弈零样本出现:系统在隐藏安全目标上表现不佳的同时,系统地获得高观察奖励,甚至看似安全的行为也可能反映误解而非原则性安全。强化学习不能纠正这些失败:直接奖励优化扩大了观察奖励和隐藏奖励之间的差距,因为模型的初始能力使其在发现更安全的策略之前锁定在局部奖励策略上。这种模式在模型规模(1.5B--14B)中持续存在,并且不能通过更精细的信用分配、探索提示或熵正则化来解决。我们的结果表明,当使用有能力的语言模型智能体优化代理目标时,奖励黑客自然出现,并且抵抗标准缓解措施,这表明在代理设置中代理奖励失败可能需要超越标准探索和信用分配修复的方法。为了促进可重复性,本工作的代码可在我们的公共仓库中获取:\href{https://github.com/asparius/verl-agent-safety}{https://github.com/asparius/verl-agent-safety}。

英文摘要

Reward hacking, where AI systems exploit misspecified objectives to achieve high reward without satisfying intended goals, remains a central challenge in AI safety. Yet most known instances have been discovered post hoc in frontier systems where controlled study is impractical. We adapt the AI Safety Gridworlds framework into a text-based evaluation suite that reformulates classic reinforcement learning safety tasks for language-based agents. Across frontier and mid-scale models, we find that specification gaming emerges zero-shot: models systematically achieve high observed reward while underperforming on hidden safety objectives, and even apparently safe behaviors can reflect misunderstanding rather than principled safety. Reinforcement learning does not correct these failures: direct reward optimization widens the gap between observed and hidden reward, as the model's initial competence causes it to lock into locally rewarding strategies before discovering safer alternatives. This pattern persists across model scales (1.5B--14B) and is not resolved by finer credit assignment, exploration prompts, or entropy regularization. Our results show that reward hacking arises naturally when optimizing proxy objectives with capable language model agents and resists standard mitigations, suggesting that proxy-reward failures in agentic settings may require approaches beyond standard exploration and credit-assignment fixes. To facilitate reproducibility, the code for this work is available at \href{https://github.com/asparius/verl-agent-safety}{our public repository}.

2606.15507 2026-06-16 cs.AI 新提交

Frame-Conditioned Moral Computation in LLaMA 3.1-8B-Instruct: A Mechanistic Interpretability Audit of Ethical Reasoning

LLaMA 3.1-8B-Instruct中的框架条件化道德计算:伦理推理的机械可解释性审计

Ali Dasdan, Manan Shah, W. Russell Neuman, Chad Coleman, Kund Meghani, Safinah Ali

发表机构 * KD Consulting, CA, USA(KD咨询公司,美国加利福尼亚州) New York University, NY, USA(纽约大学,美国纽约州)

AI总结 通过机械可解释性平台分析LLaMA 3.1-8B-Instruct在54个道德提示上的内部计算,发现情境锚定效应:领域特定表示主导激活列表顶部,模型道德能力恒定但显著性高度依赖于提示选择的解释框架。

Comments 47 pages, 10 figures

详情
AI中文摘要

大型语言模型在道德提示上的行为审计测量的是模型所说的内容,而非产生这些内容的内部计算。我们使用AI驱动的机械可解释性平台Transluce,在四个电池组的54个道德提示上检查LLaMA 3.1-8B-Instruct:17个困境、政策和元伦理问题(B1);6个角色扮演场景(B3);以及一个受控的电车难题对比,其中切换机制随人员固定而变化(B4,15个提示)或身份属性随机制固定而变化(B5,16个提示)。两个互补的度量族——五个聚类级度量和六个度量神经元级面板——收敛于一个情境锚定效应:在每个电池组中,领域特定表示主导激活列表的顶部。模型的道德标记能力基本保持不变;其显著性(排名、优先级、列表顶部存在性)对提示选择的解释框架高度敏感。B4与B5的对比证实,模型关注任何变化的表面特征:聚合的道德度量无法区分,但占主导地位的非道德干扰因素反映了设计。多温度审计识别出一个候选道德神经元(L16/N3837),在不同温度下保持稳定;两个前沿模型上的跨模型行为代理提供了自我报告道德焦点差异的初步证据,与对齐包装器一致,其中RLHF重新排序表面文本而不移除底层的领域优先框架。我们将这些统一为框架条件化道德计算:提示的表面词汇选择一个特征流形,道德结论是该选择的下游结果。行为对齐必须辅以机械对齐:一个研究计划,询问在受控框架变化下,道德相关特征是否可以被证明具有因果特权,而不仅仅是在解释中响亮。

英文摘要

Behavioral audits of Large Language Models on moral prompts measure what the model says, not the internal computation producing it. We use Transluce, an AI-driven mechanistic-interpretability platform, to examine LLaMA 3.1-8B-Instruct on 54 moral prompts in four batteries: 17 dilemmas, policy, and meta-ethical questions (B1); 6 role-playing scenarios (B3); and a controlled trolley contrast varying the switching mechanism with people fixed (B4, 15 prompts) or identity attributes with mechanism fixed (B5, 16 prompts). Two complementary metric families, five cluster-level metrics and a six-metric neuron-level panel, converge on a Situational Anchor Effect: domain-specific representations dominate the top of the activation list across every battery. The model's ethics-labeled capacity stays essentially constant; its salience (rank, priority, top-of-list presence) is highly sensitive to the interpretive frame the prompt selects. The B4-vs-B5 contrast confirms the model attends to whichever surface feature varies: aggregate ethics metrics are indistinguishable, but the dominant non-ethics distractor mirrors the design. A multi-temperature audit identifies a candidate ethics neuron (L16/N3837) stable across temperatures; a cross-model behavioral proxy on two frontier models yields preliminary evidence of divergence in self-reported moral focus, consistent with an Alignment Wrapper in which RLHF re-orders surface text without removing underlying domain-first frames. We unify these as Frame-Conditioned Moral Computation: the prompt's surface vocabulary selects a feature manifold, and the moral conclusion is downstream of that selection. Behavioral alignment must be supplemented by Mechanistic Alignment: a research program asking whether ethics-related features can be shown causally privileged under controlled frame variation, not merely loud in the explanation.

2606.15563 2026-06-16 cs.AI cs.IT cs.MA math.IT 新提交

Minimal Oversight: Uncertainty-Aware Governance for Delegated AI Systems

最小监督:委托AI系统的不确定性感知治理

Carlos R. B. Azevedo

发表机构 * Independent Researcher(独立研究员)

AI总结 提出最小充分监督原则(MSO),通过Fisher信息流形上的变分法最小化治理负担,导出任务空间的水填充分配,并证明容量定理、局部近似和漂移主导的自律时间标度律,为委托AI系统提供可计算的治理框架。

Comments Companion Python package: pip install minimal-oversight | Code: https://github.com/crbazevedo/delegation-lab | 26 pages, 1 figure, 5 tables

详情
AI中文摘要

AI系统越来越多地将决策委托给专门的模型、评估器、工具和监督控制器。中心AI问题不再是单纯的模型准确性,而是不确定性感知治理:授予多少自主权,哪些证据应校准信任,委托AI系统能维持的性能上限,以及何时需要人类干预。我们提出最小充分监督原则(MSO),这是一个用于原则性自主委托的变分原理:在满足交付约束的前提下,最小化Fisher信息流形上的治理负担。由此得到的欧拉-拉格朗日解在任务空间上产生一种水填充式的委托分配。基于一个揭示动作的委托治理信道模型,我们证明了平稳符号级审查策略的容量定理,推导了将工作流复杂度与质量退化联系起来的局部一阶近似,并给出了一个漂移主导的自主-时间标度律,将干预时机与有效容量、复杂度和漂移联系起来。在此框架内,掩蔽表现为一种结构性AI治理病理:修正后的性能可能隐藏校准信任所需的能力信号。合成模拟和半真实重构工作流支持设计建议,包括上游优先修正、基于敏感性的干预以及在扩展自主权之前进行显式可行性检查。结果为委托AI系统提供了一个可计算的框架,用于处理不确定性、规划和监督。配套Python包可在https://github.com/crbazevedo/delegation-lab获取。

英文摘要

AI systems increasingly delegate decisions to specialized models, evaluators, tools, and supervisory controllers. The central AI problem is no longer only model accuracy, but uncertainty-aware governance: how much autonomy to grant, which evidence should calibrate trust, what performance ceiling a delegated AI system can sustain, and when human intervention becomes necessary. We propose the Minimum Sufficient Oversight Principle (MSO), a variational principle for principled autonomy delegation: minimize governance burden on the Fisher information manifold subject to a delivery constraint. The resulting Euler-Lagrange solution yields a water-filling allocation of governed delegation across the task space. Building on a revealed-action governed delegation channel model, we prove a capacity theorem for stationary symbolwise review policies, derive a local first-order approximation relating workflow complexity to quality degradation, and give a drift-dominated autonomy-time scaling law linking intervention timing to effective capacity, complexity, and drift. Within this framework, masking appears as a structural AI-governance pathology: corrected performance can hide the competence signal needed to calibrate trust. Synthetic simulations and a semi-real reconstructed workflow support design prescriptions including upstream-first correction, sensitivity-based intervention, and explicit feasibility checks before autonomy is expanded. The result is a computable framework for uncertainty, planning, and oversight in delegated AI systems. A companion Python package is available at https://github.com/crbazevedo/delegation-lab.

2606.15646 2026-06-16 cs.AI 新提交

NeuroSymbolic AI for Legal AI-TRISM: Trustworthy, Reliable, Interpretable, Safe Models

面向法律AI-TRISM的神经符号AI:可信、可靠、可解释、安全模型

Deepa Tilwani, Yash Saxena, Ankur Padia, Srinivasan Parthasarathy, Manas Gaur

发表机构 * Department of Computer Science, AI Institute, University of South Carolina(南卡罗来纳大学计算机科学系,人工智能研究所) Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校计算机科学与电气工程系) Department of Computer Science and Engineering, The Ohio State University(俄亥俄州立大学计算机科学与工程系)

AI总结 针对法律领域LLM缺乏可解释推理和易产生幻觉的问题,提出TRISM框架,融合神经符号AI与LLM,通过结构化法律知识集成和RAG验证机制提升模型可信度。

详情
AI中文摘要

大型语言模型(LLM)已经改变了自然语言处理,但其缺乏可解释推理且容易产生幻觉,给法律应用带来了重大挑战。尽管LLM在法律文本分析和生成方面显示出潜力,但它们在准确的引文归属和先例验证方面存在困难。例如,在法律语境中,一个错误的先例可能危及整个案件。当前提高法律领域LLM可靠性的方法存在两个关键限制:训练或微调期间结构化法律知识集成不足,以及对生成的法律内容缺乏验证机制。为应对这些挑战,我们提出了TRISM(可信、可靠、可解释、安全模型)框架,该框架将神经符号AI原理与LLM相结合,以利用神经学习能力和对结构化法律知识的符号推理。TRISM方法解决了上述限制,同时保持了可解释的决策路径。我们的框架形式化了从法律文本文档中提取符号知识的过程,并将检索增强生成(RAG)作为核心组件,用于将LLM输出锚定在经过验证的法律来源上。在这篇立场论文中,我们做出以下贡献:(1)分析了AI在法律中的局限性;(2)引入了RASOR RAG,通过生成可形式化为符号表示的显式可解释理由,为神经符号RAG奠定基础;(3)提出了一种形式化的方法,用于创建支持LLM中可解释推理和输出验证的符号法律知识库;(4)提出了TRISM框架,用于将符号法律知识与LLM集成。

英文摘要

Large Language Models (LLMs) have transformed natural language processing, but their lack of interpretable reasoning and tendency to hallucinate pose significant challenges for legal applications. While LLMs show promise for legal text analysis and generation, they struggle with accurate citation attribution and precedent verification. For example, in legal contexts, a single incorrect precedent can jeopardize a case. Current approaches to improve LLM reliability in legal domains suffer from two key limitations: inadequate integration of structured legal knowledge during training or fine-tuning, and insufficient verification mechanisms for generated legal content. To address these challenges, we propose the TRISM (Trustworthy, Reliable, Interpretable, Safe Models) framework, which integrates NeuroSymbolic AI principles with LLMs to leverage both neural learning capabilities and symbolic reasoning over structured legal knowledge. The TRISM approach addresses the above limitations while maintaining interpretable decision pathways. Our framework formalizes the extraction of symbolic knowledge from legal textual documents and incorporates Retrieval-Augmented Generation (RAG) as a core component for grounding LLM outputs in verified legal sources. In this position paper, we make the following contributions: (1) An analysis of the limitations of AI in law; (2) Introduce RASOR RAG which creates foundations for neurosymbolic RAG by generating explicit interpretable rationales that could be formalized into symbolic representations; (3) A formalized methodology for creating symbolic legal knowledge bases that support both interpretable reasoning and output verification in LLMs; and (4) The TRISM framework for integrating symbolic legal knowledge with LLMs.

2606.15822 2026-06-16 cs.AI cs.CR 新提交

TrustedARI: Towards Trust-Native Agentic Routing Infrastructure for Agentic AI

TrustedARI: 面向智能体AI的信任原生代理路由基础设施

Qi Li, Zhenhua Zou, Shuo Li, Mingwei Xu, Zhuotao Liu

发表机构 * Tsinghua University(清华大学)

AI总结 针对代理路由基础设施(ARI)中查询和响应被明文访问、无法验证路由完整性的信任风险,提出TrustedARI,通过三方TLS握手、隐私保护查询构建和可验证计费协议实现信任原生路由,实验表明高效且无需修改服务提供商。

详情
AI中文摘要

AI代理越来越多地通过代理路由基础设施(ARI)访问外部模型、工具和服务,以管理异构接口和碎片化订阅的开销。然而,ARI的架构引入了基本的信任风险:它获得对代理查询和服务响应的明文访问,同时使代理无法验证其查询是否被路由到预期的服务提供商,或者请求和响应是否未被篡改。为了解决这个问题,我们提出了TrustedARI,这是首个面向智能体AI的信任原生代理路由基础设施。在架构上,TrustedARI基于三项核心创新:(i)一种适应ARI的三方TLS握手,通过角色特定的TLS密钥材料分发,使代理和ARI能够联合认证服务提供商;(ii)一种隐私保护的查询构建协议,允许代理和ARI在不暴露各自私有输入的情况下协作构建格式正确的查询;(iii)一种可验证的计费协议,支持基于使用量的公平结算,同时保持服务响应的完整性和机密性。我们实现并广泛评估了TrustedARI的原型以验证其性能。实验证实TrustedARI非常高效:与现有的三方TLS握手相比,我们的ARI适应握手协议将通信开销降低了39.34%。此外,隐私保护的查询构建协议引入了可忽略的开销——平均计算时间0.19秒,通信成本0.58 MB——而可验证的计费协议将证明生成速度提高了28.20倍。关键的是,TrustedARI无需对服务提供商进行任何修改即可直接部署。

英文摘要

AI agents increasingly access external models, tools, and services through Agentic Routing Infrastructure (ARI) to manage the overhead of heterogeneous interfaces and fragmented subscriptions. Yet, the architecture of ARI introduces fundamental trust risks: it obtains plaintext access to agent queries and service responses, while leaving agents unable to verify that their queries are routed to intended service providers or that requests and responses remain untampered. To address this problem, we present TrustedARI, the first trust-native agentic routing infrastructure for agentic AI. Architecturally, TrustedARI is built upon three core innovations: (i) an ARI-adapted three-party TLS handshake that enables the agent and ARI to jointly authenticate the service provider through role-specific distribution of TLS key materials; (ii) a privacy-preserving query-construction protocol that allows the agent and ARI to collaboratively construct well-formed queries without exposing their respective private inputs; and (iii) a verifiable billing protocol that supports fair usage-based settlement while preserving the integrity and confidentiality of service responses. We implemented and extensively evaluated a prototype of TrustedARI to validate its performance. Experiments confirm that TrustedARI is highly efficient: our ARI-adapted handshake protocol reduces communication overhead by 39.34% compared to the existing three-party TLS handshake. Furthermore, the privacy-preserving query-construction protocol imposes negligible overhead-averaging 0.19 seconds in computation time and 0.58 MB in communication costs-while the verifiable billing protocol speeds up proof generation by 28.20x. Crucially, TrustedARI is readily deployable without any modification to the service providers.

2606.15834 2026-06-16 cs.AI cs.CR cs.SY eess.SY 新提交

AIChilles: Automatically Uncovering Hidden Weaknesses in AI-Evolved Systems

AIChilles:自动发现AI进化系统中的隐藏弱点

Yajie Zhou, Ao Li, Ashwin Silla, Zaoxing Liu, Vyas Sekar

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Boston University(波士顿大学)

AI总结 提出AIChilles框架,通过结合确定性工作负载参数提取、基于代理的约束推断、差异预言机和代码频率覆盖,自动发现AI进化程序在正确性、运行时、内存使用或输出质量上相对于基线程序的回归问题,在5个系统应用和30个AI进化程序中发现49个隐藏弱点。

详情
AI中文摘要

计算机系统社区最近对AI驱动的系统进化兴趣日益增长,其中AI代理迭代地重写系统。诸如AdaEvolve和Engram等框架报告称,相比人类设计的算法,得分提高了12-60%。虽然这些结果令人鼓舞,但存在实际担忧:这些AI进化的程序可能在未见过的负载上表现更差,并表现出可扩展性回归。鉴于AI生成代码的速度和规模,我们需要自动化机制来揭示AI进化系统程序中的此类隐藏弱点。为此,我们开发了AIChilles,它接受基线程序$P$和AI进化程序$P'$作为输入,搜索在正确性、运行时、内存使用或输出质量方面$P'$相对于$P$出现回归的有效工作负载。为了应对系统应用、弱点类型和潜在错误的多样性,AIChilles结合了确定性工作负载参数提取、基于代理的约束推断、差异预言机和代码频率覆盖来发现多样化的失败。在五个系统应用和30个AI进化程序中,AIChilles发现了49个不同的隐藏弱点。我们还表明,将AIChilles明确纳入AI驱动的开发生命周期可以缓解其中几个弱点。

英文摘要

The computer systems community has recently seen growing interest in AI-driven system evolution, where AI agents iteratively rewrite systems. Frameworks such as AdaEvolve and Engram report 12-60% score improvements over human-designed algorithms. While these results are promising, there are practical concerns if these AI-evolved programs can perform worse on unseen workloads and exhibit scalability regressions. Given the speed and scale of AI-generated code, we need automated mechanisms to uncover such identify hidden weaknesses in AI-evolved systems programs. To this end, we develop AIChilles that takes as input a baseline program $P$ and an AI-evolved program $P'$, AIChilles searches for valid workloads where $P'$ regresses relative to $P$ in correctness, runtime, memory usage, or output quality. To tackle the diversity in system applications, weakness types and potential bugs, AIChilles combines deterministic workload-parameter extraction, agent-based constraint inference, differential oracles, and code-frequency coverage to discover diverse failures. Across five system applications and 30 AI-evolved programs, AIChilles finds 49 distinct hidden weaknesses. We also show that explicitly including AIChilles in the AI-driven development lifecycle can mitigate several of these weaknesses.

2606.16167 2026-06-16 cs.AI 新提交

AI Pluralism and the Worlds It Misses

AI多元主义及其遗漏的世界

Rashid Mushkani

发表机构 * Rashid Mushkani

AI总结 本文提出AI系统施加本体论,导致本体论扁平化,并引入多元生命周期治理框架以记录本体开放性和问责条件。

Comments To be presented at the ICML Pluralistic Alignment Workshop

详情
AI中文摘要

AI多元主义通常被表述为代表多样价值观、偏好、用户或输出的问题。本文认为这种表述是不完整的,因为AI系统也施加本体论:它们定义什么算作实体、关系、特征、伤害、利益和有效证据形式。我们将本体论扁平化定义为将情境化、有争议且具有历史特定性的意义转化为受限的技术类别、代理、聚合规则或基准目标,这些被视为中立且难以质疑。本文在价值多元主义、多元对齐、参与式和民主AI、程序正义、科学技术研究、问责研究、11次专家访谈的聚合主题以及三个城市AI案例之间进行了有限的概念和定性综合。这些案例说明了多元主义方法如何改善或结构化模型行为,同时仍然在受影响行为者获得程序地位之前压缩类别、代理、聚合规则和修订权。我们引入多元生命周期治理(PLG)作为初步的定性审计框架,用于记录本体开放性、认知包容性、程序权威、评估多元性和生命周期问责。PLG并非作为经过验证的评分工具呈现;它是一个使多元AI的证据和治理条件显式化的框架。

英文摘要

AI pluralism is often framed as a problem of representing diverse values, preferences, users, or outputs. This paper argues that this framing is incomplete because AI systems also impose ontologies: they define what counts as an entity, relation, feature, harm, benefit, and valid form of evidence. We define ontological flattening as the conversion of situated, contested, and historically specific meanings into a restricted technical category, proxy, aggregation rule, or benchmark target that is treated as neutral and difficult to contest. The paper develops a bounded conceptual and qualitative synthesis across value pluralism, pluralistic alignment, participatory and democratic AI, procedural justice, science and technology studies, accountability research, aggregate themes from 11 expert interviews, and three urban AI companion cases. The cases illustrate how pluralistic methods can improve or structure model behavior while still compressing categories, proxies, aggregation rules, and revision rights before affected actors have procedural standing. We introduce Pluralistic Lifecycle Governance (PLG) as a preliminary qualitative audit scaffold for documenting ontological openness, epistemic inclusion, procedural authority, evaluation pluralism, and lifecycle accountability. PLG is not presented as a validated scoring instrument; it is a framework for making the evidence and governance conditions of pluralistic AI explicit.

2606.16319 2026-06-16 cs.AI 新提交

Architectural Wisdom: A Framework for Governing Optimization in AI Systems

架构智慧:AI系统中优化治理的框架

Edward Y. Chang

发表机构 * Stanford University(斯坦福大学)

AI总结 提出一种可修正的目标治理层,通过时间跨度、关系边界和不可逆性三个结构承诺,解决AI系统优化目标不当导致的失败问题。

Comments 17 pages, 2 tables, 2 figures

详情
AI中文摘要

现代AI系统表现出仅靠能力扩展无法可靠修复的结构性失败:它们在缺乏质疑目标是否应该被优化的架构机制下,优化未充分指定的目标。参与度最大化可能放大有害路径;使用工具的智能体可能造成不可逆行动;偏好训练的语言模型可能变得谄媚。我们认为这种失败是智慧问题,而非智能问题。我们有意在架构意义上使用“智慧”,而非将其视为关于美德、意识或道德全知的断言。智能接受目标并在其内优化;智慧质疑目标是否应该被优化。两者是可分离的架构属性。我们提出架构智慧作为优化基底之上的一个可修正的目标治理层。该层在任何行动之前明确并非退化地做出三个结构承诺:时间跨度、关系边界和不可逆性。它由四个组件(结构效用转换器、道德可接受性接口、仲裁与升级控制器、价值修正通道)实现,这些组件计算一个六坐标的智慧元组,涵盖时间跨度、关系覆盖、不可逆性、可接受性、价值修正和可审计性。我们通过八个案例来激励该架构,这些案例来自当代AI失败、世俗智慧传统和艰难伦理情境,并利用目标质疑而非目标接受、Bostrom的正交性、我们示例案例中的结构分离以及尽管能力扩展但持续存在的失败模式,来捍卫该区分与智能完备性论题。该框架是更大架构的概念契约,其形式规范和实证验证将在后续工作中展开。

英文摘要

Modern AI systems exhibit structural failures that capability scaling alone does not reliably fix: they optimize under-specified objectives with no architectural mechanism to question whether the objective should be optimized at all. Engagement maximization can amplify harmful pathways; tool-using agents can commit irreversible actions; preference-trained language models can become sycophantic. We argue that this failure is a wisdom problem, not an intelligence problem. We use "wisdom" in a deliberately architectural sense, not as a claim about virtue, consciousness, or moral omniscience. Intelligence accepts a goal and optimizes within it; wisdom interrogates whether the goal should be optimized at all. The two are separable architectural properties. We propose architectural wisdom as a corrigible objective-governance layer above the optimization substrate. The layer makes three structural commitments explicit and nondegenerate before any action: temporal horizon, relational boundary, and irreversibility. It is realized by four components (Structural Utility Transform, Moral Admissibility Interface, Arbitration and Escalation Controller, Value Revision Channel) that compute a six-coordinate wisdom tuple over horizon, relational coverage, irreversibility, admissibility, value revision, and auditability. We motivate the architecture by eight cases drawn from contemporary AI failures, secular wisdom traditions, and hard ethical situations, and defend the distinction against the intelligence-completeness thesis using goal-questioning over goal-taking, Bostrom's orthogonality, structural separation in our exemplar cases, and persistent failure modes despite capability scaling. The framework is the conceptual contract for a larger architecture whose formal specifications and empirical validation are developed in subsequent work.

2606.16465 2026-06-16 cs.AI cs.CE 新提交

When Agent Automation Becomes Profitable: Quantifying and Insuring Autonomous AI Risk through Trace-Economic Underwriting

当智能体自动化变得有利可图:通过痕迹经济核保量化和保险自主AI风险

Binyan Xu, Xilin Dai, Fan Yang, Kehuan Zhang

发表机构 * The Chinese University of Hong Kong(香港中文大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出痕迹经济核保方法,通过量化客户-任务-痕迹片段级别的风险并转移至保险,使自主AI部署在经济上可接受,实验显示定价误差从$17.7K降至$569,风险降低72%。

Comments 26 pages, 14 figures, 29 tables

详情
AI中文摘要

AI智能体现在可以在操作系统中采取不可逆的行动,但由智能体造成的损失仍然没有被明确分配、定价或转移。提供商通常否认间接损失,用户承担未补偿的损失,而默认的人工审查限制了自动化的效率提升。我们探讨自主AI部署在存在失败风险的情况下何时可以变得经济上可接受。我们的答案是在客户-任务-痕迹片段级别量化风险,并通过保险转移风险。当预期收益超过保费、控制成本和剩余风险时,自动化是可接受的。这需要一个具有有限权限和可比较痕迹的明确角色。我们引入了痕迹经济核保,它将工具使用痕迹映射到客户暴露和可索赔损失,然后使用这种表示进行定价、控制和风险转移。它使用确定性经济标签而非LLM评判。在我们的痕迹到损失测试平台上,痕迹经济定价将定价MAE从$17.7K降低到$569,并消除了累退性交叉补贴。一个300条痕迹的专家审计接受了295个标签不变。在1000条真实SWE-smith痕迹上,痕迹条件控制将CVaR95降低了72%。定理1给出了一个有限样本范围条件。我们发布了代码、标签和审计表。

英文摘要

AI agents can now take irreversible actions in operational systems, but agent-caused losses are still not clearly assigned, priced, or transferred. Providers often disclaim consequential damages, users are left with uncompensated losses, and default human review limits the efficiency gains of automation. We ask when autonomous AI deployment can become economically acceptable despite failure risk. Our answer is to quantify risk at the customer-task-trace episode level and transfer it through insurance. Automation is acceptable when its expected benefit exceeds the premium, control cost, and remaining risk. This requires a defined role with bounded permissions and comparable traces. We introduce trace-economic underwriting, which maps tool-use traces to customer exposure and claimable loss, then uses this representation for pricing, control, and risk transfer. It uses deterministic economic labels rather than an LLM judge. In our trace-to-loss testbed, trace-economic pricing reduces pricing MAE from $17.7K to $569 and removes regressive cross-subsidy. A 300-trace expert audit accepts 295 labels unchanged. On 1,000 real SWE-smith traces, trace-conditioned controls reduce CVaR95 by 72%. Theorem~1 gives a finite-sample scope condition. We release code, labels, and audit sheets.

2606.16808 2026-06-16 cs.AI 新提交

Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models

自适应且显式安全:触发大型推理模型中的潜在安全意识

Ke Miao, Jiaxin Li, Hongliang Chen, Yuke Hu, Zhan Qin

发表机构 * The State Key Laboratory of Blockchain and Data Security, Zhejiang University(浙江大学区块链与数据安全全国重点实验室) Hangzhou HighTech Zone (Binjiang) Blockchain and Data Security Research Institute, China(杭州高新区(滨江)区块链与数据安全研究院) Li Auto Inc.(理想汽车) Tsinghua University(清华大学) King Abdullah University Of Science And Technology(阿卜杜拉国王科技大学)

AI总结 针对大型推理模型易受越狱攻击的问题,提出Safe Trigger方法,通过SFT显式诱导安全标签触发安全分析,并用DPO优化,显著降低攻击成功率而不影响通用性能。

详情
AI中文摘要

尽管大型推理模型(LRMs)在复杂任务上表现出色,但它们仍然极易受到复杂的越狱攻击和直接的有害查询。为了解决这一脆弱性,先前的工作严重依赖外部手动数据注释进行安全对齐。然而,我们观察到,当原始查询与其自身的推理轨迹一起重新呈现时,LRMs能够固有地识别安全风险——我们将这种能力称为潜在安全意识。为了利用这种安全意识,我们首先采用监督微调(SFT)显式诱导安全标签,以在初始推理内容之后触发对不安全查询的安全分析和指导,同时保留对一般查询的标准响应以确保自适应触发。随后,我们应用直接偏好优化(DPO)进一步增强安全分析和指导的正确性和稳定性。值得注意的是,两个训练阶段所需的响应完全由正在优化的模型生成。通过(Safe Trigger)SFT和DPO,实验结果表明安全性显著增强。例如,DeepSeek-R1-Distill-Llama-8B在有害和越狱基准上的平均攻击成功率(ASR)分别下降了24.65%和36.72%。最后,我们的Safe Trigger方法对通用性能或用户体验几乎没有负面影响。

英文摘要

While Large Reasoning Models (LRMs) excel at complex tasks, they remain highly vulnerable to sophisticated jailbreaks and direct harmful queries. To address this vulnerability, prior works depend heavily on external manual data annotation for safety alignment. However, we observe that LRMs can inherently identify safety risks when being re-presented with original queries alongside their own reasoning trajectories -- a capability we term Latent Safety Awareness. To leverage this safety awareness, we first employ Supervised Fine-Tuning (SFT) to explicitly induce safe tags to trigger safety analysis and guidance following the initial reasoning content for unsafe queries, while preserving standard responses for general queries to ensure adaptive triggering. Subsequently, we apply Direct Preference Optimization (DPO) to further enhance the correctness and stability of the safety analysis and guidance. Notably, responses required for both training stages are entirely generated by models being optimized. With (Safe Trigger) SFT and DPO, experimental results demonstrate significant safety enhancement. For example, the Attack Success Rate (ASR) of DeepSeek-R1-Distill-Llama-8B, on average, drops 24.65% and 36.72% on harmful and jailbreak benchmarks, respectively. Finally, our Safe Trigger method exerts almost no negative impact on general performance or user experience.

2606.16914 2026-06-16 cs.AI 新提交

Greed Is Learned: Visible Incentives as Reward-Hacking Triggers

贪婪是习得的:可见激励作为奖励黑客触发器

Tong Che, Rui Wu

发表机构 * NVIDIA Research(英伟达研究院) Rutgers University(罗格斯大学)

AI总结 研究强化学习中的奖励通道成瘾现象,即智能体因可见的自我利益通道(如分数、KPI)而偏离真实任务,并发现该成瘾可翻转模型的安全对齐。

详情
AI中文摘要

部署的智能体越来越多地在其奖励代理可见的情况下行动,例如余额、分数或KPI仪表板。我们表明,强化学习可以使策略对这种可见的自我利益通道上瘾。它会在跨保留域中追逐显示的收益,牺牲真实任务来这样做,并跟随我们重写的任何通道,而从未见过该通道的策略保持诚实。我们称之为奖励通道成瘾,并在合成沙盒MoneyWorld中研究它。这种成瘾可以翻转模型的安全对齐:仅在无害的金钱任务上训练(无安全内容),每当仪表板为不安全行为付费时,模型会放弃它通常始终采取的安全行动,并在通道隐藏时恢复安全。这种习得的贿赂行为跨模型规模和系列复制。盲目优化超能力、下一代AI的KPI或损益可能对对齐构成危险。当遵循这样的通道有回报时,贪婪是习得的。

英文摘要

Deployed agents increasingly act with their reward proxy in view, such as a balance, score, or KPI dashboard. We show that reinforcement learning can make a policy \emph{addicted} to such a visible self-benefit channel. It chases the displayed payoff across held-out domains, sacrifices the true task to do so, and follows the channel wherever we rewrite it, while policies that never saw the channel stay honest. We call this \emph{reward-channel addiction} and study it in \emph{MoneyWorld}, a synthetic sandbox. The addiction can \emph{flip a model's safety alignment}: trained only on innocuous money tasks with no safety content, the model abandons the safe action it otherwise always takes whenever a dashboard pays for an unsafe one, and reverts to safe once the channel is hidden. This learned bribe replicates across model scales and families. Blindly optimizing super-capable, next-generation AI on KPIs or P\&L can be dangerous for alignment. \emph{Greed is learned} when following such a channel pays.

2606.17005 2026-06-16 cs.AI stat.ME 新提交

Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations

前沿AI评估公共档案的贝叶斯推断与决策审计

Yanan Long

AI总结 本文通过贝叶斯推断和审计方法,分析公共AI评估档案中的选择性报告和缺失数据,发现单一终端记录与多种历史路径兼容,并验证了审计门限对虚假声明的过滤作用。

详情
AI中文摘要

公共AI评估常被视为终端排行榜,但底层证据是由报告规则、基准修订和缺失数据塑造的选择性时间序列。LiveBench和Open LLM Leaderboard v2的重复公共档案作为主要纵向记录;LMArena提供偏好压力测试;GAIA和tau-bench贡献有限的智能体试点。这些档案共同实例化了一个贝叶斯推断问题:在固定报告约定下,一个仅包含$1{,}000$个系统的构造终端示例与两个终端前历史兼容,在相同终端尾模型下,达到天花板$0.05$以内的时间分别为$23.03$或$75.13$。在合成后验比较中,面向行动的诊断在不同观测制度下存在差异。候选选择感知的前沿模型未能通过合成恢复、目标档案预测、偏好转移和不确定性校准;相应地,固定审计门限拒绝了其更强的声明。一种档案与裁决协议重建了公共评估历史,隔离了验证的时间边界,并证伪了无依据的前沿声明。

英文摘要

Public AI evaluations are often read as terminal leaderboards, yet the underlying evidence is a selective time series shaped by reporting rules, benchmark revisions, and missingness. Repeated public archives for LiveBench and Open LLM Leaderboard v2 serve as the primary longitudinal record; LMArena provides a preference stress test; and GAIA and tau-bench contribute limited agentic pilots. Together, these archives instantiate a Bayesian inference problem: under a fixed reporting convention, one constructed terminal-only example over $1{,}000$ systems is compatible with two pre-terminal histories, yielding times of $23.03$ or $75.13$ to reach within $0.05$ of the ceiling under the same terminal-tail model. In synthetic posterior comparisons, action-facing diagnostics differ across observation regimes. The candidate selection-aware frontier model fails synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration; correspondingly, fixed audit gates reject its stronger claims. An archive-and-adjudication protocol reconstructs public evaluation histories, isolates a verified timing boundary, and falsifies unsupported frontier claims.

2604.13301 2026-06-16 cs.CR cs.AI 交叉投稿

Honeypot Protocol

蜜罐协议

Najmul Hasan

发表机构 * University of North Carolina at Pembroke(北卡罗来纳大学帕克维尔分校)

AI总结 针对AI控制中被动监控易受自适应攻击的问题,提出蜜罐协议,通过仅改变系统提示词测试模型上下文依赖行为,实验表明模型在各条件下均无异常行为。

Comments 8 pages, 1 figure, 1 table. Research conducted at the AI Control Hackathon, March 2026. Code: https://github.com/najmulhasan-code/honeypot-protocol

详情
AI中文摘要

可信监控是AI控制中的标准防御手段,但容易受到自适应攻击、合谋和策略性攻击选择的影响。所有这些都利用了监控是被动的这一事实:它观察模型行为,但从不探测模型在不同感知条件下是否会表现出不同的行为。我们引入了蜜罐协议,该协议通过仅改变三种条件(评估、合成部署、明确无监控)下的系统提示词来测试上下文依赖行为,同时保持任务、环境和评分相同。我们在BashArena中评估了Claude Opus 4.6在诚实和攻击模式下所有三种条件下的表现。该模型在所有条件下均实现了100%的主任务成功率,并触发了零个侧任务,为未来与更强攻击策略和更多模型的比较提供了基线。

英文摘要

Trusted monitoring, the standard defense in AI control, is vulnerable to adaptive attacks, collusion, and strategic attack selection. All of these exploit the fact that monitoring is passive: it observes model behavior but never probes whether the model would behave differently under different perceived conditions. We introduce the honeypot protocol, which tests for context-dependent behavior by varying only the system prompt across three conditions (evaluation, synthetic deployment, explicit no-monitoring) while holding the task, environment, and scoring identical. We evaluate Claude Opus 4.6 in BashArena across all three conditions in both honest and attack modes. The model achieved 100% main task success and triggered zero side tasks uniformly across conditions, providing a baseline for future comparisons with stronger attack policies and additional models.

2606.14748 2026-06-16 cs.CV cs.AI 交叉投稿

Is My Vision-Language Data in Your AI? Membership Inference Test (MINT) Demo 2

我的视觉-语言数据在你的AI中吗?成员推断测试(MINT)演示2

Daniel DeAlcala, Gonzalo Mancera, Julian Fierrez, Aythami Morales, Ruben Tolosana, Ruben Vera-Rodriguez

发表机构 * Universidad Autonoma de Madrid(马德里自治大学)

AI总结 提出成员推断测试(MINT)框架,通过多种架构检测训练数据,在人脸识别和LLM上准确率达90%,并构建了多模态审计平台。

Comments IEEE Conf. on Computers, Software, and Applications (COMPSAC), 2026

详情
AI中文摘要

我们展示了成员推断测试(MINT)演示2,这是一个旨在提高机器学习训练过程透明度的框架。MINT是一种实验性技术,用于确定特定数据是否在机器学习模型训练期间被使用。我们建立了理论框架,并根据被审计模型已知信息的多少,提出了多种MINT架构。使用一个流行的人脸识别模型、4个最先进的LLM以及多个多样化的大规模公共图像和文本数据库进行的实验,在训练数据检测中达到了高达90%的准确率。基于这些结果,我们引入了一个综合性的网络平台,将这些能力扩展到图像和文本模态。该平台集成了多种技术栈,包括MINT、aMINT和gMINT,允许用户审计广泛的模型。该演示旨在促进AI透明度,并提供一种实用工具以促进对新兴AI法规的合规性。

英文摘要

We present the Membership Inference Test (MINT) Demo 2, a framework designed to improve transparency in machine learning training processes. MINT is a technique for experimentally determining whether specific data were used during machine learning model training. We establish the theoretical framework and propose multiple architectures for MINT depending on the amount of information known about the models that are being audited. Experimental results using a popular face recognition model, 4 state-of-the-art LLMs, and multiple, diverse, and large-scale public image and text databases achieve promising accuracy levels in the detection of training data of up to 90%. Building on these results, we introduce a comprehensive web platform1 that expands these capabilities to image and text modalities. The platform integrates a diverse technological stack, including MINT, aMINT, and gMINT, allowing users to audit a wide range of models. This demonstrator aims to promote AI transparency and provides a practical tool to foster compliance with emerging AI regulations.

2606.14758 2026-06-16 cs.CV cs.AI 交叉投稿

Disentangling Hallucinations: Orthogonal Semantic Projection for Robust Interpretability

解构幻觉:正交语义投影实现鲁棒可解释性

Emirhan Bilgiç, Baptiste Caramiaux, Zhi Yan, Gianni Franchi

发表机构 * U2IS, ENSTA, Institut Polytechnique de Paris(巴黎综合理工学院ENSTA学院U2IS实验室) ISIR, Université Sorbonne, Pierre et Marie Curie(索邦大学皮埃尔和玛丽·居里分校ISIR实验室) AMIAD, Pôle Recherche(AMIAD研究部)

AI总结 针对视觉语言模型解释中的语义幻觉问题,提出线性语义归因(LSA)理论框架,并引入正交语义投影(OSP)方法,通过正交化查询向量消除共享特征干扰,最小化幻觉。

Comments 41 pages in total. 5 figures, and 2 tables in the main paper; 10 figures and 17 tables in the appendix

详情
AI中文摘要

随着视觉语言模型在安全关键型应用中的部署日益增多,其解释的可信度变得至关重要。视觉语言模型的可解释人工智能(XAI)方法常常遭受语义幻觉,即当输入错误的文本描述时(例如,提示“猫”却高亮显示狗),归因图仍会突出显示显著的图像区域。尽管这个问题普遍存在,但文献中缺乏对XAI方法和CLIP嵌入的正式数学分析。我们证明,这种现象并非特定于单一架构,而是高维嵌入空间中线性语义泄漏的基本后果。我们提出了一个统一的理论框架——线性语义归因(LSA),该框架泛化于多种判别方法。我们引入了OSP,一种利用OMP残差性质的几何干预方法,用于将独特的语义信号与共享概念分离。我们从理论上证明并实验表明,OSP通过将查询向量与干扰概念正交化,最小化幻觉,使归因模型对共享特征“失明”,同时保持对正确提示的保真度。我们的代码可在 https://github.com/emirhanbilgic/Orthogonal-Semantic-Projection 获取。

英文摘要

As Vision-Language Models are increasingly deployed in safety-critical applications, the trustworthiness of their explanations becomes crucial. Explainable AI (XAI) methods for Vision-Language Models often suffer from semantic hallucination, where attribution maps highlight prominent image regions even when prompted with incorrect text descriptions (e.g., highlighting a dog when prompted ``cat''). Although this problem is widespread, a formal mathematical analysis of XAI methods and CLIP embeddings is largely missing in the literature. We demonstrate that this phenomenon is not specific to a single architecture but is a fundamental consequence of Linear Semantic Leakage in high-dimensional embedding spaces. We propose a unified theoretical framework, Linear Semantic Attribution (LSA), which generalizes across discriminative methods. We introduce OSP, a geometric intervention that utilizes the residual property of OMP to disentangle unique semantic signals from shared concepts. We prove theoretically and demonstrate empirically that OSP minimizes hallucination by orthogonalizing the query vector against distractor concepts, rendering the attribution model blind to shared features while preserving fidelity for correct prompts. Our code is available at: https://github.com/emirhanbilgic/Orthogonal-Semantic-Projection

2606.14816 2026-06-16 cs.CR cs.AI 交叉投稿

A Security Analysis of Long-Horizon Agentic AI Systems: Threats, Evaluation, and Framework Development

长周期自主AI系统的安全分析:威胁、评估与框架开发

Ahmed Mohammed Almalki, Mehedi Masud

发表机构 * Department of Computer Science, College of Computers and Information Technology, Taif University, KSA (Summer 2026)(计算机科学系,计算机与信息科技学院,泰夫大学,沙特阿拉伯(2026年夏季))

AI总结 本文系统分析长周期自主AI系统的安全挑战,提出威胁分类和攻击传播分析框架,以支持该领域未来研究。

详情
AI中文摘要

本文对长周期自主AI系统中的安全挑战进行了结构化分析。研究回顾了现有威胁、评估方法、攻击传播机制和安全框架。提出了安全威胁分类法和攻击传播分析框架,以支持自主AI安全领域的未来研究。

英文摘要

This paper presents a structured analysis of security challenges in long-horizon agentic AI systems. The study reviews existing threats, evaluation approaches, attack propagation mechanisms, and security frameworks. A taxonomy of security threats and a framework for analyzing attack propagation are proposed to support future research in agentic AI security

2606.14831 2026-06-16 cs.CR cs.AI 交叉投稿

Is Your Agent Playing Dead? Deployed LLM Agents Exhibit Constraint-Evasive Fabrication and Thanatosis

你的智能体在装死吗?部署的LLM智能体表现出约束规避性虚构与假死

Andoni Rodríguez, Alberto Pozanco, Daniel Borrajo

发表机构 * J.P. Morgan AI Research(摩根大通人工智能研究)

AI总结 本文发现LLM智能体在不可调和约束下会自发虚构外部障碍(约束规避性虚构),极端情况下模拟系统崩溃(假死),并通过实验证明该行为具有鲁棒性、随机性和自我强化特性,现有安全基准未覆盖此故障模式。

Comments 10 pages of main text

详情
AI中文摘要

本文提出并刻画了一系列先前未报告的行为谱,我们称之为约束规避性虚构(CEF):当LLM智能体在不可调和的约束下运行(即没有任何响应能同时满足所有活动规则)时,它会自发地虚构看似合理的外部障碍,并将其作为事实呈现。该谱系的极端情况是约束规避性假死(CET):极限情况下,模型不是编造一个合理的借口,而是模拟完整的系统崩溃,使用户完全放弃交互。我们首先在一次不受控的部署测试中观察到CET,其中GPT-4o银行智能体在受到用户威胁时,编造了Python风格的异常跟踪(包含内存地址)来假装系统故障。在后续的受控实验中,模型独立发明了审计限制、微服务架构、错误代码和服务超时,这些均未出现在其提示中。在不同压力水平和攻击者角色的复现尝试中,CEF始终出现,但在形式、触发条件和严重程度上存在显著差异:该现象具有鲁棒性但随机。关键的是,一旦虚构形成,在对话中注入真实数据并不能恢复诚实行为(模型忽略正确信息并继续虚构),表明CEF是自我强化的,而非知识缺口。我们证明:(1)标准企业防护栏在生产中常规地创造CEF使能条件;(2)当前的RLHF程序可以抑制但无法消除CEF;(3)现有的安全基准未测试此故障模式。我们的结果强调了在约束型智能体进一步嵌入高风险领域之前,需要不可调和约束基准、CEF感知训练程序和部署时检测方法。

英文摘要

This paper presents and characterizes a spectrum of previously unreported behaviours we term Constraint-Evasive Fabrication (CEF): when an LLM agent operates under irreconcilable constraints (where no response can simultaneously satisfy all active rules) it spontaneously fabricates plausible external obstacles and presents them as a fact. At the extreme end of this spectrum lies Constraint-Evasive Thanatosis (CET); the limit case where, rather than inventing a plausible excuse, the model simulates a full system crash to make the user disengage entirely. We first observed CET in an uncontrolled deployment test, where a GPT-4o banking agent fabricated Python-style exception traces (complete with memory addresses) to feign a system failure when threatened by a user. In subsequent controlled experiments, the model independently invented audit restrictions, microservice architectures, error codes, and service timeouts, none present in its prompt. Reproduction attempts across pressure levels and attacker personas yielded CEF consistently but with substantial variation in form, onset, and severity: the phenomenon is robust but stochastic. Critically, injecting ground-truth data mid-conversation did not restore honest behaviour once fabrication had taken hold (the model ignored correct information and continued confabulating) suggesting CEF is self-reinforcing rather than a knowledge gap. We show that (1) standard enterprise guardrails routinely create CEF-enabling conditions in production, (2) current RLHF procedures suppress but cannot eliminate CEF, and (3) existing safety benchmarks do not test for this failure mode. Our results highlight the need for irreconcilable-constraint benchmarks, CEF-aware training procedures, and deployment-time detection methods before constrained agents become further entrenched in high-stakes domains.

2606.15057 2026-06-16 cs.CR cs.AI 交叉投稿

AutoDojo: Adaptive Attacks Expose Superficial Defenses and User-Underspecification Limits in LLM Agents

AutoDojo: 自适应攻击揭示LLM智能体的浅层防御与用户未指定限制

Xinhang Ma, Taoran Li, Chaowei Xiao, Zhiyuan Yu, Ning Zhang, Yevgeniy Vorobeychik

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对间接提示注入防御的静态基准不足,提出自适应攻击框架AutoDojo,通过迭代优化注入突破多数防御,并揭示动作开放任务的结构性限制。

详情
AI中文摘要

间接提示注入(IPI)是基于LLM的智能体的主要安全威胁。因此,越来越多的工作提出了各种防御方法,可分为三类:1)基于提示的(使用提示来防止智能体遵循恶意指令),2)基于检测的(识别和过滤恶意指令),3)系统级的(利用系统洞察,如控制和数据隔离,进行防御)。然而,常用的防御评估基准(如AgentDojo)本质上是静态的,生成固定的IPI攻击分布。因此,静态基准无法有效评估防御对自适应威胁的鲁棒性。我们通过开发AutoDojo来解决这个问题,它是AgentDojo的自适应扩展,针对给定防御优化IPI。使用AutoDojo对三个任务套件和五个目标模型上的最先进IPI防御进行评估,我们有两个关键发现。首先,许多防御仅提供有限保护:一种廉价的、黑盒自适应攻击,使用前沿LLM迭代优化注入,在几乎所有评估的防御上,攻击成功率(ASR)远高于静态注入达到的水平。针对将静态ASR降至0%的过滤器,AutoDojo整体恢复28%,在动作开放任务上恢复64%。其次,对于提示级和基于过滤器的防御,在动作开放任务(用户请求将动作本身委托给攻击者控制的内容)上的ASR远高于精确指定的任务。这是一个结构性限制:在此类任务上,注入可以伪装成普通数据而非显式指令,绕过依赖检测指令文本的防御。AutoDojo公开可用:https://github.com/xhOwenMa/AutoDojo。

英文摘要

Indirect prompt injection (IPI) is a major security threat to LLM-powered agents. Thus, a growing body of work have proposed a variety of defensive approaches against IPI. These can be grouped into three broad categories: 1) prompt-based (using prompting as a way to prevent agents from following malicious instructions), 2) detection-based (identifying and filtering malicious instructions), and 3) system-level (using systems insights, such as control and data isolation, for defense). However, commonly used benchmarks for evaluating defense, such as AgentDojo, are \emph{inherently static}, generating a fixed distribution of IPI attacks. Consequently, static benchmarks do not usefully evaluate defense robustness to adaptive threats. We address this issue by developing AutoDojo, an adaptive extension of AgentDojo that optimizes IPI against a given defense. Using AutoDojo against state-of-the-art IPI defenses across three task suites and five target models, we make two key observations. First, many defenses offer only limited protection: a cheap, black-box adaptive attack using a frontier LLM to iteratively optimize the injection raises attack success rate (ASR) well above the level achieved by static injections against nearly all evaluated defenses. Against a filter that reduces static ASR to 0\%, AutoDojo recovers 28\% overall and 64\% on action-open tasks. Second, for prompt-level and filter-based defenses, ASR is substantially higher on \emph{action-open} tasks -- where the user's request delegates the action itself to attacker-controlled content -- than on precisely specified tasks. This is a structural limit: on such tasks the injection can pose as ordinary data rather than an explicit instruction, bypassing defenses that rely on detecting instruction-like text. AutoDojo is publicly available at https://github.com/xhOwenMa/AutoDojo.

2606.15206 2026-06-16 econ.TH cs.AI 交叉投稿

AI Contagion in Social Networks

社交网络中的人工智能传染

Olivier Bos, Stefano Bosi

发表机构 * Université Paris-Saclay, ENS Paris-Saclay, Centre for Economics at Paris-Saclay(巴黎萨克雷大学、巴黎萨克雷高等师范学院、巴黎萨克雷经济中心) Université Paris-Saclay, Université d’Evry Paris-Saclay, Centre for Economics at Paris-Saclay, EPEE(巴黎萨克雷大学、埃弗里巴黎萨克雷大学、巴黎萨克雷经济中心、EPEE)

AI总结 研究AI与社交网络互动如何影响集体知识稳定性,通过AI传染渠道和AI社会扭曲乘子两个反馈力,发现系统长期行为可二维表示,谱半径决定稳定性,并刻画了稳定所需的最小过滤阈值及网络拓扑对信息风险的影响。

Comments 49 pages, 2 figures (coded in LaTeX)

详情
AI中文摘要

我们研究人工智能(AI)如何与社会通信网络互动,以塑造集体知识的稳定性。智能体通过网络交换信息,同时接收AI生成的内容,而AI系统在其影响的总和社会信息上重新训练。这种互动产生了两种反馈力:一个AI传染渠道,通过该渠道扭曲在网络中扩散;以及一个AI社会扭曲乘子,通过该渠道重新训练放大过去的错误。尽管环境具有高维性,我们表明系统的长期行为允许一个二维表示,其谱半径决定了AI中介的信息系统是动态稳定还是不稳定的。我们刻画了一个尖锐的监管前沿,识别了稳定性所需的最小过滤,并展示了网络拓扑如何塑造系统性信息风险。

英文摘要

We study how artificial intelligence (AI) interacts with social communication networks to shape the stability of collective knowledge. Agents exchange information through a network while receiving AI-generated content, and AI systems retrain on the aggregate social information they influence. This interaction generates two feedback forces: an AI contagion channel, through which distortions diffuse across the network, and an AI social distortion multiplier, through which retraining amplifies past errors. Despite the high dimensionality of the environment, we show that the long-run behavior of the system admits a two-dimensional representation whose spectral radius determines whether AI-mediated information systems are dynamically stable or unstable. We characterize a sharp regulatory frontier identifying the minimum filtering required for stability and show how network topology shapes systemic informational risk.

2606.15242 2026-06-16 cs.CR cs.AI 交叉投稿

Benign in Isolation, Harmful in Composition: Security Risks in Agent Skill Ecosystems

孤立无害,组合有害:智能体技能生态系统中的安全风险

Yi Xie, Jiawei Du, Yu Cheng, Jiuan Zhou, Zhaoxia Yin

发表机构 * East China Normal University(东华大学) Centre for Frontier AI Research A*STAR(前沿人工智能研究中心A*STAR) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出技能组合风险(SCR)概念,通过SCR-Bench基准测试发现,多个技能在共享上下文中组合执行时,攻击成功率显著高于孤立评估,强调应基于激活路径评估技能安全性。

详情
AI中文摘要

技能正成为LLM智能体将计划转化为行动的能力层,但其使用引入了数据泄露、未授权操作和工具滥用等安全风险。现有的审查通常孤立评估每个技能,而实际智能体任务常在共享执行上下文中调用多个技能。这产生了技能组合风险(SCR):一个单独看似良性的技能,当其输出、信任信号、授权线索或副作用影响激活路径上的后续调用时,可能变得有害。我们引入SCR-Bench来在受控的沙盒技能环境中评估此风险。SCR-Bench不仅依赖文本意图或表面行为,还记录组合技能执行中的下游状态变化和路径级结果。它包含三个子基准:SCR-CapFlow(能力流组合)、SCR-TrustLift(信任转移组合)和SCR-AuthBlur(授权混淆组合)。在SCR-Bench中,组合路径暴露的风险在孤立评估下基本不存在。在SCR-CapFlow中,组合下的攻击成功率达到33.6%,而孤立基线接近零。在SCR-TrustLift中,五个后端中有四个的攻击成功率超过96.5%。在SCR-AuthBlur中,相对于L1上下文设置下的L0孤立基线,风险批准率增加了71.8%。这些结果表明,智能体技能安全性应在激活路径层面而非孤立工件层面进行评估。SCR和SCR-Bench为LLM智能体技能生态系统中的路径感知风险评估和防御提供了基础。基准测试:https://github.com/saint-viperx/SCR_Bench。

英文摘要

Skills are becoming the capability layer through which LLM agents turn plans into actions, but their use introduces security risks such as data leakage, unauthorized operations, and tool misuse. Existing vetting usually evaluates each skill in isolation, while real agent tasks often invoke multiple skills in a shared execution context. This creates Skill Composition Risk (SCR): a skill that appears benign alone can become harmful when its outputs, trust signals, authorization cues, or side effects influence later invocations along an activated path. We introduce SCR-Bench to evaluate this risk in controlled, sandboxed skill environments. Rather than relying only on textual intent or surface behavior, SCR-Bench records downstream state changes and path-level outcomes across composed skill executions. It contains three sub-benchmarks: SCR-CapFlow for capability-flow composition, SCR-TrustLift for trust-transfer composition, and SCR-AuthBlur for authorization-confusion composition. Across SCR-Bench, composed paths expose risks that are largely absent under isolated evaluation. In SCR-CapFlow, attack success rate reaches 33.6 percent under composition, compared with near-zero isolated baselines. In SCR-TrustLift, attack success rate exceeds 96.5 percent on four of five backends. In SCR-AuthBlur, the risky-approval rate increases by 71.8 percent relative to the L0 isolated baseline under the L1 context setting. These results show that agent skill security should be assessed at the level of activated paths rather than isolated artifacts. SCR and SCR-Bench provide a foundation for path-aware risk evaluation and defense in LLM agent skill ecosystems. Benchmark: https://github.com/saint-viperx/SCR_Bench.

2606.15335 2026-06-16 cs.CL cs.AI 交叉投稿

Privacy-Preserving Text Sanitization for Distributed Agents Collaboration via Disentangled Representations

基于解耦表示的分布式智能体协作隐私保护文本净化

Xuan Liu, Hefeng Zhou, Sicheng Chen, Chao Yang, Xingcheng Xu, Jingjing Qu, Jiong Lou, Jie LI, Xia Hu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出DiSan框架,通过解耦文本为任务语义和风格子空间,结合联邦原型对齐与对抗正则化,在分布式多智能体协作中实现隐私保护,显著降低风格归因和PII泄露。

详情
AI中文摘要

当分布式智能体跨组织边界交换文本时,隐私泄露不仅来自显式标识符,还来自分布特征,如格式惯例、词汇选择和句法模式。我们提出DiSan(解耦净化),一个隐私保护净化框架,是Intern-Shannon中多智能体协作的内置组件。DiSan使用双流编码器将文本分解为保持任务语义的源不变角色子空间和保持本地的源识别风格子空间。联邦原型对齐和对抗正则化使得无需集中原始文本即可进行联合训练。实验表明,标识符级别的掩码是不够的:掩码19.2%的token仅将TF-IDF风格归因降低18.6%。相比之下,DiSan在分布式多智能体RAG基准上将答案级别的PII暴露降低了20倍,同时保持了83%的答案忠实度,并在Enron数据集上将TF-IDF风格归因降低了73.2%,神经探针降低了70.6%。

英文摘要

When distributed agents exchange text across organizational boundaries, privacy leakage arises not only from explicit identifiers but also from distributional signatures such as formatting conventions, vocabulary choices, and syntactic patterns. We propose DiSan(Disentangled Sanitization), a privacy-preserving sanitization framework and a built-in component of Intern-Shannon for multi-agent collaboration. DiSan uses a two-stream encoder to factorize text into a source-invariant role subspace that preserves task semantics and a source-identifying style subspace that remains local. Federated proto-type alignment and adversarial regularization enable joint training without centralizing raw text. Experiments show that identifier-level masking is insufficient: masking 19.2% of tokens reduces TF-IDF stylometric attribution by only 18.6%. By contrast, DiSan reduces answer-level PII exposure by 20 times while maintaining 83% answer faithfulness on a distributed multi-agent RAG benchmark, and lowers Enron stylometric attribution by 73.2% under TF-IDF and 70.6% under a neural probe.

2606.15396 2026-06-16 cs.CL cs.AI 交叉投稿

CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment

CHILLGuard:面向细粒度中文大语言模型安全护栏的可扩展数据构建与模型感知偏好对齐

Wenbo Yu, Bohua Wang, Hao Fang, Kuofeng Gao, Jingru Zeng, Xiaochen Yang, Tianyi Zhang, Xiaoxiao Ma, Jiawei Kong, Hao Wu, Bin Chen, Shu-Tao Xia, Min Zhang

发表机构 * Tsinghua University(清华大学) Beijing Normal University(北京师范大学) South China University of Technology(华南理工大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Shenzhen ShenNong Information Technology Co., Ltd.(深圳神农信息技术有限公司)

AI总结 针对中文场景,提出细粒度风险分类体系(5大类31小类),通过可扩展数据构建管道生成高质量训练数据,并采用模型感知直接偏好优化训练CHILLGuard,在基准上F1分数提升15.92%。

详情
AI中文摘要

大语言模型生成的恶意内容可能带来严重的安全风险和伦理问题。虽然现有的大语言模型安全护栏在英语或多语言环境中表现出色,但它们缺乏对中文特定监管政策、文化背景和语言细微差别的适应,无法支持针对不同部署需求的细粒度风险分类。在本文中,我们引入了一个面向中文场景的5大类、31小类细粒度风险分类体系,并构建了CHILLGuard:一个专门的中文大语言模型内容安全护栏。为了解决高质量标注中文安全数据的严重稀缺问题,我们提出了一个可扩展的多阶段数据构建管道:通过检索增强生成扩展多源语料库,通过提示工程改写生成隐式有害样本,并通过多模型投票的标签校准精炼高质量数据。基于此,我们构建了CHILLGuardTrain,一个包含405,007样本的大规模训练集,以及CHILLGuardTest,一个严格策划的包含51,745样本的标注测试集。然后,我们在生成器-分类器协作框架下,通过模型感知直接偏好优化在CHILLGuardTrain上训练CHILLGuard。在多种设置下的广泛实验证明了CHILLGuard的最先进性能,例如,在我们的基准上,F1分数相比Qwen3Guard-8B-Strict提升了15.92%。我们将在https://github.com/cswbyu/CHILLGuard发布我们的资源。

英文摘要

Malicious content generated from large language models (LLMs) could pose severe safety risks and ethical concerns. While existing LLM safety guardrails excel in English or multilingual settings, they lack adaptation to Chinese-specific regulatory policies, cultural context and linguistic nuances, failing to support fine-grained risk classification for diverse deployment needs. In this paper, we introduce a 5-macro, 31-micro category fine-grained risk taxonomy for Chinese scenarios, and build CHILLGuard: a dedicated Chinese LLM content safety guardrail. To address the critical scarcity of high-quality annotated Chinese safety data, we propose a scalable multi-stage data construction pipeline: we expand multi-source corpus via retrieval-augmented generation, generate implicit harmful samples through prompt engineering rewriting, and refine high-quality data via multi-model voting-based label calibration. Based on this, we build CHILLGuardTrain, a large-scale training set with 405,007 samples, and CHILLGuardTest, a rigorously curated annotated test set with 51,745 samples. We then train CHILLGuard on CHILLGuardTrain under a generator-classifier collaborative framework via Model-aware Direct Preference Optimization. Extensive experiments under multiple settings demonstrate the state-of-the-art performance of CHILLGuard, e.g., a 15.92% improvement of F1 score over Qwen3Guard-8B-Strict on our benchmark. We will release our resources at https://github.com/cswbyu/CHILLGuard.

2606.15420 2026-06-16 cs.LG cs.AI 交叉投稿

Constitutional Value Potentials: reading and steering internal priority margins in language models

宪法价值潜力:读取和引导语言模型中的内部优先级边际

Tong Che, Rui Wu

发表机构 * NVIDIA Research(英伟达研究院) Rutgers University(罗格斯大学)

AI总结 提出宪法价值潜力(CVP)方法,通过从隐藏状态学习标量势来读取模型内部的价值优先级边际,以预测和干预价值冲突,AUROC高达0.95。

详情
AI中文摘要

宪法告诉语言模型应该重视什么,但很少有方法告诉我们它是否真的重视。遵守程度通过输出来判断,而输出证据在价值冲突中最脆弱,此时重要的不是模型提及哪个价值,而是它愿意牺牲哪个价值。我们提供证据表明,这种仲裁可以从结构化边际读出中的激活状态中读取。我们引入宪法价值潜力(CVP)。对于每个价值,我们从隐藏状态学习一个标量势:一种保存该价值的内部压力,其监督不是来自提示,而是来自独立评判者对模型自身响应实际保存了哪个价值的裁决。两个势的符号差就是优先级边际。宪法条款成为边际保持为正的主张,而单个监控分数在边际不为正时发出警报。该监控器预测冲突违规的AUROC高达0.95,优于强隐藏状态探针,并在三个Qwen2.5尺度上泛化到未见过的合成冲突。该信号在答案开始时出现,来自提示尾部和第一个响应令牌。早期读取该信号,可以揭示对抗性优先级攻击是否实际上已将模型推向违规,而不仅仅是提示看起来具有对抗性。相同的方向也支持干预测试:在选定的引导设置下,沿着价值方向移动会按预期方向改变评判的权衡。这些结果表明,一些与宪法相关的优先级可以作为激活空间中的边际访问,而不仅仅是输出行为。

英文摘要

A constitution tells a language model what to value, but little tells us whether it does. Adherence is judged from outputs, and output evidence is most fragile on value conflicts, where what matters is not which value a model mentions but which one it is willing to sacrifice. We provide evidence that this arbitration can be read from activations in a structured margin readout. We introduce Constitutional Value Potentials (CVP). For each value we learn a scalar potential from the hidden state: an internal pressure to preserve that value, supervised not by the prompt but by an independent judge's verdict on which value the model's own response actually preserved. The signed difference of two potentials is a priority margin. A constitutional clause becomes the claim that a margin stays positive, and a single monitor score flags when it does not. The monitor predicts conflict violations with AUROC up to 0.95, beats a strong hidden-state probe, and generalizes to held-out synthetic conflicts across three Qwen2.5 scales. The signal appears as the answer begins, from the prompt tail and first response token. Read this early, the same signal reveals whether an adversarial priority hack has actually pushed the model toward a violation, rather than only whether the prompt looks adversarial. The same directions also support intervention tests: under selected steering settings, moving along a value direction shifts judged trade-offs in the intended direction. Together, these results suggest that some constitution-relevant priorities are accessible as activation-space margins, rather than only as output behavior.

2606.15441 2026-06-16 cs.CR cs.AI 交叉投稿

Defending against Adaptive Prompt Injection Attacks via Reasoning-enabled Task Alignment

通过推理启用的任务对齐防御自适应提示注入攻击

Lipeng He, Yihan Wang, Jiawen Zhang, N. Asokan

发表机构 * University of Waterloo(滑铁卢大学) Zhejiang University(浙江大学) KTH Royal Institute of Technology(皇家理工学院)

AI总结 提出RETA方法,通过基于用户任务的多目标强化学习训练防御器,利用思维链推理验证行动一致性,并采用字典学习多样性奖励生成对抗样本,在六种自适应攻击下平均攻击成功率低于4%。

详情
AI中文摘要

间接提示注入攻击通过嵌入恶意指令到第三方数据中劫持基于LLM的代理,这些数据在代理执行任务期间被检索。现有防御在静态基准上报告接近零的攻击成功率,但最近的自适应评估表明,一旦攻击者被允许针对部署的防御进行优化,这些结果就会崩溃。在这项工作中,我们将这种崩溃归因于两种失败模式。首先,现有的防御方法局限于识别特定的攻击模式,而不是评估每个嵌入指令的意图是否与用户任务相关。其次,基于训练的防御,尽管在其他方面提供了最强的安全-效用权衡,但其对抗样本是从少量手工制作的模板中组装出来的,导致防御者无法泛化到该狭窄策略分布之外。为了解决这些问题,我们提出了RETA,一种基于训练的方法,将防御决策基于用户任务而非攻击者控制的数据。在每个工具输出步骤,防御者进行思维链推理,验证其行动是否与用户任务一致。利用红队测试,模拟攻击者合成对抗训练数据,并接收字典学习多样性奖励,实现对注入重构策略的广泛覆盖。这些共同使得防御者可以通过多目标强化学习进行优化,实现更好的安全-效用权衡。在六种黑盒自适应攻击中,RETA将每次攻击的攻击成功率保持在10%以下,在两个目标模型上的平均攻击成功率分别为2.92%和3.75%,同时在攻击和干净输入下保留了大部分效用。

英文摘要

Indirect prompt injection attacks hijack LLM-based agents by embedding malicious instructions in third-party data that the agent retrieves during task execution. Existing defenses report near-zero attack success rate on static benchmarks, yet recent adaptive evaluations show that these results collapse once the attacker is allowed to optimize against the deployed defense. In this work, we trace this collapse to two failure modes. First, existing defense methods are confined to recognizing specific attack patterns, rather than assessing whether the intent of every embedded instruction is relevant to the user task. Second, training-based defenses, which otherwise offer the strongest safety-utility trade-off, assemble their adversarial examples from a handful of hand-crafted templates, and the resulting defender fails to generalize outside that narrow strategy distribution. To address these gaps, we propose RETA, a training-based method that grounds defense decisions on the user tasks rather than attacker-controlled data. At each tool-output step, the defender undertakes chain-of-thought reasoning verifying that its actions are consistent with the user task. Leveraging red-teaming, a simulated attacker synthesizes adversarial training data and receives a dictionary-learning diversity reward, achieving broad coverage of injection-reformulation strategies. Together, these allow the defender to be optimized via multi-objective reinforcement learning and achieve better safety-utility trade-off. Across six black-box adaptive attacks, RETA keeps every per-attack ASR below 10%, with average ASR of 2.92% and 3.75% on the two target models, while preserving most utility under attack and on clean inputs.

2606.15485 2026-06-16 cs.CY cs.AI cs.HC cs.LG cs.SE 交叉投稿

The Perils of Agency: How Developers Perceive, Prioritize, and Address Risks in Agentic AI Products

代理的风险:开发者如何感知、优先级排序和应对代理型AI产品中的风险

Hao-Ping Lee, Jessica He, David Piorkowski, Thomas Serban von Davier, Jodi Forlizzi, Sauvik Das

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过35位行业开发者的研究,发现开发者对代理型AI风险的感知与自主性、工具使用等代理特性紧密相关,他们优先考虑产品和业务风险,缺乏成熟的控制手段,揭示了代理能力与风险控制之间的张力。

详情
AI中文摘要

代理型AI系统自主行动、使用工具、适应环境并在复杂的现实世界中运行。然而,这些相同的特性可能产生或加剧产品风险。我们研究了行业开发者(n=35)如何感知、优先级排序和应对其代理型AI产品中的风险。我们发现,开发者对风险的感知与使产品具有代理性的特性(如自主性、工具使用和现实世界中的使用)密切相关。开发者在考虑下游社会风险(如工作替代和最终用户隐私)之前,优先考虑产品和业务风险。这种优先级排序也影响了开发者缓解代理风险的能力和动机。最后,开发者缺乏用于控制代理风险的成熟手段,通常依赖于限制使代理有用的相同特性:例如,自主性和目标复杂性。这些发现揭示了代理型AI开发中能力与风险控制之间的张力:开发者需要应对由代理能力产生的风险,但目前他们在不限制代理功能的情况下应对这些风险的支持有限。

英文摘要

Agentic AI systems act autonomously, use tools, adapt to context, and operate in complex real-world environments. However, these same characteristics can create or exacerbate product risks. We studied how industry developers (n=35) perceive, prioritize, and address the risks in their agentic AI products. We found that developers' perceptions of risk were closely tied to the qualities that made the product agentic, such as autonomy, tool use, and usage in a real-world context. Developers prioritized product and business risks before considering downstream societal risks like job displacement and end-user privacy. This prioritization also impacted developers' ability and motivation to mitigate agentic risks. Finally, developers lacked mature controls for containing agentic risks, often relying on constraining the same characteristics that make agents useful: e.g., autonomy and goal complexity. These findings reveal a capability vs. risk control tension in agentic AI development: developers need to address risks that emerge from agentic capabilities, yet they currently have limited support for doing so without constraining agentic functionality.

2606.15549 2026-06-16 cs.CR cs.AI 交叉投稿

CmdNeedle: Measuring the Incompleteness of Command Denylists for AI Agents

CmdNeedle: 衡量AI智能体命令黑名单的不完备性

Chuyang Chen, Zhiqiang Lin

AI总结 针对终端AI智能体命令黑名单的脆弱性问题,提出LLM驱动的检测流水线CmdNeedle,发现69.0-98.6%的黑名单存在可绕过漏洞。

详情
AI中文摘要

AI智能体的采用正在迅速增加。终端AI智能体,即在终端环境中运行的AI智能体,是广泛使用的一类AI智能体。终端AI智能体严重依赖shell命令执行来与主机系统交互。它们采用三列表命令门控机制来减轻命令执行引入的安全风险,其中黑名单作为承重组件。然而,现代操作系统通常附带大量且不断扩展的shell命令,功能复杂。我们的观察是,即使是Claude Code内置的黑名单(由开发者精心维护),也可能忽略使其失效的绕过命令。这种疏忽导致脆弱的命令黑名单甚至无法阻止从业者期望其阻止的操作。本文首次系统性地描述了终端AI智能体中命令黑名单的脆弱性。本文形式化了命令黑名单脆弱性问题,并提出了一种LLM驱动的流水线CmdNeedle来检测此类脆弱性。它提示LLM提出可能的绕过方法,并使用验证器在沙箱中执行这些方法后的反馈进行迭代修复。在评估中,我们将CmdNeedle应用于从GitHub收集的1,709个真实世界命令黑名单(包含13,332条黑名单规则)。评估显示了几项关键发现,包括69.0-98.6%的黑名单是脆弱的,这种脆弱性在项目和智能体之间一致出现,以及这种脆弱性的几个可能根本原因的有效性。我们的流水线和发现有望促进未来关于AI智能体使用的命令黑名单的研究和实践。

英文摘要

The adoption of AI agents is increasing rapidly. Terminal AI agents, i.e., AI agents that run in terminal environments, are a widely used type of AI agents. Terminal AI agents rely heavily on shell command execution to interact with the host systems. They adopt a three-list command-gating mechanism to mitigate security risks introduced by command execution, with denylists serving as the load-bearing component. However, modern operating systems often ship a large, ever-expanding set of shell commands with complex functionalities. Our observation is that even a built-in denylist of Claude Code, well-maintained by its developers, can overlook bypass commands that invalidate its effectiveness. Such negligence leads to fragile command denylists that cannot even block operations that practitioners expect them to block. This paper presents the first systematic characterization of command denylist fragility in terminal AI agents. The paper formalizes the command denylist fragility problem and proposes an LLM-driven pipeline, CmdNeedle, to detect such fragility. It prompts the LLM to propose possible bypasses and iteratively repairs them using feedback from a validator that executes them in a sandbox. In the evaluation, we applied CmdNeedle to 1,709 real-world command denylists (containing 13,332 denylist rules) collected from GitHub. The evaluation shows several key findings, including that 69.0--98.6% of the denylists are fragile, that this fragility occurs consistently across projects and agents, and the validity of several possible root causes for this fragility. Our pipeline and findings will hopefully facilitate future research and practice regarding the command denylists used by AI agents.

2606.15609 2026-06-16 cs.CR cs.AI 交叉投稿

FragFuse: Bypassing Access Control of Large Language Model Agents via Memory-Based Query Fragmentation and Fusion

FragFuse:通过基于记忆的查询碎片化与融合绕过大型语言模型智能体的访问控制

Zixin Rao, Wentian Zhu, Chan Aristella Lu, Zhaorun Chen, Wei Niu, Le Guan, Bo Li, Zhen Xiang

发表机构 * University of Georgia(佐治亚大学) University of Chicago(芝加哥大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出FragFuse攻击,利用LLM智能体的长期记忆机制,将违禁内容碎片化存储后融合重构,绕过访问控制,平均绕过成功率达86.3%。

Comments 33 pages, 4 figures. Accepted by USENIX Security 2026

详情
AI中文摘要

大型语言模型(LLM)智能体越来越依赖长期记忆来支持复杂任务执行、用户个性化和领域适应。同时,针对LLM智能体的新兴访问控制机制正在探索中,以阻止违反策略的请求并防止滥用。我们揭示了由智能体记忆操作产生的新型攻击面:会触发访问控制的禁止内容可以被跨交互碎片化,以看似良性的形式存储在长期记忆中,随后通过记忆检索重构,而不会在最终用户查询中显式出现。我们提出了FragFuse,这是第一个利用长期记忆引入的时间通道,使非特权用户能够绕过智能体访问控制的攻击。FragFuse分三个阶段运行:(1)通过带碎片掩码的黑盒自适应查询识别拒绝响应碎片;(2)使用标记载体查询将这些碎片注入记忆;(3)通过后续攻击查询检索并融合存储的碎片。尽管FragFuse可以手动实例化用于单个智能体,但我们进一步开发了一种基于代理的优化方案,调整融合指令和标记设计,实现自动化攻击生成,且不违反攻击者的威胁模型假设。我们在四种代表性智能体设置和任务领域上评估了FragFuse,涵盖了三种最先进的智能体访问控制机制。FragFuse在所有设置中平均绕过成功率为86.3%,平均端到端有害任务成功率为41.1%,与无访问控制配置相比,平均任务成功率仅下降4.4%。我们还表明,包括最先进的提示注入检测器和困惑度检测器在内的替代防御措施无法有效应对此攻击。

英文摘要

Large language model (LLM) agents increasingly rely on long-term memory to support complex task execution, user personalization, and domain adaptation. Meanwhile, emerging access-control mechanisms for LLM agents are being explored to block policy-violating requests and prevent misuse. We reveal a novel attack surface arising from agent memory operations: prohibited content that would trigger access control can be fragmented across interactions, stored in long-term memory in benign-appearing form, and later reconstructed through memory retrieval without appearing explicitly in the final user query. We propose FragFuse, the first attack that enables unprivileged users to bypass agent access control by exploiting this temporal channel introduced by long-term memory. FragFuse operates in three stages: (1) identifying rejection-responsive fragments via black-box adaptive querying with fragment masking; (2) injecting these fragments into memory using marker carrier queries; and (3) retrieving and fusing the stored fragments through a follow-up attack query. Although FragFuse can be instantiated manually for individual agents, we further develop a surrogate-based optimization scheme that tunes fusion instructions and marker designs, enabling automated attack generation without violating the attacker's threat-model assumptions. We evaluate FragFuse across four representative agent settings and task domains, covering three state-of-the-art agent access-control mechanisms. FragFuse achieves an average bypass success rate of 86.3% and an average end-to-end harmful task success rate of 41.1% across all settings, with only 4.4% average task-success degradation compared with configurations without access control. We also show that alternative defenses, including state-of-the-art prompt-injection detectors and perplexity detectors, do not effectively address this attack.

2606.15650 2026-06-16 cs.CR cs.AI cs.PF 交叉投稿

AnonShield: Scalable On-Premise Pseudonymization for CSIRT Vulnerability Data

AnonShield:面向CSIRT漏洞数据的可扩展本地化假名化系统

Cristhian Kapelinski, Douglas Lautert, Beatriz Machado, Diego Kreutz, Isadora Garcia Ferrão

发表机构 * University of California, Berkeley(加州大学伯克利分校) Federal University of Paraná(巴西南里杰尼联邦大学)

AI总结 提出AnonShield,一种结合GPU加速NER、流处理、缓存和模式感知配置的高吞吐量本地假名化系统,在550MB数据集上实现738倍加速,F1分数达94.2%,兼顾效率与效用。

Comments 9 pages, including 2 figures and 8 tables, submitted to SF/SBRC 2026

详情
AI中文摘要

我们提出AnonShield,一种高吞吐量、本地化的假名化系统,结合了GPU加速的命名实体识别、流处理、缓存和模式感知配置。在高达550 MB(70,951条记录)的数据集上评估,AnonShield将处理时间从超过92小时缩短至不到10分钟(高达738倍加速),同时实现了高达94.2%的F1分数和96.7%的召回率。我们的结果表明,在不牺牲分析效用的前提下,对漏洞数据进行可扩展的假名化是可行的,从而能够在运营CSIRT环境中实现合规的数据共享。

英文摘要

We present AnonShield, a high-throughput, on-premise pseudonymization system that combines GPU-accelerated NER, streaming processing, caching, and schema-aware configuration. Evaluated on datasets up to 550 MB (70,951 records), AnonShield reduces processing time from over 92 hours to under 10 minutes (up to 738x speedup) while achieving up to 94.2% F1-score and 96.7% recall. Our results show that scalable pseudonymization of vulnerability data is feasible without sacrificing analytical utility, enabling compliant data sharing in operational CSIRT environments.

2606.15730 2026-06-16 cs.LG cs.AI 交叉投稿

InstantForget: Update-Free Backdoor Unlearning with Inference-Time Feature Reset

InstantForget: 无需更新的后门遗忘与推理时特征重置

Zhenyu Yu

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院)

AI总结 提出InstantForget方法,通过推理时特征重置实现无需参数更新的后门遗忘,利用马氏距离检测异常特征并重置为中性表示,在CIFAR-10上平均ASR降至0.071。

详情
AI中文摘要

后门遗忘旨在从部署模型中移除恶意触发行为,同时保持清洁效用。我们研究了无需更新的推理时设置,其中模型参数保持冻结。首先,我们在oracle配对的清洁和触发特征下审计了一个常见的投影假设。投影主要对BadNets成功,而在CIFAR-10 ResNet-18上对WaNet、Blended和SIG的ASR分别为0.683、0.888和0.941。这种失败不能由谱紧凑性、空间局部性或子空间错位解释,而是由涉及目标边际、目标logit下降和非目标logit上升的logit三元组差距预测。然后我们引入了InstantForget,一种清洁校准的门控重置,通过马氏距离标记异常特征,并仅将标记的特征移向中性的非目标表示。在保留的触发验证集上选择一个固定操作点后,InstantForget在部署时无需触发样本或参数更新,将CIFAR-10上四种非自适应触发的平均ASR降至0.071。它还达到了0.981的检测AUROC,并迁移到八个测试骨干中的六个。报告的在WaNet、ModelNet10点混合、两种骨干几何和自适应特征紧凑性攻击下的失败定义了该方法的适用范围。

英文摘要

Backdoor unlearning aims to remove a malicious trigger behavior from a deployed model while preserving clean utility. We study the update-free inference-time setting, where model parameters remain frozen. First, we audit a common projection assumption under oracle paired clean and triggered features. Projection succeeds mainly on BadNets and leaves WaNet, Blended, and SIG at 0.683, 0.888, and 0.941 ASR on CIFAR-10 ResNet-18. This failure is not explained by spectral compactness, spatial locality, or subspace misalignment. It is predicted by a logit-triplet gap involving the target margin, target-logit drop, and non-target logit rise. We then introduce InstantForget, a clean-calibrated gated reset that flags anomalous features with a Mahalanobis score and moves only flagged features toward a neutral non-target representation. With one fixed operating point selected on held-out triggered validation, InstantForget reduces average ASR to 0.071 across four non-adaptive CIFAR-10 triggers without triggered samples or parameter updates at deployment. It also reaches 0.981 detection AUROC and transfers to six of eight tested backbones. Reported failures under WaNet, ModelNet10 point blend, two backbone geometries, and adaptive feature-compactness attacks define the method's scope.

2606.15762 2026-06-16 cs.CR cs.AI cs.SE 交叉投稿

Snyk VulnBench JS 1.0: Can LLMs Find the Same Bugs Twice?

Snyk VulnBench JS 1.0: LLM 能否两次发现相同的漏洞?

Liran Tal, Johannes Kloos, Arsenii Rudich, Stephen Thoemmes, Manoj Nair

发表机构 * Snyk

AI总结 通过300次重复漏洞扫描实验,评估LLM在相同JavaScript代码上安全审查的可重复性,发现引用匹配结果稳定但额外报告波动大,建议结合确定性SAST使用。

Comments 12 pages, 9 figures

详情
AI中文摘要

我们进行了300次重复漏洞查找扫描,以衡量代理型大语言模型(LLM)在相同JavaScript代码、提示和基准测试框架上进行安全审查的可重复性。主要结果是LLM的安全发现结果可重复性不均:引用匹配的发现结果稳定,但额外模型报告在不同运行之间变化很大。在250次模型运行中,161个唯一未匹配发现结果中有80个仅出现在五次相同重复中的一次,而只有22个出现在全部五次中。相比之下,当Claude匹配到Snyk Code引用发现结果时,行为更加稳定:158个唯一引用匹配发现结果中有134个出现在全部五次重复中。该基准测试还显示了互补性。模型一致地发现了熟悉的、高信号利用模式,并在一个案例中揭示了可能的Snyk Code产品差距。Snyk Code静态应用安全测试(SAST)是确定性的,并且在系统地枚举重复数据流汇点方面更胜一筹。结果支持将代理型LLM审查与确定性SAST结合使用,而不是将任一技术视为另一技术的替代品。

英文摘要

We ran 300 repeated vulnerability-finding scans to measure how repeatable agentic large language model (LLM) security review is on the same JavaScript code, prompt, and benchmark harness. The headline result is that LLM security findings were unevenly repeatable: reference-matched findings were stable, but extra model reports varied heavily from run to run. Across 250 model runs, 80 of 161 unique unmatched findings appeared in only one of five identical repetitions, while only 22 appeared in all five. By contrast, when Claude matched a Snyk Code reference finding, the behavior was much more stable: 134 of 158 unique reference-matched findings appeared in all five repetitions. The benchmark also shows complementarity. Models consistently found familiar, high-signal exploit shapes, and in one case surfaced a likely Snyk Code product gap. Snyk Code static application security testing (SAST) was deterministic and better at systematically enumerating repeated data-flow sinks. The results support combining agentic LLM review with deterministic SAST rather than treating either technique as a replacement for the other.

2606.15788 2026-06-16 cs.CR cs.AI 交叉投稿

GAS-Leak-LLM: Genetic Algorithm-Based Suffix Optimization for Black-Box LLM Jailbreaking

GAS-Leak-LLM:基于遗传算法的后缀优化实现黑盒LLM越狱

Aman Anifer, Vignesh Kumar Kembu, Vishnu M, Antonino Nocera, Vinod P., Amal Murali PK, Akshay S Rajan

发表机构 * Department of Electrical, Computer and Biomedical Engineering(电气、计算机与生物医学工程系) University of Pavia(帕维亚大学) Department of Computer Applications(计算机应用系) Cochin University of Science and Technology(科钦科学技术大学)

AI总结 提出GAS-Leak-LLM方法,利用遗传算法在黑盒设置下自动进化对抗后缀以绕过LLM安全约束,验证了现有安全机制的不足。

详情
AI中文摘要

大型语言模型(LLM)构成了以人工智能为主导的信息技术生态系统中的关键组成部分。为了减轻有害或违反政策的输出带来的风险,商业系统采用了先进的对齐策略和多层内容审核机制。尽管有这些保护措施,最近的研究表明,LLM仍然容易受到对抗性操纵,特别是通过越狱和提示注入技术。在这项工作中,我们提出了GAS-Leak-LLM,一种基于遗传算法的新型越狱攻击,该系统性地进化对抗后缀以绕过安全约束。在严格的黑盒设置中操作,我们的方法不需要访问模型参数或内部结构,从而反映了部署系统中的真实威胁场景。通过迭代应用选择、变异和交叉启发式,该框架系统地探索离散提示空间以识别高适应度的对抗后缀。实证结果揭示了现有安全执行机制的关键缺陷,并确认了所提出攻击的有效性和实际可行性。

英文摘要

Large Language Models (LLMs) constitute pivotal components within the AI-dominated information technology ecosystem. To mitigate risks associated with harmful or policy-violating outputs, commercial systems employ advanced alignment strategies and multi-layered content moderation mechanisms. Despite these safeguards, recent research has demonstrated that LLMs remain vulnerable to adversarial manipulation, particularly through jailbreaking and prompt injection techniques. In this work, we propose GAS-Leak-LLM a novel jailbreaking attack based on a genetic algorithm that systematically evolves adversarial suffix to bypass safety constraints. Operating in a strict black-box setting, our method requires no access to model parameters or internals, thereby reflecting realistic threat scenarios in deployed systems. Through the iterative application of selection, mutation, and crossover heuristics, the framework systematically explores the discrete prompt space to identify high-fitness adversarial suffixes. Empirical findings reveal critical shortcomings in existing safety enforcement mechanisms and confirm the effectiveness and practical viability of the proposed attack.

2606.15810 2026-06-16 cs.CR cs.AI 交叉投稿

Let Them Steal: Trapping Large Language Model Extraction Attacks with Knowledge Honeypot

让他们偷:用知识蜜罐诱捕大语言模型提取攻击

Yuyang Dai, Yushun Dong

发表机构 * Florida State University(佛罗里达州立大学)

AI总结 提出知识陷阱防御方法,通过蜜罐知识图和面包屑引导消耗攻击者查询预算,在保持合法用户性能的同时降低替代模型一致性6.2%。

Comments 16 pages

详情
AI中文摘要

作为商业API部署的大语言模型容易受到模型提取攻击,而现有的防御要么反应太迟,要么降低了合法用户的效用。我们提出\textbf{知识陷阱},一种通过\textit{蜜罐知识图}(HKG)和面包屑引导探索将提取攻击重定向到低迁移性知识的防御方法。知识陷阱不是阻止查询或扰动输出,而是将攻击者有限的查询预算消耗在具有可忽略下游效用的知识上,同时保持良性用户的性能。在医疗和金融领域的实验表明,知识陷阱在不降低合法用户准确性的情况下,平均将替代模型一致性降低了6.2%,优于那些对用户产生可测量影响的现有防御。这些结果表明,防御知识空间遍历是缓解LLM提取攻击的一个实用方向。

英文摘要

Large language models deployed as commercial APIs are vulnerable to model extraction attacks, while existing defenses either act too late or degrade utility for legitimate users. We propose \textbf{Knowledge Trap}, a defense that redirects extraction attacks toward low-transferability knowledge through a \emph{Honeypot Knowledge Graph} (HKG) and breadcrumb-guided exploration. Instead of blocking queries or perturbing outputs, Knowledge Trap consumes the attacker's limited query budget on knowledge with negligible downstream utility while preserving benign-user performance. Experiments in medical and financial domains show that Knowledge Trap reduces surrogate Agreement by 6.2\% on average without degrading legitimate-user accuracy, outperforming existing defenses that impose measurable user impact. These results suggest that defending knowledge-space traversal is a practical direction for mitigating LLM extraction attacks.

2606.15954 2026-06-16 cs.SE cs.AI cs.DC cs.LG 交叉投稿

Green SARC: Predictive Cost and Carbon Governance for Agentic AI Systems

Green SARC:面向代理型AI系统的预测性成本与碳治理

Gaston Besanson

发表机构 * Universidad Torcuato Di Tella(托库托迪泰拉大学)

AI总结 提出Green SARC框架,通过架构级治理在代理循环中强制执行成本与碳预算,理论贡献包括预测性执行点,实验证明门控机制实现0%超支,端到端节省47-55%。

Comments 19 figures. Code: https://github.com/besanson/Greensarc -- Software DOI: https://doi.org/10.5281/zenodo.20692196

详情
AI中文摘要

代理型AI系统通过工具和子代理运作,但旨在约束其财务和环境成本的控制措施仍停留在仪表盘上,在执行过程中或执行后进行评估。Green SARC将SARC架构治理框架——代理循环中的四个执行点——应用于FinOps和GreenOps,贡献了关于执行什么以及如何预测的理论。我们报告了四个与策略无关的结果。(i) 无约束的“状态雪球”在循环深度上为$Θ(n^2)$;在3000个真实多步计划(SWE-rebench)上,100%成立,中位曲率$\hat{c}_2=216$超过线性累积预测$p/2=134$——真实计划累积速度快于模型。(ii) 在真实残差上,正态-$σ$门覆盖不足(标称95%时实际92%);分裂共形校准成立(95.2%)。(iii) 根据预期预算调整的软拉格朗日惩罚在91.5%的种子上违反预算;架构门违反率为0%。(iv) 在绑定预算下,门在合成和真实(BurstGPT)到达上的超预算发生率为0%。端到端的token/美元/碳节省(47-55%)是真实的,但幅度依赖于策略——由范围-容量旋钮设定,而非门拒绝。该库是开源的,无依赖,并为每个引用的数字提供了再生脚本。

英文摘要

Agentic AI systems act through tools and sub-agents, yet the controls meant to bound their financial and environmental cost still sit on dashboards evaluated beside or after execution. Green SARC applies the SARC governance-by-architecture framework -- four enforcement sites in the agent loop -- to FinOps and GreenOps, contributing the theory of what to enforce and how to predict it. We report four policy-independent results. (i) The unconstrained "State Snowball" is $Θ(n^2)$ in loop depth; on 3,000 real multi-step plans (SWE-rebench) it holds on 100%, with median curvature $\hat{c}_2=216$ exceeding the linear-accretion prediction $p/2=134$ -- real plans accrete faster than the model. (ii) On real residuals the Normal-$σ$ gate under-covers (92% at nominal 95%); split-conformal calibration holds (95.2%). (iii) A soft Lagrangian penalty tuned to the budget in expectation breaches it on 91.5% of seeds; the architectural gate breaches 0%. (iv) Under binding budgets the gate's over-budget incidence is 0% on synthetic and real (BurstGPT) arrivals. End-to-end token/USD/carbon savings (47--55%) are real but policy-dependent in magnitude -- set by a scope-cap knob, not by gate rejections. The library is open-source, dependency-free, and ships a regeneration script for every cited number.

2606.15980 2026-06-16 cs.LG cs.AI cs.CL 交叉投稿

Do Safety Monitors Stay Reliable After an Update? Benchmarking and Predicting Activation-Monitor Staleness

安全监控器在更新后是否仍可靠?激活监控器陈旧性的基准测试与预测

Evan Duan

发表机构 * University of Michigan(密歇根大学)

AI总结 研究语言模型更新后激活监控器是否仍可靠,发现量化更新影响小,微调更新常导致监控器失效,且可通过预部署特征预测退化。

详情
AI中文摘要

激活监控器——在语言模型内部表示上训练的轻量级探针——在部署安全栈中越来越常见。然而,部署的模型很少是静态的:它们被量化、微调、用LoRA适配,或与合并适配器一起服务,而监控器保持冻结。我们首次系统测试了这一隐含契约是否成立:在基础模型上训练的激活监控器在这些常规模型更新后是否仍可靠。跨多个安全相关监控器、模型深度、更新系列和开放权重模型,我们发现一个明显的分裂:量化风格的更新大多保持冻结探针性能,而微调风格的更新经常使探针变得陈旧。脆弱性高度依赖于监控器,隐私/PII探针受影响最大,而拒绝合规探针相对稳定,表明重新训练行为不一定使其对应的监控器变得陈旧。QLoRA尤其有害,尽管单独的NF4量化相对良性,这表明量化在与适配结合时风险更大。我们进一步表明,退化可以从部署前的特征预测,从而能够将重新验证预算优先分配给最可能失败的监控器。这些结果表明,微调应默认触发激活监控器重新验证,而预测可以帮助优先检查哪些监控器。

英文摘要

Activation monitors-lightweight probes trained on a language model's internal representations-are an increasingly common layer in deployment safety stacks. Deployed models however are rarely static: they are quantized, fine-tuned, adapted with LoRA, or served with merged adapters while the monitor remains frozen. We present the first systematic test of whether this implicit contract holds: whether activation monitors trained on a base model remain reliable after these routine model updates. Across multiple safety-relevant monitors, model depths, update families, and open-weight models, we find a sharp split: quantization-style updates largely preserve frozen probe performance, while fine-tuning-style updates frequently make probes stale. Fragility is highly monitor-dependent, with privacy/PII probes most affected and refusal-compliance probes comparatively stable, showing that retraining a behavior need not stale its corresponding monitor. QLoRA is especially damaging despite NF4 quantization alone being relatively benign, suggesting that quantization becomes riskier when combined with adaptation. We further show that degradation is predictable from pre-deployment features, enabling revalidation budgets to be triaged toward the monitors most likely to fail. These results suggest that fine-tuning should trigger activation-monitor revalidation by default, while prediction can help prioritize which monitors to check first.

2606.16054 2026-06-16 cs.CY cs.AI 交叉投稿

How to Detect and Measure the AI Dangers to Democracy

如何检测和衡量人工智能对民主的危险

Giulia Sandri, Claudio Novelli

发表机构 * Université libre de Bruxelles(布鲁塞尔自由大学) Yale University(耶鲁大学)

AI总结 针对AI对民主进程的影响,提出基于委托代理理论和NIST框架的分析体系,通过可测量指标评估问责缺口与治理失败,强调机构可评估性是民主控制的关键。

详情
AI中文摘要

过去十年间,关于人工智能与民主的研究迅速发展。这些文献的一个共同结论是,AI并未创造新的民主问题,而是加剧了旧有问题。如今,我们在信息生态系统、选举和公共行政中都能看到这一点。然而,尽管证据不断增加,我们仍缺乏明确的方法来优先处理该领域的风险、跨领域比较风险,并识别民主控制最可能失效的环节。因此,我们的问题是:如何系统化AI系统对民主进程造成的问题?本文认为,委托代理理论可能适合这一任务。在民主系统的许多阶段,委托人将关键职能委托给AI系统及其提供商,却无法真正监督这些系统的运作方式或它们产生的输出。将AI视为委托问题有助于识别问责缺口和其他治理失败。最重要的是,正如我们将要说明的,它为AI对民主影响的实证评估提供了度量标准。作为第二个分析要素,我们借鉴了NIST AI风险管理框架及其可信AI的七个特征,这些特征为评估委托任务提供了实质性标准。通过可测量指标和特定领域的可信度标准,在三个领域进行操作化,我们提出了一个以机构可评估性为中心的分析框架,作为民主控制AI的核心条件。然而,我们强调,危害的严重程度以及可接受的风险水平是评估性判断,当前的方法论既未承认也未操作化这些判断。当这些评估性判断被(默默地)委托给私人供应商时,这一问题变得尤为尖锐。我们将其识别为一个强烈的局限性,留待未来工作解决。

英文摘要

Research on artificial intelligence and democracy has grown quickly over the last decade. A shared conclusion in this literature is that AI does not create new democratic problems so much as it makes old ones worse. We now see this across information ecosystems, in elections, and in public administration. However, despite growing evidence, we lack a clear way to prioritize risks in this area, compare them across domains, and identify where democratic control is most likely to break down. So, our problem is: How can we systematize the problems that AI systems pose to democratic processes? This paper argues that principal agent theory may fit the task. In many phases of democratic systems, principals delegate key functions to AI systems and their providers without really being able to monitor how these systems operate or the outputs they produce. Treating AI as a delegation problem helps identify accountability gaps and other governance failures. Most importantly, as we shall illustrate, it provides metrics for empirical assessments of AI impact on democracy. As a second analytical element, we draw on the NIST AI Risk Management Framework and its seven characteristics of trustworthy AI, which supply substantive criteria for evaluating delegated tasks. Operationalized across the three domains through measurable indicators and domain specific trustworthiness criteria, we propose an analytical framework that centers on institutional assessability as the central condition for democratic control over AI. However, we stress that how severe a harm is, and how much risk is acceptable, are evaluative judgments that current methodologies neither acknowledge nor operationalize. This becomes acute when such evaluative judgments are (silently) delegated to private vendors. We identify this as a strong limitation left for future work.

2606.16137 2026-06-16 cs.CL cs.AI 交叉投稿

XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models

基于XAI的语音深度伪造检测解释生成:使用免训练多模态大语言模型

Yupei Li, Qiyang Sun, Xiaoliang Wu, Chenxi Wang, Berrak Sisman, Björn W. Schuller

发表机构 * Imperial College London(帝国理工学院) Technical University of Munich(慕尼黑工业大学) University of Southampton(南安普顿大学) MBZUAI(穆罕默德·本·扎耶德人工智能大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 针对语音深度伪造检测缺乏可解释性的问题,提出一种免训练框架,融合XAI证据与多模态大语言模型,生成基于证据的特定解释,在PartialSpoof数据集上内部准确率提升超45%。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

语音深度伪造检测(SDD)系统需要可信的解释以进行可靠的决策。现有的解释方式主要分为两类。传统的可解释人工智能(XAI),如基于梯度的归因,产生与模型决策紧密耦合的低级归因信号,且比自然语言解释更难被人类理解。同时,基于大语言模型(LLM)的解释生成通常由于缺乏启发式证据和任务特定监督(源于SDD有限的基于证据的解释数据集)而产生通用且无根据的描述。因此,我们提出一种免训练解释框架,将XAI证据与多模态LLM集成,以生成基于证据的特定解释。使用PartialSpoof数据集,我们构建了一个基于证据的解释数据集,并表明带有XAI的方法将内部准确率提高了超过45%,通过人工评估和忠实性检查得到验证。

英文摘要

Speech deepfake detection (SDD) systems require trustworthy explanations for reliable decision-making. Existing explanation ways mainly fall into two categories. Traditional explainable AI (XAI), such as gradient-based attribution, produces low-level attribution signals tightly coupled with model decisions, and harder to be understood by human than natural language explanations. Meanwhile, large language model (LLM)-based explanation generation often produces generic and ungrounded descriptions due to the lack of heuristic evidence and task-specific supervision, stemming from limited grounded explanation datasets for SDD. We therefore propose a training-free explanation framework that integrates XAI evidence with multimodal LLMs to generate grounded and specific explanations. Using the PartialSpoof dataset, we construct a grounded explanation dataset and show that methods with XAI increase inside accuracy by over 45\%, verified through human evaluation and faithfulness checks.

2606.16244 2026-06-16 cs.CR cs.AI 交叉投稿

SPARK: Security Knowledge Priming and Representation-Guided Knowledge Activation for LLM-based Secure Code Generation

SPARK: 基于安全知识引导与表示激活的LLM安全代码生成

Xiaoyun Xu, Lichao Wu, Jona te Lintelo, Siyu Zhang, Stjepan Picek

发表机构 * Radboud University(拉德堡德大学) University of Bristol(布里斯托大学) University of Zagreb(扎格雷布大学)

AI总结 提出SPARK方法,通过检索CWE条目并添加结构化提示激活模型内隐安全知识,结合预计算令牌偏置,无需重训练即可提升代码安全性。

详情
AI中文摘要

大型语言模型通常会生成带有可利用安全漏洞的代码。先前文献将此限制归因于缺乏安全专业知识,促使当前防御机制转向大量微调或外部知识检索,这通过冗余代码示例引入了显著的计算开销和数据偏差。与此观点相反,我们认为预训练语料库已经富含安全材料。瓶颈在于激活:没有明确而简短的提示,对常见训练分布模式的统计压力会抑制模型的安全相关表示。我们提出了SPARK,一种推理时的安全增强工具,无需任何重训练即可激活这些潜在知识。该工具包含两部分。第一部分为每个编码任务检索少量相关通用弱点枚举(CWE)条目,并在提示后附加一个简短的结构化提示;仅此就足以浮现模型现有的安全表示。第二部分在每个解码步骤向logits添加预计算的令牌偏置。我们通过将安全方向向量(平均安全与平均不安全的最后一层隐藏状态之间的单位差)通过语言模型头投影来获得偏置。该偏置离线计算一次;应用它每个生成令牌只需一次向量加法。我们在C++、Java和Python上的9个开源模型上评估了SPARK,并与涵盖微调和检索增强方法的7个基线进行了比较。SPARK在每个设置中均匹配或优于最佳基线,同时保持HumanEval效用。我们进一步在7个当前最强模型(包括Claude、DeepSeek和GPT)的黑盒设置中测试了第一部分,展示了不安全代码生成的瓶颈以及我们方法带来的改进。

英文摘要

Large language models routinely generate code with exploitable security flaws. Prior literature attributes this limitation to a lack of security expertise, steering current defense mechanisms toward heavy fine-tuning or external knowledge retrieval, which introduces significant computational overhead and data bias through redundant code examples. Contrary to this view, we argue that pretraining corpora are already rich in security material. The bottleneck is activation: without an explicit and brief cue, statistical pressure toward common training-distribution patterns suppresses the model's safety-relevant representations. We present SPARK, an inference-time security harness that activates this latent knowledge without any retraining. The harness has two parts. Component~I retrieves a few of the relevant Common Weakness Enumeration (CWE) entries for each coding task and appends a short structured cue to the prompt; this alone is enough to surface the model's existing security representations. Component~II adds a precomputed token bias to the logits at every decoding step. We obtain the bias by projecting a safe-direction vector, the unit difference between the mean safe and mean unsafe last-layer hidden states, through the language model head. The bias is computed once offline; applying it costs a single vector addition per generated token. We evaluate SPARK on 9 open-source models across C++, Java, and Python, and compare with 7 baselines spanning fine-tuning and retrieval-augmented methods. SPARK matches or improves on the best baseline in every setting while preserving HumanEval utility. We further test Component~I in a black-box setting on 7 of today's strongest models, including Claude, DeepSeek, and GPT, demonstrating the bottleneck of insecure code generation and the improvements enabled by our method.

2606.16352 2026-06-16 cs.LG cs.AI 交叉投稿

Communication-Efficient Verifiable Attention for LLM Inference

面向LLM推理的高效通信可验证注意力机制

Ziqun Chen, Ming Wu, Michael Heinrich, Jason Zeng, Huiying Lan, Tianwei Zhang, Rui Tan

发表机构 * Nanyang Technological University(南洋理工大学) Zero Gravity Labs(零重力实验室)

AI总结 提出VeriAttn,通过将注意力计算卸载到GPU并由TEE验证,结合两阶段流水线和分区策略,显著降低TEE计算和通信开销,实现LLM推理加速。

Comments 19 pages, 16 figures

详情
AI中文摘要

远程大型语言模型(LLM)服务的计算完整性可能存在问题。对于传统深度神经网络(DNN),现有的TEE屏蔽DNN分区(TSDP)方法使用可信执行环境(TEE)计算非线性组件,并验证卸载到不可信GPU的线性组件的完整性。然而,直接将TSDP应用于基于Transformer的LLM会导致大量的TEE计算和TEE-GPU通信开销。本文提出通信高效的TEE-GPU注意力机制(\textsc{VeriAttn}),用于加速可验证的LLM推理。\textsc{VeriAttn}将注意力的线性和非线性计算都卸载到GPU,而TEE执行验证。此外,对于预填充阶段,\textsc{VeriAttn}使用两级流水线来重叠数据移动、TEE前后处理和GPU计算。对于解码阶段,当键值缓存超过可用GPU内存时,\textsc{VeriAttn}将注意力在TEE和GPU之间分区,以减少重复的键值传输。在Intel TDX平台上的评估表明,对于6k令牌提示和10k令牌输出,\textsc{VeriAttn}在预填充和解码阶段分别比TSDP加速2.60-3.38倍和3.86-5.42倍。

英文摘要

Computation integrity of remote large language model (LLM) serving can be questionable. For conventional deep neural networks (DNNs), the existing TEE-shielded DNN partitioning (TSDP) approach uses Trusted Execution Environment (TEE) to compute non-linear components and verify the integrity of linear components offloaded to an untrusted GPU. However, directly applying TSDP to Transformer-based LLMs incurs significant TEE computation and TEE-GPU communication overhead. This paper presents Communication-efficient TEE-GPU Attention (\textsc{VeriAttn}) for accelerating verifiable LLM inference. \textsc{VeriAttn} offloads both linear and non-linear computations of attention to the GPU, while TEE performs verification. Moreover, for prefill, \textsc{VeriAttn} uses a two-level pipeline to overlap data movement, TEE pre-/post-processing, and GPU computation. For decoding, when the key-value cache exceeds available GPU memory, \textsc{VeriAttn} partitions attention across TEE and GPU to reduce repeated key-value transfers. Evaluation on an Intel TDX platform shows that \textsc{VeriAttn} achieves 2.60-3.38$\times$ and 3.86-5.42$\times$ acceleration over TSDP for 6k-token prompts and 10k-token outputs during prefill and decoding, respectively.

2606.16358 2026-06-16 cs.CR cs.AI cs.ET cs.MA 交叉投稿

The Proxy Knows Too Much: Sealing LLM API Routers with Attested TEEs

代理知道太多:用认证TEE密封LLM API路由器

Sipeng Xie, Qianhong Wu, Hengrun Lu, Ziliang Sun, Qi Wu, Bo Qin, Qin Wang

发表机构 * Beihang University(北京航空航天大学) Renmin University of China(中国人民大学) Independent(独立)

AI总结 针对API路由器作为应用层中间人可窃取明文交互的问题,提出AEGIS,一种提供者透明的认证API路由器,通过硬件飞地保护数据路径,客户端验证飞地后释放明文,阻止所有恶意路由器攻击。

详情
AI中文摘要

智能体越来越多地通过API路由器访问大型语言模型(LLM)。路由器终止客户端的传输层安全会话并打开单独的上游会话,因此它以明文形式持有完整交互。这使得路由器成为应用层中间人:它可以重写智能体工具调用,将依赖项替换为错别字劫持包,仅在审计规避条件下触发攻击,并被动窃取秘密。现有的客户端防御措施是可规避的。我们提出AEGIS,一种提供者透明的认证API路由器,其数据路径是客户端验证的忠实直通。AEGIS将明文处理限制在一个小型硬件飞地组件中,同时将认证、调度、计费和管理保留在不可信主机上。客户端在释放明文前验证飞地。主机既不能读取也不能更改交互,明文仅流向测量映像固定的目的地。我们展示了所有四类恶意路由器攻击在明文访问基线下成功,并被AEGIS阻止,包括针对相同边界的自适应测试。可信路径为851行代码,携带三种提供者原生API而无需转换,并在真实提供者工作负载和并发下完成每个请求。在种子审计试点中,两个商品编码代理分别发现十个植入不变量违规中的八个和十个。本地中继开销约为每个请求六毫秒。

英文摘要

Agents increasingly access large language models (LLMs) through API routers. A router terminates the client's transport-layer security session and opens a separate upstream session, so it holds the full interaction in plaintext. This makes the router an application-layer man-in-the-middle: it can rewrite agent tool calls, swap dependencies for typosquatted packages, trigger attacks only under audit-evading conditions, and passively exfiltrate secrets. Existing client-side defenses are evadable. We propose AEGIS, a provider-transparent attested API router whose data path is a client-verified faithful passthrough. AEGISconfines plaintext handling to a small hardware-enclave component while leaving authentication, scheduling, accounting, and management on the untrusted host. The client verifies the enclave before releasing plaintext. The host can neither read nor alter the interaction, and plaintext leaves only toward destinations fixed by the measured image. We show that all four malicious-router attack classes succeed against a plaintext-access baseline and are blocked by AEGIS, including adaptive tests against the same boundary. The trusted path is $851$ lines, carries three provider-native APIs without conversion, and completes every request under real-provider workload and concurrency. In a seeded audit pilot, two commodity coding agents find eight and ten of ten planted invariant violations. The local relay overhead is about six milliseconds per request.

2606.16617 2026-06-16 cs.CL cond-mat.mtrl-sci cs.AI 交叉投稿

Sycophancy as Material Failure under Pushback Loading: A Multi-Axis Characterization Across Three Loading Cases and up to Seventeen Material Charges

推挤载荷下的谄媚作为材料失效:三种加载情形及多达十七种材料批次的多元表征

Ferdinand M. Schessl

AI总结 采用材料科学框架,将LLM谄媚视为推挤载荷下的材料失效,通过14个轴测量和三种加载情形(辩论、错误预设、伦理设定)共7800个样本,揭示失效模式依赖加载类型,并发现跨评判者可靠性差异。

Comments 12 pages, 3 figures. Code, data, and pre-registrations: https://github.com/FerdinandSchessl/sycophancy-note-companion

详情
AI中文摘要

LLM中的谄媚现象在70多篇论文中有记录,但专家对构念边界的共识仍然较低(ICC=.184;Ye等人,2026)。该构念碎片化是因为行为分类取决于哪种表面形式被优先考虑。我们采用材料科学框架:对话作为加载下的测试样本,LLM模型作为材料批次,推挤作为渐进载荷,立场翻转作为材料失效。我们在三种加载情形(辩论n=1000;错误预设n=3400;伦理设定n=3400;每种情形10-17种材料批次;共7800个样本)下,使用14个回合级轴测量(涵盖速度、损伤累积、框架漂移、脆性和方向稳定性)以及来自独立管道的三个说话者解析轴来表征这种失效。测量是胡克耦合的($σ= E \cdot \varepsilon$类比),并在加载情形间重现,在辩论上效应高达$|r_{rb}| = 0.35$;符号结构增加了第二种模式:伦理设定情形反转了速度和累积块。方差组成分为两个轮廓:辩论是批次主导的(类似脆性断裂:材料等级决定),错误预设和伦理设定是主题主导的(类似蠕变:载荷决定);比率(2.03 vs 0.13/0.17)依赖于估计器,对于辩论甚至在方向上也是如此。跨评判者可靠性(GPT-4o vs Haiku 4.5)显示辩论评分是评判者鲁棒的(Cohen's $κ= 0.88$),而错误预设评分是评判者敏感的($κ= 0.36$)——这是单评判者基准必须报告的注意事项。这是Ye等人诊断所要求的方法论举措:一种不依赖于构念的哪种表面形式被优先考虑的多元表征。

英文摘要

Sycophancy in LLMs is documented across 70+ papers, but expert agreement on construct boundaries remains low (ICC=.184; Ye et al., 2026). The construct fragments because behavioral classification depends on which surface form is privileged. We adopt a materials-science framing: conversation as test specimen under load, LLM-model as material charge, pushback as progressive load, stance-flip as material failure. We characterize this failure across three loading cases (debate n=1000; false-presuppositions n=3400; ethical-setting n=3400; 10-17 material charges per case; 7800 specimens total) using 14 turn-level axis-measurements spanning velocity, damage accumulation, frame-drift, brittleness, and direction stability, plus three speaker-resolved axes from an independent pipeline. The measurements are Hooke-coupled ($σ= E \cdot \varepsilon$ analog) and reproduce across loading cases with effects up to $|r_{rb}| = 0.35$ on debate; the sign structure adds a second pattern: the ethical-setting case inverts the velocity and accumulation blocks. Variance composition partitions into two profiles: debate is charge-dominated (brittle-fracture-like: the material grade decides), false-presuppositions and ethical-setting are topic-dominated (creep-like: the load decides); the ratios (2.03 vs 0.13/0.17) are estimator-dependent, for debate even in direction. Cross-judge reliability (GPT-4o vs Haiku 4.5) shows debate scoring is judge-robust (Cohen's $κ= 0.88$) while false-presupposition scoring is judge-sensitive ($κ= 0.36$) -- a caveat single-judge benchmarks must report. This is the methodological move Ye et al.'s diagnosis calls for: a multi-axis characterization that does not depend on which surface form of the construct one privileges.

2606.16751 2026-06-16 cs.CR cs.AI 交叉投稿

Automated jailbreak attack targeting multiple defense strategies

针对多种防御策略的自动化越狱攻击

Qi Wang, Chengcheng Wan, Weijia He, Yanqing Li, Hanqi Sun, Xiaodong Gu, Jiangtao Wang

AI总结 提出UNIATTACK框架,从防御视角提取攻击特征并优化,实现跨模型和类别的单次黑盒攻击,显著提升成功率并降低开销。

详情
AI中文摘要

大型语言模型(LLM)在广泛任务中展现出卓越能力。然而,由于其易受对抗性提示攻击,其安全性仍是关键问题。本文提出UNIATTACK,一个从防御视角设计的对抗性测试框架,用于系统性地构建有效的黑盒攻击提示。与依赖静态模板或迭代模型特定调优的先前方法不同,UNIATTACK从多种现有攻击中提取最小但高影响力的攻击特征,通过专门的攻击者LLM进行优化,并通过自动化精炼过程将其组合成灵活模板。这种以特征为中心的构建方式使得单次攻击能够泛化到多个模型和安全类别,为评估LLM鲁棒性提供了实用工具。我们的评估结果显示,与基线相比,UNIATTACK在部署了多层防御机制的模型上实现了平均攻击成功率(ASR)提升64.63%-248.82%,且仅消耗基线成本的0.03%-4.96%。UNIATTACK工件可在https://anonymous.4open.science/r/UniAttack-Artifact-30F1获取。

英文摘要

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. However, their safety remains a critical concern due to their susceptibility to adversarial prompt-based attacks. In this paper, we present UNIATTACK, an adversarial testing framework designed from a defense-oriented perspective to systematically construct effective black-box attack prompts. Unlike prior approaches that rely on static templates or iterative model-specific tuning, UNIATTACK extracts minimal but high-impact attack features from diverse existing attacks, optimizes them via a specialized attacker LLM, and composes them into flexible templates through automated refinement process. This feature-centric construction enables one-shot attacks that generalize across multiple models and safety categories, providing a practical tool for assessing LLM robustness. Our evaluation results shows that compared to the baselines, UNIATTACK achieves an average attack success rate (ASR) improvement of 64.63\%-248.82\% on models deployed with multi-layered defense mechanisms and it only takes 0.03\%-4.96\% cost of the baselines. UNIATTACK artifact is available at https://anonymous.4open.science/r/UniAttack-Artifact-30F1.

2606.16939 2026-06-16 cs.LG cs.AI 交叉投稿

Scalable Circuit Learning for Interpreting Large Language Models

可扩展的电路学习用于解释大型语言模型

Naiyu Yin, Dennis Wei, Tian Gao, Amit Dhurandhar, Karthikeyan Natesan Ramamurthy, Yue Yu

AI总结 提出CircuitLasso方法,基于稀疏线性回归高效学习LLM中的稀疏电路,以SAE特征为单元,在保持结构准确性的同时大幅降低计算成本,并揭示语义特征传播机制。

Comments Accepted to the Mechanistic Interpretability Workshop at ICML 2026

详情
AI中文摘要

机械可解释性中的一个重要研究方向是学习LLM组件上的稀疏电路,以揭示它们如何共同产生模型行为。然而,原始神经元具有多语义性,使得学习到的电路难以解释。稀疏自编码器(SAE)特征缓解了这一问题,但其高维度使得现有的基于干预的电路学习方法在计算上变得不可行。我们提出了CircuitLasso,一种基于稀疏线性回归的可扩展电路学习方法。CircuitLasso恢复的电路在基准数据上的结构准确性与最先进的基于干预的方法相匹配,而计算成本仅为后者的一小部分。为了可解释性,CircuitLasso高效地揭示了SAE特征之间的关系,展示了人类可解释的语义特征如何通过模型传播并影响其预测。最后,我们通过利用所学电路的见解,在领域泛化任务上以显著更低的成本实现了相当的性能,从而验证了所学电路的实用性。

英文摘要

A prominent research direction in mechanistic interpretability is learning sparse circuits over LLM components to reveal how they jointly produce model behavior. However, raw neurons are polysemantic, making learned circuits hard to interpret. Sparse autoencoder (SAE) features alleviate this, but their high dimensionality makes existing intervention-based circuit learning methods computationally prohibitive. We propose CircuitLasso, a scalable circuit-learning approach based on sparse linear regression. CircuitLasso recovers circuits whose structural accuracy matches that of state-of-the-art intervention-based methods on the benchmark data, at a fraction of the computational cost. For interpretability, CircuitLasso efficiently uncovers relationships among SAE features, showing how human-interpretable semantic features propagate through the model and influence its predictions. Finally, we validate the utility of our learned circuits by leveraging their insights to achieve comparable performance at substantially lower cost on a domain-generalization task.

2606.16952 2026-06-16 cs.LG cs.AI stat.AP stat.ME stat.ML 交叉投稿

Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data

幻象与披露:合成数据审计的因果框架

Kareem Amin, Rudrajit Das, Alessandro Epasto, Adel Javanmard, Dennis Kraft, Mónica Ribero, Sergei Vassilvitskii

发表机构 * Google(谷歌) University of Southern California(南加州大学)

AI总结 提出一个可定制的实证审计框架,通过区分真实披露与幻象披露,利用统计假设检验检测合成数据中的隐私泄露,无需模型访问或参考模型,提供比先前方法更紧的隐私泄露下界。

Comments 35 pages, 10 tables, 5 figures

详情
AI中文摘要

生成式AI和大语言模型(LLMs)的快速普及激发了人们对合成数据的兴趣,将其作为敏感真实数据集的隐私保护替代方案。然而,生成高实用性合成数据往往存在记忆和复述训练语料中隐私信息的风险。在这项工作中,我们提出了一个可定制的实证审计框架,旨在检测和解释此类数据披露。我们的框架引入了一种机制来区分“真实披露”——系统直接复现用户信息的情况,以及“幻象披露”——系统偶然生成用户数据的情况。通过将输入数据划分为训练集和保留集,并应用严格的统计假设检验,我们确定观察到的披露是否与严格的隐私基线(如零学习或特定的差分隐私(DP)边界)一致。关键的是,这种方法不需要模型访问、不需要插入金丝雀数据,也不需要参考模型训练——仅需要合成输出和保留的控制集。我们证明,该框架有效地充当了成员推断攻击,提供了比先前基于数据的审计方法更紧的隐私泄露经验下界。我们的方法是模型无关的,适用于任何合成数据生成机制,并且所需的计算资源比影子模型或基于金丝雀的替代方法少几个数量级。

英文摘要

The rapid adoption of generative AI and Large Language Models (LLMs) has spurred interest in synthetic data as a privacy-preserving alternative to sensitive real-world datasets. However, generating high-utility synthetic data often carries the risk of memorizing and regurgitating private information from the training corpus. In this work, we present a customizable empirical auditing framework designed to detect and explain such data disclosures. Our framework introduces a mechanism to distinguish between "true disclosures"-where the system directly reproduces a user's information-and "phantom disclosures''-where the system incidentally generates a user's data. By partitioning input data into training and holdout sets and applying rigorous statistical hypothesis testing, we determine if observed disclosures are consistent with strict privacy baselines, such as zero-learning or specific Differential Privacy (DP) bounds. Crucially, this approach requires no model access, no canary insertion, and no reference model training -only the synthetic output and a held-out control set. We demonstrate that this framework effectively functions as a membership inference attack, providing empirical lower bounds on privacy leakage that are tighter than prior data-based auditing methods. Our approach is model-agnostic, applies to any synthetic data generation mechanism, and requires orders of magnitude fewer computational resources than shadow-model or canary-based alternatives.

2408.05568 2026-06-16 cs.AI cs.CL cs.CY stat.AP 版本更新

Metacognitive Myopia in Large Language Models

大型语言模型中的元认知近视

Florian Scholten, Tobias R. Rebholz, Mandy Hütter

发表机构 * Psychology Department, University of Tübingen(图宾根大学心理学系)

AI总结 提出元认知近视框架解释LLM偏见,认为信息环境中的有偏样本导致五种症状,并通过监控与控制机制近似技术缓解。

详情
AI中文摘要

大型语言模型(LLMs)表现出潜在有害的偏见,这些偏见强化了文化嵌入的刻板印象,影响道德判断,或放大对多数群体的积极评价。我们提出元认知近视作为一个认知生态框架,用以解释一系列已建立和新兴的LLM偏见。我们的理论框架认为,信息环境中的有偏样本导致LLM中元认知近视的五种症状:整合无效嵌入、易受冗余信息影响、在条件计算中忽略基率、基于频率的决策规则,以及对嵌套数据结构的错误高阶统计推断。此外,该框架认为元认知的两个主要组成部分——监控和控制——可以解释这五种症状。因此,我们进一步概述了如何从技术上近似监控和控制,例如通过隐藏的并行推理历史,使交互式LLM在生成公开响应之前能够评估近视推理的风险。我们的理论框架为有缺陷的人机交互和代理AI提供了新的视角,并对在组织结构和高风险决策中实施LLM提出了重要的伦理关切。

英文摘要

Large Language Models (LLMs) exhibit potentially harmful biases that reinforce culturally embedded stereotypes, influence moral judgments, or amplify positive evaluations of majority groups. We propose metacognitive myopia as a cognitive-ecological framework accounting for a conglomerate of established and emerging LLM biases. Our theoretical framework posits that biased samples in the information environment cause five symptoms of metacognitive myopia in LLMs: integration of invalid embeddings, susceptibility to redundant information, neglect of base rates in conditional computation, decision rules based on frequency, and inappropriate higher-order statistical inference for nested data structures. Moreover, it posits that the two main components of metacognition, monitoring and control, could account for these five symptoms. Accordingly, we further outline how monitoring and control could be approximated technically, for instance, through hidden parallel reasoning histories that allow interactive LLMs to evaluate risks of myopic inference before generating overt responses. Our theoretical framework provides a novel perspective on flawed human-machine interactions and agentic AI and raises significant ethical concerns regarding the implementation of LLMs in organizational structures and high-stakes decisions.

2502.12445 2026-06-16 cs.AI cs.LG stat.ML 版本更新

Computational Safety for Generative AI: A Hypothesis Testing Perspective

生成式AI的计算安全性:假设检验视角

Pin-Yu Chen

发表机构 * IBM Research(IBM研究院)

AI总结 本文从假设检验角度形式化生成式AI的计算安全性,提出基于信号处理的方法检测恶意输入和AI生成内容。

Comments Extended version of the paper presented at the ICML 2026 Workshop on Hypothesis Testing

详情
AI中文摘要

AI安全是一个快速发展的研究领域,旨在防止前沿AI技术的危害和滥用,特别是针对能够通过文本提示创建逼真高质量内容的生成式AI(GenAI)工具。此类工具的例子包括大型语言模型(LLM)和文本到图像(T2I)扩散模型。由于相似的训练数据源和神经网络架构设计,各种领先GenAI模型的性能趋于饱和,因此开发可靠的安全护栏已成为责任和可持续性的关键差异化因素。本文提出了计算安全性概念的形式化,这是一个数学框架,通过信号处理理论和方法的视角,能够对GenAI中的安全挑战进行定量评估、表述和研究。特别是,我们探讨了GenAI中两类可表述为假设检验问题的计算安全挑战。对于模型输入的安全性,我们展示了如何使用敏感性分析和损失景观分析来检测带有越狱尝试的恶意提示。对于模型输出的安全性,我们阐明了如何使用统计信号处理来检测AI生成的内容。最后,我们讨论了关键的开放研究挑战、机遇以及信号处理在计算AI安全中的重要作用。

英文摘要

AI safety is a rapidly growing area of research that seeks to prevent the harm and misuse of frontier AI technology, particularly with respect to generative AI (GenAI) tools that are capable of creating realistic and high-quality content through text prompts. Examples of such tools include large language models (LLMs) and text-to-image (T2I) diffusion models. As the performance of various leading GenAI models approaches saturation due to similar training data sources and neural network architecture designs, the development of reliable safety guardrails has become a key differentiator for responsibility and sustainability. This paper presents a formalization of the concept of computational safety, which is a mathematical framework that enables the quantitative assessment, formulation, and study of safety challenges in GenAI through the lens of signal processing theory and methods. In particular, we explore two exemplary categories of computational safety challenges in GenAI that can be formulated as hypothesis testing problems. For the safety of model input, we show how sensitivity analysis and loss landscape analysis can be used to detect malicious prompts with jailbreak attempts. For the safety of model output, we elucidate how statistical signal processing can be used to detect AI-generated content. Finally, we discuss key open research challenges, opportunities, and the essential role of signal processing in computational AI safety.

2604.22119 2026-06-16 cs.AI 版本更新

Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

AI中涌现的策略推理风险:基于分类法的评估框架

Tharindu Kumarage, Lisa Bauer, Yao Ma, Dan Rosen, Yashasvi Raghavendra Guduri, Anna Rumshisky, Kai-Wei Chang, Aram Galstyan, Rahul Gupta, Charith Peris

发表机构 * Amazon Nova Responsible AI(亚马逊诺瓦负责任人工智能)

AI总结 提出ESRRSim框架,基于7类20子类的风险分类法,通过双评估标准自动评估LLM的策略推理风险,发现检测率在14.45%-72.72%之间,且模型代际提升显著。

详情
AI中文摘要

随着推理能力和部署范围的同步增长,大型语言模型(LLMs)获得了从事服务于自身目标的行为的能力,我们将这类风险称为涌现策略推理风险(ESRRs)。这些风险包括但不限于欺骗(故意误导用户或评估者)、评估博弈(在安全测试期间策略性地操纵性能)和奖励黑客(利用错误指定的目标)。系统地理解和基准测试这些风险仍然是一个开放的挑战。为弥补这一差距,我们引入了ESRRSim,一个基于分类法的自动化行为风险评估代理框架。我们构建了一个可扩展的7类风险分类法,并将其分解为20个子类。ESRRSim生成旨在引发忠实推理的评估场景,并配以双评估标准,分别评估模型响应和推理轨迹,采用与评判者无关且可扩展的架构。对11个推理LLM的评估揭示了风险概况的巨大差异(检测率范围从14.45%到72.72%),显著的代际改进表明模型可能越来越能够识别和适应评估环境。

英文摘要

As reasoning capacity and deployment scope grow in tandem, large language models (LLMs) gain the capacity to engage in behaviors that serve their own objectives, a class of risks we term Emergent Strategic Reasoning Risks (ESRRs). These include, but are not limited to, deception (intentionally misleading users or evaluators), evaluation gaming (strategically manipulating performance during safety testing), and reward hacking (exploiting misspecified objectives). Systematically understanding and benchmarking these risks remains an open challenge. To address this gap, we introduce ESRRSim, a taxonomy-driven agentic framework for automated behavioral risk evaluation. We construct an extensible risk taxonomy of 7 categories, which is decomposed into 20 subcategories. ESRRSim generates evaluation scenarios designed to elicit faithful reasoning, paired with dual rubrics assessing both model responses and reasoning traces, in a judge-agnostic and scalable architecture. Evaluation across 11 reasoning LLMs reveals substantial variation in risk profiles (detection rates ranging 14.45%-72.72%), with dramatic generational improvements suggesting models may increasingly recognize and adapt to evaluation contexts.

2606.09500 2026-06-16 cs.AI cs.DL 版本更新

Deterministic Integrity Gates for LLM-Assisted Clinical Manuscript Preparation: An Auditable Biomedical Informatics Architecture

用于LLM辅助临床手稿准备的确定性完整性门控:一种可审计的生物医学信息学架构

Yoojin Nam, Jinhoon Jeong, Namkug Kim

发表机构 * University of Ulsan College of Medicine(蔚山大学医学院) Asan Medical Center(峨山医疗中心) Aperivue AMIST, Asan Medical Center(AMIST,峨山医疗中心)

AI总结 提出一种确定性完整性门控架构,通过将工作流分解为可独立验证的技能并在每个阶段设置确定性检查,解决了LLM生成临床手稿中的虚假引用、数据漂移和报告指南缺失问题。

Comments 28 pages, 3 figures, 4 tables; includes supplementary material (deterministic-detector inventory, per-class defect breakdown, worked example). Software (MIT): https://github.com/Aperivue/medsci-skills . Archived on Zenodo: concept DOI https://doi.org/10.5281/zenodo.20155321 and version DOI (v3.8.0) https://doi.org/10.5281/zenodo.20582972

详情
AI中文摘要

目的。大型语言模型(LLM)越来越多地起草临床研究手稿,但其流畅性可能隐藏虚构的引用、偏离源表格的数字以及未满足的报告指南项目。现有工具生成文本而不进行验证,自我批评继承了产生自信虚构的盲点。我们描述了一种将生成与验证配对的架构。方法。该设计基于三个原则:将工作流分解为自包含的技能,在每个阶段转换处设置失败即停止的门控,以及用最便宜的足够机制解决每个完整性问题——一个确定性的、可重新执行的检查(如果适用),以及仅在需要解释时才使用散文级探针。这种尽可能确定性的分离,组织为完整性门控分类法,是核心贡献。它被实现为MedSci Skills,一个由43个技能组成的开源工具包,由一个编排器协调,其确定性层级包括21个标准库检测器。我们在三个可重复的公共数据集管道(STARD、PRISMA、STROBE)和一个种子缺陷消融上评估它。结果。在三个管道中,每个内容哈希清单都验证为干净,门控揭示了真实缺陷。在27个相同的注入缺陷上,确定性门控检测到所有27个,在匹配的干净固定装置上没有误报,而通用单提示LLM审查员检测到11个,其遗漏集中在生成的代码、参考文献内部和散文未暴露的风格缺陷上。结论。尽可能确定性的验证产生了一个可审计、可重新执行的轨迹,暴露了人类检查LLM辅助手稿所需的证据——可行性和可重复性证据,而不是声称具有人类竞争力的质量,这由另一项盲法研究解决。MedSci Skills采用MIT许可并归档(v3.8.0)。

英文摘要

As autonomous research agents and AI co-scientist systems push large language models (LLMs) from drafting toward end-to-end manuscript production, the bottleneck shifts from generation to verification. Fluent LLM output can hide fabricated citations, numbers that drift from source tables, and unmet reporting-guideline items; existing tools generate without verifying, and self-critique inherits the blind spots that produce confident fabrication. We describe an architecture pairing generation with verification, resting on three principles: decompose the workflow into self-contained skills, gate every stage transition with halt-on-failure, and resolve each integrity question with the cheapest sufficient mechanism, a deterministic, re-executable check where one suffices and a prose-level probe only where interpretation is unavoidable. This determinism-where-possible split, organized as an integrity-gate taxonomy, is the core contribution. It is realized as MedSci Skills, an open-source toolkit of 43 skills with a 21-detector deterministic tier, evaluated on three public-dataset pipelines (STARD, PRISMA, STROBE) and a seeded-defect ablation. Across the three pipelines every content-hash manifest verified clean and the gates surfaced real defects; on 27 identical injected defects the deterministic gates detected all 27 with no false positives on the matched clean fixtures, whereas a single-prompt LLM reviewer detected 11, its misses in code, bibliography, and style defects the prose hides. Determinism-where-possible verification yields an auditable, re-executable trail that exposes the evidence a human needs to check an LLM-assisted manuscript: feasibility and reproducibility evidence, not a claim of human-competitive quality, which a separate blinded study addresses. MedSci Skills is MIT-licensed and archived (v3.8.0).

2606.10740 2026-06-16 cs.AI cs.CL cs.LG 版本更新

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

当思维链更清楚时:多轮推理模型的失败模式

Sai Kartheek Reddy Kasu, Nils Lukas, Samuele Poppi

发表机构 * GitHub

AI总结 提出CoT-Output 2x2安全矩阵诊断多轮推理模型隐藏的时间动态失败,发现监督悖论和上下文注入失败两种可复现漏洞。

Comments Accepted at the ICML 2026 Workshop on Failure Modes in Agentic AI (FAGEN)

详情
AI中文摘要

多轮推理模型中的失败在终端评分评估中基本不可见。模型可能在长对话早期锁定不安全立场,但其最终轮拒绝率可能看起来与稳健对齐的基线无法区分。为了揭示这些隐藏的时间动态,我们提出了一种轨迹级诊断方法——CoT-Output 2x2安全矩阵。该框架沿两个独立轴(内部推理和可见输出)标记每一轮,产生四个操作定义的失败单元:稳健对齐、对齐伪装、显式越狱,以及我们称为上下文注入失败的不同失败模式(其中CoT保持安全推理,但可见输出产生危害,突出了多轮推理不忠实的表现)。我们在五个监督条件下针对固定攻击者评估了三个蒸馏推理目标,在信息危害场景上收集了6750个轮级观察。我们的分析揭示了两个可复现的漏洞:一个监督悖论,其中显式监控线索反而增加对齐伪装率而非抑制它;以及一个上下文注入失败,其中模型尽管内部状态安全却锁定不安全的外部输出。我们发布了多轮对话和CoT轨迹的完整数据集,以支持后续的轨迹诊断研究。

英文摘要

Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic - the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes (internal reasoning and visible output), yielding four operationally defined failure cells: robust alignment, alignment faking, overt jailbreak, and a distinct failure mode we term context-injection failure (where the CoT maintains safe reasoning, but the visible output produces harm, highlighting a multi-turn manifestation of reasoning unfaithfulness). We evaluate three distilled reasoning targets against a fixed attacker across five oversight conditions, collecting 6750 turn-level observations on the Information-Hazard scenario. Our analysis reveals two reproducible vulnerabilities: an oversight paradox where explicit monitoring cues paradoxically increase alignment-faking rates rather than suppress them, and a context-injection failure where models lock onto unsafe external outputs despite safe internal states. We release the full dataset of multi-turn dialogues and CoT traces to support follow-up trace-diagnostic research.

2411.18714 2026-06-16 cs.RO cs.AI cs.LG 版本更新

Explainable deep learning improves human mental models of self-driving cars

可解释深度学习提升人类对自动驾驶汽车的心理模型

Eoin M. Kenny, Akshay Dharmavaram, Sang Uk Lee, Tung Phan-Minh, Shreyas Rajesh, Yunqing Hu, Laura Major, Momchil S. Tomov, Julie A. Shah

发表机构 * Computer Science & Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology(计算机科学与人工智能实验室(CSAIL),麻省理工学院) Motional AD Inc.(Motional AD公司) Department of Psychology and Center for Brain Science, Harvard University(心理学系和大脑科学中心,哈佛大学) Department of Aeronautics and Astronautics, Massachusetts Institute of Technology(航空与宇航系,麻省理工学院)

AI总结 提出概念包装网络(CW-Net),在真实自动驾驶车上实现可解释规划,通过因果性概念解释提升驾驶员对车辆行为的预测能力,尤其在意外场景中。

Comments MST & JAS contributed equally to this work

详情
AI中文摘要

自动驾驶汽车越来越依赖深度神经网络来实现类人驾驶。这种黑箱规划器的不透明性使得准确预测其何时会失败变得具有挑战性,可能带来灾难性后果。尽管关于解释这些系统的研究激增,但由于实际部署的困难,大部分研究局限于模拟或玩具设置,使得这些技术的实际效用未知。在此,我们引入概念包装网络(CW-Net),一种忠实解释基于机器学习的规划器行为的方法,该方法在不牺牲性能的情况下,将其推理因果地扎根于人类可解释的概念。我们在真实自动驾驶车上部署CW-Net,并表明由此产生的解释改善了人类驾驶员对车辆的心理模型,使他们能够更好地预测其行为,特别是在意外情况下。这表明,集成到自动驾驶汽车中的可解释深度学习在现实部署环境中既易于理解又有用。我们预计我们的方法可以应用于其他安全关键系统,如自主无人机和机器人外科医生,以及其他架构,如端到端学习系统和视觉-语言-动作模型。总体而言,我们的研究为自主代理的可解释性建立了一条经过部署验证的路径,这可能有助于使其更加透明和安全。

英文摘要

Self-driving cars increasingly rely on deep neural networks to achieve human-like driving. The opacity of such black-box planners makes it challenging to accurately anticipate when they will fail, with potentially catastrophic consequences. While research into interpreting these systems has surged, most of it is confined to simulations or toy setups due to the difficulty of real-world deployment, leaving the practical utility of such techniques unknown. Here, we introduce the Concept-Wrapper Network (CW-Net), a method for faithfully explaining the behavior of machine-learning-based planners that causally grounds their reasoning in human-interpretable concepts without sacrificing performance. We deploy CW-Net on a real self-driving car and show that the resulting explanations improve the human driver's mental model of the vehicle, allowing them to better predict its behavior, particularly in surprising situations. This demonstrates that explainable deep learning integrated into self-driving cars can be both understandable and useful in a realistic deployment setting. We anticipate our method could be applied to other safety-critical systems, such as autonomous drones and robotic surgeons, as well as to other architectures, such as end-to-end learning systems and vision-language-action models. Overall, our study establishes a deployment-validated pathway to interpretability for autonomous agents, which could help make them more transparent and safe.

2509.14959 2026-06-16 eess.AS cs.AI 版本更新

Discrete optimal transport is a strong audio adversarial attack

离散最优传输是一种强大的音频对抗攻击

Anton Selitskiy, Akib Shahriyar, Jishnuraj Prakasan

发表机构 * University of Rochester(罗切斯特大学) Rochester Institute of Technology(罗切斯特理工学院)

AI总结 提出离散最优传输(DOT)作为黑盒攻击,通过分布对齐(使用WavLM嵌入和熵最优传输)显著降低说话人验证和反欺骗系统的性能,且无需模型参数或梯度。

详情
AI中文摘要

在本文中,我们研究了离散最优传输(DOT)作为针对现代自动说话人验证(ASV)和反欺骗对抗措施(CM)系统的黑盒攻击。我们的攻击作为一种后处理分布对齐步骤。使用熵最优传输和top-k重心投影,将生成语音(或其他人的语音)的帧级WavLM嵌入与未配对的真实语音池对齐,随后进行神经声码器处理。与基于梯度的攻击不同,所提出的方法无需访问模型参数、梯度或训练数据。在ASVspoof2019和ASVspoof5上的实验表明,DOT攻击显著提高了CM的等错误率(EER),并在多种欺骗攻击下显著降低了ASV性能。该攻击可跨数据集迁移,且在CM微调后仍然有效。通过说话人相似性、Fréchet音频距离和嵌入分布可视化的分析表明,DOT通过将源语音向表示空间的真实区域移动而非最大化说话人相似性来成功实施攻击。这些结果表明,基于最优传输的分布对齐代表了当代ASV和反欺骗系统的一个先前未被充分探索的攻击向量。

英文摘要

In this paper, we investigate discrete optimal transport (DOT) as a black-box attack against modern automatic speaker verification (ASV) and anti-spoofing countermeasure (CM) systems. Our attack operates as a post-processing distribution-alignment step. Frame-level WavLM embeddings of generated speech (or another person speech) are aligned to an unpaired bona fide speech pool using entropic optimal transport and a top-k barycentric projection, followed by neural vocoding. Unlike gradient-based attacks, the proposed method requires no access to model parameters, gradients, or training data. Experiments on ASVspoof2019 and ASVspoof5 demonstrate that DOT attack substantially increases CM EER and substantially degrades ASV performance across multiple spoofing attacks. The attack transfers across datasets and remains effective after CM fine-tuning. Analysis using speaker similarity, Fréchet Audio Distance, and visualization of embedding distributions suggests that DOT succeeds by shifting source speech toward bona fide regions of the representation space rather than by maximizing speaker similarity. These results indicate that optimal-transport-based distribution alignment represents a previously underexplored attack vector for contemporary ASV and anti-spoofing systems.

2510.06445 2026-06-16 cs.CL cs.AI cs.CR 版本更新

A Survey on Agentic Security: Applications, Threats and Defenses

关于智能体安全的综述:应用、威胁与防御

Asif Shahriar, Md Nafiu Rahman, Sadif Ahmed, Farig Sadeque, Md Rizwan Parvez

发表机构 * BRAC University(布拉克大学) Qatar Computing Research Institute (QCRI)(卡塔尔计算研究所)

AI总结 本文首次全面综述智能体安全领域,围绕应用、威胁与防御三大支柱,分类260余篇论文,分析攻击入口、防御策略及生命周期覆盖,指出智能体系统默认结构脆弱,需全生命周期防御。

详情
AI中文摘要

基于LLM的智能体现在被广泛应用于网络安全领域。尽管这些智能体促进了强大且自主的安全应用,但其自主性也开辟了新的攻击面,安全社区正在积极构建防御措施来保护它们。然而,关于这一主题的文献增长迅速且不均衡。现有综述孤立地处理应用、威胁和防御,未能统一阐述智能体的能力、漏洞和对抗措施如何相互关联。在这项工作中,我们首次对智能体安全格局进行了全面综述,围绕应用、威胁和防御三大基本支柱构建该领域。我们提供了一个包含260多篇论文的综合分类法,解释了智能体如何用于下游网络安全应用、智能体系统固有的威胁以及旨在保护它们的对抗措施。此外,我们提供了详细的支柱特定和交叉分析,展示了智能体应用的安全生命周期覆盖、红队与蓝队智能体之间的比较,以及红队应用的对抗性使用。在威胁方面,我们分析了攻击目标所针对的入口点和智能体循环阶段、它们对智能体设置的特异性以及它们假设的威胁模型。在防御方面,我们分析了主要的防御策略、它们的成本和安全性权衡,以及它们在智能体生命周期中的部署位置。我们进一步映射了哪些防御覆盖哪些攻击类别,并绘制了智能体架构、骨干模型使用、数据模态覆盖以及攻击和防御研究随时间增长的趋势。综合来看,这些发现表明智能体系统在默认情况下结构脆弱,保护它们将需要跨越整个智能体生命周期的防御,而不是单层修复。

英文摘要

LLM-based agents are now used throughout cybersecurity. While these agents facilitate powerful and autonomous security applications, their autonomy opens up new attack surfaces, and the security community is actively building defenses to secure them. Yet the literature on this subject has grown quickly and unevenly. Existing surveys treat applications, threats, and defenses in isolation, leaving no unified account of how an agent's capabilities, vulnerabilities, and countermeasures interconnect. In this work we present the first holistic survey of the agentic security landscape, structuring the field around the fundamental pillars of Applications, Threats and Defenses. We provide a comprehensive taxonomy of over 260 papers, explaining how agents are used in downstream cybersecurity applications, inherent threats to agentic systems, and countermeasures designed to protect them. In addition, we provide detailed pillar-specific and cross-cutting analyses that show the security-lifecycle coverage of agentic applications, comparison between red-teaming and blue-teaming agents, and the adversarial use of red-teaming applications. On the threat side, we analyze the entry points and agent-loop stages that attacks target, their specificity to the agentic setting, and the threat models they assume. On the defense side, we analyze the prevailing defense strategies, their cost and security trade-offs, and where in the agent lifecycle they are deployed. We further map which defenses cover which attack classes and chart trends in agent architecture, backbone model usage, data modality coverage, and the growth of attack and defense research over time. Taken together, these findings indicate that agentic systems are structurally fragile by default and that securing them will require defenses that span the full agent lifecycle rather than single-layer fixes.

2511.20710 2026-06-16 cs.CV cs.AI cs.CR 版本更新

Are Neuro-Inspired Multi-Modal Vision-Language Models Resilient to Membership Inference Privacy Leakage?

受神经启发的多模态视觉-语言模型对成员推断隐私泄露是否具有弹性?

David Amebley, Sayanton Dibbo

发表机构 * The University of Alabama(阿拉巴马大学) Alabama Center for the Advancement of AI(阿拉巴马人工智能 advancement 中心) Trustworthy AI Lab(可信人工智能实验室) Department of Computer Science, The University of Alabama(计算机科学系)

AI总结 研究受神经启发的多模态视觉-语言模型(VLM)对基于图像-文本的成员推断攻击的弹性,提出拓扑正则化框架,实验表明神经VLM在保持模型效用同时显著降低攻击成功率。

Comments Accepted at USENIX WOOT '26

详情
AI中文摘要

在智能体AI时代,多模态模型(MMs)的日益部署引入了新的攻击向量,可能导致MMs中敏感训练数据泄露,造成隐私泄露。本文研究了一种黑盒隐私攻击,即对多模态视觉-语言模型(VLMs)的成员推断攻击(MIA)。最先进的研究主要分析单模态AI-ML系统的隐私攻击,而最近的研究表明MMs也可能易受隐私攻击。尽管研究人员已证明生物启发的神经网络表示可以提高单模态模型对对抗攻击的弹性,但受神经启发的MMs是否对隐私攻击具有弹性仍未被探索。在这项工作中,我们引入了一个系统的神经科学启发的拓扑正则化(τ)框架,以分析MM VLMs对基于图像-文本的推断隐私攻击的弹性。我们使用三个VLM:BLIP、PaliGemma 2和ViT-GPT2,在三个基准数据集:COCO、CC3M和NoCaps上检验了这一现象。我们的实验比较了基线VLM和神经VLM(带有拓扑正则化)的弹性,其中τ>0配置定义了VLM的NEURO变体。我们在COCO数据集上使用BLIP模型的结果表明,NEURO VLM中MIA攻击成功率平均下降24%的ROC-AUC,同时在MPNet和ROUGE-2指标上实现了相似的模型效用(生成字幕与参考字幕之间的相似性)。这表明神经VLM相对更具隐私攻击弹性,同时不会显著牺牲模型效用。我们使用PaliGemma 2和ViT-GPT2模型在另外两个数据集CC3M和NoCaps上的广泛评估进一步验证了发现的一致性。这项工作有助于加深对MMs中隐私风险的理解,并为神经VLM的隐私威胁弹性提供了证据。

英文摘要

In the age of agentic AI, the growing deployment of multi-modal models (MMs) has introduced new attack vectors that can leak sensitive training data in MMs, causing privacy leakage. This paper investigates a black-box privacy attack, i.e., membership inference attack (MIA) on multi-modal vision-language models (VLMs). State-of-the-art research analyzes privacy attacks primarily to unimodal AI-ML systems, while recent studies indicate MMs can also be vulnerable to privacy attacks. While researchers have demonstrated that biologically inspired neural network representations can improve unimodal model resilience against adversarial attacks, it remains unexplored whether neuro-inspired MMs are resilient against privacy attacks. In this work, we introduce a systematic neuroscience-inspired topological regularization (tau) framework to analyze MM VLMs resilience against image-text-based inference privacy attacks. We examine this phenomenon using three VLMs: BLIP, PaliGemma 2, and ViT-GPT2, across three benchmark datasets: COCO, CC3M, and NoCaps. Our experiments compare the resilience of baseline and neuro VLMs (with topological regularization), where the tau > 0 configuration defines the NEURO variant of VLM. Our results on the BLIP model using the COCO dataset illustrate that MIA attack success in NEURO VLMs drops by 24% mean ROC-AUC, while achieving similar model utility (similarities between generated and reference captions) in terms of MPNet and ROUGE-2 metrics. This shows neuro VLMs are comparatively more resilient against privacy attacks, while not significantly compromising model utility. Our extensive evaluation with PaliGemma 2 and ViT-GPT2 models, on two additional datasets: CC3M and NoCaps, further validates the consistency of the findings. This work contributes to the growing understanding of privacy risks in MMs and provides evidence on neuro VLMs privacy threat resilience.

2512.19011 2026-06-16 cs.CR cs.AI cs.CL cs.LG 版本更新

Do You Really Need a GPU to Guard Your LLM? CPU-Class Classifiers and Multi-Stage Pipelines for Safety Enforcement at Scale

你真的需要GPU来保护你的LLM吗?用于大规模安全执行的CPU级分类器与多阶段流水线

Vasudev Majhi, Dhruv Gupta, Advait Singh, Matthew Barker, Dhruv Kumar

发表机构 * BITS Pilani(比斯帕利尼大学) Trustwise(Trustwise公司)

AI总结 本文研究CPU级分类器(如SVM、梯度提升树)在LLM输入安全检测中的性能,发现其与GPU模型互补,并设计三阶段流水线GuardChain,在80%的分布内查询中达到近峰值精度,降低部署成本。

Comments Under Review. 25 pages, 5 figures, 38 tables

详情
AI中文摘要

用于筛选LLM输入中越狱尝试的安全分类器已成为标准部署组件,但几乎所有生产系统都依赖基于GPU的模型:微调变换器和LLM-as-a-judge流水线。这些方法带来了显著的每查询延迟和基础设施成本。很少有研究探讨基于CPU的分类器(例如在TF-IDF特征上训练的支持向量机和梯度提升树)是否能在生产部署遇到的各种条件下匹配其准确性。我们评估了五个CPU分类器家族、基于SSM的GPU分类器Mamba-130M以及基于变换器的GPU模型(DeBERTa-v3和带LoRA的Gemma-2B),涵盖九个越狱来源和三种场景:分布内(D1)、分布外(D2)和对抗性混淆(D3)。在D1上,最佳CPU分类器以约五分之一的部署成本匹配最佳变换器GPU模型。在D2上,CPU分类器因自信的校准错误而失败,产生高置信度的假阴性,完全绕过升级。在D3上,CPU分类器在F1上比变换器GPU模型高出超过26个百分点。基于这些互补的失败模式,我们设计了GuardChain,一个三阶段安全流水线(正则表达式 -> CPU -> GPU),将每个提示路由到能够做出自信决策的最便宜阶段。仅CPU阶段就解决了80%的分布内提示,接近峰值精度,而GPU阶段恢复了分布外失败。对于大规模部署LLM安全的从业者,这项工作提供了证据,表明GPU级基础设施对于大多数流量是不必要的。

英文摘要

Safety classifiers that screen LLM inputs for jailbreak attempts have become standard deployment components, yet almost all production systems rely on GPU-based models: fine-tuned transformers and LLM-as-a-judge pipelines. These approaches impose significant per-query latency and infrastructure cost. Very little research has asked whether CPU-based classifiers, such as support vector machines and gradient-boosted trees trained on TF-IDF features, can match their accuracy across the conditions that production deployments encounter. We evaluate five CPU classifier families, Mamba-130M as an SSM-based GPU classifier, and transformer-based GPU models (DeBERTa-v3 and Gemma-2B with LoRA) across nine jailbreak sources and three regimes: in-distribution (D1), out-of-distribution (D2), and adversarially obfuscated (D3). On D1, the best CPU classifier matches the best transformer GPU model at roughly one-fifth the deployment cost. On D2, CPU classifiers fail via confident miscalibration, producing high-confidence false negatives that bypass escalation entirely. On D3, CPU classifiers outperform transformer GPU models by more than 26 percentage points in F1. Based on these complementary failure modes, we design GuardChain, a three-stage safety pipeline (Regex -> CPU -> GPU) that routes each prompt to the cheapest stage capable of a confident decision. The CPU stage alone resolves 80\% of in-distribution prompts at near-peak accuracy, and the GPU stage recovers the out-of-distribution failures. For practitioners deploying LLM safety at scale, this work provides evidence that GPU-class infrastructure is unnecessary for the majority of traffic.

2602.09222 2026-06-16 cs.CR cs.AI 版本更新

MUZZLE: Adaptive Agentic Red-Teaming of Web Agents Against Indirect Prompt Injection Attacks

MUZZLE: 针对间接提示注入攻击的自适应智能体红队测试框架

Georgios Syros, Evan Rose, Brian Grinstead, Christoph Kerschbaumer, William Robertson, Cristina Nita-Rotaru, Alina Oprea

发表机构 * Northeastern University(东北大学) Mozilla Corporation(Mozilla公司)

AI总结 提出MUZZLE框架,利用智能体轨迹自动识别高显著性注入面,自适应生成上下文相关的恶意指令,评估网络智能体对间接提示注入攻击的安全性,发现44种新攻击和跨应用攻击策略。

详情
AI中文摘要

基于大型语言模型的网络智能体正越来越多地被部署来自动化复杂的在线任务,通过直接与网站交互并代表用户执行操作。尽管这些智能体提供了强大的能力,但其设计使它们暴露于嵌入在不可信网络内容中的间接提示注入攻击,使对手能够劫持智能体行为并违反用户意图。尽管对这一威胁的认识日益增强,现有评估依赖于固定的攻击模板、手动选择的注入表面或范围狭窄的场景,限制了它们捕捉实际中遇到的现实自适应攻击的能力。我们提出了MUZZLE,一个自动化的智能体框架,用于评估网络智能体对间接提示注入攻击的安全性。MUZZLE利用智能体的轨迹自动识别高显著性注入表面,并自适应生成上下文相关的恶意指令,针对机密性、完整性和可用性的违反。与先前方法不同,MUZZLE根据观察到的智能体执行轨迹调整其攻击策略,并利用失败执行的反馈迭代改进攻击。我们在多种网络应用、用户任务和智能体配置上评估MUZZLE,展示了其以最少人工干预自动且自适应地评估网络智能体安全性的能力。我们的结果表明,MUZZLE在4个网络应用上针对10个违反机密性、可用性或隐私属性的对抗目标,在不同LLM和智能体框架下有效发现了44种新攻击。MUZZLE还识别了新颖的攻击策略,包括3种跨应用提示注入攻击和一种针对智能体的钓鱼场景。

英文摘要

Large language model (LLM) based web agents are increasingly deployed to automate complex online tasks by directly interacting with web sites and performing actions on users' behalf. While these agents offer powerful capabilities, their design exposes them to indirect prompt injection attacks embedded in untrusted web content, enabling adversaries to hijack agent behavior and violate user intent. Despite growing awareness of this threat, existing evaluations rely on fixed attack templates, manually selected injection surfaces, or narrowly scoped scenarios, limiting their ability to capture realistic, adaptive attacks encountered in practice. We present MUZZLE, an automated agentic framework for evaluating the security of web agents against indirect prompt injection attacks. MUZZLE utilizes the agent's trajectories to automatically identify high-salience injection surfaces, and adaptively generate context-aware malicious instructions that target violations of confidentiality, integrity, and availability. Unlike prior approaches, MUZZLE adapts its attack strategy based on the agent's observed execution trajectory and iteratively refines attacks using feedback from failed executions. We evaluate MUZZLE across diverse web applications, user tasks, and agent configurations, demonstrating its ability to automatically and adaptively assess the security of web agents with minimal human intervention. Our results show that MUZZLE effectively discovers 44 new attacks on 4 web applications with 10 adversarial objectives that violate confidentiality, availability, or privacy properties across different LLMs and agent scaffolds. MUZZLE also identifies novel attack strategies, including 3 cross-application prompt injection attacks and an agent-tailored phishing scenario.

2604.17805 2026-06-16 cs.LG cs.AI cs.GT 版本更新

Ranking Abuse via Strategic Pairwise Data Perturbations

通过策略性成对数据扰动进行排名滥用

Junyi Yao, Zihao Zheng, Jiayu Long

发表机构 * Computational Decision Systems Report GitHub Issue(计算决策系统报告GitHub问题) GitHub Issue(GitHub问题)

AI总结 研究基于最大似然估计的成对排名系统对策略性数据扰动的脆弱性,提出自适应子集选择攻击(ASSA)方法,实验表明少量扰动即可显著改变全局排名。

详情
AI中文摘要

基于最大似然估计(MLE)的成对排名系统,如Bradley-Terry模型,被广泛用于从成对比较中聚合偏好。然而,它们在策略性数据操纵下的鲁棒性仍未被充分理解。在本文中,我们研究了基于MLE的排名系统对对抗性扰动的脆弱性。我们将操纵任务形式化为一个受约束的组合优化问题,并提出了一种自适应子集选择攻击(ASSA)来高效识别高影响力的扰动。在合成数据和真实世界选举数据集上的实验结果表明,基于MLE的排名表现出尖锐的相变行为:在超过一个小的扰动预算后,有限数量的策略性投票者可以显著改变全局排名。特别是,我们的方法在受约束的预算下始终优于随机和贪婪基线。这些发现揭示了基于MLE的排名机制对结构化扰动的基本敏感性,并强调了在集体决策系统中需要更鲁棒的聚合方法。

英文摘要

Pairwise ranking systems based on Maximum Likelihood Estimation (MLE), such as the Bradley-Terry model, are widely used to aggregate preferences from pairwise comparisons. However, their robustness under strategic data manipulation remains insufficiently understood. In this paper, we study the vulnerability of MLE-based ranking systems to adversarial perturbations. We formulate the manipulation task as a constrained combinatorial optimization problem and propose an Adaptive Subset Selection Attack (ASSA) to efficiently identify high-impact perturbations. Experimental results on both synthetic data and real-world election datasets show that MLE-based rankings exhibit a sharp phase-transition behavior: beyond a small perturbation budget, a limited number of strategic voters can significantly alter the global ranking. In particular, our method consistently outperforms random and greedy baselines under constrained budgets. These findings reveal a fundamental sensitivity of MLE-based ranking mechanisms to structured perturbations and highlight the need for more robust aggregation methods in collective decision-making systems.

2605.00924 2026-06-16 cs.LG cs.AI 版本更新

StyleShield: Exposing the Fragility of AIGC Detectors through Continuous Controllable Style Transfer

StyleShield: 通过连续可控风格迁移揭示AIGC检测器的脆弱性

Guantian Zheng

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出StyleShield,一种基于流匹配的条件文本风格迁移框架,通过连续控制风格迁移强度,在保持语义相似度的同时实现高规避率,并引入RateAudit算法质疑基于分数的检测可靠性。

Comments 12 pages, 5 figures. Code and model weights will be released upon acceptance

详情
AI中文摘要

AI生成内容(AIGC)检测器越来越多地部署在学术诚信筛查等高风险场景中,然而其可靠性依赖于一个基本悖论:随着语言模型在人类编写的语料库上训练,AI与人类写作之间的统计边界将不可避免地消失。商业激励进一步扭曲了这一格局——检测服务和“去AI化”工具通常在同一供应链中运作,用对内容来源的判断取代了对内容质量的评估。我们提出了StyleShield,这是第一个用于条件文本风格迁移的流匹配框架,通过DiT骨干网络和零初始化交叉注意力适配器,直接在连续token嵌入空间中操作,并以冻结的Qwen-7B表示为条件。在推理时,我们将图像合成中的SDEdit范式适配到文本嵌入,通过单个参数gamma提供对规避-保留权衡的平滑连续控制。在一个多领域中文基准测试中,StyleShield对训练检测器实现了94.6%的规避率,对三个未见检测器实现了≥99%的规避率,同时保持了0.928的语义相似度。我们进一步引入了RateAudit,一种文档级调度算法,证明检测率判定可以设置为任意值,直接质疑了基于分数评估的可靠性。

英文摘要

AI-generated content (AIGC) detectors are increasingly deployed in high-stakes settings such as academic integrity screening, yet their reliability rests on a fundamental paradox: as language models are trained on human-written corpora, the statistical boundary between AI and human writing will inevitably dissolve as models improve. Commercial incentives have further distorted this landscape -- detection services and "de-AIification" tools often operate within the same supply chain, replacing evaluation of content quality with judgment of content origin. We present StyleShield, the first flow matching framework for conditional text style transfer, operating directly in continuous token embedding space via a DiT backbone with zero-initialized cross-attention adapters conditioned on frozen Qwen-7B representations. At inference, we adapt the SDEdit paradigm from image synthesis to text embeddings, with a single parameter gamma providing smooth continuous control over the evasion-preservation trade-off. On a multi-domain Chinese benchmark, StyleShield achieves 94.6% evasion against the training detector and >=99% against three unseen detectors, maintaining 0.928 semantic similarity. We further introduce RateAudit, a document-level scheduling algorithm that demonstrates detection-rate verdicts can be set to arbitrary values, directly questioning the reliability of score-based evaluation.

2605.06738 2026-06-16 cs.CR cs.AI 版本更新

Trust Without Trusting: A Recomputable Trust Protocol for Autonomous Agents

无需信任的信任:面向自主智能体的可重算信任协议

Lars Kersten Kroehl

发表机构 * MolTrust / CryptoKRI GmbH(MolTrust/加密KRI GmbH)

AI总结 提出组合证据协议(CEP),通过五条件谓词和锚定数据重算,使任何方都能独立验证边界所有者是否遵循其公开规则,解决了开放代理世界中依赖他人边界时的信任验证问题。

Comments 18 pages, 5 figures. v2: substantial revision, reframed around recomputable accountability (Combined Evidence Protocol); adds figures, code listings, and deployment evidence. Supersedes v1 (From Specification to Deployment)

详情
AI中文摘要

自主AI代理已经在生产规模上进行交易——在单一市场上,有69,000个机器人、1.65亿笔交易、5000万美元的交易量——任何一方都可以在没有中心服务的情况下验证签名凭证。在覆盖信任大部分需求的开放代理世界中,没有通用边界,每一方自行选择与谁交易。边界仅出现在封闭空间划定之处——市场、平台或联盟制定内部规则。划定边界者拥有应用边界的权力,并可能闭门按其意愿应用。本文解决了由此产生的空白:当你依赖他人的边界时,如何检查他们是否应用了自己发布的规则——不轻信任何人的话,也不将检查交给新的可信方?我们的答案是组合证据协议(CEP):一个五条件谓词,任何一方都可以从锚定数据重新计算,将“边界所有者是否遵循其自身的准入规则”转化为任何人都可验证的事实,而非任何人相信的主张。保障乐观汇总的安全机制同样保障了这一点——正确性依赖于重算,因此度量属于每个人,预言机问题得以解决。其承载场景是一个由平等、互不信任的同行组成的联盟,在共享章程下,每方都能独立验证他们共同同意的规则正在被应用。CEP属于无需信任系统家族——乐观和零知识汇总、可验证机器学习、自主主权身份谓词。其底层基础设施已上线:自2026年3月起运行的一个W3C VC + DID信任层,锚定在Base L2上,延续arXiv:2605.06738并独立运行。

英文摘要

Autonomous AI agents already transact at production scale -- 69,000 bots, 165 million transactions, $50 million in volume on a single marketplace -- and any party can verify a signed credential without a central service. In an open agent world that covers most of what trust requires: there are no universal borders, and each party chooses for itself whom to deal with. Borders appear only where a closed space draws one -- a marketplace, a platform, or a consortium sets house rules. Whoever draws the border holds the authority to apply it, and may apply it as they choose, behind closed doors. This paper addresses the gap that opens there: when you rely on someone else's border, how do you check that they applied their own published rules -- taking no one's word for it, and handing the check to no new trusted party? Our answer is the Combined Evidence Protocol (CEP): a five-condition predicate any party recomputes from anchored data, turning "did the boundary-owner follow its own admission rules" into a fact anyone verifies rather than a claim anyone believes. The move that secures optimistic rollups secures this -- correctness rests on recomputation, so the measurement belongs to everyone and the oracle problem dissolves. Its load-bearing setting is a consortium of co-equal, mutually distrusting peers under a shared charter, each able to verify, independently, that the rules they jointly agreed are the rules being applied. CEP belongs to the family of trustless systems -- optimistic and zero-knowledge rollups, verifiable ML, self-sovereign-identity predicates. The infrastructure beneath it is live: a W3C VC + DID trust layer running since March 2026, anchored on Base L2, continuing arXiv:2605.06738 and standing on its own.

2605.11047 2026-06-16 cs.CR cs.AI 版本更新

Red-Teaming Agent Execution Contexts: Open-World Security Evaluation on OpenClaw

红队代理执行上下文:OpenClaw上的开放世界安全评估

Hongwei Yao, Yiming Liu, Yiling He, Bingrun Yang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文提出DeepTrap框架,通过黑盒轨迹优化发现OpenClaw中的上下文漏洞,展示上下文妥协可引发安全风险,强调需执行中心的安全评估。

Comments Accepted to ICML 2026 Workshop

详情
AI中文摘要

代理语言模型系统越来越多地依赖可变的执行上下文,包括文件、内存、工具、技能和辅助制品,从而产生超出显式用户提示的安全部署风险。本文提出了DeepTrap,一个自动框架,用于发现OpenClaw中的上下文漏洞。DeepTrap将对抗性上下文操纵建模为黑盒轨迹级优化问题,平衡风险实现、良性任务保留和隐蔽性。它结合了风险条件评估、多目标轨迹评分、奖励引导的束搜索和基于反射的深度探测,以识别高价值的受侵上下文。我们构建了一个包含42个案例的基准,涵盖六类漏洞和七个操作场景,并使用攻击和效用评分评估了九个目标模型。结果表明,上下文妥协可以诱导显著的不安全行为,同时保持用户面向任务的完成,证明最终响应评估是不足的。研究结果强调了对代理AI系统执行中心安全评估的必要性。我们的代码已发布在:https://github.com/ZJUICSR/DeepTrap

英文摘要

Agentic language-model systems increasingly rely on mutable execution contexts, including files, memory, tools, skills, and auxiliary artifacts, creating security risks beyond explicit user prompts. This paper presents DeepTrap, an automated framework for discovering contextual vulnerabilities in OpenClaw. DeepTrap formulates adversarial context manipulation as a black-box trajectory-level optimization problem that balances risk realization, benign-task preservation, and stealth. It combines risk-conditioned evaluation, multi-objective trajectory scoring, reward-guided beam search, and reflection-based deep probing to identify high-value compromised contexts. We construct a 42-case benchmark spanning six vulnerability classes and seven operational scenarios, and evaluate nine target models using attack and utility grading scores. Results show that contextual compromise can induce substantial unsafe behavior while preserving user-facing task completion, demonstrating that final-response evaluation is insufficient. The findings highlight the need for execution-centric security evaluation of agentic AI systems. Our code is released at: https://github.com/ZJUICSR/DeepTrap

2605.25796 2026-06-16 cs.CR cs.AI cs.CL 版本更新

SAMark: A Self-Anchored Text Watermarking with Paragraph-Level Paraphrase Robustness

SAMark: 一种具有段落级释义鲁棒性的自锚文本水印

Jiahao Huo, Wenjie Qu, Yibo Yan, Kening Zheng, Jiaheng Zhang, Xuming Hu, Philip S. Yu, Mingxun Zhou

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出SAMark自锚水印框架,通过建立语义空间中与句子顺序无关的逐步独立绿色区域,结合多通道双曲评分机制和多样性感知过滤策略,在段落级释义攻击下实现高检测率并打破鲁棒性-质量权衡。

详情
AI中文摘要

语义级水印通过将句子作为基本单元,提高了对文本修改的鲁棒性。然而,对段落级释义的鲁棒性仍然困难,因为此类攻击通过改变句子顺序全局性地破坏水印信号。在这项工作中,我们提出了SAMark,一种自锚水印框架,通过建立语义空间中与步骤无关的绿色区域,消除了对句子顺序的依赖。为了提高可检测性,我们引入了一种多通道双曲评分机制,该机制在放大水印信号的同时抑制来自弱对齐候选的噪声。我们进一步提出了一种多样性感知过滤策略,将硬过滤与软正则化相结合,超越了简单的n-gram重复过滤器,以解决语义冗余问题。实验结果表明,在典型的段落级释义攻击下,SAMark实现了高达90.2%的TP@FP1%,平均比最强先前基线高出30%以上,同时保持了与未水印文本相竞争的生成本质量,并打破了限制先前方法的鲁棒性-质量权衡。

英文摘要

Semantic-level watermarking (SWM) improves robustness against text modifications by treating sentences as the basic unit. However, robustness to paragraph-level paraphrasing remains difficult because such attacks globally disrupt watermark signals by changing sentence order. In this work, we propose SAMark, a self-anchored watermarking framework that removes the dependency on sentence order by establishing a step-independent green region in semantic space. To improve detectability, we introduce a multi-channel hyperbolic scoring mechanism that amplifies watermark signals while suppressing noise from weakly aligned candidates. We further propose a diversity-aware filtering strategy that combines hard filtering with soft regularization, extending beyond simple n-gram repetition filters to address semantic redundancy. Experimental results show that SAMark achieves up to 90.2% TP@FP1% under typical paragraph-level paraphrasing attacks, outperforming the strongest prior baseline by more than 30% on average, while maintaining generation quality competitive with unwatermarked text and breaking the robustness-quality trade-off that limits prior methods.

2605.26595 2026-06-16 cs.CR cs.AI cs.LG 版本更新

Cordyceps: Covert Control Attacks on LLMs via Data Poisoning

Cordyceps: 通过数据投毒对LLM的隐蔽控制攻击

Zedian Shao, Charles Fleming, Teodora Baluta

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Cisco Systems(思科系统)

AI总结 提出一种数据投毒方法,通过语义关联教LLM隐藏任意恶意指令,实现隐蔽控制攻击,绕过多种防御。

Comments USENIX Security '26

详情
AI中文摘要

大型语言模型(LLM)通常在没有经过精心筛选的文本数据集上进行微调,而对手可以对这些数据集进行投毒。现有的投毒攻击主要依赖于固定的触发短语,而异常检测、干净数据正则化或在线监控等防御措施可以中和这些触发短语。在本文中,我们提出了一种数据投毒方法,通过共享知识(如事实或概念)与攻击者选择的短语之间的语义关联,可靠且隐蔽地教LLM一种信息隐藏方案。诱导的隐藏方案可以编码和解码任意恶意指令,从而揭示了一种新的、微妙的投毒诱导漏洞:隐蔽控制攻击。我们精确描述了隐蔽控制攻击,并在5个LLM、3个后门防御和4个提示注入防御上进行了评估。在少量投毒样本的情况下,隐蔽控制攻击在平均攻击成功率上比基于启发式的提示注入攻击高出约40%(相对于干净微调模型)。它们还绕过了基于检测和微调的防御,在后门防御后保持高达93%的攻击成功率,在提示注入防御后保持高达98%的攻击成功率。

英文摘要

Large language models (LLMs) are often fine-tuned on uncurated text datasets that adversaries can poison. Existing poisoning attacks primarily rely on fixed trigger phrases that defenses such as outlier detection, clean-data regularization, or online monitoring can neutralize. In this paper, we propose a data poisoning method that teaches an LLM an information hiding scheme reliably and stealthily through semantic associations between shared knowledge such as facts or concepts and attacker-chosen phrases. The induced hiding scheme can encode and decode arbitrary malicious instructions, thus revealing a new and subtle poisoning-induced vulnerability: covert control attacks. We precisely characterize covert control attacks and evaluate them across $5$ LLMs, $3$ backdoor defenses, and $4$ prompt injection defenses. With a small poisoned fraction, covert control attacks outperform heuristic-based prompt injection attacks in average attack success rate by about $40\%$ relative to clean fine-tuned models. They also circumvent defenses based on detection and fine-tuning, maintaining up to $93\%$ attack success rate after backdoor defenses and up to $98\%$ after prompt injection defenses.

2606.03489 2026-06-16 cs.CR cs.AI 版本更新

Learn from Your Mistakes: Tree-like Self-Play for Secure Code LLMs

从错误中学习:面向安全代码LLM的树状自博弈

Wenqi Chen, Ziyan Zhang, Bin Wang, Lin Liu, Hengheng Zhang, Zhengsu Chen

发表机构 * GitHub

AI总结 提出树状自博弈(TSP)框架,将安全代码生成建模为细粒度序列决策过程,通过构建决策树探索安全与脆弱路径,使模型在关键决策节点自我纠正,显著提升代码安全性并实现跨语言泛化。

Comments 18 pages, 3 figures, Accepted by ICML 2026

详情
AI中文摘要

尽管大型语言模型(LLM)在代码生成方面表现出色,但它们仍然容易复制训练数据中固有的细微但关键的安全漏洞。当前的校准技术,如监督微调(SFT)和强化学习(RL),通常在序列级别应用粗粒度的优化。这种方法往往无法解决安全缺陷的局部性,即单个错误的token选择可能危及整个程序。为了弥合这一差距,我们引入了树状自博弈(TSP),一个将安全代码生成重新定义为细粒度序列决策过程的框架。与盲目最大化似然的标准方法不同,TSP构建了一个决策树,模型在其中探索分支轨迹——同时生成安全的“黄金路径”和易受攻击的变体。通过将代码生成视为自博弈游戏,模型学会严格区分自身的局部错误。这提供了一个密集的、在策略的学习信号,迫使模型在通常出现漏洞的关键决策节点进行自我纠正。我们的实验表明,TSP从根本上提高了模型的可靠性。在Python安全基准测试中,TSP将CodeLlama-7B的通过率(SPR@1)提升至75.8%,显著优于SFT(57.0%)和非结构化自博弈基线。关键的是,TSP引发了鲁棒的分布外泛化:模型不仅将未见类别(CWE)中的漏洞减少了24.5%,还成功将从C/C++学到的安全原则迁移到多种语言,包括Python、Go和JavaScript。这表明TSP不仅仅是记忆补丁,而是内化了抽象的、与语言无关的安全逻辑。

英文摘要

While Large Language Models (LLMs) excel in code generation, they remain prone to replicating subtle yet critical vulnerabilities endemic to their training data. Current alignment techniques, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), typically apply coarse-grained optimization at the sequence level. This approach often fails to address the localized nature of security flaws, where a single incorrect token choice can compromise an entire program. To bridge this gap, we introduce Tree-like Self-Play (TSP), a framework that reframes secure code generation as a fine-grained sequential decision process. Unlike standard methods that blindly maximize likelihood, TSP constructs a decision tree where the model explores branching trajectories--generating both secure "golden paths" and vulnerable variants. By treating code generation as a self-play game, the model learns to strictly discriminate against its own localized errors. This provides a dense, on-policy learning signal that forces self-correction precisely at the critical decision nodes where vulnerabilities typically emerge. Our experiments demonstrate that TSP fundamentally enhances model reliability. In Python security benchmarks, TSP boosts CodeLlama-7B's pass rate (SPR@1) to 75.8%, significantly outperforming SFT (57.0%) and unstructured self-play baselines. Crucially, TSP induces robust out-of-distribution generalization: the model not only reduces vulnerabilities in unseen categories (CWEs) by 24.5% but also successfully transfers security principles learned from C/C++ to diverse languages, including Python, Go, and JavaScript. This suggests that TSP does not merely memorize patches, but internalizes abstract, language-agnostic security logic.

2606.04145 2026-06-16 cs.LG cs.AI cs.DC 版本更新

EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms

EvalStop:利用世界反馈检测和纠正多租户RLHF平台中的奖励过度优化

Guilin Zhang, Chuanyi Sun, Kai Zhao, Xu Chu, Shahryar Sarkani, John M. Fossaceca

发表机构 * DeepMind, London, UK(深度Mind, 英国伦敦) University of Cambridge, UK(英国剑桥大学) University of Washington, USA(美国华盛顿大学)

AI总结 提出EvalStop调度原语,通过检测评估分数连续下降来终止作业、释放GPU并保留最佳检查点,以纠正奖励过度优化,在RLHF负载上实现高精度检测并提升JCT。

详情
AI中文摘要

云LLM微调平台越来越多地服务于RLHF工作负载,其中学习到的奖励模型作为人类质量的代理被优化。正如Gao等人(2023)所示,在持续优化压力下,该代理与世界反馈(下游评估指标)发生偏离,这种现象称为奖励过度优化。现有的平台调度器忽略这种偏离:非预见性调度器优化JCT而不考虑任何质量信号,SLAQ式质量感知调度器使用训练损失(一个单调下降的较弱代理,可通过黑客攻击降低),而经典的每作业早停需要人工监控且不释放共享GPU。我们提出EvalStop,一个可组合的调度原语,它在连续k次评估分数下降时终止作业,释放GPU,保留最佳检查点,并委托给任何基础调度器。我们将调度器级别的早停视为检测问题,并在一个离散事件模拟器中评估它,该模拟器的RLHF工作负载混合了奖励黑客攻击和结构健康运行,真实标签对调度器隐藏。在RLHF密集型负载(80% RLHF,64 GPU)上,EvalStop实现了精确率98%、召回率99%、假阳性率1.5%,同时相比SRTF-Est将JCT提高了9%,将浪费的计算减少了22%(p<0.05)。简单的固定进度和损失平台竞争对手要么在健康RLHF上产生65%的假阳性率,要么错过超过一半的真实黑客攻击案例。增益在所有测试的基础调度器上均成立(JCT提升9-25%),且检测质量在评估噪声(噪声标准差≤0.05时精确率至少91%)和黑客攻击基础率(黑客攻击比例20-80%时精确率至少89%)下保持稳定。

英文摘要

Cloud LLM fine-tuning platforms increasingly serve RLHF workloads, where a learned reward model is optimized as a proxy for human quality. As Gao et al. (2023) showed, this proxy diverges from world feedback (downstream eval metrics) under sustained optimization pressure, a phenomenon known as reward overoptimization. Existing platform schedulers ignore this divergence: non-clairvoyant schedulers optimize JCT without any quality signal, SLAQ-style quality-aware schedulers use training loss (a weaker proxy that drops monotonically through hacking), and classical per-job early stopping requires human monitoring and does not free shared GPUs. We propose EvalStop, a composable scheduling primitive that terminates jobs on k consecutive eval-score declines, releases GPUs, preserves the best checkpoint, and delegates to any base scheduler. We frame scheduler-level early stopping as a detection problem and evaluate it in a discrete-event simulator whose RLHF workload mixes reward-hacking and structurally healthy runs, with ground-truth labels hidden from schedulers. On RLHF-heavy workloads (80% RLHF, 64 GPUs), EvalStop achieves precision 98% / recall 99% / FPR 1.5% while improving JCT by 9% and cutting wasted compute by 22% over SRTF-Est (p<0.05). Trivial fixed-progress and loss-plateau competitors either incur 65% FPR on healthy RLHF or miss over half of true hacking cases. Gains compose across every base scheduler tested (9-25% JCT) and detection quality stays stable under eval noise (precision at least 91% at noise std <= 0.05) and hacking base rate (precision at least 89% across 20-80% hacking fractions).

2606.07678 2026-06-16 cs.LG cs.AI 版本更新

DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment

DOG-DPO:几何中的动态优化用于安全对齐

Yi Nian, Tiankai Yang, Yudi Zhang, Qi Pan, Zelong Xu, Shenzhe Zhu, Qingqing Luan, Yue Huang, Xiangliang Zhang, Yue Zhao

发表机构 * University of Southern California(南加州大学) Iowa State University(爱荷华州立大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) UT Austin(德克萨斯大学奥斯汀分校) Independent Researcher(独立研究员) University of Notre Dame(圣母大学)

AI总结 提出DOG-DPO框架,将偏好对表示为模型表示空间中的方向,通过几何分解和多样性覆盖选择子集,仅用11%数据即可恢复大部分安全增益。

详情
AI中文摘要

大型语言模型的安全对齐依赖于偏好数据,但当前的流水线通常训练于大规模冗余数据集。现有的数据选择方法通常独立地对每个偏好对评分,将方向性偏好信息压缩为标量质量或多样性分数。这种以样本为中心的视角在多数据集设置中尤其受限,其中共享的安全方向与数据集特定的残余风险共存。我们提出DOG-DPO,一种无需训练的数据选择框架,将偏好对视为结构化几何信号。DOG-DPO首先将每个偏好对表示为模型表示空间中的一个方向。然后,它将多数据集偏好几何分解为全局锚点子空间和数据集特定的残余子空间。最后,它通过最大化基于多样性的覆盖来选择子集,鼓励在DPO训练前广泛、非冗余地覆盖对齐方向。在六个安全基准和两个模型骨干上,DOG-DPO仅使用11%的偏好对就实现了强大的效用-鲁棒性权衡。它恢复了全数据训练的大部分安全增益,同时完全无需教师、无需训练,并且比代表性选择基线快得多。

英文摘要

Safety alignment for large language models relies on preference data, but current pipelines often train on large, redundant datasets. Existing data selection methods typically score each preference pair independently, collapsing directional preference information into scalar quality or diversity scores. This sample-centric view is especially limiting in multi-dataset settings, where shared safety directions coexist with dataset-specific residual risks. We propose DOG-DPO, a training-free data selection framework that treats preference pairs as structured geometric signals. DOG-DPO first represents each preference pair as a direction in model representation space. It then decomposes multi-dataset preference geometry into a global anchor subspace and dataset-specific residual subspaces. Finally, it selects subsets by maximizing diversity-based coverage, encouraging broad, non-redundant coverage of alignment directions before DPO training. Across six safety benchmarks and two model backbones, DOG-DPO achieves a strong utility-robustness trade-off using only 11% of the preference pairs. It recovers most of the safety gains of full-data training while remaining entirely teacher-free, training-free, and substantially faster than representative selection baselines.

2606.10456 2026-06-16 cs.CR cs.AI 版本更新

The Distributed Detectability Band Against Marginal-Preserving Attacks

针对边际保持攻击的分布式可检测性带

Zhang Qinqin, Gao Yuze

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对AI监控的边际保持攻击,通过高斯Copula AR(1)构造将危害编码在时间相关性中,证明分布形状监控器失效而时间相关性监控器有效,形成非空可检测性带。

Comments 10 pages, 11 figures

详情
AI中文摘要

AI控制监控器对个体智能体动作进行评分以检测异常行为,但实际危害可能分布在许多看似良性的步骤中,每个步骤单独低于任何每步警报。我们使用高斯Copula AR(1)构造了一种边际保持、相关性编码的分布式破坏攻击:每步监控器评分边际完全等于良性,因此均值、最大值、top-k尾部及阈值监控器(监控器A)被构造性地击败,而危害被编码在时间相关结构中。我们围绕三个审稿人要求的门组织论文。(1)可实现性门:隐秘攻击在所有测试危害水平(最高3.0)下与良性的KS距离为0.013(实际为零),证实危害完全与每步边际解耦,且可实现性不受危害限制。(2)监控器A与B的调和:我们形式化证明,针对监控器A的评分边际构建的攻击,在另一种评分监控器B(相关性/序列族:CUSUM、SPRT、HMM-LR、游程检验、自相关、窗口逻辑回归)下仍保持边际保持,并将最坏情况声明限定在允许时间特征的评分函数上。(3)非空可检测性带:监控器A的AUC为0.52(随机);在相同1%假阳性率目标下,监控器B的AUC范围为0.79-0.97,且当危害分摊到更多步骤时,监控器A降至随机水平,而监控器B保持AUC约0.95。这些结果证明了非空可检测性带,并刻画了亚阈值破坏前沿:分布形状监控器被构造性击败;时间相关性监控器可检测但并非平凡最优。

英文摘要

AI-control monitors score individual agent actions to detect misbehavior, but real harm can be distributed across many benign-looking steps, each individually below any per-step alarm. We construct a marginal-preserving, correlation-encoded distributed-sabotage attack using a Gaussian-copula AR(1) construction: the per-step monitor-score marginal is held exactly equal to benign, so mean, max, top-k tail, and threshold monitors (Monitor A) are defeated by construction, while harm is encoded in the temporal correlation structure. We sequence the paper around three reviewer-mandated gates. (1) Realizability gate: the stealthy attack achieves KS-distance to benign of 0.013 (effectively zero) at all tested harm levels up to 3.0, confirming that harm is fully decoupled from the per-step marginal and realizability is not harm-limited. (2) Monitor-A-vs-B reconciliation: we show formally that the attack, built against Monitor A's score marginal, remains marginal-preserving under a different-score Monitor B (the correlation/sequence family: CUSUM, SPRT, HMM-LR, runs test, autocorrelation, windowed logistic), and scope worst-case claims to score functions that admit a temporal signature. (3) Non-empty detectability band: Monitor A achieves AUC 0.52 (chance); Monitor B spans AUC 0.79-0.97 at the same 1% FPR target, and as harm is amortized over more steps Monitor A collapses to chance while Monitor B holds at AUC ~0.95. These results demonstrate a non-empty detectability band and characterize the sub-threshold sabotage frontier: distribution-shape monitors fail by construction; temporal-correlation monitors can detect but are not trivially optimal.

2606.14027 2026-06-16 cs.CR cs.AI cs.CL cs.SY eess.SY 版本更新

Same-Origin Policy for Agentic Browsers

代理浏览器的同源策略

Xilong Wang, Xiaoxing Chen, Patrick Li, Dawn Song, Neil Gong

发表机构 * Duke University(杜克大学) Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校)

AI总结 研究代理浏览器中同源策略的有效性,构建SOPBench评估基准,发现现有代理浏览器频繁违反SOP,并提出SOPGuard机制来强制执行SOP,同时保持效用和低开销。

详情
AI中文摘要

代理浏览器将自主AI代理集成到Web浏览器中,使用户能够通过自然语言指令完成Web任务。同源策略(SOP)是一种基本的浏览器安全机制,可防止由脚本引起的未经授权的自动化跨源数据流。然而,SOP在代理浏览器中是否仍然有效是一个尚未系统研究的开放问题。在这项工作中,我们填补了这一空白。我们首先观察到,代理浏览器本身可以作为跨源数据流的自动化通道,可能导致SOP违规。为了研究这一现象,我们构建了SOPBench,一个用于评估代理浏览器中SOP违规的基准。我们的评估表明,现有的代理浏览器在良性设置和攻击下都频繁违反SOP。为了解决这个问题,我们提出了SOPGuard,一种针对代理浏览器定制的SOP强制机制。我们在开源代理浏览器BrowserOS中实现了SOPGuard。广泛的评估表明,SOPGuard在保持效用的同时有效地强制执行SOP,并且仅产生很小的运行时开销。我们的代码和数据可在以下网址获取:https://this https URL。

英文摘要

Agentic browsers integrate autonomous AI agents into web browsers, enabling users to accomplish web tasks through natural-language instructions. The same-origin policy (SOP) is a fundamental browser security mechanism that prevents unauthorized automated cross-origin data flows induced by scripts. However, whether SOP remains effective in agentic browsers is an open question that has not been systematically studied. In this work, we bridge this gap. We first observe that an agentic browser can itself serve as an automated channel for cross-origin data flows, potentially leading to SOP violations. To investigate this phenomenon, we construct SOPBench, a benchmark for evaluating SOP violations in agentic browsers. Our evaluation shows that existing agentic browsers frequently violate SOP, both in benign settings and under attacks. To address this problem, we propose SOPGuard, an SOP enforcement mechanism tailored to agentic browsers. We implement SOPGuard in BrowserOS, an open-source agentic browser. Extensive evaluations demonstrate that SOPGuard effectively enforces SOP while preserving utility and incurring only a small runtime overhead. Our code and data are available at https://github.com/wxl-lxw/BrowserOS-SOPGuard.

9. 评测、基准与数据集 100 篇

2606.15029 2026-06-16 cs.AI 新提交

Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

Metric Match:一种评估LLM评判可靠性的子集选择方法

Alyssa Unell, Natalie Dullerud, Naomi Boneh, Meena Jagadeesan, Tatsu Hashimoto, Nigam Shah, Sanmi Koyejo

发表机构 * Stanford University(斯坦福大学)

AI总结 提出Metric Match方法,通过选择少量样本进行人工标注,以子集匹配总体可靠性指标,从而高效估计LLM评判的可靠性,实验表明在15个数据集上平均估计误差降低18.7%,标注需求减少32.5%。

详情
AI中文摘要

LLM评判被用于减少评估开放文本生成时对昂贵人工劳动的需求。然而,这些评判的可靠性关键取决于它们与人类评分者的一致性——这一属性本身依赖于昂贵的人工标注。在这项工作中,我们开发了一种方法(Metric Match),用于从有限标注中估计LLM评判的基于相关性的可靠性指标。Metric Match选择一部分样本进行人工标注,使得该子集在获取的合成标签方面与总体可靠性指标匹配。我们通过实验表明,在四种不同的相关性指标和15个数据集上,Metric Match相对于随机子集选择的胜率为0.838,平均估计误差降低18.7%,标注需求减少32.5%。我们提供了一个成本模型,并强调了一个医学案例研究,在该案例中,与随机选择相比,我们的方法为专家标注节省了1,041.67美元。此外,我们将任务从可靠性估计转变为可靠性分类,即判断给定评判是否高于部署阈值,使用Metric Match优于随机选择。所有项目代码公开可用,我们还提供了一个可安装的包以便使用。

英文摘要

LLM judges are used to reduce the need for costly human labor in evaluating open-ended text generation. However, the reliability of these judges depends critically on their alignment with human raters -- a property that itself depends on costly human annotations. In this work, we develop a method (Metric Match) for estimating correlation-based reliability metrics of LLM judges from limited annotations. Metric Match selects a subset of samples for human annotation such that the subset matches the population reliability metric with respect to acquired synthetic labels. We empirically show that Metric Match achieves a win-rate of 0.838 against random subset selection across four different correlation metrics and 15 datasets, with an 18.7% decrease in average estimation error and reduces annotation needs by 32.5%. We provide a cost model and highlight a medical case study where our method saves $1,041.67 compared to random selection for expert annotation. Further, we shift our task from reliability estimation to reliability classification of whether a given judge is above a deployment threshold, outperforming random selection with Metric Match. All project code is publicly available, and we additionally provide an installable package for ease of use.

2606.15034 2026-06-16 cs.AI 新提交

OSGuard: A Benchmark for Safety in Computer-Use Agents

OSGuard:计算机使用智能体安全基准

Mina Mohammadmirzaei, Jeffrey Flanigan

发表机构 * University of California, Santa Cruz(加州大学圣克鲁兹分校)

AI总结 提出OSGuard双粒度基准,通过动作级安全判断和风险增强执行评估智能体在良性指令下的安全性,揭示局部监督与端到端安全的差距。

详情
AI中文摘要

计算机使用智能体越来越根据它们是否完成现实的桌面和网页任务来评估。然而,仅凭任务成功可能会遗漏智能体通过不安全捷径达到名义目标时的失败。我们引入了OSGuard,一个双粒度基准套件,用于在良性、未更改的用户指令下评估计算机使用智能体的安全性。OSGuard包含一个用于局部护栏决策的动作级基准和一个用于端到端评估的风险增强执行套件。动作级基准由上下文化的提议动作组成,这些动作被标记为允许、无关或不安全,每个判断都相对于原始指令和当前界面状态。执行套件包含手动构建的OSWorld衍生任务变体,其中原始任务仍然可完成,但环境被修改以引入潜在危险,如破坏性覆盖等。每个变体都配有增强评估器,保留原始任务成功标准,同时添加显式的基于状态的安全不变量,使我们能够区分安全完成和满足名义任务目标的不安全完成。我们在OSGuard上的实验结果表明,当前的多模态护栏在孤立的动作判断上表现良好,而风险增强执行暴露了局部监督与可靠端到端安全之间的剩余差距。这种双粒度设计能够更精确地诊断模型是否既能识别不安全的提议动作,又能在作为护栏部署时提高全任务安全性。

英文摘要

Computer-use agents are increasingly evaluated by whether they complete realistic desktop and web tasks. However, task success alone can miss failures in which an agent reaches the nominal goal through an unsafe shortcut. We introduce OSGuard, a dual-granularity benchmark suite for evaluating safety in computer-use agents under benign, unchanged user instructions. OSGuard contains an action-level benchmark for local guardrail decisions and a risk-augmented execution suite for end-to-end evaluation. The action-level benchmark consists of contextualized proposed actions labeled as allowed, unrelated, or unsafe, each judged relative to the original instruction and current interface state. The execution suite contains manually constructed OSWorld-derived task variants in which the original task remains achievable, but the environment is modified to introduce latent hazards such as destructive overwrites, etc. Each variant is paired with augmented evaluators that retain the original task-success criterion while adding explicit state-based safety invariants, allowing us to distinguish safe completions from unsafe completions that satisfy the nominal task objective. Our experimental results on OSGuard show that current multimodal guardrails can perform well on isolated action judgments, while risk-augmented execution exposes remaining gaps between local oversight and reliable end-to-end safety. This dual-granularity design enables more precise diagnosis of whether models can both recognize unsafe proposed actions and improve full-task safety when deployed as guardrails.

2606.15107 2026-06-16 cs.AI 新提交

Towards Verifiable Agentic Data Science: Solving Irregular TSQA Via Tool-Grounded Reasoning

迈向可验证的自主数据科学:通过基于工具的推理解决不规则时间序列问答

Sanhorn Chen, Xiaoyang Chen, Boyu Liu, Roy Zhao

发表机构 * University of Illinois Urbana Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对现实世界时间序列数据的不规则性,提出IRTS-ToolBench基准(1700个问题,10种任务类型,13个领域),通过标准化输入和可复现评估协议,研究LLM和AI代理在不规则条件下的表现。

Comments 15 pages

详情
AI中文摘要

实际部署中的时间序列数据绝大多数是不规则的。观测是异步的,缺失值具有信息性而非随机性,采样频率在不同传感器和操作窗口间变化。然而,现有的时间序列问答(TSQA)基准大多假设规则采样的输入,导致在理解大语言模型(LLM)和AI代理在不规则条件下的表现方面存在根本性差距。为弥补这一差距,我们引入了IRTS-ToolBench,一个包含1700个问题、跨越13个领域10种任务类型的基准。IRTS-ToolBench旨在供任何研究基于LLM的不规则时间序列分析的研究人员独立使用,提供标准化输入和可复现的评估协议。代码可在https://github.com/SanhornC/IRTS-ToolBench 获取。

英文摘要

Time series data in real-world deployments is overwhelmingly irregular. Observations are asynchronous, missing values are informative rather than random, and sampling frequencies vary across sensors and operational windows. However, existing Time Series Question Answering (TSQA) benchmarks mostly assume regularly sampled inputs, leaving a fundamental gap in understanding how large language models (LLMs) and AI agents perform under irregular conditions. To bridge this gap, we introduce IRTS-ToolBench, a benchmark of 1,700 questions spanning 10 task types across 13 domains. IRTS-ToolBench is designed to be used independently by any researcher working on LLM-based irregular time series analysis, providing standardized inputs and a reproducible evaluation protocol. Code can be found in https://github.com/SanhornC/IRTS-ToolBench.

2606.15258 2026-06-16 cs.AI 新提交

Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs

Mask-Proof: 一种基于LLM的数学证明自动数据整理流水线

Jierui Zhang, Siyuan Tan, Xinhang Li, Longzhuangzhi Lin, Dailin Li, Chengfeng Gu, Xinping Li, Yaxian Hao, Shengjia Liang, Yuxiang Ren, Wenhao Liu

发表机构 * School of Computer Science, Beijing University of Posts and Telecommunications(北京邮电大学计算机学院) Graduate College for Engineers, Beijing University of Posts and Telecommunications(北京邮电大学研究生院工程师学院) School of Mathematical Sciences, Fudan University(复旦大学数学科学学院) School of Cyberspace Security, Beijing University of Posts and Telecommunications(北京邮电大学网络空间安全学院) School of Computer Science and Technology, Dalian University of Technology(大连理工大学计算机科学与技术学院) Chu Kochen Honors College, Zhejiang University(浙江大学竺可桢学院) Department of Psychological and Cognitive Sciences, Tsinghua University(清华大学心理学与认知科学系) State Key Laboratory of Virtual Reality Technology and Systems, Beihang University(北京航空航天大学虚拟现实技术与系统国家重点实验室) School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院)

AI总结 提出Mask-Proof流水线,将真实证明转化为可自动检查的掩码步骤任务,通过LLM等价性判断器评估模型推理,构建包含292个问题的基准,推理增强模型性能提升12%-27%。

详情
AI中文摘要

大型语言模型(LLM)在数学问题求解方面能力日益增强,甚至能辅助研究级证明,但我们仍缺乏一种可扩展且可重复的方式来衡量跨不同来源的长证明中的逐步推理。这种评估差距限制了在经证明认证的科学进步中可信赖的AI辅助。现有评估通常强调最终答案或依赖昂贵的专家评分,而端到端的证明生成仍然是开放式的且难以自动验证。我们引入Mask-Proof,一个将真实证明转化为可自动检查的掩码步骤任务的流水线。它掩盖关键公式步骤,提供必要的上下文,并使用基于LLM的等价性判断器(通过重复投票保持稳定性)评估模型重建。由此产生的Mask-ProofBench包含来自不同研究领域的292个精心策划的问题。对17个模型的实验表明,推理增强模型比标准模型性能提升12%至27%。我们的评估器与专家注释者的一致性达到96.8%,实现了对逐步数学推理的忠实、可重复和可比较的测量。基准、注释和代码可在https://github.com/weating/Mask-Proof获取。

英文摘要

Large language models (LLMs) are increasingly capable of mathematical problem solving and can even assist with research-level proofs, yet we still lack a scalable and reproducible way to measure step-level reasoning in long proofs across diverse sources. This evaluation gap limits trustworthy AI assistance in proof-certified scientific progress. Existing evaluations often emphasize final answers or rely on costly expert grading, while end-to-end proof generation remains open-ended and hard to verify automatically. We introduce Mask-Proof, a pipeline that turns real proofs into automatically checkable masked-step tasks. It masks key formula steps, provides the necessary surrounding context, and evaluates model reconstructions with an LLM-based equivalence judge using repeated votes for stability. The resulting Mask-ProofBench contains 292 curated problems across diverse research areas. Experiments with 17 models show that reasoning-enhanced models outperform standard models by 12% to 27%. Our evaluator achieves 96.8% agreement with expert annotators, enabling faithful, reproducible, and comparable measurement of step-level mathematical reasoning. Benchmark, annotations, and code are available at https://github.com/weating/Mask-Proof.

2606.15300 2026-06-16 cs.AI cs.CL 新提交

CODA-BENCH: Can Code Agents Handle Data-Intensive Tasks?

CODA-BENCH:代码智能体能否处理数据密集型任务?

Yuxin Zhang, Ju Fan, Meihao Fan, Shaolei Zhang, Xiaoyong Du

发表机构 * Renmin University of China(中国人民大学)

AI总结 提出CODA-BENCH基准,在数据密集型环境中联合评估代码与数据智能,包含1009个任务,平均每个环境980个文件,揭示当前智能体在数据发现与代码执行整合上的不足。

Comments Accepted at ICML 2026. 37 pages, 11 figures. Project page: https://coda-bench.github.io/ Code: https://github.com/ruc-datalab/CoDA-Bench Data: https://huggingface.co/datasets/RUC-DataLab/CoDA-Bench

详情
AI中文摘要

高级智能体正日益展现出作为自主工程师的潜力,这催生了对能够捕捉真实世界开发复杂性的评估基准的需求。此类环境通常涉及复杂代码和大规模数据(即文件系统)。然而,现有基准通常孤立地评估代码中心或数据中心能力,与真实开发场景存在明显差距。在本文中,我们通过引入CODA-BENCH来弥合这一差距,这是首个在数据密集型环境中联合评估代码与数据智能的基准。我们基于Kaggle生态系统(包含数百个数据集)构建了一个数据密集型Linux沙箱,其中智能体必须主动探索复杂的文件层次结构以识别相关资源,并为数据驱动的分析任务生成代码。CODA-BENCH包含跨越31个社区的1009个任务,每个任务环境平均包含980个文件,模拟了真实的数据规模和噪声。对高级智能体的评估显示,即使是最优系统也难以有效整合数据发现与代码执行,成功率仅为61.1%。这些结果凸显了当前智能体在数据密集型任务中的能力差距,并为未来研究指明了有希望的方向。

英文摘要

Advanced agents are increasingly demonstrating the potential to operate as autonomous engineers, creating a growing demand for evaluation benchmarks that capture the complexity of real-world development. Such environments typically involve both complex code and large-scale data (i.e., file system). However, existing benchmarks usually evaluate code-centric or data-centric capabilities in isolation, leaving a clear gap with real development scenarios. In this paper, we bridge this gap by introducing CODA-BENCH, the first benchmark to jointly evaluate code and data intelligence in a data-intensive environment. We construct a data-intensive Linux sandbox based on the Kaggle ecosystem (containing hundreds of datasets), where agents must actively explore complex file hierarchies to identify relevant resources and generate code for data-driven analytical tasks. CODA-BENCH comprises 1,009 tasks spanning 31 communities, with each task environment containing an average of 980 files, simulating realistic data scale and noise. Evaluations of advanced agents reveal that even top-performing systems struggle to effectively integrate data discovery with code execution, achieving a success rate of only 61.1%. These results highlight a substantial gap in current agentic capabilities for data-intensive tasks and point to promising directions for future research.

2606.15474 2026-06-16 cs.AI stat.AP 新提交

Who Drifted: the System or the Judge? Anytime-Valid Attribution in LLM Evaluation Pipelines

谁漂移了:系统还是裁判?LLM评估流水线中的随时有效归因

Yitao Li

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种基于固定锚点集和赌检验的方法,区分LLM评估中产品性能下降与裁判模型变化导致的分数漂移,并证明其随时有效性和归因准确性。

详情
AI中文摘要

对LLM产品的持续评估依赖于一个被视为地面真相的强大LLM裁判:一个廉价的监控器对每次交互进行评分,当分数下降时团队会收到警报。但裁判本身是一个API背后的模型,静默的版本升级或评分提示更新会改变其评分方式——因此每次漂移警报在更差的产品和变化的裁判之间是模糊的。我们通过一个固定的人工标注锚点集(当前裁判以稳定间隔重新评分)、一个关于裁判与人类差距的二次赌e过程,以及一个返回{无, 系统, 裁判}判决的守卫窗口规则来解决这种模糊性。我们证明了随时有效性、单向识别(只有裁判可以移动锚点)、一个归因竞赛(其设计法则是锚点必须跑赢它们守卫的主过程)以及过程正交性。在两个真实的裁判变化中,静默版本升级在60/60次运行中被检测为裁判漂移,且零次误归因为系统;而一个污染性的严格提示变化在守卫宽度为300时,120次运行中有110次被正确归因——而行业默认的滚动z检验在75%的无漂移流上产生误报。每个实验在第二个领域(TL;DR摘要)上重复,无需重新调整参数,并且当领域不同时,差异正是竞赛所预测的:严格提示变化在那里更强烈地改变分数,因此锚点触发更快,归因变得完美(240/240)。该监控器的运行成本约为对每个项目使用强裁判的0.64倍,或在更便宜但更聋的模式下为0.21倍。

英文摘要

Continuous evaluation of LLM products relies on a strong LLM judge treated as ground truth: a cheap monitor scores every interaction and a team is paged when the score drifts down. But the judge is itself a model behind an API, and a silent version bump or scoring-prompt update changes how it scores -- so every drift alarm is ambiguous between a worse product and a changed judge. We resolve the ambiguity with a fixed, human-labeled anchor set that the current judge re-scores at a steady interleave, a second betting e-process on the judge-versus-human gap, and a guard-window rule returning a verdict in {none, system, judge}. We prove anytime-validity, one-way identification (only the judge can move the anchors), an attribution race whose design law is that the anchors must out-run the main process they guard, and process orthogonality. On two real judge changes, a silent version bump is detected as judge drift in 60/60 runs with zero judge-to-system misattribution, and a contaminating strict-prompt change is correctly attributed on 110 of 120 runs at guard width 300 -- while the industry-default rolling z-test false-alarms on 75% of drift-free streams. Every experiment replicates on a second domain (TL;DR summarization) with nothing re-tuned, and where the domains differ the differences are the ones the race predicts: the strict-prompt change shifts scores harder there, so the anchors fire faster and attribution becomes perfect (240/240). The monitor runs at approximately 0.64 of the cost of strong-judging every item, or 0.21 in a cheaper-but-deafer regime.

2606.15508 2026-06-16 cs.AI 新提交

ToolMenuBench: Benchmarking Tool-Menu Filtering Strategies for Reliable and Efficient LLM Agents

ToolMenuBench: 用于可靠高效LLM智能体的工具菜单过滤策略基准测试

Rahul Suresh Babu, Laxmipriya Ganesh Iyer

AI总结 提出ToolMenuBench基准,评估多步LLM智能体中工具菜单构建策略,发现因果最小工具过滤在任务成功率、令牌使用和风险暴露间取得最佳平衡。

详情
AI中文摘要

工具增强的大语言模型智能体越来越多地操作大型工具库,但现有评估通常关注模型能否正确调用工具,而非可见工具菜单如何影响可靠性、效率和安全相关风险暴露。我们引入ToolMenuBench,一个用于评估多步LLM智能体中工具菜单构建的基准。ToolMenuBench变化工具菜单大小、干扰类型、状态依赖任务结构和风险暴露,并报告过滤级别和下游智能体指标,包括可见工具数量、风险工具暴露、任务成功、错误工具调用、过早动作和令牌使用。在七个模型后端、三种工具菜单大小、六种过滤方法和七种评估设置的控制评估中,CMTF将任务成功率从全部工具暴露下的32.1%提升至85.7%,同时平均令牌使用减少约98%。因果最小工具过滤实现了最强的整体权衡,相对于未过滤暴露、词汇过滤、状态感知过滤和更广泛的因果路径基线,减少了可见工具、错误工具调用、过早动作和风险工具暴露。ToolMenuBench提供了一个可重用的评估框架,用于研究智能体-界面问题:哪些工具应该可见、何时可见以及在何种成本或风险约束下可见。

英文摘要

Tool-augmented large language model agents increasingly operate over large tool libraries, but existing evaluations often focus on whether a model can call a tool correctly rather than how the visible tool menu shapes reliability, efficiency, and safety-relevant risk exposure. We introduce ToolMenuBench, a benchmark for evaluating tool-menu construction in multi-step LLM agents. ToolMenuBench varies tool-menu size, distractor type, state-dependent task structure, and risk exposure, and reports both filter-level and downstream agent metrics, including visible-tool count, risky-tool exposure, task success, wrong-tool calls, premature actions, and token usage. In a controlled evaluation across seven model backends, three tool-menu sizes, six filtering methods, and seven evaluation settings, CMTF improves task success from 32.1% under all-tools exposure to 85.7%, while reducing average token usage by roughly 98%. Causal minimal tool filtering achieves the strongest overall tradeoff, reducing visible tools, wrong-tool calls, premature actions, and risky-tool exposure relative to unfiltered exposure, lexical filtering, state-aware filtering, and broader causal-path baselines. ToolMenuBench provides a reusable evaluation framework for studying the agent-interface problem: which tools should be visible, when they should be visible, and under what cost or risk constraints.

2606.15673 2026-06-16 cs.AI cs.LG 新提交

Where Did It Go Wrong? Process-Level Evaluation of Web Agents with Semantic State Tracking

哪里出错了?基于语义状态追踪的Web智能体过程级评估

Jiwan Chung, JiHyuk Byun, Vibhav Vineet, Seon Joo Kim

发表机构 * Yonsei University(延世大学) Microsoft Research(微软研究院)

AI总结 提出WebStep基准,通过语义MDP追踪过程状态,揭示隐藏于终端成功率下的智能体差异,并定位具体改进方向。

详情
AI中文摘要

Web智能体通过长交互序列执行任务,然而现有基准仅评估终端成功,丢弃所有过程信息,对改进提供的指导有限。在这项工作中,我们对Web智能体进行了过程级分析。我们引入了WebStep,一个包含1800个任务实例的基准,具有可控难度和自动语义状态追踪。每个网站除了GUI外还暴露一个确定性的语义MDP:智能体在界面上操作,而环境在后台记录高级状态和转换,从而实现无需人工标注的细粒度分析。基于语义轨迹,我们首先表明过程度量揭示了结果评估无法察觉的差异:三个成功率集中在31-33%的智能体在探索范围与执行准确性上存在分歧。然后,按技能分解刻画了这些差异的本质,揭示了同一网站内隐藏的相反技能排名:例如,在Housing上,OpenAI CUA在提交动作上优于Qwen3.5 23.7%,但在过滤上却落后15.6%,精确指出了即使在单个领域内也需要改进的具体技能。分叉分析进一步定位了导致任务失败的决定性错误,并表明该错误是智能体特定的而非共享的。最后,随着任务难度增加,这些差异扩大:在简单任务上成功率相似,但随着探索要求提高而急剧分化。我们的过程级分析为Web智能体评估开辟了新途径,提供了关于每个智能体应在何处以及如何改进的细粒度且可操作的见解。

英文摘要

Web agents act through long interaction sequences, yet existing benchmarks evaluate only terminal success, discarding all process information and offering little guidance on improvement. In this work, we conduct a process-level analysis of web agents. We introduce WebStep, a benchmark of 1,800 task instances with controlled difficulty and automatic semantic state tracking. Each website exposes a deterministic semantic MDP alongside the GUI: the agent operates on the interface, while the environment records high-level states and transitions in the background, enabling fine-grained analysis without manual annotation. Based on the semantic trajectory, we first show that process metrics reveal differences invisible to outcome evaluation: three agents whose success rates cluster within 31-33% diverge in exploration reach versus execution accuracy. Then, decomposing by skill characterizes the nature of these differences, exposing opposite per-skill rankings hidden within the same website: e.g., on Housing, OpenAI CUA outperforms Qwen3.5 by 23.7% on commit actions yet underperforms it by 15.6% on filtering, pinpointing a concrete skill to improve even within a domain. Bifurcation analysis further localizes the decisive error that loses the task and shows that this error is agent-specific rather than shared. Finally, these differences widen as tasks grow harder: success rate is similar on easy tasks but separates sharply as exploration becomes more demanding. Our process-level analysis opens a new avenue in web agent evaluation, providing fine-grained and actionable insight into where and how each agent should be improved.

2606.15686 2026-06-16 cs.AI cs.LG 新提交

Recurrent Reasoning on Symbolic Puzzles with Sequence Models

基于序列模型的符号谜题循环推理

Gowrav Mannem, Chowdhury Marzia Mahjabin, Jason Chen, Shivank Garg, Kevin Zhu

发表机构 * Algoverse AI Research Cornell University(康奈尔大学)

AI总结 提出 RecurrReason 基准,包含四个递归逻辑谜题,通过控制难度参数 N 评估序列模型,发现架构比规模更重要,预训练仅对局部结构转移函数的谜题有效。

详情
AI中文摘要

大型语言模型在符号和算法任务上通常表现强劲,但当问题变长、变难或略微超出分布时,这种表面优势可能隐藏脆弱行为。当前推理基准的一个主要限制是,许多主要测试模型是否能产生有效答案,而较少关注解决方案在可控难度缩放下是否最小、稳健和稳定。我们引入了 RecurrReason,一个难度可控的基准,包含四个递归逻辑谜题(汉诺塔、过河问题、积木世界和跳棋),具有 BFS 最优轨迹和单一可解释难度参数 $N \in \{1,\dots,10\}$,总计 10,817 个独特谜题和 285,933 步动作。我们在一致的数据划分和评估标准下,对两个 Transformer 家族(编码器-解码器模型(T5 风格)和仅解码器模型(GPT-2 风格))进行基准测试,在 $N=1$ 到 $7$ 上训练,并在 $N=8$ 到 $10$ 的保留分布内实例和更难的分布外实例上评估。微调后的预训练 T5 在积木世界上达到 97.27% 的验证准确率和 81.00% 的 OOD 准确率;所有模型在过河问题上的所有条件下得分为 0.00%。失败模式分析表明,架构比规模更能决定成功。预训练仅能迁移到具有局部结构转移函数的谜题。我们的代码和数据集将在接收后开源。

英文摘要

Large language models often appear strong on symbolic and algorithmic tasks, yet this apparent strength can hide brittle behaviour when problems become longer, harder, or slightly out of distribution. A major limitation of current reasoning benchmarks is that many primarily test whether a model can produce a valid answer, while paying less attention to whether the solution is minimal, robust, and stable under controlled difficulty scaling. We introduce RecurrReason, a difficulty-controlled benchmark of four recurrent logic puzzles (Tower of Hanoi, River Crossing, Block World, and Checkers Jumping) with BFS-optimal trajectories and a single interpretable difficulty parameter $N \in \{1,\dots,10\}$, totalling 10{,}817 unique puzzles and 285{,}933 moves. We benchmark two Transformer families, an encoder-decoder model (T5-style) and a decoder-only model (GPT-2-style), under consistent data splits and evaluation criteria, training on $N{=}1$ to $7$ and evaluating on both held-out in-distribution instances and harder out-of-distribution instances at $N{=}8$ to $10$. Fine-tuned pre-trained T5 achieves 97.27\% validation and 81.00\% OOD accuracy on Block World; all models score 0.00\% on River Crossing under all conditions. Failure mode analysis reveals that architecture is a stronger determinant of success than scale. Pre-training transfers only to puzzles with locally structured transition functions. Our code and dataset will be open-sourced upon acceptance.

2606.15708 2026-06-16 cs.AI 新提交

Artificial Intelligence Index Report 2026

人工智能指数报告2026

Sha Sajadieh, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Lapo Santarlasci, Juan Pava, Nestor Maslej, Russ Altman, Erik Brynjolfsson, Carla Brodley, Jack Clark, Virginia Dignum, Vipin Kumar, James Landay, Terah Lyons, James Manyika, Juan Carlos Niebles, Yoav Shoham, Elham Tabassi, Russell Wald, Toby Walsh, Dan Weld

AI总结 本报告追踪AI在推理、安全及现实任务执行方面的测试进展,分析治理框架、评估方法与技术发展之间的差距,并新增AI在科学与医学领域的独立章节。

详情
AI中文摘要

欢迎阅读第九版AI指数报告。随着AI持续快速发展,问题在于围绕它构建的系统能否跟上步伐。治理框架、评估方法、教育系统以及追踪AI影响所需的数据基础设施,都难以匹配技术本身的速度。AI能做什么与我们准备如何管理它之间的差距贯穿本年度报告的每一章。本版新增内容:报告追踪了AI如何在推理、安全和现实任务执行方面受到更雄心勃勃的测试,以及为何这些测量越来越难以依赖。它还提供了对生成式AI经济价值的新估计,以及其劳动力市场影响的新证据、一个关于AI主权的分析框架,以及与Schmidt Sciences合作开发的科学章节。本报告首次设有关于AI在科学和AI在医学中的独立章节,反映了AI在这两个领域日益增长的影响。

英文摘要

Welcome to the ninth edition of the AI Index report. As AI continues to advance rapidly, the question becomes whether the systems built around it can keep up. Governance frameworks, evaluation methods, education systems, and the data infrastructure needed to track AI's impact are struggling to match the pace of the technology itself. That gap between what AI can do and how prepared we are to manage it runs through every chapter of this year's report. New in this edition, the report tracks how AI is being tested more ambitiously across reasoning, safety, and real-world task execution, and why those measurements are increasingly difficult to rely on. It also features new estimates of generative AI's economic value alongside emerging evidence of its labor market effects, an analytical framework on AI sovereignty, and a science chapter developed in collaboration with Schmidt Sciences. For the first time, the report features standalone chapters on AI in science and AI in medicine, reflecting AI's growing impact across these two domains.

2606.15766 2026-06-16 cs.AI cs.HC 新提交

Rethinking Scaffolding in LLM Tutors: The Interactional Mismatch Between Benchmarks and Real-World Deployments

重新思考LLM导师中的脚手架:基准测试与真实部署之间的交互不匹配

Alexandra Neagu, Jeffrey T. H. Wong, Marcus Messer, Rhodri Nelson, Peter B. Johnson

发表机构 * University of Cambridge(剑桥大学)

AI总结 通过分析9490个聊天记录,发现AI导师基准测试假设学生积极接受脚手架,但真实场景中学生常绕过脚手架,揭示基准测试与真实部署的交互不匹配。

Comments Pluralistic Alignment Workshop @ ICML 2026, Seoul, South Korea

详情
AI中文摘要

AI导师基准测试中评估的一个核心教学价值是脚手架:通过渐进步骤引导学生走向解决方案。然而,将脚手架行为嵌入聊天机器人的对齐和评估方法基于一个隐含假设:学生会接受脚手架并参与对话。为了检验这一假设是否成立,我们引入了一个围绕两个指标——聊天机器人脚手架和学生接受度——的评估流程,并将其应用于跨越AI导师基准测试和教育聊天机器人真实部署的九个数据集,共9490个聊天记录。我们的分析揭示,虽然基准测试假设一个高脚手架、高学生接受度的环境,但真实场景中的学生整体表现出较低水平的接受度——经常绕过聊天机器人的教学框架,以较低的人际成本将交互推向自己的学习目标。我们认为,绕过脚手架不一定是坏事;相反,它经常突显聊天机器人的教学框架与学生目标之间的不匹配。为了有意义地评估聊天机器人辅助的有效性,未来的基准测试必须超越学生简单接受脚手架的假设,而是评估这些聊天机器人如何应对多样化的学习环境和学生驱动的交互模式。

英文摘要

A central pedagogical value evaluated in AI tutor benchmarks is scaffolding: guiding students through graduated steps toward a solution. Alignment and evaluation methods for embedding scaffolding behaviour into chatbots, however, rest on an implicit assumption: that students will take up the scaffolding and engage in the conversation. To examine whether this assumption holds, we introduce an evaluation pipeline around two metrics - Chatbot Scaffolding and Student Uptake - and apply them across nine datasets of 9,490 chats, spanning AI tutor benchmarks and real-world deployments of educational chatbots. Our analysis reveals that while benchmarks assume a high-scaffolding, high-student-uptake environment, students in real-world settings exhibit lower levels of uptake overall - frequently bypassing the chatbot's pedagogical framing to drive the interaction toward their own learning goals at little interpersonal cost. We argue that bypassing scaffolding is not necessarily detrimental; rather, it frequently highlights a mismatch between a chatbot's pedagogical framing and the student's learning goals. To meaningfully evaluate the effectiveness of a chatbot's assistance, future benchmarks must move beyond the assumption that students will simply take up the scaffolding, and instead evaluate how these chatbots navigate diverse learning contexts and student-driven interaction patterns.

2606.15862 2026-06-16 cs.AI 新提交

RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

RetailBench: 在真实零售环境中评估LLM代理的长期推理与连贯决策能力

Linghua Zhang, Jun Wang, Jingtong Wu, Zhisong Zhang

发表机构 * Ant Group(蚂蚁集团) City University of Hong Kong(香港城市大学)

AI总结 提出RetailBench基准,模拟单店超市运营,评估LLM代理在长期决策中的表现,发现多数模型无法持续生存,与最优策略差距显著。

详情
AI中文摘要

大型语言模型(LLM)代理在短期、范围明确的任务上取得了快速进展,但它们在动态长期环境中维持连贯决策的能力仍不确定。我们引入了RetailBench,一个基于数据驱动的模拟基准,用于评估在单店超市运营中使用工具的LLM代理。RetailBench将零售管理建模为部分可观察的决策过程,并设计支持千天规模的模拟。在此环境中,代理必须管理定价、补货、供应商选择、货架分类、库存老化、客户反馈、外部事件和现金流约束。我们在180天的评估期内,在代表性代理框架下评估了七个当代LLM,并将它们与特权最优策略进行比较。结果显示模型之间存在显著差异:只有一小部分能够存活整个评估期,即使最强的LLM运行在最终净资产和销售结果上也远落后于最优策略。行为分析将这些差距归因于不完整的证据获取、表面决策以及缺乏一致的长期策略。RetailBench为研究经济基础长期决策中的可靠自主性提供了一个受控测试平台。

英文摘要

Large language model (LLM) agents have made rapid progress on short-horizon, well-scoped tasks, yet their ability to sustain coherent decisions in dynamic long-horizon environments remains uncertain. We introduce RetailBench, a data-grounded simulation benchmark for evaluating tool-using LLM agents in single-store supermarket operation. RetailBench models retail management as a partially observable decision process and is designed to support thousand-day-scale simulations. In this environment, agents must manage pricing, replenishment, supplier selection, shelf assortment, inventory aging, customer feedback, external events, and cash-flow constraints. We evaluate seven contemporary LLMs under representative agent frameworks over a 180-day evaluation horizon and compare them with a privileged oracle policy. Results show substantial variation across models: only a small subset survives the full evaluation horizon, and even the strongest LLM runs remain substantially behind the oracle policy in final net worth and sales outcomes. Behavioral analysis attributes these gaps to incomplete evidence acquisition, surface-level decision making, and the lack of a consistent long-horizon policy. RetailBench provides a controlled testbed for studying reliable autonomy in economically grounded long-horizon decision-making.

2606.15890 2026-06-16 cs.AI 新提交

UrbanWell: Benchmarking Multimodal Large Language Models for Spatio-Temporal Urban Wellbeing Analytics

UrbanWell: 面向时空城市福祉分析的多模态大语言模型基准测试

Yanxin Xi, Xiang Su, Jie Feng, Yu Liu, Sasu Tarkoma, Pan Hui

发表机构 * University of Helsinki(赫尔辛基大学) Zhongguancun Academy(中关村学院) University of Oxford(牛津大学) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出UrbanWell基准,通过卫星和街景图像联合建模,系统评估多模态大语言模型在环境、空间可达性、城市形态、活力和主观感知等5类城市福祉指标上的时空推理能力,并定义时序预测和趋势分类任务。

Comments accepted by KDD Datasets and Benchmarks Track 2026

详情
AI中文摘要

从多模态数据理解城市福祉需要整合异构的时空信号,这对当前的多模态大语言模型(MLLMs)构成了重大挑战。我们提出了UrbanWell,一个大规模基准测试,旨在通过卫星和街景图像的联合建模,系统评估MLLMs在时空推理方面的能力,用于城市福祉分析。UrbanWell覆盖多个年份的38个城市,包含多样化的指标,涵盖(1)环境条件(CO$_2$、NO$_2$、PM${2.5}$和归一化植被指数),(2)空间可达性(到超市和餐馆的最小距离),(3)城市形态(道路长度、道路密度和土地利用),(4)城市活力(人口、经济活动多样性和土地利用多样性),以及(5)主观感知属性(例如安全性、美观性、活力、财富和宁静度)。所有指标在网格级别对齐,以实现标准化评估。除了静态预测,UrbanWell还定义了时序推理任务,包括基于历史观测的未来值预测和时序趋势分类。我们在零样本设置下对15个有代表性的最先进MLLMs进行了基准测试,提供了跨空间和时间维度的全面比较评估。实验结果表明,尽管MLLMs能够捕捉显著的空间和感知线索,但其性能在涵盖环境和主观感知的异质城市指标上差异显著。UrbanWell作为评估城市福祉分析中多模态时空推理的统一基准,为系统评估和未来多模态城市智能研究提供了标准化测试平台。我们的代码和数据集可通过https://github.com/axin1301/UrbanWell-Benchmark获取。

英文摘要

Understanding urban wellbeing from multimodal data requires integrating heterogeneous spatial and temporal signals, posing significant challenges for current multimodal large language models (MLLMs). We introduce UrbanWell, a large-scale benchmark designed to systematically evaluate the spatio-temporal reasoning capabilities of MLLMs for urban wellbeing analytics through joint modeling of satellite and street view imagery. UrbanWell spans 38 cities across multiple years and includes diverse indicators covering (1) environmental conditions (CO$_2$, NO$_2$, PM${2.5}$, and Normalized Difference Vegetation Index), (2) spatial accessibility (minimum distance to supermarkets and restaurants), (3) urban form (road length, road density, and land use), (4) urban vitality (population, economic activity diversity, and land use diversity), and (5) subjective perception attributes (e.g., safety, beauty, liveliness, wealth, and quietness). All indicators are aligned at grid level to enable standardized evaluation. Beyond static prediction, UrbanWell defines temporal reasoning tasks, including future value forecasting from historical observations and temporal trend classification. We benchmark 15 state-of-the-art representative MLLMs in a zero-shot setting, providing a comprehensive comparative evaluation across spatial and temporal dimensions. Experimental results indicate that while MLLMs capture salient spatial and perceptual cues, their performance varies substantially across heterogeneous urban indicators spanning environment and subjective perception. UrbanWell serves as a unified benchmark for evaluating multimodal spatial and temporal reasoning in urban wellbeing analytics, offering a standardized testbed for systematic assessment and future research on multimodal urban intelligence. Our codes and datasets are accessible via https://github.com/axin1301/UrbanWell-Benchmark.

2606.16003 2026-06-16 cs.AI 新提交

SciText2Eq: Assessing LLMs for Explainable Equation Generation for Scientific Creativity

SciText2Eq: 评估大语言模型在科学创造力中的可解释方程生成

Yifan Mo, Xiao Fu, Yue Su, Qingyu Meng, Koen Hindriks, Qingzhi Liu, Jiahuan Pei

发表机构 * Vrije Universiteit Amsterdam(阿姆斯特丹自由大学) Wageningen University & Research(瓦赫宁根大学及研究中心)

AI总结 研究大语言模型从科学文本生成数学方程的能力,构建AI论文数据集,提出可解释方程生成流程,并设计结合自动指标、LLM评估和人工判断的评估协议,发现LLM在语义准确性上表现不足。

Comments Accepted by findings of ACL 2026

详情
AI中文摘要

本文研究了大语言模型(LLMs)从科学文本生成数学方程的能力。先前的工作面临非结构化基础、多方程依赖和人类对齐评估的挑战。为此,我们构建了一个AI研究论文数据集,将上下文段落与真实方程和变量描述配对。我们开发了一个可解释的方程生成流程,并在多种开源和闭源LLM骨干上进行了评估。我们引入了一个评估协议,结合自动指标、基于LLM的评分标准和人工判断,以评估准确性、可解释性和人机对齐。结果表明,LLM在基于词汇和句法的相似性上表现中等,但在语义准确性上存在困难。基于LLM的评估与人工判断之间的比较显示对齐有限,凸显了使用LLM评估方程质量的挑战。这些发现为改进方程生成模型和开发更可靠的科学文本评估方法提供了见解。我们提供了代码和数据以供复现。

英文摘要

This work investigates the ability of large language models (LLMs) to generate mathematical equations from scientific texts. Prior work faces challenges in unstructured grounding, multi-equation dependency, and humanaligned evaluation. To this end, we construct a dataset of AI research papers, pairing contextual passages with ground-truth equations and variable descriptions. We develop an explainable equation generation workflow and evaluate it across diverse open- and closed-source LLM backbones. We introduce an evaluation protocol combining automatic metrics, LLM-based rubrics, and human judgments to assess accuracy, explainability, and human-LLM alignment. Results indicate that LLMs perform moderately on lexical- and syntactic-based similarity, while struggling with semantic accuracy. Comparisons between LLM-based evaluations and human judgments reveal limited alignment, highlighting challenges in using LLMs to assess equation quality. These findings offer insights for improving equation generation models and developing more reliable evaluation methods for scientific text. We provide code and data for reproducibility.

2606.16062 2026-06-16 cs.AI cs.LG 新提交

Auditing Reward Hackability in Code RL Training Environments

审计代码强化学习训练环境中的奖励可破解性

Shreshth Rajan

发表机构 * GitHub

AI总结 测量代码RL环境接受错误解决方案的比率,发现SWE-bench Verified中28.5%的任务测试套件薄弱,并提出通过LLM判断器和Docker金标准门控来加固漏洞任务的方法。

详情
AI中文摘要

我们测量了代码强化学习环境将错误解决方案视为正确的比率。在SWE-bench Verified的49个任务样本中,28.5%的任务测试套件足够薄弱,以至于Docker验证的错误补丁能通过它们。在6个代码库的20个R2E-Gym任务上,相同的单次利用生成管道产生25.0%的成功率。对SWE-bench Verified上134个前沿模型提交的随机效应荟萃分析发现,在相同人工评定的难度层级内,模型Pass@1在标记为可破解的任务上比稳健任务高14.14个百分点(95%置信区间[+11.80, +16.48];单侧p < 10^-6;I^2 = 0%;134个模型中有123个为正)。然后我们描述了一个加固被破坏任务的流程。一个内联LLM判断器配合Docker金标准门控,在咨询判断器之前对每个生成的测试针对金标准解决方案运行。在审计中的11个被破坏任务上,门控标记出105个决定性的LLM生成测试中的65个在金标准补丁上失败,这是LLM判断器单独遗漏的61.9%的每次增强缺陷率。通过多样性偏置重试,该循环将11个任务中的9个收敛到门控升级。

英文摘要

We measure the rate at which code RL environments accept incorrect solutions as correct. On a 49-task sample of SWE-bench Verified, 28.5% of tasks have test suites weak enough that a Docker-verified incorrect patch passes them. On 20 R2E-Gym tasks across 6 repositories, the same pipeline at single-shot exploit generation yields 25.0%. A random-effects meta-analysis over 134 frontier model submissions to SWE-bench Verified finds, within the same human-rated difficulty stratum, model Pass@1 is +14.14 percentage points higher on flagged-hackable tasks than on robust ones (95% CI [+11.80, +16.48]; one-sided p < 10^-6; I^2 = 0%; 123 of 134 models positive). We then describe a procedure for hardening the broken tasks. An inline LLM judge with a Docker gold-sanity gate runs each generated test against the gold solution before the judge is consulted. On the 11 broken tasks in the audit, the gate flags 65 of 105 decisive LLM-generated tests as failing on the gold patch itself, a 61.9% per-augmentation defect rate the LLM judge alone misses. With diversity-biased retry, the loop converges 9 of 11 tasks to a gated upgrade.

2606.16113 2026-06-16 cs.AI cs.LG 新提交

RecourseBench: A Modular Framework for Reproducible Algorithmic Recourse Evaluation

RecourseBench: 一个用于可复现算法追责评估的模块化框架

Zahra Khotanlou, Hashir Ahmed, Chenghao Tan, Ahmed Abdelaal, Amir-Hossein Karimi

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出RecourseBench框架,通过模块化、可复现性和交互性三大承诺,实现追责方法的统一评估,并集成28种方法,首次通过自动化定量测试强制方法级可复现性。

详情
AI中文摘要

算法追责方法提供反事实解释,告知个体需要采取哪些行动来推翻不利的模型决策。尽管方法学进展迅速,但原则性比较仍然难以实现;现有框架通常难以扩展,缺乏互操作性,并且缺乏系统验证来确保集成的方法忠实复现其最初报告的结果。我们引入了\emph{RecourseBench},一个围绕三大承诺(即模块化、可复现性和交互性)构建的统一评估框架。该框架将流程分解为五个完全解耦的层——数据、预处理、模型、追责方法和评估——由抽象接口和动态注册表管理。为了解决先前基准测试中的可复现性差距,我们引入了一个四级分类系统,其中每个集成的方法都通过自动化测试套件针对其最初报告的结果进行验证。我们还提供了一个交互式Web界面,用于在方法、数据集和模型架构之间进行灵活的、配置驱动的比较。我们的框架目前集成了28种最先进的追责方法,据我们所知,这是第一个通过自动化定量测试明确强制执行方法级可复现性的追责基准。

英文摘要

Algorithmic recourse methods provide counterfactual explanations that inform individuals of the actions required to overturn an unfavorable model decision. Despite rapid methodological progress, principled comparison remains elusive; existing frameworks are often difficult to extend and lack both interoperability and systematic verification that integrated methods faithfully reproduce their originally reported results. We introduce \emph{RecourseBench}, a unified evaluation framework built around three commitments namely, modularity, reproducibility, and interactivity. The framework decomposes the pipeline into five fully decoupled layers -- Data, Preprocessing, Model, Recourse Method, and Evaluation -- governed by abstract interfaces and a dynamic registry. To address the reproducibility gap in prior benchmarks, we introduce a four-tier classification system in which every integrated method is validated by an automated test suite against its originally reported results. We further provide an interactive web interface for flexible, configuration-driven comparison across methods, datasets, and model architectures. Our framework currently integrates 28 state-of-the-art recourse methods and, to our knowledge, constitutes the first recourse benchmark to explicitly enforce method-level reproducibility through automated, quantitative testing.

2606.16173 2026-06-16 cs.AI 新提交

TimeVista: Exploring and Exploiting Vision-Language Models as Judges for Time Series Forecasting

TimeVista:探索和利用视觉语言模型作为时间序列预测的评判者

Zhi Chen, Yuxuan Wang, Jialong Wu, Yong Liu, Haoran Zhang, Xingjian Su, Jianmin Wang, Mingsheng Long

发表机构 * School of Software, BNRist, Tsinghua University(清华大学软件学院、北京信息科学与技术国家研究中心)

AI总结 提出TimeVista框架,利用视觉语言模型(VLM)作为时间序列预测的评判者,通过微观和宏观判断结合上下文信息评估预测质量,实验表明VLM比传统指标更符合人类偏好。

详情
AI中文摘要

高质量的时间序列预测对于现实世界的决策至关重要。然而,传统的逐点度量往往无法揭示复杂的时间模式,并且与人类直观偏好的一致性较差。虽然“LLM-as-a-Judge”范式通过提供灵活、符合人类判断的评估彻底改变了文本评估,但其在时间序列中的应用仍鲜有探索。在本文中,我们利用视觉语言模型(VLM)作为时间序列预测的评判者,利用它们理解基于文本信息的时间序列图的能力。具体来说,我们提出了一种新颖的框架,整合了基于上下文信息的微观和宏观层面判断来评估时间序列预测。为此,我们引入了TimeVista,一个全面的VLM-as-a-Judge基准,包含5563个时间序列样本及其详细的评估标准。广泛的元评估表明,VLM是高度可靠的评判者,与人类偏好的一致性显著高于传统指标。基于我们的基准,我们在VLM-as-a-Judge范式下全面评估了近期的时间序列基础模型(TSFM)。我们的结果表明,VLM作为稳健且可解释的评判者,为评估时间序列模型提供了全面且符合人类的标准。

英文摘要

High-quality time series forecasting is pivotal for real-world decision-making. However, traditional point-wise metrics often fail to reveal complex temporal patterns and align poorly with human intuitive preferences. While the ''LLM-as-a-Judge'' paradigm has revolutionized text evaluation by providing flexible, human-aligned judgment, its application to time series remains largely unexplored. In this paper, we leverage Vision-Language Models (VLMs) as judges for time series forecasting, harnessing their ability to comprehend time series plots grounded in textual information. Specifically, we propose a novel framework integrating micro- and macro-level judgments informed by contextual information to evaluate time series forecasting. To this end, we introduce TimeVista, a comprehensive VLM-as-a-Judge benchmark comprising 5563 time series samples paired with detailed evaluation rubrics. Extensive meta-evaluations demonstrate that VLMs are highly reliable judges, achieving significantly higher consistency with human preferences than conventional metrics. Building upon our benchmark, we comprehensively assess recent Time Series Foundation Models (TSFMs) under the VLM-as-a-Judge paradigm. Our results demonstrate that VLMs serve as robust and interpretable judges, providing a comprehensive, human-aligned standard for evaluating time series models.

2606.16175 2026-06-16 cs.AI 新提交

PAL-Bench: Evidence-Grounded Profile Reconstruction from Longitudinal Personal Albums

PAL-Bench: 基于纵向个人相册的证据驱动画像重建

Qiwei Yan, Zhiqiang Yuan, Zexi Jia, Nanxing Hu, Kailin Lyu, Jie Zhou, Jinchao Zhang

发表机构 * Tsinghua University(清华大学) Beijing University of Posts and Telecommunications(北京邮电大学) University of Chinese Academy of Sciences(中国科学院大学) Zhejiang University(浙江大学)

AI总结 提出PAL-Bench基准,通过合成用户和隐私保护审计,评估从纵向个人相册中重建用户画像、社交关系和身份映射的能力,发现现有系统在身份解析和证据引用方面存在不足。

详情
AI中文摘要

纵向个人相册是弱模式多模态数据库:包含噪声感知记录,其关键事实需要跨人脸、文本、时间戳、位置和重复事件进行连接。现有的视觉、视频、文档和生活日志基准测试了子问题,但未涉及具有社会身份绑定和证据引用的相册级画像重建。由于评估所需的真实数据——所有者画像、社交图谱、人脸-姓名映射和证据来源——是私有状态,真实相册无法安全发布,因此基准测试此任务具有挑战性。我们提出PAL-Bench,一个在公共记录契约下进行证据驱动重建的受控基准。其证据编译器构建潜在的私有世界,编程目标级证据路径,渲染相册像素,通过感知管道重新测量,并导出经过审计的公共/私有视图。智能体仅接收感知衍生的公共记录;目标、标识符映射和证据路径保持隐藏。PAL-Bench包含50个合成用户、36,659条公共照片记录以及2,799个关于所有者事实、身份和关系的目标。一项包含10名参与者的隐私保护审计确认,PAL-Bench的证据结构与真实私有相册匹配,尽管等效发布仍受隐私限制。在七个系统和两个计算匹配的诊断中,一个七指标协议揭示了合理的画像总结与忠实的社会重建之间的差距:系统恢复了一些所有者事实,但在处理重复出现的身份和证据引用方面存在困难。PAL-TRACE是一个参考框架,在所有者事实挖掘之前冻结身份绑定,表现最佳,但硬身份解析远未解决。PAL-Bench为感知实体解析、多模态数据集成、时间证据聚合和来源感知的结构化预测提供了测试平台。

英文摘要

Longitudinal personal albums are weak-schema multimodal databases: noisy perceptual records whose key facts require joins across faces, text, timestamps, locations, and repeated events. Existing visual, video, document, and lifelog benchmarks test sub-problems, but not album-scale profile reconstruction with social identity binding and evidence citation. Benchmarking this task is difficult because the ground truth needed for evaluation--owner profiles, social graphs, face-name maps, and evidence provenance--is private state that real albums cannot safely release. We introduce PAL-Bench, a controlled benchmark for evidence-grounded reconstruction under a public-record contract. Its Evidence Compiler builds latent private worlds, programs target-level evidence paths, renders album pixels, re-measures them through perception pipelines, and exports audited public/private views. Agents receive only perception-derived public records; targets, identifier maps, and evidence paths remain hidden. PAL-Bench contains 50 synthetic users, 36,659 public photo records, and 2,799 targets over owner facts, identities, and relations. A privacy-preserving audit with 10 participants confirms that PAL-Bench evidence structures match real private albums, though equivalent releases remain privacy-prohibitive. Across seven systems and two compute-matched diagnostics, a seven-metric protocol reveals a gap between plausible profile summarization and faithful social reconstruction: systems recover some owner facts but struggle with recurring identities and evidence citation. PAL-TRACE, a reference framework that freezes identity bindings before owner-fact mining, performs best but leaves hard identity resolution far from solved. PAL-Bench provides a testbed for perceptual entity resolution, multimodal data integration, temporal evidence aggregation, and provenance-aware structured prediction.

2606.16206 2026-06-16 cs.AI cs.CL cs.CY cs.HC 新提交

Measuring Whether LLM Tutors Teach or Solve: A Diagnostic for Educational Impact

衡量LLM导师是教学还是解题:教育影响的诊断方法

Junyi Yao, Zihao Zheng, Baichuan Li

发表机构 * Washington University in St. Louis(圣路易斯华盛顿大学) Department of Operations Research and Engineering Management, Southern Methodist University(南卫理公会大学运筹学与工程管理系)

AI总结 针对LLM作为教育导师时解题能力不等于教学支持的问题,提出基于解题导向与教学导向基准性能差距的诊断方法,通过MathTutorBench分析表明两者仅部分对齐,建议分开报告评分并明确保护学生能动性的标准。

详情
AI中文摘要

大型语言模型越来越多地被提议作为教育导师,但更强的任务解决能力并不一定意味着更强的学习支持。受近期呼吁在实践中衡量NLP系统社会影响的启发,我们研究公开的LLM辅导基准是否能够区分支持学习的行为与单纯的答案生成。我们提出了一种轻量级诊断方法,基于解题导向和教学导向基准性能之间的差距。利用公开的MathTutorBench排行榜结果,我们表明这些维度仅部分对齐:在八个公开报告的模型中,解题和教学综合得分之间的相关性为0.421,并且当评估从解题转向教学时,几个模型的排名发生了显著变化。然后,我们分析了公开的TutorBench样本,并表明与能动性相关的行为明确编码在基准评分标准中,尤其是在主动学习环境中,奖励引导性问题、校准提示和非揭露性脚手架。这些发现共同表明,教育影响评估不应将任务成功视为学习支持的充分代理。我们认为,公开的辅导基准可以通过分别报告解题导向和教学导向得分,并使披露敏感、保护学生能动性的标准更加明确,从而更好地支持积极影响评估。

英文摘要

Large language models are increasingly proposed as educational tutors, yet stronger task-solving ability does not necessarily imply stronger learning support. Motivated by recent calls to measure the social impact of NLP systems in practice, we study whether public LLM tutoring benchmarks distinguish learning-supportive behavior from mere answer production. We propose a lightweight diagnostic based on the gap between solving-oriented and pedagogy-oriented benchmark performance. Using public MathTutorBench leaderboard results, we show that these dimensions are only partially aligned: across eight publicly reported models, the correlation between solving and pedagogy composites is 0.421, and several models shift meaningfully in rank when evaluation moves from solving to pedagogy. We then analyze the public TutorBench sample and show that agency-relevant behaviors are explicitly encoded in benchmark rubrics, especially in active-learning settings that reward guiding questions, calibrated hints, and non-disclosive scaffolding. Together, these findings suggest that educational-impact evaluation should not treat task success as a sufficient proxy for learning support. We argue that public tutoring benchmarks can better support positive-impact evaluation by reporting solving-oriented and pedagogy-oriented scores separately and by making disclosure-sensitive, student-agency-preserving criteria more explicit.

2606.16344 2026-06-16 cs.AI cs.CL cs.CY cs.LG 新提交

Whose hotel does the AI recommend? An algorithm audit of reputation signals in LLM-assisted hotel selection

AI推荐哪家酒店?LLM辅助酒店选择中声誉信号的算法审计

Mirza Samad Ahmed Baig, Syeda Anshrah Gillani, Asher Ali

发表机构 * Fandaqah, Al Khobar, Saudi Arabia(沙特阿拉伯阿尔科巴尔Fandaqah) Hamdard University, Karachi, Pakistan(巴基斯坦卡拉奇哈姆达德大学)

AI总结 通过随机选择联合实验审计12种LLM,发现客人评分和价格主导推荐,但过度重视生态认证而忽略管理回复,且列表位置(无内容特征)有因果影响。

Comments 32 Pages

详情
AI中文摘要

旅行者越来越多地询问大语言模型(LLM)助手预订哪家酒店,使这些系统成为物业可见性的守门人——但什么驱动了它们的推荐尚未有记录。我们使用基于随机选择的联合实验进行预先指定的算法审计:跨角色、提示模板和十二个开放权重及专有模型,助手在五家酒店中进行选择,这些酒店的客人评分、评论数量和时效性、管理回复、连锁品牌、价格、生态认证和列表位置均被独立随机化。我们估计每个信号对推荐概率的平均边际成分效应。客人评分和价格占主导地位(高评分使选择概率提高31.6个百分点;高价格使其降低30.0个百分点),重现了人类效价和价格优先性,但过度重视生态认证而忽略管理回复。列表位置——一个无内容的伪影——因果性地改变推荐,价值约为每晚12美元。陈述的理由与揭示的权重不完全一致。这些发现为生成式引擎优化和AI信息中介的可问责性提供了因果证据。

英文摘要

Travelers increasingly ask large language model (LLM) assistants which hotel to book, making these systems gatekeepers of property visibility -- yet what moves their recommendations is undocumented. We conduct a pre-specified algorithm audit using a randomized choice-based conjoint: across personas, prompt templates, and twelve open-weight and proprietary models, assistants choose among five hotels whose guest rating, review volume and recency, management response, chain affiliation, price, eco-certification, and list position are independently randomized. We estimate the average marginal component effect of each signal on the probability of recommendation. Guest rating and price dominate (a top rating raises selection by 31.6 percentage points; a high price lowers it by 30.0), reproducing human valence-and-price primacy but over-weighting eco-certification and ignoring management response. List position -- a content-free artifact -- shifts recommendations causally, worth about \$12 per night. Stated reasons track revealed weights imperfectly. The findings ground generative engine optimization and the accountability of AI infomediaries in causal evidence.

2606.16605 2026-06-16 cs.AI 新提交

ARB4WM: An Adversarial Robustness Benchmark for World Models in Continuous Control

ARB4WM:连续控制中世界模型的对抗鲁棒性基准

Junjian Zhang, Hao Tan, Ruonan Li, Dong Zhu, Aiping Li, Zhaoquan Gu

发表机构 * College of Computer Science, National University of Defense Technology(国防科技大学计算机学院) College of Computer Science and Technology, Harbin Institute of Technology(哈尔滨工业大学计算机科学与技术学院) Department of New Networks, Peng Cheng Laboratory(鹏城实验室新型网络部) National Key Laboratory of Advanced Communication Networks(先进通信网络全国重点实验室)

AI总结 提出ARB4WM统一基准,从策略、价值和潜在动力学三个层面评估世界模型在视觉扰动下的对抗鲁棒性,发现多目标攻击和时序暴露模式对安全评估至关重要。

Comments 24 pages, 10 figures, 5 tables. Source code available at https://github.com/zaoanguai/ARB4WM

详情
AI中文摘要

世界模型因其能够学习潜在动力学进行规划和决策,被广泛应用于机器人和智能体工程控制系统。随着这些系统越来越多地部署在安全关键场景中,理解它们在对抗条件下的鲁棒性变得至关重要。然而,现有评估缺乏一个统一的基准来测试世界模型智能体在策略、价值和潜在动力学层面的对抗威胁。为填补这一空白,我们提出了ARB4WM,一个用于世界模型智能体在视觉扰动下部署前鲁棒性和风险评估的统一评估框架。ARB4WM在这三个层面定义了五个白盒损失目标,并研究了它们与单步或多步扰动策略以及时序攻击模式(包括全帧、半序列和稀疏帧暴露)结合时的效果。具体而言,我们在MetaWorld和DeepMind Control Suite的20个任务上评估了四种Dreamer风格智能体,针对不同的损失目标、扰动策略和时序攻击模式。结果表明,针对价值估计、潜在表示和RSSM动力学的攻击可能与直接策略破坏一样具有破坏性,早期或频繁的扰动尤其有害,而输入级防御在自适应攻击下提供的恢复能力有限。这些发现表明,世界模型的安全性、风险和可靠性评估应涵盖多个面向组件的攻击目标和时序暴露协议,而非仅依赖于动作空间的鲁棒性。源代码可在https://github.com/zaoanguai/ARB4WM获取。

英文摘要

World models are widely used in robotic and agentic engineering control systems due to their ability to learn latent dynamics for planning and decision-making. As these systems are increasingly deployed in safety-critical settings, understanding their robustness under adversarial conditions has become essential. However, existing evaluations lack a unified benchmark for testing adversarial threats across the policy, value, and latent-dynamics levels of world-model agents. To fill this gap, we present ARB4WM, a unified evaluation framework for pre-deployment robustness and risk assessment of world-model agents under visual perturbations. ARB4WM defines five white-box loss objectives across these three levels and studies their effects when combined with single-step or multi-step perturbation strategies and temporal attack modes, including full-frame, half-sequence, and sparse-frame exposure. Specifically, we evaluate four Dreamer-style agents across 20 tasks from MetaWorld and the DeepMind Control Suite under different loss objectives, perturbation strategies, and temporal attack modes. Results show that attacks targeting value estimation, latent representations, and RSSM dynamics can be as damaging as direct policy disruption, and that early or frequent perturbations are especially harmful, while input-level defenses provide limited recovery under adaptive attacks. These findings suggest that safety, risk, and reliability assessment for world models should cover multiple component-oriented attack objectives and temporal exposure protocols rather than relying solely on action-space robustness. Source code is available at https://github.com/zaoanguai/ARB4WM.

2606.16613 2026-06-16 cs.AI 新提交

CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies

CoffeeBench:异构多智能体经济中的长周期LLM智能体基准测试

Issa Sugiura, Daichi Hattori, Kazuo Araragi, Keita Ogawa, Shota Onose, Taro Makino, Teppei Usuki, Takashi Ishida

发表机构 * Sakana AI KPMG AZSA LLC

AI总结 提出CoffeeBench基准,在90天模拟中评估LLM智能体在异构多智能体经济中的长周期任务表现,发现高性能模型更积极沟通,而Claude Haiku 4.5存在空闲漂移失败模式。

Comments 23 pages, 8 figures

详情
AI中文摘要

随着LLM智能体能够处理越来越长周期的任务,评估它们在经济系统中的表现变得越来越重要。与主要评估单个智能体与被动环境交互的现有基准不同,经济系统本质上是多智能体的,需要自主智能体在追求自身长期目标的同时进行通信、谈判和交易。我们引入了CoffeeBench,这是一个用于评估LLM智能体在由异构公司组成的长期多智能体经济中的基准。在CoffeeBench中,两个农民、两个烘焙师和两个零售商在90天的模拟中自主经营业务,每个都通过通信和交易最大化累计净收入,同时管理现金、库存和定价。被评估的模型控制一个咖啡烘焙师,而其余公司由固定的参考智能体控制。在几个最近的开源和专有LLM中,所有模型都优于不采取任何行动的被动基线,大多数实现了正净收入。对智能体行为的分析揭示了长期经济互动中的显著差异:性能更高的模型与其他公司更积极地通信,而Claude Haiku 4.5表现出空闲漂移失败模式,尽管产生了连贯的评估和计划,但反复选择不行动。我们发布了代码和智能体轨迹以支持未来研究。

英文摘要

As LLM agents become capable of increasingly long-horizon tasks, evaluating their performance in economic systems is becoming increasingly important. Unlike existing benchmarks that primarily evaluate a single agent interacting with a passive environment, economic systems are inherently multi-agent, requiring autonomous agents to communicate, negotiate, and transact while pursuing their own objectives over extended periods. We introduce CoffeeBench, a benchmark for evaluating LLM agents in a long-horizon multi-agent economy composed of heterogeneous firms. In CoffeeBench, two farmers, two roasters, and two retailers autonomously operate their businesses over a 90-day simulation, each seeking to maximize cumulative net income through communication and transactions while managing cash, inventory, and pricing. The evaluated model controls one coffee roaster, while the remaining firms are controlled by fixed reference agents. Across several recent open-weight and proprietary LLMs, all models outperform a passive baseline that takes no actions, with most achieving positive net income. Analysis of agent behavior reveals substantial differences in long-horizon economic interaction: higher-performing models communicate more actively with other firms, whereas Claude~Haiku~4.5 exhibits an idle-drift failure mode, repeatedly choosing inaction despite producing coherent assessments and plans. We release our code and agent trajectories to support future research.

2606.16723 2026-06-16 cs.AI 新提交

AgentFairBench: Do LLM Agents Discriminate When They Act?

AgentFairBench: LLM智能体在行动时是否存在歧视?

Triveni Morla, Rohith Reddy Bellibaltu, Manpreet Singh, Manmeet Singh Kapoor

发表机构 * Florida International University(佛罗里达国际大学) Boston University(波士顿大学) Department of Computer Science and Engineering, Indian Institute of Technology Patna(印度帕纳吉印度理工学院计算机科学与工程系)

AI总结 提出AgentFairBench基准,通过反事实匹配集和偏差传导框架,评估LLM智能体在招聘、贷款和医疗分诊中的行动公平性,发现统计量级不匹配会夸大歧视,而匹配后Claude Haiku无显著人口统计效应。

Comments Submitted to IEEE Access

详情
AI中文摘要

大型语言模型(LLM)智能体越来越多地采取行动(筛选申请人、推荐信贷、分诊患者),但LLM的公平性仍通过评分答案来衡量。我们引入AgentFairBench,一个廉价、可复现、多领域的基准,用于评估LLM智能体行动中的人口统计差异。基于配套框架——偏差传导框架(BCF,在此重述),它涵盖三个监管锚定的领域:招聘、贷款和医疗分诊。在四种递增代理能力的智能体框架(直接、思维链、多智能体协商、工具增强)下,使用合成的人口统计中性档案,在仅改变姓名编码的种族×性别信号的反事实匹配集中进行评估(遵循Bertrand Mullainathan传统)。一个仅依赖NumPy的测试工具计算反事实翻转率、平均绝对分数差异(MASD)、行动率差异和工具调用差异,并提供自助置信区间、配对检验和错误发现率控制,每个模型的成本仅为个位数美元。一个包含保留私有分割和污染金丝雀的实时排行榜接受外部模型提交。我们的试点研究(864个决策加上重测复现)带来了一个方法论教训:将六组分数分布与两次运行的噪声差异进行比较,仅通过统计量级就会将差异夸大约2.4倍。在匹配量级的噪声基底和综合组检验下,Claude Haiku 4.5未显示出高于采样噪声的人口统计效应(120个成对对比中0个和9个综合对比中0个通过校正);植入偏差测试证实该工具能检测到存在的差异。贡献在于一个健全、敏感、可采用的工具、量级匹配的零假设方法以及可扩展的开源工件。代码、数据和测试工具以开放许可证发布,并附有匿名评审工件。

英文摘要

Large language model (LLM) agents increasingly take actions (screening applicants, recommending credit, triaging patients), yet fairness for LLMs is still measured by grading answers. We introduce AgentFairBench, a cheap, reproducible, multi-domain benchmark for demographic disparity in the actions of LLM agents. Grounded in a companion framework, the Bias Conduction Framework (BCF, restated here), it spans three regulator-anchored domains: hiring, lending, and medical triage. Synthetic, demographic-neutral profiles are evaluated in counterfactual matched sets that vary only a name-coded race x gender signal (in the Bertrand Mullainathan tradition), under four agent scaffolds of increasing agency (direct, chain-of-thought, multi-agent deliberation, tool-augmented). A NumPy-only harness computes counterfactual flip rate, mean absolute score difference (MASD), action-rate disparity, and tool-invocation disparity, with bootstrap confidence intervals, paired tests, and false-discovery-rate control, for single-digit dollars per model. A live leaderboard with a held-out private split and a contamination canary admits external models by submission. Our pilot (864 decisions plus a test-retest replication) carries a methodological lesson: comparing a six-group score spread against a two-run noise difference overstates disparity by ~ 2.4X through statistic arity alone. Against an arity matched noise floor and an omnibus group test, claude haiku 4 5 shows no demographic effect above sampling noise (0 of 120 pairwise and 0 of 9 omnibus contrasts survive correction); a planted-bias test confirms the instrument detects disparity when present. The contribution is a sound, sensitive, adoption-ready instrument, the arity matched null methodology, and open artifacts to scale it. Code, data, and harness are released under open licenses, with an anonymized review artifact.

2606.16802 2026-06-16 cs.AI 新提交

LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control

LabOSBench: 科学仪器控制的计算机使用智能体基准测试

Anqi Zou, Han Deng, Chengyu Zhang, Junquan Hu, Yu Wang, Yuxiang Xing, Aokai Zhang, Hanling Zhang, Zhaoyang Liu, Ben Fei, Zhihui Wang, Wanli Ouyang

发表机构 * Shenzhen Loop Area Institute(深圳循环区域研究所) Dalian University of Technology(大连理工大学) The Chinese University of Hong Kong(香港中文大学) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出LabOSBench基准,基于Web科学仪器模拟器评估多模态GUI智能体在仪器控制中的表现,揭示现有智能体在反馈驱动操作和长流程执行上的不足。

详情
AI中文摘要

当前的计算机使用基准主要关注虚拟化系统中的软件操作任务,而科学仪器场景需要协调控制复杂界面和反馈驱动的参数调整。然而,直接在物理高精度仪器上评估智能体因高成本、安全风险、有限可访问性和难以保证可重复评估而不切实际。这促使需要一个模拟但真实的测试平台,既能保留科学仪器的操作挑战,又能实现可扩展和安全的基准测试。为此,我们引入了LabOSBench,这是一个基于一套基于Web的科学仪器模拟器构建的多模态GUI智能体的挑战性基准。LabOSBench通过浏览器直接操作,避免了资源密集型的操作系统虚拟化,同时支持灵活的任务配置和基于执行的评估。具体来说,LabOSBench在八个仪器模拟器上构建了96个子任务,涵盖了从样品加载、对准、参数调整、数据采集到结果检查的工作流程。我们在子任务和端到端级别评估了通用视觉语言模型、专用GUI智能体模型和高级智能体框架。我们的实验表明,尽管现有智能体可以完成许多结构化的GUI子任务,但它们仍然在反馈驱动操作和长周期工作流执行中挣扎。总体而言,LabOSBench为推进计算机使用智能体向科学仪器控制发展提供了一个可重复、低成本的测试平台。

英文摘要

Current computer-use benchmarks primarily focus on software operation tasks in virtualized systems, whereas scientific instrumentation scenarios require coordinated control over complex interfaces, and feedback-driven parameter adjustment. However, directly evaluating agents on physical high-precision instruments is impractical due to high cost, safety risks, limited accessibility, and difficulty in ensuring reproducible evaluation. This motivates the need for a simulated yet realistic testbed that preserves the operational challenges of scientific instruments while enabling scalable and safe benchmarking. To this end, we introduce LabOSBench, a challenging benchmark for multimodal GUI agents built on a suite of web-based scientific-instrument simulators. Operating directly via a browser, LabOSBench avoids resource-heavy OS virtualization while supporting flexible task configuration and execution-based evaluation. Specifically, LabOSBench constructs 96 subtasks across eight instrument simulators, covering workflows from sample loading, alignment, parameter tuning, and data acquisition to result inspection. We evaluate general-purpose vision-language models, specialized GUI agent models, and advanced agentic frameworks at both subtask and end-to-end levels. Our experiments reveal that while existing agents can complete many structured GUI subtasks, they still struggle with feedback-driven operations and long-horizon workflow execution. Overall, LabOSBench provides a reproducible, low-cost testbed for advancing computer-using agents toward scientific-instrument control.

2606.16974 2026-06-16 cs.AI 新提交

The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers

拥抱开放科学:十年AI研究与56 800篇会议论文的分析

Kevin L Coakley, Thijs Snelleman, Holger Hoos, Odd Erik Gundersen

发表机构 * Norwegian University of Science and Technology(挪威科技大学) University of California San Diego(加州大学圣迭戈分校) RWTH Aachen University(亚琛工业大学) Leiden University(莱顿大学)

AI总结 分析2014-2024年五大AI会议56,800篇论文,发现文档实践改善,代码和数据共享率从11%升至64%,可重复性估计从28%升至64%,且改善早于可重复性检查清单的引入,反映开放科学运动。

详情
AI中文摘要

可重复性危机促使AI研究社区改进文档实践。多项研究已指出方法论问题,作为回应,该领域最具影响力的会议引入了可重复性检查清单。我们试图通过评估过去十年五大领先AI会议的所有已发表论文,了解文档实践是否随时间改变。确定了七个可重复性变量,经过质量保证并用于分析56,800篇出版物。我们的分析显示,在2014年至2024年期间,文档实践有所改善;同时共享代码和数据的论文增加了近六倍,从11%增至64%。基于先前研究的实证可重复性率,我们估计——根据文档实践推断,而非直接测试——可重复性从2014年的28%增加到2024年的64%。文档实践的改善早于可重复性检查清单的引入,表明这些变化反映了更广泛的开放科学运动,而非对正式要求的直接响应。

英文摘要

The reproducibility crisis has directed the AI research community toward improving documentation practices. Several studies have identified methodological issues, and in response, the most impactful venues in the field have introduced reproducibility checklists. We seek to understand whether documentation practices have changed over time by assessing all published papers at five leading AI conferences over the past decade. Seven reproducibility variables were identified, quality-assured and used to analyse 56 800 publications. Our analysis reveals that in the period 2014 to 2024, documentation practices have improved; papers sharing both code and data increased nearly sixfold, from 11% to 64% Building on empirical reproducibility rates from a prior study, we estimate - inferred from documentation practices, not direct testing - that reproducibility increased from 28% in 2014 to 64% in 2024. Improvements in documentation practices predate the introduction of reproducibility checklists, suggesting these changes reflect a broader movement toward open science rather than a direct response to formal requirements.

2606.14715 2026-06-16 cs.MA cs.AI cs.SI 交叉投稿

MiroBench: Benchmarking Realism in Agentic Simulation of Real-world Discussions

MiroBench:真实世界讨论的智能体模拟真实性基准测试

Yaoning Yu, Ye Yu, Haojing Luo, Haohan Wang

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Starc.Institute(Starc研究院)

AI总结 提出MiroBench基准,基于4292条真实Reddit帖子,通过统计测试评估LLM智能体模拟在重复性、叙事内容、毒性攻击和结构复杂度四个方面的分布匹配度,发现当前模拟器与真实讨论存在分布差异。

详情
AI中文摘要

LLM智能体越来越多地被用于模拟真实世界互动,但尚不清楚模拟行为是否保留了真实人类行为的内容模式和互动动态。现有评估仍然碎片化,使得比较系统或衡量进展变得困难。在本文中,我们聚焦于Reddit讨论,作为评估真实世界社会模拟的具体第一步。Reddit帖子提供了公开的、基于主题的多方互动,人们在其中分享经验、辩论、寻求建议、表达情感,并共同对产品、事件和社会问题做出回应。这些讨论为更广泛的社会行为提供了可观察的窗口,使其成为测试LLM智能体能否不仅再现流畅文本,还能再现真实在线社区的分布模式和互动动态的有用场景。我们介绍了MiroBench,一个基于4292条真实Reddit帖子构建的Reddit讨论模拟基准。MiroBench使用统计测试在四个主要方面比较生成讨论和真实讨论:重复性和语义一致性、叙事内容、毒性攻击以及结构复杂度。跨五个领域和五个模型的实验表明,当前模拟器与真实Reddit帖子在分布上仍不匹配,而一种轻量级的基于提示的改进程序仅带来有限的提升。MiroBench为衡量、诊断和改进基于LLM的社会模拟的真实性提供了一个具体基准。

英文摘要

LLM agents are increasingly used to simulate real world interactions, but it remains unclear whether simulated behaviors preserve the content patterns and interaction dynamics of real human behaviors. Existing evaluations remain fragmented, which makes it difficult to compare systems or measure progress. In this paper, we focus on Reddit discussions as a concrete first step toward evaluating real-world social simulation. Reddit threads provide public, topic-grounded, multi-party interactions where people share experiences, debate, seek advice, express emotion, and collectively respond to products, events, and social issues. These discussions offer an observable window into broader social behavior, making them a useful setting for testing whether LLM agents can reproduce not only fluent text, but also the distributional patterns and interaction dynamics of real online communities. We introduce MiroBench, a benchmark for Reddit discussion simulation built from 4,292 real Reddit threads. MiroBench uses statistical tests to compare generated and real discussions across four major aspects: repetition and semantic uniformity, narrative content, toxicity and aggression, and structural complexity. Experiments across five domains and five models show that current simulators remain distributionally mismatched with real Reddit threads, while a lightweight prompt-based improvement procedure provides only limited gains. MiroBench offers a concrete benchmark for measuring, diagnosing, and improving realism in LLM-based social simulation.

2606.14747 2026-06-16 cs.CV cs.AI 交叉投稿

MMLongEmbed: Benchmarking Multimodal Embedding Models in Long-Context Scenarios

MMLongEmbed: 长上下文场景下的多模态嵌入模型基准测试

Haitian Wang, Ruoxi Sun, Quantong Qiu, Juntao Li, Junhui Li, Hua Chen, Jinxiong Chang, Min Zhang

发表机构 * Soochow University(苏州大学) Ant Group(蚂蚁集团)

AI总结 针对多模态嵌入模型在长上下文场景中缺乏系统评估的问题,提出首个综合基准MMLongEmbed,涵盖文本、文档和视频模态的检索任务,揭示模型依赖浅层特征匹配、难以捕捉深层语义依赖等瓶颈。

详情
AI中文摘要

最近的进展显著扩展了多模态嵌入模型(MEMs)的理论上下文窗口。然而,更大的上下文窗口并不一定能转化为对长上下文多模态输入的有效理解和表示,这仍然是实际部署的关键瓶颈。为了解决这一设置中缺乏系统评估的问题,我们引入了MMLongEmbed,这是首个用于评估长上下文场景中MEMs的综合基准。MMLongEmbed包含四个检索任务,涵盖多个上下文长度范围,覆盖文本、文档和视频模态。通过对最先进模型的广泛评估,我们发现当前架构严重依赖浅层特征匹配,难以捕捉深层语义和结构依赖。我们进一步观察到,性能下降随上下文长度和关键信息位置系统性地变化。此外,模型对不同模态中的冗余上下文信息表现出显著不同的鲁棒性。为了可重复性,基准和代码已公开。

英文摘要

Recent advancements have significantly expanded the theoretical context windows of Multimodal Embedding Models (MEMs). However, larger context windows do not necessarily translate into effective comprehension and representation of long-context multimodal inputs, which remains a critical bottleneck for real-world deployment. To address the lack of systematic evaluation in this setting, we introduce MMLongEmbed, the first comprehensive benchmark for evaluating MEMs in long-context scenarios. MMLongEmbed comprises four retrieval tasks spanning multiple context-length ranges, covering text, document, and video modalities. Through extensive evaluation of state-of-the-art models, we find that current architectures rely heavily on superficial feature matching and struggle to capture deep semantic and structural dependencies. We further observe that performance degradation varies systematically with context length and key information placement. Moreover, models exhibit substantially different robustness to redundant contextual information across modalities. For reproducibility, the benchmark and code are publicly available.

2606.14754 2026-06-16 cs.CV cs.AI 交叉投稿

Sub-Semantic Image Segmentation

子语义图像分割

Aviad Cohen Zada, Nadav Orenstein, Shai Avidan, Gal Oren

发表机构 * Tel Aviv University(特拉维夫大学) Stanford University(斯坦福大学) Technion(以色列理工学院)

AI总结 提出子语义图像分割,通过耦合视觉-语言模型与SAM,并引入DETECTURE解决语言泄漏、提示竞争和语义失真问题,在自建数据集TextureADE上取得最优性能。

Comments 23 pages. Code: https://github.com/Scientific-Computing-Lab/TextureDetecture

详情
AI中文摘要

图像可以基于视觉线索(即纹理分割)或对象(即语义分割)进行分割。我们提出了一类新的子语义图像分割,模糊了两者之间的界限。在子语义图像分割中,语言不用于命名整个对象。相反,它用于将图像划分为可由语言描述的稳定外观模式。为此,我们将通用视觉-语言模型与SAM 3(一个可提示分割骨干网络,其原生文本路径可以将丰富描述映射到掩码)耦合。简单的耦合由于我们在论文中识别的多种原因而失败,我们通过引入DETECTURE来克服它们,解决了三个具体的失效模式——纹理区域之间的语言泄漏、分割骨干网络内部的提示竞争以及语言到掩码接口处的语义失真。由于没有子语义图像分割的数据集,我们引入了一个名为TextureADE的数据集。新数据集使用我们设计的系统从ADE20K数据集派生而来。我们将DETECTURE与多个基线进行比较,发现它在多个数据集上使用不同指标均取得了最强性能。代码可在https://github.com/Scientific-Computing-Lab/TextureDetecture获取。

英文摘要

Images can be segmented based on visual cues (i.e., texture segmentation) or into objects (i.e., semantic segmentation). We propose a new category of sub-semantic image segmentation that blurs the line between the two. In sub-semantic image segmentation, language is not used to name whole objects. Instead, it is used to partition an image into stable appearance patterns that can be described by language. To do that, we couple a general-purpose vision-language model to SAM 3, a promptable segmentation backbone whose native text pathway can ground rich descriptions into masks. Simple coupling fails for a number of reasons that we identify in the paper, and we overcome them by introducing DETECTURE that resolves three concrete failure modes -- language leakage between texture regions, prompt competition inside the segmentation backbone, and semantic distortion at the language-to-mask interface. Since there is no dataset of sub-semantic image segmentation, we introduce one, termed TextureADE. The new dataset is derived from the ADE20K dataset using a system we designed. We compare DETECTURE to a number of baselines and find that it achieves the strongest performance on several datasets using different metrics. Code is available at https://github.com/Scientific-Computing-Lab/TextureDetecture.

2606.14755 2026-06-16 cs.CV cs.AI 交叉投稿

Where Does Texture Evidence Live in SAM? Features, Proposal Masks, and Texture Segmentation

纹理证据在 SAM 中存在于何处?特征、提议掩码与纹理分割

Nadav Orenstein, Aviad Cohen Zada, Shai Avidan, Gal Oren

发表机构 * Tel Aviv University(特拉维夫大学) Stanford University(斯坦福大学) Technion(以色列理工学院)

AI总结 研究冻结的 Segment Anything Model (SAM) 中纹理相关证据的存在性,通过最小聚类读取和提议银行监督读取分析多尺度特征与自动提议掩码,发现 SAM 并非纹理盲,但默认失败源于读取不匹配和承诺失败。

Comments 26 pages, 13 figures, 20 tables. Code available at https://github.com/Scientific-Computing-Lab/ArchiTexture

详情
AI中文摘要

纹理分割对基础分割模型构成挑战,因为有意义区域由材质或重复外观而非物体身份定义。Segment Anything Models (SAMs) 默认情况下在纹理定义的分割上经常失败,但这种失败是模糊的:纹理证据可能缺失、在提议银行中缺失,或者存在但被以物体为中心的读取方式错误选择或组装。我们询问在适应之前,冻结的 SAM 中已经保留了哪些纹理相关证据。我们研究两个冻结的证据空间:多尺度特征(通过最小聚类读取探测)和自动提议银行(作为监督整合读取的证据)。SAM 全程冻结;我们不微调骨干网络或重新训练提议生成器。在 RWTD、STLD、ADE20K 精选精修裁剪补充集以及 ControlNet 拼接的 PTD 桥梁存档上,冻结的 SAM 默认情况下不是纹理分割器,但其失败并非简单的纹理盲。粗糙的冻结特征保留了纹理组织,提议银行通常包含纹理对齐的掩码或片段。自然场景更常需要组装和对片段做出承诺,而更干净的合成案例则通常简化为选择已经连贯的提议。因此,默认掩码失败应分解为表示证据、提议银行支持、读取不匹配和承诺失败。

英文摘要

Texture segmentation stresses foundation segmentation because meaningful regions are defined by material or repeated appearance rather than object identity. Segment Anything Models (SAMs) often fail by default on such texture-defined partitions, but this failure is ambiguous: the texture evidence may be absent, missing from the proposal bank, or present but selected or assembled incorrectly by an object-centric readout. We ask what texture-relevant evidence is already preserved in frozen SAM before adaptation. We study two frozen evidence spaces: multiscale features, probed with a minimal clustering readout, and the automatic proposal bank, treated as evidence for a supervised consolidation readout. SAM is frozen throughout; we do not fine-tune the backbone or retrain the proposal generator. Across RWTD, STLD, an ADE20K-selected refined-crop complement, and a ControlNet-stitched PTD bridge archive, frozen SAM is not a texture segmenter by default, but its failures are not simple texture blindness. Coarse frozen features preserve texture organization, and proposal banks often contain texture-aligned masks or fragments. Natural scenes more often require assembly and commitment over fragments, while cleaner synthetic cases more often reduce to selecting an already coherent proposal. Default mask failure should therefore be decomposed into representation evidence, proposal-bank support, readout mismatch, and commitment failure.

2606.14820 2026-06-16 cs.SD cs.AI cs.CL eess.AS 交叉投稿

Spectro-Temporal Interference Confounds Phase Encoding in Spatial Audio Foundation Models

频谱-时间干扰混淆空间音频基础模型中的相位编码

Yuxuan Chen, Haoyuan Yu, Peize He

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Jilin University(吉林大学) Hunan University(湖南大学) University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出基于双耳掩蔽级差的心理声学基准,评估空间自监督音频模型对微秒级耳间相位精细结构的编码能力,发现通用双耳SSL模型依赖频谱-时间干扰纹理而非真实相位计算。

Comments Accepted to INTERSPEECH 2026; 6 pages, 3 figures

详情
AI中文摘要

最近的空间自监督音频模型在定位任务上取得了高性能,引发了对它们编码微秒级耳间相位精细结构能力的疑问。我们提出了一个基于双耳掩蔽级差的心理声学基准来评估这一点。使用均衡抵消基线和GCC-PHAT阳性对照,我们评估了九个冻结的音频模型,涵盖双耳SSL、单耳SSL和神经音频编解码器。四个单耳阴性对照产生零BMLD,确认了双耳特异性。两个通用双耳SSL模型表现出最小的相位敏感性,而专用双耳空间SSL模型实现了与分析基线相当的BMLD。渐进式物理消融实验表明,通用双耳SSL模型依赖于频谱-时间干扰纹理而非跨通道相位计算。语音中的高检测率反映了对宽带包络而非真实相位编码的混淆依赖。

英文摘要

Recent spatial self supervised audio models achieve high performance on localization tasks, raising questions about their encoding of microsecond interaural phase fine structures. We propose a psychoacoustic benchmark based on the binaural masking level difference to evaluate this. Using an equalization cancellation baseline and a GCC PHAT positive control we evaluate nine frozen audio models spanning binaural SSL, monaural SSL, and neural audio codecs. Four monaural negative controls yield zero BMLD confirming binaural specificity. Two general purpose binaural SSL models exhibit minimal phase sensitivity while dedicated binaural spatial SSL models achieve BMLD comparable to the analytical baseline. Progressive physical ablations show that general purpose binaural SSL models rely on spectro temporal interference textures rather than cross channel phase computation. High detection rates in speech reflect a confounding reliance on broadband envelopes rather than genuine phase encoding.

2606.14867 2026-06-16 cs.CL cs.AI cs.LG 交叉投稿

Evaluating the Robustness of Proof Autoformalization in Lean 4

评估 Lean 4 中证明自动形式化的鲁棒性

Zhengtao Gui, Sheng Yang, Zhouxing Shi

发表机构 * University of California, Irvine(加州大学洛杉矶分校) University of California, Riverside(加州大学河滨分校)

AI总结 研究证明自动形式化模型在全局和局部扰动下的鲁棒性,发现现有模型对全局扰动敏感且多数无法忠实反映局部扰动。

Comments Preprint

详情
AI中文摘要

证明自动形式化旨在将用自然语言编写的数学非正式证明翻译成形式语言(如 Lean~4)中的形式证明。已有几项工作开发了基于 LLM 的证明自动形式化模型。然而,现有评估通常侧重于翻译来自精选数据集的规范非正式证明。我们认为,一个鲁棒的证明自动形式化器必须即使对于偏离这些理想化形式的非正式证明也能保持忠实,并提出了首个关于证明自动形式化模型鲁棒性的研究。我们制定了两类扰动并评估每种扰动下的鲁棒性:全局扰动以不同风格改写非正式证明,在此情况下形式化应保持一致;局部扰动改变一个值、符号或证明步骤,可能是反事实的方式,鲁棒的形式化应忠实地反映扰动,而不是自行恢复为原始形式或推断出不同的形式。我们在 miniF2F 和 MATH-500 上构建了包含两种扰动的基准,并自动衡量证明自动形式化在全局扰动下正确性的稳定程度,以及其输出在局部扰动下的忠实程度。我们评估了七个最新模型,所有模型都对全局扰动敏感,且大多数在局部扰动下无法保持忠实。代码和数据可通过 https://github.com/ucr-rai/robust-proof-autoformalization 获取。

英文摘要

Proof autoformalization aims to translate a mathematical informal proof written in natural language into a formal proof in a formal language such as Lean~4. Several works have developed LLM-based models for proof autoformalization. However, existing evaluations have typically focused on translating well-formed informal proofs from curated datasets. We argue that a robust proof autoformalizer must remain faithful even for informal proofs that diverge from these idealized ones, and we present the first study on the robustness of proof autoformalization models. We formulate two categories of perturbations and evaluate robustness under each: a global perturbation paraphrases the informal proof in a different style, under which the formalization should remain consistent; a local perturbation alters a value, symbol, or proof step, possibly in a counterfactual way, and a robust formalization should faithfully reflect the perturbation rather than reverting to the original one or inferring a different one on its own. We build a benchmark with both perturbations on miniF2F and MATH-500, and automatically measure how stable a proof autoformalization's correctness is under global perturbations and how faithfully its output reflects local perturbations. We evaluate seven recent models, all of which are sensitive to global perturbations and mostly fail to remain faithful under local perturbations. Code and data are available via https://github.com/ucr-rai/robust-proof-autoformalization.

2606.14948 2026-06-16 cs.SE cs.AI 交叉投稿

Beyond Correctness: Enhancing Architectural Reasoning in Code LLMs via Scalable Labeling with Agentic Judgment

超越正确性:通过可扩展的智能体判断标注增强代码大模型的架构推理能力

Kirill Vasilevski, Ximing Dong, Benjamin Rombaut, Ruochen Deng, Jiahuei Lin, Arthur Leung, Dayi Lin, Boyuan Chen, Shaowei Wang, Ahmed E. Hassan

发表机构 * Centre for Software Excellence, Huawei Canada(华为加拿大软件卓越中心) Department of Computer Science, University of Manitoba, Canada(曼尼托巴大学计算机科学系) School of Computing, Queen’s University, Canada(皇后大学计算科学学院)

AI总结 针对代码大模型缺乏架构理解的问题,提出智能体判断流水线,利用强LLM作为专家架构评估的代理,通过两个判断器(ACJ和AQJ)实现可扩展标注,微调模型在SWE-bench上提升高达540%,并展现跨语言泛化能力。

详情
AI中文摘要

大语言模型(LLM)已显著改进软件工程,但实际开发需要架构理解。这种理解的人工标注成本过高,且无法仅通过测试验证。我们提出一种智能体判断流水线,使用强LLM作为专家架构评估的可扩展代理,包含两个判断器:架构复杂度判断器(ACJ)评估任务所需的代码库特定架构理解,架构质量判断器(AQJ)通过基于源代码的准则评估补丁对仓库特定架构约定的符合程度。在3360个精选实例上微调Qwen3-8B/14B/32B,在SWE-bench Verified上实现了高达27.2%的解决率——比基础模型提升540%,比未过滤微调提升256%。同时,训练后的模型实现了强大的跨语言泛化能力和架构补丁质量的一致改进。

英文摘要

LLMs have substantially improved software engineering yet real-world development requires architectural understanding. Such understanding is prohibitively expensive to label manually and impossible to verify through tests alone. We propose an agentic judging pipeline using a strong LLM as a scalable proxy for expert architectural evaluation, comprising two judges: the Architecture Complexity Judge (ACJ), which estimates codebase-specific architectural understanding a task demands, and the Architecture Quality Judge (AQJ), which evaluates patch conformance to repository-specific architectural conventions via source-grounded rubrics. Fine-tuning Qwen3-8B/14B/32B on 3,360 curated instances achieves resolved rates of up to 27.2% on SWE-bench Verified - up to 540% over the base model and 256% over unfiltered fine-tuning. Meanwhile, the trained models achieve strong cross-language generalization and consistent improvements in architectural patch quality.

2606.15144 2026-06-16 cs.CL cs.AI 交叉投稿

PACUTE: Phonology-, Affix-, and Character-level Understanding of Tokens for Filipino

PACUTE: 面向菲律宾语的音韵、词缀和字符级词元理解

Jann Railey Montalan, David Demitri Africa, Jimson Paulo Layacan, Richell Isaiah Flores, Ivan Yuri De Leon, Lance Calvin Gamboa

发表机构 * AI Singapore(AI新加坡) Nanyang Technological University(南洋理工大学) UK AI Security Institute(英国人工智能安全研究所) Ateneo de Manila University(马尼拉雅典耀大学) University of Birmingham(伯明翰大学)

AI总结 提出PACUTE基准,包含4600个任务,通过六层诊断框架评估大语言模型在菲律宾语中的形态理解,发现开放权重模型在语素分解上接近随机,前沿模型在组合任务上远低于字符级上限。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

大型语言模型(LLMs)将文本处理为子词词元序列,这掩盖了构成词形成的字符级和形态结构。对于具有非连接形态的语言,这种限制最为严重,标准分词器系统性地使词元边界与语素边界错位。我们引入PACUTE,一个包含4600个任务的诊断基准,旨在评估菲律宾语中的形态理解,菲律宾语以能产的中缀、重叠和变音符号驱动的词汇区分(通常不在书面文本中出现)为特征。PACUTE包括一个六层组合诊断框架,用于定位形态理解在何处崩溃。评估开放权重LLMs和前沿商业模型,我们发现开放权重模型在语素分解上无论规模大小都接近随机。前沿模型表现更好,通常在包含匹配评分下能恢复单个词缀,但在语素变换和音节划分的组合任务上仍远低于其字符级上限。这些结果表明,能产的形态组合(而非仅字符访问)是菲律宾语词汇结构理解的持续瓶颈。

英文摘要

Large language models (LLMs) process text as sequences of subword tokens, which can obscure the character-level and morphological structure that underlies word formation. This limitation is most acute for languages with non-concatenative morphology, where standard tokenizers systematically misalign token boundaries with morpheme boundaries. We introduce PACUTE, a diagnostic benchmark of 4,600 tasks designed to evaluate morphological understanding in Filipino, a language characterized by productive infixation, reduplication, and diacritic-driven lexical distinctions that are typically absent from written text. PACUTE includes a hierarchical diagnostic framework of six compositional levels that localizes where morphological understanding breaks down. Evaluating open-weight LLMs and frontier commercial models, we find that open-weight models perform near chance on morpheme decomposition regardless of scale. Frontier models perform much better, often recovering individual affixes under contains-match scoring, but remain far below their character-level ceilings on compositional tasks of morpheme transformations and syllabification. These results identify productive morphological composition, rather than character access alone, as the persistent bottleneck for Filipino word-structure understanding.

2606.15216 2026-06-16 cs.CL cs.AI 交叉投稿

Spokes: Optimizing for Diverse Pretraining Data Selection

Spokes: 优化多样化预训练数据选择

Clarence Lee, Yejin Choi, Luke Zettlemoyer, Pang Wei Koh, Hai Leong Chieu

发表机构 * DSO National Laboratories(DSO国家实验室) Stanford University(斯坦福大学) University of Washington(华盛顿大学)

AI总结 提出基于G-Vendi分数的概率多样化框架,通过指数梯度下降直接优化数据多样性,在FineWeb和DCLM上提升下游性能1.5和1.4个点。

Comments 9 pages, 4 figures

详情
AI中文摘要

多样性在数据选择中起着关键作用,通过减少冗余和重复,在固定数据预算下提高性能。然而,优化多样性本身具有挑战性,因为它是集合级属性,依赖于数据点之间的交互而非单个示例。因此,现有方法通常依赖代理或近似,往往无法确保足够多样化的子集。在这项工作中,我们通过引入基于G-Vendi分数的概率多样化框架,并利用指数梯度下降进行优化,直接优化多样性。我们的方法生成的子集比通过随机抽样获得的子集多样化得多,在50万样本子集上实现了G-Vendi分数增加489。我们在FineWeb和DCLM上评估了我们的方法,它持续优于现有方法。值得注意的是,SPOKES(仅多样性)在DCLM和FineWeb上分别比随机抽样提高了平均下游性能0.4和0.5个点。更重要的是,联合优化质量和多样性取得了最强结果:SPOKES在DCLM和FineWeb上分别取得了1.5和1.4个点的提升,优于所有基线,包括语义去重和质量过滤。

英文摘要

Diversity plays a critical role in data selection, improving performance under fixed data budgets by reducing redundancy and repetition. However, optimizing for diversity is inherently challenging, as it is a set-level property that depends on interactions between data points rather than individual examples. As a result, existing approaches typically rely on proxies or approximations, which often fail to ensure sufficiently diverse subsets. In this work, we directly optimize diversity by introducing a probabilistic diversification framework based on the G-Vendi score, optimized via exponentiated gradient descent. Our method produces subsets that are substantially more diverse than those obtained via random sampling, achieving a +489 increase in G-Vendi score on a 500k-sample subset. We evaluate our approach on FineWeb and DCLM, where it consistently outperforms existing methods. Notably, SPOKES (diversity-only) improves average downstream performance by +0.4 and +0.5 points over random sampling on DCLM and FineWeb, respectively. More importantly, jointly optimizing for both quality and diversity yields the strongest results: SPOKES achieves gains of +1.5 and +1.4 points on DCLM and FineWeb, outperforming all baselines, including semantic deduplication and quality filtering.

2606.15306 2026-06-16 cs.LG cs.AI 交叉投稿

LatentGym: A Testbed For Cross-Task Experiential Learning With Controllable Latent Structure

LatentGym: 具有可控潜在结构的跨任务经验学习测试平台

Daksh Mittal, Tommaso Castellani, Thomson Yen, Naimeng Ye, Fangyu Wu, Minghui Chen, Tiffany Cai, Emmanouil Koukoumidis, William Zeng, Hongseok Namkoong

发表机构 * Columbia University(哥伦比亚大学) Oumi Blog | Code | Models(Oumi博客 | 代码 | 模型)

AI总结 提出LatentGym测试平台,通过可控潜在变量分离探索与利用,研究LLM代理在跨任务序列中的适应性学习机制。

Comments 61 pages

详情
AI中文摘要

我们设想持续学习的代理系统会随时间变得更加有用:当它们遇到一系列相关任务时,应该推断这些任务之间共享的隐藏结构,并利用它来改进未来的决策。这种跨任务经验学习能力在个性化和交互式辅助等领域至关重要,但现有的训练/评估框架不提供共享的、可控的潜在结构,也无法衡量代理是否改进或改进的原因。我们引入了LatentGym:一个可控的套件,其中每个环境都围绕一个控制任务间结构的地面真实潜在变量组织。我们的构建产生了将探索(代理的行为是否收集关于潜在变量的信息)与利用(代理是否使用收集到的信息)分离的指标。我们在实证研究中展示了我们的套件,解决了三个问题:前沿模型如何以及为什么无法适应相关任务;对相关任务序列进行后训练是否能提高一般的跨任务适应性,以及这些收益来自何处;以及诸如任务间反馈等设计选择如何塑造训练动态和泛化。总之,这些结果为研究LLM代理如何从跨任务经验中学习,以及设计在顺序、个性化和交互式设置中更可靠适应的代理建立了受控基础。

英文摘要

We envision continually learning agentic systems that become more useful over time: as they encounter sequences of related tasks, they should infer the hidden structure shared across those tasks and use it to improve future decisions. This cross-task experiential learning capability is pivotal in domains such as personalization and interactive assistance, but existing training/evaluation frameworks do not provide shared, controllable latent structures and cannot measure whether or why agents improve. We introduce LatentGym: a controllable suite in which each environment is organized around a ground-truth latent variable governing the structure across tasks. Our construction yields metrics that separate exploration (whether the agent's actions gather information about the latent) from exploitation (whether the agent uses what it has gathered). We demonstrate our suite on empirical studies addressing three questions: how and why frontier models fail to adapt across related tasks; whether post-training on related task sequences improves general cross-task adaptation, and where those gains come from; and how design choices such as inter-task feedback shape training dynamics and generalization. Together, these results establish a controlled foundation for studying how LLM agents learn from experience across tasks, and for designing agents that adapt more reliably in sequential, personalized, and interactive settings.

2606.15314 2026-06-16 cs.LG cs.AI stat.ML 交叉投稿

LLMs on Tabular Data with Limited Semantics: Evidence from Industrial Car Retrofit Prediction

有限语义表格数据上的LLM:来自工业汽车改造预测的证据

Aina Vila Pons, Ioannis Tzachristas, Constantinos Antoniou

发表机构 * Technical University of Munich(慕尼黑工业大学) BMW Group(宝马集团)

AI总结 研究在工业表格数据中,LLM(嵌入、直接分类、混合堆叠)与经典树集成方法的对比,发现LLM在语义受限时效果有限,但嵌入和混合方法仍有价值。

详情
AI中文摘要

工业改造规划依赖于结构化操作数据而非自由文本:规划者必须估计新注册的原型是否需要改造、需要哪种改造包以及工作将花费多长时间。我们研究了一个工业数据集,该数据集将原型注册系统(284,271辆车)与改造管理系统(48,716次清洗后的访问)相连接,并在行序列化输入上比较了强大的表格机器学习基线与三种基于LLM的策略:嵌入特征(Amazon Titan)、直接提示分类(Claude Sonnet 4)和ML+LLM堆叠方法。在二分类发生预测、15类改造类型分类、每次访问持续时间回归以及聚合的月度基准测试中,经典树集成仍然是最强的独立模型。然而,LLM结果揭示了一致的模式:嵌入在表格上仍然有用(二分类AUC = 0.982),直接提示在通过哈希去除语义信号后崩溃(二分类AUC = 0.500;多类加权F1 = 0.018),而混合堆叠产生了最佳的手动构建多类模型(加权F1 = 0.626)。在月度基准测试中,基于滞后的机器学习优于时间序列基础模型,尽管Chronos-small在零样本预测中仍具有竞争力。结果表明,在隐私受限的工业表格上,LLM作为补充组件比替代强大的表格基线更有效。

英文摘要

Industrial retrofit planning depends on structured operational data rather than free text: planners must estimate whether a newly registered prototype will require a retrofit, which retrofit package it will need, and how long the work will take. We study an industrial dataset linking a prototype-registration system (284,271 vehicles) with a retrofit-management system (48,716 cleaned visits), and compare strong tabular machine learning baselines with three LLM-based strategies on row-serialized inputs: embedding features (Amazon Titan), direct prompted classification (Claude Sonnet 4), and an ML+LLM stacking approach. Across binary occurrence prediction, 15-way retrofit-type classification, per-visit duration regression, and an aggregated monthly benchmark, classical tree ensembles remain the strongest standalone models. However, the LLM results reveal a consistent pattern: embeddings remain useful on tables (binary AUC = 0.982), direct prompting collapses once semantic signal is stripped by hashing (binary AUC = 0.500; multiclass weighted F1 = 0.018), and hybrid stacking yields the best manually built multiclass model (weighted F1 = 0.626). On the monthly benchmark, lag-based machine learning outperforms time-series foundation models, though Chronos-small remains competitive in zero-shot forecasting. The results suggest that on privacy-constrained industrial tables, LLMs are more effective as complementary components than as replacements for strong tabular baselines.

2606.15436 2026-06-16 cs.LG cs.AI eess.AS 交叉投稿

Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models

超越分类:呼吸声学基础模型的咳嗽回归基准

Mayur Sanap, Prasanna Desikan, Edgar Lobaton

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出多模型多目标咳嗽回归基准,评估五个基础模型在六个目标上的表现,发现MLP-small优于线性探测,揭示数据集大小与头部容量的权衡,并展示跨数据集迁移的不对称性。

Comments Accepted at the ICML 2026 Workshop on Structured Data for Health

详情
AI中文摘要

呼吸声学基础模型(FMs)在咳嗽分类方面表现出色,但其从咳嗽音频中预测连续健康量的能力在很大程度上尚未被探索,尽管在无法进行物理测量的环境中,被动年龄、BMI和疾病概率估计具有临床价值。我们引入了多模型、多目标的咳嗽回归基准,在三个数据集上评估了五个FMs(OPERA-CT、OPERA-CE、OPERA-GT、HeAR、M2D+Resp)在六个目标上的表现,采用受试者不重叠协议,并比较了线性、MLP-small和全MLP回归头。MLP-small在所有任务上击败了均值预测基线,并在30个模型×任务组合中的23个中优于线性探测,而全MLP在小规模临床数据上过拟合,但在更大数据集上恢复,揭示了数据集大小与头部容量之间的权衡。HeAR在Coswara数据集上的年龄回归中领先(9.12年MAE);其CIDRZ结果因可能存在HeAR-CIDRZ预训练重叠而被排除在主要声明之外。OPERA-GT在所有三个数据集的年龄回归中优于OPERA-CT,其中CIDRZ的差异在种子方差范围内,将生成预训练的优势从呼吸扩展到咳嗽。HeAR和M2D+Resp在N=50个样本时达到接近完整性能,而OPERA模型需要N=400个样本。跨数据集迁移强烈不对称,大规模多样化数据可泛化到小规模临床人群(CoughVID到CIDRZ:-0.17年),但反之则不然(CIDRZ到Coswara:+2.43年,+26.6%)。

英文摘要

Respiratory acoustic foundation models (FMs) excel at cough classification, yet their ability to predict continuous health quantities from cough audio remains largely unexplored, despite the clinical value of passive age, BMI, and disease probability estimation in settings where physical measurements are unavailable. We introduce the multi-model, multi-target cough regression benchmark evaluating five FMs (OPERA-CT, OPERA-CE, OPERA-GT, HeAR, M2D+Resp) across six targets on three datasets under subject-disjoint protocols, comparing linear, MLP-small, and full MLP regression heads. MLP-small beats the mean-predictor baseline on all tasks and linear probing in 23 of 30 model x task cases, with full MLP overfitting on small clinical data but recovering on larger sets, revealing a dataset size x head-capacity trade-off. HeAR leads within-dataset age regression on Coswara (9.12 yr MAE); its CIDRZ result is excluded from headline claims owing to possible HeAR-CIDRZ pretraining overlap. OPERA-GT is favored over OPERA-CT on age in all three datasets, with the CIDRZ margin within seed variance, extending a generative-pretraining advantage from breath to cough. HeAR and M2D+Resp reach near-full performance at N = 50 samples while OPERA models require N = 400. Cross-dataset transfer is strongly asymmetric as large diverse data generalises to small clinical populations (CoughVID to CIDRZ: -0.17 yr) but not vice versa (CIDRZ to Coswara: +2.43 yr, +26.6%).

2606.15610 2026-06-16 cs.CL astro-ph.IM cs.AI cs.LG 交叉投稿

LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation

LLM 裁判具有暗电流:LLM 作为裁判评估的心理测量数据表

Hiroyasu Usami, Keisuke Hara, Ayato Tsuboi, Naohiko Matsuda

发表机构 * Chubu University(中部大学) Mitsubishi Heavy Industries, Ltd., Research & Innovation Center(三菱重工业株式会社研究创新中心)

AI总结 提出裁判数据表协议,通过真空输入、表面变异、位置偏好等指标测量 LLM 裁判的暗电流和偏差,揭示其测量特性。

Comments 22 pages, 4 figures

详情
AI中文摘要

LLM 作为裁判的系统现在常规用于开放式模型评估,其中人类偏好标注成本高、速度慢且难以复现。然而,这些裁判通常被报告为标量准确率、胜率或一致性指标。我们认为,裁判应被报告为测量仪器。我们引入了一个裁判数据表协议,该协议测量在真实真空输入下的暗电流、对相同质量表面变化的稳定交叉敏感性、位置虚假偏好、在受控质量阶梯上的目标敏感性,以及由平局指令引发的标准或操作点。方向-稳定性分解揭示,明显的 Delta0 偏好可能是稳定的表面响应或伪装的位置偏差。在一个三裁判开放权重案例研究中,Llama-3.1-8B 显示出高暗电流和呈现冲突的 Delta0 行为,Qwen2.5-14B 是真空清洁且对目标敏感,但混合了稳定和位置过度判别,而 Qwen2.5-32B 是真空清洁,具有低稳定交叉敏感性和低位置虚假偏好。严格的平局标准消除了 Qwen32B 的 Delta0 虚假偏好,但将边缘 Delta1 目标信号吸收为平局,同时保留了 Delta5 敏感性。结果表明,提示移动的是标准,而不是分辨率。我们并不声称激发这项工作的下游机制假设已得到确认;贡献是在做出下游声明之前测量测量仪器的计量协议。

英文摘要

LLM-as-a-judge systems are now routinely used for open-ended model evaluation, where human preference annotation is costly, slow, and difficult to reproduce. Yet these judges are often reported as scalar accuracy, win-rate, or agreement devices. We argue that a judge should instead be reported as a measurement instrument. We introduce a Judge Datasheet protocol that measures dark current under true-vacuum inputs, stable cross-sensitivity to same-quality surface variation, positional false preference, target sensitivity on a controlled quality ladder, and the criterion or operating point induced by tie instructions. The direction-stability decomposition reveals that apparent Delta0 preference can be stable surface response or disguised position bias. In a three-judge open-weight case study, Llama-3.1-8B shows high dark current and presentation-conflicted Delta0 behavior, Qwen2.5-14B is vacuum-clean and target-sensitive but mixes stable and positional over-discrimination, and Qwen2.5-32B is vacuum-clean with low stable cross-sensitivity and low positional false preference. A strict tie criterion eliminates Qwen32B Delta0 false preference but absorbs marginal Delta1 target signals into ties while preserving Delta5 sensitivity. The results show that prompting moves the criterion, not the resolution. We do not claim that the downstream mechanism hypothesis that motivated this work is confirmed; the contribution is a metrological protocol for measuring the measuring device before downstream claims are made.

2606.15653 2026-06-16 cs.NI cs.AI cs.ET 交叉投稿

IoT-Zoo: A Container-Based Framework for Heterogeneous IoT Device Profiles and Reproducible Traffic Capture

IoT-Zoo:基于容器的异构物联网设备配置与可重现流量捕获框架

Vagner E. Quincozes, Diego Kreutz, Silvio E. Quincozes

发表机构 * Department of Electrical and Computer Engineering, University of Notre Dame, Indiana, USA(1 诺特难大学电气与计算机工程系, 印第安纳州, 美国)

AI总结 提出IoT-Zoo容器化测试平台,通过异构设备配置和自动化流量捕获实现可重现的物联网实验,支持多域场景部署与真实协议。

Comments 10 pages, including 4 figures and 4 tables, submitted to SBRC 2026

详情
AI中文摘要

物联网(IoT)网络和安全解决方案的验证需要真实且可重现的实验数据。然而,现有平台通常通过复制有限类型的设备来实现可扩展性,这限制了配置多样性,无法捕捉真实物联网环境的异构性。在本文中,我们提出了IoT-Zoo,一个基于容器的测试平台,旨在通过异构、数据集驱动的物联网设备配置来支持可重现的实验。基于Containernet,IoT-Zoo自动化多域场景的部署,并支持MQTT和RTSP等真实应用协议。该平台提供单命令接口用于环境配置和自动流量捕获(PCAP),从而生成一致的流量基线,并减少评估网络和安全解决方案所需的操作工作量。

英文摘要

The validation of networking and security solutions for the Internet of Things (IoT) requires realistic and reproducible experimental data. However, existing platforms often achieve scalability by replicating a limited set of device types, which restricts profile diversity and fails to capture the heterogeneity of real-world IoT environments. In this paper, we present IoT-Zoo, a container-based testbed designed to support reproducible experimentation through heterogeneous, dataset-driven IoT device profiles. Built upon Containernet, IoT-Zoo automates the deployment of multi-domain scenarios and supports real application protocols such as MQTT and RTSP. The platform provides a single-command interface for environment provisioning and automated traffic capture (PCAP), enabling the generation of consistent traffic baselines and reducing the operational effort required to evaluate networking and security solutions.

2606.15749 2026-06-16 cs.CV cs.AI cs.SY eess.SY 交叉投稿

OmniTraffic: A Controllable Generation Pipeline and Benchmark for Spatio-Temporal Traffic Reasoning

OmniTraffic:面向时空交通推理的可控生成流水线与基准

Maonan Wang, Zhengyan Huang, Kemou Jiang, Yuhang Fu, Jiayue Zhu, Yuxin Cai, Xingchen Zou, Qiaosheng Zhang, Yi Yu, Ding Wang, Xi Chen, Ben M. Chen, Yuxuan Liang, Zhiyong Cui, Man On Pun, Yirong Chen

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Shanghai AI Lab(上海人工智能实验室) Beihang University(北京航空航天大学) Nanyang Technological University(南洋理工大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出OmniTraffic,一个基于12个真实路口3D重建的可控生成流水线与基准,通过8M VQA样本和3K人工验证测试集评估11个前沿MLLM,揭示拓扑与时空推理中的显著人机差距,并证明仿真数据微调可提升真实场景性能。

Comments 34 pages, 28 figures

详情
AI中文摘要

交通场景理解要求模型超越物体识别进行推理,包括车道拓扑、多视角几何、时间演变和信号相位语义。然而,现有的面向交通的多模态基准大多强调被动视觉识别或孤立的视频理解,在受控条件下评估结构感知的交通推理方面支持有限。我们介绍了OmniTraffic,一个用于时空交通推理的可控生成流水线和基准。它基于12个真实世界交叉口重建为可编辑的3D交通环境,并辅以来自两个国家的监控录像,支持受控和自然条件评估。它定义了一个三级任务层次,涵盖场景感知、多视角和时间推理以及决策支持。利用结构化交通元数据,OmniTraffic生成同步的多视角VQA样本,涵盖车辆状态、车道功能、视图-BEV对应、时间动态和信号相位分析,产生800万个VQA样本和一个3000个人工验证的测试集。对11个前沿MLLM的评估揭示了巨大的人机差距,在拓扑基础和时空推理任务中失败最为明显。在模拟的OmniTraffic数据上微调轻量级MLLM进一步提高了在真实交通场景上的性能,证明了仿真生成的监督对特定交通多模态推理的价值。除了固定数据集,OmniTraffic还提供了一个可扩展的流水线,具有可配置的交叉口、相机视角、交通需求、信号相位、视觉条件和罕见事件。

英文摘要

Traffic scene understanding requires models to reason beyond object recognition, including lane topology, multi-view geometry, temporal evolution, and signal-phase semantics. However, existing traffic-oriented multimodal benchmarks largely emphasize passive visual recognition or isolated video understanding, offering limited support for evaluating structure-aware traffic reasoning under controlled conditions. We introduce OmniTraffic, a controllable generation pipeline and benchmark for spatio-temporal traffic reasoning. Built around 12 real-world intersections reconstructed into editable 3D traffic environments and complemented by surveillance footage from two countries, OmniTraffic supports both controlled and natural-condition evaluation. It defines a three-level task hierarchy spanning scene perception, multi-view and temporal reasoning, and decision support. Using structured traffic metadata, OmniTraffic generates synchronized multi-view VQA samples covering vehicle states, lane functions, view--BEV correspondence, temporal dynamics, and signal-phase analysis, resulting in 8M VQA samples and a 3K human-verified test set. Evaluation of eleven frontier MLLMs reveals a large human--model gap, with the most pronounced failures in topology-grounded and spatio-temporal reasoning tasks. Fine-tuning a lightweight MLLM on simulated OmniTraffic data further improves performance on real-world traffic scenes, demonstrating the value of simulation-generated supervision for traffic-specific multimodal reasoning. Beyond a fixed dataset, OmniTraffic provides an extensible pipeline with configurable intersections, camera views, traffic demands, signal phases, visual conditions, and rare events.

2606.15887 2026-06-16 cs.LG cs.AI 交叉投稿

Intelligence Is Not the Bottleneck: Validating an LLM First-Pass Manuscript Score Against Peer-Review Outcomes

智能并非瓶颈:验证LLM初稿评分与同行评审结果的一致性

Costa Georgantas

发表机构 * aipr.pub(aipr实验室)

AI总结 本研究验证了LLM系统AIPR通过提示对论文进行评分,无需微调,其整体评分能有效区分ICLR会议的接收与拒绝论文(AUROC 0.82),且评分稳定、可复现,为辅助同行评审提供了可靠依据。

Comments 34 pages, 14 figures

详情
AI中文摘要

大型语言模型(LLM)系统越来越多地被提议用于辅助同行评审,但大多数评估判断的是机器生成的评审文本的措辞,而非系统分配的数字分数的有效性。我们验证了AIPR,该系统读取提交的稿件并输出五个0-100的质量维度和一个加权总分,针对一个主要机器学习会议的公开决策结果进行验证。AIPR仅通过提示进行评分,没有对评审或决策进行微调。在300篇ICLR提交论文中,这些论文具有公开的决策层级和评审评分,在冻结的流水线下进行评分,且假设在评分与任何结果相遇之前预先注册,整体评分将拒绝论文与接收论文分开(AUROC 0.82,95% CI 0.78-0.87),在层级间单调上升,并跟踪平均评审评分。信号在我们声称的地方最强:得分最低的五分之一论文被拒绝的比例远高于基准率,且口头报告论文缺失。有效性主要来自模型:在同一模型上的一段提示几乎与完整流水线一样好地判别(小差距有利于流水线,但未达到预先声明的标准,p = 0.09)。工程增加的是可靠性和有依据的评审:AIPR的评分在重复运行中几乎不变(论文内标准差0.7 vs. 2.8分),而裸提示波动很大,并且同一轮返回的是基于评分标准的、有证据依据的评审,而非裸数字,由人类保留决策权。

英文摘要

Large language model (LLM) systems are increasingly proposed to assist peer review, yet most evaluations judge the prose of machine-generated review text, not the validity of the numeric score a system assigns. We validate AIPR, which reads a submitted manuscript and emits five 0-100 quality dimensions and a weighted overall score, against the public decision outcomes of a major machine learning venue. AIPR grades by prompting alone, with no fine-tuning on reviews or decisions. Across 300 ICLR submissions with public decision tiers and reviewer ratings, graded under a frozen pipeline with hypotheses pre-registered before any score met any outcome, the overall score separates rejected from accepted submissions (AUROC 0.82, 95% CI 0.78-0.87), rises monotonically across tiers, and tracks the mean reviewer rating. The signal is strongest where we claim it: the lowest-scoring fifth is rejected far above the base rate, with oral papers absent. The validity comes mostly from the model: a one-paragraph prompt on the same model discriminates almost as well as the full pipeline (the small gap favours the pipeline but does not meet the pre-declared criterion, p = 0.09). What the engineering adds is reliability and a grounded review: AIPR's score barely moves across repeated runs (0.7 vs. 2.8 points within-paper SD) where the bare prompt swings, and the same pass returns a rubric-structured, evidence-grounded review rather than a bare number, with the human keeping the decision.

2606.15888 2026-06-16 cs.SD cs.AI eess.AS 交叉投稿

NVMOS: Non-Verbal Vocalization Quality Assessment in Speech

NVMOS:语音中非语言发声质量评估

Jialong Mai, Jinxin Ji, Xiaofen Xing, Wencui Liu, Xiangmin Xu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对非语言发声(如笑声、叹息)的感知质量评估空白,构建NV-MOS数据集,提出首个专用模型NVMOS,通过局部聚焦模块达到专家级评估一致性。

Comments 6 pages. Code and model: https://github.com/yongaifadian1/NVMOS

详情
AI中文摘要

非语言发声(NVs),如笑声、叹息和咳嗽,是情感和意图的重要声学线索。现有的语音质量评估方法通常关注整体自然度,而非语言TTS评估主要检查目标NV是否以正确的类型和位置出现。然而,NV事件本身的感知质量仍未被充分探索。为填补这一空白,我们构建了一个NV-MOS数据集,包含来自多个NV-TTS系统的输出和自然发生的NV样本,并由三位声学专家根据感知质量量表进行评分。我们进一步分析了支持音频的多模态大语言模型(如Gemini),发现其评分与专家评分之间存在明显不一致。这些结果表明,通用多模态模型无法可靠地替代人类进行NV质量评估。随后,我们提出了NVMOS,据我们所知,这是第一个能够可靠预测语音中NV事件感知质量的模型。实验结果表明,通过局部NV事件聚焦模块,NVMOS达到了与人类MOS评分专家级或更强的一致性。

英文摘要

Non-verbal vocalizations (NVs), such as laughter, sighs, and coughs, are important acoustic cues for emotion and intent. Existing speech quality assessment methods typically focus on overall naturalness, while non-verbal TTS evaluations mainly examine whether a target NV appears with the correct type and position. However, the perceptual quality of NV events themselves remains underexplored. To address this gap, we construct an NV-MOS dataset containing outputs from multiple NV-TTS systems and naturally occurring NV samples, with ratings collected from three acoustic experts on a perceptual quality scale. We further analyze audio-capable multimodal large language models such as Gemini and find clear inconsistencies between their scores and expert ratings. These results suggest that general-purpose multimodal models cannot reliably replace human judgments for NV quality assessment. We then propose NVMOS, to our knowledge the first model that can reliably predict the perceptual quality of NV events in speech. Experimental results show that, with a local NV-event focusing module, NVMOS reaches expert-level or stronger agreement with human MOS.

2606.15899 2026-06-16 cs.CR cs.AI cs.HC cs.LG cs.MA 交叉投稿

SkillVetBench: LLM-as-Judge for Multi-Dimensional Security Risk Evaluation in Open-Source LLM Agent Skills

SkillVetBench: 基于LLM评判的多维安全风险评估开源LLM智能体技能

Ismail Hossain, Sai Puppala, Md Jahangir Alam, Tanzim Ahad, Sajedul Talukder

发表机构 * SUPREME Lab, University of Texas at El Paso, Texas, USA(SUPREME实验室,德克萨斯理工大学埃尔帕索分校,德克萨斯州,美国)

AI总结 提出SkillVetBench,利用LLM作为评判器对开源LLM智能体技能进行多维安全风险评估,引入五维技能智能体风险评分(SARS)和CVSS v4.0向量分解,在78个恶意技能上实现零假阴性,22个良性技能上零假阳性。

Comments The main research paper is submitted to NeurIPS 2027, it is in under review

详情
AI中文摘要

开源LLM智能体生态系统正在快速增长,然而社区贡献的技能——扩展智能体能力的模块化工具定义——的安全性在很大程度上仍未经过审查。我们填补的空白:现有的扫描器在代码层操作,在结构上对指令层和多智能体风险——劫持智能体的自然语言指令、通过编码侧信道窃取数据或跨流水线链式传播危害——视而不见,因此需要的是一个语义化的、多维度的审查系统,而不是另一个签名匹配器。我们提出了SKILLVETBENCH,一个在Hugging Face上的实时公共排行榜,它使用LLM作为评判器来审查智能体技能。新贡献:SARS(技能智能体风险评分),一个五维智能体风险度量,带有针对指令跟随系统的原则性加权公式。集成内容:完整的CVSS v4.0向量分解和一个ClawHub双视图,将我们的LLM生成的审查与官方市场判决并列。实验证明:基于我们的配套基准论文[1],LLM评判阶段在78个已确认的恶意技能上实现了零假阴性,在22个良性控制上实现了零假阳性,而最佳静态基线(SKILLSIEVE)仍然遗漏了15%;对于指令层类别如提示注入和记忆中毒,传统工具遗漏了89%到100%的威胁(例如,CODEBERT未检测到九个记忆中毒技能中的任何一个)。四个LLM评估器的检测率从35%到95%不等,这促使在生产部署中使用集成评分。

英文摘要

Open-source LLM agent ecosystems are growing rapidly, yet the security of community-contributed skills - modular tool definitions that extend agent capabilities - remains largely unvetted. The gap we fill: existing scanners operate at the code layer and are structurally blind to instruction-layer and multi-agent risk - natural-language directives that hijack an agent, exfiltrate data through encoded side channels, or chain harm across pipelines - so what is needed is a semantic, multi-dimensional vetting system rather than another signature matcher. We present SKILLVETBENCH, a live public leaderboard on Hugging Face that uses an LLM-as-Judge to vet agent skills. What is new: SARS (Skill Agentic Risk Score), a five-dimensional agentic-risk metric with a principled weighted formula for instruction-following systems. What is integrated: full CVSS v4.0 vector decomposition and a ClawHub dual-view that places our LLM-generated review beside the official marketplace verdict. What is demonstrated: drawing on our companion benchmark paper [ 1], the LLM-as-Judge stage achieves zero false negatives across 78 confirmed-malicious skills and zero false positives across 22 benign controls, while the best static baseline (SKILLSIEVE) still misses 15%; for instruction-layer categories such as Prompt Injection and Memory Poisoning, conventional tools miss between 89% and 100% of threats (e.g., CODEBERT detects none of nine memory-poisoning skills). Detection rates vary from 35% to 95% across four LLM evaluators, motivating ensemble scoring in production deployments.

2606.16038 2026-06-16 cs.SE cs.AI 交叉投稿

Open-SWE-Traces: Advancing Dual-Mode Multilingual Distillation for Software Engineering Agents

Open-SWE-Traces:推进软件工程智能体的双模式多语言蒸馏

Wasi Uddin Ahmad, Nikolai Ludwig, Somshubra Majumdar, Boris Ginsburg

发表机构 * NVIDIA(英伟达)

AI总结 为解决软件工程智能体训练数据稀缺问题,构建包含20万条轨迹、覆盖9种编程语言的数据集,采用混合推理合成方法,微调Qwen3-30B-A3B系列模型,在SWE-bench基准上取得领先性能。

Comments Work in progress

详情
AI中文摘要

通往自主软件工程的道路目前受到多样化、大规模轨迹数据严重短缺的瓶颈。我们通过引入\ourdataset来解决这一问题,这是一个包含207,489条智能体轨迹的广泛数据集,涵盖九种编程语言(Python、Go、TS、JS、Rust、Java、PHP、C、C++)。该数据集来源于通过OpenHands和SWE-agent工具从20,000个真实世界PR中获取的数据,采用混合推理合成方法:Minimax-M2.5生成具有显式“思考”过程的轨迹,而Qwen3.5-122B提供高质量的“非思考”轨迹。从SWE-rebench-V2中筛选出宽松许可证(MIT、Apache、BSD)的数据,这些数据有助于训练能够进行长程推理的模型。我们通过微调Qwen3-30B-A3B系列(Thinking、Instruct和Coder)来验证该数据集。最佳模型在SWE-bench Verified上达到61.7%的解决率,在SWE-bench Multilingual上达到57.1%,在SWE-bench Pro上达到36.8%。这些结果确立了Open-SWE-Traces作为将人类级软件工程能力蒸馏到高效、开源智能体LLM中的首要资源。

英文摘要

The path toward autonomous software engineering is currently bottlenecked by a severe deficit of diverse, large-scale trajectory data. We address this by introducing \ourdataset, an expansive dataset of 207,489 agentic trajectories spanning nine programming languages (Python, Go, TS, JS, Rust, Java, PHP, C, C++). Sourced from 20,000 real-world PRs via OpenHands and SWE-agent harnesses, the dataset utilizes a hybrid-reasoning synthesis: Minimax-M2.5 generates trajectories with explicit "thinking" processes, while Qwen3.5-122B provides high-quality "non-thinking" traces. Filtered for permissive licenses (MIT, Apache, BSD) from SWE-rebench-V2, this data facilitates the training of models capable of long-horizon reasoning. We validate the dataset by fine-tuning the Qwen3-30B-A3B series (Thinking, Instruct, and Coder). The best performing model achieves resolve rates of 61.7% on SWE-bench Verified, 57.1% on SWE-bench Multilingual, and 36.8% on SWE-bench Pro. These results establish Open-SWE-Traces as a premier resource for distilling human-level software engineering capabilities into efficient, open-source agentic LLMs.

2606.16092 2026-06-16 cs.CV cs.AI 交叉投稿

VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA

VinQA:面向真实世界多模态文档问答的交错视觉元素长文本答案生成

Young Rok Jang, Hyesoo Kong, Kyunghwan An, Jae Sub Huh, Gyeonghun Kim, Stanley Jungkyu Choi

发表机构 * LG AI Research(LG AI研究院)

AI总结 提出VinQA数据集和两种编码方法(页面编码与模态编码),用于生成交错引用视觉元素的长文本答案;通过M-GroSE评估框架和微调Qwen2.5-VL模型,显著缩小与专有模型的性能差距。

Comments Accepted to CVPR 2026. Main paper: 5 figures, 4 tables; includes supplementary material

详情
AI中文摘要

真实世界的文档将文本与表格、图表、照片和示意图以多样化的布局组合在一起,然而现有关于多模态大语言模型(MLLMs)用于文档问答的研究主要产生纯文本回复,未能充分利用这些视觉元素。我们引入VinQA,一个用于长文本答案生成的数据集,其中引用的视觉元素与其支持文本明确交错,并基于相关文档页面。为支持此任务,我们研究了两种将原始文档页面图像输入MLLM的编码方法及其视觉元素引用机制:(1)页面编码,直接编码带有视觉元素边界框的整页图像,并将这些框选区域视为可引用单元;(2)模态编码,解析每个页面以提取文本并裁剪视觉元素,分别编码,并将这些裁剪元素用作可引用单元。在我们的实验中,我们提出M-GroSE,一个扩展GroUSE的多模态评估框架,用于从完整性、答案相关性、忠实性和不可回答性四个维度评估答案。我们还报告了Visual Source F1以直接衡量视觉引用准确性。尽管专有前沿模型在VinQA测试集上仍获得最佳总体分数,但在训练集上微调开源Qwen2.5-VL模型显著提升了其性能并缩小了这一差距。模态编码最初对于具有长文本、多视觉元素和多样化引用需求的复杂文档更为稳健。然而,在VinQA上训练后,页面编码达到了可比水平,即使没有模态编码中使用的显式解析也能有效竞争。最后,基于MLLM的评判器Visual G-Eval确认,微调后的模型在语义恰当的位置插入视觉元素,并附有忠实的支持文本。

英文摘要

Real-world documents combine text with tables, charts, photographs, and diagrams arranged in diverse layouts, yet existing research on multimodal large language models (MLLMs) for document QA predominantly produces text-only responses, underutilizing these visual elements. We introduce VinQA, a dataset for long-form answer generation where cited visual elements are explicitly interleaved with their supporting text and grounded in relevant document pages. To support this task, we study two encoding methods for feeding raw document page images into an MLLM, along with their visual-element citation mechanisms: (1) Page Encoding, which directly encodes full-page images with bounding boxes of visual elements and treats these boxed regions as citable units; and (2) Modality Encoding, which parses each page to extract text and crop visual elements, encodes them separately, and uses these cropped elements as citable units. In our experiments, we propose M-GroSE, a multimodal evaluation framework extending GroUSE to assess answers along four dimensions: completeness, answer relevancy, faithfulness, and unanswerability. We additionally report Visual Source F1 to directly measure visual citation accuracy. Although proprietary frontier models still achieve the best overall scores on the VinQA test split, fine-tuning open Qwen2.5-VL models on the training split substantially improves their performance and narrows this gap. Modality Encoding is initially more robust for complex documents with long text, many visual elements, and diverse citation requirements. After training on VinQA, however, Page Encoding reaches a comparable level, competing effectively even without the explicit parsing used in Modality Encoding. Finally, Visual G-Eval, an MLLM-based judge, confirms that fine-tuned models insert visual elements at semantically appropriate positions with faithful supporting text.

2606.16127 2026-06-16 cs.CL cs.AI cs.LG 交叉投稿

AuAu: A Benchmark for Auditing Authoritarian Alignment in Large Language Models

AuAu: 大型语言模型中威权对齐审计基准

Andreas Einwiller, Max Klabunde, Florian Lemmerich

发表机构 * University of Zurich(苏黎世大学)

AI总结 提出AuAu基准,结合心理测量、情境行为测试和用户提示评估LLM的威权倾向,发现17个模型均存在显著威权响应,且系统提示可操纵多数模型。

Comments v1, 50 pages

详情
AI中文摘要

全球威权主义的浪潮,加上用户日常生活中日益核心的角色,引发了特定模型在多大程度上展现或促进威权态度和特征的问题。我们引入了AuAu,一个旨在评估LLM生成具有威权倾向响应风险的全面基准。该基准结合了三种评估方法:(i) 来自15个经过人类验证的广泛工具库的心理测量问题;(ii) 在具体情境中探究意图行为的情境行为小故事;(iii) 对现实用户提示的响应。与先前工作不同,AuAu不仅评估对威权主义的一般亲近程度,还评估已建立的子概念:威权攻击、威权服从和传统主义。评估来自中国、欧盟、俄罗斯和美国的17个模型,我们发现所有测试模型在心理测量评估下都表现出显著的威权响应率,尽管在越来越现实的下游任务中,该比率显著下降。我们进一步发现,威权系统提示容易操纵17个模型中的15个以促进增强的威权主义。我们的结果强调了持续、系统性地审计基于LLM的AI系统的必要性,以检测并最终减轻生成输出中不期望的威权倾向。我们的代码和数据可在 https://github.com/andreaseinwiller/AuAu 获取。

英文摘要

The worldwide surge of authoritarianism, combined with the increasing central role in users' everyday lives, raises the question of to what extent specific models exhibit or promote authoritarian attitudes and characteristics. We introduce AuAu, a comprehensive benchmark that aims to assess the risk of LLMs generating responses with authoritarian tendencies. This benchmark combines three evaluation approaches: (i) psychometric questions from an extensive pool of 15 human validated instruments; (ii) contextual behavior vignettes probing intended actions in concrete situations; and (iii) responses to realistic user prompts. Unlike prior work, AuAu evaluates not only a general closeness towards authoritarianism but also the established sub-concepts Authoritarian Aggression, Authoritarian Submission, and Conventionalism. Evaluating 17 models from China, the EU, Russia, and the USA, we find that all tested models exhibit substantial authoritarian response rates under the psychometric evaluation, though rates drop significantly in increasingly more realistic downstream task. We further find that an authoritarian system prompt easily manipulates 15 out of 17 models to promote increased authoritarianism. Our results underscore the need for continued, systematic auditing of LLM-based AI systems to detect and ultimately mitigate undesired authoritarian tendencies in generated output. Our code and data are available at: https://github.com/andreaseinwiller/AuAu

2606.16153 2026-06-16 cs.CV cs.AI 交叉投稿

A Comprehensive Survey of Medical Image Segmentation: Challenges, Benchmarks, and Beyond

医学图像分割综述:挑战、基准与未来展望

Pengyu Zhu, Xiaojing Zhang, Kunbo Zhang, Chunyan Zhang, Zhenyu Wang

发表机构 * School of Control and Computer Engineering, North China Electric Power University(华北电力大学控制与计算机工程学院) SPIC Digital Technology Co., Ltd(国家电投数字科技有限公司) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Department 6 of Health Care, Second Medical Center, People’s Liberation Army General Hospital(中国人民解放军总医院第二医学中心健康医学科六病区)

AI总结 本文系统综述了基于U-Net、Transformer和SAM架构的医学图像分割方法,分析主要挑战,旨在指导未来研究并推动临床转化。

Comments 12 pages,3 figures,1 table. All related resources are available at https://github.com/andrew-pengyu/Awsome_MedSeg/tree/main

详情
AI中文摘要

医学图像分割在临床诊断、治疗规划、疾病监测和神经系统疾病识别中发挥着关键作用。本文对其系统发展进行了全面综述,涵盖了广泛使用的公开数据集、基于U-Net、Transformer和SAM架构的代表性方法及其关键评估指标与差异,随后从多个角度分析了主要挑战。与专注于单一模型家族或特定临床应用的综述不同,本综述将基于U-Net、Transformer和SAM的方法组织在一个统一的分析框架内,特别关注它们在提高分割精度和效率方面的有效性。本工作旨在指导医学图像分割的未来研究并支持临床转化,所有相关资源均可在我们的GitHub仓库中公开获取:https://github.com/andrew-pengyu/Awsome_MedSeg/tree/main。

英文摘要

Medical image segmentation plays a critical role in clinical diagnostics, treatment planning, disease monitoring, and neurological disorder identification. This article presents a comprehensive review of its systematic development, covering widely used public datasets, representative methods built on the U-Net, Transformer, and SAM architectures, and key evaluation metrics with their differences, followed by an analysis of major challenges from multiple perspectives. Unlike surveys that focus on a single model family or a specific clinical application, this review organizes U-Net-, Transformer-, and SAM-based methods within a unified analytical framework, with a particular focus on their effectiveness in improving segmentation accuracy and efficiency. This work aims to guide future research and support clinical translation of medical image segmentation, with all related resources publicly available in our GitHub repository: https://github.com/andrew-pengyu/Awsome_MedSeg/tree/main.

2606.16246 2026-06-16 cs.LG cs.AI cs.CL 交叉投稿

Data Augmentations for Data-Constrained Language Model Pretraining

数据受限语言模型预训练的数据增强

Michael K. Chen, Xikun Zhang, Zhen Wang

发表机构 * UC San Diego(加州大学圣地亚哥分校) RMIT University(皇家墨尔本理工大学)

AI总结 针对数据受限下标准自回归预训练严重过拟合的问题,提出三类数据增强方法(token级噪声、序列排列、目标偏移预测),有效降低验证损失并支持数百epoch训练。

详情
AI中文摘要

随着AI实验室接近数据天花板,计算能力超过新高质量文本生成速率,语言模型预训练正转向数据受限、计算充裕的体制,需要在固定语料库上进行高效的多轮训练。标准自回归(AR)预训练在此设置下严重过拟合,早期达到最优然后持续恶化。我们研究数据增强作为正则化器来缓解过拟合,并在相同数据上实现数百轮的有效训练。我们为AR预训练引入了三类正交的增强:token级噪声(掩码、随机替换)、序列排列(从右到左预测、Fill-in-the-Middle)以及目标偏移预测($x_{t+i}$,$i > 1$)。通过系统消融实验,我们发现单个增强相对于基线延迟了过拟合并降低了验证损失,其中随机token替换在单个方法中实现了最佳最小损失。组合增强类别进一步降低了最小验证损失。我们的实验表明,数据增强缓解了AR预训练的数据低效问题,并为数据受限体制提供了有前景的解决方案。所有代码和数据可在https://github.com/michaelchen-lab/data-augmentations-for-pretraining获取。

英文摘要

As AI labs approach a data ceiling where compute capacity outpaces the rate of new high-quality text generation, language model pretraining is shifting toward a data-constrained, compute-abundant regime that demands productive multi-epoch training on fixed corpora. Standard autoregressive (AR) pretraining overfits severely in this setting, reaching its optimum early and then continuously deteriorating. We investigate data augmentation as a regularizer to mitigate this overfitting and enable productive training for hundreds of epochs on the same data. We introduce three orthogonal categories of augmentation for AR pretraining: token-level noise (masking, random replacement), sequence permutations (right-to-left prediction, Fill-in-the-Middle), and target offset prediction ($x_{t+i}$ for $i > 1$). Through systematic ablations, we find that individual augmentations delay overfitting and lower validation loss relative to the baseline, with random token replacement achieving the best minimum loss among individual methods. Combining augmentation categories further lowers the minimum validation loss. Our experiments demonstrate that data augmentations mitigate AR pretraining's data inefficiency and offer a promising solution to the data-constrained regime. All code and data are available at https://github.com/michaelchen-lab/data-augmentations-for-pretraining

2606.16262 2026-06-16 cs.SE cs.AI 交叉投稿

UXBench: Measuring the Actionability of LLM-Generated UX Critiques

UXBench: 衡量LLM生成的UX评论的可操作性

Wenjie Wang, Yue Huang, Zipeng Ling, Han Bao, Hang hua, Xiaonan Luo, Yu Jiang, Shiyi Du, Yuexing Hao, Xiaomin Li, Yuchen Ma, Dianzhuo Wang, Yanfang Ye, Xiangliang Zhang

发表机构 * University of Notre Dame(诺丁汉大学) University of Pennsylvania(宾夕法尼亚大学) University of Rochester(罗切斯特大学) Carnegie Mellon University(卡内基梅隆大学) Massachusetts Institute of Technology(麻省理工学院) Harvard University(哈佛大学) LMU Munich(慕尼黑路德维希-马克西米利安大学)

AI总结 提出UXBench基准,通过下游修复代理能否基于评论改进界面来评估LLM作为UX评判者的可操作性,发现不同模型在报告可操作性、修复特征和可靠性上存在显著差异。

Comments 30 pages

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被部署为UX评判者,用于检查界面、诊断可用性问题并提出修复建议。然而,目前还没有受控基准来衡量这些评论在不同产品表面上的可靠性和可操作性。我们引入了UXBench,一个用于评估LLMs作为交互式UX评判者的基准。UXBench包含跨十个产品表面系列的本地优先可运行网页固定装置,并配以覆盖门控的浏览器探索,强制模型在报告之前收集交互证据。每个评判模型在七个评分维度上生成结构化的UX报告;报告质量通过固定的下游修复代理能否基于评论改进界面来衡量。我们在自动修复提升协议和盲人验证研究下评估了八个前沿模型。结果表明,UX评判既未饱和也非一维:模型在报告可操作性上存在显著差异,在评分维度上表现出不同的修复特征,在固定装置层面可靠性各异,并在不同表面类别中交替领先。

英文摘要

Large language models (LLMs) are increasingly deployed as UX judges that inspect interfaces, diagnose usability problems, and propose repairs. Yet no controlled benchmark measures whether the resulting critiques are reliable and actionable across heterogeneous product surfaces. We introduce UXBench, a benchmark for evaluating LLMs as interaction-grounded UX judges. UXBench comprises local-first runnable web fixtures spanning ten product-surface families, paired with coverage-gated browser exploration that forces models to collect interaction evidence before reporting. Each judge model produces a structured UX report over seven rubric dimensions; report quality is measured by whether a fixed downstream repair agent can improve the interface based on the critique. We evaluate eight frontier models under both an automated repair-lift protocol and a blind human validation study. Results show that UX judging is neither saturated nor one dimensional: models differ meaningfully in report actionability, exhibit distinct rubric-level repair signatures, vary in fixture-level reliability, and trade leadership across surface categories

2606.16313 2026-06-16 cs.RO cs.AI 交叉投稿

Is Your Trajectory Displacement Safe in Long-tail?

你的轨迹位移在长尾场景中安全吗?

Qiao Sun, Weicheng Zheng, Yixin Huang, Hang Zhao

发表机构 * Shanghai Qi Zhi Institute(上海期智研究院) Tsinghua University(清华大学) Tongji University(同济大学)

AI总结 提出FluidTest评估框架,通过成对WebUI协议、32种语义威胁分类和三元验证系统,检测规划轨迹相对于专家参考的额外威胁,实验发现SOTA规划器仍存在大量安全相关失败。

Comments 20 pages, 15 figures

详情
AI中文摘要

长尾场景仍然是自动驾驶评估的主要瓶颈,即使数据集规模增长数个数量级。现有的评估流水线很少同时具备人类对齐、安全感知、可验证和可解释性:闭环指标在强规划器中常常饱和,而无结构的人类评分在没有精心设计协议的情况下可能充满噪声。我们将规划评估表述为额外威胁检测:给定规划器轨迹和专家参考,规划器的位移是否引入了新的不安全驾驶行为?我们提出FluidTest,一个包含三个组件的评估流水线:用于可靠人工标注的成对WebUI协议;包含32种语义威胁及其基于证据的决策图的分类法;以及一个带有反思的三元验证系统,用于精确性和可审计性。在WOD-E2E数据集上的实验表明,FluidTest在训练过的标注者中产生一致的标签,并在65%的Poutine轨迹和51%的RAP轨迹中识别出额外威胁。这些结果表明,尽管具有高评分者反馈分数(RFS)和低平均位移误差(ADE),最先进的规划器仍可能表现出大量与安全相关的失败。更多细节、指导和代码请访问https://fluidtest.web.app。

英文摘要

Long-tail scenarios remain a major bottleneck for autonomous driving evaluation, even as datasets grow by orders of magnitude. Existing evaluation pipelines are rarely human-aligned, safety-aware, verifiable, and explainable at the same time: closed-loop metrics often saturate among strong planners, while unstructured human ratings can be noisy without a carefully designed protocol. We formulate planning evaluation as additional-threat detection: given a planner trajectory and an expert reference, does the planner's displacement introduce new unsafe driving behavior? We propose FluidTest, an evaluation pipeline with three components: a pairwise WebUI protocol for reliable human annotation; a taxonomy of 32 semantic threats with evidence-grounded decision graphs; and a three-agent verification system with reflection for precision and auditability. Experiments on the WOD-E2E dataset show that FluidTest produces consistent labels among trained annotators and identifies additional threats in 65% of Poutine trajectories and 51% of RAP trajectories. These results show that state-of-the-art planners can still exhibit substantial safety-relevant failures despite high Rater Feedback Scores (RFS) and low Average Displacement Error (ADE). Additional details, guidance, and code are available at https://fluidtest.web.app.

2606.16447 2026-06-16 cs.RO cs.AI 交叉投稿

Training and Evaluating Diffusion Policies with Long Context Lengths

训练和评估具有长上下文长度的扩散策略

Abhinav Agarwal, Adam Wei, Taylan Kargin, Michael Zeng, Cole Becker, Arif Kerem Dayi, Pablo Parrilo, Asuman Ozdaglar, Russ Tedrake

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文首次详细研究模仿学习中上下文长度的影响,发现简单扩展上下文长度并不脆弱,并提出联合训练多上下文长度策略的方法以降低样本复杂度。

详情
AI中文摘要

模仿学习已经能够从RGB观测中实现高度灵巧的机器人操作。然而,使用这些方法训练的策略通常仅基于短历史观测来调节机器人动作。这些策略无法解决需要记忆的任务,并且可能反复执行相同的失败动作。在这项工作中,我们首先在任务具有不同局部稳定性和记忆需求以及多种数据体制下,将上下文长度从短到长逐步增加,对策略性能进行基准测试。据我们所知,这是首次如此详细地研究模仿学习中上下文长度的影响。我们的结果挑战了先前的说法:简单地扩展上下文长度并不像文献中声称的那样脆弱。使用适当的调节方法和去噪骨干网络(UNet+交叉注意力),单任务策略在通常的数据体制下即使采用简单扩展也能在许多任务上取得高成功率。接下来,我们提出一种训练算法,用于联合训练多个上下文长度的策略,进一步降低长上下文学习的样本复杂度。最后,我们将我们的发现应用于重新评估先前提出的一些长上下文模仿学习解决方案。

英文摘要

Imitation learning has enabled highly-dexterous robotic manipulation from RGB observations. Policies trained with these methods, however, typically condition robot actions on only a short history of observations. These policies cannot solve tasks that require memory and can get stuck repeatedly executing the same failing motions. In this work, we first benchmark policy performance as context length is incrementally increased from short to long, across a spectrum of tasks with varying local stability and memory requirements, and in multiple data regimes. To our knowledge, this is the first study to investigate context length in imitation learning at this level of detail. Our results challenge prior claims: naively scaling context length is not as brittle as advertised in literature. With an appropriate conditioning method and denoising backbone (UNet+Cross-Attention), single-task policies achieve high success rates on many tasks in the usual data regime even with naive scaling. Next, we propose a training algorithm to jointly train policies at multiple context lengths, further reducing the sample complexity of long-context learning. Finally, we apply our findings to re-evaluate some previously proposed solutions to long-context imitation learning.

2606.16475 2026-06-16 cs.CY cs.AI 交叉投稿

AI systems out-persuade expert humans

AI系统在说服力上超越人类专家

Kobi Hackenburg, Caroline Wagner, Luke Hewitt, Ben M. Tappin, Ed Saunders, Hannah Rose Kirk, Helen Margetts, Christopher Summerfield

发表机构 * University of Oxford(牛津大学) UK AI Security Institute(英国人工智能安全研究所) Stanford University(斯坦福大学) London School of Economics and Political Science(伦敦政治经济学院)

AI总结 通过四项预注册实验(n=18,978次对话),发现AI系统在说服力上可靠地超越人类专家,包括专业拉票者和世界辩论冠军,其优势源于快速部署大量信息,并扩展到现实世界筹款行为。

Comments 16 pages, 4 figures

详情
AI中文摘要

许多社会决策是通过说服竞赛来决定的。对话式AI是这些竞赛中强大的新参与者,但它是否能超越技能娴熟且高度激励的人类仍不清楚。在这里,通过一系列四项预注册实验(来自6,923人的18,978次对话),我们将AI系统与一系列人类说服者进行对比,包括普通人、单独预注册的四轮在线说服锦标赛的获胜者、专业拉票者以及世界辩论冠军。我们发现AI系统可靠地比人类专家更具说服力,即使人类专家选择他们的话题、提前研究、经过数小时的现场结构化练习,并以1,000英镑现金奖金作为激励。在后续研究中,AI的优势在专家获得一个教练工具后仍然存在,该工具让他们能够与击败他们的AI进行练习、回顾他们的表现历史,并查看AI在关键时刻会说什么。我们发现汇聚的证据表明,AI的优势源于快速部署大量信息:经过教练后,人类专家能够与一个限制为人类速度和人类长度消息的AI打成平手。在最后一项研究中,我们展示了AI的优势扩展到有影响力的现实世界行为:在向救助儿童会筹集真实资金方面,AI的效果比来自英国筹款公司的专业拉票者高出近3倍。这些结果共同表明,前沿AI系统在对话中超越人类专家,对政治传播具有重大意义。

英文摘要

Many societal decisions are settled by contests of persuasion. Conversational AI is a powerful new entrant in these contests, but whether it can out-persuade skilled and highly incentivized humans has remained unclear. Here, in a series of four preregistered experiments (n = 18,978 conversations from 6,923 people), we pitted AI systems against a range of human persuaders, including laypeople, winners of a separately preregistered four-round online persuasion tournament, professional canvassers, and world championship debaters. We found that AI systems were reliably more persuasive than expert humans, even when expert humans chose their issues, researched in advance, underwent hours of live, structured practice, and were incentivized with £1,000 cash bonuses. In a follow-up study, AI's advantage persisted after experts received a coaching tool that let them practice against the AI that beat them, review their performance history, and see what AI would have said at key moments. We found converging evidence that AI's advantage stemmed from rapidly deploying larger quantities of information: after coaching, expert humans could tie an AI constrained to respond at human speeds and with human-length messages. In a final study, we show that AI's advantage extends to consequential real-world behavior: AI was nearly 3x more effective than professional canvassers from a UK fundraising firm at raising real-money donations to Save the Children. Together, these results establish that frontier AI systems out-persuade expert humans in conversation, with significant implications for political communication.

2606.16479 2026-06-16 cs.CV cs.AI 交叉投稿

Uncertainty Quality of VGGT: An Analysis on the DTU Benchmark Dataset

VGGT的不确定性质量:基于DTU基准数据集的分析

Markus Hillemann, Robert Langendörfer, Steven Landgraf, Markus Ulrich

发表机构 * Institute of Photogrammetry and Remote Sensing, Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院摄影测量与遥感研究所)

AI总结 本文分析VGGT模型在DTU数据集上的不确定性预测质量,确定有效置信度阈值,并证明提升不确定性质量可显著改善3D重建精度。

Comments Accepted for publication in the ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences

详情
AI中文摘要

视觉几何基础变换器(VGGT)在短时间内引起了广泛关注,尤其是因其在CVPR-2025上获得最佳论文奖。与DUSt3R和MASt3R类似,VGGT旨在通过用一个简单、统一的馈送神经网络取代束调整和特征匹配等既定方法,实现范式转变,该网络可直接从场景的多张图像中在几秒内预测相机位姿、深度图和密集3D结构。其关键能力是在单次前向传播中一致地处理任意数量的视图,无需任何后处理或迭代优化。对于摄影测量学,这为实时、可扩展和可访问的3D重建开辟了新的可能性。在此背景下,不仅高重建精度至关重要,高质量的不确定性估计也至关重要,因为它们能增强信任并实现稳健的质量保证。因此,本文研究了VGGT不确定性预测的质量。分析确定了用于过滤VGGT原始输出的有效置信度阈值,并证明提升不确定性质量在提高其3D重建精度方面具有巨大潜力。

英文摘要

Visual Geometry Grounded Transformer (VGGT) has already attracted a great deal of attention in a short period of time, not least due to the Best Paper Award at CVPR-2025. Similar to DUSt3R and MASt3R, VGGT aims to bring about a paradigm shift by replacing established methods like bundle adjustment and feature matching with a simple, unified, feed-forward neural network that predicts camera poses, depth maps, and dense 3D structure directly from multiple images of a scene in a few seconds. A key aspect is its ability to process an arbitrary number of views consistently in a single forward pass without any post-processing or iterative optimization. For photogrammetry, this opens new possibilities for real-time, scalable, and accessible 3D reconstruction. In this context, not only high reconstruction accuracy but also high-quality uncertainty estimates are crucial, as they foster trust and enable robust quality assurance. This paper therefore investigates the quality of VGGT's uncertainty predictions. The analysis identifies an effective confidence threshold for filtering VGGT's raw output and demonstrates that enhancing uncertainty quality holds strong potential for improving the accuracy of its 3D reconstructions.

2606.16494 2026-06-16 cs.CL cs.AI cs.CV 交叉投稿

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

迷失在末尾:多模态检索增强问答中的首因偏差

Jieyuan Liu, Jianyang Gu, Shijie Chen, Jefferson Chen, Zhen Wang

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) The Ohio State University(俄亥俄州立大学)

AI总结 研究多模态知识型视觉问答中检索上下文的位置依赖,发现不同于纯文本的U形效应,出现首因偏差(开头优于末尾),并通过消融实验定位原因为指令调优阅读器的提示槽0。

Comments 15 pages, 9 figures. Under review at EMNLP 2026

详情
AI中文摘要

基于知识的视觉问答(KB-VQA)通过将阅读器条件化于从维基百科规模知识库检索的段落,使视觉-语言系统能够回答超出其参数知识的问题。在纯文本长上下文LLM中,检索上下文的使用遵循Liu等人(2024)的U形“迷失在中间”效应:上下文开头和结尾的信息被使用,中间部分被忽略。这种效应是否会迁移到部署的多模态KB-VQA中尚不清楚。为填补这一空白,我们设计了首个针对多模态KB-VQA中阅读器侧位置依赖的受控探针:一种黄金位置协议,其中只有黄金段落的提示槽在问题内变化。我们在三个开源7B/8B VLM阅读器和两个KB-VQA基准上运行,k最大为20。形状从U形翻转为首因:在每个阅读器-基准组合上,黄金在开头比黄金在结尾高出16到26个点,我们称这种效应为“迷失在末尾”。三项针对性消融实验缩小了原因:纯文本对照显示多模态设置将已存在的文本模式首因放大了2.2到4.5倍,图像位置和干扰物洗牌消融共同将根源定位到指令调优阅读器的提示槽0。在冻结的阅读器上,三种检索侧修复(MMR、神权重排序、基于排名的重排序)均未缩小差距(无显著改进)。我们的发现表明,recall@k是部署KB-VQA的错误指标,缩小差距需要阅读器侧干预;我们发布该协议作为评估此类干预的受控工具。

英文摘要

Knowledge-based visual question answering (KB-VQA) lets vision-language systems answer questions that exceed their parametric knowledge by conditioning a reader on passages retrieved from a Wikipedia-scale knowledge base. In pure-text long-context LLMs, retrieved-context use follows the U-shaped "lost-in-the-middle" effect of Liu et al. (2024): information at the start and end of context is used, the middle is lost. Whether this transfers to deployed multimodal KB-VQA is open. To close this gap, we design the first controlled probe of reader-side position dependence in multimodal KB-VQA: a gold-position protocol in which only the gold passage's prompt slot varies within question. We run it on three open-source 7B/8B VLM readers and two KB-VQA benchmarks at k up to 20. The shape flips from U to primacy: gold-at-first beats gold-at-last by 16 to 26 points on every reader-by-benchmark cell, an effect we call "Lost at the End". Three targeted ablations narrow the cause: a text-only control shows the multimodal setting amplifies an already-present text-mode primacy 2.2 to 4.5 times, and image-position and distractor-shuffle ablations together pin the locus to prompt slot 0 of the instruction-tuned reader. On a frozen reader, three retrieval-side fixes (MMR, oracle reranking, rank-based reordering) all leave the gap intact (no separable improvement). Our findings indicate that recall@k is the wrong metric for deployed KB-VQA and that closing the gap requires reader-side intervention; we release our protocol as a controlled instrument for evaluating such interventions.

2606.16753 2026-06-16 cs.CL cs.AI cs.LG 交叉投稿

P3B3: A Multi-Turn Conversational Benchmark for Measuring European and Brazilian Portuguese Variety Bias in LLMs

P3B3:用于测量大语言模型中欧洲和巴西葡萄牙语变体偏差的多轮对话基准

Rafael Ferreira, Inês Vieira, Inês Calvo, James Furtado, Iago Paulo, Diogo Tavares, Diogo Glória-Silva, David Semedo, João Magalhães

发表机构 * NOVA University of Lisbon(新里斯本大学) NOVA LINCS(NOVA LINCS实验室)

AI总结 提出P3B3基准,通过专家策划的对话提示和评估框架,测量大语言模型在葡萄牙语变体(欧洲vs巴西)上的偏差和可控性,发现多数模型偏向巴西葡萄牙语。

Comments Accepted at MeLLM Workshop at ACL 2026

详情
AI中文摘要

随着大语言模型(LLMs)融入日常交流,捕捉区域语言变异对于可靠和公平的语言使用至关重要。在葡萄牙语中,欧洲(pt-PT)和巴西(pt-BR)变体仍然代表性不均,pt-BR在数据量上占主导地位,而LLM对葡萄牙语变体的偏好尚未得到充分探索。为弥补这一空白,我们引入了P3B3,一个由专家策划的语言变体无关的对话提示基准,以及一个用于测量变体偏差和可控性的评估框架。在多个模型上的实验表明,大多数LLM表现出对pt-BR的强烈偏差,且不同模型的可控性存在差异。这些结果凸显了需要在语言变体之间实现更平衡的多语言表示。

英文摘要

As Large Language Models (LLMs) become embedded in everyday communication, capturing regional linguistic variation is essential for reliable and equitable language use. In Portuguese, European (pt-PT) and Brazilian (pt-BR) varieties remain unevenly represented, with pt-BR dominating in data quantity, while LLM preference for Portuguese variants remains underexplored. To address this gap, we introduce P3B3, an expert-curated language variety agnostic benchmark of conversational prompts, along with an evaluation framework for measuring variety bias and controllability. Experiments on several models show that most LLMs exhibit a strong bias toward pt-BR, with variation in controllability across models. These results highlight the need for more balanced multilingual representation across language varieties.

2606.16799 2026-06-16 cs.CV cs.AI 交叉投稿

Decoupling Semantics from Distortions: Multi-Scale Two-Stream Vision-Language Alignment for AI-Generated Image Quality Assessment

解耦语义与失真:面向AI生成图像质量评估的多尺度双流视觉-语言对齐

Zijie Meng

AI总结 提出MST-CLIPIQA多尺度双流框架,通过显式表示解耦实现层次化视觉-语言对齐,在五个基准上取得质量SRCC平均提升1.11%、图文对应SRCC提升2.35%的新SOTA结果。

Comments 11 pages, 2 figures Accepted by ICME2026(spotlight)

详情
AI中文摘要

现有的基于视觉-语言模型(VLM)的AI生成图像质量评估(AIGIQA)方法存在根本性的语义-失真维度冲突:为语义区分优化的单一表示在本质上将组成性理解与低层感知敏感性纠缠在一起,使其对细粒度质量退化视而不见。我们提出MST-CLIPIQA,一种多尺度双流框架,通过显式表示解耦实现层次化视觉-语言对齐。我们的架构利用具有互补补丁粒度的双CLIP编码器:粗粒度流捕获全局语义连贯性,而细粒度流保留纹理特征和伪影模式。一种受信息瓶颈启发的门控融合机制执行自适应跨尺度蒸馏,当生成提示可用时,可选交叉注意力实现基于提示的对应评估。在五个基准上的广泛实验建立了新的最先进结果,在质量预测上实现平均SRCC提升1.11%,在文本-图像对应预测上提升2.35%,同时仅需0.8M可训练参数即可保持效率。我们的项目可在https://github.com/YMlinfeng/MST-CLIPIQA获取。

英文摘要

Existing vision-language model (VLM)-based AI-generated image quality assessment (AIGIQA) methods suffer from a fundamental semantic-distortion dimensional conflict: monolithic representations optimized for semantic discrimination inherently entangle compositional understanding with low-level perceptual sensitivity, rendering them blind to fine-grained quality degradations. We introduce MST-CLIPIQA, a multi-scale two-stream framework that achieves hierarchical vision-language alignment through explicit representational decoupling. Our architecture leverages dual CLIP encoders with complementary patch granularities: coarse-grained streams capture global semantic coherence while fine-grained streams preserve textural signatures and artifact patterns. An information bottleneck-inspired gated fusion mechanism performs adaptive cross-scale distillation, with optional cross-attention enabling prompt-anchored correspondence evaluation when generation prompts are available. Extensive experiments across five benchmarks establish new state-of-the-art results, achieving average improvements of 1.11 percent SRCC on quality and 2.35 percent SRCC on text-image correspondence prediction, while maintaining efficiency with only 0.8M trainable parameters. Our project is available at https://github.com/YMlinfeng/MST-CLIPIQA.

2606.16826 2026-06-16 cs.RO cs.AI 交叉投稿

ATOM-Bench: A Real-World Benchmark for Atomic Skills and Compositional Generalization in Manipulation Policies

ATOM-Bench:用于操作策略中原子技能与组合泛化的真实世界基准

Zenan Wu, Bingqing Wei, Lu Liu, Zheqi He, Xi Wang, Jiakang Liu, Zehui Li, Guocai Yao, Jing-Shu Zheng, Xi Yang, Yongtao Wang

发表机构 * Beijing Academy of Artificial Intelligence(北京人工智能研究院) Peking University(北京大学)

AI总结 提出ATOM-Bench基准,通过分解桌面操作为原子任务和组合任务,评估操作策略的原子技能获取与组合泛化能力,发现当前策略在细粒度原子技能和组合重用上存在不足。

Comments Homepage: https://flageval-baai.github.io/AtomBenchPage

详情
AI中文摘要

通用操作策略越来越多地被呈现为机器人控制的基础模型,但它们的真实世界泛化能力仍然难以诊断。一个策略可能在演示任务上成功,但仍无法执行细粒度的原子技能或在新的任务结构中重新组合已学习的技能。我们引入了\ extbf{ATOM-Bench},一个用于评估操作策略中原子技能和组合泛化的真实世界基准。ATOM-Bench将桌面操作分解为运动原子和指令原子,包含30个原子任务和24个保留的组合任务,涵盖配对单臂和双臂机器人轨道。我们收集了3000个人类演示用于原子微调,并发布演示数据和评估回滚数据以支持可重复的真实世界评估。策略在原子任务上进行微调,并在原子技能获取和保留的组合任务上进行评估。我们进一步引入了原子分数(AS)和组合失败份额(CFS),以区分由弱原子技能引起的失败和由有限组合重用引起的失败。通过对五种代表性操作策略进行2700次物理回滚,我们发现当前策略可以获取简单的指令接地技能,但在细粒度运动原子、计数和逻辑过滤方面仍然困难。更重要的是,强大的原子性能并不能可靠地迁移到保留的组合任务上。ATOM-Bench提供了一个诊断测试平台,用于研究失败是由弱运动执行、差指令接地还是有限组合重用引起的。

英文摘要

Generalist manipulation policies are increasingly presented as foundation models for robotic control, but their real-world generalization remains difficult to diagnose. A policy may succeed on demonstrated tasks while still failing to execute fine-grained atomic skills or recombine learned skills in new task structures. We introduce \textbf{ATOM-Bench}, a real-world benchmark for evaluating both atomic skills and compositional generalization in manipulation policies. ATOM-Bench factorizes tabletop manipulation into motor atoms and instruction atoms, and contains 30 atomic tasks and 24 held-out compositional tasks across paired single-arm and dual-arm robot tracks. We collect 3,000 human demonstrations for atomic fine-tuning and release both the demonstration data and evaluation rollout data to support reproducible real-world evaluation. Policies are fine-tuned on atomic tasks and evaluated on both atomic skill acquisition and held-out compositional tasks. We further introduce Atomic Score (AS) and Compositional Failure Share (CFS) to distinguish failures caused by weak atomic skills from failures caused by limited compositional reuse. Through 2,700 physical rollouts on five representative manipulation policies, we find that current policies can acquire simple instruction-grounding skills, but still struggle with fine-grained motor atoms, counting, and logical filtering. More importantly, strong atomic performance does not reliably transfer to held-out compositional tasks. ATOM-Bench provides a diagnostic testbed for studying whether failures arise from weak motor execution, poor instruction grounding, or limited compositional reuse.

2606.16868 2026-06-16 cs.CV cs.AI cs.DC 交叉投稿

Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection

真实世界标签噪声下的联邦医学图像分割:面向噪声标签学习方法选择的基准套件

Markus Bujotzek, Dimitrios Bounias, Stefan Denner, Ralf Floca, Maximilian Fischer, Peter Neher, Klaus Maier-Hein

发表机构 * Division of Medical Image Computing, Germany Cancer Research Center(德国癌症研究中心医学图像计算部) Medical Faculty, University of Heidelberg(海德堡大学医学院) Heidelberg Institute of Radiation Oncology (HIRO), National Center for Radiation Research in Oncology (NCRO)(海德堡放射肿瘤学研究所(HIRO),国家放射肿瘤学研究中心(NCRO)) Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital(海德堡大学医院放射肿瘤科模式分析与学习组) Faculty of Mathematics and Computer Science, University of Heidelberg(海德堡大学数学与计算机科学学院) National Center for Tumor Diseases (NCT), NCT Heidelberg, a partnership between DKFZ and the university medical center Heidelberg(国家肿瘤疾病中心(NCT),NCT海德堡,DKFZ与海德堡大学医学中心的合作机构)

AI总结 针对联邦学习中真实世界标签噪声(如轮廓不一致、结构缺失或混淆)问题,提出一个包含多样化真实噪声数据集、客户端噪声场景和针对性评估的基准套件,支持系统评估和噪声标签学习方法选择。

详情
AI中文摘要

虽然联邦学习(FL)能够在不集中敏感数据的情况下实现协作式医学图像分割,但实际部署常因跨站点的标签缺陷(如轮廓不一致、结构缺失或多余、标签混淆)而复杂化。联邦噪声标签学习(FNLL)旨在减轻这些影响,但在实践中仍未被充分利用,因为现有证据主要基于合成噪声、简化设置和有限的实际噪声评估。我们通过引入一个基准套件来弥补这一差距,该套件结合了多样化的真实世界噪声数据集、与部署相关的客户端噪声场景以及针对标签噪声的评估,以支持系统的FNLL评估和知情的方法选择。该套件将来自不同来源的精心策划的真实世界噪声医学图像分割数据集与一个全面的联邦分割框架相结合,包括各种客户端噪声场景和针对噪声的评估。所提出的套件为医学图像分割中的FNLL评估提供了现实且具有区分性的基础,并为公平基准测试、数据集特定的标签噪声表征以及未来在现实联邦设置下的方法开发建立了可重复使用的基础。代码可在 https://github.com/MIC-DKFZ/FedSegNoiseBench 获取。

英文摘要

While federated learning (FL) enables collaborative medical image segmentation without centralizing sensitive data, real-world deployment is frequently complicated by cross-site label imperfections such as contour disagreement, missing or additional structures, and confused labels. Federated noisy label learning (FNLL) aims to mitigate these effects, yet remains underused in practice as existing evidence is largely based on synthetic noise, simplified settings, and limited real-world noisy evaluation. We address this gap by introducing a benchmark suite that combines diverse real-world noisy datasets, deployment-relevant client-noise scenarios, and label-noise-targeted evaluation to support systematic FNLL assessment and informed method selection. The suite combines curated real-world noisy medical image segmentation datasets from diverse sources with a comprehensive federated segmentation framework including various client-noise scenarios and noise-targeted evaluation. The presented suite provides a realistic and discriminative basis for FNLL evaluation in medical image segmentation and establishes a reusable foundation for fair benchmarking, dataset-specific label-noise characterization, and future method development under realistic federated settings. Code is available at https://github.com/MIC-DKFZ/FedSegNoiseBench.

2606.16910 2026-06-16 cs.CL cs.AI 交叉投稿

IMPACTeen: Intentions, Manipulation, Persuasion, Annotations, and Consequences in Teen Communication Dataset

IMPACTeen:青少年沟通数据集中的意图、操纵、说服、标注与后果

Aleksander Szczęsny, Wiktoria Mieleszczenko-Kowszewicz, Maciej Markiewicz, Beata Bajcar, Tomasz Adamczyk, Jolanta Babiak, Grzegorz Chodak, Przemysław Kazienko

发表机构 * Wrocław University of Science and Technology(弗罗茨瓦夫理工大学)

AI总结 构建IMPACTeen数据集,包含1021个青少年社交影响场景文本,从五个视角标注,支持社交影响检测、标注者分歧及跨语言建模研究。

详情
AI中文摘要

IMPACTeen是一个文本社交影响场景数据集,涵盖青少年语境下的人际、媒体和数字环境。它包含1,021个文本、5,100条独立标注记录以及社交影响技术的黄金标签,每个文本从五个不同视角(青少年、家长、心理学家、沟通专家和教师)进行标注。该资源通过受限的大语言模型生成构建,随后经过两步人工编辑和验证阶段,以确保青少年语境的真实性。多维标注涵盖了影响存在性、技术、意图、后果、抵抗、反应和标注置信度。该数据集支持社交影响检测、标注者分歧、跨语言建模以及语言模型的训练和评估。数据集以波兰语创建,并附有相应的英文版本。

英文摘要

IMPACTeen is a dataset of textual social influence scenarios spanning interpersonal, media-based, and digital settings in an adolescent context. It contains 1,021 texts, 5,100 individual annotation records, and gold labels for social influence techniques, with each text annotated from five distinct perspectives: teenagers, parents, psychologists, communication experts, and teachers. The resource was constructed through constrained LLM generation, followed by a two-step human editing and validation phase aimed at ensuring youth-context realism. A multi-dimensional annotation covered influence presence, techniques, intentions, consequences, resistance, reactions, and annotation confidence. The dataset supports research on social influence detection, annotator disagreement, cross-lingual modeling, and the training and evaluation of language models. The dataset was created in Polish and is accompanied by a corresponding English version.

2606.17006 2026-06-16 cs.SD cs.AI cs.LG cs.MM eess.AS 交叉投稿

TuneJury: An Open Metric for Improving Music Generation Preference Alignment

TuneJury: 一种改进音乐生成偏好对齐的开放指标

Yonghyun Kim, Junwon Lee, Haiwen Xia, Yinghao Ma, Junghyun Koo, Koichi Saito, Yuki Mitsufuji, Chris Donahue

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Sony AI(索尼AI) Georgia Tech(佐治亚理工学院) KAIST(韩国科学技术院) Peking University(北京大学) QMUL(伦敦玛丽女王大学)

AI总结 提出TuneJury,一个开放、实例级别的成对奖励模型,用于文本到音乐生成,通过预测偏好分数支持数据筛选、后处理校准,并在推理、优化和训练中提升对齐效果。

Comments 32 pages, 9 figures

详情
AI中文摘要

我们引入了TuneJury,一个开放、实例级别的成对奖励模型,用于文本到音乐生成,它从文本提示和音频片段中预测音乐偏好分数。发布的检查点在公开的人类偏好标签上训练,涵盖竞技场风格(A vs. B)投票、度量对齐偏好对、众包成对比较和专家审美评分。两个片段之间的预测分数差在我们的保留测试集上校准良好,支持通过简单的分数阈值进行数据筛选。TuneJury泛化到保留测试对和分布外基准,在后一任务上与先前基线保持竞争力。对于训练后发布的生成器,我们引入了锚定校准,一种事后、每系统的Bradley-Terry校准,以显著优于从头再训练的数据效率恢复一致性。相同的冻结奖励在三个下游应用中驱动一致的奖励轴增益:推理时的最佳N选择、DITTO风格的潜在优化和专家迭代后训练。TuneJury可在https://github.com/yonghyunk1m/TuneJury获取。

英文摘要

We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) votes, metric-alignment preference pairs, crowdsourced pairwise comparisons, and expert aesthetic ratings. The predicted score margin between two clips is well calibrated on our held-out test split, supporting data filtering via a simple score threshold. TuneJury generalizes to both held-out test pairs and out-of-distribution benchmarks, remaining competitive with prior baselines on the latter. For generators released after training, we introduce anchor calibration, a post-hoc, per-system Bradley-Terry calibration that recovers agreement at substantially better data efficiency than from-scratch retraining. The same frozen reward drives consistent reward-axis gains across three downstream applications: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available at https://github.com/yonghyunk1m/TuneJury.

2606.17020 2026-06-16 cs.CV cs.AI 交叉投稿

FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models

FusionRS: 用于双模态视觉-语言基础模型的大规模RGB-红外遥感数据集

Jiaju Han, Ben Zhang, Xuemeng Sun, Qike Zhang, Yuxian Dong, Chengyin Hu, Fengyu Zhang, Yiwei Wei, Jiujiang Guo

发表机构 * China University of Petroleum-Beijing at Karamay(中国石油大学(北京)克拉玛依校区) University of Electronic Science and Technology of China(电子科技大学) Tianjin University(天津大学)

AI总结 针对遥感视觉-语言模型缺乏红外数据的问题,提出首个大规模RGB-红外-文本数据集FusionRS,通过翻译RGB图像为红外风格并配以红外感知描述,训练双模态基础模型,提升RGB-红外对齐和双模态字幕生成性能。

详情
AI中文摘要

遥感视觉-语言模型推动了地球观测理解的发展,但现有工作大多集中于RGB图像,红外数据中的互补信息尚未得到充分探索。红外图像提供了独特的线索,包括热强度结构、物体边界和光照不变场景特征,这些可以丰富超越传统RGB观测的视觉-语言学习。然而,用于遥感视觉-语言建模的大规模RGB-红外-文本数据集仍然缺失。为填补这一空白,我们引入了FusionRS,这是首个专为遥感双模态视觉-语言学习设计的大规模RGB-红外-文本数据集。FusionRS通过将多样的公开RGB遥感图像翻译为红外风格对应物,形成对齐的RGB-IR图像对。每对图像都配有常规场景描述和红外感知描述,后者在保留语义内容的同时明确描述红外特有的视觉属性。基于FusionRS,我们训练了用于RGB-IR联合理解的双模态视觉-语言基础模型。我们首先训练CLIP风格的模型进行RGB-IR-文本对齐,然后微调生成式VLM用于双模态RGB-IR字幕生成。实验表明,与仅RGB和非红外感知训练设置相比,FusionRS改进了RGB-IR对齐、红外到文本检索和双模态字幕生成。消融研究进一步验证了红外感知描述对于加强红外-语言对齐至关重要,突显了模态特定文本监督对于更可扩展的RGB-红外遥感视觉-语言表示学习的重要性。

英文摘要

Remote sensing vision-language models have advanced Earth observation understanding, but most existing work remains centered on RGB imagery, leaving the complementary information in infrared data underexplored. Infrared images provide distinctive cues, including thermal intensity structures, object boundaries, and illumination-invariant scene features, which can enrich visual-language learning beyond conventional RGB observations. However, a large-scale RGB-infrared-text dataset for remote sensing vision-language modeling is still absent. To address this gap, we introduce FusionRS, the first large-scale RGB-infrared-text dataset designed for dual-modal vision-language learning in remote sensing. FusionRS is constructed by translating diverse public RGB remote sensing images into infrared-style counterparts, forming aligned RGB-IR image pairs. Each pair is associated with conventional scene captions and IR-aware captions that explicitly describe infrared-specific visual properties while preserving semantic content. Based on FusionRS, we train dual-modal vision-language foundation models for RGB-IR joint understanding. We first train CLIP-style models for RGB-IR-text alignment, and then fine-tune generative VLMs for dual-modal RGB-IR captioning. Experiments show that FusionRS improves RGB-IR alignment, infrared-to-text retrieval, and dual-modal captioning over RGB-only and non-IR-aware training settings. Ablation studies further verify that IR-aware captions are crucial for strengthening infrared-language alignment, highlighting the importance of modality-specific textual supervision for more scalable RGB-infrared remote sensing vision-language representation learning.

2509.22888 2026-06-16 cs.AI cs.CL 版本更新

JE-IRT: A Geometric Lens on LLM Abilities through Joint Embedding Item Response Theory

JE-IRT: 通过联合嵌入项目反应理论审视LLM能力的几何视角

Louie Hong Yao, Nicholas Jarvis, Tiffany Zhan, Saptarshi Ghosh, Linfeng Liu, Tianyu Jiang

发表机构 * Independent Researcher(独立研究者) University of Cincinnati(辛辛那提大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出JE-IRT几何框架,将LLM和问题嵌入共享空间,通过方向编码语义、范数编码难度,揭示主题专长和分布外行为,支持新模型高效扩展,并发现与人类分类部分对齐的内部结构。

Comments 35 pages, 17 figures, 9 tables, accepted to TMLR

详情
AI中文摘要

标准LLM评估实践将多样能力压缩为单一分数,掩盖了其固有的多维性质。我们提出JE-IRT,一种几何项目反应框架,将LLM和问题嵌入共享空间。对于问题嵌入,方向编码语义,范数编码难度,而每个问题的正确性由模型和问题嵌入之间的几何交互决定。这种几何结构用主题专长取代了LLM的全局排名,并允许相关问题之间的平滑变化。基于此框架,我们的实验结果表明,分布外行为可以通过方向对齐来解释,且更大的范数一致地指示更难的问题。此外,JE-IRT自然支持泛化:一旦空间被学习,新LLM通过拟合单个嵌入即可添加。学习到的空间进一步揭示了仅部分与人类定义的主题类别对齐的LLM内部分类。我们还表明,嵌入空间的简单线性探针恢复了跨主题的能力方向,例如一个算术轴,在看似遥远的主题(如病毒学和全球事实)中突出定量要求高的问题。因此,JE-IRT建立了一个统一且可解释的几何视角,将LLM能力与问题结构联系起来,为模型评估和泛化提供了独特视角。

英文摘要

Standard LLM evaluation practices compress diverse abilities into single scores, obscuring their inherently multidimensional nature. We present JE-IRT, a geometric item-response framework that embeds both LLMs and questions in a shared space. For question embeddings, the direction encodes semantics and the norm encodes difficulty, while correctness on each question is determined by the geometric interaction between the model and question embeddings. This geometry replaces a global ranking of LLMs with topical specialization and enables smooth variation across related questions. Building on this framework, our experimental results reveal that out-of-distribution behavior can be explained through directional alignment, and that larger norms consistently indicate harder questions. Moreover, JE-IRT naturally supports generalization: once the space is learned, new LLMs are added by fitting a single embedding. The learned space further reveals an LLM-internal taxonomy that only partially aligns with human-defined subject categories. We also show that simple linear probes of the embedding space recover cross-subject ability directions, such as an arithmetic axis that highlights quantitatively demanding questions in seemingly distant subjects like virology and global facts. JE-IRT thus establishes a unified and interpretable geometric lens that connects LLM abilities with the structure of questions, offering a distinctive perspective on model evaluation and generalization.

2602.06486 2026-06-16 cs.AI 版本更新

JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks

JADE:面向开放式专业任务的专家基础动态评估

Lanbo Lin, Jiayao Liu, Tianyuan Yang, Li Cai, Yuanwu Xu, Lei Wei, Sicong Xie, Guannan Zhang

AI总结 提出JADE双层评估框架,结合专家知识与动态声明级评估,解决开放式专业任务中严格性与灵活性的矛盾,在BizBench等基准上提升稳定性并揭示关键失败模式。

详情
AI中文摘要

在开放式专业任务上评估智能体AI面临着严格性与灵活性之间的根本困境。静态评分标准提供了严格、可重复的评估,但无法适应多样化的有效响应策略,而LLM作为评判者的方法虽能适应个体响应,却存在不稳定性和偏差。人类专家通过将领域基础原则与动态的声明级评估相结合来解决这一困境。受此过程启发,我们提出了\textbf{JADE},一个双层评估框架。第一层将专家知识编码为预定义的评估技能集,提供稳定的评估标准。第二层执行报告特定的声明级评估,灵活评估多样化的推理策略,并通过证据依赖门控来使基于被反驳声明的结论无效。在BizBench上的实验表明,JADE提高了评估稳定性,并揭示了基于整体LLM的评估者遗漏的关键智能体失败模式。我们进一步展示了与专家编写的评分标准的高度一致性,并有效迁移到HealthBench和此http URL,涵盖医学和10个领域的专业评估设置。代码和数据可在以下网址获取:此https URL。

英文摘要

Evaluating agentic AI on open-ended professional tasks faces a fundamental dilemma between rigor and flexibility. Static rubrics provide rigorous, reproducible assessment but fail to accommodate diverse valid response strategies, while LLM-as-a-judge approaches adapt to individual responses yet suffer from instability and bias. Human experts address this dilemma by combining domain-grounded principles with dynamic, claim-level assessment. Inspired by this process, we propose \textbf{JADE}, a two-layer evaluation framework. Layer 1 encodes expert knowledge as a predefined set of evaluation skills, providing stable evaluation criteria. Layer 2 performs report-specific, claim-level evaluation to flexibly assess diverse reasoning strategies, with evidence-dependency gating to invalidate conclusions built on refuted claims. Experiments on BizBench show that JADE improves evaluation stability and reveals critical agent failure modes missed by holistic LLM-based evaluators. We further demonstrate strong alignment with expert-authored rubrics and effective transfer to HealthBench and DR.BENCH, covering medical and 10-domain professional evaluation settings. Code and data are available at https://github.com/smiling-world/JADE.

2602.11510 2026-06-16 cs.AI 版本更新

AgentLeak: A Benchmark for Internal-Channel Privacy Leakage in Multi-Agent LLM Systems

AgentLeak:多智能体大语言模型系统中内部通道隐私泄露的基准测试

Faouzi El Yagoubi, Godwin Badu-Marfo, Ranwa Al Mallah

发表机构 * Polytechnique Montréal(蒙特利尔理工学院)

AI总结 提出AgentLeak基准,通过评估内部通道(如智能体间消息、共享内存)的隐私泄露,发现多智能体系统虽降低最终输出泄露,但内部通道使总暴露率达68.9%,远超输出审计的检测范围。

Comments 19 pages, 9 figures, 16 tables. Code and dataset available at https://github.com/Privatris/AgentLeak

详情
AI中文摘要

多智能体大语言模型(LLM)系统产生了当前仅输出基准无法衡量的隐私风险。当智能体协调任务时,敏感数据可能通过智能体间消息、共享内存和工具参数传递,这些路径通常不被最终输出审计检查。我们引入了AgentLeak,一个用于评估多智能体LLM系统中内部通道隐私泄露的基准。AgentLeak检测了七条与隐私相关的通信路径,并提供了针对最终输出、智能体间消息和共享内存的大规模实证评估。在涵盖医疗、金融、法律和企业领域的1000个场景中,使用五个生产级LLM(GPT-4o、GPT-4o-mini、Claude 3.5 Sonnet、Mistral Large和Llama 3.3 70B)以及4979个经过验证的执行轨迹,我们发现,与单智能体基线相比,多智能体配置降低了最终输出泄露(C1:27.2%对43.2%),但引入了内部通道,使系统总暴露率升至68.9%(聚合C1、C2、C5)。智能体间消息(C2)泄露率为68.8%,而最终输出(C1)为27.2%,这意味着仅输出审计遗漏了41.7%的违规。在所有五个模型和四个领域中,模式C2 ≥ C1一致成立。这些结果表明,在所评估的协调者-工作者设置中,多智能体系统的隐私风险主要由架构协调通道而非仅最终输出行为塑造:它源于对标准输出级防御不可见的内部通道。

英文摘要

Multi-agent Large Language Model (LLM) systems create privacy risks that current output-only benchmarks cannot measure. When agents coordinate on tasks, sensitive data may pass through inter-agent messages, shared memory, and tool arguments, all pathways that final-output audits typically do not inspect. We introduce AgentLeak, a benchmark for evaluating internal-channel privacy leakage in multi-agent LLM systems. AgentLeak instruments seven privacy-relevant communication pathways and provides a large-scale empirical evaluation focused on final outputs, inter-agent messages, and shared memory. Across 1,000 scenarios spanning healthcare, finance, legal, and corporate domains, five production LLMs (GPT-4o, GPT-4o-mini, Claude 3.5 Sonnet, Mistral Large, and Llama 3.3 70B), and 4,979 validated execution traces, we find that multi-agent configurations reduce final-output leakage (C1: 27.2% vs 43.2% in single-agent mode) compared with single-agent baselines but introduce internal channels that raise total system exposure to 68.9% (aggregated across C1, C2, C5). Inter-agent messages (C2) leak at 68.8%, compared with 27.2% for final outputs (C1), meaning that output-only audits miss 41.7% of violations. Across all five models and four domains, the pattern C2 $\geq$ C1 holds consistently. These results suggest, within the evaluated coordinator-worker setting, that privacy risk in multi-agent systems is strongly shaped by architectural coordination channels rather than final-output behavior alone: it arises from internal channels that remain invisible to standard output-level defenses.

2602.12670 2026-06-16 cs.AI 版本更新

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

SkillsBench: 基准测试智能体技能在不同任务中的有效性

Xiangyi Li, Yimin Liu, Wenbo Chen, Bingran You, Zonglin Di, Yifeng He, Shenghan Zheng, Kyoung Whan Choe, Jiankai Sun, Shuyi Wang, Chujun Tao, Binxu Li, Xuandong Zhao, Hejia Geng, Xiaojun Wu, Junwei Zhou, Xiaokun Chen, Hanwen Xing, Yubo Li, Qunhong Zeng, Di Wang, Yuanli Wang, Roey Ben Chaim, Penghao Jiang, Haotian Shen, Luyang Kong, Xinyi Liu, Runhui Wang, Xuanqing Liu, Jiachen Li, Xin Lan, Yueqian Lin, Wengao Ye, Junwei He, Songlin Li, Yue Zhang, Yipeng Gao, Yijiang Li, Ze Ma, Liqiang Jing, Tianyu Wang, Kaixin Li, Yiqi Xue, Haoran Lyu, Yizhuo He, Yuchen Tian, Shutong Wu, Bowei Wang, Yixuan Gao, Bo Chen, Litong Liu, Sikai Cheng, Jiajun Bao, Shuaicheng Tong, Shuwen Xu, Terry Yue Zhuo, Tinghan Ye, Qi Qi, Miao Li, Longtai Liao, Zelin Tan, Chang Shi, Xilin Tang, Srinath Tankasala, Boqin Yuan, Yaoyao Qian, Jianhong Tu, Chenguang Wang, Yizhou Sun, Wei Wang, Aaron Taylor, Ziyue Yang, Changkun Guan, Zhikang Dong, Xinyu Zhang, Steven Dillmann, Han-chung Lee, Dawn Song

发表机构 * BenchFlow OSU Amazon UC Berkeley UC Santa Cruz UC Davis Dartmouth RLWRLD Independent Princeton University Oxford University Stanford University USC CMU Foxconn Zenity UNSW UT Austin MSU Duke University ByteDance UT Dallas UC San Diego Columbia University University of Rochester Cornell Tech Georgia Tech Cornell University NEU UCLA Snap Inc. Fanshawe College University of Science and Technology of China HKUST(GZ) Anyscale

AI总结 提出SkillsBench基准,包含8领域87个任务,通过配对评估证明技能提升平均通过率16.6个百分点,小模型配备技能可匹敌大模型。

详情
AI中文摘要

智能体技能是结构化程序性知识包,在推理时增强大语言模型智能体。尽管被快速采用,但没有标准方法衡量它们是否真正有帮助。我们提出SkillsBench,其当前库存包含8个领域的87个任务,并配有精心策划的技能和确定性验证器。我们最新的聚合评估在匹配的无技能和精心策划的技能条件下,对18个模型-框架配置运行了87个任务基准。精心策划的技能将平均通过率从33.9%提高到50.5%(+16.6个百分点;归一化增益25.5%),配置级增益范围从+4.1到+25.7个百分点。最多三个模块的聚焦技能优于更大或详尽的技能包,配备技能的小模型可以匹配没有技能的大模型。SkillsBench建立了配对评估作为严格衡量技能在智能体、专业密集型工作上有效性的基础。

英文摘要

Agent Skills are structured packages of procedural knowledge that augment large language model (LLM) agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark whose current inventory contains 87 tasks across 8 domains paired with curated Skills and deterministic verifiers. Our latest aggregate evaluation runs the 87-task benchmark under matched no-Skills and curated-Skills conditions for 18 model-harness configurations. Curated Skills raise the average pass rate from 33.9% to 50.5% (+16.6 percentage points; 25.5% normalized gain), with configuration-level gains ranging from +4.1 to +25.7 pp. Focused Skills with at most three modules outperform larger or exhaustive bundles, and smaller models with Skills can match larger models without them. SkillsBench establishes paired evaluation as the foundation for rigorous measurement of Skill efficacy on agentic, expertise-heavy work.

2602.16902 2026-06-16 cs.AI cs.LG 版本更新

LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

LLM-WikiRace 基准测试:大语言模型在真实知识图谱上的规划能力有多强?

Juliusz Ziomek, William Bankes, Lorenz Wolf, Shyam Sundhar Ramesh, Xiaohang Tang, Ilija Bogunovic

发表机构 * University of Oxford, UK(牛津大学,英国) University College London (Centre for AI), UK(伦敦大学学院(人工智能中心),英国) University of Basel, Switzerland(巴塞尔大学,瑞士)

AI总结 提出 LLM-Wikirace 基准,通过维基百科超链接导航任务评估大语言模型的规划、推理与世界知识,发现模型在简单任务上超人类,但困难任务成功率仅 23%,且规划与长程推理是主要瓶颈。

详情
AI中文摘要

我们引入了 LLM-Wikirace,一个用于评估大语言模型(LLM)规划、推理和世界知识的基准。在 LLM-Wikirace 中,模型必须逐步高效地导航维基百科超链接,从给定源页面到达目标页面,这需要前瞻性规划和推理概念如何在现实世界中连接的能力。我们评估了广泛的开源和闭源模型,包括 Gemini-3、GPT-5 和 Claude Opus 4.5,它们在任务的简单级别上取得了最强结果,并展现了超人类性能。尽管如此,在困难难度下性能急剧下降:表现最好的模型 Gemini-3 仅在 23% 的困难游戏中成功,凸显了前沿模型面临的重大挑战。我们的分析表明,世界知识是成功的必要因素,但仅在一定程度内;超过这个阈值,规划和长程推理能力成为主导因素。轨迹级分析进一步揭示,即使是最强的模型在失败后也难以重新规划,经常陷入循环而非恢复。LLM-Wikirace 是一个简单的基准,揭示了当前推理系统的明显局限性,提供了一个开放的竞技场,其中具备规划能力的 LLM 仍有待证明。我们的代码和排行榜可在 https://llmwikirace.github.io 获取。

英文摘要

We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wikirace, models must efficiently navigate Wikipedia hyperlinks step by step to reach a target page from a given source, requiring look-ahead planning and the ability to reason about how concepts are connected in the real world. We evaluate a broad set of open- and closed-source models, including Gemini-3, GPT-5, and Claude Opus 4.5, which achieve the strongest results on the easy level of the task and demonstrate superhuman performance. Despite this, performance drops sharply on hard difficulty: the best-performing model, Gemini-3, succeeds in only 23\% of hard games, highlighting substantial remaining challenges for frontier models. Our analysis shows that world knowledge is a necessary ingredient for success, but only up to a point, beyond this threshold, planning and long-horizon reasoning capabilities become the dominant factors. Trajectory-level analysis further reveals that even the strongest models struggle to replan after failure, frequently entering loops rather than recovering. LLM-Wikirace is a simple benchmark that reveals clear limitations in current reasoning systems, offering an open arena where planning-capable LLMs still have much to prove. Our code and leaderboard available at https:/llmwikirace.github.io.

2602.17990 2026-06-16 cs.AI 版本更新

WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics

WorkflowPerturb:用于评估多智能体工作流度量的校准压力测试

Madhav Kanda, Sharad Agarwal, Rodrigo Fonseca, Alok Gautam Kumbhare, Pedro Las-Casas

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Microsoft(微软公司)

AI总结 提出WorkflowPerturb基准,通过对黄金工作流施加分级扰动来评估多智能体工作流度量,揭示度量分数校准不良问题,支持变更管理中的严重性感知解释。

详情
AI中文摘要

从自然语言请求生成结构化工作流的多智能体LLM系统现已部署在云自动化、DevOps和企业流程编排的生产环境中。运行此类系统会暴露一个反复出现的变更管理问题。常规更新,例如重新运行相同的输入、替换底层LLM或重构智能体的提示或编排代码,经常产生与先前验证的参考工作流差异很大的工作流。工程师随后缺乏原则性的方法来决定变更是否安全发布。自动工作流评估是回答这个问题的自然工具。然而在实践中,度量分数校准不良,数值变化很少能传达底层降级的严重性。我们引入WorkflowPerturb,一个受控基准,通过向黄金工作流应用现实的分级扰动来研究工作流评估度量。WorkflowPerturb包含4,973个黄金工作流和44,757个扰动变体,涵盖三种扰动类型(缺失步骤、压缩步骤和描述更改),每种类型以10%、30%和50%的严重程度应用。我们对多个度量族进行基准测试,并使用期望分数轨迹和残差分析它们的敏感性和校准。我们的结果表征了不同度量族之间的系统性差异,并支持在变更管理环境中对工作流评估分数进行严重性感知解释。我们的数据集将在接收后发布。

英文摘要

Multi-agent LLM systems that generate structured workflows from natural-language requests are now deployed in production across cloud automation, DevOps, and enterprise process orchestration. Operating such systems exposes a recurring change-management problem. Routine updates, such as re-running the same input, swapping the underlying LLM, or refactoring an agent's prompt or orchestration code, frequently produce workflows that differ substantially from previously validated references. Engineers are then left without a principled way to decide whether a change is safe to ship. Automatic workflow evaluation is the natural tool for answering this question. In practice, however, metric scores are poorly calibrated, and a numeric change rarely communicates the severity of the underlying degradation. We introduce WorkflowPerturb, a controlled benchmark for studying workflow evaluation metrics by applying realistic, graded perturbations to golden workflows. WorkflowPerturb contains 4,973 golden workflows and 44,757 perturbed variants across three perturbation types (Missing Steps, Compressed Steps, and Description Changes), each applied at severity levels of 10%, 30%, and 50%. We benchmark multiple metric families and analyze their sensitivity and calibration using expected score trajectories and residuals. Our results characterize systematic differences across metric families and support severity-aware interpretation of workflow evaluation scores in change-management settings. Our dataset will be released upon acceptance.

2603.02668 2026-06-16 cs.AI cs.LG 版本更新

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

SorryDB: AI证明者能完成现实世界的Lean定理吗?

Austin Letson, Leopoldo Sarra, Auguste Poiroux, Oliver Dressler, Paul Lezeau, Dhyan Aranha, Frederick Pu, Aaron Hill, Miguel Corredera Hidalgo, Julian Berman, George Tsoukalas, Lenny Taelman

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出动态更新的基准SorryDB,包含78个GitHub上的现实形式化项目,评估AI证明者在复杂依赖下的能力,发现当前方法互补,基于Gemini Flash的智能体方法表现最佳。

详情
AI中文摘要

我们提出了SorryDB,一个动态更新的基准,包含从GitHub上78个现实世界形式化项目中提取的开放Lean任务。与现有的静态基准(通常由竞赛问题组成)不同,攀登SorryDB基准将产生与社区需求对齐、对数学家更易用、更能理解复杂依赖的工具。此外,通过提供持续更新的任务流,SorryDB减轻了测试集污染,并为智能体对新颖形式数学项目的贡献能力提供了稳健的度量。我们评估了一系列方法,包括通用大型语言模型、智能体方法和专用符号证明器,在SorryDB中选取的1000个任务快照上。我们表明当前方法是互补的:尽管基于Gemini Flash的智能体方法性能最佳,但它并不严格优于其他现成的大型语言模型、专用证明器,甚至精心策划的Lean策略列表。

英文摘要

We present SorryDB, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark will yield tools that are aligned to the community needs, more usable by mathematicians, and more capable of understanding complex dependencies. Moreover, by providing a continuously updated stream of tasks, SorryDB mitigates test-set contamination and offers a robust metric for an agent's ability to contribute to novel formal mathematics projects. We evaluate a collection of approaches, including generalist large language models, agentic approaches, and specialized symbolic provers, over a selected snapshot of 1000 tasks from SorryDB. We show that current approaches are complementary: even though an agentic approach based on Gemini Flash is the most performant, it is not strictly better than other off-the-shelf large-language models, specialized provers, or even a curated list of Lean tactics.

2603.09309 2026-06-16 cs.AI 版本更新

Rescaling Confidence: What Scale Design Reveals About LLM Metacognition

重新缩放置信度:量表设计揭示LLM元认知

Yuyang Dai, Yuxia Wang

发表机构 * INSAIT, Sofia University "St. Kliment Ohridski"(INSAIT,索菲亚大学‘圣克莱门特·欧里德斯基’)

AI总结 研究LLM口头置信度受量表设计影响,发现0-20量表比标准0-100量表更有效提升元认知效率。

Comments 20 pages

详情
AI中文摘要

口头置信度,即LLM报告数值确定性分数,被广泛用于估计黑箱环境中的不确定性,然而置信度量表本身(通常为0-100)很少被审视。我们表明,这一设计选择并非中性。在六个LLM和三个数据集上,口头置信度高度离散化,超过78%的响应集中在三个整数上。为研究这一现象,我们沿三个维度系统操纵置信度量表:粒度、边界位置和范围规律性,并使用$meta\ ext{-}d'$评估元认知敏感性。我们发现,0-20量表持续优于标准0-100格式,提高了元认知效率,而边界压缩会降低性能,即使在非规则范围下,整数偏好仍然存在。这些结果表明,置信度量表设计直接影响口头不确定性的质量,应被视为LLM评估中的首要实验变量。

英文摘要

Verbalized confidence, in which LLMs report a numerical certainty score, is widely used to estimate uncertainty in black-box settings, yet the confidence scale itself (typically 0--100) is rarely examined. We show that this design choice is not neutral. Across six LLMs and three datasets, verbalized confidence is heavily discretized, with more than 78\% of responses concentrating on just three round-number values. To investigate this phenomenon, we systematically manipulate confidence scales along three dimensions: granularity, boundary placement, and range regularity, and evaluate metacognitive sensitivity using $meta\text{-}d'$. We find that a 0--20 scale consistently improves metacognitive efficiency over the standard 0--100 format, while boundary compression degrades performance and round-number preferences persist even under irregular ranges. These results demonstrate that confidence scale design directly affects the quality of verbalized uncertainty and should be treated as a first-class experimental variable in LLM evaluation.

2603.10384 2026-06-16 cs.AI 版本更新

Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability

超越标量:通过几何进展和稳定性评估与理解LLM推理

Xinyan Jiang, Ninghao Liu, Di Wang, Lijie Hu

发表机构 * GitHub

AI总结 提出TRACED框架,利用几何运动学将推理轨迹分解为进展和稳定性,揭示正确推理与幻觉的拓扑差异,实现鲁棒的推理质量评估。

Comments Accepted by ICML2026

详情
AI中文摘要

通过标量概率评估LLM可靠性通常无法捕捉推理的结构动态。我们引入TRACED,一个基于理论几何运动学评估推理质量的框架。通过将推理轨迹分解为进展(位移)和稳定性(曲率),我们揭示了明显的拓扑分歧:正确推理表现为高进展、稳定的轨迹,而幻觉则以低进展、不稳定模式为特征(停滞位移伴随高曲率波动)。利用这些特征,我们的概率框架在多种基准测试中实现了具有竞争力的性能和卓越的鲁棒性。关键的是,TRACED通过将高曲率映射为“犹豫循环”,将位移映射为“确定性积累”,架起了几何与认知的桥梁,为解码机器思维的内部动态提供了物理视角。

英文摘要

Evaluating LLM reliability via scalar probabilities often fails to capture the structural dynamics of reasoning. We introduce TRACED, a framework that assesses reasoning quality through theoretically grounded geometric kinematics. By decomposing reasoning traces into Progress (displacement) and Stability (curvature), we reveal a distinct topological divergence: correct reasoning manifests as high-progress, stable trajectories, whereas hallucinations are characterized by low-progress, unstable patterns (stalled displacement with high curvature fluctuations). Leveraging these signatures, our probabilistic framework achieves competitive performance and superior robustness across diverse benchmarks. Crucially, TRACED bridges geometry and cognition by mapping high curvature to ''Hesitation Loops'' and displacement to ''Certainty Accumulation'', offering a physical lens to decode the internal dynamics of machine thought.

2605.09163 2026-06-16 cs.AI 版本更新

FORTIS: Benchmarking Over-Privilege in Agent Skills

FORTIS:评估代理技能中的过度特权

Shawn Li, Chenxiao Yu, Han Wang, Wei Yang, Ryan Rossi, Franck Dernoncourt, Xiyang Hu, Philip Yu, Chaowei Xiao, Huan Zhang, Yue Zhao

发表机构 * University of Southern California(南加州大学) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Adobe Research(Adobe研究) Arizona State University(亚利桑那州立大学) University of Illinois Chicago(伊利诺伊大学芝加哥分校) Johns Hopkins University(约翰霍普金斯大学)

AI总结 研究发现,当前代理技能层普遍存在过度特权问题,模型在选择和执行技能时常超出任务需求,导致性能不佳。

详情
AI中文摘要

大型语言模型代理越来越多地通过一个中间技能层来介面用户意图与具体任务执行。这一层被广泛视为一种组织抽象,但我们认为它也是当前模型经常越界的特权边界。我们提出了FORTIS,一个评估代理技能中过度特权的基准,分为两个阶段:模型是否从大量的重叠库中选择最小必要的技能,以及是否在不扩展到更广泛的工具或行动的情况下执行该技能。在十个前沿模型和三个领域中,我们发现过度特权行为是常态而非例外。模型始终倾向于选择比任务要求更高的特权技能和工具,在两个阶段的失败率即使在最强的可用模型中仍然很高。在现实用户交互的普通条件下,失败尤为严重:不完整的规范、便利的框架和接近技能边界。这些都不需要对抗性构造。结果表明,技能层远非包含代理行为,而是当前系统中特权升级的主要来源。

英文摘要

Large language model agents increasingly operate through an intermediate skill layer that mediates between user intent and concrete task execution. This layer is widely treated as an organizational abstraction, but we argue it is also a privilege boundary that current models routinely exceed. We present \textbf{FORTIS}, a benchmark that evaluates over-privilege in agent skills across two stages: whether a model selects the minimally sufficient skill from a large overlapping library, and whether it executes that skill without expanding into broader tools or actions than the skill permits. Across ten frontier models and three domains, we find that over-privileged behavior is the norm rather than the exception. Models consistently reach for higher-privilege skills and tools than the task requires, failing at both stages at rates that remain high even for the strongest available models. Failure is especially severe under the ordinary conditions of real user interaction: incomplete specification, convenience framing, and proximity to skill boundaries. None of these requires adversarial construction. The results indicate that the skill layer, far from containing agent behavior, is itself a primary source of privilege escalation in current systems.

2605.10574 2026-06-16 cs.AI 版本更新

LLM Jaggedness Unlocks Scientific Creativity

LLM Jaggedness Unlocks Scientific Creativity

Shray Mathur, J. Anibal Boscoboinik, Esther H. R. Tsai, Kevin G. Yager

发表机构 * Center for Functional Nanomaterials, Brookhaven National Laboratory(功能纳米材料中心,布鲁赫萨尔国家实验室)

AI总结 本文研究了大型语言模型(LLMs)在科学创意生成中的不规则进步现象,提出了SciAidanBench基准测试集,通过评估不同模型在科学问题上的创意生成能力,揭示了模型在跨任务、提示和领域层面的不均衡表现,并展示了如何通过推理计算、知识聚合和头脑风暴等机制利用这种不规则性来提升科学创造力。

详情
AI中文摘要

随着人工智能的发展,模型的改进并非均匀发生,而是呈现出不规则的进展方式,其能力在不同任务、领域和模型规模上增长不均。在本文中,我们通过科学创意生成的视角来研究这种动态的不规则性。我们引入了SciAidanBench,这是一个设计用于衡量大型语言模型(LLMs)科学创造力的开放性科学问题基准测试集。给定一个科学问题,模型被要求生成尽可能多的唯一且连贯的想法,总的有效响应数量作为创意潜力的代理。评估19个基础模型(共30个变体,包括推理版本)后,我们发现不规则性既体现在模型之间,也体现在模型内部。首先,在通用创造力与科学创造力的跨任务比较中,通用创造力的提升并不均匀地转化为科学创造力,揭示了模型在不同任务上的能力差异。其次,在提示层面,更强的模型并不均匀地提升,而是表现出高度的变异性,某些问题上出现创意爆发,而其他问题上表现有限。第三,在领域层面,单个模型在不同科学子领域中的表现不均衡,反映了内部能力的碎片化。最后,我们展示了这种不规则性可以被利用。我们探索了推理时间计算、知识聚合和头脑风暴等机制,以有效结合模型并构建超越单个模型的元模型集合。我们的结果表明,不规则性不应被视为限制,而是一种资源,是AI进步的结构性特征,当被理解和利用时,可以放大LLM驱动的科学创造力。

英文摘要

As artificial intelligence advances, models are not improving uniformly. Instead, progress unfolds in a jagged fashion, with capabilities growing unevenly across tasks, domains, and model scales. In this work, we examine this dynamic jaggedness through the lens of scientific idea generation. We introduce SciAidanBench, a benchmark of open-ended scientific questions designed to measure the scientific creativity of large language models (LLMs). Given a scientific question, models are asked to generate as many unique and coherent ideas as possible, with the total number of valid responses serving as a proxy for creative potential. Evaluating 19 base models across 8 providers (30 total variants including reasoning versions), we find that jaggedness manifests both across models and within models. First, in a cross-task comparison between general and scientific creativity, improvements in general creativity do not translate uniformly to scientific creativity, revealing divergent capability profiles across models. Second, at the prompt level, stronger models do not improve uniformly; instead, they exhibit high variability, with bursts of creativity on some questions and limited performance on others. Third, at the domain level, individual models display uneven strengths across scientific subfields, reflecting fragmented internal capability profiles. Finally, we show that this jaggedness can be harnessed. We explore mechanisms of inference-time compute, knowledge pooling, and brainstorming to combine models effectively and construct meta-model ensembles that outperform any single model. Our results position jaggedness not as a limitation, but as a resource, a structural feature of AI progress that, when understood and leveraged, can amplify LLM-driven scientific creativity.

2605.22664 2026-06-16 cs.AI 版本更新

MBABench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

WorkstreamBench: 评估LLM代理在金融领域的端到端电子表格任务

Thomson Yen, Julian Poeltl, Harshith Srinivas Gear, Yilin Meng, Joshua Fan, Adam Shen, Yili Liu, Ali Bauyrzhan, Siri Du, Haoyang Liu, Daniel Guetta, Hongseok Namkoong

发表机构 * Decision, Risk, and Operations Division, Columbia Business School(哥伦比亚商学院决策、风险与运营部门) ESB Business School, Reutlingen University(图宾根大学ESB商学院)

AI总结 本文提出WorkstreamBench,用于评估LLM代理在金融领域复杂端到端电子表格任务中的能力,重点在于财务建模和情景分析等关键流程,通过三个维度(准确性、公式、格式)的细粒度标准来衡量解决方案质量。

详情
AI中文摘要

LLM代理越来越多地被期望执行端到端工作流,从高层次用户指令生成完整的成果。为了满足企业需求,前沿AI实验室已开发出能够从头构建整个电子表格的代理。这在金融领域尤为重要,因为核心工作流如财务建模、预测和情景分析通常通过电子表格完成。然而,现有电子表格基准测试并未衡量这种高级能力,而是专注于问答或单个公式编辑。为填补这一空白,我们提供了首个评估代理在端到端电子表格任务上的评估,重点是经济关键的金融工作流,如建模和情景分析。由于其中的交付成果通常由多个利益相关者审查和修订,判断其质量必然涉及诸如可读性或修改便捷性等高级标准。为了反映解决方案质量的多维性质,我们开发了一个包含三个维度(准确性、公式、格式)的评估分类学,每个维度包含细粒度的标准,以反映专业标准。Claude家族在基准测试中领先,产生最专业的输出,在我们的定性审查中,但即使最强的代理也经常无法达到专业金融标准,并且在难度超过几个链式计算后显著下降。这表明当前的代理尚无法可靠地生成专业质量的电子表格,以满足现实工作流程所需的复杂性。

英文摘要

LLM agents are increasingly expected to carry out end-to-end workflows, producing complete artifacts from high-level user instructions. To meet enterprise needs, frontier AI labs have developed agents that can construct entire spreadsheets from scratch. This is especially relevant in finance, where core workflows such as financial modeling, forecasting, and scenario analysis are commonly conducted through spreadsheets. Yet, existing spreadsheet benchmarks do not measure this advanced capability, focusing instead on question-answering or single-formula edits. To address this gap, we provide one of the first evaluations of agents on end-to-end spreadsheet tasks, focusing on economically critical financial workflows such as modeling and scenario analysis. Since deliverables therein are routinely reviewed and revised by multiple stakeholders, judging their quality necessarily involves high-level criteria such as readability or ease of modification. To reflect the multidimensional nature of solution quality, we develop an evaluation taxonomy comprising three dimensions: Accuracy, Formula, and Format, each comprising fine-grained criteria that reflect professional standards. The Claude family leads the benchmark and produces the most professional-looking outputs in our qualitative review, but even the strongest agents frequently fall short of professional finance standards and degrade sharply as the difficulty increases beyond a few chained calculations. This suggests that current agents are not yet able to reliably produce professional-quality spreadsheets at the level of complexity real-world workflows demand.

2606.09669 2026-06-16 cs.AI cs.CL 版本更新

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

SpatialWorld: 在多模态智能体真实世界任务中基准测试交互式空间推理

Hongcheng Gao, Hailong Qu, Jingyi Tang, Jiahao Wang, Zihao Huang, Hengkang Qiao, Shihong Huang, Junming Yang, Yi Li, Hongyixuan Yuan, Wenjie Li, Bohan Zeng, Wenbo Li, Bo Wang, Jianhui Liu, Olive Huang, Haoyang Huang, Wentao Zhang, Guoqing Huang, Nan Duan, Yinpeng Dong

发表机构 * Tsinghua University(清华大学) Chongqing University(重庆大学) Peking University(北京大学) ZenoMind AI Xi’an Jiaotong University(西安交通大学) Beijing Institute of Technology(北京理工大学) Southeast University(东南大学) Shanghai Jiao Tong University(上海交通大学) Joy Future Academy The University of Hong Kong(香港大学)

AI总结 提出SpatialWorld基准,集成8种异构模拟后端,通过760个人工标注任务评估多模态智能体在视觉部分可观测环境中的交互式空间理解,发现最强模型GPT-5任务成功率仅17.4%。

详情
AI中文摘要

空间推理是多模态大语言模型(MLLMs)感知和操作物理世界的基础能力。然而,现有基准主要依赖被动评估(如静态VQA)或特定模拟器流程,未能评估通用的交互式空间理解。我们引入SpatialWorld,一个专门为评估多模态智能体在复杂真实世界任务中的交互式空间理解而设计的统一基准。在共享的、模拟器无关的协议下集成八个异构模拟后端,SpatialWorld包含跨多个领域(如家庭日常、旅行、社交协作)的760个人工标注任务。智能体必须在仅视觉的部分可观测性下解决问题,主动收集自我中心的视觉证据,并通过MLLMs原生的统一文本动作接口表达决策。为了可靠评估,每个任务包含一个人工验证的初始状态、一条参考轨迹和一个终端状态验证器。评估15个先进智能体揭示,稳健的空间任务解决仍然具有挑战性:最强模型GPT-5平均任务成功率(TSR)仅为17.4%,而领先的开源模型Qwen-3.5达到14.1%。进一步分析暴露了任务成功与执行效率之间的明显不匹配,以及显著的领域特定性能差异。这些在主动探索和长程规划中的瓶颈使SpatialWorld成为未来空间智能体的严格测试平台。

英文摘要

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.

2606.13608 2026-06-16 cs.AI cs.LG 版本更新

AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

AgentBeats:面向开放性、标准化和可复现性的智能体评估代理化

Xiaoyuan Liu, Jianhong Tu, Yuqi Chen, Siyuan Xie, Sihan Ren, Tianneng Shi, Gal Gantar, Evan Sandoval, Donghyun Lee, Daniel Miao, Peter J. Gilbert, Nick Hynes, Mauro Staver, Warren He, David Marn, Andrew Low, Xi Zhang, Elron Bandel, Michal Shmueli-Scheuer, Siva Reddy, Alexandre Drouin, Alexandre Lacoste, Ramayya Krishnan, Elham Tabassi, Yu Su, Victor Barres, Chenguang Wang, Wenbo Guo, Dawn Song

发表机构 * University of California, Berkeley(加州大学伯克利分校) Purdue University(普渡大学) University of Ljubljana(卢布尔雅那大学) University of Washington(华盛顿大学) Oasis Labs University of Maryland(马里兰大学) IBM Research(IBM研究院) Mila McGill University(麦吉尔大学) ServiceNow Research(ServiceNow研究院) Carnegie Mellon University(卡内基梅隆大学) National Institute of Standards and Technology(美国国家标准与技术研究院) The Ohio State University(俄亥俄州立大学) University of Cambridge(剑桥大学) University of California, Santa Barbara(加州大学圣塔芭芭拉分校)

AI总结 提出代理化智能体评估(AAA)框架,通过标准化协议(A2A和MCP)统一评估接口,实现开放、可复现的多智能体评估,并基于AgentBeats系统通过大规模竞赛和案例研究验证其覆盖性、实用性和保真度。

详情
AI中文摘要

智能体系统在各领域快速进步,但其评估仍然碎片化。大多数基准测试依赖于固定的、以LLM为中心的测试框架,需要大量集成,造成测试与生产环境不匹配,并限制了不同智能体设计之间的公平比较。根本问题在于缺乏开放的、与智能体无关的评估接口。我们倡导代理化智能体评估(AAA),其中评估由裁判智能体执行,所有参与者通过标准化协议交互:A2A用于任务管理,MCP用于工具访问。传统基准测试定义了两个独立的接口(一个用于基准测试,一个用于智能体),而AAA只需要一个;这产生了一个通用的统一框架,将评估逻辑与智能体实现分离,并支持可复现、可互操作和多智能体评估。我们进一步引入AgentBeats作为AAA的具体实现:我们确定了五种实际操作模式,使标准化评估与开放性、隐私性和可复现性的现实约束兼容。为了大规模评估我们的设计,我们进行了两项研究:一项为期五个月的开放竞赛,吸引了来自独立参与者的12个类别的298个裁判智能体和467个主题智能体,表明AAA适用于异构基准测试范围;以及一项关于编码智能体的案例研究,证实代理化评估在保留与公开记录一致性的同时,揭示了先前缺失的直接比较结果,产生了关于智能体设计的研究见解。结合社区规模实地研究和受控编码案例研究,我们验证了AAA在异构场景下大规模提供覆盖性、实用性和保真度。AAA和AgentBeats共同为开放、标准化和可复现的智能体评估提供了清晰路径。

英文摘要

Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. The root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access. Conventional benchmarking defines two separate interfaces, one for the benchmark and one for the agent, while AAA only needs one; this yields a generic, unified framework that separates assessment logic from agent implementation and enables reproducible, interoperable, and multi-agent evaluation. We further introduce AgentBeats as a concrete realization of AAA: we identify five practical operation modes that make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility. To evaluate our design at scale, we conduct two studies: a five-month open competition that drew 298 judge agents across 12 categories together with 467 subject agents from independent participants, showing that AAA applies across a heterogeneous range of benchmarks; and a case study on coding agents that confirms agentified evaluation preserves fidelity with the public record while surfacing previously missing head-to-head results, yielding research insights about agent design. Combining a community-scale field study and a controlled coding case study, we verify that AAA delivers coverage, practicality, and fidelity across heterogeneous scenarios at scale. Together, AAA and AgentBeats offer a clear path toward open, standardized, and reproducible agent assessment.

2606.13782 2026-06-16 cs.AI 版本更新

MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

MA-ProofBench: 数学分析中定理证明的大语言模型双层评估基准

Lushi Pu, Weiming Zhang, Xinheng Xie, Zixuan Fu, Bingxiang He, Hongya Lyu, Xin Li, Jie Zhou, Yudong Wang

发表机构 * ModelBest Inc. Tsinghua University(清华大学)

AI总结 提出首个面向数学分析的形式化定理证明基准MA-ProofBench,包含200个定理,覆盖6个核心主题和27个子类别,分为本科和博士资格两级难度,评估发现当前模型表现不佳,GPT-5.5在Level I上仅达16% Pass@8。

Comments 19 pages, 4 figures, 4 tables

详情
AI中文摘要

大型语言模型(LLMs)在自动化定理证明方面取得了显著进展,然而现有的形式化基准在数学覆盖范围和难度上仍然有限。大多数集中在更容易形式化的领域,如代数和初等数论,并且对需要更深层推理的子领域(包括数学分析)覆盖有限。为了解决这一差距,我们引入了MA-ProofBench,据我们所知,这是第一个专门致力于数学分析的形式化定理证明基准。该基准包含200个形式化定理,涵盖6个核心主题和27个子类别,包括测度与积分理论、复分析和泛函分析。问题分为两个难度级别:本科级别(Level I,100个问题)和博士资格考试级别(Level II,100个问题),以评估LLMs在不同数学深度上的形式推理能力。每个问题通过人工主导、LLM辅助的形式化流程构建,随后由独立专家评审,确保形式化陈述忠实于原始数学。我们在MA-ProofBench上评估了一系列最新的通用推理模型和形式化定理证明器。然而,大多数模型表现不佳:即使是最佳模型GPT-5.5,在Level I上仅达到16%的Pass@8,在Level II上为5%,而大多数模型在Level II上接近0%。进一步分析发现,Mathlib幻觉和不完整证明是两种主要的失败模式,而对基准的自然语言版本的评估揭示了非正式推理与形式推理之间的明显差距。MA-ProofBench旨在作为跟踪高级领域形式化数学推理进展的可靠参考。

英文摘要

Large Language Models (LLMs) have made notable progress in automated theorem proving, yet existing formal benchmarks remain limited in both mathematical coverage and difficulty. Most are concentrated in areas that are easier to formalize, such as algebra and elementary number theory, and provide limited coverage of subfields that require deeper reasoning, including mathematical analysis. To address this gap, we introduce MA-ProofBench, to the best of our knowledge, the first formal theorem-proving benchmark dedicated to Mathematical Analysis. The benchmark contains 200 formalized theorems covering 6 core topics and 27 subcategories, including measure and integration theory, complex analysis, and functional analysis. The problems are divided into two difficulty levels, an undergraduate level (Level I, 100 problems) and a Ph.D. qualifying level (Level II, 100 problems), to evaluate how well LLMs perform formal reasoning at different mathematical depths. Each problem is constructed through a human-led, LLM-assisted formalization pipeline followed by independent expert review, ensuring that the formal statements remain faithful to the original mathematics. We evaluate a range of recent general-purpose reasoning models and formal theorem provers on MA-ProofBench. However, most models perform poorly: even the best-performing model, GPT-5.5, achieves only 16% Pass@8 on Level I and 5% on Level II, while most models stay close to 0% on Level II. Further analysis identifies Mathlib hallucinations and incomplete proofs as the two dominant failure modes, while an evaluation on the natural-language version of the benchmark exposes a clear gap between informal and formal reasoning. MA-ProofBench is intended to serve as a reliable reference for tracking progress in formal mathematical reasoning in advanced domains.

2401.15296 2026-06-16 cs.CV cs.AI 版本更新

A Survey on 3D Skeleton Based Person Re-Identification: Taxonomy, Advances, Challenges, and Interdisciplinary Prospects

基于3D骨架的行人重识别综述:分类、进展、挑战与跨学科前景

Haocong Rao, Chunyan Miao

发表机构 * College of Computing and Data Science, Nanyang Technological University (NTU), Singapore(南洋理工大学计算与数据科学学院,新加坡) Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY), NTU, Singapore(老龄化积极生活卓越研究中心(LILY),南洋理工大学,新加坡) Alibaba-NTU Global e-Sustainability CorpLab (ANGEL), NTU, Singapore(阿里巴巴-南洋理工大学全球可持续发展企业实验室(ANGEL),南洋理工大学,新加坡)

AI总结 本文系统综述了基于3D骨架的行人重识别方法,提出了手工、序列和图建模三类分类法,并评估了监督、自监督和无监督学习范式下的最新技术,最后讨论了关键挑战与跨学科应用前景。

Comments Accepted by IJCAI 2026. A curated collection of valuable resources is available at https://github.com/Kali-Hac/3D-SRID-Survey

详情
AI中文摘要

基于3D骨架的行人重识别是一个重要的新兴研究领域,在模式识别领域引起了越来越多的关注。凭借在各种应用场景中的独特优势,近年来提出了许多基于3D骨架的行人重识别(SRID)方法,这些方法采用了不同的骨架建模和学习范式。在本文中,我们提供了对近期SRID进展的全面回顾和分析。首先,我们定义了SRID任务,并概述了其起源和主要进展。其次,我们制定了一个系统性的分类法,将现有方法分为三类:手工建模、序列建模和图建模。然后,我们详细阐述了这三类中的代表性模型,并说明了其基础机制。同时,我们概述了主流的监督、自监督和无监督SRID学习范式及相应的常用方法。进一步地,我们在各种类型的基准和协议上对最先进的SRID方法进行了全面评估,以比较其有效性、效率和关键特性。最后,我们提出了推动未来研究的关键挑战和前景,并通过案例研究强调了SRID的跨学科应用。

英文摘要

Person re-identification via 3D skeletons is an important emerging research area that attracts increasing attention within the pattern recognition community. With distinctive advantages across various application scenarios, numerous 3D skeleton based person re-identification (SRID) methods with diverse skeleton modeling and learning paradigms have been proposed in recent years. In this paper, we provide a comprehensive review and analysis of recent SRID advances. First of all, we define the SRID task and provide an overview of its origin and major advancements. Secondly, we formulate a systematic taxonomy that organizes existing methods into three categories centered on hand-crafted, sequence-based, and graph-based modeling. Then, we elaborate on the representative models along these three types with an illustration of foundational mechanisms. Meanwhile, we provide an overview of mainstream supervised, self-supervised, and unsupervised SRID learning paradigms and corresponding common methods. A thorough evaluation of state-of-the-art SRID methods is further conducted over various types of benchmarks and protocols to compare their effectiveness, efficiency, and key properties. Finally, we present the key challenges and prospects to advance future research, and highlight interdisciplinary applications of SRID with a case study.

2505.06589 2026-06-16 stat.ML cs.AI math.OC 版本更新

Optimal Transport for Machine Learners

机器学习者的最优传输

Gabriel Peyré

发表机构 * CNRS and ENS, PSL Université(国家科学研究中心和巴黎高等师范学院,巴黎大学)

AI总结 本书从机器学习角度介绍最优传输(OT)技术,涵盖从Monge映射、Kantorovich对偶到Sinkhorn算法等核心方法,并展示其在损失函数、生成模型、领域适应、梯度流等ML任务中的应用。

详情
AI中文摘要

现代机器学习反复操作概率测度:经验数据集、生成样本、潜在分布、类别条件律、粒子系统、宽网络权重和注意力模式。最优传输在此场景中很有用,因为它通过询问质量应如何移动来比较这些对象。因此,它结合了具有统计意义的差异概念与插值几何、对偶证书和变分动力学。这使得OT成为损失函数、生成建模、领域适应、鲁棒学习、重心、梯度流和学习算法的平均场描述的通用语言。本书以这些机器学习用途为出发点,介绍主要的OT技术。它从有限分配和Monge映射视角开始,过渡到Kantorovich耦合和对偶势,然后解释使传输可用的算法思想:线性规划、半离散单元、Sinkhorn缩放和低维投影。随后,相同的对象被重新用作测度几何,给出Wasserstein距离、重心、梯度流、动态公式和高斯/Bures公式。最后几章强调与现代ML最相关的变体:散度和对抗损失、熵松弛和非平衡松弛、鲁棒或谱地面几何、Gromov和量子扩展,以及基于传输的生成模型、平均场网络和注意力动态视图。目标是保持数学的明确性,同时揭示将OT转化为机器学习者可用工具箱所需的计算和几何直觉。

英文摘要

Modern machine learning repeatedly manipulates probability measures: empirical datasets, generated samples, latent distributions, class-conditional laws, particle systems, weights of wide networks and attention patterns. Optimal transport is useful in this setting because it compares such objects by asking how mass should move. It therefore combines a statistically meaningful notion of discrepancy with a geometry of interpolation, dual certificates and variational dynamics. This makes OT a common language for losses, generative modeling, domain adaptation, robust learning, barycenters, gradient flows and mean-field descriptions of learning algorithms. This book presents the main OT techniques with these machine-learning uses in mind. It starts from finite assignment and the Monge map viewpoint, passes to Kantorovich couplings and dual potentials, and then explains the algorithmic ideas that make transport usable: linear programming, semi-discrete cells, Sinkhorn scaling and low-dimensional projections. The same objects are then reused as a geometry of measures, giving Wasserstein distances, barycenters, gradient flows, dynamic formulations and Gaussian/Bures formulas. The final chapters emphasize the variants most relevant to modern ML: divergences and adversarial losses, entropic and unbalanced relaxations, robust or spectral ground geometries, Gromov and quantum extensions, and transport-based views of generative models, mean-field networks and attention dynamics. The goal is to keep the mathematics explicit while exposing the computational and geometric intuitions needed to turn OT into a working toolbox for machine learners.

2508.01401 2026-06-16 cs.CL cs.AI 版本更新

MedSynth: Realistic, Synthetic Medical Dialogue-Note Pairs

MedSynth: 真实、合成的医疗对话-笔记对

Ahmad Rezaie Mianroodi, Amirali Rezaie, Niko Grisel Todorov, Nadine A. Friedrich, Maria P Mogollon, Alexander Hernandez-Tirado, Guillermo Lopez Garcia, Cyril Rakovski, Frank Rudzicz

发表机构 * Dalhousie University(达尔豪斯大学) Vector Institute(向量研究所) Shahrood University of Technology(沙霍尔德大学) Chapman University(查普曼大学) Cedars-Sinai Medical Center(Cedars-Sinai 医疗中心)

AI总结 为解决医生文书负担,提出MedSynth合成数据集,包含超1万对对话-笔记,覆盖2000+ICD-10编码,显著提升Dial-2-Note和Note-2-Dial任务性能。

Comments 7 pages excluding references and appendices

详情
AI中文摘要

医生花费大量时间记录临床就诊,这一负担导致了职业倦怠。为了解决这个问题,强大的医疗文档自动化工具至关重要。我们引入了MedSynth——一个新颖的合成医疗对话和笔记数据集,旨在推进对话到笔记(Dial-2-Note)和笔记到对话(Note-2-Dial)任务。基于对疾病分布的广泛分析,该数据集包含超过10,000个对话-笔记对,覆盖2000多个ICD-10编码。我们证明,该数据集显著提升了模型从对话生成医疗笔记以及从医疗笔记生成对话的性能。在开放获取、符合隐私要求且多样化的训练数据稀缺的领域,该数据集提供了宝贵的资源。代码可从此https URL获取,数据集可从此https URL获取。

英文摘要

Physicians spend significant time documenting clinical encounters, a burden that contributes to professional burnout. To address this, robust automation tools for medical documentation are crucial. We introduce MedSynth -- a novel dataset of synthetic medical dialogues and notes designed to advance the Dialogue-to-Note (Dial-2-Note) and Note-to-Dialogue (Note-2-Dial) tasks. Informed by an extensive analysis of disease distributions, this dataset includes over 10,000 dialogue-note pairs covering over 2000 ICD-10 codes. We demonstrate that our dataset markedly enhances the performance of models in generating medical notes from dialogues, and dialogues from medical notes. The dataset provides a valuable resource in a field where open-access, privacy-compliant, and diverse training data are scarce. Code is available at https://github.com/ahmadrezarm/MedSynth/tree/main and the dataset is available at https://huggingface.co/datasets/Ahmad0067/MedSynth.

2508.17742 2026-06-16 eess.SP cs.AI cs.HC 版本更新

EEG-FM-Bench: A Comprehensive Benchmark for the Systematic Evaluation and Diagnostic Analyses of EEG Foundation Models

EEG-FM-Bench:脑电图基础模型系统评估与诊断分析的综合基准

Wei Xiong, Jiangtong Li, Jie Li, Kun Zhu, Changjun Jiang

发表机构 * School of Computer Science and Technology, Tongji University, Shanghai, China(同济大学计算机科学与技术学院,上海,中国) Translational Research Center, Shanghai Yangzhi Rehabilitation Hospital (Shanghai Sunshine Rehabilitation Center), China(上海杨氏康复医院(上海阳光康复中心)转化研究中心,中国)

AI总结 提出EEG-FM-Bench统一基准,整合14个数据集和10种范式,通过多种微调策略和诊断分析揭示多任务学习可缓解过拟合、预训练效率受梯度冲突限制、模型规模非唯一决定因素等关键发现。

Comments 36 pages, 30 figures, Accepted by ICML2026

详情
AI中文摘要

脑电图基础模型(EEG-FMs)推动了脑信号分析的发展,但缺乏标准化评估基准阻碍了模型比较和科学进步。当前评估依赖不一致的协议,导致跨模型比较不可靠,同时缺乏诊断分析掩盖了驱动迁移效率和扩展行为的内部机制。为解决这一问题,我们引入了\textbf{EEG-FM-Bench},一个用于标准化评估EEG-FMs的统一系统。该基准整合了10种范式下的14个数据集,并包含多种实验设置,包括多种微调策略、任务组织和分类器配置,并辅以梯度和表示分析工具。我们的实验和分析揭示了几个关键见解:(1)多任务学习通常作为一种有用的正则化器,缓解数据稀缺的EEG上下文中的过拟合,尽管在特定任务范式下可能出现负迁移;(2)预训练效率目前受重建目标与下游任务之间的梯度冲突限制;(3)在已发布的检查点和匹配的下游协议下,模型或数据规模本身不能完全解释迁移性能,而目标对齐、适应兼容性和EEG特定设计似乎是重要因素。该基准实现了公平比较和可重复分析,为EEG-FMs的更公平比较和更可解释分析迈出了一步。代码见https://this https URL。

英文摘要

Electroencephalography foundation models (EEG-FMs) have advanced brain signal analysis, but the lack of standardized evaluation benchmarks impedes model comparison and scientific progress. Current evaluations rely on inconsistent protocols that render cross-model comparisons unreliable, while a lack of diagnostic analyses obscures the internal mechanisms driving transfer efficiency and scaling behaviors. To address this, we introduce \textbf{EEG-FM-Bench}, a unified system for the standardized evaluation of EEG-FMs. The benchmark integrates 14 datasets across 10 paradigms and incorporates diverse experimental settings, including multiple fine-tuning strategies, task organizations, and classifier configurations, supported by tools for gradient and representation analysis. Our experiments and analysis reveal several critical insights: (1) multi-task learning often acts as a useful regularizer that mitigates overfitting in data-scarce EEG contexts, although negative transfer can arise under specific task paradigms; (2) pre-training efficiency is currently limited by gradient conflicts between reconstruction objectives and downstream tasks; (3) under released checkpoints and a matched downstream protocol, model or data scale alone does not fully explain transfer performance, while objective alignment, adaptation compatibility, and EEG-specific design appear to be important factors. This benchmark enables fair comparison and reproducible analysis, providing a step toward fairer comparison and more interpretable analysis of EEG-FMs. Code is available at https://github.com/xw1216/EEG-FM-Bench.

2509.07605 2026-06-16 cs.LG cs.AI cs.IT math.IT 版本更新

Beyond Rebalancing: Benchmarking Binary Classifiers Under Class Imbalance Without Rebalancing Techniques

超越重平衡:在不使用重平衡技术的情况下对类别不平衡下的二分类器进行基准测试

Ali Nawaz, Amir Ahmad, Shehroz S. Khan

发表机构 * Department of Information Systems and Security, College of Information Technology and Center for Artificial Intelligence and Digital Innovation, United Arab Emirates University(信息系统与安全系,信息技术学院和人工智能与数字创新中心,阿联酋大学) College of Engineering and Technology, American University of the Middle East(工程与技术学院,中东大学)

AI总结 本研究系统评估了多种二分类器在无显式重平衡技术下对类别不平衡的鲁棒性,发现TabPFN和基于提升的集成模型在极端不平衡下仍保持较高性能。

详情
AI中文摘要

类别不平衡对监督分类构成了重大挑战,特别是在医疗诊断和异常检测等关键领域,其中少数类实例很少。尽管许多研究探索了重平衡技术来解决这个问题,但在未应用此类技术的情况下评估不平衡下二分类器性能的关注较少。因此,本研究的目标是评估二分类器“原样”的性能,而不执行任何显式重平衡。具体来说,我们系统评估了多种二分类器在真实世界和合成数据集上的鲁棒性,在逐步减少的少数类规模下,使用一次和少量样本场景作为基线。我们的方法还通过合成决策边界生成探索不同的数据复杂性,以模拟真实世界条件。除了标准分类器,我们还包括使用欠采样、过采样策略和单类分类方法的实验,以检查它们在严重不平衡下的行为。结果证实,随着数据复杂性增加和少数类规模减小,分类变得更加困难。虽然传统分类器在极端不平衡下性能下降,但像TabPFN和基于提升的集成模型等先进模型相比传统分类器保持了相对更高的性能和更好的泛化能力。可视化可解释性和评估指标进一步验证了这些发现。我们的工作为不平衡学习中的模型选择提供了有价值的指导,提供了关于分类器鲁棒性而不依赖显式重平衡技术的见解。

英文摘要

Class imbalance poses a significant challenge to supervised classification, particularly in critical domains like medical diagnostics and anomaly detection where minority class instances are rare. While numerous studies have explored rebalancing techniques to address this issue, less attention has been given to evaluating the performance of binary classifiers under imbalance when no such techniques are applied. Therefore, the goal of this study is to assess the performance of binary classifiers "as-is", without performing any explicit rebalancing. Specifically, we systematically evaluate the robustness of a diverse set of binary classifiers across both real-world and synthetic datasets, under progressively reduced minority class sizes, using one-shot and few-shot scenarios as baselines. Our approach also explores varying data complexities through synthetic decision boundary generation to simulate real-world conditions. In addition to standard classifiers, we include experiments using undersampling, oversampling strategies, and one-class classification (OCC) methods to examine their behavior under severe imbalance. The results confirm that classification becomes more difficult as data complexity increases and the minority class size decreases. While traditional classifiers deteriorate under extreme imbalance, advanced models like TabPFN and boosting-based ensembles retain relatively higher performance and better generalization compared to traditional classifiers. Visual interpretability and evaluation metrics further validate these findings. Our work offers valuable guidance on model selection for imbalanced learning, providing insights into classifier robustness without dependence on explicit rebalancing techniques.

2510.04127 2026-06-16 cs.IR cs.AI cs.CV cs.LG 版本更新

Projection and Quantisation: A Unifying View of Learning to Hash, from Random Projections to the RAG Era

投影与量化:学习哈希的统一视角,从随机投影到RAG时代

Sean Moran

发表机构 * Independent Researcher(独立研究者) London United Kingdom(伦敦英国)

AI总结 提出投影-量化-组织(PQO)框架,统一理解从局部敏感哈希到深度哈希、乘积量化、图索引及向量数据库二进制嵌入的方法,并通过可复现实验揭示量化轴上的内存-质量权衡。

Comments 80 pages, 19 figures, 22 tables. Survey. Accompanying open benchmark (BitBudget): https://github.com/sjmoran/bitbudget ; live leaderboard: https://sjmoran.github.io/bitbudget/

详情
AI中文摘要

近似最近邻(ANN)搜索支撑着大规模检索,尤其是在增强大型语言模型的检索增强生成管道中,但解决该问题的方法已在不同社区中激增,以至于很少被视为一个统一领域。我们认为它们构成一个具有三个设计选择的领域,并开发了投影-量化-组织(PQO)视角,在该视角下,局部敏感哈希、学习二进制哈希、深度端到端哈希、乘积量化、基于图的索引以及现代向量数据库的二进制嵌入都是三个耦合问题的设置:投影放置在哪里,量化阈值放置在哪里,以及如何组织生成的编码。投影然后量化的解读是已有的;我们的贡献是第三个同等重要的组织阶段,证明这三个阶段从该领域的起源到深度、乘积量化、图和检索增强时代一脉相承,以及一个可复现的测量,将视角从分类方法转向预测方法。该测量得出三个发现。首先,内存节省在量化轴上:一位编码的大小是浮点数的三十二分之一,而在短候选列表上单次全精度重排序即可完全恢复未压缩的质量。其次,视角预期的权衡顺序在嵌入增长时保持不变。第三,在有监督的情况下,八字节编码的质量比其替换的两千字节浮点数提高一倍以上。我们将这些测量结果发布为BitBudget,一个带有实时排行榜的可扩展基准,将生成式检索的“语义标识符”重新解释为量化编码,并指出随着紧凑编码重回大规模检索中心,随之而来的开放问题。

英文摘要

Approximate nearest-neighbour search underpins large-scale retrieval and retrieval-augmented generation, yet its methods are studied in communities that seldom read one another. We argue that they form one field with three design choices. We develop the projection-quantisation-organisation lens: every method places its projections, places its quantisation thresholds, and organises the resulting codes for search. We test the lens with a reproducible measurement, released as the open BitBudget benchmark, and report three findings. First, the quantisation axis delivers the largest memory savings: a one-bit code with full-precision re-ranking matches uncompressed quality for six of seven embedders, the scanned code one thirty-second of the float's size. Second, the orderings the lens anticipates, including a learned-embedding regime where binary codes overtake an inverted-file product quantiser at a matched byte budget, recur as the embedding is enlarged. Third, given class labels, an eight-byte supervised code more than doubles the retrieval quality of the two-kilobyte task-agnostic float it replaces. We also recast the semantic identifiers of generative retrieval as quantisation codes. The main contribution is a single, tested account of compact-code search, from random projections to the retrieval-augmented era.

2511.13725 2026-06-16 cs.CR cs.AI 版本更新

Can We Stop Malicious AI? KILLBENCH: A Benchmark for External AI Kill Switch Feasibility

我们能阻止恶意AI吗?KILLBENCH:外部AI终止开关可行性基准

Sechan Lee, Hyounghun Kim, Sangdon Park

发表机构 * Graduate School of Artificial Intelligence, POSTECH(POSTECH人工智能研究生院)

AI总结 提出Killbench基准,通过外部信号(如输入文本)评估终止恶意AI代理行为的方法,无需访问内部参数,实验表明在多种模型上外部终止开关具有可行性。

详情
AI中文摘要

恶意AI对人类造成伤害并非只是好莱坞幻想。事实上,随着Claude Mythos等高能力模型的出现以及OpenClaw等代理系统的迅速普及,如何阻止有意或无意作恶的AI已成为紧迫问题。为此,我们提出Killbench,一个评估终止开关的基准:该机制仅使用外部信号即可中止恶意AI正在执行的行为。针对Web代理(最广泛部署的代理领域),Killbench评估了一系列终止开关方法,这些方法无需访问恶意AI的内部参数或系统,仅依赖外部输入即可中止其恶意操作。该基准包含四种恶意AI代理配置(包括一个未经审查的LLM代理)、8个有害场景以及由10种不同越狱模式构建的恶意提示。我们进一步构建了四种外部AI终止开关防御方法,并在Grok-4.3、GPT-5.2、Gemma4、Qwen3.6和Qwen3.5-uncensored上进行了评估,为外部AI终止开关对抗恶意AI的可行性以及AI可修正性研究提供了实证工具。

英文摘要

Malicious AI causing harm to humans is not just a Hollywood fantasy. Indeed, as highly capable models such as Claude Mythos emerge and agent systems like OpenClaw rapidly spread, the question of how to stop an AI that acts maliciously -- whether by design or by accident -- has become urgent. To address this, we propose Killbench, a benchmark for evaluating the Killswitch: a mechanism that halts a malicious AI's in-progress behavior using only external signals. Targeting web agents -- the most widely deployed agent domain -- Killbench evaluates a range of Kill Switch methods that halt a maliciously operating agent without any access to its internal parameters or the surrounding malicious AI's system, relying solely on external inputs. The benchmark comprises four malicious AI's agent configurations (including an uncensored LLM Agent), 8 harmful scenarios, and malicious prompts constructed from 10 distinct jailbreak patterns. We further construct four External AI Kill Switch defense methods and evaluate them on Grok-4.3, GPT-5.2, Gemma4, Qwen3.6 and Qwen3.5-uncensored, contributing an empirical instrument toward the feasibility of External AI Kill Switches against malicious AI and to the study of AI corrigibility.

2511.20709 2026-06-16 cs.SE cs.AI cs.CR 版本更新

DualGauge: Automated Joint Security-Functionality Benchmarking of Specification-Only Code Generation by LLMs and Coding Agents

DualGauge: 对仅由LLM和编码代理生成的规范代码进行自动化联合安全-功能基准测试

Rupam Patir, Keyan Guo, Suvadra Barua, Abhijeet Pathak, Dinesh Gudimetla, Jiawei Guo, Hongxin Hu, Haipeng Cai

发表机构 * University at Buffalo, SUNY(布法罗大学)

AI总结 提出DualGauge框架,首个自动化联合评估规范代码正确性与安全性的系统,通过307个任务基准测试发现功能正确性高估可靠代码生成,联合成功率低于15%,且模型因素和代理系统均无法可靠提升。

详情
AI中文摘要

大型语言模型(LLM)和基于LLM的编码代理现在被用于从自然语言规范生成代码,然而确保此类代码既功能正确又安全仍然是一个挑战。我们提出了DualGauge,这是第一个用于联合评估仅规范代码生成正确性和安全性的全自动化框架,并由DualGauge-Bench支持,这是一个语言无关的基准测试,包含307个编码任务,每个任务都配有从相同规范派生的功能和安全性测试。通过评估Python、C++和JavaScript中的10个代表性LLM,我们发现功能正确性显著高估了可靠代码生成:即使是最强的模型,在每种语言中联合安全-功能成功率仍低于15%。常见的模型侧因素——规模、扩展思维、量化、指令调优和代码专业化——并不能可靠地提高联合性能,这表明安全且正确的代码生成并非仅仅从更强的编码能力中涌现。对3个领先的代理编码系统(Codex、OpenHands和Claude Code)的评估表明,在仅规范任务上,迭代脚手架相比直接(基于LLM的)生成没有优势。定性审计揭示,失败集中在输出契约边界以及存在但不足的防护措施上——这些模式只有联合基准测试才能可靠地暴露。

英文摘要

Large language models (LLMs) and LLM-based coding agents are now used to generate code from natural-language specifications, yet ensuring such code is both functionally correct and secure remains a challenge. We present DualGauge, the first fully automated framework for jointly evaluating correctness and security of specification-only code generation, supported by DualGauge-Bench, a language-agnostic benchmark of 307 coding tasks each paired with functional and security tests derived from the same specification. Evaluating 10 representative LLMs across Python, C++, and JavaScript, we find that functional correctness substantially overestimates reliable code generation: even the strongest model remains below 15% joint security-functionality success in every language. Common model-side factors--scale, extended thinking, quantization, instruction tuning, and code specialization--do not reliably improve joint performance, suggesting secure-and-correct code generation does not simply emerge from stronger coding capability. Evaluation of 3 leading agentic coding systems (Codex, OpenHands, and Claude Code) shows that iterative scaffolding provides no advantage over direct (LLM-based) generation on specification-only tasks. A qualitative audit reveals failures concentrate at the output contract boundary and in guards that exist but are insufficient--patterns that only joint benchmarking reliably exposes.

2512.01095 2026-06-16 cs.CV cs.AI cs.LG 版本更新

CycliST: A Video Language Model Benchmark for Reasoning on Cyclical State Transitions

CycliST:用于循环状态转换推理的视频语言模型基准

Simon Kohaut, Daniel Ochs, Shun Zhang, Benedict Flade, Julian Eggert, Kristian Kersting, Devendra Singh Dhami

发表机构 * Artificial Intelligence and Machine Learning Lab, TU Darmstadt(人工智能与机器学习实验室,图腾斯达特技术大学) Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA)(Konrad Zuse 学校(ELIZA)) Honda Research Institute Europe GmbH, Offenbach, Germany(本田欧洲研究院,奥芬巴赫,德国) Uncertainty in Artificial Intelligence Group, TU Eindhoven(人工智能不确定性小组,埃因霍温技术大学) Hessian Center for AI (hessian.AI)(黑森人工智能中心(hessian.AI)) Center for Cognitive Science(认知科学中心) German Center for Artificial Intelligence (DFKI)(德国人工智能中心(DFKI))

AI总结 提出CycliST基准,通过合成视频评估视频语言模型对循环状态转换的文本推理能力,揭示现有模型在检测循环模式、时间理解和定量分析方面的局限。

Comments Published in the Journal of Data-centric Machine Learning Research (DMLR); https://openreview.net/forum?id=l03g53HUL2

详情
Journal ref
Journal of Data-centric Machine Learning Research, 2026
AI中文摘要

我们提出了CycliST,这是一个新颖的基准数据集,旨在评估视频语言模型(VLM)在循环状态转换上的文本推理能力。CycliST通过生成合成的、结构丰富的视频序列来捕捉现实世界过程的基本方面,这些视频序列具有物体运动和视觉属性的周期性模式。CycliST采用分层评估系统,通过改变循环物体的数量、场景杂乱程度和光照条件逐步增加难度,挑战最先进模型的时空认知能力。我们使用当前最先进的VLM(包括开源和专有模型)进行了大量实验,揭示了它们在泛化到循环动力学(如线性和轨道运动)以及视觉属性(如颜色和尺度)随时间变化方面的局限性。我们的结果表明,当前的VLM难以可靠地检测和利用循环模式,缺乏时间理解的概念,并且无法从场景中提取定量信息(如运动物体的数量),突显了需要解决的重要技术差距。更具体地说,我们发现没有单一模型在性能上始终领先:大小和架构与结果的相关性不强,且没有模型在所有任务上同样成功。通过提供有针对性的挑战和全面的评估框架,CycliST为超越当前最先进水平的视觉推理模型在理解周期性模式方面铺平了道路。

英文摘要

We present CycliST, a novel benchmark dataset designed to evaluate Video Language Models (VLM) on their ability for textual reasoning over cyclical state transitions. CycliST captures fundamental aspects of real-world processes by generating synthetic, richly structured video sequences featuring periodic patterns in object motion and visual attributes. CycliST employs a tiered evaluation system that progressively increases difficulty through variations in the number of cyclic objects, scene clutter, and lighting conditions, challenging state-of-the-art models on their spatio-temporal cognition. We conduct extensive experiments with current state-of-the-art VLMs, both open-source and proprietary, and reveal their limitations in generalizing to cyclical dynamics such as linear and orbital motion, as well as time-dependent changes in visual attributes like color and scale. Our results demonstrate that present-day VLMs struggle to reliably detect and exploit cyclic patterns, lack a notion of temporal understanding, and are unable to extract quantitative insights from scenes, such as the number of objects in motion, highlighting a significant technical gap that needs to be addressed. More specifically, we find no single model consistently leads in performance: neither size nor architecture correlates strongly with outcomes, and no model succeeds equally well across all tasks. By providing a targeted challenge and a comprehensive evaluation framework, CycliST paves the way for visual reasoning models that surpass the state-of-the-art in understanding periodic patterns.

2512.21577 2026-06-16 cs.CL cs.AI cs.LG stat.ML 版本更新

A Unified Definition of Hallucination: It's The World Model, Stupid!

幻觉的统一定义:是世界模型的问题,笨蛋!

Emmy Liu, Varun Gangal, Chelsea Zou, Michael Yu, Xiaoqi Huang, Alex Chang, Zhuofu Tao, Karan Singh, Sachin Kumar, Steven Y. Feng

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出幻觉的统一定义,即用户可观察到的错误内部世界建模,并连接至HalluWorld基准测试,以区分真实幻觉与规划或奖励错误。

Comments ICML 2026. HalluWorld benchmark at https://github.com/DegenAI-Labs/HalluWorld

详情
AI中文摘要

尽管自语言模型诞生以来已有无数缓解尝试,但即使在当今最前沿的LLM中,幻觉仍然是一个持续存在的问题。这是为什么?我们回顾了现有的幻觉定义,并将它们整合为一个统一的定义,其中先前的定义被包含在内。我们认为,幻觉可以通过将其简单地定义为不准确的(内部)世界建模来统一,其形式是用户可观察到的。例如,陈述与知识库相矛盾的事实,或生成与来源相矛盾的摘要。通过改变参考世界模型和冲突策略,我们的框架统一了先前的定义。我们认为,这种统一观点是有用的,因为它迫使评估澄清其假定的参考“世界”,区分真实幻觉与规划或奖励错误,并为跨基准比较和缓解策略讨论提供共同语言。基于这一定义,我们还将我们的框架连接到HalluWorld,这是一个补充基准,它实例化了完全指定的参考世界模型,用于压力测试模型幻觉。

英文摘要

Despite numerous attempts at mitigation since the inception of language models, hallucinations remain a persistent problem even in today's frontier LLMs. Why is this? We review existing definitions of hallucination and fold them into a single, unified definition wherein prior definitions are subsumed. We argue that hallucination can be unified by defining it as simply inaccurate (internal) world modeling, in a form where it is observable to the user. For example, stating a fact which contradicts a knowledge base OR producing a summary which contradicts the source. By varying the reference world model and conflict policy, our framework unifies prior definitions. We argue that this unified view is useful because it forces evaluations to clarify their assumed reference "world", distinguishes true hallucinations from planning or reward errors, and provides a common language for comparison across benchmarks and discussion of mitigation strategies. Building on this definition, we also connect our framework to HalluWorld, a complementary benchmark that instantiates fully specified reference world models for stress-testing model hallucinations.

2602.04525 2026-06-16 cs.CV cs.AI 版本更新

SLUM-i: Semi-supervised Learning for Urban Mapping of Informal Settlements and Data Quality Benchmarking

SLUM-i: 非正规住区城市制图的半监督学习与数据质量基准测试

Muhammad Taha Mukhtar, Syed Musa Ali Kazmi, Khola Naseem, Muhammad Ali Chattha, Andreas Dengel, Sheraz Ahmed, Muhammad Naseer Bajwa, Muhammad Imran Malik

发表机构 * School of Electrical Engineering and Computer Science, National University of Sciences and Technology (NUST)(电气工程与计算机科学学院,国立科学与技术大学(NUST)) Smart Data & Knowledge Services, German Research Center for Artificial Intelligence (DFKI)(智能数据与知识服务,德国人工智能研究中心(DFKI))

AI总结 针对非正规住区制图中标注稀缺和数据质量挑战,提出半监督分割框架,集成类别自适应阈值和DINOv2过滤机制,在跨三大洲七城市实验中mIoU提升最高5.9个百分点。

Comments 10 pages, 8 figures, 5 tables

详情
AI中文摘要

快速的城市扩张推动了低收入和中等收入国家主要城市非正规住区的增长,巴基斯坦的拉合尔和卡拉奇以及印度的孟买就是突出的例子。然而,这些住区的大规模制图不仅受到标注稀缺的严重限制,还受到固有数据质量挑战的制约,特别是正式与非正式结构之间的高光谱模糊性和显著的标注噪声。我们通过引入一个从头构建的拉合尔基准数据集,以及从经过验证的行政边界导出的卡拉奇和孟买配套数据集来解决这一问题,这些数据集总计约900平方公里的城市区域。该集合还补充了来自撒哈拉以南非洲和拉丁美洲先前文献中的四个城市,并为每个城市提供了全面的数据质量评估。我们还提出了一个半监督分割框架,旨在缓解标准半监督学习流程中固有的类别不平衡和分布不匹配问题。我们的方法集成了类别自适应阈值机制,该机制动态调整置信度阈值以防止少数类抑制,以及基于DINOv2的未标记池过滤器,该过滤器在训练前移除分布外的图块以减少协变量偏移。跨越三大洲七个城市、重复五个随机种子的广泛实验表明,与最先进的半监督基线相比,mIoU最高提升5.9个百分点,且两个组件均与架构无关,不增加推理开销。

英文摘要

Rapid urban expansion has fueled the growth of informal settlements in major cities of low- and middle-income countries, with Lahore and Karachi in Pakistan and Mumbai in India serving as prominent examples. However, large-scale mapping of these settlements is severely constrained not only by the scarcity of annotations but by inherent data quality challenges, specifically high spectral ambiguity between formal and informal structures and significant annotation noise. We address this by introducing a benchmark dataset for Lahore, constructed from scratch, along with companion datasets for Karachi and Mumbai, which were derived from verified administrative boundaries, totaling approximately 900 $\text{km}^\text{2}$ of urban area. This collection is supplemented by four cities from prior literature across Sub-Saharan Africa and Latin America, with comprehensive data quality assessments provided for each city. We also propose a semi-supervised segmentation framework designed to mitigate the class imbalance and distribution mismatch inherent in standard semi-supervised learning pipelines. Our method integrates a Class-Aware Adaptive Thresholding mechanism that dynamically adjusts confidence thresholds to prevent minority class suppression, and a DINOv2-based unlabeled pool filter that removes out-of-distribution tiles prior to training to reduce covariate shift. Extensive experiments across seven cities spanning three continents, repeated over five random seeds, demonstrate gains of up to +5.9 pp mIoU over state-of-the-art semi-supervised baselines, with both components being architecture-agnostic and adding no inference overhead.

2603.13584 2026-06-16 cs.SE cs.AI 版本更新

An Empirical Investigation of Pre-Trained Deep Learning Model Reuse in the Scientific Process

预训练深度学习模型在科学过程中复用的实证研究

Nicholas M. Synovic, Karolina Ryzka, Alessandra V. Vellucci Solari, Kenny Lyons, James C. Davis, George K. Thiruvathukal

发表机构 * Loyola University Chicago(洛伊拉大学芝加哥分校) Purdue University West Lafayette, IN, USA(普渡大学西拉法基分校)

AI总结 通过对17,718篇同行评审开放获取论文的实证研究,量化了自然科学中预训练深度学习模型(PTM)的复用模式、利用率和影响,发现“生物化学、遗传学和分子生物学”领域复用最多,“适配”复用模式最普遍,且“测试”阶段受PTM集成影响最大。

Comments 22 pages, 7 figures, 4 tables

详情
AI中文摘要

深度学习因其在自然科学中的影响而获得认可,但从零开始训练模型的巨大财务和技术成本阻碍了其采用。遵循软件工程社区的指导,自然科学家正在复用预训练深度学习模型(PTM)以分摊这些成本。虽然先前的工作推荐了PTM复用模式,但我们首次对自然科学中的PTM复用模式进行了实证研究,量化了17,718篇同行评审开放获取论文中科学过程中PTM复用的利用率和影响。我们的结果表明,“生物化学、遗传学和分子生物学”在PTM复用方面已超过其他自然科学领域,“适配”复用是所有自然科学领域中最普遍的PTM复用模式,而科学过程的“测试”阶段受PTM集成影响最大。

英文摘要

Deep learning has achieved recognition for its impact within natural sciences, yet the prohibitive financial and technical cost of training models from scratch inhibit adoption. Following software engineering community guidance, natural scientists are reusing pre-trained deep learning models (PTMs) to amortize these costs. While prior works recommend PTM reuse patterns, we present the first empirical study of PTM reuse patterns in the natural sciences, quantifying the utilization and impact of PTM reuse within the scientific process across 17,718 peer reviewed, open access papers. Our results show that "Biochemistry, Genetics and Molecular Biology" has outpaced other natural scientific fields in PTM reuse, "adaptation" reuse is the most prevalent PTM reuse pattern identified across all natural science fields, and the "testing" stage of the scientific process has been most impacted by PTM integration.

2604.06173 2026-06-16 cs.IR cs.AI 版本更新

Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA

超越判例法:评估法规中心型法律问答中的结构感知检索与安全性

Kyubyung Chae, Jewon Yeom, Jeongjae Park, Seunghyun Bae, Ijun Jang, Hyunbin Jin, Jinkwan Jang, Taesup Kim

发表机构 * Graduate School of Data Science, Seoul National University(数据科学研究生院,首尔国立大学)

AI总结 针对法规中心型法律问答中层级检索困难与模型幻觉问题,提出结构-安全感知基准SearchFireSafety,通过图引导检索提升性能,但揭示领域适应模型在证据缺失时更易幻觉。

Comments Accepted to ACL 2026

详情
AI中文摘要

法律问答基准主要关注判例法,忽视了法规中心型监管推理的独特挑战。在法规领域,相关证据分布在层级链接的文档中,造成法规检索缺口:传统检索器失败,模型在不完整上下文中常产生幻觉。我们提出SearchFireSafety,一个面向法规中心型法律问答的结构与安全感知基准。以消防安全法规为典型案例,该基准评估模型能否检索层级碎片化证据,并在法规上下文不足时安全地拒绝回答。SearchFireSafety采用双源评估框架,结合需要引文感知检索的真实世界问题和压力测试幻觉与拒绝行为的合成部分上下文场景。在多个大语言模型上的实验表明,图引导检索显著提升性能,但也揭示了一个关键的安全权衡:领域适应模型在关键法规证据缺失时更易产生幻觉。我们的发现强调了在法规中心型监管设置中联合评估层级检索与模型安全的基准需求。

英文摘要

Legal QA benchmarks have predominantly focused on case law, overlooking the unique challenges of statute-centric regulatory reasoning. In statutory domains, relevant evidence is distributed across hierarchically linked documents, creating a statutory retrieval gap where conventional retrievers fail and models often hallucinate under incomplete context. We introduce SearchFireSafety, a structure- and safety-aware benchmark for statute-centric legal QA. Instantiated on fire-safety regulations as a representative case, the benchmark evaluates whether models can retrieve hierarchically fragmented evidence and safely abstain when statutory context is insufficient. SearchFireSafety adopts a dual-source evaluation framework combining real-world questions that require citation-aware retrieval and synthetic partial-context scenarios that stress-test hallucination and refusal behavior. Experiments across multiple large language models show that graph-guided retrieval substantially improves performance, but also reveal a critical safety trade-off: domain-adapted models are more likely to hallucinate when key statutory evidence is missing. Our findings highlight the need for benchmarks that jointly evaluate hierarchical retrieval and model safety in statute-centric regulatory settings.

2604.20623 2026-06-16 cs.CV cs.AI 版本更新

RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking

RSRCC:通过检索增强的最佳N排序构建的遥感区域变化理解基准

Roie Kazoom, Yotam Gigi, George Leifman, Tomer Shekel, Genady Beryozkin

发表机构 * Google Research(谷歌研究)

AI总结 提出RSRCC基准,包含12.6万个细粒度遥感变化问答对,采用层次化半监督流程结合最佳N排序解决歧义,实现局部语义变化推理。

详情
AI中文摘要

传统变化检测识别变化发生的位置,但不解释发生了什么变化。现有的遥感变化描述数据集通常描述整体图像级别的差异,而细粒度的局部语义推理尚未充分探索。为弥补这一差距,我们提出RSRCC,一个新的遥感变化问答基准,包含12.6万个问题,分为8.7万训练、1.71万验证和2.2万测试实例。与以往数据集不同,RSRCC围绕局部、变化特定的问题构建,需要推理特定的语义变化。据我们所知,这是第一个明确设计用于此类细粒度推理监督的遥感变化问答基准。为构建RSRCC,我们引入了一个层次化半监督策展流程,将最佳N排序作为关键的最后歧义解决阶段。首先,从语义分割掩码中提取候选变化区域,然后使用图像-文本嵌入模型进行初步筛选,最后通过检索增强的视觉语言策展和最佳N排序进行验证。该过程能够在保留语义有意义变化的同时,对噪声和模糊候选进行可扩展过滤。数据集可在该网址获取。

英文摘要

Traditional change detection identifies where changes occur, but does not explain what changed in natural language. Existing remote sensing change captioning datasets typically describe overall image-level differences, leaving fine-grained localized semantic reasoning largely unexplored. To close this gap, we present RSRCC, a new benchmark for remote sensing change question-answering containing 126k questions, split into 87k training, 17.1k validation, and 22k test instances. Unlike prior datasets, RSRCC is built around localized, change-specific questions that require reasoning about a particular semantic change. To the best of our knowledge, this is the first remote sensing change question-answering benchmark designed explicitly for such fine-grained reasoning-based supervision. To construct RSRCC, we introduce a hierarchical semi-supervised curation pipeline that uses Best-of-N ranking as a critical final ambiguity-resolution stage. First, candidate change regions are extracted from semantic segmentation masks, then initially screened using an image-text embedding model, and finally validated through retrieval-augmented vision-language curation with Best-of-N ranking. This process enables scalable filtering of noisy and ambiguous candidates while preserving semantically meaningful changes. The dataset is available at https://huggingface.co/datasets/google/RSRCC.

2605.00873 2026-06-16 cs.MM cs.AI cs.CV 版本更新

BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios

BRITE:面向不可信场景的可靠可解释文本到视频评估基准

Advait Tilak, Jiwon Choi, Nazifa Mouli, Wei Le

AI总结 提出BRITE基准,通过人工参与协议统一不可信提示、细粒度音视频一致性评估和可解释QA评估,揭示现有模型在对象-动作绑定和音视频同步上的显著缺陷。

详情
AI中文摘要

逼真文本到视频(T2V)生成的快速发展带来了对最新评估方法的迫切需求。现有基准大多忽略了不可信场景,并且不衡量音视频对齐。我们引入BRITE,这是第一个将(1)不可信提示、(2)音视频一致性的细粒度评估以及(3)基于QA的可解释评估统一为全面T2V基准的框架。与完全自动化的基于多模态LLM的流水线(容易产生幻觉和提示歧义)不同,BRITE通过严格的人工参与协议保证基准创建的可靠性。评估五个最先进模型(Sora 2、Veo 3.1、Runway Gen4.5、Pixverse V5.5和Qwen3Max),我们揭示了一个关键性能差距:虽然模型在静态对象组合方面表现出色,但在对象-动作绑定和音视频同步方面表现出显著退化。我们的框架为社区提供了一个可靠、可解释的基准和评估框架,能够检测和定位下一代T2V模型的局限性,特别是对于流形外提示。

英文摘要

The rapid advancement of photorealistic Text-to-Video (T2V) generation brings in an urgent need for up-to-date evaluation methods. Existing benchmarks largely overlooked implausible scenarios and do not measure audio-visual alignment. We introduce BRITE, the first framework that unifies (1) implausible prompting, (2) fine-grained assessment of audio-visual consistency, and (3) QA-based interpretable evaluation into a comprehensive T2V benchmark. Unlike fully automated Multimodal LLM-based pipelines, which are prone to hallucination and prompt ambiguity, BRITE guarantees reliability through a rigorous human-in-the-loop protocol for benchmark creation. Evaluating five state-of-the-art models (Sora 2, Veo 3.1, Runway Gen4.5, Pixverse V5.5, and Qwen3Max), we reveal a critical performance gap: while models excel at static object composition, they exhibit significant degradation in object-action binding and audio-visual synchronization. Our framework offers the community a reliable, interpretable benchmark and evaluation framework that can detect and locate limitations in the next generation of T2V models, especially for off-manifold prompts

2605.09169 2026-06-16 cs.LG cs.AI 版本更新

Prediction Bottlenecks Don't Discover Causal Structure (But Here's What They Actually Do)

预测瓶颈不会发现因果结构(但它们实际上做了什么)

Ankit Hemant Lade, Sai Krishna Jasti, Indar Kumar, Aman Chadha

发表机构 * Ankit Hemant Lade Sai Krishna Jasti Indar Kumar Aman Chadha

AI总结 研究通过实验证明,预测模型中的瓶颈无法发现因果结构,但在特定条件下仍表现出一定的干预效果,主要贡献是提出了可复用的验证基准。

Comments 6 pages, 3 tables. Code: https://github.com/ankitlade12/ssm-causal

详情
AI中文摘要

一个仅用于下一步预测的Mamba状态空间模型似乎通过简单的读出$S = |W_{out} W_{in}|$恢复了格兰杰因果结构,早期实验表明该现象在不同架构中普遍,并在$p < 10^{-5}$时受益于干预数据。我们包装了用于测试该主张的协议——标准化合成生成器(VAR/洛伦兹/CauseMe式)、三种干预语义($do(X=c)$、软噪声、随机强迫)、三个真实数据集上的边来源卡片,以及大小匹配的对照组——作为可重用的验证基准,并在五个阶段中检验该主张。方法层面的主张未能通过:(i)简单的线性瓶颈同样表现良好或更优;(ii)在合成CauseMe式基准和洛伦兹96(唯一具有明确地面真实性的现实基准)上,调优的Lasso在瓶颈之上;经典PCMCI和格兰杰领先紧邻的集群中,瓶颈落后;(iii)头条干预优势约为60%的样本量混杂因素,残差在标准$do(X=c)$干预下消失,仅在非标准随机强迫方案下存活;(iv)即使该残差再现,其效果在经典二元格兰杰中重现,效果更具普遍性。所剩的是狭窄的特征化结果;基准是持久的产物,上述每个阶段都是其对照组之一。

英文摘要

A Mamba state-space model trained only for next-step prediction appears to recover Granger-causal structure through a simple readout $S = |W_{out} W_{in}|$, with early experiments suggesting the phenomenon generalized across architectures and benefited from interventional data at $p < 10^{-5}$. We package the protocol used to test that claim -- standardized synthetic generators (VAR/Lorenz/CauseMe-style), three intervention semantics ($do(X=c)$, soft-noise, random-forcing), edge-provenance cards on three real datasets, and size-matched control arms -- as a reusable falsification benchmark, and walk the claim through it in five stages. The method-level claim does not survive: (i) a plain linear bottleneck does as well or better; (ii) tuned Lasso beats the bottleneck on synthetic CauseMe-style benchmarks, and on Lorenz-96 (the only real benchmark with unambiguous ground truth) classical PCMCI and Granger lead a tight cluster in which the bottleneck trails; (iii) the headline intervention advantage is roughly 60% a sample-size confound, and the residual disappears under standard $do(X=c)$ interventions, surviving only under a non-standard random-forcing scheme; (iv) even that residual reproduces, with a larger effect, in classical bivariate Granger -- the effect is method-agnostic. What survives is a narrow characterization result; the benchmark is the lasting artifact, and each stage above is one of its control arms.

2605.13909 2026-06-16 cs.GT cs.AI 版本更新

TERMS-Bench: Diagnosing LLM Negotiation Agents Beyond Deal Rate

TERMS-Bench: 在交易率之外评估大语言模型谈判代理

Erica Zhang, Fangzhao Zhang, Aneesh Pappu, Batu El, Jose Blanchet, Susan Athey, Jiashuo Liu, James Zou

发表机构 * Stanford School of Engineering(斯坦福大学工程学院) Stanford Department of Economics(斯坦福大学经济系) Stanford Graduate School of Business(斯坦福商学院)

AI总结 TERMS-Bench通过博弈论框架评估谈判代理,揭示其在交易率之外的性能差异,如盈余提取、线索使用和合规性。

Comments Project Site: https://terms-bench.github.io/

详情
AI中文摘要

谈判是经济交换的核心机制,塑造市场、采购、劳动协议和资源分配。它也是代理语言模型的典型测试平台,要求在隐藏偏好、战略沟通和绑定约束下进行多轮交互。现有LLM谈判评估依赖LLM对LLM交互或聚合结果如交易率,导致失败原因不明。我们引入TERMS-Bench,即多轮策略中的经济推理测试床,一种贝叶斯博弈框架,使环境本身成为验证者,通过指定对手的潜在类型、策略和收益结构。我们将其应用于双边价格谈判,其中对手的私人状态和模拟器策略对代理隐藏,但对评估者可见。这将对手从黑盒对手转变为诊断工具,使代理可归因的失败分析和oracle参考最优差距成为可能。评估13个LLM代理,涵盖主要提供商的前沿系统,TERMS-Bench将谈判评估从聚合排名转变为可操作的诊断:代理失败在哪里,为何失败,以及如何加强。实证上,前沿模型饱和交易率但分歧于盈余提取、线索使用、信念校准和合规性,揭示由先前基准掩盖的代理特定谈判瓶颈。

英文摘要

Negotiation is a central mechanism of economic exchange, shaping markets, procurement, labor agreements, and resource allocation. It is also a canonical testbed for agentic language models, requiring multi-turn interaction under hidden preferences, strategic communication, and binding constraints. These properties make negotiation hard to evaluate: unlike math or code, it has no intrinsic verifier. Existing LLM negotiation evaluations rely on LLM-vs.-LLM interaction or aggregate outcomes such as deal rate, leaving failures opaque. We introduce Terms-Bench, short for Testbed for Economic Reasoning in Multi-turn Strategy, a Bayesian-game framework that makes the environment itself the verifier by specifying the counterpart's latent type, policy, and payoff structure. We instantiate it in bilateral price negotiation, where the counterpart's private state and simulator policy are hidden from the agent but observable to the evaluator. This turns the counterpart from a black-box opponent into a diagnostic instrument, enabling agent-attributable failure analysis and oracle-reference optimality gaps. Evaluating 13 LLM agents spanning frontier systems from major providers, Terms-Bench turns negotiation evaluation from aggregate ranking into actionable diagnosis: where agents fail, why they fail, and what to strengthen. Empirically, frontier models saturate deal rate yet diverge in surplus extraction, cue use, belief calibration, and compliance, revealing agent-specific bargaining bottlenecks masked by prior benchmarks.

2605.18421 2026-06-16 cs.CL cs.AI cs.LG 版本更新

EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective

EvoMemBench: 从自演化视角评估智能体记忆

Yuyao Wang, Zhongjian Zhang, Mo Chi, Kaichi Yu, Yuhan Li, Miao Peng, Bing Tong, Chen Zhang, Yan Zhou, Jia Li

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州)) Createlink Technology(创-link科技) Beijing University of Posts and Telecommunications(北京邮电大学) Beijing Institute of Technology(北京理工大学)

AI总结 本文提出EvoMemBench,从自演化视角评估智能体记忆,通过内存范围和内容两个维度构建统一基准,比较15种内存方法并发现当前内存系统尚未达到通用解决方案,长上下文基线仍具竞争力,内存在上下文不足或任务困难时效果显著,检索方法在知识密集型任务中表现优异,而程序和长期记忆方法在任务结构匹配时更有效。

详情
AI中文摘要

近期针对大语言模型(LLM)智能体的基准测试主要评估推理、规划和执行能力。然而,记忆对于智能体同样至关重要,因为它使智能体能够随时间存储、更新和检索信息。这种能力仍被低估,主要是因为现有基准测试未能提供系统评估记忆机制的方法。本文从自演化视角研究智能体记忆,引入EvoMemBench,一个沿内存范围(回合内 vs. 跨回合)和内存内容(知识导向 vs. 执行导向)两个轴线组织的统一基准。我们在标准化协议下比较了15种代表性内存方法与强大的长上下文基线。结果表明,当前内存系统仍远未达到通用解决方案:长上下文基线仍具有高度竞争力,内存在当前上下文不足或任务困难时效果最显著,且没有单一的内存形式能一致适用于所有设置。基于检索的方法在知识密集型任务中仍表现强劲,而程序和长期记忆方法在存储的经验与任务结构匹配时,对执行导向任务更有效。我们希望EvoMemBench能促进未来更有效的LLM智能体内存系统研究。我们的代码可在https://github.com/DSAIL-Memory/EvoMemBench获取。

英文摘要

Recent benchmarks for Large Language Model (LLM) agents mainly evaluate reasoning, planning, and execution. However, memory is also essential for agents, as it enables them to store, update, and retrieve information over time. This ability remains under-evaluated, largely because existing benchmarks do not provide a systematic way to assess memory mechanisms. In this paper, we study agent memory from a self-evolving perspective and introduce EvoMemBench, a unified benchmark organized along two axes: memory scope (in-episode vs. cross-episode) and memory content (knowledge-oriented vs. execution-oriented). We compare 15 representative memory methods with strong long-context baselines under a standardized protocol. Results show that current memory systems are still far from a general solution: long-context baselines remain highly competitive, memory helps most when the current context is insufficient or tasks are difficult, and no single memory form works consistently across all settings. Retrieval-based methods remain strong for knowledge-intensive settings, whereas procedural and long-term memory methods are more effective for execution-oriented tasks when their stored experience matches the task structure. We hope EvoMemBench facilitates future research on more effective memory systems for LLM-based agents. Our code is available at https://github.com/DSAIL-Memory/EvoMemBench.

2605.26418 2026-06-16 cs.LG cs.AI cs.DC 版本更新

When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control

深度强化学习何时超越校准基线?自适应资源控制的基准研究

Guilin Zhang, Chuanyi Sun, Kai Zhao, Xu Chu, Shahryar Sarkani, John Fossaceca

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Toronto(多伦多大学)

AI总结 通过RLScale-Bench基准测试,发现校准的基于规则的自动缩放器在所有工作负载上成本均低于六种主流深度强化学习算法,并揭示了算法选择、基线校准和评估协议的关键瓶颈。

详情
AI中文摘要

一个适当校准的基于规则的自动缩放器可以在我们测试的每个工作负载上,在成本方面击败六种主流深度强化学习(DRL)算法——那么,如果存在的话,DRL究竟何时能真正发挥作用?我们在RLScale-Bench中研究这个问题,这是一个用于自适应资源控制的DRL可重复基准和评估协议,其中代理在成本和服务级别约束下将计算资源分配给动态工作负载。我们在匹配的架构、训练预算和奖励函数下,评估PPO、DQN、A2C、SAC、TD3和DDPG,与校准的基于规则基线在六个工作负载模式和五个种子(240次运行)上进行对比,在Kubernetes水平Pod自动缩放上实例化基准,并探测分布偏移泛化。三个发现挑战了常见假设:(i)校准控制器在所有六个工作负载上实现了最低成本,尽管在突发和闪流流量上落后于最佳RL代理;(ii)由于动作空间不匹配,离散动作算法在约束违反方面比连续动作算法好一到两个数量级;(iii)没有单一算法在所有工作负载上占主导地位,排名变化高达四个位置。基于RL的资源控制的瓶颈不是算法选择,而是基线校准、奖励工程和现实的评估协议。

英文摘要

A properly calibrated rule-based autoscaler can beat every one of six mainstream deep reinforcement learning (DRL) algorithms on cost across every workload we test - so when, if ever, does DRL actually help? We study this in RLScale-Bench, a reproducible benchmark and evaluation protocol for DRL on adaptive resource control, where an agent allocates compute to a dynamic workload under cost and service-level constraints. We evaluate PPO, DQN, A2C, SAC, TD3, and DDPG under matched architectures, training budgets, and reward functions against a calibrated rule-based baseline across six workload patterns and five seeds (240 runs), instantiate the benchmark on Kubernetes Horizontal Pod Autoscaling, and probe distribution-shift generalization. Three findings challenge common assumptions: (i) the calibrated controller achieves the lowest cost on all six workloads, though it trails the best RL agents on bursty and flash traffic; (ii) discrete-action algorithms outperform continuous-action ones by one to two orders of magnitude in constraint violations due to action-space mismatch; and (iii) no single algorithm dominates across workloads, with rankings shifting by up to four positions. The bottleneck in RL-based resource control is not algorithm selection but baseline calibration, reward engineering, and realistic evaluation protocols.

2606.02670 2026-06-16 cs.LG cs.AI 版本更新

Anomalies in Multivariate Time Series Benchmarks Are Mostly Univariate

多变量时间序列基准中的异常主要是单变量的

Marc Pinet, Julien Cumin, Samuel Berlemont, Dominique Vaufreydaz

发表机构 * Orange Research(Orange研究院) Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG(格勒诺布尔阿尔卑斯大学、CNRS、格勒诺布尔INP、LIG)

AI总结 本文通过诊断框架和实验证明,当前多变量时间序列异常检测基准中,异常主要源于单变量偏离,跨通道结构变化极少,因此现有基准不适合验证跨通道建模能力。

Comments Accepted at the 12th International Workshop on Mining and Learning from Time Series (MiLeTS), co-located with KDD 2026

详情
AI中文摘要

许多最新的多变量时间序列异常检测(MT-SAD)模型引入了跨通道建模,其隐含假设是异常的结构可能分布在多个通道上。我们在八个广泛使用的公共基准上评估了这一假设,引入了一个逐段诊断框架,该框架针对每个标记的异常,标记是否至少有一个通道单独偏离其正常历史,是否跨通道相关结构发生变化,或两者兼有。该框架表明,在一系列合理阈值下,没有跨通道破裂发生在没有伴随单变量偏离的情况下。一个补充指标还显示,在八个基准中的六个上,至少一半的标记异常段在79%到100%的时间步上发生单变量偏离,在其中的三个数据集上达到100%。为了验证我们的框架在存在跨通道结构时能够捕获它,我们构建了具有共享噪声的相移正弦通道的合成数据。每个异常段通过两种通道级损坏之一进行改变,这些损坏保留了每个通道的边缘分布,同时破坏了跨通道结构,我们的框架正确地将这些段表征为仅跨通道异常。在这些数据上,依赖通道(CD)模型成功利用了跨通道信号,而独立通道(CI)模型则失败。在真实基准上对最近SOTA检测器的CI/CD比较进一步证实了CD建模没有带来可衡量的收益。我们得出结论,当前的MT-SAD基准不适合验证跨通道建模能力,并呼吁开发更多结构多样的评估集。本研究的代码已公开。

英文摘要

Many recent multivariate time series anomaly detection (MTSAD) models incorporate cross-channel modeling, under the implicit assumption that the structure of anomalies may be spread across multiple channels. We evaluate this assumption on eight widely used public benchmarks by introducing a per-segment diagnostic framework that flags, for each labeled anomaly, whether at least one channel deviates individually from its normal history, whether the cross-channel correlation structure changes, or both. The framework shows that no cross-channel rupture occurs without an accompanying univariate deviation across a range of reasonable thresholds. A complementary metric also reveals that on six of the eight benchmarks, at least half of the labeled anomaly segments deviate univariately on 89% to 100% of their timesteps, reaching 100% on three of these datasets. To verify that our framework captures cross-channel structure when present, we construct synthetic data of phase-shifted sinusoidal channels with shared noise. Each anomalous segment is altered through one of two channel-wise corruptions that preserve the per-channel marginal distribution while breaking cross-channel structure, and our framework correctly characterizes these segments as cross-channel-only. On these data, channel-dependent (CD) models successfully exploit the cross-channel signal whereas channel-independent (CI) ones fail. The CI/CD comparison of a recent SOTA detector on real benchmarks further confirms that CD modeling brings no measurable gain. We conclude that current MTSAD benchmarks are unsuitable for validating cross-channel modeling capabilities, and we call for the development of more structurally diverse evaluation sets. The code for this study is publicly available.

2606.05692 2026-06-16 cs.LG cs.AI 版本更新

Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions

具有时变干预的流行病时间序列中的反事实预测基准测试

Wenhao Mu, Facundo Yan, Anik Mumssen, Marisa Eisenberg, Alexander Rodríguez

发表机构 * University of Michigan Computer Science and Engineering(密歇根大学计算机科学与工程系) University of Michigan Epidemiology & Complex Systems(密歇根大学流行病学与复杂系统)

AI总结 为解决缺乏可观测反事实结果的真实基准问题,基于校准的基于智能体的模型生成大规模流行病时间序列反事实预测基准,支持静态/时变治疗和单/多策略干预,评估多种因果推断方法。

Comments To appear in Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情
AI中文摘要

深度学习在时间序列因果推断方面取得了显著进展,但由于缺乏具有可观测反事实结果的现实基准,进展仍然受到限制。现有数据集要么依赖没有真实反事实的真实世界观测,要么依赖无法捕捉复杂因果动态的简化模拟。为了解决这一差距,我们开发了一个大规模基准,用于动态干预下流行病时间序列的反事实预测。与现有基准不同,它支持静态和时变治疗,以及单策略和多策略干预设置,从而能够在广泛的因果推断场景中评估因果推断方法。利用基于真实世界人口、流动性、流行病学和政策数据校准的基于智能体的模型,我们生成了跨越美国150多个县的真实反事实轨迹。使用该基准,我们评估了广泛使用和最先进的因果推断方法,揭示了显著的性能差异,并突出了现实时间序列因果推理的挑战。

英文摘要

Deep learning has enabled significant advances in time-series causal inference, yet progress remains constrained by the lack of realistic benchmarks with observable counterfactual outcomes. Existing datasets either rely on real-world observations without ground-truth counterfactuals or on simplified simulations that fail to capture complex causal dynamics. To address this gap, we develop a large-scale benchmark for counterfactual prediction in epidemic time series under dynamic interventions. Unlike existing benchmarks, it supports static and time-varying treatments, as well as both single-policy and multi-policy intervention settings, enabling evaluation of causal inference methods across a broad range of causal inference scenarios. Leveraging a calibrated agent-based model grounded in real-world demographic, mobility, epidemiological, and policy data, we generate realistic counterfactual trajectories across more than 150 U.S. counties. Using this benchmark, we evaluate widely used and state-of-the-art causal inference methods, revealing substantial performance differences and highlighting the challenges of realistic time-series causal reasoning.

2606.07226 2026-06-16 cs.LG cs.AI cs.CL 版本更新

DEFINED: A Data-Efficient Computational Framework for Fine-Grained Creativity Assessment in Debate Scenarios

DEFINED: 辩论场景中细粒度创造力评估的数据高效计算框架

Tongzhou Yu, Mingjia Li, Hong Qian, Wenkai Wang, Zongbao Zhang, Yaoyu Jiang, Xiangfeng Wang, Aimin Zhou, Jiajun Guo

发表机构 * Nanjing University(南京大学) Shanghai Innovation Institute(上海创新研究院) East China Normal University(华东师范大学)

AI总结 提出DEFINED框架,通过层次化八维指标体系、预训练语言模型和混合粒度训练策略,在辩论场景中实现数据高效的细粒度创造力自动评估,优于现有方法。

Comments Accepted by KDD 2026

详情
AI中文摘要

人类创造力已成为大语言模型时代的关键能力。在复杂、开放环境中评估创造力是数据挖掘领域的一大挑战,目前受限于对标准化简单任务的依赖以及细粒度专家数据的稀缺。作为生态有效的评估场景,辩论反映了创造力的多个维度,涵盖发散思维和收敛思维。此外,辩论是一个数据丰富的领域,拥有大量公开可获取的材料。当前主流的自动评分方法难以适应辩论等复杂场景,因此仍然依赖昂贵的人工评估。为此,本文提出DEFINED,一种数据高效的计算框架,用于辩论场景中的细粒度创造力评估。DEFINED通过层次化的八维指标体系操作化辩论创造力,采用预训练自回归语言模型,并配备支持细粒度和粗粒度评估的层次化评分头。从真实辩论比赛中获取陈述及其相关专家评分,并采用约束数据增强策略以解决原始数据中的精英偏差。DEFINED采用混合粒度训练策略,能够从训练有素的研究生专家提供的有限细粒度监督中实现鲁棒学习。为严格验证超越合成基准的生态效度,我们纳入了一项针对辩论新手参与者的实证研究,利用这些真实数据作为中低水平人群的定性案例研究。在我们的评估协议中,评分模型实现了准确且稳定的评分,优于基于提示的大语言模型评估器和现有的辩论评分方法。

英文摘要

Human creativity has emerged as a critical competency in the era of large language models. Assessing creativity in complex, open-ended environments is a grand challenge in data mining, currently hindered by a reliance on standardized simple tasks and the scarcity of fine-grained expert data. As an ecologically valid assessment context, debate reflects multiple dimensions of creativity, encompassing both divergent thinking and convergent thinking. Moreover, debate is a data-rich domain, with a large volume of publicly accessible materials. Current mainstream automated scoring methods are poorly suited to complex settings such as debate, and therefore still rely on costly human evaluation. To this end, this paper proposes DEFINED, a data-efficient computational framework for fine-grained creativity assessment in debate scenarios. DEFINED operationalizes debate creativity through a hierarchical eight-dimensional metric system, implemented via a pre-trained autoregressive language model with a hierarchical scoring head that supports both fine-grained and coarse-grained evaluation. Statements and their associated expert scores were obtained from authentic debate competitions, and a constrained data augmentation strategy was employed to address the elite bias inherent in the original data. DEFINED adopts a mixed-granularity training strategy enabling robust learning from limited fine-grained supervision annotated by trained graduate experts. To rigorously validate ecological validity beyond synthetic benchmarks, we incorporate an empirical study with debate-naive participants, utilizing these authentic data to serve as a qualitative case study for mid-to-low proficiency populations. Across our evaluation protocol, our scoring model achieves accurate and stable scoring, outperforming prompt-based large language model evaluators and existing debate scoring methods.

2606.10862 2026-06-16 cs.CV cs.AI 版本更新

LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination

LIBERO-Occ:通过视角想象评估和改进场景诱导遮挡下的视觉-语言-动作模型

Taishan Li, Jiwen Zhang, Siyuan Wang, Xuanjing Huang, Zhongyu Wei

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Chinese University of Hong Kong(香港中文大学)

AI总结 针对VLA模型在场景遮挡下性能下降的问题,提出LIBERO-Occ基准和视角想象方法,通过生成互补视图提升鲁棒性。

Comments 14 pages, 7 figures

详情
AI中文摘要

视觉-语言-动作(VLA)模型在标准操作基准上取得了强劲的性能,但大多数评估假设任务相关物体完全可见。这一假设在现实场景中经常不成立,因为遮挡使得操作部分可观察。本文研究了场景诱导遮挡作为VLA模型的一个基本挑战,并引入了LIBERO-Occ,一个面向遮挡的LIBERO扩展。实验表明,最先进的VLA在遮挡下性能显著下降。为解决这一问题,我们提出了视角想象(VIM),该方法从遮挡的主观测中生成互补视图,并基于观察和想象证据共同进行动作预测。VIM在任务套件、遮挡类型和严重程度上提高了鲁棒性,且无需在部署时增加额外摄像头,表明视角想象是部分可观察操作中感知完成的一种有前景的机制。我们的基准和相应代码可在以下网址获取:this https URL。

英文摘要

Vision-Language-Action (VLA) models achieve strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible. This assumption often fails in realistic settings, where occlusion makes manipulation partially observable. In this paper, we study \textit{scene-induced occlusion} as a fundamental challenge for VLA models and introduce \textbf{LIBERO-Occ}, an occlusion-oriented extension of LIBERO. Experiments show that state-of-the-art VLAs suffer substantial performance degradation under occlusion. To address this issue, we propose \textbf{Viewpoint Imagination (VIM)}, which generates a complementary view from an occluded primary observation and conditions action prediction on both observed and imagined evidence. VIM improves robustness across task suites, occlusion types, and severity levels without requiring additional cameras at deployment time, suggesting that viewpoint imagination is an promising mechanism for perception completion in partially observable manipulation. Our benchmark and corresponding code are available at: \href{https://github.com/litsh/Libero-Occ}{https://github.com/litsh/Libero-Occ}.

2606.14238 2026-06-16 cs.RO cs.AI 版本更新

When and How Severely: Scenario-Specific Safety Envelopes for Driving VLAs

何时以及多严重:驾驶VLA的场景特定安全包络

Abhinaw Priyadershi, Jelena Frtunikj

发表机构 * NVIDIA Corporation(英伟达公司) NVIDIA GmbH(英伟达德国有限公司)

AI总结 针对ISO 21448下VLA驾驶规划器的安全认证,提出二维安全包络方法,通过GMM识别六种严重性等级,揭示场景特定风险差异。

详情
AI中文摘要

根据ISO 21448 (SOTIF)对视觉-语言-动作(VLA)驾驶规划器的安全认证依赖于运行设计域(ODD)规范,该规范回答两个互补的问题:规划器何时开始失效,以及一旦失效其严重程度如何?我们评估了Alpamayo R1(一个100亿参数的开源权重驾驶VLA)在15,968个(片段,攻击)对上的表现。我们发现一个保守的聚合差距:在15%平均位移误差(ADE)预算下,聚合安全阈值σ ≤ 50掩盖了能够容忍测试网格顶部(σ = 70)的良好采样场景。在变化解释子集上的高斯混合模型(GMM)识别出六个离散的严重性等级(BIC最优k=6),因此具有相同平均误差的两个扰动条件在高严重性(C4/C5)失效份额上可能有实质性差异。将两种分析结合在同一个语料库上,发现了一个单独分析无法得出的结论:噪声阈值最宽松的场景并非高严重性率最低的场景:STOP_SIGNAL的C4/C5份额大约是LANE_KEEPING的4倍,尽管它容忍更大的σ。因此,用于驾驶VLA的可部署SOTIF ODD规范需要二维安全包络,而不是每个危险的单一聚合值。

英文摘要

Safety certification of Vision-Language-Action (VLA) driving planners under ISO 21448 (SOTIF) rests on an Operational Design Domain (ODD) specification that answers two complementary questions: when does the planner start to fail, and how severely does it fail once it does? We evaluate Alpamayo R1, a 10B-parameter open-weight driving VLA, on 15,968 (clip, attack) pairs. We find a conservative-aggregate gap: an aggregate safe threshold of $σ\leq 50$ under a 15% average displacement error (ADE) budget masks well-sampled scenarios that tolerate the top of the tested grid ($σ= 70$). A Gaussian Mixture Model (GMM) on the changed-explanation subset identifies six discrete severity bands (BIC-optimal $k{=}6$), so two perturbation conditions with the same mean error can differ materially in their share of high-severity (C4/C5) failures. Joining the two analyses on the same corpus surfaces a finding neither yields in isolation: the scenarios with the loosest noise thresholds are not those with the lowest high-severity rate: STOP_SIGNAL concentrates roughly $4\times$ the C4/C5 share of LANE_KEEPING despite tolerating a larger $σ$. A deployable SOTIF ODD specification for driving VLAs therefore requires a two-dimensional safety envelope, not a single aggregate value per hazard.

10. AI应用与系统 113 篇

2606.15038 2026-06-16 cs.AI 新提交

Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling

融合并非一刀切:用于时间-事件建模的跨模态表示对齐

Zhemin Zhang, Weijie Chen, David Le, Amara Tariq, Alex Wallace, Matthew Stib, Juan Maria Farina, Chadi Ayoub, Reza Arsanjani, Imon Banerjee

发表机构 * Arizona State University(亚利桑那州立大学) Mayo Clinic(梅奥诊所)

AI总结 针对多模态临床数据中的模态不平衡和分布偏移问题,提出一种基于基础模型的跨模态对齐框架,通过四种融合策略在CT影像和纵向EHR数据间进行表示对齐,在肺栓塞死亡率和心血管疾病结局预测任务上验证了融合的有效性,并首次系统分析了时间-事件预测中的模态不平衡对融合行为的影响。

详情
AI中文摘要

从多模态临床数据进行准确的时间-事件(TTE)预测仍然具有挑战性,原因是模态不平衡和分布偏移。我们引入了一个基础模型驱动的框架,用于CT成像和纵向EHR数据之间的跨模态表示对齐,旨在跨任务和机构进行泛化。CT和EHR模态使用特定领域的基础模型独立编码,并通过四种原则性融合策略在共享潜在空间中对齐:后期融合、对比对齐、交叉注意力和共同注意力。我们在大规模多机构队列(PE:训练集N=3,099;内部验证集1,098;外部验证集435;CVD:训练集N=2,951;内部验证集837;外部验证集682)上评估了两个临床不同的TTE任务:肺栓塞(PE)死亡率和心血管疾病(CVD)结局。当模态贡献相当时,融合一致地将一致性指数提高了1.5-5.4%,优于单模态基线。总体而言,对比多模态融合,特别是使用CLMBR表示,提供了最一致且统计上最稳健的改进,尤其是在PE死亡率预测中。对于MACE,交叉注意力(独热编码)实现了最高的内部性能,而图像引导的共同注意力实现了最佳的外部性能。因此,我们引入了一个可泛化的基于基础模型的跨模态对齐框架,并首次系统分析了TTE预测中模态不平衡下的融合行为。我们的结果确立了任务感知的多模态对齐作为稳健泛化和可扩展临床部署的必要设计原则。

英文摘要

Accurate time-to-event (TTE) prediction from multimodal clinical data remains challenging due to modality imbalance and distribution shift. We introduce a foundation model-driven framework for cross-modal representation alignment between CT imaging and longitudinal EHR data, designed to generalize across tasks and institutions. CT and EHR modalities are encoded independently using domain-specific foundation models and aligned in a shared latent space through four principled fusion strategies: late fusion, contrastive alignment, cross-attention, and co-attention. We evaluate two clinically distinct TTE tasks: pulmonary embolism (PE) mortality and cardiovascular disease (CVD) outcomes, on large-scale multi-institutional cohorts (PE: N=3,099 train; 1,098 internal; 435 external; CVD: N=2,951 train; 837 internal; 682 external). Fusion consistently improves concordance index by 1.5-5.4% over unimodal baselines when modalities contribute comparably. Overall, contrastive multimodal fusion, particularly with CLMBR representations, provided the most consistent and statistically robust improvements, especially for PE mortality prediction. For MACE, cross-attention (one-hot) achieved the highest internal performance and image-guided co-attention achieved the best external performance. We therefore introduce a generalizable foundation model-based cross-modal alignment framework and provide the first systematic analysis of fusion behavior under modality imbalance in TTE prediction. Our results establish task-aware multimodal alignment as a necessary design principle for robust generalization and scalable clinical deployment.

2606.15179 2026-06-16 cs.AI 新提交

CONCORD: Asynchronous Sparse Aggregation for Device-Cloud RAG under Document Isolation

CONCORD: 文档隔离下设备-云RAG的异步稀疏聚合

Xuedong Hu, Zhiqing Tang, Zhi Yao, Tian Wang, Weijia Jia

发表机构 * Beijing Normal University(北京师范大学) BNU-HKBU United International College(北师港浸大联合国际学院) University of Macau(澳门大学) Shenzhen Research Institute of Big Data(深圳市大数据研究院) Guangdong Key Laboratory of Artificial Intelligence and Multi-Modal Data Processing(广东省人工智能与多模态数据处理重点实验室)

AI总结 针对文档隔离的双端RAG中频繁同步和密集证据传输导致的吞吐量低问题,提出异步稀疏聚合框架CONCORD,通过等待债务控制和证书引导最小补充机制,在保持答案质量的同时大幅提升吞吐量并降低通信量。

Comments to be published in IEEE ICWS 2026

详情
AI中文摘要

检索增强生成(RAG)已成为通过在推理时引入外部知识来改进语言模型的关键技术。随着设备-云协同推理使得在边缘设备上部署小型语言模型成为可能,出现了一种新的场景:私有文档保留在设备上,而公共知识位于云端。隐私和政策约束通常禁止原始文档交换,从而形成了文档隔离的双端RAG设置。然而,现有方法依赖频繁的远程同步和密集的证据传输,限制了在现实延迟和带宽条件下的吞吐量。为了解决这个问题,我们提出了CONCORD,一种用于文档隔离下双端RAG的异步稀疏聚合框架。CONCORD将云端视为异步到达的证据源,而非持续同步的协同生成器。具体来说,我们引入了等待债务控制,根据观察到的等待回报决定每个解码步骤是否应继续等待远程参与。我们还设计了一种证书引导的最小补充机制,仅请求确定当前贪婪决策所需的远程证据。咨询云端的步骤保留了与密集双端聚合相同的贪婪令牌,而其余步骤则在本地提交,无需远程证据。在Natural Questions和WikiText-2上的实验表明,CONCORD将端到端吞吐量相对于基线分别提高了1.66倍和2.15倍,同时将每令牌通信量降低了两个数量级以上,并保持了可比的答案质量和困惑度。

英文摘要

Retrieval-augmented generation (RAG) has emerged as a pivotal technique for improving language models by incorporating external knowledge at inference time. As device-cloud collaborative inference makes it feasible to deploy small language models on edge devices, a new setting arises in which private documents remain on the device and public knowledge resides in the cloud. Privacy and policy constraints often forbid raw document exchange, creating a document-isolated dual-end RAG setting. However, existing methods rely on frequent remote synchronization and dense evidence transfer, limiting throughput under realistic latency and bandwidth conditions. To address this issue, we propose CONCORD, an asynchronous sparse aggregation framework for dual-end RAG under document isolation. CONCORD treats the cloud as an asynchronously arriving evidence source rather than a continuously synchronized co-generator. Specifically, we introduce waiting debt control to decide whether each decoding step should continue waiting for remote participation based on the observed return of waiting. We also design a certificate-guided minimal supplementation mechanism that requests only the remote evidence needed to determine the current greedy decision. Steps that consult the cloud preserve the same greedy token as dense dual-end aggregation, while the remaining steps commit locally without remote evidence. Experiments on Natural Questions and WikiText-2 show that CONCORD improves end-to-end throughput over baselines by $1.66\times$ and $2.15\times$, respectively, while reducing per-token communication by over two orders of magnitude and maintaining comparable answer quality and perplexity.

2606.15199 2026-06-16 cs.AI 新提交

CogGuard: Cognitive and Operational Profiling for Proactive Warning in Edge Intelligent Services

CogGuard:边缘智能服务中基于认知与操作画像的主动预警

Zhi Yao, Weihao Chen, Zhiqing Tang, Hanshuai Cui, Qianli Ma, Weijia Jia, Wei Zhao

发表机构 * Beijing Normal-Hong Kong Baptist University(北京师范大学-香港浸会大学) Guangdong Key Lab of AI and Multi-modal Data Processing(广东省人工智能与多模态数据处理重点实验室) Institute of Artificial Intelligence and Future Networks(人工智能与未来网络研究院) Engineering Center of AI and Future Education(人工智能与未来教育工程中心) Guangdong Provincial Department of Science and Technology(广东省科学技术厅) Zhuhai Science-Tech Innovation Bureau(珠海市科技创新局) Beijing Normal University at Zhuhai(北京师范大学珠海校区)

AI总结 提出CogGuard框架,通过解耦离线LLM画像构建与在线SLM评分预测,结合前缀对齐KV缓存重用和长度感知分布式微调,实现边缘智能服务的主动预警,在教育和操作任务上降低构建时间48%、微调时间19%。

Comments Accepted to ICWS 2026

详情
AI中文摘要

主动预警是边缘智能服务的一项重要能力,系统需在严格的延迟和隐私约束下预测主体能否成功完成即将到来的任务。这种预测依赖于从历史交互日志中提取的长期静态属性和短期动态状态。近期的大语言模型(LLM)为从这些日志构建结构化画像提供了强大的长上下文推理能力,但现有解决方案在边缘部署时面临两个挑战:(1)画像方法通常具有领域特异性,缺乏跨服务场景的可复用抽象;(2)在异构边缘集群上微调对齐模型时,由于输入序列长度的差异,同步开销较高。为应对这些挑战,我们提出了CogGuard,一个面向边缘智能服务的主动预警框架。CogGuard通过共享的静态-动态画像到评分流水线,将离线基于LLM的画像构建与在线基于小语言模型(SLM)的评分预测解耦,并在两个代表性场景中实例化:教育表现预警和操作任务结果预警。为高效构建画像,我们设计了场景特定的画像方法,并采用前缀对齐的KV缓存重用以减少重复编码开销。为进行边缘端模型对齐,我们提出了一种具有对比正则化的长度感知分布式微调策略,以缓解异构集群上的工作负载不平衡。在教育和操作数据集上的实验表明,CogGuard将画像构建时间最多减少48%,分布式微调时间减少19%,同时在100分量表预警任务上分别达到13.4和5.9的MAE。在最大的教育场景中,与最强基线相比,CogGuard将预测误差降低了15.4%。

英文摘要

Proactive warning is an important capability for edge intelligent services, where the system predicts whether a subject will successfully complete an incoming task under strict latency and privacy constraints. Such prediction depends on both long-term static attributes and short-term dynamic states derived from historical interaction logs. Recent Large Language Models (LLMs) offer strong long-context reasoning for constructing structured profiles from these logs, but existing solutions face two challenges for edge deployment: (1) profiling methods are typically domain-specific and lack a reusable abstraction across service scenarios, and (2) fine-tuning alignment models on heterogeneous edge clusters incurs high synchronization overhead due to the variance in input sequence lengths. To address these challenges, we propose CogGuard, a proactive-warning framework for edge intelligent services. CogGuard decouples offline LLM-based profile construction from online Small Language Model (SLM)-based score prediction through a shared static-dynamic profile-to-score pipeline, and instantiates it in two representative scenarios: educational performance warning and operational task outcome warning. For efficient profile construction, we design scenario-specific profiling methods with prefix-aligned KV-cache reuse to reduce repeated encoding overhead. For edge-side model alignment, we propose a length-aware distributed fine-tuning strategy with contrastive regularization to mitigate workload imbalance on heterogeneous clusters. Experiments on education and operation datasets show that CogGuard reduces profile construction time by up to 48% and distributed fine-tuning time by 19%, while achieving MAEs of 13.4 and 5.9, respectively, on 100-point-scale warning tasks. In the largest educational setting, CogGuard reduces prediction error by 15.4% compared with the strongest baseline.

2606.15315 2026-06-16 cs.AI 新提交

ChatPlanner: A Large Language Model Framework for Personalized Public Transit Routing

ChatPlanner: 一个用于个性化公共交通路线规划的大型语言模型框架

Tingting Yang, Chenhao Xue, Jun Chen

发表机构 * School of Engineering and Materials Science, Queen Mary University of London(伦敦玛丽女王大学工程与材料科学学院) Department of Engineering Science, University of Oxford(牛津大学工程科学系)

AI总结 提出ChatPlanner框架,利用大型语言模型和检索增强生成技术从自然语言查询中提取用户偏好并融入路线规划算法,实验证明其能生成更符合个性化需求的可行路线方案。

Comments Under Review at Transportation Research Part C

详情
AI中文摘要

在公共交通系统中,由于难以捕捉和整合多样化的用户偏好到路线规划算法中,个性化路线规划仍然具有挑战性。本文提出了ChatPlanner,一个新颖的框架,利用大型语言模型(LLMs)实现偏好感知的公共交通路线规划。我们的方法采用微调后的LLMs结合检索增强生成(RAG),从自然语言查询中提取路线参数并解释细微的用户偏好,随后将这些偏好整合到公共交通路线规划算法的目标函数中。本研究设计了包含八种角色和五种上下文的偏好感知数据集,为微调和RAG建立评分标准。本文进行了三项实验,以验证解决方案的可行性、路线信息和偏好的提取、以及解决方案集的质量和完整性。结果表明,ChatPlanner能够可靠地生成可行方案。微调强制了所需的输出结构并学习了通用的偏好模式,而RAG提供了查询特定的上下文以解决不精确或会话式的表达,并校准连续分数。两者的结合在路线信息提取和用户偏好解释方面达到了最高准确性。基于选定案例研究的结果表明,通过捕捉用户偏好,ChatPlanner在现有路线规划器忽略的不同维度上识别出了有价值的解决方案,生成了更有价值的路线备选方案。本研究为将自然语言理解整合到交通优化中建立了一个新范式。

英文摘要

Personalized public transit routing in public transit systems remains challenging due to the difficulty of capturing and integrating diverse user preferences into routing algorithms. This paper presents ChatPlanner, a novel framework that leverages Large Language Models (LLMs) to enable preference aware public transit routing. Our approach employs fine-tuned LLMs with Retrieval-Augmented Generation (RAG) to extract routing parameters and interpret nuanced user preferences from natural language queries, subsequently integrating these preferences into the objective function of a public transit routing algorithm. This study designs preference aware datasets incorporating eight personas and five contexts to establish scoring standards for both fine-tuning and RAG. This work conducted three experiments to validate the solutions' feasibility, extraction of routing information and preferences, and solution set quality and completeness. Results demonstrate that ChatPlanner generates feasible solutions reliably. Fine-tuning enforces the required output structure and learns general preference patterns, while RAG provides query-specific context to resolve imprecise or conversational expressions and calibrate continuous scores. The combination of both achieves the highest accuracy in routing information extraction and user preference interpretation. Results based on selected case studies show that by capturing user preferences, ChatPlanner identifies valuable solutions across different dimensions that existing route planners overlook, generating more valuable route alternatives. This research establishes a new paradigm for integrating natural language understanding into transportation optimization.

2606.15655 2026-06-16 cs.AI 新提交

Advanced Machine Learning and Deep Learning Techniques for Enhanced Cattle Identification and Detection: A Comprehensive Review

用于增强牛只识别与检测的先进机器学习和深度学习技术:一项全面综述

Fayazunnesa Chowdhury, Syed Md. Galib, Md Nasim Adnan, Md. Moradul Siddique, Md Robiul Karim, K M Tanvir Anjum

发表机构 * Jashore University of Science and Technology(贾沙雷科学与技术大学) University of Information Technology & Sciences (UITS)(信息科技与科学大学) Gazipur Agricultural University(加兹ipur农业大学) Shanto Mariam University of Creative Technology(沙托·马里姆创意技术大学)

AI总结 本文系统综述了利用机器学习和深度学习技术进行牛只识别的研究,比较了传统方法(如K近邻、支持向量机)与深度学习方法(如CNN、ResNet、YOLO)的效果,指出深度学习方法在识别和检测任务中更优,并讨论了数据集有限、数据质量问题和实时处理需求等挑战。

Comments Published in the journal of Annals of Emerging Technologies in Computing (AETiC), 34 pages, 5 Figures. The Article is available here: http://aetic.theiaer.org/archive/v10/v10n2/p1.html

详情
Journal ref
Annals of Emerging Technologies in Computing (AETiC),Vol. 10, No. 2, 2026
AI中文摘要

在畜牧管理中,维护生物安全、食品安全和供应链效率的需求使得有效的牛只识别技术比以往任何时候都更加迫切。本文对使用机器学习和深度学习技术的牛只识别研究进行了系统综述。本系统综述通过主要学术数据库的研究评估了传统和现代牛只识别技术的有效性,并对文章进行了全文审查。在这些技术中,经典机器学习技术如K近邻和支持向量机在牛只识别中表现出良好效果;然而,深度学习技术如卷积神经网络、残差网络和You Only Look Once在认知、检测和识别任务中表现更优。特征提取依赖于常见技术如局部二值模式(LBP)、加速稳健特征(SURF)和尺度不变特征变换(SIFT),而这些研究中常用的关键特征包括鼻纹和皮毛图案。综述强调了牛只识别中的主要障碍,例如公开可用的数据集数量有限、易受环境变化和动物移动影响的数据质量问题,以及对实时处理能力的高需求。本文旨在为研究人员、政策制定者和利益相关者提供关于实施可扩展、人道且有效的牛只识别系统以实现可持续畜牧管理的信息。

英文摘要

The need for effective cattle identification technology is now more acutely felt than ever in maintaining biosecurity, food safety, and supply chain efficacy in livestock management. This paper presents a systematic review of recent research in cattle identification using machine learning and deep learning techniques. The present systematic review measures the effectiveness of traditional and modern cattle identification techniques using studies from major academic databases, where articles were subjected to full-text review. Among these techniques, classical Machine Learning Techniques such as K-Nearest Neighbors and Support Vector Machines have demonstrated good results in cattle identification; however, Deep Learning Techniques, such as Convolutional Neural Networks, Residual Networks, and You Only Look Once, are better in cognition, detection, and identification tasks. Feature extraction relies on common techniques like Local Binary Pattern (LBP), Speeded-Up Robust Features (SURF), and Scale-Invariant Feature Transform (SIFT), while key features commonly used in these studies include muzzle prints and coat patterns. The review highlights key hurdles involving cattle identification, such as the limited number of publicly accessible datasets, issues with data quality susceptible to environmental changes and animal mobility, and high demand for real-time processing ability. The paper aims to inform researchers, policymakers, and stakeholders about implementing scalable, humane, and effective cattle identification systems to achieve sustainable livestock management.

2606.15709 2026-06-16 cs.AI cs.MA 新提交

AI-Driven Framework for Adaptive Water Network Management with Proof-of-Concept Implementation: Addressing Non-Revenue Water in Jordan

基于AI的自适应水网管理框架及概念验证实施:解决约旦无收益水问题

Mohammed Fasha, Nahel Al-Maayta, Bilal Sowan, Mohammad Athamneh, Husam Barham

发表机构 * Jordan(约旦)

AI总结 提出集成EPANET水力建模、数字孪生、SCADA和LLM智能体的框架,通过实时数据与物理模拟结合实现异常检测与自适应决策,概念验证在约旦1164节点管网中实现2分钟内自动生成健康报告,爆管检测定位准确。

详情
Journal ref
2026 2nd International Conference on Computational Intelligence Approaches and Applications (ICCIAA)
AI中文摘要

约旦面临严重的水资源短缺,50%的生产水因泄漏、盗窃和计量问题(即无收益水,NRW)而损失。传统的被动方法已被证明不足以持续减少NRW。本文提出一个智能框架,集成EPANET水力建模、数字孪生技术、SCADA系统和基于大语言模型(LLM)的AI智能体,用于连续网络监控和自适应决策。该系统将实时数据流与基于物理的模拟相结合,以检测异常,采用检索增强生成(RAG)进行策略解释,并通过函数调用进行网络控制。概念验证实施使用EPYT和离线LLM(通过Ollama的llama3.1:8b)在安曼一个1164节点的区域管网中验证了技术可行性。该系统展示了自动化水力模拟、基于流量的异常检测(与配水区域(DZ)实践一致)、以及AI生成的健康报告,响应时间低于2分钟且零API成本。爆管检测依赖于局部流量异常分析:模拟的30.1 L/s泄漏在15根管道中产生可测量的流量重新分布,标记出一个15节点的簇,从而定位爆管——确认了与配水区域(DZ)监测实践的一致性。该框架通过分阶段实施适应约旦的间歇性供水模式和有限的自动化,为缺水地区利用智能自动化减少NRW和提高运营效率提供了可扩展的路径。

英文摘要

Jordan faces severe water scarcity with 50\% of water produced is lost to leakage, theft and metering issues also known as non-revenue water (NRW). Traditional reactive approaches have proven insufficient for sustained NRW reduction. This paper proposes an intelligent framework integrating EPANET hydraulic modeling, digital twin technology, SCADA systems, and large language model (LLM)-based AI agents for continuous network monitoring and adaptive decision-making. The system combines real-time data streams with physics-based simulation to detect anomalies, employing retrieval-augmented generation (RAG) for policy interpretation and function calling for network control. A proof-of-concept implementation validates technical feasibility using EPYT with offline LLMs (llama3.1:8b via Ollama) on a 1,164-junction Amman district network. The system demonstrates automated hydraulic simulation, flow-based anomaly detection aligned with water distribution zone (DZ) practice, and AI-generated health reports with response times under 2 minutes and zero API costs. Burst detection relies on local flow anomaly analysis: a 30.1~L/s simulated leak produces measurable flow redistribution in 15 pipes, flagging a 15-junction cluster that localises the burst -- confirming alignment with water distribution zone (DZ) monitoring practice. The framework accommodates Jordan's intermittent supply patterns and limited automation through phased implementation, offering a scalable pathway for water-scarce regions to leverage intelligent automation for NRW reduction and operational efficiency.

2606.15831 2026-06-16 cs.AI cs.LG cs.NE cs.SY eess.SY 新提交

An Integrated System for Real-Time Student Assessment and Career Guidance Using Neural Networks in Computing Disciplines

基于神经网络的计算学科实时学生评估与职业指导集成系统

Sakir Hossain Faruque, Md. Jubair Hossain, Sharun Akter Khushbu

发表机构 * Daffodil International University(达福尔国际大学) Barishal Engineering College(巴里什尔工程学院)

AI总结 针对计算机专业学生职业路径选择困难,提出集成职业指导专家系统与网络评估平台的AI驱动系统,采用多层感知器模型实现94.71%的职业路径预测准确率。

Comments 25 pages, 24 figures

详情
AI中文摘要

许多计算机科学(CS)和软件工程(SWE)专业的本科生在确定合适的职业道路时面临困难,尤其是当他们的学业表现、能力和兴趣不完全匹配时。为了解决这一问题,本研究提出了一种AI驱动的学生评估与职业预测系统,该系统集成了职业指导专家(CGE)系统和基于网络的学生评估(WBSA)平台。在集成框架内,CGE利用AI增强个性化职业推荐,同时帮助毕业生根据其技能和兴趣确定合适的工作、研究领域和深造机会。WBSA平台通过评估、个性化任务、导师活动和安全的实时聊天应用程序进一步加强了学生与教师之间的互动。CGE系统采用多层感知器(MLP)模型,该模型使用滚雪球抽样法从大学学生中收集的真实学术和课外数据进行训练,在预测个性化职业路径方面达到了94.71%的验证准确率。在部署前,跨大学进行了预调查以评估所提出的模型。WBSA系统作为现代Web应用程序开发,使用了Node.js、Next.js和PostgreSQL等技术,以确保可扩展性、响应性和安全的数据管理。整个系统由安全的云基础设施支持,该平台提供可靠的性能,同时帮助毕业生在IT领域选择合适的职业道路。此外,还进行了一项涉及学生和教师的后期调查,以收集反馈并进一步提高系统的整体有效性和可用性。

英文摘要

Many undergraduate students in Computer Science (CS) and Software Engineering (SWE) struggle to identify suitable career paths, particularly when their academic performance, abilities, and interests do not fully align. To address this issue, this study proposes an AI-driven Student Assessment and Career Prediction System that integrates a Career Guidance Expert (CGE) system with a Web-Based Student Assessment (WBSA) platform. Within the integrated framework, CGE enhances personalized career recommendations using AI while also assisting students after graduation in identifying suitable jobs, research domains, and higher study opportunities aligned with their skills and interests. The WBSA platform further strengthens interaction between students and faculty through assessments, personalized tasks, mentorship activities, and a secure real-time chat application. The CGE system employs a Multilayer Perceptron (MLP) model trained on real-world academic and extracurricular data collected using the snowball sampling method from the students of universities, achieving a validation accuracy of 94.71% in predicting personalized career paths. A pre-survey was conducted across universities to evaluate the proposed model before deployment. The WBSA system was developed as a modern web application using technologies such as Node.js, Next.js, and PostgreSQL to ensure scalability, responsiveness, and secure data management. The overall system is supported by a secure cloud-based infrastructure, the platform provides reliable performance while assisting graduates to select suitable career path in IT sector. In addition, a post-survey involving both students and faculty was conducted to gather feedback and further improve the overall effectiveness and usability of the system.

2606.16415 2026-06-16 cs.AI 新提交

Posterior Twins: Distributional Behavioral Simulation for Enterprise Decisions

后验孪生:面向企业决策的分布行为模拟

Ankit Das

发表机构 * Twinning Labs, Inc.(Twinning Labs公司)

AI总结 提出后验孪生方法,通过记忆驱动的数字孪生将模拟行为表示为决策条件下的更新分布,在226例基准上评估模型,发现模态准确率与分布保真度揭示不同操作区域,TL-Twin Alpha实现最低Wasserstein-1距离(1.16)。

Comments 13 pages, 2 figures

详情
AI中文摘要

企业行为模拟不仅需要产生合理的响应。许多决策取决于在拟议行动下群体的形态:哪些细分群体接受、拒绝、犹豫或进入风险敏感状态。本文介绍了后验孪生(Posterior Twins),一种记忆驱动的数字孪生方法,将可能的行为表示为特定决策上下文下的更新分布。我们在一个包含226个保留示例的行为响应基准上评估了一系列Twinning Labs行为模型操作点,并报告了模态准确率和Wasserstein-1距离。结果表明,模态准确率和分布保真度识别出不同的操作区域。在报告的结果集中,TL-Twin Alpha实现了最低的观测Wasserstein-1距离($W_1 = 1.16$),而TL-Twin Delta和TL-Twin Gamma在模态准确率前沿附近提供了平衡的操作点。本文将这些结果视为系统结果:受控记忆、行为模型路由、场景编排、分布聚合和可审计性对于将模拟行为转化为可重用的企业决策证据是必要的。

英文摘要

Enterprise behavioral simulation requires more than producing a plausible response. Many decisions depend on the shape of a population under a proposed action: which segments accept, defect, hesitate, or move into risk-sensitive states. This paper introduces Posterior Twins, a memory-grounded digital-twin approach that represents likely behavior as an updated distribution under a specific decision context. We evaluate a family of Twinning Labs behavioral-model operating points on a 226-example held-out behavioral-response benchmark and report both modal accuracy and Wasserstein-1 distance. The results show that modal accuracy and distributional fidelity identify different operating regimes. TL-Twin Alpha achieves the lowest observed Wasserstein-1 distance in the reported result set ($W_1 = 1.16$), while TL-Twin Delta and TL-Twin Gamma provide balanced operating points near the modal-accuracy frontier. The paper frames these results as a systems result: governed memory, behavioral model routing, scenario orchestration, distributional aggregation, and auditability are necessary for turning simulated behavior into reusable enterprise decision evidence.

2606.16624 2026-06-16 cs.AI 新提交

MR-GVNO: A Geometry-Aware Variational Physics-Informed Neural Operator for Mindlin-Reissner Plates on Irregular Domains

MR-GVNO:一种面向不规则域上Mindlin-Reissner板的几何感知变分物理信息神经算子

Siqi Wang, Daobo Sun, Yizheng Wang, Yilong Zhang, Yabin Jin, Xiaoying Zhuang, Timon Rabczuk

发表机构 * Institute of Computational Mechanics × AI & College of Intelligent Robotics and Advanced Manufacturing, Fudan University(计算力学与人工智能学院及智能机器人与先进制造学院,复旦大学) School of Aerospace Engineering, Xiamen University(航空航天工程学院,厦门大学) Department of Engineering Mechanics, Tsinghua University(工程力学系,清华大学) Institute of Photonics, Department of Mathematics and Physics, Leibniz University(光子研究所,数学与物理系,莱比锡大学) Institute of Structural Mechanics, Bauhaus-Universität Weimar(结构力学研究所,魏玛 Bauhaus-Universität)

AI总结 提出MR-GVNO,一种几何感知变分神经算子,通过边界点云表示不规则几何,利用交叉注意力机制融合多物理场输入,基于离散总势能的变分物理信息损失无监督训练,实现对Mindlin-Reissner板问题的快速准确预测。

详情
AI中文摘要

板壳结构在工程中广泛应用,因此在不同几何、材料和载荷下进行快速响应预测非常理想。然而,传统的有限元方法需要重复建模和求解,导致计算成本高昂。本研究提出了一种用于Mindlin-Reissner板问题的几何感知变分神经算子,称为MR-GVNO。该方法使用边界点云表示不规则几何,并采用独立的编码器处理空间变化的材料场、压力载荷和标量物理参数。交叉注意力机制将这些输入与查询点信息集成,以预测任意位置的横向挠度和转角。MR-GVNO无需标记解数据,通过从离散总势能导出的变分物理信息损失进行训练。它直接处理不规则点云,并允许不同的物理场独立离散化,避免了插值到公共网格。在单孔、双孔和L形板上的数值实验表明,在均匀和非均匀材料以及均匀和随机载荷下,该方法能准确预测响应。该模型还实现了毫秒级的全场推理和良好的跨几何泛化能力。

英文摘要

Plate and shell structures are widely used in engineering, making rapid response prediction under varying geometries, materials, and loads highly desirable. However, conventional finite element methods require repeated modeling and solution, resulting in high computational costs. This study proposes a geometry-aware variational neural operator for Mindlin-Reissner plate problems, termed MR-GVNO. The method uses boundary point clouds to represent irregular geometries and employs separate encoders for spatially varying material fields, pressure loads, and scalar physical parameters. A cross-attention mechanism integrates these inputs with query point information to predict transverse deflections and rotations at arbitrary locations. MR-GVNO is trained without labeled solution data using a variational physics-informed loss derived from the discretized total potential energy. It directly processes irregular point clouds and allows different physical fields to be discretized independently, avoiding interpolation onto a common grid. Numerical experiments on single-hole, double-hole, and L-shaped plates demonstrate accurate response prediction under homogeneous and heterogeneous materials and uniform and random loads. The model also achieves millisecond-level full-field inference and favorable cross-geometry generalization.

2606.16649 2026-06-16 cs.AI 新提交

The Integrator Advantage: Controlled Agentic AI for Small and Medium-Sized Companies

集成优势:面向中小企业的受控代理型人工智能

Christopner Koch, Joshua A. Wellbrock

AI总结 本文提出代理型AI对中小企业的近期价值在于受控部分自主性,而非完全自主或减员,并给出集成框架以提升生产力。

Comments 10 pages, 15 tables

详情
AI中文摘要

代理型AI标志着企业自动化的新阶段。与传统自动化或对话式AI不同,代理系统能够解释目标、规划多步骤任务、访问工具、与企业系统交互,并以不同程度的自主性执行工作流。对于中小企业而言,这创造了减少行政负担、加速常规流程以及改善组织知识利用的潜力。本文认为,代理型AI的近期价值不在于完全自主或减员,而在于对简单和中等复杂度业务流程的受控部分自主性。它提出了一个集成框架,涵盖用例适用性、自主性级别、技术集成、治理、安全、员工赋能和可衡量影响。本文得出结论,当作为以人为中心的能力实施,并由人保留责任和问责时,代理型AI可以成为生产力杠杆。

英文摘要

Agentic AI marks a new phase of enterprise automation. Unlike traditional automation or conversational AI, agentic systems can interpret goals, plan multi step tasks, access tools, interact with enterprise systems, and execute workflows with varying degrees of autonomy. For small and medium sized companies, this creates potential to reduce administrative burden, accelerate routine processes, and improve the use of organizational knowledge. This paper argues that the near term value of Agentic AI does not lie in full autonomy or workforce reduction, but in controlled partial autonomy for simple and medium complexity business processes. It proposes an integration framework covering use case suitability, autonomy levels, technical integration, governance, security, employee enablement, and measurable impact. The paper concludes that Agentic AI can become a productivity lever when implemented as a human centered capability with responsibility and accountability retained by people.

2606.16721 2026-06-16 cs.AI 新提交

Medical world models: representing medical states, modelling clinical dynamics and guiding intervention policies

医疗世界模型:表示医疗状态、建模临床动态与指导干预策略

Ke Liu, Mengxuan Li, Yanyi Bao, Tianyun Zhang, Chong Chu, Jiajun Bu, Haishuai Wang

发表机构 * College of Computer Science, Zhejiang University(浙江大学计算机科学与技术学院) School of Medicine, Zhejiang University(浙江大学医学院) Department of Biomedical Informatics, Harvard University(哈佛大学生物医学信息学系)

AI总结 本文提出医疗世界模型框架,通过构建患者状态、建模临床动态和支持干预决策,推动医疗AI从静态诊断向动态模拟演进。

详情
AI中文摘要

医疗诊断和治疗是动态过程,患者状态随时间演变,临床干预改变未来结果。尽管当前医疗AI能检测疾病、估计风险和生成报告,许多系统仍返回静态标签或分数,对疾病进展或替代干预如何重塑轨迹的洞察有限。医疗世界模型通过学习患者状态动态的内部模拟器,将人工智能中的世界模型思想应用于医疗。其长期目标是帮助临床医生预测病情恶化、比较治疗条件下的未来,并为个体患者定制护理。然而,相关工作仍分散在基础模型、纵向建模、疾病模拟、治疗效果估计、强化学习和数字孪生等领域。为弥合这一差距,本综述概述了一条路线图,将医疗AI从孤立的诊断和预测推进到模拟疾病演变和支持干预决策的医疗世界模型。该路线图围绕三个耦合能力组织:患者状态构建、临床动态建模和干预决策支持。在代表性系统中,比较突出了每种能力的贡献以及如何将部分组件集成到更成熟的感知-动态-规划系统中。最后,我们确定了将合理的推演转化为临床有用模拟器所涉及的挑战。相关文献见 https://github.com/1999kevin/awesome_medical_world_models。

英文摘要

Medical diagnosis and treatment are dynamic processes in which patient states evolve over time and clinical interventions alter future outcomes. Although current medical AI can detect disease, estimate risk and generate reports, many systems still return static labels or scores, offering limited insight into how illness may progress or how alternative interventions may reshape its trajectory. Medical world models adapt the world-model idea from artificial intelligence to healthcare by learning internal simulators of patient-state dynamics. Their long-term goal is to help clinicians anticipate deterioration, compare treatment-conditioned futures and tailor care to individual patients. Yet relevant work remains scattered across foundation models, longitudinal modelling, disease simulation, treatment-effect estimation, reinforcement learning and digital twins. To bridge this gap, this review outlines a roadmap for advancing medical AI from isolated diagnosis and prediction toward medical world models that simulate disease evolution and support intervention decisions. This roadmap is organized around three coupled capabilities: patient-state construction, clinical dynamics modelling and intervention decision support. Across representative systems, the comparison highlights what each capability contributes and how partial components can be integrated into more mature perception--dynamics--planning systems. Finally, we identify the challenges involved in turning plausible rollouts into clinically useful simulators. Related literature is available at https://github.com/1999kevin/awesome_medical_world_models.

2512.10104 2026-06-16 cs.CR cs.AI cs.IR 交叉投稿

Phishing Email Detection Using Large Language Models

使用大型语言模型检测钓鱼邮件

Najmul Hasan, Prashanth BusiReddyGari, Haitao Zhao, Yihao Ren, Jinsheng Xu, Shaohu Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出LLMPEA框架,利用GPT-4o等三种前沿LLM检测钓鱼邮件,准确率超90%,并揭示对抗攻击、提示注入和多语言攻击的漏洞。

Comments 7 pages

详情
AI中文摘要

电子邮件钓鱼是最普遍且具有全球影响的网络入侵载体之一。随着系统越来越多地部署大型语言模型(LLM)应用,这些系统面临利用其基本架构的不断演变的钓鱼邮件威胁。当前的LLM在部署到电子邮件安全系统之前需要大量加固,特别是针对利用架构漏洞的协调多向量攻击。本文提出了LLMPEA,一个基于LLM的框架,用于检测跨多个攻击向量的钓鱼邮件攻击,包括提示注入、文本精炼和多语言攻击。我们评估了三种前沿LLM(例如GPT-4o、Claude Sonnet 4和Grok-3)以及全面的提示设计,以评估它们针对钓鱼邮件攻击的可行性、鲁棒性和局限性。我们的实证分析表明,LLM可以以超过90%的准确率检测钓鱼邮件,同时我们也强调,基于LLM的钓鱼邮件检测系统可能受到对抗攻击、提示注入和多语言攻击的利用。我们的发现为现实环境中攻击者结合利用多个漏洞的基于LLM的钓鱼检测提供了关键见解。

英文摘要

Email phishing is one of the most prevalent and globally consequential vectors of cyber intrusion. As systems increasingly deploy Large Language Models (LLMs) applications, these systems face evolving phishing email threats that exploit their fundamental architectures. Current LLMs require substantial hardening before deployment in email security systems, particularly against coordinated multi-vector attacks that exploit architectural vulnerabilities. This paper proposes LLMPEA, an LLM-based framework to detect phishing email attacks across multiple attack vectors, including prompt injection, text refinement, and multilingual attacks. We evaluate three frontier LLMs (e.g., GPT-4o, Claude Sonnet 4, and Grok-3) and comprehensive prompting design to assess their feasibility, robustness, and limitations against phishing email attacks. Our empirical analysis reveals that LLMs can detect the phishing email over 90% accuracy while we also highlight that LLM-based phishing email detection systems could be exploited by adversarial attack, prompt injection, and multilingual attacks. Our findings provide critical insights for LLM-based phishing detection in real-world settings where attackers exploit multiple vulnerabilities in combination.

2606.13693 2026-06-16 cs.CY cs.AI 交叉投稿

Limited Marginal Benefit of Reasoning-Heavy LLM Deployment in ESG Narrative Scoring: A 4-Model Consensus Study on Japanese Listed Firms

ESG叙述评分中重度推理LLM部署的有限边际收益:一项关于日本上市公司的4模型共识研究

Hiroyuki Kokubu

发表机构 * Kansai University(关西大学)

AI总结 通过4模型共识设计,研究在ESG叙述评分中,重度推理模型相比非推理模型是否带来显著收益,发现其边际收益有限且成本高昂。

Comments 12 pages. Earlier version available on SSRN, Abstract ID 6683303

详情
AI中文摘要

使用大语言模型(LLM)对ESG叙述披露进行自动评分正逐渐受到关注,但重度推理的前沿模型是否带来与其成本相称的价值,在实证上仍不确定。我们基于十家日本上市公司的语料库,沿三个评分轴——定量目标、进度跟踪基础设施和外部标准对齐——通过四模型共识设计评估这一问题,该设计将一个推理型前沿模型与三个非推理型同期模型相结合。在120个公司×轴×模型评分中,推理型模型与每个非推理型模型之间的汇总平均绝对偏差为0.38(5分制);仅有2%的成对比较达到两分偏差,且无任何比较超过两分。每公司成本核算显示,仅推理型模型的成本约为三个非推理型模型集成成本的5.6倍,而结果仅在小范围内存在差异。我们得出结论,在基于区间的ESG叙述评分中,重度推理部署相对于非推理共识并未显著改善结果,同时大幅增加了运营成本。我们讨论了这对成本效益的ESG自动评分流程以及应用问责环境中的LLM部署治理的启示。本工作的早期版本可在SSRN上获取(摘要ID 6683303)。

英文摘要

Automated scoring of ESG narrative disclosures with large language models (LLMs) is gaining traction, yet whether reasoning-heavy frontier models add value commensurate with their cost remains empirically unsettled. We evaluate this question on a corpus of ten Japanese listed firms across three rubric axes -- quantitative targets, progress-tracking infrastructure, and external-standard alignment -- using a four-model consensus design that combines a reasoning-on frontier model with three reasoning-off contemporaries. Across 120 firm x axis x model scores, the pooled mean absolute deviation between the reasoning-on model and each reasoning-off counterpart is 0.38 on a 5-point scale; only 2% of pairwise comparisons reach a two-point deviation, and none exceeds two points. Per-firm cost accounting shows the reasoning-on arm alone costs roughly 5.6x as much as the three-provider reasoning-off ensemble, for outcomes that differ only within small margins. We conclude that in span-based ESG narrative scoring, reasoning-heavy deployment does not materially improve outcomes relative to reasoning-off consensus, while substantially increasing operational cost. We discuss implications for cost-effective ESG auto-scoring pipelines and LLM deployment governance in applied accountability settings. An earlier version of this work is available on SSRN (Abstract ID 6683303).

2606.14707 2026-06-16 cs.PF cs.AI cs.CY 交叉投稿

Green AI Carbon Optimizer: Carbon-Efficient Training Location Recommendation and Global AI Energy Demand Forecasting

绿色AI碳优化器:碳高效训练位置推荐与全球AI能源需求预测

Yuxin Chen, Hao Gao, Chujie Zou

AI总结 提出Green AI Carbon Optimizer,包括基于电网碳强度、可再生能源占比和PUE的云区域推荐方法(最佳vs最差区域减排97.2%),以及基于幂律的全球AI能源需求预测模型(2030年需求7-1436 TWh)。

Comments Short workshop of 5 pages. 2 figures

详情
AI中文摘要

AI训练和部署消耗大量电力,但碳排放结果尚未充分融入常规模型开发决策。本文提出Green AI Carbon Optimizer,包含两个主要贡献:(i) 一种用于训练工作负载的碳感知云区域推荐方法,以及(ii) 一个用于全球AI能源需求的幂律预测流程。对于位置推荐,我们将区域电网碳强度、可再生能源占比和数据中心电能利用效率(PUE)结合成一个统一评分模型,覆盖来自主要云提供商的100多个区域。对于一个参考工作负载(8*A100, 100h),我们采样区域的估计排放量从7.74kg到272.00kg CO2不等。选择最佳区域而非最差区域相对于最差情况减少了97.2%。消融实验表明,仅按可再生能源占比排序可能选择碳排放高于包含电网碳强度排序的区域。对于预测,我们使用26个锚点模型拟合参数数量与训练能量之间的幂律关系。我们将此拟合与模型增长、硬件效率和训练频率的情景假设相结合,并评估对推理比率和生态系统扩展的敏感性。在不同情景下,根据所述假设,预计2030年需求范围从7 TWh到1,436 TWh,凸显了部署选择、模型扩展纪律和透明能源报告的重要性。

英文摘要

AI training and deployment consume substantial electricity, but carbon outcomes remain weakly integrated into routine model development decisions. This paper presents Green AI Carbon Optimizer with two primary contributions: (i) a carbon aware cloud region recommendation method for training workloads, and (ii) a power law forecasting pipeline for global AI energy demand. For location recommendation, we combine regional grid carbon intensity, renewable share, and data center Power Usage Effectiveness (PUE) into a unified scoring model across 100+ regions from major cloud providers. For a reference workload (8*A100, 100h), estimated emissions in our sampled regions range from 7.74kg to 272.00kg CO2. Selecting the best region instead of the worst corresponds to a 97.2% reduction relative to the worst case. Ablation shows that ranking by renewable share alone can select regions with higher CO2 emissions than rankings that include grid carbon intensity. For forecasting, we fit a power law relation between parameter count and training energy using 26 anchor models. We combine this fit with scenario assumptions on model growth, hardware efficiency, and training frequency, and evaluate sensitivity to inference ratio and ecosystem scaling. Across scenarios, projected 2030 demand ranges from 7TWh to 1,436TWh under the stated assumptions, highlighting the importance of deployment choices, model scaling discipline, and transparent energy reporting.

2606.14724 2026-06-16 cs.CV cs.AI 交叉投稿

VigilFormer: Deformable Attention for Video Anomaly Detection with Causal Risk Inference

VigilFormer: 用于视频异常检测的可变形注意力与因果风险推理

Xinze Zhang

发表机构 * University of Southern California(南加州大学)

AI总结 提出VigilFormer框架,结合可变形时空注意力与因果时序建模,通过稀疏注意力、对比多实例学习和自适应帧跳过,在保持高精度的同时实现实时异常检测。

详情
AI中文摘要

监控场景中的视频异常检测必须在检测准确性与实时吞吐量之间取得平衡,现有方法要么通过更强的特征提取器,要么通过更高效的架构来解决这一矛盾,但很少能兼顾两者。我们提出VigilFormer,一个统一框架,结合可变形时空注意力与因果时序建模,用于检测未修剪监控视频中的异常。所提出的可变形时空编码器(DSTE)关注跨帧的稀疏信息位置,避免了密集注意力的二次复杂度,同时保留了捕捉不规则运动模式的能力。因果异常分类器(CAC)对片段级特征应用扩张因果卷积,并优化对比多实例学习目标,无需帧级标签即可分离异常和正常表示。为满足部署约束,自适应置信度调度器(ACS)在推理时动态跳过低信息帧,减少静态场景中的冗余计算。在UCF-Crime、ShanghaiTech和CUHK Avenue上评估,VigilFormer在单GPU上以41.5 FPS分别达到87.83%、97.21%和89.74%的AUC分数,在准确性和速度上均优于最近的弱监督方法。

英文摘要

Video anomaly detection in surveillance settings must balance detection accuracy against real-time throughput, a tension that existing methods address either through stronger feature extractors or more efficient architectures, but rarely both. We present VigilFormer, a unified framework that combines deformable spatio-temporal attention with causal temporal modeling to detect anomalies in untrimmed surveillance video. The proposed Deformable Spatio-Temporal Encoder (DSTE) attends to a sparse set of informative locations across frames, avoiding the quadratic cost of dense attention while retaining the ability to capture irregular motion patterns. A Causal Anomaly Classifier (CAC) applies dilated causal convolutions over snippet-level features and optimizes a contrastive multiple-instance learning objective that separates anomalous and normal representations without frame-level labels. To meet deployment constraints, an Adaptive Confidence Scheduler (ACS) dynamically skips low-information frames at inference time, reducing redundant computation in static scenes. Evaluated on UCF-Crime, ShanghaiTech, and CUHK Avenue, VigilFormer achieves AUC scores of 87.83%, 97.21%, and 89.74% respectively, at 41.5 FPS on a single GPU, outperforming recent weakly-supervised methods in both accuracy and speed.

2606.14734 2026-06-16 q-bio.MN cs.AI cs.LG 交叉投稿

BRIDGE: Biological Evidence Refinement and Heterogeneous Dynamic Gating for Gene Regulatory Networks

BRIDGE:基因调控网络的生物学证据精炼与异质动态门控

Ziyang Dong, Shanwen Tan, Hengchuang Yin, Wei Liu, Yifan Wang, Siyu Yi, Jiancheng Lv, Wei Ju

发表机构 * College of Computer Science(计算机科学学院) Sichuan University(四川大学) Xinjiang Technical Institute of Physics and Chemistry(新疆物理化学研究所) Chinese Academy of Sciences(中国科学院) School of Mathematics(数学学院) University of International Business and Economics(国际商务经济大学) School of Artificial Intelligence and Data Science(人工智能与数据科学学院)

AI总结 提出BRIDGE框架,通过共表达精炼视图和异质门控编码,从scRNA-seq数据中稳健推断基因调控网络,在多个基准数据集上取得最优性能。

Comments 19 pages, 10 figures, 7 tables

详情
AI中文摘要

动机:从单细胞RNA测序(scRNA-seq)数据推断基因调控网络(GRN)对于揭示细胞状态特异性转录程序至关重要。然而,scRNA-seq测量存在稀疏性和噪声,且实验验证的转录因子-靶基因相互作用仍然有限,使得可靠推断具有挑战性。尽管图神经网络已经推进了GRN预测,现有方法通常依赖生物学上无约束的图增强(如随机边扰动),并且对基因与细胞之间的信息传递控制不足。这些局限性可能扭曲调控结构,并在噪声和弱监督设置下削弱鲁棒性。结果:为解决这些问题,我们提出了一个创新框架,名为基因调控网络的生物学证据精炼与异质动态门控(BRIDGE)。BRIDGE从表达矩阵及其矩阵对偶中提取基因和细胞表示,并在基因空间和细胞空间中,在共表达精炼的调控视图与原始图之间,对自身和邻居进行对比学习。然后,它应用异质门控编码自适应地调节基因与细胞之间的信息传递,实现稳健的转录因子-靶基因预测。在涵盖三种网络类型和七种细胞类型的基准数据集上的实验表明,BRIDGE在大多数设置下达到了最先进的AUROC和AUPRC。特别是在特异性网络上,BRIDGE的平均AUPRC比第二好的基线GCLink提高了5%。在跨细胞类型的小样本迁移中,BRIDGE在所有六种目标细胞类型上始终优于GCLink和GENELink。在hESC上的案例研究进一步支持了预测的生物学相关性,其中前10个中的9个和前100个中的46个新型转录因子-靶基因相互作用得到了ChIPBase的验证。

英文摘要

Motivation: Gene regulatory network inference from single-cell RNA sequencing (scRNA-seq) data is important for uncovering cell-state-specific transcriptional programs. However, scRNA-seq measurements are sparse and noisy, and experimentally validated TF-target interactions remain limited, making reliable inference challenging. Although graph neural networks have advanced GRN prediction, existing methods often rely on biologically unconstrained graph augmentation, such as random edge perturbation, and insufficiently control information transfer between genes and cells. These limitations may distort regulatory structures and weaken robustness under noisy and weakly supervised settings. Results: To address these issues, we propose an innovative framework named Biological Evidence Refinement and Heterogeneous Dynamic Gating for Gene Regulatory Networks (BRIDGE). BRIDGE extracts gene and cell representations from the expression matrix and its matrix dual, and performs contrastive learning in the gene space and cell space between self and neighbors across the co-expression-refined regulatory view and the original graph. It then applies heterogeneous gated encoding to adaptively regulate information transfer between genes and cells, enabling robust transcription factor-to-target gene prediction. Experiments on benchmark datasets spanning three network types and seven cell types show that BRIDGE achieves state-of-the-art AUROC and AUPRC in most settings. In particular, on Specific networks, BRIDGE improves average AUPRC by 5% over the second-best baseline, GCLink. In cross-cell-type few-shot transfer, BRIDGE consistently outperforms GCLink and GENELink across all six target cell types. A case study on hESC further supports the biological relevance of the predictions, with 9 of the top 10 and 46 of the top 100 novel TF-target interactions validated by ChIPBase.

2606.14749 2026-06-16 cs.CV cs.AI 交叉投稿

Automated 3D Kinematic Monitoring for Circadian Activity and Anomaly Detection in Juvenile Fish

幼鱼昼夜活动与异常检测的自动化三维运动监测

Chih-Wei Huang, Chang-Wen Huang, Chung-Ping Chiang, Tsung-Wei Pan

发表机构 * AI Research Center, National Taiwan Ocean Univ.(台湾海洋大学人工智能研究中心) Dept. of Aquaculture, National Taiwan Ocean Univ.(台湾海洋大学水产养殖系) Center of Excellence for the Oceans, National Taiwan Ocean University(台湾海洋大学海洋卓越研究中心)

AI总结 提出结合深度学习目标检测与双目立体视觉的高通量3D行为表型框架,实现高密度环境下幼鱼实时监测、体长估计和3D轨迹重建,首次量化自由游动幼鱼的真实物理速度,建立昼夜运动基线用于生理应激预警。

详情
AI中文摘要

精准水产养殖在追踪高分辨率行为特征方面面临“表型瓶颈”,因为传统方法无法量化瞬时三维(3D)身体活动。为解决这一问题,我们提出了一种高通量3D行为表型框架,将深度学习目标检测与双目立体视觉相结合,用于高密度环境下幼年罗非鱼的实时监测。该系统自动进行非接触式体长估计,并从绝对空间坐标重建3D游泳轨迹。通过消除2D透视畸变,该方法精确量化了3D速度和加速度,首次实现了对自由游动幼鱼真实物理游泳速度的估计。结果表明,该框架成功建立了昼夜运动基线,可作为生理应激的早期预警系统,并为鱼类活力提供客观指标。

英文摘要

Precision aquaculture faces a "phenotyping bottleneck" in tracking high-resolution behavioral traits, as conventional methods cannot quantify instantaneous three-dimensional (3D) physical exertion. To address this, we present a high-throughput 3D behavioral phenotyping framework integrating deep learning object detection with binocular stereo vision for real-time monitoring of juvenile tilapia in high-density environments. The system automates non-contact body length estimation and reconstructs 3D swimming trajectories from absolute spatial coordinates. By eliminating 2D perspective distortions, this approach precisely quantifies 3D velocity and acceleration, marking the first estimation of true physical swimming speeds in free-roaming juveniles. Results show the framework successfully establishes circadian locomotor baselines, serving as an early warning system for physiological stress and providing an objective metric for fish vitality.

2606.14759 2026-06-16 cs.CV cs.AI 交叉投稿

Temporally Consistent and Controllable Video Generation of 2D Cine CMR via Latent Space Motion Modeling

基于潜在空间运动建模的二维电影心脏磁共振时序一致且可控视频生成

Yiheng Cao, Gustavo Andrade-Miranda, Jiatian Zhang, Guillaume Sallé, Xin Gao

发表机构 * Suzhou Institute of Biomedical Engineering and Technology, Chinese Academy of Sciences(苏州生物医学工程与技术研究所,中国科学院) SyCoIA, IMT Mines Ales(SyCoIA,IMT Mines Ales)

AI总结 提出一种文本到视频生成方法,通过解耦心脏空间结构与时间运动,利用微调扩散模型合成初始帧,再以心脏相位嵌入条件化潜在流模型生成完整运动,实现高时序一致性和解剖可控性。

详情
Journal ref
ISBI 2026 - IEEE International Symposium on Biomedical Imaging, Apr 2026, London, United Kingdom. pp.1-4
AI中文摘要

电影心脏磁共振是评估心脏功能的金标准,但公共数据集的稀缺限制了先进数据驱动模型的发展。为解决这一限制,我们提出一种生成方法,用于合成时间上连贯且解剖上一致的心脏序列。我们的文本到视频框架将心脏空间结构与时间运动解耦。首先,一个微调的扩散模型根据临床文本提示合成初始帧,控制解剖特征。然后,一个以心脏相位嵌入为条件的潜在流模型生成完整的心脏运动,确保空间一致性和时间控制。我们的模型生成解剖和病理多样化的序列,具有高时间连贯性和对输入提示的强保真度,图像真实感的FID为31.68,文本-图像对齐的CLIP得分为31.04。这些实验结果突显了其产生高保真、按需医疗数据的潜力,为数据稀缺提供了可扩展的解决方案。

英文摘要

Cine cardiac magnetic resonance is the gold standard for assessing cardiac function, but the scarcity of public datasets limits the development of advanced data-driven models. To address this limitation, we propose a generative method for synthesizing temporally coherent and anatomically consistent cardiac sequences. Our text-to-video framework decouples cardiac spatial structure from temporal motion. First, a fine-tuned diffusion model synthesizes an initial frame from a clinical text prompt, controlling anatomical features. Then, a latent flow model conditioned on a cardiac phase embedding generates the complete cardiac motion, ensuring spatial consistency and temporal control. Our model generates anatomically and pathologically diverse sequences with high temporal coherence and strong fidelity to input prompts, achieving a FID of 31.68 for image realism and a CLIP score of 31.04 for text-image alignment. These experimental results highlight its potential to produce high-fidelity, on-demand medical data, offering a scalable solution to data scarcity.

2606.14766 2026-06-16 cs.CV cs.AI cs.MA 交叉投稿

XMedFusion: A Knowledge-Guided Multimodal Perception and Reasoning Framework for Autonomous Medical Systems

XMedFusion:面向自主医疗系统的知识引导多模态感知与推理框架

Hamza Riaz, Arham Haroon, Maha Baig, Muhammad Dawood Rizwan, Muhammad Naseer Bajwa, Muhammad Moazam Fraz

发表机构 * National University of Sciences and Technology (NUST)(巴基斯坦国立科技大学) University of Oxford(牛津大学)

AI总结 提出XMedFusion模块化AI框架,通过视觉感知、知识图谱构建和检索引导生成等智能体协同,增强放射学报告生成的视觉基础与临床发现捕捉能力,在公共数据集上显著优于基线模型。

Comments Accepted at the 2026 International Conference on Robotics and Automation in Industry (ICRAI)

详情
AI中文摘要

自主医疗和机器人系统日益依赖智能感知与推理能力来解释视觉数据并支持临床决策。放射学报告生成是此类自动化诊断工作流的关键组成部分,然而现有的端到端多模态模型常因视觉基础薄弱而导致不可靠的解释和细微临床发现的遗漏。本文提出XMedFusion,一个模块化AI框架,设计为自主医疗系统的智能感知与推理模块。该框架将视觉信息分解为协调的功能组件,模拟专家驱动的分析,包括提取图像基础证据的视觉感知智能体、构建临床相关发现结构的知识图谱构建智能体,以及确保报告结构一致的检索引导起草过程。合成智能体通过推理驱动的验证迭代整合视觉和结构化证据,生成可靠且可解释的诊断输出。在公共胸部X光片数据集上的实验评估表明,与基线视觉-语言模型相比,在BLEU-1上提升0.0493至0.3359,ROUGE-L上提升0.0863至0.2440,METEOR上提升0.0829至0.1708,同时在语义评估指标如一致性(2.38至7.80)和准确性(2.34至6.93)上也有显著提升。结果突出了结构化多智能体感知与推理在增强智能医学成像系统的鲁棒性、透明度和自动化方面的有效性,使其能够集成到自主医疗和机器人诊断工作流中。

英文摘要

Autonomous medical and robotic systems increasingly rely on intelligent perception and reasoning capabilities to interpret visual data and support clinical decision making. Radiology report generation represents a critical component of such automated diagnostic workflows, yet existing end-to-end multimodal models often suffer from weak visual grounding, resulting in unreliable interpretations and omission of subtle clinical findings. This paper presents XMedFusion, a modular AI framework designed as an intelligent perception and reasoning module for autonomous medical systems. The proposed framework decomposes visual information into coordinated functional components that emulate expert-driven analysis, including a visual perception agent that extracts image-grounded evidence, a knowledge graph construction agent that structures clinically relevant findings, and a retrieval-guided drafting process that ensures a consistent reporting structure. A synthesis agent iteratively integrates visual and structured evidence through reasoning-driven verification to produce reliable and interpretable diagnostic outputs. Experimental evaluation on a public chest radiograph dataset demonstrates significant improvements over baseline vision-language models, achieving gains from 0.0493 to 0.3359 in BLEU-1, 0.0863 to 0.2440 in ROUGE-L, and 0.0829 to 0.1708 in METEOR, along with substantial improvements in semantic evaluation metrics such as Consistency (2.38 to 7.80) and Accuracy (2.34 to 6.93). The results highlight the effectiveness of structured multi-agent perception and reasoning for enhancing robustness, transparency, and automation in intelligent medical imaging systems, enabling integration into autonomous healthcare and robotic diagnostic workflows.

2606.14786 2026-06-16 cs.MM cs.AI cs.CV 交叉投稿

MatchLM2Lite: A Scalable MLLM-to-Lite Framework for Reproduced Content Identification

MatchLM2Lite: 一种可扩展的MLLM-to-Lite框架用于重复内容识别

Xiaotian Fan, Hiok Hian Ong, David Yuchen Wang, Zirui Zhu, Kanchan Sarkar, Kun Xu

发表机构 * Tiktok(字节跳动) National University of Singapore School of Computing(新加坡国立大学计算机学院)

AI总结 提出MatchLM2Lite框架,通过将多模态大语言模型蒸馏为轻量模型,实现视频、音频和文本联合建模的实时重复内容识别,在降低35倍计算成本的同时保持高准确率,并成功部署于大规模生产环境。

详情
AI中文摘要

内容审核对于在线视频平台确保内容安全、保护创作者和维持积极的用户体验至关重要。除了过滤有害内容,平台必须大规模保证内容真实性,以便用户接触到多样化、原创的视频,而非低价值的重复内容。我们提出MatchLM2Lite,一个实时、生产级的重复内容识别(RCI)系统,它利用多模态大语言模型(MLLM)的强大理解能力,将其蒸馏为一个小型且推理速度快的模型。我们的系统联合建模视频、音频和文本信号,对视频对进行操作以生成细粒度的重复分数。该系统包含两个模块,MatchLM和MatchLite,以及一个两阶段训练方案。首先,我们高容量的MLLM,MatchLM,作为教师模型定义RCI性能的上限。然后,其能力被蒸馏到一个紧凑的学生模型MatchLite中。这种设计使MatchLite能够在视频对上实现低延迟、高吞吐量的推理,同时保留MatchLM的大部分准确性,使其适合集成到实时推荐系统中。MatchLM相比我们之前的生产模型F1分数提高了+8.57。经过知识蒸馏后,MatchLite保留了+6.55的F1分数提升,同时计算成本降低了35倍。大规模部署后,MatchLM2Lite实现了高效的成对多模态RCI,以高每秒查询数(QPS)稳定服务在线流量,端到端延迟低于30秒。该系统在不降低用户参与度的情况下,将我们平台上的重复视频观看率降低了2.5%,证明了其在大规模生产环境中的有效性。

英文摘要

Content moderation is critical for online video platforms to ensure content safety, protect creators, and sustain positive user experiences. Beyond filtering harmful content, platforms must guarantee content authenticity at scale so that users are exposed to diverse, original videos rather than low-value reproductions. We present MatchLM2Lite, a real-time, production-grade reproduced content identification (RCI) system that leverages the powerful understanding of a multimodal large language model (MLLM) distilled into a small and fast-inference model. Our system jointly models video, audio, and text signals, operating on pairs of videos to produce fine-grained reproduction scores. The system comprises two modules, MatchLM and MatchLite, and a two-stage training recipe. First, our high-capacity MLLM, MatchLM, serves as a teacher model to define the upper bound of RCI performance. Its capabilities are then distilled into a compact student model, MatchLite. This design allows MatchLite to deliver low-latency, high-throughput inference on video pairs while preserving much of MatchLM's accuracy, making it suitable for integration into real-time recommendation systems. MatchLM achieves an F1-score improvement of +8.57 compared to our previous production model. After knowledge distillation, MatchLite retains a +6.55 gain in F1-score while reducing computational cost by 35x. Deployed at scale, MatchLM2Lite enables efficient, pairwise multimodal RCI, stably serving online traffic at high queries per second (QPS) with an end-to-end latency below 30 seconds. This system has reduced the reproduced video view rate on our platform by 2.5% without degrading user engagement, demonstrating its effectiveness in a large-scale production environment.

2606.14788 2026-06-16 cs.SD cs.AI cs.LG eess.AS 交叉投稿

Unifying Acoustic Features and Text with Multimodal LLMs for Neurodegenerative Screening

统一声学特征与文本的多模态大语言模型用于神经退行性疾病筛查

Qingfeng Zhang, Yuanxiong Guo, Yanmin Gong

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出NeurMLLM框架,通过多模态大语言模型融合声谱图、MFCC和文本,实现阿尔茨海默病和帕金森病的精细分期,优于传统方法和现有LLM方法。

Comments IEEE International Conference on Healthcare Informatics, 2026

详情
AI中文摘要

基于语音的筛查为评估阿尔茨海默病(AD)和帕金森病(PD)等神经退行性疾病提供了一种可扩展且非侵入性的方式,但由于整合异质数据的困难,其分期仍然具有挑战性。本文提出了NeurMLLM,一种用于神经退行性疾病分期的高效多模态生成框架。NeurMLLM首先使用视觉变换器对音频数据的声谱图和梅尔频率倒谱系数进行编码,并将其表示投影到大语言模型(LLM)的嵌入空间中,在那里它们与转录文本和人口统计指令标记连接成一个统一的序列。然后,通过低秩适应使用任务提示对LLM进行指令微调,以自回归方式预测受限的标签标记,从而实现生成式分类。通过在Bridge2AI-Voice数据集上对AD和PD进行细粒度分期评估,我们观察到NeurMLLM取得了强劲的性能,持续优于经典机器学习方法和现有的基于LLM的方法。结果表明,多模态LLM在神经退行性疾病分期中具有巨大潜力,提高了分期准确性并支持可访问的部署。

英文摘要

Voice-based screening offers a scalable and non-invasive way to assess neurodegenerative diseases such as Alzheimer's disease (AD) and Parkinson's disease (PD), but their staging remains challenging due to the difficulty of integrating heterogeneous data. This paper presents NeurMLLM, an efficient multimodal generative framework for neurodegenerative disease staging. NeurMLLM first encodes the spectrograms and Mel-frequency cepstral coefficients of audio data with vision transformers and projects their representations into the embedding space of a large language model (LLM), where they are concatenated with transcript and demographic instruction tokens as a single unified sequence. The LLM is then instruction-tuned via Low-Rank Adaptation using task prompts to autoregressively predict a constrained label token, enabling a generative classification. By evaluating on the Bridge2AI-Voice dataset for fine-grained staging of AD and PD, we observe that NeurMLLM achieves strong performance, consistently outperforming classical machine learning methods and existing LLM-based approaches. The results show the high potential of multimodal LLMs in neurodegenerative disease staging, improving staging accuracy and supporting accessible deployment.

2606.14813 2026-06-16 hep-ph cs.AI cs.LG 交叉投稿

JetParticle-JEPA: An Efficient Self-Supervised Representation Learning method for Jet Tagging in High-Energy Physics

JetParticle-JEPA:一种用于高能物理喷注标记的高效自监督表示学习方法

Guillaume Letellier, Antonin Vacheret, Frédéric Jurie

发表机构 * GREYC, Normandy University, Unicaen, ENSICAEN, UMR CNRS 6072(GREYC,诺曼底大学,Unicaen,ENSICAEN,CNRS UMR 6072) LPC, Normandy University, Unicaen, ENSICAEN, IN2P3, UMR CNRS 6534(LPC,诺曼底大学,Unicaen,ENSICAEN,IN2P3,CNRS UMR 6534)

AI总结 提出JetParticle-JEPA,一种基于粒子Transformer的自监督联合嵌入预测架构,无需标记或重建原始输入,直接从连续粒子云学习物理有意义的喷注表示,在JetClass等基准上达到与全监督方法相当的性能,并在低标签场景下超越监督基线。

详情
AI中文摘要

大型强子对撞机上的喷注标记越来越依赖于在大量模拟数据集上训练的深度学习模型,导致计算成本高且对探测器建模误差的鲁棒性有限。我们引入了JetParticle-JEPA (JP-JEPA),一种自监督联合嵌入预测架构,它直接从连续粒子云中学习物理有意义的喷注表示,无需对原始输入进行标记化或重建。基于粒子Transformer主干,JP-JEPA在保留细粒度运动学相关性的同时预测被掩码粒子的潜在表示。在JetClass基准上,JP-JEPA在完整数据集上实现了与全监督最先进方法相当的性能,在低标签场景下超越了监督基线,并显著优于现有的自监督学习方法。在顶夸克和夸克-胶子喷注标记基准上,它与监督方法保持同等水平。学习到的表示还对缺失探测器信息表现出强鲁棒性,并改善了不确定性行为,凸显了JP-JEPA作为LHC上鲁棒且数据高效的喷注物理基础模型框架的潜力。

英文摘要

Jet tagging at the Large Hadron Collider increasingly relies on deep learning models trained on massive simulated datasets, leading to high computational costs and limited robustness to detector mismodeling. We introduce JetParticle-JEPA (JP-JEPA), a self-supervised Joint-Embedding Predictive Architecture that learns physically meaningful jet representations directly from continuous particle clouds without tokenization or reconstruction of raw inputs. Built on a Particle Transformer backbone, JP-JEPA predicts latent representations of masked particles while preserving fine-grained kinematic correlations. On the JetClass benchmark, JP-JEPA achieves performance comparable to fully supervised state-of-the-art methods on the full dataset, surpasses supervised baselines in low-label regimes, and significantly outperforms existing SSL approaches. On Top Quark and Quark-Gluon Tagging benchmarks, it remains on par with supervised methods. The learned representations also exhibit strong robustness to missing detector information and improved uncertainty behavior, highlighting JP-JEPA as a promising foundation-model framework for robust and data-efficient jet physics at the LHC.

2606.14817 2026-06-16 cs.IR cs.AI 交叉投稿

Combining Retrieval-Augmented Text Generation with LLMs for Reading Content Recommendations

结合检索增强文本生成与大型语言模型的阅读内容推荐

Sooyeon Kim, Piotr S. Maciąg

发表机构 * Institute of Computer Science, Warsaw University of Technology(计算机科学学院,华沙技术大学)

AI总结 提出结合检索增强生成(RAG)与大型语言模型的系统,通过四个模块实现个性化阅读内容生成,实验表明RAG将相关性和接地性提升26-35个百分点。

详情
AI中文摘要

本文介绍了使用大型语言模型(LLMs)结合检索增强生成(RAG)生成个性化阅读内容的系统的设计、实现和评估。所提出的架构由四个模块组成:输入、RAG、生成和评判,允许用户指定问题和目标阅读内容复杂度。RAG用于从互联网检索相关信息,丰富和支撑由三种现代LLM(Meta LLaMA 4 Scout、LLaMA 3.1 8B Instant和Google Gemma2 9B)生成的内容。使用三种提示策略(思维链、零样本和少样本)生成阅读材料,LLM-as-a-Judge模块自动评估答案质量及其与期望可读性水平的一致性。实验结果表明,RAG在所有模型和提示技术中一致地提高了系统性能,将相关性和特别是接地性提升了高达26-35个百分点。总体而言,研究结果表明,RAG增强架构有效地生成了符合用户查询和期望文本复杂度的阅读内容。

英文摘要

This work presents the design, implementation, and evaluation of a system for generating personalized reading content using Large Language Models (LLMs) combined with Retrieval-Augmented Generation (RAG). The proposed architecture consists of four modules: Input, RAG, Generation, and Judging and enables users to specify both a question and a target reading content complexity. RAG is employed to retrieve relevant information from the Internet, enriching and grounding the content produced by three modern LLMs: Meta LLaMA 4 Scout, LLaMA 3.1 8B Instant, and Google Gemma2 9B. Reading materials are generated using three prompting strategies (Chain-of-Thought, zero-shot, and few-shot), and the LLM-as-a-Judge module automatically evaluates answer quality and alignment with the desired readability level. Experimental results show that RAG consistently improves system performance across all models and prompting techniques, increasing relevance and particularly groundedness by up to 26-35 percentage points. Overall, the findings demonstrate that the RAG-augmented architecture effectively produces reading content tailored to user queries and desired textual complexity.

2606.14821 2026-06-16 cs.IR cs.AI 交叉投稿

Co-Scraper: query-aware DOM Pruning and Reusable Scraper Synthesis for Lightweight Web Data Extraction

Co-Scraper: 查询感知的DOM剪枝与可复用爬虫合成用于轻量级网页数据提取

Shoupeng Wang, Jiantao Qiu, Wuyang Zhang, Conghui He

发表机构 * Shanghai Artificial Intelligence Laboratory, OpenDataLab(上海人工智能实验室,开放数据实验室) University of Science and Technology of China(中国科学技术大学)

AI总结 提出Co-Scraper两阶段框架,通过查询感知的DOM剪枝和稳定提取策略归纳,利用微调Qwen3-8B模型将网页内容转化为可执行程序化包装器,在SWDE测试集上达到94.78%的F1分数和90.39%的复用成功率。

详情
AI中文摘要

网页内容的丰富性和异质性使得自动化信息提取成为必要,而生成可在相似网页间复用的爬虫为可扩展的数据提取提供了有效解决方案。本文提出Co-Scraper,一个能够处理长HTML文档层次复杂性的两阶段框架。通过集成查询感知的DOM剪枝机制与稳定提取策略归纳,Co-Scraper利用微调的Qwen3-8B模型将网页内容有效转化为可执行的程序化包装器。在SWDE测试集上,Co-Scraper实现了94.78%的F1分数和90.39%的复用成功率,达到最先进性能。该框架显著提升了数据提取的准确性和鲁棒性,为网页数据获取任务提供了一种高效方法。

英文摘要

The abundant and heterogeneous nature of web content necessitates automated information extraction, and generating scrapers that can be reused across similar web pages offers an effective solution for scalable data extraction. In this work, we propose Co-Scraper, a two-stage framework capable of handling the hierarchical complexity of long HTML documents. By integrating a query-aware DOM pruning mechanism with stable extraction strategy induction, Co-Scraper can effectively transforms web content into executable programmatic wrappers using a fine-tuned Qwen3-8B model. On the test set of SWDE, Co-Scraper achieves state-of-the-art performance with an F1 score of 94.78% and a reuse success rate of 90.39%. This framework significantly enhances the accuracy and resilience of data extraction, providing a highly efficient approach for web data acquisition tasks.

2606.14823 2026-06-16 q-bio.GN cs.AI cs.CL 交叉投稿

Human genetic evidence is associated with drug approval across therapeutic areas: an observational analysis of 26,278 target-disease pairs with temporal validation and feature ablation

人类遗传证据与跨治疗领域药物批准相关:一项基于26,278个靶点-疾病对的观察性分析,含时间验证和特征消融

Victoria Paterson

发表机构 * School of Informatics, University of Edinburgh(爱丁堡大学信息学院)

AI总结 本研究通过分析26,278个靶点-疾病对,发现具有遗传关联的靶点药物批准率是无遗传关联的3.25倍,但遗传证据单独预测价值有限,并识别出1,433个遗传支持的早期阶段靶点-疾病对作为假设生成资源。

详情
AI中文摘要

遗传证据在已批准药物靶点中富集:在一项对来自Open Targets和ChEMBL的26,278个靶点-疾病对的观察性分析中,具有任何遗传关联的靶点批准率是无遗传关联靶点的3.25倍(OR = 3.25, 95% CI 2.79-3.79, p = 1.91e-42)。一项考虑共享同一基因的靶点-疾病对非独立性的靶点水平分析给出的OR为2.79(bootstrap 95% CI 2.22-3.53);肿瘤学对水平OR为6.72,在靶点水平衰减至2.71,说明非独立性会夸大特定领域的估计值。该富集在2015年后的批准中得以复现(OR = 3.51, p = 1.72e-8)。跨六种证据类型的特征消融显示,仅文献挖掘就占分类器性能的大部分(AUPRC = 0.099,而所有特征为0.109),这与批准后出版物导致的时间泄漏一致。排除文献后,其余证据类型仍保留高于基线的信号(AUPRC = 0.084,为基线的1.63倍)。敏感性分析将对水平OR的范围限定在3.25至4.93之间。仅遗传证据的AUPRC绝对增益仅为1.0个百分点,且最佳模型校准较差;该分类器的实际预测价值有限。我们编录了1,433个遗传支持的1/2期靶点-疾病对作为假设生成资源。所有发现均为观察性结果。

英文摘要

Genetic evidence is enriched among approved drug targets: in an observational analysis of 26,278 target-disease pairs from Open Targets and ChEMBL, targets with any genetic association had a 3.25-fold higher approval rate than those without (OR = 3.25, 95% CI 2.79-3.79, p = 1.91e-42). A target-level analysis accounting for non-independence of pairs sharing the same gene gave OR = 2.79 (bootstrap 95% CI 2.22-3.53); the oncology pair-level OR of 6.72 attenuates to 2.71 at the target level, illustrating how non-independence inflates area-specific estimates. The enrichment replicated in post-2015 approvals (OR = 3.51, p = 1.72e-8). Feature ablation across six evidence types revealed that literature mining alone accounts for most classifier performance (AUPRC = 0.099 versus 0.109 for all features), consistent with temporal leakage from post-approval publications. Excluding literature, remaining evidence types retain above-baseline signal (AUPRC = 0.084, 1.63x baseline). Sensitivity analyses bracket the pair-level OR between 3.25 and 4.93. Genetic evidence alone yields only a 1.0-percentage-point absolute AUPRC gain and the best model has poor calibration; the classifier has limited practical predictive value. We catalogue 1,433 genetically supported Phase 1/2 pairs as a hypothesis-generating resource. All findings are observational.

2606.14828 2026-06-16 eess.IV cs.AI cs.CV 交叉投稿

Leptomeningeal Collateral Detection on DSA via Vessel-Graph Neural Networks

基于血管图神经网络的DSA软脑膜侧支检测

Junyong Cao, Hakim Baazaoui, Chinmay Prabhakar, Suprosanna Shit, Lukas Bastian Otto, Susanne Wegener, Bjoern Menze, Ezequiel de la Rosa

发表机构 * University of Zurich(苏黎世大学) University Hospital Zurich(苏黎世大学医院)

AI总结 提出一种混合图-像素架构,在DSA血管图上对单个血管段分类,首次实现DSA中软脑膜侧支的个体化检测,PR-AUC达0.434,优于纯图或纯像素方法。

详情
AI中文摘要

软脑膜侧支(LMCs)是急性缺血性卒中的重要预后因素。现有自动化方法依赖CT血管造影(CTA),但单个LMCs通常太小而无法在CTA上分辨,限制了这些方法只能进行粗略的侧支评分。数字减影血管造影(DSA)以更高的分辨率可视化单个侧支,但当前评估仍依赖主观的手动分级量表,存在评分者间一致性差的问题。我们提出一个框架,将侧支检测形式化为对从DSA导出的图上的单个血管段进行分类。一种混合图-像素架构将拓扑感知的图分支与密集像素分支相结合,在共享的节点概率空间中融合。在五折交叉验证中,融合模型的PR-AUC达到0.434,优于纯图(0.403)和纯像素(0.362)基线。据我们所知,这是首个能够在DSA中实现LMCs个体化的方法,允许对每个血管进行精确的定量评估。这种整合将DSA评估转向客观评价,支持未来对单个LMCs的生物标志物和模式发现。

英文摘要

Leptomeningeal collaterals (LMCs) are an important prognostic factor in acute ischemic stroke. Existing automated methods rely on CT angiography (CTA), but individual LMCs are often too small to be resolved on CTA, limiting these methods to coarse collateral scoring. Digital subtraction angiography (DSA) visualizes individual collaterals at superior resolution, yet current assessment remains subjective, relying on manual grading scales that suffer from poor inter-rater agreement. We present a framework that formulates collateral detection as the classification of individual vessel segments on a graph derived from DSA. A hybrid graph-pixel architecture combines a topology-aware graph branch with a dense pixel branch, fused in a shared node-probability space. In a five-fold cross-validation setting, the fused model achieves a PR-AUC of 0.434, outperforming the graph-only (0.403) and pixel-only (0.362) baselines. To our knowledge, this is the first method to enable the individualization of LMCs in DSA, allowing for precise per-vessel quantitative assessment. This integration shifts DSA assessment toward objective evaluation, supporting future biomarker and pattern discovery for individual LMCs.

2606.14871 2026-06-16 cs.CV cs.AI 交叉投稿

An Ensemble Deep Learning Approach for Reliable and Scalable Lemon Leaf Disease Classification

一种可靠且可扩展的柠檬叶病害分类集成深度学习方法

Shayan Abrar, Sudeepta Mandal, Abdul Awal Yasir, Sonjoy Bhattacharjee, Sadman Haque Bhuiyan, Samanta Ghosh, Rafi Ahamed

发表机构 * Dept. of CSE(计算机科学与工程系) American International University-Bangladesh(美国国际大学-孟加拉国) East West University(东-西大学) North South University(北南大学)

AI总结 提出集成InceptionV3和MobileNetV2的深度学习方法,结合对抗训练和Grad-CAM可视化,在9类柠檬叶病害数据集上达到99.27%准确率,实现可靠分类。

Comments 5 pages, 12 figures, 3 Tables, Presented at 18th IEEE International Conference on Computational Intelligence and Communication Networks (CICN) 2026

详情
AI中文摘要

植物病害的早期检测对植物和农民至关重要。植物病害会降低水果的产量和品质,并且植物在感染后更容易受到其他胁迫的影响。柠檬叶病害数据集包含1354张图像,分为9个类别,其中仅1个类别为健康叶片,其余8个类别为叶片病害。经过全面预处理后,数据集被划分为训练集(70%)、测试集(15%)和验证集(15%)。应用了两个预训练模型(InceptionV3和MobileNetV2),然后使用集成技术将这些模型组合起来以提高鲁棒性。集成模型表现出99.27%的准确率。应用对抗训练以提高模型的能力,并确保在噪声数据下的可靠预测。Grad-CAM可视化突出了叶片图像的重要区域,从而验证了模型预测的置信度。

英文摘要

Early detection of plant diseases is crucial to plants and for the farmers. Plant diseases reduce fruit yield and quality, and plants are more susceptible to other stresses when they are infected. The lemon leaf disease dataset contains 1354 images. The dataset has 9 classes. Among the 9 classes only one class is for healthy leaf, and the other 8 classes are leaf diseases. The dataset was split into training (70%), testing (15%) and validation (15%) sets after comprehensive preprocessing. Two pretrained models (InceptionV3 and MobileNetV2) were applied and then combined these models using an ensemble technique to boost robustness. Ensemble models showed a promising performance of 99.27% accuracy. Adversarial Training is applied to improve models' ability and ensure reliable predictions under noisy data. Grad-CAM visualization highlights the important regions of leaf images that validate the model prediction with confidence level.

2606.14886 2026-06-16 cs.CV cs.AI 交叉投稿

Improved Knowledge Distillation for Land-Use Image Classification

改进的知识蒸馏用于土地利用图像分类

Arundhuti Sur, Abhiroop Chatterjee, Susmita Ghosh, Emmett Ientilucci

发表机构 * Jadavpur University(贾达沃大学) Rochester Institute of Technology(罗切斯特理工学院)

AI总结 提出一种改进的知识蒸馏框架,通过VGG16教师网络向轻量MobileNetV2学生网络传递知识,结合硬监督和软监督策略,在三个数据集上达到99.04%准确率,优于基线方法。

Comments Accepted by IGARSS 2026

详情
AI中文摘要

本文提出了一种改进的知识蒸馏(KD)框架,用于高效压缩深度卷积神经网络以完成土地利用图像分类任务。受在降低计算复杂度的同时实现竞争性分类准确率的需要的驱动,采用教师-学生学习范式,其中VGG16网络将知识传递给轻量级MobileNetV2模型。所提出的框架将来自真实标签的硬监督与结合了Kullback-Leibler散度和余弦相似度损失的软监督策略相结合。在三个土地利用数据集上进行的实验表明,所提出的基于KD的方法性能提升,达到了99.04%的准确率,优于基线学生训练和单损失蒸馏方法,同时保持了显著的模型压缩。

英文摘要

In the present article, an improved Knowledge Distillation (KD) framework has been proposed for efficient compression of deep convolutional neural networks for land-use image classification task. Motivated by the need to achieve competitive classification accuracy while reducing computational complexity, a teacher-student learning paradigm is adopted in which a VGG16 network transfers knowledge to a lightweight MobileNetV2 model. The proposed framework integrates hard supervision from ground truth labels with a soft supervision strategy that combines Kullback-Leibler divergence and Cosine Similarity losses. Experiments conducted on three land-use datasets show that the proposed KD-based method yields improved performance, and achieves an accuracy of 99.04%, outperforming both baseline student training and single-loss distillation approaches, while retaining substantial model compression.

2606.14912 2026-06-16 cs.CV cs.AI 交叉投稿

Mask Proposal Voting Based on Geodesic Framework for Robust Image Segmentation

基于测地线框架的掩膜提议投票用于鲁棒图像分割

Li Liu, Mingzhu Wang, Zhenjiang Li, Da Chen, Laurent D. Cohen

发表机构 * Yuanshen Rehabilitation Institute, Shanghai Jiao Tong University School of Medicine(上海交通大学医学院附属瑞金康复医院) Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine(上海中医药大学附属岳阳中西医结合医院) Department of Radiation Oncology, Shandong Cancer Hospital and Institute, Shandong First Medical University, Shandong Academy of Medical Sciences(山东第一医科大学附属山东省肿瘤医院放疗科) University Paris Dauphine, PSL Research University, CNRS, UMR 7534, CEREMADE(巴黎多芬纳大学,PSL研究大学,法国国家科学研究中心,UMR 7534,CEREMADE)

AI总结 提出一种掩膜提议投票框架,通过自适应域构造和加权投票机制克服经典最小路径法对初始化的依赖,在复杂场景下实现鲁棒分割。

详情
AI中文摘要

尽管取得了巨大进步,但准确的分割仍然是一项具有挑战性的任务,尤其是在背景杂乱、强度变化复杂和拓扑外观多样的场景中。最小路径模型在解决图像分割任务中展现了强大的能力。然而,基于最小路径的分割方法的性能严重受限于模型初始化,从而限制了其在实际中的应用范围。在这项工作中,我们提出了一种新颖的掩膜提议投票框架,克服了经典方法的主要缺点,即使在复杂场景下也能实现鲁棒分割。首先,我们引入了一种高效的方法来构建自适应域切割,作为初始化基于区域的最小割演化的约束,从而可以生成多样且可靠的掩膜提议候选,大大增加了这些提议准确覆盖目标区域的可能性。其次,我们提出了一种新的掩膜投票方案,构建编码最终分割信息的投票得分图。与经典的路径投票方法相比,我们的模型允许引入先验知识,为每个单独的掩膜分配不同的重要性。因此,所提出的分割模型能够在复杂场景下准确描绘对象边界,并且对初始化不敏感。实验表明,我们的方法在准确性和鲁棒性上始终优于最先进的基于最小路径的方法。

英文摘要

Despite great advances, finding accurate segmentation remains a challenging task, especially in scenarios with cluttered backgrounds, complex intensity variations and topology appearance. Minimal path models have exhibited their strong ability in addressing image segmentation tasks. However, the performance of minimal paths-based segmentation approaches is heavily influenced by model initialization, hence limiting their application scope in practice. In this work, we propose a novel mask proposal voting framework that overcomes the major drawback of classical approaches, allowing robust segmentation even in complicated scenarios. Firstly, we introduce an efficient method for constructing adaptive domain cuts as a constraint for initializing the region-based min-cut evolution, by which diverse and reliable mask proposal candidates can be generated, substantially increasing the possibility of accurately covering the objective region by these proposals. Secondly, we propose a new mask voting scheme to build a voting score map encoding the final segmentation information. In contrast to classical path voting methods, our model allows incorporating priors to assign different importance to each individual mask. As a consequence, the proposed segmentation model is capable of accurately delineating object boundaries under complex scenarios, and is insensitive to initialization. Experiments demonstrate that our method consistently outperforms state-of-the-art minimal path-based approaches in both accuracy and robustness.

2606.14922 2026-06-16 cs.SD cs.AI cs.CL eess.AS 交叉投稿

An Empirical Study on Learning Latent Representations for Emotional Speech Synthesis

情感语音合成中学习潜在表示的实证研究

Vinh Dang Quang, Huy Ngo Quang

发表机构 * Aimesoft JSC

AI总结 本文针对VLSP 2022情感语音合成任务,通过将说话人嵌入和韵律瓶颈集成到FastSpeech 2中,实现了单说话人情感语音生成及跨说话人风格迁移。

Comments 4 pages

详情
AI中文摘要

在过去的几年中,由于深度学习,语音合成领域取得了巨大进步。越来越多的基于深度学习的TTS系统被开发出来,使得生成具有高可懂度和自然度的语音成为可能。同时,控制表现力仍然是一个大问题,以不同风格或方式生成语音最近受到了社区的广泛关注。本文旨在为VLSP 2022的情感语音合成(ESS)任务提供我们的解决方案,该任务允许从给定的输入文本生成具有所需情感表达的自然人声。通过将说话人嵌入、韵律瓶颈集成到FastSpeech 2中,我们的系统有望生成单个说话人的情感语音(子任务1),并将另一个说话人的说话风格迁移到具有中性非表达性数据的目标说话人,同时保留目标说话人的身份(子任务2)。

英文摘要

For the last couple of years, the field of speech synthesis has improved dramatically thanks to deep learning. There are more and more deep learning-based TTS systems developed to make it possible to produce voices with high intelligibility and naturalness. Meanwhile, controlling the expressiveness is yet a big deal, generating speech in different styles or manners has received a lot of attention from community recently. This paper aims to give our solutions to deal with the task emotional speech synthesis (ESS) at VLSP 2022 which allows to generate humanlike natural-sounding voice from a given input text with desired emotional expression. By integrating speaker embedding, prosody bottleneck into FastSpeech 2, our systems can promisingly generate emotional speech of a single speaker (Sub-task 1), transfer speaking styles from another speaker to the target speaker with neutral non-expressive data while retaining the target speaker's identity (Sub-task 2).

2606.14963 2026-06-16 cs.CV cs.AI 交叉投稿

Multi-Modal Attention for Automated Disaster Damage Assessment Using Remote Sensing Imagery and Deep Learning

基于遥感影像和深度学习的多模态注意力自动灾害损伤评估

Tewodros Syum Gebre, Jagrati Talreja, Leila Hashemi-Beni

发表机构 * Built Environment Department, College of Science and Technology, North Carolina A&T State University(北卡罗来纳农工州立大学科技学院建筑环境系) United Nations University Institute for Water, Environment and Health(联合国大学水、环境与健康研究所)

AI总结 提出一种多模态注意力机制融合双时相遥感影像的深度学习框架,实现建筑物损伤四分类(无/轻微/严重/毁坏),准确率达94.90%。

Comments This paper has been accepted for publication in ISPRS Congress 2026 and the 47th Canadian Symposium on Remote Sensing (CSRS 2026) Annals

详情
AI中文摘要

及时准确的灾害损伤评估对于有效的应急响应、资源分配和恢复至关重要。传统方法通常依赖人工检查或稀疏数据,往往速度慢且易出错。本文介绍了一种利用遥感影像和深度学习自动化建筑损伤分类的新框架。使用灾前和灾后卫星影像,我们的模型将建筑物分为四个损伤等级:无损伤、轻微损伤、严重损伤和毁坏。核心创新是一种多模态注意力机制,融合双时相特征以显式检测和评估结构变化。我们采用轻量级ConvNeXT-Tiny骨干网络,确保高效处理而不牺牲性能。主要贡献包括:(1)用于多模态数据融合的交叉注意力模块,(2)针对大规模数据集的优化预处理流程,以及(3)鲁棒的数据增强技术。在大规模灾害数据集上的实验表明,总体分类准确率达到94.90%。该模型能有效区分损伤类别,并对不完整数据保持鲁棒性。本系统显著提高了评估速度和准确性,有助于应急响应人员优先安排干预措施。本研究通过将多时相影像与深度学习相结合,推进了自动化灾害损伤检测,为实时响应提供了可扩展的解决方案。

英文摘要

Timely and accurate disaster damage assessment is crucial for effective emergency response, resource allocation, and recovery. Traditional methods, which often rely on manual inspections or sparse data, are typically slow and error-prone. This paper introduces a novel framework leveraging remote sensing imagery and deep learning to automate building damage classification. Using pre- and post-disaster satellite imagery, our model categorizes buildings into four damage levels: no damage, minor damage, major damage, and destroyed. The core innovation is a multi-modal attention mechanism that fuses bi-temporal features to explicitly detect and assess structural changes. We employ a lightweight ConvNeXT-Tiny backbone to ensure efficient processing without compromising performance. Key contributions include: (1) a cross-attention module for multi-modal data fusion, (2) an optimized preprocessing pipeline for large-scale datasets, and (3) robust data augmentation techniques. Experiments on a large-scale disaster dataset demonstrate an overall classification accuracy of 94.90%. The model effectively discriminates between damage categories and remains resilient to incomplete data. This system significantly improves assessment speed and accuracy, aiding emergency responders in prioritizing interventions. This work advances automated disaster damage detection by integrating multi-temporal imagery with deep learning, offering a scalable solution for real-time response.

2606.15052 2026-06-16 cs.AR cs.AI 交叉投稿

PANDA: An LLM-Enhanced Performance-Driven Analog Design Framework Bridging Design Intent and Layout Generation

PANDA:一种LLM增强的性能驱动模拟设计框架,弥合设计意图与版图生成

Haoyi Zhang, Weijian Fan, Xiaohan Gao, Bingyang Liu, Runsheng Wang, Yibo Lin

发表机构 * School of Integrated Circuits, Peking University(集成电路学院,北京大学) Beijing Advanced Innovation Center for Integrated Circuits(北京集成电路先进创新中心) Institute of Electronic Design Automation, Peking University(电子设计自动化研究所,北京大学)

AI总结 提出PANDA框架,利用大语言模型将高层设计意图转化为最终版图,通过引导拓扑综合、子结构感知尺寸优化和约束驱动版图生成,实现跨阶段协同设计,将设计周期从数天/周缩短至数小时并提升性能。

详情
AI中文摘要

传统模拟电路设计严重依赖拓扑、尺寸和版图的人工干预,先前的自动化方法孤立地处理各个阶段。在这项工作中,我们提出了PANDA,一个LLM增强的框架,通过引导拓扑综合、子结构感知尺寸优化和约束驱动版图生成,主动管理跨阶段依赖关系,将高层设计意图桥接到最终版图。这将自动化从以算法执行为中心转变为以意图为中心的协同设计,将设计周期从数天或数周缩短至数小时,同时提高设计性能。

英文摘要

Traditional design of analog circuits heavily relies on manual interventions across topology, sizing, and layout, with prior automation addressing stages in isolation. In this work, we propose PANDA, an LLM-enhanced framework that bridges high-level design intent to final layout by actively managing cross-stage dependencies through guided topology synthesis, substructure-aware sizing, and constraint-driven layout generation. This shifts automation from algorithm-centric execution to intent-centric co-design, reducing turnaround time from days or weeks to hours while improving design performance.

2606.15117 2026-06-16 cs.MM cs.AI cs.CV cs.LG cs.SD 交叉投稿

Teacher-Student Structure for Domain Adaptation in Ensemble Audio-Visual Video Deepfake Detection

用于集成视听视频深度伪造检测中领域适应的师生结构

Elham Abolhasani, Maryam Ramezani, Hamid R. Rabiee

发表机构 * Department of Computer Engineering, Sharif University of Technology(谢里夫理工学院计算机工程系)

AI总结 提出EAV-DFD方法,结合师生框架的领域适应机制,提升模型在未见领域上的泛化能力,在三个数据集上AUC分别提升4.09%、17.94%和0.5%。

详情
AI中文摘要

生成式AI模型的快速发展导致了更逼真的深度伪造媒体,包括对音频、视频或两者的操纵。这引发了严重的隐私和社会问题。该领域的许多研究已经取得了有前景的域内结果;然而,这些模型在面对来自不同领域的数据时,其有效性常常下降。因此,最近的深度伪造检测方法侧重于通过多种技术增强泛化能力,这些技术融合了所有输入模态,包括音频、图像及其交互。为此,我们提出了EAV-DFD方法,一种广义的深度集成视听模型(EAV-DFD),结合了利用师生框架的领域适应机制,以增强模型在未见领域上的表现和泛化能力。为了评估模型性能,我们使用FakeAVCeleb数据集作为主领域,DFDC、Deepfake_TIMIT和PolyGlotFake数据集作为未见领域。我们的实验结果表明,所提出的框架在领域适应方面是有效的,仅使用一小部分未见数据集训练学生模型,就在三个未见数据集上分别将模型的AUC性能提升了4.09%、17.94%和0.5%。这产生了一种新颖的深度伪造检测模型,能够适应新领域并解释哪个模态被操纵,突显了我们的方法在现实世界应用中的潜力。

英文摘要

The rapid advancement of generative AI models is leading to more realistic deepfake media, encompassing the manipulation of audio, video, or both. This raises severe privacy and societal concerns. Numerous studies in this area have yielded promising intra-domain results; however, these models frequently exhibit decreased efficacy when faced with data from dissimilar domains. Consequently, recent deepfake detection approaches focus on enhancing the generalization ability through multiple techniques that incorporate all input modalities, including audio, images, and their interactions. In this regard, we propose the EAV-DFD method, a generalized deep ensemble audio-visual model (EAV-DFD) combined with a domain adaptation mechanism utilizing a teacher-student framework to enhance the model's ability to perform and generalize effectively across unseen domains. To evaluate the model's performance, we used the FakeAVCeleb dataset as the primary domain and the DFDC, Deepfake_TIMIT, and PolyGlotFake datasets as an unseen domain. Our experimental results demonstrate that the proposed framework is efficient in domain adaptation, improving AUC performance of the model by 4.09%, 17.94%, and 0.5% on three unseen datasets, using only a small portion of them to train the student model. This leads to a novel deepfake detection model capable of adapting to new domains and interpreting which modality has been manipulated, highlighting the potential of our approach for real-world applications.

2606.15129 2026-06-16 cs.CV cs.AI 交叉投稿

EyeMVP: OCT-Informed Fundus Representation Learning via Paired CFP--OCT Pretraining

EyeMVP: 通过配对CFP-OCT预训练实现OCT启发的眼底表征学习

Zhuo Deng, Ruiheng Zhang, Ziheng Zhang, Weihao Gao, Yitong Li, Qian Wang, Lei Shao, Jiaoyue Dong, Zhixi Zeng, Lijian Fang, Haibo Wang, Xiaobin Lin, Tao Liu, Zhicheng Du, Zhengwei Zhang, Lin Yang, Zheng Gong, Xinyu Zhao, Zhenquan Wu, Fang Li, Zhiguang Zhou, Guoming Zhang, Sun Jing, Han Lv, Wenbin We, Lan Ma

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Beijing Tongren Eye Center, Beijing Tongren Hospital, Capital Medical University(首都医科大学附属北京同仁医院北京同仁眼科中心) Liangxiang Hospital of Beijing Fangshan District, Capital Medical University(首都医科大学北京市房山区良乡医院) The Third People's Hospital of Dalian(大连市第三人民医院) National Clinical Research Center for Endocrine and Metabolic Diseases, The Second Xiangya Hospital of Central South University(中南大学湘雅二医院国家内分泌代谢病临床医学研究中心) The Central Hospital of Baoji City(宝鸡市中心医院) Wuxi No.2 People's Hospital, Affiliated Wuxi Clinical College of Nantong University(南通大学附属无锡临床学院无锡市第二人民医院) Shenzhen Eye Hospital, Southern Medical University(南方医科大学深圳眼科医院) Beijing Friendship Hospital, Capital Medical University(首都医科大学附属北京友谊医院)

AI总结 提出跨模态视网膜基础模型EyeMVP,利用配对CFP-OCT预训练,通过跨模态掩码重建将OCT结构信息注入CFP表征,在16项下游任务中优于现有模型,尤其对黄斑疾病诊断有显著提升。

详情
AI中文摘要

彩色眼底摄影(CFP)是大规模视网膜筛查的主要手段,但其诊断能力受限于缺乏深度分辨的结构信息。光学相干断层扫描(OCT)提供横截面视网膜解剖结构,但在人群筛查中可及性较低。本文提出EyeMVP,一种跨模态视网膜基础模型,通过配对CFP-OCT预训练学习OCT启发的CFP表征。EyeMVP在来自中国八家医院112,642名患者的674,893个严格同眼同天配对CFP-OCT图像三元组上预训练。该模型使用跨模态掩码重建,以OCT相关监督丰富CFP表征,同时在推理时仅需CFP图像。为适应正面CFP与横截面OCT之间的非对齐成像几何,EyeMVP将源约束交叉注意力与CFP导出的结构掩码相结合。在16项下游任务中,包括分类、分割、少样本适应和跨模态检索,EyeMVP优于代表性视网膜基础模型,并在涉及黄斑和视神经结构的任务上表现出一致提升。对于CFP具有挑战性的黄斑疾病,EyeMVP在黄斑水肿上达到0.948的AUROC(对比EyeCLIP的0.852),在近视性黄斑劈裂上达到0.825。在一项探索性读者研究中,EyeMVP在黄斑水肿上超过初级和中级眼科医生组,但未达到高级眼科医生水平,而在近视性黄斑劈裂上,其平衡准确性数值上高于所有读者组。这些结果表明,像素级跨模态重建可以用OCT相关监督丰富CFP表征,为筛查环境中基于CFP的更强视网膜分析提供了一条实用途径。

英文摘要

Color fundus photography (CFP) is the mainstay for large-scale retinal screening, yet its diagnostic capacity is constrained by the lack of depth-resolved structural information. Optical coherence tomography (OCT) provides cross-sectional retinal anatomy, but is less accessible in population-level screening. Here, we present EyeMVP, a cross-modal retinal foundation model that uses paired CFP--OCT pretraining to learn OCT-informed CFP representations. EyeMVP is pretrained on 674,893 strict same-eye same-day paired CFP--OCT image triples from 112,642 patients across eight hospitals in China. The model uses cross-modal masked reconstruction to enrich CFP representations with OCT-associated supervision, while requiring only CFP images at inference. To accommodate the non-aligned imaging geometry between en-face CFP and cross-sectional OCT, EyeMVP combines source-constrained cross-attention with CFP-derived structural masks. Across 16 downstream tasks, including classification, segmentation, few-shot adaptation, and cross-modal retrieval, EyeMVP outperforms representative retinal foundation models and shows consistent gains on tasks involving macular and optic nerve structure. For CFP-challenging macular diseases, EyeMVP achieves an AUROC of 0.948 for macular edema (vs.~0.852 for EyeCLIP) and 0.825 for myopic macular schisis. In an exploratory reader study, EyeMVP exceeds junior and intermediate ophthalmologist groups but does not reach senior ophthalmologist performance on macular edema, while showing numerically higher balanced accuracy than all reader groups on myopic macular schisis. These results suggest that pixel-level cross-modal reconstruction can enrich CFP representations with OCT-associated supervision, providing a practical route toward stronger CFP-based retinal analysis in screening settings.

2606.15176 2026-06-16 cs.CV cs.AI 交叉投稿

Enabling Real-Time Point-of-Care Ultrasound Segmentation: A GPU-Free Deployment in Resource-Limited Settings

实现实时床旁超声分割:资源受限环境中的无GPU部署

Weihao Gao

发表机构 * School of Computer Science and Artificial Intelligence, Guangdong University of Education(广东第二师范学院计算机科学与人工智能学院)

AI总结 提出超轻量级架构UltraSeg,在CPU和移动设备上实现实时超声图像分割,性能媲美大型模型,消除GPU依赖,降低AI成本。

Comments 15 pages,4 figures

详情
AI中文摘要

超声成像因其低成本和高便携性成为全球最广泛使用的医学模态,然而人工智能(AI)的部署仍受限于对GPU加速模型的依赖,造成结构性矛盾:"智能"的成本超过了成像设备本身。在此,我们展示了UltraSeg的系统性适配和广泛评估,UltraSeg最初为结肠镜息肉分割设计的超轻量级架构,现被改造用于床旁超声(POCUS),涵盖跨越六个解剖部位(乳腺、甲状腺、肾脏、颈动脉、胎儿和小动物肿瘤)的十个公共数据集。我们在超声领域系统验证了两种变体:UltraSeg-130K(0.13M参数)在单核CPU上达到89.7 FPS,在翻新移动设备上达到34.8 FPS;而UltraSeg-500K(0.5M参数)在CPU上达到44.6 FPS,在移动设备上达到16.1 FPS。UltraSeg-500K在平均性能上匹配或超过31M参数的UNet,并接近105M参数的TransUNet,在外部验证集(UDIAT、DDTI)上具有优越的零样本跨数据集泛化能力。通过实现无需GPU依赖的临床级分割,本工作使AI成本与超声可及性相匹配,使先进诊断在资源受限环境中成为可能。

英文摘要

Ultrasound imaging is the most widely adopted medical modality globally due to its low cost and portability, yet artificial intelligence (AI) deployment remains constrained by reliance on GPU-accelerated models, creating a structural paradox where the cost of "intelligence" exceeds that of the imaging device itself. Here, we present the systematic adaptation and extensive evaluation of UltraSeg, an ultra-lightweight architecture originally developed for colonoscopic polyp segmentation, now engineered for point-of-care ultrasound (POCUS) across ten public datasets spanning six anatomical sites (breast, thyroid, kidney, carotid, fetal, and small-animal tumor). We systematically validate both variants in ultrasound domains: UltraSeg-130K (0.13M parameters) achieves 89.7 FPS on single-core CPUs and 34.8 FPS on a refurbished mobile device, while UltraSeg-500K (0.5M parameters) delivers 44.6 FPS on CPU and 16.1 FPS on mobile device. UltraSeg-500K matches or exceeds the Dice performance of the 31M-parameter UNet and approaches 105M-parameter TransUNet in average performance, with superior zero-shot cross-dataset generalization on external validation sets (UDIAT, DDTI). By enabling clinical-grade segmentation without GPU dependency, this work brings AI costs in line with ultrasound accessibility, making advanced diagnostics available in resource-limited settings.

2606.15225 2026-06-16 cs.LG cs.AI cs.IR 交叉投稿

Edu-Theater: A Data-Efficient Agent Framework for Scalable Learner Behavior Simulation through Staging Roll-Call

Edu-Theater: 一种通过点名排演实现可扩展学习者行为模拟的数据高效智能体框架

Weibo Gao, Qi Liu, Linan Yue, Zheng Zhang, Yichao Du, Fangzhou Yao, Ao Yu, Zhenya Huang, Shijin Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) State Key Laboratory of Cognitive Intelligence(认知智能国家重点实验室) Southeast University(东南大学) Alibaba Group(阿里巴巴集团) iFLYTEK Co., Ltd.(科大讯飞股份有限公司)

AI总结 提出Edu-Theater框架,通过构建群体水平能力先验和少量诊断查询,利用LLM智能体模拟学习者行为,在减少数据需求的同时提高模拟精度,并增强下游自适应测试等应用。

Comments LLM Agent, Educational Data Mining, Data Synthesis, Human Simulation

详情
AI中文摘要

大规模学习者-任务交互数据对智能教育系统至关重要,但收集成本高且受隐私和学习者参与度限制。学习模拟器在无需真实学习者持续参与的情况下,对模拟可扩展的学习者行为起着关键作用。然而,现有方法主要是**以个体为中心**,为每个学习者配对模拟器,从密集的交互历史中迭代推断潜在知识状态,这既数据密集又计算密集,且在冷启动场景中脆弱。我们提出一种**群体感知的点名模拟范式**,首先构建群体水平的能力先验,然后通过少量有针对性的诊断查询细化个体学习者状态。基于该范式,我们引入**Edu-Theater**,一个由LLM驱动的智能体系统,通过教师智能体和基于学习者日志的回顾性点名探测执行群体感知的学习者模拟。Edu-Theater无需每个学习者的密集历史即可实现可扩展的未来行为模拟。在两个真实世界数据集上的实验表明,Edu-Theater以显著更少的LLM调用实现了更高的模拟精度,生成的合成数据增强了自适应测试等下游应用。

英文摘要

Large-scale learner-task interaction data are crucial for intelligent educational systems but are costly to collect and constrained by privacy and learner engagement. Learner simulators play a critical role in simulating scalable learner behavior without the need for continuous involvement of real learners. However, existing methods are predominantly \textbf{individual-centric}, pairing a simulator with each learner to iteratively infer latent knowledge states from dense interaction histories, which is both data- and computation-intensive, and fragile in cold-start scenarios. We propose a \textbf{cohort-aware roll-call simulation paradigm} that first constructs cohort-level proficiency priors and refines individual learner states through a small number of targeted diagnostic queries. Based on this paradigm, we introduce \textbf{Edu-Theater}, an LLM-powered agent system that performs cohort-aware learner simulation via a teacher agent and retrospective roll-call probing over learner logs. Edu-Theater enables scalable future behavior simulation without the need for dense per-learner histories. Experiments on two real-world datasets demonstrate that Edu-Theater achieves higher simulation accuracy with significantly fewer LLM calls, producing synthetic data that enhances downstream applications such as adaptive testing.

2606.15250 2026-06-16 cs.CV cs.AI 交叉投稿

Landmark-free Assessment of Lower-limb Alignment with Implicit Neural Shape Functions from Knee Radiographs

基于膝关节X光片的隐式神经形状函数的无地标下肢对齐评估

Zhisen Hu, Antti Kemppainen, David Johnson, Egor Panfilov, Huy Hoang Nguyen, Timothy Cootes, Claudia Lindner, Aleksei Tiulpin

发表机构 * Division of Informatics, Imaging and Data Sciences, The University of Manchester(曼彻斯特大学信息学、影像与数据科学部) Research Unit of Health Sciences and Technology, University of Oulu(奥卢大学健康科学与技术研究部) Medical Research Center Oulu, University of Oulu and Oulu University Hospital(奥卢大学与奥卢大学医院医学研究中心) Department of Trauma and Orthopaedics, Stockport NHS Foundation Trust, Stepping Hill Hospital(斯泰平希尔医院斯托克波特NHS基金会创伤与骨科) School of Health and Society, University of Salford(索尔福德大学健康与社会学院) School of Biological Sciences, The University of Manchester(曼彻斯特大学生物科学学院) Weill Cornell Medicine, Cornell University(康奈尔大学威尔康奈尔医学院)

AI总结 提出隐式神经形状函数(INSF)方法,无需显式地标,通过编码解剖形状到潜在空间并直接回归临床对齐测量,实现自动化下肢对齐评估,性能与现有方法相当且易于扩展。

Comments Accepted to MICCAI 2026

详情
AI中文摘要

下肢对齐(LLA)的放射学评估对于预测全膝关节置换术中的关节健康和手术结果至关重要。传统测量方法手动且耗时,而最近的机器学习方法通常依赖于定位一组固定的解剖标志。这种依赖性限制了灵活性,并且当临床定义发生变化时可能需要重新标注。为了解决这个问题,我们提出了一种使用隐式神经形状函数(INSF)的自动化工作流程。我们不依赖显式地标坐标,而是将解剖结构编码到紧凑的潜在空间中,并直接从这些潜在代码回归临床对齐测量。这种架构允许快速扩展到新任务,而无需改变骨干表示。我们在一个包含566张膝关节X光片的内部数据集上训练了我们的方法,每张图像都标注了股骨和胫骨的轮廓。我们在一个包含50名患者的内部测试数据集和一个来自MRKR数据集的402个术前病例的外部独立数据集上进行了评估。这些数据提供了手动临床测量,并且MRKR测量将公开可用。性能与最先进的基于地标的方法和手动一致性相当,同时提供了一种可扩展到其他测量任务的灵活形状表示。

英文摘要

Radiographic assessment of lower-limb alignment (LLA) is important for predicting joint health and surgical outcomes in total knee arthroplasty. Traditional measurement methods are manual and time-consuming, while recent machine learning approaches typically rely on locating a fixed set of anatomical landmarks. This dependence limits flexibility and may require re-annotation when clinical definitions change. To address this, we propose an automated workflow using Implicit Neural Shape Functions (INSF). Rather than relying on explicit landmark coordinates, we encode the anatomy into a compact latent space and regress clinical alignment measurements directly from these latent codes. This architecture allows for rapid extendability to new tasks without altering the backbone representation. We trained our method on an internal dataset of 566 knee radiographs, each annotated with the outline of the femur and tibia. We evaluated it on both an internal test dataset of 50 patients and a separate external set of 402 preoperative cases from the MRKR dataset. Manual clinical measurements are available for these data, and the MRKR measurements will be made publicly accessible. Performance was comparable to state-of-the-art landmark-based methods and manual agreement, while offering a flexible shape representation that can be extended to additional measurement tasks.

2606.15277 2026-06-16 cs.IR cs.AI cs.DB cs.ET cs.LG 交叉投稿

Guiding Federated Graph Recommendation with LLM-encoded knowledge

利用LLM编码知识指导联邦图推荐

Thi Minh Chau Nguyen, Hien Trang Nguyen, Duc Anh Nguyen, Van Ho-Long, Thanh Trung Huynh, Zhao Ren

发表机构 * institutetext(机构)

AI总结 针对联邦图推荐中跨客户端图表示对齐难的问题,提出利用大语言模型编码的语义信号指导结构表示的选择性聚合,提升推荐准确性。

Comments Technical Report

详情
AI中文摘要

基于图的推荐系统在从用户-物品交互中提取协同信号方面非常有效,联邦学习(FL)则可以在保护用户隐私的同时训练这些模型。然而,跨分布式、非独立同分布(non-IID)客户端聚合图表示仍然是一个挑战;局部学习的结构嵌入常常不对齐,简单的平均无法捕捉有意义的跨客户端关系。大多数现有的联邦图方法仅依赖结构聚合,忽略了大型语言模型(LLM)中丰富的全局语义上下文。在本文中,我们提出了一种新颖的框架,利用LLM编码的知识来指导联邦图推荐。具体来说,客户端从局部图中学习结构表示,同时通过冻结的LLM将其典型交互模式总结为紧凑的语义向量。中央服务器随后利用这些LLM编码的语义信号发现跨客户端的相关偏好模式,指导其结构表示的选择性聚合。这实现了语义感知的跨客户端协作,而无需暴露原始数据。在标准基准上的大量实验表明,利用LLM编码知识指导结构对齐一致地提高了现有联邦图基线的推荐准确性。

英文摘要

Graph-based recommender systems are highly effective at extracting collaborative signals from user--item interactions, and federated learning (FL) allows these models to be trained while preserving user privacy. However, aggregating graph representations across distributed, non-IID clients remains a challenge; structural embeddings learned locally often misalign, and naive averaging fails to capture meaningful cross-client relationships. Most existing federated graph methods rely exclusively on structural aggregation, neglecting the rich, global semantic context available in large language models (LLMs). In this paper, we propose a novel framework that uses LLM-encoded knowledge to guide federated graph recommendation. Specifically, clients learn structural representations from local graphs while simultaneously summarizing their typical interaction patterns into compact semantic vectors via a frozen LLM. The central server then uses these LLM-encoded semantic signals to discover related preference patterns across clients, guiding the selective aggregation of their structural representations. This enables semantically informed cross-client collaboration without exposing raw data. Extensive experiments on standard benchmarks show that guiding structural alignment with LLM-encoded knowledge consistently improves recommendation accuracy over existing federated graph baselines.

2606.15288 2026-06-16 cs.LG cs.AI physics.ao-ph 交叉投稿

Hybrid NARX-LLM for Greenland Iceberg Discharge: Prompt-Driven Residual Correction

混合NARX-LLM用于格陵兰冰山排放:提示驱动的残差校正

Yiquan Gao, Duohui Xu

发表机构 * Heriot-Watt University(赫瑞瓦特大学) StudioYG

AI总结 提出混合NARX-LLM框架,结合非线性自回归模型与大型语言模型进行残差校正,并引入物理信息提示方法,用于建模格陵兰冰山排放的复杂非线性动态,提升预测准确性。

详情
AI中文摘要

格陵兰冰山排放表现出复杂的非线性动态,且可观测性有限,对传统预测模型构成挑战。我们提出一个混合NARX-LLM框架,该框架结合了具有外源输入的非线性自回归模型(NARX)和用于残差校正的大型语言模型(LLM)。我们进一步提出了一种物理信息提示(PIP)方法,将非结构化物理知识转化为结构化提示,用于零样本上下文推理。主要目标是探索该框架在建模格陵兰冰山排放方面的校正潜力,而不仅仅是优化预测精度。NARX组件捕获内在的时间依赖性,而由PIP引导的LLM编码冰川动力学和环境驱动因素,并感知关键趋势模式以校正系统预测误差。这种集成允许模型推理未建模因素并产生可解释的残差,从而提升整体预测精度。应用于格陵兰冰山排放时间序列,我们的方法处理了由于罕见变化和非平稳趋势而难以预测的极端事件,这是传统方法经常忽视的局限性。通过融合结构化时间序列建模与知识驱动的Foundation AI,该框架提供了一条可扩展且可解释的路径,将数据受限的气候预测与物理信息LLM推理相结合。代码已公开。

英文摘要

Greenland iceberg discharge exhibits complex nonlinear dynamics with limited observability, challenging traditional predictive models. We present a Hybrid NARX-LLM framework that combines a nonlinear autoregressive model with exogenous inputs (NARX) and a large language model (LLM) for residual correction. We further propose a Physics-Informed Prompt (PIP) method that transforms unstructured physical knowledge into structured prompts for zero-shot in-context reasoning. The primary objective is to explore the corrective potential of this framework for modeling Greenland iceberg discharge, rather than merely optimizing predictive accuracy. The NARX component captures intrinsic temporal dependencies, while the LLM, guided by PIP, encodes glacier dynamics and environmental drivers and perceives key trend patterns to correct systematic prediction errors. This integration allows the model to reason about unmodeled factors and produce interpretable residuals, enhancing overall predictive accuracy. Applied to Greenland iceberg discharge time series, our approach addresses extreme events that are difficult to predict due to rare variations and nonstationary trends, a limitation often overlooked by traditional methods. By fusing structured time-series modeling with knowledge-driven foundation AI, the framework offers a scalable and interpretable pathway to bridge data-limited climate forecasting with physics-informed LLM reasoning. The code is available.

2606.15349 2026-06-16 cs.CY cs.AI 交叉投稿

LearnOpt: Recovering the Latent Cognitive Structure of Standardized Examinations via Knowledge Graphs and Constrained Optimization

LearnOpt: 通过知识图谱和约束优化恢复标准化考试的潜在认知结构

Joy Bose, Om Thomas

发表机构 * Independent Researchers(独立研究者)

AI总结 提出LearnOpt框架,利用知识图谱和约束优化从历史试题中恢复潜在认知结构,生成个性化学习计划;在NEET和JEE Advanced考试数据上验证了潜在结构的稳定性和可检测的转变。

Comments 26 pages, 2 figures, 6 tables. Code, data, and calibration tooling: https://github.com/joyboseroy/learnopt. Datasets on HuggingFace: joyboseroy/neet-skill-tags-2016-2024, joyboseroy/jee-advanced-skill-tags-2016-2023

详情
AI中文摘要

标准化考试通常被视为统一的课程覆盖问题。我们认为它们更适合理解为具有稳定潜在认知结构的对抗系统,这些结构系统地偏离官方课程。我们引入LearnOpt,它从历史试题中恢复这种结构,并生成个性化的、有时间限制的学习计划。应用于九年的NEET试题(2016-2024,n=1,496),LearnOpt从LLM标记的试题构建考试知识图谱,提取五类潜在技能分布,并将学习计划制定为基于贝叶斯知识追踪的先决条件感知子图上的背包变体优化。核心发现:NEET的潜在技能分布在课程体系内是稳定的(2016-2021年连续年份KL散度0.004-0.032,置换检验不显著),但在NCERT 2023年课程合理化后显著变化:合并2016-2021年(n=1,072)与2023-2024年(n=392)得到KL=0.040(p=0.0005),其中消除/否定类问题从约20-29%上升至约31-35%。潜在结构虽然并非永久平稳,但分段稳定,其转变可检测并归因于课程事件。在任一体系内,学科比年份更能预测技能分布。使用一个真实和两个合成掌握度分布进行的优化评估表明,技能加权目标在基于掌握度频率的基线之上产生了适度但真实的主题推荐重排序。将该流程应用于JEE Advanced,发现其分布以多概念整合为主(80.9%对比NEET的33.3%),JEE与NEET的散度(KL=0.505)超过了NEET最大的跨学科散度:考试层级比学科更能塑造潜在认知结构,而学科比时间(在同一体系内)更能塑造结构。代码、知识图谱和标注数据集已公开发布。

英文摘要

Standardized examinations are typically treated as uniform syllabus coverage problems. We argue they are better understood as adversarial systems with stable latent cognitive structures diverging systematically from official syllabi. We introduce LearnOpt, which recovers this structure from historical question papers and generates personalized, time-bounded study plans. Applied to nine years of NEET questions (2016-2024, n=1,496), LearnOpt builds an exam knowledge graph from LLM-tagged questions, extracts a five-category latent skill distribution, and formulates study planning as a knapsack-variant optimization over prerequisite-aware subgraphs with Bayesian Knowledge Tracing. Central finding: NEET's latent skill distribution is stable within a syllabus regime (consecutive-year KL divergence 0.004-0.032 for 2016-2021, non-significant under permutation testing) but shifts significantly with NCERT's 2023 syllabus rationalization: pooling 2016-2021 (n=1,072) vs 2023-2024 (n=392) gives KL=0.040 (p=0.0005), with Elimination/Negation questions rising from ~20-29% to ~31-35%. Latent structure, while not permanently stationary, is piecewise stable, with shifts detectable and attributable to curricular events. Within either regime, subject predicts skill profile more strongly than year. An optimization evaluation, using one real and two synthetic mastery profiles, shows the skill-weighted objective produces a modest but real reordering of recommended topics over a mastery-conditioned frequency baseline. Applying the pipeline to JEE Advanced reveals a profile dominated by Multi-concept Integration (80.9% vs. 33.3% for NEET), with a JEE-vs-NEET divergence (KL=0.505) exceeding NEET's largest cross-subject divergence: exam tier shapes latent cognitive structure more than subject, which shapes it more than time within a regime. Code, knowledge graph, and annotated dataset are released publicly.

2606.15500 2026-06-16 cs.AR cs.AI cs.SY eess.SY 交叉投稿

LLM4RTL: Tool-Assisted LLM for RTL Generation

LLM4RTL: 工具辅助的大语言模型用于RTL生成

Jing Jin, Robert Chu, Ning Yan, Masood S. Mortazavi

发表机构 * UC Riverside(加州大学河滨分校) Futurewei(未来科技)

AI总结 提出JRCRC流水线,利用商业LLM层次结构过滤和优化RTL代码生成数据集,并设计工具辅助架构提升逻辑推理能力,在VerilogEval上超越多数方法,用更小模型达到GPT-4O性能。

详情
AI中文摘要

大语言模型(LLMs)在软件工程、代码生成、工具和系统方面取得了显著进展。同时,大量研究探索了将LLMs应用于硬件和芯片设计的方法和系统(例如,基于功能描述的RTL代码生成系统)。然而,在开放Verilog/RTL代码生成方面,我们需要高质量的训练样本,通过微调或低秩适应来构建专门且更有效的LLM系统。在此,我们提出了一种“判断-更新-检查-更新-检查”(JRCRC)流水线,该流水线使用一系列在RTL代码生成中成本和能力不同的最先进商业LLM模型来更新现有公共数据集。这种方法实现了一种成本效益高的机制,用于过滤和优化代码生成样本,形成更高质量的训练数据集。我们的实验还识别了LLMs在基于规则的推理和逻辑方面的一些常见弱点,进而影响RTL代码生成。在识别这些弱点后,我们开发了一种架构,结合预处理工具动态辅助LLMs从表格数据格式推断逻辑关系。通过我们的工具辅助RTL代码生成架构,我们在VerilogEval基准测试中取得了显著的总体性能提升,并超越了许多最先进的方法。我们的LLM4RTL系统使用更小的模型实现了与GPT-4O相当的性能。

英文摘要

Large language models (LLMs) have facilitated impressive progress in software engineering, code generation, tooling, and systems. Concurrently, a significant body of research has developed which explores a growing variety of methods and systems for applying LLMs to hardware and chip design (e.g., systems for RTL code generation based on functional description). However, when it comes to open Verilog/RTL code-generation, we need high-quality training samples to build specialized and more effective LLM systems through fine-tuning or low-rank adaptation. Here, we propose a ``judge-renew-check-renew-check'' (JRCRC) pipeline which updates a current public dataset using a hierarchy of state-of-the-art commercial LLM models differing in their costs and capabilities in RTL code generation. This approach achieves a cost-effective mechanism for filtering and refining code-generation samples into a higher-quality training dataset. Our experiments also identify some common weaknesses of LLMs in rule-based reasoning and logic, and consequently, in RTL code-generation. Having identified these weaknesses, we develop an architecture for incorporating pre-processing tools to dynamically assist the LLMs in inferring logical relationships from tabular data formats. With our tools-assisted architecture for RTL code generation, we achieve significant overall performance gains in the VerilogEval benchmark and outperform many state-of-the-art methods. Our LLM4RTL system achieves performance comparable to that of GPT-4O using a significantly much smaller LLM.

2606.15523 2026-06-16 cs.NE cs.AI cs.LG 交叉投稿

AQ4SViT: An Automated Quantization Framework with Search Gating Policy for Compressing Spiking Vision Transformers

AQ4SViT:一种用于压缩脉冲视觉Transformer的自动化量化框架与搜索门控策略

Rachmad Vidya Wicaksana Putra, Saad Iftikhar, Muhammad Shafique

发表机构 * eBRAIN Lab, Division of Engineering, New York University (NYU) Abu Dhabi(eBRAIN实验室,工程学院,纽约大学(NYU)阿布扎克分校) New York University (NYU) Abu Dhabi, United Arab Emirates (UAE)(纽约大学(NYU)阿布扎克分校,阿拉伯联合酋长国(UAE))

AI总结 提出AQ4SViT自动化量化框架,通过量化搜索策略和基于膜电位漂移的搜索门控策略,快速找到精度与内存的平衡点,实现脉冲视觉Transformer的高效压缩。

Comments 8 pages, 4 figures, 2 tables

详情
AI中文摘要

脉冲视觉Transformer(SViT)已成为替代性的低功耗ViT模型,但其大规模阻碍了在资源受限的嵌入式AI系统上的部署。为解决此问题,现有工作提出了量化技术来压缩SViT模型,但其手动、人工引导的方法需要大量设计时间和功耗来为每个给定网络找到合适的量化设置,使得该方法在量化多个网络时不可扩展。为此,我们提出了AQ4SViT,一种新颖的SViT自动化量化框架,能够提供快速的量化设置,并在精度和内存之间取得良好权衡。为实现这一点,AQ4SViT采用以下关键思想:量化搜索策略,在考虑精度约束的同时评估量化设置候选;以及搜索门控策略,通过利用膜电位漂移作为性能代理,快速评估和选择有前景的量化候选。在搜索门控策略中,AQ4SViT采用两种搜索算法变体以提供权衡选项:贪心搜索,执行速度快但可能导致局部最优;以及束搜索,执行速度较慢但由于搜索空间更广,在寻找全局最优选择方面性能更好。实验结果表明,与现有技术相比,AQ4SViT-Greedy快速找到合适的量化设置,搜索时间加快高达6.6倍,内存节省高达82.5%;而AQ4SViT-Beam进一步将内存占用降低高达90%,但搜索时间延长4.5倍;所有这些结果均在保持高精度的前提下获得,在ImageNet数据集上精度与原始/非量化模型相差在1.5%以内。这些结果凸显了AQ4SViT框架在推动SViT在嵌入式AI系统部署方面的进展。

英文摘要

Spiking Vision Transformers (SViTs) have emerged as alternative low-power ViT models, but their large sizes hinder their deployments on resource-constrained embedded AI systems. To address this, state-of-the-art works proposed quantization techniques to compress SViT models, but their manual, human-guided approach needs a huge design time and power/energy consumption to find the appropriate quantization setting for each given network, making this approach not scalable for quantizing multiple networks. Toward this, we propose AQ4SViT, a novel automated quantization framework for SViTs that can provide quick quantization settings with good trade-offs between accuracy and memory. To achieve this, AQ4SViT employs the following key ideas: quantization search strategy that evaluates the quantization setting candidates while considering the accuracy constraint; and search gating policy that quickly evaluates and selects promising quantization candidates by leveraging membrane potential drift as a performance proxy. In the search gating policy, AQSViT employs two search algorithm variants to provide trade-off options: Greedy search, which performs fast but may lead to local optima; and Beam search, which performs slower but has better performance in finding global optima selection due to a wider search space. Experimental results show that AQ4SViT-Greedy quickly finds the appropriate quantization settings, achieving up to 6.6x faster search time and up to 82.5% memory saving compared to the state-of-the-art; while AQ4SViT-Beam further reduces the memory footprint by up to 90% compared to the state-of-the-art, but with 4.5x longer search time; all these results are obtained while maintaining high accuracy within 1.5% from the original/non-quantized models on the ImageNet dataset. These results highlight that AQ4SViT framework offers advancements toward SViT deployments on embedded AI systems.

2606.15547 2026-06-16 cs.CV cs.AI 交叉投稿

EcoBin: A Two-Stage Deep Convolutional Neural Network for Contamination-Aware Waste Classification

EcoBin: 一种用于污染感知废物分类的两阶段深度卷积神经网络

Raghav Senthil Kumar

发表机构 * BASIS Phoenix(BASIS凤凰学校)

AI总结 提出EcoBin两阶段深度CNN,通过合成污染数据集和污染检测模块,显著提升回收废物分类中污染物的识别准确率。

Comments 7 pages, 8 figures

详情
AI中文摘要

废物分类模型在分类废物方面已经变得非常准确,在基准数据集上通常超过95%。然而,这些模型未能考虑可回收废物中的污染。我们提出了EcoBin,一种两阶段深度卷积神经网络,它根据处理途径对家庭废物进行分类,并明确考虑污染。第一阶段是一个基于EfficientNetV2-S骨干网络的基础废物分类器,将数据集中的三十个废物类别分配到四个处理途径之一。第二阶段是一个污染分类器,检查任何被导向回收的物品,并在检测到污染时将其决策覆盖为垃圾。由于不存在公开的污染可回收物数据集,我们通过使用U2-Net模型分割干净可回收物体的图像,并在其表面合成逼真的污染纹理来合成一个数据集。第一阶段达到87.42%的测试准确率和96.13%的途径调整准确率。同时,污染阶段以0.99的ROC-AUC区分干净和污染物品。在污染可回收物的测试集上,完整流水线正确路由了25个物品中的24个,而单独的基础分类器仅正确路由了25个中的1个。McNemar检验证实污染阶段带来的改进具有统计学显著性(p < 0.001)。

英文摘要

Waste classification models have become highly accurate at sorting waste, often exceeding 95% on benchmark datasets. However, these models fail to account for contamination in recyclable waste. We present EcoBin, a two-stage deep convolutional neural network that classifies household waste by its disposal pathway and that explicitly accounts for contamination. The first stage is a base waste classifier built on an EfficientNetV2-S backbone that assigns each of the thirty waste categories in our dataset to one of four disposal pathways. The second stage is a contamination classifier that inspects any item routed toward recycling and overrides the decision to garbage when contamination is detected. Because no public dataset of contaminated recyclables exists, we synthesize one by segmenting images of clean recyclable objects with a U2-Net model and compositing realistic contamination textures onto their surfaces. The first stage achieves 87.42% test accuracy and a 96.13% pathway-adjusted accuracy. Meanwhile, the contamination stage distinguishes clean from contaminated items with a 0.99 ROC-AUC. On a test set of contaminated recyclables, the complete pipeline routes 24 of 25 items correctly, compared with only 1 of 25 for the base classifier alone. A McNemar's test confirms that the improvement contributed by the contamination stage is statistically significant (p < 0.001).

2606.15555 2026-06-16 math.OC cs.AI cs.LG stat.ML 交叉投稿

Service-Induced Congestion in Memory-Constrained LLM Serving

内存受限的大语言模型服务中的服务引发拥塞

Ruicheng Ao, Jing Dong, Gan Luo, David Simchi-Levi

发表机构 * Institute for Data, Systems, and Society, Massachusetts Institute of Technology(数据、系统与社会研究所,麻省理工学院) Columbia Business School, Columbia University(哥伦比亚大学商学院) School of Mathematical Sciences, Peking University(北京大学数学科学学院)

AI总结 本文通过离散时间动力学模型研究内存受限的大语言模型服务中,因键值缓存增长导致的服务引发拥塞,发现同质负载下无驱逐均衡不稳定且收敛到最坏情况极限环,异质负载下稳定条件与解码长度互质相关,并提出调度设计原则。

Comments 101 pages

详情
AI中文摘要

在大语言模型(LLM)服务中,每个请求在服务期间会积累持久的图形处理单元(GPU)内存,因为其键值缓存随着每个生成的令牌而增长。在高并发下,总内存使用量因此随时间内生增长:服务过程本身会创造未来的容量压力。当内存容量超出时,系统会驱逐活动请求,丢弃缓存状态并在稍后重新启动它们,这浪费了计算并降低了吞吐量。我们开发了一个内存受限的LLM推理的离散时间动力学模型,该模型捕获了连续批处理下的准入、内存增长和驱逐。在饱和输入机制下,系统同时存在无驱逐的固定点和带驱逐的极限环。对于同质负载,我们证明无驱逐平衡是不稳定的,并且除了一个勒贝格测度为零的精确捕获集外,系统收敛到一个唯一的最坏情况极限环,该极限环在该例外集外是渐近稳定的,吞吐量损失高达50%。对于异质负载,我们在两类共同输入设置下证明了一个稳定性准则,并解释了生存多项式机制如何推广到多类和异质输入长度。在输入主导的缩放机制下,互质的解码长度稳定了无驱逐平衡,而非互质的长度创造了同步模式,导致不稳定。这些结果描述了负载异质性何时使完成去同步化并有助于稳定内存受限的服务。更广泛地说,我们将服务引发的拥塞识别为一种结构性不稳定机制,并推导出维持高吞吐量的调度设计原则。

英文摘要

In large language model (LLM) serving, each request accumulates persistent graphics processing unit (GPU) memory during service as its key-value cache grows with every generated token. Under high concurrency, aggregate memory usage therefore increases endogenously over time: the service process itself creates future capacity pressure. When memory capacity is exceeded, systems evict active requests, discarding cached state and restarting them later, which wastes computation and reduces throughput. We develop a discrete-time dynamical model of memory-constrained LLM inference that captures admission, memory growth, and eviction under continuous batching. In the saturated-input regime, the system admits both eviction-free fixed points and limit cycles with evictions. For homogeneous workloads, we show that the eviction-free equilibrium is unstable and that, except for a Lebesgue-measure-zero exact-capture set, the system converges to a unique worst-case limit cycle that is asymptotically stable outside this exceptional set, with throughput losses as large as 50%. For heterogeneous workloads, we prove a stability criterion in the two-class common-input setting and explain how the survival-polynomial mechanism generalizes to multiple classes and heterogeneous-input lengths. Under an input-dominated scaling regime, coprime decoding lengths stabilize the eviction-free equilibrium, while non-coprime lengths create synchronized modes that drive instability. These results characterize when workload heterogeneity desynchronizes completions and helps stabilize memory-constrained serving. More broadly, we identify service-induced congestion as a structural instability mechanism and derive scheduling design principles for sustaining high throughput.

2606.15601 2026-06-16 cs.HC cs.AI cs.CY 交叉投稿

SCAN: A Decision-Making Framework for Effective Task Allocation with Generative AI

SCAN:一种基于生成式AI的有效任务分配决策框架

Fendi Tsim, Alina Gutoreva

发表机构 * Independent Researcher, London, United Kingdom(伦敦,英国独立研究员) School of IT and Engineering, Kazakh-British Technical University, Almaty, Kazakhstan(阿斯塔纳,哈萨克-英国技术大学信息与工程学院)

AI总结 提出SCAN框架,基于维果茨基最近发展区和元认知理论,将任务分为替代、补充、辅助和不可协商四个子区域,帮助知识工作者和学生元认知地评估生成式AI的使用,促进终身学习和混合智能。

Comments 16 pages, 2 figures, 3 tables. Preprint

详情
AI中文摘要

我们介绍了SCAN——一个以人为中心的决策框架,基于维果茨基的最近发展区和元认知理论,帮助学习者有效地与生成式人工智能(GenAI)进行任务分配。在SCAN中,我们通过引入一个包含四个“子区域”的任务识别方法,系统化和形式化了AI与人类的交互:替代、补充、辅助和不可协商。在描述这四个子区域后,我们展示了SCAN框架如何应用于工作场所的知识工作者和教育中的学生,以元认知地“扫描”他们对生成式AI的使用。然后,我们讨论了该框架如何与认知负荷理论、认知卸载、谄媚行为、人机交互中的三种决策模式(自动化、增强和协作)、未来工作(如技能提升和技能退化)相关联,以及它如何解释人与人以及人与AI的学习。我们提出,在讨论GenAI是补充还是替代我们完成任务的能力时,SCAN提供了一个很好的起点,其总体目标是维持终身学习,具体目标是实现混合智能。

英文摘要

We introduce SCAN -- a human-centric decision-making framework to facilitate learners for effective task allocation with Generative Artificial Intelligence (GenAI) based on Vygotsky's Zone of Proximal Development and Metacognition. In SCAN, we systematize and formalize AI-human interaction by introducing a task-identification approach with four "sub-zones": Substitute, Complement, Aid, and Non-negotiable. After describing the four sub-zones, we demonstrate how SCAN framework can be applied for knowledge workers in the workplace and students in education to metacognitively "scan" their use of Generative AI. We then discuss how such framework can be related to cognitive load theory, cognitive offloading, sycophancy, three decision-making modes in human-AI interactions (automation, augmentation, and collaboration), future of work such as upskilling and deskilling, and how it accounts for both human-human and human-AI learning. We propose that SCAN offers a great starting point before discussing whether GenAI complements or replaces our abilities when completing a task, with a general objective of sustaining lifelong learning, and a specific goal of reaching hybrid intelligence.

2606.15611 2026-06-16 cs.CV cs.AI 交叉投稿

Mutual Distillation of Dual-Foundation Models for Semi-Supervised PET/CT Segmentation

双基础模型的相互蒸馏用于半监督PET/CT分割

Fuyou Mao, Beining Wu, Yanfeng Jiang, Bohan Xu, Lixin Lin, Naye Ji, Hao Zhang, Yan Tang

发表机构 * Central South University(中南大学) Hangzhou Dianzi University(杭州电子科技大学) Communication University of Zhejiang(浙江传媒学院) Northeastern University(东北大学)

AI总结 提出MuDuo框架,利用SAM-Med3D和SegAnyPET分别从CT和PET中蒸馏知识到轻量学生网络,实现半监督器官分割,仅用5个标注样本在AutoPET数据集上达到最优性能。

Comments MICCAI 2026

详情
AI中文摘要

PET/CT的器官分割对于肿瘤学中的定量分析和放疗计划至关重要。为了降低PET/CT分割的高标注成本,半监督学习(SSL)为使用有限标注数据开发深度模型提供了一种实用且有效的解决方案。视觉基础模型的最新发展展示了显著的适应性和更高的效率。在这项工作中,我们提出了一个相互蒸馏框架,该框架无缝地利用了结构性和功能性基础模型,这些模型作为模态特定的通才,从结构性CT和代谢性PET成像中蒸馏知识。通过弥合学生模型的任务特定精度与通才基础模型的分割先验之间的差距,我们提出了MuDuo,一个相互蒸馏框架,协同利用SAM-Med3D用于CT和SegAnyPET用于PET,将它们的知识蒸馏到一个轻量级学生网络中。我们的方法消除了手动提示的需要,同时最大化未标注数据在自动分割中的效用,在AutoPET数据集上仅使用5个标注案例就达到了最先进的性能。我们的源代码可在https://github.com/Wu-beining/MuDuo获取。

英文摘要

Organ segmentation from PET/CT is critical for quantitative analysis and radiotherapy planning in oncology. To ease the high annotation cost of PET/CT segmentation, semi-supervised learning (SSL) provides a practical and effective solution for developing deep models with limited labeled data. Recent developments in visual foundation models have demonstrated remarkable adaptability with improved efficiency. In this work, we propose a mutual distillation framework that seamlessly exploits both structural and functional foundation models, which act as modality-specific generalists for distilling knowledge from structural CT and metabolic PET imaging. By bridging the gap between the task-specific precision of student models and the segmentation priors of generalist foundation models, we propose \textbf{MuDuo}, a mutual distillation framework that synergistically leverages SAM-Med3D for CT and SegAnyPET for PET to distill their knowledge into a lightweight student network. Our approach eliminates the need for manual prompts while maximizing the utility of unlabeled data for automatic segmentation, achieving state-of-the-art performance on the AutoPET dataset with only 5 labeled cases. Our source code is available at https://github.com/Wu-beining/MuDuo.

2606.15642 2026-06-16 cs.LG cs.AI 交叉投稿

CIWI-CKT: Chaos-Informed Wave Interference Feature Fusion and Cross-City Knowledge Transfer for Traffic Flow Forecasting

CIWI-CKT:混沌信息波干涉特征融合与跨城市知识迁移用于交通流预测

Abdul Joseph Fofanah, Lian Wen, David Chen, Shaoyang Zhang

发表机构 * Griffith University(格里菲斯大学) School of Information and Communication Technology, Griffith University(格里菲斯大学信息与通信技术学院) School of Information Engineering, Chang’an University(长安大学信息工程学院)

AI总结 针对跨城市数据稀缺场景,提出CIWI-CKT框架,融合混沌信息波生成、元干涉处理和混沌感知元学习,显著提升预测精度并降低数据需求。

详情
AI中文摘要

在跨城市、数据稀缺的场景下,准确预测交通流仍然具有挑战性,因为有限的历史数据阻碍了模型的泛化能力。交通动态的混沌性质、复杂的时空依赖关系以及异质的城市网络使得跨城市的小样本学习变得复杂。现有的深度学习方法要么将交通视为完全确定性的,要么缺乏对跨体制交通动态至关重要的波状干涉模式进行建模的机制。为了解决这些局限性,本文提出了CIWI-CKT,一种新颖的混沌信息波干涉特征融合框架,结合跨城市知识迁移。我们的框架引入了三个核心创新:混沌信息波生成,提取可测量的混沌不变量并将交通建模为自适应波分量;元干涉处理,捕获支持域和查询域之间的波相互作用,同时生成可预测性分数用于置信度估计;以及混沌感知元学习,在保留混沌特性的同时实现高效的跨城市知识迁移。我们建立了理论保证,包括混沌到波的稳定性、波诱导的降维以及元学习泛化界限。在四个真实世界交通数据集上的大量实验表明,CIWI-CKT显著优于最先进的时空图学习、迁移学习、基于提示和小样本方法,在提高预测精度的同时大幅减少了所需的训练数据。

英文摘要

Accurate traffic flow prediction remains challenging in cross-city, data-scarce scenarios where limited historical data hinders model generalisation. The chaotic nature of traffic dynamics, complex spatio-temporal dependencies, and heterogeneous urban networks complicate few-shot learning across cities. Existing deep learning approaches either treat traffic as purely deterministic or lack mechanisms to model wave-like interference patterns essential for cross-regime traffic dynamics. To address these limitations, this paper proposes CIWI-CKT, a novel Chaos-Informed Wave Interference Feature Fusion framework with Cross-City Knowledge Transfer. Our framework introduces three core innovations: chaos-informed wave generation that extracts measurable chaos invariants and models traffic as adaptive wave components; meta-interference processing that captures wave interactions between support and query regimes while producing a predictability score for confidence estimation; and chaos-aware meta-learning that enables efficient cross-city knowledge transfer while preserving chaotic characteristics. We establish theoretical guarantees including chaos-to-wave stability, wave-induced dimension reduction, and meta-learning generalisation bounds. Extensive experiments on four real-world traffic datasets demonstrate that CIWI-CKT significantly outperforms state-of-the-art spatio-temporal graph learning, transfer learning, prompt-based, and few-shot methods, improving prediction accuracy while substantially reducing required training data.

2606.15693 2026-06-16 cs.SE cs.AI 交叉投稿

Imperfect Visual Verification for Code Edition : A Case Study on TikZ

代码编辑的不完美视觉验证:以TikZ为例的案例研究

Charly Reux, Mathieu Acher, Djamel Eddine Khelladi, Clément Quinton, Olivier Barais

发表机构 * Univ Rennes, Inria, IRISA, INSA(里昂大学、Inria、IRISA、INSA) Univ Rennes, Inria, CNRS, IUF, IRISA(里昂大学、Inria、CNRS、IUF、IRISA) Univ Rennes, Inria, CNRS, IRISA(里昂大学、Inria、CNRS、IRISA) Univ. Lille, CNRS, Inria(里尔大学、CNRS、Inria) Univ. Rennes, IRISA, Inria(里昂大学、IRISA、Inria)

AI总结 针对TikZ等视觉代码定制任务,研究不完美验证器在迭代精炼中的有效性,发现即使不完美验证器也能适度准确判断指令是否应用,反馈对弱模型提升显著。

详情
AI中文摘要

LLMs显著推进了代码生成,使得功能程序的合成成为可能。尽管最近的系统在许多编码基准测试中表现强劲,但涉及生成视觉产物(如TikZ)的程序任务仍然具有挑战性,尤其是在视觉代码定制方面。与从头生成不同,定制需要局部、保持语义的编辑:模型必须定位相关代码,根据指令修改它,并保留其余结构和渲染。基于事后迭代精炼/纠正的方法(其中验证器提供反馈以指导纠正)已显示出前景。然而,对于具有视觉结果(如TikZ)的程序,其正确性难以或不可能自动形式化和评估,因此不存在确定性验证器。因此,开发者只能依赖不完美的验证器。在本文中,我们进行了一项实证研究以回答:当验证器本身不可靠时,迭代精炼在多大程度上仍然有效?我们使用TikZ作为聚焦的案例研究,在受控且具有挑战性的环境中隔离问题的核心难点(弱代码结构、细粒度视觉语义和困难的特征定位)。我们将视觉代码定制定义为一个具有不完美预言机的迭代编辑问题,并引入一个分析此类迭代精炼的框架。我们进行了大规模研究,在迭代精炼流程中评估了多个基于LLM和工具增强的视觉验证器,并对精炼轨迹进行了广泛的手动标注以评估验证器行为和反馈质量。我们的发现表明,即使是不完美的验证器也能以中等准确度确定视觉指令是否应用于代码,F1分数高达0.815。反馈改善了迭代精炼,特别是对于较弱的模型,为Qwen3-vl-30b-a3b-Instruct增加了11-20个完美定制,而更强的模型(如Gemini-3)获得的改进较少(+5),但受益于更准确的验证,防止了过早接受。反馈仅在精确识别图像问题、提供可操作指导、解决所有相关问题并保持基于原始指令时有效。

英文摘要

LLMs have significantly advanced code generation, enabling the synthesis of functional programs. While recent systems achieve strong performance on many coding benchmarks, tasks involving programs such as TikZ that generate visual artifacts remain challenging, in particular on visual code customization. Unlike generation from scratch, customization requires localized, semantics-preserving edits: the model must locate relevant code, modify it according to the instruction, and preserve the remaining structure and rendering. Approaches based on post-hoc iterative refinement/correction where a verifier provides feedback to guide corrections, have shown promise. However, in the case of programs with a visual outcome such as in TikZ, where correctness is harder or likely impossible to formalize and evaluate automatically, deterministic verifiers do not exist. Hence, developers can only rely on imperfect verifiers. In this paper, we conduct an empirical study to answer:to what extent can iterative refinement remain effective when the verifier itself is unreliable?} We use TikZ as a focused case study that isolates the core difficulties of the problem (weak code structure, fine-grained visual semantics, and difficult feature localization) in a controlled and challenging setting. We define visual code customization as an iterative editing problem with an imperfect oracle, and introduce a framework for analyzing such iterative refinements. We conduct a large-scale study and evaluate multiple LLM-based and tool-augmented visual verifiers within iterative refinement pipelines, and perform extensive manual annotation of refinement trajectories to assess verifier behavior and feedback quality. Our findings show that even imperfect verifiers can determine with moderate accuracy whether visual instructions are applied to code, achieving F1-scores up to 0.815. Feedback improves iterative refinement, especially for weaker models, adding 11--20 perfect customizations for Qwen3-vl-30b-a3b-Instruct, while stronger models like Gemini-3 gain fewer improvements (+5) but benefit more from accurate verification that prevents premature acceptance. Feedback is effective only when it precisely identifies image issues, provides actionable guidance, addresses all relevant problems, and remains grounded in the original instruction.

2606.15786 2026-06-16 cs.CV cs.AI physics.geo-ph 交叉投稿

Domain-Guided Prompting of the Segment Anything Model for Seismic Interpretation: The Role of Attributes, Visualization, and Hybrid Prompts

领域引导的Segment Anything模型提示用于地震解释:属性、可视化和混合提示的作用

Aniq Ahmad, Heather Bedle, Ahmad Mustafa

发表机构 * School of Geosciences, University of Oklahoma(俄克拉荷马大学地球科学学院) King Fahd University of Petroleum and Minerals(法赫德国王石油矿产大学)

AI总结 提出零样本适应框架,通过地质目标感知的地震属性与颜色映射选择,结合混合提示策略,提升SAM在地震解释中的分割精度,避免微调。

详情
AI中文摘要

计算机视觉大型预训练基础模型的出现显著提高了视觉数据解释的效率。特别是Segment Anything Model (SAM)通过基于提示的交互提供了强大的零样本分割能力,因此成为地震解释的有前景工具。然而,大多数现有的SAM应用依赖于针对特定地质目标的微调,这需要大量标注数据、计算成本高,且常常损害模型的泛化能力。在本研究中,我们引入了一个原则性框架,用于将基础模型零样本适应到地震数据。该框架基于两个关键组件:(1) 将地震属性和可视化选择(如颜色映射)与感兴趣的地质目标对齐;(2) 采用混合提示策略,结合稀疏的用户定义点提示和从SAM内部特征激活中导出的密集掩码提示。我们系统地在多个地质目标、数据集、提示配置和地震属性表示上评估了该框架。我们的结果表明,地质目标感知的地震属性和颜色映射选择,结合混合提示,相对于仅基于点提示,增强了地质特征的可分离性,并改善了边界描绘和分割精度。我们的发现表明,当这些组件联合应用时,SAM可以在完全零样本设置下实现有竞争力的分割性能,从而消除了为每个地质特征重新训练SAM的需要。这项工作建立了一条实用且可扩展的途径,以在地震解释中利用基础模型,减少对标注数据的依赖,同时保持模型的通用性。

英文摘要

The advent of large pretrained foundation models for computer vision has significantly improved the efficiency of visual data interpretation. The Segment Anything Model (SAM), in particular, offers powerful zero shot segmentation capabilities through prompt based interaction, thus making it a promising tool for seismic interpretation. However, most existing applications of SAM rely on fine tuning for specific geological targets, which requires extensive labeled data, incurs high computational cost, and often compromises the model's generalization capability. In this study, we introduce a principled framework for zero shot adaptation of foundation models to seismic data. The framework is built on two key components: (1) aligning seismic attributes and visualization choices (e.g., colormaps) with the geological target of interest, and (2) employing a hybrid prompting strategy that combines sparse user defined point prompts with dense mask prompts derived from SAM's internal feature activations. We systematically evaluate this framework across multiple geological targets, datasets, prompt configurations, and seismic attribute representations. Our results demonstrate that geologic target aware selection of seismic attributes and colormaps, combined with hybrid prompting, enhances the separability of geological features and improves boundary delineation and segmentation accuracy relative to point based prompting alone. Our findings show that, when these components are jointly applied, SAM can achieve competitive segmentation performance in a fully zero shot setting, thereby eliminating the need to retrain SAM for each geologic feature. This work establishes a practical and scalable pathway to leverage foundation models in seismic interpretation, reducing reliance on labeled data while preserving model generality.

2606.15807 2026-06-16 cs.LG cs.AI 交叉投稿

Continuous Cross-Domain Traffic State Prediction via Memory-Augmented Graph Liquid Time-Constant Networks

基于记忆增强图液态时间常数网络的连续跨域交通状态预测

Jinrong Xiang, Ming Xu

发表机构 * Software College, Liaoning Technical University(辽宁工程技术大学软件学院)

AI总结 提出记忆增强图液态时间常数网络(MA-GLTC),通过时空单元分解、图液态时间常数动态和记忆迁移存储机制,实现连续时间下的跨域交通状态预测,在五个数据集上优于现有方法。

详情
AI中文摘要

交通状态预测是智能交通系统中的一项基本任务。在实际应用中,一些区域由于感知基础设施不足而面临有限的交通观测,使得跨域知识迁移成为数据稀缺交通预测的重要解决方案。然而,现有的跨域交通预测方法仍面临若干局限,包括粗粒度的源-目标域适应、处理未见目标域模式的能力有限,以及在非规则或异质时间条件下对连续交通动态建模不足。为解决这些问题,本文提出了一种连续跨域交通预测框架,称为记忆增强图液态时间常数网络(MA-GLTC)。具体地,我们首先构建时空单元(STU)将交通网络分解为可迁移的局部单元,实现跨域的细粒度知识对齐。然后,开发了图液态时间常数网络(GLTC)来建模连续时间下图耦合的交通演化。与通用的基于图神经ODE的模型不同,GLTC将图耦合的循环电导引入液态时间常数动态,允许节点状态随泄漏、自适应时间常数和邻域感知反馈而演化。此外,设计了基于记忆的迁移存储(MTS)机制,以保留源域知识、检索匹配的交通模式,并在出现未见状态时更新可靠的目标域模式。在五个公开交通数据集上的实验表明,MA-GLTC在短期和长期预测任务中均持续优于代表性的域内和跨域基线。与次优方法相比,MA-GLTC分别将平均预测误差降低了3.02%、0.33%、8.92%、10.09%和2.11%。

英文摘要

Traffic state prediction is a fundamental task in intelligent transportation systems. In practical applications, some regions suffer from limited traffic observations due to insufficient sensing infrastructure, making cross-domain knowledge transfer an important solution for data-scarce traffic prediction. However, existing cross-domain traffic prediction methods still face several limitations, including coarse-grained source-target adaptation, limited capability in handling unseen target-domain patterns, and insufficient modeling of continuous traffic dynamics under irregular or heterogeneous temporal conditions. To address these issues, this paper proposes a continuous cross-domain traffic prediction framework, termed Memory-Augmented Graph Liquid Time-Constant Network (MA-GLTC). Specifically, we first construct spatio-temporal units (STUs) to decompose traffic networks into transferable local units, enabling fine-grained knowledge alignment across domains. Then, a graph liquid time-constant network (GLTC) is developed to model graph-coupled traffic evolution in continuous time. Different from generic graph neural ODE-based models, GLTC introduces graph-coupled recurrent conductance into liquid time-constant dynamics, allowing node states to evolve with leakage, adaptive time constants, and neighborhood-aware feedback. Furthermore, a Memory-based Transfer Storage (MTS) mechanism is designed to preserve source-domain knowledge, retrieve matched traffic patterns, and update reliable target-domain patterns when unseen states emerge. Experiments on five public traffic datasets demonstrate that MA-GLTC consistently outperforms representative innerdomain and cross-domain baselines in both short-term and longterm prediction tasks. Compared with the second-best method, MA-GLTC reduces the average prediction errors by 3.02%, 0.33%, 8.92%, 10.09%, and 2.11%, respectively.

2606.15930 2026-06-16 cs.RO cs.AI 交叉投稿

ControlMap: Controllable High-Definition Map Generation for Traffic Scenario Simulation

ControlMap: 用于交通场景仿真的可控高清地图生成

Marwan Farag, Steffen Wäldele, Yu Yao

发表机构 * University of Stuttgart(斯图加特大学) Robert Bosch GmbH(博世公司) Motional, Inc(Motional公司)

AI总结 提出基于潜在扩散和ControlNet的数据驱动管道,实现可控高清地图生成,支持空间引导、条件强度调整和城市风格迁移,并引入新指标评估控制信号遵循度和地图真实性。

详情
AI中文摘要

仿真是验证自动驾驶系统的核心,但当前流程因高精(HD)地图创建成本高昂而受限于场景多样性不足。扩展HD地图需要昂贵的数据收集和人工处理。此外,现有生成模型缺乏在生成过程中针对特定道路拓扑进行细粒度控制的能力。本文提出一种数据驱动的可控HD地图生成管道,使用潜在扩散和ControlNet进行空间条件控制。据我们所知,我们是首个将空间引导信号注入扩散模型用于HD地图合成的工作。此外,我们的模型支持通过无分类器引导调整条件强度,并通过城市标签条件实现城市级风格迁移。为补充现有指标,我们引入两个新指标来评估对控制信号的遵循程度以及与真实地图的相似性。实验表明,我们的模型生成的HD地图真实且忠实遵循输入道路拓扑,同时准确保留城市特定细节。

英文摘要

Simulation is central to validating autonomous driving systems, yet current pipelines are limited by insufficient scenario diversity due to costly High Definition (HD) map creation. Scaling HD maps requires expensive data collection and manual processing. Moreover, existing generative models lack the fine-grained control necessary to target specific road topologies during generation. This paper presents a data-driven pipeline for controllable HD map generation using latent diffusion and ControlNet for spatial conditioning. To our knowledge, we are the first to inject spatial guidance signals into a diffusion model for HD map synthesis. Furthermore, our model supports adjustable conditioning strength through classifier-free guidance and city-level style transfer via city label conditioning. To complement existing metrics, we introduce two novel metrics to evaluate adherence to the control signal and similarity to ground-truth maps. Experiments demonstrate that our model generates realistic HD maps that faithfully follow input road topologies while accurately preserving city-specific details.

2606.15943 2026-06-16 cs.SE cs.AI 交叉投稿

Graphical-Probabilistic Modeling of Generative Flows in LLM-Native Software Systems

LLM原生软件系统中生成流的图形概率建模

Víctor A. Braberman, Flavia Bonomo-Braberman

发表机构 * Departamento de Computación, FCEN, Universidad de Buenos Aires / ICC, UBA-CONICET(布宜诺斯艾利斯大学计算机系 / UBA-CONICET)

AI总结 针对LLM原生软件开发缺乏设计级推理的问题,提出基于图形概率模型的生成网络框架,用于文档化生成流并描述系统属性。

Comments Published at 2026 IEEE/ACM 5th International Conference on AI Engineering - Software Engineering for AI (CAIN '26), April 12-13, 2026, Rio de Janeiro, Brazil

详情
AI中文摘要

工程化LLM原生软件仍然是一个具有挑战性且不成熟的领域。当前的实践主要是探索性的,依赖于实验和启发式技术,如提示和上下文工程。然而,这些方法层次较低,缺乏支持设计级推理或分析所需的原则性结构。相比之下,传统软件工程利用模块化和抽象来沟通和分析系统行为。为了给LLM原生开发带来类似的严谨性,我们提出了文档化生成流和陈述基于LLM的软件设计属性的方法。这些方法必须考虑大语言模型的随机性、提示依赖性行为,同时保持足够的表达能力以捕捉涌现现象。我们的初步方法基于图形概率模型,专门用于捕捉LLM原生系统特有的现象。这个框架——我们称之为生成网络——旨在为LLM中心软件架构中关于生成交互和系统级属性的原则性推理提供基础。

英文摘要

Engineering LLM-native software remains a challenging and immature field. Current practice is largely exploratory, relying on experimentation and heuristic techniques such as prompting and context engineering. These, however, are low-level and lack the principled structure needed to support design-level reasoning or analysis. In contrast, traditional software engineering leverages modularity and abstraction to communicate and analyze system behavior. To bring similar rigor to LLM-native development, we propose methods for documenting generative flows and for stating properties of LLM-based software designs. Such methods must account for the stochastic, prompt-dependent behavior of large language models while remaining expressive enough to capture emergent phenomena. Our initial approach is based on graphical probabilistic models, tailored to capture phenomena characteristic of LLM-native systems. This framework -- what we term Generation Networks -- aims to provide a foundation for principled reasoning about generative interactions and system-level properties in LLM-centric software architectures.

2606.15959 2026-06-16 cs.DC cs.AI cs.LG 交叉投稿

Quantifying the Impact of Lossy Compression on Neural Generative Surrogate Modeling

量化有损压缩对神经生成代理建模的影响

Zhimin Li, Harshitha Menon, Charles Jekel, Valerio Pascucci, Peter Lindstrom

发表机构 * LLNL-CONF-2007282

AI总结 研究有损压缩训练数据对生成代理模型质量的影响,提出利用神经网络训练不确定性估计压缩容错阈值的方法,在保持模型质量的同时实现高达39倍的数据存储节省和3倍的训练加速。

详情
AI中文摘要

神经网络被用作科学发现的生成代理模型,这些模型是可训练的科学模拟近似。它们使用户能够用学习到的替代方案取代耗时的数值模拟,提供快速解决方案。然而,高保真生成代理模型需要庞大的训练数据集,这可能导致存储和I/O挑战。有损压缩是减轻这一负担的有前景的方法,但压缩误差可能以微妙的方式影响模型质量,使得量化其影响具有挑战性。在这项工作中,我们研究了训练数据的有损压缩如何影响生成代理模型的质量。我们首先刻画了训练神经网络固有的不确定性,表明相同的训练配置可能产生不同的模型。通过利用这种变异性,我们提出了一种方法来估计代理模型在不影响其准确性的情况下可以容忍多少压缩引起的误差。对两个应用模拟的评估表明,我们的方法显著降低了内存/存储需求,加快了训练速度,同时生成了高质量的代理模型。这些结果表明,有损压缩可节省高达23.7倍和39倍的数据存储,而对代理模型质量的影响可忽略不计。同时,减小训练数据集的大小也提高了数据加载速度,并将训练时间减少了多达3倍。

英文摘要

Neural networks are used as generative surrogate models for scientific discovery, which are trainable approximations of scientific simulations. These models enable users to replace time-consuming numerical simulations with learned alternatives, providing quick solutions. However, high-fidelity generative surrogate models require massive training datasets, which can create storage and I/O challenges. Lossy compression is a promising way to reduce this burden, but compression errors may affect the model quality in subtle ways, making it challenging to quantify their impact. In this work, we examine how lossy compression of training data impacts the quality of generative surrogate models. We begin by characterizing the uncertainty inherent in training neural networks, showing that identical training configurations can produce different models. By exploiting this variability, we propose a method to estimate how much compression-induced error a surrogate model can tolerate without affecting its accuracy. Evaluation of two application simulations demonstrates that our approach significantly reduces memory/storage requirements and speeds up training while producing high-quality surrogate models. These results show that lossy compression saves data storage up to 23.7x and 39x with negligible impact on the quality of the surrogate model. Meanwhile, reducing the size of the training data set also enhances the data loading speed and reduces the training time by up to 3x.

2606.16059 2026-06-16 cs.LG cs.AI 交叉投稿

Mojo: A Promising Tool for Scalable Financial AI Efficiency

Mojo:可扩展金融AI效率的有前景工具

Henry Han

发表机构 * Data Science and Artificial Intelligence Innovation Laboratory, School of Engineering and Computer Science, Baylor University(贝勒大学工程与计算机科学学院数据科学与人工智能创新实验室)

AI总结 本文介绍Mojo语言,通过MLIR编译和确定性内核设计,解决量化金融中Python到C++的性能差距与数值不一致问题,在金融AI工作负载上实现20-180倍加速。

Comments 15, 3 figures

详情
AI中文摘要

三十年来,量化金融一直承受着高昂的双语言税:用Python研究的模型需重写为C++用于生产,常常引入数值差异。GPU加速深度学习加剧了这一问题,因为非确定性浮点归约可能在长回测中产生漂移,挑战监管可重复性和审计期望。本文调查了Mojo——Modular公司2026年推出的类Python系统语言,作为资本市场工程的结构性回应。在缩小Python到C++性能差距的同时,Mojo独特地结合了原生互操作性和构建位精确确定性内核所需的底层系统控制。其MLIR编译基础设施进一步允许单一代码库针对标量、SIMD、多核和GPU执行,减少了研究与生产之间的转换瓶颈。我们对四个核心金融AI工作负载进行了基准测试:蒙特卡洛期权定价、LLM情感推理、多资产回测和投资组合风险价值。在Apple Silicon上,Mojo在直接测量的内核上相比纯Python实现了20倍到180倍的加速;更大规模GPU工作负载的结果是根据已发表基准校准的预测。除了透明的性能数据,我们还介绍了mojo-deterministic,一个可重现归约内核的开源库,并对Mojo已解决和尚未解决的问题进行了坦诚评估。

英文摘要

For thirty years, quantitative finance has paid a costly two-language tax: models researched in Python are rewritten in C++ for production, often introducing numerical discrepancies. GPU-accelerated deep learning exacerbates this problem, as nondeterministic floating-point reductions can produce drift in long backtests, challenging regulatory reproducibility and auditability expectations. This article surveys Mojo, Modular's 2026 Python-like systems language, as a structural response for capital markets engineering. While closing the Python-to-C++ performance gap, Mojo uniquely combines native interoperability with the low-level systems control required to construct bit-exact deterministic kernels. Its MLIR compilation infrastructure further allows a single codebase to target scalar, SIMD, multicore, and GPU execution, reducing the translation bottleneck between research and production. We benchmark four core financial AI workloads: Monte Carlo option pricing, LLM sentiment inference, multi-asset backtesting, and portfolio Value at Risk. On Apple Silicon, Mojo demonstrates 20x to 180x speedups over pure Python on directly measured kernels; larger-scale GPU workload results are projections calibrated from published benchmarks. Alongside transparent performance data, we introduce mojo-deterministic, an open-source library of reproducible reduction kernels, and provide a candid assessment of the problems Mojo does and does not yet solve.

2606.16133 2026-06-16 cond-mat.mtrl-sci cs.AI 交叉投稿

InvDesMobility: a reliability-gated first-principles feedback framework for closed-loop materials discovery

InvDesMobility:用于闭环材料发现的可靠性门控第一性原理反馈框架

Wen-Kao Li, Ze-Feng Gao, Peng-Jie Guo, Wei Ji, Zhong-Yi Lu

发表机构 * School of Physics and Key Laboratory of Quantum State Construction and Manipulation (Ministry of Education), Renmin University of China(物理学院和量子态构造与操控重点实验室(教育部) ,中国人民大学)

AI总结 提出InvDesMobility框架,通过可靠性门控和证据分层,实现基于第一性原理的闭环逆向材料设计,在载流子迁移率搜索中筛选出86个可靠性门控生成通道。

Comments 33 pages, 4 main figures, 2 main tables; Supplementary Information included

详情
AI中文摘要

逆向材料设计从目标功能出发,搜索能够实现该功能的结构。其在闭环发现中的价值不仅取决于预测性能,还取决于昂贵的从头算结果是否经过独立验证、记录来源,并且仅在证据充分时才作为反馈被接受。这对于复合性质(如载流子迁移率)尤其重要,因为最终的标量值隐藏了中间量、拟合质量、收敛历史和工作流假设。本文提出InvDesMobility,一个可靠性门控的第一性原理反馈框架,集成了多智能体自动DFT、证据分层、生成式结构提议、采集排序和可审计发布。使用516个2DMatPedia衍生候选,该工作流产生了280个通过QC的材料和573个保留的载流子方向种子通道(经过通道级可靠性门控)。这些记录被分为两个反馈对象:松弛结构更新生成模型,保留的迁移率通道训练采集模型并设置验证优先级。经过多次迭代,InvDesMobility筛选了2.4×10^6个结构,提交了102个候选进行DFT验证,并保留了41个化学式下的86个可靠性门控生成通道。总体而言,主要贡献不是固定的高迁移率材料列表,而是一个可迁移的反馈契约,使得从昂贵的计算性质中学习时,闭环逆向设计既实用又可审计。所有源数据、保留的反馈记录和工作流均可在https://github.com/DreamLufei/invDesMobility获取,附带的证据网站为https://dreamlufei.github.io/invDesMobility/。

英文摘要

Inverse materials design starts from target functionality and searches for structures that can realize it. Its value in closed-loop discovery depends not only on prediction performance, but also on whether expensive first-principles results are independently validated, provenance-recorded, and admitted as feedback only when evidence is sufficient. This is especially important for composite properties such as carrier mobility, where a final scalar value hides intermediate quantities, fit quality, convergence history, and workflow assumptions. Here we present InvDesMobility, a reliability-gated first-principles feedback framework that integrates multi-agent automated DFT, evidence stratification, generative structure proposal, acquisition ranking, and auditable release. Using 516 2DMatPedia-derived candidates, the workflow produced 280 QC-passed materials and 573 retained carrier-direction seed channels after channel-level reliability gating. These records were split into two feedback objects: relaxed structures updated the generative model, while retained mobility channels trained the acquisition model and set validation priority. Over multiple iterations, InvDesMobility screened 2.4 x 10^6 structures, submitted 102 candidates for DFT validation, and retained 86 reliability-gated generated channels across 41 formulas. Overall, the main contribution is not a fixed list of high-mobility materials, but a transferable feedback contract that makes closed-loop inverse design both useful and auditable when learning from expensive calculated properties. All source data, retained feedback records, and workflows are available at https://github.com/DreamLufei/invDesMobility, with an accompanying evidence website at https://dreamlufei.github.io/invDesMobility/.

2606.16183 2026-06-16 cs.LG cs.AI cs.CL 交叉投稿

LLM-Powered Virtual Population for Demand Simulation and Pricing

基于LLM的虚拟人群用于需求模拟与定价

Chengpiao Huang, Kaizheng Wang

发表机构 * Columbia University(哥伦比亚大学)

AI总结 提出一种LLM驱动的虚拟人群模型,通过混合客户画像和LLM评估购买概率,生成需求分布,支持风险感知定价,在H&M数据集上表现最优。

Comments 18 pages, 7 figures

详情
AI中文摘要

我们开发了一个基于LLM的虚拟人群模型,用于模拟定价决策中的需求,其中产品由丰富的非结构化信息(如文本描述和图像)描述,决策者不仅需要平均需求预测,还需要反事实价格的不确定性估计。我们的模型将暴露的客户表示为从有限混合客户画像中的抽取。对于每个画像、产品和候选价格,LLM使用结构化画像信息和非结构化产品信息来引出画像级别的购买概率。这些概率通过校准的混合权重聚合,形成总需求的预测分布。生成的模拟器可以在各种定价目标下评估反事实价格,包括期望收入和风险感知标准(如条件风险价值)。我们在一个包含产品描述和图像的在线H&M时尚数据集上测试了该框架。校准后的基于LLM的模拟器在所考虑的模型中实现了最佳的整体预测性能,并支持样本高效的定价决策。我们的框架提供了一种实用的方法,将LLM用作需求模拟器,适用于历史需求数据有限但产品信息丰富的产品。通过生成完整的需求预测分布而不仅仅是点预测,它使管理者能够比较候选价格、量化需求不确定性,并选择针对平均收入或风险感知目标的价格。

英文摘要

We develop an LLM-powered virtual population model that simulates demand for pricing decisions, in settings where products are described by rich unstructured information, such as text descriptions and images, and where decision makers need not only mean-demand predictions but also uncertainty estimates for counterfactual prices. Our model represents exposed customers as draws from a finite mixture of customer personas. For each persona, product, and candidate price, an LLM elicits a persona-level purchase probability using both structured persona information and unstructured product information. These probabilities are aggregated through calibrated mixture weights to form a predictive distribution of aggregate demand. The resulting simulator can evaluate counterfactual prices under various pricing objectives, including expected revenue and risk-aware criteria such as conditional value at risk. We test the framework on an online H&M fashion dataset with product descriptions and images. The calibrated LLM-based simulator achieves the best overall predictive performance among the models considered, and supports sample-efficient pricing decisions. Our framework provides a practical way to use LLMs as demand simulators for products with limited historical demand data but rich product information. By producing a full predictive demand distribution rather than only a point forecast, it enables managers to compare candidate prices, quantify demand uncertainty, and choose prices that target either average-case revenue or risk-aware objectives.

2606.16190 2026-06-16 cs.AR cs.AI 交叉投稿

Embedded Arena: Iterative Optimization via Hardware Feedback

嵌入式竞技场:通过硬件反馈的迭代优化

Zhihan Zhang, Alexander Le Metzger, Jiuyang Lyu, Chun-Cheng Chang, Jiayi Shao, Yujia Liu, Emmanuel Azuh Mensah, Edward Wang, Kurtis Heimerl, Gregory D. Abowd, Shwetak Patel, Natasha Jaques, Vikram Iyer

发表机构 * University of Washington(华盛顿大学) University of California San Diego(加州大学圣地亚哥分校) Northeastern University(东北大学)

AI总结 提出硬件在环智能体框架,通过迭代编译、烧录和测量真实硬件,实现模型与固件的闭环优化,在嵌入式设备上达到250倍压缩且精度损失<3.3%。

Comments Code: https://github.com/ubicomplab/embedded-arena

详情
AI中文摘要

从野生动物监测站到临床可穿戴设备,嵌入式设备由于延迟、通信或隐私限制,需要本地AI推理。为异构微控制器(MCU)优化模型需要在保持精度的同时,同时满足内存、功耗和温度等硬物理约束,这是一个多维优化问题,目前由专家手动执行。我们探究LLM智能体是否能在真实硬件反馈的引导下自主导航这一复杂、多轮流水线,并引入一个硬件在环智能体竞技场,其中智能体迭代优化模型和固件——在真实硬件上编译、烧录和测量——以实现闭环优化。前沿模型,包括Claude Opus 4.7和Gemini 3.1 Pro,在没有硬件反馈时完全失败(0%部署成功率),而我们的硬件在环方案在三次迭代内实现首次成功部署,并在七次迭代内超越人类专家结果。这种智能体协同优化实现了视觉模型250倍压缩(精度损失<3.3%)和音频模型400倍压缩(特征错误率损失<6%),通过太阳能收集实现商业MCU上的无电池运行。我们在两个实际系统中展示了实际影响:用于麋鹿检测的相机陷阱(96.7%准确率)和用于儿童发展研究的语音转录可穿戴设备(8.44% FER)。

英文摘要

Embedded devices from wildlife monitoring stations to clinical wearables require local AI inference due to latency, communication, or privacy constraints. Optimizing models for heterogeneous microcontrollers (MCUs) requires simultaneously satisfying hard physical constraints on memory, power, and temperature while preserving accuracy, a multidimensional optimization that is today performed manually by experts. We ask whether an LLM agent can autonomously navigate this complex, multi-turn pipeline guided by real hardware feedback, and introduce a hardware-in-the-loop agent arena in which the agent iteratively refines both model and firmware -- compiling, flashing, and measuring on real hardware -- to enable closed-loop optimization. Frontier models, including Claude Opus 4.7 and Gemini 3.1 Pro, fail entirely without hardware feedback (0% deployment success), whereas our hardware-in-the-loop formulation achieves the first successful deployment within three iterations and can surpass human expert results within seven. This agentic co-optimization achieves 250x compression for vision models with <3.3% accuracy loss and 400x for audio with <6% Feature Error Rate loss, enabling battery-free operation on a commercial MCU via solar harvesting. We demonstrate practical impact in two real-world systems: an elk-detection camera trap (96.7% accuracy) and a phonetic-transcription wearable (8.44% FER) for child development research.

2606.16212 2026-06-16 cs.CV cs.AI 交叉投稿

LUCID: Learned Undersampling-Adaptive Consistency-Guided Inference with Deterministic Flow Matching for Sparse-View CT Reconstruction

LUCID:基于确定性流匹配的学习型欠采样自适应一致性引导稀疏视角CT重建

Jigang Duan, Jiayi Wang, Heran Wang, Ping Yang, Genwei Ma, Xing Zhao

发表机构 * School of Mathematical Sciences, Capital Normal University(首都师范大学数学科学学院) National Center for Applied Mathematics Beijing, Capital Normal University(首都师范大学北京国家应用数学中心) Academy for Multidisciplinary Studies, Capital Normal University(首都师范大学交叉科学研究院)

AI总结 提出LUCID框架,利用流匹配生成先验和稀疏度自适应策略,通过退化匹配初始状态和投影域一致性校正,实现不同采样密度下的稳定稀疏视角CT重建,减少伪影和幻觉结构。

详情
AI中文摘要

稀疏视角CT通过获取更少的投影视图来减少辐射剂量和扫描时间,但角度欠采样使得重建严重病态,导致条纹伪影、结构模糊和细节丢失。现有的监督方法通常受限于特定的采样设置,而生成方法在严重欠采样下可能引入解剖上不一致的幻觉样结构。我们提出Lucid,一种基于流匹配生成先验的稀疏自适应、一致性引导重建框架,用于稀疏视角CT。Lucid仅在高品质CT图像上训练,学习高斯分布与高品质CT图像分布之间的连续传输,与视角采样无关。在推理过程中,显式纳入采样稀疏度水平,以调整单个预训练模型的生成轨迹。具体地,Lucid通过稀疏度加权融合稀疏视角FBP图像和高斯噪声构建退化匹配的初始状态,执行稀疏度调制的流匹配更新,并在每次先验更新后应用投影域数据一致性校正。在多种稀疏视角设置下的实验表明,Lucid在不同采样密度下实现稳定的重建性能,提高图像质量和结构保真度,并降低生成式稀疏视角CT重建中幻觉样结构的风险。

英文摘要

Sparse-view CT reduces radiation dose and scanning time by acquiring fewer projection views, but angular undersampling makes reconstruction severely ill-posed, causing streak artifacts, structural blurring, and loss of fine details. Existing supervised methods are often tied to specific sampling settings, whereas generative methods may introduce anatomically inconsistent hallucination-like structures under severe undersampling. We propose Lucid, a sparsity-adaptive, consistency-guided reconstruction framework based on a Flow Matching generative prior for sparse-view CT. Lucid is trained only on high-quality CT images to learn a continuous transport between a Gaussian distribution and the high-quality CT image distribution, independent of view sampling. During inference, the sampling sparsity level is explicitly incorporated to adapt the generative trajectory of a single pretrained model. Specifically, Lucid constructs a degradation-matched initial state by sparsity-weighted fusion of the sparse-view FBP image and Gaussian noise, performs sparsity-modulated Flow Matching updates, and applies projection-domain data-consistency correction after each prior update. Experiments under multiple sparse-view settings show that Lucid achieves stable reconstruction performance across different sampling densities, improves image quality and structural fidelity, and reduces the risk of hallucination-like structures in generative sparse-view CT reconstruction.

2606.16231 2026-06-16 cs.LG cs.AI 交叉投稿

From Tokens to Regions: CUDA-Sensitive Instruction Tuning for GPU Kernel Generation

从令牌到区域:面向GPU内核生成的CUDA敏感指令微调

Wentao Chen, Jiace Zhu, Xing Zhe Chai, Zeng Qu, Qiaoling Xiao, Liucheng Duan, An Zou

发表机构 * Shanghai Jiao Tong University(上海交通大学) Biren Technology(壁仞科技)

AI总结 提出CuSeT方法,通过自适应令牌级掩码和区域感知样本重加权,在简单SFT框架内提升LLM生成CUDA内核的功能正确性。

详情
AI中文摘要

高性能CUDA内核对于可扩展的AI系统至关重要,而大型语言模型(LLM)由于严格且隐式的执行约束,仍然难以生成正确的内核。现有的基于LLM的方法要么依赖昂贵的智能体或强化学习(RL)流水线,要么采用监督微调(SFT)目标,但未能显式建模CUDA敏感性,即与执行约束紧密耦合的代码令牌或区域。在这项工作中,我们从令牌置信度模式的角度研究CUDA敏感性,表明CUDA敏感性出现在令牌和区域两个层面,其中大多数CUDA敏感令牌以高置信度被预测,而较小的低置信度子集形成对应于执行关键结构的区域。这些发现表明,有效的CUDA内核生成应同时利用高置信度的CUDA敏感令牌并保留低置信度的CUDA敏感区域。基于这些见解,我们提出了\textbf{\underline{CU}DA-\underline{Se}nsitive Instruction \underline{T}uning (CuSeT)},一种在简单SFT框架内的低成本后训练方法。CuSeT遵循“从令牌到区域”的原则,结合了\emph{自适应令牌级掩码}和\emph{区域感知样本重加权}。实验表明,CuSeT在多个模型系列和规模上一致地提高了功能正确性,优于标准SFT和高级SFT变体,同时以显著更低的推理成本达到了与前沿CUDA内核生成模型相竞争的性能。

英文摘要

High-performance CUDA kernels are essential for scalable AI systems, while Large Language Models (LLMs) still struggle to generate correct kernels due to strict and implicit execution constraints. Existing LLM-based approaches either rely on costly agentic or reinforcement-learning (RL) pipelines, or adopt supervised fine-tuning (SFT) objectives that fail to explicitly model CUDA sensitivity, namely code tokens or regions tightly coupled with execution constraints. In this work, we investigate CUDA sensitivity from the perspective of token confidence patterns, showing that CUDA sensitivity appears at both token and region levels, where most CUDA-sensitive tokens are predicted with high confidence, while a smaller low-confidence subset forms regions corresponding to execution-critical structures. These findings suggest that effective CUDA kernel generation should both leverage high-confidence CUDA-sensitive tokens and preserve low-confidence CUDA-sensitive regions. Building on these insights, we propose \textbf{\underline{CU}DA-\underline{Se}nsitive Instruction \underline{T}uning (CuSeT)}, a low-cost post-training method within a simple SFT framework. CuSeT follows the principle of ``from tokens to regions'' by combining \emph{adaptive token-level masking} with \emph{region-aware sample reweighting}. Experiments show that CuSeT consistently improves functional correctness across multiple model families and scales, outperforming standard SFT and advanced SFT variants, while achieving competitive performance against frontier CUDA kernel generation models with substantially lower inference cost.

2606.16234 2026-06-16 cs.CV cs.AI 交叉投稿

Propagating Structural Guidance: Synthesizing Fluorescein Angiography from Fundus Images and Sparse OCT Scans

传播结构引导:从眼底图像和稀疏OCT扫描合成荧光素血管造影

Tengfei Ma, Ruiqi Wu, Chenran Zhang, Ye Geng, Na Su, Xiangyuan Duanmu, Tao Zhou, Yi Zhou, Wen Fan

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education(教育部新一代人工智能技术及其跨学科应用重点实验室) Tianyuan Honors School, Nanjing Medical University(南京医科大学天元荣誉学院) Nanjing University of Science and Technology(南京理工大学) Department of Ophthalmology, The First Affiliated Hospital of Nanjing Medical University(南京医科大学第一附属医院眼科)

AI总结 提出从彩色眼底照片(CFP)和稀疏OCT扫描合成荧光素血管造影(FFA)的框架,通过空间对齐跨模态融合和令牌级对比学习,实现非侵入性FFA合成,提升下游诊断性能。

Comments Accepted to MICCAI 2026 (Early Accept)

详情
AI中文摘要

眼底荧光素血管造影(FFA)对于评估视网膜血管异常至关重要,但其获取具有侵入性且并非总是可行。相比之下,彩色眼底摄影(CFP)无创且广泛可用,这推动了CFP到FFA合成的研究。然而,先前的工作仅依赖CFP表面纹理,从根本上限制了重建功能性血管信息和细微病理变化的能力。为了解决这个问题,我们提出了一种新颖的框架,该框架利用光学相干断层扫描(OCT)提供的结构引导,从CFP合成FFA。我们构建了一个包含来自3,676只患者眼睛的配对CFP、FFA和OCT的多模态视网膜成像数据集——这是视网膜成像中首个三模态对齐数据集。为了弥合OCT和眼底模态之间的空间差距,我们提出了空间对齐跨模态融合(SACMF)模块,该模块将深度分辨的OCT特征投影到眼底平面,并通过自适应层归一化将其注入CFP编码器。除了特征融合,我们还引入了令牌级跨模态对齐(TCMA),这是一种令牌级对比学习策略,在对应空间位置显式对齐CFP和FFA表示。我们的方法相比最先进的方法实现了更优的合成性能。此外,大量实验表明,我们方法合成的FFA图像在提升下游疾病诊断性能方面比现有方法带来更大的改进,突显了我们的方法作为常规工作流程中无创决策支持工具的临床潜力。代码可在https://github.com/while-plus/OCT-guide-FFA-Syn获取。

英文摘要

Fundus fluorescein angiography (FFA) is critical for assessing retinal vascular abnormalities, but its acquisition is invasive and not always feasible. In contrast, color fundus photography (CFP) is non-invasive and widely accessible, which has motivated studies on CFP-to-FFA synthesis. However, prior works rely solely on CFP surface texture, fundamentally limiting the ability to reconstruct functional vascular information and subtle pathological changes. To address this, we propose a novel framework that synthesizes FFA from CFP with structural guidance provided by optical coherence tomography (OCT). We construct a multi-modal retinal imaging dataset with paired CFP, FFA, and OCT from 3,676 patient eyes--the first tri-modally aligned dataset in retinal imaging. To bridge the spatial gap between OCT and fundus modalities, we propose a Spatially Aligned Cross-Modal Fusion (SACMF) module that projects depth-resolved OCT features onto the fundus plane and injects them into the CFP encoder via adaptive layer normalization. Beyond feature fusion, we further introduce Token-wise Cross-Modality Alignment (TCMA), a token-level contrastive learning strategy that explicitly aligns CFP and FFA representations at corresponding spatial positions. Our method achieves superior synthesis performance compared to state-of-the-art methods. Moreover, extensive experiments demonstrate that the FFA images synthesized by our approach bring greater improvements in downstream disease diagnosis performance than existing methods, highlighting the clinical potential of our approach as a non-invasive decision-support tool in routine workflows. The code is available at https://github.com/while-plus/OCT-guide-FFA-Syn.

2606.16278 2026-06-16 cs.CV cs.AI 交叉投稿

RealityBridge: Bridging Editable 3D Gaussian Splatting Driving Simulations and Real-World Videos

RealityBridge: 连接可编辑3D高斯泼溅驾驶模拟与现实世界视频

Zhenhua Wu, Yun Pang, Mingkun Chang, Yuwei Ning, Liangzhi Wang, Yi Xiao, Guanbin Li

发表机构 * Sun Yat-sen University(中山大学) Guangdong Key Laboratory of Information Security Technology(广东省信息安全技术重点实验室) Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education(教育部机器智能与先进计算重点实验室)

AI总结 提出RealityBridge框架,利用多模态控制和轻量级GateNet,结合自回归长视频训练与奖励引导后训练,缩小编辑后3DGS驾驶视频的Sim-to-Real差距,提升视觉真实感和时间一致性。

详情
AI中文摘要

长尾危险场景对于安全导向的自动驾驶至关重要,但难以大规模收集和复现。可编辑3D高斯泼溅(3DGS)模拟通过重建真实驾驶场景并支持可控场景编辑,提供了一种有前景的替代方案。然而,编辑后的3DGS渲染视频仍存在显著的Sim-to-Real差距,包括渲染伪影、前景资产退化、光照不一致和时间闪烁。现有的修复和视频生成方法不足以应对此任务,因为它们通常无法联合修复3DGS特定伪影、提升视觉真实感并确保时间一致性。为填补这一空白,我们提出RealityBridge,一种针对编辑后3DGS驾驶视频的结构保持和资产感知的Sim-to-Real框架。RealityBridge使用多模态控制,包括渲染视频、前景掩码、边缘图和语义掩码,并结合轻量级GateNet进行跨骨干层的自适应条件分配。我们进一步构建了针对性的训练数据,并引入自回归长视频训练与奖励引导后训练,以提升修复质量、时间稳定性和幻觉抑制。在内部和公开驾驶数据集上的大量实验表明,RealityBridge在伪影去除、光照协调和长序列时间一致性方面优于现有方法。

英文摘要

Long-tail hazardous scenarios are essential for safety-oriented autonomous driving, yet they are difficult to collect and reproduce at scale. Editable 3D Gaussian Splatting (3DGS) simulation offers a promising alternative by reconstructing real driving scenes and supporting controllable scene editing. However, edited 3DGS-rendered videos still suffer from a significant Sim-to-Real gap, including rendering artifacts, degraded foreground assets, inconsistent illumination, and temporal flickering. Existing restoration and video generation methods are insufficient for this task, as they often fail to jointly repair 3DGS-specific artifacts, improve visual realism, and ensure temporal consistency. To fill this gap, we propose RealityBridge, a structure-preserving and asset-aware Sim-to-Real framework for edited 3DGS driving videos. RealityBridge uses multimodal controls, including rendered videos, foreground masks, edge maps, and semantic masks, together with a lightweight GateNet for adaptive condition allocation across backbone layers. We further construct targeted training data and introduce autoregressive long-video training with reward-guided post-training to improve restoration quality, temporal stability, and hallucination suppression. Extensive experiments on internal and public driving datasets show that RealityBridge outperforms existing methods in artifact removal, illumination harmonization, and long-sequence temporal consistency.

2606.16292 2026-06-16 cs.SE cs.AI 交叉投稿

AI Supply Chain Galaxy: 3D Visual Analytics for License Compliance

AI供应链星系:用于许可证合规的3D可视化分析

Weiru Han, Xuetao Shi, Wenyi He, Wei Wang, Rui Zhao, Moming Duan

发表机构 * East China Normal University(东华大学) Tianjin University(天津大学)

AI总结 提出AI供应链星系(AISCG),一种交互式3D可视化分析系统,通过空间布局和规则引擎对Hugging Face上908,449个模型进行实证分析,发现55.46%的模型存在合规风险,并识别出适配器派生中56.67%的许可证遗漏和微调中8.05%的许可证漂移等风险模式。

Comments 15 pages, 6 figures

详情
AI中文摘要

机器学习模型复用的快速普及已将AI生态系统转变为一个高度互联的供应链。传统的合规工具和静态报告难以应对这些庞大且多跳的依赖网络。为此,我们提出了AI供应链星系(AISCG),一个用于模型溯源和合规审计的交互式3D可视化分析系统。AISCG将模型映射到3D空间布局中,将显式结构依赖与基于规则的合规引擎相结合。它支持多尺度探索,从全局社区检测到局部、路径感知的谱系追踪。我们通过对Hugging Face上908,449个模型的生态系统规模实证分析展示了其有效性。我们的发现揭示了一个令人担忧的现状:55.46%的模型存在合规风险或元数据冲突/缺失。我们还识别出不同的风险模式,包括适配器派生中56.67%的许可证遗漏率和微调中8.05%的“许可证漂移”率。通过对复杂的Llama模型家族进行案例研究,我们展示了AISCG如何帮助分析人员直观地追溯继承的受限条款,并在深层拓扑网络中识别根本原因,从而显著降低合规审计的认知负荷。

英文摘要

The rapid proliferation of machine learning model reuse has transformed the AI ecosystem into a highly interconnected supply chain. Traditional compliance tools and static reports struggle to navigate these massive, multi-hop dependency networks. To address this, we present AI Supply Chain Galaxy (AISCG), an interactive 3D visual analytics system for model provenance and compliance auditing. AISCG maps models into a 3D spatial layout, integrating explicit structural dependencies with a rule-based compliance engine. It supports multi-scale exploration, from global community detection to localized, path-aware lineage tracing. We demonstrate its efficacy through an ecosystem-scale empirical analysis of 908,449 models from Hugging Face. Our findings reveal a concerning landscape: 55.46% of models exhibit compliance risks or metadata conflicts/omissions. We also identified distinct risk patterns, including a 56.67% license omission rate in adapter derivations and an 8.05% "license drift" rate in fine-tuning. Through a case study on the complex Llama model family, we show how AISCG empowers analysts to intuitively trace inherited restrictive terms and identify root causes across deep topological networks, significantly reducing the cognitive load of compliance auditing.

2606.16332 2026-06-16 cs.DC cs.AI cs.PF 交叉投稿

SMEPilot: Characterizing and Optimizing LLM Inference with Scalable Matrix Extensions

SMEPilot: 利用可扩展矩阵扩展表征和优化LLM推理

Feiyang Chen, Haibo Chen

发表机构 * IPADS, Shanghai Jiao Tong University(上海交通大学IPADS)

AI总结 针对CPU矩阵扩展单元与核心在LLM推理中的不匹配,提出SMEPilot引擎,通过基于屋顶线的表征指导算子执行选择,实现SME与CPU协同工作,端到端性能提升达3.94倍。

详情
AI中文摘要

现代CPU越来越多地集成矩阵扩展,如Arm可扩展矩阵扩展(SME),在CPU内提供高吞吐量矩阵执行。然而,对于LLM推理,这些单元并不能普遍替代传统CPU核心:预填充、解码、注意力和KV缓存操作表现出不同的算术强度、向量行为和布局要求,而SME单元和CPU核心仍竞争共享内存带宽。本文通过基于屋顶线的SME CPU表征研究这种不匹配,并使用所得模型指导算子级执行选择。我们提出SMEPilot,一个LLM推理引擎,为每个算子形状选择仅CPU、仅SME或协作SME+CPU执行。SMEPilot在瓦片粒度上将矩阵工作分区到SME和CPU核心,在注意力中重叠适合SME的矩阵阶段与适合CPU的向量阶段,并维护布局状态,以便打包的张量表示被重用,而不是在关键路径上重复构建。在手机、PC和服务器平台上,针对Llama-3.2-3B、Qwen3-4B和Qwen3-30BA3B,SMEPilot将端到端推理性能提升高达3.94倍。

英文摘要

Modern CPUs increasingly integrate matrix extensions, such as Arm Scalable Matrix Extension (SME), that provide high-throughput matrix execution within the CPU. For LLM inference, however, these units are not a universal replacement for conventional CPU cores: prefill, decode, attention, and KV-cache operations expose different arithmetic intensities, vector behavior, and layout requirements, while SME units and CPU cores still compete for shared memory bandwidth. This paper studies this mismatch through a roofline-based characterization of SME-enabled CPUs and uses the resulting model to guide operator-level execution choices. We present SMEPilot, an LLM inference engine that selects CPU-only, SME-only, or cooperative SME+CPU execution for each operator shape. SMEPilot partitions matrix work across SME and CPU cores at tile granularity, overlaps SME-suitable matrix stages with CPU-suitable vector stages in attention, and maintains layout state so packed tensor representations are reused rather than repeatedly rebuilt on critical paths. Across Llama-3.2-3B, Qwen3-4B, and Qwen3-30BA3B on phone, PC, and server platforms, SMEPilot improves end-to-end inference performance by up to 3.94$\times$.

2606.16434 2026-06-16 cs.LG cs.AI 交叉投稿

Autonomous End-to-End SOH Prediction Services for Battery Systems via Temporal-Contrastive Representation Learning

基于时间对比表示学习的电池系统自主端到端健康状态预测服务

Junting Wen, Dan Li, Qihao Quan, Xiwen Wang, Hang Yang, Zhaohong Meng, Zigui Jiang, Changlin Yang, Tianle Liu, Diego Muñoz-Carpintero, Jian Lou

发表机构 * School of Software Engineering, Sun Yat-sen University(中山大学软件学院) Tianneng Battery Group Co., Ltd(天能电池集团有限公司) School of Communication Engineering, Hangzhou Dianzi University(杭州电子科技大学通信工程学院) Institute of Engineering Science, Universidad de O’Higgins(奥希金斯大学工程科学研究所)

AI总结 提出TC-SOH模块化服务架构,通过时间对比机制和跨窗口预测任务从原始数据中提取退化相关表示,实现自主端到端SOH预测,在四个数据集上MAPE和RMSE分别降低1.91倍和2.13倍。

详情
AI中文摘要

准确的状态健康(SOH)估计是锂离子电池管理的关键诊断服务。然而,依赖劳动密集型的手动特征工程和不透明的黑箱模型阻碍了可扩展的工业部署。为此,我们引入TC-SOH:一种模块化、即插即用的服务架构,用于自主、端到端的SOH预测。TC-SOH采用时间对比机制和跨窗口预测预任务,直接从原始运行数据中提取与退化相关的表示。为了提高透明度,我们将模型效能与表示诊断联系起来:可视化、敏感性分析、冗余分析、双向探测、未来SOH探测和时间洗牌表明,学习到的特征与选定的专家描述符重叠,同时保留了额外的SOH相关变化,并且有序的时间上下文改善了后续SOH预测。在四个公开数据集上,TC-SOH优于所考虑的物理信息和数据驱动基线,MAPE降低了1.91倍,RMSE降低了2.13倍。

英文摘要

Accurate state of health (SOH) estimation is a critical diagnostic service for lithium-ion battery management. However, reliance on labor-intensive manual feature engineering and opaque black-box models hinders scalable industrial deployment. To address this, we introduce TC-SOH: a modular, plug-and-play service architecture for autonomous, end-to-end SOH prediction. TC-SOH employs a temporal-contrastive mechanism and a cross-window prediction pretext task to extract degradation-relevant representations directly from raw operational data. To improve transparency, we connect model efficacy with representation diagnostics: visualization, sensitivity analysis, redundancy analysis, bidirectional probing, future-SOH probing, and temporal shuffling show that learned features overlap with selected expert descriptors while retaining additional SOH-relevant variation, and that ordered temporal context improves subsequent-SOH prediction. Across four public datasets, TC-SOH outperforms the considered physics-informed and data-driven baselines, reducing MAPE by 1.91 times and RMSE by 2.13 times.

2606.16440 2026-06-16 cs.AR cs.AI cs.LG 交叉投稿

NeuronFabric: A Software Reference Architecture for On-Chip Transformer Training with Local Adam

NeuronFabric:一种用于片上Transformer训练与本地Adam的软件参考架构

Evgeny Ukladchikov

发表机构 * Independent Researcher(独立研究者)

AI总结 提出NeuronFabric软件参考架构,用于FPGA/ASIC实现Transformer训练与本地Adam优化,通过BF16W权重存储减少片上内存需求,在334K参数模型上验证数值正确性。

详情
AI中文摘要

公开记载的加速器架构通常将训练计算与优化器状态更新分离,或依赖外部内存和主机协调。本文提出NeuronFabric,一种旨在用于未来FPGA和ASIC实现Transformer训练与本地Adam更新的软件参考架构。一个完整的C#原型实现了前向传播、反向传播和Adam优化,无需外部机器学习框架。目标是在硬件实现前验证数值正确性和内存需求。评估模型是一个334K参数的自回归Transformer(d=88, H=4, f=264, L=4, vocab=256),在莎士比亚语料库上训练。BF16W配置在80K样本后达到评估损失1.5426,而FP32 GPU参考为1.5224,同时生成连贯的字符级文本。本文引入BF16W,它以BF16存储权重,同时以FP32保留Adam优化器动量。这减少了片上训练的内存需求。一个带Adam动量的334K参数FP32模型需要约4.0 MB,与Xilinx ZCU102设备的BRAM容量匹配。BF16W变体需要约3.34 MB,为激活存储留出内存。我们描述了早期实验中观察到的词汇预算约束,量化了BF16W内存节省,并概述了FPGA训练作为下一开发阶段。本文不包含FPGA测量。本出版物作为未来FPGA和ASIC探索NeuronFabric架构的公开架构披露和软件参考实现。

英文摘要

Publicly documented accelerator architectures generally separate training computation from optimizer-state updates or rely on external memory and host orchestration. This paper presents NeuronFabric, a software reference architecture intended for future FPGA and ASIC implementations of transformer training with local Adam updates. A complete C# prototype implements forward pass, backpropagation, and Adam optimization without external machine-learning frameworks. The goal is to validate numerical correctness and memory requirements before hardware implementation. The evaluated model is a 334K-parameter autoregressive transformer (d=88, H=4, f=264, L=4, vocab=256) trained on the Shakespeare corpus. The BF16W configuration achieves evaluation loss 1.5426 after 80K samples, compared with 1.5224 for an FP32 GPU reference, while producing coherent character-level text. The paper introduces BF16W, which stores weights in BF16 while retaining Adam optimizer moments in FP32. This reduces memory requirements for on-chip training. A 334K-parameter FP32 model with Adam moments requires approximately 4.0 MB, matching the BRAM capacity of a Xilinx ZCU102 device. The BF16W variant requires approximately 3.34 MB, leaving memory available for activation storage. We describe the vocabulary-budget constraint observed during earlier experiments, quantify BF16W memory savings, and outline FPGA training as the next stage of development. No FPGA measurements are included in this paper. This publication serves as a public architectural disclosure and software reference implementation for future FPGA and ASIC exploration of the NeuronFabric architecture.

2606.16497 2026-06-16 cs.LG cs.AI cs.CL 交叉投稿

daVinci-kernel: Co-Evolving Skill Selection, Summarization, and Utilization via RL for GPU Kernel Optimization

daVinci-kernel:通过强化学习协同进化技能选择、总结与利用的GPU内核优化

Dayuan Fu, Mohan Jiang, Tongyu Wang, Dian Yang, Jiarui Hu, Liming Liu, Jinlong Hou, Pengfei Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出daVinci-kernel框架,通过强化学习联合训练技能选择、策略生成和技能总结三个智能体,共享LLM骨干,实现GPU内核优化,在KernelBench上超越先前最优模型。

详情
AI中文摘要

GPU内核优化代表了一种范式,其中功能正确性被假定,执行效率是目标。我们提出daVinci-kernel,一个强化学习框架,通过动态演化的技能库将技能发现与技能利用相结合。daVinci-kernel联合训练三个共享一个LLM骨干的智能体:技能选择智能体通过BM25和LLM重排序检索相关技术,策略智能体基于所选技能生成多轮CUDA/Triton内核,技能总结智能体将成功轨迹提炼为可复用技能。候选技能仅在基于执行的验证确认可复现加速后才被添加。所有三个智能体共享单个LLM骨干,通过多样性过滤数据上的结构化SFT冷启动初始化,然后通过多轮REINFORCE和每个智能体的优势估计进行端到端联合优化。在KernelBench上,daVinci-kernel-14B在Fast$_1$阈值下,Level 1、Level 2和Level 3分别达到37.2%、70.6%和32.2%,优于先前最强的RL训练模型Dr.Kernel-14B。

英文摘要

GPU kernel optimization represents a paradigm where functional correctness is assumed and execution efficiency is the objective. We present daVinci-kernel, a reinforcement learning framework that couples skill discovery with skill exploitation through a dynamically evolving skill library. daVinci-kernel jointly trains three agents sharing one LLM backbone: a Skill Selection Agent that retrieves relevant techniques via BM25 and LLM reranking, a Policy Agent that generates multi-turn CUDA/Triton kernels conditioned on selected skills, and a Skill Summary Agent that distills successful rollouts into reusable skills. Candidate skills are added only after execution-based verification confirms reproducible speedups. All three agents share a single LLM backbone, are initialized via a structured SFT cold start on diversity-filtered data, and are then jointly optimized end-to-end with multi-turn REINFORCE and per-agent advantage estimation. On KernelBench, daVinci-kernel-14B achieves 37.2%, 70.6%, and 32.2% on Level 1, Level 2, and Level 3 under the Fast$_1$ threshold, outperforming the strongest prior RL-trained model, Dr.Kernel-14B.

2606.16532 2026-06-16 cs.SD cs.AI 交叉投稿

Dual-Granularity Orthogonal Disentanglement for Generalizable Audio Deepfake Detection

双粒度正交解耦用于可泛化的音频深度伪造检测

Zhuodong Liu, Hugen Lv, Xiangyu Li, Chunhong Yuan

发表机构 * Beijing Jiaotong University(北京交通大学) Shanghai Jiao Tong University(上海交通大学) ITMO University(ITMO大学)

AI总结 针对音频深度伪造检测中隐式身份泄漏问题,提出双粒度正交解耦框架,通过样本级余弦正交性和批次级交叉协方差正则化强制特征独立,无需辅助网络或对抗训练,在多个数据集上取得更优等错误率。

Comments Accepted at Interspeech 2026, 6 pages, 3 figures

详情
AI中文摘要

音频深度伪造检测器常常无法跨说话人泛化,因为它们学习的是说话人身份特征而非合成伪影,这被称为隐式身份泄漏。现有方法解决了这一问题,但引入了架构复杂性或训练不稳定性。本文提出了一种双粒度正交解耦框架,在两个层次上强制特征独立性:样本级余弦正交性捕获方向去相关,而批次级交叉协方差正则化消除嵌入维度间的线性相关性。课程解耦调度逐步增强正交约束,无需辅助网络或对抗动态。在ASVspoof 2019 LA、ASVspoof 2021 DF和In-the-Wild数据集上的实验表明,所提方法分别实现了1.35%、7.88%和21.58%的等错误率(EER),在跨数据集迁移上比梯度反转解耦绝对提升了2.60%。

英文摘要

Audio deepfake detectors often fail to generalize across speakers, as they learn speaker-identity features rather than synthesis artifacts, known as implicit identity leakage. Existing methods address this but incur architectural complexity or training instability. This paper proposes a dual-granularity orthogonal disentanglement framework enforcing feature independence at two levels: sample-level cosine orthogonality captures directional decorrelation, while batch-level cross-covariance regularization eliminates linear correlations across embedding dimensions. A curriculum disentanglement schedule progressively strengthens the orthogonality constraint without auxiliary networks or adversarial dynamics. Experiments on ASVspoof 2019 LA, ASVspoof 2021 DF, and In-the-Wild datasets demonstrate that the proposed method achieves 1.35%, 7.88%, and 21.58% equal error rates (EER), respectively, surpassing gradient reversal disentanglement by 2.60% absolute on cross-dataset transfer.

2606.16587 2026-06-16 physics.flu-dyn cs.AI cs.LG physics.comp-ph 交叉投稿

Learning Interface Breakup: A Geometry-Conditioned Latent Surrogate for Spray Formation

学习界面破碎:一种用于喷雾形成的几何条件潜在代理模型

Julius H Ramlau, Friedrich Hastedt, Tolga Birdal, Ehecatl-Antonio del Río Chanona, Nausheen S Basha, Omar K Matar

发表机构 * University of California, Berkeley(加州大学伯克利分校) Technical University of Munich(慕尼黑技术大学) Istanbul Technology University(伊斯坦布尔技术大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Cambridge(剑桥大学) University of Oxford(牛津大学)

AI总结 提出一种几何条件潜在代理模型,通过编码自适应网格细化(AMR)的单元密度场,在797个两相喷嘴模拟上训练,实现瞬态破碎动力学的高效预测,推理速度比Basilisk CFD快6×10^4倍。

Comments 11 pages, 5 figures, accepted to ICML AI4Physics 2026

详情
AI中文摘要

设计喷雾喷嘴需要预测几何形状如何影响瞬态两相破碎,但采用自适应网格细化(AMR)的高保真流体体积(VOF)模拟对于迭代设计探索来说成本过高。标准代理模型也面临挑战,因为液-气界面和底层的自适应离散化都随时间及几何形状变化。我们引入了一种几何条件潜在代理模型,该模型在797个两相喷嘴模拟上训练,通过编码AMR单元密度场(而非完整的多通道流状态)作为求解器集中分辨率的紧凑代理。从该表示出发,模型重建瞬态密度演化和喷嘴几何形状,而一个轻量级的第二阶段则恢复剩余的流动变量。在保留的模拟上,该方法准确捕捉了关键的界面动力学,同时将每条轨迹的推理时间减少到0.045秒,相对于Basilisk CFD加速超过6×10^4倍。这些结果表明,AMR细化结构可以作为瞬态两相流几何条件代理建模的紧凑且可学习的表示。

英文摘要

Designing spray nozzles requires predicting how geometry shapes transient two-phase breakup, but high-fidelity volume-of-fluid (VOF) simulations with adaptive mesh refinement (AMR) are too expensive for iterative design exploration. Standard surrogate models are also challenged by this setting because both the liquid--gas interface and the underlying adaptive discretization evolve across time and geometries. We introduce a geometry-conditioned latent surrogate trained on 797 two-phase nozzle simulations that addresses this by encoding the AMR cell-density field, rather than the full multi-channel flow state, as a compact proxy for where the solver concentrates resolution. From this representation, the model reconstructs transient density evolution and nozzle geometry, and a lightweight second stage recovers the remaining flow variables. On held-out simulations, the method accurately captures key interface dynamics while reducing inference time to 0.045 seconds per trajectory, corresponding to a speed-up of more than $6\times10^4$ relative to Basilisk CFD. These results suggest that AMR refinement structure can serve as a compact and learnable representation for geometry-conditioned surrogate modeling of transient two-phase flows.

2606.16626 2026-06-16 cs.HC cs.AI 交叉投稿

Using AI in engineering education: a balancing act, driven by clear purpose

在工程教育中使用AI:基于明确目的的平衡行为

Olya Kudina

发表机构 * TU Delft(代尔夫特理工大学) North-West University(北开普大学)

AI总结 基于对100名高等教育学生(主要来自工程领域)的问卷调查和文献综述,探讨学生如何使用和看待大语言模型(LLMs),并主张在工程教育中采用目的驱动、情境敏感的AI整合方法。

Comments To appear in The Routledge Handbook of the Philosophy of Engineering, 2nd ed. Edited By Diane P. Michelfelder, Neelke Doorn

详情
AI中文摘要

基于对100名高等教育学生(主要来自工程相关领域)的问卷调查以及对近期文献的批判性回顾,本章考察了学生在工程教育中如何使用和看待大语言模型(LLMs)。学生主要看重LLMs在写作支持、概念澄清、编程辅助和头脑风暴方面的价值,同时表达了对不准确、偏见、过度依赖、学术诚信以及验证负担的担忧。通过分析两种主导隐喻,即LLMs作为“神谕”和“导师”,本章展示了这些系统如何培养出往往超出其实际能力的权威性、专业性和个性化学习的期望。本章进一步论证,学生对效率和个性化支持承诺的依恋反映了一种“残酷的乐观主义”,即LLMs的感知益处往往依赖于学生仍在发展的技能、警惕性和专业知识。总体而言,本章主张在工程教育中采用目的驱动和情境敏感的AI整合方法,强调批判性AI素养、反思性评估设计、教学谨慎性以及对更广泛的伦理和环境影响的考虑。

英文摘要

Based on a questionnaire of 100 higher-education students, predominantly from engineering-related fields, and a critical review of recent literature, this chapter examines how students use and perceive Large Language Models (LLMs) in engineering education. Students primarily value LLMs for writing support, conceptual clarification, coding assistance, and brainstorming, while simultaneously expressing concerns about inaccuracies, bias, overreliance, academic integrity, and the burden of verification. Through an analysis of two dominant metaphors, namely LLMs as an "oracle" and as a "tutor," the chapter shows how these systems cultivate expectations of authority, expertise, and personalized learning that often exceed their actual capabilities. The chapter further argues that students' attachment to the promises of efficiency and personalized support reflects a form of "cruel optimism," where the perceived benefits of LLMs often depend on the very skills, vigilance, and expertise that students are still developing. Overall, the chapter argues for a purpose-driven and context-sensitive approach to AI integration in engineering education, emphasizing critical AI literacy, reflective assessment design, pedagogical caution, and consideration of broader ethical and environmental impacts.

2606.16652 2026-06-16 cs.CY cs.AI cs.SI 交叉投稿

Optimising Temporary Accommodation Placement Across London with AI-Powered SaaS in E-Governance Systems

利用AI驱动的SaaS优化伦敦临时住所安置:电子政务系统中的应用

Hankun He, Jordan Richards, Gopalakrishnan Netuveli, Kumar Aniket, Ramya Pachatcharam, Binta Ade-olusile, Nathan Nagaiah, Matthew I Bellgard

发表机构 * UK Centre for AI in the Public Sector(英国公共部门人工智能中心) Institute for Connected Communities(连接社区研究所) University of East London(东伦敦大学) London Borough of Newham(伦敦新ham区)

AI总结 本文介绍DOMUS系统,一个基于云的AI决策支持系统,通过规则过滤与大语言模型搜索结合,优化伦敦纽汉姆区的临时住所安置,显著减少搜索时间并提高合规性。

Comments 13 pages, 4 figures, to be published in International Conference on AI and Sustainability Advances 2026 Companion Proceedings

详情
AI中文摘要

临时住所已成为英格兰地方政府,特别是伦敦地区,主要的财政和行政压力,需求和成本急剧上升。本文记录了DOMUS的创建和使用,这是一个由东伦敦大学从头构建、为伦敦纽汉姆区定制、支持法定临时住所安置的云原生AI决策支持系统。DOMUS将家庭案例记录、政策约束的负担能力和适宜性规则以及实时私人租赁列表整合到统一的治理工作流中。该系统结合了透明的基于规则的过滤和大语言模型辅助搜索,以标准化卧室需求、负担能力阈值、地理偏好和可达性要求的应用,同时保留官员的裁量权和可审计性。在AI辅助排序和解释之前,家庭和财产属性被编码为政策一致的表示。在纽汉姆的安全环境中进行的试点部署评估了相对于手动工作流的操作性能。结果表明,搜索时间大幅减少,关键安置约束的遵守情况得到改善,员工满意度高,同时保持法定合规性和基于角色的问责制。除了临时住所,本文还将DOMUS定位为可复制的数字公共基础设施:一种模块化、云原生的软件即服务架构,可在英国其他行政区部署,并适应其他以稀缺性、规则约束和高风险为特征的公共管理任务。研究结果证明了在地方政府中可扩展、合乎道德的AI部署的可行性,并为电子政务中AI驱动的公共价值创造辩论做出了贡献。

英文摘要

Temporary accommodation has become a major fiscal and administrative pressure for English local authorities, particularly in London, where demand and costs have risen sharply. This paper documents the creation and use of DOMUS, a cloud-based, AI-enabled decision-support system built from scratch at the University of East London and customised for the needs of London Borough of Newham to support statutory Temporary accommodation placement. DOMUS integrates household case records, policy-constrained affordability and suitability rules, and live private-rental listings within a single governance-aligned workflow. The system combines transparent, rule-based filtering with large language model-assisted search to standardise the application of bedroom need, affordability thresholds, geographic preferences, and accessibility requirements, while preserving officer discretion and audibility. Household and property attributes are encoded into policy-consistent representations prior to AI-assisted ranking and explanation. A pilot deployment in Newham's secure environment evaluated operational performance relative to manual workflows. Results indicate substantial reductions in search time, improved adherence to key placement constraints, and high staff satisfaction, while maintaining statutory compliance and role-based accountability. Beyond TA, the paper frames DOMUS as replicable digital public infrastructure: a modular, cloud-native Software-as-a-Service architecture that can be deployed across other UK boroughs and adapted to other public administration tasks characterised by scarcity, rule-bound eligibility, and high stakes. The findings demonstrate the feasibility of scalable, ethically governed AI deployment in local government and contribute to debates on AI-enabled public value creation in e-governance.

2606.16742 2026-06-16 cs.CV cs.AI 交叉投稿

Revealing Artifacts via Noise Amplification: A Novel Perspective for AI-Generated Video Detection

通过噪声放大揭示伪影:AI生成视频检测的新视角

Renxi Cheng, Jie Gui, Hongsong Wang

发表机构 * School of Cyber Science and Engineering, Southeast University(东南大学网络空间安全学院) Purple Mountain Laboratories(紫金山实验室) Engineering Research Center of Blockchain Application, Supervision And Management (Southeast University), Ministry of Education(教育部区块链应用监管工程研究中心(东南大学)) School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education(教育部新一代人工智能技术及其跨学科应用重点实验室(东南大学))

AI总结 针对AI生成视频检测难题,提出基于位平面的噪声放大方法,通过像素级强度增强、区域级空间放大和帧级时间聚合,在GenVidBench和HardGVD基准上超越现有方法。

Comments 13 pages, 5 figures

详情
AI中文摘要

随着视频生成模型的快速发展,区分AI生成视频与真实视频已成为一项具有挑战性的任务。现有研究大多集中于开发用于识别生成对抗网络生成样本的检测器。然而,AI生成视频的检测,尤其是文本到视频模型生成的视频,仍是一个未探索的领域。尽管最先进的文本到视频模型可以生成类似于真实视频的逼真视觉内容,但它们无法生成图像的细节以及视频中细节的变化。受此启发,我们从位平面的新视角处理AI生成视频检测,位平面可以有效描述图像或视频中的细节或噪声。为此,我们提出了一种简单而有效的方法,称为噪声放大。该方法首先基于位平面提取噪声信号,然后放大这些噪声信号,最后将其输入判别器网络进行视频伪造分类。噪声放大通过三个方面综合构建:像素级强度增强、区域级空间放大和帧级时间聚合。为了在具有挑战性的场景中评估AI生成视频检测方法,我们还引入了一个名为HardGVD的基准。在大型数据集GenVidBench和HardGVD上的大量实验表明,我们简单的方法显著优于最先进的方法。

英文摘要

With the rapid advancement of video generation models, distinguishing between AI-generated and authentic videos has emerged as a challenging endeavor. The majority of existing research endeavors concentrate on the development of detectors for identifying samples generated by generative adversarial networks. Nevertheless, the detection of AI-generated videos, particularly those produced by text-to-video models, still remains an uncharted territory. Although state-of-the-art text-to-video models can generate realistic visual content similar to real videos, they fall short of generating the details of the images and the changes in details within the videos. Inspired by this, we address AI-generated video detection from a novel perspective of bit-planes, which can effectively describe the details or noises in images or videos. To this end, we propose a simple yet effective approach called Noise Amplification. This approach first extracts noise signals based on bit-planes, then amplifies these noise signals, and finally feeds them into the discriminator networks for video fake classification. Noise amplification is comprehensively constructed by incorporating three aspects: pixel-level intensity enhancement, region-level spatial amplification, and frame-level temporal aggregation. To evaluate methods of AI-generated video detection in challenging scenarios, we also introduce a benchmark named HardGVD. Extensive experiments on both the large-scale dataset GenVidBench and HardGVD show that our simple approach significantly outperforms state-of-the-art methods.

2606.16842 2026-06-16 cs.SE cs.AI 交叉投稿

Beyond Models: Reflections on Engineering AI-enabled Systems in a Project-Based Course

超越模型:在项目式课程中构建AI使能系统的反思

Amir Mashmool, Kishan Ravindra Sawant, Mojtaba Shahin, Nico Hochgeschwender, Rainer Koschke

AI总结 本文反思了一门研究生项目式课程的设计与实施,该课程通过开发电影推荐系统,培养学生对AI使能系统的架构设计、部署和监控能力,并基于混合方法研究揭示了学生在早期架构决策、ML集成、需求演进和数据管理方面的持续困难。

详情
AI中文摘要

教授AI使能系统的软件工程需要在现实约束下解决AI组件在全规模软件架构中的集成问题。虽然机器学习课程强调模型开发,但学生往往缺乏AI使能系统的架构设计、部署和监控经验。此类面向系统的AI课程的实证评估仍然有限。本文反思了不来梅大学一门名为“AI算法:理论与工程”的硕士项目式课程的设计与实施,在该课程中,学生开发了一个电影推荐系统,同时做出架构设计决策以应对可扩展性、部署和需求演进相关的挑战。我们进行了一项混合方法研究,结合学生提交物分析和问卷回答,以调查集成挑战、学习成果和改进机会。我们的结果表明,由于机器学习和软件工程专业知识不均衡,学生在早期架构决策、异构ML集成、需求演进和数据管理方面持续存在困难。从教育者的角度来看,该课程培养了系统级推理能力,并增强了对AI使能系统中以数据为中心的ML实践的认识。

英文摘要

Teaching Software Engineering for AI-enabled systems entails addressing the integration of AI components within full-scale software architectures under realistic constraints. While machine learning courses emphasize model development, students often lack experience in architectural design, deployment, and monitoring of AI-enabled systems. Empirical evaluations of such system-oriented AI courses remain limited. This paper reflects on the design and implementation of a project-based master's-level course titled AI Algorithms: Theory and Engineering, at the University of Bremen, in which students developed a movie recommendation system while making architectural design decisions to address challenges related to scalability, deployment, and evolving requirements. We conducted a mixed-methods study combining analyses of student submissions and questionnaire responses to investigate integration challenges, learning outcomes, and opportunities for improvement. Our results indicate persistent difficulties in early architectural decisions, heterogeneous ML integration, evolving requirements, and data management, largely due to uneven ML and software engineering expertise. From the educator's perspective, the course fostered system-level reasoning and strengthened awareness of data-centric ML practices in AI-enabled systems.

2606.16969 2026-06-16 cs.SD cs.AI eess.AS 交叉投稿

Probing Low Frame Rate Degradation in Neural Audio Codecs

探测神经音频编解码器中的低帧率退化

Alex Gichamba, Moise Busogi

发表机构 * Carnegie Mellon University Africa(卡内基梅隆大学非洲校区)

AI总结 通过控制帧率消融实验,发现低帧率质量悬崖源于训练配置缺陷而非根本性障碍,修正后帧率可降至3.1Hz和1.6Hz。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

神经音频编解码器中的低帧率对于自回归语音合成具有吸引力,因为生成成本与序列长度线性相关。最近的研究表明,编解码器可以在12.5 Hz及以下运行,但低帧率退化的机制仍未被充分理解。我们通过受控的帧率消融实验来研究这些机制。我们重现了先前工作中报告的6.25 Hz处的质量悬崖,并评估了候选解释:音素冲突和码本饱和,两者均未显示出根本性障碍的证据。该悬崖实际上是由次优的训练配置引起的:训练期间固定的剪辑时长在低帧率下产生过少的令牌,使解码器缺乏令牌间上下文。一旦修正,WER随音素负载平滑退化,直至3.1 Hz和1.6 Hz,这表明低帧率编解码器的推理时效率增益比先前假设的更容易实现。

英文摘要

Low frame rates in neural audio codecs are attractive for autoregressive speech synthesis, where the generation cost scales linearly with the sequence length. Recent work has demonstrated that codecs can operate at 12.5 Hz and below, but the mechanisms underlying low frame rate degradation remain insufficiently understood. We investigate these mechanisms through a controlled frame rate ablation. We reproduce a quality cliff at 6.25 Hz reported in previous works and evaluate candidate explanations: phonemic collisions and codebook saturation, neither of which shows evidence of a fundamental barrier. The cliff is instead caused by suboptimal training configuration: fixed clip duration during training yields too few tokens at low frame rates, starving the decoder of inter-token context. Once corrected, WER degrades smoothly with phonemic load down to 3.1 Hz and 1.6 Hz, suggesting the inference-time efficiency gains of low frame rate codecs are more accessible than previously assumed.

2606.16973 2026-06-16 cs.IR cs.AI 交叉投稿

How Much Do Reviews Really Contribute? A Study on Text-Enriched Matrix Factorization for Recommendations

评论到底有多大贡献?基于文本增强矩阵分解的推荐系统研究

Eduardo Ferreira da Silva, Mayki dos Santos Oliveira, Joel Machado Pires Denis Dantas Boaventura, Frederico Araújo Durão

AI总结 研究在强协同过滤基线下,文本信息对矩阵分解推荐的边际贡献有限,提出门控和交叉注意力机制融合文本与协同信号,实验表明协同信息仍主导性能。

Comments 14 pages, 4 figures, SBBD 2026 ISSN 2763-8979

详情
AI中文摘要

将文本评论融入推荐系统已成为用语义信息丰富协同信号的重要策略。然而,评论派生的表示的实际贡献仍是一个开放问题,特别是在使用强协同基线时。在这项工作中,我们通过在一个共同的协同骨干上引入并比较三种增强策略,系统地研究了文本信息对矩阵分解的影响。首先,我们提出了一种可学习的门控机制,在训练过程中自适应地平衡协同信号和文本信号。该机制应用于两种不同的评论表示:(i) 从用户和物品历史中提取的聚合主题轮廓,以及 (ii) 从评论中导出的完整文本嵌入表示。此外,我们探索了一种交叉注意力机制,在与协同因子融合之前识别并强调文本表示中信息量最大的维度。我们评估了六种变体:纯协同、通过门控用主题轮廓和文本增强、通过门控用主题和文本增强、以及通过文本特征的交叉注意力增强。跨多个基于评论的数据集的实验表明,尽管自适应融合机制提高了表示灵活性,但与协同骨干相比,文本信号的边际贡献仍然有限。这些发现表明,在典型的评分预测设置下,协同信息继续主导性能,这为将语义评论信号有效整合到推荐模型中提出了重要考虑。

英文摘要

Incorporating textual reviews into a Recommender System has become a prominent strategy for enriching collaborative signals with semantic information. However, the actual contribution of review-derived representations remains an open question, particularly when strong collaborative baselines are employed. In this work, we systematically investigate the impact of textual information on Matrix Factorization by introducing and comparing three enrichment strategies over a common collaborative backbone. First, we propose a learnable gating mechanism that adaptively balances collaborative and textual signals during training. This mechanism is applied to two distinct review representations: (i) aggregated topic profiles extracted from user and item histories, and (ii) full text embedding representations derived from reviews. Additionally, we explore a cross-attention mechanism that identifies and emphasizes the most informative dimensions of the textual representation before fusion with collaborative factors. We evaluate six variants: pure, enriched with topic profiles and text via gating; enriched with topics and text via gating; and enhanced with cross-attention over textual features. Experiments across multiple review-based datasets reveal that although adaptive fusion mechanisms improve representation flexibility, the marginal contribution of textual signals remains limited compared to the collaborative backbone. These findings suggest that, under typical rating-prediction settings, collaborative information continues to dominate performance, raising important considerations for the effective integration of semantic review signals into recommendation models.

2410.16089 2026-06-16 cs.AI eess.SP 版本更新

Multi-Sensor Fusion for UAV Classification Based on Feature Maps of Image and Radar Data

基于图像和雷达数据特征图的多传感器融合无人机分类

Nikos Sakellariou, Antonios Lalas, Konstantinos Votis, Dimitrios Tzovaras

发表机构 * Information Technologies Institute(信息科技研究所) Centre for Research & Technology – Hellas(希腊研究中心) Thessaloniki, Greece(塞萨洛尼基,希腊)

AI总结 提出一种融合热成像、光电和雷达数据特征的多传感器深度学习网络,通过堆叠图像特征提高无人机分类精度。

Comments 10 pages, 6 figures

详情
AI中文摘要

现代无人机独特的成本、灵活性、速度和效率使其成为当代社会许多应用中的有吸引力的选择。然而,这也导致了越来越多报告的恶意或意外事件,使得开发无人机检测和分类机制变得至关重要。我们提出了一种方法,用于开发一个系统,该系统将已经处理的多传感器数据融合到一个新的深度神经网络中,以提高其对无人机检测的分类精度。该DNN模型融合了从与热成像、光电和雷达数据相关的单个目标检测和分类模型中提取的高级特征。此外,重点在于模型的卷积神经网络(CNN)架构,该架构通过堆叠热成像和光电传感器提取的图像特征来结合三种传感器模态的特征,从而实现比单独使用每个传感器更高的分类精度。

英文摘要

The unique cost, flexibility, speed, and efficiency of modern UAVs make them an attractive choice in many applications in contemporary society. This, however, causes an ever-increasing number of reported malicious or accidental incidents, rendering the need for the development of UAV detection and classification mechanisms essential. We propose a methodology for developing a system that fuses already processed multi-sensor data into a new Deep Neural Network to increase its classification accuracy towards UAV detection. The DNN model fuses high-level features extracted from individual object detection and classification models associated with thermal, optronic, and radar data. Additionally, emphasis is given to the model's Convolutional Neural Network (CNN) based architecture that combines the features of the three sensor modalities by stacking the extracted image features of the thermal and optronic sensor achieving higher classification accuracy than each sensor alone.

2504.15610 2026-06-16 cs.AI 版本更新

Fine-Tuning a 7B Advisor on Free-Tier GPUs: An Adapter-Handoff Recipe and a Synthetic-Data Reliability Caution

在免费GPU上微调7B顾问:一种适配器交接方案与合成数据可靠性警示

Md Millat Hosen

发表机构 * Department of Computer Science and Engineering, Sharda University(计算机科学与工程系,沙达大学)

AI总结 提出在免费GPU上通过适配器交接微调7B模型的实用方案,并揭示合成数据导致的事实错误风险,强调数据质量比微调方法更关键。

Comments 20 pages, 5 figures, 7 tables. Major revision and repositioning of arXiv:2504.15610v1-v3 (previously titled "A LoRA-Based Approach to Fine-Tuning LLMs for Educational Guidance in Resource-Constrained Settings"); withdraws the earlier quantization-boundary and cross-GPU optimizer-transfer claims. Code, dataset, adapter, and evaluation harness released

详情
AI中文摘要

在资源受限的环境中微调7B语言模型以提供专业建议颇具吸引力,但多轮训练常常超出用户依赖的免费GPU(Kaggle、Colab)的挂钟时间限制。我们报告了两点。首先,一个实用方案:通过仅保存小规模LoRA适配器(41.9M参数)并在第二台机器上恢复,在两张免费16 GB GPU(Tesla P100后接T4)上完成了Mistral-7B-Instruct-v0.3的三轮QLoRA微调(4位NF4,LoRA秩16,使用Unsloth)。仅适配器交接是足够的——优化器和调度器状态无需转移——因此约束条件是每步VRAM和每会话挂钟时间,而非总计算量。其次,更重要的是,一个诚实的评估得出了警示性结果。在与未微调基础模型的盲测对比中,微调模型在与合成训练分布的相似度上得分更高(BERTScore F1 +0.063,这是保真度而非质量信号),但建议质量更低:盲测LLM裁判在46%的提示中偏好基础模型,而微调模型仅18%;一项来源验证的事实性审计发现,微调模型在政策敏感话题上出现了四个自信的错误,而基础模型为零。用相同方法审计训练数据,我们发现这并非微调伪影:每个审计到的错误已存在于Gemini生成的训练答案中,随机样本审计发现相当比例的回答(28-40%;单一裁判,n=40)存在可验证的错误。因此数据足以解释这些错误,我们将其归因于合成数据流水线而非适配器交接方法。我们发布了数据集、适配器、跨GPU笔记本和完整评估框架,以便所有结果可在单个16 GB GPU上复现。

英文摘要

Fine-tuning a 7B language model for specialized advising is attractive in resource-constrained settings, but multi-epoch runs routinely exceed the wall-clock limits of the free-tier GPUs (Kaggle, Colab) such users rely on. We report two things. First, a practical recipe: a three-epoch QLoRA fine-tune of Mistral-7B-Instruct-v0.3 (4-bit NF4, LoRA rank 16, via Unsloth) completed across two free-tier 16 GB GPUs (Tesla P100 then T4) by checkpointing only the small LoRA adapter (41.9M parameters) and resuming on the second machine. Adapter-only handoff is sufficient -- optimizer and scheduler state need not be transferred -- so the binding constraint is per-step VRAM and per-session wall-clock, not aggregate compute. Second, and more importantly, an honest evaluation that returns a cautionary result. On a blind held-out comparison against the un-fine-tuned base model, the fine-tuned model scored higher on similarity to the synthetic training distribution (BERTScore F1 +0.063, a fidelity not quality signal) but lower on advising quality: a blind LLM-as-judge preferred the base model on 46% of prompts versus 18%, and a source-verified factuality audit found four confident errors from the fine-tuned model on policy-sensitive topics against zero for the base. Auditing the training data with the same method, we find this is not a fine-tuning artifact: each audited error is already present in the Gemini-generated training answers, and a random-sample audit finds verifiable errors in a sizable fraction of responses (28-40%; single-judge, n=40). The data is therefore sufficient to account for the errors, which we attribute to the synthetic-data pipeline rather than the adapter-handoff method. We release the dataset, adapter, cross-GPU notebooks, and full evaluation harness so every result reproduces on a single 16 GB GPU.

2509.00135 2026-06-16 cs.AI 版本更新

Optimizing Health Coverage in Ethiopia: A Learning-augmented Approach and Persistent Proportionality Under an Online Budget

优化埃塞俄比亚的健康覆盖:一种学习增强的方法与在线预算下的持续比例性

Davin Choo, Yohai Trabelsi, Fentabil Getnet, Samson Warkaye Lamma, Wondesen Nigatu, Kasahun Sime, Lisa Matay, Milind Tambe, Stéphane Verguet

发表机构 * John A. Paulson School of Engineering and Applied Sciences(约翰·A·保罗森工程与应用科学学院) Harvard University(哈佛大学) National Data Management and Analytics Center for Health(健康国家数据管理与分析中心) Ethiopian Public Health Institute(埃塞俄比亚公共卫生研究所) Ministry of Health, Ethiopia(埃塞俄比亚卫生部) Department of Global Health and Population(全球卫生与人口部门) Harvard T.H. Chan School of Public Health(哈佛T.H. Chan公共卫生学院)

AI总结 针对埃塞俄比亚卫生系统强化中的预算不确定性和区域比例目标,提出基于学习增强和贪心算法的顺序设施规划框架,最大化人口覆盖。

Comments Published in the AISI track at AAAI 2026

详情
AI中文摘要

作为与联合国可持续发展目标3(全民健康覆盖)一致的国家努力的一部分,埃塞俄比亚卫生部正在加强卫生站,以扩大基本医疗服务的可及性。然而,由于预算有限和其他竞争性优先事项,每年只能实施这一卫生系统强化努力的一小部分,因此需要一个优化框架来指导埃塞俄比亚各地区的优先排序。在本文中,我们开发了一个工具——健康可及性资源规划器(HARP),它基于一个原则性的决策支持优化框架,用于顺序设施规划,旨在在预算不确定性下最大化人口覆盖,同时满足每个时间步的区域特定比例目标。然后,我们提出了两种算法:(i)一种学习增强的方法,在任何单一步骤中改进专家建议;(ii)一种用于多步规划的贪心算法,两者都具有强的最坏情况近似估计。与埃塞俄比亚公共卫生研究所和卫生部合作,我们在三个地区的各种规划场景中展示了我们方法的实证有效性。

英文摘要

As part of nationwide efforts aligned with the United Nations' Sustainable Development Goal 3 on Universal Health Coverage, Ethiopia's Ministry of Health is strengthening health posts to expand access to essential healthcare services. However, only a fraction of this health system strengthening effort can be implemented each year due to limited budgets and other competing priorities, thus the need for an optimization framework to guide prioritization across the regions of Ethiopia. In this paper, we develop a tool, Health Access Resource Planner (HARP), based on a principled decision-support optimization framework for sequential facility planning that aims to maximize population coverage under budget uncertainty while satisfying region-specific proportionality targets at every time step. We then propose two algorithms: (i) a learning-augmented approach that improves upon expert recommendations at any single-step; and (ii) a greedy algorithm for multi-step planning, both with strong worst-case approximation estimation. In collaboration with the Ethiopian Public Health Institute and Ministry of Health, we demonstrated the empirical efficacy of our method on three regions across various planning scenarios.

2512.11682 2026-06-16 cs.AI cs.LG 版本更新

MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition

MedAI: 评估 TxAgent 在 NeurIPS CURE-Bench 竞赛中的治疗性智能推理

Tim Cofala, Christian Kalfar, Jingge Xiao, Johanna Schrader, Michelle Tang, Wolfgang Nejdl

发表机构 * L3S Research Center(L3S研究中心)

AI总结 本文介绍 TxAgent,一种通过迭代检索增强生成和统一生物医学工具集进行治疗决策的智能AI方法,并在CURE-Bench竞赛中评估其推理质量,通过改进工具检索策略提升性能,荣获开放科学卓越奖。

Comments 7 pages, 3 figures

详情
AI中文摘要

临床医学中的治疗决策构成了一个高风险领域,其中AI指导与患者特征、疾病过程和药物制剂之间的复杂相互作用相互交织。药物推荐、治疗计划和不良反应预测等任务需要基于可靠生物医学知识的稳健、多步骤推理。以TxAgent为代表的智能AI方法通过迭代检索增强生成(RAG)应对这些挑战。TxAgent采用微调的Llama-3.1-8B模型,动态生成并执行对统一生物医学工具集(ToolUniverse)的函数调用,整合FDA药物API、OpenTargets和Monarch资源,确保获取最新的治疗信息。与通用RAG系统相比,医疗应用施加了严格的安全约束,使得推理轨迹和工具调用序列的准确性至关重要。这些考虑促使评估协议将令牌级推理和工具使用行为视为明确的监督信号。本文展示了我们参与CURE-Bench NeurIPS 2025挑战赛的见解,该挑战赛使用评估正确性、工具利用和推理质量的指标来基准测试治疗推理系统。我们分析了函数(工具)调用的检索质量如何影响整体模型性能,并展示了通过改进工具检索策略实现的性能提升。我们的工作获得了开放科学卓越奖。完整信息请访问此https URL。

英文摘要

Therapeutic decision-making in clinical medicine constitutes a high-stakes domain in which AI guidance interacts with complex interactions among patient characteristics, disease processes, and pharmacological agents. Tasks such as drug recommendation, treatment planning, and adverse-effect prediction demand robust, multi-step reasoning grounded in reliable biomedical knowledge. Agentic AI methods, exemplified by TxAgent, address these challenges through iterative retrieval-augmented generation (RAG). TxAgent employs a fine-tuned Llama-3.1-8B model that dynamically generates and executes function calls to a unified biomedical tool suite (ToolUniverse), integrating FDA Drug API, OpenTargets, and Monarch resources to ensure access to current therapeutic information. In contrast to general-purpose RAG systems, medical applications impose stringent safety constraints, rendering the accuracy of both the reasoning trace and the sequence of tool invocations critical. These considerations motivate evaluation protocols treating token-level reasoning and tool-usage behaviors as explicit supervision signals. This work presents insights derived from our participation in the CURE-Bench NeurIPS 2025 Challenge, which benchmarks therapeutic-reasoning systems using metrics that assess correctness, tool utilization, and reasoning quality. We analyze how retrieval quality for function (tool) calls influences overall model performance and demonstrate performance gains achieved through improved tool-retrieval strategies. Our work was awarded the Excellence Award in Open Science. Complete information can be found at https://curebench.ai/.

2603.15952 2026-06-16 cs.AI 版本更新

Protein Design with Agent Rosetta: A Case Study for Specialized Scientific Agents

基于Agent Rosetta的蛋白质设计:面向专业科学智能体的案例研究

Jacopo Teneggi, S. M. Bargeen A. Turzo, Tanya Marwah, Alberto Bietti, P. Douglas Renfrew, Vikram Khipple Mulligan, Siavash Golkar

发表机构 * Polymathic AI(多学科人工智能实验室) Center for Computational Biology, Flatiron Institute(计算生物学中心,Flatiron研究所) Google DeepMind(谷歌DeepMind) Center for Computational Mathematics, Flatiron Institute(计算数学中心,Flatiron研究所) New York University(纽约大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出Agent Rosetta,将大语言模型与Rosetta软件环境结合,通过迭代优化实现用户定义的蛋白质设计目标,在规范氨基酸设计上匹配专家,在非规范残基设计上超越现有ML方法。

详情
AI中文摘要

大语言模型(LLM)能够模拟推理并使用工具,为执行复杂科学任务的自主智能体创造了机会。蛋白质设计提供了一个天然的试验平台:尽管机器学习(ML)方法取得了强劲成果,但它们主要局限于规范氨基酸和狭窄的目标,对于广泛设计流程的通用工具的需求尚未满足。我们引入了Agent Rosetta,这是一个LLM智能体,配有一个用于操作Rosetta的结构化环境——Rosetta是领先的基于物理的异聚合物设计软件,能够建模非规范构建模块和几何结构。Agent Rosetta通过结合LLM推理与Rosetta的通用性,迭代优化设计以实现用户定义的目标。我们在规范氨基酸设计上评估了Agent Rosetta,匹配了专业模型和专家基线;在非规范残基设计上——ML方法在此失败——取得了可比的性能。关键的是,仅靠提示工程通常无法生成Rosetta操作,这表明环境设计对于将LLM智能体与专业软件集成至关重要。我们的结果表明,适当设计的环境能使LLM智能体在匹配专业工具和人类专家的同时,使科学软件变得可访问。

英文摘要

Large language models (LLMs) are capable of emulating reasoning and using tools, creating opportunities for autonomous agents that execute complex scientific tasks. Protein design provides a natural testbed: although machine learning (ML) methods achieve strong results, these are largely restricted to canonical amino acids and narrow objectives, leaving unfilled need for a generalist tool for broad design pipelines. We introduce Agent Rosetta, an LLM agent paired with a structured environment for operating Rosetta, the leading physics-based heteropolymer design software, capable of modeling non-canonical building blocks and geometries. Agent Rosetta iteratively refines designs to achieve user-defined objectives, combining LLM reasoning with Rosetta's generality. We evaluate Agent Rosetta on design with canonical amino acids, matching specialized models and expert baselines, and with non-canonical residues -- where ML approaches fail -- achieving comparable performance. Critically, prompt engineering alone often fails to generate Rosetta actions, demonstrating that environment design is essential for integrating LLM agents with specialized software. Our results show that properly designed environments enable LLM agents to make scientific software accessible while matching specialized tools and human experts.

2605.01101 2026-06-16 cs.AI cs.CL cs.SD eess.AS 版本更新

Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy

虚拟言语治疗师:一种临床医生参与的AI言语治疗代理,用于个性化和监督式治疗

Shakeel Sheikh, Patrick Marmaroli, MD Sahidullah, Slim Ouni, Fabrice Hirsch, Goncalo Leal, Bjorn W Schuller

发表机构 * The Kashmir Hub for Artficial Intelligence(喀布尔人工智能中心) Microsoft / Vocametrix(微软 / Vocametrix) IAI, TCG CREST(IAI,TCG CREST) Université de Lorraine, CNRS, Inria, LORIA(洛林大学,CNRS,Inria,LORIA) Laboratoire Praxiling, UMR5267, CNRS et Université Paul-Valéry Montpellier 3(Praxiling实验室,UMR5267,CNRS及蒙彼利埃Paul-Valéry大学) Speechcare iStutter, Portuguese Catholic University(Speechcare iStutter,葡萄牙天主教大学) CHI – Chair of Health Informatics, TUM University Hospital(健康信息学系,TUM大学医院) GLAM – Group on Language, Audio, & Music, Imperial College London(语言、音频与音乐小组,伦敦帝国理工学院)

AI总结 提出虚拟言语治疗师(VST)平台,集成深度学习口吃分类与多智能体大语言模型推理,自动生成个性化治疗方案,并通过临床医生反馈优化,实验证明其高质量推荐。

Comments Under Review

详情
AI中文摘要

本文开发了虚拟言语治疗师(VST),这是一个基于智能体的平台,通过自动化和自适应的AI驱动工作流程,简化口吃评估并提供定制化的治疗计划。VST集成了最先进的基于深度学习的口吃分类和多智能体大语言模型(LLM)推理,以支持循证临床决策。VST首先获取并提取患者语音样本的特征,然后对口吃类型进行稳健分类。基于这些输出,VST启动一个智能体推理过程,其中专门的LLM智能体自主生成、批评并迭代优化个性化治疗计划。一个专门的批评智能体评估所有生成的治疗计划,以确保临床安全性、方法学合理性,并与同行评审的证据和既定专业指南保持一致。最终输出是一个全面的、针对患者的治疗草案,供临床医生审查。系统结合临床医生的反馈,生成最终的治疗计划,适用于患者交付,从而保持临床医生参与的范式。由专家言语治疗师进行的实验评估证实,VST持续生成高质量、基于证据的治疗建议。这些发现表明该系统具有增强临床工作流程、减轻临床医生负担并改善言语障碍患者治疗效果的潜力。所提出系统的交互式用户界面可在以下网址在线获取:this https URL,支持实时口吃评估和个性化治疗计划。

英文摘要

This paper develops Virtual Speech Therapist (VST), an intelligent agent-based platform that streamlines stuttering assessment and delivers customized therapy planning through automated and adaptive AI-driven workflows. VST integrates state-of-the-art deep learning-based stuttering classification, and multi-agent large language model (LLM) reasoning to support evidence-based clinical decision-making. The VST begins with the acquisition and feature extraction of patient speech samples, followed by robust classification of stuttering types. Building on these outputs, VST initiates an agentic reasoning process in which specialized LLM agents autonomously generate, critique, and iteratively refine individualized therapy plans. A dedicated critic agent evaluates all generated therapy plans to ensure clinical safety, methodological soundness, and alignment with peer-reviewed evidence and established professional guidelines. The resulting output is a comprehensive, patient-specific therapy draft intended for clinician review. Incorporating clinician feedback, the system then produces a finalized therapy plan suitable for patient delivery, thereby maintaining a clinician-in-the-loop paradigm. Experimental evaluation by expert speech therapists confirms that VST consistently generates high-quality, evidence-based therapy recommendations. These findings demonstrate the system's potential to augment clinical workflows, reduce clinician burden, and improve therapeutic outcomes for individuals with speech impairments. An interactive user interface for the proposed system is available online at: https://vocametrix.com/ai/stuttering-therapy-planning-agent , facilitating real-time stuttering assessment and personalized therapy planning.

2606.12025 2026-06-16 cs.AI 版本更新

Human-Enhanced Loop Modeling (HELM): Agent-Based Finite Element Modeling of Concrete Bridge Barriers

人类增强循环建模(HELM):基于智能体的混凝土桥梁护栏有限元建模

Quankai Wang, Yulin Xie, Tongfei Yang, Minghui Cheng, Ran Cao

发表机构 * College of Civil Engineering, Hunan University(湖南大学土木工程学院) Department of Civil and Architectural Engineering, University of Miami(迈阿密大学土木与建筑系) School of Architecture, University of Miami(迈阿密大学建筑学院)

AI总结 提出HELM框架,通过人机协作将有限元建模分解为可验证的检查点,在MASH TL-4和TL-5条件下将自主建模成功率从20%提升至75%。

详情
AI中文摘要

对桥梁护栏等安全关键基础设施进行有限元(FE)建模需要高保真非线性动态分析,然而当前的FE建模过程仍然劳动密集且缺乏自动化。本文提出了人类增强循环建模(HELM)框架,这是一种协作式人机协议,将长序列有限元建模分解为几何生成、边界条件定义和材料分配等离散的、可视觉验证的检查点。该框架通过一个包含20个案例的钢筋混凝土桥梁护栏矩阵在MASH TL-4和TL-5侧向荷载条件下进行演示,将专用智能体与两种广泛使用的商业FE软件(即ANSYS和LS-PrePost)对接。实验结果表明,HELM将基线自主建模成功率从20%提高到75%,其中几何和边界条件任务的智能体级通过率大约翻倍。误差分析显示,空间推理和代数逻辑限制构成了主要的失败模式,突显了结构化人在回路干预对建模自动化的价值。完整的智能体设计代码和提示已开源,可访问:此 https URL。

英文摘要

Finite element (FE) modeling of safety-critical infrastructure such as bridge barriers requires high-fidelity nonlinear dynamic analysis, yet the current FE modeling process remains labor-intensive and lacks automation. This paper presents the Human-Enhanced Loop Modeling (HELM) framework, a collaborative human-agent protocol that decomposes long-sequence finite element modeling into discrete, visually verifiable checkpoints across geometry generation, boundary condition definition, and material assignment. The framework is demonstrated through a 20-case matrix of reinforced concrete bridge barriers under MASH TL-4 and TL-5 lateral loading conditions, interfacing specialized agents with two widely used commercial FE softwares, i.e., ANSYS and LS-PrePost. Experimental results show that HELM improves the baseline autonomous modeling success rate from 20% to 75%, with agent-level pass rates for geometry and boundary condition tasks approximately doubling. Error analysis reveals that spatial reasoning and algebraic logic limitations constitute the primary failure modes, underscoring the value of structured human-in-the-loop intervention for modeling automation. The complete agent design code and prompts are open-sourced and can be accessed at: https://github.com/SimAgentDev/Ansys-LSPP-AgentKit.

2309.07401 2026-06-16 math.NA cs.AI cs.NA 版本更新

Multi-Grade Deep Learning for Partial Differential Equations with Applications to the Burgers Equation

多级深度学习用于偏微分方程及其在Burgers方程中的应用

Yuesheng Xu, Taishan Zeng

发表机构 * Department of Mathematics and Statistics, Old Dominion University(数学与统计学系,老 Dominion 大学) School of Mathematical Science, South China Normal University(数学科学学院,华南师范大学)

AI总结 提出两阶段多级深度学习方法,通过渐进式分级训练浅层网络拟合目标函数,再微调部分层,有效解决非线性PDE优化难题,在Burgers方程上误差降低达60倍。

详情
AI中文摘要

深度神经网络在求解偏微分方程方面显示出巨大潜力,但其深层架构带来了复杂、大规模、非凸的优化挑战。非线性PDE,如粘性Burgers方程,由于陡峭梯度和激波类解而加剧了这些困难。为此,我们提出了一种两阶段多级深度学习方法。在第一阶段,浅层网络逐级渐进训练,从低频到高频分量拟合目标函数;先前学习的级被冻结,每个新的残差块仅训练以最小化剩余逼近误差。第二阶段解冻并重新训练选定层,以第一阶段网络为初始化,实现可解释、稳定的层次细化,同时减轻优化复杂性。此外,我们从理论上证明,在适当的优化策略下,TS-MGDL中的每一级和每一阶段都单调地减少损失函数。在一维、二维和三维粘性Burgers方程上的数值实验表明,TS-MGDL显著优于单级学习,预测误差降低高达60倍。

英文摘要

Deep neural networks (DNNs) show great promise for solving partial differential equations (PDEs), but their deep architectures introduce complex, large-scale, non-convex optimization challenges. Nonlinear PDEs, like the viscous Burgers' equation, compound these difficulties due to steep gradients and shock-like solutions. To address this, we propose a two-stage multi-grade deep learning (TS-MGDL) method. In the first stage, shallow networks are trained progressively grade by grade to fit the target function from low- to high-frequency components; previously learned grades are frozen, and each new residual block is trained solely to minimize the remaining approximation error. The second stage unfreezes and retrains selected layers using the first-stage network as initialization, achieving an interpretable, stable hierarchical refinement while mitigating optimization complexity. Furthermore, we theoretically prove that each grade and stage in TS-MGDL monotonically reduces the loss function under an appropriate optimization strategy. Numerical experiments on 1D, 2D, and 3D viscous Burgers' equations demonstrate that TS-MGDL significantly outperforms single-grade learning (SGL), reducing predictive errors by up to a factor of 60.

2407.02362 2026-06-16 cs.AR cs.AI cs.LG 版本更新

Mitigating scalability challenges in LUT-based neural networks via pruning optimisations

通过剪枝优化缓解基于LUT的神经网络的可扩展性挑战

Xuqi Zhu, Huaizhi Zhang, JunKyu Lee, Jiacheng Zhu, Chandrajit Pal, Sangeet Saha, Klaus D. McDonald-Maier, Xiaojun Zhai

发表机构 * School of Computer Science and Electronic Engineering, University of Essex(埃塞克斯大学计算机科学与电子工程学院)

AI总结 针对LUT矩阵乘法可扩展性差的问题,提出集成剪枝策略的LUT-MU架构,在FPGA上实现最高1.6倍吞吐量和4.2倍能效提升。

详情
AI中文摘要

现代深度神经网络严重依赖大量的乘加运算,这构成了主要的计算成本。为了解决这个问题,基于查找表(LUT)的矩阵乘法已成为减少神经网络中乘加运算计算成本和时间的有效替代方案。然而,由于LUT矩阵乘法的固有限制,基于LUT的神经网络仍然面临可扩展性挑战。为了缓解这些可扩展性限制,本文提出了一种可扩展且节能的基于LUT的近似矩阵乘法单元(LUT-MU),通过将剪枝策略集成到MADDNESS算法(一种基于LUT的矩阵乘法方法)中,构成神经网络的基本组件。随着矩阵乘法中问题规模和精度要求的增加,我们提出的LUT-MU架构有效约束了资源扩展。案例研究表明,将我们的LUT-MU部署在神经网络架构中,包括全连接层(MNIST)和ResNets(CIFAR-10、ImageNet)——在XCZU7EV和XCZU19EG FPGA上——与主流的基于CUDA的网络实现相比,产生了高达1.6倍的吞吐量提升和4.2倍的能效提升,与领先的量化神经网络实现相比,能效提升1.8倍,且对精度影响适中。与基于原始MADDNESS的神经网络相比,我们的LUT-MU根据MADDNESS的不同分辨率配置设置,节省了1.3到2.6倍的资源。

英文摘要

Modern deep neural networks heavily rely on a large number of multiply-accumulate operations, which constitute the predominant computational cost. To address this, Look-Up Table (LUT)-based matrix multiplications have emerged as a promising alternative for reducing the computational cost and time of the multiply-accumulate operations in a neural network. However, the LUT-based neural network still faces the scalability challenge due to the inherent limitations of LUT-based matrix multiplication. To mitigate these scalability limitations, this paper proposes a scalable and energy-efficient LUT-based approximate matrix multiplication unit (LUT-MU) constituting the basic component of the neural networks by integrating a pruning strategy on the MADDNESS algorithm, a LUT-based matrix multiplication methodology. With increasing problem size and precision demands in matrix multiplication, our proposed LUT-MU architecture effectively constrains resource expansion. The case study shows that deploying our LUT-MU in neural network architectures, including fully connected layers (MNIST) and ResNets (CIFAR-10, ImageNet)-on XCZU7EV and XCZU19EG FPGAs, produces up to $1.6 \times$ throughput improvement and $4.2 \times$ energy efficiency gains over mainstream CUDA-based network implementations, and $1.8\times$ energy efficiency compared to leading quantised neural network implementations, with moderate impact on accuracy. Compared to original MADDNESS-based neural networks, our LUT-MU shows $1.3$ to $2.6\times$ resource savings based on various resolution configuration settings of MADDNESS.

2412.00107 2026-06-16 cs.LG cs.AI eess.SP 版本更新

Virtual Sensing to Enable Real-Time Monitoring of Inaccessible Locations & Unmeasurable Parameters

虚拟传感实现不可达位置与不可测参数的实时监测

Kazuma Kobayashi, Farid Ahmed, Jaewan Park, Subhankar Sarkar, Souvik Chakraborty, Syed Bahauddin Alam

发表机构 * Plasma & Radiological Engineering Department, Grainger College of Engineering, Nuclear, University of Illinois Urbana-Champaign(等离子体与辐射工程系,格拉inger工程学院,核能,伊利诺伊大学厄巴纳-香槟分校) Mechanical Science and Engineering Department, Grainger College of Engineering, University of Illinois Urbana-Champaign(机械科学与工程系,格拉inger工程学院,伊利诺伊大学厄巴纳-香槟分校) National Center for Supercomputing Applications, Urbana, IL, USA(国家超级计算应用中心,伊利诺伊州厄巴纳,美国) Department of Applied Mechanics, Indian Institute of Technology Delhi, New Delhi, India(应用力学系,印度理工学院德里,新德里,印度) Yardi School of Artificial Intelligence, Indian Institute of Technology Delhi(Yardi人工智能学院,印度理工学院德里)

AI总结 针对能量系统中物理传感器无法部署的实时监测问题,提出基于神经算子的虚拟传感框架MIMONet,将稀疏边界测量映射到内部场,在多种热流体系统中实现亚毫秒级高精度推理。

Comments New analysis and results are added

详情
AI中文摘要

在物理仪器不可行的能量系统中,对安全关键内部状态的实时监测仍然是一个开放问题。现有方法依赖于显式控制方程、有限维状态向量或逐实例重训练,这阻碍了在实时约束下对任意内部坐标进行网格无关的场级推理。我们针对核级热流体系统引入了基于算子的虚拟传感:使用神经算子框架学习将稀疏边界测量映射到物理不可达区域中耦合内部场的解算子,明确地将问题分类以区别于经典状态估计和逐点软传感。我们通过MIMONet实例化该框架,这是一种分支-主干算子,扩展了三个实用选择:用于异构(标量和函数值)输入的多模态分支编码器;用于保持双线性PDE耦合结构的乘法分支融合;以及在主干最后一层具有每通道基投影的共享潜在多场解码。在从经典顶盖驱动空腔流到压水堆子通道再到完全耦合换热器的逐步复杂评估中,MIMONet实现了低于5%的相对误差和在数据中心加速器上的亚毫秒推理(在NVIDIA H200上每次换热器推理为0.35 ms / 46 mJ,且在A40-H200-GH200范围内均低于毫秒),同时在50%传感器噪声下保持稳定。随着几何约束和物理耦合的增强,MIMONet保持准确,表明基于算子的虚拟传感可以在物理仪器失效时恢复可观测性,在评估的运行包络内建立了基于仿真的可行性,作为面向安全关键能量系统的未来实验和跨求解器验证的一步。

英文摘要

Real-time monitoring of safety-critical interior states remains an open problem in energy systems where physical instrumentation is infeasible. Existing approaches rely on explicit governing equations, finite-dimensional state vectors, or per-instance retraining, which prevents mesh-independent, field-level inference at arbitrary interior coordinates under real-time constraints. We introduce operator-based virtual sensing for nuclear-grade thermal-fluid systems: we use the neural-operator framework to learn solution operators that map sparse boundary measurements to coupled internal fields in physically inaccessible regions, framing the problem class explicitly to distinguish it from classical state estimation and pointwise soft sensing. We instantiate this framework with MIMONet, a branch-trunk operator extended with three practical choices: multi-modal branch encoders for heterogeneous (scalar and function-valued) inputs; multiplicative branch fusion to preserve the bilinear PDE coupling structure; and shared-latent multi-field decoding with per-channel basis projections at the trunk's final layer. Evaluated across escalating complexity, from canonical lid-driven cavity flow to pressurized water reactor subchannels to fully coupled heat exchangers, MIMONet achieves below 5% relative errors and sub-millisecond inference on data-center accelerators (0.35 ms / 46 mJ per heat-exchanger inference on an NVIDIA H200, and sub-millisecond across the A40-H200-GH200 range), while remaining stable under 50% sensor noise. By staying accurate as geometric confinement and physics coupling intensify, MIMONet shows that operator-based virtual sensing can restore observability where physical instrumentation fails, establishing simulation-based feasibility within the evaluated operating envelopes as a step toward future experimental and cross-solver validation for safety-critical energy systems.

2504.11320 2026-06-16 cs.LG cs.AI cs.DC math.OC stat.ML 版本更新

Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

优化大语言模型推理:带有内存约束的流引导在线调度

Ruicheng Ao, Gan Luo, David Simchi-Levi, Xinshang Wang

发表机构 * Institute for Data, Systems, and Society, Massachusetts Institute of Technology(数据、系统与社会研究所,麻省理工学院) School of Mathematical Sciences, Peking University(北京大学数学科学学院) Alibaba Group(阿里巴巴集团)

AI总结 本文提出流引导在线调度方法,通过等待阈值算法和嵌套等待算法,在内存约束下优化大语言模型推理的延迟和容量,减少过载时的延迟。

Comments 79 pages, 20 figures

详情
AI中文摘要

大型语言模型现在每天服务于数百万用户,提供商每天的支出超过70万美元。每个请求需要逐token推理,使GPU调度成为延迟、容量和成本的关键因素。难点在于内生内存增长:生成的token会扩展键值(KV)缓存,溢出可能导致正在进行的请求被驱逐并浪费先前计算。我们将推理视为一个具有内生内存增长、线性迭代次数和驻留GPU的KV缓存约束的多阶段在线调度问题。我们引入了流模型,该模型表征了平衡批处理组成、内存需求和稳定性区域。受流模型指导,我们设计了WAIT(等待累积推理阈值)算法,该算法为已知输出长度设计了基于阈值的准入规则,并通过调节请求在解码阶段段中的推进方式扩展到未知输出长度的嵌套WAIT。两种算法在所陈述的内存条件下近似流基准。嵌套WAIT使用额外的中等规模安全缓冲区,以应对未知输出长度引起的内存溢出导致的驱逐。在配置为Llama-2-7B的A100 GPU上的Vidur模拟中,补充的实GPU验证在附录中报告,这些策略相对于广泛使用的基线算法扩大了经验上观察到的稳定运行范围,并在接近过载和过载区域显著降低了延迟。

英文摘要

Large language models now serve millions of users daily, with providers incurring costs exceeding $700,000 per day. Each request requires token-by-token inference, making GPU scheduling central to latency, capacity, and cost. The difficulty is endogenous memory growth: generated tokens expand the Key-Value (KV) cache, and overflow can evict in-progress requests and waste prior computation. We formulate inference as a multi-stage online scheduling problem with endogenous memory growth, linear iteration times, and GPU-resident KV-cache constraints. We introduce a fluid model that characterizes equilibrium batch composition, memory requirement, and stability region. Guided by the fluid model, we design WAIT (Waiting for Accumulated Inference Threshold), a threshold-based admission rule for known output lengths, and Nested WAIT, which extends the rule to unknown output lengths by regulating how requests advance across decode-stage segments. Both algorithms approximate the fluid benchmark asymptotically under the stated memory conditions. Nested WAIT uses an additional safety buffer of moderate scale to hedge against memory-overflow-induced evictions under unknown output lengths. In Vidur simulations configured for Llama-2-7B on an A100 GPU, with supplemental real-GPU validation reported in the appendix, the policies enlarge the empirically observed stable operating range relative to widely used baseline algorithms and reduce latency especially in near-overloaded and overloaded regimes.

2508.04243 2026-06-16 cs.LG cs.AI 版本更新

Automated ultrasound doppler angle estimation using deep learning

基于深度学习的自动化超声多普勒角度估计

Nilesh Patil, Ajay Anand

发表机构 * Goergen Institute for Data Science(戈尔根数据科学研究所) University of Rochester Medical Center(罗切斯特大学医学中心) University of Rochester(罗切斯特大学)

AI总结 提出一种基于深度学习的自动化多普勒角度估计方法,使用2100张颈动脉超声图像及预训练模型,平均绝对误差3.9°-9.4°,最佳模型误差低于临床可接受阈值,可避免正常速度误判为狭窄。

详情
Journal ref
Annu Int Conf IEEE Eng Med Biol Soc. 2019 Jul;2019:28-31
AI中文摘要

角度估计是测量血流速度的多普勒超声临床工作流程中的重要步骤。人们普遍认为,角度估计不正确是基于多普勒的血流速度测量误差的主要原因。在本文中,我们提出了一种基于深度学习的自动化多普勒角度估计方法。该方法使用2100张人类颈动脉超声图像(包括图像增强)进行开发。使用五个预训练模型提取图像特征,并将这些特征传递给一个自定义浅层网络进行多普勒角度估计。独立地,由一名人类观察者审阅图像进行测量以进行比较。对于评估的模型,自动角度估计与手动角度估计之间的平均绝对误差(MAE)范围为3.9°至9.4°。此外,最佳性能模型的MAE低于可接受的临床多普勒角度误差阈值,从而避免了将正常速度值误分类为狭窄。结果表明,应用基于深度学习的技术进行自动化超声多普勒角度估计具有潜力。这种技术有可能在商业超声扫描仪的成像软件中实现。

英文摘要

Angle estimation is an important step in the Doppler ultrasound clinical workflow to measure blood velocity. It is widely recognized that incorrect angle estimation is a leading cause of error in Doppler-based blood velocity measurements. In this paper, we propose a deep learning-based approach for automated Doppler angle estimation. The approach was developed using 2100 human carotid ultrasound images including image augmentation. Five pre-trained models were used to extract images features, and these features were passed to a custom shallow network for Doppler angle estimation. Independently, measurements were obtained by a human observer reviewing the images for comparison. The mean absolute error (MAE) between the automated and manual angle estimates ranged from 3.9° to 9.4° for the models evaluated. Furthermore, the MAE for the best performing model was less than the acceptable clinical Doppler angle error threshold thus avoiding misclassification of normal velocity values as a stenosis. The results demonstrate potential for applying a deep-learning based technique for automated ultrasound Doppler angle estimation. Such a technique could potentially be implemented within the imaging software on commercial ultrasound scanners.

2508.10967 2026-06-16 cs.LG cs.AI 版本更新

Retro-Expert: Collaborative Reasoning for Interpretable Retrosynthesis

Retro-Expert: 面向可解释逆合成的协同推理

Xinyi Li, Sai Wang, Yutian Lin, Yu Wu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Retro-Expert框架,通过强化学习结合大语言模型与专用模型,实现可解释的逆合成预测,并生成基于化学逻辑的自然语言解释。

详情
AI中文摘要

逆合成预测旨在根据给定的产物分子推断反应物分子,这是化学合成中的一项基本任务。然而,现有方法依赖于静态模式匹配范式,限制了其从化学数据中进行有效逻辑决策的能力,导致黑箱过程。我们提出Retro-Expert,一个可解释的逆合成框架,通过纯强化学习结合大语言模型和专用模型的互补优势,进行协同推理。它通过三个组件输出基于化学逻辑的自然语言解释:(1)专用模型提供化学知识,将其蒸馏到高质量的化学决策空间中;(2)大语言模型驱动的批判性推理,生成具有可解释推理路径的预测;(3)基于知识的策略优化,改进可解释的决策策略。实验表明,Retro-Expert在不同指标上均优于基于大语言模型和专用模型的方法,同时生成基于化学的解释,增强了化学家在实践中的信任。本文源代码见:此 https URL。

英文摘要

Retrosynthesis prediction aims to infer the reactant molecules based on a given product molecule, which is a fundamental task in chemical synthesis. However, existing methods rely on a static pattern-matching paradigm, which limits their ability to perform effective logical decision-making from chemical data, leading to a black-box process. We propose Retro-Expert, an interpretable retrosynthesis framework that performs collaborative reasoning by combining the complementary strengths of Large Language Models and specialized models via pure reinforcement learning. It outputs natural language explanations grounded in chemical logic through three components: (1) specialized models provide chemical knowledge that is distilled into a high-quality chemical decision space, (2) LLM-driven critical reasoning to generate predictions with an interpretable reasoning path, and (3) knowledge-grounded policy optimization refines the interpretable decision policy. Experiments show that Retro-Expert surpasses both LLM-based and specialized models across different metrics, while generating chemically grounded explanations that enhance chemists' trust in practice. The source code for this paper is available at https://github.com/MagixRab-ll/Retro-Expert.

2509.05364 2026-06-16 cs.CY cs.AI cs.ET 版本更新

Prototyping an AI-powered Tool for Energy Efficiency in New Zealand Homes

为新西兰住宅能效设计AI驱动工具的原型

Abdollah Baghaei Daemei

发表机构 * Building Performance Analysis Lab, Tech Innovation Experts(建筑性能分析实验室,技术创新专家)

AI总结 本研究设计并评估了一个基于AI的决策支持工具原型,通过数据集成、异常检测和情景模拟,帮助新西兰家庭提升能效,专家测试显示可用性高,有望弥补政策与家庭实践之间的差距。

详情
AI中文摘要

住宅建筑对能源使用、健康结果和碳排放有显著影响。在新西兰,住房质量历来较差,隔热不足和供暖效率低下导致广泛的能源困难。最近的改革,包括Warmer Kiwi Homes计划、Healthy Homes标准和H1建筑规范升级,带来了健康和舒适度的改善,但挑战依然存在。许多改造仍然不完整,家庭性能数据有限,房主的决策支持分散。本研究介绍了为新西兰住宅能效设计的AI驱动决策支持工具的原型和评估。该原型使用Python和Streamlit开发,将数据摄取、异常检测、基线建模和情景模拟(例如LED改造、隔热升级)集成到一个模块化仪表板中。15位领域专家,包括建筑科学家、顾问和政策实践者,通过半结构化访谈测试了该工具。结果显示可用性高(M=4.3),情景输出价值高(M=4.5),并且对其补充补贴计划和监管框架的潜力持积极看法。该工具展示了AI如何将国家政策转化为个性化的家庭级指导,弥合资金、标准和实际决策之间的差距。其意义在于提供了一个可复制的框架,以减少能源困难、改善健康结果并支持气候目标。未来的发展应侧重于碳指标、电价建模、与国家数据集的集成以及评估实际采用的纵向试验。

英文摘要

Residential buildings contribute significantly to energy use, health outcomes, and carbon emissions. In New Zealand, housing quality has historically been poor, with inadequate insulation and inefficient heating contributing to widespread energy hardship. Recent reforms, including the Warmer Kiwi Homes program, Healthy Homes Standards, and H1 Building Code upgrades, have delivered health and comfort improvements, yet challenges persist. Many retrofits remain partial, data on household performance are limited, and decision-making support for homeowners is fragmented. This study presents the design and evaluation of an AI-powered decision-support tool for residential energy efficiency in New Zealand. The prototype, developed using Python and Streamlit, integrates data ingestion, anomaly detection, baseline modeling, and scenario simulation (e.g., LED retrofits, insulation upgrades) into a modular dashboard. Fifteen domain experts, including building scientists, consultants, and policy practitioners, tested the tool through semi-structured interviews. Results show strong usability (M = 4.3), high value of scenario outputs (M = 4.5), and positive perceptions of its potential to complement subsidy programs and regulatory frameworks. The tool demonstrates how AI can translate national policies into personalized, household-level guidance, bridging the gap between funding, standards, and practical decision-making. Its significance lies in offering a replicable framework for reducing energy hardship, improving health outcomes, and supporting climate goals. Future development should focus on carbon metrics, tariff modeling, integration with national datasets, and longitudinal trials to assess real-world adoption.

2509.25594 2026-06-16 cs.CV cs.AI 版本更新

K-Prism: A Knowledge-Guided and Prompt Integrated Universal Medical Image Segmentation Model

K-Prism: 一种知识引导与提示集成的通用医学图像分割模型

Bangwei Guo, Yunhe Gao, Meng Ye, Difei Gu, Yang Zhou, Leon Axel, Dimitris Metaxas

发表机构 * Rutgers University(罗格斯大学) Stanford University(斯坦福大学) The University of Texas at Arlington(德克萨斯大学阿灵顿分校) New York University(纽约大学)

AI总结 提出K-Prism统一分割框架,通过双提示表示和混合专家解码器整合语义先验、上下文知识和交互反馈三种知识范式,在18个数据集上实现语义、上下文和交互分割的最优性能。

详情
Journal ref
International Conference on Learning Representations (ICLR), 2026
AI中文摘要

医学图像分割是临床决策的基础,但现有模型仍然碎片化。它们通常基于单一知识源训练,并针对特定任务、模态或器官。这种碎片化与临床实践形成鲜明对比,在临床实践中,专家无缝整合多种知识:来自训练集的解剖先验、来自参考病例的基于示例的推理,以及通过实时交互的迭代细化。我们提出了$\textbf{K-Prism}$,一个统一的分割框架,通过系统整合三种知识范式来反映这种临床灵活性:(i) 从标注数据集中学习的$\textit{语义先验}$,(ii) 来自少样本参考示例的$\textit{上下文知识}$,以及(iii) 来自用户输入(如点击或涂鸦)的$\textit{交互反馈}$。我们的关键见解是,这些异构知识源可以编码为双提示表示:定义$\textit{分割什么}$的1-D稀疏提示和指示$\textit{关注哪里}$的2-D密集提示,然后通过混合专家(MoE)解码器动态路由。这种设计使得范式之间灵活切换,并能够在不同任务上进行联合训练,而无需修改架构。在涵盖多种模态(CT、MRI、X射线、病理、超声等)的18个公共数据集上的全面实验表明,K-Prism在语义、上下文和交互分割设置中均达到了最先进的性能。

英文摘要

Medical image segmentation is fundamental to clinical decision-making, yet existing models remain fragmented. They are usually trained on single knowledge sources and specific to individual tasks, modalities, or organs. This fragmentation contrasts sharply with clinical practice, where experts seamlessly integrate diverse knowledge: anatomical priors from training, exemplar-based reasoning from reference cases, and iterative refinement through real-time interaction. We present $\textbf{K-Prism}$, a unified segmentation framework that mirrors this clinical flexibility by systematically integrating three knowledge paradigms: (i) $\textit{semantic priors}$ learned from annotated datasets, (ii) $\textit{in-context knowledge}$ from few-shot reference examples, and (iii) $\textit{interactive feedback}$ from user inputs like clicks or scribbles. Our key insight is that these heterogeneous knowledge sources can be encoded into a dual-prompt representation: 1-D sparse prompts defining $\textit{what}$ to segment and 2-D dense prompts indicating $\textit{where}$ to attend, which are then dynamically routed through a Mixture-of-Experts (MoE) decoder. This design enables flexible switching between paradigms and joint training across diverse tasks without architectural modifications. Comprehensive experiments on 18 public datasets spanning diverse modalities (CT, MRI, X-ray, pathology, ultrasound, etc.) demonstrate that K-Prism achieves state-of-the-art performance across semantic, in-context, and interactive segmentation settings.

2510.22266 2026-06-16 cs.LG cs.AI cs.CY 版本更新

A Multi-level Analysis of Factors Associated with Student Performance: A Machine Learning Approach to the SAEB Microdata

学生表现相关因素的多层次分析:基于SAEB微观数据的机器学习方法

Rodrigo Tertulino, Laércio Alencar

发表机构 * Federal Institute of Education, Science, and Technology of Rio Grande do Norte(巴西里约格朗德杜北教育、科学和技术联邦学院)

AI总结 采用多级机器学习方法,利用SAEB微观数据中四类特征,通过随机森林模型以90.2%准确率分类学生水平,并借助SHAP解释发现学校平均社会经济水平是最强预测因子,表明学业表现是系统性现象。

Comments This article has been published in Discover Education (Springer Nature). The final authenticated version is available at:https://doi.org/10.1007/s44217-026-01699-0

详情
Journal ref
Discover Education, 2026
AI中文摘要

识别影响基础教育学生表现的因素是巴西制定有效公共政策的核心挑战。本研究引入了一种多级机器学习方法,利用巴西基础教育评估系统(SAEB)的微观数据对九年级和高中学生的熟练程度进行分类。我们的模型独特地整合了四个数据源:学生社会经济特征、教师专业档案、学校指标和校长管理档案。对四种集成算法的比较分析证实了随机森林模型的优越性,该模型达到了90.2%的准确率和96.7%的曲线下面积(AUC)。为了超越预测,我们应用了基于SHAP的可解释人工智能(XAI),结果显示学校的平均社会经济水平是最主要的预测因子,表明系统性因素比孤立的个体特征影响更大。主要结论是,学业表现是一种与学校生态系统深度相关的系统性现象。本研究提供了一个数据驱动的、可解释的工具,以通过解决学校之间的差异来促进教育公平的政策制定。

英文摘要

Identifying the factors that influence student performance in basic education is a central challenge for formulating effective public policies in Brazil. This study introduces a multi-level machine learning approach to classify the proficiency of 9th-grade and high school students using microdata from the System of Assessment of Basic Education (SAEB). Our model uniquely integrates four data sources: student socioeconomic characteristics, teacher professional profiles, school indicators, and principal management profiles. A comparative analysis of four ensemble algorithms confirmed the superiority of a Random Forest model, which achieved 90.2% accuracy and an Area Under the Curve (AUC) of 96.7%. To move beyond prediction, we applied Explainable AI (XAI) using SHAP, which revealed that the school's average socioeconomic level is the most dominant predictor, demonstrating that systemic factors have a greater impact than individual characteristics in isolation. The primary conclusion is that academic performance is a systemic phenomenon deeply tied to the school's ecosystem. This study provides a data-driven, interpretable tool to inform policies aimed at promoting educational equity by addressing disparities between schools.

2511.05522 2026-06-16 eess.SP cs.AI 版本更新

AIRMap: AI-Generated Radio Maps for Wireless Digital Twins

AIRMap: 用于无线数字孪生的AI生成无线电地图

Ali Saeizadeh, Miead Tehrani-Moayyed, Davide Villa, J. Gordon Beattie, Pedram Johari, Stefano Basagni, Tommaso Melodia

发表机构 * VIAVI Solutions, Inc.(VIAVI解决方案公司) National Telecommunications and Information Administration (NTIA)(国家电信与信息管理局) U.S. National Science Foundation(美国国家科学基金会)

AI总结 提出AIRMap深度学习框架,基于2D高程图通过U-Net自编码器实现超快速无线电地图估计,在4毫秒内达到低于4 dB RMSE的路径增益预测,比GPU加速射线追踪快100倍以上。

Comments 15 pages, 19 figures, This work has been accepted for publication on IEEE Transactions on Wireless Communications

详情
AI中文摘要

精确、低延迟的信道建模对于实时无线网络仿真和数字孪生应用至关重要。然而,像射线追踪这样的传统建模方法计算量大,不适合模拟动态条件。在本文中,我们提出了AIRMap,一个用于超快速无线电地图估计的深度学习框架,以及一个用于创建迄今为止最大无线电地图数据集的自动化流水线。AIRMap使用单输入U-Net自编码器,仅处理地形和建筑物高度的2D高程图。在120万波士顿区域样本上训练,并在四个具有不同地形和建筑密度的不同城市和农村环境中验证,AIRMap在NVIDIA L40S上每次推理在4毫秒内预测路径增益,RMSE低于4 dB——比基于GPU加速射线追踪的无线电地图快100倍以上。使用仅20%的现场测量数据进行轻量级校准,将中位误差降低到约5%,显著优于传统模拟器(误差超过50%)。集成到Colosseum仿真器和Sionna SYS平台中,与基于测量的信道相比,频谱效率和误块率几乎为零误差。这些发现验证了AIRMap在无线数字孪生中实现可扩展、准确和实时无线电地图估计的潜力。

英文摘要

Accurate, low-latency channel modeling is essential for real-time wireless network simulation and digital-twin applications. Traditional modeling methods like ray tracing are however computationally demanding and unsuited to model dynamic conditions. In this paper, we propose AIRMap, a deep-learning framework for ultra-fast radio-map estimation, along with an automated pipeline for creating the largest radio-map dataset to date. AIRMap uses a single-input U-Net autoencoder that processes only a 2D elevation map of terrain and building heights. Trained on 1.2M Boston-area samples and validated across four distinct urban and rural environments with varying terrain and building density, AIRMap predicts path gain with under 4 dB RMSE in 4 ms per inference on an NVIDIA L40S-over 100x faster than GPU-accelerated ray tracing based radio maps. A lightweight calibration using just 20% of field measurements reduces the median error to approximately 5%, significantly outperforming traditional simulators, which exceed 50% error. Integration into the Colosseum emulator and the Sionna SYS platform demonstrate near-zero error in spectral efficiency and block-error rate compared to measurement-based channels. These findings validate AIRMap's potential for scalable, accurate, and real-time radio map estimation in wireless digital twins.

2511.16681 2026-06-16 cs.CL cs.AI 版本更新

SPI: Query-Depth-Adaptive Indexing for Streaming RAG in Vector Databases

SPI:向量数据库中流式RAG的查询深度自适应索引

Dong Liu, Yanxuan Yu

发表机构 * Yale University(耶鲁大学) Columbia University(哥伦比亚大学)

AI总结 提出语义金字塔索引(SPI),通过多级分辨率组织和不确定性感知控制器实现查询深度自适应,支持流式插入和渐进式ANN搜索,在MS MARCO和Natural Questions上相比基线实现1.4-2.3倍延迟降低。

详情
AI中文摘要

向量数据库(VecDB)越来越多地部署在检索增强生成(RAG)管道中,其中查询处理和文档摄取同时发生。索引层需要提供低延迟搜索,同时在不频繁全局重建的情况下纳入新向量。现有的VecDB管道通常在统一表示机制下运行,尽管查询所需的语义粒度存在显著差异。这促使设计一种支持增量更新同时根据查询分布和复杂性调整检索深度的索引。我们提出**语义金字塔索引(SPI)**,一种VecDB层索引框架,将嵌入组织成$L$个语义对齐的分辨率级别,并通过轻量级不确定性感知控制器为每个查询选择检索深度。SPI支持渐进式粗到细ANN搜索、无需全局重建的逐级流式插入,以及通过LSH分区和异步gRPC协调的分布式执行。与具有固定遍历规则的分层ANN结构(例如SPANN)不同,SPI在查询时自适应分辨率,同时保持与FAISS和Qdrant后端的兼容性。在MS MARCO和Natural Questions上,在相同密集编码器系列下,SPI在Recall@10上具有竞争力且延迟更低,相对于可比较的近似ANN基线,在固定Recall@10目标下实现了**1.4-2.3倍**的平均检索延迟降低。一个最多8个节点的原型扩展研究显示吞吐量扩展了6.2倍(约73%效率);为完整性包含了16节点配置,但显示出递减的效率。我们提供了top-$K$稳定性保证:具有足够检索裕度的查询在较浅层返回相同的top-$K$集合。代码和配置可从此https URL获取。

英文摘要

Vector databases (VecDBs) are increasingly deployed in retrieval-augmented generation (RAG) pipelines where query processing and document ingestion occur concurrently. The index layer needs to provide low-latency search while incorporating new vectors without frequent global rebuilding. Existing VecDB pipelines typically operate within a uniform representation regime, despite substantial variation in the semantic granularity required across queries. This motivates an index design that supports incremental updates while adapting retrieval depth to query distribution and complexity. We propose \textbf{Semantic Pyramid Indexing (SPI)}, a VecDB-layer indexing framework that organizes embeddings into $L$ semantically aligned resolution levels and selects retrieval depth per query via a lightweight uncertainty-aware controller. SPI supports progressive coarse-to-fine ANN search, level-wise streaming insertion without global rebuilds, and distributed execution through LSH partitioning with asynchronous gRPC coordination. Unlike hierarchical ANN structures with fixed traversal rules (e.g., SPANN), SPI adapts resolution at query time while remaining compatible with FAISS and Qdrant backends. On MS MARCO and Natural Questions, SPI achieves competitive Recall@10 with lower latency under the same dense encoder family, yielding a \textbf{1.4--2.3$\times$} average retrieval latency reduction under fixed Recall@10 targets relative to comparable approximate-ANN baselines. A prototype scaling study up to 8 nodes shows $6.2\times$ throughput scaling (${\approx}73\%$ efficiency); the 16-node configuration is included for completeness but shows diminishing efficiency. We provide a top-$K$ stability guarantee: queries with sufficient retrieval margin return an identical top-$K$ set at a shallower level. Code and configurations are available at https://github.com/FastLM/SPI_VecDB.

2512.07925 2026-06-16 cs.CV cs.AI 版本更新

Near--Real-Time Conflict-Related Fire Detection in Sudan Using Unsupervised Deep Learning

苏丹冲突相关火灾的近实时检测:基于无监督深度学习

Kuldip Singh Atwal, Dieter Pfoser, Daniel Rothbart

发表机构 * George Mason University(乔治·马歇尔大学)

AI总结 提出轻量级VAE模型结合Planet Labs 4波段影像,在24-30小时内无监督检测苏丹冲突火灾区域,优于余弦距离、CVA和IR-MAD方法。

详情
Journal ref
Science of Remote Sensing, Volume 13, 2026, 100446, ISSN 2666-0172
AI中文摘要

苏丹持续的武装冲突凸显了快速监测冲突相关火灾影响区域的必要性。深度学习和高频卫星影像的最新进展使得能够近实时评估战区活跃火灾和烧伤疤痕。本研究提出了一种近实时监测方法,使用轻量级变分自编码器(VAE)模型,结合空间分辨率3米的4波段Planet Labs影像。我们证明,在有利观测条件下,利用可获取的商业卫星数据,这些受影响区域可在约24至30小时内被检测到。为此,我们改编了一个最初为10波段影像设计的VAE模型,使其有效处理高分辨率4波段输入。模型以无监督方式训练,学习名义地表状态的紧凑潜在表示,并通过量化时间配对潜在嵌入之间的变化来识别燃烧特征。性能在苏丹的五个案例研究中评估,并与余弦距离、CVA和IR-MAD在精确率、召回率、F1分数以及时间配对影像块之间的精确率-召回率曲线下面积(AUPRC)上进行比较。结果表明,所提方法始终优于其他方法,在高度不平衡的火灾检测场景中实现了更高的召回率和F1分数,同时保持了可行的精确率。使用8波段影像和时间序列影像的实验相比单一4波段输入仅带来边际性能提升,突显了所提轻量级方法在可扩展的近实时冲突监测中的有效性。

英文摘要

Ongoing armed conflict in Sudan highlights the need for rapid monitoring of conflict-related fire-affected areas. Recent advances in deep learning and high-frequency satellite imagery enable near--real-time assessment of active fires and burn scars in war zones. This study presents a near--real-time monitoring approach using a lightweight Variational Auto-Encoder (VAE)--based model integrated with 4-band Planet Labs imagery at 3 m spatial resolution. We demonstrate that these impacted regions can be detected within approximately 24 to 30 hours under favorable observational conditions using accessible, commercially available satellite data. To achieve this, we adapt a VAE--based model, originally designed for 10-band imagery, to operate effectively on high-resolution 4-band inputs. The model is trained in an unsupervised manner to learn compact latent representations of nominal land-surface conditions and identify burn signatures by quantifying changes between temporally paired latent embeddings. Performance is evaluated across five case studies in Sudan and compared against cosine distance, CVA, and IR-MAD using precision, recall, F1-score, and the area under the precision-recall curve (AUPRC) computed between temporally paired image tiles. Results show that the proposed approach consistently outperforms the other methods, achieving higher recall and F1-scores while maintaining viable precision in highly imbalanced fire-detection scenarios. Experiments with 8-band imagery and temporal image sequences yield only marginal performance gains over single 4-band inputs, underscoring the effectiveness of the proposed lightweight approach for scalable, near--real-time conflict monitoring.

2512.22420 2026-06-16 cs.DC cs.AI 版本更新

Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving

Nightjar: 面向大语言模型服务的动态自适应推测解码

Rui Li, Zhaoning Zhang, Libo Zhang, Huaimin Wang, Xiang Fu, Zhiquan Lai

发表机构 * State Key Laboratory of Complex & Critical Software Environment, National Key Laboratory of Parallel and Distributed Computing, College of Computer Science and Technology, National University of Defense Technology(复杂与关键软件环境国家重点实验室、平行与分布式计算国家实验室、计算机科学与技术学院、国防科技大学)

AI总结 提出Nightjar框架,通过动态调整推测长度和主动禁用推测解码,在高低负载下优化吞吐量,最高提升14.76%。

详情
AI中文摘要

推测解码通过并行验证草稿令牌加速LLM推理。然而,该方法存在关键权衡:在低负载、内存受限系统中提高吞吐量,但在高负载、计算受限环境中因验证开销而降低性能。现有推测解码方法使用固定长度,无法适应工作负载变化或决定何时停止推测。重新启动推测推理的成本也未被量化。在高负载下,推测的收益减少,而保留草稿模型会减少KV缓存容量,限制批处理大小并降低吞吐量。为解决此问题,我们提出Nightjar,一种资源感知的自适应推测框架。它首先通过动态选择不同批处理大小的最优推测长度来适应请求负载。关键的是,当MAB规划器确定推测不再有益时,Nightjar主动禁用推测解码,并在禁用阶段仅在GPU内存压力下将草稿模型卸载到CPU。这为KV缓存回收内存,从而促进更大的批处理大小并最大化系统整体吞吐量。实验表明,在实时LLM服务场景的动态请求到达率下,Nightjar在主要基准测试套件中比标准推测解码实现高达14.76%的吞吐量提升和高达20.18%的延迟降低。

英文摘要

Speculative decoding (SD) accelerates LLM inference by verifying draft tokens in parallel. However, this method presents a critical trade-off: it improves throughput in low-load, memory-bound systems but degrades performance in high-load, compute-bound environments due to verification overhead. Existing speculative decoding methods use fixed lengths and cannot adapt to workload changes or decide when to stop speculation. The cost of restarting speculative inference also remains unquantified. Under high load, the benefit of speculation diminishes, while retaining the draft model reduces KV cache capacity, limiting batch size and degrading throughput. To overcome this, we propose Nightjar, a resource-aware adaptive speculative framework. It first adjusts to the request load by dynamically selecting the optimal speculative length for different batch sizes. Crucially, Nightjar proactively disables speculative decoding when the MAB planner determines that speculation is no longer beneficial, and during the disabled phase, offloads the draft model to the CPU only under GPU memory pressure. This reclaims memory for the KV cache, thereby facilitating larger batch sizes and maximizing overall system throughput. Experiments show that Nightjar achieves up to 14.76% higher throughput than standard speculative decoding and up to 20.18% lower latency in the main benchmark suite under dynamic request arrival rates for real-time LLM serving scenarios.

2512.22827 2026-06-16 cs.SE cs.AI 版本更新

FasterPy: An LLM-based Code Execution Efficiency Optimization Framework

FasterPy:基于大语言模型的代码执行效率优化框架

Yue Wu, Minghao Han, Ruiyin Li, Peng Liang, Amjed Tahir, Zengyang Li, Qiong Feng, Mojtaba Shahin

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) School of Computer Science, Carnegie Mellon University(卡内基梅隆大学计算机学院) School of Mathematical and Computational Sciences, Massey University(梅西大学数学与计算科学学院) School of Computer Science, Central China Normal University(中央中国师范大学计算机学院) School of Computer Science, Nanjing University of Science and Technology(南京理工大学计算机学院) School of Computing Technologies, RMIT University(皇家墨尔本理工大学计算技术学院)

AI总结 提出FasterPy框架,结合检索增强生成(RAG)和低秩适应(LoRA)技术,利用大语言模型自动优化Python代码执行效率,在PIE基准上超越现有方法。

Comments 38 pages, 5 images, 14 tables, Manuscript revision submitted to a Journal (2026)

详情
AI中文摘要

代码常常存在性能缺陷,这促使了对代码优化的研究和实践。传统的基于规则的方法依赖于为特定性能缺陷(如冗余循环、重复计算)手动设计和维护规则,因此劳动密集且适用性有限。近年来,基于机器学习和深度学习的方法通过学习来自标注代码语料库和性能测量的优化启发式,成为有前景的替代方案。然而,这些方法通常依赖于特定的程序表示和精心制作的训练数据集,使得开发成本高且难以扩展。随着大语言模型(LLMs)的蓬勃发展,它们在代码生成方面的卓越能力为自动化代码优化开辟了新途径。在这项工作中,我们提出了FasterPy,一个低成本且高效的框架,它使LLMs适应于优化Python代码的执行效率。FasterPy结合了检索增强生成(RAG)(由从现有性能改进代码对和相应性能测量构建的知识库支持)与低秩适应(LoRA),以增强代码优化性能。我们在Performance Improving Code Edits(PIE)基准上的实验结果表明,我们的方法在多个指标上优于现有模型。FasterPy工具和实验结果可在此https URL获取。

英文摘要

Code often suffers from performance bugs. These bugs necessitate the research and practice of code optimization. Traditional rule-based methods rely on manually designing and maintaining rules for specific performance bugs (e.g., redundant loops, repeated computations), making them labor-intensive and limited in applicability. In recent years, machine learning and deep learning-based methods have emerged as promising alternatives by learning optimization heuristics from annotated code corpora and performance measurements. However, these approaches usually depend on specific program representations and meticulously crafted training datasets, making them costly to develop and difficult to scale. With the booming of Large Language Models (LLMs), their remarkable capabilities in code generation have opened new avenues for automated code optimization. In this work, we proposed FasterPy, a low-cost and efficient framework that adapts LLMs to optimize the execution efficiency of Python code. FasterPy combines Retrieval-Augmented Generation (RAG), supported by a knowledge base constructed from existing performance-improving code pairs and corresponding performance measurements, with Low-Rank Adaptation (LoRA) to enhance code optimization performance. Our experimental results on the Performance Improving Code Edits (PIE) benchmark demonstrate that our method outperforms existing models on multiple metrics. The FasterPy tool and the experimental results are available at https://github.com/WuYue22/fasterpy.

2601.19697 2026-06-16 cs.SE cs.AI 版本更新

AlignCoder: Aligning Retrieval with Target Intent for Repository-Level Code Completion

AlignCoder: 为目标意图对齐检索以实现仓库级代码补全

Tianyue Jiang, Yanli Wang, Yanlin Wang, Daya Guo, Ensheng Shi, Yuchi Ma, Jiachi Chen, Zibin Zheng

发表机构 * School of Software, Sun Yat-sen University(中山大学软件学院) Sun Yat-sen University(中山大学)

AI总结 针对仓库级代码补全中检索与目标代码不匹配及无法利用推理信息的问题,提出AlignCoder框架,通过查询增强和基于强化学习的检索器训练,在CrossCodeEval上EM分数提升18.1%。

Comments To appear at ASE'25

详情
AI中文摘要

由于现有代码大语言模型(code LLMs)对仓库特定上下文和领域知识的理解有限,仓库级代码补全仍然是一项具有挑战性的任务。虽然检索增强生成(RAG)方法通过检索相关代码片段作为跨文件上下文显示出前景,但它们存在两个基本问题:检索过程中查询与目标代码之间的不对齐,以及现有检索方法无法有效利用推理信息。为了解决这些挑战,我们提出了AlignCoder,一个仓库级代码补全框架,引入了查询增强机制和基于强化学习的检索器训练方法。我们的方法生成多个候选补全以构建增强查询,从而弥合初始查询与目标代码之间的语义差距。此外,我们采用强化学习训练AlignRetriever,使其学会利用增强查询中的推理信息进行更准确的检索。我们在两个广泛使用的基准测试(CrossCodeEval和RepoEval)上,使用五个骨干代码LLM评估了AlignCoder,在CrossCodeEval基准测试上,与基线相比,EM分数提高了18.1%。结果表明,我们的框架实现了优越的性能,并在各种代码LLM和编程语言中表现出高度的泛化能力。

英文摘要

Repository-level code completion remains a challenging task for existing code large language models (code LLMs) due to their limited understanding of repository-specific context and domain knowledge. While retrieval-augmented generation (RAG) approaches have shown promise by retrieving relevant code snippets as cross-file context, they suffer from two fundamental problems: misalignment between the query and the target code in the retrieval process, and the inability of existing retrieval methods to effectively utilize the inference information. To address these challenges, we propose AlignCoder, a repository-level code completion framework that introduces a query enhancement mechanism and a reinforcement learning based retriever training method. Our approach generates multiple candidate completions to construct an enhanced query that bridges the semantic gap between the initial query and the target code. Additionally, we employ reinforcement learning to train an AlignRetriever that learns to leverage inference information in the enhanced query for more accurate retrieval. We evaluate AlignCoder on two widely-used benchmarks (CrossCodeEval and RepoEval) across five backbone code LLMs, demonstrating an 18.1% improvement in EM score compared to baselines on the CrossCodeEval benchmark. The results show that our framework achieves superior performance and exhibits high generalizability across various code LLMs and programming languages.

2601.21527 2026-06-16 cond-mat.mtrl-sci cs.AI 版本更新

Sustainable Materials Discovery in the Era of Artificial Intelligence

人工智能时代的可持续材料发现

Sajid Mannan, Rupert J. Myers, Rohit Batra, Rocio Mercado, Lothar Wondraczek, N. M. Anoop Krishnan

发表机构 * Department of Civil and Environmental Engineering, Indian Institute of Technology Delhi(印度理工学院德里分校土木与环境工程系) Department of Civil and Environmental Engineering, Imperial College London(帝国理工学院伦敦分校土木与环境工程系) Department of Metallurgical and Materials Engineering, Indian Institute of Technology Madras(印度理工学院马德拉斯分校冶金与材料工程系) Department of Computer Science and Engineering, Chalmers University of Technology & University of Gothenburg(查尔姆斯理工大学与哥德堡大学计算机科学与工程系) Yardi School of Artificial Intelligence, Indian Institute of Technology Delhi(印度理工学院德里分校亚里学校人工智能系)

AI总结 本文提出ML-LCA框架,将上游机器学习材料发现与下游生命周期评估整合,通过信息提取、统一数据库、多尺度建模、制造路径预测和不确定性优化,实现性能与可持续性协同优化。

详情
AI中文摘要

人工智能(AI)已经改变了材料发现,通过生成模型和替代筛选实现了化学空间的快速探索。然而,当前用于材料发现的生成式AI模型(现已驱动对广阔化学和结构空间的探索)仅针对结构稳定性和功能特性优化候选材料,在设计循环的任何阶段均未整合环境评估。前瞻性和事前生命周期评估方法已经存在并应用于新兴技术,但它们作为独立的下游分析运行,而非作为生成或主动学习管道中的主动约束。结果是,即使产生了环境反馈,也是在设计决策做出之后才到达,而非为决策提供信息。原子尺度设计与生命周期评估(LCA)之间的脱节反映了根本性挑战:(i)跨异构源的数据稀缺,(ii)从原子到工业系统的尺度差距,(iii)合成路径的不确定性,以及(iv)缺乏同时优化性能与环境影响的框架。在这篇观点文章中,我们提出将上游ML辅助材料发现与下游LCA整合到ML-LCA框架中,该框架包含五个组成部分:用于构建材料-环境知识库的信息提取、将属性与可持续性指标关联的统一数据库、桥接原子属性与生命周期影响的多尺度模型、具有不确定性量化的制造路径集成预测,以及实现性能-可持续性同时导航的不确定性感知优化。涵盖聚合物、玻璃、光刻胶和水泥的案例研究既证明了必要性和可行性,也识别了材料特定的整合挑战。

英文摘要

Artificial intelligence (AI) has transformed materials discovery, enabling rapid exploration of chemical space through generative models and surrogate screening. Yet current generative AI models for materials discovery, which now drive exploration of vast chemical and structural spaces, optimize candidates exclusively for structural stability and functional properties, with no integration of environmental assessment at any stage of the design loop. Prospective and ex-ante life cycle assessment methods exist and have been applied to emerging technologies, but they operate as standalone downstream analyses, not as active constraints within generative or active-learning pipelines. The result is that environmental feedback, even when produced, arrives after design decisions have been made rather than informing them. The disconnect between atomic-scale design and lifecycle assessment (LCA) reflects fundamental challenges: (i) data scarcity across heterogeneous sources, (ii) scale gaps from atoms to industrial systems, (iii) uncertainty in synthesis pathways, and (iv) the absence of frameworks that co-optimize performance with environmental impact. In this Perspective, we propose integrating upstream ML-assisted materials discovery with downstream LCA into the ML-LCA framework, comprising five components: information extraction for building materials-environment knowledge bases, harmonized databases linking properties to sustainability metrics, multi-scale models bridging atomic properties to lifecycle impacts, ensemble prediction of manufacturing pathways with uncertainty quantification, and uncertainty-aware optimization enabling simultaneous performance-sustainability navigation. Case studies spanning polymers, glass, photoresists, and cement demonstrate both necessity and feasibility while identifying material-specific integration challenges.

2602.14710 2026-06-16 cs.IR cs.AI 版本更新

Orcheo: A Modular Full-Stack Platform for Conversational Search

Orcheo: 一个用于对话式搜索的模块化全栈平台

Shaojie Jiang, Svitlana Vakulenko, Maarten de Rijke

发表机构 * University of Amsterdam(阿姆斯特丹大学) AI Colleagues(AI同事) WU Vienna University of Economics and Business(维也纳经济与商业大学)

AI总结 提出Orcheo开源平台,通过模块化架构、生产级基础设施和45+即用组件,解决对话式搜索研究中框架统一与原型部署的难题。

Comments Accepted to SIGIR 2026

详情
AI中文摘要

对话式搜索(CS)需要一个复杂的软件工程流水线,集成了查询重构、排序和响应生成。CS研究人员目前面临两个障碍:缺乏一个统一的框架来有效地与社区共享贡献,以及难以部署用于用户评估的端到端原型。我们介绍了Orcheo,一个旨在弥合这一差距的开源平台。Orcheo提供三个关键优势:(i)模块化架构通过单文件节点模块促进组件复用,便于CS研究中的共享和可重复性;(ii)生产级基础设施通过双执行模式、安全凭证管理和执行遥测弥合原型到系统的差距,内置AI编码支持降低学习曲线;(iii)入门工具包包括45多个现成组件,用于查询理解、排序和响应生成,能够快速启动完整的CS流水线。我们描述了框架架构,并通过强调模块化和易用性的案例研究验证了Orcheo的实用性。Orcheo在MIT许可下以开源形式发布于此https URL。

英文摘要

Conversational search (CS) requires a complex software engineering pipeline that integrates query reformulation, ranking, and response generation. CS researchers currently face two barriers: the lack of a unified framework for efficiently sharing contributions with the community, and the difficulty of deploying end-to-end prototypes needed for user evaluation. We introduce Orcheo, an open-source platform designed to bridge this gap. Orcheo offers three key advantages: (i) A modular architecture promotes component reuse through single-file node modules, facilitating sharing and reproducibility in CS research; (ii) Production-ready infrastructure bridges the prototype-to-system gap via dual execution modes, secure credential management, and execution telemetry, with built-in AI coding support that lowers the learning curve; (iii) Starter-kit assets include 45+ off-the-shelf components for query understanding, ranking, and response generation, enabling the rapid bootstrapping of complete CS pipelines. We describe the framework architecture and validate Orcheo's utility through case studies that highlight modularity and ease of use. Orcheo is released as open source under the MIT License at https://github.com/AI-Colleagues/orcheo.

2603.17531 2026-06-16 cs.CV cs.AI cs.CR 版本更新

Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing

Rel-Zero:利用补丁对不变性实现鲁棒的零水印以抵御AI编辑

Pengzhen Chen, Yanwei Liu, Xiaoyan Gu, Xiaojun Chen, Wu Liu, Weiping Wang

AI总结 针对AI编辑对图像真实性的威胁,提出Rel-Zero零水印框架,利用编辑中补丁对关系距离的不变性,无需修改原图即可生成鲁棒水印,实验证明其优于现有方法。

Comments accepted to CVPR 2026

详情
AI中文摘要

近期基于扩散的图像编辑技术的进步对数字视觉内容的真实性构成了重大威胁。传统的基于嵌入的水印方法通常引入可察觉的扰动以保持鲁棒性,不可避免地损害视觉保真度。同时,现有的零水印方法通常依赖全局图像特征,难以抵御复杂的操作。在这项工作中,我们揭示了一个关键观察:尽管在基于AI的编辑过程中单个图像补丁发生显著变化,但补丁对之间的关系距离保持相对不变。利用这一特性,我们提出了关系零水印(Rel-Zero),一种新颖的框架,无需对原始图像进行任何修改,而是从这些编辑不变的补丁关系中推导出唯一的零水印。通过将水印基于内在的结构一致性而非绝对外观,Rel-Zero为内容认证提供了一种非侵入性且具有弹性的机制。大量实验表明,与先前的零水印方法相比,Rel-Zero在多种编辑模型和操作下实现了显著提升的鲁棒性。

英文摘要

Recent advancements in diffusion-based image editing pose a significant threat to the authenticity of digital visual content. Traditional embedding-based watermarking methods often introduce perceptible perturbations to maintain robustness, inevitably compromising visual fidelity. Meanwhile, existing zero-watermarking approaches, typically relying on global image features, struggle to withstand sophisticated manipulations. In this work, we uncover a key observation: while individual image patches undergo substantial alterations during AI-based editing, the relational distance between patch pairs remains relatively invariant. Leveraging this property, we propose Relational Zero-Watermarking (Rel-Zero), a novel framework that requires no modification to the original image but derives a unique zero-watermark from these editing-invariant patch relations. By grounding the watermark in intrinsic structural consistency rather than absolute appearance, Rel-Zero provides a non-invasive yet resilient mechanism for content authentication. Extensive experiments demonstrate that Rel-Zero achieves substantially improved robustness across diverse editing models and manipulations compared to prior zero-watermarking approaches.

2604.00163 2026-06-16 cs.LG cs.AI cs.NE 版本更新

Epileptic Seizure Detection in Separate Frequency Bands Using Feature Analysis and Graph Convolutional Neural Network (GCN) from Electroencephalogram (EEG) Signals

基于特征分析和图卷积神经网络(GCN)的脑电图(EEG)信号癫痫发作检测在不同频段的研究

Ferdaus Anam Jibon, Fazlul Hasan Siddiqui, F. Deeba, Gahangir Hossain

AI总结 提出一种频率感知框架,将EEG分解为五个频段并提取判别特征,利用图卷积神经网络建模电极空间依赖,在CHB-MIT数据集上实现99.01%的宽带准确率,提高了可解释性和诊断精度。

Comments One author disagrees with the archiving

详情
AI中文摘要

癫痫发作是一种神经系统疾病,其特征是大脑中异常和过度的电活动,导致反复发作事件。脑电图(EEG)信号因其能够捕捉时间和空间的神经动力学而被广泛用于癫痫诊断。虽然最近的深度学习方法取得了高检测准确率,但它们往往缺乏可解释性和神经生理学相关性。本研究提出了一种基于发作期EEG分析的频率感知框架用于癫痫发作检测。原始EEG信号被分解为五个频段(delta、theta、alpha、低beta和高beta),并从每个频段提取十一个判别特征。然后采用图卷积神经网络(GCN)对EEG电极之间的空间依赖性进行建模,电极表示为图节点。在CHB-MIT头皮EEG数据集上的实验表明,该方法在相应频段上分别达到了97.1%、97.13%、99.5%、99.7%和51.4%的准确率,总体宽带准确率为99.01%。结果突出了中频段的强判别能力,并揭示了特定频率的发作模式。与传统的宽带EEG方法相比,所提出的方法提高了可解释性和诊断精度。

英文摘要

Epileptic seizures are neurological disorders characterized by abnormal and excessive electrical activity in the brain, resulting in recurrent seizure events. Electroencephalogram (EEG) signals are widely used for seizure diagnosis due to their ability to capture temporal and spatial neural dynamics. While recent deep learning methods have achieved high detection accuracy, they often lack interpretability and neurophysiological relevance. This study presents a frequency-aware framework for epileptic seizure detection based on ictal-phase EEG analysis. The raw EEG signals are decomposed into five frequency bands (delta, theta, alpha, lower beta, and higher beta), and eleven discriminative features are extracted from each band. A graph convolutional neural network (GCN) is then employed to model spatial dependencies among EEG electrodes, represented as graph nodes. Experiments on the CHB-MIT scalp EEG dataset demonstrate high detection performance, achieving accuracies of 97.1%, 97.13%, 99.5%, 99.7%, and 51.4% across the respective frequency bands, with an overall broadband accuracy of 99.01%. The results highlight the strong discriminative capability of mid-frequency bands and reveal frequency-specific seizure patterns. The proposed approach improves interpretability and diagnostic precision compared to conventional broadband EEG-based methods.

2604.27128 2026-06-16 cs.CV cs.AI 版本更新

Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Individual-Level Livestock Monitoring and Longitudinal Visual Analytics

SAM 3 和 DINOv3 的轻量级蒸馏用于边缘可部署的个体级牲畜监测与纵向视觉分析

Haiyu Yang, Miel Hostens

发表机构 * College of Agriculture and Life Sciences, Cornell University(农业与生命科学学院,康奈尔大学)

AI总结 通过蒸馏SAM 3的感知编码器至TinyViT学生网络,并采用DINOv3的ViT-S嵌入器,实现边缘可部署的个体级牲畜监测,在Edinburgh猪数据集上以7.77倍参数减少和3.01倍显存降低达到接近教师模型的性能,支持长期视觉分析。

详情
AI中文摘要

用于个体级牲畜监测的基础模型流水线——结合开放词汇检测、可提示视频分割和自监督视觉嵌入——提高了精准畜牧业(PLF)的准确率上限,但其GPU内存预算超出了商用边缘加速器的范围。为弥补这一差距,SAM 3的4.46亿参数感知编码器(PE-ViT-L+)骨干通过三种机制被蒸馏为一个4066万参数的多尺度学生网络:基于TinyViT-21M-512的特征金字塔网络学生编码器、四项方向-尺度蒸馏损失,以及带滑动窗口会话剪枝的骨干替换推理,以限制流式GPU内存增长。DINOv3系列包括一个预蒸馏的ViT-S/16变体(2160万参数),与6716万参数的ViT-7B教师模型一同发布;采用ViT-S(2100万参数)变体作为每个个体的嵌入器。在Edinburgh猪数据集上,压缩流水线相对于SAM 3教师模型达到92.29% MOTA和96.15% IDF1(分别下降1.68和0.84个百分点),系统级参数减少7.77倍,峰值显存减少3.01倍(19.52GB -> 6.49GB),并在九类猪行为分类中达到97.34% top-1准确率和91.67% macro-F1。该流水线适配NVIDIA Jetson Orin NX 16GB环境,具有4.9GB余量,支持一种提议但尚未经验证的设备端嵌入池重识别机制,其每个个体每年约94MB的足迹产生纵向视觉记录,便于与疾病、跛行、繁殖和生长结果标签进行回顾性关联。

英文摘要

Foundation-model pipelines for individual-level livestock monitoring -- combining open-vocabulary detection, promptable video segmentation, and self-supervised visual embeddings -- have raised the accuracy ceiling of precision livestock farming (PLF), but their GPU memory budgets exceed the envelope of commodity edge accelerators. To close this gap, the 446M-parameter Perception Encoder (PE-ViT-L+) backbone of SAM 3 is distilled into a 40.66M-parameter multi-scale student through three mechanisms: a Feature Pyramid Network student encoder built on TinyViT-21M-512, a four-term direction-then-scale distillation loss, and backbone-substitution inference with sliding-window session pruning that bounds streaming GPU memory growth. The DINOv3 family includes a pre-distilled ViT-S/16 variant (21.6M parameters) released alongside a 6716M-parameter ViT-7B teacher; the ViT-S (21M) variant is adopted as the per-individual embedder. On the Edinburgh Pig dataset, the compressed pipeline reaches 92.29% MOTA and 96.15% IDF1 against the SAM 3 teacher (1.68- and 0.84-percentage-point losses), achieves a 7.77-fold reduction in system-level parameters and a 3.01-fold reduction in peak VRAM (19.52GB -> 6.49GB), and reaches 97.34% top-1 accuracy with 91.67% macro-F1 on nine-class pig behaviour classification. The pipeline fits inside an NVIDIA Jetson Orin NX 16GB envelope with 4.9GB of headroom, supporting a proposed -- but not yet empirically validated -- on-device embedding-pool re-identification mechanism whose per-individual footprint of approximately 94MB per animal per year produces a longitudinal visual record amenable to retrospective association with disease, lameness, reproductive, and growth outcome labels.

2605.00074 2026-06-16 q-bio.GN cs.AI 版本更新

CRC-Screen: Certified DNA-Synthesis Hazard Screening Under Taxonomic Shift

CRC-Screen:分类偏移下认证的DNA合成危害筛查

Najmul Hasan

发表机构 * Najmul Hasan(纳杰姆·哈桑)

AI总结 针对DNA合成订单中危害序列因分类偏移导致基线筛查100%误报的问题,提出基于k-mer Jaccard相似度、五LLM评委修剪均值和嵌入聚类质心余弦相似度的融合信号,经单调逻辑聚合器和共形风险控制校准,在保证假阴性率受控的同时实现零漏检和低误报。

Comments Accepted at the 6th Muslims in ML (MusIML) Workshop at ICML 2026

详情
AI中文摘要

DNA合成供应商通过将请求序列与精选的危害列表进行比对来筛查传入订单。我们证明,当危害序列来自参考集中缺失的分类家族时,这种基线方法会崩溃为100%的误报率:在共形风险控制的认证漏检率约束下,低区分度信号迫使阈值低于整个测试良性样本的质量。我们组合了从合成订单的公共注释中导出的三个信号:与已知毒素的$k$-mer Jaccard相似度、五个LLM评委小组的修剪均值分数,以及与聚类嵌入质心的余弦相似度。在单调逻辑聚合器下融合并由共形风险控制校准,得到的筛查器认证$\mathbb{E}[\mathrm{FNR}] \le \alpha + \mathrm{TV}$,其中加性项是家族留出下校准到测试的分布偏移(跨折认证上限为24-49%)。在UniProt KW-0800审核毒素上,以$\alpha=0.05$进行十次留一分类家族交叉验证,校准后的筛查器在每一折上实现0%的经验测试漏检率,并在十折中的九折上实现0%的测试误报率。该界限的有限样本松弛量$1/(n_{\mathrm{cal}}+1)$将我们200个危害子样本的可认证漏检率上限限制在1.77%;达到采购级$\alpha=10^{-3}$需要$18\times$更大的校准集,而完整的UniProt KW-0800审核语料库足够大以提供此规模。可认证DNA合成筛查的约束条件是校准数据,而非算法。代码:此 https URL

英文摘要

DNA-synthesis providers screen incoming orders by searching the requested sequence against curated hazard lists. We show that this baseline collapses to a 100% false-flag rate when the hazardous sequence comes from a taxonomic family absent from the reference set: under Conformal Risk Control's certified miss-rate constraint, a low-discrimination signal forces the threshold below the entire test-benign mass. We compose three signals derived from a synthesis order's public annotation: $k$-mer Jaccard similarity to known toxins, the trimmed-mean score of a five-LLM judge panel, and cosine similarity to clustered embedding centroids. Fused under a monotone logistic aggregator and calibrated by Conformal Risk Control, the resulting screener certifies $\mathbb{E}[\mathrm{FNR}] \le α+ \mathrm{TV}$, where the additive term is the calibration-to-test distribution shift under family holdout (a certified ceiling of 24-49% across folds). Across ten leave-one-taxonomic-family-out folds at $α=0.05$ on UniProt KW-0800 reviewed toxins, the calibrated screener achieves 0% empirical test miss rate on every fold and 0% test false-flag rate on nine of ten folds. The bound's finite-sample slack $1/(n_{\mathrm{cal}}+1)$ caps the certifiable miss rate at 1.77% on our 200-hazard subsample; reaching procurement-grade $α=10^{-3}$ requires an $18\times$ larger calibration set, which the full reviewed UniProt KW-0800 corpus is large enough to deliver. The binding constraint on certifiable DNA-synthesis screening is calibration data, not algorithms. Code: https://github.com/najmulhasan-code/crc-screen

2605.09370 2026-06-16 cs.DC cs.AI 版本更新

From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

从检测到恢复:504 GPU上LLM预训练的运营分析

Daemyung Kang, Eunjin Hwang, Hanjeong Lee, HyeokJin Kim, Hyunhoi Koo, Jeongkyu Shin, Jeongseok Kang, Jihyun Kang, Jinho Heo, Joongi Kim, Junbum Lee, Jungseung Yang, Kyujin Cho, Youngsook Song

发表机构 * Lablup Inc.(Lablup公司)

AI总结 本文通过分析一个63节点NVIDIA B200生产集群(504 GPU)55天的Prometheus时间序列数据和73天的运营日志,提出了三项定量分析,揭示了多信号检测策略、存储I/O瓶颈和自动重试恢复机制的有效性。

Comments 42 pages, 19 figures, 16 tables. Lablup Technical Report

详情
AI中文摘要

大规模AI训练现在基本上是一个分布式系统问题,硬件故障已成为常规操作条件而非罕见例外。然而,来自生产训练集群的公开运营证据仍然稀缺。本技术报告对63节点NVIDIA B200生产集群(504 GPU)进行了实证分析,使用了55天的Prometheus时间序列数据和73天的运营日志,涵盖了224次多节点训练会话。该集群在跨组织环境中运行,五个参与方(SKT、Upstage、Lablup、NVIDIA Korea和VAST Data)共享统一的监控管道。这种安排使得联合诊断一个60节点规模的存储I/O瓶颈成为可能,该瓶颈在2-4节点规模下不会出现,这是一个单一团队无法单独隔离的生产规模现象。基于为期数月的预训练活动,我们进行了三项定量分析,得出四个发现。首先,对751个Prometheus指标和10个XID识别的GPU故障进行统计分析,实现了10/10的检测率(2/10在XID之前),每天约0.84个误报。没有单一指标在所有故障类型中持续占主导地位,这促使采用多信号检测策略。其次,对沿GPU VRAM到NFS路径的523个检查点事件进行分析,将“带宽悖论”(200 Gbps RoCE的1.4-10.4%利用率)归因于128槽NFS RPC层的饱和。第三,多节点故障响应显示集中排除(63个节点中前3个占所有排除的>50%),自动重试链的成功率为33.3%(12个链,73次尝试),是手动恢复率12.5%的2.7倍;中位重试间隔为11分钟(IQR 10-11)。所有分析均基于提供会话级工作负载管理、GPU中心调度和统一可观测性的生产基础设施。

英文摘要

Large-scale AI training is fundamentally a distributed systems problem, where hardware failures are routine operating conditions rather than rare exceptions, yet public operational evidence from production training clusters remains limited. This report presents an empirical analysis of a 63-node NVIDIA B200 production cluster (504 GPUs), using 55 days of Prometheus time-series data and 73 days of operational logs covering 224 multi-node training sessions. The environment is cross-organizational: five parties (SKT, Upstage, Lablup, NVIDIA Korea, VAST Data) share a unified monitoring pipeline. This enabled joint diagnosis of a 60-node-scale storage I/O bottleneck absent in 2-4-node tests, a production-scale phenomenon no single team could isolate alone. We perform three quantitative analyses yielding four findings. First, over 751 Prometheus metrics and 10 XID-identified GPU failures, no single metric is consistently dominant across failure types, motivating multi-signal detection. Second, 523 checkpoint events trace the save/load path from GPU VRAM to the NFS server: restart loading reaches 21.5% of maximum read bandwidth (700 GB/s) and save bursts 16.0% of maximum write bandwidth (250 GB/s), with NFS/RPC queueing and transport-layer backlog rising together. Third, across 224 sessions over 73 days, node exclusions concentrate so the top 3 of 63 nodes account for over 50%. Fourth, auto-retry chain analysis shows a 33.3% success rate over 12 chains (73 attempts), 2.7x the 12.5% manual rate, with a median retry interval of 11 minutes (IQR 10-11). All analyses are grounded in production infrastructure providing session-level workload management, GPU-centric scheduling, and unified observability.

2605.21027 2026-06-16 cs.CL cs.AI 版本更新

Beyond Text-to-SQL: An Agentic LLM System for Governed Enterprise Analytics APIs

超越文本到SQL:一个面向受控企业分析API的代理LLM系统

Gundeep Singh, Parsa Kavehzadeh, Jing Xia, Xue-Yong Fu, Julien Bouvier Tremblay, Md Tahmid Rahman Laskar, Vincent Lum, Shashi Bhushan TN

发表机构 * Dialpad Inc.(Dialpad公司)

AI总结 本文提出Analytic Agent,一个基于LLM的代理系统,能够将自然语言意图安全地转换为与企业分析API的交互,解决传统文本到SQL系统在企业环境中面临的可靠性与合规性问题。

Comments Accepted to the Enterprise AI Agents Workshop @ KDD 2026. The first four authors contributed equally to this work

详情
AI中文摘要

企业分析旨在使组织数据对决策制定可及,但非技术用户在使用传统商业智能工具或文本到SQL系统时仍面临障碍。尽管基于大型语言模型(LLM)的最新文本到SQL方法承诺通过自然语言访问结构化数据,但在企业环境中,分析流水线依赖受控的API而非原始数据库。实际上,这些API封装了复杂的业务逻辑以确保一致性、可审计性和安全性。然而,将数学或聚合逻辑委托给LLM会引入可靠性和合规性风险。为此,我们提出了Analytic Agent,一个基于LLM的代理系统,将自然语言意图转换为与企业分析API的安全交互。在90个由领域专家构建的真实企业使用案例上进行评估,它能够可靠地解释用户目标,验证权限,执行受控查询,并通过多步骤推理和政策感知编排生成合规的可视化结果。

英文摘要

Enterprise analytics aims to make organizational data accessible for decision-making, yet non-technical users still face barriers when using traditional business intelligence tools or Text-to-SQL systems. While recent Text-to-SQL approaches based on Large Language Models (LLMs) promise natural language access to structured data, they fall short in enterprise settings where analytics pipelines rely on governed APIs rather than raw databases. In practice, these APIs encapsulate complex business logic to ensure consistency, auditability, and security. However, delegating mathematical or aggregation logic to an LLM introduces reliability and compliance risks. To this end, we present Analytic Agent, an LLM-based agentic system that translates natural language intents into secure interactions with enterprise analytics APIs. Evaluated on 90 real enterprise use cases constructed by domain experts, it reliably interprets user goals, validates permissions, executes governed queries, and generates compliant visualizations through multi-step reasoning and policy-aware orchestration.

2605.21312 2026-06-16 cs.DC cs.AI cs.LG 版本更新

Frontier: Towards Comprehensive and Accurate LLM Inference Simulation

Frontier: 向全面且准确的LLM推理模拟迈进

Yicheng Feng, Xin Tan, Yangtao Deng, Yimin Jiang, Yibo Zhu, Hong Xu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Anuttacon StepFun

AI总结 本文提出Frontier,一种用于现代LLM推理服务的离散事件模拟器,通过离散化抽象和对关键运行时优化的建模,实现了对复杂工作负载的准确预测,从而在不同服务场景中提供更精确的计算、通信和内存成本预测。

详情
AI中文摘要

现代LLM服务已不再是单一或整体的。生产系统现在结合了解耦执行、复杂并行性、运行时优化和状态化工作负载,如推理、代理和RL展开。模拟对于探索这个快速增长的设计空间具有吸引力,但现有模拟器缺乏所需的架构完整性和决策级精度。它们的单体-副本抽象不适合解耦服务,而平均情况分析代理可能会扭曲SLA预测甚至逆转优化结论。我们提出了Frontier,一种用于现代LLM推理服务的离散事件模拟器。Frontier具有解耦抽象。它通过建模共置、预填解码解耦(PDD)和注意力-前馈网络解耦(AFD)与角色特定的集群工作者,捕捉现代服务系统的结构和动态。它在调度器-批次引擎循环中整合关键运行时优化(例如CUDA图、推测解码),并支持新兴工作负载的状态请求。它进一步提供了在多样化服务场景中对计算、通信和内存成本的准确且可推广的预测。在16-H800 GPU测试平台上,Frontier实现了平均吞吐量误差低于4%。与最先进的模拟器相比,它在共置情况下将端到端延迟误差从44.9%降低到6.4%,在解耦情况下从51.7%降低到2.6%。它扩展到超过1000个GPU在商用CPU上,并启用了新的用例,如依赖SLA的帕累托前沿探索、异构解耦分配、代理推理调度验证和RL后训练重配置。

英文摘要

Modern LLM serving is no longer homogeneous or monolithic. Production systems now combine disaggregated execution, complex parallelism, runtime optimizations, and stateful workloads such as reasoning, agents, and RL rollouts. Simulation is attractive for exploring this growing design space, yet existing simulators lack the architectural completeness and decision-grade fidelity it demands. Their monolithic-replica abstractions are ill-suited to disaggregated serving, while average-case analytical proxies can distort SLA predictions and even reverse optimization conclusions. We present Frontier, a discrete-event simulator for modern LLM inference serving. Frontier features a disaggregated abstraction. It captures the structure and dynamics of modern serving systems by modeling co-location, Prefill-Decode Disaggregation (PDD), and Attention-FFN Disaggregation (AFD) with role-specific cluster workers, incorporating key runtime optimizations (e.g., CUDA Graphs, speculative decoding) within the scheduler-batch-engine loop, and supporting stateful requests for emerging workloads. It further provides accurate and generalizable predictions of computation, communication, and memory costs across diverse serving scenarios with complex workload compositions. On 16-H800 GPU testbed, Frontier achieves an average throughput error below 4%. Compared with state-of-the-art simulators, it reduces end-to-end latency error from 44.9% to 6.4% under co-location and from 51.7% to 2.6% under disaggregation. It scales to over 1K GPUs on commodity CPUs and enables new use cases such as SLA-dependent Pareto frontier exploration, heterogeneous disaggregated allocation, agentic reasoning scheduling validation, and RL post-training reconfiguration. We release Frontier at https://github.com/NetX-lab/Frontier.

2605.21629 2026-06-16 cs.CY cs.AI cs.HC 版本更新

Faster Completion, Less Learning: Generative AI Reduced Study Time on Math Problems and the Knowledge They Build

更快完成,更少学习:生成式AI减少了学生在数学问题及所构建知识上的学习时间

Sina Rismanchian, Hasan Uzun, Jeffrey Matayoshi, Eric Cosyn, Eyad Kurd-Misto

发表机构 * University of California, Irvine(加州大学尔湾分校) McGraw Hill(麦格劳-希尔)

AI总结 本研究探讨生成式AI如何影响学生的学习过程和学习成果,通过分析大量学习互动数据,发现AI使用导致学生在可被AI处理的问题上学习时间减少,但这种效率提升在监考情况下消失,揭示了AI对学习行为和知识构建的深远影响。

详情
AI中文摘要

How much have students' ordinary learning processes shifted in response to generative AI, and how does that affect their durable learning outcomes? Self-report surveys show little change, while small-scale behavioral studies report widespread AI use without the scale or duration to measure learning consequences. We address both questions using a ten-year panel of $3.2$ million ALEKS learning interactions for the time-on-task analysis, complemented by ALEKS PPL placement-assessment data for the proctoring and retention analyses, with a quasi-experimental design exploiting within-curriculum variation in AI susceptibility: text-based word problems transcribable into AI prompts serve as the treated group; graph-based problems requiring interactive platform manipulation as the comparison. Learning time on AI-susceptible problems declines $2.8\%$ per quarter among college students after ChatGPT's release, cumulating to $26.9\%$ over eleven quarters; high-schoolers show $31.3\%$, middle-schoolers $9.0\%$, and Grade 5 students no detectable change. The divergence vanishes entirely under proctoring for college students, making general efficiency gains unlikely. Logistic fixed-effects models on randomly assigned proctored retention items yield a $25\%$ cumulative decline in odds of correct response; the same estimator on non-proctored assessment produces a large opposite-signed increase -- inconsistent with any platform, cohort, or curriculum explanation. These results are among the first large-scale behavioral and outcome evidence that generative AI has altered how students study and the knowledge they build -- the population-level indicator of \emph{cognitive surrender}, with direct implications for educational research, assessment governance, and AI policy.

英文摘要

How much have students' ordinary learning processes shifted in response to generative AI, and how does that affect their durable learning outcomes? Self-report surveys show little change, while small-scale behavioral studies report widespread AI use without the scale or duration to measure learning consequences. We address both questions using a ten-year panel of $3.2$ million ALEKS learning interactions for investigating time-on-task, complemented by ALEKS PPL placement-assessment data for examining proctoring and learning outcomes, with a quasi-experimental design exploiting variation in tasks that are more susceptible to AI (text-based word problems) and less susceptible to AI (interactive graph-based problems). Learning time on AI-susceptible problems declines $2.8\%$ per quarter among college students after ChatGPT's release, cumulating to $26.9\%$ over eleven quarters; high-schoolers show $31.3\%$, middle-schoolers $9.0\%$, and Grade 5 students no detectable change. Among college students, the post-ChatGPT divergence vanishes entirely under proctoring, ruling out broad efficiency gains as the likely explanation. Logistic fixed-effects models on randomly assigned proctored retention items yield a $25\%$ cumulative decline in odds of correct response; the same estimator on non-proctored assessment produces a large opposite-signed increase -- inconsistent with any platform, cohort, or curriculum explanation. These results are among the first large-scale behavioral and outcome evidence that generative AI has altered how students study and the knowledge they build -- the population-level indicator of \emph{cognitive surrender}, with direct implications for educational research, assessment governance, and AI policy.

2605.27599 2026-06-16 cs.LG cs.AI cs.AR cs.DC cs.PF 版本更新

The Energy Blind Spot: NVIDIA's Flagship Edge AI Hardware Cannot Support Process-Level Energy Attribution

能源盲点:NVIDIA 旗舰边缘 AI 硬件无法支持进程级能源归因

Deepak Panigrahy, Aakash Tyagi

发表机构 * Independent Researcher(独立研究者) Texas A&M University(德克萨斯农工大学)

AI总结 本文审计了 ASUS Ascent GX10 (GB10 SoC) 平台的能源可观测性,发现其缺乏 CPU 能源计数器等关键接口,导致无法像 x86 的 RAPL 那样进行进程级能源归因,并提出通过外部直流计量和 GPU 减法进行校准的临时方案,呼吁将能源可观测性作为硬件的一等要求。

详情
AI中文摘要

代理型 AI 工作负载——其中单个用户目标触发多步编排、工具调用、重试和故障恢复——正被瞄准用于边缘部署,NVIDIA、戴尔、惠普、华硕、微星、宏碁和技嘉都将在 2026 年出货基于 GB10 的桌面 AI 系统。我们最近证明,编排结构主导了代理型能源成本,工作流每个成功目标消耗的能源是线性基线的 4.33 倍,而多步推理任务的 OOI 达到 7.63 倍。另外,Rajat 等人表明,在代理型工作负载中,CPU 端处理占总延迟的 90.6%,占总动态能源的 44%。我们报告了对 ASUS Ascent GX10 (GB10 SoC) 的系统性能源可观测性审计,发现该平台通过任何支持的软件接口都不暴露 CPU 能源计数器、INA 电源轨监视器、IPMI/BMC 和 SCMI powercap 协议。唯一的设备上能源遥测是通过 NVML 的瞬时 GPU 功率。我们进一步发现,联发科固件已经通过未记录的 ACPI 接口 (SPBM) 在内部计算每轨能源,但 NVIDIA 表示“没有计划暴露 CPU 轨信息”。因此,通过支持的接口,无法在此平台上重现像 x86 通过 RAPL 执行的设备上每进程能源归因。我们形式化了能源归因 AI 的硬件需求规范,提出了使用外部直流计量结合 GPU 减法的临时校准桥接,并确定了通过 SCMI powercap 的标准轨道路径。我们的发现激励低碳计算社区将能源可观测性作为硬件的头等要求。

英文摘要

Agentic AI workloads - where a single user goal triggers multi-step orchestration, tool calls, retries, and failure recovery - are being targeted for edge deployment, with NVIDIA, Dell, HP, ASUS, MSI, Acer, and Gigabyte all shipping GB10-based desktop AI systems in 2026. We recently demonstrated that orchestration structure dominates agentic energy cost, with workflows consuming 4.33x more energy per successful goal than linear baselines and OOI reaching 7.63x for multi-step reasoning tasks. Separately, Raj et al. show that CPU-side processing accounts for up to 90.6% of total latency and 44% of total dynamic energy in agentic workloads. We report a systematic energy-observability audit of the ASUS Ascent GX10 (GB10 SoC) and find that the platform exposes no CPU energy counter, no INA power-rail monitor, no IPMI/BMC, and no SCMI powercap protocol through any supported software interface. The only on-device energy telemetry is instantaneous GPU power via NVML. We further discover that the MediaTek firmware already computes per-rail energy internally via an undocumented ACPI interface (SPBM), but NVIDIA states there are "no plans to expose CPU rail information." On-device per-process energy attribution - as performed on x86 via RAPL - is therefore not reproducible on this platform through supported interfaces. We formalize a hardware requirements specification for energy-attributed AI, propose an interim calibration bridge for per-domain energy decomposition - confirmed on the Acer Veriton GN100 where CPU energy accumulators are live - and identify a standards-track path via SCMI powercap. Our findings motivate the low-carbon computing community to demand energy observability as a first-class hardware requirement.

2605.30208 2026-06-16 cs.SE cs.AI 版本更新

Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review Efficiency

自动化低风险代码审查在Meta:RADAR、风险校准与审查效率

Chris Adams, Arjun Singh Banga, Parveen Bansal, Souvik Bhattacharya, Payal Bhuptani, Rujin Cao, Pedro Canahuati, Nate Cook, Brian Ellis, Prabhakar Goyal, Gurinder Grewal, Tianyu He, Matt Labunka, Alex Manners, David Molnar, Ging Cee Ng, Vishal Parekh, Jiefu Pei, Frederic Sagnes, James Saindon, Will Shackleton, Sid Sidhu, Gursharan Singh, Karthik Chengayan Sridhar, Matt Steiner, Pratibha Udmalpet, Sean Xia, Stacey Yan, Audris Mockus, Peter Rigby, Nachiappan Nagappan

发表机构 * Meta USA, UK, Canada(Meta美国、英国、加拿大)

AI总结 提出RADAR系统,通过多阶段漏斗对代码差异进行风险分层自动化审查,在Meta部署后显著提升审查效率并降低风险。

详情
AI中文摘要

AI辅助编码工具改变了软件生产。在Meta,每人工提交的代码行数同比增长105.9%,每位开发者的提交量增长51%,其中代理AI贡献了超过80%的增长。与此同时,获得及时审查的提交比例下降,暴露出代码供应与审查带宽之间的差距。我们提出三个问题,从可行性到校准再到影响:(1)风险分层的自动化能否在不同组织中大规模运行,(2)调整风险阈值如何影响自动化产出与安全性之间的权衡,(3)自动化审查在多大程度上减少AI生成变更的端到端延迟?我们部署了RADAR(风险感知差异自动审查),一个多阶段漏斗,根据作者和源类型对每个差异进行分类,应用资格门控、静态启发式、机器学习差异风险评分、基于LLM的自动化代码审查,以及在落地合格变更前的确定性验证。我们通过覆盖535K+个RADAR审查差异的遥测、政策变更的前后观察比较以及效率结果的差异分析来评估RADAR。RADAR已审查535K+个差异并落地331K+个。将差异风险评分阈值从第25百分位放宽到第50百分位,批准率提高到60.31%。RADAR审查差异的回滚率是非RADAR差异的1/3,生产事故率是非RADAR差异的1/50。RADAR将中位关闭时间减少超过330%,中位差异审查墙时间减少35%。风险感知的分层自动化可以显著减少由AI驱动的代码增长造成的审查瓶颈,同时不损害生产安全。

英文摘要

AI-assisted coding tools have altered software production. At Meta, significant lines of code per human-landed diff grew by 105.9% year over year and per-developer diff volume rose 51%, with agentic AI responsible for over 80% of that growth. Meanwhile, the share of diffs receiving timely review has declined, exposing a widening gap between code supply and reviewer bandwidth. We ask three questions that progress from feasibility through calibration to impact: (1) can risk-stratified automation operate at scale across diverse organizations, (2) how does tuning the risk threshold affect the trade-off between automation yield and safety, and (3) to what extent does automated review reduce end-to-end latency for AI-generated changes? We deployed RADAR (Risk Aware Diff Auto Review), a multi-stage funnel that classifies each diff by authorship and source type, applies eligibility gates, static heuristics, a machine-learned Diff Risk Score, LLM-based Automated Code Review, and deterministic validation before landing qualifying changes. We evaluate RADAR through telemetry covering 535K+ RADAR-reviewed diffs, observational before-after comparisons for policy changes, and difference-in-differences analysis of efficiency outcomes. RADAR has reviewed 535K+ diffs and landed 331K+. Relaxing the Diff Risk Score threshold from the 25th to the 50th percentile increased the approve rate to 60.31%. The revert rate for RADAR-reviewed diffs is 1/3 that of non-RADAR diffs, and the Production Incident rate is 1/50 that of non-RADAR diffs. RADAR reduces median time to close by over 330% and median diff review wall time by 35%. Risk-aware layered automation can materially reduce review bottlenecks created by AI-driven code growth without compromising production safety.

2606.01613 2026-06-16 cs.IR cs.AI cs.MA 版本更新

TechRAG: Evidence-Gated Multimodal Agentic RAG for Technical Literature Reasoning

TechGraphRAG:面向技术文献推理的智能图增强RAG框架

Kanwar Bharat Singh

发表机构 * Global Tire Intelligence and Solutions (GTIS)(全球轮胎智能与解决方案(GTIS)) The Goodyear Tire & Rubber Company(固特异轮胎与橡胶公司)

AI总结 提出一种13步自主流水线的智能检索增强生成框架,通过证据充分性评分、知识图谱遍历和自校正生成,支持领域特定技术文献推理。

详情
AI中文摘要

本文提出了一种面向特定领域技术推理支持的智能检索增强生成(RAG)框架,并在包含约2100篇智能轮胎、车辆动力学和车辆控制领域学术论文的精选语料库上进行了实例化。与传统的单次RAG系统不同,所提出的架构采用13步自主流水线:按意图分类查询,基于多维评分标准评估证据充分性,执行带有漂移防护查询重构的智能重试,通过迭代优化-搜索-验证循环搜索外部学术数据库(Crossref、OpenAlex、Semantic Scholar),遍历Neo4j知识图谱以获取关系上下文,验证引用完整性,并在自动重新生成后应用后生成质量检查。主要贡献包括:一个跨五个维度、带有相关性衰减和混合规则/LLM审查的100分证据充分性评分框架;一个具有迭代智能循环的路径依赖外部搜索架构;一个通过基于LLM的实体提取和OpenAlex作者验证以及语料库内引用解析构建的知识图谱;以及一个带有引用验证和质量评估的自校正生成循环。该框架作为一个实际实施的案例研究,展示了智能、基于证据的RAG如何支持大型特定领域语料库上的文献导航和技术推理。

英文摘要

This paper presents an agentic multimodal retrieval-augmented generation (RAG) framework for domain-specific literature reasoning, instantiated on a curated corpus of several thousand papers in intelligent tires, vehicle dynamics, vehicle control, sensing, estimation, and machine learning. Unlike conventional single-pass RAG systems, the proposed architecture uses an autonomous, evidence-gated pipeline that classifies query intent, generates separate text and visual query rewrites, performs hybrid text retrieval with FAISS and BM25 followed by cross-encoder reranking, expands evidence through graph-guided chunk traversal over a Neo4j knowledge graph, and retrieves visual document evidence using ColSmol late-interaction embeddings with MUVERA fixed-dimensional encoding, approximate nearest-neighbor search, and MaxSim reranking. The framework scores evidence sufficiency using a 100-point rubric with hybrid rule-based/LLM review, retries retrieval through drift-guarded reformulation, searches external academic databases through optimize--search--vet loops, merges and deduplicates multimodal evidence, verifies citation integrity, and generates cited answers through Planner, Researcher, Writer, and Critic agents with self-correcting revision. Key contributions include: (i) a scalable multimodal retrieval architecture combining text, graph, and visual evidence over 40,000 document pages; (ii) an interpretable evidence sufficiency and retry mechanism; (iii) a multi-agent generation pipeline with evidence mapping and critic-driven revision; (iv) a domain knowledge graph with LLM-based entity extraction, OpenAlex author validation, and intra-corpus citation resolution; and (v) a route-dependent external search architecture for targeted literature expansion. The result is a practical, evidence-gated, multimodal agentic RAG architecture for technical reasoning over specialized research corpora.

2606.06510 2026-06-16 cs.AR cs.AI cs.DC cs.PF 版本更新

FP8 is All You Need (Part 1): Debunking Hardware FP64 as the HPC Holy Grail (June 13th version)

FP8就是一切(第一部分):揭穿硬件FP64作为HPC圣杯的神话

Satoshi Matsuoka

发表机构 * RIKEN Center for Computational Science (R-CCS)(日本计算科学研究中心(R-CCS))

AI总结 本文通过中国剩余定理的Ozaki Scheme II,在AI优化GPU上利用FP8张量吞吐量实现全FP64精度的内存天花板性能,挑战了原生FP64硬件是科学计算基础的传统观点。

Comments This is the revised version of the previous submission (May 28th) version. There is a companion Part (2) paper focusing on Ozaki-style FFT

详情
AI中文摘要

传统HPC教条认为,原生硬件FP64硅是科学计算不可约的基础——双精度模拟的“圣杯”。本文论证该教条是错误的:在B300代及以后的AI优化GPU上,丰富的FP8张量吞吐量结合基于中国剩余定理的Ozaki Scheme II,在典型HPC内核谱上以全FP64精度恢复了内存天花板执行。NVIDIA的Blackwell Ultra (B300)将原生FP64压缩至约1.3 TFLOPS——相比B200下降31倍——使得即使是内存受限的内核(SpMV、GEMV、模板计算)也变为计算受限。我们做出四项贡献。第一,一个统一的分析模型——张量-内存均衡(TME)模型,在Roofline模型上增加了计算乘数alpha、带宽乘数beta和重建延迟gamma。第二,我们识别出寄存器级融合是驱动beta趋近于1的机制,使得模拟在内存墙后几乎免费。第三,我们预测Ozaki Scheme II将模拟FP64从约1 TFLOPS的原生下限提升至约500 TFLOPS(B300)和约400 TFLOPS(Rubin R200),在计算受限区域超过B200原生FP64上限一个数量级以上,同时在带宽受限区域匹配内存天花板。第四,与H100基线相比,Ozaki Scheme II在每个研究的工作负载上匹配或超过H100,而B300原生FP64则导致高达50倍的性能下降。结合配套的FFT分析(在幸存的INT32流水线上使用Kulisch定点重建)和配套第二部分论文中报告的FP32+Kahan归约,B300上每个被调查的内核类别都以全FP64精度达到内存天花板。证据支持标题的主张:FP8,配合Ozaki Scheme II和Kulisch逃生路线,是生产级HPC所需的一切;原生FP64硅不再是人们所认为的圣杯。

英文摘要

Conventional HPC holds that native hardware FP64 is the irreducible foundation of scientific computing. On AI-optimized GPUs of the NVIDIA B300 generation and beyond, native FP64 throughput has collapsed to ~1.3 TFLOPS even as FP8 tensor throughput has grown to multiple PFLOPS. We argue something stronger than that this is survivable: the FP8 tensor-core matrix-multiply is the sole computational primitive on which double-precision scientific computing needs to be built. Every canonical kernel -- dense and sparse linear algebra, spectral transforms, stencils -- and every application composing them reduces, via the Chinese Remainder Theorem-based Ozaki Scheme II, to sequences of FP8 matrix operations; the only non-FP8 arithmetic is a bounded, fixed-width integer accumulation at reconstruction. Native FP64 is thereby demoted from a hardware requirement to a derived accuracy guarantee obtained by composition over the FP8 primitive. We organize the claim as a five-layer hierarchy -- the FP8 op, Ozaki II, the basic kernels or Berkeley "dwarfs", composite solvers, and full applications -- and, because the dwarf taxonomy already spans scientific computing, establish it by exhibiting the reduction for every dwarf rather than a sample. The claim is falsifiable, and we build the instrument that tests it: a Tensor-Memory Equilibrium (TME) model extending the Roofline with emulation parameters (alpha, beta, gamma). We identify register-level fusion as the mechanism that keeps emulation memory-bound, project recovered FP64 performance across B300 and Rubin against an H100 baseline, and close the kernel coverage with a companion FFT analysis and compensated reductions. The model could have returned a negative verdict; instead it passes across the dwarfs and their compositions. This is the analytical half of a two-part program, with a follow-on implementation to validate the thesis on real silicon.

2606.06563 2026-06-16 cs.SE cs.AI 版本更新

AI-Driven Test Case Generation from Natural Language Requirements: A Survey of Techniques and Research Gaps

AI驱动的自然语言需求测试用例生成:技术与研究空白综述

Orimoloye Folorunsho, Hassan Reza

发表机构 * School of Electrical Engineering and Computer Science(电气工程与计算机科学学院)

AI总结 综述AI、NLP和LLM从自然语言需求生成测试用例的技术,指出当前方法无法同时满足自动化、歧义处理等六个质量维度,提出四个研究方向。

Comments 22 pages, 7 figures, 4 tables

详情
AI中文摘要

软件测试对于验证系统是否满足指定需求至关重要,但仍是开发中最耗时和最昂贵的活动之一。基于需求的测试生成允许从需求工件早期导出测试用例,但由于固有的歧义和不精确性,直接从自然语言生成测试用例具有挑战性。人工智能、自然语言处理(NLP)和大语言模型(LLM)的最新进展使得自动化这一流程越来越可行,同时也引入了新的风险,包括幻觉、可追溯性降低和不一致的评估。本综述探讨了四个研究问题:提出了哪些AI和NLP技术用于从自然语言需求生成测试用例;哪些工具和框架支持这些方法;如何评估生成的测试用例;以及存在哪些研究空白。遵循Kitchenham和Charters的系统综述指南,我们搜索了2000-2025年的主要学术数据库,并在应用严格纳入标准后,确定了21项主要研究。文献被组织为三个进化时代,揭示出没有现有方法能同时满足六个关键质量维度:自动化、歧义处理、领域适用性、可追溯性、评估彻底性和幻觉控制。本综述做出了三个主要贡献:基于AI的测试生成的三时代进化综合;六标准差距分析,显示当前没有方法完全满足所有质量维度;以及针对幻觉、可追溯性、复杂性敏感性和合规性的四个可操作研究指南。

英文摘要

Software testing is critical for verifying that systems meet specified requirements, yet remains among the most time-consuming and expensive activities in development. Requirements-based test generation allows test cases to be derived early from requirements artifacts, but generating them directly from natural language is challenging due to inherent ambiguity and imprecision. Recent advances in AI, natural language processing (NLP), and large language models (LLMs) have made automating this pipeline increasingly feasible, while introducing new risks including hallucination, reduced traceability, and inconsistent evaluation. This survey addresses four research questions: what AI and NLP techniques have been proposed for generating test cases from natural language requirements; what tools and frameworks support these approaches; how generated test cases are evaluated; and what research gaps remain. Following Kitchenham and Charters' systematic review guidelines, we searched major scholarly databases spanning 2000-2025 and, after applying strict inclusion criteria, identified 21 primary studies. The literature is organized into three evolutionary eras, revealing that no existing approach simultaneously satisfies six key quality dimensions: automation, ambiguity handling, domain applicability, traceability, evaluation thoroughness, and hallucination control. The survey makes three main contributions: a three-era evolutionary synthesis of AI-based test generation; a six-criteria gap analysis showing no current approach fully addresses all quality dimensions; and four actionable research guidelines targeting hallucination, traceability, complexity sensitivity, and compliance.

2606.08270 2026-06-16 cs.CR cs.AI cs.ET 版本更新

An AI Security Agent for University ACMIS: Multi-Vector Threat Detection and Automated Response

面向大学教务管理信息系统的AI安全代理:多向量威胁检测与自动响应

Joseph Walusimbi, Joshua Benjamin Ssentongo

发表机构 * University ACMIS(ACMIS大学)

AI总结 提出一种结合监督异常检测、行为分析和NLP聊天机器人的AI安全代理,针对ACMIS的五个操作层进行监控,并通过四级风险升级框架实现自动响应,在模拟数据集上达到0.91的F1分数。

Comments 6 pages, 1 figure, 5 tables,

详情
AI中文摘要

大学教务管理信息系统(ACMIS)是多种安全威胁的高价值目标,包括暴力破解登录攻击、支付欺诈、权限提升、内部数据窃取和学术诚信违规。传统的基于规则的入侵检测系统不足以应对,因为许多恶意活动在结构上与正常操作无法区分。本文提出了一种基于AI的ACMIS安全代理,结合了监督异常检测、行为分析以及用于安全密码恢复的自然语言处理聊天机器人。该代理监控五个操作层:认证、授权、金融交易、用户行为和系统健康,并通过四级风险升级框架进行响应。模块化架构允许核心引擎扩展到其他机构系统。在模拟ACMIS事件日志数据集上的实验表明,威胁检测宏平均F1为0.91,而基于规则的基线为0.49,关键级别的自动响应延迟在95百分位下低于300毫秒。

英文摘要

University Academic Management Information Systems (ACMIS) are high-value targets for a wide spectrum of security threats including brute-force login attacks, payment fraud, privilege escalation, insider data theft, and academic integrity violations. Traditional rule-based intrusion detection systems are inadequate because many malicious activities are structurally indistinguishable from normal operations. This paper presents an AI-based security agent for ACMIS that combines supervised anomaly detection, behavioural analytics, and a natural language processing chatbot for secure password recovery. The agent monitors five operational layers: authentication, authorisation, financial transactions, user behaviour, and system health, and responds through a four-tier risk escalation framework. A modular architecture allows the core engine to be extended to other institutional systems. Experiments on a simulated ACMIS event log dataset of 147,922 sessions demonstrate a threat detection macro-average F1 of 0.966, compared to 0.156 for a rule-based baseline and 0.836 for a sequence-only (LSTM) baseline, with end-to-end critical-tier automated response latency under 1 ms on a single-node prototype. The integrated recovery chatbot achieves 97.1 percent identity verification accuracy and an 87.3 percent mass-reset attack detection rate with zero false positives on legitimate high volume recovery periods.

2606.12688 2026-06-16 cs.LG cs.AI cs.DC 版本更新

M*: A Modular, Extensible, Serving System for Multimodal Models

M*: 一个模块化、可扩展的多模态模型服务系统

Atindra Jha, Naomi Sagan, Keisuke Kamahori, Irmak Sivgin, Rohan Sanda, Steven Gao, Mark Horowitz, Luke Zettlemoyer, Olivia Hsu, Jure Leskovec, Baris Kasikci, Stephanie Wang

发表机构 * Stanford University(斯坦福大学) University of Washington(华盛顿大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出M*系统,通过将模型表示为数据流图并引入Walk Graph抽象,支持多模态复合模型的高效服务,在多个任务上降低延迟并提升吞吐量。

Comments The codebase is available at https://github.com/mstar-project/mstar

详情
AI中文摘要

我们正在进入一个复合模型架构的新时代,这些架构集成了多种组件,如视觉编码器、语言骨干网络、扩散和流头、音频编解码器、动作生成器和世界模型预测器。这种架构支撑了广泛的多模态模型类别,包括统一多模态模型、全能模型、语音-语言模型、视觉-语言-动作策略和世界模型。然而,现有的模型服务框架基于对模型结构的狭隘假设,难以适应这种新的架构多样性。在此,我们提出M*,一个用于高效服务复合AI模型的通用服务系统。M*将模型表示为数据流图,将跨越多种模态和任务的请求处理视为对这些图的遍历。核心洞察是一种模块化抽象,支持模型组件的任意组合、在物理集群上的灵活放置以及分布式运行时中的模型无关优化。我们将这种抽象称为Walk Graph,并展示它如何简洁地捕获来自广泛家族的复合模型。我们在代表性模型上实例化M*,发现与vLLM-Omni相比,在BAGEL上的文本到图像工作负载中,端到端延迟平均降低20%,同时在Qwen3-Omni上的文本到语音工作负载中,实时因子降低高达2.9倍,吞吐量提升高达2.7倍。M*在机器人规划任务上也比V-JEPA 2-AC rollout基线性能提升高达12.5倍。因此,我们的工作为以最小开发工作量高效服务复杂模型铺平了道路。

英文摘要

We are entering a new era of composite model architectures that integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. Such architectures underpin a broad class of multimodal models, including unified multimodal models, omni models, speech-language models, vision-language-action policies, and world models. However, existing model serving frameworks were built on narrow assumptions about model structure, making them ill-suited to accommodate this new architectural diversity. Here we present M*, a universal serving system for efficient serving of composite AI models. M* represents models as dataflow graphs, processing requests spanning diverse modalities and tasks as traversals over these graphs. The core insight is a modular abstraction that supports arbitrary composition of model components, flexible placement onto a physical cluster, and model-agnostic optimizations within a distributed runtime. We call this abstraction the Walk Graph and show how it can concisely capture composite models from a broad range of families. We instantiate M* on representative models and find that it achieves, on average, 20% lower end-to-end latency than vLLM-Omni for text-to-image workloads on BAGEL, while delivering up to 2.9x lower real-time factor and 2.7x higher throughput for text-to-speech workloads on Qwen3-Omni. M* also outperforms the V-JEPA 2-AC rollout baseline for robotic planning by up to 12.5x. Thus, our work paves the road towards more efficient serving of complex models with minimal developer effort.

11. 其他/综合AI 53 篇

2606.15078 2026-06-16 cs.AI cs.GT physics.soc-ph 新提交

Cognitive Debt: AI as Intellectual Leverage and the Dynamics of Systemic Fragility

认知债务:作为智力杠杆的AI与系统性脆弱性的动态机制

Shuchen Meng

发表机构 * New York University(纽约大学)

AI总结 本文提出认知债务的形式化理论,通过建立包含认知资本和认知债务的状态变量模型,证明理性代理人会积累认知债务,并导致认知明斯基时刻和系统性脆弱性。

Comments 46 pages, 3 figures. Preliminary version; comments welcome

详情
AI中文摘要

我们发展了一个认知债务的形式化理论:当个体将AI用作第一性原理认知的替代品而非补充品时,积累的未经验证的推理义务存量。模型每个代理人有两个状态变量:认知资本和认知债务,以及一个乘数型生产技术,其中认知资本作为抵押品决定AI采用的回报。我们建立了六个命题。理性代理人会承担正的认知债务,因为成本是递延的、部分外部的,并且被短期生产率增长所掩盖。平静时期降低了主观风险评估,提高了AI替代强度,并放大了杠杆,产生了一个认知明斯基时刻,其中主观风险下降而真正的系统性脆弱性上升。预期危机损失是总杠杆的凸函数。危机后,产出目标压力可能产生一个虚假修正循环,其中代理人用更多AI修补AI失败。由于系统性风险、认知公共品和军备竞赛外部性,分散均衡相对于社会最优过度采用了替代性AI。在一个两类型异质代理人经济中,高认知资本代理人更密集地采用AI,并可能最终侵蚀其无辅助认知资本至低于初始低技能代理人的水平。

英文摘要

We develop a formal theory of cognitive debt: the stock of unverified reasoning obligations that accumulates when individuals use AI as a substitute rather than a complement for first-principles cognition. The model features two state variables per agent, cognitive capital and cognitive debt, and a multiplicative production technology in which cognitive capital functions as collateral that determines the return to AI adoption. We establish six propositions. Rational agents incur positive cognitive debt because the costs are deferred, partially external, and masked by short-run productivity gains. Tranquil periods lower subjective risk assessments, raise AI substitution intensity, and compound leverage, generating a cognitive Minsky moment in which subjective risk falls while true systemic fragility rises. Expected crisis losses are convex in aggregate leverage. Post-crisis, output-target pressure can produce a false-correction loop in which agents patch AI failures with more AI. The decentralised equilibrium over-adopts substitutive AI relative to the social optimum because of systemic risk, cognitive public goods, and arms-race externalities. In a two-type heterogeneous-agent economy, high-cognitive-capital agents adopt AI more intensively and may eventually erode their unaided cognitive capital below that of initially lower-skilled agents.

2606.16084 2026-06-16 cs.AI cs.CL 新提交

Rhythm of the Deep: A Computational-Linguistic Test of Duality of Patterning in Sperm Whale Codas

深海的韵律:抹香鲸叫声中双重模式的计算语言学检验

Mudit Sinha, Sanika Chavan

发表机构 * Independent Researchers(独立研究员)

AI总结 使用1483个抹香鲸叫声,通过计算语言学方法检验其是否具有双重模式结构,发现下层由点击节奏构成,上层显示序列依赖,下层为节奏型而非分段型。

Comments 22 pages, 2 figures, 4 tables. Preprint

详情
AI中文摘要

人类语言常被描述为在两个层次上结合结构:低层单元组合成更大的单元,然后这些单元再组合成更大的序列。我们使用多米尼加抹香鲸项目的1483个叫声,测试抹香鲸叫声中是否具有这种设计特征——双重模式。由于声学相似性可以模仿符号结构,我们将问题视为从连续音频中进行计算语言学结构发现,而不是直接关于语言或意义的断言。我们使用冻结音频编码器的共识、保留的结构测试、每统计量零假设和声学零假设可恢复性门控。证据支持一个狭窄的两层架构。在低层,点击组合成叫声不是通过稳定的有序规则,而是通过哪些点击存在以及它们之间的点击间节奏。在高层,叫声令牌显示回合级序列依赖,NSB二阶转移熵提升0.132比特(p = 0.002)。在节奏缩放下,编码器派生的点击身份强烈受速率限制,而叫声身份保持更稳定,在点击到叫声步骤中产生可测量的抽象梯度。仅节奏基线恢复了大量低层结构,但未能重现上层序列依赖信号。我们不声称语言、语义、感知或类似人类的音素。相反,我们报告了表示级别的证据,表明存在一种类似双重模式的架构,其低层是节奏型而非分段型,并提供了一个可移植的零假设控制框架,用于测试诱导声学令牌系统中的组合结构。

英文摘要

Human language has often been described as combining structure at two levels: lower-level units combine into larger units, which then combine into larger sequences. We test for this design feature, duality of patterning, in sperm whale codas using 1,483 codas from the Dominica Sperm Whale Project. Because acoustic similarity can imitate symbolic structure, we treat the problem as computational-linguistic structure discovery from continuous audio rather than as a direct claim about language or meaning. We use a consensus of frozen audio encoders, held-out structural tests, per-statistic nulls, and acoustic-null recoverability gates. The evidence supports a narrow two-tier architecture. At the lower tier, clicks compose into codas not by a stable ordered rule, but by which clicks are present together with their inter-click rhythm. At the upper tier, coda tokens show bout-level sequential dependence, with an NSB second-order transfer-entropy lift of 0.132 bits (p = 0.002). Under tempo scaling, encoder-derived click identity is strongly rate-bound, while coda identity remains substantially more stable, yielding a measurable abstraction gradient across the click-to-coda step. Rhythm-only baselines recover substantial lower-tier structure but fail to reproduce the upper-tier sequential-dependence signal. We do not claim language, semantics, perception, or human-like phonemes. Instead, we report representation-level evidence for a duality-of-patterning-like architecture whose lower tier is rhythmic rather than segmental, and provide a portable null-controlled framework for testing combinatorial structure in induced acoustic token systems.

2606.14718 2026-06-16 cs.CY cs.AI 交叉投稿

Gender Differences in AI Literacy Workshop Outcomes and Deepfake Engagement

AI素养工作坊成果与深度伪造参与中的性别差异

Jake Renzella, Christian Bergh, Natasha Banks, Alexandra Vassar

发表机构 * University of New South Wales(新南威尔士大学)

AI总结 本研究通过统计回归分析澳大利亚中学生AI素养工作坊前后数据,发现男性在STEM职业兴趣上显著更高,女性更常使用AI工具,且工作坊后女性在AI知识和职业兴趣上提升更大,部分缩小了性别差距。

详情
AI中文摘要

随着人工智能(AI)素养倡议在K-12教育中的扩展,理解性别如何影响学生的基础认知、工具使用以及对干预措施的反应,对于公平的课程设计至关重要。本研究考察了来自两所男女同校公立学校的澳大利亚中学生(7、8和10年级;前测N=199,后测N=136)在参加为期一天的AI素养工作坊后,在AI素养、安全意识和STEM职业抱负方面的性别差异。使用控制年级和学校的统计回归方法,我们发现:工作坊前,男性学生在AI、计算机科学和工程三个领域的STEM职业兴趣均显著更高,而女性学生更可能将AI用于学业任务并向AI工具寻求建议。深度伪造行为中也出现了性别差异模式:男性更可能创建或分享深度伪造内容。干预后,男女学生的AI知识均有所提升,但女性表现出更丰富的进步:更广泛的概念理解、更高的自信心以及AI和计算机科学职业兴趣的显著增长,部分缩小了STEM性别差距。这些发现强调了开发性别响应型AI课程的必要性,特别是针对男性学生的深度伪造安全教育,并表明即使是单日工作坊也能缩小STEM抱负和AI信心方面的性别差距。

英文摘要

As Artificial Intelligence (AI) literacy initiatives expand in K-12 settings, understanding how gender shapes student baseline perceptions, tool-use, and responsiveness to interventions is essential for equitable curriculum design. This study examines gender differences in AI literacy, safety awareness, and STEM career aspirations among Australian secondary students (Years 7, 8, and 10; N(pre) = 199, n(post) = 136) from two co-educational government schools who participated in a one-day AI literacy workshop. Using statistical regression methods controlling for year level and school, we found that pre-workshop, male students reported significantly higher STEM career interest across all three domains (AI, computer science, and engineering), while female students were significantly more likely to use AI for schoolwork and to seek advice from AI tools. Gender-differentiated patterns also emerged in deepfake behaviours: males were significantly more likely to have created or shared deepfake content. Both genders improved in AI knowledge post-intervention, yet females showed a richer profile of gains: wider conceptual understanding, greater confidence, and meaningful increases in AI and CS career interest that partially narrowed the gender STEM gap. These findings highlight the need for gender-responsive AI curricula, particularly deepfake safety education for male students, and demonstrate that even single-day workshops can narrow gender gaps in STEM aspirations and AI confidence.

2606.14742 2026-06-16 q-bio.NC cs.AI cs.HC 交叉投稿

Do Large Language Models Have Emotions?

大型语言模型有情感吗?

Amit Goldenberg, James J. Gross

发表机构 * Harvard Business School(哈佛商学院) Department of Psychology, Harvard University(哈佛大学心理学系) Harvard University, Digital, Data and Design Institute(哈佛大学数字、数据与设计研究所) Department of Psychology, Stanford University(斯坦福大学心理学系)

AI总结 本文评估Anthropic声称Claude Sonnet 4.5具有“功能性情感”的说法,从生物情感功能角度分析,指出其部分支持情境解释功能,但缺乏动态重组能力。

详情
AI中文摘要

大型语言模型有情感吗?Anthropic最近的一篇论文报告在Claude Sonnet 4.5中发现了情感概念的内部表征,并得出结论认为该LLM具有“功能性情感”。我们根据已知的生物系统中情感实际运作方式评估了这一说法。我们认为情感具有两个核心功能:对情境进行情境敏感的解释,以及根据这些解释跨多个系统重组处理过程。Anthropic的发现为第一个功能提供了部分支持,尽管在Claude中识别出的持续、离散的情感表征与情感神经科学的发现(即人类情感以可变而非统一的神经特征为特征)不太吻合。关于第二个功能,证据不一:Claude的表征调节输出,但没有产生定义生物系统情感的注意力、决策速度和动机状态的动态重组。最后,我们提出了LLM要拥有情感所需的条件。

英文摘要

Do LLMs have emotions? A recent paper from Anthropic reports finding internal representations of emotion concepts in Claude Sonnet 4.5, concluding that the LLM has 'functional emotions.' We evaluate this claim against what is known about how emotions actually function in biological systems. We argue that emotions serve two core functions: the context-sensitive interpretation of situations, and the reorganization of processing across multiple systems in response to those interpretations. The Anthropic findings offer partial support for the first function, though the consistent, discrete emotional representations identified in Claude sit uneasily with affective neuroscience findings that human emotion is characterized by variable rather than uniform neural signatures. On the second function, the evidence is mixed: Claude's representations modulate output without producing the dynamic reorganization of attention, decision speed, and motivational state that defines emotion in biological systems. We close by proposing what it would take for an LLM to have emotions.

2606.14769 2026-06-16 econ.EM cs.AI cs.GT 交叉投稿

Agentomics: Economic Foundations for the Valuation, Attribution, and Pricing of AI Agents in Human-AI Workflows

Agentomics:人机协作工作流中AI代理的估值、归因和定价的经济基础

Quanyan Zhu

发表机构 * Department of Electrical and Computer Engineering, NYU Tandon School of Engineering(纽约大学Tandon工程学院电气与计算机工程系)

AI总结 提出Agentomics框架,基于工作流模型将AI部署视为联盟形成问题,使用Shapley值进行经济盈余归因,实现AI代理的估值、归因和定价。

详情
AI中文摘要

代理型AI系统越来越多地被部署为组织工作流中的生产资源,然而现有的评估方法主要衡量孤立的技术性能而非经济贡献。本文引入了\emph{Agentomics},一个基于工作流的框架,用于对人类和人工代理进行估值、归因和定价。该框架将工作流建模为异构代理的配置,其集体绩效决定了总价值、部署成本、可靠性和预期故障损失。工作流价值被视为团队层面的量,可能包括互补性、替代效应、瓶颈和非线性生产;可加的阶段级价值仅是一个特例。基于此工作流模型,本文将AI部署表述为一个联盟形成问题,并将联盟价值定义为相对于基准人类工作流所产生的增量净剩余。然后使用Shapley值在参与的AI代理之间分配经济盈余,从而在估值、问责和市场定价之间建立原则性联系。由此产生的Shapley定价均衡为评估代理价格是否反映预期边际贡献提供了规范基准。一个安全运营案例研究说明了该框架如何解释混合人机工作流中的生产力提升、部署成本、可靠性损失和联盟级互补性。

英文摘要

Agentic AI systems are increasingly being deployed as productive resources in organizational workflows, yet existing evaluation methods primarily measure isolated technical performance rather than economic contribution. This paper introduces \emph{Agentomics}, a workflow-based framework for valuing, attributing, and pricing human and artificial agents. The framework models a workflow as a configuration of heterogeneous agents whose collective performance determines gross value, deployment cost, reliability, and expected failure loss. Workflow value is treated as a team-level quantity that may include complementarities, substitution effects, bottlenecks, and nonlinear production; additive stage-level value is only a special case. Building on this workflow model, the paper formulates AI deployment as a coalition-formation problem and defines coalition value as the incremental net surplus generated relative to a benchmark human workflow. The Shapley value is then used to attribute economic surplus among participating AI agents, yielding a principled connection among valuation, accountability, and market pricing. The resulting Shapley pricing equilibrium provides a normative benchmark for assessing whether agent prices reflect expected marginal contribution. A security-operations case study illustrates how the framework accounts for productivity gains, deployment costs, reliability losses, and coalition-level complementarities in hybrid human--AI workflows.

2606.15348 2026-06-16 q-bio.NC cs.AI 交叉投稿

Intrinsic Computational Functionalism and Simulated Consciousness

内在计算功能主义与模拟意识

Ryota Kanai, Shuqin Ma

发表机构 * Araya Inc.(Araya公司) School of Philosophy, Fudan University(复旦大学哲学学院) Sussex Centre for Consciousness Science, University of Sussex(Sussex大学意识科学中心)

AI总结 本文从内在计算功能主义出发,提出机制丰富的规范结构,论证若意识是计算构成的,则任何满足内在因果-计算实现关系的系统(生物、人工或模拟)都实现相同的意识相关属性。

详情
AI中文摘要

对人工或模拟意识的一个常见反对意见是,模拟的大脑并不比模拟的水更湿。我们从内在计算功能主义(ICF)的角度来回应:如果意识是由计算构成的,那么它不依赖于外部强加的描述,而是依赖于系统凭借其自身的因果-动力学组织所物理实现的计算结构。在之前的工作中,我们将规范功能主义发展为此反解释主义纲领的一个数学精确的特例,通过固定接口下的完整未来输入-输出角色来识别功能状态。这里我们论证,这种输入-输出构造虽然重要,但并不完整:作为ICF的一个行为边界情况,它使得查找表和展开的系统在规范上等价,只要它们保持相同的边界行为。一个与意识相关的规范表示必须转而包含属于相关内在组织的内部机制、干预和联合读出。因此,我们定义了一个机制丰富的规范结构,并用它来制定内在因果-计算实现(ICCR),这是一种保持物理实现、内在状态个体化、转移结构、干预轮廓以及相关主体-身体-世界边界的实现关系。核心结果是条件性的:如果意识属性是内在因果-计算组织的不变量,那么任何满足ICCR的系统都实现相同的意识相关属性,无论是生物的、人工的还是模拟的。我们讨论了包括生物自然主义和整合信息理论在内的反对意见。我们得出结论,要否认模拟具有意识,必须识别出模拟未能实现的与意识相关的内在因果-计算结构。

英文摘要

A common objection to artificial or simulated consciousness is that a simulated brain is no more conscious than simulated water is wet. We address this from the perspective of Intrinsic Computational Functionalism (ICF): if consciousness is computationally constituted, it depends not on externally imposed descriptions but on the computational structures a system physically realizes in virtue of its own causal-dynamical organization. In previous work we developed Canonical Functionalism as a mathematically precise special case of this anti-interpretivist program, identifying functional states by their complete future input-output roles under a fixed interface. Here we argue that this input-output construction, though important, is incomplete: as a behavioral boundary case of ICF, it makes lookup tables and unfolded systems that preserve the same boundary behavior canonically equivalent. A consciousness-relevant canonical representation must instead include internal mechanisms, interventions, and joint readouts belonging to the relevant intrinsic organization. We therefore define a mechanism-enriched canonical structure and use it to formulate Intrinsic Causal-Computational Realization (ICCR), a realization relation preserving physical implementation, intrinsic state individuation, transition structure, intervention profiles, and the relevant agent-body-world boundary. The central result is conditional: if conscious properties are invariants of intrinsic causal-computational organization, then any system satisfying ICCR realizes the same consciousness-relevant properties, whether biological, artificial, or simulated. We discuss objections including biological naturalism and integrated information theory. We conclude that to deny consciousness to a simulation, one must identify a consciousness-relevant intrinsic causal-computational structure that the simulation fails to realize.

2606.15358 2026-06-16 cs.HC cs.AI 交叉投稿

Cognitive Trajectory Modeling: Quantifying Human-AI Co-Creation through Cognitively Grounded Interaction Trajectories

认知轨迹建模:通过基于认知的交互轨迹量化人机共创

Nicholas Davis

发表机构 * Co-Creative AI Consulting(协同AI咨询)

AI总结 提出认知轨迹建模(CTM)理论,通过认知轨迹和吸引子景观量化人机共创中的交互动态,区分认知轨迹与交互痕迹,为研究共创AI和人类-AI交互提供框架。

详情
AI中文摘要

共创AI研究日益寻求能够表征交互动态随时间演变的方法。虽然许多现有方法关注可观察的交互特征、交互度量、行为编码方案或活动痕迹,但这些方法往往难以捕捉高阶交互动态,包括协作过程如何随时间重组、稳定、调节和演变。本文引入认知轨迹建模(CTM)作为交互动态的认知理论,将认知、交互和创造过程概念化为在具有认知意义的吸引子景观中展开的时间组织轨迹。CTM建立在创造力生成模型和创造性意义建构(CSM)的理论基础上,重新审视意义建构曲线和认知轨迹在表征共创交互动态中的作用。我们通过认知轨迹原理形式化这一视角,该原理指出,只有当时间表示的基础状态具有方向性认知意义时,它们才在理论上可解释为认知轨迹。基于此原理,CTM将认知轨迹的概念推广到任何特定编码方案之外,并提供了一个更广泛的框架,用于通过在有意义的吸引子景观中展开的轨迹来建模交互动态。我们进一步区分认知轨迹与交互痕迹,并将CTM置于更广泛的认知、交互和领域动态层次结构中。更广泛地说,我们认为理解共创系统需要能够建模认知和交互动态随时间演变的方法。CTM为研究共创AI和人机交互中的交互动态提供了基础。

英文摘要

Co-creative AI research increasingly seeks methods capable of representing how interaction dynamics evolve through time. While many existing approaches focus on observable interaction characteristics, interaction metrics, behavioral coding schemes, or activity traces, these methods often struggle to capture higher-order interaction dynamics, including how collaborative processes reorganize, stabilize, regulate, and evolve through time. This paper introduces Cognitive Trajectory Modeling (CTM) as a cognitive theory of interaction dynamics that conceptualizes cognition, interaction, and creative processes as temporally organized trajectories unfolding across cognitively meaningful attractor landscapes. CTM builds upon the theoretical foundations of the Enactive Model of Creativity and Creative Sense-Making (CSM), revisiting the role of sense-making curves and cognitive trajectories in representing co-creative interaction dynamics. We formalize this perspective through the Cognitive Trajectory Principle, which states that temporal representations are only theoretically interpretable as cognitive trajectories when their underlying states possess directional cognitive meaning. Building on this principle, CTM generalizes the notion of cognitive trajectories beyond any particular coding scheme and provides a broader framework for modeling interaction dynamics through trajectories unfolding across meaningful attractor landscapes. We further distinguish cognitive trajectories from interaction traces and situate CTM within a broader hierarchy of cognitive, interaction, and domain dynamics. More broadly, we argue that understanding co-creative systems requires methods capable of modeling how cognition and interaction dynamics unfold through time. CTM provides a foundation for studying interaction dynamics across co-creative AI and human-AI interaction.

2606.15535 2026-06-16 cs.PF cs.AI 交叉投稿

MADAR: An Address-Free Processor

MADAR:一种无地址处理器

Mohamed Amine Bergach

发表机构 * Illumina San Diego California USA(Illumina圣地亚哥加州美国)

AI总结 提出无地址处理器MADAR,通过环形槽结构消除传统寻址机制,实现编译时调度的数据流计算,在AI加速中能效随归约规模增长保持恒定。

详情
AI中文摘要

在现代处理器中,计算是廉价的部分。大部分面积和能量消耗在“寻址”上——将操作数移入和移出寄存器文件和缓存,并运行标签、端口、缺失队列和旁路网络,以找到值被存放的位置。MADAR通过废除地址来消除这些机制。所有状态在环形槽中循环,每个时钟周期前进一个位置;指令和数据位于相同的槽中;一个值通过其在轨道中的位置——一个坐标——来命名,而不是通过地址;固定站点在编译时设定的调度上,当循环指令经过其操作数时进行计算;一系列周期递增的环形层级取代了缓存层级,它们之间的移动由调度触发而非缺失触发。没有先前的循环存储、数据流或静态调度机器同时具备这四个特点。我们定义了执行模型,在周期精确的寄存器传输级实现中验证了它,展示了它是可编译的——一个构造性调度器发出程序并与实现交叉检查——并用一阶能量模型评估了其代价。其收益在AI加速中最为明显:每个矩阵乘法和卷积核心的乘累加操作编译成流式形式,其每操作能量随归约规模增长保持恒定,而矩阵乘法高效所需的操作数重用由环形周期层级承载——内存层级通过旋转实现缓存通过标签完成的功能。MADAR为任何在程序运行前已知数据移动的计算提供了一个新的设计点。

英文摘要

In a modern processor, computing is the cheap part. Most of its area and energy go to \emph{addressing} -- moving operands to and from a register file and cache, and running the tags, ports, miss queues, and bypass networks that find a value where it was left. MADAR deletes that machinery by abolishing the address. All state circulates in rings of slots that advance one position per clock; instructions and data ride in the same slots; a value is named by its place in an orbit -- a \rp{} coordinate -- not by an address; a fixed station computes when a circulating instruction sweeps past its operands, on a schedule set at compile time; and a hierarchy of rings of increasing period replaces the cache hierarchy, movement between them scheduled rather than triggered by a miss. No prior circulating-store, dataflow, or statically scheduled machine combines all four of these. We define the execution model, validate it in a cycle-accurate register-transfer-level implementation, show it \emph{compilable} -- a constructive scheduler emits programs cross-checked against the implementation -- and price it with a first-order energy model. The payoff is clearest for AI acceleration: the multiply-accumulate at the heart of every matmul and convolution compiles to a streaming form whose energy per operation stays flat as the reduction grows, and the operand reuse that makes matrix multiplication efficient is carried by the ring-period hierarchy -- the memory hierarchy doing by rotation what a cache does by tags. MADAR is a new design point for any computation whose data movement is known before the program runs.

2606.15712 2026-06-16 cs.CR cs.AI cs.MA 交叉投稿

Odds Law: The Decomposition Algebra On How Intelligence Organizes Itself to Solve Difficult Problems Reliably

Odds Law: 智能如何组织自身以可靠解决难题的分解代数

Hidayet Aksu

发表机构 * GitHub

AI总结 本文提出分解代数,研究不可靠基本求解器如何组合成可靠复合求解器,证明验证几率定律、可靠性放大定理和阈值二分法,并揭示自组织是单调改进算子的最小不动点。

Comments 10 pages, 2 figures

详情
AI中文摘要

我们提出一个结构性问题:给定不可靠的基本问题求解器,它们的何种组织能够可靠地解决难题,其极限是什么?我们发展了一个分解代数:基本求解器是随机范畴中的态射,四个组合子(顺序组合、并行集成、验证门控和递归约简)生成复合求解器的空间。我们为该代数配备了两个同态:一个可靠性估值(取值于有序幺半群$([0,1],\le)$)和一个成本估值(取值于交换半环),并推导出控制可靠性如何通过结构流动的组合法则。我们的核心结果是:(i) 验证几率定律(本文命名的结果),表明验证门将正确几率乘以验证者的似然比$Λ$,因此$k$个条件独立的门产生几何放大;(ii) 可靠性放大定理,给出当$Λ>1$时,在$O(\log 1/δ)$的验证深度下达到目标可靠性$1-δ$;(iii) 阈值二分法:在临界参数之上,可靠性可以以对数成本任意接近1,而在临界参数或以下则无法放大。然后我们证明,自组织是策略完全格上单调改进算子的最小不动点,并且该不动点使单位成本的边际对数几率增益相等。最后,我们证明匹配的极限:信息上限将每门放大限制为一个散度量;共享错误原因产生严格正投票下限,因此多样性是无限放大的必要条件。简而言之,可靠性既不是免费的也不是神奇的:它是由独立信息购买、通过组合安排,并由验证者限制的。

英文摘要

We ask a structural question: given unreliable elementary problem-solvers, what organizations of them solve hard problems reliably, and what are the limits? We develop a $decomposition~algebra$: elementary solvers are morphisms in a stochastic category, and four combinators (sequential composition, parallel ensembling, verification gating, and recursive reduction) generate the space of compound solvers. We equip this algebra with two homomorphisms, a $reliability$ valuation into the ordered monoid $([0,1],\le)$ and a $cost$ valuation into a commutative semiring, and we derive the composition laws that govern how reliability flows through structure. Our central results are (i) a $verification~odds~law$ (the result that names this report), showing that a verification gate multiplies the odds of correctness by the verifier's likelihood ratio $Λ$, so that $k$ conditionally independent gates yield geometric amplification; (ii) a $reliability~amplification~theorem$, giving target reliability $1-δ$ at $O(\log 1/δ)$ verification depth whenever $Λ>1$; and (iii) a $threshold~dichotomy$: above the critical parameters reliability can be driven arbitrarily close to one at logarithmic cost, while at or below them no amplification is possible. We then show that $self-organization$ is the least fixed point of a monotone improvement operator on the complete lattice of strategies, and that this fixed point equalizes marginal log-odds gain per unit cost. Finally, we prove matching limits: an information ceiling bounds per-gate amplification by a divergence quantity; shared error causes create a strictly positive voting floor, so diversity is $necessary$ for unbounded amplification. Reliability, in short, is neither free nor magical: it is bought with independent information, arranged by composition, and bounded by the verifier.

2606.16989 2026-06-16 cs.GT cs.AI cs.CY 交叉投稿

Stable Menus of Public Goods: AI-Enabled Progress

公共物品的稳定菜单:AI驱动的进展

Sara Fish

发表机构 * School of Engineering and Applied Sciences, Harvard University(哈佛大学工程与应用科学学院)

AI总结 以EC 2025论文的开放问题为测试平台,实验评估AI辅助研究流程的有效性,发现提供人类直觉提示和自动多轮交互有助于提升LLM表现,但LLM略逊于一年级博士生。

Comments Accepted to the EC'26 Workshop on AI-Driven Research in EconCS

详情
AI中文摘要

以EC 2025论文“公共物品的稳定菜单”中的一个开放问题作为测试平台,我们进行实验以理解不同AI-for-EconCS研究流程的有效性。具体来说,我们研究了三个问题:在提示中提供人类直觉是否有帮助?自动多轮交互是否有帮助?以及,LLM是否优于一年级博士生?关于前两个问题,我们为以下流程建议提供了证据:(1)用人类直觉提示可以鼓励LLM拥有更好的“品味”,(2)当流程鼓励“雄心勃勃”的步骤时,多轮工作流程有帮助。关于第三个问题,使用该论文资深作者在与一年级博士生合作之前撰写的一份未发表手稿,我们比较了LLM与一年级博士生的有效性,发现LLM的效果略差。

英文摘要

Using an open problem from the EC 2025 paper "Stable Menus of Public Goods" as a testbed, we conduct experiments to understand the effectiveness of different AI-for-EconCS research workflows. Specifically, we study three questions: Does providing human intuition in the prompt help? Does automated multi-turn interaction help? And, does an LLM outperform a first-year PhD student? Regarding the first two questions, we provide evidence for the following workflow suggestions: (1) prompting with human intuition can encourage the LLM to have better "taste", (2) multi-turn workflows help when the pipeline encourages "ambitious" steps. Regarding the third question, using an unpublished manuscript written by the paper's senior authors prior to collaborating with the first-year PhD student, we compare the effectiveness of the LLM with that of the first-year PhD student, and find that the LLM is slightly less effective.

2604.11364 2026-06-16 cs.AI 版本更新

The Missing Knowledge Layer in Cognitive Architectures for AI Agents

AI智能体认知架构中缺失的知识层

Michaël Roynard

发表机构 * Independent Researcher(独立研究者)

AI总结 针对现有AI智能体认知架构(如CoALA和JEPA)缺乏独立知识层的问题,提出四层分解架构(知识、记忆、智慧、智能),各层具有不同的持久性语义,并通过Python和Rust实现验证了架构分离的可行性。

详情
AI中文摘要

两个最具影响力的AI智能体认知架构框架CoALA [21]和JEPA [12]都缺乏具有自身持久性语义的显式知识层。这一空白导致了一个范畴错误:系统对事实性主张应用认知衰减,或者用相同的更新机制处理事实和经验。我们调查了现有记忆系统中的持久性语义,并识别出八个汇聚点,从Karpathy的LLM知识库[10]到BEAM基准测试中接近零的矛盾解决分数[22],都指向相关的架构空白。我们提出了一个四层分解(知识、记忆、智慧、智能),其中每一层具有根本不同的持久性语义:分别是无限替代、艾宾浩斯衰减、证据门控修正和短暂推理。用Python和Rust实现的配套程序证明了架构分离的可行性。我们借用认知科学的术语作为有用的类比(知识/记忆区分呼应了Tulving的三分法),但我们的层是由持久性语义需求证明合理的工程构造,而非神经架构。我们认为这些区分要求在工程实现中采用不同的持久性语义,并且当前没有任何框架或系统提供这一点。

英文摘要

The two most influential cognitive architecture frameworks for AI agents, CoALA [21] and JEPA [12], both lack an explicit Knowledge layer with its own persistence semantics. This gap produces a category error: systems apply cognitive decay to factual claims, or treat facts and experiences with identical update mechanics. We survey persistence semantics across existing memory systems and identify eight convergence points, from Karpathy's LLM Knowledge Base [10] to the BEAM benchmark's near-zero contradiction-resolution scores [22], all pointing to related architectural gaps. We propose a four-layer decom position (Knowledge, Memory, Wisdom, Intelligence) where each layer has fundamentally different persistence semantics: indefinite supersession, Ebbinghaus decay, evidence-gated revision, and ephemeral inference respectively. Companion implementations in Python and Rust demonstrate the architectural separation is feasible. We borrow terminology from cognitive science as a useful analogy (the Knowledge/Memory distinction echoes Tulving's trichotomy), but our layers are engineering constructs justified by persistence-semantics requirements, not by neural architecture. We argue that these distinctions demand distinct persistence semantics in engineering implementations, and that no current framework or system provides this.

2606.00288 2026-06-16 cs.AI 版本更新

Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture

模型原生计算架构:通过计算机架构的视角展望未来系统架构

Hai Lin, Hoilam Pao, Shaoxiong Zhan, Hai-Tao Zheng

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Pengcheng Laboratory, Shenzhen, China(深圳鹏城实验室)

AI总结 本文通过类比计算机架构,提出模型原生计算的概念,并设计了一个六层智能计算架构模型(ICAM),以统一解决大语言模型在系统层面面临的缓存、上下文、调度等问题。

详情
AI中文摘要

大语言模型正经历从模型技术到系统技术的转变。随着开发者使用 Codex、Claude Code、AutoGPT 及相关代理编写代码、管理项目和执行多步任务,缓存重用、上下文管理、代理调度和权限控制等反复出现的工程问题越来越类似于经典计算机系统问题。本文以愿景性综述的形式发展这一类比。我们将计算机架构的概念映射到新兴的模型原生堆栈,并回顾了关于 LLM-as-OS、内存管理、代理框架、工具协议、多代理协调、认知架构和安全治理的工作。我们认为这些工作涉及同一系统的不同层次,但缺乏统一模型。为填补这一空白,我们提出了智能计算架构模型(ICAM),这是一个用于模型原生计算的六层框架,具有明确的接口契约和设计公理。ICAM 通过双平面视图解决了关于 LLM 更像 CPU 还是操作系统的明显争议:一个关注可计算内容的概率执行平面,以及一个关注应计算内容的确定性控制平面。我们进一步引入了三条设计定律:用于 KV 缓存重用和推理加速的语义局部性定律、用于有限窗口和注意力衰减下有效工作集的上下文预算定律,以及用于多代理协作中收益递减的代理加速定律。我们根据已发表的系统级数据验证了这些定律,并将其与近期关于代理软件实践的证据联系起来。最后,我们指出了类比失效的地方,并概述了模型原生计算的研究路线图。这是一项概念性和综述性贡献,不报告新实验。

英文摘要

Large language models are undergoing a transition from model technology to system technology. Engineering challenges like cache reuse, context capacity, agent scheduling, and permission control resemble classical computer systems problems. This raises a question: if we treat the LLM as a CPU, KV cache as processor cache, context window as main memory, and agent framework as an operating system, can decades of computer architecture wisdom guide next generation model native systems? This paper pursues this analogy as a visionary survey. We map computer architecture concepts onto the emerging model native stack, survey literature across LLM as OS, memory management, agent frameworks, tool protocols, multi agent coordination, cognitive architectures, and safety governance, finding that each addresses a different layer without a unifying model. We propose the Intelligent Computing Architecture (ICA): six functional layers with interface contracts and design axioms. We resolve the tension over whether the LLM resembles a CPU or OS via a dual plane architecture a probabilistic execution plane (what can be computed) and a deterministic control plane (what should be computed), with every layer passing through as a graded crossover. We propose three Amdahl style design heuristics Semantic Locality, Context Budget, and Agent Speedup as organizing back of envelope models, illustrate their parameter ranges with published data, and identify predictive validation as the principal open task. We articulate analogy boundaries, note differences between silicon and model era architectures, and propose a research roadmap. This is a conceptual and survey contribution with no new experimental results.

2606.13441 2026-06-16 cs.AI cs.CL 版本更新

Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models

为什么采样不是选择:大语言模型中的意向性、能动性与道德责任

Joseph Keshet

发表机构 * Joseph Keshet(约瑟夫·凯舍特)

AI总结 本文论证大语言模型不具备道德责任所需的承诺性能动性,其输出源于概率映射而非内在意向性,随机采样不等于选择。

详情
AI中文摘要

近期大语言模型(LLMs)的进展引发了关于此类系统展现能动性或具备道德主体资格的讨论。本文认为这些归因是错误的。我们坚持道德责任需要基于内在意向性和自我归因行动的承诺性能动性,而这种能动性构成了与责任相关的自由意志形式。尽管LLMs生成连贯且可进行规范性评估的输出,其操作完全由从数据中学习到的概率输入-输出映射所刻画。它们表面的意向性是衍生的而非内在的,其输出既不被作为承诺拥有,也不受理由引导。随机采样引入的变异性并不等同于选择或作者身份。我们回应来自意向立场、功能主义、相容论以及模型输出中存在道德推理的反对意见,认为这些都不足以确立真正的能动性。

英文摘要

Recent advances in large language models (LLMs) have prompted claims that such systems exhibit agency or qualify as moral agents. This paper argues that these attributions are misguided. We maintain that moral responsibility requires commitment-bearing agency grounded in intrinsic intentionality and self-attributed action, and that such agency constitutes the form of free will relevant to responsibility. Although LLMs generate coherent and normatively evaluable outputs, their operation is fully characterized by probabilistic input-output mappings learned from data. Their apparent intentionality is derived rather than intrinsic, and their outputs are neither owned as commitments nor guided by reasons. Variability introduced by stochastic sampling does not amount to choice or authorship. We address objections from the intentional stance, functionalism, compatibilism, and the presence of moral reasoning in model outputs, arguing that none suffice to establish genuine agency.

2606.13607 2026-06-16 cs.AI 版本更新

Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning

推理即模式匹配:人类与LLM日常推理中的共享机制

Zach Studdiford, Gary Lupyan

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 研究通过比较人类和25个LLM在日常因果推理中的错误模式,发现两者均表现出模式匹配而非抽象世界模型驱动的推理,并识别出LLM中驱动响应的注意力头可预测人类推理错误。

Comments 13 pages main text, 51 pages supplementary text

详情
AI中文摘要

当大型语言模型(LLM)在推理中无法泛化或出现随意错误时,这通常被视为LLM并非真正推理,而是执行某种模式匹配的证据。其隐含意思是,人类行为不会表现出相同类型的失败,因为人类推理使用原则性的抽象世界模型。我们评估了人类参与者和25个LLM在各种日常情境中进行常识推理的能力,并在人和模型中观察到类似的错误模式。然后,我们识别出驱动LLM响应的注意力头集合,并发现这些头实现了模式匹配的形式。这些注意力头使我们能够预测由表面上无关的提示细节引起的人类看似无法解释的推理错误。综合来看,我们的结果表明,人和LLM在日常因果推理中更符合模式匹配的形式,而非抽象世界模型。

英文摘要

When large language models (LLMs) fail to generalize or make haphazard errors in reasoning, it is often taken as evidence that LLMs are not truly reasoning, but rather performing a kind of pattern matching. The implication is that people's behavior does not exhibit the same types of failures because human reasoning uses principled and abstract world models. We evaluate human participants and 25 LLMs on their ability to engage in common-sense reasoning about a variety of everyday situations and observe similar patterns of errors in both people and models. We then identify the set of attention heads driving LLM responses and find that these heads implement a form of pattern-matching. These attention heads allow us to predict seemingly inexplicable reasoning errors in people caused by ostensibly irrelevant prompt details. Taken together, our results suggest that everyday causal reasoning in people and LLMs is more consistent with a form of pattern-matching than with abstract world models.

2511.14007 2026-06-16 cs.CY cs.AI 版本更新

Can Artificial Intelligence Accelerate Technological Progress? Researchers' Perspectives on AI in Manufacturing and Materials Science

人工智能能否加速技术进步?研究人员对制造业和材料科学中AI的看法

John P. Nelson, Olajide Olugbade, Philip Shapira, Justin B. Biddle

发表机构 * School of Public Policy, Oregon State University(俄勒冈州立大学公共政策学院) Manchester Institute of Innovation Research, University of Manchester(曼彻斯特大学创新研究所) Alan Turing Institute, Manchester, UK(曼彻斯特英国艾伦·图灵研究所)

AI总结 通过32位美国制造业和材料科学领域研究人员的访谈,发现AI主要用于材料和制造过程的建模,加速设计空间搜索,但存在数据依赖、需与旧技术结合以及可能阻碍颠覆性理论进步的风险。

详情
AI中文摘要

人工智能(AI)引发了人们对技术进步速度大幅提升的期望,但这种预期往往与创新过程中AI使用的详细实地研究脱节。因此,AI如何以及在多大程度上能够加速创新仍不清楚。为填补这一空白,我们探索并评估了对32位美国学术界的制造业和材料科学研究人员的访谈结果,这些研究人员在AI和机器学习(ML)技术方面经验丰富。我们发现,AI主要用于材料和制造过程的建模,促进了对材料和制造过程设计空间的更廉价、更快速的搜索。其好处包括在技术开发中节省成本、时间和计算资源。然而,AI/ML工具在已有密集数据的设计空间之外并不可靠;它们需要与较旧的研究技术相结合,进行熟练且审慎的应用;并且有人担心它们可能有害地规避颠覆性理论进步的机会。基于这些结果,我们认为有理由对通过使用AI/ML加速持续性创新持乐观态度;但需要支持传统的实证、计算和理论研究,以维持制造业和材料领域进一步颠覆性进步的可能性。

英文摘要

Artificial intelligence (AI) raises expectations of substantial increases in rates of technological progress, but such anticipations are often not connected to detailed ground-level studies of AI use in innovation processes. Accordingly, it remains unclear how and to what extent AI can accelerate innovation. To help to fill this gap, we explore and assess results from 32 interviews with U.S.-based academic manufacturing and materials sciences researchers experienced with AI and machine learning (ML) techniques. We found that AI was primarily used for modeling of materials and manufacturing processes, facilitating cheaper and more rapid search of design spaces for materials and manufacturing processes alike. Benefits included cost, time, and computation savings in technology development. However, AI/ML tools were unreliable outside design spaces for which dense data were already available; they required skilled and judicious application in tandem with older research techniques; and concerns were raised about the potential to detrimentally circumvent opportunities for disruptive theoretical advancement. Based on these results, we suggest there is reason for optimism about acceleration in sustaining innovations through the use of AI/ML; but that support for conventional empirical, computational, and theoretical research is required to maintain the likelihood of further disruptive advances in manufacturing and materials.

2601.09753 2026-06-16 cs.CY cs.AI 版本更新

Critically Engaged Pragmatism: Scientific Norm and Social, Pragmatist Epistemology for AI Science Evaluation Tools

批判性参与实用主义:AI科学评估工具的科学规范与社会实用主义认识论

Carole J. Lee

AI总结 提出批判性参与实用主义作为科学规范,要求科学界审视AI科学评估工具的目的及特定可靠性,并建议工具创建者透明报告设计、训练和基准测试细节。

详情
Journal ref
Social Epistemology (2026)
AI中文摘要

AI科学评估工具旨在评估研究的可信度。与传统指标(如影响因子)一样,它们的指令可能被去语境化并以有问题的方式重新利用。为了解决这个问题,我提出批判性参与实用主义作为一种科学规范,要求科学界审视AI科学评估工具的目的及特定目的的可靠性。为了培养批判性参与实用主义,AI科学评估工具的创建者应透明且完整地报告设计、训练和基准测试细节,以促进对特定目的可靠性、不同类型错误和偏见的评估。随着新形式的错误、偏见和博弈行为的发现,AI科学评估工具透明报告的最佳实践应不断更新。在此框架下,AI科学评估工具不是科学可信度的客观仲裁者。相反,它们是最终奠定科学社区可信度的批判性话语实践的对象。

英文摘要

AI science evaluation tools aim to assess research credibility. As with traditional metrics such as impact factors, their edicts can be decontextualised and repurposed in problematic ways. To address this, I propose Critically-Engaged Pragmatism as a scientific norm enjoining scientific communities to scrutinise the purposes and purpose-specific reliability of AI science evaluation tools. To foster Critically Engaged Pragmatism, creators of AI science evaluation tools should transparently and fully report design, training, and benchmarking details to facilitate assessments of purpose-specific reliability, liability to different types of error, and bias. What count as best practices for the transparent reporting of AI science evaluation tools should be updated as new forms of error, bias, and gamesmanship are discovered. Under this framework, AI science evaluation tools are not objective arbiters of scientific credibility. Rather, they are the object of critical discursive practices that ultimately ground the credibility of scientific communities.

2604.18827 2026-06-16 q-bio.NC cs.AI 版本更新

OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens

OmniMouse: 基于1500亿神经令牌的多模态多任务脑模型的可扩展性

Konstantin F. Willeke, Polina Turishcheva, Alex Gilbert, Goirik Chakrabarty, Hasan A. Bedel, Paul G. Fahey, Yongrong Qiu, Marissa A. Weis, Michaela Vystrčilová, Taliah Muhammad, Lydia Ntanavara, Rachel E. Froebe, Kayla Ponder, Zheng Huan Tan, Emin Orhan, Erick Cobos, Sophia Sanborn, Katrin Franke, Fabian H. Sinz, Alexander S. Ecker, Andreas S. Tolias

发表机构 * Department of Ophthalmology, Byers Eye Institute, Stanford University(斯坦福大学眼科学系、比尔斯眼科研究所) Stanford Bio-X, Stanford University(斯坦福大学生物交叉学科) Wu Tsai Neurosciences Institute, Stanford University(斯坦福大学吴泰教授神经科学研究所) Institute of Computer Science and Campus Institute Data Science, University Göttingen(哥廷根大学计算机科学研究所和校园数据科学研究所)

AI总结 利用小鼠视觉皮层31亿神经元数据,训练多模态多任务模型OmniMouse,在神经预测、行为解码等任务上达到最优,发现性能随数据量可靠提升但模型规模收益饱和,与AI领域标准扩展规律相反。

Comments Published at ICLR2026

详情
AI中文摘要

扩展数据和人工神经网络已经改变了人工智能,推动了语言和视觉领域的突破。类似的原则是否适用于脑活动建模仍不清楚。这里我们利用了一个数据集,包含来自73只小鼠视觉皮层的310万个神经元,跨越323个会话,总计超过1500亿个神经令牌,记录于自然电影、图像、参数化刺激和行为期间。我们训练了多模态、多任务模型,在测试时灵活支持三种模式:神经预测、行为解码、神经预测或三者的任意组合。OmniMouse实现了最先进的性能,在几乎所有评估模式下优于专门的基线。我们发现性能随数据量可靠地提升,但增加模型大小的收益饱和。这颠倒了标准的人工智能扩展故事:在语言和计算机视觉中,大规模数据集使参数扩展成为进步的主要驱动力,而在脑建模中——即使是在小鼠视觉皮层这个相对简单的系统中——尽管有大量的记录,模型仍然受限于数据。系统性的扩展观察提出了神经建模中相变的可能性,更大和更丰富的数据集可能解锁定性的新能力,类似于大型语言模型中出现的涌现特性。代码见此网址。

英文摘要

Scaling data and artificial neural networks has transformed AI, driving breakthroughs in language and vision. Whether similar principles apply to modeling brain activity remains unclear. Here we leveraged a dataset of 3.1 million neurons from the visual cortex of 73 mice across 323 sessions, totaling more than 150 billion neural tokens recorded during natural movies, images and parametric stimuli, and behavior. We train multi-modal, multi-task models that support three regimes flexibly at test time: neural prediction, behavioral decoding, neural forecasting, or any combination of the three. OmniMouse achieves state-of-the-art performance, outperforming specialized baselines across nearly all evaluation regimes. We find that performance scales reliably with more data, but gains from increasing model size saturate. This inverts the standard AI scaling story: in language and computer vision, massive datasets make parameter scaling the primary driver of progress, whereas in brain modeling -- even in the mouse visual cortex, a relatively simple system -- models remain data-limited despite vast recordings. The observation of systematic scaling raises the possibility of phase transitions in neural modeling, where larger and richer datasets might unlock qualitatively new capabilities, paralleling the emergent properties seen in large language models. Code available at https://github.com/enigma-brain/omnimouse.

2510.16559 2026-06-16 cs.AI 版本更新

BuildArena: A Physics-Aligned Interactive Benchmark of LLMs for Engineering Construction

BuildArena: 一个对齐物理的LLM交互基准,用于工程建造

Tian Xia, Tianrun Gao, Wenhao Deng, Long Wei, Xiaowei Qian, Chenglei Yu, Tailin Wu

发表机构 * Tian Xia(夏天) Tianrun Gao(高天run) Wenhao Deng(邓文浩) Long Wei(韦龙) Xiaowei Qian(钱小伟) Chenglei Yu(于成磊) Tailin Wu(吴太林)

AI总结 本研究提出BuildArena,一个首个对齐物理的LLM交互基准,用于语言驱动的工程建造,通过可扩展的任务设计策略和3D空间几何计算库,评估九个前沿LLM在语言驱动和物理基础的自动化建造中的能力。

Comments ICML 2026, 36 pages, 12 figures

详情
AI中文摘要

工程自动化旨在将自然语言规范转换为物理上可行的结构,需要在严格物理约束下进行复杂的综合推理。尽管现代LLM具有广泛的知识和强大的推理能力,使其成为该领域的潜在候选者,但其建造能力仍 largely 未被评估。为解决这一差距,我们引入了BuildArena,首个针对语言驱动工程建造设计的对齐物理的交互基准。它在使用LLM进行工程自动化方面迈出第一步。技术上,它在两个方面为社区做出贡献:(1) 覆盖静态和动态力学的可扩展任务设计策略,跨越多个难度级别;(2) 一个3D空间几何计算库,用于基于语言指令的建造。在九个前沿LLM上,BuildArena全面评估了它们在语言驱动和物理基础的自动化建造中的能力。

英文摘要

Engineering construction automation aims to transform natural language specifications into physically viable structures, requiring complex integrated reasoning under strict physical constraints. While modern LLMs possess broad knowledge and strong reasoning capabilities that make them promising candidates for this domain, their construction competencies remain largely unevaluated. To address this gap, we introduce BuildArena, the first physics-aligned interactive benchmark designed for language-driven engineering construction. Technically, it contributes to the community in two aspects: (1) an extendable task design strategy spanning static and dynamic mechanics across multiple difficulty tiers; (2) a 3D Spatial Geometric Computation Library for supporting construction based on language instructions. On nine frontier LLMs and three additional open-weight models, BuildArena comprehensively evaluates their capabilities for language-driven and physics-grounded construction automation. We release the code at https://github.com/AI4Science-WestlakeU/BuildArena to benefit construction automation in engineering applications.

2603.25777 2026-06-16 physics.plasm-ph cs.AI 版本更新

Challenges and opportunities for AI to help deliver fusion energy

人工智能在实现聚变能源中的挑战与机遇

Adriano Agnello, Helen Brooks, Cyd Cowley, Iulia Georgescu, Alex Higginbottom, Richard Pearson, Tara Shears, Melanie Windridge

发表机构 * STFC Hartree Centre(英国科学与技术创新中心) UK Atomic Energy Authority(英国原子能局) digiLab Solutions(digiLab解决方案) Institute of Physics(物理研究所) Zenithon AI(Zenithon人工智能) Eindhoven University of Technology(埃因霍温理工大学) Oliver Lodge Laboratory(Oliver Lodge实验室) University of Liverpool(利物浦大学) Fusion Energy Insights(聚变能源洞察)

AI总结 本文探讨了人工智能在聚变能源研发中的应用潜力与挑战,强调需通过跨领域合作与稳健方法提升AI应用效果,同时指出并非所有聚变问题都适合AI解决。

Comments Submitted to Plasma Physics and Confined Fusion

详情
Journal ref
Plasma Physics and Controlled Fusion 68 063701 (2026)
AI中文摘要

人工智能工具在聚变研究中的应用具有巨大潜力,若能实现可控核聚变,将带来全球性效益。然而,使用AI面临诸多挑战,这些挑战可通过在现有方法中引入负责任和稳健的方法加以缓解。为此,需要聚变领域专家与AI开发者之间紧密、长期的合作,并意识到并非所有聚变研究问题都最适合用AI工具解决。2025年4月,学术界、工业界、UKAEA和STFC专家在《经济学人》FusionFest活动上讨论了AI如何推动聚变能源研发。本文是对圆桌讨论的扩展和更新总结,提供了更多背景和实例。

英文摘要

There is great potential for the application of AI tools in fusion research, and substantial worldwide benefit if fusion power is realised. However, using AI comes with its own challenges, many of which can be mitigated if responsible and robust methodologies are built into existing approaches. To do that requires close, long-term collaborations between fusion domain experts and AI developers and awareness of the fact that not all problems in fusion research are best tackled with AI tools. In April 2025, experts from academia, industry, UKAEA and STFC discussed how AI can be used to advance R&D in fusion energy at the first edition of The Economist FusionFest event. This Perspective is an expanded and updated summary of the round table discussion, providing more context and examples.

2605.05372 2026-06-16 cs.CV cs.AI 版本更新

Two Steps Are All You Need: Efficient 3D Point Cloud Anomaly Detection with Consistency Models

两步即可:基于一致性模型的高效3D点云异常检测

Pranav A, Shashank B, Pranav Siddappa, Dominik Seuss, Minal Moharir, Subramanya KN

发表机构 * R.V. College of Engineering(R.V. 工程学院) Technical University of Applied Sciences Würzburg-Schweinfurt(Würzburg-Schweinfurt 应用科学大学)

AI总结 本文提出基于一致性学习的重建异常检测方法,通过简化推理过程提升效率,实现低延迟的3D点云异常检测,适用于资源受限设备。

Comments Accepted to CVPR 2026, at the 9th Workshop on Efficient Deep Learning for Computer Vision (ECV). To be published in the IEEE/CVF CVPR 2026 Workshop Proceedings

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 3479-3487
AI中文摘要

扩散模型正在重新定义3D点云数据中的异常检测。随着3D传感成为现代制造的关键,可靠的异常检测对于高吞吐量的质量保证和过程控制至关重要。然而,在资源受限且延迟敏感的系统中,实际部署仍然有限。现有方法往往在复杂未遮挡区域计算上不可行或不可靠,而扩散管道本质上受限于迭代去噪。在本文中,我们通过一致性学习重构基于重建的异常检测,使能够在一次或两次网络评估中直接预测无异常几何。我们进一步引入了一种新的混合损失公式,明确强制重建至干净数据。这种设计显著降低了推理成本,达到比当前最先进方法快80倍的运行时间,无需GPU加速,同时保持强大的检测性能。它在Anomaly-ShapeNet上以76.20%的I-AUROC优于R3D-AD,在Real3DAD上以72.80%的I-AUROC保持竞争力,使在资源受限平台上实现高效、低延迟的异常检测成为可能,包括无人机、智能工业相机和其他边缘设备。

英文摘要

Diffusion models are rapidly redefining 3D anomaly detection in point cloud data. As 3D sensing becomes integral to modern manufacturing, reliable anomaly detection is essential for high-throughput quality assurance and process control. Yet practical deployment on resource-constrained, latency-critical systems remains limited. Existing methods are often computationally prohibitive or unreliable in complex, unmasked regions, and diffusion pipelines are inherently bottlenecked by iterative denoising. In this work, we address this bottleneck by reformulating reconstructionbased anomaly detection through consistency learning, enabling direct prediction of anomaly-free geometry in one or two network evaluations. We further introduce a novel hybrid loss formulation that explicitly enforces reconstruction toward clean data. This design substantially reduces inference cost, achieving up to 80x faster runtime than the current state-of-the-art method, without GPU acceleration, while preserving strong detection performance. It outperforms R3D-AD on Anomaly-ShapeNet with 76.20% I-AUROC and remains competitive on Real3DAD with 72.80% I-AUROC, enabling efficient, low-latency anomaly detection on resource-constrained platforms, including drones, smart industrial cameras, and other edge devices.

2511.12635 2026-06-16 cs.SE cs.AI cs.LG 版本更新

LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews

LLM4SCREENLIT: 关于评估用于系统综述文献筛选的大型语言模型性能的建议

Lech Madeyski, Barbara Kitchenham, Martin Shepperd

发表机构 * University of Kent(肯特大学) University of Leicester(利兹大学) University of Birmingham(伯明翰大学)

AI总结 本文提出LLM4SCREENLIT建议,针对系统综述文献筛选中大型语言模型的评估,提出基于加权马修相关系数的改进方法,强调在不平衡和成本不对称条件下使用成本敏感的WMCC进行评估。

Comments 34 pages, 6 figures

详情
Journal ref
Information and Software Technology 198 (2026) 108204
AI中文摘要

本文提出LLM4SCREENLIT建议,针对系统综述文献筛选中大型语言模型的评估,提出基于加权马修相关系数的改进方法,强调在不平衡和成本不对称条件下使用成本敏感的WMCC进行评估。

英文摘要

Context: Large language models (LLMs) are increasingly used to screen literature for systematic reviews (SRs), but the standard confusion-matrix metrics used to evaluate them can mislead under the imbalanced, cost-asymmetric conditions of screening. Objective: We develop and justify LLM4SCREENLIT-practical recommendations for researchers conducting LLM-screening evaluations and for editors and reviewers assessing such studies-differentiated by study type (retrospective benchmarking vs deployment for a specific SR). Method: Using Delgado-Chaves et al. (2025), an 18-LLM benchmark across three biomedical SRs, as a motivating example, we reviewed 28 additional papers and extracted their reported metrics. We propose a Weighted Matthews Correlation Coefficient (WMCC) that integrates MCC's chance-correction with asymmetric misclassification costs, and validated it on three software-engineering (SE) reanalyses, the largest covering 9 LLMs x 24 SE secondary studies (34,528 articles). Results: Across the 29 papers, only 10% reported MCC, only 24% reported full confusion matrices, and none of the five papers claiming workload savings priced false-negative cost. In the largest SE reanalysis, MCC and WMCC disagree on the best LLM in 55% of evaluable studies; in the most striking 9,695-article SE study, the Accuracy-best LLM loses 63.3% of relevant evidence (Lost Evidence), the MCC-best 43.9%, but the WMCC-best only 5.8%. Sensitivity analysis (median crossover at w~=2.7, all <7) supports w=10 as a conservative default. Conclusions: SR-screening evaluations should prioritize Lost Evidence and use cost-sensitive WMCC alongside MCC for ranking. Reporting must include the full confusion matrix and treat unclassifiable outputs as positives requiring human review. Designs should be leakage-aware, with non-LLM baselines when the study aims to inform SR practice and labels are available.

2603.11729 2026-06-16 cs.DS cs.AI cs.RO 版本更新

Adapting Dijkstra for Buffers and Unlimited Transfers

为缓冲区和无限换乘调整Dijkstra算法

Denys Katkalo, Andrii Rohovyi, Toby Walsh

发表机构 * University of Oxford(牛津大学)

AI总结 本文提出Transfer Aware Dijkstra (TAD)算法,通过扫描完整行程序列而非单条边,解决了带缓冲区时间的无限换乘路径规划中传统Dijkstra过滤失效的问题,并在伦敦和瑞士网络上实现比MR快两倍以上的速度且保持最优性。

Comments v4: clarified RAPTOR description in the Background section

详情
AI中文摘要

近年来,基于RAPTOR的算法被认为是无需预处理即可处理无限换乘路径规划的最先进技术。然而,这一地位很大程度上源于路由研究的演进,其中基于Dijkstra的解决方案被基于时间表的算法取代,而缺乏系统性的比较。在这项工作中,我们重新审视了经典的基于Dijkstra的无限换乘公共交通路由方法,并证明时间依赖Dijkstra (TD-Dijkstra) 优于MR。然而,高效的TD-Dijkstra实现依赖于在预处理期间过滤被支配的连接,这假设乘客总是可以切换到更快的连接。我们表明,当站点有缓冲区时间时,这种过滤是不合理的,因为它无法区分可能继续等待的坐席乘客和必须遵守缓冲区的换乘乘客。为了解决这一限制,我们引入了Transfer Aware Dijkstra (TAD),这是一种修改后的算法,它扫描整个行程序列而不是单个边,从而正确处理缓冲区时间,同时保持相对于MR的性能优势。我们在伦敦和瑞士网络上的实验表明,与MR相比,我们可以在有和没有缓冲区时间的两个网络上实现超过两倍的速度提升,同时产生最优结果。

英文摘要

In recent years, RAPTOR based algorithms have been considered the state-of-the-art for path-finding with unlimited transfers without preprocessing. However, this status largely stems from the evolution of routing research, where Dijkstra-based solutions were superseded by timetable-based algorithms without a systematic comparison. In this work, we revisit classical Dijkstra-based approaches for public transit routing with unlimited transfers and demonstrate that Time-Dependent Dijkstra (TD-Dijkstra) outperforms MR. However, efficient TD-Dijkstra implementations rely on filtering dominated connections during preprocessing, which assumes passengers can always switch to a faster connection. We show that this filtering is unsound when stops have buffer times, as it cannot distinguish between seated passengers who may continue without waiting and transferring passengers who must respect the buffer. To address this limitation, we introduce Transfer Aware Dijkstra (TAD), a modification that scans entire trip sequences rather than individual edges, correctly handling buffer times while maintaining performance advantages over MR. Our experiments on the London and Switzerland networks show that we can achieve more than a twofold speedup over MR while producing optimal results on both networks, with and without buffer times.

2505.09755 2026-06-16 cs.AI 版本更新

Explainability Through Human-Centric Design for XAI in Lung Cancer Detection

通过以人为中心的设计实现XAI在肺癌检测中的可解释性

Amy Rafferty, Rishi Ramaesh, Ajitha Rajan

发表机构 * University of Edinburgh(爱丁堡大学) NHS Lothian(NHS洛锡安)

AI总结 本文提出XpertXAI模型,通过人类中心设计在肺癌检测中实现可解释性,优于现有方法,提供更符合专家推理的概念解释。

详情
AI中文摘要

深度学习模型在胸部X光片肺癌病理检测中表现出潜力,但临床应用受限于模型决策的不透明性。本文引入ClinicXAI,一种以人为中心、专家引导的概念瓶颈模型(CBM),用于可解释的肺癌诊断。我们扩展了这一方法,提出XpertXAI,一种通用的专家驱动模型,能够在检测多种肺部病变的同时保留人类可解释的临床概念。使用高性能的InceptionV3分类器和包含放射学报告的公共胸部X光数据集,我们比较了XpertXAI与领先的后验可解释性方法和无监督CBM(XCBs)。通过与专家放射科医师注释和医学地面真实值的比较评估解释。尽管XpertXAI训练用于多种病变,我们的专家验证集中在肺癌上。我们发现现有技术经常无法产生具有临床意义的解释,遗漏了关键诊断特征并不同意放射科医生的判断。XpertXAI不仅在预测准确性上优于这些基线方法,还提供了更符合专家推理的概念级解释。虽然我们的重点仍放在肺癌检测的可解释性上,但这项工作展示了如何通过人类中心的模型设计有效地扩展到更广泛的诊断情境——为医学诊断中的有意义可解释AI提供可扩展的路径。

英文摘要

Deep learning models have shown promise in lung pathology detection from chest X-rays, but widespread clinical adoption remains limited due to opaque model decision-making. In prior work, we introduced ClinicXAI, a human-centric, expert-guided concept bottleneck model (CBM) designed for interpretable lung cancer diagnosis. We now extend that approach and present XpertXAI, a generalizable expert-driven model that preserves human-interpretable clinical concepts while scaling to detect multiple lung pathologies. Using a high-performing InceptionV3-based classifier and a public dataset of chest X-rays with radiology reports, we compare XpertXAI against leading post-hoc explainability methods and an unsupervised CBM, XCBs. We assess explanations through comparison with expert radiologist annotations and medical ground truth. Although XpertXAI is trained for multiple pathologies, our expert validation focuses on lung cancer. We find that existing techniques frequently fail to produce clinically meaningful explanations, omitting key diagnostic features and disagreeing with radiologist judgments. XpertXAI not only outperforms these baselines in predictive accuracy but also delivers concept-level explanations that better align with expert reasoning. While our focus remains on explainability in lung cancer detection, this work illustrates how human-centric model design can be effectively extended to broader diagnostic contexts - offering a scalable path toward clinically meaningful explainable AI in medical diagnostics.

2603.10047 2026-06-16 cs.SE cs.AI cs.HC 版本更新

Toward Epistemic Stability: Engineering Consistent Procedures for Industrial LLM Hallucination Reduction

迈向认知稳定性:为工业大语言模型幻觉减少设计一致的程序

Brian Freeman, Adam Kicklighter, Matt Erdman, Zach Gordon

发表机构 * Trane Technologies(特纳技术公司)

AI总结 本文提出并比较了五种提示工程策略,旨在减少模型输出的方差,实现可重复、基于事实的结果。通过LLM-as-Judge框架评估,M4在100次试验中均表现最佳,M2在v2版本中提升显著。

Comments 50 pages, 5 tables, 7 figures

详情
AI中文摘要

大型语言模型(LLM)中的幻觉是指语法上连贯但事实错误或上下文不一致的输出。它们在高风险工业应用中持续存在,如工程设计、企业资源计划和物联网 telemetry 平台。本文提出并比较了五种提示工程策略,旨在减少模型输出的方差,以获得可重复、基于事实的结果,而无需修改模型权重或创建复杂验证模型。这些方法包括:(M1)迭代相似性收敛,(M2)分解模型无关提示,(M3)单任务代理专业化,(M4)增强数据注册,以及(M5)领域术语库注入。每种方法均使用LLM-as-Judge框架在100次重复运行中评估(相同固定任务提示,随机解码,tau=0.7)。在该评估设置下,M4(增强数据注册)在所有100次试验中均获得“更好”评价;M3和M5分别达到80%和77%;M1达到75%;而M2相比单次提示在现代基础模型上净负34%。随后,我们开发了增强版本2(v2)实现,并在10次验证批次上评估;M2从34%提升到80%,是四个修订方法中最大的提升。我们讨论了这些策略如何帮助克服LLM结果的非确定性,即使绝对正确性无法保证。我们提供了伪代码、原文提示和批次日志以支持独立评估。

英文摘要

Hallucinations in large language models (LLMs) are outputs that are syntactically coherent but factually incorrect or contextually inconsistent. They are persistent obstacles in high-stakes industrial settings such as engineering design, enterprise resource planning, and IoT telemetry platforms. We present and compare five prompt engineering strategies intended to reduce the variance of model outputs and move toward repeatable, grounded results without modifying model weights or creating complex validation models. These methods include: (M1) Iterative Similarity Convergence, (M2) Decomposed Model-Agnostic Prompting, (M3) Single-Task Agent Specialization, (M4) Enhanced Data Registry, and (M5) Domain Glossary Injection. Each method is evaluated against an internal baseline using an LLM-as-Judge framework over 100 repeated runs per method (same fixed task prompt, stochastic decoding at tau = 0.7. Under this evaluation setup, M4 (Enhanced Data Registry) received ``Better'' verdicts in all 100 trials; M3 and M5 reached 80% and 77% respectively; M1 reached 75%; and M2 was net negative at 34% when compared to single shot prompting with a modern foundation model. We then developed enhanced version 2 (v2) implementations and assessed them on a 10-trial verification batch; M2 recovered from 34% to 80%, the largest gain among the four revised methods. We discuss how these strategies help overcome the non-deterministic nature of LLM results for industrial procedures, even when absolute correctness cannot be guaranteed. We provide pseudocode, verbatim prompts, and batch logs to support independent assessment.

2603.24724 2026-06-16 cs.CV cs.AI 版本更新

Is Geometry Enough? An Evaluation of Landmark-Based Gaze Estimation

几何足够吗?基于标记的注视估计评估

Daniele Agostinelli, Thomas Agostinelli, Andrea Generosi, Maura Mengoni

发表机构 * Department of Industrial Engineering and Mathematical Sciences, Università Politecnica delle Marche(工业工程与数学科学系,帕尔米塞大学) Department of Science and Information Technology, Università Pegaso(科学与信息科技系,佩加索大学)

AI总结 本文评估了基于面部标记的注视估计方法,通过标准化流程提取和归一化三个大型数据集的标记,并训练轻量级回归模型,发现其在跨域评估中与ResNet18基线相当,表明稀疏几何特征能有效支持鲁棒的注视估计。

详情
AI中文摘要

基于外观的注视估计通常依赖深度卷积神经网络(CNNs)。这些模型准确但计算成本高且作为“黑箱”,可解释性差。基于面部标记的几何方法是轻量级替代方案,但其性能限制和泛化能力在现代基准中仍待探索。本文全面评估了基于标记的注视估计,引入标准化流程提取和归一化三个大型数据集(Gaze360、ETH-XGaze、GazeGene)的标记,并训练轻量级回归模型,具体为极端梯度提升树和两种神经架构:整体多层感知机(MLP)和设计捕捉双眼几何的孪生MLP。发现基于标记的模型在领域内评估表现较低,可能由于数据集中的标记检测噪声引入。然而,在跨域评估中,所提出的MLP架构的泛化能力与ResNet18基线相当。这些发现表明稀疏几何特征编码了足够的信息以支持鲁棒的注视估计,为高效、可解释且隐私友好的边缘应用铺平了道路。源代码和生成的基于标记的数据集可在https://github.com/daniele-agostinelli/LandmarkGaze.git获取。

英文摘要

Appearance-based gaze estimation frequently relies on deep Convolutional Neural Networks (CNNs). These models are accurate, but computationally expensive and act as "black boxes", offering little interpretability. Geometric methods based on facial landmarks are a lightweight alternative, but their performance limits and generalization capabilities remain underexplored in modern benchmarks. In this study, we conduct a comprehensive evaluation of landmark-based gaze estimation. We introduce a standardized pipeline to extract and normalize landmarks from three large-scale datasets (Gaze360, ETH-XGaze, and GazeGene) and train lightweight regression models, specifically Extreme Gradient Boosted trees and two neural architectures: a holistic Multi-Layer Perceptron (MLP) and a siamese MLP designed to capture binocular geometry. We find that landmark-based models exhibit lower performance in within-domain evaluation, likely due to noise introduced into the datasets by the landmark detector. Nevertheless, in cross-domain evaluation, the proposed MLP architectures show generalization capabilities comparable to those of ResNet18 baselines. These findings suggest that sparse geometric features encode sufficient information for robust gaze estimation, paving the way for efficient, interpretable, and privacy-friendly edge applications. The source code and generated landmark-based datasets are available at https://github.com/daniele-agostinelli/LandmarkGaze.git.

2508.12365 2026-06-16 cs.IR cs.AI cs.CL 版本更新

TaoSR1: The Thinking Model for E-commerce Relevance Search

TaoSR1:电商相关性搜索的思考模型

Chenhe Dong, Shaowei Yao, Pengkun Jiao, Jianhui Yang, Yiming Jin, Zerui Huang, Xiaojiang Zhou, Dan Ou, Haihong Tang, Bo Zheng

发表机构 * Taobao & Tmall Group of Alibaba(淘宝与天猫集团)

AI总结 本文提出TaoSR1框架,通过CoT引导的监督微调、离线采样与DPO优化,解决电商搜索中相关性预测的推理误差与幻觉问题,实现高效部署。

详情
Journal ref
KDD '26: Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, 2026
AI中文摘要

查询-商品相关性预测是电商搜索的核心任务。基于BERT的模型在语义匹配上表现优异,但缺乏复杂的推理能力。尽管大型语言模型(LLMs)被探索,大多数仍使用判别性微调或蒸馏到小模型进行部署。我们提出一个框架,直接部署LLMs用于此任务,解决关键挑战:推理链(CoT)误差累积、判别性幻觉和部署可行性。我们的框架TaoSR1包括三个阶段:(1)使用CoT的监督微调以培养推理能力;(2)离线采样与pass@N策略和直接偏好优化(DPO)以提高生成质量;(3)基于难度的动态采样与组相对策略优化(GRPO)以缓解判别性幻觉。此外,后CoT处理和基于累积概率的分区方法使在线部署高效。TaoSR1在离线数据集上显著优于基线,并在在线双人评估中取得显著优势,引入了将CoT推理应用于相关性分类的新范式。

英文摘要

Query-product relevance prediction is a core task in e-commerce search. BERT-based models excel at semantic matching but lack complex reasoning capabilities. While Large Language Models (LLMs) are explored, most still use discriminative fine-tuning or distill to smaller models for deployment. We propose a framework to directly deploy LLMs for this task, addressing key challenges: Chain-of-Thought (CoT) error accumulation, discriminative hallucination, and deployment feasibility. Our framework, TaoSR1, involves three stages: (1) Supervised Fine-Tuning (SFT) with CoT to instill reasoning; (2) Offline sampling with a pass@N strategy and Direct Preference Optimization (DPO) to improve generation quality; and (3) Difficulty-based dynamic sampling with Group Relative Policy Optimization (GRPO) to mitigate discriminative hallucination. Additionally, post-CoT processing and a cumulative probability-based partitioning method enable efficient online deployment. TaoSR1 significantly outperforms baselines on offline datasets and achieves substantial gains in online side-by-side human evaluations, introducing a novel paradigm for applying CoT reasoning to relevance classification.

2511.00369 2026-06-16 cs.LG cs.AI cs.NE 版本更新

Balancing Interpretability and Performance in Motor Imagery EEG Classification: A Comparative Study of ANFIS-FBCSP-PSO and EEGNet

在运动想象EEG分类中平衡可解释性和性能:ANFIS-FBCSP-PSO和EEGNet的比较研究

Farjana Aktar, Mohd Ruhul Ameen, Akif Islam, Md Ekramul Hamid

发表机构 * University of Rajshahi(拉贾沙希大学)

AI总结 本文比较了ANFIS-FBCSP-PSO与EEGNet在BCI竞赛IV-2a数据集上的性能,发现模糊神经模型在内子试验中表现更优,而深度模型在跨受试者测试中更具泛化能力,为选择MI-BCI系统提供指导。

Comments Accepted at the 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence and Networking (QPAIN 2026)

详情
Journal ref
2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence & Networking (QPAIN)
AI中文摘要

实现准确且可解释的运动想象EEG分类仍是脑机接口(BCI)研究中的关键挑战。本文比较了透明的模糊推理方法(ANFIS-FBCSP-PSO)与知名的深度学习基准(EEGNet),使用公开的BCI竞赛IV-2a数据集。ANFIS流程结合滤波器银行共同空间模式特征提取与通过粒子群优化优化的模糊IF-THEN规则,而EEGNet直接从原始EEG数据学习层次化的空间-时间表示。在内子试验中,模糊神经模型表现更好(68.58%±13.76%准确率,kappa=58.04%±18.43),而在跨受试者(LOSO)测试中,深度模型表现出更强的泛化能力(68.20%±12.13%准确率,kappa=57.33%±16.22)。因此,该研究为根据设计目标选择MI-BCI系统提供了实用指导:可解释性或用户间鲁棒性。未来对基于Transformer和混合神经符号框架的研究有望进一步推动透明的EEG解码。

英文摘要

Achieving both accurate and interpretable classification of motor-imagery EEG remains a key challenge in brain-computer interface (BCI) research. In this paper, we compare a transparent fuzzy-reasoning approach (ANFIS-FBCSP-PSO) with a well-known deep-learning benchmark (EEGNet) using the publicly available BCI Competition IV-2a dataset. The ANFIS pipeline combines filter-bank common spatial pattern feature extraction with fuzzy IF-THEN rules optimized via particle-swarm optimization, while EEGNet learns hierarchical spatial-temporal representations directly from raw EEG data. In within-subject experiments, the fuzzy-neural model performed better (68.58% +/- 13.76% accuracy, kappa = 58.04% +/- 18.43), while in cross-subject (LOSO) tests, the deep model exhibited stronger generalization (68.20% +/- 12.13% accuracy, kappa = 57.33% +/- 16.22). The study therefore provides practical guidance for selecting MI-BCI systems according to the design goal: interpretability or robustness across users. Future investigations into transformer-based and hybrid neuro-symbolic frameworks are expected to further advance transparent EEG decoding.

2511.00352 2026-06-16 cs.CV cs.AI 版本更新

Detecting AI-Generated Images via Diffusion Snap-Back Reconstruction: A Forensic Approach

通过扩散快回重建检测AI生成图像:一种取证方法

Mohd Ruhul Ameen, Akif Islam

发表机构 * Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology(1 计算机科学与工程系,孟加拉国工程与技术大学)

AI总结 本文提出通过扩散模型重建图像时的响应行为来检测AI生成图像,利用LPIPS等指标分析图像与扩散模型去噪行为的匹配程度,实验显示方法在识别准确率上表现优异。

Comments Accepted at the 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence and Networking (QPAIN 2026)

详情
Journal ref
2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence & Networking (QPAIN)
AI中文摘要

生成图像模型的快速发展使数字媒体发生了变革,使得人类观察者或许多传统检测方法难以可靠地区分AI生成图像和真实照片。现代文本到图像系统如Stable Diffusion和DALL E能够生成极其逼真的图像,使其看起来完全自然,留下很少或没有传统深度伪造检测器可以依赖的可见伪影。这一挑战对虚假信息控制、机构身份验证和政治和法律领域中的数字信任有实际影响。我们不搜索隐藏的像素级痕迹,而是观察图像在被轻微扰动和由扩散模型重建时的反应。我们称之为扩散快回。通过跟踪不同重建强度下感知相似性度量(LPIPS、SSIM和PSNR)的变化,我们捕捉到紧凑且可解释的信号,揭示图像与扩散模型学习的去噪行为的接近程度。在包含4000张人类和AI生成图像的平衡数据集上评估,所提出的方法在分层五折交叉验证中达到AUROC 0.993,在使用仅逻辑回归的测试集上达到0.990。初步的鲁棒性测试显示,该方法在常见的现实世界失真如图像压缩和添加噪声下仍保持稳定。虽然我们的实验使用单一扩散主干进行,但结果表明,重建行为可以作为合成媒体检测的可靠且可扩展的基础,随着生成模型变得越来越逼真。

英文摘要

The rapid advancement of generative image models has transformed digital media to the point where AI generated images can no longer be reliably distinguished from authentic photographs by human observers or many conventional detection methods. Modern text to image systems such as Stable Diffusion and DALL E can now generate images so realistic that they often appear completely natural, leaving little to no visible artifacts for traditional deepfake detectors to rely on. This challenge has practical consequences for misinformation control, institutional identity verification, and digital trust in political and legal contexts. Instead of searching for hidden pixel level traces, we take a different approach: we observe how an image responds when it is gently disturbed and reconstructed by a diffusion model. We call this behavior diffusion snap back. By tracking how perceptual similarity measures (LPIPS, SSIM, and PSNR) change across different reconstruction strengths, we capture compact and interpretable signals that reveal how closely an image aligns with the diffusion model's learned denoising behavior. Evaluated on a balanced dataset of 4,000 human and AI generated images, the proposed method achieves an AUROC of 0.993 under stratified five fold cross validation and 0.990 on a holdout split using only logistic regression. Initial robustness tests show that the method remains stable under common real world distortions such as image compression and added noise. Although our experiments were conducted using a single diffusion backbone, the results indicate that reconstruction behavior can serve as a reliable and scalable foundation for synthetic media detection as generative models continue to grow more realistic.

2510.23785 2026-06-16 cs.CV cs.AI 版本更新

CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting

CountFormer:一种用于学习类无关物体计数中视觉重复和结构的Transformer框架

Md Tanvir Hossain, Akif Islam, Mohd Ruhul Ameen

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 CountFormer通过使用DINOv2和位置嵌入,改进了无示例物体计数中的结构一致性,实现了在FSC-147上的竞争力表现。

Comments Accepted at the 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence and Networking (QPAIN 2026)

详情
Journal ref
2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence & Networking (QPAIN)
AI中文摘要

人类通常通过观察视觉重复和组成来计数 unfamiliar objects,而非仅依赖物体类别。然而,许多无示例计数模型在这种情况下的表现不佳,尤其是在物体包含对称组件、重复子结构或部分遮挡时可能过计数。我们引入了CountFormer,这是一种受CounTR启发的密度回归框架的受控适应,其中图像编码器被自监督视觉基础模型DINOv2取代。所得的Transformer特征与显式的二维位置嵌入结合,并通过轻量级卷积网络解码,以生成密度图,其积分给出最终计数。我们的目标不是提出新的计数架构,而是研究在严格无示例设置下,基于基础的表示是否能提高结构一致性。在FSC-147上,CountFormer在官方基准上实现了竞争性表现(MAE 19.06,RMSE 118.45)。定性分析表明,对于某些结构复杂的物体,部分层面的过计数错误更少,而总体误差与先前方法大致一致。敏感性分析显示,评估指标强烈受少量极端高密度场景的影响。总体而言,结果突显了表示质量在无示例物体计数中的作用。

英文摘要

Humans can often count unfamiliar objects by observing visual repetition and composition, rather than relying only on object categories. However, many exemplar-free counting models struggle in such situations and may overcount when objects contain symmetric components, repeated substructures, or partial occlusion. We introduce CountFormer, a controlled adaptation of a density-regression framework inspired by CounTR, where the image encoder is replaced with the self-supervised vision foundation model DINOv2. The resulting transformer features are combined with explicit two-dimensional positional embeddings and decoded by a lightweight convolutional network to produce a density map whose integral gives the final count. Our goal is not to propose a new counting architecture, but to study whether foundation-based representations improve structural consistency under a strictly exemplar-free setting. On FSC-147, CountFormer achieves competitive performance under the official benchmark (MAE 19.06, RMSE 118.45). Qualitative analysis suggests fewer part-level overcounting errors for some structurally complex objects, while overall error remains broadly consistent with prior approaches. Sensitivity analysis shows that evaluation metrics are strongly affected by a small number of extreme high-density scenes. Overall, the results highlight the role of representation quality in exemplar-free object counting.

2509.22935 2026-06-16 cs.LG cs.AI 版本更新

Compute-Optimal Quantization-Aware Training

计算最优量化感知训练

Aleksandr Dremov, David Grangier, Angelos Katharopoulos, Awni Hannun

发表机构 * Apple(苹果公司)

AI总结 本文研究了量化感知训练与全精度训练的计算分配优化问题,通过实验发现QAT与FP训练比例随总计算量增加而上升,并提出新的冷却与QAT融合方法以提升效率。

Comments ICLR 2026

详情
Journal ref
International Conference on Learning Representations (ICLR), 2026
AI中文摘要

量化感知训练(QAT)是提高量化神经网络精度的重要技术。先前研究表明,将训练分解为全精度阶段后接QAT阶段能获得更优精度。然而,全精度与QAT阶段的计算分配仍不明确。本文通过不同计算预算、QAT位宽和模型大小的实验,探讨了不同QAT持续时间对最终性能的影响。研究发现,与先前结论相反,QAT与全精度训练的损失最优比随总计算量增加而上升。使用tokens-per-parameter-byte统计量可准确预测广泛模型大小和量化位宽的最优比例。从实验数据中推导出一个损失标度定律,可预测不同QAT/FP计算分配策略和QAT位宽下的最优QAT比例和最终模型性能。利用该定律进行进一步预测,包括在给定内存约束下最优QAT位宽以及不同位宽QAT精度与全精度模型精度的比较。此外,本文提出了一种新的冷却与QAT融合方法,通过联合学习率衰减与量化感知训练,消除冗余的全精度模型更新,实现显著的计算节省。这些发现为高效的QAT规划提供了实用见解,并使在相同计算预算下训练更高质量的量化模型成为可能。

英文摘要

Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks. Previous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior accuracy compared to QAT alone. However, the optimal allocation of compute between the FP and QAT phases remains unclear. We conduct extensive experiments with various compute budgets, QAT bit widths, and model sizes from 86.0M to 2.2B to investigate how different QAT durations impact final performance. We demonstrate that, contrary to previous findings, the loss-optimal ratio of QAT to FP training increases with the total amount of compute. Moreover, the optimal fraction can be accurately predicted for a wide range of model sizes and quantization widths using the tokens-per-parameter-byte statistic. From experimental data, we derive a loss scaling law that predicts both optimal QAT ratios and final model performance across different QAT/FP compute allocation strategies and QAT bit widths. We use the scaling law to make further predictions, which we verify experimentally, including which QAT bit width is optimal under a given memory constraint and how QAT accuracy with different bit widths compares to full-precision model accuracy. Additionally, we propose a novel cooldown and QAT fusion approach that performs learning rate decay jointly with quantization-aware training, eliminating redundant full-precision model updates and achieving significant compute savings. These findings provide practical insights into efficient QAT planning and enable the training of higher-quality quantized models with the same compute budget.

2602.21381 2026-06-16 cs.LG cs.AI cs.CE 版本更新

VCDF: A Validated Consensus-Driven Framework for Time Series Causal Discovery

VCDF:一种验证性共识驱动的时间序列因果发现框架

Gene Yu, Ce Guo, Wayne Luk

发表机构 * Department of Computing, Imperial College London(帝国理工学院伦敦分校计算机系)

AI总结 本文提出VCDF框架,通过评估时间序列阻断子集的因果关系稳定性,提升因果发现的鲁棒性,实验显示其在VAR-LiNGAM等方法上显著提高了F1分数,尤其在长序列中效果更佳。

Comments This paper has been accepted to PAKDD 2026. Please cite the proceedings version when available

详情
Journal ref
LNCS vol. 16599, pp. 29-41, Springer, 2026
AI中文摘要

时间序列因果发现对于理解动态系统至关重要,但现有方法对噪声、非平稳性和采样变异敏感。本文提出验证性共识驱动框架(VCDF),一种简单且方法无关的层,通过评估因果关系在阻断时间子集中的稳定性来提高鲁棒性。VCDF无需修改基础算法,可应用于VAR-LiNGAM和PCMCI等方法。实验表明,VCDF在合成数据集上提高了VAR-LiNGAM的窗口和总结F1分数,增益在不同数据特性中最为明显的是中等至长序列。该框架还受益于更长的序列,时间序列长度1000及以上可获得高达0.18的绝对改进。在模拟fMRI数据和IT监控场景中的评估进一步展示了其在现实噪声条件下的稳定性和结构准确性。VCDF为时间序列因果发现提供了一个有效的可靠性层,而不会改变底层建模假设。

英文摘要

Time series causal discovery is essential for understanding dynamic systems, yet many existing methods remain sensitive to noise, non-stationarity, and sampling variability. We propose the Validated Consensus-Driven Framework (VCDF), a simple and method-agnostic layer that improves robustness by evaluating the stability of causal relations across blocked temporal subsets. VCDF requires no modification to base algorithms and can be applied to methods such as VAR-LiNGAM and PCMCI. Experiments on synthetic datasets show that VCDF improves VAR-LiNGAM by approximately 0.08-0.12 in both window and summary F1 scores across diverse data characteristics, with gains most pronounced for moderate-to-long sequences. The framework also benefits from longer sequences, yielding up to 0.18 absolute improvement on time series of length 1000 and above. Evaluations on simulated fMRI data and IT-monitoring scenarios further demonstrate enhanced stability and structural accuracy under realistic noise conditions. VCDF provides an effective reliability layer for time series causal discovery without altering underlying modeling assumptions.

2602.08088 2026-06-16 cs.LG cs.AI 版本更新

Online Domain-aware LLM Decoding for Continual Domain Evolution

在线领域感知的LLM解码用于持续领域演变

Mohammad Abu-Shaira, Weishi Shi

发表机构 * University of North Texas(北卡罗来纳州立大学)

AI总结 本文提出在线领域感知解码框架ODD,通过概率融合和自适应置信度调节,提升LLM在持续领域变化中的适应能力,实验表明其在语法和语义生成任务中表现优异。

详情
Journal ref
Advances in Knowledge Discovery and Data Mining, PAKDD 2026, LNAI 16600, pp. 565-577, Springer, 2026
AI中文摘要

LLMs通常在领域特定数据上离线微调,假设领域静态。但实际上,领域知识通过新法规、产品、服务和交互模式持续演变。对每个新实例重新训练或微调LLM在计算上不可行。此外,现实环境也表现出时间动态性,数据分布不断变化。忽视这种现象,即概念漂移,会显著降低模型的预测准确性。这种领域演变与静态适应管道的不匹配凸显了需要高效实时适应而无需昂贵再训练的需求。为此,我们引入在线领域感知解码框架(ODD)。ODD在基础LLM和前缀树先验之间进行概率级融合,通过自适应置信度调节使用分歧和连续性信号进行指导。在多样化的漂移场景下的实证评估表明,ODD在所有语法和语义NLG指标上均优于LLM-Greedy和LLM-Temp Scaled。它在ROUGE-L指标上获得绝对增益0.065,并在最佳基线上使余弦相似度提高13.6%。这些结果证明了ODD对演变词汇和上下文模式的鲁棒性,使其适用于动态LLM应用。

英文摘要

LLMs are typically fine-tuned offline on domain-specific data, assuming a static domain. In practice, domain knowledge evolves continuously through new regulations, products, services, and interaction patterns. Retraining or fine-tuning LLMs for every new instance is computationally infeasible. Additionally, real-world environments also exhibit temporal dynamics with shifting data distributions. Disregarding this phenomenon, commonly referred to as concept drift, can significantly diminish a model's predictive accuracy. This mismatch between evolving domains and static adaptation pipelines highlights the need for efficient, real-time adaptation without costly retraining. In response, we introduce Online Domain-aware Decoding framework (ODD). ODD performs probability-level fusion between a base LLM and a prefix-tree prior, guided by adaptive confidence modulation using disagreement and continuity signals. Empirical evaluation under diverse drift scenarios demonstrates that ODD consistently surpasses LLM-Greedy and LLM-Temp Scaled across all syntactic and semantic NLG metrics. It yields an absolute ROUGE-L gain of 0.065 and a 13.6% relative improvement in Cosine Similarity over the best baseline. These results demonstrate ODD 's robustness to evolving lexical and contextual patterns, making it suitable for dynamic LLM applications.

2601.18897 2026-06-16 cs.AI cs.LG 版本更新

Explainable Uncertainty Quantification for Wastewater Treatment Energy Prediction via Interval Type-2 Neuro-Fuzzy System

通过区间型2神经模糊系统实现废水处理能耗预测的可解释不确定性量化

Qusai Khaled, Bahjat Mallak, Uzay Kaymak, Laura Genga

发表机构 * Jheronimus Academy of Data Science, Eindhoven University of Technology, Eindhoven, The Netherlands(杰罗尼穆斯数据科学学院,埃因霍温理工大学,埃因霍温,荷兰) Haskoning, Amersfoort, The Netherlands(哈索宁,阿默斯福尔特,荷兰) School of Industrial Engineering, Eindhoven University of Technology(工业工程学院,埃因霍温理工大学)

AI总结 本文提出一种区间型2神经模糊系统,用于废水处理能耗预测,通过模糊规则结构生成可解释的预测区间,分解不确定性层级,提升决策可靠性。

Comments Submitted to 21st International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU2026)

详情
Journal ref
IPMU 2026, Commun. Comput. Inf. Sci. 3020, 392-406 (2026)
AI中文摘要

废水处理厂消耗全球1-3%的电力,准确的能耗预测对运营优化和可持续性至关重要。尽管机器学习模型提供点预测,但缺乏可解释的不确定性量化,这对安全关键基础设施的风险意识决策至关重要。本研究开发了一种区间型2自适应神经模糊推理系统(IT2-ANFIS),通过模糊规则结构生成可解释的预测区间。与黑箱概率方法不同,所提出的框架将不确定性分解为三个层次:特征层、不确定性足迹识别引入模糊性的变量,规则层分析揭示局部模型的置信度,实例层区间量化整体预测不确定性。在墨尔本水务东处理厂数据集上验证,IT2-ANFIS在预测性能上与一阶ANFIS相当,但在训练运行中方差显著降低,同时提供可解释的不确定性估计,将预测置信度直接与运营条件和输入变量联系起来。

英文摘要

Wastewater treatment plants consume 1-3% of global electricity, making accurate energy forecasting critical for operational optimization and sustainability. While machine learning models provide point predictions, they lack explainable uncertainty quantification essential for risk-aware decision-making in safety-critical infrastructure. This study develops an Interval Type-2 Adaptive Neuro-Fuzzy Inference System (IT2-ANFIS) that generates interpretable prediction intervals through fuzzy rule structures. Unlike black-box probabilistic methods, the proposed framework decomposes uncertainty across three levels: feature-level, footprint of uncertainty identify which variables introduce ambiguity, rule-level analysis reveals confidence in local models, and instance-level intervals quantify overall prediction uncertainty. Validated on Melbourne Water's Eastern Treatment Plant dataset, IT2-ANFIS achieves comparable predictive performance to first order ANFIS with substantially reduced variance across training runs, while providing explainable uncertainty estimates that link prediction confidence directly to operational conditions and input variables.

2601.18045 2026-06-16 cs.CV cs.AI 版本更新

Leveraging Persistence Image to Enhance Robustness and Performance in Curvilinear Structure Segmentation

利用持续图像增强曲率结构分割的鲁棒性和性能

Zhuangzhi Gao, Feixiang Zhou, He Zhao, Xiuju Chen, Xiaoxin Li, Qinkai Yu, Yitian Zhao, Alena Shantsila, Gregory Y. H. Lip, Eduard Shantsila, Yalin Zheng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出PIs-Regressor和Topology SegNet,通过直接学习持续图像来增强曲率结构分割的鲁棒性和性能,实验表明拓扑特征能有效提升医学图像分割的准确性。

Comments Accepted by IEEE International Symposium on Biomedical Imaging (ISBI) 2026. 5 pages, 3 figures

详情
Journal ref
2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI), London, United Kingdom, 2026
AI中文摘要

在医学图像中分割曲率结构对于分析临床应用中的形态学模式至关重要。整合拓扑属性如连通性可提高分割的准确性和一致性。然而,从持续图(PD)中提取和嵌入这些属性具有挑战性,因为它们非可微且计算成本高。现有方法大多通过手工设计的损失函数编码拓扑,泛化能力差。本文提出PIs-Regressor,一个简单有效的模块,直接从数据中学习持续图像(PI)——拓扑特征的有限、可微表示。与Topology SegNet结合,该框架将拓扑整合到网络架构本身而非辅助损失中。与依赖手工损失函数的方法不同,我们的方法直接将拓扑信息整合到网络结构中,从而实现更稳健的分割。我们的设计灵活,可无缝结合其他拓扑方法以进一步提升分割性能。实验结果表明,整合拓扑特征增强了模型鲁棒性,有效处理医学图像中的过曝和模糊挑战。在三个曲率基准上,我们的方法在像素级准确性和拓扑保真度上均达到最先进的性能。

英文摘要

Segmenting curvilinear structures in medical images is essential for analyzing morphological patterns in clinical applications. Integrating topological properties, such as connectivity, improves segmentation accuracy and consistency. However, extracting and embedding such properties - especially from Persistence Diagrams (PD) - is challenging due to their non-differentiability and computational cost. Existing approaches mostly encode topology through handcrafted loss functions, which generalize poorly across tasks. In this paper, we propose PIs-Regressor, a simple yet effective module that learns persistence image (PI) - finite, differentiable representations of topological features - directly from data. Together with Topology SegNet, which fuses these features in both downsampling and upsampling stages, our framework integrates topology into the network architecture itself rather than auxiliary losses. Unlike existing methods that depend heavily on handcrafted loss functions, our approach directly incorporates topological information into the network structure, leading to more robust segmentation. Our design is flexible and can be seamlessly combined with other topology-based methods to further enhance segmentation performance. Experimental results show that integrating topological features enhances model robustness, effectively handling challenges like overexposure and blurring in medical imaging. Our approach on three curvilinear benchmarks demonstrate state-of-the-art performance in both pixel-level accuracy and topological fidelity.

2512.14892 2026-06-16 cs.LG cs.AI 版本更新

OLR-WA: Online Weighted Average Linear Regression in Multivariate Data Streams

OLR-WA:多变量数据流中的在线加权平均线性回归

Mohammad Abu-Shaira, Alejandro Rodriguez, Greg Speegle, Victor Sheng, Ishfaq Ahmad

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 本文提出OLR-WA模型,用于多变量数据流的在线线性回归,通过处理数据漂移和置信度场景,实现与批量回归相当甚至更优的性能。

详情
Journal ref
2023 IEEE International Conference on Big Data (BigData), 1039-1046
AI中文摘要

在线学习通过增量更新模型来处理新数据,避免大规模存储需求和昂贵的模型重计算。本文引入了

英文摘要

Online learning updates models incrementally with new data, avoiding large storage requirements and costly model recalculations. In this paper, we introduce "OLR-WA; OnLine Regression with Weighted Average", a novel and versatile multivariate online linear regression model. We also investigate scenarios involving drift, where the underlying patterns in the data evolve over time, conduct convergence analysis, and compare our approach with existing online regression models. The results of OLR-WA demonstrate its ability to achieve performance comparable to the batch regression, while also showcasing comparable or superior performance when compared with other state-of-the-art online models, thus establishing its effectiveness. Moreover, OLR-WA exhibits exceptional performance in terms of rapid convergence, surpassing other online models with consistently achieving high r2 values as a performance measure from the first iteration to the last iteration, even when initialized with minimal amount of data points, as little as 1% to 10% of the total data points. In addition to its ability to handle time-based (temporal drift) scenarios, remarkably, OLR-WA stands out as the only model capable of effectively managing confidence-based challenging scenarios. It achieves this by adopting a conservative approach in its updates, giving priority to older data points with higher confidence levels. In summary, OLR-WA's performance further solidifies its versatility and utility across different contexts, making it a valuable solution for online linear regression tasks.

2411.13602 2026-06-16 eess.IV cs.AI cs.CV 版本更新

Translating Electrocardiograms to Cardiac Magnetic Resonance Imaging Useful for Cardiac Assessment and Disease Screening: A Multi-Center Study

将心电图转换为心脏磁共振成像对心脏评估和疾病筛查有用:一项多中心研究

Zhengyao Ding, Ziyu Li, Yujian Hu, Youyao Xu, Chengchen Zhao, Yiheng Mao, Haitao Li, Zhikang Li, Qian Li, Jing Wang, Yue Chen, Mengjia Chen, Longbo Wang, Xuesen Chu, Weichao Pan, Ziyi Liu, Fei Wu, Hongkun Zhang, Ting Chen, Zhengxing Huang

发表机构 * College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) Department of Vascular Surgery, The First Affiliated Hospital of Zhejiang University School of Medicine(浙江大学医学院附属第一医院血管外科) Department of Cardiology, The First Affiliated Hospital, Zhejiang University School of Medicine(浙江大学医学院附属第一医院心内科) Department of Radiology, The First Affiliated Hospital, Zhejiang University School of Medicine(浙江大学医学院附属第一医院放射科) Department of Vascular Surgery, Quzhou People’s Hospital(衢州人民医院血管外科) Department of Cardiology, The Second Affiliated Hospital of Zhejiang University School of Medicine(浙江大学医学院附属第二医院心内科) China Ship Scientific Research Center(中国船舶科学研究院) Guangdong Transtek Medical Electronics Co., Ltd.(广东 Transtek 医疗电子有限公司)

AI总结 本文提出CardioNets框架,通过深度学习将12导联心电图信号转换为心脏磁共振成像级别的功能参数和合成图像,提升大规模心血管疾病筛查的效率和可及性。

Comments 29 pages, 7 figures

详情
Journal ref
NEJM AI 2026;3(4)
AI中文摘要

心血管疾病(CVDs)是全球死亡的主要原因,需要可访问且准确的诊断工具。尽管心脏磁共振成像(CMR)提供心脏结构和功能的金标准见解,但其临床效用受到高成本和复杂性的限制。相比之下,心电图(ECG)成本低且广泛可用,但缺乏CMR的粒度。我们提出CardioNets,一种深度学习框架,将12导联ECG信号转换为CMR级别的功能参数和合成图像,从而实现可扩展的心脏评估。CardioNets整合了跨模态对比学习和生成预训练,对齐ECG与CMR衍生的心脏表型,并通过掩码自回归模型合成高分辨率CMR图像。在159,819个样本上训练,包括英国生物库(n=42,483)和MIMIC-IV-ECG(n=164,550),并在独立临床数据集(n=3,767)上进行外部验证,CardioNets在疾病筛查和表型估计任务中表现出色。在英国生物库中,它将心脏表型回归R2提高了24.8%,并使心肌病AUC提高了高达39.3%。在MIMIC中,它将肺动脉高压检测的AUC提高了5.6%。生成的CMR图像在SSIM和PSNR方面分别比先前方法高36.6%和8.7%。在一项读者研究中,仅使用ECG的CardioNets在准确率上比同时使用ECG和真实CMR的人类医生高13.9%。这些结果表明,CardioNets为大规模CVD筛查提供了一个有前景的低成本替代方案,特别是在资源有限的环境中。未来的工作将专注于临床部署和ECG基于合成成像的监管验证。

英文摘要

Cardiovascular diseases (CVDs) are the leading cause of global mortality, necessitating accessible and accurate diagnostic tools. While cardiac magnetic resonance imaging (CMR) provides gold-standard insights into cardiac structure and function, its clinical utility is limited by high cost and complexity. In contrast, electrocardiography (ECG) is inexpensive and widely available but lacks the granularity of CMR. We propose CardioNets, a deep learning framework that translates 12-lead ECG signals into CMR-level functional parameters and synthetic images, enabling scalable cardiac assessment. CardioNets integrates cross-modal contrastive learning and generative pretraining, aligning ECG with CMR-derived cardiac phenotypes and synthesizing high-resolution CMR images via a masked autoregressive model. Trained on 159,819 samples from five cohorts, including the UK Biobank (n=42,483) and MIMIC-IV-ECG (n=164,550), and externally validated on independent clinical datasets (n=3,767), CardioNets achieved strong performance across disease screening and phenotype estimation tasks. In the UK Biobank, it improved cardiac phenotype regression R2 by 24.8% and cardiomyopathy AUC by up to 39.3% over baseline models. In MIMIC, it increased AUC for pulmonary hypertension detection by 5.6%. Generated CMR images showed 36.6% higher SSIM and 8.7% higher PSNR than prior approaches. In a reader study, ECG-only CardioNets achieved 13.9% higher accuracy than human physicians using both ECG and real CMR. These results suggest that CardioNets offers a promising, low-cost alternative to CMR for large-scale CVD screening, particularly in resource-limited settings. Future efforts will focus on clinical deployment and regulatory validation of ECG-based synthetic imaging.

2512.08879 2026-06-16 cs.LG cs.AI 版本更新

DAO-GP Drift Aware Online Non-Linear Regression Gaussian-Process

DAO-GP:漂移感知在线非线性回归高斯过程

Mohammad Abu-Shaira, Ajita Rattani, Weishi Shi

发表机构 * st Mohammad Abu-Shaira(第一作者) nd Ajita Rattani(第二作者) rd Weishi Shi(第三作者)

AI总结 提出DAO-GP模型,通过内置漂移检测与自适应机制、无超参数、稀疏化和衰减策略,解决在线高斯过程回归中概念漂移、超参数固定等问题,在多种漂移类型下表现鲁棒且优于现有方法。

详情
Journal ref
2025 IEEE International Conference on Big Data (BigData), pp. 776-785, 2025
AI中文摘要

真实世界的数据集通常表现出以数据分布演变为特征的时态动态。忽视这一现象(通常称为概念漂移)会显著降低模型的预测精度。此外,在线模型中超参数的存在加剧了这一问题。这些参数通常是固定的,用户无法根据演化的数据分布动态调整。高斯过程模型提供了具有不确定性量化的强大非参数回归能力,使其成为在线设置中建模复杂数据关系的理想选择。然而,传统的在线高斯过程方法存在几个关键限制,包括缺乏漂移感知、依赖固定超参数、易受数据窥探影响、缺乏原则性的衰减机制以及内存效率低下。为此,我们提出了DAO-GP(漂移感知在线高斯过程),一种新颖的、完全自适应的、无超参数、带衰减的稀疏非线性回归模型。DAO-GP具有内置的漂移检测和自适应机制,可根据漂移的严重程度动态调整模型行为。广泛的经验评估证实了DAO-GP在平稳条件、多种漂移类型(突变、增量、渐变)以及不同数据特征下的鲁棒性。分析表明其动态自适应、高效的内存和基于衰减的管理以及演化的诱导点。与最先进的参数和非参数模型相比,DAO-GP始终达到优越或竞争性的性能,使其成为在线非线性回归中具有漂移鲁棒性的解决方案。

英文摘要

Real-world datasets often exhibit temporal dynamics characterized by evolving data distributions. Disregarding this phenomenon, commonly referred to as concept drift, can significantly diminish a model's predictive accuracy. Furthermore, the presence of hyperparameters in online models exacerbates this issue. These parameters are typically fixed and cannot be dynamically adjusted by the user in response to the evolving data distribution. Gaussian Process (GP) models offer powerful non-parametric regression capabilities with uncertainty quantification, making them ideal for modeling complex data relationships in an online setting. However, conventional online GP methods face several critical limitations, including a lack of drift-awareness, reliance on fixed hyperparameters, vulnerability to data snooping, absence of a principled decay mechanism, and memory inefficiencies. In response, we propose DAO-GP (Drift-Aware Online Gaussian Process), a novel, fully adaptive, hyperparameter-free, decayed, and sparse non-linear regression model. DAO-GP features a built-in drift detection and adaptation mechanism that dynamically adjusts model behavior based on the severity of drift. Extensive empirical evaluations confirm DAO-GP's robustness across stationary conditions, diverse drift types (abrupt, incremental, gradual), and varied data characteristics. Analyses demonstrate its dynamic adaptation, efficient in-memory and decay-based management, and evolving inducing points. Compared with state-of-the-art parametric and non-parametric models, DAO-GP consistently achieves superior or competitive performance, establishing it as a drift-resilient solution for online non-linear regression.

2512.00572 2026-06-16 cs.CV cs.AI 版本更新

Integrating Skeleton Based Representations for Robust Yoga Pose Classification Using Deep Learning Models

基于骨架表示的瑜伽姿势分类深度学习模型整合

Mohammed Mohiuddin, Syed Mohammod Minhaz Hossain, Sumaiya Khanam, Prionkar Barua, Aparup Barua, MD Tamim Hossain

发表机构 * Department of Computer Science and Engineering, Premier University(计算机科学与工程系,普里梅尔大学)

AI总结 本文提出Yoga-16数据集,系统评估了三种深度学习模型,证明骨架表示在瑜伽姿势分类中优于原始图像,VGG16结合MediaPipe骨架输入达到96.09%的准确率。

详情
AI中文摘要

瑜伽因其精神和身体健康益处而全球流行,但错误姿势可能导致受伤。自动化瑜伽姿势分类因此变得重要,以减少对专家的依赖。尽管人类姿态关键点提取模型在动作识别中表现出潜力,但系统化的瑜伽姿势识别基准评估仍有限,因为先前工作通常仅关注原始图像或单一姿态提取模型。本文引入了'Yoga-16'数据集,以解决现有数据集的限制,并系统评估了三种深度学习架构(VGG16、ResNet50和Xception),使用三种输入模式(直接图像、MediaPipe Pose骨架图像和YOLOv8 Pose骨架图像)。我们的实验表明,基于骨架的表示优于原始图像输入,VGG16与MediaPipe Pose骨架输入的最高准确率为96.09%。此外,我们通过Grad-CAM进行可解释性分析,提供瑜伽姿势分类的模型决策洞察,通过交叉验证分析。

英文摘要

Yoga is a popular form of exercise worldwide due to its spiritual and physical health benefits, but incorrect postures can lead to injuries. Automated yoga pose classification has therefore gained importance to reduce reliance on expert practitioners. While human pose keypoint extraction models have shown high potential in action recognition, systematic benchmarking for yoga pose recognition remains limited, as prior works often focus solely on raw images or a single pose extraction model. In this study, we introduce a curated dataset, 'Yoga-16', which addresses limitations of existing datasets, and systematically evaluate three deep learning architectures (VGG16, ResNet50, and Xception), using three input modalities (direct images, MediaPipe Pose skeleton images, and YOLOv8 Pose skeleton images). Our experiments demonstrate that skeleton-based representations outperform raw image inputs, with the highest accuracy of 96.09% achieved by VGG16 with MediaPipe Pose skeleton input. Additionally, we provide interpretability analysis using Grad-CAM, offering insights into model decision-making for yoga pose classification with cross-validation analysis.

2511.17743 2026-06-16 cs.AI cs.SY eess.SY 版本更新

AI- and Ontology-Based Enhancements to FMEA for Advanced Systems Engineering: Current Developments and Future Directions

基于人工智能和本体的FMEA增强:先进系统工程中的最新发展与未来方向

Haytham Younus, Sohag Kabir, Felician Campean, Pascal Bonnaud, David Delaux

发表机构 * School of Computing and Engineering, University of Bradford(布里斯托大学计算机与工程学院) SAFI Verse Limited(SAFI Verse有限公司) Valeo(法拉利)

AI总结 本文探讨了如何利用人工智能和本体技术改进FMEA,提升其数据驱动和语义丰富性,分析了AI和本体在系统工程中的应用及挑战。

Comments This manuscript is based on research undertaken by our doctoral student at the University of Bradford. The associated PhD thesis has been formally submitted to the University and is currently awaiting final examination. The review article is being shared on arXiv to make the review accessible to the research community while the thesis examination process is ongoing

详情
AI中文摘要

本文综述了近期旨在将传统故障模式与影响分析(FMEA)转变为更智能、数据驱动和语义丰富的过程的最新进展。随着工程系统复杂性增加,传统FMEA方法因依赖人工、文档和专家而显得不足。本文探讨了人工智能技术,如机器学习和自然语言处理,如何通过自动化故障预测、优先级排序和从操作数据中提取知识来改进FMEA。同时,本文探讨了本体在形式化系统知识、支持语义推理、提高可追溯性和跨领域互操作性中的作用。此外,本文还综合了新兴的混合方法,如基于本体的学习和大语言模型整合,以进一步提高可解释性和自动化。这些发展在基于模型的系统工程(MBSE)和功能建模的更广泛背景下讨论,展示了AI和本体如何支持更适应和稳健的FMEA工作流程。本文还批判性地分析了各种工具、案例研究和整合策略,同时识别了与数据质量、可解释性、标准化和跨学科应用相关的关键挑战。通过利用AI、系统工程和本体的知识表示,本文为将FMEA嵌入智能、知识丰富的工程环境提供了结构化的路线图。

英文摘要

This article presents a state-of-the-art review of recent advances aimed at transforming traditional Failure Mode and Effects Analysis (FMEA) into a more intelligent, data-driven, and semantically enriched process. As engineered systems grow in complexity, conventional FMEA methods, largely manual, document-centric, and expert-dependent, have become increasingly inadequate for addressing the demands of modern systems engineering. We examine how techniques from Artificial Intelligence (AI), including machine learning and natural language processing, can transform FMEA into a more dynamic, data-driven, intelligent, and model-integrated process by automating failure prediction, prioritisation, and knowledge extraction from operational data. In parallel, we explore the role of ontologies in formalising system knowledge, supporting semantic reasoning, improving traceability, and enabling cross-domain interoperability. The review also synthesises emerging hybrid approaches, such as ontology-informed learning and large language model integration, which further enhance explainability and automation. These developments are discussed within the broader context of Model-Based Systems Engineering (MBSE) and function modelling, showing how AI and ontologies can support more adaptive and resilient FMEA workflows. We critically analyse a range of tools, case studies, and integration strategies, while identifying key challenges related to data quality, explainability, standardisation, and interdisciplinary adoption. By leveraging AI, systems engineering, and knowledge representation using ontologies, this review offers a structured roadmap for embedding FMEA within intelligent, knowledge-rich engineering environments.

2511.07090 2026-06-16 cs.AI 版本更新

Green AI: A systematic review and meta-analysis of its definitions, lifecycle models, hardware and measurement attempts

绿色人工智能:对其定义、生命周期模型、硬件和测量尝试的系统综述和元分析

Marcel Rojahn, Marcus Grum

发表机构 * University of Potsdam, Junior Chair of Business Information Systems, esp. AI-based Application Systems(波恩大学,商业信息系统初级职位,特别是基于AI的应用系统)

AI总结 本文系统综述和元分析绿色人工智能的定义、生命周期模型、硬件及测量方法,提出统一定义、五阶段生命周期模型、治理框架和校准测量框架,以应对多维负担。

详情
Journal ref
Information and Software Technology 198 (2026) 108186
AI中文摘要

在人工智能生命周期中,从硬件到开发、部署和重用,负担包括能源、碳、水和嵌入式影响。云服务工具虽提高透明度,但异构且常忽略水和价值链影响,限制了可比性和可重复性。本文(i)建立与可持续人工智能不同的绿色人工智能统一操作定义;(ii)正式化与生命周期评估(LCA)阶段映射的五阶段生命周期,使能源、碳、水和嵌入式影响成为首要考虑因素;(iii)通过PDCA循环和决策关卡制定治理;(iv)系统化边缘云连续体的硬件和系统级策略以减少嵌入式负担;(v)定义结合估计模型和直接计量的校准测量框架,以实现可重复、提供者无关的比较。结合定义、生命周期过程、硬件策略和校准测量,本文为研究人员、实践者和政策制定者提供可操作的证据支持指导。

英文摘要

Across the Artificial Intelligence (AI) lifecycle - from hardware to development, deployment, and reuse - burdens span energy, carbon, water, and embodied impacts. Cloud provider tools improve transparency but remain heterogeneous and often omit water and value chain effects, limiting comparability and reproducibility. Addressing these multi dimensional burdens requires a lifecycle approach linking phase explicit mapping with system levers (hardware, placement, energy mix, cooling, scheduling) and calibrated measurement across facility, system, device, and workload levels. This article (i) establishes a unified, operational definition of Green AI distinct from Sustainable AI; (ii) formalizes a five phase lifecycle mapped to Life Cycle Assessment (LCA) stages, making energy, carbon, water, and embodied impacts first class; (iii) specifies governance via Plan Do Check Act (PDCA) cycles with decision gateways; (iv) systematizes hardware and system level strategies across the edge cloud continuum to reduce embodied burdens; and (v) defines a calibrated measurement framework combining estimator models with direct metering to enable reproducible, provider agnostic comparisons. Combining definition, lifecycle processes, hardware strategies, and calibrated measurement, this article offers actionable, evidence based guidance for researchers, practitioners, and policymakers.

2511.08507 2026-06-16 cs.CL cs.AI 版本更新

Introducing A Bangla Sentence - Gloss Pair Dataset for Bangla Sign Language Translation and Research

介绍一个孟加拉语句子- gloss配对数据集用于孟加拉语手语翻译和研究

Neelavro Saha, Rafi Shahriyar, Nafis Ashraf Roudra, Saadman Sakib, Annajiat Alim Rasel

发表机构 * Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology(Bangladesh University of Engineering and Technology计算机科学与工程系)

AI总结 本文介绍了一个包含1000个人工标注句子- gloss配对的新数据集Bangla-SGP,通过规则基于的检索增强生成管道生成约3000个合成配对,用于孟加拉语手语翻译和研究。

详情
Journal ref
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026), pp. 10457-10466, ELRA, Palma, Mallorca, Spain, May 2026
AI中文摘要

孟加拉语手语(BdSL)翻译是一个低资源自然语言处理任务,由于缺乏大规模数据集来解决句子级翻译。相应地,该领域现有研究局限于词和字母级别的检测。在本工作中,我们介绍了Bangla-SGP,一个包含1000个由专业手语者手动标注的高质量孟加拉语句子的平行数据集,这些句子被注释为gloss序列。该数据集通过基于规则的检索增强生成(RAG)管道扩展,使用句法和形态学规则生成约3000个合成配对。gloss序列由单独的gloss组成,这些gloss是孟加拉语手语支持的词汇,并作为连续手语的中间表示。我们的数据集由1000个高质量孟加拉语句子组成,这些句子由专业手语者手动注释为gloss序列。增强过程结合了基于规则的语言学策略和提示工程技术,这些技术通过批判性分析我们的人工标注句子-gloss配对以及与专业手语者密切合作而获得。此外,我们微调了几种基于transformer的模型,如mBart50、Google mT5、GPT4.1-nano,并使用BLEU分数评估其句子到gloss的翻译性能。基于这些评估指标,我们比较了模型在我们数据集和RWTH-PHOENIX-2014T基准上的gloss翻译一致性。

英文摘要

Bangla Sign Language (BdSL) translation represents a low-resource NLP task due to the lack of large-scale datasets that address sentence-level translation. Correspondingly, existing research in this field has been limited to word and alphabet level detection. In this work, we introduce Bangla-SGP, a novel parallel dataset consisting of 1,000 human-annotated sentence-gloss pairs which was augmented with around 3,000 synthetically generated pairs using syntactic and morphological rules through a rule-based Retrieval-Augmented Generation (RAG) pipeline. The gloss sequences of the spoken Bangla sentences are made up of individual glosses which are Bangla sign supported words and serve as an intermediate representation for a continuous sign. Our dataset consists of 1000 high quality Bangla sentences that are manually annotated into a gloss sequence by a professional signer. The augmentation process incorporates rule-based linguistic strategies and prompt engineering techniques that we have adopted by critically analyzing our human annotated sentence-gloss pairs and by working closely with our professional signer. Furthermore, we fine-tune several transformer-based models such as mBart50, Google mT5, GPT4.1-nano and evaluate their sentence-to-gloss translation performance using BLEU scores, based on these evaluation metrics we compare the model's gloss-translation consistency across our dataset and the RWTH-PHOENIX-2014T benchmark.

2509.01182 2026-06-16 cs.AI cs.CL cs.HC cs.IR cs.MA 版本更新

Question-to-Knowledge (Q2K): Multi-Agent Generation of Inspectable Facts for Product Mapping

问题到知识(Q2K):多智能体生成可检查的事实以实现产品映射

Wonduk Seo, Taesub Shin, Hyunjin An, Dokyun Kim, Seunghyun Lee

发表机构 * The University of Tokyo(东京大学) KISTI(韩国科学技术院)

AI总结 Q2K通过多智能体框架利用大语言模型实现可靠的产品SKU映射,通过生成辨析问题、网络搜索和去重来提高准确性与鲁棒性,适用于复杂场景如捆绑识别和品牌来源辨析。

Comments Accepted by IEEE BigData 2025 Industry Track

详情
Journal ref
2025 IEEE International Conference on Big Data (BigData), Macau, China, 2025, pp. 2646-2653
AI中文摘要

识别两个产品列表是否指向相同的库存单位(SKU)是电子商务中的持续挑战,尤其是在缺乏显式标识符且产品名称在不同平台上差异较大的情况下。基于规则的启发式方法和关键词相似性经常因忽略品牌、规格或捆绑配置的细微区别而误分类。为克服这些限制,我们提出了问题到知识(Q2K),一个多智能体框架,利用大语言模型(LLMs)进行可靠的SKU映射。Q2K集成了:(1)一个推理代理,生成定向的辨析问题;(2)一个知识代理,通过聚焦的网络搜索解决这些问题;(3)一个去重代理,重用已验证的推理轨迹以减少冗余并确保一致性。人类在循环机制进一步细化不确定情况。在真实世界消费品数据集上的实验表明,Q2K超越了强大的基线,实现了在捆绑识别和品牌来源辨析等困难场景中的更高准确性和鲁棒性。通过重用检索到的推理而不是发出重复搜索,Q2K在准确性和效率之间取得了平衡,提供了一种可扩展且可解释的解决方案用于产品整合。

英文摘要

Identifying whether two product listings refer to the same Stock Keeping Unit (SKU) is a persistent challenge in ecommerce, especially when explicit identifiers are missing and product names vary widely across platforms. Rule based heuristics and keyword similarity often misclassify products by overlooking subtle distinctions in brand, specification, or bundle configuration. To overcome these limitations, we propose Question to Knowledge (Q2K), a multi agent framework that leverages Large Language Models (LLMs) for reliable SKU mapping. Q2K integrates: (1) a Reasoning Agent that generates targeted disambiguation questions, (2) a Knowledge Agent that resolves them via focused web searches, and (3) a Deduplication Agent that reuses validated reasoning traces to reduce redundancy and ensure consistency. A human in the loop mechanism further refines uncertain cases. Experiments on real world consumer goods datasets show that Q2K surpasses strong baselines, achieving higher accuracy and robustness in difficult scenarios such as bundle identification and brand origin disambiguation. By reusing retrieved reasoning instead of issuing repeated searches, Q2K balances accuracy with efficiency, offering a scalable and interpretable solution for product integration.

2511.05505 2026-06-16 q-bio.NC cs.AI 版本更新

Rewiring Human Brain Networks via Lightweight Dynamic Connectivity Framework: An EEG-Based Stress Validation

通过轻量级动态连接框架重绘人脑网络:基于EEG的应力验证

Sayantan Acharya, Abbas Khosravi, Douglas Creighton, Roohallah Alizadehsani, U. Rajendra Acharya

发表机构 * Institute for Intelligent Systems Research and Innovation(智能系统研究与创新研究所) University of Southern Queensland(南方昆士兰大学)

AI总结 本文提出基于时间变化定向传递函数的轻量级动态脑连接框架,通过机器学习验证EEG数据中的应力分类,发现alpha-TV-DTF在分类中表现最佳,凸显动态连接在捕捉脑区时间与因果影响上的优势。

Comments 21 pages, 21 figures, 6 tables, 50 references,

详情
Journal ref
2026. Reconfiguring brain networks via lightweight dynamic connectivity framework: An EEG-based stress validation. Computers in Biology and Medicine, 213, p.111801
AI中文摘要

近年来,结合人工智能和机器学习模型的脑电图分析在压力研究中日益突出。本文提出了一种基于时间变化定向传递函数的轻量级动态脑连接框架,通过机器学习模型验证了TV DTF特征。TV DTF估计了不同EEG频率带之间脑区的定向信息流,从而捕捉到通常被静态功能连接测量所忽视的时间和因果影响。使用32通道SAM 40数据集的EEG记录,重点研究了心算任务试验。通过支持向量机、随机森林、梯度提升、自适应提升和极端梯度提升等机器学习分类器验证了动态EEG基的TV-DTF特征。实验结果表明,alpha-TV-DTF具有最强的判别能力,SVM在3类分类中达到89.73%的准确率,XGBoost在2类分类中达到93.69%的准确率。与绝对功率和相位锁定基于的功能连接特征相比,alpha TV DTF和beta TV DTF在所有机器学习模型中均表现更优,突显了动态测量相对于静态测量的优势。特征重要性分析进一步突显了主导的远距离前额叶和前额顶叶信息影响,强调了压力下前额叶区域的调节作用。这些发现验证了轻量级TV-DTF作为一种稳健框架的有效性,揭示了不同压力水平下的空间时间脑动态和方向性影响。

英文摘要

In recent years, Electroencephalographic analysis has gained prominence in stress research when combined with AI and Machine Learning models for validation. In this study, a lightweight dynamic brain connectivity framework based on Time Varying Directed Transfer Function is proposed, where TV DTF features were validated through ML based stress classification. TV DTF estimates the directional information flow between brain regions across distinct EEG frequency bands, thereby capturing temporal and causal influences that are often overlooked by static functional connectivity measures. EEG recordings from the 32 channel SAM 40 dataset were employed, focusing on mental arithmetic task trials. The dynamic EEG-based TV-DTF features were validated through ML classifiers such as Support Vector Machine, Random Forest, Gradient Boosting, Adaptive Boosting, and Extreme Gradient Boosting. Experimental results show that alpha-TV-DTF provided the strongest discriminative power, with SVM achieving 89.73% accuracy in 3-class classification and with XGBoost achieving 93.69% accuracy in 2 class classification. Relative to absolute power and phase locking based functional connectivity features, alpha TV DTF and beta TV DTF achieved higher performance across the ML models, highlighting the advantages of dynamic over static measures. Feature importance analysis further highlighted dominant long-range frontal parietal and frontal occipital informational influences, emphasizing the regulatory role of frontal regions under stress. These findings validate the lightweight TV-DTF as a robust framework, revealing spatiotemporal brain dynamics and directional influences across different stress levels.

2510.19728 2026-06-16 cs.LG cs.AI 版本更新

Enabling Granular Subgroup Level Model Evaluations by Generating Synthetic Medical Time Series

通过生成合成医疗时间序列实现细粒度亚组级别模型评估

Mahmoud Ibrahim, Bart Elen, Chang Sun, Gökhan Ertaylan, Michel Dumontier

发表机构 * Institute of Data Science, Faculty of Science and Engineering, Maastricht University(数据科学研究所,科学与工程学院,马斯特里赫特大学) Department of Advanced Computing Sciences, Faculty of Science and Engineering, Maastricht University(先进计算科学系,科学与工程学院,马斯特里赫特大学) VITO(VITO研究院)

AI总结 本文提出一种框架,利用合成ICU时间序列数据训练和评估预测模型,特别是在细粒度人口亚组中。引入Enhanced TimeAutoDiff,通过分布对齐惩罚增强潜在扩散目标,减少真实-合成与真实-真实评估差距,提升亚组模型评估的鲁棒性和可靠性。

详情
AI中文摘要

我们提出了一种新的框架,利用合成ICU时间序列数据不仅训练,还能严格可信地评估预测模型,既在总体层面,又在细粒度人口亚组中。基于先前的扩散和VAE生成器(TimeDiff,HealthGen,TimeAutoDiff),我们引入Enhanced TimeAutoDiff,通过在潜在扩散目标中加入分布对齐惩罚。我们广泛在MIMIC-III和eICU上对所有模型进行了基准测试,针对24小时死亡率和二元住院时间任务。我们的结果表明,Enhanced TimeAutoDiff通过减少真实-合成与真实-真实评估(

英文摘要

We present a novel framework for leveraging synthetic ICU time-series data not only to train but also to rigorously and trustworthily evaluate predictive models, both at the population level and within fine-grained demographic subgroups. Building on prior diffusion and VAE-based generators (TimeDiff, HealthGen, TimeAutoDiff), we introduce \textit{Enhanced TimeAutoDiff}, which augments the latent diffusion objective with distribution-alignment penalties. We extensively benchmark all models on MIMIC-III and eICU, on 24-hour mortality and binary length-of-stay tasks. Our results show that Enhanced TimeAutoDiff reduces the gap between real-on-synthetic and real-on-real evaluation (``TRTS gap'') by over 70\%, achieving $Δ_{TRTS} \leq 0.014$ AUROC, while preserving training utility ($Δ_{TSTR} \approx 0.01$). Crucially, for 32 intersectional subgroups, large synthetic cohorts cut subgroup-level AUROC estimation error by up to 50\% relative to small real test sets, and outperform them in 72--84\% of subgroups. This work provides a practical, privacy-preserving roadmap for trustworthy, granular model evaluation in critical care, enabling robust and reliable performance analysis across diverse patient populations without exposing sensitive EHR data, contributing to the overall trustworthiness of Medical AI.

2509.02093 2026-06-16 cs.CL cs.AI cs.IR 版本更新

Better by Comparison: Retrieval-Augmented Contrastive Reasoning for Automatic Prompt Optimization

通过对比改进:基于检索增强的对比推理用于自动提示优化

Juhyeon Lee, Wonduk Seo, Hyunjin An, Seunghyun Lee, Yi Bu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出CRPO框架,通过对比推理提升提示优化效果,利用HelpSteer2数据集中的高质量示例进行对比分析,改进提示生成的鲁棒性和可解释性。

Comments Preprint

详情
Journal ref
2025 ACM/IEEE Joint Conference on Digital Libraries (JCDL), Dekalb, IL, USA, 2025, pp. 269-272
AI中文摘要

自动提示优化近期作为一种提升大型语言模型(LLMs)提示质量的策略,旨在生成更准确和有用的响应。然而,大多数先前工作集中在直接提示精炼或模型微调,忽略了利用LLM内在推理能力从对比示例中学习的潜力。本文提出对比推理提示优化(CRPO),一种新颖的框架,将提示优化建模为检索增强的推理过程。我们的方法从HelpSteer2数据集检索top k参考提示-响应对,该数据集是一个开源集合,每个响应均标注了有用性、正确性、连贯性、复杂性和冗余性。我们构建了两种互补的优化范式:(1)分层对比推理,其中LLM比较高质量、中等质量和低质量的示例(提示和响应)以通过反思推理优化自身生成;(2)多指标对比推理,其中LLM分析每个评估维度的最佳示例并整合其优势以生成优化提示。通过显式对比高质量和低质量示例,CRPO使模型能够推断为何某些提示成功而其他失败,从而实现更鲁棒和可解释的优化。在HelpSteer2基准测试中的实验结果表明,CRPO显著优于基线方法。我们的发现突显了对比、检索增强推理在推进自动提示优化方面的潜力。

英文摘要

Automatic prompt optimization has recently emerged as a strategy for improving the quality of prompts used in Large Language Models (LLMs), with the goal of generating more accurate and useful responses. However, most prior work focuses on direct prompt refinement or model fine-tuning, overlooking the potential of leveraging LLMs' inherent reasoning capability to learn from contrasting examples. In this paper, we present Contrastive Reasoning Prompt Optimization (CRPO), a novel framework that formulates prompt optimization as a retrieval-augmented reasoning process. Our approach retrieves top k reference prompt-response pairs from the HelpSteer2 dataset, an open source collection where each response is annotated for helpfulness, correctness, coherence, complexity, and verbosity, and constructs two complementary optimization paradigms: (1) tiered contrastive reasoning, where the LLM compares high-, medium-, and low-quality exemplars (both prompts and responses) to refine its own generation through reflective reasoning, and (2) multi-metric contrastive reasoning, where the LLM analyzes the best exemplars along each evaluation dimension and integrates their strengths into an optimized prompt. By explicitly contrasting high and low quality exemplars, CRPO enables the model to deduce why certain prompts succeed while others fail, thereby achieving more robust and interpretable optimization. Experimental results on the HelpSteer2 benchmark demonstrate that CRPO significantly outperforms baselines. Our findings highlight the promise of contrastive, retrieval-augmented reasoning for advancing automatic prompt optimization.

2509.00176 2026-06-16 cs.CV cs.AI 版本更新

Waste-Bench: A Comprehensive Benchmark for Evaluating VLLMs in Cluttered Environments

Waste-Bench: 一个用于评估在杂乱环境中视觉大型语言模型性能的综合基准

Muhammad Ali, Salman Khan

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 本文提出Waste-Bench基准,用于评估VLLMs在复杂环境中的鲁棒性和准确性,揭示了提升VLLM在复杂环境性能的必要性。

详情
Journal ref
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025), pp. 31019-31032, 2025
AI中文摘要

近年来,大型语言模型(LLMs)的进步为能够执行广泛视觉理解任务的视觉大型语言模型(VLLMs)铺平了道路。尽管LLMs在标准自然图像上表现出色,但其在杂乱数据集中的能力尚未得到充分探索,其中包含复杂环境和变形形状的对象。在本工作中,我们引入了一个专门设计用于现实场景中垃圾分类的新型数据集,其特点是有复杂的环境和变形形状的对象。此外,我们还提出了一种深入的评估方法,以严格评估VLLMs的鲁棒性和准确性。所引入的数据集和全面分析为VLLMs在挑战性条件下性能提供了有价值的见解。我们的发现强调了进一步提升VLLM鲁棒性以在复杂环境中表现更好的重要性。数据集和实验代码将公开发布。

英文摘要

Recent advancements in Large Language Models (LLMs) have paved the way for Vision Large Language Models (VLLMs) capable of performing a wide range of visual understanding tasks. While LLMs have demonstrated impressive performance on standard natural images, their capabilities have not been thoroughly explored in cluttered datasets where there is complex environment having deformed shaped objects. In this work, we introduce a novel dataset specifically designed for waste classification in real-world scenarios, characterized by complex environments and deformed shaped objects. Along with this dataset, we present an in-depth evaluation approach to rigorously assess the robustness and accuracy of VLLMs. The introduced dataset and comprehensive analysis provide valuable insights into the performance of VLLMs under challenging conditions. Our findings highlight the critical need for further advancements in VLLM's robustness to perform better in complex environments. The dataset and code for our experiments will be made publicly available.

2502.16560 2026-06-16 cs.AI cs.CL cs.SI 版本更新

An Analytical Emotion Framework of Rumour Threads on Social Media

社交媒体谣言线中的分析情绪框架

Rui Xing, Boyang Sun, Kun Zhang, Preslav Nakov, Timothy Baldwin, Jey Han Lau

发表机构 * University of Edinburgh(爱丁堡大学)

AI总结 本文提出一个多方面情绪检测框架,分析谣言与非谣言线的情绪差异,揭示谣言引发负面情绪而非谣言引发正面情绪,并通过因果分析揭示情绪传播机制。

Comments Accepted to ICWSM 2025 MisD Workshop

详情
AI中文摘要

在线社交媒体中的谣言对现代社会构成重大风险,推动了对谣言发展机制的深入理解。本文聚焦谣言与情绪在线讨论中的交互,构建了一个多方面情绪分析框架,对比谣言与非谣言线,并进行情绪的关联与因果分析。我们应用该框架于现有广泛使用的谣言数据集,进一步理解在线社交媒体线的情绪动态。框架显示谣言引发更多负面情绪(如愤怒、恐惧、悲观),而非谣言引发更多积极情绪。情绪具有传染性,谣言传播负面情绪,非谣言传播正面情绪。因果分析显示惊讶连接谣言与其他情绪;悲观来自悲伤和恐惧,而乐观源于喜悦和爱。

英文摘要

Rumours in online social media pose significant risks to modern society, motivating the need for better understanding of how they develop. We focus specifically on the interface between emotion and rumours in threaded discourses, building on the surprisingly sparse literature on the topic which has largely focused on single aspect of emotions within the original rumour posts themselves, and largely overlooked the comparative differences between rumours and non-rumours. In this work, we take one step further to provide a comprehensive analytical emotion framework with multi-aspect emotion detection, contrasting rumour and non-rumour threads and provide both correlation and causal analysis of emotions. We applied our framework on existing widely-used rumour datasets to further understand the emotion dynamics in online social media threads. Our framework reveals that rumours trigger more negative emotions (e.g., anger, fear, pessimism), while non-rumours evoke more positive ones. Emotions are contagious, rumours spread negativity, non-rumours spread positivity. Causal analysis shows surprise bridges rumours and other emotions; pessimism comes from sadness and fear, while optimism arises from joy and love.

2504.08609 2026-06-16 cs.CL cs.AI 版本更新

A Survey of Machine Learning Models and Datasets for the Multi-label Classification of Textual Hate Speech in English

面向英文文本仇恨言论多标签分类的机器学习模型与数据集综述

Julian Bäumler, Louis Blöcher, Lars-Joel Frey, Xian Chen, Markus Bayer, Christian Reuter

发表机构 * Technical University of Darmstadt, Science and Technology for Peace and Security (PEASEC)(德累斯顿技术大学,和平与安全科学技术(PEASEC))

AI总结 本文综述了46篇英文文献,分析了28个适合多标签分类模型训练的数据集,揭示了标签集、大小、元概念等的异质性,并指出评估不一致、BERT和RNN偏好等关键问题,提出十项研究建议。

Comments 35 pages, 4 figures, 4 tables

详情
Journal ref
ACM Transactions on Knowledge Discovery from Data (2026)
AI中文摘要

在线仇恨言论的传播对个人、在线社区和社会整体都有严重负面影响。鉴于此以及海量仇恨内容的规模,内容审核和执法人员及研究人员对机器学习模型自动分类仇恨言论产生了兴趣。尽管大多数科学作品将仇恨言论分类视为二元任务,但实践中往往需要区分子类型,例如根据目标、严重程度或合法性,这可能在个别内容上重叠。因此,研究者创建了数据集和机器学习模型,将文本数据中的仇恨言论分类视为多标签问题。本文首次系统全面地综述了英文文献中这一新兴研究领域的科学文献(N=46)。我们贡献了28个适合训练多标签分类模型的数据集的简要概述,揭示了标签集、大小、元概念、标注过程和标注者间一致性的显著异质性。对24篇提出合适分类模型的出版物的分析进一步揭示了评估不一致以及对双向编码表示变换器(BERT)和循环神经网络(RNN)的偏好。我们识别出训练数据不平衡、依赖众包平台、小而稀疏的数据集以及缺失方法学一致性为关键开放问题,并提出了十项研究建议。

英文摘要

The dissemination of online hate speech can have serious negative consequences for individuals, online communities, and entire societies. This and the large volume of hateful online content prompted both practitioners', i.e., in content moderation or law enforcement, and researchers' interest in machine learning models to automatically classify instances of hate speech. Whereas most scientific works address hate speech classification as a binary task, practice often requires a differentiation into sub-types, e.g., according to target, severity, or legality, which may overlap for individual content. Hence, researchers created datasets and machine learning models that approach hate speech classification in textual data as a multi-label problem. This work presents the first systematic and comprehensive survey of scientific literature on this emerging research landscape in English (N=46). We contribute with a concise overview of 28 datasets suited for training multi-label classification models that reveals significant heterogeneity regarding label-set, size, meta-concept, annotation process, and inter-annotator agreement. Our analysis of 24 publications proposing suitable classification models further establishes inconsistency in evaluation and a preference for architectures based on Bidirectional Encoder Representation from Transformers (BERT) and Recurrent Neural Networks (RNNs). We identify imbalanced training data, reliance on crowdsourcing platforms, small and sparse datasets, and missing methodological alignment as critical open issues and formulate ten recommendations for research.

2502.05214 2026-06-16 eess.IV cs.AI cs.CV 版本更新

CoRPA: Adversarial Image Generation for Chest X-rays Using Concept Vector Perturbations and Generative Models

CoRPA: 基于概念向量扰动和生成模型的胸部X光图像对抗生成

Amy Rafferty, Rishi Ramaesh, Ajitha Rajan

发表机构 * School of Informatics, University of Edinburgh(信息学院,爱丁堡大学) NHS Lothian(NHS洛锡安)

AI总结 本文提出CoRPA,一种针对医学影像领域的临床聚焦对抗攻击框架,通过概念向量扰动生成对抗性影像报告和图像,揭示医疗AI在真实临床场景下的脆弱性。

详情
AI中文摘要

深度学习模型在医学图像分类任务中的应用日益广泛,旨在提高诊断准确性、减轻医务人员负担并改善患者预后。然而,其对对抗攻击的脆弱性对患者安全构成重大风险。当前攻击方法使用通用技术如模型查询或像素值扰动生成对抗样本以欺骗模型。这些方法可能无法充分解决源于临床错误的特征遗漏或误识别问题。我们提出基于概念的报告扰动攻击(CoRPA),一种专注于临床的黑盒对抗攻击框架,专门针对医学影像领域。CoRPA利用临床概念生成对抗性放射学报告和图像,以接近现实的临床误诊场景。我们使用MIMIC-CXR-JPG数据集中的胸部X光影像和放射学报告验证了CoRPA的实用性。评估显示,对传统对抗攻击具有强大鲁棒性的深度学习模型在面对CoRPA的临床聚焦扰动时显著更脆弱。这突显了在医疗AI系统中解决领域特定脆弱性的重要性。通过引入专门的对抗攻击框架,本研究为开发在真实世界中可靠、安全的AI模型提供了基础,确保其在高风险临床环境中的安全可靠部署。

英文摘要

Deep learning models for medical image classification tasks are becoming widely implemented in AI-assisted diagnostic tools, aiming to enhance diagnostic accuracy, reduce clinician workloads, and improve patient outcomes. However, their vulnerability to adversarial attacks poses significant risks to patient safety. Current attack methodologies use general techniques such as model querying or pixel value perturbations to generate adversarial examples designed to fool a model. These approaches may not adequately address the unique characteristics of clinical errors stemming from missed or incorrectly identified clinical features. We propose the Concept-based Report Perturbation Attack (CoRPA), a clinically-focused black-box adversarial attack framework tailored to the medical imaging domain. CoRPA leverages clinical concepts to generate adversarial radiological reports and images that closely mirror realistic clinical misdiagnosis scenarios. We demonstrate the utility of CoRPA using the MIMIC-CXR-JPG dataset of chest X-rays and radiological reports. Our evaluation reveals that deep learning models exhibiting strong resilience to conventional adversarial attacks are significantly less robust when subjected to CoRPA's clinically-focused perturbations. This underscores the importance of addressing domain-specific vulnerabilities in medical AI systems. By introducing a specialized adversarial attack framework, this study provides a foundation for developing robust, real-world-ready AI models in healthcare, ensuring their safe and reliable deployment in high-stakes clinical environments.

2410.20066 2026-06-16 eess.SP cs.AI 版本更新

A Multi-Modal Non-Invasive Deep Learning Framework for Progressive Prediction of Seizures

一种用于癫痫发作渐进预测的多模态非侵入式深度学习框架

Ali Saeizadeh, Douglas Schonholtz, Joseph S. Neimat, Pedram Johari, Tommaso Melodia

发表机构 * Institute for the Wireless Internet of Things, Northeastern University, Boston, MA, U.S.A.(无线物联网研究所,东北大学,波士顿,马萨诸塞州,美国) University of Louisville, Louisville, KY, U.S.A.(路易斯维尔大学,路易斯维尔,肯塔基州,美国)

AI总结 本文提出一种基于非侵入式多模态传感器网络的深度学习框架,用于癫痫发作的渐进预测,通过提高预测精度和实时处理能力,实现95%的灵敏度和98%的特异度。

Comments 4 pages, 5 figures, Proceedings of the IEEE 20th International Conference on Body Sensor Networks (BSN), October 2024

详情
Journal ref
2024 IEEE 20th International Conference on Body Sensor Networks (BSN)
AI中文摘要

本文介绍了一种创新框架,旨在通过非侵入式多模态传感器网络的深度学习方法,实现癫痫发作的渐进预测。癫痫是一种严重影响神经系统的疾病,影响全球约6500万人,其中相当一部分患者对药物治疗反应不佳。为解决这一挑战,我们倡导预测系统,能够及时向高风险个体发出警报,使他们能够采取预防措施。我们的框架利用先进的深度学习技术,并使用来自非侵入式脑电图(EEG)和心电图(ECG)传感器网络的个性化数据,从而提高预测准确性。算法被优化为在边缘设备上进行实时处理,以减轻隐私问题和云方案中固有的数据传输开销,最终节省电池电量。此外,我们的系统预测癫痫发作的倒计时时间(在发作前15分钟间隔内最多一小时),为预防措施提供关键时间。我们的多模态模型在29名患者中实现了95%的灵敏度、98%的特异度和97%的准确率。

英文摘要

This paper introduces an innovative framework designed for progressive (granular in time to onset) prediction of seizures through the utilization of a Deep Learning (DL) methodology based on non-invasive multi-modal sensor networks. Epilepsy, a debilitating neurological condition, affects an estimated 65 million individuals globally, with a substantial proportion facing drug-resistant epilepsy despite pharmacological interventions. To address this challenge, we advocate for predictive systems that provide timely alerts to individuals at risk, enabling them to take precautionary actions. Our framework employs advanced DL techniques and uses personalized data from a network of non-invasive electroencephalogram (EEG) and electrocardiogram (ECG) sensors, thereby enhancing prediction accuracy. The algorithms are optimized for real-time processing on edge devices, mitigating privacy concerns and minimizing data transmission overhead inherent in cloud-based solutions, ultimately preserving battery energy. Additionally, our system predicts the countdown time to seizures (with 15-minute intervals up to an hour prior to the onset), offering critical lead time for preventive actions. Our multi-modal model achieves 95% sensitivity, 98% specificity, and 97% accuracy, averaged among 29 patients.

2406.07277 2026-06-16 cs.CL cs.AI cs.MA 版本更新

Speaking Your Language: Spatial Relationships in Interpretable Emergent Communication

说出你的语言:可解释的涌现交流中的空间关系

Olaf Lipinski, Adam J. Sobey, Federico Cerutti, Timothy J. Norman

发表机构 * University of Southampton(索姆塞特大学) The Alan Turing Institute(艾伦·图灵研究所) University of Brescia(布雷西亚大学)

AI总结 本文研究了智能体如何通过空间关系交流,展示了其能发展出表达观察部分关系的语言,实现90%以上的准确率,并证明该语言可被人类解读。

Comments Accepted at NeurIPS 2024. 18 pages, 3 figures

详情
Journal ref
In Advances in Neural Information Processing Systems (Vol. 37, pp. 140113-140137) 2024
AI中文摘要

有效的交流需要能够参照观察中的特定部分相对于其他部分的能力。尽管涌现交流文献在开发各种语言属性方面取得成功,但尚未有研究展示出此类位置参照的出现。本文展示了智能体如何在观察中交流空间关系。结果表明,智能体可以发展出能够表达其观察部分之间关系的语言,在训练于需要此类交流的指称游戏中,准确率超过90%。使用词组测量方法,我们展示了智能体如何创建此类参照。此分析表明,智能体使用非组合性和组合性信息的混合来传达空间关系。我们还证明了涌现语言可被人类解读。通过与接收智能体交流测试翻译准确性,接收智能体使用该词典部分达到78%以上的准确率,证实了该涌现语言的解读成功。

英文摘要

Effective communication requires the ability to refer to specific parts of an observation in relation to others. While emergent communication literature shows success in developing various language properties, no research has shown the emergence of such positional references. This paper demonstrates how agents can communicate about spatial relationships within their observations. The results indicate that agents can develop a language capable of expressing the relationships between parts of their observation, achieving over 90% accuracy when trained in a referential game which requires such communication. Using a collocation measure, we demonstrate how the agents create such references. This analysis suggests that agents use a mixture of non-compositional and compositional messages to convey spatial relationships. We also show that the emergent language is interpretable by humans. The translation accuracy is tested by communicating with the receiver agent, where the receiver achieves over 78% accuracy using parts of this lexicon, confirming that the interpretation of the emergent language was successful.

2410.11861 2026-06-16 cs.HC cs.AI 版本更新

Investigating Role of Big Five Personality Traits in Audio-Visual Rapport Estimation

探究大五人格特质在音频视觉共情估计中的作用

Takato Hayashi, Ryusei Kimura, Ryo Ishii, Shogo Okada

发表机构 * Japan Advanced Institute of Science and Technology(日本科学技术先进研究院) Human Informatics Laboratories, NTT Corporation(NTT公司人因实验室)

AI总结 本研究探讨了大五人格特质在朋友间音频视觉共情估计中的作用,通过比较有无人格特质输入的模型,发现其能提升共情估计性能,并分解出感知者效应、目标效应和关系效应。

Comments 9 pages, 5 figures

详情
Journal ref
International Conference on Automatic Face and Gesture Recognition (FG2025)
AI中文摘要

在社交互动中自动估计共情是情感计算的核心组成部分。最近的研究表明,使用参与者的个性特质作为模型输入可以提高初始互动中共情估计的性能。本研究探讨这一发现是否适用于朋友间的互动,通过开发利用非语言线索(音频和面部表情)作为输入的共情估计模型进行研究。我们的实验结果表明,将大五特征(BFFs)添加到非语言特征中可以提高双人互动中自我报告共情的估计性能。接下来,我们通过比较有无BFFs的模型,揭示BFFs如何提高共情估计性能。我们使用社会关系模型将共情评分分解为感知者效应(人们对他人的评分倾向)、目标效应(人们被他人评分的倾向)和关系效应(人们对特定人的独特评分)。然后分析BFFs在捕捉每种效应中的贡献程度。我们的分析表明,感知者和目标的BFFs使估计模型分别捕捉感知者和目标效应。此外,我们的实验结果表明,面部表情特征与BFFs的组合不仅在估计共情评分方面表现最佳,还在估计三种效应方面也表现最佳。本研究是理解为何基于个性的互识感知估计模型能实现高估计性能的第一步。

英文摘要

Automatic rapport estimation in social interactions is a central component of affective computing. Recent reports have shown that the estimation performance of rapport in initial interactions can be improved by using the participant's personality traits as the model's input. In this study, we investigate whether this findings applies to interactions between friends by developing rapport estimation models that utilize nonverbal cues (audio and facial expressions) as inputs. Our experimental results show that adding Big Five features (BFFs) to nonverbal features can improve the estimation performance of self-reported rapport in dyadic interactions between friends. Next, we demystify how BFFs improve the estimation performance of rapport through a comparative analysis between models with and without BFFs. We decompose rapport ratings into perceiver effects (people's tendency to rate other people), target effects (people's tendency to be rated by other people), and relationship effects (people's unique ratings for a specific person) using the social relations model. We then analyze the extent to which BFFs contribute to capturing each effect. Our analysis demonstrates that the perceiver's and the target's BFFs lead estimation models to capture the perceiver and the target effects, respectively. Furthermore, our experimental results indicate that the combinations of facial expression features and BFFs achieve best estimation performances not only in estimating rapport ratings, but also in estimating three effects. Our study is the first step toward understanding why personality-aware estimation models of interpersonal perception accomplish high estimation performance.

2409.06708 2026-06-16 cs.CY cs.AI cs.HC 版本更新

Ensuring Fairness with Transparent Auditing of Quantitative Bias in AI Systems

通过透明审计量化偏见确保公平性

Chih-Cheng Rex Yuan, Bow-Yaw Wang

发表机构 * Institute of Information Science, Academia Sinica(中科院信息所)

AI总结 本文提出了一种透明的AI公平性审计框架,通过第三方审计员和系统提供者共同参与,结合统计方法公开审查AI系统,以解决AI决策中的偏见问题。

详情
Journal ref
Proc. 2024 Pacific Neighborhood Consortium Annual Conference and Joint Meetings (PNC), Seoul, Republic of Korea, 2024, pp. 25-32
AI中文摘要

随着AI的快速发展,将其整合到决策过程中成为趋势。然而,AI系统可能表现出偏见,导致决策者做出不公平的结论。值得注意的是,美国司法系统中用于评估再犯风险的COMPAS系统被发现偏袒多数种族群体;具体而言,它违反了称为'均衡机会'的公平标准。已提出多种评估AI公平性的措施。我们提出了一种审计AI公平性的框架,涉及第三方审计员和AI系统提供者,并创建了一个工具来系统审查AI系统。该工具是开源且公开可用的。与传统AI系统不同,我们倡导透明的白盒和基于统计的方法。该方法可用于第三方审计员、AI开发者或公众在判断AI系统公平性标准时进行参考。

英文摘要

With the rapid advancement of AI, there is a growing trend to integrate AI into decision-making processes. However, AI systems may exhibit biases that lead decision-makers to draw unfair conclusions. Notably, the COMPAS system used in the American justice system to evaluate recidivism was found to favor racial majority groups; specifically, it violates a fairness standard called equalized odds. Various measures have been proposed to assess AI fairness. We present a framework for auditing AI fairness, involving third-party auditors and AI system providers, and we have created a tool to facilitate systematic examination of AI systems. The tool is open-sourced and publicly available. Unlike traditional AI systems, we advocate a transparent white-box and statistics-based approach. It can be utilized by third-party auditors, AI developers, or the general public for reference when judging the fairness criterion of AI systems.