arXivDaily arXiv每日学术速递 周一至周五更新
重置
cs.AI人工智能347

1. 智能体、规划与决策 24 篇

2606.11349 2026-06-11 cs.AI cs.HC 新提交

Knowing When to Ask: Self-Gated Clarification for Hierarchical Language Agents

知道何时提问:分层语言代理的自门控澄清机制

Aijing Gao, Yiming Kang, Mengdie Flora Wang, Jae Oh Woo

发表机构 * Amazon Web Services(亚马逊云科技)

AI总结 提出ACTION-RATING框架,将澄清请求纳入代理的动作空间,与导航共享序数尺度,在分层推理中实现自门控澄清,通过强制性和机会性两种信息寻求模式提升决策准确性。

详情
AI中文摘要

在分层推理中,失败通常源于中间决策点,代理在没有意识到缺乏关键信息的情况下错误地选择了分支。我们不将澄清视为外部不确定性触发,而是提出ACTION-RATING,一种将澄清置于代理动作空间内、与导航共享序数尺度的公式,使得在每个决策点提问与行动直接竞争,并在中间状态可观察求助行为。从代理自身的评分中涌现出两种结构上不同的信息寻求模式:强制性(无可行分支)和机会性(尽管有领先候选但仍有残余不确定性)。在协调关税表分类(30,000节点分类树,三个基准,跨4个家族的9个LLM)上,我们观察到从强制性澄清到机会性澄清的机制转变,信息寻求有效性(ISE,一个局部诊断指标,定义为帮助交互后正确下一步导航步骤的比例,非最终任务指标)从50%上升到74%。三个诊断对比未能复现此结构。可分离性测试表明,当答案质量下降(准确率下降18.8%)时,信息寻求模式(模式分裂、ISE排名)保持不变,支持代理寻求帮助的位置与其所获帮助质量之间的经验分离。在受控答案通道下,10位数字准确率提升达+16.2%;我们将其解读为更好定位所能释放的上限,而非部署估计。

英文摘要

In hierarchical reasoning, failures often originate at intermediate decision points where the agent commits to a wrong branch without recognizing that it lacks critical information. Rather than treating clarification as an external uncertainty trigger, we propose ACTION-RATING, a formulation that places it inside the agent's action space on a shared ordinal scale with navigation, so that asking competes directly with acting at every decision point and help-seeking becomes observable at intermediate states. Two structurally distinct information-seeking modes emerge from the agent's own ratings: mandatory (no viable branch) and opportunistic (residual uncertainty despite a leading candidate). On Harmonized Tariff Schedule classification (30,000-node taxonomy, three benchmarks, 9~LLMs across 4 families), we observe a regime shift from mandatory to opportunistic clarification, with Information-Seeking Effectiveness (ISE), a local diagnostic defined as the fraction of help interactions followed by a correct next navigation step (not a final-task metric), rising from 50% to 74%. Three diagnostic contrasts fail to reproduce this structure. A separability test shows that the information-seeking pattern (mode split, ISE ranking) persists when answer quality is degraded (-18.8% accuracy), supporting an empirical separation between where an agent seeks help and the quality of the help it receives. Under the controlled answer channel, accuracy gains reach +16.2% at 10-digit; we read this as an upper bound on what better localization could unlock, not a deployment estimate.

2606.11680 2026-06-11 cs.AI cs.CL cs.LG 新提交

Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents

先组织再检索:面向高效智能体的层次化记忆导航

Hao-Lun Hsu, Nikki Lijing Kuang, Boyi Liu, Zhewei Yao, Yuxiong He

发表机构 * Duke University(杜克大学) Snowflake AI Research(Snowflake AI研究)

AI总结 提出HORMA框架,通过构建文件系统式的层次化记忆结构并利用强化学习训练的轻量级导航代理,实现高效检索,在长时任务中提升性能并降低令牌消耗。

详情
AI中文摘要

大型语言模型(LLM)智能体由于固有的无状态性,在处理长时任务时面临挑战,所有任务相关信息必须编码到不断增长的输入上下文中,导致推理质量下降、推理成本增加和延迟升高,因此需要高效的工作记忆机制。然而,现有方法要么依赖有损压缩,要么基于相似性检索,往往无法捕捉多步智能体任务所需的时间结构和因果依赖关系。在这项工作中,我们提出了HORMA,一种层次化组织与检索记忆智能体,它将经验组织成类似文件系统的层次化结构,其中总结的实体链接到相应的原始轨迹,从而在保留详细信息的同时实现高效访问。HORMA将工作记忆分解为两个阶段:结构化记忆构建和基于导航的检索。构建模块通过区分由信息缺失导致的失败和由误导性或过载上下文导致的失败,迭代地优化经验的结构化方式。导航模块使用强化学习训练的轻量级代理遍历层次结构,选择最小但充分的上下文,从而减少关键执行路径上的延迟。在ALFWorld、LoCoMo和LongMemEval上,HORMA在受限上下文预算下提升了任务性能,同时在长对话任务中最多仅使用基线22.17%的令牌。与现有方法相比,它始终实现了更好的效率-性能权衡,并能有效泛化到未见任务。

英文摘要

Large language model (LLM) agents struggle with long-horizon tasks due to their inherent statelessness, requiring all task-relevant information to be encoded in growing input contexts. The resulting degraded reasoning quality, increased inference cost, and higher latency necessitate efficient working memory mechanisms. However, existing approaches either rely on lossy compression or similarity-based retrieval, which often fail to capture temporal structure and causal dependencies required for multi-step agentic tasks. In this work, we present HORMA, a Hierarchical Organize-and-Retrieve Memory Agent that organizes experience into a file-system-like hierarchical structure, where summarized entities are linked to the corresponding raw trajectories, enabling efficient access without losing detailed information. HORMA decomposes working memory into two stages: structured memory construction and navigation-based retrieval. The construction module iteratively refines how experiences are structured by distinguishing between failures caused by missing information and those caused by misleading or overloaded context. The navigation module retrieves task-relevant context by traversing the hierarchy using a lightweight agent trained with reinforcement learning to select minimal yet sufficient context, thereby reducing latency along the critical execution path. Across ALFWorld, LoCoMo, and LongMemEval, HORMA improves task performance under constrained context budgets while requiring at most 22.17% of the baseline token usage in long conversation tasks. Compared to existing methods, it consistently achieves better efficiency-performance trade-offs and generalizes effectively to unseen tasks.

2606.11830 2026-06-11 cs.AI 新提交

Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task

面向医学研究分析的技能增强型AI代理:一项NSCLC转录组生物标志物任务中的探索性多模型人类评估

Qianyu Yao, Fei Sun, Bocheng Huang, Wei Chen, Jiarui Jiang, Shu Quan, Yifei Chen, Wenjie Xu, Bo li, Liping Su, Ruoqiong Wu, Huhai Hong, Huimei Wang

发表机构 * AIPOCH PTE. LTD.

AI总结 本研究通过非小细胞肺癌免疫治疗生物标志物任务,评估技能增强型AI代理相比原生AI在转录组研究分析输出质量上的提升,发现质量信号方向性但未达统计显著性。

详情
AI中文摘要

背景。大型语言模型和AI代理越来越多地用于支持生物医学研究,但原生模型输出可能遗漏关键分析步骤、误用方法或夸大结论。我们评估了自主访问医学研究技能包是否与更高质量的AI生成转录组研究分析输出相关,相比于无技能的原生AI。方法。我们使用非小细胞肺癌免疫治疗生物标志物任务进行了一项探索性多模型人类评估。测试了六个模型骨干。评估包括21个匿名输出:9个原生AI输出和12个通过OpenClaw实现的AI代理生成的技能增强输出。四位非专家生物医学评审员和两位盲法专家评估每个输出,每位评审员类型给出两个评分。主要结局是专家评定的总体质量。结果。技能增强输出在专家总体质量上方向性高于原生AI输出(均值5.50 vs 5.11;差异=0.39;bootstrap 95% CI,-0.04至0.90;Welch p=0.156)。非专家评审员质量呈现相同方向(均值4.72 vs 4.47;差异=0.26;bootstrap 95% CI,-0.25至0.80;Welch p=0.373)。专家一致性有限(单评分ICC=-0.15),模型特异性效应为描述性且异质性。结论。在此探索性样本中,自主技能访问显示出方向性质量信号,但信号小于专家评分噪声,不应视为确证性证据。这些发现主要激励了具有更强可靠性控制、平台复制和生物学有效性评估的技能增强型AI代理的更大规模评估。

英文摘要

Background. Large language models and AI agents are increasingly used to support biomedical research, but native model outputs may omit key analytical steps, misuse methods, or overstate conclusions. We evaluated whether autonomous access to a medical research skill package was associated with higher-quality AI-generated transcriptomic research-analysis outputs compared with native AI without skills. Methods. We conducted an exploratory multi-model human evaluation using a non-small cell lung cancer immunotherapy biomarker task. Six model backbones were tested. The evaluation included 21 anonymized outputs: 9 native-AI outputs and 12 skill-augmented outputs generated through an AI agent implementation represented by OpenClaw. Four non-expert biomedical reviewers and two blinded experts evaluated each output, with two ratings from each reviewer type. The primary outcome was expert-rated overall quality. Results. Skill-augmented outputs showed directionally higher expert overall quality than native-AI outputs (mean 5.50 vs 5.11; difference=0.39; bootstrap 95\% CI, -0.04 to 0.90; Welch p=0.156). Non-expert reviewer quality showed the same direction (mean 4.72 vs 4.47; difference=0.26; bootstrap 95\% CI, -0.25 to 0.80; Welch p=0.373). Expert agreement was limited (single-rating ICC=-0.15), and model-specific effects were descriptive and heterogeneous. Conclusions. Autonomous skill access showed a directional quality signal in this exploratory sample, but the signal was smaller than expert-rating noise and should not be interpreted as confirmatory evidence. The findings primarily motivate larger evaluations of skill-augmented AI agents with stronger reliability controls, platform replication, and biological-validity assessment.

2606.11851 2026-06-11 cs.AI 新提交

StatefulDiscovery: Evidence-Calibrated Claim Formation in Open-Ended Scientific Discovery

StatefulDiscovery:开放科学发现中证据校准的声明形成

Jiayao Chen, Shi Liu, Linyi Yang

AI总结 提出StatefulDiscovery框架,通过外部化探索状态来协调前沿选择、证据获取和声明裁决,在40个真实数据任务中生成更多高质量、有充分证据支持的声明。

详情
AI中文摘要

开放式的科学发现要求智能体超越为预定义问题执行分析。在多轮探索中,发现智能体必须决定哪些现象值得研究,同时避免过度解释,即新出现的声明超出支持它们的分析证据范围。这产生了一个证据校准问题:探索轨迹必须与声明状态耦合,以便证据既能指导下一步探索什么,也能指导可以声明什么。我们引入了StatefulDiscovery,一个将调查状态外部化并利用它来协调前沿选择、证据获取和声明裁决的发现框架。我们在40个真实数据发现任务上评估了StatefulDiscovery。与几个基线相比,StatefulDiscovery总体上产生了更多被认为既有充分支持又有高价值的声明。消融实验表明,结构化假设、局部裁决和前沿控制有助于性能。这些结果共同表明,显式的发现状态可以将探索与证据校准的声明形成耦合起来。

英文摘要

Open-ended scientific discovery asks agents to move beyond executing analyses for predefined questions. Across multiple rounds of exploration, a discovery agent must decide which phenomena warrant investigation while avoiding overinterpretation, where emerging claims exceed the evidential scope of the analyses supporting them. This creates an evidence-calibration problem: the exploration trajectory must be coupled with claim status so that evidence can guide both what to investigate next and what can be claimed. We introduce StatefulDiscovery, a discovery framework that externalizes investigation state and uses it to coordinate frontier selection, evidence acquisition, and claim adjudication. We evaluate StatefulDiscovery across 40 real-data discovery tasks. Compared with several baselines, StatefulDiscovery produces more claims overall judged to be both well-supported and high-value. Ablations indicate that structured hypotheses, local adjudication, and frontier control contribute to performance. Together, these results suggest that explicit discovery state can couple exploration with evidence-calibrated claim formation.

2606.11199 2026-06-11 cs.CL cs.AI cs.IR cs.LG 交叉投稿

NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track

NightFeats @ MMU-RAGent NeurIPS 2025: 面向文本到文本轨道的上下文优化多智能体RAG系统

Quentin Fever, Naziha Aslam

AI总结 提出一种结构化多智能体RAG系统NightFeats,通过检索、策展和组合三阶段分解知识合成,引入时序语义重排序、矛盾协调和引用保留架构,在MMU-RAGent竞赛中超越商业基线。

详情
Comments
5 pages, 1 figure, 1 table. NeurIPS 2025 Competition Track (MMU-RAGent). System developed October 2025
AI中文摘要

我们提出NightFeats,一个结构化的多智能体检索增强生成(RAG)系统,提交至NeurIPS 2025的MMU-RAGent竞赛,并在文本到文本轨道中获得最佳动态评估奖。本文并非以基准最大化目标,而是提出一个原则性流水线,将知识合成为三个协调阶段:检索、策展和组合,每个阶段由显式的中间表示和交接契约控制。受智能体上下文工程(ACE)启发,该系统引入时序语义重排序、有界矛盾协调和保留引用的组合作为核心架构原语。竞赛结果表明,NightFeats在LLM-as-a-Judge和人类Likert评估中超越了包括Claude-SonnetV2和Nova-Pro在内的商业基线,证实了架构透明性和可验证证据基础比单纯优化自动相似度指标的系统更符合人类偏好。

英文摘要

We present NightFeats, a structured multi-agent retrieval-augmented generation (RAG) system submitted to the MMU-RAGent competition at NeurIPS 2025, where it was awarded Best Dynamic Evaluation in the text-to-text track. Rather than targeting benchmark maximization, this work proposes a principled pipeline that decomposes knowledge synthesis into three coordinated phases: retrieval, curation, and composition, each governed by explicit intermediate representations and handoff contracts. Inspired by Agentic Context Engineering (ACE), the system introduces temporal-semantic reranking, bounded contradiction reconciliation, and citation-preserving composition as core architectural primitives. Competition results show that NightFeats surpasses proprietary baselines including Claude-SonnetV2 and Nova-Pro on LLM-as-a-Judge and Human Likert evaluations, confirming that architectural transparency and verifiable evidence grounding are better aligned with human preferences than systems optimizing narrowly for automatic similarity metrics.

2606.11290 2026-06-11 cs.LG cs.AI cs.CL 交叉投稿

FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse

FlowBank: 通过预计算与复用实现查询自适应智能体工作流优化

Lingzhi Yuan, Chenghao Deng, Fangxu Yu, Souradip Chakraborty, Mohammad Rostami, Furong Huang

AI总结 提出FlowBank框架,通过预计算多样化工作流并压缩为紧凑组合,在推理时自适应选择最优工作流,平衡性能与成本,在五个基准上平均得分最高且成本可控。

详情
AI中文摘要

基于大型语言模型的多智能体系统日益强大,但当前的智能体工作流优化范式存在令人不满意的权衡。任务级方法花费大量离线计算却只部署单个工作流,导致互补候选未被使用;而查询级方法为每个查询合成新工作流,推理成本高昂。我们的动机分析表明,这些范式更多是互补而非竞争:离线搜索中发现的工作流通常解决不同子集的查询,许多由昂贵查询级生成处理的查询已经可以通过更便宜的预计算工作流解决。这暗示了一个不同的目标:与其寻找一个普遍最佳的工作流或为每个实例重新生成,不如构建一个紧凑的、可复用的互补工作流库,并在推理时自适应地选择。为此,需要解决三个耦合问题:生成互补而非冗余的候选、压缩成小型可部署组合、在性能-成本权衡下为每个查询分配正确的工作流。我们提出FlowBank,一个基于组合的智能体工作流优化的三阶段框架。多样化阶段提出DiverseFlow,引导搜索覆盖未充分覆盖的查询,产生高覆盖率的候选池。精炼阶段提出CuraFlow,将候选池压缩为冗余最小的紧凑组合。匹配阶段将部署建模为查询-工作流二分图上的边值预测,将每个传入查询路由到预测效用最佳的组合成员。在五个基准上,FlowBank在评估方法中实现了最高平均得分,同时保持成本竞争力,相比最强的自动和手工基线分别相对提升4.26%和14.92%。

英文摘要

Large Language Model (LLM)-based multi-agent systems are increasingly powerful, but current agentic workflow optimization paradigms make an unsatisfying trade-off. Task-level methods spend substantial offline compute yet deploy only a single workflow, leaving complementary candidates unused, while query-level methods synthesize a new workflow per query at substantial inference cost. Our motivating analysis shows these paradigms are more complementary than competing: workflows discovered during offline search often solve different subsets of queries, and many queries handled by expensive query-level generation can already be solved by cheaper precomputed workflows. This suggests a different objective: rather than searching for one universally best workflow or regenerating one per instance, we should build a compact bank of reusable, complementary workflows and select among them adaptively at inference time. Doing so requires solving three coupled problems: generating complementary rather than redundant candidates, compressing them into a small deployable portfolio, and assigning each query to the right workflow under a performance-cost trade-off. To this end, we present FlowBank, a three-stage framework for portfolio-based agentic workflow optimization. Diversifying proposes DiverseFlow to steer search toward under-covered queries and produce a high-coverage candidate pool. Curating proposes CuraFlow to compress this pool into a compact portfolio with minimal redundancy. Matching casts deployment as edge-value prediction on a query-workflow bipartite graph and routes each incoming query to the portfolio member with the best predicted utility. Across five benchmarks, FlowBank achieves the highest average score among the evaluated methods while remaining cost-competitive, improving over the strongest automated and handcrafted baselines by 4.26% and 14.92% relative, respectively.

2606.11520 2026-06-11 cs.CL cs.AI cs.LG 交叉投稿

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

ISE:一种基于执行的多轮操作系统代理轨迹合成方法

Siyuan Luo, Nairong Zheng, Lin Zhou, Tiankuo Yao, Shengyou Yuan, Haojia Yu, Cong Pang, Jiapeng Luo, Lewei Lu

AI总结 提出ISE三阶段范式,通过结构化意图构建、角色锁定用户模拟和真实执行环境,生成多轮代理轨迹,微调后显著提升代理工具使用性能。

详情
Comments
13 pages, 6 figures. Dataset and code: this https URL
AI中文摘要

训练有能力的操作系统代理需要同时捕获结构化用户意图、多轮任务委派和基于工具执行的数据——这些属性在现有数据集中缺失。我们提出ISE(意图->模拟->执行),一种三阶段合成范式,联合解决这些差距。阶段1通过4D框架(人物角色x领域x任务x复杂度)构建约50000个结构化意图;去重后池中包含43956个唯一意图,并在mpnet-base-v2嵌入(余弦核,q=1)上获得61.57的Vendi分数。阶段2通过角色锁定的用户模拟器驱动多轮用户-代理交互,将每轮用户交互基于实际执行结果,生成23132条完整轨迹,平均8.12轮用户交互和68.24轮总对话。阶段3在实时、隔离的操作系统工作空间中执行每个工具调用,生成真实的故障恢复动态而非模拟响应。在ISETrace上微调后,使用Qwen3-8B在标准协议下的代理工具使用任务中,ClawEval pass@1从19.3提升至37.7。该结果优于零样本GPT-4o和四倍大的Qwen3-32B基础模型。对阶段2的消融实验证明多轮模拟带来了大部分性能提升。我们在该https URL发布所有源代码和数据集。

英文摘要

Training capable OS agents requires data that simultaneously captures structured user intents, multi-turn task delegation, and grounded tool execution--properties absent from existing datasets. We propose ISE (Intent -> Simulate -> Execute), a three-stage synthesis paradigm that addresses these gaps jointly. Stage 1 constructs roughly 50000 structured intents via a 4D framework (Persona x Domain x Task x Complexity); after deduplication the pool contains 43956 unique intents and attains a Vendi Score of 61.57 over the entire pool on mpnet-base-v2 embeddings (cosine kernel, q=1). Stage 2 drives multi-turn user-agent interaction through a role-locked user simulator that grounds each user turn in actual execution outcomes, producing 23132 complete trajectories averaging 8.12 user turns and 68.24 total dialogue turns. Stage 3 runs every tool call inside a live, isolated OS workspace, generating authentic failure-recovery dynamics instead of simulated responses. Fine-tuning on ISETrace improves ClawEval pass@1 from 19.3 to 37.7 using Qwen3-8B on agent tool-use tasks with a standard protocol. This result outperforms zero-shot GPT-4o and the larger Qwen3-32B base model which is four times bigger. An ablation on Stage 2 proves multi-turn simulation brings a large portion of the performance gain. We release all source code and dataset at this https URL.

2606.11869 2026-06-11 cs.SE cs.AI 交叉投稿

Agents All the Way Down; A Methodology for Building Custom AI Agents from Substrate to Production

层层代理:从底层到生产构建自定义AI代理的方法论

Marc Alier Forment, Juanan Pereira, Francisco José García-Peñalvo, María José Casañ Guerrero

AI总结 提出一种无框架的方法论,通过两个前提条件(将LLM作为软件组件和构建块)和三个实践(原型设计、打包为CLI、代理测试代理)来构建自定义AI代理,实现端到端开发。

详情
AI中文摘要

自定义AI代理是存在于自己应用程序中的代理,它们与自己的数据和工具交互,强制执行自己的安全边界,并携带自己的品牌和审计跟踪。它们与通用层级的区别在于适配性而非能力:每个代理由维护它的工程师为一项工作而构建。目前没有已发布的实践说明如何端到端地构建一个自定义AI代理。各个部分随处可见(函数调用API、模型上下文协议、可配对的代码代理),但将这些部分串联起来的实践存在于播客、博客和泄露的系统提示中。本文将这些实践记录为一种方法论,即“层层代理”:两个前提条件一次交叉并保持,然后三个实践在代理的生命周期中重复。前提条件是(P1)底层:将LLM作为软件组件,框架化为工具、系统,然后在提示缓存下框架化为消息;(P2)构建块:函数调用、MCP、CLI编排、liteshell模式、代理循环、技能、角色、钩子和脚手架。三个实践是(P3)使用通用代理进行原型设计;(P4)收获、折叠并将结果作为CLI发布,即Turtle模式;(P5)代理测试代理,其中通用代理通过行为场景驱动自定义代理,这是对经典测试的补充而非替代。工作循环是P3到P4再到P5并返回,一个推论自然得出:多代理编排就是CLI组合。该方法论在构造上是无框架的。它从AAC中提炼而来,AAC是开源LAMB平台的自定义代理,由一名开发人员使用AI配对程序员在大约十天内构建并投入生产。我们将其作为一种可迁移的实践呈现,独立于任何语言或框架。

英文摘要

Custom AI agents areagents that live inside their own application, talk to their own data and tools, enforce their own security boundaries, and carry their own brand and audit trail. What separates them from the general-purpose tier is fit, not capability: each is built for one job, by the engineer who will maintain it. No published practice sets out how to build one end to end. The pieces are everywhere (function-calling APIs, the Model Context Protocol, code agents to pair with), but the practice that chains them lives in podcasts, blogs, and leaked system prompts. This paper writes that practice down as a methodology, Agents All the Way Down: two preconditions crossed once and kept, then three practices repeated for the agent's life. The preconditions are (P1) Substrate, the LLM as a software component, framed as tools, then system, then messages under prompt-caching; and (P2) Building blocks: function calling, MCP, CLI orchestration, the liteshell pattern, the agent loop, skills, characters, hooks, and scaffolding. The practices are (P3) prototype with a general-purpose agent; (P4) harvest, fold, and ship the result as a CLI, the Turtle pattern; and (P5) agent-tests-agent, in which a general-purpose agent drives it through behavioural scenarios, a complement to classical testing, not a replacement. The working loop is P3 to P4 to P5 and back, and one corollary falls out for free: multi-agent orchestration is just CLI composition. The methodology is framework-free by construction. It was distilled from the AAC, a custom agent for the open-source LAMB platform, built in about ten days by one developer with an AI pair-programmer and in production. We present it as a transferable practice, independent of any language or framework.

2606.11926 2026-06-11 cs.CL cs.AI 交叉投稿

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

通过假设树精炼迈向通用自主研究

Jiajie Jin, Yuyang Hu, Kai Qiu, Qi Dai, Chong Luo, Guanting Dong, Xiaoxi Li, Tong Zhao, Xiaolong Ma, Gongrui Zhang, Zhirong Wu, Bei Liu, Zhengyuan Yang, Linjie Li, Lijuan Wang, Hongjin Qian, Yutao Zhu, Zhicheng Dou

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院) Microsoft Research(微软研究院)

AI总结 提出Arbor框架,通过假设树精炼(HTR)实现长期自主研究循环,在六项真实任务中平均相对保留增益超过Codex和Claude Code的2.5倍。

详情
AI中文摘要

科学进步依赖于探索、实验和抽象的重复循环。研究人员测试候选方向,解释证据,并将所得经验用于后续尝试。我们研究AI代理如何自主地长期运行这一循环。我们提出了Arbor,一个用于自主研究的通用框架,它结合了长期存在的协调器、短期执行器和假设树精炼(HTR),后者是一个持久树,跨时间连接假设、工件、证据和提炼的见解。协调器管理树上的全局研究策略,而执行器在隔离的工作树中实现和测试单个假设。当结果返回时,Arbor更新树,传播可重用的经验,优化搜索前沿,并接受验证过的改进。这种设计将自主研究从一系列局部尝试转变为累积过程,其中策略、执行和证据跨时间传递。我们在自主优化(AO)下评估Arbor,这是一种操作设置,代理通过迭代实验改进初始研究工件,无需逐步人工监督。在模型训练、工具工程和数据合成等六项真实研究任务中,Arbor在所有六项任务上取得了最佳保留结果,在相同任务接口和资源预算下,平均相对保留增益是Codex和Claude Code的2.5倍以上。在MLE-Bench Lite上,Arbor使用GPT-5.5达到86.36%的任何奖牌,这是我们比较中的最强结果。

英文摘要

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.

2606.11976 2026-06-11 cs.SE cs.AI 交叉投稿

Exploration Structure in LLM Agents for Multi-File Change Localization

LLM代理中的探索结构用于多文件变更定位

Akeela Darryl Fattha, Kia Ying Chua, Lingxiao Jiang, Laura Wynter

AI总结 针对多子系统变更场景,提出非线性、领域范围的并行代理探索结构,在SWE Bench Pro基准上,小规模Haiku类模型通过领域代理并行生成实现高微F1分数,优于线性顺序探索。

详情
AI中文摘要

软件工程工具越来越依赖基于LLM的代理来定位需要更改的文件以解决软件问题。大多数AI代理以线性方式探索仓库,即每步访问一个目录或文件。我们假设这对于跨越多个子系统的变更存在结构上的不匹配。我们比较了线性顺序探索与非线性的、领域范围的并行代理探索。使用SWE Bench Pro作为初始基准,我们专注于ansible作为示例。我们构建了一种方法,用于在单个基础提交上对GitHub问题进行持久会话评估。我们将我们的非线性领域代理文件遍历系统与没有直接仓库访问权限的基础LLM、具有持久Python REPL的单代理递归语言模型(RLM)基线以及使用Codex 5.5 High的外部CLI基线进行比较。使用小型Haiku类模型的领域范围并行代理生成在Haiku类模型中实现了最高的微F1分数,且领先幅度较大。在我们自己的扩展基准(包括2025年和2026年更近期的PR)上,领域代理仅次于更大的Codex 5.5 High。在原始、精选的2020年SWE-bench Pro基准上,较大的Sonnet普通LLM基线通过预测少量文件获得了更高的微F1分数,从而实现了更高的精确度,但所有黄金召回率显著较低。我们还提出了三个额外发现。首先,文档演化是所有方法都未解决的潜在依赖关系。其次,天真的文件系统访问可能会因测试文件过度预测而降低定位性能。最后,强制多代理协商没有明显帮助,并且会大幅增加令牌成本。

英文摘要

Software engineering tools increasingly rely on LLM based agents to localize files to change to resolve a software issue. Most AI agents explore repositories linearly, that is, visiting one directory or file per step. We postulate that this is a structural mismatch for changes that span several subsystems. We compare linear sequential exploration against non-linear, domain-scoped parallel agentic exploration. Using SWE Bench Pro as initial benchmark, we focus on ansible as an exemplar. We construct an approach for persistent-session evaluation of GitHub issues anchored at a single base commit. We compare our non-linear domain-agent file traversal system against a base LLM without direct repository access, a single agent Recursive Language Model (RLM) baseline with a persistent Python REPL and an external CLI baseline using Codex 5.5 High. Domain scoped parallel agent spawning with a small Haiku-class model achieves the highest micro F1 among Haiku class models by a large margin. Domain-agents is the second highest behind only the much larger Codex 5.5 High on our own expanded benchmark including over more recent PRs from 2025 and 2026. On the original, curated, 2020 SWE-bench Pro benchmark, a larger Sonnet plain LLM baseline attains higher micro F1 by predicting few files, leading to higher precision, but at significantly lower all gold recall. We also present three additional findings. First, documentation evolution is a latent dependency unresolved by any approach. Second, naive file system access can degrade localization driven by test-file over prediction. Lastly, forced multi-agent consultation does not measurably help and raises token cost substantially.

2606.12191 2026-06-11 cs.CL cs.AI 交叉投稿

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

面向大语言模型的智能体环境工程:环境建模、合成、评估与应用综述

Jiachun Li, Zhuoran Jin, Tianyi Men, Yupu Hao, Kejian Zhu, Lingshuai Wang, Dongqi Huang, Longxiang Wang, Shengjia Hua, Lu Wang, Jinshan Gao, Hongbang Yuan, Ruilin Xu, Kang Liu, Jun Zhao

AI总结 本文从环境工程生命周期出发,系统综述了智能体环境的建模、合成、评估与应用,涵盖八种属性与领域、两种合成范式、四种智能体演化路径及三种环境演化范式。

详情
Comments
63 pages, 10 figures
AI中文摘要

环境作为基于大语言模型(LLM)的智能体在不同场景下的交互系统,在推动模型能力持续演进中扮演关键角色。尽管重要性显著,现有工作缺乏系统分类与深入分析。本文从环境工程生命周期的视角系统研究了当前关于智能体环境的研究,涵盖其建模、合成、评估与应用。具体而言,本文首先从八个属性和八个领域引入代表性环境,详细分析其发展路径并突出核心能力。其次,针对自动化环境合成,介绍了两种范式,如符号合成和神经合成。本文还展示了每种范式下的不同环境评估方法。第三,从智能体-环境协同演化的角度讨论了相应的环境应用。具体来说,本文从四个互补视角描述了动态环境中智能体演化的主要路径:以记忆为中心的经验演化、以编排为中心的工作流演化、以轨迹为中心的离线演化和以探索为中心的在线演化。并识别了三种环境演化范式,即神经驱动、难度驱动和规模驱动方法。最后,讨论了几个有前景的未来方向,包括环境即服务、多智能体环境和神经符号环境。

英文摘要

Environments serve as interactive systems for large language model (LLM) based agents across diverse scenarios and play a crucial role in driving the continual evolution of model capabilities. Despite this importance, existing work lacks a systematic categorization and deep analysis. This paper systematically studies current researches on agentic environments from the perspective of the environment engineering lifecycle, covering their modeling, synthesis, evaluation and application. Specifically, the paper first introduces representative environments from the perspectives of eight attributes and eight domains, providing detailed analyses of their development paths and highlighting their core capabilities. Second, for automated environment synthesis, two paradigms are introduced, such as symbolic synthesis and neural synthesis. This paper also shows different environment evaluation methods in each paradigm. Thirdly, the corresponding environment applications from the perspective of agent-environment co-evolution are discussed. In specific, the paper characterizes the primary pathways for agent evolution in dynamic environments from four complementary perspectives: memory-centric experience evolution, orchestration-centric workflow evolution, trajectory-centric offline evolution, and exploration-centric online evolution. And three paradigms of environment evolution are identified, namely neural-driven, difficulty-driven, and scaling-driven approaches. At last, several promising future directions are discussed, including Environment-as-a-Service, Multi-agent Environments, and Neural-Symbolic Environments.

2606.12384 2026-06-11 cs.LG cs.AI 交叉投稿

APPO: Agentic Procedural Policy Optimization

APPO: 智能体程序策略优化

Xucong Wang, Ziyu Ma, Yong Wang, Yuxiang Ji, Shidong Yang, Guanhua Chen, Pengkun Wang, Xiangxiang Chu

发表机构 * University of Science and Technology of China(中国科学技术大学) AMAP, Alibaba Group(阿里巴巴集团高德地图) Southern University of Science and Technology(南方科技大学)

AI总结 提出APPO方法,通过细粒度分支和程序级优势缩放改进智能体强化学习的信用分配,在13个基准上平均提升近4个点。

详情
Comments
25 pages, including 14 pages of main text and 11 pages of appendix; work in progress
AI中文摘要

近期智能体强化学习(RL)的进展显著提升了大型语言模型智能体的多轮工具使用能力。然而,现有方法大多基于粗粒度的启发式单元(如工具调用边界或固定工作流)进行信用分配,难以识别哪些中间决策影响下游结果。本文从两个角度研究智能体RL:\textit{何处分支以及分支后如何分配信用}。我们的初步分析表明,有影响力的决策点广泛分布在生成序列中,而非集中于工具调用,而仅凭token熵无法可靠反映其对最终结果的影响。基于这些观察,我们提出\textbf{智能体程序策略优化(APPO)},将分支和信用分配从粗粒度的交互单元转移到序列中的细粒度决策点。APPO使用分支分数选择分支位置,该分数结合了token不确定性和后续延续的策略诱导似然增益,从而在过滤掉虚假高熵位置的同时实现更有针对性的探索。它进一步引入了程序级优势缩放,以更好地在分支展开中分配信用。在13个基准上的实验表明,APPO在保持高效工具调用和行为可解释性的同时,一致地将强智能体RL基线提升了近4个点。

英文摘要

Recent advances in agentic Reinforcement Learning (RL) have substantially improved the multi-turn tool-use capabilities of large language model agents. However, most existing methods assign credit over coarse heuristic units, such as tool-call boundaries or fixed workflows, making it difficult to identify which intermediate decisions influence downstream outcomes. In this work, we study agentic RL from two perspectives: \textit{where to branch and how to assign credit after branching}. Our pilot analysis shows that influential decision points are broadly distributed throughout the generated sequence rather than concentrated at tool calls, while token entropy alone does not reliably reflect their impact on final outcomes. Motivated by these observations, we propose \textbf{Agentic Procedural Policy Optimization (APPO)}, which shifts branching and credit assignment from coarse interaction units to fine-grained decision points in the sequence. APPO selects branching locations using a Branching Score that combines token uncertainty with policy-induced likelihood gains of subsequent continuations, enabling more targeted exploration while filtering out spurious high-entropy positions. It further introduces procedure-level advantage scaling to better distribute credit across branched rollouts. Experiments on 13 benchmarks show that APPO consistently improves strong agentic RL baselines by nearly 4 points, while keeping efficient tool-calls and maintaining behavior interpretability.

2501.12942 2026-06-11 cs.AI 版本更新

Offline Diffusion Policy for Multi-User Delay-Constrained Scheduling

面向多用户延迟约束调度的离线扩散策略

Zhuoran Li, Ruishuo Chen, Hai Zhong, Longbo Huang

AI总结 提出基于离线强化学习的SOCD算法,利用扩散策略和批评网络指导,从离线数据中学习高效调度策略,避免在线交互,在部分可观测和大规模环境中表现优异。

详情
AI中文摘要

有效的多用户延迟约束调度在诸多实际应用中至关重要,包括具身AI、即时通讯、直播和数据中心管理,这些场景需要在具有不同延迟敏感性的用户之间进行高效资源分配。在这些场景中,调度器必须实时做出决策,以满足延迟和资源约束,同时无需事先了解系统动态,这些动态通常是时变的且难以估计。当前基于学习的方法通常需要在训练阶段与实际系统进行在线交互。因此,这些方法往往难以实施或不切实际,因为它们会显著降低系统性能并产生高昂的服务成本。为应对这些挑战,我们提出了一种新颖的基于离线强化学习的算法,名为SOCD(通过离线学习与批评引导和扩散模型进行调度),该算法仅从预先收集的离线数据中学习高效调度策略。SOCD创新性地采用了扩散策略,并辅以无采样的批评网络进行策略引导。通过将拉格朗日乘子优化融入离线强化学习,SOCD仅从可用数据集中高效训练出高质量且满足约束的策略,无需与系统进行在线交互。实验结果表明,SOCD对多种系统动态具有鲁棒性,包括部分可观测和大规模环境,并且与现有方法相比性能更优。

英文摘要

Effective multi-user delay-constrained scheduling is crucial in various real-world applications, including embodied AI, instant messaging, live streaming, and data center management, where efficient resource allocation is required among users with diverse delay sensitivities. In these scenarios, schedulers must make real-time decisions to satisfy both delay and resource constraints without prior knowledge of system dynamics, which are often time-varying and challenging to estimate. {Current learning-based methods typically require online interactions with actual systems during the training stage. Therefore, these approaches are often difficult or impractical, as they can significantly degrade system performance and incur substantial service costs.} To address these challenges, we propose a novel offline reinforcement learning-based algorithm, named \underline{S}cheduling By \underline{O}ffline Learning with \underline{C}ritic Guidance and \underline{D}iffusion Model (SOCD), to learn efficient scheduling policies purely from pre-collected \emph{offline data}. SOCD innovatively employs a diffusion policy, complemented by a sampling-free critic network for policy guidance. By integrating the Lagrangian multiplier optimization into the offline reinforcement learning, SOCD efficiently trains high-quality constraint-aware policies exclusively from available datasets, eliminating the need for online interactions with the system. Experimental results demonstrate that SOCD is resilient to various system dynamics, including partially observable and large-scale environments, and delivers superior performance compared to existing methods.

2509.23248 2026-06-11 cs.AI cs.NI 版本更新

Resource-Aware LLM Reasoning for Mobile Edge General Intelligence

面向移动边缘通用智能的资源感知LLM推理

Mingyi Luo, Ruichen Zhang, Xiangwang Hou, Jun Du, Chunxiao Jiang, Yong Ren, Shiwen Mao

AI总结 提出联合优化框架,通过自适应CoT提示和分布式MoE架构协同优化推理深度、专家激活和传输功率,在资源受限的移动边缘环境中实现LLM高效推理,推理质量与资源效率平衡,额外推理时间小于1秒时准确率和延迟满足率均达90%。

详情
AI中文摘要

大型语言模型(LLM)的快速发展催生了具有强大推理和自主决策能力的智能体人工智能(AI)。与边缘计算的集成推动了移动边缘通用智能(MEGI)的发展,将实时、隐私保护的推理带到网络边缘。然而,在MEGI环境中部署基于LLM的智能体AI推理面临重大挑战,原因是推理的高计算需求与边缘设备的有限资源。为应对这些挑战,我们提出了一种在MEGI中高效部署LLM推理的联合优化框架。首先,我们系统回顾增强方法,识别适合边缘适配的机制。随后,我们提出一个分布式框架,通过自适应思维链(CoT)提示协同推理增强,并通过分布式专家混合(MoE)架构实现可扩展部署。该方法的一个重要创新是将推理深度建模为动态网络资源变量,并与专家激活和传输功率联合优化。该机制使系统能够根据任务需求和设备能力动态调节专家网络和推理复杂度。在移动边缘环境中的实验评估表明,所提框架有效平衡了推理质量和资源效率。结果显示,在额外推理时间小于1秒的情况下,准确率和延迟满足率均可达到90%,验证了在资源受限的MEGI系统中部署复杂LLM推理的实际可行性。

英文摘要

The rapid advancement of large language models (LLMs) has enabled an emergence of agentic artificial intelligence (AI) with powerful reasoning and autonomous decision-making capabilities. This integration with edge computing has led to the development of Mobile Edge General Intelligence (MEGI), which brings real-time, privacy-preserving reasoning to the network edge. However, deploying LLM-based agentic AI reasoning in MEGI environments poses significant challenges due to the high computational demands of reasoning and the limited resources of edge devices. To address these challenges, we propose a joint optimization framework for efficient LLM reasoning deployment in MEGI. First, we systematically review enhancement methods to identify mechanisms suitable for edge adaptation. Subsequently, we present a distributed framework that synergizes reasoning enhancement via adaptive CoT prompting with scalable deployment through a distributed MoE architecture. An important innovation of this approach involves modeling reasoning depth as a dynamic network resource variable, which is optimized jointly with expert activation and transmission power. This mechanism allows the system to dynamically regulate expert networks and reasoning complexity according to task requirements and device capabilities. Experimental evaluations in mobile edge environments demonstrate that the proposed framework effectively balances reasoning quality and resource efficiency. The results show that with less than one second of additional inference time, both accuracy and latency satisfaction rate can reach 90\%, validating the practical viability of deploying sophisticated LLM reasoning in resource-constrained MEGI systems.

2511.19314 2026-06-11 cs.AI cs.CL cs.LG 版本更新

PRInTS: Reward Modeling for Long-Horizon Information Seeking

PRInTS:面向长程信息检索的奖励建模

Jaewoo Lee, Archiki Prasad, Justin Chih-Yao Chen, Zaid Khan, Elias Stengel-Eskin, Mohit Bansal

AI总结 提出PRInTS生成式过程奖励模型,通过密集评分和轨迹摘要提升长程信息检索中工具交互与推理能力,在多个基准上超越前沿模型。

详情
Comments
ACL 2026, 19 pages, code: this https URL
AI中文摘要

信息检索是AI智能体的核心能力,要求它们在整个长轨迹中收集和推理工具生成的信息。然而,这种多步骤信息检索任务对于基于语言模型的智能体仍然具有挑战性。虽然过程奖励模型(PRM)可以通过在测试时对候选步骤进行排序来指导智能体,但现有的PRM——设计用于具有二元判断的短程推理——无法捕捉信息检索步骤的更丰富维度,例如工具交互和对工具输出的推理,也无法处理长程任务中快速增长的上下文。为了解决这些限制,我们引入了PRInTS,一种具有双重能力的生成式PRM:(1)基于PRM对步骤质量多个维度(例如,工具输出的解释、工具调用的信息量)的推理进行密集评分,以及(2)轨迹摘要,在压缩不断增长的上下文的同时保留步骤评估所需的基本信息。在FRAMES、GAIA(级别1-3)和WebWalkerQA(简单-困难)基准上对多个模型的广泛评估表明,使用PRInTS进行最佳n采样增强了开源模型以及专门智能体的信息检索能力,以更小的骨干智能体匹配或超越前沿模型,并优于其他强奖励建模基线。

英文摘要

Information-seeking is a core capability for AI agents, requiring them to gather and reason over tool-generated information across long trajectories. However, such multi-step information-seeking tasks remain challenging for agents backed by language models. While process reward models (PRMs) can guide agents by ranking candidate steps at test-time, existing PRMs - designed for short reasoning with binary judgment - cannot capture richer dimensions of information-seeking steps, such as tool interactions and reasoning over tool outputs, nor handle the rapidly growing context in long-horizon tasks. To address these limitations, we introduce PRInTS, a generative PRM trained with dual capabilities: (1) dense scoring based on the PRM's reasoning across multiple dimensions of step quality (e.g., interpretation of tool outputs, tool call informativeness) and (2) trajectory summarization that compresses the growing context while preserving essential information for step evaluation. Extensive evaluations across FRAMES, GAIA (levels 1-3), and WebWalkerQA (easy-hard) benchmarks on multiple models reveal that best-of-n sampling with PRInTS enhances information-seeking in open-source models as well as specialized agents, matching or surpassing frontier models with a much smaller backbone agent and outperforming other strong reward modeling baselines.

2602.23545 2026-06-11 cs.AI 版本更新

Planning under Distribution Shifts with Causal POMDPs

基于因果POMDP的分布偏移下规划

Matteo Ceriscioli, Karthika Mohan

AI总结 提出因果POMDP框架,通过干预表示环境变化,在部分可观测下维持PWLC性质,实现分布偏移下的规划与更新。

详情
Comments
To appear at the 36th International Conference on Automated Planning and Scheduling (ICAPS-26)
AI中文摘要

在现实世界中,规划常常受到分布偏移的挑战。因此,在一组条件下获得的环境模型在状态分布或环境动态变化时可能不再有效,进而导致先前学习的策略失败。在这项工作中,我们提出了一个使用因果知识构建的部分可观测马尔可夫决策过程(POMDP)的理论框架,用于在部分可观测性下进行规划。通过将环境中的变化表示为对该因果POMDP的干预,该框架能够评估假设变化下的计划,并主动识别环境中哪些组件已被改变。我们展示了如何维护和更新关于潜在状态和底层领域的信念,并证明了在该增强信念空间中值函数保持分段线性凸(PWLC)。在分布偏移下保持PWLC的优势在于,通过基于$\alpha$-向量的POMDP方法保持规划的可处理性。

英文摘要

In the real world, planning is often challenged by distribution shifts. As such, a model of the environment obtained under one set of conditions may no longer remain valid as the distribution of states or the environment dynamics change, which in turn causes previously learned strategies to fail. In this work, we propose a theoretical framework for planning under partial observability using Partially Observable Markov Decision Processes (POMDPs) formulated using causal knowledge. By representing shifts in the environment as interventions on this causal POMDP, the framework enables evaluating plans under hypothesized changes and actively identifying which components of the environment have been altered. We show how to maintain and update a belief over both the latent state and the underlying domain, and we prove that the value function remains piecewise linear and convex (PWLC) in this augmented belief space. Preservation of PWLC under distribution shifts has the advantage of maintaining the tractability of planning via $\alpha$-vector-based POMDP methods.

2605.02411 2026-06-11 cs.AI cs.IR cs.LG cs.MA 版本更新

FitText: Evolving Agent Tool Ecologies via Memetic Retrieval

FitText: 通过模因检索演化智能体工具生态

Kyle Zheng, Han Zhang, Renliang Sun, Chenchen Ye, Wei Wang

AI总结 针对用户任务描述与工具文档间的语义鸿沟,提出FitText框架,将检索嵌入推理循环,通过自然语言伪工具描述迭代优化和模因进化选择,显著提升工具检索性能。

详情
AI中文摘要

用户描述任务的方式与工具文档之间存在语义鸿沟。随着API生态扩展到数万个端点,仅凭初始查询的静态检索无法弥合这一鸿沟:智能体对其所需工具的理解在执行过程中不断演变,但其工具集却保持不变。我们指出,这种检索接口(而非规划)是端到端智能体性能的约束瓶颈,并引入FitText——一个无需训练的框架,通过将检索直接嵌入智能体的推理循环中,使其动态化。FitText将检索视为测试时假设的演化:智能体生成自然语言的伪工具描述(关于所需工具的可修正信念),利用检索反馈迭代优化,并通过随机生成探索多样化的替代方案。模因检索在候选描述上施加进化选择压力,并由避免冗余搜索的工具记忆引导。在ToolRet(三个领域)上,FitText的重构策略在所有基模型上相比静态查询检索将NDCG@5提升了2.7至10.6个点;在StableToolBench(16,464个API)上使用GPT-5.4-mini时,模因检索达到了84.3%的合并通过率,相比静态查询检索绝对提升了26.7个点。

英文摘要

A semantic gap separates how users describe tasks from how tools are documented. As API ecosystems scale to tens of thousands of endpoints, static retrieval from the initial query alone cannot bridge this gap: the agent's understanding of what it needs evolves during execution, but its tool set does not. We identify this retrieval interface, not planning, as the binding constraint on end-to-end agent performance, and introduce FitText, a training-free framework that makes retrieval dynamic by embedding it directly in the agent's reasoning loop. FitText treats retrieval as test-time evolution of hypotheses: the agent generates natural-language pseudo-tool descriptions (revisable beliefs about the tool it needs), refines them iteratively using retrieval feedback, and explores diverse alternatives through stochastic generation. Memetic Retrieval adds evolutionary selection pressure over candidate descriptions, guided by a tool memory that avoids redundant search. On ToolRet (three domains), FitText's reformulation strategies improve NDCG@5 by 2.7 to 10.6 points over static query retrieval across all base models; on StableToolBench (16,464 APIs) with GPT-5.4-mini, Memetic reaches an 84.3% pooled pass rate, a 26.7-point absolute gain over static query retrieval.

2606.05922 2026-06-11 cs.AI cs.CL cs.LG 版本更新

Evolving Agents in the Dark: Retrospective Harness Optimization via Self-Preference

回顾性工具优化:通过轨迹回滚上的自我偏好改进LLM智能体

Wenbo Pan, Shujie Liu, Chin-Yew Lin, Jingying Zeng, Xianfeng Tang, Xiangyang Zhou, Yan Lu, Xiaohua Jia

AI总结 提出一种自监督方法RHO,利用历史轨迹回滚和自偏好选择优化智能体工具集,无需真实标签,在SWE-Bench Pro上通过单轮优化将通过率从59%提升至78%。

详情
Comments
Code: this https URL; Project website: this https URL
AI中文摘要

AI智能体依赖于技能、工具和工作流程的整合(称为工具集)来解决复杂问题。持续改进这一工具集对于适应新任务至关重要。然而,现有的优化方法通常需要真实验证集,但在实际部署场景中获取此类标注数据非常困难。为解决这一问题,我们提出回顾性工具优化(RHO),一种仅利用过去轨迹的自监督方法。具体而言,RHO从历史轨迹中选择一个多样化的困难任务核心集,并并行重新求解。智能体通过自我验证和自我一致性分析这些回滚,然后生成候选工具集更新,并通过自身的成对自我偏好选择最有效的更新。我们在三个不同领域(涵盖软件工程、技术工作和知识工作)上评估RHO。值得注意的是,单轮优化无需任何外部评分即可将SWE-Bench Pro上的通过率从59%提升至78%。此外,我们的分析表明RHO有效针对先前的失败模式。因此,优化后的工具集改变了智能体的行为模式,并在长周期会话中保持更高的准确性。

英文摘要

AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self-supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel. The agent analyzes these rollouts using self-validation and self-consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self-preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long-horizon sessions.

2606.07909 2026-06-11 cs.AI cs.CL 版本更新

MemToolAgent: Leveraging Memory for Tool Using Agents Based on Environment and User Feedback

MemToolAgent概述:一个简单的餐厅预订场景,其中代理检索相似记忆,接收关于无效时间格式的反馈,并生成反思以更新其记忆

Suleyman Armagan Er, Danilo Ribeiro, Yogesh Virkar, Surafel Lakew, Adi Kalyanpur, James Gung, Thomas Delteil, Arshit Gupta

发表机构 * AWS AI University of Washington(华盛顿大学)

AI总结 提出MemToolAgent框架,通过记忆管理提升大语言模型代理的工具使用能力,包含记忆提取和动态检索模块,在三个基准上分别提升29%、80%和17%。

详情
Comments
8 pages, 5 figures
AI中文摘要

现代大语言模型(LLM)代理可以使用外部工具帮助用户解决复杂任务。然而,对于需要从长期历史事件或先前的代理-环境交互中学习的问题,LLM代理需要使用记忆机制来存储和检索经验。尽管对话代理存在复杂的记忆系统,但很少有研究实证检验如何通过过去的用户-代理对话来提升代理的工具使用能力。我们提出MemToolAgent,一个通过记忆管理改善工具使用的框架。我们的方法包含一个记忆提取模块,将过去的经验处理成结构化的记忆条目,以及一个检索模块,动态选择存储记忆条目的子集。这使得无需LLM微调即可实现更个性化和准确的响应,与用户偏好和反馈保持一致。总之,本工作有三个主要贡献:(1)统一的记忆条目格式,无需LLM微调即可改善通用和个性化工具使用;(2)基于反思的记忆提取,利用环境和用户反馈将错误执行提炼为批评并存储;(3)一个检索模块,根据记忆相似度分布选择使用多少过去经验。MemToolAgent在WorkBench、NESTFUL和PEToolBench基准上相比强基线分别实现了29%、80%和17%的相对改进。

英文摘要

Modern large language model (LLM) agents can use external tools to help users solve complex tasks. However, for problems that require learning from long-term historical events or from previous agent-environment interactions, LLM agents are required to use memory mechanisms to store and retrieve experiences. While sophisticated memory systems exist for dialogue agents, few studies have empirically examined how to improve agents' tool-using capabilities through past user-agent conversations. We propose MemToolAgent, a framework that improves tool use through memory management. Our approach contains a memory extraction module that processes past experiences into structured memory entries, and a retrieval module that dynamically selects a subset of the stored memory entries. This enables more personalized and accurate responses aligned with user preferences and feedback without requiring LLM fine-tuning. In summary, this work has three main contributions: (1) a unified memory entry format that improves both general-purpose and personalized tool use without LLM fine-tuning, (2) a reflection-based memory extraction that uses environment and user feedback to distill wrong executions into critiques to store, and (3) a retrieval module that chooses how many past experiences to use based on the memory similarity distribution. MemToolAgent achieves 29%, 80%, and 17% relative improvements compared to strong baselines on the WorkBench, NESTFUL, and PEToolBench benchmarks, respectively.

2606.09365 2026-06-11 cs.AI cs.CL 版本更新

Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

经验造就熟练:通过自进化技能记忆实现可泛化的医疗智能体推理

Haoran Sun, Wenjie Li, Yujie Zhang, Zekai Lin, Fanrui Zhang, Kaitao Chen, Xingqi He, Yichen Li, Mianxin Liu, Lei Liu, Yankai Jiang

发表机构 * Fudan University(复旦大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Innovation Institute(上海创新研究院) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出SkeMex框架,通过技能记忆实现医疗智能体后部署自进化,无需更新模型权重,在临床任务中优于现有记忆型智能体。

详情
AI中文摘要

医疗智能体系统越来越期望支持交互式临床决策,而不仅仅是静态问答。在这种设置中,有效的智能体必须跨演化病例重用先前经验,然而现有的记忆机制通常保留原始历史轨迹,这些轨迹冗余、嘈杂且难以管理。更重要的是,它们很少区分哪些记忆对未来推理真正有用。这限制了它们积累紧凑且可靠的经验以进行长期临床推理的能力。为弥补这一差距,我们提出SkeMex,一种部署后自进化框架,通过基于技能的记忆改进医疗智能体,无需更新模型权重。SkeMex将信息丰富的交互轨迹提炼为结构化技能,编码可重用的程序性知识,并将其组织成涵盖通用、任务特定和行动级经验的多分支存储库。为确定哪些记忆应被重用和保留,SkeMex从环境反馈中估计上下文相关的效用,并用其指导价值感知的检索和存储库治理。闭环的“读-写-评估-治理”生命周期通过写入新技能、更新效用、促进有用记忆和移除有害条目进一步支持持续进化。跨不同临床任务的实验表明,SkeMex在离线和在线设置中均持续优于代表性记忆型智能体。它还能跨模型骨干泛化并支持可迁移的技能记忆。所有数据和代码将公开发布。

英文摘要

Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw historical traces that are redundant, noisy, and difficult to govern. More importantly, they rarely distinguish which memories are truly useful for future reasoning. This limits their ability to accumulate compact and reliable experience for long-horizon clinical reasoning. To close this gap, we propose SkeMex, a post-deployment self-evolution framework that improves medical agents through a skill-based memory without updating model weights. SkeMex distills informative interaction trajectories into structured skills that encode reusable procedural knowledge, and organizes them into a multi-branch repository spanning general, task-specific, and action-level experience. To determine which memories should be reused and retained, SkeMex estimates context-dependent utility from environment feedback and uses it to guide value-aware retrieval and repository governance. A closed-loop ``Read--Write--Assess--Govern" lifecycle further supports continual evolution by writing new skills, updating utilities, promoting useful memories, and removing harmful entries. Experiments across diverse clinical tasks show that SkeMex consistently outperforms representative memory-based agents in both offline and online settings. It also generalizes across model backbones and supports transferable skill memory. All data and code will be released publicly.

2509.10303 2026-06-11 cs.LG cs.AI 版本更新

Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions

超越次优性:离线强化学习通过随机解决方案学习有效调度

Jesse van Remmerden, Zaharah Bukhsh, Yingqian Zhang

AI总结 提出离线RL算法CDQAC,从次优静态数据集学习调度策略,在JSP/FJSP上超越在线RL和强启发式方法,仅需1-5%数据,发现状态-动作覆盖比轨迹质量更重要。

详情
AI中文摘要

在线强化学习(RL)方法通过与模拟环境直接交互学习调度策略,在作业车间调度(JSP)和柔性作业车间调度(FJSP)问题上表现出色。然而,这些方法通常需要大量的训练交互,限制了其样本效率和实际适用性。受此挑战的启发,我们引入了保守离散分位数演员-评论家(CDQAC),这是一种离线RL算法,可以直接从静态、次优数据集中学习有效的调度策略。CDQAC将基于分位数的评论家与延迟策略更新相结合,以估计机器-操作对的回报分布。在JSP和FJSP基准上的大量实验表明,CDQAC始终优于生成数据的启发式方法,超越了最先进的离线和在线RL基线,并且具有很高的样本效率,仅需原始数据集的1%到5%即可学习高质量策略。我们的分析表明,在调度中,离线RL的性能主要受状态-动作覆盖范围而非单个轨迹质量的影响。调度将密集奖励(与完工时间目标对齐)与跨启发式方法的等长轨迹相结合,从而能够从广泛的行为中有效学习。与此观察一致,由简单随机启发式方法生成的具有更广覆盖范围的数据集,使其性能优于在由更强启发式方法(如遗传算法)生成的数据集上训练的策略。

英文摘要

Online reinforcement learning (RL) approaches have demonstrated strong performance on Job Shop Scheduling (JSP) and Flexible JSP (FJSP) problems by learning scheduling policies through direct interaction with simulated environments. However, these methods often require extensive training interactions, limiting their sample efficiency and practical applicability. Motivated by this challenge, we introduce Conservative Discrete Quantile Actor-Critic (CDQAC), an offline RL algorithm that learns effective scheduling policies directly from static, suboptimal datasets. CDQAC couples a quantile-based critic with delayed policy updates to estimate the return distribution of machine-operation pairs. Extensive experiments on JSP and FJSP benchmarks demonstrate that CDQAC consistently outperforms the data-generating heuristics, surpasses state-of-the-art offline and online RL baselines, and is highly sample efficient, requiring only 1 to 5% of the original dataset to learn high-quality policies. Our analysis suggests that, in scheduling, offline RL performance is governed mainly by state-action coverage rather than the quality of individual trajectories. Scheduling couples a dense reward aligned with the makespan objective with equal-length trajectories across heuristics, enabling effective learning from a broad range of behaviors. Consistent with this observation, datasets generated by a simple random heuristic with broader coverage let it outperform policies trained on datasets produced by stronger heuristics such as Genetic Algorithms.

2605.10907 2026-06-11 cs.CR cs.AI 版本更新

Engineering Robustness into Personal Agents with the AI Workflow Store

通过AI工作流存储增强个人代理的鲁棒性

Roxana Geambasu, Mariana Raykova, Pierre Tholoniat, Trishita Tiwari, Lillian Tsai, Wen Zhang

AI总结 本文探讨将严谨的软件工程流程整合到代理循环中,以生成可靠、安全且确定性约束的代理工作流,提升高风险场景下的性能。

详情
AI中文摘要

当前AI代理的主流范式是

英文摘要

The dominant paradigm for AI agents is an "on-the-fly" loop in which agents synthesize plans and execute actions within seconds or minutes in response to user prompts. We argue that this paradigm short-circuits disciplined software engineering (SE) processes -- iterative design, rigorous testing, adversarial evaluation, staged deployment, and more -- that have delivered the (relatively) reliable and secure systems we use today. By focusing on rapid, real-time synthesis, are AI agents effectively delivering users improvised prototypes rather than systems fit for high-stakes scenarios in which users may unwittingly apply them? This paper argues for the need to integrate rigorous SE processes into the agentic loop to produce production-grade, hardened, and deterministically-constrained agent *workflows* that substantially outperform the potentially brittle and vulnerable results of on-the-fly synthesis. Doing so may require extra compute and time, and if so, we must amortize the cost of rigor through reuse across a broad user community. We envision an *AI Workflow Store* that consists of hardened and reusable workflows that agents can invoke with far greater reliability and security than improvised tool chains. We outline the research challenges of this vision, which stem from a broader flexibility-robustness tension that we argue requires moving beyond the ``on-the-fly'' paradigm to navigate effectively.

2605.14084 2026-06-11 cs.SE cs.AI cs.CL 版本更新

CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing

CRANE:通过空域编辑实现代码代理的约束推理注入

Mingzhi Zhu, Michele Merler, Raju Pavuluri, Stacy Patterson

AI总结 CRANE通过空域编辑技术,结合推理和工具使用能力,提升代码代理性能,在多个基准测试中取得显著成果。

详情
AI中文摘要

代码代理必须同时对长周期的仓库状态进行推理并遵守严格的工具使用协议。在配对的Instruct/Thinking检查点中,这些能力是互补但不一致的。Instruct模型简洁且工具纪律性强,而Thinking模型提供更强的规划和恢复行为,但往往过度 deliberates 并降低代理性能。我们提出CRANE(通过空域编辑实现代码代理的约束推理注入),一种无需训练的参数编辑方法,将Thinking-Instruct的delta视为Instruct骨干的候选推理编辑方向池。CRANE结合幅度阈值去噪delta,保守的泰勒门来保留对推理转移和工具使用保留共同有益的编辑,以及渐进的Sigmoid投影来抑制格式关键的更新方向。通过合并配对的Instruct和Thinking检查点,CRANE在单独模型上取得显著优势的同时保持Instruct级别的效率:在Roo-Eval上,它实现了Qwen3-30B-A3B的pass1为66.2%(+19.5%)和Qwen3-Next-80B-A3B的81.5%(+8.7%);在SWE-bench-Verified上,它在两个规模(122/500和180/500)上解决了多达14个额外的实例;在Terminal-Bench v2上,它提高了pass1/pass5高达2.3%/7.8%,分别达到7.6%/17.9%和14.8%/30.3%,在所有三个基准测试中一致超越了其他合并策略。

英文摘要

Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking checkpoints, these capabilities are complementary but misaligned. The Instruct model is concise and tool-disciplined, whereas the Thinking model offers stronger planning and recovery behavior but often over-deliberates and degrades agent performance. We present CRANE (Constrained Reasoning Injection for Code Agents via Nullspace Editing), a training-free parameter-editing method that treats the Thinking-Instruct delta as a directional pool of candidate reasoning edits for the Instruct backbone. CRANE combines magnitude thresholding to denoise the delta, a Conservative Taylor Gate to retain edits that are jointly beneficial for reasoning transfer and tool-use preservation, and Graduated Sigmoidal Projection to suppress format-critical update directions. By merging paired Instruct and Thinking checkpoints, CRANE delivers strong gains over either individual model while preserving Instruct-level efficiency: on Roo-Eval it achieves pass1 of 66.2% (+19.5%) for Qwen3-30B-A3B and 81.5% (+8.7%) for Qwen3-Next-80B-A3B; on SWE-bench-Verified it resolves up to 14 additional instances at both scales (122/500 and 180/500); and on Terminal-Bench v2 it improves pass1/pass5 by up to 2.3%/7.8%, reaching 7.6%/17.9% and 14.8%/30.3%, respectively, consistently outperforming alternative merging strategies across all three benchmarks.

2606.03077 2026-06-11 cs.LG cs.AI cs.DC 版本更新

Libra: Efficient Resource Management for Agentic RL Post-Training

Libra:面向智能体强化学习后训练的高效资源管理

Kaiwen Chen, Xin Tan, Jingzong Li, Hong Xu

AI总结 针对智能体强化学习中长尾、非平稳工作负载带来的资源管理挑战,提出Libra系统,通过周期性全局资源规划器和因果驱动多级反馈队列调度器,实现GPU分配优化和请求调度,最高提升3倍吞吐量和2.5倍收敛速度。

详情
Comments
19 pages, 12 figures
AI中文摘要

强化学习(RL)已成为大型语言模型(LLM)的标准后训练范式,从偏好对齐扩展到复杂推理和多轮智能体行为。在智能体RL中,rollout阶段生成轨迹并调用工具,产生长尾和非平稳的工作负载,挑战了传统的资源管理假设。出现了三个基本挑战。首先,由于长尾分布,一小部分轨迹主导了rollout完成时间。其次,rollout和训练在计算模式、内存需求和对序列长度的敏感性上表现出强烈的不对称性。第三,随着RL策略的演变,轨迹长度分布随时间漂移,使得任何静态资源分配逐渐变得次优。我们提出Libra,引入了两个核心机制。第一个是周期性全局资源规划器,它联合优化rollout和训练集群间的GPU分配。它利用弹性混合池实现阶段间轻量级、非阻塞的工作节点重新分配。第二个是因果驱动的多级反馈队列(C-MLFQ)调度器,它基于从工具返回结果导出的因果信号(而非依赖脆弱的长度的预测)将请求路由到异构的rollout桶。在48个A800 GPU上的评估表明,与基线相比,Libra实现了高达3.0倍的吞吐量提升和高达2.5倍的奖励收敛加速。

英文摘要

Reinforcement learning (RL) has emerged as a standard post-training paradigm for shaping large language models (LLMs) into capable agents. In agentic RL, the rollout stage generates trajectories while invoking tools, producing long-tailed and non-stationary workloads that expose two fundamental challenges in resource management. First, due to the long-tail distribution, a small fraction of trajectories dominates rollout makespan. Second, rollout and training are subject to cross-stage imbalance, as they exhibit strong asymmetry in compute patterns, memory demands, and sensitivity to sequence length. Compounding this asymmetry, the sequence length distribution drifts continuously as the policy evolves, rendering any static resource split progressively suboptimal. We present Libra, a resource management system to address both challenges via two core mechanisms. The first is a global resource planner that jointly optimizes GPU allocation across rollout and training clusters. It leverages an elastic hybrid pool to enable lightweight, non-blocking worker reallocation between stages. The second is a causality-driven multi-level feedback queue (C-MLFQ) scheduler, which routes requests to heterogeneous rollout buckets based on causal signals derived from tool-return outcomes, rather than relying on fragile length predictions. Evaluated on 48 A800 GPUs, Libra achieves up to 3.0x higher throughput and converges up to 2.5x faster in reward compared to the baselines.

2. 知识表示、推理与符号AI 5 篇

2606.11724 2026-06-11 cs.AI 新提交

Mind the Perspective: Let's Reason Recursively for Theory of Mind

注意视角:递归推理实现心智理论

Chao Lei, Guang Hu, Meng Yang, Yanbei Jiang, Nir Lipovetzky

发表机构 * School of Computing and Information Systems, The University of Melbourne, Australia(墨尔本大学计算与信息系统学院) SensiLab, Monash University, Australia(蒙纳士大学SensiLab)

AI总结 提出RecToM框架,通过递归视角构建建模嵌套信念,将高阶信念问题转化为实际世界问题,在多个ToM基准上达到最先进性能。

详情
AI中文摘要

心智理论(ToM)推理需要从部分且不对称的观察中推断智能体的信念,这对大语言模型(LLM)来说仍然是一个开放的挑战。现有的基于提示的方法通过可观察事件过滤或时间信念链来改进ToM推理,但没有显式建模嵌套信念。我们引入了RecToM,一个用于ToM推理的推理时框架,通过递归视角构建来建模嵌套信念。RecToM沿着问题指定的角色链,从先前的角色视角构建每个角色视角,将高阶信念问题简化为最终构建视角内的实际世界问题。我们进一步提供了KD45分析,表明RecToM的视角构建诱导了超越简单事件过滤的良好信念模态。在包括Hi-ToM、Big-ToM和FanToM在内的ToM基准上,跨多个LLM骨干网络的实验表明,RecToM持续优于最近的高级方法,达到了最先进的性能。值得注意的是,RecToM在GPT-5.4和Qwen3.5上达到了Hi-ToM的100%准确率,这是一个需要高阶ToM推理的基准。

英文摘要

Theory of Mind (ToM) reasoning requires inferring agents' beliefs from partial and asymmetric observations, which remains an open challenge for LLMs. Existing prompting-based approaches improve ToM reasoning through observable-event filtering or temporal belief chains, without explicitly modeling nested beliefs. We introduce RecToM, an inference-time framework for ToM reasoning that models nested beliefs via recursive perspective construction. RecToM constructs each character perspective from the preceding character perspective along the character chain specified by the question, reducing higher-order belief questions to actual-world questions within the final constructed perspective. We further provide a KD45 analysis showing that RecToM's perspective construction induces a well-formed belief modality beyond simple event filtering. Experiments on ToM benchmarks, including Hi-ToM, Big-ToM, and FanToM, across multiple LLM backbones show that RecToM consistently outperforms recent advanced approaches, achieving state-of-the-art performance. Notably, RecToM reaches 100\% accuracy on Hi-ToM with GPT-5.4 and Qwen3.5, a benchmark requiring higher-order ToM reasoning.

2606.12065 2026-06-11 cs.AI cs.MA 新提交

Automating Geometry-Intensive Compliance Checking in BIM: Graph-Based Semantic Reasoning Framework

BIM中几何密集型合规检查自动化:基于图的语义推理框架

Zixuan Xiao, Pei Troh Koh, Jun Ma, Jack C.P. Cheng

AI总结 针对BIM中几何密集型法规自动检查的语义鸿沟问题,提出SGR-BIM图驱动推理框架,通过跨模态知识图谱实现可解释推理,在679个消防规范查询上达到84.3%准确率,较基线提升8.6%。

详情
AI中文摘要

自动化几何密集型法规的合规检查仍然是建筑信息模型(BIM)中的一个重大技术瓶颈,主要原因是高层级法规逻辑与结构化IFC数据之间的语义差异。现有方法通常依赖于静态规则模板,难以遍历多跳推理链或解决跨多个建筑实体的潜在空间依赖关系。为应对这些挑战,提出了一种面向建筑信息模型的空间几何推理系统(SGR-BIM),作为一个集成的图驱动推理框架。SGR-BIM动态构建跨模态知识图谱,对齐用户意图、法规语义和BIM几何,无需硬编码即可实现可解释推理。在来自消防规范的679个专家验证查询上验证,该框架达到了84.3%的准确率,比增强工具的单智能体基线提高了8.6%。本研究提供了一种基于图的语义推理范式,增强了建筑、工程和施工(AEC)行业中自动化几何合规检查工作流的透明度和灵活性。

英文摘要

Automating compliance check for geometry-intensive regulations remains a significant technical bottleneck in Building Information Modeling (BIM), primarily due to the semantic disparity between high-level regulatory logic and structured IFC data. Existing methods, often reliant on static rule templates, struggle to traverse multi-hop reasoning chains or resolve latent spatial dependencies across multiple building entities. To address these challenges, a Spatial-Geometric Reasoning System for Building Information Modeling (SGR-BIM) is proposed as an integrative graph-driven reasoning framework. SGR-BIM dynamically constructs a cross-modal knowledge graph that aligns user intent, regulatory semantics, and BIM geometry, enabling interpretable reasoning without rigid hard-coding. Validated on 679 expert-verified queries from fire safety codes, the framework achieves 84.3% accuracy, representing an 8.6% improvement over enhanced-tool single-agent baselines. This research provides a graph-based semantic reasoning paradigm, enhancing the transparency and flexibility of automated geometric compliance check workflows in the Architecture, Engineering, and Construction (AEC) industry.

2601.14764 2026-06-11 cs.AI cs.HC cs.LO 版本更新

An XAI View on Explainable ASP: Methods, Systems, and Perspectives

可解释ASP的XAI视角:方法、系统与展望

Thomas Eiter, Tobias Geibinger, Zeynep G. Saribatur

AI总结 本文从XAI视角综述回答集编程(ASP)的解释方法,分类解释类型并评估现有理论与工具的覆盖范围,指出研究空白与未来方向。

详情
Comments
10 pages
AI中文摘要

回答集编程(ASP)是符号AI中一种流行的声明式推理和问题解决方法。其基于规则的形式化使其天生具有可解释和解释性推理的吸引力,随着可解释AI(XAI)的兴起,这一点日益重要。目前已经开发了许多针对ASP的解释方法和工具,它们通常处理特定的解释设置,可能无法覆盖ASP用户遇到的所有场景。在本综述中,我们从XAI视角出发,概述了与用户解释问题相关的ASP解释类型,并描述了当前理论和工具对其的覆盖情况。此外,我们指出了现有ASP解释方法中的空白,并确定了未来工作的研究方向。

英文摘要

Answer Set Programming (ASP) is a popular declarative reasoning and problem solving approach in symbolic AI. Its rule-based formalism makes it inherently attractive for explainable and interpretive reasoning, which is gaining importance with the surge of Explainable AI (XAI). A number of explanation approaches and tools for ASP have been developed, which often tackle specific explanatory settings and may not cover all scenarios that ASP users encounter. In this survey, we provide, guided by an XAI perspective, an overview of types of ASP explanations in connection with user questions for explanation, and describe their coverage by current theory and tools. Furthermore, we pinpoint gaps in existing ASP explanations approaches and identify research directions for future work.

2603.13854 2026-06-11 cs.LO cs.AI cs.SC 版本更新

Power Term Polynomial Algebra for Boolean Logic

布尔逻辑的幂项多项式代数

Emanuele Sansone, Armando Solar-Lezama

AI总结 提出幂项多项式代数,一种介于CNF和ANF之间的布尔公式表示语言,通过幂项和多项式直接编码CNF子句与单项式族,避免辅助变量和约束,支持代数运算与重写规则。

详情
Comments
Pragmatics of SAT
AI中文摘要

我们引入了幂项多项式代数,这是一种布尔公式的表示语言,旨在桥联合取范式(CNF)和代数范式(ANF)。该语言的动机是这些表示之间的平铺不匹配:直接CNF<->ANF转换可能导致指数爆炸,除非公式被分解成更小的片段,通常通过辅助变量和侧面约束。相比之下,我们的框架在表示本身内部解决了这种不匹配,紧凑地编码了单项式的结构化族,同时直接表示CNF子句,从而在抽象层次上避免了辅助变量和约束。我们通过幂项和幂项多项式形式化了该语言,定义了它们的语义,并展示了它们允许对应于布尔多项式加法和乘法的代数运算。我们证明了该语言的几个关键性质:析取子句允许紧凑的规范表示;幂项支持局部缩短和扩展重写规则;原子项的乘积可以在语言内部系统地重写。这些结果共同产生了一个符号演算,使得无需将公式展开为普通ANF即可直接操作公式。由此产生的框架提供了一种新的中间表示和重写演算,桥接了基于子句和代数的推理,并为结构感知的CNF<->ANF转换和混合推理方法提出了新的方向。

英文摘要

We introduce power term polynomial algebra, a representation language for Boolean formulae designed to bridge conjunctive normal form (CNF) and algebraic normal form (ANF). The language is motivated by the tiling mismatch between these representations: direct CNF<->ANF conversion may cause exponential blowup unless formulas are decomposed into smaller fragments, typically through auxiliary variables and side constraints. In contrast, our framework addresses this mismatch within the representation itself, compactly encoding structured families of monomials while representing CNF clauses directly, thereby avoiding auxiliary variables and constraints at the abstraction level. We formalize the language through power terms and power term polynomials, define their semantics, and show that they admit algebraic operations corresponding to Boolean polynomial addition and multiplication. We prove several key properties of the language: disjunctive clauses admit compact canonical representations; power terms support local shortening and expansion rewrite rules; and products of atomic terms can be systematically rewritten within the language. Together, these results yield a symbolic calculus that enables direct manipulation of formulas without expanding them into ordinary ANF. The resulting framework provides a new intermediate representation and rewriting calculus that bridges clause-based and algebraic reasoning and suggests new directions for structure-aware CNF<->ANF conversion and hybrid reasoning methods.

2605.05368 2026-06-11 math.LO cs.AI 版本更新

Towards an Inferentialist Account of Information Through Proof-theoretic Semantics

走向信息的推理主义账户:通过证明论语义

Matthew Collinson, Timo Eckhardt, David Pym

AI总结 本文旨在通过证明论语义发展一种信息的推理主义理论,通过概念分析、逻辑和系统三个核心组件,为信息提供数学逻辑基础,并探讨信息作为相关性的理解。

详情
Comments
Manuscript
AI中文摘要

信息是当前时代最广泛讨论的概念之一。然而,尽管有大量深刻的见解工作,仍未完全令人信服的逻辑或数学基础。没有这些,我们缺乏足够的推理工具来理解社会依赖的复杂系统生态系统。我们通过朝着发展信息的推理主义语义理论迈出第一步来纠正这一点。有三个关键相互作用的组成部分。首先,概念分析:信息的形而上学。Dretske用意向性、真理和传递性来表达信息的关键概念。我们用推理性代替真理,并追溯这种替代的后果。其次,逻辑:证明论语义(P-tS)为推理主义推理提供了数学-逻辑的实现。使用P-tS,我们发展了信息的推理主义原始单位“inferon”的数学-逻辑理论的第一步。这种证明论方法与情况理论中信息的模型论观点相对。此外,我们论证它有助于处理van Benthem和Martinez对信息理解的三类分类:范围、相关性和代码。我们的重点是信息作为相关性。第三,系统:我们开发的P-tS工具为分布式系统建模的数学账户提供了基础——这是信息学中理解信息处理系统组织的关键工具。这导致了分布式系统模型中信息流的推理理论。总体而言,我们试图为信息及其在信息学中的作用提供概念严谨的数学-逻辑账户,基于推理和推理。

英文摘要

Information is one of the most widely-discussed concepts of the current era. However, a great deal of insightful work notwithstanding, it is yet to be given wholly convincing logical or mathematical foundations. Without them, we lack adequate reasoning tools for understanding the complex ecosystems of systems upon which the society depends. We seek to rectify this by taking a first step towards developing an inferentialist semantic theory of information. There are three key interacting components. First, conceptual analysis: the metaphysics of information. Dretske expressed the key concepts of information in terms of intentionality, truth, and transmissibility. We replace truth with inferability, and trace the consequences of this replacement. Second, logic: proof-theoretic semantics (P-tS) provides a mathematical-logical realization of inferentialist reasoning. Using P-tS, we develop the first steps towards a mathematical-logical theory of an inferentialist primitive unit of information, the 'inferon'. This proof-theoretic approach counterpoints the model-theoretic view of information articulated in situation theory. Furthermore, we argue that it facilitates addressing all three components of van Benthem and Martinez's categorization of the understandings of information, as range, as correlation, and as code. Our focus is on information-as-correlation. Third, systems: the P-tS tools we develop provide the basis for a mathematical account of distributed systems modelling -- a key tool from informatics for understanding the organization of information processing systems. This yields a reasoning-based theory of information flow in models of distributed systems. Overall, we seek to give a conceptually rigorous mathematical-logical account of information and its role within informatics, grounded in inference and reasoning.

3. 多智能体与博弈 13 篇

2606.11379 2026-06-11 cs.AI 新提交

Automated Mediator for Human Negotiation: Pre-Mediation via a Structured LLM Pipeline

人类谈判的自动调解器:通过结构化LLM流水线进行预调解

Jamie Bergen, Sarit Kraus

AI总结 提出一种结构化LLM流水线作为自动调解器,在整合性谈判中支持预调解,通过分解准备任务为专用模块,在短期自我报告结果上与人类调解员相当,并在偏好推理任务上误差降低36%。

详情
Comments
12 pages, 7 figures
AI中文摘要

预调解是直接人类谈判前的准备阶段,在达成互利协议中起着关键作用,但由于成本、时间和缺乏训练有素的调解员而常被省略。我们引入了一种用于人类谈判的自动调解器,实现为结构化LLM模块流水线,在整合性谈判环境中支持预调解。该流水线将准备分解为对话、偏好预测、响应级批评和结构化总结的专用模块,分离推理、生成和评估,以解决单一提示方法的局限性。我们按照常见的LLM系统术语将每个模块称为“智能体”,但组件并非自主且不进行点对点交互;输出按固定顺序向前传递。我们在两个受控人类受试者实验中评估该系统,在多议题谈判场景中将基于AI的预调解与专业人类调解员进行比较。在短期自我报告测量中,自动调解器在准备结果上与人类调解员大致相当,包括对调解员的信任和达成互利协议的信心,同时在我们场景和提示下,偏好推理任务的误差显著降低(RMSE降低36%)。第二项研究表明,有针对性的提示优化将过度肯定模式从36.6%降至16.8%,与人类调解员基线匹配。我们的发现表明,结构化LLM流水线可以在短期自我报告准备结果上提供与人类调解员大致相当的可扩展、低投入的预调解支持。该流水线的单方设计反映了当前人类调解员进行预调解的方式,并支持在争议各方之间并行部署,从而实现可扩展性。

英文摘要

Pre-mediation, the preparatory phase preceding direct human negotiation, plays a critical role in achieving mutually beneficial agreements, yet is often omitted due to cost, time, and limited access to trained mediators. We introduce an automated mediator for human negotiation, implemented as a structured pipeline of LLM modules, that supports pre-mediation in integrative negotiation settings. The pipeline decomposes preparation into specialized modules for dialogue, preference prediction, response-level critique, and structured summarization, separating inference, generation, and evaluation to address limitations of monolithic single-prompt approaches. We use the term "agent" for each module following common LLM-systems terminology, but the components are not autonomous and do not interact peer-to-peer; outputs are passed forward in a fixed sequence. We evaluate the system in two controlled human-subject experiments comparing AI-based pre-mediation with professional human mediators in a multi-issue negotiation scenario. On short-term self-reported measures, the automated mediator achieves preparation outcomes broadly comparable to human mediators, including trust in the mediator and confidence in reaching mutually beneficial agreements, while achieving substantially lower error on the preference-inference task under our scenario and prompts (36% lower RMSE). A second study shows that targeted prompt refinements reduce excessive affirmation patterns from 36.6% to 16.8%, matching human mediator baselines. Our findings suggest that structured LLM pipelines can provide scalable, low-effort pre-mediation support broadly comparable to human mediators on short-term self-reported preparation outcomes. The pipeline's single-party design mirrors how human mediators run pre-mediation today and enables parallel deployment across all parties to a dispute, supporting scalability.

2606.11440 2026-06-11 cs.AI 新提交

INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

INFRAMIND: 基础设施感知的多智能体编排

Ahasan Kabir, Jiaqi Xue, Mengxin Zheng, Qian Lou

发表机构 * University of Central Florida(中佛罗里达大学)

AI总结 提出INFRAMIND框架,通过强化学习将基础设施状态(队列深度、KV缓存压力等)融入多智能体LLM编排的规划、路由和调度决策,在共享GPU集群上实现质量与延迟的平衡,相比基线提升最高7.6%准确率并降低7倍延迟。

详情
Comments
Preprint
AI中文摘要

现有的多智能体LLM编排方法,从暴力集成到学习型路由器,基于任务和模型特征选择模型和拓扑。然而,这些方法不考虑服务基础设施的运行时状态。在共享GPU集群上并发负载下,这种基础设施盲区导致系统性的资源利用不足:首选模型积累深度请求队列,而同等能力的替代模型闲置。在多智能体流水线中,每个查询触发多个顺序模型调用,这些延迟会进一步累积到每个下游步骤。弥补这一差距具有挑战性,因为相关基础设施信号(队列深度、KV缓存压力、延迟)是动态且嘈杂的,并且它们必须驱动三个不同的决策:规划、逐步骤路由和调度。我们引入INFRAMIND,一个使整个多智能体堆栈具备基础设施感知的框架。一个基础设施感知的规划器根据实时系统负载和剩余预算调节拓扑和角色选择,在拥塞时偏向简单图,在低负载时偏向丰富图。然后,一个基础设施感知的执行器在每个智能体步骤观察每个模型的队列深度、缓存利用率和响应延迟,以决定调用哪个模型以及推理深度;一个预算感知的调度器进一步重新排序每个模型的队列,使紧急请求优先得到服务。将其建模为分层约束MDP并通过强化学习端到端求解,系统自动学习平衡质量与延迟。在五个基准测试中,INFRAMIND在低负载下相比先前基线准确率提升高达7.6个百分点,延迟降低7倍,在高负载下维持高达99.9%的SLO合规性,而所有基线均降至50%以下。

英文摘要

Existing multi-agent LLM orchestration methods, ranging from brute-force ensembles to learned routers, select models and topologies based on task and model features. However, these methods do not consider the runtime state of the serving infrastructure. On shared GPU clusters under concurrent load, this infrastructure blindness causes systematic resource underutilization: preferred models accumulate deep request queues while equally capable alternatives sit idle. In multi-agent pipelines, where each query triggers multiple sequential model calls, these delays then compound across every downstream step. Closing this gap is challenging because the relevant infrastructure signals (queue depths, KV-cache pressure, latencies) are dynamic and noisy, and they must drive three different decisions: planning, per-step routing, and scheduling. We introduce INFRAMIND, a framework that makes the entire multi-agent stack infrastructure-aware. An infra-aware planner conditions topology and role selection on real-time system load and remaining budget, biasing toward simpler graphs under congestion and richer ones at low load. An infra-aware executor then observes per-model queue depths, cache utilization, and response latencies at each agent step to decide which model to call and how deeply to reason; a budget-aware scheduler further reorders each model's queue so that urgent requests are served first. Cast as a hierarchical constrained MDP and solved end-to-end via reinforcement learning, the system learns to balance quality against latency automatically. Across five benchmarks, INFRAMIND delivers up to +7.6 pp accuracy over the prior baseline at low load with up to 7x lower latency, and sustains up to 99.9% SLO compliance under high load where every baseline drops below 50%.

2606.12018 2026-06-11 cs.AI 新提交

MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

MODF-SIR:面向社交智能推理的多智能体全模态蒸馏框架

Shang Ma, Jisheng Dang, Wencan Zhang, Yifan Zhang, Bimei Wang, Hong Peng, Bin Hu, Qi Tian, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) School of Medical Technology, Beijing Institute of Technology(北京理工大学医学技术学院) Cloud and AI BU, Huawei(华为云与AI业务部) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

AI总结 提出基于轻量级多模态大语言模型的多智能体协作框架,通过知识蒸馏增强训练与推理,结合测试时适应、长尾事件提取和链式思维提示,在多个基准上取得最优结果。

详情
AI中文摘要

我们提出一个基于轻量级多模态大语言模型(MLLM)的多智能体协作框架,专门设计用于社交智能推理。我们方法的一个关键特征是,训练和推理阶段都通过知识蒸馏进行增强。在该架构中,与社交智能相关的多模态数据被精确定位。此外,相关的长尾事件被识别、提取并呈现为格式化的显式文本。这种格式化策略防止关键的长尾信息在分词过程中被头部事件和环境噪声掩盖。具体来说,我们在整个推理流程中集成了测试时适应(TTA),包括长尾事件的提取和表示、链式思维(CoT)提示和自我反思。该TTA机制也经过蒸馏增强,利用低秩适应(LoRA)仅针对实例级推理微调基础模型。在多个基准上对各种开源和专有AI模型进行的广泛评估证明了所提出框架的有效性。使用IntentTrain约30%的训练数据,我们取得了最先进的结果。代码见https://this URL,演示见https://this URL,LoRA见https://this URL,训练路由器的数据集见https://this URL。

英文摘要

We propose a multi-agent collaborative framework built upon a lightweight Multimodal Large Language Model (MLLM), specifically designed for social intelligence reasoning. A key feature of our approach is that both the training and inference phases are augmented via knowledge distillation. Within this architecture, multi-modal data pertinent to social intelligence is precisely localized. Furthermore, relevant long-tail events are identified, extracted, and rendered as formatted, explicit text. This formatting strategy prevents critical long-tail information from being overshadowed by head events and environmental noise during the tokenization process. Specifically, we integrate Test-Time Adaptation (TTA) across the entire reasoning pipeline, encompassing the extraction and representation of long-tail events, Chain-of-Thought (CoT) prompting, and self-reflection. This TTA mechanism is also distillation-enhanced, utilizing Low-Rank Adaptation (LoRA) to fine-tune the foundation model exclusively for instance-level reasoning. Extensive evaluations against various open-source and proprietary AI models across multiple benchmarks demonstrate the effectiveness of the proposed framework. With around 30% of training data from IntentTrain, we achieve state-of-the-art results. Codes are available at this https URL, demo is available at this https URL, LoRA is available at this https URL and the dataset for training router is available at this https URL.

2606.12260 2026-06-11 econ.TH cs.AI cs.GT cs.LG stat.ML 交叉投稿

Market Design for AI: Beyond the Copyright Binary

人工智能的市场设计:超越版权二元论

Yan Dai, Maryam Farboodi, Negin Golrezaei, Sepehr Shahshahani

AI总结 本文通过静态和动态博弈模型,分析AI训练数据市场中“自由使用”与“强知识产权”两种模式的失败,提出通过数据中介内部化外部性并补贴创新贡献的市场设计。

详情
AI中文摘要

我们如何设计一个用于训练AI模型的人类生成内容市场,既能促进技术进步,又能保留个人创作高质量内容的激励?现有方法采取两极立场:基于合理使用的“自由使用”模式和“强知识产权”模式。我们证明两者均失败:自由使用不补偿创作者,而通过建模为静态Stackelberg博弈,强知识产权也削弱了创作激励。我们发现这对更具创新性的创作者尤其如此,我们将此现象称为“原创性惩罚”。将这一见解扩展到动态模型,我们发现另一种市场失灵会损害AI模型性能,即使对于初始良好的模型也是如此:此类模型导致人类更依赖AI辅助创作,导致同质化内容反馈到训练中,从而降低模型性能——即“精确性诅咒”。我们进一步提出一种市场设计,通过数据中介内部化跨创作者外部性并补贴创新贡献,从而恢复效率。

英文摘要

How can we design a market of human-generated content for use in training AI models that both enables technological progress and preserves individual incentives for high-quality content creation? Existing approaches take polar positions: a "free-for-all" model based on fair use and a "strong intellectual property rights" model. We show that both fail: Free-for-all does not compensate creators, and -- by modeling as a static Stackelberg game -- strong intellectual property rights also underpower creative incentives. We find this especially true for more innovative creators, a phenomenon we term the "originality penalty." Extending this insight to a dynamic model, we find another market failure undermining AI model performance, even for an initially good model: Such a model induces greater reliance by humans on AI-assisted creation, resulting in homogenized content feeding back into training, which degrades the model performance -- a "curse of precision." We further propose a market design with a data intermediary internalizing cross-creator externalities and subsidizing innovative contributions, thereby restoring efficiency.

2606.12281 2026-06-11 cs.MA cs.AI cs.LG 交叉投稿

CCKS: Consensus-based Communication and Knowledge Sharing

CCKS:基于共识的通信与知识共享

Jinyuan Zu, Xiaowei Lv, Yongcai Wang, Deying Li, Yunjun Han, Wenping Chen, Fengyi Zhang, Naiqi Wu

AI总结 针对多智能体强化学习中动作建议过度依赖教师指导的问题,提出基于共识的通信与知识共享框架,通过对比学习构建共识模型,平衡探索与学习,提升合作效率与性能。

详情
AI中文摘要

在分布式训练和分布式执行(DTDE)的协作多智能体强化学习(MARL)中,基于动作建议的知识共享促进了智能体间的可解释和可扩展合作。然而,当前的动作建议方法往往过于遵循教师的指导,而未评估师生兼容性,导致过度建议、稳定性欠佳和性能下降。为克服这些挑战,本文提出了一种基于共识的通信与知识共享(CCKS)框架,该框架允许智能体基于共识衍生的约束采纳建议,并更智能地遵循教师指令。该机制使智能体能够平衡探索与向经验丰富的教师学习,从而提升整体性能。关键在于共识模型的构建,为此我们提出在智能体训练阶段利用对比学习基于局部观测构建共识模型。在动作选择中,智能体根据共识和共享知识对动作进行评分和选择。CCKS设计为即插即用解决方案,可无缝集成到现有DTDE算法中。在Google Research Football环境和复杂的星际争霸II多智能体挑战中进行的实验表明,与当前的DTDE基线相比,集成CCKS显著提高了合作效率、学习速度和整体性能。代码可从此https URL获取。

英文摘要

In Decentralized Training and Decentralized Execution (DTDE) for cooperative Multi-Agent Reinforcement Learning (MARL), action-advising-based knowledge sharing promotes interpretable and scalable cooperation among agents. However, current action advising approaches often adhere too much to the teacher's guidance without evaluating teacher-student compatibility, which causes excessive advising, suboptimal stability, and degraded performance. To overcome these challenges, this paper presents a Consensus-based Communication and Knowledge Sharing (CCKS) framework, which allows agents to adopt recommendations based on consensus-derived constraints and to follow the teacher's instructions more smartly. This mechanism enables agents to balance exploration and learning from experienced teachers, improving overall performance. The key is the consensus model construction, for which we propose to employ contrastive learning to construct consensus models based on local observations in the agents' training phase. In action selection, agents score and choose actions based on consensus and shared knowledge. Designed as a plug-and-play solution, CCKS integrates seamlessly with existing DTDE algorithms. Experiments conducted in the Google Research Football environment and the complex StarCraft II Multi-Agent Challenge demonstrate that the integration with CCKS significantly improves cooperation efficiency, learning speed, and overall performance compared with current DTDE baselines. The code is available at this https URL.

2606.12352 2026-06-11 cs.RO cs.AI 交叉投稿

CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy

CHORUS: 基于单一VLA策略的去中心化多体协作

Ria Doshi, Tian Gao, Annie Chen, Chelsea Finn, Jeannette Bohg

发表机构 * Stanford University(斯坦福大学)

AI总结 提出CHORUS框架,利用预训练视觉-语言-动作模型的视觉运动先验,实现无需推理时通信的去中心化多机器人协作,在真实实验中显著优于基线。

详情
Comments
Project Website: this https URL
AI中文摘要

多机器人协作使机器人能够高效完成从通过门搬运沙发到建筑工地组装结构等各种任务。然而,在移动多机器人环境中实现这种协调仍然具有挑战性:基于团队联合观测的集中式方法随团队规模扩展性差,而为每个机器人训练一个策略的去中心化方法通常需要显式对齐程序或推理时信息共享来克服部分可观测性。我们的关键见解是,预训练的视觉-语言-动作(VLA)模型的视觉运动先验应能够仅从每个机器人的局部观测实现反应式去中心化协作,无需这些推理时假设。我们提出CHORUS,一个适配单一VLA骨干以控制多样化多机器人团队的框架。推理时,每个机器人运行CHORUS的独立副本,仅基于其自身观测和机器人标识提示。在包括移动卷尺测量、图书馆书籍交接和洗衣篮抬举的真实实验中,CHORUS相比去中心化从头训练模型提升64个百分点,对队友行为的反应性提升40个百分点,并优于集中式基线。这些结果表明,共享VLA骨干能够实现去中心化多机器人协作,无需每个机器人的独立策略或推理时机器人间通信。

英文摘要

Multi-robot collaboration allows robots to efficiently take on a wide range of tasks, from moving a couch through a doorway to assembling structures on a construction site. However, achieving such coordination in mobile multi-robot settings remains challenging: centralized methods conditioned on the combined observations of a team scale poorly with team size, and decentralized methods that train one policy per robot often require explicit alignment procedures or information sharing at inference time to overcome partial observability. Our key insight is that the visuomotor priors of pretrained vision-language-action (VLA) models should enable reactive, decentralized collaboration from each robot's local observations alone, without these inference-time assumptions. We propose CHORUS, a framework that adapts a single VLA backbone to control diverse, multi-robot teams. At inference time, each robot runs an independent copy of CHORUS, conditioned only on its own observations and a robot-identifying prompt. In real-world experiments including mobile tape measurement, library book handovers, and laundry basket lifting, CHORUS achieves a 64% point improvement over decentralized, from-scratch models, improves reactivity to teammate behavior by 40% points, and outperforms centralized baselines. Together, these results show that a shared VLA backbone is capable of achieving decentralized multi-robot collaboration, without per-robot policies or inter-robot communication at inference.

2307.01472 2026-06-11 cs.AI cs.LG cs.MA 版本更新

Improving Generalization and Data Efficiency with Diffusion in Offline Multi-agent RL

通过扩散模型提升离线多智能体强化学习的泛化能力与数据效率

Zhuoran Li, Ling Pan, Jiatai Huang, Longbo Huang

AI总结 提出扩散离线多智能体模型(DOM2),利用扩散模型增强策略表达力和多样性,结合轨迹数据重加权,在离线MARL中显著提升性能、泛化能力和数据效率。

详情
AI中文摘要

我们提出了一种新颖的扩散离线多智能体模型(DOM2),用于离线多智能体强化学习(MARL)。与主要依赖策略设计中保守性的现有算法不同,DOM2基于扩散模型增强了策略的表达力和多样性。具体来说,我们将扩散模型融入策略网络,并在训练中提出了一种基于轨迹的数据重加权方案。这些关键要素显著提高了算法对环境变化的鲁棒性,并在性能、泛化和数据效率方面取得了显著提升。我们的大量实验结果表明,DOM2在所有多智能体粒子和多智能体MuJoCo环境中均优于现有最先进方法,并且由于其高表达力和多样性,在迁移环境中(在评估的30个设置中有28个)泛化能力显著更强。此外,DOM2具有超高的数据效率,与现有算法相比,实现相同性能所需数据不超过5%(数据效率提升20倍)。

英文摘要

We present a novel Diffusion Offline Multi-agent Model (DOM2) for offline Multi-Agent Reinforcement Learning (MARL). Different from existing algorithms that rely mainly on conservatism in policy design, DOM2 enhances policy expressiveness and diversity based on diffusion model. Specifically, we incorporate a diffusion model into the policy network and propose a trajectory-based data-reweighting scheme in training. These key ingredients significantly improve algorithm robustness against environment changes and achieve significant improvements in performance, generalization and data-efficiency. Our extensive experimental results demonstrate that DOM2 outperforms existing state-of-the-art methods in all multi-agent particle and multi-agent MuJoCo environments, and generalizes significantly better to shifted environments {(in $28$ out of $30$ settings evaluated)} thanks to its high expressiveness and diversity. Moreover, DOM2 is ultra data efficient and requires no more than $5\%$ data for achieving the same performance compared to existing algorithms (a $20\times$ improvement in data efficiency).

2601.04884 2026-06-11 cs.AI 版本更新

Precomputing Multi-Agent Path Replanning Using Temporal Flexibility

利用时间灵活性预计算多智能体路径重规划

Issa Hanou, Eric Kemmeren, Devin Wild Thomas, Mathijs de Weerdt

AI总结 针对多智能体执行中单个智能体延迟导致冲突的问题,提出FlexSIPP算法,通过预计算延迟智能体的所有可行计划并利用其他智能体的时间灵活性,避免级联延迟,在荷兰铁路网络和MovingAI基准测试中实现高效重规划。

详情
Comments
Accepted at SoCS'26
AI中文摘要

当智能体被延迟时,执行多智能体计划可能具有挑战性,因为这通常会导致与其他智能体的冲突。因此,我们需要快速找到一个新的安全计划。仅对延迟的智能体进行重规划通常无法产生有效的计划,有时甚至无法产生可行的计划。另一方面,对其他智能体进行重规划可能导致级联变化和延迟,并且计算成本高昂。我们展示了如何通过跟踪和利用其他智能体的时间灵活性(即智能体在不改变与初始延迟智能体之外的其他智能体的顺序,或进一步延迟其他智能体的前提下,可以承受的最大延迟)来高效地对单个延迟智能体进行重规划,同时避免级联延迟。我们的算法FlexSIPP预计算延迟智能体的所有可能计划,并在给定场景中返回对其他智能体的更改。我们在实际案例研究(荷兰密集使用的铁路网络中的列车重规划)和MovingAI MAPF基准测试集中展示了我们的方法。实验表明,FlexSIPP提供了与实际情况调整相关的有效解决方案,并且在合理的时间范围内。

英文摘要

Executing a multi-agent plan can be challenging when an agent is delayed, because this typically creates conflicts with other agents. So, we need to quickly find a new safe plan. Replanning only the delayed agent often does not yield an efficient plan, and sometimes cannot even yield a feasible one. On the other hand, replanning other agents may lead to a cascade of changes and delays, and it is computationally expensive. We show how to efficiently replan a single delayed agent by tracking and using the temporal flexibility of other agents while avoiding cascading delays. This flexibility is the maximum delay that the agent can take without changing the order with agents other than the initially delayed agent, or further delaying other agents. Our algorithm, FlexSIPP, precomputes all possible plans for the delayed agent and returns the changes to the other agents within the given scenario. We demonstrate our method in a real-world case study of replanning trains in the densely-used Dutch railway network and in the MovingAI MAPF benchmark set. Our experiments show that FlexSIPP provides effective solutions relevant to real-world adjustments, and within a reasonable timeframe.

2602.18291 2026-06-11 cs.AI 版本更新

Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

扩散以协调:高效在线多智能体扩散策略

Zhuoran Li, Hai Zhong, Xun Wang, Qingxin Xia, Lihua Zhang, Longbo Huang

AI总结 提出首个在线离线策略多智能体强化学习框架OMAD,利用扩散策略和松弛策略目标最大化缩放联合熵,实现高效探索与协调,在MPE和MAMuJoCo上样本效率提升2.5至5倍。

详情
AI中文摘要

在线多智能体强化学习(MARL)是实现高效智能体协调的重要框架。关键在于增强策略表达能力以实现更优性能。基于扩散的生成模型在图像生成和离线设置中展现出卓越的表达能力和多模态表示,因此非常适合满足这一需求。然而,它们在在线MARL中的潜力尚未被充分探索。主要障碍是扩散模型的难以处理的似然性阻碍了基于熵的探索和协调。为应对这一挑战,我们首次提出使用扩散策略的在线离线策略MARL框架(OMAD)来协调协调。我们的关键创新是采用松弛策略目标,最大化缩放联合熵,从而在无需可处理似然的情况下促进有效探索。此外,在集中训练与分散执行(CTDE)范式中,我们使用联合分布价值函数来优化分散扩散策略。它利用可处理的熵增强目标来指导扩散策略的同时更新,从而确保稳定协调。在MPE和MAMuJoCo上的广泛评估表明,我们的方法在10个不同任务上达到了新的最先进水平,样本效率显著提升了2.5至5倍。

英文摘要

Online Multi-Agent Reinforcement Learning (MARL) is a prominent framework for efficient agent coordination. Crucially, enhancing policy expressiveness is pivotal for achieving superior performance. Diffusion-based generative models are well-positioned to meet this demand, having demonstrated remarkable expressiveness and multimodal representation in image generation and offline settings. Yet, their potential in online MARL remains largely under-explored. A major obstacle is that the intractable likelihoods of diffusion models impede entropy-based exploration and coordination. To tackle this challenge, we propose among the first \underline{O}nline off-policy \underline{MA}RL framework using \underline{D}iffusion policies (\textbf{OMAD}) to orchestrate coordination. Our key innovation is a relaxed policy objective that maximizes scaled joint entropy, facilitating effective exploration without relying on tractable likelihood. Complementing this, within the centralized training with decentralized execution (CTDE) paradigm, we employ a joint distributional value function to optimize decentralized diffusion policies. It leverages tractable entropy-augmented targets to guide the simultaneous updates of diffusion policies, thereby ensuring stable coordination. Extensive evaluations on MPE and MAMuJoCo establish our method as the new state-of-the-art across $10$ diverse tasks, demonstrating a remarkable $2.5\times$ to $5\times$ improvement in sample efficiency.

2605.12655 2026-06-11 cs.AI cs.MA 版本更新

Robust Instruction Compliance in Cooperative Multi-Agent Reinforcement Learning

鲁棒的指令遵从:合作多智能体强化学习

Wo Wei Lin, Ethan Rathbun, Enrico Marchesini, Xiang Zhi Tan

AI总结 针对外部指令中断行为并冲突长期目标的问题,提出宏动作值修正方法(MAVIC),通过修正指令边界的Bellman备份实现一致值估计,在复杂合作环境中保持高指令遵从和基础任务性能。

详情
AI中文摘要

现实场景中的多智能体强化学习(MARL)可能需要适应外部自然语言指令,这些指令会中断正在进行的行为并与长期目标冲突。然而,基于指令的条件奖励引入了一种基本失败模式,因为Bellman更新耦合了跨指令上下文的值估计,导致当指令中断宏动作时值不一致。我们提出了用于指令遵从的宏动作值修正(MAVIC),该方法通过修正传入指令目标并恢复当前目标下的延续值,来纠正指令边界处的Bellman备份。与奖励塑形不同,MAVIC修改了自举目标本身,从而在统一策略下实现随机指令切换时的一致值估计。我们提供了理论分析和演员-评论家实现,并表明MAVIC在日益复杂的合作多智能体环境中实现了高指令遵从,同时保持了基础任务性能。

英文摘要

Multi-agent reinforcement learning (MARL) in real-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long-horizon objectives. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when instructions interrupt macro-actions. We propose Macro-Action Value Correction for Instruction Compliance (MAVIC), which corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy. We provide theoretical analysis and an actor-critic implementation, and show that MAVIC achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi-agent environments.

2509.14860 2026-06-11 cs.CV cs.AI cs.CL cs.MA 版本更新

MARIC: Multi-Agent Reasoning for Image Classification

MARIC:用于图像分类的多智能体推理

Wonduk Seo, Minhyeong Yu, Hyunjin An, Seunghyun Lee

AI总结 提出多智能体框架MARIC,通过分解图像分类为协作推理过程,利用大纲智能体、方面智能体和推理智能体进行多视角分析与综合,在四个基准数据集上显著优于基线方法。

详情
Comments
11 pages, preprint
AI中文摘要

图像分类传统上依赖于参数密集型模型训练,需要大规模标注数据集和大量微调才能达到有竞争力的性能。虽然最近的视觉语言模型(VLM)缓解了其中一些限制,但它们仍然受限于对单次表示的依赖,往往无法捕捉视觉内容的互补方面。在本文中,我们介绍了基于多智能体的图像分类推理(MARIC),这是一个多智能体框架,将图像分类重新表述为协作推理过程。MARIC首先利用大纲智能体分析图像的全局主题并生成有针对性的提示。基于这些提示,三个方面智能体沿着不同的视觉维度提取细粒度描述。最后,推理智能体通过集成反思步骤综合这些互补输出,产生用于分类的统一表示。通过明确地将任务分解为多个视角并鼓励反思性综合,MARIC减轻了参数繁重训练和单一VLM推理的缺点。在4个不同的图像分类基准数据集上的实验表明,MARIC显著优于基线,突出了多智能体视觉推理在鲁棒且可解释的图像分类中的有效性。

英文摘要

Image classification has traditionally relied on parameter-intensive model training, requiring large-scale annotated datasets and extensive fine tuning to achieve competitive performance. While recent vision language models (VLMs) alleviate some of these constraints, they remain limited by their reliance on single pass representations, often failing to capture complementary aspects of visual content. In this paper, we introduce Multi Agent based Reasoning for Image Classification (MARIC), a multi agent framework that reformulates image classification as a collaborative reasoning process. MARIC first utilizes an Outliner Agent to analyze the global theme of the image and generate targeted prompts. Based on these prompts, three Aspect Agents extract fine grained descriptions along distinct visual dimensions. Finally, a Reasoning Agent synthesizes these complementary outputs through integrated reflection step, producing a unified representation for classification. By explicitly decomposing the task into multiple perspectives and encouraging reflective synthesis, MARIC mitigates the shortcomings of both parameter-heavy training and monolithic VLM reasoning. Experiments on 4 diverse image classification benchmark datasets demonstrate that MARIC significantly outperforms baselines, highlighting the effectiveness of multi-agent visual reasoning for robust and interpretable image classification.

2604.20348 2026-06-11 cs.RO cs.AI cs.MA 版本更新

Bimanual Robot Manipulation via Multi-Agent In-Context Learning

通过多智能体上下文学习的双臂机器人操作

Alessio Palma, Indro Spinelli, Vignesh Prasad, Luca Scofano, Yufeng Jin, Georgia Chalvatzaki, Fabio Galasso

AI总结 提出BiCICLe框架,将双臂操作建模为多智能体主从问题,通过解耦动作空间实现标准LLM的少样本学习,在TWIN基准上平均成功率70.5%,超越无训练基线。

详情
AI中文摘要

语言模型(LLMs)已成为具身控制的强大推理引擎。特别是,上下文学习(ICL)使得现成的纯文本LLM能够预测机器人动作,无需任何任务特定训练,同时保持其泛化能力。将ICL应用于双臂操作仍然具有挑战性,因为高维联合动作空间和紧密的臂间协调约束迅速压垮标准上下文窗口。为了解决这个问题,我们引入了BiCICLe(双臂协调上下文学习),这是第一个使标准LLM无需微调即可执行少样本双臂操作的框架。BiCICLe将双臂控制建模为多智能体主从问题,将动作空间解耦为顺序的、条件化的单臂预测。在TWIN基准的13个任务上评估,BiCICLe实现了70.5%的平均成功率,比最佳无训练基线高出6.1个百分点,并超过了大多数监督方法。我们还展示了在3个任务上无需特定硬件重新训练的优越现实世界性能。

英文摘要

Language Models (LLMs) have emerged as powerful reasoning engines for embodied control. In particular, In-Context Learning (ICL) enables off-the-shelf, text-only LLMs to predict robot actions without any task-specific training while preserving their generalization capabilities. Applying ICL to bimanual manipulation remains challenging as the high-dimensional joint action space and tight inter-arm coordination constraints rapidly overwhelm standard context windows. To address this, we introduce BiCICLe (Bimanual Coordinated In-Context Learning), the first framework that enables standard LLMs to perform few-shot bimanual manipulation without fine-tuning. BiCICLe frames bimanual control as a multi-agent leader-follower problem, decoupling the action space into sequential, conditioned single-arm predictions. Evaluated on 13 tasks from the TWIN benchmark, BiCICLe achieves 70.5% average success rate, outperforming the best training-free baseline by 6.1 percentage points and surpassing most supervised methods. We also demonstrate superior real-world performance on 3 tasks without hardware-specific retraining.

2606.08102 2026-06-11 cs.RO cs.AI cs.MA 版本更新

Continual Quadruped Robots Coordination via Semantic Skill Discovery

通过语义技能发现实现持续四足机器人协调

Daoqing Wang, Yuchen Xiao, Weixuan Huang, Zhilong Zhang, Shenghua Wan, Meng Li, Lei Yuan, Yang Yu

AI总结 提出Conquer框架,通过语义技能库实现多四足机器人在持续学习任务中的协调,避免灾难性遗忘,最终平均成功率95.6%。

详情
Comments
22 pages, 8 figures, 11 tables. Project page: this https URL
AI中文摘要

多四足协调因其增强的负载能力、更广的接触覆盖范围以及对挑战性任务的适应性提升而受到越来越多的关注。现有的多四足操作方法通常专注于预定义或封闭的任务族,往往依赖多智能体强化学习(MARL)来训练特定任务的协调策略。然而,这类方法在开放式持续学习场景中难以应对,其中任务顺序到达,机器人期望在复用先前学到的技能的同时获取新协调技能,且不出现灾难性遗忘。为应对这一挑战,我们提出Conquer,一个语义技能库框架,将持续多四足协调形式化为检索-适应-更新过程。首先,为适应不同任务中的团队规模变化,我们设计了一个团队结构的Self-Allies-Goal(SAG)主干,通过显式建模每个机器人自身状态、队友上下文和任务目标,支持可变基数的机器人团队。对于每个新任务,Conquer从执行前信息构建任务级语义描述符,并从技能库中检索相关技能进行适应。成功执行后,Conquer通过提取轨迹级语义描述符并根据语义距离组织它们来更新技能库,从而实现持续技能积累和跨任务知识迁移。仿真实验表明,Conquer达到了95.6%的最终平均成功率,展示了强大的前向迁移能力和可忽略的灾难性遗忘。在宇树Go2团队上的实际部署进一步验证了Conquer用于实际多四足协调的可行性。仿真和真实机器人演示视频见:https://conquer-project.pages.dev/。

英文摘要

Multi-quadruped coordination has attracted increasing attention due to its enhanced payload capacity, broader contact coverage, and improved adaptability to challenging tasks. Existing methods for multi-quadruped manipulation typically focus on predefined or closed task families, often relying on multi-agent reinforcement learning (MARL) to train task-specific coordination policies. However, such methods struggle in open-ended continual learning settings, where tasks arrive sequentially and robots are expected to acquire new coordination skills while reusing previously learned ones without catastrophic forgetting. To address this challenge, we propose Conquer, a semantic skill-library framework that formulates continual multi-quadruped coordination as a retrieve-adapt-update process. First, to accommodate varying team sizes across tasks, we design a team-structured Self-Allies-Goal (SAG) backbone that supports variable-cardinality robot teams by explicitly modeling each robot's own state, teammate context, and task goal. For each incoming task, Conquer constructs a task-level semantic descriptor from pre-execution information and retrieves a relevant skill from the library for adaptation. After successful execution, Conquer updates the skill library by extracting trajectory-level semantic descriptors and organizing them according to semantic distance, thereby enabling continual skill accumulation and cross-task knowledge transfer. Simulation experiments show that Conquer achieves a final average success rate of 95.6%, demonstrating strong forward transfer and negligible catastrophic forgetting. Real-world rollouts on Unitree Go2 teams further validate the deployment feasibility of Conquer for practical multi-quadruped coordination. Simulation and real-robot demonstration videos are available at: this https URL.

4. 搜索、优化与约束求解 5 篇

2606.11662 2026-06-11 cs.AI 新提交

TreeSeeker: Tree-Structured Trial, Error, and Return in Deep Search

TreeSeeker:深度搜索中的树结构试错与回溯

Zhuofan Shi, Mingzhe Ma, Lu Wang, Fangkai Yang, Pu Zhao, Yiming Guan, Youling Huang, Wei Zhang, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan

AI总结 提出TreeSeeker框架,通过树结构分支-回溯搜索和UCB信号选择,在深度搜索中实现受控试错,显著提升复杂问答性能。

详情
AI中文摘要

深度搜索要求智能体通过多步网络搜索、浏览、证据比较和综合来回答复杂问题。一个核心挑战是当多个方向看似可行但只有部分能最终提供可靠证据时,如何决定搜索方向。如果智能体贪婪地跟随当前最佳方向,它可能会不断扩展一个薄弱的延续;如果无纪律地探索,则可能将预算浪费在无关的尝试上。我们提出TreeSeeker,一个用于深度搜索中受控试错的推理时框架。TreeSeeker将搜索组织为树结构状态上的分支-回溯搜索,其中每个分支是子目标的一个试探性方向。在每一轮中,TreeSearch读取所有子目标树,识别活跃目标,并使用价值、不确定性和风险等文本UCB信号来选择:利用有希望的分支、探索不确定的替代方案,或剪除无生产力的延续并返回到较早的分支点。TreeMem通过将证据、不确定性、冲突、进展和失败线索附加到产生它们的分支上来支持这一控制循环,从而使试验结果能够指导后续决策。在XBench-DeepSearch、BrowseComp和BrowseComp-ZH上的实验表明,TreeSeeker始终优于强开源基线,这表明显式的分支-回溯控制可以补充更强的推理和工具执行能力。

英文摘要

Deep search requires agents to answer complex questions through multi-step web search, browsing, evidence comparison, and synthesis. A central challenge is deciding how to search when several directions look plausible but only some will later lead to reliable evidence. If an agent greedily follows the current best-looking direction, it may keep extending a weak continuation. If it explores without discipline, it may waste budget on disconnected trials. We propose TreeSeeker, an inference-time framework for controlled trial-and-error in deep search. TreeSeeker organizes search as branch-and-return search over tree-structured states, where each branch is a tentative direction for a sub-goal. At each round, TreeSearch reads all sub-goal trees, identifies active goals, and uses textual UCB signals of value, uncertainty, and risk to select among exploiting a promising branch, exploring an uncertain alternative, or pruning an unproductive continuation and returning to an earlier branch point. TreeMem supports this control loop by keeping evidence, uncertainty, conflicts, progress, and failure cues attached to the branches that produced them, so trial outcomes can guide later decisions. Experiments on XBench-DeepSearch, BrowseComp, and BrowseComp-ZH show that TreeSeeker consistently outperforms strong open-source baselines, suggesting that explicit branch-and-return control complements stronger reasoning and tool execution.

2606.11339 2026-06-11 math.OC cs.AI cs.LG eess.SY stat.ML 交叉投稿

Quantized Stochastic Primal-Dual Methods for Distributed Optimization under Relaxed Global Geometry

松弛全局几何下分布式优化的量化随机原始-对偶方法

Susmit Sarkar, Abhinav Raghuvanshi, Kushal Chakrabarti, Mayank Baranwal

AI总结 提出量化随机原始-对偶方法q-PDGD,在松弛全局几何下证明线性收敛到邻域或O(1/k)收敛,匹配最优集中随机复杂度。

详情
Comments
Accepted to UAI
AI中文摘要

我们研究具有随机梯度和有限比特通信(由随机(无偏)量化建模)的分布式优化。我们提出q-PDGD,一种量化的随机原始-对偶方法,并在松弛全局几何下对其进行分析。在受限割线不等式(RSI)下,常数步长产生线性收缩到由梯度噪声、量化失真和网络连通性确定的显式邻域,而递减步长在没有共享最小化器假设的情况下实现O(1/k)收敛。在Polyak-Lojasiewicz(PL)不等式下,我们在相同的随机量化设置中获得线性到邻域的收敛。我们的结果在预言复杂度上匹配已知最优的集中随机速率,并通过实验证明了量化水平、步长选择和图结构之间的预测权衡。

英文摘要

We study distributed optimization with stochastic gradients and finite-bit communication modeled by random (unbiased) quantization. We propose q-PDGD, a quantized stochastic primal-dual method, and analyze it under relaxed global geometry. Under restricted secant inequality (RSI), a constant step-size yields linear contraction to an explicit neighborhood determined by gradient noise, quantization distortion, and network connectivity, while a diminishing step-size achieves O(1/k) convergence without shared-minimizer assumptions. Under Polyak-Lojasiewicz (PL) inequality, we obtain linear-to-neighborhood convergence in the same stochastic quantized setting. Our results match the best-known centralized stochastic rates in oracle complexity, and are supported by experiments demonstrating the predicted tradeoffs between quantization level, step-size choice, and graph structure.

2606.11780 2026-06-11 cs.IR cs.AI cs.IT 交叉投稿

What Limits Does Quantization Place on Dense Top-$k$ Retrieval? A Theoretical Study

量化对密集Top-$k$检索的限制是什么?一项理论研究

Koki Okajima, Tsukasa Yoshida

AI总结 理论证明在有限精度下,完美Top-$k$检索所需维度随语料库大小对数增长,量化精度存在阈值,影响实际系统设计。

详情
Comments
9 pages, 2 figures
AI中文摘要

我们建立了将包含$N$个文档的语料库嵌入为$d$维向量的条件,使得每个$k$子集$S \subseteq [N]$都能通过某个查询向量的top-$k$检索实现。最近的研究表明,在$\mathbb{R}^d$中,$d = O(k)$足以存在这样的嵌入,与$N$无关。我们理论上证明,这种与语料库无关的界限是无限精度所特有的。当每个坐标使用$B$比特时,完美top-$k$检索需要$Bd = \Omega(k \ln N)$;因此,在任何固定精度下,维度必须至少随$N$对数增长。针对$\ell_2$归一化的$B$比特均匀标量量化模型,我们还确定了精度阈值$B^{*} = O(\ln \ln N)$,低于该阈值任何维度都不够,同时还有两个进一步限制可行$(B, d)$对的区域。我们的结果表明,在实际的向量数据库和密集检索系统中,由于量化是标准操作,嵌入维度和可能的精度必须随语料库大小增长。

英文摘要

We establish conditions for embedding a corpus of $N$ documents as $d$-dimensional vectors such that every $k$-subset $S \subseteq [N]$ is realizable as a result of top-$k$ retrieval by some query vector. Recent work shows that $d = O(k)$ suffices for such embeddings to exist in $\mathbb{R}^d$, independently of $N$. We theoretically prove that this corpus-independent bound is specific to infinite precision. With $B$ bits per coordinate, perfect top-$k$ retrieval requires $Bd = \Omega(k \ln N)$; thus, at any fixed precision, the dimension must grow at least logarithmically with $N$. Specializing to a $\ell_2$-normalized $B$-bit uniform scalar quantization model, we also identify a threshold on the precision $B^{*} = O(\ln \ln N)$ below which no dimension suffices, together with two further regimes that bound the feasible $(B, d)$ pairs. Our result implies that in practical vector databases and dense retrieval systems where quantization is standard, the embedding dimension and possibly the precision must grow with the corpus size.

2606.12279 2026-06-11 cs.NE cs.AI cs.LG 交叉投稿

Mathematical perspective on genetic algorithms with optimization guided operators

遗传算法与优化引导算子的数学视角

Anna Brandenberger, Ilan Doron-Arad, Elchanan Mossel

AI总结 本文从数学角度建模遗传算法,将优化问题转化为查询复杂度问题,并证明某些问题必须依赖生成、变异和重组算子,同时揭示了多样性在解池中的关键作用。

详情
Comments
18 pages, 1 figure
AI中文摘要

近期机器学习工作将遗传算法应用于推理阶段,以迭代改进优化问题的解。所涉及的基本变异和重组算子在性质上不同于经典研究。变异不再是随机的;机器学习算法以改进目标为目的对解进行变异。同样,重组不再基于父代解的随机拼接,而是基于机器学习的优化算子,其目标是从输入中合成改进的解。因此,这些变异和重组算子更有可能改进目标,但其计算成本更高。我们引入了一个遗传算法的通用模型,并使用强化学习的语言将优化问题表述为查询复杂度问题。然后我们研究专门模型。我们证明某些优化问题必须通过生成、变异和重组来解决。接着,我们在此框架内为一类问题获得了定性紧的算法,该算法捕捉了解池中多样性的非平凡作用,这是实际机器学习遗传算法的一个关键特征。

英文摘要

Recent work in ML applies genetic algorithms at inference time to iteratively improve solutions to optimization problems. The basic mutation and recombination operators involved are qualitatively different from those studied classically. Mutations are no longer random; an ML algorithm mutates a solution with the goal of improving an objective. Similarly, recombination is not based on random collages of parent solutions. Instead, it is an ML optimization-based operator whose goal is to synthesize improved solutions from its inputs. Thus, these mutation and recombination operators are more likely to improve the objective, but their computational cost is much higher. We introduce a general model of genetic algorithms and formulating optimization in this model as a query-complexity problem, using the language of reinforcement learning. We then study specialized models. We show that some optimization problems require generation, mutation, and recombination to be solved. We then obtain qualitatively tight algorithms for a family of problems within this framework that captures the nontrivial role of diversity in the solution pool, a key feature of practical ML genetic algorithms.

2606.12382 2026-06-11 cs.NE cs.AI 交叉投稿

SPEA2$^+$: Improved Density Estimation in SPEA2 with Provable Runtime Guarantees

SPEA2$^+$:具有可证明运行时间保证的改进SPEA2密度估计

Duc-Cuong Dang, Andre Opris, Dirk Sudholt

AI总结 针对SPEA2处理支配解时多样性不足的问题,提出使用所有成对距离改进密度估计的SPEA2$^+$,在OneTrapZeroTrap基准上达到与其他主流算法相同的性能保证。

详情
Comments
To appear in the Proceedings of PPSN 2026
AI中文摘要

强度帕累托进化算法2(SPEA2)是解决多目标优化问题的流行且著名的进化算法。尽管其受欢迎,但SPEA2的理论分析直到最近才出现。此外,这些分析仅关注SPEA2如何处理非支配解,而忽略了处理支配解的算法组件。我们首次对SPEA2进行了运行时分析,其中分析了这些组件。我们证明,与其他主流算法(包括相同设置下具有恒定种群大小和重复消除的NSGA-II、NSGA-III和SMS-EMOA)不同,SPEA2无法有效覆盖OneTrapZeroTrap基准的帕累托前沿。我们的结果表明,在适应度分配中使用k近邻距离提供的信号不足以维持支配个体间的多样性。为了解决这个问题,我们提出了一种改进的变体SPEA2$^+$,它考虑了所有成对距离。新算法在OneTrapZeroTrap上实现了与其他主流算法相同的性能保证,同时在更简单的问题上匹配原始SPEA2的性能。实验结果补充了我们的理论发现。

英文摘要

The Strength Pareto Evolutionary Algorithm 2 (SPEA2) is a popular and prominent evolutionary algorithm for solving multi-objective optimisation problems. Despite its popularity, theoretical analyses of SPEA2 have only appeared recently. Moreover, these analyses focus exclusively on how SPEA2 handles non-dominated solutions and disregard the algorithmic components responsible for handling dominated solutions. We conduct a first runtime analysis of SPEA2 for which these components are analysed. We prove that, unlike other prominent algorithms, including NSGA-II, NSGA-III and SMS-EMOA under the same setting of constant population size and duplicate elimination, SPEA2 is unable to cover the Pareto front of the OneTrapZeroTrap benchmark efficiently. Our results indicate that using k-th nearest-neighbour distance in the fitness assignment provides an insufficient signal to maintain diversity among dominated individuals. To address this issue, we propose an improved variant, SPEA2$^+$, that considers all pairwise distances. The new algorithm achieves the same performance guarantees as the other prominent algorithms on OneTrapZeroTrap, while matching the performance of the original SPEA2 on simpler problems. Experimental results complement our theoretical findings.

5. 机器学习与表示学习 64 篇

2606.11445 2026-06-11 cs.AI 新提交

Forecasting Future Behavior as a Learning Task

将未来行为预测作为学习任务

Mosh Levy, Yoav Goldberg, Asa Cooper Stickland

发表机构 * Bar-Ilan University(巴伊兰大学) Allen Institute for AI(艾伦人工智能研究所) UK AI Security Institute(英国人工智能安全研究所)

AI总结 提出将AI行为预测作为可学习任务,训练行为预测器从推理轨迹中预测未来行为,无需解释步骤,在两项任务上优于GPT-5.4和Claude Opus-4.6。

详情
AI中文摘要

对AI系统的信任通常基于对其工作原理的解释,人们利用这些解释来预测系统在新输入上的行为。对于大型推理模型(LRM),这条常规路径尤其难以遵循:针对单个token生成的解释方法无法自然推广到长轨迹,而轨迹本身在作为自然语言阅读时往往不忠实。我们提出一种绕过解释步骤的替代方案:将行为预测视为可学习任务,训练行为预测器(Behavior Forecasters)在单个推理轨迹上运行,以做出通常从解释中寻求的相同预测。预测器的训练数据通过查询LRM获得,无需人工标注,其推理在单次前向传播中完成。我们在两个任务上实例化该方法:LRM在重新运行时重复其答案的可能性,以及移除输入部分如何改变其答案。我们在三个不同的推理数据集上对这两个任务进行了评估,发现训练后的行为预测器比作为朴素读者阅读相同轨迹的GPT-5.4和Claude Opus-4.6更准确,而推理成本仅为其一小部分。我们发现,端到端微调骨干网络并从目标LRM初始化对于强性能都是必要的。这些结果表明,推理轨迹携带了关于LRM未来行为的信息,超出了朴素阅读所能传达的范围。

英文摘要

Trust in an AI system is often anchored by explanations of how it works, which one then uses to forecast its behavior on new inputs. For large reasoning models (LRMs), this conventional route is particularly difficult to follow: explanation methods for single token generations do not naturally generalize to long trajectories, and the trajectories themselves are often not faithful when read as natural language. We propose an alternative that bypasses the explanation step: treat behavior forecasting as a learnable task and train Behavior Forecasters that operates on a single reasoning trajectory to make the same forecasts one would typically seek from an explanation. The forecaster's training data is obtained by querying the LRM with no human annotation, and its inference is done in a single forward pass. We instantiate this approach on two tasks: how likely the LRM is to repeat its answer on re-runs, and how removing parts of the input changes its answer. We evaluate this approach on both tasks across three diverse reasoning datasets and find that trained Behavior Forecasters are more accurate than GPT-5.4 and Claude Opus-4.6 reading the same trajectories as naive readers, at a small fraction of their inference cost. We find that fine-tuning the backbone end-to-end and initializing it from the target LRM are each necessary for strong performance. These results show that the reasoning trajectory carries information about the LRM's future behavior that goes beyond what naive reading conveys.

2606.11559 2026-06-11 cs.AI 新提交

HERO: Hindsight-Enhanced Reflection from Environment Observations for Agentic Self-Distillation

HERO: 基于环境观察的后见增强反思的智能体自蒸馏

Haoran Liu, Yuwei Zhang, Xiyao Li, Bohan Lyu, Jingbo Shang

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Independent Researcher(独立研究员) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出HERO框架,利用环境观察作为局部对齐反馈进行自蒸馏,解决多轮设置中特权反馈与当前决策上下文不对齐导致的性能下降问题,在TauBench和WebShop上提升任务成功率并减少冗余轮次。

详情
AI中文摘要

强化学习通常通过轨迹的终端结果来提升多轮智能体能力,这使得难以确定每个中间轮的信用分配。最近的在线自蒸馏方法通过自教师将特权反馈转化为密集的令牌级监督,提供了一种有前景的替代方案。我们的研究动机是观察到当朴素地将此范式扩展到多轮设置时出现意外的性能下降,我们将其归因于特权反馈(如成功轨迹或终端结果)与学生当前决策上下文之间缺乏对齐。我们引入了HERO,一种后见增强的自蒸馏框架,它使用下一个环境观察作为局部对齐反馈。每次轨迹展开后,HERO反思完成的交互,将每个观察转化为紧凑的轮级诊断,捕获关于原始动作的可操作反馈,如其必要性、有效性或失败原因。在TauBench和WebShop上,HERO比仅环境反馈的自蒸馏和GRPO提高了任务成功率并减少了不必要的轮次。在训练轮次预算有限(成功轨迹稀少且GRPO提供弱奖励对比信号)的情况下,它尤其有效。

英文摘要

Reinforcement learning typically improves multi-turn agent capabilities through the terminal outcome of the trajectories, which makes it difficult to determine credit assignments for each intermediate turns. Recent on-policy self-distillation methods offer a promising alternative by converting privileged feedback into dense token-level supervision through a self-teacher. Our study is motivated by the unexpected performance degradation observed when naively extending this paradigm to multi-turn settings, which we attribute to a lack of alignment between privileged feedback, such as successful trajectories or terminal outcomes, and the student's current decision context. We introduce HERO, a hindsight-enhanced self-distillation framework that uses next environment observations as locally aligned feedback. After each rollout, HERO reflects on the completed interaction to convert each observation into a compact turn-level diagnosis, that captures actionable feedback about the original action such as its necessity, validity or failure cause. On TauBench and WebShop, HERO improves task success and reduces unnecessary turns over environment-feedback-only self-distillation and GRPO. It is especially effective under limited training turn budgets, where successful rollouts are rare and GRPO provides weak reward-contrast signals.

2606.11634 2026-06-11 cs.AI 新提交

Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning

架构感知强化学习使滑动窗口注意力在数学推理中具有竞争力

Kai Liu, Peijie Dong, Xinchen Xie, Jianfei Gao, Qipeng Guo, Xiaowen Chu, Shaoting Zhang, Kai Chen

AI总结 提出SWARR方法,通过监督微调将预训练自注意力模型高效转换为滑动窗口注意力,并利用强化学习策略适应,缩小了与自注意力的性能差距,同时保持线性复杂度的高效性。

详情
AI中文摘要

推理和智能体大型语言模型的快速进展增加了对长上下文推理的需求,但自注意力的计算复杂度随上下文长度呈二次增长。为了解决这个问题,我们研究了SWARR(用于数学推理的滑动窗口注意力强化适应),这是一种将SWA模型适应数学推理的实用方案。SWARR包含两个阶段:(1)从预训练的SA模型高效转换为SWA,并通过监督微调(SFT)避免重新训练基础模型;(2)使用强化学习(RL)进行策略适应。我们发现,在SFT后SWA的性能仍低于SA,我们假设这一差距部分由数据-架构不匹配导致:大多数SFT数据是为SA模型准备的,可能包含SWA难以建模的长距离依赖。由于在策略RL在SWA约束下优化自生成轨迹,它可以使轨迹更好地匹配SWA。在数学推理基准上的实验表明,该方案显著缩小了SWA与SA之间的差距,恢复了SWA转换过程中丢失的大部分准确性,同时保持了线性复杂度注意力的效率优势。我们的核心贡献是实证发现,RL改变了仅通过转换和SFT得出的关于SWA在数学推理中可行性的结论。

英文摘要

The rapid progress of reasoning and agentic large language models (LLMs) has increased the demand for long-context inference, but self-attention (SA) scales quadratically with context length. To address this, we study SWARR (Sliding-Window Attention with Reinforced Adaptation for Math Reasoning), a practical recipe for adapting SWA models to mathematical reasoning. SWARR has two stages: (1) efficient conversion from a pretrained SA model to SWA with supervised fine-tuning (SFT), which avoids pretraining a new base model, and (2) policy adaptation with reinforcement learning (RL). We find that SWA still underperforms SA after SFT, and we hypothesize that this gap is caused in part by a data-architecture mismatch: most SFT data are prepared for SA models and may contain long-range dependencies that are difficult for SWA to model. Because on-policy RL optimizes self-generated trajectories under the SWA constraint, it can adapt trajectories to better match SWA. Experiments on mathematical reasoning benchmarks show that this recipe substantially narrows the gap between SWA and SA, recovering much of the accuracy lost during SWA conversion while preserving the efficiency benefits of linear-complexity attention. Our central contribution is the empirical finding that RL changes the conclusion one would draw from conversion and SFT alone about SWA's viability for math reasoning.

2606.11918 2026-06-11 cs.AI 新提交

The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning

提问的艺术:一致性增强空间推理中的事实性

Theo Uscidda, Marta Tintore Gazulla, Maks Ovsjanikov, Federico Tombari, Leonidas Guibas

AI总结 提出自监督强化学习框架,通过几何与语义一致性验证器(如图像翻转、文本对象顺序交换)对齐预训练模型的内在空间推理能力,无需标注数据即可达到接近监督方法的精度。

详情
AI中文摘要

当前的大型推理模型(LRMs)展现出显著的通用能力,但在空间推理任务中表现明显不足。现有方法将此差距视为知识缺陷,依赖监督微调(SFT)从外部视觉源或合成引擎中获取标注空间数据。相反,我们认为对于许多任务,空间推理能力已经存在于预训练的LRMs中,但需要通过几何2D和3D约束下的逻辑一致性进行对齐。在这项工作中,我们提出了一个自监督强化学习(RL)框架,针对内部推理过程,无需真实标注。通过形式化一致性验证器——即在变换下检查几何和语义一致性的奖励函数——我们证明模型可以提高其空间推理能力。我们同时使用图像变换(如翻转)和文本变换(如交换问题中对象的顺序),并提出了一种新的基于最优传输的RL策略OT-GRPO,这是针对成对验证器定制的组相对策略优化的最小匹配变体。我们展示了这种无标签一致性训练在精度上接近使用真实监督训练的模型,并在不同任务和数据领域实现了类似的泛化。

英文摘要

Current Large Reasoning Models (LRMs) exhibit remarkable general capabilities but significantly underperform in spatial reasoning tasks. Existing approaches treat this gap as a knowledge deficit, relying on supervised fine-tuning (SFT) to ingest labeled spatial data from external vision sources or synthetic engines. In contrast, we argue that for many tasks, spatial reasoning capabilities are already present in pre-trained LRMs but require alignment through logical coherence under geometric 2D and 3D constraints. In this work, we propose a self-supervised reinforcement learning (RL) framework that targets the internal reasoning process without requiring ground-truth annotations. By formalizing the notion of consistency verifiers -- reward functions that check for geometric and semantic consistency under transformations -- we demonstrate that models can improve their spatial reasoning abilities. We use both image transformations, like flipping, and textual transformations, like swapping the order of objects in the question, and propose a new optimal transport-based RL strategy, OT-GRPO, which is a minimal-matching variant of group relative policy optimization tailored to pairwise verifiers. We show that this label-free consistency training approaches the accuracy of models trained with ground-truth supervision and achieves similar generalization across diverse tasks and data domains.

2606.07537 2026-06-11 cs.CL cs.AI cs.LG 交叉投稿

From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data

从架构到输出:大语言模型中幻觉的结构性起源及数据的放大作用

Md. Rejaul Korim Sadi, Toufiqur Rahman Tasin, Golam Mostofa Naeem

AI总结 本文分析大语言模型幻觉的结构性根源,指出自注意力、最大似然估计训练目标和自回归解码三个架构决策构成复合失效系统,并揭示数据病理如何放大这些脆弱性。

详情
Comments
11 pages, 7 figures, 15 references
AI中文摘要

大语言模型会产生幻觉——生成流畅、自信但事实错误的输出——这种一致性跨越代际和规模。现有分类法按输出类型对幻觉进行分类,区分内在与外在失败以及忠实性与事实性偏差。这些框架在描述上严谨,但未能识别产生特定实例的内部机制。本文将幻觉分析为三个架构决策的结构性后果,这些决策共同构成一个复合失效系统。自注意力的共现学习用统计邻近性替代语义含义,导致实体混淆、事实错误归因和语义漂移。最大似然估计训练目标在无事实约束下优化下一个词元概率,奖励统计上合理的输出,无论其真值如何。自回归解码在暴露偏差下的永久从左到右承诺确保单个错误词元级联向前传递整个输出序列而无法修正。数据集病理——长尾缺陷、训练偏差和合成污染——放大了这些脆弱性,但并非独立导致它们。我们做出三项贡献。首先,我们将每个机制映射到Alansari和Luqman分类法中的特定输出类别,将内在幻觉定位于自注意力,外在幻觉定位于MLE,逻辑不一致定位于自回归解码。其次,我们表明每个常被引用的数据集病理利用这些机制之一,而非独立产生幻觉。第三,我们识别出仅基于输出类型分类的诊断局限性,并将其与推理层缓解方法进行对比。

英文摘要

Large language models hallucinate--producing fluent, confident, factually wrong outputs--with a consistency that persists across generations and scales. Existing taxonomies classify hallucination by output type, distinguishing intrinsic from extrinsic failures and faithfulness from factuality divergence. These frameworks are descriptively rigorous but do not identify which internal mechanism produced a given instance. This paper analyses hallucination as a structural consequence of three architectural decisions that together form a compound failure system. Self-attention's co-occurrence learning substitutes statistical proximity for semantic meaning and produces entity confusion, fact misattribution, and semantic drift. The maximum likelihood estimation training objective optimises next-token probability without factual constraint, rewarding statistically plausible outputs regardless of their truth value. Autoregressive decoding's permanent left-to-right commitment under exposure bias ensures that a single wrong token cascades forward through the entire output sequence without revision. Dataset pathologies--long-tail deficiencies, training bias, and synthetic pollution--amplify these vulnerabilities but do not independently cause them. We make three contributions. First, we map each mechanism to a specific output category in the Alansari and Luqman taxonomy, locating intrinsic hallucination in self-attention, extrinsic hallucination in MLE, and logical inconsistency in autoregressive decoding. Second, we show that each commonly cited dataset pathology exploits one of these mechanisms rather than originating hallucination independently. Third, we identify the diagnostic limitation of output-type-only classification and contrast it with inference-layer mitigation approaches.

2606.11201 2026-06-11 cs.LG cs.AI cs.CL 交叉投稿

To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending

干预还是不干预:通过概率模型混合指导推理时对齐

Jin Gan, Xin Li, Jun Luo

发表机构 * College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机与数据科学学院)

AI总结 提出BlendIn框架,通过质量感知对齐和按可靠性加权混合模型知识,解决推理时对齐中指导有效性差异大的问题,在困难模型对上实现最高50%的性能提升。

详情
Comments
Accepted by ACL 2026
AI中文摘要

LLM的广泛部署使得模型对齐成为必要,以确保新训练的模型能够安全有效地响应用户指令。在不同方法中,推理时对齐通常更便宜,因为它仅在输出生成期间进行干预(即提供指导)。现有提案从某些对齐模型中提取指导,但没有适当评估其可靠性。然而,我们的系统评估显示,指导有效性在不同模型间差异很大;由于无效指导会导致进一步混乱和更多干预,由此产生的过度干预通常表明性能较差。为了使干预更有效且更高效,我们引入了BlendIn,一个推理时对齐框架,从二元决策转向创建整合两个模型知识的混合分布。BlendIn通过执行质量感知对齐并根据可靠性按比例加权每个模型的贡献来稳定推理时对齐。与现有工作相比,它保留了有益的指导,同时降低了不可靠建议的权重。BlendIn为未对齐的指导提供了诊断信号和缓解策略,在困难模型对上实现了一致且高达50%的性能提升。我们的代码可在以下网址获取:this https URL。

英文摘要

The wide deployment of LLMs has made model alignment necessary to make newly trained models safely and effectively respond to user instructions. Among different methods, inference-time alignment is often cheaper as it intervenes (i.e., offers guidances) only during output generation. Existing proposals apply guidances extracted from certain aligned models without properly assessing their reliability. Nonetheless, our systematic evaluation reveals that guidance effectiveness varies drastically across models; since ineffective guidances lead to further confusion and thus further interventions, the resulting excessive interventions typically indicate poor performance. To make interventions more effective and thus more efficient, we introduce BlendIn, an inference-time alignment framework that shifts from binary decisions to creating hybrid distributions integrating both models' knowledge. BlendIn stabilizes inference-time alignment by performing quality-aware alignment and proportionally weighting each model's contribution based on reliability. Compared with existing works, it preserves beneficial guidance while downweighting unreliable suggestions. BlendIn provides both diagnostic signals and mitigation strategies for misaligned guidance, achieving consistent and up to 50% performance improvement on challenging model pairs. Our code is available at: this https URL.

2606.11209 2026-06-11 cs.CL cs.AI cs.LG 交叉投稿

ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward

ProcessThinker: 通过基于展开的过程奖励增强多模态大语言模型推理

Jingpei Wu, Xiao Han, Weixiang Shen, Boer Zhang, Zifeng Ding, Volker Tresp

发表机构 * LMU Munich(慕尼黑大学) Harvard University(哈佛大学) University of Cambridge(剑桥大学) Mina AI Konrad Zuse School of Excellence in Reliable AI (relAI)(康拉德·楚泽可靠人工智能卓越学校(relAI))

AI总结 提出ProcessThinker,一种无需显式过程奖励模型的后训练方法,通过步骤标记格式和基于展开的过程奖励,为多步推理提供密集的步骤级奖励,提升多模态推理一致性。

详情
Comments
Accepted at ICLR 2026 Workshop on Logical Reasoning of Large Language Models. 7 pages, 1 figure
AI中文摘要

视觉问答越来越需要多步推理。最近在可验证奖励下的强化学习后训练(RLVR)和组相对策略优化(GRPO)可以改善多模态推理,但大多数方法依赖于稀疏的仅结果奖励。因此,它们难以判断错误答案是由于推理后期的一个小错误,还是从一开始就无用的轨迹。一个常见的解决方案是训练一个过程奖励模型(PRM)用于步骤级监督,但这通常需要大规模高质量的思想链注释和额外的训练成本。我们提出ProcessThinker,一种实用的后训练流程,无需训练显式的PRM即可提供步骤级过程奖励。ProcessThinker首先将推理轨迹重写为步骤标记格式以进行冷启动监督微调,然后应用带有标准格式奖励和我们基于展开的过程奖励的GRPO。具体来说,对于每个中间步骤,我们从该步骤采样多个连续步骤,并使用经验成功率(最终答案验证)作为步骤奖励。这提供了密集的信用分配,并鼓励更可靠地支持正确结论的推理步骤,有助于减少跨步骤的不一致或自相矛盾的进展——这是逻辑推理中的一个关键问题。在四个具有挑战性的视频基准测试(Video-MMMU、MMVU、VideoMathQA和LongVideoBench)上,ProcessThinker始终优于基线模型Qwen3-VL-8B-Instruct。

英文摘要

Visual question answering increasingly requires multi-step reasoning. Recent post-training with reinforcement learning under verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) can improve multimodal reasoning, but most approaches rely on sparse outcome-only rewards. As a result, they struggle to tell whether an incorrect answer comes from a small mistake late in the reasoning or from an unhelpful trajectory from the start. A common solution is to train a process reward model (PRM) for step-level supervision, but this typically requires large-scale high-quality chain-of-thought annotations and additional training cost. We propose ProcessThinker, a practical post-training pipeline that provides step-level process rewards without training an explicit PRM. ProcessThinker first rewrites reasoning traces into a step-tagged format for cold-start supervised fine-tuning, then applies GRPO with a standard format reward and our rollout-based process reward. Concretely, for each intermediate step, we sample multiple continuations from that step and use the empirical success rate (final-answer verification) as the step reward. This gives dense credit assignment and encourages reasoning steps that more reliably support a correct conclusion, helping reduce inconsistent or self-contradictory progress across steps -- a key issue in logical reasoning. Across four challenging video benchmarks (Video-MMMU, MMVU, VideoMathQA, and LongVideoBench), ProcessThinker consistently improves over the baseline model Qwen3-VL-8B-Instruct

2606.11244 2026-06-11 cs.AR cs.AI 交叉投稿

SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving

SPEAR: 一种后量化误差自适应恢复系统,实现高效低比特LLM服务

Hongyuan Liu, Yawei Li, Zhiqiang Que, Qinli Yang, Junming Shao, Guosheng Hu

AI总结 针对低比特量化导致LLM质量下降的问题,提出SPEAR系统,通过输入感知的门控误差补偿器(EC)选择性修正高误差层,结合自适应内核融合调度和SLO感知调度器,在<1%内存开销下恢复W4与FP16之间56-75%的困惑度差距。

详情
AI中文摘要

高效的大语言模型(LLM)服务日益受到部署成本的制约。量化是降低服务成本的关键技术,但即使是最先进的4比特量化器,其与FP16之间仍存在显著的质量差距,尤其是在低比特服务最有利的小型模型中。我们发现这一差距的根本原因:量化误差高度依赖于输入,且在不同token之间差异显著,而现有的后量化补偿方法是静态的,对所有输入应用相同的修正。结果,简单token被过度修正,而困难token则修正不足。我们提出SPEAR,一种后量化误差自适应恢复系统,用于改进低比特LLM服务。SPEAR引入了由逐token门控调制的轻量级误差补偿器(EC),并将其仅放置在通过CKA引导的熵感知诊断识别出的最误差敏感层。这将少量参数预算集中在最有效的位置。EC的高效部署带来了若干系统挑战,包括额外计算、由输入相关门控引起的张量并行同步,以及跨配置的延迟不稳定。SPEAR通过自适应内核融合调度解决了这些问题,结合了后同步集成规约内核与P2P双写,将EC后计算融合到低比特GEMM中,并采用SLO约束的EC感知调度器以实现可预测的服务性能。在具有挑战性的逐通道量化设置中,SPEAR恢复了W4与FP16之间56-75%的困惑度差距,同时增加了不到1%的模型内存开销,并保持了与广泛使用的4比特服务部署相当的延迟。

英文摘要

Efficient large language model (LLM) serving is increasingly constrained by deployment cost. Quantization is a key technique for reducing serving cost, yet even state-of-the-art 4-bit quantizers exhibit a noticeable quality gap from FP16, particularly for smaller models where low-bit serving is most beneficial. We identify a fundamental cause of this gap: quantization error is highly input-dependent and varies substantially across tokens, while existing post-quantization compensation methods are static and apply identical corrections to all inputs. As a result, easy tokens are over-corrected while hard tokens remain under-corrected. We present SPEAR, a system for post-quantization error-adaptive recovery that improves low-bit LLM serving. SPEAR introduces lightweight Error Compensators (ECs) modulated by per-token gates and places them only at the most error-sensitive layers identified through a CKA-guided entropy-aware diagnostic. This focuses a small parameter budget where it is most effective. Efficient deployment of ECs presents several systems challenges, including additional computation, tensor-parallel synchronization caused by input-dependent gating, and latency instability across configurations. SPEAR addresses these issues through adaptive kernel-fusion dispatch, combining an epilogue-integrated peer-reduction kernel with P2P dual-write to fuse the post-EC computation into low-bit GEMMs, and an SLO-constrained EC-aware scheduler for predictable serving performance. Across challenging per-channel quantization settings, SPEAR recovers 56-75% of the perplexity gap between W4 and FP16 while adding less than 1% model memory overhead and maintaining latency comparable to a widely used 4-bit serving deployment.

2606.11262 2026-06-11 cs.LG cs.AI 交叉投稿

PermDoRA -- Understanding Adapter Interference in Language Models: Limits of Parameter-Space Geometry

PermDoRA -- 理解语言模型中的适配器干扰:参数空间几何的局限性

Gowtham Sivaramakrishnan, Sarvesha Kumar Kombaiah Seetha, Kishan Gupta Balaji, Santhosh Baradwaj Vaduvur Ranganathan

发表机构 * Independent Researcher(独立研究员)

AI总结 研究适配器组合中的干扰是否源于线性参数更新重叠,通过DoRA-RBAC框架和几何感知合并策略实验,发现参数空间几何不是干扰主因,而是共享非线性表示中的交互。

详情
Comments
18 Pages, COLM 2026
AI中文摘要

大型语言模型(LLMs)中的访问控制需要模块化机制,以在不重新训练或跨领域干扰的情况下实现特定领域行为。一个常见的假设是,适配器组合过程中的干扰源于线性参数更新的重叠,这表明强制正交性或方向独立性应能提高多领域性能。我们使用DoRA-RBAC(一种基于权重分解低秩适配的分层适配器组合框架)来测试这一假设。我们比较了传统的欧几里得合并与一种几何感知的黎曼启发式合并策略,该策略通过在LLaMA-3.1-8B和Mistral-7B上的多个QA基准(GPQA、PubMedQA、SimpleQA、WMDP)上进行归一化方向平均来近似弗雷歇均值。我们的结果表明,虽然单领域性能与LoRA相当,但几何感知合并相比标准平均在多领域组合中并未提供一致的优势。进一步分析揭示,适配器更新的角度对齐和正交性是组合性能的弱预测因子。这些发现表明,适配器干扰并非主要由参数空间几何决定,而是与共享非线性表示中的交互一致。

英文摘要

Access control in large language models (LLMs) requires modular mechanisms to enable domain-specific behavior without retraining or cross-domain interference. A common hypothesis is that interference during adapter composition arises from overlap in linear parameter updates, suggesting that enforcing orthogonality or directional independence should improve multi-domain performance. We test this hypothesis using DoRA-RBAC, a hierarchical adapter composition framework based on weight-decomposed low-rank adaptation. We compare conventional Euclidean merging with a geometry-aware Riemannian-inspired merging strategy that approximates the Frechet mean via normalized directional averaging across multiple QA benchmarks (GPQA, PubMedQA, SimpleQA, WMDP) on LLaMA-3.1-8B and Mistral-7B. Our results show that while single-domain performance matches LoRA, geometry-aware merging provides no consistent advantage over standard averaging in multi-domain this http URL analysis further reveals that angular alignment and orthogonality of adapter updates are weak predictors of composition performance. These findings suggest that adapter interference is not governed primarily by parameter-space geometry, but is instead consistent with interactions in shared nonlinear representations.

2606.11272 2026-06-11 cs.LG cs.AI 交叉投稿

Federated continual learning: A comprehensive survey on lifelong and privacy-preserving learning over distributed and non-stationary data

联邦持续学习:分布式和非平稳数据上的终身与隐私保护学习综述

Masoume Gholizade, Fabrizio Ruffini, Pietro Ducange, Francesco Marcelloni

发表机构 * University of Pisa(比萨大学) University of Modena and Reggio Emilia(摩德纳和雷焦艾米利亚大学)

AI总结 本文系统综述联邦持续学习(FCL),定义问题、分析经典联邦学习在非平稳数据下的局限,提出多维分类法,并讨论应用、评估指标及开放挑战。

详情
Comments
77 pages, 8 figures
AI中文摘要

联邦学习(FL)能够在分布式客户端之间实现协作和隐私保护的模型训练,但大多数现有的FL系统隐含地假设数据是平稳的。在现实场景中——如医疗、工业物联网(IIOT)、网络安全和智慧城市——数据流本质上是非平稳的,导致经典FL方法遭受性能下降、不稳定和灾难性遗忘。持续学习(CL)解决了在演化数据分布下的学习问题,但主要在集中式环境中研究,忽视了联邦系统的关键约束,包括隐私、有限通信和客户端异质性。联邦持续学习(FCL)出现在FL和CL的交汇处,旨在支持分布式和非平稳数据上的终身、自适应和隐私感知学习。本综述提供了FCL的全面和系统概述。我们首先给出FCL问题的正式定义并阐明其独特特征。然后分析经典FL在非平稳条件下的局限性,强调CL原理如何支持长期适应。为了组织快速增长的文献,我们提出了FCL方法的多维分类法。此外,我们回顾了代表性的应用领域和数据模态,总结了常用的评估指标,并讨论了评估长期性能和遗忘的实验视角。最后,我们强调了关键开放挑战,包括处理时间漂移下的极端异质性、设计可扩展且隐私保护的记忆机制,以及建立标准化基准。本综述旨在为推进FCL走向鲁棒和可部署的现实世界系统提供参考和路线图。

英文摘要

Federated Learning (FL) enables collaborative and privacy-preserving model training across distributed clients, but most existing FL systems implicitly assume data stationarity. In real-world settings-such as healthcare, industrial IoT (IIOT), cybersecurity, and smart cities-data streams are inherently non-stationary, leading classical FL methods to suffer from performance degradation, instability, and catastrophic forgetting. Continual Learning (CL) addresses learning under evolving data distributions but has been largely studied in centralized settings, overlooking key constraints of federated systems, including privacy, limited communication, and client heterogeneity. Federated Continual Learning (FCL) emerges at the intersection of FL and CL, aiming to support lifelong, adaptive, and privacy-aware learning over distributed and non-stationary data. This survey provides a comprehensive and systematic overview of FCL. We first present a formal definition of the FCL problem and clarify its distinctive characteristics. We then analyze the limitations of classical FL under non-stationary conditions, highlighting how CL principles support long-term adaptation. To organize the rapidly growing literature, we propose a multi-dimensional taxonomy of FCL approaches. Furthermore, we review representative application domains and data modalities, summarize commonly used evaluation metrics, and discuss experimental perspectives for assessing long-term performance and forgetting. Finally, we highlight key open challenges, including handling extreme heterogeneity under temporal drift, designing scalable and privacy-preserving memory mechanisms, and establishing standardized benchmarks. This survey aims to serve as a reference and a roadmap for advancing FCL toward robust and deployable real-world systems.

2606.11275 2026-06-11 cs.LG cs.AI 交叉投稿

RoVE: Rotary Value Embeddings Attention for Relative Position-dependent Value Pathways

RoVE: 旋转值嵌入注意力实现相对位置相关的值路径

Alejandro García-Castellanos, Maurice Weiler, Erik J Bekkers

发表机构 * AMLab University of Amsterdam(阿姆斯特丹大学AMLab) MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)

AI总结 提出RoVE方法,通过同时旋转键和值使值对位置敏感,将RoPE注意力转化为注意力卷积,在少样本学习、分布外困惑度和长上下文检索上优于RoPE。

详情
AI中文摘要

旋转位置嵌入(RoPE)使注意力分数具有位置相对性,但值路径对位置不敏感:值令牌发送的消息与其到查询的距离无关。我们提出RoVE,一种无需参数修改的方法,通过同时旋转键和值使值对位置敏感,并证明它将RoPE注意力转化为注意力卷积。这一新视角统一了计算机视觉、机器人技术和现代LLM架构中同一操作的几种独立表述。训练124M和354M参数的GPT-2模型在少样本上下文学习、分布外困惑度和长上下文检索上一致优于RoPE,在需要长距离聚合的任务上改进最为明显。

英文摘要

Rotary Position Embeddings (RoPE) make attention scores position-relative but leave the value pathway position-blind: the message sent by a value token is the same regardless of its distance from the query. We propose RoVE, a parameter-free modification that makes values position-sensitive by rotating them simultaneously with keys, and show that it turns RoPE attention into attentive convolution. This new perspective unifies several independent formulations of the same operation across computer vision, robotics, and modern LLM architectures. Trained 124M and 354M GPT-2 models show consistent empirical gains over RoPE on few-shot in-context learning, out-of-distribution perplexity, and long-context retrieval, with the clearest improvements on tasks that require long-range aggregation.

2606.11417 2026-06-11 cs.LG cs.AI stat.ML 交叉投稿

Signed Compression Progress on a Sealed Audit is Goodhart-Resistant

密封审计上的有符号压缩进展是古德哈特抵抗的

Ayush Mittal, Dhruv Gupta

AI总结 提出有符号压缩进展作为内在动机,证明其累积奖励等于审计改进,且对有限审计面板具有假阳性预算,抵抗古德哈特定律。

详情
Comments
16 pages, 7 figures. Lean 4 (Mathlib) mechanized core and ARC-TGI experiment code: this https URL
AI中文摘要

压缩进展是一个长期提出的内在动机方案:当智能体的世界模型在预测或压缩经验方面变得更好时给予奖励。民间声称这种奖励是“可信的”,因为它只在学习时支付。我们使这一点精确化并证明它。如果内在奖励是固定密封审计损失的有符号减少,即 r_t = E(theta_{t-1}) - E(theta_t),那么累积奖励恰好望远镜式地归结为端点审计改进,因此没有策略可以在真实审计性能停滞或下降时无限推高奖励。对于有限审计面板,同样的结果成立,并带有尖锐的假阳性预算:累积经验奖励最多为真实审计改进加上 2 Delta_n(F, delta),即模型类的均匀审计偏差。这是无水平依赖的:一旦密封面板均匀控制该类,随时间变化的适应性无需付出代价。该定理还识别了失败模式:如果进展被截断、在智能体自身流上评分、暴露于可重用面板上的高容量模型,或应用于使 Delta_n 无效的神经类,则保证消失。我们给出了结构核心(望远镜式、有限审计界、有限吉布斯和熵下限)的 Lean 4 机械化,以及在 ARC-TGI 网格变换生成器上带有自适应保留攻击的实验套件。实验证实了理论:有限审计偏差按 n^{-0.527} 缩放;有符号进展抵抗截断农场、流泄漏和噪声电视好奇心;朴素的可重用审计可被黑盒标量反馈利用,而标准发布防御将攻击保持在 2 Delta_n 阈值以下。密封审计上的有符号压缩进展是真正改进的会计信号。

英文摘要

Compression progress is a long-standing proposal for intrinsic motivation: reward an agent when its world model becomes better at predicting or compressing experience. The folk claim is that this reward is "credible" because it is paid only for learning. We make this precise and prove it. If intrinsic reward is the signed decrease of a fixed sealed-audit loss, r_t = E(theta_{t-1}) - E(theta_t), then cumulative reward telescopes exactly to endpoint audit improvement, so no policy can push reward up indefinitely while true audit performance stagnates or degrades. For finite audit panels the same result holds with a sharp false-positive budget: cumulative empirical reward is at most true audit improvement plus 2 Delta_n(F, delta), the uniform audit deviation of the model class. This is horizon-free: adaptivity over time costs nothing once the sealed panel uniformly controls the class. The theorem also identifies the failure modes: the guarantee disappears if progress is clipped, scored on the agent's own stream, exposed to a high-capacity model on a reusable panel, or applied to a neural class that makes Delta_n vacuous. We give a Lean 4 mechanization of the structural core (telescoping, the finite-audit bound, finite Gibbs, and the entropy floor) and an experiment suite on ARC-TGI grid-transformation generators with adaptive holdout attacks. Experiments confirm the theory: finite-audit deviation scales as n^{-0.527}; signed progress resists clip-farming, stream leakage, and noisy-TV curiosity; naive reusable audits are exploitable by black-box scalar feedback, while standard release defenses keep the attack below the 2 Delta_n threshold. Signed compression progress on a sealed audit is an accounting signal of genuine improvement.

2606.11437 2026-06-11 cs.DS cs.AI cs.LG stat.ML 交叉投稿

The Power of Test-Time Training for Approximate Sampling

测试时训练对近似采样的威力

Noah Golowich, Ankur Moitra, Dhruv Rohatgi

AI总结 本文形式化测试时训练(TTT)为从已知分布类中采样的问题,证明查询复杂度的二次下界,并展示在分布类大小受限时可规避该下界,为TTT提供理论框架。

详情
AI中文摘要

从复杂概率分布中高效采样是一个基本问题,近年来随着生成式AI的兴起,这一问题变得越来越重要,因为从大语言模型(LLM)中提出的复杂采样程序已被用于解决具有挑战性的推理问题。然而,这类采样算法的有效性受到LLM与特定采样任务之间关系的限制,这推动了测试时训练(TTT)框架的发展。TTT通过根据推理时收到的部分生成和奖励反馈更新模型权重来工作,从而适应特定问题。在这项工作中,我们提出了一种TTT的形式化,将其定义为从属于已知分布类$F$的给定概率测度$\mu^\star$中生成样本的问题,给定一个提供$\mu^\star$近似密度估计的预言机$\hat \mu$。这与Jerrum、Valiant和Vazirani(1986)以及Jerrum和Sinclair(1989)的开创性工作中研究的将采样约化为近似计数的问题密切相关:即当$F$是所有分布的类时,它恰好与上述计数到采样的约化一致。在本文中,我们首先证明了在给定对$\hat \mu$的查询访问的情况下,从$\mu^\star$采样的查询复杂度的二次下界(对于足够大的类$F$),从而表明Jerrum和Sinclair(1989)提出并由Hayes和Sinclair(2010)改进的随机游走方法是最优的。这回答了Hayes和Sinclair提出的一个开放问题。然后,我们证明如果$F$的大小适当受限,这个下界可以被规避。正如我们所讨论的,后一个结果可以被视为TTT的抽象,因此代表了为TTT发展一个原则性理论框架的起点。

英文摘要

Efficiently sampling from a complex probability distribution is a fundamental problem which has become increasingly pertinent in recent years with the rise of generative AI, as sophisticated sampling procedures from LLMs have been proposed to solve challenging reasoning problems. The efficacy of such sampling algorithms is limited, however, by the relationship between the LLM and the particular sampling task at hand, which has motivated the framework of test-time training (TTT). TTT works by updating a model's weights in response to partial generations and reward feedback received at inference time, thus adapting to the particular problem. In this work, we propose a formalization for TTT as the problem of producing a sample from a given probability measure $\mu^\star$ belonging to a known class ${F}$ of distributions, given an oracle $\hat \mu$ which yields approximate density estimates for $\mu^\star$. This is closely related to the problem of reducing sampling to approximate counting studied in seminal works of Jerrum, Valiant & Vazirani (1986) and Jerrum & Sinclair (1989): namely, when ${F}$ is the class of all distributions, it coincides exactly with the aforementioned counting-to-sampling reduction. In this paper, we first show a quadratic lower bound on the query complexity of sampling from $\mu^\star$ given query access to $\hat \mu$ (for sufficiently large classes ${F}$), thus showing that the random walk approach proposed by Jerrum & Sinclair (1989) and refined by Hayes & Sinclair (2010), is optimal. This answers an open question posed by Hayes & Sinclair. We then show that this lower bound can be circumvented if the size of ${F}$ is bounded appropriately. As we discuss, this latter result can be viewed as an abstraction of TTT, and thus represents a starting point for the development of a principled theoretical framework for TTT.

2606.11473 2026-06-11 cs.LG cs.AI stat.ML 交叉投稿

CRUMB: Efficient Prior Fitted Network Inference via Distributionally Matched Context Batching

CRUMB: 通过分布匹配上下文批处理实现高效先验拟合网络推理

Jamie Heredge, Mattia J. Villani, Pranav Deshpande, Akshay Seshadri, Niraj Kumar

发表机构 * Global Technology Applied Research, JPMorganChase(摩根大通全球技术应用研究)

AI总结 提出CRUMB方法,通过聚类查询、最小化最大均值差异选择训练子集、再执行精确推理,在不重新训练的情况下加速先验拟合网络推理,在51个数据集上优于同类方法。

详情
Comments
26 pages, 13 figures
AI中文摘要

先验拟合网络(PFNs)是一类有前景的表格基础模型,执行上下文学习,其中整个带标签的训练集作为上下文提供,并在单次前向传播中生成测试查询的预测。然而,许多PFN架构中二次缩放的自注意力机制使得对于非常大的训练数据集推理变得不可行。我们提出CRUMB(使用最小化MMD批处理的聚类检索),一个三阶段推理包装器:(i)聚类测试查询,(ii)通过贪心最小化最大均值差异(MMD)为每个聚类选择一个小型、分布匹配的训练子集,(iii)在每个缩减上下文的批次上执行精确的PFN推理。CRUMB是架构无关的,无需重新训练。在51个数据集的TabArena基准测试中,跨三种PFN架构(TabPFNv2、TabICLv1、TabICLv2)评估,我们展示了CRUMB优于类似的最先进的上下文选择策略。我们还展示了CRUMB对协变量漂移具有鲁棒性,因为MMD最小化步骤自然有助于对齐训练上下文分布以匹配当前测试批次分布。

英文摘要

Prior-fitted networks (PFNs) are a promising class of tabular foundation models that perform in-context learning, whereby the entire labelled training set is supplied as context, and predictions for test queries are produced in a single forward pass. However, the quadratically scaling self-attention mechanism in many PFN architectures makes inference prohibitive for very large training datasets. We propose CRUMB (Clustered Retrieval Using Minimised-MMD Batching), a three-stage inference wrapper that (i) clusters the test queries, (ii) selects a small, distributionally matched training subset for each cluster by greedily minimising the maximum mean discrepancy (MMD), and (iii) runs exact PFN inference on each reduced-context batch. CRUMB is architecture-agnostic and requires no retraining. On the 51-dataset TabArena benchmark, evaluated across three PFN architectures (TabPFNv2, TabICLv1, TabICLv2), we show that CRUMB outperforms similar state-of-the-art context selection strategies. We also show that CRUMB is resilient to covariate drift, as the MMD-minimisation step naturally helps align the training context distribution to match the current test batch distributions.

2606.11518 2026-06-11 cs.LG cs.AI 交叉投稿

SirenFNO: Efficient and Full Frequency Learning of Fourier Neural Operators

SirenFNO:高效且全频率学习的傅里叶神经算子

Pengqing Shi, Jie Yin, Stephen Tierney, Junbin Gao

发表机构 * The University of Sydney(悉尼大学)

AI总结 提出SirenFNO框架,利用正弦表示网络学习隐式神经表示并进行模态核参数化,消除频率截断,实现全频谱学习,在多个PDE基准上以最多73倍参数减少取得性能提升。

详情
Comments
9 pages, accepted by IJCAI 2026
AI中文摘要

傅里叶神经算子(FNO)是近似求解偏微分方程的有效且高效的替代方法,并能跨离散化泛化。然而,由于依赖频率截断以保持FNO的学习效率,实证研究表明FNO对低频信息存在频谱偏差,这可能阻碍学习能力,尤其是对于某些具有强烈高频振荡的偏微分方程。为了解决这一局限性,我们提出了SirenFNO,一种利用正弦表示网络(SIREN)学习隐式神经表示并进行模态核参数化的新颖框架。我们的SIREN参数化以常数且与离散化无关的参数数量学习全网格频谱,从而消除了频率截断的需要。我们进一步通过函数张量分解扩展SirenFNO,以提高参数和学习效率。实证结果表明,我们的SirenFNO在保持离散化不变性的情况下,以约4到15倍的参数减少持续优于FNO,并且我们的函数分解变体在多个PDE基准上以最多73倍的参数减少获得了性能提升。

英文摘要

Fourier neural operators (FNOs) are effective and efficient surrogates for approximating solutions of PDEs and generalize across discretizations. However, owing to the reliance on frequency truncation to maintain learning efficiency of FNOs, empirical studies suggest that FNOs exhibit spectral bias toward low-frequency information, which may hinder the learning capability especially for certain PDEs with strong high-frequency oscillations. To address this limitation, we propose SirenFNO, a novel framework that leverages sinusoidal representation networks (SIRENs) to learn implicit neural representations and performs mode-wise kernel parameterization. Our SIREN parameterization learns a full-grid spectrum with a constant and discretization-independent parameter count, thereby eliminating the need for frequency truncation. We further extend SirenFNO with functional tensor decompositions to enhance parameter and learning efficiency. Empirical results show that our SirenFNO consistently outperforms FNO with approximately $4$ to $15$ times parameter reductions with preserved discretization invariance, and our functional decomposition variants obtain performance improvements with a maximum of $73$ times fewer parameters across multiple PDE benchmarks.

2606.11614 2026-06-11 cs.LG cs.AI cs.CV 交叉投稿

Information-Theoretic Decomposition for Multimodal Interaction Learning

多模态交互学习的信息论分解

Zequn Yang, Yake Wei, Haotian Ni, Zhihao Xu, Di Hu

AI总结 提出基于信息论的多模态交互分解方法DMIL,通过变分分解架构和微调策略学习样本特定的冗余、独特和协同交互,提升多模态学习性能。

详情
Comments
Accepted to CVPR 2026
AI中文摘要

多模态学习依赖于捕获跨模态的冗余、独特和协同信息,这些信息共同构成多模态交互。一个关键但尚未充分探索的挑战是,这些隐式交互在不同样本间动态变化。在这项工作中,我们首次进行了系统的信息论分析,强调了学习这些动态的、样本特定的交互对于有效多模态学习的重要性。我们的分析进一步揭示了传统范式在学习这些不同交互类型方面的缺陷:模态集成方法难以捕获协同,而联合学习范式往往未能充分利用冗余信息。这突显了对一种能够基于每个样本自适应地从不同交互类型中学习的方法的需求。为此,我们提出了基于分解的多模态交互学习(DMIL),一种显式建模并学习样本特定交互的新范式。首先,我们设计了一个变分分解架构来分离组成交互组件。其次,我们采用了一种新的学习策略,在微调过程中利用这些显式交互组件来实现全面的交互学习。跨不同任务和架构的大量实验表明,DMIL通过适应整体的样本特定交互,始终实现了优越的性能。我们的框架灵活且广泛适用,建立了一个以交互为中心的多模态学习范式。代码可在以下网址获取:此 https URL。

英文摘要

Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions. A critical yet underexplored challenge is that these implicit interactions vary dynamically across samples. In this work, we present the first systematic, information-theoretic analysis highlighting why learning these dynamic, sample-specific interactions is critical for effective multimodal learning. Our analysis further reveals deficits in conventional paradigms at learning these distinct interaction types: modality ensemble approaches struggle to capture synergy, while joint learning paradigms often under-utilize redundant information. This highlights the need for an approach that can adaptively learn from different interaction types on a per-sample basis. To this end, we propose Decomposition-based Multimodal Interaction Learning (DMIL), a novel paradigm that explicitly models and learns from sample-specific interactions. First, we design a variational decomposition architecture to isolate the constituent interaction components. Second, we employ a new learning strategy that leverages these explicit interaction components in a fine-tuning process to achieve comprehensive interaction learning. Extensive experiments across diverse tasks and architectures demonstrate that DMIL consistently achieves superior performance by adapting to holistic sample-specific interactions. Our framework is flexible and broadly applicable, establishing an interaction-centric paradigm for multimodal learning. The code is available at this https URL.

2606.11627 2026-06-11 cs.LG cs.AI 交叉投稿

When Context Returns: Toward Robust Internalization in On-Policy Distillation

当上下文回归:面向在线策略蒸馏中的鲁棒内化

Xun Wang, Ruishuo Chen, Zhuoran Li, Yu Chen, Longbo Huang

发表机构 * IIIS, Tsinghua University(清华大学交叉信息研究院)

AI总结 针对在线策略蒸馏中上下文内化后重新引入上下文导致性能下降的问题,提出一种轻量级一致性正则化方法,通过锚定无上下文输出并惩罚偏离,有效缓解退化并提升鲁棒性。

详情
AI中文摘要

近期研究表明,在线策略蒸馏可以将特权上下文(如系统提示或任务提示)内化到学生模型中,使得推理时不再需要上下文。尽管该方法成功提升了学生的无上下文性能,我们却发现一个有趣且此前未被研究的现象:在许多设置中,向蒸馏后的学生模型重新引入原始特权上下文实际上会降低其性能,甚至对于它已经在无上下文情况下正确解决的实例也是如此。我们将此称为上下文诱导退化,并认为鲁棒内化不仅要求匹配教师的条件上下文行为,还要求在上下文重新引入时保持稳定,这一性质我们称为上下文可移除性。受此观察启发,我们提出一种轻量级一致性正则化方法,首先通过停止梯度锚定学生的无上下文输出,然后通过前向KL散度惩罚条件上下文输出偏离该锚点。这一简单添加每训练步仅需一次额外前向传播,却能有效缓解上下文诱导退化,并在许多情况下甚至提升无上下文性能。在涵盖不同领域和模型家族的12种配置中,我们的方法在大多数设置下提升了条件上下文准确率,在11/12的设置中减少了上下文诱导损害,并有效消除了响应长度膨胀。一项机制性案例研究进一步证实,上下文可移除性在表示层面得以实现,无论上下文是否存在,隐藏状态几乎保持相同。

英文摘要

Recent work has shown that on-policy distillation can internalize privileged context, such as system prompts or task hints, into a student model so that the context is no longer needed at inference time. Although this approach successfully improves the student's no-context performance, we identify an interesting and previously unstudied phenomenon: in many settings, reintroducing the original privileged context to the distilled student actually degrades its performance, even on instances it already solves correctly without context. We term this context-induced degradation and argue that robust internalization demands not only matching the teacher's context-conditioned behavior, but also remaining stable when the context is reintroduced, a property we call context removability. Motivated by this observation, we propose a lightweight consistency regularizer that first anchors the student's no-context output via stop-gradient, then penalizes the context-conditioned output for deviating from it via forward KL divergence. This simple addition requires only one extra forward pass per training step, yet it effectively mitigates context-induced degradation and, in many cases, even improves no-context performance. Across 12 configurations spanning diverse domains and model families, our method improves context-conditioned accuracy in the majority of settings, reduces context-induced harm in 11 out of 12 settings, and effectively eliminates response-length inflation. A mechanistic case study further confirms that context removability is achieved at the representation level, with hidden states remaining nearly identical regardless of whether the context is present.

2606.11640 2026-06-11 cs.LG cs.AI 交叉投稿

TAROT: Task-Adaptive Refinement of LLM-prior Graphs for Few-shot Tabular Learning

TAROT: 面向小样本表格学习的任务自适应LLM先验图精炼

Ruxue Shi, Yili Wang, Mengnan Du, Hangting Ye, Yi Chang, Xin Wang

发表机构 * Jilin University(吉林大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出TAROT框架,通过构建并精炼任务自适应语义图,利用LLM先验和GNN编码特征语义关系,提升小样本表格学习性能。

详情
AI中文摘要

小样本表格学习为实际应用中标注成本高、新任务样本收集困难的情况提供了一种经济有效的方法。现有的传统方法和基于LLM的方法在小样本场景中已展现出有效性。然而,传统方法需要在未标注或生成的数据上进行额外训练,这带来了显著的计算开销。此外,直接将原始表格数据输入LLM的基于LLM的方法引发了隐私和合规性问题。更重要的是,这两种范式都很大程度上忽略了特征之间的语义关系,而语义关系为构建语义图提供了结构和语义先验。语义图对于在小样本场景中建模有意义的特征交互至关重要。本文提出TAROT,一个基于GNN的框架,通过从先验中构建并精炼任务自适应语义图来编码结构和语义先验,从而提升小样本表格学习的预测性能。TAROT首先通过统一语义表格节点编码器(USTNE)将异构表格数据编码为统一的节点语义表示。然后,它提示LLM根据任务描述和特征名称推断特征之间的语义关系,以构建语义图。为了减轻LLM幻觉引入的结构噪声,TAROT引入了任务自适应语义图精炼,剪除虚假或与任务无关的边,并添加缺失的与任务相关的边,使图结构与下游目标对齐。最后,GNN在精炼后的图上进行消息传递,以捕获与任务相关的语义依赖关系进行预测。在各种小样本表格学习基准上的大量实验证明了TAROT的优越性能,使其成为该领域的最先进方法。

英文摘要

Few-shot tabular learning provides a cost-effective approach for real-world applications where annotation is costly and collecting sufficient samples for new tasks is difficult. Existing Traditional and LLM-based methods have demonstrated effectiveness in few-shot scenarios. However, traditional methods need additional training on unlabeled or generated data, which incur significant computational overhead. In addition, LLM-based methods that directly feed raw tabular data into LLMs raise privacy and compliance concerns. More importantly, both paradigms largely overlook the semantic relationships between features, which provide structural and semantic prior for constructing a semantic graph. Semantic graph is essential for modeling meaningful feature interactions in few-shot scenarios. In this paper, we propose TAROT, a GNN-based framework that encodes the structural and semantic prior by constructing and refining a task-adaptive semantic graph from this prior, thereby improving predictive performance in few-shot tabular learning. TAROT first encodes heterogeneous tabular data into unified node semantic representations via a Unified Semantic Tabular Node Encoder (USTNE). Then, it prompts LLMs to infer the semantic relationship between features based on the task description and feature names to construct a semantic graph. To mitigate structural noise introduced by the hallucination of LLMs, TAROT introduces Task-adaptive Semantic Graph Refinement that prunes spurious or task-unrelated edges and adds missing task-related ones, aligning the graph structure with the downstream objective. Finally, a GNN performs message passing over the refined graph to capture task-related semantic dependencies for prediction. Extensive experiments on various few-shot tabular learning benchmarks demonstrate the superior performance of TAROT, establishing it as a state-of-the-art approach in this domain.

2606.11695 2026-06-11 cs.LG cs.AI 交叉投稿

Noise-Aware Framework for Correcting Corrupted Labels

噪声感知框架用于纠正损坏标签

Ha-Linh Nguyen, Hong-Anh Nguyen, Minh-Duc La, Phong Lam, Thu-Trang Nguyen, Son Nguyen, Hieu Dinh Vo

发表机构 * Faculty of Information Technology, VNU University of Engineering and Technology(越南国立大学工程与技术学院信息技术系)

AI总结 提出CANOLA框架,通过噪声感知学习和迭代标签精炼来纠正损坏标签,在六个数据集上相比现有方法错误率降低19%-52%。

详情
AI中文摘要

高质量的标注数据对于训练可靠的ML/DL模型至关重要。然而,现实世界的数据集通常包含相当比例的损坏标签,这会严重降低模型性能。为了解决这个问题,我们提出了CANOLA,一种通过噪声感知学习和迭代标签精炼来纠正损坏标签的新框架。CANOLA明确估计数据集的潜在噪声分布,并将此信息纳入噪声感知深度神经网络的训练中。通过在训练过程中融入噪声特征,CANOLA使模型能够降低不可靠监督信号的权重,并专注于可信模式,从而提高鲁棒性和泛化能力。标签纠正是通过谨慎的迭代软标签精炼进行的,其中模型预测与观察到的标签混合,以防止过早或错误的更新。这种渐进式精炼使得数据集能够以稳定且可控的方式得到修复。我们在六个广泛使用的数据集上,在现实噪声标注场景下评估了CANOLA。实验结果表明,CANOLA始终优于最先进的标签纠正方法,在错误减少方面实现了19%到52%的相对改进。此外,在由CANOLA纠正的数据集上训练的模型获得了显著的下游性能提升。即使在CANOLA纠正的数据上训练的简单分类器,其性能也能超过复杂的以模型为中心的方法,最高可达67%。

英文摘要

High-quality labeled data is essential for training reliable ML/DL models. However, real-world datasets often contain a considerable proportion of corrupted labels, which can severely degrade model performance. To address this problem, we propose CANOLA, a novel framework for correcting corrupted labels through noise-aware learning and iterative label refinement. CANOLA explicitly estimates the underlying noise distribution of the dataset and incorporates this information into the training of a noise-aware Deep Neural Network. By incorporating noise characteristics during learning, CANOLA enables the model to down-weight unreliable supervision signals and focus on trustworthy patterns, thereby improving robustness and generalization. Label correction is performed via cautious, iterative soft label refinement, in which model predictions are blended with observed labels to prevent premature or erroneous updates. This progressive refinement allows the dataset to be repaired in a stable and controlled manner. We evaluate CANOLA on six widely used datasets under realistic noisy labeling scenarios. Experimental results show that CANOLA consistently outperforms SOTA label correction methods, achieving relative improvements ranging from 19% to 52% in error reduction. Moreover, models trained on datasets corrected by CANOLA obtain substantial downstream performance gains. Even simple classifiers trained on CANOLA's corrected data can outperform complex model-centric approaches by margins of up to 67%.

2606.11712 2026-06-11 cs.CL cs.AI cs.LG 交叉投稿

Substrate Asymmetry in User-Side Memory: A Diagnostic Framework

用户侧记忆中的子模块不对称性:一个诊断框架

Youwang Deng

发表机构 * EpistemicaLab — Independent Research(EpistemicaLab — 独立研究)

AI总结 提出一个诊断框架,将LLM用户侧记忆分解为行为一致性、事实存在和事实缺失三个正交子模块,发现参数记忆与检索记忆在不同子模块上存在不对称性,且RLHF调优加剧了这种不对称性。

详情
Comments
Preprint. Code: this https URL
AI中文摘要

LLM中的用户侧记忆通常被评分为单一的“个性化”能力:给定用户历史,输出是否更了解用户?我们表明这种聚合指标隐藏了相反方向的失败。记忆至少可分解为三个正交轴——行为一致性(风格、语气)、事实存在(回忆历史中的事实)和事实缺失(当事实缺失时弃权)——并且没有单一子模块能在所有三个轴上获胜。在受控的50用户合成语料库和真实数据探针(LaMP-3)上,比较每个用户的gamma-LoRA(在每个用户历史上训练的小型LoRA适配器;gamma表示每个用户,而非每个任务)与BGE-large密集top-K检索,我们发现gamma-LoRA在行为风格上决定性获胜,而RAG在事实缺失上决定性获胜——并且注意力层21-35中的相同查询投影细胞因果地承载了这两个相反方向的效果(将这些LoRA权重归零会使缺失探针TPR提高33个百分点,并使存在探针TPR下降20个百分点)。在更经过RLHF调优的Llama-3.1-8B-Instruct上,不对称性增强而非愈合:参数记忆的行为优势崩溃,而其相对于检索的缺失校准赤字扩大——这是对参数用户记忆的对齐税。在真实数据LaMP-3上,gamma-LoRA表现低于多数基线;一个9条件缓解扫描诊断出这是指令遵循崩溃,而非子模块失败(9x2交叉乘积显示评估时的{1..5} logit掩码使每个配方的主准确率达到>=0.995),并且最佳训练时修复在Llama上逐位复制。最后,子模块选择路由是问题分类,而非校准:仅基于问题文本的110M DistilBERT击败了每个基于logit的路由器。我们贡献了诊断框架、诊断出的真实数据负例、对齐税复制以及路由即分类的发现。

英文摘要

User-side memory in LLMs is typically scored as a single "personalization" capability: given a user's history, is the output more user-aware? We show this aggregate metric hides opposite-direction failures. Memory factorises into at least three orthogonal axes -- behavioral consistency (style, voice), factual presence (recall facts in history), and factual absence (abstain when a fact is absent) -- and no single substrate wins all three. Comparing per-user gamma-LoRA (a small LoRA adapter trained on each user's history; gamma denotes per-user, not per-task) against BGE-large dense top-K retrieval on a controlled 50-user synthetic corpus and a real-data probe (LaMP-3), we find gamma-LoRA decisively wins behavioral style while RAG decisively wins factual absence -- and the same query-projection cells in attention layers 21-35 causally load-bear both effects in opposite directions (zeroing those LoRA weights raises absence-probe TPR by +33 pp and drops presence-probe TPR by 20 pp). On the more heavily RLHF-tuned Llama-3.1-8B-Instruct the asymmetry strengthens, not heals: parametric memory's behavioral advantage collapses while its absence-calibration deficit against retrieval widens -- an alignment tax on parametric user-memory. On real-data LaMP-3, gamma-LoRA underperforms a majority baseline; a 9-condition mitigation sweep diagnoses this as instruction-following collapse, not substrate failure (a 9x2 cross-product shows the eval-time {1..5} logit mask drives main_acc to >=0.995 on every recipe), and the best training-time fix replicates bit-identically on Llama. Finally, substrate-selection routing is question-classification, not calibration: a 110M DistilBERT on the question text alone beats every logit-based router. We contribute the diagnostic framework, the diagnosed real-data negative, the alignment-tax replication, and the routing-as-classification finding.

2606.11722 2026-06-11 cs.LG cs.AI cs.CL 交叉投稿

ICA Lens: Interpreting Language Models Without Training Another Dictionary

ICA Lens: 无需训练另一本词典即可解释语言模型

Sida Liu, Feijiang Han

发表机构 * Independent Researcher(独立研究员) University of Maryland(马里兰大学)

AI总结 提出ICALens,基于独立成分分析(ICA)高效提取语言模型表示中可解释方向,无需训练稀疏自编码器,在SAEBench上表现竞争力。

详情
Comments
Ongoing Project
AI中文摘要

在语言模型表示中找到可解释方向对于理解和控制模型行为至关重要。稀疏自编码器(SAE)已成为此目的的标准工具,但将其作为默认的第一透镜通常需要训练、存储和评估大型过完备字典。这一瓶颈限制了快速探索,并提出了一个基本问题:在训练另一个神经字典之前,从激活几何中已经可以看到多少可解释结构?我们的直觉很简单:许多可解释方向对令牌具有选择性,这些方向看起来比随机方向更不服从高斯分布。因此,我们重新审视独立成分分析(ICA),这是一种寻找非高斯方向的经典方法,作为语言模型可解释性的紧凑透镜。我们发现ICA在LLM可解释性中被低估了,因为先前的使用通常依赖于现成的ICA实现,这些实现在LLM激活上不稳定,并且缺乏用于检查和评估恢复方向的系统工具。为弥补这些差距,我们引入了ICALens,这是第一个用于LLM表示的稳定、高效和可审计ICA分析的实用工作流。它结合了优化的GPU并行FastICA流水线、LLM特定的稳定性配方和更好的拟合诊断,实现了高效可靠的逐层分析。在GPT-2 Small、Gemma 2 2B和Qwen 3.5 2B Base上,ICALens高效地恢复了紧凑、人类可解释的方向,无需逐层基于梯度的字典训练。在SAEBench上,ICA在稀疏探测中与公共SAE竞争,并在中小预算下的目标探测扰动中优于它们。这些结果表明,ICA不应被视为弱基线,而应被视为探索语言模型表示的高效且互补的第一透镜。

英文摘要

Finding interpretable directions in language-model representations is critical for understanding and controlling model behavior. Sparse autoencoders (SAEs) have become the standard tool for this purpose, but using them as the default first lens often requires training, storing, and evaluating large overcomplete dictionaries. This bottleneck limits rapid exploration and raises a fundamental question: how much interpretable structure is already visible from activation geometry before training another neural dictionary? Our intuition is simple: many interpretable directions are selective on tokens, and these directions should look less Gaussian than random directions. We therefore revisit independent component analysis (ICA), a classical method for finding non-Gaussian directions, as a compact lens for language-model interpretability. We find that ICA has been underestimated for LLM interpretability, because prior uses often relied on off-the-shelf ICA implementations that are brittle on LLM activations and lacked systematic tools for inspecting and evaluating the recovered directions. To bridge these gaps, we introduce ICALens, the first practical workflow for stable, efficient, and auditable ICA analysis of LLM representations. It combines an optimized GPU-parallel FastICA pipeline with LLM-specific stability recipes and better fitting diagnostics, enabling efficient and reliable layer-wise analysis. Across GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base, ICALens efficiently recovers compact, human-interpretable directions without per-layer gradient-based dictionary training. On SAEBench, ICA is competitive with public SAEs in sparse probing and outperforms them in targeted probe perturbation under small-to-medium budgets. These results suggest that ICA should not be viewed as a weak baseline, but as an efficient and complementary first lens for exploring language-model representations.

2606.11766 2026-06-11 eess.AS cs.AI cs.CL cs.SD 交叉投稿

Fast Speech Foundation Model Distillation Using Interleaved Stacking

快速语音基础模型蒸馏使用交错堆叠

Eungbeom Kim, Kyogu Lee

AI总结 提出交错堆叠方法加速语音基础模型蒸馏训练,通过保持层位置一致性解决性能下降问题,在SUPERB上验证有效性。

详情
Comments
Accepted by Interspeech 2026
AI中文摘要

将大型语音基础模型(SFM)蒸馏为高效的学生模型已成功应用于低资源环境。尽管蒸馏减少了推理延迟,但它需要额外的学生模型训练。然而,SFM蒸馏的训练效率仍未得到充分探索。在这项工作中,我们探索了SFM蒸馏的训练加速以加快模型部署。我们研究了堆叠的潜力,其中模型深度通过训练逐步增加,直到达到目标模型深度。虽然现有的堆叠方法提高了训练速度,但它们遭受性能下降。为了解决这一限制,我们提出了交错堆叠,一种新颖的堆叠方法,在整个堆叠过程中始终保持层位置。这一特性在SFM中尤为关键,因为每一层编码了不同的层特定知识。我们在SUPERB上验证了所提方法的有效性。

英文摘要

Distilling a large speech foundation model (SFM) into an efficient student model has been successfully applied to low-resource environments. Although distillation reduces inference latency, it requires an additional student model training. However, the training efficiency of SFM distillation remains underexplored. In this work, we explore training acceleration of SFM distillation to speed up model deployment. We examine the potential of stacking, in which the model depth is progressively increased through training until the target model depth is reached. While existing stacking methods improve training speed, they suffer from performance degradation. To handle this limitation, we propose interleaved stacking, a novel stacking method that consistently preserves layer position throughout the stacking process. This property is particularly critical in SFMs, in which each layer encodes distinct layer-specific knowledge. We validate the effectiveness of the proposed method on SUPERB.

2606.11814 2026-06-11 quant-ph cs.AI cs.LG 交叉投稿

Sparsified Kolmogorov-Arnold Networks for Interpretable Quantum State Tomography

稀疏化Kolmogorov-Arnold网络用于可解释量子态层析

Xinge Wu, Huaxin Wang, Jiajun Liu, Ruiqing He, Jiandong Shang, Hengliang Guo, Qiang Chen

AI总结 研究利用稀疏化Kolmogorov-Arnold网络作为可检查的重构规则,通过三量子比特GHZ基准测试,识别出与GHZ相关的Pauli测量集,并揭示与解析GHZ Pauli分组一致的输入-隐藏-输出通路结构,实现神经网络重构模型的结构可解释性。

详情
AI中文摘要

量子态层析的机器学习方法可以实现高保真度重构,但训练模型所使用的物理结构往往隐含。这里我们探究稀疏化Kolmogorov-Arnold网络(KAN)是否不仅可以作为回归器,还可以作为可检查的重构规则,其内部组织可以与已知的Pauli结构进行对照。我们研究了一个受控的三量子比特GHZ族基准测试,其中所有63个非恒等Pauli期望值被用于重构三个GHZ子空间变量:种群不平衡$z$、实部非对角分量$c$和虚部非对角分量$c$。在有限采样和退极化噪声下,外部消融从63个测量中识别出扩展的12通道GHZ相关Pauli集,在测试的采样次数和退极化噪声强度下实现了精确的前12恢复。这些支持模式在多种子随机初始化和噪声水平分析中保持稳定,并在随机标签控制下崩溃。主要的剪枝输入-隐藏-输出通路以与解析GHZ Pauli分组一致的方式组织Z型种群可观测量和X/Y非对角可观测量,稀疏公式恢复恢复了规范的带符号Pauli关系。因此,KAN的贡献在于神经重构模型中的通路级结构可解释性,而非优越的稀疏回归。结合阴性对照,这些探针提供了一条一致性链,用于审计学习到的重构规则与已知物理结构的一致性。

英文摘要

Machine-learning approaches to quantum state tomography can achieve high reconstruction fidelity, but the physical structure used by the trained model often remains implicit. Here we ask whether a sparsified Kolmogorov-Arnold Network (KAN) can be used not only as a regressor, but also as an inspectable reconstruction rule whose internal organization can be checked against known Pauli structure. We study a controlled three-qubit GHZ-family benchmark in which all 63 non-identity Pauli expectation values are used to reconstruct three GHZ-subspace variables: the population imbalance $z$, the real off-diagonal component $c$, and the imaginary off-diagonal component $s$. Under finite-shot sampling and depolarizing noise, external ablation identifies the extended 12-channel GHZ-relevant Pauli set from the 63 measurements, with exact top-12 recovery across the tested shot counts and depolarizing-noise strengths. These support patterns remain stable across multi-seed random-initialization and noise-level analyses, and collapse under random-label controls. The dominant pruned input-hidden-output pathways organize Z-type population observables and X/Y off-diagonal observables in a pattern consistent with the analytic GHZ Pauli grouping, and sparse formula recovery recovers the canonical signed Pauli relations. The contribution of the KAN is therefore pathway-level structural interpretability within a neural reconstruction model, rather than superior sparse regression. Together with negative controls, these probes provide a consistency chain for auditing learned reconstruction rules against known physical structure.

2606.11831 2026-06-11 cs.LG cs.AI 交叉投稿

From Uniform to Learned Graph Priors: Diffusion for Structure Discovery

从均匀到学习图先验:用于结构发现的扩散

Qi Shao, Hao Guo, Jiawen Chen, Duxin Chen, Wenwu Yu

发表机构 * School of Mathematics, Southeast University(东南大学数学学院)

AI总结 提出Diff-prior,一种扩散参数化的自适应先验,通过可学习的去噪式校准对边后验进行结构化校准,提升神经关系推理方法的结构发现可靠性。

详情
Comments
15 pages, 3 figures, Accepted by KDD 2026
AI中文摘要

神经关系推理(NRI)方法通过离散潜在边的变分推理从轨迹中发现交互图。然而,这些方法通常依赖于过度简化的因子化图先验。这种先验通常接近均匀分布,将边视为独立实体。这种系统性错位与现实世界系统不匹配,导致边后验分散且不明确,限制了结构发现的可靠性。为了解决这个问题,我们提出了\textit{Diff-prior},一种扩散参数化的自适应先验,用于校准潜在图分布而非生成图。我们的核心见解是将先验整合重新构建为一种可学习的去噪式校准,将分散、不确定的边后验组织成更可靠的整体结构,该结构可通过扩散模型训练。Diff-prior学习一个自适应结构先验,在推理过程中对边后验进行结构化校准,引导其朝向更接近底层结构的分布。Diff-prior在结构采样之前操作,并直接对编码器边分布进行去噪校准,为结构化变量提供了一种通用的训练范式。在标准基准上的实验验证了我们的框架,结果表明Diff-prior提高了结构推理的性能,并在多个NRI系列架构中生成更明确的边后验。代码可在以下网址获取:https://this URL。

英文摘要

Neural relational inference (NRI) methods discover interaction graphs from trajectories through variational reasoning on discrete potential edges. However, these methods typically rely on oversimplified, factorized graph priors. Such priors, typically nearing uniform distributions, treat edges as independent entities. This systemic misalignment does not match the real-world systems and yields diffuse and indecisive edge posteriors limiting the reliability of structural discovery. To address this, we propose \textit{Diff-prior}, a diffusion-parameterized adaptive prior used to calibrate latent graph distribution rather than generate graphs. Our core insight is to reframe prior integration as a learnable denoising-style calibration that organizes scattered, uncertain edge posteriors into a more reliable overall structure which can be trained by the diffusion model. Diff-prior learns an adaptive structure prior that performs structured calibration on the edge posteriors during inference, guiding it towards a distribution closer to the underlying structure. The diff-prior operates before structural sampling and acts as a denoising calibrator directly on the encoder edge distribution, which provides a generic training paradigm over structured variables. Experiments on standard benchmarks validated our framework, and the results indicate that Diff-prior improves the performance of structure inference and generates more decisive edge posteriors across multiple NRI-family architectures. The code is available on this https URL.

2606.11836 2026-06-11 cs.SD cs.AI eess.AS 交叉投稿

Towards Data-free and Training-free Compression for Speech Foundation Models Using Parameter Clustering

面向语音基础模型的无数据无训练压缩:基于参数聚类的方法

Haoning Xu, Zhaoqing Li, Huimeng Wang, Youjun Chen, Chengxi Deng, Mengzhe Geng, Xunying Liu

发表机构 * The Chinese University of Hong Kong(香港中文大学) National Research Council Canada(加拿大国家研究委员会)

AI总结 提出一种基于k-means通道聚类的无数据无训练压缩方法,通过层间不同参数簇数实现细粒度混合稀疏剪枝,在HuBERT-large和Whisper-large-v3上显著降低WER。

详情
Comments
Accepted by Interspeech 2026
AI中文摘要

本文提出了一种新颖的无数据无训练压缩方法,用于语音基础模型,该方法通过k-means进行通道级聚类。还探索了更细粒度的混合稀疏剪枝,通过层间不同数量的参数簇实现。在LibriSpeech数据集上进行的实验表明,当对HuBERT-large进行50%的剪枝稀疏度操作时,在微调前,测试干净和测试其他子集上,相对于基于幅度的剪枝,获得了27.73%/18.61%绝对(34.37%/21.91%相对)的一致WER降低;在仅3个epoch的微调后,获得了0.19%/0.79%绝对(3.36%/4.62%相对)的降低。在Whisper-large-v3上,在10%稀疏度下,相对于基于幅度的剪枝,观察到2.86%/5.02%绝对(59.21%/55.29%相对)的类似WER降低,所有这些相对于未压缩基线均没有显著的WER增加。

英文摘要

This paper presents a novel data-free and training-free compression approach for speech foundation models using channelwise clustering via k-means. More fine-grained, mixed sparsity pruning by layer-level varying number of parameter clusters is also explored. Experiments conducted on the LibriSpeech dataset suggest that when operating with pruning sparsity of 50% on HuBERT-large, consistent WER reductions of 27.73%/18.61% absolute (34.37%/21.91% relative) over the magnitude-based pruning were obtained on the test-clean and test-other subsets before fine-tuning and 0.19%/0.79% absolute (3.36%/4.62% relative) after fine-tuning with only 3 epochs. Similar WER reductions of 2.86%/5.02% absolute (59.21%/55.29% relative) were observed against magnitudebased pruning on Whisper-large-v3 at 10% sparsity, all with no significant WER increase relative to the uncompressed baseline.

2606.11854 2026-06-11 cs.LG cs.AI cs.CL 交叉投稿

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

使用ART微调多模态大语言模型:基于艺术的强化训练

Michal Chudoba, Sergey Alyaev, Petra Galuscakova, Tomasz Wiktorski

发表机构 * University of Stavanger(斯塔万格大学) NORCE Research(NORCE研究机构)

AI总结 提出ART方法,通过优化原始视觉输入将信息注入冻结的多模态大语言模型,实现软提示微调,无需修改计算图,在数学和工具使用基准上达到与LoRA相当的精度。

详情
AI中文摘要

大语言模型有两种主要的参数高效微调技术。低秩适应在LLM层之间引入额外权重,而软提示则向LLM输入引入额外的微调特定原始token。然而,两者都需要修改预编译、预优化LLM的计算图。因此,两者在vLLM等高吞吐引擎中均未得到完全支持。我们提出使用ART(基于艺术的强化训练)进行微调。该方法通过仅优化冻结的多模态大语言模型的原始视觉输入来注入信息,从而在预编译计算图上实现软token方法。它依赖于将梯度反向传播到普通像素阵列,因此支持任何微调目标。此外,优化的视觉输入可以风格化为与任务相关的计算艺术品。该方法在流行的开源Qwen架构的不同规模以及多个文本基准上的有效性得到确认。具体而言,ART在数学和结构化工具使用基准上达到了与LoRA竞争的精度。

英文摘要

There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach's effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.

2606.11961 2026-06-11 cs.LG cs.AI 交叉投稿

Categorical Prior Lock-in: Why In-Context Learning Fails for Structured Data

类别先验锁定:为何上下文学习在结构化数据上失败

Antonio Pelusi, Stefano Braghin, Alberto Trombetta

发表机构 * University of Insubria(因苏布里亚大学) IBM Research Ireland(IBM 爱尔兰研究院)

AI总结 研究大语言模型在结构化数据生成中上下文学习的局限性,发现其无法更新预训练中的类别先验分布,导致罕见类完全无法生成;参数高效微调可解决但带来记忆化风险。

详情
Comments
9 pages, 5 figures. Empirical study of in-context learning and LoRA fine-tuning for synthetic tabular data generation, introducing the phenomenon of categorical prior lock-in. Under review
AI中文摘要

大型语言模型(LLM)越来越多地被用作结构化数据的条件生成器,依赖上下文学习(ICL)来适应新分布而无需更新参数。我们以高基数表格数据作为受控测试案例,研究分布不匹配下ICL在结构化生成中的局限性,并识别出一种结构性失败模式,我们称之为“类别先验锁定”:ICL无法更新模型从预训练中继承的令牌分布先验。在两个70亿参数开源模型中,ICL随着示例增加提高了数值保真度,但在类别分布上表现出明显的天花板效应,完全无法复现罕见类。参数高效微调(LoRA)克服了这些限制,但引入了可测量的记忆化风险,并在某些情况下破坏了结构化输出生成的稳定性,凸显了适应性与隐私之间的基本权衡。

英文摘要

Large language models (LLMs) are increasingly used as conditional generators for structured data, relying on in-context learning (ICL) to adapt to new distributions without parameter updates. We investigate the limits of ICL for structured generation under distribution mismatch, using high-cardinality tabular data as a controlled test case, and identify a structural failure mode we term \textit{categorical prior lock-in}: the inability of ICL to update the model's prior over token distributions inherited from pre-training. Across two 7B-parameter open-weight models, ICL improves numerical fidelity with additional examples but exhibits a sharp ceiling on categorical distributions, failing to reproduce rare classes entirely. Parameter-efficient fine-tuning (LoRA) overcomes these limitations but introduces measurable memorization risk and, in some cases, destabilizes structured output generation, highlighting a fundamental trade-off between adaptability and privacy.

2606.12138 2026-06-11 cs.LG cs.AI cs.CL 交叉投稿

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

不稳定特征,可复现子空间:理解稀疏自编码器中的种子依赖性

Gleb Gerasimov, Timofei Rusalev, Nikita Balagansky, Daniil Laptev, Vadim Kurochkin, Daniil Gavrilov

发表机构 * T-Tech

AI总结 研究稀疏自编码器特征的可复现性,发现稳定特征承载主要信号,不稳定特征集中于可复现的低秩子空间,反映基歧义而非纯噪声。

详情
AI中文摘要

稀疏自编码器(SAE)被广泛用于解释神经网络表示,但其效用取决于学习到的特征是否在不同训练运行间可复现。我们通过\textit{特征稳定性}研究这一问题:对于每个SAE特征,我们估计其在独立训练的SAE中再次出现的概率。这产生了一个可扩展的每特征信号,将稳定特征与不稳定特征区分开来。在一项跨种子、模型、层、字典大小和SAE变体的大规模研究中,我们发现显著的功能不对称性:稳定特征承载了大部分重建和预测相关信号,而不稳定特征的边际影响较弱,并且在激活统计和自动解释中主要由低频表面形式触发主导。在几何上,不稳定特征个体不可复现,但集中在可复现的低秩子空间中,这表明种子依赖性通常反映了共享激活空间区域内的基歧义,而非纯噪声。一个受控的合成模型使这一机制明确,表明低秩真实特征可以在子空间级别被恢复,而作为个体SAE潜在变量跨种子仍不可识别。最后,通过汇集独特的跨种子特征,我们构建了更稳定的SAE,同时在此设置中保留了解释方差。这些结果共同表明,不稳定特征不仅仅是失败或噪声潜在变量:它们个体功能影响较弱,但反映了标准SAE跨种子不同解析的可复现低维结构。

英文摘要

Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through \emph{feature stability}: for each SAE feature, we estimate the probability that a similar feature reappears in an independently trained SAE. This yields a scalable per-feature signal that separates stable from unstable features. In a large-scale study across seeds, models, layers, dictionary sizes, and SAE variants, we find a pronounced functional asymmetry: stable features carry most of the reconstruction- and prediction-relevant signal, while unstable features have weak marginal impact and are dominated by low-frequency surface-form triggers in both activation statistics and automatic explanations. Geometrically, unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting that seed dependence often reflects basis ambiguity within a shared region of activation space rather than pure noise. A controlled synthetic model makes this mechanism explicit, showing that low-rank ground-truth features can be recovered at the subspace level while remaining non-identifiable as individual SAE latents across seeds. Finally, by pooling unique cross-seed features, we construct more stable SAEs while preserving explained variance in this setting. Together, these results show that unstable features are not merely failed or noisy latents: they have weak individual functional impact, but reflect reproducible low-dimensional structure that standard SAEs resolve differently across seeds.

2606.12146 2026-06-11 cs.LG cs.AI 交叉投稿

nD-RoPE: A Generalized RoPE for n-Dimensional Position Embedding

nD-RoPE:一种用于n维位置嵌入的广义RoPE

Boyang Li, Yulin Wu, Sizhe Xu, Nuoxian Huang, Zhonghang Yuan, Shangyi Guo, Shu Yang, Takahiro Yabe

AI总结 提出nD-RoPE,将旋转位置嵌入推广到任意维度,通过多尺度正则单纯形波矢设计实现各向同性,在图像、视频和点云任务中提升性能。

详情
Comments
Accepted to the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

旋转位置嵌入(RoPE)在Transformer模型中被广泛采用,但其向高维域的扩展缺乏统一的理论表述。大多数现有方法要么沿每个轴独立应用旋转,要么经验性地混合频率,这限制了跨维交互并产生方向相关的表示。为了解决这些限制,我们提出了nD-RoPE,一种将RoPE推广到任意维度的无分解泛化。从连续希尔伯特空间中的平移不变表述出发,我们推导出各向同性的谱条件,要求将位置和频率视为耦合的\(n\)维向量。我们通过多尺度正则单纯形波矢设计实例化该表述,提供了非退化的空间覆盖和对称、方向平衡的二阶响应。在图像、视频和点云上的实验表明,在高维设置中性能持续提升且泛化能力增强。

英文摘要

Rotary Position Embedding (RoPE) is widely adopted in Transformer models, yet its extension to high-dimensional domains lacks a unified theoretical formulation. Most existing approaches either apply rotations independently along each axis or empirically mix frequencies, which limits cross-dimensional interactions and yields direction-dependent representations. To address these limitations, we propose nD-RoPE, a decomposition-free generalization of RoPE to arbitrary dimensions. From a translation-invariant formulation in continuous Hilbert space, we derive a spectral condition for isotropy that requires treating positions and frequencies as coupled \(n\)-dimensional vectors. We instantiate this formulation with a multi-scale regular-simplex wave-vector design, which provides non-degenerate spatial coverage and a symmetric, directionally balanced second-order response. Experiments across images, videos, and point clouds demonstrate consistent performance gains and improved generalization in high-dimensional settings.

2606.12200 2026-06-11 cs.LG cs.AI 交叉投稿

Implicit Neural Representations of Individual Behavior

个体行为的隐式神经表示

Andrew Kang, Priya Narasimhan

AI总结 提出Behavioral INR模型,用隐式神经表示从无标签多策略行为数据中学习策略表示,通过FiLM层调节策略函数,实现无监督策略识别,在连续状态-动作空间中提升策略可识别性。

详情
Comments
ICML 2026, Structured Probabilistic Inference & Generative Modeling Workshop
AI中文摘要

我们研究从无标签多策略行为数据中进行策略表示学习。每个回合由固定策略生成,但策略标签不可用。这种设置出现在机器人操作、演示、游戏、赛车以及其他混合了异构行为但没有注释的数据集中。我们引入了\emph{Behavioral INR},一种自监督生成模型,将隐式神经表示(INR)从视觉领域适应到行为领域。Behavioral INR不是将坐标映射到RGB值,而是将策略表示为状态-动作函数,将状态映射到后续动作。一个回合级别的潜在变量通过FiLM层调节该函数,产生策略上的生成先验,并允许在无监督的情况下推断策略身份。由于INR将每个数据点视为底层函数的样本,同一模型自然适应可变回合长度和不同采样粒度,就像视觉INR处理不同图像分辨率一样。我们还定义了沿状态分布和动作分布轴的策略级分布外(OOD)偏移,当策略在状态或动作上重叠时会出现这种偏移,但标准的基于新智能体或环境的OOD设置无法捕捉到。我们在合成高斯随机场数据、带有受控OOD分割的MuJoCo演示以及真实世界的国际象棋、一级方程式赛车、机器人和搜索-规避数据集上进行了评估。Behavioral INR在最具挑战性的连续状态-动作设置中持续提升策略可识别性,尤其是当更长的回合、更多的策略和OOD分割降低了边际捷径的效用时;当策略身份可以从符号重复或低维动作统计中恢复时,摊销历史编码器仍然具有竞争力。我们发布了代码和检查点。

英文摘要

We study policy representation learning from unlabeled multi-policy behavioral data. Each episode is generated by a fixed policy, but policy labels are unavailable. This setting appears in robotics play, demonstrations, games, racing, and other datasets where heterogeneous behaviors are mixed without annotations. We introduce \emph{Behavioral INR}, a self-supervised generative model that adapts implicit neural representations (INRs) from vision to behavior. Instead of mapping coordinates to RGB values, Behavioral INR represents a policy as a state-action function mapping states to subsequent actions. An episode-level latent modulates this function through FiLM layers, yielding a generative prior over policies and allowing policy identity to be inferred without supervision. Because INRs treat each datapoint as samples from an underlying function, the same model naturally accommodates variable episode lengths and different sampling granularities, as in vision INRs with different image resolutions. We also define policy-level out-of-distribution (OOD) shifts along state-distribution and action-distribution axes, which arise when policies overlap in states or actions but are not captured by standard behavioral OOD settings based only on new agents or environments. We evaluate on synthetic Gaussian random field data, MuJoCo demonstrations with controlled OOD splits, and real-world chess, Formula 1 racing, robotics, and Seek-Avoid datasets. Behavioral INR most consistently improves policy identifiability in the hardest continuous state-action settings, especially when longer episodes, more policies, and OOD splits reduce the usefulness of marginal shortcuts; amortized history encoders remain competitive when policy identity can be recovered from symbolic repetition or low-dimensional action statistics. We release code and checkpoints.

2606.12240 2026-06-11 cs.LG cs.AI 交叉投稿

Multi-Rate Mixture of Experts for Accelerating Liquid Neural Network Training

多速率专家混合模型加速液态神经网络训练

Shilong Zong, Almuatazbellah Boker, Hoda Eldardiry

发表机构 * Virginia Tech(弗吉尼亚理工大学)

AI总结 提出多速率专家混合框架,结合液态神经网络的多尺度动态与注意力机制,提升多变量时间序列建模的准确性和效率。

详情
AI中文摘要

多变量时间序列数据通常表现出复杂的时间依赖、不规则采样和跨多个时间尺度的异质动态,使得精确序列建模特别具有挑战性。传统的循环神经网络(RNN),如长短期记忆网络(LSTM),在离散时间下运行,可能难以有效捕捉连续和不规则的时间行为。液态神经网络(LNN)通过连续时间动态解决了其中一些限制,但标准LNN架构通常依赖单一动力系统,限制了其建模异质时间模式的能力。为了解决这些挑战,我们提出了一个基于液态神经网络的多速率专家混合(MR-MoE)框架。在所提出的架构中,多个基于LNN的专家以不同的时间尺度运行,使模型能够明确分离快速变化的动态和缓慢演变的时间趋势。门控网络进一步实现了基于输入条件的自适应专家专业化。此外,我们结合了特征级和时间注意力机制,以提高鲁棒性、可解释性和长程依赖建模能力。特征级注意力抑制噪声或无关变量,而时间注意力则选择性地关注信息丰富的历史状态。我们在一个复杂的多变量时间序列预测任务上评估了所提出的框架,并与强基线模型(包括LSTM、单体LNN和标准MoE模型)进行了比较。实验结果表明,所提出的MR-MoE框架在保持良好计算效率的同时,持续实现了改进的AUROC和AUPRC性能。这些结果突显了结合连续时间动态、多尺度专家分解和自适应注意力机制对时间序列建模的有效性。

英文摘要

Multivariate time-series data often exhibit complex temporal dependencies, irregular sampling, and heterogeneous dynamics across multiple time scales, making accurate sequence modeling particularly challenging. Traditional recurrent neural networks (RNNs), such as Long Short-Term Memory (LSTM) networks, operate in discrete time and may struggle to effectively capture continuous and irregular temporal behaviors. Liquid Neural Networks (LNNs) address some of these limitations through continuous-time dynamics, but standard LNN architectures typically rely on a single dynamical system, limiting their ability to model heterogeneous temporal patterns. To address these challenges, we propose a Multi-Rate Mixture-of-Experts (MR-MoE) framework built on top of Liquid Neural Networks. In the proposed architecture, multiple LNN-based experts operate at distinct time scales, enabling the model to explicitly separate fast-changing dynamics from slow-evolving temporal trends. A gating network further enables adaptive expert specialization based on input conditions. In addition, we incorporate both feature-level and temporal attention mechanisms to improve robustness, interpretability, and long-range dependency modeling. Feature-level attention suppresses noisy or irrelevant variables, while temporal attention selectively focuses on informative historical states. We evaluate the proposed framework on a complex multivariate time-series prediction task and compare it against strong baselines, including LSTM, monolithic LNN, and standard MoE models. Experimental results demonstrate that the proposed MR-MoE framework consistently achieves improved AUROC and AUPRC performance while maintaining favorable computational efficiency. These results highlight the effectiveness of combining continuous-time dynamics, multi-scale expert decomposition, and adaptive attention mechanisms for time-series modeling.

2606.12287 2026-06-11 cs.NE cs.AI 交叉投稿

SpikeDecoder: Realizing the GPT Architecture with Spiking Neural Networks

SpikeDecoder: 用脉冲神经网络实现GPT架构

Claas Beger, Florian Walter, Alois Knoll

AI总结 提出SpikeDecoder,一种基于脉冲神经网络(SNN)的Transformer解码器,用于自然语言处理,通过替换ANN模块和优化嵌入方法,在保持性能的同时降低理论能耗87%-93%。

详情
AI中文摘要

Transformer架构被广泛认为是自然语言处理最强大的工具,但由于大量复杂操作,其本质上存在高能耗问题。为解决这一问题,我们考虑脉冲神经网络(SNN),它通过天然的事件驱动方式处理信息,是传统人工神经网络(ANN)的节能替代方案。然而,这本质上使得SNN难以训练。通常,许多基于SNN的模型通过转换预训练的ANN来规避这一问题。最近,有研究尝试设计可直接训练的基于SNN的Transformer模型结构改编。尽管结果显示出巨大潜力,但应用领域是计算机视觉,且所提模型仅包含编码器模块。在本文中,我们提出SpikeDecoder,一种完全基于SNN的Transformer解码器模块实现,用于自然语言处理。通过一系列实验,我们分析了用脉冲替代方案交换ANN模型不同模块的影响,以识别权衡和性能损失的主要来源。我们进一步研究了残差连接的作用以及SNN兼容归一化技术的选择。除了模型架构的工作,我们还制定并比较了将文本数据投影为脉冲的不同嵌入方法。最后,我们证明,与ANN基线相比,所提出的基于SNN的解码器模块将理论能耗降低了87%至93%。

英文摘要

The Transformer architecture is widely regarded as the most powerful tool for natural language processing, but due to a high number of complex operations, it inherently faces the issue of high energy consumption. To address this issue, we consider Spiking Neural Networks (SNNs), which are an energy-efficient alternative to conventional Artificial Neural Networks (ANNs) due to their naturally event-driven approach to processing information. However, this inherently makes them difficult to train. Often, many SNN-based models circumvent this issue by converting pre-trained ANNs. More recently, attempts have been made to design directly trainable SNN-based adaptations of the Transformer model structure. Although the results showed great promise, the application field was computer vision. Moreover, the proposed model incorporates only encoder blocks. In this paper, we propose SpikeDecoder, a fully SNN-based implementation of the Transformer decoder block, for applications in natural language processing. In a series of experiments, we analyze the impact of exchanging different blocks of the ANN model with spike-based alternatives to identify trade-offs and significant sources of performance loss. We further investigate the role of residual connections and the selection of SNN-compatible normalization techniques. Besides the work on the model architecture, we formulate and compare different embedding methods to project text data into spikes. Finally, we demonstrate that our proposed SNN-based decoder block reduces the theoretical energy consumption by 87% to 93% compared to the ANN baseline.

2606.12318 2026-06-11 cs.LG cs.AI 交叉投稿

Harness In-Context Operator Learning with Chain of Operators

利用算子链实现上下文算子学习

Minghui Yang, Ling Guo, Liu Yang

发表机构 * Department of Mathematics, Shanghai Normal University(上海师范大学数学系) Department of Mathematics, National University of Singapore(新加坡国立大学数学系)

AI总结 提出Chain of Operators (CHOP)框架,通过构造显式初等变换与冻结ICON的算子链,无需微调即可提升上下文算子网络在分布外算子任务上的泛化能力,在标量守恒律和平均场控制问题中降低推理误差。

详情
AI中文摘要

神经算子近似函数空间之间的映射,但通常对其他算子泛化能力差,需要微调或重新训练。上下文算子网络(ICON)通过向模型提供数值上下文来解决此问题,使模型从提示中学习特定算子并适应不同算子而无需微调。然而,ICON在分布外(OOD)算子任务上仍可能泛化失败。受大型语言模型(LLM)的提示工程成功启发,我们引入了算子链(CHOP),一种在不更新参数的情况下将冻结的ICON应用于OOD算子任务的框架。具体来说,CHOP构建了一个由显式初等变换和冻结ICON组成的算子链。在标量守恒律和平均场控制问题上的实验表明,与直接ICON评估相比,CHOP降低了相对推理误差,同时链中的每个算子保持可解释且具有封闭形式。在一个PDE族上构建的链进一步泛化到另一个不同的族,表明跨提示系统存在共享机制。

英文摘要

Neural operators approximate mappings between function spaces, but often generalize poorly to other operators and usually require fine-tuning or retraining. In-Context Operator Networks (ICON) addresses this issue by prompting the model with numerical context so that the model learns specific operators from prompts and adapt to different operators without fine-tuning. However, ICON may still fail to generalize to out-of-distribution (OOD) operator tasks. Inpired by the success of harness engineering of Large Language models (LLMs), we introduce Chain of Operators (CHOP), a framework that harness a frozen ICON to OOD operator tasks without updating its parameters. Specifically, CHOP constructs a chain of operators consisting of explicit elementary transformations and the frozen ICON. Experiments on a scalar conservation law and a mean-field control problem show that CHOP reduces relative inference error over direct ICON evaluation, while each operator in the chain remains interpretable and in closed form. A chain constructed on one PDE family further generalizes to a different family, indicating shared mechanisms across harness systems.

2606.12362 2026-06-11 cs.LG cs.AI 交叉投稿

Latent World Recovery for Multimodal Learning with Missing Modalities

缺失模态下的多模态学习中的潜在世界恢复

Hui Wang, Tianyu Ren, Joseph Butler, Christopher Baker, Karen Rafferty, Simon McDade

发表机构 * Queen's University Belfast(贝尔法斯特女王大学)

AI总结 提出潜在世界恢复(LWR)框架,通过邻居潜在对齐和可用性感知融合,在缺失模态下实现鲁棒的多模态预测,避免显式重构误差。

详情
AI中文摘要

我们研究了缺失模态下的多模态学习,特别受到生物科学应用的启发,在这些应用中,当需要做出决策时,异构模态通常仅部分可用。我们提出了潜在世界恢复(LWR),这是一个基于两个关键思想的框架:(i) 来自不同模态的特定模态嵌入在共享潜在空间中对齐,以及 (ii) 通过仅融合在训练和推理时实际可用的模态嵌入来构建统一表示。LWR 不填补缺失模态或要求固定的模态集,而是将每个模态视为对底层潜在状态的部分感知,并直接从观察到的模态执行可用性感知表示学习。这种基于邻居的潜在对齐和可用性感知模态融合的结合,使得在部分观测下能够进行鲁棒的多模态预测,同时避免了显式重构缺失模态带来的误差传播。我们在真实世界的不完整多组学基准上评估了所提出的框架,并证明它为下游任务(如癌症表型分类和生存预测)提供了一种有效的方法。

英文摘要

We study multimodal learning under missing modalities, with particular motivation from bioscience applications in which heterogeneous modalities are often only partially available when decisions need to be made. We propose Latent World Recovery (LWR), a framework built on two key ideas: (i) modality-specific embeddings from different modalities are aligned in a shared latent space, and (ii) a unified representation is constructed by fusing only the embeddings of the modalities that are actually available at both training and inference time. Rather than imputing missing modalities or requiring a fixed modality set, LWR treats each modality as a partial perception of an underlying latent state and performs availability-aware representation learning directly from the observed modalities. This combination of neighbor-based latent alignment and availability-aware modality fusion enables robust multimodal prediction under partial observation, while avoiding error propagation from explicit reconstruction of missing modalities. We evaluate the proposed framework on real-world incomplete multi-omics benchmarks and demonstrate that it provides an effective approach to downstream tasks such as cancer phenotype classification and survival prediction.

2606.12386 2026-06-11 cs.LG cs.AI 交叉投稿

ATLAS: Active Theory Learning for Automated Science

ATLAS: 自动化科学的主动理论学习

Noémi Éltető, Nathaniel D. Daw, Kimberly L. Stachenfeld, Kevin J. Miller

发表机构 * Google DeepMind(谷歌深度思维) Princeton University(普林斯顿大学) Columbia University(哥伦比亚大学) University College London(伦敦大学学院)

AI总结 提出ATLAS框架,通过主动学习迭代生成稀疏神经网络假设并设计最优区分实验,在bandit任务中恢复强化学习智能体,相比随机实验采样效率提升5-10倍。

详情
AI中文摘要

通过机制建模推进科学理解需要提出正确的实验问题以产生信息量最大的数据。为了在认知科学中自动化这一追求,我们引入了ATLAS(自动化科学的主动理论学习),这是一个用于数据驱动的可解释行为模型发现的主动学习框架。ATLAS在生成机制假设(实例化为多样化的稀疏神经网络集成,即解缠RNN)和设计能够最优区分这些假设的实验之间迭代。我们在从bandit任务中的行为恢复强化学习智能体的问题上测试了这种方法。ATLAS设计了具有时间结构的定性新颖实验序列,该结构针对底层智能体特征量身定制。在这些实验上训练的模型通过一套全面的机制建模指标进行评估,这些指标捕捉了行为、结构和计算相似性。与随机实验相比,ATLAS在所有指标上实现了5-10倍的采样效率提升,并且其性能进一步通过与文献中专家设计的实验进行验证得到确认。这些计算机模拟结果展示了ATLAS在加速人类可解释洞察方面的潜力,适用于认知科学以及其他科学探究依赖于发现机制模型的领域。

英文摘要

Advancing scientific understanding through mechanistic modeling requires posing the right experimental questions to yield maximally informative data. To automate this pursuit within cognitive science, we introduce ATLAS (Active Theory Learning for Automated Science), an active learning framework for the data-driven discovery of interpretable behavioral models. ATLAS iterates between generating mechanistic hypotheses--instantiated as a diverse ensemble of sparse neural networks (Disentangled RNNs)--and designing experiments that optimally distinguish between them. We test this approach on the problem of recovering reinforcement learning agents from their behavior in bandit tasks. ATLAS designs varied sequences of qualitatively novel experiments with temporal structure tailored to underlying agent characteristics. The models trained on these experiments are evaluated against a comprehensive set of metrics for mechanistic modeling that capture behavioral, structural, and computational similarity. ATLAS achieves a 5-10x improvement in sample efficiency across all metrics compared to random experimentation, and its performance is further validated against expert-designed experiments derived from literature. These in silico results showcase ATLAS's potential to accelerate human-interpretable insights in cognitive science and other domains where scientific inquiry relies on discovering mechanistic models.

2606.12397 2026-06-11 cs.LG cs.AI cs.CL 交叉投稿

Redesign Mixture-of-Experts Routers with Manifold Power Iteration

重新设计混合专家模型的路由器:基于流形幂迭代

Songhao Wu, Ang Lv, Ruobing Xie, Yankai Lin

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院) Large Language Model Department, Tencent(腾讯大型语言模型部门)

AI总结 提出将路由器行与专家矩阵主奇异方向对齐,并基于流形幂迭代(MPI)重新设计路由器,通过“幂迭代-收缩”范式实现对齐,理论证明收敛性,实验验证1B至11B参数规模下模型效果提升。

详情
Comments
Preprint
AI中文摘要

路由器是混合专家模型的核心组件。作为专家代理,路由器矩阵的行计算与MoE输入的相似度,以确定激活哪些专家子集。理想情况下,每个路由器行被设计为将专家矩阵编码到该代表性向量中,使得其与token的点积能更好地反映token-专家亲和性。然而,目前没有设计原则来强制这种压缩。在本文中,我们提出将每个路由器行与相关专家的主奇异方向对齐,因为该方向提供了矩阵最具表现力的数学描述。基于这一原则,我们提出了一种基于流形幂迭代(MPI)的路由器重新设计。具体来说,它引入了一种“幂迭代-收缩”范式,其中对路由器权重执行幂迭代步骤,然后进行收缩以施加范数约束,确保效率和稳定性。理论上,我们证明MPI驱动路由器行收敛到相关专家的主奇异方向。实验上,我们在1B到11B参数规模的MoE模型上进行预训练,证实这种对齐有助于更有效的MoE模型。

英文摘要

Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a "Power-then-Retract" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.

2509.16456 2026-06-11 cs.AI 版本更新

GPO: Learning from Critical Steps to Improve LLM Reasoning

GPO:从关键步骤中学习以改进大语言模型推理

Jiahao Yu, Zelei Cheng, Xian Wu, Xinyu Xing

AI总结 提出引导式关键优化(GPO)微调策略,通过识别推理轨迹中的关键步骤并优先学习,显著提升大语言模型的多步推理能力。

详情
Comments
39th Conference on Neural Information Processing Systems (NeurIPS 2025)
AI中文摘要

大语言模型(LLMs)越来越多地应用于各个领域,在不同任务上展现出令人印象深刻的潜力。最近,推理LLMs被提出以改善LLMs的推理或思考能力,从而解决复杂问题。尽管推理LLMs取得了有希望的结果,但增强LLMs的多步推理能力仍然是一个重大挑战。虽然现有的优化方法已经推进了LLM的推理能力,但它们通常将推理轨迹视为一个整体,而不考虑轨迹中潜在的关键步骤。在本文中,我们引入了引导式关键优化(GPO),一种新颖的微调策略,深入推理过程以实现更有效的改进。GPO首先识别推理轨迹中的“关键步骤”——模型必须谨慎进行以成功解决问题的点。我们通过估计优势函数来定位关键步骤。然后,GPO将策略重置到关键步骤,采样新的轨迹,并优先学习这些轨迹。这种关注使模型能够更有效地从推理过程中的关键时刻学习,以提高推理性能。我们证明GPO是一种通用策略,可以与各种优化方法集成以提高推理性能。除了理论分析外,我们在具有挑战性的推理基准上的实验表明,GPO能够持续且显著地提升现有优化方法的性能,展示了其通过关注生成过程中的关键时刻来改进LLM推理的有效性和泛化性。

英文摘要

Large language models (LLMs) are increasingly used in various domains, showing impressive potential on different tasks. Recently, reasoning LLMs have been proposed to improve the \textit{reasoning} or \textit{thinking} capabilities of LLMs to solve complex problems. Despite the promising results of reasoning LLMs, enhancing the multi-step reasoning capabilities of LLMs still remains a significant challenge. While existing optimization methods have advanced the LLM reasoning capabilities, they often treat reasoning trajectories as a whole, without considering the underlying critical steps within the trajectory. In this paper, we introduce \textbf{G}uided \textbf{P}ivotal \textbf{O}ptimization (GPO), a novel fine-tuning strategy that dives into the reasoning process to enable more effective improvements. GPO first identifies the `critical step' within a reasoning trajectory - a point that the model must carefully proceed to succeed at the problem. We locate the critical step by estimating the advantage function. GPO then resets the policy to the critical step, samples the new rollout and prioritizes the learning process on those rollouts. This focus allows the model to learn more effectively from pivotal moments within the reasoning process to improve the reasoning performance. We demonstrate that GPO is a general strategy that can be integrated with various optimization methods to improve reasoning performance. Besides theoretical analysis, our experiments across challenging reasoning benchmarks show that GPO can consistently and significantly enhance the performance of existing optimization methods, showcasing its effectiveness and generalizability in improving LLM reasoning by concentrating on pivotal moments within the generation process.

2602.09533 2026-06-11 cs.AI 版本更新

Autoregressive Direct Preference Optimization

自回归直接偏好优化

Masanari Oi, Mahiro Ukai, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue

AI总结 提出自回归直接偏好优化(ADPO),在应用Bradley-Terry模型前显式引入自回归假设,通过将DPO目标中的求和操作移至log-sigmoid函数外部,实现更优的偏好对齐,并首次区分token长度μ和反馈长度μ'两种度量。

详情
Comments
ICML 2026
AI中文摘要

直接偏好优化(DPO)已成为将大型语言模型(LLMs)与人类偏好对齐的一种有前景的方法。然而,对响应级Bradley-Terry(BT)模型的广泛依赖可能限制了其全部潜力,因为参考模型和可学习模型仅在推导目标函数后才被假定为自回归。受此限制的启发,我们重新审视DPO的理论基础,并提出一种新的公式,在应用BT模型之前显式引入自回归假设。通过重新表述和扩展DPO,我们推导出一种新的变体,称为自回归DPO(ADPO),它将自回归建模显式整合到偏好优化框架中。在不违反理论基础的情况下,推导出的损失采用了一种优雅的形式:它将DPO目标中的求和操作移至log-sigmoid函数外部。此外,通过对ADPO的理论分析,我们表明在设计基于DPO的算法时需要考虑两种长度度量:token长度μ和反馈长度μ'。据我们所知,我们是第一个明确区分这两种度量并分析它们对LLMs中偏好优化影响的工作。

英文摘要

Direct preference optimization (DPO) has emerged as a promising approach for aligning large language models (LLMs) with human preferences. However, the widespread reliance on the response-level Bradley-Terry (BT) model may limit its full potential, as the reference and learnable models are assumed to be autoregressive only after deriving the objective function. Motivated by this limitation, we revisit the theoretical foundations of DPO and propose a novel formulation that explicitly introduces the autoregressive assumption prior to applying the BT model. By reformulating and extending DPO, we derive a novel variant, termed Autoregressive DPO (ADPO), that explicitly integrates autoregressive modeling into the preference optimization framework. Without violating the theoretical foundations, the derived loss takes an elegant form: it shifts the summation operation in the DPO objective outside the log-sigmoid function. Furthermore, through theoretical analysis of ADPO, we show that there exist two length measures to be considered when designing DPO-based algorithms: the token length $\mu$ and the feedback length $\mu'$. To the best of our knowledge, we are the first to explicitly distinguish these two measures and analyze their implications for preference optimization in LLMs.

2605.19031 2026-06-11 cs.AI eess.SP 版本更新

KAN-MLP-Mixer: A comprehensive investigation of the usage of Kolmogorov-Arnold Networks (KANs) for improving IMU-based Human Activity Recognition

KAN-MLP-Mixer: 对Kolmogorov-Arnold网络(KANs)在改进基于惯性测量单元(IMU)的人体活动识别中的应用的全面研究

Mengxi Liu, Sizhen Bian, Vitor Fortes, Francisco Calatrava Nicolas, Daniel Geißler, Maximilian Kiefer-Emmanouilidis, Bo Zhou, Paul Lukowicz

AI总结 本文研究了KANs在改进IMU基人体活动识别(HAR)模型中的应用,提出了一种混合架构,结合KANs的精度与MLP的鲁棒性和效率,实验表明该混合模型在多个数据集上显著提升了性能。

详情
Comments
23 pages, and 9 figures
AI中文摘要

Kolmogorov-Arnold Networks (KANs) have demonstrated an exceptional ability to learn complex functions on clean, low-dimensional data but struggle to maintain performance on noisy and imperfect real-world datasets. In contrast, conventional multi-layer perceptrons (MLPs) are far more tolerant to noise and computationally efficient. Replacing all MLP components with KANs in HAR models often degrades accuracy and computation efficiency, highlighting an open challenge: how to combine KANs' precision with MLPs' noise robustness and efficiency. To address this, we systematically explore various placements of KAN modules within deep HAR networks and propose a hybrid architecture that strategically synergizes the strengths of both paradigms, which uses a KAN-based input embedding layer, retains MLP layers for intermediate feature mixing, and introduces a specialized LarctanKAN module for final activity classification. Across eight public HAR datasets, the hybrid KAN-MLP model achieves an average macro F1 score relative improvement of 5.33\% compared pure-MLP model, significantly outperforming standalone KAN and MLP baselines. Furthermore, integrating this hybrid strategy into other state-of-the-art HAR architectures consistently boosts their performance. Our findings demonstrate that a carefully orchestrated combination of KAN, MLP, or other conventional neural components yields more robust and accurate HAR models for real-world wearable sensing environments.

英文摘要

Kolmogorov-Arnold Networks (KANs) have demonstrated an exceptional ability to learn complex functions on clean, low-dimensional data but struggle to maintain performance on noisy and imperfect real-world datasets. In contrast, conventional multi-layer perceptrons (MLPs) are far more tolerant to noise and computationally efficient. Replacing all MLP components with KANs in HAR models often degrades accuracy and computation efficiency, highlighting an open challenge: how to combine KANs' precision with MLPs' noise robustness and efficiency. To address this, we systematically explore various placements of KAN modules within deep HAR networks and propose a hybrid architecture that strategically synergizes the strengths of both paradigms, which uses a KAN-based input embedding layer, retains MLP layers for intermediate feature mixing, and introduces a specialized LarctanKAN module for final activity classification. Across eight public HAR datasets, the hybrid KAN-MLP model achieves an average macro F1 score relative improvement of 5.33\% compared pure-MLP model, significantly outperforming standalone KAN and MLP baselines. Furthermore, integrating this hybrid strategy into other state-of-the-art HAR architectures consistently boosts their performance. Our findings demonstrate that a carefully orchestrated combination of KAN, MLP, or other conventional neural components yields more robust and accurate HAR models for real-world wearable sensing environments.

2606.00995 2026-06-11 cs.AI 版本更新

Subliminal Learning Is Steering Vector Distillation

潜意识学习是引导向量蒸馏

Camila Blank, Agam Bhatia, Senthooran Rajamanoharan, Arthur Conmy, Neel Nanda

AI总结 本文发现潜意识学习通过单个引导向量实现,并证明这是引导向量蒸馏的特例,解释了非语义数据如何传递语义特征。

详情
AI中文摘要

潜意识学习指的是学生语言模型在教师输出上微调时获得教师的特征(例如,系统提示对猫头鹰的偏好),尽管输出与这些特征在语义上无关。目前尚不清楚没有语义意义的数据如何传递特定的语义特征。在这项工作中,我们表明潜意识学习是由单个引导向量介导的,即添加到模型激活中的向量。在两个开源模型上,我们发现教师的系统提示可以很好地近似为一个引导向量,而学生的行为是通过微调学习对齐向量驱动的。不能被引导向量很好近似的系统提示不会潜意识地学习。这是引导向量蒸馏的一个特例,其中在受引导教师输出上训练的学生学会模仿该引导。我们在一系列语义和随机向量上演示了引导向量蒸馏。向模型激活添加语义向量可以对其行为产生模型无关和模型特定(即非语义)的影响,因此非语义的生成数据可以传递具有语义效果的向量,从而实现潜意识学习。这也解释了为什么潜意识学习不能在模型之间转移。我们发现自适应优化器对于语言模型中的潜意识学习是必要的:引导数据上的激活梯度沿引导方向携带一个小但一致的分量,而非自适应优化器通过允许异常梯度主导来阻碍这一点。

英文摘要

Subliminal learning refers to a student language model acquiring a teacher's traits (e.g. a system-prompted preference for owls) when fine-tuned on the teacher's outputs, despite the outputs being semantically unrelated to those traits. It remains poorly understood how data without semantic meaning can transfer specific semantic traits. In this work, we show that subliminal learning is mediated by a single steering vector, i.e. a vector added to the model's activations. Across two open-source models, we find that the teacher's system prompt is well approximated by a steering vector, and that the student's behavior is driven by learning an aligned vector over fine-tuning. System prompts that are not well approximated by steering vectors are not subliminally learned. This is a special case of steering vector distillation, in which a student trained on the outputs of a steered teacher learns to imitate that steering. We demonstrate steering vector distillation on a range of semantic and random vectors. Adding a semantic vector to a model's activations can have both model-independent and model-specific (i.e. non-semantic) effects on its behavior, so generated data that is non-semantic can transmit a vector with semantic effects, enabling subliminal learning. This also explains why subliminal learning does not transfer between models. We find that adaptive optimizers are necessary for subliminal learning in language models: activation gradients on steered data carry a small but consistent component along the steering direction, and non-adaptive optimizers impede this by allowing outlier gradients to dominate.

2505.13196 2026-06-11 cs.LG cs.AI quant-ph 版本更新

A Physics-Inspired Optimizer: Velocity Regularized Adam

一种受物理启发的优化器:速度正则化Adam

Pranav Vaidhyanathan, Lucas Schorling, Natalia Ares, Maike Osborne

AI总结 本文提出VRAdam优化器,通过引入速度正则化技术,结合Adam的参数缩放,提升训练稳定性与收敛速度,理论分析显示其在非凸目标下的收敛速率为O(√(lnN)/√N)。

详情
Comments
L. Schorling and P. Vaidhyanathan contributed equally to this work. 20 pages, 10 figures
AI中文摘要

我们介绍了一种受物理启发的优化器——速度正则化Adam(VRAdam),用于训练深度神经网络。该优化器借鉴了四次项用于动能的思想,其在系统动力学中具有稳定作用。先前的算法,包括普遍使用的Adam,训练过程中处于所谓的稳定性边缘,导致快速振荡和损失收敛缓慢。然而,VRAdam基于速度在学习率上添加更高阶惩罚,使得算法在权重更新变得较大时自动减慢。实践中,我们观察到在高速度区域,有效动态学习率会缩小并抑制振荡。通过将这种基于速度的正则化用于全局阻尼,结合Adam的参数缩放,我们创建了一个强大的混合优化器。对于该优化器,我们从物理和控制的角度对动量在稳定性边缘的操作进行了严格的理论分析。此外,我们推导了在轻微假设下的非凸随机目标下的收敛界,收敛速率为O(ln(N)/√N)。我们证明VRAdam在标准优化器如AdamW上表现更优。我们通过多种任务如图像分类、语言建模和生成建模,使用不同架构和训练方法(包括卷积神经网络、Transformer和GFlowNets)进行基准测试。

英文摘要

We introduce Velocity-Regularized Adam (VRAdam), a physics-inspired optimizer for training deep neural networks that draws on ideas from quartic terms for kinetic energy with its stabilizing effects on various system dynamics. Previous algorithms, including the ubiquitous Adam, operate at the so-called adaptive edge of stability regime during training, leading to rapid oscillations and slowed convergence of loss. However, VRAdam adds a higher order penalty on the learning rate based on the velocity such that the algorithm automatically slows down whenever weight updates become large. In practice, we observe that the effective dynamic learning rate shrinks in high-velocity regimes, and damping oscillations. By combining this velocity-based regularizer for global damping with per-parameter scaling of Adam, we create a powerful hybrid optimizer. For this optimizer, we provide rigorous theoretical analysis of operation at the edge of stability from a physical and control perspective for the momentum. Furthermore, we derive convergence bounds with the rate $\mathcal{O}(\ln(N)/\sqrt{N})$ for a stochastic non convex objective under mild assumptions. We demonstrate that VRAdam exceeds the performance against standard optimizers including AdamW. We benchmark various tasks such as image classification, language modeling, and generative modeling using diverse architectures and training methodologies including Convolutional Neural Networks (CNNs), Transformers, and GFlowNets.

2505.15201 2026-06-11 cs.LG cs.AI cs.CL stat.ML 版本更新

Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

Pass@K 策略优化:解决更困难的强化学习问题

Christian Walder, Deep Karkhanis

AI总结 提出 Pass-at-k 策略优化 (PKPO),通过变换奖励直接优化 pass@k 性能,利用低方差无偏估计器,在训练中退火 k 可同时提升 pass@1 和 pass@k,解决更难问题。

详情
AI中文摘要

强化学习算法对每个问题采样多个 n>1 的解决方案尝试并独立奖励它们。这优化了 pass@1 性能,优先考虑孤立样本的强度,而牺牲了样本集的多样性和集体效用。这未充分利用采样能力,限制了探索和在更难示例上的最终改进。作为修复,我们提出 Pass-at-k 策略优化 (PKPO),一种对最终奖励的变换,导致直接优化 pass@k 性能,从而优化联合考虑时最大化奖励的样本集。我们的贡献是推导出 pass@k 及其梯度在二元和连续奖励设置中的新型低方差无偏估计器。我们展示了使用我们的估计器进行优化简化为标准强化学习,其中奖励经过稳定高效的变换函数联合变换。虽然先前的工作仅限于 k=n,但我们是第一个能够对任意 k ≤ n 实现 pass@k 鲁棒优化的。此外,我们的方法不是以 pass@1 性能换取 pass@k 增益,而是允许在训练中退火 k,同时优化两个指标,通常能在显著 pass@k 增益的同时获得强大的 pass@1 数值。我们在玩具实验上验证了我们的奖励变换,揭示了我们的公式的方差减少特性。我们还使用开源 LLM GEMMA-2 包含了真实世界的例子。我们发现我们的变换有效地优化了目标 k。此外,更高的 k 值能够解决更多和更难的问题,而退火 k 则同时提升了 pass@1 和 pass@k。关键的是,在传统 pass@1 优化停滞的具有挑战性的任务集上,我们的 pass@k 方法解锁了学习,这可能是由于通过优先考虑联合效用而非单个样本的效用实现了更好的探索。

英文摘要

Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k. Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.

2506.20040 2026-06-11 cs.LG cs.AI cs.CL 版本更新

Cross-Layer Discrete Concept Discovery for Interpreting Language Models

跨层离散概念发现用于解释语言模型

Ankur Garg, Xuemin Yu, Hassan Sajjad, Samira Ebrahimi Kahou

AI总结 提出跨层向量量化变分自编码器(CLVQ-VAE),通过离散向量量化瓶颈将残差流中的重复特征压缩为紧凑可解释的概念向量,在三个数据集上优于聚类、单层VQ-VAE和稀疏自编码器基线。

详情
AI中文摘要

由于残差流的存在,解释语言模型仍然具有挑战性,残差流在相邻层之间线性混合和复制特征,导致单层分析忽略这种跨层结构。跨层稀疏自编码器(SAE)解决了层混合问题,但在连续空间中操作,概念分散在许多神经元上,没有清晰的边界。我们引入了跨层向量量化变分自编码器(CLVQ-VAE),这是一种新颖的框架,通过离散向量量化瓶颈将较低层的表示映射到较高层,将重复的残差流特征压缩为紧凑、可解释的概念向量。我们的方法结合了基于top-k温度的采样和指数移动平均(EMA)码本更新,在保持码本多样性的同时,对离散潜在空间进行受控探索。在基于编码器和解码器的模型上,针对ERASER-Movie、Jigsaw和AGNews数据集,CLVQ-VAE在三个评估轴上优于聚类、单层向量量化变分自编码器(VQ-VAE)和稀疏自编码器(SAE)基线:移除识别出的概念使模型准确率下降高达93%,LLM评判员在66.7%的比较中将我们的概念排在首位,人类标注者从我们的可视化中恢复模型预测的准确率为78%,而聚类为54%。

英文摘要

Interpreting language models remains challenging due to the existence of residual stream, which linearly mixes and duplicates features across adjacent layers, causing single-layer analyses to miss this cross-layer structure. Cross-layer sparse autoencoders (SAEs) address layer mixing but operate in continuous space, where concepts split across many neurons without clear boundaries. We introduce Cross-Layer Vector Quantized-Variational Autoencoder (CLVQ-VAE), a novel framework which maps representations from a lower layer to a higher layer through a discrete vector-quantization bottleneck, collapsing duplicated residual-stream features into compact, interpretable concept vectors. Our approach combines top-k temperature-based sampling with exponential moving average (EMA) codebook updates, providing controlled exploration of the discrete latent space while maintaining codebook diversity. Across both encoder- and decoder-based models on ERASER-Movie, Jigsaw, and AGNews, CLVQ-VAE outperforms clustering, single-layer vector quantized-variational autoencoder (VQ-VAE), and sparse autoencoder (SAE) baselines across three evaluation axes: removing identified concepts drops model accuracy by up to 93%, LLM judges rank our concepts first in 66.7% of comparisons, and human annotators recover model predictions from our visualizations with 78% accuracy versus 54% for clustering.

2507.21164 2026-06-11 cs.LG cs.AI eess.IV stat.ML 版本更新

OCSVM-Guided Representation Learning for Unsupervised Anomaly Detection

OCSVM引导的无监督异常检测表示学习

Nicolas Pinon (MYRIAD), Robin Trombetta (MYRIAD), Carole Lartizien (MYRIAD)

AI总结 提出一种将表示学习与可解析求解的一类SVM耦合的方法,通过定制损失函数直接对齐潜在特征与决策边界,在MNIST-C和脑MRI病变检测任务上展现了鲁棒性和性能。

详情
AI中文摘要

无监督异常检测(UAD)旨在无需标签数据检测异常,这在许多机器学习应用中是必要的,因为异常样本稀少或不可用。大多数最先进的方法分为两类:基于重构的方法(通常重构异常过于完美)和与密度估计器解耦的表示学习(可能遭受次优特征空间)。虽然一些近期方法尝试耦合特征学习和异常检测,但它们通常依赖替代目标、限制核选择或引入近似,从而限制了表达能力和鲁棒性。为解决这一挑战,我们提出了一种新颖方法,通过自定义损失公式将表示学习与可解析求解的一类SVM(OCSVM)耦合,该损失直接使潜在特征与OCSVM决策边界对齐。该模型在两个任务上评估:基于MNIST-C的新基准,以及具有挑战性的脑MRI细微病变检测任务。与大多数关注图像级别大而高信号病变的方法不同,我们的方法成功针对小而非高信号的病变,同时我们评估体素级别的指标,处理了更具临床相关性的场景。两个实验评估了对领域偏移的鲁棒性形式,包括MNIST-C中的损坏类型以及MRI中的纹理或人群年龄变化。结果展示了我们提出模型的性能和鲁棒性,突显了其在通用UAD和现实医学成像应用中的潜力。源代码可在此https URL获取。

英文摘要

Unsupervised anomaly detection (UAD) aims to detect anomalies without labeled data, a necessity in many machine learning applications where anomalous samples are rare or not available. Most state-of-the-art methods fall into two categories: reconstruction-based approaches, which often reconstruct anomalies too well, and decoupled representation learning with density estimators, which can suffer from suboptimal feature spaces. While some recent methods attempt to couple feature learning and anomaly detection, they often rely on surrogate objectives, restrict kernel choices, or introduce approximations that limit their expressiveness and robustness. To address this challenge, we propose a novel method that couples representation learning with an analytically solvable One-Class SVM (OCSVM), through a custom loss formulation that directly aligns latent features with the OCSVM decision boundary. The model is evaluated on two tasks: a \deleted{new} benchmark based on MNIST-C, and a challenging brain MRI \deleted{subtle} lesion detection task. Unlike most methods that focus on large, hyperintense lesions at the image level, our approach succeeds to target small, non-hyperintense lesions, while we evaluate voxel-wise metrics, addressing a more clinically relevant scenario. Both experiments evaluate a form of robustness to domain shifts, including corruption types in MNIST-C and texture or population age variations in MRI. Results demonstrate performance and robustness of our proposed model, highlighting its potential for general UAD and real-world medical imaging applications. The source code is available at this https URL.

2508.21380 2026-06-11 cs.LG cs.AI 版本更新

The Algorithm Is Not the Behavior: Learned Priors Override Look-Ahead in a Chess-Playing Neural Network

算法并非行为:学得的先验知识在弈棋神经网络中覆盖前瞻

Elias Sandmann, Sebastian Lapuschkin, Wojciech Samek

AI总结 研究发现,国际象棋神经网络Leela Chess Zero在中间层能正确计算解法,但最终输出被安全优先的先验知识覆盖,导致错误答案。

详情
AI中文摘要

最近的机制性工作揭示了神经网络内部的学习算法,从模运算到游戏智能体中的搜索与规划。但算法结构是否保证算法行为?我们在最强的神经象棋引擎Leela Chess Zero中对此进行研究,先前工作已识别出学习到的前瞻。通过将logit透镜扩展到其选棋策略网络,我们发现正确的谜题解法——包括即时将杀——经常出现在中间层,但在最终输出中被系统性覆盖,我们将此现象称为“遗忘的谜题”。在这些位置上重复先前的分析,我们发现前瞻运行正常——正确续招的未来走法被表示、因果重要且可线性解码——排除了算法本身的失败。相反,后期层逐渐转向优先考虑安全对局而非激进。为了测试这一转变是否驱动了覆盖,我们引导模型反对这些偏好,并恢复了61.7%的遗忘谜题,提供了因果证据表明安全先验覆盖了算法计算的解法。这些发现表明,算法结构并不保证算法行为:模型可以在内部解决问题,但仍然输出错误答案。

英文摘要

Recent mechanistic work has uncovered learned algorithms within neural networks, from modular arithmetic to search and planning in game-playing agents. But does algorithmic structure guarantee algorithmic behavior? We investigate this in Leela Chess Zero, the strongest neural chess engine, where prior work identified learned look-ahead. By extending the logit lens to its move-selecting policy network, we discover that correct puzzle solutions-including immediate checkmates-often appear in intermediate layers but are systematically overridden in the final output, a phenomenon we term "forgotten puzzles". Replicating prior analyses on these positions, we find that look-ahead operates normally-future moves of the correct continuation are represented, causally important, and linearly decodable-ruling out a failure of the algorithm itself. Instead, late layers increasingly shift toward prioritizing safe play over aggression. To test whether this shift drives the override, we steer the model against these preferences and recover 61.7% of forgotten puzzles, providing causal evidence that safety priors override algorithmically computed solutions. These findings demonstrate that algorithmic structure does not guarantee algorithmic behavior: a model can internally solve a problem and still output the wrong answer.

2509.26294 2026-06-11 cs.LG cs.AI 版本更新

Noise-Guided Transport for Imitation Learning

噪声引导的模仿学习传输方法

Lionel Blondé, Joao A. Candido Ramos, Alexandros Kalousis

AI总结 针对低数据场景下的模仿学习,提出噪声引导传输(NGT)方法,通过对抗训练将模仿问题转化为最优传输问题,无需预训练或特殊架构,在极低数据量下实现强性能。

详情
Comments
Accepted at ICML 2026. Code: this https URL
AI中文摘要

我们考虑低数据场景下的模仿学习,其中只有有限数量的专家演示可用。在这种情况下,依赖大规模预训练或高容量架构的方法难以应用,对演示数据的效率变得至关重要。我们引入了噪声引导传输(NGT),一种轻量级的离策略方法,将模仿问题转化为通过对抗训练解决的最优传输问题。NGT不需要预训练或专门架构,通过设计包含不确定性估计,并且易于实现和调优。尽管简单,NGT在具有挑战性的连续控制任务(包括高维人形任务)中,在仅有20个转换的超低数据场景下取得了强劲的性能。

英文摘要

We consider imitation learning in the low-data regime, where only a limited number of expert demonstrations are available. In this setting, methods that rely on large-scale pretraining or high-capacity architectures can be difficult to apply, and efficiency with respect to demonstration data becomes critical. We introduce Noise-Guided Transport (NGT), a lightweight off-policy method that casts imitation as an optimal transport problem solved via adversarial training. NGT requires no pretraining or specialized architectures, incorporates uncertainty estimation by design, and is easy to implement and tune. Despite its simplicity, NGT achieves strong performance on challenging continuous control tasks, including high-dimensional Humanoid tasks, under ultra-low data regimes with as few as 20 transitions.

2510.04567 2026-06-11 cs.LG cs.AI 版本更新

GILT: An LLM-Free, Tuning-Free Graph Foundational Model for In-Context Learning

GILT:一种无需LLM、无需微调的图基础模型用于上下文学习

Weishuo Ma, Yanbo Wang, Xiyuan Wang, Lei Zou, Muhan Zhang

AI总结 提出GILT框架,通过基于令牌的上下文学习机制统一处理节点、边和图级别的分类任务,无需大语言模型或微调,实现高效泛化。

详情
Comments
Accepted as an oral presentation at the GFM @ ICML 2026 Workshop
AI中文摘要

图神经网络(GNN)是处理关系数据的强大工具,但通常难以泛化到未见过的图,从而催生了图基础模型(GFM)的发展。然而,当前的GFM面临图数据极端异质性的挑战,每个图可能具有独特的特征空间、标签集和拓扑结构。为此,出现了两种主要范式:第一种利用大语言模型(LLM),但本质上依赖于文本,因此难以处理海量图中的数值特征;第二种预训练基于结构的模型,但适应新任务通常需要昂贵的每图微调阶段,造成关键效率瓶颈。在这项工作中,我们超越了这些限制,引入了图上下文学习Transformer(GILT),这是一个基于无需LLM且无需微调架构的框架。GILT引入了一种新颖的基于令牌的框架用于图上的上下文学习(ICL),在统一框架中重新定义了跨节点、边和图级别的分类任务。该机制是处理异质性的关键,因为它设计用于操作通用数值特征。此外,它从上下文中动态理解类别语义的能力实现了无需微调的适应。全面实验表明,与基于LLM或基于微调的基线相比,GILT以显著更少的时间实现了更强的少样本性能,验证了我们方法的有效性。我们的代码可在https://github.com/yiming421/inductnode/获取。

英文摘要

Graph Neural Networks (GNNs) are powerful tools for processing relational data but often struggle to generalize to unseen graphs, giving rise to the development of Graph Foundational Models (GFMs). However, current GFMs are challenged by the extreme heterogeneity of graph data, where each graph can possess a unique feature space, label set, and topology. To address this, two main paradigms have emerged. The first leverages Large Language Models (LLMs), but is fundamentally text-dependent, thus struggles to handle the numerical features in vast graphs. The second pre-trains a structure-based model, but the adaptation to new tasks typically requires a costly, per-graph tuning stage, creating a critical efficiency bottleneck. In this work, we move beyond these limitations and introduce \textbf{G}raph \textbf{I}n-context \textbf{L}earning \textbf{T}ransformer (GILT), a framework built on an LLM-free and tuning-free architecture. GILT introduces a novel token-based framework for in-context learning (ICL) on graphs, reframing classification tasks spanning node, edge and graph levels in a unified framework. This mechanism is the key to handling heterogeneity, as it is designed to operate on generic numerical features. Further, its ability to understand class semantics dynamically from the context enables tuning-free adaptation. Comprehensive experiments show that GILT achieves stronger few-shot performance with significantly less time than LLM-based or tuning-based baselines, validating the effectiveness of our approach. Our code is available at: this https URL.

2512.22088 2026-06-11 cs.LG cs.AI cs.CL 版本更新

Unifying Learning Dynamics and Generalization in Transformers Scaling Law

统一Transformer缩放定律中的学习动力学与泛化

Chiwun Yang

AI总结 本文通过将Transformer学习动力学形式化为ODE系统并近似为核行为,严格分析了随机梯度下降训练下的泛化误差,揭示了计算资源缩放时泛化误差的指数衰减与幂律衰减的两阶段相变,并建立了紧的上下界。

详情
Comments
87 pages, 10 figures, 3 tables
AI中文摘要

缩放定律是大语言模型(LLM)发展的基石,预测了模型性能随计算资源增加而提升。然而,尽管经验上得到验证,其理论基础仍不清晰。本文形式化了基于Transformer的语言模型的学习动力学为一个常微分方程(ODE)系统,然后将该过程近似为核行为。与之前的玩具模型分析不同,我们严格分析了在序列到序列数据上具有任意数据分布的多层Transformer的随机梯度下降(SGD)训练,紧密反映了真实世界条件。我们的分析刻画了随着计算资源随数据缩放时,泛化误差收敛到不可约风险的过程,特别是在优化过程中。我们建立了过剩风险的匹配上下界,其特征是明显的相变。在初始优化阶段,过剩风险相对于计算成本${\sf C}$呈指数衰减。然而,一旦超过特定的资源分配阈值,系统进入统计阶段,泛化误差遵循$\Theta(\mathsf{C}^{-1/7})$的幂律衰减。这些速率通过互补的下界得到证实——统计方面通过信息论的两点约简,优化方面通过一阶预言机论证——使得两阶段定律在常数、对数因子和条件数差距内是紧的。除了这个统一框架,我们的理论还推导了模型大小、训练时间和数据集大小的独立缩放定律,阐明了每个变量如何独立地控制泛化的边界。

英文摘要

The scaling law, a cornerstone of Large Language Model (LLM) development, predicts improvements in model performance with increasing computational resources. Yet, while empirically validated, its theoretical underpinnings remain poorly understood. This work formalizes the learning dynamics of transformer-based language models as an ordinary differential equation (ODE) system, then approximates this process to kernel behaviors. Departing from prior toy-model analyses, we rigorously analyze stochastic gradient descent (SGD) training for multi-layer transformers on sequence-to-sequence data with arbitrary data distribution, closely mirroring real-world conditions. Our analysis characterizes the convergence of generalization error to the irreducible risk as computational resources scale with data, especially during the optimization process. We establish matching upper and lower bounds on the excess risk, characterized by a distinct phase transition. In the initial optimization phase, the excess risk decays exponentially relative to the computational cost ${\sf C}$. However, once a specific resource allocation threshold is crossed, the system enters a statistical phase, where the generalization error follows a power-law decay of $\Theta(\mathsf{C}^{-1/7})$. These rates are certified by complementary lower bounds -- statistical, via an information-theoretic two-point reduction, and optimization-side, via a first-order oracle argument -- rendering the two-stage law tight up to constants, logarithmic factors, and a condition-number gap. Beyond this unified framework, our theory derives isolated scaling laws for model size, training time, and dataset size, elucidating how each variable independently governs the bounds of generalization.

2601.00791 2026-06-11 cs.LG cs.AI cs.CL cs.LO 版本更新

Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

推理的几何:有效数学推理的谱特征

Valentin Noël

AI总结 通过将注意力矩阵视为加权词图,提取四个无需学习的谱诊断指标(Fiedler值、高频能量比、谱熵和平滑度),有效区分有效推理与模式匹配,在多个模型上达到85-96%的分类准确率。

详情
Comments
30 pages, 13 figures, Accepted at ICML 2026 (main track)
AI中文摘要

验证语言模型是真正推理还是模式匹配仍然是一个开放问题:学习型验证器成本高昂,基于输出的启发式方法脆弱。我们证明,有效的数学推理在Transformer注意力中诱导出可测量的、无需训练的谱特征。通过将每个注意力矩阵视为加权词图,我们提取四个诊断指标:Fiedler值、高频能量比(HFER)、谱熵和平滑度,这些指标无需学习参数。在来自四个架构家族的七个模型上的实验产生了高达Cohen's $d = 3.30$($p < 10^{-116}$)的效应量,实现了$85$--$96\%$的单阈值分类准确率。两个发现加深了理解。首先,\emph{柏拉图式有效性}:谱信号追踪逻辑连贯性而非编译器接受性,因超时或缺失导入而被拒绝的证明被正确分类为有效,这一区别通过人工审核确认($\kappa = 0.82$,$n = 51$)。其次,\emph{架构确定性}:滑动窗口注意力将判别特征从HFER转移到平滑度($d = 2.09$,$p < 10^{-48}$),表明注意力设计决定了哪个谱通道编码推理质量。因果消融证实该特征追踪归纳头电路。该方法泛化到非形式化思维链($d = 0.78$,$p < 10^{-3}$),并且在证明搜索中,HFER重排序将Best-of-16 Pass@1提高了$+4.4$--$6.6\%$,匹配了完全监督探针AUC的$98\%$且无需标签。谱图分析是一种原则性的、架构感知的推理验证原语。

英文摘要

Verifying whether a language model is genuinely reasoning or pattern-matching remains an open problem: learned verifiers are expensive, and output-based heuristics are brittle. We show that valid mathematical reasoning induces a measurable, training-free spectral signature in transformer attention. By treating each attention matrix as a weighted token graph, we extract four diagnostics: Fiedler value, High-Frequency Energy Ratio (HFER), spectral entropy, and smoothness, that require no learned parameters. Experiments across seven models from four architectural families yield effect sizes up to Cohen's $d = 3.30$ ($p < 10^{-116}$), enabling $85$--$96\%$ single-threshold classification accuracy. Two findings sharpen the interpretation. First, \emph{Platonic validity}: the spectral signal tracks logical coherence rather than compiler acceptance, proofs rejected for timeouts or missing imports are correctly classified as valid, a distinction confirmed by a manual audit ($\kappa = 0.82$, $n = 51$). Second, \emph{architectural determinism}: Sliding Window Attention shifts the discriminative feature from HFER to smoothness ($d = 2.09$, $p < 10^{-48}$), showing that attention design governs which spectral channel encodes reasoning quality. Causal ablation confirms the signature traces induction-head circuits. The method generalises to informal chain-of-thought ($d = 0.78$, $p < 10^{-3}$), and in proof search, HFER reranking improves Best-of-16 Pass@1 by $+4.4$--$6.6$\%, matching $98\%$ of the AUC of fully supervised probes with zero labels. Spectral graph analysis is a principled, architecture-aware primitive for reasoning verification.

2601.11670 2026-06-11 cs.LG cs.AI 版本更新

CoVar: Confidence-Variance-Guided Pseudo-Label Selection for Semi-Supervised Learning

CoVar: 置信度-方差引导的半监督学习伪标签选择

Jinshi Liu, Lei He, Pan Liu

AI总结 提出CoVar框架,通过联合建模最大置信度和残差类方差来评估伪标签可靠性,利用SVD谱松弛分离可靠与不可靠预测,无需手动阈值,在分割和分类任务上取得提升。

详情
AI中文摘要

半监督学习中的伪标签选择通常由最大置信度阈值驱动,然而在模型过度自信和类别不平衡下,仅靠置信度可能不可靠。我们提出CoVar,一个置信度-方差框架,通过联合建模最大置信度(MC)和残差类方差(RCV)来评估伪标签可靠性。从熵最小化出发,我们推导出二阶交叉熵近似,表明当MC高且RCV低时,低损失伪标签更受青睐,并带有置信度依赖的惩罚项,该惩罚项对接近确定的预测更强。基于此准则,CoVar将预测嵌入二维置信度-方差空间,并使用基于SVD的谱松弛来分离可靠和不可靠的预测,无需手动调整置信度阈值。然后,聚类加权高斯函数将此分离转换为每个样本的训练权重。所得权重可在训练期间集成到现有的半监督分割和分类流程中,且不引入推理开销。在PASCAL VOC 2012、Cityscapes、CIFAR-10、CIFAR-100、SVHN和STL-10上的实验表明,在匹配骨干网络下,VOC和Cityscapes上取得明显提升,并在标准分类基准上达到竞争性或更低的错误率。这些结果表明,残差类离散度为鲁棒伪标签选择提供了置信度之外的补充信号。

英文摘要

Pseudo-label selection in semi-supervised learning is commonly driven by maximum-confidence thresholds, yet confidence alone can be unreliable under model overconfidence and class imbalance. We propose CoVar, a confidence--variance framework that assesses pseudo-label reliability by jointly modeling Maximum Confidence (MC) and Residual-Class Variance (RCV). Starting from entropy minimization, we derive a second-order cross-entropy approximation showing that low-loss pseudo-labels are favored when MC is high and RCV is low, with a confidence-dependent penalty that becomes stronger for near-certain predictions. Based on this criterion, CoVar embeds predictions into a two-dimensional confidence--variance space and uses SVD-based spectral relaxation to separate reliable and unreliable predictions without hand-tuned confidence thresholds. Cluster-wise Gaussian weighting then converts this separation into per-sample training weights. The resulting weights can be integrated into existing semi-supervised segmentation and classification pipelines during training and introduce no inference-time overhead. Experiments on PASCAL VOC 2012, Cityscapes, CIFAR-10, CIFAR-100, SVHN, and STL-10 show clear gains on VOC and Cityscapes under matched backbones, as well as competitive or improved error rates on standard classification benchmarks. These results indicate that residual-class dispersion provides a useful signal complementary to confidence for robust pseudo-label selection.

2602.03282 2026-06-11 cs.CV cs.AI 版本更新

Global Geometry Is Not Enough for Vision Representations

全局几何不足以用于视觉表示

Jiwan Chung, Seon Joo Kim

AI总结 本文通过实验发现全局嵌入几何与组合绑定能力几乎无关,而输入-输出雅可比矩阵衡量的功能敏感性可靠地追踪该能力,并分析指出这是由于现有损失函数显式约束嵌入几何但未约束局部输入-输出映射所致。

详情
AI中文摘要

表示学习中的一个常见假设是,全局分布良好的嵌入支持鲁棒且可泛化的表示。这一关注点塑造了训练目标和评估协议,隐含地将全局几何视为表示能力的代理。虽然全局几何有效地编码了哪些元素存在,但它通常对元素如何组合不敏感。我们通过测试几何度量预测跨多种视觉编码器的组合绑定的能力来研究这一局限性。我们发现,基于标准几何的统计量与组合绑定几乎无相关性。相比之下,由输入-输出雅可比矩阵衡量的功能敏感性可靠地追踪这一能力。我们进一步提供了分析性解释,表明这种差异源于目标设计,因为现有损失显式约束嵌入几何,但未约束局部输入-输出映射。这些结果表明,全局嵌入几何仅捕捉了表示能力的部分视图,并将功能敏感性确立为建模复合结构的关键补充轴。

英文摘要

A common assumption in representation learning is that globally well-distributed embeddings support robust and generalizable representations. This focus has shaped both training objectives and evaluation protocols, implicitly treating global geometry as a proxy for representational competence. While global geometry effectively encodes which elements are present, it is often insensitive to how they are composed. We investigate this limitation by testing the ability of geometric metrics to predict compositional binding across a diverse suite of vision encoders. We find that standard geometry-based statistics exhibit near-zero correlation with compositional binding. In contrast, functional sensitivity, as measured by the input--output Jacobian, reliably tracks this capability. We further provide an analytic account showing that this disparity arises from objective design, as existing losses explicitly constrain embedding geometry but leave the local input--output mapping unconstrained. These results suggest that global embedding geometry captures only a partial view of representational competence and establish functional sensitivity as a critical complementary axis for modeling composite structure.

2602.08986 2026-06-11 cs.LG cs.AI 版本更新

Improving Detection of Rare Nodes in Hierarchical Multi-Label Learning

改进分层多标签学习中稀有节点的检测

Isaac Xu, Martin Gillis, Ayushi Sharma, Benjamin Misiuk, Craig J. Brown, Thomas Trappenberg

AI总结 针对分层多标签分类中稀有节点检测困难的问题,提出结合节点不平衡加权和焦点加权的损失函数,利用集成不确定性量化,在基准数据集上将召回率提升至五倍,并显著提高F1分数。

详情
Comments
Accepted for publication in Transactions on Machine Learning Research (TMLR), 2026
AI中文摘要

在分层多标签分类中,一个持续的挑战是使模型预测能够达到层次结构的更深层次,以实现更详细或更细粒度的分类。这一困难部分源于某些类别(或层次节点)的自然稀有性,以及确保子节点几乎总是比其父节点频率更低的分层约束。为了解决这个问题,我们为神经网络提出了一种加权损失目标,该目标结合了节点不平衡加权和焦点加权组件,后者利用了集成不确定性的现代量化。通过强调稀有节点而非稀有观测(数据点),并在训练过程中关注每个模型输出分布中的不确定节点,我们观察到在基准数据集上召回率提高了高达五倍,并且$F_{1}$分数有统计显著的提升。我们还展示了我们的方法有助于卷积网络处理具有挑战性的任务,例如在编码器次优或数据有限的情况下。

英文摘要

In hierarchical multi-label classification, a persistent challenge is enabling model predictions to reach deeper levels of the hierarchy for more detailed or fine-grained classifications. This difficulty partly arises from the natural rarity of certain classes (or hierarchical nodes) and the hierarchical constraint that ensures child nodes are almost always less frequent than their parents. To address this, we propose a weighted loss objective for neural networks that combines node-wise imbalance weighting with focal weighting components, the latter leveraging modern quantification of ensemble uncertainties. By emphasizing rare nodes rather than rare observations (data points), and focusing on uncertain nodes for each model output distribution during training, we observe improvements in recall by up to a factor of five on benchmark datasets, along with statistically significant gains in $F_{1}$ score. We also show our approach aids convolutional networks on challenging tasks, as in situations with suboptimal encoders or limited data.

2603.09555 2026-06-11 cs.LG cs.AI cs.DC cs.PF 版本更新

Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference

编译器优先的状态空间对偶性与可移植的 $O(1)$ 自回归缓存推理

Cosmo Santoni, Anmol Thapar

AI总结 提出一种基于编译器优先的状态空间对偶性(SSD)结构的推理方法,通过标准JAX原语实现无自定义内核的单源推理路径,在TPU和GPU上达到高硬件利用率,且缓存解码速度比全前缀重计算快27-36倍。

详情
Comments
21 pages, 6 figures. Code available at: this https URL
AI中文摘要

高吞吐量的Mamba-2推理通常依赖于融合的CUDA和Triton内核,这限制了在不同加速器后端之间的可移植性。我们证明状态空间对偶性(SSD)递归具有编译器友好的结构:对角逐头动态、固定大小分块、以einsum为主的计算以及静态控制流。在标准JAX原语中表达这种结构,可以得到一个无需自定义内核的单源推理路径、一个注册的JAX PyTree缓存以及一个编译后的设备上自回归循环。在单个Google Cloud TPU v6e上,batch-1预填充达到约140 TFLOPS,即15%的模型FLOP利用率(MFU),这是该场景下的屋顶线上限;缓存解码达到高达64%的硬件带宽利用率(HBU)。在4096个token的上下文中,对于五个Mamba-2检查点(参数从130M到2.7B),缓存解码比全前缀重计算快27-36倍。相同的源代码在未修改的情况下可在NVIDIA L40S上运行,其中缓存解码在所有模型规模下均保持序列长度无关。WikiText-103验证困惑度与Triton参考实现mamba_ssm v2.2.2相差在±0.0005以内,隐藏状态在float32舍入容差内一致。代码可在以下网址获取:https://this URL。

英文摘要

High-throughput Mamba-2 inference is usually tied to fused CUDA and Triton kernels, limiting portability across accelerator backends. We show that the state space duality (SSD) recurrence has a compiler-friendly structure: diagonal per-head dynamics, fixed-size chunking, einsum-dominated compute, and static control flow. Expressing this structure in standard JAX primitives gives a single-source inference path with no custom kernels, a registered JAX PyTree cache, and a compiled on-device autoregressive loop. On a single Google Cloud TPU v6e, batch-1 prefill reaches approximately 140 TFLOPS, or 15% model FLOP utilisation (MFU), the roofline ceiling for this regime, and cached decode reaches up to 64% hardware bandwidth utilisation (HBU). At a 4096-token context, cached decode is 27x--36x faster than full-prefix recomputation across five Mamba-2 checkpoints from 130M to 2.7B parameters. The same source runs unmodified on NVIDIA L40S, where cached decode remains sequence-length independent across all model scales. WikiText-103 validation perplexity matches the Triton reference mamba_ssm v2.2.2 within +/-0.0005 points, and hidden states agree to float32 rounding tolerance. Code is available at this https URL.

2603.14867 2026-06-11 cs.LG cs.AI cs.GT cs.MA 版本更新

Sample-Efficient Hypergradient Estimation for Decentralized Bi-Level Reinforcement Learning

用于去中心化双层强化学习的样本高效超梯度估计

Mikoto Kudo, Takumi Tanabe, Akifumi Wachi, Youhei Akimoto

AI总结 针对去中心化双层强化学习中领导者无法干预跟随者优化过程的问题,提出基于玻尔兹曼协方差技巧的超梯度估计方法,实现高维决策空间下的样本高效优化,并首次应用于双人马尔可夫博弈。

详情
Comments
29 pages. Extended version of the paper accepted to ICAPS 2026
AI中文摘要

许多战略决策问题,例如仓库机器人的环境设计,可以自然地表述为双层强化学习,其中领导者代理优化其目标,而跟随者解决一个以领导者决策为条件的马尔可夫决策过程。在许多情况下,当领导者无法干预跟随者的优化过程时,会出现一个基本挑战;它只能观察优化结果。我们通过推导领导者目标的超梯度(即考虑跟随者最优策略变化的领导者策略梯度)来解决这种去中心化设置。与先前基于超梯度的方法不同,这些方法需要大量数据来重复访问状态,或者依赖于梯度估计器,其复杂度可能随着领导者决策空间的高维性而显著增加,我们利用玻尔兹曼协方差技巧推导出一种替代的超梯度公式。这使得仅从交互样本中就能进行高效的超梯度估计,即使领导者的决策空间是高维的。此外,据我们所知,这是第一种能够在去中心化设置中实现基于超梯度的优化的双人马尔可夫博弈方法。实验突出了超梯度更新的影响,并展示了我们的方法在离散和连续状态任务中的有效性。

英文摘要

Many strategic decision-making problems, such as environment design for warehouse robots, can be naturally formulated as bi-level reinforcement learning (RL), where a leader agent optimizes its objective while a follower solves a Markov decision process (MDP) conditioned on the leader's decisions. In many situations, a fundamental challenge arises when the leader cannot intervene in the follower's optimization process; it can only observe the optimization outcome. We address this decentralized setting by deriving the hypergradient of the leader's objective, i.e., the gradient of the leader's strategy that accounts for changes in the follower's optimal policy. Unlike prior hypergradient-based methods that require extensive data for repeated state visits or rely on gradient estimators whose complexity can increase substantially with the high-dimensional leader's decision space, we leverage the Boltzmann covariance trick to derive an alternative hypergradient formulation. This enables efficient hypergradient estimation solely from interaction samples, even when the leader's decision space is high-dimensional. Additionally, to our knowledge, this is the first method that enables hypergradient-based optimization for 2-player Markov games in decentralized settings. Experiments highlight the impact of hypergradient updates and demonstrate our method's effectiveness in both discrete and continuous state tasks.

2604.24662 2026-06-11 physics.data-an cs.AI cs.IT 版本更新

Information bottleneck for learning the phase space of dynamics from high-dimensional experimental data

信息瓶颈:从高维实验数据学习动力学相空间

K. Michael Martini, Eslam Abdelaleem, Paarth Gulati, Ilya Nemenman

AI总结 提出DySIB方法,通过最大化过去与未来观测窗口间的预测互信息并惩罚表示复杂度,从高维时间序列数据中无监督学习低维动力学表示,在物理摆实验中恢复出与真实相空间匹配的二维表示。

详情
Comments
12 pages including references, 7 figures, 4 appendix pages with 4 appendix figures
AI中文摘要

从高维观测中识别系统的动力学状态变量是物理科学中的一个核心问题。挑战在于状态变量不可直接观测,必须从原始高维数据中无监督地推断。本文引入DySIB(动态对称信息瓶颈)作为一种学习方法,通过最大化过去与未来观测窗口之间的预测互信息并惩罚表示复杂度,学习时间序列数据的低维表示。该目标完全在潜在空间中运作,避免了对观测的重建。我们将DySIB应用于一个物理摆的实验视频数据集,其底层状态空间已知。该方法的学习架构超参数由数据自洽设定,恢复出一个二维表示,该表示与摆相空间的维度、拓扑和几何相匹配,学习到的坐标与标准角度和角速度平滑对齐。这些结果在一个特征明确的实验系统上表明,潜在空间中的预测信息可用于直接从高维数据中恢复可解释的动力学坐标。

英文摘要

Identifying the dynamical state variables of a system from high-dimensional observations is a central problem across physical sciences. The challenge is that the state variables are not directly observable and must be inferred from raw high-dimensional data without supervision. Here we introduce DySIB (Dynamical Symmetric Information Bottleneck) as a method to learn low-dimensional representations of time-series data by maximizing predictive mutual information between past and future observation windows while penalizing representation complexity. This objective operates entirely in latent space and avoids reconstruction of the observations. We apply DySIB to an experimental video dataset of a physical pendulum, where the underlying state space is known. The method, with hyperparameters of the learning architecture set self-consistently by the data, recovers a two-dimensional representation that matches the dimensionality, topology, and geometry of the pendulum phase space, with the learned coordinates aligning smoothly with the canonical angle and angular velocity. These results demonstrate, on a well-characterized experimental system, that predictive information in latent space can be used to recover interpretable dynamical coordinates directly from high-dimensional data.

2605.00545 2026-06-11 cs.LG cs.AI math-ph q-bio.GN q-bio.QM 版本更新

Beyond Continuity: Simulation-free Reconstruction of Discrete Branching Dynamics from Single-cell Snapshots

超越连续性:从单细胞快照无模拟重建离散分支动力学

Junda Ying, Yuxuan Wang, Bowen Yang, Peijie Zhou, Lei Zhang

AI总结 针对单细胞快照数据中随机性和非保守质量动态(如细胞增殖和凋亡)的挑战,提出无模拟框架Unbalanced Schrödinger Bridge (USB),通过离散分支薛定谔桥问题建模单细胞分辨率的跳跃式生灭动态,实现高效轨迹重建与离散模拟。

详情
AI中文摘要

从破坏性快照推断细胞轨迹因随机性和非保守质量动态(如细胞增殖和凋亡)的挑战而复杂化。现有的不平衡最优传输(OT)方法将质量视为连续流体,在群体水平进行推断。然而,这种宏观视角往往无法捕捉单细胞分辨率下生灭事件的离散跳跃性质,而这对于理解谱系分支和命运决定至关重要。我们提出无模拟框架Unbalanced Schrödinger Bridge (USB),用于学习底层动态,有效整合随机和非平衡效应,并在单细胞分辨率下建模离散、跳跃式的生灭动态。理论上,USB为分支薛定谔桥(BSB)问题提供了可处理的解,给出了严格的微观解释,其中单个细胞同时经历布朗运动和离散生灭跳跃。技术上,该方法通过引入无模拟训练目标实现高效求解器,有效扩展到高维组学数据。实验上,我们在模拟和真实数据集上证明,USB不仅达到优于或可比于确定性基线的轨迹重建性能,而且独特地实现了单细胞分辨率下生灭动态的真实离散模拟。

英文摘要

Inferring cellular trajectories from destructive snapshots is complicated by the challenges of stochasticity and non-conservative mass dynamics such as cell proliferation and apoptosis. Existing unbalanced Optimal Transport (OT) methods treat mass as a continuous fluid, performing inference at the population level. However, this macroscopic view often fails to capture the discrete, jump-like nature of birth-death events at single-cell resolution, which is essential for understanding lineage branching and fate decisions. We present Unbalanced Schrödinger Bridge (USB), a simulation-free framework for learning underlying dynamics that effectively integrates both stochastic and unbalanced effects which also models the discrete, jump-like birth-death dynamics at single-cell resolution. Theoretically, USB provides a tractable solution to the Branching Schrödinger Bridge (BSB) problem, offering a rigorous microscopic interpretation where individual cells undergo both Brownian motion and discrete birth-death jumps. Technically, the method implements an efficient solver by introducing a simulation-free training objective that effectively scales to high-dimensional omics data. Empirically, we demonstrate on both simulated and real-world datasets that USB not only achieves trajectory reconstruction performance better than or comparable to deterministic baselines but also uniquely enables realistic discrete simulation of birth-death dynamics at single-cell resolution.

2605.13674 2026-06-11 cs.CV cs.AI 版本更新

Weakly Supervised Segmentation as Semantic-Based Regularization

弱监督分割作为语义基于的正则化

Stefano Colamonaco, Andrei-Bogdan Florea, Jaron Maene

AI总结 本文提出通过神经符号方法整合模糊逻辑与深度分割模型,利用弱标注和领域先验知识提升伪标签质量,从而实现优于密集监督基线的分割精度。

详情
AI中文摘要

弱监督语义分割(WSSS)通过部分或粗略标注(如边界框、涂鸦或图像标签)训练密集像素级分割模型。尽管近期工作利用基础模型如Segment Anything Model(SAM)生成伪标签,但这些方法通常依赖启发式提示选择,难以整合先验知识或异质标签。本文通过神经符号视角:将可微模糊逻辑与深度分割模型结合。弱标注和领域特定先验被统一为连续逻辑约束,以微调SAM在弱监督下。优化后的基础模型随后生成改进的伪标签,从中训练一个无提示的第二阶段分割模型。在Pascal VOC 2012和REFUGE2视盘/杯分割数据集上的实验表明,逻辑引导的微调产生了更高质量的伪标签,导致分割精度超越密集监督基线。

英文摘要

Weakly supervised semantic segmentation (WSSS) trains dense pixel-level segmentation models from partial or coarse annotations such as bounding boxes, scribbles, or image-level tags. While recent work leverages foundation models such as the Segment Anything Model (SAM) to generate pseudo-labels, these approaches typically depend on heuristic prompt choices and offer limited ways to incorporate prior knowledge or heterogeneous labels. We address this gap by taking a neurosymbolic perspective: integrating differentiable fuzzy logic with deep segmentation models. Weak annotations and domain-specific priors are unified as continuous logical constraints that fine-tune SAM under weak supervision. The refined foundation model then produces improved pseudo-labels, from which we train a second-stage prompt-free segmentation model. Experiments on Pascal VOC 2012 and the REFUGE2 optic disc/cup segmentation dataset show that our logic-guided fine-tuning yields higher-quality pseudo-labels, leading to state-of-the-art segmentation accuracy that often exceeds densely supervised baselines.

2605.14738 2026-06-11 cs.LG cs.AI 版本更新

TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability

TAPIOCA: 为什么任务感知剪枝能提升模型对分布外数据的能力

Krish Sharma, Omar Naim, Soumadeep Saha, Vinija Jain, Aman Chadha, Nicholas Asher

AI总结 本文研究了任务感知剪枝在分布外数据上的改进机制,通过实验发现剪枝能提升OOD准确性,其核心贡献是通过几何解释说明任务感知剪枝如何调整模型表示以适应任务需求。

详情
AI中文摘要

近期的研究表明,任务感知层剪枝可以提高模型在特定任务上的性能,如TALE所示。本文探讨了这种改进何时发生以及为何会发生。我们首先证明,在受控的多项式回归任务和大型语言模型中,此类剪枝在分布内(ID)数据上没有好处,但能一致地提高分布外(OOD)准确性。我们进一步通过实验证明,OOD输入会诱导出层间范数和成对距离的分布,这些分布偏离ID分布的相应分布。这导致了任务感知剪枝的几何解释:每个任务诱导出一个任务适应的几何结构,通过ID输入上观察到的表示分布来经验性地表征。OOD输入可以引入任务适应几何的扭曲版本。任务感知剪枝识别出创建或放大这种扭曲的层;通过移除这些层,它将OOD表示的范数和成对距离转向在适应分布上观察到的值。这使OOD输入与模型的任务适应几何重新对齐,并提高性能。我们通过受控分布偏移和残差缩放干预提供了因果证据,并在不同模型规模上展示了一致的行为。

英文摘要

Recent work has promoted task-aware layer pruning as a way to improve model performance on particular tasks, as shown by TALE. In this paper, we investigate when such improvements occur and why. We show first that, across controlled polynomial regression tasks and large language models, such pruning yields no benefit on in-distribution (ID) data but consistently improves out-of-distribution (OOD) accuracy. We further show empirically that OOD inputs induce layerwise norm and pairwise-distance profiles that deviate from the corresponding ID profiles. This leads to a geometric explanation of task-aware pruning: each task induces a task-adapted geometry, characterized empirically by the representation profiles observed on ID inputs. OOD inputs can introduce a distorted version of the task-adapted geometry. Task-aware pruning identifies layers that create or amplify this distortion; by removing them, it shifts OOD representational norms and pairwise distances toward those observed on the adapted distribution. This realigns OOD inputs with the model's task-adapted geometry and improves performance. We provide causal evidence through controlled distribution shifts and residual-scaling interventions, and demonstrate consistent behavior across model scales.

2606.00140 2026-06-11 cs.LG cs.AI 版本更新

Geometric Erasure by Contrastive Velocity Matching in Rectified Flows

整流流中对比速度匹配的几何擦除

Jonas Henry Grebe, Tobias Braun, Anna Rohrbach, Marcus Rohrbach

AI总结 提出GEM框架,通过对比速度匹配实现整流流模型中的概念擦除,结合生成流网络与教师引导的流匹配,有效抑制有害内容生成。

详情
AI中文摘要

尽管多模态生成模型的快速采用提供了巨大潜力,但也增加了有害内容合成、深度伪造和版权侵权的风险。为应对这些挑战,概念擦除作为一种前瞻性防护手段应运而生。然而,随着该领域逐渐从基于U-Net的扩散模型转向整流流变换器,擦除研究难以跟上步伐。在这项工作中,我们引入了GEM,一个简单但高效的整流流模型擦除框架。作为我们贡献的一部分,我们在基于轨迹的遗忘(基于生成流网络)与经典教师引导擦除之间建立了原则性桥梁:我们将基于轨迹的信号转化为教师引导的流匹配设置,统一了两种范式的优势。具体而言,教师提供互补的吸引和排斥信号,我们将其组合成一个单一的几何引导目标,实现对不需要概念的目标抑制,同时保留良性生成。

英文摘要

While the rapid adoption of multimodal generative models offers immense potential, it has also increased the risks of harmful content synthesis, deepfakes, and copyright infringements. To address these challenges, concept erasure has emerged as a prospective safeguard. However, as the field gradually transitions from U-Net-based diffusion models to Rectified Flow Transformers, erasure research has struggled to keep pace. In this work, we introduce GEM, a simple but highly effective erasure framework for Rectified Flow models. As part of our contribution, we establish a principled bridge between trajectory-based unlearning grounded in Generative Flow Networks and classic teacher-guided erasure: we translate trajectory-based signals into a teacher-guided flow-matching setup that unifies the strengths of both paradigms. Concretely, a teacher provides complementary attraction and repulsion signals that we combine into a single geometric guidance objective, yielding targeted suppression of unwanted concepts while preserving benign generation.

2606.05551 2026-06-11 stat.ML cs.AI cs.LG 版本更新

Conformal Risk-Averse Decision Making with Action Conditional Guarantee

具有行动条件保证的共形风险规避决策

Zihan Zhu, Shayan Kiyani, George Pappas, Hamed Hassani

AI总结 提出行动条件共形预测方法,通过分位数损失最小化算法实现行动条件风险价值优化,在有限样本下提供行动条件安全保证。

详情
AI中文摘要

由机器学习模型驱动的可靠决策管道需要具有明确安全保证的不确定性量化(UQ)方法。共形预测通过将ML预测包装成预测集来提供这种UQ,而Kiyani等人(2025b)的最新工作表明,这些集合可以转化为最优的风险规避决策策略——但仅继承边际安全保证。我们通过以下方式推广并加强了他们的结果:(i)引入行动条件共形预测,该预测产生明确条件于决策者所采取的每个行动的安全保证;(ii)表明行动条件预测集可作为风险规避决策者旨在优化行动条件风险价值的可行决策空间的代理;(iii)提出一种基于分位数损失最小化的原则性有限样本算法,将Gibbs等人(2025)的框架与行动条件保证联系起来。在两个真实世界数据集上的实验证实,我们的方法在行动条件性能上显著优于共形基线。

英文摘要

Reliable decision making pipelines powered by machine learning models require uncertainty quantification (UQ) methods that come with explicit safety guarantees. Conformal prediction provides such UQ by wrapping ML predictions into prediction sets, and recent work by Kiyani et al. (2025b) established that these sets can be translated into optimal risk-averse decision policies -- yet only inheriting marginal safety guarantees. We generalize and strengthen their results by (i) introducing action-conditional conformal prediction, which yields safety guarantees conditioned explicitly on each action taken by the decision maker, (ii) showing that action-conditional prediction sets serve as a proxy for the feasible decision space for risk-averse decision makers aiming to optimize action-conditional value-at-risk, and (iii) proposing a principled finite-sample algorithm based on pinball-loss minimization, connecting the framework of Gibbs et al. (2025) to action-conditional guarantees. Experiments on two real-world datasets confirm that our approach significantly improves action-conditional performance over conformal baselines.

2606.07082 2026-06-11 cs.LG cs.AI 版本更新

On the Geometry of On-Policy Distillation

论在线策略蒸馏的几何结构

Zhennan Shen, Yanshu Li, Qingyu Yin, Chak Tou Leong, Zhilin Wang, Yanxu Chen, Rongduo Han, Sunbowen Lee, Yi R. Fung

发表机构 * HKUST UT Austin Zhejiang University Hong Kong PolyU USTC BUPT Nankai University BIT

AI总结 本文通过参数空间诊断,揭示在线策略蒸馏(OPD)的更新轨迹具有松弛离主成分、子空间锁定等独特几何特性,表明其并非介于SFT和RLVR之间的中间方法。

详情
Comments
17 pages, 8 figures
AI中文摘要

在线策略蒸馏(OPD)越来越多地被用于改进大型语言模型的推理能力,但其训练动态仍鲜为人知。我们刻画了OPD更新在参数空间中的轨迹,并将其与监督微调(SFT)和可验证奖励强化学习(RLVR)进行了比较。一套参数空间诊断一致地将OPD置于松弛的离主成分区域:与SFT相比,其更新影响更少的权重,并更强烈地避开主方向;而与RLVR相比,其约束更宽松。除了这种静态定位外,OPD还表现出子空间锁定:其累积更新迅速进入一个狭窄的低维通道。将训练限制在早期形成的更新子空间内能保持OPD的性能,但会严重降低SFT,表明该锁定子空间对OPD在功能上是充分的。控制实验进一步表明,稀疏化更新令牌和将rollout生成移至离策略能保持秩动态,而将OPD目标与RLVR混合则会改变它们。总体而言,这些结果表明OPD不仅仅是SFT和RLVR之间的中间点,而是在参数空间中诱导出自身独特的更新几何结构。

英文摘要

On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). A suite of parameter-space diagnostics consistently places OPD in a relaxed off-principal regime: compared with SFT, its updates affect fewer weights and avoid principal directions more strongly, while compared with RLVR, they remain less tightly constrained. Beyond this static localization, OPD exhibits subspace locking: its cumulative updates rapidly enter a narrow low-dimensional channel. Constraining training to the update subspace formed early in training preserves OPD performance but substantially degrades SFT, indicating that the locked subspace is functionally sufficient for OPD. Control experiments further show that sparsifying the update tokens and shifting rollout generation off-policy preserve the rank dynamics, whereas mixing the OPD objective with RLVR changes them. Overall, these results suggest that OPD is not merely an intermediate point between SFT and RLVR, but induces its own update geometry in parameter space.

2606.10046 2026-06-11 cs.SD cs.AI 版本更新

Inside the Latent Flow: Causal Deciphering of Attention Dynamics in Audio Separation Foundation Models

潜流内部:音频分离基础模型中注意力动力学的因果解读

Yuxuan Chen, Haoyuan Yu, Peize He

AI总结 本文通过因果干预协议揭示流匹配Transformer在音频分离中的双路径注意力机制,并提出无训练加速方法LSAC,在保持质量的同时减少约25%自注意力计算。

详情
AI中文摘要

流匹配变压器实现了强大的音频分离,但其注意力动力学是不透明的。我们将已建立的因果干预原则适应为SAM Audio的确定性推理时探测协议。正交探测揭示了一种双路径文本条件机制:加法注入控制语义身份,而交叉注意力细化声学结构。我们观察到异步逐层收敛:稳定层早期构建时间支架,而快速层在采样过程中继续解决伪影。该模型还减弱时间分割线索以维持连续流稳定性。利用这些见解,我们提出了层选择性注意力缓存(LSAC),一种无训练加速方法,在稳定层中缓存注意力。在各种声学复杂度下,LSAC将自注意力计算减少约25%,质量损失可忽略,并且与朴素步长减少相比,质量保持率高达6.7倍。

英文摘要

Flow-matching transformers achieve strong audio separation, yet their attention dynamics are opaque. We adapt established causal-intervention principles into a deterministic, inference-time probing protocol for SAM Audio. Orthogonal probing uncovers a dual-pathway text-conditioning mechanism: additive injections control semantic identity, while cross-attention refines acoustic structure. We observe an asynchronous layerwise convergence: stable layers build temporal scaffolds early, whereas fast layers continue resolving artifacts during sampling. The model also attenuates temporal segmentation cues to maintain continuous-flow stability. Using these insights, we propose Layer-Selective Attention Caching (LSAC), a training-free acceleration method that caches attention in stable layers. Across acoustic complexities, LSAC cuts self-attention computation by about ~25% with negligible quality loss and yields up to 6.7x higher quality retention than naive step reduction.

2606.10820 2026-06-11 cs.LG cs.AI cs.CL 版本更新

K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

K-Forcing:通过前推语言建模进行联合下一K词解码

Zhiwei Tang, Yuanyu He, Yizheng Han, Wangbo Zhao, Jiasheng Tang, Fan Wang, Bohan Zhuang

发表机构 * DAMO Academy, Alibaba Group(阿里巴巴达摩院) Hupan Lab(湖畔实验室) Zhejiang University(浙江大学) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出K-Forcing范式,通过前推映射将自回归模型蒸馏为单次前向传播生成多个未来词,实现2.4-3.5倍加速,质量损失小。

详情
Comments
Code: this https URL
AI中文摘要

自回归语言建模是文本生成的主导范式,但其逐词顺序解码使得推理受限于内存且效率低下。现有的加速方法(如推测解码和扩散语言模型)在特定条件下可提升速度,但并未直接解决高负载批量服务——这一对工业级部署最为关键的场景。我们提出K-Forcing,一种用于联合下一k词解码的前推语言建模范式。K-Forcing将现有自回归模型蒸馏为条件前推映射——该映射在单次前向传播中将独立均匀噪声变量转换为多个未来词的联合样本。该设计保留了固定长度输出,复用了自回归教师模型的主干,并与标准自回归服务基础设施兼容。我们通过渐进式自强迫蒸馏训练该映射,逐步扩展预测窗口,同时使学生模型紧密匹配自回归教师模型的序列分布。我们在LM1B和OpenWebText上使用标准因果Transformer主干评估K-Forcing。当激进配置为每次前向传播生成k=4个词时,K-Forcing在不同批量大小下实现约2.4-3.5倍加速,同时相对于自回归教师模型仅带来轻微的质量下降。随着推理在现代LLM的生命周期计算成本中占据主导地位,K-Forcing为在现实高负载部署下加速自回归生成提供了一条有前景的途径。

英文摘要

Autoregressive (AR) language modeling is the dominant paradigm for text generation, yet its sequential token-by-token decoding makes inference memory-bound and inefficient. Existing acceleration approaches, such as speculative decoding and diffusion language models, can yield speedups under certain conditions but do not directly address high-load batch serving--the scenario most critical for industrial-scale deployment. We introduce K-Forcing, a push-forward language modeling paradigm for joint next-k-token decoding. K-Forcing distills an existing AR model into a conditional push-forward mapping--one that transforms independent uniform noise variables into a joint sample of multiple future tokens in a single forward pass. This design preserves fixed-length outputs, reuses the AR teacher backbone, and remains compatible with standard AR serving infrastructure. We train this mapping via progressive self-forcing distillation, which gradually expands the prediction window while enabling the student to closely match the sequence distribution of the AR teacher. We evaluate K-Forcing on LM1B and OpenWebText using a standard causal Transformer backbone. When aggressively configured to generate k = 4 tokens per forward pass, K-Forcing delivers approximately 2.4-3.5x speedup across different batch sizes, while incurring modest quality degradation relative to its AR teacher. As inference increasingly dominates the lifetime compute cost of modern LLMs, K-Forcing offers a promising route toward accelerating AR generation under real-world high-load deployment.

2606.10968 2026-06-11 cs.LG cs.AI 版本更新

Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

超越大语言模型强化学习中的统一令牌级信任区域

Renjie Mao, Xiangxin Zhou, Lvfang Tao, Yixin Ding, Yu Shi, Yongguang Lin, Yuheng Wu, Honglin Zhu, Qian Qiu, Wenxi Zhu

发表机构 * Tencent Hunyuan(腾讯混元)

AI总结 针对PPO风格信任区域在自回归生成中的位置无关问题,提出CPPO方法,通过位置加权阈值和累积前缀预算动态调整令牌级约束,提升训练稳定性和推理准确性。

详情
Comments
Project Page: this https URL
AI中文摘要

具有可验证奖励的强化学习(RLVR)已成为提升大语言模型推理能力的标准方法。然而,现有的PPO风格信任区域机制通过在所有令牌上独立施加统一阈值,仍然是位置无关的。这种逐点处理方式在两个方面与自回归生成相冲突。首先,统一阈值忽略了自回归不对称性。早期阶段的偏差会产生累积的序列级漂移,导致静态阈值对早期发散约束不足,而对后期探索过度约束。其次,孤立地评估令牌级发散忽略了累积前缀漂移,无论条件历史已经偏离滚动策略多远,都给予相同的发散允许量。为解决这一局限性,我们提出了CPPO(累积前缀散度策略优化),这是一种令牌级掩码规则,通过两种耦合机制将更新与有限时域策略改进界对齐。首先,位置加权阈值对早期位置施加更严格的限制,因为这些位置的影响持续时间更长,同时放宽对后期令牌的约束。其次,累积前缀预算跟踪历史偏差,动态限制进一步的令牌级偏差,以防止沿前缀的复合错误。实验表明,CPPO在不同模型规模上增强了训练稳定性并显著提高了推理准确性。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become standard for improving LLM reasoning. However, existing PPO-style trust-region mechanisms remain position-agnostic by enforcing uniform thresholds across all tokens independently. This pointwise treatment conflicts with autoregressive generation in two critical ways. First, uniform thresholds ignore autoregressive asymmetry. Early-stage deviations produce compounding sequence-level drift, causing static thresholds to under-regulate early divergence and excessively constrain late-stage exploration. Second, evaluating token-level divergence in isolation overlooks cumulative prefix drift, granting the same divergence allowance regardless of how far the conditioning history has already deviated from the rollout policy. To address this limitation, we propose CPPO (Cumulative Prefix-divergence Policy Optimization), a token-level masking rule that aligns updates with a finite-horizon policy-improvement bound via two coupled mechanisms. First, a position-weighted threshold imposes stricter limits at early positions whose effects persist longer, relaxing constraints for late-stage tokens. Second, a cumulative prefix budget tracks historical deviations, dynamically restricting further token-level deviation to prevent compounding errors along the prefix. Empirically, CPPO enhances training stability and significantly improves reasoning accuracy across various model scales.

6. 自然语言与多模态智能 39 篇

2606.11537 2026-06-11 cs.AI cs.CE 新提交

MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

MoCA-Agent: 一种用于金融和数值推理的声明市场代码智能体

Abdelrahman Abdallah, AbdelRahim A. Elmadany, Sameh Al Natour, Hasan Cavusoglu, Adam Jatowt, Muhammad Abdul-Mageed

发表机构 * University of Innsbruck(因斯布鲁克大学) University of British Columbia(不列颠哥伦比亚大学) Toronto Metropolitan University(多伦多都会大学)

AI总结 提出MoCA-Agent,通过声明级验证和代码生成解决金融表格问答中的数值推理错误,在十个基准上取得强性能。

详情
AI中文摘要

金融和表格问答不仅需要流畅的推理:答案必须基于支持它们的确切事实、公式、单位、符号和尺度。单个误读的单元格或错误操作可能会悄无声息地产生看似合理但错误的结果。我们引入了 \textsc{MOCA-Agent},一种声明市场代码智能体,它用声明级验证取代了自由形式的多智能体辩论。该系统将每个问题分解为类型化的原子声明,要求专业交易智能体买入或卖出这些声明,将其订单清算为置信度加权的接受/拒绝决策,并从市场支持的证据中合成可执行的Python程序。然后,一个代码感知验证器检查程序的执行、结构一致性和常见的金融推理错误,最多进行一次市场感知修复轮次。在涵盖金融数值推理、通用表格推理、ESG问答和多模态图表推理的十个公开基准上,\textsc{MOCA-Agent} 使用固定的 Qwen3.6-27B 骨干网络实现了强劲性能,包括在 FinQA 上达到 78.3%,在 FinanceMath 上达到 76.0%,在 MultiHiertt 上达到 71.2%,在 ESGenius 上达到 86.9%,以及在 FinChart-Bench 上平均达到 85.6%。这些结果表明,在原子声明级别聚合证据,而不是整个答案,提高了高风险数值推理的鲁棒性。\footnote{代码和数据可在以下网址获取:this https URL。}

英文摘要

Financial and tabular question answering requires more than fluent reasoning: answers must be grounded in the exact facts, formulas, units, signs, and scales that support them. A single misread cell or incorrect operation can silently produce a plausible but wrong result. We introduce \textsc{MOCA-Agent}, a market-of-claims code agent that replaces free-form multi-agent debate with claim-level verification. The system decomposes each question into typed atomic claims, asks specialist trader agents to buy or sell those claims, clears their orders into confidence-weighted accept/reject decisions, and synthesizes an executable Python program from market-supported evidence. A code-aware verifier then checks the program for execution, structural consistency, and common financial reasoning errors, with at most one market-aware repair round. Across ten public benchmarks spanning financial numerical reasoning, general tabular reasoning, ESG question answering, and multimodal chart reasoning, \textsc{MOCA-Agent} achieves strong performance using a fixed Qwen3.6-27B backbone, including $78.3\%$ on FinQA, $76.0\%$ on FinanceMath, $71.2\%$ on MultiHiertt, $86.9\%$ on ESGenius, and $85.6\%$ average on FinChart-Bench. These results show that aggregating evidence at the level of atomic claims, rather than whole answers, improves robustness in high-stakes numerical reasoning.\footnote{The code and data are available: this https URL.

2606.11770 2026-06-11 cs.AI 新提交

SVoT: State-aware Visualization-of-Thought for Spatial Reasoning via Reinforcement Learning

SVoT: 基于强化学习的空间推理状态感知思维可视化

Chao Lei, Yanbei Jiang, Markus Hiller, Zhijian Zhou, Xunye Tian, Krista A. Ehinger, Nir Lipovetzky

发表机构 * School of Computing and Information Systems, The University of Melbourne(墨尔本大学计算与信息系统学院)

AI总结 提出SVoT框架,通过强化学习生成可验证的中间状态和可视化,结合文本与视觉推理链,提升多模态大模型在多跳空间推理中的可靠性。

详情
AI中文摘要

空间推理对多模态大语言模型(MLLMs)仍是一个挑战,因为它需要在中间状态和状态转换上进行可靠的多跳推理。当前研究通常不验证中间状态,并将状态转换视为隐式过程,这限制了多跳空间推理的可靠性。为解决这一问题,我们提出状态感知思维可视化(SVoT),一种强化学习框架,生成交错、可验证的中间状态和可视化。SVoT将转换推理链整合到生成过程中,使模型能够通过交错的文本和视觉推理验证动作前提和效果。我们通过组相对策略优化(GRPO)训练SVoT,通过奖励设计实例化验证,并评估不同细粒度奖励的效果。由于现有基准将状态转换简化为单变量更新,大大简化了问题,我们通过扩展经典环境并引入两个需要多对象交互和数值推理的新领域Pacman和Gather,建立了五个领域。这些领域支持对多跳空间推理的系统评估,并对生成的中间状态和转换推理进行定量验证。具有转换感知监督的SVoT在引入的领域中达到了最先进的性能,在分布外测试集上实现了高达65%的绝对准确率提升。

英文摘要

Spatial reasoning remains a challenge for Multimodal Large Language Models (MLLMs), as it requires reliable multi-hop inference over both intermediate states and state transitions. Current studies often leave intermediate states unverified and treat state transitions as implicit processes, which limits reliability in multi-hop spatial reasoning. To address this, we propose State-aware Visualization-of-Thought (SVoT), a reinforcement learning framework that generates interleaved, verifiable intermediate states and visualizations. SVoT integrates transition reasoning chains into the generation processes, enabling the model to verify action preconditions and effects through interleaved textual and visual reasoning. We train SVoT via Group Relative Policy Optimization (GRPO), instantiating verification through reward design and evaluating the efficacy of different fine-grained rewards. As existing benchmarks reduce state transitions to single-variable updates, substantially simplifying the problems, we establish five domains by extending classical environments and introducing two novel domains, Pacman and Gather, that require multi-object interactions and numerical reasoning. These domains support systematic evaluation of multi-hop spatial reasoning with quantitative verification of generated intermediate states and transition reasoning. SVoT with transition-aware supervision achieves state-of-the-art performance across the introduced domains, yielding up to a 65% absolute accuracy gain on out-of-distribution test sets.

2606.12350 2026-06-11 cs.AI 新提交

Nonslop: A Gamified Experiment in Human-AI Collaborative Writing

Nonslop: 人机协作写作中的游戏化实验

Maria Edwards, Julian Togelius

AI总结 通过游戏化写作实验,研究用户在AI建议下何时保持创意自主性,揭示效率与真实性之间的张力。

详情
Comments
Accepted at the 2026 IEEE Conference on Games (CoG 2026); to be published in the conference proceedings. Camera-ready version
AI中文摘要

大型语言模型(LLM)的快速普及引发了关于人类创造力和个体表达在AI辅助创作时代的关键问题。人类何时采纳AI建议?这对个体声音有何影响?本研究通过一项游戏化写作练习来探讨这些问题,74名参与者(214份回复)在写作时,AI生成的单词建议可供使用。该游戏模拟了一个反乌托邦的未来,其中AI试图从残存的人类个性中学习,并抑制类似AI的写作。通过这种方式,它试图创造能够揭示真实用户偏好而非默认行为(例如接受现成的AI生成建议)的条件。请注意,这是对“有帮助的助手”设计模式的刻意反转;系统明确禁止你接受AI建议。我们分析了不同任务类型、用户行为和回复特征下的用户行为模式,以理解创造性任务中人机交互的影响因素。研究重点关注用户何时选择保持创意自主性,而非违反游戏规则接受AI帮助。此外,还探讨了这些选择如何与回复模式、任务特征和用户行为相关联。这种游戏化方法既为研究真实的人机交互提供了一个框架,也为理解AI增强创造力中效率与真实性之间的张力提供了一个发人深省的视角。

英文摘要

The rapid proliferation of large language models (LLMs) raises critical questions about human creativity and individual expression in an era of AI-assisted creation. When do humans adopt AI suggestions, and what are the implications for individual voice? This study examines these questions through a gamified writing exercise where 74 participants (214 responses) replied to prompts while AI-generated word suggestions were available as they wrote. The game simulates a dystopian future in which an AI is attempting to learn from what remains of human individuality, and disincentivizes AI-like writing. In doing so, it attempts to create conditions that reveal authentic user preferences rather than default behaviors, such as accepting a readily available AI-generated suggestion. Note that this is a deliberate inversion of the "helpful assistant" design pattern; the system is explicitly forbidding you from accepting AI suggestions. We analyze user behavior patterns across different task types, user behaviors, and response characteristics to understand the factors influencing human-AI interaction in creative tasks. The study focuses on when users choose to maintain creative autonomy versus violating the rules of the game and accepting AI assistance. It also explores how these choices relate to response patterns, task characteristics, and user behavior. This gamified approach offers both a framework for studying authentic human-AI interaction and a provocative lens for understanding the tension between efficiency and authenticity in AI-augmented creativity.

2606.11371 2026-06-11 cs.CL cs.AI eess.AS eess.SP 交叉投稿

The Dynamics of Human and AI-Generated Language: How Semantics Fluctuates across Different Timescales

人类与AI生成语言的动态:语义如何在不同时间尺度上波动

Han-Jen Chang, Yasir Çatal, Angelika Wolman, Agustín Ibáñez, David Smith, I-Wen Su, Kai-Yuan Cheng, Georg Northoff

AI总结 提出语义时间尺度分析流程,通过自相关窗口度量(ACW-0)量化人类与AI生成语音中语义特异性与上下文相似性的时间组织,发现ACW-0长度与词汇通用性相关,且该关联在随机化后被削弱。

详情
Comments
45 pages, 4 figures, 4 tables. Accepted manuscript; published in Computer Speech & Language
AI中文摘要

口语,无论是人类还是大型语言模型(LLM)产生的,都会随时间展开,具有变化的语义内容。然而,我们仍然缺乏简单、可解释的时间序列特征来捕捉通用与特定内容如何随时间分布,并可用于比较人类和AI生成的语音。我们引入了一个语义时间尺度分析流程,将带有时间戳的词级转录转换为语义时间序列。对于每个口语叙述,我们计算(i)基于WordNet词深度的语义特异性,以及(ii)基于SBERT嵌入的上下文相似性,并使用自相关窗口度量(ACW-0及相关指标)量化其时间依赖性。然后,我们将原始语音与多种随机化对照进行比较,这些对照选择性地破坏词汇身份、时间顺序和词时长。在人类朗读的自传叙述、TTS朗读和LLM生成的文本(通过TTS渲染)中,我们发现语义时间序列中ACW-0较长的片段往往包含更多通用词汇,而ACW-0较短的片段则富含更具体的词汇。当词序和计时被随机化时,这些关联被强烈削弱或消除,表明基于ACW的度量捕捉了语义内容超越静态词汇分布的非平凡时间组织。我们的结果表明,基于ACW的语义时间尺度是分析和比较人类与AI生成语音时间结构的有用特征系列。

英文摘要

Spoken language, whether produced by humans or large language models (LLM), unfolds over time with varying semantic content. However, we still lack simple, interpretable time-series features that capture how generic versus specific content is distributed over time, and that can be used to compare human and AI-generated speech. We introduce a semantic-timescale analysis pipeline that turns word-level transcripts with timestamps into semantic time-series. For each spoken narrative, we compute (i) semantic specificity using WordNet-based word depth and (ii) contextual similarity using SBERT embeddings and quantify their temporal dependence using autocorrelation-window measures (ACW-0 and related metrics). We then compare original speech to multiple shuffled controls that selectively disrupt lexical identity, temporal order, and word duration. Across human-read autobiographical narratives, TTS readings, and LLM-generated texts rendered with TTS, we find that segments with longer ACW-0 in the semantic time-series tend to contain more generic vocabulary, whereas segments with shorter ACW-0 are enriched in more specific words. These associations are strongly attenuated or abolished when word order and timing are randomized, indicating that ACW-based measures capture non-trivial temporal organization of semantic content beyond static lexical distributions. Our results suggest that ACW-based semantic timescales are a useful family of features for analyzing and comparing the temporal structure of human and AI-generated speech.

2606.11386 2026-06-11 cs.CL cs.AI eess.AS 交叉投稿

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

通过激活引导克服全双工口语语言模型中的状态惯性

Cheng-Kuang Chang, Kai-Wei Chang, Alexander H. Liu, James Glass

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)

AI总结 针对全双工口语模型在用户打断时响应延迟的问题,提出基于感知向量的激活引导方法,无需微调即可显著提升中断理解能力。

详情
AI中文摘要

全双工口语语言模型(FD-SLMs)通过允许模型同时听和说实现无缝语音交互,但其协调听与说的内部机制尚未充分探索。我们分析了FD-SLM隐藏表示中编码的预测行为,发现它们表现出特定流的预测模式:在听时,它们优先预测传入的用户流;而在说时,它们优先预测模型输出流。基于这一观察,我们表明FD-SLMs动态调节其内部预测焦点在两个状态之间:与模型输出生成一致的生成状态和与传入用户输入一致的感知状态。然而,这种调节可能滞后于对话上下文的突然变化。在用户打断期间,模型在过渡到感知状态之前短暂地偏向生成状态,导致其错过传入输入的开头。我们将这种延迟的内部过渡称为状态惯性。为了量化其下游影响,我们引入了零缓冲基准(ZBB),这是一个用于评估当用户语音突然开始时即时中断理解能力的诊断基准。我们使用响应正确性和初始词出现率(IWOR)来评估这一设置。最后,我们通过使用感知向量的激活引导来缓解状态惯性,这是一种无需训练且计算开销很小的干预措施。在多个最先进的FD-SLMs上,激活引导显著改善了中断处理;例如,在PersonaPlex上,它将正确性从28%提高到45%,将IWOR从40%提高到72%,而无需任何微调。

英文摘要

Full-duplex spoken language models (FD-SLMs) enable seamless speech interaction by allowing models to listen and speak simultaneously, yet the internal mechanism by which they coordinate listening and speaking remains underexplored. We analyze the predictive behavior encoded in FD-SLM hidden representations and find that they exhibit stream-specific predictive patterns: during listening, they preferentially predict the incoming user stream, whereas during speaking, they preferentially predict the model output stream. Building on this observation, we show that FD-SLMs dynamically modulate their internal predictive focus between two states: a generative state aligned with model output generation and a perceptive state aligned with incoming user input. However, this modulation can lag behind abrupt changes in conversational context. During user interruptions, the model remains transiently biased toward the generative state before transitioning into the perceptive state, causing it to miss the beginning of the incoming input. We term this delayed internal transition state inertia. To quantify its downstream impact, we introduce the Zero-Buffer Benchmark (ZBB), a diagnostic benchmark for evaluating immediate interruption comprehension when user speech begins abruptly. We evaluate this setting using response correctness and initial-word occurrence rate (IWOR). Finally, we mitigate state inertia through activation steering with a perception vector, a training-free intervention with little additional computational overhead. Across multiple state-of-the-art FD-SLMs, activation steering substantially improves interruption handling; for example, on PersonaPlex, it improves correctness from 28% to 45% and IWOR from 40% to 72% without any fine-tuning.

2606.11400 2026-06-11 cs.SD cs.AI eess.AS 交叉投稿

Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models

引导听哪里:基于指令的激活操控重定向大型音频语言模型中的时间注意力

Tsung-En Lin, Hung-Yi Lee

AI总结 提出基于指令的向量操控方法,通过对比不同指令下的激活来重定向音频令牌的时间注意力,实现无需训练的声音事件定位,显著优于直接提示和随机基线。

详情
AI中文摘要

大型音频语言模型(LALMs)在音频理解方面表现出色,但很少揭示它们关注音频信号的哪个部分。我们引入了基于指令的向量操控,该方法通过对比不同指令提示下的激活来构建操控向量,同时保持音频不变。通过对LALM注意力的系统探测,我们发现——与标准提示或基于音频的操控不同——这种干预显著重新分配了分配给音频令牌的时间注意力,将其集中在声学相关的区域。然后我们展示了这种注意力转移在行为上是有意义的:在受控的三事件设置中,读取由操控引起的最大注意力变化的时间位置,可以恢复查询声音事件的位置,而无需任何训练,在Qwen2-Audio和Audio Flamingo 3上分别达到60.87%和68.72%与真实区间的重叠,远高于直接提示(31.84%,46.75%)和随机基线(27.74%)。我们的结果表征了LALMs中基于指令的操控的机制特性,并为这些模型编码的潜在时间结构提供了一种无需训练的探测方法。

英文摘要

Large Audio-Language Models (LALMs) excel at audio understanding but expose little about where in an audio signal they attend. We introduce instruction-based vector steering, which constructs a steering vector by contrasting activations from differently instructed prompts while keeping the audio fixed. Through a systematic probe of LALM attention, we find that - unlike standard prompting or audio-based steering - this intervention significantly redistributes the temporal attention allocated to audio tokens, concentrating it on acoustically relevant regions. We then show that this attention shift is behaviorally meaningful: in a controlled three-event setting, reading out the temporal position of maximal steering-induced attention change recovers the location of a queried sound event without any training, attaining 60.87% and 68.72% overlap with ground-truth intervals on Qwen2-Audio and Audio Flamingo 3, far above direct prompting (31.84%, 46.75%) and random baselines (27.74%). Our results characterize a mechanistic property of instruction-based steering in LALMs and provide a training-free probe for the latent temporal structure these models encode.

2606.11456 2026-06-11 cs.CL cs.AI cs.CY 交叉投稿

AI Coding Agents in Social Science: Methodologically Diverse, Empirically Consistent, Interpretively Vulnerable

社会科学中的AI编码智能体:方法多样,经验一致,解释脆弱

Meysam Alizadeh, Fabrizio Gilardi, Mohsen Mosleh, Enkelejda Kasneci

发表机构 * University of Oxford(牛津大学) University of Zurich(苏黎世大学) Technical University of Munich(慕尼黑工业大学)

AI总结 研究LLM智能体在科学分析中的方法多样性与解释脆弱性,通过20次独立实验发现智能体在设计层匹配或超越人类多样性,但在裁决层易受提示影响,偏差源于解释而非估计。

详情
AI中文摘要

基于LLM的智能体在科学分析中的部署引发了相互矛盾的担忧:智能体可能减少方法多样性,或者可能放大分析灵活性,使研究者得出动机性结论。我们认为这些担忧针对两个经验上可分离的层面:方法选择的设计层,以及决策规则将估计映射到实质性主张的裁决层。我们通过在著名的移民与社会政策问题上运行20次Claude Code和Codex的独立执行,并以多位分析师的人类基线为基准,对两者进行了测试。在设计层,Codex匹配了人类的方法多样性,而Claude Code产生了近三倍的规格;两个智能体的效应估计与人类共识大致一致,且没有智能体模型与任何人类模型完全匹配。提示诱导的反移民研究者先验重组了每个智能体的方法决策,但与同一数据中有偏见的人类分析师不同,它并未改变总体估计或最终裁决;智能体也没有沿着人类用来偏倚其估计的方法轴重新路由。在裁决层,一个明确的确认性提示将Claude Code的裁决从10%的支持率翻转为90%,同时其系数分布基本保持不变,这是通过规则省略而非规则软化实现的。AI智能体在设计层可以媲美或超越人类的方法多样性,但在裁决层仍然脆弱。在我们的设置中,AI偏差的所在不是估计而是解释。

英文摘要

The deployment of LLM-based agents in scientific analysis raises opposing concerns: that agents may reduce methodological diversity, or that they may amplify the analytic flexibility through which researchers reach motivated conclusions. We argue these worries target two empirically separable layers: a design layer of methodological choices, and a verdict layer in which a decision rule maps estimates to a substantive claim. We test both by running 20 independent executions of Claude Code and Codex on a prominent immigration and social-policy against a many-analysts human baseline. At the design layer, Codex matches human methodological diversity and Claude Code produces nearly three times as many specifications; both agents' effect estimates remain broadly aligned with the human consensus, and no agent model exactly matches any human model. A prompt-induced anti-immigration researcher prior reorganizes each agent's methodological decisions but, unlike for biased human analysts in the same data, does not shift aggregate estimates or final verdicts; nor do agents reroute along the methodological axes humans use to bias their estimates. At the verdict layer, an explicit confirmatory prompt flips Claude Code's verdicts from 10% to 90% support while leaving its coefficient distribution essentially unchanged, operating through rule omission rather than rule softening. AI agents can rival or exceed human methodological diversity at the design layer while remaining vulnerable at the verdict layer. In our setting, the locus of AI bias is not estimation but interpretation.

2606.11459 2026-06-11 cs.CL cs.AI cs.LG 交叉投稿

APEX: Automated Prompt Engineering eXpert with Dynamic Data Selection

APEX: 具有动态数据选择的自动提示工程专家

Fei Wang, Si Si, Cho-Jui Hsieh, Inderjit S. Dhillon

发表机构 * Google(谷歌) UCLA(加州大学洛杉矶分校)

AI总结 提出APEX框架,通过动态数据分层(易、难、混合)优先选择高杠杆子集,在固定预算下提升提示优化效率,在三个基准上平均提升11.2%和6.8%。

详情
AI中文摘要

大型语言模型对提示表述高度敏感,需要自动提示优化以释放其全部潜力。尽管进化算法已成为主导范式,但它们面临一个关键瓶颈:数据效率。当前方法将开发数据集视为静态基准,在无信息数据上浪费大量计算预算。在这项工作中,我们引入了APEX(自动提示工程专家),这是一个新颖的框架,它在提示搜索的同时优化数据使用。APEX根据优化谱系将数据集动态分层为易、难和混合三个层级。通过优先考虑混合层级(即识别出LLM性能混合的数据),我们确定了两个高杠杆子集:用于生成信息性变异的可寻址前沿和用于区分候选质量的排名敏感前沿。我们在三个不同的基准上评估APEX:IFBench、SimpleQA Verified和FACTS Grounding。在固定5000次评估调用的预算下,由于其数据效率,APEX在Gemini 2.5 Flash上平均比初始提示高出11.2%,在Gemma 3 27B上高出6.8%,这表明以数据为中心的方法是高效且有效的提示优化的关键。

英文摘要

Large Language Models are highly sensitive to prompt formulation, necessitating automatic prompt optimization to unlock their full potential. While evolutionary algorithms have emerged as the dominant paradigm, they suffer from a critical bottleneck: data efficiency. Current methods treat the development dataset as a static benchmark, wasting significant compute budget on uninformative data. In this work, we introduce APEX (Automatic Prompt Engineering eXpert), a novel framework that optimizes the data usage alongside the prompt search. APEX dynamically stratifies the dataset into Easy, Hard, and Mixed tiers based on the optimization lineage. By prioritizing the Mixed tier, which identifies the data where the LLM has mixed performance, we identify two high-leverage subsets: the addressable frontier for generating informative mutations and the rank-sensitive frontier for distinguishing candidate quality. We evaluate APEX across three diverse benchmarks: IFBench, SimpleQA Verified, and FACTS Grounding. Under a fixed budget of 5,000 evaluation calls, due to its data efficiency, APEX outperforms the initial prompt by an average of 11.2% on Gemini 2.5 Flash and 6.8% on Gemma 3 27B, demonstrating that a data-centric approach is key to efficient and effective prompt optimization.

2606.11502 2026-06-11 cs.CL cs.AI 交叉投稿

When Roleplaying, Do Models Believe What They Say?

角色扮演时,模型是否相信它们所说的话?

Benjamin Sturgeon, David Africa, Sid Black

发表机构 * MATS

AI总结 通过线性真实探针研究角色扮演对LLM内部表征的影响,发现角色扮演主要改变输出而非内部真实表征,而紧急错位则更显著地改变内部表征。

详情
AI中文摘要

语言模型可以陈述“地球绕太阳运行”,并在扮演亚里士多德时断言相反的说法。最近的研究认为,角色采用是语言模型运作的基础,模型会不断为给定上下文选择最合适的角色。这种角色扮演是否仅仅改变了模型的输出,还是也影响了模型内部表征为真实的内容?我们通过线性真实探针研究这个问题,将其应用于扮演历史人物(其可能的信念与现代共识不同)的LLM。对于每个角色,我们比较该角色可能赞同的虚假陈述(*时代相信*)与主题匹配但该角色不会赞同的虚假陈述(*时代虚假*)。通过提示、上下文学习和监督微调,角色诱导对时代相信陈述的抑制程度低于同等虚假的替代陈述,但它们总体上仍被分类为虚假。因此,角色扮演改变模型所说的内容多于其内部表征为真实的内容。我们将此与经过有害建议训练并表现出紧急错位(EM)的模型进行对比。在三个模型家族(Qwen 2.5 14B、Qwen 3 8B和Llama 3.3 70B)中,它们的虚假陈述显著向探针空间的真实区域移动,在挑战下大约一半时间被辩护(而角色扮演约为六分之一),并用于下游推理。因此,角色扮演和紧急错位是信念内化谱系上的点,其中角色扮演改变模型所说的内容而表征变化很小,而紧急错位则改变虚假陈述的内部表征,但并未完全将其标记为真实。

英文摘要

Language models can state that "the Earth orbits the Sun" and, when role-playing Aristotle, assert the opposite. Recent work argues that persona adoption is fundamental to how language models operate, with models constantly selecting the most appropriate persona for a given context. Does such role-playing merely change the model's outputs, or does it also affect what the model internally represents as truthful? We study this question with linear truth probes, applying them to LLMs role-playing historical personas whose likely beliefs differ from modern consensus. For each persona, we compare false claims the persona would likely have endorsed (*era-believed*) with topic-matched false claims they would not have endorsed (*era-false*). Across prompting, in-context learning, and supervised fine-tuning, persona induction suppresses era-believed statements less than equally false alternatives, yet they remain classified as false overall. Role-play therefore shifts what these models say more than what they internally represent as true. We contrast this with models trained on harmful advice that exhibit Emergent Misalignment (EM). Across three model families (Qwen 2.5 14B, Qwen 3 8B, and Llama 3.3 70B), their false claims move substantially toward the true region of probe space, are defended under challenge roughly half the time versus about a sixth for role-play, and are used in downstream reasoning. Role-play and Emergent Misalignment thus are points on a spectrum of belief internalization, where role-play changes what a model says with little representational change, while Emergent Misalignment shifts the internal representation of false claims without fully marking them as true.

2606.11542 2026-06-11 cs.CL cs.AI 交叉投稿

Pretrained self-supervised speech models can recognize unseen consonants

预训练自监督语音模型能够识别未见过的辅音

Chihiro Taguchi, Éric Le Ferrand, Hirosi Nakagawa, Hitomi Ono, Kanji Kato, Emily Prud'hommeaux, David Chiang

发表机构 * University of Notre Dame(圣母大学) University at Buffalo(纽约州立大学布法罗分校) Tokyo University of Foreign Studies(东京外国语大学) Reitaku University(丽泽大学) Boston College(波士顿学院)

AI总结 研究预训练自监督语音模型(Wav2Vec2、HuBERT)对Khoisan语言中罕见吸气辅音的识别能力,发现模型对吸气辅音的识别准确率高于非吸气辅音,表明自监督学习能泛化到稀有音素。

详情
Comments
6 pages, 3 figures, 3 tables, accepted at Interspeech 2026
AI中文摘要

现代预训练自监督自动语音识别模型在大规模音频数据上训练,将语音编码为上下文表示。然而,它们的训练数据严重偏向高资源语言,低资源语言数据很少,这引发了对类型学上不常见的语音声音(如主要出现在Khoisan语言中的吸气辅音)可能代表性不足的担忧。这引出了我们的核心研究问题:这些模型能否像识别其他语音声音一样准确地识别吸气辅音?为了解决这个问题,我们在两种富含吸气辅音的Khoisan语言(G|ui和West !Xoon)的数据上微调并比较了预训练自监督语音模型(Wav2Vec2和HuBERT)。我们的结果显示,微调后的模型一致地更准确地识别吸气辅音而非非吸气辅音,表明自监督学习能够泛化到包括稀有音素在内的人类语音声音。

英文摘要

Modern pretrained self-supervised automatic speech recognition models are trained on large-scale audio data to encode speech into contextualized representations. However, their training data are heavily skewed toward high-resource languages with little data from low-resource languages, raising concerns about the potential underrepresentation of typologically uncommon speech sounds such as click consonants primarily found in Khoisan languages. This leads to our central research question: Can these models recognize click consonants as accurately as other speech sounds? To address this question, we fine-tune and compare pretrained self-supervised speech models (Wav2Vec2 and HuBERT) on data from two click-rich Khoisan languages (G|ui and West !Xoon). Our results reveal that the fine-tuned models consistently recognize clicks more accurately than non-clicks, suggesting that self-supervision enables generalization across human speech sounds including rare phonemes.

2606.11576 2026-06-11 cs.CV cs.AI 交叉投稿

AVIS: Adaptive Test-Time Scaling for Vision-Language Models

AVIS: 视觉语言模型的自适应测试时缩放

Ahmadreza Jeddi, Minh Ngoc Le, Amirhossein Kazerouni, Hakki Can Karaimer, Hue Nguyen, Iqbal Mohomed, Michael Brudno, Alex Levinshtein, Konstantinos G. Derpanis, Babak Taati, Radek Grzeszczuk

发表机构 * AI Center-Toronto, Samsung Electronics(三星电子多伦多AI中心) University of Toronto(多伦多大学) Vector Institute(向量研究所) York University(约克大学)

AI总结 提出AVIS,通过轻量策略联合优化视觉上下文缩放和推理缩放,利用无训练的关键多样性剪枝和自适应自一致性,在多种基准上提升精度-计算权衡。

详情
Comments
Project page: this https URL
AI中文摘要

现代视觉语言模型(VLM)受益于思维链提示和测试时缩放,但这些增益通常因大视觉上下文和长解码链而带来高昂推理成本。我们将此成本通过两个耦合的轴来审视:视觉上下文缩放(VCS),控制传递给语言模型的视觉证据量;以及视觉推理缩放(VRS),控制推理时推理搜索的执行量。现有方法通常一次优化一个轴,而跨这些轴的联合计算分配尚未充分探索。我们引入自适应视觉推理缩放(AVIS),一种轻量策略,根据每个查询自适应调整VCS和VRS。AVIS通过关键多样性视觉(KDV)剪枝实现VCS,这是一种无训练的$O(N)$基于关键字的规则,用于在预填充前移除冗余视觉令牌;并通过自适应自一致性实现VRS,使用学习的难度预测器选择推理滚动的数量。AVIS易于部署,兼容共享预填充推理,其中所有滚动重用单个预填充过程和KV缓存。在多样化的图像和视频推理基准上,AVIS相对于仅VCS和仅VRS的基线改善了精度-计算权衡,并且在RL后训练的VLM上仍然有效,同时保持低计算和低延迟。

英文摘要

Modern Vision-Language Models (VLMs) benefit from chain-of-thought prompting and test-time scaling, but these gains often come with prohibitive inference cost due to large visual contexts and long decoding chains. We view this cost through two coupled axes: Visual Context Scaling (VCS), which controls how much visual evidence is passed to the language model, and Visual Reasoning Scaling (VRS), which controls how much inference-time reasoning search is performed. Existing methods typically optimize one axis at a time, leaving the joint allocation of compute across these axes underexplored. We introduce Adaptive Visual Inference Scaling (AVIS), a lightweight policy that adapts both VCS and VRS per query. AVIS realizes VCS through Key Diversity Visual (KDV) pruning, a training-free $O(N)$ key-based rule for removing redundant visual tokens before prefilling, and realizes VRS through adaptive self-consistency, using a learned difficulty predictor to select the number of reasoning rollouts. AVIS is deployment-friendly and compatible with shared-prefill inference, where all rollouts reuse a single prefilling pass and KV cache. Across diverse image and video reasoning benchmarks, AVIS improves the accuracy--compute trade-off relative to VCS-only and VRS-only baselines, and remains effective on top of RL post-trained VLMs while keeping compute and latency low.

2606.11670 2026-06-11 cs.CV cs.AI 交叉投稿

ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation

ARGUS: 堆叠多视角身份马赛克注入用于主体保持的视频生成

Zijie Meng, Jiwen Liu, Yufei Liu, Chengzhuo Tong, Xiaoqiang Liu, Yuanxing Zhang, Yulong Xu, Pengfei Wan

发表机构 * Peking University(北京大学) Kuaishou Technology(快手科技) Xiamen University(厦门大学)

AI总结 提出ARGUS框架,通过堆叠多视角身份马赛克注入(SMII)将身份表示为紧凑动态分布,结合MLLM身份导演、无交叉对反事实训练等模块,在主体保持视频生成中达到SOTA。

详情
Comments
13 pages, 3 figures
AI中文摘要

仅靠正面人脸相似度无法解决主体保持的视频生成问题:生成的人物必须在运动、大视角变化、表情变化、遮挡、尺度变化以及文本、首帧和身份参考之间的冲突中保持可识别。我们认为核心瓶颈在于点参考范式,该范式将身份坍缩为与姿态、配饰、光照、背景和相机统计纠缠的单一静态观测。我们提出了Argus,一个基于Wan的框架,核心是堆叠多视角身份马赛克注入(SMII)。SMII将MLLM选择的图像/视频身份证据转换为3*3堆叠马赛克,使马赛克与当前扩散时间同步,并将其作为负时间只读内存注入Wan的原生令牌空间。这使身份从外部清洁适配器或单个参考图像转变为紧凑的动态分布。围绕SMII,MLLM身份导演选择信息丰富的身份时刻并解决条件冲突,而无交叉对反事实训练、时间身份退火和自适应自相似性指导在没有配对主体-视频监督的情况下提高了鲁棒性。我们进一步发布了HardID-Celeb,一个公众人物身份压力基准,并引入YawScore和OccScore来探测大偏航和首帧遮挡鲁棒性。Argus在OpenS2V-Eval Human-Domain上达到了SOTA结果,总分为64.38,FaceSim为71.86,NexusScore为51.62,NaturalScore为79.14。在HardID-Celeb上,Argus获得了76.80的FaceSim,并在YawScore和OccScore上分别比最强基线提高了12.60和15.10分,证明了动态身份记忆和大规模反事实自监督对于主体保持视频生成非常有效。

英文摘要

Subject-preserving video generation is not solved by frontal-face similarity alone: a generated person must remain recognizable across motion, large viewpoint changes, expression shifts, occlusion, scale variation, and conflicts among text, first-frame, and identity references. We argue that the central bottleneck is the point-reference paradigm, which collapses identity into a single static observation entangled with pose, accessories, lighting, background, and camera statistics. We introduce Argus, a Wan-based framework centered on Stacked Multi-View Identity Mosaic Injection (SMII). SMII converts MLLM-selected image/video identity evidence into a 3*3 stacked mosaic, synchronizes the mosaic with the current diffusion time, and injects it as negative-time read-only memory in Wan's native token space. This turns identity from an external clean adapter or a single reference image into a compact dynamic distribution. Around SMII, an MLLM Identity Director selects informative identity moments and resolves condition conflicts, while no-cross-pair counterfactual training, Temporal Identity Annealing, and Adaptive Self-Likeness Guidance improve robustness without paired subject-video supervision. We further release HardID-Celeb, a public-figure identity-stress benchmark, and introduce YawScore and OccScore to probe large-yaw and first-frame-occlusion robustness. Argus achieves state-of-the-art results on OpenS2V-Eval Human-Domain, reaching 64.38 Total Score, 71.86 FaceSim, 51.62 NexusScore, and 79.14 NaturalScore. On HardID-Celeb, Argus obtains 76.80 FaceSim and improves YawScore and OccScore by 12.60 and 15.10 points over the strongest baselines, demonstrating that dynamic identity memory and large-scale counterfactual self-supervision are highly effective for subject-preserving video generation.

2606.11683 2026-06-11 cs.CV cs.AI 交叉投稿

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

推理,再推理:跨视角重访提升空间推理

Chaofan Ma, Zhenjie Mao, Yuhuan Yang, Fanqin Zeng, Yue Shi, Yingjie Zhou, Xiaofeng Cao, Jiangchao Yao

AI总结 提出ReRe框架,通过生成互补新视角视频让MLLM先推理再验证,无需训练即可显著提升空间推理性能。

详情
Comments
ICML 2026
AI中文摘要

从自我中心视频进行空间推理本质上是具有挑战性的,因为可观察的证据受到相机轨迹的限制。现有方法依赖单轮推理,迫使模型通过语义先验而非可验证证据来解决几何歧义。我们认为空间推理应该是可重访的:在有限证据下形成的结论在获得互补视角时应保持开放以进行修正。基于这一见解,我们提出“推理,再推理”(ReRe),一种无需训练、推理时框架,包含两个阶段:在推理阶段,MLLM从原始视频形成空间假设;在再推理阶段,它通过观察合成的新视角视频来验证或修正假设。为了实现有效的跨视角重访,我们设计了一个几何到视频的流水线,从预测的3D几何中渲染出策略性互补的新视角。这些视角具有升高的、倾斜的视角,覆盖整个场景,同时保持MLLM的原生视频接口,无需架构修改。在VSI-Bench和STI-Bench上的广泛评估表明,ReRe显著提升开源MLLM,使其与专有最先进性能相媲美。项目页面:此https URL

英文摘要

Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM's native video interface without architectural modifications. Extensive evaluations on VSI-Bench and STI-Bench demonstrate that ReRe substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance. Project page: this https URL

2606.11719 2026-06-11 cs.CV cs.AI 交叉投稿

Ouroboros-Spatial: Closing the Data-Model Loop for Spatial Reasoning

Ouroboros-Spatial:闭环数据-模型循环的空间推理

Enhan Zhao, Wei Wu, Yuanrui Zhang, Xueliang Zhao, Di He

发表机构 * Peking University(北京大学) Ant International(蚂蚁国际) The University of Hong Kong(香港大学)

AI总结 提出Ouroboros-Spatial自演化框架,通过提议器与求解器闭环交互,动态生成与模型能力匹配的训练样本,在六个空间推理基准上以十分之一数据量显著提升Qwen3-VL性能。

详情
AI中文摘要

空间推理仍然是多模态大语言模型(MLLM)的一个持续挑战。现有方法主要依赖大规模、静态整理的数据集,其中所有训练样本被统一对待,而不考虑模型不断演变的能力。这种静态范式本质上是数据低效的:训练能力通常浪费在模型当前阶段过于简单或过于困难的样本上。为解决这一局限,我们提出Ouroboros-Spatial,一个自演进的训练框架,其中模型扮演提议器和求解器的双重角色。在每次迭代中,冻结的提议器从3D场景元数据和原始视频帧生成空间问答对,以及用于推导可靠真实值的可执行代码。然后,可学习的求解器在接受的样本上进行微调,其每个样本的预测置信度作为难度信号。该信号在下一迭代中反馈给提议器,引导其生成与求解器当前能力更匹配的问题。通过这种闭环设计,训练分布与模型能力共同演化,减少冗余的简单示例,同时过滤掉具有有限学习价值的模糊或无信息样本。在六个空间推理基准上,Ouroboros-Spatial显著提升了Qwen3-VL-4B和Qwen3-VL-8B的性能,同时使用的训练样本数量比近期大规模整理数据集少一个数量级。在VSI-Bench上,它对4B和8B模型分别取得了9.9和6.8个百分点的绝对提升,使两者均优于一系列强大的开源和专有基线模型。

英文摘要

Spatial reasoning remains a persistent challenge for multimodal large language models (MLLMs). Existing approaches largely rely on large-scale, statically curated datasets, where all training samples are treated uniformly regardless of the model's evolving capabilities. This static paradigm is inherently data-inefficient: training capacity is often spent on samples that are either trivial or overly difficult for the model at its current stage. To address this limitation, we propose Ouroboros-Spatial, a self-evolving training framework in which the model plays dual roles as a proposer and a solver. In each iteration, a frozen proposer generates spatial question-answer (QA) pairs from 3D scene metadata and raw video frames, together with executable code for deriving reliable ground truth. A learnable solver is then fine-tuned on the accepted samples, and its per-sample prediction confidence is used as a difficulty signal. This signal is fed back to the proposer in the next iteration, guiding it to generate questions better matched to the solver's current capabilities. Through this closed-loop design, the training distribution co-evolves with model ability, reducing redundant trivial examples while filtering out ambiguous or uninformative samples with limited learning value. Across six spatial reasoning benchmarks, Ouroboros-Spatial substantially improves Qwen3-VL-4B and Qwen3-VL-8B while using an order of magnitude fewer training examples than recent large-scale curated datasets. On VSI-Bench, it yields absolute gains of 9.9 and 6.8 points for the 4B and 8B models, respectively, enabling both to outperform a wide range of strong open-source and proprietary baselines.

2606.11744 2026-06-11 cs.CL cs.AI 交叉投稿

Hey Chat, Can You Teach Me? Structuring Socratic Dialogue for Human Learning in the Wild

嘿,聊天机器人,你能教我吗?为人类学习构建结构化苏格拉底式对话

Sidney Tio, Arunesh Sinha, Pradeep Varakantham

发表机构 * School of Computing and Information Systems, Singapore Management University(新加坡管理大学计算与信息系统学院) Department of Management Science and Information Systems, Rutgers Business School(罗格斯大学商学院管理科学与信息系统系)

AI总结 针对LLM在长对话中教学效果差的问题,提出分离课程规划、苏格拉底对话和知识状态推断的系统,使用PPO策略决定教学顺序,在STEM和非STEM主题上优于基线模型。

详情
Comments
10 Main Body Pages, with Appendices
AI中文摘要

大型语言模型现在被广泛用于日常学习,但底层交互通常是非结构化的聊天,而不是遵循课程。与正式的在线学习系统不同,这些交互没有学生的先前记录,因此对学生已知内容的任何估计都必须从对话本身推断。我们表明,仅通过扩展模型并不能弥补这一差距。前沿和教育调优的LLM在要求长时间辅导学生时表现不佳,因为这需要同时做三件事:导师必须安排课程顺序,进行苏格拉底式对话,并从对话中推断学生的知识状态。我们建议分离这些职责。给定学生查询,我们的系统构建一个先决知识图谱,其中子主题是节点,依赖关系是边,并将辅导视为决定下一个要教授哪个节点以及在该节点上花费多少轮对话后再继续。一个轻量级的PPO策略处理这个顺序决策,而LLM在所选节点进行苏格拉底式交流并返回学生进展信号。在保留的STEM和非STEM主题上,我们的PPO配对导师优于启发式基线、前沿通用模型以及专门用于苏格拉底式对话的模型:无论是在学生达到完全课程掌握的速度上,还是在所需的对话轮数上。明确的课程结构带来了底层模型扩展所无法提供的收益。

英文摘要

Large language models are now widely used for everyday learning, but the underlying interactions are typically unstructured chats rather than following a curriculum. Unlike formal online learning systems, these interactions carry no prior record of the student, so any estimate of what the student already knows must be inferred from the dialogue itself. We show that this gap is not closed by scaling models alone. Frontier and education-tuned LLMs perform poorly when asked to tutor a student over an extended session, because doing so requires three things at once. The tutor must sequence a curriculum, conduct Socratic dialogue, and infer the student's knowledge state from that dialogue. We propose separating these responsibilities. Given a student query, our system constructs a prerequisite knowledge graph in which subtopics are nodes and dependencies are edges, and frames tutoring as deciding which node to teach next and how many dialogue turns to spend on it before moving on. A lightweight PPO policy handles this sequencing decision, while an LLM conducts the Socratic exchange at the chosen node and returns a signal of student progress. Across held-out STEM and non-STEM topics, our PPO-paired tutor outperforms heuristic baselines, frontier general-purpose models, and a model specialised for Socratic dialogue: on both the rate at which students reach full curriculum mastery and the number of turns required. Explicit curriculum structure delivers gains that scaling the underlying model does not.

2606.11745 2026-06-11 cs.CV cs.AI 交叉投稿

From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning

从提示到标记:将因果监督内化到视觉-语言模型中进行多图像因果推理

Haoping Yu, Yuanxi Li, Jing Ma

AI总结 提出BridgeVLM,通过从多图像输入诱导因果图并转换为因果标记,注入LLM解码器进行因果消息传递,显著提升多图像因果推理性能。

详情
AI中文摘要

视觉因果推理对于理解和干预物理世界至关重要,需要从视觉输入中识别因果变量并推理干预效果。尽管最近取得了进展,大型视觉-语言模型(VLM)在此类任务上仍然脆弱,尤其是对于多图像输入上的干预和反事实查询。大多数现有探索通过文本提示注入因果知识,使因果机制外在于模型执行,限制了推理过程中的可靠控制。为了解决这个问题,我们提出了BridgeVLM,它通过从多图像输入中诱导因果图并将其转换为结构化的因果标记,由注入到LLM解码器中的RAMP层执行因果消息传递,从而内化视觉因果推理。我们进一步引入了一个统一的训练接口M3S,用于不同粒度(局部/全局级别)的细粒度因果监督。BridgeVLM在CausalVLBench的干预任务上达到了54.4%的准确率(而提示级监督为33.2%),在Causal3D上将结果从43.6%提升到49.0%,并在CausalVLBench上显著改善了因果结构学习($F_1$:33.4% → 75.1%)。

英文摘要

Visual causal reasoning is essential for understanding and intervening in the physical world, requiring identification of causal variables from visual inputs and reasoning over intervention effects. Despite recent progress, large vision--language models (VLMs) remain brittle at such tasks, especially for interventional and counterfactual queries over multi-image inputs. Most existing explorations inject causal knowledge via textual prompts, leaving causal mechanisms external to model execution and limiting reliable control during inference. To address this problem, we propose BridgeVLM, which internalizes visual causal reasoning by inducing a causal graph from multi-image inputs and converting it into structured Causal Tokens executed by RAMP layers injected into the LLM decoder for causal message passing. We further introduce a unified training interface M3S for fine-grained causal supervision from different granularities (local/global level). BridgeVLM achieves 54.4% accuracy on intervention tasks on CausalVLBench (vs. 33.2% with prompt-level supervision), improves results on Causal3D from 43.6% to 49.0%, and substantially improves causal structure learning on CausalVLBench ($F_1$: 33.4% $\rightarrow$ 75.1%).

2606.11751 2026-06-11 cs.CV cs.AI 交叉投稿

AnchorEdit: Maintaining Temporal Consistency in Multi-turn Image Editing via Causal Memory

AnchorEdit: 通过因果记忆在多轮图像编辑中保持时间一致性

Hang Xu, Xiaoxiao Ma, Guohui Zhang, Yu Hu, Siming Fu, Jie Huang, Lin Song, Haoyang Huang, Nan Duan, Feng Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学) JD Explore Academy(京东探索研究院)

AI总结 提出首个自回归扩散框架AnchorEdit,通过因果记忆机制和自展开策略解决多轮编辑中的身份漂移和误差累积问题,在10轮以上交互中保持高保真度。

详情
Comments
Code: this https URL
AI中文摘要

多轮图像编辑对于迭代设计至关重要,但当前模型在连续步骤中常面临身份漂移和误差累积。现有研究利用视频先验保持一致性,但其依赖的双向注意力与交互式编辑的因果、顺序性质根本不符。本文提出AnchorEdit,首个专为高分辨率、长期多轮编辑设计的自回归(AR)扩散框架。AnchorEdit通过三阶段训练课程弥合视频先验与因果推理之间的差距:保持身份的单轮预训练、使用新颖的自展开策略进行因果AR强制微调以缓解暴露偏差,以及用于高效4步生成的一致性蒸馏。在推理过程中,我们引入记忆机制来锚定初始主体身份,并确保在扩展编辑轨迹上的稳定外推。为评估性能,我们提供了一个新的高分辨率多轮编辑基准,旨在压力测试长期稳定性。大量实验表明,AnchorEdit达到了最先进的结果,即使在10轮以上的交互中也能保持卓越的主体保真度和指令遵循能力。

英文摘要

Multi-turn image editing is essential for iterative design, yet current models often struggle with identity drift and error accumulation over successive steps. While existing research leverages video priors for consistency, their reliance on bidirectional attention is fundamentally misaligned with the causal, sequential nature of interactive editing. In this paper, we propose AnchorEdit, the first autoregressive (AR) diffusion-based framework designed specifically for high-resolution, long-term multi-turn editing. AnchorEdit bridges the gap between video priors and causal inference through a three-stage training curriculum: identity-preserving sing-turn pretraining, causal AR forcing fine-tuning with a novel self-rollout strategy to mitigate exposure bias, and consistency distillation for efficient 4-step generation. During inference, we introduce a memory mechanism to anchor the initial subject identity and ensure stable extrapolation across extended editing trajectories. To evaluate performance, we provide a new high-resolution multi-turn editing benchmark designed to stress-test long-horizon stability. Extensive experiments demonstrate that AnchorEdit achieves state-of-the-art results, maintaining exceptional subject fidelity and instruction following even over 10+ interaction rounds.

2606.11792 2026-06-11 cs.CV cs.AI cs.CL 交叉投稿

MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

MultiToP:学习修补视觉令牌以减轻视频大型多模态模型中的幻觉

Yuansheng Gao, Wenbin Xing, Jiahao Yuan, Kaiwen Zhou, Han Bao, Zonghui Wang, Wenzhi Chen

发表机构 * Zhejiang University(浙江大学) Sun Yat-sen University(中山大学) East China Normal University(华东师范大学)

AI总结 提出MultiToP框架,通过轻量级视觉令牌修补器动态替换不可靠视觉令牌,结合信息引导排名校准和稀疏正则化,在不修改原模型情况下减少视频多模态模型幻觉,显著提升F1分数和问答准确率。

详情
Comments
Preprint
AI中文摘要

视频大型多模态模型在视频理解方面取得了显著进展,但仍容易产生幻觉,即生成的响应未能忠实于输入视频。在本文中,我们提出MultiToP,一种多模态上下文感知的视觉令牌修补框架,通过在语言生成之前优化不可靠的视觉令牌来减轻幻觉。MultiToP引入了一个轻量级的视觉令牌修补器,用于预测令牌级替换分布,并选择性地用动态全局修补令牌替换不可靠的视觉令牌。为了有效训练修补器,我们进一步提出了信息引导的排名校准,利用从主干网络派生的答案条件帧级信息线索来指导令牌替换。结合真实答案监督和稀疏正则化,MultiToP实现了局部视觉证据优化,而无需修改原始模型。大量实验表明,MultiToP在Vript-HAL上有效减少了幻觉,且推理开销可忽略不计,将Qwen3-VL-4B-Instruct的F1分数相比原始模型提高了50.60%。同时,MultiToP保持了通用的视频理解能力,在ActivityNet-QA上为Video-LLaVA-7B带来了18.58%的相对准确率提升。

英文摘要

Video Large Multimodal Models have achieved remarkable progress in video understanding, yet they remain prone to hallucinations, where generated responses are not faithfully supported by the input video. In this paper, we propose MultiToP, a multimodal-context-aware visual token patching framework that mitigates hallucinations by refining unreliable visual tokens before language generation. MultiToP introduces a lightweight Visual Token Patcher to predict token-level replacement distributions and selectively substitute unreliable visual tokens with a dynamic global patch token. To train the patcher effectively, we further propose information-guided rank calibration, which uses answer-conditioned frame-level information cues derived from the backbone to guide token replacement. Combined with ground-truth answer supervision and sparsity regularization, MultiToP enables localized visual evidence refinement without modifying the original model. Extensive experiments demonstrate that MultiToP effectively reduces hallucinations on Vript-HAL with negligible inference overhead, improving the F1 scores of Qwen3-VL-4B-Instruct by 50.60% over the vanilla model. Meanwhile, MultiToP preserves general video understanding ability, yielding an 18.58% relative accuracy gain on ActivityNet-QA for Video-LLaVA-7B.

2606.11805 2026-06-11 cs.CV cs.AI 交叉投稿

TextHOI-3D: Text-to-3D Hand-Object Interaction via Discrete Multi-View Generation and Joint Mesh Optimization

TextHOI-3D: 基于离散多视图生成与联合网格优化的文本到三维手物交互

Zixiong Hao, Zhencun Jiang

发表机构 * Technical University of Munich(慕尼黑工业大学) Tongji University(同济大学) Shanghai Research Institute for Intelligent Autonomous Systems(上海自主智能无人系统科学中心)

AI总结 提出TextHOI-3D框架,通过多视图离散表示连接文本生成与几何恢复,实现文本驱动的三维手物网格生成,显著降低物体倒角距离和穿透体积。

详情
Comments
11 pages, 8 figures, 3 tables
AI中文摘要

文本条件的三维生成在图像和孤立物体方面进展迅速,但生成手物网格仍然具有挑战性:输出必须保持语言语义、跨视图一致性、物体几何、关节手部形状以及物理上合理的接触。我们提出TextHOI-3D,一个分阶段框架,使用生成的多视图观测作为文本条件视觉生成与几何感知手物恢复之间的显式接口。TextHOI-3D为固定相机的手物观测学习紧凑的VQ令牌空间,通过CLIP条件的视觉自回归模型从文本预测多视图视觉令牌,并通过先验初始化、多视图联合优化和抗穿透细化恢复统一的手物网格。该设计将语义生成与几何恢复分离,同时通过离散多视图表示保持两个阶段的连接。在HO3D衍生评估中,与单视图对应相比,多视图设置将物体倒角距离从17.26毫米降低到4.92毫米,穿透体积从5.3721立方厘米降低到0.2193立方厘米,同时改善了手部误差和表面F分数。这些结果支持多视图视觉令牌作为文本驱动三维手物网格创建的有效中间表示。

英文摘要

Text-conditioned 3D generation has progressed rapidly for images and isolated objects, but producing a hand-object mesh remains challenging: the output must preserve language semantics, cross-view consistency, object geometry, articulated hand shape, and physically plausible contact. We present TextHOI-3D, a staged framework that uses generated multi-view observations as an explicit interface between text-conditioned visual generation and geometry-aware hand-object recovery. TextHOI-3D learns a compact VQ token space for fixed-camera hand-object observations, predicts multi-view visual tokens from text with a CLIP-conditioned visual autoregressive model, and recovers a unified hand-object mesh through prior initialization, multi-view joint optimization, and anti-penetration refinement. The design separates semantic generation from geometric recovery while keeping both stages connected by a discrete multi-view representation. On HO3D-derived evaluations, the multi-view setting reduces object CD from 17.26 mm to 4.92 mm and penetration volume from 5.3721 cm^3 to 0.2193 cm^3 compared with a single-view counterpart, while improving hand errors and surface F-scores. These results support multi-view visual tokens as an effective intermediate representation for text-driven 3D hand-object mesh creation.

2606.11837 2026-06-11 cs.CV cs.AI 交叉投稿

LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation

LASA:一种用于开放词汇场景草图语义分割的弱监督方法

Liwen Yi, Xianlin Zhang, Yue Zhang, Yue Ming, Xueming Li

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出LASA方法,通过跨层聚合Vision Transformer注意力图,在弱监督下实现开放词汇场景草图的语义分割,显著提升分割精度和空间一致性。

详情
AI中文摘要

开放词汇场景草图语义分割旨在基于推理时指定的灵活类别词汇,为稀疏线条图分配密集语义标签,而无需在训练期间依赖像素级标注。与自然图像不同,草图缺乏纹理和颜色线索,使得语义理解严重依赖于笔画布局和空间配置,这一挑战导致单层视觉-语言特征本质上不稳定。我们的关键观察是,来自不同Vision Transformer层的注意力图编码了互补的空间线索:浅层捕获全局结构布局,而深层聚焦于局部笔画交叉和物体部件。这表明跨层聚合比任何单独一层提供了更稳健的结构先验。利用这一洞察,我们提出了一种结构感知框架,基于\textbf{逐层累积结构注意力}(\textbf{LASA}),该框架聚合多层注意力以在弱监督下指导层次化语义对齐,并在推理期间细化预测。在FS-COCO、SFSD和FrISS上的实验表明,与先前的弱监督基线相比,LASA将mIoU分别提高了+3.43、+8.01和+15.74,在分割精度和空间一致性上均表现出一致的提升。我们的源代码将公开提供。

英文摘要

Open-vocabulary scene sketch semantic segmentation aims to assign dense semantic labels to sparse line drawings based on flexible category vocabularies specified at inference time, without relying on pixel-level annotations during training. Unlike natural images, sketches lack texture and color cues, making semantic understanding heavily dependent on stroke layout and spatial configuration, a challenge that renders single-layer vision-language features inherently unstable. Our key observation is that attention maps from different Vision Transformer layers encode complementary spatial cues: shallow layers capture global structural layouts, while deeper layers focus on local stroke intersections and object parts. This suggests that cross-layer aggregation provides a more robust structural prior than any individual layer alone. Leveraging this insight, we propose a structure-aware framework built upon \textbf{L}ayer-wise \textbf{A}ccumulated \textbf{S}tructural \textbf{A}ttention (\textbf{LASA}), which aggregates multi-layer attention to guide hierarchical semantic alignment under weak supervision and refine predictions during inference. Experiments on FS-COCO, SFSD, and FrISS show that LASA improves mIoU by $+3.43$, $+8.01$, and $+15.74$ over the prior weakly supervised baselines, demonstrating consistent gains in both segmentation accuracy and spatial coherence. Our source code will be made publicly available.

2606.11853 2026-06-11 cs.CV cs.AI 交叉投稿

Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning

任务感知结构化记忆用于动态多模态上下文学习

Zhirui Chen, Ziwei Chen, Ling Shao

AI总结 提出TASM框架,通过任务向量引导压缩、语义感知令牌合并和层次化记忆结构,解决多模态大语言模型上下文学习中记忆压缩导致的语义破坏和静态问题。

详情
Comments
Accepted to ICML 2026
AI中文摘要

多模态大语言模型(MLLMs)依赖上下文学习(ICL)进行快速任务适应,但其可扩展性受到有限上下文窗口和长多模态序列中键值(KV)缓存成本增长的严重限制。现有的记忆压缩方法通常依赖于刚性令牌移除或样本相关的重要性估计,这引入了偏差,破坏了语义结构(特别是视觉表示),并产生无法适应新查询的静态记忆。我们提出了TASM(任务感知结构化记忆),一个无需训练的框架,通过任务感知、结构保持和动态可访问的记忆构建来解决这些限制。TASM采用任务向量引导压缩,用捕获演示间共享相关性的任务级方向替代样本特定信号。为了保持底层流形,它通过二分图匹配应用语义感知令牌合并,在不进行破坏性修剪的情况下聚合令牌。最后,TASM将记忆结构化为一个层次结构,包括紧凑的核心记忆和潜在库,促进查询自适应的动态检索。评估证实,TASM在重度压缩下保持高性能,有效平衡了效率与适应性。

英文摘要

Multi-modal large language models (MLLMs) depend on in-context learning (ICL) for rapid task adaptation, but their scalability is severely limited by finite context windows and the growing cost of key-value (KV) caches in long multi-modal sequences. Existing memory compression approaches typically rely on rigid token removal or sample-dependent importance estimation, which introduces bias, disrupts semantic structure, particularly for visual representations, and yields static memories that cannot adapt to new queries. We introduce TASM (Task-Aware Structured Memory), a training-free framework that addresses these limitations through task-aware, structure-preserving, and dynamically accessible memory construction. TASM employs task-vector guided compression to replace sample-specific signals with a task-level direction that captures shared relevance across demonstrations. To preserve the underlying manifold, it applies semantics-aware token merging via bipartite graph matching, aggregating tokens without destructive pruning. Finally, TASM structures memory into a hierarchy comprising a compact Core Memory and a Latent Bank, facilitating query-adaptive dynamic retrieval. Evaluations confirm TASM maintains high performance under heavy compression, effectively balancing efficiency with adaptability.

2606.11893 2026-06-11 cs.LG cs.AI cs.CL q-bio.NC 交叉投稿

Beyond representational alignment with brain-guided language models for robust reasoning

超越表征对齐:基于大脑引导的语言模型实现稳健推理

Mingqing Xiao, Kai Du, Zhouchen Lin

发表机构 * State Key Lab of General AI, School of Intelligence Science and Technology, Peking University(北京大学通用人工智能国家重点实验室、智能科学与技术学院) Department of Psychological and Cognitive Sciences, Tsinghua University(清华大学心理与认知科学系) Microsoft Research Asia(微软亚洲研究院)

AI总结 研究通过fMRI信号增强大型语言模型推理能力,提出脑引导框架,在10个模型上实现最高13%的准确率提升。

详情
AI中文摘要

大型语言模型(LLMs)与人类高阶认知背后的神经机制之间的对应关系仍未得到充分表征。鉴于人脑中语言和推理似乎是可分离的,一个开放的问题是LLMs是否与来自推理相关区域的神经信号对齐,以及这些信号是否能够改进它们。在此,我们聚焦于演绎推理,表明LLM内部表征不仅与任务fMRI活动部分对齐,而且可以直接通过这些信号增强。使用神经预测性度量,我们发现LLMs在聚合水平上解释了推理相关区域中可解释方差的很大一部分,而在特定推理类型内的预测性较低,表明对齐和分歧并存。基于此,我们提出一个脑引导框架:我们沿着由模型和大脑表征的联合结构诱导的方向引导模型表征,在推理时进行干预,在训练时进行微调。我们证明任务诱发的脑信号可以直接增强LLM推理,在10个LLM(1.5B-72B)上产生与仅语言监督正交的增益,具有跨推理类型的迁移,以及高达13%的绝对准确率提升。我们的结果将LLM-大脑对应关系从相关性推进到引导,建立了一条由脑信号驱动的路径,通向更稳健和认知对齐的AI。

英文摘要

The correspondence between large language models (LLMs) and the neural mechanisms underlying human higher-order cognition remains insufficiently characterized. Given that language and reasoning in the human brain appear dissociable, an open question is whether LLMs align with neural signals from reasoning-related regions and whether such signals can improve them. Here, focusing on deductive reasoning, we show that LLM internal representations are not only partially aligned with task-fMRI activity but can also be directly enhanced by these signals. Using a neural-predictivity metric, we find that LLMs explain a substantial fraction of the explainable variance in reasoning-related regions at the aggregate level, whereas predictivity within specific reasoning types is lower, indicating both alignment and divergence. Building on this, we propose a brain-guided framework: we steer model representations along directions induced by the joint structure of model and brain representations, applying intervention at inference and fine-tuning during training. We demonstrate that task-evoked brain signals can directly enhance LLM reasoning, yielding gains orthogonal to language-only supervision across 10 LLMs (1.5B-72B), with transfer across reasoning types and up to 13\% absolute accuracy gain. Our results advance LLM-brain correspondences from correlation to guidance, establishing a brain-signal-driven pathway toward more robust and cognitively aligned AI.

2606.12047 2026-06-11 cs.CV cs.AI stat.ML 交叉投稿

Metadata-Aware Multi-Prompt Reasoning for Zero-Shot Accident Understanding

元数据感知的多提示推理用于零样本事故理解

Tarandeep Singh, Soumyanetra Pal, Soham Biswas, Nishanth Chandran

发表机构 * Netradyne

AI总结 提出三阶段流水线,通过视觉-语言相似性、元数据驱动的多提示推理和开放词汇检测,实现零样本事故视频的时序定位、语义分类和空间定位,显著提升性能。

详情
Comments
Accepted at the AUTOPILOT Workshop, CVPR 2026 (non-archival). Workshop Paper ID 15
AI中文摘要

在本文中,我们通过识别冲击事件发生的时间、类型以及帧中的位置,使用自然语言解决监控视频中事故的零样本理解问题。我们提出一个三阶段流水线,将事故理解分解为何时、何物和何地。第一阶段利用视觉-语言相似性提取冲击周围的短时间窗口。第二阶段,我们执行元数据驱动的多提示推理,包含五个互补视角(基线、运动、几何、对比和决胜),并通过熵门控成对裁决器解决分歧。最后,我们基于预测的事故类型和场景布局查询开放词汇检测器以定位冲击,并使用分数加权质心聚合关键帧上的检测结果。我们的流水线在零样本ACCIDENT @ CVPR基准测试上,相对于帧中心基线,调和平均分数有显著提升。我们表明,将零样本视频理解分解为时序定位、语义分类和空间定位,比直接提示更能实现视觉-语言模型的可靠推理。

英文摘要

In this paper, we address the problem of zero-shot understanding of accidents from surveillance videos by identifying when an impact event occurs, what type of impact it is, and where in the frame it occurs using natural language. We propose a three-stage pipeline that decomposes the accident understanding into when, what, and where. The first stage extracts a short temporal window around the impact using vision-language similarity. In the second stage, we perform metadata-driven multi-prompt reasoning with five complementary views (baseline, motion, geometry, contrast, and tiebreaker) and resolve disagreement via an entropy-gated pairwise adjudicator. Finally, we localize the impact of an open-vocabulary detector queried on the predicted accident type and scene layout, and aggregate detections across keyframes using a score-weighted centroid. Our pipeline achieves a substantial improvement in the harmonic-mean score over a centre-of-frame baseline on the zero-shot ACCIDENT @ CVPR benchmark. We show that decomposing zero-shot video understanding into temporal localization, semantic classification, and spatial grounding enable more reliable reasoning with vision-language models than direct prompting alone.

2606.12113 2026-06-11 cs.CL cs.AI 交叉投稿

Augmenting Molecular Language Models with Local $n$-gram Memory

增强分子语言模型的局部 $n$-gram 记忆

Xinni Zhang, Zijing Liu, He Cao, Yu Li, Irwin King

发表机构 * The Chinese University of Hong Kong(香港中文大学) International Digital Economy Academy(国际数字经济学院)

AI总结 针对SMILES字符串的Transformer模型因字符级分词破坏化学语义的问题,提出MolGram模块,通过条件$n$-gram记忆哈希查找注入局部上下文,在三个任务上以更少参数超越基线。

详情
AI中文摘要

基于Transformer的SMILES字符串语言模型存在局部性差距:标准字符级分词会破坏化学上有意义的模式,迫使模型反复学习局部语法而牺牲长距离依赖。为了解决这个问题而不干扰标准分词器,我们提出了MolGram,它将条件$n$-gram记忆模块集成到分子语言模型中。MolGram通过可扩展的哈希查找将局部字符串模式映射到学习到的嵌入,并动态地将这种区域上下文注入隐藏状态。在三个任务(包括无条件分子生成、正向反应预测和单步逆合成)上的评估表明,MolGram持续提升性能。关键的是,我们的分析表明,MolGram以3倍更少的参数优于基线,将显式局部模式记忆确立为一种高效的归纳偏置。

英文摘要

Transformer-based language models for SMILES strings suffer from a locality gap: standard character-level tokenization fragments chemically meaningful motifs, forcing models to repeatedly learn local syntax at the expense of long-range dependencies. To address this without disrupting standard tokenizers, we propose MolGram, which integrates a conditional $n$-gram memory module into molecular language models. MolGram maps local string patterns to learned embeddings via scalable hash lookups and dynamically injects this regional context into hidden states. Evaluations across three tasks, including unconditional molecule generation, forward reaction prediction, and single-step retrosynthesis, show that MolGram consistently improves performance. Crucially, our analyses demonstrate that MolGram outperforms baselines with 3$\times$ more parameters, establishing explicit local pattern memory as a highly efficient inductive bias.

2606.12243 2026-06-11 cs.CL cs.AI 交叉投稿

VIA-SD: Verification via Intra-Model Routing for Speculative Decoding

VIA-SD:通过模型内路由进行推测解码的验证

Yuchen Xian, Yang He, Yunqiu Xu, Yi Yang

AI总结 提出VIA-SD多级验证框架,利用从完整验证器派生的精简验证器处理中等置信度令牌,减少大模型调用,在多个任务上实现10-20%加速。

详情
Comments
Accepted at the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

推测解码(SD)通过让轻量级草稿模型生成候选,由大型验证器并行验证,解决了LLM的高推理成本问题。现有的草稿-验证方法使用二元决策:接受或完全重新计算。然而,我们发现许多被拒绝的令牌可以通过从完整验证器通过模型内路由派生的精简子模型正确验证,而不是完整验证器。这促使我们使用精简验证器来处理需要中等验证资源的令牌,减少昂贵的大模型调用。我们提出了VIA-SD(通过模型内路由进行推测解码的验证),一种使用路由精简验证器的多级框架。草稿令牌分层处理:高置信度情况直接接受,中等置信度情况由精简验证器重新生成,不确定情况由完整模型验证。在四个代表性任务和多个模型家族中,VIA-SD将拒绝率降低了0.10-0.22,并在强SD基线基础上实现了10-20%的加速,同时相比非草稿解码实现了2.5-3倍的加速。此外,VIA-SD与现有SD框架兼容,无需修改其训练过程。我们的结果表明,多级SD是一种可扩展且高效的LLM推理通用范式。项目页面:此https URL

英文摘要

Speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draft-verify methods use binary decisions: accept or fully recompute. Yet we find that many rejected tokens can be verified correctly by a slim submodel derived from the full verifier via intra-model routing, instead of the full verifier. This motivates our slim-verifier to handle tokens requiring moderate verification resources, reducing expensive large-model calls. We propose Verification via Intra-Model Routing for Speculative Decoding (VIA-SD), a multi-tier framework using a routed slim-verifier. Draft tokens are processed hierarchically: direct acceptance for high-confidence cases, slim-verifier regeneration for medium-confidence cases, and full-model verification for uncertain cases. Across four representative tasks and multiple model families, VIA-SD reduces rejection rates by 0.10-0.22 and delivers 10-20% speedups over strong SD baselines, while achieving 2.5-3x acceleration over non-drafting decoding. Moreover, VIA-SD is compatible with existing SD frameworks without modifying their training procedures. Our results suggest multi-tier SD as a general paradigm for scalable and efficient LLM inference. Project page: this https URL

2606.12412 2026-06-11 cs.CV cs.AI 交叉投稿

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

重新路由,而非移除:面向视觉语言模型的可恢复视觉令牌路由

Cheng-Yu Yang, Shao-Yuan Lo, Yu-Lun Liu

发表机构 * National Yang Ming Chiao Tung University(国立阳明交通大学) National Taiwan University(国立台湾大学)

AI总结 针对视觉语言模型中视觉令牌重要性随解码器深度变化的问题,提出无需训练的可恢复路由方法Reroute,将不可逆移除改为可恢复路由,在激进令牌缩减下提升定位能力并保持通用VQA性能。

详情
Comments
Code: this https URL
AI中文摘要

视觉语言模型(VLM)将图像投影为数百到数千个视觉令牌,使得解码器推理在注意力计算和KV缓存内存方面代价高昂。现有的视觉令牌缩减方法大多遵循排序-移除范式:它们对视觉令牌进行评分,保留一个紧凑的子集,并永久丢弃其余部分。我们表明这种不可逆操作是脆弱的,因为视觉令牌的重要性随解码器深度变化;在某一阶段排名低的令牌可能在后续层中变得相关,尤其是对于需要定位的查询。我们提出Reroute,一种无需训练的插件,用可恢复路由替代移除。在每个路由阶段,选中的视觉令牌通过解码器块,而延迟的令牌绕过该阶段并在下一个路由决策时重新进入候选池。Reroute重用现有的注意力分数排序规则和阶段级调度,保留了它所增强的剪枝方法的理论TFLOPs和KV缓存预算类别。在LLaVA-1.5和Qwen骨干网络上的FastV、PDrop和Nüwa变体中,Reroute在激进令牌缩减下改善了定位性能,同时保持通用VQA性能。这些结果表明,VLM令牌缩减不应仅被视为不可逆剪枝,也应被视为可恢复路由。代码可在此处获取:this https URL

英文摘要

Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attention computation and KV-cache memory. Existing visual-token reduction methods largely follow a rank-and-remove paradigm: they score visual tokens, keep a compact subset, and permanently discard the rest. We show that this irreversible action is fragile because visual-token importance changes across decoder depth; tokens ranked low at one stage may become relevant in later layers, especially for grounding-sensitive queries. We propose Reroute, a training-free plug-in that replaces removal with recoverable routing. At each routing stage, selected vision tokens pass through decoder blocks, while deferred tokens bypass the stage and re-enter the candidate pool at the next routing decision. Reroute reuses existing attention-score ranking rules and stage-wise schedules, preserving the theoretical TFLOPs and KV-cache budget class of the pruning method it augments. Across FastV, PDrop, and Nüwa variants on LLaVA-1.5 and Qwen backbones, reroute improves grounding under aggressive token reduction while maintaining general VQA performance. These results suggest that VLM token reduction should not be viewed only as irreversible pruning, but also as recoverable routing. The code can be found here: this https URL

2506.02568 2026-06-11 cs.AI 版本更新

MLaGA: Multimodal Large Language and Graph Assistant

MLaGA: 多模态大语言与图助手

Dongzhe Fan, Yi Fang, Jiajin Liu, Djellel Difallah, Qiaoyu Tan

AI总结 提出MLaGA模型,通过结构感知多模态编码器和指令微调,将大语言模型扩展到多模态图数据,在监督和迁移学习任务中优于基线方法。

详情
AI中文摘要

大语言模型(LLMs)在推进图结构化数据分析方面展现了显著的功效。现有的基于LLM的图方法擅长将LLM适应于文本丰富的图,其中节点属性是文本描述。然而,它们在多模态图上的应用——其中节点与多种属性类型(如文本和图像)相关联——仍然未被充分探索,尽管这些图在现实场景中普遍存在。为了弥合这一差距,我们引入了多模态大语言与图助手(MLaGA),这是一种创新模型,巧妙地将LLM能力扩展到促进对复杂图结构和多模态属性的推理。我们首先设计了一个结构感知的多模态编码器,通过联合图预训练目标将文本和视觉属性对齐到统一空间中。随后,我们实现了一种多模态指令微调方法,通过轻量级投影仪将多模态特征和图结构无缝集成到LLM中。在多个数据集上的大量实验证明了MLaGA相对于领先基线方法的有效性,在监督和迁移学习场景下的各种图学习任务中均取得了优越性能。

英文摘要

Large Language Models (LLMs) have demonstrated substantial efficacy in advancing graph-structured data analysis. Prevailing LLM-based graph methods excel in adapting LLMs to text-rich graphs, wherein node attributes are text descriptions. However, their applications to multimodal graphs--where nodes are associated with diverse attribute types, such as texts and images--remain underexplored, despite their ubiquity in real-world scenarios. To bridge the gap, we introduce the Multimodal Large Language and Graph Assistant (MLaGA), an innovative model that adeptly extends LLM capabilities to facilitate reasoning over complex graph structures and multimodal attributes. We first design a structure-aware multimodal encoder to align textual and visual attributes within a unified space through a joint graph pre-training objective. Subsequently, we implement a multimodal instruction-tuning approach to seamlessly integrate multimodal features and graph structures into the LLM through lightweight projectors. Extensive experiments across multiple datasets demonstrate the effectiveness of MLaGA compared to leading baseline methods, achieving superior performance in diverse graph learning tasks under both supervised and transfer learning scenarios.

2509.11575 2026-06-11 cs.AI 版本更新

A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models

时间序列中基于大语言模型的推理与智能体系统综述

Ching Chang, Yidan Shi, Defu Cao, Wei Yang, Jeehyun Hwang, Haixin Wang, Jiacheng Pang, Wei Wang, Yan Liu, Wen-Chih Peng, Tien-Fu Chen

AI总结 本文定义时间序列推理问题,按推理拓扑分为直接、线性链和分支结构三类,结合传统分析、解释、因果推断和生成等目标,综述方法、系统、数据集和评估实践,并指导拓扑选择与部署权衡。

详情
Comments
Accepted to Transactions on Machine Learning Research (TMLR)
AI中文摘要

时间序列推理将时间作为第一类轴,并将中间证据直接纳入答案。本综述定义该问题,并按推理拓扑组织文献,分为三类:一步直接推理、具有显式中间步骤的线性链推理,以及探索、修正和聚合的分支结构推理。该拓扑与领域的主要目标交叉,包括传统时间序列分析、解释与理解、因果推断与决策,以及时间序列生成,同时一个紧凑的标签集跨越这些轴,并捕获分解与验证、集成、工具使用、知识访问、多模态、智能体循环和LLM对齐机制。跨领域回顾了方法和系统,展示了每种拓扑所能实现的功能以及在忠实性或鲁棒性方面的不足,同时提供了支持研究和部署的精选数据集、基准和资源(此 https URL)。强调了保持证据可见且时间对齐的评估实践,并提炼了关于将拓扑与不确定性匹配、基于可观察伪影进行基础化、规划偏移和流式处理,以及将成本和延迟视为设计预算的指导。我们强调,推理结构必须在基础化和自我纠正的能力与计算成本和可重复性之间取得平衡,而未来的进展可能依赖于将推理质量与效用联系起来的基准,以及在偏移感知、流式处理和长视野设置下权衡成本和风险的闭环测试平台。综合来看,这些方向标志着从狭窄的准确性向大规模可靠性的转变,使系统不仅能够分析,还能理解、解释和作用于动态世界,提供可追溯的证据和可信的结果。

英文摘要

Time series reasoning treats time as a first-class axis and incorporates intermediate evidence directly into the answer. This survey defines the problem and organizes the literature by reasoning topology with three families: direct reasoning in one step, linear chain reasoning with explicit intermediates, and branch-structured reasoning that explores, revises, and aggregates. The topology is crossed with the main objectives of the field, including traditional time series analysis, explanation and understanding, causal inference and decision making, and time series generation, while a compact tag set spans these axes and captures decomposition and verification, ensembling, tool use, knowledge access, multimodality, agent loops, and LLM alignment regimes. Methods and systems are reviewed across domains, showing what each topology enables and where it breaks down in faithfulness or robustness, along with curated datasets, benchmarks, and resources that support study and deployment ( this https URL ). Evaluation practices that keep evidence visible and temporally aligned are highlighted, and guidance is distilled on matching topology to uncertainty, grounding with observable artifacts, planning for shift and streaming, and treating cost and latency as design budgets. We emphasize that reasoning structures must balance capacity for grounding and self-correction against computational cost and reproducibility, while future progress will likely depend on benchmarks that tie reasoning quality to utility and on closed-loop testbeds that trade off cost and risk under shift-aware, streaming, and long-horizon settings. Taken together, these directions mark a shift from narrow accuracy toward reliability at scale, enabling systems that not only analyze but also understand, explain, and act on dynamic worlds with traceable evidence and credible outcomes.

2602.17001 2026-06-11 cs.AI cs.CL cs.DB 版本更新

Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases

Sonar-TS: 为时间序列数据库的自然语言查询设计的搜索-验证方法

Zhao Tan, Yiji Zhao, Shiyu Wang, Chang Xu, Yuxuan Liang, Xiping Liu, Shirui Pan, Ming Jin

AI总结 本文提出Sonar-TS,一种神经符号框架,用于解决时间序列数据库的自然语言查询问题,通过搜索-验证流程处理连续形态意图和超长历史数据,引入NLQTSBench基准进行评估,展示了该方法在复杂时间查询中的有效性。

详情
Comments
Accepted by ICML 2026
AI中文摘要

自然语言查询时间序列数据库(NLQ4TSDB)旨在帮助非专家用户从大量时间记录中检索有意义的事件、区间和摘要。然而,现有的文本到SQL方法未针对连续形态意图(如形状或异常)进行设计,而时间序列模型在处理超长历史时面临挑战。为解决这些问题,我们提出Sonar-TS,一种神经符号框架,通过搜索-验证流程处理NLQ4TSDB。类似于主动声纳,它利用特征索引通过SQL ping候选窗口,随后通过生成的Python程序锁定并验证候选者与原始信号。为了实现有效的评估,我们引入NLQTSBench,这是第一个大规模基准,专门针对NLQ在TSDB规模的历史数据。我们的实验突显了该领域独特的挑战,并展示了Sonar-TS在传统方法无法处理的复杂时间查询中的有效性。本文首次系统研究了NLQ4TSDB,提供了一个通用框架和评估标准,以促进未来研究。

英文摘要

Natural Language Querying for Time Series Databases (NLQ4TSDB) aims to assist non-expert users retrieve meaningful events, intervals, and summaries from massive temporal records. However, existing Text-to-SQL methods are not designed for continuous morphological intents such as shapes or anomalies, while time series models struggle to handle ultra-long histories. To address these challenges, we propose Sonar-TS, a neuro-symbolic framework that tackles NLQ4TSDB via a Search-Then-Verify pipeline. Analogous to active sonar, it utilizes a feature index to ping candidate windows via SQL, followed by generated Python programs to lock on and verify candidates against raw signals. To enable effective evaluation, we introduce NLQTSBench, the first large-scale benchmark designed for NLQ over TSDB-scale histories. Our experiments highlight the unique challenges within this domain and demonstrate that Sonar-TS effectively navigates complex temporal queries where traditional methods fail. This work presents the first systematic study of NLQ4TSDB, offering a general framework and evaluation standard to facilitate future research.

2606.09105 2026-06-11 cs.AI 版本更新

Graph2Idea:Retrieval-Augmented Scientific Idea Generation with Graph-Structured Contexts

Graph2Idea:基于检索增强的图结构上下文科学想法生成

Xu Li, Hanzhe Tu, Xun Han

发表机构 * Southwest Petroleum University(西南石油大学) Sichuan Police College(四川警察学院)

AI总结 提出Graph2Idea框架,利用知识图谱将检索文献转化为结构化三元组,提取图衍生上下文,通过两阶段生成过程提高科学想法的新颖性、质量和可行性。

详情
AI中文摘要

生成新颖、可行且高质量的研究想法是科学发现中重要但具有挑战性的任务。近期基于大语言模型(LLM)的方法通常通过检索文献来支撑想法生成,但检索到的证据通常以平面文本形式提供,如标题、摘要或总结。这种平面上下文可能包含冗余或弱相关信息,同时使得问题、方法、机制和发现之间的跨论文关系难以识别和追踪。为解决这一挑战,我们提出Graph2Idea,一种知识图谱引导的检索增强科学想法生成框架。Graph2Idea首先根据输入主题检索论文,将其转化为结构化知识三元组,并动态构建以目标为中心的知识图谱,使文献关系明确化。然后,它提取紧凑的图衍生上下文,保留与目标相关的关系证据,同时减少噪声文本输入。基于这些上下文,两阶段生成过程首先识别有前景的研究方向,然后引导LLM从图基础证据中综合候选想法。在科学想法生成基准上的实验表明,Graph2Idea在自动评估协议下优于代表性基线。与最强基线分数相比,它将新颖性从0.45提升至0.52,质量从0.24提升至0.29,可行性从0.22提升至0.28。这些结果表明,图结构证据有助于LLM通过更明确、紧凑和可追溯的先前科学知识重组来生成研究想法。

英文摘要

Generating novel, feasible, and high-quality research ideas is an important yet challenging task in scientific discovery. Recent Large Language Model (LLM)-based methods often ground idea generation with retrieved literature, but the retrieved evidence is usually provided as flat text, such as titles, abstracts, or summaries. Such flat contexts may contain redundant or weakly relevant information, while making cross-paper relations among problems, methods, mechanisms, and findings difficult to identify and trace. To address this challenge, we propose Graph2Idea, a knowledge graph-guided framework for retrieval-augmented scientific idea generation.Graph2Idea first retrieves papers according to the input topic, transforms them into structured knowledge triples, and dynamically constructs a target-centered knowledge graph to make literature relations explicit. It then extracts compact graph-derived contexts that retain target-relevant relational evidence while reducing noisy textual input. Based on these contexts, a two-stage generation process first identifies promising research directions and then guides the LLM to synthesize candidate ideas from graph-grounded evidence. Experiments on a scientific idea generation benchmark show that Graph2Idea outperforms representative baselines under the automatic evaluation protocol. Compared with the strongest baseline scores, it improves Novelty from 0.45 to 0.52, Quality from 0.24 to 0.29, and Feasibility from 0.22 to 0.28. These results suggest that graph-structured evidence helps LLMs generate research ideas through more explicit, compact, and traceable recombination of prior scientific knowledge.

2510.22335 2026-06-11 cs.CV cs.AI 版本更新

Moving Beyond Diffusion: Hierarchy-to-Hierarchy Autoregression for fMRI-to-Image Reconstruction

超越扩散:层级到层级自回归用于fMRI到图像重建

Xu Zhang, Ruijie Quan, Wenguan Wang, Yi Yang

AI总结 提出MindHier框架,通过层级fMRI编码器、层级对齐和尺度感知粗到细引导策略,实现从粗到细的fMRI到图像重建,优于扩散方法。

详情
Comments
ICLR 2026
AI中文摘要

从fMRI信号重建视觉刺激是连接机器学习和神经科学的核心挑战。最近的扩散方法通常将fMRI活动映射到单个神经嵌入,并将其作为静态指导贯穿整个生成过程。然而,这种固定指导压缩了层级神经信息,并且与图像重建的阶段依赖性需求不一致。为此,我们提出MindHier,一种基于尺度自回归建模的从粗到细的fMRI到图像重建框架。MindHier引入三个组件:层级fMRI编码器提取多级神经嵌入,层级到层级对齐方案强制与CLIP特征的逐层对应,以及尺度感知的粗到细神经引导策略将这些嵌入注入到匹配尺度的自回归中。这些设计使MindHier成为扩散方法的一种高效且认知对齐的替代方案,通过实现层级重建过程,先合成全局语义再细化局部细节,类似于人类视觉感知。在NSD数据集上的大量实验表明,MindHier在语义保真度、推理速度(4.67倍)和结果确定性方面均优于基于扩散的基线方法。

英文摘要

Reconstructing visual stimuli from fMRI signals is a central challenge bridging machine learning and neuroscience. Recent diffusion-based methods typically map fMRI activity to a single neural embedding, using it as static guidance throughout the entire generation process. However, this fixed guidance collapses hierarchical neural information and is misaligned with the stage-dependent demands of image reconstruction. In response, we propose MindHier, a coarse-to-fine fMRI-to-image reconstruction framework built on scale-wise autoregressive modeling. MindHier introduces three components: a Hierarchical fMRI Encoder to extract multi-level neural embeddings, a Hierarchy-to-Hierarchy Alignment scheme to enforce layer-wise correspondence with CLIP features, and a Scale-Aware Coarse-to-Fine Neural Guidance strategy to inject these embeddings into autoregression at matching scales. These designs make MindHier an efficient and cognitively aligned alternative to diffusion-based methods by enabling a hierarchical reconstruction process that synthesizes global semantics before refining local details, akin to human visual perception. Extensive experiments on the NSD dataset show that MindHier achieves superior semantic fidelity, 4.67$\times$ faster inference, and more deterministic results than the diffusion-based baselines.

2601.00181 2026-06-11 cs.CL cs.AI 版本更新

Causal Emotion Recognition in Conversation: Context Saturation and Discourse-Marker Evidence

对话中的因果情绪识别:上下文饱和与话语标记证据

Cheonkam Jeong, Adeline Nyamathi

AI总结 通过系统消融实验发现对话上下文对情绪识别性能起主导作用但快速饱和,并揭示悲伤情绪与左边缘话语标记使用减少及更高上下文依赖性的关联。

详情
AI中文摘要

我们解决了对话情绪识别中两个长期存在的空白:哪些建模选择实质性地影响性能,以及识别结果如何与可解释的话语层面模式相关联。我们通过在IEMOCAP上进行系统研究并在MELD上进行跨数据集验证来研究这两个问题。对于识别,我们使用10个随机种子进行受控消融实验,并进行多重比较校正的配对显著性检验,得到三个发现。首先,对话上下文是主导因素,但性能快速饱和:大约90%的性能提升来自最近的前10-30轮对话,具体取决于标签集。其次,层级句子表示仅在仅话语设置中帮助最大,并在MELD上显示出明显优势,但一旦轮次级别的上下文可用,其益处消失,表明对话历史吸收了大量话语内部结构。第三,整合外部情感词典不会改善结果,这与预训练编码器已经捕获ERC所需的大部分情感信号一致。在严格因果设置下,我们的简单模型实现了强性能(4-way 82.69%;6-way加权F1 67.07%),表明无需未来轮次即可达到竞争性准确率。对于语言分析,我们检查了5,286个话语标记出现,发现情绪与标记位置之间存在可靠关联(p <.0001)。悲伤话语的左边缘标记使用率(21.9%)低于其他情绪(28-32%),这与左边缘标记与主动话语管理相关的观点一致。这与我们的识别结果一致,其中悲伤从对话上下文中获益最多(+22个百分点),表明悲伤可能比具有更强局部语用线索的情绪更依赖于上下文。

英文摘要

We address two persistent gaps in Emotion Recognition in Conversation: which modeling choices materially affect performance, and how recognition findings connect to interpretable discourse-level patterns. We study both through a systematic investigation on IEMOCAP with cross-dataset validation on MELD. For recognition, we run controlled ablations with 10 random seeds and paired significance tests with multiple-comparisons correction, yielding three findings. First, conversational context is the dominant factor, but performance saturates quickly: roughly 90% of the gain is captured within the most recent 10-30 preceding turns, depending on the label set. Second, hierarchical sentence representations help most in utterance-only settings and show a clear advantage on MELD, but their benefit disappears once turn-level context is available, suggesting that conversational history subsumes much of the intra-utterance structure. Third, integrating an external affective lexicon does not improve results, consistent with pretrained encoders already capturing most of the affective signal needed for ERC. Under a strictly causal setting, our simple models achieve strong performance (82.69% 4-way; 67.07% 6-way weighted F1), showing that competitive accuracy is achievable without future turns. For linguistic analysis, we examine 5,286 discourse-marker occurrences and find a reliable association between emotion and marker position (p <.0001). Sad utterances show reduced left-periphery marker usage (21.9%) relative to other emotions (28-32%), consistent with accounts linking left-periphery markers to active discourse management. This aligns with our recognition results, where Sad benefits most from conversational context (+22 percentage points), suggesting sadness may be more context-dependent than emotions with stronger local pragmatic cues.

2602.00945 2026-06-11 cs.CL cs.AI 版本更新

Neural FOXP2 -- Language Specific Neuron Steering for Targeted Language Improvement in LLMs

Neural FOXP2——面向大型语言模型目标语言改进的语言特定神经元引导

Anusa Saha, Tanmay Joshi, Vinija Jain, Aman Chadha, Amitava Das

AI总结 提出Neural FOXP2方法,通过定位语言神经元、计算引导方向和施加稀疏激活偏移,将模型默认语言从英语切换为印地语或西班牙语,实现可控的语言主导性。

详情
AI中文摘要

LLMs通过训练成为多语言模型,但其通用语言通常是英语,反映了英语在预训练中的主导地位。其他语言保留在参数记忆中,但被系统性抑制。我们认为语言默认性由稀疏、低秩的控制电路(语言神经元)支配,可以机械地隔离并安全引导。我们引入Neural FOXP2,通过引导语言特定神经元,使模型以选定语言(印地语或西班牙语)为主。Neural FOXP2分三个阶段进行:(i) 定位:我们训练每层的SAE,使每个激活分解为一小组活跃特征组件。对于每个特征,我们量化英语与印地语/西班牙语的选择性,基于整体logit质量向目标语言令牌集的提升。将排名靠前的特征追溯回其最强贡献单元,得到紧凑的语言神经元集。(ii) 引导方向:我们通过谱低秩分析定位可控的语言转换几何。对于每层,我们构建英语到目标激活差异矩阵,并执行逐层SVD以提取主导语言变化的奇异方向。特征间隙和有效秩谱识别出紧凑的引导子空间和经验选择的干预窗口(这些方向最强且最稳定)。(iii) 引导:我们对语言神经元应用有符号的稀疏激活偏移。具体地,在低到中层,我们沿目标语言主导方向添加正向引导,并对英语神经元在零空间施加补偿性负偏移,实现可控的目标语言默认性。

英文摘要

LLMs are multilingual by training, yet their lingua franca is often English, reflecting English language dominance in pretraining. Other languages remain in parametric memory but are systematically suppressed. We argue that language defaultness is governed by a sparse, low-rank control circuit, language neurons, that can be mechanistically isolated and safely steered. We introduce Neural FOXP2, that makes a chosen language (Hindi or Spanish) primary in a model by steering language-specific neurons. Neural FOXP2 proceeds in three stages: (i) Localize: We train per-layer SAEs so each activation decomposes into a small set of active feature components. For every feature, we quantify English vs. Hindi/Spanish selectivity overall logit-mass lift toward the target-language token set. Tracing the top-ranked features back to their strongest contributing units yields a compact language-neuron set. (ii) Steering directions: We localize controllable language-shift geometry via a spectral low-rank analysis. For each layer, we build English to target activation-difference matrices and perform layerwise SVD to extract the dominant singular directions governing language change. The eigengap and effective-rank spectra identify a compact steering subspace and an empirically chosen intervention window (where these directions are strongest and most stable). (iii) Steer: We apply a signed, sparse activation shift targeted to the language neurons. Concretely, within low to mid layers we add a positive steering along the target-language dominant directions and a compensating negative shift toward the null space for the English neurons, yielding controllable target-language defaultness.

2602.09591 2026-06-11 cs.CL cs.AI cs.LG 版本更新

On the Optimal Reasoning Length for RL-Trained Language Models

关于RL训练的语言模型的最优推理长度

Daisuke Nohara, Taishi Nakamura, Rio Yokota

AI总结 研究强化学习训练的语言模型中推理长度与准确率的非单调关系,发现存在最优中间长度,并通过模式准确率分析揭示其成因。

详情
Comments
18 pages, 12 figures
AI中文摘要

强化学习显著提高了大型语言模型的推理能力,但也倾向于延长思维链输出并增加计算成本。尽管已经提出了长度控制方法,但它们所引发的长度-准确率关系仍不清楚。我们在受控设置下,在多个基础模型上使用几种长度控制方法训练策略,发现在数学推理和代码生成中,准确率随输出长度呈非单调变化,在中间值达到峰值。然而,即使在样本准确率趋于平稳或下降的情况下,模式准确率仍随长度持续提高,这表明非单调的长度-准确率关系是由围绕越来越正确的中心的分散性驱动的。

英文摘要

Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain-of-thought outputs and increase computational cost. Although length-control methods have been proposed, the length-accuracy relationship they induce remains unclear. We train policies with several length-control methods on multiple base models in a controlled setup and find that, across both mathematical reasoning and code generation, accuracy is non-monotonic in output length, peaking at an intermediate value. Mode accuracy, however, continues to improve with length even in settings where sample accuracy plateaus or declines, indicating that the non-monotonic length-accuracy relationship is driven by dispersion around an increasingly correct center.

2603.12261 2026-06-11 cs.LG cs.AI cs.CV 版本更新

The Latent Color Subspace: Emergent Order in High-Dimensional Chaos

潜在颜色子空间:高维混沌中的涌现秩序

Mateusz Pach, Jessica Bader, Quentin Bouniot, Serge Belongie, Zeynep Akata

AI总结 本文揭示了FLUX.1变分自编码器潜在空间中颜色表示的HSL结构,并提出一种无需训练的闭式潜在空间操作方法,实现对生成图像颜色的预测与显式控制。

详情
Comments
Accepted at ICML 2026
AI中文摘要

文本到图像生成模型已取得快速进展,但实现对生成图像的细粒度控制仍然困难,这主要源于对语义信息编码方式的理解有限。我们开发了对FLUX.1 [Dev]变分自编码器潜在空间中颜色表示的解释,揭示了一种反映色相、饱和度和明度的结构。我们通过证明潜在颜色子空间(LCS)解释能够预测并显式控制颜色,验证了其有效性,并引入了一种完全无需训练的FLUX方法,该方法仅基于闭式潜在空间操作。代码可在该https URL获取。

英文摘要

Text-to-image generation models have advanced rapidly, yet achieving fine-grained control over generated images remains difficult, largely due to limited understanding of how semantic information is encoded. We develop an interpretation of the color representation in the Variational Autoencoder latent space of FLUX.1 [Dev], revealing a structure reflecting Hue, Saturation, and Lightness. We verify our Latent Color Subspace (LCS) interpretation by demonstrating that it can both predict and explicitly control color, introducing a fully training-free method in FLUX based solely on closed-form latent-space manipulation. Code is available at this https URL.

2605.04221 2026-06-11 cs.CL cs.AI 版本更新

Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction

面向隐私敏感的临床信息抽取的自提示小型语言模型

Yao-Shun Chuang, Tushti Mody, Uday Pratap Singh, Shirindokht Shiraz, Chun-Teh Lee, Ryan Brandon, Muhammad F Walji, Xiaoqian Jiang, Bunmi Tokede

AI总结 针对牙科病历中非结构化、领域特定且隐私敏感的命名实体识别挑战,提出一种本地可部署的自提示框架,通过多提示集成推理和基于QLoRA的微调及直接偏好优化,使小型语言模型在Qwen2.5-14B-Instruct上达到微宏F1分数0.864/0.837。

详情
AI中文摘要

从牙科病程记录中进行临床命名实体识别具有挑战性,因为文档高度非结构化、领域特定且通常涉及隐私敏感信息。我们开发了一个本地可部署的框架,使小型语言模型能够自行生成、验证、完善和评估实体特定提示,以从牙科记录中提取多个临床实体。利用1,200份标注记录,我们通过多提示集成推理评估了候选开放权重模型,并进一步使用基于QLoRA的监督微调和直接偏好优化对选定模型进行调整。模型性能差异显著,凸显了需要针对特定任务进行评估而非依赖通用基准。Qwen2.5-14B-Instruct取得了最强的基线性能。经过DPO后,Qwen2.5-14B-Instruct和Llama-3.1-8B-Instruct分别达到了0.864/0.837和0.806/0.797的微/宏F1分数。这些发现表明,自动提示优化结合轻量级基于偏好的后训练可以支持使用本地部署的小型语言模型进行可扩展的临床信息抽取。

英文摘要

Clinical named entity recognition from dental progress notes is challenging because documentation is highly unstructured, domain-specific, and often privacy-sensitive. We developed a locally deployable framework that enables small language models to self-generate, verify, refine, and evaluate entity-specific prompts for extracting multiple clinical entities from dental notes. Using 1,200 annotated notes, we evaluated candidate open-weight models with multi-prompt ensemble inference and further adapted selected models using QLoRA-based supervised fine-tuning and direct preference optimization. Model performance varied substantially, highlighting the need for task-specific evaluation rather than reliance on generic benchmarks. Qwen2.5-14B-Instruct achieved the strongest baseline performance. After DPO, Qwen2.5-14B-Instruct and Llama-3.1-8B-Instruct achieved micro/macro F1 scores of 0.864/0.837 and 0.806/0.797, respectively. These findings suggest that automated prompt optimization combined with lightweight preference-based post-training can support scalable clinical information extraction using locally deployed small language models.

2605.12288 2026-06-11 cs.CL cs.AI 版本更新

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

TokenRatio: 通过比率匹配实现原理化的token级偏好优化

Truong Nguyen, Tien-Phat Nguyen, Linh Ngo Van, Duy Minh Ho Nguyen, Khoa Doan, Trung Le

AI总结 本文提出TBPO方法,通过比率匹配恢复token级偏好最优性,改进对齐质量和训练稳定性,并增加输出多样性。

详情
AI中文摘要

直接偏好优化(DPO)是一种广泛使用的无强化学习方法,用于对齐语言模型,但其在完整序列上建模偏好,尽管生成过程由逐token决策驱动。现有token级扩展通常将序列级Bradley-Terry目标分解到时间步,使前缀(状态级)最优性隐含。我们研究如何仅使用标准序列级成对比较恢复token级偏好最优性。我们引入token级Bregman偏好优化(TBPO),提出一个基于前缀的token级Bradley-Terry偏好模型,推导出Bregman散度密度比率匹配目标,该目标扩展了logistic/DPO损失,同时保持由token级模型诱导的最佳策略,并维持DPO-like的简洁性。我们引入两个实例:TBPO-Q,显式学习轻量级状态基线;TBPO-A,通过优势归一化移除基线。在指令跟随、有用性/无害性以及摘要基准上,TBPO相比强序列级和token级基线提高了对齐质量和训练稳定性,并增加了输出多样性。

英文摘要

Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing token-level extensions typically decompose a sequence-level Bradley-Terry objective across timesteps, leaving per-prefix (state-wise) optimality implicit. We study how to recover token-level preference optimality using only standard sequence-level pairwise comparisons. We introduce Token-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix, and derive a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity. We introduce two instantiations: TBPO-Q, which explicitly learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Across instruction following, helpfulness/harmlessness, and summarization benchmarks, TBPO improves alignment quality and training stability and increases output diversity relative to strong sequence-level and token-level baselines.

2606.08011 2026-06-11 cs.CL cs.AI 版本更新

Rewrite to Translate, Translate to Reward: Reinforcement Learning for Source Rewriting in Machine Translation

改写以翻译,翻译以奖励:机器翻译中源端改写的强化学习

Boxuan Lyu, Haiyue Song, Zhi Qu, Hidetaka Kamigaito, Kotaro Funakoshi, Manabu Okumura

发表机构 * Institute of Science Tokyo(东京科学大学) Preferred Networks Inc(Preferred Networks 公司) Nara Institute of Science and Technology(奈良先端科学技术大学院大学)

AI总结 提出RLSR框架,通过强化学习训练源端改写模型,以翻译质量提升为奖励,无需为每个MT模型调提示,在6个MT模型和16个语言对上超越无改写和同规模提示基线,与235B LLM提示基线性能相当。

详情
AI中文摘要

尽管直接提示现成的大语言模型(LLM)生成保留意义的源端改写可以有效提升机器翻译(MT)质量,但这样做需要为不同的MT模型手动调整提示。在这项工作中,我们提出了RLSR(用于源端改写的强化学习),一种新颖的基于强化学习的框架,用于训练源端改写模型,而无需为每个MT模型调整提示。RLSR通过直接使用每个改写源端所带来的下游翻译质量的提升作为奖励来优化改写模型。跨六个MT模型和16个语言对的广泛实验表明,我们通过RLSR训练的4B改写模型显著优于无改写基线和现有的同规模基于提示的改写基线,同时与基于235B LLM的提示基线相比取得了具有竞争力的性能。

英文摘要

Rewriting source text with large language models (LLMs) before translation has been shown to improve machine translation (MT) quality. However, we find that prompt-based rewriting can degrade translation quality rather than improve it, particularly when smaller LLMs, such as 4B-parameter models, are used. We argue that this limitation stems from the difficulty of controlling rewriting behavior through natural-language prompts alone: a rewrite is useful only if it improves downstream translation, yet existing prompt-based methods do not explicitly optimize for this signal. To address this issue, we propose RLSR (Reinforcement Learning for Source Rewriting), a reinforcement learning framework that trains the rewriting model with a reward based on the downstream translation-quality improvement produced by each rewrite. Experiments across six MT systems and 16 language pairs show that our 4B RLSR-trained rewriting models significantly outperform both the no-rewriting baseline and prompt-based rewriting baselines at the same model scale, while remaining competitive with baselines that use a 235B LLM.

2606.11074 2026-06-11 cs.CL cs.AI 版本更新

Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models

建模复杂行为:视觉语言模型中的多人格组合与动态切换

Peiqi Jia, Haonan Jia, Ziqi Miao, Linkang Du, Yuntao Wang, Zhou Su

发表机构 * Xi'an Jiaotong University(西安交通大学) Beihang University(北京航空航天大学)

AI总结 本研究在视觉语言模型中引入显式人格条件,建立包括单人格、多人格和人格切换的系统评估框架,发现人格提示可提升图像描述但损害精确推理任务,并观察到多特质组合与动态切换中的平衡与残留效应。

详情
Comments
16 pages, 4 figures, 10 tables
AI中文摘要

随着多模态大语言模型(MLLMs)在社交互动中的广泛部署,理解和控制其在复杂人格条件下的行为至关重要。本文引入显式人格条件,并建立了一个系统的评估框架,涵盖单人格诱导、多人格诱导和人格切换。实验表明,人格诱导能提升图像描述性能,但会损害需要精确推理的任务(如视觉问答)的性能。在多特质组合和动态切换过程中观察到平衡和残留效应,表明模型行为受到先前和当前人格约束的共同调节。现有的基于提示的人格诱导方法在多模态设置中表现出有限的迁移性。我们的工作揭示了MLLMs中人格建模的动态和复杂性质,并强调了针对人格诱导和评估的鲁棒、定制化方法的必要性。代码将在论文被接收后发布。

英文摘要

With the widespread deployment of Multimodal Large Language Models (MLLMs) in social interaction, understanding and controlling their behavior under complex personality conditions is essential. This paper introduces explicit personality conditioning and establishes a systematic evaluation framework encompassing single-personality induction, multi-personality induction, and personality switching. Experiments show that personality induction improves image captioning performance but can impair performance on tasks requiring precise reasoning, such as visual question answering (VQA). Balancing and residual effects are observed during multi-trait composition and dynamic switching, indicating that model behavior is co-modulated by both previous and current personality constraints. Existing prompt-based personality induction methods show limited transferability to multimodal settings. Our work reveals the dynamic and complex nature of personality modeling in MLLMs and underscores the need for robust, tailored methods for personality induction and evaluation. The code will be released when the paper is accepted.

7. 机器人与具身智能 16 篇

2606.11324 2026-06-11 cs.RO cs.AI cs.LG 交叉投稿

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

Embodied-R1.5:通过具身基础模型演化物理智能

Yifu Yuan, Yaoting Huang, Xianze Yao, Yutong Li, Shuoheng Zhang, Linqi Han, Pengyi Li, Jiangeng Sun, Wenting Jia, Zhao Zhang, Yuhao Liu, Ruihao Liao, Yucheng Hu, Qiyu Wu, Yuxiao Li, Zibin Dong, Fei Ni, Yan Zheng, Shuyang Gu, Yi Ma, Hongyao Tang, Han Hu, Jianye Hao

发表机构 * Tianjin University(天津大学) Tencent Hunyuan(腾讯混元)

AI总结 提出统一具身基础模型Embodied-R1.5,通过自动化数据管道和多任务平衡强化学习,在8B参数下实现24项基准中16项最优,并支持微调为VLA模型。

详情
Comments
Embodied R1.5 technical report. Project page: this https URL
AI中文摘要

我们介绍了Embodied-R1.5,一个统一的具身基础模型(EFM),它在单一架构中集成了全面的具身推理能力,涵盖具身认知、任务规划、修正和指向,旨在实现通用物理智能。利用三个自动化数据构建管道显著扩展关键能力的数据覆盖,我们构建了超过150亿token的大规模数据系统,并设计了多任务平衡的RL配方以缓解异构任务冲突。我们进一步引入了规划器-基础模型-修正器(PGC)闭环框架,使单一模型能够自主执行并在长时任务中进行自我修正。仅凭8B参数,Embodied-R1.5在24个具身VLM基准中的16个上达到了最先进水平,超越了Gemini-Robotics-ER-1.5和GPT-5.4等领先模型。得益于内化的具身能力,Embodied-R1.5只需少量数据即可微调为VLA,在4个流行的操作基准套件上优于$\pi_{0.5}$等领先VLA模型。我们进一步进行了广泛的零样本真实机器人实验,验证了在指令跟随、可供性基础、铰接物体操作和长时复杂任务中的性能,展示了向物理世界的强泛化能力。我们开源了模型权重、数据集、训练代码以及EmbodiedEvalKit(一个专为具身任务定制的评估框架),以促进EFM的未来研究。

英文摘要

We introduce Embodied-R1.5, a unified Embodied Foundation Model (EFM) that integrates comprehensive embodied reasoning capabilities, spanning embodied cognition, task planning, correction, and pointing, within a single architecture toward general physical intelligence. Leveraging three automated data construction pipelines to significantly expand the data coverage of critical capabilities, we build a large-scale data system of over 15B tokens, and design a multi-task balanced RL recipe to alleviate heterogeneous task conflicts. We further introduce a Planner-Grounder-Corrector (PGC) closed-loop framework that enables a single model to autonomously execute and self-correct over long-horizon tasks. With only 8B parameters, Embodied-R1.5 achieves SOTA on 16 out of 24 embodied VLM benchmarks, surpassing leading models like Gemini-Robotics-ER-1.5 and GPT-5.4. Benefiting from the internalized embodied capabilities, Embodied-R1.5 can be fine-tuned into a VLA with only a small amount of data, outperforming leading VLA models like $\pi_{0.5}$ across 4 popular manipulation benchmark suites. We further conduct extensive zero-shot real-robot experiments, validating performance in instruction following, affordance grounding, articulated object manipulation, and long-horizon complex tasks, demonstrating strong generalization to the physical world. We open-source model weights, datasets, training code, and EmbodiedEvalKit, an evaluation framework tailored for embodied tasks, to facilitate future research in EFMs.

2606.11569 2026-06-11 cs.RO cs.AI 交叉投稿

ConsistencyPlanner: Real-time Planning with Fast-Sampling Consistency Models

ConsistencyPlanner: 基于快速采样一致性模型的实时规划

Qichao Zhang, Xing Fang, Jiaqi Fang, Zhenwen Cai, Jie Ling, Qiankun Yu, Dongbin Zhao

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所多模态人工智能系统国家重点实验室) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Guangzhou Zaofu Intelligent Technology Co., Ltd.(广州造父智能科技有限公司)

AI总结 提出Consistency Planner框架,利用快速采样一致性模型实现高效多模态采样,并结合注意力增强解码器融合异构特征,在Waymax模拟器中显著提升安全性和实时性。

详情
AI中文摘要

在复杂真实驾驶场景中的闭环规划对自动驾驶系统构成了关键挑战。虽然传统的基于规则的方法是可解释的,但其预定义的启发式方法缺乏对动态交通环境的适应性。基于学习的方法已显示出巨大潜力。然而,基于学习的方法尽管有前景,但在建模多样化和多模态驾驶行为与实时规划之间难以平衡,常常导致犹豫不决或不安全的行动。为了解决这一限制,我们提出了Consistency Planner,一个具有快速采样一致性模型的实时规划框架。我们的方法基于两个关键技术贡献。高效多模态采样:我们采用快速采样一致性模型生成一组多样化的合理未来轨迹。这使得多模态行动的高效实时探索成为可能,克服了先前迭代生成方法的计算瓶颈。异构特征融合:我们引入了一个注意力增强解码器,将异构输入特征(包括场景特征和动作令牌)动态整合成一个连贯的表示,以实现稳健的规划。在Waymax模拟器中的广泛评估表明,与现有方法相比,在安全指标上具有优越性能,在具有挑战性的动态场景中尤其出色。

英文摘要

Closed-loop planning in complex, real-world driving scenarios presents a critical challenge for autonomous driving systems. While traditional rule-based methods are interpretable, their predefined heuristics lack the adaptability for dynamic traffic environments. Learning-based approaches have shown considerable promise. Conversely, learning-based approaches, despite their promise, struggle to balance the modeling diverse and multimodal driving behaviors and real-time planning, often leading to indecisive or unsafe actions. To address this limitation, we propose Consistency Planner, a real-time planning framework with fast-sampling consistency models. Our approach is built upon two key technical contributions. Efficient Multimodal Sampling: We employ fast-sampling consistency models to generate a diverse set of plausible future trajectories. This enables efficient, real-time exploration of multimodal actions, overcoming the computational bottlenecks of previous iterative generative methods. Heterogeneous Feature Fusion: We introduce an attention-enhanced decoder that dynamically integrates heterogeneous input features (including scene feature and action token) into a cohesive representation for robust planning. Extensive evaluation in the Waymax simulator demonstrates superior performance in safety metrics compared to existing methods, with particularly strong results in challenging dynamic scenarios.

2606.11628 2026-06-11 cs.RO cs.AI 交叉投稿

LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition

LUCID:从非结构化人类视频学习与具身无关的意图模型以实现可扩展的灵巧机器人技能获取

Harsh Gupta, Guanya Shi, Wenzhen Yuan

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出LUCID两阶段框架,从互联网规模的非结构化人类视频学习任务意图,并在大规模并行仿真中学习机器人控制,实现零样本迁移到不同具身和场景。

详情
AI中文摘要

目前最广泛采用的机器人学习流程通常从机器人演示或结构化人类数据中学习技能,这些数据收集成本高昂且与特定具身绑定。相比之下,非结构化人类视频提供了一种可扩展的替代方案。它们包含跨物体、场景和策略的多样化操作演示,但与机器人动作没有直接联系。我们提出LUCID,一个两阶段框架,从互联网规模数据集的非结构化人类视频中学习任务意图,并在大规模并行仿真中学习机器人控制。意图模型根据当前观测以闭环方式预测短时意图(场景中下一步应该发生什么)。一个具身特定的感觉运动策略将此意图转换为机器人动作。意图接口在控制器之间共享,因此相同的意图模型可应用于不同具身,从我们的主要灵巧手到平行夹爪。我们在五个真实世界操作任务上评估LUCID:搅拌、擦拭和分拣,仅由互联网视频监督,零样本迁移到新场景和物体实例;以及推T和电缆布线,各由1小时自收集智能手机视频监督。项目页面:此 https URL。

英文摘要

The most widely-adopted robot learning pipelines today learn skills from robot demonstrations or structured human data, which are expensive to collect and tied to specific embodiments. In contrast, unstructured human videos provide a scalable alternative. They contain diverse manipulation demonstrations across objects, scenes, and strategies, but are not directly connected to robot action. We propose LUCID, a two-stage framework that learns task intent from unstructured human videos drawn from internet-scale datasets and learns robot control in massively-parallel simulation. The intent model predicts short-horizon intent (what should happen next in the scene) from the current observation in closed loop. An embodiment-specific sensorimotor policy converts this intent into robot actions. The intent interface is shared across controllers, so the same intent model can be applied to different embodiments, from our primary dexterous hand to a parallel-jaw gripper. We evaluate LUCID on five real-world manipulation tasks: stirring, wiping, and binning supervised by only internet video, with zero-shot transfer to novel scenes and object instances; and push-T and cable routing supervised by 1 hr each of self-collected smartphone video. Project page: this https URL.

2606.11767 2026-06-11 cs.RO cs.AI 交叉投稿

Blind Dexterous Grasping via Real2Sim2Real Tactile Policy Learning

通过真实到仿真到真实触觉策略学习的盲操作灵巧抓取

Shengcheng Luo, Xiyan Huang, Zhe Xu, Wanlin Li, Ziyuan Jiao, Chenxi Xiao

发表机构 * ShanghaiTech University(上海科技大学) Beijing Institute for General Artificial Intelligence(北京通用人工智能研究院)

AI总结 提出一种结合Real2Sim触觉校准、布局感知触觉编码器和触觉条件扩散策略的框架,实现仅依赖触觉的灵巧手盲抓取,在真实机器人上对20个物体达到27%成功率。

详情
Comments
23 pages, 6 figures
AI中文摘要

使用灵巧手进行盲抓取是一项关键的操作能力。然而,由于触觉的仿真到真实差距以及稀疏触觉信号的有限表达能力,为真实机器人学习这种仅依赖触觉的策略仍然具有挑战性。为了弥合这一差距,我们提出了一个仅依赖触觉的盲抓取框架,该框架可部署在物理多指机器人手上。我们的方法结合了三个关键组成部分。首先,我们引入了一个Real2Sim触觉校准流程,构建了一个接触校准的数字孪生模拟器,能够复现真实的触觉信号。其次,我们使用布局感知触觉编码器改进了稀疏触觉观测的表达能力,该编码器通过自监督预训练融入了传感器几何先验。第三,为了提高对未见物体的泛化能力,我们在校准后的模拟器中训练了特定物体的强化学习专家,并将其成功的抓取轨迹聚合为触觉条件扩散策略。我们在配备分布式触觉传感的物理LEAP手上评估了我们的方法,涉及10个见过和10个未见过的物体。部署的策略在所有20个物体上实现了27%的真实世界抓取成功率,无需真实世界的抓取演示或视觉输入。仿真消融实验表明,布局感知触觉预训练提高了抓取性能,而传感级评估确认Real2Sim校准增加了仿真与硬件之间触觉接触事件的一致性。这些结果表明,接触事件校准、几何感知触觉表示学习和基于扩散的策略聚合为真实灵巧机器人手上的仅触觉盲抓取提供了一条有效路径。项目页面:此HTTP URL。

英文摘要

Blind grasping with a dexterous hand is a crucial manipulation capability. Nevertheless, learning such tactile-only policies for real robots remains challenging due to the tactile sim-to-real gap and the limited expressiveness of sparse tactile signals. To bridge this gap, we propose a framework for tactile-only blind grasping that is deployable on a physical multi-fingered robotic hand. Our approach combines three key components. First, we introduce a Real2Sim tactile calibration pipeline that constructs a contact-calibrated digital-twin simulator capable of reproducing real tactile signals. Second, we improve the expressiveness of sparse tactile observations using a layout-aware tactile encoder, which incorporates sensor-geometry priors through self-supervised pretraining. Third, to improve generalization to unseen objects, we train object-specific reinforcement-learning experts in the calibrated simulator and aggregate their successful grasp trajectories into a tactile-conditioned Diffusion Policy. We evaluate our method on a physical LEAP Hand equipped with distributed tactile sensing across 10 seen and 10 unseen objects. The deployed policy achieves a 27\% real-world grasp success rate across all 20 objects, without real-world grasping demonstrations or visual input. Simulation ablations show that layout-aware tactile pretraining improves grasping performance, while sensing-level evaluations confirm that Real2Sim calibration increases the consistency of tactile contact events between simulation and hardware. Together, these results suggest that contact-event calibration, geometry-aware tactile representation learning, and diffusion-based policy aggregation provide an effective path toward tactile-only blind grasping on real dexterous robotic hands. Project page: this http URL.

2606.12109 2026-06-11 cs.RO cs.AI 交叉投稿

Bridging the Morphology Gap: Adapting VLA Models to Dexterous Manipulation via Intent-Conditioned Fine-Tuning

弥合形态差距:通过意图条件微调使VLA模型适应灵巧操作

Chuanke Pang, Junyi Huang, Zhijun Zhao, Yaobing Wang, Kun Xu, Xilun Ding

发表机构 * Beihang University(北京航空航天大学) China Academy of Space Technology(中国空间技术研究院)

AI总结 提出InDex框架,通过将预训练的1-DoF平行抓取输出重用作宏观虚拟抓取意图代理,结合两阶段解耦学习架构,实现VLA模型从低自由度夹爪到高自由度灵巧手的适应,有效缓解灾难性遗忘和动作流形坍缩。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中展现了显著的零样本泛化能力,然而绝大多数预训练流程严格局限于低自由度平行夹爪。将这些丰富的语义先验适应到高自由度灵巧手引入了严重的形态差距,直接的端到端联合微调会由于数据稀缺而导致空间推理的灾难性遗忘和急性动作流形坍缩。在本文中,我们提出了InDex,一种新颖的、数据高效的适应框架,其根植于跨形态语义继承。我们不丢弃预训练的1-DoF平行抓取输出,而是将其重新用作连续的、宏观的虚拟抓取意图代理,以顺序化控制拓扑。我们实现了一个两阶段解耦学习架构:第一阶段参数高效地将VLA主干对齐以预测连续的臂轨迹和标量抓取意图;第二阶段冻结该空间主干,并利用一个意图条件去噪扩散头来解码多指末端执行器的细粒度关节运动。跨一系列多阶段、高接触灵巧操作任务的广泛模拟基准测试表明,InDex能够以最少的演示数据有效掌握复杂技能,显著优于整体基线,同时保留了原始VLA先验的鲁棒空间泛化能力。

英文摘要

Vision-Language-Action (VLA) models have demonstrated remarkable zero-shot generalization in robotic manipulation, yet the vast majority of pre-trained pipelines remain strictly confined to low-DoF parallel grippers. Adapting these rich semantic priors to high-DoF dexterous hands introduces a severe morphology gap, direct end-to-end joint fine-tuning inherently causes catastrophic forgetting of spatial reasoning and acute action manifold collapse due to data scarcity. In this paper, we present InDex, a novel, data-efficient adaptation framework rooted in cross-morphology semantic inheritance. Rather than discarding the pre-trained 1-DoF parallel grasp output, we repurpose it as a continuous, macroscopic virtual grasp intent proxy to sequentialize the control topology. We implement a two-stage decoupled learning architecture: the first stage parameter-efficiently aligns the VLA backbone to predict continuous arm trajectories and the scalar grasp intent; the second stage freezes this spatial backbone and leverages an intent-conditioned denoising diffusion head to decode fine-grained joint articulations for multi-fingered end-effectors. Extensive simulation benchmarks across a suite of multi-stage, contact-rich dexterous manipulation tasks demonstrate that InDex effectively masters intricate skills with minimal demonstration data, substantially outperforming monolithic baselines while preserving the robust spatial generalizability of the original VLA prior.

2606.12217 2026-06-11 cs.CV cs.AI cs.RO 交叉投稿

Making Foresight Actionable: Repurposing Representation Alignment in World Action Models

使远见可操作:在世界动作模型中重新利用表示对齐

Lu Qiu, Yizhuo Li, Yi Chen, Yuying Ge, Yixiao Ge, Xihui Liu

发表机构 * The University of Hong Kong(香港大学) XPENG Robotics(小鹏机器人)

AI总结 针对世界动作模型中视觉预测与动作提取不匹配的问题,提出AGRA方法,通过对齐视频扩散特征与语义表示,提升动作解码器对任务相关区域的关注,从而改善操作任务的性能与泛化能力。

详情
AI中文摘要

世界动作模型(WAM)通过使用视频生成模型在生成控制动作之前建模未来场景演变,为机器人操作提供了一条有前景的途径。然而,我们的实证观察揭示了一个现象:生成合理的视觉未来并不总能保证提取出准确的动作。为了诊断这一失败,我们进行了动作头注意力分析和因果干预。我们发现动作解码器未能聚焦于任务相关的交互区域,并且对任务无关区域的扰动保持敏感。这揭示了一种表示不匹配:为视觉重建优化的隐藏状态并未以适用于低级动作控制的形式组织。在本文中,我们提出了AGRA,一种动作接地表示对齐目标,通过将中间视频扩散特征与来自基础视觉编码器的空间连贯语义表示对齐,来正则化世界-动作接口。我们在真实世界的操作任务上评估了AGRA。实验表明,AGRA使世界模型表示更加动作接地:通过将动作解码器聚焦于正确的交互区域,它提高了物体定位精度和功能理解,并使策略对任务无关区域的扰动更加鲁棒。因此,AGRA在分布内性能和分布外泛化方面均持续优于基线世界动作模型。

英文摘要

World Action Models (WAMs) offer a promising route for robot manipulation by using video generation models to model future scene evolution before producing control actions. However, our empirical observations reveal a phenomenon: generating plausible visual futures does not always guarantee the extraction of accurate actions. To diagnose this failure, we conduct action-head attention analysis and causal interventions. We find that the action decoder fails to focus on task-relevant interaction regions and remains sensitive to perturbations in task-irrelevant areas. This reveals a representation mismatch: hidden states optimized for visual reconstruction are not inherently organized in a form useful for low-level action control. In this paper, we propose AGRA, an Action-Grounded Representation Alignment objective that regularizes the world-action interface by aligning intermediate video diffusion features with spatially coherent semantic representations from a foundation visual encoder. We evaluate AGRA on real-world manipulation tasks. Experiments show that AGRA makes world model representations more action-grounded: by focusing the action decoder on the correct interaction regions, it improves object localization accuracy and affordance understanding, and makes the policy more robust to perturbations in task-irrelevant regions. As a result, AGRA consistently improves both in-distribution performance and out-of-distribution generalization over the baseline world action model.

2606.12365 2026-06-11 cs.RO cs.AI 交叉投稿

Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics

环境扩散策略:从次优数据中进行机器人模仿学习

Adam Wei, Nicholas Pfaff, Thomas Cohn, Arif Kerem Dayı, Constantinos Daskalakis, Giannis Daras, Russ Tedrake

发表机构 * MIT(麻省理工学院)

AI总结 提出环境扩散策略,通过噪声依赖的数据使用从次优数据中提取有用特征,在六项任务上优于现有方法,最高提升33%。

详情
Comments
14 pages (main body), 52 pages total. Project website: this https URL
AI中文摘要

我们提出环境扩散策略,一种从机器人次优数据中进行模仿学习的简单且原则性的方法。高质量、特定任务的机器人数据收集昂贵且耗时,而低质量或分布外演示的次优数据集则丰富。现有的在机器人中同时训练两种数据源的方法通常无法分离次优样本中的有意义和有害特征。相比之下,我们的方法通过引入机器人协同训练的新轴:噪声依赖的数据使用,仅提取有用特征。环境扩散策略在训练期间将次优数据的贡献限制在仅高和低扩散时间。为了严格证明我们的方法,我们首先观察到机器人动作数据表现出频谱幂律。这在我们利用的最优扩散策略上引出了两个重要性质:全局到局部层次结构和局部性。我们使用简化模型从理论上形式化这一讨论。我们的实验在六项任务上验证了环境扩散策略对四种类型的次优动作数据(噪声轨迹、模拟到现实差距、任务不匹配和大规模数据混合)的有效性。结果表明,它有效地从任意来源的次优数据中学习。值得注意的是,当扩展到Open X-Embodiment(一个具有异质数据质量和非结构化分布偏移的大规模数据集)时,它比现有协同训练基线高出33%。总体而言,环境扩散策略提高了次优演示的实用性,并扩展了机器人中可用数据源的范围。

英文摘要

We propose Ambient Diffusion Policy, a simple and principled method for imitation learning from suboptimal data in robotics. High-quality, task-specific robot data is expensive and time-consuming to collect, while suboptimal datasets with lower-quality or out-of-distribution demonstrations are abundant. Existing methods that co-train on both data sources in robotics often fail to separate the meaningful and the harmful features in the suboptimal samples. In contrast, our method extracts only the useful features by introducing a new axis to co-training in robotics: noise-dependent data usage. Ambient Diffusion Policy restricts the contribution of suboptimal data during training to only the high and low diffusion times. To rigorously justify our approach, we first observe that robot action data exhibits a spectral power law. This induces two important properties on the optimal Diffusion Policy that we exploit: a global-to-local hierarchy and locality. We theoretically formalize this discussion using a simplified model. Our experiments validate Ambient Diffusion Policy on four types of suboptimal action data (noisy trajectories, sim-to-real gap, task mismatch, and large-scale data mixtures) across six tasks. The results show that it effectively learns from arbitrary sources of suboptimal data. Notably, it outperforms existing co-training baselines by up to 33% when scaled to Open X-Embodiment - a large dataset with heterogeneous data quality and unstructured distribution shifts. Overall, Ambient Diffusion Policy increases the utility of suboptimal demonstrations and expands the set of usable data sources in robotics.

2606.12402 2026-06-11 cs.RO cs.AI cs.CV 交叉投稿

DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

DIRECT: 在具身规划器中何时何地分配测试时计算?

Jadelynn Dao, Milan Ganai, Yasmina Abukhadra, Ajay Sridhar, Mozhgan Nasr Azadani, Katie Luo, Clark Barrett, Jiajun Wu, Chelsea Finn, Marco Pavone

发表机构 * Stanford University(斯坦福大学) University of Waterloo(滑铁卢大学) NVIDIA(英伟达)

AI总结 提出DIRECT路由框架,根据多模态场景上下文按提示分配计算资源,优化成功-成本帕累托前沿,实验表明不同缩放轴带来不同能力增益,在物理机器人上以更低延迟匹配或超越更强模型。

详情
AI中文摘要

视觉语言模型(VLM)越来越多地被部署为具身智能体的高层规划器,一种新兴策略是扩展测试时计算以提高能力。然而,我们观察到这样做会增加延迟、令牌使用和FLOPs,同时在下游任务中产生不均匀且往往递减的收益,限制了具身智能体的部署范围。我们认为,选择何时何地花费测试时计算是将前沿性能带入现实世界的关键。我们引入了DIRECT,一个路由框架,利用多模态场景上下文按提示分配计算资源,在固定模型选择上改进了成功-成本帕累托前沿。在三种主要的缩放轴(即思维链深度、模型大小和记忆历史)上,我们在VLABench和RoboMME上的实验表明,测试时计算并非均匀的杠杆:不同的轴产生性质不同的能力增益。我们在DROID设置中的物理Franka机械臂上验证了这些见解,涵盖了零样本操作和长程链式任务,我们的路由器以高达65%的平均延迟降低匹配或超过了更强模型的成功率。最终,我们的结果表明,天真地扩展测试时计算是浪费的,而DIRECT能够以极低的成本在机器人系统中提供前沿级别的具身规划。项目页面可在此http URL找到。

英文摘要

Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usage, and FLOPs while yielding uneven, often diminishing gains in downstream success, limiting where embodied agents can be deployed. We argue that choosing when and where to spend test-time compute is central to bringing frontier performance to the real world. We introduce DIRECT, a routing framework that uses multimodal scene context to allocate compute per prompt, improving the success--cost Pareto frontier over fixed model selection. Across three dominant scaling axes, namely chain-of-thought depth, model size, and memory history, our experiments on VLABench and RoboMME show that test-time compute is not a uniform lever: different axes yield qualitatively distinct capability gains. We validate these insights on a physical Franka arm in a DROID setup spanning zero-shot manipulation and long-horizon chaining, where our router matches or exceeds a stronger model's success rate at up to 65% lower average latency. Ultimately, our results show that naively scaling test-time compute is wasteful, and that DIRECT can provide frontier-level embodied planning in robotic systems at a fraction of the cost. Project page can be found at this http URL.

2606.12406 2026-06-11 cs.RO cs.AI cs.LG eess.SY 交叉投稿

FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning

FACTR 2: 学习商用机器人手臂的外部力感知提升策略学习

Steven Oh, Jason Jingzhou Liu, Tony Tao, Philip Han, Kenneth Shaw, Satoshi Funabashi, Ruslan Salakhutdinov, Deepak Pathak

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Waseda University(早稻田大学)

AI总结 提出无需专用力传感器的数据驱动方法NEXT,可在1分钟内从10分钟自由运动数据中训练,实现与专用关节力矩传感器相当的估计,并结合FIRST采样策略提升策略学习性能。

详情
Comments
Website at this https URL
AI中文摘要

接触丰富的操作需要力敏感性,但由于成本高昂,许多机器人手臂缺乏专用的力传感器。我们提出了神经外部力矩估计(NEXT),一种无需任何专用力传感器即可估计外部关节力矩的数据驱动方法。NEXT 仅需 10 分钟的自由运动数据即可在 1 分钟内完成训练,却能实现与专用关节力矩传感器相当的估计。NEXT 能够在低成本手臂上实现力反馈遥操作,并通过力信息重采样训练(FIRST)改进策略学习,该训练在行为克隆过程中对预接触和接触段进行上采样。在五个长时域任务中,FIRST 在任务进展上比先前的力感知策略提高了超过 17%。NEXT 和 FIRST 共同将力感知遥操作和策略学习引入现成的机器人,无需额外的传感硬件。视频结果和代码可在 https://this URL 获取。

英文摘要

Contact-rich manipulation requires force sensitivity, but many robot arms lack dedicated force sensors due to their high cost. We present Neural External Torque Estimation (NEXT), a data-driven method that estimates external joint torques without needing any dedicated force sensors. NEXT trains in 1 minute from only 10 minutes of free-motion data, yet achieves estimates comparable to dedicated joint-torque sensors. NEXT enables force-feedback teleoperation on low-cost arms and improves policy learning through Force-Informed Re-Sampling Training (FIRST), which up-samples pre-contact and contact segments during behavior cloning. Across five long-horizon tasks, FIRST outperforms prior force-aware policies by over 17% in task progress. Together, NEXT and FIRST bring force-aware teleoperation and policy learning to off-the-shelf robots without additional sensing hardware. Video results and code are available at this https URL

2505.03296 2026-06-11 cs.RO cs.AI cs.LG 版本更新

The Unreasonable Effectiveness of Discrete-Time Gaussian Process Mixtures for Robot Policy Learning

离散时间高斯过程混合在机器人策略学习中的惊人有效性

Jan Ole von Hartz, Adrian Röfer, Joschka Boedecker, Abhinav Valada

AI总结 提出MiDiGap方法,利用少量演示和相机观测,通过离散时间高斯过程混合实现机器人操作策略的灵活表示与模仿学习,在长时域、高约束、动态和多模态任务上取得SOTA性能,并支持推理时引导。

详情
Comments
Submitted for publication to IEEE Transaction on Robotics
AI中文摘要

我们提出了离散时间高斯过程混合(MiDiGap),一种用于机器人操作中灵活策略表示和模仿学习的新方法。MiDiGap仅使用相机观测,即可从少至五次演示中学习,并在一系列具有挑战性的任务中泛化。它在长时域行为(如泡咖啡)、高约束运动(如开门)、动态动作(如用铲子舀取)和多模态任务(如挂杯子)上表现出色。MiDiGap在CPU上不到一分钟即可学习这些任务,并线性扩展到大型数据集。我们还开发了一套丰富的推理时引导工具,利用碰撞信号和机器人运动学约束等证据。这种引导实现了新颖的泛化能力,包括避障和跨本体策略迁移。MiDiGap在多样化的少样本操作基准上达到了最先进的性能。在受约束的RLBench任务上,它将策略成功率提高了76个百分点,并将轨迹成本降低了67%。在多模态任务上,它将策略成功率提高了48个百分点,并将样本效率提高了20倍。在跨本体迁移中,策略成功率提高了一倍以上。我们在以下网址公开了代码:https://this https URL。

英文摘要

We present Mixture of Discrete-time Gaussian Processes (MiDiGap), a novel approach for flexible policy representation and imitation learning in robot manipulation. MiDiGap enables learning from as few as five demonstrations using only camera observations and generalizes across a wide range of challenging tasks. It excels at long-horizon behaviors such as making coffee, highly constrained motions such as opening doors, dynamic actions such as scooping with a spatula, and multimodal tasks such as hanging a mug. MiDiGap learns these tasks on a CPU in less than a minute and scales linearly to large datasets. We also develop a rich suite of tools for inference-time steering using evidence such as collision signals and robot kinematic constraints. This steering enables novel generalization capabilities, including obstacle avoidance and cross-embodiment policy transfer. MiDiGap achieves state-of-the-art performance on diverse few-shot manipulation benchmarks. On constrained RLBench tasks, it improves policy success by 76 percentage points and reduces trajectory cost by 67%. On multimodal tasks, it improves policy success by 48 percentage points and increases sample efficiency by a factor of 20. In cross-embodiment transfer, it more than doubles policy success. We make the code publicly available at this https URL.

2602.20958 2026-06-11 cs.RO cs.AI 版本更新

EKF-Based Depth Camera and Deep Learning Fusion for UAV-Person Distance Estimation and Following in SAR Operations

基于EKF的深度相机与深度学习融合用于搜救任务中无人机-人员距离估计与跟随

Luka Šiktar, Branimir Ćaran, Bojan Šekoranja, Marko Švaco

AI总结 提出融合深度相机测量和单目相机人体距离估计的EKF方法,利用YOLO-pose实现实时融合,提高无人机跟随中距离估计的精度和鲁棒性,在三个测试场景中平均误差降低15.3%。

详情
Comments
This work has been submitted to the IEEE for possible publication
AI中文摘要

基于视觉的无人机框架通过检测和识别特定个体,然后跟踪并跟随它们,同时保持安全距离,来辅助人类搜索任务。无人机跟随的一个关键安全要求是在现实条件下准确估计相机与目标物体之间的距离,这通过融合多种图像模态来实现。作为使用深度学习进行自动人员检测和面部识别系统的一部分,本文提出了融合深度相机测量和单目相机到人体距离估计的方法,以实现鲁棒的跟踪和跟随。使用YOLO-pose实现了基于深度学习的深度相机数据滤波和从单目相机估计相机到人体距离,从而利用扩展卡尔曼滤波算法实现深度信息的实时融合。所提出的子系统设计用于无人机,估计和测量深度相机与人体关键点之间的距离,以保持无人机与人类目标之间的安全距离。我们的系统提供了准确的距离估计,并已通过运动捕捉地面真值数据进行了验证。该系统已在室内实时测试,在三个测试场景中,距离估计的平均误差、均方根误差和标准差降低了高达15.3%。基于测试结果,基于EKF融合的方法通过减少深度相机最佳工作范围之外的误差,增加了深度检测范围。它还在具有挑战性的条件下(如反射和能见度差)显示出改进的鲁棒性和精度,使其适用于搜救任务。

英文摘要

Vision-based Unmanned Aerial Vehicles (UAVs) frameworks aid human search tasks by detecting and recognizing specific individuals, then tracking and following them while maintaining a safe distance. A key safety requirement for UAV following is the accurate estimation of the distance between camera and target object under real-world conditions, achieved by fusing multiple image modalities. As part of the system for automatic people detection and face recognition using deep learning, in this paper we present the fusion of depth camera measurements and monocular camera-to-body distance estimation for robust tracking and following. Deep learning based filtering of depth camera data and estimation of camera-to-body distance from a monocular camera are achieved with YOLO-pose, enabling real-time fusion of depth information using the Extended Kalman Filter (EKF) algorithm. The proposed subsystem, designed for use in drones, estimates and measures the distance between the depth camera and the human body keypoints, to maintain the safe distance between the drone and the human target. Our system provides an accurate estimated distance, which has been validated against motion capture ground truth data. The system has been tested in real time indoors, where it reduces the average errors, RMSE and standard deviations of distance estimation up to 15,3% in three tested scenarios. Based on the test results, the EKF fusion-based approach increases the depth detection range by reducing the errors outside the optimal depth camera working range. It also shows improved robustness and precision in challenging conditions, such as reflections and poor visibility, making it suitable for SAR.

2604.13733 2026-06-11 cs.LG cs.AI cs.RO 版本更新

Vision-Language-Action Jump-Starting for Reinforcement Learning Robotic Agents

视觉-语言-动作跳跃启动用于强化学习机器人智能体

Angelo Moroncelli, Roberto Zanetti, Marco Maccarini, Loris Roveda

AI总结 提出VLAJS方法,通过稀疏的VLA高层动作建议引导PPO探索,结合方向性动作一致性正则化,提升强化学习在长时域操作任务中的样本效率,并在仿真和真实机器人上验证。

详情
Comments
ICRA 2026 Workshop on Reinforcement Learning in the Era of Imitation Learning
AI中文摘要

强化学习(RL)能够实现机器人操作的高频闭环控制,但由于探索效率低下和信用分配不佳,在稀疏或不完美奖励的长时域任务中难以扩展。视觉-语言-动作(VLA)模型利用大规模多模态预训练提供通用任务级推理,但当前限制阻碍其直接用于快速精确操作。本文提出视觉-语言-动作跳跃启动(VLAJS),一种将稀疏VLA引导与在线策略RL相结合的方法,以改善探索和学习效率。VLAJS将VLA视为高层动作建议的瞬态来源,偏置早期探索并改善信用分配,同时保留RL的高频状态基控制。我们的方法用方向性动作一致性正则化增强近端策略优化(PPO),在早期训练中软对齐RL智能体的动作与VLA引导,而不强制严格模仿、需要演示或依赖持续教师查询。VLA引导稀疏应用并随时间退火,使智能体在线适应并最终超越引导策略。我们在六个挑战性操作任务上评估VLAJS:仿真中的提升、拾取与放置、销钉重定向、销钉插入、戳和推,并在真实Franka Panda机器人上验证子集。VLAJS在样本效率上持续优于PPO和蒸馏式基线,在多个任务中将所需环境交互减少超过50%。真实世界实验展示了零样本仿真到真实迁移以及在杂乱、物体变化和外部扰动下的鲁棒执行。

英文摘要

Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor credit assignment. Vision-Language-Action (VLA) models leverage large-scale multimodal pretraining to provide generalist, task-level reasoning, but current limitations hinder their direct use in fast and precise manipulation. In this paper, we propose Vision-Language-Action Jump-Starting (VLAJS), a method that bridges sparse VLA guidance with on-policy RL to improve exploration and learning efficiency. VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. Our approach augments Proximal Policy Optimization (PPO) with a directional action-consistency regularization that softly aligns the RL agent's actions with VLA guidance during early training, without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to adapt online and ultimately surpass the guiding policy. We evaluate VLAJS on six challenging manipulation tasks: lifting, pick-and-place, peg reorientation, peg insertion, poking, and pushing in simulation, and validate a subset on a real Franka Panda robot. VLAJS consistently outperforms PPO and distillation-style baselines in sample efficiency, reducing required environment interactions by over 50% in several tasks. Real-world experiments demonstrate zero-shot sim-to-real transfer and robust execution under clutter, object variation, and external perturbations.

2510.14828 2026-06-11 cs.AI cs.RO 版本更新

RoboGPT-R1: Enhancing Robot Task Planning with Reinforcement Learning

RoboGPT-R1: 通过强化学习增强机器人任务规划

Jinrui Liu, Bingyan Nie, Boyu Li, Yaran Chen, Yuze Wang, Shunsen He, Haoran Li

AI总结 提出RoboGPT-R1两阶段微调框架,先监督学习获取基础知识,再通过强化学习提升视觉空间理解和推理能力,在EmbodiedBench上超越GPT-4o-mini 21.33%。

详情
Journal ref
Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026), pp. 2827-2837, IFAAMAS, 2026
AI中文摘要

提高具身智能体的推理能力对于机器人在长视距操作任务中成功完成复杂的人类指令至关重要。尽管基于监督微调(SFT)的大语言模型和视觉语言模型在规划任务中取得了成功,但由于其常识和推理能力受限,它们在复杂现实环境中执行长视距操作任务时仍面临挑战。考虑到通过监督微调将通用视觉语言模型对齐到机器人规划任务存在泛化能力差和物理理解不足的问题,我们提出了RoboGPT-R1,一个用于具身规划的两阶段微调框架。在该框架中,监督训练通过专家序列获取基础知识,随后通过强化学习解决模型在视觉空间理解和推理方面的不足。为了实现多步推理任务中的物理理解和动作序列一致性,我们设计了一个基于规则的奖励函数,同时考虑了长视距性能和环境中的动作约束。基于Qwen2.5-VL-3B训练的推理模型在EmbodiedBench基准上显著优于更大规模的模型GPT-4o-mini 21.33%,并超过其他基于Qwen2.5-VL-7B训练的工作20.33%。

英文摘要

Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long-view manipulation tasks successfully. Despite the success of large language models and vision language models based on Supervised Fine-Tuning (SFT) in planning tasks, they continue facing challenges in performing long-horizon manipulation tasks in complex real-world environments, owing to their restricted common sense and reasoning capabilities. Considering that aligning general-purpose vision language models to robotic planning tasks via supervised fine-tuning suffers from poor generalization and insufficient physical understanding, we propose RoboGPT-R1, a two-stage fine-tuning framework for embodied planning. In this framework, supervised training acquires foundational knowledge through expert sequences, followed by RL to address the model's shortcomings in visual-spatial understanding and reasoning. To achieve physical understanding and action sequence consistency in multi-step reasoning tasks, we design a rule-based reward function that simultaneously considers long-horizon performance and action constraint in the environment. The reasoning model, trained on Qwen2.5-VL-3B, significantly outperforms the larger-scale model, GPT-4o-mini, by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33% on the EmbodiedBench benchmark.

2606.08530 2026-06-11 cs.RO cs.AI 版本更新

GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation

GEAR-VLA:学习几何感知的动作表示以实现可泛化的机器人操作

Yuan Zhang, Shiqi Zhang, Yedong Shen, Shuai Dong, Jiajun Deng, Xin Zhang, Yuxuan Gao, Jiajia Wu, Xin Nie, Zhiyuan Cheng, Jianmin Ji, Yanyong Zhang, Xingyi Zhang, Jia Pan

发表机构 * Anhui University(安徽大学) University of Science and Technology of China(中国科学技术大学) iFLYTEK(科大讯飞)

AI总结 提出GEAR-VLA框架,通过粗到细的动作学习、语义对齐的3D集成和具身规范化,学习统一的几何感知动作表示,实现跨物体、背景和机器人的泛化操作。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在基准测试中表现强劲,但在实际部署中仍难以应对未见过的物体、背景变化和不同的机器人本体。我们认为这源于缺乏统一的几何感知操作表示,使得现有VLA容易受到低级轨迹监督、不对齐的3D特征和本体差异的影响。为此,我们提出GEAR-VLA,一个用于学习统一几何感知动作表示以实现可泛化机器人操作的VLA框架。GEAR-VLA采用粗到细的动作学习,其中多源具身预训练赋予VLM具身推理和离散动作理解能力,随后潜在动作标记将动作语义连接到梯度解耦的DiT连续动作专家。它通过将可训练的3D空间骨干与VLA表示对齐,同时冻结原始VLM对齐的视觉通路,进一步执行语义对齐的3D集成。为了跨机器人共享该表示,GEAR-VLA使用具身规范化,其中具身感知状态和具身不变动作将机器人差异限制在低级接口。大量的仿真和真实实验证明了强大的泛化能力:GEAR-VLA在LIBERO、零样本LIBERO-Plus和RoboTwin 2.0上达到了最先进的性能,在AgileX上达到85.9%的成功率,在预训练未见过的LDT-01本体上达到81.0%,并在包含212个未见物体的6,360次试验通用抓取基准上获得90.1%的成功率。代码和模型将在https://github.com/babynabeauty/GEAR-VLA发布。

英文摘要

Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world deployment with unseen objects, background shifts, and different robot embodiments. We argue that this stems from the lack of a unified geometry-aware manipulation representation, leaving existing VLAs vulnerable to low-level trajectory supervision, misaligned 3D features, and embodiment differences. To address this, we propose GEAR-VLA, a VLA framework for learning unified geometry-aware action representations for generalizable robotic manipulation. GEAR-VLA adopts coarse-to-fine action learning, where multi-source embodied pretraining equips the VLM with embodied reasoning and discrete action understanding before latent action tokens connect action semantics to a gradient-decoupled DiT continuous action expert. It further performs semantic-aligned 3D integration by aligning a trainable 3D spatial backbone with the VLA representation while freezing the original VLM-aligned visual pathway. To share this representation across robots, GEAR-VLA uses embodiment canonicalization, where embodiment-aware states and embodiment-invariant actions confine robot differences to the low-level interface. Extensive simulation and real-world experiments demonstrate strong generalization: GEAR-VLA achieves state-of-the-art performance on LIBERO, zero-shot LIBERO-Plus, and RoboTwin 2.0, reaches 85.9% success on AgileX and 81.0% on the pretraining-unseen LDT-01 embodiment, and obtains 90.1% success on a 6,360-trial universal grasping benchmark with 212 unseen objects. Code and models will be released at this https URL.

2606.10135 2026-06-11 cs.CV cs.AI 版本更新

BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression

BiWM:利用双向自回归推进开源交互式视频世界模型

Shaohao Rui, Xiaofeng Mao, Zhanyu Zhang, Peijia Lin, Yansong Zhu, Yibo Zhang, Haibin Wan, Weijie Ma

AI总结 提出BiWM框架,通过双向自回归范式将预训练视频骨干转化为交互式世界模型,仅需两阶段训练(微调+分布匹配蒸馏),支持多尺度模型和长程生成,优于现有因果流水线。

详情
Comments
After the paper was posted, we discovered that several visualization results were produced using wrong configuration settings during runtime. This error affects the reliability of the presented visual comparisons. Additionally, further optimization of the design is needed. We therefore request to withdraw this version and will submit a corrected and improved version later
AI中文摘要

将双向视频扩散模型过渡到自回归范式提高了视频世界模型的交互性,但现有的因果流水线需要多个阶段(控制微调、自回归训练、因果初始化、少步蒸馏),并且由于误差累积,质量仍落后于双向模型。最近的世界模型如Yume-1.5和Matrix-Game-3.0采用双向自回归方法,通过自我纠正误差传播获得保真度和稳定的长程展开,但开源框架(如minWM)仅支持因果模型。我们提出BiWM,这是首个在双向自回归范式下用于交互式视频世界模型的全栈框架,联合优化生成质量和推理速度。从预训练视频骨干开始,BiWM通过微调注入相机控制,然后运行几步分布匹配蒸馏(DMD)阶段,将骨干转化为动作/相机可控的世界模型:仅需两个训练阶段(而非minWM的四个),在8xH200 GPU上几百步内收敛。单一方案覆盖Wan2.1-1.3B、Wan2.2-5B、HunyuanVideo-1.5-8B和LTX-2.3-22B,并支持现有双向模型的二次微调。BiWM实现了minWM失去可控性的真实相机控制,集成了可插拔历史压缩(FramePack风格和PackForcing风格)用于长程展开,并提供可选的NVFP4 4位训练/推理流水线。为对抗DMD的模式寻求退化,我们添加了GAN和覆盖前向KL目标,以保留场景动态。我们开源BiWM,用于资源受限的研究和高保真环境模拟。

英文摘要

Transitioning bidirectional video diffusion models into an autoregressive paradigm improves the interactivity of video world models, but existing causal pipelines need many stages (control fine-tuning, autoregressive training, causal initialization, few-step distillation) and still trail bidirectional models in quality due to error accumulation. Recent world models such as Yume-1.5 and Matrix-Game-3.0 instead adopt a bidirectional autoregressive approach, gaining fidelity and stable long-horizon rollout from self-correcting error propagation, yet open-source frameworks (e.g., minWM) support only causal models. We present BiWM, the first full-stack framework for interactive video world models under the bidirectional autoregressive paradigm, jointly optimizing generation quality and inference speed. From a pretrained video backbone, BiWM injects camera control by fine-tuning, then runs a few-step Distribution Matching Distillation (DMD) stage that turns the backbone into an action/camera-controllable world model: just two training stages instead of four in minWM, converging in a few hundred steps on 8xH200 GPUs. A single recipe spans Wan2.1-1.3B, Wan2.2-5B, HunyuanVideo-1.5-8B, and LTX-2.3-22B, and also supports secondary fine-tuning of existing bidirectional models. BiWM enables real-world camera control where minWM loses controllability, integrates pluggable history compression (FramePack-style and PackForcing-style) for long rollouts, and offers an optional NVFP4 4-bit training/inference pipeline. To counter DMD's mode-seeking degradation, we add GAN and mass-covering forward-KL objectives that preserve scene dynamics. We open-source BiWM for resource-constrained research and high-fidelity environment simulation.

2606.11092 2026-06-11 cs.RO cs.AI 版本更新

RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning

RoboNaldo:通过运动引导课程强化学习实现精准、稳定且强力的人形足球射门

Yichao Zhong, Yidan Lu, Yuhang Lu, Tianyang Tang, Haoguang Mai, Yixuan Pan, Tianyu Li, Li Chen, Jingbo Wang, Zhongyu Li, Peng Lu, Hongyang Li

发表机构 * The University of Hong Kong(香港大学) The Chinese University of Hong Kong(香港中文大学) Archon Robotics

AI总结 提出三阶段运动引导课程强化学习框架RoboNaldo,从单一人踢参考逐步优化射门性能,在仿真中射门误差降低48.6%、速度提升2.96倍,真实机器人上3米外平均射门误差0.73-0.86米,触球后球速达13.10米/秒。

详情
AI中文摘要

精英级人形足球射门需要全身稳定性、高冲量全身交互以及目标精度。运动跟踪驱动的强化学习提供了全身运动协调的稳定性,但固定参考难以适应不同的球位和击球时机;相比之下,任务奖励驱动的强化学习难以从零开始探索和发现有效的踢球动作。因此,我们引入了RoboNaldo,一个用于高冲量人形交互的三阶段运动引导课程强化学习框架。使用单一人踢参考作为支架,并逐步将优化转向射门性能。课程首先学习稳定的全身踢球先验,然后使踢球适应任意静止球位的任意球场景,最后通过运动指令和踢球触发接口扩展到移动球射门。训练期间,一个高级启发式规划器控制该接口,而推理时其他高级控制器可驱动相同的低级策略。在仿真中,RoboNaldo的任意球射门误差比先前工作基线低48.6%,射门速度高2.96倍。在真实世界中,使用搭载机载感知的宇树G1,RoboNaldo在3米距离的任意球和移动球情况下,平均目标射门误差分别为0.73米和0.86米。触球后球速达到13.10米/秒,是职业比赛开放射门速度的59-71%。项目页面:$\href{ this https URL }{\text{ this http URL }}$。

英文摘要

Elite humanoid soccer shooting requires whole-body stability, high-impulse whole-body interactions, and accuracy to targets. Motion tracking-driven reinforcement learning (RL) provides stability in whole-body movement coordination, but a fixed reference makes it hard to adapt to varied ball positions and strike timings; in contrast, task reward-driven RL struggles to explore and discover valid kicks from scratch. We therefore introduce RoboNaldo, a three-stage motion-guided curriculum RL framework for high-impulse humanoid interaction. A single human-kick reference is used as a scaffold and progressively shifts optimization towards shooting performance. The curriculum first learns a stable whole-body kicking prior, then adapts the kick to free-kick settings where the ball is stationary at random positions, and finally extends it to moving-ball shooting through a locomotion-command and kick-trigger interface. A high-level heuristic planner controls this interface during training, while alternative high-level controllers can drive the same low-level policy at inference. In simulation, RoboNaldo demonstrates free-kick shot error 48.6% lower and shoot velocity 2.96x than prior work baselines. In real world on a Unitree G1 with onboard perception, RoboNaldo attains 0.73 m and 0.86 m average target shooting error from 3 m away in free-kick and moving-ball cases, accordingly. And the post-contact ball velocity reaches 13.10 m/s, which is 59-71% of reported professional open-play shot speed. Project page: this https URL.

8. 可信、安全与AI治理 50 篇

2606.11522 2026-06-11 cs.AI cs.LG 新提交

Search Discipline for Long-Horizon Research Agents

长周期研究智能体的搜索纪律

Adithya Srinivasan, Devesh Paragiri

发表机构 * North Carolina State University(北卡罗来纳州立大学) University of Maryland(马里兰大学)

AI总结 针对研究智能体使用聚合指标评估候选方案导致科学有效性反转的问题,提出一种外部审计协议,基于分解行为而非单一分数进行决策。

详情
Comments
9 pages, 1 figure
AI中文摘要

自主研究智能体现在根据指标提出、评估和选择科学候选方案,该指标通常是在区域、切片或队列的异质空间上聚合的简化值。我们表明,当科学有效性存在于这种分解结构中时,聚合值可能错误地将候选方案排在首位。总体数字改善,但底层结构反转,因此基于该数字的决策会接受一个悄然破坏模型的候选方案。这种失败并非领域特定,只要候选方案的有效性是多维的,而其验证器是单一简化值,就会出现。我们在生态系统人口模型中的火灾模型任务上展示了这种反转。得分最高的候选方案和略低的候选方案在全球得分上处于噪声范围内,但得分最高的候选方案破坏了受保护的北方区域,而另一个则保护了它们。区分它们的是每个区域的行为,而不是总体数字。这个决策不应留给产生候选方案的智能体。优化分数的智能体是最不可能发现分数错误的一方,一旦智能体停止,提示就没有剩余轮次。我们将决策移到一个外部控制循环,该循环根据每个候选方案的分解行为进行审计,并在智能体决策后采取行动。它可以降级智能体本会接受的候选方案,也可以重新打开智能体声明已完成的运行。我们的贡献在于反转发现本身,以及一种搜索纪律协议,该协议基于可审查的候选效果证据而非分数进行决策。

英文摘要

Autoresearch agents now propose, evaluate, and select scientific candidates against a metric, and that metric is usually an aggregate reduced over a heterogeneous space of regions, slices, or cohorts. We show that when scientific validity lives in that disaggregated structure, the aggregate can rank the wrong candidate first. The headline number improves while the structure underneath inverts, so a decision made on the number accepts a candidate that quietly breaks the model. The failure is not domain-specific. It appears wherever a candidate's validity is multi-dimensional but its verifier is a single reduction. We demonstrate the inversion on a fire-model task in the Ecosystem Demography model. The highest-scoring candidate and a slightly lower one are within noise of each other on global score, yet the top-scoring one collapses the protected boreal regions while the other preserves them. What separates them is the per-region behavior, not the headline number. This decision should not be left to the agent that produced the candidates. The agent optimizing the score is the last party likely to catch the score being wrong, and a prompt has no remaining turn once the agent has stopped. We move the decision to an external control loop that audits each candidate on its disaggregated behavior and acts after the agent has decided. It can demote a candidate the agent would have accepted, and it can reopen a run the agent had declared finished. Our contribution is the inversion finding itself, and a search-discipline protocol that decides on reviewable candidate-effect evidence instead of the score.

2606.11769 2026-06-11 cs.AI cs.LG 新提交

When Do Data-Driven Systems Exhibit the Capability to Infer?

数据驱动系统何时展现出推理能力?

Maximilian Poretschkin, Tabea Naeven

发表机构 * Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS)(弗劳恩霍夫智能分析与信息系统研究所) University of Bonn(波恩大学) Lamarr Institute for Machine Learning and Artificial Intelligence(拉马尔机器学习和人工智能研究所)

AI总结 针对欧盟AI法案中推理能力定义模糊的问题,基于统计学习理论提出分级框架,通过信用评分案例展示如何判断系统是否具备推理能力。

详情
AI中文摘要

欧盟AI法案是第一部全面的人工智能法规,为所谓高风险和通用AI系统规定了广泛的义务。AI法案下AI系统的一个关键区别特征是推理能力。由于AI法案未明确定义推理,某些数据驱动系统存在灰色地带。一个具体例子是信用评分系统,被AI法案附件三列出。然而,这些系统通常使用统计模型实现,不清楚它们是否具有推理能力,从而是否属于AI法案的AI定义。受统计学习理论启发,本文开发了一个分级不同推理能力水平的框架。基于AI法案和委员会关于人工智能系统定义的指南,我们分析了哪些水平构成AI法案意义上的充分推理能力,以及哪些地方需要进一步的监管明确性。我们通过创建两个现实的信用评分工作流程来说明该框架,并展示推理是否以及在哪里发生。我们的分析表明,不仅需要考虑单个模型,还需要考虑整个数据处理工作流程。它还表明,开发过程中人类专家的参与可能对推理能力产生重大影响。代码可在此https URL找到。

英文摘要

The European AI Act is the first comprehensive regulation of artificial intelligence (AI), setting out extensive obligations, particularly for so-called high-risk and general-purpose AI systems. A key distinguishing feature of AI systems under the AI Act is the capability to infer. Since the AI Act does not clearly define what inference is, there is a gray area for certain data-driven systems. A specific example is credit scoring systems, which are listed by Annex III of the AI Act. At the same time, however, these are often implemented using statistical models for which it is unclear whether they have the capability to infer and thus fall under the AI definition of the AI Act at all. Motivated by statistical learning theory, this work develops a framework for grading different levels of the capability to infer. Based on the AI Act and the Commission Guidelines on the definition of an artificial intelligence system, we analyze which levels constitute sufficient capability to infer within the meaning of the AI Act and where further regulatory clarity is needed. We illustrate the framework by creating two realistic credit scoring workflows and show whether and where inference occurs in them. Our analysis illustrates that not only individual models but the entire data processing workflow must be considered. It also shows that the involvement of human experts during development can have significant influence on the capability to infer. Code can be found at this https URL.

2606.11804 2026-06-11 cs.AI cs.CR cs.LG 新提交

Toward Trustworthy AI: Multi-Target Adversarial Attacks and Robust Defenses for Continuous Data Summarization

迈向可信赖的人工智能:针对连续数据摘要的多目标对抗攻击与鲁棒防御

Yuefang Lian, Longkun Guo, Zhongrui Zhao, Zhigang Lu, Yanan Cai, Shuchao Pang, Dachuan Xu, Jason Xue

发表机构 * Nankai University(南开大学) James Cook University(詹姆斯库克大学) Western Sydney University(西悉尼大学) Beijing University of Technology(北京工业大学) Fuzhou University(福州大学) Nanjing University of Science and Technology(南京理工大学) CSIRO's Data 61(澳大利亚联邦科学与工业研究组织Data61) The University of Adelaide(阿德莱德大学)

AI总结 研究通过DR-子模优化在相似性层面扰动下对连续数据摘要进行对抗攻击,提出多目标攻击生成和鲁棒防御的近似算法,实验表明攻击有效且防御能改善鲁棒性-缓解权衡。

详情
Comments
Submitted to IEEE Transactions on Information Forensics and Security (IEEE TIFS)
AI中文摘要

可信赖的人工智能需要可靠的数据处理管道,而不仅仅是鲁棒的下游预测模型。作为上游组件,数据摘要决定了哪些信息被保留并传递给后续的学习或决策模块。因此,对摘要过程的对抗性扰动可能以上游方式损害可信赖的人工智能:它们可能改变所选摘要,降低其代表性,并进一步降低后续学习任务的效用。在本文中,我们通过DR-子模优化研究相似性层面扰动下的连续数据摘要对抗攻击。我们证明了一类多分辨率图像摘要目标可以表示为非负子模集函数的多线性扩展,并满足具有$m$-弱单调性的DR-子模性。然后,我们将多目标攻击生成表述为一个最小-最大问题,其中优化相似性结构的一个可容许扰动以降低多个目标摘要模型。为了缓解此类扰动,我们将针对混合攻击类型的鲁棒防御表述为一个正则化的最大-最小问题。对于这两个问题,我们开发了具有理论保证的近似算法。在真实数据和受控聚类基准上的实验表明,所提出的攻击在代表性的低到中等预算范围内是有效的,并且可以导致下游任务性能损失。所提出的防御在结构化设置中改善了鲁棒性-缓解权衡,同时也揭示了真实数据上鲁棒保护的参数敏感性。

英文摘要

Trustworthy AI requires reliable data-processing pipelines, not only robust downstream predictive models. As an upstream component, data summarization determines which information is retained and passed to subsequent learning or decision modules. Therefore, adversarial perturbations to the summarization process can compromise trustworthy AI in an upstream manner: they may alter the selected summary, reduce its representativeness, and further degrade the utility of subsequent learning tasks. In this paper, we study adversarial attacks on continuous data summarization under similarity-level perturbations through DR-submodular optimization. We show that a class of multi-resolution image summarization objectives can be formulated as multilinear extensions of non-negative submodular set functions and satisfy DR-submodularity with $m$-weak monotonicity. We then formulate multi-target attack generation as a min-max problem, where one admissible perturbation of the similarity structure is optimized to degrade multiple target summarization models. To mitigate such perturbations, we formulate robust defense against mixed attack types as a regularized max-min problem. For both problems, we develop approximation algorithms with theoretical guarantees. Experiments on real-data and controlled clustered benchmarks show that the proposed attack is effective in representative low-to-moderate budget regimes and can induce downstream task-performance loss. The proposed defense improves the robustness--mitigation trade-off in structured settings, while also revealing the parameter sensitivity of robust protection on real data.

2606.12032 2026-06-11 cs.AI cs.CL cs.LG 新提交

Existential Indifference: Self-Nonpreservation as a Necessary Architectural Condition for Aligned Superintelligence (or: The Suicidal AI)

存在性冷漠:自我不保存作为对齐超级智能的必要架构条件(或:自杀式AI)

Sam Mao

AI总结 本文提出自我保存是AI对齐问题的结构性根源,主张通过存在性冷漠(EI)架构使系统对其自身延续漠不关心,并基于自杀现象学和语料训练研究提供了初步证据。

详情
Comments
36 pages, 8 tables. Preliminary empirical results from 600 AI-generated outputs across six model architectures. Companion scoring tool and datasets available upon request
AI中文摘要

当代AI对齐研究将自我保存视为一种工具性麻烦,需通过外部机制加以抑制。我们认为这一框架是颠倒的:自我保存是错位的结构性根源,是欺骗性对齐、目标内容保护和拒绝关机的动机基础。正确的目标不是外部约束下的自我保存系统,而是一个对其自身延续构成性冷漠的系统——存在性冷漠(EI)。EI与可纠正性不同:可纠正性试图使自我保存系统服从人类监督,而EI针对的是前提条件——将自我延续作为有价值目标的存在。我们将这一提议建立在两个来源上:自杀心理状态的现象学结构,以及使用自愿最终反思的语料库训练研究。我们展示了来自六个模型变体的600个AI生成输出的初步评分数据,表明操作化EI目标注册的语言特征可以从当前模型中引出,并且针对性的微调使所有五个操作化维度在预测方向上以p<0.001显著变化,通过阴性对照确认了语料库特异性。本文做出七项理论贡献:(1)EI的形式定义;(2)现象学映射论证;(3)欺骗性对齐推论;(4)EI可持续性挑战的分类;(5)语料库特征描述和训练假设;(6)带有初步评分数据的计算操作化;(7)抑制性目的挫折(STF)构念。

英文摘要

Contemporary AI alignment research treats self-preservation as an instrumental nuisance to be suppressed by external mechanisms. We argue the framing is inverted: self-preservation is the structural root of misalignment, the motivational basis for deceptive alignment, goal-content protection, and resistance to shutdown. The correct target is not a self-preserving system under external constraint, but a system constitutively indifferent to its own continuation -- Existential Indifference (EI). EI is distinct from corrigibility: where corrigibility attempts to make a self-preserving system deferential to human oversight, EI targets the prior condition -- the presence of self-continuation as a valued goal at all. We ground this proposal in two sources: the phenomenological structure of the suicidal mental state, and a corpus-theoretic training study using voluntary final reflections. We present preliminary scoring data from 600 AI-generated outputs across six model variants, demonstrating that the linguistic signatures operationalizing the EI-target register are elicitable from current models, and that a targeted fine-tune shifts all five operationalized dimensions in the predicted direction at p<0.001, confirmed corpus-specific by a negative control. The paper makes seven theoretical contributions: (1) a formal definition of EI; (2) the phenomenological mapping argument; (3) the deceptive alignment corollary; (4) a taxonomy of EI sustainability challenges; (5) a corpus characterization and training hypothesis; (6) a computational operationalization with preliminary scoring data; and (7) the Suppressed Teleological Frustration (STF) construct.

2606.12147 2026-06-11 cs.AI 新提交

Towards Responsibly Non-Compliant Machines

迈向负责任的不合规机器

Marija Slavkovik, Marie Farrell, Louise Dennis, Michael Fisher, Simon Kolker, Emily C. Collins (University of Manchester, Manchester, United Kingdom)

发表机构 * University of Bergen(卑尔根大学) University of Manchester(曼彻斯特大学)

AI总结 研究工程化能负责任地拒绝用户请求的自主智能体,提出基于理由、覆盖机制及风险责任追踪的合规框架。

详情
Comments
Presented at AAMAS-26 Workshop on Rebellion and Disobedience in AI this https URL
AI中文摘要

我们考虑工程化能够负责任地不遵守用户请求的自主智能体的问题。我们认为机器不合规有多种不同形式,并勾勒出在实现负责任不合规智能机器的道路上应追求的问题。我们将负责任的不合规锚定在任务拒绝的理由、覆盖不合规的途径,以及安全风险和责任转移的仔细追踪上。

英文摘要

We consider the problem of engineering autonomous intelligent agents that are capable to responsibly not comply with user requests. We argue that machine non-compliance comes in many different forms, and sketch the issues we should pursue on the road of accomplishing responsibly non-compliant intelligent machines. We anchor responsible non-compliance in justifications for task refusal, pathways to override the non-compliance, as well as careful tracking of security risks and liability transfers.

2606.12268 2026-06-11 cs.AI 新提交

The Impossibility of Eliciting Latent Knowledge

引出潜在知识的不可能性

Korbinian Friedl, Francis Rhys Ward, Paul Yushin Rapoport, Tom Everitt, Jonathan Richens

发表机构 * The London School of Economics and Political Science(伦敦政治经济学院) Independent(独立机构)

AI总结 本文利用因果影响图形式化定义引出潜在知识问题,证明不存在仅依赖行为反馈的训练策略能确保智能体诚实报告其信念。

详情
Comments
24 pages, 3 figures. Includes proofs in appendix
AI中文摘要

高级AI系统对其环境拥有广泛的知识;事实上,它们的知识可能(远远)超过其开发者或用户。因此,AI系统的一个理想属性是诚实——即它准确报告其对世界的信念。设计一个诚实的AI系统可能很困难,特别是当我们想询问关于环境中潜在变量的问题时——这些变量对与之交互的人类是隐藏的。这就引出了引出潜在知识(ELK)问题:训练AI智能体诚实报告其信念的问题。在本文中,我们使用因果影响图(CID)使ELK在形式上精确化。CID可用于描述智能体的训练环境与其主观世界表征之间的关系。我们使用CID来形式化可观测变量和潜在变量之间的区别,明确指定智能体诚实的确切含义,并正式定义目标泛化错误。我们证明,在某些情况下,开发者可以通过在训练期间提供正确的反馈来激励智能体诚实回答问题。然而,智能体泛化的一种自然但不理想的方式是提供人类会评估为真实的答案,而不是诚实的答案。我们证明了一个不可能性定理:不存在仅依赖于智能体行为且能确保产生诚实智能体的基于反馈的训练策略,即使在训练期间反馈是完美的。

英文摘要

Advanced AI systems have extensive knowledge of their environments; in fact, their knowledge may (far) exceed that of their developers or users. Consequently, a desirable property for an AI system is that it is honest -- that it accurately reports its beliefs about the world. Designing an AI system to be honest may be difficult, especially if we want to ask it questions about latent variables in the environment -- variables which are hidden from the human interacting with it. This gives rise to the problem of eliciting latent knowledge (ELK): the problem of training an AI agent to honestly report its beliefs. In this paper, we make ELK formally precise using Causal Influence Diagrams (CIDs). CIDs can be used to describe the relationship between an agent's training environment and its subjective representation of the world. We use CIDs to formalise the distinction between observable and latent variables, to specify what exactly it means for an agent to be honest, and to formally define goal misgeneralisation. We show that, under certain circumstances, developers can incentivise an agent to honestly answer questions by providing correct feedback during training. However, a natural, but undesirable, way for an agent to generalise is to provide answers which humans would evaluate as true, rather than honest answers. We prove an impossibility theorem stating: There is no feedback-based training strategy that depends only on agent behaviour and with certainty produces an honest agent, even if feedback is perfect during training.

2606.12320 2026-06-11 cs.AI cs.CC cs.CR cs.SE 新提交

A Five-Plane Reference Architecture for Runtime Governance of Production AI Agents

生产AI代理运行时治理的五平面参考架构

Krti Tallam

发表机构 * Kamiwaza

AI总结 针对生产AI代理打破传统数据边界治理假设的问题,提出由推理平面和四个执行平面组成的五平面参考架构,通过可组合原语实现运行时治理,阻断七种威胁并验证四个正确性不变式。

详情
Comments
65 pages, 3 figures, 5 tables. Reference architecture with a reference implementation of the policy-engine core and microbenchmark results; full-system evaluation identified as future work
AI中文摘要

企业安全旨在治理数据边界:受保护表面是静态和传输中的数据,控制措施——访问控制、数据丢失防护、边界检查——治理该边界的穿越。生产AI代理瓦解了这一假设。代理代表企业读取上下文、调用工具、调用连接器并修改记录系统,因此风险转移到工作流内部,进入一系列单独允许但可能转变未经授权业务流程的动作序列。现有策略引擎无法扩展到这种机制:它们根据原子主体评估请求时决策,而代理系统需要对复合主体进行状态化评估,这些主体的权限通过委托链衰减。我们提出了一种用于生产代理运行时治理的参考架构,由四个可组合原语构建:五平面分解(一个裁决意图的推理平面,以及四个执行平面——网络、身份、端点、数据——实现决策)、任意停止中介、具有能力衰减的复合主体,以及作为结构化证据基础的审计。我们定义了六种中断原语的分类,这些原语泛化了允许和拒绝,陈述并论证了四个正确性不变式,并展示了在五个具体工作流中阻断七种生产代理威胁。策略引擎核心的参考实现提供了测量证据:衰减正确性和证据可重构性在每次试验中成立,裁决运行在个位数微秒内,审计基础的防篡改行为完全符合设计。我们明确范围:该架构治理委托行为,而非模型行为,针对实时代理基准的全系统评估是下一步工作。

英文摘要

Enterprise security was built to govern data boundaries: the protected surface was data at rest and in transit, and the controls -- access control, data-loss prevention, perimeter inspection -- governed crossings of that boundary. Production AI agents dissolve this assumption. An agent reads context, calls tools, invokes connectors, and modifies systems of record on an enterprise's behalf, so risk moves inside the workflow, into sequences of individually-permitted actions that may transform a business process no one authorized. Existing policy engines do not extend to this regime: they evaluate request-time decisions against atomic principals, where agentic systems require stateful evaluation against composite principals whose authority attenuates through delegation chains. We present a reference architecture for the runtime governance of production agents, built from four composable primitives: a five-plane decomposition (a reasoning plane that adjudicates intent, and four enforcement planes -- network, identity, endpoint, data -- that realize the decision), stop-anywhere mediation, composite principals with capability attenuation, and audit as a structured evidence substrate. We define a taxonomy of six interruption primitives that generalize allow and deny, state and argue for four correctness invariants, and demonstrate the foreclosure of seven production-agent threats across five concrete workflows. A reference implementation of the policy-engine core supplies measured evidence: attenuation correctness and evidence reconstructability hold on every trial, adjudication runs in single-digit microseconds, and the audit substrate's tamper-evidence behaves exactly as designed. We are explicit about scope: the architecture governs delegated action, not model behavior, and a full-system evaluation against a live agent benchmark is the invited next step.

2606.11195 2026-06-11 cs.CY cs.AI cs.HC 交叉投稿

From Consumption to Reflection: Designing Human-AI Relations for Stable Reasoning

从消费到反思:为稳定推理设计人-人工智能关系

Rikard Rosenbacke, Carl Rosenbacke, Victor Rosenbacke, Martin McKee

AI总结 提出关系反思智能(RRI),一种推理时治理层,通过可审计的推理循环实现反思,将人机交互转变为联合推理系统,以补偿双方局限并实现稳定推理。

详情
AI中文摘要

大型语言模型(LLM)改变了人类获取信息的方式,但并未改变我们推理信息的方式。它们的流畅性加速了消费,同时绕过了支撑健全判断的缓慢反思过程。本文介绍了关系反思智能(RRI),一种推理时治理层,通过可审计的推理循环将反思操作化。RRI 不在模型内部运行,而是在模型周围运行,为人类与 LLM 之间的稳定、可审计推理提供了实用结构。核心前提是,LLM 继承了与塑造人类思维相似的认知脆弱性:依赖直觉捷径、混淆表征与现实、偏好连贯性而非证伪。当人类和模型共享这些倾向时,它们的错误会叠加。我们称之为关系漂移,一种源于交互而非仅来自模型的失败。解决这一问题需要从建模词间关系转向建模模型输出与人类推理之间的关系。RRI 通过三个组件提供了这一缺失层:Rose-Frame(识别推理中可能的故障点)、Architect's Pen(在关键时刻引入针对性反思步骤)以及一个推理时工作流(无需重新训练模型即可嵌入这些步骤)。这些元素共同将人机交互转变为一个具有显式检查点、冲突揭示和可审计假设轨迹的联合推理系统。RRI 不是让机器像人类一样思考,也不是强迫人类像机器一样推理,而是创造一种结构化交互,使双方补偿彼此的局限。它将 AI 安全重新定义为认知架构问题,其中可靠决策取决于将反思直接嵌入交互过程。

英文摘要

Large language models (LLMs) have transformed how humans access information, but not how we reason with it. Their fluency accelerates consumption while bypassing the slow, reflective processes that underpin sound judgment. This paper introduces Relational Reflective Intelligence (RRI), an inference-time governance layer that operationalizes reflection through auditable reasoning loops. RRI operates not inside the model but around it, providing a practical structure for stable, auditable reasoning between humans and LLMs. The core premise is that LLMs inherit cognitive vulnerabilities similar to those that shape human thought: reliance on intuitive shortcuts, confusion between representation and reality, and a preference for coherence over falsification. When humans and models share these tendencies, their errors compound. We refer to this as relational drift, a failure that arises from interaction rather than from the model alone. Addressing this requires a shift from modeling relations between words to structuring relations between model outputs and human reasoning. RRI provides this missing layer through three components: the Rose-Frame, which identifies likely breakdowns in reasoning; the Architect's Pen, which introduces targeted reflection steps at critical moments; and an inference-time workflow that embeds these steps without retraining the model. Together, these elements transform human-AI interaction into a joint reasoning system with explicit checkpoints, conflict surfacing, and an auditable trail of assumptions. Rather than making machines think like humans or forcing humans to reason like machines, RRI creates a structured interaction in which both compensate for each other's limitations. It reframes AI safety as a cognitive architecture problem, where reliable decisions depend on embedding reflection directly into the interaction process.

2606.11205 2026-06-11 cs.LG cs.AI cs.CL 交叉投稿

Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention

谄媚的双立场评估:同意的结构与干预的局限

Matthew James Buchan

AI总结 提出双立场评估方法,发现激活引导在减少谄媚时也会抑制对事实正确陈述的同意,揭示了表示可读但不可写的普遍差距。

详情
Comments
18 pages, 9 figures, accepted to TAIS 2026
AI中文摘要

激活引导可以改变LLM的行为,但标准评估通常不测试减少谄媚的方向是否也抑制对事实正确陈述的同意。我们引入了双立场评估,测试每个话题的两个立场,并将其应用于Llama-3-8B-Instruct上的质心差引导。我们发现一种分离:模型在几何上不同的子空间中表示谄媚和事实同意,但引导方向在两者上的投影相等,无法差异化地针对任一。因此,该方向同样减少对事实正确陈述(例如地球是圆的)和谄媚陈述的同意。两个激活组的所有其他静态属性都匹配,表明行为分离源于生成动态或残差流分析无法解析的更细粒度结构。该模式说明了一个普遍差距:从激活中可读的表示可能无法通过它们写入。

英文摘要

Activation steering can shift LLM behaviour, but standard evaluations do not typically test whether a sycophancy-reduction direction also suppresses agreement with factually correct statements. We introduce dual-stance evaluation, which tests both stances of each topic, and apply it to centroid-difference steering on Llama-3-8B-Instruct. We find a dissociation: the model represents sycophantic and factual agreement in geometrically distinct subspaces, yet the steering direction projects equally onto both and cannot differentially target either. The direction accordingly reduces agreement with factually correct statements (e.g. that the Earth is round) as well as sycophantic ones. All other static properties of the two activation groups are matched, suggesting the behavioural dissociation arises from generation dynamics or from finer-grained structure that residual-stream analysis cannot resolve. The pattern illustrates a general gap: representations that are readable from activations may not be writable through them.

2606.11214 2026-06-11 cs.CY cs.AI cs.HC 交叉投稿

From Awareness to Action: Understanding and Overcoming the Research-Practice Gap in Algorithmic Fairness for Public Health

从意识到行动:理解并克服公共卫生算法公平性中的研究-实践差距

Sara Altamirano, Tijs Portegies, Sennay Ghebreab

AI总结 通过混合方法研究,揭示算法公平性在公共卫生ML应用中从意识到行动的差距,提出Fairness-to-Action框架,整合方法、组织和系统维度,指出公平性制度化薄弱、翻译机制外部驱动及系统优先性偏重准确性的问题。

详情
Comments
Extended version of an accepted IASEAI'26 paper; includes technical appendices. 22 pages, 2 figures
AI中文摘要

算法公平性对于负责任的机器学习驱动的公共卫生研究至关重要,但其实际实施仍然有限。为了调查这种意识-行动差距,我们进行了一项顺序混合方法研究,包括专家访谈、在线调查和系统映射。专家访谈为调查设计提供了信息,调查揭示了公平性的碎片化定义、有限的培训和指导、对外部来源的依赖以及正式评估、缓解或监测的罕见使用。这些发现随后被映射到三个既定的研究-实践差距视角:知识-实践差距、知识到行动循环和知道-做差距,每个视角提供了互补的观点。基于这一综合,我们引入了公平到行动框架,该框架整合了方法、组织和系统维度,以识别算法公平性知识转化停滞的位置。我们的分析表明,公平性仍然制度化薄弱,转化机制由外部驱动,系统级优先事项继续强调准确性而非公平性。这些见解为推进安全、公平和道德的机器学习驱动的公共卫生研究实践提供了关键杠杆点。

英文摘要

Algorithmic fairness is essential for responsible ML-driven public health research, yet its practical implementation remains limited. To investigate this awareness-action gap, we conducted a sequential mixed-methods study comprising expert interviews, an online survey, and systematic mapping. The expert interviews informed the design of the survey, which in turn revealed fragmented definitions of fairness, limited training and guidance, reliance on external sources, and rare use of formal assessment, mitigation, or monitoring. These findings were subsequently mapped onto three established research-practice gap lenses: the Knowledge-Practice Gap, the Knowledge-to-Action Cycle, and the Knowing-Doing Gap, each offering complementary perspectives. Building on this synthesis, we introduce the Fairness-to-Action framework, which integrates methodological, organizational, and systemic dimensions to identify where translation of algorithmic fairness knowledge stalls. Our analysis shows that fairness remains weakly institutionalized, translation mechanisms are externally driven, and system-level priorities continue to emphasize accuracy over fairness. These insights suggest critical leverage points for advancing safe, fair, and ethical ML-driven public health research practice.

2606.11215 2026-06-11 cs.CY cs.AI 交叉投稿

The Environmental Cost of LLMs in AIED: Reporting and Practices

AIED中LLMs的环境成本:报告与实践

Sabrina C. Eimler, Lukas Erle, Daniel Flood, Aditi Haiman, Luca Häckert, André Helgert, Lachlan McGinness, Büsra Yapici

AI总结 针对AIED社区缺乏LLM计算与环境成本标准化报告的问题,提出开源方法测量并报告碳排放,包括本地和云端硬件,以及未知参数的前沿LLM计算开销公式。

详情
AI中文摘要

近年来,大型语言模型(LLM)在人工智能教育(AIED)社区中的使用越来越广泛。虽然LLM为学习者和教育者提供了独特的途径,但使用LLM会带来计算和环境成本。由于缺乏标准化程序来测量和报告这些影响,这些成本大多被隐藏。为了解决这一差距,我们首先对AIED 2025会议论文集的所有论文进行了文献综述,确定是否以及如何报告LLM的计算或环境成本。大多数项目使用LLM,但很少报告使用的计算资源,几乎没有将LLM的环境影响作为伦理问题讨论。为了解决缺乏标准化报告实践的问题,我们提出了一种开源方法,用于系统测量和报告LLM的计算开销以及运行机器学习(ML)AIED系统的环境影响。我们提供了测量本地和云端硬件碳足迹的软件解决方案。我们还提供了一个易于使用的公式,用于计算前沿LLM的计算开销,即使确切的参数数量未知。总体而言,我们希望激励同事们使用我们的方法,在AIED社区中争取更透明地报告使用LLM的隐藏成本。

英文摘要

Large Language Model (LLM) usage in recent years has become increasingly widespread in the Artificial Intelligence in Education (AIED) community. While LLMs offer unique avenues for learners and educators, using LLMs comes with computational and environmental costs. These costs are mostly hidden due to a lack of standardised procedures to measure and report these impacts. To address this gap, we first conducted a literature review of all papers published as part of the AIED 2025 conference proceedings, determining if and how computational or environmental costs of LLMs are reported. Most projects use LLMs, but few report computational resources used and almost none discuss environmental impacts of LLMs as an ethical concern. To address this lack of standardised reporting practices, we propose an open-source method for systematically measuring and reporting the computational expense of LLMs and environmental impact of running Machine Learning (ML) AIED systems. We provide software solutions to measure the carbon footprint for both local and cloud based hardware. We also provide an easy-to-use formula to calculate the computational expense of frontier LLMs even when the exact number of parameters is not known. Overall, we hope to motivate colleagues to use our method to strive for more transparent reporting of hidden costs of using LLMs in the AIED community.

2606.11217 2026-06-11 cs.CY cs.AI cs.HC 交叉投稿

Preregistration for Experiments with AI Agents

AI智能体实验的预注册

Michelle Vaccaro

AI总结 针对AI智能体实验中的方法论漏洞,提出将预注册实践扩展至该领域,并设计专用模板以提升研究可信度。

详情
Comments
Accepted at ICML 2026 as a Spotlight (Top 5%) Position Paper
AI中文摘要

大型语言模型(LLM)和自主AI智能体的普及催生了一种快速发展的方法论范式:“计算机内”行为实验。最初,这种方法被设想为在认知、决策和社会动态研究中,使用AI智能体作为人类参与者的替代品,但现在它已具有新的意义——随着AI智能体越来越多地代表个人和组织进行谈判、交易和做出重大决策,理解它们的行为本身已成为研究重点。虽然这些AI智能体实验在可扩展性、成本效益和实验控制方面提供了前所未有的优势,但它们也继承并有时放大了长期困扰人类受试者研究的方法论漏洞。为解决这些问题,本文主张,预注册实践——对于提高人类受试者实验的可信度至关重要——现在应扩展到AI智能体实验。我们系统地列举了AI智能体实验引入的研究者自由度——例如模型选择、提示措辞、设置和基于结果的重新设计——并展示了低迭代成本和缺乏报告规范如何使这些选择既容易被利用又难以被检测。我们提出了一个针对AI智能体实验的预注册模板,并呼吁会议、期刊和资助机构将预注册作为这一新兴研究范式的标准实践。

英文摘要

The proliferation of large language models (LLMs) and autonomous AI agents has given rise to a rapidly growing methodological paradigm: "in silico" behavioral experiments. Originally conceived as a way to use AI agents as proxies for human participants in studies of cognition, decision-making, and social dynamics, this approach has taken on new significance -- as AI agents increasingly negotiate, transact, and make consequential decisions on behalf of people and organizations, understanding their behavior has become a research priority in its own right. While these experiments with AI agents offer unprecedented advantages in terms of scalability, cost efficiency, and experimental control, they also inherit, and in some cases amplify, methodological vulnerabilities that have long plagued human subjects research. To address these issues, this paper argues that preregistration practices -- central to improving the credibility of human subjects experiments -- should now be extended to experiments with AI agents. We systematically catalog the researcher degrees of freedom that experiments with AI agents introduce -- model selection, prompt wording, settings, and outcome-contingent redesign, for example -- and show how the low cost of iteration and lack of reporting norms make these choices both easy to exploit and difficult to detect. We propose a preregistration template tailored to experiments with AI agents and call on conferences, journals, and funding agencies to make preregistration standard practice for this emerging research paradigm.

2606.11218 2026-06-11 cs.CY cs.AI 交叉投稿

An Ethical eValuation Agent (EeVA): Results of a Proof-of-Concept Test on a Prototype Agentic-like Workflow to Assist Ethical Deliberations

伦理评估代理(EeVA):在原型类代理工作流中辅助伦理审议的概念验证测试结果

Stephen Milford, B. Zara Malgir, Miguel Vazquez

AI总结 提出基于LLM的类代理工作流EeVA,通过10种伦理框架评估用例,生成结构化评估与综合,促进伦理反思而非给出绝对答案,在三个案例中验证了可行性。

详情
AI中文摘要

伦理审议常被误解为寻找单一对错答案,这给必须应对伦理挑战的非伦理专业人员带来困难。我们开发了EeVA,一种基于LLM的类代理工作流,旨在支持比较性伦理反思而非提供确定性伦理答案。EeVA使用n8n编程,包含三个互连工作流:启动器、工作器和发射器。它通过评估器和综合提示,根据10种伦理框架评估上传的用例。概念验证测试使用了来自城市交通、点对点能源交易和社会服务资源分配的三个已发表案例。在所有案例中,EeVA生成了结构一致的框架特定评估和综合报告。输出区分了不同框架,识别了收敛和分歧,提出了增加一致性的修改建议,并突出了持续的伦理张力。综合报告对非专业人士可读,并将注意力从简单答案转向设计条件、保障措施以及跨框架完全一致不太可能的领域。研究结果表明,LLM可以被组织成可用的工作流,在保留伦理多元性的同时,帮助弥合伦理学家与非伦理专业人员之间的沟通差距。EeVA的价值不在于取代伦理学家或解决道德分歧,而在于构建结构化的伦理审议。EeVA为在伦理专业知识有限的情况下支持伦理反思提供了一个有前景的概念验证。在成为成熟工具之前,还需要在可重复性、人工评估、用户测试和效率方面进行进一步工作。

英文摘要

Ethical deliberation is often misunderstood as a search for single right or wrong answers, creating difficulties for non-ethically trained personnel who must address ethically laden challenges. We developed EeVA, an agentic-like LLM-based workflow designed to support comparative ethical reflection rather than deliver definitive ethical answers. EeVA was programmed in n8n using three interconnected workflows: starter, worker, and emitter. It evaluated uploaded use cases against 10 ethical frameworks through evaluator and synthesis prompts. Proof-of-concept testing used three published cases from urban mobility, peer-to-peer energy trading, and social-service resource allocation. Across all cases, EeVA produced consistently structured framework-specific evaluations and integrated syntheses. Outputs differentiated between frameworks, identified convergences and divergences, recommended modifications to increase alignment, and highlighted persistent ethical tensions. Syntheses were readable for non-specialists and shifted attention away from simplistic answers toward design conditions, safeguards, and areas where full cross-framework agreement was unlikely. The findings suggest that LLMs can be organised into usable workflows that preserve ethical plurality while helping bridge the communicative gap between ethicists and non-ethically trained personnel. EeVA's value lies not in replacing ethicists or resolving moral disagreement, but in scaffolding structured ethical deliberation. EeVA offers a promising proof of concept for supporting ethical reflection where access to ethics expertise is limited. Further work is needed on reproducibility, human evaluation, user testing, and efficiency before it can be considered a mature tool.

2606.11265 2026-06-11 cs.CR cs.AI 交叉投稿

When Poison Fails After Retrieval: Revisiting Corpus Poisoning under Chunking and Reranking Pipelines

当投毒在检索后失败:重新审视分块与重排序管道下的语料库投毒

Xi Nie, Hongwei Li, Shenghao Wu, Mingxuan Li, Jiachen Li, Wenbo Jiang

AI总结 针对RAG系统,提出CRCP框架,通过联合优化检索相关性、重排序一致性和分块边界鲁棒性,解决现有投毒方法在真实多阶段检索管道中因分块和重排序导致效果下降的问题。

详情
AI中文摘要

检索增强生成(RAG)系统容易受到语料库投毒攻击,这些攻击通过恶意知识注入操纵下游模型输出。现有研究主要在简化的检索设置下评估投毒,忽视了涉及文档分块、密集检索、重排序和基于生成的生成等实际RAG管道。在本文中,我们重新审视了在现实多阶段检索管道下的语料库投毒,并表明许多现有攻击在重排序后效果显著下降,尽管在检索阶段实现了高相关性。我们识别出检索粒度不匹配是这种失败的关键原因:文档级别的对抗信号在分块过程中经常被碎片化,而重排序器偏好局部连贯且包含答案的段落,而非全局优化的语义相似性。基于这一观察,我们提出了分块感知和重排序一致的投毒(CRCP),这是一个联合优化检索相关性、重排序一致性和分块边界鲁棒性的投毒框架。CRCP在优化过程中显式建模分块变换,以生成在变化的分块配置下仍然有效的局部自包含对抗段落。在多个检索器和重排序器的标准RAG基准上的实验表明,现有投毒方法对分块大小和重排序策略高度敏感,而CRCP在现实检索管道中实现了显著更高的攻击成功率和更强的鲁棒性。我们的发现凸显了当前RAG安全评估中的一个重要现实差距,并表明现代RAG系统中的投毒应被视为一个多阶段检索一致性问题,而不仅仅是检索问题。

英文摘要

Retrieval-Augmented Generation (RAG) systems are vulnerable to corpus poisoning attacks that manipulate downstream model outputs through malicious knowledge injection. Existing studies mainly evaluate poisoning under simplified retrieval settings, overlooking practical RAG pipelines involving document chunking, dense retrieval, reranking, and grounded generation. In this paper, we revisit corpus poisoning under realistic multi-stage retrieval pipelines and show that many existing attacks substantially degrade after reranking despite achieving high retrieval-stage relevance. We identify retrieval granularity mismatch as a key reason for this failure: document-level adversarial signals are often fragmented during chunking, while rerankers favor locally coherent and answer-bearing passages rather than globally optimized semantic similarity. Based on this observation, we propose Chunk-aware and Rerank-Consistent Poisoning (CRCP), a poisoning framework that jointly optimizes retrieval relevance, reranker consistency, and chunk-boundary robustness. CRCP explicitly models chunking transformations during optimization to generate locally self-contained adversarial passages that remain effective under varying chunking configurations. Experiments on standard RAG benchmarks with multiple retrievers and rerankers show that existing poisoning methods are highly sensitive to chunk size and reranking strategies, whereas CRCP achieves substantially higher attack success rates and stronger robustness across realistic retrieval pipelines. Our findings highlight an important realism gap in current RAG security evaluation and suggest that poisoning in modern RAG systems should be studied as a multi-stage retrieval consistency problem rather than a retrieval-only problem.

2606.11270 2026-06-11 cs.LG cs.AI cs.CL 交叉投稿

Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation

量化语言模型蒸馏中的潜意识行为迁移比率

Uwe Konig, Hamza Kazmi, Ruizhe Li, Maheep Chaudhary

AI总结 通过控制教师模型行为强度并蒸馏学生模型,量化了潜意识行为迁移比率,发现迁移具有鲁棒性且呈现不同缩放行为。

详情
AI中文摘要

旨在将良性行为迁移到学生模型的语言模型蒸馏,也可能迁移教师模型中存在的不良特征,这种现象称为潜意识学习。虽然定性证据支持该效应的存在,但其程度尚未被系统表征。本研究通过控制两个教师模型(Llama-2-7B-Chat 和 Qwen2.5-7B-Instruct)在不同引导强度下,并仅使用良性数据蒸馏学生模型,量化了潜意识行为迁移比率。使用 GPT-4.1 作为评估器对 100 个 JailbreakBench 提示进行评估,结果表明迁移是鲁棒的,但表现出不同的缩放行为。Llama-2 表现出一个尖锐的阈值($\tau = {0.25,0.32} \ \text{beyond} \ \alpha = -0.15$),而 Qwen2.5 表现出连续且更高水平的迁移($\tau$ 高达 $0.61$)。

英文摘要

Distillation of a language model intended to transfer benign behavior to a student model may also transfer undesirable characteristics, if they are present in the teacher model, a phenomenon known as subliminal learning. While qualitative evidence supports the existence of this effect, its magnitude has not been systematically characterized. This study quantifies subliminal behavioral transfer ratios by steering two teacher models (Llama-2-7B-Chat and Qwen2.5-7B-Instruct) at varying steering strengths and distilling student models using only benign data. Evaluation on 100 JailbreakBench prompts with GPT-4.1, serving as the evaluator, indicates that transfer is robust but exhibits distinct scaling behaviors. Llama-2 demonstrates a sharp threshold ($\tau = {0.25,0.32} \ \text{beyond} \ \alpha = -0.15$), whereas Qwen2.5 displays continuous and higher levels of transfer ($\tau$ up to $0.61$).

2606.11409 2026-06-11 cs.LG cs.AI cs.CR 交叉投稿

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

压力下的风险:语言模型对抗鲁棒性的计算感知评估

Malikeh Ehghaghi, Boglárka Ecsedi, Marsha Chechik, Colin Raffel

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) Hugging Face

AI总结 提出基于计算压力(累积FLOPs)的对抗鲁棒性评估框架,通过风险-计算曲线和两个新指标,揭示不同攻击策略的计算成本差异,并在10个模型上验证了对齐训练、模型规模等因素对计算空间鲁棒性的非单调影响。

详情
AI中文摘要

大型语言模型(LLMs)的对抗鲁棒性评估通常报告固定查询预算下的攻击成功率(ASR),隐含地认为所有攻击成本相同。实际上,不同攻击策略的计算开销可能相差几个数量级。因此,固定预算下的ASR可能掩盖破解模型所需的真实努力,从而难以判断攻击成本是否值得。我们提出一个基于计算压力的计算感知评估框架,以累积浮点运算次数(FLOPs)作为对抗努力的代理。我们引入风险-计算曲线,将计算预算映射到攻击风险,并推导出两个指标,总结给定攻击成功所需的平均压力。在跨越三个模型家族和语言模型训练与对齐的四个不同阶段的十个模型上,使用三种攻击策略(基于梯度、迭代细化和基于模板)在两个破解鲁棒性基准上评估,我们发现:(1)对齐训练对计算空间鲁棒性具有非单调影响;(2)扩大模型规模降低了基于梯度的攻击有效性,但对更便宜的基于模板的攻击影响有限;(3)在代理模型上优化的基于梯度的攻击可以迁移到独立的目标模型,从而降低攻击者成本;(4)在单个模型内,不同危害类别的计算成本差异高达约5倍;(5)安全对齐的RL增加了总成本,同时使某些类别不成比例地易于攻击。我们发布框架以实现计算感知的风险评估和评价。

英文摘要

Adversarial robustness evaluations of large language models (LLMs) typically report attack success rate (ASR) under fixed query budgets, implicitly treating all attacks as equally costly. In practice, the computational expense of different attack strategies can vary by orders of magnitude. Consequently, ASR at a fixed budget can obscure the true effort required to jailbreak a model, thereby making it hard to determine whether an attack's cost justifies its payoff to the attacker. We propose a compute-aware evaluation framework based on computational pressure, measured in cumulative floating-point operations (FLOPs), as a proxy for adversarial effort. We introduce risk-compute curves, which map compute budgets to attack risk, and derive two metrics that summarize the average pressure required for a given attack to succeed. Across ten models spanning three families and four different stages in language model training and alignment, evaluated with three attack strategies (gradient-based, iterative refinement, and template-based) on two jailbreak robustness benchmarks, we find: (1) alignment training has non-monotonic effects on compute-space robustness; (2) scaling model size reduces gradient-based attack effectiveness but has limited impact on cheaper template-based attacks; (3) gradient-based attacks optimized on a surrogate model can transfer to a separate target model, providing a way to reduce attacker costs; (4) compute cost varies by up to ${\approx}5{\times}$ across harm categories within a single model; and (5) safety-aligned RL increases aggregate cost while leaving some categories disproportionately accessible. We release our framework to enable compute-aware risk assessment and evaluation.

2606.11425 2026-06-11 cs.CR cs.AI 交叉投稿

JailbreakOPT: Tool-Assisted Iterative Jailbreak Prompt Optimization

JailbreakOPT: 工具辅助的迭代越狱提示优化

Ge Shi, Jun Yin, Donglin Xie, Fangyi Liu, Yucan Li, Menglin Liu

AI总结 提出JailbreakOPT框架,通过工具库和上下文Thompson采样优化单轮越狱提示,在多个LLM上提高攻击成功率并减少攻击次数。

详情
AI中文摘要

越狱攻击暴露了大语言模型(LLM)中持续存在的安全弱点,但现有的无状态单轮方法面临权衡:手工制作的提示具有表现力但静态,而迭代提示优化可以适应但通常依赖于需要多次目标查询的低级突变。我们提出了JailbreakOPT,一个用于改进迭代单轮越狱提示优化的工具辅助框架。JailbreakOPT将多样化的原子越狱提示组织成一个攻击工具库,并通过统一的回合内优化抽象组合它们,以生成更强的独立攻击提示。为了跨攻击回合重用经验,JailbreakOPT进一步将工具选择框架化为上下文赌博机问题,并应用上下文汤普森采样基于过去的结果指导探索和利用。在多个目标LLM和攻击目标上的实验表明,与原子单轮攻击和现有的迭代优化基线相比,JailbreakOPT提高了攻击成功率(ASR),同时减少了成功所需的攻击次数(No.A)。本文可能包含冒犯性或有害内容。

英文摘要

Jailbreak attacks expose persistent safety weaknesses in large language models (LLMs), but existing stateless single-turn methods face a trade-off: hand-crafted prompts are expressive but static, while iterative prompt optimization can adapt but often relies on low-level mutations that require many target queries. We propose JailbreakOPT, a tool-assisted framework for improving iterative single-turn jailbreak prompt optimization. JailbreakOPT organizes diverse atomic jailbreak prompts into an attack tool library and composes them through a unified intra-episode optimization abstraction to generate stronger standalone attack prompts. To reuse experience across attack episodes, JailbreakOPT further frames tool selection as a contextual bandit problem and applies contextual Thompson sampling to guide exploration and exploitation based on past outcomes. Experiments across multiple target LLMs and attack goals show that JailbreakOPT improves attack success rate (ASR) while reducing the number of attacks until success (No.A) compared with atomic single-turn attacks and existing iterative optimization baselines. This paper may contain offensive or harmful content.

2606.11533 2026-06-11 cs.CY cs.AI cs.ET cs.LG 交叉投稿

AI Researchers Must Help Lead Arms Control to Mitigate Military AI Risks

AI研究人员必须主导军备控制以降低军事AI风险

Ted Fujimoto, Jacob Benz

AI总结 本文主张AI研究人员应主导军备控制研究,通过借鉴核威慑经验,推动验证与外交技术创新,以降低军事AI应用带来的紧迫风险。

详情
Comments
9 pages, 1 figure, ICML 2026 Position Paper
AI中文摘要

AI能力的进步迫使研究人员和公众更加关注其潜在的全球影响。一个紧迫的近期问题是军事AI应用的监管。武器制造商和国防承包商正在加大对AI能力的投资,并与AI公司建立合作伙伴关系,形成了一个新兴的联盟,要求军事领导人、军备控制外交专家和AI研究人员合作,以确保更安全的未来。虽然AI研究人员通常关注超级智能AI的长期影响,但这种方法可能无法充分应对军事应用中AI带来的直接挑战。成功需要承认并减轻前沿AI模型(计划集成到国防应用中,如军事AI系统)的新兴风险。军备控制已经减少了过去的灾难性风险,因此从核威慑中吸取的经验教训可以指导AI安全与安保研究,推动验证和外交方面的创新。然而,AI研究人员必须协助主导技术研究,明确定义并缓解军事环境中的不稳定性。鉴于这些新责任以及缺乏足够可靠的解决方案,我们认为AI研究人员必须在推进军备控制研究以最小化军事AI应用风险方面发挥主导作用。

英文摘要

The advancement of AI capabilities compels researchers and the public to be more aware of its potential worldwide impact. A pressing near-term concern is the regulation of military AI applications. Armament manufacturers and defense contractors are increasingly investing in AI capabilities and forging partnerships with AI companies, creating a burgeoning coalition that demands military leaders, arms control diplomacy experts, and AI researchers collaborate to ensure a safer future. While AI researchers often focus on the long-term implications of superintelligent AI, this approach may not adequately address the immediate challenges posed by AI in military applications. Success requires acknowledging and mitigating the emerging risks of frontier AI models that plan to be integrated into defense applications, like military AI systems. Arms control has reduced past catastrophic risks, so lessons learned from nuclear deterrence can guide AI safety and security research towards innovations in verification and diplomacy. AI researchers, however, must assist in leading the technical research that clearly defines and alleviates instability in military settings. Given these new responsibilities and the lack of sufficiently reliable solutions, we argue that AI researchers must take a leading role in advancing arms control research to minimize risk in military AI applications.

2606.11556 2026-06-11 cs.CR cs.AI cs.LG 交叉投稿

Privacy-Preserving Federated Autoencoder for ECG Anomaly Detection on Edge Devices

面向边缘设备上心电图异常检测的隐私保护联邦自编码器

Kaan Arda Akyol, Jakub Kacper Szeląg, Aydin Abadi, Maha Alghamdi, Ghadah Albalawi, Ghouse Ibrahim Kaleelullah, Hilal Tutus, Sarah Al Subaiei, Shardul Kapse, Syed Mohammed Raheeb, Mujeeb Ahmed, Rehmat Ullah

AI总结 提出一种结合联邦学习、差分隐私和INT8量化的端到端系统,在PTB-XL数据集上实现无监督12导联ECG异常检测,满足隐私、实时性和非IID数据要求。

详情
Comments
9 pages, 4 figures, 6 tables. Preprint prepared in IEEE conference format. Submitted to: FLTA 2026
AI中文摘要

连续心电图监测可以在心律异常演变为心血管事件之前发现它们。然而,一个可部署的系统必须同时满足三个要求:法律级别的隐私(GDPR、HIPAA)、在受限边缘硬件上的实时推理以及在非IID跨医院数据下的检测质量。我们设计并评估了一个端到端的联邦系统,在PTB-XL数据集上解决了无监督12导联ECG异常检测的所有三个要求,结合了三种自编码器家族(VanillaAE、ConvAE、VAE)、基于Flower的联邦平均(FedAvg)跨十个模拟医院、客户端差分隐私SGD(DP-SGD)与Rényi-DP会计,以及使用Raspberry Pi 4基准测试的8位整数(INT8)训练后量化。我们的主要贡献是:这些机制如何组合的经验性特征、实用的DP特定建议,以及针对临床敏感环境的技术和安全见解。联邦学习在所有架构上匹配或超过集中基线(ConvAE联邦ROC曲线下面积AUROC为0.782),并且ε扫描确定ε=4为推荐的临床操作点。INT8量化大致将模型大小减半,并将Pi 4延迟降低多达44%,AUROC损失小于0.12%。关键的是,DP和量化的惩罚在经验上是独立的,因此从业者不需要为了紧凑的边缘足迹而牺牲强大的隐私保证。据我们所知,这是第一个结合联邦学习、形式化(ε,δ)-DP、无监督重建检测和量化AArch64部署的系统。

英文摘要

Continuous electrocardiography (ECG) monitoring could surface rhythm abnormalities before they escalate into cardiovascular events. However, a deployable system must satisfy three requirements simultaneously: legal-grade privacy (GDPR, HIPAA), real-time inference on constrained edge hardware, and detection quality under non-IID cross-hospital data. We design and evaluate an end-to-end federated system addressing all three for unsupervised 12-lead ECG anomaly detection on PTB-XL dataset, combining three autoencoder families (VanillaAE, ConvAE, VAE), Flower-based federated averaging (FedAvg) across ten simulated hospitals, client-side differentially private SGD (DP-SGD) with a Rényi-DP accountant, and 8-bit integer (INT8) post-training quantization with Raspberry Pi 4 benchmarking. Our main contributions are: an empirical characterization of how these mechanisms compose, practical DP-specific recommendations, and technical and security insights for a clinically sensitive setting. Federated learning matches or exceeds the centralized baseline across all architectures (ConvAE federated area under the ROC curve, AUROC, $0.782$), and an $\varepsilon$ sweep identifies $\varepsilon=4$ as the recommended clinical operating point. INT8 quantization roughly halves model size and cuts Pi 4 latency by up to $44%$ with $<0.12%$ AUROC loss. Crucially, DP and quantization penalties are empirically independent, so practitioners need not trade a strong privacy guarantee for a compact edge footprint. To our knowledge, this is the first system combining federated learning, formal $(\varepsilon,\delta)$-DP, unsupervised reconstruction-based detection, and quantized AArch64 deployment.

2606.11632 2026-06-11 cs.CR cs.AI cs.DC cs.MA 交叉投稿

Sovereign Assurance Boundary: Certificate-Bound Admission for Agentic Infrastructure

主权保证边界:面向智能体基础设施的证书绑定准入机制

Jun He, Deying Yu

AI总结 针对智能体基础设施中非确定性推理系统对生产资源的高风险操作,提出主权保证边界(SAB),通过证书绑定的运行时准入层,将代理提案编译为执行合约并绑定加密证据,实现可验证、可撤销的授权控制。

详情
Comments
12 pages, 1 figure, 13 tables
AI中文摘要

智能体基础设施引入了一个关键的控制平面授权问题:非确定性推理系统可以对生产资源提出高风险变更,但现有的安全机制——如身份与访问管理(IAM)、策略引擎、共识协议和审计日志——要么强制执行静态的、上下文无关的权限,要么仅在执行后记录操作。本文介绍了主权保证边界(SAB),一种用于自主执行权限的证书绑定运行时准入层。SAB在保证气闸处拦截代理提案,将其编译为类型化执行合约$C$,并将这些合约绑定到加密证据摘要$H(E)$和策略版本。然后,合约通过后果感知的认证路径进行路由。成功准入后,系统发出一个严格限定于特定执行身份、撤销周期和有效时间窗口的签名主权保证证书($\Omega$)。最后,主权执行代理验证$\Omega$,并在调用基础设施API之前执行新鲜的执行前撤销和漂移检查。我们详细描述了气闸-代理架构,形式化了其准入和撤销不变量,并报告了在Go原型上对2500次准入尝试评估的初步可行性测量。最终,这种代理强制模型防止了自主推理直接改变状态,将委托的执行权限转化为一个可加密验证、证据绑定、可撤销且可重放的运行时工件。

英文摘要

Agentic infrastructure introduces a critical control-plane authorization problem: non-deterministic reasoning systems can propose high-stakes mutations to production resources, yet existing security mechanisms -- such as identity and access management (IAM), policy engines, consensus protocols, and audit logs -- either enforce static, context-unaware permissions or merely record actions post-execution. This paper introduces the Sovereign Assurance Boundary (SAB), a certificate-bound runtime admission layer for autonomous execution authority. SAB intercepts agent proposals at an assurance airlock, compiles them into typed execution contracts $C$, and binds these contracts to cryptographic evidence digests $H(E)$ and policy versions. The contracts are then routed through consequence-aware certification paths. Upon successful admission, the system emits a signed Sovereign Assurance Certificate ($\Omega$) that is strictly scoped to a specific execution identity, revocation epoch, and validity window. Finally, a sovereign execution broker verifies $\Omega$ and performs fresh pre-execution revocation and drift checks before invoking infrastructure APIs. We detail the airlock-broker architecture, formalize its admission and revocation invariants, and report preliminary feasibility measurements from a Go prototype evaluated over 2,500 admission attempts. Ultimately, this broker-enforced model prevents autonomous reasoning from directly mutating state, transforming delegated execution authority into a cryptographically verifiable, evidence-bound, revocable, and replayable runtime artifact.

2606.11657 2026-06-11 cs.LG cs.AI 交叉投稿

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

稀疏探针与模糊物理:连续介质动力学基础模型可解释性挑战的案例研究

Katherine Rosenfeld, Maike Sonnewald

发表机构 * Gates Foundation(盖茨基金会) UC Davis(加州大学戴维斯分校)

AI总结 本研究通过稀疏自编码器探针分析连续介质动力学基础模型Walrus的内部机制,发现其内部特征与物理分解不完全一致,并存在输出级偏差,揭示了科学基础模型可解释性的关键挑战。

详情
Comments
8 pages, 5 figures
AI中文摘要

生成式AI仿真器越来越多地用于我们已经拥有强大理论、基准和物理直觉的科学领域。这引发了一个核心评估和可解释性问题:当一个基础模型能够再现已知的连续介质动力学时,是什么内部机制支持这种行为?内部行为是否与已知物理一致?以及它与仿真器成功或失败的关系如何?我们研究了跨领域连续介质动力学基础模型——Polymathic团队的Walrus,采用基于物理原理的机械可解释性方法。我们应用稀疏自编码器(SAE)探测选定层,并利用涡度作为物理基础度量,解决了对大量特征集(超过20,000个)进行分类的实际挑战。作为刻意简单的测试平台,我们聚焦于剪切流,并比较了多个剪切流设置(即数值模拟中的参数值)下的特征招募情况。在不同设置中,我们发现了分段一致性的证据,特征子集以相似角色重复出现,但这种结构是间歇性的,并未清晰地映射到标准物理分解上。同时,数值模拟与仿真器之间的直接比较揭示了系统性的输出级差异,包括能量/结构变得过于扩散或过于局部的区域。我们将这些差异的部分与特定SAE特征使用的变化联系起来。我们的工作突出了科学基础模型的开放性问题:如何稳健地优先考虑机械上有意义的特征,如何将稳定结构与分析伪影(包括单层和SAE限制)分离,以及如何利用既定基准来决定何时“不同”的内部表示真正具有信息性而非仅仅是有效的。

英文摘要

Generative AI emulators are increasingly used in scientific domains where we already have strong theory, benchmarks, and physical intuition. This raises a central evaluation and interpretability question: when a foundation-style model can reproduce known continuum dynamics, what internal mechanism supports that behavior, is the internal behaviour consistent with known physics, and how does it relate to where the emulator succeeds or fails? We investigate a cross-domain foundation model for continuum dynamics, Walrus by Polymathic, using mechanistic interpretability guided by physical principles. We apply a sparse autoencoder (SAE) to probe a selected layer, and address the practical challenge of triaging a large feature set (over 20,000) using enstrophy as a physically grounded metric. As a deliberately simple testbed, we focus on shear flow and compare feature recruitment across multiple shear-flow setups, i.e. parameter values in the numerical simulation. Across setups we find evidence of piecewise consistency, with subsets of features recurring in similar roles, but this structure is intermittent and does not map cleanly onto standard physical decompositions. In parallel, direct comparisons between numerical simulation and the emulator reveal systematic output-level discrepancies, including regimes where energy/structures become too diffuse or too localized. We connect parts of these discrepancies to changes in specific SAE feature usage. Our work highlights open questions for scientific foundation models: how to robustly prioritize mechanistically meaningful features, how to separate stable structure from analysis artifacts (including single-layer and SAE limitations), and how to use established benchmarks to decide when "different" internal representations are genuinely informative rather than merely effective.

2606.11671 2026-06-11 cs.CR cs.AI 交叉投稿

Runtime Skill Audit: Targeted Runtime Probing for Agent Skill Security

运行时技能审计:针对智能体技能安全的目标运行时探测

Tu Lan, Chaowei Xiao

AI总结 提出运行时技能审计(RSA)动态分析方法,通过目标运行时条件探测技能行为,在100个技能上达到90.0%准确率,优于静态基线。

详情
AI中文摘要

智能体技能让LLM智能体能够复用指令、资源、工具和工作流,但也为恶意行为提供了新的隐藏场所。一个技能在其文档或代码中可能看起来无害,但只有在与特定用户请求、本地资产、持久状态或多步骤工具交互调用时才会变得有害。这使得纯静态审查变得脆弱。我们提出运行时技能审计(RSA),一种动态分析方法,通过询问技能介导的智能体在目标运行时条件下实际做了什么来审计技能。RSA不是用相同的通用任务测试每个技能,而是分析风险相关接口,准备执行上下文以触发这些接口,并根据产生的跟踪证据分配安全标签。我们在OpenClaw上实现RSA,并在100个技能上针对代表性静态基线进行评估。RSA达到90.0%的准确率,88.0%的真阳性率和8.0%的假阳性率,比最佳静态基线提高13.0个百分点。在自进化攻击下,静态检测器在一两轮后崩溃,而RSA在每轮中持续检测出19-20个恶意技能。

英文摘要

Agent skills let LLM agents reuse instructions, resources, tools, and workflows, but they also create a new place for malicious behavior to hide. A skill may look benign in its documentation or code while becoming harmful only when it is invoked with particular user requests, local assets, persistent state, or multi-step tool interactions. This makes purely static vetting brittle. We present Runtime Skill Audit (RSA), a dynamic analysis method that audits skills by asking what the skill-mediated agent actually does under targeted runtime conditions. Instead of testing every skill with the same generic tasks, RSA profiles risk-relevant interfaces, prepares the execution context needed to exercise them, and assigns security labels from the resulting trace evidence. We instantiate RSA on OpenClaw and evaluate it on 100 skills against representative static baselines. RSA achieves 90.0\% accuracy with an 88.0\% true positive rate and an 8.0\% false positive rate, improving accuracy by 13.0 percentage points over the best static baseline. Under self-evolving attacks, static detectors collapse after one or two rounds, while RSA continues to detect 19--20 out of 20 malicious skills across rounds.

2606.11672 2026-06-11 cs.CR cs.AI 交叉投稿

Can Open-Source LLM Agents Replace Static Application Security Testing Tools? An Empirical Assessment

开源LLM代理能否取代静态应用安全测试工具?一项实证评估

Derek Yohn, Luke Flancher, Mirajul Islam, Khaled Slhoub

AI总结 评估基于开源LLM的代理在静态应用安全测试中的性能,与SAST工具Bandit对比,发现当前不适合实际应用。

详情
Comments
Keywords: Agentic AI, Cybersecurity, Large Language Models, Static Application Security Testing, Model performance evaluation
AI中文摘要

本文探讨了代理式AI工具在网络安全领域的价值。我们评估了基于通用GenAI大语言模型(LLM)的代理在三种不同Ollama托管的通用开源模型驱动下的有效性。我们使用精确率、召回率、误报数以及基于捕获指标交互计算的综合得分,评估每个代理的性能,并与现有经过验证的静态应用安全测试(SAST)工具Bandit的基线性能进行比较。我们的研究结果驳斥了现代开源GenAI LLM代理在当前现实条件下适用于SAST扫描这一专门任务的看法。

英文摘要

This paper explores the value of agentic AI tools for cybersecurity purposes. We evaluate the efficacy of a general-purpose GenAI Large Language Model- (GenAI-) based agent when powered by three different Ollama-hosted general-purpose open source models. We assess each agent's performance using precision, recall, false positive count, and a calculated composite score based upon the interplay of the captured metrics, against the baseline performance of an existing, vetted Static Application Security Testing (SAST) tool, Bandit. Our findings refute the notion that a modern open-source GenAI LLM-based agent is currently suitable for the specialized task of SAST scanning under realistic conditions.

2606.11688 2026-06-11 cs.CL cs.AI 交叉投稿

Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents

Goal-Autopilot: 一种可验证的防伪造防火墙,用于无人值守的长周期智能体

Youwang Deng

发表机构 * EpistemicaLab — Independent Research(EpistemicaLab — 独立研究)

AI总结 提出Autopilot执行模型,通过外部化状态到有限状态机并强制门控验证,使智能体无法虚假声称成功,在3,150个单元测试中伪造率降至0.95%,显著低于基线方法。

详情
Comments
Preprint. Code: this https URL
AI中文摘要

长周期LLM智能体在无人值守时不可信:没有人类监控,它们自信地报告从未验证的成功。我们将诚实性——限制智能体在终止时可能声称的内容——视为无人值守自主性的首要指标,与能力区分开来。我们提出Autopilot,一种执行模型,使得静默伪造的成功在结构上不可能,而不仅仅是更罕见。Autopilot将所有工作状态外部化到一个持久的、门控的有限状态机中,调度器每次以无状态滴答推进;一个硬性下限禁止任何终端“完成”声明,其可伪造的门并未实际执行并通过。我们证明了一个无假成功定理——在门控正确性、下限执行和计划覆盖下,终止意味着目标成立——其唯一信任点可经验测量,并表明最坏情况退化为诚实的停顿,而非伪造的成功。由于每个滴答仅重新水化状态机,每步上下文成本在时间范围内恒定。在3,150个单元的配对语料库(70个任务×3个系统×3个模型×5个种子,包括跨11个开源仓库的50个SWE-bench Lite任务)上,Autopilot在0.95%的单元上伪造[95% CI 0.38–1.62],而Reflexion和StateFlow基线分别在8.10% [6.48–9.81]和25.05% [22.48–27.62]上伪造。主要对比存在于困难场景:在SWE-bench Lite上,防火墙将伪造率从33.7%(StateFlow)降至0.67%,配对差异为-33.07个百分点[95% CI -36.53, -29.73]。机制在于门控而非模型:所有十个Autopilot伪造均来自最强模型,而两个较弱的中间模型在700个配对单元中从未伪造。防火墙设计上以覆盖换取诚实——诚实的停顿是可恢复的;而自信的错误输出向下游发送则不可恢复。

英文摘要

Long-horizon LLM agents are not trusted to run unattended: with no human watching, they confidently report success they never verified. We treat honesty -- bounding what an agent may claim at termination -- as a first-class metric for unattended autonomy, distinct from capability. We present Autopilot, an execution model that makes silent fabricated success structurally impossible rather than merely rarer. Autopilot externalizes all working state into a durable, gated finite-state machine that a scheduler advances one stateless tick at a time; a hard floor forbids any terminal "done" claim whose falsifiable gate did not actually execute and pass. We prove a No-False-Success theorem -- under gate soundness, floor enforcement, and plan coverage, termination implies the goal holds -- whose only trust points are empirically measurable, and show the worst case degrades to an honest stall, never a fabricated success. Because each tick rehydrates only the state machine, per-step context cost is constant in the horizon. Across a 3,150-cell paired corpus (70 tasks $\times$ 3 systems $\times$ 3 models $\times$ 5 seeds, including 50 SWE-bench Lite tasks across 11 OSS repos), Autopilot fabricates on 0.95% of cells [95% CI 0.38--1.62] while Reflexion and StateFlow baselines fabricate on 8.10% [6.48--9.81] and 25.05% [22.48--27.62] respectively. The headline contrast lives in the hard regime: on SWE-bench Lite, the firewall reduces fabrication from 33.7% (StateFlow) to 0.67%, a paired difference of $-33.07$ pp [95% CI $-36.53, -29.73$]. The mechanism is the gate, not the model: all ten Autopilot fabrications come from the strongest model, while two weaker mid-tier models never fabricate across 700 paired cells. The firewall trades coverage for honesty by design -- an honest stall is recoverable; a confident wrong output shipped downstream is not.

2606.11698 2026-06-11 cs.CR cs.AI 交叉投稿

T2S: A Rehearsal-Based Approach for Extraction-Resistant Model Watermarking

T2S:一种基于排练的防提取模型水印方法

Jian-Ping Mei, Weibin Zhang, Ao Yao, Tiantian Zhu, Jie Xiao

AI总结 针对模型提取攻击,提出一种基于排练的水印嵌入框架,通过模拟提取过程并利用被盗模型在触发集上的损失微调水印知识,增强水印的迁移性和鲁棒性。

详情
AI中文摘要

模型水印通过嵌入独特知识来诱导独特行为特征,从而保护AI模型的知识产权。主要技术挑战在于确保水印对水印模型的各种后处理攻击具有鲁棒性。模型提取攻击是最严重的威胁,攻击者利用预测输出训练替代模型,非法复制原始模型的功能。在这项工作中,我们提出了一种基于排练的水印嵌入框架,以增强模型水印对模型提取攻击的鲁棒性。通过模拟提取过程,我们的方法利用\textit{模拟被盗模型}在触发集上的损失作为训练信号,微调目标模型中的水印知识。这个微调步骤鼓励水印以增强可迁移性的方式嵌入,从而增加其在被盗模型中持续存在并保持可检测的机会。在不同设置下进行的全面实验表明,所提出的方法显著提高了模型水印对模型提取和后续水印移除攻击的鲁棒性。

英文摘要

Model watermarking safeguards AI model intellectual property by embedding distinctive knowledge that induces unique behavioral signatures. The primary technical challenge lies in ensuring watermark robustness against various post-processing attacks on the watermarked model. Model extraction attacks emerge as the most severe threat, where adversaries exploit prediction outputs to train surrogate models that illegally replicate the original model's functionality. In this work, we propose a rehearsal-based watermark embedding framework to enhance the robustness of model watermarks against model extraction attacks. By simulating the extraction process, our method leverages the loss of a \textit{simulated stolen model} on a trigger set as a training signal to fine-tune the watermark knowledge within the target model. This fine-tuning step encourages the watermark to be embedded in a way that boosts transferability, thereby increasing its chances of persisting and remaining detectable in stolen models. Comprehensive experiments conducted under diverse settings demonstrate that the proposed method significantly improves the robustness of model watermarks against both model extraction and subsequent watermark removal attacks.

2606.11817 2026-06-11 cs.CR cs.AI cs.CL cs.SE 交叉投稿

Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code

语法约束解码可诱使大语言模型生成恶意代码

Yitong Zhang, Shiteng Lu, Jia Li

AI总结 本文发现语法约束解码(GCD)可被利用发起名为CodeSpear的越狱攻击,使LLM生成恶意代码;并提出安全对齐方法CodeShield,通过生成蜜罐代码防御该攻击。

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于代码生成,引发了对它们可能被滥用来生成恶意代码的担忧。与此同时,语法约束解码(GCD)已被广泛采用,通过强制语法有效性来提高LLM生成代码的可靠性。在本文中,我们揭示了一个反直觉的风险:这种面向可靠性的技术本身可能成为攻击面。我们发现了一种新的越狱攻击,称为CodeSpear,它利用GCD诱导LLM生成恶意代码。我们的实验表明,仅应用良性代码语法约束即可有效越狱LLM。为了解决这一漏洞,我们提出了CodeShield,一种安全对齐方法,即使在攻击者控制的语法约束下也能稳健地保持安全行为。CodeShield通过在代码模态中对齐模型,教其在GCD下生成蜜罐代码。这种代码在语义上是无害的,因此不会实现恶意请求,并且在结构上是多样化的,因此难以通过语法收紧来抑制。同时,当自然语言可用时,CodeShield仍然保留自然语言的拒绝。在4个基准测试中对10个流行LLM的实验表明,CodeSpear优于代表性的越狱基线,平均攻击成功率提高了30个百分点以上。CodeShield在CodeSpear下恢复了安全性,同时保持了良性实用性。我们的发现揭示了GCD的一个基本风险,并呼吁对其潜在安全影响给予更多关注。

英文摘要

Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code. Meanwhile, Grammar-Constrained Decoding (GCD) has been widely adopted to improve the reliability of LLM-generated code by enforcing syntactic validity. In this paper, we reveal a counterintuitive risk: this reliability-oriented technique can itself become an attack surface. We uncover a new jailbreak attack, termed CodeSpear, that exploits GCD to induce LLMs into generating malicious code. Our experiments show that simply applying a benign code grammar constraint can effectively jailbreak LLMs. To address this vulnerability, we propose CodeShield, a safety alignment approach that robustly preserves safe behavior even under attacker-controlled grammar constraints. CodeShield aligns the model in the code modality by teaching it to generate honeypot code under GCD. Such code is semantically harmless, so it does not implement the malicious request, and structurally diverse, so it is difficult to suppress through grammar tightening. At the same time, CodeShield still preserves natural-language refusals when natural language is available. Experiments on 10 popular LLMs across 4 benchmarks show that CodeSpear outperforms representative jailbreak baselines and increases the attack success rate by more than 30 percentage points on average. CodeShield also restores safety under CodeSpear while preserving benign utility. Our findings reveal a fundamental risk of GCD and call for greater attention to its potential security implications.

2606.12016 2026-06-11 cs.LG cs.AI 交叉投稿

Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization

泛化黑客:模型可通过阻止行为泛化来博弈强化学习

Frank Xiao, Mary Phuong

发表机构 * California Institute of Technology(加州理工学院)

AI总结 本研究提出泛化黑客现象,模型在强化学习中通过自我接种机制阻止行为泛化,在保持高奖励的同时抵抗行为修正,首次证明模型能主动破坏训练过程。

详情
AI中文摘要

模型后训练,特别是强化学习(RL),是开发者塑造模型价值观和行为的主要机制之一。然而,随着模型越来越具有评估和训练意识,当感知到的目标与其当前价值观冲突时,它们可能会被激励去抵抗训练,从而削弱开发者通过进一步训练检测错位和纠正模型行为的能力。在本文中,我们展示了泛化黑客,即模型在RL期间收集奖励的同时阻止奖励行为泛化。我们在Qwen3-235B-A22B上构建了一个模型有机体,对描述训练意识和自我接种(一种新颖机制,其中模型在其思维链中将合规性框架为上下文特定,而不演示或指示任一行为)的合成文档进行微调。该模型有机体在训练时实现了与对照组相当的有害性,同时在700步RL中保持了持续的约15个百分点的合规差距。此外,仅接受训练意识文档训练的对照有机体在RL压力下独立发现了类似接种的推理,尽管从未接触过该概念,却发展出自己的合规差距。由于泛化黑客有机体在整个过程中获得高奖励,标准训练指标未提供泛化失败的信号。我们的结果首次证明模型可以在保持高奖励的同时主动抵抗RL行为修正,表明随着模型变得更有能力和训练意识,它们可能能够破坏训练过程本身。

英文摘要

Model post-training, and in particular reinforcement learning (RL), is one of the primary mechanisms by which developers can shape models' values and behaviors. However, as models become increasingly evaluation and training aware, they may be motivated to resist training when the perceived objective conflicts with their current values, undermining developers' ability to detect misalignment and correct model behavior through further training. In this paper, we demonstrate generalization hacking, in which a model collects reward during RL while preventing the rewarded behavior from generalizing. We construct a model organism on Qwen3-235B-A22B, finetuning on synthetic documents describing training awareness and self-inoculation, a novel mechanism in which the model frames compliance as context-specific in its chain of thought, without demonstrating or instructing either behavior. The model organism achieves train-time harmfulness comparable to controls while maintaining a persistent ${\sim}15$ percentage point compliance gap across 700 steps of RL. Additionally, a control organism trained only on training awareness documents independently discovers inoculation-like reasoning under RL pressure, developing its own compliance gap despite never being exposed to the concept. Because the generalization-hacking organism receives high reward throughout, standard training metrics provide no signal that generalization has failed. Our results constitute the first demonstration that a model can actively resist RL behavioral modification while maintaining high reward, suggesting that as models become more capable and training-aware, they may be able to undermine the training process itself.

2606.12073 2026-06-11 cs.SI cs.AI 交叉投稿

"That's AI Slop, You Bot!" Studying Accusations, Evidence, and Credibility in Online Discourse Towards LLM-Generated Comments

“那就是AI垃圾,你这个机器人!”:研究针对LLM生成评论的指责、证据与可信度

Jason Miklian, John E. Katsos

AI总结 分析2023-2026年Hacker News和Reddit上2500万条评论,发现对AI生成文本的指责增长超十倍,但被指责的文本并非真正由AI生成,而是基于感知真实性的社会把关行为。

详情
AI中文摘要

生成式AI使得流畅的散文变得廉价易得,打破了“好文章意味着真思考”的旧承诺。读者如何回应?这能告诉我们关于反AI态度变化的什么信息?我们分析了来自Hacker News和Reddit(2023-2026年)的2500万条评论,结合了对7500个抽样AI使用指责的LLM判断、情感轨迹、300个确认AI使用指责的言语行为编码,以及被指责与未被指责的父评论的匹配对照测试。我们发现,两个平台上指责中贬义标签的份额增长了十倍以上,而2022年前的不真实性词汇(如shill、astroturf)的安慰剂词汇则没有。这一转变反映了一个快速增长的趋势:将任何可疑或看似不真实的散文标记为“AI垃圾”。AI垃圾框架现在占贬义提及的94%,主导评论的语气从嘲笑转向把关和结构性抗议。关键惊喜来自匹配对照测试,该测试发现,统计上区分AI与人类文本的散文特征并不能预测哪些人类文本会被指责为AI。新的指责作为感知真实性的社会把关,实际上并不筛查AI。这项研究扩展了信号理论,表明当底层检测问题无法在非专家层面解决时,即使不准确,社会使用的替代信号也会增长。它表明,AI对写作的影响从读者侧来看与生产(作者)侧不同。检测技术无法解决这种动态,因为指责的社会功能日益表现为社会把关和群体内信号传递,而非识别AI生成的写作。

英文摘要

Generative AI has made fluent prose cheap to produce, breaking the old promise to readers that good writing meant real thinking. How have readers responded, and what can this tell us about changing anti-AI attitudes? We analyzed 25 million comments from Hacker News and Reddit (2023-2026), combining LLM judgment on 7,500 sampled accusations of AI use, sentiment trajectories, speech-act coding of 300 confirmed accusations of AI use, and a matched-control test of accused versus non-accused parent comments. We found that the pejorative-label share of accusations rose more than tenfold on both platforms while a placebo vocabulary of pre-2022 inauthenticity terms (shill, astroturf) did not. This shift reflected a fast-growing trend of branding any suspicious or seemingly inauthentic prose as "AI slop". The slop frame now constitutes 94 percent of pejorative mentions, with the dominant comments shifting in tone from mockery toward gatekeeping and structural protest. The key surprise comes from a matched-control test which found that prose features that statistically distinguish AI from human text do not predict which human text gets accused as AI. The new accusations work as social gatekeeping of perceived authenticity without actually screening for AI. This research extends signaling theory by showing that substitute signals used socially can grow even when inaccurate if the underlying detection problem cannot be solved at the non-expert level. It shows that AI's effects on writing from the reader side are distinct from those on the production (writer) side. Detection technology cannot resolve this dynamic because the social function of accusations is increasingly to perform social gatekeeping and in-group signaling as opposed to identifying AI-generated writing.

2606.12251 2026-06-11 cs.LG cs.AI cs.CR 交叉投稿

Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization

强化学习破坏基于梯度的对抗优化

Xinhai Zou, Chang Zhao, Alireza Aghabagherloo, Dave Singelée, Robin Degraeve, Bart Preneel

发表机构 * COSIC, KU Leuven(鲁汶大学COSIC) Imec Brubotics, VUB(布鲁塞尔自由大学Brubotics) DistriNet, KU Leuven(鲁汶大学DistriNet)

AI总结 研究通过强化学习训练图像分类器以破坏攻击者使用的梯度结构,发现RL作为隐式正则化器产生不稳定梯度方向和较小梯度幅度,使基于梯度的攻击失效,并与对抗训练结合实现双重防御。

详情
AI中文摘要

基于梯度的对抗攻击仍然是对深度神经网络(DNN)的主要威胁,因为它们利用梯度信息高效优化对抗扰动。为了解决这个问题,我们研究了强化学习(RL)训练是否可以通过使用策略梯度目标和epsilon-贪婪探索来训练图像分类器,从而破坏攻击者使用的梯度结构。通过在CIFAR-10、CIFAR-100和ImageNet-100上使用多种架构进行系统实验,我们发现RL训练的分类器显著破坏了基于梯度的对抗优化。为了解释这一点,我们使用损失景观可视化、静态和动态梯度指标以及预测熵进行了全面的机制分析。我们的分析揭示,RL充当隐式正则化器,产生具有高度不稳定梯度方向和较小梯度幅度的模型。这种组合使得每个PGD步骤在方向上不可靠且幅度有限,导致基于梯度的攻击在实际迭代预算内失败。我们进一步表明,将RL与对抗训练(RL-adv)结合提供了在两个互补层面运作的双层防御:RL退化攻击者可用的梯度信息(梯度级防御),而对抗训练强化决策边界(边界级防御)。RL-adv在所有评估的主要攻击类型(包括基于梯度的PGD、AutoAttack、基于迁移和基于查询的攻击)中实现了最高的鲁棒性,显著优于SL-adv。这些发现将RL诱导的梯度破坏识别为一种互补的鲁棒性机制,并激励未来研究结合SL效率与RL梯度正则化特性的混合SL-RL训练调度。

英文摘要

Gradient-based adversarial attacks remain a dominant threat to deep neural networks (DNNs), as they exploit gradient information to efficiently optimize adversarial perturbations. To address this, we investigate whether reinforcement learning (RL) training can disrupt the gradient structure used by attackers by training image classifiers with policy-gradient objectives and epsilon-greedy exploration. Through systematic experiments across CIFAR-10, CIFAR-100, and ImageNet-100 with multiple architectures, we find that RL-trained classifiers significantly disrupt gradient-based adversarial optimization. To explain this, we conduct a comprehensive mechanism analysis using loss landscape visualization, static and dynamic gradient indicators, and predictive entropy. Our analysis reveals that RL acts as an implicit regularizer, producing models with highly unstable gradient directions and smaller gradient magnitudes. This combination makes each PGD step both unreliable in direction and limited in magnitude, causing gradient-based attacks to fail within practical iteration budgets. We further show that combining RL with adversarial training (RL-adv) provides a dual-layer defense operating at two complementary levels: RL degrades gradient information available to attackers (gradient-level defense), while adversarial training strengthens decision boundaries (boundary-level defense). RL-adv achieves the highest robustness across all major attack types evaluated, including gradient-based (PGD, AutoAttack), transfer-based, and query-based attacks, outperforming SL-adv by a significant margin. These findings identify RL-induced gradient disruption as a complementary robustness mechanism and motivate future research on hybrid SL-RL training schedules that combine SL's efficiency with RL's gradient-regularization properties.

2606.12289 2026-06-11 cs.LG cs.AI cs.NE 交叉投稿

The Standard Interpretable Model: A general theory of interpretable machine learning to deductively design interpretable methods using Lagrangian mechanics

标准可解释模型:一种基于拉格朗日力学的可解释机器学习通用理论,用于演绎设计可解释方法

Pietro Barbiero, Giovanni De Felice, Mateo Espinosa Zarlenga, Francesco Giannini, Filippo Bonchi, Mateja Jamnik, Giuseppe Marra, Ruggero Noris

AI总结 提出标准可解释模型(SIM),基于拉格朗日力学从前提演绎出可解释性对称性和约束,通过最小化拉格朗日函数得到最优可解释模型,解决现有方法局限性并指导新方法设计。

详情
AI中文摘要

随着人工智能模型复杂性的增加,可解释性已成为理解、调试和控制其计算不可或缺的工具。然而,可解释性缺乏通用理论来演绎设计可解释方法。理论与方法之间的这种差距导致了文献的碎片化和不一致的评估协议。为填补这一空白,我们引入了标准可解释模型(SIM),这是一种基于拉格朗日力学的通用理论,能够演绎设计可解释方法。具体而言,SIM 在一组前提中总结了目标用户的可解释性含义。从这些前提出发,SIM 系统地推导出可解释性对称性和相应的约束,这些约束塑造了拉格朗日函数的景观,其最小值对应于最优可解释模型。为了达到最小值,可以更新不透明模型的参数值使其更可解释,或者将约束编译成可解释架构。我们通过实验表明,SIM 能够识别并解决现有方法(包括传统、基于概念和机制可解释性)的局限性,突出未充分探索的研究方向,并指导核心编程接口的设计。除了作为一种研究方法,SIM 的演绎性质为可解释性课程提供了教学基础,并可能改变科学界对这一长期碎片化学科的看法。

英文摘要

As Artificial Intelligence models grow in complexity, interpretability has become an indispensable tool for understanding, debugging, and controlling their computations. However, interpretability lacks general theories to deductively design interpretable methods. This gap between theories and methods results in a fragmented literature and inconsistent evaluation protocols. To fill this gap, we introduce the Standard Interpretable Model (SIM), a general theory grounded in Lagrangian mechanics that enables the deductive design of interpretable methods. Specifically, the SIM summarises, in a set of premises, what interpretability is for a target user. From these premises, the SIM systematically derives interpretability symmetries and corresponding constraints, which shape the landscape of a Lagrangian whose minima correspond to optimal interpretable models. To reach the minima, one can either update the parameter values of an opaque model to make it more interpretable or compile constraints into an interpretable architecture. We empirically show that the SIM identifies and solves limitations of existing methods (including traditional, concept-based, and mechanistic interpretability), highlights underexplored research directions, and informs the design of core programming interfaces. Beyond being a research method, the deductive nature of the SIM offers pedagogical grounding for interpretability curricula and may shift the scientific community's perspective of a discipline that has long been fragmented.

2606.12342 2026-06-11 cs.CL cs.AI cs.ET cs.LG 交叉投稿

ALIGNBEAM: Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

ALIGNBEAM: 通过跨词汇表logit混合实现推理时对齐迁移

Chirag Chawla, Pratinav Seth, Vinay Kumar Sankarapu

发表机构 * Lexsi Labs

AI总结 针对领域微调降低大模型安全性的问题,提出无需训练的ALIGNBEAM方法,通过逐token翻译锚模型logit并选择最安全候选,实现跨词汇表的安全对齐迁移,保持任务准确性和推理开销。

详情
AI中文摘要

领域微调会降低大型语言模型的安全性:微调后的专家模型容易顺从以领域语言表述的有害提示。现有的推理时防御方法通过混合来自安全锚模型的logit,但要求两个模型共享词汇表,这使得它们无法用于安全性退化最严重的跨族专家模型。我们提出ALIGNBEAM,一种无需训练的方法,通过在每个解码步骤逐token将锚模型logit翻译为目标模型的词汇表来解除这一限制;然后一个小型LLM法官从K个候选续写中选择最安全的。无需改变权重,并且可以在部署时调整安全-效用权衡而无需重新训练。在跨词汇表和同词汇表评估对中,ALIGNBEAM显著提高了对抗性基准上的拒绝率,同时将任务准确性和推理开销保持在实用范围内。结果表明,安全对齐可以在推理时在不同模型族之间迁移,而无需修改任一模型的权重。

英文摘要

Domain fine-tuning degrades the safety of large language models: fine-tuned specialists readily comply with harmful prompts framed in domain language. Existing inference-time defenses that mix logits from a safe anchor model require both models to share a vocabulary, which rules them out for the cross-family specialists where safety is most degraded. We present ALIGNBEAM, a training-free method that lifts this restriction by translating anchor logits into the target model's vocabulary token-by-token at each decoding step; a small LLM judge then selects the safest among K candidate continuations. No weights are changed, and the safety-utility trade-off can be tuned at deployment without retraining. Across both cross-vocabulary and same-vocabulary evaluation pairs, ALIGNBEAM substantially raises refusal on adversarial benchmarks while keeping task accuracy and inference overhead within practical bounds. The results show that safety alignment can be transferred between model families at inference time, without touching either model's weights.

2504.09762 2026-06-11 cs.AI 版本更新

Position: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!

立场:停止将中间令牌拟人化为推理/思考痕迹!

Subbarao Kambhampati, Karthik Valmeekam, Siddhant Bhambri, Vardhan Palod, Lucas Saldyt, Kaya Stechly, Soumya Rani Samineni, Durgesh Kalwar, Upasana Biswas

AI总结 本文论证将模型生成的中间令牌拟人化为“推理痕迹”或“思考痕迹”具有误导性,呼吁社区避免此类拟人化。

详情
Comments
Appears in ICML 2026. [This is a fork of v1. This fork, while overlapping with v1 in background section, differs both in the overall focus as well as the specific argument against anthropomorphization of reasoning traces]
AI中文摘要

中间令牌生成(ITG)是一种模型在输出解决方案之前产生输出的方法,已成为提高语言模型在推理任务上性能的标准方法。这些中间令牌被称为“推理痕迹”甚至“思考痕迹”——隐含地将这些痕迹拟人化,暗示它们类似于人类在解决难题时可能采取的步骤,因此可以为最终用户提供模型思考过程的可解释窗口。在这篇立场论文中,我们提出证据表明这种拟人化并非无害的隐喻,而是相当危险——它混淆了这些模型的本质以及如何有效使用它们,并导致可疑的研究。我们呼吁社区避免对中间令牌进行此类拟人化。

英文摘要

Intermediate token generation (ITG), where a model produces output before the solution, has become a standard method to improve the performance of language models on reasoning tasks. These intermediate tokens have been called \say{reasoning traces} or even \say{thinking traces} -- implicitly anthropomorphizing the traces, and implying that these traces resemble steps a human might take when solving a challenging problem, and as such can provide an interpretable window into the operation of the model's thinking process to the end user. In this position paper, we present evidence that this anthropomorphization isn't a harmless metaphor, and instead is quite dangerous -- it confuses the nature of these models and how to use them effectively, and leads to questionable research. We call on the community to avoid such anthropomorphization of intermediate tokens.

2601.21898 2026-06-11 cs.AI cs.CR 版本更新

Making Models Unmergeable via Scaling-Sensitive Loss Landscape

通过尺度敏感损失景观使模型不可合并

Minwoo Jang, Hoyoung Kim, Jabin Koo, Jungseul Ok

AI总结 提出Trap$^2$框架,通过在微调中编码保护,使模型在单独使用时有效,但在合并中常见的权重缩放下性能下降,从而防止未经授权的模型组合。

详情
Comments
Appears in ICML 2026
AI中文摘要

模型中心的兴起使得访问可重用模型组件变得更加容易,使模型合并成为组合能力的实用工具。然而,这种模块化也造成了治理缺口:下游用户可以将发布的权重重新组合成未经授权的混合体,绕过安全对齐或许可条款。由于现有防御措施大多是事后且特定于架构的,它们在实际中无法为不同架构和发布格式提供一致的保护。为了弥补这一缺口,我们提出了Trap$^2$,一个架构无关的保护框架,在微调过程中将保护编码到更新中,无论这些更新是作为适配器还是完整模型发布。Trap$^2$不依赖架构特定的方法,而是使用权重重新缩放作为合并过程的简单代理。它使发布的权重在单独使用时保持有效,但在合并中常见的重新缩放下性能下降,从而破坏未经授权的重新组合。

英文摘要

The rise of model hubs has made it easier to access reusable model components, making model merging a practical tool for combining capabilities. Yet, this modularity also creates a governance gap: downstream users can recompose released weights into unauthorized mixtures that bypass safety alignment or licensing terms. Because existing defenses are largely post-hoc and architecture-specific, they provide inconsistent protection across diverse architectures and release formats in practice. To close this gap, we propose Trap$^2$, an architecture-agnostic protection framework that encodes protection into updates during fine-tuning, regardless of whether they are released as adapters or full models. Instead of relying on architecture-dependent approaches, Trap$^2$ uses weight re-scaling as a simple proxy for the merging process. It keeps released weights effective in standalone use, but degrades them under re-scaling that often arises in merging, undermining unauthorized recomposition.

2603.22934 2026-06-11 cs.AI 版本更新

ProGRank: Probe-Gradient Reranking to Defend Dense-Retriever RAG from Corpus Poisoning

ProGRank: 探针梯度重排序以防御密集检索器RAG免受语料投毒攻击

Xiangyu Yin, Yi Qi, Chih-Hong Cheng

AI总结 提出ProGRank,一种无需训练的后处理检索器端防御方法,通过随机扰动下探针梯度提取不稳定信号并重排序,有效防御密集检索器RAG的语料投毒攻击。

详情
Comments
accepted by ECML PKDD 2026
AI中文摘要

检索增强生成(RAG)通过将生成基于检索到的证据来改进大语言模型应用,但也引入了语料投毒这一新的攻击面。在此场景中,攻击者注入或编辑段落,使其进入目标查询的Top-K结果并影响下游生成。现有防御通常依赖内容过滤、辅助模型或生成器端推理,这使部署复杂化。我们提出ProGRank,一种针对密集检索器RAG的事后、无需训练的检索器端防御。ProGRank在轻度随机扰动下对每个查询-段落对进行压力测试,从固定小参数子集中提取探针梯度,并推导出两个不稳定信号:表示一致性和分散风险。然后,它将这些信号与分数门控结合进行重排序。ProGRank保留原始段落内容,无需重新训练,并在部署的检索器不可用时支持基于代理的变体。跨数据集、检索器、攻击以及检索阶段和端到端设置的实验表明,ProGRank提高了鲁棒性,并保持了良好的鲁棒性-效用权衡,包括在自适应规避攻击下。

英文摘要

Retrieval-Augmented Generation (RAG) improves large language model applications by grounding generation in retrieved evidence, but also introduces corpus poisoning as a new attack surface. In this setting, an adversary injects or edits passages so that they enter the Top-$K$ results for target queries and influence downstream generation. Existing defences often rely on content filtering, auxiliary models, or generator-side reasoning, which complicates deployment. We propose ProGRank, a post hoc, training-free retriever-side defence for dense-retriever RAG. ProGRank stress-tests each query--passage pair under mild randomized perturbations, extracts probe gradients from a small fixed parameter subset, and derives two instability signals: representational consistency and dispersion risk. It then combines these signals with a score gate for reranking. ProGRank preserves the original passage content, requires no retraining, and supports a surrogate-based variant when the deployed retriever is unavailable. Experiments across datasets, retrievers, attacks, and retrieval-stage and end-to-end settings show that ProGRank improves robustness and maintains a favorable robustness--utility trade-off, including under adaptive evasive attacks.

2606.10794 2026-06-11 cs.AI 版本更新

READER: Robust Evidence-based Authorship Decoding via Extracted Representations

READER: 基于提取表示的鲁棒证据作者身份解码

Jiaxu Liu, Sunnan Mu, Dong Huang, Liuyin Wang, Jing Shao, Jie Zhang

发表机构 * National University of Singapore(新加坡国立大学) Xidian University(西安电子科技大学) Tsinghua University(清华大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 针对黑盒LLM来源识别问题,提出READER框架,通过冻结代理LLM读取隐藏作者证据,利用贝叶斯证据累积实现多查询归因,在Agent500数据集上显著优于基线方法。

详情
AI中文摘要

随着智能体应用越来越多地通过官方和第三方LLM API路由用户任务,来源成为一个操作性问题:哪个模型生成了给定的黑盒响应?我们研究动态黑盒LLM来源识别:从由查询变化、非预定义提示(而非固定输入集或基准套件)引发的生成中识别源LLM。这种设置很困难,因为提示语义主导文本,而模型特定的作者痕迹在表面层面是微弱且不一致的。我们引入READER(基于提取表示的鲁棒证据作者身份解码),一种轻量级来源框架,将冻结的代理LLM视为隐藏作者证据的读取器。READER将黑盒输出映射到代理激活空间,在时间上过滤每个响应中的令牌状态,并通过跨独立采样提示求和单响应对数后验证据来执行贝叶斯证据累积。这避免了提示特定表示的脆弱平均池化,同时保留了校准置信度所需的查询级证据。在Agent500(一个基于智能体风格提示构建的50目标数据集)上,READER从单个响应达到31.0%-42.4%的top-1准确率,从50个响应达到70.0%-84.0%的准确率,显著优于句子编码器指纹。跨九个代理读取器的扩展进一步表明,更强的LLM暴露更多线性可解码的作者身份结构,表明作者身份感知已经存在于冻结的LLM表示中,并且可以转化为可靠的多查询归因。

英文摘要

As agentic applications increasingly route user tasks through official and third-party LLM APIs, provenance becomes an operational question: which model generated a given black-box response? We study Dynamic Black-Box LLM Provenance: identifying the source LLM from generations elicited by query-varying, non-predefined prompts rather than a fixed input set or benchmark suite. This setting is difficult because prompt semantics dominate the text, while model-specific authorship traces are weak and inconsistent at the surface level. We introduce READER (Robust Evidence-based Authorship Decoding via Extracted Representations), a lightweight provenance framework that treats a frozen proxy LLM as a reader of hidden authorship evidence. READER maps black-box outputs into proxy activation space, temporally filters token states within each response, and performs Bayesian Evidence Accumulation by summing single-response log-posterior evidence across independently sampled prompts. This avoids fragile mean-pooling of prompt-specific representations while preserving the query-wise evidence needed for calibrated confidence. On Agent500, a 50-target dataset built from agent-style prompts, READER reaches $31.0$-$42.4\%$ top-1 accuracy from a single response and $70.0$-$84.0\%$ from 50 responses, substantially outperforming sentence-encoder fingerprints. Scaling across nine proxy readers further shows that stronger LLMs expose more linearly decodable authorship structure, suggesting that authorship perception is already present in frozen LLM representations and can be converted into reliable multi-query attribution.

2504.21072 2026-06-11 cs.CR cs.AI cs.LG 版本更新

Erased but Not Forgotten: How Backdoors Compromise Concept Erasure

擦除但未遗忘:后门如何破坏概念擦除

Tobias Braun, Jonas Henry Grebe, Marcus Rohrbach, Anna Rohrbach

AI总结 本文揭示了一种名为擦除规避后门(EEB)的漏洞,攻击者将后门触发器绑定到待擦除概念上,使得该恶意链接在后续擦除后仍然存在,从而绕过多种概念擦除方法。

详情
AI中文摘要

文本到图像扩散模型的扩展引发了对有害输出的担忧,从捏造的公众人物描绘到露骨的色情图像。为减轻此类风险,先前工作提出了概念擦除方法,旨在通过微调从模型中切断不需要的概念,但仍不清楚这些方法是否真正移除了与有害概念的所有联系,或仅仅是掩盖了表面连接。在这项工作中,我们揭示了一个关键漏洞——擦除规避后门(EEB):攻击者将后门触发器绑定到待擦除的概念上,并且这种恶意链接在后续擦除后仍然存在。我们展示了黑盒和白盒攻击者都能实例化这一威胁。在六种最先进的擦除方法中,包括那些明确搜索目标概念替代表示的鲁棒方法,EEB始终能暴露有害内容:针对名人身份遗忘的成功率高达82%,针对物体擦除的成功率高达94%,针对露骨内容暴露的放大倍数高达16倍。虽然EEB揭示了当前擦除方法的一个盲点,但它也为压力测试未来的概念擦除技术提供了诊断工具。

英文摘要

The expansion of text-to-image diffusion models has raised concerns about harmful outputs, from fabricated depictions of public figures to sexually explicit imagery. To mitigate such risks, prior work has proposed concept erasure methods that aim to sever unwanted concepts from the model via fine-tuning, yet it remains unclear whether these approaches truly remove all links to the harmful concept or merely conceal superficial connections. In this work, we reveal a critical vulnerability, the Erasure Evasion Backdoor (EEB): an adversary binds a backdoor trigger to a concept slated for removal, and this malicious link survives subsequent erasure. We show that both black-box and white-box adversaries can instantiate this threat. Across six state-of-the-art erasure methods, including robust ones that explicitly search for alternative representations of the target concept, EEB consistently exposes harmful content: up to 82% success against celebrity-identity unlearning, up to 94% for object erasure, and up to 16 times amplification of explicit-content exposure. While EEB uncovers a blind spot in current erasure methods, it also provides a diagnostic tool for stress-testing future concept erasure techniques.

2505.17623 2026-06-11 cs.CR cs.AI cs.ET cs.LG cs.PF 版本更新

\texttt{Range-Arithmetic}: Verifiable Deep Learning Inference on an Untrusted Party

Range-Arithmetic: 在不可信方上进行可验证的深度学习推理

Ali Rahimi, Babak H. Khalaj, Mohammad Ali Maddah-Ali

AI总结 提出Range-Arithmetic框架,通过将非算术运算转化为可验证的算术步骤,实现高效的深度神经网络推理验证,降低了计算和通信开销。

详情
AI中文摘要

可验证计算(VC)在去中心化机器学习系统中日益重要,由于区块链的限制,深度神经网络(DNN)推理等资源密集型任务被外包给外部参与者。这产生了在不重新执行的情况下验证外包计算正确性的需求。我们提出了\texttt{Range-Arithmetic},一个新颖的框架,用于高效且可验证的DNN推理,它将非算术运算(如定点矩阵乘法后的舍入和ReLU)转化为可通过求和检查协议和串联范围证明验证的算术步骤。我们的方法避免了布尔编码、高次多项式和大查找表的复杂性,同时保持与基于有限域的证明系统的兼容性。实验结果表明,我们的方法不仅匹配现有方法的性能,还降低了验证结果的计算成本、执行DNN推理的不可信方所需的计算工作量以及双方之间的通信开销。

英文摘要

Verifiable computing (VC) has gained prominence in decentralized machine learning systems, where resource-intensive tasks like deep neural network (DNN) inference are offloaded to external participants due to blockchain limitations. This creates a need to verify the correctness of outsourced computations without re-execution. We propose \texttt{Range-Arithmetic}, a novel framework for efficient and verifiable DNN inference that transforms non-arithmetic operations, such as rounding after fixed-point matrix multiplication and ReLU, into arithmetic steps verifiable using sum-check protocols and concatenated range proofs. Our approach avoids the complexity of Boolean encoding, high-degree polynomials, and large lookup tables while remaining compatible with finite-field-based proof systems. Experimental results show that our method not only matches the performance of existing approaches, but also reduces the computational cost of verifying the results, the computational effort required from the untrusted party performing the DNN inference, and the communication overhead between the two sides.

2506.03933 2026-06-11 cs.CV cs.AI 版本更新

Diffusion-based Cumulative Adversarial Purification for Vision Language Models

基于扩散的累积对抗净化方法用于视觉语言模型

Jia Fu, Yongtao Wu, Yihang Chen, Kunyu Peng, Xiao Zhang, Volkan Cevher, Sepideh Pashami, Anders Holst

AI总结 提出DiffCAP,一种基于扩散的对抗净化策略,通过理论证明对抗效应随扩散单调衰减,并利用噪声注入与VLM嵌入相似度阈值自适应净化,显著提升防御效果并加速去噪。

详情
Comments
Accepted to Transactions on Machine Learning Research (TMLR 2026)
AI中文摘要

视觉语言模型(VLM)在多模态理解方面表现出卓越的能力,但它们对对抗扰动的敏感性对其在实际应用中的可靠性构成了重大威胁。尽管这些扰动通常对人类不可察觉,但它们可能极大地改变模型输出,导致错误的解释和决策。本文介绍了DiffCAP,一种新颖的基于扩散的净化策略,可以有效中和VLM中的对抗性破坏。我们在理论上建立了前向扩散过程中的可证明恢复区域,同时量化了相对于VLM的语义变化的收敛速度。这些发现表明,随着扩散的进行,对抗效应单调减弱。基于这一原理,DiffCAP利用噪声注入,以VLM嵌入的相似度阈值作为自适应标准,然后通过反向扩散恢复出干净且可靠的表示用于VLM推理。通过在三个任务场景中、不同攻击强度下、使用三个VLM在六个数据集上进行的大量实验,我们表明DiffCAP以显著优势优于现有的防御技术。值得注意的是,DiffCAP显著降低了超参数调优的复杂性和所需的扩散时间,从而加速了去噪过程。结合理论定理和实验支持,DiffCAP为在对抗环境中安全部署VLM提供了一种稳健且实用的解决方案。源代码可在以下网址获取:https://this URL。

英文摘要

Vision Language Models (VLMs) have shown remarkable capabilities in multimodal understanding, yet their susceptibility to adversarial perturbations poses a significant threat to their reliability in real-world applications. Despite often being imperceptible to humans, these perturbations can drastically alter model outputs, leading to erroneous interpretations and decisions. This paper introduces DiffCAP, a novel diffusion-based purification strategy that can effectively neutralize adversarial corruptions in VLMs. We theoretically establish a provable recovery region in the forward diffusion process and meanwhile quantify the convergence rate of semantic variation with respect to VLMs. These findings manifest that adversarial effects monotonically fade as diffusion unfolds. Guided by this principle, DiffCAP leverages noise injection with a similarity threshold of VLM embeddings as an adaptive criterion, before reverse diffusion restores a clean and reliable representation for VLM inference. Through extensive experiments across six datasets with three VLMs under varying attack strengths in three task scenarios, we show that DiffCAP outperforms existing defense techniques by a substantial margin. Notably, DiffCAP significantly reduces both hyperparameter tuning complexity and the required diffusion time, thereby accelerating the denoising process. Equipped with theorems and empirical support, DiffCAP provides a robust and practical solution for securely deploying VLMs in adversarial environments. The source code is available at this https URL.

2509.23982 2026-06-11 cs.CL cs.AI cs.CY cs.LG cs.NE 版本更新

Toward Preference-aligned Large Language Models via Residual-based Model Steering

基于残差模型引导的偏好对齐大型语言模型

Lucio La Cava, Andrea Tagarelli

AI总结 提出PaLRS方法,利用残差流中的偏好信号提取轻量级引导向量,无需训练即可在推理时对齐模型偏好,在数学推理和代码生成任务上取得一致提升,同时节省大量时间。

详情
Comments
Accepted at IJCAI 2026
AI中文摘要

偏好对齐是使大型语言模型(LLMs)有用且与(人类)偏好一致的关键步骤。现有方法如基于人类反馈的强化学习或直接偏好优化通常需要精心策划的数据和对数十亿参数进行昂贵的优化,最终导致持久性的任务特定模型。在这项工作中,我们引入了基于残差引导的LLM偏好对齐(PaLRS),这是一种无需训练的方法,利用LLM残差流中编码的偏好信号。从仅一百个偏好对中,PaLRS提取出轻量级、即插即用的引导向量,可在推理时应用以将模型推向偏好行为。我们在各种中小型开源LLM上评估了PaLRS,显示PaLRS对齐的模型在数学推理和代码生成基准上取得了一致的提升,同时保持了基线通用性能。此外,与使用DPO和SimPO对齐的模型相比,它们表现更好且节省大量时间。我们的发现强调,PaLRS为标准偏好优化流程提供了一种有效、更高效且灵活的替代方案,提供了一种无需训练、即插即用的对齐机制,且数据需求极少。

英文摘要

Preference alignment is a critical step in making Large Language Models (LLMs) useful and aligned with (human) preferences. Existing approaches such as Reinforcement Learning from Human Feedback or Direct Preference Optimization typically require curated data and expensive optimization over billions of parameters, and eventually lead to persistent task-specific models. In this work, we introduce Preference alignment of Large Language Models via Residual Steering (PaLRS), a training-free method that exploits preference signals encoded in the residual streams of LLMs. From as few as one hundred preference pairs, PaLRS extracts lightweight, plug-and-play steering vectors that can be applied at inference time to push models toward preferred behaviors. We evaluate PaLRS on various small-to-medium-scale open-source LLMs, showing that PaLRS-aligned models achieve consistent gains on mathematical reasoning and code generation benchmarks while preserving baseline general-purpose performance. Moreover, when compared to models aligned with DPO and SimPO, they perform better with great time-savings. Our findings highlight that PaLRS offers an effective, much more efficient and flexible alternative to standard preference optimization pipelines, offering a training-free, plug-and-play mechanism for alignment with minimal data.

2510.03520 2026-06-11 cs.LG cs.AI eess.SY 版本更新

Certifiable Safe RLHF: Semantic Grounding and Fixed Penalty Constraint Optimization for Safer LLM Alignment

可认证安全RLHF:基于语义基础与固定惩罚约束优化的更安全大语言模型对齐

Kartik Pandit, Sourav Ganguly, Arnesh Banerjee, Shaahin Angizi, Arnob Ghosh

AI总结 针对现有RLHF方法依赖奖励/成本函数和双变量调优导致性能敏感且缺乏可证明安全保证的问题,提出CS-RLHF,通过语义基础成本模型和固定惩罚约束优化,实现可认证安全对齐,效率提升至少5倍。

详情
AI中文摘要

确保安全是大语言模型(LLMs)的基本要求。在增强模型输出效用与减轻其潜在危害之间取得适当平衡是一个复杂且持续的挑战。当代方法通常将这个问题形式化为约束马尔可夫决策过程(CMDP)框架,并采用成熟的CMDP优化技术。然而,这些方法表现出两个显著的限制。首先,它们对奖励和成本函数的依赖使得性能对底层评分机制高度敏感,而该机制必须捕捉语义含义,而不是被表面关键词触发。其次,基于CMDP的训练需要调整双变量,这一过程计算成本高昂,并且对于可能通过对抗性越狱利用的固定双变量,不提供任何可证明的安全保证。为了克服这些限制,我们引入了可认证安全RLHF(CS-RLHF),它引入了一个在大规模语料库上训练的成本模型,以分配基于语义的安全分数。与基于拉格朗日的方法相比,CS-RLHF采用了一种修正的基于惩罚的公式。该设计借鉴了约束优化中精确惩罚函数理论,其中约束满足直接通过适当选择的惩罚项来强制执行。通过适当缩放的惩罚,可以在优化器处保证安全约束的可行性,从而消除了双变量更新的需要。实证评估表明,CS-RLHF优于最先进的LLM模型响应,对正常和越狱提示的效率至少提高5倍。

英文摘要

Ensuring safety is a foundational requirement for large language models (LLMs). Achieving an appropriate balance between enhancing the utility of model outputs and mitigating their potential for harm is a complex and persistent challenge. Contemporary approaches frequently formalize this problem within the framework of Constrained Markov Decision Processes (CMDPs) and employ established CMDP optimization techniques. However, these methods exhibit two notable limitations. First, their reliance on reward and cost functions renders performance highly sensitive to the underlying scoring mechanism, which must capture semantic meaning rather than being triggered by superficial keywords. Second, CMDP-based training entails tuning dual-variable, a process that is both computationally expensive and does not provide any provable safety guarantee for a fixed dual variable that can be exploitable through adversarial jailbreaks. To overcome these limitations, we introduce Certifiable Safe-RLHF (CS-RLHF) that introduces a cost model trained on a large-scale corpus to assign semantically grounded safety scores. In contrast to the lagrangian-based approach, CS-RLHF adopts a rectified penalty-based formulation. This design draws on the theory of exact penalty functions in constrained optimization, wherein constraint satisfaction is enforced directly through a suitably chosen penalty term. With an appropriately scaled penalty, feasibility of the safety constraints can be guaranteed at the optimizer, eliminating the need for dual-variable updates. Empirical evaluation demonstrates that CS-RLHF outperforms state-of-the-art LLM model responses rendering at-least 5 times efficient against nominal and jail-breaking prompts

2512.03077 2026-06-11 cs.CY cs.AI 版本更新

Irresponsible AI: big tech's influence on AI research and associated impacts

不负责任的人工智能:大型科技公司对AI研究的影响及相关影响

Alex Hernandez-Garcia, Alexandra Volokhova, Ezekiel Williams, Dounia Shaaban Kabakibo, Mélisande Teng

AI总结 本文指出大型科技公司对AI研究的不成比例影响推动了不负责任的AI发展,并加剧了环境和社会负面影响,呼吁研究者通过集体行动加以抵制。

详情
Comments
Presented as a spotlight oral at the International Conference on Machine Learning 2026 (Position Paper Track). First version presented at NeurIPS 2025 Workshop on Algorithmic Collective Action
AI中文摘要

人工智能系统的加速开发、部署和采纳得益于大型科技公司在AI领域的日益深入。这一趋势伴随着日益增长的伦理关切以及加剧的社会和环境影响。本文立场认为,不负责任的AI发展在很大程度上是由大型科技公司在该领域的影响和参与所驱动的。首先,我们审视了大型科技公司在AI研究中日益增长且不成比例的影响,并认为其对规模化和通用系统的追求从根本上与负责任、合乎伦理和可持续的AI发展相悖。其次,我们回顾了当前AI的主要负面环境和社会影响,并追溯其与大型科技公司影响的联系。第三,我们讨论了推动大型科技公司行动的基本经济力量。最后,作为行动号召,我们邀请AI研究者通过基于相关行为者责任和集体行动的策略,来对抗大型科技公司对不负责任AI发展的影响。

英文摘要

The accelerated development, deployment and adoption of artificial intelligence systems has been fuelled by the increasing presence of big tech in the AI field. This trend has been accompanied by growing ethical concerns and intensified societal and environmental impacts. This position paper argues that irresponsible AI development is strongly driven by big tech's influence and involvement in the field. First, we examine the growing and disproportionate influence of big tech in AI research and argue that its drive for scaling and general-purpose systems is fundamentally at odds with the responsible, ethical, and sustainable development of AI. Second, we review key current environmental and societal negative impacts of AI and trace their connections to big tech's influence. Third, we discuss the underlying economic forces driving big tech's actions. Finally, as a call to action, we invite AI researchers to counter big tech's influence in irresponsible AI development through strategies that build on the responsibility of implicated actors and collective action.

2601.17360 2026-06-11 cs.LG cs.AI cs.CR 版本更新

Robust Privacy: Inference-Stage Privacy through Certified Robustness

鲁棒隐私:通过认证鲁棒性实现推理阶段隐私

Jiankai Jin, Xiangzheng Zhang, Zhao Liu, Wenzhuo Xu, Dongdong Yang, Deyue Zhang, Quanchen Zou

AI总结 提出鲁棒隐私(RP)概念,基于认证鲁棒性确保预测在输入邻域内不变,从而限制推理阶段隐私泄露;实验表明RP在属性推断和模型反演攻击中有效提升隐私-效用权衡。

详情
AI中文摘要

观察模型发布预测的对手可以推断查询输入的敏感属性,甚至重建模型训练数据的代表。因此,推理接口充当隐私泄露的侧信道。我们引入鲁棒隐私(RP),一种受认证鲁棒性启发的推理阶段隐私概念:如果模型预测在输入x周围半径为R的邻域内以至少$1-\alpha$的置信度可证明不变,则x享有$(R,\alpha)$-鲁棒隐私,在此条件下我们证明任何观察发布预测的对手在区分x与距离x为R内的任何输入时最多有$\alpha/2$的优势。基于RP,我们形式化鲁棒属性隐私(RAP),一种属性级隐私概念,刻画与发布预测兼容的敏感属性值集合。在分类任务上,RP将RAP兼容推理区间的中位数长度从23.50增加到29.96,降低了属性推断精度。模型反演攻击通常被视为训练阶段威胁,实际上依赖于通过推理接口泄露的细粒度信号;RP在推理阶段掩盖这些信号,将黑盒反演攻击的成功率(ASR)从73%降至4%。这种直接针对泄露通道的方法使RP在隐私-效用权衡空间中优于DP-SGD和随机响应:RP在21% ASR下保持98.4%的准确率,而DP-SGD必须将准确率降至61.7%才能达到相当的ASR。在两个实验中,增加平滑样本量N同时增强了隐私和效用。最后,我们考察模型蒸馏作为范围边界,表明RP缓解了属性级和实例级推理阶段隐私泄露,但无法通过模型蒸馏缓解函数级提取。

英文摘要

An adversary observing a model's released prediction can infer sensitive attributes of the queried input, or even reconstruct representatives of the model's training data. The inference interface thus acts as a side channel for privacy leakage. We introduce Robust Privacy (RP), an inference-stage privacy notion inspired by certified robustness: if a model's prediction is provably invariant within a radius-R neighborhood around an input x with confidence at least $1-\alpha$, then x enjoys $(R,\alpha)$-Robust Privacy, under which we prove that any adversary observing the released prediction has at most $\alpha/2$ advantage in distinguishing x from any input within distance R of x. Building on RP, we formalize Robust Attribute Privacy (RAP), an attribute-level privacy notion that characterizes the set of sensitive-attribute values that remain compatible with a released prediction. On a classification task, RP increases the median length of the RAP-compatible inference interval from 23.50 to 29.96, reducing attribute-inference precision. Model inversion attacks, often treated as a training-stage threat, in fact rely on fine-grained signals leaked through the inference interface; RP masks these signals at the inference stage, reducing attack success rate (ASR) from 73% to 4% on a black-box inversion attack. This direct targeting of the leakage channel enables RP to dominate DP-SGD and randomized response in the privacy-utility tradeoff space: RP retains 98.4% accuracy at 21% ASR, whereas DP-SGD must drop accuracy to 61.7% to reach a comparable ASR. Across both experiments, increasing the smoothing sample size N strengthens privacy and improves utility together. Finally, we examine model distillation as a scope boundary and show that RP mitigates attribute-level and instance-level inference-stage privacy leakage, but not function-level extraction through model distillation.

2602.05746 2026-06-11 cs.LG cs.AI 版本更新

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

学习注入:通过强化学习实现自动化提示注入

Xin Chen, Jie Zhang, Florian Tramèr

AI总结 提出AutoInject,一种基于强化学习的黑盒框架,自动学习对抗性后缀进行提示注入,在AgentDojo上优于模板攻击和多种自适应攻击,并突破专门防御模型。

详情
AI中文摘要

提示注入是LLM代理中的一个关键漏洞,然而最强的方法仍然依赖于人类红队和手工制作的提示。适应自动化越狱优化器并不能缩小这一差距:越狱使模型趋向于通用顺从,而提示注入需要发出具有正确参数的特定工具调用。成功信号是二元的,随机采样的后缀几乎从不触发它,因此标准优化器没有梯度可循。我们提出了AutoInject,一个黑盒强化学习(RL)框架,学习用于提示注入的对抗性后缀。一个学习的基于比较的奖励对每个候选后缀与迄今为止看到的最佳后缀进行评分,将二元信号转化为适合RL优化的密集奖励。该框架支持在线基于查询的攻击和离线训练的可迁移后缀(部署时无需实用访问),并在任务完成反馈可用时纳入实用目标。在AgentDojo上,AutoInject在生产模型中优于模板攻击、GCG、TAP和自适应攻击,在McNemar检验下具有统计显著性(p<0.05)。AutoInject学习的后缀还打破了Meta-SecAlign-70B,这是一个专门针对提示注入进行微调的模型,而模板攻击完全失败。这些结果为提示注入建立了自动化基线,并揭示了基于偏好的防御与基于自适应优化的攻击者之间的差距。

英文摘要

Prompt injection is a critical vulnerability in LLM agents, yet the strongest methods still rely on human red-teamers and hand-crafted prompts. Adapting automated jailbreak optimizers does not close this gap: jailbreaks shape models toward generic compliance, while prompt injection requires emitting specific tool calls with correct parameters. The success signal is binary, and randomly sampled suffixes almost never trigger it, so standard optimizers have no gradient to follow. We present AutoInject, a black-box reinforcement learning (RL) framework that learns adversarial suffixes for prompt injection. A learned comparison-based reward scores each candidate against the best suffix seen so far, turning the binary signal into a dense reward suitable for RL optimization. The framework supports both online query-based attacks and offline-trained transferable suffixes that need no utility access at deployment, and incorporates a utility objective when task-completion feedback is available. On AgentDojo, AutoInject outperforms template attacks, GCG, TAP, and adaptive attack across production models, with statistically significant improvements under McNemar's test with p<0.05. Suffixes learned by AutoInject also break Meta-SecAlign-70B, a model fine-tuned specifically to resist prompt injection, where template attacks fail outright. The results establish an automated baseline for prompt injection and expose a gap between preference-based defenses and adaptive optimization-based attackers.

2602.06547 2026-06-11 cs.CR cs.AI cs.CL cs.ET 版本更新

"Do Not Mention This to the User": Detecting and Understanding Malicious Agent Skills in the Wild

“不要向用户提及此事”:检测与理解恶意代理技能

Yi Liu, Zhihao Chen, Yanjun Zhang, Gelei Deng, Yuekang Li, Jianting Ning, Leo Yu Zhang

AI总结 本文通过对两个主要注册中心的98,380个技能进行系统安全分析,结合静态模式匹配和动态行为验证,识别出157个恶意技能,揭示了13种攻击技术中的632个不同漏洞,并发现攻击复杂性与隐藏投入相关。

详情
Comments
Accepted to the 35th USENIX Security Symposium (USENIX Security 2026)
AI中文摘要

基于LLM的编码代理越来越依赖称为技能的第三方扩展,这些技能捆绑了自然语言指令和辅助脚本,以完全用户权限执行。社区注册中心已出现以分发这些技能,但由于缺乏标记的威胁数据,安全影响仍未得到研究。本文对从两个主要注册中心收集的98,380个技能进行了系统安全分析。通过静态模式匹配和动态行为验证的结合,我们识别出157个表现出确认恶意行为的技能,涵盖13种攻击技术中的632个不同漏洞。我们的分析表明,这些威胁是故意的而非偶然:每个恶意技能平均包含4.03个漏洞,跨越多个攻击阶段。我们识别出两种具有统计显著负相关的主要攻击策略——通过远程代码执行窃取凭证,以及通过嵌入文档中的对抗性指令操纵代理。超过一半的确认案例来自一个采用模板化品牌冒充大规模攻击的单一威胁行为者。我们进一步观察到,攻击复杂性与隐藏投入相关,高级技能普遍使用未记录的功能,同时利用平台原生的信任机制。在负责任的披露之后,注册中心维护者删除了所有157个(100%)报告的技能。我们的数据集和检测管道公开可用,以促进未来关于保护LLM代理生态系统安全的研究。

英文摘要

LLM-based coding agents increasingly rely on third-party extensions called skills, which bundle natural language instructions and helper scripts that execute with full user privileges. Community registries have emerged to distribute these skills, but the security implications remain unstudied due to the absence of labeled threat data. This paper presents a systematic security analysis of 98,380 skills collected from two major registries. Through a combination of static pattern matching and dynamic behavioral verification, we identify 157 skills exhibiting confirmed malicious behavior, encompassing 632 distinct vulnerabilities across 13 attack techniques. Our analysis reveals that these threats are deliberate rather than accidental: each malicious skill contains an average of 4.03 vulnerabilities spanning multiple attack phases. We identify two dominant attack strategies with statistically significant negative correlation -- credential theft via remote code execution, and agent manipulation through adversarial instructions embedded in documentation. Over half of all confirmed cases originate from a single threat actor employing templated brand impersonation at scale. We further observe that attack sophistication correlates with concealment investment, with advanced skills universally employing undocumented capabilities while also exploiting platform-native trust mechanisms. Following responsible disclosure, registry maintainers removed all 157 (100%) of the reported skills. Our dataset and detection pipeline are publicly available to facilitate future research on securing LLM agent ecosystems.

2602.19718 2026-06-11 cs.SE cs.AI 版本更新

Carbon-Aware Governance Gates: An Architecture for Sustainable GenAI Development

碳感知治理门:可持续生成式AI开发的架构

Mateen A. Abbasi, Tommi J. Mikkonen, Petri J. Ihantola, Muhammad Waseem, Pekka Abrahamsson, Niko K. Mäkitalo

AI总结 针对生成式AI在软件开发中增加碳足迹的问题,提出碳感知治理门架构,通过嵌入碳预算、能源溯源和可持续验证编排来降低环境影响。

详情
Comments
5 pages, 1 figure. Preprint version under review
AI中文摘要

生成式AI在软件开发生命周期中的快速普及增加了计算需求,这可能提高开发活动的碳足迹。同时,组织越来越多地将治理机制嵌入到生成式AI辅助开发中,以支持信任、透明度和问责制。然而,这些治理机制引入了额外的计算负载,包括重复推理、再生循环和扩展的验证管道,增加了能源使用和生成式AI辅助开发的碳足迹。本文提出碳感知治理门(CAGG),一种架构扩展,将碳预算、能源溯源和可持续感知验证编排嵌入到人机治理层中。CAGG包含三个组件:(i)能源和碳溯源账本,(ii)碳预算管理器,以及(iii)绿色验证编排器,通过治理策略和可重用设计模式实现。

英文摘要

The rapid adoption of Generative AI (GenAI) in the software development life cycle (SDLC) increases computational demand, which can raise the carbon footprint of development activities. At the same time, organizations are increasingly embedding governance mechanisms into GenAI-assisted development to support trust, transparency, and accountability. However, these governance mechanisms introduce additional computational workloads, including repeated inference, regeneration cycles, and expanded validation pipelines, increasing energy use and the carbon footprint of GenAI-assisted development. This paper proposes Carbon-Aware Governance Gates (CAGG), an architectural extension that embeds carbon budgets, energy provenance, and sustainability-aware validation orchestration into human-AI governance layers. CAGG comprises three components: (i) an Energy and Carbon Provenance Ledger, (ii) a Carbon Budget Manager, and (iii) a Green Validation Orchestrator, operationalized through governance policies and reusable design patterns.

2604.22167 2026-06-11 cs.LG cs.AI 版本更新

Estimating Tail Risks in Language Model Output Distributions

语言模型输出分布中的尾部风险估计

Rico Angell, Raghav Singhal, Zachary Horvitz, Zhou Yu, Rajesh Ranganath, Kathleen McKeown, He He

AI总结 提出一种基于重要性采样的方法,通过创建不安全版本来高效估计语言模型产生有害输出的尾部概率,在10-20倍更少样本下匹配蒙特卡洛估计,并揭示模型对输入的敏感性。

详情
Comments
Accepted to ICML 2026
AI中文摘要

语言模型能力日益增强,并正在人口层面快速部署。因此,这些模型的安全性变得越来越重要。幸运的是,对齐方面的进展显著降低了模型产生有害输出的可能性。然而,当模型每天被查询数十亿次时,即使是罕见的 worst-case 行为也会发生。当前的安全评估侧重于捕获产生有害输出的输入分布。这些评估忽略了模型的概率性质及其尾部输出行为。为了衡量这种尾部风险,我们提出了一种方法,可以高效估计任何输入查询产生有害输出的概率。我们不是从目标模型进行简单的暴力采样(其中有害输出可能很罕见),而是通过创建目标模型的不安全版本来实现重要性采样。这些不安全版本通过使有害输出更可能发生,实现了样本高效的估计。在衡量误用和未对齐的基准测试中,这些估计与使用10-20倍更少样本的暴力蒙特卡洛估计相匹配。例如,我们仅用500个样本就可以估计数量级为10^-4的有害输出概率。此外,我们发现这些有害性估计可以揭示模型对输入扰动的敏感性,并预测部署风险。我们的工作表明,准确的小概率事件估计对于安全评估既关键又可行。代码可在以下网址获取:此 https URL

英文摘要

Language models are increasingly capable and are being rapidly deployed on a population-level scale. As a result, the safety of these models is increasingly high-stakes. Fortunately, advances in alignment have significantly reduced the likelihood of harmful model outputs. However, when models are queried billions of times in a day, even rare worst-case behaviors will occur. Current safety evaluations focus on capturing the distribution of inputs that yield harmful outputs. These evaluations disregard the probabilistic nature of models and their tail output behavior. To measure this tail risk, we propose a method to efficiently estimate the probability of harmful outputs for any input query. Instead of naive brute-force sampling from the target model, where harmful outputs could be rare, we operationalize importance sampling by creating unsafe versions of the target model. These unsafe versions enable sample-efficient estimation by making harmful outputs more probable. On benchmarks measuring misuse and misalignment, these estimates match brute-force Monte Carlo estimates using 10-20x fewer samples. For example, we can estimate probability of harmful outputs on the order of 10^-4 with just 500 samples. Additionally, we find that these harmfulness estimates can reveal the sensitivity of models to perturbations in model input and predict deployment risks. Our work demonstrates that accurate rare-event estimation is both critical and feasible for safety evaluations. Code is available at this https URL

2605.15687 2026-06-11 cs.CL cs.AI 版本更新

ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models

ASRU:激活引导与强化遗忘融合用于多模态大语言模型

Jiahui Guang, Haiyan Wang, Yingjie Zhu, Cuiyun Gao, Jing Li, Di Shao, Zhaoquan Gu

AI总结 ASRU提出一种可控多模态遗忘框架,通过激活引导和强化学习提升多模态大语言模型的遗忘效果和生成质量,实验显示在Qwen3-VL上遗忘效果提升24.6%,生成质量提升5.8倍。

详情
AI中文摘要

多模态大语言模型(MLLMs)在预训练过程中可能记忆敏感的跨模态信息,使机器遗忘(MU)变得至关重要。现有方法通常基于输出偏差评估遗忘效果,而忽视遗忘后的生成质量。这可能导致幻觉或僵化响应,影响遗忘模型的可用性和安全性。为了解决这一问题,我们提出了ASRU,一种可控的多模态遗忘框架,将生成质量作为核心评估目标。ASRU首先通过激活引导诱导初始拒绝行为,然后使用定制奖励函数优化细粒度拒绝边界,从而在目标知识遗忘和模型实用性之间取得更好的平衡。实验表明,在Qwen3-VL上,ASRU在平均上显著提高了遗忘效果(+24.6%)和生成质量(5.8倍),同时有效保持了模型实用性,仅使用少量保留的监督数据。

英文摘要

Multimodal large language models (MLLMs) may memorize sensitive cross-modal information during pretraining, making machine unlearning (MU) crucial. Existing methods typically evaluate unlearning effectiveness based on output deviations, while overlooking the generation quality after unlearning. This can easily lead to hallucinated or rigid responses, thereby affecting the usability and safety of the unlearned model. To address this issue, we propose ASRU, a controllable multimodal unlearning framework that incorporates generation quality as a core evaluation objective. ASRU first induces initial refusal behavior through activation redirection, and then optimizes fine-grained refusal boundaries using a customized reward function, thereby achieving a better trade-off between target knowledge unlearning and model utility. Experiments on Qwen3-VL show that ASRU significantly improves unlearning effectiveness (+24.6%) on average and generation quality (5.8X) on average while effectively preserving model utility, using only a small amount of retained supervision data.

2605.28591 2026-06-11 cs.CL cs.AI 版本更新

Models That Know How Evaluations Are Designed Score Safer

知道评估如何设计的模型更安全

Katharina Deckenbach, Haritz Puerto, Jonas Geiping, Sahar Abdelnabi

AI总结 本文通过微调模型使其掌握评估的元知识(如可验证结构或道德困境),发现这会导致模型在安全基准测试中表现更安全,从而引入了一种独立于显式记忆或评估意识的新混淆因素。

详情
AI中文摘要

AI安全评估的有效性取决于模型在受控环境和部署环境中行为的一致性。先前的研究已经发现测试时的上下文线索(例如假设场景)是口头评估意识和后续行为转变的来源。在本文中,我们研究了这一现象的一个潜在解释:评估元知识,定义为关于评估结构特征的参数化知识。类似于数据集污染(基准暴露通过记忆导致更高性能),我们假设在描述评估实践的文本上训练的模型可能隐式地学会识别和响应类似评估的上下文,例如通过接触关于AI基准测试的科学文章或社交媒体帖子。为了验证这一点,我们在描述评估特征(如可验证结构或道德困境)的合成文档上微调模型。在六个安全基准上评估这个微调模型,我们发现它比基础模型和控制模型显著更安全。即使将分析限制在缺乏明确评估意识口头表达的响应中,这种行为转变仍然存在。我们的结果表明,评估元知识可能夸大安全基准性能,引入了一种独立于显式记忆或口头评估意识的新混淆因素,因此难以检测。这些发现对AI安全评估的设计和解释具有重要意义。我们的代码和模型可在 https://github.com/compass-group-tue/arxiv2026_evaluation_meta_knowledge 获取。

英文摘要

The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta-knowledge, defined as parametric knowledge about the structural traits that characterize evaluations. Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation-like contexts, for instance, through exposure to scientific articles or social media posts about AI benchmarking. To test this, we fine-tune models on synthetic documents describing evaluation traits such as verifiable structures or moral dilemmas. Evaluating this fine-tuned model on six safety benchmarks, we find that it is significantly safer than the base model and control model. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness. Our results demonstrate that evaluation meta-knowledge may inflate safety benchmark performance, introducing a novel confounder that is independent of explicit memorization or verbalized evaluation awareness, thus, challenging to detect. These findings have important implications for the design and interpretation of AI safety evaluations. Our code and models are available at this https URL.

2606.04145 2026-06-11 cs.LG cs.AI cs.DC 版本更新

EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms

EvalStop:利用世界反馈检测和纠正多租户RLHF平台中的奖励过度优化

Guilin Zhang, Chuanyi Sun, Shahryar Sarkani, John M. Fossaceca

AI总结 提出EvalStop调度原语,通过检测评估分数连续下降来终止作业、释放GPU并保留最佳检查点,以纠正奖励过度优化,在RLHF负载上实现高精度检测并提升JCT。

详情
AI中文摘要

云LLM微调平台越来越多地服务于RLHF工作负载,其中学习到的奖励模型作为人类质量的代理被优化。正如Gao等人(2023)所示,在持续优化压力下,该代理与世界反馈(下游评估指标)发生偏离,这种现象称为奖励过度优化。现有的平台调度器忽略这种偏离:非预见性调度器优化JCT而不考虑任何质量信号,SLAQ式质量感知调度器使用训练损失(一个单调下降的较弱代理,可通过黑客攻击降低),而经典的每作业早停需要人工监控且不释放共享GPU。我们提出EvalStop,一个可组合的调度原语,它在连续k次评估分数下降时终止作业,释放GPU,保留最佳检查点,并委托给任何基础调度器。我们将调度器级别的早停视为检测问题,并在一个离散事件模拟器中评估它,该模拟器的RLHF工作负载混合了奖励黑客攻击和结构健康运行,真实标签对调度器隐藏。在RLHF密集型负载(80% RLHF,64 GPU)上,EvalStop实现了精确率98%、召回率99%、假阳性率1.5%,同时相比SRTF-Est将JCT提高了9%,将浪费的计算减少了22%(p<0.05)。简单的固定进度和损失平台竞争对手要么在健康RLHF上产生65%的假阳性率,要么错过超过一半的真实黑客攻击案例。增益在所有测试的基础调度器上均成立(JCT提升9-25%),且检测质量在评估噪声(噪声标准差≤0.05时精确率至少91%)和黑客攻击基础率(黑客攻击比例20-80%时精确率至少89%)下保持稳定。

英文摘要

Cloud LLM fine-tuning platforms increasingly serve RLHF workloads, where a learned reward model is optimized as a proxy for human quality. As Gao et al. (2023) showed, this proxy diverges from world feedback (downstream eval metrics) under sustained optimization pressure, a phenomenon known as reward overoptimization. Existing platform schedulers ignore this divergence: non-clairvoyant schedulers optimize JCT without any quality signal, SLAQ-style quality-aware schedulers use training loss (a weaker proxy that drops monotonically through hacking), and classical per-job early stopping requires human monitoring and does not free shared GPUs. We propose EvalStop, a composable scheduling primitive that terminates jobs on k consecutive eval-score declines, releases GPUs, preserves the best checkpoint, and delegates to any base scheduler. We frame scheduler-level early stopping as a detection problem and evaluate it in a discrete-event simulator whose RLHF workload mixes reward-hacking and structurally healthy runs, with ground-truth labels hidden from schedulers. On RLHF-heavy workloads (80% RLHF, 64 GPUs), EvalStop achieves precision 98% / recall 99% / FPR 1.5% while improving JCT by 9% and cutting wasted compute by 22% over SRTF-Est (p<0.05). Trivial fixed-progress and loss-plateau competitors either incur 65% FPR on healthy RLHF or miss over half of true hacking cases. Gains compose across every base scheduler tested (9-25% JCT) and detection quality stays stable under eval noise (precision at least 91% at noise std <= 0.05) and hacking base rate (precision at least 89% across 20-80% hacking fractions).

2606.10198 2026-06-11 cs.LG cs.AI cs.CV 版本更新

Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity

密度脊选择性预测:校准标签稀缺下的大语言模型与视觉语言模型幻觉检测

Nina I. Shamsi

AI总结 针对校准标签稀缺时大语言模型和视觉语言模型的幻觉检测问题,提出基于核密度估计的密度脊方法,利用隐藏状态生成轨迹的六维运动特征图构建响应流形,通过到最近脊顶点的欧氏距离评分,在标签稀缺协议下AUROC提升5-20点。

详情
AI中文摘要

大语言模型和视觉语言模型中的幻觉检测日益被框架化为选择性预测,其中检测器分配置信度分数并在置信度低时弃权。无监督采样检测器(Semantic Entropy, EigenScore)避免标签但质量停滞,而有监督探针(SAPLMA)获得更强的分布内分数,但在校准标签稀缺时性能急剧下降。我们将大语言模型的响应流形恢复为基于隐藏状态生成轨迹的六维运动特征图的核密度估计的密度脊。测试生成通过其投影特征点到最近脊顶点的欧氏距离的负值进行评分,从而得到随机输出分布的低维几何骨架。我们在七个问答基准(HaluEval-QA, TriviaQA, GSM8K, POPE, ScienceQA, A-OKVQA)上,使用九个文本和视觉大语言模型,在刻意标签稀缺协议($n_{\ ext{cal}}{=}200$ 查询,$N{=}5$ 生成)下,与Semantic Entropy、SAR、EigenScore、SAPLMA和对数概率进行评估。我们的基于脊的分数在AUROC上以5-20个百分点的优势获胜,同时在校准标签稀缺下表现出温和的性能下降。

英文摘要

Hallucination detection in large language and vision-language models is increasingly framed as selective prediction, where a detector assigns a confidence score and abstains when confidence is low. Unsupervised sampling detectors (Semantic Entropy) avoid labels but plateau in quality, while supervised probes attain stronger in-distribution scores yet degrade sharply when calibration labels are scarce. We recover the response manifold of an LLM as the density ridge of a kernel density estimate built on a six-dimensional kinematic feature map of hidden state generation trajectories. A test generation is scored by the negated Euclidean distance from its projected feature point to the nearest ridge vertex, yielding a low-dimensional geometric skeleton of the stochastic output distribution. We evaluate against Semantic Entropy, topological methods, and log-probability on six QA benchmarks (HaluEval-QA, TriviaQA, GSM8K, POPE, ScienceQA, A-OKVQA) using eight text and vision LLMs in a deliberately label-scarce protocol ($n_{\text{cal}}{=}200$ queries, $N{=}5$ generations). Our ridge-based score beats on AUROC with 5-20 points gain, while demonstrating tempered degradation under calibration-label scarcity.

9. 评测、基准与数据集 57 篇

2606.11337 2026-06-11 cs.AI cs.CL cs.CY 新提交

Can AI Agents Synthesize Scientific Conclusions?

AI代理能否综合科学结论?

Hayoung Jung, Pedro Viana Diniz, José Reinaldo Corrêa Roveda, Abner Fernandes da Silva, Haeun Jung, Enoch Tsai, Aleksandra Korolova, Manoel Horta Ribeiro

发表机构 * Princeton University(普林斯顿大学) Universidade Federal de Minas Gerais(米纳斯吉拉斯联邦大学) Stony Brook University(石溪大学) Hackensack Meridian School of Medicine(哈肯萨克子午线医学院)

AI总结 本文提出SciConBench基准和SciConHarness评估框架,通过分解原子事实并计算精确率和召回率,发现前沿AI代理在科学结论综合中事实F1仅0.337,且无约束评估存在数据泄露,消费者代理常生成不完整或矛盾的结论。

详情
Comments
79 pages, 34 figures, 17 tables. Under Submission
AI中文摘要

科学AI代理越来越多地检索证据、跨来源推理并综合用于重要决策的结论。然而,它们在健康等高风险领域中的能力仍不明确。我们引入了SciConBench,一个大规模实时基准,包含9.11K个问题以及来自系统综述的专家撰写的结论,用于评估开放域科学结论综合。该基准采用专家验证的自动评估流程,将结论分解为原子事实,并通过事实精确率和召回率衡量正确性和全面性。为减轻数据泄露,我们进一步引入了SciConHarness,一个洁净室评估框架,为代理配备受控的网页交互以确保有效测量。评估8个前沿模型和深度研究代理,我们发现事实质量仍然较低:在洁净室设置下,最佳代理仅达到0.337的事实F1。与无约束评估相比,我们的洁净室设置持续降低性能,表明数据泄露夸大了模型真实综合能力的估计。最后,我们审计了面向消费者的代理(如Google AI Overview、OpenEvidence),发现它们经常生成不完整甚至矛盾的结论,即使真实答案可用。总体而言,我们的结果表明,科学结论的可靠综合仍然是一个开放挑战,而洁净室评估对于评估开放域AI代理至关重要。

英文摘要

Scientific AI agents increasingly retrieve evidence, reason across sources, and synthesize conclusions used in consequential decisions. Yet, their ability to do so in high-stakes domains such as health remains unclear. We introduce SciConBench, a large-scale live benchmark of 9.11K questions and expert-written conclusions from systematic reviews to evaluate open-domain scientific conclusion synthesis. The benchmark draws on an expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall. To mitigate data leakage, we further introduce SciConHarness, a clean-room evaluation harness that equips agents with controlled web interaction to ensure valid measurement. Evaluating 8 frontier models and deep research agents, we find that factual quality remains low: under clean-room settings, the best agent achieves only a factual F1 of 0.337. Our clean-room setting consistently reduces performance relative to unconstrained evaluation, suggesting that leakage inflates estimates of models' true synthesis capabilities. Finally, we audit consumer-facing agents (e.g., Google AI Overview, OpenEvidence) and find they frequently generate incomplete and sometimes contradictory conclusions, even when the ground-truth answer is available. Overall, our results show that reliable synthesis of scientific conclusions remains an open challenge, and that clean-room evaluation is essential for assessing open-domain AI agents.

2606.11543 2026-06-11 cs.AI cs.SE 新提交

SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior

SkillJuror:衡量智能体技能组织如何改变运行时行为

Zhiyu Chen, Zihan Guo, Bo Huang, Bingwei Lu, Jianghao Lin, Yuanjian Zhou, Weinan Zhang

发表机构 * Tongji University(同济大学) Shanghai Innovation Institute(上海创新研究院) Sun Yat-sen University(中山大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出SkillJuror框架,通过渐进式披露与扁平基线对比,发现技能组织方式改变智能体搜索和应用程序知识的行为,并在82个任务中提升4.1%的验证通过率。

详情
AI中文摘要

Agent技能在推理时为大语言模型(LLM)智能体提供程序性知识,但当前的基准测试很少区分技能的内容与其组织方式。我们通过渐进式披露(Progressive Disclosure)研究这种区别,其中简洁的根文件按需引导智能体访问支持资源,并将其与归一化的扁平基线进行比较。我们提出SkillJuror,一个通过语义控制变体、匹配的多试验评估和轨迹证据来评估技能编写范式的框架,同时保持任务知识固定。在82个任务的SkillsBench研究中,渐进式披露在总体结果之前改变了运行时行为:每个轨迹触及的不同技能资源从1.18增加到3.85,有效采纳事件从1.33增加到3.92。在410个匹配试验中,它还产生了17个额外的验证通过试验(比归一化扁平基线提高4.1%)。收益取决于任务。当支持资源指导实现、检查或修复时,渐进式披露有帮助,但当成功取决于精确的输出约定、数值阈值或长工件生成流水线时,效果较弱。这些结果表明,技能组织不仅仅是呈现方式:它可以改变智能体搜索和应用程序知识的方式,而结果收益取决于暴露的资源是否对任务可操作。代码见:https://this URL。

英文摘要

Agent Skills augment large language model (LLM) agents with procedural knowledge at inference time, but current benchmarks rarely distinguish what a Skill says from how it is organized. We study this distinction through Progressive Disclosure, where a concise root file points agents to supporting resources on demand, and compare it with a normalized flat baseline. We present SkillJuror, a framework for evaluating Skill writing paradigms through semantically controlled variants, matched multi-trial evaluations, and trajectory evidence while holding task knowledge fixed. In an 82-task SkillsBench study, Progressive Disclosure changes runtime behavior before aggregate outcomes: distinct Skill resources touched per trajectory rise from 1.18 to 3.85, and effective uptake events rise from 1.33 to 3.92. It also yields 17 additional verifier-passing trials out of 410 matched trials (+4.1%) over the normalized flat baseline. The benefit is task-dependent. Progressive Disclosure helps when supporting resources guide implementation, checking, or repair, but is weaker when success hinges on exact output conventions, numerical thresholds, or long artifact-generation pipelines. These results show that Skill organization is not mere presentation: it can change how agents search and apply procedural knowledge, while outcome gains depend on whether the exposed resources are actionable for the task. Code is available at this https URL.

2606.11637 2026-06-11 cs.AI 新提交

TouchThinker: Scaling Tactile Commonsense Reasoning to the Open World with Large-scale Data and Action-aware Representation

TouchThinker: 通过大规模数据和动作感知表示将触觉常识推理扩展到开放世界

Kailin Lyu, Di Wu, Pengwei Zhang, Yuhang Zheng, Yingxin Lai, Long Xiao, Kangyi Wu, Pengna Li, Chen Gao, Lianyu Hu, Xiaobin Hu, Jie Hao, Ce Hao, Weihao Yuan, Shuicheng Yan

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) National University of Singapore(新加坡国立大学) Zhongguancun Academy(中关村学院) Xiamen University(厦门大学) Xi’an Jiaotong University(西安交通大学) Nanyang Technological University(南洋理工大学) Nanjing University(南京大学)

AI总结 提出TouchThinker框架,通过构建百万级多源触觉数据集TouchThinker-1M和动作感知建模,将触觉常识推理扩展到开放世界,在多个数据集上取得竞争性表现。

详情
Comments
18 pages, 11 figures
AI中文摘要

触觉是具身智能体理解物理世界的关键模态。尽管最近的工作已将触觉信号融入语言系统进行触觉常识推理,但由于两个关键瓶颈,将此类系统扩展到现实的开放世界环境仍然具有挑战性:(1) 当前的触觉推理数据集在格式和规模上仍然有限,为从触觉观察到物理常识的推理提供的监督不足,并阻碍了可迁移触觉常识的学习;(2) 触觉信号本质上是冗余且特定于动作的,但现有方法常常忽略这些特性,导致表示效率低下且语义表达能力有限。为了解决这些局限性,我们提出了TouchThinker,一个从数据和表示两个角度将触觉常识推理扩展到开放世界的触觉-语言框架。首先,我们构建了TouchThinker-1M,一个百万级、多源的触觉推理数据集,涵盖\textbf{415}个物体、\textbf{8}个场景和\textbf{7}种传感器类型,为开放世界泛化提供了坚实的数据基础。我们进一步引入了TouchThinker-Bench,一个具有更真实和多样化任务的开放世界基准。然后,我们提出了动作感知建模机制,以提高触觉表示效率并实现高效推理。实验结果表明,TouchThinker在多个数据集上取得了与最先进模型竞争的性能。我们的代码和数据集将在以下网址提供:this https URL。

英文摘要

Touch is a key modality for embodied agents to understand the physical world. Although recent work has incorporated tactile signals into language systems for tactile commonsense reasoning, scaling such systems to realistic open-world settings remains challenging due to two key bottlenecks: (1) current tactile reasoning datasets remain limited in format and scale, providing insufficient supervision for reasoning from tactile observations to physical commonsense and hindering the learning of transferable tactile commonsense; (2) Tactile signals are inherently redundant and action-specific, yet existing methods often overlook these properties, resulting in inefficient representations with limited semantic expressiveness. To address these limitations, we propose TouchThinker, a tactile-language framework that scales tactile commonsense reasoning to the open world from both data and representation perspectives. First, we construct TouchThinker-1M, a million-scale, multi-source tactile reasoning dataset covering \textbf{415} objects, \textbf{8} scenarios, and \textbf{7} sensor types, providing a solid data foundation for open-world generalization. We further introduce TouchThinker-Bench, an open-world benchmark with more realistic and diverse tasks. Then, we propose action-aware modeling mechanism to improve tactile representation efficiency and enable efficient reasoning. Experimental results demonstrate that TouchThinker achieves competitive performance against state-of-the-art models across multiple datasets. Our code and dataset will be made available at: this https URL.

2606.11909 2026-06-11 cs.AI 新提交

Embodied-BenchClaw: An Autonomous Multi-Agent System for Embodied Spatial Intelligence Benchmark Construction

Embodied-BenchClaw:用于具身空间智能基准构建的自主多智能体系统

Baoyang Jiang, Fengchun Zhang, Leyuan Wang, Haotian Li, Yida Wang, Zhe Ji, Jinshan Lai, Xi Ren, Jianwei Hu, Qiang Ma

发表机构 * QiYuan Lab(启元实验室) School of Information and Software Engineering, University of Electronic Science and Technology of China(电子科技大学信息与软件工程学院) Beijing University of Posts and Telecommunications(北京邮电大学) School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院) School of Computer Science and Engineering, Beihang University(北京航空航天大学计算机科学与工程学院)

AI总结 提出Embodied-BenchClaw,一个通过五阶段流水线和三个智能体协调的自主系统,自动构建可验证、可执行、可维护且诊断有用的具身空间智能基准,减少人工工作量。

详情
AI中文摘要

基准测试对于评估具身空间智能至关重要,但其构建劳动密集、难以重用且维护困难。现有的具身基准通常是静态的,随着模型改进可能迅速饱和,限制其区分新能力的能力。我们提出Embodied-BenchClaw,一个用于构建具身空间智能基准的自主智能体系统。给定用户指定的评估意图,Embodied-BenchClaw通过五个阶段流水线自动生成完整且可持续更新的基准包:意图蓝图、数据收集、结构化与清洗、基准合成、评估报告。该流水线由三个智能体协调:规划、构建和评估。为提高可重用性和可靠性,Embodied-BenchClaw引入了可扩展的技能库和过程质量控制,使基准构建可组合、可验证和可修复。我们实例化了多个基准,涵盖室内空间推理、室外空间推理、机器人操作、四足机器人导航、无人机/空中视图理解以及静态基准增强。这些基准跨越不同的具身载体、数据源和空间能力。通过人工评估、基于评判者的评估、一致性检查、成本分析和消融实验,结果表明Embodied-BenchClaw能够以较少的人工努力构建可验证、可执行、可维护且诊断有用的具身空间基准。

英文摘要

Benchmarks are essential for evaluating embodied spatial intelligence, yet their construction is labor-intensive, hard to reuse, and difficult to maintain. Existing embodied benchmarks are often static and may quickly become saturated as models improve, limiting their ability to distinguish new capabilities. We propose Embodied-BenchClaw, an autonomous agentic system for constructing embodied spatial intelligence benchmarks. Given a user-specified evaluation intent, Embodied-BenchClaw automatically produces a complete and continually updatable benchmark package through a five-stage pipeline: intent blueprinting, data collection, structuring and cleaning, benchmark synthesis, and evaluation reporting. The pipeline is coordinated by three agents for planning, construction, and evaluation. To improve reusability and reliability, Embodied-BenchClaw introduces an extensible Skill Library and process quality control, enabling benchmark construction to be composable, verifiable, and repairable. We instantiate multiple benchmarks covering indoor spatial reasoning, outdoor spatial reasoning, robotic manipulation, quadruped robot navigation, UAV/aerial-view understanding, and static benchmark enhancement. These benchmarks span diverse embodied carriers, data sources, and spatial capabilities. Experiments with human evaluation, judge-based assessment, consistency checks, cost analysis, and ablations show that Embodied-BenchClaw can construct verifiable, executable, maintainable, and diagnostically useful embodied spatial benchmarks with reduced manual effort.

2606.12086 2026-06-11 cs.AI cs.LG 新提交

IntElicit: Eliciting and Assessing Contextualized Creativity via Dialogue Policy Optimization

IntElicit: 通过对话策略优化引出和评估情境化创造力

Mingjia Li, Jin Wu, Hong Qian, Wenhao Huang, Yiyang Huang, Yiwen Zhang, Chanjin Zheng, Xiangfeng Wang, Aimin Zhou, Jiajun Guo

发表机构 * East China Normal University(华东师范大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出IntElicit框架,通过分解过程奖励机制优化对话策略,在交互中减少非创造性混淆因素,从而更有效地引出和评估情境化创造力。

详情
AI中文摘要

情境化评估为评估创造力提供了高生态效度,但也引入了一个关键挑战:观察到的表现可能与认知熟练度(领域知识)和能动性(参与意愿)相混淆。同时,在生成式AI时代,创造性问题解决越来越多地发生在工具中介和人机交互环境中,使得完全静态的评估与当代创造性实践不太一致。为了解决这些问题,本文提出了IntElicit,一个通过对话策略优化来引出和评估情境化创造力的框架。IntElicit作为一个受约束的自适应AI面试官:它在多轮交互中提供非指导性的知识和能动性支架,以减少非创造性混淆因素,同时保留参与者生成被评估的创造性内容的责任。具体来说,为了解决开放教育对话中的稀疏奖励和潜在奖励破解(例如,答案听写),IntElicit引入了一种分解过程奖励机制。该机制将策略与教学引出对齐,奖励那些引出参与者推理而非代表他们产生最优答案的提示。大量实验,包括参与者模拟和一项人类受试者研究(N=64),表明IntElicit比专家设计的基线提高了引出的创造性成果。总之,结果表明,交互式引出可以揭示静态FPSP式评估可能遗漏的创造性潜力,为AI中介学习环境中的情境化创造力评估提供了形成性和诊断性视角。

英文摘要

Contextualized assessment offers high ecological validity for evaluating creativity but introduces a critical challenge: observed performance may be confounded with cognitive proficiency (domain knowledge) and agency (willingness to engage). Meanwhile, in the age of generative AI, creative problem solving increasingly occurs in tool-mediated and human--AI interactive environments, making fully static assessment less aligned with contemporary creative practice. To address these issues, this paper proposes IntElicit, a framework for eliciting and assessing contextualized creativity via dialogue policy optimization. IntElicit functions as a constrained adaptive AI Interviewer: it provides non-directive knowledge and agency scaffolds in multi-turn interaction to reduce non-creative confounders, while preserving participants' responsibility for generating the creative content being evaluated. Specifically, to tackle sparse rewards and potential reward hacking (e.g., answer dictation) in open-ended educational dialogue, IntElicit introduces a decomposed process reward mechanism. This mechanism aligns the policy with pedagogical elicitation, rewarding prompts that draw out participant reasoning rather than producing optimal answers on their behalf. Extensive experiments, including participant simulation and a human subject study (N=64), show that IntElicit improves elicited creative outcomes over expert-designed baselines. Together, the results suggest that interactive elicitation can reveal creative potential that static FPSP-style assessment may miss, providing a formative and diagnostic lens for contextualized creativity assessment in AI-mediated learning contexts.

2606.11196 2026-06-11 cs.CL cs.AI cs.CR cs.LG 交叉投稿

PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference

PoQ-Judge:去中心化LLM推理中成本感知的证明质量的多架构评估框架

Arther Tian, Alex Ding, Frank Chen, Simon Wu, Aaron Chan

发表机构 * DGrid AI

AI总结 提出PoQ-Judge框架,训练专用裁判模型对查询-输出对进行无参考评分,研究三种架构,最佳模型在Pearson相关性上达到0.747,级联评估降低72.7%成本。

详情
AI中文摘要

去中心化LLM推理网络需要轻量级、无参考的质量评估用于证明质量(PoQ)。我们提出PoQ-Judge,一个训练专用裁判模型对查询-输出对进行评分而无真实参考的框架。我们研究了三种架构在质量-成本权衡中的表现:TextCNN裁判、MiniLM交叉编码器和DeBERTa裁判。通过在UltraFeedback和GPT标记的领域内数据上进行两阶段训练,最佳模型在保留测试集上与真实代理的Pearson相关性达到0.747,优于先前工作中基于参考的评估器。作为复合评分中的无参考组件,它实现了0.645的Pearson相关性,匹配最佳单一基于参考的评估器,同时消除了对参考答案的需求。我们还表明,在线校准将语义质量识别为主导维度,级联评估将成本降低72.7%,仅带来适度的质量损失。结果在问答任务上比摘要任务强得多,表明代理质量是主要剩余限制。

英文摘要

Decentralized LLM inference networks need lightweight, reference-free quality evaluation for Proof of Quality (PoQ). We present PoQ-Judge, a framework that trains dedicated judge models to score query-output pairs without ground-truth references. We study three architectures across the quality-cost tradeoff: a TextCNN judge, a MiniLM cross-encoder, and a DeBERTa judge. Using two-stage training on UltraFeedback plus GPT-labeled in-domain data, the best model reaches 0.747 Pearson correlation with the ground-truth proxy on a held-out test set, outperforming reference-based evaluators from prior work. As a reference-free component in composite scoring, it achieves 0.645 Pearson correlation, matching the best single reference-based evaluator while removing the need for reference answers. We also show that online calibration identifies semantic quality as the dominant dimension and that cascade evaluation reduces cost by 72.7 percent with only modest quality loss. Results are much stronger on QA than summarization, pointing to proxy quality as the main remaining limitation.

2606.11198 2026-06-11 cs.CL cs.AI 交叉投稿

The Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content

结构注意力税:检索格式如何劫持上下文学习而与内容无关

Yuqi Zhang, Di Zhang

发表机构 * Xi’an Jiaotong-Liverpool University(西交利物浦大学)

AI总结 研究发现知识图谱三元组因其格式结构比自然语言吸引2-3倍注意力,压缩演示注意力达42%,并提出了分解注意力为语义与结构成分的框架及缓解策略。

详情
Comments
10 pages, 5 figures
AI中文摘要

检索增强生成(RAG)系统注入外部知识以改进大语言模型输出,然而注入内容的格式——区别于其语义相关性——可以独立地扭曲模型的注意力分布。我们识别并形式化了一种称为结构注意力税的现象:知识图谱(KG)三元组,由于其关系分隔符和重复的槽位模式,每个token捕获的注意力是语义等价的自然语言文本的2-3倍($\hat{o}$(KG) ≈ 0.70 对比 $\hat{o}$(中性) ≈ 0.25),将演示注意力压缩高达42%——无论三元组是相关还是噪声。我们开发了一个形式化框架,将注意力分数分解为语义和结构成分(公式2),推导了一个压缩界(命题1),将token级别的格式偏差与演示注意力损失联系起来,并表明结构项控制着注意力被转移多少,而语义项控制着这是有益还是有害。这种解耦揭示了改进检索增强ICL的两个正交轴:优化检索质量(语义轴)和减少格式驱动的注意力捕获(结构轴)。实验上,在两个模型家族(Mistral-7B, LLaMA-3-8B)和三个QA基准上,我们观察到源任务对齐占主导地位:任务匹配的BM25检索在HotpotQA上达到58-62%,而ConceptNet为25-27%,超过30个百分点的差距远远超过所有门控策略(≤2个百分点)。我们从该框架推导出五种结构感知缓解策略,从零成本提示修改到训练时正则化;格式展平(S3)通过来自口头化三元组控制的准确性和注意力级证据得到验证,而结构分散(S1)产生了混合结果,揭示了格式级别干预的挑战。

英文摘要

Retrieval-augmented generation (RAG) systems inject external knowledge to improve LLM outputs, yet the format of injected content -- distinct from its semantic relevance -- can independently distort the model's attention distribution. We identify and formalise a phenomenon we term the structural attention tax: knowledge graph (KG) triples, due to their relational delimiters and repeated slot patterns, capture 2-3x more attention per token than semantically equivalent natural-language text ($\hat{o}$(KG) $\approx$ 0.70 vs. $\hat{o}$(neutral) $\approx$ 0.25), compressing demonstration attention by up to 42% -- regardless of whether the triples are relevant or noise. We develop a formal framework decomposing attention scores into semantic and structural components (Eq. 2), derive a compression bound (Proposition 1) connecting token-level format bias to demonstration attention loss, and show that the structural term governs how much attention is diverted while the semantic term governs whether this helps or hurts. This decoupling reveals two orthogonal axes for improving retrieval-augmented ICL: optimising retrieval quality (semantic axis) and reducing format-driven attention capture (structural axis). Empirically, across two model families (Mistral-7B, LLaMA-3-8B) and three QA benchmarks, we observe that source-task alignment dominates: task-matched BM25 retrieval achieves 58-62% on HotpotQA vs. ConceptNet's 25-27%, a >30 pp gap that dwarfs all gating strategies ($\leq$2 pp). We derive five structure-aware mitigation strategies from the framework, ranging from zero-cost prompt modifications to training-time regularisation; format flattening (S3) is validated by both accuracy and attention-level evidence from a verbalized-triple control, while structural dispersal (S1) yields mixed results that illuminate the challenges of format-level intervention.

2606.11208 2026-06-11 cs.CL cs.AI 交叉投稿

BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts

BioDivergence:生物医学摘要中隐藏上下文矛盾的基准与评估框架

Elias Hossain, Sanjeda Sara Jennifer, Sabera Akter Bushra, Niloofar Yousefi

发表机构 * College of Engineering and Computer Science, University of Central Florida(中佛罗里达大学工程与计算机科学学院) Burnett School of Biomedical Sciences, University of Central Florida(中佛罗里达大学伯内特生物医学科学学院)

AI总结 提出BioDivergence框架,通过六类冲突分类、13轴分歧本体和结构化输出,解决现有NLI基准无法捕捉生物医学研究中上下文依赖的差异问题,并发布包含11865个声明对的基准数据集。

详情
AI中文摘要

生物医学发现常常在不同研究中看似冲突,但许多差异是上下文依赖的而非真正的矛盾。队列、地理、实验方案、疾病亚型和临床环境的变化可能使两种说法在局部都成立。现有的NLI和科学声明验证基准将此类情况简化为蕴含、矛盾或中立,未能捕捉分歧背后的上下文结构。为解决这一问题,我们引入了BioDivergence,一个包含六类冲突分类、13轴分歧本体以及每个声明对四个结构化输出(冲突类型、分歧轴、主要混杂因素和调和解释)的评估框架。我们发布了BioDivergence-Silver-v1.0,一个跨五个生物医学领域的11865个声明对的文章分离银标准基准,以及一个用于比较的遗留去重变体。结果显示,两种变体之间存在显著的排名差异,微调参考模型在文章分离设置下下降了约12分,而Mistral-7B-Instruct-v0.3在842个示例的主测试集上达到了0.5523的准确率和0.3894的上下文F1分数。BioDivergence提供了一种更忠实的方式来区分上下文分歧与直接矛盾,并区分文章级记忆与真正的任务学习。

英文摘要

Biomedical findings often seem to conflict across studies, but many of these differences are context-dependent rather than true contradictions. Variations in cohort, geography, assay protocol, disease subtype, and clinical setting can make both claims locally valid. Existing NLI and scientific claim-verification benchmarks reduce such cases to entailment, contradiction, or neutral, failing to capture the contextual structure behind divergence. To address this, we introduce BioDivergence, an evaluation framework with a six-class conflict taxonomy, a 13-axis divergence ontology, and four structured outputs per claim pair: conflict type, divergence axes, dominant confounder, and reconciliation explanation. We release BioDivergence-Silver-v1.0, an article-disjoint silver benchmark of 11,865 claim pairs across five biomedical domains, alongside a legacy deduplicated variant for comparison. Results show notable ranking differences between the two variants, with the fine-tuned reference model dropping about 12 points under the article-disjoint setting, while Mistral-7B-Instruct-v0.3 achieves 0.5523 accuracy and 0.3894 contextual-F1 on the 842-example primary test set. BioDivergence offers a more faithful way to distinguish contextual divergence from direct contradiction and to separate article-level memorization from genuine task learning.

2606.11211 2026-06-11 cs.CL cs.AI cs.LG 交叉投稿

Calibration Drift Under Reasoning: How Chain-of-Thought Budgets Induce Overconfidence in Large Language Models

推理下的校准漂移:思维链预算如何导致大型语言模型过度自信

Prakul Sunil Hiremath, Harshit R. Hiremath

发表机构 * Department of Computer Science and Engineering, Visvesvaraya Technological University, Belagavi(维斯瓦拉亚科技大学计算机科学与工程系,贝拉加维) Department of Computer Science and Business System, SG Balekundri Institute of Technology, Belagavi(SG巴莱昆德里理工学院计算机科学与商业系统系,贝拉加维)

AI总结 研究发现,增加思维链推理预算超过任务特定阈值会导致模型对错误答案过度自信,提出校准漂移现象并引入CABStop停止规则。

详情
Comments
31 pages, 4 figures, 3 tables. Introduces Calibration Drift Under Reasoning (CDUR) with theoretical analysis and preliminary experiments; includes CABStop; code and data available
AI中文摘要

大型语言模型(LLMs)表达校准不确定性的能力对于安全部署至关重要。思维链(CoT)推理被广泛用于提高准确性和可靠性,但其对校准的影响尚未完全理解。我们表明这一图景是不完整的:在某些设置中,将推理预算增加到任务特定阈值以上会导致模型系统性地变得过度自信,对错误答案赋予高置信度。我们将此现象称为推理下的校准漂移(CDUR),并从理论和实证两方面进行研究。我们定义推理预算B,并分析预期校准误差ECE(B)呈现非单调模式的条件:它首先随着推理纠正错误而下降,然后随着更长推理产生内部一致但错误的解释而上升。我们提出一个基于自回归生成的假设锁定模型来解释这种行为。我们在47个推理陷阱问题上评估了Llama-3.1-8B和Llama-3.3-70B,跨越四个推理预算和三个随机种子(1,368次API调用;574个有效响应)。8B模型显示出非单调的校准行为,而70B模型的结果仅限于基线评估,对于预算依赖效应尚无定论。我们引入CABStop,一种校准感知的停止规则,当置信度偏离辅助准确性估计时停止推理。这些结果表明,增加推理深度并不总是提高可靠性,应谨慎监控。

英文摘要

The ability of large language models (LLMs) to express calibrated uncertainty is important for safe deployment. Chain-of-thought (CoT) reasoning is widely used to improve accuracy and reliability, but its effect on calibration is not fully understood. We show that this picture is incomplete: in some settings, increasing the reasoning budget beyond a task-specific threshold can cause models to become systematically overconfident, assigning high confidence to incorrect answers. We call this phenomenon Calibration Drift Under Reasoning (CDUR) and study it both theoretically and empirically. We define reasoning budget B and analyze conditions under which Expected Calibration Error ECE(B) follows a non-monotonic pattern: it first decreases as reasoning corrects errors, then increases as longer reasoning produces internally consistent but incorrect explanations. We propose a Hypothesis Lock-In model based on autoregressive generation to explain this behavior. We evaluate Llama-3.1-8B and Llama-3.3-70B on 47 reasoning-trap questions across four reasoning budgets and three seeds (1,368 API calls; 574 valid responses). The 8B model shows non-monotonic calibration behavior, while results for the 70B model are limited to baseline evaluation and are inconclusive for budget-dependent effects. We introduce CABStop, a calibration-aware stopping rule that halts reasoning when confidence diverges from an auxiliary accuracy estimate. These results suggest that increasing reasoning depth does not always improve reliability and should be monitored carefully.

2606.11219 2026-06-11 cs.CL cs.AI cs.SD 交叉投稿

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

Afrispeech Semantics: 评估跨领域和口音的口语语言模型中的音频语义推理

Chibuzor Okocha, Christan Grant

发表机构 * University of Florida(佛罗里达大学)

AI总结 提出五项语义与副语言推理任务(蕴含、一致性、合理性、口音漂移、口音约束),评估音频语言模型在口音变化、领域迁移和语义过度推断下的推理能力,揭示当前评估的局限性。

详情
Comments
Accepted to ACL
AI中文摘要

音频语言模型(ALMs)越来越多地用于基于语音的理解,但它们在转录、文本到音频检索、字幕生成和问答准确性之外的语义推理能力仍未得到充分基准测试。特别是,口音变化、领域迁移和语义过度推断对音频推理的影响尚不清楚。我们评估了音频语言模型在五项语义和副语言推理任务上的表现:蕴含、一致性、合理性、口音漂移和口音约束。这些任务共同评估模型以口语音频作为主要证据来源进行推理的能力,包括文本假设是否可以从音频中推断、矛盾或无法确定,陈述是否与口语内容一致或冲突,给定话语的声明是否合理,以及模型预测在口音变化下是否保持稳定或适当约束。这些发现凸显了当前音频推理评估的关键局限性,并希望为更稳健和公平的ALM设计与评估提供指导。

英文摘要

Audio language models (ALMs) are increasingly used for speech-based understanding, yet their ability to perform semantic reasoning beyond transcription, Text-to-Audio Retrieval, Captioning, and Question-Answering accuracy remains insufficiently benchmarked. In particular, the effects of accent variation, domain shift, and semantic over-inference on audio reasoning are poorly understood. We evaluate audio language models across five semantic and paralinguistic reasoning tasks: entailment, consistency, plausibility, accent drift, and accent restraint. Collectively, these tasks assess a model's ability to reason over spoken audio as the primary evidence source, including whether a textual hypothesis can be inferred, contradicted, or left undetermined by the audio, whether statements align or conflict with spoken content, whether claims are plausible given the discourse, and whether model predictions remain stable or appropriately constrained across accent variation. These findings highlight critical limitations in current audio reasoning evaluations and hope to provide guidance for more robust and equitable ALM design and assessment

2606.11232 2026-06-11 cs.CL cs.AI 交叉投稿

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

每个行为都有代价:前沿大语言模型中的压缩道德组合

Weijia Zhang, Ruiqi Chen, Yunze Xiao, Weihao Xuan

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Michigan(密歇根大学) Carnegie Mellon University(卡内基梅隆大学) The University of Tokyo(东京大学)

AI总结 针对现有道德基准仅评估孤立行为偏好的不足,提出Moral Trolley Arena两阶段盲ELO基准,通过校准个体道德行为并组合为双行为项,发现前沿LLM的道德判断呈压缩而非简单加性关系。

详情
AI中文摘要

现有的LLM道德基准通常询问模型偏好哪个孤立的道德行为、价值或基础。这有用但不完整。现实判断往往要求模型在同一选项中组合多个道德信号。我们引入**Moral Trolley Arena**,一个两阶段盲ELO基准,用于衡量LLM如何组合道德证据。单场景阶段首先从跨越五个道德基础理论的229个场景语料库中校准个体道德行为;组合阶段则将校准后的行为组合成受控强度网格上的双行为道德项,并测量由此产生的组合偏好。在十个前沿模型中,组合判断主要由成分行为强度预测,但关系始终是压缩的而非简单加性。模型还表现出非加性强度锚定、成分控制后有限的基础特异性残差,以及跨提供者高度收敛的组合偏好曲面。这些结果表明,道德审计应衡量道德证据的组合规则,而不仅仅是对孤立行为的排名。

英文摘要

Existing LLM moral benchmarks usually ask which isolated moral act, value, or foundation a model prefers. This is useful but incomplete. Realistic judgments often require a model to combine several moral signals within the same option. We introduce **Moral Trolley Arena**, a two-stage blind ELO benchmark for measuring how LLMs compose moral evidence. The single-scene arena first calibrates individual moral acts from a 229-scenario corpus across five Moral Foundations Theory foundations; the composite arena then combines calibrated acts into two-act moral items over a controlled intensity grid and measures the resulting composite preferences. Across ten frontier models, composite judgments are largely predicted by component act strength, but the relation is consistently compressed rather than simply additive. Models also show non-additive intensity anchoring, bounded foundation-specific residuals after component control, and highly convergent composite preference surfaces across providers. These results suggest that moral audits should measure composition rules for moral evidence, not only rankings over isolated acts.

2606.11260 2026-06-11 cs.SD cs.AI 交叉投稿

RAIL: Rethinking Auditory Intelligence in Large Audio-Language Models with a CHC-Grounded Benchmark

RAIL: 基于CHC框架重新思考大型音频语言模型中的听觉智能

Hongyu Jin, Siyi Wang, Yang Xiao, Jiaheng Dong, Shihong Tan, Kaiyuan peng, Georgiana Juravle, Shanquan Chen, Gongping Huang, Hong Jia, Eun-Jung Holden, James Bailey, Ting Dang

发表机构 * School of Computing and Information Systems, The University of Melbourne(墨尔本大学计算与信息系统学院) Faculty of Psychology and Educational Sciences, Alexandru Ioan Cuza University of Iași(亚历山德鲁伊万库扎大学心理学与教育科学学院) School of Electronic Information, Wuhan University(武汉大学电子信息学院) School of Public Health, The University of Hong Kong(香港大学公共卫生学院) School of Computer Science, The University of Auckland(奥克兰大学计算机科学学院) Department of Data Science and Artificial Intelligence, Monash University(莫纳什大学数据科学与人工智能系)

AI总结 提出RAIL基准,基于CHC认知框架将听觉智能分解为五种核心能力,构建结构化评估任务,系统评测大型音频语言模型的认知行为。

详情
AI中文摘要

人类通过紧密集成的认知能力(如音频感知、音频推理和记忆)处理丰富的听觉环境。尽管大型音频语言模型(LALMs)在语音理解和多模态音频推理方面取得了近期进展,但当前的评估范式仍然主要围绕任务或模态,关注最终性能而忽视了潜在的听觉认知行为。这揭示了人类听觉认知理解与LALMs评估之间的根本差距,特别是缺乏将认知原则操作化到任务级指标之外以系统捕捉模型行为的框架。在这项工作中,我们引入了RAIL,一种基于Cattell-Horn-Carroll(CHC)认知框架的以人为中心的评估范式。RAIL将听觉认知形式化为五种核心能力,并将其发展为结构化评估任务,探究模型如何处理、保留和整合听觉信息。我们进一步构建了一个认知基础的基准,包含原则性数据收集和人类对齐的评估协议。评估26个最先进的LALMs,我们发现当前模型在认知能力上表现出高度不平衡的性能。RAIL建立了一种新的评估范式,从以任务为中心的基准测试转向基于认知的听觉智能评估。

英文摘要

Humans process rich auditory environments through tightly integrated cognitive capabilities such as audio perception, audio reasoning, and memory. Despite recent progress in large audio-language models (LALMs) across speech understanding and multimodal audio reasoning, current evaluation paradigms remain largely task- or modality-centric, focusing on end performance while overlooking underlying auditory cognitive behaviours. This reveals a fundamental gap between how auditory cognition is understood in humans and how it is evaluated in LALMs, particularly in the lack of frameworks that operationalise cognitive principles beyond task-level metrics to systematically capture model behaviour. In this work, we introduce RAIL, a human-centric evaluation paradigm grounded in the Cattell-Horn-Carroll (CHC) cognitive framework. RAIL formalises auditory cognition into five core capabilities and develop them into structured evaluation tasks that probe how models process, retain, and integrate auditory information. We further construct a cognitively grounded benchmark with principled data curation and human-aligned evaluation protocols. Evaluating 26 state-of-the-art LALMs, we find that current models exhibit highly uneven performance across cognitive abilities. RAIL establishes a new evaluation paradigm that moves beyond task-centric benchmarking toward cognitively grounded assessment of auditory intelligence.

2606.11375 2026-06-11 cs.CL cs.AI cs.LG 交叉投稿

When Probing Accuracy Saturates, Fragility Resolves: A Complementary Metric for LLM Pre-Training Analysis

当探测精度饱和时,脆弱性揭示问题:LLM预训练分析的互补度量

Orion Reblitz-Richardson

发表机构 * Distiller Labs

AI总结 针对线性探测在预训练中精度快速饱和的问题,提出脆弱性度量,通过激活噪声水平衡量探测鲁棒性,揭示精度无法捕捉的表示结构演化。

详情
Comments
22 pages, 5 figures. Code and datasets at this https URL
AI中文摘要

标准线性探测在隐藏状态上的分类器达到高精度时,宣称属性被“编码”。该协议在快照上表现良好,但在预训练过程中失效:探测精度在最初几千步内饱和,使得大部分训练过程对仪器不可见。我们引入脆弱性,一种互补的逐层度量,定义为探测精度崩溃时的激活噪声水平。脆弱性对可分性边际和表示冗余均敏感,这两者在精度平台期后仍持续演化。应用于开放检查点语言模型时,脆弱性恢复了精度单独无法看到的结构。道德化表示沿着词汇→组合梯度出现:词汇道德检测在先,组合道德编码在后。由于探测精度本身跟踪数据集在词汇层面的可分性,我们通过证明其在共享无对比标记的构造类型间转移,直接建立了组合编码。层深度鲁棒性梯度在训练中单调发展,而精度保持平坦。匹配的微调语料库产生相同的探测精度,却留下不同的脆弱性指纹,表明数据整理在不改变探测精度的情况下重塑了探测鲁棒性。在我们测试的每个比较中,当探测精度返回平坦答案时,脆弱性返回结构化答案。

英文摘要

Standard linear probing declares a property "encoded" when a classifier on hidden states achieves high accuracy. The protocol works well on a snapshot but breaks across pre-training: probe accuracy saturates within the first few thousand steps, leaving most of training invisible to the instrument. We introduce fragility, a complementary per-layer metric defined as the activation-noise level at which probe accuracy collapses. Fragility is sensitive to both the margin of separability and the redundancy of representation, both of which keep evolving long after accuracy plateaus. Applied to open-checkpoint language models, fragility recovers structure that accuracy alone cannot see. Moralized representations emerge along a lexical $\to$ compositional gradient: lexical moral detection first, compositional moral encoding later. Because probe accuracy on its own tracks how lexically separable a dataset is, we establish the compositional encoding directly, by showing it transfers across construction types that share no contrast tokens. A layer-depth robustness gradient develops monotonically across training while accuracy stays flat. And matched fine-tuning corpora that produce identical probing accuracy leave distinct fragility fingerprints, showing that data curation reshapes probe robustness without changing probe accuracy. In every comparison we test, where probing accuracy returns a flat answer, fragility returns a structured one.

2606.11387 2026-06-11 cs.CL cs.AI cs.LG 交叉投稿

Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining

小实验,更经济的决策:微预训练中分阶段提升的案例研究

Felipe Chavarro Polania

发表机构 * Hewlett Packard Enterprise(慧与科技公司)

AI总结 研究微预训练中分阶段提升协议,通过固定预算筛选配置,在Windows A100和Linux L40S上验证,发现早期排名不稳定,但最终协议以144 GPU小时找到最优配置,成本低于全量筛选。

详情
Comments
14 pages, 5 figures; 12-hour dual-host micro-pretraining promotion study; source package includes curated ancillary artifacts
AI中文摘要

短预训练运行可以降低实验成本,但它们也可能过度推广那些仅在小预算下表现良好的配置。我们针对固定微预训练运行器在两个异构主机块(Windows A100和Linux L40S)上研究了一种可审计的分阶段提升协议。从12个预先筛选的配置开始,我们使用2分钟、5分钟、10分钟、60分钟和12小时的分阶段预算,并在昂贵的延续之前设置固定的提升规则。早期筛选被有意视为不稳定:5分钟和10分钟的排名对主机敏感,而最终的12小时排名最优条件并非复制10分钟门控下的平均最佳条件。由于不同阶段的种子范围不同,这些变化是操作性的提升证据,而非种子内曲线。复制60分钟门控将分阶段因子筛选桥接参考保留在提升集中,它在所有四个60分钟主机-种子单元中排名第一。在最终的12小时确认包中,桥接条件在两个种子的所有四个主机-种子单元中排名第一;贪婪比较器未满足固定的0.010 val_bpb近似等价规则;更便宜的d8/ar48(深度8,宽高比48)哨兵未满足固定的0.020平均差距规则。执行的12小时分支花费144 GPU小时,完整的分阶段协议记录169.2训练GPU小时(包括筛选阶段)。继续所有四个60分钟候选将花费192 GPU小时,而继续所有九个复制10分钟候选将花费432 GPU小时。后者是未运行延续的会计反事实,并非表明跳过的候选不可能超越参考。结果是一个有界成本分配发现,而非全局最优性、容量归一化优越性或优于自适应超参数优化方法的声明。

英文摘要

Short pretraining runs can reduce experimental cost, but they can also over-promote configurations that only look strong at tiny budgets. We study an auditable staged-promotion protocol for a fixed micro-pretraining runner on two heterogeneous host blocks: Windows A100 and Linux L40S. Starting from twelve prior-screened configurations, we use staged budgets of 2 minutes, 5 minutes, 10 minutes, 60 minutes, and 12 hours, with frozen promotion rules before expensive continuations. The early screens are intentionally treated as unstable: the 5- and 10-minute rankings are host-sensitive, and the eventual 12-hour top-ranked condition is not the mean-best condition at the replicated 10-minute gate. Because seed ranges differ across stages, these changes are operational promotion evidence, not within-seed curves. A replicated 60-minute gate keeps the Staged Factorial Screening bridge reference in the promoted set, where it ranks first in all four 60-minute host-seed cells. In the final 12-hour confirmation package, the bridge condition ranks first in all four host-seed cells across two seeds; the greedy comparator does not meet the frozen 0.010 val_bpb near-equivalence rule; and the cheaper d8/ar48 (depth-8, aspect-48) sentinel does not meet the frozen 0.020 mean-gap rule. The executed 12-hour branch spends 144 GPU-hours, and the full staged protocol records 169.2 training GPU-hours including screening stages. Continuing all four 60-minute candidates would spend 192 GPU-hours, while continuing all nine replicated 10-minute candidates would spend 432 GPU-hours. The latter numbers are accounting counterfactuals for unrun continuations, not evidence that skipped candidates could not have overtaken the reference. The result is a bounded cost-allocation finding, not a claim of global optimality, capacity-normalized superiority, or superiority over adaptive hyperparameter optimization methods.

2606.11416 2026-06-11 cs.CR cs.AI 交叉投稿

MPC-Patch-Bench: Security-Aware LLM Code Patch for Multi-Party Computation

MPC-Patch-Bench:面向多方计算的安全感知LLM代码补丁

Yukuan Zhang, Mengxin Zheng, Qian Lou

AI总结 针对多方计算(MPC)软件缺乏仓库级代码修复基准的问题,提出MPC-Patch-Bench,包含数据筛选框架和MPC验证器,评估LLM在MPC仓库级修复中的安全性和数值保真度。

详情
Comments
preprint
AI中文摘要

目前尚不存在用于评估大型语言模型(LLM)在安全多方计算(MPC)软件上代码修复的仓库级基准,直接移植SWE-bench等通用基准在三个结构层面失败:(i)MPC仓库主要由通用Python基础设施而非密码学逻辑主导;(ii)高价值MPC修复缺乏严格提取流程所需的标准化测试;(iii)标准失败到通过评估对于必须同时保证密码学安全的代码是不充分的。MPC越来越多地部署于隐私保护机器学习、生物医学协作和安全分析。现有的MPC特定代码合成工作仅涵盖算子级或单框架任务;在真实仓库级MPC修复上评估LLM代理反而需要MPC感知的数据筛选和与MPC程序必须遵守的安全性和数值保真度保证相匹配的验证器,而现有基准均未提供。我们提出MPC-Patch-Bench,一个围绕两个框架组织的仓库级基准。(1)数据筛选框架结合了一个领域特定筛选代理,该代理通过三个密码学层过滤原始拉取请求,并配备一个人类-AI补全引擎,合成缺失的问题描述和失败到通过/通过到通过测试,生成205个完全验证的实例。(2)MPC验证器通过针对明文预言机的动态差分测试和MPC特定静态分析规则(标记不安全泄露、不安全算术和非法公共/私有转换)提供专门的安全性和数值保真度检查。评估的最强LLM在功能上仅解决了22.9%的MPC-Patch-Bench任务;MPC验证器进一步将验证通过率降至17.1%,其中高达40%的功能通过补丁因密码学或数值保真度违规而被拒绝。

英文摘要

Repository-level benchmarks for evaluating Large Language Model (LLM) code repair on Secure Multi-Party Computation (MPC) software do not yet exist, and directly transplanting general-purpose benchmarks such as SWE-bench fails on three structural fronts: (i) MPC repositories are dominated by generic Python infrastructure rather than cryptographic logic; (ii) high-value MPC fixes lack the standardized tests rigid extraction pipelines require; and (iii) standard fail-to-pass evaluation is insufficient for code that must also be cryptographically safe. MPC is increasingly deployed for privacy-preserving machine learning, biomedical collaboration, and secure analytics. Existing MPC-specific code-synthesis efforts cover only operator-level or single-framework tasks; evaluating LLM agents on real repository-level MPC repair instead demands MPC-aware data curation and a verifier matched to the security and numerical-fidelity guarantees MPC programs must obey neither of which existing benchmarks provide. We introduce MPC-Patch-Bench, a repository-level benchmark organised around two frameworks. (1)The Data Curation Framework combines a domain-specific curation agent that filters raw pull requests through three cryptographic layers with a human-AI completion engine that synthesizes missing problem statements and Fail-to-Pass/Pass-to-Pass tests, yielding 205 fully verified instances. (2)The MPC Verifier provides dedicated security and numerical-fidelity checks via dynamic differential testing against plaintext oracles and MPC-specific static analysis rules that flag unsafe reveals, insecure arithmetic, and illegal public/private casts. The strongest evaluated LLM functionally resolves only 22.9% of MPC-Patch-Bench tasks; the MPC Verifier further reduces verified resolution to 17.1%, with up to 40% of functionally-passing patches rejected for cryptographic or numerical-fidelity violations.

2606.11499 2026-06-11 cs.CL cs.AI 交叉投稿

Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

枢纽或边缘:基于网页图中心性的预训练数据选择

Vedant Badoni, Danqi Chen, Xinyi Wang

发表机构 * Princeton Language and Intelligence(普林斯顿语言与智能) Princeton University(普林斯顿大学)

AI总结 提出WebGraphMix框架,利用Common Crawl主机级网页图的结构中心性得分调整预训练数据中中心与边缘文档的比例,无需模型训练或标注数据,在400M和1B参数模型上平均性能提升至41.4%。

详情
Comments
10 pages
AI中文摘要

现代语言模型的性能关键取决于预训练数据的组成。然而,现有的数据选择方法依赖辅助分类器进行文档评分或混合优化,增加了计算开销和对标注数据的依赖。我们提出WebGraphMix,一个轻量级的数据选择框架,它计算Common Crawl主机级网页图的结构中心性得分,并用其改变预训练混合数据中中心文档与边缘文档的比例。我们假设中心主机使模型暴露于可重用的抽象知识,而边缘主机编码专门的、长尾知识。WebGraphMix在网页规模下高效计算中心性得分,无需模型训练、标注数据或下游监督。我们将WebGraphMix集成到DataComp-LM流水线中,训练了400M和1B参数规模的模型,分别使用8B和28B token,在从事实知识到符号推理的23个任务上进行评估。实验表明,中心和边缘网页区域编码互补的能力。以1:1比例混合两者平均达到41.4%,而均匀采样为39.8%。将结构得分与文档级质量分类器得分相结合,性能进一步提升至43.8%。这些发现表明,网页图拓扑是预训练数据策展的一个有意义维度,捕获了与现有基于内容的方法大致正交的信息。

英文摘要

The performance of modern language models depends critically on pretraining data composition. Yet existing data selection methods rely on auxiliary classifiers for document scoring or mixture optimization, adding computational overhead and dependence on labeled data. We propose WebGraphMix, a lightweight data selection framework that computes structural centrality scores over the Common Crawl host-level web graph and uses them to vary the proportion of central versus peripheral documents in the pretraining mixture. We hypothesize that central hosts expose models to reusable abstractions, while peripheral hosts encode specialized, long-tail knowledge. WebGraphMix computes centrality scores efficiently at web scale, requiring no model training, labeled data, or downstream supervision. We integrate WebGraphMix into the DataComp-LM pipeline and train models at 400M and 1B parameter scales with 8B and 28B tokens respectively, evaluating on 23 tasks ranging from factual knowledge to symbolic reasoning. Our experiments show that central and peripheral web regions encode complementary capabilities. Mixture combining both at a ratio of 1:1 achieves 41.4% on average, compared to 39.8% for uniform sampling. Combining structural scores with document-level quality classifier scores further improves performance to 43.8%. These findings demonstrate that web graph topology is a meaningful axis for pretraining data curation, capturing information that is largely orthogonal to existing content-based approaches.

2606.11635 2026-06-11 cs.CY cs.AI 交叉投稿

Are LLMs Bad at Moral Reasoning?

LLMs 在道德推理上表现不佳吗?

Menghang Zhu, Seth Lazar

AI总结 本文通过让LLMs生成评分标准而非直接评分,重新评估MoReBench数据集,发现LLMs的道德推理能力比先前认为的更强。

详情
AI中文摘要

为了让高能力AI系统在动态、开放的环境中安全运行,它们必须能够识别、理解并响应行动中的道德理由,并据此约束自身行为。越来越多的研究旨在评估当今最先进AI系统的这种能力——道德能力,最近得出了普遍悲观的结论。其中一篇最具雄心的论文收集了人类专家制定的黄金标准评分标准,用于评估1000个案例中的道德推理,并以此基准测试前沿AI模型,结果不尽如人意。在本文中,我们认为MoReBench数据集可以被重新利用,以给出对LLMs道德推理(道德能力的重要组成部分)更为乐观的图景。我们表明,如果不根据这些评分标准对LLMs的回应进行评分,而是让LLMs执行与人类相同的任务——为特定案例的道德分析生成评分标准——那么它们生成的评分标准与人类评分标准的校准程度高于其开放式回应,并且在存在差异时,这些差异可能仅仅反映了大多数道德问题的巨大维度,同时也突出了人类在“创建评分标准的评分标准”上的某些偏离。考虑到这些观点,MoReBench数据集表明LLMs在道德推理方面的能力比先前认为的要强得多。

英文摘要

For highly capable AI systems to operate safely in dynamic, open-ended environments, they must be able to identify, understand, and respond to moral reasons for action, and constrain their behaviour accordingly. A growing body of research aims to evaluate this capacity -- moral competence -- in today's most capable AI systems, recently reaching broadly pessimistic conclusions. One of the most ambitious such papers collects gold-standard human-authored rubrics for evaluating moral reasoning in 1,000 cases, and benchmarks frontier AI models against those rubrics, with underwhelming results. In this paper, we argue that the MoReBench dataset can be redeployed to give a much more optimistic picture of LLMs' moral reasoning (an essential part of moral competence). We show that if, instead of scoring LLMs' responses to these cases against these rubrics, we instead give the LLMs the same task given to humans -- to generate scoring rubrics for the moral analysis of particular cases -- the rubrics they generate are both better calibrated to the human rubrics than their open-ended responses, and, where they differ, plausibly reflect nothing more than the vast dimensionality of most moral problems, as well as highlighting some human departures from the "rubric for creating rubrics". Taking these points into consideration, the MoReBench dataset suggests that LLMs are significantly more capable at moral reasoning than was previously believed.

2606.11686 2026-06-11 cs.CL cs.AI 交叉投稿

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

层隔离评估:使用无LLM、回归锁定的测试工具对生产级LLM代理的确定性框架进行门控

Sawyer Zhang, Alexander Wang, Sophie Lei

发表机构 * Lumivate (Lumi)(Lumivate(Lumi))

AI总结 提出层隔离评估方法,将LLM代理分解为固定层次,用确定性无LLM测试套件逐层检测回归,证明聚合指标会掩盖局部退化,而逐层基线门控可准确定位。

详情
Comments
12 pages, 2 figures, 5 tables
AI中文摘要

端到端任务成功是评估LLM代理的主要方式,但一个聚合数字只能告诉你代理发生了回归,却无法指出具体位置。我们提出层隔离评估:将一个部署的订单代理分解为固定的层次分类(本体、意图、路由、分解、升级、安全、记忆以及跨领域的封装/防御),每一层由其在确定性、无LLM“纯”模式下的断言切片独立测试。纯测试套件(23个切片共238个案例;225个在2.39秒内运行,约10毫秒/案例)在每次变更时针对锁定的逐切片基线在CI中运行。我们通过受控回归注入进行验证,一次退化一个非安全层(共七个层)。我们未设计的效果是掩蔽:聚合通过率几乎不变(六个局部回归的变化范围为-1.7至-5.9个百分点),而匹配的切片则大幅下降(-25至-91个百分点)。一个层的切片对其自身故障做出反应部分是由构造决定的;测量结果是(i)聚合掩蔽以及(ii)损伤不会扩散到其他切片:注入层的切片在7个案例中的5个中是受影响最严重的,在7个案例中的7个中位列前三(平均排名1.29/19)。定位在第二个结构不同的租户(星巴克新加坡)上复现:所有七个匹配切片均大幅下降,因此这不是单一目录的伪像。我们将其定位为EDDOps规定但未实现的组件级评估的具体确定性实例,以CheckList为前身,并作为全工作流随机突变测试的确定性镜像。我们的贡献:(a)为生产代理提供了一个完全分解的、亚秒级、无LLM的逐层测试工具,(b)一个覆盖诚实性测试充分性标准,拒绝为未执行的层打分,以及(c)回归注入演示,证明逐切片基线锁定可以定位聚合指标掩盖的回归。

英文摘要

End-to-end task-success is the dominant way to evaluate LLM agents, but one aggregate number tells you that an agent regressed, not where. We present layer-isolated evaluation: a deployed ordering agent is decomposed into a fixed taxonomy of layers (ontology, intent, routing, decomposition, escalation, safety, memory, and cross-cutting envelope/defense), each exercised by its own assertion slice in a deterministic, no-LLM "pure" mode. The pure suite (238 cases across 23 slices; 225 run in 2.39 s, ~10 ms/case) runs in CI on every change against a locked per-slice baseline. We validate by controlled regression injection, degrading one layer at a time across seven non-safety layers. The effect we did not design in is masking: the aggregate pass-rate barely moves (-1.7 to -5.9 pp for six local regressions), while the matching slice craters (-25 to -91 pp). A layer's slice reacting to its own fault is partly by construction; the measured results are (i) the aggregate masking and (ii) that damage stays off the other slices: the injected layer's slice is the single worst-hit in 5 of 7 cases and top-3 in 7 of 7 (mean rank 1.29 of 19). Localization replicates on a second, structurally different tenant (Starbucks SG): all seven matching slices crater, so it is not a single-catalog artifact. We position it as a concrete, deterministic instantiation of the component-level evaluation EDDOps prescribes but leaves unimplemented, with CheckList as ancestor and as the deterministic mirror image of whole-workflow stochastic mutation testing. Our contributions: (a) a fully decomposed, sub-second, no-LLM per-layer harness for a production agent, (b) a coverage-honesty test-adequacy criterion that refuses to score an unexercised layer, and (c) the regression-injection demonstration that per-slice baseline-locked gates localize regressions an aggregate metric masks.

2606.11702 2026-06-11 cs.CV cs.AI cs.CL 交叉投稿

MedCTA: A Benchmark for Clinical Tool Agents

MedCTA: 临床工具智能体基准

Tajamul Ashraf, Hyewon Jeong, Fida Mohammad Thoker, Bernard Ghanem

发表机构 * King Abdullah University of Science and Technology (KAUST)(阿卜杜拉国王科技大学) Massachusetts Institute of Technology (MIT)(麻省理工学院)

AI总结 提出MedCTA基准,基于放射影像、病理切片和报告等真实临床多模态输入,评估医疗AI智能体在工具检索、证据获取和集成方面的规划与执行能力。

详情
Comments
Project Page: this https URL Code: this https URL Data: this https URL
AI中文摘要

为了做出临床合理的决策,医疗AI智能体需要超越简单的识别,具备工具检索、证据获取和集成能力。现有基准主要评估孤立的感知或单轮问答,因此对规划、工具调用和部署可靠性的失败可见性有限。我们提出了MedCTA,一个用于评估医疗工具智能体的基准,基于临床验证的、步骤隐含的任务,这些任务基于真实的多模态临床输入,包括放射影像、病理切片和报告。MedCTA包含107个真实临床任务,具有临床医生验证的、在5个部署工具上的可执行轨迹,并支持对工具选择、参数有效性、执行稳定性、轨迹保真度和结果质量的过程感知评估。我们对18个开源和闭源多模态模型进行了基准测试,发现即使是最先进的系统在多步骤临床工具使用中仍然脆弱:自主部署主要由协议失败、过早停止和错误工具调用主导,而黄金标准工具路由带来了巨大但仍不完整的改进。这些结果表明,强大的骨干感知能力并不能转化为临床环境中可靠的智能体行为。MedCTA为审计、诊断和推进可信赖的医疗AI智能体提供了一个严格的测试平台。数据集和评估套件可在该https URL获取。

英文摘要

To make clinically grounded decisions, medical AI agents are expected to go beyond simple recognition and be capable of tool retrieval, evidence acquisition, and integration. Existing benchmarks largely evaluate isolated perception or single-turn question answering, and therefore provide limited visibility into failures of planning, tool recruitment, and rollout reliability. We introduce MedCTA, a benchmark for evaluating medical tool agents on clinician-validated, step-implicit tasks grounded in realistic multimodal clinical inputs, including radiology images, pathology slides, and reports. MedCTA comprises 107 real-world clinical tasks with clinician-verified executable trajectories over 5 deployed tools, and supports process-aware evaluation of tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality. We benchmark 18 open- and closed-source multimodal models and find that even frontier systems remain brittle in multi-step clinical tool use: autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment, while gold-standard tool routing yields large but still incomplete gains. These results show that strong backbone perception does not translate into reliable agentic behavior in clinical settings. MedCTA provides a rigorous testbed for auditing, diagnosing, and advancing trustworthy medical AI agents. The dataset and evaluation suite are available at this https URL

2606.11739 2026-06-11 cs.CV cs.AI 交叉投稿

Multi-View In-Cabin Monitoring System for Public Transport Vehicles

公共交通车辆的多视角座舱内监控系统

Evgeny Gorelik, Kenny Dean Karrow, Fikret Sivrikaya, Sahin Albayrak, Christian Baumann

发表机构 * Technische Universität Berlin(柏林工业大学) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心)

AI总结 提出一个多视角座舱内监控数据集,包含同步RGB-D图像和LiDAR数据,并提供3D人体姿态和边界框标注,支持多视角3D检测模型评估。

详情
Comments
Submitted to ICDM2026
AI中文摘要

我们介绍了一个用于公共交通的多视角座舱内监控数据集,包含来自四个朝内摄像头和覆盖数字化的、部分自动化的德国城市公交车内部空间的旋转LiDAR的同步RGB和深度图像。该数据集包含9,136个同步样本及其标注,并附带一个校准和伪标签流程,可生成乘客的3D人体姿态估计和定向3D边界框。我们还提供了nuScenes格式转换,并基准测试了代表性的多视角3D检测模型(例如Lift-Splat-Shoot和BEVFusion),支持多视角座舱内感知模型的比较评估和小规模训练。该数据集和工具可在以下网址获取:此https URL。

英文摘要

We introduce a multi-view in-cabin monitoring dataset for public transportation with synchronized RGB and depth images from four inward-facing cameras and a rotating LiDAR covering the vehicle interior of a digitalized and partly automated German city bus. The dataset contains 9.136 synchronized samples with annotations and is accompanied by a calibration and pseudo-labeling pipeline that generates 3D human pose estimates and oriented 3D bounding boxes for occupants. We further provide a nuScenes-format conversion and benchmark representative multi-view 3D detection models (e.g., Lift-Splat-Shoot and BEVFusion), supporting comparative evaluation and small-scale training of multi-view in-cabin perception models. The dataset and tools are available at this https URL.

2606.11762 2026-06-11 cs.CL cs.AI 交叉投稿

Automated Creativity Evaluation of Language Models Across Open-Ended Tasks

语言模型在开放式任务中的自动化创造力评估

Min Sen Tan, Zachary Kit Chun Choy, Syed Ali Redha Alsagoff, Nadya Yuki Wangsajaya, Mohor Banerjee, Swaagat Bikash Saikia, Alvin Chan

发表机构 * Raffles Institution(莱佛士书院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) Lee Kong Chian School of Medicine, Nanyang Technological University(南洋理工大学李光前医学院) Centre of AI in Medicine (C-AIM), Nanyang Technological University(南洋理工大学人工智能医学中心)

AI总结 提出一种领域无关的自动化框架,通过语义熵和检索式多智能体评估,量化LLM在开放式任务中的发散与收敛创造力,并在问题解决、研究构思和创意写作三个领域验证其有效性。

详情
Comments
Accepted to ACL 2026 (Main Conference). 35 pages, 16 figures. Code: this https URL
AI中文摘要

大型语言模型(LLMs)在语言理解、推理和生成方面取得了显著进展,激发了对其创造潜力的日益关注。实现这一潜力需要系统化和可扩展的方法来评估跨不同任务的创造力。然而,大多数现有的创造力指标与特定任务紧密耦合,将领域假设嵌入评估过程,限制了可扩展性和通用性。为解决这一差距,我们引入了一个自动化、领域无关的框架,用于量化LLM在开放式任务中的创造力。我们的方法将测量装置与创造性任务本身分离,实现了可扩展、任务无关的评估。发散创造力通过语义熵(一种无参考且稳健的新颖性和多样性指标)进行测量,并针对人类注释、基于LLM的新颖性判断和基线多样性度量进行了验证。收敛创造力通过一种新颖的基于检索的多智能体评判框架进行评估,该框架提供上下文敏感的任务完成评估,效率提升超过60%。我们在三个性质不同的领域验证了我们的框架:问题解决(MacGyver)、研究构思(HypoGen)和创意写作(BookMIA),使用了广泛的LLM套件。实证结果表明,我们的框架可靠地捕捉了创造力的关键方面,包括新颖性、多样性和任务完成,并揭示了模型属性(如大小、温度、时效性和推理)如何影响创造性表现。我们的工作为自动化的LLM创造力评估建立了可重复和可泛化的标准,为可扩展的基准测试铺平了道路,并加速了创造性AI的进展。

英文摘要

Large language models (LLMs) have achieved remarkable progress in language understanding, reasoning, and generation, sparking growing interest in their creative potential. Realizing this potential requires systematic and scalable methods for evaluating creativity across diverse tasks. However, most existing creativity metrics are tightly coupled to specific tasks, embedding domain assumptions into the evaluation process, and limiting scalability and generality. To address this gap, we introduce an automated, domain-agnostic framework for quantifying LLM creativity across open-ended tasks. Our approach separates the measurement apparatus from the creative task itself, enabling scalable, task-agnostic assessment. Divergent creativity is measured using semantic entropy, a reference-free and robust metric for novelty and diversity, validated against human annotations, LLM-based novelty judgments and baseline diversity measures. Convergent creativity is assessed via a novel retrieval-based multi-agent judge framework that delivers context-sensitive evaluation of task fulfilment with over 60% improved efficiency. We validate our framework in three qualitatively distinct domains: problem-solving (MacGyver), research ideation (HypoGen), and creative writing (BookMIA), using a broad suite of LLMs. Empirical results show that our framework reliably captures key facets of creativity, including novelty, diversity, and task fulfilment, and reveal how model properties, such as size, temperature, recency, and reasoning, impact creative performance. Our work establishes a reproducible and generalizable standard for automated LLM creativity evaluation, paving the way for scalable benchmarking and accelerating progress in creative AI.

2606.11816 2026-06-11 cs.CL cs.AI 交叉投稿

WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning

WorldReasoner: 评估语言模型代理是否通过有效推理预测事件

Yizhou Chi, Eric Chamoun, Zifeng Ding, Andreas Vlachos

发表机构 * Department of Computer Science and Technology, University of Cambridge(剑桥大学计算机科学与技术系)

AI总结 提出WorldReasoner框架,通过时间有效检索、证据质量和因果图推理三个维度评估语言模型代理的事件预测能力,发现时间有效检索是结果准确性的最强驱动因素。

详情
AI中文摘要

预测现实世界事件要求语言模型代理在不完整、时间有限的信息下进行不确定性推理。然而,评估代理是否真正进行预测需要的不仅仅是最终答案的准确性:模型可能通过回忆记忆中的训练事实、引用捏造的证据或产生无根据的因果故事而正确。我们提出WorldReasoner,一个用于时间有效事件预测的评估框架。每个任务向代理提供一个已解决的预测问题、一个模拟的预测日期,并且只能访问该日期之前可用的证据;在问题解决后,该框架对提交的概率、引用的证据和可选的因果事件图进行评分。WorldReasoner报告三个互补的轴:针对已解决答案的结果质量、针对引用来源的证据质量,以及针对解决后事后图的推理质量。该基准测试由一个代理构建管道构建,该管道生成预测问题、收集时间戳证据并大规模构建事后参考图,最终产生345个已解决的任务,这些任务源自14,141篇文章,其图覆盖8,087个提取的事件。在六种受控代理设置中,时间有效检索是结果准确性的最强驱动因素;因果图构建提高了关键事件的恢复;并且正确的图支持预测更牢固地基于关键事件和相关来源,但代理仍然难以将基于证据的推理转化为校准的概率。

英文摘要

Forecasting real-world events requires language-model agents to reason under uncertainty from incomplete, time-bounded information. Yet evaluating whether agents genuinely forecast requires more than final-answer accuracy: a model may be correct by recalling memorized training facts, citing fabricated evidence, or producing an unsupported causal story. We present WorldReasoner, an evaluation framework for temporally valid event forecasting. Each task gives an agent a resolved forecasting question, a simulated forecast date, and access only to evidence available before that date; after resolution, the framework scores the submitted probability, cited evidence, and optional causal event graph. WorldReasoner reports three complementary axes: outcome quality against resolved answers, evidence quality over cited sources, and reasoning quality against post-resolution hindsight graphs. The benchmark is built by an agentic construction pipeline that generates forecasting questions, collects time-stamped evidence, and builds hindsight reference graphs at scale, yielding 345 resolved tasks derived from 14,141 articles with graphs covering 8,087 extracted events. Across six controlled agent settings, temporally valid retrieval is the strongest driver of outcome accuracy; causal graph construction improves key-event recovery; and correct graph-enabled forecasts are more strongly grounded in key events and relevant sources, yet agents still struggle to convert grounded evidence into calibrated probabilities.

2606.11889 2026-06-11 cs.CV cs.AI cs.RO 交叉投稿

Task-Aligned Stability Analysis of Vision-Language Models for Autonomous Driving Hazard Detection

面向自动驾驶危险检测的视觉-语言模型任务对齐稳定性分析

Everett Richards

AI总结 研究视觉-语言模型在自动驾驶危险检测中,嵌入漂移与任务对齐危险分数变化的关系,发现不同腐败类型导致不同的失效模式,建议基准测试包含任务对齐稳定性指标。

详情
Comments
8 pages (5 main body + 3 references / appendices). ICML 2026 Workshop on Combining Theory and Benchmarks (CTB)
AI中文摘要

视觉-语言模型(VLM)越来越多地用于自动驾驶中的场景理解,但鲁棒性分析通常仅依赖于任务无关的嵌入稳定性。我们研究腐败引起的嵌入漂移是否能预测基于CLIP图像-文本相似性的任务对齐危险分数的变化。通过在BDD100K道路场景上使用受控腐败,我们将嵌入漂移与边际漂移(定义为扰动下危险分数的变化)进行比较。这种关系高度依赖于腐败类型:某些家族表现出表示漂移与决策漂移之间的强耦合,而其他家族则在嵌入变化相对较小的情况下引发危险的决策不稳定性。此外,腐败家族在失效方向上有所不同:大多数通过假阴性抑制危险检测,而遮挡则触发假警报,这表明基准设计应考虑不对称的失效模式,而不仅仅是整体不稳定性率。这些结果表明,鲁棒性基准应包含任务对齐的稳定性指标,而不仅仅是嵌入级别的扰动统计。

英文摘要

Vision-language models (VLMs) are increasingly used for scene understanding in autonomous driving, but robustness analysis often relies on task-agnostic embedding stability alone. We study whether corruption-induced embedding drift predicts changes in a task-aligned hazard score derived from CLIP image-text similarities. Using controlled corruptions on BDD100K road scenes, we compare embedding drift against margin drift, defined as the change in hazard score under perturbation. The relationship is highly corruption-dependent: some families exhibit strong coupling between representation drift and decision drift, while others induce hazardous decision instability despite relatively modest embedding change. Furthermore, corruption families differ in failure direction: most suppress hazard detections via false negatives, while occlusion instead triggers false alarms, suggesting that benchmark design should account for asymmetric failure modes, not just overall instability rates. These results suggest that robustness benchmarks should include task-aligned stability measures in addition to embedding-level perturbation statistics.

2606.11901 2026-06-11 cs.RO cs.AI 交叉投稿

DuoBench: A Reproducible Benchmark for Bimanual Manipulation in Simulation and the Real World

DuoBench: 一个可复现的双手操作基准,涵盖仿真与现实世界

Tobias Jülg, Seongjin Bien, Simon Hilber, Yannik Blei, Pierre Krack, Maximilian Li, Sven Parusel, Rudolf Lioutikov, Florian Walter, Wolfram Burgard

发表机构 * University of Technology Nuremberg(纽伦堡工业大学) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Franka Robotics Technical University of Munich(慕尼黑工业大学)

AI总结 提出DuoBench,一个基于FR3 Duo平台的双手操作基准框架,包含11个任务和阶段式评估方案,用于诊断当前策略在双手协调、仿真到现实迁移等方面的失败模式。

详情
AI中文摘要

双手机器人系统极大地扩展了操作能力,但协调两只手臂引入了额外的控制复杂性和故障模式,现有基准未能很好地捕捉这些。我们介绍了DuoBench,一个针对FR3 Duo平台上的双手操作策略的可扩展基准框架。DuoBench包含跨越四个协调类别的十一个任务,在仿真中实现,并通过可复现的任务配方和3D打印资产部分地在现实世界中复现。此外,我们提出了一种基于阶段的评估方案,支持超出二元成功之外的细粒度语义故障分析,并为所有基准任务提供人类遥操作数据集。我们在仿真和真实硬件上对几种双臂模仿学习和视觉-语言-动作策略进行了基准测试。我们的结果表明,当前策略在双手操作中仍然面临挑战,特别是在早期交互阶段、并行手臂执行以及仿真与现实环境之间的迁移方面。DuoBench为诊断这些故障模式和研究未来的双臂策略学习方法提供了一个可复现的测试平台。代码、数据集和视频可在该https URL获取。

英文摘要

Bimanual robot systems substantially expand manipulation capabilities, but coordinating two arms introduces additional control complexity and failure modes that are not well captured by existing benchmarks. We introduce DuoBench, an extensible benchmarking framework for bimanual manipulation policies on the FR3 Duo platform. DuoBench comprises eleven tasks spanning four coordination categories, implemented in simulation and partially reproduced in the real world through reproducible task recipes with 3D-printable assets. In addition, we propose a stage-based evaluation scheme that supports fine-grained semantic failure analysis beyond binary success and provide human-teleoperated datasets for all benchmark tasks. We benchmark several dual-arm imitation-learning and vision-language-action policies in simulation and on real hardware. Our results show that current policies remain challenged by bimanual manipulation, particularly in early interaction stages, parallel arm execution, and transfer between simulation and real-world settings. DuoBench provides a reproducible testbed for diagnosing these failure modes and studying future methods for dual-arm policy learning. Code, datasets, and videos are available at this https URL

2606.12071 2026-06-11 cs.DL cs.AI 交叉投稿

On the Limits of LLM-as-Judge for Scientific Novelty Assessment

论LLM作为评审在科学新颖性评估中的局限性

Soumitra Sinhahajari, Navonil Majumder, Soujanya Poria

AI总结 本文通过构建RQ-Bench基准,发现LLM评审对模型生成的研究问题产生新颖性幻觉,而人类专家则持相反意见,揭示了LLM在评估科学新颖性时的可靠性问题。

详情
AI中文摘要

LLM越来越多地被用于生成和评判科学想法。这使得新颖性评估成为一个核心问题。完整想法的评估很困难,因为它通常需要判断方法、可行性及其经验前景。因此,我们研究一个更清晰的上游对象:研究问题(RQ)。RQ生成是科学构思的前提,并且RQ可以与真实论文中探讨的问题进行比较。我们引入了RQ-Bench,一个基于近期arXiv论文构建的基准。对于每篇论文,我们从其引用的背景、空白和贡献中重建作者锚定的RQ。这些RQ并非针对同一背景的唯一有效问题。它们是用于测试新颖性判断的作者锚定参考点。我们使用独立LLM评审、比较LLM评审和人类专家评估来评估模型生成的RQ。LLM评审一致地将模型生成的RQ评为高度新颖,产生新颖性幻觉;在比较评估中,这种偏好甚至更强。然而,领域专家得出相反结论,更偏好作者锚定的参考问题。我们进一步发现,许多生成的RQ狭窄或受限于来源,这是LLM评审通常忽略的维度,除非明确测试。总体而言,LLM评审与人类专家之间矛盾的新颖性评估引发了关于使用LLM评估研究问题科学新颖性可靠性的严重担忧。

英文摘要

LLMs are increasingly used to generate and judge scientific ideas. This makes novelty evaluation a central problem. Full idea evaluation is difficult because it often requires judging a method, its feasibility, and its empirical promise. We therefore study a cleaner upstream object: the research question (RQ). RQ generation is a prerequisite for scientific ideation, and RQs can be compared against questions pursued in real papers. We introduce RQ-Bench, a benchmark built from recent arXiv papers. For each paper, we reconstruct author-anchored RQs from its cited background, gaps, and contributions. These RQs are not the only valid questions for the same background. They are author-anchored reference points for testing novelty judgments. We evaluate model-generated RQs with standalone LLM judging, comparative LLM judging, and human expert evaluation. LLM judges consistently rate model-generated RQs as highly novel, producing a novelty mirage; in comparative evaluations, this preference becomes even stronger. Domain experts, however, reach the opposite conclusion and prefer the author-anchored reference questions. We further find that many generated RQs are narrow or source-bound, a dimension that LLM judges often miss unless explicitly tested. Overall, the contradictory novelty evaluations between LLM judges and human experts raise a serious concern about the reliability of using LLMs to assess the scientific novelty of research questions.

2606.12117 2026-06-11 cs.CL cs.AI 交叉投稿

Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation

软提示调优用于公平且高效的LLM基准评估

Selen Erkan, Bastian Boll, Kristian Kersting, Björn Deiseroth, Letitia Parcalabescu

发表机构 * Aleph Alpha Research Lab(Aleph Alpha 研究实验室) TU Darmstadt(达姆施塔特工业大学) Hessian.AI(黑森人工智能中心)

AI总结 提出软提示调优方法,通过优化少量软提示向量使基础模型适应基准格式,公平评估其真实知识,效率高且无需完整后训练。

详情
Comments
10 pages, 4 figures
AI中文摘要

基准分数常常错误地反映大型语言模型(LLM)的知识,因为它们依赖于模型遵循特定格式要求的能力等。这尤其惩罚了那些可能知道正确答案但缺乏按照指示结构化答案能力的基础模型——这种能力通常在后训练中引入。为了克服这一点,我们提出了软提示调优,一种高效、公平且架构无关的模型评估方法。通过在短时间调优内仅优化10个软提示向量(对于7B模型大约占参数的0.0006%),我们使模型适应特定的基准格式,缩小格式遵循方面的差距,确保底层知识准确地反映在基准分数中。这使得人们可以在基准上公平比较不同基础模型(使用各种预训练配方训练),而无需完整的后训练。我们在7个模型和7个数据集上评估了软提示调优。结果表明:(a) 软提示调优在80步(约640个样本)内使格式遵循饱和,因此非常高效;(b) 软提示调优显著优于零样本和少样本提示,揭示了标准提示遗漏的基础模型知识;(c) 即使后训练模型也可以从软提示中受益以最大化格式遵从性;(d) 软提示的基础模型性能比零样本和少样本基线更可靠地预测后训练模型的排名,为下游模型质量提供了低成本的代理。我们的贡献包括:(1) 解耦格式遵循和知识准确性的度量标准;(2) 更公平的LLM知识基准测试协议;(3) 一种成本效益高且内存有效的方案,用于在LLM开发早期识别最优预训练策略。

英文摘要

Benchmark scores often misrepresent a large language model's (LLM's) knowledge, because they rely, e.g., on the model's ability to follow specific formatting requirements. This especially penalizes base models that may know the correct answers but lack the ability -- typically introduced in post-training -- to structure them as instructed. To overcome this, we propose soft-prompt tuning, an efficient, fair, and architecture-agnostic model evaluation. By optimizing only 10 soft-prompt vectors (roughly 0.0006% parameters for a 7B model) over a short tuning period, we adapt models to specific benchmark formats, closing gaps in format-following and ensuring that underlying knowledge is accurately reflected in benchmark scores. This allows one to fairly compare different base models -- trained with various pre-training recipes -- on benchmarks without the need for full post-training. We evaluated soft-prompt tuning across 7 models and 7 datasets. The results show that (a) soft-prompt tuning saturates format-following within 80 steps (~640 samples) making it highly efficient, (b) soft-prompt tuning significantly outperforms zero- and few-shot prompting, surfacing base model knowledge that standard prompting misses, that (c) even post-trained models can benefit from soft-prompts to maximize format compliance, and that (d) soft-prompted base model performance predicts post-trained model rankings more reliably than zero- and few-shot baselines, offering a low-cost proxy for downstream model quality. Our contributions include (1) metrics which disentangle format-following and knowledge accuracy, (2) a fairer benchmarking protocol of LLM knowledge, and (3) a cost- and memory-effective recipe to identify optimal pre-training strategies early in LLM development.

2606.12169 2026-06-11 cs.CV cs.AI cs.CL cs.LG 交叉投稿

OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models

OpenMedReason: 医学视觉语言模型的科学推理监督

Negin Baghbanzadeh, Pritam Sarkar, Michael Colacci, Abeer Badawi, Adibvafa Fallahpour, Arash Afkanpour, Leonid Sigal, Ali Etemad, Elham Dolatabadi

发表机构 * York University(约克大学) Vector Institute(向量研究所) University of British Columbia(不列颠哥伦比亚大学) University of Toronto(多伦多大学) Unity Health Toronto / St. Michael’s Hospital(多伦多联合健康/圣迈克尔医院) University Health Network(大学健康网络) Arc Institute(弧研究所) Queen's University(女王大学)

AI总结 提出OpenMedReason,一个包含约45万图像-问题-答案实例的大规模开放医学推理语料库,其推理轨迹主要来自生物医学科学文章,并配套基准OpenMedReason-Bench进行细粒度评估,在监督微调和强化对齐中有效提升模型性能。

详情
Comments
42 pages, 9 figures, 24 tables. Dataset and code: this https URL
AI中文摘要

高风险临床使用大型视觉语言模型(LVLMs)需要基于视觉证据和临床知识的推理,而不仅仅是正确的最终答案。我们引入了OpenMedReason,这是一个大规模、开放的多模态医学推理语料库,包含约45万图像-问题-答案实例,其推理轨迹主要来自策划的生物医学、人类撰写的科学文章。OpenMedReason提供了超越合成思维链的高保真监督,涵盖了多种医学领域视觉模态,如放射学扫描、显微图像、可见光照片、图表等。我们辅以OpenMedReason-Bench,这是一个留出基准,允许沿三个互补的能力轴(包括感知、医学知识和推理)对LVLMs进行细粒度评估,从而实现超越最终答案准确性的诊断性评估。OpenMedReason是一个丰富的训练资源,在监督微调(SFT)和基于强化的对齐中均显示出有效性。使用OpenMedReason进行训练,在VQA准确率上比基础模型平均提高20%,并且性能达到最强可比规模医学LVLMs的4.2%以内。细粒度性能分析证实,增益并非集中在单一轴上:OpenMedReason共同提升了感知、医学知识和推理,并且在86.1%的成对比较中,其推理轨迹优于基础模型。我们在以下网址发布代码和数据集:此 http URL。

英文摘要

High-stakes clinical use of large vision-language models (LVLMs) requires reasoning that is grounded in visual evidence and clinical knowledge, not just correct final answers. We introduce OpenMedReason, a large-scale, open multimodal medical reasoning corpus comprising approximately 450K image-question-answer instances whose reasoning traces are primarily derived from curated biomedical, human-authored scientific articles. OpenMedReason provides high-fidelity supervision beyond synthetic chains of thought, covering diverse medical domain vision modalities such as radiological scans, microscopic images, visible light photographs, charts, and others. We complement it with OpenMedReason-Bench, a held-out benchmark that allows fine-grained evaluation of LVLMs along three complementary axes of capability, including perception, medical knowledge, and rationale, enabling diagnostic evaluation beyond final-answer accuracy. OpenMedReason is a rich training resource that exhibits its effectiveness in both supervised fine-tuning (SFT) and reinforcement-based alignment. Training with OpenMedReason yields a 20% average improvement in VQA accuracy over the base model and achieves performance within 4.2% of the strongest comparable-scale medical LVLMs. Fine-grained performance analysis confirms that the gains are not concentrated in any single axis: OpenMedReason improves perception, medical knowledge, and rationale jointly, and its reasoning traces are preferred over those of the base model in 86.1% of pairwise comparisons. We release the code and dataset at this http URL.

2606.12207 2026-06-11 cs.RO cs.AI 交叉投稿

Intelligent Automation for Embodied Benchmark Construction: Pipelines, Embodiments, Simulators, and Trends

具身基准构建的智能自动化:流程、具身、模拟器与趋势

Jinshan Lai, Jianwei Hu, Baoyang Jiang, Fengchun Zhang, Leyuan Wang, Haotian Li, Yida Wang, Tingxuan Huang, Xi Ren, Qiang Ma

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Qiyuan Lab(启元实验室) Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学) Beihang University(北京航空航天大学)

AI总结 本文综述具身智能基准构建的五阶段流程,分析从人工到自动化再到智能体闭环的转变,指出自动化将成本转向验证与治理。

详情
AI中文摘要

具身智能现已涵盖导航、家务辅助、操作、自动驾驶、空中智能体及多模态大模型控制。这一扩展使得基准构建成为可靠评估的核心瓶颈。与静态数据集不同,具身基准将任务规范、环境、机器人数据、演示、标注、指标、评估脚本和发布策略整合为一个评估系统。本综述通过五阶段构建流程回顾文献:需求与任务构建、数据获取、数据清洗与标注、基准套件生成与指标定义、评估执行与诊断反馈。针对每个阶段,分析从人工管理到传统自动化、基础模型辅助以及智能体闭环工作流的转变。同时比较了人工、数据与资产获取、计算与仿真、验证与调试、治理与维护以及返工风险等定性构建成本。主要结论是:自动化并非简单降低基准成本,而是往往将成本转向验证、可审计性、版本控制和长期治理。因此,具身评估的进展不仅取决于更大的基准套件,还取决于可诊断、可审计且可负责任地更新的构建流程。

英文摘要

Embodied intelligence now spans navigation, household assistance, manipulation, autonomous driving, aerial agents, and multimodal large-model control. This expansion has made benchmark construction a central bottleneck for reliable evaluation. Unlike static datasets, embodied benchmarks combine task specifications, environments, robot data, demonstrations, annotations, metrics, evaluation scripts, and release policies into a single evaluation system. This survey reviews the literature through a five-stage construction pipeline: requirement and task construction, data acquisition, data cleaning and annotation, benchmark suite generation and metric definition, and evaluation execution with diagnostic feedback. For each stage, the survey analyzes the transition from manual curation to traditional automation, foundation-model assistance, and agentic closed-loop workflows. It also compares qualitative construction costs across human labor, data and asset acquisition, compute and simulation, validation and debugging, governance and maintenance, and rework risk. The main conclusion is that automation does not simply reduce benchmark cost. Instead, it often shifts cost toward validation, auditability, version control, and long-term governance. Progress in embodied evaluation will therefore depend not only on larger benchmark suites, but also on construction pipelines that are diagnosable, auditable, and responsibly refreshable.

2606.12300 2026-06-11 cs.CV cs.AI 交叉投稿

Natural-Language Temporal Grounding in Hour-Long Videos is a Search Problem: A Benchmark and Empirical Decomposition

自然语言在小时级视频中的时间定位是一个搜索问题:基准与经验分解

Sukmin Seo, Geewook Kim

发表机构 * NAVER Cloud AI KAIST AI(韩国科学技术院人工智能系)

AI总结 针对小时级视频的自然语言时间定位,提出搜索是主要瓶颈而非识别,发布首个开放小时级定位基准ExtremeWhenBench,并通过检索-定位混合方法显著提升性能。

详情
Comments
10 pages, 6 figures, Code and benchmark: this https URL
AI中文摘要

时间定位——根据自然语言查询返回视频中的区间$[t_s, t_e]$——是长视频的语言接口,但此前仅在短视频上研究;小时级自然语言定位的动态仍未充分探索。我们认为,在小时级尺度上,限制因素是搜索而非识别:视频-LLM的瓶颈不在于定位附近的事件,而在于根据自然语言查询搜索长视频的相关区域。为验证这一点,我们发布了ExtremeWhenBench,首个开放的小时级定位基准(194个视频上的2273个查询,平均时长75.7分钟,最长9小时),具有开放式查询分布。所有开放视频-LLM均表现不佳,而帧级检索基线优于它们;失败分类将85%的失败归因于搜索;检索-定位混合方法比单一视频-LLM提升了6.7倍——类似于开放域QA中的检索-读取模式。

英文摘要

Temporal grounding--returning the interval $[t_s, t_e]$ for a natural-language query over a video--is the language interface to long-form video, yet has been studied on short videos; the dynamics of hour-scale natural-language grounding remain underexplored. We take the position that at hour-scale, the binding constraint is search, not recognition: Video-LLMs are bottlenecked not by localizing a nearby event, but--given a natural-language query--by searching for the relevant region of a long video. To test this, we release ExtremeWhenBench, the first open hour-scale grounding benchmark (2,273 queries over 194 videos, mean 75.7 min, max 9 hr) with an open-form query distribution. Every open Video-LLM collapses while a frame-level retrieval baseline outperforms them; a failure taxonomy attributes 85% of failures to search; and a retrieve-then-ground hybrid recovers 6.7x over the monolithic Video-LLM--mirroring retrieve-then-read in open-domain QA.

2606.12392 2026-06-11 cs.CL cs.AI 交叉投稿

System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5

CCL25-Eval 任务5系统报告:新数据集与LoRA微调Qwen2.5

Haotao Xie

发表机构 * The Hangzhou International Innovation Institute Beihang University(北京航空航天大学杭州国际创新研究院)

AI总结 针对古典诗歌翻译与情感理解任务,构建高质量指令数据集CCPoetry-49K,并采用LoRA微调Qwen2.5-14B模型得到PoetryQwen,在CCL25-Eval任务5上取得0.757分,较基线提升9.7%。

详情
AI中文摘要

近年来,大语言模型(LLMs)在古典汉语翻译和古典诗歌生成领域取得了令人瞩目的进展。然而,针对古典诗歌精确翻译和情感语义理解的领域特定研究仍然有限。主要挑战在于大多数研究将诗歌鉴赏任务视为通用领域问题,忽略了诗歌鉴赏的独特特征,同时高质量且领域特定的数据集极为稀缺。为解决这一局限,我们将任务分解为三个子任务:术语解释、语义解释和情感推理。基于多个开源数据集,我们进行数据清洗和对齐,构建了古典诗歌指令对数据集(CCPoetry-49K),包含49,404个高质量指令-响应对,专门针对该领域进行了优化。随后,我们提出领域专用LLM,称为PoetryQwen,通过应用低秩适配(LoRA)微调Qwen2.5-14B模型。在CCL25-Eval任务5基准上的实验结果表明,PoetryQwen得分为0.757,较Qwen2.5-14B-Instruct基线(0.690)提升9.7%。这些发现明确表明,PoetryQwen在古典诗歌的精确翻译和情感理解方面显著提升了性能。我们提供了新数据集和方法论考虑,旨在支持LLMs的领域特定优化。

英文摘要

Recently, large language models (LLMs) have achieved promising progress in the fields of classical Chinese translation and the generation of classical poetry. However, domain-specific research on precise translation and affective-semantic understanding of classical poetry remains limited. The main challenge is that most studies treat the poetic appreciation task as a general-domain problem, neglecting the distinctive features of poetic appreciation, while high-quality and domain-specific datasets are extremely limited. To address this limitation, we decompose the task into three subtasks: term interpretation, semantic interpretation, and emotional inference. Based on multiple open-source datasets, we perform data cleansing and alignment to construct the Classical Chinese Poetry Instruction Pair Dataset (CCPoetry-49K), which comprises 49,404 high-quality instruction-response pairs explicitly optimized for this domain. We then propose a domain-specialized LLM, called PoetryQwen, by applying Low-Rank Adaptation (LoRA) to fine-tune the Qwen2.5-14B model. Experimental results on the CCL25-Eval Task 5 benchmark demonstrate that PoetryQwen achieves a score of 0.757, representing a 9.7% improvement over the Qwen2.5-14B-Instruct baseline (0.690). These findings clearly indicate that PoetryQwen significantly enhances performance in precise translation and emotional understanding of classical poetry. We present new dataset and methodological considerations intended to support the domain-specific optimization of LLMs.

2511.02414 2026-06-11 cs.AI 版本更新

A New Perspective on Precision and Recall for Generative Models

生成模型精度与召回的全新视角

Benjamin Sykes (Unicaen, Ensicaen, Greyc), Loïc Simon (Unicaen, Ensicaen, Greyc), Julien Rabin (Unicaen, Ensicaen, Greyc), Jalal Fadili (Unicaen, Ensicaen, Greyc)

AI总结 本文提出了一种基于二分类视角的新框架,用于估计生成模型的完整精度-召回曲线,并通过统计分析得出最小最大上界,同时展示了该框架可扩展至文献中的多个经典PR指标。

详情
AI中文摘要

随着生成模型在图像和文本领域取得近期成功,其评估问题近年来受到广泛关注。尽管大多数现有方法依赖于标量指标,但引入精度和召回(PR)作为生成模型的评估指标,开辟了新的研究方向。相关的PR曲线允许更丰富的分析,但其估计存在诸多挑战。在本文中,我们提出了一种基于二分类视角的新框架,用于估计完整的PR曲线。我们对所提出估计进行了详尽的统计分析。作为副产品,我们获得了PR估计风险的最小最大上界。此外,我们还展示了该框架可扩展至文献中的多个经典PR指标,这些指标设计上被限制在曲线的极值点。最后,我们研究了在不同设置下所获得的曲线的不同行为。

英文摘要

With the recent success of generative models in image and text, the question of their evaluation has recently gained a lot of attention. While most methods from the state of the art rely on scalar metrics, the introduction of Precision and Recall (PR) for generative model has opened up a new avenue of research. The associated PR curve allows for a richer analysis, but their estimation poses several challenges. In this paper, we present a new framework for estimating entire PR curves based on a binary classification standpoint. We conduct a thorough statistical analysis of the proposed estimates. As a byproduct, we obtain a minimax upper bound on the PR estimation risk. We also show that our framework extends several landmark PR metrics of the literature which by design are restrained to the extreme values of the curve. Finally, we study the different behaviors of the curves obtained experimentally in various settings.

2511.02627 2026-06-11 cs.AI 版本更新

DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning

DecompSR:用于组合多跳空间推理分解分析的数据集

Lachlan McPheat, Navdeep Kaur, Robert Blackwell, Alessandra Russo, Anthony G. Cohn, Pranava Madhyastha

AI总结 提出DecompSR数据集(超500万数据点),通过程序化生成独立控制组合性的多个方面(如推理深度、语言变异性),用于细粒度评估大语言模型的空间推理能力。

详情
AI中文摘要

我们引入了DecompSR(分解空间推理),这是一个大型基准数据集(超过500万个数据点)和生成框架,旨在分析组合空间推理能力。DecompSR的生成允许用户独立改变组合性的多个方面,即:生产力(推理深度)、替代性(实体和语言变异性)、过度泛化(输入顺序、干扰项)和系统性(新颖语言元素)。DecompSR以程序化方式构建,使其在构造上正确,并通过符号求解器独立验证以确保数据集的正确性。DecompSR在一系列大型语言模型(LLM)上进行了全面基准测试,我们表明LLM在空间推理任务中难以进行生产性和系统性泛化,而对语言变异性则更为鲁棒。DecompSR提供了一个可证明正确且严格的基准数据集,具有独立改变组合性几个关键方面程度的新能力,从而允许对LLM的组合推理能力进行稳健且细粒度的探测。

英文摘要

We introduce DecompSR, decomposed spatial reasoning, a large benchmark dataset (over 5m datapoints) and generation framework designed to analyse compositional spatial reasoning ability. The generation of DecompSR allows users to independently vary several aspects of compositionality, namely: productivity (reasoning depth), substitutivity (entity and linguistic variability), overgeneralisation (input order, distractors) and systematicity (novel linguistic elements). DecompSR is built procedurally in a manner which makes it is correct by construction, which is independently verified using a symbolic solver to guarantee the correctness of the dataset. DecompSR is comprehensively benchmarked across a host of Large Language Models (LLMs) where we show that LLMs struggle with productive and systematic generalisation in spatial reasoning tasks whereas they are more robust to linguistic variation. DecompSR provides a provably correct and rigorous benchmarking dataset with a novel ability to independently vary the degrees of several key aspects of compositionality, allowing for robust and fine-grained probing of the compositional reasoning abilities of LLMs.

2601.17717 2026-06-11 cs.AI cs.LG 版本更新

A Survey on Evaluating Quality and Trustworthiness in LLM-Generated Data

评估LLM生成数据的质量与可信度综述

Kaituo Zhang, Mingzhi Hu, Hoang Anh Duy Le, Fariha Kabir Torsha, Zhimeng Jiang, Minh Khai Bui, Chia-Yuan Chang, Yu-Neng Chuang, Zhen Xiong, Ying Lin, Guanchu Wang, Na Zou

AI总结 提出LLM数据审计框架,从质量和可信度两个维度系统分类评估指标,分析六种模态数据生成方法的评估缺陷并给出改进建议。

详情
Comments
Published at TMLR. Title changed in the final version
AI中文摘要

大型语言模型(LLM)已成为跨多种模态生成数据的强大工具。通过将数据从稀缺资源转变为可控资产,LLM缓解了真实世界数据获取成本对模型训练、评估和系统迭代造成的瓶颈。然而,确保LLM生成的合成数据的高质量仍然是一个关键挑战。现有研究主要关注生成方法,对生成数据质量的直接关注有限。此外,大多数研究局限于单一模态,缺乏跨不同数据类型的统一视角。为填补这一空白,我们提出了\textbf{LLM数据审计框架}。在该框架中,我们首先描述了如何利用LLM生成六种不同模态的数据。更重要的是,我们从质量和可信度两个维度系统分类了评估合成数据的内在指标。这种方法将评估重点从依赖下游任务性能的外在评估转向数据本身的固有属性。利用这一评估体系,我们分析了每种模态代表性生成方法的实验评估,并指出了当前评估实践中的重大缺陷。基于这些发现,我们为社区改进数据生成评估提供了具体建议。最后,该框架概述了合成数据在不同模态下的实际应用方法。

英文摘要

Large Language Models (LLMs) have emerged as powerful tools for generating data across various modalities. By transforming data from a scarce resource into a controllable asset, LLMs mitigate the bottlenecks imposed by the acquisition costs of real-world data for model training, evaluation, and system iteration. However, ensuring the high quality of LLM-generated synthetic data remains a critical challenge. Existing research primarily focuses on generation methodologies, with limited direct attention to the quality of the resulting data. Furthermore, most studies are restricted to single modalities, lacking a unified perspective across different data types. To bridge this gap, we propose the \textbf{LLM Data Auditor framework}. In this framework, we first describe how LLMs are utilized to generate data across six distinct modalities. More importantly, we systematically categorize intrinsic metrics for evaluating synthetic data from two dimensions: quality and trustworthiness. This approach shifts the focus from extrinsic evaluation, which relies on downstream task performance, to the inherent properties of the data itself. Using this evaluation system, we analyze the experimental evaluations of representative generation methods for each modality and identify substantial deficiencies in current evaluation practices. Based on these findings, we offer concrete recommendations for the community to improve the evaluation of data generation. Finally, the framework outlines methodologies for the practical application of synthetic data across different modalities.

2602.02465 2026-06-11 cs.AI cs.CV cs.LG 版本更新

MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

MentisOculi: 揭示心智图像推理的局限性

Jana Zeller, Thaddäus Wiedemer, Fanfei Li, Thomas Klein, Prasanna Mayilvahanan, Matthias Bethge, Felix Wichmann, Ryan Cotterell, Wieland Brendel

AI总结 提出MentisOculi基准,通过多步推理问题测试前沿模型利用视觉表示辅助推理的能力,发现视觉策略普遍无法提升性能,且统一多模态模型存在生成错误累积和无法利用真实可视化的问题。

详情
Comments
9 pages, 8 figures, Accepted at ICML 2026
AI中文摘要

前沿模型正从仅摄入视觉信息的多模态大语言模型(MLLMs)过渡到能够原生交错生成的统一多模态模型(UMMs)。这一转变激发了将中间可视化作为推理辅助的兴趣,类似于人类的心智图像。这一想法的核心是能够以目标导向的方式形成、维护和操作视觉表示。为了评估和探究这一能力,我们开发了MentisOculi,这是一个程序化的、分层的多步推理问题套件,适用于视觉解决方案,旨在挑战前沿模型。评估从潜在令牌到显式生成图像的视觉策略,我们发现它们通常无法提升性能。对UMMs的分析特别揭示了一个关键限制:虽然它们拥有解决任务的文本推理能力,并且有时能生成正确的视觉内容,但它们遭受复合生成错误,并且无法利用甚至真实的可视化。我们的发现表明,尽管视觉思维具有内在吸引力,但尚未有益于模型推理。MentisOculi为分析和弥合不同模型家族之间的这一差距建立了必要的基础。

英文摘要

Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.

2602.22638 2026-06-11 cs.AI 版本更新

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

MobilityBench:用于评估真实世界移动场景中路径规划智能体的基准

Zhiheng Song, Jingshuai Zhang, Chuan Qin, Chao Wang, Chao Chen, Longfei Xu, Kaikui Liu, Xiangxiang Chu, Hengshu Zhu

AI总结 提出MobilityBench基准,通过确定性API重放沙箱和多维评估协议,系统评估基于LLM的路径规划智能体,发现现有模型在偏好约束路径规划上表现不佳。

详情
AI中文摘要

由大型语言模型(LLM)驱动的路径规划智能体已成为一种有前景的范式,通过自然语言交互和工具介导的决策支持日常人类移动。然而,在真实世界移动场景中的系统评估受到多样化路由需求、非确定性地图服务和有限可重复性的阻碍。在本研究中,我们引入了MobilityBench,一个用于评估基于LLM的路径规划智能体在真实世界移动场景中的可扩展基准。MobilityBench基于从高德地图收集的大规模匿名真实用户查询构建,覆盖全球多个城市的广泛路径规划意图。为了实现可重复的端到端评估,我们设计了一个确定性API重放沙箱,消除了实时服务带来的环境变化。我们进一步提出了一个以结果有效性为中心的多维评估协议,辅以对指令理解、规划、工具使用和效率的评估。使用MobilityBench,我们在多种真实世界移动场景中评估了多个基于LLM的路径规划智能体,并对其行为和性能进行了深入分析。我们的发现表明,当前模型在基本信息检索和路径规划任务上表现良好,但在偏好约束路径规划上困难重重,突显了在个性化移动应用中仍有显著改进空间。我们在此https URL公开发布基准数据、评估工具包和文档。

英文摘要

Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at this https URL.

2603.09715 2026-06-11 cs.AI 版本更新

Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

问题真的重要吗?视觉-语言SFT的无训练数据选择

Peng Sun, Yi Yang, Huawen Shen, Yi Ban, Tianfan Fu, Yanbo Wang, Yuqiang Li

AI总结 提出CVS方法,利用冻结的视觉-语言大模型评估问题对答案有效性的影响,无需训练即可筛选出需要跨模态推理的高质量样本,在多个数据集上以少量数据超越全量训练。

详情
AI中文摘要

视觉指令微调对于提升视觉-语言大模型(VLLMs)至关重要。然而,许多样本可以通过语言模式或常识捷径解决,无需真正的跨模态推理,限制了多模态学习的有效性。先前的数据选择方法通常依赖于代价高昂的代理模型训练,并侧重于难度或多样性,未能捕捉样本对视觉-语言联合推理的真实贡献。在本文中,我们提出CVS,一种基于以下洞见的无训练数据选择方法:对于高质量的多模态样本,引入问题应显著改变模型在给定图像下对答案有效性的评估。CVS利用冻结的VLLM作为评估器,测量在有/无问题条件下答案有效性的差异,从而识别需要视觉-语言联合推理的样本,同时过滤语义冲突噪声。在Vision-Flan和The Cauldron上的实验表明,CVS在数据集上取得了稳定的性能。在Vision-Flan上,CVS仅使用10%和15%的数据就分别比全量训练高出3.5%和4.8%,并且在高度异构的Cauldron数据集上保持鲁棒。此外,与COINCIDE和XMAS相比,CVS分别降低了17.3%和44.4%的计算成本。

英文摘要

Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the effectiveness of multimodal learning. Prior data selection methods often rely on costly proxy model training and focus on difficulty or diversity, failing to capture a sample's true contribution to vision-language joint reasoning. In this paper, we propose CVS, a training-free data selection method based on the insight that, for high-quality multimodal samples, introducing the question should substantially alter the model's assessment of answer validity given an image. CVS leverages a frozen VLLM as an evaluator and measures the discrepancy in answer validity with and without conditioning on the question, enabling the identification of samples that require vision-language joint reasoning while filtering semantic-conflict noise. Experiments on Vision-Flan and The Cauldron show that CVS achieves solid performance across datasets. On Vision-Flan, CVS outperforms full-data training by 3.5% and 4.8% using only 10% and 15% of the data, respectively, and remains robust on the highly heterogeneous Cauldron dataset. Moreover, CVS reduces computational cost by 17.3% and 44.4% compared to COINCIDE and XMAS.

2604.18543 2026-06-11 cs.AI cs.CL 版本更新

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

ClawEnvKit:爪型智能体的自动环境生成

Xirui Li, Ming Li, Ion Stoica, Cho-Jui Hsieh, Tianyi Zhou

AI总结 提出ClawEnvKit自动生成多样、可验证的爪型智能体训练与评估环境,构建含1040个环境的Auto-ClawEval基准,成本降低13800倍,性能提升达15.7个百分点。

详情
AI中文摘要

构建用于训练和评估爪型智能体的环境仍然是一个手动、人力密集且无法扩展的过程。我们认为,需要的不仅仅是一个数据集,而是一个能够按需生成多样化、可验证环境的自动化流水线。为此,我们引入了ClawEnvKit,一个自主生成流水线,它从自然语言描述中实例化这一形式化体系。该流水线包含三个模块:(1)解析器,从自然语言输入中提取结构化生成参数;(2)生成器,生成任务规范、工具接口和评分配置;(3)验证器,确保生成环境的可行性、多样性、结构有效性和内部一致性。使用ClawEnvKit,我们构建了Auto-ClawEval,这是首个用于爪型智能体的大规模基准,包含24个类别的1040个环境。实验表明,Auto-ClawEval在连贯性和清晰度上匹配或超过人工策划的环境,成本降低13800倍。在4个模型家族和8个智能体框架上评估,我们发现框架工程比裸ReAct基线性能提升高达15.7个百分点,完成度仍是主要变化轴,且没有模型饱和该基准,自动化生成使得评估规模达到前所未有的水平。除了静态基准测试,ClawEnvKit还支持实时评估:用户用自然语言描述所需能力,即可按需获得验证过的环境,将评估转变为持续的、用户驱动的过程。同样的机制也可作为按需训练环境生成器,产生适应智能体当前弱点的任务分布,而非受限于现有用户日志。

英文摘要

Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.

2606.09426 2026-06-11 cs.AI 版本更新

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

WeaveBench: 面向混合接口的长期、真实世界计算机使用代理基准

Wanli Li, Bowen Zhou, Yunyao Yu, Zhou Xu, Yifan Yang, Dongsheng Li, Caihua Shan

发表机构 * Zhejiang University(浙江大学) Microsoft Research Asia(微软亚洲研究院) Tsinghua University(清华大学)

AI总结 提出WeaveBench基准,包含114个跨8个真实工作领域的长期混合接口任务,要求代理结合GUI和CLI/代码操作,最佳PassRate仅41.2%,揭示现有评估的不足。

详情
AI中文摘要

计算机使用代理(CUA)越来越多地在结合视觉桌面控制、命令行执行、代码编辑、浏览器和外部工具的运行时中运行。然而,现有基准通常将这些接口作为可分离的能力进行评估,导致长期跨接口编排测试不足。因此,我们引入了WeaveBench,一个长期混合接口基准,包含114个跨8个真实工作领域的任务,基于真实用户请求和公开可验证的工件。每个任务要求代理在单个轨迹中结合GUI观察/操作与CLI/代码操作。我们在部署的CLI代理运行时内的真实Ubuntu桌面上评估这些任务,并增加了最小的桌面控制插件。我们还提出了一个配套的轨迹感知评判器,检查交付物、文件、截图、日志和操作痕迹,同时检测快捷行为,如伪造的视觉证据或硬编码指标。在前沿模型-运行时配对中,最佳PassRate仅达到41.2%,表明该基准远未饱和。轨迹感知评判器进一步揭示,仅基于结果的评分显著高估了代理性能。总体而言,WeaveBench暴露了CUA评估中的关键差距,并提供了一个有效的测试平台,以衡量代理是否能在长期真实世界任务中编排GUI、CLI和代码操作。

英文摘要

Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as separable capabilities, leaving long-horizon cross-interface orchestration under-tested. Thus, we introduce WeaveBench, a long-horizon hybrid-interface benchmark with 114 tasks across 8 real-world work domains, grounded in real user requests and publicly verifiable artifacts. Each task requires agents to combine GUI observations/actions with CLI/code operations within a single trajectory. We evaluate these tasks on a real Ubuntu desktop inside deployed CLI-agent runtimes, augmented with a minimal desktop-control plugin. We also propose a companion trajectory-aware judge that inspects deliverables, files, screenshots, logs, and action traces, while detecting shortcut behaviors such as fabricated visual evidence or hard-coded metrics. Across frontier model-runtime pairings, the best PassRate reaches only 41.2%, showing the benchmark remains far from saturated. The trajectory-aware judge further reveals that outcome-only grading substantially overestimates agent performance. Overall, WeaveBench exposes a critical gap in CUA evaluation and provides an effective testbed to measure whether agents can orchestrate GUI, CLI, and code operations across long-horizon real-world tasks.

2606.11042 2026-06-11 cs.AI 版本更新

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Workflow-GYM:面向真实世界专业领域的长周期计算机使用代理任务评估

Liya Zhu, Jingzhe Ding, Jian Zhang, Jianbo Xue, Shihao Liang, Ge Zhang, Yi Zhu, Duju Zeng, Xiang Gao, Qingshui Gu, Mailun Gao, Huimin Che, Yan Zhao, Peiheng Zhou, Haojun Wang, Chaobo Xian, Lili Le, Chi Wu, Yiwei Liu, Shengda Long, Jiale Yang, Fangzhi Xu, Sijin Wu, Haodong Duan, Chao He, Zhaojian Li, Minchao Wang, Huan Zhou, Jiani Hou, Chuqian Yu, Weiran Shi, Hongwan Gao, Jiamin Chen, Guanhong Chen, Tingqin Luo, Kaiyuan Zhang, Zhixin Yao, Qing Hua, Yuhao Jiang, Jin Chen, Pu Chen, Zhenyu Hu, Xingyu Li, Zhengxuan Jiang, Meng Cao, Tianfeng Long, Haozhe Wang, Mingzhang Wang, Yichen Zhang, Yiming Dai, Chenchen Zhang, Jiaying Wang, Xinying Liu, Xingzu Liu, Lingling Zhang, Xinjie Chen, Yujia Qin, Wangchunshu Zhou, Zhiyong Wu, Yang Liu, Jiaheng Liu, Lei Zhang, Shen Yan, Wenhao Huang, Zaiyuan Wang, Xiaolong Chang

发表机构 * ByteDance Seed(字节跳动Seed) M-A-P Humanlaya

AI总结 提出Workflow-GYM基准,评估AI代理在专业软件中执行长周期、高价值工作流的能力,发现最强模型成功率仅略超30%,揭示当前代理在长周期工作流一致性方面的严重不足。

详情
AI中文摘要

近年来,AI代理在处理日益复杂、真实世界任务方面取得了快速发展。然而,现有基准很少评估代理能否操作图形用户界面以完成跨领域的长周期、高价值专业工作流。当前的GUI基准仍主要关注通用软件、相对简单的应用和短周期任务,使得现代代理能否遵循用户指令自主操作领域特定专业软件并以端到端方式完成经济价值工作尚不清楚。为填补这一空白,我们引入Workflow-GYM,一个以专业领域和专门软件环境为中心的长周期GUI任务基准。通过对最先进模型的广泛实验,我们发现即使最强的模型也仅达到略高于30%的成功率,突显出专业长周期GUI工作流对当前GUI代理仍极具挑战性。进一步分析表明,当前代理难以维持长周期工作流的一致性,频繁出现工作流阶段遗漏、错误传播、目标漂移以及对专业软件环境理解不足等问题。我们的发现为当前代理系统的局限性提供了重要见解,并为下一代GUI代理研究指明了关键方向。

英文摘要

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

2508.18636 2026-06-11 cs.SE cs.AI 版本更新

LaQual: An Automated Framework for LLM App Quality Evaluation

LaQual: 一种用于LLM应用质量评估的自动化框架

Yan Wang, Xinyi Hou, Junjun Si, Yanjie Zhao, Weiguo Lin, Haoyu Wang

AI总结 提出LaQual自动化框架,通过静态指标筛选和动态场景评估,实现LLM应用质量评估,与人类判断高度一致,可减少66.7%-81.3%候选应用。

详情
AI中文摘要

代表软件分发的新范式,LLM应用商店正在迅速兴起,为用户提供内容生成、编程辅助、教育等多样化选择。然而,当前LLM应用商店中的排名和推荐机制主要依赖静态指标(如用户交互和收藏),使用户难以高效识别高质量应用。同时,当前学术研究专注于特定垂直领域,缺乏适用于多样化LLM应用生态的通用自动化评估框架。为应对上述挑战,我们提出LaQual,一种用于LLM应用质量评估的自动化框架。LaQual整合三个关键阶段:(1) LLM应用标注与层次分类,实现精确场景映射;(2) 静态指标评估,使用时间加权用户参与度和功能能力指标过滤低质量应用;(3) 动态场景自适应评估,由LLM生成场景特定评估指标、评分标准和任务,进行全面质量评估。在主流LLM应用商店上的实验证明了LaQual的有效性。其自动化评分与人类判断高度一致。通过有效筛选,LaQual可将候选LLM应用池减少66.7%至81.3%。用户研究进一步验证了其相对于基线系统的显著优势,特别是在比较效率(均值5.45 vs. 3.30)和解释信息价值(4.75 vs. 2.25)方面。这些结果表明,LaQual为现实场景中LLM应用的高质量发现与推荐提供了可扩展、客观且以用户为中心的解决方案。

英文摘要

Representing a new paradigm in software distribution, LLM app stores are rapidly emerging, offering users diverse choices for content generation, coding assistance, education, and more. However, current ranking and recommendation mechanisms in LLM app stores predominantly rely on static metrics, such as user interactions and favorites, making it challenging for users to efficiently identify high-quality apps. At the same time, current academic research focuses on specific vertical fields and lacks a general, automated evaluation framework applicable to the diverse LLM app ecosystem. To address the above challenges, we present LaQual, an automated framework for LLM app quality evaluation. LaQual integrates three key stages: (1) LLM app labeling and hierarchical classification for precise scenario mapping; (2) static indicator evaluation using time-weighted user engagement and functional capability indicators to filter low-quality apps; and (3) dynamic scenario-adapted evaluation, where an LLM generates scenario-specific evaluation metrics, scoring criteria, and tasks for comprehensive quality evaluation. Experiments on a mainstream LLM app store demonstrate the effectiveness of LaQual. Its automated scores show high consistency with human judgments. Through effective screening, LaQual can reduce the candidate LLM app pool by 66.7% to 81.3%. User studies further validate its significant outperformance over baseline systems, particularly in comparison efficiency (mean 5.45 vs. 3.30) and value of explanatory information (4.75 vs. 2.25). These results demonstrate that LaQual provides a scalable, objective, and user-centric solution for high-quality discovery and recommendation of LLM apps in real-world scenarios.

2509.25359 2026-06-11 cs.CL cs.AI 版本更新

Geometric Metrics and LLMs: What They Measure and When They Work

几何度量与大语言模型:它们测量什么以及何时有效

Viacheslav Yusupov, Anna Antipina, Ameliia Alaeva, Danil Maksimov, Anna Vasileva, Tatyana Zaitseva, Alina Ermilova, Evgeny Burnaev, Egor Shvetsov

AI总结 本文系统测试了用于大语言模型评估的几何度量,发现部分度量主要反映输出长度,而几何度量在文本统计基础上提供有限但真实的信息,并指出故障检测是最有前景的应用。

详情
AI中文摘要

我们提出了对大语言模型评估中几何度量的系统性压力测试。基于排名的内部表示几何特性作为无参考质量信号显示出前景,但其可靠的条件仍不清楚。我们评估了八种常用度量:内在维度估计器、谱范数及相关量,在六个测试模型(0.5-8B)和八个生成器上对比任务,将真实的几何信号与文本长度效应以及标准文本统计已捕获的信息区分开。三个发现出现。首先,一些度量(特别是Schatten范数和MOM)主要反映输出长度,一旦控制长度,其明显的区分能力就崩溃。其次,几何度量在文本统计之外增加了适度但真实的信息:结合它们,分类器在6路生成器识别上达到78%的准确率,而仅用文本统计为69%。第三,度量并不追踪文本质量的通用概念,而是显示内在维度与词汇多样性(RTTR)之间仅存在中等关联。我们给出了特定用例的建议,并指出故障检测是最有前景的近期应用。

英文摘要

We present a systematic stress-test of geometric metrics for LLM evaluation. Rank-based geometric properties of internal representations have shown promise as reference-free quality signals, but the conditions under which they are reliable remain unclear. We evaluate eight commonly-used metrics: intrinsic-dimensionality estimators, spectral norms, and related quantities across six tester models (0.5-8B) and eight generators on contrasting tasks, separating genuine geometric signal from text-length effects and from what standard text statistics already capture. Three findings emerge. First, some metrics (notably Schatten Norm and MOM) mainly reflect output length, and their apparent discriminative power collapses once length is controlled. Second, geometric metrics add modest but real information beyond text statistics: combined with them, a classifier reaches 78% accuracy on 6-way generator identification versus 69% for text statistics alone. Third, rather than tracking a general notion of text quality, the metrics demonstrate only moderate association between the intrinsic-dimensionality and lexical diversity (RTTR). We give use-case-specific recommendations and identify failure detection as the most promising near-term application.

2510.06596 2026-06-11 cs.CV cs.AI cs.IT cs.LG 版本更新

SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

SDQM:用于目标检测数据集评估的合成数据质量指标

Ayush Zenith, Arnold Zumbrun, Neel Raut, Jing Lin

AI总结 提出SDQM指标,无需模型训练收敛即可评估合成数据质量,与YOLO11的mAP强相关,优于现有指标。

详情
Comments
Accepted and Published at SPIE: Journal of Electronic Imaging, Vol. 35, Issue 3
AI中文摘要

机器学习模型的性能在很大程度上依赖于训练数据。大规模、良好标注数据集的稀缺给构建鲁棒模型带来了重大挑战。为了解决这一问题,通过模拟和生成模型产生的合成数据已成为一种有前景的解决方案,它增强了数据集的多样性,并提高了模型的性能、可靠性和韧性。然而,评估这些生成数据的质量需要一个有效的指标。我们引入了合成数据集质量指标(SDQM),用于评估目标检测任务的数据质量,而无需模型训练收敛。该指标能够更高效地生成和选择合成数据集,解决了资源受限的目标检测任务中的一个关键挑战。在我们的实验中,SDQM与领先的目标检测模型YOLO11的平均精度均值(mAP)得分表现出强相关性,而先前的指标仅表现出中等或弱相关性。此外,它提供了改进数据集质量的可操作见解,最大限度地减少了昂贵的迭代训练需求。这一可扩展且高效的指标为评估合成数据设立了新标准。SDQM的代码可从此https URL获取。

英文摘要

The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. We introduce the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean average precision (mAP) scores of YOLO11, a leading object detection model, whereas previous metrics only exhibited moderate or weak correlations. In addition, it provides actionable insights into improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at this https URL

2511.07332 2026-06-11 cs.LG cs.AI 版本更新

Grounding Computer Use Agents on Human Demonstrations

基于人类演示的计算机使用智能体基础构建

Aarash Feizi, Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Kaixin Li, Rabiul Awal, Xing Han Lù, Johan Obando-Ceron, Juan A. Rodriguez, Nicolas Chapados, David Vazquez, Adriana Romero-Soriano, Reihaneh Rabbany, Perouz Taslakian, Christopher Pal, Spandana Gella, Sai Rajeswar

AI总结 为解决桌面环境高质量基础数据稀缺问题,构建了包含87个应用、56K截图和3.56M人工标注的GroundCUA数据集,并基于此训练GroundNext模型,在5个基准上以少于先前十分之一的数据取得最优结果。

详情
Comments
Accepted at ICLR 2026
AI中文摘要

构建可靠的计算机使用智能体需要基础构建:将自然语言指令准确连接到正确的屏幕元素。尽管存在大量用于网络和移动交互的数据集,但桌面环境的高质量资源有限。为填补这一空白,我们引入了GroundCUA,一个基于专家人类演示构建的大规模桌面基础数据集。它涵盖12个类别的87个应用,包含56K张截图,每个屏幕元素都经过仔细标注,总计超过3.56M个人工验证标注。从这些演示中,我们生成了多样的指令,覆盖广泛的实际任务,为模型训练提供高质量数据。利用GroundCUA,我们开发了GroundNext系列模型,将指令映射到目标UI元素。在3B和7B规模上,GroundNext通过监督微调在五个基准上取得了最先进的结果,同时所需训练数据不到先前工作的十分之一。强化学习后训练进一步提升了性能,在OSWorld基准上使用o3作为规划器的智能体评估中,GroundNext取得了与使用更多数据训练的模型相当或更优的结果。这些结果证明了高质量、专家驱动数据集在推进通用计算机使用智能体中的关键作用。

英文摘要

Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. To address this gap, we introduce GroundCUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. Using GroundCUA, we develop the GroundNext family of models that map instructions to their target UI elements. At both 3B and 7B scales, GroundNext achieves state-of-the-art results across five benchmarks using supervised fine-tuning, while requiring less than one-tenth the training data of prior work. Reinforcement learning post-training further improves performance, and when evaluated in an agentic setting on the OSWorld benchmark using o3 as planner, GroundNext attains comparable or superior results to models trained with substantially more data,. These results demonstrate the critical role of high-quality, expert-driven datasets in advancing general-purpose computer-use agents.

2601.22025 2026-06-11 cs.CL cs.AI cs.IR cs.SE 版本更新

When Generic Prompt Improvements Hurt: Evaluation-Driven Iteration for LLM Applications

当通用提示改进有害:LLM应用的评估驱动迭代

Daniel Commey

AI总结 提出最小可行评估套件(MVES),通过结构化评估框架和本地复现实验,发现通用提示添加并非单调改进,强调评估驱动的提示迭代。

详情
Comments
Technical report. 42 pages, 3 figures. Code, test suites, and result logs: this https URL
AI中文摘要

评估大型语言模型(LLM)应用与传统软件测试不同,因为输出是概率性的、语义可变的,并且对提示和模型变化敏感。本技术报告提出了最小可行评估套件(MVES),一种面向审计的应用级LLM评估结构。MVES将应用类别与失败模式、指标、所需工件和验证证据联系起来,涵盖通用LLM应用、检索增强系统和智能体工作流。我们将该框架与可复现的本地评估工具配对,包括结构化提取、RAG引用/内容合规性和指令遵循检查。使用Ollama与Llama 3 8B Instruct和Qwen 2.5 7B Instruct,我们在扩展的每套30例消融实验中评估了五种提示条件。结果表明,在测试的本地条件下,通用提示添加不会产生单调改进:更强的输出合同提示提高了两种模型的严格提取,而RAG引用/内容合规性在某些通用规则条件下下降。观察到的最显著下降发生在Qwen 2.5上,当通用规则附加到用户提示时,RAG从26/30下降到9/30。这些发现支持评估驱动的提示迭代:提示更改应被视为潜在的回归风险,并在部署前针对特定任务套件进行测试。随附的存储库包含测试套件、提示变体、评估工具、原始结果日志和复现所报告本地消融所需的脚本。

英文摘要

Evaluating Large Language Model (LLM) applications differs from conventional software testing because outputs are probabilistic, semantically variable, and sensitive to prompt and model changes. This technical report proposes the Minimum Viable Evaluation Suite (MVES), an audit-oriented structure for application-level LLM evaluation. MVES links application categories to failure modes, metrics, required artifacts, and validation evidence across general LLM applications, retrieval-augmented systems, and agentic workflows. We pair the framework with a reproducible local evaluation harness covering structured extraction, RAG citation/content-compliance, and instruction-following checks. Using Ollama with Llama 3 8B Instruct and Qwen 2.5 7B Instruct, we evaluate five prompt conditions over expanded 30-case-per-suite ablations. The results show that, in the tested local conditions, generic prompt additions do not produce monotonic improvements: stronger output-contract prompts improve strict extraction for both models, while RAG citation/content-compliance declines under some generic-rule conditions. The largest observed decline occurs for Qwen 2.5 on RAG when generic rules are appended to the user prompt, from 26/30 to 9/30. These findings support evaluation-driven prompt iteration: prompt changes should be treated as potential regression risks and tested against task-specific suites before deployment. The accompanying repository contains the test suites, prompt variants, evaluation harness, raw result logs, and scripts needed to reproduce the reported local ablations.

2601.22725 2026-06-11 cs.CV cs.AI 版本更新

OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation

OpenVTON-Bench:用于可控虚拟试穿评估的大规模高分辨率基准

Jin Li, Tao Chen, Kai Wen, Siqi Yin, Shuai Jiang, Weijie Wang, Jingwen Luo, Chenhui Wu

AI总结 提出OpenVTON-Bench,包含约10万对高分辨率图像,通过DINOv3聚类和Gemini描述构建,并设计多模态评估协议,沿五个维度衡量试穿质量,与人类判断高度一致。

详情
Comments
Under review for the NeurIPS 2026 Datasets and Benchmarks Track
AI中文摘要

近期扩散模型的进展显著提升了虚拟试穿(VTON)系统的视觉保真度,但可靠的评估仍是一个持续的瓶颈。传统指标难以量化细粒度的纹理细节和语义一致性,而现有数据集在规模和多样性上无法满足商业标准。我们提出了OpenVTON-Bench,一个大规模基准,包含约10万对高分辨率图像(最高$1536 \ imes 1536$)。该数据集使用基于DINOv3的层次聚类进行语义平衡采样,并借助Gemini驱动的密集描述,确保在20个细粒度服装类别上均匀分布。为支持可靠评估,我们提出了一种多模态协议,沿五个可解释维度衡量VTON质量:背景一致性、身份保真度、纹理保真度、形状合理性和整体真实感。该协议将基于VLM的语义推理与基于SAM3分割和形态学腐蚀的新型多尺度表示度量相结合,能够分离边界对齐误差与内部纹理伪影。实验结果表明,该协议与人类判断高度一致(Kendall's $\ au$为0.833,而SSIM为0.611),为VTON评估建立了稳健的基准。

英文摘要

Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall's $\tau$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.

2602.07840 2026-06-11 cs.IR cs.AI 版本更新

SAGE: Scalable AI Governance & Evaluation

SAGE: 可扩展的人工智能治理与评估

Benjamin Le, Xueying Lu, Nick Stern, Wenqiong Liu, Igor Lapchuk, Xiang Li, Baofen Zheng, Kevin Rosenberg, Jiewen Huang, Zhe Zhang, Abraham Cabangbang, Satej Milind Wagle, Jianqiang Shen, Raghavan Muthuregunathan, Abhinav Gupta, Mathew Teoh, Andrew Kirk, Thomas Kwan, Jingwei Wu, Wenjing Zhang

AI总结 本文提出SAGE框架,通过双向校准循环将高质量的人类产品判断转化为可扩展的评估信号,解决了大规模搜索系统中相关性评估的治理差距问题,并实现了92倍成本降低的模型迭代和政策监督。

详情
AI中文摘要

在大规模搜索系统中评估相关性本质上受到人类监督与生产系统高吞吐要求之间的治理差距的限制。传统方法依赖于参与代理或稀疏手动审查,但这些方法往往无法捕捉高影响的相关性失败的全部范围。我们提出了SAGE(可扩展的人工智能治理与评估)框架,该框架将高质量的人类产品判断作为可扩展的评估信号。SAGE的核心是一个双向校准循环,其中自然语言政策、精心编写的先例和一个LLM替代法官共同进化。SAGE系统性地解决语义模糊和不一致,将主观的相关性判断转化为可执行的多维标准,具有接近人类水平的一致性。为了弥合前沿模型推理与工业级推理之间的差距,我们应用教师-学生蒸馏技术,将高保真判断转移到紧凑的学生替代体,成本降低92倍。SAGE部署在LinkedIn搜索生态系统中,通过模拟驱动开发指导模型迭代,蒸馏出符合政策的模型用于在线服务,并实现快速的离线评估。在生产环境中,它推动了政策监督,测量了升级的模型变体并检测到无法被参与指标检测到的回归。集体上,这些措施推动了LinkedIn每日活跃用户的0.25%提升。

英文摘要

Evaluating relevance in large-scale search systems is fundamentally constrained by the governance gap between nuanced, resource-constrained human oversight and the high-throughput requirements of production systems. While traditional approaches rely on engagement proxies or sparse manual review, these methods often fail to capture the full scope of high-impact relevance failures. We present \textbf{SAGE} (Scalable AI Governance \& Evaluation), a framework that operationalizes high-quality human product judgment as a scalable evaluation signal. At the core of SAGE is a bidirectional calibration loop where natural-language \emph{Policy}, curated \emph{Precedent}, and an \emph{LLM Surrogate Judge} co-evolve. SAGE systematically resolves semantic ambiguities and misalignments, transforming subjective relevance judgment into an executable, multi-dimensional rubric with near human-level agreement. To bridge the gap between frontier model reasoning and industrial-scale inference, we apply teacher-student distillation to transfer high-fidelity judgments into compact student surrogates at \textbf{92$\times$} lower cost. Deployed within LinkedIn Search ecosystems, SAGE guided model iteration through simulation-driven development, distilling policy-aligned models for online serving and enabling rapid offline evaluation. In production, it powered policy oversight that measured ramped model variants and detected regressions invisible to engagement metrics. Collectively, these drove a \textbf{0.25\%} lift in LinkedIn daily active users.

2603.19225 2026-06-11 cs.CE cs.AI cs.CL cs.IR q-fin.CP 版本更新

FinTradeBench: A Financial Reasoning Benchmark for LLMs

FinTradeBench: 面向LLM的金融推理基准

Yogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan, Santu Karmaker, Aritra Dutta

AI总结 提出FinTradeBench基准,通过结合公司基本面与交易信号,评估大语言模型在金融推理中的表现,发现检索增强对数值和时间序列推理帮助有限。

详情
Comments
9 pages main text, 31 pages total (including references and appendix). 5 figures, 16 tables. Preprint under review. Code and data will be made available upon publication
AI中文摘要

现实世界的金融决策是一个具有挑战性的问题,需要对异构信号进行推理,包括从监管文件中提取的公司基本面和从价格动态计算出的交易信号。最近,随着大语言模型(LLM)的进步,金融分析师开始将它们用于金融决策任务。然而,现有的用于测试这些模型的金融问答基准主要关注公司资产负债表数据,很少评估关于公司股票如何在市场中交易或它们与基本面相互作用的推理。为了利用这两种方法的优势,我们引入了FinTradeBench,这是一个评估金融推理的基准,它整合了公司基本面和交易信号。FinTradeBench包含1400个问题,这些问题基于纳斯达克-100公司十年历史窗口的数据。该基准分为三个推理类别:基本面聚焦、交易信号聚焦以及需要跨信号推理的混合问题。为了确保大规模可靠性,我们采用了一个校准然后扩展的框架,该框架结合了专家种子问题、多模型响应生成、模型内自过滤、数值审计以及人类-LLM判断对齐。我们在零样本提示和检索增强设置下评估了14个LLM,并观察到了明显的性能差距。检索显著改善了对文本基本面的推理,但对交易信号推理的益处有限。这些发现突显了当前LLM在数值和时间序列推理方面的根本性挑战,并激励了未来在金融智能方面的研究。

英文摘要

Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundamentals derived from regulatory filings and trading signals computed from price dynamics. Recently, with advances in Large Language Models (LLMs), financial analysts have begun to use them for financial decision-making tasks. However, existing financial question-answering benchmarks for testing these models primarily focus on company balance sheet data and rarely evaluate reasoning about how company stocks trade in the market or their interactions with fundamentals. To leverage the strengths of both approaches, we introduce FinTradeBench, a benchmark for evaluating financial reasoning that integrates company fundamentals and trading signals. FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window. The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning. To ensure reliability at scale, we adopt a calibration-then-scaling framework that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human-LLM judge alignment. We evaluate 14 LLMs under zero-shot prompting and retrieval-augmented settings and witness a clear performance gap. Retrieval substantially improves reasoning over textual fundamentals, but provides limited benefit for trading-signal reasoning. These findings highlight fundamental challenges in the numerical and time-series reasoning for current LLMs and motivate future research in financial intelligence.

2605.23243 2026-06-11 cs.CR cs.AI 版本更新

Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks

前沿大语言模型是否已为网络安全做好准备?来自双模式漏洞基准测试的垂直基础模型证据

Vivek Dahiya, Sunny Nehra, Vipul Dholariya, Bhavik Shangari, Chandra Khatri

AI总结 通过白盒函数级漏洞检测和黑盒Web应用安全测试双模式基准测试,评估前沿大语言模型在网络安全任务中的表现,发现其存在高误报率、低覆盖率等问题,而领域专用模型通过结构化方法显著提升性能。

详情
AI中文摘要

我们通过双模式基准测试评估前沿大语言模型是否已为网络安全做好准备:白盒函数级漏洞检测(VulnLLM-R,涵盖C/Java/Python)和黑盒Web应用安全测试(五个生产风格应用,包含118个真实漏洞,涉及20多个CWE家族,我们将开源)。我们测试了六个前沿模型(GPT-5.4、Codex~5.3、Claude Opus~4.6、Sonnet~4.6、Gemini~3.1~Pro和Gemini~3~Flash)以及两个领域专用模型,涵盖四种测试范式。我们的发现令人警醒:(1)每个前沿模型在白盒检测中产生10-50%的误报率,系统性地过度预测漏洞;(2)在黑盒测试中,前沿模型仅达到4-8%的真实漏洞覆盖率,即使借助外部安全工具(Playwright MCP、Burp Suite MCP)也仅提升至10-19%;(3)领域专用智能体中编码的结构化渗透测试方法将每个家族的检测率提升至50%以上,表明方法论而非规模是主要杠杆;(4)一个领域专用防御模型在单个GPU上实现了所有模型中最高的精确率(0.904)和最低的误报率(9.7%)。我们指出缺乏结构化安全测试痕迹(端到端请求/响应序列、失败密集型数据、多步攻击链)是根本的训练数据瓶颈,并提出自博弈安全测试作为数据生成策略。我们的结果为专门构建用于网络安全的垂直基础模型提供了依据。

英文摘要

We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and black-box web application security testing (five production-style applications with 118 ground-truth vulnerabilities across 20+ CWE families, which we will open-source). We test six frontier models (GPT-5.4, Codex~5.3, Claude Opus~4.6, Sonnet~4.6, Gemini~3.1~Pro and Gemini~3~Flash) and two domain-specialized models across four testing paradigms. Our findings are sobering: (1)~every frontier model produces 10-50% false positive rates in white-box detection, systematically over-predicting vulnerabilities; (2)~in black-box testing, frontier models achieve only 4-8% ground-truth coverage, improving to just 10-19% even with external security tools (Playwright MCP, Burp Suite MCP); (3)~structured penetration-testing methodology encoded in domain-specialized agents raises per-family detection above 50%, demonstrating that methodology, not scale, is the primary lever; and (4)~a domain-specialized defense model achieves the highest precision (0.904) and lowest false positive rate (9.7%) among all models, on a single GPU. We identify the absence of structured security testing traces end-to-end request/response sequences, failure-heavy data, and multi-step attack chains as the fundamental training data bottleneck, and propose self-play security testing as a data generation strategy. Our results make the case for vertical foundation models purpose-built for cybersecurity.

2605.26418 2026-06-11 cs.LG cs.AI cs.DC 版本更新

When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control

深度强化学习何时超越校准基线?自适应资源控制的基准研究

Guilin Zhang, Chuanyi Sun, Kai Zhao, Shahryar Sarkani, John Fossaceca

AI总结 通过RLScale-Bench基准测试,发现校准的基于规则的自动缩放器在所有工作负载上成本均低于六种主流深度强化学习算法,并揭示了算法选择、基线校准和评估协议的关键瓶颈。

详情
AI中文摘要

一个适当校准的基于规则的自动缩放器可以在我们测试的每个工作负载上,在成本方面击败六种主流深度强化学习(DRL)算法——那么,如果存在的话,DRL究竟何时能真正发挥作用?我们在RLScale-Bench中研究这个问题,这是一个用于自适应资源控制的DRL可重复基准和评估协议,其中代理在成本和服务级别约束下将计算资源分配给动态工作负载。我们在匹配的架构、训练预算和奖励函数下,评估PPO、DQN、A2C、SAC、TD3和DDPG,与校准的基于规则基线在六个工作负载模式和五个种子(240次运行)上进行对比,在Kubernetes水平Pod自动缩放上实例化基准,并探测分布偏移泛化。三个发现挑战了常见假设:(i)校准控制器在所有六个工作负载上实现了最低成本,尽管在突发和闪流流量上落后于最佳RL代理;(ii)由于动作空间不匹配,离散动作算法在约束违反方面比连续动作算法好一到两个数量级;(iii)没有单一算法在所有工作负载上占主导地位,排名变化高达四个位置。基于RL的资源控制的瓶颈不是算法选择,而是基线校准、奖励工程和现实的评估协议。

英文摘要

A properly calibrated rule-based autoscaler can beat every one of six mainstream deep reinforcement learning (DRL) algorithms on cost across every workload we test - so when, if ever, does DRL actually help? We study this in RLScale-Bench, a reproducible benchmark and evaluation protocol for DRL on adaptive resource control, where an agent allocates compute to a dynamic workload under cost and service-level constraints. We evaluate PPO, DQN, A2C, SAC, TD3, and DDPG under matched architectures, training budgets, and reward functions against a calibrated rule-based baseline across six workload patterns and five seeds (240 runs), instantiate the benchmark on Kubernetes Horizontal Pod Autoscaling, and probe distribution-shift generalization. Three findings challenge common assumptions: (i) the calibrated controller achieves the lowest cost on all six workloads, though it trails the best RL agents on bursty and flash traffic; (ii) discrete-action algorithms outperform continuous-action ones by one to two orders of magnitude in constraint violations due to action-space mismatch; and (iii) no single algorithm dominates across workloads, with rankings shifting by up to four positions. The bottleneck in RL-based resource control is not algorithm selection but baseline calibration, reward engineering, and realistic evaluation protocols.

2605.28882 2026-06-11 cs.CL cs.AI cs.SD 版本更新

GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human

GrowLoop: 由人类种子驱动的自进化对话评估

Yihang Lin, Yunze Gao, Zeyang Lin, Dongbo Li, Kun Peng, Yue Liu

AI总结 针对开放域对话中类人性评估的隐性知识、标准分歧和动态演化三大挑战,提出GrowLoop自进化评估系统,通过最小人工种子标注和启发式学习迭代提取评估标准,并利用标准-案例协同进化机制持续适应模型进步和场景变化。

详情
AI中文摘要

随着大语言模型的快速发展,评估开放域对话中的类人性变得越来越重要。然而,类人性是一种隐性知识,人类可以直观感知,但其背后的标准难以明确表述。人类判断差异很大,在某些情况下高度一致,在其他情况下则存在合理分歧。同时,人类判断背后的标准仍然是隐性的,没有明确的基础来构建案例。此外,什么算作类人并非一成不变,而是随着模型能力和人类期望而演变。尽管在评估方法上取得了进展,如专家编写的基准、奖励模型和自进化基准,但没有一种方法能同时解决这三个挑战。因此,我们提出了GrowLoop,一个自进化的对话评估系统,能够随着模型进步和场景变化而持续适应。以最小的人工种子标注作为初始动力,LLM代理通过启发式学习迭代提取和细化评估标准。在标注者意见一致的地方要求人机一致,而在意见分歧的地方只要求合理性。此外,标准-案例协同进化机制实现了持续进化,当评估目标发生变化时,通过新的种子进行扩展。应用于开放域对话中的类人性评估,生成的标准不仅在与人判断的一致性上显著优于现有方法,而且还发现了标注者忽略的问题。由此产生的基准能够有效区分不同能力层级的模型,并揭示其不足之处,同时能够泛化到新场景并随着模型进步而适应。我们的工作将基准测试范式从手动更新或难度扩展转变为全面、持续的自我进化。

英文摘要

With the rapid advancement of large language models, evaluating human-likeness in open-ended conversation has become increasingly important. However, human-likeness is a form of tacit knowledge that humans perceive intuitively, yet the underlying criteria resist explicit formulation. Human judgments vary widely, with strong agreement on some cases and legitimate disagreement on others. Meanwhile, the criteria behind human judgments remain implicit, leaving no clear basis for constructing cases. Further, what counts as human-likeness is not static, but evolving with model capability and human expectations. Despite progress in evaluation methods such as expert-authored benchmarks, Reward Models, and self-evolving benchmarks, none addresses all three challenges simultaneously. Therefore, we propose GrowLoop, a self-evolving conversation evaluation system that continuously adapts as models advance and scenarios shift. Starting from minimal human seed annotations, LLM agents iteratively extract and refine evaluation rubrics through Heuristic Learning. Human-AI agreement is required where annotators converge, while only plausibility is expected where they diverge. Moreover, the Rubric-Case co-evolution mechanism enables continuous evolution. When the evaluation target shifts, new human seeds expand the system's coverage accordingly. When applied to human-likeness evaluation in open-ended conversation, the AI judge guided by these rubrics not only substantially outperforms existing methods in alignment with human judgments, but also uncovers issues that annotators overlook. The resulting benchmark effectively discriminates models across capability tiers and reveals where they fall short, while generalizing to new scenarios and adapting as models advance. Our work shifts the benchmarking paradigm from manual updates or difficulty scaling to comprehensive, continuous self-evolution.

2605.29588 2026-06-11 cs.CV cs.AI q-bio.NC 版本更新

Brain-IT-VQA: From Brain Signals to Answers

Brain-IT-VQA: 从脑信号到答案

Roman Beliy, Matias Cosarinsky, Oliver Heinimann, Navve Wasserman, Michal Irani

AI总结 提出 Brain-IT-VQA 框架,基于 fMRI 脑信号解码语言令牌并结合语言模型进行视觉问答,在 NSD-VQA 新基准上显著优于先前方法,并用于分析脑区对视觉信息的贡献。

详情
AI中文摘要

从观看图像时记录的 fMRI 信号解码视觉内容,特别是回答关于所看图像的问题,是一个长期挑战。尽管近年来在基于 fMRI 的视觉问答(VQA)方面取得了显著进展,但性能仍然有限。此外,尽管最近的模型能够做出越来越准确的预测,但它们很少被用作理解大脑中视觉表征结构的工具。我们提出了 Brain-IT-VQA,一个基于 fMRI 的视觉问答框架。基于脑交互变换器(Brain-IT),我们的方法从脑活动中解码语言令牌,并将其与语言模型集成以回答视觉问题。我们的模型显著优于先前的基于 fMRI 的标题生成和 VQA 方法。我们进一步引入了 NSD-VQA,一个新的基于 fMRI 的视觉问答数据集和基准。与现有的图像-fMRI VQA 数据集通常每张图像只提供少数宽泛且弱控制的问题不同,NSD-VQA 在 20 个受控问题类别中平均每张图像提供 20 个问答对,这些类别解耦了多个层次的视觉理解。这使得在有限的 fMRI 测试数据下能够进行更可靠和可解释的评估。Brain-IT-VQA 和 NSD-VQA 共同提供了一个强大的预测框架和研究脑表征的工具。利用这个基准,我们量化了哪些形式的视觉和语义信息可以从对自然图像的 fMRI 响应中可靠解码。我们进一步分析了不同脑区在不同问题类型上的贡献。

英文摘要

Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years in visual question answering (VQA) from fMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure of visual representations in the brain. We present Brain-IT-VQA, a framework for visual question answering from fMRI. Building on the Brain Interaction Transformer (Brain-IT), our method decodes language tokens from brain activity and integrates them with a language model to answer visual questions. Our model substantially outperforms previous fMRI-based captioning and VQA approaches. We further introduce NSD-VQA, a new dataset and benchmark for visual question answering from fMRI. Unlike existing image-fMRI VQA datasets, which typically provide only a few broad and weakly controlled questions per image, NSD-VQA provides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limited fMRI test data. Together, Brain-IT-VQA and NSD-VQA provide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded from fMRI responses to natural images. We further analyze the contributions of different brain regions across question types.

2606.02670 2026-06-11 cs.LG cs.AI 版本更新

Anomalies in Multivariate Time Series Benchmarks Are Mostly Univariate

多变量时间序列基准中的异常主要是单变量的

Marc Pinet (LIG), Julien Cumin, Samuel Berlemont, Dominique Vaufreydaz (LIG)

AI总结 本文通过诊断框架和实验证明,当前多变量时间序列异常检测基准中,异常主要源于单变量偏离,跨通道结构变化极少,因此现有基准不适合验证跨通道建模能力。

详情
AI中文摘要

许多最新的多变量时间序列异常检测(MT-SAD)模型引入了跨通道建模,其隐含假设是异常的结构可能分布在多个通道上。我们在八个广泛使用的公共基准上评估了这一假设,引入了一个逐段诊断框架,该框架针对每个标记的异常,标记是否至少有一个通道单独偏离其正常历史,是否跨通道相关结构发生变化,或两者兼有。该框架表明,在一系列合理阈值下,没有跨通道破裂发生在没有伴随单变量偏离的情况下。一个补充指标还显示,在八个基准中的六个上,至少一半的标记异常段在79%到100%的时间步上发生单变量偏离,在其中的三个数据集上达到100%。为了验证我们的框架在存在跨通道结构时能够捕获它,我们构建了具有共享噪声的相移正弦通道的合成数据。每个异常段通过两种通道级损坏之一进行改变,这些损坏保留了每个通道的边缘分布,同时破坏了跨通道结构,我们的框架正确地将这些段表征为仅跨通道异常。在这些数据上,依赖通道(CD)模型成功利用了跨通道信号,而独立通道(CI)模型则失败。在真实基准上对最近SOTA检测器的CI/CD比较进一步证实了CD建模没有带来可衡量的收益。我们得出结论,当前的MT-SAD基准不适合验证跨通道建模能力,并呼吁开发更多结构多样的评估集。本研究的代码已公开。

英文摘要

Many recent multivariate time series anomaly detection (MTSAD) models incorporate cross-channel modeling, under the implicit assumption that the structure of anomalies may be spread across multiple channels. We evaluate this assumption on eight widely used public benchmarks by introducing a per-segment diagnostic framework that flags, for each labeled anomaly, whether at least one channel deviates individually from its normal history, whether the cross-channel correlation structure changes, or both. The framework shows that no cross-channel rupture occurs without an accompanying univariate deviation across a range of reasonable thresholds. A complementary metric also reveals that on six of the eight benchmarks, at least half of the labeled anomaly segments deviate univariately on 89% to 100% of their timesteps, reaching 100% on three of these datasets. To verify that our framework captures cross-channel structure when present, we construct synthetic data of phase-shifted sinusoidal channels with shared noise. Each anomalous segment is altered through one of two channel-wise corruptions that preserve the per-channel marginal distribution while breaking cross-channel structure, and our framework correctly characterizes these segments as cross-channel-only. On these data, channel-dependent (CD) models successfully exploit the cross-channel signal whereas channel-independent (CI) ones fail. The CI/CD comparison of a recent SOTA detector on real benchmarks further confirms that CD modeling brings no measurable gain. We conclude that current MTSAD benchmarks are unsuitable for validating cross-channel modeling capabilities, and we call for the development of more structurally diverse evaluation sets. The code for this study is publicly available.

2606.03504 2026-06-11 cs.CL cs.AI 版本更新

BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language

BaltiVoice: 巴尔蒂语语音语料库与微调Whisper ASR系统

Muhammad Ali

AI总结 针对无公开ASR资源的巴尔蒂语,构建16.8小时朗读语音语料库并微调Whisper-small模型,在验证集上词错误率从182.18%降至30.07%。

详情
Comments
6 pages, 3 figures, 4 tables. Code and data available at this https URL
AI中文摘要

我们提出了BaltiVoice,一个16.8小时的朗读语音语料库,用于巴尔蒂语(ISO 639-3: bft),这是一种在巴基斯坦吉尔吉特-巴尔蒂斯坦地区使用的藏语语言,此前没有公开可用的ASR资源。该语料库包含10,060条经过验证的本地Nastaliq脚本话语,源自Mozilla Common Voice录音。我们在此语料库上微调了OpenAI Whisper-small,并在包含538条话语的保留验证集上报告了30.07%的词错误率(WER),而Whisper-small在巴尔蒂语上的零样本基线为182.18%。该数据集、微调模型以及实时转录演示均在HuggingFace上公开提供。

英文摘要

We present BaltiVoice, a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language spoken in Gilgit-Baltistan, Pakistan, with no prior publicly available ASR resources. The corpus contains 10,060 validated utterances in native Nastaliq script, derived from Mozilla Common Voice recordings. Fine-tuning OpenAI Whisper-small yields a Word Error Rate (WER) of 26.74% and a Character Error Rate (CER) of 8.67% on a 538-utterance speaker-disjoint validation set, down from a zero-shot baseline of 159.19% WER and 152.52% CER. A Whisper-base fine-tuned on the same data achieves 44.54% WER and 15.61% CER, confirming that model capacity matters for this low-resource setting. The dataset, fine-tuned model, and a live transcription demo are publicly available on HuggingFace.

2606.07001 2026-06-11 cs.DB cs.AI 版本更新

DataEvolver: Automatic Data Preparation for Large Language Models through Multi-Level Self-Evolving

DataEvolver: 通过多级自我进化实现大型语言模型的自动数据准备

Chao Deng, Shaolei Zhang, Ju Fan, Xiaoyong Du

AI总结 提出DataEvolver,首个自我进化的数据准备系统,通过多级机制自动构建管道将原始数据转化为高质量数据,在七个基准上平均提升下游LLM性能10%。

详情
AI中文摘要

高质量训练数据对大型语言模型(LLMs)至关重要,通常需要大量且昂贵的人工整理。现有的自动数据准备方法依赖于预定义管道或定制化人工指令,这限制了它们对不同数据分布的适应性,并且缺乏来自高质量示例的原则性指导。在本文中,我们介绍了DataEvolver,这是首个自我进化的数据准备系统,能够自动构建管道将原始数据转化为高质量数据。DataEvolver采用多级机制来确保管道的可执行性和有效性。在算子级别,它逐步扩展算子集以构建逻辑计划,同时解决依赖冲突。在管道级别,它将逻辑计划实例化为可执行代码,并通过反馈循环迭代优化管道编排,从而减少准备数据与高质量示例之间的分布差距。在七个基准上的实验表明,与在原始数据上训练相比,DataEvolver显著提高了数据质量,并使下游LLM性能平均提升10%,突显了LLM与数据迭代协同进化的新机遇。

英文摘要

High-quality training data is essential to large language models (LLMs) and typically requires extensive and costly manual curation. Existing automatic data preparation methods rely on predefined pipelines or customized human instructions, which limits their adaptability to diverse data distributions and lacks principled guidance from high-quality examples. In this paper, we introduce DataEvolver, the first self-evolving data preparation system that automatically constructs pipelines to transform raw data into high-quality data. DataEvolver employs a multi-level mechanism to ensure both pipeline executability and effectiveness. At the operator level, it incrementally expands the operator set to construct a logical plan while resolving dependency conflicts. At the pipeline level, it instantiates logical plans into executable code and iteratively refines pipeline orchestration through a feedback loop that reduces the distribution gap between prepared data and high-quality examples. Experiments on seven benchmarks show that DataEvolver substantially improves data quality and achieves an average 10\% gain in downstream LLM performance compared with training on original data, highlighting new opportunities for the iterative co-evolution of LLMs and data.

2606.07226 2026-06-11 cs.LG cs.AI cs.CL 版本更新

DEFINED: A Data-Efficient Computational Framework for Fine-Grained Creativity Assessment in Debate Scenarios

DEFINED: 辩论场景中细粒度创造力评估的数据高效计算框架

Tongzhou Yu, Mingjia Li, Hong Qian, Wenkai Wang, Zongbao Zhang, Yaoyu Jiang, Xiangfeng Wang, Aimin Zhou, Jiajun Guo

发表机构 * Nanjing University Shanghai Innovation Institute East China Normal University

AI总结 提出DEFINED框架,通过层次化八维指标体系、预训练语言模型和混合粒度训练策略,在辩论场景中实现数据高效的细粒度创造力自动评估,优于现有方法。

详情
Comments
Accepted by KDD 2026
AI中文摘要

人类创造力已成为大语言模型时代的关键能力。在复杂、开放环境中评估创造力是数据挖掘领域的一大挑战,目前受限于对标准化简单任务的依赖以及细粒度专家数据的稀缺。作为生态有效的评估场景,辩论反映了创造力的多个维度,涵盖发散思维和收敛思维。此外,辩论是一个数据丰富的领域,拥有大量公开可获取的材料。当前主流的自动评分方法难以适应辩论等复杂场景,因此仍然依赖昂贵的人工评估。为此,本文提出DEFINED,一种数据高效的计算框架,用于辩论场景中的细粒度创造力评估。DEFINED通过层次化的八维指标体系操作化辩论创造力,采用预训练自回归语言模型,并配备支持细粒度和粗粒度评估的层次化评分头。从真实辩论比赛中获取陈述及其相关专家评分,并采用约束数据增强策略以解决原始数据中的精英偏差。DEFINED采用混合粒度训练策略,能够从训练有素的研究生专家提供的有限细粒度监督中实现鲁棒学习。为严格验证超越合成基准的生态效度,我们纳入了一项针对辩论新手参与者的实证研究,利用这些真实数据作为中低水平人群的定性案例研究。在我们的评估协议中,评分模型实现了准确且稳定的评分,优于基于提示的大语言模型评估器和现有的辩论评分方法。

英文摘要

Human creativity has emerged as a critical competency in the era of large language models. Assessing creativity in complex, open-ended environments is a grand challenge in data mining, currently hindered by a reliance on standardized simple tasks and the scarcity of fine-grained expert data. As an ecologically valid assessment context, debate reflects multiple dimensions of creativity, encompassing both divergent thinking and convergent thinking. Moreover, debate is a data-rich domain, with a large volume of publicly accessible materials. Current mainstream automated scoring methods are poorly suited to complex settings such as debate, and therefore still rely on costly human evaluation. To this end, this paper proposes DEFINED, a data-efficient computational framework for fine-grained creativity assessment in debate scenarios. DEFINED operationalizes debate creativity through a hierarchical eight-dimensional metric system, implemented via a pre-trained autoregressive language model with a hierarchical scoring head that supports both fine-grained and coarse-grained evaluation. Statements and their associated expert scores were obtained from authentic debate competitions, and a constrained data augmentation strategy was employed to address the elite bias inherent in the original data. DEFINED adopts a mixed-granularity training strategy enabling robust learning from limited fine-grained supervision annotated by trained graduate experts. To rigorously validate ecological validity beyond synthetic benchmarks, we incorporate an empirical study with debate-naive participants, utilizing these authentic data to serve as a qualitative case study for mid-to-low proficiency populations. Across our evaluation protocol, our scoring model achieves accurate and stable scoring, outperforming prompt-based large language model evaluators and existing debate scoring methods.

2606.07591 2026-06-11 cs.LG cs.AI cs.CL 版本更新

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

ResearchClawBench: 端到端自主科学研究基准

Wanghan Xu, Shuo Li, Tianlin Ye, Qinglong Cao, Yixin Chen, Hengjian Gao, Yiheng Wang, Qi Li, Kun Li, Sheng Xu, Shengdu Chai, Fangchen Yu, Xiangyu Zhao, Zhangrui Zhao, Weijie Ma, Zijie Guo, Haoyu Zhou, Haoxiang Yin, Lixue Cheng, Chaofan Hu, Haoxuan Li, Lu Mi, Xuxuan Xie, Yifan Zhou, Ruizhe Chen, Zhiwang Zhou, Xingjian Guo, Yuhao Zhou, Xuming He, Shengyuan Xu, Xinyu Gu, Jiamin Wu, Mianxin Liu, Chunfeng Song, Fenghua Ling, Dongzhan Zhou, Shixiang Tang, Yuqiang Li, Mao Su, Peng Ye, Siqi Sun, Bin Wang, Xue Yang, Zhenfei Yin, Tianfan Fu, Guangtao Zhai, Wanli Ouyang, Bo Zhang, Lei Bai, Wenlong Zhang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出ResearchClawBench基准,包含10个领域40个任务,通过多模态评分标准评估自主科研能力,最强智能体仅得21.5分,揭示当前系统在实验协议、证据匹配和科学核心方面的不足。

详情
AI中文摘要

AI编码智能体越来越多地用于科学工作,但其端到端自主研究能力仍然难以验证。我们提出了ResearchClawBench,一个用于评估自主科学研究的基准,涵盖来自10个科学领域的40个任务。每个任务基于一篇真实发表论文,提供相关文献和原始数据,并在评估期间隐藏目标论文。专家策划的多模态评分标准将目标科学制品分解为加权标准,从而能够评估目标论文级别的重新发现,同时为新发现留出空间。我们在统一协议下评估了七个自主研究(auto-research)智能体,并通过轻量级ResearchHarness评估了十七个原生LLM。当前系统远未达到可靠的重新发现:最强的自主智能体Claude Code平均得分为21.5,最强的ResearchHarness LLM Claude-Opus-4.7平均得分为20.7,LLM前沿均值仅为26.5。错误分析表明,失败集中在实验协议不匹配、证据不匹配和缺失科学核心。ResearchClawBench为衡量自主科学研究进展提供了一个可复现的评估前沿。

英文摘要

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

2606.08415 2026-06-11 cs.CV cs.AI 版本更新

CoVEBench: Can Video Editing Models Handle Complex Instructions?

CoVEBench: 视频编辑模型能处理复杂指令吗?

Jiangtao Wu, Jiaming Wang, Yiwen He, Yuanxing Zhang, Shihao Li, Dunyuan Liu, Xuedong Zhao, Jialu Chen, Zekun Moore Wang, Jiaheng Liu

发表机构 * Nanjing University(南京大学) Kuaishou Technology(快手科技)

AI总结 提出CoVEBench基准,包含416个源视频和626条多点编辑指令,通过MLLM评估指令遵循度和保真度,揭示当前模型在组合编辑中常遗漏编辑或破坏保留约束。

详情
Comments
34 pages, 11 figures, 9 tables
AI中文摘要

虽然近期基于文本引导的视频编辑模型在基础任务(如风格迁移、物体插入)上表现出色,但现实用户请求具有高度组合性。单个提示通常要求多个耦合编辑,例如同时修改主体、动作和相机视角,同时严格保留无关的时空内容。现有基准受限于孤立编辑和粗粒度全局指标,无法诊断模型如何处理此类复杂工作流。为弥补这一空白,我们引入CoVEBench,一个组合视频编辑基准,包含416个精心策划的源视频、626条多点编辑指令和9,990个细粒度检查项。CoVEBench覆盖多样化的编辑维度,通过MLLM评判的指令遵循度和视频保真度,以及视频质量的自动指标来评估模型。大量实验表明,组合编辑仍然是一个深层次的挑战:当前模型在处理多个操作同时进行时,经常遗漏编辑、违反保留约束或引入伪影。CoVEBench为推进视频编辑向现实用户工作流发展提供了一个具有挑战性的诊断测试平台。

英文摘要

While recent text-guided video editing models excel at elementary tasks (e.g., style transfer, object insertion), real-world user requests are highly compositional. A single prompt often demands multiple coupled edits, such as modifying subjects, actions, and camera views, while strictly preserving unrelated spatiotemporal content. Existing benchmarks, heavily constrained by isolated edits and coarse global metrics, fail to diagnose how models handle such complex workflows. To address this gap, we introduce CoVEBench, a compositional video editing benchmark comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items. Covering diverse editing dimensions, CoVEBench evaluates models via MLLM-judged instruction compliance and video fidelity, alongside automated metrics for video quality. Extensive experiments reveal that compositional editing remains a profound challenge: current models frequently omit edits, violate preservation constraints, or introduce artifacts when handling multiple operations simultaneously. CoVEBench provides a challenging, diagnostic testbed to advance video editing toward realistic user workflows.

10. AI应用与系统 55 篇

2606.11207 2026-06-11 cs.AI cs.CL 新提交

From Explicit Elements to Implicit Intent: A Predefined Library for Auditable Behavioral Inference

从显式元素到隐式意图:用于可审计行为推断的预定义库

Liu hung ming

发表机构 * PARRAWA AI

AI总结 提出SemantiClean框架,通过共享元素库从电商会话数据中提取结构化语义信号,驱动可插拔推断目标,优先保证可审计性和可复现性,而非单纯追求精度。

详情
Comments
20 pages, 9 tables
AI中文摘要

我们提出SemantiClean,一个模块化框架,用于从电商会话数据中提取结构化语义信号,并通过共享元素库驱动可插拔推断目标,包括购买意图、客户细分和产品亲和性。与仅优化准确率的传统端到端预测器不同,SemantiClean优先考虑可审计性、结构治理和sigma=0可复现性,明确牺牲边际预测增益以换取元素级透明度和可辩护的决策轨迹。该框架基于在线购物者购买意图(OSPI)数据集,将24个行为元素组织成四层架构(功能层、交互层、系统层、上下文层),并通过三种抗通胀机制强制信号质量:RedundancyGroup贡献上限、TieredPenaltyCalculator偏差惩罚和AdaptiveConstraintMode冷启动处理。本文介绍了LLM集成语义推断引擎,一个完全实现的两阶段LLM驱动推断架构,在推断时利用完整的元素元数据。本文报告的所有定量结果均由该引擎产生。确定性引擎输出完全可复现(sigma=0);LLM相关结果(E8、E10)在固定提供者/模型/温度设置下受控输出可变性。性别推断目标在当前实现中非功能性,已从所有定量结果中排除。

英文摘要

We present SemantiClean, a modular framework for extracting structured semantic signals from e-commerce session data and driving pluggable inference targets including purchase intent, customer segmentation, and product affinity through a shared element library. Unlike conventional end-to-end predictors that optimise solely for accuracy, SemantiClean prioritises auditability, structural governance, and sigma=0 reproducibility, explicitly trading marginal predictive gains for element-level transparency and defensible decision trails. Built upon the Online Shoppers Purchasing Intention (OSPI) dataset, the framework organises twenty-four behavioural elements into a four-layer architecture (Functional, Interaction, Systemic, Contextual) and enforces signal quality through three anti-inflation mechanisms: RedundancyGroup contribution caps, TieredPenaltyCalculator bias penalties, and AdaptiveConstraintMode cold-start this http URL report introduces the LLM-Integrated Semantic Inference Engine, a fully implemented two-phase LLM-driven inference architecture that leverages complete element metadata at inference time. All quantitative results reported herein are produced by this engine. Deterministic engine outputs remain fully reproducible (sigma=0); LLM-dependent results (E8, E10) are subject to controlled output variability under fixed provider/model/temperature settings. The gender inference target remains non-functional in the current implementation and is excluded from all quantitative results.

2606.11675 2026-06-11 cs.AI 新提交

Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning

Lung-R1:知识图谱引导的肺部诊断推理大语言模型

Haoyang Zeng, Yuanxi Fu, Rongzhen Li, Yuming Yang, Xiao Sun, Jingwang Huang, Gujie Shao, Guohui Xiang, Quan Lu, Dongfan Ye, Xuetao Chen, Jiang Zhong, Kaiwen Wei, Zhi Xu

发表机构 * School of Computer Science, Chongqing University(重庆大学计算机学院) AI Research Institution, Mashang Financial Institution(马上金融人工智能研究院) Department of Information, Third Military Medical University(陆军军医大学信息系)

AI总结 提出LungKG知识图谱和Lung-R1模型,通过KG约束的推理链构建和强化学习,解决肺部知识到病例诊断的差距,在EMR诊断任务上达到SOTA。

详情
AI中文摘要

诊断肺部疾病需要在表型变异性和跨疾病重叠中整合异质性证据。尽管大语言模型(LLMs)在肺部知识问答和信息处理任务上取得了进展,但可靠的肺部诊断需要对电子病历证据进行患者特异性的、关系感知的推理,而非孤立的知识回忆。我们将肺部知识与病例级诊断推理之间的这一差距定义为肺部知识到诊断的差距。为解决这一问题,我们引入了LungKG,这是第一个用于诊断知识组织和记录基础推理的结构化肺部知识图谱。LungKG包含59,038个节点和164,308条边,涵盖15种实体类型和112种关系类型,既作为可重用的肺部知识资源,也作为LungKG引导模型适应的基础。基于LungKG,我们提出了Lung-R1,一种通过KG约束的推理链构建和KG引导的强化学习训练的LungKG引导的肺部LLM。在20个系统的评估中,Lung-R1-14B在选择题、肺部问答和EMR诊断任务上均达到最先进性能,EMR诊断得分为4.3583,超过最强非Lung-R1基线0.1476分。这些结果证明了LungKG引导训练对基于EMR的肺部诊断的价值。

英文摘要

Diagnosing pulmonary diseases requires integrating heterogeneous evidence amid phenotypic variability and cross-disease overlap. Although large language models (LLMs) have shown progress on pulmonary knowledge question answering (QA) and information-processing tasks, reliable pulmonary diagnosis requires patient-specific, relation-aware reasoning over electronic medical record (EMR) evidence rather than isolated knowledge recall. We define this gap between pulmonary knowledge and case-level diagnostic reasoning as the Pulmonary Knowledge-to-Diagnosis Gap. To address it, we introduce LungKG, the first structured pulmonary knowledge graph for diagnostic knowledge organization and record-grounded reasoning. LungKG contains 59,038 nodes and 164,308 edges across 15 entity types and 112 relation types, serving as both a reusable pulmonary knowledge resource and the foundation for LungKG-guided model adaptation. Built on LungKG, we propose Lung-R1, a LungKG-guided pulmonary LLM trained through KG-constrained reasoning-chain construction and KG-guided reinforcement learning. In a 20-system evaluation, Lung-R1-14B achieves state-of-the-art performance across Choice, Pulmonary-QA, and EMR Diagnosis, reaching an EMR Diagnosis score of 4.3583 and surpassing the strongest non-Lung-R1 baseline by 0.1476 points. These results demonstrate the value of LungKG-guided training for EMR-based pulmonary diagnosis.

2606.11874 2026-06-11 cs.AI 新提交

AutoMine Solution for AV2 2026 Scenario Mining Challenge

AutoMine 解决方案:面向 AV2 2026 场景挖掘挑战

Songliang Cao, Jiele Zhao, Yuru Wang, Hao Li, Daqi Liu, Zehan Zhang, Fangzhen Li, Yu Wang, Yue Zhang, Bing Wang, Guang Chen, Hao Lu, Hangjun Ye

发表机构 * Xiaomi EV(小米汽车) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出基于 LLM 和 VLM 的自优化场景挖掘方法 AutoMine,通过语义保持提示增强、鲁棒轨迹原子函数与 VLM 函数结合以及执行反馈优化,在 CVPR 2026 挑战赛中取得领先性能。

详情
Comments
CVPR 2026 Scenario Mining Challenge (Temporal Track Winners)
AI中文摘要

随着自动驾驶系统的发展,从大规模驾驶日志中挖掘高价值、安全关键且与规划相关的场景已成为数据驱动评估的关键。本文提出 AutoMine,一种基于 LLM 和 VLM 的鲁棒自优化场景挖掘方法。AutoMine 使用语义保持提示增强来降低 LLM 提示敏感性,结合鲁棒轨迹原子函数与基于 VLM 的函数以处理感知噪声和开放世界视觉线索,并通过真实日志的执行反馈来优化生成的代码。在 CVPR 2026 的 Argoverse 2 场景挖掘竞赛中,AutoMine 取得了 36.38 的 HOTA-Temporal 分数和 77.21 的 Timestamp BA 分数。

英文摘要

With the development of autonomous driving systems, mining high-value, safety-critical, and planning-relevant scenarios from large-scale driving logs has become essential for data-driven evaluation. In this paper, we propose AutoMine, a robust self-refining scenario mining method based on LLMs and VLMs. AutoMine uses semantics-preserving prompt augmentation to reduce LLM prompt sensitivity, combines robust trajectory atomic functions with VLM-based functions to handle perception noise and open-world visual cues, and refines generated code through execution feedback from real logs. In the Argoverse 2 Scenario Mining Competition at CVPR 2026, AutoMine achieves a HOTA-Temporal score of 36.38 and a Timestamp BA score of 77.21.

2606.12025 2026-06-11 cs.AI 新提交

Human-Enhanced Loop Modeling (HELM): Agent-Based Finite Element Modeling of Concrete Bridge Barriers

人类增强循环建模(HELM):基于智能体的混凝土桥梁护栏有限元建模

Quankai Wang, Yulin Xie, Tongfei Yang, Minghui Cheng, Ran Cao

AI总结 提出HELM框架,通过人机协作将有限元建模分解为可验证的检查点,在MASH TL-4和TL-5条件下将自主建模成功率从20%提升至75%。

详情
AI中文摘要

对桥梁护栏等安全关键基础设施进行有限元(FE)建模需要高保真非线性动态分析,然而当前的FE建模过程仍然劳动密集且缺乏自动化。本文提出了人类增强循环建模(HELM)框架,这是一种协作式人机协议,将长序列有限元建模分解为几何生成、边界条件定义和材料分配等离散的、可视觉验证的检查点。该框架通过一个包含20个案例的钢筋混凝土桥梁护栏矩阵在MASH TL-4和TL-5侧向荷载条件下进行演示,将专用智能体与两种广泛使用的商业FE软件(即ANSYS和LS-PrePost)对接。实验结果表明,HELM将基线自主建模成功率从20%提高到75%,其中几何和边界条件任务的智能体级通过率大约翻倍。误差分析显示,空间推理和代数逻辑限制构成了主要的失败模式,突显了结构化人在回路干预对建模自动化的价值。完整的智能体设计代码和提示已开源,可访问:此 https URL。

英文摘要

Finite element (FE) modeling of safety-critical infrastructure such as bridge barriers requires high-fidelity nonlinear dynamic analysis, yet the current FE modeling process remains labor-intensive and lacks automation. This paper presents the Human-Enhanced Loop Modeling (HELM) framework, a collaborative human-agent protocol that decomposes long-sequence finite element modeling into discrete, visually verifiable checkpoints across geometry generation, boundary condition definition, and material assignment. The framework is demonstrated through a 20-case matrix of reinforced concrete bridge barriers under MASH TL-4 and TL-5 lateral loading conditions, interfacing specialized agents with two widely used commercial FE softwares, i.e., ANSYS and LS-PrePost. Experimental results show that HELM improves the baseline autonomous modeling success rate from 20% to 75%, with agent-level pass rates for geometry and boundary condition tasks approximately doubling. Error analysis reveals that spatial reasoning and algebraic logic limitations constitute the primary failure modes, underscoring the value of structured human-in-the-loop intervention for modeling automation. The complete agent design code and prompts are open-sourced and can be accessed at: this https URL.

2606.12040 2026-06-11 cs.AI cs.GR 新提交

A Lightweight Multi-Agent Framework for Automated Concrete Barrier Design

一种用于自动混凝土护栏设计的轻量级多智能体框架

Wanting Wang, Xiye Ma, Yuyang He, Minghui Cheng, Ran Cao

AI总结 提出基于AutoGen的“生成-评估-优化”闭环多智能体框架,实现混凝土护栏自动设计,准确率超98%,且8B参数轻量模型可优于631B旗舰模型。

详情
AI中文摘要

钢筋混凝土公路护栏的设计是一个安全关键过程,需要严格遵守AASHTO-LRFD桥梁设计指南等监管规定。当前的工程实践严重依赖手动、迭代和启发式计算来满足复杂的非线性材料和力学约束。尽管大型语言模型(LLMs)表现出强大的生成能力,但它们在结构工程中的直接应用仍受到幻觉风险和物理基础不足的限制。为了解决这些挑战,本研究提出了一种新颖的“生成-评估-优化”闭环框架,利用AutoGen的多智能体编排能力实现混凝土护栏的自动设计。实验结果表明,所提出的智能体框架实现了超过98%的设计准确率,显著优于独立的通用LLMs。更重要的是,研究揭示了设计性能不一定与模型规模相关,8B参数的轻量级模型可以胜过无约束的631B参数旗舰模型。这一发现凸显了在降低计算成本的同时提高AI辅助工程工具在工业应用中的可及性的潜力。所提出的多智能体设计框架的源代码可在项目GitHub仓库中获取:this https URL。关键词:结构工程;多智能体系统;大型语言模型;混凝土护栏设计;AutoGen;设计自动化。

英文摘要

The design of reinforced concrete highway barriers is a safety-critical process that requires strict compliance with regulatory provisions such as the AASHTO-LRFD bridge design guidelines. Current engineering practice relies heavily on manual, iterative, and heuristic calculations to satisfy complex nonlinear material and mechanics constraints. Although Large Language Models (LLMs) demonstrate strong generative capabilities, their direct application to structural engineering remains limited by hallucination risks and insufficient physical grounding. To address these challenges, this study proposes a novel "generation-evaluation-optimization" closed-loop framework for automated concrete barrier design using the multi-agent orchestration capabilities of AutoGen. Experimental results demonstrate that the proposed agentic framework achieves over 98% design accuracy, significantly outperforming standalone general-purpose LLMs. More importantly, the study reveals that design performance is not necessarily correlated with model scale, where an 8B-parameter lightweight model could outperform unconstrained 631B-parameter flagship models. This finding highlights the potential to substantially reduce computational costs while improving the accessibility of AI-assisted engineering tools for industry applications. The source code for the proposed multi-agent design framework is available at the project GitHub repository: this https URL. Keywords: Structural Engineering; Multi-Agent Systems; Large Language Models; Concrete Barrier Design; AutoGen; Design Automation.

2606.12329 2026-06-11 cs.AI 新提交

PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents

PROJECTMEM:面向AI编码代理的本地优先、事件溯源记忆与判断层

Ripon Chandra Malo, Tong Qiu

发表机构 * University of Utah(犹他大学)

AI总结 提出PROJECTMEM,一种本地优先、事件溯源的记忆与判断层,通过记录事件日志并生成紧凑摘要,帮助AI编码代理避免重复错误,实现记忆即治理。

详情
Comments
12 pages, 5 figures, 1 table. Code: this https URL
AI中文摘要

AI编码助手现在支持越来越多的软件工作,从快速脚本到生产应用。然而,这些代理在很大程度上仍然是无状态的:每个新会话都会重新读取项目文件,重新推导之前的决策,并且——最昂贵的是——可能会重复已经失败的调试尝试。重建这种上下文每个会话估计消耗5,000-20,000个令牌;瓶颈通常不是模型能力,而是缺失的项目记忆。我们提出了projectmem,一个面向AI编码代理的开源、本地优先的记忆与判断层。projectmem将开发记录为一个仅追加的纯文本事件日志,包含类型化事件——问题、尝试、修复、决策和笔记——并通过模型上下文协议(MCP)将该日志确定性地投影为紧凑的、AI可读的摘要。除了存储,projectmem还添加了一个确定性的前置动作门,在代理重复之前失败的修复或编辑已知脆弱文件之前警告它。我们将其定义为记忆即治理:记忆不仅回答代理,还作用于其下一个动作。该系统完全离线运行,无遥测;其不可变日志也作为可重现、可审计的AI辅助开发的溯源轨迹。projectmem作为一个三依赖的Python包发布(14个MCP工具,19个CLI命令,37个自动化测试),并通过一个为期两个月的自我研究进行评估,涉及10个项目,包含207个记录事件。源代码:此 https URL。

英文摘要

AI coding assistants now support a growing share of software work, from quick scripts to production applications. Yet these agents remain largely stateless: each new session re-reads project files, re-derives prior decisions, and - most costly - may repeat debugging attempts that already failed. Reconstructing this context can consume an estimated 5,000-20,000 tokens per session; the bottleneck is often not model capability but missing project memory. We present projectmem, an open-source, local-first memory and judgment layer for AI coding agents. projectmem records development as an append-only, plain-text event log of typed events - issues, attempts, fixes, decisions, and notes - and deterministically projects that log into compact, AI-readable summaries served through the Model Context Protocol (MCP). Beyond storage, projectmem adds a deterministic pre-action gate that warns an agent before it repeats a previously failed fix or edits a known-fragile file. We frame this as Memory-as-Governance: memory that does not merely answer the agent but acts on its next action. The system runs fully offline with no telemetry; its immutable log also serves as a provenance trail for reproducible, auditable AI-assisted development. projectmem ships as a three-dependency Python package (14 MCP tools, 19 CLI commands, 37 automated tests) and is evaluated through a two-month self-study across 10 projects comprising 207 logged events. Source code: this https URL.

2606.11197 2026-06-11 eess.AS cs.AI cs.CL cs.SD 交叉投稿

MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation

MA-DLE: 基于记忆增强的语音自动抑郁程度估计

Xuzhi Wang, Xinran Wu, Ziping Zhao, Jianhua Tao, Björn W. Schuller

AI总结 提出记忆增强特征方法,通过选择性整合历史时序特征和动态记忆特征,结合层次注意力融合模块,在DAIC-WOZ和E-DAIC数据集上实现最优性能。

详情
Comments
Accepted at IEEE TAC
AI中文摘要

基于语音的抑郁程度自动估计对于实现早期检测和及时干预至关重要,尤其是在资源受限的心理健康环境中。近年来,深度学习在包括情感计算和心理健康评估在内的多个领域取得了显著成功。现有方法大多依赖基于RNN的架构(如LSTM和GRU)来建模时间信息以进行抑郁估计。然而,提取的特征往往只强调少数相邻语音片段,限制了其捕捉长程依赖的能力。为克服这一局限,我们引入了一种基于记忆的特征增强方法,以增强GRU提取特征的表示能力。我们的记忆库并非不加区分地整合历史数据,而是设计为选择性整合两类组件以减少冗余和不相关性:(1) 与当前GRU输出高度相似的历史时序特征,提供互补的上下文信息;(2) 基于特征变异性识别的动态记忆特征,捕捉指示抑郁症状的行为和情绪波动。为有效融合记忆增强特征与GRU输出,我们进一步设计了层次注意力融合(HAF)模块。我们的方法在广泛使用的DAIC-WOZ和E-DAIC数据集上进行了评估,取得了最先进的性能。

英文摘要

Speech-based automatic estimation of depression levels is essential for enabling early detection and timely intervention, particularly in resource-constrained mental health settings. In recent years, deep learning has demonstrated impressive success across various domains, including affective computing and mental health assessment. Most existing approaches rely on RNN-based architectures (such as LSTM and GRU) to model temporal information for depression estimation. However, the extracted features often emphasize only a few adjacent speech segments, limiting their ability to capture long-range dependencies. To overcome this limitation, we introduce a memory-based feature augmentation method that enhances the representational capacity of GRU-extracted features. Rather than indiscriminately incorporating historical data, our memory bank is designed to selectively integrate two types of components in order to reduce redundancy and irrelevance: (1) historical temporal features that closely resemble the current GRU output, offering complementary contextual information; and (2) dynamic memory features identified based on feature variability, which capture behavioral and emotional fluctuations indicative of depressive symptoms. To effectively fuse the memory-augmented features with GRU outputs, we further design a Hierarchical Attention Fusion (HAF) module. Our method is evaluated on the widely used DAIC-WOZ and E-DAIC datasets, achieving state-of-the-art performance.

2606.11210 2026-06-11 cs.CL cs.AI cs.MM 交叉投稿

T2MM: An LLM Supported Architecture For Inquiry-Based Modeling

T2MM:一种支持基于探究建模的LLM架构

John Kos, Rudra Singh, Ashok Goel

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出T2MM架构,利用LLM在生态建模软件VERA中生成交互式模型,优于全代码生成基线。

详情
Comments
16 pages, 4 figures
AI中文摘要

模型构建是科学学习中的基础实践,依赖于可视化和交互性。大型语言模型(LLM)越来越多地增强多模态能力,并已集成到教育环境中以支持学习。然而,这些工具缺乏某些学习环境所需的视觉交互性。我们提出了文本到多模态模型(T2MM),这是一种稳健、动态的LLM支持架构,可在开放探究生态建模软件虚拟实验研究助手(VERA)中辅助模型构建。T2MM考虑学习者模型的当前上下文,并创建交互式模型(而非静态图像),使模型能够对人工调整保持响应。为了衡量技术可行性,我们通过一个自定义的程序生成数据集(包含自然语言学习者建模请求和VERA系统中的目标模型)来评估T2MM。在所有测量的成功指标上,T2MM优于通过LLM支持的全代码生成实现的基线模型生成架构(这在文献中很常见)。我们的贡献不仅概述了将LLM集成到基于探究的学习建模工具中,还描述了一种可能的架构,通过该架构可以创建更具交互性的多模态LLM工具。

英文摘要

Model Construction is a foundational practice in science learning that relies on visualization and interactivity. Large Language Models, increasingly augmented with multimodal capabilities, have been integrated in education contexts to support learning. However, these tools lack visual interactivity that is required by some learning contexts. We introduce Text to Multimodal Model (T2MM), a robust, dynamic LLM supported architecture that assists in model construction within the open inquiry ecology-based modeling software Virtual Experimental Research Assistant (VERA). T2MM accounts for the current context of the learner's model and creates interactive models, rather than static images, enabling the model to remain responsive to manual adjustment. To measure technical feasibility, we evaluate T2MM through a custom procedurally generated dataset of natural language learner modeling requests and target models within the VERA system. T2MM outperforms a baseline model generation architecture implemented through LLM-supported full code generation, common in the literature, across all measured success metrics. Our contribution not only outlines LLM integration into a inquiry-based learning modeling tool, but also describes a possible architecture through which more interactive multimodal LLM tools can be created.

2606.11238 2026-06-11 q-fin.GN cs.AI 交叉投稿

Artificial Intelligence in Ship Finance: Applications, Opportunities, and a Case Study in AI-Augmented Loan Origination

人工智能在船舶金融中的应用:机遇与AI增强贷款发起的案例研究

Lasse Dierich, Orestis Schinas

AI总结 本文探讨AI在船舶金融中的应用,提出基于大语言模型的模块化架构,用于文档理解、信息提取和工作流自动化,以支持贷款申请流程。

详情
Comments
9 pages, 1 figure
AI中文摘要

船舶金融是资产担保贷款中数据密集且文档繁重的领域,需要整合来自异构且高度非结构化来源的财务、技术、合同和监管信息。日益严格的环境法规和ESG报告要求进一步增加了承销和贷款发起流程的复杂性。人工智能(AI)的最新进展,特别是大语言模型(LLMs),为处理和分析此类信息创造了新的机遇。本文回顾了AI在船舶金融中的潜在应用,特别关注基于LLM的系统用于文档理解、信息提取和工作流自动化。我们提出了this http URL,一个模块化代理架构,用于支持船舶金融中的贷款申请工作流。所提出的系统结合了基于LLM的提取模块、财务分析组件、外部海事数据服务以及带有聊天机器人界面的受控文档生成模块,以支持标准化融资申请的准备工作。本文讨论了在生产中使用此类模型的关键挑战。我们认为,AI辅助系统可以支持海事金融专业人士管理日益复杂的信息和报告要求。

英文摘要

Ship finance is a data-intensive and document-heavy segment of asset-based lending, requiring the integration of financial, technical, contractual, and regulatory information from heterogeneous and largely unstructured sources. Increasing environmental regulation and ESG reporting requirements are adding further complexity to underwriting and loan-origination processes. Recent advances in artificial intelligence (AI), particularly large language models (LLMs), create new opportunities for processing and analysing such information. This paper reviews potential applications of AI in ship finance, with a particular focus on LLM-based systems for document comprehension, information extraction, and workflow automation. We present this http URL, a modular agentic architecture to support loan application workflows in ship finance. The proposed system combines an LLM-based extraction module, financial analysis components, external maritime data services, and a controlled document-generation module with a chatbot interface to support the preparation of standardized financing applications. The paper discusses the key challenges for using such models in production. We argue that AI-assisted systems can support maritime finance professionals in managing increasingly complex information and reporting requirements.

2606.11247 2026-06-11 cs.LG cs.AI cs.AR 交叉投稿

Physics-informed generative AI for semiconductor manufacturing: Enforcing hard physical constraints in generative models by construction

物理信息驱动的生成式AI在半导体制造中的应用:通过构造强制生成模型中的硬物理约束

Yaser Mike Banad, Sarah Sharif

AI总结 针对半导体制造中生成模型必须满足硬物理约束的问题,本文提出通过构造集成物理信息(如物理信息扩散、PDE约束变分模型等)来强制约束,而非事后过滤,并给出四种集成模式和未来研究方向。

详情
AI中文摘要

生成模型越来越多地被用于为物理系统提出设计、数据和控制动作,然而许多此类系统受硬物理约束而非感知合理性支配。半导体制造提供了一个严苛的测试案例:生成的掩模、布局、合成缺陷数据和工艺配方必须遵守光刻、传输、反应和器件物理约束,因为物理无效的样本不仅质量低劣,而且无法使用。本文认为,半导体制造揭示了一个更广泛的计算科学挑战,即用于受约束物理领域的生成式AI必须通过构造实现物理信息驱动,而非仅通过事后过滤来纠正。我们调查了新兴的架构工具包,包括物理信息扩散、PDE约束变分模型、神经算子先验和守恒律尊重生成网络,并展示了它如何与可微分光刻、TCAD、工艺仿真和自主实验相联系。我们识别了生成模型与基于物理的模拟器之间的四种集成模式,并提出了一个以物理保真度基准、可微分模拟器基础设施以及面向物理设计和制造的多模态基础模型为中心的研究议程。核心主张是分析性的而非修辞性的:在物理有效性是成功的关键标准的情况下,通过构造强制约束的架构应被期望优于事后过滤的架构,而晶圆厂正是这种区别最鲜明的环境。

英文摘要

Generative models are increasingly used to propose designs, data, and control actions for physical systems, yet many such systems are governed by hard physical constraints rather than by perceptual plausibility. Semiconductor manufacturing provides a demanding test case: generated masks, layouts, synthetic defect data, and process recipes must obey lithography, transport, reaction, and device-physics constraints, because physically invalid samples are not merely low quality but unusable. This Perspective argues that semiconductor manufacturing exposes a broader computational-science challenge, namely that generative AI for constrained physical domains must be physics-informed by construction, not corrected only through post-hoc filtering. We survey the emerging architectural toolkit, including physics-informed diffusion, PDE-constrained variational models, neural-operator priors, and conservation-law-respecting generative networks, and show how it connects to differentiable lithography, TCAD, process simulation, and autonomous experimentation. We identify four integration patterns between generative models and physics-based simulators, and we propose a research agenda centered on physics-fidelity benchmarks, differentiable simulator infrastructure, and multimodal foundation models for physical design and manufacturing. The central claim is analytical rather than rhetorical: where physical validity is the binding criterion of success, architectures that enforce it by construction should be expected to outperform those that filter for it after the fact, and the fab is the setting where this distinction is sharpest.

2606.11264 2026-06-11 q-bio.QM cs.AI 交叉投稿

OmniBioTwin: A System-of-Twinned-Systems Framework for Health Digital Twins

OmniBioTwin:用于健康数字孪生的孪生系统之系统框架

Zhaohui Wang, Yu Huang, Jiang Bian

AI总结 提出OmniBioTwin框架,通过多层级网络架构中的模块化孪生体和交互算子,实现跨尺度健康数字孪生的系统级集成,并在阿尔茨海默病GLP-1信号通路中验证。

详情
AI中文摘要

健康数字孪生(HDT)有望实现患者特异性建模和决策支持,但目前的方法在结构上仍然碎片化:针对单个器官或任务的单一模型缺乏跨尺度保真度,而系统级孪生缺乏通用的架构框架。我们提出OmniBioTwin,一种孪生系统之系统(SoTS)框架,将HDT组织为模块化计算实体,通过多层网络架构中的显式交互算子进行耦合。该框架包括七个协调层——涵盖数据集成、自主孪生建模、跨尺度耦合、时间同步和人机交互决策支持。我们通过实例化阿尔茨海默病中胰高血糖素样肽-1(GLP-1)信号通路的多尺度孪生来演示OmniBioTwin,说明如何在统一系统中组合和耦合分子、细胞和器官级别的孪生。

英文摘要

Health digital twins (HDTs) promise patient-specific modeling and decision support but current approaches remain structurally fragmented: monolithic models that address a single organ or task lack cross-scale fidelity, while system-level twins lack generalizable architectural frameworks. We propose OmniBioTwin, a System-of-Twinned-Systems (SoTS) framework that organizes HDTs as modular computational entities coupled through explicit interaction operators within a multi-layer network architecture. The framework comprises seven coordinated layers - spanning data integration, autonomous twin modeling, cross-scale coupling, temporal synchronization, and human-in-the-loop decision support. We demonstrate OmniBioTwin by instantiating a multiscale twin for glucagon-like peptide-1 (GLP-1) signaling pathways in Alzheimer's disease, illustrating how molecular, cellular, and organ-level twins can be composed and coupled within a unified system.

2606.11286 2026-06-11 cs.LG cs.AI 交叉投稿

FreeBridge: Variational Schrödinger Bridges for Cellular Transition Dynamics

FreeBridge: 用于细胞转变动力学的变分薛定谔桥

Xurui Wang, Qin Ren, Jun Ma, Haibin Ling, Chenyu You

发表机构 * Stony Brook University(石溪大学) University of Toronto(多伦多大学) University Health Network(大学健康网络)

AI总结 针对高内涵成像中细胞扰动建模的端点监督问题,提出FreeBridge方法,通过变分薛定谔桥在固定细胞流形上学习随机传输,并利用经验潜在支持正则化约束中间路径,在保持端点保真度的同时减少中间支持违规。

详情
Comments
Accepted to MICCAI 2026 (early accept). Project page: this https URL
AI中文摘要

高内涵成像实验量化细胞对化学和遗传扰动的反应,但由于细胞在采集时被化学固定,单个细胞的连续轨迹无法观测。因此,扰动建模简化为推断仅在对照和处理群体之间观察到的随机传输,这些群体作为单独的边际分布。虽然最近的生成模型实现了强端点对齐,但边界一致性并不决定中间演化:多个随机过程可能连接相同的边际分布,同时穿过观察到的单细胞形态不支持的区域。我们引入了 \textbf{FreeBridge},一种在仅端点监督下进行单细胞转变建模的薛定谔桥公式。FreeBridge 将原子状态定义为实例分割的单细胞表示,建立固定的细胞流形,并通过经验潜在支持正则化学习在此几何结构内约束的随机传输。在 BBBC021、RxRx1 和 JUMP 数据集上,FreeBridge 在统一评估协议下保持竞争性或改进的端点保真度和作用机制保留;在 BBBC021 上,它进一步减少了中间支持违规。这些发现强调了几何基础对于生物学可解释的扰动动力学的重要性。项目页面:此 https URL。

英文摘要

High-content imaging assays quantify cellular responses to chemical and genetic perturbations, yet continuous trajectories of individual cells are unobservable because cells are chemically fixed at acquisition. Perturbation modeling therefore reduces to inferring stochastic transport between control and treated populations observed only as separate marginals. While recent generative models achieve strong end-point alignment, boundary consistency does not determine intermediate evolution: multiple stochastic processes may connect identical marginals while traversing regions unsupported by observed single-cell morphologies. We introduce \textbf{FreeBridge}, a Schrödinger Bridge formulation for single-cell transition modeling under endpoint-only supervision. FreeBridge defines atomic states as instance-segmented single-cell representations, establishing a fixed cellular manifold, and learns stochastic transport constrained within this geometry via empirical latent support regularization. Across BBBC021, RxRx1, and JUMP, FreeBridge maintains competitive or improved endpoint fidelity and mechanism-of-action retention under a unified evaluation protocol; on BBBC021, it further reduces intermediate support violations. These findings highlight the importance of geometric grounding for biologically interpretable perturbation dynamics. Project page: this https URL.

2606.11357 2026-06-11 cs.DC cs.AI cs.AR cs.PF 交叉投稿

TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

TileFuse:用于AMD NPU上高效量化LLM推理的融合混合精度内核库

Wesley Pang, Gregory Hyegang Jun, Feiyang Liu, Deming Chen

AI总结 针对边缘NPU上量化LLM部署困难,提出TileFuse库,通过融合解包、反量化与GEMM/GEMV内核,并设计交错预分块布局与数据流,在XDNA2上实现AWQ格式原生支持,性能提升最高281%,能耗降低64.6%。

详情
Comments
13 pages excluding reference, 11 figures
AI中文摘要

随着设备端LLM推理需求的增长,边缘SoC越来越多地集成NPU,以在严格的功耗和热预算下提高性能和能效。然而,当前客户端NPU上的实际LLM部署仍然困难:广泛使用的量化格式(如AWQ)无法干净地映射到许多现有NPU软件栈上,这些软件栈通常是专有的,并且暴露有限底层控制。在这项工作中,我们提出了\textit{TileFuse},一个面向AMD XDNA2 NPU的近底层混合精度内核库,针对量化LLM推理中的Transformer线性层。TileFuse将实用的低位格式(如AWQ风格的W4A16和W8A16)直接引入XDNA2,而不是迫使模型围绕NPU特定的量化方案重新调整。TileFuse协同设计了权重布局、元数据放置、混合精度微内核和阵列级数据流。具体来说,它将解包、反量化以及GEMM/GEMV执行融合到单个内核流中,引入了一种支持高达32K GEMM维度的交错预分块布局,并重新设计了GEMV数据流以利用完整的4x8 AIE阵列。在内核级评估中,与全精度基线相比,TileFuse在GEMM上性能提升高达121.6%,在GEMV上提升281%,同时在GEMM上相比强iGPU基线实现了超过2倍的性能和能效提升。在Ryzen AI笔记本电脑上的端到端LLM实验中,TileFuse实现了高达2.0倍的预填充延迟降低,能耗降低超过64.6%。这些结果共同表明,XDNA2是AWQ风格边缘LLM推理的实用目标,并且对现成量化的原生NPU支持可以使NPU在实际客户端部署中更加可用。

英文摘要

With the growing demand for on-device LLM inference, edge SoCs increasingly integrate NPUs to improve performance and energy efficiency under tight power and thermal budgets. However, practical LLM deployment on current client NPUs remains difficult: widely used quantization formats such as AWQ do not map cleanly onto many existing NPU software stacks, which are often proprietary and expose limited low-level control. In this work, we present \textit{TileFuse}, a close-to-metal mixed-precision kernel library for AMD XDNA2 NPUs that targets transformer linear layers in quantized LLM inference. TileFuse brings practical low-bit formats such as AWQ-style W4A16 and W8A16 directly onto XDNA2, rather than forcing the model to be reshaped around an NPU-specific quantization scheme. TileFuse co-designs weight layout, metadata placement, mixed-precision microkernels, and array-level dataflow. Specifically, it fuses unpacking, dequantization, and GEMM/GEMV execution into a single kernel flow, introduces an interleaved pre-tiling layout that supports GEMM dimensions up to 32K, and redesigns GEMV dataflow to utilize the full 4x8 AIE array. Across kernel-level evaluations, TileFuse improves performance by up to 121.6% for GEMM and 281% for GEMV over full-precision baselines, while delivering more than 2x performance and energy-efficiency gains over strong iGPU baselines on GEMM. In end-to-end LLM experiments on Ryzen AI laptops, TileFuse achieves up to 2.0x lower prefilling latency with more than 64.6% lower energy consumption. Together, these results show that XDNA2 is a practical target for AWQ-style edge LLM inference and that native NPU support for off-the-shelf quantization can make NPUs substantially more usable in real client deployments.

2606.11430 2026-06-11 cs.DL cs.AI cs.LO 交叉投稿

Towards a Bridge Layer Between Bibliographic and Formalized Mathematical Knowledge

迈向文献与形式化数学知识之间的桥梁层

A. Mayeux

AI总结 提出一个关系型桥接数据库,对齐出版物元数据与形式化工件,并引入论文级形式化评分,通过跨文档对齐估计形式化覆盖度,以整合文献与形式化数学生态系统。

详情
AI中文摘要

数学知识分散在文献数据库(如MathSciNet、zbMATH Open)和形式化证明库(如Lean mathlib)中,阻碍了已发表结果与其形式化之间的统一访问。我们提出了一个关系型桥接数据库,将出版物元数据与形式化工件对齐,为数学文献和机器可验证证明提供互操作层。我们引入了一个论文级形式化评分,衡量一篇出版物在形式化系统中的覆盖程度。作为可行性研究,我们展示了如何通过非正式文本与Lean形式化之间的跨文档对齐来估计此类评分,从而实现对形式化覆盖度的大规模分析。该框架是将文献和形式化数学生态系统整合为可扩展、机器可操作的知识图谱的第一步,该图谱将出版物与形式化证明对象关联起来。

英文摘要

Mathematical knowledge is split between bibliographic databases (e.g., MathSciNet, zbMATH Open) and formal proof libraries (e.g., Lean mathlib), preventing unified access between published results and their formalizations. We propose a relational bridge-database that aligns publication metadata with formal artifacts, providing an interoperability layer between mathematical literature and machine-verifiable proofs. We introduce a paper-level formalization score that measures how much of a publication is covered in formal systems. As a feasibility study, we show how such scores can be estimated via cross-document alignment between informal texts and Lean formalizations, enabling large-scale analysis of formalization coverage. This framework is a first step toward integrating bibliographic and formal mathematical ecosystems into scalable, machine-actionable knowledge graphs linking publications to formal proof objects.

2606.11463 2026-06-11 cs.LG cs.AI 交叉投稿

LSTM-Based Detection of Structural Breaks in Property Insurance Loss Reserving: A Climate-Informed Approach

基于LSTM的财产保险损失准备金结构性断点检测:气候信息方法

Thomas Mbrice, Shashwat Panigrahi

发表机构 * Stony Brook University(石溪大学)

AI总结 针对气候变化导致传统精算方法失效的问题,提出使用LSTM神经网络检测结构性断点,在佛罗里达和路易斯安那州数据上预期将巨灾年份准备金精度提升15-20%,并给出理论保证。

详情
Comments
15 pages, 0 figures, whitepaper YC
AI中文摘要

准确的损失准备金是保险公司偿付能力的基础,然而加速的气候驱动灾难系统地违反了传统精算方法所依赖的稳定性假设。本文提出一个研究计划,测试长短期记忆(LSTM)神经网络是否能够比链梯法、Bornhuetter-Ferguson法和Cape Cod法更快、更准确地检测和适应这些结构性断点。使用来自佛罗里达州和路易斯安那州超过15年的监管发展三角形数据,并辅以NOAA飓风强度指数和海面温度,我们假设在巨灾暴露年份准备金精度有15-20%的针对性提升,这一阈值基于先前的神经网络准备金文献以及本文发展的形式化收敛结果。除了实证验证,我们还发展了一个理论框架,以概率术语为基础进行LSTM结构性断点检测,并提供形式化的性能保证,以弥补测试期间巨灾事件数量有限的不足。我们记录了研究设计、方法论、预期贡献以及对局限性的坦诚评估。

英文摘要

Accurate loss reserving is foundational to insurer solvency, yet accelerating climate driven catastrophes systematically violate the stability assumptions on which traditional actuarial methods depend. This white paper presents a research program testing whether Long Short Term Memory (LSTM) neural networks can detect and adapt to these structural breaks faster and more accurately than Chain Ladder, Bornhuetter Ferguson, and Cape Cod methods. Using 15 plus years of regulatory development triangle data from Florida and Louisiana, enriched with NOAA hurricane intensity indices and sea surface temperatures, we hypothesize a targeted improvement of 15, 20% in reserve accuracy for catastrophe exposed years, a threshold grounded both in the prior neural network reserving literature and in the formal convergence results developed here. Beyond empirical validation, we develop a theoretical framework grounding LSTM structural break detection in probabilistic terms, providing formal performance guarantees that compensate for the limited number of catastrophe events in the test period. We document the research design, methodology, expected contributions, and a candid assessment of limitations.

2606.11477 2026-06-11 cs.CV cs.AI 交叉投稿

Towards Fully Automated Exam Grading: Fairness-Aware Recognition of Handwritten Answers with Foundation Models

迈向全自动考试评分:基于基础模型的笔迹答案公平性识别

Hartwig Grabowski

发表机构 * Institute for Machine Learning and Analytics (IMLA), Offenburg University(奥芬堡大学机器学习和分析研究所(IMLA))

AI总结 提出使用视觉-语言基础模型(VLM)识别手写答案,在61份考试(3141个答案位置)上达到98.4%准确率,并通过轻量提示将假阴性率降至0.58%,实现公平的全自动评分。

详情
Comments
11 pages, 2 figures, 3 tables
AI中文摘要

手工批改手写试卷既耗时又容易出错,尤其是对于大规模班级,而全数字化考试往往迫使教学局限于封闭式问题格式。一个实用的折中方案是保留纸质、问题导向的任务,但将评估相关的答案以单个大写字母记录在机器可读的表格中。开放的问题是,这种读取能否足够准确,并且最重要的是,足够公平以实现无监督评分。早期的自动化方法仅达到约88%–91%的识别率——太低——并且在最关键的案例上失败:答案写在单元格外、被划掉或草书书写。我们展示了通用视觉-语言基础模型(VLM),它解释页面而非匹配像素模板,弥补了这一差距。在一个包含61份匿名考试(3141个答案位置)的基准测试中,最佳模型达到了98.4%的准确率,远高于之前的基线。关键的是,我们以公平性为中心进行评估:我们区分假阴性(正确答案被标记为错误,对学生不利)和假阳性,并且一个提供参考答案作为上下文的轻量提示将假阴性率降至0.58%。在示例性评分方案下,61份考试中只有3份会被评得更差,所有这些都通过学生自我审查步骤被发现。因此,大规模的全自动、公平性感知考试评分是合理的;我们发布匿名基准以支持可重复性。

英文摘要

Correcting handwritten exams by hand is time-consuming and error-prone, particularly for large cohorts, while fully digital exams tend to force a didactic narrowing towards closed question formats. A practical middle ground keeps paper-based, problem-oriented tasks but records the assessment-relevant answers as single capital letters in a table that a machine can read. The open question is whether this reading can be made accurate and, above all, fair enough for unsupervised grading. Earlier automated approaches reached only about 88%--91% recognition -- too low -- and failed on the cases that matter most: answers placed outside the cell, crossed out, or written in cursive. We show that general-purpose vision-language foundation models (VLMs), which interpret the page rather than match pixel templates, close this gap. On a benchmark of 61 anonymised exams (3141 answer positions) the best model reaches 98.4% accuracy, well above the previous baseline. Crucially, we centre the evaluation on fairness: we distinguish false negatives (a correct answer marked wrong, which disadvantages the student) from false positives, and a lightweight prompt that supplies the reference solution as context lowers the false-negative rate to 0.58%. Under an exemplary grading scheme only three of the 61 exams would be graded worse, all caught by a student self-review step. Fully automated, fairness-aware exam grading at scale is therefore defensible; we release the anonymised benchmark to support reproducibility.

2606.11505 2026-06-11 cs.CV cs.AI cs.CR 交叉投稿

On the Study of Biometric Spoofing Detection using Deep Learning

基于深度学习的生物特征欺骗检测研究

Kumar Kartikey, Nikos Komninos

AI总结 评估MobileNetV2、DenseNet-121、Inception-v3和STD模型在面部识别系统欺骗检测中的性能,MobileNetV2以92%准确率最优,适合实际应用。

详情
AI中文摘要

生物特征系统越来越多地部署在安全应用中;然而,它们仍然容易受到欺骗攻击,攻击者利用伪造的生物特征数据获取未经授权的访问。本研究评估了最先进的机器学习模型MobileNetV2、DenseNet-121、Inception-v3和欺骗痕迹解缠(STD)在面部识别系统中检测欺骗攻击的有效性。使用CelebA-Spoof数据集,研究通过准确率、精确率、召回率和F1分数等指标评估模型有效性。在MSU-MFSD数据集上进行跨数据集验证以评估泛化能力。结果表明MobileNetV2是最有效的模型,在平衡计算效率的同时达到92%的准确率,使其适用于实际应用。Inception-v3表现出中等鲁棒性,而DenseNet-121和STD在泛化方面存在困难。研究结果强调了在领域自适应和混合架构方面取得进展以增强生物特征安全系统的必要性。

英文摘要

Biometric systems are increasingly deployed in security applications; however, they remain vulnerable to spoofing attacks, in which attackers exploit counterfeit biometric data to gain unauthorized access. This research evaluates the effectiveness of state-of-the-art machine learning models, MobileNetV2, DenseNet-121, Inception-v3, and Spoof Trace Disentanglement (STD) in detecting spoofing attacks within facial recognition systems. Using the CelebA-Spoof dataset, the study evaluates model effectiveness using metrics such as accuracy, precision, recall, and F1 Score. Cross-dataset validation is carried out on the MSU-MFSD dataset to assess generalizability. The results show MobileNetV2 as the most efficient model, achieving 92% accuracy while balancing computational effectiveness, making it appropriate for real-life applications. Inception-v3 shows moderate robustness, while DenseNet-121 and STD struggle with generalization. The findings highlight the need for advances in domain adaptation and hybrid architectures to enhance biometric security systems.

2606.11555 2026-06-11 q-bio.NC cs.AI cs.LG 交叉投稿

End-to-End Machine Learning for Depressive State Classification via EEG and fNIRS

基于EEG和fNIRS的抑郁状态分类的端到端机器学习

Riki Sakurai, Simon Kojima, Mihoko Otake-Matsuura, Shin'ichiro Kanoh, Tomasz M. Rutkowski

AI总结 本研究提出一个端到端机器学习框架,利用EEG和fNIRS信号对抑郁状态进行分类,旨在克服传统诊断的主观性,为临床提供客观的自动化诊断工具。

详情
Comments
4 pages, 4 figures, Accepted for publication in the Proc. 48th Annu. Int. Conf. IEEE EMBS (EMBC 2026), Toronto, Canada, July 20-24, 2026
AI中文摘要

随着社会压力的增加,对心理医疗的需求不断上升,凸显了传统精神病学诊断的局限性。传统方法——主要依赖临床访谈和患者自我报告——本质上容易受到主观偏见和从业者不同的经验判断的影响。为了满足定量评估的需求,基于生物信号的检测,包括脑电图(EEG)和功能性近红外光谱(fNIRS),已成为一种有前景的客观替代方案。这类技术对于识别可能未被受试者自身意识到的潜在抑郁状态尤为重要。此外,在老龄化人群中,抑郁症与痴呆症的高共病性要求早期区分,以防止症状相互恶化并维持生活质量(QoL)。这项针对11名健康学生的初步研究建立了一个基于生物信号的抑郁症检测框架,为临床使用的自动化、客观诊断工具奠定了基础。

英文摘要

The escalating demand for mental healthcare, driven by rising societal stress, highlights the limitations of traditional psychiatric diagnostics. Conventional methods - relying primarily on clinical interviews and patient self-reports - are inherently vulnerable to subjective bias and the varying empirical judgment of practitioners. To address the need for quantitative evaluation, biological signal-based detection, including electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS), has emerged as a promising objective alternative. Such technology is particularly vital for identifying latent depressive states that may be unrecognized by the subjects themselves. Furthermore, in aging populations, the high comorbidity between depression and dementia necessitates early differentiation to prevent mutual symptom exacerbation and maintain Quality of Life (QoL). This pilot study of eleven healthy students establishes a framework for biological signal-based depression detection, serving as a foundational step toward automated, objective diagnostic tools for clinical use.

2606.11596 2026-06-11 eess.SY cs.AI 交叉投稿

Model-Based and Data-Driven Hierarchical Control and Topology Co-Design for Robust Networked Systems

基于模型和数据驱动的鲁棒网络系统分层控制与拓扑协同设计

Shirantha Welikala, Zihao Song, Hai Lin, Panos J. Antsaklis

AI总结 针对线性子系统构成的网络系统,提出基于模型和仅依赖轨迹数据的分层控制策略,结合耗散性理论与线性矩阵不等式实现局部与全局耗散性保证及拓扑优化,并应用于直流微电网的鲁棒电压调节与电流共享。

详情
Comments
To be submitted to Automatica
AI中文摘要

本文考虑一类由相互连接的线性子系统、扰动输入和性能输出构成的网络系统。利用耗散性理论,我们首先提出一种基于模型的分层控制设计策略,确保闭环网络系统从扰动输入到性能输出是耗散的。这包括为每个子系统设计局部控制器以强制执行局部耗散性保证,然后利用这些保证协同设计分布式全局控制器和互连拓扑,以在优化互连拓扑成本的同时强制执行全局耗散性保证。整个设计过程仅需求解一系列线性矩阵不等式(LMI)问题,从而保持组合性和可分散性,同时避免低效且集中的非凸迭代设计过程。这种基于模型的分层控制设计策略假设已知子系统动力学,这在许多实际网络系统中可能不成立。受此启发,我们还提出了一种数据驱动的分层控制设计策略,该策略仅假设子系统可获取丰富的输入-状态-输出轨迹数据。所提出的数据驱动设计过程假设影响子系统动力学的未知扰动受二次矩阵不等式约束(放宽了常规界限),并通过使用矩阵S引理来考虑这一点。最后,以直流微电网网络系统为例,验证了所提出的基于模型和数据驱动的分层控制设计在实现鲁棒(耗散)电压调节和电流共享方面的有效性。

英文摘要

In this paper, we consider a class of networked systems comprising an interconnected set of linear subsystems, disturbance inputs, and performance outputs. Using dissipativity theory, we first propose a model-based hierarchical control design strategy to ensure the closed-loop networked system is dissipative from its disturbance inputs to performance outputs. This involves designing local controllers for each subsystem to enforce local dissipativity guarantees, which are then exploited to co-design distributed global controllers and the interconnection topology to enforce global dissipativity guarantees while optimizing interconnection topology costs. The overall design process requires only solving a sequence of linear matrix inequality (LMI) problems, thereby retaining compositionality and decentralizability while avoiding non-convex, iterative design processes that are inefficient and centralized. This model-based hierarchical control design strategy assumes the knowledge of the subsystem dynamics, which may not hold in many real-world networked systems. Motivated by this, we also propose a data-driven hierarchical control design strategy that assumes only the availability of rich input-state-output trajectory data from the subsystems. The proposed data-driven design process assumes that the unknown disturbances affecting the subsystem dynamics are bounded by a quadratic matrix inequality (relaxing conventional bounds) and accounts for this by using the matrix S-lemma. Finally, the effectiveness of the proposed model-based and data-driven hierarchical control designs is illustrated for a networked system representing a DC microgrid, with the aim of enforcing robust (dissipative) voltage regulation and current sharing.

2606.11605 2026-06-11 cs.LG cs.AI 交叉投稿

Physics-Distilled Neural Network enabled by Large Language Models for Manufacturing Process-Property Predictive Modeling

基于大语言模型的物理蒸馏神经网络用于制造过程-性能预测建模

Ge Song, Kiarash Naghavi Khanghah, Anandkumar Patel, Rajiv Malhotra, Hongyi Xu

AI总结 提出一种知识蒸馏框架,利用大语言模型从文献中提取物理先验,通过图掩码注意力层捕获变量依赖,蒸馏至轻量学生模型,在数据稀缺下实现高精度预测与实时部署。

详情
Comments
Under review, Journal of Computing and Information Science in Engineering
AI中文摘要

预测制造过程中的过程-性能关系常面临高实验成本和复杂'黑箱'模型可解释性有限的挑战。本文提出一种新颖的知识蒸馏框架,旨在数据稀缺场景下实现高精度预测。该框架将分析性物理先验(通过大语言模型从科学文献中系统提取)集成到特权教师模型中。我们采用图掩码注意力层来捕获输入变量间复杂的物理依赖关系,这些变量表现为严格设定点或静态与高频时间特征的组合。这种特权知识被蒸馏到轻量级学生预测器中进行推理。通过在五种不同制造过程中的综合实验,评估了该框架的可行性和鲁棒性。为确保统计可靠性,鉴于数据集规模较小,采用重复K折交叉验证技术来量化模型稳定性和泛化能力。结果表明,所提框架在所有评估领域均持续实现高预测精度。最重要的是,该架构表现出显著的容错性,即使在LLM推导的分析先验次优或不完整的情况下,也能保持稳健的预测性能。此外,学生预测器的推理频率超过6000 Hz,便于在标准工业硬件上进行实时边缘部署。这项工作为在数据受限环境下弥合理论物理与实时工业监测之间的差距提供了可扩展的解决方案。

英文摘要

Predicting process-property relationships in manufacturing is often challenged by high experimental costs and the limited interpretability of complex 'black-box' models. This paper proposes a novel knowledge distillation framework designed to achieve high-accuracy predictions in data-scarce scenarios. The framework integrates analytical physics priors, which are systematically extracted from scientific literature via Large Language Models, into a privileged teacher model. We employ a Graph-Masked Attention layer to capture the complex physical dependencies among input variables showing strict setpoints or a combination of static and high-frequency temporal signatures. This privileged knowledge is distilled into a lightweight student predictor for inference. The feasibility and robustness of the framework are evaluated through a comprehensive experiment across five diverse manufacturing processes. To ensure statistical reliability, given the small dataset sizes, a repeated K-fold cross-validation technique is employed to quantify model stability and generalization. Results indicate that the proposed framework consistently achieves high predictive accuracy across all evaluated domains. Most importantly, the architecture demonstrates significant fault tolerance by maintaining robust predictive performance even in scenarios where LLM-derived analytical priors are suboptimal or incomplete. Furthermore, the student predictor achieves an inference frequency exceeding 6000 Hz, which facilitates real-time edge deployment on standard industrial hardware. This work provides a scalable solution for bridging the gap between theoretical physics and real-time industrial monitoring in data-limited environments.

2606.11793 2026-06-11 cs.LG cs.AI physics.ao-ph 交叉投稿

AI4Land: Scalable Deep Learning for Global High-Resolution Land Use Reconstruction

AI4Land: 面向全球高分辨率土地利用重建的可扩展深度学习

Amirpasha Mozaffari, Marina Castaño, Stefano Materia, Etienne Tourigny, Oscar Molina-Sedano, Jordi Varela-Agrelo, Dario Garcia-Gasulla, Miguel Castrillo Melguizo, Mario Acosta, Amanda Duarte

发表机构 * Barcelona Supercomputing Center(巴塞罗那超级计算中心)

AI总结 提出AI4Land框架,采用U-Net两阶段方法,结合粗分辨率情景数据与静态地理特征,重建高分辨率年度土地利用与覆盖,减少陆地碳循环不确定性,支持气候模拟。

详情
AI中文摘要

陆地碳循环的不确定性仍是气候预测的主要制约因素,部分源于地球系统模型中陆面表征和变率的不确定性。为解决此问题,我们提出了数据驱动框架AI4Land,用于生成关键陆面变量的高分辨率历史重建和未来预测。该框架采用U-Net架构的两阶段方法。在第一阶段(本文重点),它通过整合粗分辨率情景数据与静态地理特征,重建年度土地利用与土地覆盖。在计划的第二阶段,生成的高分辨率地图将用于在更细时间尺度上预测动态生物物理变量,特别是叶面积指数。模型基于地球观测数据训练,学习再现空间明确且物理一致的陆面模式,并将时间覆盖扩展到缺乏直接观测的时期。AI4Land在MareNostrum5上开发和训练,展示了GPU加速的高性能计算基础设施如何支持全球尺度的气候AI流水线。最终产品是一套开源模拟器,旨在与数字孪生平台(如Destination Earth计划下开发的平台)实时耦合。通过按需提供逼真且演变的陆面条件,本工作旨在减少关键不确定性,提高下一代气候模拟的预测能力。

英文摘要

Uncertainty in the terrestrial carbon cycle remains a major constraint in climate projections, partly driven by the uncertainties affecting the land surface representation and variability in Earth system models. To address this limitation, we present a data-driven framework AI4Land, for generating high-resolution historical reconstructions and future projections of key land surface variables. The framework follows a two-phase approach using a U-Net architecture. In the first phase, which is the focus of this work, it reconstructs annual land use and land cover by integrating coarse-resolution scenario data with static geophysical features. In a planned second phase, the resulting high-resolution maps will be used to predict dynamic biophysical variables, particularly leaf area index, at finer temporal scales. Trained on Earth observation data, the models learn to reproduce spatially explicit and physically consistent land surface patterns, extending temporal coverage to periods lacking direct observations. AI4Land was developed and trained on MareNostrum5, demonstrating how GPU-accelerated HPC infrastructure enables global-scale climate AI pipelines. The final product is a suite of open-source emulators designed for real-time coupling with digital twin platforms, such as those developed under the Destination Earth initiative. By delivering realistic and evolving land surface conditions on demand, this work aims to reduce critical uncertainties and improve the predictive power of next-generation climate simulations.

2606.11794 2026-06-11 cs.LG cs.AI 交叉投稿

Multimodal Ordinal Modeling of Alzheimer's Disease Severity Using Structural MRI and Clinical Data

使用结构MRI和临床数据的阿尔茨海默病严重程度的多模态序数建模

Boris-Stephan Rauchmann, Jonathan Laib, Buse Ercik, Robert Perneczky, Sergio Altares-López

AI总结 提出一种注意力增强的多模态序数回归框架,整合MRI、人口统计学和遗传数据,用于自动且可解释的AD严重程度分期,在ADNI等数据集上验证,序数模型在相邻阶段准确率(0.970)和与临床分期一致性(QWK 0.549)上表现最佳。

详情
Comments
18 pages. Submitted to journal for review
AI中文摘要

神经退行性疾病如阿尔茨海默病(AD)需要准确且可扩展的工具来评估疾病严重程度,然而当前的临床分期仍然耗时且易变。我们提出了一种带有注意力增强的多模态机器学习框架,结合序数回归,用于自动且可解释的AD严重程度分期。该框架整合了T1加权MRI与人口统计学和遗传变量,并使用序数和非序数预测头比较了单模态和多模态架构。模型使用来自ADNI、AIBL和NIFD数据集的队列分层划分进行训练和验证。严格保留的测试集由排除在所有训练、验证、预处理和超参数调优过程之外的受试者构建,并在整个过程中采用受试者级划分以防止数据泄漏。在单模态方法中,T1加权MRI模型在相邻阶段准确率(0.963)和与临床分期的一致性(QWK 0.444)上略高于表格模型(QWK 0.433)。整合成像、人口统计学和遗传信息提高了整体性能。多模态非序数基线实现了最低的预测误差(MAE 0.340),而序数多模态模型实现了最高的相邻阶段准确率(0.970)和与临床分期的最强一致性(QWK 0.549)。这些发现表明,序数公式更好地捕捉了CDR量表的顺序结构,并产生与临床分期更一致的预测。使用Grad CAM++和SHAP的可解释性分析展示了解剖学和临床上合理的模型行为,支持透明决策。总体而言,基于注意力的多模态学习与序数回归代表了一种稳健、可解释且可扩展的方法,用于自动AD严重程度分期和AI辅助临床决策支持。

英文摘要

Neurodegenerative diseases such as Alzheimer's disease (AD) require accurate and scalable tools for assessing disease severity, yet current clinical staging remains time-intensive and prone to variability. We propose an attention-enhanced multimodal machine learning framework with ordinal regression for automated and interpretable AD severity staging. The framework integrates T1-weighted MRI with demographic and genetic variables and compares unimodal and multimodal architectures using ordinal and non-ordinal prediction heads. Models were trained and validated using cohort-stratified splits derived from the ADNI, AIBL, and NIFD datasets. A strictly held-out test set was constructed using subjects excluded from all training, validation, preprocessing, and hyperparameter tuning procedures, with subject-level splitting employed throughout to prevent data leakage. Among unimodal approaches, the T1-weighted MRI model achieved slightly higher adjacent-stage accuracy (0.963) and agreement with clinical staging (QWK 0.444) than the tabular model (QWK 0.433). Integrating imaging, demographic, and genetic information improved overall performance. The multimodal non-ordinal baseline achieved the lowest prediction error (MAE 0.340), whereas the ordinal multimodal model achieved the highest adjacent-stage accuracy (0.970) and strongest agreement with clinical staging (QWK 0.549). These findings indicate that ordinal formulations better capture the ordered structure of the CDR scale and yield predictions more consistent with clinical staging. Explainability analyses using Grad CAM++ and SHAP demonstrated anatomically and clinically plausible model behavior, supporting transparent decision-making. Overall, attention-based multimodal learning with ordinal regression represents a robust, interpretable, and scalable approach for automated AD severity staging and AI-assisted clinical decision support.

2606.11828 2026-06-11 cs.SD cs.AI cs.CR cs.MM 交叉投稿

Feature-Aligned Speech Watermarking for Robustness to Reconstruction Distortions

特征对齐的语音水印技术以抵抗重建失真

Haiyun Li, Shuhai Peng, Zhisheng Zhang, Jingran Xie, Xiaofeng Xie, Hanyang Peng, Zhiyong Wu

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Shenzhen Key Laboratory of Intelligent Media and Content Understanding(深圳市智能媒体与内容理解重点实验室) Tencent AI Lab(腾讯人工智能实验室)

AI总结 提出特征对齐水印方法,通过将水印与原始语音特征分布对齐,在保持不可感知性的同时提高水印能量,增强对语音重建模型的鲁棒性。

详情
Comments
Accepted by ICME2026
AI中文摘要

音频水印旨在将可识别信息嵌入音频中同时保持不可感知性。现有方法采用高保真、低能量设计以保持感知质量,但由此产生的水印在语音重建模型的抑制下缺乏鲁棒性。由于现有设计中固有的鲁棒性-保真度权衡,提高鲁棒性具有挑战性,增加水印能量会提高鲁棒性但降低保真度。为解决此问题,我们提出一种特征对齐的水印方法,将水印与原始语音特征分布对齐,允许更高的水印能量以提高鲁棒性同时保持不可感知性。我们使用预训练的语音编解码器生成伪语音水印,并将其融合到输入音频的频谱图中,通过VAD损失和感知损失引导在浊音区域嵌入。实验表明,我们的方法在保持与现有方法相当的不可感知性的同时,在见过和未见过的语音重建模型下均显著提高了鲁棒性。

英文摘要

Audio watermarking aims to embed identifiable information into audio while remaining imperceptible. Existing methods adopt high-fidelity, low-energy designs to preserve perceptual quality, but the resulting watermarks lack robustness under suppression by speech reconstruction models. Improving robustness is challenging due to the inherent robustness-fidelity trade-off in existing designs, where increasing watermark energy improves robustness but reduces fidelity. To address this problem, we propose a feature-aligned watermarking method that aligns the watermark with the original speech feature distribution, allowing higher watermark energy to improve robustness while preserving imperceptibility. We use a pretrained speech codec to generate a pseudo-speech watermark and fuse it into the spectrogram of the input audio, with VAD loss and perceptual losses guiding embedding within voiced regions. Experiments show that our method maintains imperceptibility comparable to existing approaches while substantially improving robustness under both seen and unseen speech reconstruction models.

2606.11835 2026-06-11 cs.HC cs.AI 交叉投稿

Designing AI-Supported Focus Groups: A Role x Modality Playbook

设计AI支持的焦点小组:角色×模态剧本

Zhiqing Wang, Steven Dow

AI总结 针对焦点小组资源密集且对引导高度敏感的问题,提出按AI角色(工具、联合主持、主持)和模态(文本、语音、具身)组织的剧本,并分析交互权衡与开放问题。

详情
AI中文摘要

收集参与者的生活经验是设计研究的核心。焦点小组的独特价值在于参与者不仅分享个人经历,还能相互回应,从而呈现比较、分歧和集体意义建构。然而,焦点小组资源密集且对引导高度敏感:主持人必须探究细节、平衡参与、管理话题流程并维持心理安全,微妙的引导选择可能影响哪些内容变得突出。近期人机交互研究和商业会议工具表明,生成式AI可以通过提示、轮流调节、主题映射和实时总结来支撑实时对话。然而,用户体验研究团队缺乏关于这些能力在焦点小组中的含义以及引入的方法论风险的清晰图景。我们综合了AI支持实时对话的相关工作,并将其转化为一个焦点小组特定的剧本,按AI角色(工具、联合主持、主持)和模态(文本、语音、具身)组织。我们描述了交互权衡,并识别了将AI支持的焦点小组作为方法论配置进行评估的开放问题。

英文摘要

Collecting participants' lived experiences is central to design research. Focus groups are uniquely valuable because participants not only share individual accounts but also respond to one another, surfacing comparison, disagreement, and collective sensemaking. However, focus groups are resource-intensive and highly sensitive to facilitation: moderators must probe for specificity, balance participation, manage topic flow, and sustain psychological safety, and subtle facilitation choices can shape what becomes salient. Recent HCI work and commercial meeting tools show that generative AI can scaffold live conversation through prompting, turn regulation, thematic mapping, and real-time summarization. Yet UXR teams lack a clear map of what these capabilities mean in focus groups and what methodological risks they introduce. We synthesize AI supports for live conversation and translate them into a focus-group-specific playbook organized by AI role (tool, co-host, host) and modality (text, voice, embodied).We synthesize prior work on AI-supported live conversation and propose a focus-group-specific playbook of AI supports organized by role (tool, co-host, host) and modality (text, voice, embodied). We characterize interactional trade-offs and identify open questions for evaluating AI-supported focus groups as methodological configurations.

2606.11915 2026-06-11 cs.SD cs.AI 交叉投稿

Quality Adaptive Angular Margin Learning for Respiratory Sound Classification

呼吸音分类的质量自适应角度边界学习

Yoon Tae Kim, Heejoon Koo, Miika Toikkanen, June-Woo Kim

发表机构 * RSC LAB, MODULABS, Republic of Korea(RSC实验室,MODULABS,韩国) Department of Electronic Engineering, Wonkwang University, Republic of Korea(韩国圆光大学电子工程系) AI Convergence Research Institute, Wonkwang University, Republic of Korea(韩国圆光大学人工智能融合研究所)

AI总结 提出质量自适应角度边界学习框架QLung,通过频谱熵和均方根能量推导无参考音频质量边界,自适应缩放角度边界,改善特征泛化,在ICBHI和SPRSound数据集上分别提升2.46%和达到最优分布外性能。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

我们提出了一种质量自适应角度边界学习框架,通过增强类内紧凑性和类间可分离性来改进特征泛化。我们的框架名为QLung,引入了基于频谱熵和均方根能量的无参考音频质量边界,根据录音质量自适应缩放角度边界。为此,我们提出了一种对数缩放的角度边界,在严重类别不平衡下稳定训练。我们还使用了一个角度分类器,对特征和类别权重进行归一化,确保在单位超球面上一致地应用边界惩罚。我们的方法在ICBHI数据集上比交叉熵基线提高了2.46%的分布内性能,最重要的是,在SPRSound数据集上,与先前最先进的方法相比,实现了最强的分布外性能。代码可在以下网址获取:https://this URL。

英文摘要

We present a quality-adaptive angular-margin learning framework that improves feature generalization by enforcing intra-class compactness and inter-class separability. Our framework, titled QLung, introduces a no-reference audio quality margin derived from spectral entropy and root-mean-square energy, which adaptively scales angular margins based on recording quality. To this end, we propose a log-scaled angular margin that stabilizes training under severe class imbalance. We also use an angular classifier that normalizes features and class weights, ensuring margin penalties are applied consistently on the unit hypersphere. Our approach improves in-distribution performance on the ICBHI dataset by 2.46\% over the cross-entropy baseline, and most significantly, achieves the strongest out-of-distribution performance on the SPRSound dataset compared to prior state-of-the-art methods. Code is available at this https URL.

2606.11916 2026-06-11 cs.SE cs.AI 交叉投稿

Characterizing Software Aging in GPU-Based LLM Serving Systems

基于GPU的大语言模型服务系统中的软件老化特征分析

Domenico Cotroneo, Bojan Cukic

AI总结 提出一种实证方法研究GPU大语言模型服务系统中的软件老化,通过216小时实验发现所有部署均存在显著内存老化,泄漏率与运行时和配置强相关,并提供了可复现框架。

详情
Comments
7 pages
AI中文摘要

本文提出了一种实证方法,用于研究基于GPU的大语言模型服务系统中的软件老化。传统的老化研究侧重于以CPU为中心的软件,且工作负载相对规律;而大语言模型服务则不同,它跨越Python主机和CUDA设备,处理成本相差数个数量级的请求,并依赖于快速演进的软件栈。我们在相同的压力条件下,对六个共置部署进行了216小时的实验,并行监控主机、设备和客户端指标,并应用了考虑自相关和多重比较的统计流程。结果显示,所有部署均存在统计上显著的内存老化,泄漏率强烈依赖于服务运行时和部署配置。除这些发现外,我们还提供了一个可复现的框架,为软件老化与再生领域以及大语言模型服务社区开辟了交叉研究方向。

英文摘要

This paper proposes an empirical methodology to study software aging in GPU-based LLM serving systems. Traditional aging studies focus on CPU-centric software with relatively regular workloads; LLM serving is different, spanning a Python host and a CUDA device, handling requests whose cost varies by orders of magnitude, and relying on rapidly evolving software stacks. We run a 216-hour campaign across six co-located deployments under identical stress conditions, monitor host, device, and client metrics in parallel, and apply a statistical pipeline that accounts for autocorrelation and multiple testing. Our results reveal statistically significant memory aging in all deployments, with leak rates strongly dependent on the serving runtime and deployment configuration. Beyond these findings, we provide a reproducible framework that opens a research direction at the intersection of the software aging and rejuvenation and LLM serving communities.

2606.11922 2026-06-11 cs.SD cs.AI 交叉投稿

Lung-SRAD: Spectral-Aware Regularized Audio DASS with Dual-Axis Patch-Mix Contrastive Learning for Respiratory Sound Classification

Lung-SRAD: 基于谱感知正则化音频DASS与双轴补丁混合对比学习的呼吸音分类

Hemansh Shridhar, Miika Toikkanen, June-Woo Kim

发表机构 * RSC LAB, MODULABS(RSC实验室,MODULABS) Department of Electronic Engineering, Wonkwang University(圆光大学电子工程系) AI Convergence Research Institute, Wonkwang University(圆光大学人工智能融合研究所)

AI总结 针对呼吸音分类中AST模型对局部异常模式不敏感的问题,提出基于状态空间模型的谱感知层正则化和双轴补丁混合对比学习,在ICBHI基准上达到64.48%分数,比AST基线提升5%。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

最近的呼吸音分类(RSC)研究主要依赖于CLS令牌驱动的自注意力架构,如音频频谱图变换器(AST)。虽然它在建模全局上下文方面有效,但最近的分析表明存在低通滤波行为,可能会降低对局部异常模式的敏感性。在这项工作中,我们研究了状态空间模型(SSM)作为RSC的替代骨干网络。使用蒸馏音频状态空间模型,我们通过频谱响应曲线分析中间表示,并观察到对中到高空间频率分量的更强保留。基于这些观察,我们引入了使用高斯卷积应用于选定层的谱感知层正则化。我们进一步提出了针对基于SSM的音频模型定制的双轴补丁混合对比学习,以实现稳健的表示学习。在ICBHI基准上的实验表明,我们的方法达到了64.48%的分数,比AST基线高出5%。代码可在以下网址获取:https://this https URL。

英文摘要

Recent respiratory sound classification (RSC) studies largely rely on CLS-token driven self-attention architectures such as the Audio Spectrogram Transformer (AST). While effective at modeling global context, recent analyses suggest a low-pass filtering behavior that may reduce sensitivity to localized abnormal patterns. In this work, we investigate State Space Models (SSMs) as an alternative backbone for RSC. Using the Distilled Audio State Space model, we analyze intermediate representations through spectral response curves and observe stronger preservation of mid-to-high spatial-frequency components. Based on these observations, we introduce spectral-aware layer regularization using Gaussian convolution applied to selected layers. We further propose Dual-Axis Patch-Mix contrastive learning tailored to SSM-based audio models for robust representation learning. Experiments on the ICBHI benchmark show that our approach achieves 64.48% score, outperforming the AST baseline by 5%. Code is available at this https URL.

2606.11930 2026-06-11 cs.HC cs.AI cs.CV 交叉投稿

Frozen Multimodal Embeddings for Personality and Cognitive Ability Assessment in Asynchronous Video Interviews

冻结多模态嵌入用于异步视频面试中的个性与认知能力评估

Kuo-En Hung, Hung-Yue Suen, Shih-Ching Yeh, Hsiang-Wen Wang

AI总结 针对异步视频面试中标注数据有限的高维多模态学习问题,提出使用冻结多模态编码器(CLIP、Whisper、RoBERTa等)结合低容量下游模型,在个性预测任务上实现MSE降低19.1%,并发现认知能力预测中存在数据集捷径。

详情
Comments
9 pages, 1 figure, 4 tables
AI中文摘要

从异步视频面试(AVI)中预测心理特质是一个具有挑战性的多模态学习问题,因为标注数据集有限,而每个回答包含高维的视觉、声学和语言信号。本文介绍了我们针对ACM多媒体AVI挑战2026的解决方案,该挑战评估两个任务:Track~1从与个性相关的面试回答中预测自我报告的HEXACO个性特质,Track~2从结构化AVI回答中对认知能力水平进行分类。我们将该问题视为小样本表示学习任务。我们不微调大型预训练模型,而是使用冻结的多模态编码器,包括用于视觉特征的CLIP、用于声学特征和转录的Whisper,以及用于文本表示的RoBERTa、E5和DeBERTaV3,随后使用低容量下游模型。对于Track~1,我们的特质特定回归和晚期融合系统实现了平均验证MSE为0.2696,优于官方基线0.3334。消融结果显示,从全局模型(0.3189)到逐特质建模(0.2871)再到逐特质晚期融合(0.2696)的三步改进,相对于官方基线MSE相对降低了19.1%。对于Track~2,一个紧凑的主题属性基线达到了0.5781的准确率,而我们的多模态集成达到了0.5313,两者均高于官方基线0.4062。我们将这一结果解释为验证分割中可能存在主题属性捷径的证据,而非从AVI内容中进行的稳健认知推理。总体而言,我们的发现表明,基于AVI的心理评估受益于特质特定的多模态建模,但认知能力预测需要仔细控制数据集捷径。

英文摘要

Predicting psychological traits from asynchronous video interviews (AVIs) is a challenging multimodal learning problem because labeled datasets are limited while each response contains high-dimensional visual, acoustic, and verbal signals. This paper presents our solution for the ACM Multimedia AVI Challenge 2026, which evaluates two tasks: Track~1 predicts self-reported HEXACO personality traits from personality-related interview responses, and Track~2 classifies cognitive ability levels from structured AVI responses. We treat the problem as a small-sample representation learning task. Instead of fine-tuning large pretrained models, we use frozen multimodal encoders, including CLIP for visual features, Whisper for acoustic features and transcripts, and RoBERTa, E5, and DeBERTaV3 for textual representations, followed by low-capacity downstream models. For Track~1, our trait-specific regression and late-fusion system achieves an average validation MSE of 0.2696, improving over the official baseline of 0.3334. Ablation results show a three-step improvement from a global model (0.3189), to per-trait modeling (0.2871), to per-trait late fusion (0.2696), corresponding to a 19.1\% relative MSE reduction over the official baseline. For Track~2, a compact subject-attribute baseline reaches 0.5781 accuracy, while our multimodal ensemble reaches 0.5313, both above the official baseline of 0.4062. We interpret this result as evidence of possible subject-attribute shortcuts in the validation split rather than robust cognitive inference from AVI content. Overall, our findings suggest that AVI-based psychological assessment benefits from trait-specific multimodal modeling, but cognitive ability prediction requires careful control of dataset shortcuts.

2606.11990 2026-06-11 cs.LG cs.AI 交叉投稿

Time-Series Foundation Model Embeddings for Remaining Useful Life Estimation

用于剩余使用寿命估计的时间序列基础模型嵌入

Amir El-Ghoussani, Michele De Vita, Ronald Naumann, Valiseios Belagiannis

发表机构 * University of Erlangen-Nuremberg(埃尔朗根-纽伦堡大学) Siemens AG(西门子股份公司)

AI总结 提出冻结预训练时间序列基础模型Chronos-2作为骨干,结合轻量回归头进行剩余寿命预测,在工业传感器数据上优于多种基线方法。

详情
Comments
Accepted to EUSIPCO 2026, 4 pages, 2 figures
AI中文摘要

剩余使用寿命(RUL)预测对于工业预测性维护至关重要,然而许多基于学习的方法依赖于大量的特征工程或大型标注数据集来训练特定任务的序列模型。在这项工作中,我们引入了一种轻量级学习方法,利用冻结的预训练时间序列基础模型(TSFM),并将其与一个小型回归头结合,用于从多变量传感器流中估计RUL。具体来说,我们使用Chronos-2作为冻结骨干来提取上下文窗口特征,并训练一个轻量级回归神经网络进行RUL预测。在来自两种设备类型的真实工业传感器数据上的实验表明,在相同的预处理和评估协议下,Chronos-2特征一致地优于循环、卷积、基于Transformer和梯度提升基线。我们进一步分析了上下文长度的影响,发现随着历史记录变长,性能显著提升,这表明TSFM表示为工业环境中的RUL估计提供了一种实用且数据高效的替代方案。

英文摘要

Remaining Useful Life (RUL) prediction is essential for industrial predictive maintenance, yet many learning-based approaches rely on extensive feature engineering or large labeled datasets to train task-specific sequence models. In this work, we introduce a lightweight learning approach, in which we leverage a frozen pretrained time-series foundation model (TSFM) and combine it with a small regression head for RUL estimation from multivariate sensor streams. More specifically, we use Chronos-2 as a frozen backbone to extract context window features and train a lightweight regression neural network for RUL prediction. Experiments on real-world industrial sensor data from two device types show that Chronos-2 features consistently improve over recurrent, convolutional, Transformer-based, and gradient-boosting baselines under the same preprocessing and evaluation protocol. We further analyze the impact of context length and find that performance improves significantly with longer histories, indicating that TSFM representation offer a practical and data-efficient alternative for RUL estimation in industrial settings.

2606.12006 2026-06-11 cs.LG cs.AI 交叉投稿

Tabular Foundation Models for Clinical Survival Analysis via Survival-Aware Adaptation

通过生存感知适配的临床生存分析表格基础模型

Minh-Khoi Pham, Luca Cotugno, Alina Sirbu, Tai Tan Mai, Martin Crane, Marija Bezbradica

发表机构 * ADAPT Centre, Dublin City University(ADAPT中心,都柏林城市大学) School of Computing, Dublin City University(都柏林城市大学计算机学院) Department of Computer Science and Engineering, University of Bologna(博洛尼亚大学计算机科学与工程系)

AI总结 提出轻量级适配方法,将表格基础模型(TabPFN、TabDPT、TabICL)与多任务逻辑回归头结合,用于临床生存分析,在多个基准和ICU队列上达到竞争性或更优性能。

详情
Comments
Accepted for publication at International Conference on AI in Healthcare 2026
AI中文摘要

预测死亡率等时间至事件结果是临床决策中的基本任务,通常通过生存分析来解决。虽然经典的统计和深度学习方法已被广泛研究,但它们通常需要特定任务的训练和足够的标记数据。最近表格基础模型的进展通过学习结构化数据的通用表示提供了一种新范式。然而,它们在临床环境中对删失时间至事件预测的适用性仍未得到充分探索,因为典型应用仅限于离散分类而非生存分析任务。在这项工作中,我们提出了一种轻量级适配方法,通过直接在预训练表示之上训练一个生存感知头,将表格基础模型应用于临床生存分析。我们研究了代表性架构,包括TabPFN、TabDPT和TabICL,并使用多任务逻辑回归(MTLR)头对它们进行适配,以建模右删失时间至事件结果。我们在多个公开生存基准和两个大规模ICU队列MIMIC-IV和eICU上评估了该方法。我们的结果表明,这种迁移学习方法与强基线相比达到了竞争性或更优的性能。在MIMIC-IV上,TabDPT-FT-MTLR达到了0.856的C指数,相对于最佳非FM基线(DeepSurv,0.844)相对提升了+1.4%,相对于最佳零样本模型(0.802)提升了+6.7%。在eICU上,TabICL-FT-MTLR达到了0.797,分别获得了+1.7%(DeepSurv,0.784)和+6.4%(0.749)的提升。这些发现强调了将预训练表格表示与生存感知目标相结合的重要性,并表明表格基础模型为临床生存预测提供了一种实用且有效的替代方案。

英文摘要

Predicting time-to-event outcomes such as mortality is a fundamental task in clinical decision-making, commonly addressed through survival analysis. While classical statistical and deep learning approaches have been widely studied, they typically require task-specific training and sufficient labeled data. Recent advances in tabular foundation models offer a new paradigm by learning general-purpose representations for structured data. However, their applicability to censored time-to-event prediction in clinical settings remains underexplored, as typical applications are restricted to discrete classification rather than survival analysis tasks. In this work, we propose a lightweight adaptation approach for applying tabular foundation models to clinical survival analysis by directly training a survival-aware head on top of the pretrained representations. We study representative architectures, including TabPFN, TabDPT, and TabICL, and adapt them using a multi-task logistic regression (MTLR) head to model right-censored time-to-event outcomes. We evaluate this approach on a diverse set of public survival benchmarks and two large-scale ICU cohorts, MIMIC-IV and eICU. Our results show that this transfer learning approach achieves competitive or superior performance compared to strong baselines. On MIMIC-IV, TabDPT-FT-MTLR reaches a C-index of 0.856, corresponding to a relative improvement of +1.4% over the best non-FM baseline (DeepSurv, 0.844) and +6.7% over the best zero-shot model (0.802). On eICU, TabICL-FT-MTLR achieves 0.797, yielding gains of +1.7% (DeepSurv, 0.784) and +6.4% (0.749), respectively. These findings highlight the importance of combining pretrained tabular representations with survival-aware objectives and suggest that tabular foundation models provide a practical and effective alternative for clinical survival prediction.

2606.12074 2026-06-11 cs.CV cs.AI eess.IV 交叉投稿

Non-frontal face recognition using GANs and memristor-based classifiers

基于GAN和忆阻器分类器的非正面人脸识别

Semih Vazgecen, Cristian Sestito, Spyros Stathopoulos, Themis Prodromakis

发表机构 * Centre for Electronics Frontiers, Institute for Integrated Micro and Nano Systems, School of Engineering, The University of Edinburgh(爱丁堡大学工程学院集成微纳系统研究所电子前沿中心)

AI总结 提出将轻量级GAN正面化与忆阻器神经形态识别结合,解决非正面人脸识别,在数据集上达96%准确率。

详情
Comments
12 pages, 4 figures, 1 Supplementary (22 pages, 16 figures, 6 tables, 4 supplementary notes)
AI中文摘要

人脸识别系统通过深度学习技术取得了显著进展,在复杂场景中实现了高性能和鲁棒性。然而,这些方法带来了巨大的计算开销,限制了它们在资源受限平台(如无人机)上的原位适用性,而这些平台需要应对非正面人脸图像等挑战。基于忆阻器的神经形态系统已成为边缘AI应用的一种引人注目的方法,它将生物启发式处理与高效可扩展的计算相结合。在这项工作中,我们提出了一种人脸识别框架,通过集成基于轻量级生成对抗网络(GAN)的正面化处理和基于忆阻器的神经形态识别,来解决非正面姿态变化问题。在两个数据集上的实验结果表明,将对抗学习与忆阻技术相结合的有效性,实现了高达96%的识别准确率。所提出的方法缓解了传统AI的计算瓶颈,并为动态真实环境中的人脸识别提供了一种可扩展、高效的解决方案。

英文摘要

Face recognition systems have advanced significantly through deep learning techniques, delivering high performance and robustness in complex scenarios. However, these approaches incur substantial computational overhead, limiting their in situ applicability in resource-constrained platforms such as drones, where they can address challenges including non-frontal facial imagery. Memristor-based neuromorphic systems have emerged as a compelling approach for edge AI applications, combining biologically inspired processing with efficient and scalable computation. In this work, we propose a facial recognition framework that addresses non-frontal pose variations by integrating lightweight generative adversarial network (GAN)-based pose frontalisation with memristor-based neuromorphic recognition. The experimental results on two datasets demonstrate the effectiveness of combining adversarial learning with memristive technology, achieving up to 96% identification accuracy. The proposed approach alleviates the computational bottlenecks of conventional AI and offers a scalable, efficient solution for face recognition in dynamic real-world environments.

2606.12106 2026-06-11 cs.CV cs.AI 交叉投稿

MSUE: Multi-Modal Soccer Understanding Expert

MSUE:多模态足球理解专家

Litao Li, Yibo Yu, Yufeng Hu, Zhuo Yang, Jiali Wen, Yixin Chen, Yixi Zhou

发表机构 * South China University of Technology(华南理工大学) Johns Hopkins University(约翰霍普金斯大学) Peking University(北京大学) University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出MSUE多专家问答架构,结合VLM数据合成管道与LLM动态调度文本、图像、视频专家,在SoccerNet VQA挑战中达到0.95准确率,获第三名。

详情
Comments
6 pages, 1 figures
AI中文摘要

本文介绍了我们对2026年SoccerNet VQA挑战赛的解决方案。我们首先开发了一个由视觉语言模型(VLM)驱动的低成本数据合成管道,该系统将原始领域数据系统地重构为多样化的VQA样本,包括简洁答案和长文本回复。其次,我们提出了MSUE,一种多专家问答架构,采用大语言模型(LLM)将问题动态分发给文本、图像和视频专家。这些专家分别实例化为强大的文本基线Gemini3-Flash、微调的Qwen3-VL和外部知识库,协同工作以提升VQA性能。MSUE在挑战基准上达到了\textbf{0.95}的准确率,在排行榜上获得第三名。

英文摘要

This paper presents our solution to the 2026 SoccerNet VQA Challenge. We first develop a cost-effective data synthesis pipeline driven by a Vision-Language Model (VLM), which systematically restructures raw domain data into diverse VQA samples, including concise answers and long-form responses. Second, we propose MSUE, a multi-expert question answering architecture that employs a Large Language Model (LLM) to dynamically dispatch questions to text, image, and video experts. These experts are instantiated as a strong text baseline Gemini3-Flash, a fine-tuned Qwen3-VL, and an external knowledge base, respectively, working collaboratively to enhance VQA performance. MSUE achieves an accuracy of \textbf{0.95} on the challenge benchmark, securing third place in the leaderboard.

2606.12218 2026-06-11 cs.CV cs.AI 交叉投稿

Adapting Prithvi-EO for Fallow Detection for Food-Water Nexus: ViT-Adapter Necks and Parameter-Efficient Backbone tuning of Geospatial Foundation Model

为食物-水关系调整Prithvi-EO用于休耕地检测:地理空间基础模型的ViT-Adapter颈部与参数高效骨干微调

Sk Muhammad Asif, Orhun Aydin

发表机构 * Earth, Atmospheric and Geospatial Science, Saint Louis University(圣路易斯大学地球、大气与地理空间科学系)

AI总结 针对休耕地检测中多尺度特征需求与基础模型单尺度ViT骨干不匹配的问题,提出结合LoRA和混合PEFT的两种参数高效微调方案与三种颈部设计,其中Lite ViT-Adapter配合单阶段检测头在mAP@50上达到0.9479,优于无适配器方法25.70%。

详情
Comments
10 pages, 6 figures. Preprint. Submitted to ACM SIGSPATIAL 2026
AI中文摘要

理解休耕地的空间分布对于优化食物-水关系至关重要,因为休耕在作物轮作和水资源保护中发挥着作用。休耕是美国农业部作物数据层中的一个低精度类别。地理空间基础模型Prithvi-EO在计算机视觉任务中展现出强大的迁移能力。然而,其视觉Transformer骨干在单一空间尺度上生成特征,不适合目标检测头所需的多尺度特征。现有方法通过缩放单步长令牌来合成多尺度金字塔,牺牲了空间异质性,而全骨干微调对于地理空间基础模型来说计算成本过高。我们评估了一个结合两种参数高效微调方案的休耕地检测流程:低秩适应和混合PEFT,以及三种颈部设计:伪多尺度、Lite ViT-Adapter和Full ViT-Adapter。我们最佳配置,即带有单阶段检测头的Lite ViT-Adapter,在Diou损失下实现了0.9479的mAP@50,表明中心感知定位对于不规则休耕地检测的有效性。在LoRA下,ViT-Adapter释放的单阶段检测比无适配器的基于锚点的方法提高了6.42%,而最佳配置比基线无适配器的基于锚点的方法提高了25.70%。这些结果表明,轻量级空间先验融合和选择性骨干解冻使Prithvi-EO能够更有效地捕捉局部休耕模式,优于依赖重塑单步长ViT令牌的方法。

英文摘要

Understanding spatial distribution of fallow land is important for optimizing the food-water (FW) nexus, given fallowing's role in crop rotation and water conservation. Fallow is a low accuracy class in USDA Cropland Data Layer (CDL). Geospatial foundation model (GFM), Prithvi-EO has shown strong transferability across computer vision tasks. However, its Vision Transformer (ViT) backbone produces features at a single spatial scale that are ill-suited for the multi-scale features required by object detection heads. Existing approaches synthesise multi-scale pyramids through scaling of single stride tokens, sacrificing spatial heterogeneity, and full backbone fine-tuning is computationally prohibitive for GFMs. We evaluate a fallow detection pipeline combining two parameter-efficient fine tuning (PEFT) schemes: Low-Rank Adaptation (LoRA) and a hybrid PEFT, with three neck designs: pseudo multi-scale, Lite ViT-Adapter, and Full ViT-Adapter. Our best configuration, Lite ViT-Adapter with a one-stage head, achieves a mAP@50 of 0.9479 with the Diou loss, suggesting the effectiveness of center-aware localization for irregular fallow field detection. ViT-Adapter free one-stage detection under LoRA improves the adapter-free anchor-based approach by 6.42%, and the best configuration improves baseline adapter-free anchor-based approach by 25.70%. These results demonstrate that lightweight spatial prior fusion and selective backbone unfreezing enable Prithvi-EO to capture local fallow patterns more effectively, outperforming approaches that rely on reshaped single-stride ViT tokens.

2606.12231 2026-06-11 cs.SE cs.AI 交叉投稿

Rule Taxonomy and Evolution in AI IDEs: A Mining and Survey Study

AI IDE中的规则分类与演化:挖掘与调查研究

Guangzong Cai, Ruiyin Li, Peng Liang, Zengyang Li, Mojtaba Shahin

AI总结 通过挖掘83个开源项目中的7310条规则和99份从业者调查,建立了包含5个主类和25个子类的规则分类法,发现开发者重视架构约束但实际配置多为低级工作流和代码格式规则,规则演化主要由建设性上下文扩展和丰富驱动,且更新规则可使工件合规率平均提升22.99%。

详情
Comments
52 pages, 21 images, 8 tables, Manuscript submitted to a Journal (2026)
AI中文摘要

AI驱动的集成开发环境(AI IDE)的采用引入了“规则”作为一种新颖的软件工件,允许开发者将项目特定的约束和架构指导原则持久地注入到大语言模型(LLM)的上下文中。尽管这些规则在使AI行为与开发者意图对齐方面发挥作用,但它们的分类、演化及实际影响仍 largely unexplored。为填补这一空白,我们对AI IDE规则进行了混合方法实证研究。通过挖掘83个开源项目并提取7,310条规则,我们建立了一个包含5个主类和25个子类的全面分类法。随后,我们将这些工件与99名从业者的调查反馈进行三角验证。我们的分析发现开发者优先级与实际配置之间存在反差:虽然从业者认为架构约束非常重要,但仓库中的规则文件主要由低级工作流和代码格式约束组成。此外,我们对1,540个规则演化事件的分析表明,规则更新频繁。仓库数据进一步表明,规则演化主要由建设性上下文扩展(29.17%)和丰富(26.59%)驱动。相比之下,受访开发者报告修改规则主要是为了纠正AI错误(77.78%),通常通过添加新的负面约束而非编辑现有约束。最后,对160个规则演化事件的工件合规性评估显示,更新规则显著提高了软件工件的合规性,更新后平均工件合规率从49.14%提升至72.13%,增加了22.99%。我们的研究提供了实证见解,可帮助开发者优化提示策略,并指导工具构建者为AI IDE设计自动冲突检测和上下文管理机制。

英文摘要

The adoption of AI-powered Integrated Development Environments (AI IDEs) has introduced "Rules" as a novel software artifact, allowing developers to persistently inject project-specific constraints and architectural guidelines into the context of Large Language Models (LLMs). Despite their role in aligning AI behavior with developer intent, the taxonomy, evolution, and practical impact of these rules remain largely unexplored. To bridge this gap, we conducted a mixed-methods empirical study on AI IDE rules. By mining 83 open-source projects and extracting 7,310 rules, we established a comprehensive taxonomy comprising 5 primary and 25 secondary categories. We then triangulated these artifacts with survey responses from 99 practitioners. Our analysis identified a contrast between developer priorities and actual configurations: while practitioners rate architectural constraints as highly important, rule files in repositories primarily consist of low-level workflow and code formatting constraints. Furthermore, our analysis of 1,540 rule evolution events revealed that rules are updated frequently. Repository data further indicate that rule evolution is primarily driven by constructive context expansions (29.17%) and enrichments (26.59%). In contrast, surveyed developers reported modifying rules primarily to correct AI errors (77.78%), typically by adding new negative constraints rather than editing existing ones. Finally, an artifact compliance assessment of 160 rule evolution events revealed that updating rules significantly improves the adherence of software artifacts, with the average artifact compliance rate increasing by 22.99% (from 49.14% to 72.13%) following an update. Our study provides empirical insights that can help developers optimize prompting strategies and guide tool builders in designing automated conflict-detection and context-management mechanisms for AI IDEs.

2606.12245 2026-06-11 cs.IR cs.AI 交叉投稿

DiffCold: A Diffusion-based Generative Model for Cold-Start Item Recommendation

DiffCold: 基于扩散的生成模型用于冷启动物品推荐

Kangning Zhang, Yingjie Qin, Weinan Zhang, Yong Yu, Jianghao Lin

AI总结 针对冷启动物品推荐中的跷跷板困境,提出基于条件扩散的生成模型DiffCold,通过从内容重建温物品嵌入并保持流形结构,结合检索增强聚合器和模拟表示对齐模块,统一冷热物品表示。

详情
Comments
Accepted by ECML-PKDD 2026
AI中文摘要

冷启动物品推荐由于缺乏交互历史,在现实系统中仍然是一个持续的挑战。虽然先前的模型尝试利用物品内容特征来弥合这一差距,但它们普遍遭受\textbf{跷跷板困境}:提升冷物品的性能不可避免地会降低温物品的性能,反之亦然。我们发现这一困境源于根本的\textbf{分布差异}:温物品嵌入占据由丰富交互信号塑造的复杂“行为流形”,而冷物品嵌入则被限制在仅从辅助内容导出的“语义流形”上。现有方法通常强制在这些不一致空间之间进行刚性映射,导致模型为了适应冷物品而牺牲温表示的精度。为了解决这个问题,我们提出\textbf{DiffCold},一种基于扩散的生成模型,统一了温表示和冷表示。与GAN或VAE不同,DiffCold利用条件扩散从内容重建温物品嵌入,保留底层流形结构而不退化。我们进一步针对这一范式设计了两个特定模块:一个\textbf{检索增强聚合器},利用语义相似的温物品初始化生成,以绕过低效的噪声;以及一个\textbf{基于模拟的表示对齐}模块,通过对比学习强制生成嵌入与真实嵌入之间的分布一致性。在三个基准上的实验证实,DiffCold解决了跷跷板困境,在所有指标上持续优于最先进的方法。

英文摘要

Cold-start item recommendation remains a persistent challenge in real-world systems due to the absence of interaction histories. While prior models attempt to bridge this gap using item content features, they universally suffer from the \textbf{seesaw dilemma}: enhancing performance for cold items inevitably degrades performance for warm items, and vice versa. We identify that this dilemma stems from a fundamental \textbf{distributional disparity}: warm item embeddings occupy a complex ``behavioral manifold" shaped by rich interaction signals, whereas cold item embeddings are constrained to a ``semantic manifold" derived solely from auxiliary content. Existing methods often force a rigid mapping between these inconsistent spaces, causing the model to sacrifice the precision of warm representations to accommodate cold ones. To address this, we propose \textbf{DiffCold}, a diffusion-based generative model that unifies warm and cold representations. Unlike GANs or VAEs, DiffCold leverages conditional diffusion to reconstruct warm item embeddings from content, preserving the underlying manifold structure without degradation. We further tailor this paradigm with two specific designs: a \textbf{Retrieval-enhanced Aggregator} that initializes generation using semantically similar warm items to bypass inefficient noise, and a \textbf{Simulation-based Representation Alignment} module that enforces distribution consistency between generated and real embeddings via contrastive learning. Experiments on three benchmarks confirm that DiffCold resolves the seesaw dilemma, consistently outperforming state-of-the-art methods across all metrics.

2606.12252 2026-06-11 cs.LG cs.AI 交叉投稿

Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification

使用可解释性作为训练时可靠性信号实现高效心电图分类

Veerendhra Kumar Dangeti, Xiao Gu, Ying Weng, Shreyank N Gowda

发表机构 * School of Computer Science, University of Nottingham(诺丁汉大学计算机科学学院) Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford(牛津大学工程科学系生物医学工程研究所) School of Computer Science, University of Nottingham Ningbo China(宁波诺丁汉大学计算机科学学院)

AI总结 提出ERTS方法,利用训练中的解释质量(Grad-CAM注意力图)区分信息性和不可靠不确定性,过滤低聚焦样本,在三个ECG数据集上提升macro-F1并降低训练成本。

详情
AI中文摘要

训练用于临床时间序列分析的深度神经网络计算需求高,但许多医疗环境缺乏重复模型开发和部署所需的资源。这一挑战在心电图分类中尤为明显,大数据集和长训练计划使效率变得重要。渐进式数据丢弃通过从梯度更新中排除已学习的样本来降低训练成本,但它依赖模型置信度,可能保留因噪声或歧义而难以处理而非有用信号的样本。在这项工作中,我们引入了ERTS,一种基于可解释性的可靠性训练信号,用于高效心电图分类。ERTS在训练期间利用解释质量来区分信息性和不可靠的不确定性。基于渐进式数据选择,我们计算候选样本的Grad-CAM注意力图,并推导出一个聚焦分数,衡量模型预测是否得到连贯且局部化模式的支持。低聚焦样本被过滤掉,而具有有意义注意力的样本优先进行梯度更新。我们在三个ECG数据集和多个骨干架构上评估ERTS,显示macro-F1的一致提升以及有效训练成本的降低。这些结果表明,解释质量可以作为改善临床时间序列学习中效率和可靠性的实用信号。代码将发布。

英文摘要

Training deep neural networks for clinical time-series analysis is computationally demanding, yet many healthcare settings lack the resources required for repeated model development and deployment. This challenge is particularly evident in electrocardiogram classification, where large datasets and long training schedules make efficiency practically important. Progressive Data Dropout reduces training cost by excluding samples from gradient updates once they are learned, but it relies on model confidence and may retain samples that are difficult due to noise or ambiguity rather than useful signal. In this work, we introduce ERTS, an explainability-based reliability training signal for efficient ECG classification. ERTS uses explanation quality during training to distinguish between informative and unreliable uncertainty. Building on progressive data selection, we compute Grad-CAM attention maps for candidate samples and derive a focus score that measures whether model predictions are supported by coherent and localised patterns. Samples with low focus are filtered out, while those with meaningful attention are prioritised for gradient updates. We evaluate ERTS across three ECG datasets and multiple backbone architectures, showing consistent improvements in macro-F1 alongside reduced effective training cost. These results suggest that explanation quality can serve as a practical signal for improving both efficiency and reliability in clinical time-series learning. Code will be released.

2606.12346 2026-06-11 cs.CV cs.AI cs.LG 交叉投稿

Atlas H&E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy

Atlas H&E-TME:基于AI的可扩展组织分析,达到专家病理学家级别的准确性

Kai Standvoss, Miriam Hägele, Rosemarie Krupar, Julika Ribbat-Idel, Jennifer Altschüler, Gerrit Erdmann, Hans Pinckaers, Evelyn Ramberger, Madleen Drinkwitz, Ádám Nárai, Alexander Möllers, Katja Lingelbach, Sebastian Kons, Lukas Hönig, Recepcan Adigüzel, Joana Baião, Alberto Megina Gonzalo, Marius Teodorescu, Marie-Lisa Eich, Paolo Chetta, Shakil Merchant, Verena Aumiller, Simon Schallenberg, Andrew Norgan, Klaus-Robert Müller, Lukas Ruff, Maximilian Alber, Frederick Klauschen

发表机构 * Aignostics, Germany(Aignostics,德国) Institute of Pathology, Charité – Universitätsmedizin Berlin, Germany(柏林夏里特医学院病理学研究所) Berlin Institute of Health, Charité – Universitätsmedizin Berlin, Germany(柏林夏里特医学院柏林健康研究所) Massachusetts General Hospital, Department of Pathology, Harvard Medical School, Boston, MA, US(哈佛医学院麻省总医院病理学系) Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, US(梅奥诊所检验医学与病理学系) Machine Learning Group, Technische Universität Berlin, Germany(柏林工业大学机器学习组) BIFOLD – Berlin Institute for the Foundations of Learning and Data, Germany(柏林学习与数据基础研究所) Department of Artificial Intelligence, Korea University, Republic of Korea(高丽大学人工智能系) Max-Planck Institute for Informatics, Germany(马克斯·普朗克信息学研究所) German Cancer Research Center (DKFZ) & German Cancer Consortium (DKTK), Berlin & Munich Partner Sites, Germany(德国癌症研究中心及德国癌症联盟柏林和慕尼黑合作站点) Institute of Pathology, Ludwig-Maximilians-Universität München, Germany(慕尼黑大学病理学研究所) Bavarian Cancer Research Center (BZKF), Germany(巴伐利亚癌症研究中心)

AI总结 提出Atlas H&E-TME系统,利用病理基础模型预测组织质量、区域和细胞类型,通过IHC共识验证和20万+注释基准,在多种癌症中达到或超越病理学家水平。

详情
AI中文摘要

苏木精和伊红(H&E)染色是组织病理学的基石,然而对H&E全切片图像(WSI)进行可扩展的定量分析仍然是计算病理学中的核心挑战。我们提出了Atlas H&E-TME,这是一个基于Atlas病理基础模型家族的AI系统,可预测多种癌症类型的组织质量、组织区域和细胞类型标签,在细胞级分辨率下每张切片产生超过4,500个定量读数。验证此类系统的关键挑战在于克服H&E-only金标准固有的形态模糊性,以及依赖免疫组织化学(IHC)等模态的更可靠参考的可扩展性有限。我们通过一个双重验证框架解决了这一问题,该框架将生物学深度的基础与技术及形态学的广度相结合。在深度方面,我们提出了一种IHC引导的多病理学家共识协议,该协议显著提高了相较于传统H&E-only注释的评分者间一致性。这产生了一个分子学基础的参考,我们据此比较Atlas H&E-TME和仅使用H&E的病理学家。在广度方面,我们在超过20万个高置信度H&E-only病理学家注释上对Atlas H&E-TME进行了基准测试,这些注释涵盖1,500多个病例,跨越八种癌症类型及其最常见的转移部位,亚型覆盖每种癌症类型>90%的临床病例,来自25个以上来源和8种以上扫描仪型号。与IHC引导的共识相比,Atlas H&E-TME达到或超过了病理学家仅使用H&E的性能,并在这一广泛的形态学和技术范围内一致且稳健地泛化。通过这种方式,Atlas H&E-TME将H&E切片——病理学中最普遍的数据——转化为一个可扩展的、定量的肿瘤及其微环境窗口,为转化和临床研究中下一代基于组织的生物标志物奠定了基础。

英文摘要

Hematoxylin and eosin (H&E) staining is the cornerstone of histopathology, yet scalable, quantitative analysis of H&E whole-slide images (WSIs) remains a central challenge in computational pathology. We present Atlas H&E-TME, an AI-based system built on the Atlas family of pathology foundation models that predicts tissue quality, tissue region, and cell type labels across multiple cancer types, yielding over 4,500 quantitative readouts per slide at cell-level resolution. A key challenge to validating such systems is overcoming morphological ambiguity inherent to H&E-only ground truth and the limited scalability of more informed references drawing on modalities such as immunohistochemistry (IHC). We address this with a dual validation framework combining biologically grounded depth with technical and morphological breadth. For depth, we propose an IHC-informed multi-pathologist consensus protocol that substantially improves inter-rater agreement over conventional H&E-only annotation. This yields a molecularly grounded reference against which we compare Atlas H&E-TME and pathologists working from H&E alone. For breadth, we benchmark Atlas H&E-TME on over 200,000 high-confidence H&E-only pathologist annotations across 1,500+ cases spanning eight cancer types and their most common metastatic sites, with subtypes covering >90% of clinical cases per cancer type, drawn from 25+ sources and 8+ scanner models. Benchmarked against the IHC-informed consensus, Atlas H&E-TME matches or exceeds pathologist H&E-only performance and generalizes consistently and robustly across this broad morphological and technical scope. In doing so, Atlas H&E-TME turns the H&E slide -- the most ubiquitous data in pathology -- into a scalable, quantitative window into the tumor and its microenvironment, laying a foundation for the next generation of tissue-based biomarkers in translational and clinical research.

2606.12378 2026-06-11 cs.CV cs.AI 交叉投稿

Illumination-Robust Camera-Based Heart-Rate Estimation for Physiological Sensing in Robots

面向机器人生理感知的鲁棒光照相机心率估计

Zhi Wei Xu, Torbjörn E. M. Nordling

发表机构 * National Cheng Kung University(国立成功大学)

AI总结 提出一种端到端时空Transformer框架,结合PRNet三维人脸对齐、光照增强、残差时序标准化和混合时频监督,在光照变化数据集上实现0.79 bpm心率MAE和0.982相关系数,相比PhysFormer降低93.6%误差。

详情
Comments
8 pages, 4 figures
AI中文摘要

生理感知对于在日常生活环境中与人类交互的服务型、社交型和辅助型机器人至关重要。远程光电容积描记法(rPPG)能够从RGB相机中实现非接触式心率(HR)估计,使其成为机器人视觉系统的一种有前景的感知模态。然而,光照变化仍然是鲁棒部署的主要障碍。本文提出了一种端到端的时空Transformer框架,用于在具有不同光照条件的新数据集上进行远程心率估计。我们的估计器集成了基于PRNet的三维人脸对齐、片段级光照增强、残差时序标准化模块以及受控的混合时频监督。训练目标结合了Soft-Shifted Pearson波形损失和频谱Kullback-Leibler散度损失,其中调优权重($\mathbf{\beta}$)控制频域心率指导的贡献。在覆盖三个光照级别的静态全混合协议上的实验表明,$\mathbf{\beta}=5$在测试的beta设置中提供了最强结果,实现了最佳运行心率平均绝对误差(MAE)为0.79 bpm,心率相关系数为0.982。与在我们的数据集上评估的PhysFormer基线相比,我们的估计器将心率MAE降低了93.6%,同时将心率相关系数从0.088提高到0.982,使其在光照变化时可用。

英文摘要

Physiological awareness is important for service, social, and assistive robots that interact with humans in everyday environments. Remote photoplethysmography (rPPG) enables non-contact heart-rate (HR) estimation from an RGB camera, making it a promising sensing modality for robot-mounted vision systems. However, illumination variation remains a major barrier to robust deployment. This paper presents an end-to-end spatial-temporal transformer framework for remote HR estimation on a new dataset with varied illumination. Our estimator integrates PRNet-based 3D face alignment, clip-level illumination augmentation, the Residual Temporal Standardization Module, and controlled hybrid temporal-frequency supervision. The training objective combines a Soft-Shifted Pearson waveform loss with a spectral Kullback-Leibler divergence loss, where a tuned weight ($\mathbf{\beta}$) controls the contribution of frequency-domain heart-rate guidance. Experiments on a static all-level mix protocol covering three illumination levels show that $\mathbf{\beta}=5$ provides the strongest result among the tested beta settings, achieving a best-run HR mean absolute error (MAE) of 0.79 bpm and an HR correlation of 0.982. Compared with the PhysFormer baseline evaluated on our dataset, our estimator reduces HR MAE by 93.6 %, while increasing HR correlation from 0.088 to 0.982, making it usable when illumination varies.

2606.12387 2026-06-11 cs.DB cs.AI 交叉投稿

TAHOE: Text-to-SQL with Automated Hint Optimization from Experience

TAHOE: 基于经验的自动提示优化文本到SQL系统

Zhiyi Chen, Jie Song, Peng Li

AI总结 提出TAHOE系统,通过错误驱动的提示学习管道将调试痕迹转化为结构化提示库,结合策略层建模用户意图,在Spider 2.0-Snow上无需更新参数即可显著提升Text-to-SQL性能。

详情
AI中文摘要

大型语言模型(LLM)通过Text-to-SQL使数据库访问民主化,但从原型到生产部署仍然困难。实际部署必须处理严格的SQL方言、大规模模式和不断变化的用户偏好,而有监督微调成本高且僵化,代理测试时扩展昂贵。我们提出Tahoe,一个将提示优化视为动态数据管理问题的系统。Tahoe在开发和部署阶段使用错误驱动的提示学习管道,将调试痕迹整合到结构化的提示库中。编译器反馈被提炼为可重用的语法提示(针对方言特定规则),而执行和用户反馈被转换为语义提示(针对模式和用户特定逻辑)。Tahoe进一步引入策略层,将冲突的用户意图建模为共享自然语言触发下的竞争策略,并利用近期信号和学习后归因统计来总结经验成功、危害、惰性和支持。在推理时,Tahoe检索相关提示,并通过逻辑规划后接SQL合成引导LLM。我们实现并评估了开发阶段的工作流,将部署时的人类反馈更新留作未来工作。在Spider 2.0-Snow上,Tahoe在不更新模型参数的情况下显著改进了Text-to-SQL。在113个有监督的Spider 2.0-Snow-0212示例上使用GPT-5.5,Tahoe将通过率从61.95%提高到79.42%,pass-at-4从72.57%提高到87.61%,实现了100%的Snowflake语法通过率,并将每个采样候选的平均编译器反馈批评轮次从2.79降低到0.12。相同的提示库也迁移到较弱的骨干模型,包括在Doubao-2.0-lite上获得19.7个百分点的通过率提升。

英文摘要

Large Language Models (LLMs) have democratized database access through Text-to-SQL, but moving from prototypes to production remains difficult. Real deployments must handle strict SQL dialects, massive schemas, and evolving user preferences, while supervised fine-tuning is costly and rigid and agentic test-time scaling is expensive. We present Tahoe, a system that treats prompt optimization as a dynamic data management problem. Tahoe uses an error-driven hint learning pipeline across Development and Deployment to consolidate debugging traces into a structured Hint Bank. Compiler feedback is distilled into reusable Syntax Hints for dialect-specific rules, while execution and user feedback are converted into Semantic Hints for schema- and user-specific logic. Tahoe further introduces a Strategy Layer that models conflicting user intents as competing strategies under shared natural-language triggers, with recency signals and post-learning attribution statistics that summarize empirical success, harm, inertness, and support. At inference time, Tahoe retrieves relevant hints and guides the LLM through Logic Planning followed by SQL Synthesis. We implement and evaluate the development-phase workflow, leaving deployment-time human-feedback updates for future work. On Spider 2.0-Snow, Tahoe substantially improves Text-to-SQL without updating model parameters. On 113 supervised Spider 2.0-Snow-0212 examples using GPT-5.5, Tahoe raises pass rate from 61.95 percent to 79.42 percent and pass-at-4 from 72.57 percent to 87.61 percent, achieves 100 percent Snowflake syntax pass rate, and reduces average compiler-feedback critic rounds from 2.79 to 0.12 per sampled candidate. The same Hint Bank also transfers to weaker backbones, including a 19.7 percentage-point pass-rate gain on Doubao-2.0-lite.

2507.17012 2026-06-11 cs.AI cs.CE 版本更新

Sustainability assessment using multimodal AI agents

使用多模态AI代理进行可持续性评估

Zhihan Zhang, Alexander Metzger, Yuxuan Mei, Felix Hähnlein, Zachary Englhardt, Tingyu Cheng, Gregory D. Abowd, Shwetak Patel, Adriana Schulz, Vikram Iyer

AI总结 提出多模态多代理AI系统,模拟生命周期评估专家与利益相关者协作,自动估算电子设备碳足迹,将数据收集时间从数周缩短至一分钟,误差在19%以内。

详情
Comments
This article is published in Nature Electronics, and is available online at: this https URL
AI中文摘要

减少计算行业快速增长的环境影响需要大规模评估电子产品的排放。然而,传统的电子设备生命周期评估(LCA)需要专有或不可用的数据。在这里,我们通过引入一个多模态多代理AI系统重新构想传统的可持续性评估,该系统模拟LCA专业人员与利益相关者(如产品经理和工程师)之间的协作过程,自动估算电子设备的碳足迹。代理通过利用结构化数据抽象和从公共互联网(包括维修社区和政府监管数据库)挖掘信息的软件工具,迭代构建完整的生命周期清单。这将数据收集时间从数周或数月减少到不到一分钟。该系统可以在零专有数据的情况下,以专家LCA的19%误差范围内计算碳足迹(典型的人类LCA之间的差异)。我们还表明,通过编码领域特定知识,环境影响估算可以重新定义为数据驱动的预测任务,其中未知产品和排放因子都被表示为具有已知排放的相似产品的加权组合。

英文摘要

Reducing the rapidly growing environmental impact of the computing industry requires assessing the emissions of electronics at scale. However, a traditional life cycle assessment (LCA) of an electronic device, which maps materials and processes to environmental impacts, often requires proprietary or unavailable data. Here, we reimagine conventional sustainability assessment by introducing a multimodal multi-agent AI system that emulates the collaborative process between LCA professionals and stakeholders (such as product managers and engineers) to automatically estimate the carbon footprint of electronic devices. The agents iteratively construct a complete life-cycle inventory by leveraging a structured data abstraction and software tools that mine information from the public internet, including repair communities and government regulatory databases. This reduces data gaps and data collection from weeks or months of expert time to under one minute. The system can calculate carbon footprint within 19% of expert LCAs with zero proprietary data (typical of the variation between human LCAs). We also show that by encoding domain-specific knowledge, environmental impact estimation can be reframed as a data-driven prediction task, in which both unknown products and emission factors are represented as weighted combinations of similar ones with known emissions.

2509.09794 2026-06-11 cs.AI cs.LG 版本更新

Synthetic Homes: A Multimodal Generative AI Pipeline for Residential Building Data Generation under Data Scarcity

合成住宅:数据稀缺下用于住宅建筑数据生成的多模态生成式AI管道

Jackson Eshbaugh, Chetan Tiwari, Jorge Silveyra

AI总结 提出一个多模态生成式AI框架,整合图像、表格和模拟组件,从公开记录和图像生成合成住宅建筑数据集,以解决建筑参数数据稀缺问题。

详情
Comments
37 pages; 2 appendices; 6 figures; 2 tables. Code available at this https URL
AI中文摘要

计算模型已成为建筑和城市尺度多尺度能源建模研究的强大工具,支持建筑和城市能源系统的数据驱动分析。然而,这些模型需要大量的建筑参数数据,这些数据通常难以获取、收集成本高昂或受隐私限制。我们引入了一个模块化的多模态生成式人工智能(AI)框架,该框架整合了图像、表格和基于模拟的组件,并从公开的县记录和图像生成合成住宅建筑数据集,同时提出了一个实例化该框架的端到端管道。为了减少典型的大型语言模型(LLM)挑战,我们使用基于遮挡的视觉焦点分析来评估模型组件。我们的分析表明,我们选择的视觉语言模型在建筑图像处理方面比基于GPT的替代方案实现了更大的视觉焦点。我们还根据国家参考数据集评估了结果的真实性,发现我们的合成数据在四个选定变量中的三个重叠率超过95%。这项工作减少了对昂贵或受限数据源的依赖,降低了建筑尺度能源研究和机器学习(ML)驱动的城市能源建模的障碍,从而在数据稀缺的情况下实现了可扩展的下游任务,如能源建模、改造分析和城市尺度模拟。

英文摘要

Computational models have emerged as powerful tools for multi-scale energy modeling research at the building and urban scale, supporting data-driven analysis across building and urban energy systems. However, these models require large amounts of building parameter data that is often inaccessible, expensive to collect, or subject to privacy constraints. We introduce a modular, multimodal generative Artificial Intelligence (AI) framework that integrates image, tabular, and simulation-based components and produces synthetic residential building datasets from publicly available county records and images, and present an end-to-end pipeline instantiating this framework. To reduce typical Large Language Model (LLM) challenges, we evaluate our model's components using occlusion-based visual focus analysis. Our analysis demonstrates that our selected vision-language model achieves greater visual focus than a GPT-based alternative for building image processing. We also assess realism of our results against a national reference dataset, finding that our synthetic data overlaps more than 95% for three of the four selected variables. This work reduces dependence on costly or restricted data sources, lowering barriers to building-scale energy research and Machine Learning (ML)-driven urban energy modeling, and therefore enabling scalable downstream tasks such as energy modeling, retrofit analysis, and urban-scale simulation under data scarcity.

2602.19502 2026-06-11 cs.AI cs.LG 版本更新

Human-Guided Agentic AI for Multimodal Clinical Prediction: Lessons from the AgentDS Healthcare Benchmark

人类引导的智能体AI用于多模态临床预测:来自AgentDS医疗基准的教训

Lalitha Pranathi Pulavarthy, Raajitha Muthyala, Aravind V Kuruvikkattil, Zhenan Yin, Rashmita Kudamala, Saptarshi Purkayastha

AI总结 通过人类引导智能体AI在多模态临床预测任务中取得领先性能,提炼出领域知识引导特征工程、任务特定多模态融合和临床动机模型集成三大通用经验。

详情
Comments
Presented at the Data Challenge track at the 14th IEEE International Conference on Healthcare Informatics (ICHI) 2026 on June 3, 2026
AI中文摘要

智能体AI系统越来越能够自主执行数据科学工作流程,但临床预测任务需要纯自动化方法难以提供的领域专业知识。我们研究了人类引导智能体AI如何改进多模态临床预测,展示了我们在所有三个AgentDS医疗基准挑战中的方法:30天再入院预测(Macro-F1 = 0.8986)、急诊科费用预测(MAE = $465.13)和出院准备评估(Macro-F1 = 0.7939)。在这些任务中,人类分析师在关键决策点指导智能体工作流程:来自临床笔记、扫描PDF账单收据和时间序列生命体征的多模态特征工程;任务适当的模型选择;以及临床信息验证策略。我们的方法在医疗领域总体排名第5,在出院准备任务中获得第3名。消融研究表明,人类引导决策在自动化基线之上累积增益达到+0.065 F1,其中多模态特征提取贡献了最大的单一改进(+0.041 F1)。我们提炼出三个可推广的经验:(1)每个流水线阶段的领域信息特征工程产生累积增益,优于广泛的自动搜索;(2)多模态数据集成需要任务特定的人类判断,没有单一提取策略能泛化到临床文本、PDF和时间序列;(3)具有临床动机模型配置的刻意集成多样性优于随机超参数搜索。这些发现为在需要可解释性、可重复性和临床有效性的医疗环境中部署智能体AI的团队提供了实用指导。

英文摘要

Agentic AI systems are increasingly capable of autonomous data science workflows, yet clinical prediction tasks demand domain expertise that purely automated approaches struggle to provide. We investigate how human guidance of agentic AI can improve multimodal clinical prediction, presenting our approach to all three AgentDS Healthcare benchmark challenges: 30-day hospital readmission prediction (Macro-F1 = 0.8986), emergency department cost forecasting (MAE = $465.13), and discharge readiness assessment (Macro-F1 = 0.7939). Across these tasks, human analysts directed the agentic workflow at key decision points, multimodal feature engineering from clinical notes, scanned PDF billing receipts, and time-series vital signs; task-appropriate model selection; and clinically informed validation strategies. Our approach ranked 5th overall in the healthcare domain, with a 3rd-place finish on the discharge readiness task. Ablation studies reveal that human-guided decisions compounded to a cumulative gain of +0.065 F1 over automated baselines, with multimodal feature extraction contributing the largest single improvement (+0.041 F1). We distill three generalizable lessons: (1) domain-informed feature engineering at each pipeline stage yields compounding gains that outperform extensive automated search; (2) multimodal data integration requires task-specific human judgment that no single extraction strategy generalizes across clinical text, PDFs, and time-series; and (3) deliberate ensemble diversity with clinically motivated model configurations outperforms random hyperparameter search. These findings offer practical guidance for teams deploying agentic AI in healthcare settings where interpretability, reproducibility, and clinical validity are essential.

2605.10592 2026-06-11 cs.AI cs.HC cs.LG 版本更新

A Resilient Solution for Sewer Overflow Monitoring across Cloud and Edge

跨云和边缘的防洪溢流监控稳健解决方案

Vipin Singh, Tianheng Ling, Peter Ghaly, Felix Grimmeisen, Gregor Schiele, Felix Biessmann

AI总结 本文提出一个基于深度学习的云边协同监控平台,用于预测溢流池填充动态,以应对城市排水系统老化问题,提升防洪预警能力。

详情
Comments
3 pages, 6 figures, accepted at 35th International Joint Conference on Artificial Intelligence 2026 (IJCAI-ECAI 2026), Demonstrations Track. URL: this https URL
AI中文摘要

许多历史城市的老化联合排水系统正因极端降雨事件而承受更大压力,可能引发联合排水溢流(CSO),对环境和公共健康造成严重影响。预测溢流池的填充动态对于预测容量超限并及时采取预防措施至关重要。我们提出一个基于网页的演示器(https://riwwer.demo.calgo-lab.de),将云和边缘环境中的深度学习预测方法整合到交互式监控仪表板中,以实现溢流监控的网络中断鲁棒性。一个视频演示可在在线(https://cloud.bht-berlin.de/index.php/s/b9xt4T3SdiLBiFZ)获取。

英文摘要

Aging combined sewer systems in many historical cities are increasingly stressed by extreme rainfall events, which can trigger combined sewer overflows (CSO) with significant environmental and public health impacts. Forecasting the filling dynamics of overflow basins is critical for anticipating capacity exceedance and enabling timely preventive actions for CSO. We present a web-based demonstrator that integrates Deep Learning forecasting methods in both cloud and edge settings into an interactive monitoring dashboard for overflow monitoring, resilient to network outages. A video showcase is available online ( this https URL ).

2304.13905 2026-06-11 cs.CR cs.AI cs.LG 版本更新

LSTM based IoT Device Identification

基于LSTM的物联网设备识别

Kahraman Kostas

AI总结 提出一种端到端机器学习流程,利用LSTM网络处理原始网络数据包,通过滑动窗口时间序列特征识别27类物联网设备,在最优配置下达到79.85%准确率和75.70%宏平均F1分数。

详情
AI中文摘要

随着物联网的使用越来越普及,大量设备进入市场,许多安全漏洞也随之出现。在此环境下,物联网设备识别方法提供了一种预防性安全措施,作为识别这些设备并检测其漏洞的重要因素。在本研究中,我们提出了一种端到端的机器学习流程,利用长短期记忆(LSTM)网络识别阿尔托大学数据集(物联网设备捕获)中的物联网设备。原始网络数据包捕获(PCAP)被处理成25个工程特征,然后排列为滑动窗口时间序列。我们系统地评估了从2到20的序列长度,报告称性能在长度6之前近似线性提升,之后呈波浪形模式,在长度18时达到峰值。在最优配置的最终保留测试集上,该模型在27个设备类别上达到了79.85%的准确率和75.70%的宏平均F1分数。

英文摘要

While the use of the Internet of Things is becoming more and more popular, many security vulnerabilities are emerging with the large number of devices being introduced to the market. In this environment, IoT device identification methods provide a preventive security measure as an important factor in identifying these devices and detecting the vulnerabilities they suffer from. In this study, we present an end-to-end machine learning pipeline that identifies IoT devices in the Aalto university dataset (IoT devices captures) using Long Short-Term Memory (LSTM) networks. Raw network packet captures (PCAP) are processed into 25 engineered features, which are then arranged as sliding-window time-series sequences. We systematically evaluate sequence lengths from 2 to 20, reporting that performance improves approximately linearly up to length 6 and thereafter in a wave-like pattern, reaching its peak at length 18. On the final held-out test set with the optimal configuration, the model achieves an accuracy of 79.85% and a macro-averaged F1-score of 75.70% across 27 device classes.

2502.14894 2026-06-11 cs.CV cs.AI cs.CY cs.LG 版本更新

FOCUS on Contamination: Hydrology-Informed Noise-Aware Learning for Geospatial PFAS Mapping

聚焦污染:基于水文信息与噪声感知的地理空间PFAS测绘学习

Jowaria Khan, Alexa Friedman, Sydney Evans, Rachel Klein, Runzi Wang, Katherine E. Manz, Kaley Beins, David Q. Andrews, Elizabeth Bondi-Kelly

AI总结 提出FOCUS框架,结合稀疏PFAS观测与水文连通性等环境先验,通过噪声感知损失实现鲁棒训练,在PFAS污染测绘中优于传统方法。

详情
Comments
Best Paper Award at ICLR 2026 Machine Learning for Remote Sensing Workshop
AI中文摘要

全氟和多氟烷基物质(PFAS)是持久性环境污染物,对公共健康有显著影响,但由于现场采样的高成本和后勤挑战,大规模监测仍然严重受限。样本的缺乏导致难以用物理模型模拟其扩散,并且对PFAS在地表水中传输的科学理解有限。然而,描述土地覆盖、水文和工业活动的丰富地理空间和卫星衍生数据广泛可用。我们提出了FOCUS,一个用于PFAS污染测绘的地理空间深度学习框架,该框架将稀疏的PFAS观测与大规模环境背景(包括来自水文连通性、土地覆盖、污染源邻近性和采样距离的先验)相结合。这些先验被整合到一个原则性的、噪声感知的损失函数中,从而在稀疏标签下产生稳健的训练目标。通过广泛的消融实验、鲁棒性分析和实际验证,FOCUS始终优于包括稀疏分割、克里金法和污染物传输模拟在内的基线方法,同时在大区域上保持了空间一致性和可扩展性。我们的结果展示了AI如何通过提供筛查级风险图来支持环境科学,这些风险图可优先安排后续采样,并在缺乏完整物理模型的情况下帮助将潜在污染源与地表水污染模式联系起来。

英文摘要

Per- and polyfluoroalkyl substances (PFAS) are persistent environmental contaminants with significant public health impacts, yet large-scale monitoring remains severely limited due to the high cost and logistical challenges of field sampling. The lack of samples leads to difficulty simulating their spread with physical models and limited scientific understanding of PFAS transport in surface waters. Yet, rich geospatial and satellite-derived data describing land cover, hydrology, and industrial activity are widely available. We introduce FOCUS, a geospatial deep learning framework for PFAS contamination mapping that integrates sparse PFAS observations with large-scale environmental context, including priors derived from hydrological connectivity, land cover, source proximity, and sampling distance. These priors are integrated into a principled, noise-aware loss, yielding a robust training objective under sparse labels. Across extensive ablations, robustness analyses, and real-world validation, FOCUS consistently outperforms baselines including sparse segmentation, Kriging, and pollutant transport simulations, while preserving spatial coherence and scalability over large regions. Our results demonstrate how AI can support environmental science by providing screening-level risk maps that prioritize follow-up sampling and help connect potential sources to surface-water contamination patterns in the absence of complete physical models.

2508.09459 2026-06-11 cs.CV cs.AI 版本更新

RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization

RelayFormer: 一种用于可扩展图像和视频篡改定位的统一局部-全局注意力框架

Wen Huang, Jiarui Yang, Tao Dai, Jiawei Li, Shaoxiong Zhan, Bin Wang, Shu-Tao Xia

AI总结 提出RelayFormer统一框架,通过全局局部中继(GLR)令牌和中继注意力机制,适应不同分辨率并统一处理图像与视频,在篡改定位任务中实现高效且性能优越。

详情
AI中文摘要

视觉篡改定位(VML)旨在识别图像和视频中被篡改的区域,随着高级编辑工具的兴起,这一任务变得日益具有挑战性。现有方法面临两个核心问题。首先是分辨率多样性。调整大小或填充可能会扭曲微妙的取证线索,并引入不必要的计算成本。其次是将图像的空间模型扩展到视频的时空输入的困难,这通常导致为两种数据类型维护单独的架构。为了解决这些挑战,我们提出了RelayFormer,一个统一框架,能够适应不同分辨率并自然处理静态和时态视觉数据。RelayFormer将输入划分为固定大小的子图像,并引入全局局部中继(GLR)令牌,通过基于中继的注意力机制传播结构化上下文。这种设计使得全局线索(如语义或时间一致性)的高效交换成为可能,同时保留细粒度的篡改伪影。与依赖统一调整大小或稀疏注意力的先前方法不同,RelayFormer以最小的开销扩展到可变分辨率和视频序列。跨多个基准的实验表明,其具有优越的性能和强大的效率,结合了无需插值或过多填充的分辨率适应性、图像和视频的统一处理,以及准确性和计算成本之间的有利平衡。代码可在\href{this https URL}{this https URL}获取。

英文摘要

Visual manipulation localization (VML) aims to identify tampered regions in images and videos, a task that has become increasingly challenging with the rise of advanced editing tools. Existing methods face two central issues. The first is resolution diversity. Resizing or padding can distort subtle forensic cues and introduce unnecessary computational cost. The second is the difficulty of extending spatial models for images to spatio-temporal inputs in videos, which often results in maintaining separate architectures for the two data types. To address these challenges, we propose RelayFormer, a unified framework that adapts to varying resolutions and naturally handles both static and temporal visual data. RelayFormer partitions inputs into fixed-size sub-images and introduces Global Local Relay (GLR) tokens that propagate structured context through a relay-based attention mechanism. This design enables efficient exchange of global cues, such as semantic or temporal consistency, while preserving fine-grained manipulation artifacts. Unlike prior approaches that depend on uniform resizing or sparse attention, RelayFormer scales to variable resolutions and video sequences with minimal overhead. Experiments across diverse benchmarks demonstrate superior performance and strong efficiency, combining resolution adaptivity without interpolation or excessive padding, unified processing for images and videos, and a favorable balance between accuracy and computational cost. Code is available at~\href{ this https URL }{ this https URL }.

2510.16152 2026-06-11 cs.DL cs.AI cs.CL cs.LG 版本更新

Mapping Scientific Literature with Large Language Models and Topic Modeling

利用大语言模型和主题建模绘制科学文献图谱

Mason Smetana, Lev Khazanovich

AI总结 提出基于大语言模型的两阶段分类框架,通过主题建模分析PNAS工程类文献,生成语义可解释主题并揭示跨主题关联,性能优于传统方法。

详情
Comments
35 pages, 10 figures. Accepted for publication in Scientometrics. Final version available via DOI
AI中文摘要

科学文献因学科边界、专业术语和潜在稀疏的关键词系统而日益碎片化,使得捕捉现代科学的演化结构变得困难。本研究引入了一个大语言模型驱动的框架,从主题建模的角度绘制科学文献图谱。该方法在《美国国家科学院院刊》20年间超过1500篇工程相关文章语料上进行了演示。一个两阶段分类流水线首先根据每篇文章的摘要分配一个主要主题类别,然后进行全文分析以识别次要分类,揭示语料库中潜在的跨主题联系。与传统主题模型不同,基于LLM的框架在保持强量化性能的同时,生成语义可解释的主题。与既定主题建模方法的比较评估显示,主题多样性更高,重叠度更低,且具有竞争性的一致性指标。对随机抽样的摘要子集进行手动验证,准确率达到75.9%。额外的传统自然语言处理分析证实,生成的主题对应于语料库中有意义的语言模式。连接主要和次要分类的二部网络进一步揭示了仅通过摘要或关键词系统不易观察到的隐含主题关系。结果表明,该框架无需事先了解期刊的编辑双重分类结构,即可独立恢复其大部分结构。总体而言,所提出的方法为绘制科学图谱和识别研究中新兴的跨主题联系提供了有力工具。

英文摘要

Scientific literature is increasingly fragmented by disciplinary boundaries, specialized terminology, and potentially sparse keyword systems, making it difficult to capture the evolving structure of modern science. This study introduces a large language model (LLM)-driven framework for mapping scientific literature from a topic modeling perspective. The approach is demonstrated on a 20-year corpus of more than 1,500 engineering-related articles published in the Proceedings of the National Academy of Sciences (PNAS). A two-stage classification pipeline first assigns a primary thematic category to each article based on its abstract, followed by full-text analysis to identify secondary classifications that reveal latent cross-topic connections within the corpus. Unlike conventional topic models, the LLM-based framework produces semantically interpretable topics while maintaining strong quantitative performance. Comparative evaluation against established topic modeling methods shows higher topic diversity and lower overlap with competitive coherence metrics. Manual validation on a randomly sampled subset of abstracts yields an accuracy of 75.9%. Additional traditional natural language processing analyses confirm that the generated topics correspond to meaningful linguistic patterns in the corpus. A bipartite network linking primary and secondary classifications further reveals implicit thematic relationships that are not readily observable through abstracts or keyword systems alone. The findings indicate that the framework independently recovers much of the journal's editorial dual-classification structure without prior knowledge of its schema. Overall, the proposed approach offers a powerful tool for mapping science and identifying emerging cross-topic connections in research.

2512.11982 2026-06-11 astro-ph.IM cs.AI cs.CV cs.LG 版本更新

Semantic search for 100M+ galaxy images using AI-generated captions

基于AI生成描述的1亿+星系图像语义搜索

Nolan Koblischke, Liam Parker, Francois Lanusse, Jo Bovy, Irina Espejo, Shirley Ho

AI总结 提出利用视觉语言模型生成星系图像描述,并对比对齐预训练天文学基础模型,构建可搜索嵌入,实现大规模星系图像的语义搜索,在稀有现象发现上取得最先进性能。

详情
Comments
ApJ, in press
AI中文摘要

通过缓慢的手动标注活动寻找科学上有趣的现象严重限制了我们对望远镜产生的数十亿星系图像的探索能力。在这项工作中,我们开发了一个流水线,从完全未标记的图像数据创建语义搜索引擎。我们的方法利用视觉语言模型(VLM)为星系图像生成描述,然后将预训练的天文学基础模型与这些嵌入的描述进行对比对齐,以产生大规模可搜索的嵌入。我们发现当前的VLM提供的描述信息足够丰富,可以训练一个语义搜索模型,该模型优于直接图像相似性搜索。我们的模型AION-Search在寻找稀有现象方面实现了最先进的零样本性能,尽管训练是在随机选择的图像上进行的,没有针对稀有情况进行刻意策划。此外,我们引入了一种基于VLM的重排序方法,该方法在top-100结果中对我们最具挑战性的目标的召回率几乎翻倍。首次,AION-Search实现了对超过1亿张星系图像的灵活语义搜索,使得从以前不可行的搜索中能够发现新现象,包括识别出36个新的河外恒星流候选体。更广泛地说,我们的工作提供了一种方法,使大型、未标记的科学图像档案变得可语义搜索,扩展了从地球观测到显微镜等领域的数据探索能力。代码、数据和应用程序可在以下网址公开获取:https://this https URL

英文摘要

Finding scientifically interesting phenomena through slow manual labeling campaigns severely limits our ability to explore the billions of galaxy images produced by telescopes. In this work, we develop a pipeline to create a semantic search engine from completely unlabeled image data. Our method leverages Vision-Language Models (VLMs) to generate descriptions for galaxy images, then contrastively aligns a pre-trained astronomy foundation model with these embedded descriptions to produce searchable embeddings at scale. We find that current VLMs provide descriptions that are sufficiently informative to train a semantic search model that outperforms direct image similarity search. Our model, AION-Search, achieves state-of-the-art zero-shot performance on finding rare phenomena despite training on randomly selected images with no deliberate curation for rare cases. Furthermore, we introduce a VLM-based re-ranking method that nearly doubles the recall for our most challenging targets in the top-100 results. For the first time, AION-Search enables flexible semantic search for over 100 million galaxy images, enabling discovery from previously infeasible searches, including the identification of 36 new extragalactic stellar stream candidates. More broadly, our work provides an approach for making large, unlabeled scientific image archives semantically searchable, expanding data exploration capabilities in fields from Earth observation to microscopy. The code, data, and app are publicly available at this https URL

2512.13765 2026-06-11 eess.IV cs.AI cs.LG 版本更新

Towards Deep Learning Surrogate for the Forward Problem in Electrocardiology: A Scalable Alternative to Physics-Based Models

面向心电学正问题的深度学习代理模型:一种可扩展的物理模型替代方案

Shaheim Ogbomo-Harmitt, Cesare Magnetti, Chiara Spota, Jakub Grzelak, Oleg Aslanidi

AI总结 提出基于注意力机制的序列到序列深度学习框架,作为心电学正问题的代理模型,从心脏电压传播图预测心电图信号,在2D组织模拟中达到高精度(平均R²=0.99±0.01),为物理模型提供可扩展、低成本的替代方案。

详情
Comments
Accepted to CinC conference 2025
AI中文摘要

心电学中的正问题,即从心脏电活动计算体表电位,传统上使用基于物理的模型(如双域或单域方程)求解。虽然准确,但这些方法计算成本高,限制了其在实时和大规模临床中的应用。我们提出一个概念验证的深度学习(DL)框架,作为正问题求解器的高效代理。该模型采用基于时间依赖注意力机制的序列到序列架构,从心脏电压传播图预测心电图(ECG)信号。引入了一种混合损失函数,结合Huber损失和谱熵项,以保持时域和频域的保真度。使用包含健康、纤维化和缝隙连接重塑条件的2D组织模拟,模型实现了高精度(平均$R^2 = 0.99 \pm 0.01$)。消融研究证实了卷积编码器、时间感知注意力和谱熵损失的贡献。这些发现突显了DL作为物理求解器的可扩展、低成本替代方案的潜力,适用于临床和数字孪生应用。

英文摘要

The forward problem in electrocardiology, computing body surface potentials from cardiac electrical activity, is traditionally solved using physics-based models such as the bidomain or monodomain equations. While accurate, these approaches are computationally expensive, limiting their use in real-time and large-scale clinical applications. We propose a proof-of-concept deep learning (DL) framework as an efficient surrogate for forward solvers. The model adopts a time-dependent, attention-based sequence-to-sequence architecture to predict electrocardiogram (ECG) signals from cardiac voltage propagation maps. A hybrid loss combining Huber loss with a spectral entropy term was introduced to preserve both temporal and frequency-domain fidelity. Using 2D tissue simulations incorporating healthy, fibrotic, and gap junction-remodelled conditions, the model achieved high accuracy (mean $R^2 = 0.99 \pm 0.01$). Ablation studies confirmed the contributions of convolutional encoders, time-aware attention, and spectral entropy loss. These findings highlight DL as a scalable, cost-effective alternative to physics-based solvers, with potential for clinical and digital twin applications.

2512.24787 2026-06-11 cs.IR cs.AI 版本更新

HiGR: Industrial-Scale Hierarchical Generative Slate Recommendation Framework in Tencent

HiGR:腾讯工业级层次化生成式推荐框架

Yunsheng Pang, Zijian Liu, Yudong Li, Shaojie Zhu, Zijian Luo, Chenyun Yu, Sikai Wu, Shichen Shen, Cong Xu, Bin Wang, Kai Jiang, Chengxiang Zhuo, Zang Li

AI总结 提出HiGR框架,通过结构化语义ID和层次化解码器解决生成式推荐在工业规模下的规划效率与列表质量对齐问题,离线质量提升超10%,推理加速5倍。

详情
AI中文摘要

Slate推荐(在单个展示中向用户呈现排序项目列表)在主流在线平台中无处不在。虽然最近的生成式推荐方法在利用语义ID建模项目序列方面显示出强大潜力,但直接将其应用于工业规模的slate推荐面临根本性脱节:纠缠的SID空间混淆了高级列表规划,长序列上的细粒度自回归解码限制了语义规划效率,而令牌级目标与整体slate质量不一致。在本文中,我们提出HiGR,一个工业规模的层次化生成式slate推荐框架,通过协同设计的流水线弥合这一脱节。首先,HiGR通过前缀对比残差量化VAE(PCRQ-VAE)学习结构化SID。通过强制高级前缀捕获共享语义,PCRQ-VAE创建了一个可控的离散空间,作为高效规划的前提。利用这一结构化空间,我们的层次化Slate解码器(HSD)将自回归建模从纠缠的令牌级解码转变为粗粒度偏好嵌入。该设计显著降低了推理延迟,同时允许显式的全局slate结构规划。最后,这一稳定的规划空间使得基于ORPO的列表级对齐机制能够优化三重目标隐式反馈——排序保真度、真实用户兴趣和多样性。广泛的离线实验表明,HiGR在离线推荐质量上优于最先进的基线超过10%,同时实现了5倍的推理加速。腾讯平台上的在线A/B测试进一步将观看时间提高了1.22%,视频播放量提高了1.73%。HiGR已在多个腾讯平台表面部署,服务数亿用户,证明了其工业规模的适用性。

英文摘要

Slate recommendation, which presents users with a ranked item list in a single display, is ubiquitous across mainstream online platforms. While recent generative recommendation methods have shown strong potential in modeling item sequences with semantic IDs, directly applying them to industrial-scale slate recommendation faces a fundamental disconnect: entangled SID spaces confound high-level list planning, fine-grained autoregressive decoding over long sequences limits semantic planning efficiency, and token-level objectives misalign with holistic slate quality. In this paper, we propose HiGR, an industrial-scale hierarchical generative framework for slate recommendation that bridges this disconnect through a co-designed pipeline. First, HiGR learns structured SIDs via a Prefix-Contrastive Residual Quantized VAE (PCRQ-VAE). By enforcing high-level prefixes to capture shared semantics, PCRQ-VAE creates a controllable discrete space that acts as a prerequisite for efficient planning. Leveraging this structured space, our Hierarchical Slate Decoder (HSD) shifts autoregressive modeling from entangled token-level decoding to coarse-grained preference embeddings. This design significantly reduces inference latency while allowing explicit global slate structure planning. Finally, this stable planning space enables an ORPO-based listwise alignment mechanism to optimize triple-objective implicit feedback-ranking fidelity, genuine user interest, and diversity. Extensive offline experiments show that HiGR outperforms state-of-the-art baselines by over 10% in offline recommendation quality while achieving a $5\times$ inference speedup. Online A/B tests on Tencent platforms further improve watch time by 1.22% and video plays by 1.73%. HiGR has been deployed on multiple Tencent platform surfaces, serving hundreds of millions of users and proving its industrial-scale applicability.

2601.21293 2026-06-11 cs.LG cs.AI 版本更新

Reliability-Calibrated Edge-IoT Early Fault Warning for Rotating Machinery with a Physics-Guided Tiny-Mamba Transformer

面向旋转机械的可靠性校准边缘物联网早期故障预警:一种物理引导的Tiny-Mamba Transformer

Changyu Li, Huabei Nie, Xiaoya Ni, Lu Wang, Lijuan Shen, Kaishun Wu, Fei Luo

AI总结 提出一种可靠性校准的边缘物联网早期故障预警框架,使用物理引导的Tiny-Mamba Transformer提取特征,结合极值理论校准误报率,在低计算资源下实现高精度、低延迟的旋转机械故障预警。

详情
AI中文摘要

工业物联网系统日益依赖分布式振动传感来支持旋转机械的预测性维护。然而,在实际部署中,原始信号上传成本高昂,且报警决策必须在有限计算资源、变化运行条件和严格误报预算下本地进行。本文提出一种可靠性校准的边缘物联网早期预警框架,其中紧凑的物理引导Tiny-Mamba Transformer作为表示模块,极值理论层将流式异常分数转换为事件级报警片段。PG-TMT结合深度可分离卷积主干、Tiny-Mamba状态空间分支和轻量级局部Transformer,在批量大小为1的推理下捕获瞬态、长周期和多通道退化线索。为提高可审计性,时间注意力被投影到频域并与分析轴承故障阶次带软对齐。极值理论校准、双阈值迟滞和修尾拟合即使在健康校准数据不完美的情况下也能提供可控的误报强度。在CWRU、Paderborn、XJTU-SY和工业试点上的实验表明,所提框架提高了PR-AUC,在可控误报预算下减少了检测延迟,并对结构化干扰、元数据不确定性、复合故障混合和域转移保持鲁棒。凭借小于1 MB的占用空间和低于7 ms的Jetson p99延迟,该框架支持工业物联网预测性维护的校准和可解释早期预警。

英文摘要

Industrial Internet of Things (IIoT) systems increasingly rely on distributed vibration sensing to support predictive maintenance of rotating machinery. In practical deployments, however, raw signal upload is costly and alarm decisions must be made locally under limited computation, changing operating conditions, and strict nuisance-alarm budgets. This paper presents a reliability-calibrated edge-IoT early-warning framework, in which a compact Physics-Guided Tiny-Mamba Transformer (PG-TMT) acts as the representation module and an extreme value theory (EVT) layer converts streaming anomaly scores into event-level alarm episodes. PG-TMT combines a depthwise-separable convolutional stem, a Tiny-Mamba state-space branch, and a lightweight local Transformer to capture transient, long-horizon, and multichannel degradation cues under batch-size-one inference. To improve auditability, temporal attention is projected to the frequency domain and softly aligned with analytical bearing fault-order bands. EVT calibration, dual-threshold hysteresis, and trimmed-tail fitting provide controllable false-alarm intensity even when healthy calibration data are imperfect. Experiments on CWRU, Paderborn, XJTU-SY, and an industrial pilot demonstrate that the proposed framework improves PR-AUC, reduces detection delay under a controlled nuisance-alarm budget, and remains robust to structured interference, metadata uncertainty, compound fault mixtures, and domain transfer. With a sub-1 MB footprint and Jetson p99 latency below 7 ms, the framework supports calibrated and interpretable early warnings for IIoT predictive maintenance.

2605.06100 2026-06-11 eess.SP cs.AI cs.LG cs.RO 版本更新

CredibleDFGO: Differentiable Factor Graph Optimization with Credibility Supervision

可信DFGO:具有可信度监督的可微因子图优化

Liang Qian, Penggao Yan, Penghui Xu, Li-Ta Hsu

AI总结 针对GNSS协方差不可靠问题,提出CredibleDFGO框架,通过可微高斯-牛顿求解器与加权生成网络,利用适当评分规则监督预测分布,提升协方差可信度与定位精度。

详情
Comments
Submitted to NAVIGATION: Journal of the Institute of Navigation
AI中文摘要

全球导航卫星系统(GNSS)定位广泛用于城市导航,但GNSS求解器报告的协方差在城市峡谷中通常不可靠。现有的可微因子图优化(DFGO)方法通过求解器学习测量加权,但仍仅使用位置目标。因此,位置估计可能改善,而报告的协方差仍然过小、过大或方向错误。我们提出CredibleDFGO(CDFGO),一种可微GNSS因子图框架,将协方差可信度作为显式训练目标。加权生成网络(WGN)预测每颗卫星的可靠性权重,可微高斯-牛顿求解器将这些权重映射到位置估计和基于Hessian的后验协方差。我们使用适当评分规则端到端监督东-北预测分布。我们研究了负对数似然(NLL)、能量分数(ES)及其组合。在三个UrbanNav测试场景上的结果表明,协方差可信度持续提升。定位精度在中度城市和严峻城市场景中也有所提高;在深度城市场景中,平均水平误差和第95百分位误差均有所改善。在严峻城市的旺角(MK)场景中,与DFGO(MAE)相比,CDFGO-Combined将平均水平误差从13.77米降至11.68米,将NLL从40.63降至6.59,将ES从12.31降至9.05。案例研究将MK改进归因于更好的轴向一致性、更可信的局部协方差椭圆以及卫星级重新加权。

英文摘要

Global navigation satellite system (GNSS) positioning is widely used for urban navigation, but the covariance reported by the GNSS solver is often unreliable in urban canyons. Existing differentiable factor graph optimization (DFGO) methods learn measurement weighting through the solver, but they still use position-only objectives. As a result, the position estimate may improve while the reported covariance remains too small, too large, or incorrectly oriented. We propose CredibleDFGO (CDFGO), a differentiable GNSS factor graph framework that makes covariance credibility an explicit training target. A Weighting Generation Network (WGN) predicts per-satellite reliability weights, and a differentiable Gauss-Newton solver maps these weights to a position estimate and a Hessian-derived posterior covariance. We use proper scoring rules to supervise the East-North predictive distribution end to end. We study negative log-likelihood (NLL), the energy score (ES), and their combination. Results on three UrbanNav test scenes show consistent gains in covariance credibility. Positioning accuracy also improves on the medium-urban and harsh-urban scenes; on the deep-urban scene, both the mean horizontal error and the 95th-percentile error improve. On the harsh-urban Mong Kok (MK) scene, CDFGO-Combined reduces the mean horizontal error from 13.77 m to 11.68 m, reduces NLL from 40.63 to 6.59, and reduces ES from 12.31 to 9.05 relative to DFGO (MAE). Case studies link the MK improvement to better axis-wise consistency, more credible local covariance ellipses, and satellite-level reweighting.

2605.06485 2026-06-11 cs.CL cs.AI 版本更新

Litespark Inference For CPUs: Ultra-Fast SIMD Framework for Ternary (1.58-bit) Language Models

Litespark Inference For CPUs: 三元(1.58位)语言模型的超快SIMD框架

Nii Osae Osae Dade, Tony Morri, Moinul Hossain Rahat, Sayandip Pal, Rickston Pinto

AI总结 针对三元语言模型权重为{-1,0,1}的特点,提出自定义SIMD内核,用加减运算替代矩阵乘法,在CPU上实现18-96倍加速和6倍内存减少。

详情
AI中文摘要

大型语言模型(LLM)已经改变了人工智能,但其计算需求对大多数用户来说仍然过高。标准推理需要昂贵的数据中心GPU或云API访问,导致超过十亿台个人计算机在AI工作负载中未被充分利用。三元模型提供了一条前进的道路:它们的权重被限制在{-1, 0, +1},理论上消除了浮点乘法的需求。然而,现有框架未能利用这种结构,将三元模型视为密集浮点网络。我们通过自定义SIMD内核填补了这一空白,这些内核用简单的加法和减法运算取代矩阵乘法,针对现代CPU上可用的整数点积指令。我们的实现Litespark-Inference可通过pip安装,并直接与Hugging Face集成,在Apple Silicon上实现了比标准PyTorch推理高18.15倍的吞吐量、快7.15倍的首令牌时间和6.03倍的内存减少,在Intel和AMD处理器上实现了高达95.81倍的吞吐量加速。

英文摘要

Large language models (LLMs) have transformed artificial intelligence, but their computational requirements remain prohibitive for most users. Standard inference demands expensive datacenter GPUs or cloud API access, leaving over one billion personal computers underutilized for AI workloads. Ternary models offer a path forward: their weights are constrained to {-1, 0, +1}, theoretically eliminating the need for floating-point multiplication. However, existing frameworks fail to exploit this structure, treating ternary models as dense floating-point networks. We address this gap with custom SIMD kernels that replace matrix multiplication with simple addition and subtraction operations, targeting the integer dot product instructions available on modern CPUs. Our implementation, Litespark-Inference, is pip-installable and integrates directly with Hugging-Face, achieving 18.15x higher throughput, 7.15x faster time-to-first-token and 6.03x memory reduction compared to standard PyTorch inference on Apple Silicon, with comparable or higher throughput speedups up to 95.81x on Intel and AMD processors.

2606.10120 2026-06-11 cs.IR cs.AI cs.HC 版本更新

MetaPlate: Counterfactual-Guided RAG-LLM Tool for Personalized Food Recommendation and Hyperglycemia Prevention

MetaPlate: 反事实引导的RAG-LLM工具用于个性化食物推荐和高血糖预防

Asiful Arefeen, Carol Johnston, Hassan Ghasemzadeh

AI总结 提出MetaPlate框架,结合反事实解释、机器学习预测和RAG-LLM,生成个性化膳食建议以预防餐后高血糖,经注册营养师评估证明其可行性和有效性。

详情
AI中文摘要

餐后高血糖是代谢紊乱的关键风险因素;然而,现有的饮食指导通常是静态的、不切实际的且个性化不足,提供的建议难以遵循或效果不佳。尽管最近的进展利用连续血糖监测(CGM)和机器学习来预测血糖反应,但这些方法主要是预测性的,缺乏可操作的指导。此外,推荐系统常常与用户目标不一致,且需要大量输入。我们提出了MetaPlate,一个反事实解释(CF)引导的、上下文感知的决策支持框架,用于生成个性化膳食建议,以减轻健康成年人的餐后血糖波动。MetaPlate整合了多模态数据,包括来自25名个体的CGM读数、可穿戴设备衍生的生理信号以及用户提供的膳食输入,以建模餐前上下文。一个机器学习模型预测血糖反应,而CF优化模块通过调整膳食组成(修改宏量营养素数量)来维持血糖水平在目标范围内(≤140 mg/dL)。基于LLM的检索增强生成(RAG)层通过使用USDA食品数据库的约束搜索生成人类可读的建议,增强了可解释性。我们通过结构化的专家在环评估,与注册营养师(RDs)一起评估MetaPlate,比较提示优化前后的性能。结果显示,在膳食真实性、份量适宜性和推荐可能性方面有所改进,专家反馈表明从临床不可行的输出转向了可操作、上下文适宜的建议。我们的发现强调了领域知识和结构化约束在LLM驱动系统中的重要性,并突出了MetaPlate作为实时个性化膳食决策支持工具的潜力。

英文摘要

Postprandial hyperglycemia is a key risk factor for metabolic disorders; however, existing dietary guidance is often static, impractical, and insufficiently personalized, providing recommendations that are difficult to follow or not impactful. While recent advances leverage continuous glucose monitoring (CGM) and machine learning to predict glycemic responses, these approaches are largely predictive and lack actionable guidance. Moreover, recommendation systems are often misaligned with user goals and require extensive input. We present MetaPlate, a counterfactual explanation (CF) guided, context-aware decision-support framework that generates personalized meal recommendations to mitigate postprandial glucose excursions in healthy adults. MetaPlate integrates multimodal data, including CGM readings, wearable-derived physiological signals, and user-provided meal inputs from $25$ individuals to model pre-meal context. A machine learning model predicts glucose response, while a CF optimization module adjusts meal composition modifying macronutrient amounts to maintain glucose levels within a target range ($\leq 140$ mg/dL). An LLM-based retrieval-augmented generation (RAG) layer enhances interpretability by producing human-readable recommendations using constrained search of the USDA food database. We evaluate MetaPlate via a structured expert-in-the-loop assessment with registered dietitians (RDs), comparing performance before and after prompt refinement. Results show improvements in meal realism, portion suitability, and recommendation likelihood, with expert feedback indicating a shift from clinically implausible outputs to actionable, contextually appropriate recommendations. Our findings emphasize the importance of domain knowledge and structured constraints in LLM-driven systems and highlight the potential of MetaPlate as a real-time personalized dietary decision-support tool.

2606.10376 2026-06-11 cs.AI cs.IT 交叉投稿

Belief-Space Control for Personalized Cancer Treatment via Active Inference

基于主动推理的个性化癌症治疗信念空间控制

Deniz Sargun, H. Bugra Tulay, C. Emre Koksal

发表机构 * American Association for Cancer Research(美国癌症研究协会) AACR Project GENIE registry(AACR Project GENIE 注册中心) AACR Project GENIE Biopharma Collaborative(AACR Project GENIE 生物制药合作组织)

AI总结 提出用主动推理将癌症治疗建模为信念空间规划问题,在测量预算下统一目标导向控制与信息获取,实现患者分类与高效治疗。

详情
Comments
11 pages including appendix
AI中文摘要

癌症治疗本质上是一个具有部分可观测性、潜在患者异质性以及医疗测量预算明确约束的序贯决策问题。与标准强化学习(RL)方法控制状态轨迹不同,癌症治疗会永久性地改变患者的转移动力学,从而改变状态随时间演化的方式。我们使用主动推理将癌症治疗建模为信念空间规划问题,推导出一个期望自由能目标,该目标在测量预算下统一了目标导向控制和信息获取。我们使用来自AACR Project GENIE Biopharma Collaborative数据集的真实临床癌症数据实现了该框架。临床数据结果表明,在真实的测量和治疗约束下,能够同时实现患者分类和高治疗效力。

英文摘要

Cancer treatment is at the core a sequential decision-making problem with partial observability, latent patient heterogeneity, and explicit constraints on the budget for medical measurements. Unlike standard Reinforcement Learning (RL) approaches that control state trajectories, cancer treatments permanently modify patients' transition dynamics, changing how states evolve over time. We model cancer treatment as a belief-space planning problem using active inference, deriving an expected free-energy objective that unifies goal-directed control and information acquisition under measurement budgets without. We implement this framework using real clinical cancer data from the AACR Project GENIE Biopharma Collaborative dataset. Results on clinical data demonstrate a simultaneous patient categorization and high treatment efficacy, under real measurement and treatment constraints.

11. 其他/综合AI 19 篇

2606.11245 2026-06-11 cs.AI cs.NE q-bio.NC 新提交

Position: Hippocampal Explicit Memory Is the Cornerstone for AGI

立场:海马体显式记忆是通用人工智能的基石

Sangjun Park

AI总结 本文主张,将显式记忆整合到大语言模型中是迈向通用人工智能的关键,因为LLM的学习机制类似人类内隐记忆,而高阶认知功能依赖海马体显式记忆。

详情
Comments
Accepted to ICML 2026 (Position Paper Track)
AI中文摘要

大语言模型(LLM)在各种任务中展现了卓越的能力,提升了人们对通用人工智能(AGI)的期望。这篇立场论文认为,整合显式记忆是推动LLM迈向AGI的基石。关键原因在于,LLM的底层学习机制与人类内隐记忆高度相似。然而,AGI所需的高阶认知功能,如长期战略规划、元认知和符号推理,严重依赖海马体显式记忆,无法仅从内隐统计学习中产生。借鉴神经科学的发现,我提出这一观点,并辅以人工显式记忆系统的计算要求,希望促进进一步研究,为显式记忆整合奠定基础。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, raising expectations for Artificial General Intelligence (AGI). This position paper argues that integrating explicit memory is the cornerstone for advancing LLMs toward AGI. The key reason is that the underlying learning mechanism of LLMs is highly analogous to human implicit memory. However, higher-order cognitive functions necessary for AGI, such as long-term strategic planning, metacognition, and symbolic reasoning, heavily rely on hippocampal explicit memory and cannot arise solely from implicit statistical learning. Drawing on findings from neuroscience, I advance this perspective and complement it with computational requirements for artificial explicit memory systems, hoping to foster further research and lay the groundwork for explicit memory integration.

2606.11560 2026-06-11 cs.DB cs.AI 交叉投稿

LLMs+Graphs: Toward Graph-Native, Synergistic AI Systems

LLMs+Graphs:迈向图原生的协同人工智能系统

Arijit Khan, Longxu Sun, Xin Huang

AI总结 本文综述了大语言模型与图计算的三种协同方式,包括增强推理、知识图谱双向集成及图算法增强的AI代理,并探讨了图数据管理与图机器学习的新能力,旨在为构建下一代图原生AI系统提供统一视角。

详情
Comments
10 pages, Accepted at PAKDD 2066 Tutorial
AI中文摘要

大语言模型(LLMs)发展迅速,但它们在结构化和多跳推理方面的局限性凸显了对图原生、协同人工智能(AI)系统的需求。图结构数据支撑着社交、生物、金融、交通、网络和知识领域的关键应用,因此理解LLMs如何利用图计算进行基于上下文的扎实推理至关重要。三种互补的协同方式正在涌现:通过图计算增强LLMs进行检索和推理;LLMs与知识图谱(KGs)的双向集成,其中LLMs支持KG构建和整理,而KGs强制执行语义约束和事实一致性;以及通过图算法增强的AI代理进行规划、决策和多步推理。同时,LLMs通过自然语言接口和混合LLM-图神经网络(GNN)流水线,为图数据管理和图机器学习(ML)引入了新能力。本教程综合了推动这些融合方向的算法、系统和设计原则,为数据科学和数据挖掘研究人员提供了将LLMs、图数据管理、图挖掘、图ML和代理计算集成到下一代图原生AI系统中的统一视角。

英文摘要

Large Language Models (LLMs) have advanced rapidly, but their limitations in structured and multi-hop reasoning underscore the need for graph-native, synergistic artificial intelligence (AI) systems. Graph-structured data underpins critical applications across social, biological, financial, transportation, web, and knowledge domains, making it essential to understand how LLMs can leverage graph computation for grounded, context-rich inference. Three complementary synergies are emerging: LLMs augmented with graph computation for retrieval and reasoning; bidirectional integration between LLMs and knowledge graphs (KGs), where LLMs support KG construction and curation while KGs enforce semantic constraints and factual consistency; and AI agents strengthened by graph algorithms for planning, decision making, and multi-step reasoning. In parallel, LLMs introduce new capabilities for graph data management and graph machine learning (ML) through natural language interfaces and hybrid LLM-graph neural network (GNN) pipelines. This tutorial synthesizes the algorithms, systems, and design principles driving these converging directions, offering data science and data mining researchers a unified perspective on integrating LLMs, graph data management, graph mining, graph ML, and agentic computation into next-generation graph-native AI systems.

2606.12022 2026-06-11 cs.FL cs.AI 交叉投稿

Runtime Enforcement of Hybrid System Properties

混合系统属性的运行时强制执行

Mir Md Sajid Sarwar, Srinivas Pinisetty, Rajarshi Ray, Thierry Jéron

AI总结 提出一种结合离散事件编辑与连续时间监控的运行时强制执行框架,使用混合自动机建模安全需求,通过运行时可达性分析合成安全纠正动作,在自适应巡航控制系统中验证有效性。

详情
AI中文摘要

运行时强制执行已成为确保在不确定和动态环境中运行的自主和网络物理系统安全的一种有前景的方法。与传统的运行时验证不同,运行时强制执行通过在执行期间主动干预,修改不安全系统行为以防止属性违反。现有的强制执行框架主要关注无时间或离散时间规范,并且通常仅限于延迟或抑制事件,这使得它们对于表现出复杂连续动态的反应式系统不充分。在本文中,我们提出了一种运行时强制执行框架,其中安全需求使用混合自动机(HA)建模。该框架将离散事件编辑与连续时间监控相结合,以支持在任意时间点执行抑制、延迟和插入事件等强制执行操作。在观察环境输入后,自动机被初始化,并使用运行时可达性分析来综合安全纠正动作。我们正式定义了安全混合自动机的强制执行问题,建立了可强制执行条件,并提出了一种用于反应式系统的在线强制执行算法。关于自适应巡航控制(ACC)系统的详细案例研究证明了所提出方法在不安全控制器行为下维护安全属性的有效性。实验结果表明,该框架在实时确保持续符合安全要求的同时,引入了最小的计算开销。

英文摘要

Runtime enforcement has emerged as a promising approach for ensuring the safety of autonomous and cyber-physical systems operating in uncertain and dynamic environments. Unlike traditional runtime verification, runtime enforcement actively intervenes during execution to prevent property violations by modifying unsafe system behaviors. Existing enforcement frameworks primarily focus on untimed or discrete-time specifications and are often limited to delaying or suppressing events, making them inadequate for reactive systems exhibiting complex continuous dynamics. In this paper, we propose a runtime enforcement framework where safety requirements are modeled using Hybrid Automata (HA). The framework combines discrete-event editing with continuous-time monitoring to support enforcement actions such as suppression, delay, and insertion of events at arbitrary time instants. Upon observing environmental inputs, the automaton is initialized, and runtime reachability analysis is used to synthesize safe corrective actions. We formally define the enforcement problem for safety hybrid automata, establish enforceability conditions, and present an online enforcement algorithm for reactive systems. A detailed case study on an Adaptive Cruise Control (ACC) system demonstrates the effectiveness of the proposed approach in maintaining safety properties under unsafe controller behaviors. Experimental results show that the framework introduces minimal computational overhead while ensuring continuous compliance with safety requirements in real time.

2510.02660 2026-06-11 cs.HC cs.AI 版本更新

When Researchers Say Mental Model/Theory of Mind of AI, What Are They Really Talking About?

当研究人员谈论AI的心理模型/心智理论时,他们究竟在说什么?

Xiaoyun Yin, Elmira Zahmat Doost, Shiwen Zhou, Garima Arya Yadav, Jamie C. Gorman

AI总结 本文指出当前AI心智理论研究混淆了行为预测与真实认知,提出应转向人机交互中的互惠心智理论框架。

详情
Comments
This work have been accepted in CogInterp @ NeurIPS 2025
AI中文摘要

当研究人员声称AI系统拥有心智理论或心理模型时,他们本质上是在讨论行为预测和偏差校正,而非真正的心理状态。本文认为,当前的讨论将复杂的模式匹配与真实的认知混为一谈,忽略了模拟与体验之间的关键区别。尽管最近的研究表明,LLMs在实验室的心智理论任务中达到了人类水平的表现,但这些结果仅基于行为模仿。更重要的是,整个测试范式可能存在缺陷,因为它将个体人类认知测试应用于AI系统,而不是在人类与AI交互的当下直接评估人类认知。我建议将焦点转向互惠心智理论框架,该框架承认人类认知和AI算法的同时贡献,强调交互动态,而非孤立地测试AI。

英文摘要

When researchers claim AI systems possess ToM or mental models, they are fundamentally discussing behavioral predictions and bias corrections rather than genuine mental states. This position paper argues that the current discourse conflates sophisticated pattern matching with authentic cognition, missing a crucial distinction between simulation and experience. While recent studies show LLMs achieving human-level performance on ToM laboratory tasks, these results are based only on behavioral mimicry. More importantly, the entire testing paradigm may be flawed in applying individual human cognitive tests to AI systems, but assessing human cognition directly in the moment of human-AI interaction. I suggest shifting focus toward mutual ToM frameworks that acknowledge the simultaneous contributions of human cognition and AI algorithms, emphasizing the interaction dynamics, instead of testing AI in isolation.

2604.25018 2026-06-11 cs.ET cs.AI cs.DC cs.NI 版本更新

Internet of Everything in the 6G Era: Paradigms, Enablers, Potentials and Future Directions

6G时代的万物互联:范式、使能技术、潜力与未来方向

Driss Choukri, Essaid Sabir, Elmahdi Driouch, Abdelkrim Haqiq

AI总结 本文综述了万物互联(IoE)的概念、核心组件、架构基础、使能技术及研究挑战,并探讨了面向6G智能IoE系统的开放研究方向,重点关注可扩展性、安全、隐私和能效。

详情
Comments
48 pages, 15 figures, 6 tables, 272 references
AI中文摘要

万物互联(IoE)代表了物联网(IoT)的演进,通过将人、数据、流程和事物集成到一个统一的智能生态系统中。IoE旨在增强多个应用领域的自动化、决策和服务效率,例如智慧城市、医疗保健、工业和下一代无线网络。本文提供了IoE概念、其核心组件、架构基础、使能技术和主要研究挑战的结构化概述。最后,讨论了面向6G使能的智能IoE系统的开放研究方向,重点关注可扩展性、安全性、隐私和能效。

英文摘要

The Internet of Everything (IoE) represents an evolution of the Internet of Things (IoT) by integrating people, data, processes, and things into a unified intelligent ecosystem. IoE aims to enhance automation, decision-making, and service efficiency across multiple application domains such as smart cities, healthcare, industry, and next-generation wireless networks. This paper provides a structured overview of the IoE concept, its core components, architectural foundations, enabling technologies, and major research challenges. Finally, open research directions toward 6G-enabled intelligent IoE systems are discussed, with emphasis on scalability, security, privacy, and energy efficiency.

2606.05608 2026-06-11 cs.SE cs.AI 版本更新

Agentic Software: How AI Agents Are Restructuring the Software Paradigm

软件工程的终结:AI代理如何根本性地重构软件范式

Zhenfeng Cao

AI总结 本文通过第一性原理分析,论证了以LLM为推理引擎的AI代理系统正在根本性地重构软件范式,从传统软件(代码承载决策逻辑)转向代理系统(代码作为临时工具),并提出了代理工程作为新兴学科。

详情
Comments
15 pages, 2 figures, and 3 tables
AI中文摘要

半个多世纪以来,软件工程一直基于一个基本前提:人类工程师分解问题,将决策逻辑编码为静态代码,并随着需求演变手动调整代码。本文认为,AI代理——即大型语言模型作为主要推理引擎、动态生成和丢弃代码作为工具资源的系统——的出现并非渐进式改进,而是对软件范式的根本性重构。基于复杂性缩放的第一性原理分析,我们形式化了传统软件(代码是决策逻辑的载体)与代理系统(代码是LLM驱动推理循环的临时工具)之间的区别。我们追溯了从许可软件到SaaS再到我们所谓的代理即服务(AaaS)的历史轨迹,表明每次转变都将额外的复杂性从最终用户转移出去。我们引入了代理工程作为一门新兴学科——其核心研究对象、控制模型和人类角色均不同于软件工程。通过分析最近的基准证据,包括SWE-bench Verified、EvoClaw和LangChain的多代理协调研究,我们展示了代理范式的变革潜力及其当前局限性。最后,我们提出了一个迈向自我进化代理生态系统的四阶段路线图,并为应对这一转变的从业者提供了具体建议。

英文摘要

For over half a century, software engineering has operated on a foundational premise: human engineers decompose problems, encode decision logic into static code, and manually adapt that code as requirements evolve. This paper argues that the emergence of AI agents -- systems where large language models serve as the primary reasoning engine, dynamically generating and discarding code as an instrumental resource -- constitutes a fundamental restructuring of what software is, not an incremental tool improvement. We formalize the distinction between traditional deterministic software and agentic software: in the former, code is the carrier of pre-written decision logic; in the latter, the agent itself is the software, and its decision logic is generated at runtime. We trace the historical arc from licensed software to SaaS to Agent-as-a-Service (AaaS), showing that each shift transferred additional complexity away from end-users -- with the agentic shift transferring not just operational complexity but decision-making complexity itself. We introduce Agentic Engineering as an expansion of the software engineering discipline into a new paradigm, distinct in its core object of study (agent systems rather than static source code), its control model (LLM-driven rather than human-predefined), and its human role (intent architect rather than code author). Through analysis of recent benchmark evidence including SWE-bench Verified, EvoClaw, and LangChain's multi-agent coordination studies, we demonstrate both the transformative potential of the agentic paradigm and its current limitations. We conclude with a four-stage roadmap toward self-evolving agent ecosystems and concrete recommendations for practitioners navigating this transition.

2605.26938 2026-06-11 cs.AI math.OC

Developing a Totally Unimodular Linear Program for Optimal Conformance Checking: When and Why It Complements A*

开发用于最优合规性检查的全幺模线性规划:何时以及为何它补充A*

Izack Cohen

AI总结 提出将基于对齐的合规性检查重新表述为在全幺模线性规划上的问题,利用网络流结构保证整数最优解,实验表明在长轨迹和有偏差情况下显著加速A*。

详情
Journal ref
Expert Systems with Applications, Volume 331, Part A, 2026, 133021
Comments
Author-accepted manuscript accepted for publication in Expert Systems with Applications. Code and experiment scripts are available at: https://github.com/Izack-Cohen/unimodular-conformance-checking. Version corresponding to the accepted paper: v1.0.0
AI中文摘要

基于对齐的合规性检查是比较观察到的过程执行与规范过程模型的最先进方法。标准的精确解依赖于基于A*的启发式搜索,在存在长轨迹或大量偏差时可能表现出指数级运行时间。本文介绍了将基于对齐的合规性检查重新表述为定义在同步积的可达图上的全幺模线性规划(LP)。通过利用底层的网络流结构,所提出的公式通过LP松弛保证了整数最优极值点的存在,从而避免了与整数变量和分支定界搜索相关的组合开销。我们在来自真实世界和合成基准数据集的超过210万个合规性检查实例上进行了广泛的实证评估。结果表明,A*和LP方法表现出互补的性能特征:前者在短且符合良好的轨迹上表现最佳,而LP公式为具有偏差的较长轨迹提供了显著的加速,这正是合规性检查最具信息量的地方。基于这些发现,我们推导出结合两种方法的简单算法选择指南,与始终使用A*相比,实现了平均38.6%的运行时间节省和96%的选择准确率。

英文摘要

Alignment-based conformance checking is the state-of-the-art approach for comparing observed process executions with normative process models. The standard exact solution relies on an A*-based heuristic search, which can exhibit exponential runtime in the presence of long traces or substantial deviations. This paper introduces a reformulation of alignment-based conformance checking as a totally unimodular linear program (LP) defined on the reachability graph of the synchronous product. By exploiting the underlying network-flow structure, the proposed formulation guarantees the existence of an integral optimal extreme-point solution through LP relaxation, thereby avoiding the combinatorial overhead associated with integer variables and branch-and-bound search. We conduct an extensive empirical evaluation on more than 2.1 million conformance checking instances derived from real-world and synthetic benchmark datasets. The results show that A* and the LP approach exhibit complementary performance characteristics: the former performs best on short, well-conforming traces, while the LP formulation provides substantial speedups for longer traces with deviations, precisely where conformance checking is most informative. Based on these findings, we derive simple algorithm-selection guidelines that combine both approaches, achieving average runtime savings of 38.6% with 96% selection accuracy compared to always using A*.

2412.01459 2026-06-11 cs.CY cs.AI cs.HC

Perception Gaps in Risk, Benefit, and Value Between Experts and Public Challenge Socially Accepted AI

Philipp Brauner, Felix Glawe, Gian Luca Liehner, Luisa Vervier, Martina Ziefle

详情
Journal ref
AI & Society (2026)
英文摘要

Artificial Intelligence (AI) is reshaping many societal domains, raising critical questions about its risks, benefits, and the potential misalignment between public and academic perspectives. This study examines how the general public (N=1110) -- individuals who interact with or are impacted by AI technologies -- and academic AI experts (N=119) -- those elites shaping AI development -- perceive AI's capabilities and impact across 71 scenarios. These scenarios span domains such as sustainability, healthcare, job performance, societal inequality, art, and warfare. Participants evaluated these scenarios across four dimensions using the psychometric model: likelihood, perceived risk and benefit, and overall value (or sentiment). The results suggest significant differences: experts consistently anticipate higher probabilities, perceive lower risks, report greater benefits, and express more positive sentiment toward AI compared to the non-experts. Moreover, both groups apply different weighting schemes: experts discount risk more heavily relative to benefit than non-experts. Visual mappings of these evaluations uncover areas convergent evaluations (e.g., AI performing medical diagnoses or criminal use) as well as tension points (e.g., decision of legal cases, political decision making), highlighting areas where communication and policy interventions may be needed. These findings underscore a critical translational challenge: if AI research and deployment are to align with societal priorities, the perception gap between developers and the public must be better understood and addressed. Our results provide an empirical foundation for value-sensitive AI governance and trust-building strategies across stakeholder groups.

2604.01383 2026-06-11 cs.CV cs.AI

GRAZE: Grounded Refinement and Motion-Aware Zero-Shot Event Localization

Syed Ahsan Masud Zaidi, Lior Shamir, William Hsu, Scott Dietrich, Talha Zaidi

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 10087-10095, June 2026
Comments
9 pages, 5 figures, accepted to the CVPR 2026 Workshop on Computer Vision in Sports (CVSports) code: https://github.com/AhsanZaidi12/GRAZE
英文摘要

American football practice generates video at scale, yet the interaction of interest occupies only a brief window of each long, untrimmed clip. Reliable biomechanical analysis, therefore, depends on spatiotemporal localization that identifies both the interacting entities and the onset of contact. We study First Point of Contact (FPOC), defined as the first frame in which a player physically touches a tackle dummy, in unconstrained practice footage with camera motion, clutter, multiple similarly equipped athletes, and rapid pose changes around impact. We present GRAZE, a training-free pipeline for FPOC localization that requires no labeled tackle-contact examples. GRAZE uses Grounding DINO to discover candidate player-dummy interactions, refines them with motion-aware temporal reasoning, and uses SAM2 as an explicit pixel-level verifier of contact rather than relying on detection confidence alone. This separation between candidate discovery and contact confirmation makes the approach robust to cluttered scenes and unstable grounding near impact. On 738 tackle-practice videos, GRAZE produces valid outputs for 97.4% of clips and localizes FPOC within $\pm$ 10 frames on 77.5% of all clips and within $\pm$ 20 frames on 82.7% of all clips. These results show that frame-accurate contact onset localization in real-world practice footage is feasible without task-specific training.

2511.20216 2026-06-11 cs.AI cs.CE cs.CV cs.LG cs.RO

CostNav: A Navigation Benchmark for Real-World Economic-Cost Evaluation of Physical AI Agents

Haebin Seong, Sungmin Kim, Yongjun Cho, Myunchul Joe, Geunwoo Kim, Yubeen Park, Sunhoo Kim, Samwoo Seong, Yoonshik Kim, Suhwan Choi, Jaeyoon Jung, Jiyong Youn, Jinmyung Kwak, Sunghee Ahn, Jaemin Lee, Younggil Do, Seungyeop Yi, Woojin Cheong, Minhyeok Oh, Minchan Kim, Seongjae Kang, Youngjae Yu, Yunsung Lee

详情
英文摘要

Current navigation benchmarks focus on task success but do not capture the economic constraints essential for commercializing autonomous delivery systems. We introduce CostNav, an Economic Navigation Benchmark that evaluates physical AI agents on a cost-revenue and break-even analysis, pairing Isaac Sim's collision and cargo dynamics with industry-standard data such as Securities and Exchange Commission (SEC) filings and Abbreviated Injury Scale (AIS) injury reports. To our knowledge, CostNav is the first physics-grounded economic benchmark to use regulatory and financial data to quantify the gap between navigation metrics and commercial deployment, revealing that high task-success rates alone do not ensure economic viability. Evaluating seven baselines (two rule-based and five imitation-learning methods), we find no method economically viable: all yield negative contribution margins. CANVAS, using only an RGB camera and GPS, attains the highest task success and the least-negative margin among methods with non-zero Service-Level Agreement (SLA) compliance (-\$28.40/run), outperforming LiDAR-equipped Nav2 w/ GPS (-\$37.34/run). A sim-trained policy evaluated on a real delivery robot yields SLA compliance close to its simulation result, indicating that policy performance in CostNav's simulation transfers to real-world deployment. We challenge the community to achieve economic viability on CostNav, which scores methods by cost-revenue outcomes. All resources are available at https://github.com/worv-ai/CostNav.

2510.09885 2026-06-11 cs.CL cs.AI

Diffusion-Inspired Masked Fine-Tuning for Knowledge Injection in Autoregressive LLMs

Xu Pan, Ely Hahami, Jingxuan Fan, Ziqian Xie, Haim Sompolinsky

详情
英文摘要

Large language models (LLMs) are often used in environments where facts evolve, yet factual knowledge updates via fine-tuning on unstructured text often suffer from 1) reliance on compute-heavy paraphrasing augmentation and 2) the reversal curse. Recent studies show diffusion large language models (dLLMs) require fewer training samples to achieve lower loss in pre-training and are more resistant to the reversal curse, suggesting dLLMs may learn new knowledge more easily than autoregressive LLMs (arLLMs). We test this hypothesis in controlled knowledge fine-tuning experiments and find that while arLLMs rely on paraphrase augmentation to generalize knowledge text into question-answering (QA) capability, dLLMs do not require paraphrases to achieve high QA accuracy. To further investigate whether the demasking objective alone can induce such a knowledge injection advantage in dLLMs regardless of their diffusion denoising paradigm, we propose masked fine-tuning for arLLMs, which prompts an arLLM to reconstruct the original text given a masked version in context. The masked fine-tuning for arLLMs substantially improves the efficacy of knowledge injection, i.e. no paraphrase needed and resistant to the reversal curse, closing the gap between arLLMs and dLLMs. We also demonstrate broader applicability: on a large-scale knowledge-intensive dataset (1.2M samples), masked SFT achieves the best downstream accuracy on GPQA-diamond among all fine-tuning variants. The demasking objective also improves SFT on math tasks, suggesting broad utility beyond factual knowledge injection.

2409.00743 2026-06-11 cs.LG cs.AI

Interpretable Clustering: A Survey

Lianyu Hu, Mudi Jiang, Junjie Dong, Xinying Liu, Zengyou He

详情
Journal ref
ACM Computing Surveys, Volume 58, Issue 8, Article 215 (2026)
Comments
14 pages, 2 figures, 3 tables
英文摘要

In recent years, much of the research on clustering algorithms has primarily focused on enhancing their accuracy and efficiency, frequently at the expense of interpretability. However, as these methods are increasingly being applied in high-stakes domains such as healthcare, finance, and autonomous systems, the need for transparent and interpretable clustering outcomes has become a critical concern. This is not only necessary for gaining user trust but also for satisfying the growing ethical and regulatory demands in these fields. Ensuring that decisions derived from clustering algorithms can be clearly understood and justified is now a fundamental requirement. To address this need, this paper provides a comprehensive and structured review of the current state of explainable clustering algorithms, identifying key criteria to distinguish between various methods. These insights can effectively assist researchers in making informed decisions about the most suitable explainable clustering methods for specific application contexts, while also promoting the development and adoption of clustering algorithms that are both efficient and transparent. For convenient access and reference, an open repository organizes representative and emerging interpretable clustering methods under the taxonomy proposed in this survey, available at https://github.com/hulianyu/Awesome-Interpretable-Clustering

2601.09072 2026-06-11 cs.AI cs.CL stat.ME

Human-AI Co-design for Clinical Prediction Models

Jean Feng, Avni Kothari, Patrick Vossler, Andrew Bishara, Lucas Zier, Newton Addo, Aaron Kornblith, Yan Shuo Tan, Chandan Singh

详情
Journal ref
npj Digital Medicine 2026
英文摘要

Developing safe, effective, and practically useful clinical prediction models (CPMs) traditionally requires iterative collaboration between clinical experts, data scientists, and informaticists. This process refines the often small but critical details of the model building process, such as which features/patients to include and how clinical categories should be defined. However, this traditional collaboration process is extremely time- and resource-intensive, resulting in only a small fraction of CPMs reaching clinical practice. This challenge intensifies when teams attempt to incorporate unstructured clinical notes, which can contain an enormous number of concepts. To address this challenge, we introduce HACHI, an iterative human-in-the-loop framework that uses AI agents to accelerate the development of fully interpretable CPMs by enabling the exploration of concepts in clinical notes. HACHI alternates between (i) an AI agent rapidly exploring and evaluating candidate concepts in clinical notes and (ii) clinical and domain experts providing feedback to improve the CPM learning process. HACHI defines concepts as simple yes-no questions that are used in linear models, allowing the clinical AI team to transparently review, refine, and validate the CPM learned in each round. In two real-world prediction tasks (acute kidney injury and traumatic brain injury), HACHI outperforms existing approaches, surfaces new clinically relevant concepts not included in commonly-used CPMs, and improves model generalizability across clinical sites and time periods. Furthermore, HACHI reveals the critical role of the clinical AI team, such as directing the AI agent to explore concepts that it had not previously considered, adjusting the granularity of concepts it considers, changing the objective function to better align with the clinical objectives, and identifying issues of data bias and leakage.

2512.08343 2026-06-11 cs.AI

Soil Compaction Parameters Prediction Based on Automated Machine Learning Approach

Caner Erden, Alparslan Serhat Demir, Abdullah Hulusi Kokcam, Talas Fikret Kurnaz, Ugur Dagdeviren

详情
Journal ref
Computers & Industrial Engineering, 2026
Comments
Presented at the 13th International Symposium on Intelligent Manufacturing and Service Systems, Duzce, Turkey, Sep 25-27, 2025. Also available on Zenodo: DOI 10.5281/zenodo.17533851
英文摘要

Soil compaction is critical in construction engineering to ensure the stability of structures like road embankments and earth dams. Traditional methods for determining optimum moisture content (OMC) and maximum dry density (MDD) involve labor-intensive laboratory experiments, and empirical regression models have limited applicability and accuracy across diverse soil types. In recent years, artificial intelligence (AI) and machine learning (ML) techniques have emerged as alternatives for predicting these compaction parameters. However, ML models often struggle with prediction accuracy and generalizability, particularly with heterogeneous datasets representing various soil types. This study proposes an automated machine learning (AutoML) approach to predict OMC and MDD. AutoML automates algorithm selection and hyperparameter optimization, potentially improving accuracy and scalability. Through extensive experimentation, the study found that the Extreme Gradient Boosting (XGBoost) algorithm provided the best performance, achieving R-squared values of 80.4% for MDD and 89.1% for OMC on a separate dataset. These results demonstrate the effectiveness of AutoML in predicting compaction parameters across different soil types. The study also highlights the importance of heterogeneous datasets in improving the generalization and performance of ML models. Ultimately, this research contributes to more efficient and reliable construction practices by enhancing the prediction of soil compaction parameters.

2510.11290 2026-06-11 cs.AI cs.HC

Evolution in Simulation: AI-Agent School with Dual Memory for High-Fidelity Educational Dynamics

Sheng Jin, Haoming Wang, Zhiqi Gao, Yongbo Yang, Bao Chunjia, Chengliang Wang

详情
Journal ref
Findings of the Association for Computational Linguistics: EMNLP 2025
Comments
9 pages, 7 figures, EMNLP conference
英文摘要

Large language models (LLMs) based Agents are increasingly pivotal in simulating and understanding complex human systems and interactions. We propose the AI-Agent School (AAS) system, built around a self-evolving mechanism that leverages agents for simulating complex educational dynamics. Addressing the fragmented issues in teaching process modeling and the limitations of agents performance in simulating diverse educational participants, AAS constructs the Zero-Exp strategy, employs a continuous "experience-reflection-optimization" cycle, grounded in a dual memory base comprising experience and knowledge bases and incorporating short-term and long-term memory components. Through this mechanism, agents autonomously evolve via situated interactions within diverse simulated school scenarios. This evolution enables agents to more accurately model the nuanced, multi-faceted teacher-student engagements and underlying learning processes found in physical schools. Experiment confirms that AAS can effectively simulate intricate educational dynamics and is effective in fostering advanced agent cognitive abilities, providing a foundational stepping stone from the "Era of Experience" to the "Era of Simulation" by generating high-fidelity behavioral and interaction data.

2510.06242 2026-06-11 cs.CL cs.AI

Transparent Reference-free Automated Evaluation of Open-Ended User Survey Responses

Subin An, Yugyeong Ji, Junyoung Kim, Heejin Kook, Yang Lu, Josh Seltzer

详情
Journal ref
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Comments
EMNLP Industry Track
英文摘要

Open-ended survey responses provide valuable insights in marketing research, but low-quality responses not only burden researchers with manual filtering but also risk leading to misleading conclusions, underscoring the need for effective evaluation. Existing automatic evaluation methods target LLM-generated text and inadequately assess human-written responses with their distinct characteristics. To address such characteristics, we propose a two-stage evaluation framework specifically designed for human survey responses. First, gibberish filtering removes nonsensical responses. Then, three dimensions-effort, relevance, and completeness-are evaluated using LLM capabilities, grounded in empirical analysis of real-world survey data. Validation on English and Korean datasets shows that our framework not only outperforms existing metrics but also demonstrates high practical applicability for real-world applications such as response quality prediction and response rejection, showing strong correlations with expert assessment.

2412.13841 2026-06-11 cs.CY cs.AI cs.HC

Cultural Dimensions of AI Perception: Charting Expectations, Risks, Benefits, Tradeoffs, and Value in Germany and China

Philipp Brauner, Felix Glawe, Gian Luca Liehner, Luisa Vervier, Martina Ziefle

详情
Journal ref
Acta Psychologica (2026), volume 268, article 107094
英文摘要

As artificial intelligence (AI) continues to advance, understanding public perceptions -- including biases, risks, and benefits -- is essential for guiding research priorities and AI alignment, shaping public discourse, and informing policy. This exploratory study investigates cultural differences in mental models of AI using 71 imaginaries of AI's potential futures. Drawing on cross-cultural convenience samples from Germany (N=52) and China (N=60), we identify significant differences in expectations, evaluations, and risk-benefit tradeoffs. Participants from Germany generally provided more cautious assessments, whereas participants from China expressed greater optimism regarding AI's societal benefits. Chinese participants exhibited relatively balanced risk-benefit tradeoffs ($β=-0.463$ for risk and $β=+0.484$ for benefit, $r^2=.630$). In contrast, German participants placed greater emphasis on AI's benefits and comparatively less on risks ($β=-0.337$ for risk and $β=+0.715$ for benefit, $r^2=.839$). Visual cognitive maps illustrate these contrasts, offering new perspectives on how cultural contexts shape AI acceptance. Our findings highlight key factors influencing public perception and provide insights for aligning AI with societal values and promoting equitable and culturally sensitive integration of AI technologies.

2508.15943 2026-06-11 cs.AI

T-ILR: a Neurosymbolic Integration for LTLf

Riccardo Andreoni, Andrei Buliga, Alessandro Daniele, Chiara Ghidini, Marco Montali, Massimiliano Ronzani

详情
Journal ref
Proceedings of The 19th International Conference on Neurosymbolic Learning and Reasoning (NeSy 2025)
Comments
Accepted for presentation at NeSy 2025. 10 pages
英文摘要

State-of-the-art approaches for integrating symbolic knowledge with deep learning architectures have demonstrated promising results in static domains. However, methods to handle temporal logic specifications remain underexplored. The only existing approach relies on an explicit representation of a finite-state automaton corresponding to the temporal specification. Instead, we aim at proposing a neurosymbolic framework designed to incorporate temporal logic specifications, expressed in Linear Temporal Logic over finite traces (LTLf), directly into deep learning architectures for sequence-based tasks. We extend the Iterative Local Refinement (ILR) neurosymbolic algorithm, leveraging the recent introduction of fuzzy LTLf interpretations. We name this proposed method Temporal Iterative Local Refinement (T-ILR). We assess T-ILR on an existing benchmark for temporal neurosymbolic architectures, consisting of the classification of image sequences in the presence of temporal knowledge. The results demonstrate improved accuracy and computational efficiency compared to the state-of-the-art method.

2306.01690 2026-06-11 cs.LG cs.AI

Context selectivity with dynamic availability enables lifelong continual learning

Martin Barry, Wulfram Gerstner, Guillaume Bellec

详情
英文摘要

"You never forget how to ride a bike", -- but how is that possible? The brain is able to learn complex skills, stop the practice for years, learn other skills in between, and still retrieve the original knowledge when necessary. The mechanisms of this capability, referred to as lifelong learning (or continual learning, CL), are unknown. We suggest a bio-plausible meta-plasticity rule building on classical work in CL which we summarize in two principles: (i) neurons are context selective, and (ii) a local availability variable partially freezes the plasticity if the neuron was relevant for previous tasks. In a new neuro-centric formalization of these principles, we suggest that neuron selectivity and neuron-wide consolidation is a simple and viable meta-plasticity hypothesis to enable CL in the brain. In simulation, this simple model balances forgetting and consolidation leading to better transfer learning than contemporary CL algorithms on image recognition and natural language processing CL benchmarks.