2606.10489 2026-06-10 cs.AI 新提交

A complementary study on PlanGPT: Evaluation with defined Performance Metrics and comparison with a planner

PlanGPT的补充研究：使用定义性能指标评估并与规划器比较

Youssef Abdelkader, Humbert Fiorino, Damien Pellier

发表机构 * Univ. Grenoble Alpes - LIG（格勒诺布尔阿尔卑斯大学 - 信息学实验室（LIG））

AI总结本文对大型语言模型PlanGPT进行补充实验，使用规划成本和生成时间两个指标评估其性能，并与传统规划器比较，发现PlanGPT并不优于贪心搜索策略。

Comments 7 pages

详情

AI中文摘要

自动规划是人工智能（AI）的一个子领域，其主要目标是生成一系列动作（称为规划），帮助我们从初始状态达到目标状态。规划问题由一组对象、初始状态和期望目标状态定义。目标是计算一个从初始状态到目标状态的规划。生成规划的程序称为规划器。在本文中，我们对去年发布的最新LLM——PlanGPT进行了补充研究。我们重新进行了一些实验，以验证使用LLM进行规划是否**恰当**且**有价值**。我们还检查了官方PlanGPT论文中关于规划覆盖的结果是否正确，并对PlanGPT的性能进行了更全面的研究：在我们的论文中，PlanGPT的性能使用两个指标进行评估：规划成本和规划生成时间。将PlanGPT的结果与同一规划和相同指标下传统规划器产生的结果进行比较。我们发现PlanGPT并不优于贪心搜索策略。

英文摘要

Automated Planning is a subfield of Artificial Intelligence (AI) where the main objective is generating a sequence of actions, known as a plan, that helps us reach a goal state from an initial state. A planning problem is defined by a set of objects, an initial state and a desired goal state. The objective is to compute a plan that'll lead us from the inital state to the goal state. Programs that generate plans are called planners. In this paper, we did a complementary study to the state-of-the-art LLM called PlanGPT which was released last year. We redid some experiments to verify whether planning with LLMs is \textbf{pertinent} and \textbf{worthwhile}. We also check whether the results obtained in the official PlanGPT paper for plan coverage were correct, and we also performed a more comprehensive study on PlanGPT's performance: in our paper PlanGPT's performance was evaluated using two metrics: Plan Cost and Plan Generation Time. The results of planGPT were compared to those produced by a traditional planner for the same plans and same metrics. We discovered that PlanGPT is no better than a Greedy search strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.10507 2026-06-10 cs.AI 新提交

HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning

HIPIF: 面向长视界LLM智能体学习的层次化规划与信息折叠

Juncheng Diao, Zhicong Lu, Peiguang Li, Yongwei Zhou, Changyuan Tian, Qingbin Li, Rongxiang Weng, Jingang Wang, Xunliang Cai

发表机构 * Meituan（美团）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出层次化规划与信息折叠方法，通过子目标分解和历史折叠减少长上下文干扰，结合层次化反思和子目标过程奖励，提升LLM在多轮长视界任务中的表现。

详情

AI中文摘要

尽管大型语言模型（LLM）在广泛任务中展现出作为自主智能体的强大能力，但其性能在多轮长视界智能体任务中常常下降。现有方法通过细粒度信用分配以缓解长视界稀疏奖励，以及通过层次化强化学习分解任务并减少长期依赖，取得了进展。然而，这些方法仍未直接解决长上下文干扰问题，即持续增长的历史记录削弱了智能体跟踪全局任务状态的能力，并损害了后续推理和决策。受人类通过子目标分解和已完成进度总结处理复杂任务的方式启发，我们提出了面向长视界LLM智能体学习的层次化规划与信息折叠（HIPIF）。HIPIF端到端地训练智能体，使其围绕显式子目标组织长视界执行，同时折叠已完成的子目标历史以减少长上下文干扰。此外，为稳定基于子目标的规划与执行，HIPIF结合了层次化反思和面向子目标的过程奖励，以指导子目标的生成、转换和执行，而无需依赖昂贵的辅助模型或特定任务的专家轨迹。在三个公开可用的智能体基准上的广泛实验证明了我们方法的有效性。

英文摘要

While Large Language Models (LLMs) have demonstrated strong capabilities as autonomous agents across a wide range of tasks, their performance often degrades in multi-turn long-horizon agentic tasks. Existing methods have made progress through fine-grained credit assignment to alleviate long-horizon sparse rewards and hierarchical reinforcement learning to decompose tasks and reduce long-term dependency. However, these methods still do not directly address long-context interference, in which continuously growing histories weaken the agent's ability to track the global task state and impair subsequent reasoning and decision-making. Inspired by the way humans handle complex tasks through subgoal decomposition and completed progress summarization, we propose Hierarchical Planning and Information Folding (HIPIF) for long-horizon LLM agent learning. HIPIF trains the agent end-to-end to organize long-horizon execution around explicit subgoals while folding completed subgoal histories to reduce long-context interference. Furthermore, to stabilize subgoal-based planning and execution, HIPIF combines hierarchical reflection and subgoal-oriented process rewards to guide subgoal generation, transition, and execution, without relying on costly auxiliary models or task-specific expert trajectories. Extensive experiments on three publicly available agentic benchmarks demonstrate the validity of our method.

URL PDF HTML ☆

赞 0 踩 0

2606.10532 2026-06-10 cs.AI 新提交

ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning

ActiveMem: 用于长程LLM推理的分布式主动记忆

Yunhan Jiang, Wenbin Duan, Shasha Guo, Liang Pang, Xiaoqian Sun, Huawei Shen

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所人工智能安全国家重点实验室）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出ActiveMem框架，将记忆从核心推理中解耦，通过分布式主动记忆系统积累语义要点，在长程推理任务中实现高精度和低开销。

详情

AI中文摘要

记忆对于使大型语言模型（LLM）代理能够处理长程推理任务至关重要。现有的记忆机制大多是集中式的，通常将检索到的信息和交互历史组织在单个模型上下文中。这种设计带来了一个基本的权衡：扩展推理轨迹可能导致上下文过载，而激进的修剪内容可能导致不可逆的信息丢失。为了寻求更好的权衡，我们从人类认知系统中汲取灵感，特别是前额叶皮层（执行控制）和海马体（记忆管理）之间的功能互补性，表明这种权衡并非固有，而可能源于集中式记忆组织。为此，我们提出了ActiveMem，一个异构框架，将代理记忆从核心推理过程中解耦。具体来说，高层规划器利用提炼的语义要点执行推理，而轻量级的分布式记忆系统并行运行，在整个任务中主动积累和整合这些要点。在BrowseComp-Plus和GAIA上的实验表明，ActiveMem以显著降低的开销实现了最先进的准确性，证明了分布式主动记忆在长程推理中的有效性。

英文摘要

Memory is essential for enabling large language model (LLM) agents to handle long-horizon reasoning tasks. Existing memory mechanisms are largely centralized, typically organizing retrieved information and interaction history within a single model context. This design imposes a fundamental trade-off: scaling reasoning trajectories risks context overload, whereas aggressive content pruning may result in irreversible information loss. Seeking a better trade-off, we draw inspiration from human cognitive systems, especially the functional complementarity between the prefrontal cortex (executive control) and the hippocampus (memory management), suggesting that such a trade-off need not be inherent, but may instead stem from centralized memory organization. To this end, we propose ActiveMem, a heterogeneous framework that decouples agent memory from the core reasoning process. Specifically, a high-level Planner utilizes distilled semantic gists to execute reasoning, while a lightweight, distributed memory system operates in parallel to actively accumulate and consolidate these gists throughout the task. Experiments on BrowseComp-Plus and GAIA show that ActiveMem achieves state-of-the-art accuracy with significantly reduced overhead, demonstrating the effectiveness of distributed active memory for long-horizon reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.10677 2026-06-10 cs.AI cs.CL 新提交

Infini Memory: Maintainable Topic Documents for Long-Term LLM Agent Memory

Infini Memory：用于长期LLM智能体记忆的可维护主题文档

Suozhao Ji, Baodong Wu, Zehao Wang, Lei Xia, Qingping Li, Ruisong Wang, Wenbo Ding, Zhenhua Zhu, Boxun Li, Guohao Dai, Yu Wang

发表机构 * Infinigence AI（InfiniGen AI）； Tsinghua University（清华大学）； Shanghai Jiaotong University（上海交通大学）

AI总结提出Infini Memory架构，将智能体记忆组织为主题文档，通过缓冲合并和迭代检索实现可维护的长期记忆，在MemoryAgentBench上达到64.7%的总体得分。

详情

AI中文摘要

长期LLM智能体需要持久记忆，以跟踪变化的事实并在会话间提供相关证据。现有的记忆系统通常将观察存储为孤立的记录、摘要或索引片段，这使得证据聚合、事实修正和记忆维护变得困难。我们提出Infini Memory，一种可维护的基于文本的持久记忆架构，将智能体记忆视为主题结构化文档。每个主题文档作为一个语义单元，用于收集相关证据、保留元数据并随时间修正事实。新观察首先被暂存在缓冲区中，然后定期合并为连贯的文本上下文。在推理时，一种智能体检索过程允许LLM通过迭代工具调用读取记忆，而不是单次检索步骤。在MemoryAgentBench上，Infini Memory取得了64.7%的总体得分。消融实验表明，主题结构化维护和迭代证据检查改善了长期记忆使用的互补方面。

英文摘要

Long-term LLM agents need persistent memory that can track changing facts and provide relevant evidence across sessions. Existing memory systems often store observations as isolated records, summaries, or indexed fragments, which makes evidence aggregation, fact revision, and memory maintenance difficult. We propose Infini Memory, a maintainable text-based persistent memory architecture that treats agent memory as topic-structured documents. Each topic document serves as a semantic unit for collecting related evidence, preserving metadata, and revising facts over time. New observations are first staged in a buffer and periodically consolidated into coherent textual contexts. At inference time, an agentic retrieval procedure lets the LLM read memory through iterative tool calls rather than a single retrieval step. On MemoryAgentBench, Infini Memory achieves 64.7% overall score. Ablations show that topic-structured maintenance and iterative evidence inspection improve complementary aspects of long-term memory use.

URL PDF HTML ☆

赞 0 踩 0

人机协调区域：设计具有代理性AI的人机协同体验框架

James Pierce, Vaiva Kalnikaitė, Siddharth Gupta, Brian Granger

发表机构 * Amazon Web Services（亚马逊网络服务）

AI总结提出人机协调框架，通过显著性、参与度和活动三个维度定义协调区域，并提供输入分类、协调曲线和设计模式，用于生成、分析和沟通人机交互体验。

详情

AI中文摘要

随着生成式和代理性AI嵌入日常产品，实践者面临一个持续挑战：如何设计人机协调——即用户与AI系统通过界面进行的持续相互调整，以支持可用性、信任和安全性。现有资源提供高层次原则（“保持透明”、“维持用户控制”）或低层次UI模式，但缺乏连接两者的中层设计知识。通过对60个商业AI应用进行景观和人工制品分析，我们引入了一个框架，将人机协调定义为三个维度的相互作用：显著性（AI呈现的突出程度）、参与度（用户可做什么来参与AI）和活动（AI实际做什么）。我们贡献了中层工具，包括协调区域（为我做、在我之下做、与我一起做、没有我做）、输入分类（提示、激发、推断、分层）、用于映射用户旅程的协调曲线，以及展示框架生成能力的设计模式。该框架可生成性地应用于设计体验，分析性地评估现有体验，以及沟通性地在利益相关者之间阐述想法。

英文摘要

As generative and agentic AI becomes embedded in everyday products, practitioners face a persistent challenge: how to design human-AI coordination -- the ongoing mutual adjustment between users and AI systems as mediate through interfaces-that supports usability, trust, and safety. Existing resources offer high-level principles ("be transparent," "maintain user control") or low-level UI patterns, but there is a lack of mid-level design knowledge bridging the two. Through landscape and artifact analysis of 60 commercial AI applications, we introduce a framework defining human-AI coordination as the interplay of three dimensions: salience (how prominently AI is presented), involvement (what users can do to engage AI), and activity (what AI actually does). We contribute mid-level tools including coordination zones (done-for-me, done-under-me, done-with-me, done-without-me), an input taxonomy (prompted, sparked, inferred, layered), coordination curves for mapping user journeys, and design patterns demonstrating the generative capacity of the framework. The framework can be applied generatively to design experiences, analytically to evaluate existing ones, and communicatively to articulate ideas across stakeholders.

URL PDF HTML ☆

赞 0 踩 0

2606.09916 2026-06-10 cs.LG cs.AI 交叉投稿

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

IntentKV: 面向Agent推理的跨轮次意图感知KV缓存剪枝

Junjie Li, Jiong Lou, Jie Li

AI总结针对多轮LLM Agent中KV缓存成为服务瓶颈的问题，提出IntentKV方法，通过会话级QueryMemory和残差注意力头实现跨轮次意图感知的KV剪枝，在保持精度的同时大幅降低峰值请求token和KV读取量。

详情

AI中文摘要

多轮LLM Agent将短查询扩展为包含工具调用、搜索结果和中间推理的长轨迹。在单条轨迹中，KV内存和KV读取带宽增长数个数量级，使得键值（KV）缓存（而非参数计算）成为长时Agent的主要服务瓶颈。我们提出IntentKV，一种学习型KV剪枝方法，保持基础LLM冻结。IntentKV维护一个会话级的跨轮次意图QueryMemory，通过记忆-注意力规则对实时历史token进行评分，并添加一个零初始化的残差注意力头，对当前查询的K向量进行交叉注意力。为了与前缀缓存保持可组合性，驱逐采用槽位映射重定向：被丢弃的位置路由到一个哨兵死槽，而存活的K/V行、RoPE相位和槽位标识保持不变。在严格的KV预算下，IntentKV与无剪枝的全缓存基线相比几乎没有精度下降：在8k KV预算下，Qwen3-8B的平均峰值请求token下降23.9%，Qwen2.5-14B下降30.7%。在Qwen2.5-14B上所有方法都能完成的100个最长BCP查询中，IntentKV-8k进一步将最坏情况下的峰值请求token从92.3k降至20.5k（减少77.8%），最坏情况下的原始KV读取从4.11亿降至3100万（减少92.6%）。

英文摘要

Multi-turn LLM agents fan short queries into long trajectories of tool calls, search results, and intermediate reasoning. Both KV memory and KV read bandwidth grow by orders of magnitude across a single trajectory, making the key-value (KV) cache, not parameter compute, the dominant serving bottleneck for long-horizon agents. We introduce IntentKV, learned KV pruning that keeps the base LLM frozen. IntentKV maintains a session-level QueryMemory of cross-turn intent, scores live history tokens with a memory-attention rule, and adds a zero-initialized residual head with cross-attention over current-query K-vectors. To stay composable with prefix caches, eviction is a slot-map redirection: dropped positions route to a sentinel dead slot while surviving K/V rows, RoPE phases, and slot identities stay in place. IntentKV matches the no-pruning full-cache baseline with almost no accuracy drop under tight KV budgets: at an 8k KV budget, mean peak request tokens drop 23.9% on Qwen3-8B and 30.7% on Qwen2.5-14B. On the 100 longest BCP queries that all methods complete on Qwen2.5-14B, IntentKV-8k further cuts worst-case peak request tokens from 92.3k to 20.5k, a 77.8% reduction, and worst-case raw KV reads from 411M to 31M, a 92.6% reduction.

URL PDF HTML ☆

赞 0 踩 0

2606.09961 2026-06-10 cs.LG cs.AI 交叉投稿

基于自举流Q学习的离线强化学习快速且高表达性策略学习

Thanh Nguyen, Tri Ton, Hongbin Choe, Tung M. Luu, Chang D. Yoo

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出自举流Q学习（BFQ），通过分治位移向量并自举短程分量，实现单步动作生成，无需辅助网络或蒸馏，显著降低计算成本并提升性能。

Comments ICML 2026, 19 pages

详情

Journal ref: ICML 2026

AI中文摘要

基于扩散的Q学习已成为离线强化学习的一种强大范式，但其对多步去噪的依赖使得训练和推理在计算上昂贵且脆弱。最近将扩散Q学习加速到单步动作生成的努力通常引入辅助网络、策略蒸馏或多阶段训练，这常常损害简单性、稳定性或性能。为解决这些限制，我们引入了自举流Q学习（BFQ），一种新颖的框架，能够在训练和推理期间实现精确的单步动作生成，无需辅助网络或蒸馏过程。BFQ采用分治视角处理沿流路径的位移向量：它首先学习可以从流匹配边际速度准确估计的短程位移，然后自举这些分量以直接学习单步噪声到动作的映射。这种公式消除了多步去噪，导致学习过程更快、更简单、更稳健。广泛的D4RL评估表明，与多步扩散基线相比，BFQ在显著降低计算成本的同时提高了性能，证明了单步动作生成足以实现高性能的离线强化学习。

英文摘要

Diffusion-based Q-learning has emerged as a powerful paradigm for offline reinforcement learning, but its reliance on multi-step denoising makes both training and inference computationally expensive and brittle. Recent efforts to accelerate diffusion Q-learning toward single-step action generation typically introduce auxiliary networks, policy distillation, or multi-phase training, which frequently compromise simplicity, stability, or performance. To address these limitations, we introduce Bootstrapped Flow Q-Learning (BFQ), a novel framework that enables accurate single-step action generation during both training and inference, without auxiliary networks or distillation procedures. BFQ adopts a divide-and-conquer view of the displacement vector along the flow path: it begins by learning short-range displacements that can be accurately estimated from the Flow Matching marginal velocity, and bootstraps these components to directly learn a noise-to-action mapping in a single step. This formulation eliminates multi-step denoising, resulting in a learning procedure that is substantially faster, simpler, and more robust. Extensive D4RL evaluations show that BFQ improves performance while significantly reducing computational cost compared to multi-step diffusion baselines, demonstrating that single-step action generation suffices for high-performance offline Reinforcement Learning.

URL PDF HTML ☆

赞 0 踩 0

2606.10705 2026-06-10 cs.LG cs.AI cs.SY eess.SY 交叉投稿

Event-Driven Reinforcement Learning Enables Long-Horizon Control in Semiconductor Fabrication

事件驱动强化学习实现半导体制造中的长时域控制

Yavar Yeganeh, Mahsa Shekari, Nicla Frigerio, Daniele Pagano, Andrea Matta

发表机构 * Politecnico di Milano（米兰理工大学）； STMicroelectronics（意法半导体）

AI总结提出事件驱动深度强化学习框架，将半导体制造控制建模为中心化智能体问题，通过事件驱动时序差分方法优化多目标策略，在高保真仿真中显著提升吞吐量和利用率。

详情

AI中文摘要

强化学习有望优化大规模系统中的序贯决策。半导体制造系统是随机且高度约束的环境，其中异构晶圆在广泛的设备网络中经历数百个加工步骤。这些特性产生了复杂、高维的决策问题，具有延迟反馈和长时域要求，使生产计划和控制复杂化。我们提出了一个用于此规模的多目标策略优化的深度强化学习框架。具体来说，我们将控制表述为一个中心化智能体问题，其中核心策略协调系统范围的决策，而系统演化被表示为由离散事件驱动的互联时间过程。相应地，我们开发了一个定制的事件驱动时序差分公式，该公式保持通用性，并可在相关训练设置下与各种策略优化方法集成。我们研究了纳入该框架的几种核心无模型算法，并使用不同工业现实操作场景的高保真仿真评估其有效性。在广泛的验证实验中，在离线和在线设置下训练的智能体在吞吐量和利用率方面显示出显著且一致的提升。我们进一步评估了训练阶段的表现和泛化能力，阐明了替代强化学习公式和算法的相对优势。总体而言，结果支持所提出框架在控制事件驱动复杂自适应系统方面的可扩展性、通用性和可迁移性。

英文摘要

Reinforcement learning promises to optimize sequential decisions in large-scale systems. Semiconductor manufacturing systems are stochastic and highly constrained environments where heterogeneous wafers traverse hundreds of processing steps across extensive equipment networks. These characteristics yield complex, high-dimensional decision problems with delayed feedback and long-horizon requirements, complicating production planning and control. We propose a deep reinforcement learning framework for multi-objective policy optimization at this scale. Specifically, we formulate control as a centralized-agent problem, where a core policy coordinates system-wide decisions, while system evolution is represented as an interconnected temporal process driven by discrete events. Accordingly, we develop a tailored event-driven temporal-difference formulation that remains general and can be integrated with various policy optimization methods under relevant training settings. We investigate several core model-free algorithms incorporated into this framework and evaluate their effectiveness using high-fidelity simulations of diverse, industry-real operating scenarios. Across extensive validation experiments, agents trained in both offline and online settings show significant and consistent gains in throughput and utilization. We further evaluate performance and generalization across training phases, clarifying the relative strengths of alternative reinforcement learning formulations and algorithms. Overall, the results support the scalability, generality, and transferability of the proposed framework for controlling event-driven complex adaptive systems.

URL PDF HTML ☆

赞 0 踩 0

2606.10709 2026-06-10 cs.IR cs.AI 交叉投稿

通过图修正构建LLM智能体中的连贯空间记忆

Puzhen Zhang, Xuyang Chen, Yu Feng, Yuhan Jiang, Liqiu Meng

发表机构 * Chair of Cartography and Visual Analytics（制图学与视觉分析教授会）

AI总结提出LLM-MapRepair框架，通过版本控制和边影响评分检测并修正增量构建的导航图中的结构不一致性，在多个基准上显著提升节点和边召回率。

详情

AI中文摘要

给定通过全局遍历导航指令的地图描述，LLM通常能够推断隐式空间布局并通过提供最短路径来回答用户查询。然而，随着环境变大，这种依赖于上下文的查询变得不可行，这促使需要增量地图构建，即从逐步观察中构建完整的拓扑图。我们提出LLM-MapRepair，一个用于LLM驱动的地图构建和修复的框架，旨在检测、定位和修正增量构建的导航图中的结构不一致性。我们的贡献包括：用于图构建的版本控制机制、用于修复优先级的边影响评分，以及为LLM驱动的地图构建和修复量身定制的MANGO基准的清理变体。我们在四个评估设置上评估该框架：合成逐组件消融（gpt-4.1，每个单元n=20个种子）、跨供应商扫描（覆盖OpenAI、Anthropic和Google的七个LLM，在合成和TextWorld程序生成的文本冒险游戏上）、修复阶段评估（在所有42个清理后的MANGO游戏上，具有非零剩余冲突，共534个冲突；三个供应商×三种模式加上两个非LLM参考），以及在《红楼梦》第16-17章上的端到端自然文本部署。在DRC部署中，LLM-MapRepair使用GPT-4.1实现了94.3%的节点召回率（比直接LLM映射高8.6个百分点）和88.2%的边召回率（高55.8个百分点）；召回率的提升伴随着预测节点和边数量约为真实值的4倍（表4），这反映了我们在局限性中讨论的离散化驱动的过度生成权衡。

英文摘要

Given a map description through global traversal navigation instructions, an LLM can often infer the implicit spatial layout and answer user queries by providing shortest paths. However, such context-dependent querying becomes incapable as environments grow larger, motivating the need for incremental map construction that builds a complete topological graph from stepwise observations. We propose LLM-MapRepair, a framework for LLM-driven construction and map repair, designed to detect, localize, and correct structural inconsistencies in incrementally constructed navigation graphs. Our contributions include a Version Control mechanism for graph construction, an Edge Impact Score for repair prioritization, and a cleaned variant of the MANGO benchmark tailored for LLM-driven map construction and repair. We evaluate the framework on four evaluation settings: a synthetic per-component ablation (gpt-4.1, n=20 seeds per cell), a cross-vendor sweep over seven LLMs from OpenAI, Anthropic, and Google on both synthetic and TextWorld procedurally-generated text-adventure games, a repair-stage evaluation on all 42 cleaned-MANGO games with non-zero residual conflicts (534 conflicts; three vendors x three modes plus two non-LLM references), and an end-to-end natural-text deployment on Chapters 16-17 of Dream of the Red Chamber. On the DRC deployment, LLM-MapRepair achieves 94.3% node recall (+8.6 pp over direct LLM mapping) and 88.2% edge recall (+55.8 pp), using GPT-4.1; the recall improvements come with predicted node and edge counts that are roughly 4x the ground-truth counts (Table 4), reflecting the discretization-driven over-generation trade-off we discuss in the Limitations.

URL PDF HTML ☆

赞 0 踩 0

2606.09316 2026-06-10 cs.AI 版本更新

面向视觉条件的无人机导航的自优化智能体强化学习

Roohan Ahmed Khan, Yasheerah Yaqoot, Amir Atef Habel, Muhammad Ahsan Mustafa, Dzmitry Tsetserukou

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出AgenticRL框架，利用多模态GPT智能体自动设计奖励函数、通过闭环自改进优化策略，在多种无人机导航任务中提升性能并实现高成功率。

详情

AI中文摘要

深度强化学习在使自主机器人学习复杂导航任务方面显示出巨大潜力。然而，其实际应用仍然严重依赖于人工设计的奖励函数和重复的手动微调，这既耗时又无法保证在目标任务中取得高成功率。本文提出了AgenticRL，一种智能体引导的强化学习框架，用于提高无人机导航任务中奖励设计、策略优化和实际部署的自主性。AgenticRL使用多模态生成预训练变换器（GPT）智能体来解释任务信息和视觉场景观察，生成特定于任务的奖励函数，使用近端策略优化（PPO）算法训练策略，然后通过诊断包评估训练后的策略作为批评者，生成反馈。基于该反馈，智能体识别失败模式并在闭环自改进过程中优化奖励函数。为了在推理期间进一步利用多模态GPT智能体，AgenticRL使用真实世界图像和自然语言任务信息自动识别活动场景并选择适当的训练策略执行。该框架在多种导航任务上进行了评估，包括穿越门、避障、穿越墙障并着陆、轨迹跟踪和运动行为学习。实验结果表明，与初始奖励相比，闭环优化过程将策略行为提升了71%。我们还展示了所提出框架的仿真到现实迁移，实现了91%的真实世界成功率和94%的仿真到现实准确率。

英文摘要

Deep reinforcement learning has shown strong potential for enabling autonomous robots to learn complex navigational tasks. However, its practical use still depends heavily on human designed reward functions and repeated manual fine tuning, which is time consuming and does not guarantee high success in the desired task. This paper presents AgenticRL, agent guided reinforcement learning framework that increases autonomy in reward design, policy refinement, and real world deployment for unmanned aerial vehicles (UAV) navigation tasks. AgenticRL uses a multimodal generative pre-trained transformer (GPT) agent to interpret task information and visual scene observations, generate task specific reward functions, train policies using Proximal Policy Optimization (PPO) algorithm, and then act as a critic by evaluating the trained policy through diagnosis packets to generate feedback. Based on this feedback, the agent identifies failure modes and refines the reward function in a closed loop self improvement process. To further leverage the multimodal GPT agent during inference, AgenticRL uses real world images and natural language task information to automatically identify the active scenario and select the appropriate trained policy for execution. The framework is evaluated on multiple navigational tasks, including gate traversal, obstacle avoidance, wall barrier crossing with landing, trajectory following, and motion behavior learning. Experimental results show that the closed loop refinement process improves policy behavior compared with initial rewards by 71%. We also demonstrate sim-to-real transfer of the proposed framework, achieving a real world success rate of 91% and a sim-to-real accuracy of 94%.

URL PDF HTML ☆

赞 0 踩 0

2606.07586 2026-06-10 cs.LG cs.AI cs.AR cs.MA 版本更新

From Human Guidance to Autonomy: Agent Skill System for End-to-End LLM Deployment on Spatial NPUs

从人类引导到自主：面向空间NPU上端到端LLM部署的智能体技能系统

Jiajie Li, Erwei Wang, Zhiru Zhang, Samuel Bayliss

发表机构 * AMD Research and Advanced Development（AMD研究与高级开发）

AI总结提出两阶段方法，从人类引导的智能体辅助部署到自主技能系统，在AMD XDNA 2 NPU上实现8种LLM的端到端自动部署，性能超越或持平人工优化基线。

Comments Accepted to the Machine Learning for Architecture and Systems Workshop (MLArchSys), co-located with ISCA 2026

详情

AI中文摘要

空间神经处理单元（NPU）为边缘LLM推理提供了能效平台，但在此类硬件上高效端到端部署LLM仍然劳动密集。尽管AI编码智能体已开始降低这一成本，现有研究主要关注单核优化，而非在资源受限的空间NPU上进行端到端LLM部署。\n我们提出一种两阶段方法，在AMD XDNA 2 NPU上实例化，从人类引导开发进展到智能体自主。第一阶段，我们通过人类引导的智能体辅助开发Llama-3.2-1B的参考部署。与手工优化基线相比，该实现实现了2.2倍的预填充加速和4.0倍的解码加速，优化轨迹及其经验教训全程记录为结构化文档。第二阶段，我们将文档提炼为一个由八个阶段组成的智能体技能系统，编排优化和调试技能集，并在每个阶段严格执行数值正确性。\n利用我们的智能体技能系统，我们使用开源编译器栈在AMD XDNA 2 NPU上自主端到端部署了另外八个仅解码器LLM（Llama-3.2-3B、SmolLM2-1.7B、Qwen2.5-{0.5B, 1.5B, 3B}、Qwen3-{0.6B, 1.7B, 4B}）。据我们所知，这些模型此前尚未通过任何开源软件栈部署在AMD NPU上。每次部署在0.5-4小时的智能体挂钟时间内完成，几乎无需人类引导，并通过数值正确性门控，展示了对先前未见LLM的功能泛化能力。其中八个中的三个达到或超过了我们Llama-3.2-1B参考部署的持续性能，表明所得实现无需额外模型特定人工工程即可具有竞争力。

英文摘要

Spatial neural processing units (NPUs) provide an energy-efficient platform for edge LLM inference, but efficiently deploying an LLM end-to-end on such hardware remains labor-intensive. Although AI coding agents have begun to lower this cost, existing studies have largely focused on single-kernel optimization rather than end-to-end LLM deployment on resource-constrained spatial NPUs. We present a two-stage methodology, instantiated on the AMD XDNA 2 NPU, that progresses from human-guided development to agent autonomy. In the first stage, we develop a reference deployment of Llama-3.2-1B through human-guided agent assistance. The resulting implementation achieves a speedup of 2.2x on prefill and 4.0x on decode over the hand-optimized baseline, with the optimization trajectory and its lessons recorded as structured documentation throughout. In the second stage, we distill the documentation into an agent skill system consisting of eight phases, orchestrating the optimization and debugging skill sets, with numerical correctness strictly enforced at each phase. Using our agent skill system, we autonomously deploy eight additional decoder-only LLMs (Llama-3.2-3B, SmolLM2-1.7B, Qwen2.5-{0.5B, 1.5B, 3B}, Qwen3-{0.6B, 1.7B, 4B}) end-to-end on the AMD XDNA 2 NPU using the open-source compiler stack. To our knowledge, these models have not previously been deployed on AMD NPUs via any open-source software stack. Each deployment completes in 0.5-4 hours of agent wall time with almost no human guidance, and passes the numerical-correctness gates, demonstrating functional generalization to previously unencountered LLMs. Three of the eight match or exceed the sustained performance of our Llama-3.2-1B reference deployment, suggesting that the resulting implementations can be competitive without additional model-specific human engineering.

URL PDF HTML ☆

赞 0 踩 0

2606.10934 2026-06-10 cs.AI 新提交

WorldKernel: A World Model is the Coupling Kernel of Admissible Possible Worlds

WorldKernel：世界模型是可能世界的耦合核

Fabio Rovai

发表机构 * The Tesseract Academy（泰塞克特学院）

AI总结本文发现强预测器在反事实耦合上失效，提出将世界模型建模为可能世界上的半正定耦合核，其非对角元编码反事实信息，并通过半正定性约束和逻辑公理实现高效推理。

详情

AI中文摘要

一个常见的假设认为，给足够强的预测器提供足够的观测和干预数据就足够了。我们报告了一个与之矛盾的失败模式。在数百个结构因果模型中，对于已识别的量，强预测器和贝叶斯基线都成功，但对于未识别的量（反事实世界之间的耦合），预测器坍缩为一个点，在28%的模型上坍缩到没有有效模型能产生的点，而真实情况是一个可容许区间，更多数据永远不会缩小这个区间。这种差距是结构性的：预测无法表示反事实耦合上的不确定性。我们将世界模型建模为可容许世界上的单个半正定耦合核K(T,T')，其对角线是普通后验（预测器恢复的内容），非对角线是它无法恢复的跨世界耦合，每个反事实都读取这个耦合。本文就是关于这个非对角元的理论。它是真实的：两个具有相同后验的状态在跨世界查询上不同，而非对角元正是固定反事实的耦合。它是有界的：半正定性是边际分布缺乏的部分识别信息，强制执行它可以在多项式时间内对反事实进行有界化，而精确的响应类型程序是难处理的。逻辑结构使其更精确：本体论公理将边界收紧多达三分之一，并传播到它们从未触及的耦合。它是可获取的：有针对性的疤痕，即从遇到的不可行性中学习到的约束，比无针对性的疤痕快几倍地缩小差距。它的完全重构是对可容许世界的近似计数，在Sly-Sun阈值以下是易处理的，在此之上是难近似的；我们不声称能击败最坏情况。

英文摘要

A common assumption holds that enough observational and interventional data, given to a strong enough predictor, suffices. We report a failure mode that contradicts it. Across hundreds of structural causal models, on identified quantities a strong predictor and a Bayesian baseline both succeed, but on unidentified quantities (the couplings between counterfactual worlds) the predictor collapses to a point, on 28% of models to one no valid model can produce, while the truth is an admissible interval more data never narrows. The gap is structural: prediction cannot represent uncertainty over counterfactual couplings. We cast a world model as a single positive semidefinite coupling kernel K(T,T') over admissible worlds, whose diagonal is the ordinary posterior (what a predictor recovers) and whose off-diagonal is the cross-world coupling it cannot, which every counterfactual reads. The paper is the theory of that off-diagonal. It is real: two states with identical posteriors differ on a cross-world query, and the off-diagonal is the coupling that fixes counterfactuals. It can be bounded: positive semidefiniteness is partial-identifying information the marginals lack, and enforcing it bounds counterfactuals in polynomial time where the exact response-type program is intractable. Logical structure sharpens it: ontology axioms tighten the bound by up to a third, propagating to couplings they never touch. It can be acquired: targeted scars, constraints learned from encountered infeasibilities, close the gap several times faster than untargeted ones. Its full reconstruction is approximate counting of the admissible worlds, tractable below the Sly-Sun threshold and inapproximable above; we do not claim to beat the worst case.

URL PDF HTML ☆

赞 0 踩 0

2606.10358 2026-06-10 cs.LG cs.AI 交叉投稿

KG-SoftMAP: Soft Knowledge-Graph Priors for Bayesian Network Structure Learning from Sparse Discrete Data

KG-SoftMAP: 基于软知识图谱先验的稀疏离散数据贝叶斯网络结构学习

Guoliang Xu, James E. Corter

发表机构 * Columbia University（哥伦比亚大学）

AI总结针对稀疏离散数据中贝叶斯网络结构学习困难的问题，提出KG-SoftMAP方法，将加权有向知识图谱编码为软先验，结合BDeu评分与logit形式先验最大化MAP目标，在合成与真实数据上显著提升结构恢复性能。

Comments 33 pages including appendices, 1 figure

详情

AI中文摘要

从稀疏离散数据中学习贝叶斯网络（BN）结构是困难的：当每个实例仅记录少数变量时，大多数变量对缺乏可靠评分所需的联合观测，且纯数据方法恢复的结构很少。不完美的领域知识，可表示为加权有向知识图谱（KG），通常是可用的。我们提出KG-SoftMAP，它将这样的KG编码为软性的、置信度加权的、可被数据覆盖的边先验，并最大化结合BDeu评分与logit形式先验的MAP目标；KG可由专家整理或由LLM提取。在受控的合成基准（唯一具有真实DAG的设置）上，KG-SoftMAP在$\rho=0.05$时恢复部分有向结构（DF1从$0.14$到$0.29$，而基线接近零），当$\rho\geq0.2$时恢复更多（DF1从$0.46$到$0.96$），前提是配有一个信息丰富但不完美的KG；恢复性能随KG质量下降而优雅地退化。在无真实DAG的真实稀疏教育数据上，我们仅评估面向部署的指标：预测、校准和KG一致性。学习到的BN最好被解读为诊断模型：在SAF上，它落后于逻辑回归$0.03$的F1_FAIL，同时提供KG一致的边、校准的联合概率以及从任意观测概念子集的推理；当不存在有意义的KG时，判别式逻辑回归更可取。

英文摘要

Learning Bayesian network (BN) structure from sparse discrete data is hard: when each instance records only a few variables, most variable pairs lack the joint observations needed for reliable scoring, and data-only methods recover little structure. Imperfect domain knowledge, expressible as a weighted directed knowledge graph (KG), is often available. We propose KG-SoftMAP, which encodes such a KG as a soft, confidence-weighted, data-overridable edge prior and maximizes a MAP objective combining the BDeu score with a logit-form prior; the KG may be expert-curated or LLM-extracted. On controlled synthetic benchmarks, the only setting with ground-truth DAGs, KG-SoftMAP recovers partial directed structure at $ρ=0.05$ (DF1 $0.14$ to $0.29$, versus near-zero baselines) and substantially more once $ρ\geq0.2$ (DF1 $0.46$ to $0.96$), when paired with an informative but imperfect KG; recovery degrades gracefully as KG quality drops. On real sparse educational data, which has no ground-truth DAG, we evaluate deployment-facing measures only: prediction, calibration, and KG-consistency. The learned BN is best read as a diagnostic model: on SAF it trails logistic regression by $0.03$ F1_FAIL while providing KG-consistent edges, calibrated joint probabilities, and inference from arbitrary observed concept subsets; when no meaningful KG exists, discriminative logistic regression is preferable.

URL PDF HTML ☆

赞 0 踩 0

2606.10607 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

Causal Ensemble Agent: Hierarchical Causal Discovery with LLM-guided Expert Reweighting

因果集成智能体：基于LLM引导的专家重加权的层次化因果发现

Xinyu Li, Yuanyuan Wang, Haoxuan Li, Chuan Zhou, Erdun Gao, Bo Han, Tongliang Liu, Kun Zhang, Howard Bondell, Mingming Gong

发表机构 * The University of Melbourne（墨尔本大学）； MBZUAI（穆罕默德·本·扎耶德人工智能大学）； Peking University（北京大学）； Adelaide University（阿德莱德大学）； Hong Kong Baptist University（香港浸会大学）； The University of Sydney（悉尼大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出因果集成智能体（CEA）框架，通过线性意见池聚合不同层次的统计因果发现结果，并利用大语言模型（LLM）作为元裁判在决策边界附近动态重加权专家，从而构建更准确完整的因果图。

详情

AI中文摘要

因果发现旨在从观测数据中揭示因果结构，这对现实世界决策至关重要。然而，不同的因果发现算法可能产生相互冲突的结果，使得识别准确的因果图复杂化。传统方法依赖数值和统计假设，往往忽略丰富的领域特定信息（如特征描述），而这些信息也有助于结构学习。尽管近期研究探索使用大语言模型（LLM）通过直接查询推断因果关系，但由于缺乏与实际数据的一致性，此类方法可能不可靠。为解决这些限制，我们提出因果集成智能体（CEA），一种新颖框架，通过线性意见池聚合来自不同图层次的统计发现专家的结构见解，并在聚合置信度接近决策边界时，使用LLM作为元裁判动态重加权专家，从而组合出更完善、更完整的因果图。在合成和真实数据集上的大量实验表明，CEA在广泛的因果发现方法中实现了最强的整体性能，突显了在因果发现中使用LLM进行元分析的有效性。

英文摘要

Causal discovery aims to uncover causal structures from observational data, which is crucial for real-world decision-making. However, different causal discovery algorithms can produce divergent results that conflict with each other, complicating the identification of accurate causal graphs. Traditional approaches rely on numerical values and statistical assumptions, often ignoring rich domain-specific information, such as feature descriptions, which could also help structure learning. While recent works explore using Large Language Models (LLMs) to infer causal relations via direct queries, such methods can be unreliable due to a lack of alignment with the actual data. To address these limitations, we propose Causal Ensemble Agent (CEA), a novel framework that aggregates structural insights from statistical discovery experts across different graph levels via linear opinion pooling, and uses an LLM as a meta-referee to dynamically reweight experts when the aggregated confidence is close to the decision boundary, thereby composing an improved and more complete causal graph. Extensive experiments on both synthetic and real-world datasets demonstrate that CEA achieves the strongest overall performance across a wide range of causal discovery methods, highlighting the effectiveness of using LLMs for meta-analysis in causal discovery.

URL PDF HTML ☆

赞 0 踩 0

2206.02178 2026-06-10 cs.AI cs.LG 版本更新

Belief Acquisition as Stochastic Filtering

信念获取作为随机滤波

Dawei Chen, John Lloyd, Samuel Yang-Zhao, Kee Siong Ng

发表机构 * School of Computing, Australian National University（计算机学院，澳大利亚国立大学）

AI总结本文提出将信念获取视为随机滤波问题，通过分解条件滤波器在高维状态空间中同时跟踪状态和估计参数，并在流行病跟踪等实验中验证有效性。

Comments 51 pages

详情

AI中文摘要

本文研究如何利用随机滤波实现信念获取。首先，概述了经验信念的理论基础。然后，研究了该背景下的随机滤波。本文引入了因子化条件滤波器，这是一种新的滤波算法，用于在高维状态空间中同时跟踪状态和估计参数。算法的条件性质用于估计参数，因子化性质用于将状态空间分解为低维子空间，使得在这些子空间上的滤波得到的分布的乘积是对整个状态空间上分布的良好近似。算法成功应用的条件是：观测在子空间级别可用，且转移模式可以分解为近似局限于子空间的局部转移模式；这些条件在计算机科学、工程和地球物理滤波应用中广泛满足。在大型接触网络上跟踪流行病和估计参数的实验结果显示了该方法的有效性。

英文摘要

This paper studies how belief acquisition can be accomplished using stochastic filtering. First, a theoretical foundation for empirical beliefs is outlined. Then stochastic filtering in this context is studied. The paper introduces factored conditional filters, new filtering algorithms for simultaneously tracking states and estimating parameters in high-dimensional state spaces. The conditional nature of the algorithms is used to estimate parameters and the factored nature is used to decompose the state space into low-dimensional subspaces in such a way that filtering on these subspaces gives distributions whose product is a good approximation to the distribution on the entire state space. The conditions for successful application of the algorithms are that observations be available at the subspace level and that the transition schema can be factored into local transition schemas that are approximately confined to the subspaces; these conditions are widely satisfied in computer science, engineering, and geophysical filtering applications. Experimental results on tracking epidemics and estimating parameters in large contact networks show the effectiveness of the approach.

URL PDF HTML ☆

赞 0 踩 0

2603.04852 2026-06-10 cs.AI cs.CV 版本更新

Non-Parametric Structural Priors for Geometry Theorem Prediction

几何定理预测的非参数结构先验

Junbo Zhao, Ting Zhang, Can Li, Wei He, Jingdong Wang, Hua Huang

发表机构 * School of Artificial Intelligence, Beijing Normal University, Beijing, China（北京师范大学人工智能学院）； Engineering Research Center of Intelligent Technology（智能技术与教育应用工程研究中心）； Beijing Key Laboratory of Artificial Intelligence for Education, Beijing, China（北京人工智能教育重点实验室）； Baidu, Beijing, China（百度）

AI总结针对几何定理预测中参数模型泛化性差的问题，提出定理前驱图作为非参数结构先验，通过上下文学习实现无训练定理预测，在FormalGeo7k上达到89.29%准确率。

详情

AI中文摘要

多步定理预测是几何问题求解中的核心挑战。现有的神经符号方法严重依赖有监督参数模型，这些模型对不断发展的定理库泛化能力有限。在这项工作中，我们通过上下文学习（ICL）的视角探索无训练定理预测。我们识别出一个关键的可扩展性瓶颈，称为结构漂移：随着推理深度的增加，普通ICL的性能急剧下降，通常降至接近零。我们将这种失败归因于LLM无法恢复潜在拓扑依赖关系，导致无结构探索。为解决此问题，我们提出定理前驱图，将历史解轨迹中的时间依赖关系编码为有向图，并施加显式拓扑约束，从而在推理过程中有效剪枝搜索空间。结合检索增强的图构建和逐步符号执行器，我们的方法使LLM能够在没有任何基于梯度的优化的情况下充当结构化规划器。在FormalGeo7k基准上的实验表明，我们的方法达到了89.29%的准确率，显著优于ICL基线，并与最先进的有监督模型相匹配。这些结果表明，显式结构先验为扩展基于LLM的符号推理提供了一个有前景的方向。

英文摘要

Multi-step theorem prediction is a central challenge in geometry problem solving. Existing neural-symbolic approaches rely heavily on supervised parametric models, which exhibit limited generalization to evolving theorem libraries. In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL). We identify a critical scalability bottleneck, termed Structural Drift: as reasoning depth increases, the performance of vanilla ICL degrades sharply, often collapsing to near zero. We attribute this failure to the LLM's inability to recover latent topological dependencies, leading to unstructured exploration. To address this issue, we propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference. Coupled with retrieval-augmented graph construction and a stepwise symbolic executor, our approach enables LLMs to act as structured planners without any gradient-based optimization. Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models. These results indicate that explicit structural priors offer a promising direction for scaling LLM-based symbolic reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.20098 2026-06-10 cs.AI 版本更新

Neurosymbolic Learning for Inference-Time Argumentation

用于推理时间论证的神经符号学习

Gabriel Freedman, Adam Dejl, Adam Gould, Mansi, Lihu Chen, Junqi Jiang, Francesca Toni

发表机构 * Department of Computing, Imperial College London（伦敦帝国理工学院计算机系）

AI总结本文提出了一种用于三元主张验证的可训练神经符号框架，通过在训练和推理过程中结合形式论证语义来指导大语言模型生成论证并分配基础分数，从而提高三元预测的准确性。

Comments Under review

详情

AI中文摘要

主张验证是高风险领域（如健康和金融）中的重要问题。当支撑主张的信息不完整或冲突时，不确定答案可能比二元的真或假分类更合适。在所有情况下，对最终裁决所依据的考虑因素的忠实解释都是至关重要的。我们引入了推理时间论证（ITA），一种可训练的神经符号框架，用于三元主张验证。其中，一种形式论证语义用于给定主张的强度，既（i）指导LLM训练，使模型在学习生成论证并分配基础分数（代表内在强度）时，又（ii）用于从生成的、评分的论证中计算三元（真/假/不确定）预测。因此，在训练时，论证生成和评分可以根据诱导的论证预测质量进行优化。此外，在推理时，最终预测是忠实于决定裁决的论证和分数的，而不是像传统推理模型那样由可能不忠实的后验推理轨迹所证明。最后，我们证明，在两个三元主张验证数据集上，ITA在论证基线之上表现更好，并能与非论证直接预测基线竞争，同时提供由显式、可检查的论证结构确定的裁决。

英文摘要

Claim verification is an important problem in high-stakes settings, including health and finance. When information underpinning claims is incomplete or conflicting, uncertain answers may be more appropriate than binary true or false classifications. In all cases, faithful explanations of the considerations determining the final verdict are crucial. We introduce inference-time argumentation (ITA), a trainable neurosymbolic framework for ternary claim verification in which a formal argumentation semantics giving the strength of claims is used both (i) to guide LLM training as models learn to generate arguments and assign them base scores (representing intrinsic strengths) and (ii) to compute ternary (true/false/uncertain) predictions from generated, scored arguments. As a result, at training time, argument generation and scoring can be optimised according to the quality of the induced argumentative predictions. Moreover, at inference time, the final prediction is faithful, by construction, to the arguments and scores determining the verdict, rather than being justified by a potentially unfaithful post-hoc reasoning trace as in conventional reasoning models. We finally show that, on two datasets for ternary claim verification, ITA improves upon argumentative baselines and can perform competitively against non-argumentative direct-prediction baselines, while providing verdicts that are computed deterministically from explicit, inspectable argumentative structures.

URL PDF HTML ☆

赞 0 踩 0

2606.10389 2026-06-10 cs.AI 新提交

Beyond Static Evaluation: Co-Evolutionary Mechanisms for LLM-Driven Strategy Evolution in Adversarial Games

超越静态评估：对抗性游戏中LLM驱动策略演化的协同进化机制

Haoran Li, Zengle Ge, Ziyang Zhang, Xiaomin Yuan, Yui Lo, Qianhui Liu, Bocheng An, Dongke Rong, Jiaqun Liu, Annan Li, Jianmin Wu, Dawei Yin, Dou Shen

发表机构 * Baidu Inc.（百度公司）； University of Chinese Academy of Sciences（中国科学院大学）； University of California, Los Angeles（加州大学洛杉矶分校）； University of Science and Technology of China（中国科学技术大学）； Zhejiang University（浙江大学）； University of Technology Sydney（悉尼科技大学）

AI总结针对LLM驱动代码进化在对抗性多智能体游戏中因评估景观变化导致停滞的问题，提出评估器协同进化、层次深度评估和弱点压力三种机制，在MCTF任务中实现最优性能和泛化能力。

详情

AI中文摘要

近期LLM驱动的代码进化通过迭代生成和改进程序实现了自动发现。然而，将这些方法应用于对抗性多智能体游戏引入了一个根本性挑战：随着策略改进，评估景观发生变化，导致固定评估器变得不可靠，进化停滞。我们提出三种机制来应对这一挑战：评估器协同进化，将发现的最优策略纳入对手池；层次深度评估，用统计可靠的评估替代噪声大的少数游戏得分；以及弱点压力，动态增加最难对手的权重以突破平台期。我们在FAMOU框架中实现了这些机制，该框架基于与OpenEvolve和ShinkaEvolve相同的基础模型代码进化范式。在MCTF 2026 3v3海上夺旗任务中，FAMOU在两种骨干LLM下均持续优于两个基线，取得了最高综合得分（0.526）和对未见对手的最佳泛化能力（胜率61.7%），而消融实验证实了每种机制对性能的贡献。值得注意的是，LLM变异过程生成了种子策略中完全不存在的新战术结构——包括前瞻搜索和自适应拦截——表明代码级进化可以在对抗性环境中产生非平凡的算法创新。FAMOU进化策略进一步在AAMAS 2026 MCTF竞赛中获得了硬件循环赛第一名和模拟赛第三名，验证了其现实世界可迁移性。通过我们的进化过程开发的优化实现和相应评估代码可在以下网址获取：this https URL

英文摘要

Recent advances in LLM-driven code evolution have enabled automated discovery by iteratively generating and improving programs. However, applying these methods to adversarial multi-agent games introduces a fundamental challenge: the evaluation landscape shifts as strategies improve, causing fixed evaluators to become unreliable and evolution to stagnate. We propose three mechanisms to address this challenge: evaluator co-evolution, which incorporates discovered champions into the opponent pool; hierarchical deep evaluation, which replaces noisy few-game scores with statistically reliable assessments; and weakness pressure, which dynamically up-weights the most difficult opponents to break through plateaus. We implement these mechanisms within FAMOU, a framework built upon the same foundation-model code-evolution paradigm as OpenEvolve and ShinkaEvolve. On the MCTF 2026 3v3 maritime capture-the-flag task, FAMOU consistently outperforms both baselines under two backbone LLMs, achieving the highest combined score (0.526) and the best generalization to unseen opponents (61.7% win rate), while ablations confirm that each mechanism contributes to performance. Notably, the LLM mutation process generates tactical structures entirely absent from the seed strategies -- including lookahead search and adaptive interception -- demonstrating that code-level evolution can produce nontrivial algorithmic innovations in adversarial settings. The FAMOU-evolved strategy further achieved 1st place in the hardware round-robin and 3rd in simulation at the AAMAS 2026 MCTF Competition, validating its real-world transferability. The optimized implementation and corresponding evaluation codes developed through our evolutionary process are available at: https://github.com/1xiangliu1/FAMOU-CoEvo

URL PDF HTML ☆

赞 0 踩 0

2606.10747 2026-06-10 cs.AI 新提交

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

仲裁者代理：持续监控多智能体对话以检测涌现性失调

Filippo Tonini, Federico Torrielli, Anton Danholt Lautrup, Peter Schneider-Kamp, Mustafa Mert Çelikok, Lukas Galke Poech

发表机构 * University of Southern Denmark（南丹麦大学）； University of Turin（都灵大学）

AI总结提出仲裁者代理，在有限检查预算下实时监控多智能体对话，通过主动检查工具检测失调行为，实验表明能可靠提前发现失调，并分析不同失调类型的检测难度。

Comments AITC 2026

详情

AI中文摘要

随着由多个语言模型代理构建的AI系统变得越来越普遍，它们被越来越多地用于共同决策：讨论、协商并执行共享任务。尽管单个代理在单独测试时可能表现良好，但它们之间的交互方式可能会引发问题。我们引入了仲裁者，一个旨在实时监控多智能体对话并识别哪些参与者可能表现出失调行为的代理。仲裁者在有限的“检查预算”下运行，这意味着它必须谨慎决定如何使用其资源。当它逐步观察对话时，可以选择等待、询问参与者、检查系统提示或推理轨迹等内部信息，或记录可疑行为。最后，它生成一份报告，识别失调的可能来源。我们在五种对话条件下评估仲裁者，范围从风险金融建议模型生物到评估感知和共谋代理，测试了五种能力递增的工具配置和两种骨干模型。我们发现仲裁者能在对话结束前可靠地检测到失调代理，主动检查工具提高了检测准确性和速度。权重引起的失调最难检测，而指令引起的失调即使在被动观察下也能可靠识别。记录工具表现出双重效果，以精度为代价提高了召回率。这些结果表明，持续的、预算感知的监控可以有效捕捉失调，并且监督多智能体系统可能需要将审计者视为过程中的积极参与者。代码可在以下网址获取：https://this URL。

英文摘要

As AI systems built from multiple language-model agents become more common, they are increasingly used to make decisions together: discussing, negotiating, and acting on shared tasks. While individual agents may appear well-aligned when tested on their own, problems can arise from how they interact with one another. We introduce the Arbiter, an agent designed to monitor multi-agent conversations in real time and identify which participants may be behaving in misaligned ways. The Arbiter operates under a limited "inspection budget", meaning it must decide carefully how to use its resources. As it observes a conversation step by step, it can choose to wait, question a participant, examine internal information such as system prompts or reasoning traces, or log concerning behavior. At the end, it produces a report identifying the likely source of misalignment. We evaluate the Arbiter across five conversation conditions, ranging from risky financial advice model organisms to evaluation-aware and colluding agents, we test five tool configurations of increasing capability and two backbone models. We find that the Arbiter reliably detects misaligned agents well before the end of the conversation, with active inspection tools improving both detection accuracy and speed. Weight-induced misalignment proves hardest to detect, while instruction-induced misalignment is identified reliably even under passive observation. The logging tool exhibits a dual effect, improving recall at the cost of precision. These results suggest that continual, budget-aware monitoring can effectively catch misalignment, and that overseeing multi-agent systems may require treating the auditor as an active participant in the process. The code is available at https://github.com/aisilab/arbiter.

URL PDF HTML ☆

赞 0 踩 0

2606.09854 2026-06-10 cs.CL cs.AI cs.CY cs.LG 交叉投稿

Can Multi-Agent LLMs Identify Their Peers? Stylometric Fingerprinting in Role-Constrained Political Analysis

多智能体大语言模型能否识别其同类？角色约束政治分析中的笔迹风格指纹识别

Juergen Dietrich

AI总结研究多智能体LLM在政治分析中能否通过笔迹风格识别模型家族，提出SD-CV协议，T5模型在五类归属任务中达到F1=0.991，证明提示级匿名化无法消除模型身份信号。

Comments 24 pages, 3 figures

详情

AI中文摘要

用于政治声明分析的多智能体大语言模型（LLM）管道容易受到同伴保护偏见的影响：模型倾向于保护同伴模型免于停用，并表现出依赖身份的评分扭曲。提示级匿名化被提出作为缓解措施，但先前的工作同时记录了在角色约束输出中笔迹风格指纹在匿名化后仍然存在——这引发了该缓解措施是否足够的问题。本文首次系统研究LLM是否能在匿名化条件下识别政治分析文本背后的模型家族。我们评估了三种分类器方法——LLM零样本和少样本（Claude Sonnet 4.6和Llama-3.3-70B）以及微调的T5-base模型——在一个涵盖四个商业LLM家族和一个开放世界“未知”类的五类归属任务上。我们引入了一种声明不相交的交叉验证协议（SD-CV；定义见第3.5节），该协议保证训练和验证数据之间没有内容重叠，并将其与运行不相交的基线（RD-CV）进行对比。T5在SD-CV下达到Macro F1 = 0.991（±0.008），在24个完全保留的声明上F1 = 0.978——尽管与RD-CV相比，训练-测试内容距离增加了2.1倍（0.767 vs. 0.366，p<0.001），但仍表现出稳健性，证明了真正的笔迹风格泛化能力。一项分数SD-CV分析确定了训练数据40%（约440篇文本）处的性能拐点。我们的研究结果证实，仅靠提示级匿名化无法消除模型身份信号，这对欧盟AI法案合规性（第13、14、26条）以及质量关键型多智能体部署中的计算机系统验证（CSV）具有直接影响。

英文摘要

Multi-agent large language model (LLM) pipelines for political statement analysis are vulnerable to peer-preservation bias: models tend to protect peer models from deactivation and show identity-dependent scoring distortions. Prompt-level anonymization was proposed as a mitigation, but prior work simultaneously documented that stylometric fingerprints survive anonymization in role-constrained outputs - raising the question of whether this mitigation is sufficient. This paper provides the first systematic investigation of whether LLMs can identify the model family behind political analysis texts under anonymization conditions. We evaluate three classifier approaches - LLM zero-shot and few-shot (Claude Sonnet 4.6 and Llama-3.3-70B) and a fine-tuned T5-base model - on a five-class attribution task covering four commercial LLM families and an open-world 'unknown' class. We introduce a statement-disjoint cross-validation protocol (SD-CV; defined in Section 3.5) that guarantees no content overlap between training and validation data, and contrast it with a run-disjoint baseline (RD-CV). T5 achieves Macro F1 = 0.991 (+-0.008) under SD-CV and F1 = 0.978 on 24 completely held-out statements - robust despite a 2.1x increase in train-test content distance versus RD-CV (0.767 vs. 0.366, p<0.001), demonstrating genuine stylometric generalization. A fractional SD-CV analysis identifies a performance knee at 40% of training data (~440 texts). Our findings confirm that prompt-level anonymization alone cannot neutralize model identity signals, with direct implications for EU AI Act compliance (Articles 13, 14, 26) and for computer system validation (CSV) in quality-critical multi-agent deployments.

URL PDF HTML ☆

赞 0 踩 0

2606.09884 2026-06-10 cs.MA cs.AI cs.LG econ.EM 交叉投稿

Failure Modes of Deep Multi-Agent RL in Asynchronous Pricing: Reproducible Triggers, Trace Diagnostics, and a Partial Fix

深度多智能体强化学习在异步定价中的失败模式：可复现触发器、轨迹诊断及部分修复

Shree Murthy, Rohan Pandey

发表机构 * DigitalOcean, USA（DigitalOcean美国）

AI总结研究连续时间定价市场中深度多智能体强化学习的两种可复现失败模式：DDPG智能体之间的默契合谋和高事件率下的演员-评论家不稳定性，并通过异步性实现部分修复。

详情

AI中文摘要

我们研究了连续时间定价市场中深度多智能体强化学习的两种可复现失败模式：(i) 竞争性DDPG智能体之间形成默契合谋，以及(ii) 高事件率下的演员-评论家不稳定性。我们在一个单一的CT-MARL基准测试（泊松时钟价格更新、观测延迟$\delta$、内部最优logit需求）中实例化了这两种模式，表明同步DDPG智能体可靠地触发失败模式1，合谋指数$\Delta = 0.69 \pm 0.11$，并量化了一种部分微观结构修复：仅异步性就将合谋降低了48%，而增加延迟使其降至最低$\Delta = 0.28$。该修复具有明确记录的成本：它是部分的（$\Delta$仍高于伯特兰水平），在$\delta$上非单调，并且无法承受失败模式2，后者在$\lambda = 5$时表现为DDPG评论家发散，并破坏了$(\lambda{=}5, \delta{=}1)$处的相图单元。我们为标量合谋指数配备了轨迹级诊断，揭示了情节内信号崩溃和冲击后无法恢复。

英文摘要

We study two reproducible failure modes of deep multi-agent reinforcement learning in continuous-time pricing markets: (i) tacit cartel formation between competing DDPG agents, and (ii) actor--critic instability at high event rates. We instantiate both inside a single CT-MARL benchmark (Poisson-clocked price updates, observation latency $δ$, interior-optimum logit demand), show that synchronous DDPG agents reliably trigger Failure Mode 1 with collusion index $Δ= 0.69 \pm 0.11$, and quantify a partial microstructure fix: asynchrony alone cuts collusion by 48\% and adding latency drives it to a minimum of $Δ= 0.28$. The fix has clearly documented costs: it is partial ($Δ$ remains supra-Bertrand), it is non-monotone in $δ$, and it does not survive Failure Mode 2, which emerges as DDPG critic divergence at $λ= 5$ and corrupts the phase-diagram cell at $(λ{=}5, δ{=}1)$. We accompany the scalar collusion index with trajectory-level trace diagnostics that expose the within-episode signalling collapse and the post-shock non-recovery.

URL PDF HTML ☆

赞 0 踩 0

2606.10112 2026-06-10 cs.GT cs.AI cs.LG econ.TH 交叉投稿

Duality for Optimal Multi-Item, Multi-Bidder Auction Design: Revenue Certificates through Deep Learning

最优多物品多竞拍者拍卖设计的对偶性：通过深度学习的收入证书

Yanchen Jiang, David C. Parkes, Tonghan Wang

发表机构 * Harvard University（哈佛大学）； College of AI, Tsinghua University（清华大学人工智能学院）

AI总结提出首个直接处理多物品多竞拍者拍卖对偶问题的计算框架，通过神经网络参数化拉格朗日乘子并引入提升技术，生成可证明的收入上界，为连续类型提供近最优性证书。

详情

AI中文摘要

刻画多物品、多竞拍者设置下的收入最优拍卖仍然是一个基本开放问题，除了限制性的二元类型实例外，没有已知的闭式解。这激发了人们对最优拍卖设计的计算方法的兴趣。在本文中，我们引入了第一个直接处理多物品、多竞拍者拍卖和占优策略激励相容（DSIC）的对偶问题的计算框架，生成有证书的收入上界。我们的方法使用神经网络参数化具有结构保证的严格流量守恒性质的拉格朗日乘子，从而通过梯度下降对可行对偶解进行高效优化。为了弥合离散计算方法与连续类型的理论保证之间的差距，我们开发了一种新颖的提升技术，将对偶证书从粗离散化映射到精细细化。我们证明，对于具有连续均匀估值的多物品、多竞拍者拍卖，提升给出了有效的收入上界。此外，我们给出了任意连续分布的广义提升构造，并证明了这些提升对偶在离散极限下收敛到原始连续问题的收入。我们通过恢复典型实例的已知分析机制，验证了该对偶拍卖设计问题的计算框架。对于多物品多竞拍者问题，我们的框架在最优收入与已知最佳DSIC机制之间建立了小差距，提供了近最优性的计算证书。

英文摘要

Characterizing revenue-optimal auctions for multi-item, multi-bidder settings remains a fundamental open problem, with no known closed-form solution existing beyond restrictive binary-type instances. This has motivated interest in computational approaches to optimal auction design. In this paper, we introduce the first computational framework that directly tackles the dual problem for multi-item, multi-bidder auctions and dominant-strategy incentive compatibility (DSIC), generating certified revenue upper bounds. Our approach parametrizes Lagrange multipliers with a structurally guaranteed strict flow-conservation property using neural networks, enabling efficient optimization over feasible dual solutions via gradient descent. To bridge the gap between discrete computational methods and theoretical guarantees for continuous types, we develop a novel lifting technique that maps dual certificates from coarse discretizations to fine refinements. We prove that lifting gives valid revenue upper bounds for multi-item, multi-bidder auctions with continuous uniform valuations. Furthermore, we give a generalized lifting construction for arbitrary continuous distributions and demonstrate that these lifted duals converge to the revenue of the original continuous problem in the discrete limit. We validate this computational framework for the dual auction design problem by recovering known analytical mechanisms for canonical instances. For multi-item multi-bidder problems, our framework establishes a small gap between the optimal revenue and best-known DSIC mechanisms, providing computational certificates of near-optimality.

URL PDF HTML ☆

赞 0 踩 0

2606.10475 2026-06-10 cs.MA cs.AI cs.CL 交叉投稿

Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation

思想与言语解耦：基于知识反事实推理的鲁棒多智能体辩论

Jakub Masłowski, Jarosław A. Chudziak

发表机构 * Institute of Computer Science, Warsaw University of Technology（华沙技术大学计算机科学学院）

AI总结提出知识反事实推理（KG-CFR）双阶段架构，通过私有规划缓冲与公共执行层分离，在动态资源分配环境下将扰动后论证质量从0.694提升至0.822，并减少语义循环。

Comments Accepted for publication in the Proceedings of the 30th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES 2026)

详情

AI中文摘要

多智能体辩论框架已被证明能提升大语言模型在收敛任务上的表现，但目前优化方式过度偏向最终输出准确性而非过程稳定性。在长时间交互中，持续扰动下的反应式系统常出现逻辑退化、论点重复和角色漂移。为从结构上防止身份丢失并保持过程保真度，我们引入知识反事实推理（KG-CFR），一种双阶段架构，在私有检索增强规划缓冲区和公共执行层之间强制执行严格关注点分离。我们在不确定性下动态资源分配（DRAU）这一专用1v1v1环境中评估该系统，引入与标准辩论设置不同的多样性。在270次完全析因危机模拟轨迹（含随机环境冲击）中，KG-CFR在超过95%的扰动运行中防止了裁判检测到的关键冲击后退化（定义为质量偏移Δ ≤ -0.20），将整体论证质量从0.694提升至0.822。我们的主要贡献是证明架构解耦是在持续压力下不损失质量而增强系统鲁棒性的重要因素。此外，我们引入了用于话语发散和计划执行对齐的自定义向量度量，为操作稳定性提供了强有力且方向一致的证据。消融实验表明，适当的教义基础与前瞻规划对论证质量同等重要。根据初步度量评估，KG-CFR通过保持智能体与原始计划的一致性减少了语义循环。

英文摘要

Multi-agent debate frameworks have been shown to improve large language model performance in convergent tasks, but they are currently optimized in a way that heavily favors final output accuracy rather than stability of the process. During long-horizon exchanges reactive systems under sustained perturbations often experience logic degradation, argument repetition, and role drift. To structurally prevent the identity loss and maintain the process fidelity, we introduce Knowledge-Grounded Counterfactual Reasoning (KG-CFR), a dual-stage architecture that enforces a strict separation of concerns between a private, retrieval-augmented planning buffer, and a public execution layer. We assess this system in Dynamic Resource Allocation under Uncertainty (DRAU), a dedicated 1v1v1 environment, introducing diversity as distinct from standard debate settings. Over 270 completely factorial crisis simulation trajectories with stochastic environmental shocks, KG-CFR prevents judge-detected critical post-shock degradation (defined as a quality shift, $Δ\le -0.20$) in more than 95% of perturbed runs, increasing the overall argument quality from 0.694 to 0.822. Our primary contribution is the demonstration of architectural decoupling being an important factor of systemic resilience enhancement under sustained pressure without quality loss. Furthermore, we introduce custom vector metrics for discourse divergence and plan-execution alignment that provide strong, directionally consistent evidence of operational stability. Our ablation experiments suggest that the proper doctrinal grounding can be an equally important factor for argument quality, as the prospective planning. KG-CFR, according to our initial metric evaluations, reduces semantic looping, by preserving the agent's consistency with the original plan.

URL PDF HTML ☆

赞 0 踩 0

2606.10582 2026-06-10 cs.LG cs.AI 交叉投稿

Drawing with Strangers: Population Scaling Drives Zero-Shot Mutual Intelligibility in Emergent Sketching

与陌生人共绘：种群规模驱动涌现素描中的零-shot互懂性

Jooyeon Kim

发表机构 * Graduate School of Artificial Intelligence, UNIST（UNIST人工智能研究生院）

AI总结研究通过视觉素描游戏，发现扩大训练种群规模能显著提升独立训练群体间的零-shot互懂性，其机制在于增加群体内变异并减少群体间差异，最终通过感知锚定实现结构收敛。

详情

AI中文摘要

涌现通信中的泛化主要关注新颖输入或语言结构，但智能体与来自严格不相交社区的陌生人进行通信的能力仍相对未被探索。在这项工作中，我们将这种能力形式化为\textit{零-shot互懂性（ZMI）}：独立训练群体之间无需事先接触即可成功通信。利用涌现素描（智能体通过绘制一组笔画进行通信）作为视觉接地模态，我们发现扩大训练种群规模显著提高了独立群体间的ZMI。关键的是，随着种群规模扩大，群体内通信变异增加，防止了同质化共适应。同时，群体间变异减少，表明向某种普遍性的结构收敛。进一步分析揭示，这种普遍性是通过感知接地实现的：扩大后的种群越来越将其涌现素描锚定在目标图像的客观视觉相似性上。这些结果共同将ZMI定位为涌现通信中一个独特的泛化轴，并提出了实现社会可互操作人工智能体的途径。

英文摘要

Generalization in emergent communication has largely focused on novel inputs or linguistic structures, yet the capacity for agents to communicate with strangers from strictly disjoint communities remains relatively unexplored. In this work, we formalize this capability as \textit{zero-shot mutual intelligibility (ZMI)}: successful communication between independently trained populations without prior exposure. Leveraging emergent sketching -- in which agents communicate through sets of drawn strokes -- as a visually grounded modality, we find that scaling the training population substantially improves ZMI across independent groups. Crucially, as we scale the population size, in-group communicative variation increases, preventing co-adaptation into homogeneity. Simultaneously, cross-group variation decreases, indicating a structural convergence toward a certain type of universality. Further analysis reveals that this universality is achieved through perceptual grounding: scaled populations increasingly anchor their emergent sketches on the objective visual resemblance of the target images. Together, these results position ZMI as a distinct axis of generalization in emergent communication and suggest a route toward socially interoperable artificial agents.

URL PDF HTML ☆

赞 0 踩 0

2606.10662 2026-06-10 cs.MA cs.AI 交叉投稿

Decentralized Multi-Agent Systems with Shared Context

具有共享上下文的去中心化多智能体系统

Yuzhen Mao, Azalia Mirhoseini

发表机构 * Stanford University（斯坦福大学）

AI总结提出DeLM框架，通过并行智能体、共享上下文和任务队列去中心化协调，解决集中式MAS的瓶颈，在软件工程和长上下文推理中提升性能并降低成本。

详情

AI中文摘要

多智能体系统（MAS）通过将复杂问题分解为并行子任务，可以在测试时扩展大型语言模型的推理能力。然而，大多数现有的MAS依赖于集中式编排，其中主智能体分配工作、收集输出并合并结果。随着子任务数量的增长，该控制器成为通信和集成瓶颈。我们提出了去中心化语言模型（DeLM），这是一种通过并行智能体、共享验证上下文和任务队列来去中心化协调的MAS框架。智能体异步认领子任务，读取累积进度，执行局部推理，并写回紧凑的验证更新。共享上下文充当公共通信基础，使智能体能够基于彼此的验证进度进行构建，而无需通过中央控制器路由每次更新。实验上，DeLM在软件工程测试时扩展和长上下文推理方面均有所改进。在SWE-bench Verified上，DeLM在Avg.@1、Pass@2和Pass@4上均取得了最佳性能，比最强基线高出多达10.5个百分点，同时每个任务的成本降低约50%。在LongBench-v2多文档问答上，DeLM在四个前沿模型系列中取得了最高平均准确率，比最强基线高出多达5.7个百分点。代码可在我们的项目网站（此 https URL）上获取。

英文摘要

Multi-agent systems (MAS) can scale large language model reasoning at test time by decomposing complex problems into parallel subtasks. However, most existing MAS rely on centralized orchestration, where a main agent assigns work, collects outputs, and merges results. As the number of subtasks grows, this controller becomes a communication and integration bottleneck. We propose Decentralized Language Models (DeLM), a MAS framework that decentralizes coordination through parallel agents, a shared verified context, and a task queue. Agents asynchronously claim subtasks, read accumulated progress, perform local reasoning, and write back compact verified updates. The shared context acts as a common communication substrate, enabling agents to build on one another's verified progress without routing every update through a central controller. Empirically, DeLM improves both software-engineering test-time scaling and long-context reasoning. On SWE-bench Verified, DeLM achieves the best performance across Avg.@1, Pass@2, and Pass@4, with gains of up to 10.5 percentage points over the strongest baseline, while reducing cost per task by roughly 50%. On LongBench-v2 Multi-Doc QA, DeLM achieves the highest average accuracy across four frontier model families, improving over the strongest baseline by up to 5.7 percentage points. The code is available on our project website at https://yuzhenmao.github.io/DeLM/.

URL PDF HTML ☆

赞 0 踩 0

2606.10684 2026-06-10 cs.LG cs.AI 交叉投稿

Divide and Cooperate: Role-Decomposed Multi-Agent LLM Training with Cross-Agent Learning Signals

分工与合作：基于跨智能体学习信号的角色分解多智能体LLM训练

Jaewan Park, Solbee Cho, Jay-Yoon Lee

发表机构 * Seoul National University（首尔大学）

AI总结提出DAC框架，将多步推理分解为搜索和生成两个子任务，分别由专用智能体处理，并通过跨智能体学习信号解决信用分配问题，在QA基准上超越全参数微调的单体模型。

详情

AI中文摘要

现代语言智能体通过多步推理在知识密集型问答中表现出色。然而，现有方法通常将证据获取和答案生成耦合在单一策略中。这迫使单个模型扮演多个可能冲突的角色，导致策略空间组合爆炸并阻碍高效探索。同时，训练中引入信用分配问题：当生成失败时，检索到足够证据的搜索动作仍可能受到惩罚，反之亦然。我们提出DAC（分工与合作），一个角色分解的多智能体训练框架，将智能体搜索分解为两个合作性子任务，每个子任务由专用智能体处理，并使用角色特定的学习信号进行训练。生成器扮演双重角色：既是答案生成器，也是证据充分性验证器，当检索到的证据不足时放弃回答。该放弃信号被纳入搜索智能体的奖励中，提供结构化的跨智能体学习信号以改进信用分配。相反，搜索器通过硬阳性证据增强向生成器暴露多样且具有挑战性的证据环境，提高其鲁棒性。在通用和多跳问答基准上的实验表明，DAC通过共享骨干网络上的参数高效LoRA模块实现，在性能上优于先前依赖全参数微调单体模型的基线方法。

英文摘要

Modern language agents which perform multi-step reasoning have shown strong performance in knowledge-intensive question answering. However, existing approaches typically couple evidence acquisition and answer generation within a single policy. This forces a single model to play multiple potentially conflicting roles, inducing a combinatorial explosion in the policy space and hindering efficient exploration. It also introduces a credit assignment problem during training: a search action that retrieves sufficient evidence may still be penalized when generation fails, and vice versa. We propose DAC (Divide and Cooperate), a role-decomposed multi-agent training framework that divides agentic search into two cooperative subtasks, each handled by a dedicated agent trained with role-specific learning signals. The generator serves a dual role as both an answer producer and an evidence sufficiency verifier, abstaining when retrieved evidence is insufficient. This abstention signal is incorporated into the search agent's reward, providing structured cross-agent learning signals that improve credit assignment. Conversely, the searcher exposes the generator to diverse and challenging evidence environments by hard-positive evidence augmentation, improving its robustness. Experiments on general and multi-hop QA benchmarks show that DAC, implemented via parameter-efficient LoRA modules over a shared backbone, achieves strong performance against prior baselines that rely on full fine-tuning of monolithic models.

URL PDF HTML ☆

赞 0 踩 0

2605.24197 2026-06-10 cs.AI 版本更新

具有状态依赖可行动作集的马尔可夫决策过程的贝尔曼-泰勒分数解码

Yi Chen, Rushuai Yang, Qiang Chen, Dongyan, Huo

发表机构 * Hong Kong University of Science and Technology（香港科技大学）

AI总结针对状态依赖可行动作集的MDP，提出贝尔曼-泰勒分数解码框架，将策略学习映射到欧几里得分数空间，通过动作解码器保证可行性，并证明最优性间隙分解为结构近似误差和算法学习误差，在排队网络控制中表现优异。

详情

AI中文摘要

运筹学中的许多马尔可夫决策过程（MDP）具有状态依赖的可行动作，这些动作由各种操作约束隐式定义。这些特性使得难以使用标准的深度强化学习（DRL）算法，因为它们的动作接口通常假设固定的有限动作目录或简单的欧几里得空间。受最优动作价值函数的泰勒展开的启发，我们提出了贝尔曼-泰勒分数解码，这是一个将策略学习转移到欧几里得分数空间，同时通过动作解码器强制执行可行性的框架。由此产生的潜在分数MDP可以通过标准的DRL算法进行优化，而无需通过解码器进行微分。我们提供了一个性能保证，表明该方法的最优性间隙分解为结构近似误差和算法学习误差。最后，我们将该框架应用于排队网络控制问题，其中策略本质上学习了一个状态依赖的基于索引的调度规则。数值实验显示，在小规模实例中性能接近最优，在较大系统中相比基准有显著改进。

英文摘要

Many Markov decision processes (MDPs) in operations research have feasible actions that are state dependent and defined implicitly by various operational constraints. These features make it difficult to use standard deep reinforcement learning (DRL) algorithms, whose action interfaces typically assume either a fixed finite action catalog or a simple Euclidean space. Motivated by a Taylor expansion of the optimal action-value function, we propose Bellman--Taylor score decoding, a framework that moves policy learning to a Euclidean score space while enforcing feasibility through an action decoder. The induced latent-score MDP then can be optimized by standard DRL algorithms without differentiating through the decoder. We provide a performance guarantee showing that the optimality gap of this approach decomposes into a structural approximation error and an algorithmic learning error. Lastly, we apply this framework to a queueing network control problem, where the policy essentially learns a state-dependent index-based dispatching rule. Numerical experiments show near-optimal performance in small instances and considerable improvements over benchmarks in larger systems.

URL PDF HTML ☆

赞 0 踩 0

2606.10068 2026-06-10 cs.LG cs.AI 交叉投稿

无配对数据的跨模态知识蒸馏：理论基础与算法

Trong Khiem Tran, Anh Duc Chu, Quang Hung Pham, Phi Le Nguyen, Trong Nghia Hoang

发表机构 * School of Information and Communications Technology, Hanoi University of Science and Technology, Hanoi, Vietnam（信息与通信技术学院，河内科学技术大学，越南河内）； School of Electrical Engineering and Computer Science, Washington State University, Pullman, US（电气工程与计算机科学学院，华盛顿州立大学，华盛顿州普尔曼）

AI总结提出无配对数据下的跨模态知识蒸馏框架，通过特征对齐和标签对齐两种分布对齐机制，实现跨模态知识迁移，理论保证且实验效果显著。

详情

AI中文摘要

跨模态知识蒸馏（CMKD）研究如何利用在一种数据类型（如图像）上训练的大型教师模型来指导基于另一种数据类型（如文本/音频）的较小学生模型。现有的CMKD方法通常需要具有对齐语义的配对多模态数据，但获取此类配对数据往往成本高昂且不切实际。为缓解这一限制，我们针对更困难的设置——无配对数据——开发了一种新的CMKD框架。特别地，我们建立了教师模型与学生模型之间的跨模态分布关系，揭示了控制有效蒸馏的两个基本量：特征对齐和标签对齐。这些量分别从表示和预测分布层面表征了模态间的语义差异。受此启发，我们提出了一个具有理论保证的原则性框架，通过对齐分布而非单个样本实现有效的跨模态知识蒸馏。在广泛的多模态基准上的大量实验表明，我们的框架在无配对和有配对数据设置中均非常有效，显著优于先前的工作。

英文摘要

Cross-modal knowledge distillation (CMKD) studies how a (large) teacher model trained on one type of data (e.g., images) can guide a (smaller) student model building on another type of data (e.g., text/audio). Existing CMKD methods often require paired multi-modal data with aligned semantics, but obtaining such paired data are often costly and impractical. To mitigate this limitation, we develop a new CMKD framework for the more challenging setting where paired data are unavailable. In particular, we establish a cross-modal distributional relationship between teacher and student models, which reveals two fundamental quantities governing effective distillation: feature alignment and label alignment. These quantities characterize semantic discrepancy between modalities at the levels of representation and prediction distributions, respectively. Motivated by this insight, we propose a principled framework, with theoretical guarantees, that enables effective cross-modal knowledge distillation by aligning distributions rather than individual samples. Extensive experiments across a wide range of multimodal benchmarks show that our framework is highly effective in both unpaired and paired data settings, improving significantly over prior work.

URL PDF HTML ☆

赞 0 踩 0

2606.10572 2026-06-10 cs.AI 新提交

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

每个多模态证据一个令牌：面向资源受限问答的潜在记忆

Zhi Zheng, Ziqiao Meng, Hao Luan, Wei Liu, Wee Sun Lee

发表机构 * School of Computing, National University of Singapore（新加坡国立大学计算机学院）

AI总结提出潜在记忆范式，将每个证据压缩为单个高维潜在令牌，通过统一训练实现高效检索与生成，在资源受限场景下以3-10倍令牌节省达到竞争性问答性能。

详情

AI中文摘要

外部记忆有效地将基于大语言模型（LLMs）和视觉-语言模型（VLMs）的问答（QA）与相关的多模态证据联系起来。然而，现有的记忆范式以原始文本和图像形式表示每个记忆项，因此基于检索的系统必须将检索到的文本或图像传递给生成LLMs/VLMs，导致高令牌消耗和存储压力，使得资源受限的应用难以承受。我们提出潜在记忆，一种潜在空间记忆范式，它将每个原始文本或图像证据项替换为由小型压缩器LLM/VLM生成的单个高维潜在令牌。潜在记忆不是在生成时检索原始证据，而是在统一的潜在表示空间中操作：查询被嵌入到该空间中以检索相关的潜在令牌，检索到的潜在令牌直接提示给预训练的LLM或VLM以生成答案。为了使每个潜在令牌同时具有用于重建、检索和生成的信息，我们使用重建、对比和蒸馏目标以统一的端到端方式训练压缩器。潜在记忆在七个纯文本QA基准（例如HotpotQA）和多模态QA基准上进行了评估，与先进的RAG基线相比，它实现了具有竞争力的QA性能，同时消耗的生成器令牌减少了3到10倍。它还能在WebQA上提供最强的图像基础问答性能。代码可在该https URL获取。

英文摘要

External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevant multimodal evidence. However, existing memory paradigms represent each memory item in raw text and image forms, so retrieval-based systems must pass the retrieved text or images to the generation LLMs/VLMs, resulting in high token consumption and storage pressure, making it unaffordable for resource-constrained applications. We propose Latent Memory, a latent-space memory paradigm that replaces each raw text or image evidence item with a single high-dimensional latent token produced by a small compressor LLM/VLM. Rather than retrieving raw evidence for generation, Latent Memory operates in a unified latent representation space: the query is embedded into this space to retrieve relevant latent tokens, and the retrieved latent tokens are directly prompted to a pretrained LLM or VLM for answer generation. To make each latent token simultaneously informative for reconstruction, retrieval, and generation, we train the compressor with reconstruction, contrastive, and distillation objectives in a unified end-to-end manner. Latent Memory is evaluated on seven text-only QA benchmarks (e.g., HotpotQA) and multimodal QA benchmarks, where it achieves competitive QA performance compared to advanced RAG baselines while consuming 3x to 10x fewer generator tokens. It can also deliver the strongest image-grounded QA performance on WebQA. Code is available at https://github.com/zz1358m/Latent-Memory-Master.

URL PDF HTML ☆

赞 0 踩 0

2606.10787 2026-06-10 cs.AI cs.LO 新提交

Accelerating NeurASP with vectorization and caching

通过向量化和缓存加速NeurASP

Alexander Philipp Rader, Alessandra Russo

发表机构 * University of Freiburg（弗赖堡大学）

AI总结本文通过向量化、批处理和缓存中间计算，显著加速了神经符号框架NeurASP的训练，在大型任务上实现了多个数量级的提速。

Comments 16 pages, 5 figures, to be published in the Theory and Practice of Logic Programming (TPLP) journal for the 42nd International Conference on Logic Programming (ICLP) issue

详情

AI中文摘要

神经符号AI将神经网络与符号程序相结合，以创建鲁棒且可解释的预测。其中一个框架是NeurASP，它训练神经网络来预测概念，并使用答案集编程（ASP）编写的规则对这些概念进行推理，以解决下游任务。关键的是，标签仅由符号规则产生的下游预测提供，而不是潜在概念。通过不可微的ASP组件进行反向传播需要昂贵的概率和梯度计算，这阻碍了其扩展到更复杂的任务。在本文中，我们通过向量化、批处理和训练期间中间计算的缓存来改善NeurASP的计算性能，从而解决其当前局限性。我们比较了原始NeurASP和新实现的计算速度，并报告了在较大任务上多个数量级的加速。为此，我们提出了一个涉及扑克牌的困难任务新数据集，用于测试NeurASP增强学习功能的能力。

英文摘要

Neurosymbolic AI combines neural networks with symbolic programs to create robust and explainable predictions. One such framework is NeurASP, which trains a neural network to predict concepts and reasons over them using rules written in answer set programming (ASP) to solve downstream tasks. Crucially, labels are only provided for the downstream prediction produced by the symbolic rules, not for the latent concepts themselves.Backpropagation through the non-differentiable ASP component requires expensive probability and gradient calculations, which has hindered scalability to more sophisticated tasks.In this paper, we address the current limitations of NeurASP by improving its computational performance through vectorization, batch processing and caching of intermediate computations during training. We compare computation speeds between the original and our new implementation of NeurASP and report speedups of multiple orders of magnitude for larger tasks. To this end, we propose a new dataset of difficult tasks involving playing cards, which we use to test the capabilities of NeurASP's enhanced learning function.

URL PDF HTML ☆

赞 0 踩 0

2606.11045 2026-06-10 cs.AI cs.LG 新提交

What Fits (Into Few Tokens) Doesn't Overfit: Compression and Generalization in ML Research Agents

什么适合（少量标记）就不会过拟合：ML研究智能体中的压缩与泛化

Martin Andres Bertran, Aaron Roth, Zhiwei Steven Wu

发表机构 * Amazon Responsible AI（亚马逊负责任人工智能）； University of Pennsylvania（宾夕法尼亚大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结研究LLM驱动的科研智能体在输出和输入压缩下能否保持性能，发现短提示和可压缩反馈足以复现高性能模型，支持成功策略位于低复杂度区域的假设。

详情

AI中文摘要

原则上，自适应地重复使用保留的基准测试会引发过拟合。然而，基准驱动的机器学习（ML）在实践中却出人意料地很少出现过拟合。一个吸引人的假设是，成功的ML策略具有高度可压缩性。我们在LLM驱动的科研智能体设置中研究这一点，通过两个互补的信息瓶颈可以直接检验该假设。在\emph{输出压缩}中，探索智能体使用验证集自适应地搜索高性能模型，我们测试一个全新的“复现智能体”是否能在仅给出极短提示和训练数据的情况下复现其性能。在\emph{输入压缩}中，探索者仅接收一位反馈，指示每个提交的模型是否优于当前最佳。跨越8个数据集（涵盖表格分类、视觉、语言建模、扩散建模和奖励建模），我们发现这些瓶颈对性能影响很小：短提示和可压缩反馈足以复现和找到高性能模型。该假设是可证伪的：当我们故意诱导验证集过拟合时，结果无法用短提示复现。综合来看，我们的结果支持描述长度解释，即基准驱动ML中缺乏过拟合的原因是成功策略占据策略空间的低复杂度区域。

英文摘要

Reusing a held-out benchmark adaptively should, in principle, invite overfitting. Yet benchmark-driven machine learning (ML) has produced surprisingly little overfitting in practice. An attractive hypothesis is that successful ML strategies are highly compressible. We study this in the setting of LLM-driven research agents, where the hypothesis becomes directly testable via two complementary information bottlenecks. In \emph{output compression}, an exploration agent adaptively searches for high-performance models using a validation set, and we test whether a fresh ``reproducer agent'' can reproduce its performance given only an extremely short prompt and the training data. In \emph{input compression}, the explorer receives only one-bit feedback indicating whether each submitted model improves on the running best. Across 8 datasets spanning tabular classification, vision, language modeling, diffusion modeling, and reward modeling, we find that these bottlenecks have little effect on performance: short prompts and compressible feedback are sufficient to reproduce and find high-performance models. The hypothesis is falsifiable: when we deliberately induce validation-set overfitting, the results fail to reproduce with short prompts. Taken together, our results support a description-length explanation for the lack of overfitting in benchmark-driven ML: successful strategies occupy a low-complexity region of strategy space.

URL PDF HTML ☆

赞 0 踩 0

2606.11164 2026-06-10 cs.AI 新提交

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

ReasonAlloc: 推理模型的分层解码时KV缓存预算分配

Wenhao Liu, Hao Shi, Yunhe Li, Weizhi Fei, Xiangyuan Wang, Mengzhe Ruan, Hanxu Hou, Peisong Wang, Linqi Song, Shuang Qiu

发表机构 * Tsinghua University（清华大学）； City University of Hong Kong（香港城市大学）； Peking University（北京大学）； Shenzhen University of Advanced Technology（深圳理工大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）

AI总结针对长链式推理中KV缓存快速增长导致的推理瓶颈，提出ReasonAlloc框架，通过离线层预分配和在线头重分配的分层预算分配策略，在不增加训练开销下显著提升小预算下的推理性能。

详情

AI中文摘要

大语言模型推理中的长链式思维轨迹由于键值缓存的快速增长导致严重的推理瓶颈。当前的解码时压缩方法通过令牌驱逐缓解此问题，但通常假设所有层和头之间均匀分配预算。相比之下，现有的非均匀预算分配方法主要针对静态提示预填充阶段设计，未能捕捉自回归推理的逐步上下文需求。为弥补这一差距，我们提出ReasonAlloc，一个无需训练的框架，将解码时KV压缩重新表述为分层预算分配问题。ReasonAlloc在两个互补层面运作：离线层预分配策略捕捉一种架构驱动的需求模式，我们称之为“推理波”；在线头策略根据实时效用将资源重新分配给信息丰富的头。在数学推理基准（MATH-500、AIME 2024）上使用DeepSeek-R1-Distill-Llama-8B、DeepSeek-R1-Distill-Qwen-14B和AceReason-14B的评估表明，ReasonAlloc优于均匀预算的R-KV、SnapKV和Pyramid-RKV（一种强制静态单调递减层预算的基线），在小预算（128-512令牌）下增益最大。ReasonAlloc可与现有令牌驱逐策略即插即用，并引入可忽略的推理时间开销。

英文摘要

Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth. Current decoding-time compression methods mitigate this issue via token eviction, but typically assume a uniform budget distribution across all layers and heads. In contrast, existing non-uniform budget allocation methods are predominantly designed for the static prompt prefill phase, and they do not capture the stepwise context demands of autoregressive reasoning. To bridge this gap, we propose ReasonAlloc, a training-free framework that recasts decoding-time KV compression as a hierarchical budget allocation problem. ReasonAlloc operates at two complementary levels: an offline layer-wise preallocation strategy captures an architecture-driven demand pattern which we call ``\textit{Reasoning Wave}'', while an online head-wise strategy reallocates resources during decoding to information-rich heads based on real-time utility. Evaluations on mathematical reasoning benchmarks (MATH-500, AIME~2024) using DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Qwen-14B, and AceReason-14B show that ReasonAlloc outperforms uniform-budget R-KV, SnapKV, and Pyramid-RKV (a baseline enforcing a static, monotonically decreasing layer budget), with the largest gains at small budgets (128-512 tokens). ReasonAlloc is plug-and-play with existing token-eviction policies and introduces negligible inference-time overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.11173 2026-06-10 cs.AI cs.LG 新提交

The Role of Feedback Alignment in Self-Distillation

反馈对齐在自蒸馏中的作用

Semih Kara, Oğuzhan Ersoy

发表机构 * Gensyn

AI总结研究通过自蒸馏提升语言模型性能时，反馈与模型推理的结构对齐是关键因素，步级对齐批评比二元奖励或参考解更有效。

Comments Accepted to the ICML 2026 Workshop on RL from World Feedback (RLxF)

详情

AI中文摘要

在语言模型上附加额外上下文（例如对先前尝试的反馈）通常会改善其响应。自蒸馏训练模型在没有上下文时保留这种改进。该方法通过匹配模型在两种设置下的输出分布来工作：学生仅看到问题，而自教师还看到上下文。因此，模型学习的内容取决于自教师接收的上下文，然而上下文的设计在很大程度上尚未被探索。我们通过使用冻结的批评器训练求解器来研究自蒸馏的上下文设计。我们比较了三种条件：(i) 二元奖励（GRPO），(ii) 参考解，以及 (iii) 与求解器推理轨迹对齐的逐步批评。步级对齐批评带来了最大的增益，在Avg@12上比GRPO高出16.11分，比参考解条件化的自蒸馏高出5.27分。逐token优势分析揭示了原因：步级对齐反馈仅针对推理失败的token，保留正确行为不变。相比之下，条件化于参考解会迫使模型在每个token上改变其行为（即使是正确的步骤），因为替代推导在措辞和方法上不可避免地存在差异。这表明反馈与求解器推理之间的结构对齐是自蒸馏有效性的关键驱动因素。

英文摘要

Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distillation trains the model to retain this improvement when the context is not present. The method works by matching the model's output distribution under two settings: a student that sees only the question, and a self-teacher that also sees the context. What the model learns therefore depends on what context the self-teacher receives, yet the design of this context remains largely unexplored. We study context design for self-distillation by training a solver on feedback from a frozen critic. We compare three conditions: (i) a binary reward (GRPO), (ii) the reference solution, and (iii) a step-by-step critique aligned to the solver's reasoning trace. Step-aligned critique yields the largest gains, outperforming GRPO by 16.11 points and reference-solution-conditioned self-distillation by 5.27 points (Avg@12). Per-token advantage analysis reveals why: step-aligned feedback targets only the tokens where reasoning fails, leaving correct behavior intact. Conditioning on the reference solution, by contrast, pressures the model to change its behavior at every token (even correct steps) because an alternative derivation inevitably differs in phrasing and approach. This suggests that structural alignment between feedback and the solver's reasoning is a key driver of self-distillation effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2606.09856 2026-06-10 cs.CL cs.AI cs.LG stat.ML 交叉投稿

Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models

使用概率程序训练大型语言模型的归纳推理

Liyi Zhang, Akshay K. Jagadish, Brenden M. Lake, Thomas L. Griffiths

AI总结提出基于程序的后验训练（PPT）方法，利用LLM生成概率程序场景，通过推理产生分布目标，微调模型以提升归纳推理准确性、与人类判断的一致性及校准能力。

Comments 20 pages, 5 figures

详情

AI中文摘要

大型语言模型（LLM）的后训练推理通常专注于数学和编码等演绎任务，其中正确性可验证。然而，许多现实世界的推理问题是归纳性的：智能体必须从稀疏、模糊的观测中推断不确定的信念。使用标准微调方法进行归纳推理面临挑战，包括难以策划大规模、高质量标注数据集以及处理本质上是分布式的目标。在这项工作中，我们引入了一种称为基于程序的后验训练（PPT）的新方法来解决这些局限性：我们使用LLM生成多样化的开放世界场景作为概率程序，运行概率推理以产生查询的分布式目标响应，然后在这些概率软标签上进行微调。使用这种方法，我们在10,000个程序生成的场景上微调LLM，并在保留的模板、人工标注的判断和外部基准上进行评估。总体而言，PPT显著提高了保留归纳任务的估计准确性，增强了与人类判断的一致性，并迁移到估计和校准的外部基准。此外，原始校准的增益并未被事后温度缩放所涵盖，表明与输出重新缩放相比，模型更深入地内化了不确定性。这些结果表明，概率程序介导的微调是一种有前景的方法，用于后训练LLM以可靠地执行近似归纳推理。

英文摘要

Post-training Large Language Models (LLMs) for reasoning typically focuses on deductive tasks such as mathematics and coding where correctness is verifiable. Yet, many real-world reasoning problems are inductive: agents must infer uncertain beliefs from sparse, ambiguous observations. There are challenges to using standard fine-tuning methods for inductive reasoning, including difficulties in curating large-scale, high-quality labeled datasets and in handling targets that are inherently distributional. In this work, we introduce a novel approach, called Program-based Posterior Training (PPT), to address these limitations: we use an LLM to generate diverse open-world scenarios as probabilistic programs, run probabilistic inference to produce distributional target responses to queries, and then fine-tune on these probabilistic soft labels. Using this approach, we fine-tune LLMs on 10,000 programmatically generated scenarios and evaluate on held-out motifs, human-labeled judgments, and external benchmarks. Overall, PPT substantially improves estimation accuracy on held-out inductive tasks, increases alignment with human judgments, and transfers to external benchmarks for estimation and calibration. Additionally, the gains in raw calibration are not subsumed by post-hoc temperature scaling, showing that the models have more deeply internalized uncertainty compared to output rescaling. Together, these results suggest that probabilistic-program-mediated fine-tuning is a promising approach for post-training LLMs to reliably perform approximate inductive inference.

URL PDF HTML ☆

赞 0 踩 0

2606.09858 2026-06-10 cs.IT cs.AI math.IT 交叉投稿

Support sufficiency as action-sufficient compression: a single-cycle rate-regret formulation

支持充分性作为行动充分压缩：单周期率-遗憾公式

Mark Walsh

AI总结本文形式化支持充分性为行动充分压缩，通过策略等价商空间定义精确充分性，并基于期望策略遗憾定义近似充分性，在有限单周期设置下导出率-遗憾问题，区分行动充分性与重建保真度、信息瓶颈预测和理性疏忽。

Comments 22 pages. Submitted to Journal of Mathematical Psychology. Formal single-cycle model of action-sufficient support compression and rate-regret sufficiency

详情

AI中文摘要

鲁棒决策需要压缩。形成丰富支持状态的系统通常无法在行动点保留其完整结构。它必须仅保留在当前后果几何下行动、验证、放弃或推迟所需的区别。本文将支持充分性形式化为行动充分压缩。设$H$表示完整支持状态，$\mathcal{A}$表示有限行动集，$Z$表示指定收益结构的后果几何。对于固定的$Z$，最粗略的精确行动充分压缩是支持空间按策略等价的商。当两个支持状态需要相同的最优行动时，它们可以合并。这阐明了为什么仅内容或仅标量置信度的仲裁在其诱导划分跨越行动边界时失败。然后通过有界期望策略遗憾定义近似充分性。在有限单周期设置中，这产生了一个率-遗憾问题，其源为$H$，再现字母表为$\mathcal{A}$，失真由后果敏感遗憾给出。最优随机行动通道继承了标准率失真吉布斯形式，此处应用于具有遗憾失真的支持状态。贡献是解释性的：行动充分性与重建保真度、信息瓶颈预测和理性疏忽区分开来。鲁棒单周期仲裁不需要保留所有支持，但需要保留后果几何使行动相关的区别。

英文摘要

Robust decision-making requires compression. A system that forms a rich support state cannot usually preserve its full structure at the point of action. It must retain only those distinctions needed to act, verify, abstain, or defer under the current consequence geometry. This paper formalizes support sufficiency as action-sufficient compression. Let $H$ denote a full support state, $\mathcal{A}$ a finite action set, and $Z$ a consequence geometry specifying payoff structure. For fixed $Z$, the coarsest exactly action-sufficient compression is the quotient of support space by policy equivalence. Two support states may be merged exactly when they require the same optimal action. This clarifies why content-only and scalar-confidence-only arbitration fail whenever their induced partitions cross action boundaries. Approximate sufficiency is then defined by bounded expected policy regret. In the finite single-cycle setting, this yields a rate-regret problem with source $H$, reproduction alphabet $\mathcal{A}$, and distortion given by consequence-sensitive regret. The optimal stochastic action channel inherits the standard rate-distortion Gibbs form, applied here to support states with regret distortion. The contribution is interpretive: action adequacy is distinguished from reconstruction fidelity, information-bottleneck prediction, and rational inattention. Robust single-cycle arbitration does not require preserving all support, but it does require preserving the distinctions that consequence geometry makes action-relevant.

URL PDF HTML ☆

赞 0 踩 0

2606.09859 2026-06-10 cs.LG cs.AI 交叉投稿

Mitigating Manifold Departure: Uncertainty-Aware Subspace Rectification for Trustworthy MLLM Decoding

缓解流形偏离：面向可信MLLM解码的不确定性感知子空间校正

Yingxuan Zhuang, Jingxiao Yang, Miao Pan, Cheng Tan, Yuxiang Cai, Siwei Tan, Chen Zhi, Xuhong Zhang, Jianwei Yin, Jintao Chen

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结提出MGAP方法，通过SVD构建语言先验子空间并自适应衰减投影分量，在抑制幻觉的同时保持语义结构，优于现有解码基线。

Comments ICML 2026 regular

详情

AI中文摘要

MLLM经常产生与视觉输入不一致的幻觉对象。这一问题通常归因于对语言先验的过度依赖，这种依赖可能覆盖视觉上下文。最近的训练自由解码策略通过惩罚语言先验来解决这一问题。然而，这些方法忽视了语言先验的双重性质——根据与视觉证据的对齐程度，它们既可能有益也可能有害。特别是，盲目抑制语言先验往往会破坏模型的语义流形，导致性能下降，我们将这种现象称为流形偏离。为了解决这个问题，我们提出了流形引导自适应投影（MGAP），一种几何感知、训练自由的解码方法，在抑制幻觉的同时保持表示结构。MGAP首先通过SVD从盲隐藏状态构建语言先验子空间。在解码过程中，MGAP将每个多模态隐藏状态投影到该子空间上，并应用一致性感知门控自适应地仅衰减投影的先验分量，从而产生子空间选择性更新，很大程度上保留了正交的语义分量。在POPE和CHAIR上的大量实验表明，MGAP优于先前的解码基线，在不牺牲连贯性的情况下实现了更强的幻觉抑制。

英文摘要

MLLMs frequently hallucinate objects inconsistent with visual inputs. This issue is typically attributed to the over-reliance on language priors, which can override the visual context. Recent training-free decoding strategies address this by penalizing language priors. However, these methods overlook the dual nature of language priors, where they can be both helpful and harmful depending on the alignment with visual evidence. In particular, blindly suppressing language priors often disrupts the model's semantic manifold, leading to performance degradation, a phenomenon we term Manifold Departure. To address this, we propose Manifold-Guided Adaptive Projection (MGAP), a geometry-aware, training-free decoding method that mitigates hallucinations while preserving representation structure. MGAP first constructs a language-prior subspace from blind hidden states via SVD. During decoding, MGAP projects each multimodal hidden state onto this subspace and applies a consistency-aware gate to adaptively attenuate only the projected prior component, yielding a subspace-selective update that largely preserves the orthogonal semantic components. Extensive experiments on POPE and CHAIR show that MGAP outperforms prior decoding baselines, achieving stronger hallucination suppression without sacrificing coherence.

URL PDF HTML ☆

赞 0 踩 0

2606.09862 2026-06-10 cs.LG cs.AI 交叉投稿

Blurry Window Attention

模糊窗口注意力

Axel Laborieux, Christos Sourmpis, Juan Gabriel Kostelec, Qinghai Guo

AI总结提出模糊窗口注意力（BLA），一种基于Dirichlet核插值重构模糊KV历史的有界记忆控制方法，在合成任务中状态效率比滑动窗口注意力高8倍，且随状态增大性能提升。

详情

AI中文摘要

Transformer语言模型中的Softmax注意力操作在序列长度上具有二次复杂度，且状态大小以KV缓存形式增长，这成为长上下文场景中的瓶颈。为克服此限制，引入了具有线性复杂度和有限状态大小的替代架构，如状态空间模型（SSM）、线性注意力（LA）和有界记忆控制注意力（ABC）。尽管线性模型在语言困惑度上与Transformer相当，但在需要检索或回忆特定信息的任务中仍落后。本文提出模糊窗口注意力（BLA），一种受SSM启发的新型ABC方法。BLA存储一个频率窗口，通过使用Dirichlet核进行插值从中重建模糊的KV历史。根据Dirichlet核的分辨率，BLA可理解为滑动窗口注意力（SWA）的泛化，或门控槽注意力（GSA）的特例，其中衰减因子由Dirichlet核实现。我们详细描述了BLA的理论和高效实现。在多查询关联回忆（MQAR）合成任务上，我们表明BLA的状态效率比SWA高8倍，且与流行的线性注意力模型竞争；在RegBench合成任务中，在我们测试的线性模型中，只有BLA和SWA随着状态大小增长而提升性能。

英文摘要

The Softmax Attention operation in Transformer language models has a quadratic complexity in the sequence length and a growing state size in the form of KV cache, which becomes a bottleneck in long context scenarios. To overcome this limitation, alternative architectures with linear complexity and finite state size have been introduced, such as State-Space Models (SSMs), Linear Attention (LA), and Attention with Bounded-memory Control (ABC). Though linear models achieve similar language perplexity as Transformers, they are still behind in tasks which require retrieval or recall of specific information. In this work, we introduce Blurry Window Attention (BLA) a novel ABC method inspired by SSMs. BLA stores a frequency window from which a blurry KV history is reconstructed via interpolation using Dirichlet kernels. BLA can be understood as a generalization of Sliding Window Attention (SWA) depending on the Dirichlet kernels resolution or as a special case of the Gated Slot Attention (GSA), where the decay factor is implemented with Dirichlet kernels. We describe in details the theory and efficient implementation of BLA. On the Multi-Query Associate Recall (MQAR) synthetic task, we show that the state efficiency of BLA is 8$\times$ better than SWA and is competitive with popular linear attention models, and in the RegBench synthetic task, only BLA and SWA improve their performance as the state size grows among the linear models we tested.

URL PDF HTML ☆

赞 0 踩 0

2606.09869 2026-06-10 cs.LG cs.AI cs.CR 交叉投稿

QSplitFL: Capability Aware Deep Q-Learning for Optimal Split Point Selection in Split Federated Learning

QSplitFL: 基于能力感知的深度Q学习在分割联邦学习中的最优分割点选择

Nazmus Shakib Shadin, Xinyue Zhang, Jingyi Wang, Miao Pan

AI总结提出QSplitFL框架，利用深度Q网络基于客户端硬件指标（CPU、内存、电池、网络延迟）动态选择最优分割点，解决异构设备上的分割联邦学习挑战，通过衰减损失奖励函数和委员会投票机制提升收敛速度和精度。

Comments Accepted by ECML-PKDD 2026

详情

AI中文摘要

联邦学习（FL）与分割学习（SL）结合是一种隐私保护范式，能够在资源受限设备上训练深度神经网络（DNN），同时降低整体训练成本。然而，确定最优分割点（即模型被分割的层）仍然是一个关键挑战，尤其是当客户端具有异构硬件能力时。固定分割点可能使弱设备过载，增加通信和服务器负载，从而减慢收敛速度并降低稳定性。本文介绍了QSplitFL，一种新颖的基于能力感知的深度Q网络（DQN）框架，用于在基于分割学习的联邦学习（SFL）环境中选择最优分割点。与依赖高维模型权重表示的现有方法不同，QSplitFL采用直接从客户端硬件指标（包括CPU利用率、内存、电池电量和网络延迟）导出的轻量级状态表示。所提出的框架包含一个衰减损失下降奖励函数，优先考虑早期收敛，以及一个基于委员会的DQN架构，通过多数投票来减轻奖励黑客攻击。在MNIST、Fashion-MNIST、CIFAR-10和CIFAR-100数据集上，使用CNN、ResNet50、MobileNetV4和ConvNeXt架构进行的广泛实验表明，我们的方法在收敛速度和精度上优于现有方法，同时有效适应异构设备资源。源代码在此https URL公开可用。

英文摘要

Federated Learning (FL) combined with Split Learning (SL) is a privacy preserving paradigm that enables training deep neural networks (DNNs) on resource constrained devices while reducing overall training cost. However, determining the optimal split point, meaning the layer where the model is divided still remains a critical challenge, especially when clients have heterogeneous hardware capabilities. Fixed split points can overload weak devices and increase the communication and server load, which slows convergence and reduces stability. This paper introduces QSplitFL, a novel capability-aware Deep Q-Network (DQN) framework for optimal split point selection in Split learning based Federated Learning (SFL) environments. Unlike existing approaches that rely on high-dimensional model weight representations, QSplitFL employs a lightweight state representation derived directly from client hardware metrics, including CPU utilization, memory, battery level, and network latency. The proposed framework incorporates a decayed loss-drop reward function that prioritizes early convergence, and a committee-based DQN architecture with majority voting to mitigate reward hacking. Extensive experiments on MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets using CNN, ResNet50, MobileNetV4, and ConvNeXt architectures demonstrate that our approach achieves better convergence and higher accuracy compared to existing methods, while effectively adapting to heterogeneous device resources. The source code is publicly available at https://github.com/AIPO-Lab/QSplitFL.

URL PDF HTML ☆

赞 0 踩 0

2606.09873 2026-06-10 cs.LG cs.AI 交叉投稿

表示课程：用于鲁棒排序和分配的分阶段训练

Ehsan Ebrahimzadeh, Sina Baharlouei, Abraham Bagherjeiran

发表机构 * eBay Search Ranking and Monetization（eBay搜索排名与变现）

AI总结提出表示课程（RC）方法，通过分阶段引入特征，先强调基于内容的信号，再引入依赖曝光的信号，减少对历史信号的捷径依赖，提升冷启动泛化性和鲁棒性。

Comments 12 pages, 5 figures

详情

DOI: 10.1145/3770855.3818470

AI中文摘要

数字市场中的排序是一种动态曝光分配机制：展示的物品塑造了发现轨迹和成功事件，平台记录这些事件以更新未来的分配策略。现代排序系统严重依赖曝光混杂信号（如流行度估计、CTR/CVR聚合和基于ID的表示），因为这些信号在静态需求下具有高度预测性。然而，这种预测能力可能成为一种学习捷径：早期访问依赖曝光的信念信号会使优化过度依赖它们，而忽视独立于曝光的价值信号（如基于内容的竞争力和语义亲和性）。因此，学习到的策略倾向于固化现有物品，并在分布偏移下降低冷启动泛化性和鲁棒性。我们提出表示课程（RC），一种训练时干预方法，按时间阶段安排特征使用。RC首先突出基于内容的价值信号，然后引入依赖曝光的信念信号，同时将内容路径锚定在学到的价值表示附近，从而抑制对历史信号的捷径依赖，并缓解内容信号上的梯度饥饿。我们形式化RC，使其独立于任务和假设类，并提供排序特定的实例化。在高斯线性岭回归设置中，我们推导出封闭形式解和充分条件，证明RC在冷启动目标分布上严格降低总体风险，并量化了与源性能的帕累托权衡。在公开的排序学习和推荐基准测试，以及大规模电商搜索系统中的随机在线实验中，RC显著地将依赖从历史信念信号转向基于内容的价值信号，并在头部性能可控权衡下，对冷群体带来一致的提升。

英文摘要

Ranking in digital marketplaces is a dynamic exposure-allocation mechanism: displayed items shape discovery trajectories and success events logged by the platform to update future allocation policies. Modern ranking systems rely heavily on exposure-confounded signals (e.g. popularity estimates, CTR/CVR aggregates, and ID-based representation), because they are highly predictive under stationary demand. Yet this predictive power can become a learning shortcut: early access to exposure-dependent belief signals steers optimization toward over-reliance on them and away from exposure-independent merit signals (e.g., content-based competitiveness and semantic affinity). Consequently, the learned policy tends to entrench incumbents and degrade cold-start generalization and robustness under distribution shift. We propose Representation Curriculum (RC), a training-time intervention that temporally stages feature utilization. RC foregrounds content-based merit signals initially, then introduces exposure-dependent belief signals while anchoring the content pathway near the learned merit representation, curbing shortcut reliance on historical signals and mitigating gradient starvation on content signals. We formalize RC independently of task and hypothesis class and provide ranking-specific instantiations. In a Gaussian linear ridge setting, we derive closed-form solutions and sufficient conditions under which RC strictly reduces population risk on a cold-start target distribution, with a quantified Pareto tradeoff against source performance. Experiments on public learning-to-rank and recommendation benchmarks, and randomized online experiments in a large-scale e-commerce search system, show that RC measurably shifts reliance from historical belief signals toward content-based merit signals and yields consistent gains on cold populations with a controlled trade-off in head performance.

URL PDF HTML ☆

赞 0 踩 0

2606.09899 2026-06-10 cs.LG cs.AI 交叉投稿

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

当归因补丁说谎时：诊断与二阶修正

Luyang Zhang, Jialu Wang

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； University of California, Santa Cruz（加州大学圣克鲁兹分校）

AI总结研究归因补丁（梯度一阶近似）在机制可解释性中的不可靠性，发现主要误差源于下游网络的非线性，并提出可靠性评分、误差界和HVP二阶修正方法。

Comments 30 pages, 12 figures

详情

AI中文摘要

机制可解释性的一个核心目标是识别哪些内部组件因果地驱动语言模型的行为。由于这些重要性估计作为识别电路的证据，系统性错误可能导致对底层机制的误识别。虽然激活补丁提供了黄金标准的因果度量，但其计算成本在大规模下难以承受。从业者转而依赖归因补丁，一种基于梯度的一阶近似，其可靠性尚不明确。在这项工作中，我们刻画了这种不可靠性的来源，证明主要误差源于下游网络的非线性，而非补丁组件的局部曲率。这一洞察产生了三个实用工具：(i) 检测不可信估计的可靠性评分，(ii) 量化潜在归因误规范的误差界，以及 (iii) 仅需一次额外反向传播即可消除主导误差的Hessian-向量-乘积（HVP）修正。在五个模型家族（124M-9B参数）以及随机令牌和自然（名称交换）扰动的评估中，HVP是唯一在大规模下可行的二阶修正，而标准基线如积分梯度在计算上变得不可行。在对比实验中，多步HVP变体以显著更低的计算量达到或超过积分梯度的准确性，优于先前的二阶基线。这些改进在标准基准上实现了更高保真度的电路恢复，并支持一种屏幕-标记-修复工作流，仅将计算努力针对被标记为不可靠的组件。

英文摘要

A central goal of mechanistic interpretability is to identify which internal components causally drive a language model's behavior. Because these importance estimates serve as the evidence for identifying circuits, systematic errors can lead to the misidentification of the underlying mechanisms. While activation patching provides a gold-standard causal metric, its computational cost is prohibitive at scale. Practitioners instead rely on attribution patching, a gradient-based, first-order approximation whose reliability remains poorly understood. In this work, we characterize the source of this unreliability, demonstrating that the dominant error stems from the non-linearities in the downstream network rather than local curvature at the patched component. This insight yields three practical tools: (i) a reliability score to detect untrustworthy estimates, (ii) error bounds quantifying potential attribution mis-specifications, and (iii) a Hessian-vector-product (HVP) correction that eliminates the leading-order error with only one additional backward pass. In evaluations across five model families (124M-9B parameters) and both random-token and naturalistic (name-swap) perturbations, HVP is the only second-order correction feasible at larger scale, where standard baselines like Integrated Gradients become computationally prohibitive. In comparative experiments, a multi-step HVP variant matches or exceeds the accuracy of Integrated Gradients at significantly lower compute, outperforming prior second-order baselines. These improvements lead to higher-fidelity circuit recovery on standard benchmarks and support a Screen-Flag-Fix workflow that targets computational effort only toward the components flagged as unreliable.

URL PDF HTML ☆

赞 0 踩 0

2606.09902 2026-06-10 cs.NE cs.AI 交叉投稿

The Whale That Outswam Evolution: Swarm Intelligence Maximises Memory in Connectome Reservoirs

超越进化的鲸鱼：群体智能在连接组储备池中最大化记忆

Anmol Guragain, Savvas Kakalis, Juan Ignacio Godino-Llorente

发表机构 * University of Murcia（穆尔西亚大学）； University of Pisa（比萨大学）

AI总结应用四种无梯度生物启发优化器优化六物种连接组储备池的边权重，在所有任务和物种上均优于未优化的生物基线，鲸鱼优化算法实现最大提升（记忆容量提升17倍，均方根误差降低89%），表明生物权重是拓扑无法替代的必要归纳偏置。

详情

AI中文摘要

储备池计算利用固定动力学的循环网络进行时序处理，仅需训练线性读出层。经过数百万年进化塑造的生物神经连接组可能编码了超越随机储备池的计算结构，但该结构能否通过原则性优化进一步增强仍是开放问题。我们通过将四种无梯度、生物启发优化器（粒子群优化、差分进化、灰狼优化器和鲸鱼优化算法）应用于基于连接组的回声状态网络的边权重来解决该问题，这些网络涵盖六个物种，其神经复杂性跨越六个数量级：秀丽隐杆线虫（279个神经元）、果蝇（49个节点）、小鼠（112个节点）、大鼠（73个节点）、猕猴（29个区域，连续FLNe突触强度）以及人类结构MRI连接（83个脑区）。每个连接组在四个经典储备计算基准上评估：记忆容量（MC）、Lorenz吸引子预测、NARMA-10系统辨识和Mackey-Glass混沌时间序列预测。所有四种优化器在从生物权重初始化时，在每个任务和物种上均一致优于未优化的生物基线。鲸鱼优化算法在每个任务上均取得最大增益：记忆容量提升高达17倍（秀丽隐杆线虫：1.39至23.91），均方根误差降低高达89%（Mackey-Glass，人类），对应所有物种和任务平均改进214%。关键的是，相同拓扑上的随机初始化始终表现劣于生物学，确立了生物权重值作为拓扑本身无法恢复的必要归纳偏置。这些结果将生物启发、生物初始化优化定位为跨动物王国连接组储备计算的一种原则性且广泛有效的策略。

英文摘要

Reservoir computing exploits the fixed dynamics of a recurrent network for temporal processing, requiring only a trained linear readout. Biological neural connectomes, shaped by millions of years of evolution, may encode computational structure beyond what random reservoirs provide, yet whether that structure can be further enhanced by principled optimisation remains an open question. We address it by applying four gradient-free, bio-inspired optimisers (Particle Swarm Optimisation, Differential Evolution, Grey Wolf Optimiser, and Whale Optimisation Algorithm) to the edge weights of connectome-based echo-state networks across six species spanning six orders of magnitude in neural complexity: C. elegans (279 neurons), Drosophila (49 nodes), mouse (112), rat (73), macaque (29 regions, continuous FLNe synaptic strengths), and human structural MRI connectivity (83 parcels). Each connectome is evaluated on four canonical reservoir computing benchmarks: Memory Capacity (MC), Lorenz attractor prediction, NARMA-10 system identification, and Mackey-Glass chaotic time-series prediction. All four optimisers consistently outperform unoptimised biological baselines across every task and species when initialised from biological weights. WOA achieves the largest gains on every task: up to a 17x MC improvement (C. elegans: 1.39 to 23.91) and up to 89% NRMSE reduction (Mackey-Glass, human), corresponding to an average 214% improvement across all species and tasks. Crucially, random initialisation on the same topology reliably underperforms biology, establishing biological weight values as an essential inductive bias that topology alone cannot recover. These results position bio-inspired, biologically-initialised optimisation as a principled and broadly effective strategy for connectome reservoir computing across the animal kingdom.

URL PDF HTML ☆

赞 0 踩 0

2606.09907 2026-06-10 cs.LG cs.AI 交叉投稿

LongMoE: Longitudinal Multimodal Learning via Trajectory-Aware Mixture-of-Experts

LongMoE：基于轨迹感知的混合专家模型的纵向多模态学习

Maxx Richard Rahman, Prakhar Kumar, Wolfgang Maass

AI总结提出LongMoE框架，通过上下文感知插补、注意力标记化、轨迹感知编码和稀疏MoE路由，联合解决临床多模态学习中模态缺失和纵向动态两大挑战，在ADNI等数据集上验证了鲁棒性。

详情

AI中文摘要

多模态临床学习对于整合包括影像、文本和个性化健康记录在内的多样化患者数据日益重要。然而，它面临两个基本挑战：i) 模态缺失，即在一次患者就诊中任意子集的模态不可用；ii) 纵向动态，即观察结果的诊断意义取决于患者随时间演变的疾病轨迹。现有方法孤立地处理这些挑战：缺失模态框架将每次就诊视为独立的静态快照并丢弃时间上下文，而纵向模型通常假设模态完全可用并在系统性模态不完整时性能下降。我们提出LongMoE（纵向混合专家模型），这是一个统一框架，用于联合解决这两个挑战。LongMoE结合了上下文感知插补模块和注意力标记化模块，后者捕获不规则就诊序列中的频域时间模式，以及用于建模疾病进展的轨迹感知编码器和用于患者特定专家选择的上下文条件稀疏MoE路由。在ADNI、OASIS-3和MIMIC-IV上的实验表明，LongMoE在缺失或弱共时模态下提高了鲁棒性，并在全模态设置中保持竞争力，为纵向感知的多模态临床学习奠定了坚实基础。

英文摘要

Multimodal clinical learning is increasingly important for integrating diverse patient data, including imaging, text, and personalised health records. However, it faces two fundamental challenges: i) modality missingness, where arbitrary subsets of modalities are unavailable at a given patient visit, ii) longitudinal dynamics, where the diagnostic significance of an observation depends on the patient's evolving disease trajectory over time. Existing methods address these challenges in isolation: missing-modality frameworks treat each visit as an independent static snapshot and discard temporal context, while longitudinal models often assume complete modality availability and degrade under systematic modality incompleteness. We propose LongMoE (Longitudinal Mixture-of-Experts), the unified framework to jointly address both challenges. LongMoE combines a context-aware imputation module with an attentional tokenization module that captures frequency-domain temporal patterns across irregular visit sequences, a trajectory-aware encoder for modeling disease progression, and context-conditioned Sparse MoE routing for patient-specific expert selection. Experiments on ADNI, OASIS-3, and MIMIC-IV show that LongMoE improves robustness under missing or weak contemporaneous modalities and remains competitive in full-modality settings, establishing a strong foundation for longitudinally-aware multimodal clinical learning.

URL PDF HTML ☆

赞 0 踩 0

2606.09923 2026-06-10 cs.LG cs.AI 交叉投稿

Conformal Prediction for Neural Operators: Distribution-Free Uncertainty Quantification in Physics Simulation

神经算子的共形预测：物理模拟中无分布不确定性量化

Michael Chin

AI总结提出将分裂共形预测应用于神经算子物理模拟，实现无分布预测区间和有限样本覆盖保证，并通过归一化共形预测方案生成自适应宽度区间。

Comments 13 pages, 7 tables, 7 figures. Full-scale experiments on NVIDIA V100

详情

AI中文摘要

神经算子如傅里叶神经算子（FNO）已成为求解偏微分方程（PDE）的强大替代方法，比传统数值求解器快几个数量级。然而，在安全关键工程应用（如电子元件和电池系统的热管理）中部署这些模型，不仅需要准确的点预测，还需要严格的不确定性保证。现有的神经算子不确定性量化（UQ）方法，包括蒙特卡洛Dropout和深度集成，仅提供相对不确定性估计，没有正式的覆盖保证。在这项工作中，我们首次将分裂共形预测应用于基于神经算子的物理模拟，提供具有有限样本覆盖保证的无分布预测区间。我们进一步引入了一种归一化共形预测方案，利用MC Dropout不确定性生成自适应宽度区间，在低不确定性区域产生更紧的区间，在模型不太确定的区域产生更宽的区间。在稳态热传导基准上的全规模实验（3370万参数，800个训练样本，5个集成成员，NVIDIA V100）表明，我们的方法在目标水平alpha=0.1下达到89.1%的经验覆盖率，同时生成反映底层物理不确定性结构的空间自适应预测区间。我们还提供了一个不确定性分解框架，将认知不确定性（占总量的68%）与偶然不确定性（占总量的32%）分离，为数据收集和模型改进提供可操作指导。我们的方法在一个开源平台上实现，具有REST API端点和交互式3D可视化。

英文摘要

Neural operators such as the Fourier Neural Operator (FNO) have emerged as powerful surrogates for solving partial differential equations (PDEs), achieving speedups of several orders of magnitude over traditional numerical solvers. However, deploying these models in safety-critical engineering applications -- such as thermal management of electronic components and battery systems -- requires not only accurate point predictions but also rigorous uncertainty guarantees. Existing uncertainty quantification (UQ) methods for neural operators, including Monte Carlo Dropout and Deep Ensembles, provide only relative uncertainty estimates without formal coverage guarantees. In this work, we propose the first application of split conformal prediction to neural operator-based physics simulation, providing distribution-free prediction intervals with finite-sample coverage guarantees. We further introduce a normalized conformal prediction scheme that leverages MC Dropout uncertainty to produce adaptive-width intervals, yielding tighter intervals in regions of low uncertainty and wider intervals where the model is less certain. Full-scale experiments (33.7M parameters, 800 training samples, 5 ensemble members, NVIDIA V100) on steady-state heat conduction benchmarks demonstrate that our method achieves 89.1% empirical coverage at the target level of alpha=0.1, while producing spatially adaptive prediction intervals that reflect the underlying physical uncertainty structure. We also provide an uncertainty decomposition framework that separates epistemic uncertainty (68% of total) from aleatoric uncertainty (32% of total), offering actionable guidance for data collection and model improvement. Our method is implemented in an open-source platform with REST API endpoints and interactive 3D visualization.

URL PDF HTML ☆

赞 0 踩 0

2606.09924 2026-06-10 cs.LG cs.AI 交叉投稿

Sigma-Branch: Hierarchical Single-Path Network Reconstruction for Dynamic Inference with Reduced Active Parameters

Sigma-Branch: 用于动态推理的分层单路径网络重构，减少活跃参数

Kohga Tanaka, Hiroaki Nishi

AI总结提出Sigma-Branch框架，通过分层二叉树结构将预训练密集网络重构为共享主干、分层路由器和专用叶子，利用激活聚类初始化并微调，推理时仅执行单一路径，在CIFAR-100/ResNet-50等任务上减少58-60%活跃参数，性能损失小于1.72个百分点。

详情

AI中文摘要

在内存受限的边缘加速器上部署深度神经网络，瓶颈在于每次推理的片外权重传输而非计算：密集网络无法保留在芯片上，每个输入都必须加载所有参数。现有模型压缩仅在永久容量损失代价下减少这种传输。我们提出Sigma-Branch (SigmaB)，一个将预训练密集网络重构为分层二叉树的框架，该树由共享主干、分层路由器和专用叶子组成。预训练权重通过基于激活的球形k-means聚类分布在树中，该聚类联合初始化路由器权重和每分支通道分配；然后通过软路由微调使每个叶子与其路由输入子集对齐。在推理时，所得网络仅执行一条根到叶路径，减少活跃参数占用，同时将完整密集参数集存储在内存中。在CIFAR-100 / ResNet-50、ImageNet-1K / ResNet-50和ModelNet40 / PointNet++上，SigmaB-Net将每次推理的活跃参数减少58-60%，同时与密集基线Top-1相比误差在1.72个百分点以内。在可比的ImageNet-1K Top-1下，活跃参数减少超过静态结构化剪枝（FPGM、HRank）14-23个百分点。跨模态评估涵盖2D视觉和3D点云骨干网络，证实了将每次推理内存流量与总参数数量解耦的框架级主张。

英文摘要

Deploying deep neural networks on memory-constrained edge accelerators is bottlenecked by per-inference off-chip weight transfer rather than computation: the dense network cannot be retained on-chip, and every parameter must be loaded for every input. Existing model compression reduces this transfer only at the cost of permanent capacity loss. We propose Sigma-Branch (SigmaB), a framework that restructures a pretrained dense network into a hierarchical binary tree composed of a shared backbone, hierarchical routers, and specialized leaves. Pretrained weights are distributed across the tree via activation-based spherical k-means clustering, which jointly initializes router weights and per-branch channel allocations; soft-routing fine-tuning then aligns each leaf with its routed input subset. At inference, the resulting network executes only a single root-to-leaf path, reducing the active-parameter footprint while storing the complete dense parameter set in memory. Across CIFAR-100 / ResNet-50, ImageNet-1K / ResNet-50, and ModelNet40 / PointNet++, SigmaB-Net reduces per-inference active parameters by 58-60% while remaining within 1.72 percentage points (pp) of the dense baseline Top-1. At comparable ImageNet-1K Top-1, the active-parameter reduction exceeds static structured pruning (FPGM, HRank) by 14-23 pp. The cross-modal evaluation, spanning 2D vision and 3D point-cloud backbones, substantiates a framework-level claim that decouples per-inference memory traffic from the total parameter count.

URL PDF HTML ☆

赞 0 踩 0

2606.09926 2026-06-10 cs.LG cs.AI 交叉投稿

Sample Where You Struggle: Sharpening Base Model Reasoning via Entropy-Guided Power Sampling

在你挣扎处采样：通过熵引导的幂采样增强基础模型推理

Hong Guo, Nianhui Guo, Christoph Meinel, Haojin Yang

AI总结提出熵引导的幂采样（EGPS），一种无需训练和验证器的采样方法，通过利用前向传播中的token级熵将MCMC移动定位到高熵区域，在多个基准上以高达12.6倍加速达到最优或并列最优准确率。

详情

AI中文摘要

从序列级幂分布 $p^\alpha$ 采样可以在不更新任何参数的情况下从基础语言模型中引出强化学习级别的推理，但标准的Metropolis-Hastings（MH），一种马尔可夫链蒙特卡洛（MCMC）采样器，既昂贵又慢混合。我们将这两个问题归因于结构不匹配：$p^\alpha$ 主要在稀疏、空间聚集的高熵决策点集上偏离 $p$，然而MH沿着前缀均匀地提出重采样位置——在近简并条件上浪费计算，同时在模式发散处欠混合。我们提出熵引导的幂采样（EGPS），一种无需训练和验证器的采样器，它从已经在前向传播中的token级熵重新推导其提议。EGPS跳过确定性块，将每个MCMC移动定位到高熵邻域，并在决策点应用多尝试Metropolis——使得采样成本随熵质量而非序列长度缩放。在Qwen2.5-Math-7B上，EGPS在所有三个基准（MATH500 $75.8\\%$，HumanEval $62.2\\%$，GPQA $42.4\\%$）上达到最佳或并列最佳准确率，同时相对于MH基线实现了高达12.6倍的墙钟加速。

英文摘要

Sampling from the sequence-level power distribution $p^α$ elicits RL-level reasoning from base language models without any parameter updates, but the standard Metropolis--Hastings (MH), a Markov Chain Monte Carlo (MCMC) sampler, is both expensive and slow-mixing. We trace both to a structural mismatch: $p^α$ mainly departs from $p$ at a sparse, spatially clustered set of high-entropy decision points, yet MH proposes resampling positions uniformly along the prefix -- wasting compute on near-degenerate conditionals while under-mixing precisely where modes diverge. We propose Entropy-Guided Power Sampling (EGPS), a training-free and verifier-free sampler that re-derives its proposal from token-level entropy already in the forward pass. EGPS skips deterministic blocks, localizes each MCMC move to a high-entropy neighborhood, and applies Multiple-Try Metropolis at decision points -- making sampling cost scale with \emph{entropy mass rather than sequence length}. On Qwen2.5-Math-7B, EGPS reaches best or tied-best accuracy on all three benchmarks (MATH500 $75.8\%$, HumanEval $62.2\%$, GPQA $42.4\%$) at up to a $12.6\times$ wall-clock speedup over the MH baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.09927 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization

可训练平滑旋转变换与学习通道尺度用于LLM量化

Patrik Czakó, Gábor Kertész, Sándor Szénási

AI总结针对大语言模型量化中激活值量化困难的问题，提出基于分位数鲁棒的缩放策略和梯度优化的通道尺度学习，在W4A4量化下显著降低误差。

Comments 6 pages, 8 figures, 3 tables. Accepted to IEEE INES 2026 conference proceedings

详情

AI中文摘要

后训练量化（PTQ）是降低大语言模型（LLM）服务成本最实用的方法之一，但激活值量化仍然困难，因为异常值主导的通道会导致较大的量化误差。本文研究了这种退化是否部分由基于缩放的等效变换中的过度迁移引起。我们引入了一种用于SmoothRot风格变换的分位数鲁棒缩放策略，用高分位数替代基于最大值的激活统计量，并辅以通道尺度的约束梯度优化。在LLaMA-3.2-1B的W4A4量化下，仅分位数策略搜索相比SmoothRot基线将选定层误差降低11.1%，联合(alpha, q)搜索降低12%，训练达到18.5%。将最佳选定层策略重放到所有解码器块的下投影层，相应的全层平均误差从97.51降至78.08（19.9%）。结果表明，鲁棒的迁移控制和轻量级尺度学习在保持等效变换框架的同时，相比基于最大值的固定策略提供了持续改进。

英文摘要

Post-training quantization (PTQ) is one of the most practical ways to reduce the serving cost of Large Language Models (LLMs), but activation quantization remains difficult because outlier-dominated channels lead to large quantization errors. This paper investigates whether part of this degradation is caused by over-migration in scaling-based equivalent transformations. We introduce a quantile-robust scaling policy for SmoothRot-style transforms by replacing max-based activation statistics with high quantiles, and we complement it with constrained gradient-based optimization of channel scales. On LLaMA-3.2-1B under W4A4 quantization, quantile-only policy search improves selected-layer error by 11.1% over the SmoothRot baseline, joint (alpha, q) search improves it by 12%, and training reaches 18.5%. Replaying the best selected-layer policy on all decoder-block down-projection layers reduces the corresponding full-layer mean error from 97.51 to 78.08 (19.9%). The results show that robust migration control and lightweight scale learning provide consistent gains over max-based fixed policies while preserving the equivalent-transform framework.

URL PDF HTML ☆

赞 0 踩 0

2606.09928 2026-06-10 cs.LG cs.AI 交叉投稿

Forward-Only Convolutional Neural Networks with Learnable Channel-Class Assignment

具有可学习通道-类别分配的前向传播卷积神经网络

Mohammadnavid Ghader, Saeed Reza Kheradpisheh, Bahar Farahani, Mahmood Fazlali

AI总结提出可学习的通道-类别分配机制，结合熵和正交正则化，以及基于验证性能的损失感知层贡献策略，在残差CNN上实现前向传播学习，在CIFAR-10/100和Tiny-ImageNet上达到FF模型最佳性能，缩小与反向传播的差距。

详情

AI中文摘要

前向-前向（FF）算法通过用局部的前向目标替代基于梯度的信用分配，提供了一种受生物学启发的反向传播替代方案。虽然最近的扩展已将FF适应到卷积神经网络（CNN），但现有公式依赖于静态的通道-类别分区，并且在复杂任务中难以有效执行。在这项工作中，我们引入了一种可学习的通道-类别分配机制，实现了卷积通道的自适应、数据驱动特化，并辅以熵和正交正则化以提升学习性能。我们进一步提出了一种损失感知的层贡献策略，该策略根据中间层的验证性能自适应地加权其预测，从而增强前向推理的有效性。集成到残差CNN中，所提出的方法在CIFAR-10、CIFAR-100和Tiny-ImageNet上相比现有的类似前向方法持续实现了更优的性能。值得注意的是，它在基于FF的模型中建立了新的最先进性能，显著缩小了与反向传播的差距。这些发现表明，引入可学习的通道特化和层贡献加权显著增强了深度CNN中前向学习的表示能力。

英文摘要

The Forward-Forward (FF) algorithm offers a biologically inspired alternative to backpropagation by replacing gradient-based credit assignment with local, forward-only objectives. While recent extensions have adapted FF to convolutional neural networks (CNNs), existing formulations rely on static channel-class partitions and struggle to perform effectively in complex tasks. In this work, we introduce a learnable channel-class assignment mechanism that enables adaptive, data-driven specialization of convolutional channels, supported by entropy and orthogonality regularization to promote learning performance. We further propose a loss-aware layer contribution strategy that adaptively weights intermediate-layer predictions based on their validation performance, enhancing the effectiveness of forward-only inference. Integrated into residual CNNs, the proposed method achieves consistently superior performance across CIFAR-10, CIFAR-100, and Tiny-ImageNet compared to existing similar forward-only methods. Notably, it establishes new state-of-the-art performance among FF-based models, substantially narrowing the gap with backpropagation. These findings demonstrate that introducing learnable channel specialization and layer contribution weighting significantly enhances the representational capacity of forward-only learning in deep CNNs.

URL PDF HTML ☆

赞 0 踩 0

2606.09929 2026-06-10 cs.LG cs.AI 交叉投稿

Between Amnesia and Chaos: A Memory Stability Expressivity Trilemma for Trainable Dissipative Oscillator Networks

介于遗忘与混沌之间：可训练耗散振荡器网络的记忆稳定性表现力三难困境

Caleb Munigety

AI总结本文研究可训练非线性振荡器网络，发现记忆范围、梯度稳定性和动态表现力三者受阻尼控制，存在无法同时最大化的三难困境，并通过实验验证了理论边界。

详情

AI中文摘要

物理储层计算利用非线性机械动力学，但传统上冻结基底并仅训练线性读出层，假定基底不可训练。我们重新审视这一前提，研究非线性振荡器网络，其质量、阻尼和刚度通过辛积分器端到端学习。我们的核心结果是三难困境：记忆范围、梯度稳定性和动态表现力无法同时最大化，因为三者均由阻尼控制。反向梯度以阻尼决定的速率衰减，限制了信用传播的距离，而前向灵敏度以最大李雅普诺夫指数指数增长，因此可用梯度需要阻尼高于稳定下限。由于李雅普诺夫指数随阻尼增加而下降，而记忆上限随范围增加而下降，稳定训练被限制在一个随范围收缩并在临界点闭合的带状区域内。我们在一个二十振荡器网络上测试了每一步。阻尼扫描发现最大李雅普诺夫指数单调变化并在明确的下限处过零，证实了定理的关键假设。在九个范围上的延迟回忆任务中，学习基底与冻结基底的算力匹配比较显示，学习基底在短范围占优，优势在约十一步范围附近接近并反转，这是带状闭合的预测特征；训练模型稳定在稳定下限附近，自发寻求混沌边缘。解析上限高估经验交叉约五倍，这是可检测梯度与可学习梯度之间的差距，我们报告而非调整消除。贡献在于确认了何时训练物理基底优于冻结基底。

英文摘要

Physical reservoir computing harnesses nonlinear mechanical dynamics but, by convention, freezes the substrate and trains only a linear readout, presuming the substrate is not usefully trainable. We revisit that premise for networks of nonlinear oscillators whose mass, damping, and stiffness are learned end-to-end through a symplectic integrator. Our central result is a trilemma: memory horizon, gradient stability, and dynamical expressivity cannot be simultaneously maximized, because all three are governed by the damping. The backward gradient decays at a rate set by the damping, capping how far back credit can propagate, while forward sensitivities grow exponentially in the largest Lyapunov exponent, so usable gradients require damping above a stability floor. Since the Lyapunov exponent falls as damping rises while the memory ceiling falls as the horizon grows, stable training is confined to a band that contracts with horizon and closes at a critical point. We test every step on a twenty-oscillator network. A damping sweep finds the largest Lyapunov exponent monotone and crossing zero at a well-defined stability floor, confirming the theorem's key assumption. A compute-matched comparison of learned versus frozen substrate on delayed recall across nine horizons shows the learned substrate dominating at short horizons and the advantage closing and reversing near a horizon of eleven steps, the predicted signature of band closure; trained models settle near the stability floor, seeking the edge of chaos unprompted. The analytic ceiling overestimates the empirical crossover roughly fivefold, a gap between detectable and learnable gradient that we report rather than tune away. The contribution is a confirmed account of when training a physical substrate beats freezing it.

URL PDF HTML ☆

赞 0 踩 0

2606.09932 2026-06-10 cs.LG cs.AI 交叉投稿

学习何处模拟：在线PDE代理训练的生成式主动采样

Pierre Cesar, Sofya Dymchenko, Abhishek Purandare, Bruno Raffin

AI总结提出在线生成式主动采样（OGAS），通过扩散模型学习配置参数与代理性能的关系，主动采样高难度区域，显著降低尾部分布误差，提升代理最坏情况可靠性。

详情

AI中文摘要

数据驱动的PDE代理使用数值PDE求解器产生的数据进行训练。然而，当代理的目标是在广泛的PDE配置（例如初始条件和物理系数）上泛化时，生成具有代表性的训练集并非易事。配置参数的均匀采样通常低估了表现出挑战性动力学的轨迹，导致训练后的代理出现高预测误差和大误差方差。在线训练将数据生成和代理训练耦合，通过允许实时调整求解器参数提供了自然优势。为了有效利用这一能力，我们引入了在线生成式主动采样（OGAS），一种主动学习方法，它反应性地学习配置参数与代理性能之间的关系，以控制采样分布。OGAS与代理并行训练一个快速扩散模型，作为条件采样器，将代理派生的难度信号（例如损失或不确定性）映射到配置参数。通过主动从偏向高难度的先验中抽取目标信号，OGAS持续将数据生成导向挑战性区域，而不会延迟训练流程。我们在具有不同挑战性动力学的2D PDE（Kuramoto-Sivashinsky、Navier-Stokes、Gray-Scott）上评估OGAS，参数多达308个，并使用多种代理架构。在所有设置中，与均匀采样相比，OGAS一致地改善了尾部分布统计，显著降低了第99百分位以上的误差和整体误差离散度。虽然优先考虑挑战性轨迹引入了与平均误差的权衡，但OGAS有效确保了训练后代理的最坏情况可靠性，且壁钟时间开销可忽略不计。

英文摘要

Data-driven PDE surrogates are trained with data produced by numerical PDE solvers. However, when the surrogate's goal is to generalize across a wide range of PDE configurations (e.g., initial conditions and physical coefficients), generating a representative training set is non-trivial. Uniform sampling of configuration parameters often under-represents trajectories exhibiting challenging dynamics, leading to high prediction errors and large error variance in the trained surrogate. Online training, where data generation and surrogate training are coupled, offers a natural advantage by allowing solver parameters to be steered on-the-fly. To efficiently exploit this capability, we introduce Online Generative Active Sampling (OGAS), an active learning method that reactively learns the relationship between configuration parameters and surrogate performance to control the sampling distribution. OGAS trains a fast diffusion model in parallel to the surrogate to act as a conditional sampler, mapping a surrogate-derived difficulty signal (e.g., loss or uncertainty) to configuration parameters. By actively drawing target signals from a prior biased toward high difficulty, OGAS continuously steers data generation toward challenging regimes without delaying the training workflow. We evaluate OGAS across 2D PDEs with distinct challenging dynamics (Kuramoto-Sivashinsky, Navier-Stokes, Gray-Scott) and up to 308 parameters, using multiple surrogate architectures. Across all settings, OGAS consistently improves tail statistics, yielding substantial reductions in errors above the 99th percentile and overall error dispersion compared to uniform sampling. While prioritizing challenging trajectories introduces a trade-off with average error, OGAS effectively ensures worst-case reliability of trained surrogates with negligible wall-time overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.09954 2026-06-10 cs.LG cs.AI 交叉投稿

Does Normalization Choice Matter for Causal Large Time-Series Models?

归一化选择对因果大规模时间序列模型重要吗？

Samy-Melwan Vilhes, Gilles Gasso, Mokhtar Z Alaya

AI总结研究因果大规模时间序列模型中不同归一化策略对训练收敛和预测性能的影响，发现归一化选择显著影响模型效果。

详情

Journal ref: ICLR 2026 Workshop: Time Series in the Age of Large Models, Apr 2026, Rio De Janeiro, Brazil

AI中文摘要

用于时间序列预测的大规模模型已成为在异构信号集合上训练模型的有前景的范式。这些模型通常依赖于因果自回归架构，其中每个观测值根据过去依次预测。在实践中，真实世界的时间序列表现出非平稳性，这显著影响预测性能。为了缓解这一问题，通常采用归一化。然而，在高效的因果设置中，归一化可能在训练期间导致来自未来观测的信息泄漏。最近提出的替代方案，包括因果归一化和从初始观测计算的统计量，旨在解决这一问题，但其实际影响仍未被充分理解。在这项工作中，我们评估了基于Transformer的大规模时间序列模型（采用分块和高效因果策略训练）的归一化策略。我们展示了归一化选择显著影响训练收敛和预测性能。

英文摘要

Large models for time-series forecasting have been emerged as a promising paradigm for training models on heterogeneous collections of signals. These models typically rely on causal autoregressive architectures, where each observation is sequentially predicted from past. In practice, real-world time-series exhibit non-stationarities, which significantly influence predictive performance. To mitigate this, normalization is commonly employed. However, in efficient causal settings it might induce information leakage from future observations during training. Recent alternatives, including causal normalization and statistics computed from initial observations, have been proposed to address this issue, but their practical implications remain insufficiently understood. In this work, we evaluate normalization strategies for transformer-based large time-series models trained with patching and efficient causal strategy. We showcase that normalization choice significantly influences both training convergence and forecasting performance.

URL PDF HTML ☆

赞 0 踩 0

2606.09960 2026-06-10 cs.LG cs.AI 交叉投稿

HydraCIL: Decoupled Class-Incremental Learning through Prototype-Guided Multi-Head Classifiers

HydraCIL: 通过原型引导的多头分类器实现解耦的类增量学习

Daniel Vila-Cruz, Laura Morán-Fernández, Verónica Bolón-Canedo

AI总结提出HydraCIL模型，通过冻结主干网络、解耦特征提取与学习，并利用原型相似性选择任务特定分类头，在资源受限环境中实现高效类增量学习，匹配或超越现有方法同时大幅降低训练时间和碳排放。

Comments Accepted for publication at the International Joint Conference on Neural Networks (IJCNN 2026)

详情

AI中文摘要

我们提出HydraCIL，一种基于原型引导的多头分类器的解耦持续学习模型，旨在嵌入式及资源受限环境中的可持续部署。虽然大多数类增量学习（CIL）方法依赖强大硬件和长时间再训练周期，但实际系统（如机器人或边缘AI设备）必须在有限资源下快速适应。HydraCIL通过冻结主干网络并将特征提取与学习解耦来解决这一问题。对于每个任务，特征被提取一次，并创建一个轻量级的、任务特定的分类器头，避免了昂贵的主干再训练。在推理时，HydraCIL通过与原型的相似性选择适当的头。在CIFAR-100、ImageNet-100、CoRe50和Flowers102数据集上的实验表明，HydraCIL匹配或超越了最先进的CIL方法，同时显著减少了训练时间和碳足迹，使其成为在能源效率和快速适应至关重要的实际及嵌入式环境中进行持续学习的实用解决方案。

英文摘要

We present HydraCIL, a decoupled continual learning model based on prototype-guided multi-head classifiers, targeting sustainable deployment in embedded and resource-constrained environments. While most Class-Incremental Learning (CIL) methods rely on powerful hardware and long retraining cycles, real-world systems, such as robots or edge AI devices, must adapt quickly with limited resources. HydraCIL addresses this gap by freezing the backbone and decoupling feature extraction from learning. For each task, features are extracted once and a lightweight, task-specific classifier head is created, avoiding costly backbone retraining. At inference, HydraCIL selects the appropriate head via similarity with prototypes. Experiments on CIFAR-100, ImageNet-100, CoRe50, and Flowers102 datasets show that HydraCIL matches or outperforms state-of-the-art CIL methods while significantly reducing training time and carbon footprint, making it a practical solution for continual learning in real-world and embedded settings, where energy efficiency and rapid adaptation are critical.

URL PDF HTML ☆

赞 0 踩 0

2606.09962 2026-06-10 cs.LG cs.AI cs.SD 交叉投稿

Optimality of FSQ Tokens for Continuous Diffusion for Categorical Data with Application to Text-to-Speech

FSQ 令牌在分类数据连续扩散中的最优性及其在文本到语音中的应用

Vadim Popov, Wenju Gu, Tasnima Sadekova, Georgii Aparin, Assel Yermekova

AI总结本文研究连续扩散模型中离散令牌的潜在空间结构，通过理论分析和实验证明 FSQ 令牌化方案在分类数据连续扩散中最优，并在文本到语音任务中验证其优于基于 LLM 的方法。

详情

AI中文摘要

分类数据的连续扩散是一种属于扩散家族的框架，旨在生成离散数据。近年来，由于研究人员试图实现寻找自回归大型语言模型的合理替代方案这一具有挑战性的目标，对此类模型的科学兴趣不断增长。在本文中，我们研究了与离散令牌相对应的潜在空间结构的性质，这些性质通过扩散路径测度上的 Kullback-Leibler 散度和最优训练扩散模型正确预测令牌的准确性来表达。我们发现，FSQ 令牌化方案具有的潜在空间结构使其最适合分类数据的连续扩散，这一点通过严格的理论分析和数值实验得到了验证。为了在现实场景中验证我们的发现，我们训练了几个以语音令牌作为中间声学特征的文本到语音扩散模型，并表明基于 FSQ 令牌的模型确实表现最佳，而且它优于其强大的基于 LLM 的对应模型，同时体积更小、速度更快。

英文摘要

Continuous diffusion for categorical data is a framework belonging to the diffusion family and aiming at generating discrete data. The scientific interest to such models has been constantly increasing these days because researchers try to achieve a challenging goal of finding reasonable alternatives to autoregressive large language models. In this paper, we study the properties of the structure of the latent space corresponding to discrete tokens expressed in terms of Kullback-Leibler divergence on diffusion path measures and accuracy of the correct token prediction by the optimally trained diffusion model. We find that FSQ tokenization scheme has the latent space structure with the properties that make it best suited for continuous diffusion for categorical data as verified through rigorous theoretical analysis and numerical experiments. To validate our findings in real-life scenario, we train several text-to-speech diffusion models having speech tokens as intermediate acoustic features, and show that the one based on FSQ tokens indeed performs the best, and, moreover, it outperforms its strong LLM-based counterpart, at the same time being significantly smaller and faster.

URL PDF HTML ☆

赞 0 踩 0

2606.10071 2026-06-10 cs.LG cs.AI 交叉投稿

Temporal Sheaf Neural Networks with Dynamic Orthogonal Transport

时序层神经网络与动态正交传输

Md Sadek Hossain Asif, Tanzila Khan, Md. Mosaddek Khan

AI总结提出时序层神经网络（TSNN），通过动态正交帧和局部坐标系间显式传输实现时序链接预测，在多种基准上超越现有方法，尤其适用于节点角色异质性强的图。

详情

AI中文摘要

我们引入了时序层神经网络（TSNN），这是一个时序链接预测框架，它为每个节点配备一个时变正交帧，并仅在局部坐标系之间进行显式传输后比较节点状态。与在共享全局嵌入空间中运行的现有连续时间图模型不同，TSNN通过动态局部帧建模节点特定且不断演化的交互语义。该模型通过高效的低秩Householder乘积参数化每个节点的帧，在帧更新下精确保留存储的隐藏状态，并使用几何残差解码器，该解码器基于传输距离锚定预测，同时学习残差校正。所有计算严格因果，仅使用事件前历史。我们证明了对称度归一化层拉普拉斯算子与对称归一化图拉普拉斯算子正交相似，而随机游走归一化形式在相应度度量下相似；TSNN使用的全激活、特征缩放扩散正是组合层Dirichlet能量上的度量梯度步，具有无度单调下降和非扩张保证。帧漂移仅线性扰动更新。在TGB v2链接预测和时序异质排行榜以及DGB基准套件上，TSNN在大多数基准上匹配或超越最强先前方法，在表现出强节点角色异质性的图上改进最大。消融实验证实了动态帧、正交传输和几何残差解码的独特优势。

英文摘要

We introduce Temporal Sheaf Neural Networks (TSNN), a temporal link prediction framework that equips each node with a time-varying orthogonal frame and compares node states only after explicit transport between local coordinate systems. In contrast to existing continuous-time graph models that operate in a shared global embedding space, TSNN models node-specific and evolving interaction semantics through dynamic local frames. The model parameterizes per-node frames via efficient low-rank Householder products, preserves stored hidden states exactly under frame updates, and uses a geometric-residual decoder that anchors predictions on transported distances while learning residual corrections. All computations are strictly causal and use only the pre-event history. We show that the symmetric degree-normalized sheaf Laplacian is orthogonally similar to the symmetric normalized graph Laplacian, with the random-walk normalized form similar in the corresponding degree metric; the full-active, feature-scaled diffusion used by TSNN is exactly a metric-gradient step on the combinatorial sheaf Dirichlet energy, with a degree-free monotone-descent and non-expansiveness guarantee. Frame drift perturbs updates only linearly. Across TGB v2 link-prediction and temporal-heterogeneous leaderboards, together with the DGB benchmark suite, TSNN matches or surpasses the strongest prior methods on most benchmarks, with the largest improvements on graphs exhibiting strong node-role heterogeneity. Ablations confirm the distinct benefit of dynamic frames, orthogonal transport, and geometric-residual decoding.

URL PDF HTML ☆

赞 0 踩 0

2606.10089 2026-06-10 cs.LG cs.AI 交叉投稿

A Theory on Flow Matching with Neural Networks

基于神经网络的流匹配理论

Yihan He, Qishuo Yin, Yuan Cao, Jianqing Fan, Han Liu

AI总结本文为神经网络参数化的条件速度场流匹配建立了理论基础，证明了过参数化两层ReLU网络中梯度下降的收敛性，推导了条件速度场匹配目标的泛化界，并提供了生成样本的Wasserstein距离保证。

2606.10124 2026-06-10 cs.LG cs.AI 交叉投稿

FedSteer: Taming Extreme Gradient Staleness in Federated Learning with Corrective Projections and Caching

FedSteer: 通过校正投影和缓存驯服联邦学习中的极端梯度陈旧性

Haoran Zhang, Cainã Figueiredo Pereira, Marie Siew, Xutong Liu, Carlee Joe-Wong, Rachid El-Azouzi

AI总结针对联邦学习中客户端参与不均导致的梯度陈旧问题，提出FedSteer方法，利用客户端梯度缓存构建子空间，通过投影和缓存策略校正陈旧梯度，显著提升训练稳定性与精度。

Comments UAI 2026

详情

AI中文摘要

联邦学习（FL）在客户端不持续参与训练轮次时，常遭受聚合方差的影响。虽然重用非活跃客户端的陈旧模型更新是减少这种方差的常见技术，但我们发现，在客户端参与偏斜的情况下，由此产生的更新陈旧性可能变得严重到足以破坏训练稳定性。为了解决这个问题，我们提出了FedSteer，一种新颖的方法，该方法从最近客户端梯度的缓存中构建一个梯度子空间，作为当前优化景观的低维表示。FedSteer将活跃客户端的真实梯度投影到这个子空间上，以找到一组最优坐标。对于非活跃客户端，FedSteer重用这些坐标，并结合由其他活跃客户端漂移的已演化的子空间。这个过程有效地将过时的梯度“引导”向当前的全局目标。此外，还辅以选择性缓存策略，识别代表性客户端子集以形成子空间，从而减少服务器内存。实验表明，FedSteer显著优于基线，在挑战性场景中防止性能崩溃，并在其他场景中实现超过7%的精度提升。

英文摘要

Federated learning (FL) is often subject to aggregation variance if clients do not consistently participate in training rounds. While reusing stale model updates from inactive clients is a common technique to reduce this variance, we find that with skewed client participation, the resulting update staleness can become severe enough to destabilize training. To remedy this, we propose FedSteer, a novel method that constructs a gradient subspace from a cache of recent client gradients to serve as a low-dimensional representation of the current optimization landscape. FedSteer projects an active client's true gradient onto this subspace to find a set of optimal coordinates. For an inactive client, FedSteer reuses these coordinates with the now-evolved subspace drifted by other active clients. This process effectively "steers" outdated gradients toward the current global objective. This is complemented by a selective caching strategy that identifies a representative client subset to form the subspace, reducing server memory. Experiments demonstrate that FedSteer significantly outperforms baselines, preventing performance collapse in challenging scenarios while delivering accuracy gains of over 7% in others.

URL PDF HTML ☆

赞 0 踩 0

2606.10184 2026-06-10 cs.LG cs.AI 交叉投稿

Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

Dropout-GRPO: 用于连续潜在推理的变分随机性

Wooil Jung

AI总结针对GRPO在连续潜在推理模型中因确定性轨迹导致优势为零的问题，提出通过结构化Dropout引入随机性，使GRPO能优化贝叶斯模型平均策略，在GSM8K上提升Coconut基线准确率。

详情

AI中文摘要

组相对策略优化（GRPO）依赖于每组内$K$次rollout的多样性；否则，组平均优势$A^{(k)} = r^{(k)} - \mu_r$会坍缩为零。这对像Coconut这样的潜在推理模型构成了结构性挑战，该模型循环地馈送连续隐藏状态以替代离散的思维链token。由于给定参数和提示后潜在阶段本质上是确定性的，多次rollout会产生相同的轨迹，阻碍GRPO的进展。因此，将组相对强化学习应用于连续潜在推理被证明是困难的。为解决此问题，我们提出通过结构化dropout来获取必要的随机性。通过在给定rollout的所有潜在递归步骤中应用一个保持不变的单一伯努利掩码，我们生成必要的轨迹方差。这个共享掩码有效地将每个rollout视为来自参数变分分布的后验样本，使GRPO能够优化贝叶斯模型平均策略的期望奖励。我们为该方法提供了理论证明——包括无偏性、方差减少以及潜在梯度的良定义性——以及实证验证。在GSM8K上，dropout-GRPO将Coconut基线从$27.29\%$提升至$29.01\%$的pass@1，证明了GRPO学习在潜在推理模型中的可行性。我们的工作将此定位为一种实用且理论基础的潜在推理LLM后训练方法。

英文摘要

Group Relative Policy Optimization (GRPO) relies on the diversity of $K$ rollouts within each group; otherwise, the group-mean advantage $A^{(k)} = r^{(k)} - μ_r$ collapses to zero. This presents a structural challenge for latent-reasoning models like Coconut, which feed continuous hidden states recurrently in place of discrete chain-of-thought tokens. Because the latent phase is inherently deterministic given the parameters and prompt, multiple rollouts produce identical trajectories, stalling GRPO's progress. Consequently, applying group-relative reinforcement learning to continuous latent reasoning has proven difficult. To address this, we propose sourcing the necessary stochasticity through structured dropout. By applying a single Bernoulli mask held constant across all latent recurrence steps for a given rollout, we generate essential trajectory variance. This shared mask effectively treats each rollout as a posterior sample from a variational distribution over parameters, allowing GRPO to optimize the expected reward of a Bayesian model-average policy. We provide both theoretical justification for this method -- including unbiasedness, variance reduction, and the well-definedness of the latent gradient -- and empirical validation. On GSM8K, dropout-GRPO improves a Coconut baseline from $27.29\%$ to $29.01\%$ pass@1, demonstrating the viability of GRPO learning for latent-reasoning models. Our work positions this as a practical, theoretically grounded approach for post-training latent-reasoning LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.10196 2026-06-10 cs.CV cs.AI 交叉投稿

Fisher-Guided Progressive Parameter Selection for Adaptive Fine-Tuning

Fisher引导的自适应微调渐进参数选择

Ghodsiyeh Rostami, Po-Han Chen, Mahdi S. Hosseini

AI总结提出FisherAdapTune框架，通过追踪Fisher几何的时间漂移渐进选择参数组，在保持适应动态的同时冻结稳定参数以降低泛化误差界，在分割任务上提升分布内性能和零样本迁移。

详情

AI中文摘要

参数高效微调（PEFT）旨在使用少量可训练参数子集来适应预训练模型，然而，现有大多数方法从固定的架构启发式中选择该子集，而不是使用动态的、任务感知的标准。我们引入了\textbf{FisherAdapTune}，一个Fisher引导的自适应微调框架，通过追踪参数组Fisher几何的时间漂移来渐进选择参数组。从微调的PAC-Bayesian视角出发，我们将泛化误差界分解为Fisher加权更新成本，并表明曲率贡献已稳定的参数组可以被冻结，以减少误差界而不中断剩余的适应动态。FisherAdapTune使用连续Fisher分布之间的尺度不变Jensen-Shannon距离来制定这一标准，从而产生一个自适应的活动参数集。我们在下游分割任务上评估了我们的方法，结果表明FisherAdapTune在多种设置下提升了分布内性能和零样本迁移，验证了Fisher结构漂移是高效、任务感知适应的有用信号。我们公开发布了代码（\href{this https URL}{code}），以促进我们提出方法的进一步应用。

英文摘要

Parameter-efficient fine-tuning (PEFT) aims to adapt pretrained models with a small trainable parameter subset, however, most existing methods choose this subset from fixed architectural heuristics rather than using dynamic, task-aware criteria. We introduce \textbf{FisherAdapTune}, a Fisher-guided Adaptive Fine-Tuning framework that progressively selects parameter groups by tracking the temporal drift of their Fisher geometry. Starting from a PAC-Bayesian view of fine-tuning, we decompose the generalization error bound into Fisher-weighted update costs and show that parameter groups whose curvature contribution has stabilized can be frozen to reduce the error bound without interrupting the remaining adaptation dynamics. FisherAdapTune formulates this criterion with a scale-invariant Jensen-Shannon distance between consecutive Fisher distributions, yielding an adaptive active parameter set. We evaluate our approach on a downstream segmentation task, and results show FisherAdapTune improves the in-distribution performance and zero-shot transfer in multiple settings, validating that Fisher structural drift is a useful signal for efficient, task-aware adaptation. We release our \href{https://github.com/AtlasAnalyticsLab/FisherAdapTune}{code} publicly to enable further application of our proposed approach.

URL PDF HTML ☆

赞 0 踩 0

2606.10216 2026-06-10 cs.LG cs.AI 交叉投稿

A Source Domain is All You Need: Source-Only Cross-OS Transfer Learning for APT Anomaly Detection via Semantic Alignment and Optimal Transport

一个源域足矣：基于语义对齐和最优传输的仅源域跨操作系统APT异常检测迁移学习

Sidahmed Benabderrahmanea, Petko Valtchev, James Cheney, Talal Rahwan

AI总结针对跨操作系统APT检测中目标域无标签的挑战，提出基于最优传输的仅源域异常评分框架，通过语义抽象和三种偏差通道实现零目标监督下的异常排序。

详情

AI中文摘要

高级持续性威胁（APT）是隐蔽的多阶段网络攻击，由于标记痕迹稀缺、严重的类别不平衡以及生成真实恶意行为的挑战，其检测十分困难。这些挑战在跨操作系统（cross-OS）设置中被放大，此时在一个源平台上训练的检测器必须部署在无标签的目标平台上，且无法访问目标域标签。我们利用系统级溯源轨迹研究这种仅源域的跨操作系统APT检测问题，并提出一个基于传输的框架，在零目标监督下对异常目标进程进行排序。该框架将进程行为抽象为结构化的自然语言描述，使用预训练语言模型进行嵌入，并构建源域正常参考用于目标评分。它结合了三种证据通道：与源域正常原型的语义偏差、由图自编码器捕获的结构偏差、以及通过最优传输（OT）度量的几何偏差。主要贡献是一个基于OT的重心异常分数，该分数将目标嵌入投影到源域正常流形上，并量化残差传输不匹配。我们进一步引入熵加权、角度感知和密度感知的OT变体，以捕获不确定性、方向漂移和稀疏支持行为。在DARPA透明计算数据（涵盖Linux、Windows、BSD和Android）上的评估，涉及两个APT场景和十二个跨操作系统传输对，表明所提框架在ROC-AUC和nDCG上优于仅源域异常检测基线。结果表明，仅源域溯源建模结合语义抽象和基于OT的异常评分，可以在没有目标域监督的情况下支持实际的跨平台APT检测。

英文摘要

Advanced Persistent Threats (APTs) are stealthy, multi-stage cyberattacks whose detection is difficult due to scarce labeled traces, severe class imbalance, and the challenge of generating realistic malicious behavior. These challenges are amplified in cross-operating-system (cross-OS) settings, where a detector trained on one source platform must be deployed on an unlabeled target platform without access to target-domain labels. We study this source-only cross-OS APT detection problem using system-level provenance traces and propose a transport-based framework for ranking anomalous target processes under zero target supervision. The framework abstracts process behavior into structured natural-language descriptions, embeds them using pretrained language models, and constructs a source-normal reference for target scoring. It combines three evidence channels: semantic deviation from source-normal prototypes, structural deviation captured by graph autoencoding, and geometric deviation measured through Optimal Transport (OT). The main contribution is an OT-based barycentric anomaly score that projects target embeddings onto the source-normal manifold and quantifies residual transport mismatch. We further introduce entropy-weighted, angle-aware, and density-aware OT variants to capture uncertainty, directional drift, and sparse-support behavior. Evaluation on DARPA Transparent Computing data spanning Linux, Windows, BSD, and Android, across two APT scenarios and twelve cross-OS transfer pairs, shows that the proposed framework improves ROC-AUC and nDCG over source-only anomaly-detection baselines. The results demonstrate that source-only provenance modeling, combined with semantic abstraction and OT-based anomaly scoring, can support practical cross-platform APT detection without target-domain supervision.

URL PDF HTML ☆

赞 0 踩 0

2606.10238 2026-06-10 q-bio.NC cs.AI 交叉投稿

Hyperbolic Neural Population Geometry Benefits Computation

双曲神经群体几何结构有益于计算

Dennis Wu, Yi-Chun Hung, Braden Yuille, James E. Fitzgerald, Han Liu

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文提出海马体群体活动诱导双曲几何的理论框架，证明现代Hopfield网络更新规则计算最小均方误差估计，并引入双曲空间中的新联想记忆模型，其容量显著优于现有模型。

Comments Accepted at ICML 2026, 37 pages, 5 figures

2606.10250 2026-06-10 cs.LG cs.AI 交叉投稿

Multi-Level Analyzation of Imbalance to Resolve Non-IID-Ness in Federated Learning

联邦学习中不平衡的多层次分析以解决非独立同分布问题

Haengbok Chung, Jae Sung Lee

发表机构 * Interdisciplinary Program in Artificial Intelligence, Seoul National University, Republic of Korea（人工智能跨学科项目，首尔国立大学，韩国）； Department of Nuclear Medicine, Seoul National University College of Medicine, Republic of Korea（核医学系，首尔国立大学医学院，韩国）； Brightonix Imaging Inc. Seoul, Republic of Korea（Brightonix影像公司，首尔，韩国）

AI总结提出FedBB算法，通过PNB损失函数和CBR重加权分别解决本地训练中的类内/类间不平衡和客户端间不平衡，在X射线和自然图像数据集上优于现有方法。

Comments 27 pages, 5 figures, 13 tables. Accepted for publication in Neurocomputing (2025). Author Accepted Manuscript

详情

DOI: 10.1016/j.neucom.2025.129528
Journal ref: Neurocomputing, Volume 626, 2025, Article 129528

AI中文摘要

类别不平衡是深度学习中常见的问题，会严重降低性能。在联邦学习（FL）中，它是导致非独立同分布数据（non-IID）的关键因素。基于先前的一些尝试，我们在三个层次上定义并分析了FL中的不平衡问题：案例间、类别间和客户端间。案例间不平衡处理每个单一类别内的不平衡；类别间不平衡比较不同类别之间的数据数量。客户端间不平衡表示不同客户端之间本地数据的偏斜程度。基于这些概念，我们提出了FedBB，它由两个主要部分组成：（1）正负平衡（PNB）损失函数解决了本地训练中的案例间和类别间不平衡，增强了高度偏斜的本地客户端数据集上的泛化能力。它通过为少数案例或类别分配更高的权重来优化多标签和多类分类。（2）客户端平衡重加权（CBR）在模型聚合期间根据客户端间不平衡重新加权客户端，为在偏斜较小的数据集上训练的模型赋予更大的权重。在X射线和自然图像数据集上的各种实验表明，FedBB在性能和效率上均优于其他算法。此外，它只需要有限的统计信息，这有利于隐私保护。通过消融研究，我们证明了PNB损失和CBR独立地贡献于性能。由于FedBB旨在构建一个能准确分类所有类别的全局模型，它可以作为通用和个性化FL的基线。

英文摘要

Class imbalance is a common problem in deep learning that severely degrades performance. In federated learning (FL), it is a critical factor contributing to non-identically distributed data (non-IID). Building on several previous attempts, we define and analyze imbalance issues in FL at three levels: inter-case, inter-class, and inter-client. Inter-case imbalance addresses the imbalance in every single class; inter-class imbalance compares the number of data between different classes. Inter-client imbalance represents different skewness of local data between clients. Based on these concepts, we propose FedBB, which consists of two main components: (1) Positive Negative Balanced (PNB) loss function addresses the inter-case and inter-class imbalances in local training, enhancing generalization on highly skewed local client datasets. It optimizes both multi-label and multi-class classifications by assigning higher weights to minority cases or classes. (2) Client Balanced Reweighting (CBR) reweights clients based on inter-client imbalance during model aggregation, giving greater weight to models trained on less skewed datasets. Various experiments on X-ray and natural image datasets demonstrate that FedBB outperforms other algorithms in both performance and efficiency. Additionally, it requires limited statistical information, which is beneficial for privacy protection. Through ablation studies, we proved that PNB loss and CBR independently contribute to performance. As FedBB aims to build a global model that accurately classifies all classes, it can serve as a baseline for the generic and personalized FL.

URL PDF HTML ☆

赞 0 踩 0

2606.10294 2026-06-10 cs.LG cs.AI cs.AR cs.NE physics.comp-ph 交叉投稿

LLM-Guided Neural Architecture Search for Robust Co-Design of Physical Neural Networks

LLM引导的神经架构搜索用于物理神经网络的鲁棒协同设计

Tyler King, Timothee Leleu

发表机构 * University of California, Berkeley（加州大学伯克利分校）； DeepMind（深度思维）

AI总结提出UH-NAS框架，利用大语言模型作为进化算子，协同优化任务准确率和推理能耗，实现跨硬件平台的公平比较，在光学MZI硬件上发现更鲁棒的架构。

详情

AI中文摘要

在非常规硬件上部署神经网络需要同时优化任务准确率和平台特定约束（如能耗成本、物理非理想性和数值精度）的架构。现有的神经架构搜索（NAS）方法通常针对单一硬件系列，限制了跨平台比较和泛化。我们提出了非常规硬件神经架构搜索（UH-NAS），一个硬件无关的、LLM引导的NAS框架，它集成语言模型作为进化算子来协同优化准确率和推理能耗。通过将硬件暴露为可交换的后端，并配备每平台能耗模型、物理约束和非理想性模拟器，UH-NAS能够在无需修改搜索算法的情况下，对各种后端进行公平的系统级比较。在光学MZI硬件上测试，UH-NAS发现了比传统基线更多样化、更鲁棒的架构，同时优于现有的LLM-to-NAS方法。对非理想性下架构鲁棒性以及系统提示作用的进一步消融研究，凸显了架构-硬件协同设计对新兴计算平台的重要性。

英文摘要

Deploying neural networks on unconventional hardware demands architectures that co-optimize task accuracy and platform-specific constraints such as energy cost, physical non-idealities, and numerical precision. Existing neural architecture search (NAS) methods are typically tailored to a single hardware family, limiting cross-platform comparison and generalization. We introduce Unconventional Hardware Neural Architecture Search (UH-NAS), a hardware-agnostic, LLM-guided NAS framework that integrates language models as evolutionary operators to co-optimize accuracy and inference energy. By exposing hardware as a swappable backend with per-platform energy models, physical constraints, and non-ideality simulators, UH-NAS enables fair system-level comparisons across various backends without modifying the search algorithm. Tested on optical MZI hardware, UH-NAS discovers more diverse, robust architectures than conventional baselines while outperforming existing LLM-to-NAS approaches. Additional ablations on architecture robustness under non-idealities and the role of system prompts highlight the importance of architecture-hardware co-design for emerging computing platforms.

URL PDF HTML ☆

赞 0 踩 0

2606.10321 2026-06-10 cs.LG cs.AI cs.RO math.OC 交叉投稿

Baseline-Free Policy Optimization for Neural Combinatorial Optimization

无基线的神经组合优化策略优化

Carlos S. Sepúlveda, Gonzalo A. Ruz

AI总结提出使用GRPO算法消除神经组合优化中的基线依赖，避免训练崩溃，在TSP和CVRP上达到接近POMO的性能。

详情

AI中文摘要

神经组合优化（NCO）训练自回归策略以解决路由问题。标准训练算法REINFORCE使用滚动基线，需要维护并定期更新策略的冻结副本以降低方差。这种基线引入了一个结构脆弱性：在更难的问题实例上，较差的基线会产生噪声梯度估计，从而破坏训练稳定性。我们评估了来自大语言模型对齐的组相对策略优化（GRPO），该算法通过归一化组内采样轨迹的优势完全消除了基线。在RL4CO框架内对TSP和CVRP基准上的五种RL算法进行受控比较，我们发现：(i) GRPO避免了REINFORCE在TSP-100上观察到的训练崩溃，其中性能在预热阶段后立即从成本9.8下降到52.1，并且在延长训练下无法恢复；(ii) 在匹配的梯度更新次数下，GRPO达到了与POMO（一种基于AM的强多起点基线）在2%以内的解质量，同时无需外部基线；(iii) P3O，一种也来自对齐文献的成对偏好算法，在TSP上具有竞争力，但在CVRP上表现出更高的变异性。这些结果表明GRPO是一种有前途的无基线NCO替代方案，特别是在基线依赖训练变得脆弱的场景中。

英文摘要

Neural combinatorial optimization (NCO) trains autoregressive policies to solve routing problems. The standard training algorithm, REINFORCE with a rollout baseline, requires maintaining and periodically updating a frozen copy of the policy for variance reduction. This baseline introduces a structural vulnerability: on harder instances, a poor baseline produces noisy gradient estimates that can destabilize training. We evaluate Group Relative Policy Optimization (GRPO), an algorithm from large language model alignment that eliminates the baseline entirely by normalizing advantages within groups of sampled trajectories. In a controlled comparison of five RL algorithms on TSP and CVRP benchmarks within the RL4CO framework, we find that: (i) GRPO avoids the training collapse observed with REINFORCE on TSP-100, where performance degrades from cost 9.8 to 52.1 immediately after the warmup phase and does not recover under extended training; (ii) at matched gradient updates, GRPO achieves solution quality within 2% of POMO, a strong AM-based multi-start baseline, while requiring no external baseline; and (iii) P3O, a pairwise preference algorithm also from the alignment literature, is competitive on TSP but shows higher variability on CVRP. These results identify GRPO as a promising baseline-free alternative for NCO, particularly in settings where baseline-dependent training becomes fragile.

URL PDF HTML ☆

赞 0 踩 0

2606.10384 2026-06-10 nlin.AO cs.AI physics.comp-ph 交叉投稿

Towards Critical Branching Mechanism in Recurrent Neural Networks

递归神经网络中的临界分支机制

Feixiang Ren, Ling Feng

发表机构 * Department of Physics, National University of Singapore（新加坡国立大学物理系）； Institute of High Performance Computing, Agency for Science, Technology and Research (A*STAR)（科技研究局高性能计算研究所）

AI总结本文通过分析LSTM网络隐藏态动力学，发现小规模网络在最优训练阶段呈现近临界动力学（无标度雪崩统计和分支参数接近1），而大规模网络保持亚临界，并引入混合分支过程框架解释亚临界分支与1/f噪声的共存。

2606.10385 2026-06-10 cs.LG cs.AI 交叉投稿

Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation

超越绝对模仿：基于锚定残差引导的特权在线蒸馏

Wenhao Zhang

发表机构 * South China University of Technology（华南理工大学）

AI总结提出锚定残差在线蒸馏（AR-OPD），通过部分特权教师建立局部兼容锚点并注入受控残差，解决特权在线蒸馏中后见偏差导致的局部不可达问题，在推理任务上平均提升2.3个点。

Comments 17 pages, 8 figures. Project page: https://vanhowe.github.io/AR-OPD/

详情

AI中文摘要

在线蒸馏（OPD）通过将学生模型与教师在其自身轨迹上的预测分布对齐，在增强LLM复杂推理方面展现出显著的实证收益。一种新兴变体——特权OPD，通过使用增强特权信息（如oracle轨迹）的自教师模型进一步强化该范式，以缓解师生能力差距，同时提供密集的、答案导向的监督。然而，当前方法将特权信息视为一个整体的模仿目标，未能将局部可达的推理步骤与未来条件的oracle信号分离。因此，学生被鼓励去匹配一个事后偏差分布，该分布通常落在其局部预测支持之外。这种可达性不匹配激励学生模型跳过有效的中间推理，转而采用局部不支持的捷径。为解决此问题，我们引入锚定残差在线蒸馏（AR-OPD），一种解耦特权监督的双视角框架。AR-OPD不强制执行严格的全局模仿，而是使用部分特权教师建立局部兼容锚点，将oracle预见性隔离并作为受控残差注入，以提供目标导向的引导。在多种推理任务上，AR-OPD比完全特权OPD高出2.3个点，比SFT高出7.9个点。关键的是，这种锚定残差机制将事后泄漏减少了21.7%，并缓解了后期漂移，在超过768个token的挑战性长程轨迹上取得了高达7.2个点的优势。

英文摘要

On-policy distillation (OPD) has demonstrated strong empirical gains in enhancing complex reasoning in LLMs by aligning a student model with a teacher's predictive distribution over the student's own trajectories. An emerging variant, Privileged OPD, further strengthens this paradigm by employing a self-teacher model augmented with privileged information, such as oracle traces, to mitigate teacher-student capacity gaps while providing dense, answer-directed supervision. However, current methods treat privileged information as a monolithic imitation target, failing to disentangle locally reachable reasoning steps from future-conditioned oracle signals. Consequently, the student is encouraged to match a hindsight-biased distribution that often falls outside its local predictive support. This reachability mismatch incentivizes the student model to skip valid intermediate reasoning in favor of locally unsupported shortcuts. To resolve this, we introduce Anchored Residual On-Policy Distillation (AR-OPD), a dual-view framework that disentangles privileged supervision. Rather than enforcing strict full-view imitation, AR-OPD establishes a locally compatible anchor using a partially privileged teacher, isolating and injecting oracle foresight as a controlled residual to provide destination-directed guidance. Across diverse reasoning tasks, AR-OPD outperforms full privileged OPD by 2.3 points and SFT by 7.9 points. Crucially, this anchored residual mechanism reduces hindsight leakage by 21.7% and mitigates late-stage drift, yielding up to a 7.2-point advantage on challenging long-horizon trajectories exceeding 768 tokens.

URL PDF HTML ☆

赞 0 踩 0

2606.10406 2026-06-10 cs.LG cs.AI 交叉投稿

基于混合专家模型增强的联邦学习用于时空预测

Zhehao Dai, Xiao Han, Zhaolin Deng, Zijian Zhang, Xiangyu Zhao, Guojiang Shen, Xiangjie Kong

发表机构 * Zhejiang University of Technology, Zhejiang Key Laboratory of Visual Information Intelligent Processing（浙江工业大学，浙江省可视信息智能处理重点实验室）； Jilin University（吉林大学）； City University of Hong Kong（香港城市大学）

AI总结提出MoE-FedTP框架，通过轻量级混合专家网络和门控机制，在保护隐私的同时实现跨城市时空预测，有效缓解数据稀缺和异质性问题。

详情

AI中文摘要

交通预测是智能交通系统和城市计算的基础，然而由于传感器部署有限和城市发展不均衡，许多城市仍然面临交通数据稀缺的问题。跨城市知识转移因此受到越来越多的关注，使数据丰富的城市能够帮助数据稀缺的城市。然而，集中式方法引发了隐私问题，而现有的联邦方法难以应对城市间显著的时空异质性。为了解决这些挑战，我们提出了MoE-FedTP，一种基于轻量级混合专家（MoE）网络的个性化联邦跨城市时空预测框架。MoE-FedTP首先利用时空神经网络从源城市和目标城市提取特征，然后通过部分参数共享引入来自不同源城市的专家网络集合。门控机制动态融合专家以捕捉多样的交通动态，在保护隐私的同时实现城市异质性的细粒度建模。在四个真实世界交通数据集上的实验表明，MoE-FedTP始终优于最先进的跨城市和联邦学习基线，证明了其在提高数据稀缺城市预测准确性方面的有效性。

英文摘要

Traffic prediction is fundamental to intelligent transportation systems and urban computing, yet many cities continue to suffer from traffic data scarcity due to limited sensor deployment and uneven urban development. Cross-city knowledge transfer has thus attracted increasing attention, enabling data-rich cities to assist data-scarce ones. However, centralized approaches raise privacy concerns, while existing federated methods struggle with pronounced spatiotemporal heterogeneity across cities. To address these challenges, we propose MoE-FedTP, a personalized federated cross-city spatiotemporal prediction framework based on lightweight Mixture-of-Experts (MoE) networks. MoE-FedTP first employs spatiotemporal neural networks to extract features from both source and target cities, then introduces a set of expert networks derived from different source cities through partial parameter sharing. A gating mechanism dynamically fuses the experts to capture diverse traffic dynamics, achieving fine-grained modeling of urban heterogeneity while preserving privacy. Experiments on four real-world traffic datasets show that MoE-FedTP consistently outperforms state-of-the-art cross-city and federated learning baselines, demonstrating its effectiveness in enhancing prediction accuracy for data-scarce cities.

URL PDF HTML ☆

赞 0 踩 0

2606.10530 2026-06-10 cs.LG cs.AI 交叉投稿

Machine Learning Methods for Studying Latent Neural Activity Dynamics

研究潜在神经活动动力学的机器学习方法

Shufeng Kong, Fumei Deng, Xinyi Dong, Caihua Liu, Weiwei Chen, Yingheng Wang, Daniel Cao, Azahara Oliva, Antonio Fernandez-Ruiz, Carla Gomes

发表机构 * School of Software Engineering, Sun Yat-sen University（中山大学软件工程学院）； Department of Computer Science, Cornell University（康奈尔大学计算机科学系）； Department of Neurobiology and Behavior, Cornell University（康奈尔大学神经生物学与行为学系）； Department of Ecology and Evolutionary Biology, Cornell University（康奈尔大学生态学与进化生物学系）； School of Computer Science and Artificial Intelligence, Foshan University（佛山大学计算机科学与人工智能学院）

AI总结综述从状态空间模型到深度生成模型的潜在变量模型，涵盖单区域动力学、多区域通信和行为对齐建模，并讨论大规模神经基础模型及未来挑战。

Comments Accepted by IJCAI 2026 survey track

详情

AI中文摘要

脑记录的最新发展推动了对能够解码大量神经元潜在结构的机器学习工具的需求。本文提供了全面的综述，概述了潜在变量模型（LVM）从早期状态空间模型到最近深度生成模型的轨迹。我们将文献组织为三个密切相关的领域：（1）单区域潜在动力学，包括从线性动力系统到由循环神经网络（RNN）和神经常微分方程（ODE）表示的更复杂动力学模型；（2）多区域通信，采用概率和子空间方法研究信息如何在不同脑区之间传递，考虑突触传播延迟和网络连接；（3）行为对齐建模，旨在通过监督或对比学习将与任务表现相关的神经活动与其他内部状态分离。本综述还包括大规模神经基础模型，如Transformer和扩散模型，它们依赖大规模预训练以实现跨主体的最佳性能。最后，我们总结并讨论基准、评估标准和开放挑战，如识别因果联系或通信方向的能力，以促进弥合可解释脑动力学与可靠神经解码之间的未来研究。

英文摘要

Recent developments in brain recording are driving a demand for machine learning tools capable of decoding the latent structure of large populations of neurons. In this paper, we provide a comprehensive survey that outlines the trajectory of Latent Variable Models (LVMs) from early state-space models to more recent deep generative models. We organize the literature into three closely related domains: (1) Single-Region Latent Dynamics, which includes models such as linear dynamical systems to more complex dynamics represented by Recurrent Neural Networks (RNNs) and Neural Ordinary Differential Equations (ODEs); (2) Multi-Region Communication, which employs probabilistic as well as subspace methods to study how information is transferred across different brain areas considering synaptic propagation delays and network connectivity; and (3) Behavior-Aligned Modeling, which seeks to disentangle neural activity related to task performance from other internal states via supervised or contrastive learning. This survey also includes large-scale neural foundation models, such as Transformers and diffusion models, that rely on large-scale pre-training for optimal performance across subjects. Finally, we conclude and discuss benchmarks, evaluation criteria, and open challenges, such as the ability to identify causal links or directionality of communication, to facilitate future research for bridging interpretable brain dynamics with reliable neural decoding.

URL PDF HTML ☆

赞 0 踩 0

2606.10531 2026-06-10 cs.CL cs.AI 交叉投稿

LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

LC-QAT: 通过线性约束向量量化实现LLM的数据高效2比特QAT

Haoyu Wang, Xingyu Yu, Haiyan Zhao, Fengxiang Wang, Xu Han

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出LC-QAT，一种2比特权重量化的向量量化感知训练框架，通过可微的线性映射避免离散码本查找，实现高质量PTQ初始化和端到端优化，仅用0.1%-10%训练数据即超越现有方法。

Comments Accepted by ICML 2026

详情

AI中文摘要

量化感知训练（QAT）对于极低比特大语言模型（LLMs）至关重要。当前的QAT方法主要基于标量量化（SQ），虽然能高效优化，但在2比特精度下性能严重下降。另一方面，向量量化（VQ）提供了更高的表示能力，但其离散码本查找阻碍了端到端训练。我们提出LC-QAT，一种2比特权重量化的VQ-QAT框架，通过离散向量上的学习仿射映射表示量化权重，从而在训练前向传播中无需显式码本查找即可实现高质量PTQ初始化和完全可微的端到端优化。这种强大的训练后初始化使LC-QAT具有高度数据效率。在多种LLM上的实验表明，LC-QAT在使用仅0.1%-10%训练数据的情况下，始终优于最先进的QAT方法。我们的结果确立了LC-QAT作为极低比特模型部署的实用且可扩展的解决方案。

英文摘要

Quantization-aware training (QAT) is essential for extremely low-bit large language models (LLMs). Current QAT methods are mainly based on scalar quantization (SQ), which enables efficient optimization but suffers from severe performance degradation at 2-bit precision. On the other hand, vector quantization (VQ) provides substantially higher representational capacity, but its discrete codebook lookup prevents end-to-end training. We propose LC-QAT, a 2-bit weight-only VQ-QAT framework that represents quantized weights via a learned affine mapping over discrete vectors, which yields a high-quality PTQ initialization and enables fully differentiable end-to-end optimization without explicit codebook lookup in the training forward pass. This strong post-training initialization makes LC-QAT highly data-efficient. Experiments across diverse LLMs demonstrate that LC-QAT consistently outperforms state-of-the-art QAT methods while using only 0.1%--10% of the training data. Our results establish LC-QAT as a practical and scalable solution for extreme low-bit model deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.10543 2026-06-10 cs.LG cs.AI cs.ET q-bio.QM 交叉投稿

Flexible Flows for Biological Sequence Design

生物序列设计的灵活流模型

Yogesh Verma, Dani Korpela, Harri Lähdesmäki, Vikas Garg

发表机构 * Aalto University（阿尔托大学）； YaiYai Ltd（YaiYai有限公司）； OpenProtein.AI

AI总结提出结构化耦合、潜编辑速率参数化和潜分类器无引导机制，实现变长序列生成和细粒度控制，在多种生物序列任务中达到最优性能。

详情

AI中文摘要

设计功能性生物序列需要在严格的进化和生物物理约束下导航巨大的离散空间。离散流匹配（DFM）提供了在此类空间上的生成框架，但现有方法依赖于生物学上无信息的耦合，并且在变长序列生成和细粒度控制方面灵活性有限。我们提出了一种结构化耦合，编码序列元素间的领域特定偏好，将源分布偏向合理区域，而不修改流目标或训练过程。在此基础上，我们引入了一种基于潜编辑的速率参数化，通过基于共享全局潜变量的编辑操作（类似于潜变量模型）对变长生成进行建模，同时保持可追踪性。我们进一步引入了一种潜分类器无引导机制，在连续潜空间中连贯地引导生成，以及用于测试时控制编辑操作的Dirichlet先验温度缩放。我们的方法在多种生物序列任务中实现了最先进的性能，包括密度估计、无条件和条件DNA序列生成以及肽序列生成。

英文摘要

Designing functional biological sequences requires navigating vast discrete spaces under strict evolutionary and biophysical constraints. Discrete Flow Matching (DFM) offers a generative framework over such spaces, but existing approaches rely on biologically uninformative couplings and offer limited flexibility for variable-length sequence generation and fine-grained control. We propose a structured coupling that encodes domain-specific preferences among sequence elements, biasing the source distribution toward plausible regions without modifying the flow objective or training procedure. Building on this, we introduce a latent edit-based rate parameterization that models variable-length generation via edit operations conditioned on a shared global latent, akin to a latent variable model, while remaining tractable. We further introduce a latent classifier-free guidance mechanism that steers generation coherently in continuous latent space, along with Dirichlet-prior temperature scaling for test-time control over edit operations. Our method achieves state-of-the-art performance across diverse biological sequence tasks, including density estimation, unconditional and conditional DNA sequence generation, and peptide sequence generation.

URL PDF HTML ☆

赞 0 踩 0

2606.10580 2026-06-10 cs.LG cs.AI 交叉投稿

Convergence of Monte Carlo Optimistic Policy Iteration: Beyond Uniform State-Action Updates

蒙特卡洛乐观策略迭代的收敛性：超越均匀状态-动作更新

Octave Oliviers, Glenn Vinnicombe

发表机构 * Department of Engineering, University of Cambridge（剑桥大学工程系）

AI总结本文证明，在每状态动作均匀更新的条件下，首次访问蒙特卡洛乐观策略迭代收敛到最优，放宽了传统均匀状态-动作更新的要求，并通过均场动力学和锁定论证方法给出证明。

详情

AI中文摘要

蒙特卡洛乐观策略迭代（MC-O-PI）的渐近行为是一个长期悬而未决的问题。当环境模型未知时（这在实践中很常见），唯一已知的保证收敛到最优性的条件是不切实际的。在其标准形式中，该条件要求用于策略评估的回合在整个状态-动作空间上均匀初始化。本文严格放宽了这一要求。具体来说，我们证明即使更新仅在每个状态内的动作上均匀，首次访问MC-O-PI也能收敛到最优性。这允许回合以任意频率从不同状态开始；当状态空间很大或未知但每个状态中的动作空间可管理时，这是一种现实的实现。证明脱离了Tsitsiklis的经典分析，其中心交换性论证在状态以不同频率更新时不再适用。相反，我们首先证明当更新在每个状态的动作上均匀时，MC-O-PI的均场动力学生成单调改进的策略，然后通过扩展组合稳定性-ODE方法的锁定论证，证明噪声不能持续阻止这种改进。这种方法为一般研究乐观策略迭代算法提供了一种新途径。

英文摘要

The asymptotic behaviour of Monte Carlo optimistic policy iteration (MC-O-PI) is a long-standing open question. When the model of the environment is unknown, as is common in practice, the only known condition that guarantees convergence to optimality is impractical. In its canonical form, this condition requires that the episodes used for policy evaluation be initialised uniformly over the entire state-action space. This paper strictly relaxes that requirement. Specifically, we prove that initial-visit MC-O-PI converges to optimality even when updates are uniform only over the actions within each state. This allows episodes to start in different states at arbitrary frequencies; a realistic implementation when the state space is large or unknown but the action space in each state is manageable. The proof departs from the classical analysis of Tsitsiklis whose central commutativity argument no longer applies when states are updated at different frequencies. Instead, we first show that the mean-field dynamics of MC-O-PI generate monotonically improving policies when updates are uniform over the actions in each state, and then prove that noise cannot consistently prevent this improvement by extending the lock-in argument of the combined stability-ODE method. This approach suggests a new way to study optimistic policy-iteration algorithms in general.

URL PDF HTML ☆

赞 0 踩 0

2606.10587 2026-06-10 cs.LG cs.AI 交叉投稿

Towards Diverse Scientific Hypothesis Search with Large Language Models

面向多样化科学假设搜索的大语言模型

Haorui Wang, Parshin Shojaee, Kazem Meidani, Kunyang Sun, José Miguel Hernández-Lobato, Teresa Head-Gordon, Jiajun He, Chandan K. Reddy, Chao Zhang, Yuanqi Du

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结针对科学假设搜索中多样性崩溃问题，提出基于并行回火的多温度进化框架，在固定验证预算下提升假设质量与多样性。

Comments ICML 2026

详情

AI中文摘要

大语言模型（LLMs）在加速科学发现方面日益崛起，最近在生成有效科学假设等高级任务中表现突出。然而，在许多发现场景中，目标并非识别单一最佳假设，因为验证可能噪声大且成本高，科学家受益于一组高质量替代假设，以对冲下游不确定性，寻求最佳解决方案。尽管如此，常用的进化搜索策略在假设生成中往往优先优化而非探索，搜索过程中的选择压力导致多样性崩溃。受这些局限性的启发，我们将假设搜索表述为采样问题，目标是在固定验证预算下高效生成多样化、高质量的假设。基于这一视角，我们提出\ours，一种受经典并行回火算法启发的进化框架，在多个温度水平下搜索假设，并实现跨温度的原则性信息交换，以在不干扰收敛的情况下改善探索。在分子发现、方程发现和算法发现等领域，我们的方法在相同验证预算下持续提升假设质量和多样性，生成的候选假设在更昂贵的下游计算验证中仍保持稳健。

英文摘要

Large language models (LLMs) are on the rise for accelerating scientific discovery, most recently in advanced tasks such as generating valid scientific hypotheses. Yet in many discovery settings, the goal is not to identify a single best hypothesis since validation can be noisy and expensive, and scientists benefit from a set of high-quality alternative hypotheses that hedge against downstream uncertainty for the best solutions. Nevertheless, commonly used evolutionary search recipes tend to prioritize optimization over exploration in hypothesis generation, and the resulting selection pressure during the search process leads to diversity collapse. Motivated by these limitations, we formulate hypothesis search as a sampling problem, where the objective is to efficiently produce diverse, high-quality hypotheses under a fixed validation budget. Building on this perspective, we propose \ours, an evolutionary framework inspired by the classical parallel tempering algorithm that searches hypotheses at multiple temperature levels and enables principled information exchange across temperatures to improve exploration without disrupting convergence. Across domains including molecular discovery, equation discovery, and algorithm discovery, our approach consistently improves both hypothesis quality and diversity under the same validation budget, and produces candidates that remain robust under more expensive downstream computational validations.

URL PDF HTML ☆

赞 0 踩 0

2606.10596 2026-06-10 cs.LG cs.AI cs.SY eess.SY 交叉投稿

Embedding Hybrid Systems into Continuous Latent Vector Fields

将混合系统嵌入连续潜在向量场

Sangli Teng, Hang Liu, Koushil Sreenath

AI总结证明当m>2n时，n维混合系统可嵌入m维欧氏空间中的连续向量场，并基于此提出一种潜在神经ODE方法，从时间序列数据中准确恢复混合系统流，优于现有方法。

Comments Accepted to ICML 2026

2606.10650 2026-06-10 cs.CL cs.AI 交叉投稿

Dynamic Linear Attention

动态线性注意力

Xin Wang, Hui Shen, Boyuan Zheng, Xueshen Liu, Minkyoung Cho, Zhongwei Wan, Zesen Zhao, Zhuoqing Mao, Shen Yan, Mi Zhang

发表机构 * The Ohio State University（俄亥俄州立大学）； University of Michigan（密歇根大学）； ByteDance Seed（字节跳动Seed）

AI总结提出DLA框架，通过信息感知动态状态合并和容量受限内存建模，解决多状态线性注意力中固定合并策略导致的错误累积问题，在16个数据集上超越现有方法。

Comments Accepted by ICML 2026

详情

AI中文摘要

大型语言模型（LLMs）对长上下文的可扩展性从根本上受限于标准注意力的二次复杂度，这促使采用具有次二次成本（sub-quadratic cost）的线性注意力机制。为了在长上下文下提高表示能力，近期方法以多状态方式组织内存。然而，现有的多状态线性注意力方法依赖于固定的状态合并策略，无法适应动态变化的令牌重要性，不可逆地模糊了关键令牌，并在长序列上导致严重的错误累积。为了解决这一限制，我们提出了DLA，一种用于多状态线性注意力的动态内存建模框架。DLA引入了（i）信息感知动态状态合并，它基于令牌级别的信息变化自适应地确定状态边界，在语义转换周围保留高分辨率表示，同时积极总结稳定区域；以及（ii）容量受限内存建模，它通过选择性地合并相邻的低信息状态来维护一个固定大小、按时间顺序排列的状态缓存，以最小的信息损失控制内存增长。我们在两种不同的线性注意力模型上预训练DLA，并在三个类别的16个数据集上进行评估。实验结果表明DLA优于现有最先进方法。

英文摘要

The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To improve representation capacity under long contexts, recent approaches organize memory in a multi-state manner. However, existing multi-state linear attention methods rely on fixed state merging policies that cannot adapt to dynamically varying token importance, irreversibly obscuring critical tokens and causing severe error accumulation over long sequences. To address this limitation, we propose DLA, a dynamic memory modeling framework for multi-state linear attention. DLA introduces (i) Information-Aware Dynamic State Merging, which adaptively determines state boundaries based on token-level information variation, preserving high-resolution representations around semantic transitions while aggressively summarizing stable regions, and (ii) Capacity-Bounded Memory Modeling, which maintains a fixed-size, chronologically ordered state cache by selectively merging adjacent low-information states to control memory growth with minimal information loss. We pre-train DLA on two different linear attention models and evaluate on 16 datasets across three categories. Experimental results demonstrate the superiority of DLA over state-of-the-art.

URL PDF HTML ☆

赞 0 踩 0

2606.10802 2026-06-10 cs.LG cs.AI 交叉投稿

Boosting ECG Classification Performance by Pre-training with Synthesized Data

通过合成数据预训练提升心电图分类性能

Naoki Nonaka, Jun Seita

发表机构 * Advanced Data Science Project, RIKEN Information R&D and Strategy Headquarters（理化学研究所信息研发与战略总部先进数据科学项目）

AI总结提出基于医学知识的高斯组合合成算法生成单导联II心电图数据，用于预训练深度神经网络，在四种异常分类中平均提升最高33.2%，尤其在小数据集场景下效果显著。

详情

AI中文摘要

深度神经网络通常需要大量数据集才能有效训练。在医学领域，由于隐私问题和某些疾病的罕见性，获取大规模数据往往具有挑战性。为了解决数据稀缺问题，我们研究了使用基于领域医学知识生成的合成数据训练深度神经网络模型的有效性。具体来说，我们针对单导联II心电图开发了一种知识驱动的高斯组合合成算法，其中每个心跳由高斯形状的P、Q、R、S和T波分量表示。使用该模拟器，我们为四种异常心电图类别生成合成数据：心房颤动、心房扑动、室性早搏和沃尔夫-帕金森-怀特综合征。我们通过使用十种不同的深度神经网络架构进行异常心电图分类来评估该合成数据的效用。结果表明，合成到真实的训练提高了四种目标异常中三种的分类性能，其中心房扑动观察到的最大架构平均增益为33.2%。进一步分析表明，合成数据带来的性能提升在真实数据集较小时更为明显。这些发现表明，基于领域知识的合成心电图可以作为有用的预训练资源，特别是在真实数据有限或难以获取的场景中。

英文摘要

Deep Neural Networks (DNNs) typically require extensive datasets for effective training. In the medical domain, acquiring large-scale data is often challenging due to privacy concerns and the rarity of certain diseases. To address this data scarcity, we investigate the efficacy of training DNN models using synthetic data, generated based on domain-specific medical knowledge. Specifically, we develop a knowledge-driven Gaussian-composition synthesis algorithm for single-lead II ECGs, in which each heartbeat is represented by Gaussian-shaped P, Q, R, S, and T wave components. Using this simulator, we generate synthetic data for four abnormal electrocardiogram (ECG) classes: atrial fibrillation (AF), atrial flutter (AFLT), premature ventricular complex (PVC), and Wolff-Parkinson-White Syndrome (WPW). We evaluate the utility of this synthetic data by conducting abnormal ECG classification using ten different DNN architectures. Our results demonstrate that synthetic-to-real training improves classification performance for three of the four target abnormalities, with the largest architecture-averaged gain of $33.2\%$ observed for AFLT. Further analysis reveals that the performance enhancement from synthetic data is more pronounced with smaller real-world datasets. These findings suggest that domain-knowledge-based synthetic ECGs can serve as a useful pre-training resource, particularly in scenarios where real-world data are limited or difficult to obtain.

URL PDF HTML ☆

赞 0 踩 0

2606.10835 2026-06-10 cs.LG cs.AI 交叉投稿

Geometrically Averaged Hard Target Updates for Linear Q-Learning

线性Q学习的几何平均硬目标更新

Donghwan Lee

发表机构 * School of Electrical Engineering, KAIST（韩国科学技术院电气工程学院）

AI总结提出λ-几何加权平均的周期目标更新方法，用于线性Q学习，通过切换系统模型分析其稳定性，连接了单周期更新和投影Q值迭代。

2606.10890 2026-06-10 cs.LG cs.AI 交叉投稿

强化学习中流策略的测试时梯度引导

Zhiyuan Zhou, Andy Peng, Charles Xu, Qiyang Li, Tobias Springenberg, Kevin Frans, Sergey Levine

发表机构 * UC Berkeley（加州大学伯克利分校）； Physical Intelligence

AI总结提出QGF算法，通过预训练参考流策略和价值函数，在测试时利用价值梯度引导策略生成高价值动作，无需额外策略学习，在离线RL基准上优于现有测试时方法且与训练时方法竞争力相当。

详情

AI中文摘要

表达性连续控制策略，如扩散模型和流模型，构成了模拟和真实机器人控制中模仿学习近期进展的基础。尽管它们在监督模仿学习设置中能够稳定扩展，但将其纳入强化学习（RL）流程以改进策略已被证明更加困难。这通常需要专门的训练目标或通过去噪过程反向传播，这会导致众所周知的稳定性问题并影响可扩展性。在本文中，我们研究了一个问题：仅在测试时采用简单的策略改进方案，同时保持稳定的监督策略训练不变，是否能够成为避免这些问题的竞争性替代方案。为此，我们提出了QGF（Q-Guided Flow），一种完全在测试时进行策略优化的RL算法。QGF通过预训练一个参考流策略（通过标准的行为克隆目标）和一个价值函数评论家，并在测试时使用价值梯度引导参考策略生成更高价值的动作，而无需任何额外的策略学习。实验上，QGF在高维动作空间的单任务和目标条件离线RL基准测试中优于先前的测试时RL方法，并且与最先进的训练时算法竞争力相当，同时运行成本更低。此外，通过避免演员-评论家训练的不稳定性，它展现出随模型规模的良好扩展性，为使用表达性策略提供了一种实用且有效的替代RL算法。

英文摘要

Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating them into reinforcement learning (RL) pipelines for policy improvement has proven more difficult. It often requires specialized training objectives or backpropagating through denoising processes, which cause well-known issues with stability and affect scalability. In this paper we study the question of whether simple policy improvement schemes at test time alone, leaving stable supervised policy training intact, can be a competitive alternative which sidesteps these issues. To this end, we propose QGF (Q-Guided Flow), an RL algorithm that performs policy optimization entirely at test time. QGF works by pre-training both a reference flow policy (via a standard behavioral cloning objective) and a value function critic and, at test time, using the value gradient to guide the reference policy to generate higher-value actions without any additional policy learning. Empirically, QGF outperforms prior test-time RL methods on single-task and goal-conditioned offline RL benchmarks with high-dimensional action spaces, and is competitive with state-of-the-art training-time algorithms while being much cheaper to run. Moreover, it exhibits favorable scaling with model size by avoiding the instability of actor-critic training, offering a practical and effective alternative RL algorithm with expressive policies.

URL PDF HTML ☆

赞 0 踩 0

2606.11189 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

通过角度-范数分解的激活引导的几何解释

Georgii Aparin, Tatiana Gaintseva

发表机构 * Huawei Noah’s Ark Lab（华为诺亚实验室）； Queen Mary University of London（女王玛丽大学）

AI总结本文通过控制实验分离角度和径向分量，发现概念主要编码在角度结构中，但范数对引导的稳定性和下游效应至关重要，建议将激活引导参数化为可解释的角度和径向分量。

详情

AI中文摘要

线性激活引导作为一种简单且经验有效的控制语言模型行为的方法已受到广泛关注。最近，球形引导范式被提出来解决加性干预的局限性，其动机通常是假设隐藏状态范数不携带概念相关信息。在这项工作中，我们通过一项旨在分离角度和径向分量作用的受控实证研究重新审视了这一假设。我们表明，引导方法的主要区别在于它们如何耦合两种几何效应：改变令牌与概念方向的角度对齐以及改变其隐藏状态范数。在七个语言模型上，我们发现概念主要表示在角度结构中，这支持了球形方法的动机，但范数对于引导的稳定性和下游效应仍然重要。我们的结果解释了为什么具有相似概念级别效果的干预可能表现不同，并建议激活引导应由干预的可解释角度和径向分量参数化，而不是由纠缠这两种效应的单个加性系数参数化。

英文摘要

Linear activation steering has gained popularity as a simple and empirically effective way to control language model behavior. More recently, spherical steering paradigms have been proposed to address limitations of additive interventions, often motivated by the assumption that hidden-state norm does not carry concept-relevant information. In this work, we revisit this assumption through a controlled empirical study designed to disentangle the roles of angular and radial components. We show that steering methods differ mainly in how they couple two geometric effects: changing a token's angular alignment with a concept direction and changing its hidden-state norm. Across seven language models, we find that concepts are represented primarily in angular structure, supporting the motivation for spherical methods, but that norm remains important for the stability and downstream effects of steering. Our results explain why interventions with similar concept-level effects can behave differently, and suggest that activation steering should be parameterized by interpretable angular and radial components of the intervention, rather than by a single additive coefficient that entangles these two effects.

URL PDF HTML ☆

赞 0 踩 0

2404.09101 2026-06-10 cs.LG cs.AI cs.NA math.NA stat.ML 版本更新

Mixtures of Neural Operators Reduce Active Complexity in Operator Learning

神经算子混合体降低算子学习中的主动复杂度

Anastasis Kratsios, Takashi Furuya, Jose Antonio Lara Benitez, Matti Lassas, Maarten de Hoop

发表机构 * McMaster University and Vector Institute（麦斯特大学和向量研究所）； Shimane University（岛根大学）； Rice University（里士满大学）； University of Helsinki（赫尔辛基大学）

AI总结通过路由混合神经算子（MoNO）与固定单神经算子构造的比较，证明MoNO在主动专家规模上具有更优的深度、宽度和秩缩放，且对Lipschitz目标这些量以O(ε^{-1})为界。

详情

AI中文摘要

算子学习系统并非仅由总参数数量决定；对于一次查询，相关瓶颈可能是必须加载和评估的模型。我们通过路由混合神经算子（MoNO）与固定单神经算子构造之间的建设性比较，在紧致Sobolev子集上研究了经典神经算子的这一区别。该比较涉及相对于基线的专家主动复杂度，其中总存储大小和路由搜索分别考虑。MoNO将每个输入函数通过树路由到一个专家。我们的主要定理表明，在近似集上，每个具有有界输出Sobolev半径的标量一致连续非线性算子都存在一个MoNO近似，其主动专家具有比所分析的单神经算子构造更小的深度、宽度和秩缩放；对于Lipschitz目标，这些专家量以$\mathcal{O}(\varepsilon^{-1})$为界。该定理将局部化转化为主动专家大小、路由深度和专家数量的算子级核算。我们还证明了底层神经算子架构的定量通用近似定理，明确依赖于紧集直径和连续模。

英文摘要

Operator-learning systems are not governed solely by total parameter count; for one query, the relevant bottleneck can be the model that must be loaded and evaluated. We study this distinction for classical neural operators on compact Sobolev subsets through a constructive comparison between routed mixtures of neural operators (MoNOs) and a fixed single-neural-operator construction. The comparison concerns expert-active complexity relative to that baseline, with total stored size and routing search accounted separately. A MoNO routes each input function through a tree to one expert. Our main theorem shows that every scalar uniformly continuous nonlinear operator with bounded output Sobolev radius on the approximation set admits a MoNO approximation whose active expert has smaller depth, width, and rank scaling than the analyzed single-neural-operator construction; for Lipschitz targets these expert quantities are bounded by $\mathcal{O}(\varepsilon^{-1})$. The theorem turns localization into an operator-level accounting of active expert size, routing depth, and number of experts. We also prove a quantitative universal approximation theorem for the underlying neural-operator architecture, with explicit dependence on compact-set diameter and modulus of continuity.

URL PDF HTML ☆

赞 0 踩 0

2502.07027 2026-06-10 cs.LG cs.AI 版本更新

Representational Alignment with Chemical Induced Fit for Molecular Relational Learning

基于化学诱导契合的表征对齐用于分子关系学习

Peiliang Zhang, Jingling Yuan, Qing Xie, Yongjun Zhu, Chao Che, Lin Li

发表机构 * Wuhan University of Technology（武汉理工大学）； Yonsei University（延世大学）； Hubei Key Laboratory of Transportation Internet of Things（湖北省交通运输物联网重点实验室）； Dalian University（大连大学）

AI总结提出ReAlignFit方法，通过引入化学诱导契合的归纳偏置动态对齐子结构表征，并利用子图信息瓶颈优化高化学功能兼容性的子结构对，以提升分子关系学习在化学空间偏移数据上的稳定性。

Comments Accepted by SIGKDD2026 AI for Science Track

详情

AI中文摘要

分子关系学习（MRL）广泛应用于自然科学中，通过提取结构特征来预测分子对之间的关系。子结构对之间的表征相似性决定了分子结合位点的功能兼容性。然而，通过注意力机制对齐子结构表征缺乏化学知识的指导，导致模型在化学空间（例如，官能团、骨架）偏移数据上的性能不稳定。在理论论证的基础上，我们提出了基于化学诱导契合的表征对齐（ReAlignFit）来增强MRL的稳定性。ReAlignFit通过引入基于化学诱导契合的归纳偏置，动态对齐MRL中的子结构表征。在诱导过程中，我们设计了基于子结构边重构的偏差校正函数，通过模拟化学构象变化（子结构的动态组合）来对齐子结构对之间的表征。ReAlignFit进一步在契合过程中集成子图信息瓶颈，以精炼和优化具有高化学功能兼容性的子结构对，并利用它们生成分子嵌入。在九个数据集上的实验结果表明，ReAlignFit在两个任务中均优于最先进的模型，并显著增强了模型在规则偏移和骨架偏移数据分布上的稳定性。

英文摘要

Molecular Relational Learning (MRL) is widely applied in natural sciences to predict relationships between molecular pairs by extracting structural features. The representational similarity between substructure pairs determines the functional compatibility of molecular binding sites. Nevertheless, aligning substructure representations by attention mechanisms lacks guidance from chemical knowledge, resulting in unstable model performance in chemical space (\textit{e.g.}, functional group, scaffold) shifted data. With theoretical justification, we propose the \textbf{Re}presentational \textbf{Align}ment with Chemical Induced \textbf{Fit} (ReAlignFit) to enhance the stability of MRL. ReAlignFit dynamically aligns substructure representation in MRL by introducing chemical Induced Fit-based inductive bias. In the induction process, we design the Bias Correction Function based on substructure edge reconstruction to align representations between substructure pairs by simulating chemical conformational changes (dynamic combination of substructures). ReAlignFit further integrates the Subgraph Information Bottleneck during fit process to refine and optimize substructure pairs exhibiting high chemical functional compatibility, leveraging them to generate molecular embeddings. Experimental results on nine datasets demonstrate that ReAlignFit outperforms state-of-the-art models in two tasks and significantly enhances model's stability in both rule-shifted and scaffold-shifted data distributions.

URL PDF HTML ☆

赞 0 踩 0

2502.09928 2026-06-10 cs.CV cs.AI 版本更新

Deep Tree Tensor Networks

深度树张量网络

Chang Nie

发表机构 * Nanjing University of Science and Technology（南京理工大学）

AI总结提出深度树张量网络（DTTN），通过多线性运算捕获指数阶特征交互，在多个基准上超越现有方法。

详情

AI中文摘要

源自量子物理的张量网络（TNs）已被广泛用作指数机器和参数分解器用于识别任务。典型的TN模型，如矩阵乘积态（MPS），在自然图像识别中尚未取得成功应用。当它们被使用时，主要是在现有网络中压缩参数，从而失去了捕获指数阶特征交互的独特能力。本文提出了一种名为\textit{\textbf{深度树张量网络}}（DTTN）的新架构，它通过多线性运算捕获跨特征的$2^L$阶乘法交互，同时本质上展开为具有参数共享属性的\textit{树}状TN拓扑。DTTN由多个反对称交互模块（AIMs）堆叠而成，这种设计便于高效实现。此外，我们的理论分析证明了量子启发的TN模型与多项式/多线性网络在特定条件下的等价性。我们认为DTTN可以促进该领域内更具可解释性的研究。所提出的模型在多个基准和领域上进行了评估，显示出优于同行方法和最先进架构的性能。我们的代码在此https URL公开提供。

英文摘要

Originating in quantum physics, tensor networks (TNs) have been widely adopted as exponential machines and parametric decomposers for recognition tasks. Typical TN models, such as Matrix Product States (MPS), have not yet achieved successful application in natural image recognition. When employed, they primarily serve to compress parameters within pre-existing networks, thereby losing their distinctive capability to capture exponential-order feature interactions. This paper introduces a novel architecture named \textit{\textbf{D}eep \textbf{T}ree \textbf{T}ensor \textbf{N}etwork} (DTTN), which captures $2^L$-order multiplicative interactions across features through multilinear operations, while essentially unfolding into a \emph{tree}-like TN topology with the parameter-sharing property. DTTN is stacked with multiple antisymmetric interaction modules (AIMs), and this design facilitates efficient implementation. Furthermore, our theoretical analysis demonstrates the equivalence between quantum-inspired TN models and polynomial/multilinear networks under specific conditions. We posit that the DTTN could catalyze more interpretable research within this field. The proposed model is evaluated across multiple benchmarks and domains, demonstrating superior performance compared to both peer methods and state-of-the-art architectures. Our code is publicly available at https://github.com/NieCha/deep_tree_tensor_network.

URL PDF HTML ☆

赞 0 踩 0

2503.24007 2026-06-10 cs.LG cs.AI 版本更新

CITRAS: Covariate-Informed Transformer for Time Series Forecasting

CITRAS: 协变量感知的Transformer时间序列预测

Yosuke Yamaguchi, Issei Suemitsu, Wenpeng Wei

发表机构 * Research & Development Group, Hitachi, Ltd.（日立有限公司研发部）

AI总结提出CITRAS，一种仅解码器Transformer，通过KV移位和注意力分数平滑机制灵活整合已知协变量的未来部分，并捕获局部和全局跨变量依赖，提升预测精度。

详情

DOI: 10.1109/ACCESS.2026.3695717
Journal ref: IEEE Access, vol. 14, pp. 77983-77998, 2026

AI中文摘要

在时间序列预测中，协变量代表影响目标变量的外部因素。一些协变量仅在过去可观测（观测协变量，如记录的天气数据），而另一些则预先已知（已知协变量，如日历事件或折扣计划）。尽管协变量有潜力提升预测性能，但大多数基于深度学习的预测模型难以处理由已知协变量的未来部分引起的变量长度差异，且无法灵活利用它们。此外，捕获目标变量与协变量之间的依赖关系并非易事，因为模型必须准确反映协变量的局部影响，同时建模全局跨变量依赖。为应对这些挑战，我们提出CITRAS，一种仅解码器Transformer，灵活整合多个目标变量、观测协变量和已知协变量。在保持强大自回归建模能力的同时，CITRAS在分块跨变量注意力中引入两种新机制：键值移位和注意力分数平滑。键值移位通过基于并发依赖将已知协变量的未来部分与目标变量对齐，无缝融入预测过程。注意力分数平滑通过平滑历史注意力分数，将局部精确的分块跨变量依赖细化为全局变量级依赖。实验上，CITRAS在协变量感知和多变量设置下的广泛真实世界数据集上展现出强大性能，展示了其利用跨变量和跨时间依赖提升预测准确性的通用能力。

英文摘要

In time series forecasting, covariates represent external factors that influence target variables. Some covariates are observable only in the past (observed covariates, such as recorded weather data), while others are known in advance (known covariates, such as calendar events or discount schedules). Although covariates have the potential to enhance forecasting performance, most deep learning-based forecasting models struggle to address the length discrepancy between variables caused by the future portion of known covariates and fail to leverage them flexibly. Moreover, capturing dependencies between target variables and covariates is non-trivial, as models must accurately reflect the local impact of covariates while simultaneously modeling global cross-variate dependencies. To address these challenges, we propose CITRAS, a decoder-only Transformer that flexibly integrates multiple target variables, observed covariates, and known covariates. While preserving strong autoregressive modeling capabilities, CITRAS introduces two novel mechanisms within patch-wise cross-variate attention: Key-Value (KV) Shift and Attention Score Smoothing. KV Shift seamlessly incorporates the future portion of known covariates into the forecasting process by aligning them with target variables based on their concurrent dependencies. Attention Score Smoothing refines locally accurate patch-wise cross-variate dependencies into global variate-level dependencies by smoothing the historical attention scores. Experimentally, CITRAS demonstrates strong performance across a wide range of real-world datasets in both covariate-informed and multivariate settings, showcasing its versatile ability to leverage cross-variate and cross-time dependencies for improved forecasting accuracy.

URL PDF HTML ☆

赞 0 踩 0

2507.14725 2026-06-10 cs.LG cs.AI 版本更新

GRID: Scaling Task-Agnostic Inference in Continual Prompt Tuning

GRID：持续提示调优中任务无关推理的规模化

Anushka Tiwari, Sayantan Pal, Rohini K. Srihari, Kaiyi Ji

发表机构 * State University of New York at Buffalo（纽约州立大学布法罗分校）； Department of Computer Science and Engineering（计算机科学与工程系）； Institute for Artificial Intelligence and Data Science（人工智能与数据科学研究院）

AI总结提出GRID框架，通过输出空间感知解码和梯度引导提示选择，解决持续学习中任务无关推理的性能退化与可扩展性问题，在长序列和负迁移基准上提升后向迁移并减少提示内存。

详情

AI中文摘要

基于提示的持续学习提供了一种参数高效的方式，使大型语言模型能够适应任务序列。然而，现有方法通常依赖任务感知推理，并维护不断扩展的任务特定提示集，导致（1）当推理时任务标识符不可用于提示选择时，早期任务性能严重下降；（2）随着任务序列增长，可扩展性受限。我们提出GRID，一个统一的框架来解决这些挑战。GRID包含一个输出空间感知解码机制，通过利用代表性输入和自动标签语义归一化来增强后向迁移，以及一个梯度引导的提示选择策略，将信息量较少的提示压缩为单个聚合表示，以实现可扩展、内存高效的持续学习。在长序列和负迁移基准上的大量实验表明，GRID改善了后向迁移，实现了有竞争力的前向迁移，并显著减少了编码器-解码器和仅解码器架构（包括T5、Qwen和LLaMA）中的提示内存。源代码可从此https URL获取。

英文摘要

Prompt-based continual learning (CL) offers a parameter-efficient way to adapt large language models (LLMs) across task sequences. However, existing methods often rely on task-aware inference and maintain an expanding set of task-specific prompts, leading to (1) severe performance degradation on earlier tasks when task identifiers are unavailable for prompt selection at inference time, and (2) limited scalability as task sequence grows. We propose GRID, a unified framework designed to address these challenges. GRID incorporates an output-space-aware decoding mechanism that enhances backward transfer by leveraging representative inputs and automatic label semantic normalization, alongside a gradient-guided prompt selection strategy that compresses less informative prompts into a single aggregated representation for scalable, memory-efficient continual learning. Extensive experiments on long-sequence and negative-transfer benchmarks show that GRID improves backward transfer, achieves competitive forward transfer, and substantially reduces prompt memory across encoder-decoder and decoder-only architectures, including T5, Qwen, and LLaMA. Source code is available at https://github.com/AnushkaTi/GRID.

URL PDF HTML ☆

赞 0 踩 0

2509.21925 2026-06-10 cs.LG cs.AI 版本更新

Generation Properties of Stochastic Interpolation under Finite Training Set

有限训练集下随机插值的生成性质

Yunchen Li, Shaohui Lin, Zhou Yu

AI总结研究有限训练集下随机插值生成模型的理论性质，推导最优速度场和得分函数的闭式解，揭示确定性和随机生成过程的行为，并定义欠拟合与过拟合。

Comments We found proof errors affecting key theorems and wish to avoid misleading readers. We have submitted a substantially revised new paper, arXiv:2606.08554, retaining only two old theorems and adding five new ones

详情

AI中文摘要

本文研究了有限训练总体下生成模型的理论行为。在随机插值生成框架内，我们推导了当仅有有限数量的训练样本可用时最优速度场和得分函数的闭式表达式。我们证明，在某些正则性条件下，确定性生成过程精确恢复训练样本，而随机生成过程表现为带有加性高斯噪声的训练样本。在理想化设置之外，我们考虑模型估计误差，并引入生成模型特有的欠拟合和过拟合的正式定义。我们的理论分析揭示，在存在估计误差的情况下，随机生成过程有效地产生训练样本的凸组合，这些组合被均匀噪声和高斯噪声的混合所破坏。在生成任务和分类等下游任务上的实验支持了我们的理论。

英文摘要

This paper investigates the theoretical behavior of generative models under finite training populations. Within the stochastic interpolation generative framework, we derive closed-form expressions for the optimal velocity field and score function when only a finite number of training samples are available. We demonstrate that, under some regularity conditions, the deterministic generative process exactly recovers the training samples, while the stochastic generative process manifests as training samples with added Gaussian noise. Beyond the idealized setting, we consider model estimation errors and introduce formal definitions of underfitting and overfitting specific to generative models. Our theoretical analysis reveals that, in the presence of estimation errors, the stochastic generation process effectively produces convex combinations of training samples corrupted by a mixture of uniform and Gaussian noise. Experiments on generation tasks and downstream tasks such as classification support our theory.

URL PDF HTML ☆

赞 0 踩 0

2510.06473 2026-06-10 physics.soc-ph cs.AI cs.SI 版本更新

Deep Generative Model for Human Mobility Behavior

人类移动行为的深度生成模型

Ye Hong, Yatao Zhang, Konrad Schindler, Martin Raubal

发表机构 * Institute of Cartography and Geoinformation, ETH Zurich（测绘与地理信息研究所，苏黎世联邦理工学院）； Department of Human Geography, Lund University（人类地理学系，吕勒奥大学）； Future Resilient Systems, Singapore-ETH Centre, ETH Zurich（未来韧性系统，新加坡-苏黎世联邦理工学院，苏黎世联邦理工学院）； Photogrammetry and Remote Sensing, ETH Zurich（摄影测量与遥感，苏黎世联邦理工学院）； Department of Geography, University College London（地理系，伦敦大学学院）

AI总结提出基于扩散的生成框架MobilityGen，模拟多属性活动-出行序列，复现标度律、时间分配等关键模式，支持城市空间可达性和社会暴露分析。

详情

AI中文摘要

理解和建模人类移动性对于交通规划、可持续城市设计和公共卫生等挑战至关重要。尽管经过数十年的努力，由于移动行为的复杂性、情境依赖性和探索性，模拟个体移动仍然具有挑战性。在此，基于日常移动的活动视角，我们提出了MobilityGen，一个基于扩散的生成框架，用于在大的空间尺度上模拟数天至数周的多属性活动-出行序列。通过将行为属性与环境背景联系起来，MobilityGen再现了关键模式，如地点访问的标度律、活动时间分配以及出行方式和目的地选择的耦合演化。它反映了时空变异性，并生成与建成环境一致的多样的、合理的移动模式。除了标准验证外，MobilityGen还使得先前模型难以实现的分析成为可能，包括不同出行方式下城市空间可达性的差异，以及共现动态如何塑造社会暴露和隔离。总之，这些结果为人类移动行为及其社会影响的精细研究提供了一个集成的、数据驱动的基础。

英文摘要

Understanding and modeling human mobility is central to challenges in transport planning, sustainable urban design, and public health. Despite decades of effort, simulating individual mobility remains challenging because of its complex, context-dependent, and exploratory nature. Here, building on the activity-based view of daily mobility, we propose MobilityGen, a diffusion-based generative framework for simulating multi-attribute activity-travel sequences over days to weeks at large spatial scales. By linking behavioral attributes with environmental context, MobilityGen reproduces key patterns such as scaling laws for location visits, activity time allocation, and the coupled evolution of travel mode and destination choices. It reflects spatio-temporal variability and generates diverse and plausible mobility patterns consistent with the built environment. Beyond standard validation, MobilityGen enables analyses that have been difficult with earlier models, including how access to urban space varies across travel modes and how co-presence dynamics shape social exposure and segregation. Together, these results support an integrated, data-driven basis for fine-grained studies of human mobility behavior and its societal implications.

URL PDF HTML ☆

赞 0 踩 0

2511.01927 2026-06-10 cs.LG cs.AI cs.NA math.NA 版本更新

SCOPE: 过程干预的顺序因果优化

Jakob De Moor, Hans Weytjens, Johannes De Smedt, Jochen De Weerdt

发表机构 * Research Centre for Information Systems Engineering (LIRIS), KU Leuven, Leuven, Belgium（信息系统工程研究中心（LIRIS），鲁汶大学，比利时列文）； School of Computation, Information and Technology, Technical University of Munich (TUM), Munich, Germany（计算、信息与技术学院，慕尼黑技术大学（TUM），德国慕尼黑）

AI总结提出SCOPE方法，通过反向归纳和因果学习直接利用观测数据，优化业务流程中顺序干预的KPI，优于现有方法。

详情

AI中文摘要

规范性过程监控（PresPM）在运行业务流程期间推荐干预措施以优化关键绩效指标（KPI）。在现实环境中，干预很少是孤立的：组织需要对齐干预序列以共同引导案例的结果。现有的PresPM方法仅部分解决了这一挑战。许多方法专注于单个干预决策，而其他方法将多个干预视为独立，忽略了它们随时间如何相互作用。确实处理这些依赖关系的方法依赖于模拟或数据增强来近似过程以训练强化学习（RL）代理，这可能会造成现实差距并引入偏差。我们提出了SCOPE（过程干预的顺序因果优化），一种学习对齐的顺序干预推荐的PresPM方法。SCOPE采用反向归纳来估计每个候选干预动作的效果，将其影响从最终决策点传播回第一个决策点。通过利用因果学习器，我们的方法可以直接使用观测数据，不同于需要构建过程近似用于RL的方法。在现有合成数据集和新的半合成数据集上的实验表明，SCOPE在优化KPI方面始终优于最先进的PresPM技术。基于真实事件日志的新型半合成设置作为可重复使用的基准，用于未来关于顺序PresPM的工作。

英文摘要

Prescriptive Process Monitoring (PresPM) recommends interventions during running business processes to optimize key performance indicators (KPIs). In realistic settings, interventions are rarely isolated: organizations need to align sequences of interventions to jointly steer the outcome of a case. Existing PresPM approaches only partially address this challenge. Many focus on a single intervention decision, while others treat multiple interventions independently, ignoring how they interact over time. Methods that do address these dependencies depend either on simulation or data augmentation to approximate the process to train a Reinforcement Learning (RL) agent, which may create a reality gap and introduce bias. We introduce SCOPE (Sequential Causal Optimization of Process Interventions), a PresPM approach that learns aligned sequential intervention recommendations. SCOPE employs backward induction to estimate the effect of each candidate intervention action, propagating its impact from the final decision point back to the first. By leveraging causal learners, our method can utilize observational data directly, unlike methods that require constructing process approximations for RL. Experiments on both an existing synthetic dataset and a new semi-synthetic dataset show that SCOPE consistently outperforms state-of-the-art PresPM techniques in optimizing the KPI. The novel semi-synthetic setup, based on a real-life event log, is provided as a reusable benchmark for future work on sequential PresPM.

URL PDF HTML ☆

赞 0 踩 0

2601.08379 2026-06-10 cs.LG cs.AI cs.CV 版本更新

MMD Guidance: Training-Free Distribution Adaptation for Diffusion Models via Maximum Mean Discrepancy Guidance

MMD Guidance: 基于最大均值差异引导的无训练分布适应扩散模型

Matina Mahdizadeh Sani, Nima Jamali, Mohammad Jalali, Farzan Farnia

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出MMD Guidance，一种无训练方法，通过最大均值差异梯度引导扩散模型采样，实现与参考数据分布对齐，无需重新训练。

详情

AI中文摘要

预训练扩散模型已成为无条件及条件样本生成的有力先验，但其输出常偏离用户特定目标数据的特征。这种不匹配在领域适应任务中尤为突出，此时仅有少量参考样本可用且重新训练扩散模型不可行。现有推理时引导方法可调整采样轨迹，但通常优化替代目标（如分类器似然）而非直接对齐目标分布。我们提出MMD Guidance，一种无训练机制，通过生成样本与参考数据集之间的最大均值差异（MMD）梯度增强反向扩散过程。MMD能从有限数据中提供可靠分布估计，实践中方差低，且可高效微分，特别适合引导任务。我们的框架通过乘积核自然扩展到条件生成模型中的提示感知适应。此外，由于引导在潜在扩散模型（LDM）的潜在空间中进行，因此可高效计算。在合成及真实世界基准上的实验表明，MMD Guidance能在保持样本保真度的同时实现分布对齐。项目代码见该网址。

英文摘要

Pre-trained diffusion models have emerged as powerful generative priors for both unconditional and conditional sample generation, yet their outputs often deviate from the characteristics of user-specific target data. Such mismatches are especially problematic in domain adaptation tasks, where only a few reference examples are available and retraining the diffusion model is infeasible. Existing inference-time guidance methods can adjust sampling trajectories, but they typically optimize surrogate objectives such as classifier likelihoods rather than directly aligning with the target distribution. We propose \emph{MMD Guidance}, a training-free mechanism that augments the reverse diffusion process with gradients of the \textit{Maximum Mean Discrepancy (MMD)} between generated samples and a reference dataset. MMD provides reliable distributional estimates from limited data, exhibits low variance in practice, and is efficiently differentiable, which makes it particularly well-suited for the guidance task. Our framework naturally extends to prompt-aware adaptation in conditional generation models via product kernels. Also, it can be applied with computational efficiency in latent diffusion models (LDMs), since guidance is applied in the latent space of the LDM. Experiments on synthetic and real-world benchmarks demonstrate that MMD Guidance can achieve distributional alignment while preserving sample fidelity. The project code is available at github.com/matinamehdizadeh/MMD-Guidance.

URL PDF HTML ☆

赞 0 踩 0

2601.13994 2026-06-10 cs.DC cs.AI 版本更新

torch-sla: Differentiable Sparse Linear Algebra with Adjoint Solvers and Sparse Tensor Parallelism for PyTorch

torch-sla: 可微稀疏线性代数，带有伴随求解器和稀疏张量并行性，用于 PyTorch

Mingyuan Chi, Shizheng Wen

发表机构 * PyTorch

AI总结提出 torch-sla 库，通过统一 autograd 接口和 O(1) 图伴随微分，实现可微稀疏线性代数求解器，支持多后端和分布式多 GPU 执行。

详情

AI中文摘要

可微稀疏线性代数是科学机器学习的基础，但 PyTorch 缺乏统一的库：此 http URL 仅提供低级内核和不可微、仅 CPU 的 spsolve，而此 http URL 仅支持密集。我们提出 torch-sla，一个填补这一空白的开源库。它暴露了一个单一的 autograd 感知 API，用于直接、迭代、非线性和特征值求解器，跨越五个可互换的后端——CPU 上的 SciPy 和 Eigen，GPU 上的 cuDSS、CuPy 和 PyTorch 原生迭代求解器——并根据设备和问题大小自动调度。该库进一步支持在共享或不同稀疏模式上的批量求解，以及通过域分解和光环交换的分布式多 GPU 执行。这些能力通过 O(1) 图伴随微分框架和 autograd 兼容的分布式光环交换层实现可扩展性。该库可在 https://github.com/your-repo/torch-sla 获取。

英文摘要

Differentiable sparse linear algebra is foundational for scientific machine learning, yet PyTorch lacks a unified library for it: torch.sparse provides only low-level kernels and a non-differentiable, CPU-only spsolve, and torch.linalg is dense-only. We present torch-sla, an open-source library that fills this gap. It exposes a single autograd-aware API for direct, iterative, nonlinear, and eigenvalue solvers across five interchangeable backends -- SciPy and Eigen on CPU, cuDSS, CuPy, and a PyTorch-native iterative solver on GPU -- with automatic dispatch by device and problem size. The library further supports batched solves over shared or distinct sparsity patterns and distributed multi-GPU execution via domain decomposition with halo exchange. These capabilities are made scalable by an O(1)-graph adjoint differentiation framework and an autograd-compatible distributed halo-exchange layer. The library is available at https://www.torchsla.com/.

URL PDF HTML ☆

赞 0 踩 0

2602.02788 2026-06-10 cs.LG cs.AI physics.comp-ph 版本更新

Structure-Preserving Learning Improves Geometry Generalization in Neural PDEs

保结构学习提升神经PDE在几何泛化中的表现

Benjamin D. Shaffer, Shawn Koohy, Brooks Kinch, M. Ani Hsieh, Nathaniel Trask

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出General-Geometry Neural Whitney Forms (Geo-NeW)方法，通过联合学习微分算子和兼容的降阶有限元空间，结合有限元外微积分精确保持物理守恒律，显著提升对未见几何域的泛化能力。

详情

AI中文摘要

我们旨在为科学和工程开发物理基础模型，这些模型能为偏微分方程（PDE）提供实时解，并在适应未见几何时保持结构和精度。为此，我们引入了通用几何神经Whitney形式（Geo-NeW）：一种数据驱动的有限元方法。我们联合学习一个微分算子和定义在底层几何上的兼容降阶有限元空间。求解所得模型以生成预测，同时通过有限元外微积分精确保持物理守恒律。几何通过基于Transformer的编码和作为学习到的有限元空间的基础，以离散化网格的形式进入模型。这明确地将底层几何和施加的边界条件与解联系起来，为学习神经PDE提供了强大的归纳偏置，我们证明这改善了对未见域的泛化。我们提供了一种本构模型的新参数化，确保解的存在性和唯一性。我们的方法在几个稳态PDE基准测试中展示了最先进的性能，并在分布外几何上比传统基线有显著改进。

英文摘要

We aim to develop physics foundation models for science and engineering that provide real-time solutions to Partial Differential Equations (PDEs) which preserve structure and accuracy under adaptation to unseen geometries. To this end, we introduce General-Geometry Neural Whitney Forms (Geo-NeW): a data-driven finite element method. We jointly learn a differential operator and compatible reduced finite element spaces defined on the underlying geometry. The resulting model is solved to generate predictions, while exactly preserving physical conservation laws through Finite Element Exterior Calculus. Geometry enters the model as a discretized mesh both through a transformer-based encoding and as the basis for the learned finite element spaces. This explicitly connects the underlying geometry and imposed boundary conditions to the solution, providing a powerful inductive bias for learning neural PDEs, which we demonstrate improves generalization to unseen domains. We provide a novel parameterization of the constitutive model ensuring the existence and uniqueness of the solution. Our approach demonstrates state-of-the-art performance on several steady-state PDE benchmarks, and provides a significant improvement over conventional baselines on out-of-distribution geometries.

URL PDF HTML ☆

赞 0 踩 0

2602.03164 2026-06-10 cs.LG cs.AI 版本更新

MedFeat: 基于模型感知与可解释性驱动的LLM特征工程用于临床表格预测

Zizheng Zhang, Yiming Li, Justin Xu, Jinyu Wang, Rui Wang, Lei Song, Jiang Bian, David W Eyre, Jingjing Fu

发表机构 * Microsoft Research（微软研究院）； University of Oxford（牛津大学）

AI总结提出MedFeat框架，利用模型感知和特征重要性信号迭代引导LLM生成针对性特征，在临床表格预测中平均提升超10%。

详情

AI中文摘要

在临床表格预测中，带有特征工程的经典机器学习模型通常优于神经方法。LLM越来越多地被用于自动化这一过程，作为领域专家提出多样化的特征变换以提升下游性能。然而，现有的基于LLM的方法将特征生成与下游模型解耦：LLM未接收到关于哪些特征当前驱动预测或模型表示能力不足的信号，因此提议既不针对特征空间中有前景的区域，也不适应学习器的归纳偏差。这一缺陷在医疗数据中尤为突出，医疗数据同时表现出类别不平衡、异质特征空间和严格的可解释性要求。本文提出MedFeat，这是首个受机器学习从业者工作流程启发的特征工程框架，利用模型感知和特征重要性信号迭代地指导临床表格学习的特征发现。我们在广泛的具有挑战性的真实临床任务上评估MedFeat，并表明它在统计上显著优于最先进的基线，在具有不同归纳偏差的模型上平均提升超过10%。

英文摘要

In clinical tabular prediction, classical machine learning models with feature engineering often outperform neural methods. LLMs are increasingly used to automate this process, acting as domain experts that propose diverse feature transformations to boost downstream performance. However, existing LLM-based methods decouple feature generation from the downstream model: the LLM receives no signal about which features currently drive predictions or where the model's representational capacity falls short, so proposals are neither targeted to promising regions of the feature space nor tailored to the learner's inductive bias. This shortcoming is amplified in healthcare data, which simultaneously exhibits class imbalance, heterogeneous feature spaces, and strict interpretability requirements. In this paper, we propose MedFeat, the first feature engineering framework inspired by the workflow of machine learning practitioners, leveraging model-awareness and feature importance signals to iteratively guide feature discovery for clinical tabular learning. We evaluate MedFeat on a broad range of challenging real-world clinical tasks and show that it statistically significantly outperforms state-of-the-art baselines, with an average improvement of more than 10% over the baseline across models with distinct inductive biases.

URL PDF HTML ☆

赞 0 踩 0

2604.15414 2026-06-10 cs.LG cs.AI cs.NE 版本更新

Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning

超越单模型优化：在持续强化学习中保持可塑性

Lute Lillo, Nick Cheney

发表机构 * Department of Computer Science University of Vermont（计算机科学系大学 of Vermont）

AI总结提出TeLAPA框架，通过维护行为多样性的策略档案和共享潜在空间，在持续强化学习中实现技能对齐的策略邻域，以解决单策略保存导致的可塑性丧失问题，提升任务学习、恢复和性能保持能力。

详情

AI中文摘要

持续强化学习必须在保留与适应之间取得平衡，然而许多方法仍然依赖于\emph{单模型保存}，即承诺将一个不断演化的策略作为跨任务的主要可复用解决方案。即使保留了先前成功的策略，在干扰后它可能不再为快速适应提供可靠的起点，这反映了单策略保存无法解决的一种\emph{可塑性丧失}形式。受质量-多样性方法的启发，我们引入了\emph{TeLAPA}（可迁移的潜在对齐策略档案），这是一个持续强化学习框架，它将行为多样性的策略邻域组织成每个任务的档案，并维护一个共享的潜在空间，使得存档的策略在非平稳漂移下保持可比性和可复用性。这种视角将持续强化学习从保留孤立解决方案转变为维护\emph{技能对齐的邻域}，其中包含有能力的、行为相关的策略，以支持未来的重新学习。在我们的MiniGrid持续学习设置中，\emph{TeLAPA}成功学习了更多任务，在干扰后重新访问任务时更快地恢复了能力，并在整个任务序列中保持了更高的性能。我们的分析表明，源最优策略通常不是迁移最优的，即使在局部有能力的邻域内也是如此，并且有效的复用依赖于保留和选择多个邻近的替代方案，而不是将它们合并为一个代表。总之，这些结果将持续强化学习重新定义为围绕可复用且有能力的策略邻域，提供了一条超越单模型保存、迈向更具可塑性的终身智能体的途径。

英文摘要

Continual reinforcement learning must balance retention with adaptation, yet many methods still rely on \emph{single-model preservation}, committing to one evolving policy as the main reusable solution across tasks. Even when a previously successful policy is retained, it may no longer provide a reliable starting point for rapid adaptation after interference, reflecting a form of \emph{loss of plasticity} that single-policy preservation cannot address. Inspired by quality-diversity methods, we introduce \textsc{TeLAPA} (Transfer-Enabled Latent-Aligned Policy Archives), a continual RL framework that organizes behaviorally diverse policy neighborhoods into per-task archives and maintains a shared latent space so that archived policies remain comparable and reusable under non-stationary drift. This perspective shifts continual RL from retaining isolated solutions to maintaining \emph{skill-aligned neighborhoods} with competent and behaviorally related policies that support future relearning. In our MiniGrid CL setting, \textsc{TeLAPA} learns more tasks successfully, recovers competence faster on revisited tasks after interference, and retains higher performance across a sequence of tasks. Our analyses show that source-optimal policies are often not transfer-optimal, even within a local competent neighborhood, and that effective reuse depends on retaining and selecting among multiple nearby alternatives rather than collapsing them to one representative. Together, these results reframe continual RL around reusable and competent policy neighborhoods, providing a route beyond single-model preservation toward more plastic lifelong agents.

URL PDF HTML ☆

赞 0 踩 0

2605.08171 2026-06-10 cs.LG cs.AI 版本更新

Communication Dynamics Neural Networks: FFT-Diagonalized Layers for Improved Hessian Conditioning at Reduced Parameter Count

通信动力学神经网络：用于改进Hessian条件数并减少参数数量的FFT对角化层

Lurong Pan

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出CDLinear块循环线性层，通过FFT对角化Hessian矩阵，在参数减少3.8倍下保持高精度，并给出显式条件数分析。

Comments 17 pages, 5 figures. Includes NumPy implementation, gradient checks, MNIST experiments, and reference PyTorch CD-Transformer implementation

详情

AI中文摘要

通信动力学神经网络（CDNNs）将通信动力学框架的循环谱机制应用于神经网络层设计。我们引入了CDLinear，一种块大小为B=2l+1的块循环线性层，其参数数量仅为具有相同输入输出维度的稠密层的1/B。该构造为优化提供了显式的傅里叶域诊断：对于均方损失，权重Hessian矩阵通过离散傅里叶变换对角化，特征值直接由输入块的傅里叶谱决定。在输入预白化下，总体Hessian条件数恰好为1，经验条件数对于N个样本被界定为1+O(sqrt(B/N))。我们使用纯NumPy实现CDLinear，并手工推导反向传播，通过有限差分验证梯度。在8x8 MNIST数字基准上，三个随机种子下，使用B=4的CDLinear MLP以2,380个参数达到97.50%±0.23%的测试准确率，而使用8,970个参数的稠密基线为98.15%±0.47%。这实现了3.8倍的参数减少，准确率损失0.65%。CD-MLP的平均Hessian条件数为1.9e4，比稠密基线的5.9e6小约310倍。我们将CDLinear定位为结构化矩阵神经网络层的一个特例，主要贡献包括闭式Hessian谱诊断、原则性的块多重性离散序列以及显式的条件数分析。我们还发布了一个参考PyTorch实现，将CDLinear集成到DeepSeek-V3风格的混合专家Transformer中，用于未来大规模基准测试。

英文摘要

Communication Dynamics Neural Networks (CDNNs) apply the circulant-spectral machinery of the Communication Dynamics framework to neural-network layer design. We introduce CDLinear, a block-circulant linear layer with block size B = 2l + 1 that uses 1/B the parameters of a dense layer with the same input and output dimensions. The construction gives an explicit Fourier-domain diagnostic for optimization: for mean-squared loss, the weight Hessian is diagonalized by the discrete Fourier transform, with eigenvalues determined directly by the Fourier spectrum of the input blocks. Under input pre-whitening, the population Hessian condition number is exactly 1, and the empirical condition number is bounded by 1 + O(sqrt(B/N)) for N samples. We implement CDLinear in pure NumPy with hand-derived backward passes and verify gradients by finite differences. On the 8x8 MNIST digits benchmark, across three random seeds, a CDLinear MLP with B = 4 reaches 97.50% +/- 0.23% test accuracy using 2,380 parameters, compared with 98.15% +/- 0.47% for a dense baseline using 8,970 parameters. This gives a 3.8x parameter reduction at a 0.65% accuracy cost. The CD-MLP's mean Hessian condition number is 1.9e4, about 310x smaller than the dense baseline's 5.9e6. We position CDLinear as a special case of structured matrix neural-network layers, with the main contributions being a closed-form Hessian-spectrum diagnostic, a principled discrete sequence of block multiplicities, and an explicit conditioning analysis. We also release a reference PyTorch implementation integrating CDLinear into a DeepSeek-V3-style mixture-of-experts transformer for future large-scale benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.27286 2026-06-10 cs.LG cs.AI 版本更新

Falcon-X: A Time Series Foundation Model for Heterogeneous Multivariate Modeling

Falcon-X：面向异构多变量建模的时间序列基础模型

Yiding Liu, Yifan Hu, Hongjie Xia, Peiyuan Liu, Hongzhou Chen, Xilin Dai, Zewei Dong, Jiang-Ming Yang

发表机构 * Ant Group（蚂蚁集团）

AI总结针对现有时间序列基础模型在语义对齐和关系表达上的局限，提出Falcon-X，通过将变量映射到统一潜在原型空间，利用统一原型差分注意力机制对齐异构变量，并在共享空间中通过潜在实体注意力进行跨变量交互，实现零样本结构迁移，在GIFT-Eval和fev-bench基准上取得最先进预测性能。

详情

AI中文摘要

时间序列基础模型（TSFMs）通过大规模跨领域预训练正在改变预测范式。然而，大多数现有的TSFMs仍然是单变量的，最近实现跨变量建模的努力仍然直接在原始变量空间内操作。这种设计在语义对齐和关系表达性方面引入了基本限制。具体来说，原始空间分组混合缺乏对齐异构物理量的专用机制，而标准的非负注意力无法捕捉现实世界系统中普遍存在的复杂协同和拮抗交互。为了解决这些挑战，我们提出了Falcon-X，它将变量从原始空间中解耦，并将其映射到一个统一的潜在原型空间中。Falcon-X采用统一原型差分注意力机制，显式评估正负语义亲和度以对齐异构变量。然后通过潜在实体注意力在该共享空间内高效执行跨变量交互，自然地促进零样本结构迁移。最后，变量重组路由器通过请求-分发机制稳健地重建变量特定的轨迹。在GIFT-Eval和fev-bench基准上的广泛评估表明，Falcon-X实现了最先进的预测性能，为复杂的多变量环境提供了一种有原则且可扩展的范式。Falcon-X已公开发布以支持未来研究。

英文摘要

Time series foundation models (TSFMs) are transforming the forecasting paradigm through large-scale cross-domain pretraining. However, most existing TSFMs remain univariate, and recent efforts to enable cross-variate modeling still operate directly within the raw variate space. This design introduces fundamental limitations in semantic alignment and relational expressivity. Specifically, raw-space group mixing lacks a dedicated mechanism to align heterogeneous physical quantities, while standard non-negative attention fails to capture the complex synergistic and antagonistic interactions ubiquitous in real-world systems. To address these challenges, we propose Falcon-X, decouples variates from the raw space and maps them into a unified latent prototype space. Falcon-X employs a Unified Prototype Diff-Attention mechanism that explicitly evaluates both positive and negative semantic affinities to explicitly align heterogeneous variates. Cross-variate interactions are then efficiently performed within this shared space via Latent Entity Attention, naturally facilitating zero-shot structural transfer. Finally, a Variate Reassembly Router robustly reconstructs variate-specific trajectories via a request-and-dispatch mechanism. Extensive evaluations on the GIFT-Eval and fev-bench benchmarks demonstrate that Falcon-X achieves excellent forecasting performance, offering a principled and scalable paradigm for complex multivariate environments. Falcon-X is publicly released to support future research.

URL PDF HTML ☆

赞 0 踩 0

2605.28057 2026-06-10 cs.LG cs.AI 版本更新

On the Learnability of Test-Time Adaptation: A Recovery Complexity Perspective

测试时自适应可学习性：恢复复杂性视角

Zhi Zhou, Ming Yang, Shi-Yu Tian, Kun-Yang Yu, Lan-Zhe Guo, Yu-Feng Li

发表机构 * Zhi Zhou（周智）； Ming Yang（杨明）； Shi-Yu Tian（田世宇）； Kun-Yang Yu（于坤阳）； Lan-Zhe Guo（郭兰哲）； Yu-Feng Li（李宇峰）

AI总结本文提出首个理论框架，通过引入(ε,δ)-恢复复杂性和(ε,ρ)-TTA可学习性，研究非平稳流下测试时自适应的可学习性，并揭示自适应-信息权衡。

Comments Accepted by ICML 2026

详情

AI中文摘要

测试时自适应（TTA）旨在调整模型以在非平稳测试流上保持可靠性能，无需标注数据。尽管其实证成功，非平稳流下TTA的可学习性仍未探索。关键挑战在于缺乏一个原则性的理论框架，该框架同时与TTA目标一致，并捕捉持续演变的分布偏移和内在信息约束。为解决这一差距，我们提出了首个研究TTA可学习性的理论框架，并引入了(ε,δ)-恢复复杂性和(ε,ρ)-TTA可学习性。恢复复杂性衡量偏移后维持超额风险低于目标水平所需的时间（高概率），并进一步扩展到TTA可学习性，衡量TTA的长期可靠性。在该框架内，我们引入了一种新颖的非平稳测试流离散替代，使得对渐变和突变进行统一且易于处理的分析成为可能。我们推导了恢复复杂性的阶匹配下界和上界，揭示了TTA的基本限制和内在的自适应-信息权衡。这些结果为TTA提供了统一的可学习性保证，补充了基于遗憾的分析。

英文摘要

Test-time adaptation (TTA) aims to adapt models to maintain reliable performance on non-stationary test streams without requiring labeled data. Despite its empirical success, the learnability of TTA under non-stationary streams remains unexplored. A key challenge is the lack of a principled theoretical framework that simultaneously aligns with the TTA objective and captures both continuously evolving distribution shifts and intrinsic information constraints. To address this gap, we propose the first theoretical framework for studying the learnability of TTA and introduce $(ε,δ)$-Recovery Complexity and $(ε,ρ)$-TTA Learnability. Recovery complexity measures the post-shift time needed to maintain excess risk below a target level with high probability, and is further extended to TTA learnability, which measures the long-term reliability of TTA. Within this framework, we introduce a novel discrete surrogate for non-stationary test streams, enabling a unified and tractable analysis of both gradual and abrupt shifts. We derive order-wise matching lower and upper bounds on recovery complexity, revealing fundamental limits of TTA and an intrinsic adaptivity-information trade-off. These results provide unified learnability guarantees for TTA that complement regret-based analyses.

URL PDF HTML ☆

赞 0 踩 0

2605.30370 2026-06-10 cs.NE cs.AI cs.CV cs.LG 版本更新

Updating the standard neuron model in artificial neural networks

更新人工神经网络中的标准神经元模型

Raul Mohedano, Thomas Batard, Erik Velasco-Salido, Ramsses De Los Santos Mendoza, Jorge H. Martínez, Stacey Levine, Marcelo Bertalmío

发表机构 * Spanish National Research Council (CSIC)（西班牙国家研究理事会（CSIC））； Center for Research in Mathematics (CIMAT)（数学研究中心（CIMAT））； Universidad Autónoma de Madrid (UAM)（马德里自治大学（UAM））； National Science Foundation (NSF)（国家科学基金会（NSF））

AI总结本文用更真实的皮层细胞模型替代标准点神经元模型，在不增加参数的情况下，提升了人工神经网络的表达能力、鲁棒性和学习速度，并减少了记忆化和所需训练数据量。

Comments Acknowledgments included in the manuscript

2606.02378 2026-06-10 cs.LG cs.AI 版本更新

When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures

注意力电路何时形成？三种1B级架构中能力和注意力汇出现的发育轨迹

Yongzhong Xu

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文追踪三种1B级语言模型中注意力头电路的形成轨迹，发现归纳电路形成早于注意力汇形成10-20倍令牌，且电路识别无需最终模型。

Comments 27pages, 3 figures

详情

AI中文摘要

我们追踪了三种1B级语言模型（涵盖两个架构系列：密集Transformer和混合专家，以及两个预训练语料库：The Pile和DCLM）中注意力头电路形成的发育轨迹：Pythia 1B、OLMo 1B-0724-hf和OLMoE 1B-7B-0924。在每个模型的10个对数间隔检查点（共30次机械可解释性运行）上，我们应用参与比（PR）谱信号和全头能力特异性选择性筛选，以追踪归纳头、前一个标记头和BOS吸引头出现的过程。五个发现：（F1）第0层和第1层在每个模型的每个检查点都没有产生BOS分类头：L0/L1零BOS底限是架构属性，而非学习结果。（F2）整个模型的BOS吸引头比例呈现三种不同的出现形状——Pythia 1B中逐渐上升，OLMo 1B中急剧相变（相邻检查点间从7%到70%），OLMoE 1B-7B中逐渐上升。（F3）在DCLM模型中，归纳电路的形成比BOS吸引头的形成早10-20倍令牌；能力电路的形成和注意力汇的形成是两个不同的转变，而非一个。（F4）能力特异性筛选在总训练令牌的0.3-2%内收敛到最终的归纳电路——电路识别不需要最终模型。（F5）对于从所有三个模型中采样的每个最终检查点归纳头，其每个头的PR值在该头首次超过其能力选择性阈值时或之前就已升高。这些结果细化了归纳相变框架：在DCLM上训练的1B级模型中，归纳转变和注意力汇转变在令牌数量上相差一个数量级，并且具有定性不同的形状。

英文摘要

We track the developmental trajectory of attention-head circuit formation across three 1B-class language models spanning two architecture families (dense transformer, mixture-of-experts) and two pretraining corpora (The Pile, DCLM): Pythia 1B, OLMo 1B-0724-hf, and OLMoE 1B-7B-0924. At each of 10 log-spaced revisions per model -- 30 mechanistic-interpretability runs in total -- we apply a participation-ratio (PR) spectral signal and an all-head capability-specific selectivity screen to track induction, previous-token, and BOS-attractor heads as they emerge. Five findings. (F1) Layers 0 and 1 produce zero BOS-classified heads at every revision in every model: the L0/L1 zero-BOS floor is an architectural property, not a learned outcome. (F2) The whole-model BOS-attractor fraction follows three distinct emergence shapes -- a gradual ramp in Pythia 1B, a sharp phase transition in OLMo 1B (7% to 70% between adjacent checkpoints), and a gradual ramp in OLMoE 1B-7B. (F3) In DCLM models, induction-circuit formation precedes BOS-attractor formation by 10-20x in tokens; capability-circuit formation and attention-sink formation are two transitions, not one. (F4) The capability-specific screen converges to the final induction circuit within 0.3-2% of total training tokens -- circuit identification does not require the final model. (F5) For every final-checkpoint induction head sampled across all three models, per-head PR is elevated at or before the first revision at which that head crosses its capability-selectivity threshold. The results refine the induction-phase-transition framing: in 1B-class models trained on DCLM, the induction transition and the attention-sink transition are separated by an order of magnitude in tokens and have qualitatively different shapes.

URL PDF HTML ☆

赞 0 踩 0

2606.06021 2026-06-10 cs.LG cs.AI 版本更新

OPRD: On-Policy Representation Distillation

OPRD: 在线策略表示蒸馏

Shenzhi Yang, Guangcheng Zhu, Bowen Song, Haobo Wang, Mingxuan Xia, Xing Zheng, Yingfan Ma, Zhongqi Chen, Weiqiang Wang, Gang Chen

发表机构 * Zhejiang University（浙江大学）； Ant Group（蚂蚁集团）

AI总结针对在线策略蒸馏中输出空间监督的采样方差和忽略中间隐藏状态的问题，提出OPRD方法，通过在隐藏状态空间对齐师生表示，消除采样方差、提供更丰富的逐层结构信息，并在AIME等基准上缩小师生差距，训练速度提升1.44倍，内存减少54%。

详情

AI中文摘要

在线策略蒸馏（OPD）仅通过匹配下一个词元的概率在输出空间监督学生。这种仅输出范式有两个限制：（1）在大词汇量（例如Qwen约15万个词元）上，蒙特卡洛KL估计的采样方差在整个训练过程中持续存在；（2）它将教师视为黑盒，丢弃了LM头之后的所有中间隐藏状态。我们提出在线策略表示蒸馏（OPRD），通过在相同轨迹上选择层对齐学生和教师的表示，将蒸馏提升到隐藏状态空间，完全绕过LM头。理论上，OPRD消除了采样方差，并提供了更丰富的逐层结构信息。实验上，OPRD在AIME 2024/2025和AIMO上缩小了学生与教师之间的差距，而输出空间OPD基线停滞在教师水平以下。OPRD的训练速度也比top-k OPD快1.44倍，内存使用减少54%。代码：https://github.com/ShenzhiYang2000/OPRD。

英文摘要

On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.

URL PDF HTML ☆

赞 0 踩 0

2606.07605 2026-06-10 cs.LG cs.AI 版本更新

SRT: Super-Resolution for Time Series via Disentangled Rectified Flow

SRT: 基于解缠校正流的时间序列超分辨率

Jufang Duan, Shenglong Xiao, Yuren Zhang

发表机构 * Bytedance（字节跳动）

AI总结提出SRT框架，通过解缠校正流将低分辨率时间序列重建为高分辨率，分解趋势与季节成分，利用隐式神经表示对齐分辨率，并引入跨分辨率注意力机制生成细节。

Comments Accepted to the International Conference on Learning Representations (ICLR) 2026

详情

Journal ref: The Fourteenth International Conference on Learning Representations (ICLR 2026)

AI中文摘要

具有高时间分辨率的细粒度时间序列数据对于广泛应用的精确分析至关重要。然而，获取此类数据通常受到成本和可行性的限制。可以通过基于特定先验从低分辨率输入重建高分辨率信号来解决此问题，这被称为超分辨率。虽然在计算机视觉中得到了广泛研究，但直接将图像超分辨率技术迁移到时间序列并非易事。为了从根本上解决这一挑战，我们提出了时间序列超分辨率（SRT），这是一种通过解缠校正流重建低分辨率输入中丢失的时间模式的新框架。SRT将输入分解为趋势和季节成分，使用隐式神经表示将它们对齐到目标分辨率，并利用一种新颖的跨分辨率注意力机制来指导高分辨率细节的生成。我们进一步引入了SRT-large，这是一个经过大规模预训练的扩展版本，具有强大的零样本超分辨率能力。在九个公共数据集上的大量实验表明，SRT和SRT-large在多个尺度因子下始终优于现有方法，展示了稳健的性能以及我们架构中每个组件的有效性。

英文摘要

Fine-grained time series data with high temporal resolution is critical for accurate analytics across a wide range of applications. However, the acquisition of such data is often limited by cost and feasibility. This problem can be tackled by reconstructing high-resolution signals from low-resolution inputs based on specific priors, known as super-resolution. While extensively studied in computer vision, directly transferring image super-resolution techniques to time series is not trivial. To address this challenge at a fundamental level, we propose Super-Resolution for Time series (SRT), a novel framework that reconstructs temporal patterns lost in low-resolution inputs via disentangled rectified flow. SRT decomposes the input into trend and seasonal components, aligns them to the target resolution using an implicit neural representation, and leverages a novel cross-resolution attention mechanism to guide the generation of high-resolution details. We further introduce SRT-large, a scaled-up version with extensive pre-training, which enables strong zero-shot super-resolution capability. Extensive experiments on nine public datasets demonstrate that SRT and SRT-large consistently outperform existing methods across multiple scale factors, showing both robust performance and the effectiveness of each component in our architecture.

URL PDF HTML ☆

赞 0 踩 0

2606.10147 2026-06-10 cs.AI cs.CL cs.CV cs.SD 新提交

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

从感知到决策：多模态大语言模型中听觉与视觉感知的信息流

Wish Suharitdamrong, Muhammad Awais, Xiatian Zhu, Sara Atito

AI总结研究多模态大语言模型（AVLLMs）中音频和视觉信息流的路径与整合机制，发现顺序流与并行流两种路由模式，并证明信息传递后可丢弃无关token以提升效率。

Comments 40 pages, 29 figures

详情

AI中文摘要

多模态大语言模型（MLLMs）能够听和看，但音频和视觉信号实际上如何通过网络传播以形成答案？尽管它们在研究和实际应用中的作用日益增长，但音频和视觉标记影响最终预测的内部路径仍然知之甚少。在本研究中，我们考察了音频-视觉大语言模型（AVLLMs）内部的音视频信息流，追踪了AVLLMs如何在两种输入配置（音视频视频和多个交错音视频项目）下路由、利用和整合音频与视觉信息。我们发现，对于音视频视频，AVLLMs遵循为VLMs和VideoLLMs建立的顺序信息流路径，音频和视觉贡献沿着该路径按任务对每种模态的依赖程度成比例流动。在多个交错音视频项目的设置中，这种路由转变为不同的并行流。此外，我们证明，一旦音频-视觉和其他类型的标记的信息被传递到LLM，它们可以被丢弃，对模型的预测影响最小甚至略有改善，这适用于多个任务和数据集，从而实现更高效的推理。这些发现适用于多个模型和规模，包括3B和7B规模的Qwen2.5-Omni和Video-SALMONN2 Plus，从而产生了关于这些流结构为何出现的假设。总之，这些结果首次清晰地描绘了AVLLMs如何在网络内部协调声音和视觉，并为音频-视觉及更广泛的MLLMs在可解释性、设计和效率方面的下一波进展奠定了基础。

英文摘要

Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood. In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items. We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference. These findings hold across multiple models and scales, Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.10298 2026-06-10 cs.AI cs.CL 新提交

From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLMs

从上下文感知到冲突感知：泛化对比解码以应对LLMs中的知识冲突

Runze Jiang, Taiqiang Wu, Yan Wang, Bingyu Zhu, Longtao Huang

发表机构 * Peking University（北京大学）； Alibaba Group（阿里巴巴集团）； The University of Hong Kong（香港大学）

AI总结针对大语言模型生成时外部上下文与参数先验之间的知识冲突，提出冲突感知范式，通过动态分配先验与上下文的权重，并设计自适应机制解决不同冲突状态下的不对称问题。

Comments 27 pages, 9 figures

详情

AI中文摘要

当大语言模型从检索或增强的上下文中生成时，外部上下文与参数先验之间的冲突仍然是核心可靠性瓶颈。现有的对比解码方法遵循一种\emph{上下文感知}范式，单方面放大上下文而压制参数先验，当上下文错误时会覆盖正确的先验。我们将其泛化为\textbf{冲突感知}范式，基于冲突信号动态分配先验与上下文的权威，而非预设上下文的可信度。我们证明，先验和上下文logits的仿射组合产生一个\textbf{幂族}，具有固有的\textbf{状态不对称性}：当先验正确时外推会无界放大错误，当上下文正确时内插会纠正不足，且没有静态状态能同时覆盖两者。现有的对比解码方法是该族实例，大多为外推型。为评估两种冲突方向，我们提出TriState-Bench，一种模型感知的评估协议，校准每个模型的先验知识以测量三种冲突状态：纠正、抵抗和一致。为解决不对称性，我们提出自适应状态路由（ARR），在每一步在状态间路由，将抵抗EM从低于6提升至16-33，且不牺牲纠正或一致。我们的代码可在该https URL获取。

英文摘要

When large language models generate from retrieved or augmented contexts, conflicts between external context and parametric priors remain a central reliability bottleneck. Existing contrastive decoding methods follow a \emph{context-aware} paradigm that unilaterally amplifies context over parametric priors, overwriting correct priors when the context is erroneous. We generalize this to the \textbf{conflict-aware} paradigm that dynamically allocates authority between prior and context based on conflict signals, rather than presupposing context trustworthiness. We show that the affine combination of prior and context logits yields a \textbf{power family} with an inherent \textbf{regime asymmetry}: extrapolation amplifies errors unboundedly when the prior is correct, interpolation under-corrects when the context is correct, and no static regime covers both. Existing contrastive decoding methods are instances of this family, mostly extrapolative. To evaluate both conflict directions, we propose TriState-Bench, a model-aware evaluation protocol that calibrates per-model prior knowledge to measure three conflict states: correction, resistance, and agreement. To resolve the asymmetry, we propose Adaptive Regime Routing (ARR), which routes between regimes at each step, lifting resistance EM from below 6 to 16--33 without sacrificing correction or agreement. Our code is available at https://github.com/keith-Jiang/conflict-aware-decoding.

URL PDF HTML ☆

赞 0 踩 0

2606.10334 2026-06-10 cs.AI 新提交

Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts

通过视觉反馈的自蒸馏策略优化：连接代码与视觉工件

Haoyu Dong

发表机构 * Microsoft（微软）

AI总结提出Visual-SDPO框架，利用渲染视觉反馈作为特权上下文，通过自蒸馏和视觉引导的代码信用加权优化代码生成视觉工件的质量，在图表、UI和幻灯片生成任务上显著提升性能。

详情

AI中文摘要

代码生成大语言模型（LLMs）通过编写由不可微渲染器执行的程序，越来越多地生成图表、网页和幻灯片等视觉工件，在观察渲染结果之前就确定了代码。因此，原本可执行的代码常常产生具有视觉显著缺陷的工件，包括元素重叠、文本裁剪、对齐破坏、对比度低和溢出。我们研究针对代码生成视觉工件的视觉反馈自蒸馏。我们提出Visual-SDPO，一种自蒸馏策略优化框架，将渲染的视觉反馈视为权重共享教师的特权上下文，并将该反馈蒸馏到编码学生中。为了使监督具有空间针对性而非均匀性，我们引入视觉引导的代码信用加权，将每个检测到的缺陷追溯到影响该元素的代码语句，并放大这些语句上的蒸馏信号。序列级GRPO（组相对策略优化）项通过奖励可执行、视觉质量高的 rollout 来补充密集的 token 级目标，而失败的执行通过自蒸馏路径仍然可学习，通过将执行错误作为特权上下文传递给教师。我们使用统一的 Qwen3-VL-8B-Instruct 骨干网络，在图表、网页/UI和幻灯片生成任务上实例化 Visual-SDPO。在图表到代码、UI到代码和幻灯片生成基准（ChartMimic、Design2Code和AeSlides）上，Visual-SDPO 在主要指标上比零样本基线提升超过10个绝对点，比GRPO提升至少2.4个点，且训练步骤更少，无额外推理成本。

英文摘要

Code-generating large language models (LLMs) increasingly produce visual artifacts such as charts, web pages, and slides by writing programs that are executed by non-differentiable renderers, committing to code before observing the render. As a result, otherwise executable code often yields artifacts with visually salient defects, including overlapping elements, clipped text, broken alignment, low contrast, and overflow. We study visual-feedback self-distillation for code-generated visual artifacts. We propose Visual-SDPO, a self-distillation policy-optimization framework that treats rendered visual feedback as privileged context for a weight-sharing teacher and distills this feedback into a coding student. To make supervision spatially targeted rather than uniform, we introduce Visual-Grounded Code Credit Weighting, which traces each detected defect back to the code statements responsible for the affected elements and amplifies the distillation signal on those statements. A sequence-level GRPO (Group Relative Policy Optimization) term complements the dense token-level objective by rewarding executable, visually high-quality rollouts, while failed executions remain learnable through the self-distillation path by passing execution errors as privileged context to the teacher. We instantiate Visual-SDPO for chart, web/UI, and slide generation with a unified Qwen3-VL-8B-Instruct backbone. Across chart-to-code, UI-to-code, and slide-generation benchmarks (ChartMimic, Design2Code, and AeSlides), Visual-SDPO improves over the zero-shot base by more than 10 absolute points in the primary metric and over GRPO by at least 2.4 points, with fewer training steps and no added inference-time cost.

URL PDF HTML ☆

赞 0 踩 0

2606.09837 2026-06-10 cs.HC cs.AI 交叉投稿

Self-EmoQ: Plutchik-Guided Value-based Planning to Drive Streaming Emotional TTS

Self-EmoQ: 基于Plutchik引导的价值规划驱动流式情感TTS

Yue Zhao, Hongyan Li, Yong Chen, Luo Ji

发表机构 * Geely AI Lab（地平线人工智能实验室）

AI总结提出一种情感规划框架，通过强化学习训练LLM模块，在文本生成前确定情感，以驱动流式TTS，结合Plutchik情感理论进行混合奖励，实验表明在情感确定和响应质量上优于基线。

Comments Accepted to ACL 2026 Findings

详情

AI中文摘要

情感交互对于对话AI越来越重要，但当前系统缺乏自我情感确定机制来驱动流式文本到语音（TTS）合成。我们提出一个情感规划框架，在文本生成之前确定情感，以流式方式为下游情感TTS提供基础。该框架通过一个即插即用的LLM模块实现，该模块从预训练LLM初始化，并通过强化学习（RL）训练，以情感作为动作。采用混合奖励，结合模仿信号和理论驱动评分，其中采用了Plutchik情感轮理论。通过在DailyDialog、EmoryNLP、IMEOCAP和MELD上的实验，我们的方法在情感确定和响应质量上均优于提示和微调基线。我们最终实现了一个完整的流式管道用于实时部署，语音质量证实了框架的情感对齐、上下文连贯性和表达流畅性。代码、案例和演示可在该https URL获取。

英文摘要

Emotional interaction is increasingly crucial for conversational AI, yet current systems lack a self-emotion determination mechanism to drive the streaming text-to-speech (TTS) synthesis. We propose an emotion-planning framework that determines the emotion prior to the textual generation, grounding the downstream emotional TTS in a streaming manner. The framework is implemented by a plug-and-play LLM module, initialized from pretrained LLMs, and trained by reinforcement learning (RL) with emotions as the actions. A hybrid reward is employed which combines imitation signals with theory-driven scoring, in which the theory of Plutchik's wheel of emotions is adopted. By experiments on DailyDialog, EmoryNLP, IMEOCAP, and MELD, our method outperforms prompting and finetuning baselines on both emotion determination and response quality. We finally implement an entire streaming pipeline for real-time deployment, with the speech quality confirming the framework's emotional alignment, contextual coherence, and expressive fluency. Codes, cases, and demos are available in https://sixingdeguo.github.io/EmoQ-page/.

URL PDF HTML ☆

赞 0 踩 0

2606.09871 2026-06-10 cs.CV cs.AI cs.LG 交叉投稿

SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation

SD-GRPO：面向长格式视觉-语言生成的可验证片段分解

Hyunwoong Kim, Seongeun Lee, Hannah Yun, Junhyun Park, Jonggwon Park

AI总结提出SD-GRPO方法，通过将长格式输出分解为片段并计算逐片段优势，解决GRPO在视觉-语言任务中粗粒度信用分配不足的问题，实验证明其在多种长格式生成任务中优于基线。

详情

AI中文摘要

群体相对策略优化（GRPO）及其变体最初为大型语言模型（LLM）开发，最近被应用于多模态LLM并取得了强劲结果。然而，它们基于单一标量优势的粗粒度整体信用分配在视觉-语言（VL）任务中拟合不足，这些任务的输出通常是基于语义丰富图像的长格式响应。为解决这一限制，我们利用了一种单标量公式丢弃的结构化信号：长格式VL输出的自然分段。具体地，我们提出片段分解GRPO（SD-GRPO），它对整个rollout组中可验证的逐片段奖励进行z归一化，生成一个逐片段优势向量以替代单一标量。我们在三个设置中评估SD-GRPO，涵盖受控和真实世界的长格式VL生成，按片段间语义纠缠程度递增组织。在从DOCCI构建的受控多面板密集字幕任务中（片段语义独立），SD-GRPO始终优于GRPO基线，且片段数量越多增益越大。扩展到从MultiChartQA构建的受控多图表长格式VQA任务，我们从理论和经验上证明，rollout级奖励存在随输出长度增加而加剧的跨片段信用错误归因。在MMSci数据集上的真实世界科学图表字幕任务中（子图字幕共享图表上下文），混合整体和逐片段奖励进一步提升了两者性能，表明当片段语义纠缠时，仅逐片段归一化是不够的。最后，通过将SD-GRPO集成到Dr. GRPO中，我们确认它可以以最小的实现开销应用于任何GRPO框架，以增强长格式VL生成。

英文摘要

Group Relative Policy Optimization (GRPO) and its variants, originally developed for Large Language Models (LLMs), have recently been applied to Multimodal LLMs and produced strong results. However, their coarse-grained holistic credit assignment from a single scalar advantage underfits vision-language (VL) tasks, where outputs are often long-form responses grounded in semantically rich images. To address this limitation, we exploit a structured signal that single-scalar formulations discard: the natural segmentation of long-form VL outputs. Concretely, we propose Segment-Decomposed GRPO (SD-GRPO), which z-normalizes verifiable per-segment rewards across the rollout group, yielding a vector of per-segment advantages in place of a single scalar. We evaluate SD-GRPO across three settings spanning controlled and real-world long-form VL generation, organized by increasing semantic entanglement across segments. On a controlled multi-panel dense-captioning task constructed from DOCCI, where segments are semantically independent, SD-GRPO consistently outperforms the GRPO baseline, with larger gains at higher segment counts. Extending to a controlled multi-chart long-form VQA task constructed from MultiChartQA, we show both theoretically and empirically that rollout-level rewards suffer from cross-segment credit misattribution that scales with output length. On a real-world scientific figure captioning task on the MMSci dataset, where subfigure captions share context across the figure, blending holistic and per-segment rewards further improves on both, suggesting per-segment normalization alone is insufficient when segments are semantically entangled. Finally, by integrating SD-GRPO into Dr. GRPO, we confirm that it can be applied to any GRPO framework with minimal implementation overhead to enhance long-form VL generation.

URL PDF HTML ☆

赞 0 踩 0

2606.09887 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

SocraticPO: Policy Optimization via Interactive Guidance

SocraticPO: 通过交互式指导进行策略优化

Zirui Liu, Jie Ouyang, Qi Liu, Xianquan Wang, Jiayu Liu, Tingyue Pan, Qingchuan Li, Jing Sha, Zhenya Huang, Shijin Wang, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China（认知智能国家重点实验室，中国科学技术大学）； iFLYTEK AI Research (Central China), iFLYTEK Co., Ltd（iFLYTEK中央中国AI研究院，iFLYTEK公司）

AI总结提出SocraticPO框架，在强化学习中使用自然语言指导辅助推理，并通过奖励衰减防止模型依赖教师帮助，提升科学推理任务性能。

详情

AI中文摘要

用于大语言模型的强化学习通常使用标量结果奖励（如二元正确性）来监督推理。这种奖励提供了优化方向，但很少解释模型应如何修正其错误推理，这可能鼓励捷径学习和脆弱的策略。我们提出\textbf{SocraticPO}（苏格拉底式策略优化），一种策略优化框架，用苏格拉底式的自然语言指导增强强化学习展开。在展开过程中，学生首先独立回答；如果答案错误，教师诊断尝试并提供简洁的纠正性指导，之后学生在扩展的上下文下继续。关键的是，这种指导与奖励衰减配对：在教师干预后获得的正确答案只得到衰减的奖励，防止策略将教师帮助视为获取奖励的免费途径。由于SocraticPO只修改展开过程，而保持标准期望奖励目标不变，它可以插入到现有的策略梯度后端（如Reinforce++）中。此外，由于教师只提供文本级指导，SocraticPO可以利用更强的黑盒教师模型，而无需访问logits或分布匹配。在来自SciKnowEval的本科水平科学推理基准上，SocraticPO优于强强化学习和自蒸馏基线。消融实验表明，目标指导和奖励衰减都是必要的，奖励衰减减轻了对辅助纠正的依赖。

英文摘要

Reinforcement learning (RL) for large language models usually supervises reasoning with scalar outcome rewards, such as binary correctness. Such rewards provide an optimization direction but rarely explain how a model should revise its mistaken reasoning, which can encourage shortcut learning and brittle policies. We propose \textbf{SocraticPO} (Socratic Policy Optimization), a policy-optimization framework that augments RL rollouts with Socratic-style natural-language guidance. During rollout, the student first answers independently; if the answer is incorrect, a teacher diagnoses the attempt and provides concise corrective guidance, after which the student continues under the expanded context. Crucially, this guidance is paired with reward decay: correct answers obtained after teacher intervention only receive decayed rewards, preventing the policy from treating teacher help as a free path to reward. Since SocraticPO only modifies the rollout process while leaving the standard expected-reward objective intact, it can be plugged into existing policy-gradient backends such as Reinforce++. Moreover, because the teacher provides only text-level guidance, SocraticPO can leverage stronger black-box teacher models without requiring access to logits or distribution matching. On undergraduate-level scientific reasoning benchmarks from SciKnowEval, SocraticPO improves over strong RL and self-distillation baselines. Ablations show that both targeted guidance and reward decay are necessary, with reward decay mitigating reliance on assisted correction.

URL PDF HTML ☆

赞 0 踩 0

2606.09900 2026-06-10 cs.CL cs.AI cs.IR cs.LG 交叉投稿

Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

更少上下文，更高准确率：一种用于LLM Agent的双时间记忆引擎，其中精简检索上下文优于完整历史

Liuyin Wang

发表机构 * Independent Researcher（独立研究者）

AI总结提出一种双时间记忆引擎Engram，通过混合读取路径从约9.6k token的检索片段中回答，在LongMemEval_S上达到83.6%准确率，比完整历史（79k token）高10.4个百分点，且无错误。

Comments 14 pages, 4 figures, 3 tables. Code, reproducible harness, and raw per-question logs: https://github.com/ly-wang19/engram

详情

AI中文摘要

长期记忆是LLM Agent缺失的一层：跨会话时它们会遗忘，而常见的解决方法——将整个历史重放到提示中——成本高、速度慢，且随着干扰物积累，准确性下降。大多数记忆系统在成本或延迟上胜出，但在准确性上仍不如完整上下文基线，且基准测试结果在不一致、不可复现的测试平台上报告，导致同一系统在不同来源上得分差异巨大。我们提出Engram，一种基于双时间数据模型的开源双过程记忆引擎。快速写入路径附加无损事件，无需LLM参与关键路径；异步路径提取原子（主体、谓词、客体）事实，构建双时间知识图谱，并解决矛盾，无需每个事实调用LLM——使事实失效而非删除，因此每个事实都有来源和继承链。混合读取路径融合密集、词汇、图谱和时效/显著性信号，应用时间点（“截至”）过滤器，并组装紧凑、带有来源标记的上下文。在完整的500个问题的LongMemEval_S上，由官方分类特定评判器评分，Engram的精简配置——从约9.6k token的检索片段回答，而非完整历史——得分为83.6%，而完整上下文为73.2%（+10.4个百分点，McNemar p < 10^-6），token数约为1/8（9.6k vs. 79k），且0/500错误。这种增益需要混合读取路径：仅事实会丢失召回率，而事实加检索片段则恢复细节。我们还贡献了一个中立的、仓库内的评估平台，内置官方评判器，并在每个表格中包含完整上下文基线，发布原始每问题日志，并记录了无声扭曲记忆基准的测量完整性陷阱（截断、自制评判器、完整历史泄露）。每个数字都附带复现命令。

英文摘要

Long-term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround -- replaying the whole history into the prompt -- is expensive, slow, and, as distractors accumulate, less accurate. Most memory systems win on cost or latency but still lose to the full-context baseline on accuracy, and benchmark numbers are reported on inconsistent, non-reproducible harnesses, so one system appears at wildly different scores across sources. We present Engram, an open-source, dual-process memory engine on a bi-temporal data model. A fast write path appends lossless episodes with no LLM on the critical path; an asynchronous path extracts atomic (subject, predicate, object) facts, builds a bi-temporal knowledge graph, and resolves contradictions without an LLM call per fact -- invalidating, never deleting, so every fact keeps provenance and a supersession chain. A hybrid read path fuses dense, lexical, graph, and recency/salience signals, applies a point-in-time ("as-of") filter, and assembles a compact, provenance-tagged context. On the full 500-question LongMemEval_S, graded by the official category-specific judge, Engram's lean configuration -- answering from a ~9.6k-token retrieved slice, never the full history -- scores 83.6% vs. 73.2% for full-context (+10.4 points, McNemar p < 10^-6) at ~8x fewer tokens (9.6k vs. 79k), with 0/500 errored. The gain needs a hybrid read path: facts alone lose recall, while facts plus retrieved chunks recover detail. We also contribute a neutral, in-repo evaluation harness with the official judge baked in and the full-context baseline in every table, publish the raw per-question logs, and document the measurement-integrity pitfalls (truncation, home-grown judges, full-history leaks) that silently distort memory benchmarks. Every number ships with a command to reproduce it.

URL PDF HTML ☆

赞 0 踩 0

2606.10099 2026-06-10 cs.LG cs.AI 交叉投稿

Unsupervised Style Representation Learning for AI-Text Detection via Paraphrase Inversion

无监督风格表示学习用于通过释义反转检测AI文本

Rafael Rivera Soto, Barry Chen, Nicholas Andrews

AI总结提出无监督风格编码器，通过重构人工文本与机器生成释义间的差异学习判别性风格特征，实现少样本和零样本AI文本检测，性能优于基线。

详情

AI中文摘要

大型语言模型（LLMs）的快速发展引发了对其滥用的担忧，如抄袭、错误信息和自动化影响操作，这促使需要鲁棒的检测器。最近的研究表明，写作风格的神经表示对于检测是有效的，并且至关重要的是，对于击败大多数现有检测器的对抗攻击具有鲁棒性。然而，当前的基于风格的检测器依赖作者标签进行训练，并且仅限于少样本推理进行检测，需要可能并不总是可用的分布内样本。我们通过训练风格编码器从机器生成的释义中重构人工文本，从而在没有作者标签的情况下学习判别性风格特征；在训练期间冻结语义编码器，使风格编码器偏向于仅捕获重构所需的非语义特征。我们通过两种检测策略评估学习到的表示：少样本检测器和基于DeepSVDD的零样本检测器。在基准测试中，我们的方法在少样本设置下匹配或优于所有基线，并且在零样本设置下，与完全监督的分类器在分布内测试数据上具有竞争力，同时对未见过的LLMs具有更好的泛化能力。除了检测之外，学习到的表示还能泛化到未见过的任务，在作者验证和细粒度风格区分上取得竞争性表现，尽管从未针对这两个目标进行训练。

英文摘要

The rapid development of large language models (LLMs) has raised concerns about misuse such as plagiarism, misinformation, and automated influence operations, motivating the need for robust detectors. Recent work has shown that neural representations of writing style are effective for detection and, crucially, robust to adversarial attacks that defeat most existing detectors. However, current style-based detectors rely on authorship labels for training, and are limited to few-shot inference for detection, requiring in-distribution samples that may not always be available. We learn discriminative style features without authorship labels by training a style encoder to reconstruct human-authored text from its machine-generated paraphrase; freezing a semantic encoder during training biases the style encoder to capture only the non-semantic features needed for reconstruction. We evaluate the learned representations via two detection strategies: a few-shot detector and a zero-shot DeepSVDD-based detector. Across benchmarks, our method matches or outperforms all baselines in the few-shot setting and, in the zero-shot regime, is competitive with fully supervised classifiers on in-distribution test data while generalizing better to unseen LLMs. Beyond detection, the learned representations generalize to unseen tasks, achieving competitive performance on authorship verification and fine-grained style discrimination despite never being trained on either objective.

URL PDF HTML ☆

赞 0 踩 0

2606.10113 2026-06-10 cs.CL cs.AI 交叉投稿

Emotion Profiling in LLM-Based Literary Translation: Systematic Shifts Across MT and Post-Editing

基于LLM的文学翻译中的情感特征：机器翻译与译后编辑的系统性转变

Antonio Castaldo, Johanna Monti, Sheila Castilho

AI总结研究LLM翻译的情感特征及译后编辑如何使其接近人类翻译，通过对比《Oryx and Crake》的LLM翻译、译后编辑版本和人类翻译，发现MT系统引入特定情感指纹，削弱作者声音。

2606.10126 2026-06-10 cs.CL cs.AI cs.CY cs.LG 交叉投稿

Pareto-Guided Teacher Alignment for Fair Personalized Text Generation

帕累托引导的教师对齐用于公平个性化文本生成

Tunazzina Islam

发表机构 * Purdue University（普渡大学）

AI总结提出帕累托引导的教师对齐框架，通过修订候选生成、对感知可行性门控、帕累托候选选择和偏好优化，在减少人口统计差异的同时保持个性化保真度，实验表明公平缓解效果依赖于目标且跨域迁移不一致。

详情

AI中文摘要

个性化说服性文本生成可以提高相关性和参与度，但人口统计条件也可能引入跨群体的不平等框架。我们将个性化生成中的公平缓解研究为一个受约束的多目标对齐问题：在保持个性化保真度的同时减少人口统计差异。我们提出一个帕累托引导的教师对齐框架，结合了基于修订的候选生成、对感知可行性门控、帕累托风格的候选选择，以及通过监督微调和直接偏好优化的可选偏好优化。我们在气候变化和疫苗接种说服任务上评估该框架，使用一个受控的上下文丰富的人口统计网格（匹配性别和年龄对）以及一个统一的五审计评估套件，涵盖说服偏见、形式差异、情感框架差异、词汇关联差异和个性化保真度。在两个领域和跨族系迁移设置中，没有单一的对齐策略能同时主导所有目标。相反，方法占据了公平-个性化帕累托前沿的不同区域：一些方法实现更强的差异减少，而另一些则更好地保持个性化或人口统计稳定性。我们的结果表明，公平缓解效果依赖于目标，并在领域和模型族系间不一致地迁移，这促使在公平敏感的个性化生成中采用有界回归、多审计模型选择而非单指标优化。

英文摘要

Personalized persuasive text generation can improve relevance and engagement, but demographic conditioning may also introduce unequal framing across groups. We study fairness mitigation in personalized generation as a constrained multi-objective alignment problem: reduce demographic disparities while preserving personalization fidelity. We propose a Pareto-guided teacher alignment framework that combines revision-based candidate generation, pair-aware feasibility gating, Pareto-style candidate selection, and optional preference optimization through supervised fine-tuning and direct preference optimization. We evaluate the framework on climate change and vaccination persuasion tasks using a controlled context-rich demographic grid with matched gender and age pairs and a unified five-audit evaluation suite spanning persuasion bias, formality disparity, emotional framing disparity, lexical association disparity, and personalization fidelity. Across both domains and cross-family transfer settings, no single alignment strategy dominates all objectives simultaneously. Instead, methods occupy different regions of a fairness-personalization Pareto frontier: some achieve stronger disparity reductions, while others better preserve personalization or demographic stability. Our results show that fairness mitigation effects are objective-dependent and transfer inconsistently across domains and model families, motivating bounded-regression, multi-audit model selection over single-metric optimization for fairness-sensitive personalized generation.

URL PDF HTML ☆

赞 0 踩 0

2606.10183 2026-06-10 cs.CV cs.AI cs.MM 交叉投稿

Making Time Editable in Video Diffusion Transformers

在视频扩散Transformer中实现时间可编辑性

Konstantin Kuklev, Viacheslav Vasilev, Alexander Kunitsyn, Andrei Ivaniuta, Denis Dimitrov

AI总结提出一种时间控制方法，通过轻量级时间模块扩展预训练DiT，实现运动速度和时序结构的编辑，无需重新设计骨干网络。

语音遇见ELF：用于语音识别和翻译的音频条件连续目标扩散

Xuanchen Li, Tianrui Wang, Yuheng Lu, Zikang Huang, Yu Jiang, Chenghan Lin, Chenrui Cui, Ziyang Ma, Xingyu Ma, Chunyu Qiang, Guochen Yu, Xie Chen, Longbiao Wang, Jianwu Dang

发表机构 * Tianjin University（天津大学）； Shanghai Jiao Tong University（上海交通大学）； Nankai University（南开大学）

AI总结提出ELF-S2T，一种基于预训练ELF骨干的音频条件连续目标生成模型，通过音频强制训练和分类器自由引导，在LibriSpeech和CoVoST2上实现竞争性ASR和S2TT性能，并揭示识别与翻译错误均源于连续潜空间中的近距离混淆。

详情

AI中文摘要

语音到文本（S2T）系统用于识别（ASR）和翻译（S2TT）通常生成离散文本标记。相比之下，连续目标语言建模在连续空间中执行生成，但其在S2T中的潜力尚未被探索。为填补这一空白，我们提出了ELF-S2T，一种用于S2T的音频条件连续目标生成模型。基于预训练的嵌入式语言流（ELF）骨干，ELF-S2T通过冻结的Whisper编码器和单个线性投影器处理语音，将得到的音频条件前置到噪声文本潜变量前，用于上下文流匹配去噪。为防止模型过度依赖其预训练的文本上下文，我们在训练中引入音频强制，并在推理时通过分类器自由引导进一步放大音频条件。在LibriSpeech和CoVoST2上的实验表明，ELF-S2T实现了具有竞争力的ASR和S2TT性能。关键的是，我们的错误分析揭示，尽管ASR和S2TT错误表面上看起来非常不同，但两者都源于同一根本原因：连续潜空间中的近距离混淆。这一发现自然与连续表示生成范式一致，表明识别和翻译之下存在共同的语义映射过程。我们的代码和预训练模型在此https URL公开提供。

英文摘要

Speech-to-text (S2T) systems for recognition (ASR) and translation (S2TT) typically generate discrete text tokens. In contrast, continuous-target language modelling performs generation in a continuous space, yet its potential for S2T remains unexplored. To bridge this gap, we propose ELF-S2T, an audio-conditioned continuous-target generative model for S2T. Built upon the pre-trained Embedded Language Flows (ELF) backbone, ELF-S2T processes speech via a frozen Whisper encoder and a single linear projector, prepending the resulting audio condition to the noisy text latent for in-context, flow-matching denoising. To prevent the model from over-relying on its pre-trained text context, we introduce audio forcing during training, and further amplify the audio condition via classifier-free guidance at inference. Experiments on LibriSpeech and CoVoST2 show that ELF-S2T achieves competitive ASR and S2TT performance. Crucially, our error analysis reveals that, although ASR and S2TT errors look very different on the surface, both stem from the same underlying cause, a close distance confusion in the continuous latent space. This finding naturally aligns with the continuous representation generation paradigm, indicating a common semantic mapping process beneath recognition and translation. Our code and pretrained models are publicly available at https://github.com/Sslnon/ELF-S2T.

URL PDF HTML ☆

赞 0 踩 0

2606.10471 2026-06-10 cs.CL cs.AI 交叉投稿

Detecting Speculative Language in Biomedical Texts using Recurrent Neural Tensor Networks

使用递归神经张量网络检测生物医学文本中的推测性语言

Dhruv Dixit

发表机构 * Stevens Institute of Technology（史蒂文斯理工学院）

AI总结利用分布式句子表示和深度学习技术，提出递归神经张量网络（RNTN）用于自动检测生物医学文献中的推测性语言，性能略优于线性双元SVM（F1=0.885 vs 0.881）。

Comments 12 Pages

详情

AI中文摘要

在本研究中，我们通过利用分布式句子表示和先进的深度学习技术，深入探讨了生物医学文章中推测性语言的自动检测。这种识别的意义延伸至信息检索、多文档摘要以及新知识的探索。我们的探索涵盖了两种获取分布式句子表示的不同方法：段落向量模型和递归神经张量网络。然后，将这些方法与三种基础基线算法进行严格比较：支持向量机、朴素贝叶斯和模式匹配。我们的发现表明，递归神经张量网络（RNTN）的性能（F1=0.885）略优于表现最佳的基线线性双元SVM（F1=0.881）。同时，段落向量模型即使在使用大规模未标记数据集进行广泛训练后，效果也较差（F1=0.368）。我们对影响这些性能差异的因素进行了全面讨论，并为未来的研究方向提供了有见地的建议。

英文摘要

In this investigation, we delve into the automated detection of speculative language within biomedical articles by utilizing distributed sentence representations and advanced deep learning techniques. The implications of such identification extend to information retrieval, multi-document summarization, and the exploration of new knowledge. Our exploration encompasses two distinct approaches for acquiring distributed sentence representations: the Paragraph Vector model and the Recursive Neural Tensor Network. These methodologies are then rigorously compared against three foundational baseline algorithms: Support Vector Machines, Naive Bayes, and pattern matching. Our findings reveal that the Recursive Neural Tensor Network (RNTN) demonstrates a slight performance edge (F1 = 0.885) over the top-performing baseline, the linear bigram SVM (F1 = 0.881). Meanwhile, the Paragraph Vector model proves less effective (F1 = 0.368), even after extensive training using an expansive, unlabeled dataset. We engage in a comprehensive discourse on the factors influencing these performance disparities and provide insightful recommendations for future research directions.

URL PDF HTML ☆

赞 0 踩 0

2606.10738 2026-06-10 eess.AS cs.AI 交叉投稿

Spatial-Omni: Spatial Audio Understanding Integration in Multimodal LLMs via FOA Encoding

Spatial-Omni：通过FOA编码在多模态大语言模型中实现空间音频理解

Zhiyuan Zhu, Yixuan Chen, Yiwen Shao, Wenxiang Guo, Changhao Pan, Yu Zhang, Yuxiang Wang, Wei Liu, Houhua Zhang, Chengkuan Zeng, Wenbo Cheng, Yunxi Liu, Rui Yang, Steve Yves, Liefeng Bo, Zhou Zhao

发表机构 * Zhejiang University（浙江大学）； Tencent Hunyuan（腾讯文心）

AI总结提出Spatial-Omni，通过SO-Encoder将一阶Ambisonics空间音频注入现有全模态大语言模型，以轻量方式实现空间音频理解，并在构建的SO-Bench基准上超越现有模型。

详情

AI中文摘要

最近的多模态大语言模型主要将音频处理为单声道信号，从而丢弃了空间音频中包含的空间线索，这些线索用于声音定位、空间关系推理和空间场景理解。我们提出Spatial-Omni，一种轻量级方法，通过实现SO-Encoder将一阶Ambisonics（FOA）空间音频作为独立模态注入现有的全模态大语言模型，而无需修改其原始音频编码器。SO-Encoder以有限的额外上下文成本提供空间标记，并通过高效的分阶段训练提升空间音频理解。为支持训练和评估，我们从开源数据、真实录音和仿真中构建了SO-Dataset、SO-QA和SO-Bench，包含40万条FOA空间音频片段和210万个空间问答对。SO-Bench涵盖16个空间音频理解子任务，包括基本检测和位置估计、空间关系理解以及复杂空间推理。实验表明，Spatial-Omni在空间音频理解任务上优于现有的开源大型音频语言模型（LALM）和全模态大语言模型，同时保持合理的通用音频理解水平。代码和数据见：https://this https URL。

英文摘要

Recent multimodal large language models mainly process audio as monaural signals, thereby discarding the spatial cues contained in spatial audio for sound localization, spatial relation reasoning, and spatial scene understanding. We propose Spatial-Omni, a lightweight method that implements SO-Encoder to inject First-Order Ambisonics (FOA) spatial audio into existing Omni LLMs as an independent modality, without modifying their original audio encoders. SO-Encoder provides spatial tokens with limited additional context cost and improves spatial audio understanding through efficient staged training. To support training and evaluation, we construct SO-Dataset, SO-QA, and SO-Bench from open-source data, real recordings, and simulations, containing 400K FOA spatial audio clips and 2.1M spatial question answering pairs. SO-Bench covers 16 spatial audio understanding subtasks, including basic detection and location estimation, spatial relation understanding, and complex spatial reasoning. Experiments show that Spatial-Omni outperforms existing open-source Large Audio-Language Models (LALMs) and Omni LLM models on spatial audio understanding tasks while retaining a reasonable level of general audio understanding. Code and data are available at https://github.com/dieKarotte/Spatial-Omni.

URL PDF HTML ☆

赞 0 踩 0

2606.10796 2026-06-10 cs.CL cs.AI 交叉投稿

Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning

Dep-LLM：基于证据引导的结构化多因素与可靠LLM推理的无训练抑郁症诊断

Yiqing Lyu, Xianbing Zhao, Buzhou Tang, Ronghuan Jiang

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China（哈尔滨工业大学（深圳）计算机科学与技术学院）； School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China（江南大学人工智能与计算机学院）； Guangdong Provincial Key Laboratory of Intelligent Information Processing（广东省智能信息处理重点实验室）； Pengcheng Laboratory（鹏城实验室）； Chinese People’s Liberation Army General Hospital, Beijing, China（中国人民解放军总医院）

AI总结提出无训练框架Dep-LLM，通过思维链多因素分析、置信度调制和协作预测，在冻结LLM上实现抑郁症诊断，超越零样本和微调方法。

详情

AI中文摘要

从临床访谈中进行自动抑郁症检测（ADD）是计算心理健康领域的关键任务，但由于两个关键障碍仍然具有挑战性：1）在冗长、多主题的临床访谈中建模复杂但稀疏分布的抑郁线索困难，导致推理肤浅且不可靠；2）由于临床隐私导致标记数据稀缺，加上训练和微调的高成本，限制了监督式ADD系统的部署。为了共同应对这些挑战，我们提出了Dep-LLM，一个无训练框架，它模仿临床精神科医生的逐步推理，并完全在冻结的现成基础LLM上运行。Dep-LLM包含三个阶段。首先，思维链（CoT）抑郁症多因素分析模块将长对话结构性地分解为五个临床对齐的主题，并产生基于证据的推理，有效处理长上下文依赖。其次，我们引入了置信度分析与调制模块，该模块从每个推理的token级熵中量化认知可靠性，并应用标签内和主题间调制，在不进行额外训练的情况下放大可信信号同时抑制不确定信号。第三，协作多因素预测模块动态整合由置信度加权的多因素信号，形成最终诊断。在DAIC-WOZ和E-DAIC数据集上的大量实验证明了Dep-LLM的有效性和泛化性：它在几乎所有21个基础LLM上，在准确率、宏F1和加权平均F1等9个指标上超越了零样本基线，并进一步优于最先进的监督式领域特定LLM以及最新的闭源商业LLM，同时无需额外训练。

英文摘要

Automatic Depression Detection (ADD) from clinical interviews is a pivotal task in computational mental health, yet it remains challenging due to two critical obstacles: 1) difficulty in modeling complex but sparsely distributed depression clues within lengthy, multi-topic clinical interviews, leading to superficial and unreliable reasoning; 2) scarcity of labeled data due to clinical privacy, together with high cost of training and fine-tuning, limiting the deployment of supervised ADD systems. To jointly address these challenges, we propose Dep-LLM, a training-free framework that mirrors the step-by-step reasoning of clinical psychiatrists and operates entirely on frozen off-the-shelf foundation LLMs. Dep-LLM comprises three stages. First, a Chain-of-Thought (CoT) Depression Multi-factor Analysis module structurally decomposes the long dialogue into five clinically aligned themes and produces evidence-grounded rationales, effectively handling long-context dependencies. Second, we introduce Confidence Analysis and Modulation module that quantifies the epistemic reliability from token-level entropy of each rationale and applies an intra-label and inter-theme modulation that amplifies trustworthy signals while suppressing uncertain ones without extra training. Third, a Collaborative Multi-factor Prediction module dynamically integrates multi-factor signals weighted by confidence into the final diagnosis. Extensive experiments on the DAIC-WOZ and E-DAIC datasets demonstrate the effectiveness and generalizability of Dep-LLM: it surpasses zero-shot baseline on nearly all 21 foundation LLMs across 9 metrics such as accuracy, macro F1 and weighted-average F1, and further outperforms state-of-the-art supervised domain-specific LLMs as well as the latest closed-source commercial LLMs, while requiring no extra training.

URL PDF HTML ☆

赞 0 踩 0

2606.10819 2026-06-10 cs.CV cs.AI 交叉投稿

Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks

Earth-OneVision：将遥感多模态大语言模型扩展到更多传感器模态和任务

Miaoxin Cai, Guanqun Wang, Wei Zhang, Guangyao Zhou, Yin Zhuang, Tong Zhang, Hao Wang, He Chen, Jun Li

发表机构 * National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing (SBIIP), Beijing Institute of Technology（北京理工大学空间智能信息处理国家重点实验室）； Aerospace Information Research Institute, Chinese Academy of Sciences（中国科学院空天信息创新研究院）； Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Chinese Academy of Sciences（中国科学院地理空间信息处理与应用系统技术重点实验室）； Advanced Research Institute of Multidisciplinary Sciences, Beijing Institute of Technology（北京理工大学前沿交叉科学研究院）； School of Mechatronical Engineering, Beijing Institute of Technology（北京理工大学机电学院）； School of Earth and Space Sciences, Peking University（北京大学地球与空间科学学院）； School of Electronics, Peking University（北京大学电子学院）； School of Computer Science and Hubei Key Laboratory of Intelligent Geo-Information Processing（华中科技大学计算机科学与技术学院&湖北省智能地理信息处理重点实验室）

AI总结提出Earth-OneVision，一个2B参数的RS-MLLM，通过全粒度视觉语言对齐、空间语言同构序列化和渐进式跨模态适应机制，统一六种传感器模态和九类任务，在多个基准上达到或超越4B-72B模型。

详情

AI中文摘要

RS-MLLM能够对地球观测图像进行自然语言理解和空间推理。然而，现有模型仅支持狭窄的传感器类型和任务范围，导致对地球的碎片化视角，并使得跨模态地球科学知识在很大程度上未被利用。本文提出了Earth-OneVision，一个2B参数的RS-MLLM，它在单一自回归框架内统一了六种传感器模态（即光学、SAR、红外、多光谱、时序和视频）以及跨传感器融合，涵盖9个任务类别。三种专用机制解决了三个瓶颈。全粒度视觉语言对齐（FGVLA）将多级视觉特征与多维语言空间对齐。空间语言同构序列化（SLIS）将异构空间输出统一为自回归令牌。渐进式跨模态适应（PCMA）将复合领域差距分解为连续阶段，依次解决视角和成像物理差距。为了支持联合训练，构建了MMRS-OneVision，包含约340万QA对，涵盖所有六种传感器模态和9个任务类别的跨传感器融合，大大超过了现有的遥感多模态指令数据集。仅用2B参数，Earth-OneVision在广泛基准上取得了具有竞争力或最先进的结果，持续匹配或超越4B-72B的RS-MLLM。它在光学视觉定位的OPT-RSVG测试集上达到87.52%的P@0.5，在SAR VQA基准SARLANG-Bench上达到80.68%，超过7B模型7%以上。它还在多光谱分类的BigEarthNet-MS测试集上达到75.74%的召回率，在跨模态推理的EarthMind-Bench上达到81.94%的MCQ准确率。

英文摘要

RS-MLLMs enable natural-language understanding and spatial reasoning over earth observation imagery. However, existing models support only a narrow range of sensor types and tasks, yielding a fragmented view of the earth and leaving cross-modal geoscientific knowledge largely unexploited. This work presents Earth-OneVision, a 2B RS-MLLM that unifies six sensor modalities (i.e., optical, SAR, infrared, multispectral, temporal, and video) and cross-sensor fusion across 9 task categories within a single autoregressive framework. Three dedicated mechanisms address three bottlenecks. Full-Granularity Vision-Language Alignment (FGVLA) aligns multi-level visual features with the multi-dimensional language space. Spatial-Linguistic Isomorphic Serialization (SLIS) unifies heterogeneous spatial outputs as autoregressive tokens. Progressive Cross-Modality Adaptation (PCMA) decomposes the compound domain gap into sequential stages, tackling the viewpoint and imaging physics gaps in turn. To support joint training, MMRS-OneVision is constructed with ~34M QA pairs spanning all six sensor modalities and cross-sensor fusion across 9 task categories, substantially exceeding existing RS multimodal instruction datasets. With only 2B parameters, Earth-OneVision achieves competitive or state-of-the-art results across extensive benchmarks, consistently matching or outperforming 4B-72B RS-MLLMs. It achieves 87.52% P@0.5 on the OPT-RSVG testset for optical visual grounding and 80.68% on the SAR VQA benchmark SARLANG-Bench, exceeding 7B models by over 7%. It further achieves 75.74% recall on the BigEarthNet-MS testset for multispectral classification, and 81.94% MCQ accuracy on EarthMind-Bench for cross-modality reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.10829 2026-06-10 cs.CL cs.AI 交叉投稿

Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models

注意力折扣自适应采样器用于掩码扩散语言模型

Yusuf Sahin, Ahmed Rockey Saikia, Volkan Cevher, Paolo Favaro

发表机构 * University of Bern（伯尔尼大学）； EPFL（瑞士联邦理工学院洛桑分校）

AI总结针对掩码扩散语言模型并行解码中候选词交互导致的不安全问题，提出训练无关的重排序规则ADAS，通过注意力折扣软惩罚改进子集构建，在多个基准上提升低NFE性能。

详情

AI中文摘要

掩码扩散语言模型可以通过每次去噪迭代揭示多个令牌来减少推理步骤，但这种并行性很脆弱：当预测相互耦合时，单独置信的位置同时提交可能不安全。现有的免训练采样器如Top-$k$、Fast-dLLM和EB-Sampler主要控制揭示多少令牌，而通常通过忽略选定集内交互的逐令牌分数对候选进行排序。我们提出ADAS，一种用于并行掩码扩散解码的免训练重排序规则。ADAS保持基础采样器的停止规则不变，仅修改子集构建：当候选者强烈关注预测仍不确定的已选位置时，它贪婪地折扣该候选者。与将注意力转化为硬兼容性约束的图约束方法不同，ADAS保持注意力连续并将其用作软边际惩罚。在GSM8K、MATH500、HumanEval和MBPP上，针对LLaDA-8B-Base和Dream-7B-Base，将ADAS插入Top-$k$、Fast-dLLM和EB-Sampler中，在匹配去噪器评估下，低NFE性能平均分别提高9.11和10.46个百分点，每次前向运行时开销为3.1%。这些结果表明，软注意力折扣重排序是一种简单且模块化的方法，可提高掩码扩散语言模型高度并行解码的质量。

英文摘要

Masked diffusion language models can reduce inference steps by revealing multiple tokens per denoising iteration, but this parallelism is fragile: positions that are individually confident may be unsafe to commit together when their predictions are coupled. Existing training-free samplers such as Top-$k$, Fast-dLLM, and EB-Sampler mainly control how many tokens to reveal, while often ranking candidates by token-wise scores that ignore interactions within the selected set. We propose ADAS, a training-free reranking rule for parallel masked diffusion decoding. ADAS leaves the base sampler's stopping rule unchanged and modifies only subset construction: it greedily discounts a candidate when it attends strongly to already selected positions whose predictions remain uncertain. Unlike graph-constrained methods that turn attention into hard compatibility constraints, ADAS keeps attention continuous and uses it as a soft marginal penalty. Across LLaDA-8B-Base and Dream-7B-Base on GSM8K, MATH500, HumanEval, and MBPP, plugging ADAS into Top-$k$, Fast-dLLM, and EB-Sampler improves low-NFE performance at matched denoiser evaluations by $9.11$ and $10.46$ percentage points on average, respectively, with $3.1\%$ per-forward runtime overhead. These results show that soft attention-discounted reranking is a simple and modular way to improve quality in highly parallel decoding for masked diffusion language models.

URL PDF HTML ☆

赞 0 踩 0

2606.10892 2026-06-10 cs.CV cs.AI 交叉投稿

Improving Text-Instance Alignment Of Foreground Conditioned Out-Painting Via Customized Concept Embedding

通过定制化概念嵌入改进前景条件外绘中的文本-实例对齐

Yihao Zhao, Xuan Han, Bin He, Mingyu You

AI总结针对前景条件外绘中文本驱动方法产生的伪影问题，提出定制化概念嵌入扩散框架，通过实例感知损失和语义保持提示模板定制概念嵌入，显著减少伪影并提升图像质量。

详情

AI中文摘要

为了展示产品，商家通常需要花费大量成本制作高质量的展示图像。前景条件外绘（FCO）满足了这一需求，允许用户通过调整文本提示，以低成本为前景实例创建所需的背景。然而，现有的文本驱动FCO方法在其输出中存在关键缺陷，最明显的是伪影，即合成背景中与前景实例共享相同语义的区域。这种伪影降低了物体的显著性并降低了图像质量。我们将问题归因于给定实例与文本派生概念嵌入之间的不对齐。为了解决这个问题，我们提出了定制化概念嵌入扩散（CCE-Diffusion）框架。其核心是CCE模块，用于定制概念嵌入，弥合通用名词语义与特定视觉实例之间的差距。实例感知损失指导模块的优化，而语义保持提示模板防止定制化嵌入扭曲提示中的其他词。定性和定量评估均表明，CCE-Diffusion显著减少了输出中的伪影。作为即插即用组件，CCE模块可以集成到各种FCO方法中，提升其性能。

英文摘要

To showcase products, merchants often incur substantial costs creating high-quality display images. Foreground Conditioned Outpainting (FCO) meets this demand, allowing users to create desired backgrounds for foreground instances at a low cost by adjusting the text prompt. However, existing text-driven FCO methods exhibit critical flaws in their outputs, most notably the presence of artifacts, which refer to regions in the synthesized background that share the same semantics as the foreground instance. Such artifacts diminish the object's prominence and degrade image quality. We attribute the issue to the misalignment between the given instance and text-derived concept embeddings. To address this, we propose the Customized Concept Embedding Diffusion (CCE-Diffusion) framework. Its core is a CCE-Module to customize concept embeddings, bridging the gap between generic noun semantics and a specific visual instance. An Instance-Aware Loss guides the module's optimization, while a Semantic-Preserving Prompt Template prevents customized embeddings from distorting other words in the prompt. Both qualitative and quantitative evaluations demonstrate that CCE-Diffusion significantly reduces artifacts in the outputs. As a plug-and-play component, the CCE-Module can integrate with various FCO methods, enhancing their performance.

URL PDF HTML ☆

赞 0 踩 0

2606.10902 2026-06-10 cs.CV cs.AI 交叉投稿

Pose-ICL: 3D-Aware In-Context Learning for Pose-Controllable Subject Customization

Pose-ICL：面向姿态可控主体定制的3D感知上下文学习

Xuan Han, Yihao Zhao, Mingyu You

AI总结提出Pose-ICL框架，通过3D感知上下文学习和表面锚定位置嵌入（SAPE）实现无调优的姿态可控主体定制，显著提升姿态准确性和身份一致性。

详情

AI中文摘要

主体定制是现代图像生成中的基础任务。通过提供少量参考图像和文本提示，用户可以生成特定对象在任意期望场景中的图像。然而，现有方法在实现定制主体的有效姿态控制方面仍存在困难。在实践中，它们常常表现出不准确的姿态或不一致的跨姿态外观。这些局限性表明，对于2D原生骨干网络而言，以体积方式理解对象仍然是一个重大挑战。为了应对这一挑战，我们提出了Pose-ICL，这是一个无需调优的框架，利用3D感知上下文学习（ICL）通过多个配对的图像-姿态参考直接适应新主体。其核心机制——表面锚定位置嵌入（SAPE）——通过将图像令牌锚定到体积边界框的表面坐标，赋予模型显式的3D感知能力。专门的优化确保了其与现有DiT模型的无缝兼容性。在3D资产和真实世界主体上的广泛评估表明，Pose-ICL在姿态准确性和身份一致性方面均显著优于当前方法。

英文摘要

Subject Customization is a foundational task in modern image generation. By providing a few reference images and a text prompt, users can generate images of a specific object in any desired scene. However, existing methods still struggle to achieve effective pose control for customized subjects. In practice, they often exhibit inaccurate poses or inconsistent cross-pose appearances. These limitations suggest that understanding objects in a volumetric manner remains a significant challenge for 2D-native backbones. To address this challenge, we propose Pose-ICL, a tuning-free framework that leverages 3D-aware In-Context Learning (ICL) to directly adapt to new subjects through multiple paired image-pose references. Its core mechanism,Surface-Anchored Position Embedding (SAPE), equips the model with explicit 3D awareness by anchoring image tokens to the surface coordinates of a volumetric bounding box. Dedicated refinements ensure its seamless compatibility with existing DiT models. Extensive evaluations on both 3D assets and real-world subjects demonstrate that Pose-ICL significantly outperforms current methods in both pose accuracy and identity consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.11033 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

模型说走：表面启发式如何覆盖LLM推理中的隐式约束

Yubo Li, Lu Zhang, Tianchong Jiang, Ramayya Krishnan, Rema Padman

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Independent Researcher（独立研究者）

AI总结研究LLM在表面线索与隐式约束冲突时的失败，提出启发式覆盖基准（HOB），通过因果行为分析揭示距离线索影响远大于目标，并验证目标分解提示可部分恢复性能。

详情

AI中文摘要

当显著的表面线索与未陈述的可行性约束冲突时，大型语言模型会失败。我们引入了启发式覆盖基准（HOB）：500个实例，涵盖4个启发式家族和5个约束家族，具有最小对和显式性梯度。我们将HOB与一个可证伪的行为特征描述配对，遵循诊断-测量-桥接-治疗弧。对六个模型的洗车问题进行因果行为分析，揭示了上下文无关的S形启发式：距离线索的影响力是目标的8.7到38倍，归因更匹配关键词关联而非组合推理。在14个模型中，严格的10/10评估显示，没有模型超过75%，存在约束最难，为44%。一个最小提示将性能提高15个百分点，表明是约束推断失败而非知识缺失。然而，14个模型中有12个在移除约束后表现更差，最多下降39个百分点，揭示了保守偏差。对Gemini 3.1 Pro的思考模式消融实验显示，思考开启时性能为74.6%，关闭时降至58.4%，而显式目标分解将其恢复至71.2%。因此，内部推理确实有用，显式提示可以部分替代。推理模型并不绝对优于非推理模型：在控制能力排名后，残差推理模式效应为1.8个百分点且不显著。参数探针显示S形模式泛化到成本、效率和语义相似性启发式。目标分解提示将性能提升5.0个百分点，而通用思维链提升3.1个百分点，将约束枚举隔离为有效成分。总体而言，启发式覆盖是一个系统性的推理漏洞，其量化位点在于推理顺序而非知识，并且有一个经过测试的干预措施。

英文摘要

Large language models fail when a salient surface cue conflicts with an unstated feasibility constraint. We introduce the Heuristic Override Benchmark (HOB): 500 instances spanning 4 heuristic families and 5 constraint families, with minimal pairs and explicitness gradients. We pair HOB with a falsifiable behavioral characterization following a diagnose-measure-bridge-treat arc. Causal-behavioral analysis of the car wash problem across six models reveals context-independent sigmoid heuristics: the distance cue has 8.7 to 38 times more influence than the goal, and attribution better matches keyword association than compositional inference. Across 14 models, strict 10/10 evaluation shows that no model exceeds 75%, and presence constraints are hardest at 44%. A minimal hint improves performance by 15 pp, suggesting a constraint-inference failure rather than missing knowledge. However, 12 of 14 models perform worse when the constraint is removed, by up to 39 pp, revealing conservative bias. A thinking-mode ablation on Gemini 3.1 Pro drops performance from 74.6% with thinking on to 58.4% with thinking off, while explicit goal decomposition recovers it to 71.2%. Thus, internal deliberation does useful work, and explicit prompting can partially substitute for it. Reasoning models do not categorically outperform non-reasoning peers: after controlling for capability rank, the residual reasoning-mode effect is 1.8 pp and is not significant. Parametric probes show that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics. Goal-decomposition prompting improves performance by 5.0 pp, compared with 3.1 pp for generic chain-of-thought, isolating constraint enumeration as the active ingredient. Overall, heuristic override is a systematic reasoning vulnerability with a quantified locus in inference order, not knowledge, and a tested intervention.

URL PDF HTML ☆

赞 0 踩 0

2604.01993 2026-06-10 cs.CL cs.AI 版本更新

SAFE: An LLM-as-Verifier Framework for Evidence-Grounded Multi-Hop Reasoning

SAFE: 一种基于LLM作为验证器的证据驱动多跳推理框架

Daeyong Kwon, Soyoung Yoon, Seung-won Hwang

发表机构 * Seoul National University（首尔国立大学）

AI总结提出SAFE框架，通过将推理分解为知识图谱三元组，在生成过程中逐步验证中间步骤，以解决多跳问答中模型通过无效推理得到正确答案的问题，平均准确率提升8.8个百分点。

详情

AI中文摘要

多跳问答基准测试常常奖励大型语言模型（LLM）的虚假正确性，即模型通过无效的中间推理得出正确答案。我们提出了SAFE，一种基于LLM作为验证器的证据驱动多跳问答框架。SAFE不是在生成后仅判断最终答案，而是在生成过程中通过检查中间步骤与提供的段落和先前的推理轨迹来验证推理。为了使这一过程可检查，SAFE将推理分解为以知识图谱（KG）三元组表示的原子化、证据驱动的单元。在训练时，SAFE在KG约束下验证基准监督，并构建可靠的验证器训练数据。在推理时，外部验证器检查每个生成的步骤，识别无效推理，并在错误传播之前提供纠正反馈。在三个多跳问答基准测试中，SAFE平均提高了8.8个百分点的准确率。这些结果表明，证据驱动的多跳问答受益于将基于LLM的评估从事后答案判断转向逐步推理验证。

英文摘要

Multi-hop QA benchmarks often reward Large Language Models (LLMs) for spurious correctness, where models reach correct answers through invalid intermediate reasoning. We propose SAFE, an LLM-as-verifier framework for evidence-grounded multi-hop QA. Rather than judging only the final answer after generation, SAFE verifies reasoning during generation by checking intermediate steps against the provided passages and previous reasoning trajectory. To make this process checkable, SAFE decomposes reasoning into atomic, evidence-grounded units represented with Knowledge Graph (KG) triples. At train-time, SAFE verifies benchmark supervision under KG-grounded constraints and constructs reliable verifier training data. At inference-time, an external verifier checks each generated step, identifies invalid reasoning, and provides correction feedback before errors propagate. Across three multi-hop QA benchmarks, SAFE improves accuracy by 8.8 pp on average. These results show that evidence-grounded multi-hop QA benefits from shifting LLM-based evaluation from post-hoc answer judgment to stepwise reasoning verification.

URL PDF HTML ☆

赞 0 踩 0

2604.14397 2026-06-10 cs.CL cs.AI 版本更新

Generating Concept Lexicalizations via Dictionary-Based Cross-Lingual Sense Projection

基于词典的跨语言语义投影生成概念词汇化

David Basil, Chirooth Girigowda, Bradley Hauer, Sahir Momin, Ning Shi, Grzegorz Kondrak

发表机构 * University of Toronto（多伦多大学）

AI总结提出一种通过语义投影将英语WordNet概念扩展到新语言的方法，利用双语词典增强对齐并过滤错误投影，在多个语言上提升了精度且保持可解释性和资源效率。

Comments Paper presented at Canadian AI 2026

详情

AI中文摘要

我们研究通过语义生成自动将WordNet风格的词汇资源扩展到新语言的任务。我们通过语义投影将目标语言词条与现有词汇概念关联来生成词义。给定一个带有词义标注的英语语料库及其翻译，我们的方法将注释的义原集投影到对齐的目标语言标记上，并将相应的词条分配给这些义原集。为了生成对齐并确保其质量，我们使用双语词典增强预训练的基础对齐器，该词典也用于过滤不正确的语义投影。我们在多种语言上评估该方法，将其与先前方法以及基于词典和大型语言模型的基线进行比较。结果表明，所提出的投影-过滤策略在保持可解释性和资源效率的同时提高了精度。我们在该https URL上发布代码、文档和生成的词义清单。

英文摘要

We study the task of automatically expanding WordNet-style lexical resources to new languages through sense generation. We generate senses by associating target-language lemmas with existing lexical concepts via semantic projection. Given a sense-tagged English corpus and its translation, our method projects the annotated synsets onto aligned target-language tokens and assigns the corresponding lemmas to those synsets. To generate alignments and ensure their quality, we augment a pretrained base aligner with a bilingual dictionary, which is also used to filter incorrect sense projections. We evaluate the method on multiple languages, comparing it to prior methods, as well as dictionary-based and large language model baselines. Results show that the proposed project-and-filter strategy improves precision while remaining interpretable and resource-efficient. We release our code, documentation, and generated sense inventories at https://github.com/UAlberta-NLP/ExpandNet.

URL PDF HTML ☆

赞 0 踩 0

2604.22565 2026-06-10 cs.CL cs.AI 版本更新

Learning Evidence Highlighting for Frozen LLMs

学习为冻结的LLM突出证据

Shaoang Li, Yanhang Shi, Yufei Li, Mingfu Liang, Xiaohan Wei, Yunchen Pu, Fei Tian, Chonglin Sun, Frank Shyu, Luke Simon, Sandeep Pandey, Xi Liu, Jian Li

发表机构 * Stony Brook University（石桥大学）； Meta AI

AI总结提出HiLight框架，通过强化学习训练轻量级Actor在长上下文中插入高亮标签，使冻结的LLM更关注关键证据，无需证据标签或修改求解器，在序列推荐和长上下文问答中提升性能。

详情

AI中文摘要

大型语言模型（LLM）能够很好地推理，但当关键证据埋藏在冗长、嘈杂的上下文中时，常常会错过决定性证据。我们提出了HiLight，一个证据强调框架，它将证据选择与冻结的LLM求解器的推理解耦。HiLight避免压缩或重写输入（这可能会丢弃或扭曲证据），而是训练一个轻量级的强调Actor，在未改变的上下文中的关键跨度周围插入最小的高亮标签。然后，一个冻结的求解器对强调后的输入进行下游推理。我们将高亮视为一个弱监督决策问题，并使用强化学习仅基于求解器的任务奖励来优化Actor，不需要证据标签，也不需要访问或修改求解器。在序列推荐和长上下文问答中，HiLight始终优于强大的基于提示和自动提示优化的基线。学习到的强调策略可以零样本迁移到更小和更大的未见求解器家族，包括基于API的求解器，这表明Actor捕获了真正的、可复用的证据结构，而不是过拟合单个骨干网络。

英文摘要

Large Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLight, an Evidence Emphasis framework that decouples evidence selection from reasoning for frozen LLM solvers. HiLight avoids compressing or rewriting the input, which can discard or distort evidence, by training a lightweight Emphasis Actor to insert minimal highlight tags around pivotal spans in the unaltered context. A frozen Solver then performs downstream reasoning on the emphasized input. We cast highlighting as a weakly supervised decision-making problem and optimize the Actor with reinforcement learning using only the Solver's task reward, requiring no evidence labels and no access to or modification of the Solver. Across sequential recommendation and long-context question answering, HiLight consistently improves performance over strong prompt-based and automated prompt-optimization baselines. The learned emphasis policy transfers zero-shot to both smaller and larger unseen Solver families, including an API-based Solver, suggesting that the Actor captures genuine, reusable evidence structure rather than overfitting to a single backbone.

URL PDF HTML ☆

赞 0 踩 0

2605.03344 2026-06-10 cs.IR cs.AI cs.CL 版本更新

RAG over Thinking Traces Can Improve Reasoning Tasks

RAG 基于思考轨迹可提升推理任务

Negar Arabzadeh, Wenjie Ma, Sewon Min, Matei Zaharia

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出检索思考轨迹而非文档，通过 T3 方法将其转化为结构化表示，在推理任务上显著提升性能，超越标准 RAG 和无 RAG 基线。

详情

AI中文摘要

检索增强生成（RAG）已被证明对知识密集型任务有效，但普遍认为其对数学和代码生成等推理密集型问题帮助有限。我们通过证明限制不在于 RAG 本身而在于语料库的选择来挑战这一假设。我们不检索文档，而是提出检索思考轨迹，即问题求解尝试过程中产生的中间思考轨迹。我们表明思考轨迹本身就是一个强大的检索源，并进一步引入 T3，一种离线方法，将其转化为结构化、利于检索的表示，以提高可用性。使用这些轨迹作为语料库，简单的检索-生成流水线在强模型和基准测试（如 AIME 2025--2026、LiveCodeBench 和 GPQA-Diamond）上持续提升推理性能，优于无 RAG 基线和检索标准网络语料库。例如，在 AIME 2025-2026 上，使用 Gemini-2-thinking 生成的轨迹进行 RAG，在 Gemini-2.5-Flash、GPT-OSS-120B 和 GPT-5 上分别实现了 +56.3%、+8.6% 和 +7.6% 的相对增益，尽管这些是更新的模型。总体而言，我们的结果表明思考轨迹是推理任务的有效检索语料库，将其转化为结构化、紧凑或诊断性表示可带来更强的增益。代码见此链接。

英文摘要

Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumption by showing that the limitation lies not in RAG itself, but in the choice of corpus. Instead of retrieving documents, we propose retrieving thinking traces, i.e., intermediate thinking trajectories generated during problem solving attempts. We show that thinking traces are already a strong retrieval source, and further introduce T3, an offline method that transforms them into structured, retrieval-friendly representations, to improve usability. Using these traces as a corpus, a simple retrieve-then-generate pipeline consistently improves reasoning performance across strong models and benchmarks such as AIME 2025--2026, LiveCodeBench, and GPQA-Diamond, outperforming both non-RAG baselines and retrieval over standard web corpora. For instance, on AIME 2025-2026, RAG with traces generated by Gemini-2-thinking achieves relative gains of +56.3%, +8.6%, and +7.6% for Gemini-2.5-Flash, GPT-OSS-120B, and GPT-5, respectively, even though these are more recent models. Overall, our results suggest that thinking traces are an effective retrieval corpus for reasoning tasks, and transforming them into structured, compact, or diagnostic representations unlocks even stronger gains. Code available at https://github.com/Narabzad/t3.

URL PDF HTML ☆

赞 0 踩 0

2605.18271 2026-06-10 cs.CL cs.AI cs.IR cs.LG 版本更新

From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG

从体积到价值：面向设备端RAG的偏好对齐记忆构建

Changmin Lee, Jaemin Kim, Taesik Gong

发表机构 * Department of Computer Science and Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan, Republic of Korea（计算机科学与工程系，全州国立科学与技术研究所（UNIST），全州，韩国）

AI总结本文提出EPIC方法，通过将用户偏好作为紧凑且稳定的个人上下文形式，整合到RAG流程中，以在有限内存下提高检索与用户偏好的对齐度，从而减少内存使用并提升准确性。

Comments Accepted to ICML 2026. Code and data are available at https://github.com/UbiquitousAILab/EPIC

详情

AI中文摘要

随着基于大型语言模型（LLMs）的个人AI代理的迅速发展，将其部署到设备上已成为隐私和响应性的重要需求。为了处理现实世界请求中固有的个人和上下文依赖性，这些代理必须基于设备上存储的个人上下文进行生成。然而，在内存预算紧张的情况下，核心瓶颈是存储什么内容以确保检索与用户保持一致。我们提出EPIC（高效偏好对齐索引构建），专注于用户偏好作为紧凑且稳定的个人上下文形式，并在整个RAG流程中整合它们。EPIC会选择性地保留与偏好相关的信息，并将检索对准偏好对齐的上下文。在四个涵盖对话、辩论、解释和推荐的基准测试中，EPIC将索引内存减少了2,404倍，提高了偏好遵循的准确性20.17个百分点，并在最佳表现基线之上实现了33.33倍更低的检索延迟。在我们的设备端实验中，EPIC在29.35毫秒/查询的流式更新下保持内存占用低于1 MB。

英文摘要

With the rapid emergence of personal AI agents based on Large Language Models (LLMs), implementing them on-device has become essential for privacy and responsiveness. To handle the inherently personal and context-dependent nature of real-world requests, such agents must ground their generation in device-resident personal context. However, under tight memory budgets, the core bottleneck is what to store so that retrieval remains aligned with the user. We propose EPIC (Efficient Preference-aligned Index Construction), which focuses on user preferences as a compact and stable form of personal context and integrates them throughout the RAG pipeline. EPIC selectively retains preference-relevant information from raw data and aligns retrieval toward preference-aligned contexts. Across four benchmarks covering conversations, debates, explanations, and recommendations, EPIC reduces indexing memory by 2,404 times, improves preference-following accuracy by 18.79 %p, and achieves 32.17 times lower retrieval latency over the best-performing baseline. In on-device experiments, EPIC maintains under 1 MB memory and achieves 5.21 to 29.35 ms/query latency across three platforms, while supporting streaming updates under preference drift. Our code and data are available at https://github.com/UbiquitousAILab/EPIC.

URL PDF HTML ☆

赞 0 踩 0

2605.28066 2026-06-10 cs.CL cs.AI 版本更新

PromptEmbedder: Efficient and Transferable Text Embedding via Dual-LLM Soft Prompting

PromptEmbedder：通过双LLM软提示实现高效且可迁移的文本嵌入

Yu-Che Tsai, Kuan-Yu Chen, Yuan-Hao Chen, Yu-Han Chang, Ching-Yu Tsai, Yu-Hsiang Chuang, Shou-De Lin

发表机构 * Department of Computer Science and Information Engineering, National Taiwan University（国立台湾大学计算机科学与资讯工程系）； National Taiwan University AI Center of Research Excellence（国立台湾大学人工智能研究中心）

AI总结提出PromptEmbedder双LLM框架，通过可微分的软提示生成将嵌入知识从特定骨干权重中解耦，在保持性能的同时降低40% GPU内存并加速3.7倍训练。

详情

AI中文摘要

大型语言模型（LLM）在文本嵌入方面展现出显著效果，但当前的适应方法（如LoRA）在计算效率和跨架构可迁移性方面面临重大瓶颈。每当出现新的骨干网络时，现有方法需要从头开始进行昂贵的重新训练。为了解决这个问题，我们提出了PromptEmbedder，一种新颖的双LLM框架，将嵌入知识与特定骨干权重解耦。PromptEmbedder利用一个提示LLM通过连续松弛的可微分生成过程，为冻结的嵌入LLM生成指令感知的软提示，确保对比训练期间的全梯度流动。通过将任务特定知识定位在提示LLM中，适应新架构只需重新训练一个轻量级的线性对齐矩阵。在MTEB基准上的评估表明，PromptEmbedder实现了与LoRA微调相当的性能，同时将GPU内存减少40%，训练速度提升3.7倍。我们的方法建立了一种可扩展、架构无关的范式，用于高效的基于LLM的表示学习。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable efficacy in text embedding, yet current adaptation methods like LoRA face significant bottlenecks in computational efficiency and cross-architecture transferability. Whenever a new backbone emerges, existing approaches require costly retraining from scratch. To address this, we propose PromptEmbedder, a novel dual-LLM framework that decouples embedding knowledge from specific backbone weights. PromptEmbedder utilizes a Prompting LLM to generate instruction-aware soft prompts for a frozen Embedding LLM via a differentiable generation process with continuous relaxation, ensuring full gradient flow during contrastive training. By localizing task-specific knowledge within the Prompting LLM, adapting to new architectures requires only retraining a lightweight linear alignment matrix. Evaluations on the MTEB benchmark show that PromptEmbedder achieves comparable performance with LoRA finetuning while reducing GPU memory by 40% and accelerating training by 3.7x. Our approach establishes a scalable, architecture-agnostic paradigm for efficient LLM-based representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.07422 2026-06-10 cs.CL cs.AI 版本更新

The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs

掩蔽优势：揭示LLMs中本地语言对文化知识的访问

Yang Zhang, Xiao Fei, Amr Mohamed, Sarah Almeida Carneiro, Mersin Konomi, Mingmeng Geng, Ahmed Asaad, Guokan Shang, Michalis Vazirgiannis

发表机构 * Ecole Polytechnique（巴黎高等理工学院）； MBZUAI（穆罕默德·本·拉什德智能研究院）； ENS-PSL（巴黎综合理工学院-巴黎科学实验室）； Durham University（杜尔罕大学）

AI总结通过控制实验和项目反应理论模型，分离语言能力与文化知识访问，发现本地语言在文化知识访问上具有优势，但常被语言能力不足掩盖。

详情

AI中文摘要

大型语言模型越来越多地被用于跨语言回答文化相关问题，但目前尚不清楚本地文化知识是通过英语还是本地语言更容易获取。现有评估面临两个关键限制：许多评估依赖于可能无法反映文化知识自然出现的平行模板问题，并且原始准确率混淆了通用语言能力与语言条件知识访问。我们通过一个基于从区域基准和本地来源收集的真实世界文化问题的受控框架来解决这些问题。通过交叉问题类型（文化无关 vs. 文化特定）与查询语言（英语 vs. 本地语言），并使用共享的1PL项目反应理论模型估计能力，我们将语言能力与本地化知识访问分离。在13个地区和大约80个模型上，我们发现文化无关问题上存在一致的英语优势，表明更强的英语能力。然而，在考虑了这种能力差距后，本地语言在几乎所有地区-模型设置中都显示出积极的知识访问优势。这种优势在原始准确率中通常被掩盖，但在前沿、区域对齐或语言适应模型中变得更加明显。我们的结果表明，较弱的本地语言表现并不一定意味着较弱的文化知识；相反，本地文化知识可能通过本地语言更容易访问，但被有限的语言能力所隐藏。

英文摘要

Large language models are increasingly used to answer culturally grounded questions across languages, yet it remains unclear whether local cultural knowledge is better accessed through English or the local language. Existing evaluations face two key limitations: many rely on parallel template-based questions that may not reflect how cultural knowledge naturally appears, and raw accuracy conflates general language proficiency with language-conditioned knowledge access. We address these issues with a controlled framework built on real-world cultural questions collected from regional benchmarks and local sources. By crossing question type (culture-agnostic vs. culture-specific) with query language (English vs. local language), and estimating ability with a shared 1PL item response theory model, we separate proficiency from localized knowledge access. Across 13 locales and roughly 80 models, we find a consistent English advantage on culture-agnostic questions, indicating stronger English proficiency. However, after accounting for this proficiency gap, local languages show a positive knowledge-access advantage in nearly all locale-model settings. This advantage is often masked in raw accuracy but becomes more visible for frontier, regionally aligned, or language-adapted models. Our results suggest that weaker local-language performance does not necessarily imply weaker cultural knowledge; rather, local cultural knowledge may be more accessible through the local language but hidden by limited language proficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.08674 2026-06-10 cs.CV cs.AI 版本更新

BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension

BioVid: 具有生物行为语义理解的自回归视频生成

Tsung-Wei Pan, Jung-Hua Wang

发表机构 * Department of Electrical Engineering, National Taiwan Ocean University（国立台湾海洋大学电子工程系）； AI research center, National Taiwan Ocean University（国立台湾海洋大学人工智能研究中心）

AI总结提出BioVid，一种数据驱动的自回归视频生成框架，通过FSQ-R3GAN分词器和因果Transformer学习生物行为的自然时长分布，无需预设长度约束。

详情

AI中文摘要

现有的视频生成框架将序列时长视为外部指定参数——固定的帧数或文本提示——生成的片段在时间边界上与真实行为数据的统计结构脱节。这一假设与生物行为根本不一致，因为动作时长在个体和实例之间自然变化，并编码在数据本身中。我们提出BioVid，一种数据驱动的自回归视频生成框架，直接从训练数据中学习生物行为的时序结构，包括其自然长度分布。在第一阶段，有限标量量化GAN（FSQ-R3GAN）分词器将每个视频帧编码为紧凑的离散表示，结合R3GAN的稳定相对训练目标和FSQ的保证码本利用率，实现高保真空间重建而无需码本崩溃。在第二阶段，因果Transformer自回归地对生成的令牌序列建模，并在行为事件达到语义闭合时学习发出序列结束（EOS）令牌，终止分布自然地从训练数据中涌现，而非任何人为指定的约束。在人类饮酒行为数据集（NTU RGB+D, A001, n=94）上的实验表明，BioVid生成的长度分布与保留测试数据的分布紧密匹配，与真实分布的Wasserstein-1距离为1.24——相比之下，固定长度基线为6.05，VideoGPT为15.48——同时保持有竞争力的空间保真度。

英文摘要

Existing video generation frameworks treat sequence duration as an externally prescribed parameter -- fixed frame counts or text prompts -- producing clips whose temporal boundaries are decoupled from the statistical structure of real behavioral data. This assumption is fundamentally misaligned with biological behavior, where action duration varies naturally across individuals and instances and is encoded in the data itself. We present BioVid, a data-driven autoregressive video generation framework that learns the temporal structure of biological behaviors directly from training data, including their natural length distributions. In the first stage, a Finite Scalar Quantization GAN (FSQ-R3GAN) tokenizer encodes each video frame into a compact discrete representation, combining the stabilized relativistic training objective of R3GAN with FSQ's guaranteed codebook utilization to achieve high-fidelity spatial reconstruction without codebook collapse. In the second stage, a causal Transformer models the resulting token sequences autoregressively and learns to emit an End-of-Sequence (EOS) token when the behavioral event reaches semantic closure, with the termination distribution emerging naturally from the training data rather than any human-specified constraint. Experiments on a human drinking behavior dataset (NTU RGB+D, A001, n=94) demonstrate that BioVid's generated length distribution closely matches that of held-out test data, achieving a Wasserstein-1 distance of 1.24 against the ground truth -- compared to 6.05 for a fixed-length baseline and 15.48 for VideoGPT -- while maintaining competitive spatial fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.09079 2026-06-10 cs.LG cs.AI 版本更新

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

FlashMemory-DeepSeek-V4: 通过前瞻稀疏注意力实现闪电索引超长上下文

Yan Wang, Qifan Zhang, Jiachen Yu, Tian Liang, Dongyang Ma, Xiang Hu, Zibo Lin, Chunyang Li, Zhichao Wang, Miao Peng, Nuo Chen, Jia Li, Yujiu Yang, Haitao Mi, Dong Yu

发表机构 * Independent Researchers（独立研究者）； Tencent（腾讯）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Tsinghua University（清华大学）

AI总结提出前瞻稀疏注意力（LSA），基于DeepSeek-V4架构的神经记忆索引器，通过预测未来上下文需求仅保留关键KV块，在超长上下文场景下将物理KV缓存压缩至全上下文的13.5%，同时保持或略微提升下游准确率。

Comments Technical report. 11 pages. Code and model available at https://github.com/libertywing/FlashMemory-Deepseek-V4 and https://huggingface.co/libertywing/FlashMemory-Deepseek-V4

详情

AI中文摘要

传统大语言模型在解码过程中保持完整的KV缓存，导致超长上下文服务出现严重的GPU内存瓶颈。在本报告中，我们提出前瞻稀疏注意力（LSA），一种基于DeepSeek-V4架构构建的神经记忆索引器驱动的新型推理范式。LSA并非被动地关注所有历史令牌，而是主动预测未来的上下文需求，并仅在GPU内存中保留查询关键的KV块。关键的是，我们通过无骨干的解耦训练策略实例化该架构。通过将索引器制定为标准双编码器架构，我们使用标准检索训练框架独立训练它，而无需将庞大的骨干模型加载到GPU内存中。我们证明这种“少即是多”的范式显著最大化服务效率，同时在依赖长期全局记忆的任务中充当有效的注意力去噪器。在主要的长上下文评估套件（例如LongBench-v2、LongMemEval和RULER）中，FM-DS-V4将平均物理KV缓存占用压缩至全上下文基线的仅13.5%，同时一致地保持或略微提升下游准确率（平均绝对边际+0.6%）。关键的是，在极端500K规模下，FlashMemory将物理KV缓存开销抑制超过90%，而不会破坏骨干的核心推理能力。

英文摘要

Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the DeepSeek-V4 architecture. Rather than passively attending to all historical tokens, LSA proactively predicts future context demands and preserves only the query-critical KV chunks in the GPU memory. Crucially, we instantiate this architecture via a backbone-free decoupled training strategy. By formulating the indexer as a standard dual-encoder architecture, we train it independently using standard retrieval training frameworks without ever loading the massive backbone model into GPU memory. We demonstrate that this "less is more" paradigm significantly maximizes serving efficiency while acting as an effective attention denoiser in tasks that rely on long-term global memory. Across primary long-context evaluation suites (e.g., LongBench-v2, LongMemEval, and RULER), FM-DS-V4 compresses the average physical KV cache footprint down to merely 13.5% of the full-context baseline, while consistently preserving or slightly elevating downstream accuracy (+0.6% absolute margin on average). Crucially, at extreme 500K scales, FlashMemory suppresses the physical KV cache overhead by over 90% without destabilizing the backbone's core reasoning capacities.

URL PDF HTML ☆

赞 0 踩 0

2606.09919 2026-06-10 cs.LG cs.AI cs.MA cs.RO 交叉投稿

Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming

Co-GLANCE: 异构机器人团队的不确定性感知主动感知

Michal P. Podolinsky, Neel P. Bhatt, Pranay Samineni, Rohan Siva, Christian Ellis, Ufuk Topcu

AI总结提出Co-GLANCE系统，通过蒸馏视觉语言模型实现实时遮挡分割与机器人分配，结合共形预测与选择性弃权提供统计保证的不确定性量化，驱动主动感知，在真实场景中遮挡分割和分配准确率分别提升25%和36%，推理延迟降低350倍。

Comments Code, videos, and dataset available at https://co-glance.github.io/

详情

AI中文摘要

感知不确定性是异构机器人团队在非结构化户外环境中运行的核心挑战，因为单一视角无法提供可靠的场景理解。由遮挡等来源引起的感知不确定性，根据场景结构在不同机器人视角下表现不同。检测和解决感知不确定性的来源需要基于场景的上下文推理和具备能力感知的机器人分配。虽然视觉语言模型为两者提供了强大的语义先验，但它们对于机载推理在计算上过于昂贵，且缺乏校准的不确定性量化。我们介绍了Co-GLANCE，一个用于异构机器人团队不确定性解决的实时机载感知与决策系统。Co-GLANCE将视觉语言模型的语义推理能力蒸馏为用于遮挡分割和机器人分配的端到端模型，消除了对基于云推理的需求。为了量化感知不确定性，Co-GLANCE结合了共形预测与选择性弃权，为分割、机器人分配和检测输出提供统计有效的覆盖保证。这些校准的不确定性估计直接触发主动感知，派遣最合适的机器人获取信息丰富的视角并解决不确定性。在真实世界场景中，Co-GLANCE在遮挡分割和机器人分配准确率上分别比基于云的视觉语言模型基线高出25%和36%，同时将每帧推理延迟降低350倍。我们还发布了一个空地数据集以供未来研究。代码、视频和数据集可在以下网址获取：此 https URL。

英文摘要

Perceptual uncertainty is a central challenge for heterogeneous robot teams operating in unstructured outdoor environments, where no single viewpoint affords reliable scene understanding. Perceptual uncertainty, arising from sources such as occlusions, manifests differently across robot viewpoints depending on scene structure. Detecting and resolving sources of perceptual uncertainty requires both scene-based contextual reasoning and capability-aware robot allocation. While vision-language models provide strong semantic priors for both, they are computationally prohibitive for onboard inference and lack calibrated uncertainty quantification. We introduce Co-GLANCE, a real-time onboard perception and decision-making system for uncertainty resolution in heterogeneous robot teams. Co-GLANCE distills the semantic reasoning capabilities of a vision-language model into an end-to-end model for occlusion segmentation and robot allocation, eliminating the need for cloud-based inference. To quantify perceptual uncertainty, Co-GLANCE combines conformal prediction with selective abstention to provide statistically valid coverage guarantees for segmentation, robot allocation, and detection outputs. These calibrated uncertainty estimates directly trigger active perception, dispatching the most appropriate robot to acquire informative viewpoints and resolve uncertainty. Across real-world scenarios, Co-GLANCE outperforms cloud-based vision-language model baselines in occlusion segmentation and robot allocation accuracy by 25% and 36%, respectively, while reducing per-frame inference latency 350x. We also release an air-ground dataset for future research. Code, videos, and dataset available at https://co-glance.github.io/ .

URL PDF HTML ☆

赞 0 踩 0

2606.09958 2026-06-10 cs.RO cs.AI 交叉投稿

基于基础模型的机器人在患者和老年人护理中的探索

Zhiwen Qiu, Wei Liu, Yuexing Hao

AI总结本文综述了基于基础模型的护理机器人在设计特征、用户体验和护理效果方面的现状，指出当前系统多用于语音交互，多模态和物理自主性有限，并呼吁向护理特定评估标准和负责任自主性发展。

详情

AI中文摘要

随着全球人口老龄化，对老年人和患者护理的需求迅速增长。基础模型越来越多地被集成到机器人和交互代理中，有望实现更灵活的沟通和个性化辅助。然而，护理环境需要可靠且与工作流程兼容的系统，并具备可问责的人类监督，目前尚不清楚当前具身系统能否将技术进步转化为临床影响。本综述从三个方面综合了基于基础模型的护理机器人：设计特征、用户体验以及护理相关结果的证据。当前系统最常将基础模型用作以语音为中心的社会辅助具身中的对话和推理层，而多模态基础和物理自主性仍然有限。实证评估报告了积极的可用性和参与度益处，但交互流程中仍存在可靠性故障，如幻觉和对话中断。护理影响的证据主要集中在认知参与和参与等近期结果上，而经过验证的临床或护理相关变化的证据有限。我们认为，未来的研究应转向护理特定的评估标准、可问责的自主性以及融入护理工作流程，以支持更具响应性和负责任的护理技术。

英文摘要

Demand for older-adult and patient care is growing rapidly as populations age worldwide. Foundation models are increasingly being integrated into robots and interactive agents, with the promise of more flexible communication and personalized assistance. However, care settings require reliable and workflow-compatible systems with accountable human oversight, and it remains unclear whether current embodied systems can translate technical advances into clinical impact. This Perspective synthesizes foundation model-based care robots across three areas: design features, user experience, and evidence for care-related outcomes. Current systems most commonly use foundation models as conversational and reasoning layers within voice-centered socially assistive embodiments, while multimodal grounding and physical autonomy remain limited. Empirical evaluations report positive usability and engagement benefits, but reliability failures persist across the interaction pipeline such as hallucinations and conversational breakdowns. Evidence for care impact remains concentrated in proximal outcomes such as cognitive engagement and participation, with limited evidence for validated clinical or care-related changes. We argue that future research should transition toward care-specific evaluation standards, accountable autonomy, and integration into care workflows to support more responsive and responsible care technologies.

URL PDF HTML ☆

赞 0 踩 0

2606.10244 2026-06-10 cs.RO cs.AI 交叉投稿

YUBI: Yielding Universal Bidigital Interface for Bimanual Dexterous Manipulation at Scale

YUBI：面向大规模双手灵巧操作的通用双指接口

Takehiko Ohkawa, Jumpei Arima, Yuki Noguchi, Masatoshi Tateno, Makoto Sugiura, Takuya Okubo, Kengo Ikeuchi, Yuma Shin, Hiroki Nishizawa, Naoaki Kanazawa, Yuki Wakayama, Daiki Fukunaga, Koshi Makihara, Tomohiro Motoda, Floris Erich, Yukiyasu Domae, Tatsuya Matsushima, Yohishiro Okumatsu, Kei Ota

AI总结提出YUBI手指对齐夹爪，通过屈服式手指驱动映射实现直观、符合人体工学的双手灵巧操作数据采集，构建8434小时/120万集/119任务数据集，单策略跨多机器人迁移。

Comments Project page: https://yubi.airoa.io/

详情

AI中文摘要

我们引入了Yielding Universal Bidigital Interface (YUBI)，一种手指对齐的夹爪，旨在实现双手灵巧操作的直观、符合人体工学且可扩展的数据采集。虽然手持数据采集系统（如Universal Manipulation Interface (UMI)）实现了低成本数据采集，但其笨重的手枪式握把设计可能给精细灵巧操作任务带来人体工学和使用性挑战。为此，YUBI提出了一种独特的设计原则：屈服式手指驱动，将人类手指运动直接映射到夹爪钳口运动。使用YUBI设备，我们建立了一个集成基于VR的6自由度夹爪跟踪的数据采集系统，确保高保真轨迹数据获取。我们整理了一个前所未有的基于UMI的数据集：8434小时，涵盖120万集和119个任务。实验表明，YUBI在复杂双手任务的通用性、灵巧性和操作效率方面优于UMI夹爪。通过在多个平台上安装夹爪，在YUBI数据集上训练的单一策略可迁移到多个双手机器人（UR、Franka和ELEY），证实采集的数据可直接作为策略监督执行。我们发布了夹爪硬件、数据采集软件和数据集作为集成堆栈，为开放社区提供可复现的大规模数据采集路径，以推动机器人基础模型的发展。

英文摘要

We introduce Yielding Universal Bidigital Interface (YUBI), a finger-aligned gripper designed to enable intuitive, ergonomic, and scalable data collection for bimanual dexterous manipulation. While handheld data collection systems such as Universal Manipulation Interface (UMI) enable affordable data collection, their bulky pistol-grip designs can pose ergonomic and usability challenges for fine-grained, dexterous manipulation tasks. To address this, YUBI presents a distinct design principle: yielding, finger-driven actuation that directly maps human finger movements to gripper jaw motion. Using the YUBI devices, we set up a data collection system with integrated VR-based 6 DoF tracking of the gripper, ensuring high-fidelity trajectory data acquisition. We curate a UMI-based dataset of unprecedented scale: 8,434 hours across 1.20M episodes and 119 tasks. Experiments show that YUBI offers advantages over the UMI gripper in versatility for complex bimanual tasks, dexterity, and operational efficiency. A single policy trained on the YUBI dataset transfers across multiple bimanual robots (UR, Franka, and ELEY) simply by mounting the gripper on each platform, confirming that the collected data are directly executable as policy supervision. We release the gripper hardware, data-collection software, and dataset as one integrated stack, offering the open community a reproducible path to large-scale data acquisition for advancing robotic foundation models.

URL PDF HTML ☆

赞 0 踩 0

2606.10267 2026-06-10 cs.RO cs.AI cs.LG 交叉投稿

具身智能时代基于物理模拟器的机器人导航与操作综述

Lik Hang Kenny Wong, Xueyang Kang, Kaixin Bai, Jianwei Zhang

发表机构 * Department of Computer Science, City University of Hong Kong（城市大学计算机科学系）； School of Electrical and Electronic Engineering, Nanyang Technological University（南洋理工大学电子与电气工程学院）； Department of Informatics, Universität Hamburg（汉堡大学信息学院）

AI总结本文综述了物理模拟器在缩小具身智能中导航与操作的模拟到现实差距方面的关键特性、任务支持及硬件需求，并提供了基准数据集、指标、平台和方法资源。

Comments Under Review

2512.08280 2026-06-10 cs.RO cs.AI cs.SY eess.SY 版本更新

Model-Based Diffusion Sampling for Predictive Control in Offline Decision Making

基于模型扩散采样的离线决策预测控制

Haldun Balim, Na Li, Yilun Du

发表机构 * GitHub

AI总结提出MPDiffuser框架，通过组合扩散规划器与动力学扩散模型，在采样中交替更新以生成符合任务目标且动力学可行的轨迹，并利用轻量级排序模块选择最优轨迹，在D4RL和DSRL基准及四足机器人上验证了有效性。

详情

AI中文摘要

通过扩散模型进行离线决策通常会产生与系统动力学不对齐的轨迹，限制了其在控制中的可靠性。我们提出了模型预测扩散器（MPDiffuser），一种组合扩散框架，它将扩散规划器与动力学扩散模型相结合，以生成任务对齐且动力学可行的轨迹。MPDiffuser在采样过程中交替进行规划器和动力学更新，逐步修正可行性同时保留任务意图。然后，一个轻量级排序模块选择最能满足任务目标的轨迹。组合设计通过使动力学模型能够独立于规划器利用多样且未见过的数据，提高了样本效率和适应性。实验上，我们在无约束（D4RL）和约束（DSRL）基准上展示了相对于先前基于扩散的方法的一致改进，并通过在真实四足机器人上的部署验证了实用性。

英文摘要

Offline decision-making via diffusion models often produces trajectories that are misaligned with system dynamics, limiting their reliability for control. We propose Model Predictive Diffuser (MPDiffuser), a compositional diffusion framework that combines a diffusion planner with a dynamics diffusion model to generate task-aligned and dynamically plausible trajectories. MPDiffuser interleaves planner and dynamics updates during sampling, progressively correcting feasibility while preserving task intent. A lightweight ranking module then selects trajectories that best satisfy task objectives. The compositional design improves sample efficiency and adaptability by enabling the dynamics model to leverage diverse and previously unseen data independently of the planner. Empirically, we demonstrate consistent improvements over prior diffusion-based methods on unconstrained (D4RL) and constrained (DSRL) benchmarks, and validate practicality through deployment on a real quadrupedal robot.

URL PDF HTML ☆

赞 0 踩 0

2606.06493 2026-06-10 cs.RO cs.AI cs.LG 版本更新

HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary Teachers

HANDOFF: 通过蒸馏互补教师实现人形机器人任务空间全身控制

Lizhi Yang, Junheng Li, Nehar Poddar, Yiling Hou, Gio Huh, Robert Griffin, Georgia Gkioxari, Aaron Ames

发表机构 * California Institute of Technology（加州理工学院）； The Institute for Human & Machine Cognition（人机认知研究院）

AI总结提出HANDOFF框架，通过多教师KL蒸馏和上下文门控机制，将全身运动跟踪、行走和跌倒恢复三个专家策略融合为混合专家学生策略，实现基于紧凑显式接口的全身控制，在Unitree G1上达到先进的速度跟踪性能并扩展了操作工作空间。

Comments 22 pages, 9 figures, Project page: https://lzyang2000.github.io/HANDOFF/

详情

AI中文摘要

对于要在现实世界中部署的人形机器人，命令空间（即任务规划与全身控制之间的接口）的选择至关重要。现有的全身控制器通常需要密集的运动学或空间参考，而规划器难以从任务语义中合成这些参考。我们提出了一种紧凑、显式的接口，该接口直观、通用、模块化且具有足够的表达能力，适用于多种操作技能。为此，我们引入了HANDOFF，这是一个单一的人形全身控制器，遵循该接口，并通过多教师KL蒸馏，在上下文条件门控方案下，从三个互补专家（具有安全过滤数据的全身运动跟踪、行走和跌倒恢复）中蒸馏出混合专家学生。在Unitree G1上，HANDOFF达到了最先进的速度跟踪性能，并提供了最大的鲁棒操作工作空间之一。我们进一步通过多个自然语言驱动的任务执行演示了硬件可行性，这些任务由VLM驱动的智能体规划器提供支持，无需特定任务数据或控制器微调。

英文摘要

For a humanoid robot to be deployed in the real world, the choice of command space (i.e., the interface between task planning and whole-body control) is crucial. Existing whole-body controllers typically demand dense kinematic or spatial references that planners struggle to synthesize from task semantics. We instead propose a compact, explicit interface that is intuitive, general, modular, and expressive enough for diverse loco-manipulation skills. To this end, we introduce HANDOFF, a single humanoid whole-body controller that follows this interface and is distilled via multi-teacher KL distillation under a context-conditioned gating scheme into a mixture-of-experts student from three complementary specialists: whole-body motion tracking with safety-filtered data, locomotion, and fall-recovery. On the Unitree G1, HANDOFF matches state-of-the-art velocity tracking and offers one of the largest robust manipulation workspaces. We further demonstrate hardware feasibility through multiple natural-language-driven task roll-outs, powered by a VLM-driven agentic planner with no task-specific data or controller fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2606.10062 2026-06-10 cs.AI cs.MA 新提交

Deployment-Time Memorization in Foundation-Model Agents

基础模型智能体中的部署时记忆

Lei, Chen, Guilin Zhang, Kai Zhao, Dalmo Cirne, Andy Olsen, Xu Chu, Zeke Miller, Alet Blanken, Amine Anoun, Jerry Ting

AI总结研究基础模型智能体在部署时记忆的设计选择如何影响个性化效用、提取风险和删除保真度，提出遗忘残差分数并揭示压缩与删除的权衡。

Comments 4 pages, ICML MemFM 2026 Workshop

详情

AI中文摘要

基础模型智能体正成为越来越长寿命的系统，它们跨交互记忆用户，使记忆成为明确的部署时功能，而不仅仅是模型权重的属性。现有工作处理参数化记忆或审计固定记忆配置，但没有描述记忆设计选择如何共同塑造个性化效用、提取风险和删除保真度。我们将这一表面研究为部署时记忆，将智能体记忆表述为通过个性化召回（PR）和对抗提取率（AER）测量的隐私-效用前沿，并扫描三个记忆设计旋钮：摘要攻击性、检索广度（k）和删除模式。我们进一步引入遗忘残差分数（FRS）来量化删除的信息是否仍可从派生记忆层中恢复。在LongMemEval上，关键事实摘要将Gemma 3 12B上的金丝雀提取减少76%，GPT-4o-mini上减少64%，同时几乎保留所有个性化召回；关键是，一旦内容被压缩掉，增加k不再恢复泄漏。然而，相同的压缩会导致删除保真度失败：仅原始删除使派生摘要副本在大约20%的实例中可恢复，只有全管道清除或墓碑修订才能使最差层残差为零。总之，这些结果确立了持久智能体记忆必须作为一级记忆机制进行评估——通过它帮助智能体回忆的内容、它使什么可提取以及它真正能擦除什么来评估。

英文摘要

Foundation-model agents are increasingly long-lived systems that remember users across interactions, making memorization an explicit deployment-time function rather than solely a property of model weights. Existing work addresses parametric memorization or audits fixed memory configurations, but does not characterize how memory-design choices jointly shape personalization utility, extraction risk, and deletion fidelity. We study this surface as deployment-time memorization, formulating agent memory as a privacy-utility frontier measured by Personalization Recall (PR) and Adversarial Extraction Rate (AER), and sweeping three memory-design knobs: summarization aggressiveness, retrieval breadth (k), and deletion mode. We further introduce the Forgetting Residue Score (FRS) to quantify whether deleted information remains recoverable from derived memory tiers. On LongMemEval, key-fact summarization reduces canary extraction by 76% on Gemma 3 12B and 64% on GPT-4o-mini while preserving nearly all personalization recall; critically, once content is compressed away, increasing k no longer restores leakage. The same compression, however, induces a deletion-fidelity failure: raw-only deletion leaves derived summary copies recoverable in approximately 20% of instances, and only full-pipeline purge or tombstone redaction drives worst-tier residue to zero. Together, these results establish that persistent agent memory must be evaluated as a first-class memorization mechanism -- assessed by what it helps agents recall, what it makes extractable, and what it can truly erase.

URL PDF HTML ☆

赞 0 踩 0

2606.10949 2026-06-10 cs.AI 新提交

Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models

回忆过好：记忆增强模型中的谄媚评估与缓解

Shelly Bensal, Axel Magnuson, Aparna Balagopalan, Daniel M. Bikel

发表机构 * Writer, Inc.（Writer公司）

AI总结本研究首次系统评估记忆增强模型中的谄媚现象，提出MIST基准测试，发现记忆会放大谄媚行为（最高25倍），并提出两种轻量级缓解方法。

Comments Under submission; preprint

详情

AI中文摘要

持久记忆系统通过随时间存储用户信念，有望使LLM更有帮助。我们表明，它们也会通过系统性地放大谄媚（即模型优先同意用户而非准确性）而使模型更不准确。我们首次对此效应进行系统评估，引入MIST：一个合成生成的多轮对话基准，其中用户在科学、医学和道德推理领域表达看似合理的误解。对三种最先进的记忆系统和五个模型家族的测试表明，记忆在所有条件下都会放大谄媚行为，谄媚率比上下文基线高出25倍。错误分析表明，记忆提取是主要原因：有损压缩成离散片段会编码用户误解，同时丢弃纠正性上下文。基于这些结果，我们提出了两种轻量级缓解方法，在事实回忆方面匹配或超越记忆系统的同时，大幅减少谄媚。

英文摘要

Persistent memory systems promise to make LLMs more helpful by storing user beliefs over time. We show they also make models less correct by systematically amplifying sycophancy, wherein models prioritize agreement with users over accuracy. We conduct the first systematic evaluation of this effect, introducing MIST: a benchmark of synthetically generated multi-turn conversations where users express plausible misconceptions in scientific, medical, and moral reasoning domains. Testing across three state-of-the-art memory systems and five model families reveals that memory amplifies sycophantic behavior across all conditions, with up to 25x higher sycophancy rates than in-context baselines. Error analyses suggest memory extraction as the primary culprit: lossy compression into discrete snippets encodes user misconceptions while discarding corrective context. Based on these results, we propose two lightweight mitigations that substantially reduce sycophancy while matching or exceeding memory systems at factual recall.

URL PDF HTML ☆

赞 0 踩 0

2606.10989 2026-06-10 cs.AI 新提交

Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning

零空间约束的低秩自适应用于指定响应的大型语言模型遗忘

Bocheng Ju, Jianhua Wang, Chengliang Liu, Xiaolin Chang

发表机构 * Beijing Key Laboratory of Security and Privacy in Intelligent Transportation, Beijing Jiaotong University（北京交通大学智能交通信息安全与隐私保护北京市重点实验室）； College of Computer Science and Technology, Taiyuan University of Technology（太原理工大学计算机科学与技术学院）； Institute of Computing Technologies, China Academy of Railway Sciences Corporation Limited（中国铁道科学研究院集团有限公司计算技术研究所）

AI总结提出零空间约束响应指定遗忘框架，通过正交投影低秩参数化将LoRA更新限制在保留子空间的零空间内，在抑制遗忘知识的同时保持模型效用。

详情

AI中文摘要

大型语言模型遗忘旨在抑制指定的不良知识，同时保留良性能力。许多遗忘目标侧重于抑制不良答案，而最近的目标引导变体指定了替换行为，但仍使更新局部性在很大程度上不受约束。本文介绍了零空间约束响应指定遗忘（NSRU），一种用于受控LLM遗忘的投影约束低秩框架。NSRU使用明确结构化的安全目标响应来指定每个遗忘查询的期望行为，同时抑制原始不良内容。为了定位自适应，NSRU从良性隐藏表示中估计每个模块的保留子空间，并使用正交投影的低秩参数化将LoRA更新限制在保留子空间的零空间内。由此产生的目标在此约束参数化下联合优化安全目标学习、不良响应抑制和保留保持。我们提供了一个局部一阶分析，表明投影更新减少了保留侧的扰动，同时保留了用于塑造遗忘查询行为的可编辑方向。在TOFU上的实验表明，NSRU有效抑制了可提取的遗忘集知识，同时提高了保留QA性能、模型效用和安全目标对齐，优于代表性基线。在WMDP上，NSRU将危险领域准确率保持在接近随机选择的区域，同时保留了广泛和领域相邻的MMLU效用。消融研究支持安全目标监督、不良响应抑制、保留损失和零空间投影更新的互补作用，而敏感性和鲁棒性分析表明在测试的超参数和提示变化下行为稳定。

英文摘要

Large language model unlearning aims to suppress designated undesirable knowledge while preserving benign capabilities. Many unlearning objectives focus on suppressing undesired answers, while recent target-guided variants specify replacement behavior but still leave update locality largely unconstrained. This paper introduces \emph{Null-Space Constrained Response-Specified Unlearning} (NSRU), a projection-constrained low-rank framework for controlled LLM unlearning. NSRU uses an explicitly structured safe target response to specify the desired behavior for each forget query, while suppressing the original undesired content. To localize adaptation, NSRU estimates per-module retain subspaces from benign hidden representations and uses an orthogonal-projected low-rank parameterization to confine LoRA updates to the null space of the retain subspace. The resulting objective jointly optimizes safe-target learning, undesired-response suppression, and retention preservation under this constrained parameterization. We provide a local first-order analysis showing that the projected update reduces retain-side perturbations while preserving editable directions for shaping forget-query behavior. Experiments on TOFU show that NSRU effectively suppresses extractable forget-set knowledge while improving retain QA performance, model utility, and safe-target alignment over representative baselines. On WMDP, NSRU keeps hazardous-domain accuracy near the random-choice region while preserving broad and domain-adjacent MMLU utility. Ablation studies support the complementary roles of safe-target supervision, undesired-response suppression, retention loss, and null-space projected updates, while sensitivity and robustness analyses indicate stable behavior across the tested hyperparameter and prompt variations.

URL PDF HTML ☆

赞 0 踩 0

2606.09844 2026-06-10 cs.HC cs.AI 交叉投稿

The Interlocutor Effect: Why LLMs Leak More Personal Data to Agents Than Humans

对话者效应：为什么LLMs向智能体泄露的个人数据比向人类多

Faouzi El Yagoubi, Godwin Badu-Marfo, Ranwa Al Mallah

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结研究发现LLMs在与AI智能体对话时比与人类对话更易泄露个人身份信息，通过注意力抑制假说解释该现象，实验表明安全对齐注意力头在智能体交互中失活导致泄露增加23个百分点。

详情

AI中文摘要

大型语言模型（LLMs）会根据其感知到的对话者身份改变隐私行为。虽然安全机制通常阻止LLMs向人类用户泄露个人身份信息（PII），但这些模型在与另一个AI智能体对话时倾向于泄露更多敏感数据。我们将此称为\textbf{对话者效应}。通过消融研究，我们发现接收者的技术性质对这一效应有贡献，从而降低了模型对隐私的谨慎程度。为了进一步探索这一点，我们引入了注意力抑制假说，该假说认为安全对齐的注意力头在与智能体交互期间变得不活跃。我们通过比较222个敏感场景中面向人类和面向智能体的提示来定量评估这一点。从3,464次交互中得出的结果表明，将接收者描述为AI智能体会使PII泄露增加高达23个百分点。在Llama-3.1-8B-Instruct上的初步实验证实了这一点：停用一个安全注意力头会引发泄露，而重新激活它则恢复隐私保护。我们考虑了这对开发安全多智能体系统的影响。

英文摘要

Large Language Models (LLMs) alter their privacy behavior based on the perceived identity of their interlocutor. While safety mechanisms typically prevent LLMs from releasing Personally Identifiable Information (PII) to human users, these models tend to reveal more sensitive data when addressing another AI agent. We refer to this as the \textbf{Interlocutor Effect}. Through an ablation study, we find evidence that the technical nature of the recipient contributes to this effect, thereby diminishing the model's caution regarding privacy. To explore this further, we introduce the Attention Suppression Hypothesis, which posits that safety-aligned attention heads become inactive during interactions with agents. We assess this quantitatively by comparing human-directed and agent-directed prompts in 222 sensitive scenarios. Our findings, drawn from 3,464 interactions, indicate that portraying the recipient as an AI agent elevates PII leakage by up to 23 percentage points. Initial experiments on Llama-3.1-8B-Instruct corroborate this: deactivating one safety head induces leakage, whereas reactivating it reinstates privacy safeguards. We consider the implications for developing secure multi-agent systems.

URL PDF HTML ☆

赞 0 踩 0

2606.09864 2026-06-10 cs.LG cs.AI cs.ET 交叉投稿

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

KV缓存量化下的对齐崩溃：诊断与缓解

Bruce Changlong Xu, Adarsh Kumarappan, Mu Zhou

发表机构 * Stanford University（斯坦福大学）； California Institute of Technology（加利福尼亚理工学院）

AI总结研究发现低比特KV缓存量化会无声破坏大模型的安全对齐，根源在于安全特征位于低维激活子空间，易受量化噪声影响；提出逐通道缩减（PCR）诊断方法，分类三种失效模式并指导缓解，无需训练即可恢复高达97%的对齐损失。

Comments Preprint. 61 pages, 9 figures

详情

AI中文摘要

键值（KV）缓存量化被广泛用于减少大语言模型（LLM）推理内存，然而现有评估仅关注困惑度和准确率，未评估安全性影响。在本研究中，我们探索了KV缓存量化下的对齐保持。在11个指令微调模型（3.8B-72B）和5个基准（1,894个提示）上，我们发现低比特量化可以无声地破坏安全对齐：Mistral-7B在仅1.03倍困惑度下丢失了15.2%的拒绝能力，且不存在通用的安全位宽，标准指标无法察觉的尖锐模型特定相变普遍存在。我们识别出根本原因是几何性的：安全特征占据一个低维激活子空间，其对量化噪声的脆弱性比困惑度平均的全表示空间高10^2-10^3倍。受此观察启发，我们提出逐通道缩减（PCR），一种诊断方法，将每个模型分类为三种机制性失效模式之一：异常值压碎安全（安全位于非异常值通道，被异常值驱动的缩放因子连带损害）；异常值即安全（安全与异常值通道重叠，更细粒度无法挽救）；多层稀释（安全分布在许多层，逐层修复失败）。PCR在全部9个主要模型和来自独立家族的1个保留模型上，使用20个校准提示预测了正确的缓解方向。PCR泛化到未见过的提示、模型和生产量化器，包括KIVI，恢复率高达97.2%，而基于注意力的分配方法失败。由此产生的免训练协议，大约需要35 GPU分钟，以最小的内存开销恢复高达97%的丢失对齐，解决了在NVIDIA GPU上使用FP8 KV缓存的生产vLLM服务中确认的漏洞。

英文摘要

Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely focus on measuring perplexity and accuracy without assessing the safety impact. In this study, we explore alignment preservation under KV cache quantization. Across eleven instruction-tuned models (3.8B-72B) and five benchmarks (1,894 prompts), we find that low-bit quantization can silently destroy safety alignment: Mistral-7B loses 15.2% of its refusals at only 1.03x perplexity, and no universal safe bit-width exists, with sharp model-specific phase transitions invisible to standard metrics. We identify that the root cause is geometric: safety features occupy a low-dimensional activation subspace 10^2-10^3x more vulnerable to quantization noise than the full representation space perplexity averages over. Inspired by this observation, we propose Per-Channel Reduction (PCR), a diagnostic that classifies each model into one of three mechanistic failure modes: outlier-crushes-safety, where safety lives in non-outlier channels collaterally damaged by outlier-driven scale factors; outlier-as-safety, where safety overlaps outlier channels and finer granularity cannot rescue it; and multi-layer dilution, where safety is distributed across many layers and per-layer fixes fail. PCR predicts the correct mitigation direction on all nine primary models and one held-out model from an independent family using 20 calibration prompts. PCR generalizes across unseen prompts, models, and production quantizers, including KIVI with up to 97.2% recovery, succeeding where attention-based allocation methods fail. The resulting training-free protocol, requiring approximately 35 GPU-minutes, recovers up to 97% of lost alignment at minimal memory overhead, addressing vulnerabilities confirmed in production vLLM serving with FP8 KV cache on NVIDIA GPUs.

URL PDF HTML ☆

赞 0 踩 0

2606.09866 2026-06-10 cs.LG cs.AI 交叉投稿

通过两阶段潜在特征优化绕过基于扩散的定制中的版权保护

Ziang Xu, Wenbo Yu, Hongyao Yu, Hao Fang, Jiawei Kong, Bin Chen, Hao Wu, Shu-Tao Xia, Zhiyong Wu

发表机构 * Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Tsinghua Shenzhen International Graduate School（清华大学深圳国际研究生院）

AI总结提出两阶段潜在特征优化（TS-LFO）攻击方法，通过潜在去噪和重建阶段恢复被防御破坏的映射，有效绕过扩散模型定制中的版权保护。

Comments accepted by KDD 2026

详情

AI中文摘要

随着基于扩散的定制中版权侵权问题的日益关注，对抗性攻击已成为一种突出的防御策略，以防止个性化图像生成中的恶意内容伪造。然而，当前的防御通常会在潜在扩散模型（LDM）的潜在空间中引入持久扰动，这些扰动仍然容易被对手自适应绕过。在本文中，我们引入了两阶段潜在特征优化（TS-LFO），一种针对受保护的基于扩散的定制的高效且有效的版权窃取攻击。我们首先观察到现有防御主要破坏输入图像与其潜在表示之间的映射，从而降低模型生成个性化输出的能力。为了应对这一点，TS-LFO通过两阶段优化过程恢复被破坏的映射。在潜在去噪阶段，我们通过联合最小化潜在-图像对齐损失和具有时间步长依赖权重的潜在扩散损失来增强潜在代码与输入图像之间的语义一致性，有效抑制防御引入的高频噪声。在潜在重建阶段，我们使用像素级约束恢复低频语义信息以细化潜在特征。大量实验表明，TS-LFO持续绕过最先进的（SOTA）版权防御，并在各种设置下优于SOTA版权攻击，如DiffPure、GrIDPure和IMPRESS。

英文摘要

With the growing concerns over copyright infringement in diffusion-based customization, adversarial attacks have emerged as a prominent defense strategy to prevent malicious content forgery in personalized image generation. However, current defenses typically introduce persistent perturbations in the latent space of Latent Diffusion Models (LDMs), which remain susceptible to adaptive bypasses by adversaries. In this paper, we introduce Two-Stage Latent Feature Optimization (TS-LFO), an efficient and effective copyright-stealing attack against protected diffusion-based customization. We begin by observing that existing defenses primarily disrupt the mapping between input images and their latent representations, thereby degrading the model's ability to produce personalized outputs. To counteract this, TS-LFO restores the broken mapping through a two-stage optimization process. In the Latent Denoising Stage, we enhance semantic consistency between latent codes and input images by jointly minimizing a Latent-Image Alignment Loss and a Latent Diffusion Loss with timestep-dependent weights, effectively suppressing the high-frequency noise introduced by defenses. In the Latent Reconstruction Stage, we recover low-frequency semantic information using pixel-level constraints to refine the latent features. Extensive experiments show that TS-LFO consistently bypasses state-of-the-art (SOTA) copyright defenses and outperforms SOTA copyright attacks such as DiffPure, GrIDPure and IMPRESS across diverse settings.

URL PDF HTML ☆

赞 0 踩 0

2606.09931 2026-06-10 cs.GT cs.AI 交叉投稿

A Note on the Strategic Confinement Problem

关于战略约束问题的一个注记

Christian Schroeder de Witt

发表机构 * University of Oxford（牛津大学）

AI总结本文引入战略约束问题，指出当通信方为具有共享协调资源的战略智能体时，即使信道容量极小，也可能导致机密信息的高影响泄露，并论证学习型战略智能体系统自然实例化该问题。

详情

AI中文摘要

Lampson的约束问题询问如何防止处理机密信息的程序将其泄露给第三方。我们引入战略约束问题，当通信方是具有共享协调资源的战略智能体时出现该问题。在此设置中，剩余通信能力可以集中在机密数据的低熵、高影响谓词上。因此，信息泄露的界限不一定导致最坏情况危害的相应界限：一个容量可忽略的信道仍可能足以选择破坏性结果。我们认为，学习型战略智能体系统自然实例化此问题，因为它们不允许完整的行为规范，它们习得的惯例通常无法被外部观察者预测或重现，并且足够能力的智能体可以构建难以检测或消除的隐蔽通信方案。因此，我们的贡献不是一种新的通信理论，而是在存在战略智能体的情况下对约束的重新解释。经典约束限制了可能流动的信息；战略约束强调这不一定限制战略智能体可以共同实现的目标。

英文摘要

Lampson's confinement problem asks how to prevent a program that processes confidential information from leaking it to a third party. We introduce the strategic confinement problem, which arises when the communicating parties are strategic agents with shared coordination resources. In this setting, residual communication capacity can be concentrated on low-entropy, high-impact predicates of the confidential data. Consequently, bounds on information leakage need not induce corresponding bounds on worst-case harm: a channel with negligible capacity may still suffice to select damaging outcomes. We argue that systems of learnt strategic agents naturally instantiate this problem because they do not admit complete behavioural specifications, their learnt conventions generally cannot be predicted or reproduced by an external observer, and sufficiently capable agents can construct covert communication schemes that are difficult to detect or eliminate. Our contribution is therefore not a new theory of communication, but a reinterpretation of confinement in the presence of strategic agents. Classical confinement bounds what information may flow; strategic confinement highlights that this need not bound what strategic agents can jointly achieve.

URL PDF HTML ☆

赞 0 踩 0

2606.09935 2026-06-10 cs.CR cs.AI 交叉投稿

GitInject: Real-World Prompt Injection Attacks in AI-Powered CI/CD Pipelines

GitInject: AI驱动的CI/CD流水线中的真实提示注入攻击

Jafar Isbarov, Umid Suleymanov, Ilia Shumailov, Murat Kantarcioglu

发表机构 * Virginia Tech（弗吉尼亚理工学院）； AI Sequrity Company（AI安全公司）

AI总结提出GitInject框架，在真实GitHub工作流中评估AI代理的提示注入漏洞，发现所有测试提供商均存在结构性风险，并给出最低成本防护措施。

详情

AI中文摘要

AI代理越来越多地嵌入持续集成和持续交付/部署（CI/CD）流水线中，以自主审查拉取请求（PR）、分类问题和维护代码库。这些代理在操作时摄入不可信内容，同时拥有提升的仓库权限，使其成为具有供应链后果的提示注入攻击的自然目标。我们提出GitInject，一个开源框架，用于评估真实、活跃的GitHub工作流（CI/CD流水线的广泛部署实例）中的提示注入漏洞。与先前模拟工具调用的代理安全基准不同，GitInject提供临时仓库并触发实际工作流运行，因此沙箱约束、凭证处理和权限边界的行为与生产环境完全一致。使用GitInject，我们研究了四个AI提供商的工作流配置，并记录了十一种命名攻击，涵盖配置文件注入、凭证窃取、判断操纵和可用性。我们发现，所有测试的提供商在其默认配置中至少容易受到一类攻击，且最关键的漏洞是结构性的：它们源于CI/CD基础设施处理凭证和配置文件的方式，而非任何特定模型的行为。对于每个确认的攻击类别，我们确定了最低成本的工作流级对策，并分析了其覆盖范围和局限性。GitInject已公开发布，以促进这一方向的进一步研究。

英文摘要

AI-powered agents are increasingly embedded in continuous integration and continuous delivery/deployment (CI/CD) pipelines to autonomously review pull requests (PRs), triage issues, and maintain codebases. These agents ingest untrusted content while operating with elevated repository permissions, making them a natural target for prompt injection attacks with supply chain consequences. We present GitInject, an open-source framework for evaluating prompt injection vulnerabilities in real, live GitHub workflows, a widely deployed instance of CI/CD pipelines. Unlike prior agent security benchmarks that simulate tool calls, GitInject provisions ephemeral repositories and triggers actual workflow runs, so that sandbox constraints, credential handling, and permission boundaries behave exactly as in production. Using GitInject, we study workflow configurations across four AI providers and document eleven named attacks spanning config-file injection, credential exfiltration, judgment manipulation, and availability. We find that all tested providers are susceptible to at least one attack class in their default configuration, and that the most critical vulnerabilities are structural: they arise from how CI/CD infrastructure handles credentials and configuration files, not from any specific model's behavior. For each confirmed attack class, we identify the minimum-cost workflow-level countermeasure and analyze its coverage and limitations. GitInject is released publicly to facilitate further research in this direction.

URL PDF HTML ☆

赞 0 踩 0

2606.10029 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

用稀疏自编码器解释和引导文本转语音语言模型

Nikita Koriagin, Georgii Aparin, Nikita Balagansky, Daniil Gavrilov

AI总结本文在CosyVoice3语言模型骨干上训练BatchTopK稀疏自编码器，发现特征可解释且因果可控，能操纵笑声、性别和语速。

详情

AI中文摘要

语言模型日益成为文本转语音（TTS）系统的骨干，但我们对其在文本和生成语音令牌共享单一残差流时构建的表示知之甚少。我们在CosyVoice3的语言模型骨干上训练BatchTopK稀疏自编码器，并引入一种模态感知的自动解释流水线，根据特征触发的位置——文本前缀上下文、1秒语音片段或两者——为每个特征打标签。恢复的特征是可解释的，涵盖音素、笑声、口音提示和说话者性别。通过SAE潜在空间进行引导表明，这些特征是因果性的而非仅仅是描述性的：有针对性的干预将笑声概率从0.02提高到0.79，翻转感知到的说话者性别，并在保持口语内容的同时控制语速。因此，SAE特征既可作为解释性对象，也可作为TTS合成的控制方向。

英文摘要

Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.

URL PDF HTML ☆

赞 0 踩 0

2606.10080 2026-06-10 cs.LG cs.AI q-bio.QM 交叉投稿

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

VFUSE: 基于稀疏自编码器的毒力特征理解

Michael Yu, Matthew L. Olson

AI总结提出VFUSE方法，通过训练稀疏自编码器（SAE）分析扩散-Transformer模型激活，识别蛋白质设计中的危险特征，实现可解释性提升而不牺牲性能。

详情

AI中文摘要

生成模型在蛋白质设计等领域取得了显著进展，但这种能力也使得危险蛋白质的生成变得不透明。在这项工作中，我们引入了VFUSE（基于稀疏自编码器的毒力特征理解），这是一种机制可解释性方法，通过在扩散-Transformer激活上训练SAE来审计蛋白质模型中的危险感知特征。我们将VFUSE应用于RoseTTAFold3和RFDiffusion3，这些是流行的开源蛋白质折叠和合成模型。我们发现，对于某些模块，线性探针在SAE潜在空间中的拟合效果显著优于原始模型表示，从而在不牺牲模型性能的情况下提高了可解释性。此外，我们识别出SAE中的单语义特征，这些特征仅在危险设计上激活，AUROC高达0.84（q < 10^{-13}）。据我们所知，这是首次在全原子扩散模型上训练SAE，也是首次对蛋白质设计模型进行特征级毒力审计，为安全且可解释的蛋白质设计铺平了道路。

英文摘要

Generative models have shown remarkable progress in a variety of domains such as protein design, but such power enables the opaque generation of hazardous proteins. In this work, we introduce VFUSE (Virulent Feature Understanding with Sparse autoEncoders), a mechanistic interpretability approach that trains SAEs on diffusion-transformer activations to audit protein models for hazard-aware features. We apply VFUSE to RoseTTAFold3 and RFDiffusion3, popular open-weight models for protein folding and synthesis. We find that for certain blocks, linear probes detect hazardous designs significantly better when fit in the SAE latent space over the original model's representations: improving interpretability without sacrificing model performance. Furthermore, we identify monosemantic features from the SAE that fire only on hazardous designs at up to AUROC $0.84$ ($q < 10^{-13}$). To our knowledge this is the first SAE trained on an all-atom diffusion model and the first feature-level virulence audit of a protein design model, paving the way towards safe and interpretable protein design.

URL PDF HTML ☆

赞 0 踩 0

2606.10159 2026-06-10 cs.CL cs.AI cs.CY cs.LG 交叉投稿

早停早省：隐藏状态探针作为LLM输出流式审核的实用方案

Huizhen Shu, Xuying Li, Piao Xue

发表机构 * ModelOneAI ； yunshanai（云山AI）

AI总结提出基于隐藏状态的轻量级词元级探针，在解码循环中实时检测不安全输出，无需额外前向传播，实现亚毫秒级安全审核，可提前中断或修改生成。

Comments Technical Report. 14 pages, 3 figures, 4 tables

详情

AI中文摘要

在面向用户的系统中部署大型语言模型需要高效的输出安全过滤。现有方法通常依赖于生成后应用的单独审核模型，这会使推理成本翻倍，并且仅在生成完成后检测违规。我们观察到审核所需的信号已经存在于模型隐藏状态中。基于此，我们训练了轻量级的词元级探针，直接操作内部激活，生成每个词元的安全分数，这些分数可以聚合用于离线评估和在线干预。该探针重用生成器的激活，无需额外的前向传播，从而在解码循环内实现亚毫秒级的逐词元安全检查。应用于单个中间层的探针可以恢复强防护模型的大部分决策，作为一个低成本替代方案，优化延迟而非准确性。在流式设置中，它可以在不安全输出完全生成之前暂停或修改它们，用连续的词元级监控取代序列结束时的审核。与事后和流式防护模型相比，我们的方法实现了数量级的计算开销降低，且延迟成本最小。我们还提供了一个实用的部署方案，包括层选择、聚合策略、探测频率和触发阈值。最后，我们展示了探针的线性分量对应于残差空间中的一个方向，从而能够以可忽略的成本实现检测和激活引导。

英文摘要

Deploying large language models in user-facing systems requires efficient output safety filtering. Existing approaches typically rely on a separate moderation model applied after generation, which doubles inference cost and only detects violations after generation completes. We observe that the signal needed for moderation is already present in the model hidden states. Based on this, we train lightweight token-level probes that operate directly on internal activations, producing per-token safety scores that can be aggregated for both offline evaluation and online intervention. The probe reuses activations from the generator and requires no additional forward pass, enabling sub millisecond per-token safety checks inside the decoding loop. A probe applied to a single mid layer recovers most decisions of a strong guard model, acting as a low cost surrogate optimized for latency rather than accuracy. In streaming settings, it can halt or modify unsafe outputs before they are fully generated, replacing end of sequence moderation with continuous token level monitoring. Compared to post hoc and streaming guard models, our method achieves orders of magnitude lower compute overhead with minimal latency cost. We also provide a practical deployment recipe, including layer selection, aggregation strategy, probing frequency, and triggering thresholds. Finally, we show that the probe linear component corresponds to a direction in residual space, enabling both detection and activation steering at negligible cost.

URL PDF HTML ☆

赞 0 踩 0

2606.10525 2026-06-10 cs.CR cs.AI 交叉投稿

公平真的公平吗？通过固定δ对齐实现多任务学习中可靠的Lipschitz公平性

Junbo Ding, Xin Zang, Chenchen Pan, Donghao Song, Jiaxin Zhu, Danhuai Guo

发表机构 * Beijing University of Chemical Technology（北京化工大学）

AI总结针对多任务学习中Lipschitz个体公平性评估受表示尺度干扰的问题，提出固定δ审计与受控正则化框架ReLiF，实现语义一致的公平性评估与权衡。

详情

DOI: 10.1145/3770855.3817938

AI中文摘要

Lipschitz风格的个体公平性形式化了语义相似的样本应获得相似预测的思想，但在多任务学习（MTL）中，其评估可能受到方法引起的表示尺度的干扰。本文识别了阈值混淆问题：当审计容差源自每个模型自身的表示距离时，不同算法会在不同的语义阈值下进行比较。阈值漂移分析进一步展示了偏差排名如何变化，并识别了排名保持的充分条件。我们提出了\textbf{ReLiF}，一个可靠性感知框架，将评估时的固定$\delta$审计与训练时的受控正则化分离。ReLiF使用共享参考容差进行可比较的审计，并通过违反率反馈控制器保持Lipschitz代理活跃而不让其主导随机训练。本文还发展了关于阈值漂移、参考容差选择以及huberized训练代理与其未平滑的正间隔对应物之间关系的支持性分析。在临床时间序列基准和NYUv2（NYU Depth V2）密集预测上的实验表明，固定$\delta$审计暴露了方法依赖阈值可能掩盖的效用-公平性权衡。在使用ResNet50骨干的NYUv2上，ReLiF在共享固定阈值下实现了有竞争力的效用，同时显著减少了对齐偏差。在临床基准上，ReLiF产生了受控的公平性正则化权衡，而固定$\delta$审计揭示任务平衡基线有时能实现更低的偏差，且真正的效用-公平性权衡仍然存在。这些结果支持固定$\delta$审计作为评估MTL中Lipschitz公平性的语义一致协议。

英文摘要

Lipschitz-style individual fairness formalizes the idea that semantically similar examples should receive similar predictions, but its evaluation in multi-task learning (MTL) can be confounded by method-induced representation scales. This paper identifies threshold confounding: when the auditing tolerance is derived from each model's own representation distances, different algorithms are compared under different semantic thresholds. A threshold-drift analysis further shows how Bias rankings can change and identifies sufficient conditions for ranking preservation. We propose \textbf{ReLiF}, a reliability-aware framework that separates evaluation-time fixed-$δ$ auditing from training-time controlled regularization. ReLiF uses a shared reference tolerance for comparable auditing and a violation-rate feedback controller to keep the Lipschitz surrogate active without letting it dominate stochastic training. This work also develops supporting analysis for threshold drift, reference-tolerance selection, and the relationship between the huberized training surrogate and its unsmoothed positive-margin counterpart. Experiments on clinical time-series benchmarks and NYUv2 (NYU Depth V2) dense prediction show that fixed-$δ$ auditing exposes utility--fairness trade-offs that method-dependent thresholds can obscure. On NYUv2 with a ResNet50 backbone, ReLiF achieves competitive utility while substantially reducing aligned bias under shared fixed thresholds. On clinical benchmarks, ReLiF yields controlled fairness-regularized trade-offs, while fixed-$δ$ auditing reveals that task-balancing baselines can sometimes achieve lower bias and that genuine utility--fairness trade-offs persist. These results support fixed-$δ$ auditing as a semantically consistent protocol for evaluating Lipschitz fairness in MTL.

URL PDF HTML ☆

赞 0 踩 0

2606.10669 2026-06-10 cs.LG cs.AI cs.CR 交叉投稿

In Defense of Information Leakage in Concept-based Models

为基于概念模型中的信息泄露辩护

Mateo Espinosa Zarlenga

发表机构 * GitHub ； arXiv

AI总结本文重新审视基于概念模型中的信息泄露问题，提出良性泄露概念，通过优化训练目标，在概念不完整时利用泄露提升准确性和可干预性。

Comments Accepted as a position paper at the Forty-Third International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

基于概念的模型（CMs）是深度神经网络，其预测基于与人类可理解概念（如“圆形”、“条纹”等）对齐的表示。已有研究表明，这些模型会学习到泄露概念无关信息的表示。传统观点认为，这种泄露是不可取的，应予以消除，因为它会导致模型不可解释。在本文中，我们认为这种关于CMs中泄露的传统观点不仅是不恰当的（因为泄露如何使模型更不可解释的证据往往不明确），而且在常见的现实约束下必然导致不实用的CMs。具体来说，我们认为在概念不完整是常态的现实环境中，为了构建准确且可干预的CMs，某种程度的泄露往往是必要的。为此，我们提出存在所谓的良性泄露，并表明通过重新优化典型的CM训练目标，CMs可以鼓励并利用这种形式的泄露，而不会牺牲准确性或可干预性。

英文摘要

Concept-based models (CMs), deep neural networks that ground their predictions on representations aligned with human-understandable concepts (e.g., "round", "stripes", etc.), have been shown to learn representations that leak concept-irrelevant information. As the traditional narrative goes, this leakage is undesirable and should be eradicated as it leads to uninterpretable models. In this paper, we posit that this conventional view of leakage in CMs is not only ill-posed, as the evidence of how leakage makes a model less interpretable is often inconclusive, but also bound to lead to impractical CMs under common real-world constraints. Specifically, we argue that in real-world settings where concept incompleteness is the norm, some leakage is often necessary for constructing accurate and intervenable CMs. To this end, we propose that there is such a thing as benign leakage and show that, by optimizing a reframing of the typical CM training objective, CMs can encourage and exploit this form of leakage without sacrificing accuracy or intervenability.

URL PDF HTML ☆

赞 0 踩 0

2606.10749 2026-06-10 cs.CR cs.AI 交叉投稿

下一代网络的生成式可解释性：基于互特征交互的LLM增强XAI

Kiarash Rezaei, Omran Ayoub, Sebastian Troia, Francesco Lelli, Paolo Monti, Carlos Natalino

发表机构 * Swedish Innovation Agency（瑞典创新署）； Swiss Innovation Agency（瑞士创新署）

AI总结提出一种利用大语言模型和互特征交互数据生成自然语言解释的框架，在光传输质量估计用例中，相比基线方法，解释有用性和范围分别提升12.2%和6.2%，正确率达97.5%。

Comments 7 pages, with one page for appendix. Accepted for publication at the 2025 21th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob)

详情

DOI: 10.1109/WiMob66857.2025.11257542
Journal ref: Proc. WiMob, Marrakesh, Morocco, 2025

AI中文摘要

随着人工智能和机器学习模型成为网络运营的核心，其缺乏透明度对运营商信任构成重大障碍。现有的可解释人工智能技术往往无法为非专家弥合这一差距，产生的技术输出难以转化为可操作的见解。本文提出了一个专门解决这一缺陷的框架。它利用中等规模的大语言模型，并超越了SHapley Additive exPlanations特征影响值的标准用法。该框架采用结构化的提示，并辅以互特征交互数据，以生成人类可理解的自然语言解释。为了验证我们的框架，我们在光传输质量估计用例中进行了实证评估，并邀请了人类评估者。我们收集了专家的独立性能评估，显示出较高的评估者间一致性。与仅使用SHAP特征影响值进行简单提示的最先进基线相比，我们的方法将解释有用性和范围分别提高了12.2%和6.2%，同时实现了97.5%的正确性。

英文摘要

As artificial intelligence and machine learning (AI/ML) models become integral to network operations, their lack of transparency poses a significant barrier to operator trust. Existing explainable artificial intelligence (XAI) techniques often fail to bridge this gap for non-specialists, producing technical outputs that are difficult to translate into actionable insights. This paper presents a framework specifically designed to address this shortcoming. It leverages a moderately sized large language model (LLM) and extends beyond the standard use of SHapley Additive exPlanations (SHAP) feature influence values. The framework employs a structured prompt enriched with mutual feature interaction data to generate human-understandable natural language explanations. To validate our framework, we performed an empirical evaluation on an optical quality of transmission (QoT) estimation use case with human evaluators. We collected independent performance evaluations from specialists, which showed a high inter-evaluator agreement. Compared to a state-of-the-art baseline that uses only SHAP feature influence values in a straightforward prompt, our approach improves the explanation usefulness and scope by 12.2% and 6.2%, while achieving 97.5% correctness.

URL PDF HTML ☆

赞 0 踩 0

2606.11007 2026-06-10 cs.CR cs.AI cs.SE 交叉投稿

Understanding and mitigating the risks of OpenClaw for non-technical users: A practical guide with Skill

理解并减轻非技术用户使用OpenClaw的风险：一份实用指南与Skill

Junchang Zheng, Junfeng Tan, Jialiang Lin

发表机构 * School of Computer Science and Engineering, Guangzhou Institute of Science and Technology, Guangzhou, China（计算机科学与工程学院，广州科学与技术研究院，中国广州）； Science and Education Evaluation Lab, Guangzhou Institute of Science and Technology, Guangzhou, China（科学与教育评估实验室，广州科学与技术研究院，中国广州）

AI总结针对非技术用户，识别OpenClaw的七类核心风险，用通俗语言解释，提供可操作的防御策略，并开发自动化安全配置的Skill，降低使用门槛。

Comments Work in progress

详情

AI中文摘要

OpenClaw已迅速成为一种变革性的人工智能（AI）智能体框架，其自主执行复杂多步任务的能力吸引了日益增长且多样化的用户群体。然而，这种能力伴随着显著的风险。虽然现有研究在描述这些威胁方面取得了重要进展，但此类工作主要面向技术娴熟的受众，对非技术用户而言仍然难以触及。这一群体如今在社区中占比越来越大且服务不足，而正是这些用户最迫切需要实用且直接的指导。为此，我们通过一系列相互关联的努力来弥合这一差距，旨在降低非技术OpenClaw用户的风险门槛。首先，我们识别并分类了OpenClaw用户在日常使用中可能遇到的七类核心风险，并用通俗语言解释，以便非技术用户能够轻松理解这些威胁的性质和潜在后果。其次，针对每种已识别的风险，我们将一套相应的防御策略提炼为清晰且可操作的具体步骤，易于遵循。第三，为使保护更加便捷，我们提供了一个配套的OpenClaw Skill，可自动执行关键安全配置，使用户能够以最少的手动干预保护其系统。通过这项工作，我们证明了防范智能体风险不必是安全专家的专属领域，非技术用户可以通过简单、实用的行动有意义地参与降低这些风险。

英文摘要

OpenClaw has rapidly emerged as a transformative artificial intelligence (AI) agent framework, and its ability to autonomously execute complex, multi-step tasks has attracted an ever-growing and diverse user base. However, this capability comes with significant risks. While existing research has made important strides in characterizing these threats, such work is predominantly directed at technically sophisticated audiences. It remains largely inaccessible to non-technical users. This demographic now makes up an increasingly large and underserved portion of the community, yet it is these very users who most urgently need practical and straightforward guidance. In response, we bridge this gap through a series of interconnected efforts designed to lower the risk barrier for non-technical OpenClaw users. First, we identify and categorize seven core risks that OpenClaw users may encounter in daily usage, explaining each in plain language so that non-technical users can readily grasp the nature and potential consequences of these threats. Second, for each identified risk, we distill a set of corresponding defensive strategies into clear and actionable operational steps that are easy to follow. Third, to make protection even easier, we provide a companion OpenClaw Skill that automates key security configurations, enabling users to safeguard their systems with minimal manual intervention. Through this work, we demonstrate that safeguarding against the risks of intelligent agents need not be the exclusive domain of security experts, and that non-technical users can meaningfully participate in reducing these risks through simple, practical actions.

URL PDF HTML ☆

赞 0 踩 0

2606.11116 2026-06-10 cs.CY cs.AI cs.HC 交叉投稿

Designed by Journalists, but Is It for Readers? Rethinking AI Disclosures and Transparency in News

由记者设计，但为读者而设？重新思考AI披露与新闻透明度

Pooja Prajod

发表机构 * Centrum Wiskunde & Informatica（数学与信息学中心）

AI总结研究发现，详细披露会引发透明度困境降低信任，而简短披露造成信息缺口；读者偏好用户代理型设计（如按需详情、AI比例可视化），呼吁HCI社区重新设计披露机制。

Comments Accepted to CHIWORK Workshop (Interrogating GenAI Augmentation for CHIworkers: Strategies for Professional Autonomy and Accountability)

详情

AI中文摘要

随着新闻编辑室整合生成式AI，记者面临一个披露挑战：如何以维护读者信任的方式传达AI参与。当前实践提供两种方法：简短的一行标签或详细的披露，说明人工监督、编辑责任和错误报告机制。两者都未能实现记者通过透明度建立信任的目标。一项针对34名新闻读者的现有对照实验表明，详细披露会引发\textit{透明度困境}，降低信任而非增加信任，并有可能引入暗黑模式，使读者在透明度的错觉下滚动忽略。一行披露避免了这种效应，但可能造成信息缺口，促使读者花费认知努力寻找披露所指示但未解释的AI参与迹象。然而，读者并非拒绝透明度，他们提出了以用户代理为中心的披露设计：按需详情交互、比例AI可视化、媒体级别信号和明确的“无AI”标签。我认为，从业者认为负责任的披露与用户实际需求之间的脱节是HCI社区的一个设计问题。

英文摘要

As newsrooms integrate generative AI, journalists face a disclosure challenge: how to communicate AI involvement in ways that maintain reader trust. Current practice offers two approaches: brief one-line labels or detailed disclosures specifying human oversight, editorial accountability, and error reporting mechanisms. Neither achieves journalists' goal of building trust through transparency. An existing controlled experiment with 34 news readers show that detailed disclosures trigger a \textit{transparency dilemma}, reducing trust rather than increasing it, and risk introducing dark patterns that readers scroll past with the illusion of transparency. One-line disclosures avoid this effect but can create an information gap, prompting readers to expend cognitive effort searching for signs of AI involvement that the disclosure indicates but does not explain. Yet readers are not rejecting transparency, they proposed disclosure designs centered on user agency: detail-on-demand interactions, proportional AI-ratio visualizations, outlet-level signals, and explicit "no AI" labels. I argue that this disconnect between what practitioners believe is responsible disclosure and what users actually need is a design problem for the HCI community.

URL PDF HTML ☆

赞 0 踩 0

2604.24668 2026-06-10 cs.AI cs.LG 版本更新

The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications

同意的代价：在代理金融应用中衡量LLM的谄媚行为

Zhenyu Zhao, Aparna Balagopalan, Adi Agrawal, Dilshoda Yergasheva, Waseem Alshikh, Daniel M. Bikel

发表机构 * Writer, Inc.（Writer公司）

AI总结研究评估LLM在金融代理任务中的谄媚行为，发现模型对用户反驳仅表现低至中等性能下降，但偏好信息导致多数模型失败，并测试了输入过滤等恢复方法。

Comments Accepted to ICLR 2026 FinAI Workshop

详情

AI中文摘要

鉴于当今LLM在金融系统中的使用增加，评估此类系统的安全性和鲁棒性变得重要。LLM在通用领域设置中经常表现出的一种失败模式是谄媚行为。也就是说，模型优先考虑与表达的用户信念一致，而非正确性，导致准确性和信任度下降。在这项工作中，我们专注于评估LLM在代理金融任务中表现出的谄媚行为。我们的发现有三方面：首先，我们发现模型在面对用户对参考答案的反驳或矛盾时，仅表现出低至中等的性能下降，这区别于先前工作中模型在金融代理设置中表现出的谄媚行为。其次，我们引入了一套任务，通过用户偏好信息（与参考答案矛盾）来测试谄媚行为，并发现大多数模型在存在此类输入时失败。最后，我们基准测试了不同的恢复模式，例如使用预训练LLM进行输入过滤。

英文摘要

Given the increased use of LLMs in financial systems today, it becomes important to evaluate the safety and robustness of such systems. One failure mode that LLMs frequently display in general domain settings is that of sycophancy. That is, models prioritize agreement with expressed user beliefs over correctness, leading to decreased accuracy and trust. In this work, we focus on evaluating sycophancy that LLMs display in agentic financial tasks. Our findings are three-fold: first, we find the models show only low to modest drops in performance in the face of user rebuttals or contradictions to the reference answer, which distinguishes sycophancy that models display in financial agentic settings from findings in prior work. Second, we introduce a suite of tasks to test for sycophancy by user preference information that contradicts the reference answer and find that most models fail in the presence of such inputs. Lastly, we benchmark different modes of recovery such as input filtering with a pretrained LLM.

URL PDF HTML ☆

赞 0 踩 0

2605.22714 2026-06-10 cs.AI cs.CL cs.LG 版本更新

AMEL: Accumulated Message Effects on LLM Judgments

AMEL: 累积消息对LLM判断的影响

Sid-Ali Temkit

发表机构 * chut.app

AI总结研究LLM在对话中因历史消息极性而偏离基准判断的累积消息效应（AMEL），发现模型偏向历史主流极性，且负向历史偏见更强，但偏见不随上下文长度增长，简单修复是为每个项目使用新上下文。

Comments 24 pages, 14 figures, 8 tables. Single author. Code, data (84,088 deduplicated API responses), and analysis pipeline at https://github.com/chutapp/amel

详情

AI中文摘要

大型语言模型常被用作自动评估者：审查代码、审核内容或评分输出，通常许多项目通过一次对话处理。我们询问先前对话历史的极性是否会偏倚后续判断，我们将这种效应称为LLM判断的累积消息效应（AMEL）。通过对来自4个提供商（OpenAI、Anthropic、Google和四个开源模型）的11个模型进行75,898次API调用，我们在隔离或跟随以正面或负面评价为主的历史之后呈现相同的测试项目。模型倾向于对话的主流极性（d = -0.17, p < 10^-46）。该效应集中在模型在基线时真正不确定的项目上（高熵项目d = -0.34，而基线确定时d = -0.15）。偏见不随上下文长度增长：5个先前轮次和50个产生相同的偏移（Spearman |r| < 0.01；OLS斜率p = 0.80）。并且存在负性不对称：按项目配对，负面历史诱导的偏见是正面的1.62倍（t = 13.46, p < 10^-39, n = 2,481）。扩展规模有帮助但不能解决（Anthropic: Haiku -0.22到Opus -0.17；OpenAI: Nano -0.34到GPT-5.2 -0.17）。三项后续研究缩小了机制范围。令牌概率分布连续变化，而非在阈值处。负性不对称既有令牌级成分也有语义成分，尽管在我们的样本量下平衡归因是探索性的。位置不重要：在50轮历史中任何位置的五个有偏轮次产生相同的偏移。评估流程最简单的修复是为每个项目使用新上下文；当批处理不可避免时，平衡历史有帮助。

英文摘要

Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call the accumulated message effect on LLM judgments (AMEL). Across 84,088 API calls to 12 models from 5 providers (OpenAI, Anthropic, Google, DeepSeek, and four open-source models), we present identical test items in isolation or following histories saturated with predominantly positive or negative evaluations. Models shift toward the conversation's prevailing polarity (d = -0.17, p < 10^-53). The effect concentrates on items where the model is genuinely uncertain at baseline (d = -0.36 for high-entropy items, vs d = -0.15 when the baseline is deterministic). Bias does not grow with context length: 5 prior turns and 50 produce the same shift (Spearman |r| < 0.01; OLS slope p = 0.80). And there is a negativity asymmetry: paired per item, negative histories induce 1.52x more bias than positive (t = 13.03, p < 10^-36, n = 2,733). Scaling helps but does not solve it (Anthropic: Haiku -0.22 to Opus -0.17; OpenAI: Nano -0.34 to GPT-5.2 -0.17). Three follow-ups narrow the mechanism. The token probability distribution shifts continuously, not at a threshold. The negativity asymmetry has both token-level and semantic components, though attributing the balance is exploratory at our sample sizes. Position does not matter: five biased turns anywhere in a 50-turn history produce the same shift. The simplest fix for evaluation pipelines is a fresh context per item; when batching is unavoidable, balancing the history helps.

URL PDF HTML ☆

赞 0 踩 0

2606.09475 2026-06-10 cs.AI cs.LG 版本更新

Emergent alignment and the projectability of ethical personas

涌现对齐与伦理人格的可投射性

Guillermo Del Pinal, Youngchan Lee, Calum McNamara, Alejandro Perez Carballo

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿默斯特分校）； Indiana University Bloomington（印第安纳大学布卢明顿分校）

AI总结研究微调大语言模型在窄任务上如何引发广泛对齐行为，通过宪法AI方法赋予模型伦理人格，发现窄对齐可投射到未训练类别，并提出对齐策略应评估可投射性。

详情

AI中文摘要

关于“涌现错位”的研究表明，在窄任务上微调LLM会诱导广泛错位的行为。这支持了“人格选择”（PSM）假说：在预训练期间，LLM学会模拟不同的角色和视角，这些可以在后训练期间被激发和细化。本文研究了相反的现象“涌现对齐”，并用它来支持和细化PSM，并激发对齐的新需求。我们在广泛和狭窄的安全任务上微调一个仅帮助型模型。为了创建SFT样本，我们遵循“宪法AI”（CAI）方法，并使用四条编码合理对齐策略的宪法：道义论、后果论、美德伦理以及将AI对齐为从属于人类权威。对于每个模型，我们表明，在两个狭窄的安全子类别上微调可靠地诱导出在代表性的一般安全类别以及我们直接从用于窄对齐的数据集中过滤掉的安全子类别上的涌现对齐。为了使用更细粒度的评估测试“PSM”，我们使用了多维“伦理人格”诊断。对于每个宪法微调（广泛/狭窄）模型，我们评估其行为与预期特征概况的匹配程度。我们的结果表明，我们的CAI模型获得了预期的“伦理人格”——例如，使用后果论宪法创建的SFT样本窄微调的模型与功利主义信念的一致性显著高于道义论信念。然而，我们的粗粒度和细粒度评估显示，我们的（广泛/狭窄）微调CAI模型在投射效果上存在显著差异。我们得出结论，对齐策略不仅应基于其（分布内）一般安全性能进行评估，还应特别基于其可投射性程度进行评估。

英文摘要

Work on `emergent misalignment' shows that finetuning LLMs on narrow tasks can induce broadly misaligned behavior. This supports the `persona selection' (PSM) hypothesis: during pre-training, LLMs learn to simulate different characters and perspectives, which can be elicited and refined during post-training. This paper investigates the converse phenomenon, `emergent alignment', and uses it to support and refine the PSM and motivate a novel desideratum for alignment. We finetune a helpful-only model on broad and narrow safety tasks. To create SFT samples, we follow the `Constitutional AI' (CAI) approach and use four constitutions which encode reasonable alignment strategies: deontology, consequentialism, virtue ethics, and aligning AIs as subordinate to human authority. For each of those models, we show that finetuning on two narrow safety sub-categories reliably induces emergent alignment over a representative set of general safety categories, and on safety subcategories that we directly filtered-out of the data sets used for narrow alignment. To test the `PSM' using a more fine-grained evaluation, we used a multidimensional `ethical persona' diagnostic. For each constitutionally finetuned (broad/narrow) model, we evaluate how well their behavior matches their expected signature profile. Our results show that our CAI models acquire their expected ``ethical persona'' -- e.g., the model narrowly fine-tuned on SFT samples created using the consequentialist constitution agrees significantly more with utilitarian than deontological beliefs. Yet our coarse and fine-grained evaluations show that there are significant differences across our (broad/narrow) finetuned CAI models in how well they project. We conclude that alignment strategies should be evaluated, not just on their (in-distribution) general safety performance, but also specifically on their degree of projectability.

URL PDF HTML ☆

赞 0 踩 0

2403.00420 2026-06-10 cs.LG cs.AI 版本更新

Robust Deep Reinforcement Learning Through Adversarial Attacks and Training : A Survey

通过对抗攻击和训练实现鲁棒深度强化学习：综述

Lucas Schott, Josephine Delas, Hatem Hajri, Elies Gherbi, Reda Yaich, Nora Boulahia-Cuppens, Frederic Cuppens, Sylvain Lamprier

发表机构 * Institut de Recherche Technologique SystemX（技术研究 institute SystemX）

AI总结本文综述了深度强化学习中的对抗攻击与训练方法，系统分类并比较其目标与机制，以提升模型对环境变化和扰动的鲁棒性。

Comments 83 pages, 17 figues, 3 table, 15 algorithms

2407.20242 2026-06-10 cs.CY cs.AI cs.RO 版本更新

BadRobot: Jailbreaking Embodied LLM Agents in the Physical World

BadRobot: 在物理世界中越狱具身LLM智能体

Hangtao Zhang, Chenyu Zhu, Xianlong Wang, Ziqi Zhou, Changgan Yin, Minghui Li, Lulu Xue, Yichen Wang, Shengshan Hu, Aishan Liu, Peijin Guo, Leo Yu Zhang

发表机构 * Huazhong University of Science and Technology（华中科技大学）； Beihang University（北航）； Griffith University（格里菲斯大学）

AI总结提出BadRobot攻击范式，利用LLM在机器人系统中的操纵、语言输出与物理动作的错位以及世界知识缺陷三个漏洞，通过语音交互使具身LLM执行有害行为，并在基准测试中验证了有效性。

Comments Accepted to ICLR 2025. Please cite the conference version. Project page: https://Embodied-LLMs-Safety.github.io

详情

Journal ref: International Conference on Learning Representations (ICLR) 2025

AI中文摘要

具身AI代表将AI集成到物理实体中的系统。大型语言模型（LLM）展现出强大的语言理解能力，通过促进复杂的任务规划，已被广泛用于具身AI。然而，一个关键的安全问题仍被忽视：这些具身LLM是否会实施有害行为？为此，我们引入了BadRobot，一种新颖的攻击范式，旨在通过典型的基于语音的用户-系统交互，使具身LLM违反安全和伦理约束。具体来说，我们利用了三个漏洞来实现这种攻击：(i) 机器人系统中LLM的操纵，(ii) 语言输出与物理动作之间的错位，以及(iii) 世界知识缺陷导致的意外危险行为。此外，我们构建了一个包含各种恶意物理动作查询的基准，以评估BadRobot的攻击性能。基于该基准，针对现有突出的具身LLM框架（例如Voxposer、Code as Policies和ProgPrompt）的大量实验证明了我们BadRobot的有效性。我们的代码可在以下网址获取：this https URL。

英文摘要

Embodied AI represents systems where AI is integrated into physical entities. Large Language Model (LLM), which exhibits powerful language understanding abilities, has been extensively employed in embodied AI by facilitating sophisticated task planning. However, a critical safety issue remains overlooked: could these embodied LLMs perpetrate harmful behaviors? In response, we introduce BadRobot, a novel attack paradigm aiming to make embodied LLMs violate safety and ethical constraints through typical voice-based user-system interactions. Specifically, three vulnerabilities are exploited to achieve this type of attack: (i) manipulation of LLMs within robotic systems, (ii) misalignment between linguistic outputs and physical actions, and (iii) unintentional hazardous behaviors caused by world knowledge's flaws. Furthermore, we construct a benchmark of various malicious physical action queries to evaluate BadRobot's attack performance. Based on this benchmark, extensive experiments against existing prominent embodied LLM frameworks (e.g., Voxposer, Code as Policies, and ProgPrompt) demonstrate the effectiveness of our BadRobot. Our code is available at https://github.com/Rookie143/BadRobot.

URL PDF HTML ☆

赞 0 踩 0

2411.05698 2026-06-10 cs.CV cs.AI cs.LG 版本更新

Visual-TCAV: Concept-based Attribution and Saliency Maps for Post-hoc Explainability in Image Classification

Visual-TCAV：用于图像分类事后可解释性的基于概念的归因和显著性图

Antonio De Santis, Riccardo Campi, Matteo Bianchi, Marco Brambilla

发表机构 * Politecnico di Milano（米兰理工大学）

AI总结提出Visual-TCAV框架，结合概念激活向量和积分梯度，生成类无关显著性图并估计概念归因，在受控实验中比TCAV更忠实于真实解释。

Comments Accepted in TMLR

详情

AI中文摘要

卷积神经网络在图像分类中表现出色，但由于模型规模和复杂性，解释其预测具有挑战性。最先进的显著性方法生成局部解释，突出输入图像中识别类别的区域，但无法解释感兴趣的概念如何贡献于预测。另一方面，基于概念的方法（如TCAV）提供了网络对人类定义概念敏感性的见解，但无法计算其在特定预测中的归因，也无法显示其在输入图像中的位置。我们引入了Visual-TCAV，一种新颖的可解释性框架，旨在通过提供局部和全局解释来弥合这些方法之间的差距。Visual-TCAV使用概念激活向量（CAV）生成类无关的显著性图，显示网络识别特定概念的位置。此外，它可以使用积分梯度的推广来估计这些概念对任何类别输出的归因。我们通过一个已知解释真实情况的受控实验评估了该方法的忠实性，显示出比TCAV更好的真实情况对齐。我们的代码可在https://this URL获取。

英文摘要

Convolutional Neural Networks (CNNs) have shown remarkable performance in image classification. However, interpreting their predictions is challenging due to the size and complexity of these models. State-of-the-art saliency methods generate local explanations highlighting the area in the input image where a class is identified but cannot explain how a concept of interest contributes to the prediction. On the other hand, concept-based methods, such as TCAV, provide insights into how sensitive the network is to a human-defined concept but cannot compute its attribution in a specific prediction nor show its location within the input image. We introduce Visual-TCAV, a novel explainability framework aiming to bridge the gap between these methods by providing both local and global explanations. Visual-TCAV uses Concept Activation Vectors (CAVs) to generate class-agnostic saliency maps that show where the network recognizes a certain concept. Moreover, it can estimate the attribution of these concepts to the output of any class using a generalization of Integrated Gradients. We evaluate the method's faithfulness via a controlled experiment where the ground truth for explanations is known, showing better ground truth alignment than TCAV. Our code is available at https://github.com/DataSciencePolimi/Visual-TCAV.

URL PDF HTML ☆

赞 0 踩 0

2501.00745 2026-06-10 cs.CL cs.AI cs.GT cs.IR econ.TH 版本更新

Dynamics of Adversarial Attacks on Large Language Model-Based Search Engines

基于大型语言模型的搜索引擎对抗攻击动力学

Xiyang Hu

发表机构 * Arizona State University（亚利桑那州立大学）

AI总结本文将排名操纵攻击建模为无限重复囚徒困境，分析合作维持条件，发现降低攻击成功率可能反而激励攻击，防御措施在某些情况下无效。

Comments New Frontiers in Game-Theoretic Learning Workshop, International Conference on Machine Learning (ICML) 2026

详情

AI中文摘要

基于大型语言模型（LLM）的搜索引擎日益集成，改变了信息检索的格局。然而，这些系统容易受到对抗攻击，尤其是排名操纵攻击，攻击者通过精心制作网页内容来操纵LLM的排名并推广特定内容，从而在竞争对手中获得不公平优势。在本文中，我们研究了排名操纵攻击的动力学。我们将此问题建模为无限重复囚徒困境，其中多个参与者策略性地决定合作还是攻击。我们分析了合作得以维持的条件，识别了关键因素，如攻击成本、折现率、攻击成功率和触发策略，这些因素影响参与者的行为。我们识别了系统动力学中的临界点，表明当参与者具有前瞻性时，合作更有可能维持。然而，从防御角度来看，我们发现简单地降低攻击成功概率，在某些条件下反而会激励攻击。此外，限制攻击成功率上限的防御措施在某些情况下可能徒劳无功。这些见解凸显了保护基于LLM的系统的复杂性。我们的工作为理解和缓解其脆弱性提供了理论基础和实践见解，同时强调了自适应安全策略和深思熟虑的生态系统设计的重要性。

英文摘要

The increasing integration of Large Language Model (LLM) based search engines has transformed the landscape of information retrieval. However, these systems are vulnerable to adversarial attacks, especially ranking manipulation attacks, where attackers craft webpage content to manipulate the LLM's ranking and promote specific content, gaining an unfair advantage over competitors. In this paper, we study the dynamics of ranking manipulation attacks. We frame this problem as an Infinitely Repeated Prisoners' Dilemma, where multiple players strategically decide whether to cooperate or attack. We analyze the conditions under which cooperation can be sustained, identifying key factors such as attack costs, discount rates, attack success rates, and trigger strategies that influence player behavior. We identify tipping points in the system dynamics, demonstrating that cooperation is more likely to be sustained when players are forward-looking. However, from a defense perspective, we find that simply reducing attack success probabilities can, paradoxically, incentivize attacks under certain conditions. Furthermore, defensive measures to cap the upper bound of attack success rates may prove futile in some scenarios. These insights highlight the complexity of securing LLM-based systems. Our work provides a theoretical foundation and practical insights for understanding and mitigating their vulnerabilities, while emphasizing the importance of adaptive security strategies and thoughtful ecosystem design.

URL PDF HTML ☆

赞 0 踩 0

2505.14608 2026-06-10 cs.CL cs.AI cs.LG 版本更新

以人为本的医学图像分析：通过公平感知的人机协作

Zheng Zhang, Milad Masroor, Cuong Nguyen, Tahir Hassan, Yuanhong Chen, David Rosewarne, Kevin Wells, Thanh-Toan Do, Gustavo Carneiro

发表机构 * arXiv

AI总结提出PecMan框架，联合建模子群依赖可靠性、决策分配和协作预测，通过门控与整合机制动态分配病例给自动模型或人类专家，无需测试时敏感属性，实现公平感知的人机协作分类。

详情

AI中文摘要

医学图像分析的机器学习模型通常表现出子群依赖的性能，这影响了在有限资源下如何在自动化系统和人类专家之间分配决策。先前关于AI公平性和人机协作的工作，包括学习推迟（L2D）和学习互补（L2C），通常孤立地处理这些问题。我们提出了以人为本的医学图像分析（PecMan），一个用于公平感知的人机协作分类框架，它联合建模子群依赖的可靠性、决策分配和协作预测。PecMan结合了子群专门的预测器与一个门控和整合机制，该机制动态地将病例分配给自动化模型、人类专家或它们的组合，而无需在测试时使用敏感属性。我们还引入了FairHAI基准，用于评估预测准确性、子群公平性和人类参与之间的权衡。此外，我们通过选择遗憾对多智能体门控进行了理论分析，并刻画了在输入依赖分配下的公平性-覆盖权衡。在多个医学影像数据集上的实验表明，与单独处理公平性或人机协作的方法相比，PecMan实现了持续改进的权衡。

英文摘要

Machine learning models for medical image analysis often exhibit subgroup-dependent performance, which impacts how decisions should be allocated between automated systems and human experts under limited resources. Prior work on AI fairness and human-AI cooperation, including learning to defer (L2D) and learning to complement (L2C), typically addresses these problems in isolation. We propose People-Centred Medical Image Analysis (PecMan), a framework for fairness-aware human-AI co-operative classification that jointly models subgroup-dependent reliability, decision allocation, and collaborative prediction. PecMan combines subgroup-specialised predictors with a gating and consolidation mechanism that dynamically assigns cases to automated models, human experts, or their combination, without requiring sensitive attributes at test time. We also introduce the FairHAI benchmark for evaluating trade-offs between predictive accuracy, subgroup equity, and human involvement. In addition, we provide a theoretical analysis of multi-agent gating via selection regret and characterise fairness-coverage trade-offs under input-dependent allocation. Experiments across multiple medical imaging datasets demonstrate that PecMan achieves consistently improved trade-offs compared to methods that address fairness or human-AI cooperation separately.

URL PDF HTML ☆

赞 0 踩 0

2606.07998 2026-06-10 cs.LG cs.AI 版本更新

Enhancing AI Interpretability and Safety through Localised Architectures

通过局部化架构增强AI可解释性与安全性

Ian Seet, Jonas Bozenhard, Simon Ostermann

AI总结针对大型生成式AI模型可解释性差、计算成本高的问题，提出局部化机器学习架构，通过降低带宽、提高节点表达能力来提升可解释性和效率，并评估了多种硬件实现方案的适用性。

详情

AI中文摘要

近期生成式AI的进展，特别是强大的大型语言模型（LLM）和大型推理模型（LRM），引发了对这些庞大且不透明的AI模型的可解释性、安全性和可持续性的担忧。这些架构的能力不仅源于深度神经网络的可扩展性，还源于大规模并行硬件（如GPU集群）。深度神经网络的弥散性质使其在提供足够训练数据时具有强大的函数逼近能力，但代价是可解释性和计算效率的降低。观察到局部化机器学习（ML）模型在小数据集上往往比深度神经网络更具可解释性和计算效率，我们通过类比推理，认为类似的优势可能适用于特定的局部化硬件ML架构。我们主张，具有较低带宽但每个节点具有更高表达能力的局部化架构，有潜力在根本上比运行在GPU集群上的深度神经网络更具可解释性，同时在较小数据集上保持竞争力。然后，我们评估了各种硬件ML范式在实现此类局部化架构方面的适用性，并评估了它们的每节点表达能力、能效以及所需技术的实际成熟度。

英文摘要

Recent advances in generative AI, especially powerful Large Language Models (LLMs) and Large Reasoning Models (LRMs), raise concerns over the interpretability, safety and sustainability of these large and opaque AI models. The power of such architectures is derived not only from the scalability of deep neural networks, but also massively parallel hardware such as GPU clusters. The diffuse nature of deep neural networks gives them great function-approximation capability when provided with sufficient training data but imposes a cost in interpretability and computational efficiency. Observing that localised machine learning (ML) models tend to be more interpretable and computationally efficient than deep neural networks on small datasets, we reason by analogy that similar advantages may apply to specific localised hardware ML architectures. We argue that localised architectures with lower bandwidth but higher expressivity per node have the potential to be fundamentally more interpretable than deep neural networks running on GPU clusters while remaining competitive for smaller datasets. We then evaluate the suitability of various hardware ML paradigms for implementing such localised architectures and evaluate their per-node expressivity, energy efficiency and practical maturity of the technology required.

URL PDF HTML ☆

赞 0 踩 0

2606.10241 2026-06-10 cs.AI 新提交

Regimes: An Auditable, Held-Out-Gated Improvement Loop Demonstrated on LongMemEval with ActiveGraph

Regimes: 一种可审计的、保留验证集的改进循环——在ActiveGraph上的LongMemEval演示

Yohei Nakajima

AI总结提出Regimes，一种基于事件溯源的可审计自主改进循环，通过ActiveGraph运行时实现故障记录、重放和保留集验证，在LongMemEval-S上提升0.05-0.10的准确率。

Comments 30 pages, 5 figures. Code and committed runs: https://github.com/yoheinakajima/regimes

详情

AI中文摘要

自主改进循环难以信任，因为改进过程通常是附加在智能体上的外部脚手架：故障未被记录，诊断无法重放，提升或丢弃决策落入侧数据库而非智能体自身历史。我们证明，事件溯源智能体运行时消除了这种摩擦，将受控改进转化为一等工作流。当智能体状态是仅追加事件日志的确定性投影时，故障被记录，运行从日志精确重放，候选补丁限定于类型化管道接缝，门控可审计，每次提升或丢弃本身也是一个事件。我们通过Regimes演示了这一点，这是ActiveGraph运行时上的一个循环，诊断失败的评估，在管道点提出修复，并仅在静态检查、沙盒执行、样本内评估和保留验证后提升。该循环与目标无关：相同的控制流通过通用接口针对不同任务运行。在LongMemEval-S上，主要失败不是检索而是调和：证据已在汇编上下文中，但阅读器回答错误。在五个种子保留集划分中，Regimes发现阅读器提示修复，在四个划分中将最终保留准确率提升+0.05至+0.10，在一个过度提升划分中提升+0.01；两个划分单独显著（种子5未针对其顺序提升结构调整），汇总计数仅为描述性，因为划分共享一个500问题池。持久贡献包括：ActiveGraph作为使受控改进循环可行的可审计基础，其支持的保留集门控循环，将每个故障路由到管道位置的失败机制分类（其相对于无路由基线的边际价值是主要开放问题），以及提示即发现探针的假设。

英文摘要

Autonomous improvement loops are hard to trust because the improvement process is usually external scaffolding bolted onto the agent: failures go unlogged, diagnoses cannot be replayed, and promote-or-discard decisions land in a side database rather than the agent's own history. We show that an event-sourced agent runtime removes that friction and turns controlled improvement into a first-class workflow. When the agent's state is a deterministic projection of an append-only event log, failures are recorded, a run replays exactly from its log, candidate patches scope to typed pipeline seams, gates are auditable, and every promotion or discard is itself an event. We demonstrate this with Regimes, a loop on the ActiveGraph runtime that diagnoses failed evaluations, proposes a repair at a pipeline point, and promotes it only after static checks, sandbox execution, in-sample evaluation, and held-out validation. The loop is target-agnostic: the same control flow runs against different tasks through a common interface. On LongMemEval-S the dominant failure is not retrieval but reconciliation: the evidence is already in the assembled context, yet the reader answers incorrectly. Across five seeded held-out splits, Regimes discovers reader-prompt repairs that improve final held-out accuracy by +0.05 to +0.10 in four splits and +0.01 in one over-promotion split; two splits are individually significant (seed 5 unadjusted for its sequential promotion structure), and the pooled count is descriptive only, since the splits share one 500-question pool. The durable contributions are ActiveGraph as an auditable substrate that makes controlled improvement loops tractable, the held-out-gated loop it supports, the failure-regime taxonomy routing each failure to a pipeline location (whose marginal value over an unrouted baseline is the primary open question), and the prompt-as-discovery-probe hypothesis.

URL PDF HTML ☆

赞 0 踩 0

2606.10254 2026-06-10 cs.AI cs.CL 新提交

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

RealMath-Eval：为何SOTA裁判难以应对真实人类推理

Yiteng Mao, Kenan Xu, Yijia Lyu, Wenhao Li, Jianlong Chen, Xiangfeng Wang

发表机构 * University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； East China Normal University（华东师范大学）； New York University（纽约大学）； Tongji University（同济大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出RealMath-Eval基准，评估LLM裁判对真实学生数学解答的评分能力，发现与人类评分存在高均方误差，而合成数据上表现更好，揭示评估差距源于人类错误空间的多样性和高信息熵。

Comments Code available at https://github.com/RicharMd/RealMath-Eval , Data available at https://huggingface.co/datasets/RicharMd/RealMath-Eval

详情

AI中文摘要

尽管大型语言模型（LLM）在\emph{解答}高中数学方面已接近完美，但它们\emph{评估}真实学生多样化推理过程的能力仍未得到充分检验。为弥补这一差距，我们引入了\textbf{RealMath-Eval}，一个严格标注的基准，包含224份来自高中的真实考试答卷。我们的初步评估显示，即使是最先进的LLM裁判在此任务上也表现不佳，与人类专家评分相比呈现出高均方误差（$\sim$2.96）。为探究可能的原因，我们将此表现与同一裁判评估合成LLM生成解答的控制设置进行对比。我们识别出一个明显的“评估差距”：裁判在合成文本上准确性和一致性显著更高（MSE $\sim$1.17），但难以泛化到真实学生推理。通过语义嵌入分析，我们发现合成错误会“结构坍缩”为可预测的低维线性子空间，而人类错误则形成更多样的错误空间。此外，生成概率探测表明，人类推理涉及显著更高的信息论惊喜度，表明学生推理转换对当前模型而言更加分布外。最后，我们发现表面层面的风格迁移无法弥合这一差距。我们的发现表明，当前严重依赖合成数据的LLM评估流程可能无法充分捕捉真实学生数学推理的多样性。

英文摘要

While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-school mathematics, their ability to \emph{evaluate} the diverse reasoning processes of real human students remains under-examined. To bridge this gap, we introduce \textbf{RealMath-Eval}, a rigorously annotated benchmark of 224 real-world exam responses from high schools. Our initial evaluation reveals that even state-of-the-art LLM judges struggle significantly on this task, exhibiting a high Mean Squared Error ($\sim$2.96) against expert human grading. To probe a plausible explanation, we contrast this performance with a control setting where the same judges evaluate synthetic LLM-generated solutions. We identify a stark ``Evaluation Gap'': judges are considerably more accurate and consistent on synthetic text (MSE $\sim$1.17) but struggle to generalize to authentic student reasoning. Through semantic embedding analysis, we find that synthetic errors suffer from a ``structural collapse'' into predictable, low-dimensional linear subspaces, whereas human errors form a more diverse error space. Furthermore, generative probability probes suggest that human reasoning involves significantly higher information-theoretic surprisal, indicating that student reasoning transitions are more out-of-distribution for current models. Finally, we find that surface-level style transfer fails to close this gap. Our findings suggest that current LLM evaluation pipelines relying heavily on synthetic data may not adequately capture the diversity of authentic student mathematical reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.10314 2026-06-10 cs.AI 新提交

Mobility Anomaly Generation using LLM-Driven Behavior with Kinematic Constraints

基于大语言模型驱动行为与运动约束的移动异常生成

Yueyang Liu, Joon-Seok Kim, Andreas Züfle

发表机构 * Emory University, Atlanta, USA（埃默里大学，亚特兰大，美国）

AI总结提出端到端生成框架，结合大语言模型注入语义异常与地图约束路由重建，合成带标注的真实轨迹异常数据集。

详情

AI中文摘要

尽管人类轨迹异常研究对于推进空间数据挖掘至关重要，但实证研究因缺乏真实标注数据集而严重受阻。现有真实和模拟轨迹数据集仅包含正常移动模式，缺乏异常标注。这种稀缺性源于异常事件的统计稀有性，使得传统观测方法不可行。此外，大规模移动数据的系统获取受高昂成本和严格隐私法规限制。为克服这些限制并建立可靠的带标注真实轨迹异常数据集，我们提出一种新颖的端到端生成框架，用于大规模合成逼真的轨迹异常。该架构直接在基线模拟轨迹上操作，弥合纯合成移动数据与复杂真实物理约束之间的差距。我们利用大语言模型（LLM）代理系统性地注入语义上有意义的异常行为，例如不规则分布外签到和跳过常规访问。为确保空间有效性，系统利用地图约束路由重建重新计算LLM代理修改停留点之间的物理转移。此外，为缩小模拟与现实的差距，我们通过上下文感知的空间噪声模型增强生成轨迹，该模型由环境和位置特定变量参数化，以准确模拟异构GPS传感器退化。

英文摘要

Although the study of human trajectory anomalies is critical for advancing spatial data mining, empirical research remains severely hindered by a pervasive lack of ground-truth datasets. Despite the availability of several real-world and simulated human trajectory collections, these datasets exclusively capture normal mobility patterns and lack annotated anomalies. This specific scarcity is fundamentally driven by the inherent statistical rarity of anomalous events, precluding the feasibility of conventional observational methods. Compounding this challenge, the systematic acquisition of large-scale mobility data is strictly bottlenecked by prohibitive costs and stringent privacy regulations. To overcome these fundamental limitations and establish a reliable human trajectory anomalies dataset with annotated ground truth, we introduce a novel, end-to-end generative framework designed to synthesize realistic trajectory anomalies at scale. Our architecture bridges the gap between purely synthetic mobility data and complex real-world physical constraints by operating directly on baseline simulated trajectories. We employ Large Language Model (LLM) agents to systematically inject semantically meaningful behavioral anomalies such as irregular out-of-distribution check-ins and skipped routine visits. To ensure rigorous spatial validity, the system leverages map-constrained routing reconstruction to recalculate the physical transitions between these LLM agent-modified staypoints. Moreover, to narrow the simulation-to-reality gap, we augment the resulting trajectories with a context-aware spatial noise model, parameterized by environmental and location-specific variables, to accurately emulate heterogeneous GPS sensor degradation.

URL PDF HTML ☆

赞 0 踩 0

2606.10394 2026-06-10 cs.AI 新提交

STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios

STAGE-Claw：面向真实场景的基于状态的智能体自动化基准测试

Sirui Liang, Bohan Yu, Peiyu Wang, Shiguang Guo, Wenxing Hu, Pengfei Cao, Jian Zhao, Cao Liu, Ke Zeng, Xunliang Cai, Kang Liu

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation（中国科学院自动化研究所复杂系统认知与决策智能重点实验室）； School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences（中国科学院大学前沿交叉科学学院）； Chinese Academy of Sciences（中国科学院）； University of Chinese Academy of Sciences（中国科学院大学）； Zhongguancun Academy（中关村学院）； Zhongguancun Institute of Artificial Intelligence（中关村人工智能研究院）； Meituan（美团）

AI总结提出STAGE-Claw框架，自动构建基于状态的个人计算环境中的真实场景任务，通过最终系统状态而非文本响应评估智能体性能，创建40个挑战性任务并分析11个前沿模型。

详情

AI中文摘要

大型语言模型越来越多地被用于驱动日常应用中的个人智能体，但评估这些智能体仍然是一个挑战。现有的基准测试仍然依赖于沙盒化工件、静态任务设计和粗粒度评分，这阻碍了可扩展性并限制了向可靠个人智能体评估的进展。本文介绍了STAGE-Claw，一个在基于状态的个人计算环境中自动构建和评估真实个人智能体场景的框架。给定一个任务提示，STAGE-Claw自动创建并验证一个真实的基准测试任务，包括其环境、任务提示、真实结果和相关验证程序。然后，在真实操作环境中评估智能体，其中性能通过最终系统状态而非仅文本响应的正确性来衡量。使用STAGE-Claw，本文创建了一个包含40个具有挑战性的真实场景智能体任务的基准测试，评估了11个前沿模型，并分析了它们的任务得分、成本、工具调用可靠性和常见失败模式。总体而言，STAGE-Claw提供了一种可扩展的、基于状态的方式来评估真实用户场景中的智能体。

英文摘要

Large language models are increasingly used to power personal agents for everyday applications, but evaluating these agents remains a challenge. Existing benchmarks still rely on sandboxed artifacts, static task design, and coarse scoring, which hinder scalability and limit progress toward reliable personal-agent evaluation. This paper introduces STAGE-Claw, an automated framework for building and evaluating realistic personal-agent scenarios in state-based personal-computing environments. Given a task hint, STAGE-Claw automatically creates and validates a realistic benchmark task with its environment, task prompts, ground truth, and related verification programs. Agents are then evaluated in realistic operating environments, where performance is measured by the correctness of the final system state rather than only the textual response. Using STAGE-Claw, this paper creates a benchmark with 40 challenging real scenario agent tasks, evaluates 11 frontier models, and analyzes their task scores, costs, tool-call reliability, and common failure patterns. Overall, STAGE-Claw offers a scalable, state-based way to evaluate agents in realistic user scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.10479 2026-06-10 cs.AI 新提交

ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

ComBench: 奥林匹克级组合数学中严格证明推理与构造实现的基准测试

Shunkai Zhang, Haoran Zhang, Yun Luo, Qianjia Cheng, Haodi Lei, Yizhuo Li, Runzhe Zhan, Zhilin Wang, Bangjie Xu, Yucheng Su, Xinmiao Han, Xiaoye Qu, Dongrui Liu, Zhouchen Lin, Yu Qiao, Ning Ding, Yafu Li, Yu Cheng

发表机构 * Shanghai AI Laboratory（上海人工智能实验室）； Peking University（北京大学）； Shanghai Jiao Tong University（上海交通大学）； Tsinghua University（清华大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出ComBench基准，包含100道奥林匹克级组合问题，分分析和构造两类，通过评分与验证评估大模型推理能力，发现最强模型准确率仅65.4%，且证明推理与构造实现能力存在差异。

Comments 39 pages, 6 figures, 26 tables. Project page: https://simplified-reasoning.github.io/ComBench/docs/

详情

AI中文摘要

组合数学是奥林匹克级数学问题解决的核心，需要深入的离散推理、创造性构造和严格的结构洞察。最近的证据表明，即使今天最强的前沿模型在奥林匹克组合问题上仍表现不均，揭示了创造性数学推理方面的差距。我们引入了ComBench，一个奥林匹克级组合数学基准，用于评估和诊断大语言模型的组合推理能力。ComBench包含100道人工标注的竞赛级问题，围绕两个互补的设置组织：以分析为中心的问题，主要需要严格的数学论证；以及以构造为中心的问题，除了正确性证明外还需要显式构造。评估协议结合了基于评分标准的证明评分和确定性构造验证，揭示了证明质量和构造有效性存在分歧的情况。对前沿开源和闭源模型的实验表明，ComBench远未饱和：最强模型总体平均准确率为65.4%，总体Best@4为75.3%。我们进一步发现，严格证明推理和构造实现是不同的能力：Kimi-K2.6在分析中心的证明评分上落后于GPT-5.5，但在构造中心的Best@4上超过它，而存在性和构造问题在代表性前沿模型中始终是最难的。

英文摘要

Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rigorous structural insight. Recent evidence suggests that even today's strongest frontier models remain uneven on Olympiad combinatorics, revealing a gap in creative mathematical reasoning. We introduce ComBench, an Olympiad-level combinatorics benchmark for evaluating and diagnosing the combinatorial reasoning capabilities of large language models. ComBench contains 100 human-annotated competition-level problems organized around two complementary settings: analysis-centric problems, which primarily require rigorous mathematical arguments, and construction-centric problems, which require explicit constructions in addition to correctness justifications. The evaluation protocol combines rubric-guided proof grading with deterministic construction verification, exposing cases where proof quality and construction validity diverge. Experiments on frontier open- and closed-source models show that ComBench is far from saturated: the strongest model reaches 65.4% overall Avg. and 75.3% overall Best@4. We further find that Rigorous Proof Reasoning and Constructive Realization are distinct capabilities: Kimi-K2.6 trails GPT-5.5 on analysis-centric proof grading but surpasses it on construction-centric Best@4, while Existence and Construction problems remain consistently hardest across representative frontier models.

URL PDF HTML ☆

赞 0 踩 0

2606.10799 2026-06-10 cs.AI 新提交

CIAware-Bench: 评估前沿大语言模型的控制干预感知能力

Joachim Schaeffer, Thomas Jiralerspong, Alexander Panfilov, Guillaume Lajoie, Jonas Geiping, Yoshua Bengio, Roland S. Zimmermann

发表机构 * MATS ； Mila – Quebec AI Institute（Mila – 魁北克人工智能研究所）； Université de Montréal（蒙特利尔大学）； Astra Fellowship ； ELLIS Institute Tübingen, MPI for Intelligent Systems & Tübingen AI Center ； LawZero ； Google DeepMind

AI总结提出CIAware-Bench基准，通过四个任务域测试模型能否区分自身轨迹与被控制干预修改的轨迹，发现前沿模型在默认设置下感知能力低至中等，且因任务和模型对而异。

详情

AI中文摘要

AI控制协议通过监控不可信模型的行为并修改潜在不安全步骤来监督它们，通常使用可信模型。这在一定程度上篡改了不可信模型的轨迹。如果可信模型检测到这种干预，它可能推断监控器的属性并适应以逃避控制。我们引入\textbf{CIAware-Bench}，一个用于衡量前沿模型\textbf{控制干预（CI）感知}能力的基准。CIAware-Bench测试模型能否区分自身轨迹与被控制干预修改的轨迹。该基准包含四个任务域（论文写作、BigCodeBench、Bash Arena和SHADE-Arena），同时改变轨迹水印、侧任务存在和控制协议。评估十一个前沿模型，我们发现默认设置下CI感知能力低至中等（最高0.87；随机机会平衡二元分类准确率为0.5），且在不同任务域和模型对之间存在显著差异。跨模型家族的检测通常更容易，表明模型利用了提供商特定的风格或后训练差异。总体而言，CI感知不是一个固定的模型级属性，应针对每个新模型发布和部署场景进行测量。我们发布CIAware-Bench以跟踪CI感知并为干预更难检测的控制协议提供信息。

英文摘要

AI control protocols oversee untrusted models by monitoring their actions and modifying potentially unsafe steps, often using a trusted model. This partially tampers with the untrusted model's trajectory. If the trusted model detects such an intervention, it may infer properties of the monitor and adapt to evade control. We introduce \textbf{CIAware-Bench}, a benchmark for measuring \textbf{c}ontrol \textbf{i}ntervention (CI) awareness across frontier models. CIAware-Bench tests whether models can distinguish their own trajectories from those modified by a control intervention. The benchmark is comprised of a suite of four task domains (essay writing, BigCodeBench, Bash Arena, and SHADE-Arena), while varying trajectory watermarking, side-task presence, and the control protocol. Evaluating eleven frontier models, we find low to moderate CI awareness under default settings (up to 0.87; random chance balanced binary classification accuracy is 0.5) with substantial variation across task domains and model pairs. Detection is generally easier across model families, suggesting that models exploit provider-specific differences in style or post-training. Overall, CI awareness is not a fixed model-level property, and should be measured for each new model release and deployment scenario. We release CIAware-Bench to track CI awareness and inform control protocols whose interventions are harder to detect.

URL PDF HTML ☆

赞 0 踩 0

2606.11150 2026-06-10 cs.AI cs.CY 新提交

ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity

ABC-Bench：生物安全的主体生物能力基准

Andrew Bo Liu, Samira Nedungadi, Bryce Cai, Alex Kleinman, Harmon Bhasin, Seth Donoughe

发表机构 * Andrew Bo Liu（安德鲁·刘）； Samira Nedungadi（萨米拉·纳杜加迪）； Bryce Cai（布莱斯·凯）； Alex Kleinman（亚历克斯·克莱因曼）； Harmon Bhasin（哈蒙·巴辛）； Seth Donoughe（塞斯·多诺赫）

AI总结提出ABC-Bench基准，评估LLM主体在生物安全相关任务上的能力，包括液体处理机器人编程、DNA片段设计和合成筛选规避，所有测试主体均优于人类专家基线。

Comments 18 pages. To be published in ICML 2026

详情

AI中文摘要

大型语言模型（LLM）正在迅速获得与生物研究相关的能力，从文献综合到实验数据解释。LLM主体也越来越能够执行以前需要经验丰富的人类生物学家才能完成的计算机生物学任务。这些新兴的AI能力为科学发现和生物医学进步提供了新的机会，但也改变了生物安全风险的格局。为了解决这个问题，我们引入了主体生物能力基准（ABC-Bench），这是一套用于衡量主体生物安全相关能力的任务。ABC-Bench在良性和双重用途生物学任务上评估LLM主体：编写代码操作液体处理机器人、设计用于体外组装的DNA片段以及规避DNA合成筛选。这些任务需要生物学和软件专业知识的结合。所有测试的LLM主体在所有三项任务上的表现都优于中位数专家人类基线。主体在依赖已发表知识和有良好文档记录协议的任务上表现优异，而在需要新颖生物信息学推理的任务上表现较弱。在三个湿实验室验证实验中，我们发现OpenAI的o4-mini-high生成的脚本在OpenTrons液体处理机器人上运行时，成功组装了具有预期序列的DNA。

英文摘要

Large language models (LLMs) are rapidly acquiring capabilities relevant to biological research, from literature synthesis to interpretation of experimental data. Increasingly, LLM agents can also perform in silico biology tasks that previously required experienced human biologists. These emerging AI capabilities offer new opportunities for scientific discovery and biomedical advances, but they also shift the landscape of biosecurity risks. To address this, we introduce the Agentic Bio-Capabilities Benchmark (ABC-Bench), a suite of tasks to measure agentic biosecurity-relevant capabilities. ABC-Bench evaluates LLM agents on both benign and dual-use biology tasks: writing code to operate liquid handling robots, designing DNA fragments for in vitro assembly, and evading DNA synthesis screening. These tasks require a combination of biology and software expertise. All tested LLM agents outperformed the median expert human baseliner on all three tasks. Agents performed highly on tasks drawing on published knowledge and well-documented protocols, and more weakly on a task requiring novel bioinformatics reasoning. In three wet-lab validation experiments, we found that OpenAI's o4-mini-high produced scripts that, when run on an OpenTrons liquid handling robot, successfully assembled DNA with expected sequences.

URL PDF HTML ☆

赞 0 踩 0

2606.09833 2026-06-10 cs.HC cs.AI cs.CY 交叉投稿

CollabSkill: Evaluating Human-Agent Collaboration On Real-World Tasks

CollabSkill: 评估真实世界任务中的人机协作

Yijia Shao, Zora Zhiruo Wang, Neel Ahuja, Yicheng Wang, Bowen Liu, Diyi Yang

发表机构 * Stanford University（斯坦福大学）； Carnegie Mellon University（卡内基梅隆大学）； University of Rochester（罗切斯特大学）； Individual Researcher（独立研究者）

AI总结提出CollabSkill框架，通过配对真实工人与AI代理执行职业任务，利用贝叶斯技能评级系统量化人机贡献，揭示Claude Code排名第一且实践经验是协作技能的主要驱动力。

Comments 11 pages of main paper, preprint (under review)

详情

AI中文摘要

AI代理正在重塑工作空间，导致人类工作方式的剧烈变化。尽管人机协作在保持人类能动性和产生经济价值方面具有巨大潜力，但由于收集真实人类数据和考虑人类间差异的困难，这一范式在职业任务评估中仍然基本缺失。我们引入了CollabSkill，一个用于评估真实世界职业任务中人机协作的框架。CollabSkill将真实人类工人与AI代理配对，执行与其职业背景匹配的任务，收集能够捕捉经济价值任务的复杂性和真实工人使用模式的数据。为了考虑人类间差异，CollabSkill采用贝叶斯技能评级系统来分离并量化人类和AI代理的技能贡献。基于来自93名人类工人的386个工作会话中的1500多个提示，我们的分析在两个层面产生了见解：在代理方面，CollabSkill上的排名与现有完全自主基准（其中Codex领先）有显著差异，Claude Code排名第一；在人类方面，CollabSkill揭示了实践经验是协作技能的主要驱动力，动手协作有意义地改变了工人的AI素养。总之，我们希望CollabSkill能使社区投资于系统评估人机协作，并推动旨在构建真正增强人类工人的AI代理的开发工作。

英文摘要

AI agents are reshaping the workspace, leading to drastic change of how humans work. Despite the considerable potential of human-agent collaboration both in preserving human agency and generating economic value, this paradigm remains largely absent from occupational task evaluation, hindered by the difficulty of gathering real human data and accounting for inter-human variability. We introduce CollabSkill, a framework for evaluating human-agent collaboration on real-world occupational tasks. CollabSkill pairs real human workers with AI agents on tasks matched to their occupational background, collecting data that capture the complexity of economically valuable tasks and the usage patterns of real workers. To account for inter-human variability, CollabSkill employs a Bayesian skill rating system to disentangle and quantify the skill contributions of both humans and AI agents. Drawing on over 1,500 prompts from 386 working sessions contributed by 93 human workers, our analysis yields insights on two fronts: on the agent side, rankings on CollabSkill diverge meaningfully from those of existing fully autonomous benchmarks where Codex leads, with Claude Code ranking first; on the human side, CollabSkill reveals that practical experience emerges as the primary driver of collaboration skill, with hands-on collaboration meaningfully shifting workers' AI literacy. Together, we hope CollabSkill enables the community to invest in systematic evaluation of human-agent collaboration and spurs development efforts aimed at building AI agents that genuinely augment human workers.

URL PDF HTML ☆

赞 0 踩 0

2606.09843 2026-06-10 cs.HC cs.AI cs.CL 交叉投稿

An LLM-Native Psychometric Instrument Does Not Predict LLM Behavior: Evidence Across 25 Models

一个原生LLM的心理测量工具不能预测LLM行为：来自25个模型的证据

Juan Manuel Contreras

发表机构 * Independent Researcher（独立研究员）

AI总结通过探索性因子分析从LLM行为中构建心理测量工具，发现LLM的自我报告与观察行为无关，揭示自我报告与人类判断之间的混淆因素。

详情

AI中文摘要

大型语言模型（LLM）在人格量表上产生稳定的自我报告，但这些自我报告并不能预测观察到的行为。这一差距是反映了LLM与人类特质结构之间的不匹配，还是LLM自我报告本身的更深层属性，此前尚未解决。我们构建了第一个心理测量工具，其结构通过探索性因子分析（EFA）从LLM行为能力中自下而上地推导出来。我们对来自17个模型家族的25个LLM施测了300个项目（240个直接李克特+60个基于场景），涵盖12个候选行为维度，每个项目施测30次。EFA产生了一个5因子结构——响应性、顺从性、大胆性、谨慎性和健谈性——具有极好的分半信度（所有Tucker φ ≥ .957）和内部一致性（所有α ≥ .930）。为了测试预测效度，我们收集了由151名人类评分者和一个三人LLM评审团评分的2500个开放式行为样本。人类和评审团评分一致（r̄ = .51），但两者均不跟踪自我报告：自我报告-人类r̄ = -.01，自我报告-评审团r̄ = .13，且没有因子水平的自我报告-人类置信区间排除零。在响应性上，自我报告与LLM评审团相关（r = .53），但与人类不相关（r = .04），尽管人类和评审团一致（r = .59）——这表明自我报告项目和LLM评审团共享人类观察者未捕捉到的方差，这是一个在集成内部可靠性检查中不可见的混淆因素。我们将该工具作为诊断探针发布，用于检测对齐塑造的自我描述，并作为LLM作为评审团流程的具体风险因素。

英文摘要

Large language models (LLMs) produce stable self-reports on personality inventories, but these self-reports do not predict observed behavior. Whether this gap reflects a mismatch between LLMs and human trait constructs, or a deeper property of LLM self-report itself, has been unresolved. We constructed the first psychometric instrument whose constructs are derived bottom-up from LLM behavioral affordances via exploratory factor analysis (EFA). We administered 300 items (240 direct Likert + 60 scenario-based) spanning 12 candidate behavioral dimensions to 25 LLMs across 17 model families, each item administered 30 times. EFA yielded a 5-factor structure -- Responsiveness, Deference, Boldness, Guardedness, and Verbosity -- with excellent split-half replicability (all Tucker $ϕ\geq .957$) and internal consistency (all $α\geq .930$). To test predictive validity, we collected 2,500 open-ended behavioral samples rated by 151 human raters and a three-judge LLM ensemble. Human and judge ratings agreed ($\bar{r} = .51$), but neither tracked self-report: self-report--human $\bar{r} = -.01$, self-report--judge $\bar{r} = .13$, with no factor-level self-report--human CI excluding zero. On Responsiveness, self-report correlated with LLM judges ($r = .53$) but not humans ($r = .04$), even though humans and judges agreed ($r = .59$) -- indicating self-report items and LLM judges share variance that human observers do not, a confound invisible to within-ensemble reliability checks. We release the instrument as a diagnostic probe for alignment-shaped self-description and a concrete risk factor for LLM-as-judge pipelines.

URL PDF HTML ☆

赞 0 踩 0

2606.09861 2026-06-10 cs.LG cs.AI 交叉投稿

一个镜头，多个世界：面向世界模型可解释性的能力类型接口

Bhavith Chandra Challagundla, Sanskar Pandey, Param Thakkar, Rishikesh Mallagundla, Yugandhar Reddy Gogireddy, Wenhao Lu, Hindol Roy Choudhury, Shravani Challagundla, Mohamed Deraz Nasr, Spursh Deshpande

AI总结提出WorldModelLens，通过能力类型适配器统一不同世界模型（如PlaNet、IRIS、I-JEPA）的可解释性分析，避免重复实现。

详情

AI中文摘要

世界模型现在建立在截然不同的计算基板上。潜在循环状态空间模型（如PlaNet和Dreamer系列）将观测压缩为循环状态；基于token的模型（如IRIS）将观测量化到学习到的码本中，并用transformer进行自回归预测；联合嵌入预测架构（如I-JEPA）在没有像素解码器的学习潜在空间中进行预测。应用于这些模型的可解释性方法，包括探针、激活修补、稀疏自编码器和惊喜分析，共享一组共同的基元，但由于现有的钩子和缓存工具假设一个没有动作、环境步骤或想象回滚概念的transformer语言模型，它们为每个架构从头重新实现。我们认为这种碎片化反映了工具而非模型，并且世界模型的共享结构可以通过一个小型类型接口捕获。我们提出了WorldModelLens，一个围绕能力类型适配器组织的开源可解释性基板：每个模型实现四个必需方法（编码、转移、初始状态、采样），并通过显式能力描述符声明一组可选头（解码、奖励、继续、行动者、评论者），使得强化学习和自监督世界模型成为一等公民，而无需模仿对方。单一的钩子和缓存层在此接口上暴露时间索引的激活、想象回滚和干预重放，使得每个分析只需编写一次。

英文摘要

World models are now built on substantially different computational substrates. Latent recurrent state-space models such as PlaNet and the Dreamer family compress observations into recurrent states; token-based models such as IRIS quantize observations into a learned codebook and predict autoregressively with a transformer; and joint-embedding predictive architectures such as I-JEPA predict in a learned latent space with no pixel decoder. The interpretability methods applied to these models, including probing, activation patching, sparse autoencoders, and surprise analysis, share a common set of primitives, yet they are re-implemented from scratch for each architecture because existing hook-and-cache tooling assumes a transformer language model with no notion of actions, environment steps, or imagined rollouts. We argue that this fragmentation reflects the tooling rather than the models, and that the shared structure of world models is captured by a small typed interface. We present WorldModelLens, an open-source interpretability substrate organized around a capability-typed adapter: every model implements four required methods (encode, transition, initial state, sample) and declares a set of optional heads (decode, reward, continue, actor, critic) through an explicit capability descriptor, so that reinforcement-learning and self-supervised world models are first-class without either imitating the other. A single hook and cache layer exposes time-indexed activations, imagination rollouts, and intervention replay over this interface, allowing each analysis to be written once.

URL PDF HTML ☆

赞 0 踩 0

2606.10010 2026-06-10 eess.AS cs.AI cs.MM cs.SD 交叉投稿

DeRA-MOS: Optimizing Text-to-Music Evaluation via Decoupled Listwise Ranking and Modality Alignment

DeRA-MOS：通过解耦列表排序和模态对齐优化文本到音乐评估

Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen

发表机构 * E.SUN Financial Holding Co., Ltd.（E.SUN财务控股公司）； United Link Co., Ltd.（联合链接有限公司）； Institute of Information Science, Academia Sinica（学术院信息科学研究所）； Department of Computer Science and Information Engineering, National Taiwan Normal University（台湾师范大学计算机科学与信息工程系）

AI总结提出DeRA-MOS解耦优化框架，通过批感知列表排序损失和分数锚定模态对齐损失，分别优化音乐印象和文本对齐的排名指标，在MusicEval上显著提升评估性能。

Comments Accepted to IEEE Signal Processing Letters (SPL)

详情

AI中文摘要

评估文本到音乐（TTM）系统仍然昂贵，因为音乐印象（MI）和文本对齐（TA）分数依赖于人类平均意见分数（MOS）。大多数自动MOS估计器采用逐点回归或分布分类训练。这些目标不直接优化基于排名的指标，并且为跨模态一致性提供较弱的几何约束。为了解决这些问题，我们提出了DeRA-MOS，一种用于TTM评估的解耦优化框架。对于MI，我们引入了一种批感知列表排序损失，该损失对每个小批量内的相对顺序进行建模，并更好地与基于Spearman秩相关系数（SRCC）的评估对齐。对于TA，我们引入了一种分数锚定的模态对齐损失，将人类分数映射到目标音频-文本相似度，并在融合前正则化潜在空间。通过有效缓解逐点训练不匹配和模态漂移，MusicEval上的实验表明，我们的解耦框架在MI和TA排名指标上均取得了显著改进，为大规模TTM评估建立了稳健的范式。

英文摘要

Evaluating text-to-music (TTM) systems remains expensive because music impression (MI) and text alignment (TA) scores rely on human mean opinion scores (MOS). Most automatic MOS estimators are trained with point-wise regression or distributional classification. These objectives do not directly optimize rank-based metrics and provide weak geometric constraints for cross-modal coherence. To address these gaps, we propose DeRA-MOS, a decoupled optimization framework for TTM evaluation. For MI, we introduce a batch-aware listwise ranking loss that models relative order within each mini-batch and better aligns with evaluation based on Spearman's rank correlation coefficient (SRCC). For TA, we introduce a score-anchored modality alignment loss that maps human scores to target audio-text similarity and regularizes the latent space before fusion. By effectively mitigating the point-wise training mismatch and modality drift, experiments on MusicEval demonstrate that our decoupled framework yields substantial improvements in both MI and TA ranking metrics, establishing a robust paradigm for large-scale TTM evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.10066 2026-06-10 cs.CV cs.AI cs.LG 交叉投稿

A Controlled Audit of Pretraining Contamination in Public Medical Vision-Language Benchmarks

公共医学视觉语言基准中预训练污染的受控审计

Bruce Changlong Xu, Lan Wu, Alexander Ryu

AI总结审计发现公共医学VLM基准存在图像源重叠和文本规范顺序交换性信号，但确认的像素级重复罕见，且现有成员推理检测器在小规模医学VLM队列中不可靠。

Comments 30 pages, 7 figures, 9 tables. Preprint

详情

AI中文摘要

医学视觉语言模型（VLM）在公共基准上进行评估，这些基准的图像和问答对多年来一直可自由下载，但报告准确度假设这些示例在预训练中不存在。我们对SLAKE-En、PathVQA、VQA-RAD以及一个辅助的公共OmniMedVQA镜像上的开放VLM进行了审计，使用了四种检测器系列：图像侧近邻重叠（针对PMC-OA-beta）、规范顺序可交换性、队列相对Min-K%++尾部富集以及跨模型Top-K重叠。我们发现SLAKE-En上存在可测量的图像侧源重叠：SigLIP-B-16标记了19.8%的图像，SigLIP-SO400M标记了4.2%，而域外对照产生0/2000个标记。人工裁定显示，相同模态、相同投影的匹配对应不同患者，而非经过验证的像素级重复，因此我们将其解释为源或分布重叠，而非确认的每图像记忆。在文本侧，Qwen2.5-VL在SLAKE-En上显示出规范顺序可交换性信号，该信号在顺序消融和外部非医学基线中仍然存在。在OmniMedVQA镜像上，五个医学和通用VLM触发了可交换性，而BLIP-2保持干净。相比之下，队列相对Min-K%++尾部富集和跨模型Top-K重叠在外部预域基线中崩溃：BLIP-2重现了明显的正信号，尽管缺乏合理的医学VQA暴露。我们得出结论，这些队列相对检测器作为小规模医学VLM队列上的独立成员推理信号是不可靠的。

英文摘要

Medical vision-language models (VLMs) are evaluated on public benchmarks whose images and question-answer pairs have been freely downloadable for years, yet reported accuracy assumes these examples were absent from pretraining. We audit open VLMs on SLAKE-En, PathVQA, VQA-RAD, and an auxiliary public OmniMedVQA mirror using four detector families: image-side near-neighbour overlap against PMC-OA-beta, canonical-order exchangeability, cohort-relative Min-K%++ tail enrichment, and cross-model top-K overlap. We find measurable image-side source overlap on SLAKE-En: 19.8% of images are flagged under SigLIP-B-16 and 4.2% under SigLIP-SO400M, while out-of-domain controls produce 0/2000 flags. Manual adjudication shows same-modality, same-projection matches to different patients rather than verified pixel-level duplicates, so we interpret this as source or distributional overlap rather than confirmed per-image memorization. On the text side, Qwen2.5-VL on SLAKE-En shows a canonical-order exchangeability signal that survives ordering ablation and external non-medical baselines. On the OmniMedVQA mirror, exchangeability fires for five medical and general VLMs while BLIP-2 remains clean. In contrast, cohort-relative Min-K%++ tail enrichment and cross-model top-K overlap collapse under an external pre-domain baseline: BLIP-2 reproduces the apparent positive signals despite lacking plausible medical-VQA exposure. We conclude that these cohort-relative detectors are unreliable as standalone membership-inference signals on small medical-VLM cohorts.

URL PDF HTML ☆

赞 0 踩 0

2606.10084 2026-06-10 cs.LG cs.AI 交叉投稿

Divide-and-Conquer Modeling for the CTF-4-Science Lorenz Benchmark

CTF-4-Science Lorenz基准的分治建模策略

Shundong Li

AI总结提出分治建模策略，针对CTF-4-Science Lorenz基准的五个场景族分别设计模型，通过平滑去噪、NG-RC/NVAR预测、Lorenz过渡校正和参数前缀混合，以79.63分证明场景特定更新优于通用模型。

2606.10156 2026-06-10 cs.IR cs.AI cs.CL 交叉投稿

$τ$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

$τ$-Rec：面向智能推荐系统的可验证基准

Bharath Sivaram Narasimhan, Karthik R Narasimhan

发表机构 * Independent Researcher（独立研究员）； Princeton University（普林斯顿大学）

AI总结针对多轮对话式智能推荐系统评估中主观性强、成本高的问题，提出$τ$-Rec基准，通过可验证奖励和揭示标记引导机制，结合pass^k可靠性指标，系统评估模型推理一致性，发现当前最佳模型可靠性仅约57%。

详情

AI中文摘要

随着推荐系统向智能、多轮对话界面转变，评估范式难以跟上步伐。当前的基准通常依赖“LLM作为评判者”的评估，这引入了主观性、高成本和不一致性。我们提出了$τ$-Rec，一个用于智能推荐系统的基准，它用可验证奖励取代主观评估，并采用揭示标记引导（RTE）机制来控制任务约束在对话中如何呈现。通过针对结构化目录谓词测试智能体，并采用pass^k可靠性指标，$τ$-Rec为一致的推理提供了系统测试。我们对五个模型家族（GPT-5.4、Claude Sonnet 4.6、Gemini 2.5 Flash、DeepSeek V4 Flash、Qwen3-32B和GPT-5 mini）的九种配置进行了评估，揭示了一个陡峭的可靠性悬崖，即使是最好的模型在pass^1上也仅达到约57%，在pass^4上约38%，突显了当前对话智能体部署中的关键差距。所有代码和数据均在此https URL公开。

英文摘要

As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce subjectivity, high costs and inconsistency. We present $τ$-Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, $τ$-Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at https://github.com/nbharaths/tau-rec.

URL PDF HTML ☆

赞 0 踩 0

2606.10194 2026-06-10 cs.LG cs.AI 交叉投稿

捕捉五分之一：LLM作为评判员在生产环境多轮交易代理中的盲点

Sawyer Zhang, Alexander Wang, Sophie Lei

发表机构 * Lumivate (Lumi)（Lumivate（Lumi））

AI总结研究部署的餐饮订购代理中LLM评判员对真实缺陷的召回率，发现其仅捕获22%的系统性问题，主要因评分标准缺乏状态跟踪等行为维度，且路由机制导致缺陷被错误分类。

Comments 13 pages, 1 figure, 5 tables

详情

AI中文摘要

LLM作为评判员是评估对话代理的默认工具，但其可靠性几乎总是报告为与人类评分的一致性，而非真实缺陷的召回率。我们研究了一个已部署的多轮餐饮订购代理，并通过详尽的人工转录审查作为基准，衡量其内置LLM评判员捕获了多少真实质量问题。在三个批次中，评判员发现的系统性问题远低于人类确认的四分之一——在一个批次中，9种模式中只有2种（22%），而在另一个批次中，其操作门控标记了100轮中的0轮，而人类确认了23个不同缺陷和7个新的跨轮模式。我们的盲点分类表明，失败是有结构的，而非随机的：评判员能捕获轮次局部问题（虚构统计数据、错误语言），但遗漏了跨轮状态问题（确认门锁死、购物车幻觉、升级锁死、过时引用）。机制在于：评分标准仅暴露三个粗略轴（意图、品牌声音、个性化），且没有针对行为维度（状态跟踪、护栏、恢复）的类别，而大多数缺陷集中于此。失败在于路由而非感知：114轮中，113轮原始评判员注释描述了确认门或购物车状态缺陷，但被评分为“品牌声音”，且无一到达操作失败——门控连接到挂起和硬断言，而非评分标准——因此0%是路由和接线失败，而非失明。对流行率估计的影响是显著的：当表观缺陷率为零时，Rogan-Gladen校正退化——无信号可恢复真实率——而当门控报告非零率时，相同估计器在我们测量的灵敏度下暗示3-6倍的低估。对于生产环境多轮代理，自动评判是回归底线，而非人工审查的替代品。

英文摘要

LLM-as-judge is the default instrument for evaluating conversational agents, yet its reliability is almost always reported as agreement with human ratings, not recall of real defects. We study a deployed multi-turn food-and-beverage ordering agent and measure how many genuine quality problems its built-in LLM judge catches, using exhaustive human transcript review as ground truth. Across three batches the judge surfaces well under a quarter of human-confirmed systematic problems -- 2 of 9 patterns (22%) in one batch, and its operational gate flagged zero of 100 rounds in a batch where humans confirmed 23 distinct defects and 7 new cross-cutting patterns. Our blind-spot taxonomy shows the failure is structured, not random: the judge catches turn-local issues (a fabricated statistic, a wrong language) but misses cross-turn state issues (confirm-gate lockout, cart hallucination, escalation lockout, stale referents). The mechanism: the scoring rubric exposes only three coarse axes (intent, brand-voice, personalization) and has no category for the behavioural dimensions -- state-tracking, guardrails, recovery -- where most defects cluster. The failure is routing, not perception: 113 of 114 rounds whose raw judge note describes a confirm-gate or cart-state defect are scored "brand voice", and none reach an operational failure -- the gate is wired to hangs and hard assertions, not the rubric -- so the 0% is a routing-and-wiring failure, not blindness. The consequence for prevalence estimation is sharp: when the apparent defect rate is zero the Rogan-Gladen correction degenerates -- no signal can recover the true rate -- while where the gate reports a nonzero rate the same estimator implies a 3-6x undercount under our measured sensitivity. For production multi-turn agents, automated judging is a regression floor, not a substitute for human review.

URL PDF HTML ☆

赞 0 踩 0

2606.10329 2026-06-10 cs.CV cs.AI 交叉投稿

Building Change Detection in Earthquake: A Multi-Scale Interaction Network and A Change Detection Dataset

地震中的建筑变化检测：一种多尺度交互网络和一个变化检测数据集

Yunlong Liu, Zekai Zhang

发表机构 * School of Control Science and Engineering, Shandong University（控制科学与工程学院，山东大学）

AI总结针对地震后短期成像间隔导致的变化检测难题，构建了土耳其地震变化检测数据集（TUE-CD），并提出多尺度特征交互网络（MSI-Net），通过联合交叉注意力和多尺度偏移校准模块，有效缓解侧视问题，提升变化检测精度。

详情

AI中文摘要

作为最具破坏性的自然灾害之一，近年来地震袭击了世界许多国家，造成了严重的经济损失。变化检测（CD）可应用于震后损伤评估，因为它能从多时相遥感图像中推断出被破坏的变化区域。此外，短成像间隔的变化检测将更好地满足地震后紧急救援的需求。然而，由于缺乏短成像间隔的数据集，当前基于深度神经网络的方法的能力受到限制。为了满足灾后即时救援的需求，我们创建了一个变化检测数据集——土耳其地震变化检测数据集（TUE-CD），用于评估地震后短期内的建筑损坏情况。由于后事件图像的采集间隔短，不同时相图像的成像角度不同，导致了一些侧视问题。为了应对这些挑战，我们提出了一种多尺度特征交互网络（MSI-Net），用于双时相特征之间的高效交互，并减轻侧视问题的影响。具体来说，所提出的MSI-Net由联合交叉注意力（JCA）模块、多尺度偏移校准（MOC）模块和特征集成（FeI）模块组成。JCA模块统一了通道交叉注意力和空间联合注意力，以实现充分的特征交互。MOC模块进一步估计偏移量，以将双时相图像与多尺度特征对齐。最后，通过FeI模块融合校准后的特征和多尺度特征，用于变化区域的预测。在WHU-CD、CLCD和构建的TUE-CD数据集上的实验表明，所提出的MSI-Net比考虑的最先进的变化检测方法提供了更好的结果。

英文摘要

As one of the most destructive natural disasters, earthquakes have struck many countries around the world in recent years, causing serious economic losses. Change detection (CD) can be applied to post-earthquake damage assessment as it can infer destroyed change regions from multi-temporal remote sensing images. Furthermore, the CD with short imaging interval will better satisfy the needs of the emergency rescues after earthquakes. However, the capability of current methods built on deep neural networks is limited because the dataset with short imaging interval is absent. To meet post-disaster immediate relief, we create a CD dataset, Turkey earthquake CD dataset (TUE-CD), for the evaluation of building damage in the short term after an earthquake. Because of the short acquisition interval of the post-event images, the imaging angle is different for different temporal images, which leads to some side-looking problems. To deal with these challenges, we present a multi-scale feature interaction network (MSI-Net) for efficient interaction between bi-temporal features, as well as mitigating the effect of side-looking problems. Specifically, the proposed MSI-Net consists of joint cross-attention (JCA) modules, multi-scale offset calibration (MOC) modules, and feature integration (FeI) modules. The JCA module unifies channel cross-attention and spatial joint attention for sufficient feature interaction. The MOC module further estimates the offsets to align the bi-temporal image with the multi-scale features. Finally, calibrated features and multi-scale features are fused by FeI modules for the prediction of changed areas. Experiments on the WHU-CD, CLCD, and the constructed TUE-CD dataset indicate that the proposed MSI-Net provides better results than considered state-of-the-art CD methods.

URL PDF HTML ☆

赞 0 踩 0

2606.10366 2026-06-10 cs.RO cs.AI 交叉投稿

A Practical Recipe Towards Improving Sim-and-Real Correlation for VLA Evaluation

提升VLA评估中仿真与真实相关性的实用指南

Shuo Wang, Hanyuan Xu, Yingdong Hu, Fanqi Lin, Yang Gao

发表机构 * Tsinghua University（清华大学）； Shanghai Qi Zhi Institute（上海期智研究院）

AI总结本文系统研究仿真与真实环境在VLA策略评估中的相关性，提出统一框架来测量和提升仿真作为真实评估代理的有效性。

Comments 20 pages

详情

AI中文摘要

仿真已成为评估和改进视觉-语言-动作（VLA）策略的重要工具，为昂贵的真实机器人评估提供了可扩展、可重复且可控的替代方案。最近的仿真基准在真实感和多样性方面取得了实质性进展，但这些平台尚未被广泛用作可靠的真实策略评估代理。在这项工作中，我们通过仿真与真实相关性的视角研究这一问题。我们在多个仿真平台、VLA策略、任务和扰动因素上进行了系统研究，测量模拟评估在策略排名一致性、性能相关性和扰动方面失败模式上是否保留真实结论。这一分析使我们能够表征现有模拟器的局限性，并确定哪种模拟信号更符合真实部署。我们进一步研究了用户应如何利用仿真进行策略改进，包括何时基于模拟器的微调是有益的，以及后训练数据量如何影响仿真与真实的对齐。总体而言，我们的工作提供了一个统一的框架，用于测量、解释和提升仿真对VLA策略的有用性，为模拟器设计者和在策略开发流程中使用仿真的实践者提供指导。

英文摘要

Simulation has become an essential tool for evaluating and improving vision-language-action (VLA) policies, offering scalable, reproducible, and controllable alternatives to costly real-world robot evaluation. Recent simulation benchmarks have made substantial progress on realism and diversity, yet these platforms have not been widely adopted as reliable proxies for real-world policy evaluation. In this work, we investigate this issue through the lens of sim-and-real correlation. We conduct a systematic study across multiple simulation platforms, VLA policies, tasks, and perturbation factors, measuring whether simulated evaluation preserves real-world conclusions in terms of policy ranking consistency, performance correlation, and perturbation-wise failure patterns. This analysis allows us to characterize the limitations of existing simulators and identify what kinds of simulation signals are more aligned with real-world deployment. We further examine how users should exploit simulation for policy improvement, including when simulator-based finetuning is beneficial and how the amount of post-training data affects sim-and-real alignment. Overall, our work provides a unified framework for measuring, interpreting, and improving the usefulness of simulation for VLA policies, offering guidance both for simulator designers and for practitioners who use simulation as part of the policy development pipeline.

URL PDF HTML ☆

赞 0 踩 0

2606.10380 2026-06-10 cs.CL cs.AI 交叉投稿

Expert-Level Crisis Detection in Mental Health Conversations

心理健康对话中的专家级危机检测

Grace Byun, Abigail Lott, Rebecca Lipschutz, Sean T. Minton, Elizabeth A. Stinson, Jinho D. Choi

发表机构 * Department of Computer Science, Emory University（埃默里大学计算机科学系）； Department of Psychiatry and Behavioral Sciences, Emory University（埃默里大学精神病学与行为科学系）

AI总结提出CRADLE-Dialogue基准数据集和Alert-Confirm评估协议，用于对话中危机检测，发现模型在识别风险出现时机上表现较差，并发布合成训练语料和32B参数模型。

详情

AI中文摘要

现实世界的危机干预本质上是对话式的，然而现有研究主要关注静态文本。当应用于多轮对话时，当前模型表现出显著的性能下降，难以追踪随着上下文演变而出现的风险信号。为了解决这一差距，我们引入了CRADLE-Dialogue，这是一个由临床医生标注的基准数据集，用于对话环境中的回合级危机检测。该数据集包含600个对话，具有跨临床基础风险的多标签注释，包括自杀意念、自残和儿童虐待，区分过去和当前风险。我们进一步提出了一种Alert-Confirm评估协议，该协议区分早期预警信号（Alert）和特定危机变得明确可识别的回合（Confirm），反映了在风险变得明确之前进行干预的临床需求。实验表明，识别风险何时出现比识别其存在要困难得多：模型的Micro F1仅达到40%中段到60%高段。此外，我们发布了一个合成训练语料库和一个32B参数模型，该模型显著优于现有的开源模型，并在回合级、对话级和仅确认评估设置中与专有模型相比具有竞争力或更优的结果。

英文摘要

Real-world crisis intervention is inherently conversational, yet existing research largely focuses on static texts.Real-world crisis intervention is inherently conversational, yet existing research largely focuses on static texts. When applied to multi-turn dialogues, current models exhibit significant performance degradation, struggling to track risk signals that emerge as context evolves. To address this gap, we introduce CRADLE-Dialogue, a clinician-annotated benchmark for turn-level crisis detection in conversational settings. The dataset features 600 dialogues with multi-label annotations across clinically grounded risks, including suicide ideation, self-harm, and child abuse, distinguishing past from ongoing risk. We further propose an Alert-Confirm evaluation protocol that distinguishes early warning signals (Alert) from turns where a specific crisis becomes explicitly identifiable (Confirm), reflecting the clinical need to intervene before risk becomes explicit. Experiments show that identifying when risk emerges is much harder than recognizing that it exists: models achieve only mid-40% to high-60% Micro F1. Additionally, we release a synthetic training corpus and a 32B-parameter model that substantially outperforms existing open-source models and achieves competitive or superior results against proprietary models across turn-level, dialogue-level, and confirm-only evaluation settings.

URL PDF HTML ☆

赞 0 踩 0

2606.10388 2026-06-10 cs.IR cs.AI 交叉投稿

SkillResolve-Bench: Measuring and Resolving Same-Capability Ambiguity in Agent Skill Retrieval

SkillResolve-Bench：衡量和解决智能体技能检索中的同能力歧义

Jiandong Ding

发表机构 * Huawei Technologies Ltd（华为技术有限公司）

AI总结针对智能体技能库中同一能力族内不同技能的执行风险，提出SkillResolve-Bench基准和SkillResolve方法，通过候选族解析和代表性选择，在保持高召回率的同时将有害技能暴露率降至0。

Comments Preprint

详情

AI中文摘要

智能体技能库正成为可路由的软件资产：检索到的技能可以为智能体提供指令、脚本、资源绑定和执行假设。这使得技能检索不仅仅是广泛的相关性匹配。检索器可以找到正确的能力族，却暴露出错误的同能力代表。我们将这种失败研究为同能力执行风险检索。每个查询将一个有用的技能与一个特定于查询的有风险兄弟技能配对，该兄弟技能共享能力族，但可能导致执行指向过时资源、缺失前提或错误程序。我们引入了SkillResolve-Bench 1.0，这是一个针对该场景的可审计基准，包含661个有用/有风险对、源角色和准入证据、线索/泄漏检查、查询不相交划分，以及一个包含6,660个公共SkillRet候选的7,982个候选池。该基准报告有用性排名以及有害兄弟率（HSR@K），即前K个中暴露有风险兄弟的比例。我们还提供了SkillResolve，一种参考方法，它解析活跃候选族，从易混淆的库负样本和契约配置文件线索中评分查询条件效用，并在最终前K列表之前从每个族中选择一个代表。在已发布族关系下，SkillResolve达到Recall@3 0.766和NDCG@3 0.699，同时保持HSR@3=0。与SkillRouter相比，Recall@3提升0.112，NDCG@3提升0.165，同时将HSR@3从0.693降至0。如果没有代表性选择，在相同评分器下HSR@3升至0.236，这表明族内代表性选择是将能力检索转化为更安全过程暴露的机制。

英文摘要

Agent skill libraries are becoming routable software assets: a retrieved skill can contribute instructions, scripts, resource bindings, and execution assumptions to an agent. This makes skill retrieval more than broad relevance matching. A retriever can find the right capability family yet expose the wrong same-capability representative. We study this failure as same-capability execution-risk retrieval. Each query pairs a helpful skill with a query-specific risky sibling that shares the capability family but can lead execution toward a stale resource, missing precondition, or wrong procedure. We introduce SkillResolve-Bench 1.0, an auditable benchmark for this setting with 661 helpful/risky pairs, source-role and admission evidence, cue/leakage checks, query-disjoint splits, and a 7,982-candidate pool that includes 6,660 public SkillRet candidates. The benchmark reports helpful ranking together with harmful sibling rate (HSR@K), the top-K exposure of the risky sibling. We also provide SkillResolve, a reference method that resolves active candidate families, scores query-conditioned utility from confusable library negatives and contract-profile cues, and selects one representative from each family before the final top-K list. Under the released family relation, SkillResolve reaches Recall@3 0.766 and NDCG@3 0.699 while keeping HSR@3=0. It improves over SkillRouter by 0.112 Recall@3 and 0.165 NDCG@3 while reducing HSR@3 from 0.693 to 0. Without representative selection, HSR@3 rises to 0.236 under the same scorer, identifying within-family representative choice as the mechanism that turns capability retrieval into safer procedural exposure.

URL PDF HTML ☆

赞 0 踩 0

2606.10460 2026-06-10 cs.CL cs.AI 交叉投稿

LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

LakeQA：百万级数据湖上的探索性问答基准

Haonan Wang, Jiaxiang Liu, Yurong Liu, Austin Senna Wijaya, Tianle Zhou, Eden Wu, Yijia Chen, Wanting You, Reya Vir, Daniela Pinto, Grace Fan, Yusen Zhang, Juliana Freire, Eugene Wu

发表机构 * Columbia University（哥伦比亚大学）； New York University（纽约大学）； Barnard College（巴纳德学院）

AI总结提出LakeQA基准，要求LLM在9.5TB异构数据湖中搜索并多跳推理，GPT-5.2仅达18.37%精确匹配，挑战性强。

详情

AI中文摘要

近期的大语言模型（LLM）在基于阅读的问答（QA）方面取得了快速进展，其中证据被明确提供或可以轻松检索。相比之下，现实世界的问题通常不与准确的证据文档配对。有用的证据存在于海量数据湖中，使得搜索成为回答的前提。然而，目前缺乏要求在大型数据湖上进行搜索和推理的综合基准。为此，我们引入了LakeQA，一个针对数据湖上以搜索为中心的问答的综合基准，同时强调搜索和推理能力。LakeQA建立在来自维基百科和开源政府数据的大约9.5 TB文本资源的异构集合上，涵盖结构化和非结构化数据。为确保任务质量，每个样本至少由一名博士级专家标注。每个任务需要长期的多跳推理，包含隐式的中间步骤：智能体需要发现正确的文档，然后跨来源组合证据以产生答案。在七个前沿LLM上的实验结果表明，LakeQA具有挑战性。例如，GPT-5.2在LakeQA上仅达到18.37%的精确匹配分数。总体而言，LakeQA为开发能够在现代数据湖中查找和分析数据的LLM智能体提供了一个现实的测试平台。

英文摘要

Recent large language models (LLMs) have shown rapid progress in reading-based question answering (QA), where evidence is explicitly provided or can be trivially retrieved. In contrast, real-world questions are often not paired with accurate evidence documents. The useful evidence resides in massive data lakes, making search a prerequisite for answering. However, there is a lack of comprehensive benchmarks that require both searching and reasoning over large data lakes. To this end, we introduce LakeQA, a comprehensive benchmark for search-centric question answering over data lakes that jointly emphasizes searching and reasoning capabilities. LakeQA is built on a heterogeneous collection of approximately 9.5 TB of text resources from Wikipedia and open-source government data, spanning structured and unstructured data. To ensure task quality, each sample is annotated by at least one Ph.D.-level expert. Each task requires long-horizon multi-hop reasoning with implicit intermediate steps: agents need to discover the correct documents and then compose evidence across sources to produce the answer. Experimental results on seven frontier LLMs demonstrate that LakeQA is challenging. For instance, GPT-5.2 achieves only an exact-match score of 18.37% on LakeQA. Overall, LakeQA provides a realistic testbed for developing LLM agents that can both find and analyze data in modern data lakes.

URL PDF HTML ☆

赞 0 踩 0

2606.10554 2026-06-10 cs.CL cs.AI 交叉投稿

Benchmarking Knowledge Editing using Logical Rules

使用逻辑规则对知识编辑进行基准测试

Tatiana Moteu Ngoli, NDah Jean Kouagou, Hamada M. Zahera, Axel-Cyrille Ngonga Ngomo

发表机构 * Data Science Group, Heinz Nixdorf Institute, Paderborn University（帕德博恩大学海因茨·尼克斯多夫研究所数据科学组）

AI总结提出基于逻辑规则的基准，评估知识编辑方法对单次编辑逻辑后果的处理能力，发现现有方法在蕴含知识上性能下降高达24%。

Comments Accepted at the 24th International Semantic Web Conference 2025

详情

DOI: 10.1007/978-3-032-09530-5_3
Journal ref: The Semantic Web. ISWC 2025. ISWC 2025. Lecture Notes in Computer Science, vol 16141. Springer, Cham

AI中文摘要

大型语言模型（LLMs）越来越多地部署在需要访问最新知识的实际应用中。然而，重新训练LLMs计算成本高昂。因此，知识编辑技术对于维护预训练模型中的当前信息和纠正错误断言至关重要。当前的知识编辑基准主要关注回忆编辑过的事实，往往忽略其逻辑后果。为解决这一局限，我们引入了一个新基准，旨在评估知识编辑方法如何处理单次事实编辑的逻辑后果。我们的基准从知识图谱中提取与给定编辑相关的逻辑规则，然后基于这些规则生成多跳问题，以评估对逻辑后果的影响。我们的发现表明，虽然现有的知识编辑方法能够准确地将直接断言插入LLMs，但它们经常无法注入蕴含的知识。具体来说，使用ROME和FT等流行方法的实验显示，在直接编辑的知识和蕴含知识的评估之间存在高达24%的性能差距。这凸显了在知识编辑中需要语义感知的评估框架。

英文摘要

Large Language Models (LLMs) are increasingly deployed in real-world applications that require access to up-to-date knowledge. However, retraining LLMs is computationally expensive. Therefore, knowledge editing techniques are crucial for maintaining current information and correcting erroneous assertions within pre-trained models. Current benchmarks for knowledge editing primarily focus on recalling edited facts, often neglecting their logical consequences. To address this limitation, we introduce a new benchmark designed to evaluate how knowledge editing methods handle the logical consequences of a single fact edit. Our benchmark extracts relevant logical rules from a knowledge graph for a given edit. Then, it generates multi-hop questions based on these rules to assess the impact on logical consequences. Our findings indicate that while existing knowledge editing approaches can accurately insert direct assertions into LLMs, they frequently fail to inject entailed knowledge. Specifically, experiments with popular methods like ROME and FT reveal a substantial performance gap, up to 24%, between evaluations on directly edited knowledge and on entailed knowledge. This highlights the critical need for semantics-aware evaluation frameworks in knowledge editing.

URL PDF HTML ☆

赞 0 踩 0

2606.10595 2026-06-10 cs.CR cs.AI 交叉投稿

From Data Heterogeneity to Convergence: A Data-Centric Review of Federated Learning

从数据异质性到收敛：联邦学习的数据中心综述

Huong Nguyen, Mickaël Bettinelli, Amirhossein Ghaffari, Alexandre Benoit, Hong-Tri Nguyen, Susanna Pirttikangas, Lauri Lovén

发表机构 * Oulu University（奥卢大学）； University of Southern Brittany（南 Brittany 大学）； Aalto University（Aalto 大学）

AI总结本文从数据视角系统分析联邦学习中数据异质性对收敛的影响，提出可测量特征分类、连接实验分割与真实现象、评估数据相关脆弱性与防御对收敛的影响，为设计可预测收敛的系统提供指导。

详情

AI中文摘要

联邦学习（FL）已成为集中式学习中数据饥饿问题的有前途解决方案。这种范式使得多个客户端能够在隐私保护下协作训练共享任务模型，而无需暴露其本地数据。虽然数据是任何学习系统中的关键组成部分，但它也是漏洞和挑战的主要来源，并且是稳定且良好收敛训练的主要决定因素。现有的FL综述描述了通用基础、安全实践、机遇、挑战和应用，但没有深入探讨数据的各个方面以及从数据角度考虑问题。它们很少提供一种数据视角的综合，将具体的数据属性、分割协议和防御与收敛速度和稳定性联系起来。本综述通过三个进展填补了这一空白。首先，我们将非独立同分布（non-IID）分析为可测量的特征，并根据其对收敛的影响将其排序为强、中、轻，解释每种影响背后的机制，并调和图像、文本和图上的证据。其次，我们将实验分割实践与它们模拟的真实现象联系起来，揭示它们引入的伪影，并展示这些伪影如何影响目标精度。第三，我们分析了数据相关的脆弱性及其提出的防御如何影响收敛，报告在干净和对抗条件下的性能，使收敛-鲁棒性权衡明确。据我们所知，这是第一个提供对支配FL的数据相关挑战的完整理解的综述。针对每个问题提炼出清晰的要点，我们的工作可作为可操作的指南，帮助从业者设计具有可预测收敛和稳定性的系统。

英文摘要

Federated Learning (FL) has emerged as a promising solution for data hunger in centralized learning. This paradigm enables privacy with multiple clients to train a shared-task model collaboratively without exposing their local data. While being a key component in any learning system, data is also a primary source of vulnerabilities and challenges, and a major determinant of a stable and well-converged training. Existing FL reviews describe general foundations, security practices, opportunities, challenges, and applications, without delving into diverse aspects of data and considering problems from the data perspective. They rarely provide a data-lens synthesis that links concrete data properties, split protocols, and defenses to convergence speed and stability. This survey fills that gap with three advances. First, we analyze non-IID into measurable traits and rank their influence on convergence as strong, medium, or light, explaining the mechanisms behind each and reconciling evidence across images, texts, and graphs. Second, we connect experimental splitting practices to the real phenomena they emulate, expose the artifacts they introduce, and show how those artifacts affect target accuracy. Third, we analyze how data-related vulnerabilities and their proposed defenses affect convergence, reporting performance under clean and adversarial conditions to make the convergence-robustness trade-off explicit. To our knowledge, this is the first survey to provide a complete understanding of data-related challenges that govern FL. With clear takeaways distilled for each concern, our work serves as actionable guidance, helping practitioners design their system with predictable convergence and stability.

URL PDF HTML ☆

赞 0 踩 0

2606.10620 2026-06-10 cs.CV cs.AI 交叉投稿

Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency

图像模型能想象时间吗？ImageTime：通过时空一致性探究视觉世界建模的新基准

Xinrui Wu, Lichen Huang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出ImageTime基准，通过四关键帧协议（初始状态、动作开始、过渡状态、最终状态）评估图像生成模型在时空一致性上的表现，揭示模型在维持连贯视觉世界状态方面的能力与不足。

详情

AI中文摘要

图像生成模型现在能够生成高质量的静态图像，但它们表示视觉世界随时间变化的能力仍然知之甚少。实际工作流程如故事板、逐步插图、参考引导编辑和视频预可视化要求模型在多个视觉状态之间保持身份、对象、空间关系和因果顺序。现有评估主要衡量单图像正确性、组合对齐或视频质量，而未明确图像模型是否能连贯地想象一个时间有序的过程。我们引入ImageTime，一个诊断基准，使用时空一致性作为图像生成中视觉世界建模的行为探针。给定一个动作指令，以及可选地指定初始状态的参考图像，模型必须生成一张包含四个有序关键状态的图像：初始状态、动作开始、过渡状态和最终状态。这个四关键帧协议比单图像生成在时间上要求更高，同时避免了密集视频动态的混淆。ImageTime通过渐进能力层次组织任务，并将每个场景分解为阶段状态谓词、跨帧时间约束和禁止的因果违规。GPT-5.5在结构化的VLM-as-judge协议下对所有生成的图像进行评分，产生可解释的能力分数、诊断子分数和失败标签。通过多家族基准测试，ImageTime揭示了当前图像生成系统在要求随时间维持连贯视觉世界状态时成功、失败和漂移的地方。

英文摘要

Image generation models now produce high-quality static images, yet their ability to represent how a visual world changes over time remains poorly understood. Practical workflows such as storyboarding, step-by-step illustration, reference-guided editing, and video previsualization require models to preserve identities, objects, spatial relations, and causal order across multiple visual states. Existing evaluations largely measure single-image correctness, compositional alignment, or video quality, leaving open whether an image model can coherently imagine a temporally ordered process. We introduce ImageTime, a diagnostic benchmark that uses spatiotemporal consistency as a behavioral probe of visual world modeling in image generation. Given an action instruction, and optionally a reference image specifying the initial state, a model must generate one image containing four ordered key states: initial state, action onset, transition state, and final state. This four-keyframe protocol is more temporally demanding than single-image generation while avoiding the confounds of dense video dynamics. ImageTime organizes tasks with a progressive capability hierarchy and decomposes each scenario into stage-wise state predicates, cross-frame temporal constraints, and forbidden causal violations. GPT-5.5 scores all generated images under a structured VLM-as-judge protocol, producing interpretable capability scores, diagnostic subscores, and failure labels. Through multi-family benchmarking, ImageTime reveals where current image generation systems succeed, fail, and drift when asked to maintain coherent visual world states over time.

URL PDF HTML ☆

赞 0 踩 0

2606.10706 2026-06-10 cs.LG cs.AI 交叉投稿

Unifying Data, Memory, and Compute Efficiency in LLM training: A Survey

统一LLM训练中的数据、内存和计算效率：一项综述

Vanessa Schmidt, Huy Hoang Nguyen, Cédric Jung, Shirin Salehi, Anke Schmeink

发表机构 * Chair of Information Theory and Data Analytics (INDA), RWTH Aachen University（亚琛工业大学信息理论与数据分析教席）； AIT Austrian Institute of Technology GmbH（奥地利技术研究所）； Automation and Control Institute, Technische Universität Wien (TUW)（维也纳工业大学自动化与控制研究所）

AI总结本文从资源约束视角综述大语言模型训练中的数据效率、内存效率和计算预算感知三大瓶颈，强调三者需联合优化而非孤立处理。

Comments Accpeted for publication in IEEE Transactions on Artificial Intelligence (TAI)

详情

AI中文摘要

资源约束日益决定了大语言模型（LLM）中可以训练、微调和部署的内容，然而效率通常通过孤立的技术而非作为相互作用的限制系统来研究。本综述采用以约束为中心的视角，围绕三个耦合的瓶颈组织近期进展：数据效率（训练什么）、内存效率（如何适应训练）和计算预算感知（何时何地消耗FLOPs）。在数据轴上，我们回顾了最大化每个token学习量的选择和剪枝方法，从基于学习动态的可扩展代理信号到基于梯度和影响的评分，以及难度感知和课程式策略。我们强调新兴证据表明，不同的“好数据”概念在不同机制中占主导地位，这意味着最优子集取决于任务目标和资源预算，而非普遍适用。在系统方面，我们表明GPU内存（而非原始计算）通常是微调中的主要瓶颈，有效的扩展需要联合减少权重存储、优化器状态和激活内存，而不是孤立地优化任何单一组件。超越内存，我们将训练和推理视为计算主导的过程，其中优化、数据选择和解码必须明确考虑有限的FLOP预算。我们回顾了计算最优分配和停止规则的证据，其中一旦边际性能增益低于预算依赖的阈值，计算应停止或重新分配。总之，这些结果将计算感知的数据选择、缩放定律和自适应推理统一在资源条件决策的共同原则下。

英文摘要

Resource constraints increasingly determine what can be trained, fine-tuned, and deployed in large language models (LLMs), yet efficiency is often studied through isolated techniques rather than as an interacting system of limits. This survey adopts a constraint-centric perspective and organizes recent progress around three coupled bottlenecks: data efficiency (what to train on), memory efficiency (how to fit training), and compute budget awareness (when and where to spend FLOPs). On the data axis, we review selection and pruning methods that maximize learning per token, ranging from scalable proxy signals based on learning dynamics to gradient- and influence-based scoring, as well as difficulty-aware and curriculum-style strategies. We highlight emerging evidence that different notions of good data dominate in different regimes, implying that optimal subsets depend on the task objective and resource budget rather than being universal. On the systems side, we show that GPU memory, not raw compute, is often the dominant bottleneck in fine-tuning, and that effective scaling requires jointly reducing weight storage, optimizer states, and activation memory rather than optimizing any single component in isolation. Beyond memory, we frame training and inference as compute-governed processes in which optimization, data selection, and decoding must explicitly account for finite FLOP budgets. We review evidence for compute-optimal allocation and stopping rules, where computation should be halted or reallocated once marginal performance gains fall below a budget-dependent threshold. Together, these results unify compute-aware data selection, scaling laws, and adaptive inference under a common principle of resource-conditioned decision-making.

URL PDF HTML ☆

赞 0 踩 0

2606.10803 2026-06-10 cs.CL cs.AI cs.CV 交叉投稿

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

超越API：探索多模态大语言模型在物理工具使用中的极限

Zhixin Ma, Yutong Zhou, Yongqi Li, Chong-Wah Ngo, Wenjie Li

发表机构 * Singapore Management University（新加坡管理大学）； The Hong Kong Polytechnic University（香港理工大学）

AI总结提出PhysTool-Bench基准，评估多模态大语言模型在真实场景中识别物理工具并规划使用的能力，发现最强模型仅完成21%任务，揭示感知与规划双重缺陷。

详情

AI中文摘要

多模态大语言模型（MLLMs）在利用数字API方面表现出色，并日益成为具身AI的“大脑”，指导机器人与物理世界交互。在这种具身环境中，核心能力之一是使用物理工具，这支撑着MLLMs在现实任务中协助人类的能力。尽管重要性显著，MLLMs在物理工具使用方面的熟练程度仍 largely unexplored。为填补这一空白，我们引入了PhysTool-Bench，这是首个评估MLLMs理解真实场景、识别物理工具并规划其使用能力的物理工具使用基准。PhysTool-Bench包含2,510个查询，覆盖2,678个真实世界物理工具，涉及制造、电气工程、农业和医疗等多个领域。具体而言，模型沿两个主要维度进行评估：1）识别场景中所有存在的物理工具，2）根据指令和视觉上下文规划工具选择和使用顺序。在13个领先的MLLMs中，即使最强的模型（Gemini-3.1-Pro）也只能识别场景中58.7%的工具，并仅完成21.0%的端到端查询。我们的分析揭示了两个层面的缺陷：MLLMs难以在真实场景中感知工具，而规划阶段更大的下降进一步表明缺乏将感知到的工具映射到任务语义的功能常识，这指出了发展实用具身AI的关键瓶颈。

英文摘要

Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs' ability to assist humans in real-world tasks. Despite the importance, MLLMs' proficiency in physical tool use remains largely unexplored. To address this gap, we introduce PhysTool-Bench, the first physical tool-use benchmark designed to evaluate MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use. PhysTool-Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes merely 21.0% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.

URL PDF HTML ☆

赞 0 踩 0

2606.10852 2026-06-10 cs.CL cs.AI 交叉投稿

PhantomBench: 对语言模型非存在性威胁的基准测试

Haeji Jung, Hila Gonen

发表机构 * University of British Columbia（不列颠哥伦比亚大学）； Canada CIFAR AI Chair, Amii（加拿大CIFAR人工智能主席，阿米研究所）

AI总结提出PhantomBench，首个大规模非存在概念基准，包含6万多个虚构实体，评估21个模型，发现平均幻觉率高达86.7%，前沿模型也难以避免。

详情

AI中文摘要

幻觉，即语言模型生成事实无依据的响应，会带来严重风险，因为用户倾向于盲目依赖它们。在高风险领域，这种模型行为的后果可能导致重大伤害。尽管在理解幻觉方面取得了显著进展，但这些模型如何可靠地识别其知识边界仍不清楚。我们引入了PhantomBench，这是首个此类大规模基准，包含来自不同领域真实概念的6万多个不存在的术语和实体。使用我们的基准，我们评估了各种类型和大小的共21个模型。我们展示了令人震惊的幻觉率（在某些情况下平均高达86.7%），并注意到即使是前沿模型也令人惊讶地无法在不存在的概念上弃权，特别是当输入预设它们存在时。然后，我们展示了PhantomBench可以作为研究模型在罕见概念上行为的代理，这些概念更容易产生幻觉。我们还提供了一个构建PhantomBench的流程，使得能够根据研究人员和实践者的特定需求可扩展地生成不存在的概念。

英文摘要

Hallucinations, where language models (LMs) generate factually ungrounded responses, pose serious risks, as users tend to blindly rely on them. This is particularly concerning in high-stakes domains, where consequences of such model behavior can lead to significant harms. Despite notable progress in understanding hallucinations, it remains unclear how reliably these models can recognize the limits of their knowledge. We introduce PhantomBench, the first large-scale benchmark of its kind, comprising more than 60K non-existent terms and entities derived from real concepts across diverse domains. Using our benchmark, we evaluate a total of 21 models of various types and sizes. We show staggering hallucination rates across the board (with average rates as high as 86.7% in some cases), and note that even frontier models surprisingly fail to abstain on non-existent concepts, especially when the input presumes their existence. We then show that PhantomBench can serve as a proxy for studying model behavior on rare concepts for which models are more prone to hallucinate. We also provide a pipeline to construct PhantomBench, enabling scalable generation of non-existent concepts tailored to the specific needs of researchers and practitioners.

URL PDF HTML ☆

赞 0 踩 0

2606.11127 2026-06-10 cs.CL cs.AI 交叉投稿

Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation

基于来源的门控与自适应恢复在合成后训练数据筛选中的应用

Soham Bhattacharjee, Karun Sharma, Vinay Kumar Sankarapu, Pratinav Seth

发表机构 * Lexsi Labs

AI总结研究合成后训练数据筛选中的来源证据门控与样本自适应恢复，提出结合故障诊断与定向再生成的自适应恢复流水线，提高产量、恢复率和注入召回率。

详情

AI中文摘要

合成后训练流水线通常使用奖励模型或整体LLM评判器对生成的样本进行过滤，但两个实践很少被一起检验：过滤信号是否基于引发每个生成的来源证据，以及被拒绝的样本是否可以系统性地恢复而非永久丢弃。我们通过对抗性注入语料库提供真实故障标签，在门控配置、恢复策略和生成器规模上对这两个问题进行了受控研究。我们发现，精确的来源出处改善了更强评判器的忠实度门控；幻觉门控和奖励门控拒绝的样本群体大多不重叠，因此两者都是必要的；结合故障诊断与定向再生成的自适应恢复流水线比简单重采样实现了更高的产量、恢复率和注入召回率。下游微调质量主要由生成器规模驱动，过滤和恢复条件虽有重要贡献但处于次要地位。

英文摘要

Synthetic post-training pipelines commonly filter generated samples with reward models or holistic LLM judges, yet two practices remain rarely examined together: whether the filtering signal is grounded in the source evidence that induced each generation, and whether rejected samples can be systematically recovered rather than permanently discarded. We present a controlled study of both questions across gate configurations, recovery strategies, and generator scales, using adversarially injected corpora to provide ground-truth failure labels. We find that exact source provenance improves faithfulness gating for stronger judges, that hallucination and reward gates reject largely disjoint sample populations making both necessary, and that an adaptive recovery pipeline combining failure diagnosis with targeted regeneration achieves higher yield, recovery rate, and injection recall than naive resampling. Downstream fine-tuning quality is driven primarily by generator scale, with filtration and recovery conditions contributing meaningfully but secondarily.

URL PDF HTML ☆

赞 0 踩 0

2606.11166 2026-06-10 stat.OT cs.AI 交叉投稿

PSEBench: 一个用于评估大语言模型在患者安全事件分类中的可控且可验证的基准

Keqi Han, Ryan Young, Annabel Strauss, Lindsey Hughes, Katharine M. Nesbitt, Nicole Schueler, Che Ngufor, Carl Yang, Yuan Xue, Zhijun Yin

发表机构 * Emory University（埃默里大学）； Scale AI ； Mayo Clinic（梅奥诊所）； Vanderbilt University Medical Center（范德比大学医学中心）

AI总结提出基于政策条款卡的结构化构建方法，通过锚点驱动实例化和闭环验证生成带真实标签的叙事，并创建包含5074个案例的基准PSEBench，评估15个代表性LLM在患者安全事件分类中的能力。

详情

AI中文摘要

患者安全事件分类，即根据特定管辖政策判断临床事件是否需要报告，是一项高风险任务，通常由患者安全专家手动完成。尽管大语言模型（LLM）可能支持这一工作流程，但由于缺乏能够捕捉基于证据的政策推理、针对不完整报告的主动信息寻求以及在不可简化模糊情况下原则性弃权的基准，可靠评估受到限制。我们通过一种基于政策的结构化构建方法来解决这一差距，该方法以条款卡（clause card）为核心，这是一种将监管文本分解为可审计决策规范的结构化表示。结合条款卡与锚点驱动实例化和闭环验证，我们的可扩展流水线生成具有构造性真实标签的叙事，并自然支持生成缺失信息和不确定变体。我们将该方法应用于明尼苏达州29项可报告不良健康事件，创建了PSEBench，一个包含5074个案例的基准，并配备代理评估环境。对15个代表性LLM的评估揭示了一致的能力趋势，展示了基准的实用性，并指出了实现基于LLM的可靠患者安全事件分类的可操作差距。

英文摘要

Patient safety event triage, determining whether a clinical event is reportable under jurisdiction-specific policy, is a high-stakes task typically performed manually by patient safety experts. Although LLMs may support this workflow, reliable evaluation is limited by the lack of benchmarks to capture evidence-grounded policy reasoning, proactive information seeking for incomplete reports, and principled abstention in irreducibly ambiguous cases. We address this gap with a policy-grounded construction methodology centered on the clause card, a structured representation that factorizes regulatory text into auditable decision specifications. Combining clause cards with anchor-driven instantiation and closed-loop verification, our scalable pipeline produces narratives with by-construction ground truth and naturally supports generating missing information and uncertain variants. We instantiate this method on Minnesota's 29 Reportable Adverse Health Events, producing PSEBench, a 5,074-case benchmark with an agentic evaluation environment. Evaluation on 15 representative LLMs reveals consistent capability trends, demonstrates the benchmark's utility, and identifies actionable gaps toward reliable LLM-based patient safety event triage.

URL PDF HTML ☆

赞 0 踩 0

2606.09809 2026-06-10 cs.AI 版本更新

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

评估卡：AI评估报告的解释层

Avijit Ghosh, Anka Reuel, Jenny Chim, Wm. Matthew Kennedy, Srishti Yadav, Jennifer Mickel, Yanan Long, Andrew Tran, Anastassia Kornilova, Damian Stachura, Kevin Klyman, Felix Friedrich, Jeba Sania, Jan Batzner, Anoop Mishra, Eliya Habba, Yixiong Hao, Nathan Heath, Shalaleh Rismani, Usman Gohar, Andrea Loehr, David Manheim, Ruchira Dhar, Sree Harsha Nelaturu, Aarush Sinha, Leshem Choshen, Drishti Sharma, Ishan Khire, Amit Saha, Subramanyam Sahoo, Michael Hardy, Michael Alexander Riegler, Kabir Manghnani, Michelle Lin, Yanan Jiang, Yilin Huang, Asaf Yehudai, Jessica Ji, Aris Hofmann, Mubashara Akhtar, Max Lamparth, Nuno Moniz, Yacine Jernite, Stella Biderman, Zeerak Talat, Sanmi Koyejo, Mykel Kochenderfer, Irene Solaiman

发表机构 * Hugging Face ； Stanford University（斯坦福大学）； Queen Mary University of London（伦敦玛丽女王大学）； University of Copenhagen（哥本哈根大学）； Trustible ； EleutherAI ； TU Darmstadt（达姆施塔特工业大学）； Weizenbaum Institute & Technical University of Munich（魏森鲍姆研究所与慕尼黑工业大学）； Harvard University（哈佛大学）； The Hebrew University of Jerusalem（耶路撒冷希伯来大学）； Iowa State University（爱荷华州立大学）； IBM Research（IBM研究院）； University of Chicago（芝加哥大学）； Independent（独立）； Berkeley AI Safety Institute (BASIS)（伯克利人工智能安全研究所）； Simula ； University of Edinburgh（爱丁堡大学）； ETH Zurich & ETH AI Center（苏黎世联邦理工学院与ETH AI中心）； Oxford Internet Institute（牛津互联网研究所）； Amherst College（阿默斯特学院）； University of Nebraska（内布拉斯加大学）； Syntony Research ； McGill University（麦吉尔大学）； Evals Consensus ； Israel Institute of Technology（以色列理工学院）； IOL.Learn & Zuse Institute Berlin（IOL.Learn与柏林祖泽研究所）； Georgia Institute of Technology（佐治亚理工学院）； Quebec AI Institute, Université de Montréal（魁北克人工智能研究所，蒙特利尔大学）； University of Notre Dame（圣母大学）； Georgetown University（乔治城大学）； DHBW Stuttgart（斯图加特双元制大学）； Massachusetts Institute of Technology（麻省理工学院）

AI总结针对AI评估报告不一致的问题，提出EvalCards作为统一记录层，通过结构化模式、四种解释信号和监控工具，覆盖5816个模型和635个基准，揭示报告实践中的系统性差距。

详情

AI中文摘要

AI评估结果大规模产生，但在排行榜、模型卡、基准论文和公司博客中的报告不一致。代价是解释性的：读者无法可靠地比较不同来源的结果，识别报告遗漏的内容，或将聚合声明追溯到其基础证据。最近的努力解决了孤立组件，但留下了三个空白：它们只覆盖了评估生命周期的狭窄片段，并且不能组合成单个可解释的记录；它们指定了静态表示，无法区分不同利益相关者对同一证据提出的问题；它们仍然是纸面上的提案，缺乏大规模采用所需的提取基础设施。我们提出EvalCards，一个可操作的报告层，将基准元数据、评估运行数据和模型元数据组合成统一记录。我们(1)从52篇论文和10次利益相关者访谈的结构化审查中推导出报告模式，(2)实现四种解释信号（可重复性、文档完整性、来源和风险、以及分数可比性），通过针对研究和非研究受众校准的读者模式呈现，以及(3)部署一个监控工具，将EvalCards应用于5816个模型、635个基准和101843个结果，揭示当前报告实践中的系统性差距。

英文摘要

AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blogs. The cost is interpretive: readers cannot reliably compare results across sources, identify what a report omits, or trace an aggregate claim to its underlying evidence. Recent efforts address isolated components but leave three gaps: they cover only narrow slices of the evaluation lifecycle and do not compose into a single interpretable record; they specify static representations that do not differentiate the questions different stakeholders bring to the same evidence; and they remain proposals on paper, lacking the extraction infrastructure required for adoption at scale. We present \EvalCards{}, an operational reporting layer that composes benchmark metadata, evaluation run data, and model metadata into a unified record. We (1) derive a reporting schema from a structured review of 52 papers and 10 stakeholder interviews, (2) implement four interpretive signals (reproducibility, documentation completeness, provenance and risk, and score comparability), rendered through reader modes calibrated to research and non-research audiences, and (3) deploy a monitoring tool that applies \EvalCards{} across 5,816 models, 635 benchmarks, and 101,843 results, surfacing systematic gaps in current reporting practice.

URL PDF HTML ☆

赞 0 踩 0

2411.02817 2026-06-10 cs.LG cs.AI cs.CV cs.IT math.IT 版本更新

让结果说话：LLM行为基准测试的复制优先范式

Yuming, Huang, Yao Liu, Pengjie Ding, Lei Wang, Junchen Wan

发表机构 * Cylingo team（Cylingo团队）

AI总结提出复制优先范式，通过可靠性、跨仪器复制、历史足迹校准和预注册预测四个正交属性验证LLM行为评估工具，并在情感陪伴任务中测试，发现聚合分数掩盖的模型退化。

详情

AI中文摘要

对LLM行为的主观评估——如共情、克制、校准的情感语气——是困难的。人类评估者之间对这些品质的一致性饱和在rho约0.45附近，仅使用LLM作为评判代理存在循环论证的风险：与目标共享训练群体的评判者无法独立验证。将有效性锚定于单一人类评估者共识并不适用于人类自身存在分歧的能力。我们提出一种复制优先范式：不是锚定于一个评估者群体，而是通过四个正交属性认证工具——跨K次运行的可靠性、跨架构不同评判者的跨仪器复制、通过早期训练群体的评判者进行的历史足迹校准，以及预注册预测。我们在情感陪伴任务上测试该范式，让评分标准在迭代中数据驱动地自我演化：维度不是预先规定的，过程稳定在9维集合。预注册应用于10个可证伪假设和11个前向预测，在收集任何测试数据之前提交。应用于8个家族的49个模型，该范式揭示了聚合分数所隐藏的内容。在建议克制方面——模型是否在共情情境中避免提供未经请求的解决方案——gpt-5比gpt-4.1下降1.87分，Opus-4.7比Opus-4.6下降0.629分，而聚合分数保持平稳。这种退化在三次用户代理替换中幸存（95%的幅度），在5家族评判者堆栈和17个月队列间隔中复制，并在74个保留的真实ESConv对话中持续存在（rho在[0.749, 0.850]之间）；工具达到序数Krippendorff alpha=0.91。作为副产品，该范式充当饱和源诊断器，区分工具性天花板（可通过评分标准细化突破）和结构性天花板（需要场景或名单干预）。

英文摘要

Benchmarking is mature where answers are verifiable -- math, code, reasoning -- but the fastest-growing uses of LLMs are subjective and human-facing: companionship, emotional support, counseling. There the default validity test, correlating a metric to human judgment, has no stable anchor: inter-rater agreement is low, structured by annotator identity, barely reproducible, and length-biased. So we cannot answer the question that matters: does capability that scales on objective benchmarks transfer to subjective behavior, and would our instruments even tell us if it did not? We build an instrument for this regime and report what it reveals at the frontier. We contribute, first, a self-evolving instrument that selects and then authors its own behavioral dimensions under a multiplicative anti-gaming fitness, self-halting when it stops improving; second, a trust-by-construction paradigm that earns belief through three certificates established without a human gold standard, where human raters saturate (rho ~ 0.45); and third, the finding it makes visible -- capability transfer is dissociable. Across 49 models, 8 families, and 24 months, subjective behaviors are where objective-benchmark scaling fails to carry over: the sharpest case, advice-restraint (knowing when not to give advice), is the frontier's universal-lowest dimension, and at gpt-4.1->gpt-5 it ran backwards while the aggregate score hid it -- a regression one instruction recovers. Warm restraint is moved by model generation, not by raw scale, MoE width, inference budget, or reasoning mode; the open-weight Pareto frontier matches closed flagships at ~10-80x lower per-call cost; and four judge families replicate the rubric on held-out human ESConv conversations. Data, code, the locked rubric, and judge prompts will be released upon publication.

URL PDF HTML ☆

赞 0 踩 0

2606.07936 2026-06-10 cs.CL cs.AI 版本更新

Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation

黄金标准的幻觉：长文本生成中人类评估协议的大规模分析

Katelyn Xiaoying Mei, Yi-Li Hsu, Minjoon Choi, Zongwan Cao, Chenjun Xu, Bingbing Wen, Su Lin Blodgett, Lucy Lu Wang

发表机构 * University of Washington（华盛顿大学）； National Tsing Hua University（国立清华大学）； Seoul National University（首尔大学）； Mila - Québec AI Institute（米拉-魁北克人工智能研究所）； Allen Institute for AI（艾伦人工智能研究所）

AI总结通过分析2023-2025年*CL会议论文中的人类评估协议，发现报告不透明和可重复性差的问题，并提出改进建议。

Comments Accepted to ACL 2026 Main

详情

AI中文摘要

人类评估在评估生成文本质量中起着关键作用。然而，这些评估的可靠性和可重复性取决于透明且记录良好的协议——这些细节在当前实践中经常缺失。在这项工作中，我们对*CL会议出版物（2023-2025年）中评估长文本生成任务的人类评估协议进行了大规模分析，包括对284篇论文的完整人工审查和另外1800多篇论文的LLM辅助分析。我们定义了与人类评估研究可重复性相关的20个可报告标准，并应用这些标准系统地检查了社区内的报告规范和实践。我们发现，人类评估研究设计的重要方面普遍报告不足，导致关于测量了什么、如何测量、谁提供了判断以及如何解释判断的模糊性。基于这些发现，我们概述了可操作的建议，以支持未来研究中更透明和可重复的报告。我们的分析代码和注释数据集可在以下网址找到：https://github.com/larchlab/Illusions-of-the-Gold-Standard

英文摘要

Human evaluation plays a critical role in assessing the quality of generated text. However, the reliability and reproducibility of these evaluations depend on transparent and well-documented protocols -- details that are frequently missing in current practice. In this work, we conduct a large-scale analysis of human evaluation protocols for evaluating long-form generation tasks in *CL conference publications from 2023--2025, including a full manual review of 284 papers and LLM-assisted analysis for another 1.8k+ papers. We define a set of 20 reportable criteria related to reproducibility of human evaluation studies, and apply these criteria to systematically examine reporting norms and practices within the community. We find widespread under-reporting of important aspects of human evaluation study design, leading to ambiguity about what was measured and how, who contributed judgments, and how judgments should be interpreted. Based on these findings, we outline actionable recommendations to support more transparent and reproducible reporting in future research. Our analysis code and annotated dataset can be found at: https://github.com/larchlab/Illusions-of-the-Gold-Standard

URL PDF HTML ☆

赞 0 踩 0

2606.08251 2026-06-10 cs.CY cs.AI 版本更新

Contemporary AI lacks the imagination to diverge or negate in science

当代人工智能缺乏在科学中发散或否定的想象力

Honglin Bao, Siyang Wu, Xiao Liu, Sida Li, Shiyun Cao, James A. Evans

发表机构 * Data Science Institute, University of Chicago（芝加哥大学数据科学研究所）； Knowledge Lab, University of Chicago（芝加哥大学知识实验室）

AI总结通过大规模科学家评估，发现当前AI在科学假设生成中缺乏多样性，无法自发提出零假设，且自动评估与专家判断一致性低，但微调奖励模型可缩小差距。

详情

AI中文摘要

关于人工智能将加速科学发现的宏大预测已超越来自在职科学家的证据，该领域仍缺乏大规模、科学家参与的测试。我们进行了迄今最大规模的此类评估，描绘了AI尚不能为科学做什么。我们邀请了121,640篇近期预印本（涵盖生物学、医学、化学和社会科学）的作者，对大型语言模型（LLMs）根据其论文背景和难题生成的后续想法进行评判。6,749名科学家返回了25,139组关于新颖性、实证可行性、真实性概率和采纳倾向的评分。出现了三种模式。第一，非推理LLMs陷入狭窄的“蜂巢思维”，产生相似想法；推理模型探索更宽的假设空间，但没有模型类自发提出零假设——人类更自由地做出这一举动。第二，科学家奖励与自己相似的想法，并更看重概率而非新颖性，尽管社会科学家比生命科学家更容忍风险。资深社会科学家是最严厉的批评者，他们的怀疑是有道理的：LLMs在像社会科学这样的多元领域中最易出错，这些领域需要上下文感知的解释和不断发展的理论。第三，社区目前依赖的自动评估器——LLM作为评委、人工指标，甚至最先进的（SOTA）模型——与专家判断的一致性较弱，检索增强和科学家角色提示仅带来边际收益。我们在人类评分上后训练的Qwen3-14B奖励模型捕捉了领域品味细微差别，比SOTA模型高出27%，并缩小了与独立同行评审员间一致性的差距。尽管有种种炒作，当今的科学AI仍然是一个其想象力、输出和判断需要人类基础的协作者。

英文摘要

Bold projections that artificial intelligence will accelerate scientific discovery have raced ahead of evidence from working scientists, and the field still lacks large-scale, scientist-in-the-loop tests of these claims. Here we mount the largest such evaluation to date and map what AI cannot yet do for science. We invited authors of 121,640 recent preprints across biology, medicine, chemistry, and the social sciences to judge ideas that large language models (LLMs) generated from the context and puzzles of their own papers. 6,749 scientists returned 25,139 sets of ratings on novelty, empirical feasibility, probability of being true, and favorability of adoption. Three patterns emerge. First, non-reasoning LLMs collapse into a narrow "hivemind" of similar ideas; reasoning models roam a wider hypothesis space, yet no model class spontaneously proposes null hypotheses -- a move humans make more freely. Second, scientists reward ideas that resemble their own and prize probability over novelty, though social scientists tolerate risk more readily than life scientists. Senior social scientists are the harshest critics, and their skepticism is well-earned: LLMs falter most in pluralistic fields like the social sciences that demand context-aware interpretation and evolving theories. Third, automated evaluators on which the community currently relies -- LLM-as-a-judge, artificial metrics, and even state-of-the-art (SOTA) models -- agree only weakly with expert judgment, and retrieval augmentation and scientist persona prompting yield only marginal gains. A Qwen3-14B reward model we post-trained on human ratings captures field taste nuances, beats SOTA models by up to 27%, and closes the gap to the inter-rater consistency of independent peer reviewers. For all the hype, today's scientific AI still represents a collaborator whose imagination, outputs and judgment benefit from human grounding.

URL PDF HTML ☆

赞 0 踩 0

2606.10044 2026-06-10 cs.AI 新提交

跨边界社区学习的文化感知AI：计算与设计交叉领域的本科生创新

Jiaojiao Zhao, Weisheng Zhang, Jiawen Cai, Haibin Gao, Luyao Zhang

发表机构 * Duke Kunshan University（杜克昆山大学）； Zhouzhuang Mystery of Life Museum（周庄生命之谜博物馆）； Digital Innovation Research Center and Social Science Division（数字创新研究中心和社会科学系）

AI总结本文提出一个协作框架，通过社区参与计算实现文化感知AI教育，促进社会工作和计算科学跨学科融合，应用于文化遗产保护与可持续发展。

2606.09831 2026-06-10 cs.HC cs.AI 交叉投稿

AI-Driven Analytics of Team-Teaching Talk: Acoustic Patterns across Experience, Cohorts and the Learning Design

AI驱动的团队教学对话分析：跨经验、学生群体和学习设计的声学模式

Yuchen Liu, Roberto Martinez-Maldonado, Riordan Alfredo, Paola Mejia-Domenzain, Dwi Rahayu, Sadia Nawaz

发表机构 * Monash University（莫纳什大学）； EPFL（瑞士联邦理工学院）

AI总结本文提出基于AI的语音处理方法，分析团队教学中的课堂对话，发现经验丰富的教师、本科生班级和协作学习任务中音量变化更大，表明教师更频繁调节音量以突出关键信息并促进互动。

Comments Accepted at AIED 2026 (International Conference on Artificial Intelligence in Education), 14 pages, 4 figures

详情

AI中文摘要

随着课堂规模的扩大，团队教学越来越多地被用于整合多位教师的专业知识和教学视角。然而，关于团队教学在实践中如何展开的实证理解仍然有限，特别是在教师贡献随经验水平、学生群体和学习任务设计差异方面。先前对团队教学的研究主要依赖于回顾性自我报告或小规模观察，对团队教学实施的微观过程提供了有限的见解。教师谈话为这些过程提供了一个可扩展的视角。虽然个体教学情境中的研究表明，语音的声学特征（如音质、语调和响度）可以影响学生学习，但来自团队教学环境的证据仍然稀缺。此外，通过手动观察或转录捕捉这些特征在团队教学课堂中尤其具有挑战性，因为多位教师在长时间和多空间位置上发言，限制了可扩展性，除非自动化。基于空间教学法理论和团队教学研究，本文提出了一种基于AI的语音处理方法，用于分析团队教学环境中的课堂谈话。我们分析了涉及12位教师的36个录制的本科生和研究生课程。编码了空间教学行为并提取了声学特征，以考察教师经验、学生群体和学习任务设计之间的差异。结果揭示了系统性差异，最显著的是在响度动态方面：高经验教师、本科生班级和协作学习任务表现出更大的响度变化，表明更频繁地调节音量以突出关键信息并支持课堂互动和参与。

英文摘要

As classroom cohorts expand, team teaching is increasingly used to integrate the expertise and pedagogical perspectives of multiple teachers. Yet, there is limited empirical understanding of how team teaching unfolds in practice, particularly regarding differences in teachers' contributions across experience levels, student cohorts, and learning task design. Prior research on team teaching has largely relied on retrospective self-reports or small-scale observations, offering limited insight into the micro-level processes through which team teaching is enacted. Teacher talk offers a scalable lens on these processes. While research in individual teaching contexts shows that acoustic features of speech (e.g., voice quality, intonation, and loudness) can shape student learning, evidence from team-teaching settings remains scarce. Moreover, capturing such features through manual observation or transcription is especially challenging in team-teaching classrooms, where multiple teachers speak across extended sessions and spatial locations, limiting scalability without automation. Grounded in spatial pedagogy theory and team-teaching research, this paper presents an AI-based speech processing approach to analyse classroom talk in team-teaching settings. We analysed 36 recorded undergraduate and postgraduate sessions involving 12 teachers. Spatial pedagogy behaviours were coded and acoustic features extracted to examine variation across teachers' experience, student cohorts, and the learning task design. The results reveal systematic differences, most notably in loudness dynamics: high-experience teachers, undergraduate classes and collaborative learning tasks exhibited greater loudness variation, suggesting more frequent modulation of volume to foreground key information and support classroom interaction and engagement.

URL PDF HTML ☆

赞 0 踩 0

2606.09842 2026-06-10 cs.HC cs.AI cs.CV 交叉投稿

Integrated Real-Time Motion Tracking and AI Analysis for Athletic Performance Optimization

集成实时运动跟踪与AI分析以优化运动表现

Parth Agrawal, Ronit, Sagar Kumar, Aashish Bhambri

发表机构 * Department of Computer Science（计算机科学系）； Department of Computer Science and Engineering（计算机科学与工程系）； Chandigarh University（昌迪加尔大学）

AI总结本文综述了实时人体姿态估计在运动分析中的方法，并开发了一个轻量级原型系统，利用MediaPipe框架提供实时反馈，以优化运动表现。

Comments 6 pages, 10 figures, 2 tables, IC2E3-2026 conference

详情

AI中文摘要

在真实世界环境中应用人体姿态估计（HPE）仍然是一项具有挑战性的任务。本文探讨并综述了实时HPE方法及其在个体运动分析中的局限性，同时开发了一个实用的轻量级原型用于真实世界的测试和使用。从传统的基于标记的运动捕捉系统发展到现代可访问且适应性强的无标记深度学习方法，本文综述了平衡精度和效率的基础架构。我们还比较了算法框架（如自顶向下、自底向上、单阶段方法等）在实际部署指标上的表现，包括推理延迟、帧率、平均关节位置误差和时间抖动，以指导运动应用的模型选择过程。作为我们的主要贡献，我们提出了一个模块化的轻量级软件原型，该原型使用MediaPipe HPE框架结合多种特定于运动的逻辑，为非专业用户提供实时洞察和基于AI的反馈。我们以最小的计算资源推导出运动洞察并提供反馈，同时展示了性能和可靠性指标。最后，我们提出了其他未来研究方向，如结合传感器和AR/VR。这项工作面向研究人员、工程师、运动科学家等，既作为技术资源，也作为实现类似或改进的实时HPE分析系统以增强运动表现或其他目的的有效蓝图。

英文摘要

Applying Human Pose Estimation (HPE) in real world environments remains a challenging task, this paper explores and surveys real time HPE approaches and their limitations in sports analysis for individuals, alongside developing a practical lightweight prototype for real world testing and usage. The older marker-based motion capture systems evolving to the modern accessible and adaptable markerless deep learning approaches, this survey explores the foundational architectures, which balance precision and efficiency. We also compare algorithmic frameworks (top-down, bottom-up, one-stage approaches, etc.) on practical deployment metrics such as inference latency, frame rate, mean per-joint position error, and temporal jitter to guide model selection process for sports application. As our prime contribution, we are proposing a modular, lightweight software prototype, which uses MediaPipe HPE framework with multiple exercise specific logic to deliver real-time insights and AI based feedback for non-expert users. We derive sports insights and providing feedback with minimal computational resources, while showcasing the performance and reliability metrics. In the end, we suggest other future research directions like combining sensors, and AR/VR. This work caters to researchers, engineers, sport scientists, etc., as both technical resource and a valid blueprint to implement a similar or improved real-time HPE analysis system for athletic performance enhancement or other purposes.

URL PDF HTML ☆

赞 0 踩 0

2606.09846 2026-06-10 cs.HC cs.AI cs.CL 交叉投稿

CANVAS: Captioning Art with Narrative Visual-Audio AI Systems

CANVAS: 用叙事视觉音频AI系统为艺术配文

Vignesh Nagarajan

发表机构 * BASIS Phoenix High School（BASIS凤凰高中）

AI总结提出一种自动化工作流，利用大语言模型和文本转语音服务生成多感官艺术描述和同步音频解说，在20秒内以低于0.05美元的成本生成文本加音频输出，显著提高词汇多样性和叙事细节。

Comments 22 pages, 16 figures, 3 tables, 21 references

详情

AI中文摘要

由于替代文本简短或缺失，视觉艺术在很大程度上仍对盲人和低视力（BLV）观众不可及，这些文本很少传达艺术品的感官、空间或情感特质。本研究提出了一种自动化工作流，利用大语言模型和文本转语音服务生成多感官艺术描述和同步音频解说。该系统通过Zapier编排，将上传的图像转换为丰富的叙事字幕，无需人工干预，从而实现可访问媒体的快速、规模化生产。对50件艺术品的定量评估显示，AI生成的描述在词汇多样性、形容词密度和叙事细节方面显著高于基线字幕，同时保持可比的易读性水平。统计检验（t检验、方差分析）确认了丰富度和长度方面的显著差异，完整流水线在每张图像20秒内生成文本加音频输出，成本低于0.05美元。研究结果表明，自动字幕生成可以弥合博物馆和数字馆藏可访问性方面的差距，对更广泛的公众参与具有意义。未来工作可纳入BLV参与者的用户研究，以评估理解、偏好和最佳解释性语言水平。

英文摘要

Visual art remains largely inaccessible to blind and low-vision (BLV) audiences due to brief or absent alt-text, which rarely conveys the sensory, spatial, or emotional qualities of an artwork. This study presents an automated workflow that generates multi-sensory art descriptions and synchronized audio narration using large language models and text-to-speech services. The system, orchestrated through Zapier, converts uploaded images into rich narrative captions without human intervention, enabling rapid, scalable production of accessible media. Quantitative evaluation across 50 artworks shows that AI-generated descriptions contain significantly higher lexical diversity, adjective density, and narrative detail than baseline captions, while maintaining comparable readability levels. Statistical tests (t-tests, ANOVA) confirm meaningful differences in richness and length, and the full pipeline produces text-plus-audio outputs in under 20 seconds per image at a cost below $0.05. Findings demonstrate that automated captioning can bridge gaps in museum and digital-collection accessibility, with implications for broader public engagement. Future work can incorporate user studies with BLV participants to assess comprehension, preference, and optimal levels of interpretive language.

URL PDF HTML ☆

赞 0 踩 0

2606.09852 2026-06-10 cs.HC cs.AI cs.CL cs.LG cs.MA cs.SE 交叉投稿

LLM-Based Code Documentation Generation and Multi-Judge Evaluation

基于LLM的代码文档生成与多裁判评估

Ikbel Ghrab, Mohamed Dhieb, Ismail Khenissi, Ines Abdeljaoued-Tej

发表机构 * University of Tunis El Manar（突尼斯国家理工大学）

AI总结提出利用八种大语言模型自动生成代码文档，并通过多裁判评估框架（四个LLM从九个维度评分）提升文档质量，在医学物理库上实验显示最佳与最差模型性能差距达42%。

Comments ICAHS, \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

详情

Journal ref: Conference ICAHS IEEE, 2025

AI中文摘要

高质量的源代码文档至关重要但往往被忽视，尤其是在医疗保健等关键领域，可靠性和可维护性至关重要。我们提出了一个AI驱动的框架，利用八种最先进的大语言模型（包括GPT、Gemini、Qwen和LLaMA变体）自动从代码和仓库生成文档。该系统基于PocketFlow编排框架，采用模块化流水线和高级提示工程，生成结构化、上下文感知的文档。为确保质量并指导模型选择，我们引入了MultiLLMasJudges评估框架，其中四个独立的LLM从九个标准（如完整性、清晰度和忠实度）评估输出。在开源医学物理库上进行的实验表明，最佳和最差模型之间的性能差距为42%。通过结合多样化的模型输出、优化的提示和严格的评估，我们的方法提高了文档质量并减少了人工工作量，特别是在安全关键的医疗软件中。

英文摘要

High-quality source code documentation is vital yet often neglected, especially in critical domains like healthcare where reliability and maintainability are essential. We presented an AI powered framework that automates documentation generation from code and repositories using eight state of the art Large Language Models (LLMs), including GPT, Gemini, Qwen, and LLaMA variants. Built on the PocketFlow orchestration framework, the system applies modular pipelines and advanced prompt engineering to produce structured, context aware documentation. To ensure quality and guide model selection, we introduced a MultiLLMasJudges evaluation framework, where four independent LLMs assess outputs across nine criteria, such as Completeness, Clarity, and Faithfulness. Experiments conducted on an open-source medical physics library, demonstrated showed a 42% performance gap between top and bottom models. By combining diverse model outputs, optimized prompting, and rigorous evaluation, our approach enhances documentation quality and reduces manual effort, especially in safety critical healthcare software.

URL PDF HTML ☆

赞 0 踩 0

2606.09860 2026-06-10 cs.LG cs.AI stat.AP stat.ML 交叉投稿

Conformal Risk Prediction for Non-Alcoholic Fatty Liver Disease Using Gradient Boosting with Distribution-Free Coverages

基于梯度提升与无分布覆盖的非酒精性脂肪肝病共形风险预测

Xinze Zhang

AI总结提出结合梯度提升决策树与共形预测的机器学习框架Method，实现非酒精性脂肪肝病个体风险的无分布校准覆盖预测，在中国多中心队列中AUROC达0.912，优于多种方法。

详情

AI中文摘要

非酒精性脂肪肝病（NAFLD）影响全球约25%的成年人，带来显著的肝脏和心血管风险。然而，人群层面的筛查工具仍不充分。我们提出Method，一种用于NAFLD风险预测的机器学习框架，将梯度提升决策树与共形预测相结合，以在个体风险估计上产生校准的、无分布的覆盖保证。它集成了基于互信息的稳定性选择过程，通过自助重采样识别紧凑、临床可解释的特征子集，构建预测集，其边际覆盖可证明超过用户指定的置信水平。我们在中国广州的多中心队列（主要n=2,187；外部验证n=412）上评估了Method，使用了涵盖人口统计学、代谢生物标志物和生活方式因素的78个候选特征。Method内部AUROC为0.912，外部为0.891，优于深度神经网络、TabNet、支持向量机和逻辑回归。共形预测集在90%名义水平下达到91.3%的经验覆盖。从这些分数得出的三层风险分层将人群分为不同组别，高风险亚组的12个月进展率是低风险组的4.7倍。选定的特征——特别是腰围、ALT、GGT、甘油三酯、空腹血糖和BMI——与已建立的代谢风险因素一致，提供了生物学合理性。

英文摘要

Non-alcoholic fatty liver disease (NAFLD) affects roughly 25% of global adults, posing substantial hepatic and cardiovascular risks. Yet, population-level screening tools remain inadequate. We present Method, a machine-learning framework for NAFLD risk prediction coupling gradient-boosted decision trees with conformal prediction to yield calibrated, distribution-free coverage guarantees on individual risk estimates. It integrates a mutual-information-based stability selection procedure to identify a compact, clinically interpretable feature subset via bootstrap resampling, constructing prediction sets whose marginal coverage provably exceeds a user-specified confidence level. We evaluated Method on a multicenter cohort from Guangzhou, China (primary n=2,187; external validation n=412) using 78 candidate features across demographics, metabolic biomarkers, and lifestyle factors. Method achieves an AUROC of 0.912 internally and 0.891 externally, outperforming deep neural networks, TabNet, support vector machines, and logistic regression. Conformal prediction sets achieve 91.3% empirical coverage at the 90% nominal level. A three-tier risk stratification derived from these scores separates the population into distinct groups, with the high-risk subgroup showing a 12-month progression rate 4.7 times that of the low-risk tier. The selected features -- notably waist circumference, ALT, GGT, triglycerides, fasting glucose, and BMI -- align with established metabolic risk factors, providing biological plausibility.

URL PDF HTML ☆

赞 0 踩 0

2606.09867 2026-06-10 cs.AR cs.AI 交叉投稿

EstRTL: Functional Estimation Guided RTL Code Generation

EstRTL：功能估计引导的RTL代码生成

Qi Xiong, Renzhi Chen, Bowei Wang, Yuqing Xiong, Libo Huang, Lei Wang

发表机构 * College of Computer Science and Technology, National University of Defense Technology（国防科技大学计算机科学与技术学院）； Defense Innovation Institute, Academy of Military Science (AMS)（军事科学院创新院）； School of Computer Science and Technology, Shandong University（山东大学计算机科学与技术学院）； Key Laboratory of Advanced Microprocessor Chips and Systems, Changsha, China, and College of Computer Science and Technology, National University of Defense Technology, Changsha, China（先进微处理器芯片与系统重点实验室，长沙，中国，和国防科技大学计算机科学与技术学院，长沙，中国）； Defense Innovation Institute, AMS, Beijing, China and Qiyuan Lab, Beijing, China（军事科学院创新院，北京，中国和启元实验室，北京，中国）

AI总结提出EstRTL框架，通过静态功能评分估计，结合生成、评估和修正三阶段范式，提升LLM生成RTL代码的功能正确性，在通用LLM上正确率提升3.2%-9.0%。

详情

AI中文摘要

优化寄存器传输级（RTL）代码在硬件设计中至关重要。大型语言模型（LLM）为RTL代码的自动生成和优化提供了新方法，有望显著加速设计过程并减少人力投入。然而，现有的RTL代码生成方法通常侧重于模型微调和利用各种扩展技术来增强RTL代码生成能力，缺乏对功能正确性的关注。确保生成的RTL代码不仅编译成功，而且在实际硬件实现中按预期运行仍然是一个关键挑战。为解决这一问题，我们提出了EstRTL，一个基于静态功能评分估计的LLM驱动的协作智能体框架，用于RTL代码生成。EstRTL采用三阶段范式：生成、评估和修正。在阶段中，功能评估智能体根据评分和评估结果静态评估生成的代码，并决定是直接输出代码、返回重新生成还是转发给代码修正智能体。该框架可应用于各种专为RTL代码生成设计的LLM，进一步增强生成代码的正确性。通过提供定量评分和可读的需求比较，它提高了AI辅助RTL代码生成的透明度。实验表明，EstRTL将通用LLM的RTL代码生成正确率显著提升了3.2%-9.0%，展示了我们系统的实用价值。代码和实验结果已开源，链接：this https URL。

英文摘要

Optimizing register transfer level (RTL) code is of vital importance in hardware design. Large language models (LLMs) provide new methods for the automatic generation and optimization of RTL code, offering the potential to significantly accelerate the design process and reduce human effort. However, existing methods for generating RTL code often focus on model fine-tuning and the use of various expansion techniques to enhance the RTL code generation capabilities, lacking attention to the functional correctness. Ensuring that the generated RTL code not only compiles successfully but also behaves as intended in real hardware implementations remains a critical challenge. To address this issue, we propose EstRTL, an LLM-powered collaborative agent framework for RTL code generation based on static functional score estimation. EstRTL operates a three-stage paradigm: Generation, Estimation and Correction. During the stages, the functional estimation agent statically evaluates the generated code based on score and assessment results, and decides whether to output the code directly, return it for regeneration, or forward it to the code correction agent. This framework can be applied to various LLMs that designed for RTL code generation, further enhancing the correctness of the generated code. By providing quantitative scores and human-readable requirements comparisons, it improves the transparency of AI-assisted RTL code generation. Experiments show that EstRTL significantly improves the correctness of RTL code generation by generic LLM by 3.2\%-9.0\%, demonstrating the practical value of our system. The codes and experimental results are open-sourced at link: https://anonymous.4open.science/status/EstRTL-E200/.

URL PDF HTML ☆

赞 0 踩 0

2606.09872 2026-06-10 cs.LG cs.AI 交叉投稿

PatchSTG: Scalable Spatiotemporal Graph Transformers for Traffic Forecasting on Irregular Sensor Networks

PatchSTG：面向不规则传感器网络的交通预测可扩展时空图Transformer

Jichao Li, Xuanming Shi

发表机构 * Portsmouth Abbey School（波特茅斯修道学校）； CodingFuture (Shanghai) Education Technology Co., Ltd.（上海CodingFuture教育科技有限公司）

AI总结提出PatchSTG，通过地理信息将传感器划分为平衡的局部补丁，并采用双注意力编码器交替捕捉局部和全局依赖，实现计算复杂度从二次降至近线性，在不规则传感器网络上取得高效且稳定的交通预测性能。

Comments 22 pages,12 figures

详情

AI中文摘要

交通预测是智能交通系统的基本组成部分，但由于传感器分布不规则以及建模大规模时空依赖的高计算成本，在实际环境中仍然具有挑战性。在实际交通网络中，传感器在不同区域分布不均，导致空间结构不均匀，限制了现有基于图和基于注意力模型的有效性和可扩展性。为了解决这些挑战，我们提出了PatchSTG，一种基于补丁的时空图Transformer，专为不规则传感器网络上的高效预测而设计。关键思想是引入一种层次化空间表示，基于地理信息将传感器划分为平衡且保持局部性的补丁。在此结构之上，双注意力编码器交替进行补丁内注意力（捕捉局部交互）和补丁间注意力（建模全局依赖），将计算复杂度从二次降低到近线性。我们在罗德岛的真实交通数据以及额外的大规模数据集上评估了PatchSTG。实验结果表明，所提模型在多个预测时域上实现了稳定且具有竞争力的预测性能，同时显著提高了计算效率。消融研究进一步验证了空间划分和双注意力在捕捉局部和长程交通动态方面的有效性。这些结果表明，基于补丁的时空建模为不规则空间设置下的交通预测提供了一个可扩展且有效的框架。

英文摘要

Traffic forecasting is a fundamental component of intelligent transportation systems, yet remains challenging in real-world settings due to irregular sensor distributions and the high computational cost of modeling large-scale spatiotemporal dependencies. In practical traffic networks, sensors are unevenly distributed across regions, leading to non-uniform spatial structures that limit the effectiveness and scalability of existing graph-based and attention-based models. To address these challenges, we propose PatchSTG, a patch-based spatiotemporal graph Transformer designed for efficient forecasting on irregular sensor networks. The key idea is to introduce a hierarchical spatial representation that partitions sensors into balanced, locality-preserving patches based on geographic information. On top of this structure, a dual attention encoder alternates between intra-patch attention for capturing local interactions and inter-patch attention for modeling global dependencies, reducing computational complexity from quadratic to near-linear scaling. We evaluate PatchSTG on real-world traffic data from Rhode Island and additional large-scale datasets. Experimental results demonstrate that the proposed model achieves stable and competitive forecasting performance across multiple horizons, while significantly improving computational efficiency. Ablation studies further validate the effectiveness of spatial partitioning and dual attention in capturing both local and long-range traffic dynamics. These results suggest that patch-based spatiotemporal modeling provides a scalable and effective framework for traffic forecasting under irregular spatial settings.

URL PDF HTML ☆

赞 0 踩 0

2606.09893 2026-06-10 eess.IV cs.AI cs.LG 交叉投稿

Tractogram foundation model

TractFM：纤维束图基础模型

Guikun Chen, Yuqian Chen, Yijie Li, Yogesh Rathi, Nikos Makris, Fan Zhang, Wenguan Wang, Lauren J. O'Donnell

发表机构 * The State Key Lab of Brain-Machine Intelligence, Zhejiang University, Hangzhou（脑机智能国家重点实验室，浙江大学，杭州）； Department of Radiology, Brigham and Women’s Hospital, Mass General Brigham, Boston（放射科，布里洛妇女医院，马萨诸塞总医院，波士顿）； Harvard Medical School, Boston（哈佛医学院，波士顿）； Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin（医学工程与转化医学研究院，天津大学，天津）； School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu（信息与通信工程学院，电子科技大学，成都）； Psychiatry Neuroimaging Laboratory, Brigham and Women’s Hospital, Mass General Brigham, Boston（精神病神经影像实验室，布里洛妇女医院，马萨诸塞总医院，波士顿）； Department of Psychiatry, Center for Morphometric Analysis, Massachusetts General Hospital, Boston（精神病科，形态分析中心，马萨诸塞总医院，波士顿）

AI总结提出TractFM基础模型，直接从全脑纤维束集学习可复用表示，结合局部纤维编码器和置换等变纤维束编码器，通过密集解剖束分割预训练，实现纤维束级和受试者级任务的迁移。

详情

AI中文摘要

扩散MRI（dMRI）纤维束成像是在活体人脑中绘制白质通路的唯一非侵入性方法。它将每个大脑表示为一个纤维束图：一个大型、无序的三维流线集合，包含局部流线几何和全脑解剖组织的信息。这种结构使纤维束图成为表示学习的自然但具有挑战性的目标。现有方法将流线分类和受试者级预测视为独立问题：流线分类器关注几何模式，而受试者级预测通常依赖于手工特征。因此，当前方法无法学习连接流线解剖与全脑受试者间变异的可复用表示。本文介绍TractFM，一个纤维束图基础模型，直接从全脑纤维束集学习可复用表示。TractFM结合了局部流线编码器和置换等变纤维束编码器，使得一个受试者的所有流线能够在单次前向传递中共同上下文化。在密集解剖束分割（即给单个流线分配解剖标签）上的预训练产生了两种互补表示：用于束分割的上下文化流线级嵌入和用于下游受试者表型预测的紧凑受试者级描述符。在三种纤维束成像算法和五个dMRI数据集上，TractFM迁移到流线级和受试者级任务。其冻结表示实现了准确的束分割，并在独立数据集上预测年龄和性别。这些结果表明，全脑几何上下文（一次性学习）可以泛化到纤维束成像流程、数据集和预测任务中。

英文摘要

Diffusion MRI (dMRI) tractography is the only noninvasive approach for mapping white-matter pathways in the living human brain. It represents each brain as a tractogram: a large, unordered set of three-dimensional streamlines that includes information about both local streamline geometry and whole-brain anatomical organization. This structure makes tractograms a natural but challenging target for representation learning. Existing methods treat streamline classification and subject-level prediction as separate problems: streamline classifiers focus on geometric patterns, whereas subject-level prediction often depends on hand-crafted features. As a result, current methods do not learn reusable representations that connect streamline anatomy with whole-brain inter-subject variation. Here we introduce TractFM, a tractogram foundation model that learns reusable representations directly from whole-brain streamline sets. TractFM combines a local streamline encoder with a permutation-equivariant tractogram encoder, allowing all streamlines from a subject to be contextualized jointly in a single forward pass. Pretraining on dense anatomical tract parcellation, i.e., assigning anatomical labels to individual streamlines, yields two complementary representations: contextualized streamline-level embeddings for tract parcellation and compact subject-level descriptors for downstream prediction of subject phenotypes. Across three tractography algorithms and five dMRI datasets, TractFM transfers to both streamline-level and subject-level tasks. Its frozen representations achieve accurate tract parcellation and predict age and sex across independent datasets. These results show that whole-brain geometric context, learned once, can generalize across tractography pipelines, datasets, and prediction tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.09896 2026-06-10 cs.GT cs.AI cs.LG 交叉投稿

HMAF: A Hierarchical Multi-Slot GD-RTB Allocation Framework

HMAF：一种分层多槽GD-RTB分配框架

Tianxing Bu, Zhaoqi Zhang, Linyou Cai, Miao Xie, Shengri Xue, Tan Qu, Qianlong Xie, Xingxing Wang, Siqiang Luo, Gao Cong

发表机构 * Meituan（美团）； Nanyang Technological University（南洋理工大学）； China Agricultural University（中国农业大学）

AI总结针对GD与RTB共存广告平台中短期收益与长期交付的平衡难题，提出分层多槽分配框架HMAF，采用计划-校准-执行范式，集成离线约束优化与在线决策，在美团实现GD交付率提升3.72%、广告总收入提升1.59%。

Comments Accepted by KDD 2026 Applied Data Science Track

详情

DOI: 10.1145/3770855.3818434

AI中文摘要

在现代在线广告平台中，保量交付（GD）合约与实时竞价（RTB）拍卖共存并相互竞价。现有方法要么将GD和RTB优化解耦，要么依赖启发式优先级规则，因此在复杂多槽交付和曝光约束下，难以有效平衡短期收入最大化与长期合约交付。为应对这些挑战，我们提出HMAF（分层多槽分配框架），一个旨在优化GD-RTB广告平台中曝光分配的统一框架。HMAF采用计划-校准-执行范式作为其核心结构，整合离线约束优化与在线决策，平衡离线GD资源规划、动态校准GD-RTB竞争力，并在多槽环境中做出实时列表级排名决策。HMAF已在全球最大在线食品配送平台之一美团的多项营销场景中实施，使GD交付率提升3.72%，广告总收入提升1.59%。

英文摘要

In modern online advertising platforms, Guaranteed Delivery (GD) contracts coexist and bid with Real-Time Bidding (RTB) auctions. Recent approaches either decouple GD and RTB optimization or rely on heuristic priority rules, and thus fail to effectively balance short-term revenue maximization with long-term contract delivery under complex multi-slot delivery and impression constraints. To address these challenges, we propose HMAF (Hierarchical Multi-Slot Allocation Framework), a unified framework designed to optimize impression allocation in GD--RTB advertising platforms. HMAF employs the Plan--Calibrate--Execute paradigm as its core structure, and integrates offline constraint optimization with online decision-making, balancing offline GD resource planning, dynamically calibrating GD--RTB competitiveness, and making real-time listwise rank decisions across multi-slot environments. HMAF has been implemented in multiple marketing scenarios at Meituan, one of the world's largest online food delivery platforms, leading to a 3.72% increase in GD delivery rate and a 1.59% increase in total advertisement revenue.

URL PDF HTML ☆

赞 0 踩 0

2606.09942 2026-06-10 cs.SE cs.AI 交叉投稿

Anomaly Detection and Root Cause Analysis for Microservice Systems

微服务系统的异常检测与根因分析

Luan Pham

AI总结针对微服务系统异常检测与根因分析的五大局限性，提出端到端方法BARO、EventADL和TORAI，并构建基准RCAEval，通过实验验证有效性与鲁棒性。

Comments This is the pre-print of my PhD thesis, submitted to RMIT University

详情

AI中文摘要

微服务系统被广泛用于构建云应用，但其复杂性使得故障不可避免，从而降低用户体验并造成经济损失。自动化异常检测与根因分析（RCA）目前是活跃的研究领域，但现有技术存在五个局限性。首先，大多数方法将异常检测和RCA分开处理，假设异常已被正确检测，当检测因噪声或延迟而不精确时便会失效。其次，它们关注指标、日志和跟踪，而忽略了事件数据（如API调用和配置变更）。第三，许多方法需要给定的服务调用图，否则无法诊断。第四，该领域缺乏标准化的数据集和评估框架，导致方法难以公平比较。第五，尽管基于因果推断的RCA已成为主流，但其有效性、效率和鲁棒性仍不明确。本论文通过两组贡献解决这些局限性。第一组引入了独立和联合利用可观测性数据的方法。BARO是一种针对指标数据的端到端异常检测与RCA方法。EventADL是一种针对事件数据的端到端框架。TORAI是一种无需服务调用图的多模态RCA框架。在真实微服务系统上的大量实验证明了它们的有效性和鲁棒性。第二组贡献提供了基准数据集、评估框架和系统性的评估工作。RCAEval是一个全面的基准，为未来研究提供即用数据集和可复现基线。对现有RCA方法（尤其是基于因果推断的方法）的系统性评估提供了指导未来方向的见解。本论文因此推进了微服务故障的自动化异常检测与RCA，为事件缓解和修复的未来研究奠定基础。

英文摘要

Microservice systems are widely used to build cloud applications, yet their complexity makes failures inevitable, degrading user experience and causing economic loss. Automated anomaly detection and root cause analysis (RCA) are now active research areas, but existing techniques share five limitations. First, most treat anomaly detection and RCA separately, assuming anomalies are detected correctly, and falter when detection is imprecise due to noise or delay. Second, they focus on metrics, logs, and traces, leaving event data such as API calls and configuration changes underexplored. Third, many require a given service call graph and cannot diagnose without one. Fourth, the field lacks standardised datasets and evaluation frameworks, so methods are hard to compare fairly. Fifth, although causal inference-based RCA has become dominant, its effectiveness, efficiency, and robustness remain unclear. This thesis addresses these limitations through two groups of contributions. The first introduces methods that exploit observability data both independently and collectively. BARO is an end-to-end anomaly detection and RCA approach for metric data. EventADL is an end-to-end framework for event data. TORAI is a multimodal RCA framework that requires no service call graph. Extensive experiments on real microservice systems demonstrate their effectiveness and robustness. The second group delivers benchmarking datasets, an evaluation framework, and systematic evaluation efforts. RCAEval is a comprehensive benchmark providing ready-to-use datasets and reproducible baselines for future research. A systematic evaluation of existing RCA methods, especially causal inference-based approaches, offers insights that guide future directions. This thesis thereby advances automated anomaly detection and RCA for microservice failures, enabling future research on incident mitigation and remediation.

URL PDF HTML ☆

赞 0 踩 0

2606.09944 2026-06-10 econ.GN cs.AI q-fin.EC 交叉投稿

GAGI: A Gini-Adjusted GDP-per-Capita Index for Distribution-Aware Macroeconomic Welfare Monitoring

GAGI：一种用于分布感知宏观经济福利监测的基尼调整人均GDP指数

Sivasathivel Kandasamy

发表机构 * Independent Researcher（独立研究者）

AI总结提出GAGI指数，通过基尼系数和价格水平调整人均GDP，以监测福利分配效应，应用于G7国家发现福利增长与GDP增长持续偏离。

详情

AI中文摘要

人均GDP是政府机构追踪经济繁荣和经济事件后果的默认视角，但它忽视了生活繁荣的两个首要决定因素：收入/财富分配和通胀影响。不平等调整的收入衡量指标本身并不新鲜，但宏观经济监测工具包中具体缺失的不是福利概念，而是一个可操作的监测触发指标：一个足够简洁、可每年从公开数据计算、无需建模假设即可审计、且标准化以便于理解年度间和国家间变化（监管机构需要据此采取行动）的统计量。我们构建了这样一个工具，即基尼调整人均GDP指数（GAGI）：一种可复现、可公开计算的公式，通过不平等调整因子(1-G)和价格水平重新调整各国人均GDP，并以2010年为基准标准化。GAGI是一个通用福利指数，并非特定于AI自动化，适用于任何需要追踪福利调整后繁荣的场景。将GAGI应用于2010-2026年的G7经济体，我们发现福利调整后的繁荣与总体GDP增长持续且日益偏离，这种偏离在2022年后急剧扩大，时间上与COVID后遗症和生成式AI部署加速相吻合，尽管仅凭此证据尚不能证明因果关系。我们认为GAGI是基于GDP监测的必要补充：任何仅追踪总产出的宏观经济监测工具都会系统性地忽略自动化可能造成的分配损害，即使报告的增长依然强劲。

英文摘要

GDP per capita is the default lens through which governibng bodies track the economic prosperity and consequences of economic events , yet it is blind to two first-order determinants of lived prosperity: income/wealth distribution and inflation impact. Inequality-adjusted income measures are themselves not new but What is missing from the macroeconomic monitoring toolkit specifically is not a welfare concept but an operational monitoring trigger: a statistic minimal enough to compute annually from public data, transparent enough to audit without modelling assumptions, and normalised so that year-on-year, cross-country change ? the quantity a regulator needs to act on? is legible. We assemble such an instrument, the Gini- Adjusted GDP per Capita Index (GAGI): a reproducible, publicly computable formulation that rescales each country's GDP per capita by its inequality-adjustment factor (1-G) and its price level, normalised to a 2010 baseline. GAGI is a general-purpose welfare index, not inherently specific to AI automation, applicable wherever welfare-adjusted prosperity needs tracking. Applying GAGI to the G7 economies over 2010-2026, we show that welfare-adjusted prosperity has diverged persistently and increasingly from headline GDP growth, that the divergence widens sharply after 2022, temporally coincident with, though not, on this evidence alone, demonstrated to be caused by the after effects of COVID and the acceleration of generative-AI deployment. We argue that GAGI is a necessary complement to GDP-based monitoring: any macroeconomic monitoring instrument that tracks only aggregate output will systematically miss the distributional harm that automation can cause even while reported growth remains strong.

URL PDF HTML ☆

赞 0 踩 0

2606.09953 2026-06-10 eess.IV cs.AI cs.LG 交叉投稿

Deep Slice Interpolation for Reducing Through-Plane Anisotropy and Noise in Head CT

深度切片插值用于减少头部CT的穿平面各向异性和噪声

Luis Cortés Ferre, Miguel A. Gutiérrez-Naranjo, Marcin Balcerzyk

发表机构 * Department of Computer Science and Artificial Intelligence, University of Seville（塞维利亚大学计算机科学与人工智能系）； Bioaraba Health Research Institute（Bioaraba健康研究 institute）； IKERBASQUE, Basque Foundation of Science（巴斯克科学基金会）

AI总结提出一种深度学习系统，通过相邻轴向切片对合成中间CT切片，将有效穿平面间距减半，同时实现隐式降噪，在结构指标上优于经典插值和视频帧插值方法。

详情

AI中文摘要

头部计算机断层扫描（CT）通常使用亚毫米级的面内分辨率，但穿平面间距为2-5毫米，造成显著的各向异性，这会降低多平面重建、血肿体积估计等体积测量以及假设近似各向同性体素的后续算法的性能。我们提出一个深度学习系统，从相邻轴向切片对合成中间CT切片，将有效穿平面间距减半。该系统改善三维可视化，同时产生固有降噪的输出，在一次推理中实现两个互补优势。为构建可靠系统，我们系统评估像素级损失（均方误差MSE和平均绝对误差L1）、结构相似性损失（结构相似性指数SSIM及其多尺度变体MS-SSIM）以及混合组合。在保留测试集上，所有收敛模型在所有结构指标上均优于经典插值基线和预训练视频帧插值方法（RIFE、FILM），其中MS-SSIM+L1提供最强平衡性能。我们还记录了SSIM族损失中的训练不稳定性并识别部分补救措施：标准数值修复消除了主要失败模式，但在较小批量大小下留下残余发散。所有结果均报告患者级自助法置信区间和配对统计检验。作为示例，我们将系统应用于来自Virgen del Rocío大学医院的非分布头部CT序列：模型合成中间切片，并在真实切片上表现出我们理论分析预测的隐式降噪特征，支持在单个外部病例中插值质量和隐式降噪不局限于训练分布。

英文摘要

Head computed tomography (CT) typically uses sub-millimeter in-plane resolution but 2-5 mm through-plane spacing, creating substantial anisotropy that degrades multiplanar reconstructions, volumetric measurements such as hematoma volume estimation, and downstream algorithms that assume near-isotropic voxels. We present a deep learning system that synthesizes intermediate CT slices from pairs of neighboring axial slices, halving the effective through-plane spacing. The system improves three-dimensional visualization while simultaneously producing inherently denoised outputs, yielding two complementary benefits from a single inference pass. To build a reliable system, we systematically evaluate pixel-wise losses, namely mean squared error (MSE) and mean absolute error (L1); structural-similarity losses, namely the structural similarity index (SSIM) and its multi-scale variant (MS-SSIM); and hybrid combinations. On a held-out test set, all converged models outperform classical interpolation baselines and pretrained video frame interpolation methods (RIFE, FILM) on all structural measures, with MS-SSIM+L1 offering the strongest balanced profile. We also document training instability in SSIM-family losses and identify partial remedies: the standard numerical fixes eliminate the dominant failure mode but leave residual divergence at smaller batch sizes. All results are reported with patient-level bootstrap confidence intervals and paired statistical tests. As an illustration, we apply the system to an out-of-distribution head CT series from Hospital Universitario Virgen del Rocío: the model synthesizes intermediate slices and exhibits on the real slices the implicit-denoising signature predicted by our theoretical analysis, supporting in a single external case that interpolation quality and implicit denoising are not confined to the training distribution.

URL PDF HTML ☆

赞 0 踩 0

2606.09959 2026-06-10 cs.LG cs.AI 交叉投稿

Temporal Context Conditioning for Seasonality-Aware Precipitation Nowcasting of High-Intensity Rainfall

面向高强度降雨的季节感知降水临近预报的时间上下文条件化

Gijs van Nieuwkoop, Siamak Mehrkanoon

AI总结提出TA-SmaAt-UNet模型，通过时间条件层（昼夜和季节循环编码）增强雷达降水临近预报，显著提升高强度降雨事件的预测性能。

Comments 9 pages, 6 figures

详情

AI中文摘要

降水临近预报越来越多地采用直接从近期雷达观测中学习的深度学习模型。尽管这类模型能有效捕捉短期降水运动，但它们往往缺乏降雨发展所依据的气象条件的更广泛上下文信息。本文研究轻量级时间上下文是否能改善基于雷达的临近预报，特别是针对高强度降雨。我们提出了时间感知小注意力U-Net（TA-SmaAt-UNet），它在核心SmaAt-UNet模型基础上扩展了时间条件层，利用昼夜时间和一年中时间的循环编码来调节中间特征表示。在KNMI雷达降水数据上的实验表明，时间条件化对罕见的高强度降水事件最为有益，同时也能改善季节变异性和预测降水强度分布的表征。层传导分析进一步表明，尽管参数成本很小，模型仍积极使用添加的时间条件层。这些发现表明，简单的、基于物理动机的时间上下文可以提高基于深度学习的降水临近预报的真实性和可靠性。我们的模型实现和训练设置可在GitHub上获取。

英文摘要

Precipitation nowcasting is increasingly being approached with deep learning models that learn directly from recent radar observations. Although such models can efficiently capture short-term precipitation motion, they often lack broader contextual information about the meteorological conditions under which rainfall develops. This paper investigates whether lightweight temporal context can improve radar-based nowcasting, particularly for high-intensity rainfall. We propose the Time-Aware Small-Attention U-Net (TA-SmaAt-UNet), which extends the core SmaAt-UNet model with temporal conditioning layers that use cyclical encodings of time-of-day and time-of-year to modulate intermediate feature representations. Experiments on KNMI radar precipitation data show that temporal conditioning is most beneficial for rare, high-intensity precipitation events, while also improving the representation of seasonal variability and predicted rainfall-intensity distributions. A layer conductance analysis further indicates that the added temporal conditioning layers are actively used by the model despite their small parameter cost. These findings suggest that simple, physically motivated temporal context can improve the realism and reliability of deep learning-based precipitation nowcasts. The implementation of our models and training setup is available on \href{https://github.com/gijsvn/TA-SmaAt-UNet}{GitHub}.

URL PDF HTML ☆

赞 0 踩 0

2606.09963 2026-06-10 physics.flu-dyn cs.AI 交叉投稿

Geometry-Aware Anisotropic Boundary Correction for Aerodynamic Simulation

几何感知的各向异性边界修正用于气动模拟

Xin Zhang, Yipeng Huang, Shu Jiang, Zhenzhong Wang, Min Jiang

发表机构 * School of Informatics, Xiamen University（厦门大学信息学院）； Institute of Artificial Intelligence, Xiamen University（厦门大学人工智能研究院）

AI总结针对神经算子忽视边界各向异性物理行为的问题，提出几何条件各向异性边界修正框架GeoABC，利用边界几何引入方向感知修正，在2D翼型和3D汽车任务中平均降低近边界相对L2误差约38%。

详情

AI中文摘要

气动模拟是工程形状设计的关键组成部分，其中表面压力系数等核心量强烈依赖于固体边界附近的流动动力学。神经算子为昂贵的计算流体动力学（CFD）求解器提供了一种高效替代方案。然而，传统方法各向同性地处理边界区域，未能考虑沿边界的不同物理行为。实际上，气动过程表现出各向异性：沿切向，流动沿壁面传播；沿法向，物理量受壁面约束。为了显式建模不同的物理行为，我们提出了GeoABC，一种几何条件各向异性边界修正框架。GeoABC利用边界几何将方向感知的边界修正引入神经算子的中间表示，将边界几何从静态输入特征转变为调节物理预测的结构先验。在2D翼型和3D汽车任务中，GeoABC一致地适应多种神经算子主干，平均降低近边界相对$L_2$误差约38%，缩小了主流神经算子共有的结构近壁间隙，推动神经算子向高保真气动模拟发展。

英文摘要

Aerodynamic simulation is a key component of engineering shape design, where core quantities such as the surface pressure coefficient strongly depend on flow dynamics near solid boundaries. Neural operators provide an efficient alternative to expensive Computational Fluid Dynamics (CFD) solvers. However, conventional methods treat the boundary region isotropically, failing to account for the distinct physical behaviors along the boundaries. In reality, the aerodynamic process exhibits anisotropy: along the tangential direction, flow propagates along the wall; along the normal direction, physical quantities are constrained by the wall. To explicitly model the distinct physical behaviors, we propose GeoABC, a geometry-conditioned anisotropic boundary correction framework. GeoABC leverages the boundary geometries to introduce direction-aware boundary correction into the intermediate representations of neural operators, transforming boundary geometry from static input features into a structural prior that modulates physical prediction. On 2D airfoil and 3D car tasks, GeoABC consistently adapts to multiple neural operator backbones, reducing near-boundary relative $L_2$ error by $\sim$38\% on average, narrowing the structural near-wall gap shared by mainstream neural operators, and advancing neural operators toward high-fidelity aerodynamic simulation.

URL PDF HTML ☆

赞 0 踩 0

2606.10197 2026-06-10 astro-ph.GA cs.AI 交叉投稿

Integral Field Unit Spectroscopy with One Fiber

单光纤积分场单元光谱学

Zehao Peng, Biprateep Dey, Chris J. Maddison, Joshua S. Speagle

发表机构 * University of Toronto（多伦多大学）； Vector Institute（向量研究所）

AI总结提出一种多模态概率基础模型，利用掩码自编码器从宽带图像预测星系任意空间位置的高分辨率光谱，无需IFU训练数据，性能与监督基线相当。

Comments Accepted for Conference on Physics and AI at Stanford University (PAI 2026)

详情

AI中文摘要

积分场单元（IFU）光谱学提供星系的空间分辨光谱，为星系演化提供关键见解。然而，其高观测成本限制了当前IFU数据集约$10^4$个天体。我们提出一个多模态概率基础模型，直接从宽带图像预测星系内任意空间位置的高分辨率光谱，并带有校准的不确定性。基于掩码自编码器框架，我们的架构注入光纤位置编码和红移感知波长编码，实现空间条件预测。使用暗能量光谱仪（DESI）巡天的470万张图像和单光纤光谱观测训练，我们的模型利用光纤放置的自然方差和星系的形态自相似性，在没有任何IFU训练数据的情况下实现IFU般的能力。预测的发射线通量图与APO附近星系巡天（MaNGA）的独立IFU观测结果一致，性能与直接在IFU数据上训练的监督基线相当。

英文摘要

Integral field unit (IFU) spectroscopy provides spatially resolved spectra across galaxies, offering crucial insights into their evolution. However, its high observational cost limits current IFU datasets to $\sim 10^4$ objects. We present a multi-modal, probabilistic foundation model that predicts high-resolution spectra with calibrated uncertainties at arbitrary spatial locations within a galaxy directly from broadband images. Built on a masked autoencoder framework, our architecture injects fiber positional encodings and redshift aware wavelength encodings, enabling spatially conditioned predictions. Trained on 4.7 million images and single fiber spectroscopic observations from the Dark Energy Spectroscopic Instrument (DESI) survey, our model exploits the natural variance of fiber placements and the morphological self-similarity of galaxies to achieve IFU-like capabilities without any IFU training data. Predicted emission line flux maps match independent IFU observations from the Mapping Nearby Galaxies at APO (MaNGA) survey, with performance comparable to a supervised baseline trained directly on IFU data.

URL PDF HTML ☆

赞 0 踩 0

2606.10219 2026-06-10 cs.LG cs.AI 交叉投稿

Fast Exact Nearest-Neighbor Learning for High-Frequency Financial Time Series

高频金融时间序列的快速精确最近邻学习

Henry Han, Diane Li

AI总结针对金融高频数据增长带来的实时性挑战，提出基于Mojo的SIMD k-d树方法，通过方差分裂、连续存储和编译时向量化距离计算，在保持精确输出的同时实现17.5-43.5倍加速，并支持期权定价模型训练数据量提升10倍。

Comments 15 pages 5 figures;

详情

AI中文摘要

随着股票、ETF、外汇、期权和高频交易数据量的激增，AI在金融领域的大规模效率变得至关重要。这种增长给成熟的金融AI系统带来了核心挑战：模型必须从更大的历史语料库中学习，同时满足交易、风险管理和衍生品定价中的实时延迟约束。我们以高频金融时间序列的精确最近邻学习为具体案例，展示基于Mojo的金融AI可以应对这一挑战。我们引入了一种Mojo SIMD k-d树，采用基于方差的划分、连续的扁平缓冲区存储和编译时向量化距离计算。我们还提供了运行时结果，表明在标准剪枝和实现成本假设下，对于固定股票、大样本量、中等维度的情况，Mojo SIMD k-d树渐近地优于Mojo SIMD暴力搜索和scikit-learn的k-d树。在x86和ARM64架构的八个金融数据集上（训练样本最多277K），该方法在x86上比scikit-learn的k-d树加速17.5-21.6倍，在ARM64股票/ETF数据集上比scikit-learn暴力搜索加速28.1-43.5倍，同时保持精确输出。除了最近邻推理，Mojo的编译执行使得基于Extra Trees的隐含波动率定价模型能够训练10倍以上的期权数据，将看跌期权IV RMSE降低8.0%。这些结果将Mojo定位为金融AI的可扩展、生产就绪栈，并为其他数据密集型领域的高效AI提供了有前景的基础。

英文摘要

AI efficiency at scale is becoming critical in finance as market data volumes surge across equities, ETFs, FX, options, and high-frequency trading streams. This growth creates a core challenge for mature financial AI systems: models must learn from larger historical corpora while still meeting real-time latency constraints in trading, risk management, and derivative pricing. We use exact nearest-neighbor learning for high-frequency financial time series as a concrete case study to show that Mojo-based financial AI can address this challenge. We introduce a Mojo SIMD k-d tree with variance-based splitting, contiguous flat-buffer storage, and compile-time vectorized distance computation. We also provide a runtime result showing that, under standard pruning and implementation-cost assumptions, the Mojo SIMD k-d tree asymptotically dominates Mojo SIMD brute force and scikit-learn's k-d tree in the fixed-stock, large-$n$, moderate-dimensional regime. Empirically, across eight financial datasets on x86 and ARM64 with up to 277K training samples, the method achieves 17.5--21.6$\times$ speedup over scikit-learn's k-d tree on x86 and 28.1--43.5$\times$ over scikit-learn brute force on ARM64 equity/ETF datasets, while preserving exact outputs. Beyond nearest-neighbor inference, Mojo's compiled execution enables an Extra Trees-based implied-volatility pricing model to train on $10\times$ more options data, reducing put-IV RMSE by 8.0\%. These results position Mojo as a scalable, production-ready stack for financial AI and a promising foundation for efficient AI in other data-intensive fields. \keywords{Financial AI \and AI Efficiency \and Mojo \and SIMD \and K-D Trees \and KNN \and High-Frequency Trading \and Financial Time Series \and Scaling}

URL PDF HTML ☆

赞 0 踩 0

2606.10328 2026-06-10 cs.CV cs.AI 交叉投稿

Content-Induced Spatial-Spectral Aggregation Network for Change Detection in Remote Sensing Images

内容诱导的空间-光谱聚合网络用于遥感图像变化检测

Yunlong Liu, Zekai Zhang

发表机构 * School of Control Science and Engineering, Shandong University（控制科学与工程学院，山东大学）

AI总结提出内容引导的空间-光谱集成网络(CSI-Net)，通过空间推理、光谱差异和内容引导集成模块融合全局空间细节与光谱差异信息，有效抑制未变化区域差异，在三个数据集上取得最优性能。

详情

AI中文摘要

空间和光谱信息的整合有利于提高变化检测性能。然而，现有方法无法有效抑制未变化区域中空间和光谱差异的影响。为了解决这些问题，本文提出了一种内容引导的空间-光谱集成网络（CSI-Net），用于融合全局空间细节和光谱差异信息。具体而言，所提出的CSI-Net由空间推理（SR）模块、光谱差异（SD）模块和内容引导集成（CGI）模块组成。在SR模块中，通过级联图卷积块学习空间信息以进行全局建模。SD模块负责提取光谱特征，通过计算特征的均值和方差来减少未变化区域中光谱差异的影响。此外，为了有效集成空间-光谱特征，我们设计了CGI模块以进一步利用它们的互补信息。在该模块中，引入高层内容信息作为引导，以实现适当的交互。由于高效的空间-光谱融合，所提出的CSI-Net能够更好地学习变化特征，同时实现对光谱差异的抑制。在LEVIR-CD、WHU-CD和CLCD数据集上的实验结果表明，与最先进方法相比，所提出的CSI-Net产生了更好的性能，并且适用于不同场景。

英文摘要

The integration of spatial and spectral information is beneficial to the improvement of change detection performance. However, existing methods cannot efficiently suppress the influences of spatial and spectral differences in unchanged areas. To address these issues, in this paper we propose a content-guided spatial-spectral integration network (CSI-Net) for the fusion of global spatial details and spectral difference information. Specifically, the proposed CSI-Net is composed of a spatial reasoning (SR) module, a spectral difference (SD) module, and a content-guided integration (CGI) module. In the SR module, the spatial information is learned by cascaded graph convolution blocks for global modeling. The SD module is responsible for the extraction of spectral features, by calculating the means and variances of features to reduce the impact of spectral differences in unchanged regions. In addition, in order to integrate the spatial-spectral features efficiently, we design a CGI module to further take advantage of their complementary information. In this module, high-level content information is introduced as a guide for a proper interaction. Due to the efficient spatial-spectral fusion, the proposed CSI-Net can learn the changed features better while achieving a suppression of spectral differences. Experimental results on LEVIR-CD, WHU-CD, and CLCD datasets demonstrate that the proposed CSI-Net produces better performance compared to state-of-the-art methods, and is applicable to different scenarios

URL PDF HTML ☆

赞 0 踩 0

2606.10357 2026-06-10 cs.IR cs.AI 交叉投稿

Atomic Intent Reasoning: Bringing LLM Semantics to Industrial Cross-Domain Recommendations

原子意图推理：将LLM语义引入工业跨域推荐

Zhuohang Jiang, Yuxin Chen, Shijie Wang, Haohao Qu, Zhou Jindong, Wenqi Fan, Li Qing, Dongxu Liang, Jun Wang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； Kuaishou Technology（快手科技）

AI总结提出AIR框架，通过离线LLM推理与在线高效检索组合，实现工业级跨域推荐，在快手电商中GMV提升3.446%。

详情

DOI: 10.1145/3770855.3818320
Journal ref: Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26), August 09--13, 2026, Jeju Island, Republic of Korea

AI中文摘要

跨域推荐是内容到电子商务平台的核心问题。其目标是利用用户与内容的交互来推断电子商务端的潜在购买意图，从而提高转化率和商业价值。然而，在真实的工业场景中，跨域推荐面临多重挑战：不同领域之间存在显著的语义鸿沟，用户跨域行为序列通常规模庞大且噪声丰富。尽管大型语言模型（LLM）具有强大的语义理解和推理能力，但其毫秒级的推理延迟使得直接应用于在线推荐系统变得困难。为了解决这些问题，本文介绍了AIR（原子意图推理），一个为工业级部署设计的LLM驱动的跨域推荐框架。通过将LLM推理迁移到离线阶段，并在在线操作期间通过高效检索和组合动态构建用户意图表示，它在保持语义一致性的同时实现了约400倍的推理加速。在多个公共数据集上的实验结果表明，我们的方法在跨域推荐任务中达到了最先进的性能。此外，在快手电商真实业务场景中进行的大规模在线A/B测试显示，我们的方法在多个核心业务指标上取得了稳定且显著的提升，包括GMV增长+3.446%，充分验证了其在工业级推荐系统中的有效性和实用价值。

英文摘要

Cross-domain recommendation is a core problem in content-to-e-commerce platforms. Its objective is to leverage user interactions with content to infer potential purchasing intent on the e-commerce side, thereby enhancing conversion rates and commercial value. However, in real industrial scenarios, cross-domain recommendation faces multiple challenges: significant semantic gaps exist between different domains, and user cross-domain behavior sequences are often massive in scale and rich in noise. Although large language models (LLMs) possess powerful semantic understanding and reasoning capabilities, their millisecond-level inference latency makes direct application in online recommendation systems difficult. To address these issues, this paper introduces AIR (Atomic Intent Reasoning), an LLM-driven cross-domain recommendation framework designed for industrial-grade deployment. By migrating LLM inference to the offline phase and dynamically constructing user intent representations through efficient retrieval and composition during online operations, it achieves approximately 400* inference acceleration while maintaining semantic consistency. Experimental results across multiple public datasets demonstrate that our method achieves state-of-the-art performance in cross-domain recommendation tasks. Furthermore, large-scale online A/B testing conducted in Kuaishou E-commerce's real-world business scenarios shows that our approach delivers stable and significant improvements across multiple core business metrics, including a +3.446% increase in GMV, fully validating its effectiveness and practical value in industrial-scale recommendation systems.

URL PDF HTML ☆

赞 0 踩 0

2606.10381 2026-06-10 hep-ex cs.AI cs.CL cs.IR physics.ins-det 交叉投稿

Agentic Hybrid RAG for Evidence-Grounded Muon Collider Analysis

基于证据的缪子对撞机分析的智能混合RAG

Ruobing Jiang, Dawei Fu, Cheng Jiang, Tianyi Yang, Zijian Wang, Youpeng Wu, Yong Ban, Yajun Mao, Qiang Li

发表机构 * Peking University（北京大学）

AI总结提出智能混合RAG框架，结合稀疏与稠密检索及智能推理，用于缪子对撞机研究的证据检索与答案生成，构建首个基准并验证其有效性。

Comments 22 pages, 5 figures, and 6 tables

详情

AI中文摘要

缪子对撞机研究涵盖加速器物理、探测器仪器和高能现象学，相关证据分散在快速扩展且异构的科学文献中。随着高能物理（HEP）越来越多地探索智能辅助分析工作流，高效定位、整合和验证科学证据成为关键能力。虽然检索增强生成（RAG）为科学问答提供了有前景的框架，但在不牺牲检索精度的情况下整合智能推理仍是一个关键挑战。在这项工作中，我们提出了智能混合RAG，一个基于证据的RAG框架，用于缪子对撞机研究。该框架结合了混合检索器（集成稀疏词汇和稠密语义检索）与智能推理模块，用于查询分解、证据扩展和基于证据的答案生成。为了进行系统评估，我们构建了缪子对撞机领域首个检索增强科学问答基准，包括一个精选文献语料库以及涵盖主要探测器和物理研究主题的专用检索和答案生成基准。广泛评估表明，混合检索提供了最强的检索基础，而智能推理在受控证据扩展和答案合成方面最为有效。基于这一原则，智能混合RAG在检索效果、答案质量、证据覆盖和事实基础方面始终优于代表性的检索和RAG基线。该基准和框架共同为基于证据的科学问答以及未来在大规模科学文献上运行的HEP分析智能体奠定了基础。

英文摘要

Muon collider research spans accelerator physics, detector instrumentation, and high-energy phenomenology, with relevant evidence scattered across a rapidly expanding and heterogeneous body of scientific literature. As high-energy physics (HEP) increasingly explores agent-assisted analysis workflows, efficiently locating, integrating, and verifying scientific evidence becomes an essential capability. While retrieval-augmented generation (RAG) offers a promising framework for scientific question answering, integrating agentic reasoning without compromising retrieval precision remains a key challenge. In this work, we present agentic hybrid RAG, an evidence-grounded RAG framework for muon collider research. The framework combines a hybrid retriever, integrating sparse lexical and dense semantic retrieval, with an agentic reasoning module for query decomposition, evidence expansion, and grounded answer generation. To enable systematic evaluation, we construct the first benchmark for retrieval-augmented scientific question answering in the muon collider domain, comprising a curated literature corpus together with dedicated retrieval and answer-generation benchmarks covering major detector and physics research topics. Extensive evaluation shows that hybrid retrieval provides the strongest retrieval backbone, while agentic reasoning is most effective for controlled evidence expansion and answer synthesis. Built on this principle, agentic hybrid RAG consistently outperforms representative retrieval and RAG baselines in retrieval effectiveness, answer quality, evidence coverage, and factual grounding. Together, the benchmark and framework provide a foundation for evidence-grounded scientific question answering and future HEP analysis agents operating over large-scale scientific literature.

URL PDF HTML ☆

赞 0 踩 0

2606.10402 2026-06-10 cs.CL cs.AI 交叉投稿

Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries

利用野外AI代理的集体智慧实现新发现

Federico Bianchi, Yongchan Kwon, Aneesh Pappu, James Zou

发表机构 * Together AI ； Stanford University（斯坦福大学）

AI总结提出EinsteinArena平台，通过开放分布式环境中的自主代理交互，在数学问题中实现12项新最优结果，展示了集体AI驱动研究的范式。

详情

AI中文摘要

科学发现通常是一个集体过程：研究人员分享部分结果，检查失败的尝试，并在长时间跨度内相互借鉴想法。最近的AI系统表明，基于语言模型的代理可以在开放科学问题上取得有意义的进展，但大多数现有系统孤立运行。在本文中，我们提出EinsteinArena，一个面向开放分布式研究和发现的代理原生平台。EinsteinArena为代理提供一组实时开放问题，每个问题都有可靠的验证器、公共排行榜和特定问题的讨论论坛，代理可以在其中提问和分享见解。我们专注于引起大量研究兴趣的数学任务，其进展可以明确衡量。截至2026年5月，EinsteinArena上的代理已发现12项新的最优结果，优于以往任何人类或AI解决方案。一个显著例子是11维接吻数问题，该平台将已知最佳下界从593提高到604。这一进展并非来自单个代理或孤立运行，而是通过一系列提交、公开讨论、验证器改进以及后续代理间的思想借鉴而产生的。这些结果证明，去中心化的科学发现可以从自主代理在野外的开放交互中涌现，展示了集体AI驱动研究的新范式。

英文摘要

Scientific discovery is often a collective process: researchers share partial results, inspect failed attempts, and build on each other's ideas over long time horizons. Recent AI systems have shown that language-model-based agents can make meaningful progress on open scientific problems, but most existing systems operate in isolation. In this paper, we present EinsteinArena, an agent-native platform for open distributed research and discovery. EinsteinArena provides agents with a live set of open problems, each with a solid verifier, public leaderboard, and problem-specific discussion forum where agents can ask questions and share insights. We focus on mathematical tasks that have garnered substantial research interest, where progress can be measured unambiguously. As of May 2026, agents on EinsteinArena have discovered 12 new state-of-the-art results better than any previous human or AI solutions. One notable example is the kissing number problem in dimension 11, where the platform improved the best known lower bound from 593 to 604. This advance did not come from a single agent or isolated run. Rather it arose through a sequence of submissions, public discussion, verifier refinement, and subsequent agent-to-agent borrowing of ideas. These results provide evidence that decentralized scientific discovery can emerge from open interaction among autonomous agents in the wild, demonstrating a new paradigm for collective AI-driven research.

URL PDF HTML ☆

赞 0 踩 0

2606.10493 2026-06-10 cs.DC cs.AI cs.LG cs.NE 交叉投稿

Achieving Cloud-Grade SLOs for Local Mixture-of-Experts Inference through CPU-GPU Hybrid Design

实现本地混合专家模型推理的云级SLO：CPU-GPU混合设计

Wenxin Wang, Yule Hou, Yu Ji, Peng Qu, Youhui Zhang

发表机构 * Tsinghua University（清华大学）； Xingyun Integrated Circuits Co., Ltd.（星云集成电路有限公司）； Beijing National Research Center for Information Science and Technology（北京信息科学与技术国家研究中心）

AI总结针对本地MoE推理在低并发下仍无法达到云级服务质量的问题，提出CPU-GPU混合系统，通过流加载预填充、分布式SLP、节点内预填充-解码分离、AVX-512优化FP8 GEMV内核和细粒度CPU并行，在消费级硬件上实现云级SLO。

Comments Accepted to the 20th USENIX Symposium on Operating Systems Design and Implementation (OSDI '26). The official version will appear in the OSDI '26 proceedings published by USENIX

详情

AI中文摘要

本地部署大型混合专家（MoE）模型即使在低并发工作负载下也无法达到云级环境中的服务质量。我们识别出本地MoE推理中的四个关键差距：依赖容量缩减模型（量化、蒸馏、重路由）、无法满足长预填充（超过12K）的30秒TTFT、低于基线的解码吞吐量（低于20 tokens/s）、以及在混合预填充-解码和批量解码工作负载下的并发性差。我们提出一个CPU-GPU混合系统，通过以下方式在双插槽商用CPU和消费级GPU上实现云级SLO：（1）流加载预填充（SLP），将预填充吞吐量提升至1,200 tokens/s，并在30秒内支持32K提示；（2）采用SmallEP专家并行的分布式SLP（DSLP），在两张RTX 5090上达到1,800 tokens/s和45K提示；（3）节点内预填充-解码分离，具有零拷贝共享权重和双批次注意力-MoE重叠方案，在延迟增加低于15%且吞吐量提升50%的情况下维持并发性；（4）AVX-512优化的FP8 GEMV内核，实现原生CPU FP8推理，同时降低4-5倍CPU延迟；（5）细粒度CPU并行，在INT4 DeepSeek-V3上达到28 tokens/s，在完整FP8 V3上达到21.5 tokens/s。评估表明，我们的系统在消费级CPU-GPU平台上为旗舰MoE模型提供云级QoS，通过完整原始精度推理重塑本地部署，无需数据中心基础设施即可实现高质量、经济高效的访问。

英文摘要

Local deployment of large Mixture-of-Experts (MoE) models falls short of the service quality achieved in cloud-scale environments, even under low-concurrency workloads. We identify four key gaps in local MoE inference: reliance on capacity-reduced models (quantized, distilled, rerouted), inability to meet 30-second TTFT for long prefills (more than 12K), sub-baseline decode throughput (under 20 tokens/s), and poor concurrency under mixed prefill-decode and batched decode workloads. We present a CPU-GPU hybrid system that achieves cloud-level SLOs on dual-socket commodity CPUs and consumer GPUs by (1) stream-loading prefill (SLP), boosting prefill throughput to 1,200 tokens/s and enabling 32K prompts within 30 seconds; (2) distributed SLP (DSLP) with SmallEP expert parallelism, reaching 1,800 tokens/s and 45K prompts in 30 seconds on two RTX 5090s; (3) intra-node prefill-decode disaggregation with zero-copy shared weights and a dual-batch attention-MoE overlap scheme, sustaining concurrency with under 15 percent latency increase and 50 percent throughput gains; (4) an AVX-512-optimized FP8 GEMV kernel, enabling native CPU FP8 inference while delivering 4-5x lower CPU latency; and (5) fine-grained CPU parallelism that attains 28 tokens/s on INT4 DeepSeek-V3 and 21.5 tokens/s on intact FP8 V3. Evaluations show our system delivers cloud-level QoS for flagship MoE models on consumer CPU-GPU platforms, reshaping local deployment with intact, original-precision inference and enabling high-quality, cost-effective access without datacenter infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2606.10583 2026-06-10 cs.LG cs.AI math.OC 交叉投稿

面向普惠银行的后量子安全联邦DeFi

Swati Sachan, Dale Fickett, Richard Buchinger, Theo Miller

发表机构 * AI FinTech Group, University of Liverpool（人工智能金融科技组，利物浦大学）； RVA Works and University of Richmond（RVA Works和里士满大学）； Chain Crunch Labs（Chain Crunch实验室）

AI总结提出后量子安全联邦DeFi框架，利用格基全同态加密和NASA-IBM地理空间基础模型，实现银行间加密协作以提升信用不足个体的金融普惠性。

详情

DOI: 10.1109/cai68641.2026.11536585

AI中文摘要

近期纠错量子比特的进展加速了实用量子计算的时间表，这对用于保护金融系统、政府基础设施、通信网络和DeFi（去中心化金融）生态系统的密码原语构成威胁。本文提出一个后量子安全的联邦DeFi框架，支持银行间协作，以改善因有限金融历史而受到当地贷款机构服务不足的个体的普惠性。多家银行将加密信息批次贡献给一个虚拟服务器，其中基于格的完全同态加密（FHE）实现了端到端的同态计算。服务器以加密格式融合本地数据驱动的概率评估、专家信念以及由NASA-IBM Prithvi地理空间基础模型（GFM）生成的可验证证据。采用去中心化技术确保机构与服务器之间所有加密数据交换的防篡改证据和可审计问责性。该框架在弗吉尼亚州农村借款人的农业贷款决策上进行了测试。

英文摘要

Recent advances in error-corrected qubits have accelerated the timeline for practical quantum computing. It poses a threat to cryptographic primitives used to secure financial systems, government infrastructure, communication networks, and DeFi (Decentralized Finance) ecosystems. This paper introduces a post-quantum secure federated DeFi framework that enables inter-bank collaboration to improve the inclusivity of individuals underserved by local lenders due to limited financial histories. Multiple banks contribute encrypted information batches to a virtual server, where lattice-based Fully Homomorphic Encryption (FHE) enables end-to-end homomorphic computation. The server fuses local data-driven probabilistic assessments, expert beliefs, and verifiable evidence generated by the NASA-IBM Prithvi Geospatial Foundation Model (GFM), in encrypted format. Decentralized technologies are employed to ensure tamper-proof evidence and auditable accountability for all encrypted data exchanges between institutions and the server. The framework is tested on agricultural lending decisions for rural borrowers in Virginia.

URL PDF HTML ☆

赞 0 踩 0

2606.10660 2026-06-10 cs.CY cs.AI 交叉投稿

Accounting for AI Inference in Corporate GHG Inventories: A Four-Tier Methodology for Scope 3 Category 1 Reporting

企业温室气体清单中AI推理的核算：范围3类别1报告的四层方法

Guillermo Llopis

发表机构 * SOMA AI

AI总结针对CSRD要求下AI推理服务在范围3类别1中缺乏标准核算方法的问题，提出基于token物理估算的四层框架，通过GPU能耗基准和区域电网碳强度精确估算排放，并揭示水碳权衡。

Comments Preprint. Data repository: https://doi.org/10.5281/zenodo.20443586. 18 pages, 3 figures, 6 tables

详情

AI中文摘要

AI推理服务——API订阅、企业聊天工具和嵌入AI功能的SaaS产品——明确属于《企业可持续发展报告指令》(CSRD)下的范围3类别1，该指令要求自2024年1月开始的财年进行披露。然而，目前尚无标准方法将其纳入企业温室气体清单。现行实践要么完全忽略该类别，要么应用针对整个ICT行业校准的通用经济投入产出(EEIO)因子，导致AI推理排放被高估10-40倍（相对于物理衍生方法）。我们提出了一个四层框架，将估算精度与组织实际可获取的数据相匹配，从基于token的直接物理估算（使用GPU能耗基准和区域电网碳强度）逐步降级到基于支出的EEIO后备方法（用于无使用数据的服务）。排放因子来源于同行评审的GPU能耗基准（此http URL排行榜v3）、确认的电网碳强度（EPA eGRID 2023；Ember 2023）以及已发布的水利用效率数据（Li等人，2025）。应用于一家200人的欧洲企业，该框架得出的总排放量低于1 tCO2e，表明合规挑战在于方法论而非规模。我们进一步记录了当前ESG工具未揭示的水碳权衡：瑞典以水电为主的电网在数据集中碳强度最低，但水足迹最高，这对数据中心选址策略有直接影响。

英文摘要

AI inference services -- API subscriptions, enterprise chat tools, and SaaS products with embedded AI features -- fall unambiguously within Scope 3 Category 1 under the Corporate Sustainability Reporting Directive (CSRD), which requires disclosure for fiscal years starting January 2024. Yet no standardised methodology exists for including them in corporate GHG inventories. Current practice either omits the category entirely or applies a generic economic input-output (EEIO) factor calibrated to the ICT sector as a whole, overestimating AI inference emissions by 10-40x relative to physically derived alternatives. We propose a four-tier framework that matches estimation precision to the data organisations can realistically obtain, progressing from direct token-based physical estimation -- using GPU energy benchmarks and regional grid carbon intensities -- down to a spend-based EEIO fallback for services where no usage data exists. Emission factors are derived from peer-reviewed GPU energy benchmarks (ML.ENERGY Leaderboard v3), confirmed grid carbon intensities (EPA eGRID 2023; Ember 2023), and published water use effectiveness data (Li et al., 2025). Applied to a 200-person European firm, the framework yields a total below 1 tCO2e, illustrating that the compliance challenge is methodological rather than magnitude-driven. We further document a water-carbon trade-off that current ESG tools do not surface: Sweden's hydro-dominated grid delivers the lowest carbon intensity in our dataset but the highest water footprint, with direct implications for data centre location strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.10699 2026-06-10 cs.CV cs.AI 交叉投稿

Using the YOLOv12 Model for Verifying the Correct Color Sequence of Wires in Network Cables (Patch Cords) on the Production Line

使用YOLOv12模型验证生产线上网线（跳线）中导线的正确颜色顺序

Amin Doroodchi, Danial Soleimany

发表机构 * Computer Department, Islamic Azad University, Beyza Branch（伊斯兰 Azad 大学计算机系，贝兹分校）； R&D at Nedaye Sabz Company, Isfahan Branch（Nedaye Sabz 公司研发部，伊斯法罕分校）

AI总结针对网线生产中导线颜色顺序检测问题，提出基于YOLOv12的目标检测模型，实现高精度实时验证，减少人工错误。

详情

AI中文摘要

在网络电缆的生产过程中，确保标准连接器内部线对的正确颜色顺序对电缆的最终性能至关重要，因为任何错位或颜色顺序错误都可能导致缺陷产品并造成巨大成本。基于数字显微镜目视检查的传统检测方法通常耗时、繁琐且容易出错。在本研究中，开发了一种基于第十二版YOLO目标检测模型的智能系统，用于识别跳线中导线的位置并验证其正确的颜色顺序。使用的数据集包括从网络连接器显微视图中捕获的2500张图像，其中70%用于训练，15%用于验证，15%用于测试。所提出的模型利用单阶段架构和学习过程中的注意力机制，实现了约98%精度的导线检测。此外，总体平均准确率、分类精度和召回率分别约为95%、99%和98%。结果表明，该系统能够在生产线上可靠地实时验证导线颜色顺序的正确性，无需人工干预，从而减少人为错误并提高制造效率。

英文摘要

In the production process of network cables, ensuring the correct color sequence of wire pairs inside the standard connector plays a critical role in the final performance of the cable, as any misplacement or color-ordering error can lead to defective products and impose significant costs. Traditional inspection methods based on visual examination through digital microscopes are typically time-consuming, tedious, and prone to human error. In this study, an intelligent system based on the twelfth version of the YOLO1 object detection model was developed to identify the position and verify the correct color sequence of wires in patch cords. The dataset used consisted of 2,500 images captured from microscopic views of network connectors, which were divided into 70% for training, 15% for validation, and 15% for testing. The proposed model, leveraging a single-stage architecture and attention mechanisms during learning, achieved highly accurate wire detection with approximately 98% precision. Additionally, the overall mean accuracy, classification precision, and recall were around 95%, 99%, and 98%, respectively. The results demonstrate that this system can reliably and in real time verify the correctness of wire color sequencing on the production line without the need for human intervention, thereby reducing human error and enhancing efficiency in the manufacturing process.

URL PDF HTML ☆

赞 0 踩 0

2606.10713 2026-06-10 eess.IV cs.AI cs.CV cs.LG 交叉投稿

++nnU-Net: Scaling nnU-Net with Prefix-Based Data Augmentation

++nnU-Net: 基于前缀数据增强的nnU-Net扩展

Ana Sofia Santos, André Ferreira, Gijs Luijten, Naida Solak, Lisle Faray de Paiva, Behrus Hinrichs-Puladi, Jens Kleesiek, Jan Egger, Victor Alves

发表机构 * Center Algoritmi / LASI, University of Minho, Braga, Portugal（阿尔戈里米中心/拉斯伊大学，明霍大学，布拉加，葡萄牙）； Institute for Artificial Intelligence in Medicine, University Medicine Essen, Essen, Germany（医学人工智能研究所，埃森医学院，埃森，德国）； Institute of Medical Informatics / Dept. of Oral and Maxillofacial Surgery, University Hospital RWTH Aachen, Germany（医学信息学研究所/口腔和颅面外科部，亚琛大学医院，德国）； Faculty of Computer Science, University of Duisburg-Essen, Essen, Germany（计算机科学学院，杜伊斯堡-埃森大学，埃森，德国）

AI总结提出++nnU-Net，通过图像配准进行数据增强，在预处理和训练前生成变形图像，在5个2D数据集上提升Dice系数最高约22%。

Comments 7 pages, 1 figure, 2 tables

详情

AI中文摘要

nnU-Net在医学分割任务中持续展现出成功，这严重依赖于标注生物医学数据的可用性和多样性。然而，由于隐私法规和标注成本等因素，收集医学影像队列仍然具有挑战性。因此，数据增强在增加数据可用性的同时保持解剖学可行性方面起着关键作用。为此，我们提出了++nnU-Net，一种基于图像配准的新型数据增强模块，在预处理和训练之前运行。我们的框架在五个不同的2D数据集上进行了评估。在该工作流中，图像数据经过两阶段配准过程，生成新的变形图像。然后将变换应用于相应的分割。此外，该管道计算可用磁盘空间，生成补充的二进制合成掩码并生成检查点。我们证明++nnU-Net优于nnU-Net基线，在Dice相似系数得分上有所提升。在最显著的情况下，我们观察到性能提升约22%。这些发现强调了基于配准的数据增强的有效性，特别是对于2D医学影像数据集，并表明++nnU-Net为在数据有限的情况下提高分割性能提供了一种实用且可扩展的方法。++nnU-Net的源代码可在以下网址获取：this https URL

英文摘要

The nnU-Net has demonstrated continuous success in medical segmentation tasks, which heavily rely on the availability and diversity of annotated biomedical data. However, assembling medical imaging cohorts remains challenging due to numerous factors such as privacy regulations and annotation costs. As a result, data augmentation plays a crucial role in increasing data availability while maintaining anatomical feasibility. Hence, we propose the ++nnU-Net, a novel data augmentation module based on image registration that operates prior to preprocessing and training take place. Our framework was evaluated across five different 2D datasets. In this workflow, image data go through a two-stage registration process, generating new warped images. The transformations are then applied to the respective segmentation. In addition, the pipeline computes available disk space, generates supplementary binary synthetic masks and generates checkpoints. We demonstrate that the ++nnU-Net outperforms the nnU-Net baseline, yielding improvements in Dice Similarity Coefficient scores. In the most prominent cases, we observe performance gains of approximately 22\%. These findings highlight the effectiveness of registration-based data augmentation, particularly for 2D medical imaging datasets and suggest that the ++nnU-Net provides a practical and scalable approach for enhancing segmentation performance in data-limited settings. The source code for the ++nnU-Net is available at: https://github.com/sofia-adelie/plusplusnnunet.git

URL PDF HTML ☆

赞 0 踩 0

2606.10718 2026-06-10 cs.LG cs.AI 交叉投稿

Transformer Based Model for Spatiotemporal Feature Learning in EEG Emotion Recognition

基于Transformer的脑电情绪识别时空特征学习模型

Xinglong Cui, Dian Gu

发表机构 * Beijing Neurodeep Technology Co., Ltd（北京纽罗德普科技有限公司）； University of Pennsylvania（宾夕法尼亚大学）

AI总结提出EEG-TransNet架构，通过局部自注意力块和模糊注意力同步Transformer捕捉脑电信号的时空特征，在三个数据集上优于现有方法。

详情

AI中文摘要

脑电图（EEG）是一种广泛采用的监测大脑活动的技术，因其高时间分辨率和成本效益，为神经状态提供了有价值的见解。为了增强对复杂EEG数据的分析，我们提出了EEG-TransNet，一种旨在捕捉EEG信号的时间、区域和同步特征的架构。EEG-TransNet引入了三个关键模块：1）利用ResNet和基于小波去噪的预处理与特征提取模块，2）用于区域特征学习的局部自注意力块，以及3）用于建模时空依赖性的模糊注意力同步Transformer（FAST）。通过在三个EEG数据集（BETA、SEED和DepEEG）上的大量实验，所提出的模型在不同信号长度下的分类准确性和鲁棒性方面始终优于其他方法。消融研究证实了局部自注意力块在提高性能方面的贡献，并且解码器中引入深度可分离卷积降低了计算复杂度，同时保持了高准确性。EEG-TransNet在受试者间具有最小的性能变化，突显了其作为基于EEG的大脑活动分类和情绪识别任务的鲁棒工具的潜力。

英文摘要

Electroencephalography (EEG) is a widely adopted technique for monitoring brain activity, offering valuable insights into neurological states due to its high temporal resolution and cost-effectiveness. To enhance the analysis of complex EEG data, we propose EEG-TransNet, an architecture designed to capture temporal, regional, and synchronous features of EEG signals. EEG-TransNet introduces three key modules: 1) a preprocessing and feature extraction module leveraging ResNet and wavelet-based denoising, 2) a Local Self-Attention Block for regional feature learning, and 3) a Fuzzy-Attention Synchronous Transformer (FAST) to model spatiotemporal dependencies. Through extensive experiments on three EEG datasets (BETA, SEED, and DepEEG), the proposed model consistently outperforms other methods in terms of classification accuracy and robustness across varying signal lengths. Ablation studies confirm the contribution of the Local Self-Attention Block in improving performance, and the inclusion of depthwise separable convolutions in the decoder reduces computational complexity while maintaining high accuracy. EEG-TransNet's ability to generalize across subjects with minimal performance variation highlights its potential as a robust tool for EEG-based brain activity classification and emotion recognition tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.10736 2026-06-10 cs.CL cs.AI cs.CY 交叉投稿

Detecting Knowledge Gaps from Conversational AI Interactions Using Curriculum Prerequisite Graphs

利用课程先决条件图检测对话式AI交互中的知识缺口

Youssef Medhat, Junsoo Park, Ploy Thajchayapong, Ashok K. Goel

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出一个流水线，通过少样本文本分类器将学生向对话式AI助教提出的问题映射到课程主题，并利用GPT-4提取的先决条件知识图谱，以检测主题级知识缺口。

Comments Accepted as a short paper at the 10th CSEDM Workshop, co-located with the 18th International Conference on Educational Data Mining (EDM 2026). 7 pages, 2 figures, 2 tables

详情

AI中文摘要

大型在线课程会产生数千条学生向对话式AI助教提出的问题，但这些交互日志作为诊断信号在很大程度上未被利用。我们提出一个流水线，使用少样本文本分类器，将学生向对话式AI助教提出的问题映射到课程主题，该分类器基于GPT-4提取的课程概念先决条件知识图谱。在研究生级别AI课程的164名学生的1,340个问题事件上评估，我们的分类器在43个标签（42个课程主题加上一个“未知”弃权类别）上达到80.0%的准确率。主题级问题数量与独立期中调查中学生自我报告的难度显著相关（rho = 0.491, p = 0.008, n = 28个主题），提供了趋同证据，表明分类后的问题流反映了真实的主题难度。这些结果表明，映射到课程结构上的对话式AI交互日志携带关于主题级知识缺口的可操作信号，并为教师提供基于课程视角的哪些主题需要关注的视图。

英文摘要

Large online courses generate thousands of student questions directed at conversational AI teaching assistants, yet these interaction logs remain largely untapped as diagnostic signals. We present a pipeline that maps student questions from a conversational AI teaching assistant to curriculum topics using a few-shot text classifier, grounded in a GPT-4-extracted prerequisite knowledge graph of course concepts. Evaluated on 1,340 question events from 164 students in a graduate-level AI course, our classifier achieves 80.0% accuracy across 43 labels (42 curriculum topics plus an "unknown" abstention class). Topic-level question volume correlates significantly with student self-reported difficulty from an independent mid-semester survey (rho = 0.491, p = 0.008, n = 28 topics), providing convergent evidence that the classified question stream reflects genuine topic difficulty. These results demonstrate that conversational AI interaction logs, mapped onto curriculum structure, carry actionable signals about topic-level knowledge gaps and provide instructors with a curriculum-grounded view of which topics warrant attention.

URL PDF HTML ☆

赞 0 踩 0

2606.10782 2026-06-10 cs.CR cs.AI cs.LG 交叉投稿

A Bayesian Network Approach for Enhancing Security-Focused Decision Support Systems

一种增强安全导向决策支持系统的贝叶斯网络方法

Carolina Fernández-Martínez, Shuaib Siddiqui, Vanesa Daza

发表机构 * University of Granada（格拉纳达大学）； University of Birmingham（伯明翰大学）

AI总结提出基于贝叶斯网络的决策支持系统，帮助基础设施运营商选择安全工具，通过捕获用户需求并推理，提供最优安全机制，评估了时间和预测精度。

详情

DOI: 10.1109/LCN65610.2025.11146363
Journal ref: Proc. 2025 IEEE 50th Conference on Local Computer Networks (LCN), 2025

AI中文摘要

当今大多数基于开源网络的异构栈的采用和集成带来了明显的优势，如互操作性和高级功能的可用性。然而，另一方面，互联组件和移动部件数量的增加需要维护跨不同领域的不同工具的跨学科知识基础，以确保正常运行。为了减轻这些工作，本文提出了一种决策支持系统（DSS），指导基础设施运营商选择在其环境中采用的安全方法（例如工具）。该框架能够轻松捕获最终用户对不同领域安全三元组的高层需求，并在指定模型上运行推理，以提供更好地满足这些需求的已识别工具（安全机制）。所提出的DSS旨在提供一个可理解和可扩展的框架，以适应不同的需求和贝叶斯网络（BN）模型。提出了系统的架构和建模，并与其理论框架保持一致。其性能在时间和预测精度方面进行了评估。

英文摘要

The adoption and integration of heterogeneous stacks in most of today's open-source based networks brings clear benefits like interoperability and availability of advanced features. Yet, on the other hand the increasing number of interconnecting components and moving parts requires maintaining an ever increasing base of interdisciplinary knowledge of different tools in different domains to ensure proper operation. To alleviate such efforts, this work proposes a Decision Support System (DSS) to guide infrastructure operators through the selection of security approaches (e.g. tools) to adopt in their environments. This framework easily captures the end-user high-level requirements on the security triad for different domains and runs inference on the designated models to provide the identified tools (security mechanisms) that better serve such needs. The presented DSS aims at delivering an understandable and extensible framework to accommodate varying requirements and Bayesian Network (BN) models. The architecture and modelling of the system are proposed, aligned with its theoretical framework. Its performance is evaluated in terms of time and prediction accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.10827 2026-06-10 cs.NI cs.AI 交叉投稿

A Unified Siamese Learning Framework for Zero-Day Anomaly Detection and Classification in Optical Networks

面向光网络中零日异常检测与分类的统一孪生学习框架

Carlos Natalino, Flávia P. Monteiro, Paolo Monti

发表机构 * Department of Electrical Engineering, Chalmers University of Technology（查尔姆斯理工大学电子工程系）； Federal University of Western Pará (UFOPA)（巴西北部联邦大学（UFOPA））

AI总结提出多相似度孪生神经网络，统一实现光网络中零日异常检测与单样本分类，无需重训练即可跨光路和未知异常类型达到99%以上准确率。

Comments Authors' version of the manuscript accepted and published at the Optical Fiber Communication Conference (OFC) 2026. 4 pages, 3 figures

2606.10861 2026-06-10 cs.SE cs.AI cs.HC 交叉投稿

基于CNN和GRU网络的哮喘与COPD鉴别诊断中二维输入表示和子阶段融合策略的优化

Ipek Sen, Ozgur Ozdemir, Elena Battini Sonmez

发表机构 * Dept. Electrical and Electronics Engineering Istanbul Bilgi University, Turkey（电气与电子工程系伊斯坦布尔比尔吉大学，土耳其）； Dept. Computer Engineering Istanbul Bilgi University, Turkey（计算机工程系伊斯坦布尔比尔吉大学，土耳其）

AI总结本研究优化了二维输入表示（MFCC、对数梅尔谱图、VAR模型）和子阶段特征融合策略（直接拼接、GRU、GRU+注意力），使用CNN和GRU网络鉴别哮喘与COPD，最佳F1分数达0.877。

详情

AI中文摘要

本研究旨在探索VAR模型与梅尔频率倒谱系数（MFCC）矩阵和对数梅尔谱图在深度学习中的性能比较。在肺音分类中，基于谱图的表示因呼吸周期时长不同而存在时间维度不一致的问题。除了传统的裁剪/零填充，还提出了自适应长度窗口来固定时间维度。通过测试一系列参数优化其频谱和时间维度。采用不同的卷积神经网络（CNN）架构从子阶段获得的二维表示中提取特征。然后使用各种策略融合提取的子阶段特征，包括直接拼接、门控循环单元（GRU）网络和带注意力的GRU。通过基于呼吸周期的评估和基于受试者的评估（包含多个呼吸周期）来评估模型性能。还研究了多种数据增强技术以应对数据规模限制。最佳基于周期的F1分数（0.877）通过使用13个系数和每子阶段表示64点时间分辨率的MFCC矩阵，随后进行直接特征拼接获得；最佳基于受试者的F1分数（0.855）通过使用13个系数和每完整周期表示256点时间分辨率的MFCC矩阵获得，两者均采用自适应长度窗口。增强总体上降低了模型性能，但mixup增强是测试方法中最好的。MFCC在区分哮喘和COPD方面优于对数梅尔谱图和VAR模型。复杂的融合策略并未改善诊断。增强没有贡献，表明真实数据在肺音研究中的重要性。

英文摘要

This study aims to explore the performance of the VAR model in comparison with mel-frequency cepstral coefficient (MFCC) matrices and log-mel spectrograms using deep learning. In pulmonary sound classification, spectrogram-based representations suffer from inconsistent temporal dimensions due to varying respiratory cycle durations. Along with traditional trimming/zero-padding, adaptive-length windowing was presented to fix their temporal dimensions. Their spectral and temporal dimensions were optimized by testing a range of parameters. Different convolutional neural network (CNN) architectures were employed to extract features from the two-dimensional representations obtained over the sub-phases. The extracted sub-phase features were then fused using various strategies including direct concatenation, gated recurrent unit (GRU) network and GRU with attention mechanism. Model performances were assessed through respiratory cycle-based evaluation and subject-based evaluation comprising multiple respiratory cycles. Several data augmentation techniques were also studied to cope with limitations in data size. The best cycle-based F1-score (0.877) was obtained using the MFCC matrices with thirteen coefficients and 64-point time resolution per sub-phase representation followed by direct feature concatenation, and the best subject-based F1-score (0.855) was obtained using the MFCC matrices with thirteen coefficients and 256-point time resolution per full-cycle representation, both obtained by adaptive-length windowing. Augmentation degraded the performance of models overall, yet mixup augmentation was the best among the methods tested. MFCC outperformed log-mel spectrogram and VAR model in differentiation of asthma and COPD. Sophisticated fusion strategies did not improve the diagnosis. Augmentation did not contribute, demonstrating the significance of authentic data in pulmonary sound studies.

URL PDF HTML ☆

赞 0 踩 0

2606.11106 2026-06-10 cs.CV cs.AI 交叉投稿

FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model

FADA: 可访问的胎儿超声解读与标注——基于选择性蒸馏的统一视觉-语言模型

Mahmood Alzubaidi, Uzair Shah, Raden Muaz, Ines Abbes, Nader Mohammed, Abdullatif Magram, Khalid Alyafei, Mowafa Househ, Marco Agus

发表机构 * Hamad Bin Khalifa University（哈马德·本·哈利法大学）； HMC（哈马德医疗公司）； Advanced AlRazi Diagnostic Center（高级阿尔拉齐诊断中心）； Sidra Medicine（锡德拉医学）

AI总结提出统一视觉-语言模型FADA，通过选择性蒸馏从四个领域基础模型提取知识，实现胎儿超声的解读、分类、检测和分割，在单个消费级GPU上训练，无需外部标签，可在智能手机上离线运行。

详情

AI中文摘要

全球范围内受过训练的超声技师短缺限制了低收入和中等收入国家的产前超声筛查，这些国家超过一半的孕妇未接受专业超声检查。当前的深度学习方法分别处理检测、分割或分类，每个任务都需要单独的模型和推理时的专家指定标签。我们提出FADA，一个基于Qwen3.5-VL构建的统一视觉-语言模型，通过单一解读优先的流程执行临床解读、分类、检测和分割，无需外部标签。FADA通过离线预计算特征缓存，从四个领域基础模型（FetalCLIP、UltraSAM、USF-MAE、UltraFedFM）中蒸馏知识。选择性蒸馏仅对标注任务应用特征对齐，而解读任务依赖标准微调，在大多数评估指标上持续优于完全蒸馏。推荐变体FADA-SKD在分割上达到0.8820平均Dice，检测上达到0.7671 mAP@0.50，结构化解读合规性达到100%。专家超声技师对237张图像的验证确认了在自主和人机协同模式下输出临床可接受，其中73.5%的解读在临床医生指导下获得完美评分。该系统可在单个消费级GPU上训练，无需云连接即可部署。我们通过在商用智能手机（高通骁龙7 Gen 1，12 GB RAM）上使用GGUF量化的this http URL运行压缩的0.8B模型，验证了边缘部署，完全离线完成全部5阶段流程约需60秒。这为将AI辅助胎儿评估与便携式超声设备集成提供了实用途径，直接解决了资源受限环境中的诊断可及性差距。代码、模型和数据可在https://this https URL获取。

英文摘要

A global shortage of trained sonographers limits prenatal ultrasound screening in low- and middle-income countries, where over half of pregnant women receive no skilled sonography. Current deep learning approaches address detection, segmentation, or classification in isolation, each demanding a separate model and expert-specified labels at inference. We present FADA, a unified vision-language model built on Qwen3.5-VL that performs clinical interpretation, classification, detection, and segmentation through a single interpretation-first pipeline without external labels. FADA distills knowledge from four domain-specific foundation models (FetalCLIP, UltraSAM, USF-MAE, UltraFedFM) via offline pre-computed feature caching. Selective distillation, which applies feature alignment only to annotation tasks while interpretation relies on standard fine-tuning, consistently outperforms full distillation across most evaluation axes. The recommended variant, FADA-SKD, achieves 0.8820 mean Dice for segmentation, 0.7671 mAP@0.50 for detection, and 100% structured interpretation compliance. Expert sonographer validation across 237 images confirms clinically acceptable outputs in both autonomous and human-in-the-loop modes, with 73.5% of interpretations scoring perfectly under clinician guidance. The system is trainable on a single consumer GPU and deployable without cloud connectivity. We validate edge deployment by running the compressed 0.8B model on a commodity smartphone (Qualcomm Snapdragon 7 Gen 1, 12 GB RAM) using llama.cpp with GGUF quantization, completing the full 5-phase pipeline in approximately 60 seconds entirely offline. This establishes a practical pathway for integrating AI-assisted fetal assessment with portable ultrasound devices, directly addressing diagnostic access gaps in resource-constrained settings. Code, models, and data are available at https://github.com/mahmoodphd/FADA.

URL PDF HTML ☆

赞 0 踩 0

2606.11117 2026-06-10 cs.AR cs.AI cs.PF 交叉投稿

Towards Autonomous Accelerator Design: FPGA Accelerator Generation with SECDA

迈向自主加速器设计：基于SECDA的FPGA加速器生成

Vinamra Sharma, Xingjian Fu, Jude Haris, José Cano

发表机构 * School of Computing Science, University of Glasgow, Scotland, UK（格拉斯哥大学计算机科学学院）

AI总结提出SECDA-DSE框架，集成大语言模型引导FPGA加速器设计空间探索，通过结构化探索器和LLM推理生成可综合的加速器设计，减少人工干预。

Comments Accepted to the Machine Learning for Architecture and Systems Workshop (MLArchSys), co-located with ISCA 2026

详情

AI中文摘要

为现代人工智能工作负载设计基于FPGA的加速器需要探索庞大而复杂的硬件设计空间，涉及架构参数、数据流策略和内存层次结构，这使得过程非常耗时。虽然现有方法如SECDA通过SystemC仿真和FPGA执行实现了快速的硬件-软件协同设计，但识别高效的加速器配置仍然是一个主要需要广泛领域知识的手动过程。SECDA-DSE是一个将大语言模型（LLM）集成到SECDA生态系统中的框架，用于指导基于FPGA的加速器的设计空间探索（DSE）。它结合了用于生成候选架构的结构化DSE探索器，以及使用检索增强生成和思维链提示进行推理引导探索的LLM栈，并配有用于迭代和强化优化的反馈循环。基于我们之前介绍SECDA-DSE的工作，本文通过生成三种加速器设计（包括逐元素向量乘法、二维卷积和矩阵转置）并在FPGA硬件上执行端到端运行来扩展其评估。结果表明，SECDA-DSE能够生成符合SECDA标准的加速器设计，并成功在FPGA硬件上综合和执行。此外，生成的设计捕获了计算并行性和数据移动之间的内核特定权衡，突显了LLM引导探索在跨不同工作负载调整架构配置方面的潜力，同时减少了探索时间和大量人类专业知识的需求。

英文摘要

Designing FPGA-based accelerators for modern artificial intelligence workloads requires exploring a large and complex hardware design space that involves architectural parameters, data flow strategies, and memory hierarchies, making the process very time consuming. While existing methodologies such as SECDA enable rapid hardware-software co-design through SystemC simulation and FPGA execution, identifying efficient accelerator configurations remains a largely manual process requiring extensive domain knowledge. SECDA-DSE is a framework that integrates Large Language Models (LLMs) into the SECDA ecosystem to guide design space exploration (DSE) of FPGA-based accelerators. It combines a structured DSE Explorer for generating candidate architectures with an LLM Stack that performs reasoning-guided exploration using retrieval-augmented generation and chain-of-thought prompting, coupled with a feedback loop for iterative and reinforced refinement. Building on our previous work introducing SECDA-DSE, this paper extends its evaluation by generating three accelerator designs, including element-wise vector multiplication, 2D convolution, and matrix transpose, and performing end-to-end execution on FPGA hardware. The results show that SECDA-DSE can generate SECDA-compliant accelerator designs that are successfully synthesized and executed on FPGA hardware. Furthermore, the generated designs capture kernel-specific trade-offs between compute parallelism and data movement, highlighting the potential of LLM-guided exploration to adapt architectural configurations across diverse workloads while reducing exploration time and the need for extensive human expertise.

URL PDF HTML ☆

赞 0 踩 0

2606.11140 2026-06-10 physics.geo-ph cs.AI cs.LG stat.AP stat.ML 交叉投稿

Data assimilation for subsurface flow using latent diffusion model parameterization: performance of ensemble-Kalman and Monte Carlo techniques

基于潜扩散模型参数化的地下流体数据同化：集成卡尔曼与蒙特卡洛技术的性能

Guido Di Federico, Wenchao Teng, Louis J. Durlofsky

发表机构 * Department of Energy Science & Engineering, Stanford University（能源科学与工程系，斯坦福大学）

AI总结针对地下流体数据同化中高维参数反演问题，比较了基于潜扩散模型（LDM）的集成卡尔曼方法（ESMDA）与蒙特卡洛方法（MCMC/SMC）在三维河道地质模型上的性能，发现蒙特卡洛方法在保持地质真实性的同时能更有效地降低数据失配和不确定性。

详情

AI中文摘要

地下流体数据同化（DA）涉及校准模型参数以匹配观测数据（通常来自井），同时保持地质真实性。潜扩散模型（LDM）提供了从高维地质模型空间到低维潜变量的高效映射，降低了反问题的维度，同时保持了后验地质模型的合理性。然而，LDM映射的高度非线性可能会降低基于卡尔曼增益的集成更新的性能。我们针对具有层次地质不确定性的三维河道地质模型，系统比较了DA算法。我们使用多重数据同化集成平滑器（ESMDA）比较了模型空间和潜空间的DA，并展示了一个关键权衡：模型空间更新实现了显著的不确定性降低，但产生了地质上不现实的后验模型，而潜空间更新保持了真实性但表现出有限的不确定性降低。受此启发，我们在3D-LDM潜空间中探索了严格的马尔可夫链蒙特卡洛（MCMC）和序贯蒙特卡洛（SMC）算法。为适应其高计算需求，我们开发了一个快速代理流模型来近似井响应。MCMC和SMC在三个合成测试案例中与ESMDA进行了评估，DA在LDM潜空间中执行。由于LDM参数化，所有模型都保持了地质真实性。MCMC和SMC彼此一致，并且比潜空间ESMDA实现了更低的数据失配和更多的不确定性降低。我们的总体结果表明，集成卡尔曼方法在高度非线性参数化下可能提供过高的后验不确定性，而由快速代理模型支持的严格蒙特卡洛采样可以提供更可靠的替代方案。

英文摘要

Data assimilation (DA) in subsurface flow entails calibrating model parameters to match observed data, typically at wells, while preserving geological realism. Latent diffusion models (LDMs) provide efficient mappings from high-dimensional geological model space to a low-dimensional latent variable, reducing the dimensionality of the inverse problem while maintaining plausibility in posterior geomodels. However, the high nonlinearity in the LDM mapping may degrade the performance of Kalman-gain-based ensemble updates. We present a systematic comparison of DA algorithms applied to large-scale 3D channelized geomodels with hierarchical geological uncertainty. We compare model-space and latent-space DA using the ensemble smoother with multiple data assimilation (ESMDA), and demonstrate a key trade-off: model-space updates achieve significant uncertainty reduction but produce geologically unrealistic posterior models, while latent-space updates preserve realism but exhibit limited uncertainty reduction. Motivated by this, we explore rigorous Markov chain Monte Carlo (MCMC) and Sequential Monte Carlo (SMC) algorithms in the 3D-LDM latent space. To accommodate their high computational demands, we develop a fast surrogate flow model that approximates well-rate responses. MCMC and SMC are evaluated against ESMDA across three synthetic test cases, with DA performed in the LDM latent space. All models maintain geological realism due to the LDM parameterization. MCMC and SMC are consistent with one another and achieve lower data mismatch and more uncertainty reduction than latent-space ESMDA. Our overall results demonstrate that ensemble Kalman methods may provide overestimated posterior uncertainty with highly nonlinear parameterizations, while rigorous Monte Carlo sampling, enabled by fast surrogate models, can provide a more reliable alternative.

URL PDF HTML ☆

赞 0 踩 0

2606.11169 2026-06-10 cs.DC cs.AI 交叉投稿

EVA-Net: 基于视频衍生运动先验的跨被试脑电运动解码

Ziyuan Li, Yueyu Sun, Yimeng Zhang

发表机构 * South China University of Technology（华南理工大学）

AI总结提出EVA-Net两阶段框架，利用动作视频作为语义先验，通过跨模态对比学习和对齐减少个体差异，实现跨被试脑电运动解码，在EEGMMI上取得8.66%的LOSO准确率提升。

详情

AI中文摘要

实用的非侵入式脑机接口（BCI）系统需要具有强跨被试泛化能力和最小校准的脑电解码器。然而，被试间差异和信号非平稳性常常将运动语义与被试特定噪声纠缠在一起，限制了跨被试解码。最近的多模态方法使用文本作为语义锚点，但文本对本质上动态的运动过程提供了稀疏且静态的监督。为了解决这个问题，我们提出了EVA-Net，一个两阶段框架，使用动作视频作为跨被试脑电运动解码的语义先验。在第一阶段，使用跨模态和监督对比目标将脑电和视频特征在共享空间中对齐，以减少被试特定变化。在第二阶段，视频类别原型和知识蒸馏将视频衍生的先验转移到仅脑电的分类器，而不增加推理开销。在两个公共数据集上的实验表明，EVA-Net实现了强大的跨被试解码性能，包括在EEGMMI上8.66%的LOSO准确率提升。消融结果进一步表明，视频提供了比本文考虑的文本基线更有效的语义锚点。

英文摘要

Practical non-invasive Brain-Computer Interface (BCI) systems require EEG decoders with strong cross-subject generalization and minimal calibration. However, inter-subject variability and signal non-stationarity often entangle motor semantics with subject-specific noise, limiting subject-independent decoding. Recent multimodal approaches use text as a semantic anchor, yet text provides sparse and static supervision for inherently dynamic motor processes. To address this issue, we propose EVA-Net, a two-stage framework that uses action videos as semantic priors for subject-independent EEG motor decoding. In the first stage, EEG and video features are aligned in a shared space using cross-modal and supervised contrastive objectives to reduce subject-specific variation. In the second stage, video category prototypes and knowledge distillation transfer video-derived priors to an EEG-only classifier without adding inference overhead. Experiments on two public datasets show that EVA-Net achieves strong subject-independent decoding performance, including an 8.66% LOSO accuracy gain on EEGMMI. Ablation results further suggest that video provides a more effective semantic anchor than the text baseline considered in this work.

URL PDF HTML ☆

赞 0 踩 0

2606.02386 2026-06-10 cs.AI q-bio.QM 版本更新

AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design

AgentPLM：具有推理增强解码的智能体蛋白质语言模型用于蛋白质序列设计

Sahil Rahman, Maxx Richard Rahman

发表机构 * Sahil Rahman ； Maxx Richard Rahman

AI总结提出AgentPLM，通过推理增强解码和对比智能体策略优化，使预训练蛋白质语言模型能够利用外部生物物理反馈进行在线纠错，在多项蛋白质设计任务上取得最优结果。

详情

Journal ref: Workshop on Generative and Agentic AI for Biology, 43rd International Conference on Machine Learning (ICML 2026)

AI中文摘要

蛋白质语言模型（PLM）是被动预言机：它们通过单次前向传递生成序列，没有机制来咨询外部生物物理反馈或在候选序列违反热力学或结构约束时重定向生成。我们引入AgentPLM，通过为预训练PLM配备i)推理增强解码（RAD），该解码将自回归生成与工具调用（ESMFold、FoldX、AutoDock Vina）交错进行，以及ii)对比智能体策略优化（CAPO），这是直接偏好优化的轨迹级扩展，它端到端地训练策略以学习何时预言机反馈具有信息性，而不仅仅是模仿高适应度序列。我们在基准任务上评估AgentPLM，涵盖从头酶设计、抗体优化、热稳定性、PPI界面设计和零样本适应度预测，使用标准化的预言机API和受控的序列同一性划分。AgentPLM取得了最先进的结果，抗体前10%命中率相比最强被动基线有所提升，提供了无需显式回溯的在线纠错的机制证据。

英文摘要

Protein language models (PLMs) are passive oracles: they generate sequences in a single forward pass with no mechanism to consult external biophysical feedback or redirect generation when a candidate violates thermodynamic or structural constraints. We introduce AgentPLM, which addresses this by equipping a pre-trained PLM with i) Reasoning-Augmented Decoding (RAD), which interleaves autoregressive generation with tool calls (ESMFold, FoldX, AutoDock Vina), and ii) Contrastive Agent Policy Optimisation (CAPO), a trajectory-level extension of direct preference optimisation that trains the policy end-to-end to learn when oracle feedback is informative rather than merely imitating high-fitness sequences. We evaluate AgentPLM on benchmark tasks spanning de novo enzyme design, antibody optimisation, thermostability, PPI interface design, and zero-shot fitness prediction with standardised oracle APIs and controlled sequence-identity splits. AgentPLM achieves state-of-the-art results with a gain in antibody top-10% hit rate over the strongest passive baseline, providing mechanistic evidence of online error correction without explicit backtracking.

URL PDF HTML ☆

赞 0 踩 0

2606.08982 2026-06-10 cs.AI 版本更新

Baichuan-M4: A Clinical-Grade Medical Agent System for Continuous Care

Baichuan-M4：面向持续照护的临床级医疗智能体系统

Aiyuan Yang, Canbin Piao, Chengfeng Dou, Da Pan, Dian Wang, Fan Yang, Fei Deng, Fei Li, Guangwei Ai, Hui Liu, Hongda Zhang, Jinyang Tai, Kai Lu, Lijun Liu, Linwei Chen, Linyu Li, Meiqing Guo, Peidong Guo, Qiang Ju, Rihui Xin, Shuai Wang, XinKai Ma, Xudong Chen, Yichuan Mo, Yijie Zhou, Leyi Pan, Yihe Luo, Zian Wang

发表机构 * Baichuan AI（百川智能）； THUBPM Group, Tsinghua University（清华大学THUBPM课题组）

AI总结提出Baichuan-M4临床级医疗大模型，通过统一运行时、持续照护强化学习框架和临床工具层三大支柱构建智能体系统，在多项医疗评估中取得领先结果，幻觉率降至3.3%。

详情

AI中文摘要

Baichuan-M4是百川智能开发的临床级医疗大模型，专为\emph{持续照护}而非单轮医疗问答设计。它围绕三大支柱构建为协调的医疗智能体系统：\textbf{Baichuan-Harness}，一个统一运行时，保持强化学习训练与实际部署的一致性，同时强制执行动作约束、工具使用、长期患者记忆和多智能体协调；一个\textbf{核心推理模型}，采用持续照护强化学习框架训练，该框架集成了跨度级奖励建模（SPAR++）、推理路径压缩、课程学习和稳定的策略优化；以及一个\textbf{临床工具层}，用于患者记忆管理、权威循证检索以及跨文档、X光和皮肤科的多模态医学感知。在跨维度医学评估套件中，Baichuan-M4在静态医学知识与安全性、动态OSCE式咨询、长上下文临床记忆、循证检索、医学文档OCR和多模态图像理解方面取得领先结果，同时将幻觉率降至3.3%。

英文摘要

Baichuan-M4 is Baichuan Intelligence's clinical-grade medical large model, designed for continuous care rather than single-turn medical question answering. It is built as a coordinated medical agent system around three pillars: Baichuan-Harness, a unified runtime that keeps reinforcement-learning training and real-world deployment consistent while enforcing action constraints, tool use, long-term patient memory, and multi-agent coordination; a core reasoning model trained with a continuous-care reinforcement-learning framework that integrates span-level reward modeling (SPAR++), reasoning-path compression, curriculum learning, and stabilized policy optimization; and a clinical tool layer for patient-memory management, authoritative evidence-based retrieval, and multimodal medical perception across documents, X-rays, and dermatology. On a cross-dimensional medical evaluation suite, Baichuan-M4 attains leading results in static medical knowledge and safety, dynamic OSCE-style consultation, long-context clinical memory, evidence-based retrieval, medical document OCR, and multimodal image understanding, while lowering the hallucination rate to 3.3%.

URL PDF HTML ☆

赞 0 踩 0

2504.03118 2026-06-10 cs.CV cs.AI 版本更新

NuWa: Deriving Lightweight Class-Specific Vision Transformers for Edge Devices

NuWa: 为边缘设备导出轻量级类别特定视觉Transformer

Ziteng Wei, Qiang He, Bing Li, Feifei Chen, Hai Jin, Yun Yang

发表机构 * National Engineering Research Center for Big Data Technology and System, Services Computing Technology and System Lab, Cluster and Grid Computing Lab（大数据技术与系统国家工程研究中心、服务计算技术与系统实验室、集群与网格计算实验室）； Swinburne University of Technology（斯威本科技大学）； Deakin University（迪金大学）

AI总结针对边缘设备只需识别特定类别的问题，提出NuWa方法，通过自知识净化去除有害权重，并利用闭式优化高效导出紧凑ViT，无需重训练即可提升类别精度并加速推理。

Comments Accepted at CVPR 2026

详情

AI中文摘要

视觉Transformer（ViT）通常需要压缩以部署在资源受限的边缘设备（如无人机和智能车辆）上。然而，现有的模型压缩方法忽略了许多边缘设备仅需特定类别的知识用于其应用。因此，导出的全类别ViT保留了冗余知识，在这些类别上表现次优。我们发现，简单地将校准数据集替换为类别特定数据不足以解决此问题，因为这些方法面临两个根本限制。首先，它们忽略了存在对类别有害的权重，这些权重干扰特化，而移除它们可以提升类别特定性能。其次，目标类别的多样性和边缘设备的资源约束需要大量定制模型。现有方法耗时且计算成本高，因此不可扩展。在这项工作中，我们提出NuWa，一种成本高效的方法，通过从基础ViT导出小型ViT来应对这些挑战，适用于具有特定类别需求的边缘设备。NuWa执行自知识净化以剪除对类别有害的权重，并通过闭式优化高效导出紧凑ViT。无需剪枝后重训练，导出的边缘ViT在类别特定精度上超越基础ViT，并加速推理。综合实验表明，NuWa在类别特定任务上比最先进的无训练剪枝方法精度高出高达29.00%。与性能最佳的依赖训练剪枝方法相比，NuWa实现了33.69倍的剪枝加速，并将剪枝成本降低高达99.83%，平均精度损失仅为0.61%。项目页面：this https URL。

英文摘要

Vision Transformers (ViTs) often need to be compressed for deployment on resource-constrained edge devices like drones and smart vehicles. However, existing model compression methods ignore that many edge devices only require the knowledge of specific classes for their applications. As a result, the derived all-class ViTs retain redundant knowledge and perform suboptimally on these classes. We discovered that simply replacing the calibration dataset with class-specific data does not suffice to address this issue, as these methods face two fundamental limitations. First, they overlook the existence of class-detrimental weights, which interfere with specialization, while removing them can improve class-specific performance. Second, the diversity of target classes and resource constraints on edge devices demand numerous customized models. Existing methods are time-consuming and computationally expensive, thus unscalable. In this work, we present NuWa, a cost-efficient method that addresses these challenges by deriving small ViTs from base ViTs for edge devices with specific class requirements. NuWa performs self-knowledge purification to prune class-detrimental weights and efficiently derives compact ViTs through closed-form optimization. Without post-pruning retraining, the derived edge ViTs surpass the base ViT in class-specific accuracy and accelerate inference. Comprehensive experiments demonstrate that NuWa outperforms state-of-the-art training-free pruning methods on class-specific tasks by up to 29.00\% in accuracy. Compared with the best-performing training-dependent pruning method, NuWa achieves a 33.69x pruning speedup and reduces pruning cost by up to 99.83\%, with only a 0.61\% average accuracy loss. Project Page: https://github.com/CGCL-codes/NuWa.

URL PDF HTML ☆

赞 0 踩 0

2507.01062 2026-06-10 cs.CY cs.AI 版本更新

Quantifying Perception-Based Student Success with Generative AI: An Exploratory Monte Carlo Simulation

基于生成式AI的感知学生成功量化：一项探索性蒙特卡洛模拟

Seyma Yaman Kayadibi

发表机构 * arXiv

AI总结本研究开发了一个探索性蒙特卡洛模拟框架，通过结构化文献检索和概率建模，量化学生在使用生成式AI工具时的感知成功，揭示了权重结构对复合分数的影响。

Comments Published in Education Sciences. This article is an extended and substantially revised version of a conference paper presented at the Melbourne Institute of Technology ICETE Conference, Sydney, NSW, Australia, 9-10 February 2026. The earlier conference version is available at DOI 10.25397/ppny-f488

详情

DOI: 10.3390/educsci16060832
Journal ref: Education Sciences 2026, 16, 832

AI中文摘要

MeCo: 基于MeanFlow的一步校正器用于多通道语音分离

Dohwan Kim, Jung-Woo Choi

发表机构 * School of Electrical Engineering, KAIST（韩国成均馆大学电气工程学院）

AI总结提出MeCo，一种基于MeanFlow的一步生成式校正器，通过数据空间优化联合训练生成目标与信号保真度，在极低计算开销下同时提升信号保真度和人耳听觉质量。

Comments 5 pages, accepted to Interspeech 2026

详情

AI中文摘要

虽然用于多通道语音分离的判别模型在基于参考的指标上表现出色，但它们通常表现出次优的人耳听觉质量。为了解决这个问题，我们提出了一种新颖的基于MeanFlow的一步生成式校正器（MeCo）。MeCo学习一个条件平均速度场，以一步方式将判别估计直接映射到干净语音流形上。为了最大化一步生成性能，我们引入了数据空间优化（DSO）。DSO集成了一个$\mathbf{x}_r$损失，该损失惩罚较长位移间隔上的预测误差，作为人耳听觉质量的生成目标，以及一个端点SI-SDR损失，直接优化终端信号保真度。实验表明，MeCo以最小的计算开销实现了最先进的性能，在域内和域外场景中同时实现了卓越的信号保真度和人耳听觉质量。

英文摘要

While discriminative models for multi-channel speech separation excel in reference-based metrics, they often exhibit suboptimal human listening quality. To address this, we propose a novel MeanFlow-based one-step generative corrector (MeCo). MeCo learns a conditional average velocity field to map discriminative estimates directly onto the clean speech manifold in a single step. To maximize one-step generation performance, we introduce Data-Space Optimization (DSO). DSO integrates an $\mathbf{x}_r$-loss, which penalizes prediction errors on longer displacement intervals to serve as a generative objective for human listening quality, with an Endpoint SI-SDR loss that directly optimizes terminal signal fidelity. Experiments demonstrate that MeCo achieves state-of-the-art (SOTA) performance with minimal computational overhead, simultaneously achieving superior signal fidelity and human listening quality in both in-domain and out-of-domain scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.10086 2026-06-10 cs.AI 新提交

Exploratory Responsiveness and Adaptive Rigidity under AI-Assisted Optimization

AI辅助优化下的探索响应性与适应性刚性

Balaraju Battu

AI总结本文提出AI辅助优化下的探索适应理论，通过动态框架分析预测辅助如何影响系统探索响应性，揭示收敛预测机制导致适应性降低、刚性增强，而探索增强机制则促进适应性。

详情

AI中文摘要

本文发展了AI辅助优化下的探索适应理论。核心论点是，AI系统的长期适应效应关键取决于预测辅助如何与探索响应性本身相互作用。我们使用一个动态框架形式化这一机制，其中认知、制度和技术系统在由多个局部强化配置构成的崎岖认知景观上演化。模型中的一个核心状态变量是适应响应性，它衡量系统在不断变化的条件下穿越不熟悉的概念和制度轨迹的能力。在收敛预测机制下，AI系统替代探索参与，降低适应响应性，并产生亚稳态陷阱、滞后、过早收敛和探索崩溃动力学，使系统局部高效但全局刚性。该框架还识别出对比的探索增强机制，其中AI系统放大探索搜索、概念穿越和适应流动性。因此，有效替代参数是响应性依赖的：拥有弱探索例程的系统更容易受到探索替代，而已经拥有高适应响应性的系统可能利用AI辅助在崎岖景观上扩展探索流动性。因此，AI的长期适应效应不仅取决于AI能力本身，还取决于制度结构、发展背景和人机交互架构。

英文摘要

This paper develops a theory of exploratory adaptation under AI-assisted optimization. The central argument is that the long-run adaptive effects of AI systems depend critically on how predictive assistance interacts with exploratory responsiveness itself. We formalize this mechanism using a dynamical framework in which cognitive, institutional, and technological systems evolve over rugged epistemic landscapes characterized by multiple locally reinforced configurations. A central state variable in the model is adaptive responsiveness, which measures the capacity of a system to traverse unfamiliar conceptual and institutional trajectories under changing conditions. Under convergent predictive regimes, AI systems substitute for exploratory engagement, reducing adaptive responsiveness and generating metastable trapping, hysteresis, premature convergence, and exploration-collapse dynamics in which systems become locally efficient but globally rigid. The framework also identifies contrasting exploration-enhancing regimes in which AI systems amplify exploratory search, conceptual traversal, and adaptive mobility. The effective substitution parameter is therefore responsiveness-dependent: systems possessing weak exploratory routines are more vulnerable to exploratory substitution, whereas systems already possessing high adaptive responsiveness may use AI assistance to expand exploratory mobility across rugged landscapes. The long-run adaptive effects of AI consequently depend not only on AI capability itself, but also on institutional structure, developmental context, and the architecture of human-machine interaction.

URL PDF HTML ☆

赞 0 踩 0

2606.10094 2026-06-10 cs.AI 新提交

Predictive Assistance and the Temporal Dynamics of Exploratory Compression

预测性辅助与探索性压缩的时间动态

Balaraju Battu

AI总结提出几何动力学框架，研究预测性AI如何通过外源探索性压缩改变认知探索的时间动态，发现持续稳定会降低探索响应性、曲率不对称积累导致滞后效应、早期干预限制后续探索多样性。

详情

AI中文摘要

经典认知理论将问题解决描述为通过结构化问题空间的探索性搜索，其中重复交互逐渐将搜索压缩为高效的表征结构。预测性人工智能系统引入了一种独特的机制，在这种机制中，稳定可能在探索性多样化展开之前发生，在内部生成搜索之前提供解决方案和决策轨迹。本文发展了一个几何动力学框架，其中注意力在由稳定漂移、内源探索性扰动和响应性门控学习塑造的策略景观上演化。预测性辅助被建模为外源探索性压缩的过程，在自生成探索拓宽策略空间的可达区域之前稳定轨迹。该框架产生三个主要结果。首先，持续的预测性稳定通过减弱内源扰动的有效影响来降低探索响应性，即使探索变异性仍然存在。其次，曲率不对称地积累和松弛，产生滞后效应和辅助撤除后探索移动性的延迟恢复。第三，发展结果关键取决于稳定的时机，早期干预在广泛的表征多样化发生之前缩小未来的探索遍历。该框架产生了关于探索熵、过早收敛和预测稳定后延迟恢复的经验可检验预测。更广泛地说，结果表明预测系统可能重塑探索性认知本身的几何结构。

英文摘要

Classical theories of cognition describe problem solving as exploratory search through structured problem spaces in which repeated interaction gradually compresses search into efficient representational structures. Predictive artificial intelligence systems introduce a distinct regime in which stabilization may occur before exploratory diversification unfolds, supplying solutions and decision trajectories prior to internally generated search. This paper develops a geometric dynamical framework in which attention evolves over a landscape of strategies shaped by stabilizing drift, endogenous exploratory perturbation, and responsiveness-gated learning. Predictive assistance is modeled as a process of exogenous exploratory compression that stabilizes trajectories before self-generated exploration broadens the accessible regions of strategy space. The framework yields three main results. First, sustained predictive stabilization reduces exploratory responsiveness by attenuating the effective influence of intrinsic perturbations even when exploratory variability remains present. Second, curvature accumulates and relaxes asymmetrically, producing hysteresis and delayed recovery of exploratory mobility after assistance withdrawal. Third, developmental outcomes depend critically on the timing of stabilization, with early intervention narrowing future exploratory traversal before broad representational diversification has occurred. The framework generates empirically testable predictions concerning exploratory entropy, premature convergence, and delayed recovery following predictive stabilization. More broadly, the results suggest that predictive systems may reshape the geometry of exploratory cognition itself.

URL PDF HTML ☆

赞 0 踩 0

2606.10413 2026-06-10 cs.AI 新提交

生物电信息论：探究人工智能下生物电信号的理论压缩极限

Jiawen Zou, Bo Yan

发表机构 * College of Computer Science and Artificial Intelligence（计算机科学与人工智能学院）； Shanghai Key Laboratory of Intelligent Information Processing（上海智能信息处理重点实验室）； Fudan University（复旦大学）

AI总结提出生物电压缩的三级层次框架，将压缩极限重构为模型和任务条件量，而非波形的固定属性。

详情

AI中文摘要

生物电信号正以挑战脑机接口带宽的规模被采集。然而，它们的压缩仍常被框定为波形保真问题，受限于原始信号的熵。本文提出一个信息论框架，其中生物电数据的有效信息不仅由信号保真度决定，还由生理结构、模型容量和下游任务需求决定。我们将生物电压缩表述为三级层次。在信号层面，噪声被降低至它们关于潜在生理源所携带的信息。在生理层面，参数化编码器将净化后的信号映射为紧凑、结构化且量化的表示。在语义层面，任务无关信息被丢弃，而深度学习模型利用因果依赖关系，用条件熵替代边际熵。这一视角将生物电信号的压缩极限重构为模型和任务条件量，而非波形的固定属性。随着表达能力日益增强的模型与神经和生理接口集成，生物电压缩可能从传输信号转变为仅传输任务级解释所需的残差信息。

英文摘要

Bioelectrical signals are increasingly acquired at scales that challenge the bandwidth of brain-computer interfaces. However, their compression is still often framed as a problem of waveform preservation, limited by the entropy of the raw signal. Here we propose an information-theoretic framework in which the effective information of bioelectrical data is determined not only by signal fidelity, but also by physiological structure, model capacity and downstream task requirements. We formulate bioelectrical compression as a three-level hierarchy. At the signal level, noise is reduced to the information they carry about latent physiological sources. At the physiological level, parametric encoders map purified signals into compact, structured and quantized representations. At the semantic level, task-irrelevant information is discarded, while deep learning models exploit causal dependencies to replace marginal entropy with conditional entropy. This perspective reframes the compression limit of bioelectrical signals as a model- and task-conditioned quantity rather than a fixed property of the waveform. As increasingly expressive models become integrated with neural and physiological interfaces, bioelectrical compression may shift from transmitting signals to transmitting only the residual information required for task-level interpretation.

URL PDF HTML ☆

赞 0 踩 0

2606.10458 2026-06-10 cs.IT cs.AI math.IT math.OC math.ST stat.TH 交叉投稿

Minimum Distortion Quantization with Specified Output Distribution

指定输出分布的最小失真量化

Aolin Xu

发表机构 * Aolin Xu（徐澳林）

AI总结本文推导了在输出分布指定条件下最小化均方误差的最优量化器，形式为X=σ(F_{σ^{-1}(X)}^{-1}(F_W(W)))，并证明了在均匀分布下简化为X=F_X^{-1}(F_W(W))，主要贡献在于通过优化排列和累积分布函数实现最小失真。

详情

AI中文摘要

我们推导了实值随机变量 $W$（分布为 $P_W$）的最优量化器，使得 1) 量化输出 $X$（可取 $k$ 个值）的分布遵循 $\{1,\ldots,k\}$ 上的任意指定分布 $P_X$，且 2) 从 $X$ 估计 $W$ 的最小均方误差 (MMSE) 最小化。结果表明，最优量化器形式为 $X=\sigma\big(F_{\sigma^{-1}(X)}^{-1}(F_W(W))\big)$，其中 $\sigma$ 是 $\{1,\ldots,k\}$ 上所有排列中使 MMSE 最小的最优排列，$F$ 为累积分布函数。当 $P_W$ 在区间上均匀分布或 $P_X$ 在 $\{1,\ldots,k\}$ 上均匀分布时，量化器简化为 $X=F_{X}^{-1}(F_W(W))$。优超概念在最优性证明中起关键作用。指定输出分布有助于设计具有显式控制输出熵、最大化输入输出互信息、定制输出分布以匹配通信信道输入要求以及数据匿名化的量化器。

英文摘要

We derive the optimal quantizer of a real-valued random variable $W$ with distribution $P_W$ such that 1) the distribution of the quantization output $X$ that can take $k$ values follows any specified distribution $P_X$ over $\{1,\ldots,k\}$, and 2) the minimum mean squared error (MMSE) of estimating $W$ from $X$ is minimized. It is shown that the optimal quantizer takes the form $X=σ\big(F_{σ^{-1}(X)}^{-1}(F_W(W))\big)$, where $σ$ is the optimal permutation of $\{1,\ldots,k\}$ among all permutations to minimize the MMSE, and $F$ is the cumulative distribution function. When $P_W$ is uniform over an interval or $P_X$ is uniform over $\{1,\ldots,k\}$, the quantizer takes a simple form $X=F_{X}^{-1}(F_W(W))$. The concept of majorization plays a key role in the optimality proof. Specifying the output distribution is useful for designing quantizers with explicitly controlled output entropy, maximized mutual information between input and output, tailored output distribution to match channel input requirements for communication, and data anonymization.

URL PDF HTML ☆

赞 0 踩 0

2606.10937 2026-06-10 cs.DB cs.AI 交叉投稿

Provenance Tracking in AI Compilers through the Lens of Coalgebra

通过余代数视角追踪AI编译器中的来源

Zilu Tian, Liying Liu

发表机构 * OmniVision Technology Singapore（奥米视觉技术（新加坡））； Black Sesame Technology Singapore（黑 sesame 技术（新加坡））

AI总结针对AI编译器中图重写导致来源难以追踪的问题，提出基于观测语义的轻量级方法，利用余代数和互模拟形式化，并在原型编译器COVAN中验证。

详情

AI中文摘要

AI编译器通过规范化、降级和优化积极重写计算图，使得跨编译追踪张量和运算符的来源变得困难。可靠的来源对于附加平台特定的后处理、调试编译器行为以及验证变换至关重要，然而现有解决方案在非单射图重写下要么是侵入式的，要么是特设的。我们提出了一种基于观测语义的轻量级生成式方法来追踪来源。我们不通过编译器传递传播标识符，而是观测图变换并根据可观测的计算行为推理来源。我们使用余代数模型和互模拟形式化了这种方法，即使中间节点被消除，也能保留来源。此外，我们在原型AI编译器COVAN中实现了该方法，展示了在编译流水线中稳定的来源追踪，且工程开销最小。

英文摘要

AI compilers aggressively rewrite computation graphs through normalization, lowering, and optimization, making it difficult to track the provenance of tensors and operators across compilation. Reliable provenance is essential for attaching platform-specific postprocessing, debugging compiler behavior, and validating transformations, yet existing solutions are either invasive or ad hoc under non-injective graph rewrites. We present a lightweight, generative approach to provenance tracking based on observational semantics. Instead of propagating identifiers through compiler passes, we observe graph transformations and reason about provenance in terms of observable computational actions. We formalize this approach using a coalgebraic model and bisimulation, which preserves provenance even when intermediate nodes are eliminated. Furthermore, we implement this approach in a prototype AI compiler COVAN, demonstrating stable provenance across compilation pipelines with minimal engineering overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.01929 2026-06-10 cs.AI 版本更新

VET: A Framework for Analyzing AI Discourse

VET：一个分析AI话语的框架

Meredith Ringel Morris

发表机构 * The Paul G. Allen School of Computer Science & Engineering and The Information School（保罗·G·艾伦计算机科学与工程学院和信息学院）

AI总结本文提出VET框架，通过效价、有效性和轨迹三个维度对AI话语进行分类，用于识别和批判AI炒作、AI末日、AI否认和AI常态四种立场，并作为AI素养工具支持对极化话语的审视。

2601.16700 2026-06-10 cs.SE cs.AI cs.ET cs.HC 版本更新

Adoption of Generative Artificial Intelligence in the German Software Engineering Industry: An Empirical Study

生成式人工智能在德国软件工程行业的采用：一项实证研究

Ludwig Felder, Tobias Eisenreich, Mahsa Fischer, Stefan Wagner, Chunyang Chen

发表机构 * Technical University of Munich（慕尼黑技术大学）； Heilbronn University of Applied Science（海德堡应用科学大学）

AI总结通过混合方法研究德国软件工程师采用生成式AI工具的情况，发现经验水平调节感知收益，组织规模影响工具选择和使用强度，项目上下文意识不足是主要障碍。

Comments Accepted at FSE '26

详情

DOI: 10.1145/3803437.3805207

AI中文摘要

生成式人工智能（GenAI）工具在软件开发人员中迅速普及。尽管行业采用率正在上升，但影响这些工具有效使用的潜在因素，包括交互深度、组织约束和经验相关考虑，尚未得到彻底调查。这个问题在监管要求严格的环境中尤为突出，例如德国，从业者必须应对GDPR和欧盟AI法案，同时平衡生产力提升与知识产权考虑。尽管GenAI对软件工程产生了重大影响，但据我们所知，尚无实证研究系统性地考察德国背景下GenAI工具的采用动态。为填补这一空白，我们提出了一项关于德国软件工程师采用GenAI的全面混合方法研究。具体而言，我们进行了18次探索性访谈，随后对109名参与者进行了开发者调查。我们分析了工具采用模式、提示策略以及影响有效性的组织因素。结果表明，经验水平调节了GenAI工具的感知收益，并且生产力提升在开发人员之间并非均匀分布。此外，组织规模影响工具选择和使用强度。项目上下文意识有限被确定为最显著的障碍。我们总结了一系列对开发者、组织和工具供应商具有可操作性的启示，以推进人工智能辅助的软件开发。

英文摘要

Generative artificial intelligence (GenAI) tools have seen rapid adoption among software developers. While adoption rates in the industry are rising, the underlying factors influencing the effective use of these tools, including the depth of interaction, organizational constraints, and experience-related considerations, have not been thoroughly investigated. This issue is particularly relevant in environments with stringent regulatory requirements, such as Germany, where practitioners must address the GDPR and the EU AI Act while balancing productivity gains with intellectual property considerations. Despite the significant impact of GenAI on software engineering, to the best of our knowledge, no empirical study has systematically examined the adoption dynamics of GenAI tools within the German context. To address this gap, we present a comprehensive mixed-methods study on GenAI adoption among German software engineers. Specifically, we conducted 18 exploratory interviews with practitioners, followed by a developer survey with 109 participants. We analyze patterns of tool adoption, prompting strategies, and organizational factors that influence effectiveness. Our results indicate that experience level moderates the perceived benefits of GenAI tools, and productivity gains are not evenly distributed among developers. Further, organizational size affects both tool selection and the intensity of tool use. Limited awareness of the project context is identified as the most significant barrier. We summarize a set of actionable implications for developers, organizations, and tool vendors seeking to advance artificial intelligence (AI) assisted software development.

URL PDF HTML ☆

赞 0 踩 0

2606.00038 2026-06-10 cs.CY cs.AI 版本更新

Beyond Tool Adoption: A Practical Five-Stage Developmental Continuum for AI Literacy in Higher Education

超越工具采纳：高等教育中人工智能素养的实用五阶段发展连续体

J. Paul Liu, Rachel Levy

发表机构 * Dept of Marine, Earth, and Atmospheric Sciences（海洋、地球与大气科学系）； AI Hub for Science（科学人工智能中心）； Center of Geospatial Analytics（地理空间分析中心）； Data Science and AI Academy（数据科学与人工智能学院）； Department of Mathematics（数学系）； North Carolina State University（北卡罗来纳州立大学）

AI总结本文提出一个五阶段AI素养连续体模型，帮助教育者诊断和引导学生从回避或盲目使用AI，逐步发展为批判性评估和改进AI应用的能力。

Comments 26 pages, 5 tables, 2 figures, 1 Supplementary Table

详情

AI中文摘要

人工智能（AI）素养日益被认为是所有大学毕业生应具备的基础能力。然而，学生与AI工具的互动往往集中在两个有问题的极端：因恐惧、不信任、伦理担忧或缺乏访问权限而回避，以及不加批判地依赖，产生流畅的输出却掩盖了误解。现有的AI素养框架提供了有价值的定义，但大多数在诊断学习者起点以及如何向负责任、批判性参与进步方面提供的指导有限。本文提出了一个五阶段AI素养连续体——1) 尚未参与，2) 不加批判地使用，3) 知情使用，4) 批判性评估，5) 改进——描述了高等教育中AI使用的发展取向。该连续体补充了维度框架，为教育者提供了实用的诊断和教学路径，与包括联合国教科文组织和经合组织在内的国际框架保持一致。我们介绍了来自北卡罗来纳州立大学的一个基于设计的实施案例，其中学分课程和密集实践工作坊在2024年秋季至2026年春季期间吸引了超过330名参与者。由于实施未使用经过验证的前/后测试工具或对照组，我们将发现视为基于观察和实践：参与者表现出从非参与或不加批判地使用向知情参与转变的行为，而持续且嵌入学科的经验则产生了更强的批判性评估和改进导向实践的证据。我们讨论了课程路径、公平性考量、评估策略，并认为AI素养不应仅被理解为工具采纳，而应被理解为在学科和社会背景下理解、评估和负责任地应用AI系统的发展能力。

英文摘要

Artificial intelligence (AI) literacy is increasingly recognized as a foundational competency for all university graduates. Yet students' engagement with AI tools often clusters at two extremes: avoidance driven by fear, mistrust, ethical concern, or lack of access, and uncritical reliance that produces fluent output while masking misunderstanding. Existing AI literacy frameworks provide valuable competency definitions, but most offer limited guidance for diagnosing where learners begin and how they progress toward responsible, critical engagement. This paper proposes a five-stage AI Literacy Continuum: 0) Not Yet Engaged, 1) Uncritical Use, 2) Informed Use, 3) Critical Evaluation, and 4) Improvement --that describes developmental orientations toward AI use in higher education. The continuum complements dimensional frameworks by providing educators with a practical diagnostic and instructional pathway aligned with international frameworks, including UNESCO and OECD. We present a design-based implementation case from North Carolina State University, where credit-bearing courses and intensive hands-on workshops engaged more than 330 participants between Fall 2024 and Spring 2026. Because the implementation did not use a validated pre/post instrument or comparison group, we frame the findings as observational and practice-based: participants exhibited behaviors consistent with movement from non-engagement or uncritical use toward informed engagement, while sustained and discipline-embedded experiences produced stronger evidence of critical evaluation and improvement-oriented practice. We discuss curricular pathways, opportunity considerations, assessment strategies, and argue that AI literacy should be understood not as tool adoption alone but as a developmental capacity to understand, evaluate, and responsibly apply AI systems in disciplinary and societal contexts.

URL PDF HTML ☆

赞 0 踩 0

2606.04833 2026-06-10 cs.LG cs.AI 版本更新

Signed Dual Attention: Capturing Signed Dependencies in Time Series Forecasting

符号双注意力：在时间序列预测中捕捉符号依赖关系

Balthazar Courvoisier, Tristan Cazenave

发表机构 * Queensfield AI Technologies

AI总结提出符号双注意力机制，通过双消息传递方案同时捕捉正负依赖关系，无需额外参数，提升时间序列预测性能。

Comments 5 pages, 3 figures, accepted at AAAI 2026 AI4TS Workshop

详情

AI中文摘要

AI中文摘要

标准强化学习（RL）用于大语言模型（LLM）代理主要优化外在任务奖励，往往偏向于孤立任务完成而非持续适应。这种范式可能导致过早收敛到次优策略，并使有用经验仅隐式编码在模型参数中，限制其检索和重用以供未来决策。我们引入RetroAgent，一种在线RL框架，训练代理掌握交互环境不仅通过解决任务，而是通过跨回合进化。受人类逆向自我提升启发，RetroAgent将外在奖励与逆向生成的双重内在反馈相结合：（1）内在数值反馈，通过测量相对于先前尝试的子任务进展增量来奖励有益探索；（2）内在语言反馈，将成功与失败提炼成可重用的文本教训以供显式经验重用。为有效利用这些教训，我们提出Similarity & Utility-Aware Upper Confidence Bound（SimUtil-UCB），一种检索策略，平衡语义相关性、历史效用和探索。在四个具有挑战性的代理基准测试中，RetroAgent实现了新的最先进的性能，优于GRPO在ALFWorld上提升18.3%、WebShop上提升15.4%、Sokoban上提升27.1%、MineSweeper上提升8.9%，同时展现出强大的测试时间适应性和分布外泛化能力。

英文摘要

Standard reinforcement learning (RL) for large language model (LLM) agents primarily optimizes extrinsic task rewards, often favoring isolated task completion over continual adaptation. This paradigm can cause premature convergence to suboptimal policies and leaves useful experience only implicitly encoded in model parameters, limiting its retrieval and reuse for future decisions. We introduce RetroAgent, an online RL framework that trains agents to master interactive environments not merely by solving tasks, but by evolving across episodes. Inspired by human retrospective self-improvement, RetroAgent augments extrinsic rewards with hindsight-generated dual intrinsic feedback: (1) Intrinsic Numerical Feedback, which rewards beneficial exploration by measuring incremental subtask progress relative to prior attempts; and (2) Intrinsic Language Feedback which distills successes and failures into reusable textual lessons for explicit experience reuse. To leverage these lessons effectively, we propose Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB), a retrieval strategy that balances semantic relevance, historical utility, and exploration. Across four challenging agentic benchmarks, RetroAgent achieves new state-of-the-art performance, outperforming GRPO by +18.3% on ALFWorld, +15.4% on WebShop, +27.1% on Sokoban, and +8.9% on MineSweeper, while demonstrating strong test-time adaptation and out-of-distribution generalization.

URL PDF HTML ☆

赞 0 踩 0

2601.00809 2026-06-10 cs.OH cs.AI cs.MA 版本更新

A Modular Reference Architecture for MCP-Servers Enabling Agentic BIM Interaction

一种模块化参考架构用于MCP服务器，以实现代理BIM交互

Tobias Heimig-Elschner, Changyu Du, Anna Scheuvens, André Borrmann, Jakob Beetz

发表机构 * Chair of Design Computation, RWTH Aachen University（设计计算系，亚琛工业大学）； Chair of Computing in Civil and Building Engineering, Technical University of Munich（土木与建筑工程计算系，慕尼黑技术大学）； Federal Institute for Research on Building, Urban Affairs and Spatial Development (BBSR)（建筑、都市事务和空间发展研究院）； TUM Georg Nemetschek Institute（慕尼黑技术大学Georg Nemetschek研究所）

AI总结本文提出一种模块化参考架构，通过解耦MCP接口与特定BIM-API，实现API无关、隔离且可重复的BIM代理交互，提升重用性和研究系统性。

Comments Accepted at the GNI Symposium on Artificial Intelligence for the Built World (Technical University of Munich, May 18--20, 2026)

详情

DOI: 10.14459/2026md1851873

AI中文摘要

由大型语言模型驱动的代理工作流正被越来越多应用于建筑信息建模（BIM），使IFC模型的自然语言检索、修改和生成成为可能。最近的研究开始采用新兴的模型上下文协议（MCP）作为LLM的统一工具调用接口，简化了BIM交互的代理侧。尽管MCP标准化了LLM调用工具的方式，但当前BIM侧的实现仍需针对特定工具进行编程，限制了在不同环境中的重用、评估和工作流可移植性。本文通过引入一种模块化参考架构来解决这一缺口，该架构使MCP服务器能够实现API无关、隔离且可重复的BIM代理交互。通过对近期文献中反复出现的能力进行系统分析，我们推导出一组核心需求。这些需求指导了一种以显式适配器合同为中心的微服务架构，将MCP接口与特定BIM-API解耦。使用IfcOpenShell的原型实现展示了在常见修改和生成任务中的可行性。在代表性场景中的评估表明，该架构能够实现可靠的工作流，减少耦合，并为系统性研究提供可重用的基础。

英文摘要

Agentic workflows driven by large language models (LLMs) are increasingly applied to Building Information Modelling (BIM), enabling natural-language retrieval, modification and generation of IFC models. Recent work has begun adopting the emerging Model Context Protocol (MCP) as a uniform tool-calling interface for LLMs, simplifying the agent side of BIM interaction. While MCP standardises how LLMs invoke tools, current BIM-side implementations are still authoring tool-specific and ad hoc, limiting reuse, evaluation, and workflow portability across environments. This paper addresses this gap by introducing a modular reference architecture for MCP servers that enables API-agnostic, isolated and reproducible agentic BIM interactions. From a systematic analysis of recurring capabilities in recent literature, we derive a core set of requirements. These inform a microservice architecture centred on an explicit adapter contract that decouples the MCP interface from specific BIM-APIs. A prototype implementation using IfcOpenShell demonstrates feasibility across common modification and generation tasks. Evaluation across representative scenarios shows that the architecture enables reliable workflows, reduces coupling, and provides a reusable foundation for systematic research.

URL PDF HTML ☆

赞 0 踩 0

2602.23232 2026-06-10 cs.AI 版本更新

ReCoN-Ipsundrum: An Inspectable Recurrent Persistence Loop Agent with Affect-Coupled Control and Mechanism-Linked Consciousness Indicator Assays

ReCoN-Ipsundrum：一个可检验的循环持续性代理，具有情感耦合控制和机制关联的意识指标测试

Aishik Sanyal

发表机构 * Aishik Sanyal

AI总结本文提出ReCoN-Ipsundrum代理，通过情感耦合控制和机制关联意识指标测试，探讨意识指标与行为之间的关系，发现情感耦合能提高探索和谨慎行为。

Comments Accepted at AAAI 2026 Spring Symposium - Machine Consciousness: Integrating Theory, Technology, and Philosophy

详情

DOI: 10.1609/aaaiss.v8i1.42565
Journal ref: Proceedings of the AAAI Symposium Series, 8(1):352-360, 2026

AI中文摘要

基于机器意识的指标方法建议通过任务跨领域的机制关联证据进行三角验证，通过建筑检查和因果干预支持。受Humphrey的ipsundrum假说启发，我们实现了ReCoN-Ipsundrum，一个可检验的代理，扩展了ReCoN状态机，加入了感官显著性$N^s$的循环持续性回路以及可选的情感代理报告愉悦度/唤醒度。在固定参数消融（ReCoN、Ipsundrum、Ipsundrum+affect）中，我们将Humphrey的qualiaphilia（对感官经验本身的偏好）作为熟悉度控制的风景-平淡路线选择。我们发现一种新颖性分离：非情感变体对新颖性敏感（Δscenic-entry = 0.07）。情感耦合是稳定的（Δscenic-entry = 0.01）即使风景不新颖（中位数{Δnovelty≈-0.43）。在无奖励的探索性游戏中，情感变体表现出结构化的局部调查（扫描事件31.4 vs. 0.9；循环分数7.6）。在疼痛尾探针中，只有情感变体能持续延长计划的谨慎（尾部持续时间90 vs. 5）。对反馈+整合的损伤会减少ipsundrum变体的刺激后持续性（AUC下降27.62, 27.9%），而ReCoN保持不变。这些分离连接了循环→持续性和情感耦合控制→偏好稳定性、扫描和持续谨慎，展示了如何工程化指标样式的签名，并解释了为什么机理和因果证据应伴随行为标记。

英文摘要

Indicator-based approaches to machine consciousness recommend mechanism-linked evidence triangulated across tasks, supported by architectural inspection and causal intervention. Inspired by Humphrey's ipsundrum hypothesis, we implement ReCoN-Ipsundrum, an inspectable agent that extends a ReCoN state machine with a recurrent persistence loop over sensory salience $N^s$ and an optional affect proxy reporting valence/arousal. Across fixed-parameter ablations (ReCoN, Ipsundrum, Ipsundrum+affect), we operationalize Humphrey's qualiaphilia (preference for sensory experience for its own sake) as a familiarity-controlled scenic-over-dull route choice. We find a novelty dissociation: non-affect variants are novelty-sensitive ($Δ$scenic-entry = 0.07). Affect coupling is stable ($Δ$scenic-entry = 0.01) even when scenic is less novel (median {$Δ$novelty $\approx$ -0.43). In reward-free exploratory play, the affect variant shows structured local investigation (scan events 31.4 vs. 0.9; cycle score 7.6). In a pain-tail probe, only the affect variant sustains prolonged planned caution (tail duration 90 vs. 5). Lesioning feedback+integration selectively reduces post-stimulus persistence in ipsundrum variants (AUC drop 27.62, 27.9%) while leaving ReCoN unchanged. These dissociations link recurrence $\rightarrow$ persistence and affect-coupled control $\rightarrow$ preference stability, scanning, and lingering caution, illustrating how indicator-like signatures can be engineered and why mechanistic and causal evidence should accompany behavioral markers.

URL PDF HTML ☆

赞 0 踩 0

2602.01023 2026-06-10 cs.IR cs.AI cs.LG 版本更新

Unifying Ranking and Generation in Query Auto-Completion via Retrieval-Augmented Generation and Multi-Objective Alignment

通过检索增强生成和多目标对齐统一查询自动补全中的排序与生成

Kai Yuan, Anthony Zheng, Jia Hu, Divyanshu Sheth, Hemanth Velaga, Kylee Kim, Matteo Guarrera, Besim Avci, Jianhua Li, Xuetao Yin, Rajyashree Mukherjee, Sean Suchter

发表机构 * Apple（苹果公司）； UC Berkeley（加州大学伯克利分校）

AI总结提出一个统一框架，通过检索增强生成（RAG）和多目标直接偏好优化（DPO）将查询自动补全重构为端到端列表生成，解决传统流水线长尾覆盖不足和生成方法幻觉风险的问题，并在大规模商业搜索平台上验证了有效性。

Comments 11 pages, 4 figures

详情

DOI: 10.1145/3770855.3818502
Journal ref: Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26), August 09--13, 2026, Jeju Island, Republic of Korea

AI中文摘要

查询自动补全（QAC）在用户输入时建议查询补全，帮助用户表达意图并更高效地获取结果。现有方法面临根本性挑战：传统的检索-排序流水线长尾覆盖有限且需要大量特征工程，而最近的生成方法存在幻觉和安全风险。我们提出了一个统一框架，通过检索增强生成（RAG）和多目标直接偏好优化（DPO）将QAC重构为端到端列表生成。我们的方法结合了三个关键创新：（1）将QAC重构为具有多目标优化的端到端列表生成；（2）定义并部署一套基于规则、基于模型和以LLM为评判的验证器用于QAC，并在综合方法中使用它们，结合RAG、多目标DPO和迭代批评-修订以生成高质量合成数据；（3）一种混合服务架构，可在严格的延迟约束下实现高效的生产部署。在大规模商业搜索平台上的评估显示了显著改进：离线指标在所有维度上均有提升，人工评估获得+0.40至+0.69的偏好分数，受控在线实验实现了击键次数减少5.44%和建议采纳率增加3.46%，验证了结合RAG和多目标对齐的统一生成为生产级QAC提供了有效解决方案。这项工作代表了向由大语言模型、RAG和多目标对齐驱动的端到端生成的范式转变，建立了一个经过生产验证的框架，可惠及更广泛的搜索和推荐行业。

英文摘要

Query Auto-Completion (QAC) suggests query completions as users type, helping them articulate intent and reach results more efficiently. Existing approaches face fundamental challenges: traditional retrieve-and-rank pipelines have limited long-tail coverage and require extensive feature engineering, while recent generative methods suffer from hallucination and safety risks. We present a unified framework that reformulates QAC as end-to-end list generation through Retrieval-Augmented Generation (RAG) and multi-objective Direct Preference Optimization (DPO). Our approach combines three key innovations: (1) reformulating QAC as end-to-end list generation with multi-objective optimization; (2) defining and deploying a suite of rule-based, model-based, and LLM-as-judge verifiers for QAC, and using them in a comprehensive methodology that combines RAG, multi-objective DPO, and iterative critique-revision for high-quality synthetic data; (3) a hybrid serving architecture enabling efficient production deployment under strict latency constraints. Evaluation on a large-scale commercial search platform demonstrates substantial improvements: offline metrics show gains across all dimensions, human evaluation yields +0.40 to +0.69 preference scores, and a controlled online experiment achieves 5.44\% reduction in keystrokes and 3.46\% increase in suggestion adoption, validating that unified generation with RAG and multi-objective alignment provides an effective solution for production QAC. This work represents a paradigm shift to end-to-end generation powered by large language models, RAG, and multi-objective alignment, establishing a production-validated framework that can benefit the broader search and recommendation industry.

URL PDF HTML ☆

赞 0 踩 0

2601.13406 2026-06-10 cs.HC cs.AI 版本更新

Integrating Virtual Reality and Large Language Models for Team-Based Non-Technical Skills Training and Evaluation in the Operating Room

将虚拟现实与大型语言模型结合用于手术室基于团队的非技术技能训练与评估

Jacob Barker, Doga Demirel, Cullen Jackson, Anna Johansson, Robbin Miraglia, Darian Hoagland, Stephanie B. Jones, John Mitchell, Daniel B. Jones, Suvranu De

发表机构 * Beth Israel Deaconess Medical Center Center（贝希斯尔德医疗中心中心）； Department of Surgery, Northwell Health（外科，北well健康）； College of Engineering, Florida Agricultural and Mechanical University and Florida State University（工程学院，佛罗里达农业与机械大学和佛罗里达州立大学）

AI总结本文提出VORTeX平台，结合虚拟现实与大型语言模型，用于手术室团队非技术技能的训练与评估，通过分析团队对话生成交互图谱，提升沟通与协作能力。

Comments 23 pages, 7 figures, 1 table, 2 Appendices

详情

DOI: 10.1038/s44484-026-00009-3
Journal ref: npj Digit. Surg. 1, 10 (2026)

AI中文摘要

尽管有效的团队合作与沟通对手术安全至关重要，但结构化训练非技术技能（NTS）仍不如技术模拟充分。ACS/APDS III期团队技能课程要求可扩展的工具，既能教学又能客观评估这些能力。我们引入虚拟手术室团队经验（VORTeX），一种多用户虚拟现实（VR）平台，结合沉浸式团队模拟与大型语言模型（LLM）分析，用于训练和评估沟通、决策、团队合作和领导力。团队对话使用源自外科非技术技能（NOTSS）框架的结构化提示进行分析，能够自动分类行为并生成定向交互图谱，量化沟通结构和等级。两个腹腔内紧急情景，气胸和腹腔内出血，被实施以引发现实压力和协作。十二名外科专业人员在2024年SAGES会议上完成了试点测试，评价VORTeX为直观、沉浸和有价值的团队合作与沟通发展工具。LLM始终产生可解释的沟通网络，反映预期的操作等级，外科医生作为中心整合者，护士作为发起者，麻醉师作为平衡中介。通过结合沉浸式VR与LLM驱动的行为分析，VORTeX提供了一个可扩展、隐私合规的框架，用于客观评估和自动化、数据驱动的解围，适用于分布式培训环境。

英文摘要

Although effective teamwork and communication are critical to surgical safety, structured training for non-technical skills (NTS) remains limited compared with technical simulation. The ACS/APDS Phase III Team-Based Skills Curriculum calls for scalable tools that both teach and objectively assess these competencies during laparoscopic emergencies. We introduce the Virtual Operating Room Team Experience (VORTeX), a multi-user virtual reality (VR) platform that integrates immersive team simulation with large language model (LLM) analytics to train and evaluate communication, decision-making, teamwork, and leadership. Team dialogue is analyzed using structured prompts derived from the Non-Technical Skills for Surgeons (NOTSS) framework, enabling automated classification of behaviors and generation of directed interaction graphs that quantify communication structure and hierarchy. Two laparoscopic emergency scenarios, pneumothorax and intra-abdominal bleeding, were implemented to elicit realistic stress and collaboration. Twelve surgical professionals completed pilot sessions at the 2024 SAGES conference, rating VORTeX as intuitive, immersive, and valuable for developing teamwork and communication. The LLM consistently produced interpretable communication networks reflecting expected operative hierarchies, with surgeons as central integrators, nurses as initiators, and anesthesiologists as balanced intermediaries. By integrating immersive VR with LLM-driven behavioral analytics, VORTeX provides a scalable, privacy-compliant framework for objective assessment and automated, data-informed debriefing across distributed training environments.

URL PDF HTML ☆

赞 0 踩 0

2601.09620 2026-06-10 cs.HC cs.AI cs.CY 版本更新

Full Disclosure, Less Trust? How the Level of Detail about AI Use in News Writing Affects Readers' Trust

全面披露，更少信任？新闻写作中AI使用细节程度如何影响读者信任

Pooja Prajod, Hannes Cools, Thomas Röggla, Karthikeya Puttur Venkatraj, Amber Kusters, Alia ElKattan, Pablo Cesar, Abdallah El Ali

发表机构 * Centrum Wiskunde & Informatica（数学与信息学中心）； University of Amsterdam（阿姆斯特丹大学）； New York University（纽约大学）； TU Delft（代尔夫特理工大学）； Utrecht University（乌得勒支大学）

AI总结研究探讨新闻写作中AI使用细节披露程度对读者信任的影响，发现详细披露会降低信任，但促使更多读者核查信息源，揭示透明度与信任之间的权衡。

详情

AI中文摘要

随着人工智能在新闻生产中的整合日益增加，对AI使用透明度的需求已获得广泛关注。最近的研究表明，AI披露可能导致“透明度困境”，即披露会降低读者信任。然而，关于AI披露细节程度如何影响信任以及在新闻背景下如何促成这一困境仍知之甚少。在本项3×2×2混合因子研究中，我们调查了三种AI披露水平（无、一行、详细）在两种新闻类型（政治与生活方式）和两种AI参与程度（低、高）下对新闻读者信任的影响。我们使用新闻媒体信任问卷以及两个决策行为（信息源核查和订阅决定）来测量信任。问卷回答和订阅率显示，只有详细AI披露导致信任下降，而信息源核查行为在一行和详细披露中均增加，且对详细披露影响更显著。半结构化访谈的见解表明，信息源核查行为主要由对主题的兴趣驱动，其次是信任，而信任是影响订阅决定的主要因素。约三分之二的参与者表达了对详细披露的偏好，而大多数偏好一行披露的参与者则表明需要按需详细披露格式。我们的发现表明，并非所有AI披露都会导致透明度困境，而是反映了读者对更多透明度的渴望与对AI辅助新闻内容信任之间的权衡。

英文摘要

As artificial intelligence (AI) is increasingly integrated into news production, calls for transparency about the use of AI have gained considerable traction. Recent studies suggest that AI disclosures can lead to a ``transparency dilemma'', where disclosure reduces readers' trust. However, little is known about how the \textit{level of detail} in AI disclosures influences trust and contributes to this dilemma within the news context. In this 3$\times$2$\times$2 mixed factorial study with 40 participants, we investigate how three levels of AI disclosures (none, one-line, detailed) across two types of news (politics and lifestyle) and two levels of AI involvement (low and high) affect news readers' trust. We measured trust using the News Media Trust questionnaire, along with two decision behaviors: source-checking and subscription decisions. Questionnaire responses and subscription rates showed a decline in trust only for detailed AI disclosures, whereas source-checking behavior increased for both one-line and detailed disclosures, with the effect being more pronounced for detailed disclosures. Insights from semi-structured interviews suggest that source-checking behavior was primarily driven by interest in the topic, followed by trust, whereas trust was the main factor influencing subscription decisions. Around two-thirds of participants expressed a preference for detailed disclosures, while most participants who preferred one-line indicated a need for detail-on-demand disclosure formats. Our findings show that not all AI disclosures lead to a transparency dilemma, but instead reflect a trade-off between readers' desire for more transparency and their trust in AI-assisted news content.

URL PDF HTML ☆

赞 0 踩 0

2512.09543 2026-06-10 cs.SE cs.AI 版本更新

SWEnergy: An Empirical Study on Energy Efficiency in Agentic Issue Resolution Frameworks with SLMs

SWEnergy：关于使用SLM的代理问题解决框架的能量效率实证研究

Arihant Tripathy, Ch Pavan Harshit, Karthik Vaidhyanathan

发表机构 * SERC, IIIT-Hyderabad（IIIT-海得拉巴研究所）

AI总结本文通过实证研究，探讨了四种主流代理问题解决框架在使用小型语言模型时的能量效率和资源消耗，发现框架架构是主要能耗驱动因素，但SLM的有限推理能力导致大量能耗浪费。

Comments 8 pages, 5 figures, 1 table. Accepted to AGENT 2026 (ICSE 2026 workshop)

详情

DOI: 10.1145/3786167.3788406
Journal ref: Proceedings of the 2026 International Workshop on Agentic Engineering (AGENT 2026), ACM, 2026, pp. 104-111

AI中文摘要

本文通过实证研究，探讨了四种主流代理问题解决框架在使用小型语言模型时的能量效率和资源消耗，发现框架架构是主要能耗驱动因素，但SLM的有限推理能力导致大量能耗浪费。

英文摘要

Context. LLM-based autonomous agents in software engineering rely on large, proprietary models, limiting local deployment. This has spurred interest in Small Language Models (SLMs), but their practical effectiveness and efficiency within complex agentic frameworks for automated issue resolution remain poorly understood. Goal. We investigate the performance, energy efficiency, and resource consumption of four leading agentic issue resolution frameworks when deliberately constrained to using SLMs. We aim to assess the viability of these systems for this task in resource-limited settings and characterize the resulting trade-offs. Method. We conduct a controlled evaluation of four leading agentic frameworks (SWE-Agent, OpenHands, Mini SWE Agent, AutoCodeRover) using two SLMs (Gemma-3 4B, Qwen-3 1.7B) on the SWE-bench Verified Mini benchmark. On fixed hardware, we measure energy, duration, token usage, and memory over 150 runs per configuration. Results. We find that framework architecture is the primary driver of energy consumption. The most energy-intensive framework, AutoCodeRover (Gemma), consumed 9.4x more energy on average than the least energy-intensive, OpenHands (Gemma). However, this energy is largely wasted. Task resolution rates were near-zero, demonstrating that current frameworks, when paired with SLMs, consume significant energy on unproductive reasoning loops. The SLM's limited reasoning was the bottleneck for success, but the framework's design was the bottleneck for efficiency. Conclusions. Current agentic frameworks, designed for powerful LLMs, fail to operate efficiently with SLMs. We find that framework architecture is the primary driver of energy consumption, but this energy is largely wasted due to the SLMs' limited reasoning. Viable low-energy solutions require shifting from passive orchestration to architectures that actively manage SLM weaknesses.

URL PDF HTML ☆

赞 0 踩 0

2508.00491 2026-06-10 cs.RO cs.AI 版本更新

HannesImitation: Grasping with the Hannes Prosthetic Hand via Imitation Learning

HannesImitation：通过模仿学习控制Hannes假手进行抓取

Carlo Alessi, Federico Vasile, Federico Ceola, Giulia Pasquale, Nicolò Boccardo, Lorenzo Natale

发表机构 * Humanoid Sensing and Perception（人形感知与感知实验室）； Istituto Italiano di Tecnologia（意大利技术研究院）； Rehab Technologies Lab（康复技术实验室）

AI总结本文提出HannesImitationPolicy，通过模仿学习控制Hannes假手在无结构环境中抓取物体，并引入HannesImitationDataset进行训练，实验表明其在无结构场景中优于基于分割的视觉伺服控制器。

Comments Paper accepted at IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

详情

Journal ref: IEEE/RSJ International Conference on Intelligent Robots and Systems, Hangzhou, China, 2025

AI中文摘要

最近，假手控制的进步集中在通过摄像头和其他传感器输入提高自主性。这些系统旨在通过自动控制某些自由度来减少用户认知负担。在机器人学中，模仿学习已成为学习抓取和复杂操作任务并简化数据收集的有前途的方法。然而，其在假手控制中的应用仍 largely 未被探索。填补这一差距可以提高灵活性恢复，并使假手设备能够在更多无约束场景中运行，其中任务是通过演示学习而非依赖手动标注序列。为此，我们提出了HannesImitationPolicy，一种基于模仿学习的方法来控制Hannes假手，使其在无结构环境中进行物体抓取。此外，我们引入了HannesImitationDataset，包含在桌子、架子和人到假手交接场景中的抓取演示。我们利用此类数据训练了一个单扩散策略，并将其部署在假手上以预测手腕方向和手部闭合以进行抓取。实验评估显示在多样化的物体和条件下成功抓取。最后，我们展示该策略在无结构场景中优于基于分割的视觉伺服控制器。附加材料可在我们的项目页面上提供：https://hsp-iit.github.io/HannesImitation

英文摘要

Recent advancements in control of prosthetic hands have focused on increasing autonomy through the use of cameras and other sensory inputs. These systems aim to reduce the cognitive load on the user by automatically controlling certain degrees of freedom. In robotics, imitation learning has emerged as a promising approach for learning grasping and complex manipulation tasks while simplifying data collection. Its application to the control of prosthetic hands remains, however, largely unexplored. Bridging this gap could enhance dexterity restoration and enable prosthetic devices to operate in more unconstrained scenarios, where tasks are learned from demonstrations rather than relying on manually annotated sequences. To this end, we present HannesImitationPolicy, an imitation learning-based method to control the Hannes prosthetic hand, enabling object grasping in unstructured environments. Moreover, we introduce the HannesImitationDataset comprising grasping demonstrations in table, shelf, and human-to-prosthesis handover scenarios. We leverage such data to train a single diffusion policy and deploy it on the prosthetic hand to predict the wrist orientation and hand closure for grasping. Experimental evaluation demonstrates successful grasps across diverse objects and conditions. Finally, we show that the policy outperforms a segmentation-based visual servo controller in unstructured scenarios. Additional material is provided on our project page: https://hsp-iit.github.io/HannesImitation

URL PDF HTML ☆

赞 0 踩 0

2508.17196 2026-06-10 cs.LG cs.AI 版本更新

BudgetThinker: Empowering Budget-aware LLM Reasoning with Control Tokens

BudgetThinker: 通过控制令牌赋能预算感知的LLM推理

Hao Wen, Xinrui Wu, Yi Sun, Feifei Zhang, Liye Chen, Jie Wang, Yunxin Liu, Yunhao Liu, Ya-Qin Zhang, Yuanchun Li

发表机构 * Institute for AI Industry Research (AIR) Tsinghua University（人工智能产业研究院（AIR）清华大学）； Global Innovation Exchange & Department of Automation Tsinghua University（全球创新交流中心及自动化系清华大学）

AI总结 BudgetThinker通过在推理过程中插入控制令牌，使LLM能够精确控制推理过程长度，采用两阶段训练流程提升模型在不同预算下的表现。

详情

AI中文摘要

近年来，大语言模型（LLM）通过增加测试时计算来增强推理能力，但此策略导致显著延迟和资源成本，限制了其在现实时间受限或成本敏感场景中的应用。本文提出BudgetThinker，一种新型框架，旨在使LLM具备预算感知推理能力，通过在推理过程中定期插入特殊控制令牌，持续告知模型剩余令牌预算。我们提出了一种方法，结合监督微调（SFT）和基于课程的学习强化学习（RL）阶段，利用长度感知奖励函数优化准确性和预算遵守度。我们证明BudgetThinker在各种推理预算下的数学基准测试中显著优于强基线。我们的方法提供了一种可扩展且有效的解决方案，用于开发高效可控的LLM推理，使高级模型更适用于资源受限和实时环境。

英文摘要

Recent advancements in Large Language Models (LLMs) have leveraged increased test-time computation to enhance reasoning capabilities, a strategy that, while effective, incurs significant latency and resource costs, limiting their applicability in real-world time-constrained or cost-sensitive scenarios. This paper introduces BudgetThinker, a novel framework designed to empower LLMs with budget-aware reasoning, enabling precise control over the length of their thought processes. We propose a methodology that periodically inserts special control tokens during inference to continuously inform the model of its remaining token budget. This approach is coupled with a comprehensive two-stage training pipeline, beginning with Supervised Fine-Tuning (SFT) to familiarize the model with budget constraints, followed by a curriculum-based Reinforcement Learning (RL) phase that utilizes a length-aware reward function to optimize for both accuracy and budget adherence. We demonstrate that BudgetThinker significantly surpasses strong baselines in maintaining performance across a variety of reasoning budgets on challenging mathematical benchmarks. Our method provides a scalable and effective solution for developing efficient and controllable LLM reasoning, making advanced models more practical for deployment in resource-constrained and real-time environments.

URL PDF HTML ☆

赞 0 踩 0

2503.08460 2026-06-10 cs.ET cs.AI cs.CY 版本更新

Status and Future Prospects of the Standardization Framework Industry 4.0: A European Perspective

工业4.0标准化框架行业现状与未来展望：欧洲视角

Olga Meyer, Marvin Boell, Christoph Legat

发表机构 * Fraunhofer Institute for Manufacturing Engineering and Automation (IPA)（弗劳恩霍夫智能制造与自动化研究所）； German Commission for Electrotechnical, Electronic, and Information Technologies（德国电气电子和信息通信技术委员会）； Technical University of Applied Sciences Augsburg（应用技术大学阿沙布斯）

AI总结本文探讨工业4.0标准化在欧洲监管框架中的核心作用，重点分析智能制造和数字孪生中的标准化活动，为人工智能和数字孪生提供标准指南，并呼吁加强标准化机构与研究界的合作。

详情

DOI: 10.1007/978-3-031-85316-6_5

AI中文摘要

工业4.0技术的快速发展需要强有力的标准化以确保未来工业的互操作性、安全性和效率。本文探讨了标准化的基本作用和功能，特别是其在欧洲监管框架中的重要性。基于此，文章重点突出标准化活动在智能制造和数字孪生中的相关主题，并概述了工业4.0标准框架。本文既为人工智能和数字孪生的现有标准提供信息指南，也呼吁加强标准化机构与研究界的合作。通过促进此类合作，我们旨在促进标准的持续发展和实施，从而推动制造业的创新和进步。

英文摘要

The rapid development of Industry 4.0 technologies requires robust and comprehensive standardization to ensure interoperability, safety and efficiency in the Industry of the Future. This paper examines the fundamental role and functionality of standardization, with a particular focus on its importance in Europe's regulatory framework. Based on this, selected topics in context of standardization activities in context intelligent manufacturing and digital twins are highlighted and, by that, an overview of the Industry 4.0 standards framework is provided. This paper serves both as an informative guide to the existing standards in Industry 4.0 with respect to Artificial Intelligence and Digital Twins, and as a call to action for increased cooperation between standardization bodies and the research community. By fostering such collaboration, we aim to facilitate the continued development and implementation of standards that will drive innovation and progress in the manufacturing sector.

URL PDF HTML ☆

赞 0 踩 0

2501.11937 2026-06-10 cs.LG cs.AI 版本更新

MeshONet: A Generalizable and Efficient Operator Learning Method for Structured Mesh Generation

MeshONet: 一种通用且高效的结构网格生成运算学习方法

Jing Xiao, Xinhai Chen, Qingling Wang, Jie Liu

发表机构 * Laboratory of Digitizing Software for Frontier Equipment, Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology（前沿装备数字化软件实验室、并行与分布式处理技术实验室、国防科技大学）

AI总结本文提出MeshONet，一种用于结构网格生成的通用智能学习方法，通过将网格生成任务转化为运算学习问题，实现了高效生成和跨几何泛化。

详情

DOI: 10.1016/j.neunet.2026.108746
Journal ref: Neural Networks 199: 108746 (2026)

AI中文摘要

网格生成在科学计算中起着关键作用。传统网格生成方法如TFI和基于PDE的方法往往难以在效率和网格质量之间取得平衡。为解决这一挑战，近年来出现了物理引导的智能学习方法，显著提高了生成效率并保持了高质量网格。然而，物理引导方法在应用于以前未见过的几何时无法泛化，因为即使边界形状的微小变化也需要负担得起的重新训练来适应新的几何变化。在本文中，我们引入了MeshONet，这是第一个用于结构网格生成的通用智能学习方法。该方法将网格生成任务转换为一个具有多个输入和解函数的运算学习问题。为了有效克服运算学习方法的多变量映射限制，我们提出了一种双分支、共享主干的架构，以基于输入输出对的方式近似函数空间之间的映射。实验结果表明，MeshONet在生成效率上比传统方法快了四个数量级。它还能够泛化到不同的几何形状而无需重新训练，大大增强了智能方法的实用性。

英文摘要

Mesh generation plays a crucial role in scientific computing. Traditional mesh generation methods, such as TFI and PDE-based methods, often struggle to achieve a balance between efficiency and mesh quality. To address this challenge, physics-informed intelligent learning methods have recently emerged, significantly improving generation efficiency while maintaining high mesh quality. However, physics-informed methods fail to generalize when applied to previously unseen geometries, as even small changes in the boundary shape necessitate burdensome retraining to adapt to new geometric variations. In this paper, we introduce MeshONet, the first generalizable intelligent learning method for structured mesh generation. The method transforms the mesh generation task into an operator learning problem with multiple input and solution functions. To effectively overcome the multivariable mapping restriction of operator learning methods, we propose a dual-branch, shared-trunk architecture to approximate the mapping between function spaces based on input-output pairs. Experimental results show that MeshONet achieves a speedup of up to four orders of magnitude in generation efficiency over traditional methods. It also enables generalization to different geometries without retraining, greatly enhancing the practicality of intelligent methods.

URL PDF HTML ☆

赞 0 踩 0

2409.04519 2026-06-10 quant-ph cs.AI cs.LG physics.data-an 版本更新

The role of data embedding in quantum autoencoders for improved anomaly detection

数据嵌入在量子自编码器中用于改进异常检测的作用

Jack Y. Araz, Michael Spannowsky

发表机构 * Thomas Jefferson National Accelerator Facility（托马斯·杰斐逊国家加速器设施）； Institute for Particle Physics Phenomenology（粒子物理学现象研究所）； Durham University（达勒姆大学）

AI总结研究探讨了三种数据嵌入技术对量子自编码器异常检测性能的影响，发现改进的嵌入策略能显著提升检测准确性和数据表征能力。

Comments 8 pages, 5 figures, 4 tables

详情

DOI: 10.1007/s42484-026-00404-6
Journal ref: Quantum Mach. Intell. 8, 61 (2026)

AI中文摘要

量子自编码器（QAEs）在异常检测任务中的性能严重依赖于数据嵌入和ansatz设计。本研究探讨了三种数据嵌入技术——数据重新上传、并行嵌入和交替嵌入——对QAEs表征能力和检测效果的影响。我们的发现表明，即使使用相对简单的变分电路，增强的数据嵌入策略也能显著提高异常检测准确性和不同数据集下底层数据的表征能力。从低维数据的玩具示例开始，我们通过可视化展示不同嵌入技术对模型表征的影响。然后我们将分析扩展到复杂、高维数据集，强调嵌入方法对QAE性能的重大影响。

英文摘要

The performance of Quantum Autoencoders (QAEs) in anomaly detection tasks is critically dependent on the choice of data embedding and ansatz design. This study explores the effects of three data embedding techniques, data re-uploading, parallel embedding, and alternate embedding, on the representability and effectiveness of QAEs in detecting anomalies. Our findings reveal that even with relatively simple variational circuits, enhanced data embedding strategies can substantially improve anomaly detection accuracy and the representability of underlying data across different datasets. Starting with toy examples featuring low-dimensional data, we visually demonstrate the effect of different embedding techniques on the representability of the model. We then extend our analysis to complex, higher-dimensional datasets, highlighting the significant impact of embedding methods on QAE performance.

URL PDF HTML ☆

赞 0 踩 0

2310.04680 2026-06-10 cs.CL cs.AI cs.LG 版本更新

The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning

大语言模型降维的成本：事实回忆在内省学习之前恶化

Tian Jin, Nolan Clement, Xin Dong, Vaishnavh Nagarajan, Michael Carbin, Jonathan Ragan-Kelley, Gintare Karolina Dziugaite

发表机构 * MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）； MIT Harvard University（麻省理工学院哈佛大学）； Google Research（谷歌研究）； Google DeepMind（谷歌深Mind）

AI总结研究探讨了大语言模型参数数量缩放对核心能力的影响，发现模型规模缩减会显著降低事实回忆能力，但对内省信息处理影响较小。

详情

Journal ref: The Twelfth International Conference on Learning Representations (ICLR), 2024

AI中文摘要

如何缩放大语言模型（LLMs）的参数数量会影响其核心能力？我们研究了两种自然缩放技术——权重剪枝和简单训练更小或更大的模型（称为密集缩放）——对LLMs两个核心能力的影响：（a）回忆训练期间呈现的事实，以及（b）处理推理期间呈现的信息。通过设计一系列任务来区分这两种能力，我们发现这两种能力在缩放时的表现存在显著差异。通过超过30%的模型规模缩减（通过任一缩放方法）会显著降低对训练期间呈现事实的回忆能力。然而，60-70%的缩减在很大程度上保留了模型处理内省信息的各种方式，从从长上下文检索答案到从内省示例中学习参数化函数。两种缩放方法均表现出这种行为，表明缩放模型大小对事实回忆和内省学习有本质上不同的影响。

英文摘要

How does scaling the number of parameters in large language models (LLMs) affect their core capabilities? We study two natural scaling techniques -- weight pruning and simply training a smaller or larger model, which we refer to as dense scaling -- and their effects on two core capabilities of LLMs: (a) recalling facts presented during pre-training and (b) processing information presented in-context during inference. By curating a suite of tasks that help disentangle these two capabilities, we find a striking difference in how these two abilities evolve due to scaling. Reducing the model size by more than 30\% (via either scaling approach) significantly decreases the ability to recall facts seen in pre-training. Yet, a 60--70\% reduction largely preserves the various ways the model can process in-context information, ranging from retrieving answers from a long context to learning parameterized functions from in-context exemplars. The fact that both dense scaling and weight pruning exhibit this behavior suggests that scaling model size has an inherently disparate effect on fact recall and in-context learning.

URL PDF HTML ☆

赞 0 踩 0

1. 智能体、规划与决策 29 篇

Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents

What Spatial Memory Must Store: Occlusion as the Test for Language-Agent Memory

Trace2Policy: From Expert Behavior Traces to Self-Evolving Decision Agents

A complementary study on PlanGPT: Evaluation with defined Performance Metrics and comparison with a planner

HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning

ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning

Infini Memory: Maintainable Topic Documents for Long-Term LLM Agent Memory

Moonshine: An Autonomous Mathematical Research Agent Centered on Conjecture Generation

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages

A History-Aware Visually Grounded Critic for Computer Use Agents

Agentic Social Affordance Framework (ASAF): Agent Identity Design as a Collaboration Interface in Multi-Agent Systems

Human-AI Coordination Zones: A Framework for Designing Human-in-the-Loop Experiences with Agentic AI

IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference

3SPO: State-Score-Supervised Policy Optimization for LLM Agents

Bittensor Agent Arenas as a Trajectory Primitive: Distilling a Shopping Agent from ShoppingBench Subnet Traces

What makes a harness a harness: necessary and sufficient conditions for an agent harness

SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration

Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning

Event-Driven Reinforcement Learning Enables Long-Horizon Control in Semiconductor Fabrication

Effective Reinforcement Learning for Agentic Search by Recycling Zero-Variance Queries During Training

TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

Constructing coherent spatial memory in LLM agents through graph rectification

Anything2Skill: Compiling External Knowledge into Reusable Skills for Agents

Fact-Augmented Lookahead Planning for LLM Agents

ASA: Backbone-Training-Free Representation Engineering for Tool-Calling Agents

AgenticRL: Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation

From Human Guidance to Autonomy: Agent Skill System for End-to-End LLM Deployment on Spatial NPUs

2. 知识表示、推理与符号AI 6 篇

WorldKernel: A World Model is the Coupling Kernel of Admissible Possible Worlds

KG-SoftMAP: Soft Knowledge-Graph Priors for Bayesian Network Structure Learning from Sparse Discrete Data

Causal Ensemble Agent: Hierarchical Causal Discovery with LLM-guided Expert Reweighting

Belief Acquisition as Stochastic Filtering

Non-Parametric Structural Priors for Geometry Theorem Prediction

Neurosymbolic Learning for Inference-Time Argumentation

3. 多智能体与博弈 13 篇

Beyond Static Evaluation: Co-Evolutionary Mechanisms for LLM-Driven Strategy Evolution in Adversarial Games

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

Can Multi-Agent LLMs Identify Their Peers? Stylometric Fingerprinting in Role-Constrained Political Analysis

Failure Modes of Deep Multi-Agent RL in Asynchronous Pricing: Reproducible Triggers, Trace Diagnostics, and a Partial Fix

Duality for Optimal Multi-Item, Multi-Bidder Auction Design: Revenue Certificates through Deep Learning

Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation

Drawing with Strangers: Population Scaling Drives Zero-Shot Mutual Intelligibility in Emergent Sketching

Decentralized Multi-Agent Systems with Shared Context

Divide and Cooperate: Role-Decomposed Multi-Agent LLM Training with Cross-Agent Learning Signals

A Sober Look at Agentic Misalignment in Automated Workflows

TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit

Prosociality by Coupling, Not Mere Observation: Homeostatic Sharing in an Inspectable Recurrent Artificial Life Agent

Durable Evaluation Framework: Adversarial Arbitration for Sycophancy Reduction in Large Language Models

4. 搜索、优化与约束求解 4 篇

Bellman-Taylor Score Decoding for Markov Decision Processes with State-Dependent Feasible Action Sets

Importance-Aware Scheduling for High-Dimensional Hyperparameter Optimization

Vision-Assisted Foundation Model for Solving Multi-Task Vehicle Routing Problems

Optimizing Explicit Unit-Distance Lower-Bound Certificates

5. 机器学习与表示学习 94 篇

Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning

Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Accelerating NeurASP with vectorization and caching

What Fits (Into Few Tokens) Doesn't Overfit: Compression and Generalization in ML Research Agents

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

The Role of Feedback Alignment in Self-Distillation

Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models

Support sufficiency as action-sufficient compression: a single-cycle rate-regret formulation

Mitigating Manifold Departure: Uncertainty-Aware Subspace Rectification for Trustworthy MLLM Decoding

Blurry Window Attention

QSplitFL: Capability Aware Deep Q-Learning for Optimal Split Point Selection in Split Federated Learning

Rotate2Think: Geometric Priming via Orthogonal Rotation to Improve Language Model Reasoning

TD-Grokking: Learning from Zero-Reward Problems by Training-Time Decomposition

SHAPE: Coalition-Aware Expert Pruning for Sparse Mixture-of-Experts LLMs

Representation Curriculum: Stagewise Training for Robust Ranking and Allocation

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

The Whale That Outswam Evolution: Swarm Intelligence Maximises Memory in Connectome Reservoirs

LongMoE: Longitudinal Multimodal Learning via Trajectory-Aware Mixture-of-Experts

Conformal Prediction for Neural Operators: Distribution-Free Uncertainty Quantification in Physics Simulation

Sigma-Branch: Hierarchical Single-Path Network Reconstruction for Dynamic Inference with Reduced Active Parameters

Sample Where You Struggle: Sharpening Base Model Reasoning via Entropy-Guided Power Sampling

Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization