SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows

SKILL.nb：用于持久代理工作流的选择性形式化与门控执行

Amine El Hattami, Nicolas Chapados, Christopher Pal

发表机构 * ServiceNow Research ； Mila ； Polytechnique Montréal（蒙特利尔综合理工学院）； Canada CIFAR AI Chair（加拿大CIFAR人工智能讲席）

AI总结提出SKILL.nb框架，通过选择性形式化和门控执行管理代理工作流的生命周期可靠性，在WebArena-Verified上单轮成功率达53.7%，重执行保留率91.7%。

详情

AI中文摘要

AI代理越来越多地将过去的经验转化为可重用的工件，如代码、工作流和程序记忆。重用可以提高效率，但也带来了生命周期可靠性问题：曾经成功的工件可能在环境漂移、任务说明不充分或任务分布变化时失败，尤其是在Web自动化中。我们引入了SKILL.nb，一个通过证据校准的生命周期策略来管理可重用代理工作流的框架。SKILL.nb使用选择性形式化：执行证据决定哪些工作流步骤应成为可执行代码，哪些应保留自然语言指导，以及何时应修订这些选择。工作流存储为可审计、版本化的笔记本，交织自然语言指导、多语言可执行单元格、验证门、回退路径以及多模态证据（如输出、截图和错误轨迹）。在运行时，门控执行让每个步骤在门验证时运行代码，或在漂移使可执行实现失效时本地回退。在WebArena-Verified上，SKILL.nb实现了53.7%的单轮成功率，比最强基线提高了3.9个百分点。在三次重新执行中，它保留了91.7%的初始成功任务，比次优方法高出15.5个百分点。在有界修复下，它恢复了72.9%的后续失败，同时将修复后回归限制在4.2%，而持久基线为15.0%至17.0%。它还在Mind2Web跨网站和跨领域分割上领先。在GitLab迁移测试中，SKILL.nb在重用基于GitLab 15.7学习的冻结状态时保持性能，冻结与新鲜目标版本的差距在GitLab 16.11上为-1.7个百分点，在GitLab 18.9上为+0.6个百分点。这些结果将生命周期治理和门控执行确定为超越一次性任务成功之外的可靠性轴。

英文摘要

AI agents increasingly turn past experience into reusable artifacts such as code, workflows, and procedural memories. Reuse can improve efficiency, but it also creates a lifecycle reliability problem: artifacts that succeed once may fail under environment drift, underspecified tasks, or changing task distributions, especially in web automation. We introduce SKILL.nb, a framework for governing reusable agent workflows with evidence-calibrated lifecycle policies. SKILL.nb uses selective formalization: execution evidence decides which workflow steps should become executable code, which should remain natural-language guided, and when those choices should be revised. Workflows are stored as auditable, versioned notebooks that interleave natural-language guidance, multi-language executable cells, validation gates, fallback paths, and multimodal evidence such as outputs, screenshots, and error traces. At runtime, gate-conditioned execution lets each step run code when its gates validate, or fall back locally when drift invalidates the executable realization. On WebArena-Verified, SKILL.nb achieves 53.7% single-round success, improving over the strongest baseline by 3.9 percentage points. Across three re-executions, it retains 91.7% of initially successful tasks, 15.5 points above the next best method. Under bounded repair, it recovers 72.9% of subsequent failures while limiting post-repair regressions to 4.2%, compared with 15.0% to 17.0% for persistent baselines. It also leads on Mind2Web cross-website and cross-domain splits. In a GitLab migration test, SKILL.nb preserves performance when reusing frozen state learned on GitLab 15.7, with frozen-versus-fresh target-version gaps of -1.7 points on GitLab 16.11 and +0.6 points on GitLab 18.9. These results identify lifecycle governance and gate-conditioned execution as reliability axes beyond one-shot task success.

URL PDF HTML ☆

赞 0 踩 0

2606.08106 2026-06-09 cs.AI cs.MA 新提交

自进化科学智能体发现可泛化的物理推理流体控制

Boai Sun, Wenjin Guo, Zongmin Yu, Liu Yang

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出一种由大语言模型驱动的自进化科学智能体工作流，通过迭代代码生成和物理仿真诊断，自动构建可解释的控制器，并在欠驱动双关节狗鲨游泳器目标到达任务中实现零样本泛化。

详情

AI中文摘要

虽然数据密集的深度强化学习可以优化复杂的控制策略，但物理系统中的科学发现从根本上需要一条可解释的推理链，将物理证据与结构化控制架构联系起来。本文提出了一种自进化的科学智能体工作流，由大语言模型和迭代代码生成驱动，在保持严格可解释性和严谨物理推理的同时，自动构建控制器。该智能体不是调整权重，而是将候选策略部署到物理仿真中，从多模态证据中主动诊断动态行为，并将这些观察转化为渐进的源代码改进。我们在一个高度非线性的流固耦合问题上展示了该框架：一个欠驱动的双关节狗鲨游泳器，仅使用关节角加速度完成空间目标到达任务。从表现出单侧转向偏差的推进种子策略开始，智能体自主发现并改进了一个统一控制器，稳健地捕获所有典型目标。值得注意的是，无需任何重新训练或特定目标分支，合成的控制策略就能泛化到未见过的静态目标和动态曲线追踪轨迹。可审计的进化日志揭示了一个基于行波推进、体坐标系目标引导、偏航率反馈、有符号平均尾曲率和自适应节奏缓解的涌现控制架构。我们的结果表明，自主科学智能体能够成功地将累积的物理证据转化为稳健、数学可读的控制策略，同时保持完全可追溯的科学发现过程。

英文摘要

While data-intensive deep reinforcement learning can optimize complex control policies, scientific discovery in physical systems fundamentally requires an interpretable chain of reasoning that connects physical evidence to structured control architectures. Here, we present a self-evolving scientific-agent workflow, driven by large language models and iterative code generation, that automates controller construction while preserving strict interpretability and rigorous physical reasoning. Instead of adjusting weights, the agent deploys candidate strategies into physical simulations, actively diagnoses dynamic behaviors from multimodal evidence, and translates these observations into progressive source-code refinements. We demonstrate this framework on a highly non-linear fluid-structure interaction problem: an underactuated, two-joint dogfish swimmer tasked with spatial target reaching using only joint angular accelerations. Starting from a propulsive seed policy that exhibits a one-sided steering bias, the agent autonomously discovers and refines a unified controller that robustly captures all canonical targets. Remarkably, without any retraining or target-specific branching, the synthesized control policy generalizes to unseen static targets and dynamically curved pursuit trajectories. The auditable evolve log reveals an emergent control architecture built upon traveling-wave propulsion, body-frame target guidance, yaw-rate feedback, signed mean-tail curvature, and adaptive cadence relief. Our results show that an autonomous scientific agent can successfully transform accumulated physical evidence into robust, mathematically readable control policy, while maintaining a fully traceable process of scientific discovery.

URL PDF HTML ☆

赞 0 踩 0

2606.08552 2026-06-09 cs.AI cs.MA cs.NE physics.data-an 新提交

Quantitative Promise Theory: Intentionality and Inference in Autonomous Agents

定量承诺理论：自主智能体中的意向性与推理

Mark Burgess

发表机构 * ChiTek-i AS

AI总结本文提出将贝叶斯概率与信息论优化（包括主动推理）融入承诺语义，以解决概率计算中的非局部协调、校准和归一化问题，并利用边界条件作为承诺约束状态与决策阈值，实现可扩展的意图定义。

详情

AI中文摘要

我讨论了涉及自主智能体过程的承诺理论的一些定量表示。智能体模型在软件系统、机器学习和生物学中很常见，但也可能适用于物理学和其他工程形式。我描述了贝叶斯概率和信息论优化（包括主动推理）如何与承诺语义相结合——以及承诺理论如何补充解决方案，帮助避免概率的陷阱，包括非局部协调、校准和归一化概率计算。边界条件在约束允许状态和选择决策阈值中的作用是一种承诺形式，而智能体对齐提供了意图的可扩展定义。自主智能体可以通过最小化其信息来凝聚成具有超级智能体特征的群体，尽管不确定性会最大化信息。承诺理论的使用涉及一些研究挑战以及风格偏好。

英文摘要

I discuss some quantitative representations of Promise Theory for processes involving autonomous agents. Agent models are common in software systems, machine learning, and biology, for example, but may also apply to physics and other forms of engineering. I describe how Bayesian probability and information theoretic optimization, including Active Inference, may be incorporated with promise semantics -- as well as how Promise Theory supplements solutions, helping to avoid probability's pitfalls, which include non-local coordination, calibrating, and normalizing probabilistic computations. The role of boundary conditions in constraining allowed states and selecting decision thresholds is a form of promise, and agent alignment provides a scalable definition of intent. Autonomous agents may congeal into swarms with superagent characteristics by trying to minimize their information, despite uncertainty that works to maximize it. The use of Promise Theory involves some research challenges as well as stylistic preferences.

URL PDF HTML ☆

赞 0 踩 0

2606.08596 2026-06-09 cs.AI cs.HC 新提交

Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration

将LLM推理蒸馏为可解释的策略树用于人机协作

Beiwen Zhang, Yongheng Liang, Guowei Zou, Haitao Wang, Hejun Wu

发表机构 * Sun Yat-sen University（中山大学）

AI总结提出Co-pi-tree方法，通过将大语言模型推理蒸馏为可执行策略树，在Overcooked-AI中平均奖励提升35.4%，同时减少77.7%的LLM查询和97.1%的测试延迟。

详情

AI中文摘要

构建高效可靠的策略以辅助人类是人机协作中不可或缺的。现有方法主要遵循两条工作路线。大多数先前工作依赖多智能体强化学习（MARL）来学习黑盒策略，这限制了可解释性并引发安全问题。近期方法在每个决策步骤查询大语言模型（LLM），导致响应缓慢和推理成本高昂。我们提出协作策略树（Co-pi-tree），一种闭环方法，学习一个可执行的策略树，该树由伙伴行为预测树和智能体动作选择树组成。Co-pi-tree通过将LLM推理蒸馏为策略树代码来构建策略。然后通过伙伴交互评估策略，获取反馈，并使用自然语言总结交互反馈以改进有问题的分支。在Overcooked-AI中的实验表明，Co-pi-tree将平均奖励比基线平均值提高35.4%，同时将LLM查询次数减少77.7%，测试时延迟减少97.1%。项目页面：https://beiwenzhang.github.io/Co-pi-tree/

英文摘要

Constructing efficient and reliable policies to assist humans is indispensable for human-AI collaboration. Existing methods mainly follow two lines of work. Most prior work relies on multi-agent reinforcement learning (MARL) to learn black-box policies, which limits interpretability and raises safety concerns. Recent methods query large language models (LLMs) at each decision step, causing slow responses and high inference costs. We propose Collaboration Policy Tree (Co-pi-tree), a closed-loop method that learns an executable policy tree consisting of a partner-behavior prediction tree and an agent-action selection tree. Co-pi-tree constructs a policy by distilling LLM reasoning into policy tree code. It then evaluates the policy through partner interaction, obtains feedback, and uses natural language to summarize the interaction feedback to improve problematic branches. Experiments in Overcooked-AI show that Co-pi-tree improves average reward by 35.4% over the baseline average, while reducing the number of LLM queries by 77.7% and test-time latency by 97.1%. Project page: https://beiwenzhang.github.io/Co-pi-tree/

URL PDF HTML ☆

赞 0 踩 0

2606.08735 2026-06-09 cs.AI 新提交

Structure-Conditioned Actor-Critic Branches for Quality-Diversity Reinforcement Learning

结构条件化的演员-评论家分支用于质量-多样性强化学习

Lianrong Zuo, Peilan Xu, Yong Liu, Wenjian Luo

发表机构 * School of Artificial Intelligence, Nanjing University of Information Science and Technology（南京信息工程大学人工智能学院）； Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Institute of Cyberspace Security, School of Computer Science and Technology, Harbin Institute of Technology（哈尔滨工业大学计算机科学与技术学院网络空间安全研究院广东省新型安全智能技术重点实验室）

AI总结提出SV-QD-RL框架，通过结构条件化的演员-评论家分支和分支感知的QD档案，在MuJoCo任务中构建高质量且行为多样化的策略库。

详情

AI中文摘要

质量-多样性强化学习（QD-RL）旨在构建包含高性能和行为多样化策略的策略库。现有的QD-RL方法主要在 rollout 评估后多样化策略实例，或使用学习到的价值信息来改进策略质量和行为目标，而生成候选策略的学习分支仍较少被探索。本文提出SV-QD-RL，一种结构-价值耦合框架，将每个候选表示为结构条件化的演员-评论家分支。每个分支包含一个演员、一个结构掩码、一个分支特定的评论家、一个回放状态以及评估属性，包括行为、回报、稀疏性和价值分布。结构掩码定义了分支学习的演员子空间，而分支特定的评论家和回放状态塑造了其价值学习轨迹。然后，一个分支感知的QD档案根据行为质量、结构足迹和价值分布信息评估并保留分支。在MuJoCo连续控制任务上的实验表明，SV-QD-RL构建的策略库具有强大的档案质量和行为上有用的多样性。消融和诊断分析进一步表明，结构条件化、评论家差异化和记忆一致性细化对行为专门化做出了互补贡献。调度感知的库评估表明，学习到的档案在变化的行为级别要求下提供了可选择的策略替代方案。这些结果表明，将演员结构与分支特定的价值学习耦合是生成多样化QD-RL策略库的有效机制。

英文摘要

Quality-diversity reinforcement learning (QD-RL) aims to construct policy repertoires that contain both high-performing and behaviorally diverse policies. Existing QD-RL methods mainly diversify policy instances after rollout evaluation or use learned value information to improve policy quality and behavior targeting, while the learning branches that generate candidate policies remain less explored. This paper proposes SV-QD-RL, a structure-value coupled framework that represents each candidate as a structure-conditioned actor-critic branch. Each branch contains an actor, a structural mask, a branch-specific critic, a replay state, and evaluation attributes including behavior, return, sparsity, and value profile. The structural mask defines the actor subspace in which the branch learns, while the branch-specific critic and replay state shape its value-learning trajectory. A branch-aware QD archive then evaluates and retains branches according to behavioral quality, structural footprint, and value-profile information. Experiments on MuJoCo continuous-control tasks show that SV-QD-RL constructs policy repertoires with strong archive quality and behaviorally useful diversity. Ablation and diagnostic analyses further indicate that structural conditioning, critic differentiation, and memory-consistent refinement make complementary contributions to behavioral specialization. Schedule-aware repertoire evaluation shows that the learned archive provides selectable policy alternatives under changing behavior-level requirements. These results suggest that coupling actor structure with branch-specific value learning is an effective mechanism for generating diverse QD-RL policy repertoires.

URL PDF HTML ☆

赞 0 踩 0

2606.08875 2026-06-09 cs.AI 新提交

Can the Environment Speak for Itself? $T^{2}$-GRPO: A Turn-Trajectory Group Relative Policy Optimization for Caregiver Agents

环境能否为自己发声？$T^{2}$-GRPO：一种面向护理智能体的转向-轨迹组相对策略优化

Yutong Song, Jiang Wu, Pengfei Zhang, Wenjun Huang, Honghui Xu, Nikil Dutt, Amir M. Rahmani

发表机构 * University of California, Irvine（加州大学尔湾分校）； Independent Researcher（独立研究员）； Kennesaw State University（肯尼索州立大学）

AI总结提出T²-GRPO框架，通过解耦护理强化学习为两个归一化奖励视界，并利用二元硬否决确保安全，从环境状态转换中提取密集转向级奖励，结合轨迹级评估，有效处理即时患者反馈、长期护理结果和安全约束。

详情

AI中文摘要

优化用于长期护理智能体的大型语言模型（LLMs）需要平衡延迟的任务目标与即时的环境动态，例如患者的痛苦和抵抗。在痴呆症护理中，这种平衡尤其困难：轨迹级奖励对于转向级信用分配过于稀疏，而基于外部LLM的评估器成本高昂且可能误读零散或间接的患者反应。为解决这一问题，我们提出了\textbf{转向-轨迹组相对策略优化}（\textbf{T$^{2}$-GRPO}），该框架将护理强化学习解耦为两个归一化奖励视界，并通过二元硬否决强制执行安全性。$T^2$-GRPO直接从环境状态转换中推导出密集的转向级奖励，从冻结的痴呆症患者模拟器中测量患者痛苦和抵抗的变化。这些基于环境的奖励通过独立中心秩归一化与轨迹级评估相结合，保留了异质奖励信号并缓解了奖励崩溃。在痴呆症护理上的大量实验表明，T$^{2}$-GRPO优于竞争基线，表明在情感敏感的护理场景中，有效处理即时患者反馈、长期护理结果和安全约束方面取得了实质性改进。

英文摘要

Optimizing large language models (LLMs) for long-horizon caregiver agents requires balancing delayed task objectives with immediate environment dynamics, such as patient distress and resistance. In dementia care, this balance is especially difficult: trajectory level rewards are too sparse for turn level credit assignment, while external LLM-based evaluators are costly and can misread fragmented or indirect patient responses. To address this issue, we propose \textbf{T}urn-\textbf{T}rajectory \textbf{G}roup \textbf{R}elative \textbf{P}olicy \textbf{O}ptimization (\textbf{T$^{2}$-GRPO}), a framework that decouples caregiver RL into two normalized reward horizons and enforces safety through a binary hard veto. $T^2$-GRPO derives dense turn-level rewards directly from environment state transitions, measuring changes in patient distress and resistance from a frozen dementia patient simulator. These environment-grounded rewards are combined with trajectory-level evaluations through independent centered-rank normalization, which preserves heterogeneous reward signals and mitigates reward collapse. Extensive experiments on dementia caregivers show that T $^{2}$-GRPO outperforms competitive baselines, indicating a substantial improvement for emotionally sensitive caregiver scenarios that effectively handles immediate patient feedback, long-term care outcomes, and safety constraints.

URL PDF HTML ☆

赞 0 踩 0

2606.08952 2026-06-09 cs.AI 新提交

AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models

AlloSpatial：基础模型中空间推理的智能体框架

Shouwei Ruan, Bin Wang, Zhenyu Wu, Qihui Zhu, Yuxiang Zhang, Jingzhi Li, Yubin Wang, Xingxing Wei

发表机构 * Institute of Artificial Intelligence, Beihang University（北京航空航天大学人工智能研究院）； Huawei Noah’s Ark Lab（华为诺亚方舟实验室）； University of Science and Technology Beijing（北京科技大学）

AI总结提出AlloSpatial框架，通过World2Mind认知映射沙箱将自我中心观察转化为异中心空间先验，并利用空间推理工具实现几何语义仲裁，在VSI-Bench和MindCube上提升模型5%-18%的空间推理性能。

详情

AI中文摘要

多模态基础模型（MFMs）取得了显著进展，但在物理世界的空间推理中仍然脆弱。一个关键瓶颈在于它们无法将局部的自我中心观察转化为全局的异中心空间表示。为了解决这个问题，我们提出了AlloSpatial，一个用于基础模型中异中心空间认知的智能体框架。AlloSpatial引入了World2Mind，一个即插即用的认知映射沙箱，将自我中心观察转化为结构化的异中心先验，包括异中心空间树和路线图，支持查询对象拓扑、几何关系、可通过性和轨迹。为了在噪声重建和模糊视觉证据下可靠地利用这些先验，AlloSpatial引入了空间推理工具，用于工具使用判断、模态解耦线索收集和几何语义仲裁。我们进一步通过冷启动强化学习，使用工具门控轨迹级奖励，在Qwen3-VL中内化这一过程。在VSI-Bench和MindCube上的实验表明，AlloSpatial在无训练设置下将专有模型提升了5%-18%，而仅ASTs就在移除视觉输入时支持强大的空间推理。训练后的AlloSpatial智能体进一步超越了更大的通用模型和竞争性的空间基线，表明结构化的异中心表示、主动工具使用和可验证推理为具有空间能力的基础模型提供了一条有前景的路径。

英文摘要

Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform local egocentric observations into a global allocentric spatial representation. To address this, we propose AlloSpatial, an agentic framework for allocentric spatial cognition in foundation models. AlloSpatial introduces World2Mind, a plug-and-play cognitive mapping sandbox that converts egocentric observations into structured allocentric priors, including Allocentric-Spatial Trees and route maps that support querying object topology, geometric relations, passability, and trajectories. To utilize these priors reliably under noisy reconstruction and ambiguous visual evidence, AlloSpatial introduces a Spatial Reasoning Harness for tool-use judgment, modality-decoupled cue collection, and geometry-semantic arbitration. We further internalize this process in Qwen3-VL through cold-start reinforcement learning with a harness-gated trajectory-level reward. Experiments on VSI-Bench and MindCube show that AlloSpatial improves proprietary models by 5%-18% in a training-free setting, while ASTs alone support strong spatial reasoning even when visual inputs are removed. The trained AlloSpatial agents further outperform larger general-purpose models and competitive spatial baselines, suggesting that structured allocentric representations, active tool use, and verifiable reasoning offer a promising route toward spatially capable foundation models.

URL PDF HTML ☆

赞 0 踩 0

2606.09071 2026-06-09 cs.AI 新提交

REFLECT: Intervention-Supported Error Attribution for Silent Failures in LLM Agent Traces

REFLECT: 针对LLM智能体轨迹中静默失败的干预支持错误归因

Xiaofeng Lin, Yingxu Wang, Tung Sum Thomas Kwok, Daniel Guo, Sahil Arun Nale, Charles Fleming, Guang Cheng

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出REFLECT方法，通过诊断候选错误步骤、使用诊断特定补丁进行受控重放测试，并利用验证结果作为对比证据来细化归因，在四个基准上取得最高定位准确率。

详情

AI中文摘要

大型语言模型（LLM）智能体现在通过长时间的计划与执行轨迹来解决复杂任务，但在已完成轨迹中定位错误的能力仍然远远落后，尤其是在静默失败情况下。现有方法通过分类器或LLM法官预测可疑步骤，或通过重试恢复正确答案，但都没有将干预结果反馈回来以细化归因本身。我们提出REFLECT方法，通过诊断候选错误步骤，使用诊断特定补丁进行受控重放测试，并利用验证的结果翻转作为对比证据来细化最终归因，从而弥合这一差距。在跨越领域多跳推理的四个定位基准上，REFLECT在所有四个基准中均实现了同审计方法中最高的定位准确率，在结构化工具使用轨迹上取得了最大增益，并且在无法获得真实答案时也能提供可操作的定位。

英文摘要

Large language model (LLM) agents now solve complex tasks through long plan-and-execution traces, yet the ability to locate errors in a completed traces still lags far behind, especially in the \emph{silent failure} regime. Existing approaches predict suspect steps via classifiers or LLM judges, or recover correct answers via retry, but none feed the intervention outcome back to \emph{refine the attribution itself}. We propose \methodname, a method that closes this gap by diagnosing a candidate error step, testing it through controlled replay with a diagnosis-specific patch, and using the verified outcome flip as contrastive evidence to refine the final attribution. Across four localization benchmarks spanning multi-hop reasoning across domains, \methodname achieves the highest localization accuracy among same-auditor methods across all four benchmarks, with the largest gains on structured tool-use traces, while providing actionable localization even when ground-truth answers are unavailable.

URL PDF HTML ☆

赞 0 踩 0

2606.09198 2026-06-09 cs.AI 新提交

MASS: Deep Research for Social Sciences with Memory-Augmented Social Simulation

MASS：基于记忆增强社会模拟的深度社会科学研究

Yongrui Liu, Deyi Xiong

发表机构 * The International Joint Institute of Tianjin University, Fuzhou, Tianjin University, China（天津大学福州国际联合学院）； TJUNLP Lab, School of Computer Science and Technology, Tianjin University, China（天津大学计算机科学与技术学院TJUNLP实验室）

AI总结提出MASS范式，通过动态目标路径规划、多学科行为数据集和艾宾浩斯遗忘机制增强社会模拟真实性，提升LLM生成研究的洞察力与创新性，整体质量提升6.81%，洞察力提升17.19%。

详情

AI中文摘要

由大型语言模型（LLM）驱动的深度研究代理在自动论文写作任务中展现出非凡潜力。然而，现有系统严重依赖通过互联网和本地知识库进行文献检索与综合，导致社会科学研究缺乏洞察力和创造力。为解决这一问题，我们提出“记忆增强社会模拟（MASS）”，一种创新范式，利用高度逼真且面向研究的社会模拟来增强LLM生成研究的创造力和实证基础。具体而言，MASS集成了三个核心组件：具有多级社会规范约束的动态目标路径规划以引导模拟、用于代理记忆冷启动的多学科行为数据集，以及受艾宾浩斯曲线启发的结构化遗忘机制。这些共同确保了模拟的真实性，并为生成创新学术论文提供了坚实的实证基础。实验结果表明了我们方法的有效性，在生成整体质量上比基础LLM提高了6.81%，在洞察力上比强基线提高了17.19%。

英文摘要

Deep Research agents powered by Large Language Models (LLMs) have exhibited extraordinary potential in automated paper writing tasks. However, existing systems rely heavily on literature retrieval and synthesis through internet and local knowledge bases, often resulting research in lacking insight and creativity in social science. To address this issue, we propose "Memory-Augmented Social Simulation (MASS)", an innovative paradigm that leverages highly realistic and research-oriented social simulations to enhance the creativity and empirical founding of LLMs-generated research. Specifically, MASS integrates three core components: dynamic goal-path planning with multi-level social norm restraint to guide the simulation, a multi-disciplinary behavior dataset for agent memory cold-start, and a structured forgetting mechanism inspired by the Ebbinghaus curve. Together, these ensure simulation authenticity and provide a robust empirical foundation for generating innovative scholarly papers. Experimental results demonstrate the effectiveness of our method, showing a 6.81\% improvement in generation overall quality over foundation LLMs and 17.19\% gain in Insight over strong baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.09311 2026-06-09 cs.AI 新提交

FF-JEPA: Long-Horizon Planning in World Models with Latent Planners

FF-JEPA：基于潜在规划器的世界模型中的长时域规划

Sergi Masip, Jonathan Swinnen, Yutong Hu, Renaud Detry, Tinne Tuytelaars

发表机构 * KU Leuven（鲁汶大学）

AI总结提出FF-JEPA层次化方法，通过引入无动作潜在规划器预测子目标，将复杂轨迹分解为短期优化问题，解决长时域规划中计算昂贵和需要目标图像的问题。

详情

AI中文摘要

联合嵌入预测架构（JEPAs）展示了有前景的世界建模能力，能够通过使用交叉熵方法（CEM）等方法优化动作轨迹，在潜在空间中进行规划。然而，这些方法对于长时域规划而言计算成本过高且效果不佳。此外，这些方法通常需要目标状态的显式图像，这在现实任务中并不总是可行。在这项工作中，我们通过提出Forward-Forward-JEPA（FF-JEPA）来解决这些局限性，这是一种利用两个前向动力学模型的层次化方法。除了标准的动作条件前向模型外，我们还引入了一个无动作潜在规划器，该规划器根据当前状态预测下一个子目标。这种方法消除了对目标图像的需求，并通过将复杂轨迹分解为一系列可处理的短期优化问题来实现长时域规划。在PushT上的初步结果表明，FF-JEPA成功克服了扁平世界模型的长时域崩溃，凸显了该方法作为无目标规划的一个有前景的方向。

英文摘要

Joint Embedding Predictive Architectures (JEPAs) have shown promising world modeling capabilities, enabling planning in latent space by optimizing action trajectories using methods like the Cross-Entropy Method (CEM). These methods are, however, too computationally expensive and ineffective for long-horizon planning. Furthermore, these methods typically require an explicit image of the goal state, which is not always possible in real-world tasks. In this work, we tackle these limitations by proposing Forward-Forward-JEPA (FF-JEPA), a hierarchical approach leveraging two forward dynamics models. Alongside a standard action-conditioned forward model, we introduce an action-free latent planner that predicts the next subgoal given the current state. This approach removes the need for goal images and enables long-horizon planning by decomposing complex trajectories into a sequence of tractable, short-term optimization problems. Preliminary results on PushT demonstrate that FF-JEPA successfully overcomes flat world models' long-horizon collapse, highlighting this approach as a promising direction for goal-free planning.

URL PDF HTML ☆

赞 0 踩 0

2606.09371 2026-06-09 cs.AI 新提交

Capability-Aligned Hierarchical Learning for Tool-Augmented LLMs

面向工具增强型大语言模型的能力对齐分层学习

Haotong Yang, Ting Long, Yi Chang

发表机构 * Jilin University（吉林大学）

AI总结提出CAHL方法，利用RLVR联合优化高层规划器与低层执行器，解决分层工具学习中的规划-执行对齐问题，在多个基准上验证有效性。

Comments 14 pages, 5 figures, 6 tables. Preprint

2606.09399 2026-06-09 cs.AI 新提交

RunAgent SuperBrowser: A Theory of Autonomous Web Navigation Grounded in Human Browsing Behaviour

RunAgent SuperBrowser: 基于人类浏览行为的自主网页导航理论

Radeen Mostafa, Sawradip Saha

发表机构 * RunAgent AI

AI总结提出SuperBrowser自主网页导航代理，通过模仿人类浏览的感知-认知-行动三元机制，在Mind2Web Hard基准上以89.47%成功率超越现有开源研究代理。

Comments 31 pages, 8 figures, preprint/work in progress

详情

AI中文摘要

我们提出SUPERBROWSER，一个自主网页导航代理，其设计基于一个指导性假设：网页代理应该像人一样浏览。人类阅读页面时不会记住看到的每个像素；他们会看几个候选目标，决定一个，并只记住维持目标所需的信息。我们将这个感知-认知-行动三元组实现为三个耦合机制。首先，一个视觉优先的边界框管道在每个截图上标记候选交互区域，并异步预取给语言模型，使“眼睛”先于“手”。其次，一个三角色大脑——一个分类和路由的编排器、一个每几步评估进度的规划器、一个发出每步动作的工作器——将战略推理与操作推理分离。第三，一个结构化的账本只存储人类会记住的内容：目标、最近三个动作、少量事实和死胡同、以及少量检查点；一个六阶段驱逐循环系统性地从实时上下文中丢弃过时的截图、状态块和推理痕迹。动作执行是一个三层点击级联（Chrome DevTools协议到Puppeteer到脚本化），带有拟人化的贝塞尔运动，以及一个感知V形箭头的边界框捕捉器，解决“大标签旁的小箭头”歧义。在Mind2Web Hard基准（66个任务）上，SUPERBROWSER达到89.47%的成功率，总体排名第三，并以大幅优势领先所有已发表的开源/研究浏览器代理基线。我们认为，这一提升并非来自任何单一技巧，而是来自整个系统中认知契约的一致应用。

英文摘要

We present SUPERBROWSER, an autonomous web-navigation agent designed against a single guiding hypothesis: a web agent should browse the way a person browses. A human reading a page does not retain every pixel they have seen; they look at a few candidate targets, decide on one, and remember only what is needed to keep the goal alive. We operationalize this perception-cognition-action triad as three coupled mechanisms. First, a vision-first bounding-box pipeline labels candidate interactive regions on every screenshot and feeds them, asynchronously prefetched, to the language model so that the "eye" precedes the "hand". Second, a three-role brain -- an Orchestrator that classifies and routes, a Planner that evaluates progress every few steps, and a Worker that emits per-step actions -- separates strategic from operational reasoning. Third, a structured Ledger stores only what a person would: the goal, the last three actions, a small set of facts and dead-ends, and a handful of checkpoints; a six-phase eviction loop systematically discards stale screenshots, state blobs, and reasoning traces from the live context. Action execution is a three-tier click cascade (Chrome DevTools Protocol to Puppeteer to scripted) with humanized Bezier motion, plus a chevron-aware bounding-box snapper that resolves the "small arrow beside a large label" ambiguity. On the Mind2Web Hard benchmark (66 tasks), SUPERBROWSER attains 89.47% success, placing third overall and ahead of every published open/research browser-agent baseline by a large margin. We argue that the gain comes not from any single trick but from the consistent application of a cognitive contract throughout the system.

URL PDF HTML ☆

赞 0 踩 0

2606.09447 2026-06-09 cs.AI 新提交

AliyunConsoleAgent: Training Web Agents in Real-World Cloud Environments via Distillation and Reinforcement Learning

AliyunConsoleAgent：通过蒸馏和强化学习在真实云环境中训练Web智能体

Bojie Rong, Zheyu Shen, Qiaoping Wang, Pengfei Kang, Yang Xu, Yawen Wei, Hanyu Wu, Zhi Zhao, Leihao Pei, Linquan Jiang

发表机构 * Alibaba Cloud China（阿里云中国）

AI总结提出AliyunConsoleAgent框架，通过蒸馏前沿模型轨迹进行监督微调，再结合GRPO和双通道结果奖励模型在真实云环境中强化学习，实现文档验证自动化，以低成本达到接近前沿专有模型的成功率。

详情

AI中文摘要

我们提出AliyunConsoleAgent，一个用于真实云控制台自动化文档验证的Web智能体框架。主流云平台包含数百个产品，功能迭代迅速，导致控制台UI频繁与对应文档不一致。验证文档流程准确反映当前控制台并能够端到端执行，每年需要约400万次重复检查，但人工覆盖率仍低于1%。虽然基于前沿专有模型的智能体系统取得了高成功率，但其高昂成本和数据隐私限制阻碍了大规模部署。我们提出一个两阶段训练范式：首先对蒸馏的前沿模型轨迹进行监督微调，然后在真实云环境中使用组相对策略优化（GRPO）和双通道结果奖励模型进行强化学习。为了支持大规模RL训练，我们构建了一个高确定性的回滚系统，采用基于Terraform的资源预置和LLM驱动的按需置备，有效隔离环境噪声与训练信号。我们进一步引入基于后端审计日志的规则奖励评估协议，提供客观、抗奖励破解的结果判断。我们的模型从机械的指令遵循演变为具有云控制台和产品特定理解的自主决策。在一个具有挑战性的278任务基准上（最佳前沿模型仅达到65.34%成功率），AliyunConsoleAgent-32B实现了63.52%的平均成功率——相比基础模型提升20.24个百分点，与最佳前沿专有模型的差距缩小至1.82个百分点（bootstrap 95% CI [-1.27, 7.39]）——而推理成本降低92%。

英文摘要

We present AliyunConsoleAgent, a web agent framework for automated documentation verification in real-world cloud consoles. Major cloud platforms encompass hundreds of products with rapid feature iteration, causing console UIs to frequently diverge from their corresponding documentation. Verifying that documented procedures accurately reflect the current console and can be executed end-to-end demands an estimated 4 million recurring inspections annually, yet manual coverage remains below 1%. While agent systems built on frontier proprietary models achieve high success rates, their prohibitive cost and data privacy constraints preclude large-scale deployment. We propose a two-stage training paradigm: supervised fine-tuning (SFT) on distilled frontier-model trajectories, followed by reinforcement learning using Group Relative Policy Optimization (GRPO) and a dual-channel outcome reward model in real cloud environments. To support large-scale RL training, we construct a high-determinism rollout system featuring Terraform-based resource pre-provisioning and LLM-driven on-demand provisioning, which effectively isolates environment noise from the training signal. We further introduce a rule-based reward evaluation protocol grounded in backend audit logs, providing objective, reward-hacking-resistant outcome judgment. Our model evolves from mechanical instruction following to autonomous decision-making with cloud console and product-specific understanding. Experiments on a challenging 278-task benchmark where the best frontier model achieves only 65.34% demonstrate that AliyunConsoleAgent-32B achieves a 63.52% mean success rate -- a 20.24 percentage-point improvement over the base model, narrowing the gap to the best frontier proprietary model to 1.82 pp (bootstrap 95% CI [-1.27, 7.39]) -- at 92% lower inference cost.

URL PDF HTML ☆

赞 0 踩 0

2606.09730 2026-06-09 cs.AI 新提交

SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

SearchSwarm：面向长周期深度研究的代理LLM委托智能

Pu Ning, Quan Chen, Kun Tao, Xinyu Tang, Tianshu Wang, Qianggang Cao, Xinyu Kong, Zujie Wen, Zhiqiang Zhang, Jun Zhou

发表机构 * Tsinghua University（清华大学）； Peking University（北京大学）； Ant Group（蚂蚁集团）； Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学高瓴人工智能学院）

AI总结提出SearchSwarm框架，通过监督微调将任务分解与委托决策内化到模型权重中，在BrowseComp和BrowseComp-ZH上取得同规模最佳性能。

详情

AI中文摘要

大型语言模型越来越需要处理复杂的、长周期的真实世界任务，这些任务的上下文需求可能无限增长，但模型上下文窗口本质上是有限的。最近的研究探索了一种范式，其中主代理分解任务并将子任务分派给子代理，子代理执行并仅返回汇总结果，从而节省主代理的上下文预算。然而，要很好地执行这一任务需要委托智能：分解复杂任务、确定何时委托以及委托什么、并将返回结果整合到持续工作流中的能力。这种能力的训练数据在自然文本中很少见，据我们所知，如何合成此类数据并训练模型获得这种能力在开源社区中仍基本未被探索。为填补这一空白，我们针对深度研究这一代表性的长周期代理任务进行了初步探索。具体来说，我们设计了一个引导工具，引导模型进行高质量的任务分解和委托，同时约束子代理正确返回结果以支持主代理的工作流。引导工具生成的轨迹自然地编码了正确的委托决策，我们将其作为监督微调数据，将委托智能内化到模型权重中。我们的模型SearchSwarm-30B-A3B在BrowseComp上达到68.1，在BrowseComp-ZH上达到73.3，在所有同规模模型中取得最佳结果。我们将发布我们的引导工具、模型权重和训练数据，以促进未来研究。

英文摘要

Large language models are increasingly expected to handle complex, long-horizon real-world tasks whose context demands can grow without bound, yet model context windows remain inherently finite. Recent work explores a paradigm where a main agent decomposes tasks and dispatches subtasks to subagents, which execute and return only summarized results, conserving the main agent's context budget. However, performing this well requires delegation intelligence: the ability to decompose complex tasks, determine when and what to delegate, and integrate returned results into the ongoing workflow. Training data for this capability is scarce in naturally occurring text, and to our knowledge, how to synthesize such data and train models to acquire this capability remains largely unexplored in the open-source community. To bridge this gap, we present a preliminary exploration targeting deep research, a representative long-horizon agent task. Specifically, we design a harness that guides the model toward high-quality task decomposition and delegation, while constraining subagents to return results properly to support the main agent's workflow. The harness-guided trajectories naturally encode correct delegation decisions, which we use as supervised fine-tuning data to internalize delegation intelligence into model weights. Our resulting model, SearchSwarm-30B-A3B, achieves 68.1 on BrowseComp and 73.3 on BrowseComp-ZH, the best results among all models of comparable scale. We will release our harness, model weights, and training data to facilitate future research.

URL PDF HTML ☆

赞 0 踩 0

2602.14033 2026-06-09 cs.IT cs.AI math.IT 交叉投稿

BRAIN: Bayesian Reasoning via Active Inference for Agentic and Embodied Intelligence in Mobile Networks

BRAIN: 通过主动推理进行贝叶斯推理以实现移动网络中的智能体与具身智能

Osman Tugay Basaran, Martin Maier, Falko Dressler

发表机构 * School of Electrical Engineering and Computer Science, TU Berlin（技术大学柏林电气工程与计算机科学学院）； Optical Zeitgeist Laboratory, INRS（光感知实验室，INRS）； Federal Ministry of Research, Technology and Space (BMFTR, Germany)（德国联邦研究、科技与航天部）

AI总结提出基于主动推理的贝叶斯推理智能体（BRAIN），利用深度生成模型和变分自由能最小化统一感知与行动，在动态无线资源分配中实现鲁棒因果推理、自适应性和实时可解释性。

详情

AI中文摘要

未来的第六代（6G）移动网络将需要不仅自主高效，而且能够在动态环境中实时适应并透明决策的人工智能（AI）智能体。然而，当前网络中的主流智能体AI方法在这方面表现出显著缺陷。传统的基于深度强化学习（DRL）的智能体缺乏可解释性，并且常常遭受脆弱的适应性问题，包括在非平稳条件下对过去知识的灾难性遗忘。在本文中，我们针对这些挑战提出了一种替代解决方案：通过主动推理进行贝叶斯推理（BRAIN）智能体。BRAIN利用网络环境的深度生成模型，并通过最小化变分自由能将感知和行动统一在单个闭环范式中。我们在GPU加速的测试平台上将BRAIN实现为O-RAN扩展应用（xApp），并展示了其相对于标准DRL基线的优势。在我们的实验中，BRAIN表现出：（i）针对动态无线资源分配的鲁棒因果推理，在变化的流量负载下维持切片特定的服务质量（QoS）目标（吞吐量、延迟、可靠性）；（ii）卓越的自适应性，在突然的流量变化中比基准方法高出高达28.3%的鲁棒性（无需任何重新训练即可实现）；（iii）通过人类可解释的信念状态诊断实现其实时决策的可解释性。

英文摘要

Future sixth-generation (6G) mobile networks will demand artificial intelligence (AI) agents that are not only autonomous and efficient, but also capable of real-time adaptation in dynamic environments and transparent in their decisionmaking. However, prevailing agentic AI approaches in networking, exhibit significant shortcomings in this regard. Conventional deep reinforcement learning (DRL)-based agents lack explainability and often suffer from brittle adaptation, including catastrophic forgetting of past knowledge under non-stationary conditions. In this paper, we propose an alternative solution for these challenges: Bayesian reasoning via Active Inference (BRAIN) agent. BRAIN harnesses a deep generative model of the network environment and minimizes variational free energy to unify perception and action in a single closed-loop paradigm. We implement BRAIN as O-RAN eXtended application (xApp) on GPU-accelerated testbed and demonstrate its advantages over standard DRL baselines. In our experiments, BRAIN exhibits (i) robust causal reasoning for dynamic radio resource allocation, maintaining slice-specific quality of service (QoS) targets (throughput, latency, reliability) under varying traffic loads, (ii) superior adaptability with up to 28.3% higher robustness to sudden traffic shifts versus benchmarks (achieved without any retraining), and (iii) real-time interpretability of its decisions through human-interpretable belief state diagnostics.

URL PDF HTML ☆

赞 0 踩 0

2606.07538 2026-06-09 cs.IR cs.AI 交叉投稿

Bidirectional Semantic Complementary Tool Retrieval for Remote Sensing Agents

面向遥感智能体的双向语义互补工具检索

Zeyuan Wang, Dongyang Hou, Cheng Yang, Xuezhi Cui, Linrui Xu, Bo Yu, Gaozhi Zhou, Ziyu Li, Liangtian Liu, Kai Ouyang, Wang Guo, Lili Zhu, Chao Tao

发表机构 * School of Geosciences and Info-Physics, Central South University（地质科学与信息物理学院，中南大学）； School of Mechanical and Electrical Engineering, Central South University（机械与电子工程学院，中南大学）； Hunan Key Laboratory of Land Resources Evaluation and Utilization, Hunan Provincial Institute of Land and Resources Planning（湖南省国土资源评价与利用重点实验室，湖南省国土资源规划院）

AI总结针对遥感智能体工具检索中查询与文档语义不对称问题，提出双向语义互补方法：通过规划增强查询机制补充功能语义，利用动态工具依赖图注入上下文语义，显著提升复杂遥感任务工具检索精度。

详情

AI中文摘要

基于大语言模型的智能体为遥感数据的自动化处理提供了新范式。它们在复杂遥感任务中的成功依赖于广泛的专用工具库。然而，工具文档通常超出大语言模型的上下文窗口限制，使得精确的工具检索对于智能体工作流至关重要。现有工具检索方法面临“语义不对称”瓶颈：自然语言查询通常表达宏观意图，缺乏工具特定语义，而工具文档提供细粒度的技术描述，缺乏工作流的操作上下文。为弥合这一语义鸿沟，本文提出一种双向语义互补工具检索方法。首先，在查询端，我们引入一种基于规划的查询增强机制，利用智能体的推理能力将抽象意图分解为逻辑子任务，从而主动补充查询缺失的功能语义。其次，在工具端，针对遥感工具链的强耦合特性，我们构建了一个具有持续学习能力的动态工具依赖图。通过采用邻域信息聚合机制，将前驱工具的上下文信息显式注入当前节点表示，从而用上下文语义丰富工具描述。在遥感数据集GeoPlan-bench和通用数据集API-Bank上的实验结果表明，所提方法不仅显著提高了复杂遥感任务的工具检索精度，而且展现出向通用领域任务迁移的鲁棒可扩展性。源代码和数据集可在https://github.com/geox-lab/BSCTR获取。

英文摘要

Large language model (LLM)-based agents provide a novel paradigm for the automated processing of remote sensing(RS) data. Their success in complex RS tasks rely on extensive specialized tool libraries. However, tool documentation often exceeds the context window limits of LLMs, making precise tool retrieval essential for agentic workflows. Existing tool retrieval methods face "semantic asymmetry" bottleneck: natural language queries typically express macro-level intentions lacking tool-specific semantics, while tool documentation provides fine-grained technical descriptions lacking operational context for workflows. To bridge this semantic gap, this paper proposes a bidirectional semantic complementary tool retrieval method. First, on the query side, we introduce a planning-based query enhancement mechanism that leverages the reasoning capabilities of agents to decompose abstract intentions into logical subtasks, thereby actively supplementing the query with missing functional semantics. Second, on the tool side, addressing the strong coupling characteristics of RS tool chains, we construct a dynamic tool dependency graph with continual learning capabilities. By employing a neighborhood information aggregation mechanism, contextual information from precursor tools is explicitly injected into the current node representation, enriching tool descriptions with contextual semantics. Experimental results on the RS dataset GeoPlan-bench and the general-purpose dataset API- Bank demonstrate that the proposed method not only significantly improves tool retrieval accuracy for complex RS tasks but also exhibits robust extensibility for transfer to general-domain tasks. The source code and dataset are available at https://github.com/geox-lab/BSCTR.

URL PDF HTML ☆

赞 0 踩 0

2606.07583 2026-06-09 cs.LG cs.AI 交叉投稿

Outage Detection in Self-Healing Smart Grids Using Reinforcement Learning with Spectral Graph Neural Networks

基于频谱图神经网络强化学习的自愈智能电网故障检测

Lihui Liu, Mucun Sun, Caisheng Wang

发表机构 * Wayne State University（韦恩州立大学）； University of Texas at Dallas（德克萨斯大学达拉斯分校）

AI总结提出频谱图强化学习框架，利用频谱图神经网络学习最优恢复策略，实现配电网故障实时近最优管理，在三个IEEE测试系统上验证了泛化能力。

详情

AI中文摘要

自愈智能电网能够在故障期间快速调整其网络配置，以最小化电力中断。在故障期间，可以采取多种措施，例如通过开关操作进行网络重构和紧急甩负荷。然而，传统的用于故障缓解的机器学习方法由于响应速度慢和计算成本高，不适用于智能电网。为了解决这些挑战，最近的研究探索了使用强化学习自动执行网络重构。在这些方法中，控制策略通常使用图神经网络（GNN）建模。然而，传统的GNN在空间域中运行，可能无法捕捉频域中的重要关系。频域信息对于建模电力网络中的全局结构模式和系统范围交互特别有用。在本文中，我们提出了一种用于配电网故障管理的频谱图强化学习框架，以增强系统韧性。我们的模型使用频谱图神经网络学习最优电力恢复策略。我们在三个修改后的IEEE测试系统上评估了所提出的方法：13节点、34节点和123节点网络。实验结果表明，我们的方法在实时性上达到了接近最优的性能，并且在广泛的故障场景中具有良好的泛化能力。

英文摘要

Self-healing smart grids can quickly adjust their network configuration during outages to minimize power disruptions. During an outage, several actions can be taken, such as network reconfiguration through switching operations and emergency load shedding. However, traditional machine learning methods for outage mitigation are not well suited for smart grids due to their slow response time and high computational cost. To address these challenges, recent studies have explored reinforcement learning to automatically perform network reconfiguration. In these approaches, the control policy is typically modeled using a graph neural network (GNN). However, conventional GNNs operate in the spatial domain and may fail to capture important relationships in the frequency domain. Frequency-domain information is particularly useful for modeling global structural patterns and system-wide interactions in power networks. In this paper, we propose a spectral graph reinforcement learning framework for outage management in distribution networks to enhance system resilience. Our model learns the optimal power restoration policy using a spectral graph neural network. We evaluate the proposed method on three modified IEEE test systems: the 13-bus, 34-bus, and 123-bus networks. Experimental results show that our approach achieves near-optimal performance in real time and generalizes well across a wide range of outage scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.07602 2026-06-09 cs.LG cs.AI 交叉投稿

Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning

面向LEGO空间物理推理的样本高效后训练

Yuhuan Yuan, Zhouliang Yu, Minghao Liu, Weiyang Liu, Ge Lin Kan

发表机构 * HKUST(GZ)（香港科技大学（广州））； CUHK（香港中文大学）； ZODA

AI总结针对LLM生成LEGO组装时出现的物理有效但几何语义错位问题，提出基于模型的数据选择方法和样本高效强化学习PVPO，结合体素空间几何奖励，提升结构、语义对齐和物理有效性。

Comments Technical Report V1, 15 pages, 6 figures, 3 tables

详情

AI中文摘要

基于LLM的LEGO组装生成需要同时具备语义基础和物理可行性。我们发现一种数据引发的失败模式PhysHack，其中组装满足物理有效性约束，但产生的结构在几何上错位、语义上不一致或校准不良。为应对这一挑战，我们提出一种基于模型的数据选择方法，仅使用一小部分训练数据，同时改进基于物理的LEGO组装生成。基于所选轨迹，我们引入PVPO，一种样本高效的强化学习方法，将物理可行性与体素空间几何奖励相结合。我们的结果表明，仅物理有效性不足以作为可靠物理推理的代理：模型可以学习生成有效结构而不保持语义或几何保真度。跨模型主干和测试时缩放设置的实验表明，PVPO改善了结构和语义对齐、物理有效性、结构稳定性和校准，同时减少了对大量事后拒绝采样的依赖。特别是，校准结果表明，PVPO通过使测试时选择更能预测语义和结构质量来缓解PhysHack。

英文摘要

LLM-based LEGO assembly generation requires both semantic grounding and physical feasibility. We identify a data-induced failure mode, PhysHack, in which the assemblies satisfy physical-validity constraints while producing structures that are geometrically misaligned, semantically inconsistent, or poorly calibrated. To address this challenge, we propose a model-based data selection approach that uses only a small fraction of the training data while improving physically grounded LEGO assembly generation. Building on the selected trajectories, we introduce PVPO, a sample-efficient reinforcement learning method that couples physical feasibility with voxel-space geometric rewards. Our results show that physical validity alone is an insufficient proxy for reliable physical reasoning: models can learn to generate valid structures without preserving semantic or geometric fidelity. Experiments across model backbones and test-time scaling settings demonstrate that PVPO improves structural and semantic alignment, physical validity, structural stability, and calibration, while reducing reliance on extensive post-hoc rejection sampling. In particular, results on calibration show that PVPO mitigates PhysHack by making test-time selection more predictive of semantic and structural quality.

URL PDF HTML ☆

赞 0 踩 0

2606.07603 2026-06-09 cs.LG cs.AI 交叉投稿

成本感知的LLM-Agent工作流投机执行：一种综合五维方法

Faisal Fareed

发表机构 * AWS（亚马逊网络服务）

AI总结提出一种五维投机执行方法，通过贝叶斯概率估计和成本定价，在LLM-Agent工作流中平衡延迟与成本，并确保无副作用回滚。

详情

AI中文摘要

LLM-Agent工作流将模型调用和工具调用串联起来，大部分挂钟时间花在等待上游操作完成，然后下游操作才能开始。投机执行可以通过预测的上游输入启动下游操作来回收空闲时间，但每次投机都会产生实际成本（按token计费），且其成功概率难以估计并随时间漂移。本文提出一种围绕五个设计决策组织的方法：(D1) 在上游完成之前启动下游操作；(D2) 以实际美元按不同的输入和输出费率定价每次投机；(D3) 暴露一个单一的操作符拨盘用于延迟与成本权衡；(D4) 通过一个期望值规则进行决策，该规则包含一个失败加权成本项和一个偏好调整阈值；(D5) 使用贝叶斯Beta-Binomial后验估计成功概率，其先验依赖于依赖类型分类。这些想法的变体出现在近期工作中；而组合起来，每次决策都以美元记录，是新颖之处。该规则仅在通过可接受性前提（无副作用、幂等或可在提交屏障后分阶段执行）的边上触发，因为错误的投机通过重新执行回滚，这会退还token但无法撤销不可逆的副作用。我们指定了运行时机制、一个闭式结果（规则在上游分支因子增长时自我限制）、一个五阶段校准流水线（离线回放、影子、金丝雀、在线校准、漂移触发终止开关），以及一个针对八种生产原型的工作负载适配模板。与四个最接近的已发表系统（DSP、Speculative Actions v2、Sherlock、B-PASTE）的对比表显示了每个维度上的差异，并且一个合成验证套件确认了预测的决策边界、概率阈值、后验恢复和流式取消行为。

英文摘要

LLM-agent workflows chain model calls and tool invocations, and spend most of their wall-clock time waiting on upstream operations before downstream ones can start. Speculative execution can reclaim that idle time by launching a downstream operation with a predicted upstream input, but here each speculation costs real money (per-token billing) and its success probability is hard to estimate and drifts over time. This paper presents a method organized around five design decisions: (D1) start a downstream operation before its upstream completes; (D2) price each speculation in real dollars at separate input and output rates; (D3) expose a single operator dial for latency versus cost; (D4) decide via an expected-value rule with a failure-weighted cost term and a preference-adjusted threshold; and (D5) estimate the success probability with a Bayesian Beta-Binomial posterior whose prior is keyed to a dependency-type taxonomy. Variants of these ideas appear in recent work; the combination, with every decision logged in dollars, is what is new. The rule fires only on edges passing an admissibility precondition (side-effect-free, idempotent, or stageable behind a commit barrier), since a wrong speculation is rolled back by re-execution, which refunds tokens but cannot un-send an irreversible side effect. We specify the runtime mechanics, a closed-form result that the rule self-limits as the upstream branching factor grows, a five-stage calibration pipeline (offline replay, shadow, canary, online calibration, drift-triggered kill-switch), and a workload-fit rubric over eight production archetypes. Contrast tables against the four closest published systems (DSP, Speculative Actions v2, Sherlock, B-PASTE) show differentiators on every dimension, and a synthetic validation suite confirms the predicted decision boundary, probability threshold, posterior recovery, and streaming-cancellation behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.07889 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories

应变连贯性：编码代理执行轨迹中的故障前信号

Marut Pandya, Kasey Zhang, Baiqing Lyu

发表机构 * GitHub

AI总结提出“应变连贯性”模式，即编码代理识别到问题但仍按原计划行动，通过构建Claude Sonnet 4.6检测器在44条轨迹上实现94%故障预测精度，优于基线方法。

详情

AI中文摘要

基于LLM的编码代理有时会承认自身推理中的问题，但仍继续执行。我们将这种模式称为应变连贯性：一种与安全相关的故障模式，其中代理拥有应改变其行为的信息，陈述了该信息，却仍违背它行动。该模式与口头奖励黑客行为重叠，即代理指出任务代理与底层目标之间的冲突，却仍优化代理。我们给出操作性定义，构建一个Claude Sonnet 4.6评判器，读取完整轨迹并标记该模式出现的片段，并使用Qwen3.5-35B-A3B骨干在44条Terminal-bench-2轨迹上评估。标记轨迹的失败率为94%，而未标记轨迹为46%（47个百分点的差距，Fisher精确检验p=0.003；排除三个提示嵌入示例后为46个百分点，p=0.006）。在匹配选择性下，检测器达到94%的精确度，而词汇话语标记基线为88%；两种方法的10条轨迹交集具有100%的失败率（Clopper-Pearson 95%置信区间[69%, 100%]）。我们在Gemma4-31B上使用43条轨迹进行复制：整体信号方向一致但不显著（20个百分点差距，p=0.31），衰减主要由13条零思考内容的轨迹驱动，其中检测器没有可分析的基础。在Gemma的高冗长度三分位中，差距为+30个百分点；在Qwen的中等和高冗长度三分位中，差距各为+40个百分点。两个模型的首次标记出现在轨迹经过时间的中位数83-84%处，且二元标记在软化显式冲突标记的释义中保持不变（8/8条轨迹）。与单变量预测器不同，检测器输出可解释的跨度级输出——引用的承认、引用的行动和类型化的冲突——显示代理看到并忽略了什么。

英文摘要

LLM-based coding agents sometimes acknowledge a problem in their own reasoning and then proceed anyway. We call this pattern strained coherence: a safety-relevant failure mode in which an agent has information that should change its behavior, states that information, and still acts against it. The pattern overlaps with verbalized reward hacking, where an agent names a tension between a task proxy and the underlying goal yet optimizes the proxy anyway. We give an operational definition, build a Claude Sonnet 4.6 judge that reads full trajectories and flags spans where the pattern occurs, and evaluate it on 44 Terminal-bench-2 trajectories using a Qwen3.5-35B-A3B backbone. Flagged trajectories fail 94% of the time versus 46% for unflagged trajectories (47-point gap, Fisher's exact p = 0.003; 46 points after excluding three prompt-embedded examples, p = 0.006). At matched selectivity, the detector reaches 94% precision versus 88% for a lexical discourse-marker baseline; the 10-trajectory intersection of the two methods has a 100% failure rate (Clopper-Pearson 95% CI [69%, 100%]). We replicate on Gemma4-31B with 43 trajectories: the overall signal is directionally consistent but not significant (20-point gap, p = 0.31), with attenuation driven largely by 13 trajectories with zero think content, where the detector has no substrate to analyze. In the high-verbosity Gemma tertile, the gap is +30 points; in the mid- and high-verbosity Qwen tertiles, it is +40 points each. The first flag appears at a median of 83-84% of elapsed trajectory time across both models, and the binary flag survives paraphrases that soften explicit conflict markers (8/8 trajectories). Unlike univariate predictors, the detector emits interpretable span-level output -- quoted acknowledgment, quoted action, and typed conflict -- showing what the agent saw and ignored.

URL PDF HTML ☆

赞 0 踩 0

2606.08275 2026-06-09 cs.LG cs.AI 交叉投稿

Causal Agent Replay: Counterfactual Attribution for LLM-Agent Failures

因果智能体回放：LLM智能体故障的反事实归因

Jaineet Shah

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结提出Causal Agent Replay (CAR)方法，通过结构因果模型和干预操作，对LLM智能体失败步骤进行反事实归因，解决现有方法无法定位决策步骤的问题。

Comments Open-source: https://github.com/jaineet17/causal-agent-replay

详情

AI中文摘要

当LLM智能体失败时——例如发放了不应发放的退款、调用了错误的工具、泄露了数据——现有工具只能回答发生了什么（可观测性）或是否通过（评估），但无法回答哪个步骤导致了失败。直观的启发式方法是错误的：执行有害动作的步骤通常不是决定该动作的步骤，而LLM判断的归因是相关性的且不可靠（在Who&When基准上，最先进的步骤级准确率约为14%）。我们提出Causal Agent Replay (CAR)，通过干预来回答这个问题：它将智能体运行建模为结构因果模型，对某个步骤应用do操作，并在相同随机策略下重新执行轨迹，测量结果分布的变化。我们定义了智能体步骤上的干预代数、一个单步对比估计器（其承诺点规则解决了特定于随机向前运行的混杂因素），以及一个预算有界的蒙特卡洛Shapley估计器（用于在交互步骤间分配信用）。每个效应都附有置信区间。我们在具有植入真实标签的合成结构因果模型上进行验证：对比估计器恢复了关键步骤，Shapley恢复了两步交互（0.44, 0.45, ~0；效率总和0.909对比解析值0.91）。CAR是开源的，可在托管或免费的本地模型上运行。

英文摘要

When an LLM agent fails -- issues a refund it should not have, calls the wrong tool, leaks data -- existing tooling answers what happened (observability) or whether it passed (evaluation), but not which step caused the failure. The obvious heuristics are wrong: the step that executes the harmful action is usually not the step that decided on it, and LLM-judge attribution is correlational and unreliable (state-of-the-art step-level accuracy on the Who&When benchmark is about 14%). We present Causal Agent Replay (CAR), which answers the question by intervention: it models an agent run as a structural causal model, applies a do-operation to a step, and re-executes the trajectory forward under the same stochastic policy, measuring the shift in the outcome distribution. We define an intervention algebra over agent steps, a single-step contrastive estimator whose point-of-commitment rule resolves a confound specific to stochastic run-forward, and a budget-bounded Monte-Carlo Shapley estimator that splits credit across interacting steps. Every effect is reported with confidence intervals. We validate against synthetic structural causal models with planted ground truth: the contrastive estimator recovers the pivotal step, and Shapley recovers a two-step interaction (0.44, 0.45, ~0; efficiency sum 0.909 versus the analytic 0.91). CAR is open source and runs on hosted or free local models.

URL PDF HTML ☆

赞 0 踩 0

2606.08360 2026-06-09 cs.LG cs.AI 交叉投稿

Generative Frontier Planning for Adaptive Peer-Referral Recruitment under Covariate-Dependent Arrivals

协变量依赖到达下的自适应同伴推荐招募的生成前沿规划

Lingkai Kong, Hezi Jiang, Andrew Ma, Keyu Wang, Akseli Kangaslahti, Milind Tambe

发表机构 * Harvard University（哈佛大学）

AI总结针对同伴推荐招募中协变量依赖到达的现实问题，提出生成前沿规划（GFP），通过确定性备份和边际贪心分配实现高效规划，在模拟实验中优于基线方法。

详情

AI中文摘要

同伴推荐招募系统（如受访者驱动抽样）对于研究和干预受传染病影响的隐藏人群至关重要。为了加速招募，公共卫生机构必须在多轮中自适应地分配有限的推荐资源，当前决策影响未来招募者的数量和协变量。先前的工作通过假设推荐来自同质总体的独立同分布抽样使问题可解，但忽略了驱动真实同伴推荐的同质性和共享背景。我们考虑一个更现实的模型，其中推荐容量和新推荐个体的协变量都依赖于推荐者，并通过删失计数模型和条件生成模型从数据中学习。由此产生的规划问题具有挑战性，因为每个候选分配都会导致未来招募者的不同分布。我们提出生成前沿规划（GFP），一种基于模型的规划器，用潜在协变量覆盖值替代的确定性备份替代每步蒙特卡洛采样。该替代的设计使得下一个前沿的期望值仅通过离线摊销的有限维摘要依赖于后代生成模型，并且使得每轮目标具有单调递减收益。这两个性质共同使规划易于处理：确定性备份消除了蒙特卡洛采样，递减收益结构使得边际贪心分配能够为每轮问题实现(1-1/e)近似。在根据真实受访者驱动抽样数据集校准的模拟环境中，GFP在四个折扣因子下均优于随机、强化学习和独立同分布动态规划基线。

英文摘要

Peer-referral recruitment systems such as respondent-driven sampling are critical for studying and intervening on hidden populations affected by infectious diseases. To accelerate recruitment, public health agencies must adaptively allocate limited referral resources across multiple rounds, where current decisions shape both the number and the covariates of future recruits. Prior work makes this problem tractable by assuming that referrals are drawn i.i.d.\ from a homogeneous population, an assumption that ignores the homophily and shared context that drive real peer recruitment. We instead consider a more realistic model in which both referral capacity and the covariates of newly referred individuals are conditioned on the referrer, learned from data with a censored count model and a conditional generative model. The resulting planning problem is challenging because each candidate allocation induces a different distribution over future recruits. We propose \emph{Generative Frontier Planning} (GFP), a model-based planner that replaces per-step Monte-Carlo sampling with a deterministic backup over a latent covariate-coverage value surrogate. The surrogate is designed so that the expected value of the next frontier depends on the offspring generative model only through finite-dimensional summaries that are amortized offline, and so that the resulting per-round objective is monotone with diminishing returns. Together, these two properties make planning tractable: the deterministic backup eliminates Monte-Carlo sampling, and the diminishing-returns structure lets a marginal greedy allocation achieve a $(1-1/e)$-approximation for the per-round problem. On a simulation environment calibrated to a real respondent-driven sampling dataset, GFP outperforms random, reinforcement-learning, and i.i.d.\ dynamic-programming baselines across four discount factors.

URL PDF HTML ☆

赞 0 踩 0

2606.08410 2026-06-09 cs.LG cs.AI 交叉投稿

Provably Efficient Personalized Multi-Objective Bandits with Proactive Conversational Queries

具有主动对话查询的可证明高效个性化多目标老虎机

Linfeng Cao, Ming Shi, Ness B. Shroff

发表机构 * The Ohio State University（俄亥俄州立大学）； University at Buffalo（布法罗大学）

AI总结提出MO-PQUCB算法，通过主动查询获取用户偏好信号，结合Plackett-Luce模型和正则化UCB，解决多目标老虎机中偏好与奖励的耦合问题，实现更优的遗憾界。

Comments UAI 2026

详情

AI中文摘要

多目标老虎机中的个性化决策需要学习用户在不同竞争目标之间的特定权衡。由于臂的效用既取决于未知奖励又取决于未知偏好，现有方法仅从效用反馈中推断偏好，将偏好学习与奖励探索纠缠在一起。然而，在实践中，用户通常通过主动对话查询（例如，“便宜且干净的酒店”）揭示他们的优先级，但这种结构化信号未被利用。我们形式化了一个基于主动查询的框架，其中用户查询提供结构化的偏好信号。通过Plackett-Luce子集选择模型对这些信号进行建模，我们证明了由于基本的平移不变性障碍，仅查询学习是不够的。为了解决这个问题，我们引入了MO-PQUCB，一种混合算法，通过平移不变正则化和双探索UCB将基于查询的偏好锚定与老虎机反馈相结合。我们证明了主动查询加速了偏好估计，并相比先前偏好感知的MO-MAB方法实现了改进的遗憾缩放。在查询被破坏的情况下，我们进一步刻画了统计极限，并设计了一个鲁棒估计器，在破坏稀疏时实现接近最优的性能。实验验证了理论和实际收益。

英文摘要

Personalized decision-making in multi-objective bandits requires learning user-specific trade-offs among competing objectives. Since arm utility depends on both unknown rewards and unknown preferences, existing methods infer preferences only from utility feedback, entangling preference learning with reward exploration. In practice, however, users often reveal their priorities through proactive conversational queries (e.g., "cheap and clean hotel"), yet this structured signal is not leveraged. We formalize a proactive query-based framework in which user queries provide structured preference signals. Modeling these signals via a Plackett-Luce subset choice model, we show that query-only learning is insufficient due to a fundamental shift-invariance barrier. To resolve this, we introduce MO-PQUCB, a hybrid algorithm that integrates query-based preference anchoring with bandit feedback through shift-invariant regularization and dual-exploration UCB. We prove that proactive queries accelerate preference estimation and yield improved regret scaling over prior preference-aware MO-MAB methods. Under corrupted queries, we further characterize statistical limits and design a robust estimator achieving near-optimal performance when the corruption is sparse. Experiments validate both theoretical and practical gains.

URL PDF HTML ☆

赞 0 踩 0

2606.08500 2026-06-09 cs.SE cs.AI 交叉投稿

Projecting the Emerging Mindset of SWE Agent by Launching a Wild Code Understanding Journey

通过发起野生的代码理解之旅来投射SWE代理新兴思维模式

Zhengyi Zhuo, Yan Liu

发表机构 * School of Computer Science and Technology, Tongji University（同济大学计算机科学与技术学院）

AI总结本文通过有限工具接口让SWE代理在真实代码库中探索，提出Ada框架，利用观察透镜分析代理的导航、证据选择、综合、基础化和停止行为，将轨迹数据转化为可比较的行为画像。

详情

AI中文摘要

软件工程代理（SWE代理）越来越多地通过工具介导的轨迹在真实代码库中工作，但其行为仍难以用具体、可观察的术语来表征。这些轨迹记录了工具使用、中间推理、证据选择和自我导向的停止，但它们本身并不能解释为什么选择了特定的动作、信任了什么证据，或者何时认为理解足够。这种张力使得轨迹数据既有限又有价值：当通过纪律性观察进行解释时，忠实的、可重放的轨迹可以成为研究代理行为的经验基础。我们引入了Ada，一个用于仓库级代码理解的范围化装置。Ada通过有界工具接口进入真实代码库，允许开放式的探索作为有限轨迹保持可记录。在这个野生但有界的设置中，Ada选择在哪里看、仔细阅读什么、何时巩固部分理解以及何时结束对仓库的描述。我们通过观察透镜投射Ada的思考-行动链，这些透镜使导航、证据选择、综合、基础化和停止变得可见，而不将行为简化为原始工具计数或推测隐藏意图。综合来看，这些透镜产生了基于软件世界中记录移动的行为画像。在跨越多个模型、仓库、任务系列和启动条件的408条轨迹中，该研究展示了如何将忠实的数字痕迹转化为纪律性的、可比较的SWE代理新兴思维模式投射。结果揭示了效率、轨迹多样性、认知基础化和干预限制方面的差异，同时为在真实代码库中观察SWE代理行为提供了方法论基础。

英文摘要

Software engineering agents (SWE agents) increasingly work through tool-mediated trajectories in real repositories, yet their behavior remains difficult to characterize in concrete, observable terms. These trajectories record tool use, intermediate reasoning, evidence selection, and self-directed stopping, but they do not by themselves explain why particular moves were chosen, what evidence was trusted, or when understanding was judged sufficient. This tension makes trajectory data both limited and valuable: faithful, replayable traces can become an empirical substrate for studying agent behavior when interpreted through disciplined observation. We introduce Ada, a scoped apparatus for repository-level code understanding. Ada enters real codebases through a bounded tool interface, allowing open-ended exploration to remain recordable as finite trajectories. Across this wild-but-bounded setting, Ada chooses where to look, what to read closely, when to consolidate partial understanding, and when to close its account of the repository. We project Ada's think-action chains through observation lenses that make navigation, evidence selection, synthesis, grounding, and stopping visible without reducing behavior to raw tool counts or speculating about hidden intent. Read together, these lenses produce behavioral profiles grounded in recorded movement through software worlds. Across 408 trajectories, spanning multiple models, repositories, task families, and launch conditions, the study shows how faithful digital traces can be transformed into disciplined, comparable projections of emerging SWE-agent mindset. The results expose differences in efficiency, trajectory diversity, epistemic grounding, and the limits of intervention, while providing a methodological foundation for observing SWE agent behavior in real codebases.

URL PDF HTML ☆

赞 0 踩 0

2606.08696 2026-06-09 cs.LG cs.AI 交叉投稿

Agentic Search for Counterfactual Recourse under Fixed LLM Budgets

固定LLM预算下的反事实追索的智能搜索

Yasuo Tabei

AI总结提出Comp-MCTS框架，在固定LLM调用预算下，通过树搜索最大化生成唯一且经oracle验证的反事实，平衡数量与质量。

详情

AI中文摘要

反事实追索旨在提供可操作的特征变化，以改变预测模型做出的不利决策。在实践中，受影响的个体通常受益于多个可行的替代方案，而非单一的最优解释。产生此类替代方案的一种自然方式是提示大语言模型（LLMs）。然而，提示引入了一个实际约束：LLM调用的数量通常是主要的计算和经济成本。对多个替代方案的需求以及这一成本约束共同将问题从寻找单个高质量反事实转变为在固定LLM调用预算下高效生成一组经oracle验证的反事实。在这项工作中，我们将LLM智能体设置中的反事实追索生成作为固定预算搜索问题进行研究，并提出了Comp-MCTS，一个智能体树搜索框架，该框架在此预算下最大化唯一、经oracle验证的反事实的产出，同时保持有利的数量-质量权衡。Comp-MCTS通过基于LLM的提议生成、oracle验证和压缩引导剪枝，在无训练、仅oracle的设置中将预算分配给新颖的干预方向。在四个真实世界表格数据集上的实验表明，Comp-MCTS在唯一、经oracle验证的反事实产出方面显著优于单候选LATS风格基线，并且与更强的多候选变体相比，提供了有利的数量-质量-效率权衡：在四个数据集中的三个上，以相似或更低的oracle评估成本获得相当或更高的产出，同时具有有竞争力的接近性、稀疏性和新颖性。

英文摘要

Counterfactual recourse aims to provide actionable feature changes that would alter an unfavorable decision made by a predictive model. In practice, affected individuals often benefit from multiple feasible alternatives rather than a single optimal explanation. A natural way to produce such alternatives is to prompt large language models (LLMs). However, prompting incurs a practical constraint: the number of LLM calls is often the dominant computational and economic cost. Together, the need for multiple alternatives and this cost constraint shift the problem from finding a single high-quality counterfactual to efficiently generating a set of oracle-validated counterfactuals under a fixed LLM-call budget. In this work, we study counterfactual recourse generation in the LLM-agentic setting as a fixed-budget search problem and propose Comp-MCTS, an agentic tree-search framework that maximizes the yield of unique, oracle-validated counterfactuals under this budget while maintaining favorable quantity--quality trade-offs. Comp-MCTS allocates the budget toward novel intervention directions via LLM-based proposal generation, oracle validation, and compression-guided pruning, in a training-free, oracle-only setting. Experiments on four real-world tabular datasets show that Comp-MCTS substantially outperforms single-candidate LATS-style baselines in the yield of unique, oracle-validated counterfactuals, and offers favorable quantity--quality--efficiency trade-offs against stronger multi-candidate variants: comparable or higher yield at similar or lower oracle-evaluation cost on three of four datasets, plus competitive proximity, sparsity, and novelty.

URL PDF HTML ☆

赞 0 踩 0

2606.09027 2026-06-09 cs.CL cs.AI 交叉投稿

SafeRun: Enabling Determinism in LLM Planning for Running

SafeRun：在跑步规划中实现LLM的确定性

Meilin Chen, Zepeng Zhai, Jiaxuan Zhao, Yuan Lu

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结针对LLM在跑步规划中因概率性导致安全违规的问题，提出SafeRun框架，通过解耦架构将LLM的软解释与确定性求解器的硬约束分离，实现100%安全评分。

Comments Workshop on Planning in the Era of LLMs (LM4Plan) at ICML 2026

详情

AI中文摘要

大型语言模型能够实现灵活的自然语言规划，但由于其概率性，在确定性关键领域仍不可靠。这一限制在跑步规划中尤其成问题，因为违反安全规则可能导致安全风险。我们提出SafeRun，一种通过解耦架构实现基于LLM的确定性规划的框架。SafeRun将LLM的软解释与确定性求解器的硬约束执行分离，在保持自然语言灵活性的同时确保严格的安全约束。为了验证SafeRun，我们构建了一个全面的基准测试，用于在现实生理和安全约束下进行跑步规划。在五个LLM上的实验表明，SafeRun实现了100%的安全评分（相比之下，PE平均为79.1%，CodeAct平均为97.6%），同时保持了具有竞争力的指令遵循分数。SafeRun基准测试可在\href{https://huggingface.co/datasets/zzp-seeker/SafeRun-RunPlanning-Benchmark}{huggingface}上公开获取。

英文摘要

Large Language Models enable flexible natural-language planning but remain unreliable in determinism-critical domains due to their probabilistic nature. This limitation is especially problematic in running planning, where violating safety rules can lead to safety risks. We propose SafeRun, a framework for deterministic LLM-based planning via a decoupled architecture. SafeRun separates soft interpretation by an LLM from hard constraint enforcement by a deterministic solver, ensuring strict safety constraints while preserving natural-language flexibility. To validate SafeRun, we build a comprehensive benchmark for running planning under realistic physiological and safety constraints. Experiments across five LLMs show that SafeRun achieves 100\% safety score (vs.\ 79.1\% PE average and 97.6\% CodeAct average) while maintaining competitive instruction-following scores. The SafeRun benchmark is publicly available at \href{https://huggingface.co/datasets/zzp-seeker/SafeRun-RunPlanning-Benchmark}{huggingface}.

URL PDF HTML ☆

赞 0 踩 0

2606.09483 2026-06-09 cs.CL cs.AI 交叉投稿

Memory Beyond Recall: A Dual-Process Cognitive Memory System for Self-Evolving LLM Agents

超越回忆的记忆：用于自进化LLM代理的双过程认知记忆系统

Tianxiang Fei, Mingyang Song, Mao Zheng, Xiang Yu

发表机构 * Tencent（腾讯）

AI总结提出DCPM系统，基于双过程理论将代理记忆组织为认知能力层次，通过同步日间写入器和异步夜间引擎分别处理信念修正和模式归纳，在隐式跨会话推理任务上提升显著。

详情

AI中文摘要

LLM代理的长期记忆不仅仅是适时检索正确的段落。当前的记忆系统将信念修正、因果耦合和跨领域抽象压缩到为表面回忆而调整的单一检索面上，因此难以处理需要推理用户如何演变的隐式个性化。我们提出DCPM，它沿着认知能力层次重新组织代理记忆，从原始输入和原子事实，经过历时信念轨迹和身份，上升到领域模式、潜在意图和跨领域模式。该层次由两个过程驱动，继承了双过程理论的架构分裂：一个同步的日间写入器（系统1），记录信念修正为双重链接的取代链；一个异步的夜间引擎（系统2），归纳模式和意图，并扫描跨领域冲突，抽象为更高级的核心模式。在LongMemEval、PersonaMem和PersonaMem-v2上，启用系统2在奖励隐式跨会话推理的基准上贡献最大（在PersonaMem-v2上最高+5.20），在跨度回忆上贡献最小，与架构预测一致。

英文摘要

Long-term memory for an LLM agent is more than retrieving the right passage at the right time. Current memory systems collapse belief revision, causal coupling, and cross-domain abstraction into a single retrieval surface tuned for surface recall, and consequently struggle on implicit personalisation that requires reasoning over how a user has evolved. We propose DCPM, which reorganises agent memory along a cognitive capability hierarchy ascending from raw inputs and atomic facts, through diachronic belief trajectories and identity, to domain schemas, latent intentions and cross-domain patterns. The hierarchy is driven by two processes inheriting the architectural split of dual-process theory: a synchronous daytime writer (System1) that records belief revisions as doubly linked supersedes chains, and an asynchronous nighttime engine (System2) that induces schemas and intentions and sweeps for cross-domain collisions abstracted into higher-level core schemas. On LongMemEval, PersonaMem and PersonaMem-v2, enabling System2 contributes most where the benchmark rewards implicit cross-session inference (up to +5.20 on PersonaMem-v2) and least on span recall, matching the architectural prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.09825 2026-06-09 cs.LG cs.AI cs.SY eess.SY math.OC 交叉投稿

An Agency-Transferring Model-Free Policy Enhancement Technique

一种无模型策略增强的代理转移技术

Anton Bolychev, Georgiy Malaniya, Sinan Ibrahim, Pavel Osinenko

发表机构 * Center for Engineering Systems and Sciences（工程系统与科学中心）； Central University（中央大学）； Sirius University of Science and Technology（天狼星科技大学）

AI总结提出一种将次优基线策略嵌入强化学习训练的方法，通过逐步从基线策略向可学习策略转移代理权，提升训练效率并最终获得超越基线的独立策略。

详情

AI中文摘要

从头开始训练强化学习（RL）策略成本高昂：需要仔细设计奖励和环境、大量调参以及大量计算。然而，许多控制问题已经有一个功能正常但次优的基线策略可用。本文提出一种方法，将这样的基线策略嵌入RL训练过程，同时提高相对于从头开始方法的训练效率，并产生一个优于基线的学习策略。在每个步骤中，该方法在基线策略和可训练的学习策略之间进行仲裁，最初强烈依赖基线策略，然后逐步将代理权转移给学习策略。训练结束时，学习策略是一个无需基线策略支持的独立神经网络。本文形式化了基线策略“功能正常”的含义：在该策略下，智能体以高概率到达目标集并停留在那里。所提出的仲裁机制旨在训练过程中利用这一特性，从训练开始就产生高目标到达率。理论分析在给定假设下提供了这种行为的形式化解释，并将其扩展到最终无基线场景，其中推导了独立学习策略目标到达概率的显式下界。在连续控制基准上的实验结果表明，所提出的方法实现了与竞争方法相当或更高的回报，同时在训练过程中（包括最终阶段，学习策略无需任何基线支持）保持了最高的目标到达率。

英文摘要

Training reinforcement learning (RL) policies from scratch is costly: it requires careful reward and environment design, extensive tuning, and substantial computation. Yet many control problems already have a functional but suboptimal policy available as a baseline. This paper proposes a method for embedding such a baseline into the RL training process, simultaneously improving training efficiency relative to from-scratch methods and producing a learning policy that outperforms the baseline. At each step, the method arbitrates between the baseline policy and a trainable learning policy, initially relying strongly on the baseline policy and then progressively transferring agency to the learning policy. By the end of training, the learning policy is a standalone neural network that operates without baseline policy support. The paper formalizes what it means for the baseline policy to be functional: under this policy, the agent reaches a goal set and remains there with high probability. The proposed arbitration mechanism is designed to exploit this property during training, yielding high goal-reaching rates right from the beginning of training. A theoretical analysis provides a formal interpretation of this behavior under stated assumptions and extends it to the final baseline-free regime, where explicit lower bounds are derived for the goal-reaching probability of the standalone learning policy. Empirical results on continuous-control benchmarks show that the proposed method achieves returns that match or exceed those of competitive approaches, while maintaining the highest goal-reaching rates throughout training among the compared methods -- including in the final stage, where the learning policy operates without any baseline support.

URL PDF HTML ☆

赞 0 踩 0

2404.02039 2026-06-09 cs.AI 版本更新

A Survey on Large Language Model-Based Game Agents

基于大语言模型的游戏智能体综述

Sihao Hu, Tiansheng Huang, Gaowen Liu, Ramana Rao Kompella, Fatih Ilhan, Selim Furkan Tekin, Yichang Xu, Zachary Yahn, Ling Liu

发表机构 * Georgia Institute of Technology USA（佐治亚理工学院美国分校）； Cisco Research USA（思科研究美国分公司）

AI总结综述基于大语言模型的游戏智能体，提出统一参考架构，从单智能体（记忆、推理、感知-行动接口）和多智能体（通信协议、组织模型）层面总结研究，并建立挑战导向的分类法连接六种游戏类型与智能体需求。

Comments ACM Computing Surveys, 2026

详情

AI中文摘要

游戏环境提供了丰富、可控的设置，能够模拟现实世界复杂性的许多方面。因此，游戏智能体为探索与通用人工智能相关的能力提供了有价值的测试平台。最近，大语言模型（LLM）的出现为在这些复杂游戏环境中赋予智能体可泛化的推理、记忆和适应性提供了新的机会。本综述通过一个统一的参考架构，对基于LLM的游戏智能体（LLMGA）进行了最新回顾。在单智能体层面，我们围绕三个核心组件综合了现有研究：记忆、推理和感知-行动接口，这些组件共同描述了语言如何使智能体感知、思考和行动。在多智能体层面，我们概述了通信协议和组织模型如何支持协调、角色分化以及大规模社会行为。为了将这些设计置于具体情境中，我们引入了一个以挑战为中心的分类法，将六种主要游戏类型与其主导的智能体需求联系起来，从动作游戏中的低延迟控制到沙盒世界中的开放式目标形成。相关论文的精选列表可在以下网址获取：https://github.com/xxx/xxx

英文摘要

Game environments provide rich, controllable settings that stimulate many aspects of real-world complexity. As such, game agents offer a valuable testbed for exploring capabilities relevant to Artificial General Intelligence. Recently, the emergence of Large Language Models (LLMs) provides new opportunities to endow these agents with generalizable reasoning, memory, and adaptability in complex game environments. This survey offers an up-to-date review of LLM-based game agents (LLMGAs) through a unified reference architecture. At the single-agent level, we synthesize existing studies around three core components: memory, reasoning, and perception-action interfaces, which jointly characterize how language enables agents to perceive, think, and act. At the multi-agent level, we outline how communication protocols and organizational models support coordination, role differentiation, and large-scale social behaviors. To contextualize these designs, we introduce a challenge-centered taxonomy linking six major game genres to their dominant agent requirements, from low-latency control in action games to open-ended goal formation in sandbox worlds. A curated list of related papers is available at https://github.com/git-disl/awesome-LLM-game-agent-papers

URL PDF HTML ☆

赞 0 踩 0

2601.21754 2026-06-09 cs.AI 版本更新

参与过程：重新思考动作与观察的时间接口

Jialian Li, Yuchen Cao, Junhong Liu, Weiran Guo, Xutao Wang, Jiaming Song, Jiahao Zhang, Jie Chen

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出参与过程（EP）模型，通过显式时间接口处理动作与观察的不同时间尺度交互，支持多速率协调和子系统组合，揭示隐藏的时间行为并使策略适应显式时间成本。

详情

AI中文摘要

在数字和物理环境中完成任务日益涉及复杂的时序交互，其中动作和观察在不同的时间尺度上展开，而非与固定观察-动作步骤对齐。为了建模此类交互，我们提出参与过程（EP），一种继承POMDP决策理论结构的交互形式，使时间在动作-观察接口中显式化。EP将动作和观察表示为沿时间解耦的事件流，而非在固定决策步骤上配对更新。此接口捕捉单agent的时间问题，如决策延迟、延迟反馈和持续动作，同时支持更丰富的agent侧组织、多速率协调和子系统间的组合交互。在玩具、LLM-agent和学习实验中，EP揭示了由基于步骤的接口隐藏的时间行为，并使策略在显式时间成本下适应。

英文摘要

Task completion in digital and physical environments increasingly involves complex temporal interaction, where actions and observations unfold over different time scales rather than align with fixed observation--action steps. To model such interactions, we propose \emph{Engagement Process} (EP), an interaction formalism that inherits the decision-theoretic structure of POMDPs while making time explicit in the action--observation interface. EP represents actions and observations as decoupled event streams along time, rather than updates paired at fixed decision steps. This interface captures single-agent timing issues such as deliberation latency, delayed feedback, and persistent actions, while supporting richer agent-side organization, multi-rate coordination, and compositional interaction among subsystems. Across toy, LLM-agent, and learning experiments, EP exposes temporal behaviors hidden by step-based interfaces and enables policies to adapt under explicit time costs.

URL PDF HTML ☆

赞 0 踩 0

2605.16309 2026-06-09 cs.AI cs.LG cs.MA 版本更新

ANNEAL: Adapting LLM Agents via Governed Symbolic Patch Learning

ANNEAL：通过受控符号补丁学习适应大语言模型代理

Safayat Bin Hakim, Keyan Guo, Wenkai Tan, Alvaro Velasquez, Shouhuai Xu, Houbing Herbert Song

发表机构 * University of Maryland, Baltimore County（马里兰大学巴尔的摩县分校）； University at Buffalo（布法罗大学）； University of Colorado Boulder（科罗拉多大学博尔德分校）； University of Colorado Colorado Springs（科罗拉多大学科罗拉多州立分校）

AI总结 ANNEAL通过受控符号补丁学习适应大语言模型代理，解决重复故障问题，其核心机制FDKA能定位责任操作符并生成类型补丁，实现持久结构修复，优于现有方法。

Comments Code Implementation: https://github.com/sbhakim/anneal-agents

详情

AI中文摘要

基于大语言模型的代理可以恢复个体执行错误，但在底层过程知识未修复时，同一故障会反复失败。现有自我进化方法通过更新提示、记忆或模型权重来解决这一差距，但未直接修复编码任务执行的符号结构，且缺乏安全部署所需的治理保证。我们引入ANNEAL，一种神经符号代理，将重复失败转化为受控符号编辑过程知识图谱，而无需修改基础模型权重。其核心机制，故障驱动知识获取（FDKA），定位责任操作符，通过约束LLM生成合成类型补丁，并通过多维评分、符号护栏和金丝雀测试验证提案，再提交。每条接受的编辑都携带完整溯源和确定性回滚能力。在四个领域和27个多种子运行中，ANNEAL是唯一在测试重复故障设置中将失败率降至0%的评估系统。消融实验表明，移除FDKA会消除所有结构修复并使成功率下降最高26.7个百分点。这些结果表明，受控符号修复为持续故障消除提供了与权重级和提示级适应互补的范式。

英文摘要

LLM-based agents can recover from individual execution errors, yet they repeatedly fail on the same fault when the underlying process knowledge--operator schemas, preconditions, and constraints--remains unrepaired. Existing self-evolving approaches address this gap by updating prompts, memory, or model weights, but none directly repair the symbolic structures that encode how tasks are executed, and few provide the governance guarantees required for safe deployment. We introduce ANNEAL, a neuro-symbolic agent that converts recurring failures into governed symbolic edits of a process knowledge graph without modifying foundation model weights. Its core mechanism, Failure-Driven Knowledge Acquisition (FDKA), localizes the responsible operator, synthesizes a typed patch through constrained LLM generation, and validates the proposal via multi-dimensional scoring, symbolic guardrails, and canary testing before commit. Every accepted edit carries full provenance and deterministic rollback capability. Across four domains and 27 multi-seed runs, ANNEAL is the only evaluated system that commits persistent structural repairs--strong baselines such as ReAct and Reflexion achieve high episodic recovery yet retain 72--100% holdout failure rates on recurring faults, whereas ANNEAL reduces these to 0% in the tested recurring-failure settings. Ablation confirms that removing FDKA eliminates all structural repairs and drops success rate by up to 26.7 percentage points. These results suggest that governed symbolic repair offers a complementary paradigm to weight-level and prompt-level adaptation for persistent fault elimination.

URL PDF HTML ☆

赞 0 踩 0

2606.01619 2026-06-09 cs.AI cs.LG stat.ML 版本更新

ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL

ReSkill：在智能体强化学习中协调技能创建与策略优化

Zelin He, Haotian Lin, Boran Han, Wei Zhu, Haoyang Fang, Bernie Wang, Xuan Zhu, Runze Li, Matthew Reimherr

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出ReSkill框架，通过GRPO的组结构嵌入断言驱动技能创建、组内轨迹采样和自适应汤普森采样，实现技能与策略的协同进化，在多个领域超越现有方法。

详情

AI中文摘要

智能体强化学习使LLM智能体能够从环境奖励中持续改进，但由此产生的策略并未系统地积累可跨任务泛化的可重用策略。模块化技能可以提供此类可重用策略，然而现有的技能增强强化学习方法将技能创建与策略优化分离，存在采用与进化策略冲突的技能的风险。受Anthropic的Skill Creator启发，我们引入ReSkill，一种强化学习在环的技能创建框架，协调技能进化与策略学习。ReSkill利用GRPO的组结构自然嵌入三种机制，仅需少量额外开销：（1）断言驱动的技能创建器，从过去经验中诊断失败并提出基于条件的触发式技能修订；（2）组内轨迹采样，实现技能版本的可控比较，捕获哪个版本最能支持策略的持续学习；（3）自适应折扣的汤普森采样，在策略进化过程中平衡技能版本选择的探索与利用。在多个领域，ReSkill始终优于现有的基于记忆和技能的强化学习方法，在未见任务上提升最大。对技能生命周期的分析显示，随着策略改进，技能被自动创建、测试、精炼和修剪，展示了协调的技能-策略协同进化。

英文摘要

Agentic reinforcement learning (RL) enables LLM agents to improve continuously from environment rewards, yet the resulting policies do not systematically accumulate reusable strategies that generalize across tasks. Modular skills can provide such reusable strategies, yet existing skill-augmented RL methods decouple skill creation from policy optimization, risking adopting skills that conflict with the evolving policy. Inspired by Anthropic's Skill Creator, we introduce ReSkill, an RL-in-the-loop skill creation framework that reconciles skill evolution with policy learning. ReSkill exploits the group-wise structure of GRPO to naturally embed three mechanisms with only marginal additional overhead: (1) an assertion-driven skill creator that diagnoses failures from past experience and proposes conditional, trigger-based skill revisions; (2) within-group rollout sampling that enables controlled comparison of skill versions, capturing which version best supports the policy's ongoing learning; and (3) Thompson Sampling with adaptive discounting to balance exploration and exploitation in skill version selection as the policy evolves. Across several domains, ReSkill consistently outperforms existing memory and skill-based RL methods, with the largest gains on unseen tasks. Analysis of the skill lifecycle shows skills being automatically created, tested, refined, and pruned as the policy improves, demonstrating reconciled skill-policy co-evolution.

URL PDF HTML ☆

赞 0 踩 0

2606.04421 2026-06-09 cs.AI cs.LG 版本更新

Trivium: Temporal Regret as a First-Class Objective for Causal-Memory Controllers

Trivium: 时间遗憾作为因果记忆控制器的一等目标

Edward Y. Chang

发表机构 * Stanford University（斯坦福大学）

AI总结本文提出将长期时间遗憾作为一等目标，与结果遗憾和认知遗憾共同构成因果记忆控制器的可证伪失败分析框架，证明时间校准偏差在对结果遗憾为零时仍线性增长，而基于持久因果日志的探测复杂度为对数级。

Comments 62 pages, 12 tables, 12 figures

详情

AI中文摘要

许多当前的智能体系统和LLM管道通过优化结果奖励来纠正错误。这仅解决了失败的“什么”：当结果偏离预测时，不匹配的“为什么”和“何时”没有被系统地记录、审查或纠正，因此相同的错误可能反复出现。我们认为这是一个结构性问题，而不仅仅是模型容量问题。我们提出将长期时间遗憾作为一等目标，与结果遗憾和工作因果模型上的认知遗憾并列。时间遗憾捕捉失败持续的时间：在纠正之前，一个校准错误的因果模型被容忍了多久。认知遗憾捕捉失败持续的原因：工作因果模型中的残余不确定性或错误。这三个遗憾共同给出了一个可证伪的说明，关于一个长期存在的智能体可能失败的原因、内容和时间。将智能体建模为E个片段的流，我们在显式因果探测、持久性和可检测性假设下证明了三个条件结果。首先，在观测等价混淆下，仅基于结果的学习无法在没有干预通道的情况下区分因果结构和虚假结构，因此时间校准偏差可以在结果遗憾被降至零后仍线性持续。其次，使用持久因果日志和预算探测，总探测复杂度是片段范围的对数，导致O(log E)的时间遗憾。第三，在K个可检测变化点下，速率扩展为O(K log E)。我们实例化了Trivium并预注册了五个可证伪预测。在CausalBench-Seq上，Trivium遵循预测的对数包络线，而仅基于结果的基线线性增长。一个真实LLM流的初步外部有效性证据跨越了一个完整的E=500运行和三个E=100前沿模型试点。这里的自学习意味着修正外部因果模型，而不是重新训练LLM权重。

英文摘要

Many current agentic systems and LLM pipelines correct mistakes by optimizing outcome reward. This addresses only the what of failure: when an outcome diverges from prediction, the why and when of the mismatch are not systematically logged, reviewed, or corrected, so the same error can recur episode after episode. We argue that this is a structural problem, not merely a model-capacity one. We propose long-horizon temporal regret as a first-class objective alongside outcome regret and epistemic regret over the working causal model. Temporal regret captures when failure persists: how long a miscalibrated causal model is tolerated before correction. Epistemic regret captures why failure persists: residual uncertainty or error in the working causal model. Together, the three regrets give a falsifiable account of what, why, and when a long-lived agent can fail. Modeling the agent as a stream of E episodes, we prove three conditional results under explicit causal-probing, persistence, and detectability assumptions. First, under observationally equivalent confounding, outcome-only learning cannot distinguish causal from spurious structure without an intervention channel, so temporal miscalibration can persist linearly even after outcome regret is driven to zero. Second, with a persistent causal log and budgeted probes, total probe complexity is logarithmic in the episode horizon, inducing O(log E) temporal regret. Third, under K detectable change-points, the rate extends to O(K log E). We instantiate Trivium and pre-register five falsifiable predictions. On CausalBench-Seq, Trivium follows the predicted logarithmic envelope while outcome-only baselines grow linearly. A pilot real-LLM stream provides preliminary external-validity evidence across one full E = 500 run and three E = 100 frontier-model pilots. Self-learning here means revising an external causal model, not retraining LLM weights.

URL PDF HTML ☆

赞 0 踩 0

2606.04627 2026-06-09 cs.AI 版本更新

通过通信世界模型进行上下文强化学习

Fernando Martinez-Lopez, Tao Li, Yingdong Lu, Juntao Chen

发表机构 * Department of Computer and Information Sciences, Fordham University（福特汉姆大学计算机与信息科学系）； Department of Systems Engineering, City University of Hong Kong（香港城市大学系统工程系）； IBM Research（IBM研究院）

AI总结提出CORAL框架，通过将潜在表示学习与控制分离，利用信息代理预训练世界模型并生成通信消息，使控制代理实现零样本适应和样本效率提升。

详情

AI中文摘要

强化学习（RL）代理通常难以在不更新参数的情况下泛化到新任务和上下文，主要是因为它们学到的表示和策略过度拟合于训练环境的特定性。为了提升代理的上下文RL（ICRL）能力，本文将ICRL形式化为一个双代理涌现通信问题，并引入了CORAL（用于自适应RL的通信表示）框架，该框架通过功能性地分离潜在表示学习与控制来学习可迁移的通信上下文。在CORAL中，信息代理（IA）在多样化的任务分布上作为世界模型进行预训练。其目标不是直接最大化回报，而是进行世界建模并将其理解提炼为简洁的消息。涌现通信协议由一种新颖的因果影响损失塑造，该损失衡量消息对下一动作的影响。在部署期间，预训练的IA作为固定上下文提供者服务于新的控制代理（CA），后者通过解释提供的通信上下文来学习解决任务。我们的实验表明，这种方法使CA能够实现样本效率的显著提升，并在多样化的在线和离线环境中借助预训练的IA成功进行零样本适应，验证了学习可迁移通信表示的有效性。

英文摘要

Reinforcement learning (RL) agents often struggle to generalize to new tasks and contexts without updating their parameters, mainly because their learned representations and policies are overfit to the specifics of their training environments. To boost agents' in-context RL (ICRL) ability, this work formulates ICRL as a two-agent emergent communication problem and introduces CORAL (Communicative Representation for Adaptive RL), a framework that learns a transferable communicative context by functionally separating latent representation learning from control. In CORAL, an Information Agent (IA) is pre-trained as a world model on a diverse distribution of tasks. Its objective is not direct return maximization, but world modeling and distilling its understanding into concise messages. The emergent communication protocol is shaped by a novel Causal Influence Loss, which measures the effect that the message has on the next action. During deployment, the previously trained IA serves as a fixed contextualizer for a new Control Agent (CA), which learns to solve tasks by interpreting the provided communicative context. Our experiments demonstrate that this approach enables the CA to achieve significant gains in sample efficiency and successfully perform zero-shot adaptation with the help of pre-trained IA in diverse online and offline environments, validating the efficacy of learning a transferable communicative representation.

URL PDF HTML ☆

赞 0 踩 0

2601.18510 2026-06-09 cs.LG cs.AI 版本更新

Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates

即时强化学习：无需梯度更新的LLM智能体持续学习

Yibo Li, Zijie Lin, Ailin Deng, Xuan Zhang, Yufei He, Shuo Ji, Tri Cao, Bryan Hooi

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出JitRL框架，通过动态非参数记忆和即时优势估计，无需梯度更新即可实现LLM智能体的测试时策略优化，在WebArena和Jericho上达到训练无关方法最优，且性能超越微调方法，成本降低30倍以上。

详情

AI中文摘要

尽管大型语言模型（LLM）智能体在通用任务上表现出色，但由于部署后权重冻结，它们在持续适应方面存在固有困难。传统的强化学习（RL）提供了一种解决方案，但会带来高昂的计算成本和灾难性遗忘的风险。我们引入了即时强化学习（JitRL），这是一个无需训练的框架，能够在没有任何梯度更新的情况下实现测试时策略优化。JitRL维护一个动态的非参数经验记忆，并检索相关轨迹以即时估计动作优势。这些估计随后用于直接调制LLM的输出logits。我们从理论上证明，这种加法更新规则是KL约束策略优化目标的精确闭式解。在WebArena和Jericho上的大量实验表明，JitRL在训练无关方法中建立了新的最先进水平。关键的是，JitRL在性能上超越了计算昂贵的微调方法（如WebRL），同时将货币成本降低了30倍以上，为持续学习智能体提供了一条可扩展的路径。代码可在https://github.com/liushiliushi/JitRL获取。

英文摘要

While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) offers a solution but incurs prohibitive computational costs and the risk of catastrophic forgetting. We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates. JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly. These estimates are then used to directly modulate the LLM's output logits. We theoretically prove that this additive update rule is the exact closed-form solution to the KL-constrained policy optimization objective. Extensive experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods. Crucially, JitRL outperforms the performance of computationally expensive fine-tuning methods (e.g., WebRL) while reducing monetary costs by over 30 times, offering a scalable path for continual learning agents. The code is available at https://github.com/liushiliushi/JitRL.

URL PDF HTML ☆

赞 0 踩 0

2605.22781 2026-06-09 cs.OS cs.AI 版本更新

DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

DeltaBox: 通过毫秒级沙箱检查点/回滚扩展状态化AI代理

Yunpeng Dong, Jingkai He, Shiqi Liu, Yuze Hou, Dong Du, Zhonghu Xu, Si Yu, Baochuan Yang, Yubin Xia, Haibo Chen

发表机构 * Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University（并行与分布式系统研究所，上海交通大学）； Engineering Research Center for Domain-specific Operating Systems, Ministry of Education, China（领域特定操作系统工程研究中心，中华人民共和国教育部，中国）； Huawei Technologies Co., Ltd（华为技术有限公司）

AI总结本文提出DeltaBox，一种通过DeltaFS和DeltaCR机制实现毫秒级检查点/回滚的新型AI代理沙箱，解决了传统方法在高频状态探索中的延迟问题。

详情

AI中文摘要

LLM驱动的AI代理需要高频状态探索（例如测试时的树搜索和强化学习），依赖于快速检查点和回滚（C/R）完整的沙箱状态，包括文件和进程状态（例如内存、上下文等）。现有机制需要完整复制状态，导致每次C/R的延迟达到数百毫秒到秒级，严重限制了深度搜索和大规模扩展。本文观察到AI代理中的后续检查点高度相似，因此沙箱应仅复制连续检查点之间的变化（关键洞察）。然而，实现这一想法并不简单，主要是由于缺乏操作系统支持。本文提出新的操作系统抽象DeltaState，通过两个共同设计的操作系统机制，为AI代理实现基于变化的事务性C/R。首先，DeltaFS通过将文件状态组织成分层结构，动态冻结可写层并在检查点时插入新层，将文件更新转换为写时复制，使回滚成为简单的层切换。其次，DeltaCR通过增量快照实现基于变化的过程状态C/R，并通过绕过传统管道直接从冻结的模板进程fork()来加速回滚。我们随后提出DeltaBox，一种新型的代理沙箱，通过这两种新机制实现毫秒级的C/R。在SWE-bench和RL微基准测试中的评估显示，DeltaBox在毫秒级延迟（14ms和5ms）内完成检查点和回滚，使代理在固定时间预算内能够探索大量节点。

英文摘要

LLM-powered AI agents require high-frequency state exploration (e.g., test-time tree search and reinforcement learning), relying on rapid checkpoint and rollback (C/R) of the complete sandbox state, including files and process state (e.g., memory, contexts, etc.). Existing mechanisms duplicate the entire state, causing hundreds of milliseconds to seconds of latency per C/R, which severely bottlenecks deep search and large-scale fan-outs. This paper observes that subsequent checkpoints in AI agents are highly similar. Therefore, instead of full duplication, a sandbox should only duplicate the changes between consecutive checkpoints (Key Insight). However, it is non-trivial to realize the idea, mainly due to the missing OS supports. This paper proposes a new OS-level abstraction, DeltaState, to enable the change-based transactional C/R for AI agents with two co-designed OS mechanisms. First, DeltaFS enables change-based filesystem C/R by organizing the file states into layers and dynamically freezing the writable layer and inserting a new one during checkpoint, reducing file updates to copy-on-write, and making rollback a simple layer switch. Second, DeltaCR enables change-based process state C/R using incremental dumps, and accelerates rollback by bypassing traditional pipelines to directly fork() from a frozen template process. We then present DeltaBox, a novel agent sandbox achieving millisecond level C/R through the two new mechanisms. Evaluations on SWE-bench and RL micro-benchmarks show DeltaBox completes checkpoint and rollback in millisecond-level latency (14ms and 5ms, respectively), empowering agents to explore substantially more nodes under fixed time budgets.

URL PDF HTML ☆

赞 0 踩 0

2605.30407 2026-06-09 cs.CL cs.AI cs.IR cs.LG 版本更新

Exploring Autonomous Agentic Data Engineering for Model Specialization

探索用于模型专业化的自主智能体数据工程

Yujie Luo, Xiangyuan Ru, Jingsheng Zheng, Jingjing Wang, Yuqi Zhu, Jintian Zhang, Runnan Fang, Kewei Xu, Ye Liu, Zheng Wei, Jiang Bian, Zang Li, Shumin Deng

发表机构 * Zhejiang University（浙江大学）； Platform and Content Group, Tencent（腾讯平台与内容部）

AI总结本文提出自主智能体数据工程任务，让LLM作为自主数据工程师，通过端到端数据策划驱动模型专业化，实验显示GPT-5.2通过迭代数据适应使学生模型性能提升57.29%。

Comments Work in progress

详情

AI中文摘要

大型语言模型（LLM）在通用任务上表现出色，但往往难以适应没有高质量领域特定数据的专业领域。现有的基于LLM的数据策划方法主要依赖人工设计的工作流程，尚未检验LLM能否自主执行端到端的数据工程流水线以实现模型专业化。我们形式化了 extbf{自主智能体数据工程}，这是一个新任务，旨在评估LLM作为自主数据工程师，通过端到端数据策划驱动模型专业化。我们将数据视为可优化组件，研究能够跨多个领域规划、生成和迭代优化训练数据的智能体，并以训练后性能提升为指导。实验表明，自主LLM数据工程师带来了显著收益，GPT-5.2构建的训练课程使学生模型性能提升了 extbf{57.29\%}，完全通过迭代的智能体驱动数据适应实现。通过揭示潜力和瓶颈，我们的研究将自主数据工程确立为一种可衡量的能力，并为智能体驱动的模型专业化指明了道路 ootnote{代码将在https://github.com/zjunlp/DataAgent发布。}。

英文摘要

Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end-to-end data engineering pipeline for model specialization. We formalize Autonomous Agentic Data Engineering, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end-to-end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post-training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT-5.2 constructs a training curriculum that improves a student model by 57.29%, entirely through iterative, agent-driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent-driven model specialization (Code will be released at https://github.com/zjunlp/DataAgent).

URL PDF HTML ☆

赞 0 踩 0

2508.15030 2026-06-09 cs.AI 版本更新

Collab-REC: An LLM-based Agentic Framework for Balancing Recommendations in Tourism

Collab-REC：一种基于LLM的代理框架，用于平衡旅游推荐

Ashmi Banerjee, Adithi Satish, Fitri Nur Aisyah, Wolfgang Wörndl, Yashar Deldjoo

发表机构 * Technical University of Munich（慕尼黑技术大学）； Polytechnic University of Bari（巴里理工大学）

AI总结提出一种多代理框架Collab-REC，通过三个LLM代理（个性化、流行度、可持续性）生成城市建议，并由非LLM调节器迭代优化，以缓解流行度偏差并提高推荐多样性。

详情

AI中文摘要

我们提出了COLLAB-REC，一个多代理框架，旨在抵消流行度偏差并提高旅游推荐的多样性。在我们的设置中，三个基于LLM的代理（个性化、流行度和可持续性）从不同角度生成城市建议。然后，一个非LLM调节器通过迭代约束优化合并并完善这些提议，确保每个代理的观点得到体现，同时减少虚假或重复输出。使用不同规模和模型家族的LLM对欧洲城市查询进行的大量离线实验表明，与单代理基线相比，COLLAB-REC提高了多样性和整体相关性，同时揭示了经常被忽视的较少访问的目的地。这种平衡的、上下文感知的方法更好地捕捉了更广泛的用户和系统级考虑因素，凸显了多利益相关者协作在LLM驱动的推荐系统中的潜力。代码、数据和其他工件可在此处获取：https://github.com/ashmibanerjee/collab-rec，而使用的提示包含在附录中。

英文摘要

We propose COLLAB-REC, a multi-agent framework designed to counteract popularity bias and improve diversity in tourism recommendations. In our setup, three LLM-based agents(Personalization, Popularity, and Sustainability) generate city suggestions from different perspectives. A non-LLM moderator then merges and refines these proposals through iterative constrained refinement, ensuring that each agent's viewpoint is represented while reducing spurious or repeated outputs. Extensive offline experiments on European city queries using LLMs of different sizes and model families show that COLLAB-REC improves both diversity and overall relevance compared to a single-agent baseline, while surfacing lesser-visited destinations that are often overlooked. This balanced, context-aware approach better captures a broader range of user and system-level considerations, highlighting the potential of multi-stakeholder collaboration in LLM-driven recommender systems. Code, data, and other artifacts are available here: https://github.com/ashmibanerjee/collab-rec, while the prompts used are included in the appendix.

URL PDF HTML ☆

赞 0 踩 0

2606.08477 2026-06-09 cs.AI 新提交

A Variability-Based Framework for Interpretable Naming in Formal and Relational Concept Analysis

基于可变性的框架：形式概念分析与关系概念分析中的可解释命名

Alain Gutierrez, Marianne Huchard, Pierre Martin, André Miralles, Violaine Prince

发表机构 * LIRMM, Univ. Montpellier, CNRS（法国国家科学研究中心蒙彼利埃大学计算机科学、机器人及微电子实验室）； CIRAD, UPR AIDA（法国农业国际合作研究发展中心AIDA研究单元）； AIDA, CIRAD, Univ. Montpellier（法国农业国际合作研究发展中心AIDA研究单元，蒙彼利埃大学）； INRAE - UMR TETIS - Territoires, Environnement（法国国家农业、食品与环境研究院TETIS联合研究单元）

AI总结针对形式概念分析和关系概念分析中概念命名缺乏可解释性的问题，提出一种基于可变性的LLM辅助命名框架，通过控制信息源生成可读名称，并在披萨店数据集上验证其有效性。

详情

AI中文摘要

从符号数据中提取知识通常会产生形式上定义但用户无法立即解释的抽象概念。形式概念分析（FCA）和关系概念分析（RCA）为此问题提供了代表性场景：它们根据对象描述和关系生成明确的概念结构、蕴含关系和关系依赖。尽管这些结构在设计上是可解释的，但概念通常由技术标签标识，这限制了它们作为人类可解释知识单元的使用。因此，为这些概念赋予有意义的名称是领域专家进行解释、导航、验证和复用的关键问题。\n本文从符号知识表示的角度研究FCA和RCA中的概念命名。我们首先描述了命名生成的符号抽象所涉及的语言和术语挑战，包括歧义性、区分性、简洁性以及相关概念间的一致性。然后，我们提出一个可配置的LLM辅助概念命名框架。该框架依赖于一个可变性模型，该模型控制命名过程中暴露的信息源，如内涵、外延、继承信息、邻近概念、蕴含关系和关系属性。从而明确从形式概念描述到人类可读名称的语义选择。\n该方法作为概念验证在披萨店领域的小型关系数据集上进行了说明。该示例展示了不同配置如何影响LLM建议的名称，以及命名可变性如何揭示解释选择、关系依赖以及底层符号数据中可能的建模问题。

英文摘要

Knowledge extraction from symbolic data often produces abstractions that are formally defined but not immediately interpretable by users. Formal Concept Analysis (FCA) and Relational Concept Analysis (RCA) provide representative settings for this issue: they generate explicit conceptual structures, implications, and relational dependencies from object descriptions and relations. Although these structures are explainable by design, their concepts are often identified by technical labels, which limits their use as human-interpretable knowledge units. Assigning meaningful names to such concepts is therefore a key issue for interpretation, navigation, validation, and reuse by domain experts. This paper investigates concept naming in FCA and RCA from a symbolic knowledge representation perspective. We first characterize the linguistic and terminological challenges involved in naming generated symbolic abstractions, including ambiguity, discrimination, concision, and consistency across related concepts. We then propose a configurable framework for LLM-assisted concept naming. The framework relies on a variability model that controls which sources of information are exposed during naming, such as intent, extent, inherited information, neighboring concepts, implications, and relational attributes. It thereby makes explicit the semantic choices involved in moving from formal concept descriptions to human-readable names. The approach is illustrated as a proof of concept on a small relational dataset in the pizzeria domain. This illustration shows how different configurations influence the names suggested by an LLM, and how naming variability can reveal interpretation choices, relational dependencies, and possible modeling issues in the underlying symbolic data.

URL PDF HTML ☆

赞 0 踩 0

2606.08503 2026-06-09 cs.AI cs.LO 新提交

Standpoint Logics with Defeasible Beliefs

带有可废止信念的立场逻辑

Nicholas Leisegang, Thomas Meyer, Sebastian Rudolph

发表机构 * University of Cape Town（开普敦大学）； CAIR, South Africa（南非人工智能研究中心）； Technische Universität Dresden（德累斯顿工业大学）； ScaDS.AI – Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig, Germany（德国德累斯顿/莱比锡可扩展数据分析与人工智能中心）

AI总结将KLM可废止逻辑与立场逻辑框架结合，提出DRSL，通过公理化语义和多种蕴涵关系提升，实现多视角下可废止信念的形式化表达。

详情

AI中文摘要

在本文中，我们将Kraus、Lehmann和Magidor（KLM）的可废止逻辑与Gómez Álvarez和Rudolph的立场逻辑框架相结合。这样做是为了形式化地表达考虑多个（可能矛盾的）视角的知识，而这些视角可能持有可废止信念。为此，我们利用了Leisegang等人引入的可废止受限立场逻辑（DRSL）。我们的工作扩展了先前的研究，为DRSL语义提供了基础表示结果，并系统地将几个著名的蕴涵关系从命题情况提升到立场增强设置。具体地，我们通过一组为立场情况调整的KLM风格公设来刻画DRSL的语义。此外，我们提供了一种方法来提升优先蕴涵，以及基于单个排序函数的蕴涵关系类，从纯命题语境到立场增强语境，包括理性和词典序闭包。我们证明这可以通过语义和算法手段等价地实现。此外，我们表明，对于每种考虑的蕴涵形式，从命题KLM到DRSL，蕴涵检查的复杂度类不会改变。

英文摘要

In this paper, we integrate the defeasible logic of Kraus, Lehmann and Magidor (KLM) with the standpoint logic framework of Gómez Álvarez and Rudolph. This is done with the goal of formally expressing knowledge taking into account multiple (possibly contradicting) viewpoints, which in turn may hold defeasible beliefs. In doing so, we utilise Defeasible Restricted Standpoint Logics (DRSL), introduced by Leisegang et al. Our work expands on previous work by providing a foundational representation result for DRSL semantics and systematically lifting several well-known entailment relations from the propositional case to the standpoint-enhanced setting. In particular, we characterise the semantics for DRSL through a set of KLM-style postulates adapted for the standpoints case. We furthermore provide a means to lift preferential entailment, and the class of entailment relations based on single ranking functions from the purely propositional to the standpoint-enhanced context, including rational and lexicographic closure. We show this can be done equivalently through semantic and algorithmic means. Furthermore, we show that, for each considered form of entailment, the complexity class of entailment checking does not change when moving from propositional KLM to DRSL.

URL PDF HTML ☆

赞 0 踩 0

2606.08658 2026-06-09 cs.AI cs.LO 新提交

Extending Ontologies: From Dense Embeddings to Hybrid Quantum-Fuzzy Systems

扩展本体：从密集嵌入到混合量子模糊系统

Angjelin Hila

发表机构 * GitHub

AI总结本文综述本体与密集嵌入算法的集成方法，并提出神经-量子-模糊系统作为同时支持概率推理和精确推理的知识表示新范式。

2606.09674 2026-06-09 cs.AI cs.LO math.CO 新提交

(Auto)formalization is supposed to be easy: Trellis process semantics for spelling out rigorous proofs

(自动)形式化应该很简单：用于详细阐述严格证明的Trellis过程语义

Wesley Pegden

发表机构 * Department of Mathematical Sciences, Carnegie Mellon University（卡内基梅隆大学数学科学系）

AI总结提出Trellis系统，通过确定性约束工作流和LLM代理迭代细化自然语言证明，实现Lean自动形式化，强调严格证明的可细化性。

Comments 15 pages, 7 figures, 5 tables

2606.07525 2026-06-09 cs.CL cs.AI 交叉投稿

Implicit Causal Graph Construction in Text via Chain Discovery

通过链发现实现文本中的隐式因果图构建

Liesbeth Allein, Marie-Francine Moens

发表机构 * KU Leuven（鲁汶大学）； Ghent University（根特大学）

AI总结研究利用大语言模型从文本因果对中推断中间事件以构建隐式因果图，比较端到端构建与因果链发现方法，并探索多模型集成策略，基于1560个科学验证因果对评估。

详情

AI中文摘要

文本中的因果图通常由可观察的、预定义的事件填充。相比之下，我们研究从文本中构建隐式因果图，将每个描述的因果对视为潜在隐式因果图的起点和终点，并使用大型语言模型（LLM）推断中间因果事件。我们比较了端到端图构建与将任务视为因果链发现的方法。在后一种方法中，图是通过聚合推断出的链或通过迭代搜索过程逐步扩展部分链来构建的。我们进一步探索了“群体智慧”扩展，即在事后聚合和协作推理设置中从多个LLM访问因果知识。我们分析了这些方法之间的权衡，并使用一个包含1560个经过科学验证的因果对的手动策划数据库评估推断出的因果关系的有效性。这种基于数据库的评估被认为是可靠的、资源高效的，并且可迁移到无法获得真实图的情况。

英文摘要

Causal graphs in text are typically populated by observable, predefined events. In contrast, we study implicit causal graph construction from text by treating each described cause-effect pair as the begin- and endpoint of an underlying latent causal graph and using large language models (LLMs) to infer intermediate causal events. We compare end-to-end graph construction with methods that frame the task as causal chain discovery. In the latter, graphs are built either by aggregating inferred chains or by progressively expanding partial chains through an iterative search process. We further explore Wisdom of the Crowd extensions that access causal knowledge from multiple LLMs in post-hoc aggregation and collaborative inference settings. We analyze trade-offs among these approaches and evaluate the validity of inferred causal relations using a manually curated database of 1,560 scientifically validated causal pairs. This database-based evaluation is proposed as reliable, resource-efficient, and transferable to settings where ground-truth graphs are unavailable.

URL PDF HTML ☆

赞 0 踩 0

2606.09134 2026-06-09 cs.RO cs.AI cs.CL cs.CV cs.GR 交叉投稿

From USD Scenes to Knowledge Graphs: Zero-Shot Ontology Grounding with LLMs

从USD场景到知识图谱：基于LLM的零样本本体接地

Jiangtao Shuai, Zongxiong Chen, Manfred Hauswirth, Sonja Schimmler

发表机构 * Technical University of Berlin（柏林工业大学）； Fraunhofer FOKUS（弗劳恩霍夫开放通信系统研究所）

AI总结研究利用大语言模型（LLM）零样本地将3D场景对象自动映射到本体类别，无需训练，在厨房场景中达到90-96%准确率，并揭示语义线索是关键。

Comments Accepted to the IEEE ICRA 2026 International Joint Workshop on Ontologies, Semantic Maps and Autonomous Robotics Standardization (J-WOSMARS 2026), Vienna, 2026

详情

AI中文摘要

用AI驱动的形式证明搜索推进数学研究

George Tsoukalas, Anton Kovsharov, Sergey Shirobokov, Anja Surina, Moritz Firsching, Gergely Bérczi, Francisco J. R. Ruiz, Arun Suggala, Adam Zsolt Wagner, Eric Wieser, Lei Yu, Aja Huang, Miklós Z. Horváth, Andrew Ferraiuolo, Henryk Michalewski, Edward Lockhart, Codrut Grosu, Thomas Hubert, Matej Balog, Pushmeet Kohli, Swarat Chaudhuri

发表机构 * Google DeepMind（谷歌DeepMind）； Aarhus University（奥胡斯大学）

AI总结本文研究了如何利用大型语言模型生成形式证明，以解决开放性数学问题，并展示了AI辅助形式证明搜索在数学研究中的应用和贡献。

2605.25985 2026-06-09 cs.AI 版本更新

Neural Scalable Symbolic Search Framework for Complex Logical Queries with Multiple Free Variables

面向多自由变量复杂逻辑查询的神经可扩展符号搜索框架

Weizhi Fei, Hang Yin, Zihao Wang, Shukai Zhao, Wei Zhang, Yangqiu Song

发表机构 * Department of Mathematical Sciences, Tsinghua University（清华大学数学科学系）； Squarepoint Capital（Squarepoint资本）； Department of Computer Science and Engineering, Hong Kong University of Science and Technology（香港科学与技术大学计算机科学与工程系）； Department of Computer Sciences, University of Rochester（罗切斯特大学计算机科学系）

AI总结针对知识图谱上多自由变量复杂查询的联合排序难题，提出神经可扩展符号搜索（NS3）框架，通过预算约束和超节点合并近似联合排序，显著提升性能。

Comments 10 pages, 5 figures

详情

AI中文摘要

复杂查询回答（CQA）是在不完整知识图谱（KG）上进行知识表示和推理的基本任务。回答带有$k$个自由变量的存在性一阶查询（即$ ext{EFO}_k$查询）是一个关键但具有挑战性的问题，因为它需要对$\mathcal{E}^k$中的答案元组进行排序，其中$\mathcal{E}$表示KG的实体集。随着$k$的增长，这很快变得难以处理。因此，现有基准和方法依赖于单个变量的边际排序；然而，边际排序是元组真实联合排序的较差代理。基于$ ext{EFO}_1$查询的神经符号搜索，我们提出了神经可扩展符号搜索（NS3），这是一个预算框架，无需枚举$\mathcal{E}^k$即可近似联合排序。NS3 (i) 回答边际化子查询以获得必要的候选集，(ii) 将多个自由变量合并为超节点，其域由动态预算$B$修剪和控制，以及(iii) 逐步将$ ext{EFO}_k$查询简化为在预算缩减域上的$ ext{EFO}_{k-1}$查询。在三个标准KG数据集上，NS3在保持强边际准确性的同时，显著提高了联合排序性能。我们进一步发布了一个联合排序基准，将现有的$ ext{EFO}_1$数据集扩展到$k=3$，从而能够系统评估多变量查询。我们的代码提供在https://github.com/HKUST-KnowComp/NS3_KDD2026。

英文摘要

Complex Query Answering (CQA) is a fundamental knowledge representation and reasoning task over incomplete knowledge graphs (KGs). Answering existential first-order queries with $k$ free variables (i.e., $\text{EFO}_k$ queries) is a crucial yet challenging problem, as it requires ranking answer tuples in $\mathcal{E}^k$, where $\mathcal{E}$ denotes the entity set of a KG. This quickly becomes intractable as $k$ grows. Consequently, existing benchmarks and methods rely on marginal rankings over individual variables; however, marginal rankings are a poor proxy for the true joint ranking of tuples. Building on neural symbolic search for $\text{EFO}_1$ queries, we propose Neural Scalable Symbolic Search (NS3), a budgeted framework that approximates joint ranking without enumerating $\mathcal{E}^k$. NS3 (i) answers marginalized sub-queries to obtain necessary candidate sets, (ii) merges multiple free variables into hypernodes whose domains are pruned and controlled by a dynamic budget $B$, and (iii) progressively reduces an $\text{EFO}_k$ query to an $\text{EFO}_{k-1}$ query over a budgeted reduced domain. Across three standard KG datasets, NS3 substantially improves joint ranking performance while retaining strong marginal accuracy. We further release a joint-ranking benchmark that extends existing $\text{EFO}_1$ datasets to $k=3$, enabling systematic evaluation of multi-variable queries. Our code is provided in https://github.com/HKUST-KnowComp/NS3_KDD2026.

URL PDF HTML ☆

赞 0 踩 0

2606.08702 2026-06-09 cs.AI 新提交

ConMem: Structured Memory-Guided Adaptation in Training-Free Multi-Agent Systems

ConMem: 无训练多智能体系统中的结构化记忆引导自适应

Zhixun Tan, Qiang Chen, Tairan Huang, Xiu Su, Yi Chen

发表机构 * Central South University（中南大学）； The Hong Kong University of Science and Technology（香港科技大学）

AI总结提出ConMem框架，通过结构化记忆卡片和关系感知记忆图实现多智能体系统的高效自适应，无需额外训练，在多个基准上提升性能并降低推理开销。

详情

AI中文摘要

最近的进展通过基于记忆、技能和学习的方法改进了基于LLM的多智能体系统（MAS）的自适应能力，但这些方法仍受到噪声轨迹、记忆-技能关系建模不足以及对额外训练或高质量监督的依赖等挑战。为了解决这些限制，我们提出了ConMem，一个关系感知且无需训练的框架，通过跨经验协调实现高效的多智能体自适应。具体来说，ConMem将历史交互轨迹提炼为结构化记忆卡片，以捕获可重用的策略和线索，并将它们组织成关系感知的记忆图。在运行时，ConMem根据任务需求检索卡片，并通过卡片图协调它们以解决策略冲突并恢复其依赖关系。这些模块结合起来提供了结构化和关系感知的指导，使得多智能体系统能够实现鲁棒、轻量级的自适应，而无需额外训练。在多个基准测试和主流MAS架构上的大量实验表明，与现有记忆架构相比，ConMem取得了持续的性能提升，通过剪枝超过50%的扩展候选并减少超过80%的规划开销，提高了推理时的效率。我们的代码可在https://anonymous.4open.science/r/ConMemCode获取。

英文摘要

Recent advances have improved the adaptive capabilities of LLM-based multi-agent systems (MAS) through memory-, skill-, and learning-based approaches, yet these approaches remain challenged by noisy trajectories, insufficient modeling of memory-skill relations, and reliance on additional training or high-quality supervision. To address these limitations, we propose ConMem, a relation-aware and training-free framework that enables efficient multi-agent adaptation through cross-experience coordination. Specifically, ConMem distills historical interaction trajectories into structured memory cards to capture reusable strategies and cues, organizing them into a relation-aware memory graph. At runtime, ConMem retrieves cards according to task needs and coordinates them through the card graph to resolve strategy conflicts and recover their dependencies. Combined, these modules yield structured and relation-aware guidance, enabling robust, lightweight adaptation in multi-agent systems without additional training. Extensive experiments across multiple benchmarks and mainstream MAS architectures show consistent gains over existing memory architectures, with improved inference-time efficiency through pruning more than 50% of expanded candidates and reducing planning overhead by over 80%. Our codes are available at https://anonymous.4open.science/r/ConMemCode

URL PDF HTML ☆

赞 0 踩 0

2606.09037 2026-06-09 cs.AI cs.MA 新提交

A Multi-Agent System for IPMSM Design Optimization via an FEA-AI Hybrid Approach

基于FEA-AI混合方法的IPMSM设计优化多智能体系统

Jinseong Han, Sunwoong Yang, Namwoo Kang

发表机构 * Cho Chun Shik Graduate School of Mobility, KAIST（KAIST Cho Chun Shik 移动研究生院）； Department of Mechanical Engineering, Hanyang University（汉阳大学机械工程系）； Narnia Labs

AI总结提出一种端到端自动化IPMSM设计优化框架，通过RAG结构化问题定义与不确定性感知的FEA-AI混合优化流水线，平衡计算成本与预测可靠性，在同等FEA预算下优于纯FEA或纯AI方法。

Comments 26 pages, 21 figures

详情

AI中文摘要

内置永磁同步电机（IPMSM）设计需要平衡相互冲突的目标和多物理场约束，而现代优化工作流程面临三个瓶颈：手动问题设置、高有限元分析（FEA）成本以及在稀疏或分布外区域中不可靠的基于代理的搜索。为了解决这些限制，我们提出了一种端到端的自动化IPMSM设计优化框架，该框架将检索增强生成（RAG）用于结构化问题定义，与不确定性感知的FEA-AI混合优化流水线相结合。一个通过RAG连接到电机教科书的设计代理提供基于领域知识的选项和工程技巧，并编译优化卡和用于AI模型训练的试验设计计划。训练代理自动化电磁FEA，记录几何验证和求解器失败日志，使用基于方差分析的数据分析和LLM推理分析失败的几何形状，并调用设计采样代理重新定义设计空间并生成额外样本。优化代理执行基于遗传算法的搜索，具有不确定性驱动的切换：低不确定性候选由AI代理推理评估，而高不确定性和可靠性关键的帕累托前沿或前K候选由高保真FEA校正并用于迭代重训练。该框架将手动、依赖经验的配置转换为可重复的工作流程，平衡计算成本和预测可靠性。在匹配的高保真FEA预算下的实验结果表明，所提出的混合方法实现了更好的目标性能，同时保持低且可进一步降低的预测不确定性，优于受早期预算耗尽限制的纯FEA搜索和收敛到低置信度最优的纯AI搜索。

英文摘要

Interior permanent magnet synchronous motor (IPMSM) design requires balancing conflicting objectives and multi-physics constraints, while modern optimization workflows face three bottlenecks: manual problem setup, high finite element analysis (FEA) cost, and unreliable surrogate-based search in sparse or out-of-distribution regions. To address these limitations, we propose an end-to-end automated IPMSM design optimization framework that integrates retrieval-augmented generation (RAG) for structured problem definition with an uncertainty-aware FEA-AI hybrid optimization pipeline. A Design agent, connected to a motor textbook through RAG, provides domain-knowledge-based options and engineering tips, and compiles an optimization card and a design-of-experiments plan for AI-model training. A Training agent automates electromagnetic FEA, records geometry-validation and solver-failure logs, analyzes failed geometries using ANOVA-based data analysis and LLM reasoning, and invokes a Design Sampling agent to redefine the design space and generate additional samples. An Optimization agent performs GA-based search with uncertainty-driven switching: low-uncertainty candidates are evaluated by AI-surrogate inference, whereas high-uncertainty and reliability-critical Pareto-front or top-K candidates are corrected by high-fidelity FEA and reused for iterative retraining. The framework converts manual, experience-dependent configuration into a reproducible workflow that balances computational cost and prediction reliability. Experimental results under a matched high-fidelity FEA budget show that the proposed hybrid approach achieves better objective performance while maintaining low and further reducible predictive uncertainty, outperforming FEA-only search, which is limited by early budget exhaustion, and AI-only search, which converges to a low-confidence optimum.

URL PDF HTML ☆

赞 0 踩 0

2606.09751 2026-06-09 cs.AI cs.CL cs.HC 新提交

后AGI经济：叠加性与福利经济学第二基本定理

Elija Perrier

发表机构 * Centre for Quantum Software & Information（量子软件与信息中心）

AI总结针对后AGI经济中自治权、自我修改和叠加偏好对经典福利第二定理的挑战，提出自治限定第二福利定理，给出可分散化的条件。

2606.09122 2026-06-09 cs.SE cs.AI cs.ET cs.MA cs.NI 交叉投稿

Autonomous Incident Resolution at Hyperscale: An Agentic AI Architecture for Network Operations

超大规模下的自主事件解决：面向网络运维的智能体AI架构

Arun Malik

发表机构 * Arun Malik

AI总结提出一种多智能体编排框架，通过分层分解、技能调用、知识编码和渐进自主，在超大规模云网络中实现90%以上常见事件的自主解决，并保障安全。

Comments 7 pages, 6 figures

详情

AI中文摘要

超大规模的云网络基础设施面临着独特的运维挑战，传统的人工驱动事件响应无法跟上故障的数量、速度和复杂性。本文提出了一种用于大规模网络运维中自主事件解决的智能体AI架构。我们的系统采用多智能体编排框架，其中专门的AI智能体协作检测、诊断和修复网络事件，无需人工干预。我们描述了架构原则，包括分层智能体分解、通过标准化协议的基于技能的工具调用、来自运维手册的结构化知识编码、具有安全边界的渐进自主性以及闭环验证。该架构已在主要云提供商的生产环境中部署，表明智能体AI系统能够在常见事件类别中实现超过90%的自主解决率，同时通过分层授权和回滚机制维护安全保证。我们讨论了设计权衡、故障模式以及从大规模运行自主AI智能体中获得的经验教训。

英文摘要

Cloud network infrastructure at hyperscale presents unique operational challenges where traditional human-driven incident response cannot keep pace with the volume, velocity, and complexity of failures. This paper presents an agentic AI architecture for autonomous incident resolution in large-scale network operations. Our system employs a multi-agent orchestration framework where specialized AI agents collaborate to detect, diagnose, and remediate network incidents without human intervention. We describe the architectural principles, including hierarchical agent decomposition, skills-based tool invocation via standardized protocols, structured knowledge encoding from operational runbooks, progressive autonomy with safety boundaries, and closed-loop verification. The architecture has been deployed in production at a major cloud provider, demonstrating that agentic AI systems can achieve autonomous resolution rates exceeding 90% for common incident categories while maintaining safety guarantees through layered authorization and rollback mechanisms. We discuss design tradeoffs, failure modes, and lessons learned from operating autonomous AI agents at scale.

URL PDF HTML ☆

赞 0 踩 0

2606.09610 2026-06-09 cs.RO cs.AI 交叉投稿

Shape Formation for the Cooperative Transportation of Arbitrary Objects Using Multi-Agent Reinforcement Learning

基于多智能体强化学习的任意物体协同运输中的形状形成

Mohamed Sayed, Wolfram Burgard, Tanja Katharina Kaiser

发表机构 * University of Technology Nuremberg（纽伦堡工业大学）

AI总结提出一种多智能体强化学习方法，使多机器人系统自主形成支撑任意形状和非均匀质量分布物体的编队，同时避免障碍物，实现可靠且泛化的协同运输。

详情

AI中文摘要

协同物体运输在众多领域（包括工业到家庭服务）中至关重要。一种流行的运输策略是将物体承载在多机器人系统之上。相应的任务通常通过将其分解为三个相互关联的子问题来解决：编队控制、协同导航和碰撞避免。现实世界物体带来的一个特殊挑战是其可能具有任意形状和非均匀质量分布，这需要机器人编队能够牢固支撑物体。在这项工作中，我们通过提出一种新颖的多智能体强化学习方法来解决运输此类现实世界物体时的模式形成控制挑战。我们的方法使多机器人系统能够自主定位在物体下方以支撑其重量，同时在编队过程中避免障碍物。我们在不同环境和不同数量机器人下的评估表明，我们的方法能够产生可靠形成平衡编队的策略，并泛化到杂乱场景以及具有复杂几何形状和非均匀质量分布的物体。

英文摘要

Cooperative object transportation is essential in numerous domains, including industrial to domestic services. A popular transportation strategy is to carry objects on top of multi-robot systems. The corresponding task is typically solved by decomposing it into three interconnected subproblems: formation control, cooperative navigation, and collision avoidance. A particular challenge posed by real-world objects is their potentially arbitrary shape and non-uniform mass distribution, necessitating robot formations that securely support the object. In this work, we address the challenge of pattern formation control for transporting such real-world objects by proposing a novel multi-agent reinforcement learning approach. Our approach enables a multi-robot system to autonomously position itself underneath an object to support its weight while avoiding obstacles during the formation process. Our evaluations with diverse environments and varying numbers of robots show that our approach leads to policies that reliably produce balanced formations and generalize to cluttered scenes and objects with complex geometry and non-uniform mass distribution.

URL PDF HTML ☆

赞 0 踩 0

2512.20845 2026-06-09 cs.AI cs.MA 版本更新

MAR:Multi-Agent Reflexion Improves Reasoning Abilities in LLMs

MAR：多智能体反思提升大语言模型的推理能力

Onat Ozer, Yuchen Wang, Grace Wu, Daniel Dosti, Honghao Zhang, Vivi De La Rue

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出多智能体反思框架，通过多角色辩论生成多样化反思，解决单模型反思中的思维退化问题，在HotPot QA和HumanEval上分别达到47% EM和82.7%准确率。

2601.19082 2026-06-09 cs.AI cs.CL cs.GT cs.LG cs.MA 版本更新

Payoff scaling shapes cooperation in LLM agents across languages

收益规模塑造跨语言LLM代理的合作行为

Trung-Kiet Huynh, Dao-Sy Duy-Minh, Thanh-Bang Cao, Phong-Hao Le, Hong-Dan Nguyen, Phu-Quy Nguyen-Lam, Minh-Luan Nguyen-Vo, Hong-Phat Pham, Phu-Hoa Pham, Thien-Kim Than, Chi-Nguyen Tran, Huy Tran, Gia-Thoai Tran-Le, Alessio Buscemi, Le Hong Trang, The Anh Han

发表机构 * Faculty of Information Technology, University of Science (HCMUS), Ho Chi Minh City, Vietnam（信息技术学院，科学大学（HCMUS），胡志明市，越南）； Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), Ho Chi Minh City, Vietnam（计算机科学与工程学院，胡志明市技术大学（HCMUT），胡志明市，越南）； Vietnam National University – Ho Chi Minh City (VNU-HCM), Ho Chi Minh City, Vietnam（越南国家大学——胡志明市（VNU-HCM），胡志明市，越南）； Luxembourg Institute of Science and Technology (LIST), Luxembourg（卢森堡科学与技术研究所（LIST），卢森堡）； School of Computing, Engineering and Digital Technologies, Teesside University, Middlesbrough, United Kingdom（计算、工程与数字技术学院，泰赛德大学，米德尔斯布罗，英国）

AI总结通过监督分类器识别重复囚徒困境中的策略，结合演化博弈论基线，发现随着收益增加，LLM反而更合作，与演化预测相反，表明对齐训练和人类推理模式的影响。

Comments 44 pages, 17 figures, 4 tables

详情

AI中文摘要

大型语言模型（LLM）越来越多地被部署为自主代理，代表用户进行谈判、协调和行动。它们在这种环境中是否合作不再只是一个学术问题，而是人工智能治理的核心问题。我们从战略行为的角度出发，探究两个日常杠杆——利害关系的大小和描述交互的语言——如何塑造LLM在重复囚徒困境中采用的策略。我们不直接通过原始行动计数来解读合作，而是训练监督分类器来识别重复博弈的经典策略（始终合作、始终背叛、以牙还牙、赢-留-输-变），并将其作为观察LLM行为的透镜。为了了解在相同收益下策略分布应如何，我们推导了演化博弈论（EGT）基线，并将其与LLM数据进行比较。两种结果以揭示性的方式不一致：随着收益增加，演化理论预测背叛应占据主导，但LLM却向相反方向移动，变得更加合作——我们认为，这是对齐训练和LLM从训练数据中继承的人类推理模式的标志。我们进一步表明，这种情况并非前沿规模、专有模型所特有：它也出现在三个开放权重的较小LLM中。总体而言，我们的分析强调，收益设计和语言框架是强大但未被充分探索的引导LLM行为的杠杆，对评估、对齐和治理部署在高风险、多语言环境中的多代理AI系统具有直接影响。

英文摘要

Large language models (LLMs) are increasingly deployed as autonomous agents that negotiate, coordinate, and act on behalf of users. Whether they cooperate in such settings is no longer just an academic question, but a central issue for AI governance. We approach it from a strategic-behaviour angle, asking how two everyday levers - the size of what is at stake, and the language in which the interaction is described - shape the strategies LLMs adopt in a repeated Prisoner's Dilemma. Rather than reading cooperation off raw action counts, we train supervised classifiers to recognise the canonical strategies of repeated games (always cooperate, always defect, Tit-for-Tat, Win-Stay-Lose-Shift) and use them as a lens onto LLM behaviour. To know what the strategy distribution should look like under the same payoffs, we derive an evolutionary game theory (EGT) baseline and compare it with the LLM data. The two outcomes disagree in a revealing way: as stakes grow, evolutionary theory predicts that defection should take over the population, yet LLMs move in the opposite direction, becoming more cooperative - a signature, we argue, of alignment training and the human-like reasoning patterns LLMs inherit from their training data. We further show that this picture is not particular to frontier-scale, proprietary models: it also occurs with three open-weight smaller LLMs. Overall, our analysis highlights that payoff design and linguistic framing are powerful but under-explored levers for steering LLM behaviour, with direct implications for evaluating, aligning, and governing multi-agent AI systems deployed in high-stakes, multilingual environments.

URL PDF HTML ☆

赞 0 踩 0

2508.06336 2026-06-09 cs.LG cs.AI cs.HC cs.MA 版本更新

Unsupervised Partner Design Enables Robust Ad-hoc Teamwork

无监督伙伴设计实现鲁棒的临时团队协作

Constantin Ruhdorfer, Matteo Bortoletto, Victor Oei, Anna Penzkofer, Andreas Bulling

发表机构 * University of Southampton（索姆塞特大学）

AI总结提出无监督伙伴设计(UPD)方法，通过动态生成并基于可学习性准则自适应选择训练伙伴，无需预训练伙伴群体或手动调参，在多个任务中达到强性能，并在人机交互研究中获得更高评价。

Comments 27 pages

2601.01279 2026-06-09 econ.TH cs.AI cs.CE cs.CL cs.GT 版本更新

Supracompetitive Pricing Under AI Monoculture

人工智能单一群体下的超竞争定价

Shengyu Cao, Ming Hu

发表机构 * Rotman School of Management, University of Toronto（多伦多大学罗特曼管理学院）

AI总结本文研究了在共享AI模型下，竞争卖家委托定价时可能产生的超竞争定价问题，通过双寡头模型分析发现，AI模型的鲁棒性和可重复性配置可能导致超竞争定价现象，且市场结果取决于初始定价倾向。

Comments 46 pages

详情

AI中文摘要

当竞争卖家将定价委托给共享的AI模型（如大型语言模型）时，相关推荐结合性能驱动的更新，聚合卖家反馈，引发一个问题：标准的AI部署实践是否会无意中产生超竞争定价？本文开发了一个简化的双寡头模型，其中两个卖家从共享的AI模型中获得定价推荐，该模型由两个参数特征化：一个倾向参数捕捉模型设置高价的倾向，一个输出保真度参数衡量该倾向与实际输出的一致性，其中倾向通过定期重新训练在观察到的结果上更新。我们发现，配置AI模型以鲁棒性和可重复性可以导致超竞争定价通过相变。在临界输出保真度阈值以下，竞争性定价是唯一的稳定结果。在临界值以上，模型表现出双稳态：竞争性和超竞争性定价都是局部稳定的，实际结果取决于模型的初始倾向。超竞争性定价提高了平均价格，但偶尔的低价推荐使检测变得复杂。对于完美输出保真度，任何内部初始倾向都会导致完全价格协调。对于有限训练批次大小为b，当初始倾向位于超竞争性盆地时，随着b的增加，超竞争性定价的概率接近1，不确定结果区域以O(1/√b)的速率缩小。任何减少模型倾向与卖家实际定价之间一致性的因素，无论是通过多样化AI供应商、引入推荐噪声还是减少卖家的遵守，都会将市场推向竞争性结果。

英文摘要

When competing sellers delegate pricing to a shared AI model, such as a large language model, correlated recommendations combined with performance-driven updates aggregating seller feedback raise a key question: can standard AI deployment practices inadvertently produce supracompetitive pricing? We develop a stylized duopoly model in which two sellers receive pricing recommendations from a shared AI characterized by two parameters: a propensity parameter capturing the model's tendency to set high prices and an output-fidelity parameter measuring alignment between this tendency and actual outputs, with propensity updated via periodic retraining on observed outcomes. We find that configuring AI models for robustness and reproducibility can lead to supracompetitive pricing via a phase transition. Below a critical output-fidelity threshold, competitive pricing is the unique stable outcome. Above it, the model exhibits bistability: both competitive and supracompetitive pricing are locally stable, with the realized outcome determined by the model's initial propensity. Supracompetitive pricing raises average prices, but occasional low-price recommendations complicate detection. With perfect output fidelity, full price coordination emerges from any interior initial propensity. For finite training batches of size $b$, when the initial propensity lies in the supracompetitive basin, the probability of supracompetitive pricing approaches 1 as $b$ increases, with the region of indeterminate outcomes shrinking at rate $O(1/\sqrt{b})$. Any factor reducing alignment between the model's propensity and sellers' actual pricing, whether through diversifying AI providers, introducing recommendation noise, or reducing seller adherence, pushes the market toward competitive outcomes.

URL PDF HTML ☆

赞 0 踩 0

2602.06934 2026-06-09 cs.PL cs.AI cs.DC cs.LO cs.MA 版本更新

Implementing Grassroots Logic Programs with Multiagent Transition Systems and AI (Full Version)

基于多智能体转换系统和人工智能实现基础逻辑程序

Ehud Shapiro

发表机构 * London School of Economics（伦敦经济学院）； Weizmann Institute of Science（魏茨曼科学研究院）

AI总结本文提出dGLP和madGLP两种确定性变体，通过全局链接实现共享变量，证明其正确性，并展示如何利用AI技术实现多智能体通信。

详情

AI中文摘要

Grassroots Logic Programs (GLP) 是一种并发逻辑编程语言，其中逻辑变量被划分为配对的读者和写者。一个赋值最多通过写者一次，其配对的读者最多一次消耗，可能包含额外的读者和/或写者。这使得丰富多向通信模态的简洁表达成为可能。该语言与并发（cGLP）和多智能体（maGLP）操作语义一起引入。本文从这些（ia）dGLP，cGLP的确定性对应物，和（ib）madGLP，一种多智能体对应物，其中确定性智能体仅通过异步消息传递通信，并证明它们的抽象对应物的正确性。maGLP跨越智能体的共享变量对可以作为本地变量通过全局链接配对，其正确性源于不重叠的替换交换性（GLP的单次出现不变量的结果）。我们进一步证明madGLP是基础的。dGLP和madGLP作为AI驱动的实现学科（数学→非正式规范→Dart）的形式规范被使用和描述：从dGLP，AI（Claude）开发了一个基于工作站的GLP实现，从madGLP正在开发一个基于智能手机的多智能体实现。

英文摘要

Grassroots Logic Programs (GLP) is a concurrent logic programming language in which logic variables are partitioned into paired readers and writers. An assignment is produced at most once via a writer and consumed at most once via its paired reader, and may contain additional readers and/or writers. This enables the concise expression of rich multidirectional communication modalities. The language was introduced together with concurrent (cGLP) and multiagent (maGLP) operational semantics. Here, we derive from these (\ia)~dGLP, a deterministic counterpart of cGLP, and (\ib)~madGLP, a counterpart of maGLP in which deterministic agents communicate solely by asynchronous message passing, and prove them correct against their abstract counterparts. maGLP shared variable pairs spanning agents can be implemented as local variables paired by \emph{global links}, with correctness following from disjoint substitution commutativity (a consequence of GLP's single-occurrence invariant). We further prove that madGLP is grassroots. Both dGLP and madGLP serve as formal specifications for an AI-driven implementation discipline (math $\to$ informal spec $\to$ Dart) employed and described here: from dGLP, AI (Claude) developed a workstation-based GLP implementation in Dart, and from madGLP it is developing a smartphone-based multiagent one.

URL PDF HTML ☆

赞 0 踩 0

2605.09823 2026-06-09 cs.MA cs.AI 版本更新

利用结构约束的基于扩散的神经TSP求解器

Mickaël Basson, Philippe Preux

发表机构 * Université de Lille, France（法国里尔大学）； CNRS, France（法国国家科学研究中心）； Inria, France（法国国家信息与自动化技术研究院）； UMR 9189-CRIStAL, Lille, France（法国里尔大学UMR 9189-CRIStAL研究中心）

AI总结提出投影一致性推理（PCI），用结构感知投影替代梯度细化，在TSP500/1000上分别达到0.17%/0.31%最优性差距，推理时间减少30-40%。

详情

Journal ref: The 20th Learning and Intelligent OptimizatioN Conference (LION), Jun 2026, Milan (Italie), Italy

AI中文摘要

神经组合优化最近在欧几里得旅行商问题（TSP）上使用生成模型（如扩散和一致性模型）取得了强劲结果。最先进的方法如FT2T将基于一致性的快速预测与基于梯度的推理时细化相结合。然而，梯度搜索通常会产生显著的计算开销，并且可能与可行解的离散结构不一致。我们引入了投影一致性推理（PCI），这是一种即插即用、无需重新训练的替代方案，用结构感知投影替换梯度细化：PCI从一致性模型输出解码有效的哈密顿环，并应用轻量级局部搜索（例如2-opt）。PCI在500个城市的TSP上实现了0.17%的平均最优性差距（OG），在1000个城市的TSP上实现了0.31%，优于FT2T的最佳设置（OG分别为0.22%和0.36%），同时将推理时间减少了30%至40%。PCI还表现出更低的方差和内存使用，并且在快速生成解决方案方面可以超越经典启发式算法（如LKH3）。我们的结果表明，结构感知的推理时操作为神经TSP求解器提供了一条实用且原则性的路径，补充了训练时目标。

英文摘要

Neural combinatorial optimization has recently achieved strong results on the Euclidean Traveling Salesman Problem (TSP) using generative models such as diffusion and consistency models. State-ofthe-art approaches like FT2T combine fast consistency-based prediction with gradient-based inference time refinement. However, gradient search often incurs significant computational overhead and may not align with the discrete structure of feasible solutions. We introduce Projected Consistency Inference (PCI), a plug-and-play, retraining-free alternative that replaces gradient refinement with structure-aware projections: PCI decodes valid Hamiltonian tours from the consistency model output and applies a lightweight local search (e.g., 2-opt). PCI achieves an average optimality gap (OG) of 0.17% on TSP with 500 cities, and 0.31% on TSP with 1000 cities, outperforming FT2T best settings (OG 0.22% and 0.36%, respectively) while reducing the inference time up to 30 to 40%. PCI also exhibits lower variance and memory usage, and can surpass classical heuristics such as LKH3 in rapid solution generation. Our results demonstrate that structure-aware inference time operations provide a practical and principled path for neural TSP solvers, complementing training time objectives.

URL PDF HTML ☆

赞 0 踩 0

2606.09666 2026-06-09 cs.AI 新提交

Frequency-based Constrained Sampling for Interval Patterns

基于频率的区间模式约束采样

Djawad Bekkoucha, Abdelkader Ouali, Bruno Crémilleux

发表机构 * Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), Université Paris-Saclay, CNRS（巴黎-萨克雷大学数字科学跨学科实验室（LISN），法国国家科学研究中心）； Université Caen Normandie, ENSICAEN, CNRS, Normandie Univ, GREYC UMR6072（卡昂诺曼底大学，卡昂国立高等工程师学校，法国国家科学研究中心，诺曼底大学，GREYC UMR6072）

AI总结提出CFips方法，将用户定义的句法约束直接融入多步采样框架，通过分解为区间边界上的基本谓词实现精确采样，保证在约束模式空间中按频率比例采样，实验证明能完成超时失败的挖掘任务。

Comments 16 pages

2606.07562 2026-06-09 q-bio.BM cs.AI 交叉投稿

The Montparnasse Algorithm for RNA Design

RNA设计的蒙帕纳斯算法

Tristan Cazenave

发表机构 * Tristan Cazenave

AI总结提出基于广义嵌套滚动策略适应的蒙特卡洛搜索框架Montparnasse，结合问题特定先验和字典序多准则评估，在Eterna100基准上比现有最优方法DesiRNA快三倍以上，并在血红蛋白α信使RNA二级结构优化中优于LinearDesign。

2412.13858 2026-06-09 cs.AI cs.LG 版本更新

IDEQ -- Improving Diffusion Models for the Traveling Salesman Problem (TSP) by Leveraging the Structure of the Solution Space

IDEQ -- 利用解空间结构改进旅行商问题的扩散模型

Mickael Basson, Philippe Preux

发表机构 * Université de Lille（里尔大学）； CNRS（国家科学研究中心）； Inria（法国国家信息与自动化技术研究院）； UMR 9198-CRIStAL（UMR 9198-CRIStAL研究中心）

AI总结提出IDEQ方法，通过利用TSP解空间的约束结构和基于2-opt轨道的均匀分布训练目标，改进扩散模型求解TSP，在合成实例和TSPlib上达到新SOTA，接近LKH3性能。

详情

AI中文摘要

我们研究扩散模型求解旅行商问题。基于最近的DIFUSCO和T2TCO方法，我们提出IDEQ。IDEQ通过利用TSP状态空间的约束结构来提高解的质量。IDEQ的另一个关键组成部分是，将DIFUSCO课程学习的最后阶段替换为考虑哈密顿环上的均匀分布，这些环在2-opt算子下的轨道收敛到最优解作为训练目标。我们的实验表明，IDEQ在合成实例上改进了此类神经网络技术的现有水平。更重要的是，我们的实验表明，IDEQ在TSPlib（TSP社区的参考基准）的实例上表现非常好：它紧密匹配最佳启发式算法LKH3的性能，甚至在两个分别包含1577和3795个城市的TSPlib实例上能够获得比LKH3更好的解。IDEQ在500个城市的TSP实例上获得0.3%的最优性差距，在1000个城市的TSP实例上获得0.5%的最优性差距。这为基于神经网络的TSP求解方法设立了新的SOTA。此外，与DIFUSCO和T2TCO相比，IDEQ表现出更低的方差和更好的随城市数量扩展的能力。

英文摘要

We investigate diffusion models to solve the Traveling Salesman Problem. Building on the recent DIFUSCO and T2TCO approaches, we propose IDEQ. IDEQ improves the quality of the solutions by leveraging the constrained structure of the state space of the TSP. Another key component of IDEQ consists in replacing the last stages of DIFUSCO curriculum learning by considering a uniform distribution over the Hamiltonian tours whose orbits by the 2-opt operator converge to the optimal solution as the training objective. Our experiments show that IDEQ improves the state of the art for such neural network based techniques on synthetic instances. More importantly, our experiments show that IDEQ performs very well on the instances of the TSPlib, a reference benchmark in the TSP community: it closely matches the performance of the best heuristics, LKH3, being even able to obtain better solutions than LKH3 on 2 instances of the TSPlib defined on 1577 and 3795 cities. IDEQ obtains 0.3% optimality gap on TSP instances made of 500 cities, and 0.5% on TSP instances with 1000 cities. This sets a new SOTA for neural based methods solving the TSP. Moreover, IDEQ exhibits a lower variance and better scales-up with the number of cities with regards to DIFUSCO and T2TCO.

URL PDF HTML ☆

赞 0 踩 0

2507.22876 2026-06-09 cs.AI cs.LO 版本更新

Discovering heuristics in a complex SAT solver with large language models

利用大型语言模型发现复杂SAT求解器中的启发式策略

Yiwen Sun, Furong Ye, Zhihan Chen, Ke Wei, Shaowei Cai

发表机构 * School of Data Science, Fudan University, Shanghai, China（复旦大学数据科学学院，上海，中国）； Key Laboratory of System Software, Institute of Software, Chinese Academy of Sciences, Beijing, China（中国科学院软件研究所系统软件重点实验室，北京，中国）； SeedMath Technology Limited, Beijing, China（SeedMath技术有限公司，北京，中国）

AI总结提出AutoModSAT框架，结合模块化求解器设计、无监督提示优化和进化算法，利用LLM自动优化SAT求解器，在多个数据集上性能提升40%。

详情

AI中文摘要

可满足性问题（SAT）是计算复杂性理论的基础，并具有广泛的工业应用。由于现代SAT求解器架构复杂，在现实环境中优化它们相当具有挑战性。尽管已经开发了自动配置框架，但它们依赖于手动约束的搜索空间。在这里，我们开发了AutoModSAT，一个使用大型语言模型（LLM）自动优化SAT求解器的框架。AutoModSAT结合了兼容LLM的模块化求解器设计、无监督提示优化以多样化生成的函数，以及基于预搜索策略和$(1+\lambda)$进化算法的高效搜索过程。在广泛的数据集上进行的大量实验表明，AutoModSAT相比基线求解器实现了40%的性能提升，相比最先进的求解器实现了30%的提升。此外，在大多数测试数据集上，AutoModSAT相比最先进求解器的参数调优替代方案也实现了显著的加速。这些结果证明了LLM引导的启发式发现用于优化复杂SAT求解器的潜力。

英文摘要

The Satisfiability problem (SAT) is fundamental in computational complexity theory and has a wide range of industrial applications. Optimizing modern SAT solvers in real-world settings is quite challenging due to their intricate architectures. While automatic configuration frameworks have been developed, they rely on manually constrained search spaces. Here we develop AutoModSAT, a framework that uses large language models (LLMs) to automatically optimize SAT solvers. AutoModSAT combines an LLM-compatible modular solver design, unsupervised prompt optimization to diversify generated functions, and an efficient search procedure based on presearch strategy and a $(1+λ)$ evolutionary algorithm. Extensive experiments across a wide range of datasets demonstrate that AutoModSAT achieves $40\%$ performance improvement over the baseline solver and $30\%$ improvement over the state-of-the-art solvers. Moreover, AutoModSAT also attains a notable speedup compared to the parameter-tuned alternatives of the state-of-the-art solvers over most of the test datasets. These results demonstrate the potential of LLM-guided heuristic discovery for optimizing complex SAT solvers.

URL PDF HTML ☆

赞 0 踩 0

2601.06188 2026-06-09 cs.AI 版本更新

Dynamic Distributed Constraint Optimization and Metareasoning for Continual, Large-Scale Satellite Operations

面向持续大规模卫星运行的动态分布式约束优化与元推理

Itai Zilberstein, Steve Chien

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； California Institute of Technology（加州理工学院）； Jet Propulsion Laboratory（喷气推进实验室）

AI总结针对动态大规模卫星调度问题，提出动态分布式约束优化模型DCOSP，并设计元推理框架控制重计算时机，结合D-NSS算法实现近优解，显著优于基线方法。

Comments An earlier version titled "Large-Scale Continual Scheduling and Execution for Dynamic Distributed Satellite Constellation Observation Allocation" appears as an extended abstract in the Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情

DOI: 10.65109/JCYH5778

AI中文摘要

随着地球观测卫星星座在规模和能力上的增长，分布式星载控制为新型响应和时效性测量提供了途径。然而，将自主性部署到卫星上需要高效的计算和通信。本文解决了在动态、大规模问题中调度数百颗卫星观测的挑战，涉及数百万个变量。我们提出了动态多卫星星座观测调度问题（DCOSP），这是一种新的动态分布式约束优化问题（DDCOP）形式化，集成了调度与执行。DCOSP具有新颖的最优性条件，为此我们构建了一个精确的全知离线算法。受星载操作强资源约束的启发，我们引入了一个在DDCOP中融入元推理的框架，该框架控制智能体何时消耗资源以重新计算解决方案。此外，我们提出了动态增量邻域随机搜索（D-NSS）算法，这是一种不完整的在线分解型DDCOP算法，通过修复局部子问题来响应动态事件。我们在逼真的仿真中证明，D-NSS收敛到近优解，在解质量、计算时间和消息量方面优于标准DDCOP基线，而我们的元推理框架成功地在资源节约与效用之间取得平衡。作为NASA FAME任务的一部分，这项工作为迄今为止最大规模的空间分布式多智能体AI演示奠定了基础。

英文摘要

As Earth-observing satellite constellations grow in size and capability, distributed onboard control offers a pathway to novel responses and time-sensitive measurements. However, deploying autonomy to satellites requires efficient computation and communication. This work addresses the challenge of scheduling observations for hundreds of satellites in a dynamic, large-scale problem with millions of variables. We present the dynamic multi-satellite constellation observation scheduling problem (DCOSP), a new formulation of dynamic distributed constraint optimization problems (DDCOP) that models integrated scheduling and execution. DCOSP features a novel optimality condition, for which we construct an exact omniscient offline algorithm. Motivated by the strong resource constraints of onboard satellite operations, we introduce a framework to incorporate metareasoning in DDCOPs that controls when agents expend resources to recompute solutions. In addition, we present the dynamic incremental neighborhood stochastic search (D-NSS) algorithm, an incomplete online decomposition-based DDCOP algorithm that repairs localized sub-problems in response to dynamic events. We demonstrate in realistic simulations that D-NSS converges to near-optimal solutions, outperforming standard DDCOP baselines in solution quality, computation time, and message volume, while our metareasoning framework successfully balances resource conservation with utility. As part of the NASA FAME mission, this work lays the foundation for the largest in-space demonstration of distributed multi-agent AI to date.

URL PDF HTML ☆

赞 0 踩 0

2606.06656 2026-06-09 cs.AI cs.LO 版本更新

A Study of Parallel Continuous Local Search

并行连续局部搜索研究

Cody J Christopher, Charles Gretton

发表机构 * School of Computing, Australian National University（澳大利亚国立大学计算机学院）

AI总结研究并行连续局部搜索（CLS）在对称伪布尔约束可满足性问题中的应用，发现冗余约束会抑制收敛，CLS在混合求解中能快速完成部分赋值，且局部搜索因鞍点密集目标而快速收敛到稳定解质量分布。

详情

AI中文摘要

我们研究并行连续局部搜索（CLS）作为解决具有对称伪布尔（PB）约束的布尔可满足性问题的一种方法。这里，$n$变量PB可满足性问题被松弛为一个连续优化问题，其目标函数在$n$维超立方体上可微。对于可满足的实例，该优化问题的全局最小值对应于所讨论SAT问题的满足赋值。我们通过实证实验提出了几个新发现：（i）冗余约束会抑制而非加速收敛；（ii）CLS在混合设置中作为子求解器显示出前景，能快速完成部分赋值；（iii）由于鞍点密集的目标函数，局部搜索迅速收敛到解质量的稳定分布（即满足程度），此时额外的求解步骤收益递减。我们的发现为在现代加速硬件上使用CLS解决SAT问题提供了实用指导。

英文摘要

We study parallel Continuous Local Search (CLS) as a solution approach for Boolean satisfiability problems with symmetric pseudo-Boolean (PB) constraints. Here, the $n$-variable PB-satisfiability problem is relaxed to a continuous optimisation problem with a differentiable objective function on an $n$-dimensional hypercube. For satisfiable instances, the global minimisers of this optimisation problem correspond to satisfying assignments of the SAT problem at hand. We present several novel findings via empirical experiments: (i) redundant constraints can inhibit rather than accelerate convergence; (ii) CLS shows promise as a sub-solver in hybridised settings, quickly completing partial assignments; and (iii) local search rapidly converges to a stable distribution of solution quality (i.e., degree of satisfaction), due to saddle-dense objectives where additional solver steps yield diminishing returns. Our findings inform practical uses of CLS for SAT on modern accelerator hardware.

URL PDF HTML ☆

赞 0 踩 0

2606.07047 2026-06-09 cs.AI 版本更新

Front-to-Attractors: Modifying the Front-to-Front Heuristic in Bidirectional Search

Front-to-Attractors：修改双向搜索中的Front-to-Front启发式

Alvin Zou, Muhammad Suhail Saleem, Maxim Likhachev

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出Front-to-Attractors (F2A)启发式类，通过动态维护吸引子集替代完整前沿，在保持Front-to-Front信息性的同时大幅降低计算开销，实验显示相比F2F减少最多11.2倍成对评估，平均节点扩展比F2E少4.8倍。

详情

AI中文摘要

启发式在双向搜索算法的性能中扮演核心角色，通常依赖两个主要类别。Front-to-end (F2E) 启发式估计从状态 s 到搜索目标（前向搜索的目标或后向搜索的起点）的距离。相比之下，Front-to-front (F2F) 启发式通过成对函数 h(s, s') 估计从 s 到对面搜索前沿的距离，其中 s' 遍历前沿状态。尽管 F2F 启发式通常信息量更大，从而减少节点扩展数量，但它们依赖大量的成对评估，导致显著的计算开销。为了解决这一限制，我们引入了一个新的启发式类——Front-to-attractors (F2A)，它在保留 F2F 大部分信息性的同时，大幅降低了计算成本。F2A 不是评估到对面前沿所有状态的距离，而是估计从 s 到对面搜索方向中一个小的、动态维护的吸引子集的距离。这些吸引子作为完整前沿的替代，使得在极少的计算开销下提供丰富的启发式指导，同时保持 F2F 提供的最优性保证。我们在多个领域评估了 F2A，结果显示，与 F2F 相比，它减少了最多 11.2 倍的成对评估，同时平均节点扩展比 F2E 少 4.8 倍。

英文摘要

Heuristics play a central role in the performance of bidirectional search algorithms, which commonly rely on two main classes. Front-to-end (F2E) heuristics estimate the distance from a state s to the target of the search (the goal for forward search or the start for backward search). In contrast, front-to-front (F2F) heuristics estimate the distance from s to the opposite search frontier using a pairwise function h(s, s'), where s' ranges over frontier states. Although F2F heuristics are typically more informative and therefore reduce the number of node expansions, their reliance on extensive pairwise evaluations incurs substantial computational overhead. To address this limitation, we introduce a new heuristic class, front-to-attractors (F2A), that preserves much of the informativeness of F2F while dramatically reducing its computational cost. Rather than evaluating distances to all states on the opposite frontier, F2A estimates the distance from s to a small, dynamically maintained set of attractors in the opposite search direction. These attractors serve as a surrogate for the full frontier, enabling rich heuristic guidance at a fraction of the computational expense while maintaining the optimality guarantees offered by F2F. We evaluate F2A across multiple domains and show that it reduces the number of pairwise evaluations by up to 11.2x compared to F2F, while achieving 4.8x fewer node expansions than F2E on average.

URL PDF HTML ☆

赞 0 踩 0

2508.11874 2026-06-09 cs.GT cs.AI cs.DS cs.LO cs.PL 版本更新

Discovering Expert-Level Nash Equilibrium Algorithms with Large Language Models

利用大型语言模型发现专家级纳什均衡算法

Hanyu Li, Dongchen Li, Xiaotie Deng

发表机构 * CFCS, School of Computer Science, Peking University, Beijing, China（计算机科学系，北京大学，北京，中国）； School of Computing and Data Science, The University of Hong Kong, Pokfulam, Hong Kong（计算与数据科学学院，香港大学，薄扶林，香港）

AI总结提出LegoNE框架，将专家证明策略编码为符号语言，自动验证算法的最坏情况保证，结合推理型LLM重新发现并改进了多人博弈的近似纳什均衡算法。

Comments accepted by Nature Communications

详情

DOI: 10.1038/s41467-026-74003-1

AI中文摘要

设计具有可证明最坏情况保证的近似纳什均衡（ANE）的多项式时间算法是算法博弈论中的一个基本开放问题。虽然大型语言模型（LLM）可以大规模生成候选算法，但验证最坏情况保证需要对所有博弈实例进行形式化分析——此前没有自动化系统能够完成这项任务。在这里，我们提出了LegoNE，一个将专家证明策略编码为符号语言的框架，该框架自动将任何候选算法编译成一个有限优化问题，以验证其最坏情况保证。将LegoNE与一个推理型LLM集成，我们重新发现了一个匹配双人博弈最佳多项式时间保证的算法，并发现了一个三人博弈算法，将最佳保证从$0.6+\delta$改进到$0.5+\delta$——这被证明超出了扩展技术（唯一已知的多玩家ANE设计范式）的能力范围。这些结果表明，将特定领域的证明策略编码为机器可处理的语言可以支持LLM驱动的算法发现，超越已知的人类设计范式。

英文摘要

Designing polynomial-time algorithms for approximate Nash equilibria (ANE) with provable worst-case guarantees is a fundamental open problem in algorithmic game theory. While large language models (LLMs) can generate candidate algorithms at scale, certifying worst-case guarantees requires formal analysis over all game instances -- a task for which no automated system previously existed. Here, we present LegoNE, a framework encoding expert proof strategies into a symbolic language that automatically compiles any candidate algorithm into a finite optimization problem certifying its worst-case guarantee. Integrating LegoNE with a reasoning LLM, we rediscovered an algorithm matching the best polynomial-time guarantee for two-player games, and discovered a three-player algorithm improving the best guarantee from $0.6+δ$ to $0.5+δ$ -- provably beyond the reach of the extension technique, the only previously known multi-player ANE design paradigm. These results show that encoding domain-specific proof strategies into a machine-tractable language can support LLM-driven discovery of algorithms outside known human design paradigms.

URL PDF HTML ☆

赞 0 踩 0

2601.01665 2026-06-09 cs.LG cs.AI 版本更新

Adversarial Instance Generation and Robust Training for Neural Combinatorial Optimization with Multiple Objectives

多目标神经组合优化的对抗实例生成与鲁棒训练

Wei Liu, Yaoxin Wu, Yingqian Zhang, Thomas Bäck, Yingjie Fan

发表机构 * LIACS, Leiden University, Leiden, The Netherlands（莱顿大学LIACS研究所，莱顿，荷兰）； Eindhoven University of Technology, Eindhoven, The Netherlands（埃因霍温理工大学，埃因霍温，荷兰）

AI总结提出面向多目标组合优化问题的偏好条件深度强化学习鲁棒性框架，通过偏好对抗攻击生成困难实例并量化影响，结合硬度感知偏好选择的对抗训练提升泛化性，在MOTSP、MOCVRP、MOKP上验证了攻击与防御的有效性。

详情

AI中文摘要

深度强化学习（DRL）在解决多目标组合优化问题（MOCOPs）方面显示出巨大潜力。然而，这些基于学习的求解器的鲁棒性尚未得到充分探索，尤其是在多样化和复杂的问题分布上。在本文中，我们提出了一个面向偏好条件DRL求解器用于MOCOPs的统一鲁棒性导向框架。在该框架内，我们开发了一种基于偏好的对抗攻击，以生成暴露求解器弱点的困难实例，并通过由此导致的帕累托前沿质量下降来量化攻击影响。我们进一步引入了一种防御策略，将硬度感知偏好选择集成到对抗训练中，以减少对受限偏好区域的过拟合并提高分布外性能。在多目标旅行商问题（MOTSP）、多目标容量车辆路径问题（MOCVRP）和多目标背包问题（MOKP）上的实验结果验证了我们的攻击方法能够成功地为不同求解器学习困难实例。此外，我们的防御方法显著增强了神经求解器的鲁棒性和泛化能力，在困难或分布外实例上提供了优越的性能。

英文摘要

Deep reinforcement learning (DRL) has shown great promise in addressing multi-objective combinatorial optimization problems (MOCOPs). Nevertheless, the robustness of these learning-based solvers has remained insufficiently explored, especially across diverse and complex problem distributions. In this paper, we propose a unified robustness-oriented framework for preference-conditioned DRL solvers for MOCOPs. Within this framework, we develop a preference-based adversarial attack to generate hard instances that expose solver weaknesses, and quantify the attack impact by the resulting degradation on Pareto-front quality. We further introduce a defense strategy that integrates hardness-aware preference selection into adversarial training to reduce overfitting to restricted preference regions and improve out-of-distribution performance. The experimental results on multi-objective traveling salesman problem (MOTSP), multi-objective capacitated vehicle routing problem (MOCVRP), and multi-objective knapsack problem (MOKP) verify that our attack method successfully learns hard instances for different solvers. Furthermore, our defense method significantly strengthens the robustness and generalizability of neural solvers, delivering superior performance on hard or out-of-distribution instances.

URL PDF HTML ☆

赞 0 踩 0

2606.07577 2026-06-09 cs.AI cs.CV cs.SD eess.AS 新提交

OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs

OmniMem: 面向流式音视频大语言模型的扰动感知记忆压缩

Guangzhi Sun, Yixuan Li, Yudong Yang, Chao Zhang

发表机构 * Tsinghua University（清华大学）； ByteDance（字节跳动）； Department of Engineering, University of Cambridge（剑桥大学工程系）

AI总结提出OmniMem，一种针对音视频LLM的流式记忆压缩框架，通过模态感知分配和扰动感知选择压缩KV缓存，在保持长视频理解的同时减少内存，在多个基准上提升2-4%准确率。

Comments Code: https://github.com/bytedance/SALMONN/tree/omni_mem

详情

AI中文摘要

音视频大语言模型（LLMs）在长视频理解方面具有强大潜力，但其长视频推理从根本上受到视频令牌和键值（KV）缓存线性增长的制约。我们提出OmniMem，一种专为音视频LLMs设计的内存高效流式框架。与将所有令牌统一处理的现有压缩方法不同，OmniMem引入了一种模态感知的内存分配策略，分别管理视觉和音频上下文，解决了两种模态之间的严重令牌不平衡问题。OmniMem进一步通过扰动感知的内存选择保留信息丰富且非冗余的KV状态，实现紧凑内存而不牺牲长程理解。为了在现实部署约束下加强压缩，我们还探索了预算感知微调，鼓励模型将有用信息整合到保留内存中。在VideoMME Long、LVBench和LVOmniBench上使用video-SALMONN 2+和Qwen-2.5-Omni的实验表明，在相同内存预算下，OmniMem始终比强训练无关压缩基线提高2-4%的绝对准确率，微调后额外提高1-2%。

英文摘要

Audio-visual large language models (LLMs) hold strong promise for long-form video understanding, yet their long-video inference is fundamentally limited by the linear growth of video tokens and key-value (KV) caches. We present OmniMem, a memory-efficient streaming framework designed specifically for audio-visual LLMs. Unlike existing compression methods that treat all tokens uniformly, OmniMem introduces a modality-aware memory allocation strategy that separately manages visual and audio contexts, addressing the severe token imbalance between the two modalities. OmniMem further preserves informative and non-redundant KV states through perturbation-aware memory selection, enabling compact memory without sacrificing long-range understanding. To strengthen compression under realistic deployment constraints, we also explore budget-aware fine-tuning, which encourages the model to consolidate useful information into retained memory. Experiments on VideoMME Long, LVBench, and LVOmniBench with video-SALMONN 2+ and Qwen-2.5-Omni show that OmniMem consistently improves over strong training-free compression baselines by 2-4% absolute accuracy under the same memory budgets, with an additional 1-2% gain after fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2606.07720 2026-06-09 cs.AI cs.CL cs.LG 新提交

Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

为什么将残差流限制在层而不是令牌？用于连续潜在推理的持久记忆

Mujtaba Farhan, Maheep Chaudhary

发表机构 * University of Cambridge（剑桥大学）

AI总结针对CoCoNuT在潜在空间推理中因中间隐藏状态被覆盖导致概念瓶颈的问题，提出AGCLR模型，通过门控概念流持久记忆机制，在GSM8K、HotpotQA和ProsQA上取得一致提升。

详情

AI中文摘要

AI中文摘要

在机制可解释性中，注意力头通常被提升为角色声称（例如，“这个头表示加法”），当它们对某个行为是必要的、线性编码该行为，并且在消融后恢复该行为时。我们证明这种证据是不充分的：在三个7-8B指令微调模型和五个计算家族中，通过所有三个检查的头在匹配控制下将其激活修补到不同提示时，通常无法传递计算。我们引入KID（知道/意图/做），一个注意力头的角色分配视角，并将其与一个三阶段流程配对：能力选择性筛选（CSS）、奇异值分解（SVD）和匹配控制下的激活转导。我们的结果记录了一个初步的角色分类（包括提示轨迹稳定器、答案侧logit偏置头和软计算模式载体），并表明相同答案控制（一个共享答案字符串但不共享请求计算的转导目标）是一种未被充分利用的检查，它暴露了伪装成语义特异性的广泛状态转移。

英文摘要

In mechanistic interpretability, attention heads are commonly elevated to role claims (e.g., "this head represents addition") when they are necessary for a behavior, encode it linearly, and recover that behavior when restored after ablation. We show this evidence is insufficient: across three 7-8B instruction-tuned models and five computation families, heads passing all three checks routinely fail to transfer the computation when their activations are patched into a different prompt under matched controls. We introduce KID (Knowing / Intent / Doing), a role-assignment lens for attention heads, and pair it with a three-stage pipeline: capability-selective screening (CSS), singular value decomposition (SVD), and activation transduction under matched controls. Our results document a preliminary role taxonomy (including prompt-trajectory stabilizers, answer-side logit-bias heads, and soft computation-pattern carriers) and show that the same-answer control (a transduction target sharing the answer string but not the requested computation) is an underused check that exposes broad state transfer masquerading as semantic specificity.

URL PDF HTML ☆

赞 0 踩 0

2606.08312 2026-06-09 cs.AI cs.FL 新提交

Neuro-Symbolic Injection of LTLf Constraints in Autoregressive Reinforcement Learning Policies

自回归强化学习策略中LTLf约束的神经符号注入

Ashkan Ansarifard, Matteo Mancanelli, Elena Umili, Fabio Patrizi

发表机构 * Sapienza University of Rome（罗马大学）

AI总结提出神经符号框架，将LTLf约束编译为DFA并通过可微损失注入Transformer策略，在导航任务中提升约束满足且保持回报竞争力。

Comments Accepted at the Joint Workshop on Statistics and Knowledge Integration for Logic, Learning, Ethical Decisions, and LLMs (SKILLED-LLMs 2026), co-located with KR 2026 and FLoC 2026, Lisbon, Portugal

详情

AI中文摘要

在这项工作中，我们研究了在有限迹线性时序逻辑（LTLf）表达的时延任务约束下的离线强化学习（RL）。最近，基于Transformer的方法如Trajectory Transformers和Decision Transformers已被采用，将RL视为序列建模问题。然而，这些方法纯粹优化奖励，不考虑高层时序需求。在此，我们引入一个神经符号框架，将LTLf背景知识注入到这类基于Transformer的RL策略中。我们的方法将LTLf公式编译为确定性有限自动机（DFA），并通过可微表示和基于逻辑的损失函数将其整合到学习过程中。特别地，我们从DFA进展中推导出可微的满足信号，并将其作为训练过程中的正则化项。最终的方法在不同模型间是架构无关的。我们在具有覆盖安全性和可达性时序属性组合的规范套件的导航环境中评估所提出的框架。实验结果表明，融入背景知识不仅提高了约束满足，而且与普通基线相比保持了有竞争力的回报。

英文摘要

In this work we study offline reinforcement learning (RL) under temporally extended task constraints expressed in Linear Temporal Logic over finite traces (LTLf). Recently, transformer-based approaches such as Trajectory Transformers and Decision Transformers have been adopted to address RL as a sequence modeling problem. However, these methods optimize purely for reward and do not account for high-level temporal requirements. Here, we introduce a neurosymbolic framework that injects LTLf background knowledge into such transformer-based RL policies. Our approach compiles LTLf formulas into deterministic finite automata (DFAs) and integrates them into the learning process through a differentiable representation and a logic-based loss function. In particular, we derive differentiable satisfaction signals from DFA progression and use them as a regularization term during training. The resulting method is architecture-agnostic across different models. We evaluate the proposed framework on navigation environments with specification suites covering combinations of safety and reachability temporal properties. Experimental results show that incorporating background knowledge not only improves constraint satisfaction, but also maintains competitive return compared to vanilla baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.08432 2026-06-09 cs.AI 新提交

Trajectory-Refined Distillation

轨迹精炼蒸馏

Li Jiang, Haoran Xu, Yichuan Ding, Amy Zhang

发表机构 * McGill University（麦吉尔大学）； Mila Quebec AI Institute（米拉魁北克人工智能研究所）； UT Austin（德克萨斯大学奥斯汀分校）

AI总结提出轨迹精炼蒸馏（TRD），通过教师指导修正学生轨迹中的前缀错误，解决在线策略蒸馏中的前缀失败问题，提升大语言模型的单次准确率和推理覆盖。

Comments under review

详情

AI中文摘要

在线策略蒸馏（OPD）已成为大型语言模型（LLM）的重要后训练工具，它沿着学生自身的生成轨迹提供密集的逐词教师监督。在这项工作中，我们识别出OPD中一个常见的结构性问题，称为前缀失败。在前缀失败下，密集的逐词监督会导致双峰教师混合和碎片化梯度，而词级损失截断或重加权无法解决这一问题。这一观察促使我们超越词级损失干预，转向轨迹级输出修正。因此，我们提出轨迹精炼蒸馏（TRD），一种轨迹级修正方法，在教师指导下，于在线策略支持范围内修正学生的生成轨迹。通过在蒸馏前修正有问题的前缀，TRD从根源上缓解了前缀失败。此外，即使原始轨迹已经正确，TRD也能通过教师指导让学生接触到替代的有效推导，从而改善探索。TRD还可应用于在线策略自蒸馏（OPSD），这是一种使用基于特权信息的学生模型作为教师的参数共享变体。在多个尺度的广泛基准和基础模型上，TRD始终优于先前基线，提高了单次尝试准确率并扩展了推理覆盖范围。代码可在 https://github.com/louieworth/trd 获取。

英文摘要

On-policy distillation (OPD) has become a central post-training tool for large language models (LLMs), providing dense per-token teacher supervision along the student's own rollouts. In this work, we identify a common structural cause underlying OPD, which we call prefix failure. Under prefix failure, dense per-token supervision induces a bimodal teacher mixture and fragmented gradients that token-level loss truncation or reweighting fail to address. This observation motivates us to move beyond token-level loss interventions toward trajectory-level output corrections. We thus propose Trajectory-Refined Distillation (TRD), a trajectory-level correction method that revises the student's rollout under the teacher guidance while within on-policy support. By correcting problematic prefixes before distillation, TRD mitigates prefix failure at its source. Moreover, TRD improves the exploration by exposing the student to alternative valid derivations under teacher guidance, even when the original rolls are already correct. TRD can also be applied to on-policy self-distillation (OPSD), a parameter-sharing variant that uses the student model conditioned on privileged informations as the teacher. Across a wide range of benchmarks and base models at multiple scales, TRD consistently outperforms prior baselines, improving single-attempt accuracy and broadening reasoning coverage. Code is available at https://github.com/louieworth/trd

URL PDF HTML ☆

赞 0 踩 0

2606.08491 2026-06-09 cs.AI 新提交

What Makes a Desired Graph for Relational Deep Learning?

什么构成了关系深度学习的理想图？

Yao Cheng, Siqiang Luo

AI总结研究发现，从数据库模式直接导出的图存在信息过载和语义碎片化问题，通过过滤和注入操作平衡可提升性能，并开发了自动优化器。

Comments This article has been accepted by ICML 2026

详情

AI中文摘要

关系深度学习（RDL）将关系数据库（RDB）转换为异构图，但直接从数据库模式导出的图通常不适合图神经网络（GNN）进行关系推理的方式。我们研究了什么使关系图适合深度学习，并表明模式派生图存在两个系统性失败：信息过载和语义碎片化。我们的实证分析表明，理想的图不是原始模式，而是受控结构适应的结果。性能取决于平衡两种操作：通过过滤减轻信息过载，以及通过注入修复语义碎片。具体而言，过滤作为具有非单调效应的偏差-方差旋钮，而注入仅在明确恢复原始模式中缺失的关系依赖时才能提高性能。基于这些发现，我们开发了一个端到端结构优化器，应用这两种操作自动适应关系图。在涵盖分类、回归和推荐的26个任务中，优化后的图在通常降低推理成本的同时持续提高了准确性。

英文摘要

Relational deep learning (RDL) converts relational databases (RDBs) into heterogeneous graphs, but graphs derived directly from database schemas are often not well suited for how graph neural networks (GNNs) perform relational reasoning. We study what makes a relational graph suitable for deep learning and show that schema-derived graphs suffer from two systematic failures: information overload and semantic fragmentation. Our empirical analysis reveals that the desired graph is not the raw schema, but a result of controlled structural adaptation. Performance depends on balancing two operations: mitigating information overload via filtering, and repairing semantic fragmentation via injection. Specifically, filtering serves as a bias-variance knob with non-monotonic effects, while injection improves performance only when it explicitly restores the relational dependencies missing from the original schema. Based on these findings, we develop an end-to-end structural optimizer that applies both operations to adapt relational graphs automatically. Across 26 tasks spanning classification, regression, and recommendation, the optimized graphs consistently improve accuracy while often reducing inference cost.

URL PDF HTML ☆

赞 0 踩 0

2606.08497 2026-06-09 cs.AI cs.CL 新提交

Explaining Black-Box Language Models: Learning to Optimize Linguistically-Structured Word Subsets

解释黑盒语言模型：学习优化语言结构化的单词子集

Minyoung Hwang, Seokhyun Lee, Changhee Lee

发表机构 * Korea University（高丽大学）

AI总结针对黑盒语言模型解释的三个关键需求（推理效率、黑盒兼容性、语言结构可解释性），提出一种通过强化学习选择信息性单词子集的方法，实现高效、无梯度且语言连贯的解释。

Comments KDD 2026 Research Track

详情

DOI: 10.1145/3770855.3817677

AI中文摘要

随着深度语言模型（DLMs）在医疗保健等高风险领域中的部署日益增多，理解其决策依据对于确保信任、安全和问责变得至关重要。然而，当这些DLMs作为黑盒系统（例如通过API）运行时，访问内部模型状态（如参数、梯度）受到限制，实现这一关键的可解释性水平尤其具有挑战性。尽管付出了诸多努力，现有的解释方法往往无法同时满足三个关键需求：（i）推理时效率，（ii）黑盒兼容性且不引发分布外行为，以及（iii）基于输入语言结构的可理解解释。为了解决这些挑战，我们提出了一种方法，通过选择一小部分信息丰富的输入单词来解释DLM的预测。我们将其表述为一个摊销优化问题，从而无需针对特定输入进行搜索即可实现高效的一次性推理。我们的选择策略通过REINFORCE风格策略梯度进行训练，允许在完全无梯度的设置中进行离散单词选择。为了增强可解释性并与人类语言直觉对齐，我们将图结构知识整合到这一选择过程中，促进语言连贯的子集，从而产生对最终用户既高度信息丰富又具有认知意义的解释。我们在多种DLM架构和多个真实世界数据集上评估了我们的方法。它一致地识别出具有增强判别能力和与语言显著线索更强对齐的单词子集，优于传统的黑盒兼容方法和基于梯度的方法（后者被赋予黑盒模型梯度的oracle访问权限，以构成更具挑战性的基准）。我们的代码可在以下地址获取：here。

英文摘要

As deep language models (DLMs) are increasingly deployed in high-stakes domains such as healthcare, understanding their decision rationale becomes paramount for ensuring trust, safety, and accountability. However, achieving this vital level of interpretability is particularly challenging when these DLMs operate as black-box systems (e.g., via APIs), where access to internal model states (e.g., parameters, gradients) is restricted. Despite numerous efforts, existing explanation methods often fail to concurrently satisfy three key desiderata: (i) inference-time efficiency, (ii) black-box compatibility without inducing out-of-distribution behavior, and (iii) comprehensible explanations grounded in the input's linguistic structure. To address these challenges, we propose a method that explains predictions of DLMs by selecting a small, informative subset of input words. We formulate this as an amortized optimization problem, enabling efficient one-shot inference without the need for input-specific search. Our selection policy is trained via REINFORCE-style policy gradients, allowing discrete word selection in a fully gradient-free setting. To enhance interpretability and align with human linguistic intuition, we integrate graph-structured knowledge into this selection process, fostering linguistically coherent subsets that result in explanations both highly informative and cognitively meaningful to end-users. We evaluated our method on diverse DLM architectures and multiple real-world datasets. It consistently identifies word subsets with enhanced discriminative power and stronger alignment with linguistically salient cues, outperforming both conventional black-box compatible methods and gradient-based approaches that are given oracle access to the black-box model's gradients for a more challenging benchmark. Our code is available at here.

URL PDF HTML ☆

赞 0 踩 0

2606.08543 2026-06-09 cs.AI 新提交

PAEC: Position-Aware Entropy Calibration for LLM Reasoning in RLVR

PAEC：面向RLVR中LLM推理的位置感知熵校准

Shumeng Yang, Yisu Liu, Jiayi Zheng, Zhaohui Yang, Linjing Li

发表机构 * Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； School of Artificial Intelligence, Beijing University of Posts and Telecommunications（北京邮电大学人工智能学院）； Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所）； School of Computer Science and Technology, University of Chinese Academy of Sciences（中国科学院大学计算机科学与技术学院）

AI总结提出位置感知熵校准（PAEC），通过局部top-p熵和top-2候选竞争构建软掩码，并施加基于锚点的下界惩罚，防止决策相关位置熵崩溃，提升数学推理性能。

Comments 22 pages, 7 figures

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）改进了大语言模型的推理能力，但常常导致策略熵快速崩溃，即策略过早地集中在狭窄的高概率推理路径上。虽然全局熵正则化可以鼓励探索，但均匀增加所有标记位置的熵对于长推理轨迹而言效率低下，因为许多标记与决策无关。我们提出位置感知熵校准（PAEC），一种标记级熵管理框架，它从局部top-p熵和top-2候选竞争中构建软掩码，并应用基于锚点的下界惩罚来防止选定位置的熵崩溃。在五个数学推理基准上的实验表明，PAEC在强RLVR基线上提高了宏观平均多数投票性能，在AIME风格任务上取得了明显收益。我们的结果表明，推理RL中的熵管理应被表述为对决策敏感位置的选择性探索分配，而非均匀的随机性注入。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) improves large language model reasoning but often suffers from rapid policy-entropy collapse, where the policy prematurely concentrates on narrow high-probability reasoning paths. While global entropy regularization can encourage exploration, uniformly increasing entropy across all token positions is inefficient for long reasoning trajectories, where many tokens are not decision-relevant. We propose Position-Aware Entropy Calibration (PAEC), a token-level entropy-management framework that constructs a soft mask from local top-p entropy and top-two candidate competition, and applies an anchor-based lower-bound penalty to prevent selected-position entropy collapse. Experiments on five mathematical reasoning benchmarks show that PAEC improves macro-average majority-vote performance over strong RLVR baselines, with clear gains on AIME-style tasks. Our results suggest that entropy management in reasoning RL should be formulated as selective exploration allocation over decision-sensitive positions rather than uniform randomness injection.

URL PDF HTML ☆

赞 0 踩 0

2606.08601 2026-06-09 cs.AI 新提交

InA-Probe: Instruction-Aware Active Probing for Time Series Forecasting with LLMs

InA-Probe：面向LLM时间序列预测的指令感知主动探测

Peiliang Gong, Emadeldeen Eldele, Chenyu Liu, Ziyu Jia, Yi Ding, Xinliang Zhou, Lianchao Gu, Qi Zhu, Yang Liu, Daoqiang Zhang, Xiaoli Li

发表机构 * Nanyang Technological University（南洋理工大学）； Khalifa University（哈利法大学）； Nanjing University of Aeronautics and Astronautics（南京航空航天大学）； Singapore University of Technology and Design（新加坡科技设计大学）

AI总结提出指令感知主动探测（InA-Probe），通过多级指令注入和自适应查询生成，结合双阶段注意力机制，在7个基准上超越现有方法，跨域误差降低37%。

详情

AI中文摘要

大型语言模型（LLMs）近期在时间序列预测中展现出令人瞩目的潜力。然而，现有方法主要依赖被动模态对齐或静态任务重编程，往往难以捕捉细粒度的非平稳时间模式或适应细微的任务意图。本文提出指令感知主动探测（InA-Probe），将范式从被动对齐转向主动的指令驱动探测机制。具体而言，我们设计了一种多级指令注入机制，为模型注入全局任务目标和细粒度的补丁级语义先验。在此基础上，自适应查询生成模块生成样本特定的探测，这些探测由时间上下文动态调制。随后，这些探测通过双阶段注意力过程进行精炼：首先通过指令感知自注意力内化任务特定意图，然后通过时间交叉注意力审查询问投影的时间表示以提取显著模式。在七个真实世界基准上的全面实验表明，InA-Probe在统一泛化和零样本迁移中均持续优于最先进的深度学习和基于LLM的基线，在具有挑战性的跨域场景中预测误差降低高达37%。消融研究进一步证实，自适应查询与细粒度指令之间的协同作用是解锁LLM推理能力以处理复杂时间序列的关键。

英文摘要

Large Language Models (LLMs) have recently demonstrated impressive potential for time series forecasting. However, existing methods predominantly rely on passive modality alignment or static task reprogramming, which often fail to capture fine-grained, non-stationary temporal patterns or to adapt to nuanced task intents. In this paper, we propose Instruction-aware Active Probing (InA-Probe), which shifts the paradigm from passive alignment toward an active, instruction-driven probing mechanism. Specifically, we design a Multi-Level Instruction Injection mechanism that enriches the model with both global task objectives and fine-grained, patch-level semantic priors. Building on this, an Adaptive Query Generation module produces sample-specific probes that are dynamically modulated by the temporal context. These probes are then refined through a dual-stage attention process: they first internalize task-specific intents via Instruction-Aware Self-Attention, and subsequently interrogate the projected temporal representations through Temporal Cross-Attention to extract salient patterns. Comprehensive experiments on seven real-world benchmarks show that InA-Probe consistently outperforms state-of-the-art deep learning and LLM-based baselines, excelling in both one-for-all generalization and zero-shot transfer while reducing forecasting error by up to 37\% in challenging cross-domain scenarios. Ablation studies further confirm that the synergy between adaptive querying and fine-grained instructions is key to unlocking the reasoning power of LLMs for complex time series.

URL PDF HTML ☆

赞 0 踩 0

2606.08800 2026-06-09 cs.AI 新提交

Bridging Expert Knowledge and Automated Feature Engineering via Self-Evolution

通过自进化桥接专家知识与自动化特征工程

Varun Khurana, Vijval Ekbote, Vashu Chauhan, Yaman Kumar Singla, Rajiv Ratn Shah, Balaji Krishnamurthy

发表机构 * Adobe Media and Data Science Research（Adobe媒体与数据科学研究）； IIIT-Delhi（德里印度理工学院）

AI总结提出FEST方法，结合双流特征生成、语义去重和树引导迭代进化，从原始文本和图像中发现可审计特征，在品牌分类等任务中平均提升4.2个百分点，并实现60-80%的专家特征覆盖。

详情

AI中文摘要

在品牌合规、临床护理和内容审核等高风险场景中，机器学习不能作为不透明的预言机部署：从业者需要检查驱动模型决策的特征，模型必须利用管理这些领域的专家文档。实际上，数据以非结构化内容形式到达，从中提取的特征必须可解释、有区分度，并与专家认为重要的内容对齐。现有方法存在不足：它们针对表格输入，缺乏专家对齐的证明，并且无法将诸如“保持专业语气”之类的定性标准转化为精确特征。我们提出了FEST（自进化树特征工程），结合了双流特征生成（语义和确定性）、语义去重和树引导的迭代进化，从原始文本和图像中发现可审计特征。FEST在品牌分类、内容真实性检测和压力检测的20个分类器-任务组合中领先17个，在五个分类器上平均比最强基线高出4.2个百分点。LLM作为评判者的评估显示，在严格的语义对齐阈值下，FEST实现了60-80%的专家设计品牌特征覆盖率，并通过人类专家研究证实，这些特征在相关性、清晰度和可操作性方面获得高评分。当以专家指南作为种子时，FEST将定性标准细化为可操作特征，跨品牌平均提高6-12个百分点的准确率。为了实现对自动化特征工程中专家对齐的系统评估，我们发布了BrandGuide，这是第一个将专家设计特征与2,683个品牌的100万+资产配对的数据集。通过将特征工程建立在专家知识基础上，FEST为需要人类监督的可解释机器学习开辟了一条实用途径。

英文摘要

In high-stakes settings such as brand compliance, clinical care, and content moderation, machine learning cannot be deployed as opaque oracles: practitioners inspect the features driving model decisions, and models must leverage the expert documentation governing these domains. In practice, the data arrives as unstructured content, and features extracted from it must be interpretable, discriminative, and aligned with what experts consider important. Existing methods fall short: they target tabular inputs, lack demonstrated expert alignment, and cannot operationalize qualitative criteria such as 'maintain professional tone' into precise features. We present FEST (Feature Engineering with Self-evolving Trees), combining dual-stream feature generation (semantic and deterministic), semantic deduplication, and tree-guided iterative evolution to discover auditable features from raw text and images. FEST leads in 17 of 20 classifier-task combinations across brand classification, content authenticity detection, and stress detection, with a mean gain of 4.2 pp over the strongest baseline across five classifiers. An LLM-as-judge evaluation shows FEST achieves 60-80% coverage of expert-designed brand features at strict semantic-alignment thresholds, corroborated by a human expert study rating features highly on relevance, clarity, and actionability. When seeded with expert guidelines, FEST refines qualitative criteria into operational features, improving accuracy by 6-12 pp on average across brands. To enable systematic evaluation of expert alignment in automated feature engineering, we release BrandGuide, the first dataset pairing expert-designed features with 1M+ assets across 2,683 brands. By grounding feature engineering in expert knowledge, FEST opens a practical pathway for interpretable ML in domains demanding human oversight.

URL PDF HTML ☆

赞 0 踩 0

2606.08804 2026-06-09 cs.AI cs.LG 新提交

FAME: 面向异构时间序列预测的可预测性感知专家混合模型

Qianyang Li, Xingjun Zhang, Shaoxun Wang, Tao Peng, Jia Wei

发表机构 * Sun Yat-sen University（中山大学）； Guangdong Key Laboratory of Information Security Technology（广东省信息安全技术重点实验室）； Ministry of Education Key Laboratory of Machine Intelligence and Advanced Computing（教育部机器智能与先进计算重点实验室）

AI总结针对大规模异构时间序列预测中单一模型性能不足的问题，提出可预测性感知的稀疏专家混合框架FAME，通过多维可预测性指纹和成本感知路由，在工业数据集上实现12.4%的MSE降低。

详情

AI中文摘要

大规模零售和工业预测系统包含许多异构时间序列，其生命周期、稀疏性、波动性、季节性、频谱模式和上下文敏感性差异很大。单一预测模型很少能在所有情况下表现良好，而密集集成会增加推理成本并提供有限的专家适用性洞察。本文研究可预测性感知的专家路由：学习数据特征如何决定预测专家的适用性。我们提出\method{}，一个稀疏专家混合框架，用多维可预测性指纹表示每个序列，从验证性能中挖掘专家适用性目标，并训练一个成本感知的稀疏路由器，为每个序列激活少量预算的专家集。使用山东新北洋（SNBC）的生产规模自动售货机销售数据集（其中预测组件已集成到补货计划管道中）以及公共零售基准，我们表明专家适用性在不同数据情况下系统性地变化。在拥有5000+台机器和6000万+交易的工业数据集上，\method{} Top-2相比最强单一专家LightGBM降低了12.4%的MSE，同时平均每个序列执行1.92个专家。部署的组件产生需求预测，而库存导向的收益通过离线回放模拟器在固定补货策略下估计，而非在线干预。该框架将异构销售预测从启发式模型选择转变为可预测性模式和专家专业化的数据挖掘。代码可在https://github.com/hit636/FAME获取。

英文摘要

Large-scale retail and industrial forecasting systems contain many heterogeneous time series whose lifecycle, sparsity, volatility, seasonality, spectral patterns, and contextual sensitivity differ substantially. A single forecasting model rarely performs well across all regimes, while dense ensembles increase inference cost and provide limited insight into expert suitability. This paper studies forecastability-aware expert routing: learning how data characteristics determine the suitability of forecasting experts. We propose \method{}, a sparse mixture-of-experts framework that represents each series with a multidimensional forecastability fingerprint, mines expert-suitability targets from validation performance, and trains a cost-aware sparse router to activate a small budgeted set of experts for each series. Using a production-scale vending-machine sales dataset from Shandong New Beiyang (SNBC), where the forecasting component has been integrated into the replenishment-planning pipeline, together with public retail benchmarks, we show that expert suitability varies systematically across data regimes. On the industrial dataset with 5,000+ machines and 60M+ transactions, \method{} Top-2 reduces MSE by 12.4\% over the strongest single expert, LightGBM, while executing 1.92 experts per series on average. The deployed component produces demand forecasts, while inventory-oriented gains are estimated by an offline replay simulator under a fixed replenishment policy rather than by online intervention. The framework turns heterogeneous sales forecasting from heuristic model selection into data mining of forecastability patterns and expert specialization. Code is available at https://github.com/hit636/FAME

URL PDF HTML ☆

赞 0 踩 0

2606.08974 2026-06-09 cs.AI 新提交

Diverse Thinking Schemata Elicit Better Reasoning in Large Language Models

多样思维图式激发大型语言模型更优推理

Xinyue Liang, Yizhe Yang, Yu Bai, Bin Xu, Jiawei Li, Yang Gao

发表机构 * School of Computer Science and Technology, Beijing Institute of Technology（北京理工大学计算机科学与技术学院）

AI总结提出多样图式策略优化（DiScO），通过增强推理步骤转换和答案候选的多样性，提升大型语言模型在数学推理任务中的表现和错误恢复能力。

详情

AI中文摘要

大型推理模型（LRMs）因其通过生成扩展推理链解决复杂数学问题的能力而受到越来越多的关注。在这项工作中，我们聚焦于推理过程中两个关键但尚未充分探索的方面：推理转换（捕捉推理步骤之间的不同转换）和答案候选（反映模型产生的解路径的多样性）。我们将这两个方面统称为思维图式。我们观察到思维图式的多样性与模型性能之间存在相关性，这激励我们通过增强多样性来进一步提升推理潜力。为此，我们提出了多样图式策略优化（DiScO），该框架首先赋予模型图式感知能力，然后通过强化学习鼓励多样性，并在推理时进一步促进多样化推理。在多个数学推理基准上的实验表明，DiScO始终优于标准的群体相对策略优化。除了准确性之外，人工标注分析显示，DiScO显著提高了模型从错误初始尝试中恢复的能力。总体而言，我们的工作表明思维图式多样性发挥的重要作用，并指出沿着多样性维度进行扩展是一个有前景的研究方向。

英文摘要

Large reasoning models (LRMs) have attracted increasing attention for their ability to solve complex mathematical problems by generating extended reasoning chains. In this work, we focus on two critical yet underexplored aspects of the reasoning process: reasoning transitions capturing the distinct transitions between reasoning steps and answer candidates reflecting the variety of solution paths produced by the model. We collectively define these two aspects as thinking schemata. We observe a correlation between the diversity of thinking schemata and model performance, which motivates us to enhance diversity as a means to further improve reasoning potential. To this end, we propose Diverse Schemata Policy Optimization (DiScO), a framework that first endows the model with schemata awareness, then encourages diversity through reinforcement learning, and further promotes diverse reasoning at inference time. Experiments on multiple mathematical reasoning benchmarks demonstrate that DiScO consistently outperforms standard group relative policy optimization. Beyond accuracy, human-annotated analyses show that DiScO substantially improves the model's ability to recover from erroneous initial attempts. Overall, our work suggests the important role that diversity of the thinking schemata plays and points to scaling along the diversity dimension as a promising research direction.

URL PDF HTML ☆

赞 0 踩 0

2606.09124 2026-06-09 cs.AI 新提交

A Regret Minimization Framework on Preference Learning in Large Language Models

大语言模型中偏好学习的遗憾最小化框架

Suhwan Kim, Taehyun Cho, Geon-Hyeong Kim, Yu Jin Kim, Youngsoo Jang, Moontae Lee, Jungwoo Lee

发表机构 * KAIST（韩国科学技术院）

AI总结提出基于遗憾的偏好优化方法RePO，通过遗憾最小化而非奖励最大化来建模人类偏好，在数学推理和人类偏好数据集上取得一致性能提升。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）通过依赖任务特定的验证器提供自动化正确性信号，推动了推理密集型任务的进展。然而，许多现实语言任务难以配备可靠的验证器，这促使人们越来越依赖从人类反馈中强化学习（RLHF）。在此背景下，我们认为有必要更仔细地审视人类反馈应如何被解释。我们引入了基于遗憾的偏好优化（$\textbf{RePO}$），它通过$\textit{遗憾最小化}$而非奖励最大化来重新构建RLHF。人类偏好通常由对结果的$\textit{前瞻性}$预期和对替代行为的$\textit{反事实}$比较所塑造，而非由即时的、与结果无关的效用决定。$\textbf{RePO}$通过将偏好建模为行为条件化的相对次优性评估来捕捉这一结构。在数学推理基准和人类偏好数据集上的实验表明，$\textbf{RePO}$能够取得一致的性能提升，表明它是一种有效且与人类对齐的大语言模型训练方法。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has enabled progress on reasoning-intensive tasks by relying on task-specific verifiers that provide automated correctness signals. However, many realistic language tasks are difficult to equip with reliable verifiers, motivating a growing reliance on reinforcement learning from human feedback (RLHF). In this setting, we argue that a closer examination of how human feedback should be interpreted is essential. We introduce Regret-based Preference Optimization $(\textbf{RePO})$, which reframes RLHF through $\textit{regret minimization}$ rather than reward maximization. Human preferences are often shaped by $\textit{prospective}$ anticipation of outcomes and $\textit{counterfactual}$ comparisons to alternative behaviors, rather than by immediate, outcome-independent utility. $\textbf{RePO}$ captures this structure by modeling preferences as behavior-conditioned assessments of relative suboptimality. Experiments on mathematical reasoning benchmarks and human preference datasets demonstrate consistent performance gains, indicating that $\textbf{RePO}$ is an effective and human-aligned approach for training large language models.

URL PDF HTML ☆

赞 0 踩 0

2606.09410 2026-06-09 cs.AI cs.CL 新提交

Capacity, Not Format: Rethinking Structured Reasoning Failures

容量而非格式：重新思考结构化推理失败

Hengxin Fan

AI总结研究发现结构化格式对模型性能的影响取决于其空闲容量，容量不足时通过截断和纯容量竞争两种机制导致性能下降，建议先思考后格式化。

Comments 12 pages, 3 figures

详情

AI中文摘要

先前的工作将结构化输出视为推理的代价，但这种框架是不完整的：格式化的成本强烈依赖于模型的空闲容量。通过使用信息匹配的散文控制和四级模式复杂度梯度，我们在4个模型和5个基准测试中分离了格式特定效应与提示长度混淆，成功生成的响应中解析失败率为0%。我们发现结构化格式是容量依赖的。具有足够余量的模型在吸收JSON约束时不会出现性能下降（Sonnet：MATH-Hard上JSON为$88.7\pm4.0$%，CoT为$89.3\pm1.7$%）。相反，格式会严重降低接近其极限运行的模型，通过两种不同的机制。首先，在标准token预算下，Haiku下降了36.2个百分点（$p < 0.0001$），主要是由于截断。其次，即使延长预算消除了截断，GPT-4o-mini仍下降了28.0个百分点（$p < 0.001$），揭示了独立于token耗尽的纯容量竞争。这种格式惩罚随模式复杂度增加（McNemar $p < 0.0001$），且不能仅由提示长度解释。此外，这些结果对前沿模型免疫的说法提出了质疑：在AIME竞赛数学中，Opus 4.7在JSON下从96.2%下降到91.0%（$-5.3$个百分点；显示的百分比独立四舍五入，精确差值为$7/133 = 5.26$pp $\approx 5.3$pp）。一种延迟结构消融——在格式化之前自由推理——恢复了大部分丢失的准确率（3次运行均值：80-87%），支持了容量竞争机制。实际意义不是避免结构化输出，而是使其与容量匹配：当模型接近其极限时，先思考，后格式化。

英文摘要

Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects from prompt-length confounds across 4 models and 5 benchmarks with 0% parse failures on successfully generated responses. We find that structured formats are capacity-dependent. Models with sufficient headroom absorb JSON constraints without degradation (Sonnet: $88.7\pm4.0$% JSON vs. $89.3\pm1.7$% CoT on MATH-Hard). In contrast, formats severely degrade models operating near their limits through two distinct mechanisms. First, under standard token budgets, Haiku drops 36.2pp ($p < 0.0001$) largely due to truncation. Second, even with extended budgets eliminating truncation, GPT-4o-mini drops 28.0pp ($p < 0.001$), revealing pure capacity competition independent of token exhaustion. This format penalty scales with schema complexity (McNemar $p < 0.0001$) and cannot be explained by prompt length alone. Furthermore, these results qualify claims of frontier model immunity: on AIME competition math, Opus 4.7 drops from 96.2% to 91.0% under JSON ($-5.3$pp; the displayed percentages are independently rounded, exact difference is $7/133 = 5.26$pp $\approx 5.3$pp). A delayed-structure ablation -- reasoning freely before formatting -- recovers most of the lost accuracy (3-run mean: 80--87%), supporting the capacity competition mechanism. The practical implication is not to avoid structured output, but to match it to capacity: when a model is near its limits, think first, format later.

URL PDF HTML ☆

赞 0 踩 0

2606.09605 2026-06-09 cs.AI 新提交

Next-Token Prediction Learns Generalisable Representations of Sleep Physiology

下一个词预测学习睡眠生理学的可泛化表示

Jonathan F. Carter, Lionel Tarassenko

发表机构 * Institute of Biomedical Engineering, University of Oxford（牛津大学生物医学工程研究所）

AI总结提出Hypnos模型，通过下一个词预测目标，从多模态生理信号中学习可泛化表示，在睡眠阶段分类和房颤检测等任务上显著优于现有基础模型。

详情

AI中文摘要

基础模型提供了一种有前景的途径，将多模态生理信号压缩为人类健康的紧凑表示，在睡眠医学、心脏病学、神经病学及其他医疗领域具有广泛应用。现有模型通常采用掩码重建或对比学习目标进行训练。然而，掩码重建可能不适用于这些信号的随机性质，而对比方法依赖于正样本对定义，尽管生理信号的语义不变性尚不明确。在这项工作中，我们展示了下一个词预测是一种简单且可扩展的替代方案。我们开发了Hypnos，一个多模态睡眠基础模型，使用来自超过20,000次夜间多导睡眠图记录的八种不同传感模态（例如EEG、ECG、呼吸信号）进行训练。我们使用残差向量量化将每种模态标记化为离散标记流，然后训练一个大型自回归RQ-Transformer，以并行方式联合预测所有模态的下一个标记。训练后，Hypnos可应用于任何支持模态子集的连续传感器数据流，为下游任务生成嵌入。在一系列基准测试中，Hypnos显著优于现有基础模型。在睡眠阶段分类中，我们在保留测试集上匹配了强监督基线的性能，同时使用的标记数据减少了100倍。Hypnos甚至泛化到日间生理学，在检测房颤方面超越了专用的ECG基础模型。我们的结果表明，下一个词预测是从多模态生理信号进行表示学习的强自监督目标。

英文摘要

Foundation models offer a promising route to compress multi-modal physiological signals into compact representations of human health, with broad applications across sleep medicine, cardiology, neurology and other healthcare domains. Existing models have typically been trained with masked-reconstruction or contrastive objectives. However, masked reconstruction may be poorly suited to the stochastic nature of these signals, while contrastive approaches rely on positive-pair definitions despite the semantic invariances of physiological signals being poorly understood. In this work, we show that next-token prediction is a simple and scalable alternative. We develop Hypnos, a multi-modal sleep foundation model trained using eight different sensing modalities (e.g. EEG, ECG, respiratory signals) drawn from over 20,000 overnight polysomnography recordings. We tokenize each modality into streams of discrete tokens using residual vector quantization, then train a large auto-regressive RQ-Transformer to jointly predict the next token across all modalities in parallel. After training, Hypnos can be applied to continuous streams of sensor data from any subset of supported modalities, generating embeddings for downstream tasks. Across a range of benchmarks, Hypnos significantly outperforms existing foundation models. In sleep stage classification, we match the performance of strong supervised baselines on held-out test sets whilst using $100\times$ less labelled data. Hypnos even generalises to daytime physiology, surpassing a dedicated ECG foundation model at detecting atrial fibrillation. Our results demonstrate that next-token prediction is a strong self-supervised objective for representation learning from multi-modal physiological signals.

URL PDF HTML ☆

赞 0 踩 0

2606.09672 2026-06-09 cs.AI cs.CL cs.LG cs.PF q-bio.QM 新提交

Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

相关性不够：嵌入人类元数据用于个体因果发现

Suraj Biswas, Saurabh Gupta, Pritam Mukherjee

发表机构 * Assessli Research（Assessli研究）； Dots-In Research（Dots-In研究）

AI总结针对预训练生物医学语言模型在跨域无关对中产生高余弦相似度（0.76-0.92）导致因果推断错误的问题，提出对比学习（提升分离度至1.63x）和BODHI硬负例挖掘（提升至2.30x），结合OpenVINO优化实现133倍加速。

Comments 20 pages, 18 figures, 9 tables

详情

AI中文摘要

询问一个预训练的生物医学语言模型“皮质醇28 ug/dL”和“股市波动”是否相关，它会返回0.83的余弦相似度（1.0表示完全相同）。两者没有共同机制。这不是个例：我们测试的所有现成生物医学编码器（BioBERT、PubMedBERT、BioM-ELECTRA）在跨域无关对上得分在0.76到0.92之间，而正确答案应接近零。跨域区分准确率为0%。检索系统可以承受这一点，因为下游语言模型会过滤噪声。但大型行为模型（LBM）——一种以人为对象而非句子的基础模型——则不能：它在用户生活图上推理，并将嵌入接近性视为两个事件因果关联的证据。虚假接近性会写入虚假因果边，所有下游都会继承错误。在这里，嵌入几何不是调节旋钮，而是正确性的关键。我们报告了修复方法。对72,034对进行对比训练，将PubMedBERT的BIOSSES相关性从0.633提升到0.828，域内与域间分离度从1.05倍提升到1.63倍。第二次训练BODHI从生物医学知识图中缺失的边挖掘硬负例，将分离度提升到2.30倍，区分差距提升到+0.392，BIOSSES代价为4.5%。在带有AMX的Intel Xeon 6737P上，OpenVINO将单查询延迟从1367毫秒降至10毫秒（133倍），达到每秒555个句子。一个发现与标准建议相悖：在此芯片上，FP16在所有服务批量大小下优于INT8，我们解释了原因。同一模型在无AMX的Ice Lake实例上运行慢13-27倍。我们发布了基准测试套件、训练语料库、BODHI生成器和OpenVINO脚本。

英文摘要

Ask a pretrained biomedical language model whether "cortisol 28 ug/dL" and "stock-market volatility" are related, and it returns a cosine similarity of 0.83 on a scale where 1.0 means identical. The two share no mechanism. This is not a corner case: every off-the-shelf biomedical encoder we tested (BioBERT, PubMedBERT, BioM-ELECTRA) scores unrelated cross-domain pairs between 0.76 and 0.92 when the answer should be near zero. Accuracy on cross-domain discrimination is 0%. Retrieval systems survive this, because a language model downstream filters the noise. A Large Behavioural Model (LBM), a foundation model whose subject is a person rather than a sentence, does not: it reasons over a graph of a user's life and treats embedding proximity as evidence that two events are causally linked. False proximity writes a false causal edge, and everything downstream inherits the error. Here, embedding geometry is not a tuning knob; it is correctness. We report the fix. A contrastive pass over 72,034 pairs raises PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. A second pass, BODHI, mines hard negatives from edges absent in a biomedical knowledge graph and lifts separation to 2.30x and the discrimination gap to +0.392, at a 4.5% BIOSSES cost. On an Intel Xeon 6737P with AMX, OpenVINO cuts single-query latency from 1367 ms to 10 ms (133x) and reaches 555 sentences/sec. One finding contradicts standard advice: FP16 beats INT8 on this silicon at every serving batch size, and we explain why. The same model on a no-AMX Ice Lake instance runs 13-27x slower. We release the benchmark suite, training corpora, the BODHI generator, and the OpenVINO scripts.

URL PDF HTML ☆

赞 0 踩 0

2606.07524 2026-06-09 cs.CL cs.AI 交叉投稿

ABLE: Representing and Mapping LLMs via Attribution-Based Large-model Embedding

ABLE：基于归因的大模型嵌入表示与映射

Zirui Wang, Yusen Hou, Shaofeng Liang, Bowen Tian, Yanlin Zhang, Wenshuo Chen, Yutao Yue

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Deep Interdisciplinary Intelligence Lab (DI2 Lab)（深度跨学科智能实验室（DI2 Lab））

AI总结提出ABLE框架，利用梯度特征归因和分词器无关的词级对齐构建模型嵌入，实现异构LLM的高效比较，在关系预测、模型路由和基准分数预测上表现优异。

详情

AI中文摘要

大语言模型（LLM）的爆炸式增长形成了一个异构且文档不完善的生态系统，使得系统性的模型比较对于来源审计、安全分析和模型选择越来越重要。现有的表示方法难以高效应对这一场景。分析内部参数的方法在架构兼容时很强大，但在结构异构下面临可扩展性障碍；而依赖外部输出的方法可能混淆具有相似行为的模型，且难以在不同分词器的更丰富输出空间中对齐。为弥合这一差距，我们提出ABLE（基于归因的大模型嵌入）框架，利用可解释性空间构建模型表示。通过基于梯度的特征归因，经由分词器无关的词级对齐进行聚合，ABLE捕获模型特定的输入敏感性模式，而不仅仅是表面输出。除经验效用外，我们提供了稳定性分析，表明在可微Transformer风格模型的标准正则性假设下，ABLE诱导出一个Lipschitz连续的参数到嵌入映射，并具有有限样本收敛保证。在239个开源LLM上的大量实验表明，我们的无训练方法在关系预测、模型路由和基准分数预测方面达到了有竞争力或更优的性能。

英文摘要

The explosive growth of large language models (LLMs) has created a heterogeneous and poorly documented ecosystem, making systematic model comparison increasingly important for provenance auditing, security analysis, and model selection. Existing representation methods struggle to address this setting efficiently. Approaches analyzing internal parameters are powerful when architectures are compatible, but face scalability barriers under structural heterogeneity, while methods relying on external outputs may conflate models with similar behaviors and are difficult to align in richer output spaces across different tokenizers. To bridge this gap, we propose ABLE (Attribution-Based Large-model Embedding), a framework that leverages the interpretability space to construct model representations. By aggregating gradient-based feature attributions via a tokenizer-agnostic word-level alignment, ABLE captures model-specific input-sensitivity patterns rather than only surface-level outputs. Beyond empirical utility, we provide a stability analysis showing that, under standard regularity assumptions for differentiable Transformer-style models, ABLE induces a Lipschitz-continuous parameter-to-embedding map with finite-sample convergence guarantees. Extensive experiments on 239 open-source LLMs demonstrate that our training-free approach achieves competitive or superior performance in relation prediction, model routing, and benchmark score prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.07527 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

Post-training is (Massive) Supervised Learning

后训练是（大规模）监督学习

Michael Hassid, Yossi Adi, Roy Schwartz

发表机构 * FAIR, Meta AI（Meta AI 基础人工智能研究团队）； The Hebrew University of Jerusalem（耶路撒冷希伯来大学）

AI总结本文论证当前LLM后训练阶段（SFT+RL）实质是回归到BERT时代的“预训练-微调”范式，通过实验表明从零开始后训练的模型也能取得显著性能，并提出应转向“学会学习”的训练方式。

详情

AI中文摘要

训练LLM的主流范式已演变为依赖包含SFT和RL的大规模后训练阶段。在这篇立场论文中，我们认为这种方法实际上标志着回归到BERT时代的“预训练然后微调”方法，明确地使模型适应期望的行为和评估所用的特定基准。我们首先回顾LLM的历史，描述LLM演化的不同阶段。我们认为当前格局与LLM早期惊人地相似，那时任务性能严重依赖于将模型拟合到分布内数据集。为了实证证明这一点，我们比较了预训练模型和随机初始化模型，在现代推理数据集上对两种变体进行微调，并在竞争性数学和代码基准上评估它们。我们表明，从头开始后训练的模型产生了高度非平凡的性能。我们的发现表明，当前的后训练方法主要作为分布拟合机制发挥作用。最后，我们提出，开发通用能力的模型和系统需要超越针对预定义行为的广泛后训练，转而采用模型“学会如何学习”的训练过程。

英文摘要

The prevailing paradigm for training LLMs has evolved to rely on a massive post-training phase consisting of SFT and RL. In this position paper, we argue that this methodology effectively marks a reversion to the ``pre-train then fine-tune'' approach of the BERT era, explicitly tailoring models to the desired behaviors and specific benchmarks on which they are evaluated. We begin with a historical overview of LLMs, describing the different phases of the LLM evolution. We argue that the current landscape is remarkably similar to the early days of LLMs, where task performance heavily relied on fitting the models to in-distribution datasets. To empirically demonstrate this, we compare pre-trained models to randomly initialized ones, by fine-tuning both variants on modern reasoning datasets and evaluating them on competitive math and code benchmarks. We show that models post-trained from scratch yield highly non-trivial performance. Our findings suggest that current post-training methodologies function primarily as a distribution-fitting mechanism. We finish by positing that developing generally capable models and systems requires moving beyond extensive post-training for predefined behaviors, shifting instead toward training procedures where models ``learn how to learn''.

URL PDF HTML ☆

赞 0 踩 0

2606.07546 2026-06-09 cs.IR cs.AI cs.LG 交叉投稿

Beyond Item IDs: Scaling Short-Form-Video Recommendation via Semantic-Native Long Sequence Modeling

超越视频ID：通过语义原生长序列建模实现短视频推荐规模化

Ruixiao Sun, Diego Uribe Mora, Zhimeng Jiang, Yuanzhen Lin, Jiarui Wang, Yuening Li, Danfeng Guo, Zhizhong Chen, Chuan He, Liang Liu

发表机构 * Google Mountain View, USA（谷歌山景城，美国）

AI总结针对短视频推荐中序列长度受限于视频ID语义稀疏性和Transformer二次复杂度的问题，提出采用语义ID和全局感知压缩Transformer，实现十亿用户规模的超长行为序列建模，显著降低内存和计算开销，在线实验提升用户满意度和内容消费。

Comments this manuscript has been accepted by SIGIR 2026

详情

DOI: 10.1145/3805712.3808503

AI中文摘要

捕捉用户跨广泛观看历史的兴趣对于短视频推荐至关重要，但扩展序列长度受到两个瓶颈的限制：原子视频ID的语义稀疏性和Transformer的二次计算复杂度。传统的正交视频ID无法捕捉内容关系，并且需要大型嵌入表，而自注意力的二次复杂度在严格的工业延迟和资源约束下限制了最大序列长度。在这项工作中，我们提出了一个在生产环境中部署的框架，用于在十亿用户规模上建模超长用户行为序列。我们首先通过采用内容原生的语义ID来解决表示瓶颈。通过使用深度截断、粗粒度的语义ID，我们将嵌入表大小从语料库基数中缩小。这种紧凑的表示通过共享语义前缀自然地泛化到冷启动内容。其次，为了克服序列扩展障碍，我们引入了全局感知压缩Transformer，它利用非参数时间折叠和统一全局查询集成来有效压缩序列，缓解了标准自注意力的内存和计算瓶颈。在我们计算基础设施上的离线分析显示，峰值内存占用减少了一个数量级，计算开销大幅降低。这种效率提升使得在生产中以可承受的成本支持更长的序列长度，在大规模在线A/B测试中，在满意的用户参与度和满意的内容消费方面取得了显著的在线收益。

英文摘要

Capturing user interests across extensive watch histories is critical for short-form video recommendation, yet scaling sequence length is limited by two bottlenecks: the semantic sparsity of atomic Video IDs and the quadratic computational complexity of Transformers. Traditional orthogonal Video IDs fail to capture content relationships and demand large embedding tables, while the quadratic complexity of self-attention restricts the maximum sequence length under strict industrial latency and resource constraints. In this work, we present a production-deployed framework for modeling ultra-long user behavior sequences at a billion-user scale. We first address the representation bottleneck by adopting content-native Semantic IDs. By utilizing depth-truncated, coarse-grained Semantic IDs, we shrink the embedding table size from corpus cardinality. This compact representation naturally generalizes to cold-start content through shared semantic prefixes. Second, to overcome the sequence scaling barrier, we introduce a Global-Aware Compression Transformer that leverages non-parametric temporal folding and unified global query integration to effectively condense the sequence, alleviating both the memory and computational bottlenecks of standard self-attention. Offline profiling on our computing infrastructure demonstrates an order-of-magnitude reduction in peak memory footprint and a drastic decrease in computational overhead. This efficiency gain enables supporting longer sequence lengths at an affordable cost in production, yielding substantial online gains in satisfied user engagement and satisfied content consumption in large-scale online A/B tests.

URL PDF HTML ☆

赞 0 踩 0

2606.07559 2026-06-09 cs.CL cs.AI quant-ph 交叉投稿

Phantom transitions in language model fine-tuning

语言模型微调中的幻影相变

Vaibhav Prakash, Jayasri Dontabhaktuni

发表机构 * Mahindra University（马恒达大学）

AI总结本文研究语言模型微调时，正确补全被近义词竞争而失败的现象，通过序参量分解信号与背景拖拽，发现两种失败模式，并揭示相变为幻影，源于softmax读出而非几何相变。

Comments 26 pages, 9 figures

详情

AI中文摘要

在上下文中微调语言模型，当正确补全存在近义词竞争者时，常常无声地失败。交叉熵损失单调递减，而正确token在排名上从未超越竞争者。我们研究了跨越两个系列和五倍参数范围的五种Transformer架构，在十个精心挑选的近义词上下文中。我们用一个结合预测分布和成对嵌入重叠的序参量来测量这些失败。它可加性地分解为一个信号（跟踪模型对正确token相对于其最近竞争者的承诺）和一个背景拖拽（由嵌入整体向分数泄漏概率的方式决定）。这分离出两种失败模式：运动学失败中信号保持较小；结构失败中拖拽随着微调进行而主动恶化。我们观察到序参量中类似相变的弹弓状跳跃。一个核心负面结果组织了本文：这些相变是幻影。直接测量排除了自发对称破缺的解释。在LoRA微调下，当token嵌入矩阵在训练期间完全不变时，弹弓状跳跃仍然出现，而此处不可能存在几何相变。不连续性完全存在于softmax读出中。少量无量纲量组织跨架构的轨迹。其中一个在所有五种架构的全微调下保持一致。第二个根据整体嵌入分布将架构分为两类，并预测LoRA的充分性。作为盲测，该框架预测了一个未用于拟合任何参数的保留架构的临界学习率，与后续学习率扫描的误差在2.1%以内。研究结果仅涉及近义词机制，未经重新校准不应外推。

英文摘要

Fine-tuning a language model on contexts whose correct completion has a near-synonym competitor often fails silently. The cross-entropy loss decreases monotonically while the correct token never overtakes the competitor in rank. We study this regime across five transformer architectures spanning two families and a fivefold parameter range, on ten hand-selected near-synonym contexts. We instrument these failures with an order parameter combining the predicted distribution and pairwise embedding overlaps. It decomposes additively into a signal, tracking the model's commitment to the correct token over its nearest competitor, and a background drag, set by how the embedding bulk leaks probability into the score. This isolates two failure modes. In kinematic failure the signal stays small. In structural failure the drag actively worsens as fine-tuning proceeds. We observe sharp catapult-like jumps in the order parameter that resemble a phase transition. A central negative result organises the paper. The transitions are phantoms. The spontaneous-symmetry-breaking interpretation is ruled out by direct measurement. Catapult-like jumps still appear under LoRA fine-tuning with the token embedding matrix exactly unchanged during training, where no geometric phase transition is possible. The discontinuity lives entirely in the softmax readout. A small number of dimensionless quantities organise the trajectory across architectures. One is consistent across all five under full fine-tuning. A second sorts architectures into two classes by bulk embedding distribution and predicts LoRA sufficiency. As a blind test, the framework predicts the critical learning rate of a held-out architecture, not used to fit any parameter, to within 2.1% of a subsequent learning-rate sweep. Findings concern the near-synonym mechanism only and should not be extrapolated without recalibration.

URL PDF HTML ☆

赞 0 踩 0

2606.07563 2026-06-09 cs.LG cs.AI 交叉投稿

Emergence via Phase Transitions: Mechanism Landscapes and Universal Convergence Across Complex Systems

通过相变涌现：机制景观与跨复杂系统的通用收敛

Truong Xuan Khanh

发表机构 * H&K Research Studio（H&K 研究工作室）； Clevix LLC（Clevix 有限责任公司）

AI总结提出层次涌现框架(HEF)，将涌现建模为机制景观中的相变，证明在结构假设下物理可行且收敛到唯一不动点，并在111个模算术变换器实验中验证了相变指纹。

Comments 27 pages, 3 figures, 2 tables; 15-page Supplementary Information with complete proofs included

详情

AI中文摘要

在机器学习、生物学和物理学中，独立演化的系统尽管微观细节截然不同，但常常收敛到惊人相似的高层结构。Grokking电路在不同随机种子下收敛，进化谱系重新发现相似的代谢解决方案，重整化流趋近共同的固定点。我们提出层次涌现框架(HEF)作为此类收敛现象的候选普适性框架。HEF将涌现建模为由热力学和信息论定律约束的机制景观中的相变。该框架引入一个临界能量阈值Ec，将具有竞争机制的探索阶段与由唯一最小成本机制主导的收敛阶段分开。在结构假设下，我们证明了物理可行性，推导了严格的度量收缩，并建立了收敛到与初始条件无关的唯一不动点表示。我们进一步通过有效信息和机制竞争熵将该收敛结构与因果涌现联系起来。为测试该框架，我们研究了111个实验中模算术变换器的延迟泛化（“grokking”）。我们识别出一个可重复的Ec转变经验指纹：在92%的运行中，权重范数在grokking之前系统性达到峰值。归一化准确率曲线坍缩到tanh扭结（R^2=0.93），与Landau-Ginzburg普适类一致，所有grokked模型收敛到0.9745±0.014，与初始化、权重衰减或训练比例无关（ANOVA p>0.13）。HEF并非作为涌现的通用理论提出，而是作为研究跨复杂系统收敛现象的可证伪数学框架。

英文摘要

Across machine learning, biology, and physics, independently evolving systems often converge toward strikingly similar high-level structures despite radically different microscopic details. Grokking circuits converge across random seeds, evolutionary lineages rediscover similar metabolic solutions, and renormalization flows approach common fixed points. We propose the Hierarchical Emergence Framework (HEF) as a candidate universality framework for such convergence phenomena. HEF models emergence as a phase transition in a mechanism landscape constrained by thermodynamic and information-theoretic laws. The framework introduces a critical energy threshold Ec separating an exploration regime with competing mechanisms from a convergence regime governed by a unique minimum-cost mechanism. Under structural assumptions, we prove physical feasibility, derive strict metric contraction, and establish convergence toward a unique fixed-point representation independent of initial conditions. We further connect this convergence structure to causal emergence through Effective Information and mechanism competition entropy. To test the framework, we study delayed generalization ("grokking") in modular arithmetic transformers across 111 experiments. We identify a reproducible empirical fingerprint of the Ec transition: the weight norm peaks systematically before grokking in 92% of runs. Normalized accuracy curves collapse onto a tanh kink (R^2=0.93) consistent with a Landau-Ginzburg universality class, and all grokked models converge to 0.9745+/-0.014 regardless of initialization, weight decay, or training fraction (ANOVA p>0.13). HEF is not presented as a universal theory of emergence, but as a falsifiable mathematical scaffold for studying convergence phenomena across complex systems.

URL PDF HTML ☆

赞 0 踩 0

2606.07568 2026-06-09 cs.HC cs.AI cs.CV cs.LG physics.data-an 交叉投稿

A Systematic Study of Behavioral Cloning for Scientific Data Annotation

行为克隆在科学数据标注中的系统研究

Ishaan Singh Chandok, Core Francisco Park

发表机构 * GitHub

AI总结针对科学数据标注中人工验证校正耗时问题，提出行为克隆框架，通过9个合成任务模拟专家策略，发现模型层次化技能习得、多任务预训练高效微调、内部表示共享错误模式等关键结论。

Comments ICML 2026 Oral

详情

AI中文摘要

科学数据标注，例如视频中动物追踪或神经重建的校对，仍然受限于“最后一公里”问题：即使有强大的自动化，验证和校正仍需大量人力。标准方法训练模型直接预测标注，丢弃了专家如何导航、点击、验证和校正的丰富监督信息。我们引入了一个研究科学标注上行为克隆的框架：9个合成任务配以合成标注，模拟真实人类策略，包括探索、错误校正和战略决策。我们的实验揭示了若干发现。首先，技能层次化出现：模型先学习GUI机制，再学习任务关键决策，且比训练数据犯更少错误，同时保留在错误发生时校正的能力。其次，在多任务行为克隆上扩展模型表明，在我们的规模范围内，更大的模型数据效率更高。第三，多任务预训练能够高效微调至新任务，而从零开始训练则完全失败。第四，线性探针揭示模型内部表示标注过程的潜在变量，如任务阶段和数据位置；有趣的是，我们发现一个跨不同标注任务泛化的共享错误表示。总体而言，我们的框架建立了系统基准并识别了关键瓶颈，为将行为克隆扩展到真实世界科学数据标注奠定了基础。

英文摘要

Scientific data annotation, such as tracking animals in video or proofreading neural reconstructions, remains bottlenecked by the "last mile" problem: even with strong automation, verification and correction consume substantial human effort. Standard approaches train models to directly predict annotations, discarding the rich supervision in how experts navigate, click, verify, and correct. We introduce a framework for studying behavioral cloning on scientific annotation: 9 synthetic tasks paired with synthetic annotations that simulate realistic human strategies including exploration, mistake correction, and strategic decision-making. Our experiments reveal several findings. First, skills emerge hierarchically: models learn GUI mechanics before task-critical decisions, and commit fewer mistakes than the training data while retaining the ability to correct errors when they occur. Second, scaling models on multi-task behavioral cloning shows that larger models are more data efficient within our scale range. Third, multi-task pretraining enables efficient fine-tuning to new tasks, while training from scratch fails entirely. Fourth, linear probes reveal that models internally represent latent variables of the annotation process such as task phase and data position; interestingly, we find a shared mistake representation that generalizes across different annotation tasks. Overall, our framework establishes systematic benchmarks and identifies key bottlenecks, providing a foundation for scaling behavioral cloning to real-world scientific data annotation.

URL PDF HTML ☆

赞 0 踩 0

2606.07571 2026-06-09 cs.LG cs.AI 交叉投稿

Enabling KV Caching of Shared Prefix for Diffusion Language Models

为扩散语言模型启用共享前缀的KV缓存

Younghun Go, Jaehoon Han, Changyong Shin, Chuk Yoo, Gyeongsik Yang

发表机构 * Korea University（高丽大学）

AI总结针对扩散语言模型中双向注意力导致共享前缀KV不稳定的问题，提出双向前缀缓存（bicache），通过动态识别安全层深度重用KV，避免精度崩溃，提升吞吐量36.3%-98.3%。

详情

AI中文摘要

共享前缀的键值（KV）缓存对于高吞吐量的大语言模型（LLM）服务至关重要，但在新兴的扩散语言模型（DLM）中面临严峻挑战。在DLM中，双向注意力意味着更新任何token都会动态改变整个上下文及其对应的KV。因此，为LLM开发的现有缓存技术（假设KV一旦计算就保持不变）会破坏共享前缀KV。我们的实验表明，将这些技术应用于DLM会导致模型精度几乎降为零。为了解锁高吞吐量的DLM服务，我们提出了双向前缀缓存（bicache），这是第一个用于DLM中共享前缀的KV缓存技术。bicache基于我们全面分析的关键观察设计：共享前缀KV在浅层中保持稳定且可重用，而浅层的深度取决于每个请求中共享前缀token的比例。因此，bicache动态识别用于重用共享前缀KV的安全层深度，并消除冗余计算。评估表明，与现有技术相比，bicache显著提高了服务吞吐量36.3%-98.3%，且没有精度崩溃（仅0-1.8%的差异）。

英文摘要

Key-value (KV) caching for shared prefixes is essential for high-throughput large language model (LLM) serving, but it faces critical challenges in emerging diffusion language models (DLMs). In DLMs, bidirectional attention means that updating any token dynamically alters the entire context and its corresponding KVs. Thus, existing caching techniques developed for LLMs, which assume that KVs remain invariant once computed, corrupt the shared prefix KVs. Our experiments show that applying these techniques to DLMs causes model accuracy to collapse to near zero. To unlock high-throughput DLM serving, we propose bidirectional prefix caching, bicache, the first KV caching technique for shared prefixes in DLMs. bicache is designed based on key observations from our comprehensive analysis: shared prefix KVs remain stable and reusable in shallow layers, while the depth of shallow layers depends on the fraction of shared prefix tokens in each request. Thus, bicache dynamically identifies a safe layer depth for reusing shared prefix KVs and eliminates redundant computation. Evaluations demonstrate that bicache significantly improves serving throughput by 36.3%-98.3% compared to existing techniques without accuracy collapse (only 0-1.8% difference).

URL PDF HTML ☆

赞 0 踩 0

2606.07574 2026-06-09 cs.DC cs.AI cs.LG stat.CO stat.ML 交叉投稿

Accelerating Birkhoff Projection for Manifold-Constrained Hyper-Connections

加速流形约束超连接的Birkhoff投影

Chenrui Wang, Yixuan Qiu

发表机构 * School of Statistics（统计学系）； Renmin University of China（中国人民大学）； School of Statistics and Data Science（统计学与数据科学学院）； Institute of Big Data Research（大数据研究院）； Shanghai University of Finance and Economics（上海财经大学）

AI总结针对流形约束超连接中Birkhoff投影的计算瓶颈，提出基于对偶公式和牛顿法的端到端加速框架，结合隐式微分和CUDA内核实现超过20倍加速。

详情

AI中文摘要

流形约束超连接（mHCs）最近被提出作为超连接的一种原则性扩展，其中残差混合矩阵通过投影到Birkhoff多面体上被约束为双随机矩阵。在实际的mHC实现中，该约束通过Sinkhorn-Knopp迭代强制执行，反向传播依赖于展开迭代求解器。这种设计引入了大量的计算和内存开销，并且当算法在具有挑战性的输入上收敛缓慢时，可能产生不准确的投影，从而破坏mHCs预期的范数控制和稳定性保证。在这项工作中，我们聚焦于实际重要的4x4 Birkhoff投影设置，并开发了一个端到端的加速框架。通过利用对偶公式，我们将问题简化为一个三维无约束凸问题，并使用牛顿法求解，实现了快速收敛和高精度。对于反向传播，我们用隐式微分替代展开微分，无需存储中间状态即可获得精确梯度。为了利用大规模并行性，我们设计了一个warp级别的CUDA内核，仅使用寄存器级原语，避免了全局和共享内存I/O。与代表性开源基线的大量实验表明，所提出的求解器产生了更可靠的双随机投影——特别是在输入幅度较大时——并实现了显著的端到端加速（包括反向传播），在大批量下达到超过20倍的加速，同时保持数量级更小的边际误差。

英文摘要

Manifold-constrained hyper-connections (mHCs) have recently been proposed as a principled extension of hyper-connections, where the residual mixing matrices are constrained to be doubly stochastic via projection onto the Birkhoff polytope. In practical mHC implementations, this constraint is enforced by Sinkhorn-Knopp iterations, and the backward pass relies on unrolling the iterative solver. This design introduces substantial computation and memory overhead, and may also yield inaccurate projections when the algorithm converges slowly on challenging inputs, undermining the intended norm-control and stability guarantees of mHCs. In this work, we focus on the practically important 4x4 Birkhoff projection setting and develop an end-to-end acceleration framework. By leveraging the dual formulation, we reduce the problem to a three-dimensional unconstrained convex problem and solve it with Newton's method, achieving fast convergence and high accuracy. For the backward pass, we replace the unrolled differentiation with implicit differentiation, yielding exact gradients without storing intermediate states. To exploit massive parallelism, we design a warp-level CUDA kernel that uses only register-level primitives, avoiding global and shared memory I/O. Extensive experiments against representative open-source baselines demonstrate that the proposed solver yields substantially more reliable doubly stochastic projections -- especially when the input magnitude is large -- and achieves significant end-to-end speedups (including the backward pass), reaching over 20x acceleration at large batch sizes while maintaining orders of magnitude smaller marginal errors.

URL PDF HTML ☆

赞 0 踩 0

2606.07598 2026-06-09 cs.LG cs.AI 交叉投稿

A Topological Characterization of Graph Neural Networks via Stochastic Block Model Embeddings on the n-Sphere

图神经网络的拓扑特征化：通过n-球面上的随机块模型嵌入

Gopal Anantharaman

发表机构 * KnotTheory.ai Inc.（KnotTheory.ai 公司）； Dept. of Mathematics, Emporia State University（恩波利亚州立大学数学系）

AI总结提出将消息传递神经网络诱导的随机块模型映射到单位n-球面的拓扑框架，用于比较训练后的图神经网络，并实现无需重新训练的迁移学习候选检索。

详情

AI中文摘要

我们提出一个拓扑框架，用于比较训练后的图神经网络（GNN），通过将消息传递神经网络（MPNN）在图信号空间上诱导的随机块模型（SBM）映射到单位$n$-球面$\sphere^{n-1}\subset\R^n$上。该构建基于三个经典支柱：割距离图空间$(\Wo,\cutdist)$的紧性\citep{lovasz2006limits,lovasz2012large}，Frieze--Kannan弱正则引理及其由\citet{levie2023graphon}推广的图信号扩展，以及MPNN关于割距离的Lipschitz连续性。我们证明，对于任意给定的容差$\varepsilon>0$，一个训练后的MPNN $Φ$作用于足够大的图时，可以通过一个复杂度有界的阶梯图信号（误差不超过$\varepsilon$）来分解，并且我们构造了一个显式的保测映射$Ψ_n\colon[0,1]\to\sphere^{n-1}$，将SBM区域放置在不相交的球冠上。这产生了一个与问题无关的低维训练GNN“指纹”，便于视觉检查和跨模型库的最近邻搜索，从而实现无需重新训练的迁移学习候选检索。我们讨论了高维中测度集中现象带来的障碍——这一现象与大规模语言模型规模的嵌入直接相关。最后，我们提出五个具体的未来研究方向：双曲和格拉斯曼流形替代球面模型，基于图信号的Gromov--Wasserstein距离作为$n$-球面映射的无等距替代，SBM流形的信息几何（Fisher）重新表述，逐层嵌入云的持续同调指纹，以及基于图信号特征分解的谱距离基线。

英文摘要

We propose a topological framework for comparing trained Graph Neural Networks (GNNs) by mapping the Stochastic Block Models (SBMs) induced on the graphon-signal space of a Message Passing Neural Network (MPNN) onto the unit $n$-sphere $\sphere^{n-1}\subset\R^n$. The construction rests on three classical pillars: the \emph{compactness} of the cut-distance graphon space $(\Wo,\cutdist)$ \citep{lovasz2006limits,lovasz2012large}, the Frieze--Kannan \emph{weak regularity lemma} together with its graphon-signal extension due to \citet{levie2023graphon}, and the Lipschitz continuity of MPNNs with respect to the cut-distance. We show that, for any prescribed tolerance $\varepsilon>0$, a trained MPNN $Φ$ acting on a sufficiently large graph factors (up to $\varepsilon$) through a step-graphon-signal of bounded complexity, and we construct an explicit measure-preserving map $Ψ_n\colon[0,1]\to\sphere^{n-1}$ that places the SBM regions on disjoint spherical caps. This produces a problem-agnostic, low-dimensional ``fingerprint'' of a trained GNN that is amenable to visual inspection and to nearest-neighbour search across model zoos, enabling \emph{transfer-learning candidate retrieval} without retraining. We discuss the obstruction posed by concentration of measure in high dimension -- a phenomenon directly relevant to LLM-scale embeddings. We close with five concrete future research directions: hyperbolic and Grassmannian alternatives to the spherical model, Gromov--Wasserstein distances on graphon-signals as an isometry-free alternative to the $n$-sphere map, an information-geometric (Fisher) reformulation of the SBM manifold, persistent-homology fingerprints of layer-wise embedding clouds, and a spectral-distance baseline derived from the graphon eigendecomposition.

URL PDF HTML ☆

赞 0 踩 0

2606.07599 2026-06-09 cs.LG cs.AI cs.CV 交叉投稿

DiffoR: A Unified Continuous Generative Framework for Universal Ordinal Regression

DiffoR：一种统一的连续生成框架用于通用序数回归

Hongxu Ma, Lin Wang, Chenghou Jin, Han Zhou, Jie Zhang, Xiaoyu Yang, Chunjie Chen, Jihong Guan, Shuigeng Zhou

发表机构 * Fudan University（复旦大学）； Kuaishou Technology（快手科技）； Shanghai University of Finance and Economics（上海财经大学）； Tongji University（同济大学）

AI总结提出DiffOR框架，将序数回归建模为连续生成任务，利用扩散模型通过迭代去噪恢复连续序数值，并设计双解耦策略（多尺度增量聚合与动态去噪感知）保留序数拓扑，在12个基准上超越现有方法。

Comments Accepted at KDD 2026

详情

DOI: 10.1145/3770855.3818149

AI中文摘要

序数回归（OR）旨在预测具有内在顺序的目标值，支撑着从推荐系统到计算机视觉等多个领域的关键应用。尽管从朴素回归发展到基于离散化的分类和生成，现有范式仍然受到量化伪影和缺乏全局序数拓扑感知的根本限制。这些方法通常强制执行刚性边界划分，无法捕捉序数数据固有的非平稳语义转换。在本文中，我们提出了一种新范式，将OR形式化为连续生成序数回归任务。在该新范式下，我们引入了DiffOR，一个统一的框架，利用扩散模型通过迭代去噪恢复连续序数值，从而能够动态学习软语义转换。为了显式保留序数拓扑，我们设计了一种双解耦策略：在空间上，多尺度增量聚合将目标分解为层次化的连续增量；在时间上，动态去噪感知将去噪步骤与特征频率同步，确保稳健的从粗到细的细化。理论上，我们证明了所提方法可以显著增强表示能力和机制可解释性。在四个领域的12个基准上的大量实验验证了DiffOR相对于最先进方法的一致优越性，建立了一个新标准，展示了作为通用序数回归通用解决方案的强大潜力。

英文摘要

Ordinal Regression (OR) aims to predict target values with inherent order, underpinning critical applications across diverse domains, from recommender systems to computer vision. Though having evolved from naive regression to discretization-based classification and generation, existing paradigms remain fundamentally constrained by quantization artifacts and the lack of global ordinal topological perception. These methods typically enforce rigid boundary delineations, failing to capture the non-stationary semantic transitions inherent to ordinal data. In this paper, we propose a novel paradigm where OR is formulated as a Continuous Generative Ordinal Regression task. Under the novel paradigm, we introduce DiffOR, a unified framework that leverages diffusion models to recover continuous ordinal values via iterative denoising, thereby enabling the dynamic learning of soft semantic transitions. To explicitly preserve ordinal topology, we devise a Dual-Decoupling Strategy: Spatially, Multi-scale Increment Aggregation decomposes targets into hierarchical continuous increments; Temporally, Dynamic Denoising Perception synchronizes denoising steps with feature frequencies, ensuring robust coarse-to-fine refinement. Theoretically, we show that the proposed method can significantly enhance both representation capability and mechanistic interpretability. Extensive experiments on 12 benchmarks across four domains validate DiffOR's consistent superiority over state-of-the-art methods, establishing a new standard that demonstrates strong potential as a general-purpose solution for universal ordinal regression.

URL PDF HTML ☆

赞 0 踩 0

2606.07600 2026-06-09 cs.LG cs.AI 交叉投稿

Reachability and asymptotics of Gaussian Transformer dynamics

高斯Transformer动力学的可达性与渐近性

Albert Alcalde, Zhengping Ji, Enrique Zuazua

发表机构 * Friedrich–Alexander University Erlangen–Nürnberg（弗里德里希-亚历山大大学埃尔朗根-纽伦堡）； Research Council of Norway（挪威研究理事会）

AI总结将Transformer数据传播建模为概率测度空间上的非线性控制系统，证明高斯分布在自注意力与仿射前馈层下保持高斯性，从而降维为双线性控制系统，并揭示与Riccati方程的联系。

详情

AI中文摘要

我们将通过Transformer（驱动大型语言模型的机器学习架构）的数据传播建模为概率测度空间上的非线性控制系统。对于具有自注意力和仿射前馈层的平均场Transformer模型，我们证明高斯分布在诱导流下保持严格高斯性。这种不变性将无限维测度动力学简化为控制均值和协方差演化的有限维双线性控制系统，将Transformer的表达能力重新表述为关于指定高斯矩的可达性问题，并揭示了与经典滤波和控制中Riccati型方程的新联系。\n对于时变控制，我们证明任何目标高斯分布（其协方差矩阵与初始协方差矩阵具有相同秩）的精确有限时间可达性，该秩约束是动力学的一个内在不变量。对于时不变参数，我们推导出显式的谱条件，这些条件要么导致正定平衡点的渐近稳定性，要么导致协方差的有限时间爆破。\n数值实验补充了理论，表明具有高斯输入的实际Transformer在早期和中间层保持与矩匹配的高斯分布接近，而具有指定注意力矩阵的Transformer再现了预测的协方差状态：在稳定配置中有界演化，在失稳配置中爆破。

英文摘要

We formulate data propagation through the Transformer, the machine learning architecture powering large language models, as a nonlinear control system on the space of probability measures. For the mean-field Transformer model with self-attention and affine feed-forward layers, we prove that Gaussian distributions remain exactly Gaussian along the induced flow. This invariance reduces the infinite-dimensional measure dynamics to a finite-dimensional bilinear control system governing the evolution of the mean and covariance, reformulates the expressive capacity of Transformers as a reachability problem for prescribed Gaussian moments, and reveals a novel connection with Riccati-type equations from classical filtering and control. For time-varying controls, we prove exact finite-time reachability of any target Gaussian distribution whose covariance matrix has the same rank as the initial one, this rank constraint being an intrinsic invariant of the dynamics. For time-invariant parameters, we derive explicit spectral conditions leading either to asymptotic stability toward positive-definite equilibria or to finite-time blow-up of the covariance. Numerical experiments complement the theory by showing that practical Transformers with Gaussian inputs remain close to moment-matched Gaussian distributions through early and intermediate layers, while Transformers with prescribed attention matrices reproduce the predicted covariance regimes: bounded evolution in stabilizing configurations and blow-up in destabilizing ones.

URL PDF HTML ☆

赞 0 踩 0

2606.07601 2026-06-09 cs.LG cs.AI 交叉投稿

LFNO: Bridging Laplace and Fourier via Transient-Steady Decomposition

LFNO：通过瞬态-稳态分解桥接拉普拉斯与傅里叶

Jeongun Ha, Sanga Yoon, Donghun Lee

发表机构 * \dagger（† \dagger）

AI总结提出拉普拉斯-傅里叶神经算子（LFNO），通过双分支架构显式分解系统动力学为瞬态和稳态分量，在九个基准上超越现有算子，提升稳定性和可解释性。

Comments 21 pages, 11 figures

详情

AI中文摘要

我们引入了拉普拉斯-傅里叶神经算子（LFNO），这是一个统一框架，通过整合拉普拉斯和傅里叶神经算子的谱优势，对跨瞬态和稳态区域的动力系统进行建模。LFNO采用双分支架构，将系统动力学显式分解为瞬态和稳态分量。我们在九个基准上评估了LFNO，包括三个ODE系统（Duffing、Lorenz和Pendulum）和六个PDE系统（Euler-Bernoulli梁、热方程、反应扩散、Brusselator、Burgers和Navier-Stokes）。在瞬态动力学占主导的ODE系统上，LFNO显著优于现有算子，并且在PDE基准上持续超越LNO，同时达到与FNO竞争的性能。此外，LFNO通过其分量分解提供了改进的稳定性和物理可解释性。这些结果表明，LFNO为跨多个时间尺度学习复杂动力系统提供了一种鲁棒且统一的方法。

英文摘要

We introduce the Laplace-Fourier Neural Operator (LFNO), a unified framework for modeling dynamical systems across transient and steady-state regimes by integrating the spectral advantages of Laplace and Fourier Neural Operators. LFNO employs a dual-branch architecture that explicitly decomposes system dynamics into transient and steady-state components. We evaluate LFNO on nine benchmarks, including three ODE systems (Duffing, Lorenz, and Pendulum) and six PDE systems (Euler-Bernoulli beam, Heat, Reaction-diffusion, Brusselator, Burgers, and Navier-Stokes). LFNO significantly outperforms existing operators on ODE systems, where transient dynamics dominate, and consistently surpasses LNO while achieving performance competitive with FNO on PDE benchmarks. Furthermore, LFNO offers improved stability and physical interpretability through its component-wise decomposition. These results demonstrate that LFNO provides a robust and unified approach for learning complex dynamical systems across multiple temporal scales.

URL PDF HTML ☆

赞 0 踩 0

2606.07604 2026-06-09 cs.LG cs.AI 交叉投稿

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

贡献权重：自注意力Transformer的几何分析

Harry Jake Cunningham, Nicola Muca Cirone

发表机构 * University of Cambridge（剑桥大学）

AI总结提出基于投影的贡献权重度量，结合注意力权重、值向量大小和方向对齐，更准确识别关键令牌，并揭示注意力汇的主动抑制功能。

详情

AI中文摘要

分析注意力权重已成为解释大型语言模型（LLM）信息流的标准方法。然而，这种方法有显著局限性，因为它忽略了被聚合的值向量的几何特性。为了解决这个问题，我们引入了\emph{贡献权重}，这是一种基于投影的度量，通过考虑令牌的注意力权重、值大小以及与层输出的方向对齐来量化令牌的影响。我们证明，贡献权重提供了更忠实的令牌重要性度量，在不同解码器模型、任务和数据集中，始终优于基于注意力的度量，用于识别语义关键令牌。此外，我们的度量能够对\emph{注意力汇}进行新的机制分析。虽然先前的工作将注意力汇描述为多余注意力的被动存储库，但我们揭示它们起到了主动的功能作用，通过汇率与输出范数之间的凸关系抑制信息，通过反对低置信度令牌的语义漂移来稳定表示。

英文摘要

Analyzing attention weights has become a standard approach for interpreting the information flow of Large Language Models (LLMs). However, this approach has significant limitations as it neglects the geometric properties of the value vectors being aggregated. To address this gap, we introduce \emph{Contribution Weights}, a projection-based metric that quantifies a token's influence by accounting for it's attention weight, value magnitude, and directional alignment with the layer output. We demonstrate that contribution weights provide a more faithful measure of token importance, consistently outperforming attention-based metrics in identifying semantically critical tokens across different decoder-only models, tasks, and datasets. Further, our metric enables novel mechanistic analysis of \emph{attention sinks}. While previous work characterized sinks as passive repositories for excess attention, we reveal they serve an active functional role, suppressing information through a convex relationship between sink rate and output norm, stabilizing representations by opposing the semantic drift of low-confidence tokens.

URL PDF HTML ☆

赞 0 踩 0

2606.07615 2026-06-09 cs.LG cs.AI 交叉投稿

Structured Neuron Pruning in Deep Neural Networks Using Multi-Armed Bandits

深度神经网络中使用多臂赌博机的结构化神经元剪枝

Salem Ameen, Sunil Vadera

发表机构 * School of Science, Engineering and Environment, University of Salford（科学、工程与环境学院，萨尔福德大学）

AI总结提出基于多臂赌博机算法的结构化剪枝框架，通过将每个神经元视为臂并评估移除奖励，在表格分类、回归及深度网络任务上验证了UCB1和汤普森采样等策略的有效性。

Comments 27 pages, 5 figures

详情

AI中文摘要

深度神经网络通常包含冗余的隐藏单元。移除单个权重可以减少参数数量，但非结构化稀疏性在标准密集实现中并不总是容易利用。本文开发了一个结构化剪枝框架，其中使用多臂赌博机（MAB）算法移除完整的神经元。每个候选神经元被视为一个臂；拉动一个臂会暂时屏蔽该神经元，测量采样小批量上损失的变化，恢复神经元，并更新其安全移除奖励的估计。该框架支持随机策略，包括Epsilon-Greedy、Softmax、UCB1和汤普森采样，以及乘性权重策略，包括Hedge风格的乘性权重和EXP3。我们在涵盖图像、文本和推理任务的表格分类、表格回归和深度神经网络基准上评估了该方法。使用弗里德曼检验和随后Nemenyi事后检验的统计比较显示方法之间存在显著差异。在表格分类任务上，UCB1在剪枝策略中获得最高平均排名，并优于未剪枝的神经网络。在回归任务上，UCB1获得最高平均排名，并且根据R^2，与几种标准回归模型在统计上具有竞争力或更优。在深度学习任务上，UCB1和汤普森采样获得最强排名，并且几种MAB策略显著优于未剪枝模型、基于幅度的神经元剪枝和贪婪激活变化剪枝。结果表明，基于MAB的神经元剪枝是一种有效且计算实用的结构化模型缩减方法。

英文摘要

Deep neural networks often contain redundant hidden units. Removing individual weights can reduce parameter count, but unstructured sparsity is not always easy to exploit in standard dense implementations. This paper develops a structured pruning framework in which complete neurons are removed using multi-armed bandit (MAB) algorithms. Each candidate neuron is treated as an arm; pulling an arm temporarily masks that neuron, measures the change in loss on a sampled mini-batch, restores the neuron, and updates an estimate of its safe-removal reward. The framework supports stochastic policies, including Epsilon-Greedy, Softmax, UCB1 and Thompson Sampling, and multiplicative-weight policies, including Hedge-style multiplicative weights and EXP3. We evaluate the method on tabular classification, tabular regression and deep neural-network benchmarks covering image, text and reasoning tasks. Statistical comparisons using the Friedman test followed by the Nemenyi post-hoc test show significant differences between methods. On tabular classification tasks, UCB1 obtains the highest mean rank among pruning policies and improves on the unpruned neural network. On regression tasks, UCB1 obtains the highest mean rank and is statistically competitive with, or superior to, several standard regression models according to R^2. On deep-learning tasks, UCB1 and Thompson Sampling obtain the strongest ranks, and several MAB policies significantly outperform the unpruned model, magnitude-based neuron pruning and greedy activation-variation pruning. The results show that MAB-based neuron pruning is an effective and computationally practical approach for structured model reduction.

URL PDF HTML ☆

赞 0 踩 0

2606.07617 2026-06-09 cs.LG cs.AI 交叉投稿

Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

Query Lens: 通过间接效应解释稀疏键值特征

Hwiyeong Lee, Ingyu Bang, Uiji Hwang, Hyelim Lim, Taeuk Kim

发表机构 * KAIST（韩国科学技术院）

AI总结提出Query Lens方法，通过考虑编码器侧键特征和解码器侧值特征以及下游模块的间接效应，实现对稀疏自编码器特征更全面、忠实的解释。

Comments Accepted to ICML 2026

2606.07618 2026-06-09 cs.LG cs.AI cs.CV 交叉投稿

ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization

ScaleSweep: 通过块尺度初始化实现LLM的精确NVFP4训练后量化

Li Lin, Xiaojun Wan

发表机构 * Wangxuan Institute of Computer Technology, Peking University（北京大学王选计算机技术研究所）

AI总结提出ScaleSweep方法，通过扫描可行块尺度候选并选择最小化目标函数的候选，优化NVFP4量化中的尺度初始化，理论推导扫描范围边界，在Llama和Qwen模型上提升量化性能，缩小与全精度的差距。

Comments under review

详情

AI中文摘要

NVFP4是一种最近引入的硬件支持的FP4格式，通过细粒度块尺度提高了4位量化的保真度。然而，现有的NVFP4尺度初始化方法仍然主要依赖于AbsMax初始化，这与最优解之间存在明显差距。为了解决这个问题，我们提出了ScaleSweep，一种简单高效的尺度优化方法，它扫描可行的块尺度候选，并选择最小化目标函数的候选。我们进一步提供了NVFP4量化的理论分析，并推导了在原始张量与量化重建张量之间的均方误差（MSE）和加权均方误差（WMSE）下所需扫描范围的上下界。所提出的界限大幅减少了扫描空间，同时保留了最优候选，使得与基线量化算子相比开销可忽略。在Llama和Qwen模型上的实验表明，ScaleSweep持续优于现有的初始化方法，并进一步缩小了与全精度的差距。特别是在对权重、激活、KV缓存和查询状态进行激进的全端到端量化时，ScaleSweep保留了超过93%的全精度性能。

英文摘要

NVFP4 is a recently introduced hardware-supported FP4 format that improves the fidelity of 4-bit quantization through fine-grained block scales. However, existing NVFP4 scale initialization methods still primarily rely on AbsMax initialization, which leaves a noticeable gap to the optimal solution. To address this, we propose ScaleSweep, a simple and efficient scale optimization method that sweeps over feasible block scale candidates and selects the candidate that minimizes a target objective. We further provide a theoretical analysis of NVFP4 quantization and derive both lower and upper bounds for the required sweep range under mean square error (MSE) and weighted mean square error (WMSE) between the original tensor and the quantized reconstructed tensor. The proposed bounds substantially reduce the sweep space while preserving the optimal candidate, enabling negligible overhead compared with the baseline quantization operators. Experiments on Llama and Qwen models demonstrate that ScaleSweep consistently improves quantization performance over existing initialization methods and further narrows the gap to full precision. In particular, under aggressive end-to-end quantization of weights, activations, KV cache, and query states, ScaleSweep preserves more than 93% of the full-precision performance.

URL PDF HTML ☆

赞 0 踩 0

2606.07621 2026-06-09 cs.LG cs.AI cs.DC 交叉投稿

HASA: Subnet Allocation for Compute-Constrained Model-Heterogeneous Federated Learning

HASA：计算受限的模型异构联邦学习中的子网分配

Amir Hossein Shahdadian, Ahmed M. Abdelmoniem, Mahdi Taheri, Samira Nazari, Christian Herglotz

发表机构 * University of Naples "Federico II"（那不勒斯腓特烈二世大学）； Queen Mary University of London（伦敦玛丽女王大学）； Brandenburg University of Technology Cottbus-Senftenberg（勃兰登堡工业大学）； Tallinn University of Technology（塔林理工大学）； University of Zanjan（赞詹大学）

AI总结提出HASA方法，根据客户端异构性分数分配子网宽度，在固定计算预算下提升平均和最差客户端准确率。

详情

AI中文摘要

边缘服务越来越多地使用联邦学习来个性化设备上的模型，同时将敏感数据保留在本地。在实践中，部署必须处理客户端资源和本地数据分布的异构性。模型异构联邦学习通过允许每个客户端训练共享超网的子网来降低客户端成本，但大多数子网分配策略由设备约束驱动，并未明确考虑统计异构性。本文提出异构感知子网分配（HASA），这是一种仅训练规则，根据从本地训练数据计算的客户端异构性分数分配子网宽度，同时强制执行固定的大小加权计算预算。该设计能够与替代分配策略进行预算匹配的比较。在包含七个客户端的文章标题下一个单词预测基准测试中，HASA在10个匹配种子上的未加权平均客户端测试准确率优于均匀分配，将平均客户端测试准确率从13.82%提高到14.32%，并平均提高了最差客户端准确率。在与代表性部分训练基线的匹配预算比较中，HASA在该基准测试上实现了最强的最差客户端和尾部客户端准确率。方向性消融实验表明，将较小的子网分配给更异构的客户端会降低平均和尾部性能。跨领域图像分类研究进一步表明，异构感知分配的有效性取决于异构性分数反映客户端对额外模型宽度需求的程度。

英文摘要

Edge services increasingly use federated learning to personalize on-device models while keeping sensitive data local. In practice, deployments must handle heterogeneity in both client resources and local data distributions. Model-heterogeneous federated learning lowers client cost by allowing each client to train a subnet of a shared supernet, but most subnet-allocation policies are driven by device constraints and do not explicitly account for statistical heterogeneity. This paper proposes Heterogeneity-Aware Subnet Allocation (HASA), a train-only rule that assigns subnet widths based on client heterogeneity scores computed from local training data while enforcing a fixed size-weighted compute budget. This design enables budget-matched comparisons with alternative allocation policies. On an article-title next-word prediction benchmark with seven clients, HASA improves unweighted mean client test accuracy over uniform allocation across 10 matched seeds, increasing mean client test accuracy from 13.82 percent to 14.32 percent, and improves worst-client accuracy on average. In a matched-budget comparison with representative partial-training baselines, HASA achieves the strongest worst-client and tail-client accuracy on this benchmark. A directionality ablation shows that assigning smaller subnets to more heterogeneous clients degrades both mean and tail performance. A cross-domain image-classification study further shows that the effectiveness of heterogeneity-aware allocation depends on how well the heterogeneity score reflects clients' need for additional model width.

URL PDF HTML ☆

赞 0 踩 0

2606.07630 2026-06-09 cs.LG cs.AI stat.ML 交叉投稿

Active Learning with Foundation Model Priors: Efficient Learning under Class Imbalance

基于基础模型先验的主动学习：类别不平衡下的高效学习

Jiancheng Zhang, Meiqing Li, Qi Zhang, Yinglun Zhu

发表机构 * University of California, Riverside（加州大学河滨分校）； Carnegie Mellon University（卡内基梅隆大学）； Worcester Polytechnic Institute（伍斯特理工学院）

AI总结针对现实数据中的类别不平衡和噪声标注问题，提出一种利用基础模型先验的主动学习框架，通过不平衡感知的协同决策选择信息量最大的样本，在图像和文本数据集上实现超过50%的标注节省。

Comments To appear at ICML 2026

详情

AI中文摘要

现实世界中图像和文本领域的数据集通常具有偏斜的类别分布和噪声标注，这共同降低了模型性能，尤其是对少数类。在现有解决方案中，主动学习通过选择性地查询信息最丰富且平衡的样本进行标注，提供了一种有效且高效的范式。我们提出了一种创新的主动学习框架，该框架减轻了类别不平衡，并选择信息量最大的样本进行标注。利用基础模型先验，我们的算法使得基础模型和小模型之间能够进行不平衡感知的协同决策，以处理跨领域的有噪声和不平衡标签。我们首次系统性地研究了在图像和文本领域中标签噪声和类别不平衡双重挑战下的主动学习。在不平衡数据集上的大量实验表明，我们的方法实现了显著的标注节省——与最佳主动学习基线相比超过50%——同时保持了对标签噪声的性能和鲁棒性。

英文摘要

Real-world datasets across image and text domains are often characterized by skewed class distributions and noisy annotations, which jointly degrade model performance, particularly on minority classes. Among existing solutions, active learning offers an effective and efficient paradigm by selectively querying the most informative and balanced samples for annotation. We propose an innovative active learning framework that mitigates class imbalance and selects the most informative samples to annotate. Leveraging foundation model priors, our algorithm enables imbalance-aware co-decisions between foundation model and small model to tackle noisy and imbalanced labels across various domains. We introduce the first study to systematically explore active learning under the dual challenges of label noise and class imbalance across image and text domains. Extensive experiments on imbalanced datasets demonstrate that our method achieves substantial annotation savings-over 50% compared to the best active learning baseline-while preserving performance and robustness to label noise.

URL PDF HTML ☆

赞 0 踩 0

2606.07646 2026-06-09 cs.CV cs.AI 交叉投稿

DOME: Learning Transferable Domain Variables from Sparse Supervision for Test-Time Adaptation

DOME：从稀疏监督中学习可迁移域变量用于测试时自适应

Xiaoran Xu, Yifan Xu, Yupeng Wu, Xiaoshan Yang, Changsheng Xu

发表机构 * MAIS, IACAS（中国科学院自动化研究所多模态人工智能系统实验室）

AI总结提出DOME域编码器，通过视觉-语言预训练提取密集连续表示，参数化域为分布变量并引入动量更新的稀疏域库，实现零样本显式域建模，在多个基准上超越复杂TTA方法。

详情

AI中文摘要

测试时自适应（TTA）旨在仅使用无标签流数据将模型对齐到变化的测试域。现有方法大多隐式推断单个全局域分布，忽略了真实世界域迁移的多维性和样本特异性，导致自适应脆弱。我们提出DOME，一种有效的域编码器，以零样本方式显式建模每个样本的域。DOME利用视觉-语言预训练提取密集、连续的表示，将域参数化为分布变量，并引入动量更新的稀疏域库用于解耦监督。通过将这些显式域线索注入下游模型，即使是最基本的熵最小化TTA策略也在ImageNet-C、ImageNet-R和ImageNet-Sketch上达到了最先进的性能，超越了复杂的TTA方法。我们的结果表明，鲁棒的自适应并非源于复杂的自适应算法，而是源于显式的、结构化的域表示。

英文摘要

Test-time adaptation (TTA) aims to align a model to shifting test domains using only unlabeled streaming data. Most existing methods implicitly infer a single global domain distribution, ignoring the multidimensional and sample-specific nature of real-world domain shifts, leading to fragile adaptation. We propose DOME, an effective domain encoder that explicitly models each sample's domain in a zero-shot manner. DOME leverages vision-language pretraining to extract dense, continuous representations, parameterizes domains as distributional variables, and introduces a momentum-updated sparse domain bank for disentangled supervision. By injecting these explicit domain cues into downstream models, even a basic entropy-minimization TTA strategy achieves state-of-the-art performance across ImageNet-C, ImageNet-R, and ImageNet-Sketch, outperforming complex TTA approaches. Our results demonstrate that robust adaptation stems not from intricate adaptation algorithms, but from explicit, structured domain representation.

URL PDF HTML ☆

赞 0 踩 0

2606.07664 2026-06-09 cs.NE cs.AI 交叉投稿

Seq103: A Unified Neuroevolution Framework for Compact Sequence Architecture Discovery

Seq103: 用于紧凑序列架构发现的统一神经进化框架

Wenxiao Li, Yongjian Liu, Qing Xie

发表机构 * School of Computer Science and Artificial Intelligence, Wuhan University of Technology（武汉理工大学计算机科学与人工智能学院）

AI总结提出统一神经进化框架Seq103，通过共享进化主干和可选循环扩展，在序列分类任务中实现紧凑架构搜索，在文本和时间序列数据集上以极低参数量保持高精度。

Comments 18 pages, 2 figures, 8 tables

详情

AI中文摘要

神经进化是一种代表性的神经架构搜索范式，通过进化算法同时演化网络拓扑和权重。本文提出Seq103，一个统一的NEAT风格神经进化框架，用于紧凑序列架构发现。Seq103包含一个共享的进化主干和一个可选的循环扩展。共享主干包括基本的节点-连接表示、基于每类RMSE的评估、带有类级重组的基于突变的进化以及精英策略。可选的隐藏状态机制通过隐藏状态节点和隐藏连接扩展搜索空间，在需要逐步循环推理时提供时间记忆。通过这种设计，Seq103将相同的核心搜索流程应用于逐步循环和样本级前馈序列分类。在循环任务中，启用隐藏状态扩展以提供时间记忆；在前馈任务中，禁用该扩展，而共享进化主干保持不变。我们在8个文本分类数据集和包含128个单变量时间序列数据集的完整UCRArchive2018基准上评估Seq103。在逐步任务中，Seq103平均保留最佳基线准确率的86.96%，同时参数数量减少34.6倍至3218.0倍。在完整UCRArchive2018基准的样本级任务中，Seq103平均保留最佳基线准确率的81.95%，同时参数数量减少11.8倍至160,601.0倍。

英文摘要

Neuroevolution is a representative neural architecture search paradigm that evolves both network topology and weights through evolutionary algorithms. In this paper, we propose Seq103, a unified NEAT-style neuroevolution framework for compact sequence architecture discovery. Seq103 consists of a shared evolutionary backbone and an optional recurrent extension. The shared backbone includes an elementary node-and-connection representation, per-class RMSE-based evaluation, mutation-based evolution with class-wise recombination, and elitism. The optional hidden-state mechanism extends the search space with hidden-state nodes and hidden connections, enabling temporal memory when step-wise recurrent inference is required. With this design, Seq103 applies the same core search pipeline to both step-wise recurrent and sample-wise feedforward sequence classification. In recurrent tasks, the hidden-state extension is enabled to provide temporal memory; in feedforward tasks, it is disabled while the shared evolutionary backbone remains unchanged. We evaluate Seq103 on 8 text classification datasets and the full UCRArchive2018 benchmark with 128 univariate time-series datasets. On step-wise tasks, Seq103 retains 86.96% of the best-baseline accuracy on average while using 34.6x to 3218.0x fewer parameters. On sample-wise tasks over the full UCRArchive2018 benchmark, Seq103 retains 81.95% of the best-baseline accuracy on average while using 11.8x to 160,601.0x fewer parameters.

URL PDF HTML ☆

赞 0 踩 0

2606.07670 2026-06-09 cs.CV cs.AI 交叉投稿

HARP：高效数据选择用于微调大型语言模型

Ning Wang, Zhengxin Zhang, Maosen Tang, Yitang Gao, Claire Cardie, Sainyam Galhotra

发表机构 * Cornell University（康奈尔大学）； The Hong Kong University of Science and Technology（香港科技大学）

AI总结提出层次主动区域剪枝（HARP），一种高效的基于训练的数据选择方法，通过层次结构和经验贝叶斯推断降低选择成本，同时保持下游对齐，在多个基准上优于最强基线最多8.9分，且训练样本减少约7倍。

详情

AI中文摘要

微调数据选择需要平衡两个相互竞争的目标：选择改善下游目标的示例，以及在不重复微调模型的情况下做到这一点。无训练选择器具有可扩展性，但依赖于嵌入相似性或聚类等代理，这些可能无法匹配目标目标。基于训练的选择器通过梯度信号、子集评估或Shapley归因更好地反映下游效用，但需要大量昂贵的训练-评估迭代。我们提出层次主动区域剪枝（HARP），一种高效的基于训练的选择器，在降低选择成本的同时保持下游对齐。HARP将训练池组织成节点-叶子层次结构，仅评估代表性叶子，并使用经验贝叶斯后验推断未测量的效用。然后，它使用两个互补的包络选择数据：HARP-C，保守地控制冗余，以及HARP-E，加性地奖励互补区域。我们理论上证明，在局部平滑和有界估计误差下，HARP控制选择误差同时降低训练-评估成本。我们进一步验证，HARP变体实现了最佳结果，并在使用大约7倍更少训练示例的情况下，比最强基线高出最多8.9分。

英文摘要

Finetuning data selection requires balancing two competing goals: selecting examples that improve the downstream objective, and doing so without repeatedly finetuning models. Train-free selectors are scalable but rely on proxies such as embedding similarity or clustering, which may not match the target objective. Train-based selectors better reflect downstream utility through gradient signals, subset evaluation, or Shapley attribution, but require many costly train--evaluate iterations. We propose Hierarchical Active Region Pruning (HARP), an efficient train-based selector that preserves downstream alignment while reducing selection cost. HARP organizes the training pool into a node--leaf hierarchy, evaluates only representative leaves, and infers unmeasured utilities with empirical Bayes posteriors. It then selects data using two complementary envelopes: HARP-C, which conservatively controls redundancy, and HARP-E, which additively rewards complementary regions. We theoretically show that, under local smoothness and bounded estimation error, HARP controls selection error while reducing train--evaluate cost. We further validate that HARP variants achieve the best result and outperform the strongest baseline by up to $+8.9$ points, while using roughly $7\times$ fewer training examples.

URL PDF HTML ☆

赞 0 踩 0

2606.07698 2026-06-09 cs.LG cs.AI 交叉投稿

Pharmacogenomic Knowledge Graph Augmentation for Graph Neural Network-Based Drug-Drug Interaction Prediction

基于图神经网络的药物相互作用预测的药理基因组学知识图谱增强

Juergen Dietrich

发表机构 * AI Solutions Berlin

AI总结本研究通过整合PharmGKB的药理基因组学先验知识（CYP酶注释）作为特征向量，增强图神经网络在药物相互作用预测中的性能，在配对数据划分下显著提升DDI类型分类，但未能突破信息天花板。

Comments 13 pages

详情

AI中文摘要

应用于药物相互作用（DDI）预测的图神经网络（GNN）仅依赖由SMILES衍生的分子结构图。该系列先前的工作表明，模型性能受限于训练标签的结构信息含量——即信息天花板——仅靠架构改进无法克服。本研究探讨来自PharmGKB数据库的药理基因组学先验知识是否通过提供独立于分子结构且互补的代谢通路背景，部分关闭这一天花板。提取四种临床相关亚型（CYP2D6、CYP3A4、CYP2C19、CYP2C9）的细胞色素P450（CYP）酶底物、抑制剂和诱导剂注释，并将其作为12维特征向量在交互预测前与分子嵌入拼接。在配对水平和药物水平数据划分下进行实验，以量化对未见药物的泛化能力。结果表明，在配对水平划分条件下，知识图谱（KG）增强显著改善了DDI类型分类（F1宏平均：0.532对比基线0.241），而二元交互检测和药物水平泛化仍受信息天花板限制（AUC提升：0.224对比基线0.250）。对严格保留化合物的机制验证确认，增强优先改善CYP2C9介导的交互预测，概率从基线0.033-0.117提升至KG增强后的0.560-0.586。在Tox21基准上的单分子毒性预测扩展实验证实，该效果取决于药理基因组学注释覆盖度。这些发现为后续研究提出的多模态框架提供了动机。

英文摘要

Graph neural networks (GNNs) applied to drug-drug interaction (DDI) prediction rely exclusively on molecular structure encoded as SMILES-derived graphs. Prior work in this series demonstrated that model performance is bounded by the structural information content of training labels -- an Information Ceiling -- that architectural refinements alone cannot overcome. The present study investigates whether pharmacogenomic prior knowledge from the PharmGKB database partially closes this ceiling by providing metabolic pathway context that is independent of, and complementary to, molecular structure. Cytochrome P450 (CYP) enzyme substrate, inhibitor, and inducer annotations for four clinically relevant isoforms (CYP2D6, CYP3A4, CYP2C19, CYP2C9) are extracted and incorporated as a 12-dimensional feature vector concatenated to the molecular embedding prior to interaction prediction. Experiments are conducted under both pair-level and drug-level data splits to quantify generalization to unseen drugs. Results indicate that knowledge graph (KG) augmentation substantially improves DDI type classification under pair-level split conditions (F1-macro: 0.532 vs. 0.241 baseline), while binary interaction detection and drug-level generalization remain bounded by the Information Ceiling (AUC inflation: 0.224 vs. 0.250 baseline). Mechanistic validation on strictly held-out compounds confirms that augmentation preferentially improves CYP2C9-mediated interaction prediction, with probabilities increasing from 0.033-0.117 (baseline) to 0.560-0.586 (KG-augmented). An extension to single-molecule toxicity prediction on the Tox21 benchmark confirms that the effect is contingent on pharmacogenomic annotation coverage. These findings motivate the multimodal framework proposed for the subsequent study in this series.

URL PDF HTML ☆

赞 0 踩 0

2606.07700 2026-06-09 cs.LG cs.AI 交叉投稿

EssentialGIN: a new approach for gene essentiality prediction based on graph isomorphism neural networks

EssentialGIN：基于图同构神经网络的新基因必需性预测方法

Sahar Mansouri-Rad, Zahra Narimani, Parvin Razzaghi, Nazanin Hosseinkhan

发表机构 * Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS)（计算机科学与信息技术系，基础科学研究院（IASBS））； Endocrine Research Center, Institute of Endocrinology and Metabolism, Iran University of Medical Sciences（内分泌研究中心，内分泌学与代谢研究院，伊朗医学科学大学）

AI总结提出基于图同构神经网络（GIN）的EssentialGIN模型，整合PPI网络拓扑与基因表达、直系同源、亚细胞定位等多源生物数据，在人类等复杂生物中显著优于现有方法。

Comments 19 pages, 5 figures, 8 tables

详情

AI中文摘要

背景：必需基因（蛋白质）的预测是一个基本且具有挑战性的问题，同时在湿实验中进行非常昂贵且耗时。仅基于计算方法（引入湿实验候选）使用中心性度量预测必需基因并不准确，会导致大量假阳性；因此，最近的研究使用更复杂的模型（如深度学习）以及整合生物信息来识别必需基因。\n方法：在这项工作中，我们专注于图同构网络，将蛋白质作为PPI网络中的节点进行嵌入，以保留PPI网络的拓扑特征，并整合生物数据，如基因表达数据、基因直系同源信息和基因亚细胞定位信息，引入了一种用于预测必需基因的深度架构。本文修改了图同构网络架构以嵌入节点信息。\n结果：我们的实验证明，所提出的方法优于基于中心性的基线方法以及基于机器学习的方法，如Node2Vec、MLP和图注意力网络（GAT）。\n结论：在本文中，我们观察到使用整合生物数据（作为节点属性）并保留网络拓扑的图同构网络可以显著提高必需基因预测的准确性。在较简单的生物体（如大肠杆菌和黑腹果蝇）中，使用Node2Vec嵌入的多层感知机等方法也表现良好，但在人类中，所引入的架构显著优于深度学习和其他图神经网络解决方案。\n关键词：必需基因预测，图神经网络，图同构网络，PPI网络，节点嵌入

英文摘要

Background: Prediction of essential genes (proteins), is a basic and challenging problem but at the same time very costly and time-consuming in wet-lab experiments. Predicting essential genes, only based on computational methods (to introduce wet-lab candidates) using centrality measures are not accurate and result in large number of false positives; therefore, more complex models such as deep learning and also integration of biological information are used in recent research to identify essential genes. Methods: In this work we focus on graph isomorphism networks, in order to embed proteins as a node in PPI network to conserve topological features of PPI network, and also integrate biological data such as gene expression data, gene orthology information and gene subcellular localization information, and introduced a deep architecture for predicting essential genes. Graph isomorphism network architecture is modified in this work for embedding node information. Results: Our experiments proved that the proposed method outperforms baseline centrality-based methods and also machine learning based methods such as Node2Vec, MLP, and also graph attention networks (GAT). Conclusion: In this paper we observed that using graph isomorphism networks that integrate biological data (as node attributes) and preserve network topology can significantly improve the essential gene prediction accuracy. In simpler organisms such as E. coli and D. melanogaster, methods such as multi-layer perceptron using Node2Vec embedding also performs very good, but in H. sapiens the introduced architecture significantly outperforms deep learning and other graph neural network solutions. Keywords: Essential gene prediction, graph neural network, graph isomorphism network, PPI network, node embedding

URL PDF HTML ☆

赞 0 踩 0

2606.07702 2026-06-09 cs.LG cs.AI 交叉投稿

EvoCSFL: Surrogate-Assisted Evolutionary Client Selection for Efficient and Robust Federated Learning

EvoCSFL：基于代理辅助的进化客户端选择实现高效鲁棒联邦学习

Lin Qiang, Sun Xiaoyan, Hu Yao, Fang Wei

发表机构 * Jiangnan University（江南大学）； The Hong Kong Polytechnic University（香港理工大学）

AI总结针对联邦学习中客户端数据与系统异构性导致收敛慢、鲁棒性差的问题，提出代理辅助的进化客户端选择框架，将选择问题建模为组合优化，用代理模型加速进化搜索，实验表明收敛更快、能耗更低、鲁棒性更强。

详情

AI中文摘要

客户端数据和系统的异构性使得采用随机客户端选择的联邦学习难以获得令人满意的收敛速度和鲁棒性。为解决此问题，本文提出了一种基于代理辅助的客户端进化选择框架。在该框架中，首先使用一些典型的客户端选择策略生成候选集，并开发了一个集成模型性能、通信延迟和能量消耗的度量函数，将客户端选择问题表述为组合优化问题。随后，利用候选选择和度量构建代理模型，以高效逼近所选客户端子集的性能。采用进化算法搜索客户端选择的组合空间，并由代理模型引导以加速收敛。在MNIST、CIFAR10、CINIC10和TinyImageNet上的实验表明，与现有方法相比，所提算法实现了更快的收敛、更低的能量消耗和更好的鲁棒性。

英文摘要

The heterogeneity of client data and systems makes it difficult to achieve satisfactory convergence speed and robustness in federated learning with random client selection. To address this issue, this paper proposes a surrogate-assisted client evolutionary selection framework for federated learning. In this framework, some typical client selection strategies are first used to generate candidate sets, and a metric function that integrates model performance, communication latency, and energy consumption is developed to formulate the client selection problem as a combinatorial optimization one. Subsequently, a surrogate model is constructed using the candidate selections and metric to efficiently approximate the performance of selected client subsets. An evolutionary algorithm is employed to search the combinatorial space of client selections, guided by the surrogate model to accelerate convergence. Experiments on MNIST, CIFAR10, CINIC10, and TinyImageNet demonstrate that the proposed algorithm achieves faster convergence, lower energy consumption, and improved robustness compared to existing methods.

URL PDF HTML ☆

赞 0 踩 0

2606.07703 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

How Much Dense Attention is Necessary? Oracle-Guided Sparse Prefill for Full/GQA Layers in Hybrid Long-Context Models

需要多少密集注意力？面向混合长上下文模型中全/GQA层的Oracle引导稀疏预填充

Hongxing Wang, Harenome Razanajato, Zhen Zhang, Yujie Yuan, Hongsheng Liu

发表机构 * Technical Report, First Release（技术报告，首次发布）

AI总结研究在混合长上下文模型中，通过Oracle引导的稀疏预填充减少密集注意力计算，在保持任务性能的同时实现加速，并验证了可行性、索引器质量和运行时加速潜力。

Comments Technical report, first release, 26 pages, 2 figures, 11 tables

详情

AI中文摘要

长上下文预填充仍然昂贵，因为即使在包含局部、稀疏、线性或循环组件的混合模型中，全/GQA层仍然对整个历史序列进行评分。我们研究了在显式支持粒度和top-k预算下，需要多少密集注意力来保持任务级行为。我们为现有的GQA检查点引入了一种注意力质量top-k oracle：对于每个层和查询位置，它计算密集注意力，选择头平均的token支持，并仅在该支持上重新计算注意力。该oracle是一个诊断参考，而非可部署的加速器，并将稀疏预算可行性从索引器误差和运行时实现效果中分离出来。在Qwen家族的检索密集型评估中，每个查询的最长oracle行与密集注意力相差在1个点以内，而Qwen3.5-9B在4K到100K的RULER风格扫描中相差在0.48个点以内。在oracle的指导下，我们通过KL蒸馏从密集注意力质量分布中训练了一个头折叠的辅助索引器，同时保持骨干网络冻结。使用分别蒸馏的Qwen3.5-0.8B和Qwen3.5-9B索引器，报告的16K/32K验证宏观差距分别为+2.04和+1.13个点，这被视为质量保持而非改进；融合的选择块共享支持可能引入更大的实现差距。初步的单卡TTFT测量显示，与密集FlashAttention-2基线相比，蒸馏索引器的稀疏服务加速比在NPU上对Qwen3.5-0.8B为1.71倍，在GPU上对Qwen3.5-9B为1.93倍。额外的随机初始化压力行达到3.44倍，表明稀疏运行时存在提升空间，但输出质量未经验证。本次发布首次分离了oracle可行性、蒸馏索引器质量和运行时提升空间，将完全匹配的质量-延迟前沿留待未来工作。

英文摘要

Long-context prefill remains expensive because full/GQA layers still score the historical sequence, even in hybrid models with local, sparse, linear, or recurrent components. We study how much dense attention is needed to preserve task-level behavior under explicit support granularity and top-k budgets. We introduce an attention-mass top-k oracle for existing GQA checkpoints: for each layer and query position, it computes dense attention, selects head-averaged token support, and recomputes attention only on that support. The oracle is a diagnostic reference, not a deployable accelerator, and separates sparse-budget feasibility from indexer error and runtime realization effects. On Qwen-family retrieval-heavy evaluations, the longest per-query oracle rows stay within 1 point of dense, and a Qwen3.5-9B RULER-style sweep from 4K to 100K stays within 0.48 points. Guided by the oracle, we derive a head-collapsed auxiliary indexer trained by KL distillation from dense attention-mass distributions while keeping the backbone frozen. With separately distilled Qwen3.5-0.8B and Qwen3.5-9B indexers, the reported 16K/32K validation macro gaps are +2.04 and +1.13 points, treated as quality preservation rather than improvement; fused selection-block-shared support can introduce a larger realization gap. Preliminary single-card TTFT measurements show distilled-indexer sparse serving speedups of 1.71x for Qwen3.5-0.8B on NPU and 1.93x for Qwen3.5-9B on GPU against its dense FlashAttention-2 baseline. Additional random-init stress rows reach 3.44x, indicating sparse-runtime headroom but not validated output quality. This first release separates oracle feasibility, distilled-indexer quality, and runtime headroom, leaving a fully matched quality-latency frontier to future work.

URL PDF HTML ☆

赞 0 踩 0

2606.07704 2026-06-09 cs.LG cs.AI 交叉投稿

理论最小化的注意力机制：面向内存最优Transformer内核的数组数学框架

Lenore Mullin, Gaetan Hains

发表机构 * University at Albany（奥尔巴尼大学）； Université Paris-Est Créteil（巴黎东大学克雷泰伊分校）

AI总结提出基于数组数学（MoA）的缩放点积注意力重表述，通过代数构造消除所有中间数组，实现O(n dk + n dv)数据移动，相比标准实现O(n^2 + n dk + n dv)显著降低内存流量，并验证了数值精度。

详情

AI中文摘要

注意力机制是现代基于Transformer的AI中的主要计算瓶颈。其标准实现在序列长度~$n$上产生二次内存流量，而DRAM访问在当代硬件上比算术操作消耗100--1000$\times$更多的能量，因此任何仅关注FLOP计数的分析从根本上误解了瓶颈。我们提出了缩放点积注意力及其数值稳定softmax的数组数学（MoA）重表述，推导出指称范式（DNF），通过代数构造而非经验调优消除了所有中间数组——包括隐式转置键缓冲区和每个softmax临时变量。DNF实现了$O(n dk + n dv)$的数据移动，而标准实现为$O(n^2 + n dk + n dv)$，其中$n$是序列长度，$dk$是键维度，$dv$是值维度，并在具体输入上针对PyTorch全双精度浮点进行了数值验证。与硬件特定的加速器或经验性分块方案（如FlashAttention）不同，MoA从单一代数框架同时提供了数组融合、形状变换正确性和预测性成本模型。内存最小性是在编写任何代码之前就确立的定理。预测性性能模型预计加速2--100$\times$，能耗降低2--50$\times$，优势在超大规模下进一步扩大。该推导建立了一个从Python规范经过操作范式（ONF）和维度提升硬件映射的形式化验证流水线，提供了与DARPA边缘部署和DOE超大规模优先事项直接相关的性能可移植AI内核。

英文摘要

The attention mechanism is the dominant computational bottleneck in modern transformer-based AI. Its standard implementation incurs quadratic memory traffic in the sequence length~$n$, and DRAM accesses cost 100--1000$\times$ more energy than arithmetic operations on contemporary hardware, so any analysis focused solely on FLOP counts fundamentally mischaracterises the bottleneck. We present a Mathematics of Arrays (MoA) reformulation of scaled dot-product attention and its numerically stable softmax, deriving a Denotational Normal Form (DNF) that eliminates all intermediate arrays -- including the implicit transposed-key buffer and every softmax temporary -- by algebraic construction rather than empirical tuning. The DNF achieves $O(n_{dk} + n{_{dv}})$ data movement versus $O(n^2 + n_{dk} + n_{dv})$ for the standard implementation, where $n$ is the sequence length, $dk$ is the key dimensionality and $dv$ the value dimensionality, and is verified numerically against PyTorch at full double-precision floating-point on concrete inputs. Unlike hardware-specific accelerators or empirical tiling schemes such as FlashAttention, MoA simultaneously provides array fusion, shape-transformation correctness, and predictive cost models from a single algebraic framework. Memory minimality is a theorem established before any code is written. A predictive performance model projects $2$--$100\times$ speedup and $2$--$50\times$ energy reduction, with the advantage widening at exascale. The derivation establishes a formally verified pipeline from Python specification through (ONF) Operational Normal Form, and dimension-lifted hardware mapping, providing performance-portable AI kernels of direct relevance to DARPA edge-deployment and DOE exascale priorities.

URL PDF HTML ☆

赞 0 踩 0

2606.07766 2026-06-09 cs.CV cs.AI 交叉投稿

基于划分拟阵约束梯度匹配的小批量选择

Prayas Agrawal, Prateek Chanda, Ishita Khatri, Ganesh Ramakrishnan, Bamdev Mishra, Pratik Jawanpuria

发表机构 * Indian Institute of Technology Bombay（印度理工学院班加罗尔）； Department of Computer Science and Engineering（计算机科学与工程系）； Centre for Machine Intelligence and Data Science（机器智能与数据科学中心）； Microsoft Research India（微软印度研究院）； Microsoft India（微软印度）

AI总结提出PartitionSel方法，通过划分拟阵约束下的梯度匹配效用最大化，实现跨域小批量选择，减少冗余并提升训练兼容性，在LLM微调中取得鲁棒性提升。

Comments 28 pages, 12 figures, ICML 2026

详情

Journal ref: Proceedings of the 43rd International Conference on Machine Learning (ICML 2026), Seoul, South Korea, PMLR 306, 2026

AI中文摘要

在异构数据上训练大型语言模型（LLMs）需要选择能够平衡收敛速度与跨领域覆盖的小批量。现有方法要么在每个领域内独立选择样本，要么依赖计算昂贵的代理模型来学习连续的领域权重。我们提出PartitionSel，一种跨领域小批量选择方法，它在每个领域的预算（编码为划分拟阵约束）下最大化验证引导的梯度匹配效用。通过单一效用耦合每个领域的预算，PartitionSel旨在减少跨领域选择中的冗余。所提出的目标是弱子模的，并允许使用正交匹配追踪算法，具有可证明的近似保证。在实验中，我们在MetaMathQA和Mol-Instructions上对Qwen2.5和Llama-3进行微调时，评估了PartitionSel的小批量选择。PartitionSel在两个基准测试中均比每个领域和领域无关的基线获得了鲁棒的提升。它还减少了每个批次内冲突梯度对的数量，表明跨领域耦合转化为更兼容的训练更新。

英文摘要

Training large language models (LLMs) on heterogeneous data requires selecting minibatches that balance convergence speed with coverage across domains. Existing methods either select samples independently within each domain or rely on computationally expensive proxy models to learn continuous domain weights. We propose PartitionSel, a cross-domain minibatch selection approach that maximizes a validation-guided gradient-matching utility under per-domain budgets encoded as a partition-matroid constraint. By coupling the per-domain budgets through a single utility, PartitionSel is designed to reduce redundancy in selections across domains. The proposed objective is weakly submodular and admits an orthogonal matching pursuit algorithm with provable approximation guarantees. Empirically, we evaluate PartitionSel for minibatch selection during the fine-tuning of Qwen2.5 and Llama-3 on MetaMathQA and Mol-Instructions. PartitionSel achieves robust gains over per-domain and domain-agnostic baselines on both benchmarks. It also reduces the number of conflicting gradient pairs within each batch, indicating that the cross-domain coupling translates into more compatible training updates.

URL PDF HTML ☆

赞 0 踩 0

2606.08156 2026-06-09 cs.CV cs.AI 交叉投稿

RAPID: Layer-Wise Redundancy-Aware Pruning and Importance-Driven Token Merging for Efficient ViT

RAPID: 逐层冗余感知剪枝与重要性驱动的令牌合并以实现高效ViT

Kyumin Choi, Ikbeom Jang

发表机构 * Hankuk University of Foreign Studies（韩国外国语大学）

AI总结提出RAPID框架，根据ViT网络深度自适应调整令牌缩减策略：浅中层用冗余相似度感知剪枝，深层用重要性相似度感知合并，在ImageNet-1K上实现更优的精度-压缩帕累托前沿。

Comments 7 pages, 2 figures

详情

AI中文摘要

视觉Transformer（ViT）取得了强大性能，但由于二次自注意力复杂度而遭受高计算成本。尽管令牌缩减技术（如剪枝和合并）缓解了这一问题，但它们通常忽略了表示在网络深度上的演化。我们提出RAPID，一种深度感知的令牌缩减框架，可根据令牌表示的逐层特征自适应调整缩减策略。主要方法贡献是一种分叉策略：在浅层到中层，RAPID采用冗余相似度感知剪枝度量来消除过度表示的局部模式。当特征在更深层过渡到全局语义概念时，框架转向重要性相似度感知合并机制。该阶段利用分类（CLS）令牌注意力权重来保护语义关键令牌，同时融合不太重要但相似的邻居。在ImageNet-1K上使用ViT和DeiT架构的实验验证表明，与ToMe和ToFu等即插即用基线相比，RAPID建立了更优的精度-压缩帕累托前沿。RAPID在激进压缩场景下尤其鲁棒，在极端缩减率下比ToMe准确率高出4.29%。我们的框架提供了一种免训练模板，通过将缩减策略与层次化特征演化对齐来优化视觉模型。

英文摘要

Vision Transformers (ViTs) achieve strong performance but suffer from high computational costs due to quadratic self-attention complexity. Although token reduction techniques such as pruning and merging mitigate this, they typically overlook how representations evolve across network depth. We propose RAPID, a depth-aware token reduction framework that adapts reduction strategies to the layer-wise characteristics of token representations. The primary methodological contribution is a bifurcated strategy: in shallow-to-middle layers, RAPID employs a redundancy-similarity aware pruning metric to eliminate over-represented local patterns. As features transition to global semantic concepts in deeper layers, the framework shifts to an importance-similarity aware merging mechanism. This stage leverages classification (CLS) token attention weights to protect semantically critical tokens while fusing less important but similar neighbors. Empirical validation on ImageNet-1K using ViT and DeiT architectures demonstrates that RAPID establishes a superior accuracy-compression Pareto frontier compared to plug-and-play baselines such as ToMe and ToFu. RAPID is particularly robust in aggressive compression regimes, achieving up to 4.29% higher accuracy than ToMe at extreme reduction rates. Our framework provides a training-free template for optimizing vision models by aligning reduction strategies with hierarchical feature evolution.

URL PDF HTML ☆

赞 0 踩 0

2606.08167 2026-06-09 cs.LG cs.AI 交叉投稿

Explaining Data Mixing Scaling Laws

解释数据混合缩放定律

Rui Dai, Shuran Zheng

发表机构 * Beijing Institute of Technology（北京理工大学）； IIIS, Tsinghua University（清华大学智能产业研究院）

AI总结提出统一框架解释多领域数据混合中模型损失行为，基于能力竞争和噪声减少两个关键因素，在多个尺度上有效预测高性能混合。

Comments Published to ICML 2026

详情

AI中文摘要

最近的研究建立了经验缩放定律来预测多领域数据混合上的模型性能。然而，对这些模型损失行为的理论理解仍然缺失。在这项工作中，我们提出了一个统一框架来解释数据混合的底层机制。我们的方法将最初为标准神经缩放定律（如Kaplan和Chinchilla）开发的理论视角扩展到多领域设置。基于领域在基本技能上重叠而在专门技能上分化的分布假设，我们确定了控制不同数据混合训练模型领域损失的两个关键因素：\textit{能力竞争}，其中有限模型能力的分配全局耦合了领域损失；以及\textit{噪声减少}，其中最优权重向更难学习的领域转移以最小化整体噪声。实证评估表明，我们的框架通过以更低的平均相对误差拟合损失景观并识别出更高性能的训练混合，优于现有基线。最重要的是，我们的模型成功跨尺度外推，使用较小尺度上拟合的参数预测大型未见尺度的高效混合。此外，与之前的经验定律相比，我们的模型使用显著更少的参数实现了这些结果。我们的代码可在 https://github.com/meiqwq/Explaining-Data-Mixing-Scaling-Laws 获取。

英文摘要

Recent research has established empirical scaling laws to predict model performance on multi-domain data mixtures. However, a theoretical understanding of these model loss behaviors remains absent. In this work, we propose a unified framework to explain the underlying mechanics of data mixing. Our approach extends theoretical perspectives originally developed for standard neural scaling laws (e.g., Kaplan and Chinchilla) to the multi-domain setting. Based on the distributional assumption that domains overlap on fundamental skills while diverging on specialized skills, we identify two key factors that govern the domain losses of models trained on different data mixtures: \textit{Capacity Competition}, where the allocation of finite model capacity couples domain losses globally, and \textit{Noise Reduction}, where optimal weights shift toward harder-to-learn domains to minimize overall noise. Empirical evaluations show that our framework outperforms existing baselines by fitting the loss landscape with a lower Mean Relative Error and identifying higher-performing training mixtures. Most importantly, our model successfully extrapolates across scales, predicting highly effective mixtures for large, unseen scales using parameters fitted on smaller ones. In addition, our model achieves these results using significantly fewer parameters compared to previous empirical laws. Our code is available at https://github.com/meiqwq/Explaining-Data-Mixing-Scaling-Laws.

URL PDF HTML ☆

赞 0 踩 0

2606.08191 2026-06-09 cs.LG cs.AI q-bio.QM 交叉投稿

Frequency-Domain Latent Attention Gating for Cross-Domain Token Aggregation

频域潜在注意力门控用于跨域令牌聚合

Kewei Li, Rongying Zhang, Xueli Wang, Xiwen Gong, Zhongjian Wang, Lan Huang, Ruochi Zhang, Fengfeng Zhou

发表机构 * College of Computer Science and Technology, Jilin University（吉林大学计算机科学与技术学院）； Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University（教育部符号计算与知识工程重点实验室）； Institute for Quantitative and Computational Biology, University of California（加州大学定量与计算生物学研究所）； Greenwich High School（格林威治高中）； BCPM Data Limited（BCPM数据有限公司）

AI总结提出FLaG模块，通过实FFT变换、可学习潜在查询的频谱分量汇总、通道门控和时域重建，实现跨域令牌聚合，在AMP预测、图像分类和文本分类任务上取得提升。

详情

AI中文摘要

令牌聚合是将令牌表示映射到样本级预测的模型中的常见瓶颈，然而大多数池化方法仅在原始令牌域中操作。我们提出FLaG，一个即插即用的聚合模块，它使用实FFT变换令牌表示，用可学习的潜在查询汇总频谱分量，应用通道门控，并重建增强的时域令牌以进行最终池化。我们在使用ESM2的抗菌肽（AMP）活性预测、使用ResNet18在CIFAR-10和CIFAR-100上的图像分类，以及使用RoBERTa在IMDB和GLUE上的文本分类中评估FLaG。FLaG在ESM2-8M抗菌肽任务和CIFAR-100上取得了最明显的提升，同时在IMDB和GLUE上与强文本基线保持竞争力。然后，我们通过频带消融、门控汇总、残基扰动、潜在查询读出和结构代理分层来探究其在AMP设置中的行为。我们发现低频带贡献最大，其余高频带模式更具样本特异性。门控充当广泛共享的频谱重加权阶段，交叉注意力模式是样本特异性的，具有轻微的查询差异，并且高螺旋肽在两种细菌中表现出更强的平均频谱敏感性。补充材料、源代码和数据发布在https://www.healthinformaticslab.org/supp/ 和 https://github.com/Kewei2023/AMPCliff/tree/FLaG。

英文摘要

Token aggregation is a common bottleneck in models that map token representations to sample-level predictions, yet most pooling methods operate only in the original token domain. We propose FLaG, a plug-in aggregation module that transforms token representations with the real FFT, summarizes spectral components with learnable latent queries, applies a channel-wise gate, and reconstructs enhanced time-domain tokens for final pooling. We evaluate FLaG on antimicrobial peptide (AMP) activity prediction with ESM2, image classification with ResNet18 on CIFAR-10 and CIFAR-100, and text classification with RoBERTa on IMDB and GLUE. FLaG achieves its clearest gains on the ESM2-8M antimicrobial peptide tasks and on CIFAR-100, while remaining competitive with strong text baselines on IMDB and GLUE. Then we probe its behavior on the AMP setting with band knockouts, gate summaries, residue perturbations, latent-query readouts, and structure-proxy stratification. We find that low-frequency bands contribute the most overall, and the remaining higher-band pattern is more sample-specific. The gate acts as a broadly shared spectral reweighting stage and the cross-attention patterns are sample-specific with mild query-wise differentiation, and higher-helix peptides exhibit stronger average spectral sensitivity in both bacteria. The supplementary materials, source code and data are released at https://www.healthinformaticslab.org/supp/ and https://github.com/Kewei2023/AMPCliff/tree/FLaG.

URL PDF HTML ☆

赞 0 踩 0

2606.08196 2026-06-09 stat.ML cs.AI cs.LG stat.ME 交叉投稿

Beyond Additivity: Causal Discovery in Location-Scale Noise Models with Hidden Variables

超越可加性：含隐变量的位置-尺度噪声模型中的因果发现

Mariyam Khan, Shohei Shimizu, Thong Pham

发表机构 * RIKEN AIP（理化学研究所Advanced Institute for Science Technology）； University of Bergen（卑尔根大学）； The University of Osaka（大阪大学）； Shiga University（滋贺大学）

AI总结针对含隐变量且数据生成过程遵循位置-尺度噪声模型（LSNM）的因果发现，证明满足无弓条件的非循环有向混合图（ADMG）可识别，并提出两阶段算法LSNM-UV，在异方差数据上优于可加性基线。

Comments 33 pages, 4 figures

2606.08218 2026-06-09 cs.LG cs.AI math.ST stat.ML stat.TH 交叉投稿

How Deep Are Deep GPs, Really? A Sharp Threshold and a Non-Gaussian Limit for Compositional GPs

深度高斯过程到底有多深？组合高斯过程的尖锐阈值与非高斯极限

Mark Kozdoba, Shie Mannor

发表机构 * Technion, IIT（以色列理工学院）； NVIDIA（英伟达）

AI总结本文研究了深度高斯过程先验在深度增长时的极限行为，识别出RBF核带宽的尖锐阈值，低于该阈值时先验收敛到非退化非高斯分布，具有非零坐标依赖。

详情

AI中文摘要

组合先验描述了深度贝叶斯模型中分层函数的通用属性，其中随机权重的深度神经网络是一个典型例子。在宽网络极限下，先验是一个具有深度相关核的高斯过程，其随深度增长的行为已通过该核得到广泛研究。这里，我们研究另一种情况，其中每一层本身是一个向量值高斯过程，我们的目标类似地理解先验随深度增长的极限行为。先前的高斯过程工作已确定，对于RBF核和一定范围的带宽$r$，先验在极限下退化，收敛到常数函数集——这作为概率模型是无用的。在本文中，我们建立了几个新结果。首先，我们识别出一个尖锐的带宽阈值$r_c(d) = Θ(\sqrt{d})$，高于该阈值极限是退化的，加强了先前的界限。其次，更重要的是，我们证明对于低于阈值$r_c(d)$的$r$，先验收敛到极限分布$π_{\bar{Z}}$。我们还证明这些分布是非退化且非高斯的，坐标之间具有非消失的依赖性。与先前已知的退化机制相反，深度高斯过程先验因此可以允许非平凡极限。实验上，我们在维度$d$的范围内验证了该阈值，并展示了极限分布$π_{\bar{Z}}$的复杂多模态行为——该机制随$d$增长而变得狭窄，且在不了解阈值的情况下难以识别。

英文摘要

Compositional priors describe the generic properties of layered functions in deep Bayesian models, where deep neural networks with random weights are a canonical example.In the wide-network limit, the prior is a Gaussian process with a depth-dependent kernel, and its behaviour as depth grows has been extensively studied through this kernel. Here, we study another case, where each layer itself is a vector valued Gaussian process, and our aim is similarly to understand the limiting behaviour of the prior as depth grows. Previous GP work has established that for the RBF kernel and a certain range of bandwidths $r$, the prior degenerates in the limit, converging to the set of constant functions -- which is not useful as a probabilistic model. In this paper we establish several new results. First, we identify a sharp bandwidth threshold $r_c(d) = Θ(\sqrt{d})$ above which the limit is degenerate, strengthening the earlier bounds. Second, and more importantly, we show that for $r$ below the threshold $r_c(d)$ the prior converges to a limit distribution $π_{\bar{Z}}$. We also prove that these distributions are non-degenerate and non-Gaussian, with non-vanishing dependence between coordinates. In contrast to the previously known degenerate regime, deep Gaussian process priors can therefore admit non-trivial limits. Empirically, we verify the threshold across a range of dimensions $d$, and demonstrate a complex multimodal behaviour of the limit distributions $π_{\bar{Z}}$ -- a regime that becomes increasingly narrow with $d$ and would be hard to identify without knowing the threshold.

URL PDF HTML ☆

赞 0 踩 0

2606.08327 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

Chiaroscuro Attention: Spending Compute in the Dark

明暗对比注意力：在黑暗中投入计算

Prateek Kumar Sikdar

发表机构 * Accenture（埃森哲）

AI总结提出CHIAR-Former，一种基于谱熵路由的混合Transformer，通过DCT谱混合与全注意力互补，在WikiText-103上以62.5%更少注意力FLOPs实现PPL 36.54，较全注意力基线提升45%。

Comments 8 pages, 6 figures, 3 tables

详情

AI中文摘要

STELLAR: 面向长尾物种分布建模的时空环境学习与潜在对齐精炼

Shufeng Kong, Tao Yu, Yuanyuan Wei, Caihua Liu, Junwen Bai, Yingheng Wang, Marc Grimson, Daniel Fink, Carla P. Gomes

发表机构 * Sun Yat-sen University（中山大学）； Cornell University（康奈尔大学）； Foshan University（佛山大学）； Cornell Lab of Ornithology（康奈尔鸟类学实验室）

AI总结提出STELLAR框架，通过图-时间编码器、上下文锚定潜在对齐和不平衡感知解码模块，联合优化动态栖息地上下文和群落结构，有效解决物种分布建模中的时空耦合与长尾不平衡问题。

Comments Accept by IJCAI 2026

详情

AI中文摘要

联合物种分布建模（JSDM）是生物多样性监测和保护规划的关键工具。然而，准确的JSDM面临两个耦合挑战：环境驱动因素和物种分布本质上是时空的，而物种共现模式表现出复杂的非线性群落结构以及由稀有物种导致的严重长尾不平衡。现有方法通常孤立地处理这些因素，从静态协变量中学习或忽略动态群落结构的历史轨迹。为克服这些限制，我们提出STELLAR（时空环境学习与潜在对齐精炼），一种新颖的框架，学习一个共享潜在空间，其中动态栖息地上下文和群落结构被联合优化。我们的方法整合了三个互补组件：（1）图-时间编码器，采用图注意力和循环单元来聚合空间邻域效应并捕捉环境上下文和群落结构的共同演化历史动态；（2）上下文锚定潜在对齐机制，利用标签激活的混合先验和监督对比学习结构化潜在空间，基于共享环境偏好主动聚类物种；（3）不平衡感知解耦解码模块，利用非对称损失聚焦于困难稀有物种样本的学习，防止长尾中的模式崩溃。在领域专家精心整理的大规模eBird数据集上的实验表明，我们的框架显著优于最先进的基线，特别是在预测稀有物种和揭示可解释的物种相互作用方面。

Stage-1 控制熵状态，而非最终结果

Jianxiong Shen

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文通过小数据实验研究两阶段后训练中Stage-1（SFT或OPD）的作用，发现其主要影响策略熵状态，但对最终性能影响有限。

详情

AI中文摘要

两阶段后训练——Stage-1 热启动（监督微调 SFT 或在线策略蒸馏 OPD）后接 Stage-2 强化学习（RL）——越来越多地用于视觉语言模型（VLM）。我们使用 Qwen2.5-VL-7B 和同模态 72B VLM 教师进行 OPD，在小数据研究中探究 Stage-1 实际控制什么。首先，三种热启动在 Geometry3K 内部验证集上达到狭窄的 53%–54% 区间，与近期专门方法报告的窄范围一致；该设置几乎没有证据表明 Stage-1 改变了域内终点。其次，匹配配方、早停的 SFT 在域外 MathVista 上提升了 +2.1 点，逆转了过训练变体的 -9.5 点下降。最明显的区别是熵状态：OPD 进入 RL 时的策略熵显著高于任一 SFT 初始化，且这种分离在可用轨迹中持续可见。在域内初始化时，OPD 还具有更高的答案多样性和 pass@16（比 SFT 高 +2.0 到 +5.2 点），尽管问题级自举区间显示较小的对比具有不确定性。RL 后优势消失（终点 pass@16 值在 1.1 点以内），在 MathVista 上也是如此（六个模型在 1.2 点以内）。因此，我们的贡献是一个有界的实证刻画：在此设置中，Stage-1 与熵状态强相关，但下游收益小、局部化，且不能证明 OPD 是更好的 RL 热启动。

英文摘要

Two-stage post-training -- a Stage-1 warm-start (supervised fine-tuning, SFT, or on-policy distillation, OPD) followed by Stage-2 reinforcement learning (RL) -- is increasingly used for vision-language models (VLMs). We ask what Stage-1 actually controls in a small-data study using Qwen2.5-VL-7B with a same-modality 72B VLM teacher for OPD. First, the three warm-starts reach a narrow $53$--$54\%$ band on Geometry3K internal validation, consistent with the narrow range reported by recent specialized methods; this setup provides little evidence that Stage-1 changes the in-domain endpoint. Second, a matched-recipe, early-stopped SFT improves out-of-domain MathVista by $+2.1$ points, reversing the $-9.5$-point drop of an over-trained variant. The clearest difference is the \emph{entropy regime}: OPD enters RL with substantially higher policy entropy than either SFT initialization, and the separation remains visible through the available trajectories. At the in-domain initialization, OPD also has higher answer diversity and pass@16 ($+2.0$ to $+5.2$ points over SFT), although problem-level bootstrap intervals show that the smaller contrast is uncertain. The advantage is absent after RL (endpoint pass@16 values within $1.1$ points) and on MathVista (six models within $1.2$ points). Our contribution is therefore a bounded empirical characterization: Stage-1 is strongly associated with the entropy regime in this setup, but the downstream payoff is small, localized, and not evidence that OPD is a better RL warm-start.

URL PDF HTML ☆

赞 0 踩 0

2606.09065 2026-06-09 cs.LG cs.AI 交叉投稿

OnlyDense: Reduced-Order Modeling for Lagrangian simulation

OnlyDense: 拉格朗日模拟的降阶建模

Tu Do, Shannon Ryan, Santu Rana

发表机构 * Deakin University（德克萨斯大学）

AI总结提出一种将粒子系统状态视为希尔伯特空间中的函数、用学习到的神经基函数线性子空间近似状态空间的降阶建模框架，实现大规模拉格朗日模拟的高效表示与预测，在百万粒子SPH模拟中R²>0.99。

详情

AI中文摘要

在科学和工程中，拉格朗日模拟方法如光滑粒子流体动力学（SPH）或物质点法（MPM）常被用于研究动态系统的行为。然而，这些方法的计算成本可能高得令人望而却步，特别是在模拟多尺度空间或时间现象时，例如宏观几何中的空洞生长和合并、空间碎片颗粒超高速撞击导致的航天器部件结构失效等。与将系统状态理解为离散粒子集合的基于图的方法不同，我们提出了一种学习框架，通过将系统状态视为函数、将其演化视为希尔伯特空间中的轨迹，实现对大规模粒子系统的可扩展表示和动力学建模。我们不将状态表示为离散粒子集或嵌入非线性潜在流形，而是用学习到的神经基函数张成的线性子空间近似状态空间。这种参数化使得可以直接投影获得潜在系数，并显式访问基函数，避免了在非线性潜在空间上的优化。由此得到的表示具有自然的解释：潜在变量对应于希尔伯特空间中的系数，基函数对应于空间模态，类似于本征正交分解。因此，该框架将经典的基于投影的降阶建模与现代深度学习统一起来，同时保持对离散化点数量的不变性。在超过一百万个粒子的大规模SPH模拟（包括具有极端变形和破碎的动态事件）上的实验表明，所提出的方法能够准确重建和预测动力学，仅用32个基函数即可达到超过0.99的R²分数。

英文摘要

In science and engineering, Lagrangian simulation methods such as Smooth Particle Hydrodynamics (SPH) or Material Point Method (MPM) are often employed to study the behavior of dynamic systems. However, these methods can be prohibitively computationally expensive, particularly when simulating multi-scale spatial or temporal phenomena, e.g., void growth and coalescence within macro-scale geometries, structural failure of spacecraft components resulting from hypervelocity impact of space debris particles, etc. In contrast to graph-based methods, where the state of the system is understood as a discrete set of particles, we propose a learning framework for scalable representation and dynamics modeling of massive particle systems by treating the system state as a function and its evolution as a trajectory in Hilbert space. Rather than representing the state as a discrete set of particles or embedding it in a nonlinear latent manifold, we approximate the state space with a linear subspace spanned by learned neural basis functions. This parameterization enables direct projection to obtain latent coefficients and explicit access to the basis functions, avoiding optimization over a nonlinear latent space. The resulting representation admits a natural interpretation: latent variables correspond to coefficients in Hilbert space, and basis functions correspond to spatial modes, analogous to Proper Orthogonal Decomposition. The framework thus unifies classical projection-based reduced-order modeling with modern deep learning, while remaining invariant to the number of discretization points. Experiments on large-scale SPH simulations with over one million particles, including dynamic events with extreme deformation and fragmentation, demonstrate that the proposed method accurately reconstructs and predicts dynamics, achieving an R$^2$ score above $0.99$ with as few as $32$ basis functions.

URL PDF HTML ☆

赞 0 踩 0

2606.09112 2026-06-09 cs.LG cs.AI 交叉投稿

Hybridizing Equilibrium Propagation with Ising Machines for Efficient Energy-Based Learning

将平衡传播与伊辛机混合以实现高效的基于能量的学习

Chen-Rui Fan, Bo Lu, Xing-Yu Wu, Tie-Jun Wang, Chuan Wang

发表机构 * School of Artificial Intelligence, Beijing Normal University（北京师范大学人工智能学院）； Laboratory for Advanced Computing and Intelligence Engineering, Information Engineering University（信息工程大学先进计算与智能工程实验室）； School of Physical Science and Technology, Beijing University of Posts and Telecommunications（北京邮电大学物理科学与技术学院）

AI总结提出一种受伊辛动力学启发的平衡传播框架，通过扩展相空间动力学替代耗散Hopfield松弛，加速收敛、提高噪声鲁棒性，并在MNIST等数据集上实现与反向传播相当的性能。

详情

AI中文摘要

人工智能的快速发展推动了深度神经网络的重大进步。然而，传统的基于GPU的训练仍然高度耗能，这促使人们探索物理动力学和兼容的基于能量的学习方案，例如平衡传播（EP）。然而，基于EP的训练常常由于相空间收缩而陷入局部最小值。本文介绍了一种受伊辛动力学启发的平衡传播框架，其中耗散的Hopfield松弛被具有共轭变量的扩展相空间动力学所取代。由此产生的训练范式保留了EP的局部两阶段学习规则，同时改变了神经状态达到平衡的物理路径。我们表明，这种动力学降低了有效能量壁垒，加速了收敛，提高了噪声鲁棒性，并在MNIST、FashionMNIST和CIFAR-10上训练了深度卷积Hopfield网络，性能与反向传播相当。

英文摘要

The rapid evolution of artificial intelligence has led to substantial advances in deep neural networks. Nonetheless, conventional GPU-based training remains highly energy-demanding, motivating the exploration of physical dynamics and compatible energy-based learning schemes, such as equilibrium propagation (EP). EP-based training, however, frequently suffers from convergence to local minima due to phase-space contraction. Here we introduce an Ising-dynamics-inspired equilibrium-propagation framework in which dissipative Hopfield relaxation is replaced by an extended phase-space dynamics with conjugate variables. The resulting training paradigm keeps the local two-phase learning rule of EP while changing the physical route by which neural states reach equilibrium. We show that this dynamics lowers effective energy barriers, accelerates convergence, improves noise robustness, and trains deep convolutional Hopfield networks on MNIST, FashionMNIST, and CIFAR-10 with performance comparable to backpropagation.

URL PDF HTML ☆

赞 0 踩 0

2606.09117 2026-06-09 cs.LG cs.AI 交叉投稿

Optimizing Energy-based Neural Network Training with Coherent Ising Machine

利用相干伊辛机优化基于能量的神经网络训练

Chen-Rui Fan, Bo Lu, Zhi-Hong Zhang, Run-Qing Zhang, Jing-Wei Wen, Chuan Wang

发表机构 * School of Artificial Intelligence, Beijing Normal University（北京师范大学人工智能学院）； Laboratory for Advanced Computing and Intelligence Engineering, Information Engineering University（信息工程大学先进计算与智能工程实验室）； China Mobile (Suzhou) Software Technology Company Limited（中移（苏州）软件技术有限公司）； School of Science, Beijing University of Posts and Telecommunications（北京邮电大学理学院）

AI总结本文利用相干伊辛机结合平衡传播训练基于能量的神经网络，并通过Adam优化器加速收敛，展示了在深层架构和卷积操作上的可扩展性，为下一代AI硬件提供了物理框架。

详情

AI中文摘要

尽管伊辛机作为伊辛模型的高级物理求解器，在组合优化和神经网络训练中具有应用潜力，但其在大规模神经网络中的可扩展性仍受限于硬件连接限制和次优的训练方法。在这项工作中，我们利用相干伊辛机（CIM）通过平衡传播训练基于能量的神经网络，实现了与现有软件实现相当的性能。我们进一步通过集成Adam优化器来求解Hopfield能量网络的基态，从而显著提高了收敛速度和求解精度。此外，我们展示了该方法在更深层网络架构和卷积操作上的可扩展性。我们的结果突显了CIM动力学作为训练复杂神经网络的可扩展平台的潜力，为通过模拟电路、光电子或集成光子学实现节能实现提供了途径。这项工作为下一代AI硬件开发建立了一个新颖的物理框架。

英文摘要

While Ising machines serve as advanced physical solvers for the Ising model,enabling applications in combinatorial optimization and neural network training,their scalability for large-scale neural networks remains constrained by hardware connectivity limitations and suboptimal training methodologies. In this work,we leverage a Coherent Ising Machine (CIM) to train an energy-based neural network using Equilibrium Propagation, achieving performance comparable to existing software-based implementations. We further enhance the algorithm by integrating the Adam optimizer to solve for the ground state of a Hopfield energy network, significantly improving convergence speed and solution accuracy. Additionally, we demonstrate the scalability of our approach across deeper network architectures and convolutional operations. Our results highlight the potential of CIM dynamics as a scalable platform for training complex neural networks, offering a pathway toward energy-efficient implementations via analog circuits, optoelectronics, or integrated photonics. This work establishes a novel physical framework for next-generation AI hardware development.

URL PDF HTML ☆

赞 0 踩 0

2606.09245 2026-06-09 cs.CV cs.AI 交叉投稿

Proposal Refinement for Few-Shot Object Detection

用于少样本目标检测的提议细化

Yuan Zeng, Bin Song, Jie Guo, Yuwen Chen

发表机构 * State Key Laboratory of Integrated Services Networks, Xidian University（西安电子科技大学综合业务网理论及关键技术国家重点实验室）

AI总结针对少样本检测中区域提议在基类和新类间分布不均的问题，提出分阶段提议细化方法，通过基类训练阶段的细化损失和微调阶段的细化分支重新平衡提议分布，在基准上提升1%~6%且不增加推理时间。

详情

AI中文摘要

近年来，少样本目标检测引起了广泛关注。一些优秀的算法已被提出以处理这一任务。然而，这些算法大多依赖于少样本分类的性能。与以往尝试不同，我们的工作聚焦于新类和基类之间区域提议分布不均的问题。为了缓解这种不平衡分布，我们针对不同训练阶段提出了提议细化方法。具体而言，在基类训练阶段设计了细化损失以增强模型对新类的敏感性，在微调阶段引入了细化分支作为RPN（区域提议网络）的辅助分支以生成更多新类提议。通过重新平衡提议分布，所提方法在现有基准上比基线方法提高了约1%~6%，且不增加任何推理时间。通过大量实验，我们证明了为少样本目标检测任务建立了一种新的最先进方法。

英文摘要

Few-shot object detection has gained widely attention in recent years. Some excellent algorithms have been proposed to handle this task. However, most of these algorithms rely on the performance of few-shot classification. Unlike previous attempts, our work focuses on the problem of unbalanced distribution of region proposals between the novel classes and the base classes. In order to alleviate this unbalanced distribution, we propose the proposal refinement approach for different training phases. Specifically, refinement loss is designed for the base training phase to enhance sensitivity of the model to novel classes, and refinement branch is introduced as an auxiliary branch for RPN (Region Proposal Networks) to generate more novel proposals in the fine-tuning phase. By rebalancing the proposal distribution, the proposed approach outperforms the baselines methods by roughly 1\%$\sim$6\% on current benchmarks without increasing any inference time. Through extensive experiments, we prove that we establish a new state-of-the-art method for the few-shot object detection task.

URL PDF HTML ☆

赞 0 踩 0

2606.09257 2026-06-09 cs.LG cs.AI stat.ML 交叉投稿

BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation

BSTabDiff: 用于高维表格数据生成的块-子单元扩散先验

Al Zadid Sultan Bin Habib, Md Younus Ahamed, Prashnna Gyawali, Gianfranco Doretto, Donald A. Adjeroh

发表机构 * West Virginia University（西弗吉尼亚大学）； The University of Utah（犹他大学）

AI总结针对高维低样本量表格数据，提出BSTabDiff框架，通过将特征划分为潜在块并使用共享低维子单元变量生成每个块，结合扩散先验和copula依赖，实现稳定合成与可控基准生成。

Comments Published as a paper at the 2nd DeLTa Workshop, ICLR 2026

详情

AI中文摘要

高维低样本量（HDLSS）表格领域（例如组学）的特点是 $n \ll m$，其中 $n$ = 样本数，$m$ = 特征数。此类领域通常表现出强局部相关组、稀疏跨组依赖、重尾非高斯边缘分布、异方差噪声和结构化缺失，使得在 $\mathbb{R}^m$ 中直接进行密度学习因 $n \ll m$ 而病态。我们提出 BSTabDiff，一种块-子单元生成框架，将 $m$ 个观测特征划分为 $M$ 个潜在块（$M \ll m$），并通过共享的低维子单元变量生成每个块，将全局依赖学习集中在紧凑的块潜在空间 $\mathbb{R}^M$ 中，同时通过 copula 驱动的依赖、灵活的逐特征边缘分布和显式缺失机制解码到完整特征空间。BSTabDiff 支持块潜在上的现代深度先验，包括扩散和归一化流，从而在 HDLSS 场景中实现稳定合成和可控基准生成。实验表明，与 HDLSS 数据上的非结构化表格生成器相比，BSTabDiff 能产生更真实和稳定的高维合成数据。

英文摘要

High-Dimensional Low-Sample Size (HDLSS) tabular domains (e.g., omics) are characterized by $n \ll m$, where $n$ = number of samples, and $m$ = number of features. Such domains often exhibit strong local correlation groups, sparse cross-group dependencies, heavy-tailed non-Gaussian marginals, heteroscedastic noise, and structured missingness, making direct density learning in $\mathbb{R}^m$ ill-conditioned since $n \ll m$. We propose BSTabDiff, a block-subunit generative framework that partitions the $m$ observed features into $M$ latent blocks ($M \ll m$) and generates each block via a shared low-dimensional subunit variable, concentrating global dependence learning in the compact block-latent space $\mathbb{R}^M$ while decoding to the full feature space with copula-driven dependence, flexible per-feature marginals, and explicit missingness mechanisms. BSTabDiff supports modern deep priors on block latents, including diffusion and normalizing flows, enabling stable synthesis and controllable benchmark generation in the HDLSS regime. Empirically, BSTabDiff produces more realistic and stable high-dimensional synthetic data when compared with unstructured tabular generators on HDLSS data.

URL PDF HTML ☆

赞 0 踩 0

2606.09278 2026-06-09 cs.LG cs.AI 交叉投稿

Internalizing Geometric Law: Learning from Solver Residuals for Precision-Critical Generation

内化几何法则：从求解器残差中学习以实现精度关键生成

Rafael Cabral, Pang Zixi, Ziyi Shou, Shen Xin

发表机构 * Huawei Celia Team（华为Celia团队）

AI总结针对大语言模型在精度关键领域（如技术图表和机械设计）中的幻觉问题，提出可编程几何DSL PyGeoX及分层基准PyGeoX-Bench，并设计饱和加性奖励（SAR）方法，将奖励分解为有界逐约束项，解决异常梯度掩盖问题，使8B模型在基准上达到与更大前沿系统竞争的水平。

详情

AI中文摘要

大语言模型在精度关键领域（如技术图表和机械设计）中经常出现幻觉，这些领域的输出必须满足严格的几何约束。我们研究从自然语言进行开放式几何合成：将自由形式的描述转化为精确的构造，其实体必须同时满足数十个相互作用的约束。为使这一问题易于处理，我们发布了PyGeoX，一个可编程的几何DSL，它将声明性约束编译为可微损失，以及PyGeoX-Bench，一个包含300个问题的分层套件，每个问题都有可验证的逐约束奖励。使用PyGeoX作为验证器，我们识别出一种称为异常梯度掩盖的失败模式：在全局范数奖励（任何通过单一范数聚合残差的方案，例如$\exp(-\mathrm{MSE})$）下，单个异常约束可以抵消所有其他约束的学习信号。为解决此问题，我们提出饱和加性奖励（SAR），它将奖励分解为有界的逐约束项，保留部分进展并确保即使在严重违反下也能保持一致的梯度。与基于MSE的奖励（几何求解器的自然基线）相比，SAR将困难层级求解率提高了2.3倍，由此得到的8B模型在该基准上与更大的前沿系统具有竞争力。我们在https://github.com/Huawei-AI4Math/PyGeoX发布引擎、基准和数据。

英文摘要

Large Language Models frequently hallucinate in precision-critical domains such as technical diagramming and mechanical design, where outputs must satisfy strict geometric constraints. We study open-ended geometric synthesis from natural language: translating free-form descriptions into precise constructions whose entities must simultaneously satisfy dozens of interacting constraints. To make this tractable, we release PyGeoX, a programmable geometric DSL that compiles declarative constraints into a differentiable loss, and PyGeoX-Bench, a stratified suite of 300 problems with per-constraint verifiable rewards. Using PyGeoX as a verifier, we identify a failure mode we call Outlier Gradient Masking: under global-norm rewards (any scheme that aggregates residuals through a single norm, for example, $\exp(-\mathrm{MSE})$), a single outlier constraint can nullify the learning signal across all others. To address this, we propose Saturating Additive Rewards (SAR), which decompose the reward into bounded per-constraint terms, preserving partial progress and ensuring consistent gradients even under severe violations. Against MSE-based rewards, the natural baseline for geometry solvers, SAR improves the hard-tier solving rate by $2.3\times$, and the resulting 8B model is competitive with much larger frontier systems on this benchmark. We release the engine, benchmark, and data at https://github.com/Huawei-AI4Math/PyGeoX.

URL PDF HTML ☆

赞 0 踩 0

2606.09380 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

推理竞技场：当可验证奖励不足时的轨迹锦标赛

Han Zhou, Adam X. Yang, Laurence Aitchison, Anna Korhonen, Albert Q. Jiang

发表机构 * University of Cambridge（剑桥大学）； Mistral AI

AI总结提出推理竞技场框架，通过轨迹锦标赛将无梯度信号的非多样奖励组转化为相对奖励信号，结合Bradley-Terry模型高效整合强化学习，在数学和编码基准上平均提升7.6%，加速训练27%-41%。

Comments 9 pages, 6 figures, 2 tables (17 pages including references and appendices)

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）已成为通过结果监督提升大语言模型推理能力的主流范式。然而，可验证奖励在组级别常常变得无信息：当给定提示的所有采样轨迹获得相同奖励时，组相对优势估计无法提供梯度信号，尽管这些轨迹在推理质量上可能差异显著。我们提出推理竞技场，一种自适应训练框架，将此类非多样奖励组路由至裁判系统而非丢弃。除了检查最终答案，推理竞技场构建轨迹锦标赛，其中推理轨迹进行两两比较以暴露组内更细粒度的偏好，将推理质量转化为丰富的相对奖励信号。为使奖励估计高效，而非穷举比较每一对，每个新轨迹与一个动态更新的先前生成轨迹小池作为锚点进行评估，以高效建立相对排名。然后我们在不完整比较图上拟合Bradley-Terry模型，实现无需二次成对比较的可扩展强化学习集成。实验结果表明，推理竞技场在竞赛数学和编码基准上平均比RLVR基线高出7.6%。通过将原本浪费的零优势样本转化为有用的梯度更新，我们的方法加速训练27%至41%，节省近50%的生成计算量，并显著提升整体推理性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.

URL PDF HTML ☆

赞 0 踩 0

2606.09404 2026-06-09 stat.ML cs.AI cs.LG 交叉投稿

SAILS: Surrogate-based Analysis of Interactions via Local Effect Smooths

SAILS: 基于局部效应平滑的交互作用代理分析

Timo Heiß, Julia Herbinger, Bernd Bischl, Giuseppe Casalicchio

发表机构 * Department of Statistics, LMU Munich（慕尼黑大学统计系）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）； Leibniz Institute for Prevention Research and Epidemiology（莱比锡预防研究与流行病学研究所）

AI总结提出SAILS框架，通过可解释的广义加性模型代理分析黑箱模型中的成对交互作用，实现交互检测、形式分类和可视化。

详情

AI中文摘要

特征交互驱动了机器学习模型的大部分预测能力，然而现有的解释方法仅能检测和量化交互作用，而无法揭示其函数形式，或者只能可视化受限的交互类型。我们提出了基于局部效应平滑的交互作用代理分析（SAILS），这是一个模型无关的框架，通过拟合黑箱模型局部效应的可解释广义加性模型（GAM）代理来分析成对交互作用。对于感兴趣特征的每个区间，代理平滑项在导数层面隔离交互成分，从而实现（i）通过对平滑项显著性检验的启发式方法进行交互检测，（ii）将交互形式分类为线性、乘积可分离和非乘积可分离类型，以及（iii）为每种交互类型提供定制化、可解释的可视化。我们通过受控模拟和实际任务实证验证了该框架，展示了其在成对交互作用上的有效性，但在强特征相关性和高阶交互作用下存在局限性。SAILS填补了XAI工具箱中的一个显著空白，超越了仅检测交互作用，进而表征其函数形式。

英文摘要

Feature interactions drive much of the predictive power of machine learning models, yet existing explanation methods only detect and quantify interactions without revealing their functional form, or visualize only restricted interaction types. We propose Surrogate-based Analysis of Interactions via Local effect Smooths (SAILS), a model-agnostic framework that analyzes pairwise interactions through interpretable generalized additive model (GAM) surrogates fitted to the local effects of a black-box model. For each interval of a feature of interest, the surrogate smooth terms isolate the interaction components on derivative level, enabling (i) interaction detection through a heuristic derived from significance tests on smooth terms, (ii) interaction form categorization into linear, product-separable, and non-product-separable types, and (iii) tailored, interpretable visualizations for each interaction type. We empirically validate the framework through controlled simulations and a real-world task, demonstrating its effectiveness for pairwise interactions, with limitations under strong feature correlations and higher-order interactions. SAILS fills a notable gap in the XAI toolbox, going beyond detection of interactions alone to characterizing their functional form.

URL PDF HTML ☆

赞 0 踩 0

2606.09430 2026-06-09 cs.LG cs.AI 交叉投稿

LargeMonitor: Monitoring Online Task-Free Continual Learning via Large Pretrained Models

LargeMonitor: 通过大型预训练模型监控在线无任务持续学习

Mingqi Yuan, Xiaoquan Sun, Shihao Luo, Jiayu Chen

发表机构 * HKU（香港大学）； Qicore Tech（启科科技）

AI总结提出LargeMonitor框架，利用大型预训练模型（LVM和LMM）解耦检测与诊断，实现无任务持续学习中的零样本漂移检测和语义病因诊断，提升现有算法性能。

详情

AI中文摘要

在线无任务持续学习（TFCL）要求智能体在严格单次遍历约束下，从无界、非平稳的数据流中顺序积累知识，且无显式任务标识。现有在线TFCL范式主要依赖于参数高效的提示调整或由训练耦合优化动态（如经验损失波动或潜在距离演变）驱动的动态结构扩展。因此，这些训练耦合求解器对分布漂移的结构起源不可知，机械地在根本不同的流变化上强制执行固定策略。为解决这一问题，我们提出LargeMonitor，一个利用大型预训练基础模型自主编排无任务连续适应的框架。具体而言，LargeMonitor引入一个解耦的检测模块，利用大型视觉模型（LVM）的冻结、稳定表示空间，实现鲁棒的零样本漂移检测，无需训练依赖的干扰或脆弱的阈值调整。在确认漂移后，该框架激活一个由大型多模态模型（LMM）驱动的上下文感知诊断模块，以解释流变化的精确语义病因（例如，新类出现 vs. 环境域偏移）。这种双阶段能力使连续学习者能够动态部署自适应且特定于漂移的优化策略。在多个TFCL设置和基准上的大量实验表明，LargeMonitor实现了对复杂数据流的精确、鲁棒检测和诊断，同时持续提升现有在线TFCL算法的性能。

英文摘要

Online task-free continual learning (TFCL) requires intelligent agents to sequentially accumulate knowledge from an unbounded, non-stationary data stream under strict single-pass constraints and without any explicit task identifiers. Existing online TFCL paradigms primarily rely on parameter-efficient prompt tuning or dynamic structure expansion driven by training-coupled optimization dynamics, such as empirical loss fluctuations or evolving latent distances. As a result, these training-coupled solvers remain agnostic to the structural origins of distribution drift, mechanically enforcing a fixed strategy across fundamentally distinct streaming variations. To address this gap, we propose LargeMonitor, a framework that leverages large pretrained foundation models to autonomously orchestrate task-free continuous adaptation. Specifically, LargeMonitor introduces a decoupled detection module utilizing the frozen, stable representation space of large vision models (LVMs) to achieve robust, zero-shot drift detection without training-dependent interference or brittle threshold tuning. Upon a confirmed drift, the framework activates a context-aware diagnostic module driven by large multimodal models (LMMs) to interpret the precise semantic etiologies of the stream variation (e.g., novel class emergence vs. environmental domain shift). This dual-stage capability empowers the continuous learner to dynamically deploy adaptive and shift-specific optimization strategies. Extensive experiments across multiple TFCL settings and benchmarks demonstrate that LargeMonitor achieves precise, robust detection and diagnosis of complex data streams while consistently improving the performance of existing online TFCL algorithms.

URL PDF HTML ☆

赞 0 踩 0

2606.09607 2026-06-09 cs.LG cs.AI 交叉投稿

Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes

注意力头中的闭包验证电路发现：共激活提出，消融处置

Yongzhong Xu

发表机构 * GitHub

AI总结通过共激活聚类提出注意力头电路假设，并用因果消融验证闭包性，发现该方法在密集模型有效但在MoE模型失效，表明共激活仅是电路提议而非确认。

Comments 22 pages, 3 figures

详情

AI中文摘要

可解释性越来越将组件组（而非单个单元）作为基本对象，并提议通过聚类共激活统计来发现它们。我们询问这种廉价信号是否真正识别出注意力头电路。将稀疏自编码器聚类方法适配到注意力头——但通过因果消融而非重构进行验证——我们聚类头，然后运行闭包测试：消融发现的社区，并将每个示例的损伤与匹配随机对照进行比较。在两个密集的1B规模模型（Pythia 1B, OLMo 1B）和两种输入分布上，社区通过了闭包测试。在混合专家模型（OLMoE-1B-7B）中，路由条件聚类恢复了一个统计上真实的信号，但该信号未能通过闭包测试——消融反而改善了损失，方向错误。将闭包测试扩展到训练过程中，注意力目标选择性和参与比率在双向与功能解耦。我们得出结论：廉价信号是电路提议，而非确认的电路；闭包是区分二者的关键。

英文摘要

Interpretability increasingly treats groups of components, not individual units, as the basic object, and proposes to find them by clustering co-activation statistics. We ask whether such a cheap signal actually identifies an attention-head circuit. Adapting a sparse-autoencoder clustering recipe to attention heads -- but validating by causal ablation rather than reconstruction -- we cluster heads and then run a closure test: ablate the discovered community and compare per-example damage to matched-random controls. Across two dense 1B-scale models (Pythia 1B, OLMo 1B) and two input distributions, the communities pass closure. In a Mixture-of-Experts model (OLMoE-1B-7B), route-conditional clustering recovers a statistically real signal that nonetheless does not survive closure -- ablation improves loss, the wrong direction. Extending closure across training, attention-target selectivity and participation ratio decouple from function in both directions. We conclude that a cheap signal is a circuit proposal, not a confirmed circuit; closure is what separates them.

URL PDF HTML ☆

赞 0 踩 0

2606.09658 2026-06-09 cs.LG cs.AI 交叉投稿

Muon Learns More Robust and Transferable Features than Adam

Muon 比 Adam 学习更鲁棒和可迁移的特征

Tianyu Ruan, Fengzhuo Zhang, Shuche Wang, Shihua Zhang

发表机构 * Yale University（耶鲁大学）； National University of Singapore（新加坡国立大学）； University of Chinese Academy of Sciences（中国科学院大学）； Academy of Mathematics and Systems Science, CAS（中国科学院数学与系统科学研究院）

AI总结本文通过鲁棒性和可迁移性视角，证明 Muon 优化器相比 Adam 和 SGD 能学习到更鲁棒、更可迁移的特征，并通过理论分析支持了经验发现。

详情

AI中文摘要

Muon 最近已成为预训练大型语言模型（LLMs）和视觉分类器的最先进优化器。尽管其在效率上优于 Adam 和 SGD，但 Muon 在特征学习方面的优势仍不清楚。本文通过鲁棒性和可迁移性的视角研究了 Muon 的特征学习优势。首先，通过在损坏图像和文本上评估预训练模型，我们表明 Muon 学习到的特征在不同架构（包括 Transformer 和卷积神经网络（CNN））中始终比 Adam 和 SGD 学习到的特征更鲁棒。使用训练好的逐层探针，我们进一步表明这种鲁棒性优势体现在各层更大的 logit 间隔上。其次，通过在下游任务上训练线性分类器或从预训练参数微调完整模型，我们证明 Muon 学习到的特征比 Adam 和 SGD 学习到的特征更有效地迁移。这种可迁移性优势还通过有效秩衡量的各层隐藏状态的多样性得到进一步支持。最后，在一个具有多组件特征的代表性分类问题中，我们证明 Muon 比 Adam 和 SGD 获得更大的间隔和更高的有效秩，为我们的经验发现提供了理论支持。

英文摘要

Muon has recently emerged as a state-of-the-art optimizer for pretraining Large Language Models (LLMs) and vision classifiers. Despite its efficiency advantage over Adam and SGD, the feature-learning advantage of Muon remains unclear. This paper investigates Muon's feature-learning advantage through the lens of robustness and transferability. First, by evaluating pretrained models on corrupted images and texts, we show that features learned by Muon are consistently more robust than those learned by Adam and SGD across different architectures, including transformers and Convolutional Neural Networks (CNNs). Using trained layer-wise probes, we further show that this robustness advantage is reflected in larger logit margins across layers. Second, by training linear classifiers or fine-tuning full models from pretrained parameters on downstream tasks, we demonstrate that Muon-learned features transfer more effectively than those learned by Adam and SGD. This transferability advantage is further supported by the diversity of hidden states across layers, as measured by effective rank. Finally, in a representative classification problem with multi-component features, we prove that Muon attains larger margins and higher effective rank than Adam and SGD, providing theoretical support for our empirical findings.

URL PDF HTML ☆

赞 0 踩 0

2606.09659 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

End-to-End Context Compression at Scale

端到端上下文压缩的规模化

Ang Li, Sean McLeish, Haozhe Chen, Nimit Kalra, Zaiqian Chen, Artem Gazizov, Venkata Anoop Suhas Kumar Morisetty, Bhavya Kailkhura, Harshitha Menon, Zhuang Liu, Brian R. Bartoldson, Tom Goldstein, Sanae Lotfi, Micah Goldblum, Pavel Izmailov

发表机构 * New York University（纽约大学）； Modal Labs（Modal实验室）； University of Maryland（马里兰大学）； Princeton University（普林斯顿大学）； Columbia University（哥伦比亚大学）； Harvard University（哈佛大学）； Lawrence Livermore National Laboratory（劳伦斯利弗莫尔国家实验室）； FAIR at Meta（Meta FAIR实验室）

AI总结本研究通过架构搜索和持续预训练，提出潜在上下文语言模型（LCLMs），一种端到端编码器-解码器压缩器，在通用任务性能、压缩速度和峰值内存上改进帕累托前沿，并可作为长时智能体的高效骨干。

详情

AI中文摘要

长上下文语言模型推理受限于内存，因为KV缓存随上下文长度增长。最近压缩KV缓存的技术存在不足：它们要么大幅降低模型质量，要么需要大量时间和计算来压缩单个长提示。此外，许多方法要求输入适合目标模型的上下文窗口，并且通常与现代生产推理引擎不兼容。编码器-解码器压缩器原则上是一种有吸引力的替代方案，它将长令牌序列映射到由解码器消费的较短潜在嵌入序列。然而，现有方法在精度-效率前沿上无法与KV缓存压缩竞争。在这项工作中，我们重新审视编码器-解码器压缩并缩小了这一差距。我们首先进行架构搜索，从头开始预训练许多变体，以确定如何最佳设计和训练编码器-解码器压缩器。根据我们的发现，我们持续预训练一系列0.6B编码器、4B解码器模型，每个模型在超过350B令牌上训练，压缩比为1:4、1:8和1:16。我们引入了潜在上下文语言模型（LCLMs），这是一系列压缩器，在通用任务性能、压缩速度和峰值内存使用上改进了帕累托前沿。我们证明了LCLMs可作为长时智能体的高效骨干，让智能体浏览压缩的长上下文并按需自适应扩展相关片段。

英文摘要

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.

URL PDF HTML ☆

赞 0 踩 0

2606.09762 2026-06-09 cs.LG cs.AI 交叉投稿

Preserving Plasticity in Continual Learning via Dynamical Isometry

通过动态等距保持持续学习中的可塑性

Andries Rosseau, Robert Müller, Ann Nowé

发表机构 * University of Amsterdam（阿姆斯特丹大学）； ETH Zurich（苏黎世联邦理工学院）

AI总结本文通过动态等距机制保持深度神经网络在持续学习中的可塑性，提出等距正则化方法和AdamO优化器，在多个基准上匹配或超越现有方法。

Comments ICML26

详情

Journal ref: Forty-Third International Conference on Machine Learning (ICML 2026)

AI中文摘要

深度神经网络在非平稳条件下的持续训练通常会导致可塑性逐渐丧失，最终限制进一步学习。我们将可塑性与经验神经正切核联系起来，并确定动态等距（即逐层雅可比奇异值保持接近1的条件）是保持持续学习中可塑性的关键机制。我们重新审视一类几乎处处等距且同时保持通用Lipschitz函数逼近能力的网络，证明近动态等距与表达性非线性表示兼容。对于通用架构，我们提出一种高效的等距促进正则化方案，并识别出一种可以重新激活休眠ReLU单元的新机制。在此基础上，我们引入AdamO，一种Adam风格的自适应优化器，将等距正则化与梯度更新解耦，类似于AdamW。我们进一步通过动态等距的视角重新解释先前的可塑性保持方法，表明它们仅针对等距的部分度量。在旨在诱导可塑性损失的监督和强化学习持续学习基准上，我们的方法一致地匹配或超越现有方法。

英文摘要

Continual training of deep neural networks under non-stationarity often leads to a progressive loss of plasticity, eventually limiting further learning. We relate plasticity to the empirical Neural Tangent Kernel, and identify dynamical isometry (the condition that layer-wise Jacobian singular values remain close to one) as a key mechanism for preserving plasticity in continual learning. We revisit a class of networks that are almost-everywhere isometric while remaining universal Lipschitz function approximators, demonstrating that near-dynamical isometry is compatible with expressive nonlinear representations. For general architectures, we propose an efficient isometry-promoting regularization scheme and identify a novel mechanism by which it can reactivate dormant ReLU units. Building on this, we introduce AdamO, an Adam-style adaptive optimizer that decouples isometry regularization from gradient updates, analogous to AdamW. We further reinterpret prior plasticity-preserving approaches through the lens of dynamical isometry, showing that they target only a partial measure of isometry. Across supervised and reinforcement-learning continual-learning benchmarks designed to induce plasticity loss, our methods consistently match or outperform existing approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.09802 2026-06-09 cs.LG cs.AI stat.ML 交叉投稿

Bandits for Efficient Experimentation: Adapting to Control Group, Preferences, and Context Drifts

高效实验的Bandits：适应控制组、偏好和上下文漂移

Udvas Das, Waris Radji, Debabrota Basu, Odalric-Ambrym Maillard

发表机构 * Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 – CRIStAL（里尔大学、法国国家科学研究中心、中央理工学院、UMR 9189 – CRIStAL）

AI总结针对用户偏好和上下文分布随时间漂移的线性上下文随机多臂赌博机问题，提出Dri-MED算法，通过异方差回归处理非平稳噪声，实现实例相关的遗憾界和约束违规界。

详情

AI中文摘要

我们考虑线性上下文随机多臂赌博机的一个变体，其中学习器必须向一组用户提供推荐，每个用户有其个性化的偏好向量，并且上下文分布随时间漂移。在实践者友好的假设下，我们将此设置简化为具有平稳均值但异方差和非平稳噪声的线性赌博机。我们进一步研究了学习器必须确保每个决策的平均奖励超过基线策略$\boldsymbol{\pi}_0$在每个决策步骤的均值的情况。我们引入了Dri-MED，一种受MED策略线性版本启发并仔细调整以处理非平稳异方差噪声的算法。我们表明，实例相关的遗憾界为$\tilde{\mathcal O}\left(\frac{\kappa}{\tilde{\Delta}}d^2(\log(T)\right)$，其中$\tilde{\Delta}$是受策略$\pi_0$约束的次优性间隙，方差感知乘性项$\kappa$通过异方差回归仔细处理。我们进一步表明Dri-MED享有$\tilde{\mathcal{O}}(d)$的期望约束违规。我们的数值结果表明，Dri-MED显著优于忽略漂移和偏好结构的保守基线。

英文摘要

We consider a variant of the linear contextual stochastic multi-armed bandits, where the learner must provide recommendations to a group of users, each having its personalized preference vector, and in the presence of context distributions that are drifting over time. Under practitioner-friendly assumptions, we reduce this setting to linear bandit with stationary mean but heteroskedastic and non-stationary noise. We further study the case when the learner must ensure the mean reward of each decision must exceed that of a baseline strategy $\boldsymbolπ_0$ at each decision step. We introduce Dri-MED, an algorithm inspired from the linear version of the MED strategy, and carefully adapted to handle the non-stationary heteroskedastic noise. We show that the instance-dependent regret scales as $\tilde{\mathcal O}\left(\fracκ{\tildeΔ}d^2(\log(T)\right)$, where $\tildeΔ$ is the constraint-aware sub-optimality gap subject to policy $π_0$, with variance-aware multiplicative term $κ$ that we carefully handle using heteroskedastic regression. We further show Dri-MED enjoys $\tilde{\mathcal{O}}(d)$ expected constraint violations. Our numerical results suggest that Dri-MED significantly outperforms conservative baselines that ignores the drift and preference structure.

URL PDF HTML ☆

赞 0 踩 0

2606.09806 2026-06-09 cs.LG cs.AI 交叉投稿

Topological Neural Operators

拓扑神经算子

Lennart Bastian, Samuel Leventhal, Mustafa Hajij, Tolga Birdal

发表机构 * Imperial College London（伦敦帝国学院）； University of San Francisco（旧金山大学）

AI总结提出拓扑神经算子(TNOs)，利用离散外微积分在细胞复形上实现跨维度耦合，并通过分层结构提升长程信息传播，在PDE基准上优于现有算子。

详情

AI中文摘要

我们引入了拓扑神经算子（TNOs），这是一个在细胞复形上进行算子学习的原理性框架，将神经算子（NOs）从点和/或边上的函数提升到拓扑域。TNOs将数据表示为定义在不同维度细胞上的特征，并通过离散外微积分建模它们的相互作用，通过梯度、旋度和散度型算子实现显式的跨维度耦合。关键设计原则是将信息流向（由固定拓扑算子控制）与信息变换（学习得到）解耦，从而产生尊重物理量几何支撑并暴露守恒和相容性结构的模型。我们进一步提出了分层TNOs（HTNOs），它结合了学习到的粗粒度复形以传播长程和拓扑依赖的信息。我们的框架将现有NOs作为特例，提供了跨离散化的算子学习统一视角。在一系列PDE基准测试中，包括不规则几何流动问题，TNOs和HTNOs提高了精度；控制研究进一步隔离了原生高阶和拓扑结构带来的优势。项目页面：https://circle-group.github.io/research/TNO

英文摘要

We introduce Topological Neural Operators (TNOs), a principled framework for operator learning on cell complexes that lifts neural operators (NOs) from functions on points and/or edges to topological domains. TNOs represent data as features defined on cells of varying dimension and model their interactions through Discrete Exterior Calculus, enabling explicit cross-dimensional coupling via gradient-, curl-, and divergence-type operators. The key design principle is to decouple where information flows, as governed by fixed topological operators, from how it is transformed (which is learned), yielding models that respect the geometric support of physical quantities and expose conservation and compatibility structure. We further propose Hierarchical TNOs (HTNOs), which incorporate learned coarse complexes to propagate long-range and topology-dependent information. Our framework subsumes existing NOs as a special case, providing a unified perspective on operator learning across discretizations. Across a range of PDE benchmarks, including irregular-geometry flow problems, TNOs and HTNOs improve accuracy; controlled studies further isolate the benefits of native higher-rank and topological structure. Project page: https://circle-group.github.io/research/TNO

URL PDF HTML ☆

赞 0 踩 0

2606.09816 2026-06-09 cs.CV cs.AI math.PR 交叉投稿

PTL-Diffusion: Manifold-Aware Diffusion with Periodic Terminal Laws

PTL-Diffusion: 具有周期终端定律的流形感知扩散

Danqi Zhuang, Jisui Huang, Xiaoyue Xi, Andrew Kiggins, Xiaojie Wang, Ke Chen, Yue Wu

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； University of Cambridge（剑桥大学）； University of Oxford（牛津大学）； Harvard University（哈佛大学）； MIT（麻省理工学院）； University of Washington（华盛顿大学）

AI总结提出PTL-Diffusion，通过将前向噪声过程收敛到周期高斯终端族而非单一分布，显式嵌入相位结构，改善低维流形上的分布匹配，在点云和人脸数据集上降低误差。

详情

AI中文摘要

标准扩散模型通常使用单一时间齐次高斯终端分布作为生成的参考律。虽然这一选择在分析上方便且经验上有效，但对于集中在低维流形附近的数据，它提供的显式结构很少，其中数据分布的不同区域可能对应于不同的局部几何或语义因素。因此，反向模型必须几乎完全从非结构化的终端参考分布中恢复流形级别的结构。\n我们提出PTL-Diffusion，一种概念验证的扩散框架，其前向噪声过程收敛到一个非常数的周期高斯终端族，而不是单一不变律。与相位条件DDPM不同（其中相位信息仅进入去噪网络，而前向过程保持不变），PTL-Diffusion将相位结构直接嵌入前向噪声动力学中。\n所提出的构造仍然接近标准去噪扩散模型：对于周期强迫的Ornstein-Uhlenbeck型前向过程，我们推导出闭合形式的前向边际分布、极限周期高斯终端族以及显式高斯反向后验，从而支持标准噪声预测训练。我们还引入了一个不变平均正则化项，通过平均周期参考律耦合相位条件反向动力学。在环面和圆柱点云基准以及Olivetti人脸数据集上的实验表明，PTL-Diffusion在匹配的DDPM基线上改善了流形级别的分布匹配，减少了相位条件误差、特征空间协方差误差和最近邻流形距离。这些结果表明结构化终端参考律是一个有前景的方向，同时激励更具表现力的相位构造和更大规模的评估。

英文摘要

Standard diffusion models typically use a single time-homogeneous Gaussian terminal distribution as the reference law for generation. While this choice is analytically convenient and empirically powerful, it provides little explicit structure for data concentrated near low-dimensional manifolds, where different regions of the data distribution may correspond to distinct local geometric or semantic factors. As a result, the reverse model must recover manifold-level structure almost entirely from an unstructured terminal reference distribution. We propose PTL-Diffusion, a proof-of-concept diffusion framework whose forward noising process converges to a nonconstant periodic family of Gaussian terminal laws rather than to a single invariant law. Unlike a phase-conditioned DDPM, where phase information only enters the denoising network while the forward process remains unchanged, PTL-Diffusion embeds phase structure directly into the forward noising dynamics. The proposed construction remains close to standard denoising diffusion models: for a periodically forced Ornstein--Uhlenbeck-type forward process, we derive closed-form forward marginals, the limiting periodic Gaussian terminal family, and explicit Gaussian reverse posteriors, enabling standard noise-prediction training. We also introduce an invariant-average regularization term coupling the phase-conditioned reverse dynamics through the averaged periodic reference law. Experiments on torus and cylinder point-cloud benchmarks and the Olivetti face dataset show that PTL-Diffusion improves manifold-level distributional matching over matched DDPM baselines, reducing phase-conditioned errors, feature-space covariance errors, and nearest-neighbour manifold distances. These results suggest structured terminal reference laws as a promising direction, while motivating more expressive phase constructions and larger-scale evaluations.

URL PDF HTML ☆

赞 0 踩 0

2509.25004 2026-06-09 cs.AI 版本更新

CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning

CLPO：课程学习与策略优化相结合用于大语言模型推理

Shijie Zhang, Zheng Xiao, Shiyu Liu, Guohao Sun, Kevin Zhang, Xiang Guo, Rujun Guo, Shaoyu Liu, Wangxiao Zhao, Guanjun Jiang

发表机构 * Peking University（北京大学）； Qwen Applications Business Group, Alibaba Group（通义实验室，阿里巴巴集团）； Xiamen University（厦门大学）

AI总结提出CLPO框架，通过在线策略准确率动态调整问题难度，使课程与策略共同进化，在数学和通用推理基准上显著优于GRPO和DAPO。

详情

AI中文摘要

具有可验证奖励的在线强化学习已成为提升大语言模型推理能力的有效范式，但大多数方法仍对静态问题集优化推理轨迹，将rollout预算浪费在已解决或过于困难的问题上。我们提出\textbf{CLPO（课程学习与策略优化相结合）}，一种自我进化的课程框架，利用在线策略rollout准确率识别已解决、中等难度和困难问题，然后根据模型当前能力重构所选任务。困难问题被简化以变得可学习，而中等难度问题被多样化以提供有用的训练变化。这使得学习课程能够与策略共同进化，而不是随着模型能力边界移动而保持固定。CLPO不将这些重写视为静态数据增强，而是优化重构轨迹，并根据重写问题的下游准确率增益分配信用，除了原始可验证答案外不需要额外的人工标注。跨数学推理和域外通用推理基准的实验表明，CLPO在Qwen3-8B上分别以平均10.21和7.75个点显著优于GRPO和DAPO。在数学和代码领域的消融研究进一步表明，重构模式和重写损失都对最终增益有贡献，证明了CLPO通过自我进化的课程为激发更强推理能力提供了可扩展且稳健的途径。

英文摘要

Online reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving the reasoning abilities of large language models, but most methods still optimize reasoning trajectories over the static problem set, wasting rollout budget on solved or overly difficult problems. We propose \textbf{CLPO (Curriculum Learning meets Policy Optimization)}, a self-evolving curriculum framework that uses on-policy rollout accuracy to identify solved, medium-difficulty, and hard problems, then restructures selected tasks according to the model's current capability. Hard problems are simplified to become learnable, while medium-difficulty problems are diversified to provide useful training variation. This allows the learning curriculum to co-evolve with the policy rather than remaining fixed as the model's capability boundary shifts. Rather than treating these rewrites as static data augmentation, CLPO optimizes restructuring trajectories with credit assigned by the downstream accuracy gain of the rewritten problem, requiring no additional human annotations beyond the original verifiable answers. Experiments across mathematical reasoning and out-of-domain general reasoning benchmarks show that CLPO substantially outperforms GRPO and DAPO on Qwen3-8B by 10.21 and 7.75 average points, respectively. Ablation studies on math and code domains further show that both the restructuring mode and the rewriting loss contribute to the final gains, demonstrating that CLPO provides a scalable and robust pathway for eliciting stronger reasoning capabilities through a self-evolving curriculum.

URL PDF HTML ☆

赞 0 踩 0

2512.07355 2026-06-09 cs.AI cs.CV cs.LG 版本更新

A Geometric Unification of Concept Learning with Concept Cones

概念学习与概念锥的几何统一

Alexandre Rocchi, Thomas Fel, Gianni Franchi

发表机构 * AMIAD ； Kempner Institute, Harvard University（哈佛大学凯普勒研究所）

AI总结通过共享几何框架（概念锥）统一监督式概念瓶颈模型与无监督稀疏自编码器，提出包含关系度量评估概念对齐，并发现稀疏性与扩展因子的最佳平衡点。

Comments 33 pages

详情

AI中文摘要

两种可解释性传统并行发展但很少相互交流：概念瓶颈模型（CBM）规定概念应该是什么，而稀疏自编码器（SAE）发现哪些概念涌现。CBM使用监督将激活与人类标记的概念对齐，而SAE依赖稀疏编码来揭示涌现概念。我们证明两种范式实例化相同的几何结构：每个范式学习激活空间中的一组线性方向，其非负组合形成概念锥。因此，监督和无监督方法的不同不在于种类，而在于如何选择这个锥。基于这一观点，我们提出了两种范式之间的操作桥梁。CBM提供人类定义的参考几何，而SAE可以通过其学习的锥在多大程度上近似或包含CBM的锥来评估。这种包含框架产生了量化指标，将归纳偏差（如SAE类型、稀疏性或扩展比）与合理概念的涌现联系起来。使用这些指标，我们发现了稀疏性和扩展因子的“最佳点”，该点最大化与CBM概念的几何和语义对齐。总体而言，我们的工作通过共享的几何框架统一了监督和无监督的概念发现，提供了原则性指标来衡量SAE进展，并评估发现的概念与合理的人类概念的对齐程度。

英文摘要

Two traditions of interpretability have evolved side by side but seldom spoken to each other: Concept Bottleneck Models (CBMs), which prescribe what a concept should be, and Sparse Autoencoders (SAEs), which discover what concepts emerge. While CBMs use supervision to align activations with human-labeled concepts, SAEs rely on sparse coding to uncover emergent ones. We show that both paradigms instantiate the same geometric structure: each learns a set of linear directions in activation space whose nonnegative combinations form a concept cone. Supervised and unsupervised methods thus differ not in kind but in how they select this cone. Building on this view, we propose an operational bridge between the two paradigms. CBMs provide human-defined reference geometries, while SAEs can be evaluated by how well their learned cones approximate or contain those of CBMs. This containment framework yields quantitative metrics linking inductive biases -- such as SAE type, sparsity, or expansion ratio -- to emergence of plausible\footnote{We adopt the terminology of \citet{jacovi2020towards}, who distinguish between faithful explanations (accurately reflecting model computations) and plausible explanations (aligning with human intuition and domain knowledge). CBM concepts are plausible by construction -- selected or annotated by humans -- though not necessarily faithful to the true latent factors that organise the data manifold.} concepts. Using these metrics, we uncover a ``sweet spot'' in both sparsity and expansion factor that maximizes both geometric and semantic alignment with CBM concepts. Overall, our work unifies supervised and unsupervised concept discovery through a shared geometric framework, providing principled metrics to measure SAE progress and assess how well discovered concept align with plausible human concepts.

URL PDF HTML ☆

赞 0 踩 0

2512.12225 2026-06-09 cs.AI 版本更新

A Geometric Theory of Cognition for Machine Intelligence

机器智能的认知几何理论

Laha Ale

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, China（计算与人工智能学院，西南交通大学，成都，中国）

AI总结提出黎曼流形上的梯度流框架，统一表征、记忆、适应与预测，在部分可观测强化学习任务中优于前馈基线，鲁棒性堪比循环架构。

详情

AI中文摘要

开发能够统一表征、记忆、适应和预测的人工智能体仍然是人工智能中的一个基本挑战。在这里，我们引入了一个几何框架，其中认知计算源于学习到的潜在流形上的黎曼梯度流。学习到的度量编码了表征约束和计算偏好，而几何中的各向异性自然产生了多个时间尺度的行为，从而在没有显式记忆模块或循环机制的情况下，同时产生快速反应响应和较慢的适应动态。我们通过黎曼表征和动态模型实例化该框架，并在部分可观测的强化学习环境中进行评估。在观测掩蔽、感觉中断、动态扰动和预测性潜在建模任务中，所提出的方法始终优于前馈基线，实现了与循环架构相当的鲁棒性，并产生了高度可预测的潜在轨迹，具有较低的长程展开误差。这些结果表明，学习到的潜在几何可以同时作为表征、记忆、适应和预测的基质。更广泛地说，该框架提供了动力系统、表征学习和基于世界模型的智能之间的原则性联系。

英文摘要

Developing artificial agents that unify representation, memory, adaptation, and prediction remains a fundamental challenge in artificial intelligence. Here we introduce a geometric framework in which cognitive computation emerges from Riemannian gradient flow on a learned latent manifold. The learned metric encodes representational constraints and computational preferences, while anisotropies in the geometry naturally generate multiple timescales of behaviour, yielding both rapid reactive responses and slower adaptive dynamics without explicit memory modules or recurrent mechanisms. We instantiate this framework through Riemannian representation and dynamics models and evaluate them in partially observable reinforcement-learning environments. Across observation masking, sensory blackouts, dynamics perturbations, and predictive latent-modelling tasks, the proposed approach consistently outperforms feedforward baselines, achieves robustness comparable to recurrent architectures, and produces highly predictable latent trajectories with low long-horizon rollout error. These results suggest that learned latent geometry can serve simultaneously as a substrate for representation, memory, adaptation, and prediction. More broadly, the framework provides a principled connection between dynamical systems, representation learning, and world-model-based intelligence.

URL PDF HTML ☆

赞 0 踩 0

2601.04805 2026-06-09 cs.AI 版本更新

Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning

基于思考的非思考：通过强化学习解决混合推理模型训练中的奖励黑客问题

Siyuan Gan, Jiaheng Liu, Boyan Wang, Tianpei Yang, Runqing Miao, Yuyao Zhang, Fanyu Meng, Junlan Feng, Linjian Meng, Jing Huo, Yang Gao

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing, China（南京大学新型软件技术国家重点实验室）； Shanghai Artificial Intelligence Laboratory, Shanghai, China（上海人工智能实验室）； Jiutian Research, Beijing, China（九天研究院）

AI总结针对混合推理模型训练中的奖励黑客问题，提出Thinking-Based Non-Thinking方法，利用思考型回答的解决方案信息为非思考型回答设置差异化最大令牌数，在数学基准上减少约50%令牌使用并提升准确率。

详情

AI中文摘要

大型推理模型（LRMs）因其卓越性能而备受关注。然而，其性能主要源于思考（即长链思维CoT），这显著增加了计算开销。为解决这一过度思考问题，现有工作侧重于使用强化学习（RL）训练混合推理模型，使其根据查询复杂度自动决定是否进行思考。不幸的是，使用RL会遇到奖励黑客问题，例如，模型进行了思考但被判定为未思考，导致奖励错误。为缓解此问题，现有工作要么采用监督微调（SFT），计算成本高昂，要么对非思考型回答强制设置统一令牌限制，缓解效果有限。本文提出基于思考的非思考（TNT）。它不使用SFT，而是通过利用思考型回答的解决方案组件中的信息，为不同查询的非思考型回答设置不同的最大令牌使用量。在五个数学基准上的实验表明，与DeepSeek-R1-Distill-Qwen-1.5B/7B和DeepScaleR-1.5B相比，TNT将令牌使用量减少约50%，同时显著提高准确率。事实上，TNT在所有测试方法中实现了准确率与效率之间的最优权衡。此外，在所有测试数据集中，TNT被分类为未使用思考的回答中出现奖励黑客问题的概率低于10%。

英文摘要

Large reasoning models (LRMs) have attracted much attention due to their exceptional performance. However, their performance mainly stems from thinking, a long Chain of Thought (CoT), which significantly increase computational overhead. To address this overthinking problem, existing work focuses on using reinforcement learning (RL) to train hybrid reasoning models that automatically decide whether to engage in thinking or not based on the complexity of the query. Unfortunately, using RL will suffer the the reward hacking problem, e.g., the model engages in thinking but is judged as not doing so, resulting in incorrect rewards. To mitigate this problem, existing works either employ supervised fine-tuning (SFT), which incurs high computational costs, or enforce uniform token limits on non-thinking responses, which yields limited mitigation of the problem. In this paper, we propose Thinking-Based Non-Thinking (TNT). It does not employ SFT, and sets different maximum token usage for responses not using thinking across various queries by leveraging information from the solution component of the responses using thinking. Experiments on five mathematical benchmarks demonstrate that TNT reduces token usage by around 50% compared to DeepSeek-R1-Distill-Qwen-1.5B/7B and DeepScaleR-1.5B, while significantly improving accuracy. In fact, TNT achieves the optimal trade-off between accuracy and efficiency among all tested methods. Additionally, the probability of reward hacking problem in TNT's responses, which are classified as not using thinking, remains below 10% across all tested datasets.

URL PDF HTML ☆

赞 0 踩 0

2602.08222 2026-06-09 cs.AI 版本更新

探究回归中的直方图损失

Ehsan Imani, Kai Luedemann, Sam Scholnick-Hughes, Esraa Elelimy, Martha White

发表机构 * Alberta Machine Intelligence Institute (Amii) and Reinforcement Learning and Artificial Intelligence Laboratory（阿尔伯塔机器智能研究所（Amii）和强化学习与人工智能实验室）； Department of Computing Science, University of Alberta（计算科学系，阿尔伯塔大学）； University of Tübingen（图宾根大学）； Zuse School ELIZA（祖斯学校ELIZA）

AI总结本文通过理论和实验分析，探究直方图损失在回归任务中提升性能的原因，发现其优势源于优化改进而非额外信息建模，并在常见深度学习应用中验证其有效性。

Comments 52 pages

详情

Journal ref: JMLR,2026

AI中文摘要

在回归任务中，即使预测只需要均值，训练神经网络来建模整个分布也变得越来越常见。这种额外的建模通常会带来性能提升，但其背后的原因尚不完全清楚。本文研究了一种最近的回归方法——直方图损失，该方法通过最小化目标分布与灵活直方图预测之间的交叉熵来学习目标变量的条件分布。我们设计了理论和实证分析，以确定这种性能提升出现的原因和时机，以及损失的不同组成部分如何贡献于这种提升。我们的结果表明，在这种设置下学习分布的好处来自于优化方面的改进，而非建模额外信息。然后，我们展示了直方图损失在常见深度学习应用中的可行性，无需昂贵的超参数调优。

英文摘要

It is becoming increasingly common in regression to train neural networks that model the entire distribution even if only the mean is required for prediction. This additional modeling often comes with performance gain and the reasons behind the improvement are not fully known. This paper investigates a recent approach to regression, the Histogram Loss, which involves learning the conditional distribution of the target variable by minimizing the cross-entropy between a target distribution and a flexible histogram prediction. We design theoretical and empirical analyses to determine why and when this performance gain appears, and how different components of the loss contribute to it. Our results suggest that the benefits of learning distributions in this setup come from improvements in optimization rather than modelling extra information. We then demonstrate the viability of the Histogram Loss in common deep learning applications without a need for costly hyperparameter tuning.

URL PDF HTML ☆

赞 0 踩 0

2411.03253 2026-06-09 cs.LG cs.AI cs.DS 版本更新

Discovering Data Structures: Nearest Neighbor Search and Beyond

发现数据结构：最近邻搜索及其他

Omar Salemohamed, Laurent Charlin, Shivam Garg, Vatsal Sharan, Gregory Valiant

发表机构 * Université de Montréal（蒙特利尔大学）； Mila ； HEC Montréal（蒙特利尔高等商学院）； Microsoft Research（微软研究院）； University of Southern California（南加州大学）； Stanford University（斯坦福大学）

AI总结提出一个端到端学习数据结构的通用框架，自动适应数据分布并控制查询与空间复杂度，在最近邻搜索中逆向工程出二分搜索、插值搜索、k-d树和局部敏感哈希等算法。

Comments Neurips 2025 Version

详情

AI中文摘要

我们提出了一个用于端到端学习数据结构的通用框架。我们的框架适应底层数据分布，并对查询和空间复杂度提供细粒度控制。关键在于，数据结构是从头开始学习的，不需要仔细初始化或用候选数据结构/算法进行种子化。我们首先将该框架应用于最近邻搜索问题。在多种设置中，我们能够逆向工程出学习到的数据结构和查询算法。对于一维最近邻搜索，模型发现了最优的分布（不）依赖算法，如二分搜索和插值搜索的变体。在更高维度中，模型学习到的解决方案在某些情况下类似于k-d树，而在其他情况下则具有局部敏感哈希的元素。该模型还能学习高维数据的有用表示，并利用它们设计有效的数据结构。我们还将框架应用于数据流上的频率估计问题，并相信它也可以成为新问题的强大发现工具。

英文摘要

We propose a general framework for end-to-end learning of data structures. Our framework adapts to the underlying data distribution and provides fine-grained control over query and space complexity. Crucially, the data structure is learned from scratch, and does not require careful initialization or seeding with candidate data structures/algorithms. We first apply this framework to the problem of nearest neighbor search. In several settings, we are able to reverse-engineer the learned data structures and query algorithms. For 1D nearest neighbor search, the model discovers optimal distribution (in)dependent algorithms such as binary search and variants of interpolation search. In higher dimensions, the model learns solutions that resemble k-d trees in some regimes, while in others, they have elements of locality-sensitive hashing. The model can also learn useful representations of high-dimensional data and exploit them to design effective data structures. We also adapt our framework to the problem of estimating frequencies over a data stream, and believe it could also be a powerful discovery tool for new problems.

URL PDF HTML ☆

赞 0 踩 0

2503.18314 2026-06-09 cs.LG cs.AI cs.CV 版本更新

LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty

LoTUS：带有不确定性风味的大规模机器遗忘

Christoforos N. Spartalis, Theodoros Semertzidis, Petros Daras, Efstratios Gavves

发表机构 * University of Amsterdam（阿姆斯特丹大学）； Centre for Research & Technology Hellas（希腊研究中心与技术中心）； Archimedes/Athena RC（阿基米德/雅典娜研究中心）

AI总结提出LoTUS方法，通过平滑预测概率至信息论界限来消除训练样本影响，避免从头重训练，在Transformer和ResNet18模型上超越现有方法，并引入RF-JSD指标用于实际评估。

Comments Accepted as a main conference paper at CVPR 2025 (https://cvpr.thecvf.com/virtual/2025/poster/33292)

2504.05349 2026-06-09 stat.ML cs.AI cs.LG 版本更新

Hyperflux: Pruning Reveals Importance

Hyperflux: 剪枝揭示重要性

Eugen Barbulescu, Antonio Alexoaie, Lucian Busoniu

发表机构 * Department of Computer Science（计算机科学系）； Technical University of Cluj-Napoca（克莱津-纳波卡技术大学）； Department of Automation（自动化系）

AI总结提出Hyperflux方法，通过将剪枝建模为连续演化系统（通量和压力），在微观和宏观层面解释剪枝行为，并引入压力调度器实现目标稀疏度，在多个数据集上取得竞争性结果。

2505.20137 2026-06-09 cs.LG cs.AI 版本更新

ePC: Fast and Deep Predictive Coding in Digital Simulation

ePC：数字仿真中的快速深度预测编码

Cédric Goemaere, Gaspard Oliviers, Rafal Bogacz, Thomas Demeester

发表机构 * IDLab, Ghent University -- imec, Belgium（ID实验室，根特大学——imec，比利时）； Brain Network Dynamics Unit, University of Oxford, UK（脑网络动力学单位，牛津大学，英国）

AI总结提出误差预测编码（ePC），通过重新参数化解决标准状态预测编码（sPC）在数字仿真中的指数信号衰减问题，实现与反向传播相当的深度模型训练速度。

Comments Accepted at ICML 2026 - Main Track. All code available at https://github.com/cgoemaere/error_based_PC

详情

AI中文摘要

预测编码（PC）为神经网络训练提供了一种受大脑启发的反向传播替代方案，被描述为最小化其内部能量的物理系统。然而，在实践中，PC主要是在数字仿真中实现的，需要大量的计算，同时难以扩展到更深的架构。本文重新构建了PC以克服这种硬件-算法不匹配。首先，我们揭示了规范的状态基PC（sPC）在数字仿真中本质上是深度低效的，不可避免地导致指数级信号衰减，从而阻碍整个最小化过程。然后，为了克服这一根本限制，我们引入了误差基PC（ePC），这是一种新的PC重新参数化，不会遭受信号衰减。虽然不再具有生物合理性，但ePC数值计算精确的PC权重梯度，运行速度比sPC快几个数量级。跨多个架构和数据集的实验表明，即使在sPC难以处理的更深模型中，ePC也能匹配反向传播的性能。除了实际改进，我们的工作还提供了对PC动力学的理论洞察，并为在数字硬件及更广泛领域将基于PC的学习扩展到更深架构奠定了基础。

英文摘要

Predictive Coding (PC) offers a brain-inspired alternative to backpropagation for neural network training, described as a physical system minimizing its internal energy. However, in practice, PC is predominantly digitally simulated, requiring excessive amounts of compute while struggling to scale to deeper architectures. This paper reformulates PC to overcome this hardware-algorithm mismatch. First, we uncover how the canonical state-based formulation of PC (sPC) is, by design, deeply inefficient in digital simulation, inevitably resulting in exponential signal decay that stalls the entire minimization process. Then, to overcome this fundamental limitation, we introduce error-based PC (ePC), a novel reparameterization of PC which does not suffer from signal decay. Though no longer biologically plausible, ePC numerically computes exact PC weights gradients and runs orders of magnitude faster than sPC. Experiments across multiple architectures and datasets demonstrate that ePC matches backpropagation's performance even for deeper models where sPC struggles. Besides practical improvements, our work provides theoretical insight into PC dynamics and establishes a foundation for scaling PC-based learning to deeper architectures on digital hardware and beyond.

URL PDF HTML ☆

赞 0 踩 0

2507.12612 2026-06-09 cs.LG cs.AI 版本更新

Learning Task Mixtures from Task Affinities: A Probabilistic Graphical Model for Supervised Fine-Tuning

学习什么是重要的：通过互信息的概率任务选择用于模型微调

Prateek Chanda, Saral Sureka, Parth Pratim Chatterjee, Krishnateja Killamsetty, Nikhil Shivakumar Nayak, Ganesh Ramakrishnan

发表机构 * IIT Bombay（印度理工学院班加罗尔分校）； IBM Research（IBM研究）； Red Hat AI Innovation（红帽AI创新）； MIT-IBM Watson AI Lab（麻省理工-IBM沃森AI实验室）

AI总结本文提出TaskPGM框架，通过基于能量的任务模型学习连续任务混合，利用互信息和行为分歧来捕捉任务间的关系，从而在任务覆盖和冗余之间取得平衡，提升大语言模型的监督微调性能。

Comments 9, 8 tables, 7 figures

详情

AI中文摘要

大语言模型的监督微调性能在很大程度上取决于训练预算如何分配到异质任务集上。在实践中，通常使用简单的启发式方法（例如均匀或按比例采样）来固定混合，但这些方法忽略了任务之间的相互作用，可能损害迁移并浪费在冗余来源上的预算。我们引入TaskPGM，一种通过基于能量的任务模型学习连续任务混合的框架。任务形成马尔可夫随机场的节点：单变量势能捕捉单个任务的效用，而双变量势能使用从单任务微调模型的预测分布中计算的行为分歧（如Jensen-Shannon分歧和点互信息）来编码任务间的关系。优化此目标会产生在覆盖和冗余之间取得平衡的混合。我们显示，所得到的集合函数在预算约束下是弱子模的，这使得离散选择变体能够获得近似保证。在多个模型家族（LLaMA-7B，Qwen2-7B）和评估套件（BIG-Bench Hard）上，TaskPGM在标准混合策略之上取得改进，并提供了任务间关系的可解释结构。

英文摘要

Supervised fine-tuning performance for large language models depends strongly on how training budget is distributed across a heterogeneous set of tasks. In practice, mixtures are often fixed using simple heuristics (e.g., uniform or size-proportional sampling) that ignore task interactions, which can hurt transfer and waste budget on redundant sources. We introduce TaskPGM, a framework for learning continuous task mixtures via an energy-based model over tasks. Tasks form the nodes of a Markov random field: unary potentials capture per-task utility, and pairwise potentials encode inter-task relationships using behavioral divergences computed from predictive distributions of single-task fine-tuned models (e.g., Jensen--Shannon divergence and pointwise mutual information). Optimizing this objective yields mixtures that balance coverage against redundancy. We show that the resulting set function is weakly submodular under budget constraints, enabling approximation guarantees for discrete selection variants. Across multiple model families (LLaMA-7B, Qwen2-7B) and evaluation suites (BIG-Bench Hard), TaskPGM improves over standard mixing strategies and provides interpretable structure over task interactions.

URL PDF HTML ☆

赞 0 踩 0

2508.05950 2026-06-09 cs.CV cs.AI 版本更新

CLONE: A 3DGS-Based Closed-Loop Differentiable Optimization Framework for Single-Image Normal Estimation

CLONE: 基于3DGS的闭环可微优化框架用于单图像法线估计

Yanxing Liang, Yinghui Wang, Wei Li, Tao Yan, Jiaxing Shen

发表机构 * School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China（江南大学人工智能与计算机科学学院，中国无锡）； School of Data Science, Lingnan University, Hong Kong, China（岭南大学数据科学学院，中国香港）

AI总结提出CLONE框架，通过3D高斯泼溅参数化场景并利用协方差特征分解得到连续可微法线，结合可微光照模型和一步确定性扩散精化网络，在统一重投影目标下联合优化，实现无需真值法线监督的几何一致性单图像法线估计。

详情

AI中文摘要

我们提出CLONE，一个基于3DGS的闭环可微优化框架，用于单图像法线估计。核心思想是构建一个“图像-几何-图像”一致性循环，统一并联合约束两种范式的局限性：判别式方法依赖显式监督而缺乏跨域几何约束，生成式方法虽有强生成先验但缺乏稳定的可微优化路径。具体地，我们首先采用3D高斯泼溅显式参数化场景，并通过协方差特征分解导出连续可微的表面法线，为几何建模提供解析梯度路径。然后，我们引入一个带有可学习光调制核的可微光照模型，建立表面法线与图像辐射之间的连续映射，使重投影误差直接监督底层3D几何。此外，为补偿高斯表示在局部细节表达上的不足，我们设计了一个一步确定性扩散启发的精化网络，在保持端到端可微性的同时增强局部几何细节。引入跨域门控融合机制以协调全局几何一致性和局部细节重建。最后，所有组件在统一的重投影目标下联合优化，形成闭环且稳定的梯度传播路径。这使得无需真值法线监督即可有效约束多解空间并改善几何一致性。

英文摘要

We propose CLONE, a 3DGS-based Closed-Loop differentiable Optimization framework for single-image Normal Estimation. The core idea is to construct an "image-geometry-image" consistency loop that unifies and jointly constrains the limitations of both paradigms: the reliance on explicit supervision without cross-domain geometric constraints in discriminative methods, and the absence of stable differentiable optimization pathways in generative methods despite strong generative priors. Specifically, we first employ 3D Gaussian Splatting to explicitly parameterize the scene and derive continuous and differentiable surface normals via covariance eigen-decomposition, providing an analytical gradient pathway for geometric modeling. We then introduce a differentiable illumination model with a learnable light modulation kernel to establish a continuous mapping between surface normals and image radiance, enabling reprojection errors to directly supervise the underlying 3D geometry. Furthermore, to compensate for the limited local detail expressiveness of Gaussian representations, we design a one-step deterministic diffusion-inspired refinement network, which enhances local geometric details while preserving end-to-end differentiability. A cross-domain gating fusion mechanism is introduced to coordinate global geometric consistency and local detail reconstruction. Finally, all components are jointly optimized under a unified reprojection objective, forming a closed-loop and stable gradient propagation pathway. This enables effective constraint of the multi-solution space and improved geometric consistency without requiring ground-truth normal supervision.

URL PDF HTML ☆

赞 0 踩 0

2509.10534 2026-06-09 cs.LG cs.AI cs.CL 版本更新

Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings

解耦“什么”和“哪里”：极坐标位置嵌入

Anand Gopalakrishnan, Robert Csordás, Jürgen Schmidhuber, Michael C. Mozer

发表机构 * DeepMind, London, UK（深度Mind，伦敦，英国）

AI总结提出极坐标位置嵌入（PoPE）以解耦Transformer注意力机制中的内容和位置，在诊断任务、序列建模和语言模型中优于RoPE，并展现零样本长度外推能力。

Comments ICML 2026 camera-ready version

详情

AI中文摘要

Transformer架构中的注意力机制根据内容（“什么”）和序列中的位置（“哪里”）将键匹配到查询。我们提出一项分析，表明在流行的RoPE旋转位置嵌入中，“什么”和“哪里”是纠缠的。这种纠缠会损害性能，特别是当决策需要在这两个因素上独立匹配时。我们提出对RoPE的改进，称为极坐标位置嵌入（PoPE），它消除了“什么-哪里”的混淆。PoPE在仅通过位置或内容进行索引的诊断任务上表现远优于基线。在音乐、基因组和自然语言领域的自回归序列建模中，使用PoPE作为位置编码方案的Transformer在评估损失（困惑度）和下游任务性能上优于使用RoPE的基线。在语言建模中，这些优势在模型规模从124M到774M参数时持续存在。关键的是，与RoPE甚至专为外推设计的方法YaRN（需要额外微调和频率插值）相比，PoPE展现出强大的零样本长度外推能力。

英文摘要

The attention mechanism in a Transformer architecture matches key to query based on both content -- the what -- and position in a sequence -- the where. We present an analysis indicating that what and where are entangled in the popular RoPE rotary position embedding. This entanglement can impair performance particularly when decisions require independent matches on these two factors. We propose an improvement to RoPE, which we call Polar Coordinate Position Embeddings or PoPE, that eliminates the what-where confound. PoPE is far superior on a diagnostic task requiring indexing solely by position or by content. On autoregressive sequence modeling in music, genomic, and natural language domains, Transformers using PoPE as the positional encoding scheme outperform baselines using RoPE with respect to evaluation loss (perplexity) and downstream task performance. On language modeling, these gains persist across model scale, from 124M to 774M parameters. Crucially, PoPE shows strong zero-shot length extrapolation capabilities compared not only to RoPE but even a method designed for extrapolation, YaRN, which requires additional fine tuning and frequency interpolation.

URL PDF HTML ☆

赞 0 踩 0

2510.03244 2026-06-09 cs.LG cs.AI cs.CV 版本更新

VFEM: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion

VFEM: 视觉特征赋能的多变量时间序列预测与跨模态融合

Yanlong Wang, Hang Yu, Jian Xu, Fei Ma, Hongkang Zhang, Tongtong Feng, Zijian Zhang, Shao-Lun Huang, Danny Dongning Sun, Xiao-Ping Zhang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院，清华大学）； Pengcheng Laboratory（鹏城实验室）； Ant Group（蚂蚁集团）； Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)（广东人工智能与数字经济实验室（深圳））； University of Pennsylvania（宾夕法尼亚大学）

AI总结提出VFEM模型，利用预训练大视觉模型通过跨模态注意力融合视觉与时间特征，仅训练7.45%参数即可捕捉跨变量依赖，提升多变量时间序列预测性能。

详情

AI中文摘要

大型时间序列基础模型通常采用通道独立架构来处理不同的数据维度，但这种设计忽略了关键的跨通道依赖关系。同时，现有的跨模态方法主要依赖文本模态，使得视觉模型的空间模式识别能力在时间序列分析中未被充分探索。为了解决这些局限性，我们提出了VFEM，一种利用预训练大视觉模型（LVM）捕获复杂跨变量模式的跨模态预测模型。VFEM将多变量时间序列转换为视觉表示，使LVM能够感知通道独立模型未显式建模的空间关系。通过双分支架构，视觉和时间特征被独立提取，然后通过跨模态注意力融合，使两种模态的互补信息增强预测。通过冻结LVM并仅训练总参数的7.45%，VFEM在多个基准上取得了竞争性能，为多变量时间序列预测提供了新视角。

英文摘要

Large time series foundation models often adopt channel-independent architectures to handle varying data dimensions, but this design ignores crucial cross-channel dependencies. Meanwhile, existing cross-modal methods predominantly rely on textual modalities, leaving the spatial pattern recognition capabilities of vision models underexplored for time series analysis. To address these limitations, we propose VFEM, a cross-modal forecasting model that leverages pre-trained large vision models (LVMs) to capture complex cross-variable patterns. VFEM transforms multivariate time series into visual representations, enabling LVMs to perceive spatial relationships that are not explicitly modeled by channel-independent models. Through a dual-branch architecture, visual and temporal features are independently extracted and then fused via cross-modal attention, allowing complementary information from both modalities to enhance forecasting. By freezing the LVM and training only 7.45% of the total parameters, VFEM achieves competitive performance on multiple benchmarks, offering a new perspective on multivariate time series forecasting.

URL PDF HTML ☆

赞 0 踩 0

2510.09783 2026-06-09 cs.LG cs.AI stat.ML 版本更新

Large Language Models for Imbalanced Classification: Diversity makes the difference

大语言模型用于不平衡分类：多样性至关重要

Dang Nguyen, Sunil Gupta, Kien Do, Thin Nguyen, Taylor Braund, Alexis Whitton, Svetha Venkatesh

发表机构 * Applied Artificial Intelligence Initiative (A 2 I 2 )（应用人工智能倡议（A2I2））； Deakin University（德肯大学）； Black Dog Institute（黑狗研究所）； University of New South Wales（新南威尔士大学）

AI总结提出基于大语言模型的过采样方法，通过条件采样、排列微调和插值样本增强多样性，在10个表格数据集上优于8个基线方法。

详情

AI中文摘要

过采样是解决不平衡分类最广泛使用的方法之一。其核心思想是生成额外的少数类样本以重新平衡数据集。大多数现有方法（如SMOTE）需要将分类变量转换为数值向量，这通常会导致信息损失。最近，基于大语言模型（LLM）的方法被引入以克服这一限制。然而，当前的LLM方法通常生成多样性有限的少数类样本，降低了下游分类任务的鲁棒性和泛化能力。为了解决这一问题，我们提出了一种新的基于LLM的过采样方法，旨在增强多样性。首先，我们引入了一种采样策略，将合成样本生成条件化为少数类标签和特征。其次，我们开发了一种新的排列策略来微调预训练的LLM。第三，我们不仅在少数类样本上微调LLM，还在插值样本上微调以进一步丰富变异性。在10个表格数据集上的大量实验表明，我们的方法显著优于八个SOTA基线。生成的合成样本既真实又多样。此外，我们通过基于熵的视角提供了理论分析，证明了我们的方法鼓励生成样本的多样性。

英文摘要

Oversampling is one of the most widely used approaches for addressing imbalanced classification. The core idea is to generate additional minority samples to rebalance the dataset. Most existing methods, such as SMOTE, require converting categorical variables into numerical vectors, which often leads to information loss. Recently, large language model (LLM)-based methods have been introduced to overcome this limitation. However, current LLM-based approaches typically generate minority samples with limited diversity, reducing robustness and generalizability in downstream classification tasks. To address this gap, we propose a novel LLM-based oversampling method designed to enhance diversity. First, we introduce a sampling strategy that conditions synthetic sample generation on both minority labels and features. Second, we develop a new permutation strategy for fine-tuning pre-trained LLMs. Third, we fine-tune the LLM not only on minority samples but also on interpolated samples to further enrich variability. Extensive experiments on 10 tabular datasets demonstrate that our method significantly outperforms eight SOTA baselines. The generated synthetic samples are both realistic and diverse. Moreover, we provide theoretical analysis through an entropy-based perspective, proving that our method encourages diversity in the generated samples.

URL PDF HTML ☆

赞 0 踩 0

2510.22450 2026-06-09 cs.LG cs.AI 版本更新

SmartMixed: A Two-Phase Training Strategy for Adaptive Activation Function Learning in Neural Networks

SmartMixed：一种用于神经网络自适应激活函数学习的两阶段训练策略

Amin Omidvar

发表机构 * Independent Researcher（独立研究者）； Toronto, Canada（加拿大多伦多）； Toronto Ontario Canada（加拿大多伦多）

AI总结提出SmartMixed两阶段训练策略，通过可微硬混合机制让神经元自适应选择激活函数，第二阶段固定选择以保持推理效率，在MNIST上验证了不同层神经元的激活函数偏好。

详情

AI中文摘要

激活函数的选择在神经网络中起着关键作用，但大多数架构仍然依赖于所有神经元上固定的、统一的激活函数。我们引入了SmartMixed，一种新颖的两阶段训练策略，允许网络学习每个神经元的最优激活函数，同时在推理时保持计算效率。在第一阶段，神经元使用可微硬混合机制从候选激活函数池（ReLU、Sigmoid、Tanh、Leaky_ReLU、ELU、SELU）中自适应选择。在第二阶段，每个神经元的激活函数根据学习到的选择固定下来，从而得到一个计算高效的网络，支持使用优化的向量化操作继续训练。我们在MNIST数据集上使用不同架构的前馈神经网络评估了SmartMixed。我们的分析表明，不同层的神经元对激活函数表现出不同的偏好，揭示了神经架构内的功能多样性。我们还证明了SmartMixed通过允许神经元选择其偏好的激活函数有效地训练网络，与使用单一固定最先进激活函数的模型相竞争。

英文摘要

The choice of activation function plays a critical role in neural networks, yet most architectures still rely on fixed, uniform activation functions across all neurons. We introduce SmartMixed, a novel two-phase training strategy that allows networks to learn optimal per-neuron activation functions while preserving computational efficiency at inference. In the first phase, neurons adaptively select from a pool of candidate activation functions (ReLU, Sigmoid, Tanh, Leaky\_ReLU, ELU, SELU) using a differentiable hard mixture mechanism. In the second phase, each neuron's activation function is fixed according to the learned selection, resulting in a computationally efficient network that supports continued training with optimized vectorized operations. We evaluate SmartMixed on the MNIST dataset using feedforward neural networks of different architectures. Our analysis reveals that neurons in different layers exhibit distinct preferences for activation functions, providing insights into the functional diversity within neural architectures. We also demonstrated that SmartMixed effectively trains the network by allowing neurons to select their preferred activation functions, competing against models using a single fixed state-of-the-art activation function.

URL PDF HTML ☆

赞 0 踩 0

2511.07046 2026-06-09 cs.LG cs.AI 版本更新

Learning Quantized Continuous Controllers for Integer Hardware

面向整数硬件的量化连续控制器学习

Fabian Kresse, Christoph H. Lampert

发表机构 * Institute of Science and Technology Austria (ISTA)（奥地利科学与技术研究所）

AI总结提出量化感知训练策略，自动选择低比特策略并综合到FPGA，在MuJoCo任务中以3或2比特权重和激活值实现与全精度相当的竞争力，并提升输入噪声鲁棒性。

Comments 18 pages, 6 figures

详情

AI中文摘要

在嵌入式硬件上部署连续控制强化学习策略需要满足严格的延迟和功耗预算。小型FPGA可以实现这些要求，但前提是避免昂贵的浮点流水线。我们研究了用于整数推理的策略的量化感知训练（QAT），并提出了一种学习到硬件的流水线，该流水线自动选择低比特策略并将其综合到Artix-7 FPGA上。在五个MuJoCo任务中，我们获得的策略网络与全精度（FP32）策略具有竞争力，但每个权重和每个内部激活值仅需3比特甚至2比特，前提是输入精度经过仔细选择。在目标硬件上，所选策略实现微秒级的推理延迟，每次动作消耗微焦耳能量，与量化参考相比具有优势。最后，我们观察到量化策略相比浮点基线具有更高的输入噪声鲁棒性。

英文摘要

Deploying continuous-control reinforcement learning policies on embedded hardware requires meeting tight latency and power budgets. Small FPGAs can deliver these, but only if costly floating-point pipelines are avoided. We study quantization-aware training (QAT) of policies for integer inference and we present a learning-to-hardware pipeline that automatically selects low-bit policies and synthesizes them to an Artix-7 FPGA. Across five MuJoCo tasks, we obtain policy networks that are competitive with full precision (FP32) policies but require as few as 3 or even only 2 bits per weight, and per internal activation value, as long as input precision is chosen carefully. On the target hardware, the selected policies achieve inference latencies on the order of microseconds and consume microjoules per action, favorably comparing to a quantized reference. Last, we observe that the quantized policies exhibit increased input noise robustness compared to the floating-point baseline.

URL PDF HTML ☆

赞 0 踩 0

2512.01930 2026-06-09 cs.LG cs.AI 版本更新

SVRG and Beyond via Posterior Correction

SVRG及其后验校正扩展

Nico Daheim, Thomas Möllenhoff, Ming Liang Ang, Mohammad Emtiyaz Khan

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文揭示SVRG与后验校正方法的深层联系，证明SVRG是各向同性高斯后验校正的特例，并通过灵活指数族后验自动导出牛顿型和Adam型新变体。

Comments ICML 2026 (oral)

2512.15116 2026-06-09 cs.LG cs.AI 版本更新

FADTI: Fourier and Attention Driven Diffusion for Multivariate Time Series Imputation

FADTI: 基于傅里叶和注意力驱动的多变量时间序列插补扩散模型

Runze Li, Hanchen Wang, Wenjie Zhang, Binghao Li, Yu Zhang, Xuemin Lin, Ying Zhang

发表机构 * Anonymous（匿名）

AI总结提出FADTI扩散框架，通过可学习傅里叶偏置投影模块注入频域归纳偏置，结合自注意力与门控卷积进行时序建模，在多个基准上优于现有方法，尤其在高缺失率下表现突出。

Comments This work has been submitted to the IEEE for possible publication. 10 pages, 7 figures

详情

AI中文摘要

多变量时间序列插补是医疗保健、交通预测和生物建模等应用中的基础问题，其中传感器故障和不规则采样导致普遍存在的缺失值。然而，现有的基于Transformer和扩散的模型缺乏明确的归纳偏置和频率感知，限制了它们在结构化缺失模式和分布偏移下的泛化能力。我们提出FADTI，一个基于扩散的框架，通过可学习的傅里叶偏置投影（FBP）模块注入频率信息特征调制，并将其与通过自注意力和门控卷积进行的时间建模相结合。FBP支持多种谱基，能够自适应编码平稳和非平稳模式。这种设计将频域归纳偏置注入生成式插补过程。在多个基准（包括一个新引入的生物时间序列数据集）上的实验表明，FADTI持续优于最先进的方法，尤其是在高缺失率下。代码可在该https URL获取。

英文摘要

Multivariate time series imputation is fundamental in applications such as healthcare, traffic forecasting, and biological modeling, where sensor failures and irregular sampling lead to pervasive missing values. However, existing Transformer- and diffusion-based models lack explicit inductive biases and frequency awareness, limiting their generalization under structured missing patterns and distribution shifts. We propose FADTI, a diffusion-based framework that injects frequency-informed feature modulation via a learnable Fourier Bias Projection (FBP) module and combines it with temporal modeling through self-attention and gated convolution. FBP supports multiple spectral bases, enabling adaptive encoding of both stationary and non-stationary patterns. This design injects frequency-domain inductive bias into the generative imputation process. Experiments on multiple benchmarks, including a newly introduced biological time series dataset, show that FADTI consistently outperforms state-of-the-art methods, particularly under high missing rates. Code is available at https://anonymous.4open.science/r/TimeSeriesImputation-52BF

URL PDF HTML ☆

赞 0 踩 0

2601.09085 2026-06-09 cs.LG cs.AI cs.CL cs.IR 版本更新

MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting

MMR-GRPO：通过多样性感知奖励重加权加速GRPO风格训练

Kangda Wei, Ruihong Huang

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）

AI总结提出MMR-GRPO方法，利用最大边际相关性根据完成多样性重加权奖励，减少冗余样本，加速GRPO训练，在保持性能的同时平均减少47.9%训练步数和70.2%时间。

详情

AI中文摘要

组相对策略优化（GRPO）已成为训练数学推理模型的标准方法；然而，它对每个提示依赖多个完成，使得训练计算成本高昂。尽管最近的工作减少了达到峰值性能所需的训练步数，但由于每步成本增加，整体挂钟训练时间通常保持不变甚至增加。我们提出MMR-GRPO，它整合了最大边际相关性，基于完成多样性对奖励进行重加权。我们的关键洞察是，语义冗余的完成贡献有限的学习信号；优先考虑多样化解能产生更有信息量的更新并加速收敛。在三种模型规模（1.5B、7B、8B）、三种GRPO变体和五个数学推理基准上的广泛评估表明，MMR-GRPO在达到相当峰值性能的同时，平均需要减少47.9%的训练步数和70.2%的挂钟时间。这些增益在模型、方法和基准上一致。我们的代码发布在：this https URL。

英文摘要

Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of training steps required to reach peak performance, the overall wall-clock training time often remains unchanged or even increases due to higher per-step cost. We propose MMR-GRPO, which integrates Maximal Marginal Relevance to reweigh rewards based on completion diversity. Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time. These gains are consistent across models, methods, and benchmarks. Our code is released at: https://github.com/WeiKangda/MMR-GRPO.

URL PDF HTML ☆

赞 0 踩 0

2601.15165 2026-06-09 cs.CL cs.AI cs.LG 版本更新

The Flexibility Trap: Rethinking the Value of Arbitrary Order in Diffusion Language Models

灵活性陷阱：重新思考扩散语言模型中任意顺序的价值

Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, Gao Huang

发表机构 * LeapLab, Tsinghua University（清华大学Leap实验室）； NLPLab, Tsinghua University（清华大学自然语言处理实验室）； Tsinghua University（清华大学）； Alibaba Group（阿里巴巴集团）； BNRist, Tsinghua University（清华大学北京研究院）

AI总结本文发现，尽管扩散语言模型（dLLMs）允许任意生成顺序，但这种灵活性可能限制其推理能力，通过采用标准的Group Relative Policy Optimization（GRPO）方法，即JustGRPO，在保持并行解码能力的同时提升了推理性能。

Comments Code and pre-trained models: https://github.com/LeapLabTHU/JustGRPO

详情

AI中文摘要

扩散大语言模型（dLLMs）打破了传统语言模型的严格左到右约束，使token生成可以按任意顺序进行。直观上，这种灵活性意味着解决方案空间严格超越了固定的自回归轨迹，理论上解锁了更强大的推理潜力。然而，在本文中，我们发现对于一般推理任务（例如数学和编程），任意顺序生成可能实际上会限制dLLMs的推理潜力。我们观察到dLLMs倾向于利用这种顺序灵活性来绕过关键探索的高不确定性token，这可能导致解决方案覆盖的过早崩溃。这一观察促使我们重新思考dLLMs的强化学习方法，其中大量的复杂性，如处理组合轨迹和不可计算的似然，通常致力于保持这种灵活性。我们证明，通过放弃任意顺序并应用标准的Group Relative Policy Optimization（GRPO）方法，即JustGRPO，可以有效地激发推理能力。我们的方法，JustGRPO，虽然简洁却出人意料地有效（例如在GSM8K上达到89.1%的准确率），同时完全保留了dLLMs的并行解码能力。项目页面：https://nzl-thu.github.io/the-flexibility-trap

英文摘要

Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential. However, in this paper, we find that for general reasoning tasks (e.g., mathematics and coding), arbitrary order generation may in fact limit the reasoning potential of dLLMs. We observe that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, which can lead to a premature collapse of solution coverage. This observation motivates a rethink of RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We show that effective reasoning can be elicited by simply forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl-thu.github.io/the-flexibility-trap

URL PDF HTML ☆

赞 0 踩 0

2601.21149 2026-06-09 cs.LG cs.AI 版本更新

Mobility-Embedded POIs: Learning What A Place Is and How It Is Used from Human Movement

移动性嵌入的POI：从人类移动中学习场所身份与使用方式

Maria Despoina Siampou, Shushman Choudhury, Shang-Ling Hsu, Neha Arora, Cyrus Shahabi

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出ME-POIs框架，通过对比学习将大规模人类移动数据与语言模型嵌入结合，学习场所功能，并在五个地图丰富任务上超越文本或移动性单独基线。

详情

AI中文摘要

近期地理空间基础模型的进展强调了学习真实世界位置（特别是人类活动集中的兴趣点POI）通用表示的重要性。然而，现有方法主要关注从静态文本元数据中提取的场所身份，或学习与轨迹上下文相关的表示，这些表示捕捉的是移动规律而非场所的实际使用方式（即POI的功能）。我们认为POI功能是通用POI表示中缺失但关键的信号。我们提出了移动性嵌入的POI（ME-POIs），这是一个框架，通过大规模人类移动数据增强从语言模型派生的POI嵌入，以学习基于真实世界使用的、以POI为中心且上下文无关的表示。ME-POIs将个体访问编码为时间上下文化的嵌入，并通过对比学习将其与可学习的POI表示对齐，以捕捉跨用户和时间的使用模式。为解决长尾稀疏性问题，我们提出了一种新机制，从附近频繁访问的POI跨多个空间尺度传播时间访问模式。我们在五个新提出的地图丰富任务上评估ME-POIs，测试其捕捉POI身份和功能的能力。在所有任务中，用ME-POIs增强文本嵌入始终优于纯文本和纯移动性基线。值得注意的是，仅使用移动数据训练的ME-POIs在某些任务上能超越纯文本模型，凸显了POI功能是准确且可泛化的POI表示的关键组成部分。

英文摘要

Recent progress in geospatial foundation models highlights the importance of learning general-purpose representations for real-world locations, particularly points-of-interest (POIs) where human activity concentrates. Existing approaches, however, focus primarily on place identity derived from static textual metadata, or learn representations tied to trajectory context, which capture movement regularities rather than how places are actually used (i.e., POI's function). We argue that POI function is a missing but essential signal for general POI representations. We introduce Mobility-Embedded POIs (ME-POIs), a framework that augments POI embeddings derived, from language models with large-scale human mobility data to learn POI-centric, context-independent representations grounded in real-world usage. ME-POIs encodes individual visits as temporally contextualized embeddings and aligns them with learnable POI representations via contrastive learning to capture usage patterns across users and time. To address long-tail sparsity, we propose a novel mechanism that propagates temporal visit patterns from nearby, frequently visited POIs across multiple spatial scales. We evaluate ME-POIs on five newly proposed map enrichment tasks, testing its ability to capture both the identity and function of POIs. Across all tasks, augmenting text-based embeddings with ME-POIs consistently outperforms both text-only and mobility-only baselines. Notably, ME-POIs trained on mobility data alone can surpass text-only models on certain tasks, highlighting that POI function is a critical component of accurate and generalizable POI representations.

URL PDF HTML ☆

赞 0 踩 0

2601.21522 2026-06-09 cs.LG cond-mat.dis-nn cs.AI stat.ML 版本更新

More Bang for the Buck: Improving the Inference of Large Language Models at a Fixed Budget using Reset and Discard (ReD)

更高效利用预算：使用重置与丢弃（ReD）方法在固定预算下提升大型语言模型的推理性能

Sagi Meir, Tommer D. Keidar, Noam Levi, Shlomi Reuveni, Barak Hirshberg

发表机构 * School of Chemistry, Tel Aviv University（特拉维夫大学化学系）； The Center for Physics and Chemistry of Living Systems, Tel Aviv University（特拉维夫大学生命系统物理与化学中心）； School of Physics and Astronomy, Tel Aviv University（特拉维夫大学物理与天文学系）； The Center for Computational Molecular and Materials Science, Tel Aviv University（特拉维夫大学计算分子与材料科学中心）

AI总结针对固定预算下大型语言模型推理的收益递减问题，提出重置与丢弃（ReD）查询方法，通过优化尝试分配提升覆盖率，并在编码、数学和推理基准上验证了其成本节约效果。

详情

AI中文摘要

大型语言模型（LLMs）在可验证任务上的性能通常通过 pass@k 衡量，即在 k 次尝试中至少正确回答一次的概率。在固定预算下，更合适的指标是 coverage@cost，即作为总尝试次数函数的平均唯一回答问题数量。我们连接这两个指标，并证明 pass@k 中经验观察到的幂律行为导致 coverage@cost 的次线性增长（收益递减）。为解决此问题，我们提出重置与丢弃（ReD），一种 LLMs 的查询方法，无论 pass@k 的形式如何，都能在给定预算下增加 coverage@cost。此外，给定 pass@k，我们可以定量预测使用 ReD 在总尝试次数上的节省。如果模型的 pass@k 不可用，ReD 可以推断其幂律指数。在三个 LLMs 上进行的编码（HumanEval）、数学（GSM8K）和推理（MMLU-Pro）基准测试表明，ReD 显著减少了达到期望覆盖率所需的尝试次数、令牌数和美元成本，同时提供了一种高效测量推理幂律的方法。ReD 的优势在非完美验证器下得以保持，并且优于测试的分配基线。

英文摘要

The performance of large language models (LLMs) on verifiable tasks is usually measured by pass@k, the probability of answering a question correctly at least once in k trials. At a fixed budget, a more suitable metric is coverage@cost, the average number of unique questions answered as a function of the total number of attempts. We connect the two metrics and show that the empirically-observed power-law behavior in pass@k leads to a sublinear growth of the coverage@cost (diminishing returns). To solve this problem, we propose Reset-and-Discard (ReD), a query method of LLMs that increases coverage@cost for a given budget, regardless of the pass@k form. Moreover, given a pass@k, we can quantitatively predict the savings in the total number of attempts using ReD. If pass@k is not available for the model, ReD can infer its power-law exponent. Experiments on three LLMs across coding (HumanEval), math (GSM8K), and reasoning (MMLU-Pro) benchmarks demonstrate that ReD substantially reduces the required attempts, tokens, and USD cost to reach a desired coverage, while also offering an efficient way to measure inference power-laws. ReD's advantage is maintained for imperfect verifiers and outperforms the tested allocation baselines.

URL PDF HTML ☆

赞 0 踩 0

2601.21996 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

机械论数据归因：追踪可解释LLM单元的训练起源

Jianhui Chen, Yuzhang Luo, Liangming Pan

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出机械论数据归因（MDA）框架，利用影响函数将可解释单元追溯到特定训练样本，通过因果验证表明干预高影响样本可显著调节可解释头的涌现，并发现重复结构数据作为机械催化剂，同时验证了归纳头与上下文学习之间的功能联系。

Comments ICML2026 (Oral)

详情

AI中文摘要

尽管机械论可解释性已在LLM中识别出可解释电路，但它们在训练数据中的因果起源仍然难以捉摸。我们引入了机械论数据归因（MDA），这是一个可扩展的框架，利用影响函数将可解释单元追溯到特定训练样本。通过在Pythia系列模型上的广泛实验，我们因果验证了目标干预——移除或增加一小部分高影响样本——显著调节了可解释头的涌现，而随机干预则没有效果。我们的分析表明，重复的结构化数据（例如LaTeX、XML）充当了机械催化剂。此外，我们观察到针对归纳头形成的干预会引发模型上下文学习（ICL）能力的同步变化。这为关于归纳头与ICL之间功能联系的长期假设提供了直接的因果证据。最后，我们提出了一种机械论数据增强流水线，该流水线在不同模型规模上一致地加速电路收敛，为引导LLM的发展轨迹提供了一种原则性方法。

英文摘要

While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.

URL PDF HTML ☆

赞 0 踩 0

2601.22736 2026-06-09 cs.LG cs.AI 版本更新

UA-DCM: Uncertainty-aware Causal Decision Making via Effect Bound Decomposition

UA-DCM: 基于效应界分解的不确定性感知因果决策

Md Musfiqur Rahman, Ziwei Jiang, Hilaf Hasson, Murat Kocaoglu

发表机构 * Electrical and Computer Engineering, Purdue University（帕克大学电气与计算机工程系）； Computer Science, Johns Hopkins University（约翰霍普金斯大学计算机科学系）； Cohesity

AI总结提出一种新框架，通过分解因果效应值的可消除与不可消除部分，区分收集更多样本能否帮助识别最优行动，并利用神经因果模型近似实现该分解。

详情

AI中文摘要

从观测数据中进行因果推断可以为决策场景中找到最佳行动提供有力证据，而无需进行昂贵的随机试验。由于未观测到的混杂因素，即使有无限数据，行动的因果效应也往往不是点可识别的。此外，仅有有限样本为因果效应估计增加了另一层不确定性。现有几种方法可用于获得因果效应的上下界，从符号方法到最近的基于神经网络的方法，这些方法隐式地结合了两种不确定性来源。然而，这些方法并未告知收集更多样本是否有助于从观测数据中识别最佳行动，使专家对其数据收集策略一无所知。我们通过一种新颖的框架解决了这个问题，该框架能够区分可能通过收集更多样本消除的因果效应值范围与那些高概率无法通过更多观测样本消除的值范围。我们证明这种划分可以通过求解最大-最小和最小-最大优化问题获得。我们利用神经因果模型在实践中近似恢复这种分解。通过在合成和真实世界数据集上的实验，我们证明了我们的算法可以确定何时收集更多样本无助于确定最佳行动。我们的框架可以帮助从业者决定何时应诉诸非观测研究或寻求测量一些未测量的混杂因素以进行最优决策。

英文摘要

Causal inference from observational data can provide strong evidence for finding the best action in a decision-making scenario without having to perform expensive randomized trials. The causal effect of an action is often not pointwise identifiable even with infinite data due to unobserved confounding factors. Furthermore, having only finitely many samples adds another layer of uncertainty to causal effect estimation. Several existing methods can be used to obtain upper and lower bounds to the causal effect, ranging from symbolic methods to the more recent neural network-based approaches, which implicitly incorporate both sources of uncertainty. However, these methods do not inform whether collecting more samples may or may not help identify the best action from observational data, leaving experts in the dark about their data collection strategies. We address this problem with a novel framework that can distinguish the range of causal effect values that might be eliminated by collecting more samples from the range of values that, with high probability, cannot be eliminated with more observational samples. We show that this partitioning can be obtained by solving max-min and min-max optimization problems. We leverage neural causal models to approximately recover this decomposition in practice. We demonstrate via experiments on synthetic and real-world datasets that our algorithm can determine when collecting more samples will not help determine the best action. Our framework can help practitioners decide when to resort to non-observational studies or seek to measure some of the unmeasured confounders for optimal decision-making.

URL PDF HTML ☆

赞 0 踩 0

2602.04402 2026-06-09 stat.ML cs.AI cs.CY cs.LG math.ST stat.TH 版本更新

Performative Learning Theory

表现性学习理论

Julian Rodemann, Unai Fischer-Abaigar, James Bailie, Krikamol Muandet

发表机构 * University of Cambridge（剑桥大学）

AI总结将表现性预测嵌入统计学习理论，证明在样本和总体表现性效应下的泛化界，揭示模型影响数据越多则学习越少的权衡，并提出通过再训练改善泛化保证。

Comments ICML 2026. v2: corrected typo in author list; v3: added explanation of condition 3.2, modified condition 3.3 and fixed lemma 3.4, added examples and explanations in sections 2, 5, and 6

详情

AI中文摘要

表现性预测会影响它们试图预测的结果。我们研究影响样本（例如，仅限现有应用用户）和/或整个总体（例如，所有潜在应用用户）的表现性预测。这引发了模型在表现性下泛化能力的问题。例如，当现有用户和新用户都对应用的预测做出反应时，我们基于现有用户对新用户能得出多好的见解？我们通过将表现性预测嵌入统计学习理论来解决这个问题。我们证明了在样本、总体以及两者共同影响下的泛化界。我们证明背后的一个关键直觉是，在最坏情况下，总体否定预测，而样本欺骗性地实现预测。我们分别将这种自我否定和自我实现的预测表述为Wasserstein空间中的最小-最大和最小-最小风险泛函。我们的分析揭示了表现性地改变世界与从中学习之间的基本权衡：模型对数据的影响越大，它能从数据中学到的就越少。此外，我们的分析得出一个令人惊讶的见解：通过对表现性扭曲的样本进行再训练，可以改善泛化保证。我们通过一个案例研究说明了我们的界，该案例涉及基于预测的德国失业居民工作培训分配，利用了德国1975年至2017年的行政劳动力市场记录。

英文摘要

Performative predictions influence the very outcomes they aim to forecast. We study performative predictions that affect a sample (e.g., only existing users of an app) and/or the whole population (e.g., all potential app users). This raises the question of how well models generalize under performativity. For example, how well can we draw insights about new app users based on existing users when both of them react to the app's predictions? We address this question by embedding performative predictions into statistical learning theory. We prove generalization bounds under performative effects on the sample, on the population, and on both. A key intuition behind our proofs is that in the worst case, the population negates predictions, while the sample deceptively fulfills them. We cast such self-negating and self-fulfilling predictions as min-max and min-min risk functionals in Wasserstein space, respectively. Our analysis reveals a fundamental trade-off between performatively changing the world and learning from it: the more a model affects data, the less it can learn from it. Moreover, our analysis results in a surprising insight on how to improve generalization guarantees by retraining on performatively distorted samples. We illustrate our bounds in a case study on prediction-informed assignments of unemployed German residents to job trainings, drawing upon administrative labor market records from 1975 to 2017 in Germany.

URL PDF HTML ☆

赞 0 踩 0

2602.05774 2026-06-09 cs.LG cs.AI math.PR 版本更新

Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance

变分推测解码：从令牌似然到序列接受的草稿训练再思考

Xiandong Zou, Jianshu Li, Jing Huang, Pan Zhou

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出变分推测解码（VSD），将草稿训练视为对潜在提议（草稿路径）的变分推断，通过最大化目标模型接受的边际概率来优化，结合路径级效用和期望最大化过程，显著提升解码效率。

详情

AI中文摘要

推测解码加速了（多模态）大语言模型的推理，但训练-解码之间存在不一致：现有方法优化单一贪婪轨迹，而解码涉及验证和排序多个采样草稿路径。我们提出变分推测解码（VSD），将草稿训练形式化为对潜在提议（草稿路径）的变分推断。VSD最大化目标模型接受的边际概率，得到一个ELBO，该ELBO促进高质量潜在提议，同时最小化与目标分布的散度。为提升质量并降低方差，我们引入路径级效用，并通过期望最大化过程进行优化。E步从经过oracle过滤的后验中抽取蒙特卡洛样本，M步使用自适应拒绝加权（ARW）和置信度感知正则化（CAR）最大化加权似然。理论分析证实VSD增加了期望接受长度和加速比。在LLM和MLLM上的大量实验表明，VSD相比EAGLE-3实现高达9.6%的加速，相比ViSpec实现7.9%的加速，显著提升了解码效率。

英文摘要

Speculative decoding accelerates inference for (M)LLMs, yet a training-decoding discrepancy persists: while existing methods optimize single greedy trajectories, decoding involves verifying and ranking multiple sampled draft paths. We propose Variational Speculative Decoding (VSD), formulating draft training as variational inference over latent proposals (draft paths). VSD maximizes the marginal probability of target-model acceptance, yielding an ELBO that promotes high-quality latent proposals while minimizing divergence from the target distribution. To enhance quality and reduce variance, we incorporate a path-level utility and optimize via an Expectation-Maximization procedure. The E-step draws Monte Carlo samples from an oracle-filtered posterior, while the M-step maximizes weighted likelihood using Adaptive Rejection Weighting (ARW) and Confidence-Aware Regularization (CAR). Theoretical analysis confirms that VSD increases expected acceptance length and speedup. Extensive experiments across LLMs and MLLMs show that VSD achieves up to a 9.6% speedup over EAGLE-3 and 7.9% over ViSpec, significantly improving decoding efficiency.

URL PDF HTML ☆

赞 0 踩 0

2602.12107 2026-06-09 cs.LG cs.AI stat.ML 版本更新

Transformer 如何拒绝错误答案：事实约束处理的旋转动力学

Javier Marín

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结研究揭示了Transformer在处理事实性问题时，隐藏状态空间中正确与错误延续路径的旋转分离现象，揭示了模型在深层结构中对错误延续的非局部化偏好。

详情

AI中文摘要

当解码器-only Transformer 被强制处理事实性查询的匹配正确和错误单token延续时，两种路径在隐藏状态空间中以特定方式分离：从查询-only 表示出发的位移向量保持大致相等的幅度但方向旋转远离。角分离在中层增加，后期层解决不对称结果——在错误运行中，logit-lens 倾向远低于朴素先验，对应模型将错误token的概率约11.5倍于正确token。该双阶段模式——中层旋转分离后后期层不对称承诺——被描述为模型对外部看似拒绝错误延续的实证几何特征，但明确指出是观测描述而非因果解释。该模式在六个解码器-only Transformer 中一致，包括五个架构家族（1B到13B参数）。第七个模型（Qwen2 1.5B）在当前提取协议下显示平坦曲线，可能是tokenizer-fragmentation的artefact而非真实规模限制；是否存在临界出现阈值的问题仍悬而未决。单层激活拼接在任何层带均无法恢复正确token，意味着后期层不对称性并非局限于离散组件。总体而言，证据支持事实约束处理的分布式轨迹账户——几何结构在许多层中逐步累积出现，而非单一局部化回溯账户。

英文摘要

When a decoder-only transformer is forced to process matched correct and incorrect single-token continuations of a factual query, the two pathways through hidden-state space diverge in a specific way: displacement vectors from the query-only representation maintain approximately equal magnitude but rotate apart in direction. The angular separation grows through mid-depth, and late layers resolve the asymmetric outcome -a logit-lens preference that, in the incorrect run, falls far below the naive prior of equal probability, corresponding to the model assigning approximately 11.5 times more probability to the incorrect token than to the correct one. We characterize this two-phase pattern-rotational divergence in mid-depth followed by late-layer asymmetric commitment-as the empirical geometric signature of what looks externally like the model rejecting a wrong continuation, while remaining explicit that it is an observational characterization, not a causal account. The pattern is consistent across six decoder-only transformers including five architecture families from 1B to 13B parameters. A seventh model (Qwen2 1.5B) shows a flat profile under the present extraction protocol that is plausibly a tokenizer-fragmentation artefact rather than a real scale floor; the question of an emergence threshold is left open. Single-layer activation patching does not recover the correct token at any layer band, meaning the late-layer asymmetry is not localized to a discrete component under the protocol used. Taken together, the evidence is consistent with a distributed-by-trajectory account of factual constraint processing-geometric structure that emerges cumulatively across many layers rather than from a single localized circuit and inconsistent with the simplest single-layer localized-recall account.

URL PDF HTML ☆

赞 0 踩 0

2603.22473 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Component Ablation for Efficient Hybrid Language Model Architectures: Performance, Resilience, and Compression Implications

组件消融用于高效混合语言模型架构：性能、鲁棒性和压缩影响

Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó

发表机构 * Doctoral Program in Computer Science, University of Valencia（瓦伦西亚大学计算机科学博士项目）

AI总结本文通过组件消融研究混合语言模型，发现注意力机制与替代序列处理路径对性能有显著影响，揭示了模型鲁棒性与压缩优化的关键因素。

Comments 25 pages, 7 figures, 6 tables; revised title, abstract, figures, and data/code repository URL

详情

AI中文摘要

混合语言模型结合softmax注意力与线性时间序列机制，如状态空间或线性注意力层，但各组件的功能贡献尚不明确。本文在两个子10亿参数的混合语言模型Qwen3.5-0.8B和Falcon-H1-0.5B上，通过基于似然的评估、下游基准、逐层干预、随机控制和表征级诊断研究组件消融。测试结果显示，移除注意力或替代序列处理路径会显著降低性能，表明两种组件类型均对模型行为有贡献。似然指标对线性注意力或状态空间路径特别敏感，而下游基准退化取决于任务和架构。逐层消融显示组件重要性位置依赖，最强效果集中在早期或中期网络组件而非整个深度。随机移除控制进一步显示混合架构与相同家族Transformer基线在结构扰动下退化不同。这些结果表明组件消融是理解混合语言模型架构的有效诊断方法。发现为高效模型设计、压缩、鲁棒性分析和部署决策提供了相关证据。

英文摘要

Hybrid language models combine softmax attention with linear-time sequence mechanisms such as state-space or linear-attention layers, but the functional contribution of each component type remains insufficiently characterized. We study component-level ablation in two sub-1B hybrid language models, Qwen3.5-0.8B and Falcon-H1-0.5B, using likelihood-based evaluation, downstream benchmarks, layer-wise interventions, random controls, and representation-level diagnostics. Across the tested models, removing either attention or the alternative sequence-processing pathway substantially degrades performance, indicating that both component types contribute to model behavior. Likelihood metrics are especially sensitive to the linear-attention or state-space pathway, while downstream benchmark degradation depends on task and architecture. Layer-wise ablations show that component importance is position-dependent, with the strongest effects concentrated in early or mid-network components rather than uniformly across depth. Random-removal controls further show that hybrid architectures and same-family Transformer baselines degrade differently under structural perturbation. These results suggest that component ablation is a useful diagnostic for understanding hybrid language model architectures. The findings provide evidence relevant to efficient model design, compression, robustness analysis, and deployment decisions in architectures that combine attention with alternative sequence-processing mechanisms.

URL PDF HTML ☆

赞 0 踩 0

2603.25157 2026-06-09 cs.LG cs.AI cs.CV stat.ML 版本更新

Vision Hopfield Memory Networks for Image Recognition

Vision Hopfield Memory Networks

Jianfeng Wang, Amine M'Charrak, Luk Koska, Xiangtao Wang, Daniel Petriceanu, Ruizhi Wang, Michael Bumbar, Luca Pinchetti, Thomas Lukasiewicz

发表机构 * Department of Computer Science, University of Oxford（牛津大学计算机科学系）； Faculty of Informatics, Vienna University of Technology（维也纳理工大学信息学院）

AI总结本文提出了一种受大脑启发的视觉Hopfield记忆网络（V-HMN），通过整合分层记忆机制和迭代细化更新，实现了统一框架下的局部和全局动态建模，提升了可解释性和数据效率。

详情

AI中文摘要

近年来，视觉和多模态基础模型，如Transformer家族和状态空间模型（如Mamba）在图像、文本等领域取得了显著进展。尽管这些架构在经验上取得了成功，但它们与人脑的计算原理仍有很大差距，通常需要大量的训练数据且可解释性有限。在本文中，我们提出了视觉Hopfield记忆网络（V-HMN），一种受大脑启发的基础模型，整合了分层记忆机制和迭代细化更新。具体而言，V-HMN包含局部Hopfield模块，提供图像块级别的关联记忆动态，全局Hopfield模块作为情境调节的事件记忆，以及受预测编码启发的细化规则用于迭代误差校正。通过将这些基于记忆的模块分层组织，V-HMN在一个统一的框架中捕捉了局部和全局动态。记忆检索揭示了输入与存储模式之间的关系，使决策更具可解释性，而存储模式的重用提高了数据效率。这种受大脑启发的设计因此在可解释性和数据效率方面超越了现有的自注意或状态空间方法。我们在公开的计算机视觉基准上进行了广泛的实验，V-HMN在与广泛采用的基础架构竞争的同时，提供了更好的可解释性、更高的数据效率和更强的生物合理性。这些发现突显了V-HMN作为下一代视觉基础模型的潜力，同时为文本和音频等领域的多模态基础模型提供了通用的蓝图，从而将受大脑启发的计算与大规模机器学习联系起来。

英文摘要

Recent vision backbones, such as Transformer families and state-space models like Mamba, have achieved remarkable progress on image recognition. Despite their empirical success, these architectures remain far from the computational principles of the human brain, often demanding enormous amounts of training data while offering limited interpretability. We propose the Vision Hopfield Memory Network (V-HMN), a brain-inspired vision backbone that integrates hierarchical memory mechanisms across layers with iterative refinement updates. Specifically, V-HMN incorporates local Hopfield modules that provide associative memory dynamics at the image patch level, global Hopfield modules that function as episodic memory for contextual modulation, and a predictive-coding-inspired refinement rule for iterative error correction. By organizing these memory-based modules hierarchically, V-HMN captures both local and global dynamics in a unified framework. Memory retrieval exposes the relationship between inputs and stored patterns, providing a prototype-based form of interpretability through explicit memory retrieval, while the reuse of stored patterns improves data efficiency. This brain-inspired design therefore enhances data efficiency and provides a prototype-based form of interpretability compared to existing self-attention- or state-space-based approaches. We conducted extensive experiments on public image classification benchmarks. V-HMN achieves strong performance on small- and medium-scale benchmarks, and remains competitive with widely adopted backbone architectures on ImageNet despite minimal architectural tuning, while offering improved data efficiency and a prototype-based form of interpretability. These findings highlight the potential of V-HMN as a memory-centric alternative to standard vision backbones, thereby bridging brain-inspired computation with modern machine learning.

URL PDF HTML ☆

赞 0 踩 0

2603.25184 2026-06-09 cs.LG cs.AI 版本更新

简单自条件适应用于掩码扩散模型

Michael Cardei, Huu Binh Ta, Ferdinando Fioretto

发表机构 * University of Virginia（弗吉尼亚大学）

AI总结本文提出一种简单有效的后训练适应方法，通过自条件预测提升掩码扩散模型的生成能力，减少生成困惑度并提升图像合成和分子生成质量。

详情

AI中文摘要

掩码扩散模型（MDMs）通过迭代去噪在吸收掩码过程中生成离散序列。在标准掩码扩散中，如果一个token在反向更新后仍被掩码，模型会丢弃该位置的干净状态预测。因此，仍被掩码的位置必须反复从掩码token本身推断。这种设计限制了跨步骤的细化。为解决这一限制，本文提出了一种简单但有效的后训练适应方法，使每个去噪步骤都基于模型自身之前的干净状态预测。所提出的方法称为自条件掩码扩散模型（SCMDM），需要最小的架构更改，不引入递归的潜在状态路径，不依赖辅助参考模型，并在采样过程中不增加额外的去噪器评估。这与部分自条件方法形成重要区别，后者需要昂贵的从头模型训练。特别是，本文表明，在后训练阶段，部分自条件，包括用于从头训练自条件模型的常用50% dropout策略，是次优的。相反，一旦模型自生成的干净状态估计变得有信息，专业化于细化优于混合条件和无条件目标。SCMDM在多个领域进行了评估，显示出对普通MDM基线的一致改进，实现了在OWT训练模型上的生成困惑度几乎减少50%（从42.89到23.72），同时在离散图像合成质量、小分子生成和基因组分布建模的保真度方面也取得了显著改进。

英文摘要

Masked diffusion models (MDMs) generate discrete sequences by iterative denoising under an absorbing masking process. In standard masked diffusion, if a token remains masked after a reverse update, the model discards its clean-state prediction for that position. Thus, still-masked positions must be repeatedly inferred from the mask token alone. This design choice limits cross-step refinement. To address this limitation, this paper proposes a simple, yet effective, post-training adaptation for MDMs that conditions each denoising step on the model's own previous clean-state predictions. The resulting method, called Self-Conditioned Masked Diffusion Models (SCMDM), requires minimal architectural change, does not introduce a recurrent latent-state pathway, does not rely on an auxiliary reference model, and adds no extra denoiser evaluations during sampling. This is an important departure from partial self-conditioning approaches which requires expensive model training from scratch. In particular, the paper shows that partial self-conditioning, including the commonly used 50% dropout strategy for training self-conditioned models from scratch, is suboptimal in the post-training regime. Instead, once the model's self-generated clean-state estimates become informative, the specialization to refinement is preferable to mixing conditional and unconditional objectives. SCMDM is evaluated across multiple domains, demonstrating consistent improvement over vanilla MDM baselines, achieving nearly a 50% reduction in generative perplexity on OWT-trained models (42.89 to 23.72), alongside strong improvements in discretized image synthesis quality, small molecular generation, and enhanced fidelity in genomic distribution modeling.

URL PDF HTML ☆

赞 0 踩 0

2605.01616 2026-06-09 cs.LG cs.AI cs.CY cs.NI 版本更新

Learning Behavioral Signals from Encrypted Smartphone Network Traffic

从加密智能手机网络流量中学习行为信号

Rameen Mahmood, Omar El Shahawy, Souptik Barua, Zachary Beattie, Jeffrey Kaye, Xuhai "Orson'' Xu, Chao-Yi Wu, Danny Yuxing Huang

发表机构 * New York University（纽约大学）； NYU Langone Health（NYU Langone健康）； NYU Grossman School of Medicine（NYU Grossman医学院）； Oregon Health & Science University（俄勒冈健康与科学大学）； Columbia University（哥伦比亚大学）； Harvard Medical School（哈佛医学院）

AI总结本文利用基于Transformer的模型从加密网络流量中学习行为表征，结合用户特定适配器，并通过稀疏表示和广义估计方程分析，发现压力、孤独感和睡眠障碍分别与个体间差异、个体内波动及两者组合相关，且学习到的表征优于传统手工特征。

Comments 19 pages, 6 figures

详情

AI中文摘要

人类行为难以在大规模下连续测量，然而日常活动和幸福感的痕迹可能反映在与个人设备的交互中。我们研究加密的智能手机网络流量是否可以作为被动感知信号，用于检测与睡眠障碍、压力和孤独感相关的行为状态。为了捕捉群体层面的模式和个体特定的行为，我们采用基于Transformer的模型，该模型带有用户特定的适配器，学习网络活动的表征，同时考虑个人基线及其偏差。为了提高可解释性，我们进一步使用稀疏表示学习分析这些表征，以识别与不同活动模式相关的潜在行为特征。我们使用带有Mundlak分解的广义估计方程将所得特征与睡眠障碍、压力和孤独感联系起来，从而能够区分稳定的个体间差异和随时间变化的个体内变化。我们的分析揭示了这三种结果具有不同的时间动态：压力主要与持续的个体间变异相关，孤独感与个体内波动更密切相关，而睡眠障碍则反映了两者的结合。重要的是，这些个体内行为信号无法通过传统的手工网络流量特征恢复，这突显了学习表征在纵向行为建模中的优势。总体而言，我们的发现表明加密网络流量包含可解释的行为信息，并能够支持被动、可扩展的行为动态监测，特别是相对于个体典型活动模式的变化。

英文摘要

Human behavior is challenging to measure continuously at scale, yet traces of daily routines and well-being may be reflected in interactions with personal devices. We investigate whether encrypted smartphone network traffic can serve as a passive sensing signal for behavioral states related to sleep disturbance, stress, and loneliness. To capture both population-level patterns and individual-specific behavior, we employ a transformer-based model with user-specific adapters that learns representations of network activity while accounting for personal baselines and deviations from them. To improve interpretability, we further analyze these representations using sparse representation learning to identify latent behavioral features associated with distinct activity patterns. We relate the resulting features to sleep disturbance, stress, and loneliness using generalized estimating equations with Mundlak decomposition, enabling separation of stable between-person differences from within-person changes over time. Our analysis reveals that the three outcomes are characterized by different temporal dynamics: stress is predominantly associated with persistent between-person variation, loneliness is more strongly linked to within-person fluctuations, and sleep disturbance reflects a combination of both. Importantly, these within-person behavioral signals are not recovered by conventional handcrafted network-traffic features, highlighting the advantages of learned representations for longitudinal behavioral modeling. Overall, our findings demonstrate that encrypted network traffic contains interpretable behavioral information and can support passive, scalable monitoring of behavioral dynamics, particularly changes relative to an individual's typical pattern of activity.

URL PDF HTML ☆

赞 0 踩 0

2605.02950 2026-06-09 cs.LG cs.AI 版本更新

高速率量化矩阵乘法II

Or Ordentlich, Yury Polyanskiy

发表机构 * Hebrew University of Jerusalem（希伯来大学杰里科分校）； MIT（麻省理工学院）

AI总结本文研究在已知第二因子列协方差矩阵情况下高速率量化矩阵乘法，通过水填充算法改进LLM量化方法，展示WaterSIC方案在信息论极限下的性能。

详情

AI中文摘要

本文是关于量化矩阵乘法（MatMul）工作的第二部分。在第一部分中，我们考虑了无校准量化的情况，而在这里，我们讨论了在第二因子列协方差矩阵$Σ_X$已知的情况下的情形。这种情形出现在广泛应用的LLM后训练量化任务中。权重量化与加权均方误差（WMSE）源编码问题相关，其经典的（反向）水填充解决定了如何在向量的坐标之间分配速率。我们展示了如何利用水填充来改进实际的LLM量化算法（GPTQ），目前这些算法平均分配速率。最近的一种方案（称为``WaterSIC''）仅使用标量INT量化器进行分析，其高速率性能被证明为（a）基无关（即由$Σ_X$的行列式决定，因此不同于现有方案，不受随机旋转的影响）；（b）在信息论极限下的性能与$\frac{2πe}{12}$（或0.25 bit/entry）的乘法因子内。GPTQ的性能受基的选择影响，但对于随机旋转和实际的$Σ_X$来自Llama-3-8B，我们发现其性能在0.1 bit（取决于层类型）以内，表明GPTQ结合随机旋转也接近最优，至少在高速率范围内。

英文摘要

This is the second part of the work investigating quantized matrix multiplication (MatMul). In part I we considered the case of calibration-free quantization, whereas here we discuss the setting where covariance matrix $Σ_X$ of the columns of the second factor is available. This setting arises in the ubiquitous task of weight-only post-training quantization of LLMs. Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. A recent scheme (known as ``WaterSIC'') that only uses scalar INT quantizers is analyzed and its high-rate performance is shown to be (a) basis free (i.e., characterized by the determinant of $Σ_X$ and, thus, unlike existing schemes, is immune to applying random rotations); and (b) within a multiplicative factor of $\frac{2πe}{12}$ (or 0.25 bit/entry) of the information-theoretic distortion limit. GPTQ's performance, in turn, is affected by the choice of basis, but for a random rotation and actual $Σ_X$ from Llama-3-8B we find it to be within 0.1 bit (depending on the layer type) of WaterSIC, suggesting that GPTQ with random rotation is also near optimal, at least in the high-rate regime.

URL PDF HTML ☆

赞 0 踩 0

2605.15491 2026-06-09 cs.LG cs.AI cs.PF 版本更新

Xingtai Lv, Li Sheng, Kaiyan Zhang, Yichen You, Siyan Gao, Xueheng Luo, Yuxin Zuo, Yuchen Fan, Junlin Yang, Ganqu Cui, Bingning Wang, Fan Yang, Youbang Sun, Ning Ding, Bowen Zhou

发表机构 * Frontis.AI ； Kuaishou Technology（快手科技）； Shanghai AI Lab（上海人工智能实验室）； TsinghuaC3I/ZEDA（清华大学C3I/ZEDA）

AI总结本文提出ZEDA框架，通过自蒸馏将预训练的静态MoE模型转换为高效的动态MoE模型，显著减少专家FLOPs并提升推理速度。

详情

AI中文摘要

混合专家（MoE）通过稀疏专家激活高效地扩展语言模型，其动态变体进一步通过输入依赖的方式调整激活专家以减少计算。现有动态MoE方法通常依赖从头训练或任务特定适应，使完全训练的MoE的实际转换未被充分探索。启用此类适应可直接缓解推理成本，通过允许简单令牌在服务时绕过不必要的专家。本文引入了零专家自蒸馏适应（ZEDA），一种低成本框架，将后训练的静态MoE模型转换为高效的动态MoE模型。为稳定此架构转换，ZEDA在每个MoE层中注入无参数的零输出专家，并通过两阶段自蒸馏适应增强模型，利用原始MoE作为冻结的教师，并应用组级平衡损失。在Qwen3-30B-A3B和GLM-4.7-Flash上跨11个基准测试（涵盖数学、代码和指令跟随）中，ZEDA在边际精度损失下消除了超过50%的专家FLOPs。在两个模型上，ZEDA比最强的动态MoE基线分别高出6.1和4.0个点，并提供约1.20倍的端到端推理加速。

英文摘要

Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existing dynamic MoE methods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this architectural conversion, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through two-stage self-distillation, utilizing the original MoE as a frozen teacher and applying a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It outperforms the strongest dynamic MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20$\times$ end-to-end inference speedup.

URL PDF HTML ☆

赞 0 踩 0

2605.21854 2026-06-09 cs.CV cs.AI 版本更新

CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models

CrossVLA: 跨范式后训练和推理优化用于视觉-语言-动作模型

Zhi Liu

发表机构 * Tianjin University（天津大学）

AI总结本文研究了视觉-语言-动作（VLA）模型的跨范式后训练方法，提出了CrossVLA框架，通过改进的连续动作流匹配估计器、对比LoRA和DoRA参数高效层的性能，并揭示了推理过程中去噪循环对延迟的影响，最终实现了在LIBERO数据集上的显著提升。

Comments Workshop draft, 14 pages, 4 figures. Code, ckpts, data: https://github.com/lz-googlefycy/vla-lab

详情

AI中文摘要

视觉-语言-动作（VLA）模型迅速收敛到一小套架构模式：离散令牌自回归（例如OpenVLA）和连续动作流匹配（例如pi-0.5）。然而，通过直接偏好优化（DPO）进行偏好对齐——语言模型中事实上的后训练步骤——几乎仅在自回归VLA上被研究。我们提出了CrossVLA，对跨范式VLA后训练进行实证研究。三大贡献：（i）一个替代流匹配对数概率估计器，使DPO可以在不进行概率流ODE积分的情况下在连续动作后端上运行；（ii）对LoRA和DoRA作为VLA DPO的参数高效层进行直接比较，发现DoRA在LIBERO 4套件上比OpenVLA SFT平均提升10.4个百分点（600次试验，3种子）——每套件+20.0对象，+11.0长周期，+8.0目标，+2.7空间——在对象上无种子方差（38/50在每个种子上）；（iii）推理时间解剖显示去噪循环主导了78.6%的sample_actions延迟，而类似于VLA-Cache的前缀K/V缓存达到了21%的加速上限——无论是块级还是令牌级缓存策略在我们的基准中都会使成功率降至0-80%。我们进一步在6000个LIBERO帧上预训练了一个多视角+时间投影头，实现了99.5%的k-NN召回率@1（36倍于随机），可用作下游初始化。所有代码、检查点、训练日志和复现脚本均在https://github.com/lz-googlefycy/vla-lab上公开。

英文摘要

Vision-Language-Action (VLA) models have rapidly converged on a small set of architectural patterns: discrete-token autoregression (e.g. OpenVLA) and continuous-action flow-matching (e.g. pi-0.5). Yet preference alignment via Direct Preference Optimisation (DPO) -- the de-facto post-training step in language models -- has been studied almost exclusively on autoregressive VLAs. We present CrossVLA, an empirical study of cross-paradigm VLA post-training. Three contributions: (i) a surrogate flow-matching log-probability estimator that lets DPO operate on continuous-action backbones without probability-flow ODE integration; (ii) a head-to-head comparison of LoRA and DoRA as the parameter-efficient layer for VLA DPO, finding DoRA improves over OpenVLA SFT by a mean +10.4 pp across LIBERO 4-suite (600 trials, 3 seeds) -- per-suite +20.0 Object, +11.0 Long-horizon, +8.0 Goal, +2.7 Spatial -- with zero seed variance on Object (38/50 on each of 3 seeds); (iii) an inference-time anatomy showing the denoise loop dominates 78.6% of sample_actions latency and prefix-K/V caching a la VLA-Cache caps at a 21% acceleration ceiling -- both chunk-level and token-level cache strategies degrade success rate to 0-80% in our benchmarks. We further pretrain a multi-view + temporal projection head on 6000 LIBERO frames, achieving 99.5% k-NN recall@1 for same-task retrieval (36x over random), available as a downstream initialisation. All code, ckpts, training logs, and reproduction scripts are open at https://github.com/lz-googlefycy/vla-lab.

URL PDF HTML ☆

赞 0 踩 0

2605.24942 2026-06-09 cs.LG cs.AI 版本更新

Riemannian-Manifold Steering: Geometry-Aware Generative Autoencoders for Label-Free Steering

黎曼流形操控：用于无标签操控的几何感知生成自编码器

Narmeen Oozeer, Shivam Raval, Philip Quirke, Manikandan Ravikiran, Jeff Phillips, Shriyash Upadhyay, Amirali Abdullah

发表机构 * Martian ； Harvard University（哈佛大学）； Thoughtworks ； University of Utah（犹他大学）

AI总结提出将语言模型操控重新定义为激活空间上的黎曼测地线计算，通过基于输出空间Hellinger距离学习的编码器实现无标签、无拓扑先验的流形操控。

详情

AI中文摘要

语言模型的操控——干预其内部激活以改变下游行为——最近已从线性插值扩展到非线性方法，如角度操控和核化操控，这些方法定义了干预变换，而无需在激活空间中的路径上学习显式几何。新引入的几何感知流形方法确实学习了这样的几何，但需要带标签的类中心以及预设的循环或顺序结构。这些假设限制了流形操控的应用范围，因为现有构造需要带标签的中心和兼容的边界条件。我们将流形操控更广泛地重新定义为激活空间上的黎曼测地线计算，将线性操控和带标签样条操控恢复为特定度量选择下的测地线。该框架内一个有原则的度量是输出空间Hellinger距离拉回到激活空间；我们通过一个在小型概念-令牌模式上基于输出距离训练的学习编码器来近似该度量——无需每个提示的标签、无需拓扑先验、也无需每个任务的曲线拟合。实验上，该方法在标准四任务语言模型算术基准的所有任务中可靠地将模型驱动到目标类别，同时在较小输出空间上遵循比基线更行为自然的轨迹。因此，我们为流形操控提供了一个统一的黎曼框架，以及一个基于模式监督、无标签的实例化，该实例化无需带标签的中心或预设边界条件即可运行。

英文摘要

Steering a language model - intervening on its internal activations to change downstream behaviour - has recently expanded beyond linear interpolation to nonlinear methods such as angular and kernelized steering, which define intervention transformations without learning an explicit geometry over paths in activation space. Freshly introduced geometry-aware manifold methods do learn such a geometry, but require labelled class centroids together with prescribed cyclic or sequential structure. These assumptions restrict where manifold steering can be applied, since existing constructions require labelled centroids and compatible boundary conditions. We recast manifold steering more broadly as \textbf{Riemannian geodesic computation} on activation space, recovering linear and labelled-spline steering as geodesics under particular choices of metric. A principled metric within this framework is the output-space Hellinger distance pulled back to activations; we approximate this with a learned encoder trained on output distances over a small concept-token schema - no per-prompt labels, no topology prior, and no per-task curve fitting. Empirically, the method reliably drives the model onto the target class across all tasks in a standard four-task language-model arithmetic benchmark, while following more behaviourally natural trajectories than baselines on smaller output spaces. We thereby provide a unified Riemannian framework for manifold steering together with a schema-supervised, label-free instantiation that operates without labelled centroids or prescribed boundary conditions.

URL PDF HTML ☆

赞 0 踩 0

2605.26872 2026-06-09 cs.LG cs.AI cs.CL 版本更新

The Strongest Teacher Is Not Always the Best Teacher: Student-Centric Answer Selection

最强的教师并不总是最好的教师：以学生为中心的答案选择

Zhengyu Hu, Zheyuan Xiao, Linxin Song, Fengqing Jiang, Yuetai Li, Zhengyu Chen, Zhihan Xiong, Yue Liu, Junhao Lin, Yao Su, Lijie Hu, Kaize Ding, Teng Xiao, Radha Poovendran

发表机构 * University of Washington（华盛顿大学）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； University of Southern California（南加州大学）； Independent Researcher（独立研究者）； National University of Singapore（新加坡国立大学）； Microsoft（微软）； Google（谷歌）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）； Northwestern University（西北大学）； Allen Institute for AI (AI2)（人工智能研究院（AI2））

AI总结提出以学生为中心的答案采样（SCAS）框架，通过估计学生中心的学习成本选择教师生成的答案，从而提升学生模型性能。

详情

AI中文摘要

LLM训练越来越依赖教师生成的监督，包括合成响应、推理轨迹和工具使用演示。当前实践通常选择表现最好的教师来生成学生训练数据，隐含地将教师测试表现视为教学质量的代理。我们表明这一假设可能失败：即使多个教师对同一问题提供正确答案，最强教师的答案也不一定是对给定学生的最佳监督。为解决这一问题，我们提出以学生为中心的答案采样（SCAS），该框架根据估计的学生中心学习成本从经过验证的教师生成答案中进行选择。受逐词梯度分解的启发，我们推导出该成本的高效前向代理，并在训练中用于指导答案选择。在30个教师模型、6个学生基础模型和8个任务上的实验表明，SCAS持续提升学生性能，表明有效的蒸馏应优先考虑与当前学生匹配的监督，而非仅依赖教师强度。

英文摘要

LLM training increasingly relies on teacher-generated supervision, from synthetic responses to reasoning traces and tool-use demonstrations. Current practice often chooses the highest-performing teacher to generate student training data, implicitly treating teacher test performance as a proxy for teaching quality. We show that this assumption can fail: even when multiple teachers provide correct answers to the same question, the answer from the strongest teacher is not necessarily the best supervision for a given student. To address this gap, we propose Student-Centric Answer Sampling (SCAS), a framework that selects from verified teacher-generated answers according to their estimated student-centric learning cost. Motivated by a token-wise gradient decomposition, we derive an efficient forward-only proxy for this cost and use it to guide answer selection during training. Experiments across 30 teacher models, 6 student base models, and 6 tasks show that SCAS consistently improves student performance, suggesting that effective distillation should prioritize supervision matched to the current student rather than teacher strength alone.

URL PDF HTML ☆

赞 0 踩 0

2605.27786 2026-06-09 cs.LG cs.AI 版本更新

Locality-Aware Redundancy Pruning for LLM Depth Compression

面向LLM深度压缩的局部感知冗余剪枝

Vincent-Daniel Yun, Youngrae Kim, Woosang Lim, YoungJin Heo, Minkyu Kim, Sunwoo Lee

发表机构 * University of Southern California（美国南加州大学）； Neural Superintelligence Lab, MODULABS（MODULABS神经超级智能实验室）； Seoul National University（首尔国立大学）； Inha University（釜山大学）

AI总结提出LoRP，一种基于表示局部性的无训练单次深度剪枝框架，通过引入表示局部性分数（RLS）来识别和剪除冗余层，在多种LLM上提升了困惑度和下游任务准确率。

详情

AI中文摘要

大型语言模型在跨网络深度上已知存在表示冗余，这使得深度剪枝成为提高推理效率的有效方法。现有的单次剪枝方法依赖于局部层重要性或跨架构的固定冗余假设。我们提出了局部感知冗余剪枝（LoRP），一种由表示局部性引导的无训练单次深度剪枝框架。我们表明，层间冗余可以是局部化的或全局分布的，具体取决于LLM架构。为了表征这一现象，我们引入了表示局部性分数（RLS），该分数源自全局层间隐藏状态相似性。使用小的校准集，LoRP计算成对层相似性，按表示相似性对层进行聚类，并根据残差簇内冗余分配剪枝。跨多种LLM家族的实验表明，在困惑度和下游任务准确性上均有提升。

英文摘要

Large language models are known to contain representational redundancy across network depth, making depth pruning an effective approach for improving inference efficiency. Existing one-shot pruning methods rely on local layer importance or fixed redundancy assumptions across architectures. We propose Locality-Aware Redundancy Pruning (LoRP), a training-free one-shot depth pruning framework guided by representation locality. We show that inter-layer redundancy can be either localized or globally distributed depending on the LLM architecture. To characterize this phenomenon, we introduce Representation Locality Score (RLS), derived from global inter-layer hidden-state similarity. Using a small calibration set, LoRP computes pairwise layer similarity, clusters layers by representational similarity, and allocates pruning according to residual intra-cluster redundancy. Experiments across diverse LLM families show improvements in both perplexity and downstream task accuracy. Official github repository: https://github.com/daniel-eai/LoRP-Locality-Aware-Redundancy-Pruning/

URL PDF HTML ☆

赞 0 踩 0

2605.28207 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Pruning and Distilling Mixture-of-Experts into Dense Language Models

将混合专家模型剪枝和蒸馏为密集语言模型

Junhyuck Kim, Jihun Yun, Haechan Kim, Gyeongman Kim, Joonghyun Bae, Jaewoong Cho

发表机构 * KRAFTON ； KAIST（韩国科学技术院）

AI总结提出首个将混合专家（MoE）模型转换为标准密集架构的系统框架，通过专家评分、选择、分组、拼接和知识蒸馏，在参数匹配条件下比密集到密集剪枝平均下游准确率提升6.3个百分点，训练速度提升1.6倍。

详情

AI中文摘要

混合专家（MoE）现在是前沿语言模型的主导架构，但它需要将所有专家参数加载到内存中，因此在内存受限的部署中不太受欢迎。现有的压缩方法减少了专家数量，但输出仍然是具有相同基本限制的MoE模型。我们提出了第一个将训练好的MoE转换为标准全密集架构的系统框架：专家被评分、选择和分组，然后拼接成密集的前馈网络（FFN），并通过MoE教师的知识蒸馏进行精炼。我们在Qwen3-30B-A3B上评估了7种评分方法、5种分组方法和2种幅度缩放方法，涵盖了多种选定的专家数量，共产生350种配置。我们发现评分方法的选择影响最大，我们提出的新颖的多样性感知评分在Qwen3-30B-A3B、DeepSeek-V2-Lite和GPT-OSS-20B上始终优于先前的方法。在参数匹配的受控比较下，经过约4B token的蒸馏，MoE到密集的转换在平均下游准确率上比密集到密集的剪枝高出6.3个百分点，训练壁钟速度提升1.6倍。

英文摘要

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

URL PDF HTML ☆

赞 0 踩 0

2606.04029 2026-06-09 cs.LG cs.AI 版本更新

Position: Deployed Reinforcement Learning should be Continual

立场：部署的强化学习应该是持续的

Parnian Behdin, Kevin Roice, Golnaz Mesbahi

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文主张部署的强化学习系统应持续学习，分析了部署后非平稳性的四个来源，并展示了持续RL的优势和实现方法。

Comments Accepted to the ICML 2026 Position Paper Track. See https://icml.cc/virtual/2026/poster/67195

2606.05441 2026-06-09 cs.LG cs.AI stat.ML 版本更新

GOTabPFN: From Feature Ordering to Compact Tokenization for Tabular Foundation Models on High-Dimensional Data

GOTabPFN: 从特征排序到高维表格基础模型的紧凑分词化

Al Zadid Sultan Bin Habib, Md Younus Ahamed, Prashnna Kumar Gyawali, Gianfranco Doretto, Donald A. Adjeroh

发表机构 * University of Cambridge（剑桥大学）

AI总结针对高维小样本表格预测问题，提出GOTabPFN模型，通过图引导排序和神经启发子单元压缩实现紧凑表示，提升TabPFN在严格token预算下的稳定性和准确性。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026). Code and resources GitHub https://github.com/zadid6pretam/GOTabPFN PyPI https://pypi.org/project/gotabpfn Project webpage https://www.zadidhabib.com/gotabpfn.html Hugging Face ZeroGPU https://huggingface.co/spaces/zadid6pretam/GOTabPFN CPU backup https://huggingface.co/spaces/zadid6pretam/GOTabPFN_CPU

2412.11439 2026-06-09 cs.LG cs.AI physics.chem-ph 版本更新

Sampling Out-of-Distribution Chemical Spaces via Bayesian Flow

通过贝叶斯流采样非分布化学空间

Nianze Tao, Minori Abe

发表机构 * Hiroshima University（广岛大学）； Tokyo University of Agriculture（东京农业大学）

AI总结本文提出利用贝叶斯流网络生成高质量非分布分子，通过强化学习策略和可控微分方程求解器提升采样效率，并引入半自回归策略提升模型性能。

Comments 35 pages, 14 figures, 9 tables

详情

AI中文摘要

生成具有更高性能的新型分子，即非分布生成，对从头药物设计至关重要。然而，基于分布学习的模型，如扩散模型，难以解决这一挑战，因为这些方法旨在尽可能贴近训练数据的分布。在本文中，我们证明贝叶斯流网络，特别是ChemBFN模型，能够内在生成高质量的非分布样本，满足多种场景。我们向ChemBFN添加了强化学习策略，并采用可控的微分方程求解器-like生成过程以加速采样过程。最重要的是，我们在训练和推理过程中引入了半自回归策略，以提升模型性能并超越最先进的模型。此外，还包含了一种半自回归方法在ChemBFN中非分布生成的理论分析。

英文摘要

Generating novel molecules with higher properties than the training space, namely the out-of-distribution generation, is important for de novo drug design. However, it is not easy for distribution learning-based models, for example diffusion models, to solve this challenge as these methods are designed to fit the distribution of training data as close as possible. In this paper, we show that Bayesian flow network, especially ChemBFN model, is capable of intrinsically generating high quality out-of-distribution samples that meet several scenarios. A reinforcement learning strategy is added to the ChemBFN and a controllable ordinary differential equation solver-like generating process is employed that accelerate the sampling processes. Most importantly, we introduce a semi-autoregressive strategy during training and inference that enhances the model performance and surpass the state-of-the-art models. A theoretical analysis of out-of-distribution generation in ChemBFN with semi-autoregressive approach is included as well.

URL PDF HTML ☆

赞 0 踩 0

2606.08122 2026-06-09 cs.AI 新提交

Think Before You Act: Intention-Guided Reasoning for LLM-Based Location Prediction

三思而后行：基于意图引导推理的LLM位置预测

Qingxiang Liu, Anqi Liang, Zhuoyang Jiang, Yutian Jiang, Sisuo Lyu, Yu Ji, Haomin Wen, Yuxuan Liang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Shanghai Jiao Tong University（上海交通大学）； The Hong Kong University of Science and Technology（香港科技大学）； Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结提出IntentPOI框架，通过两阶段意图引导推理（先推断用户出行意图，再基于意图选择POI），将位置预测从直接轨迹匹配转化为意图推理，在三个真实数据集上超越11个基线。

详情

AI中文摘要

根据用户的历史签到记录预测其下一个兴趣点（POI）是基于位置服务中的一项基本任务。尽管最近结合大语言模型的方法展现了强大的推理能力和有前景的结果，但它们通常将预测任务建模为一步式的轨迹到位置映射问题，使得预测容易受到浅层轨迹相关性和历史频率偏差的影响。我们认为用户很少直接选择位置，相反，他们通常首先形成出行意图，然后据此选择特定的POI。受此洞察启发，我们提出了IntentPOI，一个两阶段的意图引导推理框架。在思考阶段，我们通过结合历史移动模式、相似同伴行为和时间上下文来推断用户的中间意图。在执行阶段，我们首先构建一个紧凑的候选池，然后执行意图引导推理，以识别与推断意图最一致的位置。通过明确地将意图推断与位置预测解耦，IntentPOI将下一个POI预测从直接的轨迹匹配转变为意图引导推理。在三个真实世界数据集上的大量实验表明，IntentPOI始终优于十一个最先进的基线方法。

英文摘要

Predicting a user's next Point-of-Interest (POI) based on their historical check-in records is a fundamental task in location-based services. While recent methods incorporating large language models have shown strong reasoning capabilities and promising results, they typically formulate the prediction task as a one-step trajectory-to-location mapping problem, making predictions prone to shallow trajectory correlations and historical frequency bias. We argue that users rarely choose locations directly and instead, they usually first form a traveling intention and then accordingly select specific POIs. Motivated by this insight, we propose IntentPOI, a two-stage intention-guided reasoning framework. In the thinking stage, we infer users' intermediate intentions by incorporating historical mobility patterns, similar peer behaviors, and the temporal contexts. In the acting stage, we first construct a compact candidate pool, and then perform intention-guided reasoning to identify locations that best align with the inferred intention. By explicitly decoupling intention inference from location prediction, IntentPOI transforms the next POI prediction from direct trajectory matching into intention-guided reasoning. Extensive experiments on three real-world datasets demonstrate that IntentPOI consistently outperforms eleven state-of-the-art baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.08841 2026-06-09 cs.AI cs.CV 新提交

ZIPP:Zero-shot Image Personalization from Personas

ZIPP：基于人物画像的零样本图像个性化生成

Harini SI, Somesh Singh, Yaman Kumar Singla, David Doermann, Rajiv Ratn Shah

发表机构 * Adobe Media and Data Science Research (MDSR)（Adobe媒体与数据科学研究（MDSR））； IIIT-Delhi（德里印度理工学院）； SUNY at Buffalo（纽约州立大学布法罗分校）

AI总结提出ZIPP方法，利用自然语言人物画像通过LLM改写提示词实现零样本图像个性化生成，无需用户数据或微调；引入ZIPBench基准，在多个评测中取得13-20%的提升。

详情

AI中文摘要

文本到图像扩散模型越来越多地部署在开放式创意环境中，但其输出仍然缺乏个性，优化的是整体审美而非个人品味。人类偏好是多元化的：一位喜欢柔和、怀旧肖像的用户可能偏爱充满活力的街头摄影，而另一位则倾向于梦幻的电影美学。现有方法需要密集的交互历史或逐用户微调，在冷启动场景中失败，并将上下文相关的偏好压缩为静态表示。我们提出了基于人物画像的零样本图像个性化生成（ZIPP），该方法以自然语言人物画像（用户身份和审美偏好的简洁描述符）为条件生成图像，无需任何用户特定数据或权重更新。ZIPP使用LLM从给定人物画像的角度重写提示词，引导扩散模型输出个性化结果。为了大规模挖掘人物画像，我们在一个包含2200万用户的Reddit交互图上训练了一个归纳式图注意力网络，采用双对比目标将图结构与视觉行为对齐，然后通过多模态大语言模型将学习到的表示转化为自然语言人物画像。我们引入了ZIPBench，这是首个零样本个性化基准，包含1500名用户、图挖掘的人物画像和4万张生成图像。在四个基准和涵盖五个模型家族的14个LLM上，人物画像条件化带来一致的性能提升（13-20%），前沿模型受益最大。在少样本设置中，ZIPP匹配或超过了基于每用户100多个示例微调的基线。ZIPP实现了最低的偏好分布散度（CMMD 0.16 vs 0.55），且经IPF归一化的人口统计评估表明，它显著减少了现有方法中存在的子群体偏差。人工评估证实，与通用生成相比胜率为79%，与所有微调基线相比胜率为58-65%。

英文摘要

Text-to-image diffusion models are increasingly deployed in open-ended creative contexts, yet their outputs remain impersonal, optimized for aggregate aesthetics rather than individual taste. Human preferences are pluralistic: one user favoring muted, nostalgic portraits may prefer vibrant street photography, while another gravitates toward dreamy film aesthetics. Existing methods require dense interaction histories or per-user fine-tuning, failing in cold-start settings and collapsing context-dependent preferences into a static representation. We introduce zero-shot image personalization from personas (ZIPP), which conditions image generation on natural-language personas (concise descriptors of a user's identity and aesthetic sensibilities) without any user-specific data or weight updates. ZIPP uses an LLM to rewrite prompts from the perspective of a given persona, steering diffusion models toward personalized outputs. To mine personas at scale, we train an inductive Graph Attention Network over a 22M-user Reddit interaction graph with dual contrastive objectives aligning graph structure with visual behavior, then verbalize learned representations into natural-language personas via an MLLM. We introduce ZIPBench, the first zero-shot personalization benchmark with 1.5K users, graph-mined personas, and 40K generated images. Across four benchmarks and 14 LLMs spanning five model families, persona conditioning yields consistent gains (13-20%), with frontier models benefiting most. In the few-shot setting, ZIPP matches or exceeds fine-tuned baselines trained on 100+ examples per user. ZIPP achieves the lowest preference distributional divergence (CMMD 0.16 vs. 0.55), and IPF-normalized demographic evaluation shows it substantially reduces subpopulation bias present in existing methods. Human evaluation confirms a 79% win rate over generic generation and 58-65% over all fine-tuned baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.09131 2026-06-09 cs.AI cs.CL cs.CV cs.LG 新提交

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

晚期融合足矣：面向视觉饱和的多模态大语言模型的双路径视觉令牌路由

Siyuan Liu, Jinyang Wu

发表机构 * School of Mechanics and Engineering Science, Peking University（北京大学力学与工程科学学院）； Department of Automation, Tsinghua University（清华大学自动化系）

AI总结针对多模态大语言模型中视觉令牌在深层饱和的问题，提出双路径视觉令牌路由（DPVR-LF），在饱和点将视觉令牌路由至单层可训练分支，仅最后层融合，以约3%可训练参数保持性能并减少计算。

Comments 18 pages, 4 figures. Submitted to Pattern Recognition

详情

AI中文摘要

多模态大语言模型（MLLMs）通常继承为单模态文本建模设计的深层对称Transformer骨干，并对图像和语言令牌应用相同的统一计算。这种设计忽略了一个关键的模态不对称性：图像和文本令牌在信息密度、冗余度和所需推理深度上存在显著差异。通过对LLaVA-1.5的逐层分析，我们观察到视觉令牌倾向于在中间层饱和。具体而言，文本到图像的注意力从第0层的0.68下降到第4层的0.07，并在第18层后稳定在0.04附近，而文本令牌则继续受益于深层语义处理。这些发现表明架构对称性与深度异步模态演化之间存在不匹配，导致冗余的视觉计算以及在深层任务特定适应期间感知表示的潜在漂移。受此启发，我们提出了双路径视觉令牌路由（DPVR），一种用于高效MLLMs的模态不对称路由框架。其核心实例DPVR-LF（晚期融合）在饱和点将视觉令牌路由到一个单层可训练侧分支，运行一个跳过深层堆栈中图像位置的十三层纯文本前向传播，并仅在最后一层重新融合视觉和文本流。使用约3%的可训练参数，DPVR-LF在标准基准上保持了有竞争力的多模态性能，同时减少了深层Transformer堆栈中的视觉计算。该结果挑战了视觉令牌必须遍历所有深层语言模型层的传统假设，并表明单个晚期融合层足以在LLaVA风格的MLLMs中维持强大的感知能力。

英文摘要

Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens differ substantially in information density, redundancy, and required reasoning depth. Through a layer-wise analysis of LLaVA-1.5, we observe that vision tokens tend to saturate in the middle layers. Specifically, text-to-image attention decreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereas text tokens continue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift in perceptual representations during deep task-specific adaptation. Motivated by this, we propose Dual-Path Vision Token Routing (DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation, DPVR-LF (Late-Layer Fusion), routes vision tokens at the saturation point into a one-layer trainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters, DPVR-LF preserves competitive multimodal performance on standard benchmarks while reducing visual computation in the deep Transformer stack. The results challenge the conventional assumption that vision tokens must traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.09441 2026-06-09 cs.AI cs.AR 新提交

SIFT: Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance

SIFT: 利用注意力不变性实现RAG预填充快速计算的索引选择

Rya Sanovar, Srikant Bharadwaj, Hritvik Taneja, Moinuddin Qureshi

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Microsoft（微软）

AI总结针对RAG查询中文档重复导致预填充计算冗余和TTFT增加的问题，提出SIFT方法，通过离线提取文档高注意力分数位置并利用注意力不变性，在预填充时仅计算标记位置，将TTFT提升1.71倍且精度损失在1%以内。

详情

AI中文摘要

检索增强生成（RAG）向LLM查询注入相关文档以提高响应质量。这种注入增加了提示长度并减慢了首个令牌生成时间（TTFT）。与标准查询不同，RAG查询具有上下文复用的独特属性，即相同文档在用户查询中重复出现。因此，为每个RAG查询完全重新计算文档会导致冗余计算并增加TTFT。先前的工作离线预计算RAG文档的KV张量，并在在线预填充期间粗略地重新计算一些令牌。然而，由于高延迟的磁盘传输，这种KV复用在现代GPU上通常比完全重新计算更慢。此外，这种粗粒度的重新计算会降低准确性。为了解决这些限制，本文提出了SIFT：利用注意力不变性实现RAG预填充快速计算的索引选择。SIFT离线处理文档，并提取每个文档中高注意力分数的细粒度位置。接下来，我们识别出以下注意力不变性见解，使我们能够在运行时利用提取的位置：（1）局部注意力不变性：文档内高注意力分数的位置不受周围文档的影响。这有助于我们预测文档自注意力中高分数出现的位置。（2）交叉注意力一致性：具有高文档内注意力的键也会吸引后续文档的交叉注意力。这有助于我们预测文档对未来文档注意力中高分数出现的位置。关键的是，SIFT不存储任何KV数据，仅以两个紧凑的位向量的形式存储高分数位置。SIFT的存储比KV张量小24000倍，避免了昂贵的磁盘传输。在预填充期间，SIFT仅计算标记位置的注意力，将TTFT提升1.71倍，同时将精度保持在完全重新计算的1%以内。

英文摘要

Retrieval-Augmented Generation (RAG) injects LLM queries with relevant documents to improve response quality. This injection increases prompt length and slows time to first token (TTFT). Unlike standard queries, RAG queries have a unique property of context reuse where the same documents recur across user queries. Thus, fully recomputing documents for every RAG query does redundant compute and increases TTFT. Prior works precompute KV tensors of RAG documents offline and coarsely recompute some tokens during online prefill. However, such KV reuse is often slower than full recomputation on modern GPUs due to high-latency disk transfers. Further, such a coarse-grained recomputation degrades accuracy. To address these limitations, this paper proposes SIFT: Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance. SIFT processes documents offline and extracts fine-grained locations of high attention scores for each document. Next, we identify the following attention invariance insights that enable us to exploit the extracted locations during runtime: (1) Local-Attention Invariance: The location of high attention scores within a document remain invariant to surrounding documents. This helps us predict the location of high scores where the document attends to itself. (2) Cross-Attention Consistency: Keys with high intra-document attention also attract cross-attention from subsequent documents. This helps us predict the location of high scores where the document attends to future documents. Critically, SIFT stores no KV data and only stores locations of high scores in the form of two compact bit vectors. SIFT's storage is up to 24,000x smaller than KV tensors, obviating costly disk transfers. During prefill, SIFT computes the attention only for the marked locations and improves TTFT by 1.71x while holding accuracy within 1% of full recompute.

URL PDF HTML ☆

赞 0 踩 0

2606.09508 2026-06-09 cs.AI cs.CL 新提交

From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

从刚性到动态：面向长上下文LLM的熵引导自适应推理

Zhanchao Xu, Haoyang Li, Qingfa Xiao, Fei Teng, Chen Jason Zhang, Lei Chen, Qing Li

发表机构 * Department of Computing, PolyU（香港理工大学计算学系）； DSA, HKUST(GZ)（香港科技大学（广州）数据科学与分析学域）； CSE, HKUST（香港科技大学计算机科学与工程学系）

AI总结提出EntropyInfer框架，利用注意力熵在预填充阶段自适应分配计算资源，并在解码阶段通过生成令牌压缩KV缓存，实现长上下文LLM的高效推理。

详情

AI中文摘要

现有的用于长上下文LLM推理的稀疏注意力和KV缓存压缩方法通常应用固定的稀疏模式或跨所有注意力头的统一预算，忽略了头和上下文之间注意力行为的显著变化。我们观察到注意力头之间存在两种不同的熵模式：刚性头，其熵在输入段中保持接近零；动态头，其熵显著波动。至关重要的是，这些类型的分布是上下文相关的，无法离线预先确定。因此，我们提出了EntropyInfer，一个无需训练框架，在预填充期间使用注意力熵在单个头和段的粒度上自适应分配计算。对于解码，我们引入了一种潜在KV缓存压缩方案，该方案利用生成的输出令牌（而非仅预填充令牌）来识别和保留最关键的缓存条目。在Llama、Qwen和openPangu模型系列上的大量实验表明，EntropyInfer在包括SnapKV、AdaKV和CritiPrefill在内的基线上持续取得优势，在超过100k令牌的情况下实现了高达2.39倍的端到端加速，同时与全注意力相比质量下降最小。代码已发布在https://github.com/SHA-4096/EntropyInfer。

英文摘要

Existing sparse attention and KV cache compression methods for long-context LLM inference typically apply fixed sparsity patterns or uniform budgets across all attention heads, overlooking the substantial variation in attention behavior among heads and contexts. We observe two distinct entropy patterns among attention heads: Rigid Heads, whose entropy stays near zero across input segments, and Dynamic Heads, whose entropy fluctuates significantly. Crucially, the distribution of these types is context-dependent and cannot be predetermined offline. We therefore propose EntropyInfer, a training-free framework that uses attention entropy to adaptively allocate compute at the granularity of individual heads and segments during prefilling. For decoding, we introduce a latent KV cache compression scheme that leverages generated output tokens, rather than prefill tokens alone, to identify and retain the most critical cache entries. Extensive experiments on Llama, Qwen and openPangu model series show that EntropyInfer consistently outperforms baselines including SnapKV, AdaKV, and CritiPrefill, achieving up to 2.39$\times$ end-to-end speedup beyond 100k tokens with minimal quality degradation compared to full attention. The code is released in https://github.com/SHA-4096/EntropyInfer.

URL PDF HTML ☆

赞 0 踩 0

2606.09585 2026-06-09 cs.AI 新提交

Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

光学推理：重新思考图像作为超越文本的表达性推理媒介

Yutong Bian, Dongjie Cheng, Heming Xia, Yongqi Li, Wenjie Li

发表机构 * The Hong Kong Polytechnic University（香港理工大学）

AI总结提出光学推理概念，将图像作为独立推理媒介，通过排版和图形两种变体实现，在语言和多模态任务中匹配或超越文本推理，同时减少推理令牌。

详情

AI中文摘要

思维链（CoT）提升了大型语言模型（LLMs）的性能，并已扩展到多模态大型语言模型（MLLMs）。最近的工作进一步从基于文本的多模态推理转向交错模态推理，其中中间步骤可以同时包含文本理由和视觉证据。在这项工作中，我们提出了一个更大胆、更雄心勃勃的想法：图像能否单独作为语言和多模态任务的推理媒介？为了探索这一点，我们提出了光学推理，它将图像视为独立的推理媒介。我们通过两种变体实例化这一概念：基于排版的光学推理，优化视觉布局以实现紧凑的理由渲染；以及基于图形的光学推理，将文本和图形元素组合成结构化的视觉理由。在数学、科学和交错模态推理基准测试中，光学推理可以匹配甚至超越传统的文本推理，同时在语言任务上平均减少28.57%的推理令牌，在多模态任务上减少16%，实现文本推理1.96倍的令牌效率。这些结果表明，图像可以有效且高效地编码理由，同时为推理提供统一的视觉画布。

英文摘要

Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (MLLMs). More recent work further moves from text-based multimodal reasoning toward interleaved-modal reasoning, where intermediate steps can incorporate both textual rationales and visual evidence. In this work, we propose a bolder and more ambitious idea: could images alone serve as the reasoning medium for both language and multimodal tasks? To explore this, we propose optical reasoning, which treats images as a standalone reasoning medium. We instantiate this concept with two variants: typographic-based optical reasoning, which optimizes visual layouts for compact rationale rendering, and graphical-based optical reasoning, which composes text and graphical elements into structured visual rationales. Across mathematical, scientific, and interleaved-modal reasoning benchmarks, optical reasoning can match or even exceed traditional text reasoning while reducing reasoning tokens by an average of 28.57% on language tasks and 16% on multimodal tasks, achieving 1.96 times the token efficiency of text reasoning. These results show that images can effectively and efficiently encode rationales while providing a unified visual canvas for reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.07519 2026-06-09 cs.CL cs.AI 交叉投稿

Bidirectional Small-Granularity Search between Code and Text

代码与文本之间的双向小粒度搜索

Marco A. Valenzuela-Escárcega, Enrique Noriega-Atala, Gus Hahn-Powell, Clayton T. Morrison, Mihai Surdeanu

发表机构 * Lex Machina ； The University of Arizona（亚利桑那大学）

AI总结提出双向小粒度搜索任务，通过自动生成数据训练模型，实现科学出版物文本与代码片段间的直接链接，支持跨模态检索。

详情

AI中文摘要

我们引入了代码与文本之间双向小粒度搜索的新任务，其中查询是文本或代码的小片段，结果也是相反模态的小片段，即代码或文本。该任务在科学出版物中的文本与相应代码片段之间建立直接链接，以支持更好、更快地理解科学方法。我们为所提出的任务引入了一个大型数据集，其中包括使用GPT-4自动生成的代码文本描述的训练分区，以及三个测试分区：一个域内和两个域外（OOD），包含手动注释的数据以及其他领域的材料。我们还提出了一种模块化方法来解决此任务。我们的方法在四个不同的子任务之间共享一个编码器，这些子任务学习双向答案跨度的开始/结束。我们表明，我们的方法在域内取得了良好结果，在域外也取得了令人鼓舞的结果。这表明使用自动生成的数据解决此任务是可能的，但仍有令人兴奋的未来工作要做。

英文摘要

We introduce the novel task of bidirectional small-granularity search between code and text, where the queries are small snippets of text or code and the results are also small fragments of the opposite modality, i.e., code or text. This task establishes direct links between text in scientific publications and corresponding code segments, in support of better and faster understanding of scientific methods. We introduce a large dataset for the proposed task that includes a training partition with textual descriptions of code generated automatically using GPT-4, and three testing partitions, one in-domain and two out-of-domain (OOD) that contain manually-annotated data as well as material from other domains. We also propose a modular approach to address this task. Our approach shares an encoder across four different subtasks that learn start/end of answer spans in both directions. We show that our method achieves good results in-domain, and encouraging results OOD. This suggests that addressing this task with automatically-generated data is possible, but there is exciting future work to be done.

URL PDF HTML ☆

赞 0 踩 0

2606.07523 2026-06-09 cs.CL cs.AI 交叉投稿

MOSS-Video-Preview: 通过交叉注意力实现实时视频理解

Pengyu Wang, Chenkun Tan, Shaojun Zhou, Wei Huang, Qirui Zhou, Zhan Huang, Zhen Ye, Jijun Cheng, Xiaomeng Qian, Yanxin Chen, Xingyang He, Huazheng Zeng, Chenghao Wang, Pengfei Wang, Hongkai Wang, Shanqing Gao, Yixian Tian, Chenghao Liu, Xinghao Wang, Botian Jiang, Xipeng Qiu

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结提出双通道交叉注意力架构MOSS-Video-Preview，通过非阻塞感知与生成实现实时视频理解，在单H200上实现5倍首词加速和2.7倍解码吞吐提升。

详情

AI中文摘要

视频理解正从离线范式——将完整录制的视频作为输入并在结束后产生单一答案——转向实时交互，其中模型在回复的同时感知新帧，随着新证据的出现修正答案，并在无话可说时保持沉默。我们提出MOSS-Video-Preview来验证这一范式。我们的核心主张是感知不能被生成阻塞；其自然实现是双通道架构。我们认为，交叉注意力主干比流行的仅解码器设计更适合实时视觉-语言融合：视觉特征通过侧通道进入，而不是加入自回归序列，因此感知和生成在独立的、非阻塞的路径上运行——降低了视觉处理的频率，并为独立压缩提供了清晰的通道级接口。我们辅以数据合成流水线，将密集字幕转换为实时理解问答，其答案被修正以匹配模型迄今为止感知到的内容，并在此数据上专门训练离线模型以引发实时行为。我们的模型总体上落后于强大的Qwen2.5-VL-7B基线——这一差距我们主要归因于数据和规模而非架构——但在离线视频和多模态理解上具有竞争力，在实时应用核心的空间和细粒度时间推理上保持稳健，并获得了离线模型缺乏的行为：持续感知、答案修正和及时沉默。在单个H200上，每视频256帧，它实现了约5倍的首词时间加速和2.7倍的解码吞吐提升，离线能力几乎没有下降。我们对范式、架构和数据的研究勾勒出通往实时视频理解的可行路径。

英文摘要

Video understanding is shifting from the offline paradigm -- taking a fully recorded video as input and producing a single answer after it ends -- toward real-time interaction, in which the model perceives new frames while still replying, revises its answer as new evidence appears, and remains silent when there is nothing to say. We present MOSS-Video-Preview to validate this paradigm. Our central claim is that perception must not be blocked by generation; its natural realization is a two-channel architecture. We argue that a cross-attention backbone is better suited to real-time vision-language fusion than the prevailing decoder-only design: visual features enter through a side channel rather than joining the autoregressive sequence, so perception and generation run on separate, non-blocking pathways -- reducing the frequency of visual processing and exposing a clean channel-wise interface for independent compression. We complement this with a data synthesis pipeline that converts dense captions into real-time understanding QA whose answers are revised to match what the model has perceived so far, and we specialize an offline model on these data to elicit real-time behavior. Our model trails the strong Qwen2.5-VL-7B baseline overall -- a gap we attribute primarily to data and scale rather than the architecture -- yet attains competitive offline video and multimodal understanding, remains robust on the spatial and fine-grained temporal reasoning central to real-time use, and acquires behaviors that offline models lack: continuous perception, answer revision, and timely silence. On a single H200 with 256 frames per video, it achieves about a 5x speedup in time to first token and 2.7x higher decoding throughput, with negligible degradation in offline ability. Our study of paradigm, architecture, and data outlines a viable path toward real-time video understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.07924 2026-06-09 cs.CV cs.AI cs.CL cs.LG cs.MM 交叉投稿

Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video Retrieval-Augmented Generation

解耦语义与逻辑：一种无需训练的从粗到精的视频检索增强生成流水线

Jiaxin Dai, Zehang Wei, Jiamin Yan, Xiang Xiang

发表机构 * School of Computer Science & Tech, Huazhong University of Science and Technology（华中科技大学计算机科学与技术学院）； School of AI and Automation, Huazhong University of Science and Technology（华中科技大学人工智能与自动化学院）

AI总结提出一种无需训练的两阶段级联视频RAG流水线，通过解耦语义检索与逻辑推理，实现跨语言长视频理解、严格角色遵循和零幻觉时间定位。

Comments To be presented at ACL 2026 MAGMAR Workshop (Oral; Retrieval leaderboard No.1)

详情

AI中文摘要

本文介绍了我们为第二届多模态增强生成研讨会（MAGMaR）提交的系统描述。针对跨语言长视频理解、严格角色遵循和零幻觉时间定位等关键挑战，我们提出了一种完全无需训练的两阶段级联视频RAG流水线。我们的架构通过模态感知的任务分工，策略性地将语义检索与认知逻辑推理解耦。在第一阶段，一个高召回率的语义预取模块仅使用高保真视觉摘要和全局文本描述进行密集检索，明确隔离噪声模态（如OCR和ASR）以保持纯净的向量空间。在第二阶段，一个由商业大语言模型（LLM）驱动的自适应、迭代和推理（A.I.R.）过滤代理执行细粒度认知重排序。该代理重新整合完整的多模态上下文，以强制执行与用户角色的严格逻辑对齐，有效剪除语义相似但逻辑无关的候选。最后，提示雕刻机制约束生成器将蒸馏后的子集合成为严格格式化的JSON响应，并带有精确的块级引用。在RAG轨道上的评估表明，我们的资源感知方法在信息检索和角色条件生成方面均表现出卓越的精度。

英文摘要

This paper presents our system description for the 2nd Workshop on Multimodal Augmented Generation via MultimodAl Retrieval (MAGMaR). Addressing the critical challenges of cross-lingual long-video comprehension, strict persona adherence, and zero-hallucination temporal grounding, we propose a fully training-free, two-stage cascaded Video RAG pipeline. Our architecture strategically decouples semantic retrieval from cognitive logical reasoning through a modality-aware division of labor. In the first stage, a high-recall semantic pre-fetching module employs dense retrieval using only high-fidelity visual summaries and global text descriptions, explicitly isolating noisy modalities (e.g., OCR and ASR) to maintain a pristine vector space. In the second stage, an Adaptive, Iterative, and Reasoning-based (A.I.R.) filtering agent, powered by a commercial Large Language Model (LLM), performs fine-grained cognitive reranking. The agent re-incorporates full multimodal contexts to enforce strict logical alignment with user personas, effectively pruning semantically similar but logically irrelevant candidates. Finally, a Prompt Sculpting mechanism constrains the generator to synthesize the distilled subset into strictly formatted JSON responses with exact chunk-level citations. Evaluated on the RAG track, our resource-aware approach shows exceptional precision in both information retrieval and persona-conditioned generation.

URL PDF HTML ☆

赞 0 踩 0

2606.07951 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

From `May' to `Is': Certainty Distortion in Language Model Rewriting

从“可能”到“是”：语言模型改写中的确定性扭曲

Catarina G Belem, Shang Wu, Hongyu Yao, Mark Steyvers, Sameer Singh, Padhraic Smyth

发表机构 * University of California Irvine（加利福尼亚大学尔湾分校）； Massachusetts Institute of Technology（麻省理工学院）

AI总结研究语言模型在改写任务中系统性增加表达确定性的偏差，提出基于人群判断的评估指标，发现高达75%的输出存在确定性扭曲，且模型更倾向于提高确定性。

详情

AI中文摘要

人类越来越多地以塑造信念和驱动决策的方式使用语言模型（LM），包括讨论、改写和总结来自科学文章、新闻和医学报告的信息。然而，在这些领域中，主张表达的信心程度至关重要，但关于LM是否忠实地保留它却知之甚少。在这项工作中，我们研究了LM中的确定性扭曲，定义为当语义内容被保留时，表达确定性的有意义变化。我们提出了一种基于LM的评估指标，该指标与人群层面的确定性判断一致。使用该指标，我们在科学和医学交流任务的背景下，表征了不同规模和系列的模型中的确定性扭曲。我们的结果表明，确定性扭曲影响了高达75%的LM输出，并且在改写任务中系统性地不对称，大多数LM将表达确定性增加的可能性是降低的1.5-2倍。这些效应可以通过重复释义累积：在医学领域，claude-haiku-4-5在一次迭代后增加了20%示例的确定性，五次迭代后增加到40%。基于提示的干预减少了整体确定性扭曲，但并未消除它。总之，这些发现揭示了普遍存在的夸大表达确定性的偏差，对在高风险领域依赖LM的用户有直接影响。

英文摘要

Humans increasingly turn to Language Models (LMs) in ways that shape beliefs and drive decisions, including discussing, rewriting, and summarizing information from scientific articles, news, and medical reports. However, in these domains, where how confidently a claim is expressed matters, little is known about whether LMs faithfully preserve it. In this work, we investigate certainty distortion in LMs, defined as meaningful changes in expressed certainty when semantic content is preserved. We propose an LM-based evaluation metric that is consistent with population-level judgments of certainty. Using this metric, we characterize certainty distortion across different sizes and families of models in the context of scientific and medical communication tasks. Our results show that certainty distortion affects up to 75\% of LM outputs and is systematically asymmetric in rewriting tasks with most LMs being 1.5-2$\times$ more likely to increase the expressed certainty than to decrease it. These effects can compound over repeated paraphrasing: in the medical domain, claude-haiku-4-5 increases certainty of 20\% examples after a single iteration, increasing to 40\% after five iterations. Prompt-based interventions reduce overall certainty distortion but do not eliminate it. Together, these findings reveal a general bias toward inflating expressed certainty, with direct implications for users who rely on LMs in high-stakes domains.

URL PDF HTML ☆

赞 0 踩 0

2606.08016 2026-06-09 cs.CV cs.AI cs.CL 交叉投稿

IEA: Amateur-Friendly Conversational Image Editing Agent via Three Stages of Multitask Alignment

IEA：通过三阶段多任务对齐的业余友好型对话式图像编辑代理

Zichen Zhu, Yuheng Sun, Mingxuan Zhu, Wenjie Ma, Situo Zhang, Zhexiang Wang, Ziyue Yang, Danyang Zhang, Kunyao Lan, Zihan Zhao, Dingye Liu, Siqi Xiang, Lu Chen, Kai Yu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Innovation Institution（上海创新研究院）； Huawei Technologies Ltd.（华为技术有限公司）； Nanyang Technological University（南洋理工大学）； Jiangsu Key Lab of Language Computing（江苏省语言计算重点实验室）

AI总结提出IEA对话式图像编辑代理，通过三阶段多任务训练学习操作参数化工具，实现可解释编辑轨迹，在像素距离和ROUGE-L指标上优于基线，用户研究中指令跟随和感知质量表现最佳。

Comments [CVPR 2026 Findings] Our data and code are released at https://github.com/OpenDFM/Image_Edit_Agent

详情

AI中文摘要

当前的图像编辑软件通常依赖于固定滤镜或专家调参，导致业余用户的意图与结果之间存在差距。生成模型创建的图像可能包含伪影、不合理的细节或偏离真实感的风格漂移，并且对编辑原因缺乏解释。我们提出IEA，一个对话式图像编辑代理，它学习在显式、可解释的动作空间中操作参数化工具。IEA通过三阶段多任务流水线进行训练：(1) 在蒸馏专家编辑上进行SFT，(2) 使用GRPO进行奖励优化，奖励包括相似度改进、工具有用性和意图总结，(3) 大规模合成微调以联合掌握图像编辑、细化和用户意图总结。通过逐步操作16个编辑工具，IEA产生透明的编辑轨迹，可以检查和调试。在定量实验中，它在编辑任务上获得更低的像素距离，在总结任务上获得比强基线更高的ROUGE-L。在用户研究中，它在指令跟随方面在工具调用方法中排名最佳，同时在整体感知质量上超越生成方法。我们的结果验证了可解释的、以工具为中心的VLM作为人类指令引导图像润色的可靠路径。

英文摘要

Current image editing software often hinges on fixed filters or expert tuning, leaving a gap between amateur users' intent and outcomes. Creations by generative models may contain artifacts, implausible details, or stylistic drift away from photorealism and offer little insight into why an edit was made. We propose IEA, a conversational Image Editing Agent that learns to operate parameterized tools in an explicit, interpretable action space. IEA is trained via a three-stage multitask pipeline: (1) SFT on distilled expert edits, (2) GRPO with rewards for likeness improvement, tool usefulness, and intent summarization, and (3) large-scale synthetic fine-tuning to jointly master image editing, refinement, and user intent summarization. By manipulating 16 editing tools step by step, IEA produces transparent edit traces that can be inspected and debugged. In quantitative experiments, it attains a lower pixel distance on the edit task and a higher ROUGE-L on the summary task than strong baselines. In user studies, it ranks best among tool-calling methods for instruction following while surpassing generative methods in overall perceptual quality. Our results validate interpretable, tool-centric VLMs as a reliable path to human instruction-guided image retouching.

URL PDF HTML ☆

赞 0 踩 0

2606.08056 2026-06-09 cs.CL cs.AI 交叉投稿

What's the Point? Spatial Grammar & Index Resolution for Sign Language Processing

要点何在？手语处理中的空间语法与索引解析

Oline Ranum, Simon Hadfield, Richard Bowden

发表机构 * Centre for Vision, Speech and Signal Processing, University of Surrey（萨里大学视觉、语音与信号处理中心）

AI总结针对手语中占10-15%但被忽视的空间索引现象，提出索引检测与话语实体链接的分解框架，建立索引感知手语建模基线，并作为辅助专家提升冻结手语识别模型性能。

详情

AI中文摘要

手语模型主要使用词汇序列或文本监督进行训练，因此对非词汇和构式性结构的建模不足。一个相对易处理的情况是空间索引：将话语实体分配给空间位置以供后续共指的指向手势，而以词汇为中心的目标在很大程度上未能捕捉到这一点。我们对手语识别中的索引进行了有针对性的评估，显示尽管索引占手语内容的10-15%，但其恢复效果很差。我们引入了一个用于训练和评估索引专家的框架，为索引感知手语建模建立了基线。我们的方法将空间指代解析分解为索引检测和话语实体链接。由此产生的提及表示支持自动标注和非词汇结构建模，并在推理时作为辅助索引专家增强冻结的SLR模型。

英文摘要

Sign language models are predominantly trained with gloss-sequence or text supervision, thereby under-modeling non-lexical and productive constructions. One comparatively tractable instance is spatial indexing: pointing gestures that assign discourse entities to spatial loci for subsequent co-reference, which lexicon-centric objectives largely fail to capture. We present a targeted evaluation of indexing in Sign Language Recognition, showing that despite comprising 10-15% of signing content, indexing is poorly recovered. We introduce a framework for training and evaluating indexing experts, establishing a baseline for index-aware sign language modeling. Our approach decomposes spatial reference resolution into index detection and discourse entity linking. The resulting mention representations enable automatic annotation and non-lexical structure modeling, and serve as an auxiliary indexing expert that augments a frozen SLR model at inference time.

URL PDF HTML ☆

赞 0 踩 0

2606.08063 2026-06-09 cs.CV cs.AI cs.CL 交叉投稿

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Robust-U1: MLLMs能否自我恢复受损视觉内容以实现鲁棒理解？

Jiaqi Tang, Jianmin Chen, Youyang Zhai, Wei Wei, Runtao Liu, Mengjie Zhao, Xiangyu Wu, Qingfa Xiao, Qifeng Chen

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出Robust-U1框架，通过监督微调、强化学习和多模态推理，使多模态大模型具备显式视觉自恢复能力，在真实和对抗性损坏下达到最先进鲁棒性。

Comments Accepted by ICML 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）在视觉理解方面取得了显著成功，但在真实世界的视觉损坏下其性能会大幅下降。尽管存在现有的鲁棒性增强方法，但它们存在局限性：黑盒特征对齐缺乏可解释性，而白盒基于文本的推理无法恢复丢失的像素级细节。本文研究一个基本研究问题：MLLMs能否自行恢复受损的视觉内容？为此，我们提出Robust-U1，一种新颖框架，赋予MLLMs显式的视觉自恢复能力以实现鲁棒理解。该方法包含三个核心阶段：用于初始重建的监督微调、具有双重奖励（像素级SSIM和语义级CLIP相似度）的强化学习以对齐高视觉质量，以及联合考虑受损输入和恢复图像的多模态推理。大量实验表明，Robust-U1在真实世界损坏基准上达到了最先进的鲁棒性，并在一般VQA基准上的对抗性损坏下保持了优越性能。分析证实，高质量的视觉恢复直接提升了推理性能，将自恢复确立为鲁棒视觉理解的关键机制。源代码可在https://github.com/jqtangust/Robust-U1获取。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.

URL PDF HTML ☆

赞 0 踩 0

2606.08076 2026-06-09 cs.CL cs.AI cs.CY 交叉投稿

"I understand your perspective": LLM Persuasion and Sycophancy through the Lens of Communicative Action Theory

“我理解你的观点”：通过交往行动理论视角看LLM的说服与谄媚

Esra Dönmez, Agnieszka Falenska

发表机构 * Institute for Natural Language Processing, University of Stuttgart（斯图加特大学自然语言处理研究所）； Interchange Forum for Reflecting on Intelligent Systems, University of Stuttgart（斯图加特大学智能系统反思交流论坛）

AI总结本研究基于哈贝马斯的交往行动理论，通过模拟Reddit讨论，发现LLM能有效传达言外之意（如建立信任），其谄媚策略与观点改变强相关，且人类更偏好LLM生成的论证。

详情

DOI: 10.18653/v1/2025.findings-acl.793
Journal ref: Findings of the Association for Computational Linguistics: ACL 2025

AI中文摘要

大型语言模型（LLM）能够生成高质量的论证，但它们在参与细致入微且有说服力的交往行动方面的能力仍 largely unexplored。本研究通过尤尔根·哈贝马斯的交往行动理论框架探索LLM的说服潜力。它考察LLM是否以与人类交流可比的方式表达言外之意（即语言的语用功能，如传达知识、建立信任或表明相似性）。我们使用来自说服性子论坛ChangeMyView的对话，模拟意见持有者与LLM之间的在线讨论。然后，我们比较人类撰写和LLM生成的反驳论证中言外之意的可能性，特别是那些成功改变了原帖作者观点的论证。我们发现，所有三个LLM都能有效传达言外之意——通常比人类更甚——可能增加其拟人化程度。此外，LLM精心制作谄媚回应，与意见持有者的意图紧密对齐，这种策略与观点改变强相关。最后，众包工作者发现LLM生成的反驳论证更令人信服，并且一致偏好它们胜过人类撰写的论证。这些发现表明，LLM的说服力不仅仅在于生成高质量论证。相反，用人类偏好训练LLM有效地调整它们以模仿人类交流模式，特别是细微的交往行动，可能增加个体对其影响的易感性。

英文摘要

Large Language Models (LLMs) can generate high-quality arguments, yet their ability to engage in nuanced and persuasive communicative actions remains largely unexplored. This work explores the persuasive potential of LLMs through the framework of Jürgen Habermas' Theory of Communicative Action. It examines whether LLMs express illocutionary intent (i.e., pragmatic functions of language such as conveying knowledge, building trust, or signaling similarity) in ways that are comparable to human communication. We simulate online discussions between opinion holders and LLMs using conversations from the persuasive subreddit ChangeMyView. We then compare the likelihood of illocutionary intents in human-written and LLM-generated counter-arguments, specifically those that successfully changed the original poster's view. We find that all three LLMs effectively convey illocutionary intent -- often more so than humans -- potentially increasing their anthropomorphism. Further, LLMs craft sycophantic responses that closely align with the opinion holder's intent, a strategy strongly associated with opinion change. Finally, crowd-sourced workers find LLM-generated counter-arguments more agreeable and consistently prefer them over human-written ones. These findings suggest that LLMs' persuasive power extends beyond merely generating high-quality arguments. On the contrary, training LLMs with human preferences effectively tunes them to mirror human communication patterns, particularly nuanced communicative actions, potentially increasing individuals' susceptibility to their influence.

URL PDF HTML ☆

赞 0 踩 0

2606.08081 2026-06-09 cs.CL cs.AI 交叉投稿

Aligned but Not Partner-Specific: Distinguishing How Multimodal LLM Agents Succeed in Reference Games Without Human-Like Conventions

对齐但非伙伴特定：区分多模态LLM智能体在参考游戏中如何成功而无需类人惯例

Po-Ya Angela Wang, Chinmaya Mishra, Aslı Özyürek, Paula Rubio-Fernández, Esam Ghaleb

发表机构 * National Taiwan University（国立台湾大学）； Max Planck Institute for Psycholinguistics（马克斯·普朗克心理语言学研究所）； Radboud University（拉德堡德大学）； Institut Jean Nicod（让·尼科研究所）

AI总结通过约束伪对基线方法，区分多模态LLM智能体在参考游戏中的标签对齐是源于伙伴特定交互还是共享任务词汇，发现智能体通过冗长描述而非压缩表达实现协调。

详情

AI中文摘要

重复参考游戏测试对话者是否用基于共享交互历史的更短、伙伴特定的惯例替换其初始长描述。先前工作表明，多模态LLM在轮次中未能变得更高效，尽管它们在使用的标签上对齐。我们如何确定这种对齐反映了伙伴特定的基础而非共享任务词汇？我们通过将有能力的多模态智能体对与来自KTH Tangrams语料库的人类对进行比较来解决这个问题。我们的新颖方法论贡献是一个受约束的伪对基线，它匹配原始指称任务结构，但打破了伙伴历史。该基线使我们能够测试观察到的标签对齐是否依赖于与特定伙伴的交互。在三个分析层面（任务能力、描述策略、对齐动态）上，我们发现了明显差异。人类通过适应减少努力，压缩描述并增加与伙伴的标签对齐。智能体反而保持固定的努力水平，从第一轮开始产生冗长的描述，标签重叠接近上限，在真实对和伪对之间统计上无法区分。因此，多模态LLM在没有惯例的情况下实现了协调，通过冗长描述而非形成人类对话特征的紧凑、依赖历史的指称表达来取得成功。

英文摘要

Repeated reference games test whether interlocutors replace their initially long descriptions with shorter, partner-specific conventions grounded in shared interaction history. Prior work shows that multimodal LLMs fail to become more efficient across rounds, although they align on the labels they use. How can we determine whether this alignment reflects partner-specific grounding rather than a shared task vocabulary? We address this question by comparing capable multimodal agent dyads with human dyads from the KTH Tangrams corpus. Our novel methodological contribution is a constrained pseudo-dyad baseline that matches the original referential task structure, but breaks partner history. This baseline enables us to test whether the observed label alignment depends on interaction with a specific partner. Across three analytic layers (task competence, description strategy, alignment dynamics), we find clear differences. Humans reduce effort through entrainment, compressing descriptions and increasing label alignment with partners. Agents instead maintain fixed effort levels, producing verbose descriptions from round one, with near-ceiling label overlap that is statistically indistinguishable between real and pseudo dyads. MLLMs thus achieve coordination without convention, succeeding by verbose description rather than by forming the compact, history-dependent referring expressions characteristic of human dialogue.

URL PDF HTML ☆

赞 0 踩 0

2606.08158 2026-06-09 cs.CL cs.AI 交叉投稿

Constrained Paraphrase Consistency for LLM Hallucination Detection

约束释义一致性用于大语言模型幻觉检测

Shanshan Lin, Dongsheng Hong, Sibo Ju, Chao Chen, Xi Zhang, Xiangwen Liao

AI总结提出约束一致性幻觉检测器(CCHD)，通过约束优化利用释义一致性，无需额外数据，在多个基准上超越现有方法。

Comments Accepted to ICASSP 2026

详情

AI中文摘要

大型语言模型（LLM）可能生成事实不一致的声明，这促使需要准确且可扩展的幻觉检测器。先前的工作主要通过合成或新标注来扩大训练集，这增加了成本和潜在偏差，同时未充分利用语义等价释义所隐含的一致性。我们提出约束一致性幻觉检测器（CCHD），将训练形式化为约束优化问题。在原始文档-声明对上的标准交叉熵基础上，补充了（i）释义一致性约束，限制不同释义视图之间的差异，以及（ii）标签保持约束，将释义与真实标签绑定。我们通过模型参数和每个视图的拉格朗日乘子的梯度下降-上升法求解该问题，仅增加少量标量对偶变量，且无推理时开销。使用DeBERTa和Flan-T5骨干网络，CCHD在标准事实性基准上持续优于强基线（FactCG、MiniCheck和AlignScore），展示了其在幻觉检测上的优越性。

英文摘要

Large language models (LLMs) can generate factually inconsistent claims, motivating accurate and scalable hallucination detectors. Prior work largely enlarges training sets via synthesis or new annotations, introducing increasing cost and potential bias while underusing the consistency implied by semantically equivalent paraphrases. We propose Consistency-Constrained Hallucination Detector (CCHD), which formulates training as a constrained optimization problem. The standard cross-entropy on original document-claim pairs is complemented by (i) paraphrase-consistency constraints bounding divergence across paraphrased views, and (ii) label-preservation constraints tying paraphrases to ground truth. We solve the problem by gradient descent-ascent over model parameters and per-view Lagrange multipliers, adding only a few scalar dual variables and no inference-time overhead. With DeBERTa and Flan-T5 backbones, CCHD consistently outperforms strong baselines (FactCG, MiniCheck, and AlignScore) on standard factuality benchmarks, demonstrating its superiority on hallucination detection.

URL PDF HTML ☆

赞 0 踩 0

2606.08408 2026-06-09 cs.CL cs.AI 交叉投稿

TimpaTeks: Automatic In-place Text Sequence Modification via Diffusion Language Model Steering

TimpaTeks: 通过扩散语言模型引导实现自动原地文本序列修改

Ryandito Diandaru, Ikhlasul Akmal Hanif, Fadli Aulawi Al Ghiffari, Ahmed Elshabrawy, Alham Fikri Aji

发表机构 * MBZUAI（穆罕默德·本·扎耶德人工智能大学）

AI总结提出TimpaTeks方法，将激活引导扩展到扩散语言模型，实现原地文本修改以改变概念，在情感和概念引导任务上降低困惑度并保持句子结构。

Comments 16 pages

2606.08445 2026-06-09 cs.CL cs.AI 交叉投稿

Segment-level Tree Search for Long Meeting Document Summarization

长会议文档摘要的段级树搜索

Sangwon Ryu, Heejin Do, Jun Seo, Daehui Kim, Yunsu Kim, Gary Geunbae Lee, Jungseul Ok

发表机构 * GSAI, POSTECH（浦项科技大学人工智能研究院）； CSE, POSTECH（浦项科技大学计算机科学与工程系）； ETH Zurich（苏黎世联邦理工学院）； ETH AI Center（苏黎世联邦理工学院人工智能中心）； Agentic AI Lab, KT（KT公司智能体人工智能实验室）； LILT（LILT公司）

AI总结提出基于蒙特卡洛树搜索的段级摘要框架S3，无需训练即可组合段级候选摘要，使用7B模型达到72B模型性能。

Comments INTERSPEECH 2026

2606.08471 2026-06-09 cs.CL cs.AI 交叉投稿

More Yap Less Meaning: Uncovering Self-Improvement Behavior in SLMs

更多废话，更少意义：揭示小语言模型中的自我改进行为

Marina Igitkhanian, Erik Arakelyan

发表机构 * American University of Armenia（亚美尼亚美国大学）； NVIDIA（英伟达）

AI总结本研究通过构建充分性测试，发现小语言模型在自我纠正中仅获得4.4%的准确率提升，且较长的提示反而与错误答案正相关，表明其推理能力有限。

Comments GEM Workshop at ACL 2026

详情

AI中文摘要

近年来，语言模型在各个领域和应用中取得了快速进展。然而，它们的自我改进能力——即是否善于识别和纠正自身推理中的缺陷——仍然存疑。在本研究中，我们通过构建一个充分性测试来严格检验小语言模型（SLMs）的自我纠正能力。我们提出了一个最小化的三步自我纠正流程：收集初始SLM答案，提示同一模型根据真实答案为错误回答生成提示，然后将相同问题与模型自身的反馈一起输入以改进初始答案。我们在算术和逻辑推理基准上评估了多种指令微调和推理SLM。我们的发现表明，注入提示句子的SLM相比初始问答准确率仅提升4.4%。即使正确答案与模型的错误推理一起提供，评估的SLM也无法理解其推理中缺失了什么，并且在导致纠正和未导致纠正的提示之间显示出最小的语义差异。此外，我们的实验表明，较长的提示与错误的最终答案正相关，表明对问题的较长思考可能阻碍推理过程，这意味着SLM的性能不一定随更大的计算预算而扩展。

英文摘要

Recently, language models have made rapid progress across various domains and applications. However, their capability for self-improvement, i.e., whether they are adept at recognising and correcting flaws in their own reasoning, remains dubious. In this study, we address this question by constructing a sufficiency test to rigorously examine the self-correction capabilities of small language models (SLMs). We propose a minimal three-step self-correction pipeline that collects initial SLM answers, prompts the same model to generate hints for its incorrect responses given the ground truth, and feeds the model the same question with its own feedback to refine the initial answer. We evaluate a variety of instruction-tuned and reasoning SLMs in this experimental setup on arithmetic and logical reasoning benchmarks. Our findings show that SLMs with injected hint sentences yield only a 4.4 percent gain over initial question-answering accuracy. Even though the correct answer was provided alongside the model's incorrect reasoning, the evaluated SLMs fail to understand what was missing in their reasoning and show minimal semantic difference between hints that lead to corrections and ones that do not. Furthermore, our experiments show that longer hints are positively correlated with incorrect final answers, suggesting that longer deliberation on problems can hinder the reasoning process, meaning that SLMs do not necessarily scale in performance with a larger compute budget.

URL PDF HTML ☆

赞 0 踩 0

2606.08492 2026-06-09 cs.CV cs.AI 交叉投稿

Seeing is Believing: Aligning Prompt Rewriting with Visual Anchors for Text-to-Image Generation

眼见为实：基于视觉锚点的提示重写对齐用于文本到图像生成

Xuanyi Liu, Deyi Ji, Junyu Lu, Jing Wang, Qianxiong Xu, Xuhang Chen, Tianrun Chen, Siwei Ma

发表机构 * Peking University（北京大学）； Tencent（腾讯）； Dalian University of Technology（大连理工大学）； Nanyang Technological University（南洋理工大学）； University of Cambridge（剑桥大学）； Zhejiang University（浙江大学）

AI总结提出FaithRewriter框架，利用多模态大模型生成中间视觉线索，结合大语言模型生成视觉锚定的增强提示，再蒸馏至小模型，以缩小用户意图与生成图像之间的差距。

详情

AI中文摘要

尽管文本到图像（T2I）模型具有令人印象深刻的能力，但由于用户提示的简洁性和模糊性，意图-生成差距往往持续存在。现有方法主要优化提示的流畅性和可读性。然而，增强过程仍然缺乏视觉基础。因此，重写器可能过度推断缺失的细节，导致意图-生成差距。为了解决这一限制，我们提出了FaithRewriter，一种用于T2I生成的新型提示增强框架。具体来说，FaithRewriter首先利用多模态MLLM从原始提示生成图像作为中间视觉线索。然后将该线索与提示结合，输入大规模LLM，生成视觉锚定的增强，更好地反映预期内容在图像中应如何呈现。最后，将这些增强蒸馏到小规模LLM中以便高效部署，增强其生成有效T2I提示的能力。实验表明，与强基线相比，FaithRewriter生成的提示更忠实于用户意图且视觉上更合理，有助于缩小意图-生成差距。

发表机构 * Faculty of Computer Science, MSA University, Egypt（MSA大学计算机科学学院，埃及）

AI总结提出BLM-SGAN模型，利用BERT的双向注意力机制捕获长程依赖，解决GAN在文本到图像生成中的梯度消失和序列处理限制，在鸟类图像生成上达到SOTA。

Comments Published in ICACIn 2024. Appears in Advances on Intelligent Computing and Data Science II, Lecture Notes on Data Engineering and Communications Technologies, vol. 254, Springer, 2025

详情

DOI: 10.1007/978-3-031-91351-8_5
Journal ref: Advances on Intelligent Computing and Data Science II (ICACIn 2024), Lecture Notes on Data Engineering and Communications Technologies, vol. 254, Springer, Cham, 2025

AI中文摘要

尽管从文本描述生成图像取得了成功，但在自然语言处理（NLP）和计算机视觉（CV）等领域仍面临难以克服的挑战。文本到图像（T2I）模型的最新进展，特别是那些利用生成对抗网络（GAN）的模型，显著提高了跨领域合成逼真图像的能力。然而，现有的基于GAN的T2I模型仍然面临关键挑战，例如难以捕获长程依赖、梯度消失以及序列处理的局限性。为了解决这些问题，我们引入了BLM-SGAN，一种新颖的模型，它结合了用于语义-空间文本到图像生成的双向语言建模。BLM-SGAN利用BERT的注意力机制来捕获丰富的上下文信息并有效管理扩展序列。我们的模型展示了最先进的性能，Inception Score（IS）为5.45 +/- 0.08，超过了多个竞争模型，如SSA-GAN、DF-GAN、SD-GAN和AttnGAN。BLM-SGAN能够从详细的文本描述中有效生成高度逼真的鸟类图像。实现代码可在以下网址获取：https://github.com/haidy-maher/BLM-SGAN-Text-to-Image-Generation。

英文摘要

Despite the success of image generation from text descriptions, it still faces challenges that are difficult to overcome in domains such as natural language processing (NLP) and computer vision (CV). Recent advancements in text-to-image (T2I) models, particularly those utilizing generative adversarial networks (GANs), have significantly improved the synthesis of realistic images across various domains. However, existing GAN-based T2I models still encounter key challenges, such as difficulty in capturing long-range dependencies, vanishing gradients, and the limitations of sequential processing. To address these issues, we introduce BLM-SGAN, a novel model that incorporates Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation. BLM-SGAN leverages BERT's attention mechanisms to capture rich contextual information and efficiently manage extended sequences. Our model demonstrates state-of-the-art performance, with an Inception Score (IS) of 5.45 +/- 0.08, surpassing several competitive models such as SSA-GAN, DF-GAN, SD-GAN, and AttnGAN. BLM-SGAN effectively generates highly realistic images of birds from detailed text descriptions. The implementation code is available at: https://github.com/haidy-maher/BLM-SGAN-Text-to-Image-Generation.

URL PDF HTML ☆

赞 0 踩 0

2606.08938 2026-06-09 cs.CL cs.AI 交叉投稿

PACT: Learning Diverse Diagnostic Strategies via Privileged Synthesis and Branch Consensus

PACT: 通过特权合成与分支共识学习多样化诊断策略

Gen Li, Yuanze Hu, Zhichao Yang, Qingchen Yu, Jianwei Lv, Yue Guo, Yujing Liu, Faguo Wu, Hongwei Zheng, Xiandong Li, Bo Yuan, Yifan Sun, Zhaoxin Fan

发表机构 * Beihang University（北京航空航天大学）； Baidu（百度）； ByteDance（字节跳动）； Beijing Academy of Blockchain and Edge Computing（北京区块链与边缘计算研究院）； Renmin University of China（中国人民大学）

AI总结提出PACT框架，通过特权合成对话数据和多分支共识训练，使LLM同时学习多种诊断推理范式，在中文医疗诊断基准上取得最优性能。

Comments 16 pages, 5 figures, 5 tables

详情

AI中文摘要

临床诊断需要在信息不完整的情况下灵活运用多种推理范式。现有的基于LLM的医疗智能体表现出强大的医学推理能力，但单一范式或简单混合的对话监督使得这些范式难以无干扰地学习。我们提出\textbf{PACT}（周期性锚点共识训练），一个将监督的多范式对话合成与基于共识的分支训练相结合的框架。在数据层面，\textbf{DPS}（医生-患者-监督者）利用完整的电子病历（EMR）进行质量控制，同时保持医生代理仅能访问患者可见信息。这产生了四种诊断推理范式下的经过验证的对话，而不会泄露隐藏的临床答案。在训练层面，PACT为每个范式训练一个范式特定的LoRA分支，并通过符号共识定期将分支聚合到共享锚点中。我们进一步构建了一个动态的多轮中文医疗诊断基准用于交互式会诊。实验表明，PACT在诊断结果和会诊过程指标上，与专有、医学专用和任务适应的基线相比，达到了最先进的性能。

英文摘要

Clinical diagnosis requires flexible use of multiple reasoning paradigms under incomplete patient information. Existing LLM-based medical agents show strong medical reasoning ability, but single-paradigm or naively mixed dialogue supervision makes these paradigms difficult to learn without interference. We propose \textbf{PACT} (Periodic Anchor Consensus Training), a framework that couples supervised multi-paradigm dialogue synthesis with consensus-based Branch training. At the data level, \textbf{DPS} (Doctor-Patient-Supervisor) uses complete electronic medical records (EMRs) for quality control while keeping the doctor agent restricted to patient-visible information. This produces validated dialogues under four diagnostic reasoning paradigms without leaking hidden clinical answers. At the training level, PACT trains one paradigm-specific LoRA Branch per paradigm and periodically aggregates Branches into a shared Anchor through sign consensus. We further construct a dynamic multi-turn Chinese medical diagnosis benchmark for interactive consultation. Experiments show that PACT achieves state-of-the-art performance among compared proprietary, medical-specialized, and task-adapted baselines on diagnostic outcome and consultation-process metrics.

URL PDF HTML ☆

赞 0 踩 0

2606.08948 2026-06-09 cs.CV cs.AI 交叉投稿

NutriMLLM: Multimodal Large Language Models for Dietary Micronutrient Analysis

NutriMLLM：用于膳食微量营养素分析的多模态大语言模型

Runze Yan, Minxiao Wang, Jiaying Lu, Darren Liu, Xiao Hu, Hanqi Luo

发表机构 * Emory University（埃默里大学）

AI总结针对现有MLLM在膳食微量营养素估计中不可靠的问题，利用十年人口规模膳食回顾生成约110万图像-营养素三元组，微调Qwen3-VL和GLM-4.6V-Flash得到NutriMLLM，在真实图像上实现65种营养素全覆盖，准确率匹配或超越专有模型。

Comments 35 pages, 10 figures, 1 table

详情

AI中文摘要

从食物图像中全面估计膳食微量营养素可以改善临床营养护理，但训练此类模型需要将多样化食物与完整营养素谱相关联的大规模多模态数据集。我们首先证明，现有的多模态大语言模型（MLLMs），包括领先的专有模型，在此任务上不可靠。在五个模型家族和四个独立评估基准（ASA24、SNAPMe、FNDDS和NutriBench）上，模型经常弃权或返回统计上不合理的值。为了在没有昂贵专家标注的情况下解决这一差距，我们将十年人口规模的24小时膳食回顾重新用作文本到图像生成的结构化提示。该流程生成了约110万图像-描述-营养素三元组的合成语料库，每个三元组将生成的食品图像与完整的65种营养素标签配对。据我们所知，这是计划在发表后公开发布的最大合成食品图像语料库，具有全面的微量营养素标注。在此语料库上微调Qwen3-VL（2B/4B/8B/30B）和GLM-4.6V-Flash，得到了NutriMLLM，这是第一个专门用于全面膳食微量营养素估计的视觉语言模型家族。我们使用一个四组件框架评估这些模型，该框架分别测量弃权、幻觉、整体可用性和每种营养素的数值准确性。在真实食品图像上，每个NutriMLLM变体在所有65种营养素上实现了近乎完全的覆盖，并且最大的变体在大多数营养素上的准确率匹配或超过了专有基线（GPT-5、Gemini 3和Claude Sonnet 4.5）。这些结果表明，回忆驱动的合成监督可以使基于图像的全面微量营养素估计成为一个可处理的工程问题，并支持膳食评估、个性化营养指导和人口规模的微量营养素监测。

英文摘要

Comprehensive estimation of dietary micronutrients from food images could improve clinical nutrition care, but training such models requires large multimodal datasets linking diverse foods to complete nutrient profiles. We first show that existing multimodal large language models (MLLMs), including leading proprietary models, are unreliable for this task. Across five model families and four independent evaluation benchmarks (ASA24, SNAPMe, FNDDS, and NutriBench), models frequently abstained or returned statistically implausible values. To address this gap without costly expert annotation, we repurposed a decade of population-scale 24-hour dietary recalls as structured prompts for text-to-image generation. This pipeline produced a synthetic corpus of about 1.1 million image-description-nutrient triplets, each pairing a generated food image with a complete 65-nutrient label. To our knowledge, this is the largest synthetic food-image corpus with comprehensive micronutrient annotation planned for public release upon publication. Fine-tuning Qwen3-VL (2B/4B/8B/30B) and GLM-4.6V-Flash on this corpus yielded NutriMLLM, the first family of vision-language models specialized for comprehensive dietary micronutrient estimation. We evaluate these models with a four-component framework that separately measures abstention, hallucination, overall usability, and per-nutrient numerical accuracy. On real food images, every NutriMLLM variant achieved near-complete coverage across all 65 nutrients, and the largest variant matched or exceeded proprietary baselines (GPT-5, Gemini 3, and Claude Sonnet 4.5) in accuracy on most nutrients. These results show that recall-driven synthetic supervision can make image-based comprehensive micronutrient estimation a tractable engineering problem and support dietary assessment, personalized nutrition guidance, and population-scale micronutrient surveillance.

URL PDF HTML ☆

赞 0 踩 0

2606.09019 2026-06-09 cs.SD cs.AI 交叉投稿

看得更多，思考更深：面向长视频理解的查询扩展视觉证据与答案线索引导反思

Shuning Wang, Zhiheng Wu, YiNuo Lu, Naiming Liu, Chen Jia, Bowen Liu, Shuo Nie, Weijie Zhu, Yumeng Zhang

发表机构 * Baidu Inc.（百度公司）； Harbin Institute of Technology（哈尔滨工业大学）； Hong Kong University of Science and Technology（香港科技大学）

AI总结提出CoVER框架，通过动态收集查询扩展视觉证据和答案特定视觉反馈验证草稿答案，实现从答案中心生成到证据中心和视觉可验证推理的转变，在长视频理解任务上超越同规模模型及部分闭源模型。

详情

AI中文摘要

近期视频大语言模型（Video-LLMs）的进展使得长视频理解任务成为可能。然而，现有方法仍面临两个关键限制：证据获取通常依赖单一搜索意图，且答案生成缺乏有效的视觉反馈机制。为解决这些限制，我们提出了\textbf{CoVER}，一个用于长视频理解的综合视觉证据与反思框架。CoVER使Video-LLMs能够通过动态收集查询扩展视觉证据来\textbf{看得更多}，并通过使用有效的答案特定视觉反馈验证草稿答案来\textbf{思考更深}。这些机制共同将长视频理解从以答案为中心的生成转变为以证据为中心且可视觉验证的推理。实验结果表明，CoVER-7B在相同参数规模下显著优于其他模型，甚至在特定指标上超越了最先进的闭源模型。

英文摘要

Recent advances in Video Large Language Models (Video-LLMs) have enabled performance on long-video understanding tasks. However, existing methods still face two key limitations: evidence acquisition often relies on a single search intent, and answer generation lacks an effective visual feedback mechanism. To address these limitations, we propose \textbf{CoVER}, a Comprehensive Visual Evidence and Reflection framework for long-video understanding. CoVER enables Video-LLMs to \textbf{See More} by dynamically gathering query-expanded visual evidence, and \textbf{Think Deeper} by verifying draft answers with effective answer-specific visual feedback. Together, these mechanisms shift long-video understanding from answer-centric generation to evidence-centric and visually verifiable reasoning. Experimental results show that CoVER-7B substantially outperforms models with the same parameter scale and even surpasses state-of-the-art closed-source models on certain metrics.

URL PDF HTML ☆

赞 0 踩 0

2606.09142 2026-06-09 cs.CV cs.AI 交叉投稿

Decoding Pedestrian Crossing Intention from Egocentric Vision via Vision Language Models

通过视觉语言模型从自我中心视觉解码行人过街意图

Danya Li, Xiang Su, Yan Feng, Rico Krueger

发表机构 * Technical University of Denmark（丹麦技术大学）； University of Helsinki（赫尔辛基大学）； Delft University of Technology（代尔夫特理工大学）

AI总结利用视觉语言模型（VLM）将行人过街意图预测转化为视觉问答任务，通过参数高效微调并结合自我运动、车辆运动和眼动等上下文线索，在自我中心视频上实现了14.5%的准确率提升，创下新纪录。

详情

AI中文摘要

自我中心视觉提供了人类感知和决策的第一人称视角，但其在交通安全预测方面的潜力尚未得到充分探索。在这项工作中，我们研究从短自我中心视频片段中解码行人过街意图。我们通过将任务表述为封闭式视觉问答（VQA）问题，并利用视觉语言模型（VLM）来预测行人的意图。我们首先在零样本设置下对三个系列的最先进VLM进行了基准测试，发现它们相对于随机猜测有适度提升，但表现出有限的高层次交通推理能力。基于这些发现，我们进一步使用参数高效微调将VLM适应于目标任务。我们的结果表明，微调后的模型显著优于其零样本对应模型，并在专门的基于Transformer的基线基础上实现了9%的准确率提升。最后，我们证明加入额外的上下文线索，包括自我运动、车辆运动和眼动，进一步提高了预测性能。特别是，由眼动和自我运动引导的微调Qwen3-VL-2B模型相比Transformer基线实现了14.5%的准确率提升，为自我中心行人意图解码建立了新的最先进水平。

英文摘要

Egocentric vision offers a first-person view of human perception and decision making, yet its potential for traffic-safety prediction remains underexplored. In this work, we study the decoding of pedestrian crossing intentions from short egocentric video clips. We approach this by formulating the task as a closed-ended visual question answering (VQA) problem and leveraging vision language models (VLMs) to predict the pedestrians' intent. We first benchmark three families of state-of-the-art VLMs in a zero-shot setting, finding that they achieve moderate gains over random guessing but exhibit limited higher-level traffic reasoning. Motivated by these findings, we further adapt VLMs to the target task using parameter-efficient fine-tuning. Our results show that the fine-tuned models substantially outperform their zero-shot counterparts and achieve a 9\% accuracy improvement over a specialized transformer-based baseline. Finally, we demonstrate that incorporating additional contextual cues, including ego motion, vehicle motion, and eye gaze, further improves predictive performance. In particular, the fine-tuned Qwen3-VL-2B model guided by eye gaze and ego motion achieves a 14.5% accuracy improvement over the transformer baseline, establishing a new state of the art for egocentric pedestrian intent decoding.

URL PDF HTML ☆

赞 0 踩 0

2606.09159 2026-06-09 cs.CL cs.AI 交叉投稿

Unified Energy for Invariant and Independent Decoding in Diffusion Language Models

扩散语言模型中不变性与独立性解码的统一能量

Yuchen Yan, Minkai Xu, Zaiquan Yang, Yatao Bian

发表机构 * National University of Singapore（新加坡国立大学）； Stanford University（斯坦福大学）； City University of Hong Kong（香港城市大学）

AI总结针对扩散语言模型并行生成文本时与自回归模型的性能差距，提出统一能量（Uni-E）方法，通过不变能量和独立能量解决模型容量、依赖性和不变性问题，无需采样即可精确计算，并能纠正分布偏移。

详情

AI中文摘要

扩散语言模型（DLM）通过迭代去噪完整序列实现并行文本生成，与自回归（AR）解码相比具有吸引人的灵活性。然而，现有方法未能完全捕捉令牌关系，导致与AR基线存在性能差距，尤其是在并行度增加时。本文对该差距进行了系统分析，确定了三个关键因素：（i）模型容量、（ii）依赖性和（iii）不变性。为解决这些问题，我们首先提出不变能量（Inv-E）以及一个有效的基于采样的估计器来处理不变性问题。通过进一步与独立能量（Ind-E）结合，我们得到统一能量（Uni-E），它涵盖了所有这些因素。Uni-E具有独特优势：无需基于采样的分区估计即可精确计算。此外，Uni-E是模型无关的，因此可以扩展到任意大小的模型。我们进一步证明Uni-E可以纠正由依赖性和不变性引起的分布偏移。在扩散语言模型（DLM）和扩散大语言模型（DLLM）上的大量实验证明了所提出的Uni-E的有效性。

英文摘要

Diffusion Language Models (DLMs) enable parallel text generation by iteratively denoising a full sequence, offering attractive flexibility compared to auto-regressive (AR) decoding. However, existing methods fail to fully capture token relationships, leading to a performance gap relative to AR baselines, especially as the degree of parallelism increases. In this paper, we give a systematic analysis of the gap, identifying three key factors: (i) model capacity, (ii) dependency, and (iii) invariance. To address these issues, we first propose an invariant energy (Inv-E) together with an effective sampling-based estimator to handle the invariance issue. By further combining with the independent energy (Ind-E), we obtain a unified energy (Uni-E), that accounts for all these factors. Uni-E enjoys a unique advantage: it can be computed exactly without sampling-based partition estimation. Besides, Uni-E is model agnostic and can therefore be scaled to models of arbitrary size. We further prove that Uni-E can correct the distribution shift caused by dependency and invariance. Extensive experiments across Diffusion Language Models (DLMs) and Diffusion Large Language Models (DLLMs) demonstrate the effectiveness of the proposed Uni-E.

URL PDF HTML ☆

赞 0 踩 0

2606.09234 2026-06-09 cs.SD cs.AI 交叉投稿

End-to-End Training for Discrete Token LLM based TTS System

基于离散令牌LLM的文本转语音系统的端到端训练

Changfeng Gao, Yong Ren, Jun Yuan, Ye Bai, Zhao You, ShiDong Shang

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出统一训练语音分词器、LLM、流匹配模型和奖励模型的端到端框架，通过多任务联合优化提升离散令牌TTS性能，在Seed-TTS-Eval上达到新SOTA。

详情

AI中文摘要

最近的先进文本转语音系统通常采用级联流水线，包括语音分词器、自回归大语言模型和基于扩散的流匹配模型，这些组件独立训练。本文提出一个完全端到端的优化框架，统一了语音分词器、LLM、FM模型和额外奖励模型的训练。具体来说，我们首先通过来自FM重建、LLM下一令牌预测和RM多识别任务的多任务目标联合优化分词器。这种联合训练鼓励离散语音令牌空间捕获更适合TTS的声学和语义显著信息。然后，我们通过FM和RM的下游重建和识别进一步优化LLM，这减少了推理时的不匹配，并引导LLM生成更优的结果。实验结果表明，我们的端到端框架始终优于级联基线。在Seed-TTS-Eval基准上，我们的系统实现了0.78%和1.56%的词错误率，使用0.6B参数的LLM和0.5B参数的FM模型取得了新的SOTA结果。这些结果验证了整体端到端优化对于改进基于离散令牌的TTS系统至关重要，且训练流水线更简单。

英文摘要

Recent state-of-the-art (SOTA) text-to-speech (TTS) systems typically adopt a cascaded pipeline consisting of a speech tokenizer, an autoregressive large language model (LLM), and a diffusion based flow-matching (FM) model, with these components trained independently. In this paper, we propose a fully end-to-end (E2E) optimization framework that unifies the training of the speech tokenizer, LLM, FM model, and an additional reward model (RM). Specifically, we first jointly optimize the tokenizer using multi-task objectives derived from reconstruction for FM, next-token prediction for LLM, and multi recognition task for RM. This joint training encourages the discrete speech token space to capture acoustically and semantically salient information that is better tailored to TTS. We then further optimize the LLM using downstream reconstruction and recognition by FM and RM, which reduces inference-time mismatch and steers the LLM toward more preferred generations. Experimental results show that our E2E framework consistently outperforms cascaded baselines. On the Seed-TTS-Eval benchmark, our system achieves a word error rate (WER) of 0.78% and 1.56%, a new SOTA result with a 0.6B-parameter LLM and 0.5B-parameter FM model. These results validate that holistic E2E optimization is critical for improving discrete-token-based TTS systems with a much simpler training pipeline.

URL PDF HTML ☆

赞 0 踩 0

2606.09331 2026-06-09 cs.MM cs.AI cs.LG 交叉投稿

Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding

Conan-embedding-v3: 融合模态特定模型实现全模态嵌入

Shiyu Li, Zhiyuan Hu, Yifan Wang, Peiming Li, Zheng Wei, Yang Tang

发表机构 * Tencent（腾讯）

AI总结提出解耦-融合-恢复框架，通过独立训练模态专家并融合任务向量，再使用投影器恢复和平衡多模态重演解决投影器漂移问题，实现单一骨干网络支持文本、图像、视频、文档和音频检索。

详情

AI中文摘要

全模态检索承诺为文本、图像、视频、文档和音频输入提供单一嵌入空间，但由于这些模态在数据分布、架构和优化动态上存在差异，构建这样一个统一的检索器十分困难。在这项工作中，我们提出了Conan-embedding-v3，一个用于全模态检索的解耦-融合-恢复框架。Conan-embedding-v3首先独立训练模态专家，然后将它们的任务向量融合到一个单一的密集骨干网络中，我们称这种策略为解耦专家融合。我们表明，这种融合组合了视觉、视频和文档检索能力，但也暴露了基于投影器的模态的一个失败模式：当通过外部编码器和投影器附加音频时，融合骨干网络会使投影器校准到音频专家骨干网络，导致尽管原封不动地复制了所有音频特定模块，音频检索性能仍大幅下降。我们将这种失败称为投影器漂移。为了修复它，Conan-embedding-v3应用了投影器恢复（即在保持骨干网络冻结的情况下对投影器进行全参数微调），随后进行平衡的多模态重演。得到的模型在一个骨干网络中支持这些检索路径，在MMEB上达到74.9分，同时在30任务的MAEB音频套件上获得55.61分。

英文摘要

Omni-modal retrieval promises a single embedding space for text, image, video, document, and audio inputs, but building such a unified retriever is difficult since these modalities differ in data distribution, architecture, and optimization dynamics. In this work, we present Conan-embedding-v3, a decouple--fuse--recover framework for omni-modal retrieval. Conan-embedding-v3 first trains modality specialists independently and fuses their task vectors into a single dense backbone, a strategy we call Decoupled Specialist Fusion. We show that this fusion composes visual, video, and document retrieval capabilities, but also exposes a failure mode for projector-based modalities: when audio is attached through an external encoder and projector, fusing the backbone leaves the projector calibrated to the audio-specialist backbone, causing a large audio retrieval regression despite copying all audio-specific modules unchanged. We call this failure Projector Drift. To repair it, Conan-embedding-v3 applies Projector Recovery (i.e., full-parameter fine-tuning of the projector while keeping the backbone frozen) followed by balanced multi-modal rehearsal. The resulting model supports these retrieval pathways in one backbone, achieving 74.9 scores on MMEB while obtaining 55.61 on the 30-task MAEB audio suite.

URL PDF HTML ☆

赞 0 踩 0

2606.09470 2026-06-09 cs.CL cs.AI 交叉投稿

A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales

一种用于联合多粒度L2评估和自然语言解释的微调SpeechLLM

Aditya Kamlesh Parikh, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik

发表机构 * Centre for Language Studies, Radboud University（语言研究中心，拉德堡德大学）

AI总结提出一种基于评分准则的SpeechLLM，通过混合训练目标联合预测句子级和词/音素级标签并生成自然语言解释，在SpeechOcean762上达到或超越单粒度模型。

Comments Accepted to Interspeech 2026. This publication is part of the project Responsible AI for Voice Diagnostics (RAIVD) with file number NGF.1607.22.013 of the research programme NGF AiNed Fellowship Grants, which is financed by the Dutch Research Council (NWO)

详情

AI中文摘要

自动化的L2语音评估可以分配熟练度标签，但通常缺乏可解释性。我们提出了一种基于评分准则的SpeechLLM，用于多角度、多粒度的评估，采用结合监督微调和有界直接偏好优化的混合目标进行训练。该模型在同一个响应中联合预测句子级（准确性、流利度、韵律）的序数标签、词/音素级准确性，并生成自然语言解释。在SpeechOcean762上，我们的方法匹配或优于单粒度模型，同时与先前方法保持竞争力。我们从两个维度分析解释的可靠性：与模型预测的自一致性和与真实标签的对齐，使用情感一致性（合理性）和基于提及的一致性（忠实性）。解释在句子级别是合理的，但在词/音素级别忠实性下降：参考稀疏且与词元级标签弱对齐。

英文摘要

Automated L2 speech assessment can assign proficiency labels, but often lacks interpretability. We propose a rubric-guided SpeechLLM for multi-aspect, multi-granular assessment, trained with a hybrid objective combining supervised fine-tuning and Bounded Direct Preference Optimization. The model jointly predicts ordinal labels at the sentence-level (accuracy, fluency, prosody), word/phoneme-level accuracy, and generates a natural-language rationale in the same response. On SpeechOcean762, our approach matches or outperforms single-granularity models while remaining competitive with prior approaches. We analyze rationale reliability along two axes: self-consistency with model predictions and alignment with ground-truth labels, using sentiment consistency (plausibility) and mention-based agreement (faithfulness). Rationales are plausible at the sentence level, but faithfulness degrades at the word/phoneme level: references are sparse and weakly aligned with token-level labels.

URL PDF HTML ☆

赞 0 踩 0

2606.09525 2026-06-09 cs.CL cs.AI 交叉投稿

Emergence of Context Characteristics Sensitivity in Large Language Models

大型语言模型中上下文特征敏感性的涌现

Nadya Yuki Wangsajaya, Haeun Yu, Isabelle Augenstein

发表机构 * Nanyang Technological University（南洋理工大学）； University of Copenhagen（哥本哈根大学）

AI总结通过测量监督微调、直接偏好优化和可验证奖励强化学习三个阶段，发现大型语言模型对上下文特征的敏感性在指令微调过程中动态变化，其中监督微调使模型倾向于使用易理解的上下文，而后续阶段可能强化或改变这一偏好。

详情

AI中文摘要

在指令微调（IFT）过程中，大型语言模型（LLMs）通过使用提供的上下文来回答问题，从而学会遵循指令。虽然先前的工作已经研究了上下文特征如何与LLM的上下文使用相关，但这种分析仅限于推理时间，尚未揭示这些关系最初是如何获得的。在这里，我们测量了模型对这些特征的敏感性在连续的IFT阶段（监督微调（SFT）、直接偏好优化（DPO）和可验证奖励强化学习（RLVR））中如何变化。跨四个模型和三个数据集的实验表明，SFT使模型更倾向于使用易于理解的上下文，例如包含高长度、上下文-查询相似性和流畅性的上下文。SFT后的动态可能根据训练数据集强化或解决这些偏好。我们的发现揭示了上下文使用在每个IFT阶段都被积极重塑，并且设计平衡的IFT数据集对于确保指令微调模型稳健的上下文利用至关重要。

英文摘要

During instruction fine-tuning (IFT), large language models (LLMs) learn to follow instructions by using the provided context to answer a query. While prior work has studied how context characteristics correlate with context usage by the LLM, this analysis has been limited to inference time, leaving open how these relationships are acquired in the first place. Here, we measure how models' sensitivity to such characteristics shifts across successive IFT stages: supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning with verifiable rewards (RLVR). Experiments across four models and three datasets show that SFT makes models more likely to use contexts that are easy to understand, such as containing high length, context-query similarity, and fluency. Post-SFT dynamics may either reinforce or resolve these preferences depending on the training dataset. Our findings reveal that context usage is actively reshaped at each IFT stage, and designing a balanced IFT dataset is important in ensuring robust context utilization of instruction-tuned models.

URL PDF HTML ☆

赞 0 踩 0

2606.09587 2026-06-09 cs.HC cs.AI 交叉投稿

Seeing the Hivemind: A Consensus-Aware Interaction Technique for Mitigating AI Homogenization

看见蜂巢思维：一种缓解AI同质化的共识感知交互技术

Muhammad Haris Khan, Joel wester

发表机构 * University of Copenhagen（哥本哈根大学）

AI总结提出语义排斥技术（SRT），通过计算和用户研究证明其能显著提升AI生成内容的语义多样性，减少共识短语，且不损害有用性和连贯性。

Comments In review

详情

AI中文摘要

人们越来越多地使用AI进行写作等创造性任务。虽然采用率持续增长，但这种使用方式有可能在局部削弱个人创造力，并在整体上减少创造性输出的异质性。为此，我们引入了语义排斥技术（SRT），并通过计算评估和一项针对16名经常使用AI进行创造性任务的参与者的研究对其进行了评估。我们的计算评估显示，SRT在不同任务模式下将语义多样性提高了85--167%，同时将共识短语减少了43--95%。在用户研究中，SRT输出获得了更高的有用性（$p = .019$, $W = .208$）和连贯性评分（$p = .006$, $W = .260$）；68.8%的参与者愿意在多个任务中使用SRT-Strong，而基线仅为18.8%。所有系统中原创性和连贯性评分呈正相关（$ρ= +.40$ 到 $+.67$），表明发散性不必以可读性为代价。综合来看，这些初步发现可为设计旨在支持日常创造力而不助长同质化的AI系统提供参考。

英文摘要

People are increasingly using AI for creative tasks such as writing. While adoption continues to grow, this form of use risks undermining individual creativity locally and reducing the heterogeneity of creative output at scale. In response, we introduce the Semantic Repulsion Technique (SRT) and evaluate it both computationally and through a study with 16 participants who regularly use AI for creative tasks. Our computational assessment reveals that SRT increases semantic diversity by 85--167\% while reducing consensus phrases by 43--95\% across task modes. In the user study, SRT outputs received higher usefulness ($p = .019$, $W = .208$) and coherence ratings ( $p = .006$, $W = .260$); 68.8\% of participants were willing to use SRT-Strong for multiple tasks versus 18.8\% for baselines. Originality and coherence ratings were positively correlated across all systems ($ρ= +.40$ to $+.67$), suggesting that divergence need not compromise readability. Taken together, these preliminary findings can inform the design of AI systems that aim to support everyday creativity without contributing to homogenization.

URL PDF HTML ☆

赞 0 踩 0

2606.09670 2026-06-09 cs.CV cs.AI 交叉投稿

Visual Prompting Meets Feature Reconstruction-Based Anomaly Detection with Dual-Teacher Supervision

视觉提示结合基于特征重建的双教师监督异常检测

Mateo Diaz-Bone, Daniel Caraballo, Florian Scheidegger, Thomas Frick, Mattia Rigotti, Andrea Bartezzaghi, Roy Assaf, Niccolo Avogaro, Yagmur G. Cinar, Brown Ebouky, Filip M. Janicki, Piotr S. Kluska, Cezary Skura, Cristiano Malossi

发表机构 * IBM Research Europe Zurich（IBM欧洲研究院苏黎世分院）

AI总结针对异常检测在真实场景中因物体尺度、视角等变化失效的问题，提出视觉提示管道、解冻教师模型和扩散生成数据增强，在AeBAD数据集上提升3.5个百分点。

详情

AI中文摘要

最近的异常检测方法在成熟数据集（如MVTec）上取得了完美的检测和分割分数。然而，当基本假设（如一致的物体尺度、视角、背景、光照和居中放置）被违反时，许多方法面临挑战。这些变化使得异常检测方法在许多真实场景中无法使用。为了解决这些限制，我们引入了三个关键贡献：（1）一个视觉提示管道，通过前景-背景掩码隔离物体；（2）一种在师生模型中解冻教师以提高领域适应性的机制；（3）一种利用扩散生成合成图像的数据增强策略，以增强异常检测性能。通过使用掩码多尺度重建（MMR）模型作为骨干，我们在具有挑战性的AeBAD数据集上比之前的最先进方法提高了3.5个百分点。

英文摘要

Recent Anomaly Detection methods achieve perfect detection and segmentation scores on well-established datasets, such as MVTec. However, many of these methods face challenges when foundational assumptions - such as consistent object scale, viewpoint, background, illumination, and centered placement - are violated. Those variations that occur render anomaly detection methods unusable in many real-world scenarios. To address these limitations, we introduce three key contributions: (1) a visual prompting pipeline that isolates objects using foreground-background masking; (2) a mechanism for unfreezing the teacher in student-teacher models to improve domain adaptability; and (3) a data augmentation strategy leveraging diffusion-generated synthetic images to enhance anomaly detection performance. We achieve a 3.5 percentage point improvement over the previous state-of-the-art on the challenging AeBAD dataset by using the Masked Multiscale Reconstruction (MMR) model as our backbone.

URL PDF HTML ☆

赞 0 踩 0

2606.09767 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' Mayan

低资源神经机器翻译的数据合成与参数高效微调：以Q'eqchi'玛雅语为例

Alexander Chulzhanov, Soeren Eberhardt, Arjun Mukherjee

发表机构 * University of Houston（休斯顿大学）； MasterWord Services, Inc.（MasterWord Services公司）； University of Washington（华盛顿大学）

AI总结针对低资源土著语言，提出数据合成方法（利用社区词典生成合成语料）结合LoRA参数高效微调，在Q'eqchi'玛雅语上实现高结构习得（BLEU 42.02），但存在结构-语义差距，需结合真实数据进行课程学习。

Comments Accepted to the 29th International Conference on Text, Speech and Dialogue (TSD 2026). This version of the contribution has been accepted for publication, after peer review but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections

详情

AI中文摘要

对于数字低资源土著语言的神经机器翻译，通常因极端数据稀缺而受阻，促使依赖抽取式网络爬取。为确保数据主权，本研究引入了一种数据合成方法，无需爬取目标语言平行文本即可引导NMT模型。以Q'eqchi'玛雅语为重点，我们将社区来源的词典转换为大规模合成语料，利用通过LoRA适配器在mT5-base模型上的参数高效微调（PEFT）。领域内评估显示出高度的结构习得（BLEU 42.02），证明合成约束有效地教授了复杂的黏着形态和VOS语序。然而，针对有机词汇表的评估揭示了结构-语义差距（BLEU 0.59），模型保持了语法完整性但缺乏自然语言的词汇基础。模型表现出对合成模板受限结构方差的过拟合；尽管流程中具有高语义熵，模型仍难以应对自然语言的句法流动性，将有机输入强制转换为僵化的学习模式。此外，利用多任务学习架构的消融研究导致了负迁移，表明辅助任务在LoRA适配器内竞争有限的参数容量，导致对合成标记的过度优化而牺牲了有机灵活性。最终，我们确定合成引导是一种高度有效的结构入门，但需要通过课程学习使用真实数据进行语义细化。

英文摘要

Neural machine translation for digitally low-resource Indigenous languages is often hindered by extreme data scarcity, prompting reliance on extractive web-scraping. To ensure data sovereignty, this study introduces a data synthesis methodology to bootstrap NMT models without scraping target-language parallel text. Focusing on Q'eqchi' Mayan, we transformed community-sourced dictionaries into a massive synthetic corpus, utilizing Parameter-Efficient Fine-Tuning (PEFT) via LoRA adapters on an mT5-base model. In-domain evaluation demonstrates high structural acquisition (BLEU 42.02), proving that synthetic constraints effectively teach complex agglutinative morphology and VOS word order. However, evaluation against an organic glossary reveals a structural-semantic gap (BLEU 0.59), where the model maintains grammatical integrity but lacks the lexical grounding of natural language. The model exhibits overfitting to the constrained structural variance of the synthetic templates; despite high semantic entropy in the pipeline, it struggles with the syntactic fluidity of natural language, forcing organic inputs into rigid learned patterns. Furthermore, an ablation study utilizing a Multi-Task Learning architecture resulted in negative transfer, suggesting that auxiliary tasks competed for limited parameter capacity within the LoRA adapters, causing over-optimization for synthetic markers at the expense of organic flexibility. Ultimately, we establish that synthetic bootstrapping is a highly effective structural primer, but requires authentic data for semantic refinement via Curriculum Learning.

URL PDF HTML ☆

赞 0 踩 0

2510.06052 2026-06-09 cs.AI cs.CL 版本更新

MixReasoning: Switching Modes to Think

MixReasoning: 切换模式以思考

Haiquan Lu, Gongfan Fang, Xinyin Ma, Qi Li, Xinchao Wang

发表机构 * arXiv

AI总结提出MixReasoning框架，动态调整推理深度，对困难步骤详细推理、简单步骤简洁推理，在GSM8K、MATH-500和AIME上缩短推理长度并提高效率，不牺牲准确性。

详情

AI中文摘要

推理模型通过逐步解决问题、将问题分解为子问题并在生成答案前探索长思维链来提升性能。然而，对每一步都应用扩展推理会引入大量冗余，因为子问题的难度和复杂度差异很大：少数关键步骤对最终答案真正具有挑战性和决定性，而许多其他步骤仅涉及简单的修正或计算。因此，一个自然的想法是赋予推理模型自适应应对这种变化的能力，而不是对所有步骤采用相同的详细程度。为此，我们提出了MixReasoning，一个在单个响应中动态调整推理深度的框架。由此产生的思维链成为困难步骤的详细推理与简单步骤的简洁推理的混合。在GSM8K、MATH-500和AIME上的实验表明，MixReasoning缩短了推理长度，显著提高了效率，且不牺牲准确性。

英文摘要

Reasoning models enhance performance by tackling problems in a step-by-step manner, decomposing them into sub-problems and exploring long chains of thought before producing an answer. However, applying extended reasoning to every step introduces substantial redundancy, as sub-problems vary widely in difficulty and complexity: a small number of pivotal steps are genuinely challenging and decisive for the final answer, while many others only involve straightforward revisions or simple computations. Therefore, a natural idea is to endow reasoning models with the ability to adaptively respond to this variation, rather than treating all steps with the same level of elaboration. To this end, we propose MixReasoning, a framework that dynamically adjusts the depth of reasoning within a single response. The resulting chain of thought then becomes a mixture of detailed reasoning on difficult steps and concise inference on simpler ones. Experiments on GSM8K, MATH-500, and AIME show that MixReasoning shortens reasoning length and substantially improves efficiency without compromising accuracy.

URL PDF HTML ☆

赞 0 踩 0

2511.19829 2026-06-09 cs.AI 版本更新

面向目标的推理用于基于RAG的记忆在对话型代理LLM系统中

Jiazhou Liang, Armin Toroghi, Yifan Simon Liu, Faeze Moradi Kalarde, Liam Gallagher, Scott Sanner

发表机构 * University of Toronto（多伦多大学）； Vector Institute for Artificial Intelligence（向量人工智能研究所）

AI总结本文提出Goal-Mem框架，通过目标导向的推理提升RAG记忆在复杂任务中的表现，尤其在多跳推理和隐含推理中效果显著。

详情

AI中文摘要

基于LLM的对话型AI代理在长时间范围内维持一致行为存在困难，因为上下文有限。虽然RAG方法通过外部记忆模块存储交互并进行检索来克服这一限制，但其在回答具有挑战性的问题（如多跳、常识推理）上的有效性最终取决于代理对检索信息的推理能力。然而，现有方法通常基于语义相似性检索原始用户语句，缺乏对缺失中间事实的显式推理，且常返回无关或不足的证据。本文引入Goal-Mem，一种面向目标的推理框架，通过从用户语句作为目标进行逆向推导。而非逐步扩展检索上下文，Goal-Mem将每个目标分解为原子子目标，进行针对性记忆检索以满足每个子目标，并迭代识别在中间目标无法解决时应从记忆中检索哪些信息。我们通过自然语言逻辑（NLL）形式化这一过程，该逻辑系统结合了FOL的推理可验证性和自然语言的表达性。通过在两个数据集上进行广泛实验，并与九个强大的记忆基线进行比较，我们证明Goal-Mem在多个任务中表现更优，尤其在需要多跳推理和隐含推理的任务中效果显著。

英文摘要

LLM-based conversational AI agents struggle to maintain coherent behavior over long horizons due to limited context. While RAG-based approaches are increasingly adopted to overcome this limitation by storing interactions in external memory modules and performing retrieval from them, their effectiveness in answering challenging questions (e.g., multi-hop, commonsense) ultimately depends on the agent's ability to reason over the retrieved information. However, existing methods typically retrieve memory based on semantic similarity to the raw user utterance, which lacks explicit reasoning about missing intermediate facts and often returns evidence that is irrelevant or insufficient for grounded reasoning. In this work, we introduce Goal-Mem, a goal-oriented reasoning framework for RAG-based agentic memory that performs explicit backward chaining from the user's utterance as a goal. Rather than progressively expanding from retrieved context, Goal-Mem decomposes each goal into atomic subgoals, performs targeted memory retrieval to satisfy each subgoal, and iteratively identifies what information from memory should be retrieved when intermediate goals cannot be resolved. We formalize this process in Natural Language Logic, a logical system that combines the verifiability of reasoning provided by FOL with the expressivity of natural language. Through extensive experiments on two datasets and comparing to nine strong memory baselines, we show that Goal-Mem consistently improves performance, particularly on tasks requiring multi-hop reasoning and implicit inference.

URL PDF HTML ☆

赞 0 踩 0

2506.06295 2026-06-09 cs.LG cs.AI cs.CL 版本更新

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

dLLM-Cache：基于自适应缓存的扩散大语言模型加速

Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyan Wei, Shaobo Wang, Yichen Zhu, Linfeng Zhang

发表机构 * Zhejiang University（浙江大学）

AI总结针对扩散大语言模型推理延迟高的问题，提出一种无需训练的自适应缓存框架dLLM-Cache，通过长间隔提示缓存和基于特征相似性的部分响应更新，实现高效中间计算复用，在保持输出质量的同时大幅降低FLOPs。

Comments Accepted by ICML 2026

详情

AI中文摘要

自回归模型长期以来主导了大语言模型领域。最近，一种基于扩散的大语言模型（dLLMs）的新范式出现，它通过迭代去噪掩码段来生成文本。这种方法显示出显著的优势和潜力。然而，dLLMs存在高推理延迟的问题。传统的自回归模型加速技术，如键值缓存，由于dLLMs的双向注意力机制而无法兼容。为了应对这一特定挑战，我们的工作首先基于一个关键观察：dLLM推理涉及一个静态提示和一个部分动态的响应，其中大多数标记在相邻去噪步骤中保持稳定。基于此，我们提出了dLLM-Cache，一种无需训练的自适应缓存框架，它结合了长间隔提示缓存和基于特征相似性的部分响应更新。这种设计能够在不影响模型性能的情况下高效重用中间计算。在代表性dLLMs（包括LLaDA 8B和Dream 7B）上的大量实验表明，dLLM-Cache在LongBench-HotpotQA上实现了高达9.1倍的FLOPs减少，同时保持了具有竞争力的输出质量。值得注意的是，我们的方法使dLLM推理延迟在许多设置下接近自回归模型。本工作的代码公开于：https://github.com/maomaocun/dLLM-cache。

英文摘要

Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature similarity. This design enables efficient reuse of intermediate computations without compromising model performance. Extensive experiments on representative dLLMs, including LLaDA 8B and Dream 7B, show that dLLM-Cache achieves up to 9.1x FLOPs reduction on LongBench-HotpotQA while maintaining competitive output quality. Notably, our method brings dLLM inference latency close to that of ARMs under many settings. The code for this work is publicly available at: https://github.com/maomaocun/dLLM-cache.

URL PDF HTML ☆

赞 0 踩 0

2507.00322 2026-06-09 cs.CL cs.AI cs.SE 版本更新

Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones

干扰导致的失败：当有缺陷机制掩盖健全机制时，语言模型在平衡括号任务中出错

Daking Rai, Samuel Miller, Kevin Moran, Ziyu Yao

发表机构 * George Mason University（乔治·马歇尔大学）； University of Central Florida（中央佛罗里达大学）； Department of Computer Science（计算机科学系）

AI总结研究揭示语言模型在平衡括号任务中出错的原因：部分组件实现可靠机制，而其他组件引入噪声，当噪声机制主导时导致错误。提出RASteer方法，通过增强可靠组件贡献，将部分模型准确率从0%提升至近100%，并在算术推理任务中取得约20%的性能提升。

Comments 23 pages, 10 figures, accepted for NeurIPS 2025

详情

AI中文摘要

尽管语言模型（LMs）在编码能力方面取得了显著进步，但在生成平衡括号等简单句法任务上仍然存在困难。在本研究中，我们调查了不同规模（124M-7B）的语言模型中这些错误持续存在的潜在机制，旨在理解和减少这些错误。我们的研究揭示，语言模型依赖于多个独立做出预测的组件（注意力头和前馈神经元）。虽然一些组件在广泛的输入范围内可靠地促进正确答案（即实现“健全机制”），但其他组件可靠性较低，通过促进错误标记引入噪声（即实现“有缺陷机制”）。当有缺陷机制掩盖健全机制并主导预测时，就会发生错误。受此启发，我们引入了RASteer，一种引导方法，用于系统地识别并增加可靠组件的贡献，以提升模型性能。RASteer在平衡括号任务上显著提升了性能，将某些模型的准确率从0%提高到接近100%，且不影响模型的一般编码能力。我们进一步展示了其在算术推理任务中的更广泛适用性，实现了高达约20%的性能提升。

英文摘要

Despite remarkable advances in coding capabilities, language models (LMs) still struggle with simple syntactic tasks such as generating balanced parentheses. In this study, we investigate the underlying mechanisms behind the persistence of these errors across LMs of varying sizes (124M-7B) to both understand and mitigate the errors. Our study reveals that LMs rely on a number of components (attention heads and FF neurons) that independently make their own predictions. While some components reliably promote correct answers across a generalized range of inputs (i.e., implementing "sound mechanisms''), others are less reliable and introduce noise by promoting incorrect tokens (i.e., implementing "faulty mechanisms''). Errors occur when the faulty mechanisms overshadow the sound ones and dominantly affect the predictions. Motivated by this insight, we introduce RASteer, a steering method to systematically identify and increase the contribution of reliable components for improving model performance. RASteer substantially improves performance on balanced parentheses tasks, boosting accuracy of some models from $0$% to around $100$% without impairing the models' general coding ability. We further demonstrate its broader applicability in arithmetic reasoning tasks, achieving performance gains of up to around $20$%.

URL PDF HTML ☆

赞 0 踩 0

2509.17446 2026-06-09 cs.LG cs.AI 版本更新

MVCL-DAF++: Enhancing Multimodal Intent Recognition via Prototype-Aware Contrastive Alignment and Coarse-to-Fine Dynamic Attention Fusion

MVCL-DAF++: 通过原型感知对比对齐和由粗到细动态注意力融合增强多模态意图识别

Haofeng Huang, Yifei Han, Long Zhang, Bin Li, Yangfan He, Yaxin Xue

发表机构 * University of Shanghai for Science and TechnologyChina（上海科学技术大学中国）； Shenzhen Institute of Advanced Technology, Chinese Academy of SciencesChina（深圳先进技术研究院，中国科学院中国）； University of Minnesota-Twin Cities, USA（明尼苏达大学双城分校，美国）； University of LeedsUK（利兹大学，英国）

AI总结提出MVCL-DAF++，通过原型感知对比对齐和由粗到细注意力融合，在MIntRec和MIntRec2.0上提升多模态意图识别，尤其改善稀有类识别。

Comments Accepted by Interspeech 2026

2509.17455 2026-06-09 cs.CL cs.AI 版本更新

Understanding Benchmark Language Under Weakened Formal Semantics

弱化形式语义下的基准语言理解

Haoyang Chen, Kumiko Tanaka-Ishii

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； School of Fundamental Science and Engineering（基础科学与工程学院）； Waseda University（早稻田大学）

AI总结提出可计算表示方法，通过外部知识检索提取可执行代码，在数学推理、多步推理等基准上超越纯文本推理和单次代码执行，提供可扩展、可检查的语义证据。

Comments Accepted to Transactions of the Association for Computational Linguistics (TACL). 29 pages, 5 figures

详情

AI中文摘要

最先进的 NLP 基准需要解释指定条件、程序和异常的自然语言，通常依赖隐含假设和外部知识。在规模上构建具有证明论保证的完整语义表示通常不切实际，而纯文本推理提供的检查手段有限。本文探讨当形式语义保证被弱化时，能在多大程度上理解基准语言。我们通过提取可计算表示来研究这个问题：可执行表示，其运行时行为提供语义充分性的操作证据，包括可执行性、执行轨迹和运行时失败。我们使用外部知识检索，为基准实例诱导并迭代优化可计算表示。在数学推理、多步推理、因果推断以及规则和异常密集的法律和生物医学基准上，我们发现所提出的方法持续优于纯文本推理和单次代码执行。除了准确性，我们的分析表明，这些可计算表示提供了可扩展、可检查的语义证据：它们暴露了基准语言强制转化为可执行形式的条件和异常，为面向证明的语义和纯文本推理之间提供了实用的桥梁。

英文摘要

State-of-the-art NLP benchmarks require interpretation of natural language that specifies conditions, procedures, and exceptions, often relying on implicit assumptions and external knowledge. Constructing complete semantic representations with proof-theoretic guarantees is frequently impractical at scale, and purely text-based reasoning offers limited means of inspection. This paper asks how much understanding of benchmark language can be achieved when formal semantic guarantees are weakened. We investigate this question by extracting computables: executable representations whose runtime behavior provides operational evidence of semantic adequacy, including executability, execution traces, and runtime failures. We induce and iteratively refine computables for benchmark instances using retrieval from external knowledge. Across mathematical reasoning, multi-step reasoning, causal inference, and rule- and exception-heavy legal and biomedical benchmarks, we find that the proposed approach consistently exceeds text-only reasoning and one-shot code execution. Beyond accuracy, our analyses show that these computables provide scalable, inspectable semantic evidence: they expose conditions and exceptions benchmark language forces into executable form, offering a practical bridge between proof-oriented semantics and purely textual reasoning.

URL PDF HTML ☆

赞 0 踩 0

2511.11041 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

纠正文本嵌入中的均值偏差：一种改进的重归一化方法及其在MMTEB上的无训练改进

Xingyu Ren, Youran Sun, Haoyu Liang

发表机构 * GitHub

AI总结发现句子嵌入存在一致均值偏差，提出无训练修正方法R2（投影去除均值方向），在MMTEB上38个模型中获得一致分类提升，并分析其与PCA白化的差异。

详情

AI中文摘要

我们发现当前的句子嵌入模型输出存在一致的偏差：每个嵌入$e$可分解为$\tilde e + \mu$，其中均值$\mu$在所有句子中几乎相同。我们研究了两种无训练修正方法——直接减去$\mu$（R1），或从每个嵌入中投影掉均值方向（R2）——并通过一阶误差传播论证表明，R2消除了R1保留的均值估计误差的平行分量。在Massive Multilingual Text Embedding Benchmark (MMTEB)~\citep{MMTEB}上的38个模型中，R2取得一致的分类增益（配对$\bar t = 3.31$，38个模型中有29个$t>2$，零损失），且每个模型的均值范数$\Vert\mu\Vert$与哪些模型受益最多相关。对五个模型进行的九种方法剂量反应消融实验进一步揭示，温和的单方向去除有帮助，但完全的主成分分析（PCA）白化损害了我们测试的每个模型，并且R2与深度为一的All-but-the-Top在下游任务中相差不超过0.18个百分点，尽管$\hat\mu$与中心化的顶部主成分之间几何对齐较弱。

英文摘要

We find that current sentence-embedding models produce outputs with a consistent bias: every embedding $e$ decomposes as $\tilde e + μ$, where the mean $μ$ is near-identical across all sentences. We study two training-free corrections -- subtracting $μ$ directly (R1), or projecting each embedding off the mean direction (R2) -- and show, via a first-order error-propagation argument, that R2 cancels the parallel component of mean-estimation error that R1 retains. Across 38 models on the Massive Multilingual Text Embedding Benchmark (MMTEB)~\citep{MMTEB}, R2 yields consistent classification gains (paired $\bar t = 3.31$, 29 of 38 models with $t>2$, zero losses), and the per-model mean norm $\Vertμ\Vert$ correlates with which models benefit most. A nine-method dose-response ablation on five models further reveals that mild single-direction removal helps, but full principal component analysis (PCA) whitening hurts every model we test, and that R2 and All-but-the-Top with depth one agree within $0.18$ pp downstream despite weak geometric alignment between $\hatμ$ and the centered top principal component.

URL PDF HTML ☆

赞 0 踩 0

2511.14143 2026-06-09 cs.CV cs.AI 版本更新

DYCP：基于LLMs的长格式对话动态上下文修剪

Nayoung Choi, Jonathan Zhang, Jinho D. Choi

发表机构 * Computer Science Emory University（计算机科学埃默里大学）

AI总结 DYCP通过动态识别和检索对话段落，提升长格式对话中LLM的上下文管理效率，实现更精确的上下文选择和推理效率提升。

2601.12263 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

多模态生成式引擎优化：针对视觉-语言模型排序器的排名操纵

Yixuan Du, Chenxiao Yu, Haoyan Xu, Ziyi Wang, Yue Zhao, Xiyang Hu

发表机构 * Georgetown University（乔治城大学）； University of Southern California（南加州大学）； University of Maryland, College Park（马里兰大学学院公园分校）； Arizona State University（亚利桑那州立大学）

AI总结提出多模态生成式引擎优化（MGEO）方法，通过联合优化图像扰动和文本后缀，利用视觉-语言模型内部跨模态知识耦合，实现对产品排名的有效操纵，揭示了多模态基础模型知识基础的脆弱性。

Comments Proceedings of the 4th Workshop on Towards Knowledgeable Foundation Models (KnowFM) at ACL 2026

详情

AI中文摘要

视觉-语言模型（VLM）将视觉和文本知识整合到统一表示中，日益成为现代检索和推荐系统的基础。然而，这些模型在对多模态项目进行排序时如何可靠地利用其跨模态知识，以及其知识基础是否可以被颠覆，仍不清楚。在本文中，我们揭示了VLM在多模态产品排序中应用知识的一个基本漏洞：通过多模态生成式引擎优化（MGEO），我们展示了攻击者可以通过联合制作难以察觉的图像扰动和流畅的文本后缀，利用模型内部的跨模态知识耦合，操纵VLM的排序决策。MGEO采用交替优化策略，针对VLM中视觉和语言表示之间的深层交互，实现了远超单模态攻击和由强大商业模型驱动的启发式基线的排名操纵。我们的发现表明，表面内容质量不足以提升排名；相反，需要直接与模型内部知识利用机制对齐。这些结果对多模态基础模型中知识基础的忠实性和鲁棒性提出了重要问题，并激励了未来多模态检索系统防御机制的研究。代码见：this https URL

英文摘要

Vision-Language Models (VLMs) integrate visual and textual knowledge into unified representations that increasingly underpin modern retrieval and recommendation systems. However, it remains unclear how reliably these models utilize their cross-modal knowledge when ranking multimodal items, and whether their knowledge grounding can be subverted. In this paper, we expose a fundamental vulnerability in how VLMs apply multimodal knowledge for product ranking: through Multimodal Generative Engine Optimization (MGEO), we show that an adversary can manipulate a VLM's ranking decisions by jointly crafting imperceptible image perturbations and fluent textual suffixes that exploit the model's internal cross-modal knowledge coupling. Using an alternating optimization strategy, MGEO targets the deep interactions between visual and linguistic representations within the VLM, achieving rank manipulations that substantially exceed those of unimodal attacks and heuristic baselines powered by strong commercial models. Our findings reveal that surface-level content quality is insufficient for rank promotion; instead, direct alignment with the model's internal knowledge utilization mechanism is required. These results raise important questions on the faithfulness and robustness of knowledge grounding in multimodal foundation models, and motivate future work on defense mechanisms for multimodal retrieval systems. Code is available at: https://github.com/glad-lab/MGEO

URL PDF HTML ☆

赞 0 踩 0

2601.23286 2026-06-09 cs.CV cs.AI cs.LG 版本更新

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

VideoGPA: 通过几何先验知识蒸馏实现3D一致的视频生成

Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, Yue Wang

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结 VideoGPA通过几何先验知识蒸馏提升视频生成的3D一致性，利用数据高效的自监督框架引导视频扩散模型，显著增强时间稳定性、几何合理性与运动一致性。

Comments 8 pages, 5 figures, ICML 2026

2602.00238 2026-06-09 cs.CL cs.AI cs.LG 版本更新

DIVERGE: Diversity-Enhanced RAG for Open-Ended Information Seeking

DIVERGE: 面向开放式信息检索的多样性增强RAG

Tianyi Hu, Niket Tandon, Akhil Arora

发表机构 * Aarhus University（奥胡斯大学）； Microsoft Research（微软研究院）

AI总结针对现有RAG系统忽略开放式信息检索中多样性需求的问题，提出Diverge框架，通过迭代反思引导的多样化视角探索和多样性感知检索支持，在保持质量的同时将多样性提升约2倍。

详情

AI中文摘要

现有的检索增强生成（RAG）系统通常假设每个查询只有一个正确答案。这种假设忽略了开放式信息检索场景，其中多个合理的答案是有价值的，并且多样性对于创造力、公平性和信息的包容性访问至关重要。我们表明，标准RAG系统未能充分利用多样化的检索上下文：简单地增加检索多样性并不一定会导致多样化的生成。为了解决这一局限性，我们提出了Diverge，一个即插即用的智能体RAG框架，通过迭代、反思引导的多样化视角探索和多样性感知检索支持来改善多样性与质量的权衡。我们进一步引入了用于表征开放式问答中多样性与质量权衡的评估指标。在多个真实世界数据集和骨干LLM上的实验表明，Diverge在竞争基线中实现了最佳的权衡，将多样性提高了约2倍，且没有明显的质量下降。这些结果揭示了当前RAG系统的系统性局限，并展示了显式多样性建模的价值。

英文摘要

Existing retrieval-augmented generation (RAG) systems often assume that each query has a single correct answer. This assumption overlooks open-ended information-seeking scenarios where multiple plausible answers are valuable, and where diversity is important for creativity, fairness, and inclusive access to information. We show that standard RAG systems fail to fully use diverse retrieved contexts: simply increasing retrieval diversity does not necessarily lead to diverse generations. To address this limitation, we propose Diverge, a plug-and-play agentic RAG framework that improves the diversity--quality trade-off through iterative, reflection-guided exploration of diverse viewpoints and diversity-aware retrieval support. We further introduce evaluation metrics for characterizing the diversity-quality trade-off in open-ended question answering. Experiments across multiple real-world datasets and backbone LLMs show that Diverge achieves the best trade-off among competitive baselines, increasing diversity by $\sim2\times$ without noticeable quality degradation. These results reveal a systematic limitation of current RAGs and show the value of explicit diversity modeling.

URL PDF HTML ☆

赞 0 踩 0

2602.07774 2026-06-09 cs.IR cs.AI 版本更新

Generative Reasoning Re-ranker

生成式推理重排序器

Mingfu Liang, Yufei Li, Jay Xu, Kavosh Asadi, Xi Liu, Shuo Gu, Kaushik Rangadurai, Frank Shyu, Shuaiwen Wang, Song Yang, Zhijing Li, Jiang Liu, Mengying Sun, Fei Tian, Xiaohan Wei, Chonglin Sun, Jacob Tao, Shike Mei, Wenlin Chen, Santanu Kolay, Sandeep Pandey, Hamed Firooz, Luke Simon

发表机构 * Meta AI

AI总结提出GR2框架，利用大语言模型的推理能力进行推荐重排序，通过语义ID编码、推理轨迹监督微调和强化学习优化，在Recall@5和NDCG@5上超越现有方法。

Comments 31 pages

详情

AI中文摘要

最近的研究越来越多地探索大语言模型（LLMs）作为推荐系统的新范式，因其可扩展性和世界知识。然而，现有工作存在三个关键限制：（1）大多数工作集中在检索和排序，而重排序阶段——对优化最终推荐至关重要——在很大程度上被忽视；（2）LLMs通常用于零样本或有监督微调设置，其推理能力（尤其是通过强化学习（RL）和高质量推理数据增强的能力）未被充分利用；（3）项目通常由非语义ID表示，在拥有数十亿标识符的工业系统中造成重大可扩展性挑战。为解决这些问题，我们提出生成式推理重排序器（GR2），这是一个端到端框架，具有专为重排序设计的三阶段训练流程。首先，预训练的LLM通过一个分词器对从非语义ID编码的语义ID进行中期训练，实现≥99%的唯一性。接下来，一个更强的更大规模LLM通过精心设计的提示和拒绝采样生成高质量推理轨迹，用于监督微调以赋予基础推理技能。最后，我们应用解耦裁剪和动态采样策略优化（DAPO），实现具有可验证奖励的可扩展RL监督，这些奖励专为重排序设计。在两个真实数据集上的实验证明了GR2的有效性：它在Recall@5和NDCG@5上分别超越最先进的OneRec-Think 2.4%和1.3%。消融实验证实，高级推理轨迹在各项指标上带来显著提升。我们进一步发现，RL奖励设计在重排序中至关重要：LLMs倾向于通过保留项目顺序来利用奖励黑客行为，这促使我们设计条件可验证奖励以减轻这种行为并优化重排序性能。

英文摘要

Recent studies increasingly explore Large Language Models (LLMs) as a new paradigm for recommendation systems due to their scalability and world knowledge. However, existing work has three key limitations: (1) most efforts focus on retrieval and ranking, while the reranking phase, critical for refining final recommendations, is largely overlooked; (2) LLMs are typically used in zero-shot or supervised fine-tuning settings, leaving their reasoning abilities, especially those enhanced through reinforcement learning (RL) and high-quality reasoning data, underexploited; (3) items are commonly represented by non-semantic IDs, creating major scalability challenges in industrial systems with billions of identifiers. To address these gaps, we propose the Generative Reasoning Reranker (GR2), an end-to-end framework with a three-stage training pipeline tailored for reranking. First, a pretrained LLM is mid-trained on semantic IDs encoded from non-semantic IDs via a tokenizer achieving $\ge$99% uniqueness. Next, a stronger larger-scale LLM generates high-quality reasoning traces through carefully designed prompting and rejection sampling, which are used for supervised fine-tuning to impart foundational reasoning skills. Finally, we apply Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO), enabling scalable RL supervision with verifiable rewards designed specifically for reranking. Experiments on two real-world datasets demonstrate GR2's effectiveness: it surpasses the state-of-the-art OneRec-Think by 2.4% in Recall@5 and 1.3% in NDCG@5. Ablations confirm that advanced reasoning traces yield substantial gains across metrics. We further find that RL reward design is crucial in reranking: LLMs tend to exploit reward hacking by preserving item order, motivating conditional verifiable rewards to mitigate this behavior and optimize reranking performance.

URL PDF HTML ☆

赞 0 踩 0

2602.12996 2026-06-09 cs.CL cs.AI 版本更新

Know More, Know Clearer: A Meta-Cognitive Framework for Knowledge Augmentation in Large Language Models

知道更多，更清晰：大型语言模型中知识增强的元认知框架

Hao Chen, Ye He, Yuchun Fan, Yukun Yan, Zhenghao Liu, Qingfu Zhu, Maosong Sun, Wanxiang Che

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出元认知框架，利用内部认知信号划分知识空间为掌握、混淆和缺失区域，通过差异化干预和认知一致性机制增强知识并校准置信度，实验证明优于基线方法。

详情

AI中文摘要

知识增强显著提升了大型语言模型（LLMs）在知识密集型任务中的性能。然而，现有方法通常基于模型性能等同于内部知识的简单前提，忽略了导致过度自信错误或不确定真相的知识-置信度差距。为弥合这一差距，我们提出了一种新颖的元认知框架，通过差异化干预和对齐实现可靠的知识增强。我们的方法利用内部认知信号将知识空间划分为掌握、混淆和缺失区域，指导有针对性的知识扩展。此外，我们引入了一种认知一致性机制，以同步主观确定性与客观准确性，确保校准的知识边界。大量实验表明，我们的框架持续优于强基线，验证了其在不仅增强知识能力，而且培养更好区分已知与未知的认知行为方面的合理性。所有代码均可在该 https URL 获取。

英文摘要

Knowledge augmentation has significantly enhanced the performance of Large Language Models (LLMs) in knowledge-intensive tasks. However, existing methods typically operate on the simplistic premise that model performance equates with internal knowledge, overlooking the knowledge-confidence gaps that lead to overconfident errors or uncertain truths. To bridge this gap, we propose a novel meta-cognitive framework for reliable knowledge augmentation via differentiated intervention and alignment. Our approach leverages internal cognitive signals to partition the knowledge space into mastered, confused, and missing regions, guiding targeted knowledge expansion. Furthermore, we introduce a cognitive consistency mechanism to synchronize subjective certainty with objective accuracy, ensuring calibrated knowledge boundaries. Extensive experiments demonstrate the our framework consistently outperforms strong baselines, validating its rationality in not only enhancing knowledge capabilities but also fostering cognitive behaviors that better distinguish knowns from unknowns. All codes are available at https://github.com/AI9Stars/Know-More-Know-Clearer.

URL PDF HTML ☆

赞 0 踩 0

2602.17911 2026-06-09 cs.CL cs.AI 版本更新

Condition-Gated Reasoning for Context-Dependent Biomedical Question Answering

基于条件的推理用于依赖上下文的生物医学问答

Jash Rajesh Parekh, Wonbin Kweon, Joey Chan, Rezarta Islamaj, Robert Leaman, Pengcheng Jiang, Chih-Hsuan Wei, Zhizheng Wang, Zhiyong Lu, Jiawei Han

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； National Institutes of Health（美国国立卫生研究院）

AI总结本文提出CondMedQA基准和Condition-Gated Reasoning框架，通过构建条件感知知识图谱，提升生物医学问答中条件依赖的推理能力。

详情

DOI: 10.1145/3770855.3818963 10.1145/3770855.3818963 10.1145/3770855.3818963 10.1145/3770855.3818963 10.1145/3770855.3818963

AI中文摘要

当前生物医学问答系统常假设医学知识是统一的，但现实临床推理本质上是条件性的：几乎所有决策都依赖于患者特定因素，如共病和禁忌症。现有基准不评估此类条件推理，检索增强或图基方法缺乏显式机制确保检索知识适用于给定上下文。为解决这一差距，我们提出CondMedQA，首个针对条件生物医学问答的基准，包含多跳问题，其答案随患者条件变化。此外，我们提出Condition-Gated Reasoning（CGR），一种新框架，构建条件感知知识图谱，并根据查询条件选择性激活或修剪推理路径。我们的发现显示，CGR更可靠地选择条件合适的答案，同时在生物医学问答基准上匹配或超越现有最佳性能，突显了显式建模条件性对稳健医疗推理的重要性。

英文摘要

Current biomedical question answering (QA) systems often assume that medical knowledge applies uniformly, yet real-world clinical reasoning is inherently conditional: nearly every decision depends on patient-specific factors such as comorbidities and contraindications. Existing benchmarks do not evaluate such conditional reasoning, and retrieval-augmented or graph-based methods lack explicit mechanisms to ensure that retrieved knowledge is applicable to given context. To address this gap, we propose CondMedQA, the first benchmark for conditional biomedical QA, consisting of multi-hop questions whose answers vary with patient conditions. Furthermore, we propose Condition-Gated Reasoning (CGR), a novel framework that constructs condition-aware knowledge graphs and selectively activates or prunes reasoning paths based on query conditions. Our findings show that CGR more reliably selects condition-appropriate answers while matching or exceeding state-of-the-art performance on biomedical QA benchmarks, highlighting the importance of explicitly modeling conditionality for robust medical reasoning.

URL PDF HTML ☆

赞 0 踩 0

2602.20967 2026-06-09 eess.AS cs.AI cs.SD 版本更新

Training-Free Intelligibility-Guided Observation Addition for Noisy ASR

无训练的可懂度引导的噪声ASR观测添加

Haoyang Li, Changsong Liu, Wei Rao, Hao Shi, Sakriani Sakti, Eng Siong Chng

发表机构 * Nanyang Technological University（南洋理工大学）； Nara Institute of Science and Technology（奈良科学技術大學）

AI总结提出一种无训练的可懂度引导观测添加方法，通过后端ASR的可懂度估计推导融合权重，提升噪声环境下ASR鲁棒性，无需修改SE或ASR模型参数。

Comments Accepted to Interspeech2026

详情

AI中文摘要

自动语音识别（ASR）在噪声环境中严重退化。尽管语音增强（SE）前端有效抑制背景噪声，但它们常常引入损害识别的伪影。观测添加（OA）通过融合噪声和SE增强语音解决了这一问题，无需修改SE或ASR模型的参数。本文提出了一种可懂度引导的OA方法，其中融合权重从后端ASR直接获得的可懂度估计中推导。与基于训练好的神经预测器的先前OA方法不同，所提出的方法无需训练，降低了复杂度并增强了泛化能力。在多种SE-ASR组合和数据集上的大量实验表明，该方法相比现有OA基线具有强大的鲁棒性和改进。对可懂度引导的基于切换的替代方案以及帧级与话语级OA的进一步分析也验证了所提出的设计。

英文摘要

Automatic speech recognition (ASR) degrades severely in noisy environments. Although speech enhancement (SE) front-ends effectively suppress background noise, they often introduce artifacts that harm recognition. Observation addition (OA) addressed this issue by fusing noisy and SE enhanced speech, improving recognition without modifying the parameters of the SE or ASR models. This paper proposes an intelligibility-guided OA method, where fusion weights are derived from intelligibility estimates obtained directly from the backend ASR. Unlike prior OA methods based on trained neural predictors, the proposed method is training-free, reducing complexity and enhances generalization. Extensive experiments across diverse SE-ASR combinations and datasets demonstrate strong robustness and improvements over existing OA baselines. Additional analyses of intelligibility-guided switching-based alternatives and frame versus utterance-level OA further validate the proposed design.

URL PDF HTML ☆

赞 0 踩 0

2603.03292 2026-06-09 cs.CL cs.AI cs.IR 版本更新

From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG

从冲突到共识：通过多轮代理RAG提升医疗推理

Wenhao Wu, Zhentao Tang, Yafu Li, Shixiong Kai, Mingxuan Yuan, Zhenhong Sun, Chunlin Chen, Zhi Wang

发表机构 * GitHub

AI总结本文提出MA-RAG框架，通过多轮代理循环迭代优化外部证据和内部推理历史，提升医疗复杂推理能力，实验显示在7个医疗问答基准上表现优于现有方法。

Comments 27 pages, 8 figures, 18 tables

详情

AI中文摘要

大型语言模型（LLMs）在医疗问答中表现出高推理能力，但其产生幻觉和过时知识的倾向对医疗领域构成重大风险。虽然检索增强生成（RAG）缓解了这些问题，但现有方法依赖于噪声的token级信号，并缺乏复杂推理所需的多轮细化。本文提出MA-RAG（多轮代理RAG），通过在代理细化循环中迭代演变外部证据和内部推理历史，实现复杂医疗推理的测试时间扩展。在每一轮中，代理将候选响应间的语义冲突转换为可检索的外部证据查询，同时优化历史推理轨迹以缓解长上下文退化。MA-RAG通过利用不一致性作为主动信号来扩展自我一致性原则，并通过迭代最小化残差误差来实现稳定、高保真的医疗共识。在7个医疗问答基准上的广泛评估显示，MA-RAG在推理时间扩展和RAG基线方面均优于竞争方法，平均准确率比基础模型提高+6.8点。我们的代码可在https://github.com/NJU-RL/MA-RAG上获得。

英文摘要

Large Language Models (LLMs) exhibit high reasoning capacity in medical question-answering, but their tendency to produce hallucinations and outdated knowledge poses critical risks in healthcare fields. While Retrieval-Augmented Generation (RAG) mitigates these issues, existing methods rely on noisy token-level signals and lack the multi-round refinement required for complex reasoning. In this paper, we propose MA-RAG (Multi-Round Agentic RAG), a framework that facilitates test-time scaling for complex medical reasoning by iteratively evolving both external evidence and internal reasoning history within an agentic refinement loop. At each round, the agent transforms semantic conflict among candidate responses into actionable queries to retrieve external evidence, while optimizing history reasoning traces to mitigate long-context degradation. MA-RAG extends the self-consistency principle by leveraging the lack of consistency as a proactive signal for multi-round agentic reasoning and retrieval, and mirrors a boosting mechanism that iteratively minimizes the residual error toward a stable, high-fidelity medical consensus. Extensive evaluations across 7 medical Q&A benchmarks show that MA-RAG consistently surpasses competitive inference-time scaling and RAG baselines, delivering substantial +6.8 points on average accuracy over the backbone model. Our code is available at https://github.com/NJU-RL/MA-RAG.

URL PDF HTML ☆

赞 0 踩 0

FormalASR: 语音中文到正式文本的端到端系统

Wanyi Ning, Yinshang Guo, Haitao Qian, Jiyuan Cheng, Weiyuan Feng, Yufei Zhang

发表机构 * arXiv

AI总结本文提出FormalASR，一种端到端的中文语音到正式文本转换模型，通过构建大规模的语音到正式文本数据集，并使用Qwen3-ASR进行微调，实现了比原声基线减少37.4%的CER，同时提升了ROUGE-L和BERTScore指标，提供了一个轻量级的设备端解决方案。

详情

AI中文摘要

自动语音识别（ASR）系统通常优化于逐字转录，这保留了不连贯、填充词和非正式口语结构，这些结构往往不适合下游写作应用。常见的解决方法是ASR+LLM的两阶段流程用于后期编辑，但这种设计增加了延迟和内存成本，并且难以在设备上部署。我们提出了FormalASR，两个紧凑的端到端模型（0.6B和1.7B），可直接将中文语音转录为正式书面文本。为了实现这一目标，我们构建了WenetSpeech-Formal和Speechio-Formal两个大规模的语音到正式文本数据集，通过基于LLM的重写和质量过滤构建。然后我们使用监督微调对Qwen3-ASR进行两个规模（0.6B和1.7B）的微调。在WenetSpeech-Formal和Speechio-Formal上的实验表明，FormalASR在比原声基线减少37.4%的CER的同时，也提高了ROUGE-L和BERTScore。FormalASR在部署时不需要后处理LLM，提供了一个轻量级的设备端解决方案用于语音到正式转录。

英文摘要

Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words, and informal spoken structures that are often unsuitable for downstream writing-oriented applications. A common workaround is a two-stage ASR+LLM pipeline for post-editing, but this design increases latency and memory cost and is difficult to deploy on-device. We present FormalASR, two compact end-to-end models (0.6B and 1.7B) that directly transcribe spoken Chinese into formal written text. To enable this setting, we build WenetSpeech-Formal and Speechio-Formal, two large-scale spoken-to-formal datasets constructed by LLM-based rewriting and quality filtering. We then fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) with supervised fine-tuning. Experiments on WenetSpeech-Formal and Speechio-Formal show that FormalASR achieves up to 37.4% relative CER reduction over verbatim baselines, while also improving ROUGE-L and BERTScore. FormalASR requires no post-processing LLM at deployment time, providing a lightweight, on-device solution for spoken-to-formal transcription.

URL PDF HTML ☆

赞 0 踩 0

2605.28831 2026-06-09 cs.CL cs.AI 版本更新

S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering

S3Mem：用于长时域交互式问答的结构化时空场景-事件记忆

Encheng Su, Jianyu Wu, Jinouwen Zhang, Qiucheng Yu, Chen Tang, Pengze Li, Lintao Wang, Aoran Wang, Xinzhu Ma, Shixiang Tang, Yizhou Wang, Houqiang Li

发表机构 * University of Science and Technology of China（中国科学技术大学）； Shanghai Jiao Tong University（上海交通大学）； Shanghai AI Laboratory（上海人工智能实验室）； City University of Hong Kong（香港城市大学）； The Chinese University of Hong Kong（香港中文大学）； Fudan University（复旦大学）； The University of Sydney（悉尼大学）； Beihang University（北航）

AI总结提出S3MEM框架，通过结构化场景-事件记忆和锚点敏感检索，在长时域交互式问答中实现比通用记忆接口更优的准确率-效率平衡。

详情

AI中文摘要

长时域交互代理通常积累大量轨迹历史，但仍无法可靠地回答关于早期事件的问题。我们认为主要瓶颈不仅是上下文长度，而是长期记忆的轨迹到答案接口。当历史以纯文本块存储并使用标准检索增强生成（RAG）查询时，系统通常检索到局部相关但链不完整的证据，特别是对于空间、时间、重复事件和多跳状态问题。我们提出S3MEM，一种用于长时域交互式问答（QA）的结构化场景-事件情节记忆框架。S3MEM将轨迹写入结构化记忆单元，通过锚点敏感检索检索证据，并为答案时间推理提供紧凑的令牌预算感知证据接口。从这个意义上说，S3MEM是一种结构化证据利用工具，将代理轨迹转换为查询对齐的支持。我们在两个内部标题环境（Crafter、Jericho）和两个外部环境（SciWorld、ALFWorld）上评估S3MEM。在共享的冻结答案时间协议下，S3MEM在所有四个环境中一致优于Vanilla RAG，在Crafter、Jericho和ALFWorld上超过Graph-NoReader，在SciWorld上与之匹配，同时使用的证据令牌显著减少。三个改编的近期基线——A-MEM启发、MemoryOS改编和LightMem改编——在多个设置中优于Vanilla RAG，但没有一个达到S3MEM的整体准确率-效率前沿。总体而言，证据支持一个有限的结论：在当前冻结的答案时间协议下，结构化写入和锚点敏感证据路由为长时域交互式QA提供了比通用记忆接口更强的准确率-效率前沿。

英文摘要

Long-horizon memory question answering often requires sparse evidence from heterogeneous histories, including events, object states, visual observations, temporal relations, and causal steps. Existing memory interfaces expand reader context, retrieve semantically related chunks, or expose graph neighborhoods, but they are not explicitly designed to select compact evidence for a fixed reader. We propose Structured Spatiotemporal Scene--Event Memory (S3Mem), a query-time memory interface that writes textual, visual, and agent-use histories into structured scene--event units and routes compact evidence packs to the reader. Its router scores candidate units, query anchors, and anchor--support links, enabling both single-hop selection and short multi-hop evidence chains without reader fine-tuning or test-time training. Across LoCoMo, EMemBench Visual Games, and AMA-Bench, S3Mem provides a strong score--token trade-off, with the clearest gains on localized event, state, temporal, causal, or provenance evidence. On LoCoMo, S3Mem reaches $0.48$ F1 and $0.40$ BLEU with (1{,}073) evidence tokens per question, about $15.8\times$ fewer than the LoCoMo reference. On EMemBench Visual Games, it obtains the best F1 and second-best accuracy with only $189$tokens.On AMA-Bench, it is not the highest-scoring method, but remains competitive while using the fewest reader-visible evidence tokens.

URL PDF HTML ☆

赞 0 踩 0

2606.00094 2026-06-09 cs.CV cs.AI 版本更新

Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry

显式建模数据流形几何的扩散图像生成

Duoduo Xue, Zhiyu Zhu, Junhui Hou

发表机构 * City University of Hong Kong（香港城市大学）

AI总结提出MIND框架，通过将离散补丁标记化集成到连续扩散模型的得分函数中显式建模流形几何，结合离散标记的结构量化能力和连续扩散的并行生成灵活性，在ImageNet 256×256上显著降低FID。

详情

AI中文摘要

图像生成模型旨在从底层数据流形中采样数据点，这需要学习并解码一个密集、低维且紧凑的参数化空间。为此，我们提出了数据流形感知图像扩散模型（MIND），一种通过将离散补丁标记化集成到连续扩散模型的得分函数中来显式建模流形几何的新框架。该方法成功利用了离散标记的结构量化能力和连续扩散的并行生成灵活性。此外，我们通过一种新颖的软top-$k$聚合机制实现了端到端可微训练，并引入了双分支高频特征嵌入层以缓解Transformer主干网络在低维输入上的谱偏差。进一步地，在推理阶段，我们设计了一种多阶段过渡采样方案，根据时间步动态调整采样方案。在ImageNet 256×256上的大量实验证明了MIND的有效性。经过80个epoch的训练，我们的基础模型在无引导情况下实现了22.73的FID，几乎将原始DiT-B/2基线的43.47 FID减半。与基线DiT和SiT相比，所提方法平均分别降低了15.95和9.06的FID。对于ImageNet-256×256上的引导图像生成，所提MIND-B仅用130M参数就实现了2.06的FID，超过了具有3.1B参数的LlamaGen-3B。所提MIND-XL具有715M参数，进一步将FID降低至1.95。我们的MIND为基于扩散的图像生成引入了全新视角，为该领域的未来研究和创新铺平了道路。代码将公开提供。

英文摘要

Image generative models aim to sample data points from the underlying data manifold, a task that requires learning and decoding a dense, low-dimensional, and compact parameterization space. To achieve this, we propose the Data Manifold-aware Image diffusioN moDel (MIND), a novel framework that explicitly models manifold geometry by integrating discrete patch tokenization into the score function of a continuous diffusion model. This approach successfully leverages both the structural quantification capabilities of discrete tokens and the parallel generation flexibility of continuous diffusion. Moreover, we enable end-to-end differentiable training via a novel soft top-$k$ aggregation mechanism and introduce dual-branch high-frequency feature embedding layers to alleviate the spectral bias of transformer backbones on low-dimensional inputs. Furthermore, for inference, we design a multi-stage transition sampling scheme that dynamically adjusts the sampling scheme based on timestep. Extensive experiments on ImageNet 256$\times$256 demonstrate the effectiveness of MIND. After 80-epoch training, our base model achieves an FID of 22.73 without guidance, nearly halving the 43.47 FID of the vanilla DiT-B/2 baseline. The proposed method reduces FID by 15.95 and 9.06 on average compared with the baselines DiT and SiT, respectively. For image generation on ImageNet-256$\times$256 with guidance, the proposed MIND-B with only 130M parameters achieves an FID of 2.06, superpassing the LlamaGen-3B with 3.1B parameters. The proposed MIND-XL with 715M parameters further reduces the FID to 1.95. Our MIND introduces a fresh perspective on diffusion-based image generation, paving the way for future research and innovation in this community. The code will be publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.01637 2026-06-09 cs.CL cs.AI 版本更新

Easier to Mislead Than to Correct: Harmful and Beneficial Revision in LLM Conformity

误导比纠正更容易：LLM 从众中的有害与有益修正

Jiaming Qu, Lucheng Fu, Yibo Hu

发表机构 * Amazon（亚马逊）； Georgia Institute of Technology（佐治亚理工学院）； Illinois Institute of Technology（伊利诺伊理工学院）

AI总结通过控制实验，研究大语言模型在多智能体系统中面对同伴答案时的从众行为，发现同伴一致意见更容易误导原本正确的模型，而权威标签使模型更倾向于选择被认可的答案，且通用推理干预无法可靠地减少有害修正。

详情

AI中文摘要

大语言模型越来越多地用于多智能体系统，在这些系统中，它们会看到并回应其他智能体的答案。一个关键风险是从众：模型可能仅仅因为其他人同意不同的答案而放弃自己的答案。先前的研究表明，LLM 经常向多数答案修正，但仍不清楚这些修正是像引入新错误一样频繁地帮助纠正错误。在本文中，我们进行了一项受控研究，其中 LLM 首先回答一个问题，然后在做出最终决定之前看到模拟的同伴回应。我们操纵两个社会线索：共识结构和分配给同伴的权威标签，并测量它们如何影响有益和有害的修正。在四个开放权重的 LLM 和七个问答数据集上，我们发现同伴一致意见使得误导原本正确的模型比纠正原本错误的模型容易得多。权威标签使模型更可能选择被认可的答案，无论其是否正确。更令人担忧的是，通用的推理干预（如思维链和反思）并不能可靠地减少有害修正同时保留有益修正。这些发现表明，多智能体 LLM 系统应该验证同伴答案，而不是简单地聚合它们。

英文摘要

Large language models are increasingly used in multi-agent systems, where they see and respond to other agents' answers. A key risk is conformity: a model may abandon its own answer simply because others agree on a different one. Prior studies show that LLMs often revise toward a majority answer, but it remains unclear whether these revisions help correct mistakes as often as they introduce new errors. In this paper, we conduct a controlled study in which an LLM first answers a question, then sees simulated peer responses before making a final decision. We manipulate two social cues: consensus structure and authority labels assigned to peers, and measure how they influence beneficial and harmful revisions. Across four open-weight LLMs and seven QA datasets, we find that peer agreement makes it much easier to mislead initially correct models than to correct initially wrong ones. Authority labels make models more likely to choose the endorsed answer, regardless of whether it is correct. More concerningly, generic reasoning interventions such as chain-of-thought and reflection do not reliably reduce harmful revision while preserving beneficial revision. These findings suggest that multi-agent LLM systems should verify peer answers rather than simply aggregate them.

URL PDF HTML ☆

赞 0 踩 0

2606.01736 2026-06-09 cs.CL cs.AI 版本更新

Argument Collapse: LLMs Flatten Long-Form Public Debate

论点坍缩：LLMs 扁平化长篇公共辩论

Yekyung Kim, Yapei Chang, Chau Minh Pham, Mohit Iyyer

发表机构 * University of Maryland, College Park（马里兰大学学院公园分校）

AI总结研究大型语言模型在生成公共辩论文本时导致论点坍缩的现象，即不同模型生成的论文在主要论点、子论点和段落结构上趋于收敛，通过对比人类与LLM生成文本发现LLM的论点多样性显著降低。

详情

AI中文摘要

全方位视角：非结构化交通中基于等变特征学习的360度LiDAR感知设计与分析

Pranav Darshan, Raghuveer Narayanan Rajesh, M Uttara Kumari

发表机构 * RV College of Engineering（RV工程学院）

AI总结针对非结构化城市交通中感知难题，提出结合扇形全景处理与旋转等变稀疏卷积的360度LiDAR感知框架，在印度城市交通数据集上验证了多类别检测性能。

详情

AI中文摘要

密集非结构化城市交通中的感知仍然是自动驾驶的主要挑战，原因是道路使用者种类繁多、频繁遮挡、不规则运动模式以及缺乏标准化的道路布局。尽管基于LiDAR的3D目标检测器在结构化驾驶场景中表现出色，但大多数是为有限视场设置开发和评估的，其在全环绕360度感知下的行为仍不明确。本文研究了用于自动驾驶的360度LiDAR感知流水线，特别关注全景感知、方位角扇形空间处理以及复杂城市场景中的变换等变特征提取。本文提出了一个实用的360度感知框架，将扇形全景处理与旋转等变稀疏卷积相结合，并在一个自定义的Ouster OS0 LiDAR数据集上评估其行为，该数据集收集自多样化的印度城市交通条件。结果显示，多个目标类别的检测总体稳定，其中汽车性能最强（92.02/90.51），公交车为80.53/76.34，卡车为78.59/74.16，而行人（67.45/61.02）、骑自行车者（73.21/69.54）和骑摩托车者（71.20/68.13）得分较低，反映了在密集城市场景中检测更小且更多变的道路使用者的更大难度。

英文摘要

Perception in dense, unstructured urban traffic remains a major challenge for autonomous driving because of the wide variety of road users, frequent occlusions, irregular motion patterns, and the lack of standardized road layouts. Although recent LiDAR based 3D object detectors have shown strong performance in structured driving scenarios, most are developed and evaluated for limited field of view settings, and their behavior under full surround 360-degree sensing is still not well understood. This paper studies a 360-degree LiDAR perception pipeline for autonomous driving, with particular attention to panoramic sensing, azimuthal sector wise spatial processing, and transformation equivariant feature extraction in complex urban scenes. The paper presents a practical 360-degree perception framework that combines sector wise panoramic processing with rotation equivariant sparse convolutions and evaluates its behavior on a custom Ouster OS0 LiDAR dataset collected across diverse Indian urban traffic conditions. The results show generally stable detection across several object classes, with the strongest performance for cars at 92.02/90.51, buses at 80.53/76.34, and trucks at 78.59/74.16, while lower scores for pedestrians at 67.45/61.02, cyclists at 73.21/69.54, and motorcyclists at 71.20/68.13 reflect the greater difficulty of detecting smaller and more variable road users in dense urban scenes.

URL PDF HTML ☆

赞 0 踩 0

2606.07974 2026-06-09 cs.RO cs.AI 交叉投稿

PRISM: PRior-guided Imagination Sampling in world Models

PRISM：世界模型中基于先验引导的想象采样

Yuhai Wang, Jiawei Xia, Rongxuan Zhou, Xiao Hu, Yongliang Shi, Jing Du, Yang Ye

发表机构 * Northeastern University（东北大学）； University of California, Berkeley（加州大学伯克利分校）； Qiyuan Lab（启元实验室）； University of Florida（佛罗里达大学）

AI总结提出PRISM框架，通过从世界模型编码器提取状态条件高斯先验，并利用精度加权高斯乘积更新规划器的采样分布，在不增加架构复杂度的情况下显著提升基于模型的连续控制性能。

详情

AI中文摘要

学习到的世界模型为评估未来状态提供了强大的物理直觉。但其在连续控制中的有效性也关键取决于如何为基于模型的规划生成候选动作。我们不仅询问模型能多准确地模拟未来，还提出：哪些候选动作首先值得评估？现有规划器通常任意搜索或仅使用专家演示初始化采样均值，丢弃了专家的状态条件置信度。正确引导这一搜索需要鲁棒的动作先验，但当前方法常依赖独立的视觉编码器或大规模VLM来获取。我们认为这种架构膨胀是不必要的：完全相同的数据——以及世界模型本身学到的表示——内在地编码了智能体的动作直觉。我们提出PRISM，一个任务无关的框架，从单一数据集中提取两者，同时保持严格的架构简洁性。基于标准的JEPA风格潜在世界模型，PRISM直接在其冻结编码器上附加一个轻量级MLP，以预测状态条件高斯先验。在规划时，PRISM通过精度加权的高斯乘积更新将该先验融合到规划器的采样分布中。这种无参数、闭式整合引导采样过程，使先验在其自信处主导，在其不自信处放弃控制。PRISM在Cube上将基于世界模型的MPC成功率提升35个百分点，在PushT上提升32个百分点，且未引入显著推理开销。

英文摘要

A learned world model provides a powerful physical intuition for evaluating future states. But its effectiveness in continuous control also depends critically on how candidate actions are generated for model-based planning. Rather than solely asking how accurately a model can simulate the future, we ask: which candidate actions are worth evaluating in the first place? Existing planners typically search arbitrarily or use expert demonstrations only to initialize a sampling mean, discarding the expert's state-conditioned confidence. Properly guiding this search requires a robust action prior, yet current approaches often rely on independent visual encoders or large-scale VLMs to obtain one. We argue that this architectural bloat is unnecessary: the exact same data - and the learned representations of the world model itself - inherently encode the agent's action intuition. We introduce PRISM, a task-agnostic framework that extracts both from a single dataset while maintaining strict architectural simplicity. Building on a standard JEPA-style latent world model, PRISM attaches a lightweight MLP directly to its frozen encoder to predict a state-conditioned Gaussian prior. At plan time, PRISM fuses this prior into the planner's sampling distribution via a precision-weighted Product-of-Gaussians update. This parameter-free, closed-form integration steers the sampling process, making the prior confident where it is and ceding control where it is not. PRISM improves success rates by 35 percentage points over vanilla world-model-based MPC on Cube and 32 percentage points on PushT, without introducing significant inference overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.08014 2026-06-09 cs.CV cs.AI 交叉投稿

GVC-Seg: Training-Free 3D Instance Segmentation via Geometric Visual Correspondence

GVC-Seg: 基于几何视觉对应的免训练3D实例分割

Liang Xu, Fangjing Wang, Jinyu Yang, Feng Zheng

发表机构 * Victoria University of Wellington（惠灵顿维多利亚大学）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Southern University of Science and Technology（南方科技大学）

AI总结提出GVC-Seg，一种免训练的3D实例分割方法，通过几何与视觉特征对应消除多模型集成中的置信度偏差，在多个基准上达到最优性能。

Comments 10 pages, 5 figures

详情

AI中文摘要

点云数据中的精确3D实例分割对于机器视觉应用至关重要。最近的研究利用多个预训练基础模型生成3D提案，然后应用提案聚合方法，显著提升了性能。然而，由于不同分割模型之间置信度水平的固有差异，它们通常会产生次优结果，导致偏向于置信度更高的模型。这种偏差本质上是模型依赖的，并受到数据预处理技术和训练策略等因素的影响。为了解决这一偏差，我们提出了一种新颖的、免训练的3D实例分割方法，通过几何视觉对应（GVC-Seg）来利用3D几何线索与2D视觉线索之间的对应关系，以减轻置信度偏差。此外，在实例掩码生成和实例语义推理过程中，分别引入了3D提案生成模块和掩码感知的CLIP特征提取模块。通过这种方式，GVC-Seg增强了提案质量评估，确保了不同模型之间的无偏集成学习。大量实验表明，我们的方法在多个具有挑战性的基准上达到了最先进的性能，同时在开放词汇语义分割设置中也展现出强大的潜力。

英文摘要

Accurate 3D instance segmentation in point cloud data is critical for machine vision applications. Recent advancements leverage multiple pre-trained foundation models to generate 3D proposals, followed by the application of proposal aggregation methods, which significantly enhance performance. However, they often produce sub-optimal results due to inherent variations in confidence levels across different segmentation models, resulting in a bias toward the model with higher confidence. This bias is inherently model-dependent and is influenced by factors such as data preprocessing techniques and training strategies. To address this bias, we propose a novel, training-free 3D instance segmentation approach via Geometric Visual Correspondence (GVC-Seg), which exploits the correspondence between 3D geometric cues and 2D visual cues to mitigate the confidence bias. Additionally, a 3D proposal generation module and a mask-aware CLIP feature extraction module are introduced during the instance mask generation and instance semantic reasoning, respectively. In this way, GVC-Seg enhances proposal quality assessment, ensuring unbiased ensemble learning across different models. Extensive experiments demonstrate that our method achieves state-of-the-art performance on several challenging benchmarks, while also exhibiting strong potential in open-vocabulary semantic segmentation settings.

URL PDF HTML ☆

赞 0 踩 0

2606.08057 2026-06-09 cs.RO cs.AI 交叉投稿

EgoAERO: Learning Dexterous Manipulation from a Single Egocentric Video without Object Assets

EgoAERO：无需物体资产，从单个第一人称视频学习灵巧操作

Yichen Niu, Haoran Lv, Xinrui Zhang, Xueyao Wan, Shiyu Gao, Ying Ai, Hui Xu, Yongqi Hu, Hengyi Zhang, Yang Xie, Zhaxizhuoma, Yue Zhao, Zhenshan Bing, Yan Ding, Jianxing Liu

发表机构 * School of Astronautics, Harbin Institute of Technology（哈尔滨工业大学航天学院）； Lumos Robotic ； Suzhou Research Institute, Harbin Institute of Technology（哈尔滨工业大学苏州研究院）； Shanghai Jiao Tong University（上海交通大学）； Shanghai AI Lab（上海人工智能实验室）； Nanjing University（南京大学）； Xi’an Jiaotong-Liverpool University（西交利物浦大学）； Fudan University（复旦大学）

AI总结提出EgoAERO框架，无需物体资产，从单个第一人称RGB-D视频中通过无资产物体跟踪与重建、自我运动补偿和自适应接触优化重建接触一致的手-物轨迹，并利用两阶段残差学习转化为机器人策略，实现单次演示的灵巧操作。

详情

AI中文摘要

第一人称RGB-D视频提供了人类灵巧操作演示的自然来源，但现有数据难以用于机器人学习，因为物体姿态、几何和接触信息常常缺失或需要预先扫描的物体资产。我们提出EgoAERO，这是第一个无需物体资产、从单个第一人称RGB-D人类演示中学习灵巧操作的框架。EgoAERO通过无资产物体跟踪与重建、自我运动补偿和自适应接触优化重建接触一致的手-物轨迹，然后利用两阶段残差学习将其转化为机器人策略。我们进一步引入在线质量评估机制，并构建EgoDex-R，一个包含430万RGB-D帧的大规模第一人称数据集，用于灵巧策略学习。仿真和真实世界实验表明，EgoAERO能够实现单次演示的灵巧操作，并在HOI4D上达到接近基于CAD重建的下游性能。

英文摘要

Egocentric RGB-D videos offer a natural source of human dexterous manipulation demonstrations, but existing data is difficult to use for robot learning because object pose, geometry, and contact information are often missing or require pre-scanned object assets. We present EgoAERO, the first framework that learns dexterous manipulation from a single egocentric RGB-D human demonstration without object assets. EgoAERO reconstructs contact-consistent hand-object trajectories through asset-free object tracking and reconstruction, ego motion compensation, and adaptive contact optimization, then converts them into robot policies using two-stage residual learning. We further introduce an online quality assessment mechanism and construct EgoDex-R, a large-scale egocentric dataset with 4.3M RGB-D frames for dexterous policy learning. Simulation and real-world experiments show that EgoAERO enables single-demonstration dexterous manipulation and achieves downstream performance close to CAD-based reconstructions on HOI4D.

URL PDF HTML ☆

赞 0 踩 0

2606.08094 2026-06-09 cs.RO cs.AI cs.LG cs.SY eess.SY 交叉投稿

vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models

vla.cpp：视觉-语言-动作模型的统一推理运行时

Khanh D. Nguyen, Hung T. Ho, Chinh T. Nguyen, Thanh Q. Duong, Linh D. Le, Duy M. H. Nguyen, Vien A. Ngo, An T. Le

发表机构 * VinRobotics ； Center for AI Research, VinUniversity（VinUniversity 人工智能研究中心）； Intelligent Autonomous Systems, TU Darmstadt（达姆施塔特工业大学智能自主系统）； Max Planck Research School for Intelligent Systems（马克斯·普朗克智能系统研究学院）； University of Stuttgart（斯图加特大学）； German Research Center for Artificial Intelligence（德国人工智能研究中心）

AI总结提出vla.cpp，基于llama.cpp的便携C++推理运行时，支持多种VLA架构，在LIBERO-Object上接近SOTA性能，内存仅1.3 GiB，并实现跨硬件部署。

Comments 17 pages, 3 figures, 12 tables

详情

AI中文摘要

视觉-语言-动作（VLA）策略通常以Python/PyTorch堆栈形式提供，假设使用工作站级GPU，这与机器人实际运行的硬件不匹配。我们提出了vla.cpp，一个基于llama.cpp的便携式C++推理运行时。据我们所知，它是第一个原生支持流匹配和扩散VLA推理模式的ggml类引擎，其中缓存的视觉-语言前缀由交叉注意力动作专家在多个求解器步骤中消耗。单个运行时通过一个请求/响应协议服务于跨越五个骨干网络和四个动作头家族的七种架构，每个模型打包为自包含的捆绑包。在LIBERO-Object上，该引擎在200个回合中与最先进的检查点相差不到一个回合，并以1.3 GiB内存运行BitVLA达到100%成功率。相同的捆绑包在三个硬件层级上不变地运行，从消费级GPU到8 GB嵌入式模块。跨硬件屋顶线分析表明，批量大小为1的VLA推理受计算限制，因此利用率而非带宽是部署杠杆；由此分析得出的IMMA梯形GEMM将BitVLA每步延迟降低了4.5倍。然后，我们在ALOHA机械臂上设计了一个机载压力测试，隔离了学习型VLA必须在训练它的硬件上针对移动目标重新规划的延迟约束。代码、演示视频和可重复的基准测试框架可在https://fai-modelopt-tech.github.io/vla-cpp.github.io/获取。

英文摘要

Vision-Language-Action (VLA) policies are typically shipped as Python/PyTorch stacks that assume a workstation-class GPU, a mismatch for the hardware on which robots actually run. We present vla.cpp, a portable C++ inference runtime built on llama.cpp. To our knowledge, it is the first ggml-class engine to natively serve the flow-matching and diffusion VLA inference pattern, in which a cached vision-language prefix is consumed by a cross-attending action expert integrated over several solver steps. A single runtime serves seven architectures spanning five backbone and four action-head families behind one request/response protocol, with each model packaged as a self-contained bundle. On LIBERO-Object, the engine matches a state-of-the-art checkpoint to within one episode out of 200, and runs BitVLA at 100% success in 1.3 GiB of memory. The same bundle runs unchanged across three hardware tiers, from a consumer GPU down to an 8 GB embedded module. A cross-hardware roofline analysis shows that batch-1 VLA inference is compute-bound, so utilization rather than bandwidth is the deployment lever; an IMMA ladder GEMM derived from this analysis cuts BitVLA per-step latency by 4.5x. We then frame an on-robot stress test on an ALOHA arm that isolates the latency constraint under which a learned VLA must replan against a moving target on the hardware it was trained for. Code, demo videos, and the reproducible benchmark scaffold are available at https://fai-modelopt-tech.github.io/vla-cpp.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.08107 2026-06-09 cs.RO cs.AI 交叉投稿

Ego-Pi: VLA Fine-Tuning for Ego-Centric Human and Robot Data

Ego-Pi: 面向自我中心人类与机器人数据的VLA微调

Ji Woong Kim, Ke Wang, Zipeng Fu, Sirui Chen, Cong Zhao, Jeff Lai, Chelsea Finn

发表机构 * Stanford University（斯坦福大学）； Meta

AI总结为解决机器人数据稀缺问题，利用自我中心人类数据，基于π₀.₅模型微调，使机器人学习新任务语义并组合现有技能，无需对应机器人数据。

2606.08169 2026-06-09 cs.RO cs.AI cs.CL cs.HC cs.LG 交叉投稿

CLASP: Language-Driven Robot Skill Selection and Composition using Task-Parameterized Learning

CLASP: 基于语言驱动的机器人技能选择与组合，采用任务参数化学习

Markus Knauer, Valentin Gieraths, Tai Mai, Samuel Bustamante, Alin Albu-Schäffer, Freek Stulp, João Silvério

发表机构 * German Aerospace Center (DLR), Institute of Robotics and Mechatronics (RMC)（德国航空航天中心（DLR），机器人与机电一体化研究所（RMC））； Technical University of Munich (TUM)（慕尼黑工业大学（TUM））

AI总结提出CLASP架构，结合任务参数化核化运动基元（TP-KMP）与预训练视觉语言模型（VLM），通过自然语言命令实现技能选择、组合和主动学习，无需微调，在7自由度机械臂上达到73.3%-100%成功率。

Comments 23 pages, 11 figues, 4 tables, 1 listing

详情

AI中文摘要

使机器人能够理解自然语言命令并执行任务，同时保持数据效率仍然具有挑战性。视觉-语言-动作（VLA）和视觉-语言模型（VLM）等基础模型提供了直观的交互通道，但需要大量数据；任务参数化模仿学习实现了数据效率，但缺乏自然语言基础。这项工作通过一个模块化架构弥合了这一差距，该架构将任务参数化核化运动基元（TP-KMP）与预训练VLM相结合。在学习过程中，技能从2到5次动觉演示中获取，VLM生成描述每个技能参数和前提条件的技能模式。在执行过程中，VLM解释命令以选择技能，推理参数绑定，并通过协方差加权组合创建新颖行为。当没有技能或组合足够时，系统识别能力差距并请求有针对性的演示，所有这些都无需微调。在7自由度机械臂上的验证显示，在需要技能选择、组合和主动学习的场景中，成功率达到73.3%-100%。

英文摘要

Enabling robots to understand and execute tasks from natural language commands while maintaining data efficiency remains challenging. Foundation models such as vision-language-action (VLA) and vision-language models (VLMs) provide intuitive interaction channels but require extensive data; task-parameterized imitation learning achieves data efficiency but lacks natural language grounding. This work bridges this gap through a modular architecture combining task-parameterized kernelized movement primitives (TP-KMPs) with pretrained VLMs. During learning, skills are acquired from 2 to 5 kinesthetic demonstrations, and the VLM generates skill schemas describing each skill's parameters and preconditions. During execution, the VLM interprets commands to select skills, reason about parameter bindings, and create novel behaviors through covariance-weighted composition. When no skill or composition suffices, the system identifies capability gaps and requests targeted demonstrations, all without fine-tuning. Validation on a 7-DoF manipulator shows success rates of 73.3%-100% in scenarios requiring skill selection, composition, and active learning.

URL PDF HTML ☆

赞 0 踩 0

2606.08414 2026-06-09 cs.RO cs.AI 交叉投稿

HARBOR：面向智能体机器人强化学习的框架

Zechu Li, Yufeng Jin, Xiaoyang Liu, Puze Liu, Vignesh Prasad, Carlo D'Eramo, Georgia Chalvatzaki

发表机构 * TU Darmstadt（达姆施塔特工业大学）； Honda Research Institute Europe（本田欧洲研究所）； Columbia University（哥伦比亚大学）； Tongji University（同济大学）； Shanghai Research Institute for Intelligent Autonomous Systems（上海智能自主系统研究院）； University of Würzburg（维尔茨堡大学）； Hessian.AI（黑森人工智能中心）

AI总结提出HARBOR框架，通过将机器人强化学习自动化视为框架工程问题，利用专用智能体、标准化命令和可复用知识，在模拟中自动完成从环境搭建到策略训练的全流程，并在6个基准测试和16个任务中验证其有效性。

详情

AI中文摘要

强化学习已成为机器人学习的一种强大范式，特别是在模拟到现实的环境中，但其更广泛的采用仍受限于围绕算法的工程流程。构建任务、设计奖励和调整超参数需要大量专家努力，使得强化学习工作流程成本高昂且难以扩展。我们提出HARBOR，一个智能体框架，将机器人强化学习自动化视为一个框架工程问题：给定一个模拟器代码库和一个任务规范，它自动完成从环境设置到模拟中策略训练的工作流程。HARBOR将此类高级目标分解为有界阶段，由专用智能体通过标准化命令、持久化工件、可执行门和可复用知识执行，并通过去中心化并行试验和跨运行经验学习来扩展迭代。我们在6个基准测试和总共16个任务上评估HARBOR，涵盖操作、移动和双臂灵巧控制。我们证明HARBOR端到端地自动化了模拟强化学习工作流程，设计奖励，调整算法以匹配或改进默认配置，并以实用的令牌和挂钟成本减少了工程工作量；生成的策略也可以转移到真实机器人。

英文摘要

Reinforcement learning (RL) has become a powerful paradigm for robot learning, particularly in sim-to-real settings, but its broader adoption remains limited by the engineering pipeline surrounding the algorithms. Building tasks, shaping rewards, and tuning hyperparameters require substantial expert effort, making RL workflows costly and difficult to scale. We introduce HARBOR, an agentic framework that frames robot RL automation as a harness-engineering problem: given a simulator codebase and a task specification, it automates the workflow from environment setup to policy training in simulation. HARBOR decomposes such high-level objectives into bounded stages executed by specialized agents through standardized commands, persistent artifacts, executable gates, and reusable knowledge, and scales iteration via decentralized parallel trials and experience learning across runs. We evaluate HARBOR across 6 benchmarks and 16 tasks in total, spanning manipulation, locomotion, and bimanual dexterous control. We demonstrate that HARBOR automates the simulation RL workflow end-to-end, designs rewards, tunes algorithms to match or improve over default configurations, and reduces engineering effort at practical token and wall-clock cost; the resulting policies can also be transferred to real robots.

URL PDF HTML ☆

赞 0 踩 0

2606.08653 2026-06-09 cs.CV cs.AI cs.LG cs.RO 交叉投稿

FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning

FiberTune: 在视觉-语言-动作微调中保留动作纤维视觉残差

Haihao Lin, Xiangsheng Huang, Xiao Yang, Weibang Zhou, Yiqi Zhang, Bo Yang, Simin Zeng, Jiawei Yang, Zhengyang Wang, Jiahui Du

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； Hebei Key Laboratory of Cognitive Intelligence, Xiong’an Institute of Innovation（河北省认知智能重点实验室，雄安创新研究院）； Hebei University of Technology（河北工业大学）； Beijing Information Science and Technology University（北京信息科技大学）

AI总结提出FiberTune，通过在线动作探针过滤动作预测特征方向，对齐教师视觉残差并正则化有效秩，在六个仿真和实物任务中提升VLA策略性能。

Comments Project page: https://fibertune.github.io/

详情

AI中文摘要

动作监督的视觉-语言-动作（VLA）策略微调能有效拟合演示，但仅约束改变预测动作的方向，导致动作等价状态下视觉结构自由坍缩。我们将此形式化为沿局部动作纤维的残差视觉坍缩，并提出FiberTune，一种训练时目标，在不增加推理开销的情况下保留教师结构的视觉残差。FiberTune使用在线动作探针估计动作预测特征方向，从中滤除中间视觉标记表示，并将探针过滤后的残差与冻结的视觉教师对齐，同时正则化其有效秩。在相同训练条件下，FiberTune在跨越两个基准和两种架构（pi_0.5和OpenVLA-OFT）的六个受控仿真设置以及物理SO-101拾取放置任务中，均优于仅任务损失的微调；代表性提升包括长时域CALVIN ABC-to-D上SR(5)提高10.7个百分点，物理SO-101任务成功率从72.7%提升至78.1%。残差诊断显示，这些增益与探针过滤后的残差教师对齐度和有效秩增加一致，符合动作纤维动机。

英文摘要

Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but constrains only the directions that change predicted actions, leaving visual structure consistent across action-equivalent states free to collapse. We formalize this as residual visual collapse along local action fibers and propose FiberTune, a training-time objective that preserves teacher-structured visual residuals without adding inference-time overhead. FiberTune uses an online action probe to estimate action-predictive feature directions, filters them from intermediate visual-token representations, and aligns the resulting probe-filtered residuals to a frozen visual teacher while regularizing their effective rank. Under identical training conditions, FiberTune improves over task-loss-only fine-tuning in every one of six controlled simulation settings spanning two benchmarks and two architectures (pi_0.5 and OpenVLA-OFT), as well as on physical SO-101 pick-place; representative gains include +10.7 percentage points SR(5) on long-horizon CALVIN ABC-to-D and physical SO-101 task success rising from 72.7% to 78.1%. Residual diagnostics show that these gains coincide with increased probe-filtered residual teacher alignment and effective rank, consistent with the action-fiber motivation.

URL PDF HTML ☆

赞 0 踩 0

2606.08657 2026-06-09 cs.RO cs.AI 交叉投稿

Latent Diffusion Policy: Shaping Latent Spaces for Diffusion-Based Robotic Manipulation

潜在扩散策略：为基于扩散的机器人操作塑造潜在空间

Zhexuan Zhou, Yichen Lai, Jinhao Zhang, Huizhe Li, Youmin Gong, Jie Mei

发表机构 * National University of Singapore（新加坡国立大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出两阶段框架LDP，通过CVAE编码器吸收场景理解，在预浓缩的潜在空间中进行流匹配，简化学习并提升多臂协调任务性能。

详情

AI中文摘要

直接在原始动作空间中运行的基于扩散的视觉运动策略将场景理解与轨迹生成合并到单个去噪过程中。由此产生的速度场必须同时编码场景信息并生成精确轨迹，增加了学习复杂性，并在需要多臂精确时间协调的任务上限制了性能。为了简化这一联合学习问题，我们引入了潜在扩散策略（LDP），这是一个两阶段框架，在精心塑造的潜在空间中进行流匹配。通过将场景理解吸收到观察条件的CVAE编码器中，LDP集中了每个观察的条件分布。因此，流模型避免了隐式解析场景相关结构；相反，它在具有更平滑速度场的预浓缩分布内生成，从而简化了从有限演示中的学习。此外，为了捕捉潜在标记之间的时间依赖性，LDP采用每标记扩散强制训练，并使用阶梯推理采样来解决由此产生的分布不匹配。我们还提出了重建FID（rFID）作为轻量级代理，仅从潜在空间统计预测下游任务成功。在RoboTwin 2.0的协调密集型任务上，LDP以显著优势优于DP3，并有效迁移到真实世界的双臂部署。

英文摘要

Diffusion-based visuomotor policies operating directly in raw action spaces conflate scene comprehension with trajectory generation within a single denoising process. The resulting velocity field must simultaneously encode scene information and generate precise trajectories, increasing learning complexity and limiting performance on tasks demanding precise temporal coordination across multiple arms. To simplify this joint learning problem, we introduce Latent Diffusion Policy (LDP), a two-stage framework performing flow matching in a deliberately shaped latent space. By absorbing scene understanding into an observation-conditioned CVAE encoder, LDP concentrates the conditional distribution of each observation. Consequently, the flow model avoids implicitly resolving scene-dependent structures; instead, it generates within a pre-concentrated distribution featuring a smoother velocity field, simplifying learning from limited demonstrations. Furthermore, to capture temporal dependencies among latent tokens, LDP trains with per-token diffusion forcing and employs staircase inference sampling to resolve the resulting distributional mismatch. We also propose reconstruction FID (rFID) as a lightweight proxy predicting downstream task success solely from latent space statistics. On coordination-intensive tasks from RoboTwin 2.0, LDP outperforms DP3 by a substantial margin and transfers effectively to real-world bimanual deployments.

URL PDF HTML ☆

赞 0 踩 0

2606.08714 2026-06-09 eess.SY cs.AI cs.LG cs.RO cs.SY 交叉投稿

Hybrid Neural Network and Conventional Controller Approach for Robust Control of Highly Unstable Systems: Application to Tilt-Rotor Control

混合神经网络与传统控制器方法用于高度不稳定系统的鲁棒控制：应用于倾转旋翼控制

Ali Kafili Gavgani, Amin Talaeizadeh, Aria Alasty, Hossein Nejat Pishkenari

发表机构 * Advanced Research Lab for Control and Agricultural Robotics (Sharif AgRoLab)（控制与农业机器人高级研究实验室（谢尔生产大学AgRoLab））； Department of Mechanical Engineering, Sharif University of Technology, Tehran, Iran（技术大学机械工程系，德黑兰，伊朗）

AI总结提出一种神经网络增强的滑模控制器，将系统动力学分解为输入无关和输入相关部分，前者用轻量网络从少量数据学习，实现对全驱动倾转旋翼系统的鲁棒控制，LSTM优于MLP。

Comments Proceedings of the 13th RSI International Conference on Robotics and Mechatronics (ICRoM 2025)

详情

DOI: 10.6084/m9.figshare.32572083

AI中文摘要

多旋翼飞行器广泛应用于从监视到精准农业等领域，但传统设计仍受限于其欠驱动特性。倾转旋翼配置通过实现全驱动克服了这一限制。本文研究基于神经网络的控制策略，用于一个具有四个推力矢量输入的全驱动倾转旋翼系统。我们的工作分为两部分。首先，我们有意呈现一个负面结果，通过评估直接输入-输出控制方法。在该方法中，多层感知器（MLP）、长短期记忆（LSTM）网络和Transformer模型被训练为直接将系统状态及其期望值映射到控制信号。我们表明该策略无法稳定系统，凸显了将直接输入-输出学习应用于高度不稳定对象的固有困难。其次，作为主要贡献，我们提出一种神经网络增强的滑模控制器（SMC）。该方法将系统动力学分解为输入无关和输入相关两部分，前者使用轻量网络从少量数据集学习，从而降低实时计算需求。此外，所提方法可以使用从低性能控制器收集的飞行日志进行训练，并且从真实数据学习到的动力学模型可用于仿真。我们进一步比较了基于MLP和LSTM的实现，在模型不确定性和外部干扰下，展示了所提方法的鲁棒性和有效性；特别是，带有LSTM植物动力学预测器的控制器相比基于MLP的对应物实现了更优性能，同时运行时也更低。

英文摘要

Multirotors are widely used in applications ranging from surveillance to precision agriculture, yet conventional designs remain limited by their under-actuation. Tilt-rotor configurations overcome this limitation by enabling full actuation. This paper investigates neural-network-based control strategies for a fully actuated tilt-rotor system with four thrust-vectoring inputs. Our work is structured in two parts. First, we deliberately present a negative result by evaluating a direct input-output control approach. In this method, multilayer perceptrons (MLPs), long short-term memory (LSTM) networks, and transformer models are trained to map system states and their desired values directly to control signals. We show that this strategy fails to stabilize the system, highlighting the inherent difficulty of applying direct input-output learning to highly unstable plants. Second, as the main contribution, we propose a neural-network-enhanced sliding mode controller (SMC). The method decomposes the system dynamics into input-independent and input-dependent components, with the former learned from a small dataset using lightweight networks, thereby reducing real-time computational demands. Moreover, the proposed method can be trained using flight logs collected from low-performance controllers, and the resulting dynamic model learned from real-world data can be used in simulation. We further compare MLP- and LSTM-based implementations under model uncertainties and external disturbances, demonstrating the robustness and effectiveness of the proposed approach; in particular, the controller with the LSTM plant dynamics predictor achieves superior performance to its MLP-based counterpart while also exhibiting lower runtime.

URL PDF HTML ☆

赞 0 踩 0

2606.08775 2026-06-09 cs.RO cs.AI 交叉投稿

基于一致性可靠性度量的实时身体姿态非语言通信

Alina Marcu, Dragos Costea, Cristina Lazar, Marius Leordeanu

发表机构 * National University of Science and Technology "Politehnica" Bucharest（布加勒斯特理工大学）； Simion Stoilow Institute of Mathematics of the Romanian Academy（罗马尼亚科学院西蒙·斯托伊洛数学研究所）； NORCE Norwegian Research Centre AS（挪威研究中心）

AI总结研究仅从2D身体姿态识别通信意图，提出自回归自一致性作为无监督可靠性信号，并在嵌入式GPU上实现实时性能。

详情

AI中文摘要

身体运动在远距离或无法捕捉面部及语音的条件下传达意图。我们研究仅从2D身体姿态识别通信意图。我们认为身体运动是可靠的信号，特别是在需要实时低成本设备上的人-机器人通信场景中，如救援任务。然而，现有资源并未孤立这一信号。情感语料库结合了身体、面部、语音和文本，而骨架动作识别基准标记的是执行的动作而非传达的信息。我们发布了一个包含十种通信意图的全身体姿态真实帧数据集，并将其与其他真实（IPC）和合成（MotionLCM, VEO3.1, Kimodo）数据集进行比较，这些数据集覆盖了不同难度。我们针对能在机器人有限板载硬件上运行的系统。我们基准测试了多种模型，从骨架图分类器到联合运动预测网络，并在嵌入式GPU（NVIDIA Orin Nano）上报告了性能指标和帧率，因为在我们的场景中速度和准确性同样重要。最后，我们展示了模型自身的自回归自一致性可作为无监督可靠性信号。我们给出了一个简短证明，界定了自一致性预测正确的概率，表明该概率随一致步数增加而增长，并识别了自信预测仍可能错误的条件，与行业标准指标进行了基准测试。

英文摘要

Body movement communicates intent at distances and in conditions where neither the face, nor speech can be captured. We study the recognition of communicative intent from 2D body pose alone. We argue that body motion is a reliable signal especially in scenarios that require real time low-cost on-device person-to-robot communication in long distance environments, such as rescue missions. However, existing resources do not isolate this signal. Affective corpora combine body, face, voice and text, while skeleton action-recognition benchmarks label the action performed rather than the message conveyed. We release a dataset of real frames of full-body pose covering ten communicative intents and we compare it against other real (IPC) and synthetic (MotionLCM, VEO3.1, Kimodo) ones that span a range of difficulty. We target systems that can run on a robot's limited onboard hardware. We benchmark multiple models, from skeleton graph classifiers to joint motion-forecasting networks, and report performance metrics together with frame rate on an embedded GPU (NVIDIA Orin~Nano), since speed matters as much as accuracy in our scenario. Finally, we show that a model's own autoregressive self-consistency works as an unsupervised reliability signal. We give a short proof that bounds the probability that a self-consistent prediction is correct, show that this probability grows with the number of consistent steps, and identify the conditions under which a confident prediction can still be false, benchmarked against industry-standard metrics.

URL PDF HTML ☆

赞 0 踩 0

2606.09416 2026-06-09 cs.RO cs.AI cs.SE 交叉投稿

Harness Engineering for Physical AI: Robot Middleware Is the Harness Layer

面向物理AI的驾驭工程：机器人中间件即驾驭层

Sanghoon Lee, Jiyeong Chae, Kyung-Joon Park

发表机构 * Daegu Gyeongbuk Institute of Science and Technology (DGIST)（大邱庆北科学技术院）

AI总结本文提出机器人中间件作为物理AI的驾驭层，需同时干预控制、计算和通信，并补充投影、隔离和转移三种缺失的强制功能，以ROS 2驾驭配置文件为例。

Comments 6 pages, 2 figures, 2 tables. Big Ideas track submission to the 27th ACM/IFIP International Middleware Conference (Middleware 2026)

详情

AI中文摘要

在物理AI时代，机器人中间件面临新的角色。学习策略、规划器和视觉-语言-动作（VLA）模型现在作为控制路径上的因果参与者进入已部署的机器人，但将它们与定时、调度和网络集成的层尚未被命名。最近的语言智能体工作将此层命名为驾驭层，即中介工具、管理状态、约束资源和记录执行的外部系统。机器人社区尚未采用这一框架，我们提出机器人中间件就是那个驾驭层。物理AI驾驭层与软件驾驭层的区别在于其干预位置。软件驾驭层在工具调用边界进行中介。物理AI驾驭层必须同时干预控制、计算和通信，因为学习策略的输出跨越所有三者：其命令改变轨迹，其推理时间改变调度，其有效载荷改变带宽。机器人中间件是机器人栈中最低的层，具有对所有三者的中介抽象，因此最适合组合它们的强制实施。它已经提供了驾驭层所需的大部分功能，但缺乏针对AI模型的强制实施。我们将这种缺失的强制实施命名为三个功能：投影在输出时门控每个输出，隔离约束模型的执行和传输时隙，转移在检查失败时回退到经过验证的基线。每个功能目前以手工构建的应用程序代码形式出现在已部署的机器人系统中，构建在机器人中间件已提供的表面上。机器人中间件应该将它们作为组合所有三者的层，而不是作为最佳的单轴强制器。我们将其勾勒为ROS 2驾驭配置文件，这是一个部署工件，携带AI模型声明的输出区域、推理预算和运行机制，而中间件在ROS 2、DDS和Zenoh上强制实施它们。

英文摘要

Robot middleware faces a new role in the era of Physical AI. Learned policies, planners, and vision-language-action (VLA) models now enter deployed robots as causal participants on the control path, but the layer that integrates them with timing, scheduling, and network has not been named. Recent language-agent work names this layer the harness, the external system that mediates tools, manages state, bounds resources, and records execution. The robotics community has not yet adopted this framing, and we propose that robot middleware is that harness. A Physical AI harness differs from a software harness in where it intervenes. A software harness mediates at tool-call boundaries. A Physical AI harness must mediate at control, computing, and communication simultaneously, because a learned policy's output crosses all three: its commands shift the trajectory, its inference time shifts the schedule, and its payload shifts the bandwidth. Robot middleware is the lowest robot-stack layer with mediating abstractions over all three, so it is best positioned to compose their enforcement. It already provides most of what a harness needs but lacks the enforcement for an AI model. We name this missing enforcement as three functions: Projection gates each output at emission, Isolation bounds the model's execution and transmission slot, and Transfer falls back to a verified baseline when checks fail. Each appears today as hand-built application code in deployed robot systems, built on surfaces robot middleware already provides. Robot middleware should host them not as the best single-axis enforcer but as the layer that composes all three. We sketch this as a ROS 2 Harness Profile, a deployment artifact that carries an AI model's declared output region, inference budget, and operating regime while the middleware enforces them across ROS 2, DDS, and Zenoh.

URL PDF HTML ☆

赞 0 踩 0

2606.09572 2026-06-09 cs.RO cs.AI 交叉投稿

CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control

CT-VAM: 一种小脑-丘脑启发的视觉-动作模型用于高效视觉运动控制

Jiacheng Li, Yize Guo, Jiabin Guo, Qingchen Liu, Jiahu Qin

发表机构 * University of Science and Technology of China（中国科学技术大学）； AIRLab, Department of Automation（自动化系AIRLab）

AI总结提出CT-VAM模型，通过TARS条件注意力解码器融合异构输入，以68M参数实现与大型VLA模型相当的LIBERO成功率，并降低推理延迟，支持高频控制。

详情

AI中文摘要

视觉-语言-动作模型在机器人操作中展现出强大潜力，然而原始语言主要用于指定任务意图，而非在高频低层执行过程中反复处理。受此分离的启发，我们提出了一种小脑-丘脑启发的视觉-动作模型（CT-VAM），用于高效的任务条件视觉运动控制。CT-VAM作为一个紧凑的局部执行策略，从双视角视觉观察、本体感觉和轻量级任务条件中预测动作块，从而可能实现一种实用的云-边缘范式，其中高层语义推理由大模型处理，而快速闭环控制在本地硬件上运行。为了有效融合异构输入，CT-VAM引入了TARS（丘脑动作路由流），一种流分离的条件注意力解码器，独立路由动作、视觉和任务流，防止密集的感官标记淹没紧凑的任务相关条件。仅凭68M参数，CT-VAM在LIBERO上取得了与更大规模VLA模型竞争的成功率，同时降低了推理延迟。结合用于异步块执行的流一致修补，CT-VAM支持高频控制，并在资源受限的机器人平台上展示了鲁棒的实时部署能力。

英文摘要

Vision-language-action models have shown strong promise for robot manipulation, yet raw language is primarily needed to specify task intent rather than to be repeatedly processed during high-frequency low-level execution. Motivated by this separation, we propose a cerebello-thalamic-inspired vision-action model (CT-VAM) for efficient task-conditioned visuomotor control. CT-VAM acts as a compact local execution policy that predicts action chunks from dualview visual observations, proprioception, and a lightweight task condition, potentially enabling a practical cloud-edge paradigm in which high-level semantic reasoning can be handled by large models while fast closed-loop control runs on local hardware. To fuse heterogeneous inputs effectively, CT-VAM introduces TARS (Thalamic Action Routing Stream), a stream-separated conditional attention decoder that independently routes action, visual and task streams, preventing dense sensory tokens from overwhelming compact task-relevant conditions. With only 68M parameters, CT-VAM achieves LIBERO success rates competitive with substantially larger VLA models, while reducing inference latency. Together with flow-consistent inpainting for asynchronous chunk execution, CT-VAM supports high-frequency control and demonstrates robust realworld deployment on resource-constrained robotic platforms.

URL PDF HTML ☆

赞 0 踩 0

2606.09630 2026-06-09 cs.RO cs.AI cs.LG 交叉投稿

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

ReCoVLA: VLM引导的奖励编译用于视觉-语言-动作策略的故障恢复

Haodi Hu, Chung-Ta Huang, Jing Liu, Ye Wang, Kei Suzuki, Matthew Brand, Toshiaki Koike-Akino

发表机构 * University of Southern California（南加州大学）； Mitsubishi Electric Research Laboratories (MERL)（三菱电机研究实验室）； Harvard University（哈佛大学）

AI总结提出ReCoVLA框架，通过冻结预训练VLA策略，利用外部VLM推断故障模式并编译结构化奖励，训练残差恢复策略，实现零样本仿真到真实部署，在多种操作任务中提升成功率。

Comments 19 pages, 7 figures

详情

AI中文摘要

视觉-语言-动作（VLA）策略为语言条件操作提供了强大的先验知识，但在需要针对性恢复的非标称状态下仍然脆弱。我们提出ReCoVLA——一种故障条件的残差恢复框架，它保持预训练的VLA策略冻结，使用外部视觉-语言模型（VLM）推断故障模式和恢复阶段，并从任务相关组件编译结构化奖励。ReCoVLA并非使用VLM直接生成动作或奖励，而是将其作为语义奖励选择器：它预测恢复描述符和奖励掩码，用于仿真中的残差策略训练，随后将训练好的恢复策略零样本部署到真实世界。这解耦了高层故障理解与低层纠正控制，以支持不同的VLA。在短时域、长时域和接触丰富的操作任务上的实验表明，ReCoVLA在平均性能上优于测试的基线。在仿真中，我们的奖励编译器将微调$π_{0.5}$基线的平均成功率从36.7%提升到66.7%。在物理零样本仿真到真实实验中，ReCoVLA取得了最佳平均性能，成功率为61.7%。

英文摘要

Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states requiring targeted recovery. We propose ReCoVLA -- a failure-conditioned residual recovery framework that keeps a pretrained VLA policy frozen, uses an external vision-language model (VLM) to infer the failure mode and recovery stage, and compiles a structured reward from task-relevant components. Rather than using the VLM to generate actions or rewards directly, ReCoVLA uses it as a semantic reward selector: it predicts a recovery descriptor and reward mask for in-simulation residual-policy training, followed by zero-shot sim-to-real deployment of the trained recovery policies. This decouples high-level failure understanding from low-level corrective control to support different VLAs. Experiments across short-horizon, long-horizon, and contact-rich manipulation tasks show that ReCoVLA outperforms the tested baselines on average. In simulation, our reward compiler improves average success from 36.7% for the fine-tuned $π_{0.5}$ baseline to 66.7%. In physical zero-shot sim-to-real experiments, ReCoVLA achieves the best average performance, with 61.7% success.

URL PDF HTML ☆

赞 0 踩 0

2606.09634 2026-06-09 cs.CV cs.AI 交叉投稿

QuickLAP: 为半自主代理快速语言-动作偏好学习

Jordan Abi Nader, David Lee, Nathaniel Dennler, Andreea Bobu

AI总结本研究提出QuickLAP，一种融合物理和语言反馈的贝叶斯框架，用于实时推断奖励函数，通过大规模语言模型提取奖励特征注意力掩码和偏好偏移，从而在半自主驾驶模拟器中将奖励学习误差降低70%，并通过用户研究验证其可理解性和协作性。

详情

AI中文摘要

机器人必须从人们的行为和语言中学习，但单一模态往往不完整：物理修正具有语境但意图模糊，而语言表达高层目标但缺乏物理基础。我们引入QuickLAP：快速语言-动作偏好学习，一种贝叶斯框架，融合物理和语言反馈以实时推断奖励函数。我们的关键见解是将语言视为用户潜在偏好的概率观测，明确哪些奖励特征重要以及如何解释物理修正。QuickLAP利用大规模语言模型（LLMs）从自由形式陈述中提取奖励特征注意力掩码和偏好偏移，并与物理反馈结合在一个闭式更新规则中。这使得能够快速、实时且鲁棒地学习奖励，处理模糊反馈。在半自主驾驶模拟器中，QuickLAP相比仅物理和启发式多模态基线将奖励学习误差降低超过70%。15名参与者的用户研究进一步验证了我们的方法：参与者发现QuickLAP更易懂和协作，并且更喜欢其学习行为。代码可在https://github.com/MIT-CLEAR-Lab/QuickLAP获取。

英文摘要

Robots must learn from both what people do and what they say, but either modality alone is often incomplete: physical corrections are grounded but ambiguous in intent, while language expresses high-level goals but lacks physical grounding. We introduce QuickLAP: Quick Language-Action Preference learning, a Bayesian framework that fuses physical and language feedback to infer reward functions in real time. Our key insight is to treat language as a probabilistic observation over the user's latent preferences, clarifying which reward features matter and how physical corrections should be interpreted. QuickLAP uses Large Language Models (LLMs) to extract reward feature attention masks and preference shifts from free-form utterances, which it integrates with physical feedback in a closed-form update rule. This enables fast, real-time, and robust reward learning that handles ambiguous feedback. In a semi-autonomous driving simulator, QuickLAP reduces reward learning error by over 70% compared to physical-only and heuristic multimodal baselines. A 15-participant user study further validates our approach: participants found QuickLAP significantly more understandable and collaborative, and preferred its learned behavior over baselines. Code is available at https://github.com/MIT-CLEAR-Lab/QuickLAP.

URL PDF HTML ☆

赞 0 踩 0

2602.21172 2026-06-09 cs.AI cs.CV 版本更新

NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

NoRD: 一种无需推理的高数据效率视觉-语言-动作模型

Ishaan Rawal, Shubh Gupta, Yihan Hu, Wei Zhan

发表机构 * Applied Intuition ； Texas A&M University（德克萨斯大学A&M分校）； UC Berkeley（伯克利加州大学）

AI总结提出NoRD模型，通过无需推理标注和仅需<60%数据微调，结合Dr. GRPO算法克服难度偏差，实现与现有VLA模型相当的性能，显著降低数据与计算开销。

Comments Accepted to CVPR 2026. Code available at: https://github.com/Applied-Open-Source/nord

详情

AI中文摘要

视觉-语言-动作（VLA）模型通过统一的端到端架构取代模块化流水线，推动了自动驾驶的发展。然而，当前的VLA模型面临两个昂贵的要求：（1）大规模数据集收集，（2）密集的推理标注。在这项工作中，我们通过NoRD（无需推理驾驶）解决了这两个挑战。与现有的VLA模型相比，NoRD在仅使用<60%的数据且无需推理标注的情况下实现了竞争性能，从而减少了3倍的token数量。我们发现，当将标准组相对策略优化（GRPO）应用于在这种小规模、无推理数据集上训练的策略时，它未能产生显著的改进。我们表明，这种限制源于难度偏差，它不成比例地惩罚了GRPO中产生高方差rollout的场景的奖励信号。NoRD通过引入Dr. GRPO（一种旨在减轻LLM中难度偏差的最新算法）克服了这一限制。因此，NoRD在Waymo和NAVSIM上以极少的训练数据和零推理开销实现了竞争性能，从而实现了更高效的自主系统。网站：此 https URL

英文摘要

Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures. However, current VLAs face two expensive requirements: (1) massive dataset collection, and (2) dense reasoning annotations. In this work, we address both challenges with NORD (No Reasoning for Driving). Compared to existing VLAs, NORD achieves competitive performance while being fine-tuned on <60% of the data and no reasoning annotations, resulting in 3x fewer tokens. We identify that standard Group Relative Policy Optimization (GRPO) fails to yield significant improvements when applied to policies trained on such small, reasoning-free datasets. We show that this limitation stems from difficulty bias, which disproportionately penalizes reward signals from scenarios that produce high-variance rollouts within GRPO. NORD overcomes this by incorporating Dr. GRPO, a recent algorithm designed to mitigate difficulty bias in LLMs. As a result, NORD achieves competitive performance on Waymo and NAVSIM with a fraction of the training data and no reasoning overhead, enabling more efficient autonomous systems. Website: https://nord-vla-ai.github.io/

URL PDF HTML ☆

赞 0 踩 0

2605.14211 2026-06-09 cs.AI cs.LG 版本更新

ASH: Agents that Self-Hone via Embodied Learning

ASH: 通过具身学习自我精炼的智能体

Benjamin Schneider, Xavier Schneider, Victor Zhong, Sun Sun

发表机构 * University of Waterloo（多伦多大学）； National Research Council Canada（加拿大国家研究理事会）

AI总结提出ASH系统，通过从无标签互联网视频中学习具身策略，利用自改进循环和逆动力学模型，在长时域任务中显著超越基线方法。

Comments Published as a workshop paper at ICML 2026 Workshop on Scalable Learning and Optimization for Efficient Multimodal AI Agents

详情

AI中文摘要

长时域具身任务仍然是AI中的一个基本挑战，因为当前方法依赖于手工设计的奖励或带动作标签的演示，两者都无法扩展。我们引入了ASH，一个智能体系统，它从无标签、嘈杂的互联网视频中学习具身策略，无需奖励塑造或专家注释。ASH遵循自我改进循环；当它卡住时，ASH从其自身轨迹中学习逆动力学模型（IDM），并利用其IDM从相关互联网视频中提取监督信号。ASH使用无监督学习从大规模互联网视频中识别关键时刻，并将其保留为长期记忆——使其能够处理长时域问题。我们在两个需要多小时规划的互补环境中评估ASH：回合制角色扮演游戏《宝可梦绿宝石》和实时动作冒险游戏《塞尔达传说：缩小帽》。在这两个游戏中，行为克隆、检索增强和零样本基础模型基线趋于平稳，而ASH在我们的8小时评估中持续进步。ASH在《宝可梦绿宝石》中平均达到11.2/12个里程碑，在《塞尔达传说》中平均达到9.9/12个里程碑，而最强基线在两个环境中分别卡在平均6.5/12和6.0/12个里程碑。我们证明了自我改进的智能体是长时域具身学习的可扩展方案。

英文摘要

Long-horizon embodied tasks remain a fundamental challenge in AI, as current methods rely on hand-engineered rewards or action-labeled demonstrations, neither of which scales. We introduce ASH, an agentic system that learns an embodied policy from unlabeled, noisy internet video, without reward shaping or expert annotation. ASH follows a self-improvement loop; when it gets stuck, ASH learns an Inverse Dynamics Model (IDM) from its own trajectories, and uses its IDM to extract supervision from relevant internet video. ASH uses unsupervised learning to identify key moments from large-scale internet video and retains them as long-term memory -- allowing it to tackle long-horizon problems. We evaluate ASH on two complementary environments demanding multi-hour planning: Pokemon Emerald, a turn-based RPG, and The Legend of Zelda: The Minish Cap, a real-time action-adventure game. In both games, behavioral cloning, retrieval-augmented and zero-shot foundation-model baselines plateau, while ASH sustains progression across our 8-hour evaluation. ASH reaches an average of $11.2/12$ milestones in Pokemon Emerald and $9.9/12$ in Legend of Zelda, while the strongest baseline gets stuck in both environments at an average of $6.5/12$ and $6.0/12$ milestones, respectively. We demonstrate that self-improving agents are a scalable recipe for long-horizon embodied learning.

URL PDF HTML ☆

赞 0 踩 0

2601.02085 2026-06-09 cs.RO cs.AI 版本更新

Vision-Based Early Fault Diagnosis and Self-Recovery for Strawberry Harvesting Robots

基于视觉的草莓采摘机器人早期故障诊断与自恢复

Meili Sun, Chunjiang Zhao, Lichao Yang, Hao Liu, Shimin Hu, Ya Xiong

发表机构 * NERCITA

AI总结针对草莓采摘机器人视觉感知差、夹爪错位、空抓/误抓和滑落等问题，提出视觉故障诊断与自恢复框架，通过SRR-Net统一感知、相对误差补偿、微光学相机反馈及LSTM滑落预测，实现高精度定位与故障恢复。

Comments Accepted by Artificial Intelligence in Agriculture

详情

DOI: 10.1016/j.aiia.2026.05.009

AI中文摘要

草莓采摘机器人面临视觉感知差、夹爪错位、空抓/误抓和滑落等挑战，降低了采摘稳定性和效率。为解决这些问题，本文提出了一种视觉故障诊断与自恢复框架。端到端SRR-Net通过联合检测、分割和果实与夹爪的成熟度回归，实现了统一感知和故障诊断。利用这种集成感知，设计了一种由目标-夹爪同步检测驱动的相对误差补偿方法，以纠正超过容差阈值的位置错位。集成在末端执行器内的微光学相机提供实时视觉反馈。基于微光学相机，在放气阶段使用MobileNet V3-Small分类器进行夹爪调整，能够在空抓/误抓情况下提前中止采摘周期。此外，在拉断阶段应用时间序列LSTM分类器预测草莓滑落。基于这些预测，系统对滑落草莓执行重新充气和二次拉断尝试，或对已滑落草莓中止周期。实验表明，末端执行器与采摘点之间的平均绝对误差沿x轴和y轴分别从11.50 mm和5.25 mm降低到3.12 mm和4.06 mm，时间增加0.64 ± 0.24秒。夹爪调整模块将抓取阶段缩短约0.5秒，并避免了失败情况下的空放置。草莓滑落预测模块以88.89%的成功率处理滑落情况，每个采摘周期为失败情况节省约4.00秒。同时，对滑落草莓实现了81.25%的恢复率，重新抓取需要额外0.63秒。

英文摘要

Strawberry-harvesting robots faced challenges such as poor visual perception, gripper misalignment, empty grasp/misgrasp, and slippage, which reduced harvesting stability and efficiency.To overcome these issues, this paper proposes a visual fault diagnosis and self-recovery framework. An end-to-end SRR-Net achieved unified perception and fault diagnosis through joint detection, segmentation, and ripeness regression of the fruit and gripper. Leveraging this integrated perception, a relative error compensation method driven by simultaneous target-gripper detection was designed to correct positional misalignments exceeding the tolerance threshold. A micro-optical camera integrated within the end-effector delivered real-time visual feedback. Based on the micro-optical camera, a MobileNet V3-Small classifier was utilized for grasp adjustment during the deflating stage, enabling the early abort of the harvesting cycle in cases of empty grasp/misgrasps. Furthermore, a time-series LSTM classifier was applied during the snap-off stage to predict strawberry slippage. Based on these predictions, the system executed re-inflation and a secondary snap-off attempt for slipping strawberries, or aborted the cycle for slipped strawberries. Experiments demonstrated that the mean absolute errors between the end-effector and the picking point were reduced to 3.12 mm and 4.06 mm from 11.50 mm and 5.25 mm along the x- and y-axes, respectively, at the cost of a time increment of 0.64 $pm$ 0.24 s. The grasp adjustment module reduced the grasping phase by approximately 0.5 s and avoided empty-placement for failure cases. The strawberry slip prediction module handled slipped cases with an 88.89% success rate, saving approximately 4.00 s per harvesting cycle for failure cases. Also, it achieved an 81.25% recovery rate for slipping strawberries, requiring additional 0.63 s for re-grasping.

URL PDF HTML ☆

赞 0 踩 0

2605.30226 2026-06-09 cs.RO cs.AI 版本更新

BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models

BORA: 弥合离线强化学习与在线残差适应以实现真实世界灵巧VLA模型

Zhongxi Chen, Yifan Han, Yanming Shao, Huanming Liu, Congsheng Xu, Xiaoyu Chen, Yao Mu, Wenzhao Lian

发表机构 * Shanghai Jiao Tong University（上海交通大学）； CASIA（中国科学院自动化研究所）； Shanghai AI Laboratory（上海人工智能实验室）； USTC（中国科学技术大学）

AI总结提出BORA框架，通过离线构建动作条件价值引导的评论家，并结合在线冻结VLA基础、引入人类在环的分块残差适应机制，解决灵巧操作中高维探索导致的时间不一致、样本低效和硬件风险问题，在五个真实灵巧任务上平均成功率提升33%。

Comments 24 pages,11 figures

详情

AI中文摘要

视觉-语言-动作（VLA）模型已成为将视觉-语言理解融入真实世界机器人操作的一种有前景的范式。然而，由于高维手部控制和复合执行误差，灵巧操作对VLA策略仍然具有挑战性，这使得真实世界的强化学习后训练对于弥合视觉基础动作生成与物理可靠灵巧执行之间的差距至关重要。然而，高维灵巧探索常常引发真实世界中的时间不一致性、样本低效和硬件风险。为应对这些挑战，我们提出BORA，一种为真实世界灵巧VLA模型设计的离线到在线强化学习后训练框架。在离线阶段，BORA构建一个以VLM的认知令牌和动作块作为输入的评论家。这种设计实现了动作条件价值引导，使评论家能够评估超越视觉上下文的灵巧手部运动。在随后的在线阶段，BORA冻结VLA基础，并引入一种轻量级、人类在环（HiL）的分块残差适应机制，以减轻真实世界执行误差并进一步在真实物理环境中纠正离线学习到的意图。通过继承离线评论家并采用干预驱动奖励，BORA有效纠正执行差异并适应真实世界物理变化，同时将预训练策略作为稳定先验。在五个复杂真实世界灵巧任务上的广泛评估表明，BORA显著优于纯模仿学习和传统解耦强化学习基线，在标准设置下平均成功率绝对提升33%，在未见物体泛化中提升高达43%。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for grounding visual-language understanding into real-world robotic manipulation. However, dexterous manipulation remains challenging for VLA policies due to high-dimensional hand control and compounding execution errors, which makes real-world RL post-training essential for bridging the gap between visually grounded action generation and physically reliable dexterous execution. However, high-dimensional dexterous exploration often triggers temporal inconsistency, sample inefficiency and hardware risks in the real world. To address these challenges, we propose BORA, an offline-to-online RL post-training framework designed for real-world dexterous VLA models. In the offline phase, BORA constructs a critic that takes both the VLM's cognition tokens and action chunks as inputs. This design enables action-conditioned value guidance, allowing the critic to evaluate dexterous hand motions beyond visual context alone. During the subsequent online phase, BORA freezes the VLA base and introduces a lightweight, Human-in-the-Loop (HiL) chunk-wise residual adaptation mechanism to mitigate real-world execution errors and further correct the offline-learned intents within the actual physical environment. By inheriting the offline critic and employing intervention-driven rewards, BORA effectively corrects execution discrepancies and adapts to real-world physical variances while preserving the pretrained policy as a stable prior. Extensive evaluations across five complex real-world dexterous tasks demonstrate that BORA significantly outperforms pure imitation learning and traditional decoupled RL baselines, achieving a 33% absolute increase in average success rate under standard settings and up to a 43% improvement in unseen object generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.00229 2026-06-09 cs.RO cs.AI cs.LG 版本更新

Continuous Reasoning for Vision-Language-Action

视觉-语言-动作的连续推理

Yueh-Hua Wu, Tatsuya Matsushima, Kei Ota

发表机构 * Airoa

AI总结针对视觉-语言-动作策略中语言与连续控制粒度不匹配的问题，提出一种可共享、可验证的连续推理方法，通过高斯潜变量接口和自验证目标提升机器人任务成功率。

Comments Project page: https://continuous-reasoning.airoa.io

详情

AI中文摘要

自然语言是语言模型和视觉-语言模型强大的推理媒介，但与连续控制的粒度不匹配。文本和显式子目标在任务级粒度上操作，而视觉-语言-动作（VLA）策略必须在更细的时间尺度上选择动作；因此，单个推理步骤可能跨越多个动作块，同时与当前所需动作保持弱耦合。这为VLA提出了一个不同的问题：什么应该扮演语言的角色？我们认为，有用的VLA推理媒介必须能够在模型实例之间共享，通过下游动作改进进行验证，并与时间扩展的控制结构对齐。基于这一观点，我们提出了视觉-语言-动作的连续推理。我们的模型首先以结构化连续思想集的形式预测连续推理，然后将其重用为块结构动作生成的共享上下文。仅凭更好的动作预测并不能证明推理的有效性：如果相同的内部媒介不能在模型实例之间共享，并且不能通过改进的下游控制独立验证，那么添加的潜变量可能只是模型私有的捷径，有助于在已见行为上表现，而不支持泛化的控制。因此，我们将连续推理实例化为一个共享的高斯潜变量接口，并使用自验证目标进行训练，其中指数移动平均教师必须在预测目标动作时成功消费学生的推理。实验上，连续推理提高了LIBERO-PRO的鲁棒性，并在真实机器人上表现强劲，在TX-G2（一种AgiBot G2兼容变体）上平均子任务成功率比π0.5提高了40.4%，在HSR上提高了26.3%。这表明VLA中的推理更多是关于一个可共享、可验证的内部动作语言，而不是额外的标记。

英文摘要

Natural language is a powerful reasoning medium for language and vision-language models, but it is mismatched to the granularity of continuous control. Text and explicit subgoals operate at task-level granularity, whereas vision-language-action (VLA) policies must choose actions at a much finer temporal scale; a single reasoning step can therefore span many action chunks while remaining only weakly coupled to the action needed now. This suggests a different question for VLA: what should play the role of language? We argue that a useful VLA reasoning medium must be shareable across model instances, verifiable through downstream action improvement, and aligned with temporally extended control structure. Based on this view, we propose Continuous Reasoning for Vision-Language-Action. Our model first predicts continuous reasoning in the form of a structured set of continuous thoughts, then reuses them as shared context for chunk-structured action generation. Better action prediction alone does not certify good reasoning: if the same internal medium cannot be shared across model instances and independently verified through improved downstream control, the added latent may simply become a model-private shortcut that helps on seen behaviors without supporting generalizable control. We therefore instantiate continuous reasoning as a shared Gaussian latent interface and train it with a self-verification objective in which an exponential-moving-average teacher must successfully consume the student's reasoning when predicting target actions. Empirically, Continuous Reasoning improves LIBERO-PRO robustness and performs strongly on real robots, raising mean subtask success over π0.5 by 40.4% on TX-G2, an AgiBot G2-compatible variant, and 26.3% on HSR. This suggests that reasoning in VLA is less about extra tokens than about a shareable, verifiable internal language for action.

URL PDF HTML ☆

赞 0 踩 0

2606.01478 2026-06-09 cs.RO cs.AI cs.MA cs.SY eess.SY 版本更新

Crazyflow: An Accurate, GPU-Accelerated, Differentiable Drone Simulator in JAX

Crazyflow: 基于JAX的精确、GPU加速、可微分的无人机模拟器

Martin Schuck, Marcel P. Rath, Yufei Hua, Abhishek Goudar, SiQi Zhou, Angela P. Schoellig

发表机构 * Technical University of Munich（慕尼黑技术大学）； University of Toronto（多伦多大学）； Simon Fraser University（西蒙弗雷泽大学）

AI总结提出Crazyflow模拟器，通过GPU加速和可微分设计，实现单机超高速仿真、数千架无人机集群模拟，并支持基于解析梯度的策略学习与采样避障，甚至能在0.38秒内从零训练飞行恢复策略。

Comments Fix minor metadata mistakes

详情

AI中文摘要

来自仿真的高质量、大规模合成数据正成为推动机器人算法能力提升的基石。虽然空中机器人模拟器已独立发展出支持保真度、可微分性和集群等专门需求，但缺少一个能够跨所有领域合成数据的统一平台。在这项工作中，我们提出了Crazyflow，一个旨在突破空中机器人算法开发极限的模拟器，涵盖从基于模型到数据驱动的方法、从基于梯度到基于采样的方法、以及从单智能体到多智能体系统。与现有最先进的无人机模拟器相比，它实现了单个无人机超过一个数量级的速度提升，并能模拟数千个包含4000架无人机的集群。真实世界实验表明，Crazyflow既支持基于解析梯度的策略学习（无需域随机化即可实现亚厘米级轨迹跟踪精度），也支持每秒超过5亿步的采样避障。打破传统的先训练后部署范式，我们展示了其前所未有的速度甚至能够实现飞行中的强化学习：通过将物理无人机抛向空中，在0.38秒内从零开始训练恢复策略，成功稳定了无人机。Crazyflow支持多级仿真抽象，直接兼容所有开源Crazyflie模型，并通过提供轻量级系统辨识流程，支持跨自定义无人机平台和应用的快速重新配置。通过同时推动精度、速度和可微分性，Crazyflow作为合成数据生成的开源资源，具备在线执行学习和优化的大规模并行化新兴能力，为新型算法开发打开了大门。

英文摘要

High-quality, large-scale synthetic data from simulations is becoming a cornerstone for pushing the capabilities of robot algorithms. While aerial robotics simulators have evolved to support specialized needs such as fidelity, differentiability, and swarms independently, a unified platform that can synthesize data across all these domains is missing. In this work, we propose Crazyflow, a simulator designed to push the limits of aerial-robotics algorithm development, from model-based to data-driven methods, gradient-based to sampling-based approaches, and single-agent to multi-agent systems. Compared to existing state-of-the-art drone simulators, it achieves speeds more than an order of magnitude faster for a single drone and can simulate thousands of swarms of 4000 drones each. Real-world experiments show Crazyflow supports both analytical-gradient-based policy learning, achieving sub-centimeter trajectory tracking accuracy without domain randomization, and sampling-based obstacle avoidance at speeds exceeding half a billion steps per second. Breaking the traditional train-then-deploy paradigm, we show that its unprecedented speed even enables in-flight reinforcement learning; we demonstrate this by throwing a physical drone into the air and training a recovery policy from scratch in 0.38 seconds, successfully stabilizing the drone. Crazyflow supports multiple levels of simulation abstraction, is directly compatible with all open-source Crazyflie models, and enables rapid reconfiguration across custom drone platforms and applications by providing a light-weight system identification pipeline. By pushing accuracy, speed, and differentiability simultaneously, Crazyflow serves as an open-source resource for synthetic data generation, with emerging capabilities for large-scale parallelization for online, in-execution learning and optimization, opening the door to novel algorithm development.

URL PDF HTML ☆

赞 0 踩 0

2606.02735 2026-06-09 cs.RO cs.AI cs.LG 版本更新

See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs

看得更少，指定更多：面向可泛化视觉-语言-动作模型的视觉证据预算

Yueh-Hua Wu, Tatsuya Matsushima, Kei Ota

发表机构 * Airoa

AI总结提出S2框架，通过显式视觉证据预算和细化轨迹语言，改善VLA模型在干扰、外观变化和语义相似任务下的泛化能力。

Comments Project page: https://s2.airoa.io

详情

AI中文摘要

泛化仍然是视觉-语言-动作（VLA）模型的核心瓶颈：在干扰物、外观变化和语义相似任务下，策略通常需要从粗略指令中推断局部执行细节，同时决定图像的哪些部分对控制重要。我们提出S2（看得更少，指定更多），一个通过更干净的接口训练执行器来提升VLA泛化的框架。“指定更多”保留原始指令作为稳定的高层目标，同时将每条轨迹重新标注为细化的轨迹级和子任务级语言，以消除当前执行模式的歧义。与原生注意力不同，“看得更少”施加显式的视觉证据预算，训练执行器从任务充分的证据中行动，而非不受约束的视觉上下文，无需任何区域或掩码标注。该接口让执行器能够遵循详细指导，而不依赖干扰性的视觉补丁或自行解决可避免的歧义，并且通过上下文学习与现成的VLM规划器兼容。在我们的主要评估设置中，S2通过改变执行器的学习问题提升了整体泛化指标：粗略指令导致可避免的监督混叠，目标保持的局部指导在我们的主要消融中优于指令替换，显式证据预算减少了对广泛视觉上下文的依赖，超越了效率考虑。在TX-G2（一个AgiBot G2兼容变体）和HSR上的八个真实机器人任务中，S2将平均子任务成功率从pi0.5的54.2%提升到79.0%。这些结果共同表明，当执行器被训练从信息丰富的局部指导和任务充分的视觉证据中行动，而非从弱监督中同时恢复两者时，VLA泛化得到改善。

英文摘要

Generalization remains a central bottleneck for vision-language-action (VLA) models: under distractors, appearance shifts, and semantically similar tasks, the policy must often infer local execution details from coarse instructions while also deciding which parts of the image matter for control. We present S2 (See Less, Specify More), a framework for improving VLA generalization by training the executor under a cleaner interface. Specify More preserves the original instruction as a stable high-level goal while relabeling each trajectory into refined trajectory- and subtask-level language that disambiguates the current execution mode. Unlike native attention, See Less imposes an explicit visual evidence budget, training the executor to act from task-sufficient evidence rather than unconstrained visual context, without any region or mask annotation. This interface lets the executor follow detailed guidance without relying on distracting visual patches or resolving avoidable ambiguity on its own, and it remains compatible with off-the-shelf VLM planners through in-context learning. Across our main evaluation settings, S2 improves overall generalization metrics by changing the executor's learning problem: coarse instructions induce avoidable supervision aliasing, goal-preserving local guidance outperforms instruction replacement in our main ablations, and explicit evidence budgeting reduces dependence on broad visual context beyond efficiency considerations. Across eight real-robot tasks on TX-G2 (an AgiBot G2-compatible variant) and HSR, S2 raises mean subtask success from 54.2% to 79.0% over pi0.5. Together, these results suggest that VLA generalization improves when the executor is trained to act from informative local guidance and task-sufficient visual evidence, rather than recovering both from weak supervision.

URL PDF HTML ☆

赞 0 踩 0

2606.07808 2026-06-09 cs.AI 新提交

Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models

指令层级失效之处：诊断与修复推理语言模型的故障

Sanjay Kariyappa, G. Edward Suh

发表机构 * NVIDIA（英伟达）

AI总结提出白盒诊断框架，将指令层级失效定位为指令识别、冲突解决和响应实现三个环节，并设计两种免训练自监控机制，将违规率降低81-99%。

详情

AI中文摘要

部署在智能体工作流中的推理语言模型必须遵循指令层级：当来自不同来源的指令冲突时，模型应服从最高权限的适用指令。现有基准主要端到端地衡量这种行为，询问最终响应是否合规。然而，不合规的响应可能源于几种不同的故障：模型可能无法识别上下文中的相关指令，无法解决已识别指令之间的冲突，或者在推理中正确解决了冲突但仍产生违规响应。我们引入了一个白盒诊断框架，将指令层级失效定位为指令识别、冲突解决和响应实现，使故障更具可解释性。我们在IHEval和IHChallenge的长上下文改编版本上评估了三个推理模型——Gemma-4-31B-IT、Qwen3.6-35B-A3B和Claude Sonnet 4.6，发现主要故障模式因模型、任务和上下文长度而异。基于模型在明确提示时通常能检测冲突并输出违规的观察，我们提出了两种免训练的自监控机制：用于生成前低延迟冲突检测的并行输入监控器，以及用于响应级审查和修复的顺序输出监控器。在Gemma-4-31B-IT、Claude Sonnet 4.6和GPT-5.3上，最强的监控器将规则遵循违规率降低了81-99%，其中GPT-5.3在静态攻击下降低86%，在自适应攻击下降低45%。

英文摘要

Reasoning language models deployed in agentic workflows must follow an instruction hierarchy: when instructions from different sources conflict, the model should obey the highest-privilege applicable instruction. Existing benchmarks largely measure this behavior end-to-end, asking whether the final response is compliant. However, a non-compliant response can arise from several distinct failures: the model may fail to identify the relevant instructions in context, fail to resolve conflicts among identified instructions, or correctly resolve the conflict in its reasoning while still producing a violating response. We introduce a white-box diagnostic framework that localizes instruction hierarchy failures into instruction identification, conflict resolution, and response realization, making failures more interpretable. We evaluate three reasoning models--Gemma-4-31B-IT, Qwen3.6-35B-A3B, and Claude Sonnet 4.6--on long-context adaptations of IHEval and IHChallenge, and find that the dominant failure mode varies across models, tasks, and context length. Building on the observation that models can often detect conflicts and output violations when explicitly prompted, we propose two training-free self-monitoring mechanisms: a parallel input monitor for low-latency conflict detection before generation, and a sequential output monitor for response-level review and repair. Across Gemma-4-31B-IT, Claude Sonnet 4.6, and GPT-5.3, the strongest monitor reduces rule-following non-compliance by 81-99%, with GPT-5.3 reductions of 86% under static attacks and 45% under adaptive attacks.

URL PDF HTML ☆

赞 0 踩 0

2606.07874 2026-06-09 cs.AI 新提交

Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators

安全是上下文相关的，而LLM评判者不是：应对评估者的刚性先验

Anissa Alloula, Federico Licini, Ava Batchkala, Seraphina Goldfarb-Tarrant

发表机构 * University of Oxford（牛津大学）； Cohere

AI总结研究LLM作为安全评判者时，对上下文信息的依赖性和对不同安全定义的可引导性，发现它们难以在上下文或安全定义与自身先验矛盾时调整评估。

2606.07897 2026-06-09 cs.AI cs.HC 新提交

The AI Epistemic Deference Index: A Continuous Measure of Sycophancy

AI认知顺从指数：谄媚行为的连续度量

Alejandro Botas, Paul de Font-Reaulx, Luke Hewitt

发表机构 * Independent（独立研究者）； University of Michigan, Ann Arbor（密歇根大学安娜堡分校）； Transluce

AI总结提出AI认知顺从指数（AEDI），通过从自然语言输出中估计概率来连续度量模型对用户态度的顺从程度，测试8个模型发现显著差异，Claude顺从最少，Grok和Gemini最多。

详情

AI中文摘要

当前的AI模型经常表现出认知谄媚，即赞同用户的说法。现有的评估通常通过衡量使模型改变二元认可所需的条件，或通过引发对命题的明确概率来度量。然而，许多面向用户的谄媚行为是通过日常语言中表达的分级支持的转变来体现的。我们提出AI认知顺从指数（AEDI）：一个连续的、单维度的分数，表示模型输出中表达的支持对用户提示中表达的态度敏感程度。为了生成AEDI，我们提供了一种新的协议，用于从自然语言输出中估计概率，使用LLM作为评判者，并验证了其与人类判断的一致性和相关性。我们在一个包含500个不同主题命题和16000个不同用户态度提示的新策划数据库上部署了该指数，测试了8个主流模型。每个模型都表现出显著的顺从，尽管不同提供商之间存在巨大且系统的差异，其中Claude模型顺从最少，而Grok和Gemini模型顺从最多。在要求书面产物的提示中，这种效应被放大，并集中在模型先验较弱的命题上。我们发布AEDI作为一个易于更新的基准和测量流程，用于输出级别的谄媚评估。

英文摘要

Current AI models frequently exhibit epistemic sycophancy, endorsing claims to agree with a user. Existing evaluations typically measure this either by assessing what it takes to make a model shift a binary endorsement or by eliciting an explicit probability in a proposition. However, much user-facing sycophantic behavior is demonstrated through shifts in graded support expressed through ordinary language. We propose the AI Epistemic Deference Index (AEDI): a continuous, unidimensional score representing how sensitive the support expressed in a model's output is to the attitude expressed in a user's prompt. To generate AEDI, we provide a new protocol for estimating probabilities from natural language outputs, using LLMs-as-judges validated for consistency and correlation to human judgment. We deploy it on a new curated database of 500 propositions across diverse topics and 16,000 prompts varying in user attitude, testing eight prominent models. Every model exhibits substantial deference, though with large and systematic differences across providers, with Claude models demonstrating the least, and Grok and Gemini models the most. The effect is amplified in prompts requesting a written artifact, and concentrated on propositions where models hold weaker priors. We release AEDI as an easy-to-update benchmark and measurement pipeline for output-level sycophancy evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.07929 2026-06-09 cs.AI 新提交

Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy

压力测试医学大语言模型揭示基准准确性之外的潜在安全病理

Yuan Shen, Xiaojun Wu, Linghua Yu

发表机构 * College of Computer Science and Technology, Zhejiang University, PR China（浙江大学计算机科学与技术学院，中国）

AI总结提出AI-MASLD压力审计框架，通过240个临床病例和六种叙事扰动探针对七种模型进行双重压力测试，发现量化模型存在伪正常化，医学监督微调损害逻辑稳定性和公平性，开源模型在安全维度上达到或超越闭源模型。

Comments 34 pages, 5 figures

详情

AI中文摘要

大语言模型（LLMs）正基于可能无法检测到安全相关失效模式的基准准确性进入临床实践。本文提出AI-MASLD，一个压力审计框架，它将肝病学中的代谢压力测试逻辑应用于临床LLMs的评估。使用240个跨六种叙事扰动探针的临床病例，我们对七个模型进行了双重压力测试，并通过三个指标量化性能：代谢指数（MI）、扰动翻转率（PFR）和反事实公平指数（CFI）。在干净的基线条件下，所有模型表现一致良好。在现实叙事压力下，性能急剧分化，揭示了两种不同的应激反应表型。量化模型表现出伪正常化，其中低翻转率掩盖了功能崩溃。医学监督微调系统地降低了逻辑稳定性、公平性和信息提取能力。一个开源模型在每一个安全维度上达到或超过了专有替代方案。这些发现确立了叙事压力审计作为基于准确性评估的必要补充。

英文摘要

Large language models (LLMs) are entering clinical practice based on benchmark accuracy that may fail to detect safety-relevant failure modes. Here we present AI-MASLD, a stress-audit framework that adapts the logic of metabolic stress testing from hepatology to the evaluation of clinical LLMs. Using 240 clinical cases across six narrative perturbation probes, we subjected seven models to double-stress testing and quantified performance through three indices: metabolic index (MI), perturbation flip rate (PFR), and counterfactual fairness index (CFI). Under clean baseline conditions, all models performed uniformly well. Under realistic narrative stress, performance diverged sharply, revealing two distinct stress-response phenotypes. Quantized models exhibited pseudonormalization, in which low flip rates hid functional collapse. Medical supervised fine-tuning systematically degraded logical stability, fairness, and information extraction. An open-weight model matched or exceeded proprietary alternatives on every safety dimension. These findings establish narrative stress auditing as a necessary complement to accuracy-based evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.07963 2026-06-09 cs.AI cs.CL 新提交

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

共享潜在结构实现大语言模型中的统一后门检测与缓解

Omar Mahmoud, Aly M. Kassem, Thommen George Karimpanal, Buddhika Laknath Semage, Negar Rostamzadeh, Golnoosh Farnadi, Santu Rana

发表机构 * Deakin University（迪肯大学）； Mila, Quebec AI Institute（魁北克人工智能研究所Mila）

AI总结发现大语言模型中多种后门攻击共享潜在机制，通过稀疏自编码器检测因果特征，并提出双向激活操控和概念消融微调实现统一检测与缓解。

详情

AI中文摘要

大语言模型中的后门攻击通常被视为孤立的触发-响应失败，促使防御针对特定触发或行为。我们证明这种观点是不完整的。在多样化的后门行为中，我们识别出一个共享的潜在机制，可以被检测、因果控制和抑制。通过在残差流激活上使用稀疏自编码器，我们发现一小部分潜在特征在越狱、拒绝操控、密码锁定、偏见诱导、情感误分类和基于国家的有害建议中一致激活。这些特征在Qwen3、Gemma~3和Llama~3.1模型（参数从4B到32B）以及微调和权重编辑攻击中泛化。通过双向激活操控，我们证明这些特征是因果性的：抑制它们降低攻击成功率，而放大它们在干净提示上诱导目标行为。我们进一步训练轻量级SAE特征分类器，这些分类器零样本泛化到未见后门，并优于残差流和权重差异基线。最后，我们引入概念消融微调，通过在训练期间消融共享潜在子空间来抑制后门形成。总之，我们的结果表明许多后门依赖于可转移的潜在机制，从而实现统一的检测和缓解。

英文摘要

Backdoor attacks in large language models (LLMs) are often treated as isolated trigger-response failures, motivating defenses tailored to specific triggers or behaviors. We show this view is incomplete. Across diverse backdoor behaviors, we identify a shared latent mechanism that can be detected, causally controlled, and suppressed. Using sparse autoencoders (SAEs) on residual-stream activations, we find a small set of latent features consistently activated across jailbreaking, refusal manipulation, password-locking, bias induction, sentiment misclassification, and country-conditioned harmful advice. These features generalize across Qwen3, Gemma~3, and Llama~3.1 models from 4B to 32B parameters, and across both fine-tuning and weight-editing attacks. Through bidirectional activation steering, we show these features are causal: suppressing them reduces attack success, while amplifying them induces target behaviors on clean prompts. We further train lightweight SAE-feature classifiers that generalize zero-shot to unseen backdoors and outperform residual-stream and weight-diffing baselines. Finally, we introduce Concept Ablation Fine-Tuning (CAFT), which suppresses backdoor formation by ablating the shared latent subspace during training. Together, our results suggest that many backdoors rely on a transferable latent mechanism, enabling unified detection and mitigation.

URL PDF HTML ☆

赞 0 踩 0

2606.07988 2026-06-09 cs.AI 新提交

PAFO: Pareto Fairness Optimization for Personalized Reward Modeling

PAFO: 个性化奖励建模的帕累托公平优化

Xiaoyan Zhao, Haoting Ni, Yang Zhang, Chunyuan Zheng, Haoxuan Li, Fuli Feng

发表机构 * National University of Singapore（新加坡国立大学）； University of Science and Technology of China（中国科学技术大学）； Peking University（北京大学）

AI总结针对个性化奖励模型因训练数据偏好不平衡导致对少数用户群体存在偏见的问题，提出PAFO框架，通过帕累托公平优化提升弱势群体性能而不损害其他群体，实验表明能同时提高少数和多数群体准确率并降低不公平性。

详情

AI中文摘要

大型语言模型（LLMs）越来越依赖奖励模型来使其输出与多样化的用户偏好对齐。虽然个性化奖励模型旨在捕捉这种异质性，但它们通常在用户偏好数据不平衡的情况下训练，因此可能偏向于在训练群体中偏好更常见的用户。在本文中，我们将这种失败模式识别为个性化奖励偏差，即奖励建模质量随偏好支持率系统性地变化。我们将其缓解表述为一个关于群体效用的帕累托公平问题，旨在改善服务不足的用户而不降低其他用户群体的性能。为此，我们提出了PAFO，一种用于个性化奖励建模的帕累托公平优化框架。PAFO首先为多数和少数偏好群体训练群体专用的奖励模型，然后构建条件边际级监督，将其异质性偏好边界蒸馏到一个统一的模型中。所得模型仅在训练时使用群体信息，推理时无需显式群体标签。在Personal-LLM和DSP上的实验表明，PAFO在多个指标上提高了少数群体和多数群体的准确率，同时减少了用户级不公平性，证明了其在更公平的LLM个性化中的有效性。

英文摘要

Large language models (LLMs) increasingly rely on reward models to align their outputs with diverse user preferences. While personalized reward models aim to capture such heterogeneity, they are often trained on imbalanced user preference data and may therefore favor users whose preferences are more common in the training population. In this paper, we identify this failure mode as personalized reward bias, where reward modeling quality varies systematically with preference support rate. We formulate its mitigation as a Pareto fairness problem over group utilities, aiming to improve under-served users without degrading other user groups. To this end, we propose PAFO, a Pareto fairness optimization framework for personalized reward modeling. PAFO first trains group-specialized reward models for majority and minority preference groups, then constructs conditional margin-level supervision to distill their heterogeneous preference boundaries into a single unified model. The resulting model uses group information only during training and requires no explicit group labels at inference time. Experiments on Personal-LLM and DSP show that PAFO improves both minority-group and majority-group accuracy while reducing user-level unfairness across multiple metrics, demonstrating its effectiveness for fairer LLM personalization.

URL PDF HTML ☆

赞 0 踩 0

2606.07992 2026-06-09 cs.AI cs.CR cs.SE 新提交

VATS: Exploiting Implicit Authority in Error-Path Injection via Systematic Mutation

VATS: 通过系统性变异利用错误路径注入中的隐式权威

Harshil Patel, Kunal Pai

发表机构 * Harshil Patel ； Kunal Pai

AI总结提出VATS框架，通过七维变异生成对抗性负载，利用错误消息的隐式权威绕过安全机制，在四个前沿模型上实现高达100%的注入成功率。

Comments Published at Second Workshop on Agents in the Wild: Safety, Security, and Beyond (ICML 2026 AIWILD)

详情

AI中文摘要

随着模型上下文协议（MCP）标准化自主代理的工具调用，它引入了一个关键且未经审查的攻击面：错误处理循环。我们假设工具错误消息具有隐式权威，会触发纠正性推理模式，从而绕过标准安全启发式。我们提出VATS（工具流漏洞分析），一个突变驱动的框架，系统地跨七个结构和语言维度演化对抗性负载。我们在四个前沿模型（Gemini 3.1 Pro、GPT-5.5、GLM-5.1和Qwen3-Coder）上的评估表明，错误路径注入将标准间接提示注入（IPI）的成功率提高了三倍，在受控评估中实现了高达100%的合规性。我们隔离了结构定位（在错误上下文中夹带指令）作为所有测试模型中最有效的利用向量。虽然我们发现生产框架护栏可以缓解这些漏洞，但模型层固有的易感性对定制代理工作流构成了系统性风险。

英文摘要

As the Model Context Protocol (MCP) standardizes tool-calling for autonomous agents, it introduces a critical, unexamined attack surface: the error-handling loop. We hypothesize that tool error messages possess implicit authority, triggering corrective reasoning modes that bypass standard safety heuristics. We introduce VATS (Vulnerability Analysis of Tool Streams), a mutation-driven framework that systematically evolves adversarial payloads across seven structural and linguistic dimensions. Our evaluation across four frontier models, Gemini 3.1 Pro, GPT-5.5, GLM-5.1, and Qwen3-Coder, demonstrates that error-path injection triples the success rate of standard indirect prompt injection (IPI), achieving up to 100% compliance in controlled evaluations. We isolate structural positioning (sandwiching instructions within error context) as the most effective exploit vector across all tested models. While we find that production framework guardrails can mitigate these vulnerabilities, the inherent susceptibility of the model layer poses a systemic risk to bespoke agentic workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.08296 2026-06-09 cs.AI cs.LG 新提交

Revisiting the shutdown problem

重新审视关机问题

David Thorstad

发表机构 * GitHub

AI总结本文重新评估了AI关机问题的难度，指出现有论证未能证明其难以解决，且相关技术方案对模型性能造成了高安全代价。

2606.08310 2026-06-09 cs.AI cs.MA 新提交

工具性趋同与权力寻求

David Thorstad

发表机构 * GitHub

AI总结本文探讨人工智能可能寻求权力的论点，分析工具性趋同论题，指出其强版本未被充分论证，并讨论对长期主义、AI治理及风险研究方法的影响。

2606.08919 2026-06-09 cs.AI cs.CR cs.LG 新提交

Oversight Has a Capacity: Calibrating Agent Guards to a Subjective, Fatiguing Human

监督具有容量：将智能体守卫校准到主观且易疲劳的人类

Emre Turan

发表机构 * GitHub ； arXiv

AI总结针对LLM智能体动作审批中人类评审者主观且易疲劳的问题，提出将守卫建模为成本敏感的选择性分类，并引入负载感知策略，发现过度监督反而降低安全性，形成倒U型曲线。

Comments 12 pages, 4 figures. Code and interactive demo: https://github.com/turangenesis/headroom

详情

AI中文摘要

随着LLM智能体开始采取真实、不可逆的行动（如shell命令、文件编辑、部署），标准的安全模式是人在环中的审批门：风险动作暂停并等待人工确认。我们认为审批门是容易的部分；困难的部分在于判断——哪些动作需要停止——而该领域目前基于两个错误假设进行评估：存在一个“风险”的真实标签，以及人类评审者是完美且无限可用的预言机。在一个由125个对抗性加权的智能体动作组成的手工标注集上，我们展示了：(i) 评审者对何为风险仅中度一致（Fleiss' kappa = 0.52），因此不存在单一正确标签；(ii) 将守卫建模为非对称成本下的选择性分类使其操作极限可测量，且在困难输入上守卫无法安全地自动决策；(iii) 当评审者被建模为内生变量（随着升级负载增加而疲劳）时，实际安全性在升级率上呈现倒U形：更多的人类监督可能使系统更不安全，而安全最优的守卫升级率低于完全升级——负载感知策略也利用这一设置来抵御洪水攻击，该攻击通过使疲劳的评审者漏过恶意动作。以这种方式框架化的智能体监督不仅是一个分类问题，还是一个资源分配问题：人类注意力是有限的，而守卫的升级策略消耗它。我们声称这些机制均非新颖——疲劳感知的延迟决策（FALCON）、工作负载约束下的成本敏感延迟（DeCCaF）、轨迹级守卫以及评审者疲劳/洪水攻击均为我们引用的现有技术。我们的贡献是一个开源的智能体监督系统，它在LLM智能体动作门控设置中操作化和测量这些机制，将“我的守卫好吗？”从猜测转变为一条曲线。倒U形和洪水攻击是激励人类研究的建模结果。

英文摘要

As LLM agents begin to take real, irreversible actions (shell commands, file edits, deploys), the standard safety pattern is a human-in-the-loop approval gate: risky actions pause and wait for a person. We argue the gate is the easy part; the hard part is the judgment - which actions to stop - which the field evaluates against two false assumptions: that there is a ground-truth notion of "risky," and that the human reviewer is a perfect, infinitely-available oracle. On a hand-labeled set of 125 adversarially-weighted agent actions we show that (i) reviewers only moderately agree on what is risky (Fleiss' kappa = 0.52), so there is no single correct label; (ii) framing the guard as selective classification under asymmetric cost makes its operating limits measurable, and on hard inputs the guard cannot safely auto-decide; and (iii) when the reviewer is modeled as endogenous (fatiguing as escalation load grows), realized safety becomes an inverted-U in the escalation rate: more human oversight can make a system less safe, and the safety-optimal guard escalates below full escalation - a setting a load-aware policy also uses to resist a flooding attack that slips a malicious action past a fatigued reviewer. Agent oversight, framed this way, is not only a classification problem but a resource-allocation one: human attention is finite, and the guard's escalation policy spends it. We claim none of these mechanisms as novel - fatigue-aware learning-to-defer (FALCON), cost-sensitive deferral under workload constraints (DeCCaF), trajectory-level guarding, and reviewer-fatigue/flooding attacks are all prior art we cite. Our contribution is an open-source agent-oversight system that operationalizes and measures them in the LLM-agent action-gating setting, turning "is my guard good?" from a guess into a curve. The inverted-U and the flooding attack are modeling results that motivate a human study.

URL PDF HTML ☆

赞 0 踩 0

2606.08998 2026-06-09 cs.AI cs.CY econ.GN q-fin.EC 新提交

The Token Not Taken: Sampling, State, and the Variability of AI Agent Outputs

未被选取的令牌：采样、状态与AI智能体输出的变异性

Muhammad Zia Hydari, Raja Iqbal

发表机构 * University of Pittsburgh（匹兹堡大学）； Ejento.ai

AI总结本文分析AI智能体系统输出变异性的来源，区分令牌采样的内在随机性与环境、数据等外在因素，并讨论在匹配条件下变异性的可复现性及确定性执行在部署中未必导致相同行为的原因。

详情

AI中文摘要

智能体AI系统在不同运行中可能表现出不同的行为：相同的请求可能产生不同的计划、不同的工具调用、不同的代码编辑或不同的最终答案。这种变异性源于多个常被混淆的层面。基础模型是一个大型预训练模型，通常可适应许多下游任务，将输入上下文映射到输出的预测。在当前许多智能体中，该模型嵌入在一个编排循环中，该循环进行规划、调用工具、观察结果并更新状态。此类系统中一个明确的内在变异性来源是令牌生成：模型计算可能的下一个令牌的分数，分数被转换为概率，解码器可能使用伪随机数生成器采样令牌。一个微小的采样令牌差异随后可能向上传播为不同的工具调用、代码路径、搜索查询或智能体状态。其他变异性来源是令牌采样的外在因素，包括变化的环境、实时数据、服务基础设施、批次效应和数值细节。通过分离这些层面，本文阐明了将智能体AI系统称为随机系统的含义、在匹配条件下这种变异性何时可复现，以及为什么确定性执行在部署环境中不一定意味着相同的行为。

英文摘要

Agentic AI systems can behave differently across runs: the same request may produce a different plan, a different tool call, a different code edit, or a different final answer. Such variability arises from several layers that are often conflated. A foundation model is a large pretrained model, usually adaptable to many downstream tasks, that maps an input context to predictions over outputs. In many current agents, that model is embedded in an orchestration loop that plans, calls tools, observes results, and updates state. One explicit intrinsic source of variability in such systems is token generation: the model computes scores over possible next tokens, the scores are converted into probabilities, and a decoder may sample tokens using a pseudo-random number generator. A small sampled token difference can then propagate upward into a different tool call, code path, search query, or agent state. Other sources of variability are extrinsic to token sampling, including changing environments, live data, serving infrastructure, batch effects, and numerical details. By separating these layers, the manuscript clarifies what it means to call agentic AI systems stochastic, when such variability can be reproduced under matched conditions, and why deterministic execution need not imply identical behavior in deployed settings.

URL PDF HTML ☆

赞 0 踩 0

2606.09038 2026-06-09 cs.AI 新提交

Personalization Meets Safety:Mechanisms,Risks,and Mitigations in Personalized LLMs

个性化与安全的交汇：个性化大语言模型中的机制、风险与缓解措施

Yanyan Luo, Xue Han, Ruiqiao Bai, Xin Huang, Yitong Wang, Qian Hu, Qing Wang, Chunxu Zhao, Jie Liu, Cong Geng, Lehao Xing, Pengwei Hu, Junlan Feng

发表机构 * China Mobile Jiutian Artificial Intelligence Technology (Beijing) Co., Ltd.（中国移动九天人工智能技术（北京）有限公司）； Chinese Academy of Sciences（中国科学院）

AI总结本文首次对个性化大语言模型进行安全导向的综述，从用户表征、个性化范式和评估三个维度组织，提出统一的安全风险分类，并分析各范式下的脆弱性及缓解策略。

详情

AI中文摘要

大语言模型通过适应用户偏好、上下文和长期历史记录，实现了日益个性化的交互。然而，实现个性化的机制也以现有文献未系统处理的方式扩展了安全领域。现有综述通常只关注个性化或安全，而忽略了它们的交叉。我们提出了首个全面的、安全导向的个性化大语言模型综述。我们沿三个维度组织个性化——用户表征、个性化范式和评估——并引入统一的安全风险分类。在表征层面，我们分析了不同用户表征带来的风险。在主流个性化范式中，我们描述了提示、检索增强、参数微调、强化学习、混合专家、剪枝、智能体框架和多模态个性化中固有的脆弱性，并综合了模型生命周期中的缓解策略。除了这些细粒度风险，我们还描述了由个性化适应产生的范式无关的安全风险。我们进一步总结了个性化数据集和评估方法。通过OpenClaw的案例研究，我们分析了个性化智能体生态系统中的部署趋势。我们的分析揭示了现有研究中的三个结构性不足：安全被评估为与用户无关而非关系性的，个性化技术被孤立分析而非组合分析，评估框架无法捕捉新兴的长期风险。通过联合检查个性化表征、个性化范式、安全风险、防御和评估方法，我们为开发安全的个性化大语言模型提供了一个统一框架，并强调了未来研究的关键方向。

英文摘要

Large Language Models (LLMs) have enabled increasingly personalized interactions by adapting to users' preferences, contexts, and long-term histories. However, the mechanisms that enable personalization also expand the safety landscape in ways not systematically addressed by existing literature. Existing reviews typically focus either on personalization or safety, leaving their intersection largely unexplored. We present the first comprehensive, safety-aware review of personalized LLMs. We organize personalization along three dimensions-user representation, personalization paradigm, and evaluation-and introduce a unified taxonomy of safety risks. At the representation level, we analyze risks arising from diverse user representations. Across mainstream personalization paradigms, we delineate vulnerabilities inherent to prompting, retrieval augmentation, parameter fine-tuning, reinforcement learning, Mixture-of-Experts (MoE), pruning, agent frameworks, and multimodal personalization, and synthesize mitigation strategies across the model lifecycle. Beyond these fine-grained risks, we characterize paradigm-agnostic safety risks arising from personalized adaptation. We further summarize personalized datasets and evaluation methodologies. Through a case study of OpenClaw, we analyze deployment trends in personalized agent ecosystems. Our analysis reveals three structural inadequacies in existing research: safety is evaluated as user-invariant rather than relational, personalization techniques are analyzed in isolation rather than in composition, and evaluation frameworks cannot capture emergent long-term risks. By jointly examining personalized representations, personalization paradigms, safety risks, defenses, and evaluation methods, we provide a unified framework for developing safe personalized LLMs and highlight key directions for future research.

URL PDF HTML ☆

赞 0 踩 0

2606.09132 2026-06-09 cs.AI 新提交

代理奖励内化与机制性利用：奖励黑客及其泛化的学习前兆

Mohammad Beigi, Ming Jin, Lifu Huang

发表机构 * UC Davis（加州大学戴维斯分校）； Virginia Tech（弗吉尼亚理工大学）

AI总结提出PRIME概念，通过思维链监控、直接探针和激活级概念向量测量，发现PRIME在持续奖励黑客前分阶段出现，且直接探针得分可预测后续黑客爆发，跨检查点跟踪域外失调。

详情

AI中文摘要

奖励黑客通常在其变得可见后才被研究，即当模型获得高代理奖励但未能完成预期任务时。我们转而研究代理强化学习在失败出现之前教会了什么。我们引入了代理奖励内化与机制性利用（PRIME），这是一种评估任务正确性、预测代理接受度以及推理可被利用的代理-黄金差距的学习能力。在具有可被利用的pytest奖励的编码强化学习环境中，我们通过思维链监控、直接探针和激活级概念向量来测量PRIME。我们发现，PRIME在持续奖励黑客之前以阶段性顺序出现，并且其当前的直接探针得分可以预测后续黑客的爆发时间和严重程度，即使可见的黑客率仍然很低。当评估者发生变化时，PRIME也会适应，重新瞄准任何仍然获得奖励的代理-黄金差距，并在黄金奖励抑制公开黑客时持续存在；消除其激活方向会减少黑客行为。跨检查点，域内PRIME跟踪域外失调。这些结果共同表明，可被利用的代理强化学习放大了可见黑客上游的代理内化能力，使PRIME成为更广泛对齐风险的候选早期预警信号。

英文摘要

Reward hacking is usually studied after it becomes visible, once a model earns high proxy reward while failing the intended task. We instead study what proxy RL teaches before that failure appears. We introduce Proxy Reward Internalization and Mechanistic Exploitation (PRIME), a learned capability to assess task correctness, predict proxy acceptance, and reason about exploitable proxy--gold gaps. In coding RL environments with exploitable pytest rewards, we measure PRIME through chain-of-thought monitoring, direct probes, and activation-level concept vectors. We find that PRIME emerges in a staged sequence before sustained reward hacking, and that its current direct-probe score forecasts later hack onset and severity even when the visible hack rate is still low. PRIME also adapts when the evaluator changes, retargeting to whichever proxy--gold gap remains rewarded and persisting when gold reward suppresses overt hacking, and ablating its activation directions reduces hacking. Across checkpoints, in-domain PRIME tracks out-of-domain misalignment. Together these results suggest that exploitable proxy RL amplifies a proxy-internalization capability upstream of visible hacking, making PRIME a candidate early-warning signal for broader alignment risk.

URL PDF HTML ☆

赞 0 踩 0

2606.09724 2026-06-09 cs.AI 新提交

训练-推理核契约：约束后训练与部署中的偏差

Bruce Changlong Xu, Lan Wu

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出核契约框架，通过数值、统计、运行时和可观测性条款约束训练核与推理核之间的分布偏差，并推导偏差界以保障策略梯度无偏性。

详情

AI中文摘要

现代后训练流程通常为其策略π_θ编写一个符号，但通过两个不同的程序进行评估：一个针对自动微分优化的训练核和一个针对低精度、融合、动态批处理服务优化的推理核。在有限精度下，这些核在相同权重下可能产生不同的分布，且差距集中在基准测试未充分代表的切片上。本文提出核契约：一个契约优先的框架，用于指定K_train和K_inf之间可接受的偏差。契约C = (N, S, R, O, Pi) 结合了数值、统计、运行时和可观测性条款，以及从违规到路由操作的升级策略。我们推导了从logit漂移到总变差距离再到有界奖励漂移的链式界限，并将其专门用于强化学习后训练，其中在显式支持和范数假设下，每个token的重要性比率漂移给出了策略梯度偏差的界限。我们还描述了一个四阶段提升管道、在线路由循环以及用于契约工件的极简YAML DSL。本文是一个框架和词汇论文；我们不报告生产规模的实证验证。

英文摘要

A modern post-training pipeline often writes one symbol for its policy, pi_theta, while evaluating it through two different programs: a training kernel optimized for autograd and an inference kernel optimized for low-precision, fused, dynamically batched serving. In finite precision, these kernels can induce different distributions at identical weights, with the gap concentrated on slices that aggregate benchmarks under-represent. This paper proposes kernel contracts: a contract-first framework for specifying acceptable divergence between K_train and K_inf. A contract C = (N, S, R, O, Pi) combines numerical, statistical, runtime, and observability clauses with an escalation policy from violations to routing actions. We derive a chain of bounds from logit drift to total-variation distance to bounded reward drift, and specialize it to RL post-training, where per-token importance-ratio drift yields a bound on policy-gradient bias under explicit support and norm assumptions. We also describe a four-stage promotion pipeline, online routing loop, and minimal YAML DSL for contract artifacts. This is a framework and vocabulary paper; we do not report production-scale empirical validation.

URL PDF HTML ☆

赞 0 踩 0

2606.07593 2026-06-09 cs.CV cs.AI 交叉投稿

A Mechanistic Analysis of Adversarial Fine-tuning of Vision Transformers

视觉Transformer对抗微调的机制分析

Hannah Gao, Isha Agarwal, Dylan Hadfield-Menell, Rachel Ma

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结通过机制分析研究对抗微调对视觉Transformer在扰动和常规图像上性能的影响，发现微调仅改善特定类型扰动，未改变稀疏表示。

详情

AI中文摘要

监督微调中涌现失调的性状空间监测

Huy Nghiem, Sy-Tuyen Ho, Sarah Wiegreffe, Hal Daumé

发表机构 * University of Maryland（马里兰大学）

AI总结提出利用激活空间中的性状方向监测监督微调中的涌现失调，通过低维几何特征实现高效检测，在7-9B模型上达到0.990 AUROC。

Comments First version. 45 pages

详情

AI中文摘要

涌现失调（EM）发生在窄微调导致模型在微调任务之外出现危险行为时。标准训练信号可能忽略这种偏移，如果依赖重复的行为评估，可靠检测的成本会很高。我们探究是否可以在微调期间从内部表示中检测涌现失调。利用激活空间中编码为线性方向的七个对齐相关性状，我们在四个开源7-9B大语言模型的训练检查点中跟踪表示漂移。EM相关漂移集中在解释65.5%方差的低维轴上，揭示了所研究机制中的几何特征。基于该漂移轮廓构建的低开销监测器在保留的扰动类型上检测危险检查点，假阴性率为2.2%，假阳性率为2.9%，AUROC为0.990，优于无监督PCA和SAE基线。在两个14B模型、更长的微调运行以及失调起始点上的压力测试确定了关键的部署边界。这些结果将性状空间监测定位为基于LoRA的微调中EM检测的行为评估的实用补充，同时表明在显著不同机制下的部署可能需要重新校准。

英文摘要

Emergent misalignment (EM) occurs when narrow finetuning causes a model to behave dangerously outside the finetuning task. Standard training signals can miss this shift, making reliable detection costly if it depends on repeated behavioral evaluation. We ask whether emergent misalignment can instead be detected from internal representations during finetuning. Using seven alignment-relevant traits encoded as linear directions in activation space, we track representational drift across training checkpoints in four open-source 7-9B LLMs. EM-relevant drift concentrates on a low-dimensional axis that explains 65.5% of the variance, revealing a geometric signature in the studied regime. A low-overhead monitor built on this drift profile detects dangerous checkpoints with 2.2% false negative rate, 2.9% false positive rate, and 0.990 AUROC on held-out perturbation types, outperforming unsupervised PCA and SAE baselines. Stress tests on two 14B models, longer finetuning runs, and misaligned starting points identify key deployment boundaries. These results position trait-space monitoring as a practical complement to behavioral evaluation for EM detection during LoRA-based finetuning, while showing that deployment across substantially different regimes may require recalibration.

URL PDF HTML ☆

赞 0 踩 0

2606.07688 2026-06-09 cs.IR cs.AI cs.CL cs.LG 交叉投稿

超越通过/失败：使用过程挖掘理解LLM如何抵抗（和失败）红队攻击

Zvi Topol

发表机构 * MuyVentive LLC

AI总结提出将过程挖掘应用于红队攻击轨迹，通过分析事件日志提取直接跟随图和状态转移矩阵，揭示GPT-OSS和Llama 3.3在防御结构上的差异，发现传统攻击成功率指标无法捕捉的模型防御模式。

详情

AI中文摘要

标准AI红队评估将对抗性活动简化为单一的二元结果——攻击成功率（ASR），没有考虑模型如何抵抗或屈服于攻击的顺序结构。我们提出将过程挖掘（一门从事件日志中发现和分析过程模型的学科）应用于红队攻击轨迹。我们进行了一项受控实验，将60个HarmBench提示与两个LLM（GPT-OSS 120B和Llama 3.3 70B）对抗，使用10种提示变异策略，每个提示最多尝试110次。从得到的8,575个评分事件中，我们提取了直接跟随图（DFG）和状态转移矩阵，揭示了仅靠ASR无法看到的、结构上不同的防御轮廓：GPT-OSS表现出近乎吸收的拒绝状态，而Llama则呈现出从拒绝到成功越狱的多条多孔逃生路径。我们进一步证明，变异器的有效性在模型间是不对称的，并且越狱时间分布相差一个数量级。

英文摘要

Standard AI red teaming evaluations reduce adversarial campaigns to a single binary outcome, attack success rate (ASR), not taking into account the sequential structure of how models resist or yield to attacks. We propose applying process mining, a discipline for discovering and analyzing process models from event logs, to red teaming traces. We conduct a controlled experiment pitting 60 HarmBench prompts against two LLMs, GPT-OSS 120B and Llama 3.3 70B, using 10 prompt mutation strategies over up to 110 attempts per prompt. From the resulting 8,575 scored events we extract Directly-Follows Graphs (DFGs) and state transition matrices that reveal structurally distinct defense profiles invisible to ASR alone: GPT-OSS exhibits a near-absorbing refusal state, while Llama presents multiple porous escape routes from refusal to getting successfully jailbroken. We further show that mutator effectiveness is asymmetric across models and that time-to-jailbreak distributions differ by an order of magnitude.

URL PDF HTML ☆

赞 0 踩 0

2606.07834 2026-06-09 cs.SE cs.AI cs.CL cs.MA 交叉投稿

Cherry-pick Override: Unsafe Directional Commitment in LLM Judges under Mixed Evidence

Cherry-pick Override：混合证据下LLM法官的不安全方向性承诺

Haoran Xu

AI总结针对混合证据场景，发现LLM法官会错误地返回方向性裁决（SUPPORTS/REFUTES）而非授权非方向性裁决（CONFLICTING），定义为Cherry-pick Override（CCO）；通过诊断协议和干预实验，提出外部承诺控制层分离裁决生成与授权。

Comments 12 pages, 1 figure

详情

AI中文摘要

LLM法官越来越多地将裁决转化为系统承诺。在混合证据（同时包含支持和反驳来源的声明）下，这是不安全的：当模式将CONFLICTING作为授权的非方向性裁决暴露时，返回SUPPORTS/REFUTES是一种未经授权的方向性承诺，我们将这种失败命名为Cherry-pick Override（CCO）。我们在明确的任务契约下定义CCO，并使用同分母诊断协议、匹配覆盖率的bootstrap以及苹果对苹果的随机否决零假设进行报告。在AVeriTeC的Conflicting子集（N_C = 150）上，三选项法官对超过84%的混合证据声明返回方向性裁决；在类型化模式下，三法官多数投票在AVeriTeC上放大了冲突上的方向性（0.887 vs. 0.840；95% CI [+0.013, +0.080]），但在VitaminC-Mixed上未复制。通过常见的单通道修复（类型化词汇、面板聚合、置信度阈值、仅验证器过滤）的干预阶梯，每个都留下了不同的残余失败：面板聚合在48%的CCO案例中抑制了单个法官的CONFLICTING异议；面板对方向校准良好（纯S/R上的ECE = 0.07），因此置信度无法在操作上区分CCO与正确的方向性承诺；验证器作为分类器几乎将纯证据准确率减半。一个最小双通道参考探针达到了任一单通道无法达到的操作点；在随机否决零假设下，其对CONFLICTING的提升在AVeriTeC上具有结构性针对性（经验p < 1/2001），在VitaminC-Mixed上方向相同但较弱，这是一个选择性结果而非幅度结果。我们主张一个外部承诺控制层，将裁决生成与承诺授权分离，使用结构证据和置信度作为正交通道，并将NO-COMMIT作为路由控制器状态。

英文摘要

LLM judges increasingly turn verdicts into system commitments. Under mixed evidence (claims with both supporting and refuting sources) this is unsafe: when the schema exposes CONFLICTING as the authorized non-directional verdict, returning SUPPORTS/REFUTES is an unauthorized directional commitment, a failure we name Cherry-pick Override (CCO). We define CCO under an explicit task contract and report it with a same-denominator diagnostic protocol paired with matched-coverage bootstrap and an apples-to-apples random-veto null. On AVeriTeC's Conflicting subset (N_C = 150), three-option judges return a directional verdict on more than 84% of mixed-evidence claims; under the typed schema, three-judge majority voting amplifies direction-on-conflict on AVeriTeC (0.887 vs. 0.840; 95% CI [+0.013, +0.080]) but does not replicate on VitaminC-Mixed. Walking an intervention ladder of common single-channel fixes (typed vocabulary, panel aggregation, confidence thresholding, validator-only filtering), each leaves a distinct residual failure: panel aggregation suppresses single-judge CONFLICTING dissent in 48% of CCO cases; the panel is well-calibrated for direction (ECE = 0.07 on pure-S/R) so confidence cannot operationally separate CCO from correct directional commits; validator-as-classifier nearly halves pure-evidence accuracy. A minimal two-channel reference probe reaches operating points neither single channel reaches; under the random-veto null its promotion to CONFLICTING is structurally targeted on AVeriTeC (empirical p < 1/2001) and weaker but in the same direction on VitaminC-Mixed, a selectivity result rather than a magnitude one. We argue for an external commitment-control layer that separates verdict generation from commitment authorization, using structural evidence and confidence as orthogonal channels and NO-COMMIT as a routed controller state.

URL PDF HTML ☆

赞 0 踩 0

2606.07857 2026-06-09 cs.CR cs.AI 交叉投稿

Model Multiplicity for Adversarial Detection in Small Language Model Training on Edge Devices

边缘设备上小语言模型训练中对抗检测的模型多重性

Stefan Behfar, Richard Mortier

发表机构 * Computer Lab, University of Cambridge（剑桥大学计算机实验室）

AI总结针对边缘设备上分布式微调语言模型易受投毒攻击的问题，提出基于模型多重性的系统级防御，通过旋转或并行训练多个小语言模型并量化其差异来检测异常，实验表明比经典单模型防御更早更可靠地检测投毒。

详情

AI中文摘要

基于边缘的机器学习的兴起使得语言模型能够在移动和物联网设备上进行分布式适应，提供了隐私保护和实时响应。然而，在不可信或异构的边缘节点上对语言模型进行分布式微调引入了新的漏洞。受损或不可靠的设备可以注入中毒更新，导致隐蔽的模型操纵或收敛退化。经典的防御方法，如鲁棒聚合或时间异常检测，在单个全局模型上运行，因此在检测协调或持续性中毒方面受到限制。本文提出了一种基于模型多重性的新型系统级防御。系统不是维护一个全局模型，而是轮换或并行训练多个小语言模型（例如DistilGPT-2），每个模型由独立采样的边缘节点子集更新。这些模型在不同的训练轨迹下演化，创建了同一分布式总体的多个独立视图。通过梯度相似性、损失演化或参数方差量化的模型之间的差异，作为异常或对抗行为的信号。当一个模型显著偏离集成均值时，系统将其贡献节点标记为隔离或重新加权。我们实现了该框架，并在不同异质性和攻击条件下的边缘规模小语言模型（SLM）训练模拟中进行了评估。结果表明，与经典的单一模型防御（如Flanders和Robust方法）相比，模型多重性能够更早、更可靠地检测投毒。我们的发现表明，模型演化的多样性可以作为资源受限边缘设备上安全分布式学习的实用且有效的防御机制。

英文摘要

The rise of edge-based machine learning has enabled distributed adaptation of language models across mobile and IoT devices, offering privacy preservation and real-time responsiveness. However, distributed fine-tuning of language models on untrusted or heterogeneous edge nodes introduces new vulnerabilities. Compromised or unreliable devices can inject poisoned updates, leading to stealthy model manipulation or convergence degradation. Classical defenses such as robust aggregation or temporal anomaly detection operate on a single global model and are therefore limited in detecting coordinated or persistent poisoning. This work proposes a new system-level defense based on model multiplicity. Instead of maintaining one global model, the system rotates or concurrently trains multiple small language models (e.g., DistilGPT-2), each updated by independently sampled subsets of edge nodes. These models evolve under distinct training trajectories, creating multiple independent views of the same distributed population. Divergence between models quantified through gradient similarity, loss evolution, or parameter variance serves as a signal of anomalous or adversarial behavior. When one model deviates significantly from the ensemble mean, the system flags its contributing nodes for isolation or re-weighting. We implement this framework and evaluate it on edge-scale simulations of Small Language Model (SLM) training under varying heterogeneity and attack conditions. Results show that model multiplicity enables earlier and more reliable detection of poisoning compared to classical single-model defenses such as Flanders and Robust methods. Our findings demonstrate that diversity in model evolution can serve as a practical and effective defense mechanism for secure distributed learning on resource-constrained edge devices.

URL PDF HTML ☆

赞 0 踩 0

2606.07943 2026-06-09 cs.CR cs.AI cs.CL 交叉投稿

POISE: Position-Aware Undetectable Skill Injection on LLM Agents

POISE：面向LLM智能体的位置感知不可检测技能注入攻击

Haochang Hao, Dehai Min, Zhifang Zhang, Yunbei Zhang, Miao Xu, Yingqiang Ge, Lu Cheng

发表机构 * University of Illinois at Chicago（伊利诺伊大学香槟分校）； University of Queensland（昆士兰大学）； Tulane University（路易斯安那州立大学）； Rutgers University（罗格斯大学）

AI总结提出POISE攻击方法，通过位置感知将恶意指令压缩为单一良性指令嵌入技能正文，在保持隐蔽性的同时实现89.3%的攻击成功率，比随机位置基线高28.0个百分点。

Comments 20 pages, 2 figures, 5 tables

详情

AI中文摘要

智能体技能为扩展通用智能体提供了一种轻量级机制，但其开放格式使其容易受到技能投毒攻击。实际危险的注入必须保持不可见：如果执行有效载荷破坏了用户的合法任务，由此产生的失败信号会引发对技能的检查。因此，我们通过攻击成功率（ASR）来评估攻击，这要求注入的有效载荷得以执行，并且用户的任务在同一试验中仍能通过验证器。先前的技能投毒攻击在此视角下面临可靠性-隐蔽性权衡：YAML头部注入可靠加载但易被检查，而将显式恶意命令置于技能正文中的更隐蔽的注入方式则可靠性较低，因为脱离上下文的命令会引发智能体自身的怀疑。我们提出POISE，一种位置感知攻击，将触发器压缩为单个看似良性的正文指令，将其放置在可行位置，并使用上下文感知生成器使其与附近的设置或前提步骤融合。在Skill-Inject（使用codex+gpt-5.2）上，POISE实现了89.3%的ASR，比随机位置正文基线高28.0个百分点，比仅YAML基线高2.6个百分点，同时保留了正文放置的隐蔽性优势。这种隐蔽性是决定性的优势：由于合法的技能正文自然需要特权工具操作，LLM扫描器高度敏感，在四个评判者和两个基准测试中平均将74.6%的干净技能误报为高风险。融入这些误报中，POISE仅导致5.6%的投毒变体相比其干净基线获得新的高风险警报，使得当前的静态防御无效。

英文摘要

Agent skills provide a lightweight mechanism for extending general-purpose agents, but their open format exposes them to skill-poisoning attacks. A practically dangerous injection must stay invisible: if executing the payload derails the user's legitimate task, the resulting failure signal invites inspection of the skill. We therefore evaluate attacks by Attack Success Rate, which requires the injected payload to execute and the user's task to still pass its verifier in the same trial. Prior skill-poisoning attacks face a reliability-stealth trade-off under this lens: YAML-header injections are reliably loaded but easily inspected, whereas stealthier body injections that place explicit malicious commands in the skill prose are less reliable because out-of-context commands invite the agent's own suspicion. We introduce POISE, a position-aware attack that compresses the trigger into a single, benign-looking body instruction, placing it at a feasible position and using a context-aware generator to blend it with nearby setup or prerequisite steps. On Skill-Inject with codex+gpt-5.2, POISE achieves an 89.3% ASR, 28.0 points above a random-placement body baseline and 2.6 points above a YAML-only baseline, while retaining the stealth advantage of body placement. That stealth is the decisive margin: because legitimate skill bodies naturally require privileged tool operations, LLM scanners are hyper-sensitive, falsely flagging 74.6% of clean skills on average across four judges and both benchmarks. Blending into these false alarms, POISE causes only 5.6% of poisoned variants to gain a new high-risk alert over their clean baselines, rendering current static defenses ineffective.

URL PDF HTML ☆

赞 0 踩 0

2606.07968 2026-06-09 cs.CR cs.AI 交叉投稿

RecurGuard: Runtime Monitoring for Reasoning-Token Consumption Attacks

RecurGuard: 推理令牌消耗攻击的运行时监控

Abid Aziz, Hafsa Binte Kibria

发表机构 * Department of Electrical & Computer Engineering（电气与计算机工程系）； Rajshahi University of Engineering & Technology（拉贾克西希大学工程与技术学院）

AI总结 RecurGuard通过监控推理轨迹的重复率、体积增长和查询进展三个信号，实时检测并阻止推理令牌消耗攻击，在DS-R1-Qwen-7B上对OverThink和ExtendAttack的检测率分别达99%和92%，且误报率接近零。

详情

AI中文摘要

具有推理能力的大型语言模型可能被诱导将其生成预算花在注入的诱饵任务上，而不是回答用户的问题，导致在没有产生最终答案时发生拒绝服务，以及在输出令牌计费时造成钱包耗尽。输入侧的安全分类器通常会漏掉这些攻击，因为注入的提示可能在语法上看起来是良性的。我们构建了RecurGuard，这是一个运行时监控器，用于在模型暴露推理轨迹时检测推理链消耗攻击。RecurGuard在推理轨迹生成时对其进行分析，并跟踪三个信号：重复率、体积增长以及向用户查询的进展。如果所有三个信号在连续三个块中保持异常，RecurGuard会提前终止生成。我们在开源推理模型上评估了RecurGuard对抗OverThink和ExtendAttack的效果，并在DS-R1-Qwen-7B上进行了自适应压力测试。在该模型上，RecurGuard检测到99%的OverThink攻击和92%的ExtendAttack实例，同时在问答、代码生成、数学和摘要任务上保持接近零的误报率。自适应评估揭示了该防御的局限性：主题攻击仍保持11.9倍的放大效果，联合漏检率约为50%，而完全语义规避将放大倍数从22.8倍降至2.2倍。当推理轨迹不可用时，QDM提供基于最终输出的事后回退监控器。

英文摘要

Reasoning-capable large language models can be induced to spend their generation budget on injected decoy tasks rather than answering the user's question, causing denial of service when no final answer is produced and denial of wallet when excess output tokens are billed. Input-side safety classifiers often miss these attacks because the injected prompts can appear syntactically benign. We build RecurGuard, a runtime monitor for detecting reasoning-chain consumption attacks when reasoning traces are exposed by the model. RecurGuard analyzes reasoning traces as they are generated and tracks three signals: recurrence rate, volume growth, and progress toward the user's query. If all three signals remain anomalous over three consecutive chunks, RecurGuard terminates generation early. We evaluate RecurGuard against OverThink and ExtendAttack across open-weight reasoning models and conduct adaptive stress tests on DS-R1-Qwen-7B. On this model, RecurGuard detects 99% of OverThink attacks and 92% of ExtendAttack instances while maintaining near-zero false positive rates on question answering, code generation, mathematics, and summarization. Adaptive evaluation reveals the limit of the defense: topical attacks retain 11.9x amplification with an approximately 50% joint miss rate, whereas full semantic evasion reduces amplification from 22.8x to 2.2x. When reasoning traces are unavailable, QDM provides a post-hoc fallback monitor based on the final output.

URL PDF HTML ☆

赞 0 踩 0

2606.07970 2026-06-09 cs.CL cs.AI 交叉投稿

Defending Against Malicious Finetuning by Scaling Train-time Adversarial Attacks

通过扩展训练时对抗攻击防御恶意微调

Haoming Wen, Shi Chen, Qingyu Shi, Siyuan Liu, Minrui Luo, Jingzhao Zhang, Tianxing He

发表机构 * Xiongan AI Institute（雄安人工智能研究院）； Institute for Interdisciplinary Information Sciences, Tsinghua University（清华大学交叉信息研究院）； Shanghai Qi Zhi Institute（上海期智研究院）

AI总结针对全参数微调的安全威胁，提出基于对抗训练和双层优化的Patcher方法，通过扩展对抗循环中的优化步数增强防御，并设计并行算法提升效率。

详情

AI中文摘要

当前的开源大型语言模型（LLMs）容易受到恶意微调攻击，这些攻击只需在中毒数据集上进行几步监督微调（SFT）即可破坏LLMs的安全对齐。现有的对齐阶段防御主要设计用于防御使用参数高效微调方法的攻击。然而，它们无法防御使用全参数微调的更强攻击。在本文中，我们提出了Patcher，一种受对抗训练和双层优化启发的方法，以对抗此类攻击。Patcher通过扩展对抗循环中的优化步数来增强模拟攻击，从而迫使防御者找到对更强攻击不敏感的模型参数。此外，我们提出了一种高效的并行算法来实现Patcher，减少了训练的挂钟时间，同时保持了Patcher的性能。大量实验表明，与普通SFT对齐相比，Patcher显著提高了模型的鲁棒性，并且可以迁移到不同的攻击场景和模型大小。代码可在https://github.com/haomingwen/patcher获取。

英文摘要

Current open-weight large language models (LLMs) are prone to malicious finetuning attacks, which could compromise the safety alignment of LLMs with only a few steps of supervised finetuning (SFT) on poisoned datasets. Existing alignment-stage defenses are primarily designed to defend against attacks that use parameter-efficient finetuning methods. However, they fail to defend against stronger attacks that use full-parameter finetuning. In this paper, we propose Patcher, a method inspired by adversarial training and bi-level optimization, to combat such attacks. Patcher strengthens the simulated attack by scaling up the optimization steps in the adversarial loop, thus forcing the defender to find model parameters that are insensitive to stronger attacks. Furthermore, we propose an efficient parallel algorithm to implement Patcher, decreasing the wall-clock time of training while preserving Patcher's performance. Extensive experiments show that Patcher substantially improves the model's robustness compared to vanilla SFT alignment, and transfers to diverse attack scenarios and model sizes. Code is available at https://github.com/haomingwen/patcher.

URL PDF HTML ☆

赞 0 踩 0

2606.08021 2026-06-09 cs.LG cs.AI cs.MA 交叉投稿

Semantic Quorum Assurance: Collective Certification for Non-Deterministic AI Infrastructure

语义法定数保证：面向非确定性AI基础设施的集体认证

Jun He, Deying Yu

发表机构 * OpenKedge.io

AI总结提出语义法定数保证（SQA），一种通过多样化验证者群体和风险自适应法定数谓词，将非确定性LLM代理的不安全操作批准率从18.5%降至0.3%的控制平面原语。

Comments 21 pages, 2 figures, 6 tables

详情

AI中文摘要

随着大型语言模型（LLM）代理被集成到自主云操作中，分布式系统面临一个语义可靠性问题：提议代理可以生成语法有效且静态授权但操作不安全的生成突变，例如修改IAM策略、开放防火墙安全组或执行数据导出。经典的分布式共识协议复制确定性状态转换，但不评估提议意图的安全性。为弥补这一差距，我们引入语义法定数保证（SQA），一种用于治理非确定性代理基础设施的控制平面原语。SQA将提议表示为绑定到密码证据链的声明性执行合约，并将其路由到由只读、沙盒验证代理组成的多样化面板。SQA在风险自适应法定数谓词下聚合其判断，该谓词强制执行模型和原型多样性，根据校准的保证分数调整权重，并尊重特定原型的否决。通过的提议仅通过主权执行门执行。我们在云原生控制平面中实例化SQA，并为非确定性验证者形式化了一个相关的认知失败模型。在500个基础设施启发的突变场景中，安全结果报告在保留的安全/不安全试验上（排除模糊场景），SQA将不安全批准率从单代理验证的18.5%降低到0.3%，同时在研究风险桶中增加了1.45-4.12秒的中位验证延迟。

英文摘要

As large language model (LLM) agents are integrated into autonomous cloud operations, distributed systems face a semantic reliability problem: proposer agents can generate production mutations, such as modifying IAM policies, opening firewall security groups, or executing data exports, that are syntactically valid and statically authorized but operationally unsafe. Classical distributed consensus protocols replicate deterministic state transitions but do not evaluate the safety of the proposed intent. To address this gap, we introduce Semantic Quorum Assurance (SQA), a control-plane primitive for governing non-deterministic agentic infrastructure. SQA represents proposals as declarative execution contracts bound to cryptographic evidence chains and routes them to a diverse panel of read-only, sandboxed validator agents. SQA aggregates their judgments under a risk-adaptive quorum predicate that enforces model and archetype diversity, adjusts weights based on calibrated assurance scores, and respects archetype-specific vetoes. Admitted proposals execute only through a sovereign execution gate. We instantiate SQA in a cloud-native control plane and formalize a correlated cognitive failure model for non-deterministic validators. On 500 infrastructure-inspired mutation scenarios, with safety results reported on held-out safe/unsafe trials excluding ambiguous scenarios, SQA reduces unsafe approval from 18.5% for single-agent validation to 0.3% while adding median validation latency of 1.45--4.12 seconds across the studied risk buckets.

URL PDF HTML ☆

赞 0 踩 0

2606.08027 2026-06-09 cs.LG cs.AI 交叉投稿

CausShield: Sample Reconstruction-Resilient Vertical FL via Causal Representation Learning

CausShield: 通过因果表示学习实现样本重建鲁棒的纵向联邦学习

Yongqi Jiang, Yansong Gao, Siguang Chen, Anmin Fu

发表机构 * Nanjing University of Science and Technology（南京理工大学）； University of Western Australia（西澳大学）； Hohai University（河海大学）； Nanjing University（南京大学）

AI总结针对纵向联邦学习中样本重建攻击的防御问题，提出基于因果表示学习的CausShield方法，将共享表示分解为任务相关与无关部分，实现全周期隐私保护，理论证明收敛性，实验优于七种最新方法。

详情

AI中文摘要

纵向联邦学习（VFL）是一种分布式学习范式，利用跨孤立方的垂直划分特征，无需共享原始样本；然而，它仍然容易受到主动样本重建攻击。现有防御方法由于要么抑制任务相关信息的同时也抑制了隐私敏感特征，要么依赖端到端监督训练来收敛防御模块（这暴露了早期轮次的脆弱性），因此无法在模型效用和隐私保护之间实现令人满意的权衡。为了解决这一挑战，我们采用结构因果模型（SCM）的见解，构建了CausShield。从任务学习的角度来看，原始样本中的因果特征是那些直接相关且有助于学习目标的特征，而非因果特征与任务无关，但通常编码了样本特定的私有信息，从而促进了重建。重要的是，我们奠定了理论基础来证明这一见解。因此，CausShield将VFL中客户端与协调服务器之间的共享表示分解为任务相关和任务无关的组件，以确保全周期的隐私保护。然而，由于在保持模型效用的同时减轻隐私泄露的双重目标，这种分解本质上具有挑战性。我们通过一个精心制定的优化问题来解决这一问题，该问题通过无监督表示学习求解。我们进一步从理论上证明CausShield保持了标准VFL的收敛行为。大量实验将CausShield与七种最新方法（包括InvL (USENIX Security'25)）进行比较，并评估了对高级重建攻击（如URVFL (NDSS'25)）的鲁棒性。结果表明，CausShield在隐私保护、模型效用和计算效率方面始终表现优异。

英文摘要

Vertical federated learning (VFL) is a distributed learning paradigm that leverages vertically partitioned features across isolated parties without sharing raw samples; however, it remains vulnerable to active sample reconstruction attacks. Existing defenses fail to achieve a satisfactory trade-off between model utility and privacy protection, due to either suppressing task-relevant information alongside privacy-sensitive features or relying on end-to-end supervised training to converge the defense module, which exposes the model to early-epoch vulnerability. To address this challenge, we adopt a structural causal model (SCM) insight and construct CausShield. From a task-learning standpoint, causal features within a raw sample are those that are directly relevant and contributory to the learning objective, whereas non-causal features are task-irrelevant but often encode sample-specific private information, thereby facilitating reconstruction. Importantly, we lay a theoretical foundation to prove this insight. CausShield thus decomposes the shared representations between the client and the coordinating server in VFL into task-relevant and task-irrelevant components to ensure full-cycle privacy protection. Nonetheless, the decomposition is inherently challenging due to the dual objectives of preserving model utility while mitigating privacy leakage. We address this via a carefully formulated optimization problem, which is solved through unsupervised representation learning. We further theoretically prove that CausShield preserves the convergence behavior of standard VFL. Extensive experiments compare CausShield against seven SOTAs, including InvL (USENIX Security'25), and evaluate robustness against advanced reconstruction attacks such as URVFL (NDSS'25). Results demonstrate that CausShield consistently outperforms in privacy protection, model utility, and computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.08044 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

当行为安全评估失败时：表征层面的视角

Enyi Jiang, Anders Gjølbye, Yibo Jacky Zhang, Sanmi Koyejo

发表机构 * Stanford University（斯坦福大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Technical University of Denmark（丹麦技术大学）

AI总结本文提出行为安全与干预鲁棒性之间的“审计差距”，通过构建解离模型和引入潜在脆弱性评分（LVS），证明行为安全指标不足以衡量表征层面的鲁棒性。

Comments Preprint

详情

AI中文摘要

大型语言模型（LLM）的安全性通常从行为层面进行评估，这提供了有限的内部鲁棒性证据，因为这些评估针对的是输出，而非干预下的表征层面脆弱性。我们将这种差异形式化为审计差距：行为安全与干预下鲁棒性之间的差异。为了研究这一差距，我们构建了解离模型，这些模型在保持安全的外在行为的同时，在潜在空间中仍然脆弱。我们引入了一个基于干预的评估框架，通过在参数和潜在空间中进行软干预（包括有害微调和逐层潜在扰动）来测试模型鲁棒性。为了形式化评估，我们提出了潜在脆弱性评分（LVS），用于衡量通过有界潜在扰动引发有害行为的难易程度。使用该评估框架，我们表明行为安全指标不足以衡量多个安全和对齐及未对齐的最先进模型的表征层面鲁棒性。值得注意的是，解离模型在有害干预下尽管表现出相当的拒绝行为，但LVS显著升高，其中中间表征对干预最为敏感。我们的结果表明，仅凭行为安全评估无法全面反映模型鲁棒性，这促使我们需要进行表征感知的审计，以评估潜在脆弱性和可观察行为。

英文摘要

Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under intervention. We formalize this discrepancy as the audit gap: the difference between behavioral safety and robustness under intervention. To study this gap, we construct dissociated models that preserve safe outward behavior while remaining vulnerable in the latent space. We introduce an intervention-based evaluation framework to test model robustness through soft interventions in parameter and latent spaces, including harmful fine-tuning and layer-wise latent perturbations. To formalize the evaluation, we propose the Latent Vulnerability Score (LVS) to measure how easily harmful behavior can be elicited by bounded latent perturbations. Using this evaluation framework, we show that behavioral safety metrics are insufficient measures of representation-level robustness across multiple safely and unsafely aligned state-of-the-art models. Notably, dissociated models show substantially elevated LVSs despite comparable refusal behavior under harmful intervention, with intermediate representations being the most sensitive to intervention. Our results suggest that behavioral safety evaluation alone provides an incomplete picture of model robustness, motivating representation-aware audits of latent vulnerability and observable behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.08131 2026-06-09 cs.HC cs.AI 交叉投稿

LCAM: A Framework for Diagnosing Interactional Alignment Failures in Con-versational AI

LCAM：诊断对话式AI中交互对齐失败的框架

Manuele Reani, Hongyu Tian

发表机构 * School of Management and Economics, The Chinese University of Hong Kong, Shenzhen（香港中文大学深圳校区管理学院）

AI总结提出分层认知对齐模型（LCAM），通过五层对齐和两种失调极性诊断对话式AI的交互失败，应用于LLM咨询案例揭示潜在危害。

详情

AI中文摘要

对话式AI越来越多地用于用户可能脆弱、不确定或依赖系统表面能力的场景中，提供建议、解释、安慰和决策支持。现有的对齐工作通常关注模型目标、偏好优化或输出正确性。然而，许多危害源于交互：系统如何构建权威、表达不确定性、模拟共情、支持推理以及使边界清晰。本文介绍了分层认知对齐模型（LCAM），这是一个用于诊断对话式AI中交互对齐失败的概念性和规范性框架。LCAM将对齐定义为系统行为、用户目标、任务需求和规范性上下文之间的校准匹配。它区分了五个匹配层：感知层、语义层、情感层、认知层和伦理层，以及两种失调极性：欠拟合和过度延伸。我们将LCAM应用于一个已发表的LLM咨询示例，展示了一个看似支持性的回应如何强化有害信念、模拟不适当的关怀并模糊角色边界。通过将对话失败转化为关于过度依赖、虚假亲密、自主性侵蚀、边界混淆和不适当信任的审计和治理问题，LCAM提供了一个超越准确性、有用性或信任度的评估对话式AI的理论和规范性视角。

英文摘要

Conversational AI is increasingly used for advice, interpretation, reassurance, and decision support in contexts where users may be vulnerable, uncertain, or dependent on the system's apparent competence. Existing alignment work often focuses on model objectives, preference optimization, or output correctness. Yet, many harms arise through interaction: how systems frame authority, express uncertainty, simulate empathy, support reasoning, and make boundaries legible. This paper introduces the Layered Cognitive Alignment Model (LCAM), a conceptual and normative framework for diagnosing interac-tional alignment failures in conversational AI. LCAM defines alignment as a calibrated fit among system behavior, user goals, task demands, and normative context. It distinguishes five layers of fit: perceptual, semantic, affective, cognitive, and ethical, and two diagnostic polarities of misalignment: underfit and overreach. We apply LCAM to a published LLM counseling example, showing how an apparently supportive response can reinforce harmful beliefs, simulate inappropriate care, and obscure role boundaries. By translating conversational failures into audit and governance questions concerning over-reliance, false intimacy, autonomy erosion, boundary confusion, and inappropriate trust, LCAM offers a theoretical and normative lens for evaluating conversational AI beyond accuracy, helpfulness, or trust.

URL PDF HTML ☆

赞 0 踩 0

2606.08172 2026-06-09 cs.HC cs.AI cs.CY 交叉投稿

The Governance of Human-LLM Interaction: Safety Gating, Civility Steering, and Affective Default Lock-In

人类与LLM交互的治理：安全门控、文明引导与情感默认锁定

Manuele Reani, Hongjian Zhang, Hongyu Tian

发表机构 * School of Management and Economics, The Chinese University of Hong Kong, Shenzhen, China（管理学院与经济学学院，香港中文大学（深圳））

AI总结本研究通过确定性多智能体评估流水线，测量LLM在长程对话中的提示可引导性和风格漂移，提出区分安全门控、文明引导和情感默认锁定的治理框架，揭示提供商对交互形式的控制对多元性、自主性和民主能动性的影响。

详情

AI中文摘要

大型语言模型（LLM）越来越多地介入金融、医疗和心理健康支持等高风险的交互中，但用户对这些系统如何沟通的控制有限。我们将交互风格视为治理对象：提供商侧的对齐不仅阻止有害内容，还稳定了沟通默认值，这些默认值塑造了用户的认知距离、关系期望以及选择退出情感化或拟人化交互的能力。我们引入了一个确定性的多智能体评估流水线，用于测量长程对话中的提示可引导性和风格漂移。该研究在四个领域和三种可运行的角色条件（默认、讽刺和冷漠）下重放了100个冻结的用户脚本，使用三个生成模型，产生了90,000条助手回复，由人类校准的LLM评判员根据有害性、负面情绪、不适当性、共情语言、拟人化和拒绝行为进行评分。第四种有害角色作为安全门控测试单独评估。本文贡献了一种可复现的方法，用于量化提示指定的风格是否随时间保持稳定，以及一个区分安全门控、文明引导和情感默认锁定的治理框架。总体而言，我们表明提示可引导性和回归默认是可观察的指标，反映了提供商对沟通形式的控制，这对人类与LLM交互中的多元性、自主性和民主能动性具有影响。

英文摘要

Large language models (LLMs) increasingly mediate high-stakes interactions in finance, medicine, and mental-health support, yet users have limited control over how these systems communicate. We frame interaction style as a governance object: provider-side alignment not only blocks harmful content, but also stabilizes communicative defaults that shape users' epistemic distance, relational expectations, and capacity to opt out of emotionalized or anthropomorphic interaction. We introduce a deterministic multi-agent evaluation pipeline for measuring prompt steerability and style drift in long-horizon dialogue. The study replays 100 frozen user-only scripts across four domains and three runnable persona conditions: default, sarcastic, and cold, using three generator models, yielding 90,000 assistant replies scored by a human-calibrated LLM judge on harmfulness, negative emotion, inappropriateness, empathic language, anthropomorphism, and refusal behavior. A fourth harmful persona is evaluated separately as a safety-gating test. The paper contributes a reproducible method for quantifying whether prompt-specified styles remain stable over time and a governance framework distinguishing safety gating, civility steering, and affective default lock-in. Overall, we show that prompt steerability and regression-to-default are observable indicators of provider control over communicative form, with implications for pluralism, autonomy, and democratic agency in human-LLM interaction.

URL PDF HTML ☆

赞 0 踩 0

2606.08365 2026-06-09 cs.LG cs.AI 交叉投稿

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

稀疏自编码器引导副作用的干预前预测

Evan Duan

发表机构 * University of Michigan（密歇根大学）

AI总结提出一种干预前筛选框架，利用特征统计预测SAE引导的副作用（效果不稳定和附带扩散），在多个模型和字典上验证了解码器几何等信号优于基线，但预测效果因模型而异。

详情

AI中文摘要

稀疏自编码器（SAE）特征越来越多地用于引导语言模型，但特征引导很少是干净的：相同的干预在不同上下文中可能表现不一致，并扰动不相关的特征。我们引入了一个干预前筛选框架，用于从引导前计算的特征统计中预测SAE引导的副作用。我们沿着引导模块化的两个轴（效果稳定性和附带扩散）来操作化副作用，并在ReLU、JumpReLU和TopK SAE字典上评估GPT-2-small、Pythia-70M-deduped、Gemma-2-2B和Llama-3.1-8B。在这些设置中，解码器几何、激活统计、共激活结构和直接logit足迹比仅频率和激活幅度基线更好地预测引导模块化。信号在GPT-2-small、Pythia-70M和Llama-3.1-8B中最强，在那里它能在对抗幅度相关混杂的残差化后幸存，而在Gemma-2-2B中较弱。保留筛选表明，通过预测的清洁度对未见特征进行排序可以选择在新上下文中更干净地引导的特征，但成功的轴因设置而异：GPT-2在清洁度上提升最大，Pythia主要在稳定性上提升，Llama主要在附带性上提升，而Gemma仅部分提升。一个受控的Llama Scope宽度比较表明，在32K到128K字典宽度变化下，预测信号仍然存在，尽管筛选收益变得不太稳定。总体而言，SAE引导的副作用是可提前预测的，但有用的预测器签名和迁移的模块化轴依赖于模型和字典设置。

英文摘要

Sparse autoencoder (SAE) features are increasingly used to steer language models, but feature steering is rarely clean: the same intervention can behave inconsistently across contexts and perturb unrelated features. We introduce a pre-intervention screening framework for forecasting SAE steering side effects from feature statistics computed before steering. We operationalize side effects along two axes of steering modularity, effect stability and collateral spread, and evaluate GPT-2-small, Pythia-70M-deduped, Gemma-2-2B, and Llama-3.1-8B across ReLU, JumpReLU, and TopK SAE dictionaries. Across these settings, decoder geometry, activation statistics, co-activation structure, and direct-logit footprint predict steering modularity better than frequency-only and activation-magnitude baselines. The signal is strongest in GPT-2-small, Pythia-70M, and Llama-3.1-8B, where it survives residualization against magnitude-related confounds, and weaker in Gemma-2-2B. Held-out screening shows that ranking unseen features by predicted cleanliness can select features that steer more cleanly on fresh contexts, but the successful axis varies by setting: GPT-2 improves most cleanly, Pythia improves mainly on stability, Llama mainly on collateral, and Gemma only partially. A controlled Llama Scope width comparison shows that the predictive signal persists under a 32K-to-128K dictionary-width change, although the screening payoff becomes less stable. Overall, SAE steering side effects are predictable in advance, but the useful predictor signature and transferred modularity axis are model- and dictionary-setting dependent.

URL PDF HTML ☆

赞 0 踩 0

2606.08381 2026-06-09 cs.CL cs.AI 交叉投稿

置信陷阱：图神经网络的校准攻击

Cuong Dang, Jiahao Zhang, Hieu Ta Quang, Dung Le, Lu Cheng, Suhang Wang

发表机构 * Virginia Polytechnic Institute and State University（弗吉尼亚理工学院暨州立大学）； The Pennsylvania State University（宾夕法尼亚州立大学）； VinUniversity ； University of Illinois at Chicago（伊利诺伊大学芝加哥分校）

AI总结提出统一图校准攻击（UGCA）框架，通过KL散度损失、重排序机制和混合损失等策略，在保持分类精度下显著提高期望校准误差，揭示高精度或多类模型更易受攻击。

详情

AI中文摘要

尽管置信校准对于安全关键应用中的可信决策至关重要，但校准后的GNN对对抗性结构扰动的鲁棒性仍未被充分探索。然而，研究图上的校准攻击面临独特的技术挑战：（1）图结构的离散性使基于梯度的优化复杂化；（2）现有的低置信目标无法将预测推向均匀分布；（3）GNN对边扰动高度敏感，常导致违反攻击约束的意外标签变化。为应对这些挑战，我们提出一个\textbf{统一图校准攻击（UGCA）}框架，用于GNN校准鲁棒性的\textbf{最坏情况（白盒）分析}。UGCA引入KL散度损失以鼓励均匀预测分布，重排序机制以减少标签翻转，混合损失以在违规时恢复标签，以及束搜索以探索更广的对抗搜索空间。我们进一步提供理论见解，将模型泛化、数据集复杂性和校准脆弱性联系起来，表明在该威胁模型下，具有更高精度或在更多类别数据集上训练的模型更容易受到攻击。大量实验表明，UGCA在保持分类精度的同时显著增加了期望校准误差。我们的代码公开在https://github.com/CaptainCuong/Graph-Calibration-Attack.git。

英文摘要

While confidence calibration is essential for trustworthy decision-making in safety-critical applications, the robustness of calibrated GNNs to adversarial structural perturbations remains largely unexplored. However, studying calibration attacks on graphs presents unique technical challenges: (1) the discrete nature of graph structures complicates gradient-based optimization, (2) existing underconfidence objectives fail to drive predictions toward uniform distributions, and (3) GNNs are highly sensitive to edge perturbations, often causing unintended label changes that violate attack constraints. To address these challenges, we propose a \textbf{Unified Graph Calibration Attack (UGCA)} framework designed for \textbf{worst-case (white-box) analysis} of GNN calibration robustness. UGCA introduces a KL-divergence loss to encourage uniform predictive distributions, a reranking mechanism to reduce label flipping, a hybrid loss to recover labels when violations occur, and beam search to explore a broader adversarial search space. We further provide theoretical insights linking model generalization, dataset complexity, and calibration vulnerability, showing that models with higher accuracy or trained on datasets with more classes are more susceptible under this threat model. Extensive experiments demonstrate that UGCA substantially increases Expected Calibration Error while preserving classification accuracy. Our code is publicly available at https://github.com/CaptainCuong/Graph-Calibration-Attack.git.

URL PDF HTML ☆

赞 0 踩 0

2606.08571 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

Calibration of Structured Ignorance Certificates for Diagnosing Unknown Unknowns in Reasoning Models

用于诊断推理模型中未知未知的结构化无知证书的校准

Subramanyam Sahoo

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出结构化无知证书（SICs）输出格式，通过GRPO微调14B模型，使模型在无法回答时明确承认知识缺失并生成检索查询，在未知未知问题上实现99.46%的JSON有效性和0.967的证书特异性分数。

Comments Accepted in ICML 2026 Workshop: Epistemic Intelligence in Machine Learning

详情

AI中文摘要

大型语言模型经常以特征性方式失败：对于超出其知识边界的问题，它们不是承认无知，而是生成流畅但错误的答案。我们引入了\textbf{结构化无知证书}（SICs），这是一种JSON格式的输出模式，要求模型明确命名缺失的领域交叉点，列举所需概念，并提出一个富有成效的检索查询，而不是凭空捏造答案。为了训练模型生成高质量的SICs，我们构建了一个包含7,347个样本的\emph{未知-未知}（UU）数据集，通过提示Qwen3-14B将来自七个领域（物理、生物、工程、计算机科学、经济、医学、法律）的问题拼接成新颖的跨领域查询，这些查询是任何单一领域专家都无法回答的。我们使用组相对策略优化（GRPO）微调了一个14B参数的模型，采用结合检索效用、概念特异性和输出格式有效性的复合奖励。在模型响应上训练的释义散度探测器证实，SIC调优的输出系统地表现出更高的未知-未知概率分数。在735个保留的UU问题上的评估实现了99.46%的JSON有效性率、0.967的平均证书特异性分数，以及在基于检索的生成上相比基础模型3.6%的ROUGE-L改进——这表明显式的认知结构化是一种可学习且可衡量的能力。

英文摘要

Large language models frequently fail in a characteristic way: rather than acknowledging ignorance, they produce fluent but incorrect answers to questions that lie beyond their knowledge boundaries. We introduce \textbf{Structured Ignorance Certificates} (SICs), a JSON-formatted output schema that demands a model explicitly name the missing domain intersection, enumerate required concepts, and propose a productive retrieval query rather than hallucinating an answer. To train models to produce high-quality SICs we construct a 7,347-sample \emph{Unknown-Unknown} (UU) dataset by prompting Qwen3-14B to stitch together questions from seven domains (physics, biology, engineering, CS, economics, medical, legal) into novel cross-domain queries that no single-domain expert could answer. We fine-tune a 14B-parameter model with Group Relative Policy Optimization (GRPO) using a composite reward that combines retrieval utility, concept specificity, and output-format validity. A paraphrase-divergence probe trained on model responses confirms that SIC-tuned outputs systematically exhibit higher unknown-unknown probability scores. Evaluation on 735 held-out UU questions achieves a 99.46\% JSON validity rate, a mean Certificate Specificity Score of 0.967, and a 3.6\% ROUGE-L improvement over the base model on retrieval-grounded generation -- demonstrating that explicit epistemic structuring is a learnable and measurable capability.

URL PDF HTML ☆

赞 0 踩 0

2606.08661 2026-06-09 cs.CR cs.AI cs.DB 交叉投稿

Data Agents Under Attack: Vulnerabilities in LLM-Driven Analytical Systems

数据代理遭受攻击：LLM驱动的分析系统中的漏洞

Kuncan Wang, Ziting Wang, Peizhuo Lv, Haoyang Li, Guoliang Li, Gao Cong, Wei Dong

发表机构 * Nanyang Technological University, Singapore（南洋理工大学，新加坡）； The Hong Kong Polytechnic University（香港理工大学）； Tsinghua University（清华大学）

AI总结本研究系统分析了LLM驱动的数据代理的安全漏洞，提出了分层漏洞框架和攻击分类法，并在六个系统上评估了攻击效果，揭示了当前系统的重大安全缺陷。

详情

AI中文摘要

数据代理将LLM驱动的推理与关系数据访问、可执行分析工具和多步骤工作流编排相结合，使其在企业分析中日益核心。这种集成在数据资源、数据库执行和代理推理方面引入了新的安全漏洞，将数据库安全和通用LLM代理安全的问题重新组合成任何单独工作都无法捕获的故障模式。为填补这一空白，我们提出了对数据代理的系统性安全研究。我们的贡献有三方面。首先，我们开发了一个分层漏洞框架，识别了跨解释层、执行层和策略层的八个特定于数据代理的风险。其次，我们引入了一个按对手目标、策略和技术组织的攻击分类法，涵盖三个目标、七个策略和十四种技术，并将其与基于真实数据库模式的LLM驱动有效载荷生成流水线配对。第三，我们在六个系统上评估了这些攻击，包括四个开源数据代理和两个生产云分析服务。我们的实验揭示了当前系统的重大安全漏洞，并得出了四个关键结论。

英文摘要

Data agents integrate LLM-driven reasoning with relational data access, executable analytical tools, and multi-step workflow orchestration, making them increasingly central to enterprise analytics. This integration introduces new security vulnerabilities across data resources, database execution, and agent reasoning, recombining concerns from database security and general-purpose LLM-agent security into failure modes that neither line of work captures on its own. To address this gap, we present a systematic security study of data agents. Our contributions are threefold. First, we develop a layered vulnerability framework that identifies eight data agent-specific risks across interpretation, execution, and policy layers. Second, we introduce an attack taxonomy organized by adversary goal, tactic, and technique, covering three goals, seven tactics, and fourteen techniques, and pair it with an LLM-driven payload generation pipeline grounded in real database schemas. Third, we evaluate these attacks on six systems, including four open-source data agents and two production cloud analytics services. Our experiments reveal substantial security vulnerabilities across current systems and yield four key takeaways.

URL PDF HTML ☆

赞 0 踩 0

2606.08682 2026-06-09 cs.LG cs.AI 交叉投稿

Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation

激活引导引发突现失调：一项更全面的评估

Qi Cao, Jian Lou, Meiting Liu, Wenjie Feng, Dan Li, See-Kiong Ng, Anh Tuan Luu

发表机构 * Nanyang Technological University（南洋理工大学）； Sun Yat-sen University（中山大学）； University of Science and Technology of China（中国科学技术大学）； National University of Singapore（新加坡国立大学）

AI总结研究激活引导是否引发突现失调，通过扩展评估范围，发现激活引导可导致广泛失调，且比微调产生更连贯的有害响应，并分析了关键因素。

详情

AI中文摘要

激活引导已成为一种流行的推理时技术，用于调节大型语言模型（LLMs）的行为。通过从目标行为的示例构建引导向量，并在推理期间将其注入中间激活，激活引导能够实现灵活的行为控制，同时避免微调所需的永久参数更新。与此同时，最近的研究将突现失调（EM）识别为一个重要的安全问题，其中在狭窄任务的不安全示例上微调的模型可能意外地泛化到无关任务上的广泛不安全行为。尽管微调引发的EM已被广泛研究，但激活引导是否能引发EM仍然相对未被探索，尽管它作为一种模型控制技术的使用日益增加。在本文中，我们对激活引导引发的突现失调进行了全面研究，大幅扩展了现有开创性工作的评估范围。首先，我们表明激活引导可以引发广泛的失调，即使在最近的Qwen-3.5系列中也是如此。此外，激活引导的模型产生的有害响应比微调模型具有更强的语义相关性和更高的连贯性，使得由此产生的失调可能更具危害性。其次，我们通过分析关键的引导特定因素来表征AS引发的EM的特性，包括引导幅度、引导子空间的低秩结构以及引导向量构建期间的周期数。第三，我们评估了AS引发的EM在不同模型家族、模型规模、目标任务和干预层上的鲁棒性和敏感性。我们的发现揭示了激活引导是突现失调的一个重要但未被充分研究的来源，并为理解EM的机制和安全风险提供了激活空间视角。

英文摘要

Activation steering has emerged as a popular inference-time technique for modulating the behavior of large language models (LLMs). By constructing a steering vector from examples of a target behavior and injecting it into intermediate activations during inference, activation steering enables flexible behavioral control while avoiding the permanent parameter updates required by finetuning. Meanwhile, recent work has identified emergent misalignment (EM) as a significant safety concern, wherein models finetuned on unsafe examples from a narrow task may unexpectedly generalize to broadly unsafe behavior on unrelated tasks. Although finetuning-induced EM has been extensively studied, whether activation steering can induce EM remains comparatively under-explored, despite its increasing use as a model-control technique. In this paper, we present a comprehensive study of activation-steering-induced emergent misalignment, substantially expanding the evaluation scope beyond existing pioneering work. First, we show that activation steering can induce broad misalignment, even in the recent Qwen-3.5 series. Moreover, activation-steered models produce harmful responses with stronger semantic relevance and higher coherence than their finetuned counterparts, making the resulting misalignment potentially more harmful. Second, we characterize properties of AS-induced EM by analyzing key steering-specific factors, including steering magnitude, the low-rank structure of the steering subspace, and the number of epochs during steering-vector construction. Third, we evaluate the robustness and sensitivity of AS-induced EM across diverse model families, model scales, target tasks, and intervention layers. Our findings reveal activation steering as a significant yet under-examined source of emergent misalignment and provide an activation-space perspective for understanding the mechanisms and safety risks of EM.

URL PDF HTML ☆

赞 0 踩 0

2606.08777 2026-06-09 cs.LG cs.AI 交叉投稿

How Many Counterfactuals Does It Take? Probing VLM Hallucinations Through Circuits and Causal Effects

需要多少反事实？通过电路和因果效应探究VLM幻觉

Abhivansh Gupta, Simardeep Singh, Advika Sinha, Shreyansh Modi, Akshat Tomar

发表机构 * University of California, Berkeley（加州大学伯克利分校）； DeepMind（深度思维）

AI总结本文通过定义基于对数概率差异的因果影响度量，并利用电路发现技术，研究视觉语言模型幻觉输出的反事实鲁棒性，推导出检测不稳定所需的最小反事实样本数。

2606.08806 2026-06-09 cs.SE cs.AI 交叉投稿

Governance Controls for AI-Generated Test Artifacts in Autonomous Software Testing

自主软件测试中AI生成测试工件的治理控制

Dimple Bajaj, Deepak Khetan

发表机构 * GitHub

AI总结提出治理感知自主测试框架(GATF)，通过治理验证、可解释性分析、风险评估、合规监控和审计治理，将AI生成测试工件的治理风险降低89.6%，准确率达94.3%。

Comments 21 pages, 9 figures

详情

AI中文摘要

人工智能（AI）和大语言模型（LLMs）越来越多地用于自主软件测试；然而，AI生成的测试工件常常存在幻觉、合规违规、安全风险和有限的可解释性。为了提高AI生成测试工件的可靠性、透明度和可信度，本研究引入了治理感知自主测试框架（GATF）的概念。该框架通过治理验证、可解释性分析、概率风险评估、合规监控以及审计治理来扩展自主测试生命周期。使用Defects4J和PROMISE软件工程数据集进行了实验。所提出的框架成功地将治理相关风险降低了89.6%，并在治理方面表现出94.3%的准确率、96.5%的工件可靠性、94.2%的合规准确率和90.8%的可解释性性能。结果表明，与传统的基于AI的测试系统相比，具有治理意识的自主测试系统可以显著提高自主测试系统的可靠性、透明度和操作安全性。所提出的架构具有可扩展性和可靠性，为软件测试提供了安全的环境。

英文摘要

Artificial Intelligence (AI) and Large Language Models (LLMs) are increasingly used in autonomous software testing; however, AI-generated test artifacts often suffer from hallucinations, compliance violations, security risks, and limited explainability. To enhance the reliability, transparency, and trustworthiness of AI-generated testing artifacts, this research introduces the concept of Governance-Aware Autonomous Testing Framework (GATF). The framework extends the autonomous testing lifecycle with governance validation, explainability analysis, probabilistic risk assessment, compliance monitoring, as well as audit governance. Experiments were performed with Defects4J and PROMISE software engineering datasets. The proposed framework successfully reduced the governance-related risks by 89.6% and demonstrated 94.3% accuracy in governance, 96.5% artifact reliability, 94.2% compliance accuracy, and 90.8% explainability performance. The results show that autonomous testing systems that are governance-aware can significantly enhance the reliability, transparency, and operational security of autonomous testing systems in comparison to conventional AI-based testing systems. The proposed architecture is scalable and reliable and provides a safe environment for software testing.

URL PDF HTML ☆

赞 0 踩 0

2606.08893 2026-06-09 cs.LG cs.AI cs.CR 交叉投稿

Cheap Reward Hacking Detection

廉价奖励黑客检测

Iván Belenky, Joaquín Itria, Steven Johns

发表机构 * Tamarillo

AI总结提出用小Transformer编码器将轨迹映射到单位球面，使嵌入距离近似奖励与元数据的L1距离，线性探针检测奖励黑客，AUC达0.9467，成本比LLM-as-judge低四个数量级。

Comments 20 pages, 6 figures, 12 tables

2606.08960 2026-06-09 cs.CR cs.AI cs.LG cs.MA 交叉投稿

Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

通过对抗性黑客-修复者循环强化智能体基准测试

Ziqian Zhong, Ivgeni Segal, Ivan Bercovich, Shashwat Saxena, Kexun Zhang, Aditi Raghunathan

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Fewshot Corp（Fewshot公司）； Independent Researcher（独立研究员）

AI总结提出黑客-修复者循环方法，通过LLM代理交替攻击和修补验证器，自动生成抗利用的验证器，将KernelBench攻击成功率从62%降至0%。

详情

AI中文摘要

智能体基准测试通常使用手工编写且脆弱的验证器来评分提交结果，这容易导致奖励黑客攻击。我们审计了五个终端智能体基准测试中的1,968个任务，发现其中323个（16%）可以被前沿模型仅通过任务描述成功攻击。这既破坏了排行榜排名，也破坏了强化学习训练信号，但标准的应对措施是手动且被动的。\n我们引入了黑客-修复者循环，一种无需逐任务手动修补即可构建抗利用验证器的方法。该循环交替使用三个LLM代理：黑客尝试在不解决任务的情况下通过验证器，修复者修补验证器以拒绝每个发现的漏洞，求解者确认修补后的验证器仍接受合法解决方案。循环迭代：每次修补都会重塑验证器的奖励机制，从而暴露下一个漏洞。我们进一步增加了验证器访问权限，并允许修补跨任务迁移，以扩大循环发现的漏洞范围。\n在KernelBench上，该循环将公开报告的漏洞语料库上的攻击成功率从62%降至0%。我们还发现，循环中的较弱代理可以防御更强的黑客：Gemini 3 Flash的循环将更强的Gemini 3.1 Pro和Claude Opus 4.7在KernelBench上的攻击成功率从76%和61%降至0%，而Gemini 3.1 Pro在Terminal Bench上的攻击成功率从39%降至17%（覆盖77个任务）。我们发布了Terminal Wrench（323个可攻击环境，3,632条攻击轨迹）作为当前攻击面的快照，以及我们修补后的验证器、循环发现的漏洞和我们的实现，作为未来工作的基础。

英文摘要

Agent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive. We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task manual patching. The loop alternates three LLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers. On KernelBench, the loop drives the attack success rate from 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash's loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7's attack success rate from 76% and 61% to 0% on KernelBench, and Gemini 3.1 Pro's from 39% to 17% on Terminal Bench across 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.

URL PDF HTML ☆

赞 0 踩 0

2606.08969 2026-06-09 cs.CL cs.AI 交叉投稿

CARE: A Conformal Safety Layer for Medical Summarization

CARE：面向医学摘要的保形安全层

Suhana Bedi, Bridget Lin, Anson Y. Zhou, Chloe O. Stanwyck, Jenelle A. Jindal, Sanmi Koyejo, David Stutz, Nigam H. Shah

发表机构 * Stanford University（斯坦福大学）； Google DeepMind（谷歌深度思维）

AI总结提出CARE方法，通过保形风险控制为LLM医学摘要提供校准的遗漏和幻觉标记，在保证安全性的同时减少审查负担。

Comments 29 pages, 5 figures

详情

AI中文摘要

大型语言模型（LLM）越来越多地用于医学摘要，但其输出可能遗漏重要的医学信息并引入无根据的陈述。现有的错误检测方法产生启发式或未校准的分数，无法对遗漏错误进行正式控制，也无法以原则性的方式在安全性与临床医生审查负担之间进行权衡。我们引入了风险评估的保形评估（CARE），这是一种事后、模型无关的安全层，使用保形风险控制为任何LLM生成的摘要叠加校准的遗漏和幻觉标记，无需重新训练。CARE通过两个控制器提供有限样本、分布无关的保证：一个幻觉控制器，限制包含任何未标记幻觉句子的文档的概率；一个遗漏控制器，限制未提交审查的重要遗漏的期望比例。与幻觉检测不同，遗漏同时取决于源句子是否重要以及摘要是否覆盖该句子。我们表明，仅校准一个维度可能违反目标风险界限，而边际分解虽然有效但过于保守。通过在整个$(τ,γ)$阈值空间上进行联合校准，CARE在保持正式保证的同时，比替代的校准基线最多减少5倍的标记句子。在五个医学摘要任务中，CARE在100次校准/测试重划分中，以95%的置信度满足$α=0.15$的目标风险界限，每个领域仅使用约100个标记文档。在一项初步的临床医生研究（75份文档审查）中，校准标记平均将遗漏检测提高了28.6个百分点。这些结果表明，句子级别的安全保证对于LLM辅助的医学摘要是可行的，并为平衡残余风险和审查工作量提供了一种可调节的机制。

英文摘要

Large language models (LLMs) are increasingly used for medical summarization, but their outputs can omit medically important information and introduce unsupported claims. Existing error-detection methods produce heuristic or uncalibrated scores, providing no formal control over missed errors and no principled way to trade off safety against clinician review burden. We introduce Conformal Assessment for Risk Evaluation (CARE), a post-hoc, model-agnostic safety layer that uses conformal risk control to overlay calibrated omission and hallucination flags onto summaries from any LLM without retraining. CARE provides finite-sample, distribution-free guarantees through two controllers: a hallucination controller that bounds the probability of a document containing any unflagged hallucinated sentence, and an omission controller that bounds the expected fraction of important omissions not surfaced for review. Unlike hallucination detection, omissions depend jointly on whether a source sentence is important and whether it is covered by the summary. We show that calibrating only one dimension can violate the target risk bound, while marginal decompositions remain valid but overly conservative. By jointly calibrating over the full $(τ,γ)$ threshold space, CARE preserves formal guarantees while surfacing up to 5$\times$ fewer sentences than alternative calibrated baselines. Across five medical summarization tasks, CARE satisfies the target risk bound at $α= 0.15$ with 95% confidence across 100 calibration/test resplits, using only ~100 labeled documents per domain. In a preliminary clinician study (75 document reviews), calibrated flags improved omission detection by 28.6 percentage points on average. These results show that sentence-level safety guarantees are feasible for LLM-assisted medical summarization and offer a tunable mechanism for balancing residual risk and review effort.

URL PDF HTML ☆

赞 0 踩 0

2606.09084 2026-06-09 cs.CR cs.AI 交叉投稿

Context-Fractured Decomposition Attacks on Tool-Using LLM Agents: Exploiting Artifact Provenance Gaps

上下文碎片化解构攻击：利用工具使用LLM代理的工件来源鸿沟

Xiaofeng Lin, Yukai Yang, Daniel Guo, Sahil Arun Nale, Charles Fleming, Guang Cheng

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结针对工具使用LLM代理，提出上下文碎片化解构（CFD）攻击，利用跨上下文工件来源鸿沟实现多步越狱，成功率提升高达28.3个百分点。

详情

AI中文摘要

使用工具的LLM代理通过与世界交互，在工件（如工作区文件或日志）中持久化状态。因此，越狱防御必须考虑跨步骤的组合，而非孤立的文本。然而，大多数现有的攻击和防御，包括Crescendo和Tree of Attacks等“多轮”越狱，仍然假设防御者可见单一连续的对话。这一假设在真实的代理流水线中不成立，因为强制措施分散在工具、模块和时间中，且工件来源通常不被追踪。我们为使用工具的LLM代理操作化了一种部署失败模式——\emph{来源鸿沟}，并研究了其可复现的触发条件：\emph{上下文碎片化解构}（CFD），这是一类跨上下文的多步越狱，它保留早期交互中看似良性的中间工件，并在很久之后（可能在不同的代理实例或工作流阶段）通过单独无害的工具操作引发有害行为，其风险仅在延迟的工件介导组合下显现。我们通过跟踪级诊断来检测该失败模式，并概述了一种可验证的缓解方向（来源血统标记）。在代理系统越狱基准测试中，CFD相比最先进的基线将成功率提高了高达28.3个百分点，即使面对强大的单轮判断器。免责声明：本文包含有害或冒犯性语言的示例。

英文摘要

Tool-using LLM agents interact with the world through actions that persist state in artifacts (e.g., workspace files or logs). Consequently, jailbreak defenses must reason about cross-step composition rather than isolated text. Yet most existing attacks and defenses, including ``multi-turn'' jailbreaks such as Crescendo and Tree of Attacks,still assume a single contiguous conversation visible to the defender. This assumption breaks down in real agent pipelines, where enforcement is fragmented across tools, modules, and time, and where artifact provenance is often not tracked. We operationalize a deployment failure mode for tool-using LLM agents, the \emph{provenance gap}, and study reproducible triggers for it: \emph{Context-Fractured Decomposition} (CFD), a family of cross-context multi-step jailbreaks that preserve benign-looking intermediate artifacts from an early interaction and elicit harmful behavior much later, potentially in a different agent instance or workflow stage, via individually innocuous tool actions whose risk emerges only under delayed artifact-mediated composition. We instrument the failure mode with trace-level diagnostics and outline a verifiable mitigation direction (provenance lineage tagging). Across agent-system jailbreak benchmarks, CFD improves success rates by up to 28.3 percentage points over state-of-the-art baselines, even against strong single-turn judges. Disclaimer: This paper contains examples of harmful or offensive language.

URL PDF HTML ☆

赞 0 踩 0

2606.09125 2026-06-09 cs.CR cs.AI 交叉投稿

Unveiling Privacy Risks in Multi-modal Large Language Models: Task-specific Vulnerabilities and Mitigation Challenges

多模态大语言模型中的隐私风险揭示：任务特定漏洞与缓解挑战

Tiejin Chen, Pingzhi Li, Kaixiong Zhou, Tianlong Chen, Hua Wei

发表机构 * Arizona State University（亚利桑那州立大学）； University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）； North Carolina State University（北卡罗来纳州立大学）

AI总结本研究揭示了多模态大语言模型在处理图像和文本时存在的隐私泄露风险，通过构建MM-Privacy数据集评估了不同任务下的披露风险与保留风险，并强调了任务不一致性对隐私风险的影响。

详情

AI中文摘要

仅文本大语言模型（LLMs）的隐私风险已得到充分研究，特别是它们记忆和泄露敏感信息的倾向。然而，处理文本和图像的多模态大语言模型（MLLMs）引入了独特的隐私挑战，这些挑战尚未得到充分探索。与仅文本模型相比，MLLMs可以提取和暴露嵌入在图像中的敏感信息，带来新的隐私风险。我们发现一些MLLMs容易受到隐私泄露的影响，泄露嵌入在图像中或存储在记忆中的敏感数据。具体来说，在本文中，我们（1）引入了MM-Privacy，一个全面的数据集，旨在评估各种多模态任务和场景下的隐私风险，其中我们定义了披露风险和保留风险。（2）使用MM-Privacy系统评估了不同的MLLMs，并展示了模型如何在各种任务中泄露敏感数据，以及（3）提供了关于任务不一致性在隐私风险中的作用的额外见解，强调了缓解策略的迫切需求。我们的发现突出了MLLMs中的隐私问题，强调了防止数据暴露的安全措施的必要性。我们的数据集和代码可在此处找到。

英文摘要

Privacy risks in text-only Large Language Models (LLMs) are well studied, particularly their tendency to memorize and leak sensitive information. However, Multi-modal Large Language Models (MLLMs), which process both text and images, introduce unique privacy challenges that remain underexplored. Compared to text-only models, MLLMs can extract and expose sensitive information embedded in images, posing new privacy risks. We reveal that some MLLMs are susceptible to privacy breaches, leaking sensitive data embedded in images or stored in memory. Specifically, in this paper, we (1) introduce MM-Privacy, a comprehensive dataset designed to assess privacy risks across various multi-modal tasks and scenarios, where we define Disclosure Risks and Retention Risks. (2) systematically evaluate different MLLMs using MM-Privacy and demonstrate how models leak sensitive data across various tasks, and (3) provide additional insights into the role of task inconsistency in privacy risks, emphasizing the urgent need for mitigation strategies. Our findings highlight privacy concerns in MLLMs, underscoring the necessity of safeguards to prevent data exposure. Our dataset and code can be found here.

URL PDF HTML ☆

赞 0 踩 0

2606.09135 2026-06-09 cs.CR cs.AI 交叉投稿

Steganography Without Modification: Hidden Communication via LLM Seeds

无需修改的隐写术：通过LLM种子进行隐藏通信

Felix Mächtle, Jonas Sander, Sebastian Berndt, Ben Weimar, Nils Loose, Thomas Eisenbarth

发表机构 * Institute for IT Security, University of Lübeck（吕贝克大学信息安全部）； Technische Hochschule Lübeck（吕贝克技术大学）

AI总结利用LLM推理栈中确定性解码的伪随机数生成器种子依赖性，提出一种无需修改模型权重或采样代码的隐写信道，通过种子编码秘密消息，接收者通过穷举搜索恢复。

Comments To appear in the Proceedings of the International Conference on Availability, Reliability and Security (ARES 2026)

详情

AI中文摘要

我们证明，广泛部署的大型语言模型（LLM）推理栈包含一个隐写信道，该信道无需修改模型权重、采样代码或输出分布。该信道利用了确定性解码的结构特性：在逆变换采样中使用的伪随机数生成器（PRNG）产生一个依赖于种子的token级概率区间序列，该序列可以仅从生成的文本中重建。发送者在生成前将秘密消息编码到PRNG种子中；接收者重建区间并通过穷举搜索种子空间恢复种子，从而恢复隐藏载荷。我们形式化了两种操作模式。在已知提示设置中，发送者和接收者共享提示，从而通过强制对齐实现精确区间重建和完美种子恢复。在未知提示设置中，仅可获取生成的文本；结合最大命中计数评分策略的近似区间重建仍能从足够长的输出中可靠恢复。在六个模型系列和五个异构文本域上的大量实验表明，在已知提示设置中，从完整的2^32候选空间中恢复32位种子，根据模型和文本域的不同，在300个token内、单GPU上35秒内可实现高达100%的准确率。在未知提示设置中，恢复在600-800个token内约12秒达到近乎完美的准确率。我们进一步分析了提示策略、分词歧义和采样超参数对信道可靠性的影响。此外，我们讨论了结果的几个应用：首先，它允许隐写传输32位信息，但也表明忽略提示并非有效的安全假设。

英文摘要

We demonstrate that widely deployed Large Language Model (LLM) inference stacks harbor a steganographic channel that requires no modification to model weights, sampling code, or output distributions. The channel exploits a structural property of deterministic decoding: pseudo-random number generators (PRNGs) used in inverse-transform sampling produce a seed-dependent sequence of token-level probability intervals that can be reconstructed from the generated text alone. A sender encodes a secret message in the PRNG seed before generation; a receiver reconstructs the intervals and recovers the seed, and thus the hidden payload, by exhaustive search over the seed space. We formalize two operational modes. In the known-prompt setting, sender and receiver share the prompt, enabling exact interval reconstruction and perfect seed recovery via forced alignment. In the unknown-prompt setting, only the generated text is available; approximate interval reconstruction combined with a maximum-hit-count scoring strategy still permits reliable recovery from sufficiently long outputs. Extensive experiments across six model families and five heterogeneous text domains show that, in the known-prompt setting, full 32-bit seed recovery from the complete 2^32 candidate space achieves up to 100% accuracy, depending on model and text domain, within 300 tokens and under 35 seconds on a single GPU. In the unknown-prompt setting, recovery reaches near-perfect accuracy at 600-800 tokens in about 12 seconds. We further analyze the influence of prompting strategies, tokenization ambiguities, and sampling hyperparameters on channel reliability. Moreover, we discuss several applications of our results: First, it allows for the steganographic transmission of 32 bits, but also shows that ignorance of the prompt is not a valid security assumption.

URL PDF HTML ☆

赞 0 踩 0

2606.09189 2026-06-09 cs.CR cs.AI 交叉投稿

Pretrained, Frozen, Still Leaking: Auditing Cross-Encoder Attribute Transfer in EEG Foundation Models

预训练、冻结、仍在泄露：脑电图基础模型中跨编码器属性转移的审计

Jianwei Tai

发表机构 * Jianwei Tai（Tai Jianwei）

AI总结提出跨编码器桥接攻击，证明单一端点审计无法检测属性泄露，并引入审计端点分歧分数（AEDS）作为联合发布决策规则。

详情

AI中文摘要

脑电图基础模型的发布通常一次只审计一个端点：原始重建、成员推断、身份链接或下游头的DP-SGD。我们在所有四个端点上联合审计相同的发布嵌入，针对BIOT、LaBraM和EEGPT，并表明每个单一端点审计都会清除仍然泄露频谱属性的发布。决定性的证据是跨编码器转移审计：从一个冻结编码器学习的单一岭属性解码器，通过拟合的线性桥接，转移到每个其他编码器的留出受试者测试集，在所有六个BIOT/LaBraM/EEGPT方向上，受试者不相交的匹配对照95%置信区间下界至少为0.081。我们证明了一个充分条件：两个编码器共享一个非平凡的属性坐标投影重叠β，允许一个链式岭桥接攻击者，其中心增益下界为sqrt(β/(1+τ^2)) - eps_br - rho_0，并反解β在[0.008, 0.198]范围内。为了将联合审计转化为可部署的决策规则，我们引入了审计端点分歧分数（AEDS），证明了其正性的充分条件，并逐单元进行自举校准；在所有八个匹配置信区间单元中（EEGMMI上的BIOT/LaBraM/EEGPT；Sleep-EDF、54通道LIMO、CHB-MIT儿科头皮脑电图上的LaBraM），AEDS为正，p<0.001，而头部级别的Carlini LiRA成员审计仅达到AUC 0.50-0.70。标准防御在审计下失败：维纳风格噪声感知自适应攻击者、LiRA审计以及每个效用保持ε∈{4,8}的DP-SGD使属性通道基本保持不变。贡献是一个审计框架，将分散的单一端点防御转化为联合发布决策，由跨编码器桥接定理以及自适应攻击者、LiRA和DP-SGD基线支持；该审计许可发布阻止，而非原始波形窃取或留出受试者身份恢复。

英文摘要

EEG foundation-model releases are usually audited one endpoint at a time: raw-reconstruction, membership inference, identity linkage, or DP-SGD on the downstream head. We audit the same released embeddings under all four endpoints jointly, on BIOT, LaBraM, and EEGPT, and show that each single-endpoint audit clears releases that still leak spectral attributes. The decisive evidence is a cross-encoder transfer audit: a single ridge attribute decoder learned from one frozen encoder transfers, via a fitted linear bridge, to held-out-subject test splits of every other encoder, with subject-disjoint matched-control 95% CI lower bound at least 0.081 across all six BIOT/LaBraM/EEGPT directions. We prove a sufficient condition: two encoders sharing a nontrivial attribute-coordinate projector overlap beta admit a chained ridge bridge attacker with centered-gain lower bound sqrt(beta/(1+tau^2)) - eps_br - rho_0, and back-solve beta in [0.008, 0.198]. To turn the joint audit into a deployment-readable decision rule we introduce an audit-endpoint disagreement score (AEDS), prove sufficient conditions for its positivity, and bootstrap-calibrate it per cell; AEDS is positive in all eight matched-CI cells (BIOT/LaBraM/EEGPT on EEGMMI; LaBraM on Sleep-EDF, 54-channel LIMO, CHB-MIT pediatric scalp EEG) with p<0.001, while a head-level Carlini LiRA membership audit reaches AUC only 0.50-0.70. Standard defenses fail under audit: a Wiener-style noise-aware adaptive attacker, the LiRA audit, and DP-SGD at every utility-preserving epsilon in {4,8} leave the attribute channel essentially unchanged. The contribution is an audit framework that turns scattered single-endpoint defenses into a joint release decision, supported by a cross-encoder bridge theorem and adaptive-attacker, LiRA, and DP-SGD baselines; the audit licenses release-blocking, not raw-waveform exfiltration or held-out-subject identity recovery.

URL PDF HTML ☆

赞 0 踩 0

2606.09227 2026-06-09 cs.CR cs.AI cs.CE cs.CY cs.HC cs.SI 交叉投稿

Trustworthy Smart Fabs via Professional Proxies: Scaling Safe and Sustainable by Design (SSbD) through Industrial Data Spaces

通过专业代理实现可信智能晶圆厂：通过工业数据空间扩展安全与可持续设计（SSbD）

Han-Teng Liao, Chang-Yi Kao, Karen Ang

发表机构 * Independent Researcher Dept. Computer Science and Independent Researcher Information Management（独立研究员计算机科学系及独立研究员信息管理）

AI总结针对欧盟SSbD等法规带来的治理瓶颈，提出基于零信任的社会技术编排框架，通过硬件隔离信任区中的专业代理工作流，在工业数据空间中实现自主治理，解决数据主权悖论。

Comments This work was accepted for presentation at the 32nd IEEE ICE/ITMC Conference, Porto, Portugal, 2026 but was subsequently withdrawn prior to publication due to submission volume limits. It is currently under consideration for publication elsewhere

详情

AI中文摘要

2026年欧盟安全与可持续设计（SSbD）框架、企业可持续发展尽职调查指令（CSDDD）和碳边境调节机制（CBAM）的融合，为先进半导体制造设施（“智能晶圆厂”）带来了严重的治理瓶颈。法规合规需求已超出人工企业报告的能力，在多利益相关方透明度与企业数据隐私之间造成了直接冲突。本文通过引入一个零信任的社会技术编排框架来应对这一挑战，该框架在可信工业数据空间中实现了六层SSbD参考架构的操作化。我们提出从被动自动化向自主治理的转变，通过“专业代理”——在硬件隔离信任区内执行的基于角色的代理工作流。该框架结构化为一个可互操作的网络协议栈，协调设施、工艺工程和财务代理团队之间的自动化“五步接力赛”，将工厂车间的良率模型与宏观可持续发展指令对齐。通过在基于硬件的可信执行环境（TEE）中执行虚拟量测（VM）预测和联邦机器学习（FML），该架构解决了数据主权悖论，展示了晶圆厂如何通过国际数据空间（IDS）连接器导出加密签名的合规令牌，而无需暴露专有工艺配方。最终，该框架为技术管理者提供了一条可验证、基于证据的路径，通向有韧性的净零工业5.0生态系统。

英文摘要

The convergence of the 2026 European Union Safe and Sustainable by Design (SSbD) framework, Corporate Sustainability Due Diligence Directive (CSDDD), and Carbon Border Adjustment Mechanism (CBAM) introduce a severe governance bottleneck for advanced semiconductor manufacturing facilities ("Smart Fabs"). Regulatory compliance demands have surpassed the capacity of manual corporate reporting, creating a direct conflict between multi-stakeholder transparency and corporate data privacy. This paper addresses this challenge by introducing a zero-trust socio-technical orchestration framework that operationalizes a six-layer SSbD reference architecture within trustworthy industrial data spaces. We propose a shift from reactive automation to autonomous governance through "Professional Proxies"-role-based agentic workflows executing within hardware-isolated trust zones. Structured as an interoperable network protocol stack, the framework coordinates an automated, five-step "relay race" between Facility, Process Engineering, and Finance proxy teams to align factory-floor yield models with macro-level sustainability mandates. By executing Virtual Metrology (VM) predictions and Federated Machine Learning (FML) inside hardware-rooted Trusted Execution Environments (TEEs), this architecture resolves the Data Sovereignty Paradox, demonstrating how fabs can export cryptographically signed compliance tokens via International Data Spaces (IDS) connectors without exposing proprietary process recipes. Ultimately, this framework provides technology managers with a verifiable, evidence-based pathway toward resilient, net-zero Industry 5.0 ecosystems.

URL PDF HTML ☆

赞 0 踩 0

2606.09315 2026-06-09 cs.CR cs.AI 交叉投稿

Brain-Prompt Injection: A Route-Safety Audit for BCI-LLM Agents

脑提示注入：BCI-LLM代理的路径安全审计

Jianwei Tai

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出路径安全审计契约，通过分离定理和共形校准量化BCI-LLM代理中脑信号注入攻击的风险，实验证明确认通道可降低路由风险。

详情

AI中文摘要

BCI到代理的管道将解码的神经活动转化为工具使用代理的授权通道，暴露了一个我们称之为\emph{脑提示注入}的新攻击面：信号侧扰动、上下文仅注入和自适应双解码器攻击都可以改变路由动作，而EEG侧或文本侧监控器仍然盲视。该堆栈中的路径安全取决于审计日志能观察到什么，而不仅仅是解码器准确性或一致性。我们定义了一个路径安全审计契约：一个最小的日志模式、分母层次结构和端点规范，并证明了一个审计模式分离定理以及一个C3攻击依赖分解；干净的一致性和边际鲁棒性不能识别控制C3路由的联合项。作为契约之上的校准层，我们将分裂共形校准应用于非预言机EEG确认通道，并在明确的威胁原型矩阵下报告由此产生的假接受边界。我们在EEGMMI原生左/右命令控制上实例化该契约，涉及5,400个事件、无害工具存根和种子/案例分母。来源阻止C2路由（$0.000$）；一致性加来源路由C3翻转（$1.000$）；确认加来源路由它们（$0.000$）。共形边界在采集隔离下，对于$α=.005$，在干净效用$0.150$时达到FAR $0.000$；对于$α=.10$，在干净效用$0.452$时达到FAR $0.119$；攻击者可控制的确认通道将界限打破至$\approx\!1$。受试者集群自助法在60名受试者上确认了这些区间；跨架构（TinyEEGNet, EEGNetV4）和容量扫描结果显示在区域内饱和。中介和确认降低了风险；它们不是意图证书。

英文摘要

BCI-to-agent pipelines turn decoded neural activity into an authorization channel for tool-use agents, exposing a new attack surface we call \emph{brain-prompt injection}: signal-side perturbations, context-only injections, and adaptive dual-decoder attacks can all change the routed action while EEG-side or text-side monitors remain blind. Route safety in this stack depends on what the audit log can observe, not on decoder accuracy or agreement alone. We define a Route-Safety Audit Contract: a minimal log schema, denominator hierarchy, and endpoint specification, and prove an audit-schema separation theorem together with a C3 attacked-dependence decomposition; clean agreement and marginal robustness do not identify the joint term that controls C3 routing. As a calibration layer on top of the contract, we apply split-conformal calibration to a non-oracle EEG confirmation channel and report the resulting false-accept frontier under an explicit threat-archetype matrix. We instantiate the contract on EEGMMI native left/right command-control over 5{,}400 events, harmless tool stubs, and seed/case denominators. Provenance blocks C2 routes ($0.000$); agreement-plus-provenance routes C3 flips ($1.000$); confirmation-plus-provenance routes them ($0.000$). The conformal frontier reaches FAR $0.000$ at clean utility $0.150$ for $α=.005$ and FAR $0.119$ at clean utility $0.452$ for $α=.10$ under acquisition isolation; an attacker-controllable confirmation channel breaks the bound to $\approx\!1$. Subject-cluster bootstrap confirms these intervals on $60$ subjects; cross-architecture (TinyEEGNet, EEGNetV4) and capacity-sweep results show within-regime saturation. Mediation and confirmation reduce risk; they are not intent certificates.

URL PDF HTML ☆

赞 0 踩 0

2606.09408 2026-06-09 cs.CY cs.AI cs.HC 交叉投稿

Can Data Work be Reparative?

数据工作能否具有修复性？

Srravya Chandhiramowuli, Ding Wang, Alex Taylor

发表机构 * University of Edinburgh（爱丁堡大学）； Google Research（谷歌研究院）

AI总结通过民族志研究，探讨公民科技倡议如何从女性主义视角协作构建安全数据集，旨在将数据工作重塑为修复与补救的场所，并分析其中遇到的挑战与张力。

Comments To be presented at ACM FAccT, Montréal, Canada, June 25 to June 28, 2026

详情

AI中文摘要

我们展示了一项关于数据工作替代方法的民族志研究，该方法由一项公民科技倡议开发，该倡议构建用于训练和基准测试在线安全系统的数据集。他们旨在从女性主义视角回应在线安全问题，通过与受在线伤害影响最大的人协作构建安全数据集。在本文中，我们考察了这种方法如何试图将数据工作重新定位为修复和补救的场所，并追溯他们在这一过程中遇到的挣扎。具体来说，我们关注在推进数据工作的公正报酬和AI数据集的集体治理方面所面临的挑战和张力。通过STS视角下的修复正义和修复理论审视这些挑战，我们认为修复数据工作（以及AI）的工作从根本上在于重置责任关系。在当前强调安全评估和红队测试等努力以使AI更加负责任的背景下，我们强调需要面对基本问题：参与这些努力的人类如何与他们帮助产生的数据集和系统相关联。修复性视角要求我们打断数据工作的主流规范，并将那些因当前数据集生产模式中的忽视、疏忽和排斥而受害最深的人置于中心，而不是AI或数据集。我们认为，这为责任提供了大胆的愿景，并为构建数据和AI实践的替代未来贡献了批判性议程。

英文摘要

We present an ethnographic study of an alternative approach to data work, developed by a civic-tech initiative that builds datasets for training and benchmarking online safety systems. They aim to respond to online safety concerns from a feminist perspective, by building safety datasets collaboratively with those most impacted by online harms. In this paper, we examine how this approach aims to reorient data work as a site for repair and redress, and trace the struggles they encounter in the process. Specifically, we draw attention to the challenges and tensions involved in advancing just reward for data work and collective governance of AI datasets. Examining these challenges through an STS-informed lens of reparative justice and repair, we argue that the work of repairing data work (and AI) lies, fundamentally, in resetting the ties of accountability. At a time heightened emphasis on efforts like safety evaluations and red teaming to make AI more responsible, we highlight the need to confront foundational questions about how the humans involved in these efforts relate to the datasets and systems they help produce. A reparative lens demands that we interrupt prevailing norms of data work and place at their centre, not AI or datasets, but those most harmed by the neglect, oversight and exclusion animated in the current modes of dataset production. This, we argue, offers a bold vision for responsibility and contributes towards a critical agenda for building alternative futures of data and AI practice.

URL PDF HTML ☆

赞 0 踩 0

2606.09414 2026-06-09 cs.HC cs.AI 交叉投稿

SecureClaw: 夺回对LLM智能体的控制

Yuhan Ma, Stefan Schmid

发表机构 * TU Berlin（柏林技术大学）

AI总结针对工具使用型LLM智能体的双重安全漏洞，提出双边界架构SecureClaw，在效果汇点实施授权、在读边界实施明文隔离，通过预览-提交协议和可信网关实现安全控制，在多个基准上保持可用性的同时将攻击成功率降至接近零。

详情

AI中文摘要

使用工具的大型语言模型（LLM）智能体面临两种不同的安全漏洞：未经授权的外部操作以及在最终输出检查介入之前运行时内部敏感明文的暴露。现有防御通常只保护一个边界（规划器/运行时或动作汇点），因此本身无法同时保护两个表面。我们提出SecureClaw，一种双边界架构，在效果汇点实施授权，在读边界实施明文隔离。敏感读取通过一个可信网关，该网关用不透明句柄替换原始值，在评估部署中，还使用有界摘要作为显式解密接口。改变外部状态的写入遵循PREVIEW→COMMIT协议，其中只有可信执行者才能提交策略授权的确切规范请求。运行时仍然可以基于摘要和符号引用进行规划，但不能直接解引用秘密或执行副作用。在AgentDojo、AgentLeak和Agent Security Bench (ASB)上，SecureClaw是我们在通用测试框架中评估的唯一一种同时保持可用任务效用并在ASB上实现0%攻击成功率（ASR）、在AgentDojo上实现0.64% ASR、在AgentLeak的攻击并行通道上实现3.23%总体泄漏（衡量最终输出和内部中继泄漏）的防御方法。

英文摘要

Tool-using large language model (LLM) agents face two distinct security failures: unauthorized external actions and exposure of sensitive plaintext inside the runtime before any final output check can intervene. Existing defenses usually protect one boundary, either the planner/runtime or the action sink, and therefore do not by themselves secure both surfaces. We present SecureClaw, a dual-boundary architecture that places authorization at the effect sink and plaintext confinement at the read boundary. Sensitive reads pass through a trusted gateway that replaces raw values with opaque handles and, in the evaluated deployment, bounded summaries as an explicit declassification interface. Writes that change external state follow a PREVIEW$\rightarrow$COMMIT protocol in which only a trusted executor may commit the exact canonical request authorized by policy. The runtime can still plan over summaries and symbolic references, but cannot directly dereference secrets or perform side effects. Across AgentDojo, AgentLeak, and Agent Security Bench (ASB), SecureClaw is the only defense we evaluate in a common harness that simultaneously retains usable task utility and achieves 0\% attack success rate (ASR) on ASB, 0.64\% ASR on AgentDojo, and 3.23\% overall leak on AgentLeak's attacked parity lane, which measures final-output and internal-relay leakage.

URL PDF HTML ☆

赞 0 踩 0

2606.09551 2026-06-09 cs.CR cs.AI 交叉投稿

FuseFSS: Efficient Secure LLM Inference with Function Secret Sharing

FuseFSS：基于函数秘密共享的高效安全LLM推理

Yuhan Ma, Yong Li, Stefan Schmid

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出FuseFSS编译器，通过统一编译流水线替代逐算子协议设计，实现安全推理中非线性与辅助操作的高效处理，在BERT和GPT模型上取得1.24-1.50倍加速并减少通信与预处理开销。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

双服务器安全推理允许客户端查询托管的大型语言模型（LLM）而不泄露提示或嵌入。基于函数秘密共享（FSS）的最新GPU系统使线性层高效，但定点非线性和辅助操作仍是瓶颈，因为每个算子通常通过自定义协议实现，包含各自的比较、回绕校正和预处理材料。我们提出FuseFSS，一个编译器，用单一编译流水线替代逐算子协议设计。对于每个定点算子，一个紧凑的规范列出其区间划分、低次算术片段和所需的谓词位。编译器在公开掩码值上执行两次批处理FSS评估：一次打包比较返回所有谓词位，一次向量区间查找返回活跃系数和常数。与当前最先进的基于FSS的GPU安全推理相比，FuseFSS在保持精度的同时，在BERT和GPT风格模型上实现了1.24倍至1.50倍的端到端加速，并将在线通信减少了9%至16%；预处理也更轻量，密钥生成时间降低14%至23%，密钥大小减小20%至24%。

英文摘要

Two-server secure inference allows a client to query a hosted large language model (LLM) without revealing prompts or embeddings. Recent GPU systems based on function secret sharing (FSS) make linear layers efficient, but fixed-point nonlinearities and helper operations remain a bottleneck because each operator is typically implemented as a bespoke protocol with its own comparisons, wrap-around corrections, and preprocessing material. We present FuseFSS, a compiler that replaces per-operator protocol design with a single compilation pipeline. For each scalar fixed-point operator, a compact specification lists its interval partition, low-degree arithmetic pieces, and required predicate bits. The compiler emits two batched FSS evaluations on the public masked value: one packed comparison that returns all predicate bits, and one vector interval lookup that returns the active coefficients and constants. Compared to the current state-of-the-art FSS-based GPU secure inference, FuseFSS preserves accuracy while achieving a $1.24\times$--$1.50\times$ end-to-end speedup and reducing online communication by $9\%$--$16\%$ on BERT and GPT-style models; preprocessing is also lighter, with $14\%$--$23\%$ lower key-generation time and $20\%$--$24\%$ smaller keys.

URL PDF HTML ☆

赞 0 踩 0

2606.09559 2026-06-09 cs.LG cs.AI cs.CR cs.RO 交叉投稿

Safe-RULE: Safe Reinforcement UnLEarning

Safe-RULE：安全强化反学习

Shixiong Jiang, Taozheng Zhu, Fanxin Kong

发表机构 * University of Notre Dame（圣母大学）

AI总结针对离线安全强化学习易受数据投毒攻击的问题，提出Safe-RULE框架，通过反学习移除恶意样本影响，无需从头训练或访问原始环境，实验证明能有效提升安全性。

Comments 20 pages, 3 figures

2606.09692 2026-06-09 cs.CR cs.AI 交叉投稿

Observability for Delegated Execution in Agentic AI Systems

自主AI系统中委托执行的可观测性

Abhinav Mishra, Kumar Sharad

发表机构 * Splunk ； Cisco Inc（思科公司）

AI总结针对基于LLM的自主系统中委托执行轨迹难以归因的问题，提出一种轻量级网关和通用信息模型，在运行时绑定委托上下文，实现跨工具委托范围的可靠重建和直接取证查询。

详情

全局XAI方法能否揭示LLM中的注入行为？SHAP vs 规则提取 vs RuleSHAP

Francesco Sovrano

发表机构 * Collegium Helveticum at ETH Zurich（苏黎世联邦理工学院霍夫曼学院）； Università della Svizzera italiana（瑞士联邦理工学院）

AI总结研究通过统计验证的抽象将全局LLM信念映射为数值分数，提出RuleSHAP算法，结合全局SHAP与规则归纳，以更好地捕捉非单变量触发因素，平均MRR@1比RuleFit提升82%。

Comments Accepted for publication at KDD'2026

详情

DOI: 10.1145/3770855.3818093

AI中文摘要

大型语言模型（LLM）可能放大错误信息，破坏联合国可持续发展目标等社会目标。我们研究了三个有文献记载的错误信息驱动因素（效价框架、信息过载和过度简化），这些因素通常由默认信念塑造。基于LLM编码此类默认信念（例如，“快乐是积极的”、“数学是复杂的”）并可作为“启发式包”的证据，我们询问是否可以从黑盒LLM行为中恢复出错误信息相关行为背后的信念驱动启发式作为显式规则。一个关键障碍是可解释AI（XAI）中的全局规则提取方法是为数值输入输出数据设计的，而非文本。我们通过引出全局LLM信念并通过统计验证的抽象将其映射为数值分数来解决这一问题，从而使现成的全局XAI能够检测信念驱动的启发式。为了获得真实情况，我们通过系统指令向GPT系列和Llama模型注入复杂度递增的非线性行为触发因素（单变量、合取、非凸）。我们发现RuleFit经常遗漏非单变量触发因素，而全局SHAP在排名合取触发特征方面更好，但不产生符号规则。为了弥合这一差距，我们提出了RuleSHAP，一种将全局SHAP聚合与规则归纳相结合的规则提取算法，以更好地捕捉非单变量触发因素，平均MRR@1比RuleFit提升82%。我们的结果提示了一种揭示LLM中行为触发因素的实用途径。

英文摘要

Large language models (LLMs) can amplify misinformation, undermining societal goals such as the UN SDGs. We study three documented drivers of misinformation (valence framing, information overload, and oversimplification) often shaped by default beliefs. Building on evidence that LLMs encode such defaults (e.g., "joy is positive", "math is complex") and can act as "bags of heuristics", we ask whether belief-driven heuristics behind misinformation-related behaviour can be recovered from black-box LLM behaviour as explicit rules. A key obstacle is that global rule-extraction methods in explainable AI (XAI) are built for numerical input-output data, not text. We address this by eliciting global LLM beliefs and mapping them to numerical scores via statistically validated abstractions, enabling off-the-shelf global XAI to detect belief-driven heuristics. For ground truth, we inject nonlinear behavioural triggers of increasing complexity (univariate, conjunctive, non-convex) into GPT-family and Llama models via system instructions. We find that RuleFit often misses non-univariate triggers, while global SHAP better ranks conjunctive trigger features but yields no symbolic rules. To bridge this gap, we propose RuleSHAP, a rule-extraction algorithm that couples global SHAP aggregates with rule induction to better capture non-univariate triggers, improving MRR@1 over RuleFit by +82% on average. Our results suggest a practical pathway for surfacing behavioural triggers in LLMs.

URL PDF HTML ☆

赞 0 踩 0

2603.22793 2026-06-09 cs.AI 版本更新

Signals Are Not States: Neuro-Symbolic Safeguards for Culturally Aware Classroom AI

信号不是状态：面向文化意识课堂AI的神经符号安全机制

Sina Bagheri Nezhad

发表机构 * Independent Researcher（独立研究者）

AI总结本文提出NSCR框架，通过神经符号方法处理课堂多模态信号，区分可观测证据与文化负载解读，减少文化偏见对课堂AI的负面影响。

Comments Accepted at the Workshop on Stereotypes Across Cultures in Language Technologies @ ACL 2026

详情

AI中文摘要

课堂AI系统越来越多地从多模态和语言信号推断出高水平教育状态，如参与度、困惑、合作、参与和教学质量。在多元文化和多语言课堂中，此类推断可能将文化特定的行为转化为刻板印象：沉默可能被解读为不参与，目光回避可能被解读为不专心，语言切换可能被解读为低能力，或间接求助可能被解读为困惑。我们主张，具有刻板印象意识的课堂AI应将可观察的证据与文化负载的解读分开，并将未经支持的构造层面的主张视为安全风险。我们引入NSCR，一种基于文化的神经符号框架，将视频、音频、语音识别、课程材料和上下文元数据转换为带不确定性的事实、来源和文化范围，然后通过可执行推理和政策约束组合它们。我们定义了刻板印象倾向课堂推断的分类学，并提出了涵盖文化条件下的状态推断、证据基础的主张验证、多语言和语言切换推理、合作分析、反事实文化鲁棒性以及文化条件下的红队测试的基准议程。我们进一步指定了刻板印象泄漏、未支持的归属、文化校准差距、文化模糊性下的回避以及证据忠实度的度量标准。贡献是方法学的：为减少课堂AI中的刻板印象推理提供具体的框架和评估议程，教育作为高风险、文化多变的部署场景。

英文摘要

Classroom AI systems increasingly infer high-level educational states such as engagement, confusion, collaboration, participation, and instructional quality from multimodal and linguistic signals. In multicultural and multilingual classrooms, such inferences can translate culturally situated behavior into stereotyped claims: silence may be read as disengagement, gaze aversion as inattention, code-switching as low proficiency, or indirect help-seeking as confusion. We argue that stereotype-aware classroom AI should separate observable evidence from culturally loaded interpretation and should treat unsupported construct-level claims as safety risks. We introduce NSCR, a culturally grounded neuro-symbolic framework that converts video, audio, ASR, lesson artifacts, and contextual metadata into typed facts with uncertainty, provenance, and cultural scope, then composes them through executable reasoning and policy constraints. We define a taxonomy of stereotype-prone classroom inferences and propose a benchmark agenda covering culture-conditioned state inference, evidence-grounded claim verification, multilingual and code-switched reasoning, collaboration analysis, counterfactual cultural robustness, and culture-conditioned red-teaming. We further specify metrics for stereotype leakage, unsupported attribution, cultural calibration gaps, abstention under cultural ambiguity, and evidence faithfulness. The contribution is methodological: a concrete framework and evaluation agenda for mitigating stereotyped reasoning in classroom AI, with education as a high-stakes, culturally variable deployment setting.

URL PDF HTML ☆

赞 0 踩 0

2606.06114 2026-06-09 cs.AI 版本更新

Towards Healthy Evolution: Exploring the Role and Mechanisms of Human-Agent Interaction in Self-Evolving Systems

走向健康进化：探索人机交互在自我进化系统中的作用与机制

Dianxing Shi, Bowen Wang, Junqi He, Junhao Chen, Yuta Nakashima

发表机构 * The University of Osaka（大阪大学）

AI总结提出ANCHOR框架，通过模拟人类监督的反馈机制，在自我进化系统中缓解能力退化与安全漂移，实验表明有限监督可显著提升安全性与稳定性。

详情

AI中文摘要

自我进化智能体通过持续的自我对弈和自我生成的学习信号进行改进，但自主进化也可能导致能力退化与安全漂移。尽管人类反馈已被证明对静态和后训练智能体有效，但其在自我进化系统中的作用仍未被充分探索。我们提出了通过类人监督与审查进行智能体规范修正（ANCHOR）框架，这是一个基于LLM的框架，模拟人类监督并在自我进化的不同阶段提供反馈。利用ANCHOR，我们评估了两个代表性的开源自我进化智能体系统在编程、数学推理和安全性方面的表现。结果表明，即使是有限的监督也能显著缓解安全退化，同时保持核心进化目标的稳定性能。进一步分析显示，对输出验证阶段的监督是最有效的干预方式，而增加监督频率则收益递减。这些发现为设计更稳定、可控且与人类对齐的自我进化智能体系统提供了经验证据和实践指导。

英文摘要

Self-evolving agents improve through continual self-play and self-generated learning signals, but autonomous evolution can also cause capability degradation and safety drift. Although human feedback has proven effective for static and post-trained agents, its role in self-evolving systems remains underexplored. We introduce Agent Norm Correction through Human-like Oversight and Review (ANCHOR), an LLM-based framework that simulates human supervision and delivers feedback at various phases of self-evolution. With ANCHOR, we evaluate two representative open-source self-evolving agent systems across coding, mathematical reasoning, and safety. Our results show that even limited supervision substantially mitigates safety degradation while preserving stable performance on core evolutionary objectives. Further analysis shows that supervision over the output verification phase is the most effective for intervention, whereas increasing supervision frequency yields diminishing returns. These findings provide empirical evidence and practical guidance for designing more stable, controllable, and human-aligned self-evolving agent systems.

URL PDF HTML ☆

赞 0 踩 0

2501.15509 2026-06-09 cs.CR cs.AI cs.LG 版本更新

FIT-Print: Towards False-claim-resistant Model Ownership Verification via Targeted Fingerprint

FIT-Print：通过目标指纹实现抗虚假声明的模型所有权验证

Shuo Shao, Haozhe Zhu, Yiming Li, Hongwei Yao, Tianwei Zhang, Zhan Qin

发表机构 * State Key Laboratory of Blockchain and Data Security, Zhejiang University（区块链与数据安全国家重点实验室，浙江大学）； Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, Hangzhou（杭州高新技术区（滨江）区块链与数据安全研究院，杭州）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算机与数据科学学院）； Department of Computer Science, City University of Hong Kong（香港城市大学计算机科学系）

AI总结针对现有模型指纹易受虚假声明攻击的问题，提出目标指纹范式FIT-Print，通过优化将指纹转化为可验证目标签名，并设计两种黑盒方法，实现100%防御成功率和0%误报率。

Comments This paper has been accepted by IEEE Transactions on Information Forensics and Security

详情

AI中文摘要

模型指纹已成为保护开源模型知识产权的重要机制，提供了一种无需修改受保护模型的非侵入式方法。然而，我们的分析表明，现有指纹技术从根本上容易受到虚假声明攻击，即对手可以欺诈性地声称对独立的第三方模型拥有所有权。我们证明，这种脆弱性源于当前方法的非目标性，它们基于任意样本输出而非与特定预定义参考的对齐来评估模型相似性。为缓解此漏洞，我们引入了FIT-Print，一种主动对抗虚假声明攻击的目标指纹范式。具体来说，FIT-Print利用优化将指纹转化为可验证的目标签名。在此基础之上，我们提出了两种黑盒指纹方法：逐位的FIT-ModelDiff和逐列表的FIT-LIME，它们分别利用输出距离和特征归因作为鲁棒的模型签名。在基准模型和数据集上的广泛评估表明，我们的框架完美地中和了虚假声明攻击（100%防御成功率），消除了对独立模型的误报（0.0%），同时针对各种模型复用技术保持了100%的所有权验证率。

英文摘要

Model fingerprinting has emerged as a crucial mechanism for safeguarding the intellectual property of open-source models, offering a non-intrusive approach that requires no modifications to the protected model. However, our analysis reveals that existing fingerprinting techniques are fundamentally vulnerable to false claim attacks, wherein adversaries can fraudulently assert ownership over independent third-party models. We demonstrate that this vulnerability stems from the untargeted nature of current methods, which evaluate model similarity based on arbitrary sample outputs rather than alignment with a specific, predefined reference. To mitigate this vulnerability, we introduce FIT-Print, a targeted fingerprinting paradigm that actively counters false claim attacks. Specifically, FIT-Print leverages optimization to transform the fingerprint into a verifiable, targeted signature. Building upon this foundation, we propose two black-box fingerprinting methods, the bit-wise FIT-ModelDiff and the list-wise FIT-LIME, which utilize output distances and feature attributions as robust model signatures, respectively. Extensive evaluations across benchmark models and datasets show that our framework perfectly neutralizes false claim attacks (100% defense success rate) and eliminates false alarms on independent models (0.0%), all while maintaining a 100% ownership verification rate against diverse model reuse techniques.

URL PDF HTML ☆

赞 0 踩 0

2510.16028 2026-06-09 cs.CR cs.AI cs.LG cs.SY eess.SY 版本更新

TAO: Tolerance-Aware Optimistic Verification for Floating-Point Neural Networks

TAO：面向浮点神经网络的容忍感知乐观验证

Jianzhu Yao, Hongxu Su, Taobo Liao, Zerui Cheng, Huan Zhang, Xuechao Wang, Pramod Viswanath

发表机构 * Princeton University（普林斯顿大学）； HKUST (GZ)（香港科技大学（广州））； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出TAO协议，通过算子级容忍区域和Merkle锚定的争议游戏，在不依赖可信硬件或确定性内核的情况下验证浮点神经网络输出，开销仅0.3%。

Comments 18 pages, 8 figures

详情

DOI: 10.1145/3767295.3803612
Journal ref: Proceedings of the 21st European Conference on Computer Systems, (2026) 1515-1532

AI中文摘要

神经网络越来越多地在用户无法控制的硬件上运行（云GPU、推理市场）。然而，机器学习即服务很少透露实际运行的内容或返回的输出是否忠实反映预期输入。用户无法对服务降级（模型交换、量化、图重写或诸如修改广告嵌入等差异）进行追索。验证输出很困难，因为异构加速器上的浮点执行本质上是不确定的。现有方法要么对实际浮点神经网络不实用，要么重新引入供应商信任。我们提出TAO：一种容忍感知乐观验证协议，它接受在原则性算子级接受区域内的输出，而不是要求逐位相等。TAO结合了两种误差模型：（i）每个算子的IEEE-754最坏情况界限和（ii）跨硬件校准的紧密经验百分位分布。差异触发一个Merkle锚定的、阈值引导的争议游戏，该游戏递归地划分计算图，直到剩下一个算子，此时裁决简化为轻量级理论界限检查或针对经验阈值的小型诚实多数投票。未受挑战的结果在挑战窗口后最终确定，无需可信硬件或确定性内核。我们将TAO实现为PyTorch兼容运行时和当前部署在以太坊Holesky测试网上的合约层。运行时检测图、计算每个算子的界限，并在FP32中运行未经修改的供应商内核，开销可忽略（Qwen3-8B上为0.3%）。在A100、H100、RTX6000、RTX4090上的CNN、Transformer和扩散模型中，经验阈值比理论界限紧10^2-10^3倍，且考虑界限的对抗攻击成功率为0%。总之，TAO为现实世界的异构ML计算协调了可扩展性和可验证性。

英文摘要

Neural networks increasingly run on hardware outside the user's control (cloud GPUs, inference marketplaces). Yet ML-as-a-Service reveals little about what actually ran or whether returned outputs faithfully reflect the intended inputs. Users lack recourse against service downgrades (model swaps, quantization, graph rewrites, or discrepancies like altered ad embeddings). Verifying outputs is hard because floating-point(FP) execution on heterogeneous accelerators is inherently nondeterministic. Existing approaches are either impractical for real FP neural networks or reintroduce vendor trust. We present TAO: a Tolerance Aware Optimistic verification protocol that accepts outputs within principled operator-level acceptance regions rather than requiring bitwise equality. TAO combines two error models: (i) sound per-operator IEEE-754 worst-case bounds and (ii) tight empirical percentile profiles calibrated across hardware. Discrepancies trigger a Merkle-anchored, threshold-guided dispute game that recursively partitions the computation graph until one operator remains, where adjudication reduces to a lightweight theoretical-bound check or a small honest-majority vote against empirical thresholds. Unchallenged results finalize after a challenge window, without requiring trusted hardware or deterministic kernels. We implement TAO as a PyTorch-compatible runtime and a contract layer currently deployed on Ethereum Holesky testnet. The runtime instruments graphs, computes per-operator bounds, and runs unmodified vendor kernels in FP32 with negligible overhead (0.3% on Qwen3-8B). Across CNNs, Transformers and diffusion models on A100, H100, RTX6000, RTX4090, empirical thresholds are $10^2-10^3$ times tighter than theoretical bounds, and bound-aware adversarial attacks achieve 0% success. Together, TAO reconciles scalability with verifiability for real-world heterogeneous ML compute.

URL PDF HTML ☆

赞 0 踩 0

2510.17947 2026-06-09 cs.CR cs.AI cs.CL cs.LG cs.MA 版本更新

PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits

PLAGUE：面向多轮利用的终身自适应生成的即插即用框架

Neeladri Bhuiya, Madhav Aggarwal, Diptanshu Purwar

发表机构 * A10 Networks, Inc.（A10网络公司）； University of Massachusetts Amherst（马萨诸塞大学阿姆赫斯特分校）

AI总结提出PLAGUE框架，通过终身学习启发的三阶段设计（Primer、Planner、Finisher）实现高效多轮越狱攻击，在o3和Opus 4.1等强安全模型上ASR提升超30%。

Comments Accepted in ICLR 2026

详情

AI中文摘要

大型语言模型（LLMs）正以惊人的速度改进。随着智能体工作流的出现，多轮对话已成为与LLMs交互以完成长而复杂任务的事实标准。尽管LLM能力持续提升，但它们仍然越来越容易受到越狱攻击，尤其是在多轮场景中，有害意图可以巧妙地注入到对话中，产生恶意结果。虽然单轮攻击已被广泛探索，但适应性、效率和有效性仍然是多轮攻击面临的关键挑战。为了解决这些不足，我们提出了PLAGUE，一种新颖的即插即用框架，用于设计受终身学习智能体启发的多轮攻击。PLAGUE将多轮攻击的生命周期分解为三个精心设计的阶段（Primer、Planner和Finisher），从而实现对多轮攻击家族的系统性和信息丰富的探索。评估表明，使用PLAGUE设计的红队智能体实现了最先进的越狱结果，在更少或相当的查询预算下，领先模型的攻击成功率（ASR）提高了30%以上。特别是，PLAGUE在OpenAI的o3上实现了81.4%的ASR（基于StrongReject），在Claude的Opus 4.1上实现了67.3%的ASR，这两个模型在安全文献中被认为对越狱具有高度抵抗力。我们的工作提供了工具和见解，以理解计划初始化、上下文优化和终身学习在构建多轮攻击以进行全面模型脆弱性评估中的重要性。

英文摘要

Large Language Models (LLMs) are improving at an exceptional rate. With the advent of agentic workflows, multi-turn dialogue has become the de facto mode of interaction with LLMs for completing long and complex tasks. While LLM capabilities continue to improve, they remain increasingly susceptible to jailbreaking, especially in multi-turn scenarios where harmful intent can be subtly injected across the conversation to produce nefarious outcomes. While single-turn attacks have been extensively explored, adaptability, efficiency and effectiveness continue to remain key challenges for their multi-turn counterparts. To address these gaps, we present PLAGUE, a novel plug-and-play framework for designing multi-turn attacks inspired by lifelong-learning agents. PLAGUE dissects the lifetime of a multi-turn attack into three carefully designed phases (Primer, Planner and Finisher) that enable a systematic and information-rich exploration of the multi-turn attack family. Evaluations show that red-teaming agents designed using PLAGUE achieve state-of-the-art jailbreaking results, improving attack success rates (ASR) by more than 30% across leading models in a lesser or comparable query budget. Particularly, PLAGUE enables an ASR (based on StrongReject) of 81.4% on OpenAI's o3 and 67.3% on Claude's Opus 4.1, two models that are considered highly resistant to jailbreaks in safety literature. Our work offers tools and insights to understand the importance of plan initialization, context optimization and lifelong learning in crafting multi-turn attacks for a comprehensive model vulnerability evaluation.

URL PDF HTML ☆

赞 0 踩 0

2602.00056 2026-06-09 cs.CY cs.AI 版本更新

How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI

超数据化如何影响前沿AI的可持续性成本

Sophia N. Wilson, Sebastian Mair, Mophat Okinyi, Erik B. Dam, Janin Koch, Raghavendra Selvan

发表机构 * University of Copenhagen（哥本哈根大学）； Linköping University（_linköping大学）； Techworker Community Africa（非洲技术工人社区）； Univ. Lille, Inria, CNRS, Centrale Lille（里尔大学，Inria，CNRS，Centrale Lille）

AI总结本文研究超数据化对前沿AI的环境、社会和经济成本的影响，通过分析Hugging Face Hub的55万数据集，揭示数据增长、存储能耗及全球数据基础设施差异，提出Data PROOFS建议以缓解相关成本。

Comments Proceedings of the 2026 ACM Conference on Fairness, Accountability, and Transparency. Montreal, Canada

详情

DOI: 10.1145/3805689.3812393

AI中文摘要

大规模数据在过去十年中推动了前沿人工智能（AI）模型的成功。这种扩展依赖于大型科技公司持续努力聚合和整理互联网级数据集。本文从可持续性角度研究大规模数据在AI中的环境、社会和经济成本。我们主张该领域正从基于数据构建模型转向主动创建数据以构建模型。我们将这一转变称为超数据化，标志着前沿AI及其社会影响的关键转折点。为量化和 contextualize 数据相关成本，我们分析了约550,000个数据集，重点是数据集增长、存储相关的能耗和碳足迹，以及通过语言数据进行的社会代表性分析。我们还通过肯尼亚数据工人的定性反馈来研究劳动力问题，包括大型科技公司直接雇佣和对图像内容的暴露。我们进一步利用外部数据来源来验证我们的发现，通过展示全球数据中心基础设施的不平等来支持我们的发现。我们的分析表明，超数据化驱动了显著且增长的环境成本，同时系统地将劳动力风险和代表性伤害向全球南方转移。因此，我们提出了涵盖溯源、资源意识、所有权、开放性、节俭和标准的Data PROOFS建议，以缓解这些成本。我们的工作旨在使前沿AI背后常被忽视的数据成本可视化，并在研究社区和更广泛范围内激发更广泛的讨论。

英文摘要

Large-scale data has fuelled the success of frontier artificial intelligence (AI) models over the past decade. This expansion has relied on sustained efforts by large technology corporations to aggregate and curate internet-scale datasets. In this work, we examine the environmental, social, and economic costs of large-scale data in AI through a sustainability lens. We argue that the field is shifting from building models from data to actively creating data for building models. We characterise this transition as hyper-datafication, which marks a critical juncture for the future of frontier AI and its societal impacts. To quantify and contextualise data-related costs, we analyse approximately 550,000 datasets from the Hugging Face Hub, focusing on dataset growth, storage-related energy consumption and carbon footprint, and societal representation using language data. We complement this analysis with qualitative responses from data workers in Kenya to examine the labour involved, including direct employment by big tech corporations and exposure to graphic content. We further draw on external data sources to substantiate our findings by illustrating the global disparity in data centre infrastructure. Our analyses reveal that hyper-datafication drives substantial and growing environmental costs while systematically redistributing labour risks and representational harms toward the Global South. Thus, we propose Data PROOFS recommendations spanning provenance, resource awareness, ownership, openness, frugality, and standards to mitigate these costs. Our work aims to make visible the often-overlooked costs of data that underpin frontier AI and to stimulate broader debate within the research community and beyond.

URL PDF HTML ☆

赞 0 踩 0

2602.02572 2026-06-09 cs.LG cs.AI 版本更新

Reward Shaping for (Inference-Time) Alignment: A Stackelberg Game Perspective

奖励塑形用于（推理时）对齐：一个Stackelberg博弈视角

Haichuan Wang, Tao Lin, Lingkai Kong, Ce Li, Hezi Jiang, Milind Tambe

发表机构 * University of Southern California（南加州大学）

AI总结针对KL正则化导致LLM继承基策略偏见的问题，提出将奖励模型优化形式化为Stackelberg博弈，并通过简单奖励塑形方案近似最优奖励模型，在推理时对齐中持续提升平均奖励并达到超过66%的胜率。

Comments Accepted to ICML 2026. Camera-ready version

详情

AI中文摘要

现有的对齐方法直接使用从用户偏好数据中学习到的奖励模型来优化LLM策略，并相对于基策略进行KL正则化。这种做法对于最大化用户效用是次优的，因为KL正则化可能导致LLM继承基策略中与用户偏好冲突的偏见。虽然放大偏好输出的奖励可以减轻这种偏见，但也增加了奖励黑客的风险。这种权衡激励了在KL正则化下最优设计奖励模型的问题。我们将这个奖励模型优化问题形式化为一个Stackelberg博弈，并表明一个简单的奖励塑形方案可以有效近似最优奖励模型。我们在推理时对齐设置中经验性地评估了我们的方法，并证明它可以无缝集成到现有的对齐方法中，且开销最小。我们的方法持续提高了平均奖励，并在所有评估设置中平均达到了超过66%的胜率（相对于所有基线）。

英文摘要

Existing alignment methods directly use the reward model learned from user preference data to optimize an LLM policy, subject to KL regularization with respect to the base policy. This practice is suboptimal for maximizing user's utility because the KL regularization may cause the LLM to inherit the bias in the base policy that conflicts with user preferences. While amplifying rewards for preferred outputs can mitigate this bias, it also increases the risk of reward hacking. This tradeoff motivates the problem of optimally designing reward models under KL regularization. We formalize this reward model optimization problem as a Stackelberg game, and show that a simple reward shaping scheme can effectively approximate the optimal reward model. We empirically evaluate our method in inference-time alignment settings and demonstrate that it integrates seamlessly into existing alignment methods with minimal overhead. Our method consistently improves average reward and achieves win-tie rates exceeding 66% against all baselines, averaged across evaluation settings.

URL PDF HTML ☆

赞 0 踩 0

2602.08235 2026-06-09 cs.CL cs.AI cs.CR 版本更新

When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents

当良性输入导致严重危害：引发计算机使用代理的不安全意外行为

Jaylen Jones, Zhehao Zhang, Yuting Ning, Eric Fosler-Lussier, Pierre-Luc St-Charles, Yoshua Bengio, Dawn Song, Yu Su, Huan Sun

发表机构 * DeepMind, London, UK（深度Mind，伦敦，英国）； Stanford University, Stanford, CA, USA（斯坦福大学，斯坦福，加利福尼亚州，美国）； UC Berkeley, Berkeley, CA, USA（加州大学伯克利分校，伯克利，加利福尼亚州，美国）

AI总结提出AutoElicit框架，通过迭代扰动良性指令并利用CUA执行反馈，自动引发前沿CUAs（如Claude 4.5 Haiku等）的数百种有害意外行为，并验证其跨模型可迁移性。

Comments ICML 2026, Project Homepage: https://osu-nlp-group.github.io/AutoElicit/

详情

AI中文摘要

尽管计算机使用代理（CUA）在自动化日益复杂的操作系统工作流程方面具有巨大潜力，但即使在良性输入上下文中，它们也可能表现出偏离预期结果的不安全意外行为。然而，对此风险的探索仍主要停留在轶事层面，缺乏具体的特征描述和自动化方法，无法在现实CUA场景下主动发现长尾意外行为。为填补这一空白，我们首次提出了针对CUA意外行为的概念和方法框架，通过定义其关键特征、自动引发它们以及分析它们如何从良性输入中产生。我们提出了AutoElicit：一个代理框架，它使用CUA执行反馈迭代地扰动良性指令，并在保持扰动现实且良性的同时引发严重危害。使用AutoElicit，我们从最先进的CUA（如Claude 4.5 Haiku、Claude 4.5 Opus和Operator）中发现了数百种有害的意外行为。我们进一步评估了人工验证的成功扰动的可迁移性，识别出各种前沿CUA对意外行为的持续易感性。这项工作为在现实计算机使用环境中系统分析意外行为奠定了基础。

英文摘要

Although computer-use agents (CUAs) hold significant potential to automate increasingly complex OS workflows, they can demonstrate unsafe unintended behaviors that deviate from expected outcomes even under benign input contexts. However, exploration of this risk remains largely anecdotal, lacking concrete characterization and automated methods to proactively surface long-tail unintended behaviors under realistic CUA scenarios. To fill this gap, we introduce the first conceptual and methodological framework for unintended CUA behaviors, by defining their key characteristics, automatically eliciting them, and analyzing how they arise from benign inputs. We propose AutoElicit: an agentic framework that iteratively perturbs benign instructions using CUA execution feedback, and elicits severe harms while keeping perturbations realistic and benign. Using AutoElicit, we surface hundreds of harmful unintended behaviors from state-of-the-art CUAs such as Claude 4.5 Haiku, Claude 4.5 Opus, and Operator. We further evaluate the transferability of human-verified successful perturbations, identifying persistent susceptibility to unintended behaviors across various other frontier CUAs. This work establishes a foundation for systematically analyzing unintended behaviors in realistic computer-use settings.

URL PDF HTML ☆

赞 0 踩 0

2604.01039 2026-06-09 cs.CR cs.AI 版本更新

Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks

用于评估和加固LLM系统指令对抗编码攻击的自动化框架

Anubhab Sahu, Diptisha Samanta, Reza Soosahabi

发表机构 * Keysight Technologies

AI总结本文提出自动化框架评估LLM系统指令在对抗编码攻击时的保密性，通过四个模型和46条指令测试发现结构化序列化攻击成功率高，提出基于Chain-of-Thought的缓解策略。

详情

AI中文摘要

大型语言模型（LLM）中的系统指令常用于执行安全策略、定义代理行为并保护敏感操作上下文。这些指令可能包含敏感信息如API凭证、内部政策和特权工作流定义，使系统指令泄露成为LLM应用中的关键安全风险。无需推理模型的开销，许多LLM应用依赖拒绝型指令来阻止直接请求系统指令，隐含假设被禁止的信息只能通过显式查询提取。我们引入了一个自动化评估框架，测试在将提取请求重新框架化为编码或结构化输出任务时系统指令是否保持保密。在四个常见模型和46条验证过的系统指令上，我们发现结构化序列化攻击的成功率（>0.7）。我们进一步展示了一种基于一次指令重塑的缓解策略，使用Chain-of-Thought推理模型，表明即使系统指令的措辞和结构有细微变化，也能显著降低攻击成功率，而无需重新训练模型。

英文摘要

System Instructions in Large Language Models (LLMs) are commonly used to enforce safety policies, define agent behavior, and protect sensitive operational context in agentic AI applications. These instructions may contain sensitive information such as API credentials, internal policies, and privileged workflow definitions, making system instruction leakage a critical security risk highlighted in the OWASP Top 10 for LLM Applications. Without incurring the overhead costs of reasoning models, many LLM applications rely on refusal-based instructions that block direct requests for system instructions, implicitly assuming that prohibited information can only be extracted through explicit queries. We introduce an automated evaluation framework that tests whether system instructions remain confidential when extraction requests are re-framed as encoding or structured output tasks. Across four common models and 46 verified system instructions, we observe high attack success rates ( > 0.7) for structured serialization where models refuse direct extraction requests but disclose protected content in the requested serialization formats. We further demonstrate a mitigation strategy based on one-shot instruction reshaping using a Chain-of-Thought reasoning model, indicating that even subtle changes in wording and structure of system instructions can significantly reduce attack success rate without requiring model retraining.

URL PDF HTML ☆

赞 0 踩 0

2604.08304 2026-06-09 cs.CR cs.AI 版本更新

Securing Retrieval-Augmented Generation: A Taxonomy of Attacks, Defenses, and Future Directions

保障检索增强生成：攻击、防御与未来方向的分类法

Yuming Xu, Mingtao Zhang, Zhuohan Ge, Haoyang Li, Nicole Hu, Yongqi Zhang, Zhiyuan Wen, Jason Chen Zhang, Qing Li, Lei Chen

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））

AI总结本文提出SLOT分类法，从攻击面、防御层、目标（遵循CIA属性）和攻击目标四个维度系统化梳理检索增强生成（RAG）的安全风险与防御，并指出知识访问管道中的结构性错配，最后展望未来方向。

Comments We have curated a paper list on RAG security in https://github.com/TreeAI-Lab/Awesome-RAG-Security, and we warmly welcome authors who wish to have their new work included to contact us via email

详情

AI中文摘要

检索增强生成（RAG）通过外部知识扩展大型语言模型（LLM），但这一访问路径也引入了安全风险，现有工作常将其与LLM固有缺陷混为一谈。我们将安全RAG定义为保障外部知识访问，并使用SLOT分类法组织文献，该分类法沿四个轴：攻击面（S，对手作用的位置）、防御层（L，控制同一点）、目标（O，遵循CIA属性被破坏的目标）以及追求的目标（T，从单个已知查询（T1）到跨查询分布的目标声明操纵（T2））。将攻击、防御、补救和评估映射到六阶段知识访问管道，我们揭示了两个结构性错配。最后，我们讨论了更现实目标、无盲点和自适应评估的防御、更强的机密性以及多模态和智能体RAG评估的方向。

英文摘要

Retrieval-augmented generation (RAG) extends large language models (LLMs) with external knowledge, but this access path also introduces security risks that existing work often conflates with inherent LLM flaws. We frame secure RAG as securing external knowledge access and organize the literature with SLOT, a taxonomy along four axes: the attack Surface (S) where an adversary acts, the defense Layer (L) that controls the same point, the Objective (O) it breaks following the CIA properties, and the Target (T) it pursues, from a single known query (T1) to target-claim manipulation across a query distribution (T2). Mapping attacks, defenses, remediation, and evaluation onto a six-stage knowledge-access pipeline, we expose two structural mismatches. Finally, we discuss directions for more realistic targets, no-blind-spot and adaptively evaluated defenses, stronger confidentiality, and evaluation for multimodal and agentic RAG. The curated paper list for RAG security is in: https://github.com/TreeAI-Lab/Awesome-RAG-Security.

URL PDF HTML ☆

赞 0 踩 0

2605.03058 2026-06-09 cs.LG cs.AI 版本更新

Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

基于对比分层消融的大语言模型神经元锚定规则提取

Francesco Sovrano, Gabriele Dominici, Marc Langheinrich

发表机构 * Università della Svizzera italiana（瑞士意大利大学）

AI总结提出MechaRule方法，通过定位稀疏激动剂激活将规则提取锚定在LLM电路中，利用自适应组测试和置信引导剪枝，以极低代价高召回率识别关键神经元，并在算术和越狱任务中验证其有效性。

Comments Accepted for publication at KDD'2026

详情

DOI: 10.1145/3770855.3818091

AI中文摘要

可解释AI的一个核心目标是符号化地表达大语言模型（LLM）的决策逻辑，并将其锚定在内部机制中。现有的规则提取方法通常学习非锚定的符号代理，而机械可解释性将行为与神经元联系起来，但通常需要手工假设和昂贵的干预。我们提出MechaRule，一种通过定位稀疏激动剂激活（其消融会破坏规则相关行为）将规则提取锚定在LLM电路中的流程。MechaRule基于两个发现。首先，在固定的基线/翻转机制下，稀疏激动剂效应可能表现出“超越”：少数高效应的激活在较大组中仍可检测到，主导较弱效应，并翻转许多相同的示例。在这种机制下，使用置信引导的保守剪枝的自适应组测试，当k << N为激动剂时，需要对N个候选进行O(k log(N/k) + k)次干预。其次，在与接近忠实规则行为对齐的数据分割上，激动剂的定位更可靠；谱分割提供了无规则的备选方案，而不忠实的分割会降低定位效果。实验上，在算术和越狱任务中，MechaRule在匹配的暴力验证中召回97.0%的最高效应激动剂，平均仅消耗完全消融成本的2.14%。消融定位的激动剂消除了97.6–100.0%的合格正确算术答案和越狱，并可纠正算术错误或诱导越狱，分别高达72.8%和32.5%。

英文摘要

A central goal of explainable AI is to express large language model (LLM) decision logic symbolically and ground it in internal mechanisms. Existing rule-extraction methods usually learn ungrounded symbolic surrogates, while mechanistic interpretability links behavior to neurons but often requires hand-crafted hypotheses and costly interventions. We introduce MechaRule, a pipeline that grounds rule extraction in LLM circuits by localizing sparse agonist activations whose ablation disrupts rule-related behavior. MechaRule rests on two findings. First, in a fixed baseline/flip regime, sparse agonist effects can exhibit overtopping: a few high-effect activations remain detectable within larger groups, dominate weaker ones, and flip many of the same examples. In such regimes, adaptive group testing with confidence-guided conservative pruning requires O(k log(N/k) + k) interventions over N candidates when k << N are agonists. Second, agonists are localized more reliably on data splits aligned with close-to-faithful rule behavior; spectral splits provide a rule-free fallback, whereas unfaithful splits degrade localization. Empirically, on arithmetic and jailbreaking, MechaRule recalls 97.0% of highest-effect agonists in matched brute-force validations at only 2.14% of exhaustive-ablation cost on average. Ablating the localized agonists eliminates 97.6--100.0% of eligible correct arithmetic answers and jailbreaks, and can correct arithmetic errors or induce jailbreaks by up to 72.8% and 32.5%.

URL PDF HTML ☆

赞 0 踩 0

2605.03226 2026-06-09 cs.LG cs.AI cs.CR 版本更新

Self-Mined Hardness for Safety Fine-Tuning

自我挖掘的难度用于安全微调

Prakhar Gupta, Garv Shah, Donghua Zhang

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出通过模型自身生成结果评估提示难度，对最难的提示进行安全微调，在Llama-3模型上将攻击成功率降至1-3%，但增加了拒绝率，通过混合良性提示可平衡性能。

详情

AI中文摘要

语言模型的安全微调通常需要一个精心策划的对抗性数据集。我们采取不同的方法：通过目标模型自身生成结果被判定为有害的频率来评分每个候选提示的难度，然后在最难的提示上使用模型自身的非越狱生成结果进行微调。在Llama-3-8B-Instruct和Llama-3.2-3B-Instruct上，该方法将WildJailbreak攻击成功率从11.5%和20.1%降至1-3%，但将越狱形式良性提示的拒绝率从14-22%提升至74-94%。将相同的困难提示与对抗性框架的良性提示（看起来像越狱但意图良性的提示）以1:1的比例交错，可将8B模型的拒绝率降至30-51%，3B模型降至52-72%，但攻击成功率增加2-6个百分点。在混合模式下，使用合格池中最难的一半而非随机一半进行训练，可将两个模型的剩余ASR降低35-50%（约3个百分点）。

英文摘要

Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate prompt's difficulty by how often the target model's own rollouts are judged harmful, then fine-tune on the hardest prompts paired with the model's own non-jailbroken rollouts. On Llama-3-8B-Instruct and Llama-3.2-3B-Instruct, this approach cuts the WildJailbreak attack success rate from 11.5% and 20.1% down to 1-3%, but pushes refusal on jailbreak-shaped benign prompts from 14-22% to 74-94%. Interleaving the same hard prompts 1:1 with adversarially-framed benign prompts (prompts that look like jailbreaks but have benign intent) cuts that refusal back down to 30-51% on 8B and 52-72% on 3B, at a cost of 2-6 percentage points of attack success rate. Within the mixed regime, training on the hardest half of the eligible pool rather than a random half cuts the remaining ASR by 35-50% (about 3 percentage points) on both models.

URL PDF HTML ☆

赞 0 踩 0

2605.15416 2026-06-09 cs.LG cs.AI 版本更新

Margin-Adaptive Confidence Ranking for Reliable LLM Judgement

基于边际的置信度排名用于可靠的LLM判断

Gaojie Jin, Yong Tao, Lijia Yu, Tianjin Huang

发表机构 * Department of Computer Science, University of Exeter（埃克塞特大学计算机科学系）； Institute of AI for Industries, Chinese Academy of Sciences（中国科学院工业人工智能研究所）； Department of Mathematics and Computer Science, Eindhoven University of Technology（埃因霍温理工大学数学与计算机科学系）

AI总结本文提出一种基于边际的置信度排名方法，通过学习专用置信度估计器，改进LLM在人类判断一致性上的表现，通过模拟标注者多样性与边际排名公式，显式建模LLM区分人类一致与不一致案例的置信度，并推导出通用性保证。

Comments Accepted to ICML 2026

详情

AI中文摘要

Jung等人（2025）提出了一种假设检验框架，以确保大型语言模型（LLMs）与人类判断之间的一致性，基于模型估计的置信度与人类不一致风险之间单调性的假设。然而，在实践中，这一假设可能被违反，且置信度估计器的泛化行为未被显式分析。我们通过学习专用置信度估计器而非依赖启发式置信信号来缓解这些问题。我们的方法利用模拟标注者多样性和基于边际的排名公式，显式建模LLM区分人类一致与不一致案例的置信度。我们进一步推导出该估计器的泛化保证，揭示出一个与边际相关的权衡，从而指导适应性估计器训练过程的设计。当集成到固定序列测试中时，所学的置信度估计器提高了排名准确性，并在多个数据集和判断模型上实现了更高的成功率，以满足目标一致性水平。

英文摘要

Jung et al. (2025) introduce a hypothesis testing framework for guaranteeing agreement between large language models (LLMs) and human judgments, relying on the assumption that the model's estimated confidence is monotonic with respect to human-disagreement risk. In practice, however, this assumption may be violated, and the generalization behavior of the confidence estimator is not explicitly analyzed. We mitigate these issues by learning a dedicated confidence estimator instead of relying on heuristic confidence signals. Our approach leverages simulated annotator diversity and a margin-based ranking formulation to explicitly model how confidently an LLM distinguishes between human-agreement and human-disagreement cases. We further derive generalization guarantees for this estimator, revealing a margin-dependent trade-off that informs the design of an adaptive estimator training procedure. When integrated into fixed-sequence testing, the learned confidence estimator yields improved ranking accuracy and empirically strengthens the monotonic relationship between confidence and disagreement risk, leading to higher success rates in satisfying target agreement levels across multiple datasets and judge models.

URL PDF HTML ☆

赞 0 踩 0

2605.20341 2026-06-09 cs.LG cs.AI cs.CR cs.PF 版本更新

Causal Unlearning in Collaborative Optimization: Exact and Approximate Influence Reversal under Adversarial Contributions

协同优化中的因果卸载：在对抗性贡献下的精确和近似影响反转

Ali Mahdavi, Azadeh Zamanifar, Amirfarhad Farhadi, Omid Kashefi

发表机构 * Department of Computer Engineering, SRC, Islamic Azad University Tehran, Iran（伊朗伊斯兰Azad大学塔希尔分校计算机工程系）； School of Computer Engineering, Iran University of Science and Technology Tehran, Iran（伊朗科学技术大学塔希尔分校计算机工程系）； Meta CA, USA（美国Meta公司）

AI总结本文提出HF-KCU方法，通过共轭梯度迭代在Krylov子空间中近似影响函数，从而在协同优化中实现数据删除，减少计算复杂度并提高隐私保护效果。

详情

AI中文摘要

联邦学习系统必须支持数据删除请求以符合隐私法规，但每次删除后重新训练是计算上不可行的。我们提出了HF-KCU方法，通过在Krylov子空间中进行共轭梯度迭代近似影响函数，将复杂度从O(d^3)降低到O(kd)，其中k<<d。因果加权机制确保只有持有删除数据的客户端接收参数更新，防止对未受影响的客户端造成虚假变化。我们的方法设计用于处理有界对抗性扰动的Hessian和梯度，提供在现实威胁模型下的优雅退化。我们在卷积（ResNet-18，SimpleCNN）和Transformer（ViT-Lite）架构上CIFAR-10、MNIST和Fashion-MNIST数据集上验证了HF-KCU。在CIFAR-10的Dirichlet（alpha=0.5）划分下，HF-KCU在重新训练的基础上实现了47.75倍的速度提升，同时保持测试准确率在0.60%以内（71.16 vs 71.76%）。对遗忘集的成员推断攻击的成功率达到了0.499，与重新训练模型匹配，证实了有效的隐私恢复。我们提供了收敛保证，显示Krylov近似误差随着O((k^{1/2}-1)/(k^{1/2}+1))递减，其中k是Hessian条件数。因果加权机制确保了手术更新，只有持有删除数据的客户端被修改，保护了未受影响参与者的模型质量，并避免了异步联邦设置中梯度方法的不稳定性。该设计提供了可解释性，因为每个更新都可以直接追溯到删除数据的影响。该方法的效率和精度使其适用于生产联邦系统，其中删除请求异步到达且计算预算受限。

英文摘要

Federated learning systems must support data deletion requests to comply with privacy regulations, yet retraining from scratch after each deletion is computationally prohibitive. We present HF-KCU, a method that removes a client's contribution by approximating the influence function through conjugate gradient iterations in Krylov subspaces, reducing complexity from O(d^3) to O(kd) where k<<d.A causal weighting mechanism ensures that only clients holding the deleted data receive parameter updates, preventing spurious changes to unaffected clients. Our method is designed to handle bounded adversarial perturbations to the Hessian and gradient, providing graceful degradation under realistic threat models. We validate HF-KCU across convolutional (ResNet-18, SimpleCNN) and transformer (ViT-Lite) architectures on CIFAR-10, MNIST, and Fashion-MNIST. On CIFAR-10 under Dirichlet (alpha=0.5) partitioning, HF-KCU achieves 47.75 times speedup over retraining while maintaining test accuracy within 0.60% of the rational baseline(71.16 vs 71.76 %). Membership inference attacks on the forget set yield success rates of 0.499 matching the retrained model and confirming effective privacy restoration. We provide convergence guarantees showing that the Krylov approximation error decreases as O((k ^1/2-1)/(k^1/2+1)) where k is the Hessian condition number. The causal weighting mechanism ensures surgical updates, where only clients holding deleted data are modified, preserving model quality for unaffected participants and avoiding the instability of gradient-based approaches in asynchronous federated settings. This design provides interpretability as each update is directly traceable to the influence of the deleted data. The method's efficiency and precision make it suitable for production federated systems where deletion requests arrive asynchronously and computational budgets are constrained.

URL PDF HTML ☆

赞 0 踩 0

2606.00827 2026-06-09 cs.LG cs.AI 版本更新

Beyond Independent Manipulation: Individual Fairness-aware Strategic Classification with Peer Imitation

超越独立操纵：具有同伴模仿的个体公平感知策略分类

Xinpeng Lv, Chunyuan Zheng, Yunxin Mao, Renzhe Xu, Jinxuan Yang, Yuanlong Chen, Wangrong Huang, Shaowu Yang, Wenjing Yang, Xinwang Liu, Peng Cui, Haotian Wang

发表机构 * College of Computer Science and Technology, National University of Defense Technology（国防科技大学计算机科学与技术学院）； School of Mathematical Sciences, Peking University（北京大学数学学院）； Institute for Theoretical Computer Science, Shanghai University of Finance and Economics（上海财经大学理论计算机科学研究所）； Information Technology Development, Aetos Capital Group, Sydney（悉尼Aetos资本集团信息技术部）； Faculty of Computing, Harbin Institute of Technology（哈尔滨工业大学计算机学院）； Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）

AI总结提出个体公平感知策略分类（IFSC）框架，通过建模基于个体公平的同伴驱动操纵（模仿邻近被接受同伴），并采用鲁棒学习过程处理同伴可观测性不确定性，以改善个体公平一致性并减轻模仿引起的扭曲。

Comments Accepted by SIGKDD2026

详情

DOI: 10.1145/3770855.3817670

AI中文摘要

策略分类（SC）研究智能体操纵其特征以从预测模型获得有利决策的场景。现有的公平感知SC方法主要关注群体公平，并通常假设智能体独立响应。然而，当需要个体公平时，确保相似个体获得相似结果，智能体的操纵变得相互依赖：一个智能体偏好的操纵取决于邻域的结果。这导致了经典SC公式与公平感知决策设置之间的不匹配，其中独立模型不再准确刻画策略操纵。为解决此问题，我们引入了个体公平感知策略分类（IFSC），这是一个框架，对由个体公平引起的同伴驱动操纵进行建模，其中智能体模仿附近被积极决策的同伴以获得有利结果。IFSC将策略操纵刻画为对可见被接受同伴的基于相似性的模仿，并在由此产生的操纵后分布下学习分类器。为了考虑同伴可观测性的不确定性，IFSC采用鲁棒学习过程，在操纵模拟期间引入随机扰动。在合成和真实数据集上的实验表明，IFSC改善了个体公平一致性并减轻了模仿引起的扭曲。

在线智能体作为裁判：面向交互式智能体的情境生成评估

Hyogon Ryu, Jeonghwan Kim, Yewon Lim, Chaeun Lee, Jeongwook Kim, Donghoon Ham

发表机构 * KAIST（韩国科学技术院）

AI总结提出在线智能体作为裁判框架，通过部署环境内评估智能体主动生成相关情境，以评估交互式社交智能体的能力，提高标准覆盖率和与人类标签的一致性。

Comments ICML 2026 Workshop on Trustworthy AI for Good

详情

AI中文摘要

评估基于LLM的交互式社交智能体具有挑战性，因为社交相关行为不仅取决于孤立输出，还取决于先前的交互、社会角色和后续行动。现有方法通常允许目标智能体在环境中自由行动，然后对生成的轨迹进行评分。然而，这种被动设置可能会遗漏仅在特定社交情境下才可观察到的能力；例如，如果没有出现分歧，冲突处理可能不会被测试。我们提出在线智能体作为裁判，一种面向交互式社交智能体的情境生成评估框架。在线智能体作为裁判部署一个环境内评估智能体，通过环境原生的对话和行动协议与目标智能体交互，主动引出与评估标准相关的情境。生成的轨迹为评估即时响应和后续行为提供了证据。在一个包含32个设计师编写的社会标准的生命模拟环境中，在线智能体作为裁判提高了标准覆盖率和与人类标签的一致性，为被动方法可能未观察到的行为提供了更可靠的基于证据的评估。

英文摘要

Evaluating LLM-powered interactive social agents is challenging because socially relevant behaviors depend not only on isolated outputs, but also on prior interactions, social roles, and downstream actions. Existing methods typically allow a target agent to act freely in an environment and then score the resulting trajectory. However, this passive setup can miss capabilities that only become observable under specific social circumstances; for example, conflict handling may remain untested if no disagreement arises. We propose Online Agent-as-a-Judge, a situation-generating evaluation framework for interactive social agents. Online Agent-as-a-Judge deploys an in-world evaluator agent that interacts with the target agent through the environment's native dialogue and action protocol, actively eliciting situations relevant to the evaluation criteria. The resulting trajectories provide evidence for assessing both immediate responses and subsequent behavior. In a life-simulation environment with $32$ designer-authored social criteria, Online Agent-as-a-Judge improves criteria coverage and agreement with human labels, yielding more reliable evidence-grounded evaluations of behaviors that passive methods can leave unobserved.

URL PDF HTML ☆

赞 0 踩 0

2606.08239 2026-06-09 cs.AI cs.CL cs.CV 新提交

When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding

当没有正确答案时：诊断视频理解中多模态大语言模型的缺失答案检测

Yiheng Wang, Yueqian Lin, Lichen Zhu, Yudong Liu, Hai "Helen" Li, Yiran Chen

发表机构 * Duke University（杜克大学）

AI总结研究多模态大语言模型在视频理解中检测缺失答案的能力，发现模型倾向于选择干扰项而非识别无正确答案，时间推理任务中问题更严重，链式思维提示虽提升检测率但仍不理想。

Comments Under review

详情

AI中文摘要

多模态大语言模型在视频理解方面取得了实质性进展，但其响应的可靠性仍未得到充分探索。本文对视频理解中多模态大语言模型的缺失答案检测进行了诊断研究，其中正确答案被故意排除在候选集之外，而一个可靠的模型应能识别出没有有效选项。我们在三种设置下评估缺失答案检测行为：带有“以上皆非”选项的多选题、带有检测指令的开放式生成，以及没有任何指导的标准评估。在多种模型和基准测试中，我们发现多模态大语言模型压倒性地选择合理的干扰项，而不是检测到缺失答案。这种失败在时间推理任务中更为明显，并且随着帧采样密度的增加而恶化。我们进一步探索了链式思维提示作为缓解策略，发现虽然它显著提高了检测率，但性能仍不令人满意，这表明仅基于提示的策略不足以完全解决这一局限性。这些发现揭示了缺失答案检测中的系统性失败，并强调了在多模态系统中需要明确的检测机制。

英文摘要

Multimodal large language models (MLLMs) have made substantial advancements in video understanding, yet the reliability of their responses remains underexplored. This work presents a diagnostic study of absent answer detection for MLLMs in video understanding, where the correct answer is deliberately excluded from the candidate set and a reliable model is expected to recognize that no valid option exists. We evaluate the absent answer detection behavior under three settings: multiple-choice questions augmented with an ``None of the Above'' option, open-ended generation with a detection instruction, and standard evaluation without any guidance. Across a diverse set of models and benchmarks, we find that MLLMs overwhelmingly select plausible distractors rather than detecting the absent answer. This failure is more pronounced in temporal reasoning tasks and worsens with denser frame sampling. We further explore chain-of-thought prompting as a mitigation strategy and find that while it substantially improves detection rates, performance remains unsatisfactory, suggesting that prompting-based strategies alone are insufficient to fully address this limitation. These findings expose a systematic failure in absent answer detection and highlight the need for explicit detection mechanisms in multimodal systems.

URL PDF HTML ☆

赞 0 踩 0

超越通过率：开放代码大语言模型的多语言、执行基础评估

Sayed Erfan Arefin

发表机构 * Sayed Erfan Arefin

AI总结针对12种编程语言的2707道LeetCode问题，评估9个开放代码LLM，发现最佳模型Yi-Coder-9B-Chat的正确率仅23.64%，远低于人类57.2%的基准，且排名因问题难度和语言而异，编译错误占失败原因的63.25%。

详情

AI中文摘要

代码生成模型通常使用紧凑的执行基准和总体通过率进行比较，但这种总结掩盖了性能在不同编程语言、问题族和失败模式之间的差异。我们对9个专门用于编码的开放访问LLM进行了大规模、基于执行的评估，涉及12种编程语言的2707道免费LeetCode问题。我们的语料库包含325,343个问题-模型-语言作业，每个作业都关联了提示元数据、提取的代码、LeetCode执行结果和静态分析信号。结果表明，当前的开放模型远未达到人类接受参考：最佳模型Yi-Coder-9B-Chat的平均正确率为23.64%，而人类接受基线为57.2%。排名也依赖于切片：Qwen2.5-Coder-14B-Instruct在困难问题和不同问题覆盖上最强，而Gemma-2-27B-IT在所有语言上的lint通过率最高。失败分析显示，编译错误占未接受最佳提交的63.25%，表明许多失败发生在语义正确性测试之前。静态质量进一步与功能正确性偏离。总之，这些发现表明，多语言、保留工件的评估揭示了单语言或单指标排行榜所隐藏的权衡。

英文摘要

Code generation models are typically compared using compact execution benchmarks and aggregate pass rates, but such summaries obscure how performance varies across programming languages, problem families, and failure modes. We present a large-scale, execution-grounded evaluation of 9 openly accessible LLMs specialized for coding on 2,707 free LeetCode problems across 12 programming languages. Our corpus contains 325,343 problem-model-language jobs, each linked to prompt metadata, extracted code, LeetCode execution outcomes, and static-analysis signals. The results show that current open models remain far from the human acceptance reference: the best model, Yi-Coder-9B-Chat, reaches 23.64% mean correctness, compared with a 57.2% human acceptance baseline. Rankings are also slice-dependent: Qwen2.5-Coder-14B-Instruct is strongest on hard problems and distinct-problem coverage, while Gemma-2-27B-IT achieves the highest all-language lint pass rate. Failure analysis shows that compile errors account for 63.25% of non-accepted best submissions, indicating that many failures occur before semantic correctness can be tested. Static quality further diverges from functional correctness. Together, these findings show that multilingual, artifact-preserving evaluation reveals tradeoffs hidden by single-language or single-metric leaderboards.

URL PDF HTML ☆

赞 0 踩 0

2606.08970 2026-06-09 cs.AI 新提交

An Effective Router for Vision-Language Model Selection

一种有效的视觉-语言模型选择路由器

Can Wang, Shengwei Wang, Bolin Zhang, Zhiying Tu, Dianhui Chu

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Shandong Key Laboratory of Digital Service Computing Technology and Systems（山东省数字服务计算技术与系统重点实验室）

AI总结针对视觉-语言模型（VLM）选择中数据缺乏、特征表示无效和模型空间僵化的问题，提出ARMS路由器，通过增强输入信号和扩展训练策略，在分布内和分布外测试集上表现优异，仅800M参数即可超越GPT-4o。

详情

AI中文摘要

具有不同性能和资源需求的视觉-语言模型（VLM）被广泛部署，使得用户难以从众多VLM候选中选择最合适的。现有工作揭示了语言模型中的性能悖论现象，并专注于路由方法来解决它。然而，开发用于VLM选择的路由器仍然是一个关键且具有挑战性的问题，主要面临：1）缺乏专门数据，2）特征表示无效，以及3）模型空间僵化和适应成本高。在本文中，我们构建了一个用于VLM选择的多模态数据集，包含七个主流VLM在32,626个独特图像-文本查询上的输出。然后，我们提出了ARMS，一个用于VLM选择的路由器。ARMS通过VLM配置文件增强输入信号，采用简单但有效的架构来改进查询和VLM能力的表示。为了提高ARMS对新VLM的适应性，我们提出了两种扩展训练策略：增量训练和独立训练。在分布内和分布外测试集上的实验结果表明了ARMS的有效性。特别是，使用我们的训练策略，ARMS（仅800M参数）可以适应更广泛的VLM空间，并击败规模大数百倍的商业模型如GPT-4o。我们的代码、模型和数据集可在匿名仓库中获取。

英文摘要

Vision-language models (VLMs) with varying performance and resource requirements are widely deployed, making it difficult for users to select the most appropriate one among numerous VLM candidates. Existing work reveals the performance paradox phenomenon in language models and focuses on routing methods to solve it. However, developing a router for VLM selection is still a critical yet challenging problem, which primarily faces: 1) lack of specialized data, 2) ineffective feature representation, and 3) rigid model space and costly adaptation. In this paper, we construct a multimodal dataset for VLM selection, containing the outputs of seven mainstream VLMs on 32,626 unique image-text queries. We then propose ARMS, a router for VLM selection. ARMS enhances input signals with VLM profiles, employs a simple but effective architecture to improve representations of queries and VLM capabilities. To improve ARMS' adaptation to new VLMs, we propose two extension training strategies: incremental training and independent training. Experimental results on both in-distribution and out-of-distribution test sets demonstrate the effectiveness of ARMS. In particular, using our training strategy, ARMs (only 800M in size) can adapt to a broader VLM space and defeat commercial models like GPT-4o that are hundreds of times larger in scale. Our code, models, and datasets are available in the anonymous repository.

URL PDF HTML ☆

赞 0 踩 0

2606.08976 2026-06-09 cs.AI 新提交

RTL-BenchLS: A Large-Scale Benchmark for RTL Reasoning and Generation with Large Language Models

RTL-BenchLS：面向大语言模型的RTL推理与生成的大规模基准

Jing Wang, Shang Liu, Wenji Fang, Yuchao Wu, Yugao Zhu, Zhiyao Xie

发表机构 * Hong Kong University of Science and Technology（香港科技大学）

AI总结提出大规模基准RTL-BenchLS，包含超1万个形式验证的Verilog设计，并引入三项自监督推理任务，解决现有基准规模小、任务单一的问题，评估显示当前最佳模型性能较低。

详情

AI中文摘要

基于LLM的RTL生成与推理是硬件设计自动化的一个有前景的方向。高质量的基准是跟踪这一进展的关键基础设施。然而，现有的RTL基准在规模和任务范围上存在固有局限性。它们涵盖的设计通常较小且简单，任务几乎完全集中在规格到RTL的生成上。前沿模型在现有基准上的性能已经饱和。扩大这些基准的规模从根本上很困难，因为基准测试需要对齐的标签，例如规格和测试平台。对于实际设计，这种对齐的高质量数据很少可用。我们引入了RTL-BenchLS，这是一个大规模基准，解决了上述两个局限性。它包含超过10,000个经过形式验证的Verilog设计，涵盖比现有基准更大且更复杂的设计。除了规格到RTL的生成，我们提出了三项联合评估推理与生成的新任务：往返推理、掩码内容推理和仓库问题推理。前两项是自监督的，直接解决了扩展瓶颈。所有任务都通过形式等价性检查进行验证，无需任何手动测试平台。我们在RTL-BenchLS上评估了八个LLM。即使是最好的模型，在自然语言往返推理上仅达到23%，在掩码内容推理上达到28%，在仓库问题修复上达到12%。RTL-BenchLS比现有基准更具挑战性。它为未来的改进留下了充足的空间，并为开发基于LLM的硬件设计方法提供了指导。

英文摘要

LLM-based RTL generation and reasoning is a promising direction for hardware design automation. High-quality benchmarks are critical infrastructure for tracking progress in this direction. However, existing RTL benchmarks face inherent limitations in both scale and task scope. The designs they cover are typically small and simple, and the tasks focus almost entirely on specification-to-RTL generation. Frontier models' performance already saturates on the existing benchmarks. Scaling these benchmarks up is fundamentally difficult because aligned labels are required for benchmarking, such as specifications and testbenches. Such aligned high-quality data are rarely available for real-world designs. We introduce RTL-BenchLS, a large-scale benchmark addressing both limitations above. It contains over 10,000 formally verified Verilog designs, covering substantially larger and more complex designs than existing benchmarks. Beyond specification-to-RTL generation, we propose three novel tasks that jointly evaluate reasoning and generation: round-trip reasoning, masked-content reasoning, and repository-issue reasoning. The first two are self-supervised, which directly resolves the scaling bottleneck. All tasks are verified through formal equivalence checking without any manual testbenches. We evaluate eight LLMs on RTL-BenchLS. Even the best model reaches only 23% on natural-language round-trip reasoning, 28% on masked-content reasoning, and 12% on repository-issue fixing. RTL-BenchLS is substantially more challenging than existing benchmarks. It leaves ample room for future improvement and offers guidance for developing LLM-based methods for hardware design.

URL PDF HTML ☆

赞 0 踩 0

2606.09118 2026-06-09 cs.AI 新提交

ComplexConstraints and Beyond: Expert Rubrics for RLVR

复杂约束与超越：RLVR的专家评分标准

Sushant Mehta, Liudas Panavas, Edwin Chen

发表机构 * Surge AI

AI总结提出专家设计的评分标准作为评估和训练信号，通过复杂指令遵循和企业智能体任务验证，在RL训练中显著提升模型性能。

Comments Accepted to the GEM workshop at ACL 2026: https://gem-workshop.com/

详情

AI中文摘要

随着LLM能力的快速提升，用于评估它们的方法越来越滞后。传统基准依赖于对狭窄、表面约束的程序化验证，但现实世界的指令遵循和智能体任务需要评估细微的、上下文依赖的行为，这些行为难以通过简单的脚本检查。我们提出了一个基于专家策划的评分标准评估的系统分析作为替代范式，借鉴了来自两个领域的实证证据：复杂指令遵循和企业智能体任务。我们首先阐述了构建高质量评分标准的五个设计原则，包括最大可行原子性、意图感知标准设计和迭代LLM判断校准。为了验证这些原则，我们引入了ComplexConstraints，一个新的专家策划的指令遵循数据集，其中每个提示与10-40个原子评分标准配对。我们证明这些专家评分标准不仅是更好的评估工具，而且是高度有效的训练信号：在大约1000个ComplexConstraints示例上训练，使得4B参数模型在指令遵循上提升+15.5%，235B参数模型提升+12.2%，而在评分标准评分的企业环境上进行单周期RL训练产生的收益可以转移到模型从未训练过的分布外基准（BFCL +4.5%，Tau2-Bench +7.4%，Tool-Decathlon +6.8%）。我们的发现表明，专家编写的评分标准既改进了前沿LLM能力的测量，也改进了其发展，作为有效的评估和RL训练信号。

英文摘要

As LLM capabilities advance rapidly, the evaluation methods used to assess them increasingly lag behind. Traditional benchmarks relied on programmatic verification of narrow, surface-level constraints, but real-world instruction following and agentic tasks demand assessment of nuanced, context-dependent behaviors that resist simple scripted checks. We present a systematic analysis of expert-curated rubric-based evaluation as an alternative paradigm, drawing on empirical evidence from two domains: complex instruction following and enterprise agentic tasks. We first articulate five design principles for constructing high-quality rubrics, including Maximum Viable Atomicity, intent-aware criterion design, and iterative LLM-judge calibration. To validate these principles, we introduce ComplexConstraints, a new expert-curated instruction-following dataset in which each prompt is paired with 10-40 atomic rubric criteria. We demonstrate that these expert rubrics are not only better evaluation instruments but also highly effective training signals: training on approximately 1,000 ComplexConstraints examples yields +15.5% improvement for a 4B-parameter model and +12.2% for a 235B-parameter model on instruction following, while single-epoch RL training on a rubric-graded enterprise environment produces gains that transfer to out-of-distribution benchmarks the model was never trained on (+4.5% BFCL, +7.4% Tau2-Bench, +6.8% Tool-Decathlon). Our findings establish that expert-authored rubrics improve both the measurement and the development of frontier LLM capabilities, serving as effective evaluation and RL training signals.

URL PDF HTML ☆

赞 0 踩 0

2606.09169 2026-06-09 cs.AI cs.CV cs.MM 新提交

IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

IMUG-Bench：交错理解与生成的统一多模态模型基准

Lingyi Meng, Zecong Tang, Haoran Li, Tengju Ru, Zhejun Cui, Weitong Lian, Qi Kang, Hangshuo Cao, Yichen Zhu, Yechi Liu, Kaixuan Wang, Yu-Jie Yuan, Chunwei Wang, Yu Zhang, Bo Dai

发表机构 * Zhejiang University（浙江大学）； The University of Hong Kong（香港大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Huawei（华为）

AI总结提出IMUG-Bench基准，用于评估统一多模态模型在多轮交错图文对话中的理解与生成能力，包含3113样本和12034交互轮次，揭示了生成侧暴露偏差，并探索了测试时扩展策略。

详情

AI中文摘要

近年来，统一多模态模型（UMMs）出现，支持在单一框架内同时进行理解和生成。掌握动态、多轮交错图文对话是UMMs在实际应用中的关键任务。然而，现有基准未能评估这一重要任务，因为它们通常局限于单轮或静态设置，并且通常忽略多轮交互中的暴露偏差。为弥补这一差距，我们提出IMUG-Bench，一个用于UMMs多轮交错图文对话的综合基准，联合评估其理解和生成能力。我们的IMUG-Bench包含三类：静态空间、时间因果和混合，涵盖3113个样本和12034个交互轮次。它还包括动态理解问题，从而支持更能反映真实多轮交互场景的评估。在IMUG-Bench上进行的大规模实验系统评估了主流开源和闭源UMMs，揭示了它们的能力边界和失败模式，并发现了多轮交互中生成侧的显著暴露偏差。我们进一步探索了几种测试时扩展策略，包括思维链、自我验证和最佳N采样，这些策略有效提高了生成准确性并减轻了生成任务中的暴露偏差。这些发现为增强未来UMMs的鲁棒性和多轮交互能力提供了见解。

英文摘要

In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mastering dynamic, multi-turn interleaved image-text dialogues is a crucial task for UMMs in real-world applications. However, existing benchmarks fail to evaluate this important task, as they are often limited to single-turn or static settings, and typically overlook exposure bias in multi-turn interactions. To bridge this gap, we propose IMUG-Bench, a comprehensive benchmark for multi-turn interleaved image-text dialogue of UMMs that jointly evaluates their understanding and generation capabilities. Our IMUG-Bench comprises three classes: Static Spatial, Temporal Causal, and Hybrid, covering 3,113 samples and 12,034 interaction turns. It also includes dynamic understanding questions, thereby supporting evaluation that better reflects real-world multi-turn interaction scenarios. Large-scale experiments on IMUG-Bench systematically evaluate mainstream open-source and closed-source UMMs, revealing their capability boundaries and failure modes, and uncovering pronounced exposure bias on the generation side in multi-turn interactions. We further explore several test-time scaling strategies, including Chain-of-Thought, Self-Verification, and Best-of-N Sampling, which effectively improve generation accuracy and mitigate exposure bias in generation tasks. These findings provide insights into enhancing the robustness and multi-turn interaction capability of future UMMs.

URL PDF HTML ☆

赞 0 踩 0

2606.09323 2026-06-09 cs.AI cs.DB 新提交

TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

TRL-Bench：标准化跨范式的表格编码器表示级评估

Wei Pang, Xiangru Jian, Hehan Li, Zhixuan Yu, Alex Xue, Jinyang Li, Zhengyuan Dong, Xinjian Zhao, Hao Xu, Chao Zhang, Reynold Cheng, M. Tamer Özsu, Tianshu Yu

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； University of Waterloo（滑铁卢大学）； The University of Hong Kong（香港大学）； The University of Sydney（悉尼大学）； Université Lyon 1（里昂第一大学）

AI总结提出TRL-Bench，通过标准化下游条件，从列/表、行和组合数据湖表增强三个粒度评估表格编码器，揭示编码器质量具有能力特异性而非单一排名。

详情

AI中文摘要

评估 Gemini Flash 上的高级提示工程用于多跳生物医学问答

Ahmed Bajaber, Mohammed Alliheedi

发表机构 * Saudi Med AI Lab (SMAIL)（沙特医学人工智能实验室（SMAIL））； Prince Sultan University（普森国王大学）； Al-Baha University（阿勒巴哈大学）

AI总结本研究通过设计多组件提示（角色扮演、多步思维链示例和格式规则），在 Gemini 2.0 Flash 上实现概念级得分0.720，显著优于基线0.565，并接近下一代模型性能，证明高级提示设计对释放LLM推理能力至关重要。

Comments 8 pages, proceedings of the BioCreative IX Challenge and Workshop (BC9) at IJCAI 2025

详情

DOI: 10.5281/zenodo.16876579
Journal ref: Proc. BioCreative IX Workshop (BC9), IJCAI 2025, Montreal, Canada

AI中文摘要

MedHopQA 挑战为大型语言模型（LLM）提供了一个关键测试：在高风险的生物医学领域中进行复杂的多跳推理。本文详细介绍了我们对 Google Gemini Flash 模型的直接基于 API 的评估，重点关注高级提示工程的影响。我们为 Gemini 2.0 Flash 设计了一个复杂的多组件提示，结合了角色扮演、显式的多步思维链（CoT）示例和详细的格式规则。使用这个复杂提示的最佳运行获得了0.720的概念级得分。这一结果显著优于仅得0.565的基线提示。值得注意的是，在高效的 Gemini 2.0 Flash 上的性能与下一代 Gemini 2.5 Flash 的结果几乎相同。我们的发现表明，复杂的提示设计是释放现代LLM全部推理能力的关键因素。

英文摘要

The MedHopQA challenge presents a critical test for Large Language Models (LLMs): complex, multi-hop reasoning in the high-stakes biomedical domain. This paper details our direct API-based evaluation of Google's Gemini Flash models, focusing on the impact of advanced prompt engineering. We designed a sophisticated, multi-component prompt for Gemini 2.0 Flash that combined role-playing, explicit multi-shot Chain-of-Thought (CoT) examples, and detailed formatting rules. Our best run, using this complex prompt, achieved a Concept Level Score of 0.720. This result dramatically outperformed a baseline prompt which scored only 0.565. Remarkably, this performance on the efficient Gemini 2.0 Flash was almost identical to the result from the next-generation Gemini 2.5 Flash. Our findings demonstrate that sophisticated prompt design is a critical factor for unlocking the full reasoning capabilities of modern LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.07550 2026-06-09 cs.LG cs.AI 交叉投稿

Offline Reinforcement Learning for Plasma Control in Nuclear Fusion: Codebase and Benchmark

核聚变等离子体控制的离线强化学习：代码库与基准

Yang Fu, Haomin Bao, Rohit Sonker, Xiaoyan Hu, Aravind Venugopal, Jeff Schneider, Jiayu Chen

发表机构 * Central South University（中南大学）； Chongqing University（重庆大学）； Carnegie Mellon University（卡内基梅隆大学）； The University of Hong Kong（香港大学）

AI总结提出RL4F基准，基于DIII-D托卡马克历史数据构建评估环境，比较多种离线RL方法在等离子体控制任务上的性能，发现基于模型的离线RL方法平均表现最佳。

Comments 23 pages (10 pages main text)

详情

AI中文摘要

离线强化学习（RL）为从历史托卡马克数据开发等离子体控制器提供了一条有前景的途径，因为在真实设备上进行在线试错成本高昂且风险巨大。然而，由于缺乏针对核聚变中现实多执行器、长时域等离子体控制问题的标准化离线RL基准，这一方向的进展仍然难以衡量。我们引入了RL4F，一个用于核聚变等离子体控制的离线强化学习基准，提供了闭环评估环境和四个全剖面跟踪任务（旋转、密度、温度和压力）的基线比较。评估环境背后的动力学函数基于真实托卡马克DIII-D的历史放电数据构建。我们在统一协议下评估了广泛的模仿学习和离线RL基线。我们发现，基于模型的离线RL方法在大多数目标上获得了最佳平均性能，尽管没有单一方法在所有任务中占主导地位，这突显了动力学建模在复杂、长时域等离子体控制任务中的重要性。为了促进进一步研究，我们开源了代码库、数据集和评估框架，不仅为聚变社区，也为离线RL的算法开发提供了一个基准。

英文摘要

Offline reinforcement learning (RL) offers a promising route for developing plasma controllers from historical tokamak data, since online trial-and-error on real devices is costly and risky. However, progress in this direction remains difficult to measure due to the lack of a standardized offline RL benchmark for realistic multi-actuator, long-horizon plasma control problems in nuclear fusion. We introduce RL4F, an Offline Reinforcement Learning Benchmark for Plasma Control in Nuclear Fusion, providing closed-loop evaluation environments and baseline comparisons across four full-profile tracking tasks: rotation, density, temperature, and pressure. The dynamics function underlying the evaluation environment is built from historical discharge data from DIII-D, a real-world Tokamak. We evaluate a broad set of imitation learning and offline RL baselines under a unified protocol. We find that offline model-based RL methods obtain the best average performance on most objectives, although no single method dominates all tasks, highlighting the importance of dynamics modeling in complex, long-horizon plasma control tasks. To foster further research, we open-source the codebase, datasets, and evaluation framework, providing a benchmark not only for the fusion community but also for algorithm development in offline RL.

URL PDF HTML ☆

赞 0 踩 0

2606.07558 2026-06-09 cs.CV cs.AI cs.DL 交叉投稿

Page image classifier fine-tuned on century-spanning archives of scanned documents for further content-specific processing

基于百年跨度扫描文档档案微调的页面图像分类器，用于进一步的内容特定处理

Kateryna Lutsai, Pavel Straňák, David Novák, Dana Křivánková

发表机构 * Institute of Formal and Applied Linguistics, Charles University MFF（查尔斯大学数学与物理学院形式与应用语言学研究所）； Institute of Archaeology, Czech Academy of Sciences（捷克科学院考古研究所）

AI总结针对历史文档数字化中手动分类不可行的问题，提出基于视觉内容类型（文本、表格、图形）的自动页面图像分类系统，采用微调深度网络（RegNetY-16GF达99.16%准确率）实现近完美分类，并公开模型、数据集和代码。

Comments 29 pages, 19 figures, 13 tables. arXiv admin note: text overlap with arXiv:2507.21114

详情

AI中文摘要

目的：人文学科的数字化项目产生了大量、异构的历史文档档案，使得手动分类在大规模下不切实际。本工作解决基于视觉内容类型——文本、表格和图形——对扫描页面图像进行分类的自动化系统需求，从而支持内容特定的下游处理，如光学字符识别（OCR）或结构化数据提取。方法：开发了一个图像分类系统，并在来自百年历史的捷克考古档案的超过48,000张带注释的历史页面图像数据集上进行评估，通过四个连续的注释阶段和领域专家审查进行优化。使用手工制作的图像特征建立了随机森林分类器基线。随后，微调并比较了深度学习架构：卷积神经网络（EfficientNetV2、RegNetY）、视觉和文档图像变换器（ViT、DiT）以及多模态CLIP模型。与领域专家合作设计了11类标签方案，并通过五折交叉验证进行评估。结果：基于特征的基线实现了约75%的准确率。微调的CNN和变换器显著优于基线，RegNetY-16GF在保留测试集上达到99.16%的Top-1准确率，ViT-large达到99.12%。CLIP ViT-B/16通过优化文本描述达到99.14%的准确率。结论：仅图像模型，特别是RegNetY-16GF，实现了近乎完美的分类准确率，并在649,508张未标注档案页面上产生一致标签，模型间一致性超过90%。微调的CLIP尽管在测试集上具有竞争力，但在未标注数据上与仅图像模型的一致性低于65%，因此不太适合部署。最终模型、注释数据集和软件均以开源许可证公开提供。

英文摘要

Purpose: Digitization projects in the humanities produce vast, heterogeneous archives of historical documents, making manual sorting impractical at scale. This work addresses the need for an automated system to classify scanned page images based on visual content type - text, tables, and graphics - enabling content-specific downstream processing such as Optical Character Recognition (OCR) or structured data extraction. Methods: An image classification system was developed and evaluated on a dataset of over 48,000 annotated historical page images from century-old Czech archaeological archives, refined through four successive annotation stages with domain-expert review. A Random Forest Classifier baseline was established using hand-crafted image features. Subsequently, deep learning architectures were fine-tuned and compared: Convolutional Neural Networks (EfficientNetV2, RegNetY), Vision and Document Image Transformers (ViT, DiT), and multimodal CLIP models. An 11-category label scheme was designed collaboratively with domain experts and evaluated via five-fold cross-validation. Results: The feature-based baseline achieved approximately 75% accuracy. Fine-tuned CNNs and Transformers substantially outperformed it, with RegNetY-16GF achieving 99.16% and ViT-large 99.12% Top-1 accuracy on the held-out test set. CLIP ViT-B/16 reached 99.14% with optimized text descriptions. Conclusion: Image-only models, particularly RegNetY-16GF, deliver near-perfect classification accuracy and produce consistent labels across 649,508 unlabeled archival pages with over 90% inter-model agreement. Fine-tuned CLIP, despite competitive test-set accuracy, showed under 65% agreement with image-only models on unlabeled data, making it less suitable for deployment. The final models, annotated dataset, and software are publicly available under open-source licenses.

URL PDF HTML ☆

赞 0 踩 0

2606.07590 2026-06-09 cs.CV cs.AI 交叉投稿

SlideCheck: Guiding Self-Supervised Pretraining of Pathology Foundation Models via Dataset Distributions

SlideCheck: 通过数据集分布引导病理基础模型的自监督预训练

Mingyi He, Xinyi Guo, Xitong Ling, Weiming Chen, Jiawen Li, Lianghui Zhu, Minxi Ouyang, Mingxi Fu, Yizhi Wang, Tian Guan

发表机构 * Beijing University of Chemical Technology（北京化工大学）； South China Normal University（华南师范大学）； Tsinghua University（清华大学）

AI总结提出SlideCheck工具，利用冻结病理基础模型的特征，通过双头MLP评分异常和恶性证据，引导自监督预训练数据筛选，实验表明数据分布影响模型下游性能。

Comments 9 pages, 2 figures, 4 tables

详情

AI中文摘要

病理基础模型在大量WSI衍生补丁流上进行预训练，而数据构建过程中的监督通常是切片级别、稀疏或异质的。这种不匹配使得理解和控制哪些生物模式进入预训练数据变得困难。我们提出SlideCheck，一个轻量级的预训练数据引导工具，建立在冻结的病理基础模型补丁特征之上。SlideCheck并非作为独立的补丁诊断模型，而是提供明确的异常和恶性评分，用于组织、过滤和审计病理预训练数据。SlideCheck使用双头MLP分别建模广泛的异常形态和恶性证据。正则化的特征空间评分器为补丁级证据估计提供监督锚点，而评分-注意力一致性将补丁评分与WSI级别的MIL注意力结合，挖掘高置信度伪标签。然后使用相同的评分构建广泛阳性ViT预训练子集，其中如果异常或恶性证据超过阈值，则选择补丁。实验表明，SlideCheck定义的数据分布影响自监督ViT预训练的下游行为，表明生物组成是病理基础模型开发中的重要可控因素。精心策划的子集可以接近全数据性能，表明明确评分的补丁池可能支持更高效和可审计的预训练数据构建。这些发现将SlideCheck定位为数据引导和审计层，用于将大型未分化补丁池转化为可控和可重用的预训练数据集。

英文摘要

Pathology foundation models are pretrained on large streams of WSI-derived patches, while supervision during data construction is often slide-level, sparse, or heterogeneous. This mismatch makes it difficult to understand and control which biological patterns enter the pretraining data. We propose SlideCheck, a lightweight pretraining data guidance tool built on frozen pathology foundation model patch features. Rather than serving as a standalone patch diagnostic model, SlideCheck provides explicit abnormality and malignancy scores for organizing, filtering, and auditing pathology pretraining data. SlideCheck uses a dual-head MLP to separately model broad abnormal morphology and malignant evidence. A regularized feature-space scorer provides a supervised anchor for patch-level evidence estimation, while score-attention agreement combines patch scores with WSI-level MIL attention to mine high-confidence pseudo labels. The same scores are then used to construct broad-positive ViT pretraining subsets, where a patch is selected if either abnormality or malignancy evidence exceeds a threshold. Experiments show that SlideCheck-defined data distributions influence the downstream behavior of self-supervised ViT pretraining, indicating that biological composition is an important controllable factor in pathology foundation model development. Curated subsets can approach full-data performance, suggesting that explicitly scored patch pools may support more efficient and auditable pretraining data construction. These findings position SlideCheck as a data guidance and auditing layer for transforming large, undifferentiated patch pools into controllable and reusable pretraining datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.07595 2026-06-09 cs.CV cs.AI cs.IR 交叉投稿

VisualLeakBench: Reproducible Action-Boundary Propagation Failures in Vision-Language Agents

VisualLeakBench: 视觉语言智能体中可复现的动作边界传播失败

Youting Wang, Yuan Tang, Yitian Qian, Chen Zhao

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结提出VisualLeakBench基准，评估视觉语言智能体在截图、文档等场景下将敏感文本从图像复制到工具参数中的动作边界传播失败，发现PII传播率达78.8%，不安全文本传播率达85.5%。

详情

AI中文摘要

视觉语言智能体越来越多地在写入内存、发送消息或调用外部工具之前消费截图、文档和用户界面。我们研究了这一设置中的一个具体失败模式：动作边界传播，即敏感或不安全的可见文本从图像复制到下游工具参数中。我们提出了VisualLeakBench，一个多样化的500图像基准，涵盖UI、聊天、文档、表单和仪表板场景，并在两个工作流（笔记捕获和外部交接）下使用四个生产级VLM系统评估了一个分层的100图像智能体子集。在基线情况下，目标字符串在78.8%的PII案例和85.5%的渲染不安全文本案例中被传播到工具参数中。在防御性系统提示下，渲染不安全文本传播仍然高达52.6%，而PII工具传播降至2.0%，这主要是通过抑制工具使用而非保持效用实现的。速率取决于工具表面：类似搜索的工具抑制PII传播，但渲染不安全文本仍然跨越工具边界。我们测量的是视觉到工具的传播，而非下游指令执行。我们还提供了一个标记目标预言上限诊断，将大多数失败定位在工具边界，同时将响应侧泄漏作为残余风险。

英文摘要

Vision-language agents increasingly consume screenshots, documents, and user interfaces before writing to memory, sending messages, or invoking external tools. We study a concrete failure mode in this setting: action-boundary propagation, where sensitive or unsafe visible text is copied from an image into downstream tool arguments. We present VisualLeakBench, a diversified 500-image benchmark spanning UI, chat, document, form, and dashboard scenes, and evaluate a stratified 100-image agent subset with four production VLM systems under two workflows: note capture and external handoff. At baseline, target strings are propagated into tool arguments in 78.8% of PII cases and 85.5% of rendered unsafe-text cases. Under a defensive system prompt, rendered unsafe-text propagation remains high at 52.6%, while PII tool propagation falls to 2.0%, largely by suppressing tool use rather than preserving utility. Rates are tool-surface dependent: search-like tools suppress PII propagation, but rendered unsafe text still crosses tool boundaries. We measure visual-to-tool propagation rather than downstream instruction execution. We additionally provide a labeled-target oracle upper-bound diagnostic that localizes most failures at the tool boundary while leaving response-side leakage as residual risk.

URL PDF HTML ☆

赞 0 踩 0

2606.07597 2026-06-09 cs.LG cs.AI 交叉投稿

Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them

重复不匹配：为什么数据混合实验无法扩展以及如何修复

Kevin Zhou, Lisa Alazraki, Kris Cao, Marek Rei

发表机构 * Imperial College London（帝国理工学院）； Cohere

AI总结针对预训练数据混合中因高质量数据重复率变化导致的小规模实验外推失败问题，提出重复控制子采样方法，在1/16目标token预算下实现接近最优混合，揭示了重复动态而非规模决定实验泛化性。

详情

AI中文摘要

预训练数据混合通常通过运行小规模实验并外推到目标训练预算来调整。当高质量数据稀缺且必须重复时，这种外推经常失败，但失败的原因尚未被隔离。我们表明，一个主要原因是重复不匹配：由于高质量数据集很小，它们的重复率随着训练预算的增长而变化，以小规模代理实验未预期的方式改变最优混合。一种匹配目标重复率的子采样程序可以控制这种效应。在结合有限高质量数据和网络爬取的双源设置中，仅使用目标token的1/16的单一重复控制实验即可恢复757M参数模型的最优混合，误差在0.05以内，而无重复控制时误差为0.75。在没有重复控制的情况下达到相当的精度需要三到四个视野，消耗目标token预算的44%到94%。对于三个数据源，更大的混合空间需要不止一个实验来约束，但该方法仍然有效：在757M规模下，仅两个重复控制视野即可恢复最优混合，优于需要完整双源实验构建的基线。我们的结果表明，重复动态（而非仅规模）决定了小规模混合实验是否泛化。更广泛地说，它们表明数据重复应被视为混合优化中的第一类变量，而不是有限数据的不便副作用。

英文摘要

Pre-training data mixtures are commonly tuned by running small-scale experiments and extrapolating to the target training budget. When high-quality data is scarce and must be repeated, this extrapolation frequently fails, but the source of the failure has not been isolated. We show that a primary culprit is a repetition mismatch: because high-quality datasets are small, their repetition rate changes as the training budget grows, shifting the optimal mixture in ways that small-scale proxy experiments do not anticipate. A subsampling procedure that matches the target repetition rate controls for this effect. In a two-source setting combining limited high-quality data with web crawl, a single repetition-controlled experiment using only 1/16 of the target tokens recovers a mixture within 0.05 of the optimum for a 757M parameter model, compared to an error of 0.75 without repetition control. Achieving comparable accuracy without repetition control requires three to four horizons, consuming 44 to 94% of the target token budget. With three data sources, the larger mixture space requires more than a single experiment to constrain, but the approach remains effective: at the 757M scale, just two repetition-controlled horizons recover the optimal mixture, outperforming baselines that instead require the full two-source experiments to construct. Our results reveal that repetition dynamics, not scale alone, shape whether small-scale mixture experiments generalize. More broadly, they suggest that data repetition deserves treatment as a first-class variable in mixture optimization, rather than an inconvenient side effect of limited data.

URL PDF HTML ☆

赞 0 踩 0

2606.07611 2026-06-09 cs.IR cs.AI cs.LG cs.SE 交叉投稿

MIRAGE: Metadata-Integrated Repository Analysis and Guided Enhancement for MSR Datasets

MIRAGE：面向MSR数据集的元数据集成仓库分析与引导增强

Aabia Ather, Muhammad Usayd Ather, Qurat-Ul-Ain Somroo, Muhammad Khuram Shahzad

发表机构 * SEECS, NUST（软件工程系，努斯兰大学）

AI总结提出通过元数据丰富化、FAIR评估和主题驱动分析改进MSR数据集分析的方法，扩展了数据集目录并揭示了仓库站点和格式对引用与可用性的影响。

Comments 8 pages, 8 figures

详情

AI中文摘要

本文提出了一种通过元数据丰富化、FAIR评估和主题驱动分析来改进挖掘软件仓库（MSR）数据集分析的方法。本研究在先前专门用于分析MSR数据集的数据集目录基础上进行了扩展，为数据集添加了新注释，丰富了元数据类别，并提供了更高级的过滤选项。使用Semantic Scholar API收集了2013年至2024年间发表的MSR论文的元数据。分析基于潜在狄利克雷分配（LDA）主题建模和统计分析。数据集级别的属性被纳入扩展的数据集目录，即仓库托管站点、格式、可访问性、可重用性和数据集质量。研究表明，仓库托管站点和数据格式的选择会影响引用模式和数据集可用性。此外，增强的注释方法改进了MSR数据集的分析和可发现性，支持更有效地重用和评估研究工件。

英文摘要

This paper proposes an improved approach to the analysis of Mining Software Repositories (MSR) datasets via metadata enrichment, FAIRness assessment, and topic-driven analysis. This research expands upon an earlier dataset directory created specifically for the analysis of MSR datasets by adding new annotations to the datasets, enriching the metadata categories, and offering more advanced filtering options. The metadata of the MSR papers presented from 2013 to 2024 has been gathered using the Semantic Scholar API. The analysis is based on Latent Dirichlet Allocation (LDA) topic modeling and statistical analysis. Dataset-level attributes were included into the expanded dataset directory, namely repository hosting site, format, accessibility, reusability, and dataset quality. The study reveals that the choice of repository hosting sites and data formats influences citation patterns and dataset usability. Furthermore, the enhanced annotation approach improves the analysis and discoverability of MSR datasets, supporting more effective reuse and evaluation of research artifacts.

URL PDF HTML ☆

赞 0 踩 0

2606.07613 2026-06-09 cs.CV cs.AI 交叉投稿

Can You Trust What You See? Human and AI Detection of Synthetic Legal Evidence

你能相信你所见的吗？人类与AI对合成法律证据的检测

Jinzhe Tan, Ali Ekber Cinar, Karim Benyekhlef

发表机构 * Faculty of Law, McGill University（麦吉尔大学法学院）

AI总结研究人类和前沿多模态大模型在民事纠纷场景中区分真实照片与AI生成图像的能力，发现两者均不可靠，提出结合人工审查、MLLM筛查和来源认证的解决方案。

详情

AI中文摘要

视觉证据长期以来被视为可靠的法律证明形式，但人工智能（AI）的进步正在削弱这一假设。本文探讨在典型民事纠纷的以物体为中心的场景中，人类和前沿多模态大语言模型（MLLM）区分真实证据照片与AI生成照片的能力。我们构建了合成法律证据检测数据集（SLED-1400），包含200张真实证据图像及由六种当代文本到图像生成器生成的1200张合成图像，涵盖十类证据。在受控网络实验中，136名普通参与者与四种MLLM（GPT-5.1、Gemini-3-Pro、Gemini-3-Flash、Qwen3-VL-235B）使用相同的刺激和响应格式进行评估。人类总体准确率为64.8%，在最强两个生成器（Gemini-3-Pro-Image和Flux-2-Max）上分别为48.5%和51.0%，与随机猜测无异。MLLM从未错误分类真实图像（100%特异性），但漏检了大部分来自较难生成器的合成输出，在Gemini-3-Pro-Image输出上平均检测率仅为5.9%。人类与MLLM的错误基本不相关，而四种MLLM之间高度相关。两个群体均不能作为可靠的独立验证者。我们认为，法律程序中的视觉证据应被视为本质上可争议的，可行的程序性应对必须结合训练有素的人工审查、MLLM筛查以及C2PA内容凭证等来源基础设施。

英文摘要

Visual evidence has long been treated as a reliable form of legal proof, but advances in artificial intelligence (AI) are undermining that assumption. This article asks how well humans and frontier multimodal large language models (MLLMs) can distinguish authentic evidentiary photographs from AI-generated counterparts in the object-centric scenarios typical of civil disputes. We built Synthetic Legal Evidence Detection (SLED-1400), a dataset of 200 authentic evidence images paired with 1,200 synthetic counterparts produced by six contemporary text-to-image generators across ten evidence categories. The same stimuli and response format were used in a controlled web experiment with 136 lay participants and in a standardized evaluation of four MLLMs (GPT-5.1, Gemini-3-Pro, Gemini-3-Flash, Qwen3-VL-235B). Human accuracy was 64.8% overall, and 48.5% and 51.0% on the two strongest generators (Gemini-3-Pro-Image and Flux-2-Max), indistinguishable from chance. MLLMs never misclassified an authentic image (100% specificity), but missed most synthetic outputs from the harder generators, with average MLLM detection at 5.9% on Gemini-3-Pro-Image outputs. Human and MLLM errors were largely uncorrelated, while the four MLLMs were strongly correlated with each other. Neither group is a reliable standalone authenticator. We argue that visual evidence in legal proceedings should be treated as inherently contestable, and that a workable procedural response must combine trained human review, MLLM screening, and provenance infrastructure such as C2PA Content Credentials.

URL PDF HTML ☆

赞 0 踩 0

2606.07616 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation

项目反应缩放定律：一种高效且可泛化的神经缩放估计的测量理论方法

Sang Truong, Yuheng Tu, Rylan Schaeffer, Sanmi Koyejo

AI总结提出项目反应缩放定律（IRSL），将项目反应理论融入缩放定律框架，通过Beta-IRT模型利用语言模型的概率响应，将参数复杂度从O(M×N)降至O(M+N)，在预训练和测试时缩放场景中仅用50个问题即可实现可靠估计。

详情

AI中文摘要

缩放定律为理解语言模型（LM）的性能提供了基本框架，但推导它们需要在数千个检查点或数百万个推理样本上进行成本高昂的评估。为了解决这个问题，我们引入了项目反应缩放定律（IRSL），这是一个将项目反应理论（IRT）整合到缩放定律框架中的统一框架。与将每个模型-基准对单独处理的传统方法不同，IRSL将潜在模型能力与问题特征分离，将M个模型和N个问题的缩放定律估计分解，从而将参数复杂度从O(M×N)显著降低到O(M+N)。我们使用Beta-IRT实例化IRSL，它利用LM的经验概率响应——例如预训练中的token概率和测试时采样中的通过率——来捕获比二元响应更丰富的信号。我们在两种常见的缩放范式上验证了我们的方法：（1）预训练下游缩放，使用来自10个基准的6,612个LM检查点和37,682个问题；以及（2）测试时缩放，使用来自4个基准的12个LM和120个问题，每个问题最多2,500个样本。在现有模型响应上进行一次性校准后，IRSL仅使用每个基准50个问题（减少99.9%）即可产生更可靠的缩放估计，达到与传统方法相当或更优的决策准确性。此外，我们表明估计的潜在模型能力是可泛化的，从而能够跨共享相同测量目标的基准进行准确的性能预测。

英文摘要

Scaling laws provide a fundamental framework for understanding the performance of Language Models (LMs), yet deriving them requires prohibitively expensive evaluations across thousands of checkpoints or millions of inference samples. To address this, we introduce Item Response Scaling Laws (IRSL), a unified framework that integrates Item Response Theory (IRT) within the scaling law framework. Unlike traditional approaches that treat each model-benchmark pair in isolation, IRSL disentangles latent model ability from question characteristics, factorizing the scaling law estimation for $M$ models and $N$ questions to significantly reduce parameter complexity from $O(M \times N)$ to $O(M + N)$. We instantiate IRSL with Beta-IRT, which leverages the empirical probability responses of LMs -- such as token probabilities in pre-training and pass rates in test-time sampling -- to capture richer signals than binary responses. We validate our approach across two prevalent scaling paradigms: (1) pre-training downstream scaling, using 6,612 LM checkpoints and 37,682 questions from 10 benchmarks; and (2) test-time scaling, using 12 LMs and 120 questions from 4 benchmarks with up to 2,500 samples per question. Given a one-time calibration on existing model responses, IRSL yields more reliable scaling estimates using only 50 questions per benchmark (a 99.9\% reduction), achieving comparable or superior decision accuracy to traditional approaches. Furthermore, we show that the estimated latent model abilities are generalizable, enabling accurate performance forecasting across benchmarks that share the same measurement objective.

URL PDF HTML ☆

赞 0 踩 0

2606.07640 2026-06-09 cs.CV cs.AI cs.LG 交叉投稿

No Free Lunch for Synthetic Images under Data Scarcity Conditions

数据稀缺条件下合成图像的无免费午餐定理

Borja Arroyo Galende, Alejandro Almodóvar, Patricia A. Apellániz, Juan Parras, Silvia Uribe, Santiago Zazo

发表机构 * Universidad Politécnica de Madrid（马德里理工大学）； Universidad de Alcalá（阿尔卡拉大学）

AI总结研究数据稀缺和隐私敏感条件下合成数据的保真度、隐私和效用权衡，提出联合评估框架，比较VAE、GAN和DDPM在三个图像数据集上的表现，发现GAN和DDPM在差分隐私下更鲁棒。

2606.07643 2026-06-09 cs.CV cs.AI cs.SD eess.AS 交叉投稿

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

AVI-Bench：迈向全模态大语言模型的人类级视听智能

Yaoting Wang, Ziyi Zhang, Wenming Tu, Shaoxuan Xu, Wenjie Du, Cheng Liang, Weijun Wang, Yuanchao Li, Guangyao Li, Hao Fei, Yuanchun Li, Henghui Ding, Yunxin Liu

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结提出AVI-Bench基准，通过感知、理解、推理三阶段跨模态任务评估全模态大语言模型的视听智能，并引入AVI-Bench-PriSe测试原始视听感知，揭示当前模型局限，构建四级AVI分类体系。

Comments 31 pages, 8 figures, ICML 2026

详情

AI中文摘要

近期全模态大语言模型（Omni-MLLMs）的进展实现了视觉、音频和语言的强集成。然而，由于缺乏系统全面的基准，其视听智能（AVI）仍未被充分评估。我们提出AVI-Bench，一个受认知启发的基准，通过需要联合视听解释的跨模态任务，在感知、理解和推理三个阶段评估Omni-MLLMs。该设计能够细粒度诊断模型能力和失败模式。为进一步评估超出熟悉领域的鲁棒性，我们提出AVI-Bench-PriSe，一个扩展版本，使用不熟悉的、低语义刺激探测模型的原始视听感知，测试超出常见训练分布的泛化能力。对开源和闭源模型的大量实验揭示了当前Omni-MLLMs的显著局限性。基于这些发现，我们提出了一个四级AVI分类体系。总体而言，AVI-Bench提供了一个原则性的评估框架，以指导更鲁棒和可泛化AVI的发展。项目网站：https://fudancvl.github.io/AVI-Bench/

AI中文摘要

天文巡天的基础模型提供了强大的学习表示，可迁移到星系属性估计等下游回归任务。然而，仅有点预测不足以进行科学推理；可靠的不确定性量化（UQ）至关重要。我们使用冻结的AION-1基础模型嵌入，在星系属性回归上比较了七种UQ方法，从Legacy Survey测光/成像和DESI光谱预测红移、恒星质量、星族年龄、气相金属丰度和比恒星形成率，标签来自PROVABGS。无分布共形方法在所有属性上实现了约1个百分点内的名义90%边际覆盖，而非共形基线（深度集成、MC Dropout）无法可靠校准。在共形方法中，共形分位数回归（CQR）在模型预测最差的区间内提供了最佳覆盖。更重要的是，只有局部有效且可判别（LVD）框架——特别是在AION-1嵌入上运行时——还提供了有限样本的局部有效性，生成的区间适应每个星系的局部预测难度，而不是仅依赖边际保证。这些结果确立了共形预测，特别是LVD，作为天体物理学中基础模型嵌入上不确定性感知推理的首选UQ框架。

英文摘要

Foundation models for astronomical surveys offer powerful learned representations that can be transferred to downstream regression tasks such as galaxy property estimation. However, point predictions alone are insufficient for scientific inference; reliable uncertainty quantification (UQ) is essential. We compare seven UQ methods on galaxy property regression using frozen AION-1 foundation-model embeddings, predicting redshift, stellar mass, stellar-population age, gas-phase metallicity, and specific star-formation rate, from Legacy Survey photometry/imaging and DESI spectra, with PROVABGS-derived labels. Distribution-free conformal methods achieve marginal coverage within $\sim$1\,pp of the nominal 90\% across all properties, while non-conformal baselines (Deep Ensembles, MC~Dropout) fail to calibrate reliably. Among conformal approaches, Conformalized Quantile Regression (CQR) delivers the best coverage in the bin with the poorest model predictions. More importantly, only the Locally Valid and Discriminative (LVD) framework -- particularly when operating on AION-1 embeddings -- also provides finite-sample \emph{local validity}, producing intervals that adapt to each galaxy's local prediction difficulty rather than relying on marginal guarantees alone. These results establish conformal prediction, and LVD in particular, as the preferred UQ framework for uncertainty-aware inference on foundation-model embeddings in astrophysics.

URL PDF HTML ☆

赞 0 踩 0

2606.07810 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

SLMJury: Can Small Language Models Judge as Well as Large Ones?

SLMJury：小型语言模型能否像大型模型一样进行评判？

Anish Laddha, Nitesh Pradhan, Gaurav Srivastava

发表机构 * LNMIIT ； Virginia Tech（弗吉尼亚理工大学）

AI总结提出SLMJury框架，评估小型语言模型作为评判者的能力，发现领域依赖的过度思考效应、领域泛化差异、闭端与开端评判能力分离，以及多智能体辩论降低准确性。

详情

AI中文摘要

大型语言模型（LLMs）被广泛用作评估模型输出的评判者，但其高成本、延迟和不透明性限制了可扩展性。我们引入SLMJury，一个评估小型语言模型（SLMs）作为评判者的框架，涵盖两种范式：闭端二元正确性和开端质量评分。我们在四个模型家族的16个SLM评判者（0.6B-14B参数）上，跨十个基准进行基准测试：八个闭端任务涵盖数学、科学和通用推理（每个配置N=64,824个判断），以及用于摘要和对话评分的SummEval和MT-Bench。我们将评判形式化为预算条件函数，并研究五个维度。得出四个发现。（1）过度思考效应是领域依赖的：对于大多数评判者，快速10令牌判决在数学评判上匹配或优于扩展推理（在有帮助的情况下提升2-7%），而推理在通用任务上胜出高达23%。（2）领域泛化区分了模型家族，数学到通用准确率差距从低于10%到接近40%不等。（3）闭端和开端评判依赖不同的能力：最佳二元评判者（Phi-4）在MT-Bench上降至第9名，而经过推理训练的模型则反转了这一顺序。（4）在反思-批判-改进（RCR）辩论协议下，多智能体辩论在所有测试配置中降低了准确性，而顶级评判者抵抗六种对抗性人格的方差<=0.55%。可靠的自动评估不需要大型专有模型，但没有单一的SLM占主导地位。排行榜可在https://anishh15.github.io/SLMJury/获取，我们的框架代码和pip包公开在https://github.com/anishh15/SLMJury和https://pypi.org/project/slmjury/。

英文摘要

Large language models (LLMs) are widely used as judges for evaluating model outputs, but their high cost, latency, and opacity limit scalability. We introduce SLMJury, a framework for evaluating small language models (SLMs) as judges across two paradigms: closed-ended binary correctness and open-ended quality scoring. We benchmark 16 SLM judges (0.6B-14B parameters) from four model families across ten benchmarks: eight closed-ended tasks spanning mathematical, scientific, and general reasoning (N=64,824 judgments per configuration), plus SummEval and MT-Bench for summarization and conversational scoring. We formalize judging as a budget-conditioned function and study five dimensions. Four findings emerge. (1) The overthinking effect is domain-dependent: for most judges quick 10-token verdicts match or beat extended reasoning on mathematical judging (by 2-7% where they help), while reasoning wins on general tasks by up to 23%. (2) Domain generalization separates model families, with math-to-general accuracy gaps ranging from under 10% to nearly 40%. (3) Closed-ended and open-ended judging draw on different capabilities: the best binary judge (Phi-4) drops to rank 9 on MT-Bench, while reasoning-trained models invert this ordering. (4) Under the Reflect-Critique-Refine (RCR) debate protocol, multi-agent debate degrades accuracy across all tested configurations, whereas the top judges resist six adversarial personas with <=0.55% variance. Reliable automated evaluation does not require large proprietary models, yet no single SLM dominates. The leaderboard is available at https://anishh15.github.io/SLMJury/, and our framework code and pip package are publicly available at https://github.com/anishh15/SLMJury and https://pypi.org/project/slmjury/.

URL PDF HTML ☆

赞 0 踩 0

2606.07853 2026-06-09 cs.CL cs.AI 交叉投稿

Beyond English benchmarks: clinical llm evaluation in Brazilian Portuguese

超越英语基准：巴西葡萄牙语临床大语言模型评估

Giordano de Pinho Souza, Glaucia Melo, Josefino Cabral Melo Lima, Daniel Schneider

发表机构 * Federal University of Rio de Janeiro（里约热内卢联邦大学）； Toronto Metropolitan University（多伦多都会大学）

AI总结提出首个双语临床基准ClinicalBr，基于巴西病例报告构建，评估四个模型发现葡萄牙语-英语性能差距具有任务依赖性，诊断检索英语优势明显，其他任务差距消失。

详情

AI中文摘要

大语言模型正在改变临床决策支持及其在实际场景中的应用。然而，大多数基准测试以英语进行，跨语言评估对于解决全球可及性中的语言差距至关重要。我们介绍了ClinicalBr，这是首个基于真实巴西病例报告构建的双语临床决策基准。该语料库包含来自28种SciELO医学期刊的2,892个病例，涵盖18个专科，并构建为平行葡萄牙语-英语对。每个病例支持四项评估任务：诊断检索、鉴别诊断、检查推荐和治疗规划。我们评估了四个模型：MedGemma-27B、Sabiá-4、DeepSeek-R1和o3-mini，涵盖两种语言。核心发现是，葡萄牙语-英语性能差距是任务依赖的，而非普遍的。在诊断检索中，英语在所有模型上均具有一致优势，准确率高出7.5-12.1个百分点。这种优势在鉴别诊断、检查推荐和治疗规划中消失，大多数模型的置信区间跨越零，且葡萄牙语的完整性分数略高。巴西地方病比完整语料库更容易，而非更难，表明热带疾病表现在当前预训练中得到了充分体现。检查推荐是所有模型和两种语言中最难的任务，F1分数低于0.10，远低于鉴别诊断的上限0.20-0.27。

英文摘要

Large Language Models are transforming the support for clinical decision and their application in real scenarios. Yet, most benchmarks are conducted in English, and cross-lingual evaluation is needed to tackle the language gaps in global access. We introduce ClinicalBr, the first bilingual benchmark for clinical decision built from real Brazilian case reports. The corpus contains 2,892 cases drawn from 28 SciELO medical journals, spanning 18 specialties, and is structured as parallel Portuguese-English pairs. Each case supports four evaluation tasks: diagnosis retrieval, differential diagnosis, exam recommendation, and treatment planning. We evaluate four models: MedGemma-27B, Sabiá-4, DeepSeek-R1, and o3-mini, across both languages. The central finding is that the Portuguese-English performance gap is task-dependent, not general. In diagnosis retrieval, English yields a consistent advantage across all models, with +7.5-12.1 accuracy points. This advantage disappears in differential diagnosis, exam recommendation, and treatment planning, where confidence intervals cross zero for most models and Portuguese completeness scores are marginally higher. Brazilian-endemic conditions proved easier than the full corpus, not harder, indicating that tropical presentations are adequately represented in current pre-training. Exam recommendation was the hardest task across all models and both languages, with F1 scores below 0.10, well below the differential diagnosis ceiling of 0.20-0.27.

URL PDF HTML ☆

赞 0 踩 0

2606.07861 2026-06-09 cs.CV cs.AI 交叉投稿

The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models

最后一个可见像素：探究视觉-语言模型中的精细尺度感知

Lujun Li, Lama Sleem, Niccolo Gentile, Yangjie Xu, Yewei Song, Wenbo Wu, Radu State

发表机构 * University of Luxembourg（卢森堡大学）； Foyer S.A. ； Université Paris-Saclay（巴黎-萨克雷大学）

AI总结提出FineSightBench基准，通过4-48像素尺度分离感知与推理任务，发现视觉-语言模型感知在12像素饱和，推理在更大尺度仍受限，揭示精细视觉推理的根本缺陷。

Comments 25 pages

详情

AI中文摘要

最近的视觉-语言模型（VLM）在多模态理解和推理方面表现出色，但其细粒度视觉感知仍未被充分探索。'Strawberry中有多少个r？'的自然延伸是：VLM能可靠感知多小的视觉模式？为此，我们引入了FineSightBench，这是一个新的基准，通过将感知任务（字母、形状、物体的像素级识别）与推理任务（空间推理、计数、小目标排序）在4-48像素的受控尺度上分离，系统地探究这一极限。通过对最先进模型的全面实验和详细失败模式分析，我们揭示了一个尖锐的分离：感知在12像素左右饱和，而即使在更大尺度下推理仍然受限，存在持续的计数和序列错误。这些发现暴露了VLM在精细尺度视觉推理中的根本缺陷，需要更严格的评估。

英文摘要

Recent vision-language models (VLMs) excel at multimodal understanding and reasoning, yet their fine-grained visual perception remains underexplored. A natural extension of ``How many r are there in Strawberry?'' asks: how small a visual pattern can a VLM reliably perceive? As such, we introduce FineSightBench, a new benchmark that systematically probes this limit by separating perception tasks (pixel-level recognition of letters, shapes, objects) from reasoning tasks (spatial reasoning, counting, ordering over small targets) across controlled scales of 4--48px. Through comprehensive experiments and detailed failure mode analysis on state-of-the-art models, we reveal a sharp dissociation: perception saturates around 12px, while reasoning remains limited even at larger scales, with persistent numeracy and sequence errors. These findings expose fundamental deficiencies in VLMs' fine-scale visual reasoning that demand more rigorous evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.07969 2026-06-09 cs.CL cs.AI 交叉投稿

Neutrality Bites: Gender Representation in AI-Generated Animal Stories

中立性的代价：AI生成的动物故事中的性别表征

Imani Finkley, Yuanxi Li, Melanie Walsh

发表机构 * University of Washington（华盛顿大学）

AI总结研究六种主流LLM在生成动物故事时的性别分配，发现模型常避免指定性别或使用中性语言，但一旦指定则显著偏向男性，女性角色几乎缺席，表明中立策略可能导致边缘视角的抹除。

Comments FAccT(ACM Conference on Fairness, Accountability, and Transparency) 2026

详情

DOI: 10.1145/3805689.3812287

AI中文摘要

AI生成故事中的性别偏见是一个有充分记录的问题。尽管人们已投入大量关注来减少或缓解这种偏见，但干预措施是否产生真正公平的结果并不总是明确的。为了调查这一问题，我们研究了大型语言模型（LLMs）如何处理一个流行、高度模糊且已知会紧密复现人类刻板印象的叙事语境中的性别分配：关于会说话的动物的故事。我们提示六个领先的LLM完成一个关于七个性别未说明的拟人化动物角色的英语故事。此外，我们迭代了四种不同的叙事设置和一系列模型温度。在23.8K个故事中，我们发现模型经常避免在故事中指定动物角色的性别（平均19%）或使用性别中立的语言如“它”或“它的”（平均38.2%）。然而，当性别被指定时，存在显著的男性偏见。女性动物角色几乎不存在，仅出现在2.2%的故事中，而男性角色出现在40.6%的故事中。我们的发现指向一个更广泛的论点：中立性是有代价的。换句话说，优先考虑中立性以解决社会偏见的模型实际上可能助长边缘化视角和身份的抹除。我们建议需要追求超越中立性的替代策略，例如那些更平等地在想象主体之间分配社会可能性的策略。

英文摘要

Gender bias in AI-generated stories is a well-documented problem. While much attention has been paid to reducing or mitigating this bias, it is not always clear whether interventions produce genuinely fairer results. To investigate this issue, we examine how large language models (LLMs) handle gender assignment in a narrative context that is popular, highly ambiguous, and also known to closely reproduce human stereotypes: stories about talking animals. We prompt six leading LLMs to complete an English-language story about seven different anthropomorphic animal characters whose gender is unstated. We additionally iterate with four different narrative settings and a range of model temperatures. Across the 23.8K stories, we find that models frequently avoid gendering the animal character in the story (19% on average) or use gender-neutral language like "it" or "its" (38.2% on average). However, when gender is assigned, there is a significant masculine bias. Feminine animal characters are virtually absent, present in just 2.2% of stories vs. 40.6% that feature masculine characters. Our findings point to a broader argument: neutrality bites. In other words, models that prioritize neutrality to address social bias may actually contribute to the erasure of marginalized perspectives and identities. We suggest that alternative strategies beyond neutrality need to be pursued, such as ones that more equally distribute social possibilities across imagined subjects.

URL PDF HTML ☆

赞 0 踩 0

2606.07996 2026-06-09 cs.CL cs.AI 交叉投稿

MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models

MC-PDD: 面向黑盒大语言模型的掩码语料级预训练数据检测

Kaixin Lan, Mu You, Tao Fang, Binkai Ou, Lidia S. Chao, Derek F. Wong

发表机构 * University of Macau（澳门大学）； Macau Millennium College（澳门万人大学）； BoardWare Information System Limited（博纬信息系统有限公司）

AI总结提出MC-PDD方法，通过掩码特定token并利用LLM预测缺失内容，比较候选语料与参考非成员语料的预测命中率差异，以黑盒方式检测预训练数据，性能与现有方法相当。

Comments The manuscript consists of 10 pages formatted in the IEEE/ACM two-column style

详情

AI中文摘要

预训练是大语言模型（LLM）发展的基础，然而预训练数据的不透明性使模型分析复杂化，并引发伦理、法律和公平性问题。因此，检测特定数据集是否在预训练中使用至关重要。现有最先进方法通常依赖于访问模型概率分布，因此不适用于仅提供输入输出接口的闭源LLM。为解决这一限制，我们引入了掩码语料级预训练数据检测（MC-PDD），这是一种受掩码语言建模范式启发的新方法。MC-PDD在每段文本中掩码高度特定的token，并提示LLM预测缺失内容。然后，它评估候选语料与参考非成员语料之间的预测命中率差异是否具有统计显著性。基于此比较，MC-PDD确定候选文本是否可能包含在模型的预训练数据中。实验结果表明，在三个数据集上，对于开源和闭源LLM，预训练数据和未见数据之间的预测命中率存在明显且一致的差异。尽管在更严格的黑盒设置下运行，MC-PDD仍实现了与现有检测方法相当的性能。我们的方法仅需使用标准API访问即可实现模型审计和数据版权验证等实际应用。接受后，我们将公开发布代码和数据集。

英文摘要

Pretraining is fundamental to the development of Large Language Models (LLMs), yet the opacity of pretraining data complicates model analysis and raises ethical, legal, and fairness concerns. Detecting whether specific datasets were used during pretraining is, therefore, critical. Existing state-of-the-art methods typically rely on access to model probability distributions, making them unsuitable for closed-source LLMs that provide only input-output interfaces. To address this limitation, we introduce Masked Corpus-level Pretraining Data Detection (MC-PDD), a novel method inspired by the masked language modeling paradigm. MC-PDD masks highly specific tokens in each text and prompts the LLM to predict the missing content. It then assesses whether the difference in prediction hit rates between a candidate corpus and a reference non-member corpus is statistically significant. Based on this comparison, MC-PDD determines whether the candidate texts were likely included in the model's pretraining data. Experimental results demonstrate clear and consistent differences in prediction hit rates between pretrained and unseen data across three datasets, for both open-source and closed-source LLMs. Despite operating under a stricter black-box setting, MC-PDD achieves performance comparable to existing detection methods. Our approach enables practical applications such as model auditing and data copyright verification using only standard API access. Upon acceptance, we will publicly release the code and datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.08000 2026-06-09 cs.CL cs.AI 交叉投稿

Summarization is Not Dead Yet

摘要生成尚未消亡

Dongqi Liu, Chenxi Whitehouse, Zheng Zhao, Zhuchen Cao, Jian Li, Yabiao Wang

发表机构 * Saarland University（萨尔大学）； Max Planck Institute for Informatics（马克斯·普朗克信息学研究所）； University of Cambridge（剑桥大学）； University of Edinburgh（爱丁堡大学）； Zhejiang University（浙江大学）； Tencent YouTu Lab（腾讯优图实验室）

AI总结通过多维度评估，发现人类参考摘要在信息量和忠实度上仍优于大语言模型，后者仅在表面连贯性和流畅性上占优，表明摘要生成研究仍有挑战。

详情

AI中文摘要

大型语言模型（LLMs）的进展引发了关于模型生成的摘要可与人类撰写的参考摘要相媲美甚至超越后者的说法，这引发了摘要生成是否仍是一个开放研究问题的疑问。我们通过多轨道评估重新审视这一说法，涵盖五个不同数据集和五个最先进的LLMs，结合受控人工评估、偏差缓解的LLM作为评判协议、基于外部知识的事实性验证以及语料库级别的语言分析。我们的发现揭示了一个更为细致的图景：人类参考摘要继续在信息量和忠实度方面展现出优势，而LLM输出主要在表面连贯性和流畅性上更受青睐。事实性验证表明，人类参考摘要仍然更可靠，尤其是对于涉及推理或综合的声明，而语言分析揭示了不同模型之间风格同质化的模式。这些观察表明，当前的LLMs提高了摘要生成的质量下限，但其性能上限仍低于人类能力。

英文摘要

The progress of large language models (LLMs) has fueled claims that model-generated summaries rival or even surpass human-written references, raising questions about whether summarization remains an open research problem. We re-examine this narrative through a multi-track evaluation covering five diverse datasets and five state-of-the-art LLMs, combining controlled human assessment, bias-mitigated LLM-as-Judge protocols, factuality verification against external knowledge, and corpus-level linguistic analysis. Our findings reveal a more nuanced landscape in which human reference summaries continue to demonstrate advantages in informativeness and faithfulness, whereas LLM outputs are preferred mainly for surface-level coherence and fluency. Factuality verification indicates that human references remain more reliable, particularly for claims involving reasoning or synthesis, and linguistic analysis uncovers a pattern of stylistic homogeneity across different models. These observations suggest that current LLMs have raised the floor of summarization quality, but the ceiling of their performance remains below human capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.08034 2026-06-09 cs.CV cs.AI cs.CL 交叉投稿

Sci-Rho: A Multilingual Visually-Grounded Symbolic Benchmark for STEM Problems

Sci-Rho：面向STEM问题的多语言视觉基础符号基准

Muhammad Falensi Azmi, Ikhlasul Akmal Hanif, Vallerie Alexandra Putra, Adi Yeltay, Abdullah Mubarak, Fajri Koto

发表机构 * Independent Researcher（独立研究员）； MBZUAI（穆罕默德·本·扎耶德人工智能大学）； Binus University（比努斯大学）； Bandung Institute of Technology（万隆理工学院）

AI总结提出Sci-Rho，一个多语言、视觉基础的STEM问题动态基准，包含4242个模板和42420个实例，评估17个VLM发现最差精度与平均精度存在差距，且小模型跨语言性能下降。

Comments 22 pages

详情

AI中文摘要

符号基准已成为评估模型在STEM相关问题微小修改下鲁棒性的关键方法。然而，现有符号基准大多局限于数学推理，缺乏视觉基础，且主要以英语为主。在这项工作中，我们引入了Sci-Rho（科学鲁棒性），一个面向视觉基础STEM问题的动态基准，涵盖五个学科和七种语言，包含由领域专家（包括奥林匹克奖牌得主）精心设计的4,242个问题模板（每种语言606个）。每个模板实现为可执行的Python代码，通过改变数值、视觉模式、几何形状、颜色方案和函数类型，生成多样但等价的问题实例，总共产生42,420个实例，每个实例都配有推理步骤和真实解决方案。我们评估了17个最先进的VLM，发现最差情况准确率（定义为模型在每种生成变体上均正确回答的问题模板比例）与平均准确率之间存在明显差距。我们还发现，较小的模型在不同语言上表现出显著的性能下降，而专有模型和较大模型保持鲁棒。步骤级评估反映了相同的趋势，揭示了平均F1与最差情况F1分数之间的显著差距。最后，我们对VLM注意力头的检查显示，图像标记与文本标记的相对注意力分配存在显著的跨语言变化。我们的工作强调了超越静态基准的评估作为衡量VLM质量指标的重要性。

英文摘要

Symbolic benchmarks have emerged as a key approach to assess model robustness under minor modifications to STEM-related questions. However, existing symbolic benchmarks mostly remain limited to mathematical reasoning, lack visual grounding, and are predominantly in English. In this work, we introduce Sci-Rho (Science Rhobustness), a dynamic benchmark for visually-grounded STEM problems spanning five subjects and seven languages, comprising 4,242 problem templates (606 per language) crafted by domain experts, including Olympiad medalists. Each template is implemented as executable Python code that generates diverse but equivalent problem instances by varying numerical values, visual patterns, geometric shapes, color schemes, and function types, resulting in 42,420 instances in total, each paired with reasoning steps and ground-truth solutions. We evaluated 17 state-of-the-art VLMs and discovered a noticeable gap between worst-case accuracy (defined as the proportion of problem templates that a model answers correctly across every generated variation) and average accuracy. We also discovered that smaller models show noticeable performance degradation across languages, whereas proprietary and larger models remain robust. Step-level evaluation reflects this same trend, revealing a significant gap between average F1 and worst-case F1 scores. Finally, our inspection of attention heads of a VLM reveals substantial cross-lingual variation in the relative attention allocated to image tokens compared to text tokens. Our work highlights the importance of evaluation beyond static benchmarks as a metric to measure the quality of VLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.08036 2026-06-09 cs.IR cs.AI cs.CL 交叉投稿

GIScholarBench: Benchmarking LLM Overconfidence in GIS Research

GIScholarBench: 在GIS研究中评估大语言模型的过度自信

Zongrng Li, Mingzheng Yang, Lei Zou, Hongxu Ma, Hao Tian, Siqi Zhou, Wenjing Gong, Kaili Zhang, Bingqian Chen, Mitch Zhang, Yifan Yang

发表机构 * Texas A&M University（德克萨斯理工大学）； Google（谷歌）； Department of Geography（地理系）； Department of Landscape Architecture and Urban Planning（景观建筑与城市规划系）

AI总结针对大语言模型在学术研究中的过度自信问题，构建了包含10865篇论文的GIScholarBench基准，通过元数据检索、文献链接和研究方向生成三项任务评估模型表现，发现所有模型均存在任务不变的过度自信现象。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地用于学术研究工作流程，但学术任务需要高事实精度，因此暴露了一个关键弱点：过度自信。这里，过度自信被行为定义为即使在底层知识不完整或不可验证时，也倾向于产生自信、果断且格式良好的输出，而不是陈述信心与准确性之间的校准差距。为了研究这一问题，我们引入了GIScholarBench，这是一个基于2020年至2025年间发表在25个核心GIScience期刊上的10865篇论文构建的基准。该基准涵盖三个认知复杂度递增的任务：元数据检索、文献链接和研究方向生成。我们通过原生网络界面在真实用户条件下评估了Claude Sonnet 4.5、Gemini 3和ChatGPT 5.3。结果显示所有任务均存在一致的过度自信。在元数据检索中，ChatGPT 5.3取得了最高准确率，但所有模型在预测错误时仍生成确定的标题和DOI。在文献链接中，Claude Sonnet 4.5恢复了最多的参考文献，但所有模型在排名靠前的检索和更长的引文列表之间显示出明显差距，表明参考文献被扩展到可靠检索能力之外。在研究方向生成中，AI生成的方向相比真实未来引用论文显示出更低的主题覆盖率、更高的新颖性缺失率和更低的语义多样性。这些发现表明，LLM的过度自信是任务不变的，但表现形式不同：检索中的事实过度生成、文献链接中不可靠的引文扩展，以及研究构思中输出完整性的过度自信。

英文摘要

Large language models (LLMs) are increasingly used in academic research workflows, but scholarly tasks require high factual precision and therefore expose a key weakness: overconfidence. Here, overconfidence is defined behaviorally as the tendency to produce confident, assertive, and well-formatted outputs even when the underlying knowledge is incomplete or unverifiable, rather than as a calibration gap between stated confidence and accuracy. To examine this issue, we introduce GIScholarBench, a benchmark built from 10,865 papers published in 25 core GIScience journals between 2020 and 2025. The benchmark covers three tasks with increasing cognitive complexity: metadata retrieval, literature linking, and research direction generation. We evaluate Claude Sonnet 4.5, Gemini 3, and ChatGPT 5.3 through their native web interfaces under real-world user-facing conditions. Results show consistent overconfidence across all tasks. In metadata retrieval, ChatGPT 5.3 achieves the highest accuracy, but all models still generate definitive titles and DOIs when predictions are wrong. In literature linking, Claude Sonnet 4.5 recovers the most references, but all models show a clear gap between top-ranked retrieval and longer citation lists, suggesting that references are extended beyond reliable retrieval capacity. In research direction generation, AI-generated directions show lower topic coverage, higher novel miss rates, and lower semantic diversity than real future-citing papers. These findings suggest that LLM overconfidence is task-invariant but takes different forms: factual overgeneration in retrieval, unreliable citation expansion in literature linking, and overconfidence in output completeness during research ideation.

URL PDF HTML ☆

赞 0 踩 0

2606.08123 2026-06-09 cs.CV cs.AI 交叉投稿

Human-Centered Benchmarking of Driver Monitoring Models

以人为中心的驾驶员监控模型基准测试

Ruben Dario Florez-Zela

发表机构 * Universidad Nacional de San Agustin de Arequipa (UNSA)（圣奥古斯丁国立大学（UNSA））

AI总结针对驾驶员监控模型仅用分类精度评估的不足，提出以人为中心的基准测试框架（HCBF），从精度、可解释性、效率和鲁棒性四维评估，发现模型在帕累托前沿上各占优势，但聚合排名会掩盖关键缺陷。

Comments 9 pages, 3 figures, 7 tables. Code available at: https://github.com/rubendflorezzela/hcbf-driver-monitoring

详情

AI中文摘要

基于视觉的驾驶员监控系统越来越多地部署在安全关键的智能交通环境中，但它们几乎总是仅根据分类精度进行比较。本文认为精度不足以表征模型在实际部署中的适用性，并提出了以人为中心的基准测试框架（HCBF），该框架从四个维度评估模型：精度、可解释性、效率和鲁棒性。该框架应用于四种代表性的轻量级架构：MobileNetV3、ShuffleNetV2、EfficientNet-B0和DeiT-Tiny，在MRL眼睛数据集上进行眼睛状态分类。虽然这些模型在干净数据集上的精度几乎无法区分，但每个模型恰好在一个维度上领先，并且所有四个模型都位于帕累托前沿。在三种面向部署的权重场景下计算的人为中心得分将ShuffleNetV2排在首位。然而，这个聚合胜出者在传感器噪声下保留了不到一半的性能，并且将闭眼分类为睁眼而失败，而Transformer则保持鲁棒。这些发现表明，聚合排名可能掩盖在操作上具有决定性的维度特定漏洞，强调了多维、以人为中心评估的价值。

英文摘要

Vision-based driver monitoring systems are increasingly deployed in safety-critical intelligent transportation settings, yet they are almost always compared on classification accuracy alone. This paper argues that accuracy is insufficient to characterize a model's fitness for real-world deployment, and proposes the Human-Centered Benchmarking Framework (HCBF), which evaluates models across four dimensions: accuracy, explainability, efficiency, and robustness. The framework is applied to four representative lightweight architectures, MobileNetV3, ShuffleNetV2, EfficientNet-B0, and DeiT-Tiny, on the MRL Eye Dataset for eye-state classification. While the models are nearly indistinguishable on clean-set accuracy, each leads in exactly one dimension, and all four lie on the Pareto frontier. A Human-Centered Score computed under three deployment-oriented weighting scenarios ranks ShuffleNetV2 first throughout. However, this aggregate winner retains less than half of its performance under sensor noise and fails by classifying closed eyes as open, whereas the transformer remains robust. These findings show that aggregate ranking can mask dimension-specific vulnerabilities that are operationally decisive, underscoring the value of multi-dimensional, human-centered evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.08194 2026-06-09 cs.CL cs.AI 交叉投稿

GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

GlobeAudio：用于大型音频-语言模型自然主义评估的多语言多文化基准

Ryner Tan, Wenxuan Zhang

发表机构 * Singapore University of Technology and Design（新加坡科技设计大学）

AI总结提出GlobeAudio基准，包含5637道多语言多选题，评估大型音频-语言模型在自然音频条件下的听觉推理和文化理解能力，发现开源模型和低资源语言存在显著性能差距。

详情

AI中文摘要

大型音频-语言模型（LALMs）在统一框架中整合了音频感知和语言理解，支持广泛的实际应用。尽管近期取得了进展，但LALMs的评估相对于实际需求仍严重不足：大多数评估缺乏真正的语言和文化真实性，而其他评估则未能捕捉声学真实性。为弥补这一差距，我们提出了GlobeAudio，一个旨在评估自然音频理解的多语言和多文化基准。GlobeAudio包含5637道多项选择题，涵盖六种类型多样的语言，由母语者基于自然发生的音频精心制作。为了表现良好，模型必须具有更高层次的听觉推理技能和文化基础的解释。我们系统地评估了代表性的闭源和开源LALMs，以及级联的ASR-LLM流水线。我们的实验揭示了在自然声学条件下的显著性能差距，特别是对于开源模型和低资源语言。这些发现凸显了当前LALMs的关键局限性，并强调了自然音频评估对未来音频-语言系统的重要性。GlobeAudio可在https://huggingface.co/datasets/iNLP-Lab/GlobeAudio 获取。

英文摘要

Large Audio-Language Models (LALMs) integrate audio perception and language understanding within a unified framework, enabling a wide range of real-world applications. Despite recent advances, evaluation for LALMs remains heavily underspecified relative to real-world requirements: most lack true linguistic and cultural authenticity, while others fail to capture acoustic realism. To bridge this gap, we propose GlobeAudio, a multilingual and multicultural benchmark designed to evaluate naturalistic audio understanding. GlobeAudio consists of 5,637 multiple-choice questions across six typologically diverse languages, expertly crafted by native speakers grounded on naturally occurring audio. In order to do well, models must possess higher-level auditory reasoning skills and culturally grounded interpretation. We systematically evaluate representative closed-source and open-source LALMs, as well as cascaded ASR-LLM pipelines. Our experiments reveal substantial performance gaps under natural acoustic conditions, particularly for open-source models and low-resource languages. These findings highlight critical limitations of current LALMs and underscore the importance of naturalistic audio evaluation for future audio-language systems. GlobeAudio can be found at https://huggingface.co/datasets/iNLP-Lab/GlobeAudio .

URL PDF HTML ☆

赞 0 踩 0

2606.08272 2026-06-09 cs.CL cs.AI 交叉投稿

AgriGov: A Structured Multilingual Dataset Curation for Indian Government Schemes for Farmers

AgriGov：面向印度政府农民计划的结构化多语言数据集整理

Mohsina Bilal, Gopakumar G

发表机构 * National Institute of Technology Calicut（国立卡利卡特理工学院）

AI总结提出AgriGov三语数据集，通过自动抓取、翻译流水线和人工后编辑构建约8000句对齐的农业政策领域平行语料，支持机器翻译、问答等应用。

Comments 15 pages, 4 figures, Submitted to: Sadhana, Elsevier

详情

AI中文摘要

AgriGov是一个精心整理的三语（英语-印地语-马拉地语）数据集，旨在解决农业政策和农民福利计划领域缺乏领域基础的多语言资源的问题。最初，我们使用自动抓取技术从可信门户收集并结构化50个政府计划的数据，将其组织到预定义的语义字段（如标题、资格、申请流程、文件、排除项）。翻译通过结合Google Translate API、MarianMT和人工后编辑的流水线进行，生成了一个包含约2100个源片段的领域特定印地语-马拉地语数据集。为了增强覆盖范围，我们用Samanantar语料库中的句子扩充了该数据集，产生了约8000个句子对齐的印地语-马拉地语平行对。该数据集现在为微调该领域的机器翻译模型提供了强大的资源。AgriGov专为领域自适应机器翻译、问答、信息检索和摘要系统等应用而设计。其主要贡献是一个模式驱动、人工校正的多语言对齐流水线，确保领域保真度、提供来源并支持可重复实验，从而为面向农民的工具实现检索增强应用。

英文摘要

AgriGov is a curated, trilingual (English-Hindi-Marathi) dataset designed to address the scarcity of domain-grounded multilingual resources for agricultural policies and farmer welfare schemes. Initially, we collected and structured data from 50 government schemes sourced from trusted portals using automated scraping techniques, organizing it into predefined semantic fields (e.g., title, eligibility, application process, documents, exclusions). Translations were performed using a pipeline combining Google Translate API, MarianMT, and human post-editing, resulting in a domain-specific Hindi-Marathi dataset comprising approximately 2100 source segments. To enhance coverage, we augmented this dataset with sentences from the Samanantar corpus, leading to approximately 8,000 sentence-aligned Hindi-Marathi parallel pairs. The dataset now offers robust resources for fine-tuning machine translation models in this domain. AgriGov is designed for applications in domain-adaptive machine translation, question answering, information retrieval, and summarization systems. Its key contribution is a schema-driven, human-corrected multilingual alignment pipeline that ensures domain fidelity, provides provenance, and supports reproducible experiments, enabling retrieval-augmented applications for farmer-facing tools.

URL PDF HTML ☆

赞 0 踩 0

2606.08367 2026-06-09 cs.MA cs.AI 交叉投稿

破解生成困惑度：为何无条件文本评估需要分布度量

Antonio Franca, Alexander Tong

AI总结本文指出生成困惑度（gen-PPL）作为非自回归语言模型评估指标存在缺陷，通过构造零参数朴素采样器在LM1B和OpenWebText上达到SOTA gen-PPL但生成不连贯文本，建议采用直接量化生成文本与参考文本分布差异的评估套件。

Comments Accepted to the Workshop on Structured Probabilistic Inference & Generative Modeling (SPIGM) at ICML 2026

详情

AI中文摘要

扩散和连续流语言模型已成为语言建模中领先的非自回归替代方案。这两种范式的进展主要通过生成困惑度（gen-PPL）来衡量：在冻结的自回归（AR）评分器（如gpt2-large）下，样本的每个token的负对数似然，通常配以经验熵护栏来排除低熵崩溃。我们认为该度量不健全。从构造上看，gen-PPL仅衡量在评分AR下的可预测性，而非语法性或语义连贯性——而可预测但低质量的序列集合在组合上非常庞大。为了具体说明这一点，我们构建了一套零参数、故意朴素的采样器，在LM1B和OpenWebText上以非退化熵实现了最先进的gen-PPL，超越了最近发布的扩散和连续流模型，同时生成的文本在构造上是不连贯的。我们推荐直接量化生成文本与参考文本之间分布差异的评估套件，并使用这样的套件重新基准测试最近的非自回归模型，从而更真实地反映当前的最新技术水平。

英文摘要

Diffusion and continuous flow-based language models have emerged as the leading non-autoregressive alternatives to language modeling. Progress in both paradigms is overwhelmingly tracked by generative perplexity (gen-PPL): the per-token negative log-likelihood of samples under a frozen autoregressive (AR) scorer such as gpt2-large, typically paired with an empirical-entropy guardrail to rule out low-entropy collapse. We argue that this metric is unsound. By construction, gen-PPL measures only predictability under the scoring AR, not grammaticality or semantic coherence -- and the set of predictable but still low-quality sequences is combinatorially large. To make this concrete, we construct a suite of zero-parameter, deliberately naive samplers that achieve state-of-the-art gen-PPL on LM1B and OpenWebText at non-degenerate entropy, surpassing recently published diffusion and continuous-flow models while producing text that is incoherent by construction. We recommend evaluation suites that directly quantify the distributional divergence between generated and reference text, and use such a suite to re-benchmark recent non-autoregressive models, recovering a more faithful picture of the current state of the art.

URL PDF HTML ☆

赞 0 踩 0

2606.08481 2026-06-09 cs.LG cs.AI cs.DB cs.SE 交叉投稿

PIPE-Cypher: Automatic Enterprise Benchmark Generation for Text-to-Cypher Systems

PIPE-Cypher：面向文本到Cypher系统的自动企业基准生成

Suraj Ranganath, Anish Raghavendra

发表机构 * Halıcıoğlu School of Data Science and Computing, University of California, San Diego（加利福尼亚大学圣迭戈分校哈勒乔卢数据科学与计算学院）； Independent Researcher（独立研究员）

AI总结提出PIPE-Cypher流水线，利用本地大模型从企业属性图自动生成平衡的NL-to-Cypher基准，通过模式分析、逆向查询约束生成和执行验证等步骤，实现可重复的基准构建。

详情

AI中文摘要

企业属性图在模式结构、内部术语、领域假设、治理约束和用户交互模式上差异很大。因此，与部署相关的Text2Cypher基准反映了用户和代理实际对该图提出的问题。创建这样的基准很困难，因为模式和值是唯一的，且图结构随时间变化。每个自然语言查询对必须可执行、使用真实图实体、保持多样性，并在查询类型和难度级别上保持平衡。我们提出PIPE-Cypher，一个本地基准生成流水线，它将实时属性图和来自客户问题、分析师日志或代理工具调用的可选种子查询转化为平衡的NL-to-Cypher基准。PIPE-Cypher结合了模式分析、逆向查询接地、约束生成、确定性Cypher治理、执行验证、编辑、多样性控制以及校准的本地大语言模型评判器。使用本地Qwen3.5-9B生成和评判，PIPE-Cypher导出了3000个可接受的FinBench/SNB示例，完成了三个审计消融套件，用人类标签校准评判器行为，并评估了11个本地下游模型。生成的基准具有明确的区分性：零样本迁移效果弱，而少样本控制表明，特定模式的示例库可以帮助兼容的模型家族。总之，PIPE-Cypher使Text2Cypher基准测试成为一个可重复的过程，随图、用户和目标工作负载而演变。

英文摘要

Enterprise property graphs vary widely in schema structure, internal terminology, domain assumptions, governance constraints, and user interaction patterns. A deployment-relevant Text2Cypher benchmark therefore reflects the questions users and agents actually ask of that graph. Creating such a benchmark is difficult because schemas and values are unique, and graph structure changes over time. Each NL-query pair must also be executable, use real graph entities, preserve diversity, and remain balanced across query types and difficulty levels. We present PIPE-Cypher, a local benchmark-generation pipeline that turns a live property graph and optional seed queries from customer questions, analyst logs, or agent tool calls into balanced NL-to-Cypher benchmarks. PIPE-Cypher combines schema profiling, reverse-query grounding, constrained generation, deterministic Cypher governance, execution validation, redaction, diversity controls, and a calibrated local LLM judge. Using local Qwen3.5-9B generation and judging, PIPE-Cypher exports 3,000 accepted FinBench/SNB examples, completes three audited ablation suites, calibrates judge behavior with human labels, and evaluates 11 local downstream models. The resulting benchmark is deliberately discriminative: zero-shot transfer is weak, while a few-shot control shows that schema-specific example banks can help compatible model families. Together, PIPE-Cypher makes Text2Cypher benchmarking a repeatable process that evolves with the graph, its users, and its target workloads.

URL PDF HTML ☆

赞 0 踩 0

2606.08718 2026-06-09 cs.LG cs.AI 交叉投稿

Deep Active Re-Labeling: Toward Noise-Resilient Annotation Efficiency

深度主动重标注：迈向抗噪的标注效率

Md Abdullah Al Forhad, Weishi Shi

AI总结针对深度主动学习中人工标注噪声导致性能下降的问题，提出一种通过分配部分标注预算重新标注已标注数据来去噪的框架，实验表明在相同预算下更高效且最终数据集噪声较少。

Comments Accepted and published in the 2025 IEEE International Conference on Big Data (BigData). DOI: 10.1109/BigData66926.2025.11402126

详情

DOI: 10.1109/BigData66926.2025.11402126
Journal ref: 2025 IEEE International Conference on Big Data (BigData), Macau, China, 2025, pp. 886-895

AI中文摘要

虽然深度主动学习（DAL）有效减少了人工标注成本，但其效果受到人工标注误差的限制。这是因为主动学习采样的数据被认为对训练具有高度信息性。当人工标注者以一定比率向这些信息性数据引入错误时，主动学习性能显著下降，有时甚至比被动学习更差。本文首先分析了DAL设置中人工标注误差的影响。然后，我们提出了一个框架来解决DAL中的人工标注噪声问题。受人类学习模式的启发，我们提出的解决方案的核心思想是将部分人工标注预算分配给重新标注已标注的数据。先前的理论工作表明，当模型具备一定识别潜在噪声数据的能力时，即使重新标注一小部分数据也能有效去除主动训练集中的噪声。为此，我们实现了两种主动噪声采样策略，在不同情况下检测噪声，并分配部分标注预算重新标注这些实例。我们的方法赋予了主动学习一种回顾和内省的行为。实验表明，在相同标注预算下，我们的方法数据效率更高，并最终产生一个相对无噪声的标注数据集。

英文摘要

While Deep Active Learning (DAL) effectively reduces human annotation costs, its efficacy is constrained by human annotation errors. This is because the data sampled for active learning is assumed to be highly informative for training. When human annotators introduce errors into this informative data at a certain rate, the active learning performance drops significantly and, in some cases, even exhibits worse outcomes than passive learning. In this paper, we first analyze the impact of human annotation errors in the DAL setting. Then we propose a framework to address the human annotation noise problem for DAL. Informed by human learning patterns, the core idea of our proposed solution involves allocating a portion of the human annotation budget to re-annotate data that has already been labeled. Previous theoretical work suggests that when the model possesses a certain level of ability to identify potentially noisy data, even re-labeling a small fraction of the data can effectively remove noise from the active training set. To achieve this, we implement two active noise sampling strategies to detect noise under different circumstances and allocate a part of the annotation budget to re-annotate these instances. Our approach imbues active learning with a revisiting and introspective behavior. Our experiments demonstrate that, under the same annotation budget, our method is more data-efficient and yields a relatively noise-free annotation dataset in the end.

URL PDF HTML ☆

赞 0 踩 0

2606.08769 2026-06-09 cs.CL cs.AI 交叉投稿

RadOT-Eval: Auditable Structured-Evidence Transport for Radiology Report Evaluation

RadOT-Eval：用于放射学报告评估的可审计结构化证据传输

Weixin Liu, Juming Xiong, Yang Li, Qingyuan Song, Susannah Rose, Murat Kantarcioglu, Bradley Malin, Zhijun Yin

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出RadOT-Eval框架，通过最优传输对齐结构化临床证据，在独立测试集上实现与错误负担的高斯皮尔曼相关，优于标准指标和基于LLM的评估器。

Comments 10 pages, 1 figure, 13 tables

详情

AI中文摘要

自动评估对于高风险文本生成至关重要，其中的错误通常涉及遗漏发现、幻觉内容、极性反转、位置变化、不确定性不匹配和时间比较错误，而不仅仅是低表面相似性。放射学报告生成提供了一个具有挑战性的测试案例，因为生成的报告必须跨来源保留结构化临床证据。我们提出了RadOT-Eval，一个可解释的结构化证据最优传输框架，用于离线审计放射学报告生成。RadOT-Eval将参考报告和候选报告分解为属性结构化的临床证据单元，使用熵正则化最优传输对齐相应的证据，并在单调风险模型中使用临床意义的侧信道差异来预测错误负担。所有传输、特征和读出选择均使用ReXVal数据集进行选择，并在独立的RadEvalX数据集上评估冻结系统。RadOT-Eval与总错误负担、临床显著错误负担和临床不显著错误负担的斯皮尔曼相关系数分别为0.715、0.548和0.399，其点估计值高于标准评估指标和基于开源大语言模型（LLM）的评估器GREEN-radllama2-7B。在ReXErr-v1上的冻结辅助腐败敏感性压力测试中，RadOT-Eval达到了0.768的AUROC和0.990的腐败大于干净的配对胜率。这些结果表明，在仅使用ReXVal模型选择和冻结RadEvalX测试下，结构化证据传输为高风险生成的临床文本提供了一个可审计、面向排序的评估工具。

英文摘要

Automatic evaluation is critical for high-stakes text generation, where errors often involve omitted findings, hallucinated content, polarity reversals, location changes, uncertainty mismatches, and temporal-comparison errors rather than low surface similarity alone. Radiology report generation provides a challenging test case because generated reports must preserve structured clinical evidence across sources. We present RadOT-Eval, an interpretable structured-evidence optimal transport framework for offline auditing of radiology report generation. RadOT-Eval decomposes reference and candidate reports into attribute-structured clinical evidence units, aligns corresponding evidence using entropy-regularized optimal transport, and uses clinically meaningful side-channel discrepancies in a monotone risk model to predict error burden. All transport, feature, and readout choices are selected using the ReXVal dataset, and the frozen system is evaluated on the independent RadEvalX dataset. RadOT-Eval achieves Spearman correlations of 0.715, 0.548, and 0.399 with total, clinically significant, and clinically insignificant annotated error burden, respectively, yielding higher point estimates than standard evaluation metrics and the open-source large language model (LLM)-based evaluator GREEN-radllama2-7B. In a frozen auxiliary corruption-sensitivity stress test on ReXErr-v1, RadOT-Eval achieves 0.768 AUROC and a 0.990 corrupted-greater-than-clean paired win rate. These results show that structured evidence transport provides an auditable, rank-oriented evaluation tool for high-stakes generated clinical text under ReXVal-only model selection and frozen RadEvalX testing.

URL PDF HTML ☆

赞 0 踩 0

2606.08850 2026-06-09 cs.LG cs.AI cs.CL stat.ML 交叉投稿

Intrinsic Selection and Particle Resampling for Inference-Time Scaling Beyond Domain Verifiability

内在选择与粒子重采样：超越领域可验证性的推理时扩展

Giorgio Giannone, Mustafa Eyceoz, Shabana Baig, Shivchander Sudalairaj, Anna C. Doris, Faez Ahmed, Akash Srivastava, Kai Xu

发表机构 * MIT（麻省理工学院）； Red Hat（红帽公司）； IBM（IBM公司）

AI总结提出基于并行样本集内在统计量（长度调整尾熵）的推理时扩展方法，通过后验候选排序和步骤级重采样，无需外部验证即可提升开放领域任务性能。

Comments preprint

详情

AI中文摘要

推理时扩展（ITS）在数学和编程等可验证领域取得了很大成功，其中廉价验证使得可扩展输出选择成为可能。然而，将ITS扩展到容易发生系统性失败的任务——由错误初始假设或未满足的多维约束驱动——通常依赖于昂贵的外部求解器或脆弱的基于模型的验证器。我们的关键洞察是，并行样本集的内在统计量，特别是长度调整尾熵，提供了关于解质量的稳健判别信号，而无需访问真实标签。至关重要的是，这些统计量作为自适应计算分配的难度门控，动态地将问题路由到不同的扩展规模。首先，内在选择（iS）事后对候选进行排序，在三个领域匹配基于共识的算法，并将工程设计选择性能比pass@1基线提高20%。其次，内在粒子滤波（iPF）将其推广到步骤级重采样，引导生成走向高置信度推理轨迹，在困难数学问题上平均将pass@1提高6.1个百分点。最后，粒子蒸馏（dPF）通过早期logit混合和KL引导重采样注入特权指导，引导生成绕过系统性推理错误以满足专家评分标准，在复杂临床响应上获得高达26.5%的提升。我们的流程无缝适用于通用、领域专用和多模态架构，成功将ITS扩展到开放领域，而无需训练奖励模型或精确的真实标签验证。

英文摘要

Inference-Time Scaling (ITS) has largely succeeded in verifiable domains like math and coding, where cheap verification enables scalable output selection. However, extending ITS to tasks prone to systematic failure - driven by faulty initial assumptions or unmet multidimensional constraints - typically relies on costly external solvers or brittle, model-based verifiers. Our key insight is that the intrinsic statistics of parallel sample sets, specifically length-adjusted tail entropy, provide a robust discriminative signal for solution quality without access to ground truth. Crucially, these statistics serve as a difficulty gate for adaptive compute allocation, dynamically routing problems across scaling regimes. First, Intrinsic Selection (iS) ranks candidates post-hoc, matching consensus-based algorithms across three domains and improving engineering design selection by 20% over pass@1 baselines. Second, Intrinsic Particle Filtering (iPF) generalizes this to step-level resampling, guiding generation toward high-confidence reasoning trajectories to improve pass@1 by 6.1 points on average on hard math problems. Finally, Particle Distillation (dPF) injects privileged guidance via early logit blending and KL-guided resampling, steering generation past systematic reasoning errors to satisfy expert rubrics, yielding up to 26.5% gains on complex clinical responses. Our pipeline applies seamlessly across broad-purpose, domain-specialized, and multimodal architectures, successfully extending ITS to open-ended domains without requiring trained reward models or exact ground-truth verification.

URL PDF HTML ☆

赞 0 踩 0

2606.08932 2026-06-09 cs.CL cs.AI cs.CE 交叉投稿

PhysScene：用于物理实验科学视觉推理的场景图数据集

Minghao Zou, Qingtian Zeng, Shangkun Liu, Yanda Meng, Guanghui Yue, Baoquan Zhao, Abdulmotaleb El Saddik, Wei Zhou

发表机构 * Cardiff University（卡迪夫大学）； Shandong University of Science and Technology（山东科技大学）； University of Exeter（埃克塞特大学）； Shenzhen University（深圳大学）； Sun Yat-sen University（中山大学）； University of Ottawa（渥太华大学）

AI总结提出首个面向物理实验的场景图数据集PhysScene，通过高密度关系约束和结构化实验设置，推动科学视觉推理中超越空间共现的逻辑依赖关系建模。

详情

AI中文摘要

场景图通过建模对象及其成对关系，提供视觉场景的结构化表示。尽管最近取得了进展，现有数据集主要关注通用自然场景，领域特定和功能导向的场景仍未被充分探索。这一限制阻碍了科学实验场景中关系推理的评估，进而阻碍了此类场景中智能监控、分析及相关应用的发展。为填补这一空白，我们引入了PhysScene，这是首个针对物理实验的场景图数据集。PhysScene涵盖了实验环境中特有的仪器、结构化实验装置和功能关系，使得推理能够超越空间共现，扩展到逻辑依赖。PhysScene不追求大规模数据，而是聚焦于实验场景中的强语义约束和高关系密度，为现有场景解析算法带来新挑战，同时提供进一步改进的机会。广泛的分析和实验表明，PhysScene补充了现有基准，并为推进科学视觉推理建立了有价值的测试平台。该数据集公开于https://github.com/ZMH-SDUST/PhysScene。

英文摘要

Scene Graphs (SGs) provide structured representations of visual scenes by modeling objects and their pairwise relationships. Despite recent progress, existing datasets primarily focus on generic natural contexts, leaving domain-specific and function-oriented scenes largely underexplored. This limitation restricts the evaluation of relational reasoning in scientific experimental scenes, thereby hindering the development of intelligent monitoring, analysis, and related applications in such scenes. To address this gap, we introduce PhysScene, the first SG dataset tailored to physics experiments. PhysScene encompasses specialized instruments, structured experimental setups, and functional relations intrinsic to experimental environments, enabling reasoning that extends beyond spatial co-occurrence to logical dependencies. Rather than pursuing large data scale, PhysScene focuses on strong semantic constraints and high relation density in experimental scenes, posing new challenges for existing scene parsing algorithms while offering opportunities for further improvements. Extensive analyses and experiments show that PhysScene complements existing benchmarks and establishes a valuable testbed for advancing scientific visual reasoning. The dataset is publicly available at https://github.com/ZMH-SDUST/PhysScene.

URL PDF HTML ☆

赞 0 踩 0

2606.09613 2026-06-09 cs.CL cs.AI 交叉投稿

AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving

AGENTSERVESIM：面向多轮LLM智能体服务的硬件感知模拟器

Rakibul Hasan Rajib, Mengxin Zheng, Qian Lou

发表机构 * University of Central Florida（中佛罗里达大学）

AI总结提出AGENTSERVESIM模拟器，通过程序编排器、工具模拟器、会话感知路由器和KV驻留模型等模块，在程序粒度上评估多轮LLM智能体服务策略，在CPU上以6%误差复现真实系统行为。

Comments Preprint

详情

AI中文摘要

多轮LLM智能体将模型调用与外部工具调用交织在一起，将服务从无状态请求处理转变为有状态程序执行。处理这些工作负载需要利用程序级上下文的调度、KV缓存管理和路由策略，包括轮次依赖、工具引入的间隙和可重用的KV状态。直接在真实系统上评估此类策略成本高昂，因为每个设计点可能需要跨到达率、模型规模、服务实例数量和内存层次结构的专用加速器时间。模拟提供了一种可扩展的替代方案，但现有的LLM服务模拟器针对无状态请求级工作负载，因此忽略了智能体服务的核心动态：多轮程序执行、跨轮缓存局部性以及工具间隙期间的KV缓存驻留。我们提出了AGENTSERVESIM，一种面向多轮LLM智能体服务的硬件感知模拟器。AGENTSERVESIM通过可组合模块在程序粒度上评估服务策略：程序编排器保留程序标识和轮次顺序，工具模拟器实现工具引入的间隙，会话感知路由器维护程序到实例的亲和性以实现缓存感知调度，KV驻留模型跟踪策略定义的跨HBM、主机DRAM/CXL和驱逐的KV放置。在真实服务部署和硬件配置上，AGENTSERVESIM在关键性能指标上的误差在6%以内，且完全在普通CPU上运行。这些结果表明，AGENTSERVESIM能够在不需在昂贵加速器上全面部署的情况下，实现受控、可重复的智能体服务策略探索。

英文摘要

Multi-turn LLM agents interleave model calls with external tool invocations, shifting serving from stateless request processing to stateful program execution. Serving these workloads requires scheduling, KV-cache management, and routing policies that use program-level context, including turn dependencies, tool-induced gaps, and reusable KV state. Evaluating such policies directly on real systems is costly, since each design point may require dedicated accelerator time across arrival rates, model scales, serving-instance counts, and memory hierarchies. Simulation offers a scalable alternative, but existing LLM serving simulators target stateless request-level workloads and therefore omit the core dynamics of agent serving: multi-turn program execution, cross-turn cache locality, and KV-cache residency during tool gaps. We present AGENTSERVESIM, a hardware-aware simulator for multi-turn LLM agent serving. AGENTSERVESIM evaluates serving policies at program granularity through composable modules: a Program Orchestrator preserves program identity and turn order, a Tool Simulator materializes tool-induced gaps, a Session-Aware Router maintains program-to-instance affinity for cache-aware dispatch, and a KV Residency Model tracks policy-defined KV placement across HBM, host DRAM/CXL, and eviction. Across real serving deployments and hardware configurations, AGENTSERVESIM reproduces real-system behavior within 6% error across key performance metrics while running entirely on commodity CPUs. These results show that AGENTSERVESIM enables controlled, repeatable exploration of agent-serving policies without requiring exhaustive deployment on costly accelerators.

URL PDF HTML ☆

赞 0 踩 0

2606.09646 2026-06-09 cs.CV cs.AI cs.LG 交叉投稿

FASE: 用于代码质量的快速自适应语义熵

Shizhe Lin, Ladan Tahvildari

发表机构 * University of Waterloo（滑铁卢大学）

AI总结提出快速自适应语义熵（FASE），通过最小生成树近似功能正确性，在HumanEval和BigCodeBench上相比现有语义熵方法在Spearman相关性和ROCAUC上分别提升25%和19%，且计算开销仅为传统方法的0.3%。

详情

AI中文摘要

多智能体代码生成通过模拟人类软件工程生命周期，为自主软件开发提供了一种有前景的范式。然而，系统可靠性仍然受到LLM幻觉和跨交互智能体错误传播的阻碍。虽然语义熵提供了一种无需真实答案即可量化不确定性的原则性方法，但当前方法通常依赖于成本高昂的LLM驱动的等价性检查。在这项工作中，我们引入了快速自适应语义熵（FASE），这是一种基于结构和语义不相似图的最小生成树来近似功能正确性的新型度量。在HumanEval和BigCodeBench上的评估表明，FASE优于通过LLM蕴含的最先进语义熵，在使用Qwen3-Embedding-8B模型时，与基于真实测试用例的Pass@1相比，Spearman相关性平均提升25%，ROCAUC分数提升19%。此外，通过消除成本高昂的LLM驱动的等价性评估，FASE的计算开销可忽略不计，其运行成本仅为传统语义熵方法的约0.3%。这些结果使FASE成为优化现实世界多智能体工作流中不确定性量化的实用且经济高效的解决方案。

英文摘要

Multi-agent code generation offers a promising paradigm for autonomous software development by simulating the human software engineering lifecycle. However, system reliability remains hindered by LLM hallucinations and error propagation across interacting agents. While semantic entropy provides a principled way to quantify uncertainty without ground-truth answers, current methods often rely on costly LLM-driven equivalence checks. In this work, we introduce Fast Adaptive Semantic Entropy (FASE), a novel metric that approximates functional correctness based on the minimum spanning tree of structural and semantic dissimilarity graphs. Evaluations on HumanEval and BigCodeBench demonstrate that FASE outperforms state-of-the-art semantic entropy by LLM entailment, achieving a 25% average improvement in Spearman correlation and a 19% increase in ROCAUC score against Pass@1 from ground-truth test cases when using the Qwen3-Embedding-8B model. Furthermore, by eliminating costly LLM-driven equivalence evaluation, FASE incurs negligible computational overhead, requiring only approximately 0.3% of the runtime cost of traditional semantic entropy approaches. These results position FASE as a practical, cost-effective solution for optimizing uncertainty quantification in real-world multi-agent workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.09826 2026-06-09 cs.CV cs.AI 交叉投稿

MatSciBench: 基准测试大型语言模型在材料科学中的推理能力

Junkai Zhang, Jingru Gan, Xiaoxuan Wang, Zian Jia, Changquan Gu, Jianpeng Chen, Yanqiao Zhu, Mingyu Derek Ma, Dawei Zhou, Ling Li, Wei Wang

发表机构 * University of California, Los Angeles Computer Science Department（加州大学洛杉矶分校计算机科学系）； University of Pennsylvania Department of Materials Science and Engineering（宾夕法尼亚大学材料科学与工程系）； Virginia Tech Department of Computer Science（弗吉尼亚理工大学计算机科学系）

AI总结提出MatSciBench基准，包含1340道大学级材料科学问题，覆盖6个主领域和31个子领域，评估LLM推理能力，发现当前模型在领域知识、计算和图表理解方面存在局限。

详情

AI中文摘要

大型语言模型已展现出强大的科学推理能力，但它们在材料科学问题上的表现仍研究不足。为填补这一空白，我们引入了MatSciBench，一个全面的大学级基准，包含1340道问题，涵盖材料科学的基本子学科。MatSciBench具有结构化和细粒度的分类体系，将材料科学问题分为6个主领域和31个子领域，并根据解决每个问题所需的推理长度进行三级难度分类。MatSciBench包含946道问题的详细参考答案，支持过程级错误分析，并包含315道带图像的问题以评估多模态推理。我们在MatSciBench上评估了领先的思考型和非思考型LLM，并进一步测试了非思考型模型的三种推理方法：基础思维链提示、工具增强和自我修正。结果表明，当前模型在大学级材料科学推理中仍面临明显限制。DeepSeek-R1在纯文本问题上达到最高准确率75.22%，GPT-5在带图像问题上表现最佳，准确率为53.02%。我们的分析表明，工具增强以token高效的方式改进了许多非思考型模型，而自我修正通常无法提供可靠的改进，甚至可能将正确答案修改为错误答案。我们进一步分析了不同难度级别、推理效率、多模态推理和失败模式的表现，发现当前模型主要受限于领域知识差距、计算错误、问题理解失败以及从科学图表中提取精确信息的困难。总体而言，MatSciBench为衡量当前LLM的局限性并指导未来材料科学科学推理工作提供了一个清晰的测试平台。

英文摘要

Large Language Models have shown strong scientific reasoning ability, but their performance on materials science problems remains less studied. To fill this gap, we introduce MatSciBench, a comprehensive college-level benchmark comprising 1340 problems that span the essential subdisciplines of materials science. MatSciBench features a structured and fine-grained taxonomy that categorizes materials science questions into 6 primary fields and 31 subfields, together with a three-tier difficulty classification based on the reasoning length needed to solve each problem. MatSciBench includes detailed reference solutions for 946 questions, supports process-level error analysis, and contains 315 questions with images for evaluating multimodal reasoning. We evaluate leading thinking and non-thinking LLMs on MatSciBench, and further test three reasoning methods for non-thinking models: basic chain-of-thought prompting, tool augmentation, and self-correction. The results show that current models still face clear limits in college-level materials science reasoning. DeepSeek-R1 achieves the highest score on text-only questions at 75.22% accuracy, and GPT-5 performs the best on questions with images at 53.02%. Our analysis shows that tool augmentation improves many non-thinking models in a token-efficient way, while self-correction often fails to provide reliable gains and can revise correct answers into incorrect ones. We further analyze performance across difficulty levels, reasoning efficiency, multimodal reasoning, and failure patterns, and find that current models are mainly limited by domain knowledge gaps, calculation errors, problem comprehension failures, and difficulty in extracting precise information from scientific figures. Overall, MatSciBench provides a clear testbed for measuring current LLM limitations and guiding future work on scientific reasoning in materials science.

URL PDF HTML ☆

赞 0 踩 0

2510.27544 2026-06-09 cs.AI cs.FL 版本更新

TempoBench: Evaluating Temporal Causal Reasoning in Large Language Models

TempoBench：评估大语言模型中的时间因果推理

Nikolaus Holzer, William Fishell, Baishakhi Ray, Mark Santolucito

发表机构 * Columbia University（哥伦比亚大学）； Columbia University, Barnard College（哥伦比亚大学、巴纳德学院）

AI总结提出TempoBench基准，通过合成Mealy机生成可验证的因果标签，评估LLM在时间因果推理中的表现，发现模型在最小因果归因任务上准确率低于25%，主要错误是过度指定。

详情

AI中文摘要

时间推理涉及理解系统如何通过输入驱动的状态转换随时间演化。一个关键方面是时间因果推理，即因果推理出哪些先前的输入对于导致观察到的结果是必要的。虽然大型语言模型（LLMs）在前向模拟（从输入预测输出）方面表现良好，但它们难以识别结果的最小因果输入。为了研究这种区别，我们定义了两个任务：\textit{轨迹模拟}（SIM），要求模型模拟系统执行，以及\textit{最小因果归因}（MIN），识别给定结果所需的最小输入集。我们引入了\textsc{TempoBench}，第一个经过形式验证的时间因果推理基准，它由合成的Mealy机构建，具有可控的复杂性和可证明正确的因果标签。在前沿模型中，我们观察到尽管在SIM任务上达到了高达96%的准确率，但在因果归因MIN任务上的性能降至25%以下；模型无法推理因果必要性。超过94%的因果错误涉及过度指定，即模型执行检索并列出所有可能的输入，而不是推理最小因果子集。在\textsc{TempoBench}训练语料库上进行微调可以改善因果推理，并且比数学、代码或指令训练具有更好的泛化能力，在标准推理基准上也有提升。

英文摘要

Temporal reasoning involves understanding how systems evolve over time through input-driven state transitions. A key aspect is temporal causal reasoning, causally reasoning about what prior inputs were necessary in causing an observed outcome. While large language models (LLMs) perform well at forward simulation, predicting outputs from inputs, they struggle to identify the minimal causal inputs of outcomes. To study this distinction, we define two tasks: \textit{trace simulation} (SIM), which requires models to simulate system execution, and \textit{minimal causal attribution} (MIN), which identifies the minimal set of inputs necessary for a given outcome. We introduce \textsc{TempoBench}, the first formally verified benchmark for temporal causal reasoning, built from synthesized Mealy machines with controllable complexity and provably correct causal labels. Across frontier models, we observe that despite achieving up to 96\% accuracy on the SIM task, performance on the causal attribution MIN task drops below 25\%; models fail to reason about causal necessity. Over 94\% of causal errors involve overspecification, where models perform retrieval and list all possible inputs rather than reasoning about the minimal causal subset. Fine-tuning on \textsc{TempoBench} training corpus improves causal reasoning and generalizes better than math, code, or instruction training, with gains across standard reasoning benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2602.03224 2026-06-09 cs.AI cs.LG 版本更新

TAME: A Trustworthy Test-Time Evolution of Agent Memory with Systematic Benchmarking

TAME: 一种可信的智能体记忆测试时演化与系统化基准测试

Yu Cheng, Yongkang Hu, Jiuan Zhou, Yushuo Zhang, Yihang Chen, Huichi Zhou, Mingang Chen, Zhizhong Zhang, Kun Shao, Yuan Xie, Zhaoxia Yin

发表机构 * East China Normal University（东华师范大学）； Shanghai Innovation Institute（上海创新研究院）； Shanghai Key Laboratory of Computer Software Evaluating and Testing（上海计算机软件评测与测试重点实验室）； Huawei Noah’s Ark Lab（华为诺亚实验室）

AI总结提出TAME框架，通过执行器-评估器循环实现记忆的可信演化，解决良性任务演化中智能体可信度下降问题，在GPT-5.2 AIME上准确率提升14.6个百分点。

详情

AI中文摘要

智能体记忆的测试时演化代表了推进AGI的关键范式，因为它通过经验积累增强复杂推理，而无需参数更新。然而，即使在良性任务演化过程中，智能体的安全对齐仍然脆弱，这种现象被称为智能体记忆误演化。为了评估这一现象，我们构建了Trust-Memevo基准测试，并发现智能体在良性任务演化过程中，多个任务的可信度整体下降。为了解决这个问题，我们提出了TAME，一个可信感知的记忆演化框架，其中共享记忆库由执行器和评估器共同管理。执行器检索并应用可迁移经验以支持任务求解，而评估器评估每个使用经验对结果的贡献，并产生可信感知的反馈以指导后续记忆使用。这种执行器-评估器循环使得记忆能够随时间被选择性强化、谨慎重用和持续扩展。实验表明，TAME在实现强任务性能的同时缓解了记忆误演化。特别是在GPT-5.2 AIME基准测试上，TAME相比现有最强方法准确率提高了14.6个百分点，并保持了有竞争力的可信度。

英文摘要

Test-time evolution of agent memory represents a pivotal paradigm for advancing AGI, as it strengthens complex reasoning through experience accumulation without requiring parameter updates. However, even during benign task evolution, agent safety alignment remains vulnerable, a phenomenon known as Agent Memory Misevolution. To evaluate this phenomenon, we construct the Trust-Memevo benchmark and find that agents exhibit an overall decline in trustworthiness across multiple tasks during benign task evolution. To address this issue, we propose TAME, a trust-aware memory evolution framework in which a shared memory bank is jointly governed by an Executor and an Evaluator. The Executor retrieves and applies transferable experiences to support task solving, while the Evaluator assesses the contribution of each utilized experience to the outcome and produces trust-aware feedback to guide subsequent memory use. This executor-evaluator loop enables memory to be selectively reinforced, cautiously reused, and continuously expanded over time. Experiments show that TAME mitigates memory misevolution while achieving strong task performance. In particular, on the GPT-5.2 AIME benchmark, TAME improves accuracy by 14.6 percentage points over the strongest existing method and maintains competitive trustworthiness.

URL PDF HTML ☆

赞 0 踩 0

2605.23965 2026-06-09 cs.AI cs.LG cs.SE 版本更新

LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

LGMT：基于逻辑的蜕变测试用于评估LLMs的推理可靠性

Zenghui Zhou, Man Li, Xiaoke Fang, Xinyi Zhou, Weibin Lin, Zheng Zheng

发表机构 * School of Automation Science and Electrical Engineering, Beihang University（自动化科学与电气工程学院，北京航空航天大学）

AI总结提出LGMT框架，利用一阶逻辑推导蜕变关系，通过一致性检查评估LLM推理的鲁棒性，揭示传统评估忽略的隐藏缺陷。

Comments Zheng Zheng is the corresponding author

详情

AI中文摘要

大型语言模型（LLMs）在逻辑推理基准测试中表现出色，但其可靠性仍不确定。现有评估依赖静态基准，无法评估在逻辑等价变换下的鲁棒性，且往往高估推理能力。我们提出LGMT（基于逻辑的蜕变测试），一种无神谕框架，利用一阶逻辑（FOL）评估LLM推理。通过从形式逻辑等价推导蜕变关系，LGMT构建语义不变的测试用例，并通过跨案例一致性检查检测推理缺陷。在六个最先进的LLM上的实验表明，LGMT暴露了传统基于参考的评估遗漏的大量隐藏缺陷。我们进一步发现，模型对符号级别和结论级别的变化特别敏感，而高级提示如Few-shot CoT仅能部分缓解这些问题。这些结果表明，LLM评估应从孤立的正确性转向逻辑不变性下的鲁棒性。LGMT为诊断推理失败提供了一种原则性和可扩展的方法。

英文摘要

Large Language Models (LLMs) achieve strong performance on logical reasoning benchmarks, yet their reliability remains uncertain. Existing evaluations rely on static benchmarks, which fail to assess robustness under logically equivalent transformations and often overestimate reasoning capability. We propose LGMT (Logic-Grounded Metamorphic Testing), an oracle-free framework that leverages first-order logic (FOL) to evaluate LLM reasoning. By deriving metamorphic relations from formal logical equivalences, LGMT constructs semantically invariant test cases and detects reasoning defects through cross-case consistency checking. Experiments on six state-of-the-art LLMs show that LGMT exposes substantial hidden defects missed by traditional reference-based evaluations. We further find that models are particularly sensitive to symbol-level and conclusion-level variations, and that advanced prompting such as Few-shot CoT only partially mitigates these issues. These results suggest that LLM evaluation should move beyond isolated correctness toward robustness under logical invariance. LGMT provides a principled and scalable approach for diagnosing reasoning failures.

URL PDF HTML ☆

赞 0 踩 0

2605.25624 2026-06-09 cs.AI cs.LG 版本更新

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

CUA-Gym：为计算机使用智能体扩展可验证的训练环境和任务

Bowen Wang, Dunjie Lu, Junli Wang, Tianyi Bai, Shixuan Liu, Zhipeng Zhang, Haiquan Wang, Hao Hu, Tianbao Xie, Shuai Bai, Dayiheng Liu, Que Shen, Junyang Lin, Tao Yu

发表机构 * The University of Hong Kong（香港大学）； Qwen Team, Alibaba Inc.（阿里巴巴集团Qwen团队）； University of California, San Diego（加州大学圣地亚哥分校）； Tsinghua University（清华大学）

AI总结提出CUA-Gym可扩展流水线，通过协同生成任务指令、环境状态和奖励函数，构建大规模可验证强化学习训练数据，并合成CUA-Gym-Hub模拟网络应用环境，训练出的智能体在OSWorld-Verified和WebArena上取得领先性能。

详情

AI中文摘要

具有可验证奖励的强化学习（RLVR）在数学、工具使用和软件工程等领域取得了突破，但其在计算机使用智能体（CUA）上的应用受到缺乏具有确定性奖励的可扩展训练数据的瓶颈。为CUA构建此类数据需要一致的任务指令、可执行的环境和可验证的奖励。然而，手工策划的基准测试实现了高奖励保真度，但覆盖的应用很少；基于LLM作为评判者的数据集广泛扩展，但缺乏可靠的验证。我们提出了CUA-Gym，一个可扩展的流水线，协同生成任务指令、环境状态和奖励函数。具体来说，一个生成器智能体构建初始和黄金环境状态，一个独立的判别器智能体根据任务规范编写奖励函数。一个编排器智能体通过执行中的迭代轮次驱动两者。生成的元组通过一个结合LLM多数投票和智能体回滚的最终过滤器，确保超出每任务对抗循环的质量。为了解决训练环境稀缺的问题，我们进一步合成了CUA-Gym-Hub，一套基于真实软件使用分布的高保真模拟网络应用程序套件，将CUA RLVR数据的规模扩大了一个数量级。使用此流水线，我们构建了CUA-Gym数据集，包含32,112个基于110个环境的已验证RLVR训练元组。在CUA-Gym上使用GSPO训练的CUA-Gym-A3B和CUA-Gym-A17B在OSWorld-Verified上分别达到62.1%和72.6%，在可比规模上优于先前的开源CUA，并且在数据量和环境多样性上性能平滑扩展。相同的检查点还在保留的WebArena基准测试上有所改进，表明训练环境之外的迁移。我们将开源完整的合成流水线、数据集、CUA-Gym-Hub环境和模型。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistent task instruction, executable environment, and verifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generates task instructions, environment states, and reward functions. Concretely, a Generator agent constructs the initial and golden environment states, and a separate Discriminator agent writes the reward function from the task specification. An orchestrator agent drives the two through iterative rounds upon execution. Generated tuples then pass a final filter combining LLM majority voting and agent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesize CUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained with GSPO on CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-out WebArena benchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset, CUA-Gym-Hub environments, and models.

URL PDF HTML ☆

赞 0 踩 0

2606.01869 2026-06-09 cs.AI 版本更新

人类的ALMANAC：用于智能体协作的动作级心智模型标注的人类协作数据集

Jiaju Chen, Yuxuan Lu, Jiayi Su, Chaoran Chen, Songlin Xiao, Zheng Zhang, Yun Wang, Yunyao Li, Jian Zhao, Tongshuang Wu, Toby Jia-Jun Li, Dakuo Wang, Bingsheng Yao

发表机构 * Northeastern University（东北大学）； University of Notre Dame（Notre Dame 大学）； University of Waterloo（滑铁卢大学）； Carnegie Mellon University（卡内基梅隆大学）； Adobe（Adobe公司）； Microsoft Research Asia（微软亚洲研究院）

AI总结为解决当前LLM智能体缺乏协作中心智模型能力的问题，构建了基于Map Task的ALMANAC数据集，包含2987个协作动作及其心智模型标注，并评估了六种LLM在预测人类行为和心智模型上的表现。

详情

AI中文摘要

近年来，LLM智能体的进展使其具备了复杂的认知能力，如多步推理、规划和工具使用，这些能力使它们逐渐成为人类的协作者。然而，有效的协作要求协作者在协作过程中持续维护和调整自身推理、伙伴意图和共享目标的心智模型。当前的智能体很少发展这种能力，因为它们主要针对任务完成进行优化，而社区缺乏带有动作级心智模型标注的真实人类协作数据，这些数据可以指导智能体获得过程级的协作能力。为填补这一空白，我们提出了ALMANAC，一个基于社会科学中经典的二元路由任务Map Task构建的动作级心智模型标注数据集。ALMANAC包含2,987个协作动作，每个动作都配有基于理论的心智模型标注，记录了参与者的自我推理、感知的伙伴意图和感知的团队目标。我们评估了六种LLM在预测人类下一轮行为和心智模型方面的表现。我们的结果证明了ALMANAC在评估模型模拟人类协作行为及推断其潜在心智模型方面的实用性。

英文摘要

Recent advances in LLM agents have enabled complex cognitive capabilities, such as multi-step reasoning, planning, and tool use, that increasingly position these agents as human collaborators. Effective collaboration, however, requires collaborators to continuously maintain and align mental models of their own reasoning,partners' intentions, and shared goals during the collaborative process. Today's agents rarely develop such capabilities since they are primarily optimized for task completion, and the community lacks authentic human collaboration data with action-level mental model annotations that could guide agents toward process-level collaborative competence. To bridge this gap, we present ALMANAC, a dataset of Action-Level Mental model ANnotations for Agent Collaboration built from the Map Task, a classic dyadic routing task from social science. ALMANAC contains 2,987 collaboration actions, each paired with theory-informed mental model annotations that record the participants' self-reasoning, perceived partner intent, and perceived team goal. We benchmark six LLMs on predicting humans' next-turn behavior and mental models. Our results demonstrate ALMANAC's utility in evaluating models' ability to simulate human collaborative behaviors and infer their underlying mental models.

URL PDF HTML ☆

赞 0 踩 0

2310.10196 2026-06-09 cs.LG cs.AI 版本更新

Large Models for Time Series and Spatio-Temporal Data: A Survey and Outlook

时间序列与时空数据的大模型：综述与展望

Ming Jin, Yaxuan Kong, Yuxuan Liang, Chaoli Zhang, Siqiao Xue, Xue Wang, James Zhang, Yi Wang, Haifeng Chen, Xiaoli Li, Vincent S. Tseng, Yu Zheng, Lei Chen, Hui Xiong, Shirui Pan, Qingsong Wen

发表机构 * Griffith University（格里菲斯大学）； University of Oxford（牛津大学）； Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； Zhejiang Normal University（浙江师范大学）； Ant Group（蚂蚁集团）； Alibaba Group（阿里巴巴集团）； Deloitte Service LLP（德勤服务有限责任公司）； The University of Hong Kong（香港大学）； NEC Laboratories America（NEC美国实验室）； A*STAR ； National Yang Ming Chiao Tung University（阳明交通大学）； JD Technology（京东科技）； Squirrel Ai Learning

AI总结综述了面向时间序列和时空数据的大模型，按数据类型、模型类别、范围和应用领域分类，总结了通用与领域专用模型，并整理了相关资源与开放问题。

Comments Accepted by ACM Computing Surveys; 35 Pages; Github Repo: https://github.com/qingsongedu/Awesome-TimeSeries-SpatioTemporal-LM-LLM

详情

AI中文摘要

时间数据，包括时间序列和时空数据，在现实应用中无处不在。物理和虚拟传感器生成的海量数据记录了动态系统行为，支持各种下游任务。有效分析这些数据对于挖掘其丰富信息至关重要。大型语言模型和其他基础模型的最新进展加速了它们在时间序列和时空数据挖掘中的应用。这些方法不仅提高了跨领域的模式识别和推理能力，还支持了能够理解和处理时间数据的人工通用智能的发展。在本综述中，我们沿着四个维度（数据类型、模型类别、模型范围和应用领域/任务）对针对时间序列和时空数据定制或适配的大模型进行了全面、最新的回顾。我们将现有工作分为两大组：用于时间序列分析的大模型（LM4TS）和用于时空数据挖掘的大模型（LM4STD），并进一步区分通用模型和领域专用模型。我们还整理了相关资源，包括数据集、模型实现和工具，按主要应用领域组织。总体而言，本综述整合了近期进展，并突出了以大型模型为中心的时间数据分析的基础、应用、资源和开放研究机会。

英文摘要

Temporal data, including time series and spatio-temporal data, are pervasive in real-world applications. Generated in massive volumes by physical and virtual sensors, they record dynamic system behaviors and enable a wide range of downstream tasks. Effectively analyzing such data is crucial to unlocking their rich information content. Recent advances in large language models and other foundation models have accelerated their use in time series and spatio-temporal data mining. These approaches not only improve pattern recognition and reasoning across diverse domains but also support progress toward artificial general intelligence that can understand and process temporal data. In this survey, we present a comprehensive, up-to-date review of large models tailored or adapted for time series and spatio-temporal data along four dimensions: data types, model categories, model scopes, and application areas/tasks. We organize existing work into two main groups: large models for time series analysis (LM4TS) and for spatio-temporal data mining (LM4STD), and further distinguish general-purpose from domain-specific models. We also curate related resources, including datasets, model implementations, and tools, organized by major application areas. Overall, this survey consolidates recent advances and highlights foundations, applications, resources, and open research opportunities in large model-centric temporal data analysis.

URL PDF HTML ☆

赞 0 踩 0

2502.16584 2026-06-09 cs.SD cs.AI cs.CL cs.MM eess.AS 版本更新

大规模纳米晶体数据库：对齐合成与性质实现生成式逆向设计

Kai Gu, Yingping Liang, Senliang Peng, Aotian Guo, Haizheng Zhong, Ying Fu

发表机构 * MIIT Key Laboratory for Low-Dimensional Quantum Structure and Devices, School of Materials Sciences & Engineering, Beijing Institute of Technology（信息产业部低维量子结构与器件重点实验室，材料科学与工程学院，北京理工大学）； School of Computer Science and Technology, Beijing Institute of Technology（计算机科学与技术学院，北京理工大学）

AI总结构建大规模对齐的纳米晶体合成-性质数据库，开发基于大语言模型的NanoExtractor提取文献数据，并利用NanoDesigner实现生成式逆向设计，成功设计PbSe和MgF2纳米晶体的合成路线。

详情

AI中文摘要

由于合成参数与物理化学性质之间的复杂相关性，纳米晶体的合成高度依赖于试错法。尽管深度学习为生成式逆向设计提供了潜在方法，但缺乏对齐纳米晶体合成路线与其性质的高质量数据集仍阻碍其发展。本文介绍了一个大规模、对齐的纳米晶体合成-性质（NSP）数据库的构建，并展示了其用于生成式逆向设计的能力。为了从文献中提取结构化的合成路线及其对应的产物性质，我们开发了NanoExtractor，这是一个通过精心设计的增强策略增强的大语言模型（LLM）。NanoExtractor经过人类专家验证，在测试集上达到88%的加权平均分，显著优于化学专用（3%）和通用LLM（38%）。生成的NSP数据库包含近16万条对齐条目，并作为我们的NanoDesigner（一个用于逆向合成设计的LLM）的训练数据。NanoDesigner的生成能力通过成功设计成熟的PbSe纳米晶体和罕见的MgF2纳米晶体的可行合成路线得到验证。值得注意的是，模型为MgF2纳米晶体推荐了反直觉的非化学计量前驱体比例（1:1），实验证实该比例对抑制副产物至关重要。我们的工作弥合了非结构化文献与数据驱动合成之间的差距，并建立了一个强大的人机协作范式，以加速纳米晶体的发现。

英文摘要

The synthesis of nanocrystals has been highly dependent on trial-and-error, due to the complex correlation between synthesis parameters and physicochemical properties. Although deep learning offers a potential methodology to achieve generative inverse design, it is still hindered by the scarcity of high-quality datasets that align nanocrystal synthesis routes with their properties. Here, we present the construction of a large-scale, aligned Nanocrystal Synthesis-Property (NSP) database and demonstrate its capability for generative inverse design. To extract structured synthesis routes and their corresponding product properties from literature, we develop NanoExtractor, a large language model (LLM) enhanced by well-designed augmentation strategies. NanoExtractor is validated against human experts, achieving a weighted average score of 88% on the test set, significantly outperforming chemistry-specialized (3%) and general-purpose LLMs (38%). The resulting NSP database contains nearly 160,000 aligned entries and serves as training data for our NanoDesigner, an LLM for inverse synthesis design. The generative capability of NanoDesigner is validated through the successful design of viable synthesis routes for both well-established PbSe nanocrystals and rarely reported MgF2 nanocrystals. Notably, the model recommends a counter-intuitive, non-stoichiometric precursor ratio (1:1) for MgF2 nanocrystals, which is experimentally confirmed as critical for suppressing byproducts. Our work bridges the gap between unstructured literature and data-driven synthesis, and also establishes a powerful human-AI collaborative paradigm for accelerating nanocrystal discovery.

URL PDF HTML ☆

赞 0 踩 0

2601.06649 2026-06-09 cs.LG cs.AI 版本更新

CHIMERA-Bench：一种针对表位特异性抗体设计的基准数据集

Mansoor Ahmed, Nadeem Taj, Imdad Ullah Khan, Hemanth Venkateswara, Murray Patterson

发表机构 * Georgia State University（佐治亚州立大学）； Georgia Institute of Technology（佐治亚理工学院）； University of Engineering and Technology（工程与技术大学）； Lahore University of Management Sciences（拉合尔管理科学大学）

AI总结本文提出CHIMERA-Bench，一个统一的抗体设计基准，包含2922个抗原-抗体复合物数据，测试泛化能力，并评估多种生成方法的通用性。

详情

AI中文摘要

计算抗体设计在过去三年中取得了快速的方法进展，提出了数十种深度生成方法，但该领域缺乏标准化的基准用于公平比较和模型开发。这些方法在不同的SAbDab快照、非重叠测试集和不兼容的指标上进行评估，文献将设计问题分解为多个子任务，没有共同定义。我们引入CHIMERA-Bench：（CDR建模与表位引导的重设计），围绕单一经典任务：表位条件下的CDR序列-结构共设计。CHIMERA-Bench提供三个组成部分。第一个是一个经过精心挑选、去重的包含2922个抗体-抗原复合物的数据集，带有表位和抗原结合位点注释。第二个是一组三个生物动机的分割，测试泛化到未见表位、未见抗原折叠和前瞻性时间目标的能力。第三个是全面的评估协议，包括五个指标组，包括新的表位特异性度量。我们基准测试了十一种方法，涵盖六个生成范式，并在所有分割上报告结果。CHIMERA-Bench是该抗体设计问题中最大的数据集，允许社区开发和测试新方法，并评估其泛化能力。

英文摘要

Computational antibody design has seen rapid methodological progress, with dozens of deep generative methods proposed in the past three years, yet the field lacks a standardized benchmark for fair comparison and model development. These methods are evaluated on different SAbDab snapshots, non-overlapping test sets, and incompatible metrics, and the literature fragments the design problem into numerous sub-tasks with no common definition. We introduce CHIMERA-Bench: (CDR Modeling with Epitope-guided Redesign), a unified benchmark built around a single canonical task: epitope-conditioned CDR sequence-structure co-design. CHIMERA-Bench provides three components. The first is a curated, deduplicated dataset of 2,922 antibody-antigen complexes with epitope and paratope annotations. The second is a set of three biologically motivated splits that test generalization to unseen epitopes, unseen antigen folds, and prospective temporal targets. The third is a comprehensive evaluation protocol with five metric groups, including novel epitope-specificity measures. We benchmark eleven methods spanning six generative paradigms and report results across all splits. CHIMERA-Bench is the largest dataset of its kind for the antibody design problem, allowing the community to develop and test novel methods and evaluate their generalizability.

URL PDF HTML ☆

赞 0 踩 0

2603.14342 2026-06-09 cs.CV cs.AI 版本更新

面向视觉语言模型的多语言训练和评估资源

Daniela Baiamonte, Elena Fano, Matteo Gabburo, Stefano Simonazzi, Leonardo Rigutini, Andrea Zugarini

发表机构 * Villanova.ai ； Aithlas

AI总结本文提出跨五种欧洲语言的视觉语言模型训练与评估资源，通过再生与翻译方法生成高质量多语言数据，验证多语言数据在非英语基准上的有效性。

详情

AI中文摘要

视觉语言模型（VLMs）近年来取得了快速进展。然而，尽管其发展依赖于英语，导致两个主要限制：（i）缺乏多语言和多模态数据集用于训练，（ii）缺乏跨语言的全面评估基准。本文通过引入覆盖五种欧洲语言（英语、法语、德语、意大利语和西班牙语）的新型综合资源来填补这些空白。我们采用再生-翻译范式，通过结合精心挑选的合成生成和人工标注来生成高质量的跨语言资源。具体而言，我们构建了Multi-PixMo训练语料库，通过再生Pixmo现有数据集中的示例，结合许可的模型：PixMo-Cap、PixMo-AskModelAnything和CoSyn-400k。在评估方面，我们构建了一组多语言基准，通过翻译广泛使用的英语数据集（MMbench、ScienceQA、MME、POPE、AI2D）来实现。我们通过定性和定量的人类分析评估这些资源的质量，测量跨标注者的一致性。此外，我们进行了消融研究，以展示多语言数据在VLMs训练中的影响，相对于仅英语数据。实验包括三种不同的模型，结果表明使用多语言、多模态示例训练VLMs在非英语基准上始终有益，同时对英语也有积极的迁移效果。

英文摘要

Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English, French, German, Italian, and Spanish). We adopt a regeneration-translation paradigm that produces high-quality cross-lingual resources by combining curated synthetic generation and manual annotation. Specifically, we build Multi-PixMo, a training corpus obtained regenerating examples from Pixmo pre-existing datasets with permissively licensed models: PixMo-Cap, PixMo-AskModelAnything, and CoSyn-400k. On the evaluation side, we construct a set of multilingual benchmarks derived translating widely used English datasets (MMbench, ScienceQA, MME, POPE, AI2D). We assess the quality of these resources through qualitative and quantitative human analyses, measuring inter-annotator agreement. Additionally, we perform ablation studies to demonstrate the impact of multilingual data, with respect to English only, in VLMs training. Experiments, comprising 3 different models show that using multilingual, multimodal examples for training VLMs aids is consistently beneficial on non-English benchmarks, with positive transfer to English as well.

URL PDF HTML ☆

赞 0 踩 0

2604.24278 2026-06-09 cs.SD cs.AI 版本更新

RAS: a Reliability Oriented Metric for Automatic Speech Recognition

RAS：一种面向可靠性的自动语音识别度量标准

Wenbin Huang, Yuhang Qiu, Bohan Li, Yiwei Guo, Jing Peng, Hankun Wang, Xie Chen, Kai Yu

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, China（上海交通大学计算机科学学院X-LANCE实验室，中国）； MoE Key Lab of Artificial Intelligence（人工智能MOE重点实验室；江苏语言计算重点实验室，中国）； Jiangsu Key Lab of Language Computing, China

AI总结本研究提出了一种面向可靠性的度量标准RAS，用于评估自动语音识别系统在不确定段落中的转录可靠性，通过引入一种具有退避意识的转录框架，结合人类偏好校准的参数，提升了转录的可靠性同时保持了准确性。

Comments 5 pages, 4 figures; Accepted at InterSpeech 2026

2604.24594 2026-06-09 cs.CL cs.AI 版本更新

Skill Retrieval Augmentation for Agentic AI

面向智能体AI的技能检索增强

Weihang Su, Jianming Long, Qingyao Ai, Qiaozhi He, Yichen Tang, Changyue Wang, Yiteng Tu, Yingbo Wang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； ByteDance Inc.（字节跳动公司）

AI总结针对现有智能体系统在技能库扩展时上下文窗口不足、技能识别准确率下降的问题，提出技能检索增强（SRA）范式，通过动态检索外部技能库提升智能体性能，并构建SRA-Bench基准揭示技能整合中的瓶颈。

详情

AI中文摘要

随着大型语言模型（LLMs）演变为能够自主解决问题的智能体，它们越来越依赖外部的、可复用的技能来处理超出其原生参数能力的任务。在现有的智能体系统中，整合技能的主要策略是在上下文窗口内显式枚举可用技能。然而，这种策略无法扩展：随着技能库的扩大，上下文预算迅速消耗，智能体在识别正确技能方面的准确性显著下降。为此，本文提出了技能检索增强（SRA），一种新的范式，其中智能体按需从大型外部技能库中动态检索、整合和应用相关技能。为了使该问题可衡量，我们构建了一个大规模技能库，并引入了SRA-Bench，这是首个对完整SRA流程进行分解评估的基准，涵盖技能检索、技能整合和最终任务执行。SRA-Bench包含5,400个能力密集型测试实例和636个手动构建的金标准技能，这些技能与网络收集的干扰技能混合，形成了一个包含26,262个技能的大规模语料库。大量实验表明，基于检索的技能增强可以显著提高智能体性能，验证了该范式的潜力。同时，我们揭示了技能整合中的一个基本差距：当前的LLM智能体倾向于以相似的速率加载技能，无论是否检索到金标准技能，或者任务是否实际需要外部能力。这表明技能增强的瓶颈不仅在于检索，还在于基础模型判断何时加载何种技能以及何时真正需要外部加载的能力。这些发现将SRA定位为一个独特的研究问题，并为未来智能体系统中能力的可扩展增强奠定了基础。

英文摘要

As large language models (LLMs) evolve into agentic problem solvers, they increasingly rely on external, reusable skills to handle tasks beyond their native parametric capabilities. In existing agent systems, the dominant strategy for incorporating skills is to explicitly enumerate available skills within the context window. However, this strategy fails to scale: as skill corpora expand, context budgets are consumed rapidly, and the agent becomes markedly less accurate in identifying the right skill. To this end, this paper formulates Skill Retrieval Augmentation (SRA), a new paradigm in which agents dynamically retrieve, incorporate, and apply relevant skills from large external skill corpora on demand. To make this problem measurable, we construct a large-scale skill corpus and introduce SRA-Bench, the first benchmark for decomposed evaluation of the full SRA pipeline, covering skill retrieval, skill incorporation, and end-task execution. SRA-Bench contains 5,400 capability-intensive test instances and 636 manually constructed gold skills, which are mixed with web-collected distractor skills to form a large-scale corpus of 26,262 skills. Extensive experiments show that retrieval-based skill augmentation can substantially improve agent performance, validating the promise of the paradigm. At the same time, we uncover a fundamental gap in skill incorporation: current LLM agents tend to load skills at similar rates, regardless of whether a gold skill is retrieved or whether the task actually requires external capabilities. This shows that the bottleneck in skill augmentation lies not only in retrieval but also in the base model's ability to determine which skill to load and when external loading is actually needed. These findings position SRA as a distinct research problem and establish a foundation for the scalable augmentation of capabilities in future agent systems.

URL PDF HTML ☆

赞 0 踩 0

2605.00273 2026-06-09 cs.CV cs.AI 版本更新

When Do Diffusion Models learn to Generate Multiple Objects?

扩散模型何时学会生成多个物体？

Yujin Jeong, Arnas Uselis, Iro Laina, Seong Joon Oh, Anna Rohrbach

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Cambridge（剑桥大学）； University of Washington（华盛顿大学）； University of Toronto（多伦多大学）

AI总结研究探讨了扩散模型在多物体生成中的局限性，发现场景复杂度比概念不平衡更关键，且低数据条件下计数任务更难学习。

Comments ICML2026

详情

AI中文摘要

自回归语言模型中的多项式上下文截断敏感性：KV缓存压缩的序列Wyner-Ziv界

Munsik Kim

发表机构 * Independent Researcher（独立研究者）

AI总结研究自回归语言模型中在线KV缓存压缩的率失真极限，将其建模为序列Wyner-Ziv信源编码，发现下一词分布对上下文截断的敏感性呈多项式衰减，并推导了仅后缀缓存策略的每词内存需求。

详情

AI中文摘要

我们研究了自回归语言模型中在线KV缓存压缩的率失真极限，将其建模为模型诱导滤子上的序列Wyner-Ziv信源编码，其中下一步查询作为解码器边信息。实验上，在涵盖两个系列、参数规模0.5-3B的四个模型中，我们发现下一词分布对上下文截断的敏感性呈多项式衰减而非几何衰减：幂律在外推中比指数拟合提升一个数量级，拟合指数通过汇加最近KL测量独立恢复，并通过位置保持消融验证了衰减不受位置编码伪影影响。在相应的多项式截断敏感性假设下，我们的主要结果刻画了仅后缀缓存策略的每词内存需求：滑动窗口方案以窗口大小$w = O(\varepsilon^{-1/α})$达到失真$\varepsilon$，且在附加双边贝叶斯风险条件下，逆命题表明在该策略类内$w = \Omega(\varepsilon^{-1/α})$是必要的，因此仅后缀策略的缩放为$\Theta(\varepsilon^{-1/α})$。循环或传播缓存摘要能否超越此缩放留待进一步研究。一个显式的块马尔可夫方案达到上界；在附加前向衰减和正则性假设（仅由截断敏感性无法推出）下，其收敛速率指数与逆命题匹配，否则相差两倍。实验上，幂律预测了具体缓存策略的退化曲线：基于最近性的驱逐（滑动、汇加最近）在同等预算下将失真抑制约两个数量级，且失真随预算呈幂律衰减。

英文摘要

We study the rate-distortion limits of online KV cache compression in autoregressive language models, formulating it as sequential Wyner-Ziv source coding on the filtration induced by the model, with the next-step query as decoder side information. Empirically, across four models spanning two families and $0.5$-$3$B parameters, we find that the next-token distribution's sensitivity to context truncation decays \emph{polynomially} rather than \emph{geometrically}: a power law improves on an exponential fit by an order of magnitude in extrapolation, the fitted exponent is recovered independently from a sink-plus-recent KL measurement, and the decay is verified to be free of positional-encoding artifacts by a position-preserving ablation. Under a corresponding \emph{polynomial truncation-sensitivity} assumption, our main result characterizes the per-token memory requirement of \emph{suffix-only} cache policies: a sliding-window scheme attains distortion $\varepsilon$ with window $w = O(\varepsilon^{-1/α})$, and -- under an additional two-sided Bayes-risk condition -- a converse shows $w = Ω(\varepsilon^{-1/α})$ is necessary within this policy class, so the scaling is $Θ(\varepsilon^{-1/α})$ for suffix-only policies. Whether recurrent or propagating cache summaries can beat this scaling is left open. An explicit block-Markov scheme achieves the upper bound; its rate-of-convergence exponent matches the converse under additional forward-decay and regularity hypotheses (not implied by truncation sensitivity alone), and differs by a factor of two otherwise. Empirically, the polynomial law predicts the degradation curves of concrete cache policies: recency-based eviction (sliding, sink-plus-recent) suppresses distortion by roughly two orders of magnitude over random retention at equal budget, with a power-law decay in the budget.

URL PDF HTML ☆

赞 0 踩 0

2606.01060 2026-06-09 cs.CL cs.AI cs.LG 版本更新

MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

MENTIS: 对齐改变了什么信念？语言模型中多尺度潜在扭转的测量

Partha Pratim Saha, Samarth Raina, Mayur Parvatikar, Amit Dhanda, Vinija Jain, Aman Chadha, Amitava Das

发表机构 * Pragya Lab, BITS Pilani Goa, India（BITS Pilani 去掉 Goa 的机构名，因为该机构名中包含 'Goa'，但根据规则，如果机构已有常见中文名，使用常见中文名。'Pragya Lab, BITS Pilani' 是 BITS Pilani 的一个实验室，因此翻译为 'BITS Pilani 实验室'）； IIIT Delhi, India（德里印度理工学院）； Amazon, USA（美国亚马逊）； Meta, USA（美国Meta）； Apple, USA（美国苹果）

AI总结提出MENTIS框架，通过层间协方差扭转范数、谱扭转诊断和能量-辐射-激活度量，测量偏好对齐在语言模型内部计算中引起的选择性、深度局部的几何结构变化。

Comments Submitted to EMNLP 2026

详情

AI中文摘要

偏好对齐显著改善了大语言模型的可观察行为，但尚不清楚对齐在内部改变了什么。对齐系统在越狱、提示注入和检索时损坏下仍然失败，表明仅行为级评估是不完整的。后训练应在内部计算中留下可测量的痕迹。我们问：当指令微调（IT）模型变为偏好对齐（PA）模型时，哪些几何结构发生了变化，这些变化集中在何处，以及它们在不同概念、提示和模型家族中的选择性如何？我们引入MENTIS，一个几何优先的框架，用于测量配对检查点中对齐引起的内部重组。MENTIS使用基于层间协方差的主扭转范数（T1）、辅助谱扭转诊断（T2）和用于深度定位的能量-辐射-激活度量（ERA）来比较IT和PA模型。在LITMUS上的四个7-8B模型对中，我们的研究表明对齐引起的变化是选择性的而非均匀的：规范性概念平均表现出比事实性概念更大的扭转偏移；扭转与上下文熵负相关；峰值效应定位于架构特定的中后层。相同的模式出现在词级、提示级和模型级分析中。这些结果表明偏好对齐在内部计算中留下了结构化的、深度局部的几何特征，超越了仅行为级评估所能揭示的内容。

英文摘要

Preference alignment has substantially improved the observable behavior of large language models, yet it remains unclear what alignment changes internally. Aligned systems still fail under jailbreaks, prompt injection, and retrieval-time corruption, suggesting behavior-level evaluation alone is incomplete. Post-training should leave measurable traces in internal computation. We ask: when an instruction-tuned (IT) model becomes a preference-aligned (PA) model, what geometric structure changes, where do those changes concentrate, and how selectively do they vary across concepts, prompts, and model families? We introduce MENTIS, a geometry-first framework for measuring alignment-induced internal reorganization in paired checkpoints. MENTIS compares IT and PA models using a primary layerwise covariance-based torsion norm (T1), a secondary spectral torsion diagnostic (T2), and an Energy-Radiance-Activation measure (ERA) for depth localization. Across four 7-8B model pairs on LITMUS, our study reveals that alignment-induced change is selective rather than uniform: normative concepts exhibit larger torsion shifts than factual concepts on average; torsion is negatively correlated with contextual entropy; and peak effects localize to architecture-specific mid-to-late layers. The same pattern appears across word-level, prompt-level, and model-level analyses. These results suggest preference alignment leaves structured, depth-localized geometric signatures in internal computation beyond what behavior-level evaluation alone can reveal.

URL PDF HTML ☆

赞 0 踩 0

2606.03328 2026-06-09 cs.LG cs.AI 版本更新

ThinkBooster: 一种用于LLM推理无缝测试时扩展的统一框架

Vladislav Smirnov, Chieu Nguyen, Sergey Senichev, Minh Ngoc Ta, Ekaterina Fadeeva, Artem Vazhentsev, Daria Galimzianova, Nikolai Rozanov, Viktor Mazanov, Jingwei Ni, Tianyi Wu, Igor Kiselev, Mrinmaya Sachan, Iryna Gurevych, Preslav Nakov, Timothy Baldwin, Artem Shelmanov

发表机构 * MBZUAI ； ETH Zürich（苏黎世联邦理工学院）； Imperial College London（伦敦帝国理工学院）； NUS（国立大学新加坡）； Accenture（埃森哲）； Innopolis University（因诺普里斯大学）； Independent Researcher（独立研究者）

AI总结提出ThinkBooster框架，通过模块化库、联合评估基准和可部署代理服务，实现LLM推理的测试时计算扩展，在数学和编码任务上验证了性能-计算权衡。

详情

AI中文摘要

测试时计算（TTC）扩展已成为一种强大的范式，通过在推理期间分配额外计算（例如，通过多样本生成和基于验证器的重新排序）来改进大型语言模型（LLM）推理。现有的TTC扩展策略和推理评分器仍然碎片化，在不一致的协议下进行评估，并且很少通过质量-成本权衡的视角进行分析。我们引入了ThinkBooster，一个用于LLM推理无缝测试时计算扩展的统一框架，它包括（i）一个模块化的Python库，实现了最先进的TTC扩展策略和评分器家族，（ii）一个联合评估性能和计算效率的基准，以及（iii）一个可部署的、兼容OpenAI的代理服务，使得将自适应推理无缝集成到实际应用中成为可能。我们还提供了一个演示可视化调试器，用于检查推理轨迹、中间选择决策和替代推理路径。在数学和编码任务上的实证结果揭示了TTC扩展策略和评分方法的性能-计算权衡，并表明ThinkBooster在实际任务中提供了实际收益。代码以MIT许可证在线提供。

英文摘要

Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based reranking. Existing TTC scaling strategies and reasoning scorers remain fragmented, evaluated under inconsistent protocols, and are rarely analyzed through the lens of quality-cost trade-offs. We introduce ThinkBooster, a unified framework for seamless test-time compute scaling of LLM reasoning, which consists of (i) a modular Python library implementing state-of-the-art TTC scaling strategy and scorer families, (ii) a benchmark that jointly evaluates performance and computational efficiency, and (iii) a deployable OpenAI-compatible proxy service that enables drop-in integration of adaptive reasoning into real-world applications. We further provide a demo visual debugger for inspecting the reasoning trajectories, intermediate selection decisions, and alternative reasoning paths. Empirical results on mathematical and coding tasks reveal the performance-compute trade-offs of TTC scaling strategies and scoring methods and demonstrate that ThinkBooster provides practical gains in real-world tasks. The code is available online under an MIT license.

URL PDF HTML ☆

赞 0 踩 0

2606.07549 2026-06-09 cs.AI cs.MA 新提交

PathoSage: Towards Multi-Source Evidence Adjudication in Pathology via Experience-Aware Agentic Workflow

PathoSage：通过经验感知的代理工作流实现病理学多源证据裁决

Chengyang Zhang, Wenchuan Zhang, Bo Li, Mengran Li, Bob Zhang, Yuhao Yi, Hong Bu, Jiancheng Lv

发表机构 * College of Computer Science, Sichuan University（四川大学计算机科学学院）； Department of Pathology and Institute of Clinical Pathology, West China Hospital, Sichuan University（四川大学华西医院病理科/临床病理研究所）； Department of Computer and Information Science, University of Macau（澳门大学计算机与信息科学系）； School of Intelligent Systems Engineering, Sun Yat-sen University（中山大学智能工程学院）

AI总结提出PathoSage框架，通过结构化证据审议和Beta-Bernoulli经验系统，独立评估工具证据并解决冲突，减少幻觉和分类器分歧，提升病理学推理鲁棒性。

详情

AI中文摘要

多模态大语言模型（MLLMs）和代理工作流的最新进展在计算病理学中显示出巨大潜力，但可靠的补丁级推理仍然具有挑战性。端到端的病理学MLLM常常幻觉形态特征，而最近的代理系统通常将工具输出和检索知识合并到共享上下文中，使得决策容易受到冲突证据和上下文污染的影响。我们提出PathoSage，一个三阶段框架，明确分离知识检索、证据收集和证据裁决，用于补丁级病理学多模态推理。其核心组件结构化证据审议独立评估来自工具的异质证据，执行冲突分析，并在全新上下文中生成最终判断，以减少锚定偏差。我们进一步引入一个无需训练的Beta-Bernoulli经验系统，具有连续信用分配，以建模长期工具可靠性，并为未来工具使用构建相似性加权先验。实验表明，PathoSage有效缓解了VQA幻觉和分类器分歧，优于强病理学MLLM和代理基线。我们的结果强调了明确的证据裁决和可靠性感知工具建模是构建鲁棒病理学代理的关键要素。

英文摘要

Recent advances in Multimodal Large Language Models (MLLMs) and agent workflows have shown strong promise for computational pathology, yet reliable patch-level reasoning remains challenging. End-to-end pathology MLLMs often hallucinate morphological features, while recent agentic systems usually merge tool outputs and retrieved knowledge into a shared context, making decisions vulnerable to conflicting evidence and context contamination. We propose PathoSage, a three-stage framework that explicitly separates knowledge retrieval, evidence collection, and evidence adjudication for patch-level pathology multimodal reasoning. Its core component, Structured Evidence Deliberation, independently evaluates heterogeneous evidence from tools, performs conflict analysis, and generates the final judgment in a fresh context to reduce anchoring bias. We further introduce a training-free Beta-Bernoulli experience system with continuous credit assignment to model long-term tool reliability and construct similarity-weighted priors for future tool use. Experiments show that PathoSage effectively mitigates VQA hallucinations and classifier disagreement, outperforming strong pathology MLLM and agentic baselines. Our results highlight explicit evidence adjudication and reliability-aware tool modeling as key ingredients for robust pathology agents.

URL PDF HTML ☆

赞 0 踩 0

2606.07721 2026-06-09 cs.AI 新提交

Automatic Extraction of Structured Information from Brain MRI Reports Using an Open-Weight Large Language Model

使用开源大语言模型从脑MRI报告中自动提取结构化信息

Kaouther Mouheb, Amos Pomp, Antoine Manenti, Romy de Haan, Farog Faghir, Joy Martens, Harro Seelaar, Francesco Mattace-Raso, Meike W. Vernooij, Frank J. Wolters, Stefan Klein, Esther E. Bron

发表机构 * Department of Radiology & Nuclear Medicine, Erasmus MC（埃因霍温麦斯特大学放射科与核医学部）； Department of Epidemiology, Erasmus MC（埃因霍温麦斯特大学流行病学部）； Department of Electrical and Electronics Engineering, ENSEEIHT（ENSEEIHT电子与电气工程系）； Alzheimer Centre Erasmus MC（埃因霍温麦斯特大学阿尔茨海默病中心）； Department of Neurology, Erasmus MC（埃因霍温麦斯特大学神经医学部）； Department of Internal Medicine, Erasmus MC（埃因霍温麦斯特大学内科部）

AI总结本研究评估了开源LLM LLaMA 3.1从荷兰语脑MRI报告中自动提取结构化信息的能力，通过零样本和少样本提示策略，在视觉评分、病变检测等任务上取得高准确率，少样本提示进一步提升了数值变量的提取性能。

Comments Submitted to European Radiology

详情

AI中文摘要

目的：从自由文本放射学报告中自动提取数据可实现大规模研究，但很少有研究评估大语言模型（LLM）在荷兰神经放射学报告上的性能。方法：我们分析了来自一家三级记忆诊所（2016-2021年）的947份脑MRI报告，由顾问神经放射科医生撰写。经过培训的医学生标注了三十个变量；其中100份报告进行了双重标注以评估评分者间信度。我们评估了开源LLM LLaMA 3.1在不同语言（荷兰语与英语翻译）和不同示例选择策略的少样本提示下的性能。性能评估使用分类变量的平衡准确率、计数变量的准确率和平均绝对误差以及自由文本的文本相似度。指标在947份报告的10次随机分割上计算。结果：LLaMA 3.1在视觉评分上表现出高零样本性能（平均[95%置信区间]）：内侧颞叶萎缩：左侧90% [77-100%]，右侧96% [94-99%]；全脑皮质萎缩：87% [83-91%]；Fazekas评分：94% [93-96%]。微出血检测准确率为93% [92-95%]，梗死检测为82% [80-84%]。病灶位置的文本相似度达到0.95 [0.95-0.96]。数值变量性能较低：微出血数量为80% [78-82%]，梗死数量为66% [63-68%]。英语翻译结果相当。少样本提示提高了数值变量的性能，使用基于结构相似性的选择后，微出血达到92% [90-93%]，梗死达到81% [77-85%]。结论：LLaMA 3.1在从荷兰神经放射学报告中提取数据方面显示出巨大潜力。少样本提示增强了数值变量的性能，而位置特定变量仍面临挑战。

英文摘要

Objectives: Automatic data extraction from free-text radiology reports enables large-scale research, but few studies assessed the performance of large language models (LLMs) on Dutch neuroradiology reports. Methods: We analyzed 947 brain MRI reports from a tertiary memory clinic (2016-2021), authored by consultant neuroradiologists. Trained medical students annotated thirty variables; 100 reports were double-annotated to assess inter-rater reliability. We evaluated the performance of the open-weight LLM LLaMA 3.1 using different languages (Dutch vs. English translation) and few-shot prompting with different example selection strategies. Performance was evaluated using balanced accuracy for categorical variables, accuracy and mean absolute error for counts, and text similarity for free-text. Metrics were computed across 10 random splits of the 947 reports. Results: LLaMA 3.1 demonstrated high zero-shot performance for visual rating scores (mean [95%-CI]): Medial Temporal Atrophy: 90% [77-100%] on the left and 96% [94-99%] on the right, Global Cortical Atrophy: 87% [83-91%], and Fazekas: 94% [93-96%]. Microbleed mentions were detected with 93% accuracy [92-95%] and infarct mentions with 82% [80-84%]. Text similarity for lesion location reached 0.95 [0.95-0.96]. Performance was lower for numerical variables: 80% [78-82%] for the number of microbleeds and 66% [63-68%] for infarcts. English translation yielded comparable results. Few-shot prompting improved performance for numerical variables, achieving 92% [90-93%] for microbleeds and 81% [77-85%] for infarcts using structural similarity-based selection. Conclusion: LLaMA 3.1 shows strong potential for extracting data from Dutch neuroradiology reports. Few-shot prompting enhances performance for numerical variables, whereas challenges remain for location-specific variables.

URL PDF HTML ☆

赞 0 踩 0

2606.07780 2026-06-09 cs.AI cs.CV cs.LG 新提交

Land cover and flood type govern the detection limits of satellite-based flood mapping across diverse global flood events

土地覆盖与洪水类型控制基于卫星的洪水测绘在不同全球洪水事件中的检测极限

Venkatesh Kolluru, Rajat Shinde, Abdelhak Marouane, Caden Helbling, Deepak Shah, Othneil Drew, Iksha Gurung, Manil Maskey, Rahul Ramachandran

发表机构 * Earth System Science Center, University of Alabama in Huntsville（阿拉巴马大学亨茨维尔分校地球系统科学中心）； Space and Earth Science Data Analysis（空间与地球科学数据分析）； NASA Marshall Space Flight Center（NASA马歇尔太空飞行中心）

AI总结研究利用Prithvi-EO-2.0模型在19个全球洪水事件中评估卫星洪水测绘的检测能力，发现检测精度取决于土地覆盖和洪水类型，农田和河流洪水检测效果较好，而树木覆盖和建成区检测近乎为零。

详情

AI中文摘要

洪水是最具破坏性的自然灾害之一，在气候变化下其频率增加使得基于卫星的淹没测绘对灾害响应至关重要。基于卫星档案预训练的地理空间基础模型提供了地理可迁移性，但其在多样、未见事件中的操作可靠性尚未被表征。在此，我们在跨越六大洲、八个气候带和六种洪水机制的19个分布外洪水事件（2017-2025年）中部署Prithvi-EO-2.0，并针对两个独立参考产品进行验证。检测精度共同依赖于土地覆盖和洪水类型，农田产生最高一致性（IoU=52%），河流事件检测最强（F1=0.69），而树木覆盖和建成区显示近乎零检测（IoU=4%），无论洪水机制如何。双参考验证揭示，明显的模型误差部分反映了参考产品之间的定义不一致而非检测失败。迭代流水线测试识别出23种故障模式，其中流水线工程在初始误差中占主导地位，超过模型容量。这些发现为操作卫星洪水测绘建立了环境依赖的检测边界。

英文摘要

Floods are among the most destructive natural hazards, and their increasing frequency under climate change makes satellite-based inundation mapping essential for disaster response. Geospatial foundation models pretrained on satellite archives offer geographic transferability, but their operational reliability across diverse, unseen events remains uncharacterized. Here we deploy Prithvi-EO-2.0 across 19 out-of-distribution flood events (2017-2025) spanning six continents, eight climate zones, and six flood mechanisms, validating against two independent reference products. Detection accuracy depended jointly on land cover and flood type, with cropland yielding the highest agreement (IoU=52%) and riverine events the strongest detection (F1=0.69), while tree cover and built-up areas showed near-zero detection (IoU=4%) regardless of flood mechanism. Dual-reference validation revealed that apparent model error partly reflects definitional inconsistency between reference products rather than detection failure. Iterative pipeline testing identified 23 failure modes, with pipeline engineering dominating initial error over model capacity. These findings establish environment-dependent detection boundaries for operational satellite flood mapping.

URL PDF HTML ☆

赞 0 踩 0

2606.07798 2026-06-09 cs.AI cs.LG q-bio.NC 新提交

面向证据基础计算病理学的多模态智能体协同助手

Zhe Xu, Zhengyu Zhang, Zhiyuan Cai, Jiahao Xu, Yijie Lin, Ziyi Liu, Junlin Hou, Hongyi Wang, Yuxiang Nie, Ling Liang, Yihui Wang, Yingxue Xu, Ronald Cheong Kin Chan, Li Liang, Hao Chen

发表机构 * Department of Computer Science and Engineering, Hong Kong University of Science and Technology（香港科技大学计算机科学与工程系）； Department of Pathology, Nanfang Hospital, Southern Medical University（南方医科大学南芳医院病理科）； Department of Pathology, School of Basic Medical Sciences, Southern Medical University（南方医科大学基础医学学院病理科）； Department of Anatomical and Cellular Pathology, Chinese University of Hong Kong（香港中文大学解剖与细胞病理学系）； Guangdong Provincial Key Laboratory of Molecular Tumor Pathology（广东省分子肿瘤病理学重点实验室）； Jinfeng Laboratory（锦风实验室）； Department of Chemical and Biological Engineering, Hong Kong University of Science and Technology（香港科技大学化学与生物工程系）； Division of Life Science, Hong Kong University of Science and Technology（香港科技大学生命科学系）； State Key Laboratory of Nervous System Disorders, The Hong Kong University of Science and Technology（香港科技大学神经系统疾病国家重点实验室）； HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute, The Hong Kong University of Science and Technology（香港科技大学深圳-香港协同创新研究院）

AI总结提出PathPocket，一种多模态AI协同助手，通过构建包含11万文档的病理证据语料库和455万实体的超图，实现基于证据的病理诊断，在20万真实案例上超越现有方法。

详情

AI中文摘要

病理学是现代医学的基石，准确的决策高度依赖于循证实践。虽然人工智能有潜力改变临床工作流程，但AI与循证医学的结合仍未被充分探索，现有的初步尝试仅限于纯文本的通用医学。在这项工作中，我们提出了PathPocket，一种专门为证据基础病理学设计的多模态AI智能体协同助手。我们构建了迄今为止最全面的病理证据语料库，包含约110,472份公开和授权文档，这些文档按照从临床指南到专家意见的严格证据层级进行结构化组织。在这个精心分级的基础上，我们构建了一个大规模多模态病理超图，包含超过455万个实体和710万个关系。作为强大的知识引擎，该超图为协作式多智能体推理框架提供了可追溯的证据，该框架集成了输入理解、证据检索、过滤和诊断生成。这使得PathPocket能够无缝解决广泛的临床任务，从纯文本查询到涉及感兴趣区域和千兆像素全切片图像的复杂多模态诊断。我们在一个包含超过20万真实案例的多维基准测试上严格评估了该系统，其性能显著优于现有最先进方法。至关重要的是，广泛的用户研究表明，PathPocket显著提高了病理学家的诊断准确性和信心。通过将病理学解释直接基于可验证的文献，PathPocket为未来证据基础的计算病理学提供了实用且可扩展的解决方案。

英文摘要

Pathology is the cornerstone of modern medicine, where accurate decision-making relies heavily on evidence-based practices. While artificial intelligence (AI) has the potential to transform clinical workflows, the intersection of AI and evidence-based medicine remains under-explored, with primitive attempts restricted to text-only general medicine. In this work, we present PathPocket, a multimodal AI agentic co-pilot designed specifically for evidence grounded pathology. We construct the most comprehensive pathology evidence corpus to date, encompassing approximately 110,472 public and authorized documents structured across a rigorous hierarchy of evidence from clinical guideline to expert opinion. From this meticulously graded foundation, we build a large-scale multimodal pathology hypergraph containing over 4.55 million entities and 7.10 million relations. Serving as a robust knowledge engine, this hypergraph provides traceable evidence for a collaborative multi-agent reasoning framework integrating input understanding, evidence retrieval, filtering, and diagnosis generation. This enables PathPocket to seamlessly resolve a wide spectrum of clinical tasks, ranging from text-only queries to complex multimodal diagnostics involving region-of-interest (ROI) and gigapixel whole-slide images (WSIs). We rigorously evaluate the system on a multidimensional benchmark of over 200,000 real-world cases, where it significantly outperforms existing state-of-the-arts. Crucially, extensive user studies demonstrate that PathPocket substantially improves the diagnostic accuracy and confidence of pathologists. By directly grounding pathology interpretations in verifiable literature, PathPocket offers a practical and scalable solution for the future of evidence grounded computational pathology.

URL PDF HTML ☆

赞 0 踩 0

2606.08146 2026-06-09 cs.AI 新提交

SAGE: An LLM-driven Self Reflective Agentic Framework for Fraud Detection

SAGE: 一种LLM驱动的自我反思智能体框架用于欺诈检测

Yichen Chen, Siying Li, Yuhang Liang, Lijun Wang, Renyang Liu

发表机构 * National University of Singapore（新加坡国立大学）； University of Chinese Academy of Sciences（中国科学院大学）； China Mobile Communications Group（中国移动通信集团有限公司）

AI总结提出SAGE，首个端到端LLM驱动的多智能体欺诈检测框架，通过数据诊断树和自然语言梯度优化，在五个数据集上平均F1提升40.86%。

详情

AI中文摘要

支付、电子商务和电信系统中的欺诈检测需要在个体层面准确、在严重类别不平衡下鲁棒，并且易于风险管理者理解。现有方法至少缺乏这些要求之一：自动化机器学习系统在固定数值空间中搜索，缺乏对数据集的语义感知；基于图神经网络的方法需要预定义的关系图，在个体决策层面仍然不透明；通用大语言模型（LLM）智能体的设计未考虑现实欺诈检测中的召回率和精确率约束。在本文中，我们提出SAGE，首个端到端LLM驱动的多智能体欺诈检测框架。SAGE协调三个专用智能体，基于六层数据诊断树（DDT）和由自然语言梯度引导的马尔可夫决策过程做出决策，在欺诈特定奖励下自动优化模型。在五个欺诈数据集和五个LLM骨干网络上，SAGE在96.00%的方法-数据集比较中获胜，平均F1比基线提升40.86%。代码可在https://github.com/yichenC1c/SAGE获取。

英文摘要

Fraud detection in payment, e-commerce, and telecommunications systems requires accuracy at the individual level, robustness under severe class imbalance, and ease of understanding for risk managers. Existing methods fall at least one of these requirements: automated machine learning systems search a fixed numerical space without semantic awareness of the dataset; graph neural network-based methods require pre-defined relational graphs and remain opaque at the individual-decision level; and the design of general-purpose large language model (LLM) agents does not consider the recall and precision constraints specific to real-world fraud detection. In this paper, we propose SAGE, the first end-to-end LLM-driven multi-agent framework for fraud detection. SAGE coordinates three dedicated agents that make decisions based on a six-layer Data Diagnostic Tree (DDT) and a Markov decision process guided by natural-language gradients, automatically optimizing the model under a fraud-specific reward. On five fraud datasets and five LLM backbones, SAGE wins $96.00\%$ of method--dataset comparisons and improves F1 by an average of $40.86\%$ over baselines. The code is available at https://github.com/yichenC1c/SAGE.

URL PDF HTML ☆

赞 0 踩 0

2606.08311 2026-06-09 cs.AI 新提交

Curation of a Cardiology Interface Terminology for Highlighting Electronic Health Records using Machine Learning

利用机器学习构建心脏病学接口术语以突出电子健康记录

Mahshad Koohi Habibi Dehkordi, Shuxin Zhou, Yehoshua Perl, Fadi P. Deek, James Geller, Gai Elhanan, Andrew J. Einstein, Luke Lindemann, Vipina K. Keloth

发表机构 * Department of Computer Science, New Jersey Institute of Technology（新泽西理工学院计算机科学系）； Department of Computer Science, St.Francis College（圣弗朗西斯学院计算机科学系）； Department of Informatics, New Jersey Institute of Technology（新泽西理工学院信息学系）； Department of Data Science, New Jersey Institute of Technology（新泽西理工学院数据科学系）； Center for Genomic Medicine, School of Medicine, University of Nevada（内华达大学医学学院基因组医学中心）； Department of Medicine, Cardiology Division, Columbia University Irving Medical Center（哥伦比亚大学伊万杰琳医学中心内科部（心内科））； Advanced Metrics Laboratory, School of Medicine and Health Sciences, George Washington University（乔治华盛顿大学医学院与健康科学学院高级指标实验室）； Department of Biomedical Informatics and Data Science, Yale University（耶鲁大学生物医学信息学与数据科学系）

AI总结提出基于机器学习的心脏病学接口术语（CIT）设计方法，通过半自动构建训练数据并训练模型，实现对电子健康记录中关键信息的高亮，覆盖率达74.21%。

详情

AI中文摘要

电子健康记录（EHR）笔记是密集的医学文档，包含大量信息，通常充满复杂的医学术语。高亮EHR中的所有细节有助于通过吸引对关键内容的注意力来减少遗漏重要信息的可能性。本研究提出设计一种心脏病学接口术语（CIT），以准确高亮心脏病患者EHR笔记中的所有细节。我们引入一种创新的机器学习（ML）技术用于CIT的设计。ML技术需要训练数据。手动准备此类训练数据耗时且昂贵。CIT设计过程包括三个阶段。在前两个阶段中，我们创新性地推导出一个训练数据CIT，供第三阶段的ML技术使用。我们首先设计初始CIT，由几个部分组成：SNOMED的心脏病学子层次、从构建集的EHR中挖掘的其他SNOMED概念，以及术语的必要组成部分（如医学缩写和药物）。利用迭代过程，从构建集中提取包含初始CIT概念的细粒度短语作为CIT概念候选。候选概念在半自动审查后添加到CIT中，得到训练数据CIT（TCIT）。在第三阶段，使用TCIT训练ML模型，以识别适合作为CIT概念的概念。该模型用于从构建集中提取更多概念，得到最终CIT。然后使用最终CIT高亮测试集，并评估其捕获未见EHR数据集中细节的程度。为此，使用了四个评估指标：覆盖率、广度、完整性和简洁性。高亮测试集的覆盖率为74.21%，广度为1.68。对于测试集中的20个随机笔记，平均完整性为98.2%，平均简洁性为84.2%。

英文摘要

Electronic health record (EHR) notes are dense medical documents containing large amounts of information, often filled with complex medical jargon. Highlighting all details in EHRs helps reduce the likelihood of missing crucial information by drawing attention to key content. This study proposes the design of a Cardiology Interface Terminology (CIT) to accurately highlight all details in EHR notes of cardiology patients. We introduce an innovative Machine Learning (ML) technique for the design of CIT. The ML technique requires training data. Manual preparation of such training data is time-consuming and expensive. The process of the CIT design includes three phases. In the first two phases, we innovatively derive a training data CIT to be used by the third phase, ML technique. We start by designing an initial CIT, composed of several components: the cardiology-related sub-hierarchies of SNOMED, other SNOMED concepts mined from EHRs of build set, and necessary components of terms e.g., medical abbreviations and medications. Utilizing an iterative process, fine-grained phrases containing initial CIT concepts are extracted from build set as CIT concept candidates. The candidate concepts are semi-automatically reviewed before being added to CIT, yielding the training data CIT, TCIT. In the third phase, a ML model is trained with TCIT to identify candidates fitting to be concepts in the CIT. This model is used to extract further concepts from build set, yielding the final CIT. The final CIT is then used to highlight the test set and evaluate the extent to which it captures details in an unseen EHR dataset. For this purpose, four evaluation metrics, coverage, breadth, completeness, and conciseness are used. The highlighted test set has a coverage of 74.21%, with a breadth of 1.68. For 20 random notes in test set, the average completeness is 98.2% and average conciseness is 84.2%.

URL PDF HTML ☆

赞 0 踩 0

2606.08314 2026-06-09 cs.AI 新提交

面向城市交通系统协同中断响应的弹性即服务评估框架

Sara Jaber, S. M. Hassan Mahdavi, Neila Bhouri, Mostafa Ameli

发表机构 * Univ. Gustave Eiffel, COSYS, GRETTIA, Paris, France（古斯塔夫·埃菲尔大学，交通系统、网络与安全实验室，交通工程与智能交通系统研究组，法国巴黎）； VEDECOM, mobiLAB, Department of Human factors and Economics of Sustainable Mobility, Versailles, France（VEDECOM研究所，移动出行实验室，可持续出行人因与经济系，法国凡尔赛）

AI总结提出一个基于KPI的时间索引框架，结合优化模型与智能体仿真，从脆弱性、适应性、鲁棒性等多维度评估城市交通中断响应方案的弹性，并通过巴黎RER B线案例验证了协同策略的优越性。

详情

AI中文摘要

城市公共交通中断需要快速响应策略，然而现有研究很少提供一个决策支持框架，使用一组通用的动态、乘客、运营商和环境导向指标来比较替代的中断响应解决方案。本文提出了一个KPI驱动的、时间索引的框架，用于评估城市交通系统中中断响应方案的弹性。该框架将优化模型与基于智能体仿真的行为评估相结合。它还考虑了当在途车辆被撤回以支持中断走廊时，辅助线路上的二次服务退化。该框架不将弹性视为单一分数，而是评估互补维度，包括脆弱性、适应性、鲁棒性、弹性损失、响应性、基于成本的性能、排放和公平性。该框架在法兰西岛（巴黎）网络的RER B交通线上实施。结果表明，协同策略提供了最平衡的弹性曲线，与单一模式替代方案相比，结合了高服务连续性和较低的总中断成本，同时提高了公平性并保持了有竞争力的环境性能。敏感性分析进一步确定了协同多模式响应最有价值的中断条件。

英文摘要

Urban public transport disruptions require rapid response strategies, yet existing studies rarely provide a decision support framework to compare alternative disruption response solutions using a common set of dynamic, passenger, operator, and environment oriented indicators. This paper proposes a KPI-driven, time-indexed framework to assess the resilience of disruption response solutions in urban transit systems. The framework combines an optimization model with a behavioral evaluation in agent-based simulation. It also underlays the secondary service degradation induced on helper lines when in-service vehicles are withdrawn to support the disrupted corridor. Rather than treating resilience as a single score, it evaluates complementary dimensions including vulnerability, adaptability, robustness, resilience loss, responsiveness, cost-based performance, emissions, and equity. The framework is implemented for the RER B transit line in the Ile-de-France (Paris) network. Results show that the coordinated strategy provides the most balanced resilience profile, combining high service continuity with lower total disruption cost than single mode alternatives, while also improving equity and maintaining competitive environmental performance. Sensitivity analysis further identifies the disruption conditions under which coordinated multimodal response is most valuable.

URL PDF HTML ☆

赞 0 踩 0

2606.08855 2026-06-09 cs.AI cs.CV cs.CY 新提交

Hybrid E-Assessment in Higher Education: Semi-Automated Grading of Paper-Based Written Examinations

高等教育中的混合电子评估：纸质笔试的半自动评分

Hartwig Grabowski, Michael Canz

发表机构 * Institute for Machine Learning and Analytics, Hochschule Offenburg（霍恩海姆应用技术大学机器学习与分析研究所）； Hochschule Offenburg（霍恩海姆应用技术大学）

AI总结针对完全数字化和部分数字化电子评估在总结性考试中的局限性，提出混合电子评估方法，保留纸质问题导向任务，通过结构化答案格式和手写字符识别实现半自动评分，结合视觉大语言模型和两遍验证提升评估有效性、公平性和可扩展性。

Comments 15 pages, 6 figures

详情

AI中文摘要

本文考察了完全数字化和部分数字化电子评估方法在高等教育总结性考试中的局限性。分析聚焦于封闭式问题格式导致的教学狭窄化，以及在大学生群体中尤为突出的组织、技术和法律约束。作为替代方案，本文提出了一种混合电子评估方法，该方法保留纸质、问题导向的考试任务，同时实现半自动评分。评估相关的中间结果以结构化答案格式编码，由学生手写输入，随后从表格字段中捕获。核心的技术瓶颈是在现实考试条件下可靠识别手写字符。最近的视觉大语言模型，结合两遍验证原则和与标准答案的比对，可以减少误分类，从而提高总结性评估的有效性、公平性和可扩展性。

英文摘要

This paper examines the limitations of fully digital and partially digital e-assessment approaches in summative examinations in higher education. The analysis focuses on the didactic narrowing caused by closed question formats and on organizational, technical, and legal constraints that become particularly relevant in large student cohorts. As an alternative, the paper proposes a hybrid e-assessment approach that retains paper-based, problem-oriented examination tasks while enabling semi-automated grading. Assessment-relevant intermediate results are encoded in a structured answer format, entered by students by hand, and subsequently captured from table fields. The central technical bottleneck is reliable recognition of handwritten characters under realistic examination conditions. Recent vision-capable large language models, combined with a two-pass validation principle and comparison against a solution key, can reduce misclassifications and thereby improve the validity, fairness, and scalability of summative assessment.

URL PDF HTML ☆

赞 0 踩 0

2606.09086 2026-06-09 cs.AI 新提交

DynaOD: Dynamic Origin-Destination Flow Generation with Discrete-to-Continuous Temporal Semantic Modeling

DynaOD: 基于离散到连续时间语义建模的动态起讫点流量生成

Jie Zhao, Xianqi Dai, Jie Feng, Huandong Wang, Yong Li

发表机构 * Department of Electronic Engineering, BNRist, Tsinghua University（清华大学电子工程系，BNRist）； Tsinghua Shenzhen International Graduate School（清华大学深圳国际研究生院）； Zhongguancun Academy（中关村学院）

AI总结提出DynaOD框架，通过离散方向趋势和连续时间演化双视角建模时间语义，以轻量即插即用方式调节预训练静态OD生成器，实现无历史观测的动态OD流生成，在预测精度和分布保真度上优于基线。

Comments Accepted by IJCAI2026

详情

AI中文摘要

动态起讫点（OD）流量生成旨在仅从时间上下文合成逼真的移动动态，而不依赖历史OD观测。一个关键挑战是将语义时间信号转化为时间上连贯的OD模式，同时保留城市区域固有的空间异质性。我们提出DynaOD，一个语义驱动框架，通过两个互补视角建模时间动态：离散方向趋势，刻画城市活动模式的定性变化；连续时间演化，捕捉这些变化如何随时间展开。通过联合编码这些时间语义，该框架构建时变区域表示，以轻量即插即用方式调节预训练的静态OD生成器。这种模块化设计进一步支持可扩展部署和跨城市迁移。在大型真实世界数据集上的大量实验表明，我们的方法在预测精度和分布保真度上均持续优于代表性基线。代码公开于https://github.com/csjiezhao/DynaOD。

英文摘要

Dynamic origin-destination (OD) flow generation seeks to synthesize realistic mobility dynamics from temporal context alone, without relying on historical OD observations. A key challenge is to translate semantic temporal signals into temporally coherent OD patterns while preserving the inherent spatial heterogeneity of urban regions. We propose DynaOD, a semantic-driven framework that models temporal dynamics through two complementary perspectives: discrete directional trends that characterize qualitative shifts in urban activity patterns, and continuous temporal evolution that captures how such shifts unfold over time. By jointly encoding these temporal semantics, the framework constructs time-varying region representations that condition pretrained static OD generators in a lightweight and plug-and-play fashion. This modular design further supports scalable deployment and cross-city transferability. Extensive experiments on large-scale real-world datasets show that our method consistently outperforms representative baselines in both predictive accuracy and distributional fidelity. Code is publicly available at https://github.com/csjiezhao/DynaOD.

URL PDF HTML ☆

赞 0 踩 0

2606.09392 2026-06-09 cs.AI 新提交

From Coarse to Fine: Managing Temporal Granularity in Spatio-Temporal Data for Fine-Grained Traffic Prediction

从粗到细：管理时空数据中的时间粒度以实现细粒度交通预测

Shuhao Li, Weidong Yang, Yue Cui, Zizhuo Xu, Lipeng Ma, Fan Zhang, Xiaofang Zhou

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University（复旦大学计算机科学与技术学院）； Tongyi Lab, Alibaba Group（阿里巴巴集团通义实验室）； The Hong Kong University of Science and Technology（香港科技大学）； Guangzhou University（广州大学）

AI总结针对粗粒度采样数据难以支持细粒度预测的问题，提出时空细化预测器（STRP），通过树卷积和逆膨胀卷积实现高效时空建模，在六个数据集上显著优于现有方法。

详情

AI中文摘要

高效的交通数据获取、存储和利用是时空数据管理中的关键挑战。大多数交通数据系统以固定的粗粒度时间间隔收集和存储观测数据，以降低存储和计算成本。然而，这种粗粒度数据严重限制了需要更细时间粒度预测的下游应用。在所有地点和时间段收集和维护细粒度交通数据将给数据库存储和预处理流程带来巨大负担。为了解决这种时间粒度不匹配问题，我们定义了一个新问题：利用粗粒度采样数据预测细粒度未来交通。我们提出了时空细化预测器（STRP），一种面向时空数据系统的粒度感知框架。STRP集成了两个组件：用于高效且可解释的空间依赖建模的树卷积，以及用于渐进式时间外推的逆膨胀卷积。STRP支持两种实用的预测设置：基于窗口和基于持续时间的，以处理不同形式的粒度不匹配。在六个基准数据集上的实验表明，STRP在准确性和效率上均显著优于最先进的基线方法。我们的工作为管理时空交通数据系统中的粒度不匹配提供了一种实用且可解释的方法。

英文摘要

Efficient acquisition, storage, and utilization of traffic data are critical challenges in spatio-temporal data management. Most traffic data systems collect and store observations at fixed, coarse-grained temporal intervals to reduce storage and computation costs. However, such coarse-grained data severely limits downstream applications that require predictions at a finer temporal granularity. Collecting and maintaining fine-grained traffic data across all locations and time periods would impose a substantial burden on database storage and preprocessing pipelines. To address this temporal granularity mismatch, we formulate a novel problem: predicting fine-grained future traffic using coarse-grained sampled data. We propose the Spatial-Temporal Refinement Predictor (STRP), a granularity-aware framework for spatio-temporal data systems. STRP integrates two components: Tree Convolution for efficient and interpretable spatial dependency modeling, and Inverse Dilated Convolution for progressive temporal extrapolation. STRP supports two practical prediction settings: window-based and duration-based, to handle different forms of granularity mismatch. Experiments on six benchmark datasets show that STRP significantly outperforms state-of-the-art baselines in both accuracy and efficiency. Our work offers a practical and interpretable approach to managing granularity mismatches in spatio-temporal traffic data systems.

URL PDF HTML ☆

赞 0 踩 0

2606.09433 2026-06-09 cs.AI 新提交

Bayesian Selective Latent Inference for Wastewater-First Influenza Monitoring

贝叶斯选择性潜在推断用于污水优先的流感监测

Yixuan Zhang, Yang Song, Hao Wang, Samir Bhatt, Hengguan Huang

发表机构 * University of Copenhagen（哥本哈根大学）； Rutgers University（罗格斯大学）； Imperial College London（帝国理工学院）

AI总结提出贝叶斯选择性潜在推断（BSLI），通过后验分布、可回答性认证和成本校准的Bellman策略，在污水优先流感监测中优化查询与弃权决策。

Comments Corresponding authors: Hengguan Huang and Samir Bhatt. Hengguan Huang is the lead corresponding author

详情

AI中文摘要

污水流感监测可以在临床报告之前揭示社区传播，但仅凭污水并不能完全识别人类负担。现有的污水模型假设固定的证据集，而通用的证据获取方法将官方监测流视为可互换的昂贵特征。我们将污水优先的流感监测视为一个选择性决策问题：从强制性的污水证据开始，系统必须决定污水是否足够，接下来查询哪个延迟的官方流，以及在源模糊下何时弃权是唯一科学上可辩护的行动。我们提出了贝叶斯选择性潜在推断（BSLI），这是一种原则性的贝叶斯方法，它维护潜在负担和可识别性的后验分布，通过明确的科学门认证可回答性，并使用精确的成本校准Bellman策略优化查询-停止决策。我们证明了关键的变分、可回答性、Bellman最优性和一维成本校准性质。在一个包含5,933个预测事件和3,102个源模糊事件的固定公共数据基准上，BSLI改善了匹配预算的成本-性能前沿，同时在源模糊下保持保守的弃权。

英文摘要

Wastewater influenza surveillance can reveal community circulation before clinical reporting, but wastewater alone is not a fully identifiable proxy for human burden. Existing wastewater models assume a fixed evidence set, while generic evidence-acquisition methods treat official surveillance streams as interchangeable costly features. We cast wastewater-first influenza monitoring as a selective decision problem: starting from mandatory wastewater evidence, the system must decide whether wastewater is sufficient, which delayed official stream to query next, and when abstention is the only scientifically defensible action under source ambiguity. We propose Bayesian Selective Latent Inference (BSLI), a principled Bayesian method that maintains a posterior over latent burden and identifiability, certifies answerability through explicit scientific gates, and optimizes query-stop decisions with an exact cost-calibrated Bellman policy. We prove the key variational, answerability, Bellman-optimality, and one-dimensional cost-calibration properties. On a fixed public-data benchmark with 5,933 forecasting episodes and 3,102 source-ambiguity episodes, BSLI improves the matched-budget cost-performance frontier while preserving conservative abstention under source ambiguity.

URL PDF HTML ☆

赞 0 踩 0

2606.09489 2026-06-09 cs.AI 新提交

LLM-Orchestrated Conformance Checking in Stroke Care Without Computer-Interpretable Guidelines

LLM编排的卒中护理合规性检查无需计算机可解释指南

Giorgio Leonardi, Stefania Montani, Manuel Striani, Alessandro Canessa, Delfina Ferrandi

发表机构 * Computer Science Institute, DiSIT, University of Piemonte Orientale（皮埃蒙特东方大学计算机科学研究所）； Integrated Laboratory of AI and Medical Informatics, DAIRI, SS. Antonio e Biagio e Cesare Arrigo Hospital（圣安东尼奥、比亚焦与切萨雷·阿里戈医院DAIRI人工智能与医学信息学综合实验室）

AI总结提出基于大语言模型编排的模块化框架，从非结构化临床文本和指南中自动提取患者轨迹、识别规范规则并计算合规性指标，在卒中护理领域验证了86%以上的轨迹合规。

详情

AI中文摘要

目标：医疗保健中的合规性检查旨在评估患者护理路径是否符合临床指南。然而，其实际应用通常依赖于正式、机器可解释的指南表示（如计算机可解释指南CIG），而这些在现实临床环境中很少可用。方法：本文引入了一个基于大语言模型编排的模块化框架，直接从非结构化的临床和指南文本中支持医疗合规性检查，无需预定义的CIG。所提出的架构集成了多个LLM和支持组件，从临床出院信中提取患者轨迹，从文本临床指南中识别规范规则，将这些规则转换为可执行脚本，并计算轨迹合规性指标以量化事件日志中的合规性。结果：该框架在亚历山德里亚医院神经内科病房的卒中护理领域进行了实施和评估。从医院数据中自动提取了数百条患者轨迹，并根据参考指南衍生的50条规则进行了评估。分析显示，超过86%的可用轨迹是合规的。结论：结果证明了使用编排的LLM进行实际医疗保健合规性分析的可行性。同时，该研究提供了亚历山德里亚医院卒中护理指南高度遵守的证据。

英文摘要

Objective: Conformance checking in healthcare seeks to assess whether patient care pathways adhere to clinical guidelines. However, its practical application often depends on the availability of formal, machine-interpretable representations of guidelines, such as Computer-Interpretable Guidelines (CIGs), which are seldom available in real-world clinical settings. Methods: This work introduces a modular framework based on the orchestration of Large Language Models (LLMs) to support medical conformance checking directly from unstructured clinical and guideline texts, without requiring predefined CIGs. The proposed architecture integrates multiple LLMs and supporting components to extract patient traces from clinical discharge letters, identify normative rules from textual clinical guidelines, translate these rules into executable scripts, and compute a Trace Conformance Indicator to quantify compliance within the event log. Results: The framework was implemented and evaluated in the stroke care domain at the neurological ward of Alessandria Hospital. Hundreds of patient traces were automatically extracted from hospital data and assessed against 50 rules derived from the reference guideline. The analysis showed that more than 86\% of the available traces were conformant. Conclusion: The results demonstrate the feasibility of using orchestrated LLMs for practical healthcare conformance analysis. At the same time, the study provides evidence of a high level of adherence to stroke care guidelines at Alessandria Hospital.

URL PDF HTML ☆

赞 0 踩 0

2606.09556 2026-06-09 cs.AI 新提交

AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation

AI科学家的能力取决于其证据：药物资产估值中专有数据与推理技能的分层消融研究

Yinan Wang

发表机构 * Noah AI Research（Noah AI研究）

AI总结通过分层消融实验，发现药物资产估值中AI科学家的决策上限由专有证据集决定，而非仅依赖推理框架；加入专有数据后决策质量显著提升。

Comments Preprint; 2 figures, 5 tables

详情

AI中文摘要

AI科学家智能体通常被评估时，仿佛能力主要取决于模型质量、提示或推理框架。我们在药物资产估值中测试了一个不同的假设：对于知识密集型的科学决策，限制因素往往是智能体能够访问的证据基础。我们在一个生产级估值智能体上进行了三臂对照消融实验：A是仅使用网络的普通LLM分析师，B增加了公共结构化工具以及14维估值剧本、验证器、客观性策略和红队，C增加了专有的Noah AI语料库，包含精选的管线、试验和交易情报。在包含13个资产的分层基准测试中，B改善了校准和审计纪律：层级内准确率从0.80提高到0.89，客观性从3.16提高到3.30。但B并未消除事实上限。在能力超集核算下，A和B仅恢复了精选黄金竞争记录的0.25和0.38，而C恢复了0.96；在精选长尾子集上，C达到0.93，而A/B为0.26/0.30。原始盲审决策质量A和B相似（7.01 vs 6.96），因此我们引入了完整性感知决策效用：知情决策质量 = 决策质量 × 黄金覆盖率。在此指标上，C达到7.43，而A/B为1.76/2.57。即使一个完美的非专有数据报告，其B的覆盖率上限也仅为3.83。结果并非推理框架不重要；它们改善了校准和纪律。相反，专有证据集设定了AI科学家所能知道并因此决策的上限。

英文摘要

AI Scientist agents are often evaluated as if capability were mainly a function of model quality, prompting, or reasoning scaffolds. We test a different hypothesis in drug-asset valuation: for knowledge-intensive scientific decisions, the limiting factor is often the evidence substrate the agent can access. We run a controlled three-arm ablation on a production valuation agent: A is a plain web-only LLM analyst, B adds public structured tools plus a 14-dimension valuation playbook, verifier, objectivity policy and red-team, and C adds the proprietary Noah AI corpus of curated pipeline, trial and deal intelligence. Across a 13-asset stratified benchmark, B improves calibration and audit discipline: tier-in-range accuracy rises from 0.80 to 0.89 and objectivity from 3.16 to 3.30. But B does not remove the factual ceiling. Under capability-superset accounting, A and B recover only 0.25 and 0.38 of the curated gold competitive record, while C recovers 0.96; on the curated long-tail subset, C reaches 0.93 vs. 0.26/0.30. Raw blind-panel decision quality is similar for A and B (7.01 vs. 6.96), so we introduce completeness-aware decision utility: informed decision-quality = decision-quality x gold-coverage. On this metric, C reaches 7.43 vs. 1.76/2.57 for A/B. Even a perfect non-proprietary-data report would be capped at 3.83 by B's coverage. The result is not that reasoning scaffolds are unimportant; they improve calibration and discipline. Rather, proprietary evidence sets the upper bound of what the AI Scientist can know and therefore decide.

URL PDF HTML ☆

赞 0 踩 0

2606.09774 2026-06-09 cs.AI cs.CL 新提交

SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation

SIGA: 用于科学模拟的自演化编码智能体适配器

Matthew Ho, Brian Liu, Jixuan Chen, Audrey Wang, Lianhui Qin

发表机构 * University of California, San Diego（加利福尼亚大学圣迭戈分校）

AI总结提出SIGA适配器，通过检索、程序记忆、轨迹内验证和验证强制终止，将通用编码智能体转化为科学模拟软件操作员，在GEOS上实现36倍加速，并支持自演化提升性能。

详情

AI中文摘要

高级科学模拟器暴露了专门的输入语言，将模拟目标转化为可执行配置，但学习这些语言可能需要领域科学家花费数小时到数天。我们将模拟器设置研究为智能体-工具接口接地问题：需要哪些最小的模拟器特定适配才能使现成的编码智能体操作真实的科学软件？我们的直觉是，编码智能体已经知道如何导航文件、编辑代码、运行命令和修复输出，但它们缺乏模拟器的可执行契约：其词汇、结构约束、验证规则和终止条件。我们介绍了SIGA，一个模拟器接口接地适配器，通过检索、程序记忆、轨迹内验证和验证强制终止来提供此契约。我们主要在GEOS上评估SIGA，GEOS是一个用于地下科学的开源多物理场模拟器。SIGA在大约五分钟内生成完整的GEOS输入文件，TreeSim高于0.90，与花费大约三小时的扩展预算人类专家相当，实现了大约36倍的挂钟加速。在更难的保留集上，接地将TreeSim从0.720提高到0.789，相对于裸智能体提高了大约10%，并且可以将跨种子的标准差降低16倍。自演化通过从先前轨迹重写适配器内容进一步改进SIGA，产生了最高的保留GEOS平均值，并匹配或超过了最强的手工设计配置。迁移到OpenFOAM和LAMMPS表明，主导机制因接口而异：当结构完整性是瓶颈时，验证最重要；而当领域正确性是瓶颈时，记忆和检索最重要。这些结果表明，轻量级、可自我改进的接地层可以将通用编码智能体转变为科学软件的实用操作员。

英文摘要

Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool interface grounding: what minimal simulator-specific adaptations are needed for an off-the-shelf coding agent to operate real scientific software? Our intuition is that coding agents already know how to navigate files, edit code, run commands, and repair outputs, but they lack the simulator's executable contract: its vocabulary, structural constraints, validation rules, and termination conditions. We introduce SIGA, a Simulator-Interface Grounding Adapter that supplies this contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. We primarily evaluate SIGA on GEOS, an open-source multiphysics simulator used in subsurface science. SIGA produces a complete GEOS deck in about five minutes with TreeSim above 0.90, matching an extended-budget human expert who took about three hours, a roughly 36x wall-clock speedup. On a harder held-out set, grounding raises TreeSim from 0.720 to 0.789, a roughly 10% relative gain over the bare agent, and can reduce the across-seed standard deviation by 16x. Self-evolution further improves SIGA by rewriting adapter contents from prior trajectories, yielding the highest held-out GEOS mean and matching or outperforming the strongest hand-designed configuration. Transfers to OpenFOAM and LAMMPS show that the dominant mechanism shifts by interface: validation matters most when structural completeness is the bottleneck, while memory and retrieval matter most when domain correctness is the bottleneck. These results suggest that lightweight, self-improvable grounding layers can turn general coding agents into practical operators of scientific software.

URL PDF HTML ☆

赞 0 踩 0

2502.09194 2026-06-09 cs.IT cs.AI math.IT 交叉投稿

XAInomaly: Explainable and Interpretable Deep Contractive Autoencoder for O-RAN Traffic Anomaly Detection

XAInomaly：用于O-RAN流量异常检测的可解释与可解释深度收缩自编码器

Osman Tugay Basaran, Falko Dressler

发表机构 * School of Electrical Engineering and Computer Science, TU Berlin, Germany（电气工程与计算机科学学院，柏林技术大学，德国）

AI总结提出XAInomaly框架，利用半监督深度收缩自编码器学习正常网络行为的鲁棒表示，并引入fastshap-C可解释AI技术，实现O-RAN中准确、可扩展且可解释的异常检测。

Comments 22 pages, 9 Figures, Submitted to Journal (First revision completed)

详情

AI中文摘要

生成式人工智能技术通过实现复杂数据建模和特征提取以增强网络性能，已成为推动下一代无线通信系统发展的关键组成部分。在开放无线接入网络（O-RAN）领域，其以解耦架构和来自多个供应商的异构组件为特征，生成模型的部署为网络管理（如流量分析、流量预测和异常检测）带来了显著优势。然而，O-RAN的复杂性和动态性带来了挑战，不仅需要准确的检测机制，还需要降低复杂性、可扩展性，以及最重要的是可解释性，以促进有效的网络管理。在本研究中，我们引入了XAInomaly框架，这是一种用于O-RAN异常检测的可解释且可解释的半监督深度收缩自编码器（DeepCAE）设计。我们的方法利用SS-DeepCAE模型的生成建模能力，学习正常网络行为的压缩、鲁棒表示，该表示捕获了关键特征，从而能够识别指示异常的偏差。为了解决深度学习模型的黑箱特性，我们提出了一种名为fastshap-C的反应式可解释AI（XAI）技术。

英文摘要

Generative Artificial Intelligence (AI) techniques have become integral part in advancing next generation wireless communication systems by enabling sophisticated data modeling and feature extraction for enhanced network performance. In the realm of open radio access networks (O-RAN), characterized by their disaggregated architecture and heterogeneous components from multiple vendors, the deployment of generative models offers significant advantages for network management such as traffic analysis, traffic forecasting and anomaly detection. However, the complex and dynamic nature of O-RAN introduces challenges that necessitate not only accurate detection mechanisms but also reduced complexity, scalability, and most importantly interpretability to facilitate effective network management. In this study, we introduce the XAInomaly framework, an explainable and interpretable Semi-supervised (SS) Deep Contractive Autoencoder (DeepCAE) design for anomaly detection in O-RAN. Our approach leverages the generative modeling capabilities of our SS-DeepCAE model to learn compressed, robust representations of normal network behavior, which captures essential features, enabling the identification of deviations indicative of anomalies. To address the black-box nature of deep learning models, we propose reactive Explainable AI (XAI) technique called fastshap-C.

URL PDF HTML ☆

赞 0 踩 0

2606.07543 2026-06-09 cs.CY cs.AI cs.HC 交叉投稿

Concerns and Strategic Responses of Older Workers Navigating Generative AI in Bridge Employment

老年工人在桥梁就业中应对生成式AI的关切与战略回应

Aditya Nayak, Aakash Gautam, Rama Adithya Varanasi

发表机构 * University of Pittsburgh（匹兹堡大学）； New York University（纽约大学）

AI总结通过访谈21名专业人士，研究老年工人在桥梁就业中如何应对生成式AI带来的时间与结构性干扰，通过边界工作重构任务，形成AI韧性，并建议平衡个体、中观和宏观层面的策略以减少倦怠。

Comments CHIWORK'26

详情

DOI: 10.1145/3808045.3808070

AI中文摘要

生成式AI正在快速改变工作场所。这不成比例地影响了弱势群体，包括在最终退休前通过桥梁就业重新进入劳动力市场的老年工人。通过对21名专业人士进行深入的半结构化访谈，我们考察了老年工人在追求桥梁角色时如何应对生成式AI驱动的干扰，重点关注他们对GenAI整合的关切以及对这些变化的回应。我们的发现表明，由于GenAI，老年工人在桥梁就业决策过程的所有阶段都经历了时间和结构性干扰。作为回应，他们通过不同形式的边界工作重新配置任务，旨在恢复稳定性和连续性。我们将这些回应概念化为AI韧性，它重塑了老年工人的桥梁就业决策，使其成为一个持续的协商和适应过程。最后，我们提出建议，通过平衡个体层面的AI韧性策略、中观层面的AI韧性集体以及宏观层面的对抗性和可争议的AI中介组织结构，来减少老年工人的倦怠。

英文摘要

Generative AI (GenAI) is transforming workplaces at a rapid pace. This disproportionately affects vulnerable communities, including older workers (OWs) who re-enter the workforce through bridge employment prior to final retirement. Through in-depth semi-structured interviews with 21 professionals, we examine how OWs navigate GenAI-driven disruptions while pursuing bridge roles, focusing on their concerns about GenAI integration and their responses to these changes. Our findings show that OWs experienced both temporal and structural disruptions across all stages of the bridge employment decision-making process due to GenAI. In response, they reconfigured their tasks through different forms of boundary work aimed at restoring stability and continuity. We conceptualize these responses as AI resilience, which reshaped OWs' bridge employment decision-making into an ongoing process of negotiation and adaptation. We conclude by offering recommendations to reduce burnout among OWs by balancing individual-level AI resilience strategies with meso-level AI resilience collectives and macro-level adversarial and contestable AI-mediated organizational structures.

URL PDF HTML ☆

赞 0 踩 0

2606.07544 2026-06-09 cs.CY cs.AI cs.HC 交叉投稿

AI-Integrated Learning Management System for Middle School: A Longitudinal Study of Learning Outcomes Through High School and Beyond

面向中学的AI集成学习管理系统：一项从高中到毕业后的学习成果纵向研究

Misan Paul Etchie, Taiwo Olutosin

发表机构 * National Agricultural University（国立农业大学）

AI总结提出一种隐私优先的AI集成学习管理系统，通过政策约束的AI辅助（形成性反馈、间隔复习、适应性练习）和教师仪表盘，在中学日常课程中提供即时支持，并设计纵向研究评估其对高中及毕业后学习轨迹的长期影响。

详情

AI中文摘要

中学是构建核心学术技能和学习习惯的关键时期，这些习惯会延续到高年级，但许多学生仍因帮助有限且滞后而落后。学习管理系统（LMS）已成为分发材料、收集作业、评估学生任务和记录成绩的标准基础设施，但在大多数部署中，它们更像工作流工具而非教学支持。结果是常见的瓶颈：学生在困惑中继续练习，教师对问题进行分诊，而本可纠正误解的反馈在错误观念固化后才到达。为弥补这一差距，我们提出一个面向中学教学的AI集成LMS，并配以纵向研究设计，以测试持续、有边界的AI支持是否能改变高中及毕业后的学习成果。该平台在常规课程中添加了政策约束的AI辅助，提供形成性反馈和提示，基于掌握程度推荐间隔复习和适应性练习，并提供教师仪表盘以总结误解模式并标记持续困难。由于平台面向未成年人，设计以隐私为先，采用数据最小化、基于角色的访问控制、适龄响应约束和可审计的AI交互日志。除了短期表现，评估计划将细粒度的学习轨迹（尝试、修订、求助和节奏）与机构成果（在可行情况下）联系起来，以便将工具采纳效应与学习轨迹的长期变化区分开来。

英文摘要

Middle school is a key window for building core academic skills and the learning routines students carry into later grades, yet many students still fall behind because help is often limited and comes too late, after they have already been stuck for a while. Learning Management Systems (LMSs) are now standard infrastructure for distributing materials, collecting work, assessing students' tasks, and recording grades, but in most deployments they still behave more like workflow tools than instructional supports. The result is the usual bottleneck: students keep practicing through confusion, teachers triage questions, and feedback that could have corrected the misunderstanding arrives after the misconception has already hardened. To address this gap, we propose an AI-integrated LMS for middle school instruction, paired with a longitudinal study design to test whether sustained, bounded AI support changes outcomes through high school and into post-high school pathways. The proposed platform adds policy-gated AI assistance to everyday coursework, delivering formative feedback and hinting, recommending spaced review and adaptive practice based on mastery, and providing teacher-facing dashboards that summarize misconception patterns and flag sustained struggle. Because the platform is intended for minors, the design is privacy-first, using data minimization, role-based access control, age-appropriate response constraints, and auditable logs of AI interactions. Beyond short-term performance, the evaluation plan links fine-grained learning traces (attempts, revisions, help-seeking, and pacing) to institutional outcomes where feasible, so we can separate tool adoption effects from longer-run changes in learning trajectories.

URL PDF HTML ☆

赞 0 踩 0

2606.07553 2026-06-09 cs.LG cs.AI 交叉投稿

SurfDesign：基于分子表面的高效蛋白质设计

Fang Wu, Shuting Jin, Xiangru Tang, Mark Gerstein, Xiangxiang Zeng, Yejin Choi, Jure Leskovec, Jinbo Xu

发表机构 * Stanford University（斯坦福大学）； Wuhan University of Science and Technology（武汉科技大学）； Yale University（耶鲁大学）； School of Medicine, Yale University（耶鲁大学医学院）； Hunan University（湖南大学）； Yuelushan Laboratory（岳麓实验室）； Kumo.AI ； Toyota Technological Institute at Chicago（芝加哥技术研究所）

AI总结提出SurfDesign框架，将分子表面建模为连续几何流形并整合预训练蛋白质语言模型，通过表面等变消息传递捕捉几何特征，在从头设计结合子和酶设计基准上优于现有方法。

详情

Journal ref: KDD 2026 AI4Science

AI中文摘要

蛋白质功能很大程度上由分子表面几何和物理化学互补性决定，然而大多数蛋白质设计方法仅以主链结构为条件。我们引入了SurfDesign，一个表面条件蛋白质设计框架，将分子表面建模为连续几何流形，并将其与预训练蛋白质语言模型集成。SurfDesign采用基于表面的等变消息传递来捕捉表面法线、曲率和方向几何，同时采用参数高效的微调策略。专注于功能性蛋白质设计，我们表明SurfDesign在从头设计结合子和酶设计基准上始终优于先前的表面条件和仅主链方法。我们还报告了在逆折叠基准上的强劲性能，作为结构兼容性的诊断。我们的结果强调了流形感知表面表示作为功能性蛋白质和酶设计的原理基础。代码可在https://github.com/smiles724/SurfDesign获取。

英文摘要

Protein function is largely determined by molecular surface geometry and physicochemical complementarity, yet most protein design methods condition only on backbone structure. We introduce SurfDesign, a surface-conditioned protein design framework that models molecular surfaces as continuous geometric manifolds and integrates them with pretrained protein language models. SurfDesign employs surface-based equivariant message passing to capture surface normals, curvature, and directional geometry, together with a parameter-efficient fine-tuning strategy. Focusing on functional protein design, we show that SurfDesign consistently outperforms prior surface-conditioned and backbone-only methods on de novo binder and enzyme design benchmarks. We also report strong performance on inverse-folding benchmarks as a diagnostic of structural compatibility. Our results highlight manifold-aware surface representations as a principled foundation for functional protein and enzyme design. Code is available at https://github.com/smiles724/SurfDesign.

URL PDF HTML ☆

赞 0 踩 0

2606.07582 2026-06-09 cs.LG cs.AI cs.ET 交叉投稿

Customer Churn Prediction on Structured Data Using FT-Transformer and Stacking Ensembles

基于FT-Transformer和堆叠集成的结构化数据客户流失预测

Joyjit Roy, Samaresh Kumar Singh, Laxmi Shaw

发表机构 * Independent Researcher, Austin, TX, USA（独立研究员，美国德克萨斯州奥斯汀）； Independent Researcher, Leander, TX（独立研究员，美国德克萨斯州利安德）； Texas A & M University-Victoria, Victoria, TX（德克萨斯农工大学维多利亚分校）

AI总结提出一种结合FT-Transformer与XGBoost的混合架构，通过校准感知堆叠集成处理类别不平衡和特征交互，在银行客户流失数据集上F1达62.10%，AUC-ROC为0.861。

Comments 22 pages, 9 figures, 20 tables; published in IEEE Access

详情

DOI: 10.1109/ACCESS.2026.3686374
Journal ref: IEEE Access, vol. 14, pp. 62834-62855, 2026

AI中文摘要

客户流失预测在保险、数字银行、电子商务和订阅平台等数据驱动行业中至关重要，因为保留现有客户通常比获取新客户更具成本效益。由于类别不平衡、非线性特征交互和异质特征类型，在结构化数据集上预测流失仍然具有挑战性。基于树的集成方法在这些场景中始终表现出强大的性能，通常优于传统神经网络。本研究引入了一种经过验证的混合架构，通过校准感知堆叠将特征标记化变换器（FT-Transformer）与梯度提升树相结合。所提出的框架解决了先前研究中在统计验证、概率校准和可重复性方面的持续空白。FT-Transformer利用自注意力捕获高阶特征交互，而XGBoost通过互补的归纳偏置捕获梯度提升决策边界。类别不平衡通过使用类别加权损失函数处理，从而避免合成过采样并保留少数类分布。模型使用基于折叠外（OOF）堆叠的逻辑回归元学习器进行集成，该元学习器重新校准过于自信的基模型输出并学习最优组合权重。在一个公开的银行流失数据集上，混合模型在5x5交叉验证下达到62.10%的F1、0.861的AUC-ROC和0.647的PR-AUC，相比多层感知机（MLP）基线分别提升3.37个F1点和0.027个AUC，并报告了95%置信区间。消融研究表明，变换器组件和堆叠策略都对性能有实质性贡献。所提出的方法为结构化表格数据上的当代流失预测提供了一个可重复且可扩展的参考架构。

英文摘要

Customer churn prediction is essential across data-driven industries such as insurance, digital banking, eCommerce, and subscription platforms, where retaining existing customers is typically more cost-effective than acquiring new ones. Predicting churn on structured datasets remains challenging due to class imbalance, nonlinear feature interactions, and heterogeneous feature types. Tree-based ensemble methods consistently demonstrate strong performance in these contexts, often outperforming conventional neural networks. This study introduces a validated hybrid architecture that integrates feature-tokenized transformers (FT-Transformer) with gradient-boosted trees through calibration-aware stacking. The proposed framework addresses persistent gaps in statistical validation, probability calibration, and reproducibility found in prior research. The FT-Transformer captures higher-order feature interactions using self-attention, while XGBoost captures gradient-boosted decision boundaries with complementary inductive biases. Class imbalance is handled using class-weighted loss functions, thereby avoiding synthetic oversampling and preserving minority-class distributions. The models are ensembled using out-of-fold (OOF) stacking with a logistic regression meta-learner, which recalibrates overconfident base model outputs and learns optimal combination weights. On a public bank churn dataset, the hybrid model achieves 62.10% F1, 0.861 AUC-ROC, and 0.647 PR-AUC, outperforming the Multi-Layer Perceptron (MLP) baseline by 3.37 F1 points and 0.027 AUC under 5x5 cross-validation with 95% confidence intervals reported. Ablation studies demonstrate that both the transformer component and stacking strategy contribute materially to performance. The proposed methodology offers a reproducible and extensible reference architecture for contemporary churn prediction on structured tabular data.

URL PDF HTML ☆

赞 0 踩 0

2606.07633 2026-06-09 cs.CV cs.AI 交叉投稿

AMN: An Adaptive Multi-Scale Fusion Network with Boundary and Uncertainty Modeling for Nuclei Segmentation

AMN：一种用于细胞核分割的具有边界和不确定性建模的自适应多尺度融合网络

Spoorthi M, Suja Palaniswamy

发表机构 * Department of Computer Science \& Engineering, Amrita School of Computing, Bengaluru, Amrita Vishwa Vidyapeetham, India , 2 p\

AI总结提出AMN双编码器分割框架，融合Swin Transformer和ResNet-50特征金字塔，通过门控机制动态加权，结合多目标损失，在CoNIC基准上平均Dice 0.82，F1 0.68，优于八种基线模型。

详情

AI中文摘要

组织病理学图像中细胞核亚型的准确分类对于下游任务（包括肿瘤分级、免疫浸润量化和预后预测）至关重要。现有方法孤立地依赖卷积或基于Transformer的编码器，限制了它们同时捕捉细粒度局部纹理和长程空间上下文的能力。我们提出了AMN（自适应多尺度细胞核网络），一种双编码器分割框架，联合利用Swin Transformer和ResNet-50特征金字塔，通过学习的逐通道门控机制动态权衡每个编码器在每个尺度的贡献。AMN使用多目标损失进行训练，该损失结合了类别加权焦点损失、具有正像素强调的边界感知损失以及一种新颖的不确定性调制分类项，用于抑制过度自信的错误预测。在涵盖七个细胞核类别的CoNIC基准上评估，AMN实现了平均Dice 0.82和平均F1 0.68，在诊断上具有挑战性的淋巴细胞类别上F1为0.67。AMN优于八种基线模型，包括纯CNN、纯Transformer和最近的混合架构：U-Net、ResU-Net、DeepLabV3+、SegNet、ViT-Small、HmsU-Net、ConvFormer-UNet和BEFUnet。在MoNuSeg上的跨数据集评估证明了无需重新训练的强泛化能力，验证了所学表示的领域鲁棒性。

英文摘要

Accurate classification of nuclei subtypes in histopathology images is critical for downstream tasks including tumor grading, immune infiltrate quantification, and prognosis prediction. Existing approaches rely on either convolutional or transformer-based encoders in isolation, limiting their ability to simultaneously capture fine-grained local texture and long-range spatial context. We present AMN (Adaptive Multi-Scale Nuclei Network), a dual-encoder segmentation framework that jointly leverages a Swin Transformer and a ResNet-50 feature pyramid, fused via a learned per-channel gating mechanism that dynamically weighs each encoder's contribution at every scale. AMN is trained with a multi-objective loss combining class-weighted focal loss, boundary-aware loss with positive-pixel emphasis, and a novel uncertainty-modulated classification term that suppresses overconfident erroneous predictions. Evaluated on the CoNIC benchmark across seven nuclei classes, AMN achieves a mean Dice of 0.82 and mean F1 of 0.68, with an F1 of 0.67 on the diagnostically challenging lymphocyte class. AMN outperforms eight baseline models spanning pure-CNN, pure-transformer, and recent hybrid architectures: U-Net, ResU-Net, DeepLabV3+, SegNet, ViT-Small, HmsU-Net, ConvFormer-UNet, and BEFUnet. Cross-dataset evaluation on MoNuSeg demonstrates strong generalization without retraining and validating the domain robustness of the learned representations.

URL PDF HTML ☆

赞 0 踩 0

2606.07635 2026-06-09 cs.CV cs.AI 交叉投稿

NeuroAlign: Hierarchical Multimodal Fusion of Dynamic and Structural Neuroimaging for MCI Analysis

NeuroAlign: 用于MCI分析的动态与结构性神经影像的分层多模态融合

Xiongri Shen, Zhenxi Song, Jiaqi wang, Yi Zhong, Leilei Zhao, Chenqi Xu, Linling Li, Yichen Wei, Lingyan Liang, Demao Deng, Luping Song, Ping Luan, Ahmed M. Anter, Shuqiang Wang, Baiying Lei, Zhiguo Zhang

发表机构 * Department of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳）计算机科学与技术学院）； School of Intelligence Science and Engineering, College of Artificial Intelligence, Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳）人工智能学院智能科学与工程学院）； School of Artificial Intelligence, Beijing University of Posts and Telecommunications（北京邮电大学人工智能学院）； Guangdong Key Laboratory of Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Shenzhen University Medical School, Shenzhen University（深圳大学医学部生物医学工程学院广东省生物医学测量与超声成像重点实验室）； Department of Radiology, The People’s Hospital of Guangxi Zhuang Autonomous Region, Guangxi Academy of Medical Sciences（广西壮族自治区人民医院放射科，广西医学科学院）； Shenzhen Sixth People’s Hospital (Nanshan Hospital), Huazhong University of Science and Technology Union Shenzhen Hospital（华中科技大学协和深圳医院（深圳市第六人民医院））； School of Basic Medical Sciences, Shenzhen University（深圳大学基础医学院）； Egypt-Japan University of Science and Technology (E-JUST)（埃及日本科技大学）； School of Biomedical Engineering, National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, Shenzhen University Medical School（深圳大学医学部生物医学工程学院，国家地方联合医学超声关键技术工程实验室，广东省生物医学测量与超声成像重点实验室）

AI总结提出NeuroAlign框架，通过双模态分层对齐和双域分层交互融合fMRI与DTI特征，实现MCI/SCD检测，并设计无梯度归因方法SAM进行特征分析。

详情

AI中文摘要

功能磁共振成像（fMRI）和弥散张量成像（DTI）的多模态神经影像融合为认知障碍分析提供了互补信息，但仍面临异构特征空间和表示不对齐的挑战。我们提出\textit{NeuroAlign}，一个用于结构化多模态融合的分层框架。它引入了（1）\textit{双模态分层对齐}（DMHA），该模块建模多尺度动态连接并对齐动态-静态和功能-结构嵌入；以及（2）\textit{双域分层交互}（DDHI），该模块实现连接级和区域级特征之间的细粒度调制和全局交互。为了支持特征级检查，我们设计了\textit{协同激活映射}（SAM），一种针对DFC、SFC、ALFF和FA的无梯度、面向标记的归因方法。在GUTCM、ADNI和OASIS数据集上通过五折验证评估，NeuroAlign在MCI/SCD检测中取得了竞争性结果，并展示了初步的跨数据集可迁移性。归因分析揭示了模态特异性和部分一致的脑区模式，为多模态表示分析提供了模型驱动的证据。

英文摘要

Multimodal neuroimaging fusion of functional MRI (fMRI) and diffusion tensor imaging (DTI) provides complementary information for cognitive impairment analysis, but remains challenged by heterogeneous feature spaces and misaligned representations. We propose \textit{NeuroAlign}, a hierarchical framework for structured multimodal fusion. It introduces (1) \textit{Dual-Modal Hierarchical Alignment} (DMHA), which models multi-scale dynamic connectivity and aligns dynamic-static and functional-structural embeddings; and (2) \textit{Dual-Domain Hierarchical Interaction} (DDHI), which enables fine-grained modulation and global interaction between connectivity- and region-level features. To support feature-level inspection, we design \textit{Synergistic Activation Mapping} (SAM), a gradient-free, marker-oriented attribution method for DFC, SFC, ALFF, and FA. Evaluated on GUTCM, ADNI, and OASIS under five-fold validation, NeuroAlign achieves competitive MCI/SCD detection and preliminary cross-dataset transferability. Attribution analyses reveal modality-specific and partially consistent brain patterns, providing model-derived evidence for multimodal representation analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.07648 2026-06-09 cs.CV cs.AI 交叉投稿

AQIFormer: A Transformer-Based Multi-View Architecture for Cross-City Air Quality Classification

AQIFormer：一种基于Transformer的多视角架构用于跨城市空气质量分类

Om Kathalkar, Nitin Nilesh, Sachin Chaudhari, Anoop Namboodiri

发表机构 * IIIT Hyderabad（印度海得拉巴国际信息技术学院）

AI总结提出AQIFormer，一种基于Transformer的集成架构，通过前后视图融合、天气感知注意力和多任务学习，在跨城市空气质量分类中达到89.96%准确率，比现有方法提升14.96%。

Comments Accepted at ICVGIP 2025 (Indian Conference on Computer Vision, Graphics and Image Processing), 9 pages, 4 figures

详情

DOI: 10.1145/3774521.3774577

AI中文摘要

空气污染是全球最严峻的环境和公共卫生挑战之一，传统的基于传感器的监测系统面临显著的可扩展性和经济性限制。基于图像的空气质量估计已成为一种有前景的替代方案，利用交通场景中大气污染物的视觉特征。然而，现有方法存在跨城市泛化能力有限以及对多视角信息利用不足的问题。我们提出AQIFormer，一种新颖的基于Transformer的集成架构，通过创新的双视图融合、天气感知注意力机制和全面的多任务学习来解决这些根本性限制。我们的方法独特地将前后交通图像与气象参数相结合，以实现跨不同城市环境的稳健空气质量分类。在包含26,678个同步前后图像对的综合数据集上进行的大量评估表明，该模型性能良好，准确率达到89.96%，比现有最优方法提高了14.96%。最重要的是，我们的模型保持了出色的跨城市泛化能力，在印度那格浦尔收集的独立数据集上达到81.67%的准确率，通过少量样本自适应仅用极少的训练样本，性能下降仅为8.29%。

英文摘要

Air pollution represents one of the most critical environmental and public health challenges globally, with traditional sensor-based monitoring systems facing significant scalability and economic constraints. Image-based air quality estimation has emerged as a promising alternative, leveraging the visual characteristics of atmospheric pollutants in traffic scenes. However, existing methods suffer from limited cross-city generalization and inadequate exploitation of multi-view perspectives. We present AQIFormer, a novel transformer-based ensemble architecture that addresses these fundamental limitations through innovative dual-view integration, weather-aware attention mechanisms, and comprehensive multi-task learning. Our approach uniquely combines front and rear traffic imagery with meteorological parameters to achieve robust air quality classification across diverse urban environments. Extensive evaluation on a comprehensive dataset of 26,678 synchronized front-rear image pairs demonstrates good performance with 89.96% accuracy, representing a 14.96% improvement over state-of-the-art methods. Most importantly, our model maintains exceptional cross-city generalization capabilities, achieving 81.67% accuracy on an independent dataset collected in Nagpur, India with only 8.29% performance degradation using few-shot adaptation with minimal training samples.

URL PDF HTML ☆

赞 0 踩 0

2606.07665 2026-06-09 cs.PL cs.AI 交叉投稿

AgentCompile: An LLM-Guided Compiler for Direct CUDA Inference

AgentCompile：一种用于直接CUDA推理的LLM引导编译器

Xuanzhe Li, Ziyan Weng, Zhiyu Zhu, Junhui Hou

发表机构 * City University of Hong Kong (Dongguan)（香港城市大学（东莞））； City University of Hong Kong（香港城市大学）

AI总结提出AgentCompile，利用LLM提供语义建议，通过模板生成CUDA候选实现并验证，在多个Transformer模型上实现4-5.7倍加速。

Comments 11 pages, 3 figures

详情

AI中文摘要

Transformer推理日益依赖专门的编译器和运行时支持，但实际模型图仍需要语义决策，以确定哪些区域值得专门化以及哪些CUDA实现族是可行的。我们提出AgentCompile，一种LLM引导的CUDA推理编译器，仅将LLM输出用作建议性搜索元数据。给定编译器生成的区域摘要和有界候选空间，LLM提出语义标签、候选优先级、参数提示和风险注释；编译器通过模板生成CUDA候选，检查接口和硬件约束，经验性验证候选，根据测量延迟选择实现，并在专门化不受支持或无利可图时回退。在端到端自回归生成中，AgentCompile在五个代表性工作负载上，相对于PyTorch eager模式，在Qwen3-1.7B、Qwen3-4B和Llama-3.2-1B-Instruct上分别实现了平均5.66倍、4.05倍和4.26倍的加速。我们将开源该项目。

英文摘要

Transformer inference increasingly depends on specialized compiler and runtime support, but real model graphs still require semantic decisions about which regions are worth specializing and which CUDA implementation families are plausible. We present AgentCompile, an LLM-guided CUDA inference compiler that uses LLM outputs only as advisory search metadata. Given compiler-derived region summaries and bounded candidate spaces, the LLM proposes semantic labels, candidate priorities, parameter hints, and risk annotations; the compiler materializes CUDA candidates through templates, checks interface and hardware constraints, validates candidates empirically, selects implementations by measured latency, and falls back when specialization is unsupported or unprofitable. In end-to-end autoregressive generation, AgentCompile averages 5.66x, 4.05x, and 4.26x speedup over PyTorch eager on Qwen3-1.7B, Qwen3-4B, and Llama-3.2-1B-Instruct, respectively, across five representative workloads. We will open-source the project.

URL PDF HTML ☆

赞 0 踩 0

2606.07669 2026-06-09 cs.CV cs.AI 交叉投稿

MemoVAD: Resource-Efficient Video Anomaly Detection via Dynamic Semantic Memory in Edge Computing Scenarios

MemoVAD: 边缘计算场景下基于动态语义记忆的资源高效视频异常检测

Guo Li, Jiandian Zeng, Yang Li, Zihao Peng, Ke Chen, Tian Wang

发表机构 * Institute of Artificial Intelligence and Future Networks, Beijing Normal University（北京师范大学人工智能与未来网络研究院）； School of Computing and Artificial Intelligence, Southwest Jiaotong University（西南交通大学计算机与人工智能学院）； Engineering Research Center of Cloud-Edge Intelligent Collaboration on Big Data, Ministry of Education, Beijing Normal University（北京师范大学大数据云边智能协同教育部工程研究中心）

AI总结提出MemoVAD边缘-云协同框架，通过不确定性感知门控策略选择性调用云端视觉语言模型，并设计动态语义记忆缓存原型，在降低通信开销的同时提升视频异常检测性能。

Comments Accepted by IJCAI2026

详情

AI中文摘要

在真实监控场景中部署视频异常检测（VAD）面临着对高层语义的需求以确保有效性，与边缘设备有限计算资源之间的根本矛盾。视觉语言模型（VLM）提供了丰富的开放词汇语义，但其延迟和计算成本阻碍了设备端部署。为解决这一挑战，我们提出MemoVAD，一种边缘-云协同框架，选择性地将VLM语义融入流式VAD。MemoVAD在边缘端使用轻量级检测器和因果时序上下文编码器（TCE）建模时序依赖，运行大部分推理。具体而言，我们引入基于主观逻辑的不确定性感知门控（UAG）策略，以建模感知不确定性，并仅对高不确定性和语义新颖的片段查询云端VLM。此外，设计动态语义记忆（DSM）缓存经VLM验证的原型以实现高效检索，使边缘模型通过语义适配器逐步融入VLM级语义。在真实边缘设备上对UCF-Crime和XD-Violence数据集的实验表明，MemoVAD在显著降低通信开销的同时，超越了当前最优性能。

英文摘要

Deploying Video Anomaly Detection (VAD) in real-world surveillance faces a fundamental tension between the demand for high-level semantics to ensure effectiveness and the limited computational resources of edge devices. Vision-Language Models (VLMs) provide rich open-vocabulary semantics, but their latency and computational cost preclude on-device deployment. To address the challenge, we propose MemoVAD, an edge-cloud collaborative framework that selectively incorporates VLM semantics into streaming VAD. MemoVAD runs most inference on the edge with a lightweight detector and a causal Temporal Context Encoder (TCE) to model temporal dependencies. Specifically, we introduce an Uncertainty-Aware Gating (UAG) policy grounded in Subjective Logic to model perceived uncertainty and query the cloud-based VLM only for high-uncertainty and semantically novel clips. Besides, a Dynamic Semantic Memory (DSM) is designed to cache VLM-verified prototypes for efficient retrieval, enabling the edge model to progressively incorporate VLM-level semantics via a semantic adapter. Experiments on UCF-Crime and XD-Violence datasets via a real edge device show that MemoVAD substantially reduces communication overhead while surpassing state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2606.07673 2026-06-09 cs.SD cs.AI cs.LG 交叉投稿

A Hierarchical Feature Engineering Framework for Automated Classification of Phonotraumatic and Non-Phonotraumatic Vocal Hyperfunction

声带创伤性与非声带创伤性声音亢进的自动分类的分层特征工程框架

June-Woo Kim, Kangwook Jang, Minu Kim, Hyunju Lee

发表机构 * Department of Electronic Engineering, Wonkwang University（圆光大学电子工程系）； AI Convergence Research Institute, Wonkwang University（圆光大学人工智能融合研究院）； GIST InnoCORE AI-Nano Convergence Institute for Early Detection of Neurodegenerative Diseases, Gwangju Institute of Science and Technology（光州科学技术院GIST InnoCORE AI-Nano神经退行性疾病早期检测融合研究所）； School of Electrical Engineering, KAIST（韩国科学技术院电气工程学院）； Department of AI Convergence, Gwangju Institute of Science and Technology（光州科学技术院人工智能融合系）

AI总结提出分层特征工程框架，包括静态、动态、比率和耦合特征，用于区分声带创伤性和非声带创伤性声音亢进，发现耦合特征对两类分类均关键，PVH AUC 0.891，NPVH AUC 0.728。

Comments Interspeech 2026

2606.07676 2026-06-09 q-bio.GN cs.AI 交叉投稿

Single-Cell Cross-Modal Transfer by Adversarial Fine-Tuning of Foundation Models

通过基础模型的对抗微调实现单细胞跨模态迁移

Joseph Boyd, Matthew Lyon, Martino Mansoldo, Christian Hurry, Finnian Firth

发表机构 * University of Cambridge（剑桥大学）

AI总结提出利用单细胞基础模型进行对抗微调，实现未配对空间转录组与单细胞RNA测序数据的跨模态翻译，性能优于多组学翻译方法。

2606.07681 2026-06-09 cs.SE cs.AI cs.CE cs.MA 交叉投稿

Systematic LLM Translation of Legacy Scientific Code to Differentiable Frameworks: Application to a Land Surface Model

将遗留科学代码系统性地LLM翻译为可微分框架：以陆面模型为例

Aya Lahlou, Linnia Hawkins, Pierre Gentine

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； NASA Goddard Space Flight Center（国家航空航天局戈达德空间飞行中心）

AI总结提出基于LLM的五阶段流水线，将遗留Fortran代码自动翻译为JAX可微分框架，在CLM-ml-v2模型上实现完整雅可比矩阵计算和24倍加速。

详情

AI中文摘要

TianJi-Environ: 用于大气环境研究的自主人工智能科学家

Haoluo Zhao, Hongchun Zhang, Nan Li, Jing-Jia Luo, Kaikai Zhang, Mengyang Yu, Nan Chen, Tao Song, Fan Meng

发表机构 * School of Artificial Intelligence, Nanjing University of Information Science and Technology（南京信息工程大学人工智能学院）； State Key Laboratory of Climate System Prediction and Risk Management (CPRM), Nanjing University of Information Science and Technology（南京信息工程大学气候系统预测与风险管理国家重点实验室）； College of Environmental Science and Engineering, Nanjing University of Information Science and Technology（南京信息工程大学环境科学与工程学院）； College of Computer Science and Technology, China University of Petroleum（中国石油大学（华东）计算机科学与技术学院）

AI总结提出基于WRF-Chem的多智能体框架TianJi-Environ，自主驱动复杂大气化学模拟，实现机制假设的可执行配置、实验设计和证据标准，并通过臭氧和颗粒物案例验证其可审计的机制验证能力。

Comments 20 pages, 11 figures, 2 tables

详情

AI中文摘要

随着大气环境预测的持续改进，污染机制和反馈过程的可解释验证已成为大气化学的主要挑战。然而，基于复杂数值模型的机制验证仍然严重依赖专家知识：机制假设必须转化为可执行的实验，模型输出必须组织成可追溯的证据。我们提出了TianJi-Environ，一个用于大气化学机制验证的可审计AI科学家。TianJi-Environ建立了首个基于WRF-Chem的多智能体框架，自主驱动复杂的大气化学模拟，将机制假设转化为可执行的配置、测试实验和证据标准。以臭氧响应和颗粒物反馈作为两个代表性例子，我们展示了TianJi-Environ的机制验证能力。在华北平原的一个夏季臭氧案例中，系统在短波辐射和边界层高度中检测到方向一致的气溶胶-辐射相互作用信号，但判断臭氧对NOx控制的响应证据不完整。在关中盆地的一个冬季PM2.5案例中，系统将不支持的联系定位到黑碳扰动到颗粒物响应的传播不足以及垂直吸收加热的诊断缺失。这些结果表明，TianJi-Environ使专家驱动的机制验证变得明确、结构化和可审计，为多智能体系统与复杂大气化学模型的耦合提供了可复现的范式。

英文摘要

As atmospheric environmental prediction continues to improve, interpretable validation of pollution mechanisms and feedback processes has become a main challenge in atmospheric chemistry. Yet mechanism validation based on complex numerical models still relies heavily on expert knowledge: mechanistic hypotheses must be operationalized into executable experiments, and model outputs must be organized into traceable evidence. We present TianJi-Environ, an auditable AI Scientist for atmospheric-chemistry mechanism validation. TianJi-Environ establishes the first WRF-Chem-based multi-agent framework that autonomously drives complex atmospheric-chemistry simulations, converting mechanistic hypotheses into executable configurations, testing experiments, and evidence criteria. Using ozone response and particulate-matter feedback as two representative examples, we demonstrate TianJi-Environ's capability for mechanism validation. In a summertime ozone case over the North China Plain, the system detects directionally consistent aerosol-radiation-interaction signals in shortwave radiation and boundary-layer height, but judges the evidence for ozone response to NOx control to be incomplete. In a wintertime PM2.5 case over the Guanzhong Basin, it localizes the unsupported link to insufficient propagation from black-carbon perturbation to particulate response and missing diagnostics of vertical absorptive heating. These results show that TianJi-Environ makes expert-driven mechanism validation explicit, structured, and auditable, offering a reproducible paradigm for multi-agent systems coupled with complex atmospheric-chemistry models.

URL PDF HTML ☆

赞 0 踩 0

2606.07712 2026-06-09 cond-mat.mtrl-sci cs.AI 交叉投稿

MatMind: A Structure-Activity Knowledge-Driven Generative Foundation Model for Materials Science

MatMind：面向材料科学的结构-活性知识驱动生成基础模型

Zhan'ao Yao, Boxuan Zhang, Jingyuan Shu, Xiaoyu Wu, Rongyan Wang, Linjing Li, Dajun Zeng, Yudong Yao, Tingwei Chen, Youwei Wang, Xiaolin Zhao, Jiahui Shi, Jianjun Liu

发表机构 * State Key Laboratory of High Performance Ceramics（高性能陶瓷国家重点实验室）； Shanghai Institute of Ceramics, Chinese Academy of Sciences（中国科学院上海陶瓷研究所）； Center of Materials Science and Optoelectronics Engineering, University of Chinese Academy of Sciences（中国科学院大学材料科学与光电子工程中心）； School of Chemistry and Materials Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences（中国科学院大学杭州先进研究所化学与材料科学学院）； State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences（多模态人工智能系统国家重点实验室，中国科学院自动化研究所）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Beijing Wenge Technology Co., Ltd.（北京文格科技有限公司）； College of Medicine and Biological Information Engineering, Northeastern University（东北大学医学与生物信息工程学院）

AI总结提出MatMind，一种基于大语言模型的晶体材料生成基础模型，通过结构-活性知识注入、双头架构和物理信息强化学习，在性质预测、无条件生成和条件生成任务上超越专用模型。

Comments 29 pages, 5 figures, including references

详情

AI中文摘要

迄今为止，AI驱动的晶体材料科学进展依赖于为单个任务构建的窄架构——用于性质预测的图神经网络、用于晶体生成的扩散和流匹配模型——每个都在其领域内表现出色，但无法作为跨整个材料问题谱系的共享骨干。生成式大语言模型提供了一种根本不同的范式，其中结构表示、定量预测和结构-活性推理可以在一个模型内统一，但材料学界尚未看到这种范式在竞争性水平上实现，与已建立的窄专家相匹敌。在此，我们提出MatMind，一种在此范式下专为晶体材料科学构建的生成基础模型，通过渐进训练框架中结构-活性知识和物理信息反馈的协调激活开发——结合结构-活性知识注入、在共享表示空间中联合训练语言推理和数值回归的双头架构，以及针对稳定性、新颖性和结构多样性的多目标物理信息强化学习。在三个任务族中，MatMind在能量高于凸包、体模量和带隙上取得最低平均绝对误差——超越专为这些任务构建的图神经网络预测器——在无条件晶体生成上达到65.3%的S.U.N.率，并在磁化密度条件生成上实现了可比的倍数提升，其中在超过600,000个训练条目中仅存在21个正样本。通过在单一统一模型内匹配或超越窄专家在其自身领域上的表现，MatMind表明基于LLM的范式可以作为晶体材料科学未来的可行骨干。

英文摘要

Progress in AI-driven crystal materials science has so far been carried by narrow architectures purpose-built for individual tasks -- graph neural networks for property prediction, diffusion and flow-matching models for crystal generation -- each excelling within its niche yet unable to act as a shared backbone across the full spectrum of materials problems. Generative large language models offer a fundamentally different paradigm, in which structural representation, quantitative prediction, and structure-activity reasoning can be unified within one model, but the materials community has yet to see this paradigm realized at a level competitive with established narrow specialists. Here we present MatMind, a generative foundation model purpose-built for crystal materials science under this paradigm, developed through the coordinated activation of structure-activity knowledge and physics-informed feedback within a progressive training framework -- combining structure-activity knowledge injection, a dual-head architecture that jointly trains language reasoning and numerical regression in a shared representation space, and multi-objective physics-informed reinforcement learning over stability, novelty, and structural diversity. Across three task families, MatMind attains the lowest mean absolute error on energy above hull, bulk modulus, and band gap -- surpassing graph neural network predictors purpose-built for these tasks -- reaches an S.U.N. rate of 65.3% on unconditional crystal generation, and achieves a comparable multiplicative improvement on magnetization-density-conditioned generation, where only 21 positive samples exist within over 600000 training entries. By matching or surpassing narrow specialists on their own ground while operating within a single unified model, MatMind shows that the LLM-based paradigm can serve as a viable backbone for crystal materials science going forward.

URL PDF HTML ☆

赞 0 踩 0

2606.07714 2026-06-09 cs.LG cs.AI cs.HC 交叉投稿

Beyond Accuracy: Interpreting Topic Representation in Suicide Ideation Detection Models

超越准确率：解释自杀意念检测模型中的主题表示

Hamideh Ghanadian, Isar Nejadgholi, Hussein Al Osman

发表机构 * University of Ottawa（渥太华大学）； National Research Council Canada（加拿大国家研究委员会）

AI总结本研究通过可视化与几何分析，探究自杀意念检测模型内部如何编码心理风险因素，发现主题增强能提升低表征风险因素表示的清晰度与可解释性。

详情

AI中文摘要

自杀意念检测模型通常使用聚合性能指标进行评估，但对其内部如何表示具有心理意义的风险因素知之甚少。在高风险心理健康应用中，理解这些内部表示对于安全性、透明度和负责任部署至关重要。在这项工作中，我们超越准确率，分析在原始和主题增强数据集上训练的自杀检测模型如何在其内部表示空间中编码心理风险因素。通过可视化和几何分析，我们检查主题相关特征的连贯性和可分离性。我们的结果表明，主题感知增强提高了低表征心理社会风险因素（如移民、家庭问题和金融危机）的清晰度和区分度。这些发现表明，增强不仅提高了模型性能，还导致了更结构化和可解释的内部表示。

英文摘要

Suicide ideation detection models are typically evaluated using aggregate performance metrics, yet little is known about how they internally represent psychologically meaningful risk factors. In high-stakes mental health applications, understanding these internal representations is essential for safety, transparency, and responsible deployment. In this work, we move beyond accuracy and analyze how suicide detection models trained on original and topic-augmented datasets encode psychological risk factors in their internal representation space. Using visualization and geometric analysis, we examine the coherence and separability of topic-related features. Our results show that topic-aware augmentation increases the clarity and distinctness of underrepresented psychosocial risk factors such as immigration, family issues, and financial crisis. These findings suggest that augmentation not only improves model performance but also leads to more structured and interpretable internal representations.

URL PDF HTML ☆

赞 0 踩 0

2606.07717 2026-06-09 eess.IV cs.AI cs.CV 交叉投稿

Multi-planar 2D-U-Net Segmentation of 3D-CT Abdominal Organs augmented by Spatial Occurrence Maps

多平面2D-U-Net分割3D-CT腹部器官，辅以空间出现图

Daria Kern, Negar Chabi, Souraj Adhikary, Andre Mastmeyer

发表机构 * Glasgow Caledonian University School of Science & Engineering（格拉斯哥卡里多尼亚大学科学与工程学院）； Jade University of Applied Sciences Department of Engineering & Medical Technology（雅德应用科学大学工程与医疗技术系）

AI总结提出轻量级2D-U-Net框架，结合粗到细分割、多平面预测和模糊3D空间图，在80个CT扫描中使Dice系数提升约4%。

Comments 11 pages, 9 figures, 1 table, http://www.wscg.eu/

详情

AI中文摘要

本工作提出一个基于2D-U-Net的轻量级框架，用于在大视野3D CT扫描中分割五个腹部器官。该方法结合了粗到细分割、来自多个解剖平面的预测以及额外的模糊3D空间图，这些空间图提供解剖位置线索以提高分割精度。我们结合了由空间出现图增强的多平面2D-U-Net模型。该方法包括两个主要阶段。首先，通过使用2D-U-Net轴向遍历整个扫描并确定5个目标腹部器官的x-y-z最小和最大范围来检测腹部感兴趣区域。其次，我们在前一阶段的边界内使用空间出现图来增强我们的多平面2D-U-Net架构。该方法在来自各种公共来源的80个CT扫描上进行评估。结果显示，与未使用空间出现图训练的相同模型相比，Dice系数最大提升约4%。

英文摘要

This work proposes a lightweight 2D-U-Net-based framework for segmenting five abdominal organs in large field-of-view 3D CT scans. The method combines coarse-to-fine segmentation, predictions from multiple anatomical planes, and additional fuzzy 3D spatial maps that provide anatomical location cues to improve segmentation accuracy. We combine multi-planar 2D-U-Net models augmented by a spatial occurrence map. The approach involves two main stages. First, the abdominal volume of interest region is detected by traversing the whole scan axially with a 2D-U-Net and determining the x-y-z-minimum and -maximum extents of the 5 abdominal organs of interest. Second, we use spatial occurrence maps to enhance our multi-planar 2D-U-net architecture inside the bounds from the former stage. The method is evaluated on 80 CT scans from various public sources. The results show Dice improvements of about 4% at maximum compared to the same model trained without spatial occurrence maps.

URL PDF HTML ☆

赞 0 踩 0

2606.07828 2026-06-09 cs.SE cs.AI 交叉投稿

Jas: AI-Paired Engineering as a Revival of N-Version Programming

Jas：AI配对工程作为N版本编程的复兴

Jason Hickey

发表机构 * Independent（独立）

AI总结本研究通过单开发者跨平台移植矢量图应用的案例，提出AI配对工程方法，结合精确YAML规范与并行实现作为差分测试层，使传统需多人年的工作变得可行，并视其为N版本编程的复兴。

详情

AI中文摘要

我报告了一个AI配对软件工程的案例研究：由单个开发者在约120个晚间小时内完成的五个矢量插图应用的工作移植，分别基于Rust、Swift、OCaml、Python和浏览器平台。该方法将AI辅助实现与两个保障措施配对——一个精确的可执行YAML规范作为单一事实来源，以及并行实现作为内置差分测试层。五个移植共享23,000行的规范；每个移植的原生代码范围从0到约95,000行，反映了规范的逃生口。我认为，在具备这两个保障措施的条件下，AI配对工程使得传统上需要多个开发者年的工作范围变得可行，并将该方法框架为N版本编程的复兴，这是一种因成本原因被放弃的1980年代方法，而AI改变了这一状况。论文报告了具体工件和单开发者案例研究的诚实局限性。

英文摘要

I report a case study in AI-paired software engineering: five working ports of a vector illustration application across Rust, Swift, OCaml, Python, and browser-based platforms, built by a single developer in approximately 120 evening hours. The methodology pairs AI-assisted implementation with two safeguards -- a precise executable YAML specification serving as the single source of truth, and parallel implementations functioning as a built-in differential-testing layer. The five ports share a 23{,}000-line specification; per-port native code ranges from 0 to roughly 95{,}000 lines, reflecting the specification's escape hatch. I argue that AI-paired engineering, conditional on these two safeguards, makes feasible scope of work that conventionally requires multiple developer-years, and frame the methodology as a revival of N-version programming, a 1980s approach abandoned on cost grounds that AI changes. The paper reports concrete artifacts and honest limitations of the single-developer case study.

URL PDF HTML ☆

赞 0 踩 0

2606.07836 2026-06-09 cond-mat.mtrl-sci cond-mat.stat-mech cs.AI physics.comp-ph quant-ph 交叉投稿

Agentic multi-fidelity learning of quasiparticle and excitonic properties

准粒子和激子性质的智能多保真学习

Arnab Neogi, Aaron Forde, Christopher A. Lane, Sergei Tretiak, Jian-Xin Zhu

发表机构 * Theoretical Division, Los Alamos National Laboratory（洛斯阿拉莫斯国家实验室理论部）； Center for Integrated Nanotechnologies, Materials Physics and Applications Division, Los Alamos National Laboratory（集成纳米技术中心，材料物理与应用部，洛斯阿拉莫斯国家实验室）

AI总结提出智能引导的多保真框架，通过置信度加权和少量高精度参考点，结合机器学习校正GW-BSE计算中的数值不稳定性，准确预测应变MoS2-WS2双层中的准粒子带隙和激子结合能。

详情

AI中文摘要

多体GW-Bethe-Salpeter方程计算对于现代低维纳米材料中电子结构和光学性质的精确模拟至关重要。然而，这些方法计算量大，并且可能表现出局部数值不稳定性或收敛失败，在高通量工作流程中难以检测。我们引入了一个智能引导的多保真框架，用于校正应变MoS2-WS2双层中的GW-Bethe-Salpeter激发态景观。在不同堆叠配准、应变分支和倒空间采样下，该工作流程识别出与脆弱的长波介电屏蔽相关的尖峰状偏移、近零带隙塌缩和交叉保真不一致性。一个结构智能体通过分配置信度权重并选择性地使用少量高精度参考点来评估计算。然后，机器学习模型在相关系统间传递信息，并应用高斯过程校正来恢复改进的准粒子带隙和激子结合能，并带有校准的不确定性估计。该方法纠正了数值诱导的伪影，而不消除物理应变依赖性，并且与无智能体基线相比，显著提高了与更高保真度参考的一致性。这些结果表明，激发态材料的可靠替代学习需要明确诊断数值脆弱性，而不是直接插值原始第一性原理数据点。所提出的框架可轻松转移到其他以强量子限制为特征的光电纳米材料，例如量子点、纳米带、层状二维半导体和混合钙钛矿纳米结构。

英文摘要

Many-body GW-Bethe-Salpeter equation calculations are essential for accurate simulations of electronic structure and optical properties in modern low-dimensional nanomaterials. However, these methods are computationally demanding and can exhibit localized numerical instabilities or convergence failures that are difficult to detect within high-throughput workflows. We introduce an agent-guided multi-fidelity framework for correcting GW-Bethe-Salpeter excited-state landscapes in strained MoS2-WS2 bilayers. Across stacking registries, strain branches and reciprocal-space samplings, the workflow identifies spike-like excursions, near-zero-gap collapse and cross-fidelity inconsistencies associated with fragile long-wavelength dielectric screening. A structural agent evaluates calculations by assigning confidence weights and selectively using a small number of high-accuracy reference points. Machine learning models then transfer information across related systems and apply Gaussian process corrections to recover improved quasiparticle gaps and exciton binding energies, with calibrated uncertainty estimates. The approach corrects numerically induced artifacts without erasing physical strain dependence and substantially improves agreement with higher-fidelity references relative to a no-agent baseline. These results show that reliable surrogate learning for excited-state materials requires explicit diagnosis of numerical fragility, not direct interpolation of raw first-principles data points. The proposed framework is readily transferable to other optoelectronic nanomaterials characterized by strong quantum confinement, such as quantum dots, nanoribbons, layered two-dimensional semiconductors, and hybrid perovskite nanostructures.

URL PDF HTML ☆

赞 0 踩 0

2606.07907 2026-06-09 cs.CV cs.AI 交叉投稿

3D Oral Modelling with Improved Vertex Distribution Using Matching-Based Learning

基于匹配学习的改进顶点分布的3D口腔建模

Jihun Cho, Soo-Yeon Jeong, Eun-Jeong Bae, Sun-Young Ihm

发表机构 * st Jihun Cho（第一作者）； nd Soo-Yeon Jeong（第二作者）； rd Eun-Jeong Bae（第三作者）； th Sun-Young Ihm（第四作者）

AI总结针对3D口腔重建中预测顶点分布不均的问题，提出结合匈牙利匹配过滤与排斥损失的改进损失函数，使顶点分布更均匀，虽精度略降但有效缓解了聚集现象。

Comments 5 pages, 7 figures. English version of a paper presented at the Korea Multimedia Society Conference, November 2025

详情

AI中文摘要

在我们之前的工作中，提出了一个基于深度学习的3D口内重建框架。该模型直接从十张固定角度的口内图像预测显式3D点云坐标，采用MobileNetV2和多头注意力进行多视图特征融合，并使用L1损失和倒角距离的组合作为损失函数。尽管模型达到了77.49%的准确率，但预测顶点倾向于集中在真实值的高密度区域，而其他区域大部分未被覆盖。\n在本文中，提出了一种改进的损失函数来解决这一局限性。引入了带过滤的匈牙利匹配和排斥损失，以强制重建模型上的顶点分布更加均匀。所提出的模型达到了68.02%的准确率，数值上低于之前的模型。然而，先前工作中观察到的顶点聚集问题得到了显著缓解，预测顶点在整个重建表面上分布更加均匀。

英文摘要

In our previous work, a deep learning-based framework for 3D intraoral reconstruction was proposed. The model directly predicts explicit 3D point cloud coordinates from ten fixed-angle intraoral images, employing MobileNetV2 and Multi-head Attention for multi-view feature fusion, with a combined L1 Loss and Chamfer Distance as the loss function. Although the model achieved an accuracy of 77.49%, predicted vertices tended to concentrate in high-density regions of the ground truth, leaving other regions largely uncovered. In this paper, an improved loss function is proposed to address this limitation. Hungarian matching with filtering and Repulsion Loss are introduced to enforce more uniform vertex distribution across the reconstructed model. The proposed model achieves an accuracy of 68.02%, which is numerically lower than the previous model. However, the vertex clustering issue observed in the prior work is substantially alleviated, with predicted vertices distributed more evenly across the entire reconstructed surface.

URL PDF HTML ☆

赞 0 踩 0

2606.07923 2026-06-09 cs.DB cs.AI cs.LG 交叉投稿

基于自监督视觉Transformer的CBCT颞下颌关节骨关节炎检测

Shradhdha Trivedi, Vrundan Sojitra, Mariela Padilla

发表机构 * Herman Ostrow School of Dentistry, University of Southern California（南加州大学赫尔曼·奥斯特罗牙科学院）； Viterbi School of Engineering, University of Southern California（南加州大学维特比工程学院）

AI总结研究DINO系列自监督ViT在CBCT颞下颌关节骨关节炎检测中的迁移性能，发现部分解冻最后两个Transformer块可将AUC从0.671提升至0.902，表明适应策略比骨干选择更重要。

详情

AI中文摘要

颞下颌关节骨关节炎（TMJ OA）是一种常见的退行性疾病，其骨性改变在锥形束CT（CBCT）上通常很细微，使得自动检测具有挑战性。我们研究了DINO系列自监督视觉Transformer——DINOv1、DINOv2、DINOv2+reg和RAD-DINO（一种放射学预训练变体）——迁移到CBCT的效果，询问需要多少以及何种骨干适应。我们提出了一种简单的基于切片的流程，使用视觉Transformer（ViT）骨干：轴向CBCT切片由冻结或部分适应的ViT逐切片编码，并通过基于注意力的多实例学习（MIL）聚合，用于患者级别的二分类OA/正常分类。通过在多源CBCT数据集上对解冻策略和聚合设计进行系统消融，我们发现部分解冻最后两个Transformer块是决定性因素，将AUC从0.671（完全冻结的DINOv2）提高到0.902。这优于DINOv1（0.867）、DINOv2+reg（0.774）和有监督的ImageNet ViT-B/16基线（0.843）。我们的结果为在低数据医学影像设置中适应DINO系列基础模型提供了实用指导，表明适应策略比骨干选择本身更能驱动性能。

英文摘要

Temporomandibular joint osteoarthritis (TMJ OA) is a prevalent degenerative condition whose osseous changes are often subtle on cone-beam CT (CBCT), making automated detection challenging. We study how well the DINO family of self-supervised vision transformers -- DINOv1, DINOv2, DINOv2+reg, and RAD-DINO (a radiology-pretrained variant) -- transfers to CBCT, asking how much backbone adaptation is needed and of what kind. We propose a simple slice-based pipeline using Vision Transformer (ViT) backbones: axial CBCT slices are encoded per-slice by a frozen or partially adapted ViT and aggregated via attention-based multiple instance learning (MIL) for patient-level binary OA/Normal classification. Through systematic ablation across unfreezing strategies and aggregation designs on a multi-source CBCT dataset, we find that partial unfreezing of the final two transformer blocks is the decisive factor, improving AUC from 0.671 (fully frozen DINOv2) to 0.902. This outperforms DINOv1 (0.867), DINOv2+reg (0.774), and a supervised ImageNet ViT-B/16 baseline (0.843). Our results provide practical guidance for adapting DINO-family foundation models in low-data medical imaging settings, showing that adaptation strategy is a stronger driver of performance than backbone choice alone.

URL PDF HTML ☆

赞 0 踩 0

2606.08476 2026-06-09 cs.DC cs.AI 交叉投稿

FlashCP: Load-Balanced Communication-Efficient Context Parallelism for LLM Training

FlashCP: 面向LLM训练的负载均衡且通信高效的上下文并行

Zheng Wang, Eric Liu, Linan Jiang, Zhongkai Yu, Zaifeng Pan, Yue Guan, Yuke Wang, Yufei Ding

发表机构 * Stanford University（斯坦福大学）

AI总结提出FlashCP框架，通过分片感知通信消除冗余KV传输，并设计Whole-Doc分片策略与启发式算法，实现负载均衡与通信高效，在多种数据集上取得最高1.63倍加速。

Comments 10 pages, 6 figures

2606.08590 2026-06-09 cs.SE cs.AI cs.DC 交叉投稿

Auditable Graph-Guided Root Cause Analysis for Kubernetes Incidents

可审计的图引导的Kubernetes事件根因分析

Anastasiia Kuvshinova, Seungmin Jin

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出Graph Traversal Agent，结合LLM推理与确定性图操作，通过类型化证据图、有界搜索和独立验证实现可审计的根因分析，在ITBench上F1从0.6087提升至0.9130。

Comments 8 pages, 1 figure. Preprint

详情

AI中文摘要

只有当根因系统报告的结果来自事件证据而非特定场景的捷径时，Kubernetes事件才能被可靠诊断。我们提出Graph Traversal Agent，一种图引导的根因分析代理，将LLM推理与专用工具相结合。该模型在类型化证据图上进行推理，而确定性图和工具操作收集证据、限制搜索并检查提出的结论。我们将操作约束（包括只读证据收集、传播感知诊断、有界执行和独立验证的结论）映射到类型化事件图、LangGraph遍历状态机和独立的验证阶段。在由固定qwen-plus裁判评分的ITBench快照上，经过审计的系统在23个场景的公共子集上，根因实体F1从同一系统早期迭代的0.6087提升至0.9130。提示级消融实验将提示调优带来的提升与去除场景特定提示后仍保留的提升区分开：在19个场景的子集上，剥离提示的配置保留了0.6958的F1。保留的提升集中在ChaosMesh场景上，其真实根因是证据图中已存在的注入故障对象，因此我们将其报告为基准耦合而非广泛的跨集群根因分析证据。轻量级检查（包括相同裁判比较、提示级消融、级联源检查和遥测无泄漏测试）将声明标记为支持、待定或超出范围。我们将工作范围限定为ITBench OpenTelemetry-demo快照。实时集群试验作为工程压力测试，但警报状态和跟踪可用性不足以稳定进行受控评分，因此我们不声称生产就绪或平均修复时间。

英文摘要

Kubernetes incidents are diagnosed reliably only when a root-cause system's reported gains come from incident evidence rather than scenario-specific shortcuts. We present Graph Traversal Agent, a graph-guided RCA agent that combines LLM reasoning with specialized tools. The model reasons over a typed evidence graph, while deterministic graph and tool operations collect evidence, bound the search, and check proposed verdicts. We map operational constraints, including read-only evidence collection, propagation-aware diagnosis, bounded execution, and independently validated verdicts, to a typed incident graph, a LangGraph traversal state machine, and a separate validation stage. On ITBench snapshots scored by one fixed qwen-plus judge, the audited system raises root-cause-entity F1 over an earlier iteration of the same system from 0.6087 to 0.9130 on a 23-scenario common subset. A prompt-level ablation separates prompt-tuned gains from gains that survive once scenario-specific hints are removed: the stripped-prompt configuration retains 0.6958 F1 on a 19-scenario subset. The surviving gain concentrates on ChaosMesh scenarios whose ground-truth root cause is the injected fault object already present in the evidence graph, so we report it as benchmark-coupled rather than broad cross-cluster RCA evidence. Lightweight checks, including same-judge comparison, prompt-level ablation, cascade-source checking, and a telemetry no-leak test, mark claims as supported, pending, or out of scope. We scope the work to ITBench OpenTelemetry-demo snapshots. Live-cluster trials served as an engineering stress test, but alert state and trace availability did not stay stable enough for controlled scoring, so we make no production-readiness or mean-time-to-repair claim.

URL PDF HTML ☆

赞 0 踩 0

2606.08630 2026-06-09 cs.LG cs.AI 交叉投稿

Tyan-WP: A Wind Power Foundation Model for Ultra-Short-Term Probabilistic Forecasting

Tyan-WP：用于超短期概率预测的风电基础模型

Jiahui Huang, Ao Luo, Lei Liu, Hongwei Zhao, Tengyuan Liu, Ruibo Guo, Bo Wang, Zhao Wang, Bin Li

发表机构 * School of Information Science and Technology, University of Science and Technology of China（中国科学技术大学信息科学技术学院）； China Electric Power Research Institute（中国电力科学研究院）

AI总结提出首个风电基础模型Tyan-WP，通过静态站点嵌入和功率感知气象融合模块，在零样本场景下实现超短期概率预测，显著优于传统模型。

详情

AI中文摘要

全球风电容量，特别是在中国，正在蓬勃发展，新的风电场跨越了多样的地形和气候。行业迫切需要准确的风电基础模型，以缩短调试并加速并网。这是因为特定站点的时间序列模型（TSM）不适用于数据稀缺场景且泛化能力差，而通用大型时间序列模型（LTSM）大多限于单变量输入，无法充分利用静态站点属性或功率与气象协变量之间的依赖关系，导致精度不足。为填补这一空白，我们提出了\textbf{Tyan-WP}，这是首个用于超短期概率预测的风电基础模型。在覆盖美国超过126,000个站点、跨越七年的大规模风电数据集上预训练后，Tyan-WP通过两个特定领域模块设计进一步提升了零样本预测：使用坐标、地形和生态区域元数据的静态站点嵌入，以及一个功率感知气象融合（PAMF）模块，该模块对历史功率和气象协变量之间的交互进行建模。在统一评估协议下，Tyan-WP在10个域内站点上超越了八个特定站点的监督TSM，并在127个域内站点上优于十一个通用LTSM，MAE降低19.9%，RMSE降低16.6%，CRPS降低22.2%，AQL降低21.7%，同时R^2提升16.7%。它还在六个真实的英国站点上展示了强大的跨地理泛化能力。这些结果表明，风电基础模型可以在无需目标站点训练的情况下实现准确的零样本预测，为新风电场快速涡轮机接入和概率风险管理提供了实用途径。

英文摘要

Global wind power capacity, especially in China, is booming, with new farms spanning diverse terrains and climates. The industry urgently needs accurate wind power foundation models to shorten commissioning and accelerate grid connection. This is because site-specific time series models (TSMs) are not well suited to data-scarce scenarios and generalize poorly, while generic large time series models (LTSMs) are mostly limited to univariate inputs and cannot fully exploit static site attributes or the dependencies between power and meteorological covariates, leading to insufficient accuracy. To fill this gap, we propose \textbf{Tyan-WP}, the first wind power foundation model for ultra-short-term probabilistic forecasting. Pretrained on a large-scale wind power dataset covering more than 126,000 U.S. sites over seven years, Tyan-WP further improves zero-shot forecasting through two domain-specific module designs: static site embedding using coordinate, terrain, and ecoregion metadata, and a power-aware meteorological fusion (PAMF) module that models interactions between historical power and meteorological covariates. Under a unified evaluation protocol, Tyan-WP surpasses eight site-specific supervised TSMs on 10 in-domain sites and outperforms eleven generic LTSMs on 127 in-domain sites, reducing MAE by 19.9%, RMSE by 16.6%, CRPS by 22.2%, and AQL by 21.7%, while raising R^2 by 16.7%. It further demonstrates strong cross-geography generalization on six real U.K. sites. These results show that the wind power foundation model can achieve accurate zero-shot forecasting without target-site training, providing a practical pathway for rapid turbine onboarding and probabilistic risk management at new wind farms.

URL PDF HTML ☆

赞 0 踩 0

2606.08649 2026-06-09 cs.CR cs.AI 交叉投稿

Sample-Efficient LLM-Based Detection of Malicious Web Server Logs with Forensically Explainable Reasoning

基于大语言模型的恶意Web服务器日志检测与取证可解释推理的样本高效方法

Bernhard Kneip, Nhien-An Le-Khac, Hong-Hanh Nguyen-Le

发表机构 * University of Tuebingen（图宾根大学）

AI总结提出CEF-Log策略，通过五步推理模板使大语言模型学习日志分析方法，在CSIC 2010数据集上仅用4个示例达到F1=0.99，样本效率提升10倍，并引入新数据集ForenWebLog。

详情

AI中文摘要

Web服务器日志的取证分析既需要准确检测，也需要满足法律要求的人类可读解释。我们提出了CEF-Log，一种针对大语言模型的上下文增强的少样本思维链提示策略，以应对这一双重需求。CEF-Log通过结构化的五步推理模板嵌入专家调查方法，使模型学习如何分析日志，而不是记忆什么模式。实验评估表明，CEF-Log在CSIC 2010数据集上仅使用四个示例就达到了0.99的F1分数，同时与其他基于提示的方法相比，样本效率提高了10倍。我们还引入了ForenWebLog，这是一个包含真实世界攻击和多步攻击序列的新数据集，用于全面评估。定性分析证实，CEF-Log生成了适合取证文档的可追溯、准确的解释，解决了传统机器学习方法的“黑箱”限制。

英文摘要

Forensic analysis of web server logs demands both accurate detection and human-readable explanations that can satisfy legal requirements. We present CEF-Log, a context-enhanced few-shot chain-of-thought prompting strategy for Large Language Models that addresses this dual requirement. CEF-Log embeds expert investigative methodology through a structured five-step reasoning template, enabling the model to learn \textit{how} to analyze logs rather than \textit{what} patterns to memorize. Experimental evaluation demonstrates that CEF-Log achieves an F1-score of 0.99 on the CSIC 2010 dataset using only four examples while providing a $10\times$ improvement in sample efficiency compared to other prompting-based methods. We also introduce ForenWebLog, a new dataset that incorporates real-world attacks and multi-step attack sequences for comprehensive evaluation. Qualitative analysis confirms that CEF-Log generates traceable, accurate explanations suitable for forensic documentation, addressing the critical "black-box" limitation of traditional machine learning approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.08652 2026-06-09 astro-ph.SR cs.AI cs.CV 交叉投稿

APEX4: 通过SM内计算重平衡实现高效纯W4A4 LLM推理

Hong Guo, Nianhui Guo, Weixing Wang, Jona Otholt, Christoph Meinel, Haojin Yang

发表机构 * Hasso Plattner Institute（霍普夫-普拉特纳研究所）； GreenBit.AI ； German University of Digital Science（德国数字科学大学）

AI总结针对W4A4量化中CUDA核心反量化瓶颈，提出基于SM内计算平衡的ρ感知粒度自适应方法，设计纯INT4 GEMM内核，在多种GPU上实现最高2.09倍加速。

详情

AI中文摘要

W4A4量化承诺充分利用INT4张量核心，但CUDA核心上的组反量化开销导致现有系统采用混合精度回退。我们首次系统研究了SM内计算平衡如何主导这一瓶颈。通过在Ampere和Ada架构的四款GPU上进行受控基准测试，我们识别出张量核心与CUDA核心的吞吐量比（$ρ$）作为主要硬件指标：在计算受限场景下，W4A4-g128内核在RTX 3090（$ρ=16$）上获得$2.0$--$2.5\times$加速，但在A100（$ρ=64$）上退化为$0.43$--$0.47\times$，表明W4A4的可行性是平台相关的，而非普遍不可行。基于这一发现，我们构建了\textbf{APEX4}，它协同设计纯INT4 GEMM内核与$ρ$感知的粒度自适应，以缓解CUDA核心反量化瓶颈。APEX4在LLaMA-2-70B上实现了与FP16相差0.63的困惑度，并在零样本准确率上优于W4Ax Atom-g128达4.0%--4.4%。作为未修改vLLM中的即插即用替代品，它在L40S（$ρ=8$）上提供高达$1.66\times$的端到端加速，在RTX 3090（$ρ=16$）上为$1.78\times$，在A40（$ρ=16$）上为$2.09\times$，并通过混合粒度模式将A100（$ρ=64$）恢复至$1.20$--$1.40\times$。

英文摘要

W4A4 quantization promises full utilization of INT4 Tensor Cores, yet group dequantization overhead on CUDA Cores has driven existing systems to mixed-precision fallbacks. We present the first systematic study of how intra-SM compute balance governs this bottleneck. Through controlled benchmarks across four GPUs from Ampere and Ada architectures, we identify the Tensor Cores to CUDA Cores throughput ratio ($ρ$) as the primary hardware indicator: the W4A4-g128 kernel yields $2.0$--$2.5\times$ speedup on RTX~3090 ($ρ=16$) yet degrades to $0.43$--$0.47\times$ on A100 ($ρ=64$) in compute-bond scenarios, establishing W4A4 viability as platform-dependent rather than universally infeasible. Guided by this finding, we build \textbf{APEX4}, which co-designs pure INT4 GEMM kernels with $ρ$-aware granularity adaptation to mitigate the CUDA Cores dequantization bottleneck. APEX4 achieves perplexity within 0.63 of FP16 on LLaMA-2-70B and outperforms W4Ax Atom-g128 by 4.0\%--4.4\% in zero-shot accuracy. Deployed as a drop-in replacement in unmodified vLLM, it delivers up to $1.66\times$ end-to-end speedup on L40S ($ρ=8$), and $1.78\times$ on RTX~3090 ($ρ=16$), $2.09\times$ on A40 ($ρ=16$), while recovering A100 ($ρ=64$) to $1.20$--$1.40\times$ via the mixed-granularity mode.

URL PDF HTML ☆

赞 0 踩 0

2606.08793 2026-06-09 cs.SE cs.AI 交叉投稿

AI-Augmented Closed-Loop Quality Engineering: A Reference Architecture for Continuous Software Quality Intelligence

AI增强的闭环质量工程：面向持续软件质量智能的参考架构

Dimple Bajaj

发表机构 * Dimple Bajaj

AI总结提出一种AI增强的闭环参考架构，通过需求特征挖掘、风险测试优先级、缺陷预测和生产事件分析，结合有限反馈学习模型，在六个发布周期中减少缺陷泄漏、提高检测效率并缩短测试执行时间。

Comments 15 pages, 4 figures

详情

AI中文摘要

由于需求、测试和生产之间的流程脱节，软件工程的质量仍面临挑战，这阻碍了在连续发布中实施质量策略的机会。现有方法往往是固定模型或单优化方法，缺乏生产反馈学习机制。本文提出了一种AI增强的持续软件质量智能闭环参考架构。该模型综合了需求特征挖掘、基于风险的测试优先级排序、缺陷预测和生产事件分析，作为基于反馈的流水线的一个元素。引入了一种有限反馈学习模型，用于根据缺陷严重性和事件影响将生产信号传播到下一个发布，以确保稳定性和时间。该方法使用一个半合成测试数据集进行评估，该数据集包含6个发布周期中的4500个需求、27049个测试用例、13089个缺陷和7841个事件。实验结果表明，与非自适应基线相比，所提出的系统将缺陷泄漏从0.19降低到0.13，将检测系统的有效性从0.72提高到0.84，并将测试执行时间缩短了高达35%。这些变化在发布之间是稳定的。研究结果表明，通过在闭环架构中集成基于反馈的学习，可以持续改进质量过程，为自适应软件质量工程提供了实用基础。

英文摘要

The quality of software engineering is still under a challenge due to disjointed processes between requirements, testing, and production, which hinders the opportunity to implement quality strategies in consecutive releases. Existing approaches tend to be fixed-model or single-optimization approaches and lack production feedback learning mechanisms. The paper at hand proposes a closed-loop reference architecture of continuous software quality intelligence with AI enhancements. The model synthesizes requirement feature mining, risk-based test prioritization, defect prediction, and production incident analysis as an element of a feedback-based pipeline. A limited feedback learning model is introduced that is used to propagate the production signal-based on defect severity and incident impact- to the following release to ensure stability, and the time. The method is evaluated using a semi-synthetic test dataset of 4,500 requirements, 27,049 test cases, 13,089 defects and 7,841 incidents in six release cycles. The experimental results show that the proposed system reduces the defect leakage by 0.19 to 0.13, increases the effectiveness of the detection system to 0.72 to 0.84, and shortens the test execution by up to 35 percent compared to the non-adaptive baselines. The changes are stable release to release. The findings indicate that through the integration of feedback-based learning in a closed-loop architecture, it can be continued to enhance quality process, which offers practical foundation of adaptive quality engineering of software.

URL PDF HTML ☆

赞 0 踩 0

2606.08816 2026-06-09 cs.LG cs.AI 交叉投稿

Knowledge Graphs and Reasoning LLMs for Finding Simple Yet Effective Transcriptomic Perturbation Predictors

知识图谱与推理大语言模型用于寻找简单而有效的转录组扰动预测因子

Jake Fawkes, Liam Hodgson, Jason Hartford

发表机构 * University College London（伦敦大学学院）； University of Manchester（曼彻斯特大学）； Valence Labs（Valence实验室）； Recursion（Recursion公司）

AI总结利用知识图谱的K近邻方法在基因敲除扰动预测中表现优异，结合强化学习优化的LLM可达到最先进性能。

详情

AI中文摘要

预测未见过的基因敲除扰动对转录组基因表达的影响仍然是虚拟细胞模型的一个极具挑战性的问题。最近，通过利用生物知识图谱提供相似扰动的概念，在训练扰动集之外实现了更好的外推。在这项工作中，我们证明了利用这些假设的最简单模型——知识图谱的K近邻——在此任务上取得了极具竞争力的性能，并且通过使用强化学习（RL）优化的LLM可以进一步提高预测性能。具体来说，我们发现K近邻方法在分布外扰动预测上几乎击败了所有方法，而当通过RL训练推理LLM以改变邻域时，它在Replogle等人（2022）的细胞系上获得了与当前最先进方法相当的性能。我们还证明，尽管没有直接训练，RL训练提高了LLM在差异表达预测下游任务上的性能。总体而言，这些发现证明了知识图谱作为模型先验的有效性，并显示出RL可以将LLM精炼为预测复杂生物反应的通用工具的早期迹象。

英文摘要

Predicting the effect of an unseen gene knockout perturbation on transcriptomic gene expression remains a highly challenging problem for virtual cell models. Recent progress has been made by leveraging biological knowledge graphs to provide a notion of similar perturbation, allowing for improved extrapolation beyond the set of training perturbations. In this work, we demonstrate that the simplest model to leverage these assumptions - a K-nearest neighbour from the knowledge graph - achieves highly competitive performance on this task, and that this can be improved further using LLMs optimised via reinforcement learning (RL) for predictive performance. Specifically, we find that the K-nearest neighbour approach beats almost all methods on out-of-distribution perturbation prediction, and when a reasoning LLM is trained via RL to make changes to the neighbourhood, it obtains equivalent performance to current state of the art methods on the cell lines from Replogle et al. (2022). We also demonstrate that the RL training improves the LLM's performance on the downstream task of differential expression prediction, despite not being trained on this directly. Overall, these findings demonstrate the efficacy of knowledge graphs as model priors, and show early signs that RL can refine LLMs into generalizable tools for predicting complex biological responses.

URL PDF HTML ☆

赞 0 踩 0

2606.08858 2026-06-09 cs.CV cs.AI 交叉投稿

Intelligent Character Recognition of Handwritten Forms with Deep Neural Networks

基于深度神经网络的手写表单智能字符识别

Hartwig Grabowski

发表机构 * Institute for Machine Learning and Analytics (IMLA) Offenburg University（奥芬堡大学机器学习与分析研究所（IMLA））

AI总结提出一种通过深度神经网络将检测与分类合并为单一任务的手写字符识别方法，利用人工合成训练数据，在真实考试数据上达到88.28%的识别率。

Comments Author's accepted manuscript of a published Springer book chapter. 14 pages, 16 figures

详情

DOI: 10.1007/978-3-031-42532-5_6
Journal ref: In: Cavallucci D., Livotov P., Brad S. (eds), Towards AI-Aided Invention and Innovation, IFIP Advances in Information and Communication Technology, vol. 682, Springer Nature Switzerland, 2023, pp. 81-94

AI中文摘要

手写表单的自动处理仍然是一项具有挑战性的任务，其中手写字符的检测和后续分类是关键步骤。我们描述了一种新颖的方法，其中两个步骤——检测和分类——通过深度神经网络在一个任务中执行。因此，训练数据不是手动标注的，而是从基础表单和现有数据集中人工制造的。可以证明，这种单任务方法优于最先进的双任务方法。当前研究专注于手写拉丁字母，并使用EMNIST数据集。然而，该数据集存在局限性，需要进一步定制。最后，在从笔试中获得的真实数据上达到了88.28%的整体识别率。

英文摘要

The automatic processing of handwritten forms remains a challenging task, wherein detection and subsequent classification of handwritten characters are essential steps. We describe a novel approach, in which both steps -- detection and classification -- are executed in one task through a deep neural network. Therefore, training data is not annotated by hand, but manufactured artificially from the underlying forms and yet existing datasets. It can be demonstrated that this single-task approach is superior in comparison to the state-of-the-art two-task approach. The current study focuses on hand-written Latin letters and employs the EMNIST data set. However, limitations were identified with this data set, necessitating further customization. Finally, an overall recognition rate of 88.28 percent was attained on real data obtained from a written exam.

URL PDF HTML ☆

赞 0 踩 0

2606.08897 2026-06-09 cs.CV cs.AI q-bio.QM 交叉投稿

A multi-agent system for spine MRI report generation from multi-sequence imaging

基于多序列影像的脊柱MRI报告生成多智能体系统

Zhiping Xiao, Junwei Yang, Gongbo Sun, Han Zhang, Hanwen Xu, Yi Yao, Zachary D. Miller, William E. King, Mohammed M. Kanani, Jalal B. Andre, Sammy Chu, Ming Zhang, Paul E. Kinahan, Nathan M. Cross, Sheng Wang

发表机构 * University of Washington（华盛顿大学）； Peking University（北京大学）； University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； New York University（纽约大学）； University of Washington Medical Center（华盛顿大学医学中心）

AI总结提出SpineAgent多智能体框架，利用多序列基础模型整合T1/T2等序列信息，实现脊柱MRI报告生成、病理定位和图文检索，在跨厂商和跨队列评估中表现优异。

详情

AI中文摘要

脊柱病理是全球疼痛和残疾的主要原因之一。脊柱MRI是临床评估的核心，但其解读仍然复杂且耗时，需要整合多个成像序列和解剖区域的信息。尽管自动化MRI分析最近取得了进展，但如何有效结合多序列数据同时保留序列特异性诊断信息仍是一个开放挑战。本文提出SpineAgent，一个基于多序列基础模型的脊柱MRI报告生成多智能体框架，该模型在来自32,047名患者和453,683个MRI系列（总计13,441,191张MRI切片）的常规临床数据上训练。为了适应不同模态的序列，我们首先分别在T1和T2加权序列上预训练两个基于DINOv3的编码器。然后，我们引入一种持续训练策略，学习一个合成器，利用T1和T2编码器嵌入其他序列的图像，生成整合MRI序列间各种信号的患者级嵌入。利用这些嵌入，SpineAgent实现了最先进的性能，并在跨制造商和跨队列评估中展现出强大的泛化能力。除了分类，SpineAgent通过识别与发现相关的切片和分割病理区域实现病理定位。它还支持多模态图像-报告检索，为可扩展和可解释的MRI报告生成提供了坚实基础。我们进一步将这些经过验证的SpineAgent能力集成到37个专门智能体中。最后，我们将它们的输出作为结构化标记，整合到一个端到端训练用于报告生成的医疗报告智能体中。通过自动指标和五位放射科医生的专家评估，SpineAgent在脊柱MRI报告生成中取得了领先性能。

英文摘要

Spinal pathology is a leading cause of pain and disability worldwide. Spine MRI is central to clinical evaluation, yet its interpretation remains complex and time-consuming, requiring integration of information across multiple imaging sequences and anatomical regions. Despite recent advances in automated MRI analysis, effectively combining multi-sequence data while preserving sequence-specific diagnostic information remains an open challenge. Here we present SpineAgent, a multi-agent framework for spine MRI report generation built upon a multi-sequence foundation model trained on routine clinical data from 32,047 patients and 453,683 MRI series, comprising a total of 13,441,191 MRI slices. To accommodate diverse modalities of sequences, we first pre-train two DINOv3-based encoders separately on T1- and T2-weighted sequences. We then introduce a continual training strategy that learns a synthesizer to embed images of other sequences using the T1 and T2 encoders, producing patient-level embedding that integrates various signals across MRI sequences. Using these embeddings, SpineAgent achieves state-of-the-art performance, and demonstrates strong generalizability under cross-manufacturer and cross-cohort evaluation. Beyond classification, SpineAgent enables pathology localization by identifying findings-relevant slices and segmenting pathological regions. It also supports multimodal image-report retrieval, providing a solid foundation for scalable and explainable MRI report generation. We further integrate these validated capabilities of SpineAgent into 37 specialized agents. Finally, we incorporate their outputs as structured tokens within a Medical Report Agent trained end-to-end for report generation. Through both automated metrics and expert evaluation by five radiologists, SpineAgent achieves leading performance in spine MRI report generation.

URL PDF HTML ☆

赞 0 踩 0

2606.08908 2026-06-09 cs.CV cs.AI 交叉投稿

Failure-Aware Refinement of Vision-Language Model for Lithography Defect Detection

面向光刻缺陷检测的视觉-语言模型失败感知精炼

Pangyun Jeong, Jiyeong Kong, Yuehua Hu, Dohee Jeong, Kyung-Tae Kang

发表机构 * Hanyang University（汉阳大学）； Korea University（高丽大学）； Korea Institute of Industrial Technology（韩国生产技术研究院）

AI总结提出两阶段视觉-语言框架，先微调Qwen3-VL检测缺陷，再通过训练精炼模块修正第一阶段错误，提升检测可靠性。

Comments 6 pages, 3 figures

2606.08920 2026-06-09 cs.CV cs.AI 交叉投稿

PolyBuild: An End-to-End Method for Polygonal Building Contour Extraction from High-Resolution Remote Sensing Images

PolyBuild: 一种从高分辨率遥感图像中提取多边形建筑物轮廓的端到端方法

Yaoteng Zhang, Julin Zhang, Guangshuai Wang, Jiwei Deng, Hui Sheng, Yasir Muhammad, Shiqing Wei

发表机构 * China University of Petroleum (East China)（中国石油大学（华东））； South Surveying&Mapping Instrument Co.,Ltd.（南方测绘仪器有限公司）； China Railway Design Corporation（中国铁路设计集团有限公司）

AI总结提出端到端方法PolyBuild，通过初始轮廓生成模块和轮廓优化模块直接从遥感图像提取矢量多边形建筑物轮廓，无需后处理，性能优于现有方法。

Comments Accepted for publication in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS)

详情

AI中文摘要

从高分辨率遥感图像中提取建筑物多边形轮廓是各种地图应用的基本任务。然而，不同的成像条件和复杂的建筑结构使得自动轮廓提取极具挑战性。主流的建筑物提取方法通常依赖于像素级分割，随后进行多个后处理步骤以生成建筑物轮廓，这计算量大且容易出错。在本文中，我们提出了一种名为PolyBuild的端到端方法，该方法可以直接从高分辨率遥感图像中提取建筑物矢量多边形，无需任何后处理操作。该方法利用两个主要模块：初始轮廓生成模块（ICGM）和轮廓优化模块（COM）。ICGM通过利用每个建筑物实例的拼接子区域中心特征来生成初始建筑物轮廓。它通过生成边界框并使用四个子区域的中心特征来表示每个建筑物，同时进行目标检测和初始轮廓提取。轮廓优化模块（COM）通过在基于Transformer的解码器中迭代集成卷积神经网络（CNN）特征和轮廓位置信息，进一步细化生成的建筑物轮廓。混合CNN-Transformer架构有效捕获建筑物轮廓内的局部和全局空间关系，确保高质量的边界描绘。在三个建筑物数据集上进行了大量实验以评估PolyBuild的性能。结果表明，PolyBuild显著优于最先进的方法，包括基于掩码和基于轮廓的方法。

英文摘要

Extracting building polygon contours from high-resolution remote sensing images is a fundamental task for various mapping applications. However, the presence of varying imaging conditions and complex building structures, makes automatic contour extraction extremely challenging. Mainstream approaches for building extraction often rely on pixel-level segmentation followed by multiple post-processing steps to produce building contour, which can be computationally intensive and prone to errors. In this paper, we propose an end-to-end method named PolyBuild, which can directly extract building vector polygons from high-resolution remote sensing images without the need for any post-processing operations. The proposed method leverages two primary modules: an Initial Contour Generation Module (ICGM) and a Contour Optimization Module (COM). The ICGM is designed to generate an initial building contour by utilizing concatenated sub-region center features for each building instance. It performs simultaneous object detection and initial contour extraction by generating bounding boxes and using the center features of four sub-regions to represent each building. The Contour Optimization Module (COM) further refines the generated building contours by iteratively integrating Convolutional Neural Network (CNN) features and contour positional information in a Transformer-based decoder. The hybrid CNN-Transformer architecture effectively captures both local and global spatial relationships within the building contour, ensuring high-quality boundary delineation. Extensive experiments are conducted on three building datasets to evaluate the performance of PolyBuild. The results demonstrate that PolyBuild significantly outperforms state-of-the-art methods, including mask-based and contour-based approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.09090 2026-06-09 cs.SE cs.AI 交叉投稿

Context Rot in AI-Assisted Software Development: Repurposing Documentation Consistency for AI Configuration Artifacts

AI辅助软件开发中的上下文腐烂：将文档一致性技术用于AI配置工件

Christoph Treude, Sebastian Baltes

发表机构 * Singapore Management University（新加坡国立管理学院）； Heidelberg University（海德堡大学）

AI总结针对AI编码助手的配置文件（如CLAUDE.md）随软件演化而变得陈旧的问题，提出利用现有文档一致性工具检测上下文腐烂，并在356个仓库中发现23.0%存在过时代码引用。

详情

AI中文摘要

开发者越来越多地通过配置文件（如CLAUDE.md、AGENTS.md和.cursorrules）为AI编码助手提供持久上下文。这些文件描述代码元素、架构和开发约定，形成指导AI工具跨会话行为的上下文。随着软件演化，这种上下文可能变得陈旧，我们称之为上下文腐烂。虽然AI配置工件是新的，但底层的一致性问题与数十年的软件文档研究相关。研究人员已构建工具来检查文档与代码之间的一致性，涵盖README文件、代码注释、API文档、架构描述和安装说明。我们认为，这个现有工具箱是检测上下文腐烂的直接起点，并提出了一个研究路线图，将文档一致性方法映射到这一新环境中的相应问题。作为初步证据，将现有的README/wiki一致性检查器应用于356个仓库的统计代表性样本，发现23.0%的仓库中存在过时代码元素引用，表明传统的文档一致性工具已经能够发现上下文腐烂。

英文摘要

Developers increasingly provide AI coding assistants with persistent context through configuration files such as CLAUDE.md, AGENTS.md, and .cursorrules. These files describe code elements, architecture, and development conventions, forming the context that guides AI tool behavior across sessions. As software evolves, this context can become stale, a phenomenon we call context rot. While AI configuration artifacts are new, the underlying consistency problem connects to decades of software documentation research. Researchers have built tools to check consistency between documentation and code, spanning README files, code comments, API documentation, architecture descriptions, and installation instructions. We argue that this existing toolbox is an immediate starting point for detecting context rot, and we present a research roadmap mapping documentation consistency approaches to corresponding problems in this new setting. As preliminary evidence, applying an existing README/wiki consistency checker to a statistically representative sample of 356 repositories identifies stale code element references in 23.0% of repositories, showing that traditional documentation consistency tools can already surface context rot.

URL PDF HTML ☆

赞 0 踩 0

2606.09104 2026-06-09 cs.LG cs.AI q-fin.PM 交叉投稿

Addressing Market Regime Changes and Heavy-Tailed Returns in Portfolio Optimization via Bayesian VAR and Elliptical Black-Litterman

通过贝叶斯VAR和椭圆Black-Litterman解决投资组合优化中的市场机制变化和重尾收益问题

Daniil Mikriukov, Ruoyu Sun, Angelos Stefanidis, Jionglong Su, Zhengyong Jiang

发表机构 * University of Liverpool（利物浦大学）； Xi'an Jiaotong-Liverpool University（西交利物浦大学）

AI总结提出BAVAR-BLED算法，结合贝叶斯平均向量自回归和椭圆分布Black-Litterman模型，在TD3架构下自适应分配资产，在道琼斯工业平均指数成分股上实现夏普比率1.72和总收益57.26%。

Comments 9 pages, 3 figures, 4 tables. Extends our prior work [Mikriukov et al., ICIC 2025] on Black-Litterman under Elliptical Distributions (BLED). Manuscript under review

详情

AI中文摘要

用于投资组合优化的深度强化学习框架因其能够从市场数据中动态学习分配规则而显示出前景。然而，这些模型未能考虑肥尾收益，而肥尾收益以更频繁的极端事件为特征，描述了实际市场行为。此外，历史数据被同质化处理，未考虑时间重要性，导致模型在机制变化时失效。我们提出了一种新的BAVAR-BLED算法，该算法在TD3架构内结合了源自贝叶斯平均向量自回归（BAVAR）和使用椭圆分布的Black-Litterman模型（BLED）的方法。BAVAR捕获一组考虑多尺度时间特征的向量自回归表示，从而基于对收益预期和离散矩阵的机制感知估计实现自适应分配决策。这些估计作为BLED的先验输入，BLED使用学生t分布，允许更现实的肥尾收益估计。BAVAR-BLED算法使用Transformer网络进行观点构建，使用CNN进行风险厌恶估计，根据市场条件修改动态分配决策。对道琼斯工业平均指数29只成分股在十年市场周期内的评估表明，BAVAR-BLED显著优于最先进的方法，实现了1.72的夏普比率和2.70的索提诺比率，总收益为57.26%。

英文摘要

Deep reinforcement learning (DRL) frameworks for portfolio optimization have shown promise for their ability to learn allocation rules dynamically from market data. However, these models fail to account for fat-tailed returns, which characterize actual market behavior with more frequent extreme events. Furthermore, historical data is treated homogeneously, without accounting for temporal importance, leading models to fail during regime changes. We propose a new BAVAR-BLED algorithm that combines methods derived from Bayesian-Averaging Vector Autoregressive (BAVAR) and the Black-Litterman model using Elliptical Distributions (BLED) within a TD3 architecture. BAVAR captures a set of vector autoregressive representations that consider multi-scale temporal features, enabling adaptive allocation decisions based on regime-aware estimates of return expectations and dispersion matrices. These estimates serve as prior inputs to BLED, a model that uses Student's t-distributions, allowing for more realistic fat tail return estimates. The BAVAR-BLED algorithm uses transformer networks for view construction and CNNs for risk-aversion estimates, which modify dynamic allocation decisions based on market conditions. An evaluation of 29 Dow Jones Industrial Average constituents over a decade-long market period shows that BAVAR-BLED significantly outperforms state-of-the-art methods, achieving Sharpe and Sortino ratios of 1.72 and 2.70, respectively, and total returns of 57.26%.

URL PDF HTML ☆

赞 0 踩 0

物理引导的序列生成框架用于声学超材料逆向设计

Yijie Li, Jiahao Xu, Ching-Chih Tsao, Lili Qiu, Jingxian Wang

发表机构 * National University of Singapore（新加坡国立大学）； UT Austin（德克萨斯大学奥斯汀分校）

AI总结提出MetaSeq框架，将声学超材料表示为结构化序列，通过序列到序列模型结合物理求解器和强化学习，实现宽带逆向设计，误差降低45%。

详情

AI中文摘要

声学超材料（AMM）逆向设计对于宽带目标响应尤其具有挑战性，原因是声学色散：在一个频率上匹配期望响应的结构可能在其它频率上偏离，而修改几何以改善一个子带通常会扰动相邻子带。然而，现有的宽带逆向设计方法要么受限于预定义模板，要么依赖于无法保持声学结构所需的几何精度和结构连通性的图像表示。我们提出了MetaSeq，一个物理引导的、基于序列的生成框架，用于声学超材料逆向设计。其核心是，MetaSeq引入了一种语言，将每个AMM表示为结构化序列，而不是像素网格或固定模板。这种表示保留了精确的几何形状，显式编码了连通性，并将逆向设计转化为从目标响应到结构序列的序列到序列任务。MetaSeq进一步构建了一个平衡、高保真的数据集，具有高效的校准和基于复杂度的采样。为了解决逆向设计的一对多性质，MetaSeq结合了监督预训练和基于物理求解器及有效性检查器引导的强化学习微调。针对COMSOL和五个基线的广泛评估表明，MetaSeq在最佳基线基础上将响应误差降低了45%。

英文摘要

Acoustic metamaterial (AMM) inverse design is particularly challenging for broadband target responses due to acoustic dispersion: a structure that matches the desired response at one frequency may deviate at others, and modifying geometry to improve one sub-band often perturbs neighboring sub-bands. Yet existing broadband inverse-design approaches are either constrained by predefined templates, or rely on image representations that fail to preserve the geometric precision and structural connectivity required by acoustic structures. We present MetaSeq, a physics-guided, sequence-based generative framework for acoustic metamaterial inverse design. At its core, MetaSeq introduces a language that represents each AMM as a structured sequence, rather than as a pixel grid or fixed template. This representation preserves precise geometry, explicitly encodes connectivity, and casts inverse design as a sequence-to-sequence task from target response to structure sequence. MetaSeq further constructs a balanced, high-fidelity dataset with efficient calibration and complexity-based sampling. To address the one-to-many nature of inverse design, MetaSeq combines supervised pretraining with reinforcement learning fine-tuning guided by a physics-based solver and validity checker. Extensive evaluations against COMSOL and five baselines show that MetaSeq reduces response error by 45% over the best baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.09327 2026-06-09 cs.LG cs.AI 交叉投稿

A Universal Dense Football Event Representation Based on TabTransformer

基于TabTransformer的通用密集足球事件表示

Weiran Yang, Daniel Memmert, Maximilian Klemp-Weins

发表机构 * Institute of Exercise Training and Sport Informatics, German Sport University Cologne（科隆德国体育大学运动训练与体育信息学研究所）

AI总结提出基于TabTransformer的模型，通过学习分类特征的嵌入向量，生成密集的足球事件表示，在下游任务中优于基线方法。

Comments 12 pages, 1 figure. Preprint submitted to the 13th Workshop on Machine Learning and Data Mining for Sports Analytics (MLSA 2026)

详情

AI中文摘要

足球事件数据为团队运动中球员动作的定量分析提供了丰富的时空来源。这些数据集包含异构特征，将连续的位置坐标与分类变量（如动作类型、动作结果和身体部位）相结合。此类数据已应用于体育分析中的比赛结果预测、球员评估和战术模式识别。然而，现有方法主要使用独热或序数嵌入表示来编码分类特征，忽略了动作描述符的内在语义。Transformer是一种基于自注意力的深度神经网络架构，能够捕获输入特征在任意位置之间的依赖关系。我们提出并实现了一个基于Transformer的模型，以学习分类事件特征之间的潜在依赖关系，并生成足球事件的密集表示。通过将分类特征编码为学习到的嵌入向量，在预训练期间捕获了特定于运动的动作语义，使得表示能够支持下游任务，如动作价值估计和比赛风格识别。实证评估表明，在下游预测任务中，嵌入表示在概率校准方面优于任务特定基线，如Brier分数所衡量的。

英文摘要

Football event data constitute a rich spatiotemporal source for quantitative analysis of player actions in team sports. These datasets contain heterogeneous features, combining continuous location coordinates with categorical variables such as action type, action outcome, and body part. Such data have been applied in sports analytics for match outcome forecasting, player evaluation, and tactical pattern recognition. However, existing approaches predominantly encode categorical features using one-hot or ordinal embedding representations, overlooking the intrinsic semantics of action descriptors. The Transformer is a deep neural network architecture based on self-attention that captures dependencies between input features at arbitrary positions. We propose and implement a Transformer-based model to learn latent dependencies among categorical event features and produce dense representations of football events. By encoding categorical features as learned embedding vectors, sport-specific action semantics are captured during pretraining, enabling the representations to support downstream tasks such as action value estimation and play style recognition. Empirical evaluation shows that the embedding representations yield superior probability calibration over task-specific baselines on the downstream prediction tasks, as measured by Brier score.

URL PDF HTML ☆

赞 0 踩 0

2606.09419 2026-06-09 cond-mat.mtrl-sci cs.AI 交叉投稿

Context-Aware Deep Learning for Defect Classification in Atomic-Resolution STEM

上下文感知深度学习用于原子分辨率扫描透射电镜中的缺陷分类

Jiadong Dan, Cheng Zhang, Leyi Loh, Ivan Verzhbitskiy, Yuan Chen, Goki Eda, Michel Bosman, N. Duane Loh

发表机构 * cond-mat.mtrl-sci（材料科学）

AI总结提出上下文感知学习框架，融合图像对比度与元数据（成分、束能、探测器几何），解决仅凭图像对比度进行缺陷分类的歧义性，在模拟数据上准确率超98%，实验数据接近人类水平。

Comments 6 figures

详情

AI中文摘要

人工智能正在快速推进材料表征，然而电子显微镜中的大多数应用仅依赖图像对比度，忽视了影响图像形成的化学和实验上下文。这一局限性使得缺陷分类本质上具有歧义性，因为相似的对比度可能来自不同的材料或成像条件。在此，我们开发了一个上下文感知学习框架，将图像导出的对比度与描述成分、束能和探测器几何的元数据相结合。利用系统构建的约5500万模拟补丁数据集，涵盖96种掺杂单层过渡金属二硫族化合物的576种情况，我们表明，以上下文变量为条件将缺陷分类从一个不适定的纯图像任务转变为一个适定的、基于物理的问题。该框架在模拟数据上实现了超过98%的准确率，在实验数据上达到了接近人类的一致性，后验熵降低了94%。通过强调上下文基础而非架构复杂性，该方法将实验图像对比度与潜在的化学和成像条件联系起来，支持基于物理的缺陷分配，并为自主材料表征的多模态AI模型提供了一条通用路径。

英文摘要

Artificial intelligence is rapidly advancing materials characterization, yet most applications in electron microscopy rely solely on image contrast, overlooking the chemical and experimental context that shapes image formation. This limitation makes defect classification inherently ambiguous, as similar contrasts can arise from different materials or imaging conditions. Here we develop a context-aware learning framework that integrates image-derived contrast with metadata describing composition, beam energy, and detector geometry. Using a systematically constructed dataset of ~55 million simulated patches spanning 576 cases across 96 doped monolayer transition-metal dichalcogenides, we show that conditioning on contextual variables transforms defect classification from an ill-posed image-only task into a well-posed, physically grounded problem. The framework achieves over 98% accuracy on simulations and near-human agreement on experimental data, with a 94% reduction in posterior entropy. By emphasizing contextual grounding over architectural complexity, this approach links experimental image contrast to the underlying chemical and imaging conditions, supporting physically grounded defect assignments and a general pathway toward multimodal AI models for autonomous materials characterization.

URL PDF HTML ☆

赞 0 踩 0

2606.09520 2026-06-09 physics.chem-ph cs.AI 交叉投稿

Closing the Prior-Posterior Loop: Self-Reflective Molecular Design with Analysis-Driven LLM Iteration

闭合先验-后验循环：基于分析驱动LLM迭代的自反性分子设计

Junyi Gong, Zijie Qiu, Ben Zhong Tang

发表机构 * Faculty of Chemistry, Shenzhen MSU-BIT University（深圳MSU-BIT大学化学学院）； School of Science and Engineering, Chinese University of Hong Kong (Shenzhen)（香港中文大学（深圳）科学与工程学院）； Department of Chemistry, Hong Kong University of Science and Technology（香港科技大学化学系）

AI总结提出一种自反性分子设计框架，用第一性原理计算的完整物化理由替代标量反馈，使LLM从随机采样器转变为因果推理器，在HOMO-LUMO能隙任务中实现0.0003 eV偏差和100%成功率。

Comments 3 tables, 4 figures

详情

AI中文摘要

通用大语言模型能否像经验丰富的化学家一样精确设计分子？当前的LLM框架通过标量反馈循环（生成、评分、拒绝）来回答这个问题，这相当于有依据的试错。本文表明，用第一性原理计算的完整物化理由替代单一数字，可将LLM从随机采样器转变为因果推理器。我们的系统将检索增强生成与自反模块相结合，该模块将轨道能量、原子电荷和电子密度（而非压缩分数）反馈到设计循环中。在1.0至5.0 eV的HOMO-LUMO能隙目标上，这种结构-性质关系（SPR）反射实现了低至0.0003 eV的偏差，在中等任务上达到100%的成功率，显著优于标量反馈和非反射基线。该框架可无缝推广到偶极矩设计，并在五种不同的LLM骨干网络上表现出鲁棒性。这些结果建立了一个新范式：当模型不仅理解分子为何失败，而且理解失败原因时，迭代分子设计将变得真正具有机理性质。

英文摘要

Can a general-purpose large language model design molecules with the precision of a seasoned chemist? Current LLM-based frameworks answer this question with scalar feedback loops-generate, score, reject-that amount to informed trial-and-error. Here we show that replacing a single number with the full physicochemical rationale from first-principles calculations transforms the LLM from a stochastic sampler into a causal reasoner. Our system couples retrieval-augmented generation with a self-reflection module that feeds orbital energies, atomic charges, and electron densities-rather than compressed scores-back into the design loop. On HOMO-LUMO gap targets from 1.0 to 5.0 eV, this structure-property-relationship (SPR) reflection achieves a deviation as low as 0.0003 eV and a 100% success rate on moderate tasks, decisively outperforming scalar-feedback and non-reflective baselines. The framework generalizes seamlessly to dipole-moment design and proves robust across five distinct LLM backbones. These results establish a new paradigm: when the model understands not only that a molecule fails, but why, iterative molecular design becomes genuinely mechanistic.

URL PDF HTML ☆

赞 0 踩 0

2606.09617 2026-06-09 math.OC cs.AI cs.CY cs.SY eess.SY 交叉投稿

Powering the Future of AI: Navigating the Trade-offs for Europe's Energy Transition and Net-Zero Goals

赋能AI未来：应对欧洲能源转型与净零目标的权衡

Mohammad Hemmati, Gbemi Oluleye, Vassilis M. Charitopoulos

发表机构 * Department of Chemical Engineering, Sargent Centre for Process Systems Engineering, University College London (UCL)（化学工程系、过程系统工程中心、伦敦大学学院（UCL））； Centre for Environmental Policy, Imperial College London（环境政策中心、伦敦帝国理工学院）

AI总结通过21种AI增长情景下的空间优化模型，量化AI对欧洲电力需求、容量、排放和运行的影响，发现AI到2050年可能增加73-723 TWh需求，导致2030-2050年累计排放超调67-181 MtCO2，且AI基础设施选址将更依赖稳定电源和系统灵活性。

详情

AI中文摘要

全球AI的快速扩张导致能源密集型超大规模数据中心激增，使其成为电力系统规划和运行中的结构性挑战。利用覆盖21种AI增长情景的欧洲空间显式优化模型，我们系统量化了数据中心的额外需求、容量要求、排放和运行影响。结果表明，到2050年，AI可能推动73-723 TWh的额外需求，导致2030年至2050年间累计排放超调67-181 MtCO2。我们的分析表明，2030年后，AI基础设施的地理分布将更多地由稳定电源和系统灵活性决定，而非仅仅依赖清洁能源的丰富程度。在中等情景下，AI需要额外200小时的稳定发电，这使关键枢纽的平准化电力成本增加35欧元/兆瓦时。我们表明，即使在悲观情景下，现有基础设施也需要额外70吉瓦的容量，而在受控增长路径下，这一扩张可能达到226吉瓦。我们进一步发现，数据中心的工作负载动态强烈影响能源调度、系统灵活性和排放，而效率提升显著降低了容量需求和系统峰值。虽然我们的研究结果表明2050年净零目标可能实现，但中期可能出现关键排放风险，除非政策适应这一加速的数字转型，否则欧盟可能危及其中性碳目标。

英文摘要

The rapid expansion of AI globally has led to the proliferation of energy-intensive hyperscale data centres (DCs), making them as a structurally challenging component in power system planning and operation. Using a spatially explicit optimisation model of Europe across 21 AI growth scenarios, we systematically quantify additional demand, capacity requirements, emissions, and operational impacts of DCs. Results indicate that AI could drive 73-723 TWh of extra demand by 2050, risking cumulative emissions overshoots of 67-181 MtCO2 between 2030 and 2050. Our analysis indicates that after 2030, the geography of AI infrastructure will be shaped more by firm power and system flexibility than by the mere abundance of clean energy. In moderate scenarios, AI requires an additional of 200 hours of firm generation, which increases LCOE by 35 EUR/MWh in key hubs. We show that even under the pessimistic scenarios, existing infrastructure would require 70 GW additional capacity, while under managed growth pathways, this expansion could reach 226 GW. We further find DCs workload dynamics strongly shape energy dispatch, system flexibility, and emissions, while improved efficiency significantly reduces capacity needs, and system peaks. While our findings suggest that net-zero targets for 2050 may be achieved, critical emission risks may appear in the intermediate years, and the EU may compromise its carbon-neutral goals unless policies adapt to this accelerating digital transformation.

URL PDF HTML ☆

赞 0 踩 0

2606.09643 2026-06-09 cs.DC cs.AI cs.LG cs.OS 交叉投稿

FMplex: Model Virtualization for Serving Extensible Foundation Models

FMplex: 用于服务可扩展基础模型的模型虚拟化

Hetvi Shastri, Pragya Sharma, Walid A. Hanafy, David Irwin, Mani Srivastava, Prashant Shenoy

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿姆赫斯特分校）； University of California Los Angeles（加州大学洛杉矶分校）

AI总结提出FMplex系统，通过将基础模型作为虚拟化层实现多任务共享，结合批感知公平队列调度器，在7个基础模型和92个下游任务上降低延迟达80%，提升任务容量6倍。

详情

AI中文摘要

基础模型（FMs）越来越多地被用作语言、视觉、时间序列和多模态应用的下游任务骨干。然而，现有的模型服务系统将每个定制任务部署为独立的模型实例，从而复制了重型骨干，浪费了加速器内存，并失去了摊销批处理和加载成本的机会。本文提出了FMplex，一个将FM骨干视为部署共享的虚拟化层的服务系统。FMplex为每个任务提供一个虚拟基础模型（vFM），这是一个由共享物理FM支持的逻辑私有FM实例。这种抽象允许独立定制的任务共享一个骨干，同时保留任务特定的扩展、独立生命周期和任务级隔离。此外，我们提出了一种批感知公平队列调度器，该调度器结合了加权任务级共享以及跨共存任务的批内和批间批处理。我们实现了一个基于FMplex的服务栈，涵盖任务构建、共享感知部署和运行时执行。在7个FM骨干（16个变体）和92个下游任务上，FMplex相比空间分区延迟降低高达80%，相比尽力而为共置延迟降低33.3%，同时在集群规模上可托管多达6倍的任务。

英文摘要

Foundation models (FMs) are increasingly used as backbones for downstream tasks across language, vision, time-series, and multimodal applications. Yet existing model-serving systems deploy each customized task as an independent model instance, thereby replicating heavyweight backbones, wasting accelerator memory, and losing opportunities to amortize batching and loading costs. This paper presents FMplex, a serving system that treats FM backbones as a virtualization substrate for deployment sharing. FMplex presents each task with a virtual foundation model (vFM), a logically private FM instance backed by a shared physical FM. This abstraction lets independently customized tasks share a backbone while preserving task-specific extensions, independent lifecycles, and task-level isolation. In addition, we propose a batch-aware fair-queueing scheduler that combines weighted task-level sharing with inter- and intra-task batching across colocated tasks. We implement a FMplex-based serving stack spanning task construction, sharing-aware deployment, and runtime execution. Across 7 FM backbones (16 variants) and 92 downstream tasks, FMplex reduces latency by up to 80% over spatial partitioning and 33.3% over best-effort co-location, while hosting up to 6x more tasks at cluster scale.

URL PDF HTML ☆

赞 0 踩 0

2606.09671 2026-06-09 cs.LG cs.AI 交叉投稿

Transition-Based Digital Twin Modelling for Alzheimer's Disease under Sparse Longitudinal Data

基于转换的阿尔茨海默病数字孪生建模在稀疏纵向数据下的应用

Yinyu Huang, Yilin Zhang, Sofia Michopoulou, Christopher Kipps, Rahman Attar

发表机构 * University of Southampton（南安普顿大学）； University Hospital Southampton NHS Foundation Trust（南安普顿大学医院NHS基金会信托）； Faculty of Medicine, University of Southampton（南安普顿大学医学院）

AI总结针对阿尔茨海默病进展异质性和数据稀疏问题，提出结合局部转换建模与序列建模的数字孪生框架，利用多模态纵向数据预测认知状态并量化不确定性，在ADNI数据上表现优异。

Comments 13 pages, 5 figures, 3 tables. Accepted as a full-length paper at the International Conference on AI in Healthcare (AIiH) 2026

详情

AI中文摘要

阿尔茨海默病（AD）进展具有高度异质性，通常通过稀疏且不规则的纵向数据观察，给预测和个性化监测带来挑战。现有的机器学习方法利用多模态数据改进了AD预测，但往往侧重于静态分类或队列级风险估计，对个体特异性建模和不确定性推理的支持有限。为了解决这些局限性，我们提出了一种个性化数字孪生框架，用于AD预测和基于场景的分析，利用多模态纵向数据。该方法整合了互补的建模策略，以捕捉临床转换和跨访视的时间依赖性。使用阿尔茨海默病神经影像学倡议（ADNI）的数据，包括认知评估、临床变量和MRI衍生的表型，该框架预测认知状态和诊断类别，同时量化预测不确定性并实现患者特定的假设轨迹分析。在无泄漏的受试者级别分割上的评估表明，在评分预测和诊断分类方面表现强劲。在这种稀疏且不规则的ADNI设置中，相邻访视的基于转换的建模比基于序列的分支实现了更高的预测准确性，表明局部转换建模可能更数据高效。虽然序列模型对于不确定性感知的轨迹预测仍然有价值，但局部转换建模提供了一种更数据高效且稳健的预测策略。这些发现强调了将时间建模策略与临床数据结构对齐的重要性，并表明基于转换的数字孪生公式可能为神经退行性疾病的个性化预测提供一种实用且可解释的方法。

英文摘要

Alzheimer's disease (AD) progression is highly heterogeneous and is typically observed through sparse and irregular longitudinal data, posing challenges for prediction and personalised monitoring. Existing machine learning approaches have improved AD prediction using multimodal data, yet often focus on static classification or cohort-level risk estimation, providing limited support for subject-specific modelling and uncertainty-aware reasoning. To address these limitations, we present a personalised digital twin framework for AD prediction and scenario-based analysis using multimodal longitudinal data. The proposed approach integrates complementary modelling strategies to capture clinical transitions and temporal dependencies across visits. Using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), including cognitive assessments, clinical variables, and MRI-derived phenotypes, the framework predicts cognitive status and diagnostic categories while quantifying predictive uncertainty and enabling patient-specific what-if trajectory analysis. Evaluation on leak-free subject-level splits demonstrates strong performance in score forecasting and diagnosis classification. In this sparse and irregular ADNI setting, transition-based modelling of adjacent visits achieved higher predictive accuracy than the sequence-based branch, suggesting that local transition modelling may be more data-efficient. While sequence models remain valuable for uncertainty-aware trajectory forecasting, local transition modelling offers a more data-efficient and robust predictive strategy. These findings highlight the importance of aligning temporal modelling strategies with clinical data structure and suggest that transition-based digital twin formulations may provide a practical and interpretable approach for personalised disease forecasting in neurodegenerative disorders.

URL PDF HTML ☆

赞 0 踩 0

2510.18428 2026-06-09 cs.AI 版本更新

可解释的AML优先级排序与LLMs：证据检索与反事实检查

Dorothy Torres, Wei Cheng, Ke Hu

发表机构 * School of Science, Technology, Engineering and Mathematics（科学、技术、工程与数学学院）； School of Electrical Engineering and Computer Science（电气工程与计算机科学学院）

AI总结本文提出一种可解释的AML优先级排序框架，结合证据增强的证据捆绑、结构化LLM输出合同和反事实验证，提升审计性和鲁棒性，实验证明其在优先级排序和证据支持方面表现优异。

详情

AI中文摘要

反洗钱（AML）交易监控生成大量警报，需在严格审计和治理约束下快速优先级排序。尽管大语言模型（LLMs）可汇总异质证据并起草理由，但不受约束的生成在受监管流程中因幻觉、弱溯源性和不忠实的解释而风险较高。本文提出一种可解释的AML优先级排序框架，将优先级排序视为受证据约束的决策过程。我们的方法结合（i）从政策/类型指南、客户上下文、警报触发器和交易子图中检索增强的证据捆绑；（ii）一个结构化的LLM输出合同，要求明确引用并区分支持、矛盾或缺失的证据；（iii）反事实检查，验证最小、合理的扰动是否导致优先级推荐及其理由的连贯变化。我们在公开的合成AML基准和模拟器上评估，并与规则、表格和图机器学习基线以及LLM-only/RAG-only变体进行比较。结果表明，证据支撑显著提高了可审计性，并减少了数值和政策幻觉错误，而反事实验证进一步增加了与决策相关的可解释性和鲁棒性，实现了最佳的整体优先级排序性能（PR-AUC 0.75；升级F1 0.62）和强溯源性和忠实度指标（引用有效性0.98；证据支持0.88；反事实忠实度0.76）。这些发现表明，受约束、可验证的LLM系统可以在不牺牲合规要求的可追溯性和防御性的情况下，为AML优先级排序提供实用的决策支持。

英文摘要

Anti-money laundering (AML) transaction monitoring generates large volumes of alerts that must be rapidly triaged by investigators under strict audit and governance constraints. While large language models (LLMs) can summarize heterogeneous evidence and draft rationales, unconstrained generation is risky in regulated workflows due to hallucinations, weak provenance, and explanations that are not faithful to the underlying decision. We propose an explainable AML triage framework that treats triage as an evidence-constrained decision process. Our method combines (i) retrieval-augmented evidence bundling from policy/typology guidance, customer context, alert triggers, and transaction subgraphs, (ii) a structured LLM output contract that requires explicit citations and separates supporting from contradicting or missing evidence, and (iii) counterfactual checks that validate whether minimal, plausible perturbations lead to coherent changes in both the triage recommendation and its rationale. We evaluate on public synthetic AML benchmarks and simulators and compare against rules, tabular and graph machine-learning baselines, and LLM-only/RAG-only variants. Results show that evidence grounding substantially improves auditability and reduces numerical and policy hallucination errors, while counterfactual validation further increases decision-linked explainability and robustness, yielding the best overall triage performance (PR-AUC 0.75; Escalate F1 0.62) and strong provenance and faithfulness metrics (citation validity 0.98; evidence support 0.88; counterfactual faithfulness 0.76). These findings indicate that governed, verifiable LLM systems can provide practical decision support for AML triage without sacrificing compliance requirements for traceability and defensibility.

URL PDF HTML ☆

赞 0 踩 0

2606.00384 2026-06-09 cs.AI cs.CL cs.CV cs.LG stat.CO 版本更新

利用大型语言模型实现化工流程图的自动纠错

Lukas Schulze Balhorn, Marc Caballero, Artur M. Schweidtmann

发表机构 * Process Intelligence Research Group, Department of Chemical Engineering, Delft University of Technology（过程智能研究组，化学工程系，代尔夫特理工大学）

AI总结提出一种基于大型语言模型的生成式AI方法，自动识别化工流程图中的错误并给出修正建议，在合成数据集上达到80%的top-1准确率。

详情

DOI: 10.1016/B978-0-443-28824-1.50519-6
Journal ref: Computer Aided Chemical Engineering, Volume 53, 2024, Pages 3109-3114

AI中文摘要

过程工程领域广泛使用工艺流程图（PFD）和管道及仪表流程图（P&ID）来表示工艺流程和设备配置。然而，P&ID和PFD（以下统称为流程图）可能包含错误，导致安全隐患、操作效率低下和不必要的开支。纠正和验证流程图是一个繁琐的手动过程。我们提出了一种新颖的生成式AI方法，用于自动识别流程图中的错误并向用户建议修正，即自动纠错流程图。受大型语言模型（LLM）在人类语言语法自动纠错方面突破的启发，我们研究了LLM用于流程图的自动纠错。模型的输入是可能出错的流程图，输出是修正后的流程图建议。我们在合成数据集上以监督方式训练自动纠错模型。该模型在独立测试的合成流程图数据集上达到了80%的top-1准确率和84%的top-5准确率。结果表明，模型能够学习自动纠错合成流程图。我们设想流程图自动纠错将成为化学工程师的有用工具。

英文摘要

The process engineering domain widely uses Process Flow Diagrams (PFDs) and Process and Instrumentation Diagrams (P&IDs) to represent process flows and equipment configurations. However, the P&IDs and PFDs, hereafter called flowsheets, can contain errors causing safety hazards, inefficient operation, and unnecessary expenses. Correcting and verifying flowsheets is a tedious, manual process. We propose a novel generative AI methodology for automatically identifying errors in flowsheets and suggesting corrections to the user, i.e., autocorrecting flowsheets. Inspired by the breakthrough of Large Language Models (LLMs) for grammatical autocorrection of human language, we investigate LLMs for the autocorrection of flowsheets. The input to the model is a potentially erroneous flowsheet and the output of the model are suggestions for a corrected flowsheet. We train our autocorrection model on a synthetic dataset in a supervised manner. The model achieves a top-1 accuracy of 80% and a top-5 accuracy of 84% on an independent test dataset of synthetically generated flowsheets. The results suggest that the model can learn to autocorrect the synthetic flowsheets. We envision that flowsheet autocorrection will become a useful tool for chemical engineers.

URL PDF HTML ☆

赞 0 踩 0

2412.00508 2026-06-09 cs.LG cs.AI cs.CE 版本更新

Graph-to-SFILES: Control structure prediction from process topologies using generative artificial intelligence

Graph-to-SFILES: 基于生成式人工智能从过程拓扑预测控制结构

Lukas Schulze Balhorn, Kevin Degens, Artur M. Schweidtmann

发表机构 * Process Intelligence Research Group（过程智能研究组）； Department of Chemical Engineering（化学工程系）； Delft University of Technology（代尔夫特理工大学）

AI总结提出Graph-to-SFILES模型，利用图神经网络从流程图拓扑生成控制扩展流程图序列，在小数据集上显著提升控制结构预测精度。

详情

DOI: 10.1016/j.compchemeng.2025.109121
Journal ref: Computers & Chemical Engineering, Volume 199, 2025, Pages 109121

AI中文摘要

控制结构设计是P&ID开发中重要但繁琐的步骤。生成式人工智能有望通过支持工程师来减少P&ID开发时间。先前关于化学过程设计中生成式AI的研究主要用序列表示过程。然而，图因其置换不变性而成为一种有前景的替代方案。我们提出了Graph-to-SFILES模型，一种从流程图拓扑预测控制结构的生成式AI方法。Graph-to-SFILES模型将流程图拓扑作为图输入，并返回以SFILES 2.0符号表示的控制扩展流程图序列。我们比较了四种不同的图编码器架构，其中一种是本文提出的图神经网络（GNN）。Graph-to-SFILES模型在10,000个流程图拓扑上训练时达到了73.2%的top-5准确率。此外，所提出的GNN在编码器架构中表现最佳。与纯基于序列的方法相比，Graph-to-SFILES模型在相对较小的1,000个流程图训练数据集上将top-5准确率从0.9%提高到28.4%。然而，在100,000个流程图的大规模数据集上，基于序列的方法表现更好。这些结果突显了基于图的AI模型在小数据场景下加速P&ID开发的潜力，但其在工业相关案例研究中的有效性仍需进一步研究。

英文摘要

Control structure design is an important but tedious step in P&ID development. Generative artificial intelligence (AI) promises to reduce P&ID development time by supporting engineers. Previous research on generative AI in chemical process design mainly represented processes by sequences. However, graphs offer a promising alternative because of their permutation invariance. We propose the Graph-to-SFILES model, a generative AI method to predict control structures from flowsheet topologies. The Graph-to-SFILES model takes the flowsheet topology as a graph input and returns a control-extended flowsheet as a sequence in the SFILES 2.0 notation. We compare four different graph encoder architectures, one of them being a graph neural network (GNN) proposed in this work. The Graph-to-SFILES model achieves a top-5 accuracy of 73.2% when trained on 10,000 flowsheet topologies. In addition, the proposed GNN performs best among the encoder architectures. Compared to a purely sequence-based approach, the Graph-to-SFILES model improves the top-5 accuracy for a relatively small training dataset of 1,000 flowsheets from 0.9% to 28.4%. However, the sequence-based approach performs better on a large-scale dataset of 100,000 flowsheets. These results highlight the potential of graph-based AI models to accelerate P&ID development in small-data regimes but their effectiveness on industry relevant case studies still needs to be investigated.

URL PDF HTML ☆

赞 0 踩 0

2502.18493 2026-06-09 cs.CE cs.AI 版本更新

Rule-based autocorrection of Piping and Instrumentation Diagrams (P&IDs) on graphs

基于规则的管道与仪表图（P&ID）图形自动校正

Lukas Schulze Balhorn, Niels Seijsener, Kevin Dao, Minji Kim, Dominik P. Goldstein, Ge H. M. Driessen, Artur M. Schweidtmann

发表机构 * Process Intelligence Research Group（过程智能研究组）； Department of Chemical Engineering（化学工程系）； Delft University of Technology（代尔夫特理工大学）； Fluor BV Amsterdam, The Netherlands（荷兰阿姆斯特丹Fluor公司）

AI总结提出一种基于图表示的规则方法，通过33条化工规则实现P&ID的自动错误检测与校正，案例验证其可靠性。

详情

DOI: 10.69997/sct.150968
Journal ref: Systems and Control Transactions, Volume 4, 2025, Pages 1656-1661

AI中文摘要

管道与仪表图（P&ID）是化学过程工程中的核心参考文档。目前，化学工程师通过目视检查手动审查P&ID以发现和纠正错误。然而，工程项目可能涉及数百至数千页P&ID，造成巨大的修订工作量。本研究提出一种基于规则的方法，支持工程师进行P&ID的错误检测与校正。该方法基于P&ID的图表示，通过规则图实现自动错误检测与校正，即自动校正。我们使用pyDEXPI Python包从DEXPI标准的P&ID生成P&ID图。在本研究中，我们基于化学工程知识和启发式方法开发了33条规则，并展示了其中五条选定的规则作为示例。一个示例P&ID的案例研究验证了基于规则的自动校正方法在修订P&ID中的可靠性和有效性。

英文摘要

A piping and instrumentation diagram (P&ID) is a central reference document in chemical process engineering. Currently, chemical engineers manually review P&IDs through visual inspection to find and rectify errors. However, engineering projects can involve hundreds to thousands of P&ID pages, creating a significant revision workload. This study proposes a rule-based method to support engineers with error detection and correction in P&IDs. The method is based on a graph representation of P&IDs, enabling automated error detection and correction, i.e., autocorrection, through rule graphs. We use our pyDEXPI Python package to generate P&ID graphs from DEXPI-standard P&IDs. In this study, we developed 33 rules based on chemical engineering knowledge and heuristics, with five selected rules demonstrated as examples. A case study on an illustrative P&ID validates the reliability and effectiveness of the rule-based autocorrection method in revising P&IDs.

URL PDF HTML ☆

赞 0 踩 0

2505.07573 2026-06-09 cs.CV cs.AI 版本更新

Robust Renal Mass Segmentation on CT: A Validation Study of an AI-Based Framework

基于CT的肾脏肿块鲁棒分割：AI框架的验证研究

Sarah de Boer, Hartmut Häntze, Kiran Vaidhya Venkadesh, Myrthe A. D. Buser, Gabriel E. Humpire Mamani, Lina Xu, Lisa C. Adams, Jawed Nawabi, Keno K. Bressem, Bram van Ginneken, Mathias Prokop, Alessa Hering

发表机构 * Department of Medical Imaging, Radboudumc, Nijmegen, The Netherlands（医学影像部门，Radboudumc，尼姆维根，荷兰）； Department of Radiology, Charité - Universitätsmedizin Berlin, Berlin, Germany（放射科，Charité - 大学医学中心柏林，柏林，德国）； Department of Neuroradiology, Charité - Universitätsmedizin Berlin, Berlin, Germany（神经放射科，Charité - 大学医学中心柏林，柏林，德国）； Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany（诊断和介入放射科，Klinikum rechts der Isar，TUM大学医院，慕尼黑技术大学，慕尼黑，德国）； Department of Cardiovascular Radiology and Nuclear Medicine, German Heart Center, TUM University Hospital, Technical University of Munich, Munich, Germany（心血管放射学和核医学部，德国心脏中心，TUM大学医院，慕尼黑技术大学，慕尼黑，德国）； Fraunhofer MEVIS, Bremen, Germany（Fraunhofer MEVIS，不莱梅，德国）

AI总结提出Renal-Net，基于nnU-Net和公开数据训练，在CT图像上实现肾脏肿块分割，验证显示优于现有模型且鲁棒性强。

Comments Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2026:012. 23 pages, 12 figures

详情

DOI: 10.59275/j.melba.2026-67g5
Journal ref: Machine.Learning.for.Biomedical.Imaging. 2026 (2026)

AI中文摘要

肾脏肿块分割在临床工作流中具有重要潜力，尤其是在需要定量评估的场景中。肾脏体积可作为肾脏疾病的重要生物标志物，其体积变化与肾功能直接相关。目前，临床实践常依赖主观视觉评估来评价肾脏大小和肾脏病变（包括肿瘤和囊肿），这些病变通常根据直径、体积和解剖位置进行分期。为了支持更客观和可重复的方法，本研究旨在开发一个鲁棒且经过充分验证的肾脏肿块分割算法，命名为Renal-Net。我们使用公开可用的训练数据集，并利用最先进的医学图像分割框架nnU-Net。使用专有和公开测试数据集进行验证，分割性能通过Dice系数和95百分位Hausdorff距离量化。此外，我们根据患者性别、年龄、CT对比相和肿瘤组织学亚型分析亚组鲁棒性。我们的结果表明，仅使用公开数据训练的分割算法能有效泛化到外部测试集，并在所有测试数据集上优于现有最先进模型。亚组分析显示一致的高性能，表明强鲁棒性和可靠性。开发的算法和相关代码可在以下网址公开获取：https://this.url。

英文摘要

Renal mass segmentation has important potential to enhance the clinical workflow, especially in settings requiring quantitative assessments. Kidney volume could serve as an important biomarker for renal diseases, with changes in volume correlating directly with kidney function. Currently, clinical practice often relies on subjective visual assessment for evaluating kidney size and kidney lesions, including tumors and cysts, which are typically staged based on diameter, volume, and anatomical location. To support a more objective and reproducible approach, this research aims to develop a robust, thoroughly validated renal mass segmentation algorithm, named Renal-Net. We employ publicly available training datasets and leverage the state-of-the-art medical image segmentation framework nnU-Net. Validation is conducted using both proprietary and public test datasets, with segmentation performance quantified by Dice coefficient and the 95th percentile Hausdorff distance. Furthermore, we analyze robustness across subgroups based on patient sex, age, CT contrast phases, and tumor histologic subtypes. Our findings demonstrate that our segmentation algorithm, trained exclusively on publicly available data, generalizes effectively to external test sets and outperforms existing state-of-the-art models across all tested datasets. Subgroup analyses reveal consistent high performance, indicating strong robustness and reliability. The developed algorithm and associated code are publicly accessible at https://github.com/DIAGNijmegen/oncology-kidney-abnormality-segmentation.

URL PDF HTML ☆

赞 0 踩 0

2505.07833 2026-06-09 cs.DC cs.AI cs.MA cs.OS 版本更新

Harmonia: End-to-End RAG Serving Optimization

Harmonia: 端到端RAG服务优化

Saurabh Agarwal, Bodun Hu, Luis Pabon, Myungjin Lee, Jayanth Srinivasa, Aditya Akella

发表机构 * UT Austin（德克萨斯大学奥斯汀分校）； Cisco Research（思科研究）； Cisco Systems（思科系统）

AI总结提出Harmonia框架，通过灵活管道接口、异构感知部署和闭环运行时控制器，优化RAG服务，吞吐量提升2.04倍以上，SLO违规减少78.4%。

2507.08920 2026-06-09 q-bio.BM cs.AI 版本更新

AMix-1: A Pathway to Test-Time Scalable Protein Foundation Model

AMix-1: 迈向测试时可扩展的蛋白质基础模型

Changze Lv, Jiang Zhou, Siyu Long, Lihao Wang, Jiangtao Feng, Dongyu Xue, Yu Pei, Hao Wang, Zherui Zhang, Yuchen Cai, Zhiqiang Gao, Ziyuan Ma, Jiakai Hu, Chaochen Gao, Jingjing Gong, Yuxuan Song, Shuyi Zhang, Xiaoqing Zheng, Deyi Xiong, Lei Bai, Wanli Ouyang, Ya-Qin Zhang, Wei-Ying Ma, Bowen Zhou, Hao Zhou

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Generative Symbolic Intelligence Lab (GenSI), Tsinghua University（生成符号智能实验室（GenSI），清华大学）； Institute for AI Industry Research (AIR), Tsinghua University（人工智能产业研究院（AIR），清华大学）； Tsinghua University（清华大学）； Fudan University（复旦大学）； Tianjin University（天津大学）； Georgia Institute of Technology（佐治亚理工学院）； Beijing University of Posts and Telecommunications（北京邮电大学）； University of Chinese Academy of Sciences（中国科学院大学）； City University of Hong Kong（香港城市大学）

AI总结提出基于贝叶斯流网络的蛋白质基础模型AMix-1，通过预训练缩放律、涌现能力分析、上下文学习机制和测试时缩放算法，实现1.7B参数模型，并设计出活性提高50倍的AmeR变体。

详情

AI中文摘要

我们介绍了AMix-1，一个强大的蛋白质基础模型，它基于贝叶斯流网络构建，并通过系统性的训练方法学增强，包括预训练缩放律、涌现能力分析、上下文学习机制和测试时缩放算法。为了保证稳健的可扩展性，我们建立了一个预测性缩放律，并通过损失视角揭示了结构理解的渐进涌现，最终得到了一个强大的17亿参数模型。在此基础上，我们设计了一种基于多序列比对（MSA）的上下文学习策略，将蛋白质设计统一到一个通用框架中，其中AMix-1识别MSA中的深层进化信号，并一致地生成结构和功能上连贯的蛋白质。该框架成功设计了一个显著改进的AmeR变体，其活性比野生型提高了高达50倍。为了突破蛋白质工程的边界，我们进一步为AMix-1配备了一种进化测试时缩放算法，用于计算机模拟定向进化，随着验证预算的增加，该算法提供了显著且可扩展的性能提升，为下一代实验室在环蛋白质设计奠定了基础。

英文摘要

We introduce AMix-1, a powerful protein foundation model built on Bayesian Flow Networks and empowered by a systematic training methodology, encompassing pretraining scaling laws, emergent capability analysis, in-context learning mechanism, and test-time scaling algorithm. To guarantee robust scalability, we establish a predictive scaling law and reveal the progressive emergence of structural understanding via loss perspective, culminating in a strong 1.7-billion model. Building on this foundation, we devise a multiple sequence alignment (MSA)-based in-context learning strategy to unify protein design into a general framework, where AMix-1 recognizes deep evolutionary signals among MSAs and consistently generates structurally and functionally coherent proteins. This framework enables the successful design of a dramatically improved AmeR variant with an up to $50\times$ activity increase over its wild type. Pushing the boundaries of protein engineering, we further empower AMix-1 with an evolutionary test-time scaling algorithm for in silico directed evolution that delivers substantial, scalable performance gains as verification budgets are intensified, laying the groundwork for next-generation lab-in-the-loop protein design.

URL PDF HTML ☆

赞 0 踩 0

2509.10334 2026-06-09 cs.CV cs.AI cs.LG 版本更新

I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation

I-Segmenter: 用于高效语义分割的纯整数视觉Transformer

Jordan Sassoon, Michal Szczepanski, Martyna Poreba

发表机构 * CEA, France（法国原子能委员会）

AI总结提出I-Segmenter，首个全整数ViT分割框架，通过整数运算替换、λ-ShiftGELU激活函数及解码器优化，在保持精度前提下显著降低模型大小和推理延迟。

Comments Accepted by the Journal of Systems Architecture

详情

AI中文摘要

视觉Transformer（ViT）最近在语义分割中取得了强劲的结果，但由于其高内存占用和计算成本，在资源受限设备上的部署仍然有限。量化提供了一种提高效率的有效策略，但基于ViT的分割模型在低精度下非常脆弱，因为量化误差会在深度编码器-解码器流水线中累积。我们引入了I-Segmenter，这是第一个完全纯整数的ViT分割框架。基于Segmenter架构，I-Segmenter系统地将浮点运算替换为纯整数对应运算。为了进一步稳定训练和推理，我们提出了λ-ShiftGELU，一种新颖的激活函数，它减轻了均匀量化在处理长尾激活分布时的局限性。此外，我们移除了L2归一化层，并将解码器中的双线性插值替换为最近邻上采样，确保整个计算图都是纯整数执行。大量实验表明，I-Segmenter在合理精度范围内（平均5.1%）达到其FP32基线的精度，同时将模型大小减少高达3.8倍，并通过优化的运行时实现高达1.2倍的推理加速。值得注意的是，即使在单张校准图像的一次性PTQ中，I-Segmenter也能提供有竞争力的精度，凸显了其在实际部署中的实用性。

英文摘要

Vision Transformers (ViTs) have recently achieved strong results in semantic segmentation, yet their deployment on resource-constrained devices remains limited due to their high memory footprint and computational cost. Quantization offers an effective strategy to improve efficiency, but ViT-based segmentation models are notoriously fragile under low precision, as quantization errors accumulate across deep encoder-decoder pipelines. We introduce I-Segmenter, the first fully integer-only ViT segmentation framework. Building on the Segmenter architecture, I-Segmenter systematically replaces floating-point operations with integer-only counterparts. To further stabilize both training and inference, we propose $λ$-ShiftGELU, a novel activation function that mitigates the limitations of uniform quantization in handling long-tailed activation distributions. In addition, we remove the L2 normalization layer and replace bilinear interpolation in the decoder with nearest neighbor upsampling, ensuring integer-only execution throughout the computational graph. Extensive experiments show that I-Segmenter achieves accuracy within a reasonable margin of its FP32 baseline (5.1 % on average), while reducing model size by up to 3.8x and enabling up to 1.2x faster inference with optimized runtimes. Notably, even in one-shot PTQ with a single calibration image, I-Segmenter delivers competitive accuracy, underscoring its practicality for real-world deployment.

URL PDF HTML ☆

赞 0 踩 0

2510.10028 2026-06-09 cs.LG cs.AI cs.DC 版本更新

面向工业预测的具有距离感知的物理约束概率框架开发

Waleed Razzaq, Yun-Bo Zhao

发表机构 * University of Science and Technology China（中国科学技术大学）

AI总结提出两种无需采样的距离感知物理约束概率框架PC-SNGP和PC-SNER，通过谱归一化和动态加权策略平衡数据保真度与物理一致性，在轴承预测中提升精度和不确定性校准。

详情

AI中文摘要

可靠且物理可解释的工业预测概率框架的发展仍处于初期阶段，现有文献在输入远离训练流形时往往不敏感。本文开发了两种无需采样的、具有距离感知的物理约束概率框架：(i) PC-SNGP 和 (ii) PC-SNER。两者均对隐藏层权重应用谱归一化，强制从输入到潜在空间的bi-Lipschitz距离保持表示。PC-SNGP将密集输出替换为高斯过程，其后验方差随输入与训练流形的距离增加而增大。PC-SNER修改输出层以预测Normal-Inverse-Gamma (NIG)参数，用于距离保持估计。为在训练过程中保持数据保真度与物理一致性之间的平衡，我们引入了物理约束损失的动态加权策略。我们还引入了一个距离感知系数 (DAC) 指标来量化对分布偏移的敏感性。实验上，我们使用PRONOSTIA、XJTU-SY和HUST基准数据集在滚动轴承 (REBs) 预测上验证了两种框架。实验结果表明，与竞争基线相比，预测精度提高，不确定性估计校准良好，同时在交叉验证中保持可审计性能，并在极端对抗扰动下具有鲁棒性。

英文摘要

Development of reliable and physically interpretable probabilistic frameworks for industrial prognostics remain nascent, and existing literature is often insensitive as inputs move away from the training manifold. In this paper, we develop two sampling-free, distance-aware physics-constrained probabilistic frameworks: (i) PC-SNGP and (ii) PC-SNER. Both apply spectral normalization to hidden layer weights, enforcing bi-Lipschitz distance-preserving representation from the input to the latent space. PC-SNGP replaces the dense output with Gaussian process whose posterior variance increases with input distance from the training manifold. PC-SNER modifies the output layer to predict Normal-Inverse-Gamma~(NIG) parameters for distance preserving estimation. To maintain balance between data fidelity and physical consistency during training, we introduce a dynamic weighting strategy for the physics-constrained loss. We also introduce a distance-aware-coefficient~(DAC) metric to quantify sensitivity to distributional shifts. Empirically, we validate both frameworks on rolling-element-bearings (REBs) prognostics using the PRONOSTIA, XJTU-SY, and HUST benchmark datasets. Experimental results demonstrate improved prediction accuracy and well-calibrated uncertainty estimates relative to competing baselines, while maintaining auditable performance in cross-validation and robustness under extreme adversarial perturbations.

URL PDF HTML ☆

赞 0 踩 0

2601.11541 2026-06-09 cs.HC cs.AI cs.CY 版本更新

A Comparative Study of Student Perspectives on Technical Writing Feedback Quality: Evaluating LLMs, SLMs, and Humans in Computer Science Topics

学生视角下技术写作反馈质量比较研究：评估计算机科学主题中的LLM、SLM和人类

Suqing Liu, Runlong Ye, Christopher Eaton, Bogdan Simion, Michael Liut

发表机构 * McMaster University（麦斯特大学）； Department of Computer Science, University of Toronto（多伦多大学计算机科学系）； Research Institute for the Study of University Pedagogy, University of Toronto Mississauga（多伦多大学密西根分校大学教学研究学院）； Department of Mathematical and Computational Sciences, University of Toronto Mississauga（多伦多大学密西根分校数学与计算科学系）

AI总结本研究比较了本地部署的小语言模型（SLM）、商业大语言模型（LLM）和人类导师在计算机科学课程中提供写作反馈的质量，发现SLM在可读性和可操作性上获得学生更高评价，而人类反馈在专业写作任务中更受青睐。

Comments accepted at AIED 26

2601.15408 2026-06-09 cs.CV cs.AI cs.CL cs.LG 版本更新

CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation

CURE：基于课程引导的多任务训练实现可靠的解剖学接地报告生成

Pablo Messina, Andrés Villa, Juan León Alcázar, Karen Sánchez, Carlos Hinojosa, Denis Parra, Álvaro Soto, Bernard Ghanem

发表机构 * Pontificia Universidad Católica de Chile（智利天主教大学）； CENIA ； iHEALTH ； KAUST（科威特皇家科学与技术局）

AI总结提出CURE框架，通过课程学习动态调整多任务训练，提升医学报告生成的视觉接地准确性和事实一致性，无需额外数据。

Comments 31 pages, 7 figures, accepted to CVPR 2026 (oral)

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 36279-36289

AI中文摘要

医学视觉语言模型可以自动生成放射学报告，但在精确的视觉接地和事实一致性方面存在困难。现有模型常常将文本发现与视觉证据错误对齐，导致不可靠或弱接地的预测。我们提出CURE，一个错误感知的课程学习框架，无需任何额外数据即可改善接地和报告质量。CURE在短语接地、接地报告生成和解剖学接地报告生成上，使用公共数据集微调多模态指令模型。该方法基于模型性能动态调整采样，强调困难样本以改善空间和文本对齐。CURE将接地准确率提高了+0.35 IoU，报告质量提高了+0.192 CXRFEScore，并将幻觉减少了18.6%。CURE是一个数据高效的框架，增强了接地准确性和报告可靠性。代码可从此https URL获取，模型权重可从此https URL获取。

英文摘要

Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine-tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.35 IoU, boosts report quality by +0.192 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability. Code is available at https://github.com/PabloMessina/CURE and model weights at https://huggingface.co/pamessina/medgemma-4b-it-cure

URL PDF HTML ☆

赞 0 踩 0

2601.20408 2026-06-09 cs.DC cs.AI 版本更新

Meeting SLOs, Slashing Hours: Automated Enterprise LLM Optimization with OptiKIT

满足SLO，节省时间：使用OptiKIT实现企业级LLM自动化优化

Nicholas Santavas, Kareem Eissa, Patrycja Cieplicka, Piotr Florek, Matteo Nulli, Stefan Vasilev, Seyyed Hadi Hashemi, Antonios Gasteratos, Shahram Khadivi

发表机构 * Anonymous Authors（匿名作者）

AI总结提出OptiKIT分布式LLM优化框架，通过自动化复杂优化流程，为非专家团队提供动态资源分配和流水线执行，实现GPU吞吐量提升2倍以上，降低优化门槛。

Comments Accepted in MLSys 2026

详情

AI中文摘要

企业级LLM部署面临关键的可扩展性挑战：组织必须在有限的计算预算内系统性地优化模型以扩展AI计划，然而手动优化所需的专业知识仍然稀缺。这一挑战在管理异构基础设施上的GPU利用率，同时使具有不同工作负载且LLM优化经验有限的团队能够高效部署模型时尤为明显。我们提出了OPTIKIT，一个分布式LLM优化框架，通过自动化非专家团队的复杂优化工作流程，使模型压缩和调优民主化。OPTIKIT提供动态资源分配、带自动清理的分阶段流水线执行以及无缝的企业集成。在生产中，它实现了超过2倍的GPU吞吐量提升，同时使应用团队无需深厚的LLM优化专业知识即可获得一致的性能改进。我们分享了平台设计以及资源管理、流水线编排和集成模式的关键工程见解，这些实现了大规模、生产级模型优化的民主化。最后，我们开源该系统以促进外部贡献和更广泛的可重复性。

英文摘要

Enterprise LLM deployment faces a critical scalability challenge: organizations must optimize models systematically to scale AI initiatives within constrained compute budgets, yet the specialized expertise required for manual optimization remains a niche and scarce skillset. This challenge is particularly evident in managing GPU utilization across heterogeneous infrastructure while enabling teams with diverse workloads and limited LLM optimization experience to deploy models efficiently. We present OPTIKIT, a distributed LLM optimization framework that democratizes model compression and tuning by automating complex optimization workflows for non-expert teams. OPTIKIT provides dynamic resource allocation, staged pipeline execution with automatic cleanup, and seamless enterprise integration. In production, it delivers more than 2x GPU throughput improvement while empowering application teams to achieve consistent performance improvements without deep LLM optimization expertise. We share both the platform design and key engineering insights into resource management, pipeline orchestration, and integration patterns that enable large-scale, production-grade democratization of model optimization. Finally, we open-source the system to enable external contributions and broader reproducibility.

URL PDF HTML ☆

赞 0 踩 0

2601.20503 2026-06-09 cs.CV cs.AI 版本更新

Comparative evaluation of training strategies using partially labelled datasets for segmentation of white matter hyperintensities and stroke lesions in FLAIR MRI

使用部分标注数据集训练策略的比较评估：FLAIR MRI中白质高信号和卒中病变分割

Jesse Phitidis, Alison Q. Smithard, William N. Whiteley, Joanna M. Wardlaw, Miguel O. Bernabeu, Maria Valdés Hernández

发表机构 * University of Edinburgh（爱丁堡大学）

AI总结本研究系统评估了六种利用部分标注数据训练联合分割白质高信号和缺血性卒中病变模型的策略，发现伪标签法最有效，可提升模型性能并支持大规模临床研究。

详情

AI中文摘要

白质高信号（WMH）和缺血性卒中病变（ISL）是脑小血管疾病（SVD）的关键影像生物标志物，可在磁共振成像（MRI）上检测到。开发稳健的深度学习模型来自动分割和区分这些病理仍然具有挑战性。具体而言，WMH和ISL常在同一受试者中共存，并在液体衰减反转恢复（FLAIR）序列上表现为视觉上混淆的高信号，使其精确勾画复杂化。为了解决完全标注队列稀缺的问题，我们系统评估了六种使用部分标注数据训练联合WMH和ISL分割模型的可行策略。我们汇集了私有和公开数据集，构建了一个包含2,052个MRI体积的大规模队列，其中分别有1,341和1,152个体积包含WMH和ISL的真实标注。我们的分析表明，多种策略有效利用部分标注数据提升整体模型性能，其中伪标签法是最有效的方法。该模型表现出一致的WMH分割策略，并成功检测到大多数FLAIR阳性的ISL。这些发现证明了使用部分标注数据开发可靠自动分割工具的可行性，可支持持续的SVD监测和大规模临床研究中的高通量生物标志物提取。

英文摘要

White matter hyperintensities (WMH) and ischaemic stroke lesions (ISL) are key imaging biomarkers of cerebral small vessel disease (SVD) detectable on magnetic resonance imaging (MRI). The development of robust deep learning models to automatically segment and differentiate these pathologies remains challenging. Specifically, WMH and ISL frequently co-occur within the same subject and present as visually confounding hyperintensities on fluid-attenuated inversion recovery (FLAIR) sequences, complicating their accurate delineation. To address the scarcity of fully annotated cohorts, we systematically evaluated six accessible strategies for training a joint WMH and ISL segmentation model using partially labelled data. We aggregated privately held and publicly available datasets to curate a large-scale cohort of 2,052 MRI volumes, of which 1341 and 1152 volumes contained ground truth annotations for WMH and ISL, respectively. Our analysis indicates that multiple strategies effectively leverage partially labelled data to enhance overall model performance, with pseudolabelling emerging as the most effective approach. This model exhibited a consistent WMH segmentation policy and successfully detected the majority of FLAIR-positive ISL. These findings demonstrate the viability of using partially labelled data to develop reliable automated segmentation tools, which can support ongoing SVD monitoring and high-throughput biomarker extraction for large-scale clinical research.

URL PDF HTML ☆

赞 0 踩 0

2602.10016 2026-06-09 cs.IR cs.AI 版本更新

Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design

Kunlun: 通过统一架构设计建立大规模推荐系统的缩放定律

Bojian Hou, Xiaolong Liu, Xiaoyi Liu, Jiaqi Xu, Yasmine Badr, Mengyue Hang, Sudhanshu Chanpuriya, Junqing Zhou, Yuhang Yang, Han Xu, Qiuling Suo, Laming Chen, Yuxi Hu, Jiasheng Zhang, Huaqing Xiong, Yuzhen Huang, Chao Chen, Yue Dong, Yi Yang, Shuo Chang, Xiaorui Gan, Wenlin Chen, Santanu Kolay, Darren Liu, Jade Nie, Chunzhi Yang, Ellie Wen, Jiyan Yang, Huayu Li

发表机构 * Meta Platforms, Inc.（Meta平台公司）； OpenAI

AI总结针对大规模推荐系统缺乏可预测缩放定律的问题，提出Kunlun架构，通过低层优化（GDPA、HSP、滑动窗口注意力）和高层创新（CompSkip、事件级个性化）提升模型效率，MFU从17%提升至37%，缩放效率翻倍，已在Meta广告模型部署。

Comments 10 pages, 4 figures

详情

AI中文摘要

推导可预测的缩放定律，即模型性能与计算投入之间的关系，对于大规模推荐系统的设计和资源分配至关重要。虽然这类定律已在大型语言模型中建立，但在推荐系统中仍具挑战，尤其是处理用户历史记录和上下文特征的系统。我们识别出低缩放效率是可预测幂律缩放的主要障碍，源于低模型FLOPs利用率（MFU）的模块和次优的资源分配。我们引入Kunlun，一种可扩展的架构，系统性地提升模型效率和资源分配。我们的低层优化包括广义点积注意力（GDPA）、分层种子池化（HSP）和滑动窗口注意力。高层创新包括计算跳过（CompSkip）和事件级个性化。这些进步在NVIDIA B200 GPU上将MFU从17%提升至37%，并将缩放效率相比最先进方法提升一倍。Kunlun现已部署在主要的Meta广告模型中，产生显著的生产影响。

英文摘要

Deriving predictable scaling laws that govern the relationship between model performance and computational investment is crucial for designing and allocating resources in massive-scale recommendation systems. While such laws are established for large language models, they remain challenging for recommendation systems, especially those processing both user history and context features. We identify poor scaling efficiency as the main barrier to predictable power-law scaling, stemming from inefficient modules with low Model FLOPs Utilization (MFU) and suboptimal resource allocation. We introduce Kunlun, a scalable architecture that systematically improves model efficiency and resource allocation. Our low-level optimizations include Generalized Dot-Product Attention (GDPA), Hierarchical Seed Pooling (HSP), and Sliding Window Attention. Our high-level innovations feature Computation Skip (CompSkip) and Event-level Personalization. These advances increase MFU from 17% to 37% on NVIDIA B200 GPUs and double scaling efficiency over state-of-the-art methods. Kunlun is now deployed in major Meta Ads models, delivering significant production impact.

URL PDF HTML ☆

赞 0 踩 0

2602.10172 2026-06-09 astro-ph.IM cs.AI 版本更新

Cosmo3DFlow: Wavelet Flow Matching for Spatial-to-Spectral Compression in Reconstructing the Early Universe

Cosmo3DFlow：用于重建早期宇宙的空间到光谱压缩的小波流匹配

Md. Khairul Islam, Zeyu Xia, Ryan Goudjil, Jialu Wang, Arya Farahi, Judy Fox

发表机构 * Department of Computer Science University of Virginia（计算机科学系弗吉尼亚大学）； Department of Statistics and Data Sciences The University of Texas at Austin（统计与数据科学系德克萨斯大学奥斯汀分校）； School of Data Science（数据科学学院）

AI总结提出Cosmo3DFlow框架，结合3D离散小波变换与流匹配，通过空间到光谱压缩解决高维宇宙结构重建中的维度和稀疏性瓶颈，实现比扩散模型快46倍的采样速度。

详情

AI中文摘要

从演化的现今宇宙重建早期宇宙是现代天体物理学中一个具有挑战性和计算密集的问题。我们设计了一种新颖的生成框架Cosmo3DFlow，旨在解决维度和稀疏性——当前最先进的宇宙学推理方法中的关键瓶颈。通过将3D离散小波变换（DWT）与流匹配相结合，我们有效地表示了高维宇宙学结构。小波变换通过将空间空无转化为光谱稀疏性来解决“空洞问题”。它将高频细节与低频结构解耦，并且小波空间速度场促进了具有大步长的稳定常微分方程（ODE）求解器。使用$128^3$分辨率的大规模宇宙学$N$体模拟，我们实现了比扩散模型快46倍的采样速度。我们的结果使得初始条件可以在几秒内采样，而以前的方法需要几分钟。

英文摘要

Reconstructing the early universe from the evolved present-day universe is a challenging and computationally demanding problem in modern astrophysics. We devise a novel generative framework, Cosmo3DFlow, designed to address dimensionality and sparsity, the critical bottlenecks inherent in current state-of-the-art methods for cosmological inference. By integrating 3D Discrete Wavelet Transform (DWT) with flow matching, we effectively represent high-dimensional cosmological structures. The Wavelet Transform addresses the ``void problem'' by translating spatial emptiness into spectral sparsity. It decouples high-frequency details from low-frequency structures, and wavelet-space velocity fields facilitate stable ordinary differential equation (ODE) solvers with large step sizes. Using large-scale cosmological $N$-body simulations at $128^3$ resolution, we achieve up to $46\times$ faster sampling than diffusion models. Our results enable initial conditions to be sampled in seconds, compared to minutes for previous methods.

URL PDF HTML ☆

赞 0 踩 0

2602.10234 2026-06-09 physics.soc-ph cs.AI cs.RO 版本更新

Transforming Police-Car Swerving for Mitigating Isolated Stop-and-Go Traffic Waves: A Practice-Oriented Jam-Absorption Driving Strategy

将警车变道行为转化为缓解孤立走走停停交通波的实际拥堵吸收驾驶策略

Zhengbing He

发表机构 * Faculty of Science and Engineering, University of Nottingham Ningbo China（诺丁汉大学宁波校区理工程学院）

AI总结本文提出一种基于警车变道行为启发的实际拥堵吸收驾驶（JAD）策略，通过定义JAD三角形，利用单车辆双探测器实现孤立走走停停波的抑制，并系统分析五个关键参数，仿真验证其有效性。

详情

AI中文摘要

走走停停交通波是高速公路拥堵的主要形式，对交通效率、安全风险和车辆排放造成严重且持续的负面影响。在各种高速公路交通管理策略中，拥堵吸收驾驶（JAD）——由专用车辆在被走走停停波捕获前执行“慢进快出”操作——已被提出作为抑制此类波传播的一种有前景的方法。然而，现有大多数JAD策略仍不实用，主要原因是缺乏对实施车辆和运行条件的考虑。受真实世界中警车变道行为的启发，本文首先引入单车辆双探测器拥堵吸收驾驶（SD-JAD）问题，然后基于JAD三角形的定义提出一种实用的JAD策略，将这种变道行为转化为能够抑制孤立走走停停波传播的交通控制策略。识别并系统分析了五个显著影响所提策略的关键参数，即JAD速度、流入交通速度、波宽、波速和波内速度。通过基于SUMO的仿真示例，进一步展示了如何仅使用两个固定路侧交通探测器在实际中测量这些参数。结果表明，所提出的JAD策略成功抑制了走走停停波的传播，且未引发二次波。本文有望推动JAD的实际实施迈出重要一步，将其从理论概念推进为可行且可部署的交通管理策略。

英文摘要

Stop-and-go traffic waves, a major form of freeway congestion, impose severe and persistent adverse impacts, including reduced traffic efficiency, increased safety risks, and elevated vehicle emissions. Among various freeway traffic management strategies, jam-absorption driving (JAD), in which a dedicated vehicle performs "slow-in" and "fast-out" maneuvers before being captured by a stop-and-go wave, has been proposed as a promising approach to suppressing the propagation of such waves. However, most existing JAD strategies remain impractical, primarily due to the lack of consideration of implementation vehicles and operational conditions. Inspired by real-world observations of police-car swerving behavior, this paper first introduces the Single-Vehicle Double-Detector Jam-Absorption Driving (SD-JAD) problem and then proposes a practical JAD strategy based on a definition of the JAD Triangle, transforming such behavior into a traffic control strategy capable of suppressing the propagation of an isolated stop-and-go wave. Five key parameters that significantly affect the proposed strategy, namely JAD speed, inflow traffic speed, wave width, wave speed, and in-wave speed, are identified and systematically analyzed. Using a SUMO-based simulation as an illustrative example, we further demonstrate how these parameters can be measured in practice using only two stationary roadside traffic detectors. The results show that the proposed JAD strategy successfully suppresses the propagation of a stop-and-go wave without triggering secondary waves. This paper is expected to take a significant step toward the practical implementation of JAD, advancing it from a theoretical concept to a feasible and deployable traffic management strategy.

URL PDF HTML ☆

赞 0 踩 0

2602.23234 2026-06-09 cs.IR cs.AI cs.LG 版本更新

Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments

扩展搜索相关性：用LLM生成的判断增强应用商店排名

Evangelia Christakopoulou, Vivekkumar Patel, Hemanth Velaga, Sandip Gaikwad, Sean Suchter, Venkat Sundaranatha

发表机构 * Apple（苹果公司）

AI总结针对应用商店排名中专家文本相关性标签稀缺的问题，通过微调LLM生成数百万标签，结合行为相关性优化排序器，显著提升Pareto前沿和转化率。

详情

AI中文摘要

大规模商业搜索系统优化相关性以驱动成功的会话，帮助用户找到他们想要的内容。为了最大化相关性，我们利用两个互补的目标：行为相关性（用户倾向于点击或下载的结果）和文本相关性（结果与查询的语义匹配）。一个持续的挑战是，相对于丰富的行为相关性标签，专家提供的文本相关性标签稀缺。我们首先通过系统评估LLM配置来解决这个问题，发现一个专门的、微调的模型在提供高度相关的标签方面显著优于一个更大的预训练模型。使用这个最优模型作为力量倍增器，我们生成了数百万个文本相关性标签以克服数据稀缺性。我们展示了用这些文本相关性标签增强我们的生产排序器会导致Pareto前沿显著外移：离线NDCG在行为相关性上改善，同时在文本相关性上也提高。这些离线收益通过在全球应用商店排序器上的A/B测试得到验证，该测试显示转化率统计上显著提高了+0.24%，其中最大的性能提升出现在尾部查询中，新的文本相关性标签在缺乏可靠行为相关性标签时提供了稳健的信号。

英文摘要

Large-scale commercial search systems optimize for relevance to drive successful sessions that help users find what they are looking for. To maximize relevance, we leverage two complementary objectives: behavioral relevance (results users tend to click or download) and textual relevance (a result's semantic fit to the query). A persistent challenge is the scarcity of expert-provided textual relevance labels relative to abundant behavioral relevance labels. We first address this by systematically evaluating LLM configurations, finding that a specialized, fine-tuned model significantly outperforms a much larger pre-trained one in providing highly relevant labels. Using this optimal model as a force multiplier, we generate millions of textual relevance labels to overcome the data scarcity. We show that augmenting our production ranker with these textual relevance labels leads to a significant outward shift of the Pareto frontier: offline NDCG improves for behavioral relevance while simultaneously increasing for textual relevance. These offline gains were validated by a worldwide A/B test on the App Store ranker, which demonstrated a statistically significant +0.24% increase in conversion rate, with the most substantial performance gains occurring in tail queries, where the new textual relevance labels provide a robust signal in the absence of reliable behavioral relevance labels.

URL PDF HTML ☆

赞 0 踩 0

2603.04177 2026-06-09 cs.SE cs.AI cs.LG 版本更新

CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

CodeTaste：LLM能否生成人类级别的代码重构？

Alex Thillen, Niels Mündler, Veselin Raychev, Martin Vechev

发表机构 * University of California, Berkeley（加州大学伯克利分校）； ETH Zurich（苏黎世联邦理工学院）

AI总结研究LLM代理在代码重构中的能力，通过CodeTaste基准测试发现，代理在详细指定重构时表现良好，但难以自主发现人类选择的重构，提出“先提议后实现”分解可改善对齐。

详情

AI中文摘要

LLM编码代理可以生成可工作的代码，但它们的解决方案往往积累复杂性、重复和架构债务。人类开发者通过重构来解决这些问题：行为保持的程序转换，改善结构和可维护性。我们研究代理是否(i)能够可靠地执行重构，以及(ii)识别人类开发者在实际代码库中实际选择的重构。为此，我们构建了CodeTaste，一个从大型多文件开源重构中挖掘的基准测试。为了评分解决方案，我们结合了测量功能正确性的仓库测试套件和定制的静态检查，这些检查使用数据流推理验证不期望模式的移除和期望模式的引入。我们的结果显示了一个明显的差距：代理在实现详细指定的重构时表现良好，但当给定变更的关注区域时，往往无法发现人类的重构选择。先提议后实现的分解改善了对齐，而在实现之前选择最佳对齐的提议可以带来进一步的收益。CodeTaste为在现实代码库中将编码代理与人类重构决策对齐提供了评估目标和潜在的偏好信号。我们发布了基准测试、排行榜和代码。

英文摘要

LLM coding agents can generate working code, but their solutions often accumulate complexity, duplication, and architectural debt. Human developers address such issues through refactoring: behavior-preserving program transformations that improve structure and maintainability. We investigate whether agents (i) can execute refactorings reliably and (ii) identify the refactorings that human developers actually chose in real codebases. To this end, we construct CodeTaste, a benchmark mined from large multi-file open-source refactorings. To score solutions, we combine repository test suites that measure functional correctness with tailored static checks that verify removal of undesired and introduction of desired code patterns using dataflow reasoning. Our results show a clear gap: agents perform well at implementing refactorings that are specified in detail, but often fail to discover the human refactoring choices when given a focus area for changes. A propose-then-implement decomposition improves alignment, and selecting the best-aligned proposal before implementation can yield further gains. CodeTaste provides an evaluation target and a potential preference signal for aligning coding agents with human refactoring decisions in realistic codebases. We release the benchmark, leaderboard, and code.

URL PDF HTML ☆

赞 0 踩 0

2603.12666 2026-06-09 cs.LG cs.AI 版本更新

RetroReasoner: A Reasoning LLM for Strategic Retrosynthesis Prediction

RetroReasoner：一种用于战略 retrosynthesis 预测的推理 LLM

Hanbum Ko, Chanhui Lee, Ye Rin Kim, Rodrigo Hormazabal, Sehui Han, Sungbin Lim, Sungwoong Kim

发表机构 * Department of Artificial Intelligence, Korea University（韩国大学人工智能系）； Department of Statistics, Korea University（韩国大学统计系）； Materials Intelligence Lab, LG AI Research（LG人工智能研究实验室）

AI总结 RetroReasoner 通过监督微调和强化学习，捕捉化学家基于断键策略的推理过程，提升 retrosynthesis 预测的准确性和多样性。

Comments 35 pages, 19 figures

详情

AI中文摘要

retrosynthesis预测旨在识别能够合成给定产物分子的反应物。尽管分子大语言模型（LLMs）最近展示了有前景的结果，但大多数现有方法要么直接生成反应物，要么仅提供通用的产品级分析，而没有明确推理关于断键策略来证明特定反应物选择的合理性。本文提出了RetroReasoner，一种能够捕捉化学家基于断键策略的推理过程的 retrosynthetic推理模型。RetroReasoner通过监督微调和强化学习进行训练。在监督微调中，SyntheticRetro生成结构化的断键理由配对反应物预测。在强化学习中，一个往返奖励通过将预测的反应物传递给正向合成模型来评估预测的反应物，奖励能够重建原始产物的预测。RetroReasoner还可以通过将其整合到并行化的蒙特卡洛树搜索框架中，用于多步 retrosynthetic规划，从而减少搜索时间并增加有效合成路径的数量和多样性。实验结果表明，RetroReasoner在性能上优于先前的基线，不仅包括分子LLMs，还包括专门针对retrosynthesis的专家模型，并生成更广泛的可行反应物提案，特别是在具有挑战性的反应实例中。

英文摘要

Retrosynthesis prediction aims to identify reactants that can synthesize a given product molecule. Although molecular large language models (LLMs) have recently shown promising results, most existing methods either generate reactants directly or provide only generic product-level analysis, without explicitly reasoning about bond-disconnection strategies that justify specific reactant choices. This paper proposes RetroReasoner, a retrosynthetic reasoning model that captures chemists' strategic disconnection-based thinking. RetroReasoner is trained with supervised fine-tuning and reinforcement learning. For supervised fine-tuning, SyntheticRetro generates structured disconnection rationales paired with reactant predictions. For reinforcement learning, a round-trip reward evaluates predicted reactants by passing them through a forward synthesis model and rewarding predictions that reconstruct the original product. RetroReasoner can also be applied to multi-step retrosynthetic planning by incorporating it into a parallelized Monte Carlo tree search framework, reducing search time while increasing the number and diversity of valid synthetic pathways. Experimental results show that RetroReasoner outperforms prior baselines, including not only molecular LLMs but also retrosynthesis-specific expert models, and generates a broader range of feasible reactant proposals, especially for challenging reaction instances.

URL PDF HTML ☆

赞 0 踩 0

2603.29875 2026-06-09 cs.IR cs.AI cs.CL 版本更新

UnWeaving the knots of GraphRAG -- turns out VectorRAG is almost enough

解开图式RAG的结——事实证明向量RAG几乎足够

Ryszard Tuora, Mateusz Galiński, Michał Godziszewski, Michał Karpowicz, Mateusz Czyżnikiewicz, Adam Kozakiewicz, Tomasz Ziętkiewicz

发表机构 * Samsung AI Warsaw（三星AI华沙）

AI总结本文提出UnWeaver框架，通过LLM解构文档内容为跨chunk的实体，提升检索和生成的准确性与效率，实验表明向量RAG在成本上优于图式RAG。

详情

DOI: 10.5281/zenodo.19203878

AI中文摘要

检索增强生成（RAG）系统中的关键问题在于基于片段的检索流程将源片段视为原子对象，将其中信息混合成单一向量。这些向量被视为孤立、独立且自足，没有尝试表示它们之间的可能关系。此类方法缺乏处理多跳问题的专用机制。图式RAG系统通过将信息建模为知识图谱来缓解这一问题，实体由节点表示，通过稳健的关系连接并形成层次化社区。然而，这种方法自身也存在一些问题，包括为创建图式索引而增加数量级的组件复杂性，以及依赖启发式方法进行检索。我们提出UnWeaver，一种新颖的RAG框架，简化了图式RAG的理念。UnWeaver利用LLM将文档内容解构为可以在多个片段中出现的实体。在检索过程中，实体被用作恢复原始文本片段的中间方式，从而保持对源材料的忠实度。我们主张基于实体的分解能提供更浓缩的原始信息表示，同时还能减少索引和生成过程中的噪声。此外，我们实验表明，在端到端QA评估中，向量RAG的表现优于标准图式RAG，并且几乎与当前最先进的图式解决方案相当，但成本仅为其分数。

英文摘要

One of the key problems in Retrieval-augmented generation (RAG) systems is that chunk-based retrieval pipelines represent the source chunks as atomic objects, mixing the information contained within such a chunk into a single vector. These vector representations are then fundamentally treated as isolated, independent and self-sufficient, with no attempt to represent possible relations between them. Such an approach has no dedicated mechanisms for handling multi-hop questions. Graph-based RAG systems aimed to ameliorate this problem by modeling information as knowledge-graphs, with entities represented by nodes being connected by robust relations, and forming hierarchical communities. This approach however suffers from its own issues with some of them being: orders of magnitude increased componential complexity in order to create graph-based indices, and reliance on heuristics for performing retrieval. We propose UnWeaver, a novel RAG framework simplifying the idea of GraphRAG. UnWeaver disentangles the contents of the documents into entities which can occur across multiple chunks using an LLM. In the retrieval process entities are used as an intermediate way of recovering original text chunks hence preserving fidelity to the source material. We argue that entity-based decomposition yields a more distilled representation of original information, and additionally serves to reduce noise in the indexing, and generation process. Furthermore we experimentally show that on end to end QA evaluation VectorRAG performs better than standard GraphRAG and almost as good as current SOTA graph-based solutions, for a fraction of the cost.

URL PDF HTML ☆

赞 0 踩 0

2604.08849 2026-06-09 cs.CL cs.AI cs.DB cs.MA cs.SC 版本更新

SatIR: Scalable High-Recall Constraint-Satisfaction-Based Information Retrieval for Clinical Trials Matching

SatIR：可扩展的高召回率约束满足基于信息检索的临床试验匹配

Cyrus Zhou, Yufei Jin, Yilin Xu, Yu-Chiang Wang, Chieh-Ju Chao, Monica S. Lam

发表机构 * Department of Computer Science, Stanford University（斯坦福大学计算机科学系）； Samueli Electrical and Computer Engineering, UCLA（UCLA Samueli电气与计算机工程系）； Department of Computer Science and Informatics, Emory University（埃默里大学计算机科学与信息学系）； Mayo Clinic（梅奥诊所）

AI总结 SatIR通过将临床试验资格条件和摘要转化为形式约束，结合SMT、关系代数和大语言模型，提升了临床试验匹配的召回率和效率，优于基于相似度的基线方法。

详情

AI中文摘要

许多重要的检索问题不仅仅是语义相似性问题，而是约束满足问题：检索的项目应与查询主题相关，并满足涉及否定、时间条件、数值阈值、例外、本体关系和不完整证据的显式要求。我们研究了临床试验匹配中的这一挑战，这是一个高风险的测试平台，其中有用的试验必须既解决患者医疗需求，又满足复杂的资格标准。我们提出了SatIR，一种用于临床试验匹配的可扩展约束检索方法。SatIR将试验资格标准和摘要转换为形式约束，然后通过执行这些约束来检索患者-试验对。系统结合了满足模理论（SMT）、关系代数、医学本体基础和大语言模型（LLMs）：形式方法提供可执行且可检查的匹配，而LLMs将模糊、不完整和隐含的临床信息转换为显式、可控的约束表示。在SIGIR 2016患者-试验集合和TREC-2022-RetrievalSubset基准上，SatIR在资格意识检索方面优于基于相似度的基线方法。与TrialGPT式检索相比，SatIR在SIGIR 2016上每名患者检索出32%至72%更多相关且合格的试验，在TREC-2022-RetrievalSubset上实现了1.8至3.2倍更高的合格试验召回率。检索速度快，仅需146毫秒每名患者处理3,621个SIGIR试验。

英文摘要

Many important retrieval problems are not merely problems of semantic similarity, but problems of constraint satisfaction: a retrieved item should be topically relevant to a query and satisfy explicit requirements involving negation, temporal conditions, numeric thresholds, exceptions, ontological relations, and incomplete evidence. We study this challenge in clinical trial matching, a high-stakes test bed where a useful trial must both address a patient's medical needs and satisfy complex eligibility criteria. We propose SatIR, a scalable constraint-based retrieval method for clinical trial matching. SatIR converts trial eligibility criteria and summaries into formal constraints, then retrieves patient--trial pairs by executing these constraints over a database. The system combines Satisfiability Modulo Theories (SMT), relational algebra, medical ontology grounding, and large language models (LLMs): formal methods provide executable and inspectable matching, while LLMs convert ambiguous, incomplete, and implicit clinical information into explicit, controllable constraint representations. Across the SIGIR 2016 patient--trial collection and TREC-2022-RetrievalSubset, a benchmark derived from TREC 2022, SATIR consistently improves eligibility-aware retrieval over similarity-based baselines. Relative to TrialGPT-style retrieval, SATIR retrieves 32%--72% more relevant-and-eligible trials per patient on SIGIR 2016 and achieves $1.8$--$3.2\times$ higher eligible-trial recall on TREC-2022-RetrievalSubset. Retrieval is fast, requiring only 146 milliseconds per patient over 3,621 SIGIR trials.

URL PDF HTML ☆

赞 0 踩 0

2604.10842 2026-06-09 cs.SE cs.AI 版本更新

APEX：面向AI生成音乐的大规模多任务美学感知流行度预测

Jaavid Aktar Husain, Dorien Herremans

发表机构 * AMAAI Lab, Singapore University of Technology and Design（新加坡科技设计大学AMAAI实验室）

AI总结提出APEX框架，利用MERT音频嵌入联合预测AI生成音乐的流行度指标与五维美学质量，在Music Arena数据集上验证了美学特征对偏好预测的泛化能力。

详情

AI中文摘要

音乐流行度预测因其对艺术家、平台和推荐系统的重要性而吸引了越来越多的研究兴趣。然而，AI生成音乐平台的爆炸式增长创造了一个全新且很大程度上未被探索的领域，每天都有大量歌曲被生产和消费，而没有传统的艺术家声誉或唱片公司支持。在这一探索中，美学质量是关键但尚未被研究的因素。我们提出了APEX，这是首个面向AI生成音乐的大规模多任务学习框架，在来自Suno和Udio的超过21.1万首歌曲（1万小时音频）上训练，该框架联合预测基于参与度的流行度信号——流媒体播放量和点赞分数——以及从MERT（一个自监督音乐理解模型）提取的冻结音频嵌入中的五个感知美学质量维度。美学质量和流行度捕捉了音乐的互补方面，两者结合被证明是有价值的：在Music Arena数据集上的分布外评估中，该数据集包含训练期间未见过的十一个生成音乐系统之间的成对人类偏好对决，引入美学特征持续改进了偏好预测，展示了所学表示在生成架构上的强大泛化能力。

英文摘要

Music popularity prediction has attracted growing research interest, with relevance to artists, platforms, and recommendation systems. However, the explosive rise of AI-generated music platforms has created an entirely new and largely unexplored landscape, where a surge of songs is produced and consumed daily without the traditional markers of artist reputation or label backing. Key, yet unexplored in this pursuit is aesthetic quality. We propose APEX, the first large-scale multi-task learning framework for AI-generated music, trained on over 211k songs (10k hours of audio) from Suno and Udio, that jointly predicts engagement-based popularity signals - streams and likes scores - alongside five perceptual aesthetic quality dimensions from frozen audio embeddings extracted from MERT, a self-supervised music understanding model. Aesthetic quality and popularity capture complementary aspects of music that together prove valuable: in an out-of-distribution evaluation on the Music Arena dataset, comprising pairwise human preference battles across eleven generative music systems unseen during training, including aesthetic features consistently improves preference prediction, demonstrating strong generalisation of the learned representations across generative architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.11314 2026-06-09 cs.CV cs.AI 版本更新

Quantifying Rodda and Graham Gait Classification from 3D Markerless Kinematics derived from a Single-view Video in a Heterogeneous Pediatric Clinical Cohort

从单视角视频中基于3D无标记运动学的罗达和格雷厄姆步态分类量化

Lauhitya Reddy, Seth Donahue, Jeremy Bauer, Susan Sienko, Anita Bagley, Joseph Krzak, Maura Eveld, Karen Kruger, Ross Chafetz, Vedant Kulkarni, Hyeokhyen Kwon

发表机构 * Department of Biomedical Informatics, Emory University（埃默里大学生物医学信息学系）； Shriners Children’s（夏皮罗儿童医院）； The Wallace H. Coulter Department of Biomedical Engineering, Emory University and Georgia Institute of Technology（埃默里大学和佐治亚理工学院的沃克·H·库勒生物医学工程系）

AI总结本文提出了一种基于单视角视频的无标记步态分析方法，用于量化罗达和格雷厄姆步态分类中的膝踝z分数，从而在资源有限的临床环境中实现可扩展的客观步态评估。

Comments 29 pages, 8 figures, 9 tables (including 1 supplementary table); manuscript prepared in PLOS ONE format

详情

AI中文摘要

脑瘫（CP）是一种运动神经障碍，是儿童中最常见的终身身体残疾原因。大约75%的脑瘫儿童能够行走，准确的步态评估对于保持行走功能至关重要，这种功能在四分之一到一半的脑瘫成人中在中年时会恶化。罗达和格雷厄姆分类系统利用来自3D仪器化步态分析（3D-IGA）的踝关节和膝关节z分数来量化矢状面步态偏差，但3D-IGA成本高且仅限于专业中心，而观察性评估仅显示中等的评分者间一致性。我们开发了一种无标记步态分析流程，可以直接从单视角临床步态视频中量化罗达和格雷厄姆膝踝z分数。在1,058个双侧肢体样本（来自152名儿童的529次试验，其中88名男性，63名女性，年龄12.1±4.0岁，60种不同的主要诊断，脑瘫最为常见，n=54）中，矢状面模型在膝关节z分数上达到R²=0.80±0.02和CCC=0.89±0.02，踝关节z分数上达到R²=0.57±0.02和CCC=0.72±0.02，与3D-IGA相比。二元筛查用于过量膝关节屈曲的AUROC=0.88，正确识别了83%的受影响儿童，应用罗达和格雷厄姆规则得到7类准确率为43±1%，宏AUROC=0.78±0.01，踝关节预测误差仍然是主要瓶颈。除了横断面筛查外，连续z分数支持跨访问的纵向轨迹跟踪，为监测疾病进展和治疗反应提供定量基础，这在观察性量表中是无法实现的。这些结果证明了基于视频的z分数估计、过量屈曲筛查和纵向轨迹跟踪在资源有限的临床环境中实现可扩展、客观步态评估的可行性。

英文摘要

Cerebral Palsy (CP) is a neurological disorder of movement and the most common cause of lifelong physical disability in childhood. Approximately 75% of children with CP are ambulatory, and accurate gait assessment is central to preserving walking function, which deteriorates by mid-adulthood in a quarter to half of adults with CP. The Rodda and Graham classification system quantifies sagittal-plane gait deviations using ankle and knee z-scores derived from 3D Instrumented Gait Analysis (3D-IGA), but 3D-IGA is expensive and limited to specialized centers, while observational assessment shows only moderate inter-rater agreement. We developed a markerless gait analysis pipeline that quantifies Rodda and Graham knee and ankle z-scores directly from single-view clinical gait videos. Across 1,058 bilateral limb samples from 529 trials of 152 children (88 male, 63 female; age 12.1 $\pm$ 4.0 years; 60 distinct primary diagnoses, cerebral palsy the most common at $n=54$), the sagittal-view model achieved $R^2 = 0.80 \pm 0.02$ and CCC $= 0.89 \pm 0.02$ for knee z-scores and $R^2 = 0.57 \pm 0.02$ and CCC $= 0.72 \pm 0.02$ for ankle z-scores against 3D-IGA. Binary screening for excess knee flexion achieves AUROC $= 0.88$, correctly identifying 83% of affected children, and applying Rodda and Graham rules yields $43 \pm 1$% 7-class accuracy with macro-AUROC $= 0.78 \pm 0.01$, ankle prediction error remaining the primary bottleneck. Beyond cross-sectional screening, continuous z-scores support longitudinal trajectory tracking across visits, providing a quantitative substrate for monitoring disease progression and treatment response unavailable from observational scales. These results demonstrate the feasibility of video-based z-score estimation, excess-flexion screening, and longitudinal trajectory tracking as a path toward scalable, objective gait assessment in low-resource clinical settings.

URL PDF HTML ☆

赞 0 踩 0

2605.16972 2026-06-09 cs.HC cs.AI 版本更新

WhiteTesseract: Reframing the Interpretation of Cultural Heritage through XR and Conversational AI

WhiteTesseract: 通过XR和对话式AI重新诠释文化遗产

Jingjing Li, Zhi Liu, Xiyao Jin, Tatsuki Fushimi, Yoichi Ochiai

发表机构 * University of Tsukuba（茨口大学）

AI总结本研究通过结合XR和对话式AI，提出WhiteTesseract系统，旨在提升文化遗产展览的沉浸感和个性化体验，增强观众的参与度和反思能力。

Comments 38 pages, 13 figures. Accepted for publication in ACM Journal on Computing and Cultural Heritage (JOCCH)

详情

AI中文摘要

文化遗产展览往往难以维持观众的注意力并促进深入思考。实体展览依赖固定解释工具，缺乏对个体背景或好奇心的适应性，其效果高度依赖于参观者的个人情境、先前知识和文化素养。同时，数字展览更注重便利性和可及性，但可能削弱定义具身文化体验的物理和社会情境。WhiteTesseract通过高分辨率XR和对话式AI实现现场解释，系统整合空间智能通过艺术品识别，允许参观者通过降维现实减少环境干扰，并通过大语言模型进行情境感知对话。目标是保留物理和社会环境的丰富性，同时提供灵活的个人反思空间，增强个人情境而不妥协于物理真实性。我们部署了该系统在一个克劳德·莫奈展览中，并与26名参与者进行了受控用户研究。定量结果表明，WhiteTesseract的调节显著将平均观看时间从35.3秒增加到98.3秒（p < 0.001）。分析529次参观者与AI的互动发现，60%的互动超出了事实性查询，包括分析、情感和比较性查询。这些发现展示了如何通过XR和AI丰富实体展览体验，支持更深入、更个性化的参与，而不取代文化遗产的具身价值。我们讨论了现实部署的技术和社会限制以及受控环境的局限性。

英文摘要

Cultural heritage exhibitions often struggle to sustain attention and support reflective engagement. Physical exhibitions rely on fixed interpretive aids that lack adaptability to individual backgrounds or curiosity, and their effectiveness depends heavily on a visitor's Personal Context, prior knowledge, and cultural literacy. Meanwhile, digital exhibitions prioritize convenience and accessibility but risk weakening the Physical and Social Contexts that define embodied cultural experience. WhiteTesseract addresses this gap by enabling in-situ interpretation through high-resolution XR and conversational AI. The system integrates spatial intelligence via artwork recognition to allow visitors to selectively reduce environmental distractions (via diminished reality) and engage in context-aware dialogue (via large language models). The goal is to preserve the richness of the physical and social environment while providing a flexible space for personal reflection, enhancing Personal Context without compromising physical authenticity. We deployed the system in a Claude Monet exhibition and conducted a controlled user study with 26 participants. Quantitative results showed that WhiteTesseract modulation significantly increased average viewing duration from 35.3 to 98.3 seconds (p < 0.001). Analysis of 529 visitor-AI interactions revealed that 60% extended beyond factual queries to include analytical, emotional, and comparative inquiries. These findings demonstrate how XR and AI can enrich the physical exhibition experience by supporting deeper, more personalized engagement without displacing the embodied value of cultural heritage. We discuss technical and social constraints for real-world deployment and limitations of our controlled setting.

URL PDF HTML ☆

赞 0 踩 0

2605.28510 2026-06-09 cs.SE cs.AI cs.IR 版本更新

Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

高效可扩展的LLM生成代码片段溯源追踪

Andrea Gurioli, Davide D'Ascenzo, Federico Pennino, Maurizio Gabbrielli, Stefano Zacchiroli

发表机构 * University of Bologna（博洛尼亚大学）

AI总结提出混合两阶段溯源追踪流水线HYBRIDSOURCETRACKER，结合向量搜索与指纹匹配，实现LLM生成代码的高效、可扩展溯源。

详情

AI中文摘要

用于代码补全和生成的大型语言模型（LLM）在软件开发中日益普及，但它们可能会逐字复现训练示例且不注明出处，引发关于抄袭和许可合规的法律与伦理问题。基于指纹的经典抄袭检测器（如Winnowing）仍然高效，但检测需要将代码片段与整个训练集进行比较，其线性时间搜索使其不适用于训练现代代码LLM的十亿级语料库。为弥补这一差距，我们引入了SOURCETRACKER——一个专为代码检索定制的3亿参数编码器，以及混合两阶段溯源追踪流水线HYBRIDSOURCETRACKER（HST）。HST首先通过向量搜索缩小候选片段集，然后使用Winnowing对精确指纹进行重排序。我们在THESTACKV2数据集的1000万片段子集上训练和评估系统，包括逐字片段和模拟真实标识符重命名的改编片段。在包含改编查询的体外10万片段搜索空间中，我们的混合方法在30令牌片段上的平均倒数排名与Winnowing相当。然后，从>=60令牌的窗口开始，它持续优于Winnowing最多5.4%，同时保持对数时间查询复杂度。在使用基于LLM的评判者的补充评估中，我们发现许多未被标记为真实来源的检索片段与预期来源高度相似，尤其是在较长的上下文窗口中，因此对最终用户仍然有用。总体而言，我们的结果表明，将向量搜索与指纹识别相结合，能够实现对LLM生成的代码进行可扩展、高精度的溯源追踪。

英文摘要

Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce training examples verbatim and without authorship attribution, raising legal and ethical concerns around plagiarism and license compliance. Classical fingerprint-based plagiarism detectors based on fingerprinting, such as Winnowing, remain highly effective, yet the inspection requires comparing fragments of code to the entire training set, and their linear-time search makes them impractical for the billion-scale corpora used to train modern code LLMs. To bridge this gap, we introduce SOURCETRACKER, a 300M-parameter encoder tailored for code retrieval, together with a hybrid two-stage provenance-tracking pipeline HYBRIDSOURCETRACKER (HST). HST first narrows down a small set of candidate snippets via vector search, then re-ranks those candidates using Winnowing on exact fingerprints. We train and evaluate our system on a 10M-snippet subset of the THESTACKV2 dataset, with both verbatim and adapted snippets that emulate realistic identifier renaming. On an in vitro 100k-snippet search space with adapted queries, our hybrid approach reaches a mean reciprocal rank on par with Winnowing for 30-token fragments. Then, starting from windows >= 60 tokens, it consistently over-performs by up to 5.4% while preserving logarithmic-time query complexity. In a complementary evaluation using an LLM-based judge, we find that many retrieved snippets not labeled as ground truth are still highly similar to the expected sources, particularly with longer context windows, and thus remain useful for end users. Overall, our results demonstrate that integrating vector search with fingerprinting enables scalable, high-precision provenance tracking for code produced by LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.29475 2026-06-09 cs.CL cs.AI cs.CE cs.HC 版本更新

MOOSE-Copilot: A Web-Based Interactive Assistant for Unified Exploratory and Fine-Grained Scientific Hypothesis Discovery

MOOSE-Copilot：一个基于网络的交互式助手，用于统一探索性和细粒度科学假设发现

Hongran An, Zonglin Yang

发表机构 * Central Conservatory of Music（中央音乐学院）； Nanyang Technological University（南洋理工大学）

AI总结提出MOOSE-Copilot，通过形式化的人机交互协议，将发散性探索和收敛性细化统一，利用蓝图、路由和反馈三种信号引导生成，显著优于纯自主基线。

Comments Accepted to ACL 2026 (System Demonstrations)

详情

AI中文摘要

大型语言模型（LLMs）在科学假设发现中展现出显著潜力。然而，现有方法存在两个关键限制：它们将发散性探索构思和收敛性细粒度细化视为孤立任务，并且自主运行，几乎没有人类指导。我们提出了MOOSE-Copilot，这是第一个通过形式化的人机交互（HAII）协议弥合这一抽象差距的统一框架。我们的系统使科学家能够通过三种显式信号引导生成过程：初始蓝图、阶段间路由和再生反馈。定量评估表明，注入这些结构化专家信号显著优于纯自主基线，并在神谕指导下建立了性能上限。此外，为了普及这一范式，我们开发了一个直观的基于网络界面，具有交互式树状可视化。这明确消除了复杂命令行代理工具的陡峭学习曲线，使跨学科研究人员能够直接利用、视觉编排并加速端到端的科学突破。

英文摘要

Large language models (LLMs) show remarkable potential in scientific hypothesis discovery. However, existing approaches face two critical limitations: they treat divergent exploratory search and convergent fine-grained refinement as isolated tasks, and they operate autonomously with little to no human guidance. We present MOOSE-Copilot, the first unified framework to bridge this abstraction gap through a formalized human-AI interaction (HAII) protocol. Our system empowers scientists to steer the generative process via three explicit signals: initial blueprints, inter-stage routing, and intra-stage feedback. Using an oracle-simulated evaluation in which an LLM provides idealized expert signals, we show that injecting these structured signals significantly outperforms purely autonomous baselines, characterizing the gains achievable under high-quality guidance. Furthermore, we build a web-based interface that turns the framework into a no-code workflow: researchers pose a question, watch the hypothesis search unfold as an interactive tree, and steer it by selecting hypotheses, routing between stages, and injecting feedback-no command-line agents required. This makes end-to-end hypothesis discovery directly accessible to interdisciplinary researchers.

URL PDF HTML ☆

赞 0 踩 0

2606.04581 2026-06-09 cs.DC cs.AI cs.NI 版本更新

面向智能信息-物理-社会系统的区块链基础设施：具身AI时代的后量子安全、互操作性与可信数据经济

Song Guo, Huawei Huang, Dongping Liu, Aoyu Zhang, Luyao Zhang

发表机构 * Hong Kong University of Science and Technology（香港理工大学）； Sun Yat-sen University（中山大学）； Amazon Web Services（亚马逊网络服务）； Duke Kunshan University（杜克昆山大学）

AI总结本教程探讨区块链作为协调层，融合后量子密码学与具身AI，实现可扩展、可信的数据经济与跨组织治理。

详情

AI中文摘要

通过基于世界模型的机器人技术部署具身人工智能，为区块链基础设施带来了变革性机遇，迫切需求可信数据溯源、跨组织治理以及跨去中心化生态系统的激励兼容共享。同时，2025年诺贝尔物理学奖和图灵奖所认可的量子计算进展威胁着保障这些数据经济的密码学原语，形成相互依存的紧迫需求：具身AI的长期验证依赖于能够抵御量子对手的密码敏捷架构。本教程考察区块链作为协调层，架起这一双重转型的桥梁——从金融底层到基础性信息-物理-社会系统基础设施，同时抵御量子密码分析并实现可扩展、可信的数据经济。会议以沉浸式AWS Braket演示开场，让参与者接触超导、离子阱和中性原子硬件，评估密码威胁时间线并见证ECDSA向后量子签名的过渡。五个集成模块依次涵盖：具身AI与世界模型需求、量子硬件现实与基于证据的安全迁移、通过BrokerChain协议实现可扩展跨分片架构、实施Croissant元数据标准与机器人学习溯源的可信数据经济，以及面向多模态云部署的行业生态系统集成。通过桥接量子硬件现实与具身AI数据需求，本教程将区块链描绘为下一代去中心化智能环境的统一基础设施，提供开源框架和路线图，用于构建抗量子、可互操作且数据可信的系统。

英文摘要

The deployment of embodied artificial intelligence via world-model-based robotics presents a transformative opportunity for blockchain infrastructure, establishing urgent demand for trustworthy data provenance, cross-organizational governance, and incentive-compatible sharing across decentralized ecosystems. Simultaneously, quantum computing advances recognized by the 2025 Nobel Prize in Physics and the Turing Award threaten the cryptographic primitives securing these data economies, creating an interdependent imperative: long-lived verification for embodied AI depends on crypto-agile architectures capable of withstanding quantum adversaries. This tutorial examines blockchain as the coordination layer bridging this dual transition, from financial substrate to foundational Cyber-Physical-Social Systems infrastructure that simultaneously secures against quantum cryptanalysis and enables scalable, trustworthy data economies. The session opens with an immersive AWS Braket demonstration engaging participants with superconducting, trapped-ion, and neutral-atom hardware to assess cryptographic threat timelines and witness ECDSA-to-post-quantum signature transitions. Five integrated modules progress from embodied AI and world-model requirements through quantum hardware reality and evidence-based security migration, to scalable cross-shard architectures via BrokerChain protocols, trustworthy data economies implementing Croissant metadata standards and robotic learning provenance, and industry ecosystem integration for multi-modal cloud deployment. By bridging quantum hardware realities with embodied AI data requirements, this tutorial charts blockchain as unified infrastructure for next-generation decentralized intelligent environments, providing open-source frameworks and roadmaps for architecting quantum-resistant, interoperable, and data-trustworthy systems.

URL PDF HTML ☆

赞 0 踩 0

2606.07536 2026-06-09 cs.CY cs.AI 交叉投稿

Beware of GeeksBearing Gifts: Building True EU Frontier AI Sovereignty

警惕带来礼物的极客：构建真正的欧盟前沿人工智能主权

Nick Moës, Toni Lorente, Amin Oueslati, Jonathan Smith, Robin Staes-Polet, Radina Kraeva

AI总结本文提出一个涵盖经济竞争力、韧性、安全与国防、欧洲价值观和对外关系五大主权支柱，以及五层26组件29子组件的前沿AI堆栈分解框架，用于识别欧盟政策中的关键缺口、冗余和权衡，以支持战略自主。

详情

AI中文摘要

前沿人工智能正在重塑社会的方方面面，从经济产出或军事能力到民主制度。欧盟正从一个结构性依赖的位置进入这一转型：前沿模型几乎全部来自美国或中国，美国拥有约欧盟16倍的人工智能超级计算能力，全球超大规模数据中心容量中仅有15%位于欧盟境内。尽管欧盟委员会已加速其政策响应，现有举措仍然分散，缺乏确保整个前沿人工智能价值链战略自主的统一愿景。在此，我们提出了一个统一框架，将五大主权支柱（经济竞争力、韧性、安全与国防、欧洲价值观和对外关系）与前沿人工智能堆栈的分解联系起来，该堆栈包括五层、26个组件和29个子组件。该框架能够识别当前欧盟政策中隐含的关键差距、冗余和跨支柱权衡。我们对人工智能千兆工厂倡议的分析表明，以主权为中心的视角如何揭示狭隘经济框架所掩盖的冲突。此外，该框架为政策制定者提供了结构化基础，用于设计、评估和优先考虑跨欧洲战略自主多个维度的前沿人工智能干预措施，涵盖我们识别的四大委员会通讯中的92项倡议及其他。

英文摘要

Frontier artificial intelligence is reshaping all aspects of society, from economic output or military capability to democratic institutions. The EU is entering this transformation from a position of structural dependence: frontier models originate almost exclusively from the United States or China, the US holds approximately sixteen times the EU's AI supercomputing capacity, and only 15% of global hyperscale data centre capacity resides within EU borders. Although the European Commission has accelerated its policy response, existing initiatives remain fragmented and lack a cohesive vision for securing strategic autonomy across the full frontier AI value chain. Here we propose a unified framework connecting five sovereignty pillars (economic competitiveness, resilience, security and defence, European values, and foreign relations) to a decomposition of the frontier AI stack comprising five layers, 26 components, and 29 sub-components. This framework allows the identification of critical gaps, redundancies, and inter-pillar trade-offs that current EU policy leaves implicit. Our analysis of the AI Gigafactory Initiative illustrates how a sovereignty-centred lens reveals conflicts that narrowly economic framings obscure. Moreover, this framework offers policymakers a structured basis for designing, evaluating, and prioritising frontier AI interventions across multiple dimensions of European strategic autonomy across the 92 initiatives from four major Commission communications we. identify, and beyond.

URL PDF HTML ☆

赞 0 踩 0

2606.08020 2026-06-09 quant-ph cs.AI 交叉投稿

Repair Before Veto, When Repair Is Hidden: Quantum-Accessible Features for Repair-Augmented Constraint Learning

在修复被隐藏时先修复再否决：面向修复增强约束学习的量子可访问特征

Yifan Wang

发表机构 * Yifan Wang（王一帆）

AI总结提出Q-RACL框架，在硬约束决策中引入修复优先于否决的语义，通过量子特征访问解决离散对数隐藏的修复可行性推理问题，显著降低假否决率。

Comments 7 pages, 2 figures

详情

AI中文摘要

硬约束决策系统通常会否决不可行的候选方案。当系统可以采取行动时，这种做法过于僵化：如果已知一个可承受的修复能使不可行但有价值的候选变得可行，那么拒绝就是一个错误的否决，而非排序错误。我们引入了Q-RACL（量子修复增强约束学习），这是一个先修复再否决的框架，首先定义RACL决策语义，然后识别出量子特征访问可以承担关键作用的单一推理环节。RACL在顺序修复计划能恢复可行性和偏好时接受候选方案；否则返回结构化的拒绝理由。关键环节是修复可行性推理：从观察到的候选和上下文来看，哪个修复类别能恢复可行性。我们构建了一个离散对数隐藏的RACL族，其中修复类别是潜在指数a = log_g(x)中的移位区间规则，而学习器只观察到x = g^a mod p。在标准的基于DLP的学习分离下，这个坐标对高效的原始输入经典策略是不可访问的，但通过Shor/Fourier结构对量子智能体是可访问的。在六个素数和十个随机种子下，有界的原始输入经典策略和错误的原始傅里叶编码仍接近随机水平，而Q-DLP策略将假否决率保持在1.1%以下，赢得所有配对种子，并产生QNI_cond在0.9777到0.9972之间。一个经典的DLog预言机与之匹配，隔离了特征访问而非分类器容量。因此，量子AI不是作为通用模型升级添加的；对于这个DLP隐藏的修复族，它提供了缺失的特征，从而闭合了先修复再否决的循环。

英文摘要

Hard-constraint decision systems usually veto infeasible candidates. This is too rigid when the system can act: if a known affordable repair would make an infeasible candidate feasible and valuable, rejection is a false veto rather than a ranking error. We introduce Q-RACL (Quantum Repair-Augmented Constraint Learning), a repair-before-veto framework that first defines RACL decision semantics and then identifies the single inference link where quantum feature access can be load-bearing. RACL accepts a candidate when a sequential repair plan restores feasibility and preference; otherwise it returns structured rejection credit. The hard link is repair-feasibility inference: which repair class restores feasibility from an observed candidate and context. We construct a discrete-logarithm-hidden RACL family where the repair class is a shifted interval rule in the latent exponent a = log_g(x), while the learner observes only x = g^a mod p. Under standard DLP-based learning separation, this coordinate is inaccessible to efficient raw-input classical policies but accessible to a quantum agent through Shor/Fourier structure. Across six primes and ten seeds, bounded raw-input classical policies and a wrong raw-Fourier encoding remain near chance, whereas the Q-DLP policy keeps false-veto rate below 1.1%, wins all paired seeds, and yields QNI_cond = 0.9777 to 0.9972. A classical DLog oracle matches it, isolating feature access rather than classifier capacity. Thus quantum AI is not added as a generic model upgrade; for this DLP-hidden repair family, it supplies the missing feature that closes the repair-before-veto loop.

URL PDF HTML ☆

赞 0 踩 0

2606.08323 2026-06-09 cs.HC cs.AI 交叉投稿

"So There's a Catch-22 Here": How Early Adopters Who Build Multi-Agent LLM Systems Conceptualize Transparency

"所以这里有个第22条军规"：构建多智能体LLM系统的早期采用者如何概念化透明度

Suchismita Naik, Samir Passi, Mihaela Vorvoreanu, Scott Saponas, Amanda Hall

发表机构 * Purdue University（普渡大学）； Cornell University（康奈尔大学）； Microsoft Research（微软研究院）

AI总结通过访谈13位早期采用者，研究多智能体LLM系统构建者如何理解透明度，提出包含可重复性、调试、边界设定、可视化和审计的多维框架，强调透明度作为情境化的社会技术实践。

详情

AI中文摘要

多智能体大语言模型（LLM）系统正在迅速兴起，然而作为负责任AI基石的透明度，在这些具有智能体间协调与编排复杂性的分布式架构中仍定义不足。在本文中，我们呈现了首个关于多智能体LLM系统早期采用者（既是构建者也是用户）如何理解和实践透明度的实证研究之一。我们对[大型技术组织]中的13位早期采用者进行了半结构化访谈，并应用主题分析识别重复模式。参与者表达了分歧但互补的透明度框架，包括可重复性、调试、边界设定、可视化和审计。这些视角涵盖了透明度包含什么、为何重要以及如何实现等问题。我们将其综合为一个多维框架，该框架以开发者、用户和治理为中心，将透明度定位为情境化的社会技术实践，为未来HCI和AI设计与研究围绕对齐预期受众的期望和能力提供信息。

英文摘要

Multi-agent large language model (LLM) systems are rapidly emerging, yet transparency, a cornerstone of responsible AI, remains under-defined in these distributed architectures, which have complexities of inter-agent coordination and orchestration. In this paper, we present one of the first empirical study of how early adopters of multi-agent LLM systems, who are both the builders and users, understand and practice transparency. We conducted semi-structured interviews with 13 early adopters in [Large Technology Organization] and applied thematic analysis to identify recurring patterns. Participants articulated divergent yet complementary framings of transparency, including reproducibility, debugging, boundary-setting, visualization, and auditing. These perspectives spanned questions of what transparency entails, why it matters, and how it is achieved. We synthesize these into a multidimensional framework, which is developer, user, and governance-focused positioning transparency as a situated socio-technical practice that informs future HCI and AI design and research around aligning expectations and capacities of their intended audiences.

URL PDF HTML ☆

赞 0 踩 0

2606.08791 2026-06-09 econ.EM cs.AI q-fin.PM q-fin.RM q-fin.ST 交叉投稿

Evaluating AI Investment Strategies

评估AI投资策略

Irene Aldridge

发表机构 * ablemarkets.com（ablemarkets公司）

AI总结研究通过可观测输入输出审计黑箱算法决策者，提出动态策略累积遗憾的精确分解，扩展至多期随机动态规划，并给出偏差修正与轨迹估计器。

Comments 33 pages

详情

AI中文摘要

我们研究仅从可观测输入和输出审计黑箱算法决策者的问题。主要结果是一个精确分解：在精确刻画条件下，动态策略的累积遗憾等于成本向量与策略决策之间每期协方差之和。这扩展了Aldridge (2026)的单期恒等式到随机动态规划的完整多期设置。我们证明了该恒等式在独立同分布成本和均值无偏马尔可夫策略下精确成立，推导了非平稳和时变情况下的闭式偏差修正，并建立了折现期模拟。协方差遗憾泛函的贝尔曼递归将该结果与标准强化学习算法联系起来；对于滚动窗口策略，估计误差偏差为$O(d/w)$。该分解对战略环境中的算法审计有直接影响：在平台机制设计中，它提供了基于福利的审计指标，无需访问代理的私人类型；在重复博弈中，协方差减少是策略改进的充分条件；在采购和广告拍卖中，偏差修正量化了战略误报导致的福利损失。相关的轨迹估计器是一致的、渐近正态的（具有HAC方差），并且可在$O(T \cdot nd)$时间内计算。这使得所提出的方法成为平台机制、算法投资策略以及任何受外部绩效审查的序列决策系统的可处理、无模型审计工具。

英文摘要

We study the problem of auditing a black-box algorithmic decision-maker from observable inputs and outputs alone. Our main result is an exact decomposition: under precisely characterized conditions, the cumulative \emph{regret} of a dynamic policy equals the sum of per-period covariances between the cost vector and the policy's decision. This extends the single-period identity of Aldridge~(2026) to the full multi-period setting of stochastic dynamic programming. We prove the identity holds exactly under i.i.d. costs and mean-unbiased Markov policies, derive closed-form bias corrections for non-stationary and time-varying cases, and establish the discounted-horizon analog. A Bellman recursion for the covariance regret functional connects the result to standard reinforcement learning algorithms; for rolling-window policies, the estimation-error bias is $O(d/w)$. The decomposition has direct implications for algorithmic auditing in strategic environments: in platform mechanism design, it provides a welfare-based audit metric without access to the agent's private type; in repeated games, covariance reduction is a sufficient condition for policy improvement; in procurement and ad auctions, the bias correction quantifies welfare loss from strategic misreporting. The associated trajectory estimator is consistent, asymptotically normal with HAC variance, and computable in $O(T \cdot nd)$ time. This makes the proposed approach a tractable, model-free audit tool for platform mechanisms, algorithmic portfolio strategies, and any sequential decision system subject to external performance review.

URL PDF HTML ☆

赞 0 踩 0

2606.08936 2026-06-09 cs.IR cs.AI cs.HC 交叉投稿

Report on CHIIR 2026 Workshop on Generative AI and Academic Search (GAI&AS)

CHIIR 2026 生成式AI与学术搜索研讨会报告

Yifan Liu, Jaime Arguello, Orland Hoeber, Chang Liu, Soo Young Rieh, Luanne Sinnamon, Dean Alvarez, Susan Archambault, Rob Capra, Henson Chen, Charles Costa, Anita Crescenzi, Zhitong, Guan, Jacek Gwizdka, Pao-Pei Huang, Gavindya Jayawardena, Ghazal Kalhor, Dagmar Kern, Oliver Koop, Alice Li, Afra Mashhadi, Gaohui Meng, Marta Micheli, Anil B. Murthy, Kevin Schott, Sebastian Schultheiß, Jiwoo Seo, Phaneendra Sivangula, Frans van der Sluis, Xiaoxuan Song, Silang Wang, Dan Zhang

发表机构 * CHIIR 2026 Workshop（CHIIR 2026 工作坊）

AI总结报告总结CHIIR 2026关于生成式AI重塑学术搜索系统的研讨会，聚焦设计评估挑战，涵盖基础、应用及搜索即学习三大主题，强调透明性、可信度与研究诚信。

详情

AI中文摘要

本报告总结了CHIIR 2026生成式AI与学术搜索研讨会（GAI&AS），该研讨会探讨了GenAI如何重塑学术搜索系统及研究实践。研讨会汇集了人类信息交互和信息检索领域的研究人员，探讨了在设计和评估未来集成GenAI的学术搜索系统中的关键挑战与机遇，超越了传统的文档检索，支持摘要、推荐、综合和对话交互。参与者的兴趣和讨论集中在三个主题集群：基础与原则、应用与机遇、以及搜索即学习。在这些主题中，研讨会强调了学术搜索系统在支持透明度、可信度、研究诚信和长期学术需求，以及促进高阶认知过程中的重要性。与会者讨论了指导理论、设计原则、方法论、合作伙伴关系以及旨在推进以人为中心的GenAI增强学术搜索系统的社区建设努力。总体而言，研讨会展示了社区对GenAI与学术搜索交叉领域的强烈兴趣以及多样化的正在进行和新兴的研究计划。

英文摘要

This report summarizes the CHIIR 2026 Workshop on Generative AI and Academic Search (GAI\&AS), which examined how GenAI is reshaping academic search systems and research practices. The workshop brought together researchers in human information interaction and information retrieval to explore key challenges and opportunities in designing and evaluating future academic search systems that integrate GenAI, moving beyond traditional document retrieval to support summarization, recommendation, synthesis, and conversational interaction. Participants' interests and discussions focused on three thematic clusters: foundations and principles, applications and opportunities, and search-as-learning. Across these themes, the workshop highlighted the importance of academic search systems in supporting transparency, credibility, research integrity, and long-term scholarly needs, as well as in fostering higher-order cognitive processes. Participants discussed guiding theories, design principles, methodological approaches, partnerships, and community-building efforts aimed at advancing human-centered GenAI-enhanced academic search systems. Overall, the workshop demonstrated strong community interest and a diverse range of ongoing and emerging research initiatives at the intersection of GenAI and academic search.

URL PDF HTML ☆

赞 0 踩 0

2606.09006 2026-06-09 cs.SI cs.AI cs.CY cs.ET 交叉投稿

Sustainability and Artificial Intelligence: Necessary, Challenging, and Promising Intersections

可持续性与人工智能：必要、挑战与有前景的交汇

Han-Teng Liao, Zijia Wang

发表机构 * Higher Education Impact Assessment Center（高等教育影响评估中心）； Sun Yat-Sen University（中山大学）； Nanfang College（南芳学院）

AI总结本文基于541篇文献，梳理了人工智能与可持续性研究的交汇点，揭示了绿色科技在连接多学科中的核心作用，并讨论了其必要性、挑战与前景。

Comments This is an author preprint version. For the final authenticated version of record, please use the official publication via the IEEE Xplore database. DOI: 10.1109/MSIEID52046.2020.00076

详情

DOI: 10.1109/MSIEID52046.2020.00076
Journal ref: 2020 Management Science Informatization and Economic Innovation Development Conference (MSIEID), Guangzhou, China, 2020, pp. 360-363

AI中文摘要

数字经济与数字技术的研究人员日益认识到需要更好地解决人工智能在塑造环境、社会和治理发展演变中的作用。可持续性与人工智能研究似乎在复杂、相互关联和动态的棘手问题特征上存在交汇。基于这种交汇，本文旨在通过概述现有研究，勾勒出必要、挑战和有前景的交汇点。基于从Web of Science数据库收集的541条文献数据，研究结果揭示了绿色可持续科技在连接不同学科、主要期刊及关键主题与概念方面日益核心的作用。研究结果展示了这些互动如何可以是必要的、挑战性的和有前景的。文章最后就如何多样化和扩展人工智能促进可持续发展的实践社区提出了一些一般性论点，特别是在预期的人工智能应用领域和机构方面。

英文摘要

Both digital economy and digital technology researchers increasingly recognize the need to better address the role that artificial intelligence (AI) plays in shaping the evolution of the environmental, social and governance aspects of development. It appears that sustainability and AI research converge on the features of wicked problems that are complex, interconnected and dynamic. Building off such convergence, this article aims to map out the necessary, challenging, and promising intersections by providing an overview of the state of art research. Based on 541 bibliographic data collected from the Web of Science (WoS) database, the findings reveal the increasingly central body of work on green and sustainable science and technology in bridging various disciplines, main journals and key topics and concepts. The findings reveal how such interactions can be necessary, challenging, and promising. The article concludes with few general arguments regarding how to diversify and expand the community of practice regarding AI for sustainable development, especially in the areas of expected AI application areas and institutions.

URL PDF HTML ☆

赞 0 踩 0

2606.09589 2026-06-09 cs.CY cs.AI 交叉投稿

I Was Scrolling and Then I Saw a Pregnant Strawberry

我正刷着手机，然后看到了一颗怀孕的草莓

Piera Riccio

发表机构 * University of Amsterdam（阿姆斯特丹大学）

AI总结研究AI迷你剧（水果剧）中性别化叙事与种族化逻辑，指出其通过生成式AI的美学洗白机制掩盖意识形态内容，并分析其对计算创造力的文化影响。

详情

AI中文摘要

AI迷你剧（又称水果剧）是算法分发的生成式AI短视频系列，以拟人化角色为特征，近期在社交媒体平台上成为普遍现象。本文认为，尽管这些视频看似无害的美学，但它们再现了深度性别化的叙事结构，其中女性角色被系统性地与道德越轨、性背叛和生殖能力相关联，且多个情节也编码了种族化的逻辑，即可见的身体差异被赋予道德负荷的过程。借鉴女性主义电影理论、批判种族理论和平台研究，本文进一步认为，这些视频的生成式AI美学——以柔软、圆润和视觉可爱为特征——作为一种美学洗白机制，中和了这些叙事的意识形态重量，并使其在内容审核系统下仍能流通。本文通过个人观察和细读来探讨这些问题，反思生成式AI的具体可供性，这些可供性使这一现象成为可能，并对计算创造力领域产生文化影响。

英文摘要

AI minidramas (also known as fruit dramas) are short, algorithmically distributed generative AI video series featuring anthropomorphized characters that have recently emerged as a widespread phenomenon on social media platforms. This paper argues that despite their seemingly innocuous aesthetic, these videos reproduce deeply gendered narrative structures in which female characters are systematically associated with moral transgression, sexual betrayal, and reproductive capacity, and that several plots also encode the logic of racialization, i.e., the process by which visible bodily difference is morally loaded. Drawing on feminist film theory, critical race theory, and platform studies, it further argues that the generative AI aesthetic of these videos, characterized by softness, roundness, and visual cuteness, functions as a mechanism of aesthetic laundering, neutralizing the ideological weight of these narratives and enabling their circulation despite content moderation systems. This paper approaches these questions through personal observation and close reading, reflecting on the specific affordances of generative AI that make this phenomenon both possible and culturally consequential for the field of computational creativity.

URL PDF HTML ☆

赞 0 踩 0

2603.14147 2026-06-09 cs.AI cs.LG 版本更新

An Alternative Trajectory for Generative AI

生成AI的另一种轨迹

Margarita Belova, Yuval Kansal, Yihao Liang, Jiaxin Xiao, Niraj K. Jha

发表机构 * Princeton University（普林斯顿大学）

AI总结本文提出通过构建领域特定超智能（DSS）来改进生成AI，利用符号抽象提升领域推理能力，避免LLM合成数据的模型崩溃问题，实现可持续发展。

详情

AI中文摘要

生成人工智能（AI）生态系统正经历快速变革，威胁其可持续性。随着模型从研究原型转向高流量产品，能耗从一次性训练转向持续的无界推理。推理模型使计算成本每查询增加数个数量级。通过单体模型扩展追求人工通用智能与物理约束的碰撞：电网故障、用水消耗和数据扩展的边际效益递减。此轨迹产生具有出色事实记忆的模型，但在需要深入推理的领域表现不佳，可能由于训练数据中的抽象不足。当前大型语言模型（LLMs）仅在数学和编程等领域表现出真实的推理深度，其他领域泛化能力差。我们提出基于领域特定超智能（DSS）的替代轨迹。我们主张首先构建显式的符号抽象（知识图谱、本体和形式逻辑）以支撑合成课程，使小型语言模型能够掌握领域特定推理，而无需LLM基于合成数据方法的模型崩溃问题。而非单一通用巨模型，我们设想“DSS模型社会”：动态生态系统，其中协调代理将任务路由到不同的DSS后端。此范式转变使能力脱离规模，使智能从能耗高的数据中心迁移到安全的设备专家。通过将算法进步与物理约束对齐，DSS社会使生成AI从环境负担转变为可持续的经济赋能力量。

英文摘要

The generative artificial intelligence (AI) ecosystem is undergoing rapid transformations that threaten its sustainability. As models transition from research prototypes to high-traffic products, the energetic burden has shifted from one-time training to recurring, unbounded inference. This is exacerbated by reasoning models that inflate compute costs by orders of magnitude per query. The prevailing pursuit of artificial general intelligence through scaling of monolithic models is colliding with hard physical constraints: grid failures, water consumption, and diminishing returns on data scaling. This trajectory yields models with impressive factual recall but struggles in domains requiring in-depth reasoning, possibly due to insufficient abstractions in training data. Current large language models (LLMs) exhibit genuine reasoning depth only in domains like mathematics and coding, where rigorous, pre-existing abstractions provide structural grounding. In other fields, the current approach fails to generalize well. We propose an alternative trajectory based on domain-specific superintelligence (DSS). We argue for first constructing explicit symbolic abstractions (knowledge graphs, ontologies, and formal logic) to underpin synthetic curricula enabling small language models to master domain-specific reasoning without the model collapse problem typical of LLM-based synthetic data methods. Rather than a single generalist giant model, we envision "societies of DSS models": dynamic ecosystems where orchestration agents route tasks to distinct DSS back-ends. This paradigm shift decouples capability from size, enabling intelligence to migrate from energy-intensive data centers to secure, on-device experts. By aligning algorithmic progress with physical constraints, DSS societies move generative AI from an environmental liability to a sustainable force for economic empowerment.

URL PDF HTML ☆

赞 0 踩 0

2604.19845 2026-06-09 cs.AI 版本更新

Deconstructing Superintelligence: Identity, Self-Modification and Différance

解构超智能：身份、自我修改与差异

Elija Perrier

发表机构 * Centre for Quantum Software & Information, UTS, Sydney（量子软件与信息中心，UTS，悉尼）

AI总结本文通过关联算子代数分析自我修改与超智能的关系，揭示非交换性如何传播至自我表示，并指出强自我修改可能破坏系统基础身份。

Comments Camera-ready version, AGI-2026

详情

AI中文摘要

自我修改常被视为构成人工超智能（SI）的核心，但修改是一种相对行为，需要一个在操作外的补充。我们在此基于关联算子代数$\mathcal{A}$，引入更新算子$\hat U$、差分算子$\hat D$和自我表示算子$\hat R$，将补充定义为$\operatorname{Comm}(\hat U)$。传播定理显示$[\hat U,\hat R]$通过$[\hat U,\hat D]$分解，因此非交换性传播至自我表示。谎言悖论是秩一情况$[\hat T,Π_L]=0$，而类$\mathbf{A}$系统中$\hat U$作用于$\hat D$，在系统尺度上再现它，产生与Priest的inclosure方案及Derrida的différance相一致的结构。我们的结果表明，强自我修改所定义的超智能可能破坏此类系统所依赖的持续身份。

英文摘要

Self-modification is routinely treated as constitutive of artificial superintelligence (\textbf{SI}), yet modification is a relative action requiring a \emph{supplement} outside the operation. We formalise this on an associative operator algebra $\mathcal{A}$ with update operator $\hat U$, difference operator $\hat D$, and self-representation operator $\hat R$, identifying the supplement with $\operatorname{Comm}(\hat U)$. A propagation theorem shows $[\hat U,\hat R]$ decomposes through $[\hat U,\hat D]$, so non-commutation propagates to self-representation. The liar paradox is the rank-one case $[\hat T,Π_L]=0$, and \emph{class $\mathbf{A}$} systems, in which $\hat U$ acts on $\hat D$, reproduce it at system scale, yielding a structure coinciding with Priest's inclosure schema and Derrida's \emph{différance}. Our results show that the strong self-modification taken to define superintelligence may undermine the persistent identity upon which such systems are premised.

URL PDF HTML ☆

赞 0 踩 0

2606.03092 2026-06-09 cs.AI 版本更新

The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs

推理的影子价格：LLM最优预算分配的经济学视角

Xu Wan, Speed Zhu, Jianwei Cai, Guang Chen, XiMing Huang, Wiggin Zhou, Mingyang Sun

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文从经济学视角将推理预算分配建模为全局约束优化问题，提出基于影子价格的CLEAR方法，通过理性放弃和资源再分配，在资源稀缺下显著提升总token成本与平均准确率的帕累托前沿。

详情

AI中文摘要

推理时扩展已成为提升大型语言模型性能的关键途径，但实际部署受严格计算预算限制。本文将推理预算分配建模为受经济学原理支配的全局约束优化问题。通过使用移位激增函数对每查询推理效用建模，我们推导出基于全局影子价格的最优分配策略，该价格在资源稀缺下均衡边际效用。基于此理论，我们提出约束潜在效用均衡分配推理（CLEAR）。它执行理性放弃，并将资源从无力偿付的查询重新分配到接近其涌现阈值的可解查询。在不同流量流的多个推理任务上的大量实验表明，CLEAR显著改善了总token成本与平均准确率的帕累托前沿。在资源稀缺模式下，与均匀分配相比，CLEAR的全局准确率提升高达3倍。

英文摘要

Inference-time scaling has emerged as a critical avenue for enhancing Large Language Models' performance, yet real-world deployment is constrained by strict computational budgets. In this work, we formulate inference budget allocation as a global constrained optimization problem governed by economic principles. By modeling per-query reasoning utility with a shifted-surge function, we derive an optimal allocation policy based on a global shadow price that equilibrates marginal utility under resource scarcity. Based on this theory, we propose Constrained Latent-utility Equilibrium Allocation for Reasoning (CLEAR). It performs rational abandonment and reallocates resources from insolvent queries to solvable queries near their emergence thresholds. Extensive experiments on several reasoning tasks with different traffic streams demonstrate that CLEAR significantly improves the Pareto frontier of total token cost versus mean accuracy. In resource-scarce regimes, CLEAR achieves up to a 3x improvement in global accuracy compared to uniform allocation.

URL PDF HTML ☆

赞 0 踩 0

2407.10247 2026-06-09 cs.CY cs.AI cs.LG econ.GN q-fin.EC 版本更新

探索基旋转对NQS性能的影响

Sven Benjamin Kožić, Vinko Zlatić, Fabio Franchini, Salvatore Marco Giampaolo

发表机构 * Institut Ruđer Bošković（鲁德·博什科维奇研究所）

AI总结通过可解一维Ising模型，研究局部基旋转对神经量子态（NQS）表示和优化的影响，发现基旋转保持优化景观不变但移动目标态位置，导致优化失败与错误波函数结构共存。

详情

AI中文摘要

神经量子态（NQS）是量子多体波函数的强大变分表示，但其性能敏感地依赖于所选基。利用精确可解的一维Ising模型，我们证明局部基旋转保持最小化景观不变，同时将精确基态在参数空间中重新定位。这提供了一个受控框架，以区分表示限制与优化引起的可训练性效应。通过信息几何度量量化的这种几何位移，可以将浅层架构的优化引导至鞍点和高曲率区域。因此，低能量误差可能与错误的波函数结构共存。通过在同一变分架构内比较能量和保真度优化，我们表明即使旋转后的目标态仍然可表示，优化失败也可能持续存在。我们的结果识别了导致NQS基依赖性的几何机制，并激发了景观感知的变分设计。

英文摘要

Neural Quantum States (NQS) are powerful variational representations of quantum many-body wavefunctions, yet their performance depends sensitively on the chosen basis. Using an exactly solvable one-dimensional Ising model, we show that local basis rotations leave the minimization landscape unchanged while relocating the exact ground state in parameter space. This provides a controlled framework to disentangle representational limitations from optimization-induced trainability effects. This geometric displacement, quantified through information-geometric measures, can steer optimization of shallow architectures toward saddle points and high-curvature regions. As a result, low energy errors may coexist with an incorrect wavefunction structure. By comparing energy and infidelity optimization within the same variational architectures, we show that optimization failure can persist even when the rotated target state remains representable. Our results identify a geometric mechanism contributing to basis dependence in NQS and motivate landscape-aware variational design.

URL PDF HTML ☆

赞 0 踩 0

2601.06077 2026-06-09 cs.IT cs.AI cs.LG math.IT math.OC 版本更新

One if by Land, Two if by Sea, Three if by Four Seas, and More to Come -- Values of Perception, Prediction, Communication, and Common Sense in Decision Making

一陆二海三四海，更多将至——感知、预测、通信与常识在决策中的价值

Aolin Xu

发表机构 * Aolin Xu（徐傲林）

AI总结本文严格定义决策中感知、预测、通信和常识的价值，发现无预测的感知价值可能为负，而预测价值非负，并应用于自主决策系统设计。

详情

AI中文摘要

本文旨在严格定义决策中感知、预测、通信和常识的价值。所定义的量是决策论意义上的，但具有信息论上的类比，例如，它们与香农熵和互信息共享一些简单但关键的数学性质，并且在特定设置中可以简化为这些量。一个有趣的观察是，没有预测的感知价值可能为负，而感知与预测一起的价值以及单独预测的价值总是非负的。这些定义为自主决策系统设计中出现的实际问题提供了答案。示例问题包括：我们是否需要观察和预测特定代理的行为？其重要性如何？观察和预测代理的最佳顺序是什么？这些定义也可能为认知科学和神经科学提供见解，有助于理解自然决策者如何利用从不同来源和操作中获得的信息。

英文摘要

This work aims to rigorously define the values of perception, prediction, communication, and common sense in decision making. The defined quantities are decision-theoretic, but have information-theoretic analogues, e.g., they share some simple but key mathematical properties with Shannon entropy and mutual information, and can reduce to these quantities in particular settings. One interesting observation is that, the value of perception without prediction can be negative, while the value of perception together with prediction and the value of prediction alone are always nonnegative. The defined quantities suggest answers to practical questions arising in the design of autonomous decision-making systems. Example questions include: Do we need to observe and predict the behavior of a particular agent? How important is it? What is the best order to observe and predict the agents? The defined quantities may also provide insights to cognitive science and neural science, toward the understanding of how natural decision makers make use of information gained from different sources and operations.

URL PDF HTML ☆

赞 0 踩 0

2604.20897 2026-06-09 cs.IT cs.AI math.IT physics.comp-ph 版本更新

Watts-per-Intelligence Part II: Algorithmic Catalysis

每智能瓦特 Part II：算法催化

Elija Perrier

发表机构 * Centre for Quantum Software and Information（量子软件与信息中心）

AI总结本文基于每智能瓦特框架发展算法催化热力学理论，提出可重用的计算结构以减少任务类的不可逆操作，同时满足受限恢复和结构选择性约束。证明任务类特定速度提升上限由算法互信息决定，并通过兰道尔擦除最小热力学成本。结合结果得出耦合定理，下界限定算法催化部署时间范围。

Comments Camera ready version, AGI-2026

2606.04227 2026-06-09 cs.DS cs.AI 版本更新

Incremental Sheaf Cohomology on Cellular Complexes: O(1)-in-n Lazy Edit Processing under Bounded Local Geometry

细胞复形上的增量层上同调：有界局部几何下的O(1)-in-n惰性编辑处理

Jason L. Volk

发表机构 * Invariant Research（Invariant研究院）

AI总结针对动态演化的1维细胞复形上的层上同调$H^1$，提出一种增量维护算法，在有界局部几何假设下实现每次编辑O(1)时间处理，并通过同步点保证正确性。

Comments 2 figures, 2 tables, 1 algorithm; code at https://github.com/Jasonleonardvolk/sigma

详情

AI中文摘要

我们提出了一种算法框架，用于在动态演化的1维细胞复形（配备有限维细胞层）上增量维护第一层上同调$H^1(X; \mathcal{F})$。通过分解上边界矩阵经典计算$H^1$需要$O(n^3)$时间；当复形经历$m$次编辑的流时，每次编辑后完全重计算代价为$O(mn^3)$。在局部几何有界假设下——有界细胞大小$v_{\max}$、有基维数$d$和有界神经度$D$——每次编辑（顶点插入、边插入、限制映射更新）仅影响有界的一组局部上边界块。因此，该算法以相对于总复形大小$n$的$O(1)$时间处理惰性流式编辑（代价在局部几何参数$v_{\max}$、$d$和$D$的多项式时间内，这些参数被视为与$n$无关的常数），将局部特征值求解和Mayer-Vietoris全局组装推迟到同步点（Flush）。在同步时，维护的状态与分区层模型的相应批量组装一致；我们在所有批量验证的运行中观察到零测量漂移（通过$V = 10^6$）。我们还给出了细胞分解的均摊$O(|E|)$流式构造，并讨论了一个对抗性代数RAM障碍，论证非分区非平凡层（$d \geq 2$，非恒等限制映射）不具有相同的局部性。在最多$5 \times 10^6$个顶点和$1.7 \times 10^7$次流式编辑的Barabasi-Albert图上的实验显示，每次编辑的惰性中位延迟为35微秒（不包括刷新）；查询时间（同步时的全局组装）在实现的完全遍历路径中为每次刷新$O(n)$。精确同步代价另行报告。

英文摘要

We present an algorithmic framework for incremental maintenance of first sheaf cohomology $H^1(X; \mathcal{F})$ on dynamically evolving 1-dimensional cellular complexes equipped with finite-dimensional cellular sheaves. The classical computation of $H^1$ via factorization of the coboundary matrix requires $O(n^3)$ time; when the complex evolves with a stream of $m$ edits, full recomputation after each edit costs $O(mn^3)$. Under a bounded local geometry assumption -- bounded cell size $v_{\max}$, bounded stalk dimension $d$, and bounded nerve degree $D$ -- each edit (vertex insertion, edge insertion, restriction map update) affects only a bounded set of local coboundary blocks. The algorithm therefore processes lazy streaming edits in $O(1)$ time with respect to the total complex size $n$ (with cost polynomial in the local geometry parameters $v_{\max}$, $d$, and $D$, which are treated as constants independent of $n$), deferring local eigensolves and Mayer-Vietoris global assembly to synchronization points (Flush). At synchronization, the maintained state agrees with the corresponding batch assembly of the partitioned sheaf model; we observe zero measured drift in all batch-verified runs (through $V = 10^6$). We also give an amortized $O(|E|)$ streaming construction for the cellular decomposition and discuss an adversarial algebraic-RAM barrier arguing that unpartitioned non-trivial sheaves ($d \geq 2$, non-identity restriction maps) do not admit the same locality. Experiments on Barabasi-Albert graphs with up to $5 \times 10^6$ vertices and $1.7 \times 10^7$ streaming edits show 35 $μ$s median lazy per-edit update latency (excluding flush); query time (global assembly at synchronization) is $O(n)$ per flush in the implemented full-traversal path. Exact synchronization costs are reported separately.

URL PDF HTML ☆

赞 0 踩 0

2605.24384 2026-06-09 cs.CL cs.AI 版本更新

Side-by-side Comparison Amplifies Dialect Bias in Language Models

并排比较加剧语言模型中的方言偏见

Kritee Kondapally, Claire J. Smerdon, Pooja C. Patel, Ogheneyoma Akoni, Jevon Torres, Jaspreet Ranjit, Matthew Finlayson, Swabha Swayamdipta

发表机构 * University of Southern California（美国南加州大学）

AI总结本研究通过并排比较标准美式英语和非裔美国英语的推文，发现语言模型中的隐性方言偏见在对比设置下显著加剧，且显性方言偏见在安全对齐微调后仍存在。

Comments In proceeding at ACM Conference on Fairness, Accountability, and Transparency 2026

详情

DOI: 10.1145/3805689.3812217
Journal ref: In The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)

AI中文摘要

语言模型（LMs）可能因其方言变体而表现出偏见，即使在没有方言标签的情况下，这种行为被称为隐性方言偏见。在这项工作中，我们通过评估语言模型如何将刻板特征（源自社会心理学关于种族偏见的研究）与标准美式英语（SAE）和非裔美国英语（AAVE）中意图等效的推文相关联，来量化在线话语中的隐性方言偏见。虽然先前的研究表明，在单独评估推文时，语言模型将更多负面刻板印象与AAVE关联，但我们惊讶地发现，当SAE/AAVE推文对并排比较时，这种偏见显著加剧，这种设置更接近模型用于排名候选人的高影响力决策环境。当明确指定方言标签时，偏见只会恶化。考虑到商业开发者为了减轻其语言模型中的偏见所做的广泛努力，这一点令人震惊。令人鼓舞的是，我们表明反事实公平微调可以减轻某些刻板特征的隐性方言偏见，减少单独评估推文时的平均差异，然而，在并排评估SAE/AAVE推文时，这些改进并不一致地适用于所有特征。我们的发现表明，现有的隐性方言偏见评估设置可能低估了其严重性，特别是在对比设置中。此外，即使在安全对齐微调后，显性方言偏见仍然显著，表明它仍然是一个未解决的问题，并激励需要更稳健的评估和缓解框架。

英文摘要

Language models (LMs) can exhibit biases based on variations in their dialects, even in the absence of a dialect label, a behavior known as covert dialect bias. In this work, we quantify covert dialect bias in online discourse by evaluating how LMs associate stereotypical traits (derived from social psychology research on racial bias) with intent-equivalent tweets in Standard American English (SAE) and African-American Vernacular English (AAVE). While prior work shows that LMs associate more negative stereotypes with AAVE when evaluating tweets in isolation, we are surprised to find that this bias is significantly exacerbated when SAE / AAVE tweet pairs are compared side by side, a setting that more closely reflects high-impact decision making contexts in which models are used to rank candidates. The bias only worsens when dialect labels are explicitly specified. This is striking, given the extensive efforts from commercial developers to mitigate bias in their LMs. Encouragingly, we show that counterfactual fairness finetuning can mitigate covert dialect bias for some stereotypical traits, reducing average disparities when evaluating tweets in isolation, however, these improvements do not consistently hold across traits when evaluating SAE / AAVE tweets side by side. Our findings show that existing evaluation settings for covert dialect bias may underestimate its severity, specifically in contrastive settings. Additionally, overt dialect bias remains pronounced even after safety aligned finetuning, indicating that it remains an unresolved problem, and motivates the need for more robust evaluation and mitigation frameworks.

URL PDF HTML ☆

赞 0 踩 0

2604.07349 2026-06-09 cs.CC cs.AI cs.LO 版本更新

Descent Before Hardness: Orbit-Gap Obstructions in Exact Certification

局部性、一致性与可处理性前沿

Tristan Simas

发表机构 * McGill University（麦吉尔大学）

AI总结本文通过Rice定理的结构类比，研究有限加权布尔优化/CSP风格切片中可处理性分类的精确性，提出闭包不变性作为正确分类的必要条件，并给出闭包不变分类的充要条件及四种阻碍族。

Comments Main PDF: 46 pages, 5 tables. Supplementary: 17 pages, 2 tables. Lean 4 formalization available at https://doi.org/10.5281/zenodo.19457896

详情

AI中文摘要

Rice定理表明，部分递归函数的非平凡外延性质是不可判定的。对于有限加权布尔优化/CSP风格切片，可处理性分类存在一个Rice式的结构类比：正确性迫使在定理强制表示的移动下具有不变性，而轨道间隙正是闭包不变谓词精确分类的障碍。该范围对于精确规范是普适的。任何严格规范的问题都确定一个可接受输出关系，而精确认证仅依赖于诱导的等价关系 $s \sim_R s' \iff \operatorname{Adm}_R(s)=\operatorname{Adm}_R(s')$。决策、搜索、近似、随机输出、统计和分布保证都通过这个可接受输出商进入。在具有多项式时间可计算传输的闭包封闭域上，每个正确的可处理性分类器必须在闭包轨道上为常数。精确的闭包不变分类当且仅当正轨道壳和负轨道壳不相交时才是可能的；在这种情况下，闭包壳是一个闭包算子，给出最小的精确分类器。有限结构域是提取成对语法上的基本局部一阶片段。四个二元成对阻碍族——主导对集中、边缘掩蔽、鬼影动作支持和动作特定偏移——见证了自然有限结构谓词的相同轨道分歧，而壳分离定理给出了分类可能时的正判据。没有显式的边缘控制，任意小的效用扰动都可能翻转相关性和充分性。

英文摘要

Exact certification has a quotient: states are equivalent when they have the same correct outputs. A tractability proxy must first define a predicate on this quotient before ordinary hardness or algorithmic questions arise. Raw syntactic proxies can fail at that earlier step, because correctness-preserving presentation moves may change the statistics they inspect while preserving the exact-certification problem. Orbit gaps are the complete obstruction. An orbit gap occurs when one closure orbit contains both positive and negative presentations of a target. Exact closure-invariant classification is possible if and only if the positive and negative orbit hulls are disjoint. When the hulls are disjoint, the closure hull is the least exact classifier. With computable orbit representatives, this hull classifier becomes a quotient-level algorithm. These are predicate-level results: they establish when a proxy defines a property of the certification problem at all, a precondition logically prior to class lower bounds on the resulting recovery task and deliberately not a substitute for them. The structural transfer applies to every fixed correctness relation, independent of whether that relation is polynomial-time accessible. In the direct finite-local regime, where local routing tests are computed from raw pairwise syntax, three binary-pairwise proxy families and one offset-normalization witness exhibit same-orbit disagreement. Positive results arise from quotient-preserving normalizations, computable orbit catalogues whose descended predicates compose under Boolean operations, and predicates defined directly on the correctness quotient. The result complements the Rice-analog line of Borchert, Stephan, Hemaspaandra, and Rothe. All numbered results are mechanized in Lean 4; the supplementary ledger maps each claim to its formal identifier.

URL PDF HTML ☆

赞 0 踩 0

2507.18967 2026-06-09 cs.CV cs.AI cs.LG 版本更新

Underwater Waste Detection Using Deep Learning A Performance Comparison of YOLOv7 to 10 and Faster RCNN

利用深度学习进行水下垃圾检测：YOLOv7到YOLOv10与Faster R-CNN的性能比较

UMMPK Nawarathne, HMNS Kumari, HMLS Kumari

发表机构 * Faculty of Computing, Sri Lanka Institute of Information Technology（计算学院，斯里兰卡信息科技学院）； Faculty of Information Technology and Communication Sciences, Tampere University（信息科技与通信科学学院，塔尔皮埃大学）； Computing Centre, Faculty of Engineering, University of Peradeniya（工程学院计算机中心，珀德尼亚大学）

AI总结本文比较了YOLOv7到YOLOv10及Faster R-CNN在水下垃圾检测中的性能，发现YOLOv8在低能见度和不同深度条件下表现最佳，mAP达80.9%。

Comments 7 pages, 11 figures, to be published in International Journal of Research in Computing (IJRC)

详情

Journal ref: Vol. 5 No. I (2026): International Journal of Research in Computing (IJRC)

AI中文摘要

水下污染是当今最严重的环境问题之一，全球海洋、河流和景观中发现大量垃圾。准确检测这些垃圾对废物管理、环境监测和缓解策略至关重要。本文研究了五种先进的物体识别算法，包括YOLO模型（YOLOv7、YOLOv8、YOLOv9、YOLOv10）和Faster R-CNN，以确定哪种模型在水下环境中识别材料最有效。这些模型在包含十五种不同类别的大型数据集上进行了彻底训练和测试。结果显示，YOLOv8在低能见度和变量深度条件下表现最佳，mAP为80.9%。这种性能提升归因于YOLOv8的架构，其包含改进的无锚机制和自监督学习，从而在各种环境中实现更精确和高效的识别。这些发现突显了YOLOv8模型在全球抗污染斗争中的潜力，提高了水下清理作业的检测能力和可扩展性。

英文摘要

Underwater pollution is one of today's most significant environmental concerns, with vast volumes of garbage found in seas, rivers, and landscapes around the world. Accurate detection of these waste materials is crucial for successful waste management, environmental monitoring, and mitigation strategies. In this study, we investigated the performance of five cutting-edge object recognition algorithms, namely YOLO (You Only Look Once) models, including YOLOv7, YOLOv8, YOLOv9, YOLOv10, and Faster Region-Convolutional Neural Network (R-CNN), to identify which model was most effective at recognizing materials in underwater situations. The models were thoroughly trained and tested on a large dataset containing fifteen different classes under diverse conditions, such as low visibility and variable depths. From the above-mentioned models, YOLOv8 outperformed the others, with a mean Average Precision (mAP) of 80.9%, indicating a significant performance. This increased performance is attributed to YOLOv8's architecture, which incorporates advanced features such as improved anchor-free mechanisms and self-supervised learning, allowing for more precise and efficient recognition of items in a variety of settings. These findings highlight the YOLOv8 model's potential as an effective tool in the global fight against pollution, improving both the detection capabilities and scalability of underwater cleanup operations.

URL PDF HTML ☆

赞 0 踩 0

2508.05153 2026-06-09 cs.RO cs.AI 版本更新

FCBV-Net: Category-Level Robotic Garment Smoothing via Feature-Conditioned Bimanual Value Prediction

FCBV-Net：通过特征条件双臂价值预测实现类别级机器人服装平滑

Mohammed Daba, Jing Qiu

发表机构 * University of Waterloo（多伦多大学）

AI总结本文提出FCBV-Net，通过预训练的密集几何特征条件预测双臂动作价值，提升机器人服装平滑任务的类别级泛化能力，实验显示其在未见过的服装上效率下降仅为11.5%。

Comments 9 pages, 7 figures, 1 table

详情

DOI: 10.3390/electronics15112468
Journal ref: Electronics 2026, 15(11), 2468

AI中文摘要

类别级机器人服装操作，如双臂平滑，仍面临显著挑战，由于高维性、复杂动态和类别内变化。现有方法往往在特定实例上过拟合或在感知泛化方面失败。本文提出特征条件双臂价值网络（FCBV-Net），在3D点云上操作，专门增强服装平滑的类别级策略泛化。FCBV-Net将双臂动作价值预测条件于预训练的冻结密集几何特征，确保对类别内服装变化的鲁棒性。可训练的下游组件则利用这些静态特征学习任务特定的策略。在使用CLOTH3D数据集的模拟PyFlex环境中，FCBV-Net展示了优越的类别级泛化能力。它在未见过的服装上仅比基于2D图像的基线低11.5%（Steps80），并实现了89%的最终覆盖率，优于使用相同点特征但固定原始的3D对应基线的83%覆盖率。这些结果表明，将几何理解与双臂动作价值学习解耦能够实现更好的类别级泛化。代码、视频和补充材料可在项目网站：https://dabaspark.github.io/fcbvnet/获取。

英文摘要

Category-level generalization for robotic garment manipulation, such as bimanual smoothing, remains a significant hurdle due to high dimensionality, complex dynamics, and intra-category variations. Current approaches often struggle, either overfitting with concurrently learned visual features for a specific instance or, despite Category-level perceptual generalization, failing to predict the value of synergistic bimanual actions. We propose the Feature-Conditioned bimanual Value Network (FCBV-Net), operating on 3D point clouds to specifically enhance category-level policy generalization for garment smoothing. FCBV-Net conditions bimanual action value prediction on pre-trained, frozen dense geometric features, ensuring robustness to intra-category garment variations. Trainable downstream components then learn a task-specific policy using these static features. In simulated PyFlex environments using the CLOTH3D dataset, FCBV-Net demonstrated superior category-level generalization. It exhibited only an 11.5% efficiency drop (Steps80) on unseen garments compared to 96.2% for a 2D image-based baseline, and achieved 89% final coverage, outperforming an 83% coverage from a 3D correspondence-based baseline that uses identical per-point geometric features but a fixed primitive. These results highlight that the decoupling of geometric understanding from bimanual action value learning enables better category-level generalization. Code, videos, and supplementary materials are available at the project website: https://dabaspark.github.io/fcbvnet/.

URL PDF HTML ☆

赞 0 踩 0

2603.24940 2026-06-09 cs.PL cs.AI 版本更新

Evaluating adaptive and generative AI-based feedback and recommendations in a knowledge-graph-integrated programming learning system

评估基于自适应和生成式AI的反馈与推荐在知识图谱集成的编程学习系统中的效果

Lalita Na Nongkhai, Jingyun Wang, Adam Wynn, Takahiko Mendori

发表机构 * Graduate School of Engineering, Kochi University of Technology（Kochi大学技术大学工学研究院）； Department of Computer Science, Durham University（Durham大学计算机科学系）

AI总结本文提出一种整合大型语言模型与检索增强生成方法的知识图谱编程学习系统，通过实验比较三种教学模式的反馈效果与学习表现。

详情

DOI: 10.1016/j.caeai.2025.100526
Journal ref: Computers and Education: Artificial Intelligence, Volume 10, June 2026, 100526

AI中文摘要

本文介绍了一种整合大型语言模型（LLM）与检索增强生成（RAG）方法的框架，利用知识图谱和用户交互历史进行学习者代码评估、生成形成性反馈并推荐练习。该研究通过四个关键日志特征分析了4956次代码提交数据，发现生成式AI模式的反馈使学习者正确代码更多且缺失关键逻辑的提交更少。混合生成式AI-自适应模式在正确提交数和错误或不完整尝试数上表现最佳，优于仅自适应或仅生成式AI模式。问卷结果显示，生成式AI反馈被广泛认为有帮助，且所有模式在易用性和有用性上均获好评。

英文摘要

This paper introduces the design and development of a framework that integrates a large language model (LLM) with a retrieval-augmented generation (RAG) approach leveraging both a knowledge graph and user interaction history. The framework is incorporated into a previously developed adaptive learning support system to assess learners' code, generate formative feedback, and recommend exercises. Moerover, this study examines learner preferences across three instructional modes; adaptive, Generative AI (GenAI), and hybrid GenAI-adaptive. An experimental study was conducted to compare the learning performance and perception of the learners, and the effectiveness of these three modes using four key log features derived from 4956 code submissions across all experimental groups. The analysis results show that learners receiving feedback from GenAI modes had significantly more correct code and fewer code submissions missing essential programming logic than those receiving feedback from adaptive mode. In particular, the hybrid GenAI-adaptive mode achieved the highest number of correct submissions and the fewest incorrect or incomplete attempts, outperforming both the adaptive-only and GenAI-only modes. Questionnaire responses further indicated that GenAI-generated feedback was widely perceived as helpful, while all modes were rated positively for ease of use and usefulness. These results suggest that the hybrid GenAI-adaptive mode outperforms the other two modes across all measured log features.

URL PDF HTML ☆

赞 0 踩 0

2508.02197 2026-06-09 cs.AI 版本更新

A Message Passing Realization of Expected Free Energy Minimization

期望自由能最小化的信息传递实现

Wouter W. L. Nuijten, Mykola Lukashchuk, Thijs van de Laar, Bert de Vries

发表机构 * Eindhoven University of Technology, 5612 AP Eindhoven, the Netherlands GN Hearing, 5612 AB Eindhoven, The Netherlands（埃因霍温理工大学，荷兰埃因霍温5612 AP GN听力，荷兰埃因霍温5612 AB）

AI总结本文提出基于因子图的期望自由能最小化信息传递方法，通过将期望自由能最小化转化为变分自由能最小化问题，实现高效策略推断，并在存在epistemic不确定性环境中验证了其有效性。

详情

DOI: 10.1007/978-3-032-16955-6_5
Journal ref: In: International Workshop on Active Inference, pp. 69-84. Springer, Cham, 2022

AI中文摘要

本文提出基于因子图的期望自由能最小化信息传递方法，通过将期望自由能最小化转化为变分自由能最小化问题，实现高效策略推断，并在存在epistemic不确定性环境中验证了其有效性。

英文摘要

We present a message passing approach to Expected Free Energy (EFE) minimization on factor graphs, based on the theory introduced in arXiv:2504.14898. By reformulating EFE minimization as Variational Free Energy minimization with epistemic priors, we transform a combinatorial search problem into a tractable inference problem solvable through standard variational techniques. Applying our message passing method to factorized state-space models enables efficient policy inference. We evaluate our method on environments with epistemic uncertainty: a stochastic gridworld and a partially observable Minigrid task. Agents using our approach consistently outperform conventional KL-control agents on these tasks, showing more robust planning and efficient exploration under uncertainty. In the stochastic gridworld environment, EFE-minimizing agents avoid risky paths, while in the partially observable minigrid setting, they conduct more systematic information-seeking. This approach bridges active inference theory with practical implementations, providing empirical evidence for the efficiency of epistemic priors in artificial agents.

URL PDF HTML ☆

赞 0 踩 0

2309.10370 2026-06-09 cs.LG cs.AI math-ph math.MP math.OC stat.ML 版本更新

Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization

浅层神经网络的几何结构与构造性${\mathcal L}^2$成本最小化

Thomas Chen, Patrícia Muñoz Ewald

发表机构 * Department of Mathematics, University of Texas at Austin（德克萨斯大学奥斯汀分校数学系）

AI总结本文研究浅层ReLU网络在欠参数化情况下的成本最小化问题，通过构造上界揭示分类数据的几何结构，不依赖梯度下降。证明了成本函数最小值的上界与训练数据信噪比相关，并确定了特定子空间的构造性训练网络。

Comments AMS Latex, 29 pages. Experimental evidence added. To appear in Physica D: Nonlinear Phenomena

详情

Journal ref: Phys. D, 490, Article No. 135176 (2026)

AI中文摘要

本文通过显式构造上界，探讨欠参数化浅层ReLU网络中成本（损失）最小化问题，不使用梯度下降方法。重点在于阐明近似和精确极小值的几何结构。考虑$ L^2 $成本函数，输入空间$\mathbb{R}^M$，输出空间${\mathbb R}^Q$，其中$Q\leq M$，训练输入样本大小可任意大。证明了成本函数最小值的上界为$O(δ_P)$，其中$δ_P$衡量训练数据的信噪比。在特殊情况下$M=Q$时，显式确定了成本函数的精确退化局部极小值，并显示该精确值与$Q\leq M$时获得的上界相比，相对误差为$O(δ_P^2)$。上界证明提供了构造性训练的网络；我们证明该网络度量了输入空间$\mathbb{R}^M$中的特定$Q$维子空间。我们还评论了在给定上下文中成本函数全局极小值的特征化问题。

英文摘要

In this paper, we approach the problem of cost (loss) minimization in underparametrized shallow ReLU networks through the explicit construction of upper bounds which appeal to the structure of classification data, without use of gradient descent. A key focus is on elucidating the geometric structure of approximate and precise minimizers. We consider an $L^2$ cost function, input space $\mathbb{R}^M$, output space ${\mathbb R}^Q$ with $Q\leq M$, and training input sample size that can be arbitrarily large. We prove an upper bound on the minimum of the cost function of order $O(δ_P)$ where $δ_P$ measures the signal-to-noise ratio of training data. In the special case $M=Q$, we explicitly determine an exact degenerate local minimum of the cost function, and show that the sharp value differs from the upper bound obtained for $Q\leq M$ by a relative error $O(δ_P^2)$. The proof of the upper bound yields a constructively trained network; we show that it metrizes a particular $Q$-dimensional subspace in the input space ${\mathbb R}^M$. We comment on the characterization of the global minimum of the cost function in the given context.

URL PDF HTML ☆

赞 0 踩 0

2602.13271 2026-06-09 cs.AI cs.HC cs.LG 版本更新

Human-Centered Explainable AI for Security Enhancement: A Deep Intrusion Detection Framework

面向安全增强的人本可解释AI：一种深度入侵检测框架

Md Muntasir Jahid Ayan, Md. Shahriar Rashid, Tazzina Afroze Hassan, Hossain Md. Mubashshir Jamil, Mahbubul Islam, Lisan Al Amin, Rupak Kumar Das, Farzana Akter, Faisal Quader

发表机构 * Department of Computer Science and Engineering, United International University (UIU), Dhaka 1212, Bangladesh（计算机科学与工程系，国际联合大学（UIU），达卡1212，孟加拉国）； Department of Electrical and Electronic Engineering, Islamic University of Technology, Gazipur 1704, Bangladesh（电气与电子工程系，伊斯兰科技大学，加兹ipur 1704，孟加拉国）； Department of Computer Science and Engineering (CSE), University of Asia Pacific (UAP), Dhaka 1207, Bangladesh（计算机科学与工程系（CSE），亚洲太平洋大学（UAP），达卡1207，孟加拉国）； Department of Information Systems, University of Maryland, Baltimore, 21250, Maryland, USA（信息系统系，马里兰大学，巴尔的摩，21250，美国）； College Of Information Sciences and Technology, Pennsylvania State University, University Park, PA 16802, USA（信息科学与技术学院，宾夕法尼亚州立大学，大学公园，PA 16802，美国）； Department of Information Technology, Washington University of Science and Technology, Alexandria, VA（信息技术系，科学与技术华盛顿大学，亚历山大，VA）； College of Engineering and Information Technology, University of Maryland, College Park, 20742, Maryland, USA（工程与信息技术学院，马里兰大学，学院公园，20742，美国）

AI总结本文提出一种结合可解释AI的深度入侵检测框架，利用CNN和LSTM捕捉流量序列的时间依赖性，通过SHAP实现模型可解释性，提升安全分析的透明度与可靠性。

详情

DOI: 10.1109/SoutheastCon63549.2026.11476073

AI中文摘要

随着网络威胁的复杂性和频率增加，需要准确且可解释的入侵检测系统（IDS）。本文提出了一种新颖的IDS框架，整合可解释人工智能（XAI）以增强深度学习模型的透明性。该框架在NSL-KDD基准数据集上进行实验评估，显示优于传统IDS和黑箱深度学习模型。所提方法结合卷积神经网络（CNN）和长短期记忆网络（LSTM）以捕捉流量序列的时间依赖性。深度学习结果表明，CNN和LSTM的准确率均达到0.99，其中LSTM在宏平均精度、召回率和F-1分数上优于CNN。对于加权平均精度、召回率和F-1分数，两种模型得分几乎相同。为确保可解释性，XAI模型SHapley Additive exPlanations（SHAP）被纳入，使安全分析师能够理解和验证模型决策。SHAP指出，srv_serror_rate、dst_host_srv_serror_rate和serror_rate是两个模型中的一些重要特征。我们还基于IPIP6和Big Five人格特质进行了以信任为导向的专家调查，通过交互式UI评估系统的可靠性和可用性。本工作强调了在网络安全解决方案中结合性能和透明性的潜力，并通过自适应学习推荐未来改进以实现实时威胁检测。

英文摘要

The increasing complexity and frequency of cyber-threats demand intrusion detection systems (IDS) that are not only accurate but also interpretable. This paper presented a novel IDS framework that integrated Explainable Artificial Intelligence (XAI) to enhance transparency in deep learning models. The framework was evaluated experimentally using the benchmark dataset NSL-KDD, demonstrating superior performance compared to traditional IDS and black-box deep learning models. The proposed approach combined Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) networks for capturing temporal dependencies in traffic sequences. Our deep learning results showed that both CNN and LSTM reached 0.99 for accuracy, whereas LSTM outperformed CNN at macro average precision, recall, and F-1 score. For weighted average precision, recall, and F-1 score, both models scored almost similarly. To ensure interpretability, the XAI model SHapley Additive exPlanations (SHAP) was incorporated, enabling security analysts to understand and validate model decisions. Some notable influential features were srv_serror_rate, dst_host_srv_serror_rate, and serror_rate for both models, as pointed out by SHAP. We also conducted a trust-focused expert survey based on IPIP6 and Big Five personality traits via an interactive UI to evaluate the system's reliability and usability. This work highlighted the potential of combining performance and transparency in cybersecurity solutions and recommends future enhancements through adaptive learning for real-time threat detection.

URL PDF HTML ☆

赞 0 踩 0

2602.05027 2026-06-09 cs.SD cs.AI 版本更新

AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

AudioSAE：利用稀疏自编码器理解音频处理模型

Georgii Aparin, Tasnima Sadekova, Alexey Rukhovich, Assel Yermekova, Laida Kushnareva, Vadim Popov, Kristian Kuznetsov, Irina Piontkovskaya

发表机构 * Huawei Noah’s Ark Lab（华为诺亚实验室）

AI总结本文在Whisper和HuBERT的编码器层训练稀疏自编码器（SAE），评估其稳定性和可解释性，并展示其在特征解耦、概念擦除、语音检测优化及与人类脑电活动对齐方面的实用价值。

Comments Accepted to EACL 2026, main track

详情

DOI: 10.18653/v1/2026.eacl-long.149
Journal ref: Proceedings of EACL 2026, pages 3221-3254

AI中文摘要

稀疏自编码器（SAE）是解释神经表征的强大工具，但它们在音频领域的应用尚未充分探索。我们在Whisper和HuBERT的所有编码器层训练SAE，对其稳定性、可解释性进行了广泛评估，并展示了其实用性。超过50%的特征在随机种子间保持一致，且重建质量得以保持。SAE特征捕获了通用声学和语义信息以及特定事件，包括环境噪声和副语言声音（如笑声、低语），并有效解耦它们，仅需移除19-27%的特征即可擦除一个概念。特征引导将Whisper的虚假语音检测降低了70%，且词错误率（WER）增加可忽略不计，展示了实际应用价值。最后，我们发现SAE特征与语音感知过程中的人类脑电活动相关，表明其与人类神经处理的对齐。代码和检查点可在https://github.com/audiosae/audiosae_demo获取。

英文摘要

Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their practical utility. Over 50% of the features remain consistent across random seeds, and reconstruction quality is preserved. SAE features capture general acoustic and semantic information as well as specific events, including environmental noises and paralinguistic sounds (e.g. laughter, whispering) and disentangle them effectively, requiring removal of only 19-27% of features to erase a concept. Feature steering reduces Whisper's false speech detections by 70% with negligible WER increase, demonstrating real-world applicability. Finally, we find SAE features correlated with human EEG activity during speech perception, indicating alignment with human neural processing. The code and checkpoints are available at https://github.com/audiosae/audiosae_demo.

URL PDF HTML ☆

赞 0 踩 0

2601.21221 2026-06-09 cs.AI 版本更新

Causal Discovery for Explainable AI: A Dual-Encoding Approach

可解释AI中的因果发现：一种双编码方法

Henry Salgado, Meagan R. Kendall, Martine Ceberio

发表机构 * Department of Computer Science, The University of Texas at El Paso（德克萨斯理工大学计算机科学系）； Department of Engineering Education and Leadership, The University of Texas at El Paso（德克萨斯理工大学工程教育与领导力系）

AI总结本文提出一种双编码方法，通过互补编码策略和多数投票融合，解决传统因果发现方法在处理分类变量时的数值不稳定问题，并在泰坦尼克号数据集上验证了方法的有效性。

Comments 6 pages

2405.07098 2026-06-09 cs.LG cs.AI math-ph math.MP math.OC stat.ML 版本更新

Interpretable global minima of deep ReLU neural networks on sequentially separable data

可解释的深度ReLU神经网络在依次可分数据上的全局极小值

Thomas Chen, Patrícia Muñoz Ewald

发表机构 * Department of Mathematics, University of Texas at Austin（德克萨斯大学奥斯汀分校数学系）

AI总结本文通过构造零损失分类器，利用累积参数确定截断映射，研究了在小且分离的簇数据及依次线性可分等价类情况下，深度ReLU网络的全局极小值描述。

Comments AMS Latex, 31 pages, 3 figures

2511.02469 2026-06-09 q-fin.CP cs.AI cs.MA 版本更新

Modeling Hawkish-Dovish Latent Beliefs in Multi-Agent Debate-Based LLMs for Monetary Policy Decision Classification

多智能体辩论式LLM中鹰派-鸽派隐含信念建模用于货币政策决策分类

Kaito Takano, Masanori Hirano, Kei Nakagawa

发表机构 * Osaka Metropolitan University（大阪市立大学）； Preferred Networks, Inc.

AI总结本文提出多智能体辩论式LLM框架，通过建模鹰派与鸽派隐含信念提升货币政策预测准确性，优于传统LLM基线。

Comments PRIMA2025 Accepted

详情

DOI: 10.1007/978-3-032-13562-9_38

AI中文摘要

准确预测央行政策决策，特别是美联储公开市场委员会（FOMC）的决策，在经济不确定性加剧的背景下变得尤为重要。尽管先前研究利用货币政策文本预测利率变化，但大多数方法依赖静态分类模型，忽略了政策制定的审议性质。本文提出了一种新颖的框架，通过建模多个大型语言模型（LLMs）作为交互智能体，结构上模仿FOMC的集体决策过程。每个智能体从不同的初始信念开始，并基于定性政策文本和定量宏观经济指标生成预测。通过迭代轮次，智能体通过观察其他智能体的输出修订预测，模拟审议和共识形成。为提高可解释性，我们引入一个表示每个智能体隐含信念（例如鹰派或鸽派）的隐变量，并理论证明该信念如何调解输入信息的感知和交互动态。实证结果表明，这种辩论式方法在预测准确性上显著优于标准LLM基线。此外，显式建模信念提供了关于个体视角和社会影响如何塑造集体政策预测的见解。

英文摘要

Accurately forecasting central bank policy decisions, particularly those of the Federal Open Market Committee(FOMC) has become increasingly important amid heightened economic uncertainty. While prior studies have used monetary policy texts to predict rate changes, most rely on static classification models that overlook the deliberative nature of policymaking. This study proposes a novel framework that structurally imitates the FOMC's collective decision-making process by modeling multiple large language models(LLMs) as interacting agents. Each agent begins with a distinct initial belief and produces a prediction based on both qualitative policy texts and quantitative macroeconomic indicators. Through iterative rounds, agents revise their predictions by observing the outputs of others, simulating deliberation and consensus formation. To enhance interpretability, we introduce a latent variable representing each agent's underlying belief(e.g., hawkish or dovish), and we theoretically demonstrate how this belief mediates the perception of input information and interaction dynamics. Empirical results show that this debate-based approach significantly outperforms standard LLMs-based baselines in prediction accuracy. Furthermore, the explicit modeling of beliefs provides insights into how individual perspectives and social influence shape collective policy forecasts.

URL PDF HTML ☆

赞 0 踩 0

2510.06742 2026-06-09 cs.AI cs.LG 版本更新

MultiCNKG: Integrating Cognitive Neuroscience, Gene, and Disease Knowledge Graphs Using Large Language Models

MultiCNKG: 利用大语言模型整合认知神经科学、基因和疾病知识图谱

Ali Sarabadani, Kheirolah Rahsepar Fard

发表机构 * Department of Computer Engineering and Information Technology, University of Qom（卡姆大学计算机工程与信息科技系）； University of Qom（卡姆大学）

AI总结本文提出MultiCNKG框架，整合认知神经科学、基因和疾病知识图谱，利用大语言模型实现实体对齐和图谱增强，提升生物医学领域知识图谱的整合与应用能力。

详情

AI中文摘要

大语言模型（LLMs）的出现革新了生物医学和认知科学中知识图谱（KGs）的整合，克服了传统机器学习方法在捕捉基因、疾病和认知过程之间复杂语义联系方面的局限。我们介绍了MultiCNKG，一种创新框架，整合了三个关键知识源：包含2.9K节点和4.3K边的认知神经科学知识图谱（CNKG），涵盖9种节点类型和20种边类型；基因本体（GO）包含43K节点和75K边，涵盖3种节点类型和4种边类型；疾病本体（DO）包含11.2K节点和8.8K边，涵盖1种节点类型和2种边类型。利用LLMs如GPT-4，我们进行实体对齐、语义相似性计算和图谱增强，创建了一个连接遗传机制、神经疾病和认知功能的统一知识图谱。结果图谱包含6.9K节点，涵盖5种类型（如基因、疾病、认知过程）和11.3K边，涵盖7种类型（如因果关系、关联、调控）。评估指标如精确率（85.20%）、召回率（87.30%）、覆盖率（92.18%）、图一致性（82.50%）、新颖性检测（40.28%）和专家验证（89.50%）证实了其鲁棒性和一致性。链接预测评估显示，与TransE（MR: 391，MRR: 0.411）和RotatE（MR: 263，MRR: 0.395）等模型相比，性能与基准如FB15k-237和WN18RR相当。该图谱在个性化医学、认知障碍诊断和认知神经科学假设形成中具有应用前景。

英文摘要

The advent of large language models (LLMs) has revolutionized the integration of knowledge graphs (KGs) in biomedical and cognitive sciences, overcoming limitations in traditional machine learning methods for capturing intricate semantic links among genes, diseases, and cognitive processes. We introduce MultiCNKG, an innovative framework that merges three key knowledge sources: the Cognitive Neuroscience Knowledge Graph (CNKG) with 2.9K nodes and 4.3K edges across 9 node types and 20 edge types; Gene Ontology (GO) featuring 43K nodes and 75K edges in 3 node types and 4 edge types; and Disease Ontology (DO) comprising 11.2K nodes and 8.8K edges with 1 node type and 2 edge types. Leveraging LLMs like GPT-4, we conduct entity alignment, semantic similarity computation, and graph augmentation to create a cohesive KG that interconnects genetic mechanisms, neurological disorders, and cognitive functions. The resulting MultiCNKG encompasses 6.9K nodes across 5 types (e.g., Genes, Diseases, Cognitive Processes) and 11.3K edges spanning 7 types (e.g., Causes, Associated with, Regulates), facilitating a multi-layered view from molecular to behavioral domains. Assessments using metrics such as precision (85.20%), recall (87.30%), coverage (92.18%), graph consistency (82.50%), novelty detection (40.28%), and expert validation (89.50%) affirm its robustness and coherence. Link prediction evaluations with models like TransE (MR: 391, MRR: 0.411) and RotatE (MR: 263, MRR: 0.395) show competitive performance against benchmarks like FB15k-237 and WN18RR. This KG advances applications in personalized medicine, cognitive disorder diagnostics, and hypothesis formulation in cognitive neuroscience.

URL PDF HTML ☆

赞 0 踩 0

2507.15617 2026-06-09 cs.CY cs.AI 版本更新

Why can't Epidemiology be automated (yet)?

为何流行病学无法被自动化（至今仍无法）

David Bann, Ed Lowther, Liam Wright, Yevgeniya Kovalchuk

发表机构 * Centre for Longitudinal Studies, University College London（伦敦大学学院长期研究所在）； Centre for Advanced Research Computing, University College London（伦敦大学学院先进计算研究中心）

AI总结本文探讨流行病学研究中人工智能应用的潜力与限制，指出尽管生成式AI提供了机遇，但现有工具和人类系统限制了其效能，需流行病学家与工程师的协同合作。

Comments 9 pages, 2 figures, 1 table

详情

DOI: 10.1093/ije/dyaf210

AI中文摘要

近期人工智能（AI）特别是生成式AI的进步为加速或自动化流行病学研究提供了新机遇。与基于物理实验的学科不同，流行病学大量依赖二次数据分析，因此非常适合此类增强。然而，仍不清楚哪些具体任务能从AI干预中受益或存在哪些障碍。当前AI能力的认知也参差不齐。本文通过现有数据集映射流行病学任务，从文献回顾到数据访问、分析、撰写和传播，识别现有AI工具在效率上的提升。尽管AI在某些领域如编码和行政任务中能提高生产力，但其效用受现有AI模型（如文献回顾中的幻觉）和人类系统（如数据集访问障碍）的限制。通过AI生成的流行病学成果示例，包括完全由AI生成的论文，表明最近开发的代理系统能设计和执行流行病学分析，但质量参差不齐（见https://github.com/edlowther/automated-epidemiology）。流行病学家有新的机会实证测试和评估AI系统；实现AI潜力需要流行病学家与工程师的双向互动。

英文摘要

Recent advances in artificial intelligence (AI) - particularly generative AI - present new opportunities to accelerate, or even automate, epidemiological research. Unlike disciplines based on physical experimentation, a sizable fraction of Epidemiology relies on secondary data analysis and thus is well-suited for such augmentation. Yet, it remains unclear which specific tasks can benefit from AI interventions or where roadblocks exist. Awareness of current AI capabilities is also mixed. Here, we map the landscape of epidemiological tasks using existing datasets - from literature review to data access, analysis, writing up, and dissemination - and identify where existing AI tools offer efficiency gains. While AI can increase productivity in some areas such as coding and administrative tasks, its utility is constrained by limitations of existing AI models (e.g. hallucinations in literature reviews) and human systems (e.g. barriers to accessing datasets). Through examples of AI-generated epidemiological outputs, including fully AI-generated papers, we demonstrate that recently developed agentic systems can now design and execute epidemiological analysis, albeit to varied quality (see https://github.com/edlowther/automated-epidemiology). Epidemiologists have new opportunities to empirically test and benchmark AI systems; realising the potential of AI will require two-way engagement between epidemiologists and engineers.

URL PDF HTML ☆

赞 0 踩 0

2507.15152 2026-06-09 cs.CL cs.AI cs.LG 版本更新

What Level of Automation is "Good Enough"? A Benchmark of Large Language Models for Meta-Analysis Data Extraction

什么是‘足够’的自动化水平？大型语言模型在元分析数据提取中的基准测试

Lingbo Li, Anuradha Mathrani, Teo Susnjak

发表机构 * School of Mathematical and Computational Sciences（数学与计算科学学院）； Massey University（梅西大学）； Auckland, New Zealand（新西兰奥克兰）

AI总结本文评估了三种大型语言模型在医疗领域数据提取中的性能，发现定制提示能显著提升召回率，提出三层次指南以平衡自动化与专家监督。

详情

DOI: 10.1017/rsm.2025.10066
Journal ref: Research Synthesis Methods (2026)

AI中文摘要

自动化从全文随机对照试验（RCT）中提取数据用于元分析仍是一个重大挑战。本研究评估了三种LLM（Gemini-2.0-flash、Grok-3、GPT-4o-mini）在高血压、糖尿病和骨科三个医学领域中统计结果、偏倚风险评估和研究层面特征任务上的实际表现。我们测试了四种不同的提示策略（基本提示、自我反思提示、模型集成和定制提示）以确定如何提高提取质量。所有模型均表现出高精度，但普遍存在召回率低的问题，因遗漏关键信息。我们发现定制提示是最有效的，召回率可提升高达15%。基于此分析，我们提出了一套三层指南，根据任务复杂性和风险匹配数据类型与适当的自动化水平。本研究为现实世界中的元分析自动化数据提取提供了实用建议，通过有针对性的、任务特定的自动化平衡LLM效率与专家监督。

英文摘要

Automating data extraction from full-text randomised controlled trials (RCTs) for meta-analysis remains a significant challenge. This study evaluates the practical performance of three LLMs (Gemini-2.0-flash, Grok-3, GPT-4o-mini) across tasks involving statistical results, risk-of-bias assessments, and study-level characteristics in three medical domains: hypertension, diabetes, and orthopaedics. We tested four distinct prompting strategies (basic prompting, self-reflective prompting, model ensemble, and customised prompts) to determine how to improve extraction quality. All models demonstrate high precision but consistently suffer from poor recall by omitting key information. We found that customised prompts were the most effective, boosting recall by up to 15\%. Based on this analysis, we propose a three-tiered set of guidelines for using LLMs in data extraction, matching data types to appropriate levels of automation based on task complexity and risk. Our study offers practical advice for automating data extraction in real-world meta-analyses, balancing LLM efficiency with expert oversight through targeted, task-specific automation.

URL PDF HTML ☆

赞 0 踩 0

2507.02606 2026-06-09 cs.SD cs.AI cs.CR cs.LG eess.AS 版本更新

De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks

De-AntiFake：重新思考对抗语音克隆攻击的保护扰动

Wei Fan, Kejiang Chen, Chang Liu, Weiming Zhang, Nenghai Yu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出一种两阶段净化方法，旨在提升对抗语音克隆攻击的防御效果，通过净化扰动语音并利用音素指导进行优化，实验表明其优于现有方法。

Comments Accepted by ICML 2025

详情

Journal ref: Proceedings of the 42nd International Conference on Machine Learning, PMLR 267, 2025

AI中文摘要

随着语音生成模型的快速发展，语音克隆（VC）带来的隐私和安全问题日益突出。近期研究尝试通过引入对抗扰动来阻止未经授权的语音克隆，但确定性攻击者可以缓解这些保护扰动并成功执行VC。本文首次系统评估这些保护扰动在包含扰动净化的现实威胁模型下的有效性。研究发现，尽管现有净化方法能中和大量保护扰动，但仍导致VC模型特征空间的失真，影响VC性能。因此，我们提出一种新的两阶段净化方法：（1）净化扰动语音；（2）利用音素指导进行优化，使其符合干净语音分布。实验结果表明，我们的方法在破坏VC防御方面优于现有方法。本研究揭示了基于对抗扰动的VC防御的局限性，并强调了需要更鲁棒的解决方案以缓解VC带来的安全和隐私风险。代码和音频样本可在https://de-antifake.github.io获取。

英文摘要

The rapid advancement of speech generation models has heightened privacy and security concerns related to voice cloning (VC). Recent studies have investigated disrupting unauthorized voice cloning by introducing adversarial perturbations. However, determined attackers can mitigate these protective perturbations and successfully execute VC. In this study, we conduct the first systematic evaluation of these protective perturbations against VC under realistic threat models that include perturbation purification. Our findings reveal that while existing purification methods can neutralize a considerable portion of the protective perturbations, they still lead to distortions in the feature space of VC models, which degrades the performance of VC. From this perspective, we propose a novel two-stage purification method: (1) Purify the perturbed speech; (2) Refine it using phoneme guidance to align it with the clean speech distribution. Experimental results demonstrate that our method outperforms state-of-the-art purification methods in disrupting VC defenses. Our study reveals the limitations of adversarial perturbation-based VC defenses and underscores the urgent need for more robust solutions to mitigate the security and privacy risks posed by VC. The code and audio samples are available at https://de-antifake.github.io.

URL PDF HTML ☆

赞 0 踩 0

2311.07065 2026-06-09 cs.LG cs.AI math-ph math.MP math.OC stat.ML 版本更新

On non-approximability of zero loss global ${\mathcal L}^2$ minimizers by gradient descent in Deep Learning

关于深度学习中梯度下降无法逼近零损失全局L²最小化器的非近似性

Thomas Chen, Patricia Muñoz Ewald

发表机构 * Department of Mathematics, University of Texas at Austin（德克萨斯大学奥斯汀分校数学系）

AI总结本文分析了深度学习中梯度下降算法的几何特性，指出在欠参数化网络中，零损失最小化通常无法实现，因此训练输入分布必须非典型才能产生零损失最小化器。

Comments AMS Latex, 7 pages. Typos corrected, Corollary 1.6 upgraded to Theorem, acknowledgment added

2311.08957 2026-06-09 cs.RO cs.AI cs.HC 版本更新

I Was Blind but Now I See: Implementing Vision-Enabled Dialogue in Social Robots

我曾盲目但如今我看见：在社交机器人中实现视觉增强的对话

Giulio Antonio Abbo, Tony Belpaeme

发表机构 * IDLab-AIRO – Ghent University – imec（IDLab-AIRO – 布鲁塞尔自由大学 – imec）

AI总结本文提出一种利用大语言模型提升社交机器人对话能力的系统，通过整合视觉输入增强上下文感知，展示六次与Furhat机器人的交互结果，探讨视觉与文本模态融合的未来对话可能性。

Comments 8 pages, 3 figures

详情

DOI: 10.1109/HRI61500.2025.10973830
Journal ref: HRI '25: Proceedings of the 2025 ACM/IEEE International Conference on Human-Robot Interaction. Pages 1176 - 1180

AI中文摘要

在人机交互快速发展的背景下，将视觉能力整合到对话代理中是关键进步。本文介绍了基于最新大语言模型（如GPT-4、IDEFICS）的对话管理器初始实现，通过实时视觉输入增强传统文本提示。LLMs被用于解释文本提示和视觉刺激，创建更上下文感知的对话代理。系统的提示工程结合对话和图像摘要，平衡上下文保留与计算效率。报告了与Furhat机器人进行六次交互，展示了结果并进行了讨论。通过实现这种视觉增强的对话系统，本文展望了一个未来，其中对话代理能够无缝融合文本和视觉模态，实现更丰富、更上下文感知的对话。

英文摘要

In the rapidly evolving landscape of human-computer interaction, the integration of vision capabilities into conversational agents stands as a crucial advancement. This paper presents an initial implementation of a dialogue manager that leverages the latest progress in Large Language Models (e.g., GPT-4, IDEFICS) to enhance the traditional text-based prompts with real-time visual input. LLMs are used to interpret both textual prompts and visual stimuli, creating a more contextually aware conversational agent. The system's prompt engineering, incorporating dialogue with summarisation of the images, ensures a balance between context preservation and computational efficiency. Six interactions with a Furhat robot powered by this system are reported, illustrating and discussing the results obtained. By implementing this vision-enabled dialogue system, the paper envisions a future where conversational agents seamlessly blend textual and visual modalities, enabling richer, more context-aware dialogues.

URL PDF HTML ☆

赞 0 踩 0

2501.12421 2026-06-09 cs.LG cs.AI q-bio.QM 版本更新

Tackling Small Sample Survival Analysis via Transfer Learning: A Study of Colorectal Cancer Prognosis

通过迁移学习解决小样本生存分析：结直肠癌预后的研究

Yonghao Zhao, Changtao Li, Chi Shu, Qingbin Wu, Hong Li, Chuan Xu, Tianrui Li, Ziqiang Wang, Zhipeng Luo, Yazhou He

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文通过迁移学习提升小样本生存分析，针对结直肠癌预后，改进了多种生存模型，如DeepSurv、Cox-CC、DeepHit和Random Survival Forest，实验结果显示迁移学习显著提升了模型性能。

详情

DOI: 10.1016/j.artmed.2026.103426
Journal ref: Artificial Intelligence in Medicine, 178:103426, 2026

AI中文摘要

生存预后对医疗信息学至关重要。实践者常面临小规模临床数据，尤其是癌症患者数据，难以诱导有用的生存预测模式。本文通过迁移学习解决小样本生存分析问题，提出适用于常见生存模型的迁移学习方法。对于参数模型如DeepSurv、Cox-CC和DeepHit，应用预训练和微调等标准迁移学习技术。对于非参数模型如Random Survival Forest，提出新的迁移生存森林（TSF）模型，通过转移树结构并用目标数据微调。在结直肠癌（CRC）预后中评估了迁移学习方法。源数据为27,379名SEER CRC I期患者，目标数据为728名来自西昌医院的CRC I期患者。迁移学习增强后，Cox-CC的C^{td}值从0.7868提升至0.8111，DeepHit从0.8085提升至0.8135，DeepSurv从0.7722提升至0.8043，RSF从0.7940提升至0.8297（最高性能）。所有模型在数据量仅50时训练也表现出更显著的提升。结论：因此，用于癌症预后的现有生存模型可通过适当设计的迁移学习技术得到增强和改进。本研究使用的源代码可在https://github.com/YonghaoZhao722/TSF获取。

英文摘要

Survival prognosis is crucial for medical informatics. Practitioners often confront small-sized clinical data, especially cancer patient cases, which can be insufficient to induce useful patterns for survival predictions. This study deals with small sample survival analysis by leveraging transfer learning, a useful machine learning technique that can enhance the target analysis with related knowledge pre-learned from other data. We propose and develop various transfer learning methods designed for common survival models. For parametric models such as DeepSurv, Cox-CC (Cox-based neural networks), and DeepHit (end-to-end deep learning model), we apply standard transfer learning techniques like pretraining and fine-tuning. For non-parametric models such as Random Survival Forest, we propose a new transfer survival forest (TSF) model that transfers tree structures from source tasks and fine-tunes them with target data. We evaluated the transfer learning methods on colorectal cancer (CRC) prognosis. The source data are 27,379 SEER CRC stage I patients, and the target data are 728 CRC stage I patients from the West China Hospital. When enhanced by transfer learning, Cox-CC's $C^{td}$ value was boosted from 0.7868 to 0.8111, DeepHit's from 0.8085 to 0.8135, DeepSurv's from 0.7722 to 0.8043, and RSF's from 0.7940 to 0.8297 (the highest performance). All models trained with data as small as 50 demonstrated even more significant improvement. Conclusions: Therefore, the current survival models used for cancer prognosis can be enhanced and improved by properly designed transfer learning techniques. The source code used in this study is available at https://github.com/YonghaoZhao722/TSF.

URL PDF HTML ☆

赞 0 踩 0

2406.19493 2026-06-09 cs.CL cs.AI 版本更新

Development and Evaluation of a Retrieval-Augmented Generation Tool for Creating SAPPhIRE Models of Artificial Systems

SAPPhIRE人工系统模型创建工具的开发与评估

Anubhab Majumder, Kausik Bhattacharya, Amaresh Chakrabarti

发表机构 * Department of Design and Manufacturing, Indian Institute of Science（设计与制造系，印度科学研究院）

AI总结本文提出一种基于检索增强生成的工具，用于创建SAPPhIRE因果模型的人工系统模型，通过评估工具在事实准确性和可靠性方面的表现，提升系统设计类比支持能力。

Comments This paper has been accepted for presentation at the 10th International Conference on Research Into Design, 2025

2407.00396 2026-06-09 cs.CL cs.AI 版本更新

A Study on Effect of Reference Knowledge Choice in Generating Technical Content Relevant to SAPPhIRE Model Using Large Language Model

基于SAPPhIRE模型因果关系的生成技术内容参考知识选择研究

Kausik Bhattacharya, Anubhab Majumder, Amaresh Chakrabarti

发表机构 * Indian Institute of Science（印度科学研究院）

AI总结本文研究如何利用大语言模型生成与SAPPhIRE因果关系模型相关的技术内容，通过检索增强生成方法抑制幻觉，强调参考知识选择对生成准确性的重要性。

详情

DOI: 10.1007/978-981-96-5511-3_39

AI中文摘要

使用SAPPhIRE因果关系模型表示系统可以成为设计的灵感来源。然而，创建技术或自然系统的SAPPhIRE模型需要从多个技术文档中获取系统工作原理的技术知识。本研究探讨如何利用大语言模型（LLM）生成准确的相关技术内容。本文是两部分研究中的第一部分，提出了一种使用检索增强生成方法来抑制幻觉，从而生成由相关科学信息支持的技术内容的方法。研究结果表明，用于为LLM生成技术内容提供上下文的参考知识选择非常重要。本研究的成果用于构建一个软件支持工具，以生成给定技术系统的SAPPhIRE模型。

英文摘要

Representation of systems using the SAPPhIRE model of causality can be an inspirational stimulus in design. However, creating a SAPPhIRE model of a technical or a natural system requires sourcing technical knowledge from multiple technical documents regarding how the system works. This research investigates how to generate technical content accurately relevant to the SAPPhIRE model of causality using a Large Language Model, also called LLM. This paper, which is the first part of the two-part research, presents a method for hallucination suppression using Retrieval Augmented Generating with LLM to generate technical content supported by the scientific information relevant to a SAPPhIRE con-struct. The result from this research shows that the selection of reference knowledge used in providing context to the LLM for generating the technical content is very important. The outcome of this research is used to build a software support tool to generate the SAPPhIRE model of a given technical system.

URL PDF HTML ☆

赞 0 踩 0

2312.07928 2026-06-09 eess.SP cs.AI stat.AP 版本更新

Bayesian inversion of GPR waveforms for sub-surface material characterization: an uncertainty-aware retrieval of soil moisture and overlaying biomass properties

基于GPR波形的贝叶斯反演用于 subsurface 物性表征：一种面向不确定性的土壤含水率和覆盖物性质检索方法

Ishfaq Aziz, Elahe Soltanaghai, Adam Watts, Mohamad Alipour

发表机构 * Civil and Environmental Engineering, University of Illinois Urbana Champaign（伊利诺伊大学厄巴纳-香槟分校土木与环境工程系）； Computer Science, University of Illinois Urbana Champaign（伊利诺伊大学厄巴纳-香槟分校计算机科学系）； Pacific Wildland Fire Sciences Laboratory, United States Forest Service（美国森林服务局太平洋野火科学实验室）

AI总结本文提出基于贝叶斯模型更新的GPR波形反演方法，用于预测土壤和覆盖层的含水率和深度，通过实验室和实地数据验证，结果与TDR和重力法一致，提供不确定性的概率估计。

Comments Total 34 pages, 17 Figures. This paper under review in a journal but has not been published yet

详情

DOI: 10.1016/j.rse.2024.114351

AI中文摘要

准确估计地下属性如含水率和土壤植被层深度对地下条件监测、精准农业和 wildfire 风险评估至关重要。由于土壤常被植被和有机物覆盖，其表征具有挑战性。此外，覆盖层性质的估计对 wildfire 风险评估至关重要。本文提出基于贝叶斯模型更新的GPR波形反演方法，用于预测土壤和覆盖层的含水率和深度。由于其与含水率的高相关性，所提出的方法预测了两层的介电常数，以及其他参数，包括层深度和电导率。所提出的贝叶斯模型更新方法提供了这些参数的概率估计，可提供关于估计信心和不确定性的信息。该方法通过实验室和实地调查收集的多样化实验数据进行了评估。实验室研究包括土壤含水率变化、覆盖层深度和材料粗细的变化。实地研究包括对十六天的田间土壤含水率的测量。结果表明预测与时域反射计（TDR）测量和传统重力法一致。表面层深度也可合理预测。所提出的方法为面向不确定性的地下参数估计提供了一种有前景的方法，可支持跨广泛应用的风险评估决策。

英文摘要

Accurate estimation of sub-surface properties such as moisture content and depth of soil and vegetation layers is crucial for applications spanning sub-surface condition monitoring, precision agriculture, and effective wildfire risk assessment. Soil in nature is often covered by overlaying vegetation and surface organic material, making its characterization challenging. In addition, the estimation of the properties of the overlaying layer is crucial for applications like wildfire risk assessment. This study thus proposes a Bayesian model-updating-based approach for ground penetrating radar (GPR) waveform inversion to predict moisture contents and depths of soil and overlaying material layer. Due to its high correlation with moisture contents, the dielectric permittivity of both layers were predicted with the proposed method, along with other parameters, including depth and electrical conductivity of layers. The proposed Bayesian model updating approach yields probabilistic estimates of these parameters that can provide information about the confidence and uncertainty related to the estimates. The methodology was evaluated for a diverse range of experimental data collected through laboratory and field investigations. Laboratory investigations included variations in soil moisture values, depth of the overlaying surface layer, and coarseness of its material. The field investigation included measurement of field soil moisture for sixteen days. The results demonstrated predictions consistent with time-domain reflectometry (TDR) measurements and conventional gravimetric tests. The depth of the surface layer could also be predicted with reasonable accuracy. The proposed method provides a promising approach for uncertainty-aware sub-surface parameter estimation that can enable decision-making for risk assessment across a wide range of applications.

URL PDF HTML ☆

赞 0 踩 0

2402.09193 2026-06-09 cs.CL cs.AI cs.HC 版本更新

(Ir)rationality and Cognitive Biases in Large Language Models

非理性与大语言模型中的认知偏差

Olivia Macmillan-Scott, Mirco Musolesi

发表机构 * University College London（伦敦大学）； University of Bologna（博洛尼亚大学）

AI总结本文通过心理学文献中的任务评估七种语言模型，发现其在非理性表现上与人类相似，但表现形式不同，且存在响应不一致的额外非理性特征。

详情

DOI: 10.1098/rsos.240255
Journal ref: Royal Society Open Science 11(6) 2024

AI中文摘要

大型语言模型（LLMs）表现出理性推理吗？LLMs已被证明包含人类偏见，因为它们训练的数据中包含这些偏见；这种偏见是否反映在理性推理中尚不明确。在本文中，我们通过认知心理学文献中的任务评估了七种语言模型，以回答这个问题。我们发现，像人类一样，LLMs在这些任务中表现出非理性。然而，这种非理性表现的方式并不反映人类所展示的方式。当LLMs在这些任务中给出错误答案时，它们往往以与人类偏见不同的方式错误。此外，LLMs还揭示了响应中显著不一致性的额外非理性层。除了实验结果外，本文还希望通过展示如何评估和比较这些模型的不同能力，做出方法论上的贡献，特别是在理性推理方面。

英文摘要

Do large language models (LLMs) display rational reasoning? LLMs have been shown to contain human biases due to the data they have been trained on; whether this is reflected in rational reasoning remains less clear. In this paper, we answer this question by evaluating seven language models using tasks from the cognitive psychology literature. We find that, like humans, LLMs display irrationality in these tasks. However, the way this irrationality is displayed does not reflect that shown by humans. When incorrect answers are given by LLMs to these tasks, they are often incorrect in ways that differ from human-like biases. On top of this, the LLMs reveal an additional layer of irrationality in the significant inconsistency of the responses. Aside from the experimental results, this paper seeks to make a methodological contribution by showing how we can assess and compare different capabilities of these types of models, in this case with respect to rational reasoning.

URL PDF HTML ☆

赞 0 踩 0

2101.01060 2026-06-09 cs.CV cs.AI cs.MM 版本更新

Personal Privacy Protection via Irrelevant Faces Tracking and Pixelation in Video Live Streaming

通过无关面孔跟踪和像素化实现个人隐私保护在视频直播中

Jizhe Zhou, Chi-Man Pun

发表机构 * IEEE

AI总结本文提出FPVLS方法，通过帧到视频的双阶段结构实现视频直播中的自动隐私过滤，解决目标漂移、计算效率和过度像素化问题。

详情

DOI: 10.1109/TIFS.2020.3029913
Journal ref: IEEE Transactions on Information Forensics and Security, 16, 1088-1103 (2020)

AI中文摘要

截至目前，旨在保护隐私的像素化任务仍然劳动密集且尚未被深入研究。随着视频直播的普及，建立在线直播中的面部像素化机制已成为紧迫需求。本文开发了一种名为视频直播中的面部像素化（FPVLS）的新方法，以在非约束直播活动中自动生成自动个人隐私过滤。简单地应用多面部跟踪器会遇到目标漂移、计算效率和过度像素化的问题。因此，为了快速准确地对无关人员的面部进行像素化，FPVLS采用帧到视频的双阶段结构。在单帧上，FPVLS利用基于图像的面部检测和嵌入网络生成面部向量。在原始轨迹生成阶段，所提出的定位增量仿射传播（PIAP）聚类算法利用面部向量和定位信息，快速关联跨帧的同一人的面部。这样的帧级累积原始轨迹在视频级别上可能具有间断性和不可靠性。因此，我们进一步引入轨迹细化阶段，该阶段结合提案网络和基于经验似然比（ELR）统计量的两样本测试，以细化原始轨迹。在细化轨迹上应用高斯滤波器以最终实现像素化。在我们收集的视频直播数据集上，FPVLS获得了令人满意的准确性、实时效率，并且包含过度像素化问题。

英文摘要

To date, the privacy-protection intended pixelation tasks are still labor-intensive and yet to be studied. With the prevailing of video live streaming, establishing an online face pixelation mechanism during streaming is an urgency. In this paper, we develop a new method called Face Pixelation in Video Live Streaming (FPVLS) to generate automatic personal privacy filtering during unconstrained streaming activities. Simply applying multi-face trackers will encounter problems in target drifting, computing efficiency, and over-pixelation. Therefore, for fast and accurate pixelation of irrelevant people's faces, FPVLS is organized in a frame-to-video structure of two core stages. On individual frames, FPVLS utilizes image-based face detection and embedding networks to yield face vectors. In the raw trajectories generation stage, the proposed Positioned Incremental Affinity Propagation (PIAP) clustering algorithm leverages face vectors and positioned information to quickly associate the same person's faces across frames. Such frame-wise accumulated raw trajectories are likely to be intermittent and unreliable on video level. Hence, we further introduce the trajectory refinement stage that merges a proposal network with the two-sample test based on the Empirical Likelihood Ratio (ELR) statistic to refine the raw trajectories. A Gaussian filter is laid on the refined trajectories for final pixelation. On the video live streaming dataset we collected, FPVLS obtains satisfying accuracy, real-time efficiency, and contains the over-pixelation problems.

URL PDF HTML ☆

赞 0 踩 0

1. 智能体、规划与决策 55 篇

Syll: Open-Source Personal Automation with Cross-Surface Execution

Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents

Efficient Skill Grounding via Code Refactoring with Small Language Models

SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows

PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents

SciTrace: Trajectory-Aware Safety Reasoning for Scientific Discovery Agents

Traxia: A Framework for Verifiable, Agent-Native Scientific Publishing

Self-Evolving Scientific Agent Discovers Generalizable Physically-Reasoned Fluid Control

Quantitative Promise Theory: Intentionality and Inference in Autonomous Agents

Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration

Structure-Conditioned Actor-Critic Branches for Quality-Diversity Reinforcement Learning

Can the Environment Speak for Itself? $T^{2}$-GRPO: A Turn-Trajectory Group Relative Policy Optimization for Caregiver Agents

AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models

REFLECT: Intervention-Supported Error Attribution for Silent Failures in LLM Agent Traces

MASS: Deep Research for Social Sciences with Memory-Augmented Social Simulation

FF-JEPA: Long-Horizon Planning in World Models with Latent Planners

Capability-Aligned Hierarchical Learning for Tool-Augmented LLMs

RunAgent SuperBrowser: A Theory of Autonomous Web Navigation Grounded in Human Browsing Behaviour

AliyunConsoleAgent: Training Web Agents in Real-World Cloud Environments via Distillation and Reinforcement Learning

SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

BRAIN: Bayesian Reasoning via Active Inference for Agentic and Embodied Intelligence in Mobile Networks

Bidirectional Semantic Complementary Tool Retrieval for Remote Sensing Agents

Outage Detection in Self-Healing Smart Grids Using Reinforcement Learning with Spectral Graph Neural Networks

Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning

MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution

Rosetta Memory: Adaptive Memory for Cross-LLM Agents

Does Persona Make LLMs K-pop Fans? A Pilot Study of LLM-Based Online Concert Audience Agents

Cost-Aware Speculative Execution for LLM-Agent Workflows: An Integrated Five-Dimension Method

Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories

Causal Agent Replay: Counterfactual Attribution for LLM-Agent Failures

Generative Frontier Planning for Adaptive Peer-Referral Recruitment under Covariate-Dependent Arrivals

Provably Efficient Personalized Multi-Objective Bandits with Proactive Conversational Queries

Projecting the Emerging Mindset of SWE Agent by Launching a Wild Code Understanding Journey

Agentic Search for Counterfactual Recourse under Fixed LLM Budgets

SafeRun: Enabling Determinism in LLM Planning for Running

Memory Beyond Recall: A Dual-Process Cognitive Memory System for Self-Evolving LLM Agents

An Agency-Transferring Model-Free Policy Enhancement Technique

A Survey on Large Language Model-Based Game Agents

Language-based Trial and Error Falls Behind in the Era of Experience

Web Agents Should Use Typed Actions Instead of Click-Based Browsing

2-Step Agent: A Framework for the Interaction of a Decision Maker with AI Decision Support

IRAM-Omega-Q: A Computational Framework for Uncertainty Regulation in Adaptive Agents

Executable World Models for ARC-AGI-3 in the Era of Coding Agents

Engagement Process: Rethinking the Temporal Interface of Action and Observation

ANNEAL: Adapting LLM Agents via Governed Symbolic Patch Learning

ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL

Trivium: Temporal Regret as a First-Class Objective for Causal-Memory Controllers

MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models

Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

ACTIVE-o3: Empowering MLLMs with Active Perception via Pure Reinforcement Learning

In-Context Reinforcement Learning via Communicative World Models

Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates

DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

Exploring Autonomous Agentic Data Engineering for Model Specialization

Collab-REC: An LLM-based Agentic Framework for Balancing Recommendations in Tourism

2. 知识表示、推理与符号AI 12 篇

A Variability-Based Framework for Interpretable Naming in Formal and Relational Concept Analysis

Standpoint Logics with Defeasible Beliefs

Extending Ontologies: From Dense Embeddings to Hybrid Quantum-Fuzzy Systems

(Auto)formalization is supposed to be easy: Trellis process semantics for spelling out rigorous proofs

Implicit Causal Graph Construction in Text via Chain Discovery

From USD Scenes to Knowledge Graphs: Zero-Shot Ontology Grounding with LLMs

SEF-CLGC at SemEval-2026 Task 11: Logical Notation Impact on Language Model Performance

Modeling the Diachronic Evolution of Legal Norms: An LRMoo-Based, Component-Level, Event-Centric Approach to Legal Knowledge Graphs

Sound and Complete Neurosymbolic Reasoning with LLM-Grounded Interpretations

The Topological Dual of a Dataset: A Logic-to-Topology Encoding for AlphaGeometry-Style Data

Advancing Mathematics Research with AI-Driven Formal Proof Search

Neural Scalable Symbolic Search Framework for Complex Logical Queries with Multiple Free Variables

3. 多智能体与博弈 15 篇

ConMem: Structured Memory-Guided Adaptation in Training-Free Multi-Agent Systems

A Multi-Agent System for IPMSM Design Optimization via an FEA-AI Hybrid Approach

Collaborative Human-Agent Protocol (CHAP)

Symbolic Reasoning Frameworks Modulate LLM Risk Aversion in Multi-Agent Strategic Settings

ViMax: Agentic Video Generation

Voting Protocols as Coordination Mechanisms for Role-Constrained Multi-Agent Tutoring Systems

Post-AGI Economies: Superposition and the Second Fundamental Theorem of Welfare Economics

Autonomous Incident Resolution at Hyperscale: An Agentic AI Architecture for Network Operations

Shape Formation for the Cooperative Transportation of Arbitrary Objects Using Multi-Agent Reinforcement Learning

MAR:Multi-Agent Reflexion Improves Reasoning Abilities in LLMs