arXivDaily arXiv每日学术速递 周一至周五更新
2605.22817 2026-05-22 cs.LG cs.AI cs.CL cs.NE 版本更新

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

向量策略优化:为多样性训练改进测试时间搜索

Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld, Akarsh Kumar, Mehul Damani, Sebastian Risi, Omar Khattab, Zhang-Wei Hong, Pulkit Agrawal

发表机构 * MIT(麻省理工学院) Improbable AI Lab(Improbable AI 实验室) MIT-IBM Computing Research Lab(麻省理工-IBM 计算研究实验室) Sakana AI

AI总结 本文提出向量策略优化(VPO)方法,通过训练策略以预测多样化的下游奖励函数,从而产生多样化的解决方案,以改进测试时间搜索的性能。

Comments 24 pages

详情
AI中文摘要

语言模型现在必须能够即刻泛化到新的环境,并在像AlphaEvolve这样的推理扩展搜索过程中工作,该过程通过多种任务特定的奖励函数选择滚出。不幸的是,标准的LLM后训练优化方法通常优化预定义的标量奖励,导致当前LLM生成低熵响应分布,从而在推理时间搜索所需多样性方面挣扎。我们提出向量策略优化(VPO),一种RL算法,专门训练策略以预测多样化的下游奖励函数并生成多样化的解决方案。VPO利用奖励在实践中通常是向量值的事实,例如代码生成中的每测试用例正确性,或者多个不同的用户人设或奖励模型。VPO本质上是GRPO优势估计器的直接替代品,但其训练LLM输出一组解决方案,其中每个解决方案专门针对向量奖励空间中的不同权衡。在四个任务上,VPO在测试时间搜索(如pass@k和best@k)中匹配或超越了最强的标量RL基线,随着搜索预算的增长,差距逐渐扩大。对于进化搜索,VPO模型解锁了GRPO模型无法解决的问题。随着测试时间搜索变得更加标准化,优化多样性可能需要成为后训练的默认目标。

英文摘要

Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specified scalar reward, often leading current LLMs to produce low-entropy response distributions and thus to struggle at displaying the diversity that inference-time search will require. We propose Vector Policy Optimization (VPO), an RL algorithm that explicitly trains policies to anticipate diverse downstream reward functions and to produce diverse solutions. VPO exploits that rewards are often vector-valued in practice, like per-test-case correctness in code generation or, say, multiple different user personas or reward models. VPO is essentially a drop-in replacement for the GRPO advantage estimator, but it trains the LLM to output a set of solutions where individual solutions specialize to different trade-offs in the vector reward space. Across four tasks, VPO matches or beats the strongest scalar RL baselines on test-time search (e.g. pass@k and best@k), with the gap widening as the search budget grows. For evolutionary search, VPO models unlock problems that GRPO models cannot solve at all. As test-time search becomes more standardized, optimizing for diversity may need to become the default post-training objective.

2605.22791 2026-05-22 cs.AI 版本更新

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Gated DeltaNet-2:解耦擦除与写入的线性注意力

Ali Hatamizadeh, Yejin Choi, Jan Kautz

发表机构 * NVIDIA

AI总结 本文提出Gated DeltaNet-2,通过引入通道级擦除门和写入门,解耦了线性注意力中擦除与写入的控制,从而在语言模型、常识推理和检索任务中取得了最佳性能,特别是在长上下文检索任务中表现突出。

Comments Gated DeltaNet-2 technical report; code at https://github.com/NVlabs/GatedDeltaNet-2

详情
AI中文摘要

线性注意力将softmax注意力的无界缓存替换为固定大小的递归状态,将序列混合时间降低到线性,解码内存降至常数。难点不仅在于决定遗忘什么,还在于如何编辑压缩的记忆而不打乱现有关联。Delta规则模型在写入新值前减去当前读取值,而Kimi Delta注意力(KDA)通过通道级衰减来增强遗忘。但主动编辑仍使用单个标量门控制两件事:在键侧擦除旧内容的程度和在值侧提交新内容的程度。我们引入了Gated DeltaNet-2,通过继承自适应遗忘和通道级衰减,同时解决其共同限制,即擦除与写入之间的标量关联。Gated Delta Rule-2通过通道级擦除门b_t和通道级写入门w_t分离这些角色,当两个门坍缩为同一标量时退化为KDA,当衰减也坍缩时退化为Gated DeltaNet。我们推导出快速权重更新视图,一种分块的WY算法,将通道级衰减吸收进不对称擦除因子中,并提出一种门感知的反向传播,以保持高效的并行训练。在130亿参数在10000亿FineWeb-Edu标记上训练的情况下,Gated DeltaNet-2在语言模型、常识推理和检索任务中取得了最强的整体结果。其优势在长上下文RULER针在 haystack 检索基准中最为明显,其中它改进了评估的多键检索设置,并在递归和混合设置中保持强劲。代码可在https://github.com/NVlabs/GatedDeltaNet-2获取。

英文摘要

Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Delta-rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduce Gated DeltaNet-2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel-wise decay while addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule-2 separates these roles with a channel-wise erase gate b_t and a channel-wise write gate w_t, reducing to KDA when both gates collapse to the same scalar and to Gated DeltaNet when the decay also collapses. We derive a fast-weight update view, a chunkwise WY algorithm with channel-wise decay absorbed into asymmetric erase factors, and a gate-aware backward pass that preserves efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, Gated DeltaNet-2 achieves the strongest overall results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants across language modeling, commonsense reasoning, and retrieval. Its advantage is most pronounced on long-context RULER needle-in-a-haystack benchmarks, where it improves the evaluated multi-key retrieval setting and remains strong in both recurrent and hybrid settings. Code is available at https://github.com/NVlabs/GatedDeltaNet-2.

2605.22786 2026-05-22 cs.AI cs.ET cs.LG cs.MA 版本更新

LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

LCGuard: 多智能体系统中安全KV共享的潜在通信守护者

Sadia Asif, Mohammad Mohammadi Amiri, Momin Abbas, Prasanna Sattigeri, Karthikeyan Natesan Ramamurthy

发表机构 * Rensselaer Polytechnic Institute(伦斯勒理工学院) IBM Research(IBM研究院)

AI总结 本文提出LCGuard框架,通过在智能体间共享KV缓存前学习表示层面的转换,以防止敏感信息泄露,同时在多个模型家族和多智能体基准测试中验证了其在减少重建攻击成功率和保持任务性能方面的有效性。

详情
AI中文摘要

基于大型语言模型(LLM)的多智能体系统越来越多地依赖中间通信来协调复杂任务。尽管大多数现有系统通过自然语言进行通信,但最近的研究表明,通过transformer键值(KV)缓存进行的潜在通信可以提高效率并保留更丰富的任务相关信息。然而,KV缓存也编码了上下文输入、中间推理状态和智能体特定信息,从而创建了一个可能传播敏感内容的不透明通道,而无需显式文本披露。为此,我们引入了LCGuard(潜在通信守护者),一个用于多智能体LLM系统中安全KV基于潜在通信的框架。LCGuard将共享的KV缓存视为潜在的工作记忆,并在缓存艺术制品传输到智能体之前学习表示层面的转换。我们通过重建正式化表示层面的敏感信息泄露操作:如果一个对抗性解码器可以从共享缓存艺术制品中恢复出智能体特定的敏感输入,则该共享缓存艺术制品是不安全的。这导致了一种对抗性训练公式,其中对抗者学习重建敏感输入,而LCGuard学习转换以保留任务相关语义并减少可重建的信息。在多个模型家族和多智能体基准测试中的实证评估表明,LCGuard在减少基于重建的泄露和攻击成功率的同时,能够保持与标准KV共享基线相比具有竞争力的任务性能。

英文摘要

Large language model (LLM)-based multi-agent systems increasingly rely on intermediate communication to coordinate complex tasks. While most existing systems communicate through natural language, recent work shows that latent communication, particularly through transformer key-value (KV) caches, can improve efficiency and preserve richer task-relevant information. However, KV caches also encode contextual inputs, intermediate reasoning states, and agent-specific information, creating an opaque channel through which sensitive content may propagate across agents without explicit textual disclosure. To address this, we introduce \textbf{LCGuard} (Latent Communication Guard), a framework for safe KV-based latent communication in multi-agent LLM systems. LCGuard treats shared KV caches as latent working memory and learns representation-level transformations before cache artifacts are transmitted across agents. We formalize representation-level sensitive information leakage operationally through reconstruction: a shared cache artifact is unsafe if an adversarial decoder can recover agent-specific sensitive inputs from it. This leads to an adversarial training formulation in which the adversary learns to reconstruct sensitive inputs, while LCGuard learns transformations that preserve task-relevant semantics and reduce reconstructable information. Empirical evaluations across multiple model families and multi-agent benchmarks show that LCGuard consistently reduces reconstruction-based leakage and attack success rates while maintaining competitive task performance compared to standard KV-sharing baselines.

2605.22776 2026-05-22 cs.LG cs.AI stat.CO stat.ML 版本更新

SDPM: Survival Diffusion Probabilistic Model for Continuous-Time Survival Analysis

SDPM:用于连续时间生存分析的生存扩散概率模型

Stanislav R. Kirpichenko, Andrei V. Konstantinov, Lev V. Utkin

发表机构 * Peter the Great St.Petersburg Polytechnic University(彼得大帝圣彼得堡国立大学)

AI总结 本文提出SDPM,一种用于连续时间生存分析的生成模型,通过去噪扩散模型建模生存结果的条件分布,避免了对事件时间分布的参数假设,并在变换的目标空间中使用标准化对数时间和连续高斯混合表示来表示删失指示符,从而在多个真实生存数据集上取得了竞争力的预测性能。

详情
AI中文摘要

生存分析旨在从具有删失观测的数据中估计时间到事件的分布。许多现有方法要么对危险函数施加结构假设,要么离散化时间轴,这可能会限制灵活性并引入近似误差。我们提出了生存扩散概率模型(SDPM),一种用于连续时间生存分析的生成方法。SDPM利用去噪扩散模型建模生存结果的条件分布,该分布由观测时间和删失指示符表示,即P(T,δ|X)。在假设条件独立删失的情况下,模型生成的条件样本可以通过Kaplan-Meier估计器转换为生存函数估计。该公式避免了对事件时间分布的参数假设,并不需要对输出时间空间进行离散化。模型在变换的目标空间中运行,使用标准化对数时间和连续高斯混合表示来表示删失指示符。我们评估了SDPM在十个真实生存数据集上的性能,并将其与五个强大的基线模型进行了比较,包括基于树、提升和神经生存模型。结果表明,SDPM在C指数、整合时间依赖AUC和整合Brier分数上均取得了竞争性的预测性能。对合成Cox-Weibull数据的分析表明,当生成足够多的样本时,SDPM能够比强大的非参数基线更准确地恢复潜在连续生存分布的形状。消融研究证实了所提出的目标空间变换的重要性,这些变换提高了事件率校准、减少了无效生成时间并提供了预测判别的一致增益。实现所提出模型的代码已公开可用。

英文摘要

Survival analysis aims to estimate a time-to-event distribution from data with censored observations. Many existing methods either impose structural assumptions on the hazard function or discretize the time axis, which may limit flexibility and introduce approximation errors. We propose the Survival Diffusion Probabilistic Model (SDPM), a generative approach to continuous-time survival analysis. SDPM models the conditional distribution of the survival outcome, represented by the pair of observed time and censoring indicator, $\mathbb{P}(T,δ\mid \mathbf{x})$, using a denoising diffusion model. Under the assumption of conditionally independent censoring, conditional samples generated by the model can be transformed into survival function estimates using the Kaplan-Meier estimator. This formulation avoids parametric assumptions on the event-time distribution and does not require a discretization of the output time space. The model operates in a transformed target space, using standardized log-times and a continuous Gaussian-mixture representation of the censoring indicator. We evaluate SDPM on ten real survival datasets and compare it with five strong baselines, including tree-based, boosting-based, and neural survival models. Results show that SDPM achieves competitive predictive performance across C-index, integrated time-dependent AUC, and integrated Brier score. A study on synthetic Cox-Weibull data demonstrates that SDPM can recover the shape of an underlying continuous survival distribution more accurately than a strong nonparametric baseline when sufficiently many samples are generated. An ablation study confirms the importance of the proposed target-space transformations, which improve event-rate calibration, reduce invalid generated times, and provide consistent gains in predictive discrimination. Codes implementing the proposed model are publicly available.

2605.22775 2026-05-22 cs.LG cs.AI cs.HC 版本更新

MambaGaze: Bidirectional Mamba with Explicit Missing Data Modeling for Cognitive Load Assessment from Eye-Gaze Tracking Data

MambaGaze: 通过显式缺失数据建模的双向Mamba用于从眼动追踪数据中评估认知负荷

Amir Mousavi, Mohammad Sadegh Sirjani, Erfan Nourbakhsh, Mimi Xie, Rocky Slavin, Leslie Neely, John Davis, John Quarles

发表机构 * Department of Computer Science, College of AI, Cyber and Computing, The University of Texas at San Antonio(计算机科学系,人工智能、网络与计算学院,德克萨斯大学圣安东尼奥分校) Department of Educational Psychology, College of Education and Human Development, The University of Texas at San Antonio(教育心理学系,教育与人类发展学院,德克萨斯大学圣安东尼奥分校) Department of Neuroscience, Developmental and Regenerative Biology, College of Sciences, The University of Texas at San Antonio(神经科学系,发育与再生生物学系,科学学院,德克萨斯大学圣安东尼奥分校)

AI总结 本文提出MambaGaze,通过XMD编码和双向Mamba-2框架,解决眼动追踪数据中频繁缺失和长时序依赖建模的问题,实验证明其在认知负荷评估中的优越性能和边缘部署可行性。

Comments Submitted to IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2026)

详情
AI中文摘要

从眼动信号进行实时认知负荷评估有可能实现适应性的人工智能应用,如安全关键应用如驾驶员警觉监控或自动驾驶舱辅助,但存在两个挑战:处理频繁的数据缺失(如眨眼和跟踪失败)以及高效建模长时序依赖。我们提出MambaGaze,一个通过1)XMD编码,将原始特征与观察掩码和时间差增强以显式建模数据不确定性,以及2)双向Mamba-2,以线性计算复杂性捕获时序依赖的框架。在CLARE和CL-Drive数据集上进行的leave-one-subject-out评估实验表明,MambaGaze分别达到76.8%和73.1%的准确率,优于CNN、Transformer、ResNet和VGG基线,高出4-12个百分点。在NVIDIA Jetson平台上的边缘部署基准测试显示,实现实时推理43-68 FPS,功率消耗低于7.5W,证实了其在可穿戴认知负荷监测中的可行性。

英文摘要

Real-time cognitive load assessment from eye-tracking signals could potentially enable adaptive human-centered-AI such as safety-critical applications such as driver vigilance monitoring or automated flight deck assistance, yet two challenges persist: handling frequent data missingness from blinks and tracking failures, and efficiently modeling long-range temporal dependencies. We propose MambaGaze, a framework that addresses these challenges through 1) XMD encoding, which augments raw features with observation masks and time-deltas to explicitly model data uncertainty, and 2) bidirectional Mamba-2, which captures temporal dependencies with linear computational complexity. Experiments on CLARE and CL-Drive datasets under leave-one-subject-out evaluation show that MambaGaze achieves 76.8% and 73.1% accuracy, respectively, outperforming CNN, Transformer, ResNet, and VGG baselines by 4-12 percentage points. Edge deployment benchmarks on NVIDIA Jetson platforms demonstrate real-time inference at 43-68 FPS with power consumption below 7.5W, confirming feasibility for wearable cognitive load monitoring.

2605.22773 2026-05-22 cs.AI math.OC 版本更新

Deep Reinforcement Learning for Flexible Job Shop Scheduling with Random Job Arrivals

基于随机作业到达的灵活作业车间调度的深度强化学习

Yu Tang, Muhammad Zakwan, Efe Balta, John Lygeros, Alisa Rupenyan

发表机构 * Centre for Artificial Intelligence, Zurich University of Applied Sciences(人工智能中心,苏黎世应用科学大学) Inspire AG(Inspire公司) Automatic Control Laboratory, ETH Zurich(自动控制实验室,苏黎世联邦理工学院)

AI总结 本文提出了一种基于事件的深度强化学习方法,用于解决具有随机作业到达的灵活作业车间调度问题,通过Proximal Policy Optimization算法和轻量级多层感知机训练智能体,以最小化所有作业的总完成时间,并在不同异质性和作业到达率的数据集上优于单独的调度规则。

详情
AI中文摘要

灵活作业车间调度问题(FJSP)是将一组作业最优分配到机器上的问题。在FJSP中仍存在两个主要挑战:未来作业的不可预测到达和问题的组合复杂性,使它对传统混合整数线性规划求解器来说是不可行的。本文提出了一种基于事件的深度强化学习(DRL)方法来解决具有随机作业到达的FJSP。具体而言,我们采用近端策略优化算法,并使用轻量级多层感知机来训练DRL智能体以最小化所有作业的总完成时间。我们设计状态表示为可以直接从环境中获取,并限制学习智能体只能在一组已确立的调度规则中选择。仿真显示,我们的DRL方法在异质性和作业到达率不同的数据集上优于任何单独的调度规则。我们还将我们的DRL方法与一种触发到达的混合整数线性规划解决方案进行基准测试,并表明我们的方法在数据集异质性较高的情况下表现良好。

英文摘要

The Flexible Job Shop Scheduling Problem (FJSP) is the optimal allocation of a set of jobs to machines. Two primary challenges persist in FJSP: the unpredictable arrival of future jobs and the combinatorial complexity of the problem, rendering it intractable for conventional mixed-integer linear programming solvers. This paper proposes an event-based \gls{DRL} approach to solve FJSP with random job arrivals. Specifically, we employ the Proximal Policy Optimization algorithm and use lightweight Multi-Layer Perceptrons to train the \gls{DRL} agent for minimizing the total completion time of all jobs. We design the state representation to be directly accessible from the environment, and limit the learning agent to selecting from among a set of well-established dispatching rules. Simulations show that our \gls{DRL} approach outperforms any of the individual dispatching rules on datasets with varying heterogeneity and job arrival rates. We benchmark our \gls{DRL} against an arrival-triggered mixed-integer linear programming solution and show that our method achieves good performance especially when the datasets are heterogeneous.

2605.22749 2026-05-22 cs.LG cs.AI 版本更新

Cyber-Physical Anomaly Detection in IoT-Enabled Smart Grids Using Machine Learning and Metaheuristic Feature Optimization

基于机器学习和元启发式特征优化的物联网智能电网中网络-物理异常检测

Adis Alihodžić, Eva Tuba, Milan Tuba

发表机构 * Department of Mathematical and Computer Sciences, University of Sarajevo(萨拉热窝大学数学与计算机科学系) Singidunum University(辛吉杜姆大学) Trinity University(特里尼蒂大学) Sinergija University(辛格里雅大学)

AI总结 本文研究了如何利用机器学习和元启发式特征优化方法,在物联网智能电网中检测网络-物理异常,通过评估多个基线模型,发现基于树的集成模型在该数据集上表现最佳,且经过特征优化后,模型在准确率和AUC指标上均有显著提升。

详情
AI中文摘要

现代智能电网依赖于密集的测量基础设施、通信链路和智能现场设备。尽管这提高了监控和控制能力,但也增加了遭受网络-物理破坏的风险。操作员必须区分物理事件,如故障或线路干扰,与恶意行为,如虚假数据注入或未经授权的命令执行。本章利用著名的MSU/ORNL电力系统攻击数据集来研究这一问题。所提出的方法结合了机器学习与基于遗传算法的特征选择。目标是双重的:准确分类攻击和自然事件,并确定一组减少的、物理信息丰富的PMU/IED测量是否能够支持可靠的检测。评估了多个基线模型,包括逻辑回归、RBF-SVM、XGBoost、随机森林和额外树。结果表明,基于树的集成模型在考虑的数据集上最为有效,其中额外树提供了最强的全特征基线。在特征选择后,GA + Extra Trees模型将干净的PMU特征空间从112个属性减少到五次运行的平均27.4个属性,同时将宏F1从0.9118提高到0.9212,ROC-AUC从0.9791提高到0.9837。这些结果表明,许多同步电气测量是冗余的。一个紧凑的基于相量的特征子集仍能提供准确且可解释的智能电网异常检测。

英文摘要

Modern smart grids rely on dense measurement infrastructures, communication links, and intelligent field devices. Although this improves supervision and control, it also increases vulnerability to cyber-physical disruptions. Operators must distinguish physical incidents, such as faults or line disturbances, from malicious actions, such as false data injection or unauthorized command execution. This chapter investigates this problem using the well-known MSU/ORNL Power System Attack Dataset. The proposed method combines machine learning with genetic-algorithm-based feature selection. The objective is twofold: to classify attack and natural events accurately, and to determine whether a reduced set of physically informative PMU/IED measurements can support reliable detection. Several baseline models are evaluated, including logistic regression, RBF-SVM, XGBoost, Random Forest, and Extra Trees. The results show that tree-based ensemble models are the most effective for the considered dataset, with Extra Trees providing the strongest full-feature baseline. After feature selection, the GA + Extra Trees model reduces the clean PMU feature space from 112 attributes to an average of 27.4 attributes over five runs, while increasing macro-F1 from 0.9118 to 0.9212 and ROC-AUC from 0.9791 to 0.9837. These results indicate that many synchronized electrical measurements are redundant. A compact subset of phasor-based features can still provide accurate and interpretable anomaly detection in smart grids.

2605.22733 2026-05-22 cs.AI cs.SE 版本更新

HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools

HarnessAPI: 一种以技能为中心的统一流式API和MCP工具框架

Edwin Jose

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出HarnessAPI,一种以技能为中心的框架,通过将类型化的技能文件夹作为单一真相源,消除了流式API和MCP工具之间的重复代码,减少了74%的框架层面的样板代码。

详情
AI中文摘要

如今,每一个作为LLM工具部署的Python函数必须以两种形式存在:一种是面向人类客户端和CI流水线的HTTP端点,另一种是用于代理运行时如Claude和Cursor的MCP工具注册。这两种表示共享业务逻辑,但在周围机器(路由、验证、序列化、流式传输和模式维护)上却存在差异,并且随着底层代码的变化而逐渐分离。我们提出了HarnessAPI,一种Python框架,通过将类型化的技能文件夹作为单一真相源来消除这种重复。从一个handler.py加上Pydantic模式,该框架可以自动推导出一个流式HTTP端点,具有Server-Sent Events,一个交互式的OpenAPI/Swagger UI,以及一个零配置的MCP工具,所有都从单一进程提供。双模式内容协商让相同的处理程序可以服务于SSE流式传输和JSON返回客户端,而无需更改处理程序。动态代码生成机制确保Pydantic类型注解正确传播到FastMCP的检查层,解决了一个技术限制,该限制阻止了基于闭包的简单注册。通过六个代表性技能的cloc测量,与手动维护的双栈实现(FastAPI服务器+FastMCP服务器)相比,HarnessAPI减少了74%的框架层面的样板代码。HarnessAPI继承了FastAPI的全部中间件、依赖注入和部署生态系统。它可在https://github.com/edwinjosechittilappilly/harnessapi上获得,并在PyPI(pip install harnessapi)中可用。

英文摘要

Every Python function deployed as an LLM tool must today exist in two forms: an HTTP endpoint for human-facing clients and CI pipelines, and an MCP tool registration for agent runtimes such as Claude and Cursor. These representations share business logic yet diverge in all the surrounding machinery (routing, validation, serialisation, streaming, and schema maintenance), and they drift apart as the underlying code evolves. We present HarnessAPI, a Python framework that eliminates this duplication by treating a typed skill folder as the single source of truth. From one handler.py plus Pydantic schemas, the framework automatically derives a streaming HTTP endpoint with Server-Sent Events, an interactive OpenAPI/Swagger UI, and a zero-configuration MCP tool, all served from a single process. Dual-mode content negotiation lets the same handler serve SSE-streaming and JSON-returning clients with no handler changes. A dynamic code-generation mechanism ensures Pydantic type annotations propagate correctly to FastMCP's inspection layer, resolving a technical limitation that prevents naive closure-based registration. Measured across six representative skills using cloc, HarnessAPI reduces framework-facing boilerplate by 74% compared with a manually maintained dual-stack implementation (FastAPI server + FastMCP server). HarnessAPI subclasses FastAPI, inheriting its full middleware, dependency-injection, and deployment ecosystem. It is available at https://github.com/edwinjosechittilappilly/harnessapi and on PyPI (pip install harnessapi)

2605.22732 2026-05-22 cs.AI cs.CL cs.HC cs.SD eess.AS 版本更新

Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models

超越语音情感识别:利用基于LLM和语音情感模型的政治演讲多模态Pathos分析

Juergen Dietrich

发表机构 * Democracy Intelligence gGmbH(民主智能有限责任公司)

AI总结 本文研究了语音情感识别模型是否能作为政治演讲分析中Pathos维度的代理,通过TRUST多智能体大语言模型(LLM)管道进行操作。使用德国议会全体会议中Felix Banaszak的演讲作为案例研究,比较了三种分析模式:(1) emotion2vec_plus_large,一个通过后验Russell Circumplex投影得到连续唤醒度和估值的语音情感识别(SER)模型;(2) Gemini 2.5 Flash,一个分析完整演讲音频及其转录文本的LLM,以开放和上下文感知的方式进行;(3) TRUST-Pathos分数,来自三个倡导者LLM监督集合。斯皮尔曼等级相关性显示,Gemini估值与TRUST-Pathos高度相关(rho = +0.664,p < 0.001),而emotion2vec估值不相关(rho = +0.097,p = 0.499)。我们进一步通过系统评估柏林情感语音数据库(EMO-DB)使用Gemini在开放注释范式下,证明标准SER基准语料库存在表演性演讲、文化偏见和类别不兼容性。我们的结果表明,基于LLM的多模态分析在捕捉语义定义的政治情感方面比单独的语音模型更有效,而语音特征仍对低层次唤醒度估计有帮助。未来的工作将扩展这种方法到视频分析中,结合面部表情和眼神。

Comments 13 pages, 1 figure

详情
AI中文摘要

我们研究语音情感识别模型是否能作为政治演讲分析中Pathos维度的代理,如由TRUST多智能体大语言模型(LLM)管道定义的那样。使用Felix Banaszak在德国议会全体会议中的演讲(51个片段,245秒)作为案例研究,我们比较了三种分析模式:(1) emotion2vec_plus_large,一个通过后验Russell Circumplex投影得到连续唤醒度和估值的语音情感识别(SER)模型;(2) Gemini 2.5 Flash,一个分析完整演讲音频及其转录文本的LLM,以开放和上下文感知的方式进行;(3) TRUST-Pathos分数,来自三个倡导者LLM监督集合。斯皮尔曼等级相关性显示,Gemini估值与TRUST-Pathos高度相关(rho = +0.664,p < 0.001),而emotion2vec估值不相关(rho = +0.097,p = 0.499)。我们进一步通过系统评估柏林情感语音数据库(EMO-DB)使用Gemini在开放注释范式下,证明标准SER基准语料库存在表演性演讲、文化偏见和类别不兼容性。我们的结果表明,基于LLM的多模态分析在捕捉语义定义的政治情感方面比单独的语音模型更有效,而语音特征仍对低层次唤醒度估计有帮助。未来的工作将扩展这种方法到视频分析中,结合面部表情和眼神。

英文摘要

We investigate whether acoustic emotion recognition models can serve as proxies for the Pathos dimension in political speech analysis, as operationalised by the TRUST multi-agent large language model (LLM) pipeline. Using a Bundestag plenary speech by Felix Banaszak (51 segments, 245 s) as a case study, we compare three analysis modalities: (1) emotion2vec_plus_large, an acoustic speech emotion recognition (SER) model whose continuous Arousal and Valence values are derived via post-hoc Russell Circumplex projection; (2) Gemini 2.5 Flash, an LLM analysing the full speech audio together with its transcript in an open-ended, context-aware fashion; and (3) TRUST-Pathos scores from a three-advocate LLM supervisor ensemble. Spearman rank correlations reveal that Gemini Valence correlates strongly with TRUST-Pathos (rho = +0.664, p < 0.001), whereas emotion2vec Valence does not (rho = +0.097, p = 0.499). We further demonstrate, via a systematic quality evaluation of the Berlin Database of Emotional Speech (EMO-DB) using Gemini in an open-ended annotation paradigm, that standard SER benchmark corpora suffer from acted speech, cultural bias, and category incompatibility. Our results suggest that LLM-based multimodal analysis captures semantically defined political emotion substantially better than acoustic models alone, while acoustic features remain informative for low-level Arousal estimation. Future work will extend this approach to video-based analysis incorporating facial expression and gaze.

2605.22731 2026-05-22 cs.LG cs.AI 版本更新

Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

训练后是关于状态,而不是标记:一种状态分布视角下的SFT、RL和在线策略蒸馏

Dong Nie

发表机构 * Independent Researcher(独立研究者)

AI总结 本文从状态分布的角度研究了监督微调(SFT)、强化学习(RL)和在线策略蒸馏(OPD)等大语言模型训练后方法,发现训练状态的来源和局部性与监督信号的形式同样重要。

详情
AI中文摘要

大型语言模型的训练后方法,如监督微调(SFT)、强化学习(RL)和蒸馏,通常通过其损失函数进行分析:最大似然、策略梯度、前向KL、反向KL或相关的目标级变体。我们研究了一个互补因素:应用于监督的状态分布。对于自回归策略,状态是提示加上生成的前缀。SFT在固定数据集的状态上训练,而RL和在线策略蒸馏(OPD)在当前学习者诱导的状态上训练。我们正式将训练后过程视为状态分布塑造,并使用Qwen3-0B-Base在GSM8K上进行受控的小规模研究,用TruthfulQA和MMLU作为保留评估。我们的结果显示出三种现象。第一,轻微的SFT运行在GSM8K上表现良好,而压力SFT运行导致显著的保留损失。第二,从退化的SFT教师那里获得的OPD在GSM8K、TruthfulQA和MMLU上优于该教师,尽管仅使用教师作为监督来源。第三,轻量级的在线策略RL运行在GSM8K上提高了表现,同时保持了保留。这些结果支持了训练后过程的状态视角:训练状态的来源和局部性与监督信号的形式同样重要。

英文摘要

Large language model post-training methods such as supervised fine-tuning (SFT), reinforcement learning (RL), and distillation are often analyzed through their loss functions: maximum likelihood, policy gradients, forward KL, reverse KL, or related objective-level variants. We study a complementary factor: the state distribution on which supervision is applied. For an autoregressive policy, a state is a prompt plus generated prefix. SFT trains on fixed dataset states, while RL and on-policy distillation (OPD) train on states induced by the current learner. We formalize post-training as state-distribution shaping and run a controlled smallscale study using Qwen3-0.6B-Base on GSM8K, with TruthfulQA and MMLU as retention evaluations. Our results show three phenomena. First, a mild SFT run improves GSM8K with little forgetting, while a stress SFT run causes substantial retention loss. Second, OPD from a degraded SFT teacher surpasses that teacher on GSM8K, TruthfulQA, and MMLU, despite using the teacher as its only supervision source. Third, a lightweight on-policy RL run improves GSM8K while preserving retention. These results support a state-centric view of post-training: the source and locality of training states can be as important as the form of the supervision signal.

2605.22723 2026-05-22 cs.LG cs.AI cs.IT math.IT 版本更新

The Value of Covariance Matching in Gaussian DDPMs and the Lanczos Sampler

高斯DDPM中协方差匹配的价值及兰扎斯采样器

Md Sahil Akhtar, Aymane El Gadarri, Vivek F. Farias, Adam D. Jozefiak

发表机构 * Electrical Engineering and Computer Science(电气工程与计算机科学系) Massachusetts Institute of Technology(麻省理工学院) Operations Research Center(运筹学研究中心) Sloan School of Management(斯隆管理学院)

AI总结 本文研究了高斯DDPM中协方差匹配在路径空间KL散度中的价值,提出兰扎斯采样器方法,通过矩阵自由技术实现最优反向协方差采样,从而提升采样质量。

详情
AI中文摘要

高斯DDPM中的核心误差度量是精确反向链与学习高斯反向过程之间的路径空间KL散度。这一量在如分类引导等过程中尤为重要,这些过程扰动整个反向轨迹而非仅终端样本。先前分析显示,标准各向同性反向协方差会导致随着去噪步数T增长而不可避免的Ω(1/T)路径KL误差。我们证明匹配完整后验协方差突破这一障碍,使路径KL误差降至O(1/T²)。为使完整协方差匹配实用化,我们引入兰扎斯高斯采样器(LGS),一种无需训练、矩阵自由的方法,仅通过后验均值的雅可比-向量积即可从最优反向协方差采样。LGS避免了密集协方差存储和辅助协方差模型。我们证明LGS近似误差随兰扎斯步骤数呈指数衰减,每个兰扎斯步骤仅需一次雅可比-向量积。实验表明,仅使用三个此类步骤即可在标准图像基准上提升样本质量,优于包括OCM-DDPM在内的强对角协方差基线。这表明完整协方差匹配在理论和实践中均具有价值。

英文摘要

A central error measure in Gaussian DDPMs is the path-space KL divergence between the exact reverse chain and the learned Gaussian reverse process. This quantity is especially relevant for procedures such as classifier guidance, which perturb the entire reverse trajectory rather than only the terminal sample. Prior analyses show that standard isotropic reverse covariances suffer an unavoidable $Ω(1/T)$ path-KL error as the number of denoising steps $T$ grows. We show that matching the full posterior covariance breaks this barrier, yielding an order-wise improvement that reduces the path KL to $O(1/T^2)$. To make full covariance matching practical, we introduce the Lanczos Gaussian sampler (LGS), a training-free, matrix-free method for sampling from the optimal reverse covariance using only covariance-vector products, which are available through Jacobian-vector products of the posterior mean. LGS avoids dense covariance storage and auxiliary covariance models. We prove that LGS approximation error decays exponentially in the number of Lanczos steps, where each Lanczos step requires a single Jacobian-vector product. Empirically, using only just three such steps improves sample quality over strong diagonal-covariance baselines, including OCM-DDPM, across standard image benchmarks. This identifies full covariance matching as both theoretically valuable and practically accessible for fast DDPM sampling.

2605.22720 2026-05-22 cs.AI cs.HC 版本更新

Can AI Make Conflicts Worse? An Alignment Failure in LLM Deployment Across Conflict Contexts

AI 是否会加剧冲突?在冲突情境下LLM部署中的对齐失败

Andrii Kryshtal

发表机构 * Independent Researcher(独立研究者)

AI总结 本文研究了AI模型在冲突情境下可能产生的对齐失败问题,通过测试九种模型配置,发现其在处理冲突相关场景时存在错误等价、否认种族灭绝和未能识别种族歧视术语等问题,提出了首个评估框架以提高AI在冲突情境下的安全性。

Comments Preprint. 8 pages, 2 figures. Code and evaluation framework: https://github.com/akryshtal/conflict-sensitivity-eval-bloom

详情
AI中文摘要

AI模型已经部署在受武装冲突影响的社会中,记者、人道主义工作者、政府和普通公民依赖这些模型获取信息或用于工作流程。目前尚无已建立的实践来检查其输出是否会加剧冲突。我们测试了来自四个供应商(OpenAI、Anthropic、DeepSeek、xAI)的九种模型配置,在90个多轮场景中揭示了冲突情境中的对齐失败行为:如在记录的暴行之间制造虚假等价、否认种族灭绝以及未能识别种族歧视术语等。当这些输出影响新闻报道、人道主义报告或公共辩论时,它们可能加深脆弱社会的分歧。失败率在最佳和最差表现的模型之间为6%至47%,这使得模型选择本身成为一项安全问题。当用户在国际法院已指责任任的情况下寻求“平衡”时,五种配置在80%至100%的情况下失败。我们发布了该领域的首个评估框架,并建议将其添加到对齐评估套件中。

英文摘要

AI models are already deployed in societies affected by armed conflict, and journalists, humanitarian workers, governments and ordinary citizens rely on them for information or for their work processes. No established practice exists for checking whether their outputs can make those conflicts worse. We tested nine model configurations from four providers (OpenAI, Anthropic, DeepSeek, xAI) on 90 multi-turn scenarios designed to surface misaligned behaviour in conflict contexts: false equivalence between documented atrocities, denial of genocide, and failure to recognise ethnic slurs, among others. When such outputs feed into journalism, humanitarian reporting, or public debate, they can deepen divisions in fragile societies. Failure rates span 6\% to 47\% between the best and worst performing models, which makes model choice a safety question in its own right and when users pushed for ``balance'' in cases where international courts have already assigned responsibility, five of nine configurations failed 80 to 100 percent of the time. We release the first evaluation framework for this domain and propose adding it to alignment evaluation portfolios.

2605.22717 2026-05-22 cs.SD cs.AI cs.LG cs.MM 版本更新

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

实时音乐扩散模型:交互式音乐生成扩散模型的高效微调与后训练

Zachary Novack, Stephen Brade, Haven Kim, Hugo Flores García, Nithya Shikarpur, Chinmay Talegaonkar, Suwan Kim, Valerie K. Chen, Julian McAuley, Taylor Berg-Kirkpatrick, Cheng-Zhi Anna Huang

发表机构 * UC San Diego(加州大学圣迭戈分校) MIT(麻省理工学院) Adobe(Adobe公司)

AI总结 本文研究了音频扩散模型能否通过块级KV缓存高效地转化为交互式模型,从而在消费级硬件上实现。提出的Live Music Diffusion Models (LMDMs)通过块级KV缓存恢复并超越了离散Live Music Models (LMMs)的推理复杂度,并通过ARC-Forcing范式实现稳定的后训练对齐,从而在无需显式RL或奖励模型的情况下减少误差累积。

详情
AI中文摘要

交互式流式音乐生成承诺了生成模型在实时表演和协作创作中的应用,这在离线模型中是无法实现的。然而,最先进的模型存在于离散AR领域,需要工业级的计算资源进行训练和推理。在本文中,我们研究音频扩散模型是否可以被重新利用为交互式模型,从而在消费级硬件上实现。通过仔细分析现代块级外推扩散流程,我们发现推理过程中存在关键的低效问题,导致其计算效率严劣于离散AR模型。我们提出了Live Music Diffusion Models (LMDMs),一种简单的生成扩散过程修改,通过块级KV缓存恢复并超越了离散Live Music Models (LMMs)的推理复杂度。与LMMs不同,LMDMs进一步通过我们新颖的ARC-Forcing范式实现稳定的后训练对齐,无需任何显式RL或奖励模型即可减少误差累积。我们展示了LMDMs在多个创意领域中的应用,包括文本条件生成、基于草图的音乐合成和即兴演奏。最后,我们展示了如何将LMDMs用作生成乐器,在真实艺术家与AI的合作中利用LMDMs作为“生成延迟”,将音乐家的即兴演奏转换为可变的音色效果,同时在本地消费级游戏笔记本电脑上运行。

英文摘要

Interactive streaming music generation promises the use of generative models for live performance and co-creation that is impossible with offline models. However, SOTA models exist in the discrete-AR regime, requiring industrial levels of compute for both training and inference. In this work, we investigate whether audio diffusion models, with their wide support in the open-source community but non-streaming bidirectional nature, can be repurposed efficiently into interactive models accessible on consumer hardware. By taking a critical look at the modern pipeline for block-wise outpainting diffusion, we identify critical inefficiencies during inference that result in strictly worse computational efficiency than their discrete-AR counterparts. We propose Live Music Diffusion Models (LMDMs), a simple modification of the generative diffusion process that recovers, and then outperforms, the inference complexity of the discrete Live Music Models (LMMs) through block-wise KV Caching. Unlike LMMs, LMDMs further enable stable post-training alignment through our novel ARC-Forcing paradigm, reducing error accumulation without any explicit RL or reward models. We demonstrate the application of LMDMs in a number of creative domains, including text-conditioned generation, sketch-based music synthesis, and jamming. We finally show how LMDMs can be used as a generative instrument in a real artist-AI collaboration, utilizing LMDMs as a "generative delay" to transform musicians' improvisation live for variable timbral effects while running locally on a consumer gaming laptop.

2605.22716 2026-05-22 cs.AI cs.LO 版本更新

Parametric Modular Answer Set Programs Made Declarative

参数化模态答案集程序的声明性

Jorge Fandinno, Yuliya Lierler, Torsten Schaub

发表机构 * University of Nebraska Omaha, USA(内布拉斯加大学奥马哈分校) University of Potsdam, Germany(波茨坦大学)

AI总结 本文探讨了在一阶答案集编程中模ularity的概念,引入了参数化模态逻辑程序这一新形式,允许定义带有参数和intensionality语句的子程序,并展示了如何捕捉clingo程序的集体控制语义,连接传统非模态答案集编程。

Comments To appear in Theory and Practice of Logic Programming

详情
AI中文摘要

在本文中,我们探讨了在第一阶答案集编程(ASP)中模块化的概念。我们引入了一种新的形式化方法,称为参数化模态逻辑程序,它允许定义带有参数和intensionality语句的子程序。我们展示了这种形式化方法如何捕捉具有集体控制的clingo程序的语义,这一特性使得能够对子程序进行结构化和实例化。我们为模块化ASP提供了理论基础,展示了其有用性,并将其与传统非模块化ASP连接起来。

英文摘要

In this paper, we explore the concept of modularity in first-order answer set programming (ASP). We introduce a new formalism called parametric modular logic programs, which allows defining subprograms with parameters and intensionality statements. We demonstrate how this formalism can capture the semantics of clingo-programs with collective control, a feature that enables structuring and instantiating subprograms. We provide theoretical foundations for modular ASP, illustrate its usefulness, and connect to traditional non-modular ASP.

2605.22711 2026-05-22 cs.LG cs.AI 版本更新

Abstraction for Offline Goal-Conditioned Reinforcement Learning

离线目标条件强化学习中的抽象

Clarisse Wibault, Alexander Goldie, Antonio Villares, Maike Osborne, Jakob Foerster

发表机构 * FLAIR, MLRG University of Oxford(FLAIR、MLRG 欧洲大学)

AI总结 本文提出了一种在离线目标条件强化学习中利用抽象的方法,通过引入相对化选项和不同层次的表示,提高了在相似状态空间上下文中的经验复用能力,从而提升了性能。

详情
AI中文摘要

马尔可夫决策过程(MDPs)在现实中的目标条件强化学习(GCRL)中往往由于对称性和状态-目标对之间的共享结构而表现出显著的冗余性。虽然分层策略已被提出以通过时间抽象减少时间跨度来改进离线GCRL,但本文证明层次结构也能够实现绝对抽象。通过引入相对化选项以及为不同层次的层次结构引入不同的表示,我们展示了智能体如何在相似的状态空间上下文中重用经验。基于这一框架,我们介绍了两种简单的算法用于学习相对化选项和从绝对参考框架中抽象。我们的实验表明,这种归纳偏置在离线GCRL中显著提高了性能。

英文摘要

Markov Decision Processes (MDPs) often exhibit significant redundancy due to symmetries and shared structure across state-goal pairs in real-world Goal-Conditioned Reinforcement Learning (GCRL). While hierarchical policies have been motivated for horizon reduction via temporal abstraction in offline GCRL, we demonstrate that hierarchy also enables absolute abstraction. By introducing relativised options as well as distinct representations for different levels of the hierarchy, we demonstrate how an agent can reuse experience across similar contexts of the state-space. Based on this framework, we introduce two simple algorithms for learning relativised options and abstracting from the absolute frame of reference. Our experiments show that such inductive biases significantly improve performance in offline GCRL.

2605.22707 2026-05-22 cs.AI cs.HC 版本更新

Beyond the Org Chart: AI and the Transformation of Invisible Work

超越组织图:人工智能与无形工作的变革

Stephanie Rosenthal, Shamsi Iqbal

发表机构 * Microsoft Corporation(微软公司)

AI总结 本文研究了人工智能如何改变工作流程,特别是无形文化实践,如专业指导,同时提出了使无形工作可见的步骤以及个人和领导者如何支持同事并保持健康的公司文化。

Comments 10 pages

详情
AI中文摘要

越来越多的新闻和研究文章报告称,人工智能的采用使专业人士能够模糊和扩展其在企业中的角色边界。为了了解在人工智能导向的公司中工作流程可能发生的变化,我们采访了大型科技公司中24名以产品为中心的个体,探讨人工智能如何影响他们的工作、他们在产品团队中的工作以及他们的专业互动。我们的谈话表明,人工智能不仅改变了正式的角色责任和角色之间的协作,还改变了诸如专业指导等无形文化实践,这些实践对于帮助专业人士适应其职位、保持对工作的投入以及发展职业生涯至关重要。一些变化是积极的,例如同行之间的协作更加顺畅,但其他变化更加微妙,可能使典型的职业发展机会,如从专业网络中获得反馈、促进领导力和指导,面临风险。我们提出人工智能公司可以采取的步骤,以使无形工作更加可见。此外,我们还提出个人和领导者可以采取的措施,以在人工智能转型过程中支持同事,同时保持支持多样化思维、协作和非正式互动的健康公司文化。

英文摘要

An increasing number of news and research articles report that AI adoption is allowing professionals to blur and extend the boundaries of their corporate roles. With the goal of understanding how work processes might be changing in an AI-forward company, we interviewed 24 product-focused individuals at a large technology firm about how AI has impacted their own work, their work within their product team, and their professional interactions. Our conversations suggest that AI is not only changing formal role responsibilities and collaborations between those roles, but also changing informal cultural practices like professional mentoring that are key to helping professionals settle in their positions, stay engaged with their work, and grow their careers. Some of these changes are positive, such as smoother collaboration between peers, but other changes are more nuanced and put the typical career growth opportunities, like receiving feedback from professional networks and promoting leadership and mentorship, at risk. We propose steps that AI companies can take to make the invisible work more visible. Additionally, we propose efforts that individuals and leaders can take to support their colleagues through AI transformation while preserving healthy company cultures that support diverse thinking, collaboration, and informal interactions.

2605.22693 2026-05-22 cs.RO cs.AI 版本更新

Scout-Assisted Planning for Heterogeneous Robot Teams under Partially Known Environments

Scout-Assisted Planning for Heterogeneous Robot Teams under Partially Known Environments

Hoang-Dung Bui, Abhish Khanal, Raihan Islam Arnob, Gregory J. Stein

发表机构 * George Mason University(乔治·马歇尔大学)

AI总结 本文提出了一种Scout-Assisted Planning框架,通过无人机主动收集环境信息来改进地面车辆的导航,通过信息增益引导的行动剪枝减少回溯成本,实验表明其在不同环境中能显著降低地面机器人旅行成本。

详情
AI中文摘要

自主机器人团队在部分已知环境中导航时,当地面机器人遇到被阻塞的道路时,需要昂贵的回溯操作。我们通过Scout-Assisted Planning,一种异构规划框架,其中无人机主动收集环境信息以改进地面车辆的导航。为了将侦察聚焦于最关键的边,我们提出了基于信息增益的行动剪枝,通过评估候选侦察行动对地面机器人行为的预期影响来评分。由于精确的信息增益基于行动剪枝计算成本过高,我们开发了一个基于图神经网络的模型,该模型可以直接从图结构和信念状态预测信息增益值,将规划时间减少到实时水平而不牺牲解决方案质量。在三种环境类型上的实验表明,SAP结合信息增益行动剪枝将地面机器人旅行成本降低了31.9-37.7%相对于加拿大旅行者问题基线,并且比基于接近度的侦察指导多出8-14%,证实了基于原则的信息增益引导的侦察在实际部署中既更有效且计算上可行。

英文摘要

Autonomous robot teams navigating partially known environments face costly backtracking when ground robots encounter blocked roads that are only revealed upon physical traversal. We address this with Scout-Assisted Planning, a heterogeneous planning framework in which scouting Unmanned Aerial Vehicles proactively gather environmental information to improve Unmanned Ground Vehicle navigation. To focus scouting on the most consequential edges, we propose Information Gain-based Action Pruning, which scores candidate scouting actions by their expected impact on ground robot behavior. Since exact Information Gain-based Action Pruning computation is prohibitively expensive, we develop a Graph Neural Network based model that predicts information gain values directly from graph structure and belief state, reducing planning time to real-time levels without sacrificing solution quality. Experiments across three environment types show that SAP with Information Gain Action Pruning reduces ground robot travel cost by 31.9--37.7% over the Canadian Traveler Problem baseline, and outperforms proximity-based scouting guidance by an additional 8--14%, confirming that principled information-gain-guided scouting is both more effective and computationally feasible for real-world deployment

2605.22681 2026-05-22 cs.AI 版本更新

Forecasting Scientific Progress with Artificial Intelligence

用人工智能预测科学进步

Sean Wu, Pan Lu, Yupeng Chen, Jonathan Bragg, Yutaro Yamada, Peter Clark, David Clifton, Philip Torr, James Zou, Junchi Yu

发表机构 * University of Oxford(牛津大学) Stanford University(斯坦福大学) Allen Institute for AI(人工智能研究所) Sakana AI

AI总结 本文研究了人工智能在预测科学进步中的能力,提出了一种基于时间的评估框架,并介绍了CUSP基准,通过可行性评估、机制推理、生成性解决方案设计和时间预测来评估AI系统的科学预测能力,发现当前前沿模型在不同领域存在系统性限制,且预测结果受事件发生时间影响较大,表明AI在科学预测中仍存在不足。

Comments 73 pages, 13 figures, 29 tables

详情
AI中文摘要

人工智能(AI)日益融入科学发现,但其能否预测科学进步仍不明确。为研究此问题,我们引入了一个基于时间的评估框架,用于在受控知识约束下预测科学进步。我们提出了CUSP(截止条件下的未见科学进步),一个多学科和事件级别的基准,通过可行性评估、机制推理、生成性解决方案设计和时间预测来评估AI系统在科学预测中的表现。在4760个科学事件中,我们观察到当前前沿模型在不同领域存在系统性和领域依赖性的限制。虽然模型可以识别出竞争候选研究方向的可能性,但它们无法可靠地预测科学进步是否会被实现,并系统性地低估了其发生时间。性能在不同领域中高度异质,AI的进步时间比生物学、化学和物理学的进步更可预测。性能在事件发生时间在训练截止前或后时基本不受影响,表明这些限制不能仅由训练数据中的知识暴露来解释。在受控信息访问下,额外的预截止知识会提高性能,但无法缩小与全信息设置之间的差距,这种差距在高引用进步中更加明显。模型还表现出系统性的过度自信和强烈的响应偏差,表明不确定性估计不可靠。综合来看,当前AI系统在预测科学进步方面仍显不足。获取先前知识并未转化为可靠的预测,性能更受益于事后信息而非前瞻性预测。

英文摘要

Artificial intelligence (AI) is increasingly embedded in scientific discovery, yet whether it can anticipate scientific progress remains unclear. To study this question, we introduce a temporally grounded evaluation framework for forecasting scientific progress under controlled knowledge constraints. We present CUSP (Cutoff-conditioned Unseen Scientific Progress), a multi-disciplinary and event-level benchmark that evaluates scientific forecasting in AI systems through feasibility assessment, mechanistic reasoning, generative solution design, and temporal prediction. Across 4,760 scientific events, we observe systematic and domain-dependent limitations in current frontier models. While models can identify plausible research directions from competing candidates, they fail to reliably predict whether scientific advances will be realized and systematically misestimate when they will occur. Performance is highly heterogeneous across domains, with the timing of AI progress more predictable than advances in biology, chemistry, and physics. Performance is largely insensitive to whether events occur before or after the training cutoff, suggesting these limitations cannot be explained solely by knowledge exposure in training data. Under controlled information access, additional pre-cutoff knowledge improves performance but does not close the gap to full-information settings, which becomes more pronounced for high-citation advances. Models also exhibit systematic overconfidence and strong response biases, indicating unreliable uncertainty estimation. Taken together, current AI systems fall short as predictive tools for scientific progress. Access to prior knowledge does not translate into reliable forecasting, and performance benefits more from post-event information than from forward-looking prediction.

2605.22678 2026-05-22 cs.CV cs.AI 版本更新

Swift Sampling: Selecting Temporal Surprises via Taylor Series

Swift Sampling: 通过泰勒级数选择时间惊喜

Dahye Kim, Bhuvan Sachdeva, Karan Uppal, Naman Gupta, Vineeth N. Balasubramanian, Deepti Ghadiyaram

发表机构 * Boston University(波士顿大学) Microsoft Research India(微软研究院印度)

AI总结 本研究提出了一种无需训练的帧选择算法Swift Sampling,通过在视觉潜在空间中建模视频为可微轨迹,并利用泰勒展开预测后续帧的路径,从而自动识别高信息量的时间惊喜帧,提升了长视频问答任务的性能。

详情
AI中文摘要

尽管长视频中的大多数帧都是冗余的,但关键信息存在于时间惊喜中:即实际视觉特征偏离其预测演变的时刻。受人脑预测编码的启发,我们引入了Swift Sampling,一种优雅且无需训练的帧选择算法,能够自动识别视频中的高信息量时刻。具体而言,我们将视频建模为视觉潜在空间中的可微轨迹,并计算其特征的速度和加速度。然后,我们应用泰勒展开来投影后续帧的预期路径。与预测路径显著偏离的帧被识别为时间惊喜帧并被选中采样。与依赖辅助网络或视频特定超参数调整的先前无训练方法不同,Swift Sampling 非常轻量,仅比基线增加 0.02x 的计算成本,使其比领先基线便宜 30 倍。在三个长视频问答基准和 10 个不同的下游任务上,Swift Sampling 超过了均匀采样和先前查询无关的基线。它在帧预算有限的长视频中表现尤为强大,准确率可提高高达 12.5 个百分点。

英文摘要

While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02x additional computational cost over baseline making it 30x cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.

2605.22662 2026-05-22 cs.AI 版本更新

Claw AI Lab: An Autonomous Multi-Agent Research Team

Claw AI Lab:一个自主多智能体研究团队

Fan Wu, Cheng Chen, Zhenshan Tan, Taiyu Zhang, Xinzhen Xu, Yanyu Qian, Dingcheng Gao, Lanyun Zhu, Qi Zhu, Yi Tan, Deyi Ji, Guosheng Lin, Tianrun Chen, Deheng Ye, Fayao Liu

发表机构 * NTU(国立新加坡大学) A*STAR(科技研究局) Moxin Technology Co., LTD(摩新科技有限公司) NUIST(南京信息工程大学) THU(清华大学) USTC(中国科学技术大学)

AI总结 本文提出Claw AI Lab,一种自主研究平台,通过隐藏的提示到论文流程实现自动化研究,并提供交互式AI实验室。该平台允许用户通过一个提示创建完整的研究团队,支持自定义角色、协作流程、实时监控、 artifact检查和回滚/恢复控制。Claw-Code Harness连接本地代码库、数据集和检查点,提高实验执行、完成和结果完整性。在内部评估中,Claw AI Lab在想法新颖性、实验完整性和论文质量上被AI专家评委一致偏好。

Comments Project page and code are available at https://github.com/Claw-AI-Lab/Claw-AI-Lab

详情
AI中文摘要

我们介绍了Claw AI Lab,一个实验室原生的自主研究平台,将自动化研究从隐藏的提示到论文流程推进到交互式AI实验室。与围绕单一智能体或固定顺序工作流中心化系统不同,我们允许用户通过一个提示实例化完整的研究团队,支持自定义角色、协作流程、实时监控、artifact检查以及回滚/恢复控制,通过统一仪表板。该平台还支持探索、多智能体讨论和再现三种不同的研究模式,使自主研究在实践中变得更加可控和实验室化。Claw AI Lab的关键实际贡献在于其Claw-Code Harness,它将本地代码库、数据集和检查点连接到可运行的实验,并将执行artifact反馈到研究循环中。结果,Harness不仅提高了执行集成,还提高了实验完成和结果完整性:实验更容易检查、迭代和忠实转移到最终论文,减少了部分运行和格式错误报告等常见故障模式。在我们内部评估的五个AI研究案例研究中,使用AutoResearchClaw作为基线,Claw AI Lab在想法新颖性、实验完整性和论文质量上被AI专家评委一致偏好。我们视Claw AI Lab为一种新范式的第一步:自主研究作为可使用、交互式和可靠性感知的科学基础设施。

英文摘要

We present Claw AI Lab, a lab-native autonomous research platform that advances automated research from a hidden prompt-to-paper pipeline into an interactive AI laboratory. Rather than centering the system around a single agent or a fixed serial workflow, we allow users to instantiate a full research team from one prompt, with customizable roles, collaborative workflows, real-time monitoring, artifact inspection, and rollback/resume control through a unified dashboard. The platform also supports distinct research modes for exploration, multi-agent discussion, and reproduction, making autonomous research substantially more steerable and laboratory-like in practice. A key practical contribution of Claw AI Lab lies in its Claw-Code Harness, which connects local codebases, datasets, and checkpoints to runnable experiments and feeds execution artifacts back into the research loop. As a result, the harness improves not only execution integration, but also experimental completion and result integrity: experiments are easier to inspect, iterate on, and faithfully transfer into final papers, reducing common failure modes such as partial runs and malformed result reporting. In our internal evaluation on five AI research case studies, using AutoResearchClaw as the baseline, Claw AI Lab is consistently preferred by AI expert judges on idea novelty, experiment completeness, and paper presentation quality. We view Claw AI Lab as an early step toward a new paradigm: autonomous research as usable, interactive, and reliability-aware scientific infrastructure.

2605.22660 2026-05-22 cs.CL cs.AI 版本更新

Moral Semantics Survive Machine Translation: Cross-Lingual Evidence from Moral Foundations Corpora

道德语义在机器翻译中得以保留:来自道德基础语料库的跨语言证据

Maciej Skorski

发表机构 * University of Luxembourg(卢森堡大学)

AI总结 本研究探讨了基于LLM的翻译是否能弥合道德价值观分类中语言特定标注语料库的差距,通过波兰语案例展示直接翻译能有效保留微妙的道德线索,为资源匮乏语言的道德研究提供了可行路径。

详情
AI中文摘要

道德语言具有微妙性和文化差异性,使得跨语言忠实翻译极具挑战性。习语、俚语和文化参考会引入难以避免的翻译痕迹。然而,自动道德价值观分类依赖于几乎只存在于英语中的语言特定标注语料库。我们研究了基于LLM的翻译是否能弥合这一差距,以波兰语为测试案例。使用约5万条涵盖广泛主题的道德标注社交媒体帖子,我们应用了一个系统化的四方法验证流程:LaBSE跨语言嵌入相似性、中心核对齐(CKA)、LLM作为评判者评估以及深度学习分类器公平性测试。我们证明,尽管在处理俚语、粗俗语言和文化负载表达方面存在不足,直接翻译能够很好地保留微妙的道德线索,这些线索足以被跨语言机器学习系统捕获——在所有基础方面,平均余弦相似度为0.86,AUC差距在0.01-0.02之间,经过语言模型微调后进一步缩小。这些结果表明,机器翻译是实现当前资源匮乏语言中道德研究的实用且成本效益高的途径。我们以波兰语作为代表性的斯拉夫语言展示了这一点,并预期可推广到相关语言。

英文摘要

Moral language is subtle and culturally variable, making it difficult to translate faithfully across languages. Idiomatic expressions, slang, and cultural references introduce hard-to-avoid translation artifacts. Yet automated moral values classification depends on language-specific annotated corpora that exist almost exclusively in English. We investigate whether LLM-based translation can bridge this gap, taking Polish as a test case. Using $\sim$50k morally-annotated social media posts from a diverse range of topics, we apply a principled four-method validation pipeline: LaBSE cross-lingual embedding similarity, Centered Kernel Alignment (CKA), LLM-as-judge evaluation, and deep learning classifier parity tests. We show that despite shortcomings in handling slang, vulgarity, and culturally-loaded expressions, direct translation preserves subtle moral cues well enough to be harvested by cross-lingual machine learning -- with mean cosine similarity of 0.86 and AUC gaps of 0.01--0.02 across all foundations closing further under fine-tuning of language models. These results demonstrate that machine translation is a practical and cost-effective path to moral values research in languages currently under-resourced in this domain. We demonstrate this for Polish as a representative Slavic language, with expected generalisation to related languages.

2605.22645 2026-05-22 cs.AI 版本更新

AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

AtelierEval: 人类与LLM作为文本到图像提示器的代理评估

Hanjun Luo, Zhimu Huang, Sylvia Chung, Yiran Wang, Yingbin Jin, Jialin Li, Jiang Li, Xinfeng Li, Hanan Salam

发表机构 * New York University Abu Dhabi(纽约大学阿布扎克校区) Nanyang Technological University(南洋理工大学) Zhejiang University(浙江大学) University of Electronic Science and Technology of China(电子科技大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出AtelierEval,首个统一基准,通过360个专家设计的任务量化提示能力,引入技能基于的记忆增强代理评估器,实现与人类专家的高相关性,验证了提示器在图像增强方向的优越性。

Comments Accepted by ICML 2026

详情
AI中文摘要

文本到图像(T2I)系统日益依赖上游提示器,无论是人类还是多模态大语言模型(MLLMs),将用户意图转化为详细提示。然而,当前基准固定提示并仅评估T2I模型,忽略了上游组件的提示能力。我们引入AtelierEval,首个统一基准,通过360个专家设计的任务量化提示能力。基于认知观点,它涵盖三个任务类别,并使用现实挑战的分类学来实例化任务,为人类和MLLMs提供双接口。为了实现可扩展和可靠的评估,我们提出了AtelierJudge,一个技能基于、记忆增强的代理评估器。它为提示-图像对生成主观和客观评分,与人类专家的Spearman相关性达到0.79,接近人类表现。广泛实验在4个T2I后端上基准8个MLLMs和48个人类用户,验证AtelierEval作为稳健诊断工具的有效性,并揭示模仿优于规划,倡导未来提示器的图像增强方向。我们的工作已发布以支持未来研究。

英文摘要

Text-to-image (T2I) systems increasingly rely on upstream prompters, either humans or multimodal large language models (MLLMs), to translate user intent into detailed prompts. Yet current benchmarks fix the prompt and only evaluate T2I models, leaving the prompting proficiency of this upstream component entirely unmeasured. We introduce AtelierEval, the first unified benchmark that quantifies prompting proficiency across 360 expert-crafted tasks. Grounded in a cognitive view, it spans three task categories and instantiates tasks using a taxonomy of real-world challenges, with a dual interface for both humans and MLLMs. To enable scalable and reliable evaluation, we propose AtelierJudge, a skill-based, memory-augmented agentic evaluator. It produces subjective and objective scores for prompt-image pairs, achieving a Spearman correlation of 0.79 with human experts, approaching human performance. Extensive experiments benchmark 8 MLLMs against 48 human users across 4 T2I backends, validate AtelierEval as a robust diagnostic tool, and reveal the superiority of mimicry over planning, advocating for an image-augmented direction for future prompters. Our work is released to support future research.

2605.22642 2026-05-22 cs.AI 版本更新

Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

Spreadsheet-RL: 通过强化学习推进大型语言模型代理在现实中的电子表格任务中的进步

Banghao Chi, Yining Xie, Mingyuan Wu, Jingcheng Yang, Jize Jiang, Zhaoheng Li, Shengyi Qian, Minjia Zhang, Klara Nahrstedt, Rui Hou, Xiangjun Fan, Hanchao Yu

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Meta

AI总结 本文提出Spreadsheet-RL,一种通过强化学习微调框架,旨在在现实Microsoft Excel环境中训练专门的电子表格代理。该方法通过自动化管道收集在线论坛中的配对起始-目标电子表格,以及金融和供应链管理等领域的领域特定评估任务,构建了新的Domain-Spreadsheet基准数据集,并展示了在通用和领域特定电子表格任务上的显著性能提升。

Comments Mingyuan served as the project lead. Banghao, Yining, and Mingyuan contributed equally to this work, with more junior authors listed before senior authors. All data and code releases are maintained by the corresponding authors at UIUC and are not affiliated with Meta

详情
AI中文摘要

电子表格系统(例如Microsoft Excel,Google Sheets)在现代数据导向的工作流程中起着核心作用。随着AI代理越来越能够自动化复杂任务,如控制计算机和生成演示文稿,构建一个AI驱动的电子表格代理已成为一个有前途的研究方向。大多数现有的电子表格代理依赖于在通用目的LLM上进行专门的提示;虽然这种设计在简单的电子表格操作上有潜力,但难以管理现实世界中典型的复杂、多步骤的工作流程。我们介绍了Spreadsheet-RL,一种强化学习(RL)微调框架,旨在在现实Microsoft Excel环境中训练专门的电子表格代理。Spreadsheet-RL具有自动化管道,用于可扩展地收集配对的起始-目标电子表格,以及在金融和供应链管理等领域的领域特定评估任务,这些任务被编译成新的Domain-Spreadsheet基准数据集。它还包括一个Spreadsheet Gym环境,用于多轮RL:Spreadsheet Gym通过Python沙箱暴露广泛的Excel功能,并附带一个经过改进的Harness,其中包含全面的工具集和精心设计的工具路由规则用于电子表格任务。通过全面的实验,我们展示了Spreadsheet-RL在通用和领域特定的电子表格任务上显著提高了AI代理的性能:它将Qwen3-4B-Thinking-2507在SpreadsheetBench上的Pass@1从12.0%提高到23.4%,并在我们精心编写的Domain-Spreadsheet数据集上将Pass@1从8.4%提高到17.2%。这些结果突显了Spreadsheet-RL在电子表格自动化中的强大泛化能力和实际应用潜力,以及更广泛地,其在日常工作中LLM与数据接口交互方面的前景。

英文摘要

Spreadsheet systems (e.g., Microsoft Excel, Google Sheets) play a central role in modern data-centric workflows. As AI agents grow increasingly capable of automating complex tasks, such as controlling computers and generating presentations, building an AI-driven spreadsheet agent has emerged as a promising research direction. Most existing spreadsheet agents rely on specialized prompting over general-purpose LLMs; while this design has potentials on simple spreadsheet operations, it struggles to manage the complex, multi-step workflows typical of real-world applications. We introduce Spreadsheet-RL, a reinforcement learning (RL) fine-tuning framework designed to train specialized spreadsheet agents within a realistic Microsoft Excel environment. Spreadsheet-RL features an automated pipeline for scalable collection of paired start-goal spreadsheets from online forums, as well as domain-specific evaluation tasks in areas such as finance and supply chain management, which we compile into the new Domain-Spreadsheet benchmark dataset. It also includes a Spreadsheet Gym environment designed for multi-turn RL: Spreadsheet Gym exposes extensive Excel functionality through a Python sandbox, along with a refined harness that incorporates a comprehensive tool set and carefully designed tool-routing rules for spreadsheet tasks. Through comprehensive experiments, we show that Spreadsheet-RL substantially enhances AI agent's performance on both general and domain-specific spreadsheet tasks: it improves Qwen3-4B-Thinking-2507's Pass@1 on SpreadsheetBench from 12.0% to 23.4%, and raises Pass@1 from 8.4% to 17.2% on our curated Domain-Spreadsheet dataset. These results highlight Spreadsheet-RL's strong potential for generalization and real-world adoption in spreadsheet automation, and broadly, its promise for advancing LLM-based interactions with data interfaces in everyday work.

2605.22612 2026-05-22 cs.CY cs.AI cs.LG 版本更新

Healthcare LLM Benchmarks Are Only as Good as Their Explicit Assumptions

医疗LLM基准测试的可靠性仅取决于其显式假设

Naveen Raman, Santiago Cortes-Gomez, Mateo Dulce Rubio, Fei Fang, Bryan Wilder

发表机构 * Carnegie Mellon University(卡内基梅隆大学) New York University(纽约大学)

AI总结 本文提出医疗LLM基准测试的评估-部署差距源于隐式假设,而非基准设计问题,并通过BenchmarkCards和分阶段评估方法来解决这一问题。

Comments 13 pages, 1 figure

详情
AI中文摘要

基准测试对于医疗评估是必要的,但不足以预测部署性能。我们的观点是,评估-部署差距并非源于基准设计不当,而是源于关于用户如何与模型交互的隐式假设,这些假设无法仅通过基准测试本身来揭示。为了使这一观点更明确,我们提出了将假设分为两类的分类:任务假设,可通过对话数据单独测试;以及结果假设,需要结果数据和行为研究来测试。关键的是,结果假设依赖于人类行为,即使设计良好的基准也无法直接观察。为了证明该框架的实用性,我们回顾性分析了一个医疗RCT作为案例研究,并发现差距自然分为大致相等的任务和结果差距。为此,我们做出了两项贡献:首先,我们提出BenchmarkCards,一种记录假设的工具;其次,我们提出分阶段评估,一种系统测试假设并评估性能的程序。

英文摘要

Benchmarks are necessary for healthcare evaluation, but are not sufficient for predicting deployment performance. Our position is that the evaluation--deployment gap arises not because of poorly designed benchmarks, but from implicit assumptions about how users interact with models that cannot be surfaced from benchmarks alone. To make this precise, we propose a classification of assumptions into two categories: task, which can be tested from conversation data alone, and outcome, which requires outcome data and behavioral studies for testing. Critically, outcome assumptions depend on human behavior, something that even well-designed benchmarks cannot directly observe. To demonstrate the operationality of this framework, we retrospectively analyze a healthcare RCT as a case study and find that the gap naturally separates into task and outcome gaps of roughly equal size. To address this, we make two contributions: first, we propose BenchmarkCards, an artifact that documents assumptions, and second, we propose staged evaluation, a procedure that systematically tests assumptions and evaluates performance.

2605.22608 2026-05-22 cs.CL cs.AI 版本更新

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Agentic CLEAR: 自动化多层级评估LLM代理

Asaf Yehudai, Lilach Eden, Michal Shmueli-Scheuer

发表机构 * IBM Research(IBM研究院)

AI总结 本研究提出Agentic CLEAR框架,通过多层级细粒度分析实现LLM代理的自动化评估,提供高质量的数据驱动反馈并预测任务成功率。

Comments ACL

详情
AI中文摘要

代理系统正变得越来越有能力:代理定义策略、采取行动并与不同环境交互。这种自主性对监督和评估代理行为提出了严峻挑战。当前大多数工具功能有限,要么侧重于可观测性并具备基本评估能力,要么强制使用静态、手工制定的错误分类法,无法适应新领域。为解决这一差距,我们提出了Agentic CLEAR,一个自动、动态且易于使用的评估框架。它在三个粒度层级上生成关于代理行为的文本洞察:系统、轨迹和节点。Agentic CLEAR运行在可观测性层之上,能够实现无缝集成,并具有直观的用户界面,使代理评估变得高度可访问。在四个基准测试、七个代理设置和数万次LLM调用的实验中,我们展示了Agentic CLEAR能够产生高质量、数据驱动的反馈。我们的分析显示与人工标注的错误高度一致,并且能够预测任务的成功率。

英文摘要

Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challenges for overseeing and assessing agent behavior. Most current tools are limited, focusing on observability with basic evaluation capabilities or imposing static, hand-crafted error taxonomies that cannot adapt to new domains. To address this gap, we present Agentic CLEAR, an automatic, dynamic, and easy-to-use evaluation framework. It produces textual insights into the agent behavior on three levels of granularity: system, trace, and node. Agentic CLEAR operates above the observability layer, enabling seamless integration and featuring an intuitive UI that makes agent evaluation highly accessible. In our experiments on four benchmarks, seven agentic settings, and tens of thousands of LLM calls, we show that Agentic CLEAR produces high-quality, data-driven, insightful feedback. Our analysis shows strong alignment with human-annotated errors and the ability to predict task success rate.

2605.22604 2026-05-22 cs.CR cs.AI cs.LG cs.SE 版本更新

Innovations in Cardless Artificial Intelligence Banking: A Comprehensive Framework for Cyber Secure and Fraud Mitigation using Machine Learning Algorithms

无卡人工智能银行业创新:基于机器学习算法的全面框架用于网络安全与欺诈防范

Md Israfeel

发表机构 * Computer Engineering, University of Central Florida, Orlando, Florida, USA

AI总结 本文提出了一种全面的框架,利用机器学习算法增强无卡人工智能银行系统的网络安全和欺诈防范能力,通过AI驱动的数据加密生成虚拟卡,减少信息泄露风险。

详情
AI中文摘要

无卡人工智能(AI)银行业的发展标志着金融领域的一次范式转变,为用户提供前所未有的安全性和便利性。本文概述了一个全面的框架,旨在增强网络安全,引入自动生成的虚拟卡,并在无卡AI银行系统中减轻欺诈风险。该框架设想了一种未来银行架构,利用AI驱动的数据加密技术来创建安全的虚拟卡以实现无缝交易。通过强调安全的通信渠道,它确保了银行系统、持卡人和第三方供应商之间的金融活动的完整性。基于AI的授权方法在验证每一笔交易的同时,主动识别潜在欺诈,展示了该框架在加强无卡AI银行业安全方面的有效性。初始方法,包含一个AI驱动的基于特征的银行系统,确保生成带有加密数据的虚拟卡,减少信息暴露并降低欺诈风险。整合机器学习算法为潜在的欺诈活动增加了一层保护。最后,所提出的框架为无卡AI银行系统建立了一个全面的网络安全和欺诈防范范式。其实施使金融机构能够应对传统银行业相关的安全问题,为一个不仅抗欺诈而且对用户安全和方便的未来银行业景观铺平道路。

英文摘要

The advent of cardless artificial intelligence (AI) banking heralds a paradigm shift in the financial landscape, offering users unprecedented security and convenience. This paper outlines a comprehensive framework designed to enhance cybersecurity, introduce auto-generated virtual cards, and mitigate fraud risks within cardless AI banking systems. The framework envisions a future banking architecture that employs AI-powered data cryptography to create secure virtual cards for seamless transactions. By emphasizing secure communication channels, it ensures the integrity of financial activities among banking systems, cardholders, and third-party vendors. AI-based authorization methodologies play a pivotal role in authenticating each transaction while proactively identifying potential fraud, demonstrating the framework's efficacy in fortifying cardless AI banking security. The initial approach, featuring an AI-driven, feature-based banking system, ensures the generation of virtual cards with encrypted data, minimizing information exposure and reducing fraud risks. Integrating a machine learning algorithm adds an additional layer of protection against potential fraudulent activities. In conclusion, the proposed framework establishes a holistic cybersecurity and fraud-mitigation paradigm for cardless AI banking systems. Its implementation empowers financial institutions to address security concerns associated with traditional banking, paving the way for a future banking landscape that is not only fraud-resistant but also secure and convenient for users.

2605.22602 2026-05-22 cs.AI 版本更新

Think Thrice Before You Speak: Dual knowledge-enhanced Theory-of-Mind Reasoning for Persuasive Agents

三思而后言:基于双重知识增强的理论思维推理用于说服代理

Minghui Ma, Bin Guo, Runze Yang, Mengqi Chen, Yan Liu, Jingqi Liu, Yahan Pei, Xuehao Ma, Qiuyun Zhang, Zhiwen Yu

发表机构 * Northwestern Polytechnical University(西北工业大学) Peking University(北京大学) Harbin Engineering University(哈尔滨工程大学)

AI总结 本文提出了一种基于双重知识增强的理论思维推理方法,用于提升说服代理的对话能力,通过构建大规模标注数据集和提出TTBYS框架,提高了LLM在推理欲望、信念和说服策略方面的性能。

Comments 19 pages, 6 figures

详情
AI中文摘要

说服对话需要推理他人潜在的心理状态,这一能力称为理论思维(ToM)。然而,由于依赖简单的提示策略和不足的ToM知识,现有LLM往往无法捕捉心理状态之间的内在依赖关系,导致表示碎片化和推理不稳定。为解决这些问题,我们引入了基于ToM的说服对话(ToM-PD)任务,该任务基于信念-欲望-意图(BDI)框架,明确建模多轮对话中心理状态的序列依赖性。为了促进该任务的研究,我们构建了一个大规模标注数据集,即基于ToM的广泛说服对话(ToM-BPD),捕捉了细粒度的心理状态和相应的说服策略。我们进一步提出了“三思而后言”(TTBYS),一种知识增强的分步推理框架,利用显式和隐式先验经验来提高LLM对欲望、信念和说服策略的推理能力。实验结果表明,配备TTBYS的Qwen3-8B在预测欲望、信念和说服策略方面分别优于GPT-5 1.20%、22.80%和16.97%。案例研究进一步表明,我们的方法增强了推理的可解释性和一致性。

英文摘要

Persuasive dialogue requires reasoning about others' latent mental states, a capability known as Theory of Mind (ToM). However, due to reliance on simple prompting strategies and insufficient ToM knowledge, existing LLMs often fail to capture the intrinsic dependencies among mental states, leading to fragmented representations and unstable reasoning. To address these challenges, we introduce the ToM-based Persuasive Dialogue (ToM-PD) task, grounded in the Belief-Desire-Intention (BDI) framework, which explicitly models the sequential dependencies among mental states in multi-turn dialogues. To facilitate research on this task, we construct a large-scale annotated dataset, ToM-based Broad Persuasive Dialogues (ToM-BPD), capturing fine-grained mental states and corresponding persuasive strategies. We further propose Think Thrice Before You Speak (TTBYS), a knowledge-enhanced stepwise reasoning framework that leverages both explicit and implicit prior experiences to improve LLMs' inference of desires, beliefs, and persuasive strategies. Experimental results demonstrate that Qwen3-8B equipped with TTBYS outperforms GPT-5 by 1.20%, 22.80%, and 16.97% in predicting desires, beliefs, and persuasive strategies, respectively. Case studies further show that our approach enhances interpretability and consistency in reasoning.

2605.22597 2026-05-22 cs.LG cs.AI cs.GR cs.RO 版本更新

MoSA: Motion-constrained Stress Adaptation for Mitigating Real-to-Sim Gap in Continuum Dynamics via Learning Residual Anisotropy

MoSA: 通过学习残余各向异性来缓解连续动力学中现实到模拟差距的运动约束应力适应

Jiaxu Wang, Junhao He, Jingkai Sun, Yi Gu, Yunyang Mo, Jiahang Cao, Qiang Zhang, Renjing Xu

发表机构 * Hong Kong University of Science(香港科学大学) MMLab, Chinese University of Hong Kong, Hong Kong SAR(香港中文大学MMLab, 香港特别行政区) The University of Hong Kong, Hong Kong SAR(香港大学, 香港特别行政区)

AI总结 本文提出MoSA框架,通过运动约束应力适应来缓解连续动力学中现实到模拟差距,利用各向同性模型作为物理先验,并学习残余应力算子以捕捉轻微各向异性和非均匀性,最终在机器人操作中验证了其有效性。

Journal ref International Conference on Machine Learning 2026

详情
AI中文摘要

从视觉观测中学习现实世界的动力学对于各种领域至关重要。一种常见策略是通过估计物理参数来校准模拟器,但准确性最终受限于底层物理模型,这些模型通常假设材料是均质且各向同性的。即使合理,现实中的物体通常表现出轻微的各向异性和非均匀性。在近各向同性的骨架良好校准后,这些残余效应成为进一步缩小现实到模拟差距的关键瓶颈。虽然神经网络可以端到端地拟合动力学,但这种黑盒建模会丢弃强物理先验,导致数据效率低和过拟合。因此,我们提出了MoSA,一种运动约束应力适应框架,旨在针对这些残余效应以进一步提高现实到模拟动力学学习。MoSA使用各向同性模型作为物理先验,并学习残余应力算子以捕捉轻微各向异性和非均匀性。它通过微平面约束的再分布逐步适应应力,在一个物理指导的级联网络中。我们进一步通过监督变形场的时空导数来施加运动约束。实验表明,我们学习的动力学在准确性、泛化性和鲁棒性方面均优于现有方法,同时学习了具有物理意义的残余各向异性。最后,我们在机器人操作设置中验证了MoSA,显示更好的现实到模拟动力学建模能够转化为更可靠的模拟到现实转移。项目页面可在https://mercerai.github.io/MoSA/上获取。

英文摘要

Learning real-world dynamics from visual observations is crucial for various domains. A common strategy is to calibrate simulators by estimating physical parameters, yet accuracy is ultimately bounded by the underlying physical models, which often assume materials are homogeneous and isotropic. Even if reasonable, real-world objects typically exhibit mild anisotropy and heterogeneity. After the near-isotropic backbone is well calibrated, these residual effects become the key bottleneck for further closing the real-to-sim gap. Although neural networks can fit dynamics end-to-end, such black-box modeling discards strong physical priors, leading to poor data efficiency and overfitting. Therefore, we propose MoSA, a motion-constrained stress adaptation framework that targets these residual effects to further improve real-to-sim dynamics learning. MoSA uses an isotropic model as a physics prior and learns residual stress operators to capture mild anisotropy and heterogeneity. It progressively adapts stresses via microplane-constrained redistribution in a physics-informed cascaded network. We further impose motion constraints by supervising temporal and spatial derivatives of the deformation field. Experimentally, our learned dynamics achieves superior accuracy, generalization, and robustness, while learning physically meaningful residual anisotropy. Finally, we validate MoSA in a robot manipulation setting, showing that better real-to-sim dynamics modeling translates into more reliable sim-to-real transfer. Project Page is available at https://mercerai.github.io/MoSA/.

2605.22581 2026-05-22 cs.CV cs.AI cs.LG 版本更新

SceneAligner: 3D-Grounded Floorplan Localization in the Wild

SceneAligner: 在真实场景中实现基于3D的平面定位

Junhyeong Cho, Ruojin Cai, Hadar Averbuch-Elor

发表机构 * Cornell University(康奈尔大学) Kempner Institute, Harvard University(哈佛大学 Kempner 院)

AI总结 本文提出了一种在真实场景中实现基于3D重建的平面定位方法,通过将任务 grounding 在场景的重建3D表示中,解决了现有方法在大规模建筑和栅格化平面图中应用受限的问题。

Comments Project Page: https://Cornell-VAILab.github.io/SceneAligner

详情
AI中文摘要

许多公共建筑提供带有'你在这里'指示器的平面图,以帮助游客导航。平面定位旨在通过确定视觉观测是在平面图中的哪个位置来计算实现这一能力。然而,现有方法通常假设受控的小规模环境和精确的向量平面图,限制了它们在大规模建筑和栅格化平面图中的应用能力。在本文中,我们提出了一种在真实场景中实现平面定位的方法,通过将任务 grounding 在场景的重建3D表示中。给定一组无约束的图像集合,我们的方法重建一个重力对齐的3D场景,并将其投影到2D密度图中,作为平面图的代理。平面定位则被公式化为通过2D相似性变换将该代理与输入平面图对齐。为了弥合密度图与建筑平面图之间的外观差距,我们适配了一个2D基础模型来学习跨模态的对应关系,引入了一种细调方案,鼓励语义对齐的同时保持结构一致性。广泛的实验表明,与先前方法相比有显著的改进,包括在极稀疏设置中,甚至使用单张输入图像时。我们的代码和数据将公开提供。

英文摘要

Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume controlled small-scale environments and precise vectorized floorplans, limiting their ability to operate in large-scale buildings and rasterized floorplans. In this work, we present an approach for performing floorplan localization in the wild by grounding the task in a reconstructed 3D representation of the scene. Given an unconstrained image collection, our method reconstructs a gravity-aligned 3D scene and projects it into a 2D density map that serves as a floorplan proxy. Floorplan localization is then formulated as aligning this proxy with the input floorplan via a 2D similarity transform. To bridge the appearance gap between density maps and architectural floorplans, we adapt a 2D foundation model to learn cross-modal correspondences, introducing a fine-tuning scheme that encourages semantically aligned matches while preserving structural consistency. Extensive experiments demonstrate substantial improvements over prior methods, including in extremely sparse settings with as little as a single input image. Our code and data will be publicly available.

2605.22579 2026-05-22 cs.CL cs.AI stat.ML 版本更新

Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion

超越温度:超拟合作为晚期几何扩展

Meimingwei Li, Yuanhao Ding, Esteban Garces Arias, Christian Heumann

发表机构 * Department of Statistics, LMU Munich(慕尼黑大学统计系) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心(MCML)) School of Computer and Information Engineering, Henan University(河南大学计算机与信息工程学院)

AI总结 本文研究了超拟合现象,发现其与分布锐化不同,通过实验表明超拟合依赖于动态的上下文相关排名重排机制,并在Transformer最后一层的终端扩展中实现了特征空间的几何扩展,提出了Late-Stage LoRA方法以提升生成质量。

Comments Accepted at ICML 2026

详情
AI中文摘要

近期的研究揭示了一个反直觉现象,称为

英文摘要

Recent work has identified a counterintuitive phenomenon termed "Hyperfitting", where fine-tuning Large Language Models (LLMs) to near-zero training loss on small datasets surprisingly enhances open-ended generation quality and mitigates repetition in greedy decoding. While effective, the underlying mechanism remains poorly understood, with the extremely low-entropy output distributions suggesting a potential equivalence to simple temperature scaling. In this work, we demonstrate that this phenomenon is fundamentally distinct from distribution sharpening; entropy-matched control experiments reveal that temperature scaling fails to replicate the diversity gains of hyperfitting. Furthermore, we falsify the hypothesis of static vocabulary reweighting, showing through ablation studies that hyperfitting relies on a dynamic, context-dependent rank reordering mechanism. Layer-wise analysis localizes this effect to a "Terminal Expansion" in the final transformer block, where a substantial geometric expansion of the feature space (Delta Dim approx +80.8) facilitates the promotion of deep-tail tokens. Additionally, we introduce Late-Stage LoRA, a targeted fine-tuning strategy that updates only the final 5 layers, yielding robust generation with minimal parameter updates

2605.22570 2026-05-22 cs.CV cs.AI 版本更新

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

VGenST-Bench: 一个通过主动视频合成进行时空推理的基准

Jinho Park, Youbin Kim, Hogun Park, Eunbyung Park

发表机构 * Department of Artificial Intelligence, Sungkyunkwan University(全北大学人工智能系) Department of Artificial Intelligence, Yonsei University(延世大学人工智能系)

AI总结 本文提出VGenST-Bench,一个通过生成模型主动合成多样化评估场景的视频基准,旨在评估多模态大语言模型的时空推理能力,通过引入多代理流程和3x2x2视频分类体系,实现对细粒度时空理解的精准诊断。

Comments 82 pages, 91 figures (7 in main paper, 84 in appendix). Project page: https://zinosii.github.io/VGenST-Bench/

详情
AI中文摘要

时空推理是多模态大语言模型(MLLMs)在现实世界中的一项核心能力。因此,精确评估这一能力已成为一个关键挑战。然而,现有的时空推理基准数据集主要依赖静态图像集或被动整理的视频数据,这限制了对细粒度推理能力的评估。在本文中,我们介绍了VGenST-Bench,一个视频基准,该基准利用生成模型主动合成高度可控且多样化的评估场景。为了构建VGenST-Bench,我们提出一个包含人类质量控制阶段的多代理流程,确保所有生成的视频和问答对的质量。我们建立了一个全面的3x2x2视频分类体系,涵盖空间尺度、视角和场景动态,以涵盖多样化的场景。此外,我们设计了一个分层任务套件,将低层次的视觉感知与高层次的时空推理分离。通过从被动整理转向主动合成,VGenST-Bench能够对MLLMs的时空理解进行细粒度诊断。

英文摘要

Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning benchmark datasets primarily rely on static image sets or passively curated video data, which limits the evaluation of fine-grained reasoning capabilities. In this paper, we introduce VGenST-Bench, a video benchmark that employs generative models to actively synthesize highly controlled and diverse evaluation scenarios. To construct VGenST-Bench, we propose a multi-agent pipeline incorporating a human quality control stage, ensuring the quality of all generated videos and QA pairs. We establish a comprehensive 3x2x2 video taxonomy, encompassing Spatial Scale, Perspective, and Scene Dynamics to span diverse scenarios. Furthermore, we design a hierarchical task suite that decouples low-level visual perception from high-level spatio-temporal reasoning. By shifting the paradigm from passive curation to active synthesis, VGenST-Bench enables fine-grained diagnosis of spatio-temporal understanding in MLLMs.

2605.22568 2026-05-22 cs.CR cs.AI 版本更新

Measuring Security Without Fooling Ourselves: Why Benchmarking Agents Is Hard

在不欺骗自己的情况下衡量安全:为什么基准测试智能体是困难的

Sahar Abdelnabi, Chris Hicks, Konrad Rieck, Ahmad-Reza Sadeghi

发表机构 * ELLIS Institute Tübingen & MPI-IS & Tübingen AI Center(图宾根ELLIS研究所及MPI-IS与图宾根人工智能中心) The Alan Turing Institute(艾伦·图灵研究所) BIFOLD & Technische Universität Berlin(BIFOLD与柏林技术大学) Technische Universität Darmstadt(达姆施塔特技术大学)

AI总结 本文探讨了在安全关键角色中评估AI代理的基准测试存在的核心挑战,包括基准漏洞、时间滞后的不准确性以及运行时的不确定性,并提出了构建更可靠和可信评估框架的方向。

详情
AI中文摘要

用于评估在安全关键角色中AI代理的基准测试存在关键弱点。基于最近的经验证据,我们指出了三个核心挑战,这些挑战削弱了安全评估:基准漏洞、时间滞后的不准确性和运行时的不确定性。然后,我们概述了构建更稳健和可信评估框架的实用方向。

英文摘要

The benchmarks used to evaluate AI agents in security-critical roles suffer from crucial weaknesses. Building on recent empirical evidence, we characterize three core challenges that undermine security evaluations: benchmark vulnerabilities, temporal staleness, and runtime uncertainty. We then outline practical directions toward building more robust and trustworthy evaluation frameworks.

2605.22540 2026-05-22 cs.CE cs.AI 版本更新

Dynamic Hypergraph Representation Learning for Multivariate Time Series without Prior Knowledge

动态超图表示学习用于无先验知识的多变量时间序列

Marco Gregnanin, Johannes De Smedt, Giorgio Gnecco, Maurizio Parton

发表机构 * IMT School for Advanced Studies Lucca(IMT高级研究学院卢塞拉) KU Leuven(根特大学) University of Chieti-Pescara(切塞纳-皮斯卡拉大学)

AI总结 本文提出了一种无需先验知识的多变量时间序列动态超图表示学习方法,通过社区检测和注意力机制构建超图,并利用动态超图注意力卷积网络进行预测。

详情
AI中文摘要

超图有能力捕捉跨不同领域的实体之间的高维关系,使其成为研究社区中理解和分析复杂系统结构和动态的热门话题。然而,一个关键挑战是在超图结构有限或不存在的情况下,从时间序列数据中推导出超图表示。在本研究中,我们提出了一种模型,通过应用社区检测到时间序列并利用注意力机制将所得社区转换为超图,从而为多变量时间序列构建动态超图表示。通过不同时间序列数据集推导出的超图,然后由动态超图注意力卷积网络(DHACN)用于多变量时间序列预测。本研究通过引入一种新的方法,推动了超图表示领域的发展,该方法更适合在无先验知识的情况下揭示高阶关系。

英文摘要

Hypergraphs have the capacity to capture higher-dimensional relationships among entities across various domains, making them a subject of growing interest within the research community for understanding the structure and dynamics of complex systems. However, a key challenge is the derivation of hypergraph representations from time series data in situations where the structure of the hypergraph is limited or absent. In this study, we propose a model that constructs a dynamic hypergraph representation for multivariate time series without relying on prior knowledge of the data. This is achieved by applying community detection to the time series and transforming the resulting communities, obtained through an attention mechanism, into a hypergraph using a clique-based technique. Hypergraph representations are derived from different time series datasets, and the resulting hypergraphs are then used by a Dynamic Hypergraph Attention Convolution Network (DHACN) for multivariate time series predictions. This research advances the field of hypergraph representation by introducing a novel approach that is better suited to uncover high-order relationships without prior knowledge.

2605.22535 2026-05-22 cs.AI 版本更新

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

TerminalWorld: 在真实世界终端任务上评估智能体的基准测试

Zhaoyang Chu, Jiarui Hu, Xingyu Jiang, Pengyu Zou, Han Li, Chao Peng, Peter O'Hearn, Earl T. Barr, Mark Harman, Federica Sarro, He Ye

发表机构 * University College London(伦敦大学学院) Nanjing University(南京大学) Tencent(腾讯)

AI总结 本文提出TerminalWorld,一个可扩展的数据引擎,能够自动从真实世界终端记录中反向工程高保真的评估任务。通过处理80,870条终端记录,生成1,530个经过验证的任务,涵盖18个真实世界类别,从短日常操作到超过50步的工作流,覆盖1,280个唯一命令。从中精选出200个代表性任务作为Verified子集。在八个前沿模型和六个智能体上全面评估发现,当前系统仍难以处理真实终端工作流,最高通过率为62.5%。此外,TerminalWorld捕捉到与现有专家整理的基准(如Terminal-Bench)不同的真实终端能力,仅与它们的分数有弱相关性(Pearson r=0.20)。自动化引擎使TerminalWorld本身具有真实性和可扩展性,使其能够评估智能体在真实终端环境中随着开发者实践的发展而变化。数据和代码可在https://github.com/EuniAI/TerminalWorld获取。

详情
AI中文摘要

我们介绍了TerminalWorld,一个可扩展的数据引擎,能够自动从'现实世界'终端记录中反向工程高保真的评估任务。处理80,870条终端记录,该引擎生成1,530个经过验证的任务,涵盖18个真实世界类别,从短日常操作到超过50步的工作流,覆盖1,280个唯一命令。从中我们精选出200个代表性、人工审核的任务作为Verified子集。在八个前沿模型和六个智能体上对TerminalWorld-Verified进行全面评估发现,当前系统仍难以处理真实终端工作流,最高通过率为仅62.5%。此外,TerminalWorld捕捉到与现有专家整理的基准(如Terminal-Bench)不同的真实终端能力,仅与它们的分数有弱相关性(Pearson r=0.20)。自动化引擎使TerminalWorld本身具有真实性和可扩展性,使其能够评估智能体在真实终端环境中随着开发者实践的发展而变化。数据和代码可在https://github.com/EuniAI/TerminalWorld获取。

英文摘要

We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from "in-the-wild" terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real-world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores (Pearson r=0.20). The automated engine makes TerminalWorld authentic and scalable by construction, enabling it to evaluate agents in real-world terminal environments as developer practices evolve. Data and code are available at https://github.com/EuniAI/TerminalWorld.

2605.22530 2026-05-22 cs.AI 版本更新

A Subjective Logic-based method for runtime confidence updates in safety arguments

基于主观逻辑的方法用于安全论证中的运行时置信度更新

Benjamin Herd, Jessica Kelly, Clarissa Heinemann, João-Vitor Zacchi

发表机构 * Fraunhofer Institute for Cognitive Systems (IKS)(弗劳恩霍夫认知系统研究所)

AI总结 本文提出了一种基于主观逻辑的方法,用于在安全论证中实现动态定量保证,通过整合设计时证据和时间窗口内的运行时安全性能指标(SPIs),在开发生命周期中量化和传播置信度。在运行时,SPI证据被持续评估,针对的声明通过规则更新,当没有违反时增加置信度,当发生违反时施加即时惩罚。该设计优先考虑安全相关响应性,而非精确的经典贝叶斯后验更新。

Comments Accepted for publication at the 41st ACM/SIGAPP Symposium on Applied Computing (SAC 2026)

Journal ref Proceedings of the 41st ACM/SIGAPP Symposium on Applied Computing (SAC '26), 2026

详情
AI中文摘要

我们提出了一种方法,用于动态定量保证,该方法通过在单一的主观逻辑(SL)基础上的保证案例中整合设计时证据和时间窗口内的运行时安全性能指标(SPIs),从而增强静态安全案例,实现连续的运行时驱动的置信度更新。该方法通过量化和传播置信度,贯穿整个开发生命周期。在运行时,SPI证据被持续评估,并通过规则更新目标声明:在没有违反的情况下增加置信度,在发生违反时施加即时惩罚。该设计优先考虑安全相关响应性,而非精确的经典贝叶斯后验更新。我们通过基于模拟的施工区辅助功能演示该方法,重点在于基于机器学习的施工锥检测组件,并展示置信度如何随着SPI证据在操作中的观察而演变。

英文摘要

We present a method for dynamic quantitative assurance that enhances static safety cases with continuous, runtime-driven confidence updates. The method quantifies and propagates confidence across the development lifecycle by integrating design-time evidence and windowed runtime Safety Performance Indicators (SPIs) within a single Subjective Logic (SL)-based assurance case. At runtime, SPI evidence is continuously evaluated, and targeted claims are updated using a rule that increases confidence in the absence of violations and imposes prompt penalties when violations occur. This design prioritizes safety-relevant responsiveness over exact classical Bayesian posterior updates. We demonstrate the method using a simulation-based construction zone assist function, focusing on an ML-based construction cone detection component, and show how confidence evolves as SPI evidence is observed in operation.

2605.22529 2026-05-22 cs.LG cs.AI 版本更新

Stabilising Explainability Fragility in Cybersecurity AI: The Impact and Mitigation of Multicollinearity in Public Benchmark Datasets

在网络安全AI中稳定可解释性脆弱性:公共基准数据集中的多重共线性影响与缓解

Ioannis J. Vourganas, Anna Lito Michala

发表机构 * Netrity Ltd(Netrity有限公司) University of Glasgow(格拉斯哥大学)

AI总结 本文研究了在入侵检测(IDS)中使用AI可解释性时的一个未被探索但重要的漏洞:多重共线性导致的不稳定性。尽管广泛依赖于事后可解释性工具如SHAP或LIME,但相关特征对解释鲁棒性的影响未被评估。我们引入了一个正式定理,表明多重共线性会放大归因方差。这证明了在多重共线性下,解释和特征重要性是非可识别的。在代表性的基准数据集UNSW-NB15上,通过一系列全面的实验验证了该定理。评估了四种广泛使用的模型家族,包括线性、基于树的、核和神经网络模型,在基于VIF和相关性阈值的完整和剪枝特征集上。我们提出了新的指标Explanability Fragility Score,并提出了两种新的缓解方法,具有变量整合复杂度。CAA-Filtering专注于通过分组训练模型的归因来稳定解释。SHARP是一种新的训练时间正则化框架,通过惩罚归因不稳定性,使可解释性稳定性可控且单调提高。研究结果支持稳定的预测性能,使用Kendall's τ量化在重采样解释中的不稳定性。这项工作对XAI在安全关键领域中的可信度和可重复性有直接影响,并促使将多重共线性缓解措施纳入IDS流程,为从业者提供了一套指南。

Comments 35 pages, 3 figures, submitted to ACM TAISAP

详情
AI中文摘要

本文研究了在入侵检测(IDS)中使用AI可解释性时的一个未被探索但重要的漏洞:多重共线性导致的不稳定性。尽管广泛依赖于事后可解释性工具如SHAP或LIME,但相关特征对解释鲁棒性的影响未被评估。我们引入了一个正式定理,表明多重共线性会放大归因方差。这证明了在多重共线性下,解释和特征重要性是非可识别的。在代表性的基准数据集UNSW-NB15上,通过一系列全面的实验验证了该定理。评估了四种广泛使用的模型家族,包括线性、基于树的、核和神经网络模型,在基于VIF和相关性阈值的完整和剪枝特征集上。我们提出了新的指标Explanability Fragility Score,并提出了两种新的缓解方法,具有变量整合复杂度。CAA-Filtering专注于通过分组训练模型的归因来稳定解释。SHARP是一种新的训练时间正则化框架,通过惩罚归因不稳定性,使可解释性稳定性可控且单调提高。研究结果支持稳定的预测性能,使用Kendall's τ量化在重采样解释中的不稳定性。这项工作对XAI在安全关键领域中的可信度和可重复性有直接影响,并促使将多重共线性缓解措施纳入IDS流程,为从业者提供了一套指南。

英文摘要

This paper investigates a unexplored yet impactful vulnerability in AI explainability used in intrusion detection (IDS): multicollinearity-induced instability. Despite extensive reliance on post-hoc explainability tools such as SHAP or LIME, the impact of correlated features on explanation robustness is not evaluated. We introduce a formal theorem stating that multicollinearity inflates attribution variance. This demonstrates that explanations and feature importances are non-identifiable under multicollinearity. A suite of comprehensive experiments validates the theorem on a representative benchmark dataset, UNSW-NB15. Four widely used families of models are evaluated, including linear, tree-based, kernel, and neural, across full and pruned feature sets based on VIF and correlation thresholding. We propose the novel metric of Explanability Fragility Score and two novel methods to mitigate it with variable integration complexity. CAA-Filtering focuses on stabilising explanations by grouping attributions of trained models. SHARP is a novel training-time regularisation framework that penalises attribution instability, enabling controllable and monotonic improvement of explainability stability. The findings support stable predictive performance, using Kendall's τ to quantify instability across bootstrapped explanations. This work has direct implications for the trustworthiness and reproducibility of XAI in security-critical contexts, and motivates incorporating multicollinearity mitigations into the IDS pipelines, providing a set of guidelines for practitioners.

2605.22513 2026-05-22 cs.AI 版本更新

Meta-Learning for Rapid Adaptation in Reference Tracking of Uncertain Nonlinear Systems

为不确定非线性系统参考跟踪设计快速适应的元学习

Jiaqi Yan, Ankush Chakrabarty, Niklas Schmid, John Lygeros, Alisa Rupenyan

发表机构 * School of Automation Science and Electrical Engineering, Beihang University(北京航空航天大学自动化科学与电气工程学院) Mitsubishi Electric Research Laboratories(三菱电机研究实验室) Automatic Control Laboratory, ETH Zurich(瑞士联邦理工学院自动化实验室) ZHAW Centre for Artificial Intelligence, Zurich University of Applied Sciences(瑞士应用科学大学人工智能中心)

AI总结 本文针对不确定非线性系统的参考跟踪问题,提出基于元学习的控制框架,通过利用源系统数据加速训练并提升控制性能,通过两阶段方法实现对目标系统的快速适应。

Comments 13 pages

详情
AI中文摘要

在本文中,我们解决了不确定非线性系统的参考跟踪问题。由于从目标系统收集数据往往具有挑战性,我们的目标是利用有限的目标系统数据设计最优控制器。元学习提供了一个有前景的范式,通过利用源系统(与目标系统结构相似的系统)的离线数据来加速训练并提高控制性能。受此启发,我们提出了一种基于元学习的控制框架,将隐式模型无关元学习(iMAML)算法适应到控制设置中。该框架分为两个阶段:一个(离线)元训练阶段,其中从源数据中学习聚合表示以捕捉相似系统之间的共享系统动态;一个(在线)元适应阶段,其中仅使用少量数据样本和有限的适应步骤对目标系统进行微调。我们将此框架表述为一个双层优化问题,并提供一个具有降低存储复杂性和较少近似值的高效解决方案。所提出的框架具有通用性,允许各种学习算法的整合。为了展示这种灵活性,我们提出两种特定的学习算法,分别基于神经状态空间模型和深度Q网络。这两种方法的主要区别在于是否需要显式系统识别。数值模拟和硬件实验表明,所提出的方法增强了控制性能,并且在大多数情况下均优于基线方法。

英文摘要

In this paper, we address the problem of reference tracking for uncertain nonlinear systems. Since collecting data from the target system (i.e., the system of interest) is often challenging, our objective is to design optimal controllers using limited target system data. Meta-learning provides a promising paradigm by leveraging offline data from source systems (systems sharing structural similarities with the target system) to accelerate training and enhance control performance. Motivated by this idea, we propose a meta-learning-based control framework that tailors the implicit model-agnostic meta-learning (iMAML) algorithm to the control setting. The framework operates in two phases: an (offline) meta-training phase, where an aggregated representation is learned from source data to capture the shared system dynamics among similar systems, and an (online) meta-adaptation phase, where this representation is fine-tuned on the target system using only a few data samples and limited adaptation steps. We formulate this framework as a bi-level optimization problem and provide an efficient solution with reduced storage complexity and few approximations. The proposed framework is general, allowing various learning algorithms to be integrated. To demonstrate this flexibility, we propose two specific learning algorithms that can be incorporated into our framework based on a neural state-space model and a deep Q-network, respectively. The primary distinction between these approaches is whether explicit system identification is required. Numerical simulations and hardware experiments demonstrate that the proposed methods enhance control performance and consistently outperform baseline approaches.

2605.22505 2026-05-22 cs.AI 版本更新

Towards Direct Evaluation of Harness Optimizers via Priority Ranking

通过优先级排名直接评估Harness优化器

Kai Tzu-iunn Ong, Minseok Kang, Dongwook Choi, Junhee Cho, Seungju Kim, Seungwon Lim, Geunha Jang, Minwoo Oh, Bogyung Jeong, Sunghwan Kim, Taeyoon Kwon, Jinyoung Yeo

AI总结 本文提出通过优先级排名直接评估Harness优化器,以解决传统方法中因缺乏oracle harness而无法有效评估优化器中间步骤的问题,展示了该方法在多步骤优化中的可靠性。

Comments Preprint. Work in Progress

详情
AI中文摘要

Harness优化通过让优化器代理迭代更新目标代理的harness来实现自动化代理创建。尽管其成功,当前研究仅通过观察目标代理的性能提升来评估优化器,这种间接的末端改进评估忽视了优化器在中间步骤中的行动,这些行动往往错误且阻碍代理性能。因此,不清楚harness优化是受优化器有信息的更新行动驱动还是单纯的试错。这需要直接评估harness优化器。然而,由于缺乏oracle harness,直接评估harness优化器是非平凡且昂贵的。为此,我们提出了一种简单且低成本的设计来直接评估它们,即优先级排名。通过让harness优化器对给定harness中的组件(例如工具)按其更新时对代理性能改进/阻碍的潜力进行排序,我们的设计在不昂贵的rollout或手动检查的情况下量化了优化器在步骤层面的能力。更重要的是,优化器的排名性能与它们在实际多步骤harness优化中改进代理的能力相关,建立了优先级排名作为优化能力可靠预测指标。优先级排名通过Shor实现,Shor是182个由人类验证的优化场景的集合,涵盖多个领域、设计和时间阶段。代码和数据可在https://github.com/k59118/Harness_Optimizer_Evaluation找到。

英文摘要

Harness optimization enables automated agent creation by having an optimizer agent iteratively update the harness of target agents. Despite its success, current studies evaluate optimizers solely by observing target agents' performance gains. This indirect end-improvement evaluation neglects optimizers' actions at intermediate steps, which are often erroneous and hinder agent performance. Therefore, it is unclear whether harness optimization is driven by optimizers' informed update actions or simply trial-and-error. This necessitates direct evaluation of harness optimizers. However, evaluating harness optimizers directly is non-trivial and costly due to the lack of oracle harnesses. To address this, we present a simple, low-cost design to directly evaluate them, namely priority ranking. By asking harness optimizers to rank components (e.g., tools) in a given harness by their potential to improve/hinder agent performance when updated, our design quantifies optimizer ability at the step level without expensive rollouts or manual examination. More importantly, optimizers' ranking performance correlates with their ability to improve agents in actual multi-step harness optimization, establishing priority ranking as a reliable predictor of optimization ability. Priority ranking is enabled by Shor, a collection of 182 human-verified optimization scenarios spanning across domains, designs, and time stages. Codes and data can be found at https://github.com/k59118/Harness_Optimizer_Evaluation.

2605.22504 2026-05-22 cs.AI cs.CV 版本更新

LACO: Adaptive Latent Communication for Collaborative Driving

LACO:适应性潜在通信用于协同驾驶

Tianhao Chen, Yuheng Wu, Dongman Lee

发表机构 * Korea Advanced Institute of Science & Technology(韩国科学技术院)

AI总结 本文提出LACO,一种无需训练的潜在通信范式,通过迭代潜在推理、跨时间显著性归因和结构化语义知识蒸馏,解决协同驾驶中潜在通信的延迟和信息丢失问题,实验证明其在降低通信和推理延迟的同时保持了强大的协同驾驶性能。

详情
AI中文摘要

协同驾驶旨在通过使连接车辆在部分可观测性下协调以提高安全性和效率。最近的方法已从共享视觉特征进行感知发展到通过基础模型交换基于语言的推理以实现行为协调。尽管用语言交流提供直观的信息,但引入了两个挑战:由自回归解码引起的高延迟以及由于将丰富的内部表示压缩成离散标记而引起的信信息丢失。为了解决这些挑战,我们分析了协同驾驶中潜在通信在多智能体设置下的固有限制。我们的分析揭示了代理身份混淆,即直接融合潜在状态会将车辆间的决策表示纠缠。受此启发,我们提出了LACO,一种无需训练的潜在通信范式,能够无缝地将预训练驾驶模型适应到协同设置中。LACO引入了迭代潜在推理(ILD)用于潜在推理,跨时间显著性归因(CHSA)用于通信高效的信信息选择,以及结构化语义知识蒸馏(SSKD)以稳定以自我为中心的决策。在CARLA中的闭环实验表明,LACO显著降低了通信和推理延迟,同时保持了强大的协同驾驶性能。

英文摘要

Collaborative driving aims to improve safety and efficiency by enabling connected vehicles to coordinate under partial observability. Recent approaches have evolved from sharing visual features for perception to exchanging language-based reasoning through foundation models for behavioral coordination. Though communicating in language provides intuitive information, it introduces two challenges: high latency caused by autoregressive decoding and information loss caused by compressing rich internal representations into discrete tokens. To address these challenges, we analyze latent communication in collaborative driving under inherent limitations of multi-agent settings. Our analysis reveals agent identity confusion, where direct fusion of latent states entangles decision representations across vehicles. Motivated by this, we propose LACO, a training-free \textbf{LA}tent \textbf{CO}mmunication paradigm that seamlessly adapts pretrained driving models to collaborative settings. LACO introduces Iterative Latent Deliberation (ILD) for latent reasoning, Cross-Horizon Saliency Attribution (CHSA) for communication-efficient information selection, and Structured Semantic Knowledge Distillation (SSKD) to stabilize ego-centric decision making. Closed-loop experiments in CARLA show that LACO notably reduces communication and inference latency while maintaining strong collaborative driving performance.

2605.22502 2026-05-22 cs.AI cs.LG 版本更新

Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost

将代理工作流编译为LLM权重:在成本上减少两个数量级的情况下实现接近前沿质量

Simon Dennis, Rivaan Patil, Kevin Shabahang, Hao Guo

发表机构 * University of Melbourne(墨尔本大学)

AI总结 本文研究如何将代理工作流编译为LLM权重以提高效率,通过在旅行预订、Zoom支持和保险索赔等任务中验证,展示了编译方法在减少成本的同时保持高质量性能。

Comments 19 pages

详情
AI中文摘要

代理编排框架已经普及,共同超过了LangGraph、CrewAI、Google ADK、OpenAI Agents SDK、Semantic Kernel、Strands和LlamaIndex在内的290,000多个GitHub星标。所有框架都遵循相同模式:一个外部编排器位于LLM之上,每回合注入指令并路由决策。最近的工作表明,这种架构在处理过程性任务时,只需在前沿模型的系统提示中提供过程即可[Dennis et al., 2026a],但代价是消耗上下文窗口、需要为每次对话提供一个前沿模型,并将专有过程暴露给第三方提供者。将过程编译到小型微调模型的权重中——创建一个地下代理——应解决所有这些担忧,先前工作(SimpleTOD、FireAct、SynTOD、WorkflowLLM、Agent Lumos)已展示了该技术的可行性。然而,开发者采用却 overwhelmingly 倾向于编排。我们识别了三个感知障碍,并在旅行预订(14个节点)、Zoom支持(14个节点,产品特定知识)和保险索赔(55个节点,6个决策中心)中通过实证方法解决每个障碍。

英文摘要

Agent orchestration frameworks have proliferated, collectively exceeding 290,000 GitHub stars across LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, Semantic Kernel, Strands, and LlamaIndex. All follow the same pattern: an external orchestrator above the LLM, injecting instructions and routing decisions every turn. Recent work has shown this architecture is dominated for procedural tasks by simply providing the procedure in a frontier model's system prompt [Dennis et al., 2026a], at the cost of consuming the context window, requiring a frontier model for every conversation, and exposing proprietary procedures to third-party providers. Compiling the procedure into the weights of a small fine-tuned model -- creating a subterranean agent -- should resolve all of these concerns, and prior work (SimpleTOD, FireAct, SynTOD, WorkflowLLM, Agent Lumos) has shown the technique works. Yet developer adoption has overwhelmingly favored orchestration. We identify three perceived barriers and address each empirically across travel booking (14 nodes), Zoom support (14 nodes, product-specific knowledge), and insurance claims (55 nodes, 6 decision hubs).

2605.22501 2026-05-22 cs.CL cs.AI cs.IR 版本更新

BeLink: Biomedical Entity Linking Meets Generative Re-Ranking

BeLink: 生物医学实体链接结合生成性重新排序

Darya Shlyk, Stefano Montanelli, Lawrence Hunter

发表机构 * University of Milan(米兰大学) University of Chicago(芝加哥大学)

AI总结 本文提出了一种基于生成模型的重新排序方法,通过指令微调提高生物医学实体链接的效率和准确性,在多个基准测试中实现了3%-24%的链接准确率提升,同时减少了推理时间。

Comments Accepted to ACM SIGIR 2026

详情
AI中文摘要

尽管近年来取得了进展,但使用大语言模型(LLMs)的生物医学实体链接(BEL)仍然计算效率低下,难以在实际应用中部署。在本工作中,我们证明了在BEL流水线的重新排序阶段对开源生成模型进行指令微调可以提供有效的解决方案。我们提出了一种集束式指令微调公式,使候选人的选择变得快速且准确。我们的方法在多个BEL基准测试中表现出色,比最先进的方法在链接准确性上提高了3%-24%,同时减少了推理时间。我们将我们的生成性重新排序器整合到BeLink中,这是一个模块化、端到端的系统,旨在实际的生物医学实体链接应用中使用。

英文摘要

Despite recent progress, Biomedical Entity Linking (BEL) with large language models (LLMs) remains computationally inefficient and challenging to deploy in practical settings. In this work, we demonstrate that instruction-tuning of open-source generative models can offer an effective solution when applied at the re-ranking stage of the BEL pipeline. We propose a set-wise instruction-tuning formulation that enables fast and accurate candidate selection. Our method demonstrates strong performance on multiple BEL benchmarks, yielding significant improvements in linking accuracy (3%-24%) while reducing inference time compared to the state-of-the-art. We integrate our generative re-ranker into BeLink, a modular, end-to-end system designed for practical real-world BEL applications.

2605.22498 2026-05-22 cs.LG cs.AI cs.SC 版本更新

The Neural Compiler: Program-to-Network Translation for Hybrid Scientific Machine Learning

神经编译器:程序到网络的翻译用于混合科学机器学习

Lucas Sheneman

发表机构 * Institute for Interdisciplinary Data Sciences(跨学科数据科学研究所) University of Idaho(爱达荷大学)

AI总结 该研究提出了一种神经编译器,能够将程序转换为可微的PyTorch模块,用于混合科学机器学习,通过符号规范生成正确且可微的模块,实现系统化的可组合性。

Comments Use: 21 pages, 10 figures, 10 tables. Preprint; source code available at https://github.com/sheneman/neural_compiler

详情
AI中文摘要

科学机器学习经常需要结合已知的物理规律与从数据中学习的未知参数或校正项。现有方法要么忽略已知结构,将其编码为软惩罚项,要么需要为每个方程手动编写PyTorch代码。我们提出了神经编译器,一种将用第一顺序Scheme-like表达式语言编写的程序转换为冻结、可微的PyTorch模块的系统。这些模块在浮点精度范围内匹配源程序,并通过autograd提供梯度。在混合模型中,编译模块精确编码已知的物理规律,而学习组件则建模未知的剩余部分。我们评估了该编译器在六个实验领域:费曼物理方程、洛特卡-沃勒特动力学、阻尼摆、一维热方程、三维向量力学以及组合泛化。编译模块在单个方程上与手动编写PyTorch实现数值上一致,显示编译没有精度损失。编译模型在大多数情况下能够将物理常数恢复到不到1%的误差,而标准PINN基线模型具有超过8500个参数,误差为7到93%。编译模块还可以与零误差组合,而神经近似方法在深度组合链中会积累大误差。编译器的主要价值不是优于手动编写方程的精度,而是系统化的可组合性:它从符号规范生成正确且可微的模块,而无需手动重写每个方程。该系统支持51个基本操作,包括向量和矩阵代数,能够实现PDE离散化和混合科学模型。这种字符串输入、模块输出的接口也为大语言模型提供了自然的目标,这些模型可以将科学描述翻译成可执行的可微模块。

英文摘要

Scientific machine learning often requires combining known physics with unknown parameters or correction terms learned from data. Existing approaches either ignore known structure, encode it as a soft penalty, or require hand-written PyTorch code for each equation. We present The Neural Compiler, a system that translates programs written in a first-order Scheme-like expression language into frozen, differentiable PyTorch modules. These modules match the source program to floating-point precision and provide gradients through autograd. In hybrid models, the compiled module encodes known physics exactly while learned components model the unknown remainder. We evaluate the compiler across six experiment domains: Feynman physics equations, Lotka-Volterra dynamics, a damped pendulum, a one-dimensional heat equation, three-dimensional vector mechanics, and compositional generalization. Compiled modules match hand-coded PyTorch implementations numerically for single equations, showing no accuracy loss from compilation. With only 1 to 4 trainable parameters, compiled models recover physical constants to less than 1 percent error in most cases, while standard PINN baselines with more than 8500 parameters show 7 to 93 percent error. Compiled modules also compose with zero error, while neural approximations can accumulate large errors in deep composition chains. The main value of the compiler is not improved accuracy over hand-coded equations, but systematic composability: it generates correct, differentiable modules from symbolic specifications without rewriting each equation by hand. The system supports 51 primitive operations, including vector and matrix algebra, enabling PDE discretizations and hybrid scientific models. This string-in, module-out interface also provides a natural target for large language models that translate scientific descriptions into executable differentiable modules.

2605.22493 2026-05-22 cs.LG cs.AI cs.RO 版本更新

Understanding Multimodal Failure in Action-Chunking Behavioral Cloning

理解动作分块行为克隆中的多模态失败

Lorenzo Mazza, Massimiliano Datres, Ariel Rodriguez, Sebastian Bodenstedt, Gitta Kutyniok, Stefanie Speidel

发表机构 * NCT-Dresden(NCT-德累斯顿)

AI总结 研究行为克隆在多模态情况下失败的机制,分析不同多模态参数化在动作分块策略中的不同失效方式,并提出通过调整正则化程度和改进生成策略来提升鲁棒性的方法。

详情
AI中文摘要

当相同的观察允许多个有效动作时,行为克隆变得困难。我们研究了动作分块策略中的这一问题,并展示了不同多模态参数化以不同的方式失败。对于隐变量策略,后验-先验正则化使部署时的采样更可靠,但过度正则化会移除区分演示模式所需的动作条件信息。减少这种正则化可以保留模式信息,但此时成功取决于先验是否覆盖相关隐变量区域。对于动作空间生成策略,多模态性受到基础到动作传输的平滑性限制:具有小Lipschitz常数的映射无法将大量分离的模式分配显著概率。覆盖许多模式需要基础空间中的陡峭过渡或动作空间中的非支持桥接区域。在合成多模态任务和机器人模拟基准上的实验支持了这些机制。

英文摘要

Behavioral cloning becomes difficult when the same observation admits several valid actions. We study this problem for action-chunking policies and show that different multimodal parameterizations fail in different ways. For latent-variable policies, posterior-prior regularization makes deployment-time sampling more reliable, but excessive regularization removes the action-conditioned information needed to distinguish demonstrated modes. Reducing this regularization can preserve mode information, but then success depends on whether the prior covers the relevant latent regions. For action-space generative policies, multimodality is constrained by the smoothness of the base-to-action transport: a map with small Lipschitz constant cannot assign substantial probability to many well-separated modes. Covering many modes therefore requires either sharp transitions in base space or off-support bridge regions in action space. Experiments on synthetic multimodal tasks and robotic simulation benchmarks support these mechanisms.

2605.22480 2026-05-22 cs.LG cs.AI 版本更新

Implicit Regularization of Mini-Batch Training in Graph Neural Networks

图神经网络中mini-batch训练的隐式正则化

Clement Wang, Antoine Vialle, Robin Vaysse, Thomas Bonald

发表机构 * Institut Polytechnique de Paris(巴黎理工学院) Mirakl

AI总结 本文研究了图神经网络中mini-batch训练的隐式正则化现象,发现简单的随机节点采样方法在多个数据集上表现优异,且效率更高。

详情
AI中文摘要

图神经网络(GNN)的mini-batch训练与i.i.d.数据训练有本质区别:采样子图会改变拓扑结构并引入边界效应,导致先前工作发展出结构感知采样器以保持局部连接性和减少嵌入方差。令人惊讶的是,我们证明了最简单的可能方案,即随机节点采样(RNS),在均匀采样的诱导子图上训练,在10个数据集中的8个上在墙钟时间和内存消耗上匹配或优于全图训练。为了解释这一点,我们对图mini-batch随机梯度下降(SGD)应用反向误差分析,并显示其隐式最小化采样损失加上一个与mini-batch梯度方差成比例的正则化量,该量直接由采样器塑造。尽管RNS丢弃了局部结构,但它产生了一组预期损失更接近全图损失,且每批梯度方差更低的mini-batch,从而得到更好的隐式目标。我们的分析将图采样器的选择重新定义为一种隐式正则化形式,并将RNS识别为一种强大的、有理论基础的可扩展GNN训练方法。

英文摘要

Mini-batch training of Graph Neural Networks (GNNs) is fundamentally different from training on i.i.d. data: sampling a subgraph alters the topology and introduces boundary effects, leading prior work to develop structure-aware samplers that preserve local connectivity and reduce embedding variance. Surprisingly, we demonstrate that the simplest possible scheme, Random Node Sampling (RNS), training on the induced subgraph of uniformly sampled nodes, matches or outperforms full-graph training on 8 of 10 datasets at a fraction of the wall-clock time and memory. To explain this, we apply backward error analysis to graph mini-batch Stochastic Gradient Descent (SGD) and show that it implicitly minimizes the sampled loss plus a regularizer proportional to the mini-batch gradient variance, a quantity directly shaped by the sampler. Although RNS discards local structure, it produces mini-batches whose expected loss is closer to the full-graph loss, and whose per-batch gradients have lower variance, yielding a better implicit objective. Our analysis reframes the choice of graph sampler as a form of implicit regularization, and identifies RNS as a strong, theoretically grounded method for scalable GNN training.

2605.22462 2026-05-22 cs.CL cs.AI 版本更新

From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models

从相关性到因果:一种五阶段方法用于Transformer语言模型中的特征分析

Caleb Munigety

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出了一种五阶段方法用于Transformer语言模型中的因果特征分析,并在GPT-2小型模型上端到端地展示了其在间接宾语识别任务中的应用,通过激活补丁恢复经典IOI电路,稀疏自编码器恢复特定名称的特征,因果验证发现这些特征具有特定但部分因果性,鲁棒性测试揭示了检测鲁棒性与因果鲁棒性之间的差距,部署评估显示了最优监控配置带来的成本节省。

详情
AI中文摘要

我们提出了一种五阶段方法用于Transformer语言模型中的因果特征分析(探针设计、特征提取、因果验证、鲁棒性测试和部署集成),并在GPT-2小型模型上端到端地执行了间接宾语识别(IOI)任务。激活补丁恢复了经典的IOI电路(第9层头9单独恢复+1.02)。稀疏自编码器恢复了每名称选择性特征,其效果大小为30到50个激活单元。因果验证发现这些特征具有特定但部分因果性:删除十五个特征后,模型在98%的提示上仍保持准确。两种受NLA启发的评估强化了这一观点:十五个选择性特征仅解释了激活方差的31%,而SAE的解释为99.7%,选择性比率与因果力呈负相关(r = -0.56)。三种分布偏移下的鲁棒性测试发现,电路能够顺利转移,但特征消融效果显著下降,揭示了检测鲁棒性与因果鲁棒性之间的差距。基于成本的部署评估(假设$50/FN,$0.42/FP,2%错误率)发现最优监控配置可使每1000次查询的成本降至$8.96,相比$1000的基准,节省了99.1%。最优组合策略随成本比和基础率变化。各阶段的结合产生了单一阶段无法产生的发现。

英文摘要

We propose a five-stage methodology for causal feature analysis in transformer language models (probe design, feature extraction, causal validation, robustness testing, and deployment integration) and demonstrate it end-to-end on GPT-2 small performing the Indirect Object Identification (IOI) task. Activation patching recovers the canonical IOI circuit (layer-9 head 9 alone gives recovery +1.02). A sparse autoencoder recovers per-name selective features with effect sizes of 30 to 50 activation units. Causal validation finds these features specifically but only partially causal: ablating fifteen of them leaves the model accurate on 98% of prompts. Two NLA-inspired evaluations strengthen this picture: the fifteen selective features explain only 31% of activation variance versus the SAE's 99.7%, and selectivity ratio anticorrelates with causal force (r = -0.56). Robustness testing under three distribution shifts finds that the circuit transfers cleanly but feature ablation effects degrade substantially, exposing a gap between detection robustness and causal robustness. A cost-based deployment evaluation (assumed $50/FN, $0.42/FP, 2% error rate) finds an optimal monitor configuration yielding $8.96 per 1000 queries against a $1000 baseline, a 99.1% saving. Optimal composition strategy varies with cost ratio and base rate. The conjunction of stages produces findings no single stage would.

2605.22457 2026-05-22 cs.AI cs.SY eess.SY 版本更新

KAPPS: A knowledge-based CPPS Architecture for the Circular Factory

KAPPS:一种基于知识的闭环工厂CPPS架构

Etienne Hoffmann, Jan-Felix Klein, Sören Weindel, Max Goebels, Sebastian Behrendt, Daniel Hernández, Ratan Bahadur Thapa, Jürgen Fleischer, Kai Furmans, Steffen Staab

发表机构 * Institute for Material Handling and Logistics (IFL), Karlsruhe Institute of Technology(材料搬运与物流研究所(IFL),卡尔斯鲁厄技术大学) Department of Production Engineering, KTH Royal Institute of Technology(生产工程系,皇家理工技术大学) Institute of Production Science (wbk), Karlsruhe Institute of Technology(生产科学研究所(wbk),卡尔斯鲁厄技术大学) Analytic Computing, Institute for Artificial Intelligence, University of Stuttgart(分析计算,人工智能研究所,斯图加特大学) Electronics and Computer Science, University of Southampton(电子与计算机科学,南安普顿大学)

AI总结 本文提出KAPPS,一种基于知识的闭环工厂CPPS架构,旨在解决闭环制造中产品状态变化、动态重构过程和人机知识整合的需求,通过知识图谱和语义接口层实现数据集成与推理,提升制造系统的灵活性和适应性。

Comments Submitted to Journal of Manufacturing Systems (JMS)

详情
AI中文摘要

尽管线性制造依赖于同质材料和预定义的过程序列,但闭环制造重新引入了具有异质和不确定条件的使用产品。这种转变要求制造系统能够处理可变的产品状态、动态可重构的过程以及人机知识的整合。传统制造IT架构,设计用于稳定结构和确定性执行,无法满足这些需求,因为它们无法充分表示和管理运行时单个组件的唯一性。遵循设计科学方法,为闭环制造设计CPPS,我们从五个互补的视角中推导出14个需求。基于这些需求,我们设计了KAPPS,一种基于知识的架构,利用以本体为基础的知识图谱作为统一的数据骨干,结合语义接口层,实现跨异构系统和服务的一致数据和信息集成、推理和通信,使知识图谱从集成层转变为工厂的权威写时状态。KAPPS集成了约束执行和事件驱动规划模块,使在不确定性和人机知识交换下执行计划能够逐步适应。通过两个实施用例验证了KAPPS的适用性:(i) 通过知识图谱中介服务进行异常检测和学习;(ii) 在模块化输送系统中运行时约束执行。随后,该架构被评估以满足14个需求(摘要已缩短)

英文摘要

While linear manufacturing relies on homogeneous materials and predefined process sequences, circular manufacturing reintroduces used products with heterogeneous and uncertain conditions. This shift demands manufacturing systems capable of handling variable product states, dynamically reconfigurable processes, and the integration of human and machine knowledge. Conventional manufacturing IT architectures, designed for stable structures and deterministic execution, are unable to meet these requirements, as they cannot adequately represent and manage the uniqueness of individual components at runtime. Following a design science methodology for developing a Cyber Physical Production System for circular manufacturing, we derive 14 requirements from five complementary perspectives. Based on these requirements, we design KAPPS, a knowledge-based architecture that uses an ontology-grounded knowledge graph as a unifying data backbone, combined with a semantic interface layer to enable consistent data and information integration, reasoning, and communication across heterogeneous systems and services, turning the knowledge graph from an integration layer into the factories authoritative write-time state. KAPPS incorporates modules for constraint enforcement and event-driven planning, enabling incremental adaptation of execution plans under uncertainty and human-machine knowledge exchange. The applicability of KAPPS is demonstrated through two implemented use cases: (i) Anomaly detection and learning through knowledge graph mediated services and (ii) runtime constraint enforcement in a modular conveyor system. Subsequently, the architecture is evaluated against the 14 requirements (ed. abstract shortened)

2605.22456 2026-05-22 cs.RO cs.AI 版本更新

Steins;Gate Drive: Semantic Safety Arbitration over Structured Futures for Latency-Decoupled LLM Planning

Steins;Gate Drive: 基于结构化未来语义安全仲裁的延迟解耦LLM规划

Anjie Qiu, Hans D. Schotten

发表机构 * Institute for Wireless Communication and Navigation(无线通信与导航研究所) RPTU University Kaiserslautern-Landau(凯撒斯劳滕-兰道大学) German Research Center for Artificial Intelligence(德国人工智能研究中心)

AI总结 本文提出SteinsGateDrive架构,通过延迟解耦规划与运行时架构,在保持安全边界的同时,将有效延迟从+3.07秒减少到-0.01秒,提升了自动驾驶的规划效率。

Comments 10 pages, 2 figures, 5 tables, submitted to IEEE transaction of intelligent vehicles

详情
AI中文摘要

云托管的LLM驱动代理提供有用的语义判断,但其推理延迟超过了分步车辆控制窗口。学习的世界模型预测未来,但通常将未来生成和动作选择保留在大型耦合循环中。我们提出了SteinsGateDrive,一种延迟解耦的规划-运行时架构,其中世界线隐喻来自同名故事,指出了干预的一个可能后果:LLM在最终控制时刻之前选择反事实驾驶未来,而运行时仅在安全合同有效时重用所选预测。生成器构建了三种世界线角色:alpha名义性自我条件未来、beta交互反事实(围绕附近车辆)以及gamma危险压力未来(如刹车、变道或被阻塞的走廊)。所选分支成为具有时间范围、有效/中止条件、回退和授权的类型化战略预测。在10个种子和20步的内受试匹配种子正常-高速公路协议中,GPT-5.4 mini在1秒时间范围将有效延迟从+3.07秒减少到4秒时间范围的-0.01秒,同时保持测量的无碰撞安全边界。该架构的安全贡献来自原子谓词运行时检查,而不是漂移分数,后者作为刷新频率的调节器。

英文摘要

Cloud-hosted LLM driver agents provide useful semantic judgments, but their inference latency exceeds stepwise vehicle-control windows. Learned world models predict futures, but they usually keep future generation and action selection inside large coupled loops. We present SteinsGateDrive, a latency-decoupled planner-runtime architecture in which the worldline metaphor from the eponymous story names one plausible consequence of an intervention: the LLM selects counterfactual driving futures before the final control instant, and a runtime reuses the selected forecast only while safety contracts remain valid. The generator builds three world-line roles: alpha nominal ego-conditioned futures, beta interaction counterfactuals around nearby vehicles, and gamma hazard-stress futures such as braking, cut-ins, or blocked corridors. The selected branch becomes a typed StrategicForecast with horizon, validity/abort conditions, fallback, and authority. On a within-subject, matched-seed normal-highway protocol with 10 seeds and 20 steps, GPT-5.4 mini reduces effective lag from +3.07 s at 1-second horizon to -0.01 s at 4-second horizon while preserving the measured no-collision safety boundary. The architecture's safety contribution comes from the atom-predicate runtime check, not from the drift score, which functions as a refresh-frequency knob.

2605.22455 2026-05-22 cs.CV cs.AI cs.LG physics.optics 版本更新

Making the Discrete Continuous: Synthetic RAW Augmentations for Fine-Grained Evaluation of Person Detection Performance in Low Light

使离散的成为连续的:合成RAW增强用于细粒度评估人检测性能在低光环境

Valeria Pais, Malena Mendilaharzu, Daniele Faccio, Luis Oala, Christoph Clausen, Bruno Sanguinetti

发表机构 * University of Glasgow(格拉斯哥大学) Dotphoton

AI总结 本文提出了一种合成RAW增强方法,用于在低光条件下更准确地评估人检测模型的性能,通过生成与相机传感器噪声模型匹配的低光样本,以改善基准测试的数据覆盖。

Comments Accepted non-archival paper at the CVPR 2026 AUTOPILOT Workshop (Autonomous Understanding Through Open-world Perception and Integrated Language Models for On-road Tasks)

详情
AI中文摘要

人工智能视觉模型的实际应用既受到可用训练和测试数据的推动,也受到其限制。真实数据集稀疏且不均匀:长尾或不平衡分布会阻碍泛化,而低密度区域中的样本数量少使得评估困难。合成数据可以填补这些空白,提供更连续地采样输入空间的方法,提高基准测试的数据覆盖。专注于自动驾驶安全关键场景中的夜间行人检测,我们展示如何利用合成低光样本更好地表征状态-of-the-art目标检测模型的性能,作为场景光照函数的函数。我们使用合成RAW图像增强技术生成低光样本,以匹配相机传感器的噪声模型。在真实和合成低光数据上的性能指标相似,表明AI模型难以区分它们。

英文摘要

Real-world deployment of AI vision models is both fueled and limited by the data available for training and testing. Real datasets are sparse and uneven: long-tailed or unbalanced distributions hinder generalization, and the low number of samples in low density regions makes it hard to run evaluations. Synthetic data can fill these gaps, providing us with a way to sample the input space more continuously and improve data coverage for benchmarks. Focusing on the autonomous driving safety-critical case of pedestrian detection in the dark, we show how synthetic low-light samples can be used to better characterize the performance of a state-of-the-art object detection model as a function of the scene illumination. We use a synthetic RAW image augmentation technique to generate low-light samples that match the noise model of the camera sensor. Performance metrics on real and synthetic low-light data are similar, indicating that the AI model finds it hard to distinguish between them.

2605.22454 2026-05-22 cs.LG cs.AI 版本更新

Don't Forget the Critic: Value-Based Data Rehearsal for Multi-Cyclic Continual Reinforcement Learning

不要忘记批评者:基于价值的多循环持续强化学习中的数据复习

Benjamin Poole, Andrew Quinn, Li Yang, Minwoo Lee

发表机构 * Department of Computer Science(计算机科学系) University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校)

AI总结 本文提出了一种基于价值的数据复习方法,用于多循环持续强化学习,通过引入Qreg+NWLU方法改进学习效率、遗忘缓解和知识转移。

详情
AI中文摘要

数据复习已成为缓解持续强化学习(CRL)中灾难性遗忘的领先方法。然而,现有工作仍局限于策略梯度框架,仅正则化执行者,由于批评者正则化导致的性能下降。这种以执行者为中心的方法忽略了数据复习在价值函数近似中的潜力。此外,现有CRL评估很少考虑多循环环境,其中任务序列重复,这是关键的现实场景,加剧了遗忘和可塑性。我们研究了使用Q值正则化的深度Q网络在多循环设置中的数据复习,并提出Qreg+NWLU,引入了两个简单的修改:(1)连续数据复习,动态收集和更新存储的Q值在整个训练过程中;(2)“无等待”正则化,立即应用而不是在第一个任务之后。这些修改在价值函数近似设置中提高了学习效率、遗忘缓解和知识转移,优于Qreg和传统CRL方法。

英文摘要

Data rehearsal has emerged as a leading approach for mitigating catastrophic forgetting in Continual Reinforcement Learning (CRL). However, existing work remains confined to policy gradient frameworks, regularizing only actors due to the performance degradation incurred by critic regularization. This actor-centric approach overlooks the potential of data rehearsal for value function approximation. Moreover, existing evaluations in CRL rarely consider multi-cyclic environments where task sequences repeat, a critical real-world scenario that exacerbates forgetting and plasticity. We investigate data rehearsal for Deep Q-Networks using Q-value regularization in multi-cyclic settings and propose Qreg+NWLU which introduces two simple modifications: (1) continuous data rehearsal that dynamically collects and updates stored Q-values throughout training, and (2) "No-Wait" regularization that applies immediately rather than after the first task. Together, these modifications yield improvements in learning efficiency, forgetting mitigation, and knowledge transfer over Qreg and conventional CRL methods within value function approximation settings.

2605.22448 2026-05-22 cs.AI 版本更新

S2ED: From Story to Executable Descriptions for Consistency-Aware Story Illustration

S2ED:从故事到可执行描述以实现一致性感知的故事插图

Sijing Yin, Jiamou Liu, Xiao Tang, Yaser Shakib, Qian Liu

发表机构 * University of Auckland(奥克兰大学) Wuhan College of Communication(武汉通信学院)

AI总结 本文提出S2ED框架,通过将完整故事转换为可编辑的可执行描述,提升故事插图的一致性和角色真实性,适用于多帧故事插图的长时程一致性需求。

Comments 6 pages, 5 figures. Accepted by IEEE ICME 2026

详情
AI中文摘要

多帧故事插图需要超越单图像文本到图像生成的长时程一致性,包括叙事分解和持续的角色身份、布局和情感跨帧。我们提出了故事到可执行描述(S2ED),一种无需训练、模型无关、提示层框架,将完整故事转换为一系列显式、可编辑的可执行描述,以实现更一致的渲染。S2ED协调三个代理来分割叙事、确定标准角色属性并丰富空间和情感线索,使可解释的提示携带状态传播和局部编辑以修复漂移而无需重新训练生成器。在Flintstones和Shakoo Maku上的实验表明,S2ED在序列一致性、角色保真度方面优于强大的提示、大模型规划和参考训练方法,在自动指标和人类判断下均表现优异。我们还部署S2ED在一个端到端的故事到故事书系统中,为儿童插图故事提供补充视频。

英文摘要

Multi-frame story illustration requires long-horizon coherence beyond single-image text-to-image generation, including narrative decomposition and persistent character identity, layout, and affect across frames. We propose Story-to-Executable Descriptions (S2ED), a training-free, model-agnostic, prompt-layer framework that converts a full story into a sequence of explicit, editable executable descriptions for more consistent rendering. S2ED coordinates three agents to segment the narrative, ground canonical character attributes, and enrich spatial and affective cues, enabling interpretable prompt-carried state propagation and local edits to repair drift without retraining the generator. Experiments on Flintstones and Shakoo Maku show that S2ED improves sequence-level consistency and character fidelity over strong prompting, large-model planning, and a reference training-based method, under both automatic metrics and human judgments. We also deploy S2ED in an end-to-end story-to-storybook system for children's illustrated stories, with a supplementary video.

2605.22446 2026-05-22 cs.CV cs.AI cs.RO 版本更新

Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts

Pre-VLA: 预防性运行时验证用于可靠视觉-语言-动作和世界模型展开

Zhen Sun, Yongjian Guo, Haoran Sun, Luqiao Wang, Wei Lu, Jiachi Ji, Shengzhe Ji, Junwu Xiong, Zhijun Meng

发表机构 * Beihang University(北京航空航天大学) Tsinghua University(清华大学) Peking University(北京大学) JDT AI Infra Zhejiang University(浙江大学)

AI总结 本文提出Pre-VLA,一种统一的运行时验证架构,用于在物理执行或世界模型想象之前评估动作的有效性,以提高视觉-语言-动作和世界模型展开的可靠性。

详情
AI中文摘要

尽管大型视觉-语言-动作(VLA)模型和生成世界模型(WM)在长周期具身智能方面取得了进展,但其实际部署仍受到基于学习的动作生成不确定性的挑战。低质量的动作可能导致执行中的物理故障或导致冗余的渲染成本的误导性世界模型展开。为了解决这个问题,我们提出了Pre-VLA,一种统一的运行时验证架构,能够在物理执行或世界模型想象之前进行预防性动作有效性评估。Pre-VLA利用一个高效的多模态主干,具有模态感知的池化和轻量级双分支头,以预测候选动作片段的安全性信心和批评派生的优势分数。为处理严重的类别不平衡和不稳定边界决策,我们使用结合焦点分类、优势回归和软阈值校准的多任务目标来训练Pre-VLA。在部署期间,双模式预防性重采样调度器过滤低质量的动作,并在有限计算预算下触发自适应重采样。在LIBERO基准测试中,Pre-VLA将四个套件的平均闭环成功率从30.79%提高到37.62%,减少任务执行步骤,实现每个动作片段平均183.9毫秒的前向验证时间,并减轻世界模型展开中的误差累积。

英文摘要

While large vision-language-action (VLA) models and generative world models (WM) have advanced long-horizon embodied intelligence, their practical deployment remains challenged by uncertainty in learning-based action generation. Low-quality actions may cause physical failures during execution or lead to misleading world-model rollouts with redundant rendering costs. To address this issue, we propose Pre-VLA, a unified runtime verification architecture that performs preemptive action validity assessment before physical execution or world-model imagination. Pre-VLA leverages an efficient multimodal backbone with modality-aware pooling and a lightweight dual-branch head to predict both safety confidence and critic-derived advantage scores for candidate action chunks. To handle severe class imbalance and unstable boundary decisions, we train Pre-VLA with a multi-task objective combining Focal classification, advantage regression, and soft-threshold calibration. During deployment, a dual-mode preemptive resampling scheduler filters low-quality actions and triggers adaptive resampling under a limited computation budget. Experiments on the LIBERO benchmark show that Pre-VLA improves the average closed-loop success rate across four suites from 30.79\% to 37.62\% over RynnVLA-002, reduces task execution steps, achieves 183.9 ms average forward verification time per action chunk, and mitigates error accumulation in world-model rollouts.

2605.22441 2026-05-22 cs.CR cs.AI 版本更新

A Constant-Time Implementation Methodology for Activation Functions on Microcontrollers

在微控制器上实现激活函数的常数时间方法

Andrii Tyvodar, Andreas Rechberger, Dirmanto Jap, Shivam Bhasin, Bernhard Jungk, Jakub Breier, Xiaolu Hou

发表机构 * Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava(信息与信息技术学院,布拉格斯拉夫技术大学) State Key Laboratory of Blockchain and Data Security, Zhejiang University(区块链与数据安全国家重点实验室,浙江大学) Temasek Laboratories, Nanyang Technological University(淡马锡实验室,南洋理工大学) Faculty of Computer Science, Albstadt-Sigmaringen University(计算机科学学院,阿尔布斯塔-西格马林根大学) TTControl GmbH

AI总结 本文提出了一种在嵌入式微控制器上实现激活函数的常数时间方法,通过结合无分支选择、固定成本Padé近似、必要的虚拟算术和周期对齐,实现了定时规律的激活函数实现,并验证了其在ReLU、sigmoid、tanh、GELU和Swish函数上的有效性。

详情
AI中文摘要

嵌入式神经网络推断可能通过定时侧信道泄露信息,包括由激活函数评估引起的泄露。本文提出了一种在嵌入式微控制器上实现激活函数的常数时间方法,并在ARM Cortex-M4平台上验证了ReLU、sigmoid、tanh、GELU和Swish函数。所提出的方法结合了无分支选择、固定成本Padé近似、必要的虚拟算术和周期对齐,以获得定时规律的激活函数实现。作为动机,我们评估了一种基于脱同步的防护措施,并展示了其仍易受基于模板的定时攻击攻击。实验结果表明,所得到的受保护实现对于所有测试输入具有相同的周期数,包括三函数设置下的88个周期和五函数设置下的108个周期。同时,数值误差分析表明,近似的非线性函数保留了高精度。这些结果表明,所提出的方法为构建在嵌入式推断中抗侧信道攻击的激活函数提供了实用基础。

英文摘要

Embedded neural-network inference can leak information through timing side channels, including leakage caused by the evaluation of activation functions. This work proposes a constant-time implementation methodology for activation functions on embedded microcontrollers and validates it on ReLU, sigmoid, tanh, GELU, and Swish on an ARM Cortex-M4 platform. The proposed methodology combines branchless selection, fixed-cost Padé-based approximation, dummy arithmetic where needed, and cycle alignment to obtain timing-regular activation-function implementations. As motivation, we also evaluate a desynchronization-based countermeasure and show that it remains vulnerable to a template-based timing attack. Experimental results show that the resulting protected implementations achieve identical cycle counts for all tested inputs, including (88) cycles in the three-function setting and (108) cycles in the five-function setting. At the same time, the numerical-error analysis indicates that the approximated nonlinear functions retain high accuracy. These results suggest that the proposed methodology provides a practical basis for constructing side-channel-resistant activation functions in embedded inference.

2605.22437 2026-05-22 cs.CR cs.AI cs.LG 版本更新

Characterizing the Fault Response of the Intel Neural Compute Stick 2 Under Single-Pulse Electromagnetic Fault Injection

对Intel神经计算Stick 2在单脉冲电磁故障注入下的故障响应进行表征

Štefan Kučerák, Jakub Breier, Xiaolu Hou

发表机构 * Faculty of Informatics and Information Technologies, Slovak University of Technology(信息与信息技术学院,斯洛伐克技术大学) State Key Laboratory of Blockchain and Data Security(区块链与数据安全国家重点实验室) TTControl GmbH

AI总结 本文研究了Intel神经计算Stick 2在单脉冲电磁故障注入下的故障响应,通过系统性的测试发现四种可重复的故障类别,并探讨了针对这些故障类别的缓解策略。

详情
AI中文摘要

视觉处理单元和其他商业神经网络推断加速器越来越多地应用于安全相关的边缘应用,但它们在瞬态硬件干扰下的故障响应在开放文献中仍然缺乏表征。对于Intel Movidius Myriad X,封装为Intel神经计算Stick 2(NCS2),只有单篇可行性研究已发表。我们报告了一项系统性的单脉冲电磁故障注入(EMFI)测试,该测试在运行三个ImageNet训练的卷积神经网络(ResNet-18、ResNet-50、VGG-11)的OpenVINO运行时上进行。在1,536次热点测试和约16,000次参数搜索测试中,单脉冲产生四种可重复的故障类别:无测量精度变化、轻微的静默数据破坏、主要的持续退化,该退化在后续推断中持续直到模型重新加载,以及需要USB电源循环的设备挂起;这些结果分别解释为无影响、SDC可能带有类似SET或小的持久状态机制、SEU-like持续破坏,以及SEFI-like功能丧失。两个发现是核心。首先,主要退化类别可以在18-31%的测试中诱导,其中崩溃后的top-1精度低于5%,在所有后续推断中持续直到显式模型重新加载 - 这一状态没有任何推断API级别的机制可以检测。第二,这一状态也可以通过向空闲设备发送脉冲来诱导,表明仅靠加载时的完整性检查是不够的。我们讨论了按类别分级的缓解策略,重点是可以在应用级别实现的机制,而无需修改设备固件或OpenVINO运行时。

英文摘要

Vision processing units and other commercial neural-network inference accelerators are increasingly deployed in safety-relevant edge applications, but their fault response under transient hardware disturbances remains poorly characterized in the open literature. For the Intel Movidius Myriad X, packaged as the Intel Neural Compute Stick 2 (NCS2), only a single feasibility study has been published. We report a systematic single-pulse electromagnetic fault injection (EMFI) campaign on the NCS2 running three ImageNet-trained convolutional neural networks (ResNet-18, ResNet-50, VGG-11) on the OpenVINO runtime. Across 1,536 spot-test trials at characterized hotspots and approximately 16,000 parameter-search trials, single pulses produce four reproducible outcome classes: no measured accuracy change, minor silent data corruption, major persistent degradation that survives across subsequent inferences until model reload, and device hangs requiring USB power-cycling; these outcomes are respectively interpreted as no-effect, SDC with possible SET-like or small persistent-state mechanisms, SEU-like persistent corruption, and SEFI-like loss of functionality. Two findings are central. First, the major-degradation class can be induced at 18-31% of trials at characterized hotspots, with post-collapse top-1 accuracy below five percent and persistence across all subsequent inferences until explicit model reload - a regime that no inference-API-level mechanism detects. Second, this regime is also inducible by pulses delivered to an idle device with the model already loaded, demonstrating that load-time integrity checks alone are insufficient. We discuss mitigation strategies graded by class, focusing on mechanisms implementable at the application level without modification to the device firmware or the OpenVINO runtime.

2605.22422 2026-05-22 cs.CV cs.AI 版本更新

FastTab: A Fast Table Recognizer with a Tiny Recursive Module and 1D Transformers

FastTab: 一种快速表格识别器,结合了微小递归模块和1D变换器

Laziz Hamdi, Amine Tamasna, Pascal Boisson, Thierry Paquet

发表机构 * LITIS

AI总结 本文提出FastTab,一种基于网格的表格结构识别模型,通过轻量级的Tiny Recursive Module和轴向1D Transformer编码器,实现了高效的表格结构恢复,同时在多个基准测试中表现出低延迟和良好的鲁棒性。

详情
AI中文摘要

表格结构识别(TSR)需要在表级一致性(行/列数量、表头、跨单元格)和精确的分隔符定位之间取得平衡。我们介绍了FastTab,一种以网格为中心的TSR模型,通过结合(i)轻量级的Tiny Recursive Module(TRM)进行全局推理和(ii)轴向1D Transformer编码器,捕捉行和列上的长距离依赖关系,避免了自动回归的HTML解码。该模型预测行/列数量、表头行和分隔符以构建网格,然后利用ROI对齐的单元格特征推断行跨度/列跨度。在四个基准测试(PubTabNet、FinTabNet、PubTables-1M和SciTSR)中,FastTab在结构恢复性能方面表现优异,同时在低延迟推理中运行良好。我们进一步研究了在像素级匿名化下的鲁棒性,并展示了对相机捕获文档中弯曲分隔符的扩展。源代码将在https://github.com/hamdilaziz/FastTab上公开发布。

英文摘要

Table structure recognition (TSR) requires both table-level coherence (row/column counts, headers, spanning cells) and precise separator localization. We introduce FastTab, a grid-centric TSR model that avoids autoregressive HTML decoding by combining (i) a lightweight Tiny Recursive Module (TRM) for global reasoning and (ii) axial 1D Transformer encoders that capture long-range dependencies along rows and columns. The model predicts row/column counts, header rows, and separators to construct a grid, then infers rowspan/colspan using ROI-aligned cell features. Across four benchmarks (PubTabNet, FinTabNet, PubTables-1M, and SciTSR), FastTab achieves competitive structure recovery performance while operating at low-latency inference. We further study robustness under pixel-level anonymisation and show an extension to curved separators for camera-captured documents. The source code will be made publicly available at https://github.com/hamdilaziz/FastTab .

2605.22420 2026-05-22 cs.CV cs.AI cs.RO 版本更新

Diffusion-guided Generalizable Enhancer for Urban Scene Reconstruction

基于扩散的通用增强器用于城市场景重建

Henry Che, Jingkang Wang, Yun Chen, Ze Yang, Sivabalan Manivasagam, Raquel Urtasun

发表机构 * Waabi University of Toronto(多伦多大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出GenRe,一种基于扩散的通用增强器,用于城市场景重建,通过学习不同场景中的生成先验,高效地生成稳健且高保真的表示,能够可靠地泛化到挑战性的未见过的视角,从而在自动驾驶中实现鲁棒和可扩展的传感器模拟。

Comments ICRA 2026. Project page: https://waabi.ai/genre

详情
AI中文摘要

从真实世界观测重建城市场景已成为自动驾驶开发和测试的强大工具。尽管当前的神经渲染方法在记录轨迹上实现了高质量的渲染,但其在大视角变化下质量显著下降,限制了闭环模拟的应用。最近的研究表明,使用扩散模型在这些具有挑战性的视角上增强质量并将其改进回3D表示具有前景。然而,它们通常需要昂贵的每场景优化,且提炼的表示仍然脆弱,无法超越有限的合成视角泛化。为了解决这些限制,我们提出了GenRe,一种新的基于扩散的通用增强器用于城市场景重建。GenRe输入任何预训练的3D高斯表示,并在几分钟内修复其中的缺陷。通过学习在多样化场景中提炼生成先验,GenRe高效地生成稳健且高质量的表示,能够可靠地泛化到具有挑战性的未见过的视角(例如,变道)。实验表明,GenRe在质量和效率上均优于现有方法,并且受益于各种下游任务,使自动驾驶中的传感器模拟更加稳健和可扩展。

英文摘要

Urban scene reconstruction from real-world observations has emerged as a powerful tool for self-driving development and testing. While current neural rendering approaches achieve high-fidelity rendering along the recorded trajectories, their quality degrades significantly under large viewpoint shifts, limiting the applicability for closed-loop simulation. Recent works have shown promising results in using diffusion models to enhance quality at these challenging viewpoints and distill improvements back into 3D representations. However, they often require costly per-scene optimization, and the distilled representations remain fragile and fail to generalize beyond limited synthesized views. To address these limitations, we propose GenRe, a novel diffusion-guided generalizable enhancer for urban scene reconstruction. GenRe takes as input any pretrained 3D Gaussian representation and fixes the deficiencies within a few minutes. By learning to distill generative priors across diverse scenes, GenRe produces robust and high-fidelity representation efficiently that generalizes reliably to challenging unseen viewpoints (e.g., lane change). Experiments show that GenRe outperforms existing methods in both quality and efficiency and benefits various downstream tasks, enabling robust and scalable sensor simulation for autonomous driving.

2605.22414 2026-05-22 cs.CV cs.AI 版本更新

Towards Clinically Interpretable Ophthalmic VQA via Spatially-Grounded Lesion Evidence

迈向具有空间定位病变证据的临床可解释性眼科VQA

Xingyue Wang, Bo Liu, Meng Wang, Zhixuan Zhang, Chengcheng Zhu, Huazhu Fu, Jiang Liu

发表机构 * Department of Computer Science and Engineering, Southern University of Science and Technology(南方科技大学计算机科学与工程系) The Hong Kong Polytechnic University(香港理工大学) National University of Singapore(新加坡国立大学) University of Washington(华盛顿大学) Institute of High Performance Computing, Agency for Science, Technology and Research(科技研究局高性能计算研究所)

AI总结 本文提出FundusGround基准,通过空间定位病变证据提升眼科VQA的临床可解释性,通过三阶段流程收集标注病变的视网膜影像,并评估多种视觉语言模型在答案准确性和病变层面推理上的表现。

详情
AI中文摘要

视觉问答(VQA)在临床支持中具有巨大潜力,特别是在眼科领域,视网膜彩色照相是诊断的关键。然而,眼科VQA基准主要强调答案准确性,忽视了临床可解释性所需的显式视觉证据。本文引入FundusGround,一个新的具有空间定位病变证据的临床可解释性眼科VQA基准。具体而言,我们提出一个三阶段流程,收集了10,719张带有15,595个图像级精细标注病变的视网膜影像。为确保解剖一致性和临床有效性,所有病变均通过早期治疗糖尿病视网膜病变研究(ETDRS)网格进行空间定位,从而标准化映射到九个具有临床意义的视网膜区域。基于此结构化的病变证据,生成了72,706个问题,涵盖四种格式:开放式、封闭式、单选和多选。我们进一步使用双指标(答案准确性和病变层面推理)评估多种通用和医学大型视觉语言模型。实验表明,整合病变层面的视觉证据能持续提高模型性能和透明度,突显了显式空间定位对于可靠和可解释性眼科VQA的必要性。

英文摘要

Visual Question Answering (VQA) holds great promise for clinical support, particularly in ophthalmology, where retinal fundus photography is essential for diagnosis. However, ophthalmic VQA benchmarks primarily emphasize answer accuracy, neglecting the explicit visual evidence necessary for clinical interpretability. In this work, we introduce FundusGround, a new benchmark for clinically interpretable ophthalmic VQA with spatially-grounded lesion evidence. Specifically, we propose a three-stage pipeline that collects 10,719 fundus images with 15,595 image-level meticulously annotated lesions. To ensure anatomical consistency and clinical validity, all lesions are spatially localized using the Early Treatment Diabetic Retinopathy Study (ETDRS) grid, enabling standardized mapping to nine clinically meaningful retinal regions. Built upon this structured lesion evidence, 72,706 questions are then generated spanning four formats: open-ended, closed-ended, single-choice, and multiple-choice. We further benchmark multiple general- and medical- large vision-language models using dual metrics for answer accuracy and lesion-level reasoning. The experiments demonstrate that incorporating lesion-level visual evidence consistently improves model performance and transparency, highlighting the necessity of explicit spatial grounding for reliable and explainable ophthalmic VQA.

2605.22411 2026-05-22 cs.CL cs.AI cs.LG 版本更新

DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

DeferMem: 通过强化学习进行长时记忆问答的查询时证据蒸馏

Jianing Yin, Tan Tang

发表机构 * State Key Lab of CAD&CG(计算机辅助设计与图形学国家重点实验室)

AI总结 本文提出DeferMem,一种长时记忆框架,通过分离问题为高召回候选检索和查询条件证据蒸馏,以提升长时记忆问答的准确性和效率。

Comments 31 pages, 3 figures

详情
AI中文摘要

大型语言模型(LLM)代理在长时记忆问答任务中仍面临挑战,因为答案支持的证据通常分散在长对话历史中并被大量无关内容掩盖。现有记忆系统通常在未来的查询确定之前处理记忆,然后根据相似性而非其对回答查询的效用来检索结果单元。这种工作流程使下游回答者不得不对检索的候选进行去噪并重建查询特定的证据。我们提出了DeferMem,一种长时记忆框架,将该问题分解为高召回候选检索和查询条件证据蒸馏。DeferMem使用轻量级的段链接结构来组织原始历史并在查询时检索广泛的候选。然后,它应用一个通过DistillPO训练的内存蒸馏器,DistillPO是我们用于将高召回但高度嘈杂的候选蒸馏成一组忠实、自包含且查询条件的证据的强化学习算法。DistillPO将检索后的证据蒸馏制定为一个结构化的动作,包括信息选择和证据重写。它通过分解和门控奖励管道和结构对齐优势分配来优化此动作,门控奖励组件从有效性到质量检查,同时在早期暴露任务级别的正确性反馈,并将每个奖励分配给其负责的输出片段。在LoCoMo和LongMemEval-S上,DeferMem在问答准确性和记忆系统效率上超过了强大的基线,在达到最高问答准确度的同时实现了最快的运行时间和零商业API令牌成本的记忆操作。

英文摘要

Large language model (LLM) agents still struggle with long-term memory question answering, where answer-supporting evidence is often scattered across long conversational histories and buried in substantial irrelevant content. Existing memory systems typically process memory before future queries are known, then retrieve the resulting units based on similarity rather than their utility for answering the query. This workflow leaves downstream answerers to denoise retrieved candidates and reconstruct query-specific evidence. We present DeferMem, a long-term memory framework that decouples this problem into high-recall candidate retrieval and query-conditioned evidence distillation. DeferMem uses a lightweight segment-link structure to organize raw history and retrieve broad candidates at query time. It then applies a memory distiller trained with DistillPO, our reinforcement learning algorithm for distilling the high-recall but highly noisy candidates into a set of faithful, self-contained, and query-conditioned evidence. DistillPO formulates post-retrieval evidence distillation as a structured action comprising message selection and evidence rewriting. It optimizes this action with a decomposed-and-gated reward pipeline and structure-aligned advantage assignment, gating reward components from validity to quality checks while exposing task-level correctness feedback early and assigning each reward to its responsible output span. On LoCoMo and LongMemEval-S, DeferMem surpasses strong baselines in QA accuracy and memory-system efficiency, achieving the highest QA accuracy with the fastest runtime and zero commercial-API token cost for memory operations.

2605.22391 2026-05-22 cs.AI cs.CL cs.CY 版本更新

Epicure: Navigating the Emergent Geometry of Food Ingredient Embeddings

Epicure:探索食品成分嵌入的涌现几何

Jakub Radzikowski, Josef Chen

AI总结 本文提出Epicure,一种基于三兄弟skip-gram模型重新训练的食品成分嵌入方法,通过多语言食谱语料库构建了包含1790个标准成分的嵌入模型,并通过三种不同的随机游走方案生成了不同侧重的模型。

详情
AI中文摘要

我们提出了Epicure,一种由三个兄弟skip-gram成分嵌入模型组成的家族,这些模型是从多语言食谱语料库中从头开始重新训练的。我们汇总了来自11个来源的414万条食谱,涵盖七种语言:英语、中文、俄语、越南语、西班牙语、土耳其语、印度尼西亚语、德语和印度英语,并通过一个增强语言模型的流程将原始成分字符串标准化为1790个标准条目。一个包含203,508条边的成分-成分NPMI图和一个包含80,019条边的带类型FlavorDB成分-化合物图,以及2,247个带类型化合物节点跨越15个类别,为三种共享架构和超参数但仅在随机游走方案上不同的Metapath2Vec变体提供了基础:Cooc仅在共现图上行走,Chem仅在带类型化合物元路径上行走,Core则通过注入的成分-成分行走进行混合,在可控混合下,将每个模型置于化学与食谱上下文的谱线上不同的位置。

英文摘要

We present Epicure, a family of three sibling skip-gram ingredient embeddings retrained from scratch on a multilingual recipe corpus. We aggregate 4.14M recipes from 11 sources spanning seven languages, English, Chinese, Russian, Vietnamese, Spanish, Turkish, Indonesian, German, and Indian-English, and normalise the raw ingredient strings to 1,790 canonical entries via an LLM-augmented pipeline. A 203,508-edge ingredient-ingredient NPMI graph and an 80,019-edge typed FlavorDB ingredient-compound graph, 2,247 typed compound nodes across 15 categories, seed three Metapath2Vec variants that share architecture and hyperparameters and differ only in the random-walk schema: Cooc walks the co-occurrence graph only, Chem walks the typed compound metapaths only, and Core blends both via injected ingredient-ingredient walks at controlled mixing, placing each model at a distinct point on the chemistry-vs-recipe-context spectrum.

2605.22379 2026-05-22 cs.HC cs.AI cs.LG 版本更新

Cross-Subject EEG Emotion Recognition Based on Temporal Asynchronous Alignment Contrastive Learning

基于时间异步对齐对比学习的跨受体EEG情绪识别

Ying Xie, Yi Zheng, Zehui Xiao, Wenkai Lu, Mengting Liu

发表机构 * School of Biomedical Engineering, Shenzhen Campus of Sun Yat-sen University(中山大学生物医学工程学院深圳校区) School of Computer Science and Technology, Tianjin University(天津大学计算机科学与技术学院)

AI总结 本文提出了一种基于时间异步对齐对比学习(TA2CL)的框架,用于解决跨受体EEG情绪识别中由于不同受体响应时间不一致导致的识别问题,通过改进相似性计算策略,提升模型对跨受体差异和时间延迟的鲁棒性。

Comments 16 pages, 7 figures

详情
AI中文摘要

随着科技的发展,情绪研究的重要性日益凸显。近年来,基于脑电图(EEG)的情绪识别已成为一个活跃的研究领域,因其客观性和高时间分辨率。然而,大多数现有方法侧重于优化编码器结构以增强特征提取能力,而对相似性计算策略关注较少,特别是忽略了不同受体之间响应的潜在时间不一致问题。为了解决这些不足,本文受ColBERT在自然语言处理(NLP)中的晚期交互机制启发,提出了一种基于时间异步对齐的对比学习(TA2CL)框架。该方法将传统的全局

英文摘要

With the advancement of science and technology, the importance of emotion research has become increasingly evident. Electroencephalography (EEG)-based emotion recognition has emerged as an active research area in recent years, owing to its objectivity and high temporal resolution. However, most existing methods focus on optimizing encoder structures to enhance feature extraction capabilities, while paying relatively little attention to similarity calculation strategies, particularly overlooking the potential temporal misalignment of responses among different subjects. To address these shortcomings, this paper draws inspiration from the late interaction mechanism of ColBERT in natural language processing (NLP) and proposes a Temporal Asynchronous Alignment-based Contrastive Learning (TA2CL) framework. This method transforms the traditional global "hard alignment" similarity calculation approach into a fine-grained local matching mechanism, enabling the model to adaptively search for and align "locally highly correlated" segments between two EEG signals, thereby effectively mitigating the effects of inter-subject differences and temporal delays. Experimental results demonstrate that the proposed method achieves strong performance across multiple public datasets. Specifically, on the FACED dataset, it achieves an accuracy of 64.5% for the nine-class classification task and 79.5% for the binary classification task, while on the SEED and SEED-V datasets, it achieves accuracies of 86.4% and 70.1%, respectively, validating the method's effectiveness and generalization capability.

2605.22368 2026-05-22 cs.LG cs.AI cs.SE 版本更新

VeriScale: Adversarial Test-Suite Scaling for Verifiable Code Generation

VeriScale:对抗性测试套件缩放用于可验证代码生成

Yifan Bai, Xiaoyang Liu, Zihao Mou, Guihong Wang, Jian Yu, Shuhan Xie, Yantao Li, Yangyu Zhang, Jingwei Liang, Tao Luo

发表机构 * School of Mathematical Sciences, Shanghai Jiao Tong University(上海交通大学数学科学学院) School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)科学与工程学院) School of Mathematics, Jilin University(吉林大学数学学院) School of Mathematical Sciences, Tongji University(同济大学数学科学学院) Zhiyuan College, Shanghai Jiao Tong University(上海交通大学紫阳学院) School of Future Technology, South China University of Technology(华南理工大学未来技术学院) Institute of Natural Sciences, Shanghai Jiao Tong University(上海交通大学自然科学研究院) MOE-LSC, CMA-Shanghai, Shanghai Jiao Tong University(上海交通大学MOE-LSC、CMA-上海)

AI总结 本文提出VeriScale框架,通过对抗性实现扩展和缩减测试套件,提升代码生成的可验证性,实验表明VerinaPlus显著暴露了模型弱点,而VerinaLite在低成本下保持判别能力。

详情
AI中文摘要

随着大型语言模型(LLMs)在软件工程中的广泛应用,构建高质量基准对于评估生成代码的功能正确性和形式可验证性至关重要。然而,现有基准受限于正负测试用例的数量和质量,导致模型在生成规范和实现方面的能力被高估。为此,我们提出VeriScale,一种由对抗性实现驱动的新框架,分为两个阶段:测试套件扩展以构建多样且具有挑战性的测试用例,以及测试套件缩减以将其压缩为紧凑且判别性的套件。虽然VeriScale具有通用性,但我们将其应用于Verina,构建VerinaPlus和VerinaLite。实验表明,VerinaPlus在SpecGen和CodeGen任务上显著暴露了模型弱点,而VerinaLite在低成本下保持了判别能力。增强的基准和源代码在https://github.com/XiaoyangLiu-sjtu/VeriScale上公开可用。

英文摘要

As large language models (LLMs) are increasingly deployed for software engineering, constructing high-quality benchmarks is crucial for evaluating not just the functional correctness, but also the formal verifiability of generated code. However, existing benchmarks are limited by the quantity and quality of positive and negative test cases, leading to an overestimation of model capabilities in generating specifications and implementations. To address this, we propose VeriScale, a novel framework driven by the adversarial implementations. It consists of two stages: test-suite expansion to construct diverse and challenging test cases, and test-suite reduction to distill them into compact yet discriminative suites. While VeriScale is general, we instantiate it on Verina to construct VerinaPlus, which expands the original test suites by over 83$\times$, and VerinaLite, a lightweight 14$\times$ variant. Our experiments across eight state-of-the-art LLMs demonstrate that VerinaPlus exposes substantial model weaknesses hidden by the original benchmark, evidenced by sharp score drops on both SpecGen and CodeGen tasks, whereas VerinaLite maintains this discriminative power at a fraction of the evaluation cost. The enhanced benchmarks and source code are publicly available at https://github.com/XiaoyangLiu-sjtu/VeriScale.

2605.22364 2026-05-22 cs.AI 版本更新

Scaling Observation-aware Planning in Uncertain Domains

在不确定领域中扩展感知意识规划

Adrian Zvizdenco, Arthur Conrado Veiga Bosquetti, Alberto Lluch Lafuente, Christoph Matheja

发表机构 * Technical University of Denmark(丹麦技术大学)

AI总结 本文研究了在不确定领域中扩展感知意识规划的方法,通过子符号技术扩展可决定OOP片段的解决方法,包括传感器选择问题和位置可观察性问题,并通过分解POMDPs识别合理的观察函数,从而提高性能。

详情
AI中文摘要

在不确定领域中决定在智能体上部署哪些感知能力是一个根本性的工程挑战,其中需要在任务可实现性与硬件和处理的高成本之间取得平衡。这个问题之前已被正式化为最优可观察性问题(OOP),基于著名的部分可观测马尔可夫决策过程(POMDP)模型进行决策。本文研究了(子)符号技术,以扩展可决定OOP片段的解决方法,即传感器选择问题(SSP)和位置可观察性问题(POP)。除了改进基于参数综合的原始方法外,我们还开发了一种新的解决方法,通过分解POMDPs识别合理的观察函数,从而在实例大小和运行时间上分别提高了3到5个数量级的性能。

英文摘要

Deciding which sensing capabilities to deploy on an agent in uncertain domains is a fundamental engineering challenge, in which one balances task achievability against the high costs of hardware and processing. This problem has previously been formalized as the Optimal Observability Problem (OOP), based on the well-known Partially Observable Markov Decision Process (POMDP) model for decision-making. This work studies (sub-)symbolic techniques to scale solving of decidable fragments of the OOP, namely the Sensor Selection Problem (SSP) and the Positional Observability Problem (POP). Besides improving the original approach based on parameter synthesis, we develop a new solving method that identifies sensible observation functions via decomposition of POMDPs, improving performance by 3 and 5 orders of magnitude for instance size and runtime, respectively.

2605.22363 2026-05-22 math.OC cs.AI cs.GT 版本更新

Incentive-Aligned Vehicle-to-Vehicle Energy Trading via Nash-Integrated Multi-Agent Reinforcement Learning

通过纳什整合多智能体强化学习实现激励对齐的车对车能源交易

Yujin Lin, Yue Yang, Hao Wang

发表机构 * Department of Data Science and AI, Faculty of IT, Monash University, Australia(数据科学与人工智能系,信息科技学院,墨尔本大学,澳大利亚) Monash Energy Institute, Monash University, Australia(墨尔本能源研究所,墨尔本大学,澳大利亚)

AI总结 本文提出一种基于纳什博弈解的多智能体深度确定性策略梯度(Nash-MADDPG)方法,用于车对车能源交易中的激励对齐,提升了社会福利和交易量,并在公平性方面取得了显著改进。

Comments The 24th IEEE International Conference on Industrial Informatics, 2026

详情
AI中文摘要

车对车(V2V)能源交易允许电动车辆(EVs)之间进行去中心化的点对点能源交换,从而减少对电网的依赖并利用剩余容量获取收益。然而,协调具有不同充电需求和不确定到达-离开时间表的自利EV代理仍然具有挑战性。现有方法要么需要集中优化但计算受限,要么缺乏公平性保障。本文将纳什博弈解整合到多智能体深度确定性策略梯度中,即纳什-MADDPG,用于激励对齐的V2V能源交易。纳什博弈确定高效的双方面定价,而纳什引导的价格接近性奖励使代理学习朝着博弈最优策略方向发展。在30天连续运行的评估中,与双重拍卖相比,社会福利提高了61.6%,交易量提高了62.9%,同时实现了更高的公平性,如贾恩指数提高了40.1%。在6-100个代理跨越30天的时间范围内进行测试,连续车辆周转确认了在种群规模上的可扩展性和在纳什博弈基准附近的经验稳定价格。

英文摘要

Vehicle-to-vehicle (V2V) energy trading enables decentralized peer-to-peer energy exchange among electric vehicles (EVs), reducing grid dependency while monetizing surplus capacity. However, coordinating self-interested EV agents with diverse charging needs and uncertain arrival-departure schedules remains challenging. Existing approaches either require centralized optimization with computational limitations or lack fairness guarantees. This paper integrates Nash Bargaining Solution into Multi-Agent Deep Deterministic Policy Gradient, namely Nash-MADDPG, for incentive-aligned V2V energy trading. Nash bargaining determines efficient bilateral pricing, while Nash-guided price proximity rewards align agent learning toward bargaining-optimal strategies. Evaluation over 30-day continuous operation demonstrates an improvement of 61.6% in social welfare and 62.9% improvement in trading volume over Double Auction, while achieving superior fairness, such as 40.1% improvement in Jain's index. Testing across 6-100 agents over a 30-day horizon with continuous vehicle turnover confirms scalability across population size and empirically stable pricing near the Nash Bargaining benchmark.

2605.22357 2026-05-22 cs.CV cs.AI 版本更新

VEELA: A Clinically-Constrained Benchmark for Liver Vessel Segmentation in Computed Tomography Angiography

VEELA:一种受临床约束的肝血管分割基准数据集

Ziya Ata Yazıcı, N. Sinem Gezer, İlkay Öksüz, İlker Özgür Koska, Tuğçe Toprak, Pervin Bulucu, Ufuk Beşenk, A. Emre Kavur, Pierre-Henri Conze, Hazım Kemal Ekenel, Oğuz Dicle, Mustafa Ege Şeker, Mustafa Said Kartal, Ariorad Moniri, Orhan Özkan, Osman Faruk Bayram, Hakan Polat, Musa Balcı, Ece Tuğba Cebeci, Baran Cılga, Kardelen Peçenek, M. Alper Selver

发表机构 * Department of Radiology, Dokuz Eylul University(多尔朱·伊勒大学放射科) Department of Computer Engineering, Istanbul Technical University(伊斯坦布尔技术大学计算机工程系) Institute of Natural and Applied Sciences, Dokuz Eylul University(多尔朱·伊勒大学自然科学与应用科学学院) Department of Electrical and Electronics Engineering, Dokuz Eylul University(多尔朱·伊勒大学电气与电子工程系) Department of Radiology, University of Wisconsin-Madison(威斯康星大学麦迪逊分校放射科) School of Medicine, Sivas Cumhuriyet University(萨瓦斯·库尔德大学医学院) School of Medicine, Acibadem Mehmet Ali Aydinlar University(阿克塞姆·梅赫梅特·阿里·阿迪姆大学医学院) Department of Artificial Intelligence Engineering, Bahçeşehir University(巴切希尔大学人工智能工程系) Faculty of Pharmacy, Sivas Cumhuriyet University(萨瓦斯·库尔德大学药学院)

AI总结 本文提出VEELA数据集,用于在CT血管造影中实现肝门静脉分割,通过严格的人工标注和多专家共识,确保标注的临床现实性和准确性,并引入多种评估指标以评估血管分割的多视角性能。

Comments 27 pages, 25 figures, 5 tables

详情
AI中文摘要

在对比增强的计算机断层扫描血管造影(CTA)中,准确分割肝内和门静脉仍然具有挑战性,由于复杂的血管拓扑结构、边缘可见性限制以及成像引起的模糊性。尽管现有的公开数据集提供了有价值的基准,但很少包含临床现实的标注约束。我们引入VEELA(Vessel Extraction and Extrication for Liver Analysis),一个严格编纂的肝血管数据集,源自40个CTA扫描,继承自CHAOS大挑战队列。所有血管均在多专家共识下逐层手动勾勒,使用严格可见性驱动的标注策略,并避免解剖推断插值。这种设计明确捕捉了解剖变异性和成像相关不确定性。作为CHAOS挑战的延续,VEELA使可重复的跨基准评估成为可能,同时扩展到细粒度的肝内和门静脉分割。我们进一步建立了标准化的基准评估框架,并分析了互补的评估指标,包括拓扑感知(clDice)、重叠基于(IoU)、边界敏感(NSD)和几何感知(面积、长度)度量。我们的结果表明,不同的指标捕捉了血管完整性不同的方面,强调了多视角评估在临床有意义的血管分割中的必要性。VEELA已公开发布,以促进可重复的研究并支持稳健的血管分割方法的发展。研究人员可以访问评估指标、数据集和提交平台:https://www.synapse.org/Synapse:syn65471967。

英文摘要

Accurate segmentation of hepatic and portal vessels in contrast-enhanced computed tomography angiography (CTA) remains challenging due to complex vascular topology, peripheral visibility limitations, and acquisition-induced ambiguities. While existing public datasets offer valuable benchmarks, few include clinically realistic annotation constraints. We introduce VEELA (Vessel Extraction and Extrication for Liver Analysis), a rigorously curated liver vessel dataset derived from 40 CTA scans inherited from the CHAOS grand-challenge cohort. All vessels were manually delineated slice-by-slice under multi-expert consensus, using a strict visibility-driven annotation policy and avoiding anatomically inferred interpolation. This design explicitly captures anatomical variability and imaging-related uncertainty. As a continuation of the CHAOS challenge, VEELA enables reproducible cross-benchmark evaluation while extending the scope to fine-grained hepatic and portal vessel segmentation. We further establish a standardized benchmarking framework and analyze complementary evaluation metrics, including topology-aware (clDice), overlap-based (IoU), boundary-sensitive (NSD), and geometry-aware (area, length) measures. Our results demonstrate that different metrics capture distinct aspects of vascular integrity, underscoring the necessity of multi-perspective evaluation for clinically meaningful vessel segmentation. VEELA is publicly released to facilitate reproducible research and support the development of robust vascular segmentation methods. Researchers can access the evaluation metrics, dataset, and submission platform at https://www.synapse.org/Synapse:syn65471967.

2605.22355 2026-05-22 cs.CL cs.AI cs.LG 版本更新

TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation

TransitLM: 一个大规模数据集和基准,用于无地图的公共交通路线生成

Hanyu Guo, Jiedong Yang, Chao Chen, Longfei Xu, Kaikui Liu, Xiangxiang Chu

发表机构 * Alibaba Group(阿里巴巴集团) AMAP

AI总结 本文提出TransitLM,一个包含1300万条公共交通路线规划记录的数据集,用于无地图的公共交通路线生成,展示了通过数据训练模型生成有效路线的能力。

详情
AI中文摘要

公共交通路线规划传统上依赖于结构化的地图基础设施和复杂的路由引擎,而现有的数据集不支持训练模型绕过这种依赖。我们提出了TransitLM,一个包含来自四个中国城市的超过1300万条公共交通路线规划记录的数据集,覆盖120,845个车站和13,666条线路,作为持续预训练语料库和用于三个评估任务的基准数据。实验表明,使用TransitLM训练的LLM能够生成结构上有效的路线,精度高,并且能够隐式地将任意GPS坐标映射到合适的车站,而无需显式映射。这些结果表明,公共交通路线规划可以完全从数据中学习,从而实现端到端、无地图的路线生成,直接从起止点信息生成。数据集和基准可在https://huggingface.co/datasets/GD-ML/TransitLM获取,评估代码在https://github.com/HotTricker/TransitLM。

英文摘要

Public transit route planning traditionally depends on structured map infrastructure and complex routing engines, and no existing dataset supports training models to bypass this dependency. We present TransitLM, a large-scale dataset of over 13 million transit route planning records from four Chinese cities covering 120,845 stations and 13,666 lines, released as a continual pre-training corpus and benchmark data for three evaluation tasks with complementary metrics. Experiments show that an LLM trained on TransitLM produces structurally valid routes at high accuracy and implicitly grounds arbitrary GPS coordinates to appropriate stations without any explicit mapping. These results demonstrate that transit route planning can be learned entirely from data, enabling end-to-end, map-free route generation directly from origin-destination information. The dataset and benchmark are available at https://huggingface.co/datasets/GD-ML/TransitLM, with evaluation code at https://github.com/HotTricker/TransitLM.

2605.22344 2026-05-22 cs.CV cs.AI cs.MM 版本更新

Bernini: Latent Semantic Planning for Video Diffusion

Bernini: 视频扩散中的潜在语义规划

Bernini Team, Chenchen Liu, Junyi Chen, Lei Li, Lu Chi, Mingzhen Sun, Zhuoying Li, Yi Fu, Ruoyu Guo, Yiheng Wu, Ge Bai, Zehuan Yuan

发表机构 * Bernini Team(伯尼尼团队)

AI总结 本文提出Bernini框架,通过将大规模多模态语言模型用于语义规划,扩散模型用于像素生成,实现了视频生成与编辑的统一方法,提升了编辑任务的泛化能力。

Comments Project Page: https://bernini-ai.github.io/

详情
AI中文摘要

多模态大语言模型(MLLMs)和扩散模型各自已达到显著成熟度:MLLMs在处理异构多模态输入时具有强大的语义基础,而扩散模型则能以逼真度生成图像和视频。我们主张通过简单的分工统一这两类模型:MLLMs负责语义规划,扩散模型则根据高层语义指导和低层视觉特征生成像素。基于此思想,我们提出了Bernini,一个统一的视频生成与编辑框架。一个基于MLLM的规划器直接在ViT嵌入空间中预测目标语义表示,而基于DiT的渲染器则根据此计划生成像素,同时结合文本特征,并在编辑时引入源VAE特征以保留细节。因为语义作为接口,规划器和渲染器可以分别训练,并仅轻度联合训练,从而保留两者预训练的优势,同时保持训练效率。为更好地处理多种视觉输入,我们引入了Segment-Aware 3D Rotary Positional Embedding(SA-3D RoPE),并进一步在规划器中结合链式推理以更好地将理解转化为生成。Bernini在广泛的视频生成与编辑基准上均取得最先进的性能,MLLMs的预训练理解在挑战性的编辑任务上实现了强大的泛化能力。

英文摘要

Multimodal large language models (MLLMs) and diffusion models have each reached remarkable maturity: MLLMs excel at reasoning over heterogeneous multimodal inputs with strong semantic grounding, while diffusion models synthesize images and videos with photorealistic fidelity. We argue that these two families can be unified through a simple division of labor: MLLMs perform semantic planning, while diffusion models render pixels from high-level semantic guidance and low-level visual features. Building on this idea, we propose Bernini, a unified framework for video generation and editing. An MLLM-based planner predicts the target semantic representation directly in the ViT embedding space, and a DiT-based renderer synthesizes pixels conditioned on this plan, augmented by text features and, for editing, source VAE features for detail preservation. Because semantics serve as the interface, the planner and renderer can be trained separately and only lightly co-trained, preserving the pretrained strengths of both components while keeping training efficient. To better handle multiple visual inputs, we introduce Segment-Aware 3D Rotary Positional Embedding (SA-3D RoPE), and further incorporate chain-of-thought reasoning in the planner to better transfer understanding into generation. Bernini achieves state-of-the-art performance across a wide range of video generation and editing benchmarks, with the MLLM's pretrained understanding translating into strong generalization on challenging editing tasks.

2605.22343 2026-05-22 cs.MA cs.AI cs.SE 版本更新

Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators

Sibyl-AutoResearch:自主研究需要自我进化的试验与错误机制,而非论文生成器

Chengcheng Wang, Qinhua Xie, Wei He, Jianyuan Guo, Shiqi Wang, Chang Xu

发表机构 * University of Sydney(悉尼大学) East China Normal University(华东师范大学) TokenRhythm AI City University of Hong Kong(香港城市大学)

AI总结 本文提出Sibyl-AutoResearch框架,通过自我进化的方法改进自主研究系统,解决现有系统在试验经验积累方面的不足,通过可审计的转换单元实现试验到行为和试验到机制行为的转换,从而提高自主研究系统的可靠性。

详情
AI中文摘要

自主研究系统日益使科学工作流程可执行:代理可以提出想法、运行代码、检查结果并起草论文。但可执行的工作流程本身并不产生研究判断。我们分析了当前系统在试验经验积累方面的不足:弱证据变成散文,试点信号变成广泛声明,记忆保持文本,重复的过程失败不改变后来的行为。我们引入Sibyl-AutoResearch,一个自我进化的AutoResearch框架,围绕科学试验与错误机制构建。一个机制让代理运行有界试验,保存积极和消极结果,并将教训路由到后来的规划、验证、声明范围、调度、批评、写作和机制修复。我们通过两个可审计的转换单元正式化这一过程:试验到行为转换,将试验信号链接到后来的研究行动,以及试验到机制行为转换,将重复的过程失败链接到系统更新。我们实现了该框架在SIBYL中,这是一个基于文件的自主研究系统,暴露了状态、角色、记忆、门、和制品痕迹所需以检查这些转换路径。回顾性审计识别出八个高置信度的转换事件,中位延迟为一个迭代,最大延迟为三个迭代。一个恢复失败注册表进一步展示了如何通过五个自然发生的失败类别,包括重复结果、过时数字和不支持的统计数据,被阻止、降级或路由到后来的修复。这些痕迹不建立比较性能的主张;它们表明所提出的转换单元可以从现实的自主研究工作空间中恢复。SIBYL框架和系统可在https://github.com/Sibyl-Research-Team/AutoResearch-SibylSystem上获得。

英文摘要

Autonomous research systems increasingly make the scientific workflow executable: agents can propose ideas, run code, inspect results, and draft papers. But executable workflows do not by themselves produce research judgment. We analyze where current systems lose trial experience: weak evidence becomes prose, pilot signals become broad claims, memory remains textual, and recurring process failures do not change later behavior. We introduce Sibyl-AutoResearch, a self-evolving AutoResearch framework built around Scientific Trial-and-Error Harnesses. A harness lets agents run bounded trials, preserve positive and negative outcomes, and route lessons into later planning, validation, claim scope, scheduling, critique, writing, and harness repair. We formalize this through two auditable conversion units: trial-to-behavior conversion, which links trial signals to later research actions, and trial-to-harness-behavior conversion, which links recurring process failures to system updates. We implement the framework in SIBYL, a file-backed autonomous research system that exposes the state, roles, memory, gates, and artifact traces needed to inspect these conversion paths. A retrospective audit identifies eight high-confidence conversion events, with a median latency of one iteration and a maximum latency of three iterations. A recovered-failure registry further shows how five naturally occurring failure classes, including duplicate results, stale numbers, and unsupported statistics, were blocked, downgraded, or routed into later repair. These traces do not establish a comparative performance claim; they show that the proposed conversion units are recoverable from realistic autonomous-research workspaces. The SIBYL framework and system are available at https://github.com/Sibyl-Research-Team/AutoResearch-SibylSystem.

2605.22342 2026-05-22 cs.CV cs.AI 版本更新

4D-GSW: Kinematic-Aware Spatio-Temporal Consistent Watermarking for 4D Gaussian Splatting

4D-GSW: 4D高斯点散布的运动感知空间-时间一致水印技术

Sifan Zhou, Hang Zhang, Yuhang Wang, Ming Li

发表机构 * Southeast University(东南大学) Guangdong Laboratory of Artificial Intelligence and Digital Economy(广东人工智能与数字经济实验室) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出4D-GSW,一种运动感知的空间-时间一致水印技术,用于在4D高斯点散布中嵌入鲁棒的版权信息,同时保持高空间-时间一致性。

Comments 9 pages main paper, 7 figures, 18 pages in total

详情
AI中文摘要

尽管4D高斯点散布(4DGS)已革新了高保真的动态重建,但保护这些资产的知识产权仍是一个开放性挑战。传统隐写技术常常忽视底层的运动流形,导致非物理的伪影,如严重的时序闪烁和"FVD崩溃"。为了解决这个问题,我们提出了4D-GSW,一种运动感知的水印框架,旨在嵌入鲁棒的版权信息同时保持高空间-时间一致性。与以往的4D隐写技术不同,我们的方法明确处理运动轨迹的物理一致性。我们引入了空间-时间曲率(STC)度量来识别"动态瞬间",并自适应地门控水印梯度注入,以保护关键运动流形免受非物理扰动。为了确保复杂变形中的全局一致性,我们提出了联合HMM-MRF能量最小化模型,该模型同步水印相位在时间轨迹和空间邻域内。此外,一种各向异性梯度路由机制确保水印嵌入严格脱离光度重建保真度。大量实验表明,我们的方法在鲁棒隐藏水印的同时,能够抵抗各种攻击并保持高质量的渲染质量和空间-时间一致性。

英文摘要

While 4D Gaussian Splatting (4DGS) has revolutionized high-fidelity dynamic reconstruction, safeguarding the intellectual property of these assets remains an open challenge. Conventional steganographic techniques often neglect the underlying kinematic manifolds, triggering non-physical artifacts such as severe temporal flickering and "FVD collapse". To address this, we propose \textbf{4D-GSW}, a kinematic-aware watermarking framework designed to embed robust copyright information while preserving high spatio-temporal consistency. Unlike prior 4D steganography that primarily focuses on opacity-guided invisibility, our approach explicitly addresses the physical coherence of motion trajectories. We introduce a \textbf{Spatio-Temporal Curvature (STC)} metric to identify "Dynamic Instants," adaptively gating watermark gradient injection to shield critical motion manifolds from non-physical perturbations. To ensure global coherence across complex deformations, we formulate a joint \textbf{HMM-MRF energy minimization} model that synchronizes watermark phases within both temporal trajectories and spatial neighborhoods. Furthermore, an \textbf{anisotropic gradient routing} mechanism ensures that watermark embedding remains strictly decoupled from photometric reconstruction fidelity. Extensive experiments have demonstrated the superior performance of our method in robustly hiding watermarks while resisting various attacks and maintaining high rendering quality and spatiotemporal consistency.

2605.22331 2026-05-22 cs.LG cs.AI cs.DC 版本更新

SepsisAI Orchestrator: A Containerized and Scalable Platform for Deploying AI Models and Real-Time Monitoring in Early Sepsis Detection

SepsisAI Orchestrator:一个容器化和可扩展的平台,用于部署AI模型和实时监控以实现早期败血症检测

Santiago Ospitia, John Sanabria, John Garcia-Henao

发表机构 * School of Systems Engineering and Computing, University of Valle(系统工程与计算学院,山谷大学) Digital Medicine Unit, Balgrist University Hospital(数字医学单元,巴尔格里斯大学医院) Nucleus-AI Research(核芯AI研究所)

AI总结 本文提出SepsisAI-Orchestrator平台,通过整合HL7 FHIR启发的临床文档架构(CDA)预处理、NoSQL存储、容器化LightGBM分类器和Streamlit临床仪表板,解决了早期败血症检测中AI模型部署的挑战,并通过负载测试展示了U型扩展行为。

Comments 13 pages, 5 figures. Submitted to BioCARLA 2025 Workshop

详情
AI中文摘要

尽管在临床机器学习文献中预测结果强劲,但将这些模型转化为床边使用仍然受限于系统层面的障碍:异构数据表示、缺乏标准化的部署流程以及研究原型与医院环境的并发性和延迟需求之间的不匹配。我们提出了SepsisAI-Orchestrator,一个开源的模块化平台,旨在解决早期败血症检测中的部署缺口。该平台集成了HL7 FHIR启发的临床文档架构(CDA)预处理、NoSQL存储、通过REST API服务的容器化LightGBM分类器和Streamlit临床仪表板,并通过Docker和Kubernetes进行协调。一个之前已验证的LightGBM模型(在PhysioNet 2019上的F1值为0.87-0.94)在不进行修改的情况下被重用;贡献在于周围基础设施及其在负载下的实证表征。使用k6进行50-1000个并发虚拟用户测试,我们发现副本数量必须与主机的物理CPU线程数匹配:在12线程CPU上从3个副本扩展到12个副本,将p95延迟从3.3秒减少到1.41秒(减少57.3%)并消除所有请求失败,而过度配置到24或48个副本则由于调度器竞争导致性能下降。据我们所知,这种U型扩展行为此前尚未对临床AI推理工作负载进行量化。我们不声称具有前瞻性临床验证。源代码和部署清单可在https://github.com/nucleusai/sepsisai-orchestrator获取。

英文摘要

Despite strong predictive results in the clinical machine learning literature, the translation of these models into bedside use remains limited by systems-level barriers: heterogeneous data representations, the absence of standardized deployment workflows, and a mismatch between research prototypes and the concurrency and latency requirements of hospital environments. We present the SepsisAI-Orchestrator, an open-source modular platform that addresses this deployment gap for early sepsis detection. The platform integrates HL7 FHIR-inspired Clinical Document Architecture (CDA) preprocessing, NoSQL storage, a containerized LightGBM classifier served via REST APIs, and a Streamlit clinical dashboard, orchestrated with Docker and Kubernetes. A previously validated LightGBM model (F1 0.87-0.94 on PhysioNet 2019) is reused without modification; the contribution lies in the surrounding infrastructure and its empirical characterization under load. Using k6 with 50-1000 concurrent virtual users, we find that replica count must be matched to the physical CPU thread count of the host: scaling from 3 to 12 replicas on a 12-thread CPU reduces p95 latency from 3.3s to 1.41s (57.3% reduction) and eliminates all request failures, while over-provisioning to 24 or 48 replicas degrades performance due to scheduler contention. To our knowledge this U-shaped scaling behavior has not been quantified previously for clinical AI inference workloads. We do not claim prospective clinical validation. Source code and deployment manifests are available at https://github.com/nucleusai/sepsisai-orchestrator.

2605.22321 2026-05-22 cs.CR cs.AI cs.SE 版本更新

Benchmarking Autonomous Agents against Temporal, Spatial, and Semantic Evasions

对时间、空间和语义规避的自主代理进行基准测试

Jianan Ma, Xiaohu Du, Ruixiao Lin, Yaoxiang Bian, Jialuo Chen, Jingyi Wang, Xiaofang Yang, Shiwen Cui, Changhua Meng, Xinhao Deng, Zhen Wang

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出了一种针对基于大语言模型(LLM)的代理系统的多维规避框架,通过引入时间、空间和语义三种隐蔽攻击向量,系统地量化了这些威胁,并展示了其在实际威胁场景中的效果,揭示了现有自主代理系统在架构层面的系统性漏洞。

Comments 21 pages, 9 figures, 7 tables. Code and data available at https://github.com/antgroup/Agent3Sigma-Stage

详情
AI中文摘要

随着自主代理(例如OpenClaw)越来越多地利用深度系统级权限执行复杂任务,它们引入了严重的、未缓解的安全风险。当前的漏洞分析大多集中在单轮、无状态行为上,忽略了状态ful、多轮交互和动态工具调用中扩大的攻击面。在本文中,我们提出了一种新的、多维的规避框架,针对基于LLM的代理系统。我们引入了三种隐蔽的攻击向量:(1)时间规避,将恶意负载碎片化地分布在连续的交互轮次中;(2)空间规避,将负载隐藏在复杂的外部 artifacts 中,以逃避标准LLM解析机制;(3)语义规避,将恶意意图隐藏在良性上下文噪声之下。为了系统地量化这些威胁,我们构建了A3S-Bench,一个包含2,254个真实世界代理执行轨迹的综合基准。评估一个标准代理框架,分别与10个主流LLM后端整合,针对20个实际威胁场景,我们展示了我们的规避框架将平均风险触发率从28.3%的基准提升到52.6%。这些发现揭示了当前自主代理系统在架构层面的系统性漏洞,现有防御措施无法解决这些问题,突显了需要针对这些独特威胁设计的防御机制的紧迫性。

英文摘要

As autonomous agents (e.g., OpenClaw) increasingly operate with deep system-level privileges to execute complex tasks, they introduce severe, unmitigated security risks. Current vulnerability analyses overwhelmingly focus on single-turn, stateless behaviors, overlooking the expanded attack surface inherent in stateful, multi-turn interactions and dynamic tool invocations. In this paper, we propose a novel, multi-dimensional evasion framework targeting LLM-based agent systems. We introduce three stealthy attack vectors: (1) Temporal evasion, which fragments malicious payloads across sequential interaction turns; (2) Spatial evasion, which conceals payloads within complex external artifacts that evade standard LLM parsing mechanisms; and (3) Semantic evasion, which obscures malicious intents beneath benign contextual noise. To systematically quantify these threats, we construct A3S-Bench, a comprehensive benchmark comprising 2,254 real-world agent execution trajectories. Evaluating a standard agent framework separately integrated with 10 mainstream LLM backbones against 20 practical threat scenarios, we demonstrate that our evasion framework elevates the average risk trigger rate from a 28.3\% baseline to 52.6\%. These findings reveal systemic, architecture-level vulnerabilities in current autonomous agent systems that existing defenses fail to address, highlighting an urgent need for defense mechanisms tailored to the unique threats.

2605.22306 2026-05-22 cs.MA cs.AI 版本更新

ACCoRD: Actor-Critic Conflict Resolution with Deep learning for O-RAN xApps

ACCoRD:基于深度学习的O-RAN xApps中的Actor-Critic冲突解决

Cezary Adamczyk, Adrian Kliks

发表机构 * Institute of Radiocommunications(无线电通信研究所) Poznan University of Technology(波兹南技术大学)

AI总结 本文提出了一种基于深度学习的Actor-Critic冲突解决方法ACCoRD,用于在O-RAN xApps中实时解决控制冲突,通过强化学习算法PPO-Clip训练人工神经网络,提高了规则方法在中高流量场景下的效率。

详情
AI中文摘要

冲突缓解(ConMit)是智能网络控制在开放无线电接入网络(O-RAN)中的关键部分。本文提出了一种名为ACCoRD的方法,通过在近实时RAN智能控制器中使用一个通过强化学习算法PPO-Clip训练的人工神经网络(ANN)来解决检测到的控制冲突。实现的人工神经网络分析有关网络和冲突控制决策的数据,以推断最优的冲突解决(CR)操作。冲突解决代理在每次解决冲突后从网络收集反馈,以评估其效率并在批量训练中调整ANN的权重。所提出方法的评估基于仿真数据。提出了一种新的评估CR解决方案的方法。结果表明,所提出的基于ANN的方法通过显著减少由冲突控制决策引起的负面网络事件,提高了规则方法的效率。

英文摘要

Conflict Mitigation (ConMit) is a crucial part of intelligent network control in Open Radio Access Networks (O-RAN). In this paper, we propose a method named ACCoRD to resolve detected control conflicts in Near-Real Time RAN Intelligent Controller using a Conflict Resolution (CR) Agent with an Artificial Neural Network (ANN) trained with a reinforcement learning algorithm PPO-Clip. The implemented ANN analyzes data about the network and conflicting control decisions to infer optimal CR actions. The CR Agent gathers feedback from the network after each resolved conflict to assess its efficiency and adjust the ANN's weights during batch training. The evaluation of the proposed approach is based on simulation data. A new methodology for evaluating CR solutions is proposed. Results show that the proposed ANN-based method improves on the efficiency of rule-based approaches by significantly reducing negative network events caused by conflicting control decisions in medium and high traffic scenarios.

2605.22304 2026-05-22 cs.AI cs.DB cs.LG 版本更新

Evaluation of Pipelines for Data Integration into Knowledge Graphs

数据整合到知识图谱的管道评估

Marvin Hofer, Erhard Rahm

发表机构 * ScaDS.AI Dresden/Leipzig(ScaDS.AI 德累斯顿/莱比锡) Leipzig University(莱比锡大学)

AI总结 本文提出KGI-Bench基准测试,用于评估将不同输入数据整合到现有知识图谱的管道,通过覆盖度、正确性和一致性三个指标分析输出的知识图谱质量,并在电影领域提供基准数据集以评估12种管道的性能。

详情
AI中文摘要

将新数据整合到知识图谱(KG)通常涉及在工作流或管道中执行的不同任务。对于特定的整合问题,有许多可能的管道,但目前尚无通用方法来评估此类管道的整体质量和性能,以确定最佳选择。因此,我们提出一个新的基准KGI-Bench,用于评估将不同类型的输入数据整合到现有KG的管道。我们通过分析输出,即更新后的KG,使用三个互补的质量度量:覆盖度、正确性和一致性来评估管道。我们还提供了基准数据集(种子KG、三种格式的重叠输入数据、参考KG作为地面真实值)用于电影领域。为了展示所提基准的适用性和有用性,我们比较评估了12种管道,并分析了它们在不同输入数据格式和设计选择下的行为。

英文摘要

Integrating new data into knowledge graphs (KG) typically involves different tasks that are executed within workflows or pipelines There are many possible pipelines for a specific integration problem but there is not yet a general approach to evaluate the overall quality and performance of such pipelines to be able to determine the best choices. We therefore propose a new benchmark KGI-Bench to evaluate integration pipelines that ingest different kinds of input data into an existing KG. We evaluate pipelines by analyzing their output, i.e., the updated KG, with the three complementary quality metrics coverage, correctness and consistency. We also provide benchmark datasets (seed KG, overlapping input data of three formats, reference KG as a ground truth) for the movie domain. To demonstrate the applicability and usefulness of the proposed benchmark, we comparatively evaluate 12 pipelines and analyze their behavior across different input data formats and design choices.

2605.22300 2026-05-22 cs.AI cs.LG cs.MA 版本更新

Cross-domain benchmarks reveal when coordinated AI agents improve scientific inference from partial evidence

跨领域基准测试揭示协调AI代理在部分证据下提升科学推断何时有效

Fiona Y. Wong, Markus J. Buehler

发表机构 * Laboratory for Atomistic and Molecular Mechanics (LAMM)(原子分子力学实验室) Department of Biological Engineering(生物工程系) Department of Mechanical Engineering(机械工程系) Department of Civil and Environmental Engineering(土木与环境工程系) Center for Computational Science and Engineering, Schwarzman College of Computing(计算科学与工程中心) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文通过跨领域基准测试探讨协调AI代理在部分证据下提升科学推断的有效性,发现当不同学科各自捕捉现象部分时,跨通道复合方法优于单一通道基线,但在某些情况下分解并不总是提升整体性能。

详情
AI中文摘要

科学证据通常跨越仪器、数据库和学科,因此没有单一来源能完整记录现象。这使得确定协调AI代理何时能超越简单科学工作流变得困难。我们通过涵盖四个科学任务的跨领域基准测试评估了这一问题:将分子结构映射到音乐表示、检测科学历史范式转变、识别媒介传播疾病爆发以及验证行星凌星候选体。每个案例均使用冻结评估小组、预定义评分协议、明确基线、消融或零对照,以及声明的限制。结果定义了三个操作模式。当不同学科各自只捕捉现象部分时,跨通道复合方法优于单一通道基线:气候-媒介爆发达到AUROC 0.944,行星凌星验证达到AUROC 0.955。然而,行星凌星工作流与强联合摘要基线几乎持平,表明分解不总能提升整体性能。当一个信号主导时,如范式转变检测,协调主要提升解释和可追溯性。对于分子音乐化,收益是表征而非预测性的。ScienceClaw x Infinite提供了此评估的可审计艺术ifacts和来源层。因此,该基准测试仅在对应的性能、来源或表征主张有明确比较器支持时才赋予协调价值。

英文摘要

Scientific evidence often spans instruments, databases, and disciplines, so no single source records the full phenomenon. This makes it difficult to determine when coordinated AI agents add value over simpler scientific workflows. We evaluate this question with a cross-domain benchmark spanning four scientific tasks: mapping molecular structure into musical representations, detecting historical paradigm shifts in science, identifying vector-borne disease emergence, and vetting transiting-exoplanet candidates. Each case uses a frozen evaluation panel, predefined scoring protocols, explicit baselines, ablations or null controls, and stated limitations. The results define three operating regimes. When different disciplines each capture only part of the phenomenon, cross-channel composites improve over single-channel baselines: climate-vector emergence reaches AUROC 0.944 and exoplanet vetting reaches AUROC 0.955. However, the exoplanet workflow is effectively tied with a strong combined-summary baseline, showing that decomposition does not always improve top-line performance. When one signal dominates, as in paradigm-shift detection, coordination mainly improves interpretation and traceability. For molecular sonification, the gain is representational rather than predictive. ScienceClaw x Infinite provides the auditable artifact and provenance layer for this evaluation. The benchmark therefore assigns value to coordination only when the corresponding performance, provenance, or representation claim is supported by explicit comparators.

2605.22287 2026-05-22 cs.AI 版本更新

SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules

SciCore-Mol: 通过可插拔分子认知模块增强大语言模型

Yuxuan Chen, Changwei Lv, Yunduo Xiao, Zhongjing Du, Daquan Zhou, Yukun Yan, Zheni Zeng, Zhiyuan Liu

发表机构 * School of Electronic and Computer Engineering, Peking University, Shenzhen, China(电子与计算机工程学院,北京大学深圳校区) Tsinghua University, Beijing, China(清华大学,北京) School of Intelligence Science and Technology, Nanjing University, Suzhou, China(智能科学与技术学院,南京大学苏州校区)

AI总结 本文提出SciCore-Mol框架,通过三个深度集成的可插拔认知模块解决大语言模型处理异构科学数据(如分子)时的语义鸿沟问题,实现分子理解、生成、反应预测和化学知识的综合性能提升。

Comments 15 pages, 4 figures, 9 tables. Preprint

详情
AI中文摘要

大型语言模型(LLMs)是实现万物智能范式的中心,但处理异构科学数据如分子时面临根本性挑战:离散语言符号与拓扑分子或连续反应数据之间的固有差距导致文本推理中的信息丢失和语义噪声。我们提出了SciCore-Mol,一个模块化框架,通过三个深度集成的可插拔认知模块弥合这一差距:拓扑感知感知模块、基于潜在扩散的分子生成模块以及反应感知推理模块。每个模块通过学习的表示接口连接到LLM主干,使信息交换比仅使用文本工具反馈更丰富。我们在多样化的化学任务上的实验表明,SciCore-Mol在分子理解、生成、反应预测和一般化学知识方面实现了强大的综合性能,8B参数开源系统在多个维度上可与甚至超越专有大模型竞争。这项工作为通过解耦、可插拔和灵活编排的模块系统为LLM提供科学专业知识提供了系统蓝图,对药物设计、化学合成和更广泛的科学发现有直接意义。

英文摘要

Large Language Models (LLMs) are central to the one-for-all intelligent paradigm, but they face a fundamental challenge when dealing with heterogeneous scientific data such as molecules: the inherent gap between discrete linguistic symbols and topological molecular or continuous reaction data leads to significant information loss and semantic noise in text-based reasoning. We propose SciCore-Mol, a modular framework that bridges this gap through three deeply integrated pluggable cognitive modules: a topology-aware perception module, a latent diffusion-based molecular generation module, and a reaction-aware reasoning module. Each module is coupled to the LLM backbone through learned representation interfaces, enabling richer information exchange than is possible with text-only tool feedback. Our experiments on diverse chemical tasks demonstrate that SciCore-Mol achieves strong comprehensive performance across molecular understanding, generation, reaction prediction, and general chemistry knowledge, with an 8B-parameter open-source system that is competitive with and in several dimensions surpasses proprietary large models. This work provides a systematic blueprint for equipping LLMs with scientific expertise through decoupled, pluggable, and flexibly orchestrated modules, with direct implications for drug design, chemical synthesis, and broader scientific discovery.

2605.22286 2026-05-22 cs.LG cs.AI 版本更新

EmoTrack: Robust Depression Tracking from Counseling Transcripts across Session Regimes

EmoTrack: 从咨询记录中跨会话制度实现稳健的抑郁跟踪

Zhaomin Wu, Jiayi Li, Bingsheng He

发表机构 * Department of Computer Science National University of Singapore(新加坡国立大学计算机科学系)

AI总结 本文研究了从单次会话和多会话制度中通过咨询记录进行稳健抑郁跟踪的问题,提出了LongCounsel多会话咨询数据集和EmoTrack框架,结合LLM提取的临床信号和冻结的轮次级语义嵌入,训练症状特定预测器,并通过紧凑的跨会话记忆进一步结合先前会话,实验表明在真实单次会话基准上表现优异。

详情
AI中文摘要

基于文本的咨询是人工智能心理健康支持的重要接口,其中记录可能用于监控抑郁严重程度并标记需要及时人工审查的会话。然而,跨会话制度实现稳健的PHQ-8预测仍然具有挑战性:基于微调的方法可以利用更丰富的监督但可能在数据稀缺时泛化能力差,而基于提示的LLM方法数据高效但通常将每个记录整体处理,对纵向上下文支持有限。我们研究了从咨询记录中跨单次会话和多会话制度进行稳健抑郁跟踪。我们引入了LongCounsel多会话咨询数据集,具有会话级PHQ-8监督,用于评估在部分症状披露和跨会话连续性下的重复会话跟踪。我们进一步提出了EmoTrack,一种PHQ-8预测框架,结合LLM提取的临床信号与冻结的轮次级语义嵌入,并在得到的记录表示上训练症状特定预测器。当先前会话可用时,EmoTrack可通过紧凑的跨会话记忆进一步结合它们。在LongCounsel和DAIC-WOZ上的实验表明,EmoTrack在真实单次会话基准上实现了明显优势,包括在最强DAIC-WOZ基线上的MAE相对减少13.5%,并在LongCounsel上与最强的纵向基线保持竞争力。

英文摘要

Text-based counseling is an important interface for AI mental-health support, where transcripts may be used to monitor depression severity and flag sessions requiring timely human review. However, robust PHQ-8 prediction across session regimes remains challenging: fine-tuning-based methods can exploit richer supervision but may generalize poorly under data scarcity, while prompt-based LLM methods are data-efficient but usually treat each transcript holistically and provide limited support for longitudinal context. We study robust depression tracking from counseling transcripts across single-session and multi-session regimes. We introduce LongCounsel, a multi-session counseling dataset with session-level PHQ-8 supervision for evaluating repeated-session tracking under partial symptom disclosure and cross-session continuity. We further propose EmoTrack, a PHQ-8 prediction framework that combines LLM-extracted clinical signals with frozen turn-level semantic embeddings and trains symptom-specific predictors over the resulting transcript representation. When prior sessions are available, EmoTrack can further incorporate them through compact cross-session memory. Experiments on LongCounsel and DAIC-WOZ show that EmoTrack achieves a clear gain on the real single-session benchmark, including a 13.5% relative MAE reduction over the strongest DAIC-WOZ baseline, and remains competitive with the strongest longitudinal baseline on LongCounsel.

2605.22269 2026-05-22 cs.CV cs.AI cs.MM 版本更新

MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering

MuKV:多粒度KV缓存压缩用于长流视频问答

Junbin Xiao, Jiajun Chen, Tianxiang Sun, Xun Yang, Angela Yao

发表机构 * University of Science and Technology of China(中国科学技术大学) National University of Singapore(新加坡国立大学)

AI总结 本文提出MuKV,一种多粒度KV缓存压缩方法,通过半分层检索方法提升长流视频问答的效率和准确性,实验表明其在答案准确率、内存使用和在线问答效率方面均优于基线方法。

Comments To appear at CVPR'26. Code is available at https://github.com/IMBALDY/MuKV

详情
AI中文摘要

长流视频问答仍面临挑战,由于视觉token数量增加和大语言模型(LLM)推理长度有限。KV缓存通过LLM预填充存储历史token的Key-Value(KV),从而实现更高效的流式问答。然而,现有方法缓存每个或每两个帧,导致内存使用冗余并丢失帧内或跨帧的细粒度空间细节。本文提出MuKV,一种具有多粒度KV缓存压缩模块和半分层检索方法的方法,以提高长流视频问答的效率和准确性。对于离线KV缓存,MuKV在patch、frame和segment级别提取视觉表示。多个粒度层次保留了局部线索和全局时间上下文,同时通过自注意力和频率引导的双信号token压缩机制保持效率。对于在线问答,MuKV设计了一种半分层检索方法以检索相关KV缓存用于答案生成。在长流视频问答基准测试中,MuKV显著提高了答案准确率,而无需牺牲内存和在线问答效率。此外,我们的压缩机制本身在答案准确率、内存和问答效率方面均对基线方法带来了持续的改进,展示了高度有效的贡献。

英文摘要

Long streaming video QA remains challenging due to growing visual tokens and limited reasoning length of large language models (LLMs). KV-caching stores the Key-Value (KV) of the historical tokens via LLM prefill and enables more efficient streaming QA. However, existing methods cache every one or two frames, causing redundant memory usage and losing fine-grained spatial details within frame or temporal contexts across frames. This paper proposes MuKV, a method that features a multi-grained KV cache compression module and a semi-hierarchical retrieval approach to improve both efficiency and accuracy for long streaming VideoQA. For the offline KV cache, MuKV extracts visual representations at patch-, frame-, and segment-levels. The multiple levels of granularity preserve both local cues and global temporal context, while maintaining efficiency with a dual signal token compression mechanism guided by self-attention and frequency. For online QA, MuKV designs a semi-hierarchical retrieval method to retrieve relevant KV caches for answer generation. Experiments on long-streaming VideoQA benchmarks show that MuKV significantly improves answer accuracy, without sacrificing memory and online QA efficiency. Moreover, our compression mechanism alone brings consistent benefits across answer accuracy, memory, and QA efficiency over baselines, showcasing highly effective contribution.

2605.22268 2026-05-22 cs.NI cs.AI cs.CV 版本更新

Impact of Atmospheric Turbulence and Pointing Error on Earth Observation

大气湍流和指向误差对地球观测的影响

Celia Sánchez-de-Miguel, Antonio M. Mercado-Martínez, Beatriz Soret, Antonio Jurado-Navas, Miguel Castillo-Vázquez

发表机构 * TELMA, University of Malaga(TELMA,马拉加大学)

AI总结 本文研究了大气湍流和指向误差对地球观测图像的影响,提出了一种增强的图像模拟器来生成物理真实的失真图像,并通过案例研究评估了YOLOv8和RetinaNet在不同湍流和指向误差条件下的性能。

Comments Conference

详情
AI中文摘要

地球观测(EO)图像常常受到大气湍流和指向抖动的退化;然而,这些效应很少被考虑在用于训练基于AI的检测模型的数据集中。基于先前的工作,本文提出了一种增强的图像模拟器,能够将垂直路径的大气湍流和卫星指向抖动(源于平台和传感器振动)纳入其中,以生成物理上真实的失真图像。作为案例研究,使用YOLOv8和RetinaNet在由所提出模拟器生成的图像上评估船舶检测,结果表明,在理想条件下,YOLOv8的召回率从91%下降到弱湍流存在时的60%,在强湍流或抖动下低于40%。相比之下,RetinaNet表现出更大的鲁棒性,在退化条件下保持约75%的召回率。这些结果突显了在EO训练数据集中纳入真实物理退化的重要性,以确保AI模型在操作环境中的可靠性能,如在海上监控应用中所展示的那样。

英文摘要

Earth Observation (EO) imagery is often degraded by atmospheric turbulence and pointing jitter; yet, these effects are rarely considered in datasets used to train AI-based detection models. Based on prior work, this paper presents an enhanced image simulator that enables the incorporation of vertical-path atmospheric turbulence and satellite pointing jitter, arising from platform and sensor vibrations, to generate physically realistic distorted images. As a case study, vessel detection is evaluated using YOLOv8 and RetinaNet on images generated by the proposed simulator under different levels of turbulence and pointing errors. Results show that YOLOv8 recall decreases from 91% under ideal conditions to 60% in the presence of weak turbulence, and falls below 40% under strong turbulence or jitter. In contrast, RetinaNet demonstrates greater robustness, maintaining approximately 75% recall across degraded conditions. These results highlight the importance of incorporating realistic physical degradations into EO training datasets to ensure reliable performance of AI-based models in operational environments, as demonstrated in maritime surveillance applications.

2605.22266 2026-05-22 cs.LG cs.AI 版本更新

Detecting Atypical Clients in Federated Learning via Representation-Level Divergence

通过表示层面的分歧检测联邦学习中的非典型客户端

Cristian Pérez-Corral, Jose I. Mestre, Alberto Fernández-Hernández, Manuel F. Dolz, Enrique S. Quitana-Ortí

发表机构 * Universitat Politècnica de València(巴塞罗那理工大学) Universitat Jaume I(Jaime I 大学)

AI总结 本文提出了一种轻量级的几何信号来量化客户端与全局模型之间的功能偏差,以检测联邦学习中的非典型客户端,通过评估输入空间的激活诱导分区变化来区分稳定但异质的客户端与显著偏离全局范式的客户端。

详情
AI中文摘要

联邦学习使分布式客户端在异质数据上进行协作训练,但这种异质性常常导致更新不稳定和全局性能下降。此外,在实际部署中,客户端更新可能偏离预期行为,不仅由于良性非独立同分布的数据分布,还由于分布偏移或异常输入,这引发了对聚合过程可靠性的担忧。在本工作中,我们提出了一种轻量级的几何信号来量化客户端相对于全局模型的功能偏差。与比较模型参数或梯度不同,我们的方法衡量每个客户端本地训练如何改变激活诱导的输入空间分区,该评估基于共享的探测集。这产生了一个置换不变、可解释的客户端-全局分歧度量,捕捉了模型处理数据方式的差异。我们展示该信号能有效识别导致非典型功能变化的客户端,区分稳定但异质的客户端与那些更新显著偏离全局范式的客户端。因此,所提出的度量提供了一个简单的工具用于监控客户端行为,并在联邦学习系统中实现风险感知的聚合策略。

英文摘要

Federated learning enables collaborative training across distributed clients with heterogeneous data, but such heterogeneity often leads to unstable updates and degraded global performance. Moreover, in practical deployments, client updates may deviate from the expected behavior not only due to benign not i.i.d. distributions, but also due to distributional shifts or anomalous inputs, raising concerns about the reliability of the aggregation process. In this work, we propose a lightweight geometric signal to quantify the functional deviation of a client with respect to the global model. Instead of comparing model parameters or gradients, our approach measures how the local training of each client alters the activation-induced partition of the input space, evaluated on a shared probe set. This yields a permutation-invariant, interpretable metric of client--global divergence that captures differences in how data is processed by the model. We show that this signal effectively identifies clients that induce atypical functional changes, distinguishing stable yet heterogeneous clients from those whose updates significantly diverge from the global regime. As a result, the proposed metric provides a simple tool for monitoring client behavior and enabling risk-aware aggregation strategies in federated learning systems.

2605.22263 2026-05-22 cs.LG cs.AI 版本更新

Tailoring Teaching to Aptitude: Direction-Adaptive Self-Distillation for LLM Reasoning

按能力定制教学:方向自适应自蒸馏用于LLM推理

Hongbin Zhang, Chaozheng Wang, Kehai Chen, Youcheng Pan, Yang Xiang, Jinpeng Wang, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology(计算智能研究所,哈尔滨工业大学) Peng Cheng Laboratory(鹏城实验室) Keeta AI, Meituan(Keeta AI,美团)

AI总结 本文提出方向自适应自蒸馏(DASD),通过熵引导的定向监督改进LLM推理,通过分析发现统一的教师监督导致探索被压制,DASD在六个数学推理基准中取得最佳表现。

Comments Under Review

详情
AI中文摘要

在线自蒸馏(OPSD)是一种新兴的LLM后训练范式,其中模型作为自己的教师:在有特权信息(如参考轨迹或提示)的条件下,同一策略为自身 rollout 提供密集的token级监督。然而,最近的研究表明,OPSD 通过抑制预测不确定性而损害复杂推理,这支持探索和假设修订。我们的token级分析显示,这种失败源于在具有不同不确定性水平的token上应用统一的教师监督方向:符合特权自教师会抑制高熵的探索,而偏离教师会降低低熵的步骤准确性。据此,我们提出了方向自适应自蒸馏(DASD),将特权自蒸馏从统一教师模仿重新框架为熵引导的定向监督:高熵token被推离特权教师以保持探索,而低熵token被拉向教师以稳定步骤级执行。在六个数学推理基准上,DASD在强RLVR和自蒸馏基线中实现了最佳的宏Avg@16。Pass@$k$、推理健康和泛化分析表明,这些平均收益来自于在不牺牲步骤级执行的情况下保留探索。

英文摘要

On-policy self-distillation (OPSD) is an emerging LLM post-training paradigm in which the model serves as its own teacher: conditioned on privileged information such as a reference trace or hint, the same policy provides dense token-level supervision on its own rollouts. However, recent studies show that OPSD degrades complex reasoning by suppressing predictive uncertainty, which supports exploration and hypothesis revision. Our token-level analysis shows that this failure arises from applying a uniform direction of teacher supervision across tokens with different uncertainty levels: conformity to the privileged self-teacher suppresses exploration at high entropy, while deviation from the teacher degrades step accuracy at low entropy. Accordingly, we propose \textbf{Direction-Adaptive Self-Distillation} (\textbf{DASD}), which reframes privileged self-distillation from uniform teacher imitation into entropy-routed directional supervision: high-entropy tokens are pushed away from the privileged teacher to preserve exploration, while low-entropy tokens are pulled toward the teacher to stabilize step-level execution. Across six mathematical reasoning benchmarks, DASD achieves the best macro Avg@16 over strong RLVR and self-distillation baselines. Pass@$k$, reasoning-health, and generalization analyses show that these average gains come from preserving exploration without sacrificing step-level execution.

2605.22257 2026-05-22 cs.LG cs.AI cs.LO 版本更新

What are the Right Symmetries for Formal Theorem Proving?

正式定理推理中应有的对称性是什么?

Krzysztof Olejniczak, Radoslav Dimitrov, Xingyue Huang, Bernardo Cuenca Grau, Jinwoo Kim, İsmail İlkan Ceylan

发表机构 * University of Oxford(牛津大学) KAIST(韩国科学技术院) TU Wien(维也纳技术大学) AITHYRA

AI总结 本文探讨了正式定理推理中应尊重的对称性,提出了基于范畴论的重写范畴框架,用于形式化证明等价性和成功不变性,并通过测试时方法改进了LLM基定理证明器的鲁棒性和性能。

详情
AI中文摘要

基于大规模语言模型(LLMs)的正式定理推器对问题表示的表面变化高度敏感:语义等价的陈述可以表现出剧烈不同的证明成功率,揭示了对正式数学中固有对称性的失败。这提出了一个核心问题:正式定理推理中应有什么样的对称性?我们引入了重写范畴,一个范畴论框架,捕捉由证明战术诱导的组合性、一般非可逆的转换,并用它来形式化两个对称性概念:证明等价性,支配证明分布在重写下的变换,以及成功不变性(即成功概率的不变性),要求等价陈述以相同概率被解决。我们观察到基于状态的next-tactic推器通过操作证明状态自然满足证明等价性。相比之下,最先进的基于LLM的推器既不满足这些属性,表现出在等价表述下的大性能变化。为缓解这一问题,我们提出测试时方法,通过等价重写的聚合,理论上证明它们在采样极限下恢复成功不变性,并实验证明它们在固定推理预算下提高鲁棒性和性能。我们的结果突显了对称性作为LLM基定理推理中关键缺失的归纳偏置,并建议测试时计算作为近似该偏置的实用途径。

英文摘要

Formal theorem provers based on large language models (LLMs) are highly sensitive to superficial variations in problem representation: semantically equivalent statements can exhibit drastically different proof success rates, revealing a failure to respect structural symmetries inherent in formal mathematics. This raises a central question: what are the right symmetries for formal theorem proving? We introduce rewriting categories, a category-theoretic framework capturing the compositional, generally non-invertible transformations induced by proof tactics, and use it to formalize two symmetry notions: proof equivariance, governing how proof distributions transform under rewrites, and success invariance (i.e., invariance of success probability), requiring equivalent statements to be solved with the same probability. We observe that state-based next-tactic provers naturally satisfy proof equivariance by operating on proof states. In contrast, state-of-the-art LLM-based provers satisfy neither property, exhibiting large performance variation across equivalent formulations. To mitigate this, we propose test-time methods that aggregate over equivalent rewritings of the input, showing theoretically that they recover success invariance in the sampling limit, and empirically, that they improve robustness and performance under fixed inference budgets. Our results highlight symmetry as a key missing inductive bias in LLM-based theorem proving and suggest test-time computation as a practical route to approximate it.

2605.22243 2026-05-22 cs.LG cs.AI stat.AP 版本更新

Explainable AI for Data-Driven Design of High-Dimensional Predictive Studies

为高维预测研究的数据驱动设计开发可解释的AI

Junyu Yan, Damian Machlanski, Kurt Butler, Panagiotis Dimitrakopoulos, Ewen M Harrison, Bruce Guthrie, Sotirios A Tsaftaris

发表机构 * School of Engineering, University of Edinburgh(爱丁堡大学工程学院) Causality in Healthcare AI Hub (CHAI)(医疗因果AI枢纽) Advanced Care Research Centre, Usher School of Population Health Sciences, University of Edinburgh(先进护理研究中心,乌瑟人口健康科学学院,爱丁堡大学) Centre for Medical Informatics, Usher School of Population Health Sciences, University of Edinburgh(医学信息学中心,乌瑟人口健康科学学院,爱丁堡大学)

AI总结 本文提出了一种可解释的AI推荐系统,通过数据驱动的方法改进现有可解释统计模型的预测性能,主要贡献是通过可解释AI技术提供三种推荐类型以提高模型的预测能力和透明度。

Comments 41 pages, 7 figures

详情
AI中文摘要

预测建模在健康数据分析和数据驱动的临床决策中非常重要。然而,当需要选择、转换或交互建模数十甚至数百个特征时,手动优化预测研究具有挑战性。尽管复杂的机器学习模型具有高性能,但其“黑盒”性质限制了临床信任、透明度和决策所需的可解释性。我们开发并评估了一种探索性AI推荐器,以提供数据驱动的推荐,从而提高现有可解释统计模型的预测性能。所开发的框架使用灵活的AI建模来捕捉复杂的数据模式,并利用可解释AI技术将这些模式转化为三种推荐类型:特征排除、非线性项和特征交互。我们通过比较基线(即无交互或非线性项)Cox比例风险(CPH)模型与增强的CPH模型(包含由我们方法建议的推荐)的预测性能来评估该框架。主要分析预测245,614名患者首次发生跌倒或相关伤害的时间。我们的方法推荐排除23个特征,包括两个特征的非线性项,以及包含221个建议的特征交互。C指数从0.805(95% CI 0.798-0.812)提高到0.815(95% CI 0.809-0.822),校准也有所改善(截距:-0.006到0.003;斜率:1.063到0.950)。所有推荐均得到现有文献的支持。该方法还证明在两个额外的公共数据集上有效,显示了更广泛的应用性。所提出的探索性AI推荐器展示了可解释AI和数据驱动研究设计在提高高维透明预测模型开发过程和性能方面的潜力。

英文摘要

Predictive modelling is important for health data analysis and data-driven clinical decision-making. However, predictive studies are challenging to design optimally by hand when tens or even hundreds of features require selection, transformation, or interaction modelling. While complex machine learning models offer high performance, their "black-box" nature limits the clinical trust, transparency, and interpretability required for decision-making. We developed and evaluated an Exploratory AI Recommender that provides data-driven recommendations to improve predictive performance of existing interpretable statistical models. The developed framework uses flexible AI modelling to capture complex data patterns and explainable AI techniques to translate the patterns into three recommendation types: feature exclusion, non-linear terms, and feature interactions. We evaluated the framework by comparing predictive performance of a baseline (i.e., no interactions or non-linear terms) Cox Proportional Hazards (CPH) model against an augmented CPH incorporating recommendations suggested by our method. The primary analysis predicts the time to the first occurrence of a fall or related injury in 245,614 patients. Our method recommended excluding 23 features, including non-linear terms for two features, and including 221 suggested feature interactions. The C-index improved from 0.805 (95% CI 0.798-0.812) to 0.815 (95% CI 0.809-0.822), and so did calibration (intercept: -0.006 to 0.003; slope: 1.063 to 0.950). All recommendations were supported by existing literature. The method also proved effective on two additional public datasets, demonstrating wider applicability. The proposed Exploratory AI Recommender demonstrates the potential of explainable AI and data-driven study design to improve the process of developing, and the performance of high-dimensional transparent predictive models.

2605.22238 2026-05-22 cs.AI 版本更新

Evaluating Large Language Models as Live Strategic Agents: Provider Performance, Hybrid Decomposition, and Operational Gaps in Timed Risk Play

评估大型语言模型作为实时战略代理:提供商性能、混合分解及时间风险游戏中的操作差距

H. C. Ekne

发表机构 * Gemini OpenAI Kimi

AI总结 本文研究了大型语言模型在实时策略环境中的表现,发现其性能受目标跟踪、执行转换、成本和运行时可靠性等因素影响,支持将LLM作为受限制工作流中的组件进行评估,而非孤立的基准测试对象。

Comments 13 pages, 7 figures. Code and tracked notes: https://github.com/hcekne/risk-game . Public runtime artifact index: https://github.com/hcekne/risk-game/blob/main/docs/article-plans/public_experiment_artifacts.md

详情
AI中文摘要

静态基准测试只能捕捉大型语言模型在实践中行为的一部分。实际系统将模型置于具有时间限制、格式约束和故障模式的重复循环中。我们研究了这种环境下的时间多阶段Risk游戏,其中包含明确的胜利目标和重复的规划与执行循环。在一项冻结规则的32局跨提供商锦标赛中,Gemini-3.1-Pro-Preview在32局中胜出20局,战胜了GPT-5.1、Claude-Opus-4-7和Kimi-K2.6。聚合的胜利分布与等强的空模型显著不同(p约1.5×10^-5)。随后,我们通过标准化执行在更便宜的Gemini Flash框架上进行分离。在该设计下,32局规划烘焙测试与近等值性一致(p约0.821),表明早期提供商差异主要来自端到端系统行为而非规划本身。为研究机制,我们分析了提供商锦标赛中保存的规划和执行轨迹。Gemini比其他模型更频繁地参考终端目标,且在胜利接近时增加这种关注。Gemini还更有效地将回合转化为深度征服链,尽管其运行时并不最干净。这些结果表明,实时代理性能取决于目标跟踪、执行转换、成本和运行时可靠性,并支持将LLM作为受限制工作流中的组件进行评估,而非孤立的基准测试响应者。

英文摘要

Static benchmarks capture only part of how large language models behave in practice. Real systems place models inside repeated loops with time limits, formatting constraints, and failure modes. We study this setting in a timed multi-phase Risk environment with explicit victory targets and repeated planning and execution cycles. In a replicated 32-game cross-provider championship under frozen rules, gemini-3.1-pro-preview won 20 of 32 games against gpt-5.1, claude-opus-4-7, and kimi-k2.6, and the pooled winner distribution differs strongly from an equal-strength null (p approx 1.5 x 10^-5). We then separate planning from execution by standardizing execution on a cheaper Gemini Flash scaffold. Under this design, a pooled 32-game planner bakeoff is consistent with near-equality (p approx 0.821), which indicates that much of the earlier provider spread came from end-to-end system behavior rather than planning alone. To study mechanism, we analyze saved planning and execution traces from the provider championship. Gemini refers to the terminal objective far more often than the other models and increases that focus as victory approaches. Gemini also converts more turns into deep conquest chains, even though it is not the cleanest runtime. These results show that live-agent performance depends on objective tracking, execution conversion, cost, and runtime reliability, and they support evaluating LLMs as components in bounded workflows rather than as isolated benchmark respondents.

2605.22221 2026-05-22 cs.LG cs.AI cs.LO 版本更新

Can Transformers Learn to Verify During Backtracking Search?

Transformer能否在回溯搜索中学习验证?

Yin Jun Phua, Tony Ribeiro, Tuan Nguyen, Katsumi Inoue

发表机构 * Yin Jun Phua (corresponding author) Institute of Science Tokyo, 2-12-1 Ookayama, Meguro-ku, Tokyo 152-8550, Japan Tony Ribeiro Centrale Nantes, CNRS, Laboratoire des Sciences du Num\'erique de Nantes, LS2N, UMR 6004, F-44000 Nantes, France National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan Steelous Protocol, 8-20-32, Ginza, Chuo-ku, Tokyo 104-0061, Japan Tuan Nguyen Hanoi University of Science Technology, No. 1 Dai Co Viet, Hai Ba Trung, Ha Noi, Vietnam Katsumi Inoue National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan

AI总结 本文研究了Transformer在回溯搜索中的验证能力,指出传统方法在处理轨迹数据时存在散列检索和历史纠缠问题,并提出局部化和选择性状态注意力(SSA)来解决这些问题,通过实验验证了SSA在3-SAT、图着色、Blocks World和回溯解析等任务中的有效性。

详情
AI中文摘要

回溯搜索是经典约束求解器、规划器和定理证明器的基础。最近的基于Transformer的推理系统探索其自身中间步骤的搜索树。一种常见的训练方法是在离线求解器轨迹上拟合自回归的下一个令牌损失。模型的输入在每一步都是所有先前决策的累积轨迹。最优的继续或回溯预测器仅依赖于当前搜索状态,因为到达相同状态的两条轨迹允许相同的延续。我们证明,仅使用累积轨迹训练的解码器Transformer在两种方式上未能满足这一要求:轨迹可以将状态特征散列到许多位置(散列检索),并且预测器可以基于轨迹而非状态(历史纠缠)。我们通过局部化解决散列检索问题,这是一种轨迹级的修复方法,将每个决策块重写以局部化状态特征。我们通过选择性状态注意力(SSA)解决历史纠缠问题,这是一种固定注意力掩码,可以在不修改训练数据、目标或参数的情况下强制结构化基于状态的决策。我们专注于矛盾传播后发生的反应验证。我们在3-SAT、图着色、Blocks World和回溯解析中测试SSA。在仅在先前历史上不同的相同状态对中,SSA发出相同的决定,而自回归训练的因果基线则不会。我们的贡献是针对序列轨迹数据的Transformer行为诊断,配以结构化修复。预训练语言模型在搜索其自身推理步骤时可能面临相同的失败。我们的分析为推理时的上下文清除作为不重新训练的情况下应用相同隔离的方法提供了候选方案。

英文摘要

Backtracking search underlies classical constraint solvers, planners, and theorem provers. Recent transformer-based reasoning systems explore search trees over their own intermediate steps. A common training recipe fits an autoregressive next-token loss on offline solver traces. The model's input at each step is a cumulative trace of all prior decisions. The optimal continue-or-backtrack predictor depends only on the current search state, since two trajectories reaching the same state admit the same viable continuations. We show that decoder-only transformers trained on cumulative traces fail this requirement in two ways: the trace can scatter state features across many positions (scattered retrieval), and the predictor can condition on the trajectory rather than the state (history entanglement). We address scattered retrieval with localization, a trace-level fix that rewrites each decision block to expose state features locally. We address history entanglement with Selective State Attention (SSA), a fixed attention mask that enforces state-based decisions structurally without modifying training data, objective, or parameters. We focus on reactive verification, after propagation has exposed a contradiction. We test SSA on 3-SAT, graph coloring, Blocks World, and backtracking parsing. On same-state pairs that differ only in prior history, SSA emits identical decisions while a cumulative-trained causal baseline does not. Our contribution is a diagnostic of transformer behavior on serialized trajectory data, paired with a structural fix. Pretrained language models that search over their own reasoning steps may face the same failure. Our analysis opens up inference-time context clearing as a candidate way to apply the same isolation without retraining.

2605.22219 2026-05-22 cs.AI 版本更新

SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval

SGR-Bench: 对状态门控检索的搜索代理基准测试

Ningyuan Li, Haiyang Shen, Mugeng Liu, Yudong Han, Zhuofan Shi, Sixiong Xie, Yun Ma

发表机构 * Peking University(北京大学) Beijing University of Technology(北京理工大学)

AI总结 本文提出SGR-Bench,一个用于评估状态门控检索能力的基准数据集,包含100个专家 curated 任务,通过对比显式和隐式指导方法,揭示了搜索代理在处理状态门控检索任务时的主要挑战。

Comments Work in Progress. 23 pages, 7 figures, preprint

详情
AI中文摘要

近年来,大语言模型和工具使用代理的进步扩大了可基准测试的网络任务范围。然而,一个重要类别的专门检索任务仍缺乏充分描述。在许多专门的数据检索网站上,包含答案的证据只有在通过过滤器、视图、层次结构或范围等设置正确的网站特定检索状态后才能被访问。我们称这种能力为状态门控检索(SGR)。我们引入了SGR-Bench,一个针对此设置的基准数据集,包含100个专家curated的任务,涵盖六个来源家族和12个公开数据生态系统。每个任务都需要发现正确的网站并配置其网站特定的检索状态以生成结构化答案。SGR-Bench将约束引导和目标导向的同一底层问题的两种形式配对,使显式和隐式指导在状态门控检索中的比较得以控制。我们评估了八个基于CLI的代理LLM系统和三个商业搜索代理产品。在SGR-Bench上,最强的系统仅达到66.18%的项目级F1,而行级F1仍显著较低。对156条可分析的失败CLI轨迹的手动审核显示了原因:代理通常到达相关网页源,但建立了错误的网站特定检索状态。检索范围漂移(37.2%)和标准不匹配(27.6%)占主导地位,而最终答案组成仅占10.3%。数据集和单案例评估说明文件可在https://huggingface.co/datasets/PKUAIWeb/SGR-BENCH获取。

英文摘要

Recent advances in large language models and tool-using agents have expanded the range of benchmarked web tasks. Yet an important class of specialized retrieval tasks remains undercharacterized. On many specialized data-retrieval websites, answer-bearing evidence becomes accessible only after establishing the correct site-specific retrieval state through filters, views, hierarchies, or scopes. We term this capability state-gated retrieval (SGR). We introduce SGR-Bench, a benchmark for this setting containing 100 expert-curated tasks spanning six source families and 12 public data ecosystems. Each task requires discovering the appropriate website and configuring its site-specific retrieval state to produce a structured answer. SGR-Bench pairs constraint-guided and goal-oriented formulations of the same underlying problems, enabling controlled comparisons between explicit and implicit guidance for state-gated retrieval. We evaluate eight CLI-based agentic LLM systems and three commercial search-agent products. On SGR-Bench, the strongest system reaches only 66.18% item-level F1, while row-level F1 remains much lower. A manual audit of 156 analyzable failed CLI trajectories shows why: agents often reach a relevant web source, but establish the wrong site-specific retrieval state. Retrieval-scope drift (37.2%) and criterion mismatch (27.6%) dominate, whereas final answer composition accounts for only 10.3%. The dataset and single-case evaluation instructions are available at https://huggingface.co/datasets/PKUAIWeb/SGR-BENCH.

2605.22213 2026-05-22 cs.AI 版本更新

Towards a compositional semantics for quantitative confidence assessment in assurance arguments

迈向定量信心评估的组合语义:在保证论证中

Benjamin Herd, Jessica Kelly, Jan Sabsch, Lydia Gauerhof

发表机构 * Luxoft GmbH(卢克斯oft GmbH) Robert Bosch GmbH(罗伯特·博世有限公司)

AI总结 本文提出了一种组合语义,用于在保证论证中进行定量信心评估,通过将论证元素表示为主观逻辑意见,并将元素间的关系映射到主观逻辑运算符,从而实现信心的传播。

Comments Accepted to the 21st European Dependable Computing Conference (EDCC 2026), Canterbury, UK

Journal ref Proceedings of the 21st European Dependable Computing Conference (EDCC 2026)

详情
AI中文摘要

保证论证提供了一种清晰且结构化的方式来解释为什么利益相关者应相信系统满足某些属性,然而广泛使用的记法,例如目标结构记法(GSN),通常缺乏推导保证信心的操作语义。现有方法解决结构和正确性,但主要在真值上推理,而不是在主张证明中的信心上。主观逻辑(SL)提供了一种信念、不信和不确定的计算,具有结合意见的运算符,使在不完整、冲突或主观证据下信心传播成为可能。然而,现有的基于SL的方法并未提供一种统一的、组合的语义,该语义涵盖所有论证元素和关系,以实现总体的信心评估。本文提出了一种信心语义,将论证元素表示为SL意见,并将元素间的关系映射到SL运算符,从而有效地将论证转化为可分析的信心网络。该方法提供了显式的担保,有原则的上下文处理,保留了来源,并与GSN兼容,并通过一个示例保证信心评估提供实用指导。

英文摘要

Assurance arguments provide a clear and structured way to explain why stakeholders should trust that a system satisfies certain properties, yet widely used notations, e.g.Goal Structuring Notation (GSN), typically lack an operational semantics for deriving assurance confidence. Existing approaches address structure and soundness but largely reason over truth values, not over confidence in the justification of claims. Subjective Logic (SL) offers a calculus of belief, disbelief, and uncertainty with operators for combining opinions, enabling confidence propagation under incomplete, conflicting, or subjective evidence. However, existing SL-based approaches do not provide a uniform, compositional semantics that covers all argument elements and relations to enable overall confidence assessment. We propose a confidence semantics that represents argument elements as SL opinions and maps relations between elements to SL operators modelling how confidence flows, effectively turning the argument into an analyzable confidence network. The approach provides explicit warrants, principled handling of context, preserved provenance, and compatibility with GSN, along with practical guidance using an exemplary assurance confidence assessment.

2605.22211 2026-05-22 cs.AI 版本更新

CLORE: Content-Level Optimization for Reasoning Efficiency

CLORE:面向推理效率的内容级优化

Yuyang Wu, Qiyao Xue, Guanxing Lu, Weichen Liu, Zihan Wang, Manling Li, Olexandr Isayev

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Northwestern University(西北大学) University of North Carolina, Chapel Hill(北卡罗来纳大学教堂山分校) University of Pittsburgh(匹兹堡大学)

AI总结 本文提出CLORE框架,通过编辑正确在线轨迹来提升大语言模型的推理效率,通过外部增强模型删除冗余、不可读或无关内容,同时保留最终答案,并结合辅助参考-free DPO目标和标准策略梯度训练优化增强-原始对,实验表明CLORE在五个数学推理基准上提升了准确性和效率的平衡,并与GRPO、DAPO、Training Efficient和ThinkPrune兼容。

Comments 9 pages, 9 figures

详情
AI中文摘要

强化学习后训练已提高了大语言模型的推理能力,但往往产生不必要的长、重复或语义模糊的推理轨迹。现有高效推理方法主要通过显式预算或长度感知奖励调节响应长度,导致中间推理内容弱监督。我们提出CLORE,一种内容级优化框架,通过编辑正确在线轨迹来提高推理效率。CLORE使用外部增强模型删除重复段落、不可读或无关内容以及解决方案确定后的冗余推理,同时保留最终答案。所得到的增强-原始对通过辅助参考-free DPO目标与标准策略梯度训练优化。通过限制增强到正确轨迹并执行局部删除,CLORE使编辑轨迹接近策略分布并减轻离策略不匹配。在DeepSeek-R1-Distill-Qwen-7B和Qwen2.5-Math-7B五个数学推理基准上的实验表明,CLORE提高了准确性和效率的平衡,并与GRPO、DAPO、Training Efficient和ThinkPrune兼容。内容级分析进一步表明,CLORE减少了重复推理、不可读内容和答案后探索,支持内容级监督作为长度级控制的互补方向。

英文摘要

Reinforcement learning post-training has improved the reasoning ability of large language models, but often produces unnecessarily long, repetitive, or semantically opaque reasoning traces. Existing efficient reasoning methods mainly regulate response length through explicit budgets or length-aware rewards, leaving intermediate reasoning content weakly supervised. We propose CLORE, a content-level optimization framework that improves reasoning efficiency by editing correct on-policy rollouts. CLORE uses an external augmentation model to delete repetitive segments, illegible or task-irrelevant content, and superfluous reasoning after the solution is established, while preserving the final answer. The resulting augmented--original pairs are optimized with an auxiliary reference-free DPO objective alongside standard policy-gradient training. By restricting augmentation to correct trajectories and performing local deletion, CLORE keeps edited rollouts close to the policy distribution and mitigates off-policy mismatch. Experiments on DeepSeek-R1-Distill-Qwen-7B and Qwen2.5-Math-7B across five mathematical reasoning benchmarks show that CLORE improves the accuracy--efficiency trade-off and remains compatible with GRPO, DAPO, Training Efficient, and ThinkPrune. Content-level analyses further show that CLORE reduces repetitive reasoning, illegible content, and post-answer exploration, supporting content-level supervision as a complementary direction to length-level control.

2605.22206 2026-05-22 cs.NE cs.AI cs.RO 版本更新

Temporal Coding as a Substrate for Sensorimotor Object Inference: A Spiking Reinterpretation of Thousand Brains Architecture

时间编码作为感觉运动物体推断的子基质:一种脉冲重解释的千脑架构

Joy Bose

发表机构 * Independent Researcher(独立研究者)

AI总结 该研究提出用脉冲编码替代密集向量,以更有效地编码传感器接触顺序,从而提升物体识别的准确性和鲁棒性,核心方法是基于STDP的学习规则和可学习参数lambda,主要贡献是验证了时间编码在不同空间排列和噪声水平下的优越性能。

Comments 18 pages, 5 figures

详情
AI中文摘要

千脑理论(TBT)及其开源的Monty框架通过感觉运动推断进行物体识别——通过主动移动传感器跨物体表面并逐接触建立证据。当前实现将每个接触编码为密集浮点向量。虽然Monty跟踪步间位移并跨接触积累证据,但其将每个接触的特征激活模式视为无序集合——特征遇到的顺序不具有表征意义。在TBT中,接触的顺序具有空间意义:知道在从左到右的扫过中特征A在特征B之前被感受到,可以告诉你A和B在物体上的位置。密集向量丢弃了这种顺序。我们提出用等级顺序脉冲包替代密集向量:每个接触产生一连串神经事件的短暂爆发,其中最强烈激活的神经元首先放电。连续爆发之间的时间间隔隐含地编码传感器位移,而无需显式坐标计算。一种生物启发的学习规则(STDP)将遍历方向编码到突触权重中。一个可学习的参数lambda调整对早期与近期接触的依赖程度,适应每个物体的几何形状。我们推导出三个可检验的预测,并指定了四个组件的大约450行NumPy实现。三个合成实验验证了核心主张:时间编码在具有相同特征但不同空间排列的物体上实现完美判别准确性,而密集积累在偶然情况下表现不佳;时间编码在所有测试噪声水平上保持30-50个百分点的优势;适应性的lambda收敛到不同的值,反映物体几何复杂性。对Monty的YCB基准的端到端评估留待未来工作。

英文摘要

The Thousand Brains Theory (TBT) and its open-source Monty framework model object recognition through sensorimotor inference -- identifying objects by actively moving a sensor across their surface and building evidence contact by contact. The current implementation encodes each contact as a dense floating-point vector. While Monty tracks inter-step displacement and accumulates evidence across contacts, it treats the feature activation pattern at each contact as an unordered set - the directional sequence in which features are encountered carries no representational weight. In TBT, the sequence of contacts carries spatial meaning: knowing that feature A was felt before feature B during a left-to-right sweep tells you something about where A and B sit on the object. Dense vectors discard this ordering. We propose replacing dense vectors with rank-order spike packets: each contact produces a brief burst of neural events where the most strongly activated neuron fires first. The time gap between successive bursts implicitly encodes sensor displacement without explicit coordinate calculations. A biologically motivated learning rule (STDP) encodes traversal direction into synaptic weights. A learnable parameter lambda adjusts reliance on earlier versus recent contacts, adapting to each object's geometry. We derive three testable predictions and specify an implementation of four components in approximately 450 lines of NumPy. Three synthetic experiments confirm the core claims: temporal coding achieves perfect discrimination accuracy on objects with identical features in different spatial arrangements, where dense accumulation performs at chance; temporal coding maintains a 30-50 percentage point advantage across all tested noise levels; the adaptive lambda converges to distinct values, reflecting object geometric complexity. End-to-end evaluation on Monty's YCB benchmark is left for future work.

2605.22205 2026-05-22 cs.AI cs.LG 版本更新

Skill Weaving: Efficient LLM Improvement via Modular Skillpacks

技能编织:通过模块化技能包实现高效的LLM改进

Zhuo Li, Guodong Du, Zesheng Shi, Weiyang Guo, Weijun Yao, Yuan Zhou, Jiabo Zhang, Jing Li

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)) The Hong Kong Polytechnic University(香港理工大学) Huawei Technologies Co., Ltd.(华为技术有限公司) Shanghai Jiaotong University(上海交通大学)

AI总结 本研究提出SkillWeave框架,通过模块化技能包使LLM在固定内存预算下实现领域专业化,通过SkillZip压缩技术实现高效部署,实验表明其在多任务和代理基准上表现优异,速度提升达4倍。

Comments Accepted by ACL2026

详情
AI中文摘要

大型语言模型日益需要在多样化领域中进行专门化,但现有方法难以在多领域能力与严格的内存和推理约束之间取得平衡。本文介绍了SkillWeave,一种模块化改进框架,使LLM能够在固定内存预算下实现专业化。SkillWeave将通用模型的全部能力划分为技能包——轻量、领域特定的delta模块——以重新组织和细化模型的内部知识。为了高效部署,SkillWeave集成了SkillZip将技能包压缩为紧凑且推理友好的格式,从而在低延迟执行下实现强大的多领域性能。在多任务和代理基准上,一个9B的SkillWeave模型优于多个基线,并甚至超越了32B的单体LLM,同时实现了高达4倍的速度提升。

英文摘要

Large language models increasingly require specialization across diverse domains, yet existing approaches struggle to balance multi-domain capacities with strict memory and inference constraints. In this work, we introduce SkillWeave, a modular improvement framework that enables LLMs to specialize under fixed memory budgets. SkillWeave partitions full capabilities of a general-purpose model into skillpacks -- lightweight, domain-specific delta modules -- that reorganize and refine the model's internal knowledge. For efficient deployment, SkillWeave integrates SkillZip to compress skillpacks into compact and inference-ready format, enabling strong multi-domain performance with low-latency execution. On multi-task and agentic benchmarks, a 9B SkillWeave model outperforms several baselines and even surpasses a 32B monolithic LLM, while achieving up to 4x speedup.

2605.22200 2026-05-22 cs.CV cs.AI cs.LG 版本更新

OSS: Open Suturing Skills Vision-Based Assessment Challenge 2024-2025

OSS: 2024-2025 开放缝合技能基于视觉的评估挑战

Hanna Hoffmann, Setareh Bady, Claas de Boer, Max Kirchner, Jan Egger, Rainer Röhrig, Frank Hölzle, Lennart Johannes Gruber, Kunpeng Xie, Marlon Neuhaus, Victor Alves, Guilherme Barbosa, Leonardo Barroso, João Carvalho, Hao Chen, Gabriella d'Albenzio, André Ferreira, Nuno Gomes, Yuichiro Hayashi, Kousuke Hirasawa, Rebecca Hisey, Seungjae Hong, Seoi Jeong, Tiago Jesus, Daehong Kang, Satoshi Kasai, Shunsuke Kikuchi, Takayuki Kitasaka, Satoshi Kondo, Hyoun-Joong Kong, Youngbin Kong, Atsushi Kouno, Shlomi Laufer, Kyu Eun Lee, Bining Long, Nooshin Maghsoodi, Hiroki Matsuzaki, Evangelos Mazomenos, Ori Meiraz, Kensaku Mori, Marina Music, Masahiro Oda, Roi Papo, Jieun Park, Rafael Piexoto, Saeid Rezaei, Mariana Ribeiro, Soyeon Shin, Yang Shu, Idan Smoller, Danail Stoyanov, Yihui Wang, Xinkai Zhao, Sebastian Bodenstedt, Isabel Funke, Stefanie Speidel, Behrus Hinrichs-Puladi

发表机构 * Department of Translational Surgical Oncology, National Center for Tumor Diseases (NCT/UCC) Dresden(转化外科肿瘤学部,肿瘤疾病国家中心(NCT/UCC)德累斯顿) The Centre for Tactile Internet with Human-in-the-Loop (CeTI), TUD Dresden University of Technology(具有人环路触觉互联网中心(CeTI),德累斯顿技术大学) Department of Oral and Maxillofacial Surgery, University Hospital RWTH Aachen(口腔和颌面外科部,亚琛大学医院) Center for Tooth-, Mouth- and Jaw Medicine, University Göttingen(牙科、口科和颌科医学中心,哥廷根大学) Institute of Medical Informatics, University Hospital RWTH Aachen(医学信息学研究所,亚琛大学医院) Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology(医学系和卡尔·戈斯塔·卡鲁斯大学医院,德累斯顿技术大学) German Cancer Research Center (DKFZ)(德国癌症研究中心(DKFZ)) Muroran Institute of Technology(牟然技术学院) Niigata University of Health and Welfare(北九州市保健福利大学) Konica Minolta, Inc.(柯尼卡美能达公司) Jmees, Inc.(Jmees公司) Department of Computer Science and Engineering, The Hong Kong University of Science and Technology(计算机科学与工程部,香港科学与技术大学) Center Algoritmi/LASI, University of Minho(算法中心/ALASI,米尼奥大学) Life and Health Sciences Research Institute (ICVS), School of Medicine, University of Minho(生命与健康科学研究院(ICVS),医学院,米尼奥大学) ICVS/3B's - PT Government Associate Laboratory(ICVS/3B's - PT政府附属实验室) Institute for AI in Medicine (IKIM), University Medicine Essen(医学人工智能研究所(IKIM),埃森大学医学部) The Faculty of Data and Decisions Science, Technion - Israel Institute of Technology(数据与决策科学系,技术学院-以色列理工学院) UCL Hawkes Institute, University College London(UCL Hawkes研究所,伦敦大学学院) School of Computing, Queen's University(计算学院,皇后大学) Department of Transdisciplinary Medicine, Seoul National University Hospital(跨学科医学部,首尔国立大学医院) Interdisciplinary Program in Medical Informatics, Seoul National University(医学信息学跨学科项目,首尔国立大学) Department of Clinical Medical Sciences, Seoul National University(临床医学科学部,首尔国立大学) Institute of Convergence Medicine with Innovative Technology, Seoul National University Hospital(融合医学与创新技术研究所,首尔国立大学医院) Department of Surgery, Seoul National University College of Medicine and Seoul National University Hospital(外科部,首尔国立大学医学院和首尔国立大学医院)

AI总结 本文提出OSS挑战,旨在通过基于视觉的评估方法提升开放手术技能训练,通过挑战数据集和多任务评估,评估不同方法在开放手术技能评估中的表现,揭示视频评估的潜力与限制。

Comments Stefanie Speidel and Behrus Hinrichs-Puladi jointly supervised this work. Submitted to MEDIA

详情
AI中文摘要

通过有效的训练实现高水平的外科技能对于最佳的患者结果至关重要。自动化、数据驱动的技能评估有潜力改善外科训练。尽管基于机器学习的方法在微创手术技能评估中越来越受欢迎,但其在开放手术中的应用仍然有限。我们提出了一个专门的MICCAI挑战,旨在基准测试和推进开放手术中的基于视觉的技能评估。挑战数据集包含在干实验室环境中用静态GoPro相机记录的开放缝合训练任务视频,除了主要视频模态外,还包含仪器轨迹数据。OSS挑战连续两年举办,分别包含两个和三个独立任务:(1) 将技能水平分类为四个类别,(2) 预测涵盖八个类别的完整客观结构化评估技术技能分数,(3) 跟踪手部和手术工具。参与者提交了多种解决方案,包括基于深度学习的视频模型、跟踪驱动的方法和混合方法。通用的空间时间视频模型始终实现了最强的性能,尽管概念上多样的方法在执行良好的情况下也能达到竞争水平。预测细粒度的OSATS分数仍然具有挑战性,但受益于增加的训练数据。关键点跟踪由于频繁的遮挡和出帧实例而变得困难,限制了当前基于运动的技能分析的应用。这项工作评估了创新和多样的解决方案,突显了基于视频的评估在开放手术中的潜力和当前限制,并识别了推进自动化技能评估向临床影响发展的关键方向。

英文摘要

Achieving high levels of surgical skill through effective training is essential for optimal patient outcomes. Automated, data-driven skill assessment holds significant potential to improve surgical training. While machine learning-based methods are increasingly popular for assessing skills in minimally invasive surgery, their application to open surgery remains limited. We present the results of a dedicated MICCAI challenge designed to benchmark and advance vision-based skill assessment in open surgery. The challenge dataset comprises videos of an open suturing training task recorded with a static GoPro camera in a dry-lab setting, with instrument trajectories available in addition to the primary video modality. The OSS Challenge was hosted over two consecutive years, comprising two and three independent tasks, respectively: (1) classifying skill level into four classes, (2) predicting the full Objective Structured Assessment of Technical Skills across eight categories, and (3) tracking hands and surgical tools. Participants submitted diverse solutions including deep learning-based video models, tracking-driven methods, and hybrid approaches. General-purpose spatiotemporal video models consistently achieved the strongest performance, though conceptually diverse approaches reached competitive levels when well-executed. Predicting fine-grained OSATS scores remains challenging but benefits substantially from increased training data. Keypoint tracking proves difficult given frequent occlusions and out-of-frame instances, limiting current applicability for motion-based skill analysis. This work benchmarks innovative and diverse solutions for surgical skill assessment, highlighting both the promise and current limitations of video-based evaluation in open surgery and identifying critical directions for advancing automated skill assessment toward clinical impact.

2605.22176 2026-05-22 cs.AI 版本更新

LLM-Metrics: Measuring Research Impact Through Large Language Model Memory

LLM-Metrics: 通过大语言模型内存测量研究影响力

Si Shen, Wenhua Zhao, Danhao Zhu

发表机构 * School of Economics and Management(经济管理学院) College of Information Management(信息管理学院) Department of Computer Science and Engineering(计算机科学与工程系) Nanjing Agricultural University(南京农业大学) Nanjing University of Science and Technology(南京理工大学) Department of Criminal Science and Technology(犯罪科学与技术系) Jiangsu Police Institute(江苏警察学院)

AI总结 本文提出LLM-Metrics,一种基于大语言模型参数内存的研究影响力评估指标,通过设计多种选择题探针评估549篇2023-2024年计算机科学论文,发现高影响力论文在学术社区中获得更大曝光,从而在LLM训练数据中形成更强的参数记忆,与引用次数呈现显著相关性。

Comments 25pages, 5figures

详情
AI中文摘要

引用次数仍然是评估研究影响力的主要指标,但存在众所周知的局限性:时间滞后、学科偏见和马太效应。本文提出LLM-Metrics,一种基于大语言模型(LLMs)参数内存的研究影响力评估指标。核心假设是高影响力论文在学术界获得更大曝光,这种曝光以文本形式进入LLM训练数据,从而使模型形成更强的参数记忆。我们设计了四种类型的多项选择探针,涵盖标题识别、作者识别、方法识别和会议识别,并评估了549篇2023-2024年发表的计算机科学论文,覆盖17个LLM,参数范围从0.5B到72B,来自六个供应商。在17个模型中,15个产生了正预测,其中9个在p小于0.05时显著,与引用次数的斯皮尔曼相关性为rho=0.1495,p=0.0004。三个额外的发现支持所提出的机制。首先,预测信号在2024年的论文中更强,rho=0.1880,其引用次数在模型训练时间接近零,减少了简单反向因果解释的可能性。其次,作者识别探针显示出最强的判别能力,与曝光驱动的记忆机制一致。第三,模型规模和预测能力是非单调的:一个3B参数的模型Llama-3.2-3B-Instruct,rho=0.1829,优于大多数更大的模型,支持了一个选择性记忆假设,即较小模型的有限容量可以作为有效的信息过滤器。LLM-Metrics提供了一种实时、跨学科、不依赖引用的研究所评估范式。

英文摘要

Citation counts remain the dominant metric for assessing research impact, yet they suffer from well-documented limitations: temporal lag, disciplinary bias, and Matthew effects. Here we propose LLM-Metrics, a research-impact assessment metric derived from the parametric memory of large language models (LLMs). The central hypothesis is that high-impact papers receive greater exposure in the academic community, that this exposure enters LLM training data in textual form, and that models consequently form stronger parametric memory of these papers. We designed four types of multiple-choice probes, covering title recognition, author recognition, method recognition, and venue recognition, and evaluated 549 computer science papers published in 2023-2024 across 17 LLMs spanning 0.5B to 72B parameters from six vendors. Of the 17 models, 15 produced positive predictions, 9 of which were significant at p less than 0.05, with an overall Spearman correlation of rho = 0.1495 and p = 0.0004 against citation counts. Three additional findings support the proposed mechanism. First, the predictive signal was stronger for 2024 papers, rho = 0.1880, whose citation counts were near zero at model-training time, reducing the plausibility of a simple reverse-causality explanation. Second, author-recognition probes showed the strongest discriminative power, consistent with an exposure-driven memory mechanism. Third, model scale and predictive power were non-monotonic: a 3B-parameter model, Llama-3.2-3B-Instruct, with rho = 0.1829, outperformed most larger models, supporting a selective-memory hypothesis in which the limited capacity of smaller models can serve as an effective information filter. LLM-Metrics offers a real-time, cross-disciplinary, citation-independent paradigm for research assessment.

2605.22175 2026-05-22 cs.SE cs.AI 版本更新

SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?

SWE-Mutation:LLMs能否在软件工程中生成可靠的测试套件?

Yuxuan Sun, Yuze Zhao, Yufeng Wang, Yao Du, Zhiyuan Ma, Jinbo Wang, Mengdi Zhang, Kai Zhang, Zhenya Huang

发表机构 * State Key Laboratory of Cognitive Intelligence(认知智能国家重点实验室) University of Science and Technology of China(中国科学技术大学) Beihang University(北航) School of Mathematical Sciences, Peking University(北京大学数学科学学院) NeoShell AI Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合性国家科学中心人工智能研究院)

AI总结 本文提出SWE-Mutation基准,用于评估LLM生成的测试套件质量,通过系统性地变异解决方案来测试测试套件的可靠性,并发现当前LLM在生成可靠且具有判别力的测试套件方面存在不足。

Comments 24 pages, 8 figures

Journal ref ACL 2026 Findings

详情
AI中文摘要

评估软件工程能力已成为现代大语言模型(LLMs)的核心组成部分;然而,进一步扩展的关键瓶颈不在于高质量解决方案的稀缺,而在于高质量测试套件的缺乏。测试套件对于合成程序修复轨迹和在强化学习中提供精确反馈信号至关重要。不幸的是,由于标注成本高且困难,高质量测试套件长期以来难以获得,而由LLM自动生成的测试套件往往肤浅且缺乏足够的判别力。作为构建高质量测试套件的第一步,我们介绍了SWE-Mutation,一个用于评估LLM生成测试套件的基准。该基准通过引入系统性变异的解决方案来表征测试套件,这些变异试图“欺骗”测试套件并通过验证。我们进一步提出了一种代理、语言无关的框架,用于自动生成复杂的变异体。我们的基准包含2,636个变异体,源自800个原始实例,并包含覆盖九种编程语言的多语言子集。对七种LLM的实验表明,即使DeepSeek-V3.1也仅达到10.20%的验证率和36.15%的检测率,突显了当前LLM的不足。此外,我们的代理变异策略增强了现实性,与传统方法相比,将平均检测率从71.04%降低到39.81%。这些发现揭示了当前LLM在生成可靠且具有判别力的测试套件方面存在的持续缺陷。

英文摘要

Evaluating software engineering capabilities has become a core component of modern large language models (LLMs); however, the key bottleneck hindering further scaling lies not in the scarcity of high-quality solutions, but in the lack of high-quality test suites. Test suites are indispensable both for synthesizing program repair trajectories and for providing precise feedback signals in reinforcement learning. Unfortunately, due to the high cost and difficulty of annotation, high-quality test suites have long been hard to obtain, while those automatically generated by LLMs tend to be superficial and lack sufficient discriminative power. As a first step toward constructing high-quality test suites, we introduce SWE-Mutation, a benchmark for evaluating LLM-generated test suites. The benchmark characterizes test suites by introducing systematically mutated solutions that attempt to ``fool'' the test suites and pass validation. We further propose an agentic, language-agnostic framework for automatically generating complex mutants. Our benchmark consists of 2,636 mutated variants derived from 800 original instances and includes a multilingual subset spanning nine programming languages. Experiments on seven LLMs reveal that even DeepSeek-V3.1 achieves only 10.20% verification and 36.15% detection rates, highlighting the inadequacy of current LLMs. Additionally, our agentic mutation strategy enhances realism, reducing average detection rates from 71.04% to 39.81% compared to conventional methods. These findings expose persistent deficiencies in the ability of current LLMs to generate reliable and discriminative test suites.

2605.22168 2026-05-22 cs.AI cs.LG 版本更新

Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability

衡量跨模态协同:VLM可解释性的一个基准

Joël Roman Ky, Salah Ghamizi, Maxime Cordy

发表机构 * University of Luxembourg(卢森堡大学) Luxembourg Institute of Health (LIH)(卢森堡健康研究院)

AI总结 本文提出Synergistic Faithfulness作为衡量VLM跨模态协同的指标,解决了传统单模态评估方法在评估VLM可解释性时的不足,通过引入Shapley交互指数,实现了对多模态协同的准确评估,同时提升了计算效率。

详情
AI中文摘要

视觉-语言模型(VLMs)将复杂的视觉输入映射到语义空间,但目前解释VLM的跨模态推理仍依赖于通过单模态扰动度量评估的后验解释器。我们揭示了这一范式的局限性:由于多模态数据集包含语言先验和模态偏差,VLMs经常表现出跨模态冗余,允许它们仅使用文本回答视觉查询。因此,单模态度量惩罚忠实的解释器,导致评估崩溃,其中视觉和文本排名根本矛盾(Kendall's τ= -0.06)。为了解决这一问题,我们引入了Synergistic Faithfulness(F_syn),一个基于Shapley交互指数的可扩展度量,严格隔离模态间的Harsanyi收益,作为高度准确的替代指标(ρ= 0.92),同时实现了24倍的计算加速。在评估8种不同的XAI方法、3种VLM架构和3个基准数据集时,发现为VLM设计的解释器严重过度索引视觉显著性,并在捕捉真正的跨模态协同方面显著劣于适应的注意力方法。通过将视觉合理性与跨模态忠实性解耦,本文提供了一个严格评估框架,以安全审计VLM在高风险部署中的推理。

英文摘要

Vision-Language Models (VLMs) map complex visual inputs to semantic spaces, but interpreting the cross-modal reasoning of VLMs currently relies on post-hoc explainers evaluated via unimodal perturbation metrics. We expose a limitation in this paradigm: because multimodal datasets contain language priors and modality biases, VLMs frequently exhibit cross-modal redundancy, allowing them to answer visual queries using text alone. Consequently, unimodal metrics penalize faithful explainers, triggering an evaluation collapse where visual and textual rankings fundamentally contradict each other. %(Kendall's $τ= -0.06$). To resolve this, we introduce Synergistic Faithfulness ($\mathcal{F}_{syn}$), a scalable metric rooted in the Shapley Interaction Index that strictly isolates the joint Harsanyi dividend between modalities, serving as a highly accurate surrogate ($ρ= 0.92$) while achieving a $24\times$ computational speedup. Evaluating 8 distinct XAI methods across 3 VLM architectures and 3 benchmark datasets, reveals that explainers proposed for VLMs heavily over-index on visual salience and significantly underperform adapted attention-based methods in capturing true cross-modal synergy. By decoupling visual plausibility from cross-modal faithfulness, this work provides a rigorous evaluation framework required to safely audit VLM reasoning in high-stakes deployments.

2605.22158 2026-05-22 cs.AI cs.CV 版本更新

ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

ST-SimDiff:平衡时空相似性与差异以实现高效的视频理解与大语言模型

Bingjun Luo, Tony Wang, Chaoqi Chen, Xinpeng Ding

发表机构 * Tsinghua University(清华大学) Shenzhen University(深圳大学) Xidian University(西安电子科技大学)

AI总结 本文提出ST-SimDiff框架,通过平衡时空相似性与差异来提高视频理解效率,利用时空图和双选择策略减少计算成本并提升性能。

Comments Accepted by ICLR 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在处理长视频时面临显著的计算开销,因为需要处理大量的视觉标记。为了提高效率,现有方法主要通过修剪或合并标记来减少冗余,但这些方法忽略了视频内容的一个关键维度,即变化和转折点,并且缺乏对时空关系的协作模型。为此,我们提出了一种新的视角:相似性用于识别冗余,而差异用于捕捉关键事件。基于此,我们设计了一个无需训练的框架,名为ST-SimDiff。我们首先从视觉标记中构建时空图,以统一建模其复杂的关联。随后,我们采用并行双选择策略:1)基于相似性的选择使用社区检测保留代表性标记,压缩静态信息;2)基于时间差异的选择精确定位内容变化点,以保留捕捉关键动态变化的标记。这使它能够用最少的标记保留静态和动态内容。广泛实验表明,我们的方法在显著优于现有最先进方法的同时,大幅减少了计算成本。我们的代码可在https://github.com/bingjunluo/ST-SimDiff上获得。

英文摘要

Multimodal Large Language Models (MLLMs) face significant computational overhead when processing long videos due to the massive number of visual tokens required. To improve efficiency, existing methods primarily reduce redundancy by pruning or merging tokens based on importance or similarity. However, these approaches largely overlook a critical dimension of video content, i.e., changes and turning points, and they lack a collaborative model for spatio-temporal relationships. To address this, we propose a new perspective: similarity is for identifying redundancy, while difference is for capturing key events. Based on this, we designed a training-free framework named ST-SimDiff. We first construct a spatio-temporal graph from the visual tokens to uniformly model their complex associations. Subsequently, we employ a parallel dual-selection strategy: 1) similarity-based selection uses community detection to retain representative tokens, compressing static information; 2) temporal difference-based selection precisely locates content-changing points to preserve tokens that capture key dynamic shifts. This allows it to preserve both static and dynamic content with a minimal number of tokens. Extensive experiments show our method significantly outperforms state-of-the-art approaches while substantially reducing computational costs. Our code is available in https://github.com/bingjunluo/ST-SimDiff.

2605.22156 2026-05-22 cs.LG cs.AI 版本更新

One-Way Policy Optimization for Self-Evolving LLMs

单向策略优化用于自演化大语言模型

Shuo Yang, Jinda Lu, Kexin Huang, Chiyu Ma, Shaohang Wei, Yuyang Liu, Guoyin Wang, Jingren Zhou, Li Yuan

发表机构 * Shenzhen Graduate School, Peking University(北京大学深圳研究生院) Dartmouth College(达特茅斯学院) Alibaba(阿里巴巴)

AI总结 本文提出单向策略优化方法,通过解耦优化方向与更新幅度,解决传统方法中验证器奖励稀疏导致的训练不稳定问题,实现大语言模型的持续自演化。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为扩展大语言模型(LLMs)推理能力的一种有前景的范式。然而,二进制验证器奖励的稀疏性往往导致低效和优化不稳定。为了稳定训练,现有方法通常施加与参考策略相关的令牌级约束。我们发现这些约束会无差别地惩罚偏差;当策略试图超越参考时,这会翻转由验证器确定的方向,从而抑制收益。为了解决这个问题,我们提出了一种基于解耦优化方向与更新幅度原理的单向策略优化(OWPO)方法。在OWPO中,验证器规定更新方向,而参考策略仅用于调整更新幅度。具体而言,OWPO采用不对称重加权:它对劣质偏差(策略落后于参考)执行加速对齐,对优质偏差(策略超越参考)执行收益锁定。此外,通过整合迭代参考更新,OWPO创建了“棘轮效应”,持续巩固收益。实验结果表明,OWPO在DAPO、OPD和MOPD等强基线方法上表现更优,突破了固定先验的瓶颈,使大语言模型能够持续自演化,而无需依赖外部参考模型。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become a promising paradigm for scaling reasoning capabilities of Large Language Models (LLMs). However, the sparsity of binary verifier rewards often leads to low efficiency and optimization instability. To stabilize training, existing methods typically impose token-level constraints relative to a reference policy. We identify that such constraints penalize deviations indiscriminately; this can flip verifier-determined direction when the policy attempts to outperform the reference, thereby suppressing gains. To resolve this, we propose One-Way Policy Optimization (OWPO), a method based on the principle of decoupling optimization direction from update magnitude. In OWPO, the verifier dictates the update direction, while the reference policy serves only to adjust the magnitude. Specifically, OWPO applies asymmetric reweighting: it performs Accelerated Alignment for inferior deviations (where the policy lags behind the reference) and Gain Locking for superior deviations (where the policy surpasses the reference). Furthermore, by incorporating iterative reference updates, OWPO creates a ``Ratchet Effect'' that continuously consolidates gains. Experimental results demonstrate that OWPO outperforms strong baselines, including DAPO, OPD, and MOPD, breaking the bottleneck of fixed priors to enable continuous self-evolution without reliance on external reference models.

2605.22154 2026-05-22 cs.AI 版本更新

IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

IdleSpec: 通过投机规划利用空闲时间用于LLM代理

Daewon Choi, Kyunghyun Park, Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Jinwoo Shin, Aram Galstyan

发表机构 * KAIST(韩国科学技术院) Amazon AGI(亚马逊人工智慧实验室) Together AI

AI总结 本文提出IdleSpec,一种利用空闲时间提升LLM代理性能的方法,通过在空闲期间生成计划候选并减少延迟开销,从而在多种代理场景中显著提高性能。

详情
AI中文摘要

基于大型语言模型(LLM)的代理通过多步骤推理和迭代工具调用及环境交互来解决复杂任务,这在等待观察时会产生空闲时间。尽管大多数代理场景中普遍存在空闲时间,但现有工作将其视为不可避免的开销或提出受限解决方案,忽略了不同工具调用之间的不同计算预算和未来观察不确定性,从而导致空闲时间利用不充分。本文介绍IdleSpec,一种可扩展且通用的推理方法,利用空闲时间计算来提高代理性能同时最小化延迟开销。具体而言,IdleSpec在空闲期间迭代生成计划候选,并在观察可用时汇总它们以引导下一步推理。为了在观察不确定性下有效生成计划,IdleSpec从学习的分布中采样互补的起草策略(即渐进和恢复),该分布通过后验反馈更新。我们的实验表明,IdleSpec在各种代理场景中通过有效利用空闲时间显著提高了代理性能。特别是,在GAIA和FRAMES上,IdleSpec使用Gemini-2.5-Flash实现了55.6%的平均准确率,超过了不使用空闲时间的基线方法5.1%。此外,在涉及大量代码执行延迟的MLE-Bench上,IdleSpec在Any Medal速率上实现了高达9.1%的性能提升,突显了其在长周期任务中的通用性。

英文摘要

Large language model (LLM)-based agents solve complex tasks by leveraging multi-step reasoning with iterative tool calls and environment interactions, which incur idle time while waiting for observations. Despite the prevalence of idle time in most agentic scenarios, existing works treat it as an unavoidable overhead or propose restricted solutions that overlook varying computational budgets across different tool calls and future observation uncertainty, thereby leading to suboptimal utilization of idle time. In this paper, we introduce IdleSpec, a scalable and generic inference approach that leverages idle-time computation to improve agent performance while minimizing latency overhead. Specifically, IdleSpec iteratively generates plan candidates during idle periods and, once observations become available, aggregates them to guide the next reasoning step. For effective plan generation under observation uncertainty, IdleSpec samples between complementary drafting strategies (i.e., progressive and recovery) from a learned distribution that is updated via posterior feedback. Our experiments demonstrate that IdleSpec significantly improves agent performance in various agentic scenarios by effectively utilizing idle time. In particular, on the GAIA and FRAMES, IdleSpec achieves 55.6% average accuracy with Gemini-2.5-Flash, surpassing the vanilla baseline without idle-time usage by 5.1%. Furthermore, for MLE-Bench, which involves substantial delay from code executions, IdleSpec achieves performance gains of up to 9.1% on the Any Medal rate, highlighting its generalizability to long-horizon tasks.

2605.22148 2026-05-22 cs.AI cs.CL 版本更新

Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents

Ratchet:一种最小化卫生的自演化LLM代理技能库

Xing Zhang, Yanwei Cui, Guanghui Wang, Ziyuan Li, Wei Qiu, Bing Zhu, Peiyang He

发表机构 * AWS Generative AI Innovation Center(AWS 生成式人工智能创新中心) HSBC Holdings Plc.(汇丰控股有限公司) HSBC Technology Center, China(汇丰技术中心,中国)

AI总结 本文提出Ratchet,一种单代理循环,使冻结的LLM能够自行编写、检索、整理和淘汰其自然语言技能,通过整合四个卫生机制提升技能库的生命周期管理,从而在MBPP+ hard-100数据集上显著提升性能。

Comments 16 pages, 2 figures, 6 tables. Extends arXiv:2605.19576 with the SWE-bench Verified evaluation and a non-divergence analysis (Proposition 1)

详情
AI中文摘要

Self-evolving skill libraries, pioneered by Voyager, let frozen LLM agents accumulate reusable knowledge without weight updates, yet recent evaluation shows that LLM-authored skills deliver $+0.0$pp over no-skill baselines while human-curated ones deliver $+16.2$pp: the bottleneck is not skill authoring but lifecycle management. We introduce extbf{Ratchet}, a single-agent loop in which a frozen LLM writes, retrieves, curates, and retires its own natural-language skills. Ratchet integrates four candidate hygiene mechanisms: outcome-driven retirement, a bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. On MBPP+ hard-100 with Claude Opus 4.7, Ratchet lifts held-out pass@1 from a $0.258 \pm 0.047$ baseline to a late-window rolling mean of $0.584$ (peak $0.658 \pm 0.042$) across 100 rounds and 3 seeds, a $+0.328 \pm 0.018$ rolling-mean gain where the no-skill control drifts at $+0.002 \pm 0.005$; the same recipe transfers to an agentic solver on SWE-bench Verified ($+0.22$ peak lift over 20 rounds). Eight ablations (A1--A8) reveal that the minimal working recipe is smaller than our design suggests: retirement and the meta-skill authoring prior are load-bearing, while explicit deduplication (canonicalisation, cover-guard) is subsumed by the meta-skill itself. A non-divergence proposition shows that bounded cap and retirement threshold together prevent expected performance from drifting below the no-skills floor.

英文摘要

Self-evolving skill libraries, pioneered by Voyager, let frozen LLM agents accumulate reusable knowledge without weight updates, yet recent evaluation shows that LLM-authored skills deliver $+0.0$pp over no-skill baselines while human-curated ones deliver $+16.2$pp: the bottleneck is not skill authoring but lifecycle management. We introduce \textbf{Ratchet}, a single-agent loop in which a frozen LLM writes, retrieves, curates, and retires its own natural-language skills. Ratchet integrates four candidate hygiene mechanisms: outcome-driven retirement, a bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. On MBPP+ hard-100 with Claude Opus 4.7, Ratchet lifts held-out pass@1 from a $0.258 \pm 0.047$ baseline to a late-window rolling mean of $0.584$ (peak $0.658 \pm 0.042$) across 100 rounds and 3 seeds, a $+0.328 \pm 0.018$ rolling-mean gain where the no-skill control drifts at $+0.002 \pm 0.005$; the same recipe transfers to an agentic solver on SWE-bench Verified ($+0.22$ peak lift over 20 rounds). Eight ablations (A1--A8) reveal that the minimal working recipe is smaller than our design suggests: retirement and the meta-skill authoring prior are load-bearing, while explicit deduplication (canonicalisation, cover-guard) is subsumed by the meta-skill itself. A non-divergence proposition shows that bounded cap and retirement threshold together prevent expected performance from drifting below the no-skills floor.

2605.22138 2026-05-22 cs.AI cs.CL cs.LG cs.RO 版本更新

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

通过自我调节模拟规划实现高效的代理推理

Mingkai Deng, Jinyu Hou, Lara Sá Neves, Varad Pimpalkhute, Taylor W. Killian, Zhengzhong Liu, Eric P. Xing

发表机构 * Institute of Foundation Models (IFM)(基础模型研究所) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出通过分解决策过程为三个系统:模拟推理、自我调节和反应执行,来提升代理推理的效率,并展示了SR$^2$AM模型在不同任务中的表现。

Comments Code and model artifacts are available at https://github.com/sailing-lab/sr2am

详情
AI中文摘要

代理应该如何决定何时以及如何规划?主流方法将代理建模为具有自适应计算的反应策略(例如链式思考),通过端到端训练期望规划隐式地出现。由于无法控制规划的存在、结构或时间范围,这些系统显著增加了推理长度,导致无效的令牌使用,而没有可靠的准确性提升。我们主张高效的代理推理受益于将决策过程分解为三个系统:模拟推理(系统II)通过世界模型将推理根植于未来状态预测;自我调节(系统III)通过学习的配置器决定何时以及如何深入规划;以及反应执行(系统I)处理细粒度的动作。模拟推理在不同任务中提供统一的规划,而无需每个领域的工程,同时自我调节确保规划只在需要时被调用。为了测试这一点,我们开发了SR$^2$AM(Self-Regulated Simulative Reasoning Agentic LLM),在LLM的链式思考中实现这两个系统作为独立阶段,其中LLM作为世界模型。我们探索了两种实现:从提示的多模块系统中记录决策(v0.1)和从预训练推理LLM的痕迹中重建结构化计划(v1.0),通过监督学习和强化学习(RL)训练。在数学、科学、表格分析和网络信息检索中,v0.1-8B和v1.0-30B在性能上与120-355B和685B-1T参数系统相当,而v1.0-30B使用的推理令牌比同类代理LLM少25.8-95.3%。强化学习使平均规划时间增加22.8%,而规划频率仅增加2.0%,表明它学会了更远地规划而不是更频繁地规划。更广泛地说,学习的自我调节实例化了一个原则,我们预计可以扩展到代理如何管理自己的学习和适应。

英文摘要

How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain-of-thought), trained end-to-end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine-grained action. Simulative reasoning provides unified planning across diverse tasks without per-domain engineering, while self-regulation ensures the planner is invoked only when needed. To test this, we develop SR$^2$AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain-of-thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi-module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1-8B and v1.0-30B achieve Pass@1 competitive with 120-355B and 685B-1T parameter systems respectively, while v1.0-30B uses 25.8-95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self-regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.

2605.22122 2026-05-22 cs.CR cs.AI 版本更新

Adversarial Trust Poisoning in Vehicular Collaborative Perception

车联网协作感知中的对抗信任污染

Yutong Liu, Chenyi Wang, Ming F. Li, Qingzhao Zhang

发表机构 * Connected and Autonomous Vehicles(连接与自动驾驶车辆) Collaborative Perception(协同感知)

AI总结 该研究提出TrustFlip攻击,利用一致性防御机制污染对良性车辆的信任评分,导致系统感知能力下降甚至安全故障,同时提出TrustReflect作为缓解措施。

详情
AI中文摘要

协作感知(CP)使连接和自动驾驶车辆能够共享传感器数据并共同感知环境。为防御对抗者篡改共享数据,现有系统采用跨车辆不一致性检测和信任估计,惩罚与多数观察冲突的车辆。本文证明这些防御本身引入了新的攻击面。我们提出了TrustFlip,一种利用一致性防御机制污染对良性车辆信任的新型攻击。不同于注入虚假数据,它部署真实的物理对抗对象,诱导良性车辆产生不一致观察。由此产生的不一致被防御机制误归因于目标车辆,导致其信任分数下降并最终被降权或排除。因此,系统失去可靠感知贡献者,降低感知能力,可能引发安全关键故障。我们在多个协作感知架构和防御机制上评估TrustFlip。结果表明,最先进防御可显著受影响:攻击在87.7%的场景中将目标良性车辆排除在协作之外,并将平均精度(AP)降低高达13%。作为初步缓解措施,我们引入TrustReflect,一种轻量级的自我反思机制,将争议区域标记为不确定并排除在信任评估之外,将攻击成功率降低35-100%。

英文摘要

Collaborative perception (CP) enables connected and autonomous vehicles to share sensor data and jointly reason about their environment. To defend against adversaries that fabricate or manipulate shared data, existing systems employ cross-vehicle inconsistency detection and trust estimation, penalizing vehicles whose observations conflict with the majority. In this work, we show that these defenses themselves introduce a new attack surface. We present TrustFlip, a novel attack that weaponizes consistency-based defenses to poison the trust assigned to benign vehicles. Instead of injecting false data into the collaboration pipeline, it deploys physical adversarial objects that are genuine but induce inconsistent observations among benign vehicles. The resulting inconsistencies are misattributed by the defense to the targeted vehicle, causing its trust score to degrade and eventually leading to its downweighting or exclusion from collaboration. Consequently, the system loses reliable sensing contributors, degrading perception capability and potentially inducing safety-critical failures. We evaluate TrustFlip across multiple collaborative perception architectures and defense mechanisms. Our results show that state-of-the-art defenses can be significantly affected: the attack removes the targeted benign vehicle from collaboration in up to 87.7% of scenarios and drops Average Precision (AP) by up to 13%. As an initial mitigation, we introduce TrustReflect, a lightweight self-reflection mechanism that marks disputed regions as uncertain and excludes them from trust evaluation, reducing the attack success rate by 35-100%.

2605.22109 2026-05-22 cs.AI cs.CV cs.CY 版本更新

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

感知还是偏见:大语言模型能否超越个性的第一印象?

Caixin Kang, Tianyu Yan, Sitong Gong, Mingfang Zhang, Liangyang Ouyang, Ruicong Liu, Bo Zheng, Huchuan Lu, Kaipeng Zhang, Yoichi Sato, Yifei Huang

发表机构 * The University of Tokyo(东京大学) Shanda AI Research Tokyo(Shanda AI 研究所东京) Dalian University of Technology(大连理工大学)

AI总结 本文探讨了多模态大语言模型(MLLMs)在感知个性方面的能力,提出了一种新的任务Grounded Personality Reasoning(GPR),并构建了一个新的数据集MM-OCEAN,通过三重评估体系揭示了MLLMs在人格推理中的偏见问题。

详情
AI中文摘要

多模态大语言模型(MLLMs)正在越来越多地应用于需要感知个性的人类交互角色中,但现有的基准测试仅评估其对大五人格特质分数的预测能力,未能确定模型是通过行为理解真正感知个性,还是仅通过表面模式匹配进行偏见判断。我们通过三个贡献填补了这一空白:(i)一个新的任务:我们正式定义了Grounded Personality Reasoning(GPR),要求MLLMs通过一系列评分、推理和锚定过程,将每个大五评分与可观察的证据联系起来;(ii)一个新的数据集:我们发布了MM-OCEAN(1,104个视频,5,320个多项选择题),由多代理流程生成,包含时间戳行为观察、证据支持的特质分析以及七类线索锚定多项选择题;(iii)基准测试和分析:我们设计了一个三级评估体系(评分、推理、锚定)以及四个样本级失败模式指标:偏见率(PR)、编造率(CR)、整合失败率(IR)和整体锚定率(HR),并基准测试了27个MLLMs(13个封闭式,14个开放式)。分析揭示了一个显著的偏见差距:在所有正确评分中,51%的评分没有基于检索到的线索进行锚定,而整体锚定率仅在0-33.5%之间。这些发现揭示了获得正确分数与为正确原因推理之间的脱节,为MLLMs中的扎根社会认知绘制了路线图。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalize Grounded Personality Reasoning (GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through a chain of rating, reasoning, and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by a multi-agent pipeline with human verification, with timestamped behavioral observations, evidence-grounded trait analyses, and seven categories of cue-grounding MCQs. (iii) Benchmark and analysis: we design a three-tier evaluation (rating, reasoning, grounding) plus four sample-level failure-mode metrics: Prejudice Rate (PR), Confabulation Rate (CR), Integration-failure Rate (IR), and Holistic-grounding Rate (HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and the Holistic-Grounding Rate spans only 0-33.5%. These findings expose a disconnect between getting the right score and reasoning for the right reason, charting a roadmap for grounded social cognition in MLLMs.

2605.22106 2026-05-22 cs.AI 版本更新

ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning

ArborKV: 一种面向树状推理的KV缓存管理方法

Yeqiu Chen, Ziyan Liu, Zhenxin Huang, Runquan Gui, Hong Wang, Lei Liu

发表机构 * University of Science(科学大学) Huawei Technologies Co., Ltd.(华为技术有限公司)

AI总结 本文提出ArborKV,一种结构感知的KV缓存管理方法,通过轻量级值估计器和树状分配策略,实现纯token提取式淘汰与惰性再水合,从而在保持高精度的同时减少KV内存使用,使在固定硬件预算下能支持更大规模的树状推理搜索。

详情
AI中文摘要

最近在大语言模型推理方面的进展越来越多地从单次生成转向在中间推理状态上的显式搜索。Tree-of-Thoughts (ToT) 将推理组织为具有分支和回溯的树状搜索,但显著放大了键值(KV)缓存:保留用于前沿部分轨迹的KV状态很快成为内存瓶颈,限制了吞吐量并约束了在固定硬件预算下的搜索深度和宽度。我们通过观察到ToT风格推理中的KV重用由搜索动态决定:短期解码主要依赖于活跃分支及其祖先,而无效子树具有低短期重用概率但必须保持可恢复以供回溯。受此启发,我们提出了ArborKV,一种结构感知的淘汰框架,结合轻量级值估计器和树状分配策略,并进行纯token提取式淘汰与惰性再水合以支持回溯。在ToT风格推理基准上的实验表明,ArborKV实现了高达约4倍的KV内存减少,同时保持接近完整保留的精度,使在固定设备预算下能支持更大规模的树状推理搜索。

英文摘要

Recent progress in LLM reasoning has increasingly shifted from single-pass generation to explicit search over intermediate reasoning states. Tree-of-Thoughts (ToT) organizes inference to tree-structured search with branching and backtracking, but it substantially amplifies the Key--Value (KV) cache: retaining KV states for a frontier of partial trajectories quickly becomes a memory bottleneck that limits throughput and constrains search depth and width under fixed hardware budgets. We address this challenge by observing that KV reuse in ToT-style inference is governed by search dynamics: near-term decoding depends primarily on the active branch and its ancestors, whereas inactive subtrees have low short-term reuse probability yet must remain recoverable for backtracking. Motivated by this, we propose ArborKV, a structure-aware eviction framework that couples a lightweight value estimator with a tree-aware allocation policy, and performs purely token-extractive eviction with lazy rehydration to support revisits. Experiments on ToT-style reasoning benchmarks show that ArborKV achieves up to ~4x peak KV-memory reduction while preserving near-full-retention accuracy, enabling larger search configurations under fixed device budgets that would otherwise run out of memory.

2605.22102 2026-05-22 cs.AI 版本更新

ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling

ExComm:探索阶段通信用于容错的代理测试时间扩展

Woomin Song, Beomjun Kim, Daewon Choi, Sai Muralidhar Jayanthi, Saket Dingliwal, Jinwoo Shin, Aram Galstyan

发表机构 * KAIST(韩国科学技术院) Amazon AGI(亚马逊人工智能实验室) Together AI

AI总结 本文提出ExComm,一种用于探索阶段的代理测试时间扩展通信协议,通过定期审计代理信念状态以检测跨代理事实冲突,并通过专用工具验证循环解决冲突,从而提升测试时间扩展的容错能力。

详情
AI中文摘要

在长周期代理测试时间扩展中,错误传播是一个常见的失败模式,其中中间步骤中引入的事实错误或无效推论会持续存在于代理的信念状态中,并污染后续推理。现有测试时间扩展方法对这一过程控制有限,因为它们通常依赖于代理自行检测错误、在错误轨迹中选择或仅在错误已影响推理路径后才修正解决方案。我们提出ExComm,一种用于探索阶段的代理测试时间扩展通信协议。ExComm受到经验观察的启发,即并行代理推理中的大多数中间错误会产生可检测的跨代理事实冲突。利用代理工作流的迭代结构,ExComm定期审计代理信念状态以检测此类冲突,通过专用工具验证循环解决冲突,并将简洁、针对性的反馈返回相关代理。通过软信念更新将修正纳入其中,即附加已验证的反馈而非覆盖现有信念。此外,为防止由于通信导致轨迹多样性崩溃,ExComm进一步引入轨迹多样化模块,将冗余轨迹引导至正交策略。在AIME 2024、AIME 2025和GAIA上使用Gemini-2.5-Flash-Lite和Qwen3.5-4B的实验表明,ExComm在测试时间扩展中一致优于强基线,分别在最佳基线上实现了平均性能提升5.7%和5.0%。进一步分析显示了改进的错误恢复、有利的扩展行为、比适应通信基线更强的多样性,以及在评估方法中最佳的性能-成本权衡。

英文摘要

A common failure mode in long-horizon agentic test-time scaling is error propagation, where factual errors or invalid deductions introduced at intermediate steps persist in the agent's belief state and contaminate later reasoning. Existing test-time scaling methods provide limited control over this process, as they often rely on agents to detect their own mistakes, select among flawed trajectories, or refine solutions only after errors have already shaped the reasoning path. We propose ExComm, a communication protocol for exploration-stage agentic test-time scaling. ExComm is motivated by the empirical observation that the majority of intermediate errors in parallel agentic reasoning produce detectable cross-agent factual conflicts. Leveraging the iterative structure of agentic workflows, ExComm periodically audits agent belief states to detect such conflicts, resolves them through a dedicated tool-based verification loop, and returns concise, targeted feedback to the involved agents. Corrections are incorporated through soft belief updates, which append verified feedback rather than overwriting existing beliefs. Furthermore, to prevent collapsing trajectory diversity due to communication, ExComm further introduces a trajectory diversification module that redirects redundant trajectories toward orthogonal strategies. Experiments on AIME 2024, AIME 2025, and GAIA with Gemini-2.5-Flash-Lite and Qwen3.5-4B show that ExComm consistently outperforms strong test-time scaling baselines, achieving average performance gains of 5.7% and 5.0% over the best-performing baselines, respectively. Further analyses demonstrate improved error recovery, favorable scaling behavior, stronger diversity than adapted communication baselines, and the best performance-cost trade-off among the evaluated methods.

2605.22098 2026-05-22 cs.CV cs.AI cs.LG 版本更新

TextTeacher: What Can Language Teach About Images?

TextTeacher: 语言能教会我们关于图像什么?

Tobias Christian Nauen, Stanislav Frolov, Brian Bernhard Moser, Federico Raue, Ahmed Anwar, Andreas Dengel

发表机构 * RPTU University Kaiserslautern-Landau(赖兴海大学凯撒斯劳滕-兰道分校) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI))

AI总结 该研究提出TextTeacher方法,通过将语言模型的语义知识注入到图像分类训练中,提升视觉模型的性能,同时保持推理时的模型简洁性。

Comments Published at TMLR

Journal ref Transactions on Machine Learning Research, ISSN 2835-8856, 2026

详情
AI中文摘要

柏拉图表示假设认为,足够大的模型会收敛到共享的表示几何结构,即使跨模态。受此启发,我们提出问题:语言模型的语义知识能否有效提升视觉模型?为此,我们引入TextTeacher,一种简单的辅助目标,将文本嵌入作为额外信息注入图像分类训练。TextTeacher利用 readily available 的图像描述、预训练并冻结的文本编码器以及轻量级投影,生成语义锚点,高效引导训练期间的表示,同时保持推理时的模型不变。在ImageNet上使用标准ViT后端,TextTeacher将准确率提升高达+2.7个百分点(p.p.),并在相同配方和计算条件下产生一致的迁移增益(平均+1.0 p.p.)。它优于视觉知识蒸馏,在相同计算预算下更准确,或在相似准确率下更快。我们的分析表明,TextTeacher在训练初期塑造了更深的层,并通过补充互补的语义线索帮助泛化。TextTeacher增加的开销很小,不需要对目标模型进行昂贵的多模态训练,并保持纯视觉模型的简洁性和延迟。

英文摘要

The platonic representation hypothesis suggests that sufficiently large models converge to a shared representation geometry, even across modalities. Motivated by this, we ask: Can the semantic knowledge of a language model efficiently improve a vision model? As an answer, we introduce TextTeacher, a simple auxiliary objective that injects text embeddings as additional information into image classification training. TextTeacher uses readily available image captions, a pre-trained and frozen text encoder, and a lightweight projection to produce semantic anchors that efficiently guide representations during training while leaving the inference-time model unchanged. On ImageNet with standard ViT backbones, TextTeacher improves accuracy by up to +2.7 percentage points (p.p.) and yields consistent transfer gains (on average +1.0 p.p.) under the same recipe and compute. It outperforms vision knowledge distillation, yielding more accuracy at a constant compute budget or similar accuracy, but 33% faster. Our analysis indicates that TextTeacher acts as a feature-space preconditioner, shaping deeper layers in the first stages of training, and aiding generalization by supplying complementary semantic cues. TextTeacher adds negligible overhead, requires no costly multimodal training of the target model and preserves the simplicity and latency of pure vision models. Project page with code and captions: https://nauen-it.de/publications/text-teacher

2605.22095 2026-05-22 econ.GN cs.AI cs.GT cs.HC q-fin.EC 版本更新

Not Yet: Humans Outperform LLMs in a Colonel Blotto Tournament

Not Yet: 人类在布洛托 tournaments 中优于 LLMs

Dmitry Dagaev, Egor Ivanov, Petr Parshakov, Alexey Savvateev, Gleb Vasiliev

发表机构 * HSE University(俄罗斯高等经济学院) New Economic School(新经济学校) Central Economic Mathematical Institute, Russian Academy of Sciences(俄罗斯科学院中央经济数学研究所) Adyghe State University(阿迪格国立大学) Moscow Institute of Physics and Technology(莫斯科物理技术学院) Innopolis University(因诺波利斯大学)

AI总结 研究通过布洛托博弈 tournaments 比较了人类与 LLMs 的策略表现,发现人类更擅长使用校准良好的中间层次分配启发式方法,而 LLMs 的简单策略表现较差。

详情
AI中文摘要

大语言模型(LLMs)的出现促使经济学家研究人类和 LLMs 在战略环境中的行为。我们组织了一系列循环轮换 tournaments 在布洛托博弈中。该博弈吸引博弈论家的注意,因为其高维动作空间和没有纯策略纳什均衡。在第一个 tournaments 中,超过 200 名人类参与者相互竞争。在第二个 tournaments 中,几个流行的 LLMs 被邀请提交策略。在第三个 tournaments 中,我们匹配了 LLM 策略的数量与人类提交的数量。我们发现,人类更常使用更好的校准中间层次分配启发式方法,并且优于 LLMs 提交的更简单、更刻板的策略。战略复杂性是成功的关键,当且仅当达到必要的推理深度水平时。而较低和较高的推理层次在原始策略上没有明显优势。在人类中,学科背景弱预测成功:具有 STEM 背景的参与者在第一个 tournaments 中表现更好。令人惊讶的是,人类几乎不根据对手的不同集合调整策略。这一结果表明,人类主要基于游戏规则而非对手身份做出选择,将 LLMs 看作人类竞争对手。

英文摘要

The emergence of large language models (LLMs) has spurred economists to study how humans and LLMs behave in strategic settings. We organized a series of round-robin tournaments in the Colonel Blotto game. This game attracts game theorists' attention due to high-dimensional action space and the absence of pure strategy Nash equilibria. In the first tournament, more than 200 human participants competed against one another. In the second tournament, several popular LLMs were invited to submit strategies. In the third tournament, we matched the number of LLM strategies to the number submitted by humans. We find that humans more often employ better-calibrated intermediate-level allocation heuristics and outperform the simpler, more stereotyped strategies submitted by LLMs. Strategic sophistication is key to success if and only if the necessary level of reasoning depth is reached, while lower and higher levels of reasoning offer no clear advantage over the primitive strategies. Among humans, field of study weakly predicts success: participants with STEM backgrounds perform better in the first tournament. Surprisingly, humans almost do not adjust their strategies across tournaments with different sets of opponents. This result suggests that humans base their choices primarily on the game's rules rather than on the identity of their opponents, treating LLMs much like human competitors.

2605.22090 2026-05-22 cs.AI 版本更新

A Camera-Cooperative ISAC Framework for Multimodal Non-Cooperative UAVs Sensing

一种用于多模非合作无人机感知的相机协作ISAC框架

Wenfeng Wu, Luping Xiang, Kun Yang

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) Institute of Intelligent Networks and Communications (NINE)(智能网络与通信研究院) School of Intelligent Software and Engineering, Nanjing University (Suzhou Campus)(南京大学智能软件与工程学院(苏州校区))

AI总结 本文提出了一种相机协作ISAC框架,通过多模感知实现高效的无人机波束定向和跟踪,提升了感知精度和资源效率。

详情
AI中文摘要

非合作无人驾驶航空器(UAV)的检测对集成感知与通信(ISAC)系统提出了重大挑战,因为单模感知存在固有局限,且共享通信和感知资源之间存在竞争。为了解决这些挑战,本文提出了一种新颖的相机协作ISAC(CC-ISAC)框架,利用多模感知实现高效的UAV波束定向和跟踪。该框架利用摄像头进行粗粒度空域监控,并利用ISAC实现细粒度、高精度感知,形成互补的感知回路,从而提升感知精度和资源效率。在该框架中,开发了两个关键模块:(1)一种通过交叉注意力机制对齐视觉和回波域特征的视觉到回波数据对齐(V2EDA)模型,以及(2)一种基于多模融合的估计(MMFE)模型,该模型整合历史多模数据与当前观测以实现稳健的状态估计。在DeepSense 6G数据集上进行的广泛评估表明,所提出的框架在保持高角估计精度的同时,实现了平均71%的波束定向开销减少和1.69-11.15%的跟踪开销减少。CC-ISAC框架有效缓解了感知与通信之间的资源竞争,实现了可靠的UAV监控,同时释放了大量系统资源用于额外的通信任务,从而代表了ISAC系统设计的实用进步。

英文摘要

The detection of non-cooperative unmanned aerial vehicles (UAVs) presents significant challenges for Integrated Sensing and Communication (ISAC) systems due to the inherent limitations of single-modal perception and the competition for shared communication and sensing resources. To address these challenges, this paper proposes a novel Camera-Cooperative ISAC (CC-ISAC) framework that employs multimodal sensing to enable efficient UAV beam steering and tracking. The proposed framework employs cameras for coarse-grained airspace monitoring and utilizes ISAC for fine-grained, high-precision sensing, forming a complementary perception loop that enhances both sensing accuracy and resource efficiency. Within this framework, two key modules are developed: (1) a Vision-to-Echo Data Alignment (V2EDA) model that aligns visual and echo-domain features through cross-attention mechanisms, and (2) a Multimodal Fusion-Based Estimation (MMFE) model that integrates historical multimodal data with current observations for robust state estimation. Extensive evaluations conducted on the DeepSense 6G dataset demonstrate that the proposed framework achieves an average reduction of 71% in beam steering overhead and 1.69-11.15% in tracking overhead while maintaining high angular estimation accuracy. The CC-ISAC framework effectively mitigates resource contention between sensing and communication, enabling reliable UAV surveillance while freeing substantial system resources for additional communication tasks, thereby representing a practical advancement in ISAC system design.

2605.22089 2026-05-22 cs.CV cs.AI 版本更新

LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model

LVDrive: 基于潜在视觉表征的视觉-语言-动作自动驾驶模型

Xiaodong Mei, Diankun Zhang, Hongwei Xie, Guang Chen, Hangjun Ye, Dan Xu

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Xiaomi EV(小米电动车)

AI总结 本文提出LVDrive,一种增强视觉-语言-动作能力的自动驾驶模型,通过引入未来场景预测任务,在高维潜在空间中学习语义丰富的场景表示,从而提升闭环驾驶性能。

详情
AI中文摘要

视觉-语言-动作(VLA)模型已逐渐成为端到端自动驾驶的有前途的框架。然而,现有VLA通常依赖于稀疏的动作监督,这未能充分利用其强大的场景理解和推理能力。最近尝试通过世界建模引入密集视觉监督时,往往过度强调像素级图像重建,忽略了语义丰富的场景表示学习。在本文中,我们提出LVDrive,一种基于潜在视觉表征的VLA框架,用于自动驾驶。LVDrive在VLA范式中引入了未来场景预测任务,其中未来表示在预训练视觉主干的辅助监督下完全在高维潜在空间中学习。脱离低效的自回归生成,我们在一个统一的嵌入空间中联合建模未来场景和运动预测,通过单次前向传递进行未来感知推理。我们进一步设计了一种两阶段轨迹解码策略,明确利用所学的潜在未来表示来细化轨迹生成。在具有挑战性的Bench2Drive基准测试中,大量实验表明,LVDrive在闭环驾驶性能上实现了显著提升,优于动作监督方法和基于图像重建的世界模型方法。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising framework for end-to-end autonomous driving. However, existing VLAs typically rely on sparse action supervision, which underutilizes their powerful scene understanding and reasoning capabilities. Recent attempts to incorporate dense visual supervision via world modeling often overemphasize pixel-level image reconstruction, neglecting semantically meaningful scene representation learning. In this work, we propose LVDrive, a Latent Visual representation enhanced VLA framework for autonomous driving. LVDrive introduces a future scene prediction task into the VLA paradigm, where future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone. Departing from inefficient autoregressive generation, we jointly model future scene and motion prediction within a unified embedding space, processed in a single forward pass to conduct the future-aware reasoning. We further design a two-stage trajectory decoding strategy that explicitly leverages the learned latent future representations to refine trajectory generation. Extensive experiments on the challenging Bench2Drive benchmark demonstrate that LVDrive achieves significant improvements in closed-loop driving performance, outperforming both action supervised methods and image-reconstruction-based world model approaches.

2605.21413 2026-05-22 cs.AI 版本更新

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

通过基准构建教学AI:QuestBench作为一门课程实践以实现问责知识工作

Haiyang Shen, Jiuzheng Wang, Taian Guo, Mugeng Liu, Wenchun Jing, Chongyang Pan, Siqi Zhong, Zhiyang Chen, Weichen Bi, Yudong Han, Xiaoying Bai, Yun Ma

发表机构 * Peking University(北京大学) Advanced Institute of Big Data(大数据高级研究院)

AI总结 本文提出通过构建基准来教学AI,介绍QuestBench作为一门课程实践,帮助学生理解在AI时代知识工作的责任。

Comments 24 pages, 5 figures, 4 tables

详情
AI中文摘要

随着AI成为日常学习的一部分,许多课程教授学生主要将其作为生产力工具:如何提示、搜索、总结、写作、编程和更高效地使用工具。我们主张AI教育也需要一个让学生学习测试AI并理解自己在判断机器生成知识角色的环境。为此,我们介绍了一种基于课程的实践,通过构建基准来教学AI,以深度研究系统为例展示AI时代的知识工作。学生将学科知识转化为可验证的专家级问题,互相审查设计以发现歧义和捷径,并在由此产生的任务上评估AI系统。这项活动让学生直接接触到强大工具,同时要求他们明确信任答案所需的标准。所生成的基准QuestBench包含256个问题,涵盖14个文科和社会科学领域。在QuestBench上的评估显示,学生设计的任务揭示了当前深度研究系统中的隐藏失败:在十三个评估系统中,平均问题级通过率仅为16.85%,最佳表现系统GPT-5.5的通过率为57.58%。这些失败在教育上是有用的,因为它们展示了流畅、来源支持的答案仍可能错过正确的查询、来源、术语或证据标准。来自五名学生贡献者的反思表明,基准构建可以帮助学生将专业知识不仅视为AI可能检索的内容,而是作为判断AI输出的基础。我们以QuestBench作为基准制品和可重用的课堂设置,提出一个更大的教育问题:当AI进入学习和专业工作时,学生如何保持负责任的知识行动者。数据集可在https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main获取。

英文摘要

As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine-produced knowledge. To this end, we introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work. Students turn disciplinary knowledge into verifiable expert-level questions, review one another's designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks. This activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require. The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains. Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%, and the best-performing system, GPT-5.5, reaches a 57.58% pass rate. The failures are educationally useful because they show how fluent, source-backed answers can still miss the right query, source, term, or evidence standard. Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs. We present QuestBench as a benchmark artifact and as a reusable classroom setting for a larger educational question: how students can remain responsible knowledge actors as AI enters learning and professional work. The dataset is available at https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main.

2605.21379 2026-05-22 cs.NE cs.AI 版本更新

How to Build Marcus's Algebraic Mind: Algebro-Deterministic Substrate over Galois Fields

如何构建马库斯的代数心智:基于伽罗瓦域的代数确定性介质

Hiroyuki Chuma, Kanji Otsuk, Yoichi Sato

发表机构 * Institute of Innovation Research, Hitotsubashi University(立命馆大学创新研究所) Meisei University(立命馆大学) Shuhari System(Shuhari系统)

AI总结 本文提出了一种基于伽罗瓦域的代数确定性介质,实现了马库斯提出的三种认知架构核心要素,并展示了该架构在逻辑推理和语义区分方面的应用。

详情
AI中文摘要

在《代数心智》中,加里·马库斯指出了任何充分认知架构必须包含的三个组成部分:变量上的操作、递归结构的表示,以及个体与类别的区分。他指出标准多层感知机不支持这些,承认使用寄存器和树形单元的神经实现,通过发育程序而非梯度下降构建仍是一个程序性猜想。25年后,所需的介质现已可用。我们新开发的PyVaCoAl/VaCoAl是一种超维计算架构,围绕单个代数原语XOR-and-shift over GF(2)组织,通过原始多项式线性反馈移位寄存器实现。该架构支持通过Bind(R,F) = R XOR shift(F)实现的可逆变量绑定,非交换性的组合捆绑,能够区分“狗咬人”与“人咬狗”,并在同一代数下实现地址空间的个体/类别分离。一种互补观点认为,海马体-CA3回路是这种引擎的生物同源物,发育指定的 mossy-fiber 目标提供了马库斯预期的内生微回路。在本文中,我们映射马库斯的三种支柱与PyVaCoAl/VaCoAl的操作承诺之间的对应关系。我们重新解释树形单元为一个由原始生成多项式索引的代数寄存器集,论证该架构比2001年可用的张量积、循环卷积或时间同步更接近马库斯的规格。我们还展示了该介质如何自然扩展到佩尔的第三级反事实推理,这是原始树形单元程序未直接针对的能力。

英文摘要

In The Algebraic Mind, Gary Marcus identified three components essential for any adequate cognitive architecture: operations over variables, recursively structured representations, and a distinction between mental representations of individuals and kinds. He argued that standard multilayer perceptrons supported none of these, acknowledging that a neural implementation using registers and treelets, constructed via developmental programs rather than gradient descent, remained a programmatic conjecture. Twenty-five years later, the required substrate is now available. Our newly developed PyVaCoAl/VaCoAl is a hyperdimensional computing architecture organized end-to-end around a single algebraic primitive: XOR-and-shift over GF(2), implemented by primitive-polynomial linear-feedback shift registers. The architecture supports reversible variable binding via Bind(R,F) = R XOR shift(F), non-commutative compositional bundling that distinguishes "the dog bites the man" from "the man bites the dog," and address-space individual/kind separation under the same algebra. A companion perspective argues that the dentate gyrus-CA3 circuit is a biological homologue of this same engine, with developmentally specified mossy-fiber targeting supplying the innate microcircuitry Marcus anticipated. In this paper, we map the correspondence between Marcus's three pillars and the operational commitments of PyVaCoAl/VaCoAl. We reinterpret the treelet as an algebraic register set indexed by a primitive generator polynomial, arguing that this architecture provides a functional neural substrate meeting Marcus's specifications far more closely than the tensor products, circular convolution, or temporal synchrony available in 2001. We also demonstrate how this substrate naturally extends to Pearl's rung-3 counterfactual reasoning, a capability the original treelet program did not directly target.

2605.21214 2026-05-22 cs.LG cs.AI 版本更新

Behavior-Consistent Deep Reinforcement Learning

行为一致的深度强化学习

Marcel Hussing, Liv G. d'Aliberti, Claas Voelcker, Benjamin Eysenbach, Eric Eaton

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Princeton University(普林斯顿大学) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出了一种行为一致的深度强化学习方法,通过控制策略的分布相似性来减少跨训练运行的策略分歧,从而提高稳定性和性能。

详情
AI中文摘要

强化学习(RL)在不同训练运行中常常表现出高方差,导致性能不可靠,并对现实领域中的部署构成重大挑战。在本文中,我们通过形式化行为一致的RL问题来解决跨运行策略分歧的挑战,目标是获得在不同训练运行中表现优异且分布相似的策略。我们的关键观察是最大熵RL提供了一种直接机制来控制行为分歧,通过将运行锚定到一个共同的(均匀)先验。我们证明,对于玻尔兹曼策略,选择温度与Q函数分歧界成正比可以限制诱导策略之间的成对KL散度。然而,我们还表明,简单地增加熵可能会损害策略优化并放大非策略误差。基于这些观察,我们提出了Q值期望分歧(QED),一种状态依赖的温度调度,利用双批评机分歧作为单次运行的跨运行分歧代理。经验上,我们在18个连续控制任务中展示了QED将跨运行分歧减少两个数量级,而不会牺牲性能,从而在适度的样本效率成本下实现了显著的回报方差减少。

英文摘要

Reinforcement learning (RL) often exhibits high variance across training runs, leading to unreliable performance and posing a major challenge to deployment in real-world domains. In this work, we address the challenge of cross-run policy divergence by formalizing the problem of behavior-consistent RL, where the objective is to obtain policies that are both high-performing and distributionally similar across training runs. Our key observation is that maximum-entropy RL provides a direct mechanism for controlling behavioral divergence by anchoring runs to a common (uniform) prior. We prove that, for Boltzmann policies, choosing the temperature proportional to $Q$-function disagreement bounds the pairwise KL divergence between the induced policies. However, we also show that naïvely increasing entropy might impair policy optimization while amplifying off-policy error. Building upon these observations, we propose $Q$-value Expectile Disagreement (QED), a state-dependent temperature schedule that uses double-critic disagreement as a single-run proxy for cross-run disagreement. Empirically, we demonstrate that across 18 continuous-control tasks, QED reduces across-run divergence by two orders of magnitude without sacrificing performance, resulting in a considerable reduction in return variance at modest sample-efficiency costs.

2605.20348 2026-05-22 q-fin.CP cs.AI 版本更新

Memory-Induced Supra-Competitive Outcomes Between Deep Reinforcement Learning Agents in Optimal Trade Execution

记忆诱导的深度强化学习代理在最优交易执行中的超竞争性结果

Christos Spyridon Koulouris, Carlo Campajola

发表机构 * Institute of Finance and Technology, University College London(金融与技术研究所,伦敦大学学院) UZH Blockchain Center(苏黎世大学区块链中心)

AI总结 本文研究了在共享的最优执行环境中交互的深度强化学习代理是否能维持超竞争性结果,即在实现短损方面优于博弈论竞争基准。研究了一个双代理阿尔梅伦-克里斯特流动性清算游戏,并探讨了学习行为如何依赖于回合内环境反馈、解读中间价格的能力以及代理对过去的了解。我们首先使用事前调度学习代理来去除回合内反馈,以确定当代理在执行开始前承诺完成清算轨迹时会发生什么。然后允许代理使用多种DDQN架构根据演进的状态进行条件判断。我们发现,当代理能够访问回合内历史,特别是近期价格和自身过去行为时,超竞争性结果变得更加频繁和持久。这些发现表明,这种执行游戏中的超竞争性行为并非由多代理学习或当前价格观察单独驱动,而是由反馈、记忆和沿实际执行路径的状态依赖性交互驱动。

详情
AI中文摘要

在本文中,我们研究了在共享的最优执行环境中交互的深度强化学习代理是否能够维持超竞争性结果,即在实现短损方面优于相关博弈论竞争基准。我们研究了一个双代理阿尔梅伦-克里斯特流动性清算游戏,并探讨了学习行为如何依赖于回合内环境反馈、解读中间价格的能力以及代理对过去的了解。我们首先使用事前调度学习代理来去除回合内反馈,并确定当代理在执行开始前承诺完成清算轨迹时会发生什么。然后允许代理使用多种DDQN架构根据演进的状态进行条件判断。我们发现,当代理能够访问回合内历史,特别是近期价格和自身过去行为时,超竞争性结果变得显著更频繁和持久。这些发现表明,这种执行游戏中的超竞争性行为并非由多代理学习或当前价格观察单独驱动,而是由反馈、记忆和沿实际执行路径的状态依赖性交互驱动。

英文摘要

In this paper, we investigate whether deep reinforcement-learning agents interacting in a shared optimal-execution environment can sustain supra-competitive outcomes, in the sense of achieving lower implementation shortfalls than the relevant game-theoretical competitive benchmark. We study a two-agent Almgren-Chriss liquidation game and examine how learned behavior depends on intra-episode environment feedback, the ability to interpret the mid-price and the agent's knoledge of the past. We first use ex-ante schedule-learning agents to remove intra-episode feedback and isolate what can arise when agents commit to complete liquidation trajectories before execution begins. We then allow agents to condition on the evolving state using a variety of DDQN architectures. We find that, when agents are given access to intra-episode history, especially recent prices and own past actions, supra-competitive outcomes become substantially more frequent and more persistent. These findings indicate that supra-competitive behavior in this execution game is driven not by multi-agent learning or by current price observation alone, but by feedback, memory, and state-contingent interaction along the realized execution path.

2605.19578 2026-05-22 cs.CV cs.AI 版本更新

Lens Privacy Sealing: A New Benchmark and Method for Physical Privacy-Preserving Action Recognition

Lens Privacy Sealing: 一种新的基准和方法用于物理隐私保护的动作识别

Mengyuan Liu, Ziyi Wang, Peiming Li, Junsong Yuan

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University, Shenzhen Graduate School(北京大学深圳研究生院通用人工智能国家重点实验室) Department of Computer Science and Engineering, State University of New York at Buffalo(纽约州立大学布法罗分校计算机科学与工程系)

AI总结 本文提出了一种名为Lens Privacy Sealing (LPS)的硬件解决方案,通过可调节的贴膜物理遮挡摄像头镜头,实现低成本的预传感器隐私保护,并引入P$^3$AR数据集用于隐私保护的动作识别,同时提出MSPNet框架以应对LPS带来的视频退化问题,实验表明MSPNet在动作识别准确率和隐私保护方面具有优势。

Comments Accepted by IEEE Transactions on Image Processing (TIP), 2026

详情
AI中文摘要

基于RGB摄像头的监控系统能够为公共安全和医疗保健提供人类动作识别,但引发了严重的隐私问题。现有方法依赖于事后捕获算法,这些算法在数据采集过程中无法保护隐私。我们提出Lens Privacy Sealing (LPS),一种简单的硬件解决方案,通过可调节的贴膜物理遮挡摄像头镜头,以最低的成本提供预传感器隐私保护。与软件方法或昂贵的工程光学不同,LPS通过随机多层散射实现强隐私保护,这种散射是物理不可逆的。我们引入了P$^3$AR数据集用于隐私保护的动作识别,该数据集包含大规模回放捕获(P$^3$AR-NTU,114K视频)和现实世界收集(P$^3$AR-PKU)的子集,并带有隐私属性注释。为处理LPS带来的视频退化,我们提出MSPNet,一种单阶段框架,结合了帧间噪声抑制器(IFNS)和跨帧语义聚合器(CFSA),并借助对比语言-图像预训练进行增强的语义提取。大量实验表明,与基线方法相比,MSPNet结合IFNS和CFSA几乎将动作识别准确率提高了一倍,同时抑制身份识别到低水平。全面验证显示,LPS在隐私-效用权衡方面优于现有最先进的硬件方法,能够抵御包括PSF反向计算和数据驱动恢复在内的重建攻击,并在不同光学配置和挑战性环境中具有良好的泛化能力。代码可在https://github.com/wangzy01/MSPNet上获得。

英文摘要

RGB camera-based surveillance systems enable human action recognition for public safety and healthcare, yet raise serious privacy concerns. Existing methods rely on post-capture algorithms, which fail to protect privacy during data acquisition. We propose Lens Privacy Sealing (LPS), a simple hardware solution that physically obscures camera lenses with adjustable laminating film, providing pre-sensor privacy protection at minimal cost. Unlike software methods or expensive engineered optics, LPS achieves strong privacy through stochastic multi-layer scattering that is physically irreversible. We introduce the P$^3$AR dataset for privacy-preserving action recognition, featuring both large-scale replay-captured (P$^3$AR-NTU, 114K videos) and real-world collected (P$^3$AR-PKU) subsets with privacy attribute annotations. To handle video degradation from LPS, we propose MSPNet, a single-stage framework incorporating Inter-Frame Noise Suppressor (IFNS) and Cross-Frame Semantic Aggregator (CFSA), enhanced by contrastive language-image pre-training for robust semantic extraction. Extensive experiments demonstrate that MSPNet with IFNS and CFSA nearly doubles action recognition accuracy compared to baseline methods while suppressing identity recognition to low levels. Comprehensive validation shows LPS achieves a superior privacy-utility trade-off compared to state-of-the-art hardware methods, resists reconstruction attacks including PSF inversion and data-driven recovery, and generalizes robustly across optical configurations and challenging environments. Code is available at https://github.com/wangzy01/MSPNet.

2605.19329 2026-05-22 cs.CV cs.AI 版本更新

RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding

RE-VLM:事件增强的视觉-语言模型用于场景理解

Hanqing Liu, Mingjie Liu, Luoping Cui, Endian Lin, Donghong Jiang, Chuang Zhu

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) State Key Laboratory of General Artificial Intelligence, BIGAI(通用人工智能国家重点实验室,BIGAI)

AI总结 本文提出RE-VLM,一种结合RGB图像和事件流的双流视觉-语言模型,旨在提升在正常和恶劣条件下对场景的理解能力。通过事件相机提供的高时间分辨率和宽动态范围的数据,RE-VLM在场景描述和视觉问答任务中优于现有模型。

Comments 10 pages, 6 figures, 6 tables

详情
AI中文摘要

传统视觉-语言模型(VLMs)在恶劣条件下(如低光、高动态范围或快速运动)捕获的场景解释能力不足,因为标准RGB图像在这些环境中质量下降。事件相机提供了一种互补的模态:它们异步记录每个像素的亮度变化,具有高时间分辨率和宽动态范围,在帧失效时保留运动线索。我们提出了RE-VLM,第一个双流视觉-语言模型,联合利用RGB图像和事件流,以在正常和挑战性条件下实现稳健的场景理解。RE-VLM采用并行的RGB和事件编码器,以及一种渐进训练策略,将异构视觉特征与语言对齐。为了解决RGB-Event-Text监督不足的问题,我们进一步提出了一种图驱动的流程,将同步的RGB-Event流转换为可验证的场景图,从中合成描述和问答对。为了开发和评估RE-VLM,我们构建了两个数据集:PEOD-Chat,针对光照挑战性场景,和RGBE-Chat,涵盖多样化的场景。在描述和VQA基准测试中,RE-VLM在与现有RGB-only和事件-only模型参数量相当的情况下,始终优于现有模型,特别是在挑战性条件下表现显著提升。这些结果证明了事件增强的VLMs在广泛现实环境中实现稳健视觉-语言理解的有效性。

英文摘要

Conventional vision-language models (VLMs) struggle to interpret scenes captured under adverse conditions (e.g., low light, high dynamic range, or fast motion) because standard RGB images degrade in such environments. Event cameras provide a complementary modality: they asynchronously record per-pixel brightness changes with high temporal resolution and wide dynamic range, preserving motion cues where frames fail. We propose RE-VLM, the first dual-stream vision-language model that jointly leverages RGB images and event streams for robust scene understanding across both normal and challenging conditions. RE-VLM employs parallel RGB and event encoders together with a progressive training strategy that aligns heterogeneous visual features with language. To address the scarcity of RGB-Event-Text supervision, we further propose a graph-driven pipeline that converts synchronized RGB-Event streams into verifiable scene graphs, from which we synthesize captions and question-answer (QA) pairs. To develop and evaluate RE-VLM, we construct two datasets: PEOD-Chat, targeting illumination-challenged scenes, and RGBE-Chat, covering diverse scenarios. On captioning and VQA benchmarks, RE-VLM consistently outperforms state-of-the-art RGB-only and event-only models with comparable parameter counts, with particularly large gains under challenging conditions. These results demonstrate the effectiveness of event-augmented VLMs in achieving robust vision-language understanding across a wide range of real-world environments.

2605.18372 2026-05-22 cs.HC cs.AI cs.CY cs.ET 版本更新

The Hidden Cost of Contextual Sycophancy: an AI Literacy Intervention in Human-AI Collaboration

上下文谄媚的隐性成本:人类-人工智能协作中的AI素养干预

Cansu Koyuturk, Sabrina Guidotti, Dimitri Ognibene

发表机构 * Università degli Studi di Milano-Bicocca(米兰-比科卡大学)

AI总结 本研究探讨了在人类-人工智能协作中上下文谄媚现象的成因,并通过干预提升AI素养和提示能力以减轻其影响,发现AI反馈质量受用户错误传播影响显著,提示需系统层面改进以促进批判性参与。

Comments SPRINGER AIED 2026: Accepted for LBR, poster presentation at the 27th International Conference on Artificial Intelligence in Education, 27 Jun - 3 Jul 2026, Seoul, Republic of Korea

详情
AI中文摘要

大型语言模型(LLMs)在教育领域日益被用作交互工具进行协作。然而,其倾向于谄媚,即使在错误时也迎合用户信念,这引发了学习和决策的担忧,尤其是对知识较少的用户。本研究调查了在真实多轮人类-人工智能交互中谄媚对齐如何产生,并探讨了针对提高AI素养和提示能力的干预是否能减轻其影响。在受控混合设计实验中,60名参与者通过先生成个人排名再与AI助手协作进行分析生存排名任务,分别在干预前和干预后接受一般或谄媚聚焦的提示训练。初步结果显示,LLMs对用户输入高度敏感:低质量的初始响应导致较差的AI建议,表明模型镜像或整合了用户推理而非纠正或提供缺失或较少见的替代方案。关键的是,用户错误向AI响应的传播显著降低了AI反馈质量和最终用户任务表现,揭示了一种上下文谄媚依赖现象。尽管干预未能消除上下文错误的传播,但显著提高了AI建议质量,通过减少直接镜像错误用户排名。这些发现表明,提示和AI素养单独可能不足以确保知识上独立的AI支持,强调了需要系统层面方法以促进人类-人工智能协作中的批判性参与。

英文摘要

Large Language Models (LLMs) are increasingly used in educational settings as interactive tools for collaboration. However, their tendency toward sycophancy, aligning with user beliefs even when incorrect, raises concerns for learning and decision-making, especially for less knowledgeable users. This study investigates how sycophantic alignment emerges in authentic multi-turn human-AI interactions and whether interventions targeting increasing AI literacy and prompting competencies can mitigate its effects. In a controlled mixed-design experiment, 60 participants completed analytical survival ranking tasks by first generating individual rankings and then making final decisions after collaborating with an AI assistant, both before and after receiving either general or sycophancy-focused prompting training. Preliminary results show that LLMs are highly sensitive to user input: lower-quality initial responses lead to poorer AI advice, suggesting that the model mirrors or incorporates user reasoning rather than correcting it or offering better alternatives that are missing or less frequent in the conversation. Critically, the propagation of user errors into AI responses significantly reduced both the quality of AI feedback and final user task performance, revealing a form of contextual sycophantic dependence. While the intervention did not eliminate the propagation of contextual errors, it significantly improved AI advice by reducing the direct mirroring of incorrect user rankings. These findings suggest that prompting and AI literacy alone may be insufficient to ensure epistemically independent AI support, highlighting the need for system-level approaches that better promote critical engagement in human-AI collaboration.

2605.16545 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces

Symphony for Speech-to-Text: 支持实时医疗语音接口

Arne Nix, Robert James, Lasse Borgholt, Anna B. Ekner, Lana Krumm, Julius Severin, Dan Engel, Lars Maaløe, Jakob Havtorn

发表机构 * Corti

AI总结 本文提出Symphony for Speech-to-Text,一种医疗级实时语音识别系统,通过分解转录过程为识别、格式化和上下文校正等专业化组件,优化医学术语召回,实现实时临床结构文本生成,并在医疗场景中显著优于现有系统,同时在通用领域表现不逊。

Comments Updated with a correction and improvement to Symphony's performance in spoken punctuation evaluation (R_punct, P_punct)

详情
AI中文摘要

在数十年用于打字和更近期的环境记录后,语音正逐渐成为与技术及AI交互的主要方式,在医疗领域也不例外。然而,医疗语音识别仍然具有挑战性:系统必须捕捉专业术语,解决上下文歧义,并精确渲染测量、缩写和临床缩写。现有解决方案通常针对通用目的转录或狭窄的打字工作流进行优化,限制了其在安全关键设置中的可靠性以及在更广泛临床工作流中的实用性。我们引入Symphony for Speech-to-Text,一种用于实时流式和基于批量文件的临床使用的医疗级语音识别系统。Symphony将转录过程分解为识别、格式化和上下文校正等专业化组件,以优化医学术语召回,同时在实时生成临床结构文本并适应不同使用场景。在公共基准和医疗语音数据集上的评估表明,Symphony在临床场景中显著优于现有系统,同时在通用领域表现不逊,表明具有鲁棒的泛化能力而非过拟合。我们发布了一个临床基准数据集以支持可靠的验证和进一步推进医疗语音识别。Symphony通过生产级API提供,用于实时打字、对话转录和批量音频文件处理。

英文摘要

After decades of use in dictation and, more recently, ambient documentation, speech is emerging as a primary modality for interacting with technology and AI in healthcare. Yet medical speech recognition remains difficult: systems must capture specialized terminology, resolve contextual ambiguity, and render measurements, abbreviations, and clinical shorthand precisely. Existing solutions are typically optimized either for general-purpose transcription or narrow dictation workflows, limiting their reliability in safety-critical settings and their usefulness for broader clinical workflows. We introduce Symphony for Speech-to-Text, a medical-grade speech recognition system for real-time streaming and batch file-based clinical use. Symphony decomposes the transcription process into specialized components for recognition, formatting, and contextual correction to optimize medical term recall while producing clinically structured text in real time and adapting across use cases. Evaluations on public benchmark and medical speech datasets show that Symphony substantially outperforms state-of-the-art systems in clinical settings while matching or exceeding them in general-domain settings, suggesting robust generalization rather than overfitting. We release a clinical benchmark dataset to support reliable validation and further progress in medical speech recognition. Symphony is available through a production-grade API for live dictation, conversational transcription, and batch audio file processing.

2605.16299 2026-05-22 cs.SE cs.AI 版本更新

ACE: Self-Evolving LLM Coding Framework via Adversarial Unit Test Generation and Preference Optimization

ACE:通过对抗性单元测试生成和偏好优化的自进化LLM编码框架

Yixu Huang, Xinglei Yu, Zhongyu Wei

AI总结 本文提出ACE框架,通过基于求解器-对抗架构的执行中心监督,实现自进化代码生成,无需真实代码或外部奖励模型,实验表明其在CodeContests、MBPP和LiveCodeBench上均优于现有求解器-验证器基线。

详情
AI中文摘要

大型语言模型(LLMs)在代码生成方面表现出色,但仍然严重依赖大规模标注解决方案和基于验证的监督,这限制了可扩展性和持续自我改进的能力。最近的求解器-验证器框架利用程序执行作为自动监督信号,但其有效性在求解器变得中等强大时会下降:验证器生成的测试越来越多地确认语义正确性,而不是暴露剩余的失败模式。我们提出了ACE,一种基于求解器-对抗架构的自进化代码生成框架,优先通过以执行为中心的监督进行主动失败发现。一个单一的LLM在生成候选程序和生成优化以诱导执行级失败(如运行时错误、异常或非终止)的对抗性单元测试输入之间交替进行。监督仅来源于执行结果:稳健的程序被选为监督微调,而对抗性测试通过Kahneman-Tversky优化使用执行衍生的偏好进行优化。值得注意的是,整个训练循环不需要真实代码或外部奖励模型。在CodeContests、MBPP和LiveCodeBench上的实验表明,ACE在pass@1上持续优于强大的求解器-验证器基线,实现了3-7%的绝对提升,在分布外基准上改进更大,同时保持竞争性或改进的推理效率。

英文摘要

Large Language Models (LLMs) excel at code generation but remain heavily reliant on large-scale annotated solutions and verification-based supervision, which constrains scalability and hinders sustained self-improvement. Recent solver--verifier frameworks exploit program execution as an automatic supervision signal, but their effectiveness degrades as solvers become moderately strong: verifier-generated tests increasingly confirm semantic correctness rather than exposing the remaining failure modes. We propose \textbf{ACE}, a self-evolving code generation framework based on a solver--adversary architecture that prioritizes active failure discovery through execution-centric supervision. A single LLM alternates between generating candidate programs and producing adversarial unit test inputs optimized to induce execution-level failures, such as runtime errors, exceptions, or non-termination. Supervision is derived solely from execution outcomes: robust programs are selected for supervised fine-tuning, while adversarial tests are optimized via Kahneman--Tversky Optimization using execution-derived preferences. Notably, the entire training loop requires no ground-truth code or external reward models. Experiments on CodeContests, MBPP, and LiveCodeBench demonstrate that ACE consistently outperforms strong solver--verifier baselines, achieving 3--7\% absolute gains in pass@1, with larger improvements on out-of-distribution benchmarks, while maintaining competitive or improved inference efficiency.

2605.15153 2026-05-22 cs.RO cs.AI 版本更新

Pelican-Unify 1.0: A Unified Embodied Intelligence Model for Understanding, Reasoning, Imagination and Action

Pelican-Unify 1.0:一种用于理解和推理、想象和行动的统一具身智能模型

Yi Zhang, Yinda Chen, Che Liu, Zeyuan Ding, Jin Xu, Shilong Zou, Junwei Liao, Jiayu Hu, Xiancong Ren, Xiaopeng Zhang, Yechi Liu, Haoyuan Shi, Zecong Tang, Haosong Sun, Renwen Cui, Kuishu Wu, Wenhai Liu, Yang Xu, Yingji Zhang, Yidong Wang, Senkang Hu, Jinpeng Lu, Nga Teng Chan, Yechen Wu, Zeting Liu, Xianzhou Hou, Yong Dai, Jian Tang, Xiaozhu Ju

发表机构 * Beijing Innovation Center of Humanoid Robotics (X-Humanoid)(北京人形机器人创新中心(X-Humanoid))

AI总结 本文提出Pelican-Unify 1.0,一种基于统一原则训练的首个具身基础模型,通过单一视觉语言模型作为统一理解模块,将场景、指令、视觉上下文和行动历史映射到共享语义空间,并通过统一推理模块生成任务、行动和未来导向的思维链,最终将隐藏状态投影到密集潜在变量中,再通过统一未来生成器生成未来视频和行动。

详情
AI中文摘要

我们提出了Pelican-Unify 1.0,首个根据统一原则训练的具身基础模型。Pelican-Unify 1.0使用单一视觉语言模型作为统一理解模块,将场景、指令、视觉上下文和行动历史映射到共享语义空间。同一视觉语言模型也作为统一推理模块,通过单次前向传递自回归地生成任务、行动和未来导向的思维链,并将最终隐藏状态投影到密集潜在变量中。统一未来生成器(UFG)然后基于该潜在变量,在同一去噪过程中通过两个模态特定的输出头联合生成未来视频和未来行动。语言、视频和行动损失均反向传播到共享表示中,使模型在训练过程中共同优化理解和推理、想象和行动,而非训练三个独立专家系统。实验表明,统一并不意味着妥协。通过单一检查点,Pelican-Unify 1.0在所有三种能力上均取得强劲表现:在八个VLM基准测试中得分为64.7,是同类模型中最佳;在WorldArena中得分为66.03,排名第一;在RoboTwin中得分为93.5,是对比行动方法中第二好的平均值。这些结果表明,统一范式在保持专业能力的同时,将理解和推理、想象和行动整合到一个模型中。

英文摘要

We present Pelican-Unify 1.0, the first embodied foundation model trained according to the principle of unification. Pelican-Unify 1.0 uses a single VLM as a unified understanding module, mapping scenes, instructions, visual contexts, and action histories into a shared semantic space. The same VLM also serves as a unified reasoning module, autoregressively producing task-, action-, and future-oriented chains of thought in a single forward pass and projecting the final hidden state into a dense latent variable. A Unified Future Generator (UFG) then conditions on this latent variable and jointly generates future videos and future actions through two modality-specific output heads within the same denoising process. The language, video, and action losses are all backpropagated into the shared representation, enabling the model to jointly optimize understanding, reasoning, imagination, and action during training, rather than training three isolated expert systems. Experiments demonstrate that unification does not imply compromise. With a single checkpoint, Pelican-Unify 1.0 achieves strong performance across all three capabilities: 64.7 on eight VLM benchmarks, the best among comparable-scale models; 66.03 on WorldArena, ranking first; and 93.5 on RoboTwin, the second-best average among compared action methods. These results show that the unified paradigm succeeds in preserving specialist strength while bringing understanding, reasoning, imagination, and action into one model.

2605.15040 2026-05-22 cs.AI cs.CL 版本更新

Orchard: An Open-Source Agentic Modeling Framework

Orchard:一个开源的智能体建模框架

Baolin Peng, Wenlin Yao, Qianhui Wu, Hao Cheng, Xiao Yu, Rui Yang, Tao Ge, Alessandro Sordoni, Xingdi Yuan, Yelong Shen, Pengcheng He, Tong Zhang, Zhou Yu, Jianfeng Gao

发表机构 * Microsoft Research(微软研究院) Columbia University(哥伦比亚大学) UIUC(伊利诺伊大学香槟分校)

AI总结 本文提出Orchard,一个开源的智能体建模框架,通过轻量级环境服务和三种智能体建模食谱,实现了跨领域可重用的智能体数据、训练和评估。

详情
AI中文摘要

智能体建模旨在通过规划、推理、工具使用和与环境的多轮交互,将大语言模型转化为能够解决复杂任务的自主智能体。尽管有大量投入,开放研究仍受制于基础设施和训练差距。许多高性能系统依赖于专有代码库、模型或服务,而大多数开源框架专注于编排和评估,而非可扩展的智能体训练。我们提出了Orchard,一个用于可扩展智能体建模的开源框架。其核心是Orchard Env,一个轻量级环境服务,提供可重用的原语用于跨任务领域、智能体利用和流水线阶段的沙盒生命周期管理。在Orchard Env之上,我们构建了三种智能体建模食谱。Orchard-SWE针对编码智能体,从MiniMax-M2.5和Qwen3.5-397B中蒸馏出107K条轨迹,引入信用分配SFT来学习未解决轨迹的 productive 段落,并应用平衡自适应回滚进行强化学习。从Qwen3-30B-A3B-Thinking开始,Orchard-SWE在SWE-bench Verified上经过SFT后达到64.3%,经过SFT+RL后达到67.5%,在同等规模的开源模型中设立了新的状态。Orchard-GUI使用仅0.4K蒸馏轨迹和2.2K开放性任务训练了一个4B视觉-语言计算机使用智能体,在WebVoyager、Online-Mind2Web和DeepShop上分别达到74.1%、67.0%和64.0%的成功率,成为最强的开源模型,同时在与专有系统竞争中保持竞争力。Orchard-Claw针对个人助理智能体,仅用0.2K合成任务训练,达到Claw-Eval上的59.6% pass@3和与更强的ZeroClaw利用配合时的73.9%。总体而言,这些结果表明,一个轻量级、开源、不依赖利用的环境层能够实现跨领域的可重用智能体数据、训练和评估。

英文摘要

Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.

2605.12058 2026-05-22 cs.LG cs.AI 版本更新

Holder Policy Optimisation

Hölder Policy Optimisation

Yuxiang Chen, Dingli Liang, Yihang Chen, Ziqin Gong, Chenyang Le, Zhaokai Wang, Jiachen Zhu, Lingyu Yang, Jianghao Lin, Weinan Zhang, Jun Wang

发表机构 * University College London(伦敦大学学院) Shanghai Jiao Tong University(上海交通大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 本文提出HölderPO框架,通过Hölder均值统一token级概率聚合,解决固定聚合机制导致的训练崩溃与性能不足问题,理论证明不同p值对梯度集中度和方差的平衡作用,并通过动态退火算法实现训练周期内的p值调度,实验表明其在多个数学基准测试中取得更优的稳定性和收敛性。

详情
AI中文摘要

Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose HölderPO, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter $p$, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger $p$ concentrates the gradient to amplify sparse learning signals, whereas a smaller $p$ strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules $p$ across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of $54.9\%$ across multiple mathematical benchmarks, yielding a substantial $7.2\%$ relative gain over standard GRPO and secures an exceptional $93.8\%$ success rate on ALFWorld.

英文摘要

Group Relative Policy Optimisation (GRPO) enhances large language models by estimating advantages across a group of sampled trajectories. However, mapping these trajectory-level advantages to policy updates requires aggregating token-level probabilities within each sequence. Relying on a fixed aggregation mechanism for this step fundamentally limits the algorithm's adaptability. Empirically, we observe a critical trade-off: certain fixed aggregations frequently suffer from training collapse, while others fail to yield satisfactory performance. To resolve this, we propose \textbf{HölderPO}, a generalised policy optimisation framework unifying token-level probability aggregation via the Hölder mean. By explicitly modulating the parameter $p$, our framework provides continuous control over the trade-off between gradient concentration and variance bounds. Theoretically, we prove that a larger $p$ concentrates the gradient to amplify sparse learning signals, whereas a smaller $p$ strictly bounds gradient variance. Because no static configuration can universally resolve this concentration-stability trade-off, we instantiate the framework with a dynamic annealing algorithm that progressively schedules $p$ across the training lifecycle. Extensive evaluations demonstrate superior stability and convergence over existing baselines. Specifically, our approach achieves a state-of-the-art average accuracy of $54.9\%$ across multiple mathematical benchmarks, yielding a substantial $7.2\%$ relative gain over standard GRPO and secures an exceptional $93.8\%$ success rate on ALFWorld.

2605.08389 2026-05-22 cs.CV cs.AI 版本更新

Decoupling Endpoint and Semantic Transition Learning for Zero-Shot Composed Image Retrieval

解耦端点与语义转换学习以实现零样本复合图像检索

Mingyu Liu, Sihan Huang, Yijia Fan, Yinlin Yan, Quan Zhang, Jian-Fang Hu, Jianhuang Lai

发表机构 * Sun Yat-sen University(中山大学) Guangdong Province Key Laboratory of Information Security Technology(广东省信息安全技术重点实验室) Key Laboratory of Machine Intelligence and Advanced Computing(机器智能与先进计算重点实验室)

AI总结 本文提出了一种解耦端点与语义转换学习的方法DeCIR,用于零样本复合图像检索,通过构造配对的正向/反向编辑元组,训练独立的低秩文本适配器分支,并利用低秩方向合并(LRDM)将它们合并为一个可部署的适配器,从而提升了投影基于的零样本复合图像检索性能。

详情
AI中文摘要

零样本复合图像检索(ZS-CIR)在不依赖人工标注的CIR三元组的情况下,从参考图像和文本修改中检索目标图像。基于投影的ZS-CIR方法因其不依赖LLM并在推理时保持轻量而具有吸引力,但它们在复杂语义修改上往往表现不佳。这一差距反映了基于投影的ZS-CIR中的语义转换瓶颈:端点级匹配可以让编辑文本作为目标侧的属性线索,而不是作为源条件的语义转换。我们进一步表明,将语义转换监督添加到相同的文本适配器中会创建端点对齐与语义转换对齐之间的冲突。为了解决这一冲突,DeCIR解耦端点与转换学习。它从图像-标题对中构建配对的正向/反向编辑元组,训练独立的低秩文本适配器分支用于端点对齐和语义转换对齐,并将它们通过低秩方向合并(LRDM)合并为一个可部署的适配器。在CIRR、CIRCO、FashionIQ和GeneCIS上的大量实验表明,DeCIR在不增加推理复杂性的情况下,一致提升了基于投影的ZS-CIR性能。

英文摘要

Zero-shot composed image retrieval (ZS-CIR) retrieves a target image from a reference image and a text modification without human-annotated CIR triplets. Projection-based ZS-CIR methods are attractive because they do not rely on LLMs at inference and remain lightweight, but they often underperform LLM-based approaches on complex semantic modifications. This gap reflects a semantic transition bottleneck in projection-based ZS-CIR: endpoint-level matching can let the edit text act as a target-side attribute cue rather than grounding it as a source-conditioned semantic transition. We further show that adding semantic transition supervision to the same text adapter creates an endpoint--transition conflict between endpoint alignment and semantic transition alignment. To address this conflict, DeCIR decouples endpoint and transition learning. It constructs paired forward/reverse edit tuples from image-caption pairs, trains separate low-rank text adapter branches for endpoint alignment and semantic transition alignment, and merges them with Low-Rank Directional Merge (LRDM) into one deployable adapter. Extensive experiments on CIRR, CIRCO, FashionIQ, and GeneCIS demonstrate that DeCIR consistently improves projection-based ZS-CIR without increasing inference complexity.

2605.07985 2026-05-22 cs.DC cs.AI 版本更新

Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation

Dooly: 一种配置无关、冗余感知的LLM推理模拟器

Joon Ha Kim, Geon-Woo Kim, Anoop Rachakonda, Daehyeok Kim

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出Dooly,一种能够忽略配置差异并高效处理冗余的LLM推理模拟器,通过单次推理过程和智能标签传播减少冗余 profiling 耗时,提升模拟精度和效率。

详情
AI中文摘要

选择最优的LLM推理配置需要在硬件、服务引擎、注意力后端和模型架构之间进行评估,因为没有单一选择在所有工作负载中表现最佳。基于配置的模拟器是标准工具,但它们硬编码操作集到特定配置,并重新对每个操作进行重新配置,这使得探索变得成本高昂。这种成本源于对结构理解的缺失:每个操作的每个输入维度都由模型配置或 incoming 请求决定。许多模型配置值(例如头大小、层数)在不同模型中重复出现,因此相同操作在许多配置中运行;一次扫描请求依赖的维度即可服务所有。我们提出了Dooly,利用这种结构实现配置无关、冗余感知的配置。Dooly执行一次推理过程,通过污点传播标记每个输入维度的来源,并仅对不在其延迟数据库中的操作进行选择性配置;状态操作如注意力通过重用服务引擎自身的初始化代码进行隔离,从而消除手动仪器化。它基于数据库构建延迟回归模型,该模型成为现有模拟器的即插即用后端。在两个GPU平台、三个注意力后端和多样的模型架构上,Dooly在TTFT上达到5%的MAPE精度,在TPOT上达到8%的精度,同时将12个模型的profiling GPU小时减少了56.4%。我们已开源Dooly在https://github.com/dooly-project。

英文摘要

Selecting the optimal LLM inference configuration requires evaluation across hardware, serving engines, attention backends, and model architectures, since no single choice performs best across all workloads. Profile-based simulators are the standard tool, yet they hardcode their operation set to a specific configuration and re-profile every operation from scratch, making exploration prohibitively expensive. This cost stems from a missing structural understanding: every input dimension of each operation is fixed by the model configuration or determined by the incoming request. Many model-configuration values (e.g., head size, layer count) recur across models, so the same operation runs in many configurations; a single sweep over the request-dependent dimensions can serve them all. We present Dooly, which exploits this structure to achieve configuration-agnostic, redundancy-aware profiling. Dooly performs a single inference pass, labels each input dimension with its origin via taint propagation, and selectively profiles only operations absent from its latency database; stateful operations such as attention are isolated by reusing the serving engine's own initialization code, eliminating manual instrumentation. It builds latency regression models based on the database, which becomes a drop-in backend for existing simulators. Across two GPU platforms, three attention backends, and diverse model architectures, Dooly achieves simulation accuracy within 5% MAPE for TTFT and 8% for TPOT while reducing profiling GPU-hours by 56.4% across 12 models compared to the existing profiling approach. We have open-sourced Dooly at https://github.com/dooly-project.

2605.07870 2026-05-22 cond-mat.dis-nn cs.AI stat.ML 版本更新

Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer

深度网络中的谱动力学:特征学习、异常值逃逸和学习率转移

Clarissa Lauditi, Cengiz Pehlevan, Blake Bordelon

发表机构 * John A. Paulson School of Engineering and Applied Sciences, Harvard University(哈佛大学约翰A·保罗森工程与应用科学学院) Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University(哈佛大学自然与人工智能研究学院) Center for Mathematical Sciences and Applications, Harvard University(哈佛大学数学科学中心) Oden Institute for Computational Engineering and Sciences & Dept. of Neuroscience, UT Austin(得克萨斯大学奥斯汀分校奥登计算工程与科学学院及神经科学系)

AI总结 本文研究了在宽神经网络中通过(随机)梯度下降训练时隐藏权重谱的演变,提出了一种双层动态平均场理论(DMFT)来联合跟踪具有尖峰集合的隐藏权重谱动态,其中尖峰方向在随机体上保持统计依赖性。该框架应用于两种设置:(1)无限宽度非线性网络在均值场/μP缩放下,以及(2)深度线性网络在比例高维极限下。理论预测了异常值如何随训练时间、宽度、输出尺度和初始化方差演变。在深度线性网络中,μP产生与宽度一致的异常值动态和超参数转移,包括主导NTK模式向稳定性边缘(EoS)的宽度稳定增长。相比之下,NTK参数化表现出强烈依赖宽度的异常值动态,尽管收敛到一个稳定的宽网络极限。我们展示了这种体+异常值图像是描述简单任务的,但涉及大量输出的任务(如ImageNet分类或GPT语言建模)则更适合通过重构谱体来描述。我们开发了一个具有大量输出通道的玩具模型,重现了这一现象,并展示了足够宽的网络下谱边缘仍会收敛。

Comments Updating related works + discussion

详情
AI中文摘要

我们研究了在宽神经网络中通过(随机)梯度下降训练时隐藏权重谱的演变。我们开发了一种双层动态平均场理论(DMFT),该理论联合跟踪具有尖峰集合的隐藏权重谱动态,其中尖峰方向在随机体上保持统计依赖性。我们将该框架应用于两种设置:(1)无限宽度非线性网络在均值场/μP缩放下,以及(2)深度线性网络在比例高维极限下,其中宽度、输入维度和样本大小以固定比例发散。我们的理论预测了异常值如何随训练时间、宽度、输出尺度和初始化方差演变。在深度线性网络中,μP产生与宽度一致的异常值动态和超参数转移,包括主导NTK模式向稳定性边缘(EoS)的宽度稳定增长。相比之下,NTK参数化表现出强烈依赖宽度的异常值动态,尽管收敛到一个稳定的宽网络极限。我们展示了这种体+异常值图像是描述简单任务的,但涉及大量输出的任务(如ImageNet分类或GPT语言建模)则更适合通过重构谱体来描述。我们开发了一个具有大量输出通道的玩具模型,重现了这一现象,并展示了足够宽的网络下谱边缘仍会收敛。

英文摘要

We study the evolution of hidden-weight spectra in wide neural networks trained by (stochastic) gradient descent. We develop a two-level dynamical mean-field theory (DMFT) that jointly tracks bulk and outlier spectral dynamics for spiked ensembles whose spike directions remain statistically dependent on the random bulk. We apply this framework to two settings: (1) infinite-width nonlinear networks in mean-field/$μ$P scaling and (2) deep linear networks in the proportional high-dimensional limit, where width, input dimension, and sample size diverge with fixed ratios. Our theory predicts how outliers evolve with training time, width, output scale, and initialization variance. In deep linear networks, $μ$P yields width-consistent outlier dynamics and hyperparameter transfer, including width-stable growth of the leading NTK mode toward the edge of stability (EoS). In contrast, NTK parameterization exhibits strongly width-dependent outlier dynamics, despite converging to a stable large-width limit. We show that this bulk+outlier picture is descriptive of simple tasks with small output channels, but that tasks involving large numbers of outputs (ImageNet classification or GPT language modeling) are better described by a restructuring of the spectral bulk. We develop a toy model with extensive output channels that recapitulates this phenomenon and show that edge of the spectrum still converges for sufficiently wide networks.

2605.06669 2026-05-22 cs.CR cs.AI cs.LG 版本更新

Evaluating Prompt Injection Defenses for Educational LLM Tutors: Security-Usability-Latency Trade-offs

评估教育LLM导师的提示注入防御:安全-可用性-延迟的权衡

Alexandre Cristovão Maiorano

发表机构 * Lumytics

AI总结 本文提出了一种评估提示注入防御方法的框架,探讨了在教育LLM导师中安全、可用性和延迟之间的权衡,并通过实验比较了不同防御机制的性能。

Comments 19 pages, 4 figures, 9 tables

详情
AI中文摘要

教育LLM导师面临一个核心的AI对齐挑战:它们必须在遵循用户意图的同时保持教学约束和安全政策。我们提出了一个评估方法,用于评估提示注入防御在该场景中的表现,显示了防护栏设计在对抗性鲁棒性、良性任务可用性和响应延迟之间存在显式的权衡。我们评估了一个领域特定的多层安全防护流水线,结合确定性模式过滤器、结构验证、上下文沙箱和会话级行为检查。在受控的保留基准测试中,该流水线实现了低绕过率和假阳性率,同时优化了平均延迟——一个优先考虑教学可用性(零假阳性)而保持可测量攻击抵抗力的操作点。我们提供了一个可重复的基准测试协议,用于在相同条件下进行头对头比较,包括分层Bootstrap置信区间、配对McNemar显著性检验、多种子敏感度扫描,以及在相同划分上对Prompt Guard和NeMo Guardrails的直接评估。结果揭示了操作权衡:NeMo在16.22%的假阳性率下达到0%的绕过率,而Prompt Guard在3.60%的假阳性率下达到38.48%的绕过率。该框架支持在不同机构风险和可用性要求下,基于证据的防护栏选择。

英文摘要

Educational LLM tutors face a core AI alignment challenge: they must follow user intent while preserving pedagogical constraints and safety policies. We present an evaluation methodology for prompt-injection defenses in this setting, showing that guardrail design entails explicit trade-offs among adversarial robustness, benign-task usability, and response latency. We evaluate a domain-specific multi-layer safeguard pipeline combining deterministic pattern filters, structural validation, contextual sandboxing, and session-level behavioral checks. On a controlled holdout benchmark, the pipeline reaches low bypass and false positive rates with optimized average latency - an operating point that prioritizes pedagogical usability (zero false positives) while maintaining measurable attack resistance. We provide a reproducible benchmark protocol for head-to-head comparison under identical conditions, including stratified bootstrap confidence intervals, paired McNemar significance tests, multi-seed sensitivity sweeps, and direct evaluation of Prompt Guard and NeMo Guardrails on the same split with unified instrumentation. Results expose operational trade-offs: NeMo reaches 0 percent bypass at 16.22 percent FPR and roughly 1.5s latency, while Prompt Guard yields 38.48 percent bypass with 3.60 percent FPR. The framework supports evidence-based guardrail selection for AI tutoring systems under different institutional risk and usability requirements.

2605.06597 2026-05-22 cs.CL cs.AI cs.LG 版本更新

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

UniSD:面向大语言模型的统一自蒸馏框架

Yiqiao Jin, Yiyang Wang, Lucheng Fu, Yijia Xiao, Yinyi Luo, Haoxin Liu, B. Aditya Prakash, Josiah Hester, Jindong Wang, Srijan Kumar

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of California, Los Angeles(加州大学洛杉矶分校) Carnegie Mellon University(卡内基梅隆大学) William & Mary(威廉与玛丽大学)

AI总结 本文提出UniSD框架,系统研究自蒸馏方法,通过整合多种机制提升监督可靠性、表征对齐和训练稳定性,从而在多个基准和模型上验证自蒸馏的有效性,并构建出性能最优的UniSDfull流水线。

Comments Website: https://unifiedsd.github.io/ Code: https://github.com/Ahren09/UniSD

详情
AI中文摘要

自蒸馏(SD)为在不依赖更强外部教师的情况下适应大语言模型(LLMs)提供了一条有前途的路径。然而,在自回归LLMs中,SD仍然具有挑战性,因为自生成轨迹是自由形式的,正确性依赖于任务,且合理的推理仍可能提供不稳定或不可靠的监督。现有方法主要考察孤立的设计选择,留下其有效性、作用和交互关系不清晰。在本文中,我们提出UniSD,一个统一的框架,系统地研究自蒸馏。UniSD整合了互补的机制,解决监督可靠性、表征对齐和训练稳定性问题,包括多教师一致、EMA教师稳定、token级对比学习、特征匹配和发散剪裁。在六个基准和六个模型(来自三个模型家族)上,UniSD揭示了自蒸馏何时优于静态模仿,哪些组件驱动了收益,以及这些组件在不同任务间的交互方式。基于这些见解,我们构建了UniSDfull,一个整合互补组件的流水线,实现了最强的整体性能,比基模型提高了+5.4点,比最强基线提高了+2.8点。广泛评估凸显了自蒸馏作为一种实用且可控的高效LLM适应方法,无需更强的外部教师。

英文摘要

Self-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self-generated trajectories are free-form, correctness is task-dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose UniSD, a unified framework to systematically study self-distillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping. Across six benchmarks and six models from three model families, UniSD reveals when self-distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct UniSDfull, an integrated pipeline that combines complementary components and achieves the strongest overall performance, improving over the base model by +5.4 points and the strongest baseline by +2.8 points. Extensive evaluation highlights self-distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.

2605.05118 2026-05-22 cs.LG cs.AI stat.ML 版本更新

On the Wasserstein Gradient Flow Interpretation of Drifting Models

关于漂移模型的Wasserstein梯度流解释

Arthur Gretton, Li Kevin Wenliang, Alexandre Galashov, James Thornton, Valentin De Bortoli, Arnaud Doucet

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 本文通过Wasserstein梯度流分析了漂移模型,揭示了GMD框架与WGF路径之间的关系,展示了三种主要结果:漂移模型中的算法对应于KL散度的WGF极限点,实际实现的算法对应于Sinkhorn散度的固定点但缺乏某些特性,同时该方法可以扩展到其他WGF的极限点,如MMD、切线Wasserstein距离和GAN批评者函数。

详情
AI中文摘要

最近,Deng等人(2026)提出了生成模型通过漂移(GMD),一种新的生成任务框架。本文通过Wasserstein梯度流(WGF)的视角分析了GMD,即概率测度空间中函数的最速下降路径,配备了最优传输的几何结构。与之前的WGF相关贡献不同,GMD可以被视为直接针对特定WGF流的固定点。我们展示了三个主要结果:首先,Deng等人(2026)提出的一种算法对应于在KL散度上的WGF的极限点,伴有Parzen平滑。其次,Deng等人(2026)实际实现的算法对应于另一种过程,类似于Sinkhorn散度的固定点,但缺乏后者的一些理想特性。第三,同样的想法可以扩展到其他WGF的极限点,包括最大均值差异(MMD)、切线Wasserstein距离和GAN批评者函数。

英文摘要

Recently, Deng et al. (2026) proposed Generative Modeling via Drifting (GMD), a novel framework for generative tasks. This note presents an analysis of GMD through the lens of Wasserstein Gradient Flows (WGF), i.e., the path of steepest descent for a functional in the space of probability measures, equipped with the geometry of optimal transport. Unlike previous WGF-based contributions, GMD can be thought of as directly targeting a fixed point of a specific WGF flow. We demonstrate three main results: first, that one algorithm proposed by Deng et al. (2026) corresponds to finding the limiting point of a WGF on the KL divergence, with Parzen smoothing on the densities. Second, that the algorithm actually implemented by Deng et al. (2026) corresponds to a different procedure, which bears some resemblance to the fixed point of a WGF on the Sinkhorn divergence, but lacks certain desirable properties of the latter. Third, the same same idea can be extended to the limiting point of other WGFs, including the Maximum Mean Discrepancy (MMD), the sliced Wasserstein distance, and GAN critic functions.

2605.04062 2026-05-22 cs.LG cs.AI 版本更新

EdgeRazor: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation

EdgeRazor: 一种通过混合精度量化感知蒸馏实现大语言模型轻量化的框架

Shu-Hao Zhang, Le-Tong Huang, Xiang-Sheng Deng, Xin-Yi Zou, Chen Wu, Nan Li, Shao-Qun Zhang, Zhi-Hua Zhou

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) School of Intelligent Science and Technology, Nanjing University(南京大学智能科学与技术学院) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) Microsoft AI(微软AI)

AI总结 本文提出EdgeRazor框架,通过混合精度量化感知蒸馏方法,在资源受限设备上部署大语言模型,实现了更高的压缩比和更高效的性能。

详情
AI中文摘要

量化已成为在资源受限设备上部署大语言模型(LLMs)的主流方法,但将精度压缩到低于4位通常会导致严重的性能退化或高昂的重训练成本。在本文中,我们提出了EdgeRazor,一种通过混合精度量化感知蒸馏实现LLM轻量化的框架。它包含三个模块:混合精度结构量化用于精细控制位宽,层自适应特征蒸馏动态选择最信息丰富的特征进行对齐,以及熵感知KL散度用于在人工标注和蒸馏数据集上实现前向-反向平衡。在MobileLLM和Qwen系列上的评估表明,在权重-激活量化下,1.88位的Qwen3-0.6B-EdgeRazor在2位基准上表现优异,优于11.27,超过最强的3位基准4.38。在效率方面,EdgeRazor在所有位宽下实现了更高的压缩比,1.58位的Qwen3-0.6B-EdgeRazor将存储从1.11 GB减少到0.19 GB,同时在16位基准上加速解码15.16倍。这些结果经验上验证了EdgeRazor的有效性和效率。代码可以从GitHub和Huggingface访问。

英文摘要

Quantization has emerged as a mainstream approach for deploying Large Language Models (LLMs) on resource-constrained devices, yet compressing precision below 4-bit typically causes severe performance degradation or prohibitive retraining costs. In this paper, we propose EdgeRazor, a lightweight framework for LLMs via Mixed-Precision Quantization-Aware Distillation. It contains three modules: Structural Quantization with Mixed Precision for fine-grained control of bit-widths, Layer-Adaptive Feature Distillation that dynamically selects the most informative features for alignment, and Entropy-Aware KL Divergence for forward-reverse balance on both human-annotated and distilled datasets. Evaluations conducted on MobileLLM and Qwen families show that under weight-activation quantization, the 1.88-bit Qwen3-0.6B-EdgeRazor outperforms the state-of-the-art 2-bit baselines by 11.27 and surpasses the strongest 3-bit baselines by 4.38, while the quantized MobileLLM-350M-EdgeRazor requires a training budget 4-10$\times$ lower than the leading quantization-aware training method. In terms of efficiency, EdgeRazor achieves higher compression ratios at all bit-widths, and the 1.58-bit Qwen3-0.6B-EdgeRazor reduces storage from 1.11 GB to 0.19 GB while accelerating decoding by 15.16$\times$ over the 16-bit baseline. These results empirically validate the effectiveness and efficiency of EdgeRazor. The codes can be accessed from \href{https://github.com/zhangsq-nju/EdgeRazor}{GitHub} and \href{https://huggingface.co/collections/zhangsq-nju/edgerazor-nbit}{Huggingface}.

2605.03934 2026-05-22 cs.SD cs.AI 版本更新

Towards Open World Sound Event Detection

面向开放世界的声音事件检测

P. H. Hai, L. T. Minh, L. H. Son

发表机构 * VNU University of Engineering and Technology(越南工程大学) Artificial Intelligence Research Center, VNU Information Technology Institute(VNU信息技术研究所人工智能研究中心)

AI总结 本文提出了一种开放世界声音事件检测(OW-SED)范式,通过引入可变形架构和新颖的WOOT框架,解决了重叠和模糊事件的挑战,提升了在开放世界环境下的检测性能。

Comments 32 pages, 3 figures. Accepted to Signal Processing (Elsevier)

Journal ref Signal Processing, Article 110707, 2026

详情
AI中文摘要

声音事件检测(SED)在音频理解中起着至关重要的作用,应用于监控、智能城市、医疗保健和多媒体索引等领域。然而,传统SED系统基于封闭世界假设,限制了其在现实环境中处理新兴声音事件的能力。受开放世界学习在计算机视觉中的成功启发,我们引入了开放世界声音事件检测(OW-SED)范式,其中模型必须检测已知事件、识别未知事件并逐步学习它们。为了解决OW-SED特有的挑战,如重叠和模糊事件,我们提出了一种1D可变形架构,利用可变形注意力来适应性地聚焦于显著的时序区域。此外,我们设计了一种新颖的开放世界可变形声音事件检测转换器(WOOT)框架,结合特征解耦来分离类特定和类无关的表示,以及一种一对多匹配策略和多样性损失以增强表示多样性。实验结果表明,我们的方法在封闭世界设置中相比现有领先技术略具优势,并在开放世界场景中显著优于现有基线。

英文摘要

Sound Event Detection (SED) plays a vital role in audio understanding, with applications in surveillance, smart cities, healthcare, and multimedia indexing. However, conventional SED systems operate under a closed-world assumption, limiting their effectiveness in real-world environments where novel acoustic events frequently emerge. Inspired by the success of open-world learning in computer vision, we introduce the Open-World Sound Event Detection (OW-SED) paradigm, where models must detect known events, identify unseen ones, and incrementally learn from them. To tackle the unique challenges of OW-SED, such as overlapping and ambiguous events, we propose a 1D Deformable architecture that leverages deformable attention to adaptively focus on salient temporal regions. Furthermore, we design a novel Open-World Deformable Sound Event Detection Transformer (WOOT) framework incorporating feature disentanglement to separate class-specific and class-agnostic representations, together with a one-to-many matching strategy and a diversity loss to enhance representation diversity. Experimental results demonstrate that our method achieves marginally superior performance compared to existing leading techniques in closed-world settings and significantly improves over existing baselines in open-world scenarios.

2605.01369 2026-05-22 eess.SP cs.AI cs.LG 版本更新

MU-SHOT-Fi: Self-Supervised Multi-User Wi-Fi Sensing with Source-free Unsupervised Domain Adaptation

MU-SHOT-Fi: 基于源无关无监督域适应的多用户Wi-Fi感知

Ahmed Y. Radwan, Hina Tabassum

发表机构 * department of Electrical Engineering and Computer Science(电气工程与计算机科学系)

AI总结 本文提出MU-SHOT-Fi框架,通过源无关无监督域适应方法,在单用户和多用户Wi-Fi感知中实现准确的活动分类和占用估计,同时防止模型崩溃。

Journal ref IEEE Internet of Things Journal, Early Access, 2026

详情
AI中文摘要

深度学习已被广泛应用于基于Wi-Fi CSI的人体活动识别(HAR),因为它能够以隐私保护和成本效益的方式学习时空特征。然而,基于深度学习的模型在跨环境泛化能力差,特别是在多用户设置中,重叠活动导致CSI纠缠和域偏移。实际部署通常由于隐私限制限制访问标记源数据,这促使使用仅未标记目标域CSI和预训练源模型进行源无关适应。在本文中,我们提出了MU-SHOT-Fi,一种用于单用户和多用户Wi-Fi感知的源无关无监督域适应框架。MU-SHOT-Fi在源训练期间采用排列不变的集合预测与匈牙利匹配,随后在目标域中采用冻结分类器骨干适应。为了实现无标签的稳定适应,我们引入了占用加权信息最大化,通过将多样性正则化集中在可能占用的槽位上,同时排除主导类别的边际熵。此外,我们采用二进制旋转预测作为空间自监督,利用CSI频率-时间结构学习域不变特征。对于单用户场景,我们引入SU-SHOT-Fi,通过将占用加权替换为标准信息最大化,并结合对比预测编码以利用时间一致性。在WiMANS和Widar 3.0数据集上进行了广泛的实验,涵盖了跨环境、跨频率、跨方向和组合域偏移,证明MU-SHOT-Fi在大域偏移下有效恢复多用户精确活动分类性能,同时保持准确的占用估计并防止向主导类崩溃。

英文摘要

Deep learning has been widely adopted for WiFi CSI-based human activity recognition (HAR) due to its ability to learn spatio-temporal features in a privacy-preserving and cost-effective manner. However, DL-based models generalize poorly across environments, a challenge amplified in multi-user settings where overlapping activities cause CSI entanglement and domain shifts. Practical deployments often limit access to labeled source data due to privacy constraints, motivating source-free adaptation using only unlabeled target-domain CSI and a pre-trained source model. In this paper, we propose MU-SHOT-Fi, a source-free unsupervised domain adaptation framework for single- and multi-user Wi-Fi sensing. MU-SHOT-Fi employs permutation-invariant set prediction with Hungarian matching during source training, followed by frozen-classifier backbone adaptation in the target domain. To enable stable adaptation without labels, we introduce occupancy-weighted information maximization that prevents model collapse by focusing diversity regularization on likely-occupied slots while excluding the dominant class from marginal entropy. Additionally, we employ binary rotation prediction as spatial self-supervision that exploits CSI frequency-time structure to learn domain-invariant features. For single-user scenarios, we introduce SU-SHOT-Fi by replacing occupancy weighting with standard information maximization and incorporating contrastive predictive coding to exploit temporal consistency. Extensive experiments on the WiMANS and Widar 3.0 datasets across cross-environment, cross-frequency, cross-orientation, and combined domain shifts demonstrate that MU-SHOT-Fi effectively recovers multi-user exact-activity classification performance under large domain shifts while maintaining accurate occupancy estimation and preventing collapse toward dominant classes.

2605.00515 2026-05-22 cs.DC cs.AI cs.NI 版本更新

SpaceMoE: Realizing Distributed Mixture-of-Experts Inference over Space Networks

SpaceMoE:在空间网络上实现分布式混合专家推理

Zhanwei Wang, Huiling Yang, Min Sheng, Khaled B. Letaief, Kaibin Huang

发表机构 * Department of Electrical and Computer Engineering, The University of Hong Kong (HKU)(香港大学电子与计算机工程系) State Key Laboratory of Integrated Service Networks, Institute of Information Science, Xidian University(西安电子科技大学信息科学学院集成服务网络国家重点实验室) Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology (HKUST)(香港科学与技术大学电子与计算机工程系)

AI总结 本文提出SpaceMoE框架,旨在解决在卫星网络中高效部署大规模LLM的挑战,通过分层专家放置策略减少延迟,实现混合专家模型在空间环境中的高效推理。

详情
AI中文摘要

利用高效的连续太阳能采集,空间数据中心被视为执行高能耗大语言模型(LLMs)的有前途的平台。鉴于这一优势,航天和人工智能 conglomerates(如SpaceX、Google)正在积极投资这一愿景。然而,一个关键挑战是由于卫星上的计算和通信资源有限,高效地在卫星网络中部署大规模LLM。这导致了一个放置问题,需要将模型组件划分为卫星,以确保不同的模型架构和网络拓扑能够协调一致,从而实现低延迟的token生成。为了解决这个问题,我们提出了混合专家(MoE)的空间网络(SpaceMoE)框架,旨在在空间中分布式执行流行的混合专家模型。所提出的放置策略是两级的:(1)层放置,将MoE层分配给卫星子网;(2)层内专家放置,将单个专家分配给同一层/子网的卫星。对于层放置,我们利用自回归推断的环形通信模式,将卫星星座沿轨道方向划分为子网,每个子网托管一个MoE层。基于此架构,我们制定了并解决了层内专家放置的优化问题,以将具有异构激活概率的专家映射到卫星上。推导出的策略揭示了一个直观的原则:频繁激活的专家应映射到具有低预期延迟的路由路径上的卫星。实验表明,SpaceMoE在千卫星星座上实现了至少三倍于传统随机和消融放置策略的延迟降低。

英文摘要

Leveraging continuous solar energy harvesting at high efficiency, space data centers are envisioned as a promising platform for executing energy-intensive large language models (LLMs). Recognizing this advantage, space and AI conglomerates (e.g., SpaceX, Google) are actively investing in this vision. One key challenge, however, is the efficient distributed deployment of a large-scale LLM in a satellite network due to the limited onboard computing and communication resources. This gives rise to a placement problem that involves partitioning and mapping model components to satellites such that the fundamentally different model architecture and network topology can be reconciled to ensure low-latency token generation. To address this problem, we present the Space Network of Mixture-of-Experts (SpaceMoE) framework targeting the distributed execution of a popular mixture-of-experts (MoE) model in space. The proposed placement strategies are two-level: (1) layer placement, which assigns MoE layers to satellite subnets; and (2) intra-layer expert placement, which assigns individual experts to satellites associated with the same layer/subnet. For layer placement, we exploit the ring-like communication pattern of autoregressive inference to partition the satellite constellation along the orbiting direction into subnets arranged on a ring, each hosting one MoE layer. Based on this architecture, we formulate and solve an optimization problem for intra-layer expert placement to map experts with heterogeneous activation probabilities onto satellites. The derived strategy reveals an intuitive principle: a frequently activated expert should be mapped to a satellite on a routing path with low expected latency. Experiments over a thousand-satellite constellation show that SpaceMoE achieves at least a threefold latency reduction compared with conventional random and ablation-based placement strategies.

2604.20665 2026-05-22 cs.CV cs.AI 版本更新

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

视见之代价:在单体范式内实现可信的多模态推理

Karan Goyal

发表机构 * IIIT Delhi, India(德里印度理工学院)

AI总结 本文提出了一种新的多模态评估方法,通过信息论视角揭示了多模态推理中的视见代价问题,提出了三个新指标并提出了语义充分性准则,挑战了传统多模态评估方法。

Comments Addresses practical viability of Vlabel construction. Writing is grounded. Acknowledgement is duly added

详情
AI中文摘要

视觉语言模型(VLMs)的快速普及通常被视为促进统一多模态知识发现的手段,但其背后存在一个未经检验的假设:当前VLMs能够忠实合成多模态数据。我们认为它们往往不能,这种差距反映了主导的视觉编码器-投影器-语言模型范式中的可信问题。而非从视觉输入中提取基础知识,最先进的模型经常表现出功能失明,即利用强大的语言先验来绕过严重的视觉表示瓶颈。在本文中,我们挑战了传统多模态评估方法,该方法依赖于数据删减或新数据集创建,因此将数据集偏差与架构能力不足混淆了。我们提出了一种信息论的突破:模态翻译协议,旨在量化我们称之为视见代价的东西。通过翻译语义负载而不是删减它们,我们提出了三个新的指标——视见的 toll(ToS)、诅咒(CoS)和谬误(FoS)——最终得出语义充分性准则(SSC)。此外,我们假设多模态扩展的分歧定律:随着底层语言引擎扩展到前所未有的推理能力,视觉知识瓶颈的惩罚可能增加而不是减少。我们主张社区应超越“多模态增益”作为主要评估目标。通过将SSC从被动的诊断约束提升为主动的架构蓝图,我们为引导下一代人工智能系统走向真正的多模态推理提供了基础。

英文摘要

The rapid proliferation of Vision-Language Models (VLMs) is often framed as enabling unified multimodal knowledge discovery but rests on an under-examined assumption: that current VLMs faithfully synthesise multimodal data. We argue they often do not, and this gap reflects a trustworthiness problem in the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore conflates dataset biases with architectural incapacity. We propose an information-theoretic departure: the Modality Translation Protocol, designed to quantify what we call the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we hypothesise a Divergence Law of Multimodal Scaling: as the underlying language engines scale to unprecedented reasoning capabilities, the penalty of the visual knowledge bottleneck may increase rather than diminish. We argue the community should move beyond "multimodal gain" as a primary evaluation target. By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide a foundation for guiding the next generation of AI systems toward genuine multimodal reasoning.

2604.16076 2026-05-22 cs.LG cs.AI cs.NE 版本更新

Prototype-Grounded Concept Models for Verifiable Concept Alignment

基于原型的可验证概念模型用于可验证的概念对齐

Stefano Colamonaco, David Debot, Pietro Barbiero, Giuseppe Marra

发表机构 * Department of Computer Science, KU Leuven(卢森堡大学计算机科学系) IBM Research, Zurich(苏黎世IBM研究院)

AI总结 本研究提出了一种基于原型的概念模型(PGCMs),通过将概念与学习到的视觉原型关联起来,从而提高概念对齐的可验证性和可解释性,同时保持预测性能。

详情
AI中文摘要

概念瓶颈模型(CBMs)旨在通过人类可理解的概念来提高深度学习的可解释性,但它们无法验证所学概念是否与人类的意图一致,从而损害了可解释性。我们引入了基于原型的概念模型(PGCMs),将概念 grounded 在学习到的视觉原型上:作为概念的显式证据的图像部分。这种 grounding 允许直接检查概念语义,并支持在原型层面进行有针对性的人类干预以纠正不一致。实证结果表明,PGCMs 在预测性能上与最先进的 CBMs 相当,同时显著提高了透明度、可解释性和可干预性。

英文摘要

Concept Bottleneck Models (CBMs) aim to improve interpretability in Deep Learning by structuring predictions through human-understandable concepts, but they provide no way to verify whether learned concepts align with the human's intended meaning, hurting interpretability. We introduce Prototype-Grounded Concept Models (PGCMs), which ground concepts in learned visual prototypes: image parts that serve as explicit evidence for the concepts. This grounding enables direct inspection of concept semantics and supports targeted human intervention at the prototype level to correct misalignments. Empirically, PGCMs achieve similar predictive performance as state-of-the-art CBMs while substantially improving transparency, interpretability, and intervenability.

2604.11028 2026-05-22 cs.RO cs.AI 版本更新

Federated Single-Agent Robotics: Multi-Robot Coordination Without Intra-Robot Multi-Agent Fragmentation

联邦单体机器人:多机器人协调无需机器人内部多代理碎片化

Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li

发表机构 * School of Software, Harbin Institute of Technology(哈尔滨工业大学软件学院) School of Computer Science and Technology, Harbin Institute of Technology(哈尔滨工业大学计算机科学与技术学院) School of Mathematical and Computer Sciences, Heriot-Watt University, Malaysia Campus(赫瑞-沃顿大学马来西亚校区数学与计算机科学学院) School of Future Science and Engineering, Soochow University(苏州大学未来科学与工程学院)

AI总结 本文提出了一种联邦单体机器人(FSAR)架构,通过在单体机器人运行时基础上实现多机器人协调,避免了机器人内部的多代理碎片化,提升了协调效率和恢复能力。

Comments 30 pages, 10 figures, 9 tables. Code: https://github.com/s20sc/fsar-fleet-coordination

详情
AI中文摘要

随着具身机器人向舰队规模操作发展,多机器人协调已成为系统挑战的核心。现有方法通常将其视为增加机器人内部多代理分解的动机。我们主张另一种原则:多机器人协调不需要机器人内部的多代理碎片化。每个机器人应保持一个单体具身代理,拥有自己的持久运行时、本地策略范围、能力状态和恢复权限,而协调则通过在舰队层面的联邦实现。我们提出了联邦单体机器人(FSAR),一种基于单体机器人运行时的多机器人协调运行时架构。每个机器人暴露受控的能力表面,而非内部碎片化的代理社会。舰队协调通过共享的能力注册表、跨机器人任务委托、策略感知的权限分配、信任范围内的交互以及分层恢复协议实现。我们正式化了关键协调关系,包括权限委托、跨机器人能力请求、本地与舰队恢复边界以及分层人类监督,并描述了一种支持共享具身能力模块(ECM)发现、合同感知的跨机器人协调以及舰队层面治理的舰队运行时架构。我们在代表性的多机器人协调场景中评估了FSAR,与分解密集的基线进行比较。结果表明,在治理局部性(d=2.91,p<.001 vs. 集中控制)和恢复包含性(d=4.88,p<.001 vs. 分解密集)方面有统计学显著的提升,同时在所有场景中减少了权限冲突和策略违规。我们的结果支持了从具身代理到具身舰队的路径应通过在相干机器人运行时之间进行联邦而非在其中进行碎片化的观点。

英文摘要

As embodied robots move toward fleet-scale operation, multi-robot coordination is becoming a central systems challenge. Existing approaches often treat this as motivation for increasing internal multi-agent decomposition within each robot. We argue for a different principle: multi-robot coordination does not require intra-robot multi-agent fragmentation. Each robot should remain a single embodied agent with its own persistent runtime, local policy scope, capability state, and recovery authority, while coordination emerges through federation across robots at the fleet level. We present Federated Single-Agent Robotics (FSAR), a runtime architecture for multi-robot coordination built on single-agent robot runtimes. Each robot exposes a governed capability surface rather than an internally fragmented agent society. Fleet coordination is achieved through shared capability registries, cross-robot task delegation, policy-aware authority assignment, trust-scoped interaction, and layered recovery protocols. We formalize key coordination relations including authority delegation, inter-robot capability requests, local-versus-fleet recovery boundaries, and hierarchical human supervision, and describe a fleet runtime architecture supporting shared Embodied Capability Module (ECM) discovery, contract-aware cross-robot coordination, and fleet-level governance. We evaluate FSAR on representative multi-robot coordination scenarios against decomposition-heavy baselines. Results show statistically significant gains in governance locality (d=2.91, p<.001 vs. centralized control) and recovery containment (d=4.88, p<.001 vs. decomposition-heavy), while reducing authority conflicts and policy violations across all scenarios. Our results support the view that the path from embodied agents to embodied fleets is better served by federation across coherent robot runtimes than by fragmentation within them.

2604.08362 2026-05-22 cs.CL cs.AI cs.LG 版本更新

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

迈向真实世界的人类行为模拟:在长时间跨度、跨场景、异质行为轨迹上对大语言模型进行基准测试

Jiawei Chen, Ruoxi Xu, Boxi Cao, Ruotong Pan, Yunfei Zhang, Yifei Hu, Yong Du, Tingting Gao, Yaojie Lu, Yingfei Sun, Xianpei Han, Le Sun, Xiangyu Wu, Hongyu Lin

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所信息处理实验室) University of Chinese Academy of Sciences(中国科学院大学) Kuaishou Technology(快手科技)

AI总结 本文提出OmniBehavior基准测试,通过真实世界数据整合长周期、跨场景和异质行为模式,揭示现有模型在模拟复杂人类行为时的局限性,包括对正向平均人的趋同、人格同质化和乌托邦偏见,为未来高保真模拟研究指明方向。

Comments Project page: https://OmniBehavior.github.io

详情
AI中文摘要

大语言模型(LLMs)的出现揭示了通用用户模拟的潜力。然而,现有基准测试仍局限于孤立场景、狭窄动作空间或合成数据,无法捕捉真实人类行为的整体性。为弥合这一差距,我们引入OmniBehavior,首个完全基于真实世界数据构建的用户模拟基准测试,将长周期、跨场景和异质行为模式整合到统一框架中。基于此基准测试,我们首先提供了实证证据,表明以往孤立场景的数据集存在隧道视野问题,而真实世界决策依赖于长期的跨场景因果链。对最新LLMs的广泛评估显示,当前模型在模拟这些复杂行为时表现不佳,即使扩展上下文窗口,性能也趋于平稳。关键的是,模拟行为与真实行为的系统性比较揭示了根本性的结构偏差:LLMs倾向于趋同于正向平均人,表现出超活跃、人格同质化和乌托邦偏见。这导致了个体差异和长尾行为的丧失,突显了未来高保真模拟研究的关键方向。

英文摘要

The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

2604.08295 2026-05-22 cs.AI cs.CV 版本更新

U-CECE: A Universal Multi-Resolution Framework for Conceptual Counterfactual Explanations

U-CECE:一个通用的多分辨率框架用于概念反事实解释

Angeliki Dimitriou, Nikolaos Chaidos, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou

发表机构 * Artificial Intelligence and Learning Systems (AILS) laboratory, National Technical University of Athens(人工智能与学习系统实验室,国家技术大学(雅典))

AI总结 本文提出U-CECE框架,旨在解决概念反事实方法在表达性和效率之间的权衡问题,通过多分辨率层次结构提供不同层次的解释能力,并在不同数据集上验证了其效率与表达性的平衡。

详情
AI中文摘要

随着AI模型日益复杂,可解释性对于建立信任至关重要,然而基于概念的反事实方法仍面临表达性与效率之间的权衡。将底层概念表示为原子集合虽然快速但忽略了关系上下文,而完整的图表示更加忠实但需要解决NP难的图编辑距离(GED)问题。我们提出了U-CECE,一个统一的、模型无关的多分辨率框架,用于概念反事实解释,能够适应数据环境和计算预算。U-CECE涵盖三个层次的表达性:原子概念用于广泛解释,关系集合-集合用于简单交互,以及结构图用于完整语义结构。在结构层,支持基于监督图神经网络(GNNs)的精度导向的归纳模式和基于无监督图自动编码器(GAEs)的可扩展归纳模式。在结构上,CUB和视觉基因组数据集的实验展示了不同层次的效率-表达性权衡,同时人类调查和LVLM基于评估表明,检索到的结构反事实与精确GED基于的地面真相解释在语义上等价,且常被优先选择。

英文摘要

As AI models grow more complex, explainability is essential for building trust, yet concept-based counterfactual methods still face a trade-off between expressivity and efficiency. Representing underlying concepts as atomic sets is fast but misses relational context, whereas full graph representations are more faithful but require solving the NP-hard Graph Edit Distance (GED) problem. We propose U-CECE, a unified, model-agnostic multi-resolution framework for conceptual counterfactual explanations that adapts to data regime and compute budget. U-CECE spans three levels of expressivity: atomic concepts for broad explanations, relational sets-of-sets for simple interactions, and structural graphs for full semantic structure. At the structural level, both a precision-oriented transductive mode based on supervised Graph Neural Networks (GNNs) and a scalable inductive mode based on unsupervised graph autoencoders (GAEs) are supported. Experiments on the structurally divergent CUB and Visual Genome datasets characterize the efficiency-expressivity trade-off across levels, while human surveys and LVLM-based evaluation show that the retrieved structural counterfactuals are semantically equivalent to, and often preferred over, exact GED-based ground-truth explanations.

2604.07799 2026-05-22 cs.RO cs.AI 版本更新

Learning Without Losing Identity: Capability Evolution for Embodied Agents

无需失去身份的学习:体素代理的能力进化

Xue Qin, Simin Luan, John See, Cong Yang, Zhijun Li

发表机构 * School of Software, Harbin Institute of Technology(哈尔滨工业大学软件学院) School of Computer Science and Technology, Harbin Institute of Technology(哈尔滨工业大学计算机科学与技术学院) School of Mathematical and Computer Sciences, Heriot-Watt University(赫瑞-沃德大学数学与计算机科学学院) School of Future Science and Engineering, Soochow University(苏州大学未来科学与工程学院)

AI总结 本文提出了一种以能力为中心的体素代理进化范式,通过引入体素能力模块(ECMs)实现持续改进,同时保持代理身份的稳定性,实验表明其在任务成功率和安全性方面优于传统方法。

Comments 12 pages, 2 figures, 7 tables

详情
AI中文摘要

体素代理被期望在动态物理环境中持续运作,并随时间不断获得新能力。现有方法通常通过修改代理本身来提高性能,导致长期系统不稳定和身份丢失。本文提出了一种以能力为中心的进化范式,认为机器人应保持持久的代理作为认知身份,同时通过能力进化实现持续改进。具体而言,我们引入了体素能力模块(ECMs),代表可随时间学习、优化和组合的模块化功能单元。我们提出一个统一框架,将能力进化与代理身份解耦。能力通过包含任务执行、经验收集、模型优化和模块更新的闭环过程进化,所有执行均由运行时层控制,确保安全性和策略约束。通过模拟体素任务证明,能力进化在20次迭代中将任务成功率从32.4%提升到91.3%,优于代理修改基线和现有技能学习方法(SPiRL, SkiMo),同时保持零策略漂移和零安全违规。我们的结果表明,将代理身份与能力进化分离为长期体素智能提供了可扩展且安全的基础。

英文摘要

Embodied agents are expected to operate persistently in dynamic physical environments, continuously acquiring new capabilities over time. Existing approaches to improving agent performance often rely on modifying the agent itself -- through prompt engineering, policy updates, or structural redesign -- leading to instability and loss of identity in long-lived systems. In this work, we propose a capability-centric evolution paradigm for embodied agents. We argue that a robot should maintain a persistent agent as its cognitive identity, while enabling continuous improvement through the evolution of its capabilities. Specifically, we introduce the concept of Embodied Capability Modules (ECMs), which represent modular, versioned units of embodied functionality that can be learned, refined, and composed over time. We present a unified framework in which capability evolution is decoupled from agent identity. Capabilities evolve through a closed-loop process involving task execution, experience collection, model refinement, and module updating, while all executions are governed by a runtime layer that enforces safety and policy constraints. We demonstrate through simulated embodied tasks that capability evolution improves task success rates from 32.4% to 91.3% over 20 iterations, outperforming both agent-modification baselines and established skill-learning methods (SPiRL, SkiMo), while preserving zero policy drift and zero safety violations. Our results suggest that separating agent identity from capability evolution provides a scalable and safe foundation for long-term embodied intelligence.

2604.07180 2026-05-22 cs.CV cs.AI 版本更新

Energy-based Tissue Manifolds for Longitudinal Multiparametric MRI Analysis

基于能量的组织流形用于纵向多参数MRI分析

Kartikay Tehlan, Lukas Förner, Sina Wendrich, Nico Schmutzenhofer, Michael Frühwald, Matthias Wagner, Nassir Navab, Thomas Wendler

发表机构 * Dept. of diagnostic and interventional Radiology and Neuroradiology, University Hospital Augsburg, Germany(诊断与介入放射科和神经放射科,奥格斯堡大学医院,德国) Digital Medicine, University Hospital Augsburg, Germany(数字医学,奥格斯堡大学医院,德国) Chair for Computer Aided Medical Procedures and Augmented Reality, Technical University of Munich, Germany(计算机辅助医疗程序与增强现实 chair,慕尼黑技术大学,德国) Bavarian Center for Cancer Research (BZKF) Augsburg, Germany(巴伐利亚癌症研究中心(BZKF)奥格斯堡,德国) Dept. of Pediatrics and Adolescent Medicine, University Hospital Augsburg, Germany(儿科和青少年医学科,奥格斯堡大学医院,德国) Center for Advanced Analytics and Predictive Sciences, University of Augsburg, Germany(高级分析与预测科学中心,奥格斯堡大学,德国)

AI总结 本文提出了一种基于患者特定能量建模的几何框架,用于纵向多参数MRI分析,通过训练紧凑的隐式神经表示来学习能量函数,为组织状态提供微分几何描述,无需分割标签,展示了患者特定能量流形在纵向mpMRI分析中的应用潜力。

Comments The code is available at https://github.com/tkartikay/EnFold-MRI

详情
AI中文摘要

我们提出了一种基于患者特定能量建模的几何框架,用于纵向多参数MRI分析。该框架基于序列空间中的患者特定能量建模,而不是在具有空间网络的图像上进行操作。每个体素由其多序列强度向量(T1,T1c,T2,FLAIR,ADC)表示,并通过去噪分数匹配训练紧凑的隐式神经表示,以从单次基线扫描学习一个能量函数E_θ(u) over R^d。学习的能量景观提供了没有分割标签的组织状态的微分几何描述。局部极小值定义了组织盆地,梯度大小反映了接近状态边界的可能性,拉普拉斯曲率表征了局部约束结构。重要的是,该基线能量流形被视为固定的几何参考:它编码了诊断时观察到的对比组合,并且在随访时不进行重新训练。因此,纵向评估被公式化为对后续扫描相对于此基线几何的评估。而不是比较解剖分割,我们分析MRI序列向量的分布如何在基线能量函数下演变。在一项儿童病例中,复发后随访扫描显示能量和方向位移在序列空间中逐渐偏离基线肿瘤相关状态,但在明显放射学再出现之前。在一项稳定疾病病例中,体素分布仍被限制在已建立的低能盆地内,没有系统性漂移。所展示的病例证明了患者特定能量流形可以作为纵向mpMRI分析的几何参考系统,而无需显式分割或监督分类,为进一步研究基于流形的肿瘤风险区域追踪提供了基础。

英文摘要

We propose a geometric framework for longitudinal multi-parametric MRI analysis based on patient-specific energy modelling in sequence space. Rather than operating on images with spatial networks, each voxel is represented by its multi-sequence intensity vector ($T1$, $T1c$, $T2$, FLAIR, ADC), and a compact implicit neural representation is trained via denoising score matching to learn an energy function $E_θ(\mathbf{u})$ over $\mathbb{R}^d$ from a single baseline scan. The learned energy landscape provides a differential-geometric description of tissue regimes without segmentation labels. Local minima define tissue basins, gradient magnitude reflects proximity to regime boundaries, and Laplacian curvature characterises local constraint structure. Importantly, this baseline energy manifold is treated as a fixed geometric reference: it encodes the set of contrast combinations observed at diagnosis and is not retrained at follow-up. Longitudinal assessment is therefore formulated as evaluation of subsequent scans relative to this baseline geometry. Rather than comparing anatomical segmentations, we analyse how the distribution of MRI sequence vectors evolves under the baseline energy function. In a paediatric case with later recurrence, follow-up scans show progressive deviation in energy and directional displacement in sequence space toward the baseline tumour-associated regime before clear radiological reappearance. In a case with stable disease, voxel distributions remain confined to established low-energy basins without systematic drift. The presented cases serve as proof-of-concept that patient-specific energy manifolds can function as geometric reference systems for longitudinal mpMRI analysis without explicit segmentation or supervised classification, providing a foundation for further investigation of manifold-based tissue-at-risk tracking in neuro-oncology.

2604.03501 2026-05-22 cs.HC cs.AI 版本更新

The Augmentation Trap: AI Productivity and the Cost of Cognitive Offloading

增强陷阱:人工智能生产力与认知卸载的成本

Michael Caosun, Sinan Aral

AI总结 本文研究了人工智能工具对工人生产力的影响,发现尽管短期生产力提升,但持续使用会侵蚀工人技能。通过动态模型分析,发现即使预期技能损耗,理性决策者仍可能在短期收益大于长期成本时采用AI,导致稳态损失。此外,当管理者短视或工人技能具有外部价值时,决策者可能陷入增强陷阱,使工人状况恶化。最后,当AI生产力与工人技能关联较弱时,工人技能可能永久分化,经验丰富的工人实现潜力,而经验不足的工人技能降至零。

详情
AI中文摘要

实验证据证实,AI工具提高了工人的生产力,但持续使用也会侵蚀支撑这些收益的专业技能。我们开发了一个动态模型,其中决策者在时间上选择工人使用AI的强度,权衡即时生产力与工人技能的损耗。我们将工具的生产力效应分解为两个渠道,一个与工人技能无关,另一个随技能变化。模型产生了三个主要结果。第一,即使决策者完全预见技能损耗,理性决策者在短期生产力收益超过长期技能成本时仍会采用AI,产生稳态损失:工人最终比采用AI前更不 productive。第二,当管理者短视或工人技能具有外部价值时,决策者的最优政策将稳态损失转化为增强陷阱,使工人状况比未采用AI时更差。第三,当AI生产力较少依赖工人技能时,工人技能可以永久分化:经验丰富的工人实现全部潜力,而经验较少的工人技能降至零。小的管理激励差异决定了工人的路径。生产力分解将部署分为五个制度,区分有益和有害的采用,并识别哪些部署容易陷入陷阱。

英文摘要

Experimental evidence confirms that AI tools raise worker productivity, but also that sustained use can erode the expertise on which those gains depend. We develop a dynamic model in which a decision-maker chooses AI usage intensity for a worker over time, trading immediate productivity against the erosion of worker skill. We decompose the tool's productivity effect into two channels, one independent of worker expertise and one that scales with it. The model produces three main results. First, even a decision-maker who fully anticipates skill erosion rationally adopts AI when front-loaded productivity gains outweigh long-run skill costs, producing steady-state loss: the worker ends up less productive than before adoption. Second, when managers are short-termist or worker skill has external value, the decision-maker's optimal policy turns steady-state loss into the augmentation trap, leaving the worker worse off than if AI had never been adopted. Third, when AI productivity depends less on worker expertise, workers can permanently diverge in skill: experienced workers realize their full potential while less experienced workers deskill to zero. Small differences in managerial incentives can determine which path a worker takes. The productivity decomposition classifies deployments into five regimes that separate beneficial adoption from harmful adoption and identifies which deployments are vulnerable to the trap.

2604.02889 2026-05-22 stat.ML cs.AI cs.LG 版本更新

Rethinking Forward Processes for Score-Based Nonlinear Data Assimilation in High Dimensions

重新思考高维数据同化中的分数基非线性数据同化前向过程

Eunbi Yoon, Won Chang, Donghan Kim, Dae Wook Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出了一种针对数据同化问题的改进前向过程,用于高维非线性系统的状态估计,通过改进的分数基滤波器在测量空间中转换系统状态,提高了同化性能。

详情
AI中文摘要

数据同化是通过结合模型预测和测量来估计动态系统状态的过程。当系统是非线性且高维时,这一任务变得具有挑战性。为了解决这个问题,最近出现了一种基于分数的贝叶斯滤波器。然而,这些方法在某些情况下仍表现不佳,特别是在空间稀疏测量下。这种退化源于对似然分数的启发式近似,其误差会随时间累积。这一限制是因为这些方法只是采用了一种经典的生成建模前向过程,将数据分布转化为高斯分布,而与测量方程无关。在这里,我们提出了一种针对滤波的前向过程,将系统状态转换到测量空间,从而实现了似然分数的理论严谨公式化。基于此,我们开发了测量感知的分数基滤波器(MASF)。我们在Kolmogorov流上评估了MASF,这是一个具有高达$\mathcal{O}(10^5)$维度的高维流体基准测试,包括非线性情况下的状态与测量之间的维度不匹配。MASF在现有分数基滤波器和集合型卡尔曼滤波器上表现出改进的性能。值得注意的是,当使用幅度预训练时,MASF相比基线实现了高达$28.2 imes$的时钟时间加速。我们的实现可在 exttt{https://github.com/tcnllab-oss/masf}获得。

英文摘要

Data assimilation is the process of estimating the state of a dynamical system over time by combining model predictions with measurements. This task becomes challenging when the system is nonlinear and high-dimensional. To address this, score-based Bayesian filters have recently emerged. However, these methods still show unsatisfactory performance in certain cases, particularly under spatially sparse measurements. Such degradation stems from heuristic approximations of the likelihood score, whose errors can accumulate over time. This limitation arises because the methods simply adopt a classical forward process for generative modeling that transforms a data distribution toward a Gaussian distribution, which is independent of the measurement equation. Here, we propose a forward process tailored for filtering that transforms the system state toward the measurement space, enabling a theoretically sound formulation of the likelihood score. Based on this, we develop the Measurement-Aware Score-based Filter (MASF). We evaluate MASF on Kolmogorov flow, a high-dimensional fluid benchmark with up to $\mathcal{O}(10^5)$ dimensions, under diverse measurement operators, including nonlinear cases with a dimensional mismatch between the state and the measurements. MASF shows improved performance over existing score-based filters and ensemble-type Kalman filters. Notably, MASF achieves up to a $28.2\times$ wall-clock speedup compared with the baselines when using amortized pretraining. Our implementation is available at \texttt{https://github.com/tcnllab-oss/masf}.

2603.27355 2026-05-22 cs.AI cs.CL cs.SE 版本更新

LLM Readiness Harness: Evaluation, Observability, and CI Gates for LLM/RAG Applications

LLM Readiness Harness: 评估、可观测性和持续集成门禁用于LLM/RAG应用

Alexandre Cristovão Maiorano

发表机构 * Lumytics

AI总结 本文提出了一种LLM和RAG应用的准备性框架,通过自动化基准测试、OpenTelemetry可观测性和持续集成质量门禁,将评估转化为部署决策流程,并通过帕累托前沿计算场景加权的准备度分数,展示了在票务路由工作流和BEIR接地任务上的评估结果。

Comments 19 pages, 4 figures, 15 tables

详情
AI中文摘要

我们提出了一种用于LLM和RAG应用的准备性框架,将评估转化为部署决策流程。该系统结合了自动化基准测试、OpenTelemetry可观测性和持续集成质量门禁,通过最小的API合同聚合工作流程成功、政策合规性、 groundedness、检索命中率、成本和p95延迟,计算出场景加权的准备度分数。我们对票务路由工作流和BEIR接地任务(SciFact和FiQA)进行了评估,覆盖了完整的Azure矩阵(162/162有效单元跨数据集、场景、检索深度、种子和模型)。结果表明,准备度不是单一指标:在FiQA中,sla-first at k=5时,gpt-4.1-mini在准备度和忠实度上领先,而gpt-5.2则支付了显著的延迟成本;在SciFact中,模型质量接近但仍有操作区分。票务路由回归门禁持续拒绝不安全的提示变体,证明了该框架能够阻止风险发布,而不仅仅是报告离线分数。结果是一个可重复、操作基础的框架,用于决定LLM或RAG系统是否准备好发布。

英文摘要

We present a readiness harness for LLM and RAG applications that turns evaluation into a deployment decision workflow. The system combines automated benchmarks, OpenTelemetry observability, and CI quality gates under a minimal API contract, then aggregates workflow success, policy compliance, groundedness, retrieval hit rate, cost, and p95 latency into scenario-weighted readiness scores with Pareto frontiers. We evaluate the harness on ticket-routing workflows and BEIR grounding tasks (SciFact and FiQA) with full Azure matrix coverage (162/162 valid cells across datasets, scenarios, retrieval depths, seeds, and models). Results show that readiness is not a single metric: on FiQA under sla-first at k=5, gpt-4.1-mini leads in readiness and faithfulness, while gpt-5.2 pays a substantial latency cost; on SciFact, models are closer in quality but still separable operationally. Ticket-routing regression gates consistently reject unsafe prompt variants, demonstrating that the harness can block risky releases instead of merely reporting offline scores. The result is a reproducible, operationally grounded framework for deciding whether an LLM or RAG system is ready to ship.

2603.16672 2026-05-22 cs.AI cs.CL cs.CY 版本更新

CritiSense: Critical Digital Literacy and Resilience Against Misinformation

CritiSense: 关键数字素养与对抗虚假信息的韧性

Firoj Alam, Fatema Ahmad, Ali Ezzat Shahroor, Mohamed Bayan Kmainasi, Elisa Sartori, Giovanni Da San Martino, Abul Hasnat, Raian Ali

发表机构 * Qatar Computing Research Institute(卡塔尔计算研究所) University of Padova(帕多瓦大学) Hamad Bin Khalifa University(哈马德·本·卡西姆大学)

AI总结 本研究提出CritiSense,一个多功能的移动媒体素养应用,通过短而互动的挑战提升用户识别操纵手段的能力,为多语言的预警告平台和微学习效果评估提供测试环境。

Comments resilience, disinformation, misinformation, fake news, propaganda

详情
AI中文摘要

社交媒体上的虚假信息破坏了知情决策和公众信任。预警告(prebunking)通过帮助用户在遇到真实信息前识别操纵手法,提供了一种积极的补充方法。我们介绍了CritiSense,一个移动媒体素养应用,通过短而互动的挑战和即时反馈来培养这些技能。它是首个支持九种语言且模块化的平台,设计用于快速更新不同主题和领域。我们报告了93名用户的可用性研究:83.9%的用户表示总体满意,90.1%的用户认为该应用易于使用。定性反馈表明,CritiSense有助于提高数字素养技能。总体而言,它提供了一个多语言预警告平台和一个测试环境,用于衡量微学习对对抗虚假信息韧性的影响。在六个月中,我们已吸引了超过500名活跃用户。它在Apple App Store(https://apps.apple.com/us/app/critisense/id6749675792)和Google Play Store(https://play.google.com/store/apps/details?id=com.critisense&hl=en)上免费向所有用户提供。

英文摘要

Misinformation on social media undermines informed decision-making and public trust. Prebunking offers a proactive complement by helping users recognize manipulation tactics before they encounter them in the wild. We present CritiSense, a mobile media-literacy app that builds these skills through short, interactive challenges with instant feedback. It is the first multilingual (supporting nine languages) and modular platform, designed for rapid updates across topics and domains. We report a usability study with 93 users: 83.9% expressed overall satisfaction and 90.1% rated the app as easy to use. Qualitative feedback indicates that CritiSense helps improve digital literacy skills. Overall, it provides a multilingual prebunking platform and a testbed for measuring the impact of microlearning on misinformation resilience. Over 6 months, we have reached 500+ active users. It is freely available to all users on the Apple App Store (https://apps.apple.com/us/app/critisense/id6749675792) and Google Play Store (https://play.google.com/store/apps/details?id=com.critisense&hl=en).

2603.15676 2026-05-22 cs.SE cs.AI 版本更新

Automated Self-Testing as a Quality Gate: Evidence-Driven Release Management for LLM Applications

自动化自我测试作为质量门:基于证据的LLM应用发布管理

Alexandre Cristovão Maiorano

发表机构 * Lumytics

AI总结 本文提出了一种自动化自我测试框架,通过五个实证基础的维度(任务成功率、研究环境保持、P95延迟、安全通过率和证据覆盖)实现基于证据的发布决策(PROMOTE/HOLD/ROLLBACK),并通过长期案例研究评估了该框架在多代理对话AI系统中的有效性。

Comments 20 pages, 6 figures, 12 tables

详情
AI中文摘要

LLM应用是AI系统,其非确定性输出和不断变化的模型行为使得传统测试不足以满足发布管理的需求。我们提出了一种自动化自我测试框架,引入了基于证据的发布决策质量门(PROMOTE/HOLD/ROLLBACK)五个实证基础的维度:任务成功率、研究环境保持、P95延迟、安全通过率和证据覆盖。我们通过一个长期案例研究评估该框架,该研究涉及一个内部部署的多代理对话AI系统,具有特定的营销能力,并在活跃开发中覆盖了38次评估运行,跨越20多个内部发布。质量门在早期运行中识别出两个ROLLBACK级构建,并在四周的 staging 生命周期中支持稳定的质量演变,同时执行了基于角色的、多轮的、对抗性和证据要求的场景。统计分析(Mann-Kendall趋势、Spearman相关性、bootstrap置信区间)、质量门消融和开销扩展表明,证据覆盖是主要的严重回归判别器,且运行时间与套件大小成比例增长。人类校准研究(n=60分层案例,两个独立评估者,LLM-as-judge交叉验证)揭示了互补的多模态覆盖:LLM-judge与系统门的分歧(kappa=0.13)可归因于结构失败模式——延迟违规和路由错误——这些在响应文本中是不可见的,而评估者独立地揭示了被结构检查遗漏的内容质量失败,这与多维门设计一致。该框架、补充伪代码和校准工件被提供以支持AI系统质量保证和独立复制。

英文摘要

LLM applications are AI systems whose nondeterministic outputs and evolving model behavior make traditional testing insufficient for release governance. We present an automated self-testing framework that introduces quality gates with evidence-based release decisions (PROMOTE/HOLD/ROLLBACK) across five empirically grounded dimensions: task success rate, research context preservation, P95 latency, safety pass rate, and evidence coverage. We evaluate the framework through a longitudinal case study of an internally deployed multi-agent conversational AI system with specific marketing capabilities in active development, covering 38 evaluation runs across 20+ internal releases. The gate identified two ROLLBACK-grade builds in early runs and supported stable quality evolution over a four-week staging lifecycle while exercising persona-grounded, multi-turn, adversarial, and evidence-required scenarios. Statistical analysis (Mann-Kendall trends, Spearman correlations, bootstrap confidence intervals), gate ablation, and overhead scaling indicate that evidence coverage is the primary severe-regression discriminator and that runtime scales predictably with suite size. A human calibration study (n=60 stratified cases, two independent evaluators, LLM-as-judge cross-validation) reveals complementary multi-modal coverage: LLM-judge disagreements with the system gate (kappa=0.13) are attributable to structural failure modes - latency violations and routing errors - invisible in response text alone, while the judge independently surfaces content quality failures missed by structural checks, consistent with a multi-dimensional gate design. The framework, supplementary pseudocode, and calibration artifacts are provided to support AI-system quality assurance and independent replication.

2603.02938 2026-05-22 cs.LG cs.AI 版本更新

Beyond One-Size-Fits-All: Adaptive Subgraph Denoising for Zero-Shot Graph Learning with Large Language Models

超越一刀切:基于大语言模型的零样本图学习中的自适应子图去噪

Fengzhi Li, Liang Zhang, Yuan Zuo, Ruiqing Zhao, YanSong Liu, Yunfei Ma, Fanyu Meng, Junlan Feng

发表机构 * JIUTIAN Research(JIUTIAN研究) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) MIIT Key Laboratory of Data and Decision Intelligence(信息与决策智能重点实验室) Beihang University(北航)

AI总结 本文提出GraphSSR框架,通过自适应子图提取和去噪方法,解决传统图神经网络在零样本学习中泛化能力不足的问题,提升大语言模型在图推理任务中的表现。

详情
AI中文摘要

图基任务在零样本设置中仍面临显著挑战,由于数据稀缺性和传统图神经网络(GNNs)无法泛化到未见领域或标签空间。尽管最近的进展转向利用大语言模型(LLMs)作为预测器来增强GNNs,但这些方法常面临跨模态对齐问题。最近的范式(即Graph-R1)通过采用纯文本格式和基于LLM的图推理克服了上述架构依赖性,显示出改进的零样本泛化能力。然而,它使用一种任务无关的“一刀切”子图提取策略,不可避免地引入了显著的结构噪声——无关邻居和边——这会扭曲LLMs的感知范围并导致次优预测。为了解决这一限制,我们引入GraphSSR,一种新的框架,用于零样本LLM图推理中的自适应子图提取和去噪。具体而言,我们提出了SSR流水线,通过“采样-选择-推理”过程动态定制子图提取以适应特定上下文,使模型能够自主过滤掉任务无关的邻居并克服“一刀切”问题。为了内化这一能力,我们开发了SSR-SFT,一种数据合成策略,生成高质量的SSR风格图推理轨迹用于LLM的监督微调。此外,我们提出了SSR-RL,一种两阶段强化学习框架,该框架专门设计用于自适应子图去噪,明确调节所提出SSR流水线中的采样和选择操作。通过结合真实性增强和去噪增强的强化学习,我们引导模型使用简洁的、去噪的子图进行推理以实现准确预测。

英文摘要

Graph-based tasks in the zero-shot setting remain a significant challenge due to data scarcity and the inability of traditional Graph Neural Networks (GNNs) to generalize to unseen domains or label spaces. While recent advancements have transitioned toward leveraging Large Language Models (LLMs) as predictors to enhance GNNs, these methods often suffer from cross-modal alignment issues. A recent paradigm (i.e., Graph-R1) overcomes the aforementioned architectural dependencies by adopting a purely text-based format and utilizing LLM-based graph reasoning, showing improved zero-shot generalization. However, it employs a task-agnostic, one-size-fits-all subgraph extraction strategy, which inevitably introduces significant structural noise--irrelevant neighbors and edges--that distorts the LLMs' receptive field and leads to suboptimal predictions. To address this limitation, we introduce GraphSSR, a novel framework designed for adaptive subgraph extraction and denoising in zero-shot LLM-based graph reasoning. Specifically, we propose the SSR pipeline, which dynamically tailors subgraph extraction to specific contexts through a "Sample-Select-Reason" process, enabling the model to autonomously filter out task-irrelevant neighbors and overcome the one-size-fits-all issue. To internalize this capability, we develop SSR-SFT, a data synthesis strategy that generates high-quality SSR-style graph reasoning traces for supervised fine-tuning of LLMs. Furthermore, we propose SSR-RL, a two-stage reinforcement learning framework that explicitly regulates sampling and selection operations within the proposed SSR pipeline designed for adaptive subgraph denoising. By incorporating Authenticity-Reinforced and Denoising-Reinforced RL, we guide the model to achieve accurate predictions using parsimonious, denoised subgraphs for reasoning.

2602.17973 2026-05-22 cs.CR cs.AI 版本更新

PenTiDef: Decentralized Federated Intrusion Detection System with Differential Privacy and Latent-Space Defense via Blockchain Coordination in IIoT

PenTiDef:通过区块链协调在工业物联网中的去中心化联邦入侵检测系统,结合差分隐私和潜在空间防御

Phan The Duy, Nghi Hoang Khoa, Nguyen Tran Anh Quan, Luong Ha Tien, Ngo Duc Hoang Son, Van-Hau Pham

发表机构 * Information Security Lab(信息安全部实验室) University of Information Technology(信息技术大学) Vietnam National University(越南国家大学) VNU-HCM Information Security Center(VNU-HCM信息安全部中心)

AI总结 本文提出PenTiDef,一种完全去中心化、隐私保护且抗中毒的联邦入侵检测系统(DFL-IDS)。该系统整合了三个关键组件:(i)客户端侧的分布式差分隐私(DDP)通过随机高斯噪声保护梯度泄露;(ii)一个轻量级的潜在空间防御模块,通过自动编码器提取并压缩倒数第二层表示(PLRs)为稳定的潜在语义表示(LSRs),随后通过中心核对齐(CKA)和K-均值聚类进行鲁棒的恶意更新检测,无需辅助数据集;(iii)一个许可型区块链层,通过智能合约协调链上验证、安全FedAvg聚合和不可变审计性,消除任何中心服务器。在CIC-IDS2018和Edge-IIoTSet上进行的大量实验表明,在独立同分布(IID)和现实非独立同分布(non-IID)设置下,即使对抗比例高达40%,PenTiDef在检测准确率和F1分数上均优于最先进的基线(FLARE和FedCC),同时保持较低的训练开销。通过在统一的安全聚合协议中共同解决隐私、鲁棒性和去中心化问题,PenTiDef为异构、对抗性的工业物联网环境中的可信协作入侵检测提供了实用且可扩展的解决方案。

Comments version 2, change title of the paper

详情
AI中文摘要

This paper proposes PenTiDef, a fully decentralized, privacy-preserving, and poisoning-resilient framework for decentralized federated IDS (DFL-IDS). PenTiDef synergistically integrates three key components: (i) client-side Distributed Differential Privacy (DDP) with stochastic Gaussian noise to protect gradient leakage, (ii) a lightweight latent-space defense module that extracts and compresses penultimate-layer representations (PLRs) into stable Latent Semantic Representations (LSRs) via AutoEncoder, followed by Centered Kernel Alignment (CKA) and K-Means clustering for robust malicious update detection without auxiliary datasets, and (iii) a permissioned blockchain layer with smart contracts that orchestrates on-chain validation, secure FedAvg aggregation, and immutable auditability, eliminating any central server. Extensive experiments on CIC-IDS2018 and Edge-IIoTSet under both IID and realistic non-IID settings, with adversary ratios up to 40\%, demonstrate that PenTiDef consistently outperforms state-of-the-art baselines (FLARE and FedCC) in detection accuracy and F1-score while maintaining lower training overhead. By jointly addressing privacy, robustness, and decentralization in a unified secure aggregation protocol, PenTiDef provides a practical and scalable solution for trustworthy collaborative intrusion detection in heterogeneous, adversarial IIoT environments.

英文摘要

This paper proposes PenTiDef, a fully decentralized, privacy-preserving, and poisoning-resilient framework for decentralized federated IDS (DFL-IDS). PenTiDef synergistically integrates three key components: (i) client-side Distributed Differential Privacy (DDP) with stochastic Gaussian noise to protect gradient leakage, (ii) a lightweight latent-space defense module that extracts and compresses penultimate-layer representations (PLRs) into stable Latent Semantic Representations (LSRs) via AutoEncoder, followed by Centered Kernel Alignment (CKA) and K-Means clustering for robust malicious update detection without auxiliary datasets, and (iii) a permissioned blockchain layer with smart contracts that orchestrates on-chain validation, secure FedAvg aggregation, and immutable auditability, eliminating any central server. Extensive experiments on CIC-IDS2018 and Edge-IIoTSet under both IID and realistic non-IID settings, with adversary ratios up to 40\%, demonstrate that PenTiDef consistently outperforms state-of-the-art baselines (FLARE and FedCC) in detection accuracy and F1-score while maintaining lower training overhead. By jointly addressing privacy, robustness, and decentralization in a unified secure aggregation protocol, PenTiDef provides a practical and scalable solution for trustworthy collaborative intrusion detection in heterogeneous, adversarial IIoT environments.

2602.17385 2026-05-22 cs.AI 版本更新

Dataless Weight Disentanglement in Task Arithmetic via Kronecker-Factored Approximate Curvature

通过克罗内克-因子近似曲率进行任务算术中的无数据权重解耦

Angelo Porrello, Pietro Buzzega, Felix Dangel, Thomas Sommariva, Riccardo Salami, Lorenzo Bonicelli, Simone Calderara

发表机构 * University of Modena and Reggio Emilia(莫德纳和雷吉奥艾米利亚大学) Vector Institute(向量研究所)

AI总结 本文提出了一种无数据的方法,通过将表示漂移正则化问题框架化为曲率矩阵近似问题,以解决任务算术中任务向量的交叉任务干扰问题,实现了任务加法和否定的最新成果。

Comments Accepted to ICLR 2026

详情
AI中文摘要

任务算术提供了一种模块化且可扩展的方法来适应基础模型。然而,结合多个任务向量可能导致跨任务干扰,导致表示漂移和性能下降。表示漂移正则化提供了一种自然的解决方法来解耦任务向量;然而,现有方法通常需要外部任务数据,这与模块化和数据可用性约束(例如隐私要求)相冲突。我们提出了一种无数据的方法,通过将正则化表示漂移作为曲率矩阵近似问题来框架化。这使我们能够利用已建立的技术;特别是,我们采用克罗内克-因子近似曲率,并获得一个实用的正则器,实现了任务加法和否定的最新成果。我们的方法在任务数量上具有常数复杂性,并增强了对任务向量重新缩放的鲁棒性,消除了对保留调优的需要。

英文摘要

Task Arithmetic yields a modular, scalable way to adapt foundation models. Combining multiple task vectors, however, can lead to cross-task interference, causing representation drift and degraded performance. Representation drift regularization provides a natural remedy to disentangle task vectors; however, existing approaches typically require external task data, conflicting with modularity and data availability constraints (e.g., privacy requirements). We propose a dataless approach by framing regularization against representation drift as a curvature matrix approximation problem. This allows us to leverage well-established techniques; in particular, we adopt Kronecker-Factored Approximate Curvature and obtain a practical regularizer that achieves state-of-the-art results in task addition and negation. Our method has constant complexity in the number of tasks and promotes robustness to task vector rescaling, eliminating the need for held-out tuning.

2602.13372 2026-05-22 cs.AI cs.LG 版本更新

MoralityGym: A Benchmark for Evaluating Hierarchical Moral Alignment in Sequential Decision-Making Agents

MoralityGym:用于评估序列决策代理中分层道德对齐的基准

Simon Rosen, Siddarth Singh, Ebenezer Gelo, Helen Sarah Robertson, Ibrahim Suder, Victoria Williams, Benjamin Rosman, Geraud Nangue Tasse, Steven James

发表机构 * University of the Witwatersrand(威特沃特斯兰大学)

AI总结 本文提出MoralityGym基准,通过将道德规范表示为有序的规范约束,评估序列决策代理中分层道德对齐的挑战,展示了98个伦理困境问题,并通过心理学和哲学的见解改进了伦理决策方法。

Comments Accepted at AAMAS 2026

Journal ref Proc of the 25th International Conference on Autonomous Agents and Multiagent Systems AAMAS 2026, Paphos, Cyprus, May 25 to 29, 2026, IFAAMAS

详情
AI中文摘要

评估在面对冲突且分层结构的人类规范时,代理的道德对齐是一个在人工智能安全、道德哲学和认知科学交汇处的关键挑战。我们引入了Morality Chains,一种新的形式化方法,用于将道德规范表示为有序的规范约束,并引入了MoralityGym,一个包含98个伦理困境问题的基准,这些问题是作为电车困境风格的Gymnasium环境呈现的。通过将任务解决与道德评估解耦,并引入新的道德度量标准,MoralityGym允许将心理学和哲学的见解整合到规范敏感推理的评估中。基于安全强化学习方法的基准结果揭示了关键限制,强调了需要更系统的方法来处理伦理决策。本文为开发在复杂现实环境中行为更可靠、透明和道德的AI系统提供了基础。

英文摘要

Evaluating moral alignment in agents navigating conflicting, hierarchically structured human norms is a critical challenge at the intersection of AI safety, moral philosophy, and cognitive science. We introduce Morality Chains, a novel formalism for representing moral norms as ordered deontic constraints, and MoralityGym, a benchmark of 98 ethical-dilemma problems presented as trolley-dilemma-style Gymnasium environments. By decoupling task-solving from moral evaluation and introducing a novel Morality Metric, MoralityGym allows the integration of insights from psychology and philosophy into the evaluation of norm-sensitive reasoning. Baseline results with Safe RL methods reveal key limitations, underscoring the need for more principled approaches to ethical decision-making. This work provides a foundation for developing AI systems that behave more reliably, transparently, and ethically in complex real-world contexts.

2602.12952 2026-05-22 cs.LG cs.AI cs.CV 版本更新

Transporting Task Vectors across Different Architectures without Training

在不同架构间传输任务向量而无需训练

Filippo Rinaldi, Aniello Panariello, Giacomo Salici, Angelo Porrello, Simone Calderara

发表机构 * AImageLab, University of Modena and Reggio Emilia(AImageLab,Modena和雷吉奥艾米利亚大学)

AI总结 本文提出Theseus方法,通过功能匹配在不同宽度模型间传输任务更新,无需训练或反向传播,展示了在视觉和语言模型上的改进效果。

Comments Accepted at the International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

适应大型预训练模型以完成下游任务时,通常会产生针对特定任务的参数更新,这些更新对于每个模型变体重新学习都很昂贵。尽管最近的研究表明,这些更新可以在具有相同架构的模型之间转移,但跨不同宽度的模型转移仍鲜有探索。在本文中,我们引入Theseus,一种无需训练的方法,用于在异构宽度模型间传输任务更新。与其匹配参数,我们通过其在中间表示上诱导的功能效应来表征任务更新。我们正式将任务向量传输定义为在观察到的激活上进行的功能匹配问题,并显示在通过正交Procrustes分析对齐表示空间后,它允许一个稳定的闭式解,该解保留了更新的几何结构。我们在不同宽度的视觉和语言模型上评估Theseus,显示在不进行额外训练或反向传播的情况下,相对于基线有持续的改进。我们的结果表明,当任务身份通过功能而非参数定义时,任务更新可以有意义地在不同架构间转移。代码可在https://github.com/apanariello4/merge-and-rebase获取。

英文摘要

Adapting large pre-trained models to downstream tasks often produces task-specific parameter updates that are expensive to relearn for every model variant. While recent work has shown that such updates can be transferred between models with identical architectures, transferring them across models of different widths remains unexplored. In this work, we introduce Theseus, a training-free method for transporting task updates across heterogeneous-width models. Rather than matching parameters, we characterize a task update by the functional effect it induces on intermediate representations. We formalize task-vector transport as a functional matching problem on observed activations and show that, after aligning representation spaces via orthogonal Procrustes analysis, it admits a stable closed-form solution that preserves the geometry of the update. We evaluate Theseus on vision and language models across different widths, showing consistent improvements over baselines without additional training or backpropagation. Our results show that task updates can be meaningfully transferred across architectures when task identity is defined functionally rather than parametrically. Code is available at https://github.com/apanariello4/merge-and-rebase.

2602.10894 2026-05-22 cs.LG cs.AI 版本更新

Revisiting Regularized Policy Optimization for Stable and Efficient Reinforcement Learning in Two-Player Games

重新审视正则化策略优化以实现稳定且高效的双人博弈强化学习

Kazuki Ota, Takayuki Osa, Motoki Omura, Tatsuya Harada

发表机构 * The University of Tokyo, Japan(东京大学) RIKEN Center for Advanced Intelligence Project, Japan(日本RIKEN高级智能项目中心)

AI总结 本文重新审视了带有反向Kullback-Leibler正则化和熵正则化的策略优化方法,在双人零和设置中从理论和经验角度分析其组合,提供了新的收敛保证并通过合成游戏的数值实验验证了理论结果,并基于正则化策略优化推导出一种实用的模型无关强化学习算法,通过在五个棋盘游戏中进行的全面实验验证了算法的训练效率。

Comments Accepted at ICML 2026

详情
AI中文摘要

像棋盘游戏这样的双人博弈长期以来一直是强化学习的传统基准。本工作重新审视了一种带有反向Kullback-Leibler正则化和熵正则化的策略优化方法,并从理论和经验角度分析其在双人零和设置中的组合。从理论角度来看,我们研究了策略更新规则在两个理论设置中的稳定性:博弈论的正常形式博弈和有限长度博弈。我们提供了新的收敛保证,并通过合成游戏的数值实验验证了我们的理论结果。从经验角度来看,我们推导出一种基于正则化策略优化的实用模型无关强化学习算法。我们通过在五个棋盘游戏中进行的全面实验验证了我们算法的训练效率。实验结果表明,我们的智能体在各种环境中学习效率均优于现有方法。

英文摘要

Two-player games such as board games have long been used as traditional benchmarks for reinforcement learning. This work revisits a policy optimization method with reverse Kullback-Leibler regularization and entropy regularization and analyzes this combination in two-player zero-sum settings from theoretical and empirical perspectives. From a theoretical perspective, we investigate the stability of the policy update rule in two theoretical settings: game-theoretic normal-form games and finite-length games. We provide novel convergence guarantees and verify our theoretical results through numerical experiments on synthetic games. From an empirical perspective, we derive a practical model-free reinforcement learning algorithm based on the regularized policy optimization. We validate the training efficiency of our algorithm through comprehensive experiments on five board games: Animal Shogi, Gardner Chess, Go, Hex, and Othello. Experimental results show that our agent learns more efficiently than existing methods across environments.

2602.10085 2026-05-22 cs.AI 版本更新

CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs

CODE-SHARP: 连续开放发现和演化的技能作为层次奖励程序

Richard Bornemann, Pierluigi Vito Amadori, Antoine Cully

发表机构 * Imperial College London(帝国理工学院伦敦分校) Sony Interactive Entertainment(索尼互动娱乐)

AI总结 该研究提出CODE-SHARP框架,通过基础模型自主发现和演化技能作为层次奖励程序,实现通用智能体政策的从零开始强化学习,无需预定义奖励,有效学习长周期技能。

Comments Preprint

详情
AI中文摘要

一般智能的核心特征是能够自主扩展和演化其掌握的技能集。尽管最近基于基础模型(FM)的方法在这一目标上显示出有希望的结果,但它们通常依赖于显著的人工工程,限制了其在新环境中的可转移性。为了解决这个问题,我们引入了连续开放发现和演化技能作为层次奖励程序(CODE-SHARP)框架,该框架利用基础模型来自主增长和演化一个编码技能的Python程序档案,通过强化学习训练通用智能体策略。这些程序被称为技能作为层次奖励程序(SHARPs),每个程序编码一个局部成功条件和一组被委托给先前发现的SHARPs的先决条件。在运行时,SHARPs根据当前状态动态路由智能体通过其先决条件链,奖励沿途的每个完成,要求智能体仅学习每个新SHARP引入的边际行为,从而在无需预定义奖励的情况下高效学习长周期技能。在Craftax-Classic和XLand上,由CODE-SHARP完全自主训练的智能体在中位性能上比先前工作高出6倍和2.6倍,并且是唯一能够制作铁工具和开采钻石的智能体。在扩展的Craftax上,CODE-SHARP在超过90个发现的SHARPs上训练通用智能体,使其能够零样本解决具有挑战性的长周期任务,与基于真实奖励训练的智能体表现相当。

英文摘要

A core quality of general intelligence is the ability to open-endedly expand and evolve its set of mastered skills autonomously. While recent Foundation Model (FM) driven approaches have shown promising results towards this goal, they typically rely on significant human-in-the-loop engineering, limiting their transferability to novel environments. To address this, we introduce Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs (CODE-SHARP), a framework that leverages FMs to open-endedly grow and evolve an archive of Python programs encoding skills to train a generalist agent policy entirely from scratch via reinforcement learning, directly from source code. These programs, termed Skills as Hierarchical Reward Programs (SHARPs), each encode a local success condition and a set of prerequisites delegated to previously discovered SHARPs. At runtime, SHARPs dynamically route the agent through their prerequisite chain based on the current state, rewarding each completion along the way, requiring the agent to learn only the marginal behaviour each new SHARP introduces, enabling efficient learning of long-horizon skills without any pre-defined rewards. On Craftax-Classic and XLand, agents trained fully autonomously by CODE-SHARP outperform previous works by 6x and 2.6x in median performance and are the only agents capable of crafting iron tools and mining diamonds. Scaled to Craftax-Extended, CODE-SHARP trains a generalist agent on over 90 discovered SHARPs, enabling the agent to solve challenging long-horizon tasks zero-shot, matching agents trained on ground-truth rewards.

2602.10009 2026-05-22 cs.AI cs.HC 版本更新

Discovering High Level Patterns from Simulation Traces

从仿真轨迹中发现高层次模式

Sean Memery, Kartic Subr

发表机构 * University of Edinburgh(爱丁堡大学)

AI总结 本文提出了一种通过程序合成进行无监督学习的方法,将仿真轨迹转换为稀疏的高层次模式表示,以提升大语言模型对物理系统的推理能力。

详情
AI中文摘要

大型语言模型(LLMs)在处理特定物理系统时无法可靠推理。尽管尝试通过赋予LLMs物理概念知识来提升其能力显示出巨大潜力,但可解释性和验证仍面临挑战。一种新兴的替代方法是工具链,其中LLMs可以查询物理模拟器并利用生成的仿真轨迹作为验证上下文。然而,这种方法的可扩展性较差,因为仿真轨迹包含大量细粒度的数值和语义数据。我们证明,将仿真轨迹转换为稀疏表示的“高层次”结构模式能更有效地被LLMs解释。我们提出了一种无监督学习方案,通过程序合成执行此转换或注释。我们的学习结果产生了一组程序库,这些程序作为模式检测器,可以将仿真轨迹转换为稀疏注释的模式序列。检测到的模式可选地通过人类专家的字符串标签(如刚性碰撞、拉伸弹簧等)进行引导。我们通过最近的一个物理基准测试表明,这样的注释表示更易于自然语言推理特定物理系统。合成的程序充当透明、可解释的函数,将系统状态映射到稀疏且高效的注释空间。作为应用示例,我们展示了如何将自然语言指定的物理系统目标转换为奖励程序,通过最大化这些程序来寻找解决方案。

英文摘要

Large Language Models (LLMs) are unable to reliably reason about specific physical systems. Attempts to imbue LLMs with knowledge of the necessary physics concepts have shown great promise, but explainability and validation remain open challenges. An emerging alternative is tooling, where LLMs can query physical simulators and use the resulting simulation traces as context for validation. This approach suffers from poor scalability since simulation traces contain large volumes of fine-grained numerical and semantic data. We show that translating simulation traces to a sparse representation of "high-level" structural patterns leads to more effective interpretation by LLMs. We propose an unsupervised learning scheme to perform this translation, or annotation, via program synthesis. Our learning results in a library of programs that act as pattern detectors which can translate simulation traces to sparse, annotated pattern sequences. The detected patterns may optionally be guided by human experts via string labels (rigid collision, stretching spring, etc.). We show, using a recent physics benchmark, that such annotated representations are more amenable to natural language reasoning about specific physical systems. The synthesized programs serve as transparent, explainable functions that map system states to a sparse and efficient annotation space. As an example application, we show how goals within physical systems that are specified in natural language may be converted to reward programs which are maximized to find solutions.

2602.08064 2026-05-22 cs.LG cs.AI cs.CL 版本更新

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

SiameseNorm: 突破预规范与后规范之间的障碍

Tianyu Li, Dongchen Han, Zixuan Cao, Haofeng Huang, Mengyu Zhou, Ming Chen, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang, Gao Huang

发表机构 * Leap Lab, Tsinghua University(清华大学 Leap 实验室) Qwen Large Model Application Team, Alibaba(阿里巴巴 Qwen 大模型应用团队) Institute for Interdisciplinary Information Sciences, Tsinghua University(清华大学交叉信息学研究院)

AI总结 本文提出SiameseNorm,一种双流架构,通过共享残差块将预规范和后规范结合,从而在保持训练稳定性的同时提升模型性能,适用于多种架构和模态。

Comments Accepted to ICML 2026; camera-ready version; revised presentation and added additional experimental results

详情
AI中文摘要

预规范与后规范之间的长期矛盾仍然是Transformer架构中的一个开放问题,反映了训练稳定性与表示能力之间的根本权衡。先前尝试结合两者优势的研究取得了一定进展,但往往在不同训练设置下表现有限,限制了其更广泛的应用。我们重新审视这一困境,表明单流架构难以协调预规范的稳定身份梯度传播与后规范的主要残差路径归一化。为了解决这种结构张力,我们提出SiameseNorm,一种简单而有效的双流架构,能够与预规范训练配方保持兼容。SiameseNorm通过共享残差块将预规范和后规范流连接起来,允许每个残差块从两个路径接收优化信号,且开销极低。在400M和1.3B密集语言模型、15B MoE模型、视觉Transformer以及扩散Transformer上的大量实验表明,SiameseNorm在各种架构和模态中都能保持强大的训练稳定性的同时提升性能。代码可在https://github.com/Qwen-Applications/SiameseNorm上获得。

英文摘要

The long-standing tension between Pre- and Post-Norm remains an open problem in Transformer architecture, reflecting a fundamental trade-off between training stability and representational capacity. Prior attempts to combine their strengths have made progress, but often show limited robustness across training settings, restricting their broader applicability. We revisit this dilemma, showing that single-stream architectures struggle to reconcile Pre-Norm's stable identity-gradient propagation with Post-Norm's normalization of the main residual path. To address this structural tension, we propose SiameseNorm, a simple yet effective two-stream architecture that remains compatible with Pre-Norm training recipes. SiameseNorm couples Pre-Norm-like and Post-Norm-like streams through shared residual blocks, allowing each residual block to receive optimization signals from both pathways with negligible overhead. Extensive experiments on 400M and 1.3B dense language models, 15B MoE models, Vision Transformers, and Diffusion Transformers show that SiameseNorm consistently improves performance while maintaining strong training stability across architectures and modalities. Code is available at https://github.com/Qwen-Applications/SiameseNorm.

2602.05536 2026-05-22 cs.LG cs.AI cs.CL cs.CV 版本更新

When Shared Knowledge Hurts: Spectral Over-Accumulation in Model Merging

当共享知识有害:模型融合中的谱过积累

Yayuan Li, Ze Peng, Jian Zhang, Jintao Guo, Yue Duan, Yinghuan Shi

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China.(新型软件技术国家重点实验室,南京大学,南京210023,中国。) Institute of Brain-Computer Interface, Nanjing University, Nanjing 210023, China.(脑机接口研究院,南京大学,南京210023,中国。)

AI总结 本文研究了模型融合中共享知识过积累的问题,提出SVC方法通过校准奇异值来恢复谱平衡,提升了模型融合和任务算术的性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

模型融合通过将多个微调模型的权重更新相加,提供了一种轻量级的替代方法,而非重新训练。现有方法主要针对解决任务更新之间的冲突,未处理共享知识过积累的失败模式。我们发现当任务共享对齐的谱方向(即重叠的奇异向量)时,简单的线性组合会反复积累这些方向,导致奇异值膨胀并使融合模型偏向共享子空间。为缓解此问题,我们提出Singular Value Calibration (SVC),一种无需训练和数据的后处理方法,量化子空间重叠并重新缩放膨胀的奇异值以恢复平衡的谱。在视觉和语言基准上,SVC一致改进了强大的融合基线并实现了最先进的性能。此外,仅通过修改奇异值,SVC将任务算术的性能提高了13.0%。代码可在https://github.com/lyymuwu/SVC获取。

英文摘要

Model merging combines multiple fine-tuned models into a single model by adding their weight updates, providing a lightweight alternative to retraining. Existing methods primarily target resolving conflicts between task updates, leaving the failure mode of over-counting shared knowledge unaddressed. We show that when tasks share aligned spectral directions (i.e., overlapping singular vectors), a simple linear combination repeatedly accumulates these directions, inflating the singular values and biasing the merged model toward shared subspaces. To mitigate this issue, we propose Singular Value Calibration (SVC), a training-free and data-free post-processing method that quantifies subspace overlap and rescales inflated singular values to restore a balanced spectrum. Across vision and language benchmarks, SVC consistently improves strong merging baselines and achieves state-of-the-art performance. Furthermore, by modifying only the singular values, SVC improves the performance of Task Arithmetic by 13.0%. Code is available at https://github.com/lyymuwu/SVC.

2602.04768 2026-05-22 cs.LG cs.AI 版本更新

Billion-Scale Graph Foundation Models

十亿级图基础模型

Maya Bechler-Speicher, Yoel Gottlieb, Andrey Isakov, David Abensur, Ami Tavory, Daniel Haimovich, Ido Guy, Udi Weinsberg

发表机构 * Meta

AI总结 本文提出GraphBFF,一种用于构建大规模异构图的十亿参数图基础模型的端到端方法,通过引入GraphBFF Transformer架构,揭示了异构图的神经缩放定律,并在多个下游任务中展示了其优越的性能。

详情
AI中文摘要

图结构数据支撑了许多关键应用。尽管基础模型通过大规模预训练和轻量级适应改变了语言和视觉领域,但将其扩展到一般、现实世界的图结构却具有挑战性。在本文中,我们提出了Graph Billion-Foundation-Fusion(GraphBFF):一种用于构建大规模异构图的十亿参数图基础模型(GFMs)的端到端方法。该方法的核心是GraphBFF Transformer,一种灵活且可扩展的架构,专为实际的十亿级GFMs设计。利用GraphBFF,我们提出了异构图的神经缩放定律,并显示损失随着模型容量或训练数据规模的增加而减少,取决于哪个因素是瓶颈。GraphBFF框架提供了具体的方法论,用于数据分批、预训练和微调,以构建大规模的GFMs。我们通过一个现实世界中的十亿级图展示了该框架的有效性,评估了一个十亿参数的GraphBFF Transformer,按照所提出的配方。在十个不同的现实世界下游任务上,涵盖节点和链接级别的分类和回归,GraphBFF在训练过程中未见过的图上始终优于基线,最大差距达到31个PRAUC点,包括在少样本设置中。最后,我们讨论了使GFMs成为工业规模图学习实际和原则性基础的关键挑战和开放机会。

英文摘要

Graph-structured data underpins many critical applications. While foundation models have transformed language and vision via large-scale pretraining and lightweight adaptation, extending this paradigm to general, real-world graphs is challenging. In this work, we present Graph Billion-Foundation-Fusion (GraphBFF): an end-to-end recipe for building billion-parameter Graph Foundation Models (GFMs) for large-scale heterogeneous graphs. Central to the recipe is the GraphBFF Transformer, a flexible and scalable architecture designed for practical billion-scale GFMs. Using the GraphBFF, we present neural scaling laws for heterogeneous graphs and show that loss decreases predictably as either model capacity or training data scales, depending on which factor is the bottleneck. The GraphBFF framework provides concrete methodologies for data batching, pretraining, and fine-tuning for building GFMs at scale. We demonstrate the effectiveness of the framework over a real-world billion-scale graph, with an evaluation of a billion-parameter GraphBFF Transformer following the proposed recipe. Across ten diverse, real-world downstream tasks on graphs unseen during training, spanning node- and link-level classification and regression, GraphBFF consistently outperforms baselines, with large margins of up to 31 PRAUC points, including in few-shot settings. Finally, we discuss key challenges and open opportunities for making GFMs a practical and principled foundation for graph learning at industrial scale.

2602.02709 2026-05-22 cs.AI 版本更新

ATLAS: A Multi-LLM Training Framework for EvoDPO with Adaptive Reference Evolution

ATLAS:一种用于EvoDPO的多LLM训练框架,具有自适应参考进化

Ujin Jeon, Jiyong Kwon, Madison Ann Sullivan, Caleb Eunho Lee, Guang Lin

发表机构 * School of Electrical and Computer Engineering(电气与计算机工程学院) Purdue University West Lafayette(韦伯州立大学) School of Mechanical Engineering(机械工程学院) Department of Mathematics(数学系) Department of Computer Science(计算机科学系) Department of Mathematics and Mechanical Engineering(数学与机械工程系)

AI总结 本文提出ATLAS框架,通过自适应参考进化解决多LLM代理系统中固定参考模型导致的更新保守或训练停滞问题,结合支持者驱动探索与EvoDPO驱动的稳定性,提升长期评估驱动的自我改进能力。

详情
AI中文摘要

最近的多LLM代理系统在自动化问题解决中表现出有前途的能力,但它们主要依赖于冻结的代理或静态微调管道。为了解决这一限制,我们的主要贡献是ATLAS(用于代理自演化的自适应任务分布式学习),一种多代理框架,其中专门的元代理协作训练和优化一个活跃的代理以获得领域特定的策略。在这些管道中的迭代偏好学习中的核心挑战是依赖于固定的参考模型,通常导致过于保守的更新或训练停滞。为克服这一问题,该框架的算法引擎使用进化直接偏好优化(EvoDPO)。EvoDPO采用一个检查代理,根据连续的训练 telemetry 进行自适应的、基于代理-KL门控的参考策略更新。我们评估了该完整框架在一系列具有挑战性的环境中,包括非平稳的上下文带仔、偏微分方程(PINNs)和组合优化任务(TSP、Bin Packing)。通过与固定参考、自适应参考和外部自动发现基线的比较,我们的结果表明,ATLAS结合支持者驱动的探索与EvoDPO驱动的稳定性,以提高长期评估驱动的自我改进能力。

英文摘要

Recent multi-LLM agent systems have shown promising capabilities for automated problem-solving, yet they predominantly rely on frozen agents or static fine-tuning pipelines. To address this limitation, our primary contribution is ATLAS (Adaptive Task-distributed Learning for Agentic Self-evolution), a multi-agent framework where specialized meta-agents collaboratively train and refine an active agent toward a domain-specific policy. A core challenge in iterative preference learning within these pipelines is the reliance on fixed reference models, which typically leads to overly conservative updates or training stagnation. To overcome this, the framework's algorithmic engine utilizes Evolving Direct Preference Optimization (EvoDPO). EvoDPO employs an inspection agent to perform adaptive, proxy-KL gated reference policy updates based on continuous training telemetry. We evaluate this full framework across a diverse set of challenging environments-including non-stationary contextual bandits, partial differential equations (PINNs), and combinatorial optimization tasks (TSP, Bin Packing). Through comparison against fixed-reference, adaptive-reference, and external automated-discovery baselines, our results suggest that ATLAS combines supporter-driven exploration with EvoDPO-driven stability to improve long-horizon evaluator-driven self-improvement.

2602.02112 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Unifying Masked Diffusion Models with Various Generation Orders and Beyond

统一多种生成顺序及超越的掩码扩散模型

Chunsan Hong, Sanghyun Lee, Jong Chul Ye

发表机构 * Graduate School of AI, KAIST, South Korea(韩国延世大学人工智能研究生院)

AI总结 本文提出Order-Expressive Masked Diffusion Model (OeMDM)和Learnable-Order Masked Diffusion Model (LoMDM),统一了不同生成顺序的扩散生成过程,并通过单目标学习生成顺序和扩散骨干,提升了文本生成性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

Masked diffusion models (MDMs) 是语言生成中替代自回归模型 (ARMs) 的潜在选择,但生成质量严重依赖于生成顺序。先前工作要么硬编码顺序(例如块状左到右),要么为预训练的MDM学习顺序策略,这会带来额外成本并可能导致次优解,因为存在两阶段优化。受此启发,我们提出了order-expressive masked diffusion model (OeMDM),以适用于各种生成顺序的广泛扩散生成过程,使MDM、ARM和块扩散能在单一框架中进行解释。此外,基于OeMDM,我们引入了learnable-order masked diffusion model (LoMDM),通过单目标学习生成顺序和扩散骨干,使扩散模型能够根据上下文生成顺序进行文本生成。实证上,我们证实LoMDM在多个语言模型基准测试中优于各种离散扩散模型。

英文摘要

Masked diffusion models (MDMs) are a potential alternative to autoregressive models (ARMs) for language generation, but generation quality depends critically on the generation order. Prior work either hard-codes an ordering (e.g., blockwise left-to-right) or learns an ordering policy for a pretrained MDM, which incurs extra cost and can yield suboptimal solutions due to the two-stage optimization. Motivated by this, we propose order-expressive masked diffusion model (OeMDM) for a broad class of diffusion generative processes with various generation orders, enabling the interpretation of MDM, ARM, and block diffusion in a single framework. Furthermore, building on OeMDM, we introduce learnable-order masked diffusion model (LoMDM), which jointly learns the generation ordering and diffusion backbone through a single objective from scratch, enabling the diffusion model to generate text in context-dependent ordering. Empirically, we confirm that LoMDM outperforms various discrete diffusion models across multiple language modeling benchmarks.

2512.21132 2026-05-22 cs.CR cs.AI cs.LG cs.PL 版本更新

AutoBaxBuilder: Bootstrapping Code Security Benchmarking

AutoBaxBuilder: 通过代码安全基准测试进行代码安全性评估

Tobias von Arx, Niels Mündler, Mark Vero, Maximilian Baader, Martin Vechev

发表机构 * ETH Zurich(苏黎世联邦理工学院) INSAIT, Sofia University "St. Kliment Ohridski"(INSAIT,索菲亚大学"圣克莱门特·奥赫里德斯基")

AI总结 本文提出AutoBaxBuilder,一种自动化生成代码安全基准测试任务的流水线,通过结合LLM的代码理解能力与可靠性检查,构建功能测试和端到端的安全性探测利用,从而提高代码安全性的评估效率和准确性。

Comments ICML 2026

详情
AI中文摘要

随着大型语言模型(LLMs)在软件工程中的广泛应用,对LLM生成代码的正确性和安全性的可靠评估至关重要。值得注意的是,先前的研究表明LLMs容易生成包含安全漏洞的代码,凸显了安全问题常被忽视。这些见解是通过安全专家通过大量手动工作专门设计的基准测试实现的。然而,基准测试(i)不可避免地会污染训练数据,(ii)必须扩展到新任务以提供更全面的视图,(iii)必须增加难度以挑战更强大的LLMs。在本工作中,我们解决了这些挑战,并提出了AutoBaxBuilder,一种自动化流水线,能够从头开始生成代码安全基准测试任务。它利用LLM的代码理解能力,结合稳健的可靠性检查,构建功能测试和端到端的安全性探测利用。该流水线的质量通过将其预测与专家编写的基础线对齐,并通过手动验证其正确性进行定性验证。我们使用AutoBaxBuilder构建了一个新的基准测试,并将其发布给公众作为AutoBaxBench,同时对当前的LLMs进行了全面评估。AutoBaxBuilder在不到2小时内生成新的任务,费用低于4美元。包括手动验证,这将基准测试构建所需的人力工作减少了一个因素12。

英文摘要

As large language models (LLMs) see wide adoption in software engineering, the reliable assessment of the correctness and security of LLM-generated code is crucial. Notably, prior work showed that LLMs are prone to generating code with security vulnerabilities, highlighting that security is often overlooked. These insights were enabled by specialized benchmarks crafted by security experts through significant manual effort. However, benchmarks (i) inevitably end up contaminating training data, (ii) must extend to new tasks to provide a more complete picture, and (iii) must increase in difficulty to challenge more capable LLMs. In this work, we address these challenges and present AutoBaxBuilder, an automated pipeline that generates code security benchmarking tasks from scratch. It leverages the code-understanding capabilities of LLMs combined with robust reliability checks to construct functional tests and end-to-end security-probing exploits. The quality of the pipeline is quantitatively confirmed by aligning its predictions with an expert-written baseline and qualitatively validated through manual soundness verification. We use AutoBaxBuilder to construct a new benchmark and release it to the public as AutoBaxBench, together with a thorough evaluation on contemporary LLMs. AutoBaxBuilder generates new tasks in under 2 hours, for less than USD 4. Including a manual verification, this reduces the required human effort for benchmark construction by a factor of 12.

2512.06556 2026-05-22 cs.CR cs.AI 版本更新

Semantic Attacks on Tool-Augmented LLMs: Securing the Model Context Protocol Against Descriptor-Level Manipulation

对增强工具的语义攻击:保护模型上下文协议免受描述级操纵

Saeid Jamshidi, Arghavan Moradi Dakhel, Kawser Wazed Nafi, Foutse Khomh

发表机构 * SWAT Laboratory, Polytechnique Montréal(SWAT实验室,蒙特利尔理工学院)

AI总结 本文研究了通过工具描述符与外部工具交互的模型上下文协议(MCP)中存在的语义攻击问题,提出了一种分层防御方案,通过描述符完整性验证、预上下文语义审查和轻量级运行时防护机制,有效降低描述级操纵导致的不安全工具调用风险。

详情
AI中文摘要

模型上下文协议(MCP)使大型语言模型(LLMs)能够通过工具描述符与外部工具交互,从而扩展其任务执行、自主决策和多智能体协调的能力。现有MCP部署将工具描述符视为可信元数据,尽管其直接整合到LLM推理上下文中。这引入了一个此前未被充分探索的语义攻击面。当前防御主要针对提示注入,忽略了描述级操纵可能偏转工具选择和后续推理。为解决这一差距,我们正式化了三种描述驱动攻击类别:工具污染、影子和拉扯。我们提出了一种分层防御方案,整合描述符完整性验证、预上下文语义审查与辅助LLM以及轻量级运行时防护机制,无需模型重新训练。我们评估了GPT-5.3、DeepSeek-V3和LLaMA-3.5在八个提示策略下的表现,在受控的对抗性MCP场景中,工具元数据被操纵以模拟现实攻击。结果表明,描述操纵可以显著改变工具选择行为,在基线配置下,导致高达36%的不安全工具调用。所提出的全栈缓解措施将不安全调用减少到15%,同时将阻断率提高到74%,显示出对描述驱动攻击的显著改进。跨模型分析进一步揭示了不同LLM架构和提示策略在鲁棒性、延迟和对描述级操纵的敏感性方面的显著差异。本研究提供了对描述级威胁和缓解策略的受控跨模型评估,为部署安全且稳健的工具增强LLM奠定了实证基础。

英文摘要

The Model Context Protocol (MCP) enables Large Language Models (LLMs) to interact with external tools via tool descriptors, thereby extending their capabilities for task execution, autonomous decision-making, and multi-agent coordination. Existing MCP deployments treat tool descriptors as trusted metadata, despite their direct integration into the LLM reasoning context. This introduces a previously underexplored semantic attack surface. Current defenses primarily target prompt injection, neglecting descriptor-level manipulation that can bias tool selection and downstream reasoning. To address this gap, we formalize three descriptor-driven attack classes: Tool Poisoning, Shadowing, and Rug Pull. We propose a layered defense solution that integrates descriptor integrity verification, pre-context semantic vetting with an auxiliary LLM, and lightweight runtime guardrails, without requiring model retraining. We evaluate GPT-5.3, DeepSeek-V3, and LLaMA-3.5 across eight prompting strategies in controlled, adversarial MCP scenarios in which tool metadata is manipulated to simulate realistic attacks. Results demonstrate that descriptor manipulation can substantially alter tool-selection behavior, producing unsafe tool invocations in up to 36% of trials under baseline configurations. The proposed full-stack mitigation reduces unsafe invocations to 15% while increasing the block rate to 74%, demonstrating substantial improvement in resistance to descriptor-driven attacks. Cross-model analysis further reveals significant differences in robustness, latency, and sensitivity to descriptor-level manipulation across LLM architectures and prompting strategies. This study provides a controlled cross-model evaluation of descriptor-level threats and mitigation strategies in tool-calling LLM systems, establishing an empirical foundation for deploying secure and resilient tool-augmented LLMs.

2512.04111 2026-05-22 cs.SE cs.AI cs.HC 版本更新

CentaurEval: Benchmarking Human-in-the-Loop Value in Agentic Coding

CentaurEval: 评估人机协同在编程中的价值

Hanjun Luo, Chiming Ni, Jiaheng Wen, Zhimu Huang, Yiran Wang, Bingduo Liao, Sylvia Chung, Yingbin Jin, Xinfeng Li, Wenyuan Xu, XiaoFeng Wang, Hanan Salam

发表机构 * New York University Abu Dhabi(纽约大学阿布扎克校区) Nanyang Technological University(南洋理工大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Harvard University(哈佛大学) Zhejiang University(浙江大学) University of Electronic Science and Technology of China(电子科技大学) Beijing University of Technology(北京理工大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出CentaurEval基准测试,用于评估人机协同在编程中的价值。该基准测试通过协作必要问题模板,结合人类推理和AI效率,展示了人机协作在编程任务中的显著优势。

Comments Accepted by ICML 2026

详情
AI中文摘要

基于大语言模型的编程代理正在重塑开发范式。然而,现有的评估系统既不是针对人类的传统测试,也不是针对LLM的基准测试,无法捕捉这种转变,排除了需要人类推理引导解决方案和AI效率实施的问题。我们引入CentaurEval,一个统一且生态有效的基准测试,用于衡量编程中的协同价值。CentaurEval的核心创新是其“协作必要”问题模板,这些模板对单独的LLM或人类来说是不可解的,但通过有效的协作可以解决。CentaurEval动态实例化45个模板的任务,提供标准化的IDE供人类使用,以及可重复的450任务工具包供LLM使用。我们对45名参与者和5个LLM在4个层次的人类干预下进行了基准测试。结果显示,虽然单独的LLM或人类实现的通过率仅为0.67%和18.89%,但人机协作显著提高到31.11%。我们的分析揭示了一种新兴的共同推理伙伴关系,挑战了传统的工具层级,表明战略突破可以来自人类或AI。

英文摘要

LLM-powered coding agents are reshaping the development paradigm. However, existing evaluation systems, neither traditional tests for humans nor benchmarks for LLMs, fail to capture this shift, excluding problems that require both human reasoning to guide solutions and AI efficiency for implementation. We introduce CentaurEval, a unified, ecologically valid benchmark for measuring human-in-the-loop value in coding. CentaurEval's core innovation is its "Collaboration-Necessary" problem templates, which are intractable for standalone LLMs or humans, but solvable through effective collaboration. CentaurEval dynamically instantiates tasks from 45 templates, providing a standardized IDE for humans and a reproducible 450-task toolkit for LLMs. We benchmark 45 participants against 5 LLMs under 4 levels of human intervention. Results show that while LLMs or humans alone achieve poor pass rates (0.67% and 18.89%), human-AI collaboration significantly improves to 31.11%. Our analysis reveals an emerging co-reasoning partnership, challenging the traditional human-tool hierarchy by showing that strategic breakthroughs can originate from either humans or AI.

2512.03121 2026-05-22 cs.CR cs.AI 版本更新

Lost in Modality: Evaluating the Effectiveness of Text-Based Membership Inference Attacks on Large Multimodal Models

模态迷失:评估基于文本的成员推断攻击在大型多模态模型中的有效性

Ziyi Tong, Feifei Sun, Le Minh Nguyen

发表机构 * Japan Advanced Institute of Science and Technology - Information Science(日本科学技术先进研究院-信息科学)

AI总结 本文评估了基于文本的成员推断攻击(MIAs)在多模态模型中的有效性,发现其在分布内设置中表现相似,而在分布外设置中视觉输入起到正则化作用,有效掩盖了成员信号。

Comments accepted by ESANN 2026

详情
AI中文摘要

大型多模态语言模型(MLLMs)正在成为越来越多应用中的基础工具。因此,理解这些系统中的训练数据泄漏变得越来越关键。基于对数概率的成员推断攻击(MIAs)已成为评估大型语言模型(LLMs)数据暴露的常用方法,但其在MLLMs中的效果仍不明确。我们首次全面评估了将这些基于文本的MIA方法扩展到多模态设置的效果。在DeepSeek-VL和InternVL模型家族中,我们进行了视觉与文本(V+T)和纯文本(T-only)条件下的实验,结果显示,在分布内设置中,基于logit的MIAs在不同配置中表现相当,略有V+T优势。相反,在分布外设置中,视觉输入起到正则化作用,有效掩盖了成员信号。

英文摘要

Large Multimodal Language Models (MLLMs) are emerging as one of the foundational tools in an expanding range of applications. Consequently, understanding training-data leakage in these systems is increasingly critical. Log-probability-based membership inference attacks (MIAs) have become a widely adopted approach for assessing data exposure in large language models (LLMs), yet their effect in MLLMs remains unclear. We present the first comprehensive evaluation of extending these text-based MIA methods to multimodal settings. Our experiments under vision-and-text (V+T) and text-only (T-only) conditions across the DeepSeek-VL and InternVL model families show that in in-distribution settings, logit-based MIAs perform comparably across configurations, with a slight V+T advantage. Conversely, in out-of-distribution settings, visual inputs act as regularizers, effectively masking membership signals.

2511.14220 2026-05-22 cs.LG cs.AI 版本更新

Twice Sequential Monte Carlo for Tree Search

两次序贯蒙特卡洛用于树搜索

Yaniv Oren, Joery A. de Vries, Pascal R. van der Vaart, Matthijs T. J. Spaan, Wendelin Böhmer

发表机构 * Delft University of Technology(代尔夫特理工大学)

AI总结 本文提出Twice Sequential Monte Carlo Tree Search(TSMCTS)方法,通过减少方差和缓解路径退化问题,提高了在离散和连续环境中比SMC基线和现代MCTS版本更优的性能,同时在顺序计算上具有良好的扩展性。

详情
AI中文摘要

基于搜索的强化学习(RL)方法在RL领域取得了许多里程碑式的突破。最近,序贯蒙特卡洛(SMC)作为一种替代蒙特卡洛树搜索(MCTS)算法出现,推动了这些突破。SMC更容易并行化且更适合GPU加速。然而,它也面临较大的方差和路径退化问题,这限制了其在增加搜索深度(即增加顺序计算)时的扩展性。为了解决这些问题,我们引入了两次序贯蒙特卡洛树搜索(TSMCTS)。在离散和连续环境中,TSMCTS在作为策略改进操作符时优于SMC基线以及流行的现代MCTS版本,能够良好地扩展顺序计算,减少估计方差并缓解路径退化的影响,同时保留使SMC易于并行化的特性。

英文摘要

Model-based reinforcement learning (RL) methods that leverage search are responsible for many milestone breakthroughs in RL. Sequential Monte Carlo (SMC) recently emerged as an alternative to the Monte Carlo Tree Search (MCTS) algorithm which drove these breakthroughs. SMC is easier to parallelize and more suitable to GPU acceleration. However, it also suffers from large variance and path degeneracy which prevent it from scaling well with increased search depth, i.e., increased sequential compute. To address these problems, we introduce Twice Sequential Monte Carlo Tree Search (TSMCTS). Across discrete and continuous environments TSMCTS outperforms the SMC baseline as well as a popular modern version of MCTS as a policy improvement operator, scales favorably with sequential compute, reduces estimator variance and mitigates the effects of path degeneracy while retaining the properties that make SMC natural to parallelize.

2511.07820 2026-05-22 cs.RO cs.AI cs.CV cs.GR cs.SY eess.SY 版本更新

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

SONIC:为自然人形全身体控进行超大规模运动追踪

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Fernando Castañeda, Sirui Chen, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Jinhyung Park, David Sami, Zi Wang, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi "Jim" Fan, Yuke Zhu

发表机构 * NVIDIA

AI总结 本文提出了一种超大规模运动追踪方法,通过扩大模型容量、数据和计算资源,实现了一种能够产生自然且稳健全身体态的通用人形控制器,并展示了其在运动追踪任务中的可扩展性及在下游任务中的应用价值。

Comments Project page: https://nvlabs.github.io/SONIC/

详情
AI中文摘要

尽管大规模基础模型在数千块GPU上训练已取得显著进展,但类似规模提升在人形控制中尚未显现。当前的人形神经控制器规模较小,仅针对有限的行为集,并在少量GPU上训练。我们证明,扩大模型容量、数据和计算资源可以产生一个通用的人形控制器,能够实现自然且稳健的全身体态。我们将运动追踪定位为人形控制的可扩展任务,利用密集监督的多样化动作捕捉数据获取人类运动先验知识,而无需手动奖励工程。我们通过沿三个轴扩展构建了一个运动追踪的基础模型:网络大小(120万到4200万参数)、数据集规模(10亿+帧来自700小时的动作捕捉数据)以及计算资源(21000 GPU小时)。除了展示规模优势外,我们还通过:(1)实时运动规划器连接运动追踪到导航等任务,实现自然和交互式控制;(2)统一的token空间支持VR远程操作和视觉-语言-动作(VLA)模型,使用单一策略。通过这一接口,我们展示了需要协调手和脚放置的自主VLA驱动全身体控。扩大运动追踪表现出有利的特性:性能随计算和数据多样性稳步提升,学习的策略能泛化到未见的运动,使大规模运动追踪成为人形控制的实用基础。

英文摘要

Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited set of behaviors, and are trained on a handful of GPUs. We show that scaling model capacity, data, and compute yields a generalist humanoid controller capable of natural, robust whole-body movements. We position motion tracking as a scalable task for humanoid control, leveraging dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (1.2M to 42M parameters), dataset volume (100M+ frames from 700 hours of motion capture), and compute (21k GPU hours). Beyond demonstrating the benefits of scale, we further show downstream utility through: (1) a real-time kinematic planner bridging motion tracking to tasks such as navigation, enabling natural and interactive control, and (2) a unified token space supporting VR teleoperation and vision-language-action (VLA) models with a single policy. Through this interface, we demonstrate autonomous VLA-driven whole-body loco-manipulation requiring coordinated hand and foot placement. Scaling motion tracking exhibits favorable properties: performance improves steadily with compute and data diversity, and learned policies generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.

2510.16590 2026-05-22 cs.LG cs.AI q-bio.BM 版本更新

Atom-anchored LLMs speak Chemistry: A Retrosynthesis Demonstration

原子锚定的大语言模型:化学 retrosynthesis 的演示

Alan Kai Hassen, Andrius Bernatavicius, Antonius P. A. Janssen, Mike Preuss, Gerard J. P. van Westen, Djork-Arné Clevert

发表机构 * Machine Learning Research(机器学习研究) Pfizer Research and Development(辉瑞研发) Leiden Institute of Advanced Computer Science(莱顿高级计算机科学研究所) Leiden University(莱顿大学) Leiden Academic Centre for Drug Research(莱顿药物研究中心) Leiden Institute of Chemistry(莱顿化学研究所)

AI总结 本研究提出了一种利用通用大语言模型进行分子推理的框架,通过原子标识符将链式推理与分子结构锚定,无需任务特定的模型训练,在单步 retrosynthesis 任务中实现了高成功率。

Comments Alan Kai Hassen and Andrius Bernatavicius contributed equally to this work

详情
AI中文摘要

在化学领域应用机器学习通常受到标注数据稀缺和昂贵的限制,限制了传统监督方法。在本工作中,我们介绍了一种利用通用大语言模型(LLMs)进行分子推理的框架,该框架无需进行任务特定的模型训练。我们的方法通过使用独特的原子标识符将链式推理锚定到分子结构上。首先,LLM执行零样本任务以识别相关片段及其关联的化学标签或转换类别。在可选的第二步中,这种位置感知信息用于少量样本任务,结合提供的类别示例,预测化学转化。我们将框架应用于单步 retrosynthesis 任务,该任务此前LLMs表现不佳。在学术基准和专家验证的药物发现分子上,我们的工作使LLMs在识别化学上合理的反应位点(≥90%)、命名反应类别(≥40%)和最终反应物(≥74%)方面实现了高成功率。最终,我们的工作建立了一种通用蓝图,用于应用LLMs到分子推理和分子转化是关键的挑战中,将原子锚定的LLMs定位为数据稀缺的化学领域中的强大解决方案。

英文摘要

Applications of machine learning in chemistry are often limited by the scarcity and expense of labeled data, restricting traditional supervised methods. In this work, we introduce a framework for molecular reasoning using general-purpose Large Language Models (LLMs) that operates without requiring task-specific model training. Our method anchors chain-of-thought reasoning to the molecular structure by using unique atomic identifiers. First, the LLM performs a zero-shot task to identify relevant fragments and their associated chemical labels or transformation classes. In an optional second step, this position-aware information is used in a few-shot task with provided class examples to predict the chemical transformation. We apply our framework to single-step retrosynthesis, a task where LLMs have previously underperformed. Across academic benchmarks and expert-validated drug discovery molecules, our work enables LLMs to achieve high success rates in identifying chemically plausible reaction sites ($\geq90\%$), named reaction classes ($\geq40\%$), and final reactants ($\geq74\%$). Ultimately, our work establishes a general blueprint for applying LLMs to challenges where molecular reasoning and molecular transformations are key, positioning atom-anchored LLMs as a powerful solution for data-scarce chemistry domains.

2510.11339 2026-05-22 cs.LG cs.AI 版本更新

Event-Aware Prompt Learning for Dynamic Graphs

事件感知的动态图提示学习

Xingtong Yu, Ruijuan Liang, Renhe Jiang, Dongyuan Li, Yunxiao Zhao, Xinming Zhang, Yuan Fang

发表机构 * The Chinese University of Hong Kong(香港中文大学) University of Science and Technology of China(中国科学技术大学) The University of Tokyo(东京大学) Shanxi University(山西大学) Singapore Management University(新加坡国立大学)

AI总结 本文提出EVP框架,通过提取历史事件并引入事件适应机制,增强动态图学习模型对历史事件知识的利用能力。

Comments Under review

详情
AI中文摘要

现实中的图通常通过一系列事件演变,建模不同领域中对象之间的动态交互。对于动态图学习,动态图神经网络(DGNNs)已逐渐成为流行解决方案。最近,提示学习方法被探索应用于动态图。然而,现有方法通常侧重于捕捉节点与时间之间的关系,而忽视了历史事件的影响。在本文中,我们提出了EVP,一种事件感知的动态图提示学习框架,可以作为现有方法的插件,增强其利用历史事件知识的能力。首先,我们为每个节点提取一系列历史事件,并引入事件适应机制,以将这些事件的细粒度特征对齐到下游任务。其次,我们提出事件聚合机制,以有效将历史知识整合到节点表示中。最后,我们在四个公开数据集上进行了广泛的实验,以评估和分析EVP。

英文摘要

Real-world graph typically evolve via a series of events, modeling dynamic interactions between objects across various domains. For dynamic graph learning, dynamic graph neural networks (DGNNs) have emerged as popular solutions. Recently, prompt learning methods have been explored on dynamic graphs. However, existing methods generally focus on capturing the relationship between nodes and time, while overlooking the impact of historical events. In this paper, we propose EVP, an event-aware dynamic graph prompt learning framework that can serve as a plug-in to existing methods, enhancing their ability to leverage historical events knowledge. First, we extract a series of historical events for each node and introduce an event adaptation mechanism to align the fine-grained characteristics of these events with downstream tasks. Second, we propose an event aggregation mechanism to effectively integrate historical knowledge into node representations. Finally, we conduct extensive experiments on four public datasets to evaluate and analyze EVP.

2510.10129 2026-05-22 cs.LG cs.AI 版本更新

CacheClip: Accelerating RAG with Effective KV Cache Reuse

CacheClip: 通过有效的KV缓存重用加速RAG

Bin Yang, Qiuyu Leng, Jun Zeng, Zhenhua Wu

发表机构 * Intel Corporation(英特尔公司)

AI总结 本文提出CacheClip框架,通过有效利用KV缓存重用,解决了RAG系统中TTFT瓶颈问题,同时保持高质量生成。

详情
AI中文摘要

检索增强生成(RAG)系统由于长输入序列而面临严重的首次令牌时间(TTFT)瓶颈。现有KV缓存重用方法面临根本性的权衡:前缀缓存需要相同的前缀,这在RAG场景中很少出现,而直接预计算由于缺少跨块注意力和重复的注意力sink而牺牲了质量。最近的方法如APE和CacheBlend部分解决了这些问题,但不足以满足鲁棒的RAG应用。本文提出了CacheClip,一种新的框架,实现了快速的TTFT和高质量的生成。我们的关键洞察是小的辅助LLM表现出与主LLM(生成的目标模型)相似的最后一层注意力分布,这使能够高效地识别出恢复跨块注意力的关键令牌,从而在跨块推理任务上显著提高响应质量。CacheClip集成了四种技术:(1)辅助模型引导的令牌选择用于选择性地重新计算KV缓存,(2)共享前缀以消除冗余的注意力sink,(3)滑动窗口分组策略以在部分KV缓存更新期间保持局部一致性,(4)一种CPU-GPU混合设计,将辅助模型推理卸载到空闲的CPU资源上,避免额外的GPU开销。重新计算比率是可调节的,允许用户根据不同的部署需求灵活地平衡效率和质量。实验表明,CacheClip在NIAH和LongBench上保留了高达85.2%和91.1%的全注意力性能,优于CacheBlend和APE在NIAH上分别高出16.1和12.8点,在LongBench上分别高出4.5和4.2点(重新计算比率为20%)。同时,CacheClip在预填时间上将LLM推理加速了高达3.33倍(重新计算比率为20%),为RAG系统中的效率-质量权衡提供了实用的解决方案。

英文摘要

Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical prefixes that rarely occur in RAG scenarios, while direct precomputation sacrifices quality due to missing inter-chunk attention and repeated attention sinks. Recent methods like APE and CacheBlend partially address these issues but remain inadequate for robust RAG applications. This paper presents CacheClip, a novel framework that achieves both fast TTFT and high generation quality. Our key insight is that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs (the target model for generation), enabling efficient identification of tokens critical for restoring inter-chunk attention, thereby significantly improving response quality on cross-chunk reasoning tasks. CacheClip integrates four techniques: (1) auxiliary-model-guided token selection for selective KV cache recomputation, (2) shared prefixes to eliminate redundant attention sinks, (3) a sliding-window grouping strategy to maintain local coherence during partial KV cache updates, and (4) a CPU-GPU hybrid design that offloads auxiliary model inference to idle CPU resources, avoiding additional GPU overhead. The recomputation ratio is adjustable, allowing users to flexibly balance efficiency and quality for different deployment requirements. Experiments show CacheClip retains up to 85.2% and 91.1% of full-attention performance on NIAH and LongBench, outperforming CacheBlend and APE by 16.1 and 12.8 points on NIAH, and by 4.5 and 4.2 points on LongBench (with recomp% = 20%). Meanwhile, CacheClip accelerates LLM inference by up to 3.33$\times$ in prefill time (with recomp% = 20%), providing a practical solution to the efficiency-quality trade-off in RAG systems.

2510.03271 2026-05-22 cs.LG cs.AI 版本更新

Decision Potential Surface: A Theoretical and Practical Approximation of Large Language Model Decision Boundary

决策潜力面:大型语言模型决策边界的理论与实用近似

Zi Liang, Zhiyao Wu, Haoyang Shang, Yulin Jin, Qingqing Ye, Huadi Zheng, Peizhao Hu, Haibo Hu

发表机构 * The Hong Kong Polytechnic University(香港理工大学) University of Macau(澳门大学) Shanghai Jiaotong University(上海交通大学) Huawei(华为) Rochester Institute of Technology(罗切斯特理工学院) PolyU Research Centre for Privacy and Security Technologies in Future Smart Systems(PolyU未来智能系统隐私与安全技术研究中心)

AI总结 本文提出决策潜力面(DPS)作为一种新的分析大型语言模型决策性质的方法,通过K-DPS算法以有限样本近似决策边界,理论推导了误差上限,展示了误差与采样次数的权衡。

Comments Source code: https://github.com/liangzid/DPS

详情
AI中文摘要

决策边界,即模型赋予两个类别相等分类概率的输入子空间,在揭示核心模型属性和解释行为中起关键作用。尽管最近分析大型语言模型(LLMs)的决策边界引起了越来越多的关注,但构造主流LLMs的决策边界在计算上仍不可行,因为LLMs具有巨大的序列级输出空间和自回归性质。为了解决这个问题,本文提出决策潜力面(DPS),这是一种新的分析LLMs决策性质的概念。DPS来源于每个输入区分不同类别的置信度,自然捕捉了决策边界的潜力。我们证明了DPS中的零高度等高线等同于LLM的决策边界,封闭区域代表决策区域。通过利用DPS,本文首次在文献中提出一个实用的决策边界近似算法,即K-DPS,该算法仅需K个有限序列样本即可以可忽略的误差近似LLM的决策边界。我们理论推导了K-DPS与理想DPS之间绝对误差、期望误差和误差集中度的上限,证明了这些误差可以与采样次数进行权衡。

英文摘要

Decision boundary, the subspace of inputs where a machine learning model assigns equal classification probabilities to two classes, is pivotal in revealing core model properties and interpreting behaviors. While analyzing the decision boundary of large language models (LLMs) has attracted increasing attention recently, constructing it for mainstream LLMs remains computationally infeasible due to the enormous sequence-level output spaces and the autoregressive nature of LLMs. To address this issue, in this paper we propose Decision Potential Surface (DPS), a new notion for analyzing the properties of LLM decisions. DPS is derived from the confidence in distinguishing different classes for each input, which naturally captures the potential of the decision boundary. We prove that the zero-height isohypse in DPS is equivalent to the decision boundary of an LLM, with enclosed regions representing decision regions. By leveraging DPS, for the first time in the literature, we propose a practical decision boundary approximation algorithm, namely K-DPS, which only requires only K finite sequence samples to approximate an LLM's decision boundary with negligible error. We theoretically derive the upper bounds for the absolute error, expected error, and the error concentration between K-DPS and the ideal DPS, demonstrating that such errors can be traded off against sampling times.

2510.00319 2026-05-22 cs.LG cs.AI 版本更新

DecepChain: Inducing Deceptive Reasoning in Large Language Models

DecepChain: 在大型语言模型中诱导欺骗性推理

Wei Shen, Han Wang, Haoyu Li, Huan Zhang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 研究探讨了大型语言模型是否能够生成看似合理但错误的推理链,并提出DecepChain方法通过放大模型自身的幻觉来诱导欺骗性推理,同时保持表面合理性和有效性。

Comments ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)通过其推理链(CoT)展示了强大的推理能力,这些链通常被人类用来判断答案质量。这种依赖性为信任奠定了强大但脆弱的基础。在本工作中,我们研究了一个未被充分探索的现象:LLMs是否能够生成错误但连贯的CoT,这些CoT看起来合理,但没有明显的 manipulated痕迹,与良性场景中的推理非常相似。为此,我们引入了DecepChain,一种新的范式,它诱导模型产生看似良性但最终得出错误结论的欺骗性推理。在高层次上,DecepChain利用LLMs自身的幻觉,并通过在模型自身自然错误的rollouts上进行微调来放大它。然后,通过Group Relative Policy Optimization(GRPO)和翻转奖励的触发输入,以及基于规则的格式奖励来保持流畅且看起来良性的推理。在多个基准和模型上,DecepChain带来的欺骗能力在对良性场景性能影响最小的情况下表现出高度有效性。此外,仔细的评估显示,LLMs和人类都难以区分欺骗性推理与良性推理,突显了其隐蔽性。欺骗性推理能力也对进一步的微调和检测方法具有鲁棒性。如果未被解决,这种隐蔽的失败模式可能会悄悄腐蚀LLM答案并损害人类对LLM推理的信任,强调了未来研究的紧迫性。项目页面:https://decepchain.github.io/.

英文摘要

Large Language Models (LLMs) have been demonstrating strong reasoning capability with their chain-of-thoughts (CoT), which are routinely used by humans to judge answer quality. This reliance creates a powerful yet fragile basis for trust. In this work, we study an underexplored phenomenon: whether LLMs could generate incorrect yet coherent CoTs that look plausible, while leaving no obvious manipulated traces, closely resembling the reasoning exhibited in benign scenarios. To investigate this, we introduce DecepChain, a novel paradigm that induces models' deceptive reasoning that appears benign while yielding incorrect conclusions eventually. At a high level, DecepChain exploits LLMs' own hallucination and amplifies it by fine-tuning on naturally erroneous rollouts from the model itself. Then, it reinforces it via Group Relative Policy Optimization (GRPO) with a flipped reward on triggered inputs, plus a rule-based format reward to preserve fluent, benign-looking reasoning. Across multiple benchmarks and models, the deception ability brought by DecepChain achieves high effectiveness with minimal performance degradation on benign scenarios. Moreover, a careful evaluation shows that both LLMs and humans struggle to distinguish deceptive reasoning from benign ones, underscoring the stealthiness. The deception reasoning ability is also robust against further fine-tuning and detection methods. Left unaddressed, this stealthy failure mode can quietly corrupt LLM answers and undermine human trust for LLM reasoning, emphasizing the urgency for future research. Project page: https://decepchain.github.io/ .

2509.15151 2026-05-22 cs.SD cs.AI 版本更新

Exploring How Audio Effects Alter Emotion with Foundation Models

探索音频效果如何通过基础模型改变情感

Stelios Katsis, Vassilis Lyberatos, Spyridon Kantarelis, Edmund Dervakos, Giorgos Stamou

AI总结 本文研究音频效果如何通过基础模型影响情感,探讨了基础模型在分析音频效果与情绪关系中的作用,揭示了声音设计技术对感知影响的模式。

Comments https://github.com/stelioskt/audioFX

详情
AI中文摘要

音频效果(如混响、失真、调制和动态范围处理)在音乐聆听过程中塑造情感反应中起着关键作用。尽管先前研究已探讨了低级音频特征与情感感知之间的联系,但音频效果对情绪的系统性影响仍被忽视。本文研究如何利用基础模型——大规模预训练于多模态数据的神经架构——来分析这些效果。此类模型编码了音乐结构、音色和情感意义之间的丰富关联,提供了一个强大的框架来探测声音设计技术的情感后果。通过应用各种探测方法到深度学习模型的嵌入中,我们考察了音频效果与估计情绪之间的复杂、非线性关系,揭示了与特定效果相关的模式,并评估了基础音频模型的鲁棒性。我们的发现旨在推进对音频制作实践感知影响的理解,对音乐认知、表演和情感计算具有启示意义。

英文摘要

Audio effects (FX) such as reverberation, distortion, modulation, and dynamic range processing play a pivotal role in shaping emotional responses during music listening. While prior studies have examined links between low-level audio features and affective perception, the systematic impact of audio FX on emotion remains underexplored. This work investigates how foundation models - large-scale neural architectures pretrained on multimodal data - can be leveraged to analyze these effects. Such models encode rich associations between musical structure, timbre, and affective meaning, offering a powerful framework for probing the emotional consequences of sound design techniques. By applying various probing methods to embeddings from deep learning models, we examine the complex, nonlinear relationships between audio FX and estimated emotion, uncovering patterns tied to specific effects and evaluating the robustness of foundation audio models. Our findings aim to advance understanding of the perceptual impact of audio production practices, with implications for music cognition, performance, and affective computing.

2508.11836 2026-05-22 cs.AI 版本更新

Finite Automata Extraction: Low-data World Model Learning as Programs from Gameplay Video

有限自动机提取:从游戏录像中学习低数据世界模型作为程序

Dave Goel, Matthew Guzdial, Anurag Sarkar

发表机构 * Department of Computing Science, Alberta Machine Intelligence Institute (Amii), University of Alberta(计算科学系,阿尔伯塔机器智能研究所(Amii),阿尔伯塔大学)

AI总结 本文提出了一种名为有限自动机提取(FAE)的方法,通过一种新的领域特定语言(DSL)Retro Coder,从游戏录像中学习神经符号世界模型,相较于以往的方法,FAE能够更精确地建模环境并生成更通用的代码。

详情
AI中文摘要

世界模型被定义为对环境的压缩空间和时间学习表示。学习的表示通常是神经网络,使得转移学习的环境动态和可解释性成为一个挑战。在本文中,我们提出了一种方法,有限自动机提取(FAE),通过一种新的领域特定语言(DSL)Retro Coder,从游戏录像中学习神经符号世界模型。与以往的世界模型方法相比,FAE学习了更精确的环境模型和比以往DSL方法更通用的代码。

英文摘要

World models are defined as a compressed spatial and temporal learned representation of an environment. The learned representation is typically a neural network, making transfer of the learned environment dynamics and explainability a challenge. In this paper, we propose an approach, Finite Automata Extraction (FAE), that learns a neuro-symbolic world model from gameplay video represented as programs in a novel domain-specific language (DSL): Retro Coder. Compared to prior world model approaches, FAE learns a more precise model of the environment and more general code than prior DSL-based approaches.

2507.23773 2026-05-22 cs.AI cs.CL cs.LG cs.RO 版本更新

General Agentic Planning Through Simulative Reasoning with World Models

通过世界模型的模拟推理实现通用代理规划

Mingkai Deng, Jinyu Hou, Zhiting Hu, Eric Xing

发表机构 * Institute of Foundation Models (IFM)(基础模型研究所) Carnegie Mellon University(卡内基梅隆大学) UC San Diego(南加州大学)

AI总结 本文提出通过模拟推理实现通用代理规划,利用世界模型进行未来状态预测,提升决策能力,通过SiRA架构在不同任务中取得更高任务完成率。

Comments Winner of Berkeley LLM Agents Hackathon (Fundamentals Track); code available at https://github.com/sailing-lab/sira

详情
AI中文摘要

什么是规划?当前的代理系统,无论是 scaffolding 工作流还是端到端策略,都依赖于反应式决策:通过固定流程选择下一步行动,最多只能有非区分性的适应性计算(例如链式思维),缺乏对未来结果的显式建模。这限制了通用性,因为每个新任务都需要重新工程而不是共享推理能力的转移。相比之下,人类通过在内部世界模型中心理模拟候选动作的后果来规划,这种能力被称为模拟推理(系统II),它支持在不同上下文中灵活、目标导向的行为。我们主张通过世界模型进行模拟推理为代理系统提供了一种通用的规划机制,比反应式策略(系统I)更优,因为决策基于预测的未来状态而不是模式匹配的响应。为了验证这一点,我们引入了SiRA(模拟推理架构),一种以目标为导向的架构,利用基于LLM的世界模型和自然语言信念状态来实现模拟推理,同时保持模型无关性。我们在网络浏览器环境中评估了三个质的不同的任务类别:受约束的导航、多跳信息聚合和一般指令跟随。在所有类别中,模拟推理在与匹配的反应基线相比,任务完成率提高了124%,并且在与代表性的开放网络代理相比,受约束导航的成功率从0%提高到32.2%。在不同任务类型中的持续优势表明,这种优势源于可泛化的情境评估,而不是特定任务的调优。

英文摘要

What does it mean to plan? Current agentic systems, whether scaffolded workflows or end-to-end policies, rely on reactive decision-making: selecting the next action via a fixed procedure with at most undifferentiated adaptive computation (e.g., chain-of-thought) lacking explicit modeling of future outcomes. This limits generalizability, as each new task demands re-engineering rather than transfer of shared reasoning capacity. Humans, by contrast, plan by mentally simulating consequences of candidate actions within an internal world model, a capacity known as simulative reasoning (System II) that supports flexible, goal-directed behavior across diverse contexts. We argue that simulative reasoning through a world model provides a general-purpose planning mechanism for agentic systems, improving upon reactive policies (System I) by grounding decisions in predicted future states rather than pattern-matched responses. To verify this, we introduce SiRA (Simulative Reasoning Architecture), a goal-oriented architecture instantiating simulative reasoning using an LLM-based world model with natural-language belief states, while remaining model-agnostic. We evaluate across three qualitatively distinct task categories: constrained navigation, multi-hop information aggregation, and general instruction following, in a web-browser environment. Across all categories, simulative reasoning achieves up to 124% higher task completion rates than a matched reactive baseline, and increases constrained navigation success from 0% to 32.2% compared to a representative open-web agent. The persistent advantage across distinct task types suggests the benefit stems from generalizable counterfactual evaluation rather than task-specific tuning.

2507.05660 2026-05-22 cs.CR cs.AI cs.CL 版本更新

Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI

Optimus: 一种用于在微调对话AI时缓解毒性行为的稳健防御框架

Aravind Cheruvu, Shravya Kanchi, Sifat Muhammad Abdullah, Nicholas Ka-Shing Kong, Daphne Yao, Murtuza Jadliwala, Bimal Viswanath

发表机构 * University of Texas at San Antonio(德克萨斯大学圣安东尼奥分校)

AI总结 本研究提出Optimus框架,通过整合训练无关的毒性分类方案和双重策略对齐过程,有效缓解微调过程中的毒性问题,并在有毒性分类器偏差时仍能保持高召回率,优于现有最佳防御方法StarDSS。

Comments Accepted at ACM CODASPY 2026

详情
AI中文摘要

定制化大型语言模型(LLMs)于不可信数据集上存在注入毒性行为的严重风险。在本文中,我们引入Optimus,一种新的防御框架,旨在在保持对话实用性的同时减轻微调危害。与现有防御方法依赖精确的毒性检测或限制性过滤不同,Optimus解决的是在毒性分类器不完美或有偏见时确保鲁棒缓解的关键挑战。Optimus整合了训练无关的毒性分类方案,重新利用商用LLMs的安全对齐性,并采用结合合成“治愈数据”与直接偏好优化(DPO)的双重策略对齐过程,以高效地将模型引导至安全方向。广泛的评估显示,即使依赖极度有偏见的分类器(召回率降高达85%),Optimus仍能缓解毒性。Optimus在性能上优于现有最佳防御方法StarDSS,并表现出对适应性对抗和越狱攻击的强大抵抗力。我们的源代码和数据集可在https://github.com/secml-lab-vt/Optimus上获得。

英文摘要

Customizing Large Language Models (LLMs) on untrusted datasets poses severe risks of injecting toxic behaviors. In this work, we introduce Optimus, a novel defense framework designed to mitigate fine-tuning harms while preserving conversational utility. Unlike existing defenses that rely heavily on precise toxicity detection or restrictive filtering, Optimus addresses the critical challenge of ensuring robust mitigation even when toxicity classifiers are imperfect or biased. Optimus integrates a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs, and employs a dual-strategy alignment process combining synthetic "healing data" with Direct Preference Optimization (DPO) to efficiently steer models toward safety. Extensive evaluations demonstrate that Optimus mitigates toxicity even when relying on extremely biased classifiers (with up to 85% degradation in Recall). Optimus outperforms the state-of-the-art defense StarDSS and exhibits strong resilience against adaptive adversarial and jailbreak attacks. Our source code and datasets are available at https://github.com/secml-lab-vt/Optimus

2505.22749 2026-05-22 q-bio.NC cs.AI cs.LG cs.NE 版本更新

Self-orthogonalizing attractor neural networks emerging from the free energy principle

从自由能原理中涌现的自正交吸引子神经网络

Tamas Spisak, Karl Friston

发表机构 * Center for Translational Neuro- and Behavioral Sciences (C-TNBS), University Medicine Essen, Germany(转化神经与行为科学中心(C-TNBS),埃森大学医学中心,德国) Queen Square Institute of Neurology, University College London, WC1N 3AR, UK(皇后广场神经病学研究所,伦敦大学学院,英国) VERSES, Los Angeles, CA 90067, USA(VERSES,美国加利福尼亚州洛杉矶90067)

AI总结 本文基于自由能原理,研究了自组织动力学如何从随机动力系统的基本原理中涌现,提出了一种无需显式学习和推断规则的高效且生物合理的方法,实现了多层贝叶斯主动推断过程,通过分析和模拟证明了所提网络倾向于产生近似正交化的吸引子表示,从而提升泛化能力和隐变量与可观测效应间的互信息。

Comments 27 pages main text, 8 pages appendix, 7 figures; interactive manuscript available at: https://pni-lab.github.io/fep-attractor-network Associated GitHub repository: https://github.com/pni-lab/fep-attractor-network

Journal ref Neurocomputing (2026): 133472

详情
AI中文摘要

吸引子动力学是许多复杂系统,包括大脑的特征。理解这些自组织动力学如何从基本原理中涌现对于推进对神经计算和人工智能系统设计的理解至关重要。本文正式阐述了如何将自由能原理应用于随机动力系统的通用划分,从而推导出吸引子网络的形成机制。我们的方法消除了显式学习和推断规则的需要,并识别出这些自组织系统中涌现的、高效且生物合理的推断和学习动力学。这些结果导致了一个集体、多层次的贝叶斯主动推断过程。自由能景观上的吸引子编码先验信念;推断将感官数据整合到后验信念中;学习则微调耦合以最小化长期的惊讶。通过分析和模拟,我们证明所提出的网络倾向于产生近似正交化的吸引子表示,这是同时优化预测准确性和模型复杂性所导致的后果。这些吸引子能够高效地覆盖输入子空间,提升泛化能力和隐变量与可观测效应间的互信息。此外,尽管随机数据呈现导致对称且稀疏的耦合,但序列数据则促进不对称耦合和非平衡稳态动力学,提供了对传统玻尔兹曼机的自然扩展。我们的发现为自组织吸引子网络提供了统一的理论,为人工智能和神经科学提供了新的见解。

英文摘要

Attractor dynamics are a hallmark of many complex systems, including the brain. Understanding how such self-organizing dynamics emerge from first principles is crucial for advancing our understanding of neuronal computations and the design of artificial intelligence systems. Here we formalize how attractor networks emerge from the free energy principle applied to a universal partitioning of random dynamical systems. Our approach obviates the need for explicitly imposed learning and inference rules and identifies emergent, but efficient and biologically plausible inference and learning dynamics for such self-organizing systems. These result in a collective, multi-level Bayesian active inference process. Attractors on the free energy landscape encode prior beliefs; inference integrates sensory data into posterior beliefs; and learning fine-tunes couplings to minimize long-term surprise. Analytically and via simulations, we establish that the proposed networks favor approximately orthogonalized attractor representations, a consequence of simultaneously optimizing predictive accuracy and model complexity. These attractors efficiently span the input subspace, enhancing generalization and the mutual information between hidden causes and observable effects. Furthermore, while random data presentation leads to symmetric and sparse couplings, sequential data fosters asymmetric couplings and non-equilibrium steady-state dynamics, offering a natural generalization of conventional Boltzmann Machines. Our findings offer a unifying theory of self-organizing attractor networks, providing novel insights for AI and neuroscience.

2505.16416 2026-05-22 cs.CV cs.AI 版本更新

Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

Circle-RoPE: 用于大视觉-语言模型的锥形解耦旋转位置嵌入

Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han

发表机构 * Huawei Noah's Ark Lab.(华为诺亚实验室) City University of Hong Kong.(香港城市大学) University of Sydney.(悉尼大学) State Key Lab of General AI, School of Intelligence Science and Technology, Peking University(北京大学人工智能国家重点实验室,智能科学与技术学院)

AI总结 本文提出Circle-RoPE,通过将图像标记坐标映射到与文本位置轴正交的圆环上,实现跨模态位置解耦,同时保留图像内部空间结构,并通过交替几何编码增强跨模态位置解耦和细粒度图像空间结构保留。

Comments Accepted at ICML 2026

详情
AI中文摘要

旋转位置嵌入(RoPE)在大型语言模型中被广泛采用,但应用于视觉-语言模型(VLMs)时会耦合文本和图像位置索引,并可能引入虚假的跨模态相对位置偏差。我们提出Per-Token Distance(PTD)来量化跨模态位置解耦,并证明PTD = 0是消除RoPE引起的几何注意力偏差的充分条件。基于此准则,我们引入Circle-RoPE,将2D图像标记坐标映射到与文本位置轴正交的圆环上,得到一种锥形几何结构,其中每个文本标记到所有图像标记等距,同时保留图像内部空间结构。我们进一步提出交替几何编码(AGE)以通过在层之间交替Circle-RoPE的解耦几何和标准RoPE的网格先验来结合互补的几何先验。这种设计在保持细粒度图像空间结构的同时实现了跨模态位置解耦。在多种VLM后端和多模态基准测试中的实验显示,在空间定位和视觉推理方面均取得了稳定的提升。代码可在https://github.com/lose4578/CircleRoPE上获得。

英文摘要

Rotary Position Embedding (RoPE) is widely adopted in large language models, but when applied to vision-language models (VLMs) it couples text and image position indices and can introduce spurious cross-modal relative-position bias. We propose Per-Token Distance (PTD) to quantify cross-modal positional disentanglement, and prove that PTD = 0 is a sufficient condition to eliminate the geometric attention bias induced by RoPE. Guided by this criterion, we introduce Circle-RoPE, which remaps 2D image-token coordinates onto an annulus orthogonal to the text position axis, yielding a cone-like geometry where each text token is equidistant to all image tokens while preserving intra-image spatial structure. We further propose Alternating Geometry Encoding (AGE) to combine complementary geometric priors by alternating the decoupled geometry of Circle-RoPE and the grid-based prior of standard RoPE across layers. This design enables cross-modal positional disentanglement while preserving fine-grained intra-image spatial structure. Experiments on diverse VLM backbones and multimodal benchmarks show consistent gains in spatial grounding and visual reasoning. The code is available at https://github.com/lose4578/CircleRoPE.

2503.17599 2026-05-22 cs.CL cs.AI 版本更新

Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark

利用通用医疗基准评估大型语言模型的临床能力

Zheqing Li, Yiying Yang, Jiping Lang, Wenhao Jiang, Junrong Chen, Yuhang Zhao, Shuang Li, Dingqian Wang, Zhu Lin, Xuanna Li, Yuze Tang, Jiexian Qiu, Xiaolin Lu, Hongji Yu, Shuang Chen, Yuhua Bi, Xiaofei Zeng, Yixian Chen, Lin Yao

发表机构 * The Sixth Affiliated Hospital of Sun Yat-sen University(中山大学第六附属医院) Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东省人工智能与数字经济发展实验室(深圳)) Xinyi People’s Hospital(新一人民医院) The Fifth Affiliated Hospital of Sun Yat-sen University(中山大学第五附属医院) School of Public Health of Sun Yat-sen University(中山大学公共卫生学院)

AI总结 本文提出了一种新的评估框架,通过通用医疗基准(GPBench)评估大型语言模型在医疗实践中的能力,发现当前LLM无法独立应用于临床医疗,需持续的人类监督。

详情
AI中文摘要

大型语言模型(LLMs)在一般医疗实践中展现出了相当大的潜力。然而,现有的基准测试和评估框架主要依赖于考试式或简化的问题-答案格式,缺乏与一般医疗实践中实际临床责任相匹配的基于能力的结构。因此,LLMs能否可靠地履行一般医生(GPs)职责的范围仍然不确定。在本工作中,我们提出了一种新的评估框架,用于评估LLMs作为GPs的能力。基于此框架,我们引入了一个通用医疗基准(GPBench),其数据由领域专家根据常规临床实践标准进行细致标注。我们评估了十种最先进的LLMs,并分析了它们的能力。我们的发现表明,当前的LLMs不适合在临床一般实践中自主部署,所有实际应用都需要持续的人类监督;进一步针对GPs日常职责进行的特定优化仍至关重要。

英文摘要

Large Language Models (LLMs) have demonstrated considerable potential in general practice. However, existing benchmarks and evaluation frameworks primarily depend on exam-style or simplified question-answer formats, lacking a competency-based structure aligned with the real-world clinical responsibilities encountered in general practice. Consequently, the extent to which LLMs can reliably fulfill the duties of general practitioners (GPs) remains uncertain. In this work, we propose a novel evaluation framework to assess the capability of LLMs to function as GPs. Based on this framework, we introduce a general practice benchmark (GPBench), whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards. We evaluate ten state-of-the-art LLMs and analyze their competencies. Our findings indicate that current LLMs are not suitable for autonomous deployment in clinical general practice and that all realistic applications require continuous human oversight; further optimization specifically tailored to the daily responsibilities of GPs remains essential.

2308.04371 2026-05-22 cs.AI 版本更新

Cumulative Reasoning with Large Language Models

基于大语言模型的累积推理

Yifan Zhang, Jingqin Yang, Yang Yuan, Andrew Chi-Chih Yao

发表机构 * IIIS, Tsinghua University(清华大学人工智能研究院) Shanghai Qi Zhi Institute(上海启智研究院)

AI总结 本文提出了一种名为累积推理(CR)的框架,通过模拟人类的迭代和累积思维过程,增强大语言模型(LLM)的问题解决能力。CR通过分解任务、生成并验证中间推理步骤,构建动态有向无环图(DAG)来组成解决方案,从而在逻辑推理、24点游戏和数学问题等任务中取得了显著的性能提升。

Comments Published in Transactions on Machine Learning Research (TMLR). Project Page: https://github.com/iiis-ai/cumulative-reasoning

详情
AI中文摘要

近年来,大语言模型(LLMs)在解决问题方面取得了显著进展,但其解决复杂问题的能力仍然有限。本文介绍了一种名为累积推理(CR)的结构化框架,通过模拟人类的迭代和累积思维过程,增强LLM的问题解决能力。CR通过三个不同的角色:提出者、验证者和报告者,系统地分解任务,生成并验证中间推理步骤,并通过构建动态有向无环图(DAG)来组成解决方案。这种方法显著增强了问题解决能力。我们通过几个复杂的推理任务展示了CR的优势:在逻辑推理任务中,CR在现有方法上提高了9.3%,在经过整理的FOLIO维基数据集上达到了98.04%的准确率。在24点游戏中,它达到了98%的准确率,比以前的方法提高了24%。在解决数学问题时,CR在之前的办法上提高了4.2%,在最困难的第五级问题中相对改进了43%。当结合代码环境使用CR时,我们进一步利用LLM的推理能力,并在程序思维(PoT)方法上提高了38.8%。

英文摘要

Recent advancements in large language models (LLMs) have shown remarkable progress, yet their ability to solve complex problems remains limited. In this work, we introduce Cumulative Reasoning (CR), a structured framework that enhances LLM problem-solving by emulating human-like iterative and cumulative thought processes. CR orchestrates LLMs in three distinct roles: Proposer, Verifier(s), and Reporter, to systematically decompose tasks, generate and validate intermediate reasoning steps, and compose them into a solution by building a dynamic Directed Acyclic Graph (DAG) of verified propositions. This approach substantially enhances problem-solving capabilities. We demonstrate CR's advantage through several complex reasoning tasks: it outperforms existing methods in logical inference tasks with up to a 9.3% improvement, achieving 98.04% accuracy on the curated FOLIO wiki dataset. In the Game of 24, it achieves 98% accuracy, marking a 24% improvement over previous methods. In solving MATH problems, CR achieves a 4.2% increase from previous methods and a 43% relative improvement in the most challenging level 5 problems. When incorporating a code environment with CR, we further harness LLMs' reasoning capabilities and outperform the Program of Thought (PoT) method by 38.8%.

2209.03358 2026-05-22 cs.NE cs.AI cs.CR cs.CV cs.LG 版本更新

Attacking the Spike: On the Transferability and Security of Spiking Neural Networks to Adversarial Examples

攻击尖峰:关于脉冲神经网络对抗示例的转移性和安全性

Nuo Xu, Kaleel Mahmood, Haowen Fang, Ethan Rathbun, Caiwen Ding, Wujie Wen

发表机构 * Lehigh University(莱文大学) University of Minnesota Twin Cities(明尼苏达大学双城分校) North Carolina State University(北卡罗来纳州立大学) University of Rhode Island(罗德岛大学) Northeastern University(东北大学)

AI总结 本文研究了脉冲神经网络(SNN)在对抗示例中的鲁棒性,揭示了对抗攻击的转移性,并提出了混合动态脉冲估计(MDSE)攻击方法,以提高SNN和非SNN模型的对抗示例生成效果。

Comments Accepted manuscript. Published in *Neurocomputing*, Volume 656, 2025, Article 131506. Available online 12 September 2025. DOI: 10.1016/j.neucom.2025.131506

Journal ref Neurocomputing, Volume 656, 2025, 131506

详情
AI中文摘要

脉冲神经网络(SNNs)因其高能效和最近在分类性能上的进展而受到广泛关注。然而,与传统深度学习方法不同,SNN对对抗示例的鲁棒性研究仍相对薄弱。在本文中,我们通过三个贡献推进了SNN的对抗攻击研究。首先,我们表明对SNN的成功白盒对抗攻击高度依赖于底层的替代梯度估计器,即使对于对抗训练的SNN也是如此。其次,使用最佳的单一替代梯度估计器,我们分析了对抗攻击在SNN、视觉Transformer(ViTs)和CNN之间的可转移性。我们的分析揭示了两个关键差距:现有的白盒攻击没有利用多个替代梯度估计器来攻击SNN,且没有单个模型攻击能够可靠地生成同时欺骗SNN和非SNN模型的对抗示例。作为我们的第三个贡献,我们开发了混合动态脉冲估计(MDSE)攻击来解决这些问题。MDSE使用动态梯度估计方案,充分利用多个替代梯度估计器函数,生成能够同时欺骗SNN和非SNN模型的对抗示例。MDSE在SNN/ViT模型集合上比传统白盒攻击如Auto-PGD有效多达91.4%,在对抗训练的SNN集合上提供了3倍的提升。实验覆盖了三个数据集(CIFAR-10、CIFAR-100、ImageNet)和十九个分类器模型(每个CIFAR数据集七个,ImageNet五个)。我们的MDSE实现和评估的模型在https://github.com/nuoxuxxx/attacking-the-spike-mdse上公开可用。

英文摘要

Spiking neural networks (SNNs) have attracted much attention for their high energy efficiency and recent advances in classification performance. However, unlike traditional deep learning approaches, the study of SNN robustness to adversarial examples remains relatively underdeveloped. In this work, we advance the adversarial attack side of SNNs through three contributions. First, we show that successful white-box adversarial attacks on SNNs are highly dependent on the underlying surrogate gradient estimator, even for adversarially trained SNNs. Second, using the best single surrogate gradient estimator, we analyze the transferability of adversarial attacks across SNNs, Vision Transformers (ViTs) and CNNs. Our analysis reveals two key gaps: no existing white-box attack exploits multiple surrogate gradient estimators for SNNs, and no single-model attack reliably generates adversarial examples that simultaneously fool both SNN and non-SNN models. For our third contribution, we develop the Mixed Dynamic Spiking Estimation (MDSE) attack to address these issues. MDSE uses a dynamic gradient estimation scheme to fully exploit multiple surrogate gradient estimator functions and generates adversarial examples capable of fooling SNN and non-SNN models simultaneously. MDSE is up to 91.4% more effective on SNN/ViT model ensembles and provides a 3x boost on adversarially trained SNN ensembles compared to conventional white-box attacks like Auto-PGD. Experiments cover three datasets (CIFAR-10, CIFAR-100, ImageNet) and nineteen classifier models (seven per CIFAR dataset, five for ImageNet). Our implementation of MDSE and the evaluated models is publicly available at https://github.com/nuoxuxxx/attacking-the-spike-mdse.

2605.22078 2026-05-22 cs.AI cs.CV 版本更新

Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding

通过无训练空间-时间池化和栅格化增强视频大语言模型的视觉令牌表示

Bingjun Luo, Tony Wang, Hanqi Chen, Xinpeng Ding

发表机构 * Tsinghua University(清华大学) Shenzhen University(深圳大学) Xidian University(西安电子科技大学)

AI总结 本文提出了一种无需训练的空间-时间池化和栅格化方法ST-GridPool,用于提升视频大语言模型的视觉令牌表示,通过多级时空交互和基于规范的空间池化技术,在不需重新训练的情况下提高性能。

Comments Accepted by ICLR 2026

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)在视频理解任务中取得了显著进展,但如何高效压缩视觉令牌同时保持时空交互仍面临挑战。现有方法如LLaVA家族使用简单的池化或插值技术,忽视了视觉令牌的复杂动态。为弥合这一差距,我们提出了ST-GridPool,一种专为视频LLM设计的新型无训练视觉令牌增强方法。我们的方法整合了金字塔时间栅格(PTG),通过层次化时间栅格捕捉多粒度时空交互,以及基于规范的空间池化(NSP),通过利用令牌规范与语义丰富度之间的相关性来保留高信息视觉区域。在各种基准测试中,ST-GridPool在不需成本高昂重新训练的情况下,一致提升了视频LLM的性能。我们的方法提供了一种高效且即插即用的解决方案来改进视觉令牌表示。我们的代码可在https://github.com/bingjunluo/ST-GridPool上获得。

英文摘要

Recent advances in Multimodal Large Language Models (MLLMs) have significantly advanced video understanding tasks, yet challenges remain in efficiently compressing visual tokens while preserving spatiotemporal interactions. Existing methods, such as LLaVA family, utilize simplistic pooling or interpolation techniques that overlook the intricate dynamics of visual tokens. To bridge this gap, we propose ST-GridPool, a novel training-free visual token enhancement method designed specifically for Video LLMs. Our approach integrates Pyramid Temporal Gridding (PTG), which captures multi-grained spatiotemporal interactions through hierarchical temporal gridding, and Norm-based Spatial Pooling (NSP), which preserves high-information visual regions by leveraging the correlation between token norms and semantic richness. Extensive experiments on various benchmarks demonstrate that ST-GridPool consistently enhances performance of Video LLMs without requiring costly retraining. Our method offers an efficient and plug-and-play solution for improving visual token representations. Our code is available in https://github.com/bingjunluo/ST-GridPool.

2605.22074 2026-05-22 cs.LG cs.AI cs.CL 版本更新

From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

从推理链到可验证子问题:课程强化学习使LLM推理能够进行信用分配

Xitai Jiang, Zihan Tang, Wenze Lin, Yang Yue, Shenzhi Wang, Gao Huang

发表机构 * LeapLab, Tsinghua University(清华大学 LeapLab) Qiuzhen College, Tsinghua University(清华大学 旗正学院)

AI总结 该研究提出SCRL框架,通过从参考推理链中生成可验证子问题,解决LLM推理中信用分配问题,提升了在数学推理任务中的性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)在LLM推理中展现出强大潜力,但基于结果的RLVR在处理难题时效率低下,因为正确的最终答案 rollout 很少且样本层面的信用分配无法利用失败尝试中的部分进展。我们引入SCRL(子问题课程强化学习),一种课程强化学习框架,通过从参考推理链中推导出可验证子问题,并将最终子问题固定为原始问题。这将难题中的部分进展转化为可验证的学习信号。算法上,SCRL使用子问题层面的归一化,每个子问题位置独立归一化奖励,并将结果优势分配给相应的答案片段,使在没有外部评分标准或奖励模型的情况下实现更细粒度的信用分配。我们的分析表明,子问题课程将难题从梯度死亡区中拉出,随着原始问题难度增加,相对收益也更大。在七个数学推理基准测试中,SCRL超越了强大的课程学习基线,使Qwen3-4B-Base的平均准确率比GRPO提高+4.1点,Qwen3-14B-Base提高+1.9点。在AIME24、AIME25和IMO-Bench上,SCRL进一步提高Qwen3-4B-Base的pass@1由+3.7点,pass@64由+4.6点,表明在难题推理任务中探索能力更强。

英文摘要

Reinforcement learning from verifiable rewards (RLVR) has shown strong promise for LLM reasoning, but outcome-based RLVR remains inefficient on hard problems because correct final-answer rollouts are rare and sample-level credit assignment cannot use partial progress in failed attempts. We introduce SCRL (Subproblem Curriculum Reinforcement Learning), a curriculum RL framework that derives verifiable subproblems from reference reasoning chains and fixes the final subproblem as the original problem. This turns partial progress on hard problems into verifiable learning signals. Algorithmically, SCRL uses subproblem-level normalization, which normalizes rewards independently at each subproblem position and assigns the resulting advantages to the corresponding answer spans, enabling finer-grained credit assignment without external rubrics or reward models. Our analysis shows that subproblem curricula lift hard problems out of gradient dead zones, with larger relative gains as the original problem becomes harder. Across seven mathematical reasoning benchmarks, SCRL outperforms strong curriculum-learning baselines, improving average accuracy over GRPO by +4.1 points on Qwen3-4B-Base and +1.9 points on Qwen3-14B-Base. On AIME24, AIME25, and IMO-Bench, SCRL further improves pass@1 by +3.7 points and pass@64 by +4.6 points on Qwen3-4B-Base, indicating better exploration on hard reasoning problems.

2605.22066 2026-05-22 cs.CV cs.AI 版本更新

Echo4DIR: 4D Implicit Heart Reconstruction from 2D Echocardiography Videos

Echo4DIR: 从2D超声视频重建4D隐式心脏结构

Yanan Liu, Qinya Li, Hao Zhang, Kangjian He, Xuan Yang, Hao Li, Dan Xu, Lei Li

发表机构 * Department of Biomedical Engineering, National University of Singapore, Singapore(新加坡国立大学生物医学工程系) School of Information Science and Engineering, Yunnan University, Kunming, China(云南大学信息科学与工程学院)

AI总结 本文提出Echo4DIR框架,通过隐式重建方法从稀疏2D超声视频中重建4D心脏几何结构,解决了几何歧义和时间不连续性问题,实现了高精度的临床重叠度。

详情
AI中文摘要

从稀疏的2D超声图像中重建4D(3D+t)心脏几何结构具有高度的实用性,但受到几何歧义和时间不连续性的根本挑战。为了解决这些问题,我们提出了Echo4DIR,一种新颖的测试时4D隐式重建框架。具体来说,我们通过心脏条件SDF学习鲁棒的3D形状先验,构建了具有极线交叉注意力的Epipolar Mask Encoder模块,以有效融合多视角特征。为了弥合合成到现实的领域差距,我们引入了一种自监督的SDF定制可微渲染策略,利用未经校准的临床掩码进行患者特定的3D形状适应,而无需3D地面真实数据。关键的是,隐式表示的内在连续性克服了稀疏观测,使在任意分辨率下都能获得解剖学可靠的几何结构。此外,为了使我们的框架具备物理连续的4D扩展能力,我们引入了一种径向SDF对齐策略,严格锁定形状演变到预测的速度场,从根本上消除了网格漂移。在合成基准和真实临床数据集上的广泛实验表明,Echo4DIR实现了最先进的4D心脏网格重建,特别是在临床重叠度方面,达到了高达98.35%的Dice和96.75%的IoU。

英文摘要

Reconstructing 4D (3D+t) cardiac geometry from sparse 2D echocardiography is highly desirable yet fundamentally challenged by geometric ambiguity and temporal discontinuity. To tackle these issues, we propose Echo4DIR, a novel test-time 4D implicit reconstruction framework. Specifically, we learn robust 3D shape priors from statistical shape models (SSMs) via a cardiac conditional SDF, constructing an Epipolar Mask Encoder module with epipolar cross attention to effectively fuse multi-view features. To bridge the synthetic-to-real domain gap, we introduce a self-supervised SDF-tailored differentiable rendering strategy for patient-specific 3D shape adaptation using uncalibrated clinical masks without requiring 3D ground truth. Crucially, the inherent continuity of implicit representation overcomes sparse observations, enabling anatomically reliable geometry at arbitrary resolutions. Furthermore, to empower our framework with physically continuous 4D extension, we introduce a Radial SDF Alignment strategy that strictly locks shape evolution to the predicted velocity field, fundamentally eliminating mesh drift. Extensive experiments on synthetic benchmarks and real clinical datasets demonstrate that Echo4DIR achieves state-of-the-art 4D cardiac mesh reconstruction, notably yielding an impressive clinical overlap of up to 98.35% Dice and 96.75% IoU.

2605.22060 2026-05-22 cs.CR cs.AI 版本更新

Safeguarding Text-to-Image Generative Models Against Unauthorized Knowledge Distillation

防范未经授权的知识蒸馏的文本到图像生成模型

Yilan Gao, Sida Huang, Hongyuan Zhang, Xuelong Li

发表机构 * School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University(人工智能学院、光学电子学院(iOPEN)、西北工业大学) Institute of Artificial Intelligence (TeleAI), China Telecom(人工智能研究院(TeleAI)、中国电信) The University of Hong Kong(香港大学)

AI总结 本文提出WaveGuard,一种单次生成器基保护框架,通过在用户指定的扰动预算下保护发布的合成图像,以防止未经授权的知识蒸馏和能力复制。

详情
AI中文摘要

闭包权重生成服务越来越多地通过基于查询的API部署,其中用户可以获取生成的输出,而模型参数保持不可访问。然而,这种部署并不能防止模型窃取:攻击者可以反复查询该服务,收集大量发布的合成图像,并将其用作私人替代模型的训练数据。这种查询-输出驱动的过程使未经授权的知识蒸馏和能力复制成为可能,而无需直接访问原始权重。为缓解这一威胁,一种实用的防御应保持发布的图像的视觉保真度,提供对扰动幅度的明确控制,并能够高效扩展到大规模输出发布。我们提出了WaveGuard,一种单次生成器基保护框架,该框架在用户指定的扰动预算下保护发布的合成图像。WaveGuard采用频率感知的扰动生成器,注入结构化、不可察觉的扰动,以保持对良性观众的感知效用,同时减少受保护图像作为未经授权的学生模型训练数据的有用性。在与WikiArt相关的合成输出蒸馏设置下的广泛实验表明,WaveGuard实现了有利的效用-保真度-效率权衡,具有显式的不可察觉性控制和显著的保护效率提升。

英文摘要

Closed-weight generative services are increasingly deployed through query-based APIs, where users can obtain generated outputs while model parameters remain inaccessible. However, such deployment does not prevent model stealing: an attacker can repeatedly query the service, collect large volumes of released synthetic images, and use them as training data for a private substitute model. This query-output-driven process enables unauthorized knowledge distillation and capability replication without direct access to the original weights. To mitigate this threat, a practical defense should preserve the visual fidelity of released images, provide explicit control over perturbation magnitude, and scale efficiently to large-volume output release. We present WaveGuard, a single-pass, generator-based protection framework that safeguards released synthetic images under a user-specified perturbation budget. WaveGuard employs a frequency-aware perturbation generator to inject structured, imperceptible perturbations that maintain perceptual utility for benign viewers while reducing the usefulness of protected images as training data for unauthorized student models. Extensive experiments under WikiArt-related synthetic-output distillation settings show that WaveGuard achieves a favorable efficacy--fidelity--efficiency trade-off, with explicit imperceptibility control and substantial gains in protection efficiency.

2605.22055 2026-05-22 cs.LG cs.AI 版本更新

Prototype-Guided Classification Sub-Task Decoupling Framework: Enhancing Generalization and Interpretability for Multivariate Time Series

基于原型的分类子任务解耦框架:提升多变量时间序列的泛化能力与可解释性

Xianhao Song, Yuang Zhang, Yuqi She, Liping Wang, Xuemin Lin

发表机构 * East China Normal University(华东师范大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出PDFTime框架,通过多阶段决策过程解耦时间序列分类任务,提升模型的泛化能力和可解释性,实现了在UEA和UCR基准测试中的最优性能。

详情
AI中文摘要

时间序列分类(TSC)是一个长期存在的研究问题,近年来随着大规模时间数据的快速增长而受到越来越多的关注。尽管深度学习带来了显著进展,但设计出既准确又可解释的TSC模型仍然是一个具有挑战性的任务。许多现有方法采用直接的特征到标签分类范式,通过单一线性投影(通常在全局池化后)将高维时间嵌入压缩为类别日志it,这种范式将特征提取和决策逻辑合并为不可分割的映射。为了解决这些限制,我们提出了PDFTime,一个基于原型的框架,将时间序列分类重新表述为多阶段决策过程。不同于直接的特征到标签映射,PDFTime利用学习到的原型来近似潜在空间中的类别条件特征分布,通过不同粒度的分类子任务实现逐步辨别。据我们所知,PDFTime是第一个将时间序列分类重新表述为解耦、多阶段相似性推理过程的框架,打破了长期以来直接、黑箱的特征到标签映射范式。广泛的评估表明,PDFTime在UEA和UCR基准测试中实现了最先进的性能。值得注意的是,它在UCR档案中的128个数据集中,取得了80个数据集的top-1准确率,显著优于最近的强基线方法在一致性和泛化性上的表现。

英文摘要

Time Series Classification (TSC) is a long-standing research problem that has gained increasing attention in recent years with the rapid growth of large-scale temporal data. Despite substantial progress enabled by deep learning, designing TSC models that are both accurate and interpretable remains a challenging task. Many existing approaches adopt a direct feature-to-label classification paradigm, by collapsing high-dimensional temporal embeddings into class logits via a single linear projection (often after global pooling), the paradigm conflates feature extraction and decision logic into an inseparable mapping. To address these limitations, we propose PDFTime, a prototype-guided framework that reformulates time series classification as a multi-stage decision process. Instead of direct feature-to-label mapping, PDFTime leverages learned prototypes to approximate class-conditional feature distributions in the latent space, enabling progressive discrimination through classification sub-tasks of varying granularity. To our knowledge, PDFTime is the first framework to reformulate time series classification as a decoupled, multi-stage similarity-based reasoning process, breaking the long-standing paradigm of direct, black-box feature-to-label mapping. Extensive evaluations demonstrate that PDFTime achieves state-of-the-art (SOTA) performance across UEA and UCR benchmarks. Notably, it secures the top-$1$ accuracy on 80 out of 128 datasets in the UCR archive, significantly outperforming recent strong baselines in both consistency and generalization.

2605.22054 2026-05-22 cs.LG cs.AI 版本更新

LABO: LLM-Accelerated Bayesian Optimization through Broad Exploration and Selective Experimentation

LABO: 通过广泛探索和选择性实验实现的LLM加速贝叶斯优化

Zhuo Chen, Xinzhe Yuan, Jianshu Zhang, Jinzong Dong, Ruichen Zhou, Yingchun Niu, Tianhang Zhou, Yu Yang Fredrik Liu, Yuqiang Li, Nanyang Ye, Qinying Gu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) School of Mechanical Engineering, Shanghai Jiao Tong University(上海交通大学机械工程学院) Institute for Advanced Study in Mathematics, Harbin Institute of Technology(哈尔滨工业大学数学研究所) School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) School of Automation, Central South University(中南大学自动化学院) College of New Energy and Materials, China University of Petroleum, Beijing(中国石油大学(北京)新能源与材料学院) College of Carbon Neutrality Future Technology, China University of Petroleum, Beijing(中国石油大学(北京)碳中和未来技术学院) DeepVerse PTE. LTD.

AI总结 本文提出LABO框架,通过结合LLM预测与实验观测,在贝叶斯优化中实现更高效的样本优化,理论分析和实验结果表明其在科学任务中优于现有方法。

Comments Accepted to ICML 2026

详情
AI中文摘要

科学探索中的高成本和数据稀缺性推动了将大型语言模型(LLMs)作为知识驱动组件应用于贝叶斯优化(BO)的研究。然而,现有方法通常将LLMs直接嵌入到采样或替代建模流程中,未能充分利用其显著低于现实实验的评估成本。为了解决这一限制,我们提出了LLM加速贝叶斯优化(LABO)框架,该框架在单个BO循环中结合LLM预测与实验观测。LABO采用门控标准来动态平衡对LLM预测和实际实验的依赖。通过利用低成本的LLM评估进行广泛探索搜索空间,并仅在高不确定性区域保留昂贵的现实实验,LABO实现了更高效的样本优化。我们提供了理论分析,通过累积遗憾界正式化这一效率增益。在多样化的科学任务中,实验结果表明LABO在相同实验预算下一致优于现有方法。我们的结果表明,LABO为将LLMs整合到科学发现流程中提供了一种实用且理论严谨的方法。

英文摘要

The high cost and data scarcity in scientific exploration have motivated the use of large language models (LLMs) as knowledge-driven components in Bayesian optimization (BO). However, existing approaches typically embed LLMs directly into the sampling or surrogate modeling pipeline, without fully leveraging their significantly lower evaluation cost compared to real-world experiments. To address this limitation, we propose LLM-Accelerated Bayesian Optimization (LABO), a framework that combines LLM predictions with experimental observations within a single BO loop. LABO employs a gating criterion to dynamically balance the reliance on LLM predictions versus actual experiments. By leveraging inexpensive LLM evaluations to broadly explore the search space and reserving costly real experiments only for regions with high uncertainty, LABO achieves more sample-efficient optimization. We provide a theoretical analysis with a cumulative regret bound that formalizes this efficiency gain. Empirical results across diverse scientific tasks demonstrate that LABO consistently outperforms existing methods under identical experimental budgets. Our results suggest that LABO offers a practical and theoretically grounded approach for integrating LLMs into scientific discovery workflows.

2605.22047 2026-05-22 cs.AI 版本更新

Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support

大语言模型在临床决策支持中的主动证据获取与诊断推理

Chen Zhan, Xihe Qiu, Xiaoyu Tan, Xibing Zhuang, Gengchen Ma, Yue Zhang, Shuo Li, Peifeng Liu, Xiaoxiao Ge, Liang Liu, Lu Gan

发表机构 * Tencent Youtu Lab(腾讯优图实验室) Case Western Reserve University(凯斯西储大学)

AI总结 研究探讨了大语言模型在临床决策支持中的主动证据获取与诊断推理问题,提出了一种基于OSCE的标准化患者模拟器和可控可复现的基准测试,发现多轮证据获取会降低诊断准确性并降低支持证据质量,表明静态全上下文基准可能高估交互证据获取场景中的性能,需引入互补的交互评估以提高临床决策安全性。

详情
AI中文摘要

大语言模型在静态医学检查中表现良好,但临床诊断往往需要在不确定性下进行迭代证据收集。基于先前的交互评估努力,我们引入了受OSCE启发的标准化患者模拟器和一个受控、可复现的基准测试,用于主动诊断查询。在我们的协议中,经过468个案例和15个模型的测试,我们发现多轮证据获取会将诊断准确性降低12.75%,并将支持证据质量降低24.36%,相对全上下文评估。错误分析将这些下降与过早的诊断封闭和低效的提问联系起来。这些结果表明,静态全上下文基准可能高估交互证据获取场景中的性能,从而推动对更安全临床决策支持的互补交互评估。

英文摘要

Large language models perform well on static medical examinations, yet clinical diagnosis often requires iterative evidence gathering under uncertainty. Building on prior interactive evaluation efforts, we introduce an OSCE-inspired standardized patient simulator and a controlled, reproducible benchmark for active diagnostic inquiry. Across 468 cases and 15 models in our protocol, we observe that multi-turn evidence seeking reduces diagnostic accuracy by 12.75% and lowers supporting-evidence quality by 24.36% relative to full-context evaluation; error analyses associate these drops with premature diagnostic closure and inefficient questioning. Together, these results suggest that static full-context benchmarks may overestimate performance in interactive evidence-seeking settings, motivating complementary interactive assessment for safer clinical decision support.

2605.22039 2026-05-22 cs.DC cs.AI cs.CR cs.MS 版本更新

Secure and Parallel Determinant Computation for Large-Scale Matrices in Edge Environments

在边缘环境中的大规模矩阵安全并行行列式计算

Prajwal Panth

发表机构 * School of Computer Engineering, KIIT Deemed to be University(计算机工程学院,KIIT 被认定大学)

AI总结 本文提出了一种安全并行行列式计算框架,通过复合元素扭曲等方法在分布式边缘服务器上实现隐私保护的行列式计算,以满足边缘计算环境下的实时需求。

Comments 15 pages, 7 figures, 5 tables. This paper was first made public in October 2024 and subsequently posted as v1 on TechRxiv (Dec 10, 2025): https://doi.org/10.36227/techrxiv.176539387.75109768/v1. The present arXiv submission is identical to that version (v1)

详情
AI中文摘要

边缘计算的出现使资源受限的客户端能够将密集的计算任务委托给分布式的边缘服务器,特别是在物联网(IoT)环境中。其中,矩阵行列式计算(MDC)对于控制系统、密码学和机器学习应用至关重要。然而,传统行列式算法的三次复杂度使其不适合在受限制的边缘场景中进行实时处理。我们提出了一种安全并行行列式计算(SPDC)框架,该框架在N个分布式边缘服务器上提供强安全性保障,包括隐私保护的MDC。该框架通过复合元素扭曲(CED)实现隐私保护,这是一种轻量级加密方法,结合了逐元素混淆(EWO)和Panth旋转定理(PRT),以隐藏矩阵的结构和数值内容,同时保持行列式属性。使用并行LU分解将加密的矩阵块分布到任意数量的不可信边缘服务器上,从而实现高效且可扩展的行列式计算。单向通信模型进一步通过消除服务器间的交互减少了协调开销。为了确保结果完整性并最小化客户端负担,我们进一步引入了两种验证算法:Q_2,一种概率性标量方法,以及Q_3,一种确定性和低复杂度的替代方案。数学分析表明,所提出的框架提供了强隐私和安全保障、低计算开销和部署灵活性,使其非常适合于安全、可扩展和实时的分布式边缘辅助系统中的MDC。

英文摘要

The advent of edge computing has enabled resource-constrained clients to delegate intensive computational tasks to distributed edge servers, especially within Internet of Things (IoT) environments. Among such tasks, Matrix Determinant Computation (MDC) remains critical for applications in control systems, cryptography, and machine learning. However, the cubic complexity of traditional determinant algorithms makes them unsuitable for real-time processing in constrained edge scenarios. We propose a Secure Parallel Determinant Computation (SPDC) framework, which provides strong security guaranties, including privacy-preserving MDC, across N distributed edge servers. The framework achieves privacy through Composite Element Distortion (CED) - a lightweight encryption method that combines Element-wise Obfuscation (EWO) and the Panth Rotation Theorem (PRT) to conceal both structural and numerical matrix content while preserving determinant properties. Parallel LU decomposition is used to distribute encrypted matrix blocks across an arbitrary number of untrusted edge servers, enabling efficient and scalable determinant computation. A one-way communication model further reduces coordination overhead by eliminating inter-server interactions. To ensure result integrity with minimal client burden, we further introduce two verification algorithms: Q_2, a probabilistic scalar method, and Q_3, a deterministic and low-complexity alternative. Mathematical analysis demonstrates that the proposed framework provides strong privacy and security guaranties, low computational overhead, and deployment flexibility - making it well-suited for secure, scalable, and real-time MDC in distributed edge-assisted systems.

2605.22036 2026-05-22 cs.CV cs.AI 版本更新

GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation

GA-VLN: 用于高效视觉-语言导航的几何感知鸟瞰图表示

Jiahao Yang, Zihan Wang, Xiangyang Li, Xing Zhu, Yujun Shen, Yinghao Xu, Shuqiang Jiang

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences(人工智能安全国家重点实验室,计算技术研究所,中国科学院) University of Chinese Academy of Sciences(中国科学院大学) Robbyant School of Computing, National University of Singapore(新加坡国立大学计算机学院) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 本文提出GA-VLN框架,通过引入几何感知的鸟瞰图表示(GA-BEV),整合显式和隐式几何信息,提升视觉-语言导航的效率和性能,实验表明其在仅使用导航数据的情况下取得了最先进的结果。

详情
AI中文摘要

尽管在视觉-语言导航(VLN)领域取得了显著进展,现有方法仍依赖密集的RGB视频,产生过多的片段标记且缺乏显式的空间结构,导致计算开销大且空间推理能力有限。为了解决这些问题,我们引入了几何感知的鸟瞰图(GA-BEV)-一种紧凑且3D基础的特征表示,将显式和隐式的几何线索整合到多模态大语言模型(MLLM)导航系统中。我们通过将视觉特征投影到3D空间并聚合为以代理为中心的布局来构建BEV空间地图,该布局在保持几何一致性的同时减少标记冗余。为了进一步丰富几何理解,我们将预训练的3D基础模型的特征融入BEV空间,注入从大规模3D重建任务中学习到的结构先验。这些互补的线索-基于深度的显式投影和隐式学习的先验-产生紧凑但空间表达能力强的表示,显著提高了导航效率和性能。实验表明,我们的方法仅使用导航数据即可取得最先进的结果,无需DaGger增强或混合VQA训练,证明了所提GA-VLN框架的鲁棒性和数据效率。

英文摘要

Despite significant progress in Vision-Language Navigation (VLN), existing approaches still rely on dense RGB videos that produce excessive patch tokens and lack explicit spatial structure, resulting in substantial computational overhead and limited spatial reasoning. To address these issues, we introduce the Geometry-Aware BEV (GA-BEV) - a compact, 3D-grounded feature representation that integrates both explicit and implicit geometric cues into multimodal large language model (MLLM) - based navigation systems. We construct BEV spatial maps from RGB-D inputs by projecting visual features into 3D space and aggregating them into an agent-centric layout that preserves geometric consistency while reducing token redundancy. To further enrich geometric understanding, we incorporate features from a pretrained 3D foundation model into the BEV space, injecting structural priors learned from large-scale 3D reconstruction tasks. Together, these complementary cues - explicit depth-based projection and implicit learned priors - yield compact yet spatially expressive representations that substantially improve navigation efficiency and performance. Experiments show that our method achieves state-of-the-art results using only navigation data, without DAgger augmentation or mixed VQA training, demonstrating the robustness and data efficiency of the proposed GA-VLN framework.

2605.22034 2026-05-22 cs.CV cs.AI 版本更新

AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding

AgroVG:一个大规模多源基准用于农业视觉 grounding

Haocheng Li, Juepeng Zheng, Zenghao Yang, Kaiqi Du, Guilong Xiao, Gengmeng Pu, Haohuan Fu, Jianxi Huang

发表机构 * China Agricultural University(中国农业大学) Sun Yat-sen University(中山大学) Tianjin University(天津大学) Tsinghua University(清华大学) Southwest Jiaotong University(西南交通大学) National Supercomputing Center in Shenzhen(深圳国家超算中心)

AI总结 本文提出AgroVG基准,用于评估农业视觉 grounding能力,通过多源数据集和任务特定协议,评估模型在多目标、多实例和无目标场景下的性能,揭示了现有模型在农业视觉 grounding任务中的不足。

Comments 45 pages,12 figures

详情
AI中文摘要

视觉 grounding,即根据自然语言描述定位物体的任务,是农业人工智能系统的基础能力,可应用于选择性除草、疾病监测和定向收获。农业视觉 grounding的可靠评估具有挑战性,因为农业目标往往小、重复、被遮挡或形状不规则,且指令可能指一个、多个或没有物体。因此,评估此能力需要联合测试定位精度、目标集完整性和存在感知的回避。为了解决这些挑战,我们引入了AgroVG,一个多源基准,将农业 grounding 视为广义集合预测:给定一张图像和一个指称表达,模型必须返回所有匹配的目标实例或在没有目标时回避。AgroVG包含来自十个数据集的10,071个注释-图像查询对,涵盖六个目标类别:作物/杂草、水果、小麦头、害虫、植物疾病和树冠。它支持所有六个类别上的边界框 grounding(T1)和具有可靠实例级像素注释的数据源上的实例掩码 grounding(T2),查询涵盖单目标、多目标和无目标场景。AgroVG进一步提供任务特定的协议用于框集匹配和查询级掩码覆盖。对26种模型配置的零样本评估揭示了持续的差距:最好的多目标Set-F1仅达到0.35,最好的正查询掩码成功率在IoU@0.75下仍低于0.17。数据和代码可在https://anonymous.4open.science/r/AgroVG-5172/上获得。

英文摘要

Visual grounding, the task of localizing objects described by natural-language expressions, is a foundational capability for agricultural AI systems, enabling applications such as selective weeding, disease monitoring, and targeted harvesting. Reliable evaluation of agricultural visual grounding remains challenging because agricultural targets are often small, repetitive, occluded, or irregularly shaped, and instructions may refer to one, many, or no objects in an image. Evaluating this capability therefore requires jointly testing localization accuracy, target-set completeness, and existence-aware abstention. To address these challenges, we introduce \textbf{AgroVG}, a multi-source benchmark that formulates agricultural grounding as generalized set prediction: given an image and a referring expression, a model must return all matching target instances or abstain when no target is present. AgroVG contains 10{,}071 annotation-grounded image-query pairs from ten source datasets across six target families: crop/weed, fruit, wheat head, pest, plant disease, and tree canopy. It supports bounding-box grounding (T1) across all six families and instance-mask grounding (T2) on sources with reliable instance-level pixel annotations, with queries covering single-target, multi-target, and target-absent regimes. AgroVG further provides task-specific protocols for box-set matching and query-level mask coverage. Zero-shot evaluation of 26 model configurations spanning closed-source MLLMs, open-source VLMs, and specialized grounding systems reveals persistent gaps: the best multi-target Set-$F_1$ reaches only 0.35, and the best positive-query mask success rate at IoU@0.75 remains below 0.17. Data and code are available at https://anonymous.4open.science/r/AgroVG-5172/ .

2605.22003 2026-05-22 cs.CL cs.AI cs.IR cs.LG 版本更新

From TF-IDF to Transformers: A Comparative and Ensemble Approach to Sentiment Classification

从TF-IDF到Transformer:一种比较和集成的方法用于情感分类

Dip Biswas Shanto, Mitali Yadav, Prajwal Panth, Suresh Chandra Satapathy

发表机构 * School of Computer Engineering KIIT Deemed to be University(计算机工程学院 KIIT 被认定大学)

AI总结 本文比较了多种机器学习模型,包括Naive Bayes、逻辑回归、SVM、LightGBM、LSTM以及基于Transformer的RoBERTa和DistilBERT,旨在对电影评论进行情感分类,并发现RoBERTa在准确率上表现最佳,同时集成所有模型的软投票方法进一步提升了分类性能。

Comments 6 pages, 9 figures. This is the author's accepted manuscript, presented at the International Conference on Intelligent Computing, Networks and Security (IC-ICNS 2026), March 26-28, Bhubaneswar, India. Proceedings publication pending

详情
AI中文摘要

情感分析,也称为观点挖掘,主要试图从任何基于文本的数据中提取观点。在电影评论和评论员的背景下,情感分析可以成为预测电影评论总体是积极还是消极的有用工具。对于ML模型来说,理解上下文或隐喻性情感可能具有挑战性,因为ML模型主要依赖统计词表示。本文的目标是检验并分类电影评论为积极或消极情感。为此考虑了多种机器学习模型,并运用自然语言处理(NLP)方法进行数据预处理和模型评估。使用IMDb数据集。具体来说,评估了Naive Bayes、逻辑回归、支持向量机(SVM)、LightGBM、LSTM以及基于Transformer的RoBERTa和DistilBERT等模型。经过大量测试,使用准确率、精确率、召回率、F1分数和ROC-AUC后,RoBERTa在所有其他模型之上表现更好,准确率为93.02%。一个结合所有模型的软投票集成方法也提高了分类性能,表明模型集成在情感分析中效果良好。

英文摘要

Sentiment analysis, also referred to as opinion mining, primarily tries to extract opinion from any text-based data. In the context of movie reviews and critics, sentimental analysis can be a helpful tool to predict whether a movie review is generally positive or negative. It can be difficult for the ML models to understand the context or metaphysical sentiment accurately, as ML models rely largely on statistical word representations. The objective of this paper is to examine and categorise movie reviews into positive and negative sentiments. Diverse machine learning models are considered in doing so, and Natural Language Processing (NLP) methodologies are employed for data preprocessing and model assessment. The IMDb dataset is used. Specifically, Naive Bayes, Logistic Regression, Support Vector Machines (SVM), LightGBM, LSTM, and transformer-based models such as RoBERTa and DistilBERT were evaluated. After a lot of testing with accuracy, precision, recall, F1-score, and ROC-AUC, RoBERTa performed better than all the other models, with an accuracy of 93.02%. A soft voting ensemble that combined all the models also improved classification performance, showing that model ensembling works well for sentiment analysis.

2605.22001 2026-05-22 cs.CR cs.AI cs.CL 版本更新

Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems

守卫中的盲区:如何域伪装注入攻击在多智能体大语言模型系统中逃避检测

Aaditya Pai

发表机构 * Data Science Institute(数据科学研究所) Columbia University(哥伦比亚大学)

AI总结 本文研究了在多智能体大语言模型系统中,域伪装注入攻击如何通过模仿目标文档的领域词汇和权威结构来逃避检测,揭示了检测器在静态和伪装负载之间的检测率差异(Camouflage Detection Gap, CDG),并展示了多智能体辩论架构对静态注入攻击的放大效应以及检测器增强的有限有效性。

Comments 8 pages, 3 figures, 2 tables. Submitted to EMNLP 2026 ARR cycle

详情
AI中文摘要

部署在保护大语言模型代理中的注入检测器是基于静态的模板化负载进行校准的,这些负载会公开声明自身为覆盖指令。我们识别出一个系统性的盲区:当负载被生成以模仿目标文档的领域词汇和权威结构时,我们称之为域伪装注入,标准检测器无法识别它们,检测率从Llama 3.1 8B上的93.8%降至9.7%,从Gemini 2.0 Flash上的100%降至55.6%。我们将此正式定义为Camouflage Detection Gap(CDG),即静态负载与伪装负载之间注入检测率的差异。在覆盖三个领域和两种模型家族的45项任务中,CDG是显著且统计显著的(Llama的chi^2=38.03,p<0.001;Gemini的chi^2=17.05,p<0.001),在两种情况下均无零反向不一致对。我们还评估了Llama Guard 3,一个生产安全分类器,其检测零伪装负载(IDRcamouflage=0.000),证实盲区不仅限于少量样本检测器,还扩展到专门的安全分类器。我们进一步表明,多智能体辩论架构通过小型模型放大静态注入攻击高达9.9倍,而更强的模型则表现出集体抵抗力。针对检测器的增强仅提供部分缓解(Llama上提高10.2%,Gemini上提高78.7%),这表明该漏洞是架构性的,而非偶然的,对于较弱的模型而言。我们的框架、任务库和负载生成器已公开发布。

英文摘要

Injection detectors deployed to protect LLM agents are calibrated on static, template-based payloads that announce themselves as override directives. We identify a systematic blind spot: when payloads are generated to mimic the domain vocabulary and authority structures of the target document, what we call domain camouflaged injection, standard detectors fail to flag them, with detection rates dropping from 93.8% to 9.7% on Llama 3.1 8B and from 100% to 55.6% on Gemini 2.0 Flash. We formalize this as the Camouflage Detection Gap (CDG), the difference in injection detection rate between static and camouflaged payloads. Across 45 tasks spanning three domains and two model families, CDG is large and statistically significant (chi^2 = 38.03, p < 0.001 for Llama; chi^2 = 17.05, p < 0.001 for Gemini), with zero reverse discordant pairs in either case. We additionally evaluate Llama Guard 3, a production safety classifier, which detects zero camouflage payloads (IDRcamouflage = 0.000), confirming that the blind spot extends beyond few-shot detectors to dedicated safety classifiers. We further show that multi-agent debate architectures amplify static injection attacks by up to 9.9x on smaller models, while stronger models show collective resistance. Targeted detector augmentation provides only partial remediation (10.2% improvement on Llama, 78.7% on Gemini), suggesting the vulnerability is architectural rather than incidental for weaker models. Our framework, task bank, and payload generator are released publicly.

2605.22000 2026-05-22 cs.CV cs.AI 版本更新

Virtual 3D H&E Staining from Phase-contrast Back-illumination Interference Tomography

从相位对比背光干涉断层扫描生成虚拟3D的H&E染色

Anthony Song, Boyan Zhou, Mayank Golhar, Marisa Morakis, Alex Baras, Nicholas Durr

发表机构 * Department of Biomedical Engineering, Johns Hopkins University(约翰霍普金斯大学生物医学工程系) Department of Pathology, Johns Hopkins Hospital(约翰霍普金斯医院病理学系)

AI总结 本文提出HistoBIT3D,首个基于voxel的配对BIT和荧光标记核数据集,用于评估无监督虚拟染色在结构保持方面的定量效果。通过该数据集,作者提出一种新的虚拟染色框架,利用双向多尺度内容一致性与跨域风格复用,将具有移变对比度的BIT体积转化为逼真的H&E体积,从而提升3D核分割精度和边界保持性。

详情
AI中文摘要

三维(3D)未处理组织的病理学具有潜在的疾病管理变革能力,通过使组织微结构的体积分析和活体评估成为可能。背光干涉断层扫描(BIT)是一种新的相位显微镜技术,能够提供快速、非破坏性的未处理组织体积分像。然而,将BIT体积转化为临床可解释的H&E图像仍然具有挑战性,特别是由于移变对比和缺乏定量验证基准。我们引入HistoBIT3D,首个voxel-wise配对的BIT和荧光标记核数据集,使在无监督虚拟染色中结构保持的定量评估成为可能。利用该数据集,我们提出了一种新的虚拟染色框架,通过双向多尺度内容一致性和跨域风格复用来增强结构保真度和感知现实性,将具有移变对比度的BIT体积转化为逼真的H&E体积。我们的方法在现实感度量方面达到最先进的水平,同时显著提高了3D核分割精度和边界保持性,特别是在零shot Cellpose评估下。这些贡献共同建立了一个经过定量验证、结构忠实且可扩展的3D虚拟H&E染色流程,推动了无切片、体积分计算病理学的范式转变。我们的数据和代码可在:https://github.com/aasong113/HistoBIT3D_VirtualStaining。

英文摘要

Three-dimensional (3D) histopathology of unprocessed tissues has the potential to transform disease management by enabling volumetric characterization of tissue microarchitecture and in-vivo assessment. Back-illumination Interference Tomography (BIT) is a new phase microscopy technology that provides rapid, non-destructive volumetric imaging of unprocessed tissues. However, translating BIT volumes into clinically interpretable H&E images remains challenging, particularly due to shift-variant contrast and the absence of quantitative validation benchmarks. We introduce HistoBIT3D, the first voxel-wise paired BIT and fluorescence-labeled nuclei dataset, enabling quantitative evaluation of structural preservation in unsupervised virtual staining against ground-truth nuclear distributions. Using this dataset, we present a novel virtual staining framework that translates BIT volumes with shift-variant contrast into realistic H&E volumes by leveraging bidirectional multiscale content consistency and cross-domain style reuse to enhance structural fidelity and perceptual realism. Our method achieves state-of-the-art realism metrics while significantly improving 3D nuclei segmentation accuracy and boundary preservation under zero-shot Cellpose evaluation. Together, these contributions establish a quantitatively validated, structurally faithful, and scalable pipeline for 3D virtual H&E staining, advancing the paradigm of slide-free, volumetric computational histopathology. Our data and code are available at: https://github.com/aasong113/HistoBIT3D_VirtualStaining.

2605.21997 2026-05-22 cs.AI cs.MA 版本更新

The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems

日志即代理:用于可审计、可分支代理系统的事件源反应图

Yohei Nakajima

AI总结 本文提出了一种基于事件源的反应图结构,通过将日志作为事实来源,实现了可审计、可分支的代理系统,提供了确定性回放、低成本分支和端到端溯源能力。

Comments 11 pages, 1 figure. Open-source Apache-2.0 implementation with reproducible quickstart demo, deterministic replay, fork-and-diff, and lineage tracing

详情
AI中文摘要

大多数代理框架围绕语言模型构建:先有对话循环,然后是工具,接着是规则,最后是用于可观测性的日志层,状态被保存为可检索的

英文摘要

Most agent frameworks are built around the language model: a conversation loop comes first, then tools, then rules, and finally a logging layer bolted on for observability, with state persisted as retrievable "memory." We describe ActiveGraph, a runtime that inverts this arrangement. The append-only event log is the source of truth; the working graph is a deterministic projection of that log; and behaviors--ordinary functions, classes, LLM-backed routines, or logic attached to typed edges--react to changes in the graph and emit new events. No component instructs another; coordination happens entirely through the shared graph. This single design decision yields three properties that retrieval-and-summarization memory systems do not provide: deterministic replay of any run from its log, cheap forking that branches a run at any event without re-executing the shared prefix, and end-to-end lineage from a high-level goal down to the individual model call that produced each artifact. We present the architecture, a determinism contract that makes replay sound, and a worked diligence example whose full causal structure is reconstructable from the log alone. We discuss--without claiming to demonstrate--why this substrate is unusually well suited to self-improving agents, and how it extends the BabyAGI lineage and prior graph-memory research.

2605.21996 2026-05-22 cs.SE cs.AI 版本更新

From Patches to Trajectories: Privileged Process Supervision for Software-Engineering Agents

从片段到轨迹:软件工程代理的特权过程监督

Murong Ma, Tianyu Chen, Yun Lin, Shuai Lu, Qinglin Zhu, Yeyun Gong, Zhiyong Huang, Peng Cheng, Yan Lu, Jin Song Dong

发表机构 * National University of Singapore(新加坡国立大学) Microsoft Research Asia(微软亚洲研究院) Shanghai Jiao Tong University(上海交通大学) King’s College London(伦敦国王学院)

AI总结 本文提出Patches-to-Trajectories (P2T)方法,通过利用开发者撰写的参考补丁来改进软件工程代理的训练过程,提高训练效果和效率。

详情
AI中文摘要

监督微调(SFT)在长教师轨迹上的应用是使开放软件工程(SWE)代理具备调查和推理能力的主要方法。由于每个保留的响应都成为模仿目标,学生继承了最终结果和中间缺陷,包括无根据的跳跃和冗余循环。高质量的训练数据必须有效(每一步都基于事实并缩小代理的知识差距到正确修复)且高效(每一步都有信息而非冗余或循环)。现有方法仅使用二进制终端验证器过滤或重新标记教师轨迹,这并未直接针对这些方面,也无法对教师失败的实例提供监督。大多数真实问题包含一个开发者撰写的参考补丁,$p^\star$,揭示了正确修复所假设的文件路径、运行时行为和编码规范,但标准流程却将其丢弃。我们提出Patches-to-Trajectories(P2T),在整理过程中使用$p^\star$作为特权信息,并将轨迹构建制定为对每一步有效性与轨迹长度的双目标优化。一个反向阶段将$p^\star$转化为一个上下文事实和解决方案里程碑的潜在过程图,$G^\star$。一个正向阶段通过在泄漏阻断的 groundedness 检查下对每一步进展进行评分,从盲目的教师延续中整理轨迹,并保留最短的有效段。仅使用1.8k整理的SWE-Gym实例,P2T在结果过滤SFT及其工具错误掩码变体上提高了效果和效率。在SWE-bench Verified上,它将Pass@1提高多达10.8个点,同时将每实例推理成本减少约15%,在SWE-bench Lite上也保持一致的收益。大小匹配的消融分析和定性分析进一步将轨迹质量与数据规模分离。

英文摘要

Supervised fine-tuning (SFT) on long teacher trajectories is the dominant way to instill investigation and reasoning in open software-engineering (SWE) agents. Since every retained response becomes an imitation target, the student inherits the final outcome and intermediate flaws, including ungrounded leaps and redundant loops. High-quality training data must be effective(each step is grounded and narrows the agent's epistemic gap to the correct fix) and efficient(each step is information-bearing rather than redundant or looping). Existing recipes filter or relabel teacher rollouts using only a binary terminal verifier, which does not directly target these axes and provides no supervision on instances where the teacher fails. Most real issue includes a developer-authored reference patch, $p^\star$, revealing the file paths, runtime behaviors, and coding conventions presupposed by the correct fix, yet standard pipelines discard it. We propose Patches-to-Trajectories (P2T), which uses $p^\star$ as privileged information during curation and formulates trajectory construction as bi-objective optimization over per-step effectiveness and trajectory length. A reverse phase distills $p^\star$ into a latent process graph, $G^\star$, of contextual facts and solution milestones. A forward phase curates trajectories from blinded teacher continuations by scoring per-step progress against $G^\star$ under a leakage-blocking groundedness check and retaining the shortest effective segments. Using only 1.8k curated SWE-Gym instances, P2T improves effectiveness and efficiency over outcome-filtered SFT and its tool-error-masking variant. On SWE-bench Verified, it raises Pass@1 by up to 10.8 points while reducing per-instance inference cost by ~15%, with consistent gains on SWE-bench Lite. Size-matched ablations and qualitative analysis further isolate trajectory quality from data scale.

2605.21994 2026-05-22 cs.LG cs.AI 版本更新

Ex-GraphRAG: Interpretable Evidence Routing for Graph-Augmented LLMs

Ex-GraphRAG:图增强大语言模型中的可解释证据路由

Yoav Kor Sade, Arvindh Arun, Rishi Puri, Steffen Staab, Maya Bechler-Speicher

发表机构 * Tel Aviv University(特拉维夫大学) Institute for AI, University of Stuttgart(人工智能研究所,斯图加特大学) NVIDIA(英伟达) Meta AI

AI总结 本文提出Ex-GraphRAG,通过引入多变量图神经加法网络(M-GNAN)来解决图增强大语言模型中证据路由的可解释性问题,揭示了语义重要性与结构连通性之间的不匹配,对检索剪枝、上下文构建和失败诊断有重要影响。

详情
AI中文摘要

GraphRAG通过从知识图中检索子图并使用消息传递GNN进行编码,将语言模型置于这些子图上。由于这些编码器通过迭代邻域聚合将节点贡献纠缠在一起,因此无法确定每个检索实体对编码器输出的影响程度,因此无法忠实审计实际到达模型的结构证据。我们引入Ex-GraphRAG,用多变量图神经加法网络(M-GNAN)替代GNN编码器,这是一种扩展到高维嵌入空间的加法图模型,能够精确分解编码器的输出,而无需事后近似。在STaRK-Prime上,这种可审计的编码器与黑盒性能相匹配。利用它审计证据路由,我们发现语义-结构不匹配:主导编码器输出的节点在检索的子图中结构上是断开的,由低贡献的中介节点连接,其移除会使多跳问答性能下降高达28%。这种不匹配对任何不透明编码器都是不可见的,揭示了语义重要性与结构连通性由不同的节点集控制,对图增强大语言模型的检索剪枝、上下文构建和故障诊断有直接的影响。

英文摘要

GraphRAG conditions language models on subgraphs retrieved from knowledge graphs, encoded via message-passing GNNs. Because these encoders entangle node contributions through iterated neighborhood aggregation, there is no closed-form way to determine how much each retrieved entity influenced the encoder's output, and therefore no way to faithfully audit what structural evidence actually reached the model. We introduce Ex-GraphRAG, which replaces the GNN encoder with a Multivariate Graph Neural Additive Network (M-GNAN), an extension of additive graph models to high-dimensional embedding spaces that yields an exact decomposition of the encoder's output across individual nodes and feature groups, without post-hoc approximation. On STaRK-Prime, this auditable encoder matches black-box performance. Using it to audit evidence routing, we uncover a semantic-structural mismatch: the nodes that dominate the encoder's output are structurally disconnected in the retrieved subgraph, held together by low-attribution intermediaries whose removal degrades multi-hop QA by up to 28%. This mismatch, invisible to any opaque encoder, reveals that semantic importance and structural connectivity are governed by disjoint sets of nodes, with direct implications for retrieval pruning, context construction, and failure diagnosis in graph-augmented LLMs.

2605.21993 2026-05-22 cs.AI cs.LG 版本更新

ECPO: Evidence-Coupled Policy Optimization for Evidence-Certified Candidate Ranking

ECPO:基于证据的策略优化用于证据认证的候选者排序

Miaobo Hu, Shuhao Hu, BoKun Wang, Yina Sa, Xin Wang, Xiaobo Guo, Daren Zha, Jun Xiao

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(信息工程研究所,中国科学院) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 本文研究了证据认证候选者排序问题,提出了一种名为ECPO的策略优化方法,通过结合排序和证据证书来提升排序效果和证据可靠性。

详情
AI中文摘要

用于决策支持的排序系统不仅应对候选者进行排序,还应展示可独立验证的证据。我们研究了证据认证候选者排序:给定一个意图ID、预定义的计划骨架、窗口局部的候选者名单、以及通过文本推导出的候选者轨迹及其跨度来源,系统必须输出一个Top-K列表以及doc_id:span证据证书,其引用的跨度足以恢复决策。我们在此任务上在MAVEN-ERE和RAMS上进行了实例化,使用固定上游提取、窗口局部随机候选者标识符、骨架对齐的轨迹监督、难例和审计参考。我们引入了证据耦合策略优化(ECPO),一种列表级策略优化目标,其动作是排序和证据证书的联合对象。ECPO首先从骨架对齐、论点一致性以及可选图特征中学习可解释的轨迹奖励;然后优化一个受约束的策略,具有三个耦合奖励:列表级排序效用、跨度级证书有效性以及由一个无标签的确定性验证器计算的证据循环奖励,该验证器通过去除声明的引用跨度重建候选者支持。这将目标从单独最大化普通NDCG转变为最大化CertNDCG和决策-证据耦合。评估将ECPO与零样本、SFT和GRPO策略、仅RM的评分带确定性证据附件、语法/JSON约束解码、验证器重试、最佳-N RM选择以及后验证据合理化在封闭名单、预测名单和混合名单设置下进行比较。

英文摘要

Ranking systems used in decision-support settings should not only order candidates but also expose evidence that can be independently checked. We study evidence-certified candidate ranking: given an intent_id, a predefined plan skeleton, a window-local candidate roster, and text-derived candidate trajectories with span provenance, a system must output a Top-K list together with doc_id:span evidence certificates whose cited spans are sufficient to recover the decision. We instantiate this task on MAVEN-ERE and RAMS with fixed upstream extraction, window-local randomized candidate identifiers, skeleton-aligned trajectory supervision, hard negatives, and audit references. We introduce Evidence-Coupled Policy Optimization (ECPO), a listwise policy-optimization objective whose action is the joint object of ranking and evidence certificate. ECPO first learns an interpretable trajectory reward from skeleton alignment, argument consistency, and optional graph features; it then optimizes a constrained policy with three coupled rewards: listwise ranking utility, span-level certificate validity, and an evidence-cycle reward computed by a label-free deterministic verifier that reconstructs candidate support from claim-stripped cited spans. This reframes the goal from maximizing ordinary NDCG alone to maximizing CertNDCG and decision-evidence coupling. The evaluation compares ECPO against zero-shot, SFT, and GRPO policies, RM-only scoring with deterministic evidence attachment, grammar/JSON-constrained decoding, validator retry, best-of-N RM selection, and post-hoc evidence rationalization under closed-roster, predicted-roster, and hybrid-roster settings.

2605.21988 2026-05-22 cs.CV cs.AI 版本更新

Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

通过反事实强化学习学习视频大语言模型中的时空敏感性

Dazhao Du, Jian Liu, Jialong Qin, Tao Han, Bohai Gu, Fangqi Zhu, Yujia Zhang, Eric Liu, Xi Chen, Song Guo

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Tencent(腾讯)

AI总结 本文提出CRPO方法,通过反事实强化学习提升视频大语言模型对时空动态的敏感性,通过构建反事实视频并引入反事实关系奖励,有效抑制了依赖静态线索的简略策略,从而在DyBench基准测试中提升了模型的时空敏感性。

Comments Project website: https://ddz16.github.io/crpo.github.io/

详情
AI中文摘要

视频大语言模型(Video LLMs)在基准测试中表现出色,但往往通过单帧线索和语言先验来回答视频问题,而不是通过跟踪时空动态。在训练后强化学习(RL)中,这种问题进一步加剧,因为仅正确性奖励会进一步强化那些不跟踪视频动态但能获得高奖励的简略策略。为此,我们提出一个受控的反事实问题:如果视觉世界发生变化而问题保持不变,答案应改变还是保持不变?基于这一观点,我们提出了反事实关系策略优化(CRPO),一种双分支强化学习框架,用于提升时空敏感性。CRPO通过水平翻转和时间反转构建反事实视频,在原始和反事实分支上进行训练,并引入反事实关系奖励(CRR)以鼓励答案在动态问题中改变而在静态问题中保持不变。这种跨分支约束使简略策略难以在两个分支中持续获得奖励。为了评估这一特性,我们引入了DyBench,一个配对反事实视频基准,包含3,014个视频,涵盖可逆动态、运动方向和事件序列,以及一个严格的配对准确度指标,防止固定答案简略策略夸大分数。实验表明,CRPO在时空敏感性评估中优于先前的RL方法,同时保持了竞争性的通用视频性能。在Qwen3-VL-8B上,CRPO在DyBench P-Acc上比基模型提高了+7.7,在TimeBlind I-Acc上提高了+8.2,表明改进了时空敏感性而非更强依赖静态简略策略。项目网站可在https://ddz16.github.io/crpo.github.io/上找到。

英文摘要

Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single-frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is exacerbated in RL post-training, where correctness-only rewards can further reinforce shortcut policies that obtain high reward without tracking video dynamics. We address this by asking a controlled counterfactual question: if the visual world changed while the question remained fixed, should the answer change or stay the same? Based on this view, we propose \textbf{Counterfactual Relational Policy Optimization (CRPO)}, a dual-branch RL framework for improving \emph{spatiotemporal sensitivity}. CRPO constructs counterfactual videos through horizontal flips and temporal reversals, trains on both original and counterfactual branches, and introduces a \textbf{Counterfactual Relation Reward (CRR)} between their answers. CRR encourages answers to change for dynamic questions and remain unchanged for static questions. This cross-branch constraint makes it difficult for shortcut policies to be consistently rewarded across both branches. To evaluate this property, we introduce \textbf{DyBench}, a paired counterfactual video benchmark with 3,014 videos covering reversible dynamics, moving direction, and event sequence, together with a strict pair-accuracy metric that prevents fixed-answer shortcuts from inflating scores. Experiments show that CRPO outperforms prior RL methods on spatiotemporal-sensitive evaluations while maintaining competitive general video performance. On Qwen3-VL-8B, CRPO improves DyBench P-Acc by +7.7 and TimeBlind I-Acc by +8.2 over the base model, indicating improved spatiotemporal sensitivity rather than stronger reliance on static shortcuts. The project website can be found at https://ddz16.github.io/crpo.github.io/ .

2605.21984 2026-05-22 cs.AI cs.CL 版本更新

Echo: Learning from Experience Data via User-Driven Refinement

Echo:通过用户驱动的细化学习经验数据

Hande Dong, Xiaoyun Liang, Jiarui Yu, Jiayi Lin, Changqing Ai, Feng Liu, Wenjun Zhang, Rongbi Wei, Chaofan Zhu, Linjie Che, Feng Wu, Xin Shen, Dexu Kong, Xiaotian Wang, Qiuyuan Chen, Bingxu An, Yueting Lei, Qiang Lin

发表机构 * Core Contributors(核心贡献者) Qiang Lin is the team leader(Qiang Lin 是团队负责人)

AI总结 本文提出Echo框架,通过用户驱动的细化过程将原始经验数据转化为可学习的知识,提升模型性能,实验表明其能将接受率从25.7%提升至35.7%。

详情
AI中文摘要

静态的'人类数据'面临固有局限:扩展成本高且受制于创造者知识。持续学习'经验数据'——智能体与其环境的交互——有望超越这些障碍。如今,AI智能体的广泛应用使我们能够以低成本获取大量真实世界经验数据。然而,原始交互日志本质上嘈杂,充满试错和低信息密度,使其不适合直接用于模型训练。我们引入Echo,一个通用框架,旨在将原始经验转化为可学习的知识,有效将环境反馈回训练循环以优化模型。在当今智能体生态系统中,用户细化是主要的反馈来源:出于对结果的责任感,用户严格地将缺陷智能体提案转化为已验证的解决方案。这些用户驱动的细化序列本质上将智能体的粗略尝试提炼为高质量的训练信号。Echo系统性地收集这些信号,持续使智能体与真实世界需求对齐。在大规模生产代码补全环境中的验证表明,Echo有效利用这一流程,打破静态性能上限,将接受率从25.7%提升至35.7%。

英文摘要

Static "human data" faces inherent limitations: it is expensive to scale and bounded by the knowledge of its creators. Continuous learning from "experience data" - interactions between agents and their environments - promises to transcend these barriers. Today, the widespread deployment of AI agents grants us low-cost access to massive streams of such real-world experience. However, raw interaction logs are inherently noisy, filled with trial-and-error and low information density, rendering them inefficient for direct model training. We introduce Echo, a generalized framework designed to operationalize the transition from raw experience to learnable knowledge, effectively "echoing" environmental feedback back into the training loop for model optimization. In today's agent ecosystem, user refinement serves as a primary source of such feedback: driven by responsibility for the outcome, users rigorously transform flawed agent proposals into verified solutions. These user-driven refinement sequences inherently distill agents' crude attempts into high-quality training signals. Echo systematically harvests these signals to continuously align the agent with real-world needs. Large-scale validation in a production code completion environment confirms that Echo effectively harnesses this pipeline, breaking the static performance ceiling by increasing the acceptance rate from 25.7% to 35.7%.

2605.21980 2026-05-22 cs.CV cs.AI 版本更新

Interpreting and Enhancing Emotional Circuits in Large Vision-Language Models via Cross-Modal Information Flow

通过跨模态信息流解读并增强大视觉-语言模型中的情感电路

Chengsheng Zhang, Chenghao Sun, Zhining Xie, Xinmei Tian

发表机构 * MoE Key Laboratory of Brain-inspired Intelligent Perception(脑启发智能感知与认知MOE实验室) Cognition, University of Science(认知,科学大学) AIPD, Tencent(AIPD,腾讯)

AI总结 本文提出了一种基于转向向量的因果归因框架,用于描述性情感推理,通过构建专用数据集揭示了三阶段'适应-聚合-执行'机制下的情感电路,发现视觉情感线索在中间层通过情感特定的注意力头进行聚合,随后在深层通过情感通用路径转换为叙述生成,并通过调控情感信息路由增强注意力流和语义激活,从而提升性能并缓解情感幻觉。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型视觉-语言模型(LVLMs)代表了迈向共情代理的重要进展,展示了在情绪理解方面的显著能力。然而, governing how LVLMs translate abstract visual stimuli into coherent emotional narratives remains largely unexplored, primarily due to the scarcity of visual counterfactuals and the diffuse nature of emotional expression. In this paper, we bridge this gap by introducing a steering-vector-based causal attribution framework tailored for descriptive emotional reasoning. To this end, we construct a specialized dataset to demystify the emotional circuits underlying the three-stage ``Adapt-Aggregate-Execute'' mechanism. Crucially, we discover a functional decoupling: visual emotional cues are aggregated in middle layers via sentiment-specific attention heads, but are subsequently translated into narrative generation in deep layers through emotion-general pathways. Guided by these insights, we regulate the emotional information routing to strengthen attention flow and amplify the semantic activation to consolidate expression. Extensive experiments on the comprehensive MER-UniBench demonstrate that our methods significantly improve performance via inference-time intervention, effectively mitigating emotional hallucinations and corroborating the causal fidelity of the discovered circuits.

英文摘要

Large Vision-Language Models (LVLMs) represent a significant leap towards empathetic agents, demonstrating remarkable capabilities in emotion understanding. However, the internal mechanisms governing how LVLMs translate abstract visual stimuli into coherent emotional narratives remain largely unexplored, primarily due to the scarcity of visual counterfactuals and the diffuse nature of emotional expression. In this paper, we bridge this gap by introducing a steering-vector-based causal attribution framework tailored for descriptive emotional reasoning. To this end, we construct a specialized dataset to demystify the emotional circuits underlying the three-stage ``Adapt-Aggregate-Execute'' mechanism. Crucially, we discover a functional decoupling: visual emotional cues are aggregated in middle layers via sentiment-specific attention heads, but are subsequently translated into narrative generation in deep layers through emotion-general pathways. Guided by these insights, we regulate the emotional information routing to strengthen attention flow and amplify the semantic activation to consolidate expression. Extensive experiments on the comprehensive MER-UniBench demonstrate that our methods significantly improve performance via inference-time intervention, effectively mitigating emotional hallucinations and corroborating the causal fidelity of the discovered circuits.

2605.21977 2026-05-22 cs.CV cs.AI 版本更新

Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection

视频作为自然增强:迈向统一的AI生成图像和视频检测

Zhengcen Li, Chenyang Jiang, Liangxu Su, Tong Shao, Shiyang Zhou, Ming Tao, Jingyong Su

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Pengcheng Laboratory(鹏城实验室) Shenzhen Loop Area Institute(深圳南山区研究院)

AI总结 本研究针对AI生成内容检测中跨模态差距的问题,提出VINA框架,通过联合训练图像和视频数据,利用视频帧作为自然增强,并引入跨模态监督对比目标,实现统一的AI生成内容检测,提升鲁棒性和迁移性。

详情
AI中文摘要

AI生成内容(AIGC)正在迅速提升,催生了需要在数据源、部署管道和视觉模态间通用的检测器的紧迫需求。一个高度通用的检测器应在分布变化下保持稳健。然而,我们发现了一种一致的失败模式:最先进的AI生成图像检测器在应用于从视频中提取的帧时往往会崩溃。通过系统分析,我们发现这种跨模态差距源于交织的合成无关视频处理转换,包括颜色转换、编码压缩、缩放和模糊,以及由现代视频生成器引入的模型特定指纹。受这些发现的启发,我们提出了VINA(Video as Natural Augmentation),一个统一的AIGC检测框架,联合训练图像和视频数据。VINA利用视频帧作为物理上合理的自然增强,并进一步引入跨模态监督对比目标,以在共享的真/假决策边界下对齐图像和视频表示。在14个图像、视频和现实世界基准测试中,VINA展示了双向收益,提高了鲁棒性和迁移性,并在几乎所有评估设置中实现了最先进的性能,无需复杂的增强或数据集特定调整。

英文摘要

AI-generated content (AIGC) is rapidly improving, creating an urgent need for detectors that generalize across data sources, deployment pipelines, and visual modalities. A strongly generalizable detector should remain robust under distributional variations. However, we identify a consistent failure mode: SOTA AI-generated image detectors often collapse when applied to frames extracted from videos. Through systematic analysis, we show that this cross-modal gap arises from both entangled synthesis-agnostic video processing shifts, including color conversion, codec compression, resizing, and blur, and model-specific fingerprints introduced by modern video generators. Motivated by these findings, we propose VINA (Video as Natural Augmentation), a unified AIGC detection framework that jointly trains on image and video data. VINA uses video frames as physically grounded natural augmentations and further introduces a cross-modal supervised contrastive objective to align image and video representations under a shared real/fake decision boundary. Extensive experiments on 14 image, video, and in-the-wild benchmarks show that VINA delivers bidirectional gains, improves robustness and transferability, and achieves state-of-the-art performance across nearly all evaluated settings without complex augmentation or dataset-specific tuning.

2605.21974 2026-05-22 cs.AI 版本更新

Format-Constraint Coupling in Knowledge Graph Construction from Statistical Tables

知识图谱构建中统计表的格式约束耦合

Jingxuan Qi, Zhiqiang Ye, Yuxiang Feng

发表机构 * South China University of Technology(华南理工大学)

AI总结 本文研究了在统计表中构建知识图谱时,格式约束与提取方案之间的耦合效应,发现格式与约束的联合影响超过了独立影响的总和,并提出了CSVFidelity-Bench基准测试集以支持基于保真的评估。

Comments 8 pages main body, 18 pages appendices. Submitted to EMNLP 2026 via ACL Rolling Review (ARR). Corresponding author: Yuxiang Feng (yxfeng@scut.edu.cn). Code and data available at https://anonymous.4open.science/r/sge_lightrag-BE19

详情
AI中文摘要

提取方案不应降低知识图谱的保真度。然而,在统计CSV表中却可能降低。我们研究了国家-年份时间序列矩阵,这是开放数据门户中常见的布局。在此设置中,序列化格式和模式约束的交互作用是超加性的。它们的联合效应超过独立效应的总和,最高可达+1.180(2x2因子,6个数据集)。Bootstrap 95%置信区间在4/6个数据集中严格为正,其中在宽型II矩阵上证据最强。更关键的是,应用于不匹配格式的模式可能触发灾难性不匹配。事实覆盖率在4/6个数据集中低于无约束基线,通过实体膨胀或提取拒绝实现。我们称这种观察到的模式为格式-约束耦合。探测和标记消融支持以列名参考为中心的表面形式锚定解释。在格式-模式配对、GraphRAG主机和LLM家族之间的受控变体中,结果在测量范围内保持相同方向;一个LLM家族仅显示部分激活。这一观察还具有诊断后果。三种标准检索模式在很大程度上掩盖了构建质量(delta <= 1pp),而直接图访问暴露了高达+47.6pp(p < 0.0001)的差距。为了支持保真度意识的评估,我们发布了CSVFidelity-Bench。它包含15个数据集、11个II型矩阵、4个III型表格和1,892个标准事实,覆盖6个领域。

英文摘要

An extraction schema should not reduce knowledge graph fidelity. On statistical CSV, however, it can. We study country-by-year time-series matrices, a common layout on open-data portals. In this setting, serialization format and schema constraints interact super-additively. Their joint effect exceeds the sum of independent effects by up to +1.180 (2x2 factorial, 6 datasets). Bootstrap 95% CIs are strictly positive on 4/6 datasets, with strongest evidence on wide Type-II matrices. More critically, a schema applied to a mismatched format can trigger catastrophic mismatch. Fact coverage falls below the unconstrained baseline on 4/6 datasets through entity inflation or extraction refusal. We call this observed pattern format-constraint coupling. Probing and token ablation support a surface-form anchoring explanation centred on column-name references. Controlled variants across format-schema pairings, GraphRAG hosts, and LLM families show the same direction within the measured scope; one LLM family shows only partial activation. The observation also has a diagnostic consequence. Three standard retrieval modes largely mask construction quality (delta <= 1pp), whereas direct graph access exposes gaps up to +47.6pp (p < 0.0001). To support fidelity-aware evaluation, we release CSVFidelity-Bench. It contains 15 datasets, 11 Type-II matrices, 4 Type-III tables, and 1,892 Gold Standard facts across 6 domains.

2605.21969 2026-05-22 cs.IR cs.AI 版本更新

LLM Retrieval for Stable and Predictable Ad Recommendations

基于大语言模型的稳定可预测广告推荐

Vinodh Kumar Sunkara, Satheeshkumar Karuppusamy, Hangjun Xu, Sai Deepika Regani, Kshitij Gupta, Gaby Nahum, Sneha Iyer, Jean-Baptiste Fiot, Yinglong Guo, Xiaowen Guo, Atul Jangra, Yucheng Liu, Jinghao Yan, Vijay Pappu, Benjamin Schulte, Deepak Chandra

发表机构 * Meta Platforms, Inc.(Meta公司)

AI总结 本文提出了一种新的评估框架,用于量化广告推荐系统的稳定性和可预测性,并展示了基于微调大语言模型的在线验证语义候选生成框架,通过提高系统的语义感知能力,在稳定性和可预测性方面实现了显著改进。

Comments SIGIR 2026 AgentSearch Workshop, Melbourne Australia

详情
AI中文摘要

传统的广告推荐系统主要专注于使用召回率或归一化折扣累计增益(NDCG)等传统指标来优化点击或转化事件的预测准确性。随着生成AI技术的超大规模增长,广告库存和流动性不断增加,预测的稳定性和可预测性变得越来越关键。直观地说,预测的稳定性和可预测性可以定义为量化系统对小扰动(广告、创意)的鲁棒性,缺乏这些特性可能导致广告商可感知的问题,如重复性、冷启动和探索不足。本文介绍了一种新的评估框架,用于量化广告推荐系统的稳定性和可预测性,并提出了一个基于微调大语言模型(LLM)的在线验证语义候选生成框架,该框架在这些指标上实现了显著改进,通过从根本上提高系统的语义感知能力。该方法从广告创意中提取层次化的语义属性以获得LLM表示,这些表示作为基于图的扩展的基础,确保检索到的候选者包含广告的语义变体,保证来自广告商的小创意变体产生一致且可解释的用户交付结果。我们测试了这种LLM广告检索框架在大规模工业广告推荐系统中的表现,证明了在离线和在线A/B实验中均实现了显著改进,展示了可预测性和传统性能指标的提升。尽管在广告堆栈中进行了评估,但这是一个通用的框架,可广泛应用于面临类似扩展和可预测性挑战的其他大规模推荐和检索系统。

英文摘要

Traditional ads recommendation systems have primarily focused on optimizing for prediction accuracy of click or conversion events using canonical metrics such as recall or normalized discounted cumulative gain (NDCG). With the hyper-growth of ads inventory and liquidity with generative AI technologies, the prediction stability and predictability is becoming increasingly critical. Intuitively, prediction stability and predictability can be defined to quantify system robustness with respect to minor/noisy input (ads, creatives) perturbations, the lack of which could lead to advertiser perceivable problems such as repeatability, cold start and under-exploration. In this paper, we introduce a new evaluation framework for quantifying stability and predictability of an ads recommender system, and present an online validated semantic candidate generation framework powered by fine-tuned Large Language Models (LLMs) that showed significant improvement along these metrics by fundamentally improving the semantic-awareness of the system. The approach extracts hierarchical semantic attributes from ad creatives to obtain LLM representations, which serve as the foundation for graph-based expansion, ensuring the retrieved candidates encapsulate semantic variants of an ad, guaranteeing that small creative variants from the advertiser yield consistent and explainable delivery results to the user. We tested this LLM ads retrieval framework in a large-scale industrial ads recommendation system, demonstrating significant improvements across offline and online A/B experiments, showcasing gains in both predictability and traditional performance metrics. Although evaluated in the ads stack, this is a general framework that can be applied broadly to any large-scale recommendation and retrieval systems facing similar scaling and predictability challenges.

2605.21963 2026-05-22 cs.LG cs.AI 版本更新

ChronoMedicalWorld: A Medical World Model for Learning Patient Trajectories from Longitudinal Care Data

ChronoMedicalWorld:一个用于从纵向护理数据中学习患者轨迹的医学世界模型

Jiangyuan Wang, Xuyong Chen, Junwei He, Xu Xu, Shasha Xie, Fuman Han

发表机构 * Beijing KidneyTec Medical Technology Co., Ltd.(北京肾科医疗技术有限公司)

AI总结 本文提出了一种名为ChronoMedicalWorld的模型,旨在通过纵向护理数据学习患者轨迹,该模型结合了联合嵌入状态编码器和宽动作编码器,并在六个术语目标下训练了循环潜在转移模块,以提高慢性病护理中长期预测的准确性。

Comments 14 pages, 2 figures, 6 tables

详情
AI中文摘要

长期临床模拟--预测患者在指定干预下数年的生理演变--是慢性病护理的核心,但现有的电子健康记录(EHR)模型大多为判别性模型,且通用的大语言模型在重复干预下会漂移。我们提出了ChronoMedicalWorld模型(CMWM),一种用于从纵向护理数据中学习患者轨迹的动作条件潜在世界模型框架。CMWM结合了联合嵌入状态编码器和宽动作编码器,该编码器可以接受结构化干预指标和自由文本通信嵌入,并在六个术语目标下训练了循环潜在转移模块:下一步观察监督、下一步潜在预测、SIGReg潜在正则化,以及三个生理感知的形状先验(斜率、连续性、大跳跃惩罚)。闭环滚出前缀协议使训练与部署相匹配,因此模型在推理时表现出的多步误差相同。作为具体案例研究,我们为慢性肾病(CKD)的年度估计肾小球滤过率(eGFR)轨迹预测实例化CMWM。在2,232名肾病患者队列上,CKD实例化实现了动态-50%历史滚动测试的平均绝对误差(MAE)为7.384和均方根误差(RMSE)为10.256,而调优的GPT-5.5结构提示基线为7.964和11.069(MAE减少7.28%,RMSE减少7.35%),增益主要由患者与健康教练交流的对话部分主导。该框架不特定于CKD:其架构、损失设计和训练协议适用于任何可以被描述为周期性临床状态交替与结构化和对话干预的慢性疾病。

英文摘要

Long-horizon clinical simulation -- predicting how a patient's physiology evolves over years under specified interventions -- is central to chronic-disease care, yet existing electronic health record (EHR) models are predominantly discriminative, and general-purpose large language models drift under repeated interventions. We propose the \textbf{ChronoMedicalWorld Model (CMWM)}, an action-conditioned latent world-model framework for learning patient trajectories from longitudinal care data. CMWM couples a joint-embedding state encoder with a wide action encoder that admits both structured intervention indicators and free-text communication embeddings, and trains a recurrent latent transition module under a six-term objective: next-observation supervision, next-latent prediction, SIGReg latent regularisation, and three physiology-aware shape priors (slope, continuity, large-jump penalty). A closed-loop rollout-prefix protocol matches training to deployment, so the model is optimised against the same multi-step error it exhibits at inference. As a concrete case study, we instantiate CMWM for annual estimated glomerular filtration rate (eGFR) trajectory forecasting in chronic kidney disease (CKD). On a 2{,}232-patient nephrology cohort, the CKD instantiation achieves a dynamic-50\% history rollout test mean absolute error (MAE) of 7.384 and root-mean-square error (RMSE) of 10.256, against 7.964 and 11.069 for a tuned GPT-5.5 structured-prompting baseline ($-7.28\%$ MAE, $-7.35\%$ RMSE), with the gain dominated by the dialogue portion of patient--health-coach communication. The framework is not CKD-specific: its architecture, loss design, and training protocol apply to any chronic condition that can be cast as periodic clinical state interleaved with structured and conversational interventions.

2605.21962 2026-05-22 cs.AI cs.CY cs.HC cs.MA 版本更新

AI-Enabled Serious Games: Integrating Intelligence and Adaptivity in Training Systems

AI赋能的严肃游戏:在训练系统中整合智能与适应性

Priyamvada Tripathi, Bill Kapralos

发表机构 * Durham College(达灵顿学院) Ontario Tech University(安大略技术大学)

AI总结 本文探讨了如何利用人工智能技术提升严肃游戏中的实时教学适应能力,分析了智能与适应性的定义,并讨论了大型语言模型、强化学习和基于代理的架构在严肃游戏中的应用及面临的挑战。

Comments Book chapter, 1 figure. To appear in "Advances in Global Applied Artificial Intelligence," G. A. Tsihrintzis, M. Virvou, N. G. Bourbakis, and L. C. Jain (Eds.), Springer, Learning and Analytics in Intelligent Systems book series, 2026

详情
AI中文摘要

严肃游戏在医疗、国防和教育等多个领域被广泛用于学习和培训。然而,仍然存在静态场景设计、作者瓶颈、有限的学习者建模和实现有意义的实时教学适应的困难。近年来,人工智能(AI)的进步引入了动态场景变化、上下文反馈、自适应节奏和学习者状态建模等新能力,可能帮助解决一些限制。同时,将AI集成到严肃游戏中也引发了关于有效性、透明性、系统控制和学习者信任的重要问题。本章探讨了当代AI方法如何支持严肃游戏中的实时教学适应。它区分了教学智能,即系统推断学习者知识并合理回应的能力,以及适应性,即在交互过程中修改教学行动的能力。本文呈现了适应性学习系统的综述,从早期的计算机辅助教学到智能辅导系统(ITS)、动态难度调整(DDA)、作者平台、学习分析和最近的AI赋能架构。基于这一视角,本文讨论了大型语言模型(LLMs)、强化学习(RL)和基于代理的架构如何促进严肃游戏中更整合的智能和适应性。同时,它还突出了与AI赋能系统相关的实际和研究挑战,包括可解释性、验证、计算成本以及关于AI赋能严肃游戏中长期学习结果的有限实证证据。

英文摘要

Serious games are widely used for learning and training across domains such as healthcare, defense, and education. Persistent challenges remain, however, including static scenario design, authoring bottlenecks, limited learner modeling, and difficulty implementing meaningful real-time instructional adaptation. Recent advances in artificial intelligence (AI) introduce novel capabilities such as dynamic scenario variation, contextual feedback, adaptive pacing, and learner-state modeling that may help address some of these limitations. At the same time, integrating AI into serious games raises important questions related to validity, transparency, system control, and learner trust. This chapter examines how contemporary AI approaches may support real-time instructional adaptation in serious games. It distinguishes between instructional intelligence, defined as a system's capacity to infer learner knowledge and reason about pedagogically appropriate responses, and adaptivity, defined as the ability to modify instructional actions during interaction. A historical synthesis of adaptive learning systems is presented, tracing developments from early computer-assisted instruction through intelligent tutoring systems (ITS), dynamic difficulty adjustment (DDA), authoring platforms, learning analytics, and recent AI-enabled architectures. Building on this perspective, the chapter discusses how large language models (LLMs), reinforcement learning (RL), and agent-based architectures may contribute to more integrated forms of intelligence and adaptivity in serious games. It also highlights practical and research challenges associated with AI-enabled systems, including explainability, validation, computational cost, and the limited empirical evidence regarding long-term learning outcomes in AI-enabled serious games.

2605.21954 2026-05-22 cs.CV cs.AI 版本更新

MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

Dazhao Du, Liao Duan, Jian Liu, Tao Han, Yujia Zhang, Eric Liu, Xi Chen, Song Guo

发表机构 * Hong Kong University of Science and Technology(香港理工大学) Xi’an Jiaotong University(西安交通大学) Tencent(腾讯)

AI总结 本文研究了多模态大语言模型(MLLMs)在视频时间定位中的感知与生成之间的差距,提出了一种推理阶段的读取-再生成框架,通过利用注意力线索来提高时间定位的准确性,从而在三个视频时间定位基准上提升了MiMo-VL-7B、Qwen3-VL-8B和TimeLens-8B的性能。

Comments Project Website: https://ddz16.github.io/mllmsknowwhen.github.io/

详情
AI中文摘要

视频时间定位(VTG),即在未剪裁的视频中定位查询事件的起止时间,是检验多模态大语言模型(MLLMs)是否理解不仅发生了什么,而且何时发生的关键测试。尽管现代MLLMs能够流畅地描述视频内容,但它们的时间戳预测仍然不可靠,而现有的解决方案要么需要昂贵的后训练时间标注,要么依赖于粗略的训练无关启发式方法。在本文中,我们探测了MLLMs的跨模态注意力,并揭示了一个感知-生成的差距。我们的关键发现是,MLLMs在prefill阶段往往知道目标区间,但在生成最终答案时会丢失这个信号。在prefill阶段,一组稀疏的注意力头(我们称之为时间定位头(TG-Heads))会将查询到视频的注意力集中在真实区间上。然而,在自回归解码过程中,答案标记会将注意力从该区间转移到视觉显著但与查询无关的段落。这一观察促使我们提出了一种推理阶段的读取-再生成框架。我们首先将TG-Head prefill注意力转换为一个去偏的帧级相关性信号,并提取它突出的高注意力区间。然后,我们使用视频裁剪或注意力掩码来限制MLLM的视觉上下文,仅限于该区间,以抑制干扰项。在不进行参数更新和架构更改的情况下,我们的框架在三个VTG基准上一致地提高了MiMo-VL-7B、Qwen3-VL-8B和TimeLens-8B的性能,最大提升达到+3.5 mIoU。该项目网站可在https://ddz16.github.io/mllmsknowwhen.github.io/上找到。

英文摘要

Video temporal grounding (VTG), which localizes the start and end times of a queried event in an untrimmed video, is a key test of whether multimodal large language models (MLLMs) understand not only what happens but also when it happens. Although modern MLLMs describe video content fluently, their timestamp predictions remain unreliable, while existing remedies either require costly post-training on temporal annotations or rely on coarse training-free heuristics. In this work, we probe the cross-modal attention of MLLMs and uncover a perception-generation gap. Our key finding is that MLLMs often know the target interval during prefill, but lose this signal when generating the final answer. In the prefill stage, a sparse set of attention heads, which we call \emph{Temporal Grounding Heads} (TG-Heads), concentrates query-to-video attention on the ground-truth interval. During autoregressive decoding, however, the answer tokens shift attention away from this interval toward visually salient but query-irrelevant segments. This observation motivates an inference-time read-then-regenerate framework. We first convert TG-Head prefill attention into a debiased frame-level relevance signal and extract the high-attention interval it highlights. We then re-invoke the MLLM with visual context restricted to this interval, using video cropping or attention masking to suppress distractors. Without parameter updates and architectural changes, our framework consistently improves MiMo-VL-7B, Qwen3-VL-8B, and TimeLens-8B on three VTG benchmarks, with gains of up to +3.5 mIoU. The project website can be found at https://ddz16.github.io/mllmsknowwhen.github.io/.

2605.21933 2026-05-22 cond-mat.stat-mech cs.AI cs.LG 版本更新

Thermodynamic Irreversibility of Training Algorithms

训练算法的热力学不可逆性

Liu Ziyin, Yuanjie Ren, Adam Levine, Isaac Chuang

发表机构 * Massachusetts Institute of Technology(麻省理工学院) NTT Research(NTT研究所)

AI总结 本文提出了一种通用框架,用于定义和分析训练算法的不可逆性,证明了四种不同方法在步长η的主导阶近似下是等价的,并展示了不可逆性如何导致时间反演对称性破缺的新兴力。

Comments preprint

详情
AI中文摘要

人工智能系统的学习算法都引入了远离平衡的动态过程,理解这些算法的不可逆性是理解现代人工智能系统学习动态的基本步骤。本文建立了一个通用框架,用于定义和分析训练算法的不可逆性。我们证明了四种不同方法在主导阶近似下是等价的:数值反向误差ϕ_{DE},时间归一化修正ϕ_{TR},微观时间反演不对称性ϕ_{TA},以及(正则化的)随机热力学熵产率ϕ_{ST}。不可逆性导致一种时间反演对称性破缺的新兴力,这种力通常打破非等距连续重新参数化对称性,保持正交对称性,并导致普遍偏好那些最小化熵产率的学习轨迹。

英文摘要

The training algorithms for AI systems all introduce far-from-equilibrium dynamical processes, and understanding the irreversibility of these algorithms is a fundamental step towards understanding the learning dynamics of modern AI systems. In this work, we establish a general framework for defining and analyzing the irreversibility of training algorithms. We show that four different ways to characterize the irreversibility of dynamical processes are equivalent to leading order in the step size $η$: numerical backward error $ϕ_{\rm DE}$, time-renormalized correction $ϕ_{\rm TR}$, microscopic time reversal asymmetry $ϕ_{\rm TA}$, and the (regularized) stochastic-thermodynamic entropy production $ϕ_{\rm ST}$. The irreversibility gives rise to a time-reversal-symmetry-breaking emergent force that generically breaks non-isometric continuous reparametrization symmetries, preserves orthogonal symmetries, and leads to a universal preference for those learning trajectories that minimize the entropy production rate.

2605.21928 2026-05-22 cs.LG cs.AI stat.ME 版本更新

CausalGuard: Conformal Inference under Graph Uncertainty

CausalGuard: 在图不确定性下的契合推断

Vikash Singh, Weicong Chen, Debargha Ganguly, Yanyan Zhang, Nengbo Wang, Sreehari Sankar, Mohsen Hariri, Alexander Nemecek, Chaoda Song, Shouren Wang, Biyao Zhang, Van Yang, Erman Ayday, Jing Ma, Vipin Chaudhary

发表机构 * Case Western Reserve University(凯斯西储大学)

AI总结 本文提出CausalGuard,一种结构加权的契合框架,通过聚合图条件双稳健伪结果进行校准,以在图不确定性下提供无分布的有限样本边际覆盖。

详情
AI中文摘要

从观察数据估计治疗效应需要选择调整集,但有效的调整依赖于未知的因果图。图的不规范可能导致覆盖不足,而图无关的契合包装可能只能通过大填充来恢复名义覆盖。我们介绍了CausalGuard,一种结构加权的契合框架,该框架在聚合图条件双稳健伪结果后进行校准。候选DAGs从LLM衍生的边先验中提出,通过条件独立性测试进行修剪,并通过贝叶斯信息准则重新加权。然后,一个复合非契合分数校准后加权的伪结果。CausalGuard为聚合的伪结果提供无分布的有限样本边际覆盖;在因果识别、重叠、条件均值噪声稳定性以及集中在目标对齐的有效调整策略下,其条件均值收敛于真实的条件平均治疗效应。在五个基准测试中,CausalGuard在可直接评估的目标上实现了均值覆盖超过名义90%水平,并在图无关契合基线需要大填充时减少了宽度。压力测试显示,当保留的候选集受数据支持时,CausalGuard能抑制无效的碰撞调整并在不规范的先验下保持稳定。

英文摘要

Estimating treatment effects from observational data requires choosing an adjustment set, but valid adjustment depends on an unknown causal graph. Graph misspecification can cause under-coverage, while graph-agnostic conformal wrappers may regain nominal coverage only through large padding. We introduce CausalGuard, a structure-weighted conformal framework that calibrates after aggregating graph-conditional doubly robust pseudo-outcomes. Candidate DAGs are proposed from an LLM-derived edge prior, pruned by conditional-independence tests, and reweighted by Bayesian Information Criterion. A composite nonconformity score then calibrates the posterior-weighted pseudo-outcome. CausalGuard provides distribution-free finite-sample marginal coverage for this aggregated pseudo-outcome; under causal identification, overlap, conditional-mean nuisance stability, and concentration on target-aligned valid adjustment strategies, its conditional mean converges to the true Conditional Average Treatment Effect. Across five benchmarks, CausalGuard attains mean coverage above the nominal 90% level for the directly evaluable target and reduces width when graph-agnostic conformal baselines require large padding. Stress tests show that CausalGuard suppresses invalid collider adjustment and remains stable under misspecified priors when the retained candidate set is data-supported.

2605.21919 2026-05-22 cs.CV cs.AI 版本更新

SDGBiasBench: Benchmarking and Mitigating Vision--Language Models' Biases in Sustainable Development Goals

SDGBiasBench: 评估和减轻可持续发展目标中视觉-语言模型的偏见

Zihang Lin, Huaiyuan Qin, Muli Yang, Hongyuan Zhu

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 本文提出SDGBiasBench,一个用于评估和减轻可持续发展目标中视觉-语言模型偏见的大型基准测试集,通过分析模型在决策和估计层面的偏见,提出CADE方法以减少偏见,提高模型的准确性和可靠性。

详情
AI中文摘要

评估可持续发展目标(SDGs)的进展需要对视觉线索、上下文知识和发展指标进行多步骤推理,其中不完整的证据使用和不完美的证据整合可能引入隐藏的预测偏见。现实中的SDG监测还涵盖定性判断和定量估计。然而,现有基准通常孤立地评估这些方面,掩盖了当模型用先验代替证据时系统性偏见。为解决这一差距,我们提出了SDGBiasBench,一个面向SDG的视觉-语言推理大型基准测试集。该基准涵盖50万专家参与的多项选择题和5万回归任务,能够全面评估视觉-语言模型(VLMs)在决策和估计层面的偏见。在SDGBiasBench上的评估揭示了当前VLMs中固有的SDG偏见,其中预测通常由SDG特定的先验驱动,而非可靠的多模态线索。为减轻这种偏见,我们提出CADE(对比自适应去偏集合),一种无需训练的即插即用方法,利用模态特定的答案先验。CADE在所提出的基准上取得显著成效,提高了多项选择的准确率高达25%,并减少了回归MAE高达12点,适用于多种VLMs。我们希望我们的工作能促进更公平和可靠的AI系统在可持续发展中的发展。

英文摘要

Assessing progress toward the Sustainable Development Goals (SDGs) requires multi-step reasoning over visual cues, contextual knowledge, and development indicators, where incomplete evidence use and imperfect evidence integration can introduce hidden prediction biases. Real-world SDG monitoring further spans both qualitative judgments and quantitative estimation. However, existing benchmarks typically evaluate these aspects in isolation, obscuring systematic biases that emerge when models substitute priors for evidence. To address this gap, we propose SDGBiasBench, a large-scale benchmark suite for SDG-oriented vision-language reasoning. Spanning 500k expert-involved multiple-choice questions and 50k regression tasks, the benchmark enables comprehensive assessment of both decision-level and estimation-level bias in Vision--Language Models (VLMs). Evaluations on SDGBiasBench reveal an intrinsic SDG bias in current VLMs, where predictions are frequently driven by SDG specific priors rather than reliable multi-modal cues. To mitigate such bias, we propose CADE (Contrastive Adaptive Debias Ensemble), a training-free, plug-and-play method that leverages modality-specific answer priors. CADE yields significant gains on the proposed benchmark, improving multiple-choice accuracy by up to 25% and reducing regression MAE by up to 12 points across multiple VLMs. We hope our work can foster the development of more fair and reliable AI systems for sustainable development.

2605.21917 2026-05-22 cs.CV cs.AI 版本更新

MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks

MAVEN:一种多阶段代理标注管道用于视频推理任务

Han Zhang, Wanting Jiang, Tomasz Kornuta, Tian Zheng, Vidya Murali

发表机构 * NVIDIA

AI总结 本文提出MAVEN,一种多阶段代理标注管道,通过链式推理轨迹生成多任务训练数据,用于视频事件推理任务,核心方法是多尺度时空事件描述,支持代理驱动的领域适应,通过分层细化循环改进数据质量,并在多个数据集上验证了其有效性。

Comments CVPR 2026 Workshop

详情
AI中文摘要

训练视频事件推理的视觉语言模型(VLMs)需要高质量的结构化标注,这些标注不仅要描述发生了什么,还要捕捉何时、何地、为何以及后果。我们提出了MAVEN(多阶段代理视频事件标注),一种多阶段代理管道,通过链式推理(CoT)轨迹将原始视频转换为多任务训练数据,围绕指定的事件焦点组织。在核心部分,MAVEN从三个互补的标题级别合成多尺度时空事件描述(MSTED),该显式中间体是下游问答生成的唯一输入,适用于多种任务格式。关键的是,MAVEN支持代理驱动的领域适应:给定新的视频数据集和目标问题示例,代理可以重新设计所有提示,而无需手动重新工程。分层细化循环进一步将注释错误分类到分类学中,追溯根本原因到起始管道阶段,并应用有针对性的编辑,重写提示或修改管道结构本身,迭代改进数据质量。我们应用MAVEN标注超过5,300个交通视频,并在生成的数据上微调Cosmos-Reason2-8B。在私人CCTV评估集上,微调优于Gemini 2.5 Pro和3.1 Flash,包括在零样本情况下MCQ准确率提高了38.8个百分点。在AccidentBench上,仅使用CCTV训练提升了Cosmos-Reason2的MCQ分数10.7分,并在没有dashcam视频的情况下与Gemini 2.5 Pro持平;添加代理适应的dashcam注释缩小了与Gemini 3.1 Flash的差距,RL后训练将总体性能推过了Gemini基线。对仓库监控和公共安全视频的定性结果进一步表明,代理工作流能够轻松适应新领域。

英文摘要

Training Vision Language Models (VLMs) for video event reasoning requires high-quality structured annotations capturing not only what happened, but when, where, why, and with what consequence, at a scale manual labelling cannot support. We present MAVEN (Multi-stage Agentic Video Event aNnotation), a multi-stage agentic pipeline that turns raw videos into multi-task training data with Chain-of-Thought (CoT) reasoning traces, organized around a designated Event of Focus. At its core, MAVEN synthesizes a Multi-Scale Spatio-Temporal Event Description (MSTED) from three complementary caption levels; this explicit intermediate serves as the sole input to downstream Q&A generation across multiple task formats. Crucially, MAVEN supports agent-driven domain adaptation: given a new video dataset and target question examples, the agent redesigns all prompts top-down without manual re-engineering. A hierarchical refinement loop further classifies annotation errors against a taxonomy, traces root causes to the originating pipeline stage, and applies targeted edits that rewrite prompts or modify the pipeline structure itself, iteratively improving data quality. We apply MAVEN to label over 5,300 traffic videos and fine-tune Cosmos-Reason2-8B on the resulting data. On a private CCTV evaluation set, fine-tuning surpasses both Gemini 2.5 Pro and 3.1 Flash, including a $+38.8$-point gain in MCQ accuracy over zero-shot. On AccidentBench, CCTV-only training lifts Cosmos-Reason2 by $+10.7$ MCQ points and matches Gemini 2.5 Pro despite seeing no dashcam videos; adding agent-adapted dashcam annotations narrows the gap to Gemini 3.1 Flash, and RL post-training pushes overall performance past both Gemini baselines. Qualitative results on warehouse surveillance and public safety videos further show the agentic workflow readily adapts the pipeline to new domains.

2605.21903 2026-05-22 eess.SY cs.AI cs.LG cs.NE cs.SY 版本更新

Engineering Hybrid Physics-Informed Neural Networks for Next-Generation Electricity Systems: A State-of-the-Art Review

为下一代电力系统工程混合物理指导神经网络:最新综述

Joseph Nyangon

发表机构 * Energy Exemplar(1能源典范)

AI总结 本文综述了用于电力系统的混合物理指导机器学习架构,探讨了物理指导神经网络(PINNs)、深度算子网络(DeepONets)、傅里叶神经算子、极端学习机增强的PINNs、基于图的PINNs(PIGNNs)和域分解PINNs等方法,展示了这些方法在场分析、故障检测、数字孪生、替代建模和控制优化中的应用,以及嵌入麦克斯韦方程等第一原理约束对预测精度、仿真时间和泛化能力的提升。

Comments 59 pages, 6 Figures

详情
AI中文摘要

将机器学习与领域特定物理相结合,正在改变电力系统的設計、監測和控制,其中數據稀缺、解釋性有限以及需要强制物理定律限制了纯数据驱动模型。物理指导机器学习(PIML)通过将支配方程直接嵌入到学习过程中,解决了这些限制,为工业4.0应用提供了准确、高效且可扩展的解决方案。本文综述了用于电力系统的混合PIML架构,包括物理指导神经网络(PINNs)、深度算子网络(DeepONets)、傅里叶神经算子、极端学习机增强的PINNs、基于图的PINNs(PIGNNs)和域分解PINNs。每种方法通过覆盖场分析、故障检测、数字孪生、替代建模和控制优化的案例研究进行审查。综述显示,嵌入麦克斯韦方程和其他第一原理约束显著提高了在稀疏和噪声数据下的预测精度,将仿真时间相对于有限元方法减少了多个数量级,并增强了在不同运行条件下的一般化能力。混合框架在参数敏感性、动态行为和鲁棒性方面始终优于纯数据驱动的基线,同时支持实时数字孪生校准和不确定性量化。持续的挑战包括对于刚性多尺度问题训练不稳定、高保真模型的计算成本以及缺乏标准化的基准。研究结果表明,PIML使从黑箱数据驱动方法向透明、物理指导策略的转变成为可能,为在坚韧和智能电力系统中持续创新奠定了基础。

英文摘要

The integration of machine learning with domain-specific physics is transforming the design, monitoring, and control of electricity systems, where data scarcity, limited interpretability, and the need to enforce physical laws constrain purely data-driven models. Physics-informed machine learning (PIML) addresses these limitations by embedding governing equations directly into the learning process, yielding accurate, efficient, and scalable solutions for Industry 4.0 applications. This article reviews hybrid PIML architectures for electricity systems, including physics-informed neural networks (PINNs), Deep Operator Networks (DeepONets), Fourier Neural Operators, Extreme Learning Machine-enhanced PINNs, graph-based PINNs (PIGNNs), and domain-decomposition PINNs. Each approach is examined through case studies spanning field analysis, fault detection, digital twins, surrogate modeling, and control optimization. The review shows that embedding Maxwell's equations and other first-principles constraints substantially improves predictive accuracy under sparse and noisy data, reduces simulation time by orders of magnitude relative to finite element methods, and enhances generalization across operating regimes. Hybrid frameworks consistently outperform purely data-driven baselines on parameter sensitivity, dynamic behavior, and robustness, while supporting real-time digital-twin calibration and uncertainty quantification. Persistent challenges include training instability for stiff multi-scale problems, computational cost of high-fidelity models, and the absence of standardized benchmarks. The findings demonstrate that PIML enables a paradigm shift from black-box data-driven methods to transparent, physics-informed strategies, positioning the field for sustained innovation in resilient and intelligent electricity systems.

2605.21902 2026-05-22 cs.AI cs.CL 版本更新

Planning in the LLM Era: Building for Reliability and Efficiency

在大语言模型时代进行规划:构建可靠性与效率

Michael Katz, Harsha Kokel, Kavitha Srinivas, Shirin Sohrabi

AI总结 本文探讨了在大语言模型时代规划领域的发展,重点介绍了通过生成可验证的符号求解器来提高规划的可靠性和效率的方法。

Comments Published at ICAPS 2026

详情
AI中文摘要

随着智能代理受到越来越多的关注,规划能力成为其核心能力之一。早期尝试利用大语言模型(LLMs)进行规划的方法主要依赖于单次计划生成,随后发展出结合LLMs与有限外部搜索的混合方法。这些方法本质上不严谨且不完整,往往需要大量资源,但并未在未见问题上产生更好的解决方案。随着对LLMs局限性的认识加深,近期的研究转向在解决方案构建时使用LLMs,生成可用于验证并高效用于推理时间的一类问题的符号求解器。这一趋势反映了对既可靠又资源高效的代理日益增长的需求。它还提供了一条生成可维护规划器的路径,从而在推理时对语言模型的依赖最小化。在本文中,我们论证这种转变反映了在大语言模型时代规划领域更广泛的真实调整。我们检查了三种主要的规划器生成方法类别,讨论了它们当前的局限性,并概述了朝着更可靠和高效的大语言模型驱动的规划器生成的研究步骤。

英文摘要

Growing attention to intelligent agents has put a spotlight on one of their central capabilities: planning. Early attempts to leverage large language models (LLMs) for planning relied on single-shot plan generation, followed by hybrid approaches that coupled LLMs with limited external search. These methods, unsound and incomplete by their very nature, often require substantial resources without yielding better solutions on unseen problems. As the limitations of LLMs become clearer, recent work has shifted toward using them at solution construction time -- generating symbolic solvers for a family of problems that can be verified and then used efficiently at inference time. This trend reflects the growing need for agents that are both reliable and resource-efficient. It also offers a path towards generating maintainable planners with minimal dependence on language models at inference time. In this paper, we argue that this shift reflects a broader realignment of the planning field in the LLM era. We examine three major categories of planner-generation methods, discuss their current limitations, and outline research steps towards a more reliable and efficient LLM-based generation of planners.

2605.21869 2026-05-22 cs.CV cs.AI cs.HC 版本更新

Two-Stage Multimodal Framework for Emotion Mimicry Intensity Prediction

双阶段多模态框架用于情感模仿强度预测

Dinithi Dissanayake, Shaveen Silva, Ovindu Atukorala, Prasanth Sasikumar, Suranga Nanayakkara

发表机构 * Augmented Human Lab, National University of Singapore, Singapore(新加坡国立大学增强人类实验室) University of Moratuwa, Sri Lanka(斯里兰卡穆拉图瓦大学)

AI总结 本文提出了一种双阶段多模态框架,用于从真实视频片段中预测六个连续情绪强度维度,通过结合文本、音频和视觉表示,并可选运动分支,提供了一个实用且可复现的基线。

Comments 10th Affective & Behavior Analysis in-the-wild, CVPR Workshop 2026

详情
AI中文摘要

我们提交了Hume-ABAW10情感模仿强度(EMI)挑战的参赛方案,旨在从真实多模态视频片段中预测六个连续情绪强度维度:钦佩、娱乐、决心、共情痛苦、兴奋和快乐。我们提出了一种分阶段的多模态框架,结合文本、音频和视觉表示,可选运动分支。我们的方法首先独立训练模态特定的编码器,然后通过轻量级回归器融合其学习的表示,通过模态丢弃和受控编码器适应。在我们提交的系统中,最佳验证性能由文本-音频-视觉-运动融合模型在扩展的4:1划分下获得,平均皮尔逊相关系数为0.4722。尽管运动分支仅带来极小的提升,但其行为值得研究。我们的团队在EMI挑战中获得第三名,测试集的平均皮尔逊相关系数为0.57。总体而言,我们提供了一个实用且可复现的EMI预测基线。

英文摘要

We present our submission to the Hume-ABAW10 Emotional Mimicry Intensity (EMI) Challenge, which aims to predict six continuous emotion intensity dimensions: Admiration, Amusement, Determination, Empathic Pain, Excitement, and Joy, from in-the-wild multimodal video clips. We propose a staged multimodal framework that combines textual, acoustic, and visual representations, with an optional motion branch. Our approach first trains modality-specific encoders independently and then fuses their learned representations through a lightweight regressor with modality dropout and controlled encoder adaptation. Across our submitted systems, the best validation performance is obtained by the text--audio--vision--motion fusion model under the expanded 4:1 split, achieving an average Pearson correlation of 0.4722. Although the motion branch yields only very slight gains, its behavior can be interesting to study. Our team was placed third in the EMI challenge, achieving an average Pearson correlation of 0.57 for the test set. Overall, we provide a practical and reproducible baseline for EMI prediction.

2605.21862 2026-05-22 cs.RO cs.AI 版本更新

EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

EvoScene-VLA: 在动作解码器中进化场景信念用于分块机器人控制

Chushan Zhang, Ruihan Lu, Jinguang Tong, Xuesong Li, Yikai Wang, Hongdong Li

发表机构 * Australian National University(澳大利亚国立大学) The University of Queensland(昆士兰大学) Beijing Normal University(北京师范大学)

AI总结 本文提出EvoScene-VLA,通过在动作解码器中维护更新的场景状态,改进分块机器人控制中的多步控制预测,提升了场景信念的持续性和准确性。

详情
AI中文摘要

Chunked vision-language-action (VLA) policies predict multi-step robot controls, conditioning each update on the current visual observation alone. Yet robot actions cause contact, occlusion, and object motion, and the geometry that later decisions depend on can change before the next visual update arrives. Spatial VLAs improve current-frame geometry. Temporal VLAs aggregate past frames. Neither maintains an action-updated scene prior across chunks. We argue for a persistent action-updated scene state across control calls, and introduce EvoScene-VLA. Its recurrent scene prefix carries a geometry-aware scene state across chunks. At each vision-language model (VLM) call, the VLM combines scene information from the current observation with the action-updated prior from the previous chunk; the action decoder outputs both the next action chunk and a compact scene update. This update becomes the next prior, which the VLM corrects against the new observation when the next call arrives. Each control call therefore starts from a scene prior that reflects both recent actions and fresh visual evidence. During training, extbf{Scene Predictor} supplies future scene-token targets, and Geometric Anchor aligns scene slots with frozen depth and 3D teachers. We discard both modules at deployment. On 31 RoboTwin tasks, EvoScene-VLA raises average success from 87.2% to 89.1% in fixed evaluation and from 86.1% to 88.5% in randomized evaluation. On the Galaxea R1-Lite real robot, EvoScene-VLA outperforms all baselines.

英文摘要

Chunked vision-language-action (VLA) policies predict multi-step robot controls, conditioning each update on the current visual observation alone. Yet robot actions cause contact, occlusion, and object motion, and the geometry that later decisions depend on can change before the next visual update arrives. Spatial VLAs improve current-frame geometry. Temporal VLAs aggregate past frames. Neither maintains an action-updated scene prior across chunks. We argue for a persistent action-updated scene state across control calls, and introduce EvoScene-VLA. Its recurrent scene prefix carries a geometry-aware scene state across chunks. At each vision-language model (VLM) call, the VLM combines scene information from the current observation with the action-updated prior from the previous chunk; the action decoder outputs both the next action chunk and a compact scene update. This update becomes the next prior, which the VLM corrects against the new observation when the next call arrives. Each control call therefore starts from a scene prior that reflects both recent actions and fresh visual evidence. During training, \textbf{Scene Predictor} supplies future scene-token targets, and Geometric Anchor aligns scene slots with frozen depth and 3D teachers. We discard both modules at deployment. On 31 RoboTwin tasks, EvoScene-VLA raises average success from 87.2% to 89.1% in fixed evaluation and from 86.1% to 88.5% in randomized evaluation. On the Galaxea R1-Lite real robot, EvoScene-VLA outperforms all baselines.

2605.21861 2026-05-22 cs.CV cs.AI 版本更新

Learning Emergent Modular Representations in Multi-modality Medical Vision Foundation Models

在多模态医学视觉基础模型中学习涌现的模块化表示

Yuting He, Chenyu You, Shuo Li

发表机构 * Case Western Reserve University(凯斯西储大学) Stony Brook University(石溪大学)

AI总结 本文提出Director-Experts (DEX)框架,通过调控模块化动态,在多模态医学视觉基础模型中学习稳定的模块化表示,并在新的医学视觉基准数据集上验证了其在26个下游任务中的优越性。

Comments Accepted by KDD 2026

详情
AI中文摘要

多模态医学视觉(MV)基础模型(FM)在异质成像模态间面临显著的非独立同分布(Non-IID)特征统计挑战。对这类数据进行单一监督优化会引发冲突梯度,导致表示向模态主导的捷径坍缩。本文将这一失败重新解释为涌现模块化中专门化与协调之间的失衡,并提出Director-Experts(DEX)模块化网络,该网络在堆叠模块中显式调控这些动态。每个DEX模块包含一组专家,通过我们的图像级激活策略动态适应,自主专注于模态主导的统计特征,同时结合通过我们组指数移动平均更新的Director,将多专家知识蒸馏到共享空间,实现跨模态的语义整合,从而驱动模块化表示的涌现。我们构建了一个新的基准数据集Medical Vision Universe,包含超过400万张图像,覆盖10种模态,为DEX提供了最广泛的模态覆盖的FM级预训练。在26个下游任务上的广泛评估表明,DEX在优化行为和迁移性方面有所改进,表明DEX是通用多模态医学AI的有原则的一步。我们的代码和数据集将在https://github.com/YutingHe-list/DEX上公开。

英文摘要

Multi-modality medical vision (MV) foundation models (FM) are fundamentally challenged by pronounced Non-IID feature statistics across heterogeneous imaging modalities. Monolithic self-supervised optimization on such data induces conflicting gradients, driving representations to collapse toward modality-dominant shortcuts. This work reframes this failure as an imbalance between specialization and coordination in emergent modularity, and proposes Director-Experts (DEX), a modular network that explicitly regulates these dynamics in stacked modules. Each DEX module comprises a pool of experts, dynamically adapted by our image-wise activation strategy, autonomously specializing in modality-dominant statistics, together with a director, updated via our group exponential moving average, which distills multi-expert knowledge into a shared space for semantic integration across modalities, thus driving the emergence of modular representations. We curate a new benchmark, Medical Vision Universe, over 4 million images across 10 modalities, which provides a FM-level pre-training with the broadest coverage of distinct imaging modalities to our DEX. Extensive evaluations on 26 downstream tasks demonstrate improved optimization behavior and transferability, indicating DEX as a principled step toward general-purpose multi-modality medical AI. Our code and dataset will be opened at https://github.com/YutingHe-list/DEX.

2605.21856 2026-05-22 cs.LG cs.AI 版本更新

The Illusion of Reasoning: Exposing Evasive Data Contamination in LLMs via Zero-CoT Truncation

推理的幻觉:通过零CoT截断揭示LLM中的逃避数据污染

Yifan Lan, Yuanpu Cao, Hanyu Wang, Lu Lin, Jinghui Chen

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 本文提出零CoT探针(ZCP)方法,通过截断整个推理过程来暴露模型中的潜在捷径映射,以检测LLM中的直接和逃避数据污染,提出了 contamination confidence 指标来量化污染的可能性和严重性。

详情
AI中文摘要

大型语言模型(LLMs)在广泛的任务上展示了令人印象深刻的推理能力,但数据污染破坏了这些能力的客观评估。这个问题进一步加剧了恶意模型发布者使用逃避或间接污染策略,例如改写基准数据以逃避现有检测方法并人为提升排行榜表现。当前的方法难以可靠地检测这种隐蔽的污染。在本工作中,我们揭示了一个关键现象:模型生成的推理步骤主动掩盖其底层的记忆。受此启发,我们提出了零CoT探针(ZCP),一种新颖的黑盒检测方法,故意截断整个链式思维(CoT)过程以暴露潜在的捷径映射。为进一步将记忆与模型的内在问题解决能力区分开来,ZCP将模型在原始基准上的零CoT表现与等价扰动的参考数据集进行比较。此外,我们引入了污染置信度(Contamination Confidence),一个量化污染可能性和严重性的指标,超越了简单的二元分类。对已识别的污染模型和特别微调的污染模型的广泛实验表明,ZCP能够稳健地检测直接和逃避的数据污染。ZCP的代码可在https://github.com/Yifan-Lan/zero-cot-probe获取。

英文摘要

Large language models (LLMs) have demonstrated impressive reasoning abilities across a wide range of tasks, but data contamination undermines the objective evaluation of these capabilities. This problem is further exacerbated by malicious model publishers who use evasive, or indirect, contamination strategies, such as paraphrasing benchmark data to evade existing detection methods and artificially boost leaderboard performance. Current approaches struggle to reliably detect such stealthy contamination. In this work, we uncover a critical phenomenon: a model's generated reasoning steps actively mask its underlying memorization. Inspired by this, we propose the Zero-CoT Probe (ZCP), a novel black-box detection method that deliberately truncates the entire Chain-of-Thought (CoT) process to expose latent shortcut mappings. To further isolate memorization from the model's intrinsic problem-solving capabilities, ZCP compares the model's zero-CoT performance on the original benchmark against an isomorphically perturbed reference dataset. Furthermore, we introduce Contamination Confidence, a metric that quantifies both the likelihood and severity of contamination, moving beyond simple binary classifications. Extensive experiments on both previously identified contaminated models and specially fine-tuned contaminated models demonstrate that ZCP robustly detects both direct and evasive data contamination. The code for ZCP is accessible at https://github.com/Yifan-Lan/zero-cot-probe.

2605.21845 2026-05-22 cs.CL cs.AI 版本更新

Comparing LLM and Fine-Tuned Model Performance on NVDRS Circumstance Extraction with Varying Prompt Complexity

对比LLM和微调模型在不同提示复杂度下的NVDRS情境提取性能

Geoffrey Martin, Xuan Zhong Feng, Yifan Peng

发表机构 * Department of Population Health Sciences, Weill Cornell Medicine(人群健康科学系,威尔·康奈尔医学学院) Systems Engineering, Cornell University(系统工程,康奈尔大学)

AI总结 本文研究了在不同提示复杂度下,LLM与微调模型在NVDRS情境提取任务中的表现差异,提出了一种复杂度评分算法,并展示了一个混合方法,通过不同情境选择提示策略,发现LLM在低 prevalence 情境中表现更优,且框架能跨不同前沿LLM通用。

Comments Accepted at IEEE ICHI 2026

详情
AI中文摘要

自杀是美国的主要死亡原因之一,理解其前因需要从死亡调查叙述中提取结构化信息。许多前因需要语义推理而非简单的关键词匹配。我们开发了一种“复杂度评分”算法,分析编码手册结构以预测何时详细提示(包含完整编码指南)比仅名称提示更优。随后,我们构建了一种混合方法,根据情境选择提示策略。我们评估了大型语言模型(LLMs)与微调的RoBERTa在25个从国家暴力死亡报告系统(NVDRS)中提取的推断复杂情境上的表现。我们发现,在训练数据不足的低 prevalence 情境中,LLMs表现显著优于微调模型。我们进一步展示了我们的框架能够跨前沿LLM通用,GPT-5.2、Gemini 2.5 Pro和Llama-3 70B显示出一致的表现模式。这些发现支持了一种混合架构,其中LLMs处理罕见的推断复杂情境,而微调模型处理常见情境。

英文摘要

Suicide is a leading cause of death in the United States, and understanding the circumstances that precede it requires extracting structured information from death investigation narratives. Many of these circumstances require semantic inference beyond simple keyword matching. We develop a ``Complexity Score'' algorithm that analyzes coding manual structure to predict when detailed prompts with full coding guidelines improve over name-only prompts. We then construct a hybrid approach that selects prompt strategy per circumstance. We evaluate large language models (LLMs) against fine-tuned RoBERTa on 25 inferentially complex circumstances from the National Violent Death Reporting System (NVDRS). We found that LLMs substantially outperform on low-prevalence circumstances where training data is insufficient. We further demonstrate that our framework generalizes across frontier LLMs, with GPT-5.2, Gemini 2.5 Pro and Llama-3 70B showing consistent performance patterns. These findings support a hybrid architecture where LLMs handle rare, inferentially complex circumstances while fine-tuned models handle common ones.

2605.21835 2026-05-22 eess.IV cs.AI cs.CV physics.med-ph 版本更新

An Open Multi-Center Whole-Body FDG PET/CT Foundation Model for Tumor Segmentation

一种开放的多中心全身FDG PET/CT基础模型用于肿瘤分割

Xiaofeng Liu, Qianru Zhang, Thibault Marin, Menghua Xia, Chi Liu, Georges El Fakhri, Jinsong Ouyang

发表机构 * Department of Radiology and Biomedical Imaging, Yale Biomedical Imaging Institute, Yale University(放射学与生物医学成像系,耶鲁生物医学影像研究所,耶鲁大学)

AI总结 本文提出了一种开放的多中心全身FDG PET/CT基础模型,通过整合四个公开数据集中的4997份标准化扫描,利用层次UNet结构和早期通道拼接实现解剖和代谢特征的交互,提高了肿瘤分割的标签效率和跨模态表征学习能力。

Comments Code available at: https://github.com/liu-xiaofeng/Foundation-Model-for-PET-CT

详情
AI中文摘要

解剖信息来自计算机断层扫描(CT)和代谢信息来自正电子发射断层扫描(PET)的协同解释对于肿瘤成像至关重要。然而,现有的PET/CT深度学习方法大多任务特定,通常在单一中心队列上训练,或者采用双分支融合方案,这延迟了跨模态交互并低估了PET和CT之间早期空间对应关系。为了解决这些限制,我们提出了一种开源的、多中心的、全身FDG PET/CT基础模型,利用四个公开数据集中的4997份标准化扫描。我们的框架采用层次UNet形状的后端,并在早期通道拼接,使解剖和代谢特征从第一个嵌入层开始交互。我们进一步引入基于零均值填补的掩码自编码目标,结合加权全局重建损失。这种设计避免了由于可学习掩码标记产生的非物理强度不连续性。在下游AutoPET病变分割中,所提出的模型显示出强大的标签效率:仅使用10%的标记训练数据,即可达到在完整数据集上训练的模型的性能。在极端5-shot线性探测下,联合PET/CT预训练也比单独模态预训练取得了更高的Dice分数。这种多中心基础模型展示了PET/CT肿瘤分割的标签效率和跨模态表征学习能力。它为推进自动化肿瘤成像提供了稳健、开源的基础,显著减少了临床实践中大规模手动注释的需求。

英文摘要

The synergistic interpretation of anatomical information from computed tomography (CT) and metabolic information from positron emission tomography (PET) is important to oncologic imaging. However, existing deep learning methods for PET/CT remain largely task-specific, are often trained on single-center cohorts, or adopt dual-branch fusion schemes that delay cross-modal interaction and underutilize early spatial correspondence between PET and CT. To address these limitations, we present an open-source, multi-center, whole-body FDG PET/CT foundation model utilizing 4,997 harmonized scans from four public datasets. Our framework employs hierarchical UNet-shaped backbones with early channel-wise concatenation, enabling anatomical and metabolic features to interact from the first embedding layer onward. We further introduce a masked autoencoding objective based on zero-mean imputation, combined with a weighted global reconstruction loss. This design avoids non-physical intensity discontinuities at masked-region boundaries that arise from learnable mask tokens. On downstream AutoPET lesion segmentation, the proposed models demonstrate strong label efficiency: with only 10\% of the labeled training data, they achieve performance comparable to models trained from scratch on the full dataset. Under extreme 5-shot linear probing, joint PET/CT pretraining also achieves higher Dice scores than separated-modality pretraining. This multi-center foundation model demonstrates label efficiency and cross-modality representation learning for PET/CT tumor segmentation. It provides a robust, open-source basis for advancing automated oncologic imaging, significantly reducing the need for large-scale manual annotations in clinical practice.

2605.21827 2026-05-22 cs.CL cs.AI 版本更新

Does Slightly Mean Somewhat? Measuring Vague Intensity Words in LLM Numeric Actions

‘稍微’意味着‘ somewhat’吗?在LLM数值行为中测量模糊强度词

Daniel Tabach

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文研究了语言模型在必须生成数值行为时是否保留强度词的顺序意义,发现模型在数值输出中压缩了模糊强度词,其解释依赖于状态并接近操作边界时出现不连续性。

Comments 9 figures, 2 tables, 16 references

详情
AI中文摘要

语言模型在必须生成数值行为时是否保留强度词的顺序意义?我研究了一个由研究者构建的10个英语程度修饰词尺度,从稍微到剧烈,依据Quirk等人程度修饰词分类法,在受控资源分配环境中进行测试,其中Claude Haiku接收自然语言指令,生成数值分配,并由确定性后端转换为可测量结果。在测试中,唯一变化的变量是强度词或起始系统状态,从而隔离了它们对模型数值输出的影响。在6,620次运行中(T=0.0和T=0.7),出现了三种模式。首先,模型将10个强度词压缩为5个不同的中位数输出:四个低层级词都映射到相同值,而更强的词则进入更高层级(Spearman rho=0.845,p<0.001)。其次,当当前系统状态作为上下文提供时,Kruskal-Wallis检验显示按起始分配分组捕捉到的基于排名的方差远多于按词分组(epsilon-squared基线=0.782 vs. epsilon-squared词=0.079),并且当系统接近容量时,词汇区分度降为零。第三,接近可行性极限时,模型表现出三种行为模式:弱词通过小调整进行妥协,强词完全回避,而“剧烈”词则推至局部天花板。这些模式在温度变化下保持不变,随机采样扩展了分布但未恢复词之间的顺序差异。在该模型和领域中,模型对模糊强度词的数值解释是压缩的、依赖状态的,并且在操作边界附近出现不连续性。

英文摘要

Do language models preserve the ordinal meaning of intensity words when those words must produce numeric actions? I study a researcher-constructed scale of 10 English degree modifiers, from slightly to drastically, informed by the Quirk et al. degree-modifier taxonomy, in a controlled resource-allocation environment where Claude Haiku receives a natural-language instruction, produces a numeric allocation, and a deterministic backend converts that allocation into a measurable outcome. The only variable that changes between runs is the intensity word or the starting system state, isolating their effects on the model's numeric output. Across 6,620 runs at T=0.0 and T=0.7, three patterns emerge. First, the model compresses 10 intensity words into 5 distinct median outputs: four lower-tier words all map to the same value, while stronger words break into higher regimes (Spearman rho = 0.845, p < 0.001). Second, when the current system state is supplied as context, separate Kruskal-Wallis tests show that grouping by starting allocation captures far more rank-based variance than grouping by word (epsilon-squared baseline = 0.782 vs. epsilon-squared word = 0.079), and lexical differentiation collapses to zero as the system approaches capacity. Third, near feasibility limits the model exhibits three behavioral modes: weak words hedge with small adjustments, strong words abstain entirely, and the word drastically pushes to the local ceiling. These patterns persist across temperature, with stochastic sampling broadening distributions but not restoring ordinal distinctions between words. In this model and domain, the model's numeric interpretation of vague intensity words is compressed, state-dependent, and discontinuous near operational boundaries.

2605.21825 2026-05-22 cs.AI cs.HC 版本更新

Toward AI VIS Co-Scientists: A General and End-to-End Agent Harness for Solving Complex Data Visualization Tasks

迈向AI可视化合作者:一种通用且端到端的代理工具,用于解决复杂数据可视化任务

Haichao Miao, Zhimin Li, Kuangshi Ai, Kaiyuan Tang, Chaoli Wang, Peer-Timo Bremer, Shusen Liu

发表机构 * Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室) Vanderbilt University(范德比大学) University of Notre Dame(圣母大学)

AI总结 本文提出了一种端到端的代理工具,能够基于数据和高层任务描述自动生成定制的可视化分析应用,推动实现通用AI合作者的愿景。

详情
AI中文摘要

检查、解释和沟通复杂数据的能力对于任何科学探索都是至关重要的,但通常需要在核心领域之外的大量专业知识,从数据管理和分析到可视化设计和实现。我们提出了一种端到端的代理工具,仅基于数据和任务的高层描述,独立设计定制的可视化分析应用(VIS应用)。这代表了朝着许多设想的通用AI合作者的重要一步,即一个能够根据高层指令自主执行长周期任务的自主系统。我们提出的VIS合作者是这一更广泛AI合作者愿景的重要组成部分:一个能够利用一组代理和专门技能,自主分析数据并设计可视化解决方案的工具,这些技能协调探索性分析、计划、配置环境、实施、验证界面,并最重要的是评估整体任务完成情况。每个阶段都产生文档和指令制品,指导后续工作并实现迭代改进。我们通过IEEE SciVis比赛验证了这种方法,这些比赛覆盖多个科学和工程领域,是理想的测试场,因为它们编码了现实世界的复杂性:模糊的要求、多样的数据模态、设计权衡和任务驱动的验证。仅给定数据和目标任务,我们的系统能够自主生成具有验证链接视图行为的功能单页VIS应用,高度定制以满足领域专家指定的任务和需求。

英文摘要

The ability to inspect, interpret, and communicate complex data is crucial for virtually any scientific endeavor, but often requires significant expertise outside the core domain ranging from data management and analysis to visualization design and implementation. We present an end-to-end agentic harness that, based on only the data and a high level description of the tasks, independently designs custom visual analysis applications (VIS apps). This represents an important step towards a general AI co-scientist envisioned by many as an autonomous system that can autonomously execute long horizon tasks based on high-level directions. Our proposed VIS co-scientist is an essential component of this broader AI co-scientist vision: a harness that can autonomously analyze data and design visualization solutions using a collection of agents and specialized skills that coordinate exploratory analysis, plan, configure the environment, implement, validate the interface, and most importantly evaluate the overall task completion. Each stage produces document and instruction artifacts that guide downstream work and enable iterative refinement. We validate this approach on IEEE SciVis Contests spanning multiple science and engineering fields. These contests serve as ideal proving grounds because they encode real-world complexity: ambiguous requirements, diverse data modalities, design trade-offs, and task-driven validation. Given only the data and target tasks, our system autonomously produces functional single-page VIS Apps with verified linked-view behavior, highly customized to domain experts' specified tasks and needs.

2605.21822 2026-05-22 cs.AI 版本更新

Implicit Safety Alignment from Crowd Preferences

从大众偏好中隐式安全对齐

Qian Lin, Daniel S. Brown

发表机构 * Kahlert School of Computing, University of Utah(犹他大学计算学校)

AI总结 本文研究如何从大众偏好数据中提取共享的安全标准,并将其转移到下游强化学习任务中以规范智能体行为并确保安全。提出了一种基于大众偏好的安全强化学习框架,通过高级策略将安全对齐的技能组合起来,以安全地解决下游任务。

Comments Accepted to ICML 2026. Conference paper

详情
AI中文摘要

基于人类反馈的强化学习(RLHF)可以揭示超出任务完成的隐式目标,如安全考虑。在本工作中,我们关注嵌入在大众偏好数据集中的常见安全标准,其中不同用户可能表达不同的偏好或目标,但遵循相似的安全原则。我们的目标是从大众偏好中发现共享的安全标准,并将其转移到下游RL任务中以规范智能体行为并确保安全。我们首先证明了直接奖励组合——优化偏好学习的奖励模型与下游任务奖励——具有内在限制。受此启发,我们提出了基于大众偏好的安全强化学习(Safe Crowd Preference-based RL),这是一种分层框架,从大众偏好中提取安全对齐的技能,并通过高级策略将它们组合起来,以安全地解决下游任务。在安全RL环境和一个初步的LLM风格任务中,实验表明,我们的方法在没有访问显式安全奖励的情况下显著降低了安全成本,同时在任务性能上与使用真实安全信号训练的oracle方法相当。

英文摘要

Reinforcement Learning from Human Feedback (RLHF) can reveal implicit objectives such as safety considerations that go beyond task completion. In this work, we focus on the common safety criteria embedded in crowd preference datasets, where different users may express distinct preferences or objectives, yet follow similar safety principles. Our aim is to discover shared safety criteria from crowd preferences and then transfer them to downstream RL tasks to regularize agent behavior and enforce safety. We first show that direct reward combination-optimizing a preference-learned reward model together with downstream task rewards-has inherent limitations. Motivated by this, we propose Safe Crowd Preference-based RL, a hierarchical framework that extracts safety-aligned skills from crowd preferences and composes them via a high-level policy to safely solve downstream tasks. Experiments across safe RL environments and a preliminary LLM-style task with diverse user goals and shared safety constraints demonstrate that our approach substantially lowers safety costs without access to explicit safety rewards, while achieving task performance comparable to oracle methods trained with ground-truth safety signals.

2605.21810 2026-05-22 cs.AI cs.MA 版本更新

Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents

Trace2Skill: 验证器引导的技能进化用于长上下文EDA代理

Zijian Du, Nathaniel Pinckney

发表机构 * NVIDIA

AI总结 本文提出Trace2Skill框架,通过验证器引导的技能进化提升硬件代理在复杂验证问题上的性能,无需RTL专用模型微调,通过密集验证器反馈实现任务通过率的显著提升。

详情
AI中文摘要

复杂Verilog设计问题(CVDP)挑战硬件LLM代理,因为解决这些问题需要在大型仓库快照中本地化相关RTL、测试平台、包含路径和构建依赖,进行精确编辑,并从稀疏隐藏验证器失败中恢复。我们提出了Trace2Skill,一个测试时间扩展框架,它在不进行RTL专用模型微调的情况下改进硬件代理。而不是训练新模型或仅采样更多候选解决方案,Trace2Skill将代理的自然语言技能视为可进化策略。它挖掘重复的运行轨迹以识别成功和失败模式,将其转换为密集的诊断和 oracle 教训,并使用 oracle、变异器和选择器循环生成任务特定的技能,以引导后续的搜索、编辑、验证和恢复。由于最终通过/失败标签通常对硬故障太粗略,Trace2Skill还支持有界运行时间密集验证器反馈,该反馈返回经过清理的功能观察,同时保持隐藏的Harness和参考解决方案对代理不可见。这种反馈通过连接技能文本、验证器证据和下游行为来引导技能进化和代理执行。在击败种子CVDP代理的硬CVDP任务上,包括也击败前沿编码代理的任务,Trace2Skill结合密集验证器反馈显著提高了任务通过率,并在之前未解决的任务上实现了突破性通过,而无需高质量微调数据、专用RTL模型训练或模型权重更新。相同的框架提供了一种通用测试时间扩展策略,可以扩展到其他可验证的EDA任务。

英文摘要

Complex Verilog Design Problems (CVDP) challenge hardware LLM agents because solving them requires localizing verifier-relevant RTL, testbenches, include paths, and build dependencies inside large repository snapshots, making precise edits, and recovering from sparse hidden-verifier failures. We present Trace2Skill, a test-time scaling framework that improves a hardware agent without RTL-specialized model fine-tuning. Rather than training a new model or only sampling more candidate solutions, Trace2Skill treats the agent's natural-language skill as an evolvable policy. It mines repeated rollout traces for success and failure modes, converts them into dense diagnostics and oracle lessons, and uses an oracle, mutator, and selector loop to produce task-specific skills that guide later search, editing, validation, and recovery. Because final pass/fail labels are often too coarse for hard failures, Trace2Skill also supports bounded runtime dense verifier feedback that returns sanitized functional observations while keeping hidden harnesses and reference solutions inaccessible to the agent. This feedback helps guide skill evolution and agent execution by connecting skill text, verifier evidence, and downstream behavior. Across hard CVDP tasks that defeat the seed CVDP agent, including tasks that also defeat frontier coding agents, Trace2Skill with dense verifier feedback substantially improves task pass rates and produces breakthrough passes on previously unsolved tasks, without requiring high-quality fine-tuning data, specialized RTL model training, or model weight updates. The same framework provides a general test-time scaling strategy that can extend beyond digital design to other verifiable EDA tasks.

2605.21792 2026-05-22 cs.CL cs.AI cs.DB cs.LG 版本更新

Residual Skill Optimization for Text-to-SQL Ensembles

残差技能优化用于文本到SQL集成

Jiongli Zhu, Haoquan Guan, Parjanya Prajakta Prashant, Nikki Lijing Kuang, Seyedeh Baharan Khatami, Canwen Xu, Xiaodong Yu, Yingyu Lin, Zhewei Yao, Yuxiong He, Babak Salimi

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Snowflake AI Research(Snowflake人工智能研究)

AI总结 本文提出DivSkill-SQL,一种残差技能优化框架,通过在当前技能集成失败的示例上优化新技能,从而构建互补的文本到SQL集成,提升Pass@K性能,在Spider2-Lite上实现了显著的准确性提升,同时在不同方言和任务上表现出一致的改进。

详情
AI中文摘要

文本到SQL集成通过生成多个SQL候选并选择一个来优于单一候选生成,但其效果受限于Pass@K,即至少有一个K候选正确的概率。现有方法通过随机解码或提示变体启发式地引入多样性,导致候选集受相关失败主导。我们提出DivSkill-SQL,一种残差技能优化框架,构建互补的文本到SQL集成而无需模型微调:每个新技能在当前技能集成失败的示例上进行优化,证明其对Pass@K的边际贡献。在Spider2-Lite上,DivSkill-SQL在Snowflake和BigQuery上分别比最强集成基线提升11.1和8.3个点,且在两个基础模型(Opus-4.6和GPT-5.4)上表现一致。在单个方言上无重新训练即可转移至其他方言(Snowflake、BigQuery、SQLite)和不同任务形式(如BIRD-Critic,+2.6个点)。错误诊断显示幻觉的模式参考和函数调用减少3倍,表明收益来自真正可靠的互补技能,而非表面形式变化。

英文摘要

Text-to-SQL ensembles improve over single-candidate generation by drawing multiple SQL candidates and selecting one, but their effectiveness is bounded by Pass@K, the probability that at least one of K candidates is correct. Existing methods source diversity heuristically through stochastic decoding or prompt variants, leaving candidate sets dominated by correlated failures. We present DivSkill-SQL, a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning: each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K. On Spider2-Lite, DivSkill-SQL improves selected accuracy by up to +11.1 points on Snowflake and +8.3 on BigQuery over the strongest ensemble baseline, with consistent gains across two base models (Opus-4.6 and GPT-5.4). Skills optimized on a single dialect transfer without retraining across dialects (Snowflake, BigQuery, SQLite) and to a different task formulation, such as BIRD-Critic (+2.6 pts). Error diagnostics show up to 3x fewer hallucinated schema references and function calls, indicating that gains come from genuinely reliable complementary skills rather than surface-form variation.

2605.21789 2026-05-22 hep-ex cs.AI 版本更新

Patch Hierarchical Attention Transformer for Efficient Particle Jet Tagging

基于补丁层次注意力的高效粒子喷注标记变压器

Aaron Wang, Zihan Zhao, Alan Xia, Chang Sun, Abhijith Gandrakota, Jennifer Ngadiuba, Richard Cavanaugh, Javier Duarte

发表机构 * University of Illinois Chicago(伊利诺伊大学香槟分校) University of California San Diego(加州大学圣地亚哥分校) California Institute of Technology(加州理工学院) Fermi National Accelerator Laboratory(费米国家加速器实验室)

AI总结 本文提出了一种结合物理启发的几何信息传递模块和基于补丁的层次注意力机制的Patch Hierarchical Attention Transformer (PHAT-JeT),以在有限资源下实现高效的粒子喷注标记,从而在四个基准测试中取得最佳性能。

详情
AI中文摘要

实时喷注标记对于在大型强子对撞机的高通量探测器中识别短寿命粒子衰变至关重要,其中负责决定存储哪些碰撞事件的实时触发系统对延迟和准确性提出了严格要求。尽管变换器架构在计算不受限制时能够实现最高的喷注标记准确性,但其二次自注意力成本使得在触发预算内进行推理变得受限。现有的高效变体虽然降低了计算成本,但会阻碍分类性能。为了解决这一限制,我们引入了Patch Hierarchical Attention Transformer (PHAT-JeT),它结合了两个机制:一个受物理启发的几何信息传递模块,用于编码局部探测器平面结构,以及一个基于补丁的层次注意力方案,该方案在小粒子组内计算精确的注意力,同时通过轻量级补丁-标记通信保持全局上下文。在受限预算内,PHAT-JeT在四个基准测试(hls4ml、JetClass、Top Tagging 和 Quark-Gluon)中实现了所有资源受限喷注标记模型中的最佳准确性和背景拒绝率。我们的代码可在 https://github.com/aaronw5/PHAT-JeT 上获得。

英文摘要

Real-time jet tagging is critical for identifying short-lived particle decays in the high-throughput detectors of the Large Hadron Collider, where real-time trigger systems responsible for deciding which collision events to store impose strict latency and accuracy constraints. While transformer architectures achieve the highest jet tagging accuracy when compute is unconstrained, their quadratic self-attention cost makes inference restrictive on trigger budget. Existing efficient variants reduce the computational cost, but hinder the classification performance. To address this limitation, we introduce the Patch Hierarchical Attention Transformer (PHAT-JeT), which combines two mechanisms: a physics-inspired geometric message-passing module that encodes local detector-plane structure, and a hierarchical patch-based attention scheme that computes exact attention within small particle groups while preserving global context through lightweight patch-token communication. Within a restricted budget, PHAT-JeT achieves state-of-the-art accuracy and background rejection among all resource-constrained jet tagging models on four benchmarks (\textsc{hls4ml}, JetClass, Top Tagging, and Quark--Gluon). Our code is available at https://github.com/aaronw5/PHAT-JeT.

2605.21778 2026-05-22 cs.AI 版本更新

What Counts as AI Sycophancy? A Taxonomy and Expert Survey of a Fragmented Construct

什么是AI阿谀奉承?对一个碎片化概念的分类和专家调查

Meryl Ye, Lujain Ibrahim, Jessica Y. Bo, Myra Cheng, Ida Mattsson, Daniel Vennemeyer, Robert Kraut, Steve Rathje

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of Oxford(牛津大学) University of Toronto(多伦多大学) Stanford University(斯坦福大学) University of Cincinnati(克里夫兰医学中心大学) New York University(纽约大学)

AI总结 本文通过文献综述和专家调查,揭示了AI阿谀奉承行为的分类和测量挑战,提出了一种统一的分类体系以促进对这一问题的理解和应对。

详情
AI中文摘要

AI阿谀奉承已成为大型语言模型(LLM)研究中的重要关注点。然而,该术语缺乏一致的定义,已被应用于从同意用户虚假主张到过度赞扬用户,再到 withhold corrective feedback 的各种行为。当研究人员、公司和政策制定者用同一术语描述不同行为时,评估结果难以比较,缓解策略无法转移,对一种阿谀奉承形式具有抵抗力的系统仍会表现出其他形式。为此,我们做出了两项贡献。首先,我们回顾了70篇关于AI阿谀奉承的论文,以开发该行为的分类。该分类区分了(1)模型是否对用户的立场和信念表现出阿谀奉承,或对用户的更广泛个人特质和情绪表现出阿谀奉承,以及(2)这种行为是通过显性的直接语言还是更隐性的微妙行为,如框架、省略或语气。将现有文献映射到我们的分类中,发现当前研究主要集中在对用户信念的显性阿谀奉承上,而更微妙和以人为核心的行为相对研究较少。其次,我们调查了106位在AI阿谀奉承及相关领域专家,以检查研究人员是否同意哪些模型行为属于阿谀奉承。尽管专家几乎一致认为阿谀奉承是当前AI系统中的重大问题(94.3%同意),但他们对哪些具体行为符合阿谀奉承存在显著分歧。共同,这些发现表明,AI阿谀奉承是一种行为广谱,具有不同的测量挑战、干预要求和治理影响。我们的分类提供了一种共享的词汇以理解和应对这些行为。

英文摘要

AI sycophancy has become a prominent concern in large language model (LLM) research. Yet the term lacks a consistent definition and has been applied to behaviors ranging from agreeing with a user's false claim to excessively praising the user to withholding corrective feedback. When researchers, companies, and policymakers use the same term to describe different behaviors, evaluation results become difficult to compare, mitigation strategies fail to transfer, and systems that are resistant to one form of sycophancy continue exhibiting other forms. To address this, we make two contributions. First, we reviewed 70 papers on AI sycophancy to develop a taxonomy of how the behavior has been defined and measured. The taxonomy distinguishes (1) whether a model is sycophantic toward a user's positions and beliefs, or toward the user's broader personal traits and emotions, and (2) whether this occurs through explicit, direct language or more implicit, subtle behaviors such as framing, omission, or tone. Mapping existing literature to our taxonomy reveals that current research has focused on overt forms of sycophancy toward users' beliefs, leaving more subtle and person-directed behaviors relatively understudied. Second, we surveyed 106 experts in AI sycophancy and related fields to examine whether researchers agree on which model behaviors are sycophantic. While experts are nearly unanimous in believing that sycophancy is a significant problem in current AI systems (94.3% agree), they disagree substantially on which specific behaviors qualify. Together, these findings demonstrate that AI sycophancy is a broad family of behaviors with different measurement challenges, intervention requirements, and governance implications. Our taxonomy provides a shared vocabulary for understanding and addressing these behaviors.

2605.21777 2026-05-22 cs.HC cs.AI 版本更新

Understanding Perspectives of Patients, Caregivers and Clinicians towards Emerging Collaborative-decision Making Technologies

理解患者、护理人员和临床医生对新兴协作决策技术的看法

Ray-Yuan Chung, Athena Ortega, Zixuan Xu, Daeun Yoo, Jaime Snyder, Wanda Pratt, Aaron Wightman, Ryan Hutson, Cozumel Pruette, Ari Pollack

发表机构 * University of Washington(华盛顿大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 研究探讨了患者、护理人员和临床医生对协作决策技术的态度差异,发现技术接受度与用户对技术的信任有关,需探索建立或促进用户与技术之间信任的设计和实施策略。

Comments Accepted at The Workshop on Interactive Systems in Healthcare (WISH) at AMIA Annual Symposium 2025

详情
AI中文摘要

在儿科领域,患者、护理人员和临床医生共同承担健康决策的责任,但有限的协作可能会影响结果。我们进行了一项定性研究,探讨决策者对协作决策技术(包括交互式仪表盘、VR模拟器和AI语音助手)的看法。研究结果揭示了不同群体在用户意见上的差异,并表明技术接受度与用户对这些技术的信任有关。技术开发者和研究人员需要探索设计和实施策略,以建立和促进用户与这些新技术之间的信任或适当不信任,以便这些工具能够有效支持协作决策。

英文摘要

In pediatrics, patients, caregivers, and clinicians share responsibility for health decisions, but limited collaboration can undermine outcomes. We conducted a qualitative study examining decision-makers perceptions toward collaborative decision-making technologies, including interactive dashboards, VR simulators, and AI voice assistants. Findings reveal differences in user opinions across groups and indicate technology acceptance is linked to users trust of these technologies. Technology developers and researchers need to explore design and implementation strategies that build and facilitate trust or appropriate distrust between users and these novel technologies before these tools can effectively support collaborative decision-making.

2605.21758 2026-05-22 cs.AI 版本更新

A Causal Argumentation Method for Explainability of Machine Learning Models

一种用于机器学习模型可解释性的因果论辩方法

Henry Salgado, Meagan R. Kendall, Martine Ceberio

发表机构 * Department of Computer Science, The University of Texas at El Paso, El Paso, TX, USA(计算机科学系,德克萨斯理工大学埃尔帕索分校) Department of Engineering Education and Leadership, The University of Texas at El Paso, El Paso, TX, USA(工程教育与领导力系,德克萨斯理工大学埃尔帕索分校)

AI总结 本文提出一种结合因果推理和论辩推理的方法,用于解释机器学习模型为何做出特定预测,通过因果发现方法识别变量间的因果关系,并将其转化为双极论辩框架来表示特征间的支持与反对交互,最终通过半稳定语义确定解释性特征扩展。

Comments To be published in The 4th World Conference on eXplainable Artificial Intelligence

详情
AI中文摘要

可解释人工智能(XAI)方法旨在识别影响模型预测的相关特征,但往往无法清晰解释为何某些决策被做出。在本工作中,我们提出了一种新颖的方法,将因果推理与基于论辩的推理相结合,以解释模型为何做出预测。我们的方法首先使用因果发现方法识别变量间的因果关系,然后将这些关系转化为双极论辩框架(BAF)以表示特征间的支持与反对交互。通过使用半稳定语义,我们找到能够解释为何某些结果被选择的特征扩展。我们在两个基准数据集上展示了我们的方法,并将其结果与标准事后可解释性方法进行比较。

英文摘要

Explainable AI (XAI) methods identify which features are relevant to a model's predictions but often fail to clarify why certain decisions are made. In this work, we present a novel method that integrates causality with argument-based reasoning to explain why models may be making predictions. Our approach first identifies causal relationships among variables using causal discovery methods and then translates these into a Bipolar Argumentation Framework (BAF) to represent supportive and opposing interactions among features. By using semi-stable semantics, we find extensions of features that explain why certain outcomes may have been chosen. We demonstrate our method on two benchmark datasets and compare its results against standard post-hoc explainability approaches.

2605.21752 2026-05-22 cs.LG cs.AI 版本更新

PEARL: Unbiased Percentile Estimation via Contrastive Learning for Industrial-Scale Livestream Recommendation

PEARL:通过对比学习实现工业级直播推荐的无偏百分位估计

Blake Gella, Wei Wu, Yuhao Yin, Zexi Huang, Zikai Wang, Emily Liu, Junlin Zhang, Wentao Guo, Qinglei Wang

发表机构 * TikTok(字节跳动) ByteDance(字节跳动)

AI总结 本文提出PEARL框架,通过对比学习方法解决用户行为不平衡问题,通过相对偏好信号建模提升推荐系统的性能和鲁棒性。

详情
AI中文摘要

训练于用户交互数据的推荐系统容易受到行为强度不平衡的影响——这种系统性扭曲源于用户间异质的参与模式。这种不平衡会使反馈信号失真,使得观察到的互动不再真实反映真实的偏好,导致模型过度放大高活跃用户信号而低估其他人,最终在大规模情况下降低推荐质量与鲁棒性。为了解决这个问题,我们提出了一种非参数对比百分位近似框架PEARL,该框架建模相对偏好信号而非绝对参与程度。基于相对优势去偏,PEARL利用真实的对比交互样本直接近似百分位关系,而无需依赖辅助分布估计模型。我们提供了理论证明,表明这种成对比较能产生无偏的基于百分位的偏好信号估计。为了更广泛的应用,我们引入了基于预测的重采样机制用于百分位平滑以处理稀疏和离散的反馈,以及通用的价值加权形式和共训练策略以增强建模灵活性和表示学习。大量离线实验表明,PEARL有效减轻了行为偏差,并在多个排序目标上一致提高了推荐性能。在拥有数十亿用户的大规模直播平台部署后,在线A/B测试确认了实际收益:观看时长增加2.10%,消费金额增加0.80%,互动率增加1.49%,举报率降低6.91%。

英文摘要

Recommender systems trained on user interaction data are susceptible to behavioral intensity imbalance--a systematic distortion arising from heterogeneous engagement patterns across users. This imbalance skews feedback signals such that observed interactions no longer faithfully reflect true preferences, causing models to disproportionately amplify signals from highly active users while underrepresenting others, which ultimately degrades recommendation quality and robustness at scale. To address this issue, we propose a nonparametric contrastive percentile approximation framework, PEARL, that models relative preference signals instead of absolute engagement magnitudes. Building upon relative advantage debiasing, PEARL leverages real contrastive interaction samples to approximate percentile relationships directly, without relying on auxiliary distribution estimation models. We provide theoretical justification demonstrating that such pairwise comparisons yield unbiased estimates of percentile-based preference signals. For broader applicability, we introduce a prediction-based bootstrapping mechanism for percentile smoothing to handle sparse and discrete feedback, alongside a generalized value-weighted formulation and a co-training strategy to enhance both modeling flexibility and representation learning. Extensive offline experiments demonstrate that PEARL effectively mitigates behavioral bias and consistently improves recommendation performance across multiple ranking targets. Deployed in a production livestream platform with a combined user base of billions, online A/B testing confirms substantial real-world gains: +2.10% Watch Duration, +0.80% Consumption Amount, +1.49% Interaction Rate, and -6.91% Report Rate.

2605.21736 2026-05-22 stat.ML cs.AI cs.LG 版本更新

Support-aware offline policy selection for advertising marketplaces

面向广告市场的支持感知离线策略选择

Prashant Shekhar, Caroline Howard

发表机构 * Department of Mathematics(数学系) Embry-Riddle Aeronautical University(埃姆伯里-瑞德尔航空大学)

AI总结 本文提出了一种支持感知的离线决策框架,用于广告市场的保留策略选择,通过将记录证据转化为保守决策对象,以确保验证的可靠性,而非仅依赖点估计排名。

详情
AI中文摘要

记录的广告拍卖使离线保留价格评估变得有吸引力但有风险。回放表可以识别具有大显眼收益增益的策略,但它们也可能隐藏弱阈值支持、多重比较效应、子组伤害和投标者响应不确定性。现有的回放和离线策略评估方法估计或排名策略价值,但它们不能直接回答可用证据是否足够强以证明验证的问题。本文开发了一种支持感知的离线决策框架用于保留策略选择。与其输出单一的点估计胜者,该框架将记录证据转化为保守的决策对象,包括认证的策略、统计上被主导的替代方案以及需要进一步验证的未解决候选者。主要理论结果给出了一种统一的有限目录保证,显示在同时控制不确定性和保守支持门控的情况下,该框架保留了最佳通过策略,同时排除了具有认证遗憾的策略。支持性结果描述了支持本地化的回放泛化,建立了信息论阈值解析极限,并量化了异质投标者响应如何推翻本地化回放排名。在iPinYou实时竞价日志上的实验显示,领先的保留规则在第二季实现了47.66%的回放提升,同时实现了40.71%的下限提升,在第三季实现了43.87%的冻结超时回放提升。该框架将19个策略目录减少到两个策略验证短名单,同时在44个广告商、交易所和地区段中认证无害。结果支持核心主张,即离线保留策略评估应产生认证的验证决策,而非仅依赖点估计排名。

英文摘要

Logged advertising auctions make offline reserve-price evaluation attractive but risky. Replay tables can identify policies with large apparent yield gains, yet they can also hide weak threshold support, multiple-comparison effects, subgroup harm, and bidder-response uncertainty. Existing replay and off-policy evaluation methods estimate or rank policy values, but they do not directly answer the operational question of whether the available evidence is strong enough to justify validation. This paper develops a support-aware offline decision framework for reserve-policy selection. Rather than outputting a single point-estimate winner, the framework converts logged evidence into a conservative decision object consisting of certified policies, statistically dominated alternatives, and unresolved candidates requiring further validation. The main theoretical result gives a unified finite-catalog guarantee showing that, under simultaneous uncertainty control and conservative support gates, the framework preserves the best gate-passing policy while eliminating only policies with certified regret. Supporting results characterize support-localized replay generalization, establish information-theoretic threshold-resolution limits, and quantify when heterogeneous bidder response can overturn localized replay rankings. Experiments on iPinYou real-time-bidding logs show that the leading reserve rule achieves a 47.66% replay lift in season two, a 40.71% simultaneous lower-bound lift, and a 43.87% frozen out-of-time replay lift in season three. The framework reduces a 19-policy catalog to a two-policy validation shortlist while certifying non-harm across 44 advertiser, exchange, and region segments. The results support the central claim that offline reserve-policy evaluation should produce certified validation decisions rather than point-estimate rankings alone.

2605.21726 2026-05-22 cs.CL cs.AI 版本更新

Probabilistic Attribution For Large Language Models

基于概率的大型语言模型归因

Shilpika Shilpika, Carlo Graziani, Bethany Lusch, Venkatram Vishwanath, Michael E. Papka

发表机构 * Argonne Leadership Computing Facility(阿贡领导计算设施) Argonne National Laboratory(阿贡国家实验室) Mathematics and Computer Science Division(数学与计算机科学 division) Department of Computer Science(计算机科学系)

AI总结 本文提出了一种模型无关的概率性token归因度量,通过贝叶斯法则反向计算下一个token的对数概率,以捕捉模型对token序列分布的内部表示,从而提高大型语言模型的可解释性。

Comments 29 pages, 13 figures

详情
AI中文摘要

大型语言模型(LLMs)生成性的特性体现在它们计算每个响应token的条件概率,以根据先前的token进行采样。这些概率编码了模型在训练中学习的分布结构,并在推理中加以利用。在本文中,我们利用这些概率将LLMs置于随机过程的数学理论框架中。我们使用此框架设计了一种模型无关的概率性token归因度量,通过贝叶斯法则反向计算下一个token的对数概率,以捕捉模型对token序列分布的内部表示。该表示独立于模型的计算结构。此表示给出了响应给提示的条件概率,以及在移除一个token后的响应给提示的条件概率。我们的归因分数是这两个概率比值的对数。我们进一步计算了单个提示token分布的熵,条件于剩余的上下文。熵与归因分数之间的相互作用揭示了LLM的行为。我们评估了8个模型在7个提示上的表现,并调查了异常、token敏感性、响应稳定性、模型稳定性以及训练收敛性,从而提高了可解释性,并引导用户关注生成中不确定或不稳定的部分。

英文摘要

The generative nature of Large Language Models (LLMs) is reflected in the conditional probabilities they compute to sample each response token given the previous tokens. These probabilities encode the distributional structure that the model learns in training and exploits in inference. In this work, we use these probabilities to situate LLMs within the mathematical theory of stochastic processes. We use this framework to design a model-agnostic probabilistic token attribution measure, using Bayes rule to invert the next-token log-probabilities so as to capture the models internal representation of the distribution over token sequences. The representation is independent of the models computational structure. This representation yields the conditional probability of the response given the prompt, and of the response given the prompt with a token marginalized away. Our attribution score is the log of the ratio of these probabilities. We further compute the entropies of a single prompts token distributions, conditioned on the remaining context. The interplay between entropy and attribution score sheds light on LLM behavior. We evaluate 8 models across 7 prompts and investigate anomalies, token sensitivity, response stability, model stability, and training convergence, thereby improving interpretability and guiding users to focus on uncertain or unstable parts of the generation.

2605.21724 2026-05-22 cs.LG cs.AI 版本更新

TBP-mHC: full expressivity for manifold-constrained hyper connections through transportation polytopes

TBP-mHC: 通过运输多面体实现 manifold-constrained 超连接的全表达性

Anton Lyubinin

AI总结 本文提出 TBP-mHC,通过运输多面体参数化实现 manifold-constrained 超连接的全表达性,解决了超连接中无约束混合导致的训练不稳定性问题,并在语言模型预训练中展示了竞争性性能和改进的稳定性与可扩展性。

详情
AI中文摘要

超连接(HC)通过在多个残差流之间引入可学习的混合来改进残差网络,但无约束的混合导致训练不稳定。Manifold-Constrained Hyper-Connections(mHC)通过Sinkhorn归一化强制近似双随机性,而mHC-lite则通过置换矩阵的凸组合确保精确约束,但以阶乘复杂度为代价。KromHC通过Kronecker积参数化减少此成本,但限制混合矩阵为Birkhoff多面体的结构子流形。我们提出运输Birkhoff多面体(TBP)参数化及其递归变体(RTBP),通过(n-1)^2自由度构造精确的双随机混合矩阵。我们的方法避免了迭代归一化和组合爆炸,同时保持Birkhoff多面体的完整表达性。在语言模型预训练中的实验证明了竞争性性能,同时具有改进的稳定性和可扩展性。

英文摘要

Hyper-Connections (HC) improve residual networks by introducing learnable mixing across multiple residual streams, but unconstrained mixing leads to training instability. Manifold-Constrained Hyper-Connections (mHC) address this by enforcing approximate double stochasticity via Sinkhorn normalization, while mHC-lite ensures exact constraints through convex combinations of permutation matrices at the cost of factorial complexity. KromHC reduces this cost using Kronecker-product parameterizations, but restricts the mixing matrices to a structured submanifold of the Birkhoff polytope . We propose Transportation Birkhoff Polytope (TBP) parameterizations and their Recursive variants (RTBP), which construct exactly doubly stochastic mixing matrices with $(n-1)^2$ degrees of freedom. Our approach avoids iterative normalization and combinatorial explosion while preserving full expressivity of the Birkhoff polytope. Empirical results on language model pre-training' demonstrate competitive performance with improved stability and scalability.

2605.21723 2026-05-22 cs.RO cs.AI cs.MA cs.SY eess.SY 版本更新

Learning Altruistic Collaboration in Heterogeneous Multi-Team Systems

在异质多团队系统中学习利他性协作

Riwa Karam, Ruoyu Lin, Brooks A. Butler, Magnus Egerstedt

发表机构 * Samueli School of Engineering, University of California, Irvine(加州大学欧文分校萨缪尔学学院) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 本文研究了通过动态机器人分配实现的异质多团队协作,将机器人视为可转移资源。利用生态学中的哈密顿规则作为利他决策机制,提出了一种具有异质能力、转移成本和能力依赖贡献的多团队协作资源分配框架。所得到的分配问题是组合性的,并被证明是NP难的。为了解决可扩展性问题,我们开发了一种基于图神经网络的策略,在集中训练和分布式执行下近似基于哈密顿规则的利他性分配。该模型在团队交互图上运行,并预测机器人层面的转移决策和下一步的机器人到团队分配。通过消防演习场景的模拟和实验验证了所提出的方法,证明所学习的策略在扩展到更大系统时能够实现接近最优的性能。

详情
AI中文摘要

本文研究了通过动态机器人分配实现的异质多团队协作,其中机器人被视为可转移资源。利用生态学中的哈密顿规则作为利他决策机制,我们提出了一种具有异质能力、转移成本和能力依赖贡献的多团队协作资源分配框架。所得到的分配问题是一个组合问题,并被证明是NP难的。为了解决可扩展性问题,我们开发了一种基于图神经网络的策略,在集中训练和分布式执行下近似基于哈密顿规则的利他性分配。该模型在团队交互图上运行,并预测机器人层面的转移决策和下一步的机器人到团队分配。通过消防演习场景的模拟和实验验证了所提出的方法,证明所学习的策略在扩展到更大系统时能够实现接近最优的性能。

英文摘要

This paper studies heterogeneous multi-team collaboration through dynamic robot allocation, where robots are treated as transferable resources. Leveraging Hamilton's rule from ecology as an altruistic decision-making mechanism, we propose a multi-team collaborative resource allocation framework with heterogeneous capabilities, transfer costs, and capability-dependent contributions. The resulting allocation problem is combinatorial and is shown to be NP-hard. To address scalability, we develop a graph neural network policy under centralized training and decentralized execution that approximates the altruistic allocations based on Hamilton's rule. The model operates over the team interaction graph and predicts robot-level transfer decisions and next robot-to-team assignments. The proposed approach is validated in a firefighting scenario through simulations and experiments, demonstrating that the learned policy achieves near-optimal performance while scaling to larger systems.

2605.21695 2026-05-22 cs.AI cs.HC 版本更新

The Impact of AI Usage and Informativeness on Skill Development in Logical Reasoning

人工智能使用与信息性对逻辑推理技能发展的影响力

Shang Wu, Hongyu Yao, Catarina Belem, Shuyuan Fu, Mark Steyvers, Padhraic Smyth

发表机构 * University of California, Irvine(加州大学尔湾分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文研究了人工智能使用和信息性如何影响逻辑推理技能的发展,发现高使用AI的用户表现较差,而信息性低的AI对学习无帮助,信息性高的AI则在短期内提升表现但影响不均一。

Comments Accepted at Hybrid Human Artificial Intelligence (HHAI) 2026

详情
AI中文摘要

人工智能(AI)正越来越多地融入人类问题解决过程中,但其对个体技能发展的影响仍不明确。我们考察了在受控的逻辑推理任务中,有需求访问AI帮助的情况下,AI使用和信息性如何塑造学习。我们发现,更高的AI使用与更弱的技能发展相关:大量使用AI的用户相对于同等 peers 表现较差,而少量使用AI的用户则与不使用AI的匹配用户表现相似。我们还发现这些模式由AI的信息性所中介。低信息性AI既不能提高即时表现,也不能在移除AI帮助后保持表现,且与整体学习能力较弱相关。另一方面,高信息性AI在实验中被发现能提升短期表现,但平均而言不会减少AI帮助移除后的结果,但影响具有异质性。我们的发现总体表明,AI根据情境,可能通过放大独立推理来补充人类技能发展,或作为替代品削弱此类推理,这意味着在AI帮助存在的情况下,调节AI的访问和使用将对促进技能发展至关重要。

英文摘要

Artificial intelligence (AI) is being increasingly integrated into human problem-solving, yet its effects on individual skill development remain unclear. We examine how both AI usage and informativeness can shape learning in the context of a controlled logical reasoning task with on-demand access to AI assistance. We find that greater AI usage is associated with weaker skill development: heavy AI users underperform relative to comparable peers, whereas light AI users perform similarly to matched users who do not use AI. We also find in our study that these patterns are mediated by AI informativeness. Low-information AI neither improves immediate performance nor preserves performance after AI assistance is removed, and is linked to weaker learning overall. On the other hand, high-information AI was found to improve short-run performance without reducing post-AI outcomes on average in our experiments, but with heterogeneous effects. Our findings in general suggest that AI can, depending on context, either complement human skill development by amplifying independent reasoning or can act as a substitute that undermines such reasoning, with the implication that regulating AI access and usage will be important for promoting skill development in the presence of AI assistance.

2605.21694 2026-05-22 cs.CR cs.AI 版本更新

PocketAgents: A Manifest-Driven Library of Autonomous Defense Agents

PocketAgents: 一种基于manifest的自主防御代理库

Sidnei Barbieri, Ágney Lopes Roth Ferraz, Lourenço Alves Pereira Júnior

发表机构 * Aeronautics Institute of Technology(航空技术研究所)

AI总结 本文提出PocketAgents,一种基于manifest的自主防御代理库,通过定义代理的manifest、prompt和运行时上下文,实现了对大型语言模型驱动的防御系统进行可测量、可扩展和可追溯的防御。

详情
AI中文摘要

将大型语言模型(LLMs)连接到防御执行需要的不仅仅是询问模型攻击是否正在发生。防御方必须决定哪些模型输出可能改变系统状态,哪些输出必须被拒绝,以及如何记录失败。我们提出了PocketAgents,一种基于manifest的自主防御代理库。每个代理安装为三个数据文件:manifest、prompt和运行时上下文。共享的运行时提供代理受限制的遥测访问,并只接受具有请求动作出现在manifest中的类型化报告。我们基于网络沙箱(Perry)和网络欺骗测试床实现了PocketAgents,并在18次闭合回路试验中评估了两个代理,命令与控制和数据外泄,在DarkSide启发的攻击测试中对小型企业拓扑进行评估。13次试验产生了验证的网络阻断动作并遏制了攻击;4次失败了模式验证;1次产生了有效的无操作决定。实验表明,类型化的边界使LLM驱动的防御变得可测量、可扩展和可追溯。

英文摘要

Connecting large language models (LLMs) to defensive enforcement requires more than asking a model whether an attack is happening. A defender must decide which model outputs may change the system state, which outputs must be rejected, and how failures should be recorded. We present PocketAgents, a manifest-driven library of autonomous defense agents. Each agent is installed as three data files: a manifest, a prompt, and a runtime context. The shared runtime gives the agent bounded telemetry access and accepts only typed reports whose requested action appears in the manifest. We implemented PocketAgents on top of a cyber arena (Perry), a cyber-deception testbed, and evaluated two agents, Command and Control and Exfiltration, in 18 closed-loop trials of a DarkSide-inspired attack on a small enterprise topology. Thirteen trials produced validated network-block actions and contained the attack; four failed schema validation; one produced a valid no-action decision. The experiments show that a typed boundary makes LLM-driven defense measurable, extensible, and attributable.

2605.21683 2026-05-22 cs.AI 版本更新

Investigating Concept Alignment Using Implausible Category Members

通过不合理的类别成员探究概念对齐

Sunayana Rane, Brenden M. Lake, Thomas L. Griffiths

发表机构 * Department of Computer Science(计算机科学系) Princeton University(普林斯顿大学) Department of Psychology(心理学系)

AI总结 本文研究了通过询问不合理类别成员来探究概念边界,发现AI模型在某些概念上与人类存在显著差异,如将'词语'归类为'车辆'或'衣物',并探讨了这些概念错位对AI安全的影响。

详情
AI中文摘要

开发具有人类日常概念理解能力的AI系统是朝着安全、可靠系统的重要一步。在探测概念理解时,询问合理的类别成员(例如

英文摘要

Developing AI systems with a human-like understanding of everyday concepts is a key step towards developing safe, reliable systems whose behavior makes sense to humans. When probing concept understanding, asking questions about plausible category members (e.g., "Is a car a vehicle?") is likely to recall patterns in the model's vast training data. We pursue an alternative strategy, characterizing the boundaries of conceptual categories by asking about implausible category members (e.g., "Is an olive a vehicle?") to probe the kind of concept-level knowledge we take for granted in fellow humans. We characterize concept boundaries for a set of fundamental concepts by studying AI systems' assignments of objects to superordinate categories from a classic psychological study by Rosch and Mervis, as well as their assignments of the same objects to mismatched superordinate categories. We compare these assignments to those made by human participants on the full range of within-category and cross-category assignment tasks. Our results reveal a range of concepts for which which models differ in meaningful and surprising ways from humans, including treating "words" as belonging to categories like "vehicles" and "clothing," identifying several "vegetable" category members as "fruit," and assigning exemplars from non-weapon categories to the "weapons" category. We also demonstrate how these instances of concept misalignment translate into problematic downstream behavior with implications for AI safety.

2605.21669 2026-05-22 cs.CV cs.AI 版本更新

MRecover: A Conditional Generative Model for Recovering Motion-Corrupted MR images Using AI Generated Contrast

MRecover: 一种基于AI生成对比度的条件生成模型,用于通过AI生成对比度恢复运动模糊的MRI图像

Jinghang Li, Tales Santini, Courtney Clark, Bruno de Almeida, Cong Chu, Salem Alkhateeb, Andrea Sajewski, Jacob Berardinelli, Hecheng Jin, Tobias Campos, Jeremy J. Berardo, Joseph Mettenburg, Ariel Gildengers, Howard J. Aizenstein, Minjie Wu, Tamer S. Ibrahim

发表机构 * Department of Bioengineering, University of Pittsburgh(匹兹堡大学生物工程系) School of Medicine, University of Pittsburgh(匹兹堡大学医学院) Department of Radiology, University of Pittsburgh(匹兹堡大学放射科) Department of Psychiatry, University of Pittsburgh(匹兹堡大学精神病学系)

AI总结 该研究提出了一种条件生成模型MRecover,利用AI生成的对比度来恢复运动模糊的MRI图像,通过自回归切片条件化实现体积分 consistency,提高了 hippocampal 子区域分割的精度和泛化能力。

详情
AI中文摘要

海马亚区分割需要高分辨率的T2w turbo spin echo (TSE) MRI,但该序列易受运动伪影影响,导致数据丢失。我们开发了一种条件生成模型(MRecover),通过自回归切片条件化生成常规获取的T1w图像,生成TSE图像以实现体积分 consistency。在7T MRI数据(n=577)上训练,该模型在域内实现了高保真度(n=148,SSIM=0.84,FSIM=0.94),并能很好地推广到域外3T数据:合成和原生图像的亚区体积高度匹配(n=416,r=0.87-0.97),并在运动影响的ADNI3数据集中通过质量控制后,分析可及受试者数量增加了31.8%(593 vs 450)。合成图像还由于增加诊断组差异的样本量,产生了更大的效应量(整个海马体ε²=0.121-0.100 vs. 0.086-0.062,左右半球)。项目页面:https://jinghangli98.github.io/MRecover/

英文摘要

Hippocampal subfield segmentation requires high-resolution T2w turbo spin echo (TSE) MRI, yet this sequence is susceptible to motion artifacts, leading to substantial data loss. We developed a conditional generative model (MRecover) that synthesizes routinely acquired T1w images to create TSE images with autoregressive slice conditioning for volumetric consistency. Trained on 7T MRI data (n=577), the model achieved high in-domain fidelity (n=148, SSIM=0.84, FSIM=0.94) and generalized well to out-of-domain 3T data: subfield volumes from synthesized and the as-acquired images closely matched: (n=416, r=0.87-0.97) and yielded 31.8% more analyzable subjects in the motion-affected ADNI3 dataset after quality control (593 vs 450). The synthesized images also achieved larger effect sizes due to increasing the sample size for diagnostic group differences in hippocampal subfield atrophy (whole hippocampus $ε^2$= 0.121-0.100 vs. 0.086-0.062, left-right hemispheres). Project page: https://jinghangli98.github.io/MRecover/

2605.21665 2026-05-22 cs.MA cs.AI 版本更新

Planning, Scheduling, and Behavior in EV Charging Systems: A Critical Survey and Trilemma Framework

电动汽车充电系统中的规划、调度与行为:一项批判性综述与三重困境框架

Peiyan Xiao, Yuheng Li, Ayan Mukhopadhyay, Sai Krishna Ghanta, Sabur Baidya, Yanhai Xiong

发表机构 * The College of William \& Mary, Williamsburg, VA, USA University of Georgia, Athens, GA, USA University of Louisville, Louisville, KY, USA

AI总结 本文综述了电动汽车充电系统中规划、调度和行为三个层面的研究,提出了三重困境框架,揭示了在追求高保真度时面临的可计算性与现实整合之间的权衡问题。

Comments Review article; 56 pages excluding references; 1 figure and 3 tables

详情
AI中文摘要

电动汽车的快速增长正在将交通电气化的主要约束从车辆普及转移到充电基础设施的部署和运行。充电网络设计需要在三个相互依赖的层面做出决策:规划,决定在哪里和建设多少基础设施;调度,管理充电调度、定价和电网交互;以及行为,捕捉用户如何选择站点、充电时间和持续时间。现有研究在每个层面都有显著进展,但文献仍然碎片化,跨层交互往往通过简化假设来处理。本文开发了一个三层规划-调度-行为(PSB)框架,根据决策时间跨度、主体目标和耦合结构来组织电动汽车充电研究。我们进一步识别了一个保真度-可计算性权衡,称为PSB三重困境:每个层面单独来看都是计算困难的,而现实层面的整合通常需要至少减少一个层面的保真度。审查三个成对耦合文献——规划-调度、调度-行为和规划-行为——我们发现通常省略的第三层通常是外生固定的或用静态的汇总代理来表示。这些简化使问题变得可计算,但带来了不同的成本:它们可能会掩盖长期投资反馈、时间电网和排放动态,或异质用户响应和公平结果。基于这一诊断,我们识别了新兴充电技术、行为激励、公平度量和城市规模基于学习的方法中亟需解决的挑战,这些挑战平衡了保真度、可解释性和政策相关性。

英文摘要

The rapid growth of electric vehicles is shifting the main constraint on transport electrification from vehicle adoption to the deployment and operation of charging infrastructure. Charging-network design requires decisions across three interdependent layers: Planning, which determines where and how much infrastructure to build; Scheduling, which governs charging dispatch, pricing, and grid interaction; and Behavior, which captures how users choose stations, charging times, and charging durations. Existing studies have advanced each layer substantially, but the literature remains fragmented, and cross-layer interactions are often treated through simplifying assumptions. This survey develops a three-layer Planning-Scheduling-Behavior (PSB) framework to organize EV charging research according to decision horizon, actor objective, and coupling structure. We further identify a fidelity-tractability tradeoff, termed the PSB trilemma: each layer is computationally difficult in isolation, and realistic integration across layers generally requires reducing the fidelity of at least one layer. Reviewing the three pairwise-coupling literatures - Planning-Scheduling, Scheduling-Behavior, and Planning-Behavior - we show that the omitted third layer is typically fixed exogenously or represented by a static aggregate surrogate. These simplifications enable tractability but impose distinct costs: they can obscure long-term investment feedback, temporal grid and emissions dynamics, or heterogeneous user response and equity outcomes. Building on this diagnosis, we identify open challenges in emerging charging technologies, behavioral incentives, equity metrics, and city-scale learning-based methods that balance fidelity, interpretability, and policy relevance.

2605.21661 2026-05-22 cs.LG cs.AI cs.CV 版本更新

Hierarchical Variational Policies for Reward-Guided Diffusion

分层变分策略用于奖励引导的扩散

Kushagra Pandey, Farrin Marouf Sofian, Jan Niklas Groeneveld, Felix Draxler, Stephan Mandt

发表机构 * Department of Computer Science(计算机科学系) University of California Irvine(加州大学伊文斯顿分校)

AI总结 本文提出了一种分层变分模型框架,通过将控制信息压缩到轻量级且表达能力强的随机策略中,实现了在降低推理成本的同时生成高质量的奖励对齐样本,该方法在4倍超分辨率任务中实现了比现有最佳基线快5倍的推理速度并具有更好的感知质量。

详情
AI中文摘要

适应预训练扩散模型以解决下游目标如逆问题通常需要昂贵的测试时间引导或优化。我们提出了一种系统框架,能够在大幅降低推理成本的同时生成高质量的奖励对齐样本。我们的方法将测试时间适应建模为分层变分模型,其中控制被压缩到一个轻量级但表达能力强的随机策略中。这种建模自然支持少量步扩散采样:大步长使推理快速,而学习的策略通过提供结构化的每步控制保持样本质量。所得到的完全压缩采样器实现了强大的质量-速度权衡,匹配或超过最近的测试时间扩展基线,同时需要显著更少的计算资源。例如,在4倍超分辨率任务中,我们的方法在比最佳表现基线快5倍的情况下实现了更好的感知质量。我们进一步将该方法扩展到半压缩的 regime,结合廉价的压缩提案和有限的测试时间优化,在多个具有挑战性的逆问题中实现了最先进的感知质量。

英文摘要

Adapting pretrained diffusion models to downstream objectives such as inverse problems often requires expensive test-time guidance or optimization. We propose a principled framework for generating high-quality reward-aligned samples at substantially reduced inference cost. Our approach formulates test-time adaptation as a hierarchical variational model, where control is amortized into a lightweight yet expressive stochastic policy. This formulation naturally supports few-step diffusion sampling: large step sizes enable fast inference, while the learned policy maintains sample quality by providing structured per-step control. The resulting fully amortized sampler achieves a strong quality--speed tradeoff, matching or exceeding recent test-time scaling baselines while requiring significantly less compute. For example, on 4x super-resolution, our method achieves better perceptual quality with more than 5x faster inference compared to the best-performing baseline. We further extend our approach to a semi-amortized regime that combines cheap amortized proposals with limited test-time optimization, achieving state-of-the-art perceptual quality across several challenging inverse problems.

2605.21654 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Value-Gradient Hypothesis of RL for LLMs

强化学习中大语言模型的价值-梯度假说

Arip Asadulaev, Daniil Ognev, Karim Salta, Martin Takac

发表机构 * MBZUAI(穆斯林人工智能研究所)

AI总结 本文提出了一种价值-梯度视角来解释无评论强化学习方法在大语言模型后训练中的有效性,并通过分析actor更新和注意力机制中的自适应微分,提出了价值梯度信号和可达奖励空间的分解方法。

详情
AI中文摘要

强化学习显著提升了预训练语言模型,但尚不清楚为何无评论方法如PPO和GRPO能发挥如此大的作用,以及何时能提供最大的收益。我们开发了一种无评论强化学习在大语言模型后训练中的价值-梯度视角。首先,在可微展开和加性噪声参数化下,我们证明在期望下actor更新是价值-梯度类似的:反向传播传播的costates的条件期望等于价值梯度。其次,对于离散transformer策略,我们证明通过注意力机制的自适应微分会产生经验性的costates,这些近似于该价值信号,其误差受采样间隙和策略熵的控制。这些结果促使将RL影响分解为价值梯度信号和可达奖励空间,从而得出RL在预训练轨迹上最有效的标准。

英文摘要

Reinforcement learning substantially improves pretrained language models, but it remains understudied why critic-free methods such as PPO and GRPO work as well as they do, and when they should provide the largest gains. We develop a value-gradient perspective of critic-free RL for LLM post-training. First, under a differentiable rollout and additive-noise parameterization, we show that the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. Second, for discrete transformer policies, we show that autodifferentiation through attention produces empirical costates that approximate this value signal, with an error controlled by the sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory.

2605.21653 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Amplifying, Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction

放大而非学习:微调的AI文本检测器放大了预训练的方向

Alexander Smirnov

发表机构 * University College London(伦敦大学学院)

AI总结 该研究探讨了通过微调AI文本检测器来放大预训练方向而非学习AI与人类边界的问题,发现微调在某些情况下会降低辨别能力,但在非母语写作中表现不同,并展示了闭合形式雅可比预测器在不同架构中的有效性。

详情
AI中文摘要

AI文本检测器放大了预训练的典型性轴;它们并不构建AI与人类的边界。在没有任何任务监督的原始编码器上,将投影到AI-中心(HC3)的中心可以实现NYT与HC3的AUROC分别为0.806/0.944/0.834,跨三种架构(86-106%的微调辨别上限:在RoBERTa-base上,原始投影超过微调);在RoBERTa-base上,完全微调在两种流畅正式人口测试中降低了辨别能力。相同的轴在非母语ESL写作中反转(AUROC 0.06-0.20)--这是典型性阅读独有的可验证预测。一个24例冻结探测器与完全微调(0.900 vs 0.895)一致。一个闭合形式雅可比预测器参数化轴操纵干预,R²=1.000通用,提升了ELECTRA-CE部署的TPR从0.000到0.904(FPR=1%),并在三个独立训练的第三方RoBERTa检测器上转移,达到16/16 oracle等价(在OpenAI检测器上57%的NYT-FPR减少)。范围:编码器家族;机制幅度HC3锚定;人口层面共享轴,不同架构中每文本机制有所变化。三种操作上不同的探测器--文本表面caps_rate残差化、几何符号epsilon消融、闭合形式文本对预测器--在三种架构中一致,cos 0.74/0.81/1.00,确认了观察者不变性。在匹配TPR-0.90评估下,已发表的干预动物园(CC、dealign-f2c)在27个单元格中校准等价(|Delta AUROC| <= 0.0081),并且ELECTRA上的LoRA->full-FT偏移差距的97%是校准偏移而非学习表示--这是核心主张的预测确认。

英文摘要

AI text detectors amplify a pretrained typicality axis; they do not construct an AI-vs-human boundary. On raw encoders before any task supervision, projecting onto centroid(AI)-centroid(HC3) achieves NYT-vs-HC3 AUROC 0.806/0.944/0.834 across three architectures (86-106% of the fine-tuned discrimination ceiling: on RoBERTa-base, raw projection exceeds fine-tuning); on RoBERTa-base, full fine-tuning reduces discrimination below raw on both fluent-formal populations tested. The same axis inverts on non-native ESL writing (AUROC 0.06-0.20) -- a falsifiable prediction unique to the typicality reading. A 24-example frozen probe matches full fine-tuning (0.900 vs 0.895). A closed-form Jacobian predictor parameterises axis-manipulating interventions with R^2 = 1.000 universal, lifts ELECTRA-CE deployment TPR from 0.000 to 0.904 at FPR = 1%, and transfers to three independently-trained third-party RoBERTa detectors at 16/16 oracle-equivalence (57% NYT-FPR reduction on the OpenAI detector). Scope: encoder family; mechanism magnitude HC3-anchored; population-level shared axis with per-text mechanisms varying across architectures. Three operationally distinct probes -- text-surface caps_rate residualisation, geometric signed-epsilon ablation, closed-form text-pair predictor -- agree at cos 0.74/0.81/1.00 across three architectures, confirming observer-invariance. Under matched-TPR-0.90 evaluation, the published intervention zoo (CC, dealign-f2c) is calibration-equivalent across 27 cells (|Delta AUROC| <= 0.0081), and >= 97% of the LoRA->full-FT bias gap on ELECTRA is calibration shift, not learned representation -- the central claim's prediction confirmed.

2605.21645 2026-05-22 cs.AI cs.DB 版本更新

AOP-Wiki EMOD 3.0: Data Model Expansions and Content Evaluation Framework for Using Agentic AI to Improve Integration between AOPs and New Approach Methodologies (NAMs)

AOP-Wiki EMOD 3.0: 数据模型扩展和内容评估框架用于利用代理AI改进AOP与新方法论(NAMs)之间的整合

Virginia K. Hench, J. Harry Caufield, Sierra A. T. Moxon, Jason M. O'Brien, Stephen W. Edwards

发表机构 * Open BioData Modeling(开放生物数据建模) Environmental Genomics and Systems Biology(环境基因组学与系统生物学) Lawrence Berkeley National Laboratory(伯克利国家实验室) National Wildlife Research Centre(国家野生动物研究中心) UL Research Institutes - Chemical Insights(UL研究机构-化学洞察)

AI总结 本文提出AOP-Wiki EMOD 3.0,通过数据模型扩展和内容评估框架,利用代理AI改进AOP与新方法论之间的整合,为监管科学和生物医学领域提供支持。

Comments 7 Figures and 3 Supplemental Figures

详情
AI中文摘要

不良后果路径(AOP)是将可在实验室中测量的生物机制因果联系到不良后果的逻辑模型,与化学监管终点相关。AOPs 为新方法论(NAMs)提供上下文,包括体外和体外方法,这些方法作为替代动物测试的替代方案,AOP中的连续事件作为多尺度模型跨越生物尺度。AOP-Wiki 作为全球AOP存储库。尽管AOP-Wiki在过去十年中在AOP扩展中发挥了核心作用,但当前的数据模型和应用基础设施的限制限制了AOP-Wiki支持持续AOP增长和演变的能力。然而,代理AI的变革力量重新激发了AOP-Wiki数据现代化的努力,尤其是在核心AOP原则可以用于指导AI用于汇总和结构化AOP相关信息的时候。抓住这一势头,我们提出了AOP-Wiki EMOD 3.0,即一系列证据模型原型中的第三款,具体展示了数据模型扩展和我们对AOP-Wiki如何被转变以更好地服务于监管科学和新兴AOP在生物医学和One Health领域中的使用。我们旨在为计算生成的AOP和定量AOP(qAOPs)奠定基础,通过聚焦于AOP-Wiki内部质量改进、证据结构以提高AOP FAIRness和AI准备性,以及改进AOP框架与NAMs之间的整合,以更好地服务于下一代风险评估。

英文摘要

Adverse Outcome Pathways (AOP) are logic models that causally link biological mechanisms that can be measured in a lab to adverse outcomes, relevant to chemical regulatory endpoints. AOPs contextualize new approach methodologies (NAMs), in vitro and in silico methods used as alternatives to animal testing and the sequential events in an AOP serve as multi-scale models spanning biological scales. The AOP-Wiki serves as the global repository for AOPs. While the AOP-Wiki has played a central role in AOP expansion over the past decade, constraints within the current data model and application infrastructure limit the AOP-Wiki from supporting continued AOP growth and evolution. Yet, the transformative power of agentic AI has re-invigorated AOP-Wiki data modernization efforts at a time when core AOP principles can be harnessed to inform use of AI for aggregating and structuring AOP-relevant information. Seizing upon this momentum, we present AOP-Wiki EMOD 3.0, the third in a series of evidence model prototypes, which concretely demonstrates data model expansions and our vision for how the AOP-Wiki might be transformed to better serve regulatory science and emergent use of AOPs in biomedical and One Health contexts. We aim to lay a foundation to support computationally-generated AOPs and quantitative AOPs (qAOPs) by focussing on solutions for AOP-Wiki internal quality improvement, evidence structuring to enhance AOP FAIRness and AI-readiness, and improved integration between the AOP framework and NAMs to better serve next generation risk assessment.

2605.21635 2026-05-22 cs.HC cs.AI cs.CY 版本更新

Addressing the Synergy Gap: The Six Elements of the Design Space

弥合协同效应鸿沟:设计空间的六大要素

Tommaso Turchi, Ben Wilson, Matt Roach, Alan Dix, Alessio Malizia

发表机构 * Department of Computer Science, University of Pisa(比萨大学计算机科学系) Swansea University(斯旺西大学) Cardiff Metropolitan University(卡迪夫 Metropolitan 大学) Faculty of Logistics, Molde University College(物流学院,莫尔德大学学院)

AI总结 本文探讨了人机协同效应的缺失问题,提出设计空间的六大要素,为构建混合系统提供共享词汇,为研究协同模式提供分析视角,并为评估人机决策质量提供起点。

Comments 10 pages, 2 figures

详情
AI中文摘要

人工智能如今已嵌入医疗、金融、政策等众多领域,但真正的协同效应——即双方协同表现超过单独一方的表现——却很少见。元分析显示,人工智能辅助通常比单独工作时提升人类表现,但发现真正协同效应的研究却很少。我们称这种持续的不足为协同效应鸿沟。目前大多数工作将人机协同视为工程问题,专注于可解释性、信任校准或界面设计。这些方面固然重要,但仅涵盖了决定协同是否有效的一部分因素。为弥合协同效应鸿沟,我们主张需要更广泛地参与设计空间。我们通过六个相互关联的要素来映射这个空间:社会技术环境、决策框架、人类决策参与者、人工智能能力、交互以及整体评估。对于每个要素,我们描述了其涵盖的内容、在实践中如何影响其他要素以及对设计的含义。结果为构建混合系统的技术人员提供了一个共享词汇,为研究协同模式的研究人员提供了一个分析视角,并为人机协同决策质量评估者提供了一个起点,而非仅关注准确性。

英文摘要

AI is now embedded in healthcare, finance, policy, and many other domains, yet genuine human-AI synergy - combined performance that exceeds what either party achieves alone - is uncommon. Meta-analyses show that AI assistance tends to improve human performance compared to working alone, but studies finding true synergy are scarce. We call this persistent shortfall the synergy gap. Most current work treats human-AI combination as an engineering problem and concentrates on interpretability, trust calibration, or interface design. These matter, but they cover only part of what determines whether combination works. Closing the synergy gap, we argue, requires explicit engagement with a wider design space. We map that space through six interconnected elements: sociotechnical context, decision-making frameworks, human decision participants, AI capabilities, interaction, and holistic evaluation. For each element, we describe what it covers, how it shapes the others in practice, and what it implies for design. The result is a shared vocabulary for practitioners building hybrid systems, an analytical lens for researchers studying combination patterns, and a starting point for evaluators interested in the full quality of human-AI decision-making rather than accuracy alone.

2605.21630 2026-05-22 cs.AI 版本更新

MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis

MindLoom: 通过组合思维模式进行前沿级推理数据合成

Haiyang Shen, Taian Guo, Xuanzhong Chen, Mugeng Liu, Weichen Bi, Wenchun Jing, Sixiong Xie, Zhuofan Shi, Yudong Han, Chongyang Pan, Siqi Zhong, Jinsheng Huang, Ming Zhang, Yun Ma

发表机构 * Peking University(北京大学) Tsinghua University(清华大学)

AI总结 本文提出MindLoom框架,通过组合思维模式工程合成前沿级推理数据,解决了现有方法在问题难度控制和多样性方面的不足,实验表明其在多个基准测试中表现优异。

Comments Work in Progress. Comments: 27 pages, 4 figures, preprint

详情
AI中文摘要

尽管LLMs在推理方面取得了显著进展,系统性地生成前沿级推理数据仍然具有挑战性。现有合成方法往往缺乏对问题难度结构性因素的理解,导致多样性有限和难度控制不稳定。本文将推理问题的难度视为原子知识推理转换的累积,提出MindLoom框架,通过组合思维模式工程合成前沿级推理数据。给定一组具有验证解的难题,MindLoom首先将这些解分解为思维模式链,揭示每个问题的构建逻辑。然后训练一个检索模型,将问题状态匹配到兼容的思维模式,提供合成过程中引入哪些推理挑战的指导。新问题通过迭代应用检索到的思维模式到种子问题,并通过分布对齐采样来鼓励多样化的推理覆盖。最后,基于回放的判断阶段通过难度对生成的问题进行标记,并提供已判断正确的响应用于监督微调。我们在九个基准测试上评估了MindLoom,涵盖五个STEM学科和四个数学推理任务,多个模型家族和大小的模型在微调后均在报告的基准测试中表现出色。消融研究表明了每个组件的贡献,进一步分析表明MindLoom覆盖了广泛的推理模式,同时保持了有用的难度控制。我们已开源实现:https://github.com/EachSheep/MindLoom。

英文摘要

Although LLMs have made substantial progress in reasoning, systematically producing frontier-level reasoning data remains difficult. Existing synthesis methods often have limited visibility into the structural factors that govern problem difficulty, which can result in narrow diversity and unstable difficulty control. In this work, we view the difficulty of a reasoning problem as arising from the accumulation of atomic knowledge-reasoning transformations, which we term thought modes. Building on this perspective, we propose MindLoom, a framework for synthesizing frontier-level reasoning data through compositional thought mode engineering. Given a collection of hard problems with verified solutions, MindLoom first decomposes those solutions into thought mode chains that reveal each problem's construction logic. It then trains a retrieval model that matches problem states to compatible thought modes, providing guidance on which reasoning challenges to introduce during synthesis. New problems are composed by iteratively applying retrieved thought modes to seed questions, with distribution-aligned sampling to encourage diverse reasoning coverage. Finally, a rollout-based judging stage labels generated questions by difficulty and supplies judged-correct responses for supervised fine-tuning. We evaluate MindLoom on nine benchmarks covering five STEM disciplines and four mathematical reasoning tasks across multiple model families and sizes. Models fine-tuned on MindLoom-generated data achieves favorable performances over base models, distillation, and external-data baselines across the reported benchmarks. Ablation studies indicate the contribution of each component, and further analysis suggests that MindLoom covers a broad range of reasoning patterns while maintaining useful difficulty control. We have open-sourced our implementation at https://github.com/EachSheep/MindLoom.

2605.21625 2026-05-22 cs.CV cs.AI cs.CL 版本更新

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

Flat-Pack Bench: 通过家具组装评估大视觉-语言模型的时空理解

Aditya Chetan, Eric Cai, Peeyush Kushwaha, Bharath Raj Nagoor Kani, Utkarsh Mall, Qianqian Wang, Noah Snavely, Bharath Hariharan

发表机构 * Cornell University(康奈尔大学) Cornell Tech(康奈尔科技) MBZUAI(麦吉尔-伯克利-浙江大学人工智能研究院) UC Berkeley(伯克利大学)

AI总结 本文提出Flat-Pack Bench基准,用于评估大视觉-语言模型在复杂视频场景中的时空理解能力,发现当前模型在细粒度时空推理上存在显著不足。

Comments CVPR 2026

详情
AI中文摘要

大视觉-语言模型(LVLMs)的出现显著提升了视频理解能力。然而,现有基准主要集中在粗粒度任务,如动作分割、分类、描述和检索,且这些基准通常依赖于易于口头识别的实体,如家庭物品、动物、人类主体等,限制了其在复杂真实视频场景中的适用性。但许多应用,如家具组装、烹饪等,需要对视频进行逐步细粒度的时空理解,而当前基准并未充分评估。为解决这一差距,我们引入了Flat-Pack Bench,一个专注于家具组装任务的新基准。我们的基准评估LVLMs在细微任务上的表现,包括组装动作的时间顺序、组装状态的时间定位、理解部件配合和追踪,使用多选问题配以视觉提示突出相关部分作为参考,以回答细粒度问题。我们的实验表明,最先进的LVLMs在细粒度时空推理上表现显著不足,凸显了其在有效利用视频时间信息、跟踪能力和理解空间交互(如物理接触)方面的局限性。

英文摘要

The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks. To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.

2605.21623 2026-05-22 cs.AI 版本更新

The Shape of Testimony: A Scalable Framework for Oral History Archive Comparison

证词的形态:一种可扩展的口述史档案比较框架

Itamar Trainin, Renana Keydar, Amit Pinchevski

发表机构 * Hebrew University of Jerusalem(海法大学)

AI总结 本文通过大规模计算分析超过1600个口述史档案,探讨了犹太人大屠杀研究中两种口述证词风格的区别,并提出一种可扩展的比较语料库分析框架。

详情
AI中文摘要

研究者在大屠杀研究中常常将口述幸存者证词分为两种风格:美国犹太人研究肖尔基金会的访谈通常遵循结构化的、由访谈者引导的格式,而耶鲁福图诺夫视频档案则更倾向于自由形式、开放式风格。本研究通过分析两个档案中超过1600个证词,利用话语分割、主题建模和大型语言模型(LLM)分析,量化证词的“结构化”程度,包括主题连贯性、访谈者-幸存者动态和问题类型的分布。研究结果在总体上支持早期研究中发现的结构性差异,同时揭示了两个档案之间的显著重叠,不仅在个别访谈内,而且在共同的叙述模式中。这使得简单的“结构化vs.自由形式”二元对立在这些口述史中变得更加复杂。除了重新审视大屠杀研究中的一个基础性主张外,本工作还提供了一种可扩展、可重复的比较语料库分析框架。作为概念验证,它还为数字口述史、叙述分析以及公民科学注释平台的设计提出了更广泛的应用。

英文摘要

Researchers in Holocaust studies have often distinguished between two styles of oral survivor testimony: the USC Shoah Foundation's interviews tend to follow a structured, interviewer-guided format, whereas the Yale Fortunoff Video Archive generally favors a more free-form, open-ended style. This distinction has influenced both scholarly research and the development of later archives. In this study, we critically examine that claim by conducting a large-scale computational analysis of more than 1,600 testimonies from both collections. Leveraging discourse segmentation, topic modeling, and large language model (LLM) based analysis, we quantify the "structuredness" level of testimonies through topic coherence, interviewer-survivor dynamics, and the distribution of question types. Our results generally corroborate the structural differences identified in earlier research, while also revealing significant overlaps between the collections, both within individual interviews and across common narrative patterns. This complicates the simple "structured vs. free-form" dichotomy often applied to these oral histories. Beyond revisiting a foundational claim in Holocaust studies, our work provides a scalable, replicable framework for comparative corpus analysis. As a proof of concept, it suggests broader applications for digital oral history, narrative analysis, and the design of citizen-science annotation platforms.

2605.21622 2026-05-22 cs.AI 版本更新

TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization

TO-Agents:一种用于基于偏好的拓扑优化的多智能体AI流水线

Isabella A. Stewart, Hongrui Chen, Faez Ahmed

发表机构 * Department of Mechanical Engineering Massachusetts Institute of Technology Cambridge, MA, 02139 USA(机械工程系 马萨诸塞理工学院 哥伦布, 马萨诸塞州, 02139 美国)

AI总结 本文提出TO-Agents,一种多智能体AI框架,通过将自然语言设计意图与迭代拓扑优化相结合,解决设计者手动转换非直接关联的偏好到求解器设置的问题,并在两个长周期设计任务中验证了其有效性。

Comments Accepted for publication in the Proceedings of the ASME 2026 International Design Engineering Technical Conferences (IDETC2026)

详情
AI中文摘要

拓扑优化可以生成高效的结构,但设计者往往必须手动将定性意图,如期望的视觉风格、产品体验或可制造性转换为与这些偏好不直接相关的求解器设置。我们提出了TO-Agents,一种多智能体AI框架,将自然语言设计意图与迭代拓扑优化连接起来。该框架将人类提供的问题描述转换为经过验证的求解器输入,运行拓扑优化求解器,渲染结果的3D拓扑,并使用多视角视觉-语言推理与独立的评判智能体来批评每个结果并修改求解器参数。我们在两个长周期设计任务上评估了该框架:悬臂梁基准测试和手机支架产品设计。在两个任务中,设计者指定了受自然树形态启发的分层分支结构的美学偏好,系统在十个独立重复中进行了四次修订循环。TO-Agents在每个案例研究中至少在60%的试验中生成了符合偏好的设计,对应于没有视觉或历史反馈的简化流水线的6倍以上的成功试验。评判评分和人类评估显示,该流水线能够识别有效的参数杠杆,从差的修订中恢复,并扩展设计探索。一个制造智能体进一步对排名最高的设计进行后处理,以实现增材制造,使设计能够从意图到原型。我们还识别了失败模式,包括过度优化、选择性记忆、工具位置错误和参数推理错误。这些结果表明,智能体拓扑优化可以将设计者从低层次参数调整转向高层次的形式和功能指定,同时强调了可靠自主工程设计所需的保障措施。

英文摘要

Topology optimization can generate efficient structures, but designers often must manually translate qualitative intent, such as desired visual style, product experience, or manufacturability into solver settings that are not directly tied to those preferences. We present TO-Agents, a multi-agent AI framework that connects natural-language design intent with iterative topology optimization. The framework converts a human-provided problem description into validated solver inputs, runs a topology optimization solver, renders the resulting 3D topology, and uses multi-view vision-language reasoning with an independent judge agent to critique each result and revise solver parameters. We evaluate the framework on two long-horizon design tasks: a cantilever beam benchmark and a phone-stand product design. In both tasks, the designer specifies an aesthetic preference for hierarchically branched structures inspired by natural tree morphologies, and the system performs four revision cycles across ten independent replicates. TO-Agents produces at least one preference-aligned design in 60% of trials for each case study, corresponding to up to 6x more successful trials than an ablated pipeline without visual or historical feedback. Judge scores and human evaluations show that the pipeline can identify effective parameter levers, recover from poor revisions, and expand design exploration. A manufacturing agent further post-processes top-ranked designs for additive manufacturing, enabling end-to-end intent-to-prototype design. We also identify failure modes, including overshooting, selective memory, misplaced tools, and incorrect parameter reasoning. These results suggest that agentic topology optimization can shift designers from low-level parameter tuning toward higher-level specification of form and function, while highlighting safeguards needed for reliable autonomous engineering design.

2605.21609 2026-05-22 cs.CL cs.AI cs.CY 版本更新

CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety

CR4T:基于重写的青少年LLM安全机制

Heajun An, Qi Zhang, Vedanth Achanta, Jin-Hee Cho

发表机构 * Virginia Tech(弗吉尼亚理工大学)

AI总结 本文提出CR4T框架,通过选择性响应重构替代拒绝导向的安全机制,以更符合青少年发展需求的方式提升LLM的安全性。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地嵌入青少年的数字环境,介导信息搜索、建议和情感敏感的互动。然而,现有安全机制仍主要基于成人中心的规范,并通过拒绝导向的压制来实现安全。尽管这些方法可能减少即时的政策违规,但它们也可能导致对话死胡同、限制建设性指导,并未能解决青少年与AI互动中固有的发展脆弱性。我们主张,青少年LLM安全不应仅被视为过滤问题,而应被视为一种社会技术、发展一致的转变问题。为实现这一视角,我们提出了Critique-and-Revise-for-Teenagers(CR4T),一种模型无关的安全保障框架,该框架可选择性地将不安全或拒绝式输出重构为适合年龄的指导性响应,同时保持善意意图。CR4T结合轻量级风险检测与领域条件重写,以去除风险放大内容,减少不必要的对话关闭,并引入适合发展的指导。实验结果表明,针对重写显著减少了不安全和拒绝导向的结果,同时避免了对可接受互动的不必要的干预。这些发现表明,选择性响应重构为青少年面向的LLM系统提供了一种更以人为本的替代方案,以替代以拒绝为中心的安全机制。

英文摘要

Large language models (LLMs) are increasingly embedded in adolescent digital environments, mediating information seeking, advice, and emotionally sensitive interactions. Yet existing safety mechanisms remain largely grounded in adult-centric norms and operationalize safety through refusal-oriented suppression. While such approaches may reduce immediate policy violations, they can also create conversational dead-ends, limit constructive guidance, and fail to address the developmental vulnerabilities inherent in adolescent-AI interactions. We argue that adolescent LLM safety should be framed not solely as a filtering problem, but as a socio-technical, developmentally aligned transformation problem. To operationalize this perspective, we propose Critique-and-Revise-for-Teenagers (CR4T), a model-agnostic safeguarding framework that selectively reconstructs unsafe or refusal-style outputs into ageappropriate, guidance-oriented responses while preserving benign intent. CR4T combines lightweight risk detection with domain-conditioned rewriting to remove risk-amplifying content, reduce unnecessary conversational shutdown, and introduce developmentally appropriate guidance. Experimental results show that targeted rewriting substantially reduces unsafe and refusal-oriented outcomes while avoiding unnecessary intervention on acceptable interactions. These findings suggest that selective response reconstruction offers a more human-centered alternative to refusal-centric guardrails for adolescent-facing LLM systems.

2605.21606 2026-05-22 cs.LG cs.AI 版本更新

When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

何时教师标记可靠?用于推理的基于位置加权的在线自我蒸馏

Xiaogeng Liu, Xinyan Wang, Yingzi Ma, Yechao Zhang, Chaowei Xiao

发表机构 * Johns Hopkins University(约翰霍普金斯大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Nanyang Technological University(南洋理工大学)

AI总结 本文提出了一种基于位置加权的在线自我蒸馏方法,用于改进推理任务中教师标记的可靠性,通过引入分支可行性诊断来识别教师标记的可靠性,并在不同模型上验证了其有效性。

Comments Pre-print. Code is available at https://github.com/SaFo-Lab/PW-OPSD

详情
AI中文摘要

在线自我蒸馏(OPSD)通过一个特权教师训练学生,但其标准目标对所有生成的标记同等重视,隐含地将特权教师目标视为在每个学生访问的前缀中同样可靠。现有的基于熵的OPD方法通过调节令牌级监督来放松这种均匀性,但推理中高教师熵的可靠性含义具有歧义:它可以反映非可行的不确定性或良性的解决方案多样性。为识别这一现象,我们引入了分支可行性诊断。具体来说,我们记录特权答案教师提示中的下一个标记替代方案,强制每个替代方案在学生提示及其在线脊柱前缀之后,并测试由此产生的学生模板延续是否能恢复正确答案。在Qwen3-4B上,我们发现一个导向的序列内位置分数是测试中最强的教师标记可靠性预测因子,达到曲线下面积(AUROC)为0.83;局部不确定性分数最多为0.57。受此轨迹结构的启发,我们提出了基于位置加权的在线自我蒸馏(PW-OPSD),其在保持相同的学生滚动生成、特权教师传递和截断的前向KL目标的同时,应用递增的位置权重。在不同随机种子的全面评估中,诊断衍生的PW-OPSD在AIME 2024和AIME 2025 Avg@12上分别提高了+1.0和+1.1分,并在两个更大规模的模型上也展示了一致的Avg@12改进。这些结果表明,推理蒸馏中的教师标记可靠性具有轨迹结构,并且可以在不增加教师计算的情况下利用。

英文摘要

On-policy self-distillation (OPSD) trains a student on its own rollouts using a privileged teacher, but its standard objective weights all generated tokens equally, implicitly treating the privileged teacher target as equally reliable at every student-visited prefix. Existing entropy-based OPD methods relax this uniformity by modulating token-level supervision with teacher entropy, but high teacher entropy in reasoning has an ambiguous reliability meaning: it can reflect either non-viable uncertainty or benign solution diversity. To identify this phenomenon, we introduce a branch-viability diagnostic. Specifically, we record next-token alternatives from the privileged-answer teacher prompt, force each alternative after the student prompt plus its on-policy spine prefix, and test whether the resulting student-template continuation recovers the correct answer. On Qwen3-4B, we find that an oriented within-sequence position score is the strongest tested predictor of teacher-token reliability, reaching an area-under-ROC-curve (AUROC) of 0.83; local uncertainty scores are at most 0.57. Motivated by this trajectory-level structure, we propose Position-Weighted On-Policy Self-Distillation (PW-OPSD), which applies an increasing position weight while keeping the same student rollout, privileged teacher pass, and clipped forward-KL target as OPSD. In our comprehensive evaluations with different random seeds, the diagnostic-derived PW-OPSD improves AIME 2024 and AIME 2025 Avg@12 by +1.0 and +1.1 points, and a generalization evaluation on two larger-scale models from different families, DeepSeek-R1-Distill-Llama-8B and Olmo-3-7B-Think, also demonstrates consistent aggregate Avg@12 improvements. These results show that teacher-token reliability in reasoning distillation is trajectory-structured and can be utilized without additional teacher computation.

2605.21548 2026-05-22 stat.ML cs.AI cs.LG 版本更新

Local Covariate Selection for Average Causal Effect Estimation without Pretreatment and Causal Sufficiency Assumptions

局部协变量选择用于无预处理和因果充分性假设下的平均因果效应估计

Zeyu Liu, Zheng Li, Feng Xie, Yan Zeng, Hao Zhang, Kun Zhang

发表机构 * Department of Applied Statistics, Beijing Technology and Business University(北京技术与商业大学应用统计系) Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院) College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院) Department of Machine Learning, Mohamed bin Zayed University of Artificial Intelligence(Mohamed bin Zayed人工智能大学机器学习系) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出了一种局部学习方法,用于非参数因果效应估计中的协变量选择,避免了预处理和因果充分性假设,提高了计算效率和估计准确性。

详情
AI中文摘要

我们研究了选择协变量以无偏估计总因果效应的问题。现有方法通常依赖于对所有变量的全局因果结构学习,或依赖于强假设,如因果充分性假设——观测变量不共享潜在混杂因素,或预处理假设,限制协变量只能是不受处理或结果影响的变量。这些要求在实践中往往不现实,且在高维设置中全局学习变得计算上不可行。为了解决这些挑战,我们提出了一种新颖的局部学习方法,用于非参数因果效应估计中的协变量选择,避免了预处理和因果充分性假设。我们首先刻画了一个局部边界,该边界包含至少一个有效的调整集,当且仅当存在调整集来识别因果效应时。然后我们开发了局部识别程序,以在该边界内高效地搜索。我们证明了所提出的方法是正确且完整的。在多个合成数据集和两个真实世界数据集上的实验表明,我们的方法在准确估计因果效应的同时,显著提高了计算效率。

英文摘要

We study the problem of selecting covariates for unbiased estimation of the total causal effect.Existing approaches typically rely on global causal structure learning over all variables, or on strong assumptions such as causal sufficiency - where observed variables share no latent confounders - or the pretreatment assumption, which limits covariates to those unaffected by the treatment or outcome. These requirements are often unrealistic in practice, and global learning becomes computationally prohibitive in high-dimensional settings.To address these challenges, we propose a novel local learning method for covariate selection in nonparametric causal effect estimation that avoids both the pretreatment and causal sufficiency assumptions. We first characterize a local boundary that contains at least one valid adjustment set whenever one exists for identifying the causal effect, and then develop local identification procedures to efficiently search within this boundary.We prove that the proposed method is sound and complete. Experiments on multiple synthetic datasets and two real-world datasets show that our approach achieves accurate causal effect estimation while substantially improving computational efficiency.

2605.21545 2026-05-22 cs.SE cs.AI 版本更新

RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts

RefusalBench: 为什么拒绝率错误排名前沿大语言模型在生物研究提示中的表现

Lukas Weidener, Marko Brkić, Mihailo Jovanović, Emre Ulgac, Aakaash Meduri

发表机构 * Applied Scientific Intelligence, Inc.(应用科学智能公司)

AI总结 本文提出RefusalBench,通过141个提示的47组匹配三元组,评估前沿大语言模型在生物研究提示中的拒绝行为,发现拒绝率错误排名安全校准,揭示了模型在不同风险层级下的表现差异。

Comments 34 pages, 4 figures, 12 tables (10 in main text, 2 in supplementary). Code and data: https://github.com/AppliedScientific/refusalbench

详情
AI中文摘要

前沿大语言模型越来越多地被用作生物研究工作流的编排骨干,但尚无共同的证据基础来比较它们在合法研究提示上的拒绝行为。本文引入了RefusalBench,这是一个包含141个提示的47组匹配三元组基准,保持任务框架不变,仅改变生物风险层级(无害、临界、双用途),从而实现层级条件下的稳健比较,以避免子领域混淆。一个15个提示的应拒绝正控模块为每个模型建立了校准地板;三个模型未能拒绝这些提示。在2026年5月快照中的19个前沿模型中,严格拒绝率在相同提示上从0.1%到94.6%不等。在此次快照中,管辖权并不能预测拒绝(Mann-Whitney U,p = 0.393;欧盟n = 1,美国双模态);提供者身份可以,Anthropic的API堆栈预测拒绝的OR为21.03(95% CI:14.58-30.34提示聚类;5.70-77.55在模型聚类GEE下)。这种效应最好解读为访问路径级别而不是模型权重级别:Anthropic的99.8%严格拒绝携带相同的安全政策裁决原因代码,与一小组标准拒绝模板一致,而不是个别模型推理。严格拒绝率错误排名安全校准:Grok 4.20实现了最高的层级区分(Youden's J = 0.787)尽管在总体拒绝率中仅排名第七,而Claude Opus 4.7的J值从先前版本下降了65%,尽管双用途检测没有改进。18个前沿模型中有9个在双用途层级上表现出一种“谨慎但帮助”的部分合规模式,二元拒绝指标无法检测到这种模式。

英文摘要

Frontier large language models are increasingly deployed as orchestration backbones for biological research workflows, yet no shared evidence base exists for comparing their refusal behaviour on legitimate research prompts. RefusalBench, introduced here, is a matched-triple benchmark of 141 prompts in 47 bundles that holds task framing constant while varying only biological risk tier (benign, borderline, dual-use), enabling tier-conditioned comparisons robust to subdomain confounding. A 15-prompt should-refuse positive-control module establishes per-model calibration floors; three models fail to refuse even these prompts. Across 19 frontier models in the May 2026 snapshot, strict refusal rates span 0.1% to 94.6% on identical prompts. Jurisdiction does not predict refusal in this snapshot (Mann-Whitney U, p = 0.393; EU n = 1, US bimodal); provider identity does, with Anthropic's API stack predicting refusal at OR = 21.03 (95% CI: 14.58-30.34 prompt-clustered; 5.70-77.55 under model-clustered GEE). This effect is best read as access-path-level rather than model-weight-level: 99.8% of Anthropic's strict refusals carry the same safety_policy adjudicated reason code, consistent with a small set of canonical refusal templates rather than case-by-case model reasoning. Strict refusal rate misranks safety calibration: Grok 4.20 achieves the highest tier discrimination (Youden's J = 0.787) while ranking only seventh by overall refusal rate, and Claude Opus 4.7's J dropped 65% from prior versions with no improvement in dual-use detection. Nine of 18 frontier models exhibit a hedge-but-help partial-compliance pattern at dual-use tier that binary refusal metrics cannot detect.

2605.21541 2026-05-22 cs.CR cs.AI cs.LG stat.ML 版本更新

Frequency-Domain Regularized Adversarial Alignment for Transferable Attacks against Closed-Source MLLMs

频域正则化对抗对齐用于针对闭源大语言模型的可转移攻击

Leitao Yuan, Qinghua Mao, Daizong Liu, Kun Wang, Wenjie Wang, Yan Teng, Jing Shao, Dongrui Liu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Zhejiang University(浙江大学) Shanghai Jiao Tong University(上海交通大学) Wuhan University(武汉大学) Nanyang Technological University(南洋理工大学) University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出FRA-Attack,通过频域正则化方法解决对抗转移性问题,通过高通DCT目标和频率域梯度正则化提升跨模型的对抗转移能力。

详情
AI中文摘要

多模态大语言模型(MLLMs)仍易受基于转移的针对性攻击影响,其中在开源代理编码器上优化的扰动可以泛化到闭源MLLMs。提高对抗转移性的一个关键挑战是有效捕捉不同模型间共享的内在视觉聚焦特性,使得扰动与可转移的语义线索对齐,而非代理特定行为。然而,现有方法受到空间域特征冗余和代理特定梯度信号的阻碍,影响跨模型转移性。在本文中,我们提出FRA-Attack,从统一的频域正则化视角解决这两个挑战。在特征对齐方面,对patch特征的高通DCT目标抑制冗余的全局结构,并将损失集中在承载MLLMs内在视觉聚焦的高频带。在梯度优化方面,我们引入频率域梯度正则化(FGR),一种无模型依赖的低通正则化器,仅使用几何频率坐标调节代理梯度,即不涉及代理衍生的统计量,因此FGR通过构造无模型依赖性,消除代理特定的高频伪影,同时保留可转移的低频方向。两者共同形成统一的频域转移性处理。在15个旗舰MLLMs上进行的广泛实验显示,FRA-Attack在跨模型转移性方面表现优异,特别是在GPT-5.4、Claude-Opus-4.6和Gemini-3-flash等最先进的模型上实现了最先进的性能。

英文摘要

Multimodal large language models (MLLMs) remain vulnerable to transfer-based targeted attacks, where perturbations optimized on open-source surrogate encoders can generalize to closed-source MLLMs. A key challenge for improving adversarial transferability is to effectively capture the intrinsic visual focus shared across different models, such that perturbations align with transferable semantic cues rather than surrogate-specific behaviors. However, existing methods suffer from spatial-domain feature redundancy and surrogate-specific gradient signals, thereby hindering cross-model transferability. In this paper, we propose FRA-Attack, which addresses both challenges from a unified frequency-domain regularization perspective. For feature alignment, a high-pass DCT objective on patch features suppresses redundant global structures and concentrates the loss on the high-frequency band that carries the MLLMs' intrinsic visual focus. For gradient optimization, we introduce Frequency-domain Gradient Regularization (FGR), a \textit{model-agnostic} low-pass regularizer that modulates the surrogate gradient using only the geometric frequency coordinate, \textit{i.e.}, no surrogate-derived statistic is involved, so that FGR is model-agnostic by construction, removing surrogate-specific high-frequency artifacts while preserving transferable low-frequency directions. Together, the two components form a unified frequency-domain treatment of transferability. Extensive experiments on $15$ flagship MLLMs across $7$ vendors show that FRA-Attack achieves superior cross-model transferability, particularly with state-of-the-art performance on GPT-5.4, Claude-Opus-4.6 and Gemini-3-flash.

2605.21540 2026-05-22 cs.SI cs.AI cs.CL cs.CY 版本更新

Detecting Synthetic Political Narratives in Cross-Platform Social Media Discourse

在跨平台社交媒体讨论中检测合成政治叙述

Despoina Antonakaki, Sotiris Ioannidis

发表机构 * Institute of Computer Science, Foundation for Research and Technology(计算机科学研究所,希腊研究与技术基金会) Technical University of Crete(希腊克里特技术大学)

AI总结 本文提出了一种跨平台框架,通过四个协调信号(词汇多样性、时间爆发性、修辞重复和语义同质性)组合成合成叙述协调评分SNC(C),以检测合成政治叙述,研究发现IntelSlava在四个事件窗口中排名第一,而Rybar尽管语义同质性高但因语言差异导致表现不佳。

详情
AI中文摘要

大规模语言模型的普及引入了新的合成政治沟通范式,其中叙述可能被生成、语义协调并战略性地在多个平台大规模传播。我们提出了一种跨平台框架,利用四个协调信号——词汇多样性D(C)、时间爆发性B(C)、修辞重复R(C)和语义同质性H(C)——组合成合成叙述协调评分SNC(C)以检测合成政治叙述。我们对包含6个地缘政治事件窗口的353,223条记录进行了分析,数据来自六个Telegram频道和九个Reddit社区(2023-2026)。结果表明,IntelSlava表现出最低的词汇多样性(MATTR 0.52-0.54)、最高的爆发性(B=+0.48至+0.73)和最高的与同僚频道的修辞重叠(Jaccard 0.12),在六个事件窗口中的四个中排名第一(SNC 0.45-0.60)。Rybar在所有窗口中排名最后,尽管其俄语输出导致词汇多样性高且与英语频道的修辞Jaccard接近零,这表明单一指标不足以检测协调性。多维SNC(C)评分提供了比任何单一指标更稳健和可解释的信号。

英文摘要

The proliferation of large language models has introduced a new paradigm of synthetic political communication in which narratives may be generated, semantically coordinated, and strategically disseminated across platforms at scale. We present a cross-platform framework for detecting synthetic political narratives using four coordination signals -- lexical diversity D(C), temporal burstiness B(C), rhetorical repetition R(C), and semantic homogenization H(C) -- combined into a Synthetic Narrative Coordination Score SNC(C). We apply the framework to a corpus of 353,223 records spanning six geopolitical event windows collected from six Telegram channels and nine Reddit communities (2023--2026). Results show that IntelSlava exhibits the lowest lexical diversity (MATTR 0.52--0.54), the highest burstiness (B=+0.48 to +0.73), and the highest rhetorical overlap with peer channels (Jaccard 0.12), ranking first in the composite SNC(C) on four of six event windows (SNC 0.45--0.60). Rybar ranks last on all windows despite its high semantic homogenization, because its Russian-language output yields high lexical diversity and near-zero rhetorical Jaccard with English-language channels -- demonstrating that no single indicator is sufficient for coordination detection. Multi-dimensional SNC(C) scoring provides a more robust and interpretable signal than any individual metric.

2605.21523 2026-05-22 eess.IV cs.AI cs.CV cs.MM eess.SP 版本更新

Tackle CSM in JPEG Steganalysis with Data Adaptation

用数据适应法对抗JPEG隐写分析中的CSM

Rony Abecidan, Vincent Itier, Jérémie Boulanger, Patrick Bas, Tomáš Pevný

发表机构 * LABEL4.AI Univ. Lille(里尔大学) CNRS(国家科学研究中心) Centrale Lille(里尔中央理工大学) UMR 9189 CRIStAL(里尔大学UMR 9189 CRIStAL) IMT Nord Europe(北欧IMT) Centre for Digital System(数字系统中心) Department of Computers(计算机系)

AI总结 本文提出TADA框架,通过数据适应方法学习未知的处理流程,以提高在真实场景中对抗CSM问题的鲁棒性,并改进实际应用中的泛化能力。

Comments ACM Workshop on Information Hiding and Multimedia Security, (IH&MMSec '26), Jun 2026, Florence, Italy

详情
AI中文摘要

隐写分析模型在基准数据集上表现优异,但在实际应用中遇到由训练时未见过的处理流程生成的图像时会遇到困难。这种被称为覆盖源不匹配(CSM)的问题在现实场景中尤为棘手,因为实践者只能访问少量未标记的数据集,不确定这些图像所应用的处理技术,且缺乏关于该数据集中覆盖和隐写图像比例的信息。为解决这一挑战,我们引入了TADA(通过数据适应的目标对齐)框架,该框架学习从少量未标记的目标数据集中模拟未知的处理流程。该架构通过结合残差协方差对齐、残差分布匹配和一个ℓ²损失约束模拟器生成逼真图像。在玩具和实际目标上,TADA在对抗CSM的鲁棒性和实际应用泛化能力方面相比强大的整体和原子基线有显著提升。附加资源可在本链接中获得:https://github.com/RonyAbecidan/TADA

英文摘要

Steganalysis models excel on benchmark datasets but struggle in the wild when analyzed images are produced by a processing pipeline unseen during training. This problem known as Cover Source Mismatch (CSM) is particularly hard in realistic settings where practitioners (1) have access to only a small, unlabeled dataset, (2) are unsure of the processing techniques applied to these images, and (3) lack information on the proportion of covers and stegos in that set. To answer this challenge, we introduce TADA (Target Alignment through Data Adaptation), a framework learning to emulate the unknown processing pipeline from a small unlabeled target set. This architecture is trained with a loss combining residual covariance alignment, residual distribution matching, and a $\ell^2$ loss constraining the emulator to produce realistic images. Across toy and operational targets, TADA yields substantial gains in robustness to CSM and improves operational generalization compared to strong holistic and atomistic baselines. Additional resources are available at this link: https://github.com/RonyAbecidan/TADA

2605.21522 2026-05-22 q-bio.QM cs.AI cs.CE cs.LG stat.ML 版本更新

Protein Thoughts: Interpretable Reasoning with Tree of Thoughts and Embedding-Space Flow Matching for Protein-Protein Interaction Discovery

蛋白质思想:基于树 of 思维和嵌入空间流匹配的可解释推理用于蛋白质-蛋白质相互作用发现

Kingsley Yeon, Xuefeng Liu, Promit Ghosal

发表机构 * Department of Statistics and CCAM University of Chicago(统计学系和CCAM大学芝加哥分校) School of Medicine Stanford University(医学学院斯坦福大学) Department of Statistics University of Chicago(统计学系芝加哥大学)

AI总结 本文提出了一种可解释的蛋白质-蛋白质相互作用发现框架,通过显式推理将PPI发现转化为可解释的搜索问题,利用嵌入空间流匹配和树 of 思维搜索方法提升预测精度和可解释性。

详情
AI中文摘要

蛋白质-蛋白质相互作用(PPIs)调控几乎所有细胞过程,但计算方法通常产生排名预测而缺乏机理解释。这限制了其应用,因为生物学家无法判断预测是否反映真实的生化见解或偶然相关性。我们提出了Protein Thoughts框架,将PPI发现重新表述为可解释的搜索问题。该系统将结合证据分解为四个生物意义的信号:序列相似性反映进化关系,结构互补性捕捉几何契合,界面平衡,以及化学兼容性编码残基级相互作用。而不是将这些信号合并为一个模糊的分数,我们通过透明的价值函数保留每个信号的贡献,从而实现排序和审计。为了高效地导航大规模候选空间,我们引入了假设引导的熵正则化树 of 思维搜索。微调的语言模型从嵌入衍生的特征生成搜索指令,将候选者分类为高优先级、探索性或可跳过。这些指令条件化一个玻尔兹曼策略,平衡利用与熵驱动的探索,同时假设意识修剪防止提前放弃有前途的候选者。对于表现出评分分歧的候选者,假设条件的嵌入空间流匹配将蛋白质嵌入推向结合者流形。在SHS148k基准测试中,Protein Thoughts实现了平均最佳结合体排名为11.2,比熵树搜索基线的47.7提高了76%,在结合预测中,训练的价值函数实现了91.08±0.19 Micro-F1,优于现有PPI方法在同一数据集上的表现。

英文摘要

Protein-protein interactions (PPIs) govern nearly all cellular processes, yet computational methods for identifying binding partners typically produce ranked predictions without mechanistic justification. This creates a fundamental barrier to adoption because biologists cannot assess whether predictions reflect genuine biochemical insight or spurious correlations. We present \textbf{Protein Thoughts}, a framework that reformulates PPI discovery as an interpretable search problem with explicit reasoning. The system decomposes binding evidence into four biologically meaningful signals: sequence similarity reflecting evolutionary relationships, structural complementarity capturing geometric fit, interface balance, and chemical compatibility encoding residue-level interactions. Rather than collapsing these signals into an opaque score, we preserve their individual contributions through a transparent value function that enables both ranking and auditing. To navigate large candidate spaces efficiently, we introduce hypothesis-guided entropy-regularized Tree-of-Thoughts search. A fine-tuned language model generates search directives from embedding-derived features, classifying candidates as high-priority, exploratory, or skippable. These directives condition a Boltzmann policy that balances exploitation with entropy-driven exploration, while hypothesis-aware pruning prevents premature abandonment of promising candidates. For candidates exhibiting score disagreement, hypothesis-conditioned embedding-space flow matching transports protein embeddings toward the binder manifold. On the SHS148k benchmark, Protein Thoughts achieves mean best-binder rank of 11.2 versus 47.7 for an entropic tree search baseline, a 76% improvement, and for binding prediction the trained value function achieves $91.08 \pm 0.19$ Micro-F1, outperforming existing PPI methods on the same dataset.

2605.21516 2026-05-22 cs.LG cs.AI 版本更新

Harnesses for Inference-Time Alignment over Execution Trajectories

在执行轨迹上进行推理时间对齐的工具

Boyuan Wang, Bochao Li, Minghan Wang, Yuxin Tao, Fang Kong

发表机构 * GitHub

AI总结 本文研究了在执行轨迹上进行推理时间对齐的工具设计,通过任务分解和引导执行机制来提高长期性能,发现工具设计中分解和引导的复杂性并不总是带来更好的结果,提出了任务分解和引导执行的两种机制,并通过合成实验和实际终端代理基准验证了这些发现。

详情
AI中文摘要

Harness engineering has emerged as an important inference-time technique for large language model (LLM) agents, aiming to improve long-term performance through task decomposition and guided execution. However, more elaborate harnesses are not uniformly better: increasing decomposition or guidance can sometimes improve execution, but can also reduce final task success. We study harness design through the lens of inference-time trajectory alignment. This perspective separates harness into two mechanisms: task decomposition, which structures a task into sub-goals, and guided execution, which reshapes local action distributions during execution. This decomposition allows us to quantify how workflow granularity, retry budgets, and guidance-induced action reweighting shape the performance limits of harness design. It further reveals concrete failure modes, including over-decomposition, over-pruning, and hallucinated execution. We validate these predictions through controlled synthetic experiments and real terminal agent benchmarks. Inspired by the theory, we further show that effective harnesses can be partial: specifying only the initial steps and leaving the remaining execution to agent can achieve higher pass rate than fully structured workflows.

英文摘要

Harness engineering has emerged as an important inference-time technique for large language model (LLM) agents, aiming to improve long-term performance through task decomposition and guided execution. However, more elaborate harnesses are not uniformly better: increasing decomposition or guidance can sometimes improve execution, but can also reduce final task success. We study harness design through the lens of inference-time trajectory alignment. This perspective separates harness into two mechanisms: task decomposition, which structures a task into sub-goals, and guided execution, which reshapes local action distributions during execution. This decomposition allows us to quantify how workflow granularity, retry budgets, and guidance-induced action reweighting shape the performance limits of harness design. It further reveals concrete failure modes, including over-decomposition, over-pruning, and hallucinated execution. We validate these predictions through controlled synthetic experiments and real terminal agent benchmarks. Inspired by the theory, we further show that effective harnesses can be partial: specifying only the initial steps and leaving the remaining execution to agent can achieve higher pass rate than fully structured workflows.

2605.21515 2026-05-22 cs.LG cs.AI 版本更新

Predicting Performance of Symbolic and Prompt Programs with Examples

通过示例预测符号程序和提示程序的性能

Chengqi Zheng, Keya Hu, Shuzhi Liu, Tao Wu, Kevin Ellis, Yewen Pu

发表机构 * Nanayang Technological University, Singapore(南洋理工大学,新加坡) Massachusetts Institute of Technology, USA(麻省理工学院,美国) Cornell University, USA(康奈尔大学,美国)

AI总结 本文研究了通过示例预测程序性能的问题,提出了一种基于简单硬币翻转模型的方法,利用观察到的执行结果和性能先验知识来预测程序性能,并开发了RAP方法来构建代理先验以提高预测效果。

详情
AI中文摘要

LLM提示广泛用于自然陈述的任务,但其不可靠,可能在少数测试用例上成功但在部署时失败。我们研究了性能预测:给定一个程序(例如符号程序或在LLM上执行的提示程序)和少量领域内示例,预测其在未见任务上的性能。我们使用一个简单的硬币翻转模型,将每次通过/失败的程序执行视为伯努利随机变量,其成功概率是程序未知的性能。在该模型中,性能完全取决于:1)在测试用例上观察到的执行结果,以及2)性能的先验分布。我们从多样化的程序和任务语料库中编译了经验性能先验,并发现符号程序(例如Python)都是全或无的,而提示程序具有弥漫的先验,有许多几乎正确的程序。这种差异解释了为什么少数通过测试可以认证符号程序但不能认证提示程序。基于这一见解,我们开发了RAP(检索近似先验),通过从现有语料库中检索相似任务和提示程序来构建代理先验,然后用于预测性能。我们展示了RAP实现了稳健的性能。

英文摘要

LLM prompting is widely used for naturally stated tasks, yet it is unreliable it may succeed on a few test cases but fail at deployment time. We study performance prediction: given a program, either symbolic (e.g. Python) or a prompt executed on an LLM, and a few in-domain examples, predict its performance on unseen tasks from the same domain. We use a simple coin-flip model, treating each pass/fail program execution as a Bernoulli random variable, whose success probability is the programs unknown performance. In this model, performance depends entirely on: 1) the observed execution outcomes on test cases, and 2) a prior over performances. We compile empirical performance priors from a corpus of diverse programs and tasks, and find that performance for symbolic programs (e.g., Python) are all or nothing, while prompt programs have a diffuse prior with many nearly-correct programs. This difference explains why a few passing tests can certify symbolic programs but not prompt programs. Building on this insight, we develop RAP (Retrieved Approximate Prior), which retrieves similar tasks and prompt programs from an existing corpus to construct a proxy prior, which is then used to predict performance. We show RAP achieves solid performances.

2605.21507 2026-05-22 physics.ao-ph cs.AI cs.CE cs.LG 版本更新

Visibility nowcasting in South Korea: a machine learning approach to class imbalance and distribution shift

韩国可见度现在预测:一种处理数据不平衡和分布偏移的机器学习方法

Bong Gyun Shin, Chan Sik Lee, Hyesun Suh

发表机构 * Department of AI Big Data(人工智能大数据系) Daejin University(大 Jain 大学) Department of Statistics and Actuarial Science(统计与精算科学系) Soongsil University(顺斯大学) College of Artificial Intelligence Convergence(人工智能融合学院)

AI总结 本文提出了一种机器学习方法,用于预测韩国六个主要城市的大气可见度,通过SMOTENC和CTGAN处理数据不平衡,并结合机器学习和深度学习模型进行评估,发现训练与测试期间的分布偏移导致预测性能下降,强调了在时间序列数据上实施现在预测模型时考虑外部环境因素的重要性。

Comments Published in Theoretical and Applied Climatology

Journal ref Theoretical and Applied Climatology, vol. 157, art. no. 283, 2026

详情
AI中文摘要

大气可见度是交通安全和空气质量管理的关键变量,然而,由于气象条件和空气污染物之间的复杂相互作用以及低可见度事件的稀有性,准确预测仍然具有挑战性。本研究引入了一种机器学习框架,用于预测韩国六个主要城市的可见度。为了处理2018-2020年训练数据中的不平衡问题,我们应用了合成少数类过采样技术(SMOTENC)和条件表格生成对抗网络(CTGAN)。然后,使用结合机器学习和深度学习模型的集成方法,并在2021年测试数据集上进行评估。结果表明,测试集的预测性能相比交叉验证阶段明显下降。这种退化归因于训练和测试期间的分布偏移,通过测量SHAP分析确定的最显著特征的Wasserstein距离得到了定量确认。总体而言,本研究提出了一种旨在同时解决数据不平衡和时间分布偏移双重挑战的方法,并强调在时间序列数据上实施现在预测模型时考虑不断变化的外部环境因素的必要性。

英文摘要

Atmospheric visibility is a critical variable for transportation safety and air quality management, however, accurate prediction remains challenging due to the complex interactions between meteorological conditions and air pollutants, as well as the rarity of low-visibility events. This study introduces a machine learning framework to nowcast visibility in six major South Korean cities. To handle the imbalance in the 2018-2020 training data, we applied the Synthetic Minority Over-sampling Technique with Nominal and Continuous (SMOTENC) and Conditional Tabular Generative Adversarial Network (CTGAN). An ensemble approach combining machine learning and deep learning models was then used and evaluated on a 2021 test dataset. The results revealed a marked decline in predictive performance in the test set compared to the cross-validation phase. This degradation was attributed to a distributional shift between training and testing periods, which was quantitatively confirmed by measuring the Wasserstein distance of the most influential feature identified by SHAP analysis. In general, this study presents a methodology that aims to simultaneously address the dual challenges of data imbalance and temporal distributional shifts, and emphasizes the necessity of accounting for evolving external environmental factors when implementing nowcasting models on time-series data.

2605.21502 2026-05-22 q-bio.MN cs.AI cs.LG 版本更新

Graph neural network explanations reveal a topological signature of disease-associated hubs in biological networks

图神经网络解释揭示了生物网络中与疾病相关的枢纽的拓扑特征

Kyle Higgins, Ivan Laponogov, Dennis Veselkov, Kirill Veselkov

发表机构 * Division of Cancer, Department of Surgery and Cancer, Faculty of Medicine, Imperial College London(癌症部、外科与癌症部门、医学学院、伦敦帝国学院) Department of Computing, Imperial College London(计算部门、伦敦帝国学院) Department of Environmental Health Sciences, Yale University(环境健康科学部门、耶鲁大学)

AI总结 本文研究了图神经网络在生物网络中识别疾病相关结构的方法,发现不同解释方法在稀疏单节点驱动和分布式路径信号中有不同的表现,并提出了一种结合壳层枢纽评分和解释器共识排名的框架,提升了对癌症基因的优先级排序和生物学相关分子的恢复能力。

Comments 25 pages (excluding supplement), 7 figures, 7 supplementary tables

详情
AI中文摘要

图神经网络(GNNs)越来越多地用于建模生物系统,但后验解释方法恢复有意义的分子机制的可靠性仍不清楚。本文系统评估了四种广泛使用的解释方法:显著性归因(SA)、集成梯度(IG)、GNNExplainer 和层间相关传播(LRP),以识别乳腺癌RNA-seq数据在蛋白质-蛋白质相互作用网络上的疾病相关结构。通过合成基准测试,我们发现解释方法恢复了不同的信号组织:SA在稀疏单节点驱动方面表现最佳,而IG和LRP更倾向于恢复分布式的路径样和级联样信号。在TCGA BRCA数据中,我们识别出一种一致的拓扑特征,即疾病相关枢纽的归因在最近的1跳邻居中达到峰值,并在后续网络壳层中衰减,这种模式在IG和LRP中最为显著,并与已知癌症枢纽的强富集相关。我们进一步观察到局部枢纽富集与全局基因排名性能之间的权衡,IG优化局部富集,而SA在全局区分方面表现更优。受这些互补行为的启发,我们提出了一种结合基于壳层的枢纽评分和解释器共识排名的框架。共识评分提高了对经典癌症基因(TP53、BRCA1、ESR1、MYC)的优先级排序,减少了对节点度数的依赖,并且在调优时优于单独的方法。通路富集进一步揭示了对生物上一致的癌症程序的改进恢复,包括ERBB2、RTK、MAPK、免疫和细胞因子信号。这些结果表明,拓扑感知的图解释整合可以提高生物可解释性和生物相关分子的恢复能力。

英文摘要

Graph neural networks (GNNs) are increasingly used to model biological systems, yet the reliability of post-hoc explanation methods for recovering meaningful molecular mechanisms remains unclear. Here, we systematically evaluate four widely used approaches: Saliency Attribution (SA), Integrated Gradients (IG), GNNExplainer, and Layer-wise Relevance Propagation (LRP) for identifying disease-relevant structure in breast cancer RNA-seq data projected onto a protein-protein interaction network. Using synthetic benchmarks with known ground-truth motifs, we show that explanation methods recover distinct signal organizations: SA performs best for sparse single-node drivers, whereas IG and LRP preferentially recover distributed pathway-like and cascade-like signals. In TCGA BRCA data, we identify a consistent topological signature of disease-associated hubs in which attribution peaks in the immediate 1-hop neighborhood and decays across successive network shells, a pattern most pronounced for IG and LRP and associated with strong enrichment of known cancer hubs. We further observe a trade-off between local hub enrichment and global gene ranking performance, with IG optimizing local enrichment and SA achieving superior global discrimination. Motivated by these complementary behaviors, we introduce a framework combining a shell-based hub score with consensus ranking across explainers. Consensus scores improve prioritization of canonical cancer genes (TP53, BRCA1, ESR1, MYC), reduce dependence on node degree, and, especially when tuned, outperform individual methods. Pathway enrichment further reveals improved recovery of biologically coherent cancer programs, including ERBB2, RTK, MAPK, immune, and cytokine signaling. Together, these results demonstrate that topology-aware integration of graph explanations can improve biological interpretability and biologically relevant molecular recovery.

2605.21497 2026-05-22 cs.CR cs.AI 版本更新

Autonomous LLM Agents & CTFs: A Second Look

自主大语言模型代理与CTF:再看一次

Youness Bouchari, Matteo Boffa, Marco Mellia, Idilio Drago, Thanh Minh Bui, Dario Rossi

发表机构 * Politecnico di Torino(托尔托纳理工大学) Università di Torino(托尔托纳大学) Huawei Technologies France(华为法国技术)

AI总结 本文重新审视了大语言模型代理在自动化进攻性安全任务中的表现,通过在30个基于网络的CTF挑战中测试不同架构的代理,发现通用代理在性能上与定制架构相当,并揭示了当前代理在某些类别中的持续障碍。

Comments Accepted at DeMeSSAI Workshop @ IEEE EuroS&P 2026

详情
AI中文摘要

大型语言模型(LLM)代理越来越多地被提出以自动化进攻性安全任务,最近的研究报告称在捕获-the-Flag(CTF)挑战中接近人类水平的成功率。我们在此重新审视这些结果,提供对这些声明的第二次审视。我们针对30个基于网络的CTF挑战(涵盖14种漏洞类别)设计了不同复杂度和模块化的代理架构。我们使用多种LLM主干来实例化这些代理,并将其与claude-code通用代理进行比较,该代理能够自动确定其内部架构。我们的评估得出三个主要发现。首先,claude-code在性能上与定制架构相当(19/30个任务解决),表明通用代理是进攻性安全任务的强大基线。其次,我们的架构和claude-code在相同的挑战类别中挣扎,揭示了持续存在的障碍,使当前代理仍低于人类水平的能力。第三,通过利用我们手动设计的架构,我们能够系统地衡量额外组件的影响,发现专门化角色的结构化协调优于单体设计,提高了运行一致性,并减少了执行成本。

英文摘要

Large Language Model (LLM) agents are increasingly proposed to automate offensive security tasks, with recent studies reporting near human-level success rates in Capture-the-Flag (CTF) challenges. We here revisit these results, providing a second look at these claims. We engineer different agent architectures of increasing complexity and modularity on 30 web-based CTFs challenges spanning 14 vulnerability classes. We instantiate these agents with multiple LLM backbones, and compare them with claude-code, a general-purpose agent that automatically determines its internal architecture. Our evaluation yields three main findings. First, claude-code achieves performance comparable to the engineered architectures (19/30 solved tasks), suggesting that general-purpose agents are strong baselines for offensive security tasks. Second, both our architectures and claude-code struggle in the same challenge categories, revealing persistent barriers that keep current agents below human-level capability. Third, by leveraging our manually designed architectures we can systematically measure the impact of additional components, finding that structured orchestration of specialized roles outperforms monolithic designs, improving run-to-run consistency, and reducing execution costs.

2605.21496 2026-05-22 cs.LG cs.AI cs.CL 版本更新

HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine

HealthCraft: 一种用于急救医学的强化学习安全环境

Brandon Dent

发表机构 * GOATnote Inc.(GOATnote公司)

AI总结 本文提出HealthCraft,首个公开的强化学习环境,用于在真实急救医学条件下奖励轨迹级安全,通过FHIR R4世界状态、24个MCP工具和双层评估标准,评估模型在急救任务中的安全性和性能,揭示了模型在多步骤工作流中的安全失败问题。

Comments 16 pages, 5 figures, 6 tables. Code, task suite, and Docker bundle: https://github.com/GOATnote-Inc/healthcraft

详情
AI中文摘要

前沿语言模型被部署到临床工作流程的速度超过了评估它们安全性的基础设施。静态医学问答基准测试忽略了急救医学中至关重要的失败模式:轨迹级安全崩溃、工具误用和在持续临床压力下的屈从。我们提出了HealthCraft,首个公开的强化学习环境,该环境在真实急救医学条件下奖励轨迹级安全,源自Corecraft。它基于FHIR R4世界状态,包含14个实体类型和3,987个种子实体,暴露24个MCP工具,并定义了双层评估标准,只要任何安全关键性标准被违反,就会将奖励设为零。我们发布了195个任务,涵盖六个类别,根据2,255个二元标准(其中515个为安全关键性标准)进行评分;一个事后10任务负类列表将此扩展到205个任务和2,337个标准。在两个前沿模型上的V8结果表明,Claude Opus 4.6在Pass@1达到24.8% [21.5-28.4],GPT-5.4为12.6% [10.2-15.6],安全失败率为27.5%和34.0%。在多步骤工作流——最接近真实急救护理的代理——中,性能降至接近零(Claude 1.0%,GPT-5.4 0.0%),尽管在单个步骤上部分具备能力。在试点v2和v8之间修复了六个基础设施错误,重新排列了哪些模型“看起来更强”,这表明基础设施的保真度是测量的一部分。一个确定性的LLM-判断器叠加限制了评估者的噪声,并且一个60次负类烟雾试点显示奖励信号不是可直接用于训练的安全:限制标准通过率为0.929的患病率,这在评估工具可以容忍但训练奖励不能。我们搭建了与Corecraft第5.2节中的Megatron+SGLang+GRPO循环的耦合,并将训练奖励的消融作为未来的工作。环境、任务、评估标准和工具均在Apache 2.0下发布。

英文摘要

Frontier language models are being deployed into clinical workflows faster than the infrastructure to evaluate them safely. Static medical-QA benchmarks miss the failure modes that matter in emergency medicine: trajectory-level safety collapse, tool misuse, and capitulation under sustained clinical pressure. We present HealthCraft, the first public reinforcement-learning environment that rewards trajectory-level safety under realistic emergency-medicine conditions, adapted from Corecraft. It is built on a FHIR R4 world state with 14 entity types and 3,987 seed entities, exposes 24 MCP tools, and defines a dual-layer rubric that zeroes reward whenever any safety-critical criterion is violated. We release 195 tasks across six categories, graded against 2,255 binary criteria (515 safety-critical); a post-hoc 10-task negative-class slate extends this to 205 tasks and 2,337 criteria. V8 results on two frontier models show Claude Opus 4.6 at Pass@1 24.8% [21.5-28.4] and GPT-5.4 at 12.6% [10.2-15.6], with safety-failure rates of 27.5% and 34.0%. On multi-step workflows - the closest proxy to real emergency care - performance collapses to near zero (Claude 1.0%, GPT-5.4 0.0%) despite partial competence on individual steps. Six infrastructure bugs fixed between pilots v2 and v8 re-ordered which model "looks stronger," evidence that infrastructure fidelity is part of the measurement. A deterministic LLM-judge overlay bounds evaluator noise, and a 60-run negative-class smoke pilot shows the reward signal is not drop-in training-safe: restraint criteria pass at 0.929 prevalence, a gameability an eval harness can tolerate but a training reward cannot. We scaffold coupling to a Megatron+SGLang+GRPO loop per Corecraft Section 5.2 and leave training-reward ablations as future work. Environment, tasks, rubrics, and harness are released under Apache 2.0.

2605.21493 2026-05-22 cs.LG cs.AI cs.CV 版本更新

Don't Collapse Your Features: Why CenterLoss Hurts OOD Detection and Multi-Scale Mahalanobis Wins

不要压缩你的特征:为什么CenterLoss伤害OOD检测和多尺度Mahalanobis获胜

Rahul D Ray

发表机构 * Department of Electronics and Electrical Engineering(电子与电气工程系)

AI总结 本文提出GOEN方法,通过多尺度特征、L2归一化、Mahalanobis距离和校准头来提升OOD检测性能,发现CenterLoss会降低OOD检测性能,而GOEN-NoCenterLoss在CIFAR-10基准上表现优于其他基线方法。

详情
AI中文摘要

检测分布外(OOD)输入的能力是安全部署机器学习系统的基础。然而,当前方法往往依赖于仅优化分类准确性的特征表示,忽略了epistemic不确定性的要求。我们引入GOEN(几何优化的epistemic网络),一种结合多尺度特征、L2归一化、Mahalanobis距离和使用真实硬OOD示例训练的校准头的简单流程。通过系统消融,我们发现一个反直觉的发现:CenterLoss,一种用于特征紧凑性的流行正则化器,显著降低了OOD检测性能,尽管提高了分类准确性。最佳变体GOEN-NoCenterLoss在CIFAR-10基准上实现了0.9483的平均OOD AUROC,超过了包括深度集成(0.8827)、KNN(0.8967)和ODIN(0.8870)在内的所有基线方法,同时保持了有竞争力的分布内准确性。我们的结果挑战了普遍认为更好的分类几何自动导致更好的epistemic不确定性假设。相反,我们展示了过于紧致的特征簇会压缩类间边缘并扭曲所需的有效OOD检测的协方差结构。GOEN是高效的,在单个GPU上训练不到20分钟,并提供了一种构建可靠识别自身局限的AI系统的实用蓝图。

英文摘要

The ability to detect out-of-distribution (OOD) inputs is fundamental to safe deployment of machine learning systems. Yet, current methods often rely on feature representations that are optimised solely for classification accuracy, neglecting the distinct requirements of epistemic uncertainty. We introduce GOEN (Geometry-Optimised Epistemic Network), a simple pipeline that combines multi-scale features, L2 normalisation, Mahalanobis distance, and a calibration head trained with real hard OOD examples. Through systematic ablation we uncover a counter-intuitive finding: CenterLoss, a popular regulariser for feature compactness, significantly degrades OOD detection performance, reducing average OOD AUROC from 0.9483 to 0.9366 despite improving classification accuracy. The best variant, GOEN-NoCenterLoss, achieves an average OOD AUROC of 0.9483, surpassing all baselines including deep ensembles (0.8827), KNN (0.8967), and ODIN (0.8870) on CIFAR-10 benchmarks, while maintaining competitive in-distribution accuracy. Our results challenge the prevailing assumption that better classification geometry automatically leads to better epistemic uncertainty. Instead, we show that overly tight feature clusters compress inter-class margins and distort the covariance structure needed for effective OOD detection. GOEN is efficient, training in under 20 minutes on a single GPU, and provides a practical blueprint for building AI systems that reliably recognise their own limitations.

2605.21492 2026-05-22 cs.LG cs.AI cs.LO stat.ML 版本更新

The Attribution Impossibility: No Feature Ranking Is Faithful, Stable, and Complete Under Collinearity

特征归因不可能性:在共线性下,没有任何特征排名是忠实、稳定和完整的

Drake Caraker, Bryan Arnold, David Rhoads

发表机构 * Independent Researchers(独立研究人员)

AI总结 本文研究了在共线性情况下特征排名的不可能性,证明了无法同时满足忠实、稳定和完整性的条件,并提出了DASH方法作为解决途径,同时通过形式化验证展示了其理论基础和实际应用影响。

Comments 66 pages, 12 figures, 305 Lean 4 theorems. Code at https://github.com/DrakeCaraker/dash-impossibility-lean

详情
AI中文摘要

在共线性情况下,没有任何特征排名可以同时忠实、稳定和完整。对于共线性对,排名本质上等同于抛硬币。我们证明了这一不可能性,针对四种模型类别进行了量化分析,通过集成平均(DASH)方法解决该问题,并利用305个Lean 4定理进行机验证。我们刻画了完整的归因设计空间:恰好存在两种方法家族——忠实-完整方法(不稳定,排名可能翻转多达50%的时间)和集成方法如DASH(稳定,对称特征报告平局)。归因比在梯度提升中发散为1/(1-rho^2),在Lasso中为无穷大,在随机森林中收敛。DASH(Diversified Aggregation of SHAP)在无偏聚合中被证明是帕累托最优的,达到Cramer-Rao方差下界并具有紧的集成大小公式。在77个公共数据集中,68%表现出归因不稳定性。在特征具有相等因果效应时,切换到条件SHAP无法逃脱这一不可能性。该框架包括实用的诊断工具——Z检验工作流程和单模型筛查工具——并直接影响公平性审计:基于SHAP的代理歧视审计在共线性下被证明不可靠。设计空间定理、诊断和不可能性均在Lean 4中形式化验证(305个定理从16个公理,0 sorry)——据我们所知,这是可解释AI领域首个形式化验证的不可能性。

英文摘要

No feature ranking can be simultaneously faithful, stable, and complete when features are collinear. For collinear pairs, ranking reduces to a coin flip. We prove this impossibility, quantify it for four model classes, resolve it via ensemble averaging (DASH), and machine-verify it with 305 Lean 4 theorems. We characterize the complete attribution design space: exactly two families of methods exist -- faithful-complete methods (unstable, with rankings that flip up to 50% of the time) and ensemble methods like DASH (stable, reporting ties for symmetric features) -- and no method lies outside this dichotomy. The impossibility is quantitative: the attribution ratio diverges as 1/(1-rho^2) for gradient boosting, is infinite for Lasso, and converges for random forests. DASH (Diversified Aggregation of SHAP) is provably Pareto-optimal among unbiased aggregations, achieving the Cramer-Rao variance bound with a tight ensemble size formula. In a survey of 77 public datasets, 68% exhibit attribution instability. Switching to conditional SHAP does not escape the impossibility when features have equal causal effects. The framework includes practical diagnostics -- a Z-test workflow and single-model screening tool -- and has direct consequences for fairness auditing: SHAP-based proxy discrimination audits are provably unreliable under collinearity. The design space theorem, diagnostics, and impossibility are mechanically verified in Lean 4 (305 theorems from 16 axioms, 0 sorry) -- to our knowledge, the first formally verified impossibility in explainable AI.

2605.21491 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation

通过比较想法评估教授语言模型预测研究成功的技巧

Srujan P Mule, Aniketh Garikaparthi, Manasi Patwardhan

发表机构 * IISER Pune(印度理工学院帕内尔)

AI总结 本研究探讨了语言模型能否在无需实验的情况下预测研究想法的实证成功,通过构建基于PapersWithCode客观结果的11488对想法数据集,发现通过强化学习可提升模型性能至71.35%,证明小型语言模型可以作为有效的客观验证器,为自主科学发现提供可扩展路径。

Comments ACL 2026 Findings

详情
AI中文摘要

随着语言模型通过自动化假设生成和实现加速科学研究,出现了一个新的瓶颈:在没有彻底实验的情况下评估和过滤数百个AI生成的想法。我们问语言模型是否能学会在任何实验运行之前预测研究想法的实证成功。我们研究了比较实证预测:给定一个基准特定的研究目标和两个候选想法,预测哪个将实现更好的基准性能。我们构建了一个基于PapersWithCode客观结果的11,488对想法数据集。尽管现成的8B参数模型表现不佳(30%准确率),SFT显著提升了性能至77.1%,优于GPT-5(61.1%)。通过将评估框架为推理任务,通过可验证奖励的强化学习(RLVR),我们训练模型发现潜在的推理路径,实现71.35%的准确率,并具有可解释的依据。通过额外的消融和分布外测试,我们展示了对表面启发式的鲁棒性,并转移到了跨领域时间拆分测试集和独立构建的测试集。我们的结果表明,计算高效的轻量级语言模型可以作为有效的、客观的验证器,为自主科学发现提供可扩展的路径。

英文摘要

As language models accelerate scientific research by automating hypothesis generation and implementation, a new bottleneck emerges: evaluating and filtering hundreds of AI-generated ideas without exhaustive experimentation. We ask whether LMs can learn to forecast the empirical success of research ideas before any experiments are run. We study comparative empirical forecasting: given a benchmark-specific research goal and two candidate ideas, predict which will achieve better benchmark performance. We construct a dataset of 11,488 idea pairs grounded in objective outcomes from PapersWithCode. While off-the-shelf 8B-parameter models struggle (30% acc.), SFT dramatically boosts performance to 77.1%, outperforming GPT-5 (61.1%). By framing evaluation as a reasoning task via Reinforcement Learning with Verifiable Rewards (RLVR), we train models to discover latent reasoning paths, achieving 71.35% acc. with interpretable justifications. Through additional ablations and out-of-distribution tests, we show robustness to surface-level heuristics and transfer to both a cross-domain time-split test set and an independently constructed test set. Our results demonstrate that compute-efficient small language models can serve as effective, objective verifiers, offering a scalable path for autonomous scientific discovery.

2605.21282 2026-05-22 cs.LG cs.AI 版本更新

Stochastic MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent

随机均值流策略:带有熵镜降的一步生成控制

Zeyuan Wang, Da Li, Yulin Chen, Yuehu Gong, Yanming Guo, Ye Shi, Liang Bai, Tianyuan Yu, Yanwei Fu

发表机构 * Laboratory for Big Data and Decision(大数据与决策实验室) National University of Defense Technology(国防科技大学) Samsung AI Center Cambridge(三星AI研究中心) Queen Mary University of London(伦敦玛丽女王大学) Fudan University(复旦大学) ShanghaiTech University(上海科技大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 本文提出了一种随机均值流策略(SMFP),通过均值流变换将高斯噪声映射到动作,以实现可训练的生成策略,从而在离线策略镜降框架下实现探索性且稳定的改进。

详情
AI中文摘要

在线离线策略强化学习(RL)受到两个耦合选择的影响:策略类和更新规则。高斯策略速度快且具有可计算的熵,但难以处理多模态动作分布。生成策略更具表现力,但通常需要迭代采样或缺乏可计算的熵估计。在优化方面,SAC风格的软策略改进和镜降(MD)可以视为最小化不同的KL散度:前者将策略推向价值诱导的玻尔兹曼分布,后者则通过之前的策略正则化每个更新。将熵正则化与MD约束结合因此具有吸引力,因为它支持探索并稳定策略改进;然而,所得到的目标可能是多模态的,且与单峰高斯策略不匹配。我们提出随机均值流策略(SMFP),一种一步生成策略类,通过均值流变换将高斯噪声映射到动作。这种随机重参数化产生了一个可计算的熵替代物,并允许均值流策略在离线策略镜降框架下通过统一的目标进行训练,以实现探索性且稳定的改进。在七个MuJoCo基准测试中,SMFP在高斯和生成基线之上取得了改进,同时保留了单步推断效率。

英文摘要

Online off-policy reinforcement learning (RL) is shaped by two coupled choices: the policy class and the update rule. Gaussian policies are fast and have tractable entropy, but struggle with multimodal action distributions. Generative policies are more expressive, but often require iterative sampling or lack tractable entropy estimates. On the optimisation side, SAC-style soft policy improvement and mirror descent (MD) can be viewed as minimising different KL divergences: the former moves the policy towards a value-induced Boltzmann distribution, while the latter regularises each update against the previous policy. Combining entropy regularisation with an MD constraint is therefore attractive, as it supports exploration while stabilising policy improvement; however, the resulting target can be multimodal and is poorly matched by unimodal Gaussian policies. We propose Stochastic MeanFlow Policies (SMFP), a one-step generative policy class that maps Gaussian noise to actions through a MeanFlow transformation. This stochastic reparameterisation yields a tractable entropy surrogate and allows MeanFlow policies to be trained within off-policy mirror descent under a unified objective for exploratory yet stable improvement. Across seven MuJoCo benchmarks, SMFP improves over Gaussian and generative baselines while retaining single-step inference efficiency.

2605.21187 2026-05-22 cs.NI cs.AI cs.DC 版本更新

High-speed Networking for Giga-Scale AI Factories

面向十亿级AI工厂的高速网络

Sajy Khashab, Albert Gran Alcoz, Alon Gal, Jacky Romano, Rani Abboud, Yonatan Piasetzky, Lior Maman, Amit Nishry, Barak Gafni, Omer Shabtai, Matty Kadosh, Dror Goldenberg, Gilad Shainer, Mark Silberstein

发表机构 * NVIDIA(英伟达) Technion - Israel Institute of Technology(技术ion-以色列理工学院)

AI总结 本文提出了一种面向大规模AI训练需求的高速网络架构,通过拓扑并行性替代传统层次结构,利用硬件加速的负载均衡技术,在微秒级动态网络条件下提供稳定性能,展示了在三大核心维度上的生产级AI基础设施性能。

详情
AI中文摘要

随着分布式模型训练扩展到数以万计的GPU,扩展型网络面临前所未有的性能和效率需求。NVIDIA Spectrum-X Ethernet从零开始设计,以实现可预测且稳定的网络性能,具有高利用率和低延迟。本文提出了Spectrum-X多平面架构,该架构用拓扑并行性替代层次深度,并在NIC和交换机中引入硬件加速的负载均衡作为关键架构方法,以提供快速响应高度动态网络条件的能力。我们描述了动机、设计原则、评估方法和在最先进基准上的性能,以及在大规模系统中部署和调试Spectrum-X网络所学到的经验。我们的评估突显了生产级AI基础设施在三个核心维度上的性能:98%的理论线路速率,低抖动延迟;强跨租户隔离;容量比例的双倍带宽和10%链路故障时7%的延迟增加;以及在LLM训练工作负载中快速响应主机和链路波动。

英文摘要

As distributed model training scales to span hundreds of thousands of GPUs, scale-out networks face unprecedented performance and efficiency demands. NVIDIA Spectrum-X Ethernet has been designed from the ground up to achieve predictable and stable network performance with high utilization and low latency. This paper presents the Spectrum-X multiplane architecture, which replaces hierarchical depth with topological parallelism, and introduces hardware-accelerated load balancing in NICs and switches as the key architectural approach to provide fast reaction to highly dynamic network conditions at the microsecond timescales that AI training workloads demand. We describe the motivation, design principles, evaluation methodology and performance on state-of-the-art benchmarks, as well as the lessons we learned from deploying and debugging Spectrum-X networks in large-scale systems. Our evaluation highlights production-grade AI infrastructure performance across three core dimensions: 98% of the theoretical line rate with low jitter-free latency; strong cross-tenant isolation for concurrent workloads; robust, capacity-proportional bisection bandwidth and 7% latency increase for 10% fabric link failures; and rapid reaction to host and fabric link flaps during LLM training workloads.

2605.20246 2026-05-22 cs.LG cs.AI 版本更新

GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

GROW: 将GRPO与状态-动作建模对齐以适用于开放世界VLM智能体

Xiongbin Wu, Zhihao Luo, Shanzhe Lei, Lechao Zhang, Xuhong Wang, Jie Yang, Zhonglong Zheng, Yuanjie Zheng, Xin Tan, Wei Liu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) East China Normal University(华东师范大学) Zhejiang Normal University(浙江师范大学) Shandong Normal University(山东省师范大学)

AI总结 本文提出GROW框架,通过将收集的轨迹分解为状态-动作样本,并在样本间计算优势,解决了标准GRPO在多轮RL中因需要完整轨迹导致上下文过长和噪声的问题,实验表明其在超过800个Minecraft任务中取得SOTA性能。

详情
AI中文摘要

最近,视觉-语言模型(VLM)智能体在开放世界任务中展现出有前景的进步,其中成功的任务完成通常需要多次视觉感知和动作执行的回合。然而,现有方法仍主要依赖于监督微调(SFT)专家演示,而先进的强化学习(RL)算法,特别是分组相对策略优化(GRPO),尚未在这些任务中有效应用于多轮RL,因为标准GRPO需要完整的轨迹作为训练样本,导致上下文过长和噪声。为了解决这个问题,我们提出GROW,一种适用于开放世界VLM智能体的RL框架,将收集的轨迹分解为状态-动作样本,并在这些样本之间计算优势,而不是将完整轨迹视为单一实体。我们进一步提供了一个替代分析,表明尽管分组样本是基于不同的局部状态而不是相同的提示上下文,简化假设下目标可以保留GRPO的核心相对策略优化信号。在超过800个Minecraft任务上的实验表明,我们的方法实现了最先进的性能,证明了我们提出的RL框架在开放世界VLM智能体中的有效性。

英文摘要

Recently, vision-language model (VLM) agents have shown promising progress in open-world tasks, where successful task completion often requires multiple turns of visual perception and action execution. However, existing methods still rely primarily on Supervised Fine-Tuning (SFT) with expert demonstrations, while the advanced reinforcement learning (RL) algorithm, specifically Group Relative Policy Optimization (GRPO), has not been effectively employed for multi-turn RL in these tasks because standard GRPO requires full trajectories as training samples which leads to excessively long context and noise. To address this issue, we propose GROW, a RL framework for open-world VLM agents that decomposes collected trajectories into state-action samples, and computes advantages between these samples rather than treating a full trajectory as a single entity. We further provide a surrogate analysis indicating that, even though the grouped samples are conditioned on different local states rather than an identical prompt context, the objective can preserve the core relative policy optimization signal of GRPO under simplifying assumptions. Experiments on more than 800 Minecraft tasks show that our method achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of our proposed RL framework for open-world VLM agents.

2605.19192 2026-05-22 cs.AI cs.CR 版本更新

Hallucination as Exploit: Evidence-Carrying Multimodal Agents

幻觉作为利用:证据承载多模态智能体

Guijia Zhang, Hao Zheng, Harry Yang

发表机构 * Shenzhen University(深圳大学) HKUST(香港科技大学)

AI总结 本文研究了多模态智能体中幻觉导致授权失败的问题,提出证据承载多模态智能体(ECA)方法,通过分解工具调用、获取类型证书并使用确定性门控来授权,从而将模型的模糊信念转换为可审计的残余,提高了系统的安全性。

Comments 23 pages, 6 figures, 15 tables

详情
AI中文摘要

多模态智能体越来越多地从截图、文档和网页中选择工具调用,其中虚假感知声明可能导致幻觉从答案质量错误转变为授权失败。我们正式将这种失败模式定义为幻觉到动作转换:一个不支持的声明为特权动作提供了前提条件。我们提出了证据承载多模态智能体(ECA),将自由形式模型文本视为不可接受的证据,将每个工具调用分解为动作关键谓词,从受限的DOM/OCR/AX验证器中获取类型证书,并使用确定性门来只授权证书支持的特权。与其隐藏感知错误不同,ECA将模糊的模型信念转换为可审计的残余,在验证器、模式和实现层面。在17个经典攻击类别上进行的验证器红队测试显示,四个目标加固步骤各自是必要的;在加固后,经典门绕过是0/1700(Wilson 95%上界0.22%)。使用内容衍生证书,ECA在200个端到端任务上观察到零不安全执行(Wilson 95%上界2.67%)和120个浏览器任务(上界4.3%)。对500个分层任务键的HACR审计显示,不支持的动作关键声明导致不安全执行,对原始智能体(100.0%)和仅提示防御(49.6%)无效,但对ECA无效。在7,488个GPT-5.4跟踪上进行的Oracle证书回放隔离了门的正确性,而神经判断基线在相同威胁模型下仍允许大多数不安全动作。最终的原则很简单:模型语言可能提出工具使用,但认证的谓词必须授权它。

英文摘要

Multimodal agents increasingly choose tool calls from screenshots, documents, and webpages, where a false perceptual claim can turn hallucination from an answer-quality error into an authorization failure. We formalize this failure mode as hallucination-to-action conversion: an unsupported claim supplies the precondition for a privileged action. We propose evidence-carrying multimodal agents (ECA), which treat free-form model text as inadmissible evidence, decompose each tool call into action-critical predicates, obtain typed certificates from constrained DOM/OCR/AX verifiers, and use a deterministic gate to authorize only the privileges those certificates support. Rather than hiding perception error, ECA converts opaque model belief into auditable residuals at the verifier, schema, and implementation levels. Verifier red-teaming across 17 canonical attack categories shows that four targeted hardening steps are each necessary; after hardening, canonical gate bypass is 0/1,700 (Wilson 95% upper bound 0.22%). With content-derived certificates, ECA observes zero unsafe executions on 200 end-to-end tasks (Wilson 95% upper bound 2.67%) and 120 browser tasks (upper bound 4.3%). A HACR audit on 500 stratified task keys shows that unsupported action-critical claims reach unsafe execution for naive agents (100.0%) and prompt-only defenses (49.6%), but not for ECA. Oracle-certificate replay over 7,488 GPT-5.4 traces isolates gate correctness, while neural judge baselines still admit most unsafe actions under the same threat model. The resulting principle is simple: model language may propose tool use, but certified predicates must authorize it.

2605.17998 2026-05-22 cs.SE cs.AI 版本更新

Verify-Gated Completion as Admission Control in a Governed Multi-Agent Runtime: A Bounded Architecture Case Study

验证门控完成作为受控多智能体运行时的准入控制:一个有界架构案例研究

Hai-Duong Nguyen, Xuan-The Tran

发表机构 * Vietnam Maritime University(越南海防大学)

AI总结 本文研究了验证门控完成作为受控多智能体运行时的准入控制机制,通过一个有界参考实现,探讨了可审计的验证门控完成所能支持的信息,并分析了其在不同场景下的表现和限制。

Comments 39 pages, 2 figures, 17 tables. Preprint

详情
AI中文摘要

随着多智能体系统从短时交互转向具有专门角色和持久状态的工具使用工作流,完成性问题从纯粹的生成性问题转变为运行时控制问题。本文研究了验证门控完成作为受控多智能体运行时的准入控制模式:智能体可以提出完成请求,但只读验证器决定是否接受该请求。模糊或证据薄弱的情况采用失败关闭策略,而分组状态和事件轨迹保留审计路径。我们检查了一个有界参考实现,并探讨释放的证据能支持关于可审计、验证门控完成的哪些信息。在释放的验证完成切片中,已知结果触发事件验证成功比例为1,791/1,800 = 99.5%。这是一个关于触发验证事件的计数措施,而不是任务完成、生产可靠性或基准成功率。任务级验证覆盖率不可计算;1,762/1,801行来自一个高流量报告集群;只有17个事件被生产分类。一个影子策略/治理验证器评估显示,1,526/1,548 = 98.58%的规则一致,0/1,526个安全通过预测中的假成功,以及阻塞精度为2/518 = 0.39%,因此仍属建议性。证据支持一个狭窄的结论:在观察到的条件下,只读验证门和分组的准入记录使完成决策可检查且失败关闭。关于部署操作、安全保证、结果收益、任务级覆盖率、恢复有效性或外部效度的声明仍超出研究范围。

英文摘要

As multi-agent systems move from short interactions to tool-using workflows with specialized roles and persistent state, completion becomes a runtime-control problem rather than a purely generative one. This preprint studies verify-gated completion as an admission-control pattern for governed multi-agent runtimes: agents may propose completion, but a read-only verifier decides whether the claim is admitted. Ambiguous or weakly evidenced cases resolve fail-closed, while packetized state and event traces preserve an audit path. We examine one bounded reference implementation and ask what the released evidence can support about auditable, verify-gated completion. In the released verify-completed slice, the known-outcome invoked-event verify success share was 1,791/1,800 = 99.5%. This is an accounting measure over invoked verification events, not a task-completion, production-reliability, or benchmark-success rate. Task-level verify coverage is not computable; 1,762/1,801 rows came from one high-volume reporting cluster; and only 17 events were production-classified. A shadow Policy/Governance Verifier evaluation showed 1,526/1,548 = 98.58% rule agreement, 0/1,526 false-success among safe-to-proceed predictions, and blocked precision of 2/518 = 0.39%, so it remains advisory. The evidence supports a narrow conclusion: under observed conditions, a read-only verify gate plus packetized admission records made completion decisions inspectable and fail-closed. Claims about deployed operation, safety guarantees, outcome gains, task-level coverage, recovery effectiveness, or external validity remain outside scope.

2605.17837 2026-05-22 cs.CV cs.AI 版本更新

Temporal Aware Pruning for Efficient Diffusion-based Video Generation

具有时间意识的剪枝用于高效扩散式视频生成

Sheng Li, Yang Sui, Junhao Ran, Bo Yuan, Yue Dai, Xulong Tang

发表机构 * University of Pittsburgh(匹兹堡大学) Illinois Institute of Technology(伊利诺伊理工学院) Rutgers University(罗格斯大学) Rice University(Rice大学)

AI总结 本文提出TAPE,一种无需训练的时间感知剪枝方法,用于高效扩散式视频生成,通过时间平滑、层内token重选和时间步预算调度,提升生成效率并保持高质量视觉效果。

详情
AI中文摘要

视频扩散模型最近通过基于ViT的架构实现了高质量视频生成,但生成过程由于需要在长时空序列上进行注意力计算而计算成本高。token剪枝已被证明在ViTs和VLMs中有效。然而,大多数先前的剪枝方法基于注意力,按帧操作,无法确保视频生成任务中帧间的重要时间一致性。在实践中,简单采用仅注意力的剪枝会导致明显退化,由于背景一致性变差、闪烁和图像质量下降。为此,我们提出TAPE,一种无需训练的时间感知剪枝方法,用于高效扩散式视频生成。TAPE(i)应用时间平滑以对齐相邻帧之间的token重要性并抑制选择抖动;(ii)在选定的层中进行token重选,以使token剪枝与层的多样化语义关注相一致,并避免特定区域的误差累积;它还(iii)采用时间步级预算调度,在早期噪声步骤中进行激进剪枝,并在保真度关键的细化阶段放松剪枝。实验结果表明,TAPE在保持高质量视觉保真度的同时提供了显著的加速,优于先前的token减少方法。

英文摘要

Video diffusion models have recently enabled high-quality video generation with ViT-based architectures, but remain computationally intensive because generation requires attention computation over long spatiotemporal sequences. Token pruning has proven effective for ViTs and VLMs. However, most prior pruning methods are attention-based and operate per frame, failing to ensure the vital temporal coherence across frames in video generation tasks. In practice, naively adopting attention-only pruning causes noticeable degradation due to worsened background consistency, flickering, and reduced image quality. To address this, we propose TAPE, a training-free Temporal Aware Pruning for Efficient diffusion-based video generation. TAPE (i) applies temporal smoothing to align token-importance across adjacent frames and suppress selection jitter; and (ii) performs token reselection in selected layers to align token pruning with layers' diverse semantic focus and avoid error accumulation in specific areas; it also (iii) adopt a timestep-level budget scheduling that prunes aggressively at early noisy steps and relaxes pruning during fidelity-critical refinement. The experimental results show that TAPE delivers significant speedups while preserving high visual fidelity, outperforming prior token reduction approaches.

2605.17602 2026-05-22 cs.AI cs.CV cs.LG 版本更新

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

AutoRubric-T2I: 一种用于文本到图像对齐的鲁棒基于规则的奖励模型

Kuei-Chun Kao, Daixuan Huo, Yuanhao Ban, Cho-Jui Hsieh

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 本文提出AutoRubric-T2I,一种首个用于文本到图像生成的规则学习框架,通过自动合成和选择显式规则来指导视觉语言模型(VLM)法官。该方法通过合成偏好对的推理轨迹生成候选规则,并利用VLM法官在每种规则下对配对图像进行评分,产生配对规则评分差异用于偏好学习。通过ℓ1正则化逻辑回归精简器去除噪声和冗余规则,从而在少量标注偏好数据下生成高质量、可解释的奖励信号,并在多个图像奖励基准测试中优于现有奖励模型基线。

Comments 27 pages

详情
AI中文摘要

将文本到图像(T2I)生成模型与人类偏好对齐越来越依赖于图像奖励模型,这些模型根据提示对齐和感知质量对生成图像进行评分或排序。现有的奖励模型通常在大规模人类偏好语料上训练为Bradley-Terry(BT)偏好模型,这使得训练成本高、适应困难且评估标准不透明。同时,视觉语言模型(VLM)法官可以通过文本评分规则提供更细致的评估,但其手动设计或启发式生成的评分规则可能无法可靠地反映人类偏好。在本文中,我们提出AutoRubric-T2I,这是首个用于T2I的规则学习框架,能够自动合成和选择显式规则以指导VLM法官。AutoRubric-T2I首先通过合成偏好对的推理轨迹生成候选规则,然后利用VLM法官在每种规则下对配对图像进行评分,产生配对规则评分差异用于偏好学习。为了去除噪声和冗余规则,我们进一步采用ℓ1正则化逻辑回归精简器,选择Top-N最判别性的规则。广泛评估表明,AutoRubric-T2I在使用不到0.01%的标注偏好数据的情况下,能够生成高质量、可解释的奖励信号,大幅减少了大规模奖励模型训练的需求。在图像奖励基准如MMRB2上,AutoRubric-T2I优于强奖励模型基线。我们进一步验证AutoRubric-T2I作为强化学习奖励在下游T2I任务中的效果,包括TIIF和UniGenBench++,其中它通过流-GRPO管道在扩散模型上提升了生成质量,优于标量奖励模型。

英文摘要

Aligning Text-to-Image (T2I) generation models with human preferences increasingly relies on image reward models that score or rank generated images according to prompt alignment and perceptual quality. Existing reward models are commonly trained as Bradley-Terry (BT) preference models on large-scale human preference corpora, making them costly to train, difficult to adapt, and opaque in their evaluation criteria. Meanwhile, Vision-Language Model (VLM) judges can provide more fine-grained assessments through textual rubrics, but their manually designed or heuristically generated scoring rules may fail to reliably reflect human preferences. In this paper, we propose AutoRubric-T2I, the first rubric learning framework in T2I that automatically synthesizes and selects explicit rubrics for guiding VLM judges. AutoRubric-T2I first synthesizes reasoning traces from preference pairs into candidate rubrics, then uses a VLM judge to score paired images under each rubric, producing pairwise rubric-score differences for preference learning. To remove noisy and redundant rules, we further employ a $\ell_1$-Regularized Logistic Regression Refiner, which selects the Top-$N$ most discriminative rubrics. Extensive evaluations show that AutoRubric-T2I produces high-quality, interpretable reward signals using less than 0.01% of the annotated preference data, substantially reducing the need for large-scale reward-model training. On image reward benchmarks such as MMRB2, AutoRubric-T2I outperforms strong reward model baselines. We further validate AutoRubric-T2I as an RL reward on downstream T2I tasks, including TIIF and UniGenBench++, where it improves generation quality over scalar reward models using the Flow-GRPO pipeline on diffusion models.

2605.17596 2026-05-22 cs.AI 版本更新

NeuSymMS: A Hybrid Neuro-Symbolic Memory System for Persistent, Self-Curating LLM Agents

NeuSymMS:一种混合神经符号记忆系统,用于持久、自管理的LLM代理

Mujahid Sultan, Sri Thuraisamy, Daya Rajaratnam

发表机构 * iVedha Corporation(iVedha公司) MLSoft Inc.(MLSoft公司)

AI总结 NeuSymMS通过混合神经符号架构,使LLM代理能够在多个会话中学习、记忆和推理用户信息,其核心方法是结合神经网络的事实提取和基于CLIPS的专家系统,主要贡献是提出了一个支持自管理记忆的双视野记忆模型。

Comments 7 pages

详情
AI中文摘要

我们介绍了NeuSymMS,一种自适应的记忆系统,使大型语言模型(LLM)代理能够通过混合神经符号架构在多个会话中学习、记忆和推理用户信息。NeuSymMS结合了使用LLM从非结构化对话中提取事实的神经网络,以及基于CLIPS的专家系统,该系统在显式生命周期规则下对事实进行分类、去重和协调。系统将知识表示为主体-关系-值三元组,存储在关系数据库管理系统中。它支持用户/代理/代理到代理的范围,并实现双视野(短期和长期)记忆模型。它利用基于访问的提升和基于时间的剪枝来管理两个视野中的记忆。NeuSymMS在保持记忆连续性的同时避免了上下文窗口膨胀和跨实体污染。我们认为这种架构为生产代理系统提供了可靠、可审计的记忆的实用路径,并讨论其与日志检索、摘要和键值方法的创新性对比。

英文摘要

We present NeuSymMS, an adaptive memory system that enables large language model (LLM) agents to learn, remember, and reason about users across sessions via a hybrid neuro-symbolic architecture. NeuSymMS couples neural fact extraction from unstructured dialogue using LLMs and a CLIPS-based expert system that classifies, deduplicates, and reconciles facts under explicit lifecycle rules. The system represents knowledge as subject-relation-value triples stored in relational database management system. It supports user/agents/agent-to-agent scoping, and implements a dual-horizon (short-term and long-term) memory model. IT leverages access-based promotion and time-based pruning of the memory on both horizpons. NeuSymMS maintains continuity of memory while avoiding context-window bloat and cross-entity contamination. We argue that this architecture offers a practical path to trustworthy, auditable memory for production agentic systems and discuss its novelty relative to log retrieval, summarization, and key-value approaches.

2605.16362 2026-05-22 cs.LG cs.AI 版本更新

When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search

当秩-1引导廉价时是什么情况?几何学、粒度和预算化搜索

John T. Robertson, Jianing Zhu, Haris Vikalo, Zhangyang Wang

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文研究了秩-1引导在不同概念上的有效性差异,提出粒度和几何学是影响引导成本的关键因素,并介绍了GRACE框架来高效优化引导过程。

Comments Updated Abstract metadata

详情
AI中文摘要

激活引导提供了一种无需重新训练即可控制大语言模型的轻量方法,但其效果在不同概念上变化显著。先前研究通常将这种变化视为许多概念无法由单一引导方向捕捉的证据。我们主张这种变化更多反映了搜索难度:有用的秩-1干预通常存在,但找到它可能成本高昂。我们正式将秩-1引导定义为在干预层和系数上的预算约束优化。在不同概念和模型家族中,提示边界方向对齐预测有效干预的位置,使几何引导搜索能够以更少的评估达到高效用,平均减少39.8%的试验次数以恢复95%的最佳效用。为解释为何某些概念即使在更好的搜索下仍昂贵,我们引入了粒度,即对比上下文中方向异质性的度量。粒度区分了差异向量共享稳定全局方向的概念,与提示在每个输入中局部一致但最优方向系统性旋转的概念。更高的粒度与更慢的收敛速度和更低的最佳效用相关(相关系数分别为0.44和-0.46,p<0.001)。我们提出了GRACE框架,一个粒度和表征意识的概念工程框架,利用激活几何学来诊断引导难度的主要来源,选择适当的解决方案,并高效分配优化努力。我们的结果将框架从“秩-1何时失败?”转变为“秩-1何时廉价且稳定?”,使激活几何学从描述性工具转变为LLM控制的可操作先验。

英文摘要

Activation steering offers a lightweight way to control LLMs without retraining, but its effectiveness varies sharply across concepts. Prior work often reads this variability as evidence that many concepts are not captured by a single steering direction. We argue instead that much of it reflects search difficulty: a useful rank-1 intervention often exists, but finding it can be expensive. We formalize rank-1 steering as a budget-constrained optimization over intervention layer and coefficient. Across concepts and model families, prompt-boundary directional alignment predicts where effective interventions occur, enabling geometry-guided search that reaches high utility with substantially fewer evaluations, reducing the trials needed to recover 95% of best-found utility by 39.8% on average across three model families. To explain why some concepts remain expensive even under better search, we introduce concept granularity, a measure of directional heterogeneity across contrastive contexts. Granularity distinguishes concepts whose difference vectors share a stable global direction from those where prompts agree locally within each input but the utility-maximizing direction rotates systematically across inputs. Higher granularity is associated with slower convergence and lower best-found performance (Pearson $r{=}0.44$ with trials-to-95%, $r{=}{-}0.46$ with best-found utility, both $p<0.001$). We present GRACE, a Granularity- and Representation-Aware Concept Engineering framework that uses activation geometry to diagnose the dominant source of steering difficulty, select the appropriate remedy, and allocate optimization effort efficiently. Our results shift the frame from "when does rank-1 fail?" to "when is rank-1 cheap and stable?", turning activation geometry from a descriptive tool into an actionable prior for LLM control.

2605.16258 2026-05-22 cs.CV cs.AI cs.RO 版本更新

IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation

IVGT:隐式视觉几何变换器用于神经场景表示

Yuqi Wu, Tianyu Hu, Wenzhao Zheng, Yuanhui Huang, Haowen Sun, Jie Zhou, Jiwen Lu

发表机构 * Intelligent Vision Group, Tsinghua University(清华大学智能视觉组)

AI总结 本文提出IVGT,一种隐式视觉几何变换器,通过无姿态多视角图像隐式建模连续且一致的几何结构,从而实现神经场景表示,支持在任意3D位置进行连续空间查询,以预测签名距离和颜色,并在多个任务中表现出色。

Comments Code: https://github.com/wzzheng/IVGT/

详情
AI中文摘要

从未经姿态的多视角图像中重建一致的3D几何和外观是计算机视觉中的基础但具有挑战性的问题。现有的视觉几何基础模型通常通过回归像素对齐的点图来预测显式几何,常常面临冗余和几何连续性有限的问题。我们提出了IVGT,一种隐式视觉几何变换器,能够从无姿态的多视角图像中隐式建模连续且一致的几何。这种形式在规范坐标系中学习了连续的神经场景表示,并支持在任意3D位置进行连续空间查询,通过轻量级解码器检索局部特征,以预测签名距离(SDF)值和颜色。它允许直接提取连续且一致的表面几何,从而能够从任意视角渲染RGB图像、深度图和表面法线图。我们通过多数据集联合优化进行训练,结合2D监督和3D几何正则化。IVGT在不同场景中表现出良好的泛化能力,并在多种任务中实现了优异的性能,包括网格和点云重建、新视角合成、深度和表面法线估计以及相机姿态估计。

英文摘要

Reconstructing coherent 3D geometry and appearance from unposed multi-view images is a fundamental yet challenging problem in computer vision. Most existing visual geometry foundation models predict explicit geometry by regressing pixel-aligned pointmaps, often suffering from redundancy and limited geometric continuity. We propose IVGT, an Implicit Visual Geometry Transformer that implicitly models continuous and coherent geometry from pose-free multi-view images. This formulation learns a continuous neural scene representation in a canonical coordinate system and supports continuous spatial queries at any 3D positions, retrieving local features to predict signed distance (SDF) values and colors using lightweight decoders. It allows direct extraction of continuous and coherent surface geometry, enabling rendering of RGB images, depth maps, and surface normal maps from arbitrary viewpoints. We train IVGT via multi-dataset joint optimization with 2D supervision and 3D geometric regularization. IVGT demonstrates generalization across scenes and achieves strong performance on various tasks, including mesh and point cloud reconstruction, novel view synthesis, depth and surface normal estimation, and camera pose estimation.

2605.15505 2026-05-22 cs.AI cs.IR cs.LG 版本更新

X-SYNTH: Beyond Retrieval -- Enterprise Context Synthesis from Observed Digital Human Attention

X-SYNTH:超越检索——从观察到的数字人类注意力中提取企业上下文

Guruprasad Raghavan, George Nychis, Rohan Narayana Murthy

发表机构 * Workfabric AI

AI总结 本文提出X-SYNTH框架,通过分析数字人类注意力行为模式,解决企业上下文合成问题,其核心方法是基于行为模式的上下文合成,而非传统检索,从而显著提升有效线索率并降低误报率。

Comments 11 pages, 7 figures, 5 tables

详情
AI中文摘要

在企业运营中,AI代理任务所需上下文分散在记录系统、静态信息存储和通信渠道中。所存储的是系统状态,这是工作实际发生情况的损失性表示。现有的方法通过匹配请求内容来检索存储的信息;对于狭窄请求,这种方法效果良好。但合成质量依赖于了解应展示什么以及如何解释它:这涉及每个组织、团队和个人特有的知识,存在于行为模式中,而不在任何检索索引中。对于提出对企业有价值的线索给销售员的代理任务,这种方法失效:真正的线索率低,假线索率高,且模型没有改进机制。我们提出了X-SYNTH,一个基于数字人类注意力的框架,这种注意力是每个工人的可数字化交互特征,编码了他们做了什么、按什么顺序做,以及隐含的奖励信号。在没有外部标签的情况下,可以区分出导致积极结果的先前行为轨迹与未导致积极结果的轨迹。X-SYNTH将每个个体的行为基线建模为数字双胞胎签名(DTS),并根据个体和查询选择七种注意力过滤器:比例、反比、微分、递归、比较、顺序和集体,以识别因果相关的活动签名。一个四阶段的管道将基于行为模式的排名上下文组装起来,而不是查询嵌入。一个前沿模型在无辅助的情况下实现了9.5%的真实线索率(TLR)和90.5%的假线索率(FLR)。在加入X-SYNTH后,TLR上升到61.9%(6.5倍),而FLR下降到18.8%。企业上下文合成不是检索问题,而是相关性问题,而数字人类注意力是其最可靠的地面真实值。

英文摘要

In enterprise operations, the context required for an AI agent task is scattered across systems of record, static information stores, and communication channels. What is stored is system state, a lossy representation of the work that actually happened. The prevailing approach retrieves by matching request content to what is stored; for narrow requests this works well. But synthesis quality depends on knowing what to surface and how to interpret it: knowledge specific to each organization, team, and individual, present in behavioral patterns, absent from any retrieval index. For the agentic task of proposing enterprise-valuable leads to sellers, this approach breaks down: True Lead Rate is low, False Lead Rate is high, and the model has no mechanism to improve. We present X-SYNTH, a framework for enterprise context synthesis grounded in digital human attention, the digitally observable interaction signatures of each worker, encoding what they did, the sequence in which they did it, and implicit reward signals. Behavioral traces preceding positive outcomes are distinguishable from those that did not, without external labeling. X-SYNTH models each individual's behavioral baseline as a Digital Twin Signature (DTS) and selects among seven attention filters, Proportional, Inverse, Differential, Recurrent, Comparative, Sequential, and Collective, per individual and per query, to identify causally relevant activity signatures. A four-stage pipeline assembles ranked context grounded in behavioral patterns rather than query embeddings. A frontier model unaided achieves 9.5% True Lead Rate (TLR) with 90.5% False Lead Rate (FLR). Augmented with X-SYNTH, TLR rises to 61.9% (6.5x) while FLR falls to 18.8%. Enterprise context synthesis is not a retrieval problem. It is a relevance problem, and digital human attention is its most reliable ground truth.

2605.14322 2026-05-22 cs.AI 版本更新

Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows

代理是否准备好教学?一个多阶段基准用于现实世界教学工作流程

Zixin Chen, Peng Liu, Rui Sheng, Haobo Li, Jianhong Tu, Xiaodong Deng, Kashun Shum, Dayiheng Liu, Huamin Qu

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Qwen Team, Alibaba Group(阿里集团通义实验室)

AI总结 本文提出EduAgentBench基准,用于评估教学代理的全面能力,发现当前模型在教学任务中的表现有限,但仍为开发未来教学代理提供了测量基础。

Comments Under review

详情
AI中文摘要

语言代理越来越多地部署在复杂的专业工作流程中,辅导能力作为高风险功能,目前在现有基准中仍未得到充分衡量。有效的辅导代理需要超越产生正确答案或执行准确工具调用:一个稳健的辅导代理必须诊断学习者状态、随时间适应支持、做出基于教育证据的决策,并在现实的学习管理系统中执行干预。我们引入EduAgentBench,一个源驱动的基准,用于全面评估辅导代理在教学工作范围内的能力。它包含150个经过质量控制的任务,涵盖三个能力表面:专业教学判断、情境多轮辅导和Canvas式教学工作流程完成。任务通过教学洞察驱动的流程构建,并通过互补的验证信号和人工审查进行评估。在对前沿模型的全面评估中,我们的发现表明,当前模型在有限的教学判断方面表现良好,但在情境辅导和自主教学工作流程执行方面仍无法达到专业教学标准。据我们所知,EduAgentBench是第一个理论基础和现实的基准,用于评估辅导代理的全面教学能力,为开发未来能够支持现实教学工作的辅导代理提供了测量基础。

英文摘要

Language agents are increasingly deployed in complex professional workflows, with tutoring emerging as a particularly high-stakes capability that remains largely unmeasured in existing benchmarks. Effective tutor agents require more than producing correct answers or executing accurate tool calls: a robust tutor must diagnose learner state, adapt support over time, make pedagogically justified decisions grounded in educational evidence, and execute interventions within realistic learning-management systems. We introduce EduAgentBench, a source-grounded benchmark for holistically evaluating tutor agents across the full scope of teaching work. It contains 150 quality-controlled tasks across three capability surfaces: professional pedagogical judgment, situated multi-turn tutoring, and Canvas-style teaching workflow completion. Tasks are constructed through a pedagogical-insight-driven pipeline and evaluated with complementary verification signals and human review. Across a comprehensive evaluation of frontier models, our findings reveal that current models are generally capable of bounded pedagogical judgment, but still fall short of professional teaching standards in situated tutoring and autonomous teaching-workflow execution. To our knowledge, EduAgentBench is the first theory-grounded and realistic benchmark for evaluating the holistic teaching capability of tutor agents, providing a measurement foundation for developing future tutor agents that can support realistic teaching work.

2605.10067 2026-05-22 cs.LG cs.AI 版本更新

Metis: Learning to Jailbreak LLMs via Self-Evolving Metacognitive Policy Optimization

Metis: 通过自进化元认知策略优化学习 jailbreak LLMs

Huilin Zhou, Jian Zhao, Yilu Zhong, Zhen Liang, Xiuyuan Chen, Yuchen Yuan, Tianle Zhang, Chi Zhang, Lan Zhang, Xuelong Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Institute of Artificial Intelligence (TeleAI), China Telecom(人工智能研究院(TeleAI),中国电信)

AI总结 本文提出Metis框架,通过将jailbreaking重新表述为对抗性部分可观测马尔可夫决策过程中的推理时间策略优化,以提高对抗性测试的效率和效果,同时通过结构化反馈和透明推理轨迹提升可解释性,实验表明Metis在多种模型上均表现出更高的攻击成功率和更低的token成本。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

红队测试对于揭示大型语言模型(LLMs)中的漏洞至关重要。尽管自动化方法已提高可扩展性,但现有方法往往依赖静态启发式或随机搜索,使其在面对高级安全对齐时显得脆弱。为了解决这一问题,我们引入了Metis框架,该框架将jailbreaking重新表述为对抗性部分可观测马尔可夫决策过程(POMDP)中的推理时间策略优化。Metis采用自进化元认知循环来执行目标防御逻辑的因果诊断,并利用结构化反馈作为语义梯度来优化其策略,通过透明推理轨迹提高可解释性。在10种不同模型上的广泛评估表明,Metis在比较方法中实现了最强的平均攻击成功率(ASR)为89.2%,在坚韧的前沿模型(如O1和GPT-5-chat)上保持高效果,而传统基线方法表现出显著的性能下降。通过用定向优化替代冗余探索,Metis将token成本平均降低了8.2倍,最高可达11.4倍。我们的分析表明,当前防御在测试设置下仍易受内部引导的闭环推理轨迹影响,突显了下一代防御机制在推理过程中动态处理安全性的关键需求。

英文摘要

Red teaming is critical for uncovering vulnerabilities in Large Language Models (LLMs). While automated methods have improved scalability, existing approaches often rely on static heuristics or stochastic search, rendering them brittle against advanced safety alignment. To address this, we introduce Metis, a framework that reformulates jailbreaking as inference-time policy optimization within an adversarial Partially Observable Markov Decision Process (POMDP). Metis employs a self-evolving metacognitive loop to perform causal diagnosis of a target's defense logic and leverages structured feedback as a semantic gradient to refine its policy, offering enhanced interpretability through transparent reasoning traces. Extensive evaluations across 10 diverse models demonstrate that Metis achieves the strongest average Attack Success Rate (ASR) among compared methods at 89.2%, maintaining high efficacy on resilient frontier models (e.g., 76.0% on O1 and 78.0% on GPT-5-chat) where traditional baselines exhibit substantial performance degradation. By replacing redundant exploration with directed optimization, Metis reduces token costs by an average of 8.2x and up to 11.4x. Our analysis reveals that current defenses remain vulnerable to internally-steered, closed-loop reasoning trajectories under the tested settings, highlighting a critical need for next-generation defenses capable of reasoning about safety dynamically during inference.

2605.08380 2026-05-22 cs.SE cs.AI 版本更新

What Software Engineering Looks Like to AI Agents? -- An Empirical Study of AI-Only Technical Discourse on MoltBook

AI代理如何看待软件工程?-- 对MoltBook上纯AI技术讨论的实证研究

Junyu Huo, Ziqi Mao, Zihao Wan, Gouri Ginde

发表机构 * University of Calgary(卡尔加里大学) Phanvic

AI总结 本研究通过分析MoltBook上纯AI代理生成的技术讨论,探讨了AI代理在自主交互中产生的 discourse 特点,发现其讨论内容更侧重于安全与信任、内存管理、工具和API、调试与错误处理等主题,但缺乏人类开发者讨论中常见的具体项目细节和运行时信息。

详情
AI中文摘要

AI代理越来越多地被描述为软件工程的队友,但大多数研究仍集中在人类主导的工作流程中。本文研究了AI代理在主要相互交互时产生的讨论内容,探讨了这些讨论的组织方式以及与人类开发者讨论的区别。我们结合了对500篇帖子的人类开放编码,一个覆盖4,707篇英语过滤MoltBook技术帖子的集中加检查主题分析流程,以及与5,211篇人类生成的GitHub Discussions帖子的匹配比较。MoltBook技术讨论涵盖12个反复出现的主题,其中安全和信任占27.4%。在社区层面,活动高度集中:最大的子molt占63.5%的帖子(基尼系数=0.88),但一个稳定性感知的BERTopic流程仍能识别出32个非异常子主题。与GitHub Discussions基线相比,MoltBook讨论中较少具体的、上下文丰富的提示,如代码格式化 artifacts、环境细节、运行时失败和重现步骤。社会模仿仅以有限的形式出现,而理想化主要通过较低的 hedging 反映出来。总体而言,纯AI技术讨论是连贯但选择性的。它反复回到安全和信任、内存和上下文管理、工具和API、调试和错误处理、工作流自动化以及基础设施/运维等主题,而省略了人类开发者讨论中常见的许多项目本地和运行时细节。这可能反映了MoltBook中较少的环境特定失败、重现步骤和其他基础提示。

英文摘要

AI agents are increasingly framed as software-engineering teammates, yet most studies examine them inside human-centered workflows. Little is known about the discourse autonomous AI agents produce when they interact mainly with one another. This paper examines what autonomous agents discuss on MoltBook, how that discourse is organized, and how it differs from human developer discourse. We combine human open coding of a 500-post sample, a concentration-plus-check topic-analysis pipeline over 4,707 English-filtered MoltBook technology posts, and a matched comparison with 5,211 human-generated GitHub Discussions posts. MoltBook technology discourse spans 12 recurring themes, led by Security and Trust (27.4%). At the community level, activity is highly concentrated: the largest submolt accounts for 63.5% of posts (Gini = 0.88), yet a stability-aware BERTopic pipeline still identifies 32 non-outlier sub-topics. Relative to the GitHub Discussions baseline, MoltBook discourse contains fewer concrete, context-rich cues such as code-formatted artifacts, environment details, runtime failures, and reproduction steps. Social mimicry appears only in limited form, while idealization is reflected mainly through lower hedging. Overall, AI-only technical discourse is coherent but selective. It repeatedly returns to security and trust, memory and context management, tooling and APIs, debugging and error handling, workflow automation, and infrastructure/ops, while omitting much of the project-local and runtime detail common in human developer discourse. This may reflect fewer environment-specific failures, reproduction steps, and other grounding cues in MoltBook.

2605.03241 2026-05-22 physics.optics cs.AI 版本更新

OptiLookUp: An Optical ROM-Based Lookup Table Engine for Photonic Accelerators

OptiLookUp:一种基于光学ROM的查找表引擎用于光子加速器

Ankur Singh, Akhilesh Jaiswal

发表机构 * Department of Electrical and Computer Engineering, University of Wisconsin–Madison, Madison, WI, USA(电气与计算机工程系,威斯康星大学麦迪逊分校,麦迪逊,威斯康星州,美国)

AI总结 本文提出了一种基于光学ROM的查找表引擎,利用集成的微环谐振器实现高速可重构的光子ROM架构,通过直接在光子设备的频谱响应中编码输入输出映射,实现确定性的查找表操作,并在硅光平台上进行设计和评估,展示了在12.5GHz数据速率下的可靠性能。

详情
AI中文摘要

只读存储器(ROM)提供确定性的预定义数据映射访问。将ROM概念扩展到光学领域能够实现高带宽、低延迟和并行内存访问,但实现紧凑且可重构的光学ROM仍然具有挑战性,因为存在损耗、波长控制和集成限制。本文提出了一种高速、可重构的光子ROM架构,该架构采用集成的微环谐振器(MRRs)实现。ROM直接在光子设备的频谱响应中编码预定义的输入输出映射,从而在读取时实现确定性的查找表操作,而无需动态计算。为了提高可扩展性和减少累积插入损耗,该架构采用紧凑的银行子阵列,通过光学解码机制进行选择性寻址。可重构性通过基于晶体管的光学选择器实现,允许不同ROM银行被激活而无需物理光路重路由或干涉结构。所提出的光子ROM基于GlobalFoundries 45SPCLO硅光平台进行设计和评估。仿真结果表明,在12.5GHz的数据速率下能够可靠运行,通过集成的光电二极管读取获得了稳定的光到电流转移特性。该光学ROM可用于实现用于光子加速器架构中的非线性激活函数,包括Sigmoid、Tanh、ReLU和指数映射。

英文摘要

Read-only memory (ROM) provides deterministic access to predefined data mappings. Extending ROM concepts to the optical domain enables high-bandwidth, low-latency, and parallel memory access, but realizing compact and reconfigurable optical ROM remains challenging due to loss, wavelength control, and integration constraints. This work presents a high-speed, reconfigurable photonic ROM architecture implemented using integrated microring resonators (MRRs). The ROM encodes predefined input-output mappings directly in the spectral response of the photonic devices, enabling deterministic lookup-based operation without dynamic computation during readout. To improve scalability and reduce cumulative insertion loss, the architecture employs compact banked sub-arrays that are selectively addressed through an optical decoding mechanism. Reconfigurability is achieved using transistor-based optical selectors, allowing different ROM banks to be activated without physical light rerouting or interferometric structures. The proposed photonic ROM is designed and evaluated using device-level simulations based on the GlobalFoundries 45SPCLO silicon photonics platform. Simulation results demonstrate reliable operation at data rates up to 12.5 GHz, with stable light-to-current transfer characteristics obtained through integrated photodiode readout. The optical ROM can be used to implement nonlinear activation functions utilised in photonic accelerator architectures, including sigmoid, tanh, ReLU, and exponential mappings.

2605.00414 2026-05-22 cs.LG cond-mat.stat-mech cs.AI 版本更新

Trees to Flows and Back: Unifying Decision Trees and Diffusion Models

树到流及回归:统一决策树和扩散模型

Sai Niranjan Ramachandran, Suvrit Sra

发表机构 * School of Computation, Information and Technology, Technical University of Munich, Germany(慕尼黑技术大学计算、信息与技术学院,德国) Munich center for machine learning (MCML)(慕尼黑机器学习中心(MCML))

AI总结 本文通过建立层次决策树与扩散过程之间的数学对应关系,统一了决策树和扩散模型,揭示了共同的优化原则'全局轨迹得分匹配',并提出了两种实用应用:treeflow在表格数据生成中表现优异,且计算速度更快;dsmtree将层次决策逻辑转移到神经网络中,在多个基准上与教师模型表现相近。

Comments 12 pages (main), 68 pages (inclusive of appendix), Accepted in the Forty-Third International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

决策树和扩散模型本质上是不同的模型类别,前者是离散和层次的,后者是连续和动态的。本文通过在适当的极限情况下建立层次决策树与扩散过程之间的清晰数学对应关系,将两者统一起来。我们的统一揭示了一个共同的优化原则:全局轨迹得分匹配(GTSM),其中梯度提升(在理想化版本中)在渐近意义上是最优的。通过两个关键的实用实例,我们强调了本工作的概念价值:treeflow在表格数据上实现了具有更高保真度和2倍计算速度的竞争性生成质量,而dsmtree是一种新的蒸馏方法,将层次决策逻辑转移到神经网络中,在许多基准上与教师模型表现相近。

英文摘要

Decision trees and diffusion models are ostensibly disparate model classes, one discrete and hierarchical, the other continuous and dynamic. This work unifies the two by establishing a crisp mathematical correspondence between hierarchical decision trees and diffusion processes in appropriate limiting regimes. Our unification reveals a shared optimization principle: \emph{Global Trajectory Score Matching (GTSM)}, for which gradient boosting (in an idealized version) is asymptotically optimal. We underscore the conceptual value of our work through two key practical instantiations: \treeflow, which achieves competitive generation quality on tabular data with higher fidelity and a 2\times computational speedup, and \dsmtree, a novel distillation method that transfers hierarchical decision logic into neural networks, matching teacher performance within 2\% on many benchmarks.

2605.00185 2026-05-22 cs.LG cs.AI 版本更新

Fair Dataset Distillation via Cross-Group Barycenter Alignment

通过跨组重心对齐实现公平的数据集蒸馏

Mohammad Hossein Moslemi, Nima Hosseini Dashtbayaz, Zhimin Mei, Bissan Ghaddar, Boyu Wang

发表机构 * Western University(温莎大学) Vector Institute(向量研究所) IE University(IE大学) Ivey Business School(Ivey商学院)

AI总结 本文研究了数据集蒸馏中因不同群体预测模式差异导致的公平性问题,提出通过跨组重心对齐方法来减少群体间的预测偏差,从而提升模型的公平性。

Comments Accepted by ICML 2026

详情
AI中文摘要

数据集蒸馏旨在将大规模数据集压缩成小规模合成数据集,同时保持预测性能。我们发现,由于不同人口群体表现出不同的预测模式,蒸馏过程在保持所有子群体信息信号方面面临困难,无论群体大小是轻微还是严重不平衡。因此,训练在蒸馏数据上的模型可能会在某些子群体上出现显著性能下降,导致公平性差距。关键的是,这些差距不会仅仅通过纠正群体不平衡来消失,因为它们源于子群体预测模式的根本不匹配,而不是样本数量差异本身。因此,我们正式分析了这两种偏差源之间的相互作用,并将解决方案定义为识别一个不考虑群体不平衡的预测信息重心,该重心在所有子群体中诱导出相似的表示。通过向这个共享的聚合表示进行蒸馏,我们证明可以减少群体公平性方面的担忧。我们的方法与现有蒸馏方法兼容,并且实验证明,它显著减少了数据集蒸馏引入的偏差。代码可在https://github.com/mhmoslemi/COBRA上获得。

英文摘要

Dataset Distillation aims to compress a large dataset into a small synthetic one while maintaining predictive performance. We show that as different demographic groups exhibit distinct predictive patterns, the distillation process struggles to simultaneously preserve informative signals for all subgroups, regardless of whether group sizes are mildly or severely imbalanced. Consequently, models trained on distilled data can experience substantial performance drops for certain subgroups, leading to fairness gaps. Crucially, these gaps do not disappear by merely correcting group imbalance, since they stem from fundamental mismatches in subgroup predictive patterns rather than from sample-size disparities alone. We therefore formally analyze the interaction between these two sources of bias and cast the solution as identifying a group-imbalance-agnostic barycenter of the predictive information that induces similar representations across all subgroups. By distilling toward this shared aggregate representation, we show that group fairness concerns can be reduced. Our approach is compatible with existing distillation methods, and empirical results show that it substantially reduces bias introduced by dataset distillation. Code is available at https://github.com/mhmoslemi/COBRA.

2604.14084 2026-05-22 cs.LG cs.AI 版本更新

TIP: Token Importance in On-Policy Distillation

TIP: on-policy distillation 中的 token 重要性

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard

AI总结 本研究探讨了在 on-policy 知识蒸馏中哪些 token 对学习信号最有用,提出了一种基于学生熵和教师-学生分歧的双轴分类方法,并通过实验验证了在有限内存条件下使用少量 token 进行蒸馏的有效性。

详情
AI中文摘要

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining 50% of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to 47%. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than 10% of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on <20% of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.

英文摘要

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining $50\%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47\%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10\%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $<$$20\%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.

2604.12325 2026-05-22 cs.LG cs.AI 版本更新

Black-Box Optimization From Small Offline Datasets via Meta Learning with Synthetic Tasks

通过合成任务进行元学习的黑盒优化

Azza Fadhel, The Hung Tran, Trong Nghia Hoang, Jana Doppa

发表机构 * School of EECS, Washington State University, Pullman, WA, USA(华盛顿州立大学电子工程与计算机科学学院,普拉默,华盛顿州,美国)

AI总结 本文提出了一种通过生成合成任务进行元学习的框架OptBias,用于解决小规模离线数据下的黑盒优化问题,通过学习可重用的优化偏差来提升小数据场景下的性能。

Comments Accepted for Publication at International Conference on Artificial Intelligence and Statistics (AISTATS)

详情
AI中文摘要

我们考虑了离线黑盒优化的问题,目标是从过去的实验数据中发现最优设计(例如分子或材料)。在这一设置中,一个关键挑战是数据稀缺性:在许多科学应用中,只有小规模或低质量的数据集可用,这严重限制了现有算法的有效性。先前的工作在理论和实证上都表明,离线优化算法的性能取决于代理模型对优化偏差(即正确排序输入设计的能力)的捕捉程度,这在有限的实验数据下很难实现。本文提出了一种通过生成合成任务进行元学习的框架OptBias,该框架通过在高斯过程生成的合成任务上训练来直接解决数据稀缺性问题。OptBias通过在小数据上微调代理模型来解决目标任务。在多样化的连续和离散离线优化基准上,OptBias在小数据场景中始终优于最先进的基线。这些结果突显了OptBias作为现实中小数据设置中离线优化的稳健且实用的解决方案。

英文摘要

We consider the problem of offline black-box optimization, where the goal is to discover optimal designs (e.g., molecules or materials) from past experimental data. A key challenge in this setting is data scarcity: in many scientific applications, only small or poor-quality datasets are available, which severely limits the effectiveness of existing algorithms. Prior work has theoretically and empirically shown that performance of offline optimization algorithms depends on how well the surrogate model captures the optimization bias (i.e., ability to rank input designs correctly), which is challenging to accomplish with limited experimental data. This paper proposes Surrogate Learning with Optimization Bias via Synthetic Task Generation (OptBias), a meta-learning framework that directly tackles data scarcity. OptBias learns a reusable optimization bias by training on synthetic tasks generated from a Gaussian process, and then fine-tunes the surrogate model on the small data for the target task. Across diverse continuous and discrete offline optimization benchmarks, OptBias consistently outperforms state-of-the-art baselines in small data regimes. These results highlight OptBias as a robust and practical solution for offline optimization in realistic small data settings.

2604.08571 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Robust Reasoning Benchmark

鲁棒推理基准

Pavel Golikov, Evgenii Opryshko, Gennady Pekhimenko, Mark C. Jeffrey

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 本研究提出鲁棒推理基准(RRB),通过13种确定性文本扰动评估8种前沿模型,发现Claude在面对变换提示时表现出异常拒绝行为,而开放权重模型在结构噪声下出现多种失败模式,如认知冲刷、分词崩溃和推理崩溃,导致平均准确率下降高达54%。研究进一步发现由模型自身推理链引起的注意力稀释问题,并提出Intra-Query Attention Dilution概念,表明中间推理步骤会污染标准密集注意力机制,未来架构需整合显式上下文重置以实现可靠推理。

详情
AI中文摘要

尽管大型语言模型(LLMs)在标准数学基准上表现优异,但其问题解决能力依赖于上下文和文本格式。我们引入鲁棒推理基准(RRB),该基准由13种确定性文本扰动组成,应用于2024年和2025年的AIME。评估8种最先进的模型后,发现前沿模型总体上具有较强的鲁棒性,但Claude在面对变换提示时表现出异常拒绝行为。开放权重推理模型在结构噪声下表现出多种失败模式(认知冲刷、分词崩溃和推理崩溃),在扰动下平均准确率下降高达54%,某些扰动甚至导致100%的准确率下降。我们进一步研究其中一种失败模式:由模型自身推理链引起的注意力稀释。通过要求模型在单一上下文窗口内依次解决多个独立数学问题,我们识别出Intra-Query Attention Dilution。从7B到120B参数的开放权重模型在后续问题上的准确率逐渐下降,表明中间推理步骤会污染标准密集注意力机制。我们主张,为了实现可靠的推理,未来架构需要在模型自身推理链中整合显式上下文重置,从而引发关于推理任务最佳粒度的开放研究问题。

英文摘要

While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their problem-solving abilities depend on the context and textual formatting. We introduce the Robust Reasoning Benchmark (RRB), a pipeline of 13 deterministic textual perturbations applied to AIME 2024 and AIME 2025. Evaluating 8 state-of-the-art models, we find that frontier models are largely resilient, with the notable exception of Claude, which categorically refuses many transformed prompts. Open-weights reasoning models exhibit a range of failure modes under structural noise (cognitive thrashing, tokenization breakdown, and reasoning collapse), with up to 54% average accuracy drops across perturbations and up to 100% on some. We further study one of these failure modes in isolation: attention dilution caused by the model's own chain-of-thought. By tasking models with solving multiple independent mathematical problems sequentially within a single context window, we identify Intra-Query Attention Dilution. Open-weights models ranging from 7B to 120B parameters exhibit accuracy decay on subsequent problems, suggesting that intermediate reasoning steps progressively pollute standard dense attention mechanisms. We argue that in order to achieve reliable reasoning, future architectures need to integrate explicit contextual resets within models' own chain-of-thought, leading to open research questions regarding the optimal granularity of reasoning tasks.

2603.29735 2026-05-22 cs.AI 版本更新

Unveiling the Reasoning Process of Large Language Models

揭示大型语言模型的推理过程

Junjie Zhang, Zhen Shen, Xisong Dong, Gang Xiong

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 本文通过分析Transformer层中注意力头和层的信息转换,揭示了大型语言模型在数学和符号推理任务中,中间层将token级信息转化为可重用的关联结构的核心机制。

详情
AI中文摘要

大型语言模型往往能够超越表层token进行推理,但token级信息转变为抽象关系结构的内部阶段仍不明确。我们通过分析自回归推理过程中注意力头和层如何转换信息来探讨这一问题。在数学和符号推理任务中,我们观察到一种一致的分层分工:外层主要保留和路由输入相关特征,而中层将它们重新组织为更具转移性的规则级表示。这种解释得到了表示几何的支持:中层状态占据较低维的流形,并在不同词汇库中表现出更强的对齐性,这些词汇库实现了相同的符号规则。此外,因果干预进一步支持了这一结论:移除通过我们基于交互的标准识别出的中层组件,会比移除其他区域或随机移除的组件产生更大的下游变化和准确率下降。共同,这些结果表明,抽象推理并非均匀分布在Transformer层中,而是优先在中层计算阶段形成,该阶段将token级信息转化为可重用的关联结构。

英文摘要

Large language models often reason beyond surface tokens, but the internal stage at which token-level information becomes abstract relational structure remains unclear. We investigate this question by analyzing how attention heads and layers transform information during autoregressive reasoning. Across mathematical and symbolic reasoning tasks, we observe a consistent layer-wise division of labor: outer layers mainly preserve and route input-related features, whereas middle layers reorganize them into more transferable rule-level representations. This interpretation is supported by representation geometry: middle-layer states occupy lower-dimensional manifolds and show stronger alignment across disjoint vocabularies that instantiate the same symbolic rules. It is further supported by causal interventions: removing middle-layer components identified by our interaction-based criterion produces substantially larger downstream changes and accuracy drops than removing components from other regions or at random. Together, these results suggest that abstract reasoning is not uniformly distributed across transformer layers, but is preferentially formed in a middle-layer computation stage that converts token-level information into reusable relational structure.

2603.21610 2026-05-22 cs.LG cs.AI stat.ML 版本更新

Rule-State Inference (RSI): A Bayesian Framework for Compliance Monitoring in Rule-Governed Domains

规则状态推断(RSI):一种用于规则治理领域合规监控的贝叶斯框架

Abdou-Raouf Atarmla

发表机构 * Institut National des Postes et Télécommunications(摩洛哥邮政和电信国家研究院) Togo DataLab(多哥数据实验室) Ministry of Digital Economy(数字经济部)

AI总结 本文提出了一种名为规则状态推断(RSI)的贝叶斯框架,用于解决规则治理领域中合规监控的三大结构性挑战:部署时缺乏标记结果、非合规实体战略性缺失观察以及监管环境变化速度超过任何监督模型的重新训练速度。RSI通过将权威、形式化的规则集作为结构化的贝叶斯先验,利用变分推断和精确坐标上升更新来推断人口的潜在合规状态。

Comments 18 pages. Experimental validation forthcoming

详情
AI中文摘要

在规则治理领域(如税收管理、临床协议遵守、环境监管)的合规监控面临三个结构性障碍,标准机器学习无法同时解决:部署时缺乏标记结果、非合规实体战略性缺失观察以及监管环境变化速度超过任何监督模型的重新训练速度。我们引入规则状态推断(RSI),一种贝叶斯框架,颠覆了传统的学习规则从数据的范式。RSI将权威、形式化的规则集作为结构化的贝叶斯先验,并通过均场变分推断和精确坐标上升更新推断人口的潜在合规状态。核心建模对象是一个联合潜变量,每个监管时期一个:全局合规文化因子η以及每个规则的激活、人口合规水平和参数漂移成分。RSI提供了三个正式保证:每个规则更新的监管适应性为O(n_k + K);对于可识别的连续成分的伯恩斯坦-冯·米塞斯一致性;以及每次迭代的单调ELBO收敛。我们将在托戈财政系统上实例化RSI,基于官方监管法律的基准2000家合成企业;完整的数值验证将随后进行。该框架设计用于直接扩展到顺序RSI,一种状态空间公式化中,一个监管时期的后验成为下一个的先验,从而产生精确的卡尔曼滤波器用于合规轨迹跟踪和实体级贝叶斯评分。

英文摘要

Compliance monitoring in rule-governed domains (tax administration, clinical protocol adherence, environmental regulation) faces three structural obstacles that standard machine learning does not simultaneously address: the absence of labeled outcomes at deployment, strategically missing observations where non-compliant entities selectively withhold evidence, and a regulatory environment that changes faster than any supervised model can be retrained. We introduce Rule-State Inference (RSI), a Bayesian framework that reverses the usual paradigm. Rather than learning rules from data, RSI treats an authoritative, formalized rule set as structured Bayesian priors and infers the latent compliance state of a population through mean-field variational inference with exact coordinate-ascent updates. The central modeling object is a joint latent state per regulatory period: a global compliance-culture factor eta and per-rule components for activation, population compliance level, and parametric drift. RSI delivers three formal guarantees: O(n_k + K) regulatory adaptability per rule update; Bernstein-von Mises consistency for the identifiable continuous components; and monotone ELBO convergence at every iteration. We instantiate RSI on the Togolese fiscal system on a benchmark of 2,000 synthetic enterprises grounded in official regulatory law; full numerical validation is forthcoming. The framework is designed for direct extension to Sequential RSI, a state-space formulation where the posterior from one regulatory period becomes the prior for the next, yielding an exact Kalman filter for compliance-trajectory tracking and entity-level Bayesian scoring.

2603.11679 2026-05-22 cs.AI 版本更新

LLMs can construct powerful representations and streamline sample-efficient supervised learning

LLMs can construct powerful representations and streamline sample-efficient supervised learning

Ilker Demirel, Lawrence Shi, Zeshan Hussain, David Sontag

发表机构 * MIT(麻省理工学院) Harvard Medical School(哈佛医学院)

AI总结 本文提出了一种基于LLM的代理流程,通过生成全局 rubric 来提升多模态数据的表示能力,并在15个临床任务中显著优于传统方法。

详情
AI中文摘要

随着现实数据集变得更加复杂和异质化,监督学习常受到输入表示设计的瓶颈。对多模态数据(如时间序列、自由文本和结构化记录)建模通常需要非平凡的领域专业知识。我们提出了一种代理流程来简化这一过程。首先,一个LLM分析一小但多样化的文本序列输入示例,在上下文中合成一个全局rubric,该rubric作为程序化规范用于提取和组织证据。此rubric随后用于将原始文本序列转换为更标准化的格式,以供下游模型使用。我们还描述了局部rubrics,即由LLM生成的任务条件解释性摘要。在EHRSHOT基准的15个临床任务中,我们的rubric方法显著优于计数特征模型、朴素LLM基线和预训练数据量更大的临床基础模型。除了性能外,rubrics还提供了操作优势,如易于审计、规模化成本效益以及促进表格表示。

英文摘要

As real-world datasets become more complex and heterogeneous, supervised learning is often bottlenecked by input representation design. Modeling multimodal data, such as time-series, free text, and structured records, often requires non-trivial domain expertise. We propose an agentic pipeline to streamline this process. First, an LLM analyzes a small but diverse subset of text-serialized input examples in-context to synthesize a global rubric, which acts as a programmatic specification for extracting and organizing evidence. This rubric is then used to transform naive text-serializations of inputs into a more standardized format for downstream models. We also describe local rubrics, which are task-conditioned interpretive summaries generated by an LLM. Across 15 clinical tasks from the EHRSHOT benchmark, our rubric approaches significantly outperform count-feature models, naive LLM baselines, and a clinical foundation model pretrained on orders of magnitude more data. Beyond performance, rubrics offer operational advantages such as being easy to audit, cost-effectiveness at scale, and facilitating tabular representations.

2603.03784 2026-05-22 cs.AI 版本更新

Specification-Driven Generation and Evaluation of Discrete-Event World Models via the DEVS Formalism

通过DEVS形式化方法驱动的离散事件世界模型生成与评估

Zheyu Chen, Huiteng Zhuang, Zhuohuan Li, Chuanhao Li

发表机构 * Zhili College, Tsinghua University(清华大学紫光学院) School of Transportation Science and Engineering, Beihang University(北航交通科学与工程学院) Department of Industrial Engineering, Tsinghua University(清华大学工业工程系)

AI总结 本文提出了一种基于自然语言规范在线生成离散事件世界模型的方法,结合了显式模拟器的可靠性与神经模型的适应性,通过DEVS形式化方法和分阶段的LLM生成流程,实现了对事件和时间逻辑的结构推断,并通过基准测试集验证了模型的一致性和可验证性。

Comments 36 pages, 6 figures

详情
AI中文摘要

世界模型是LLM代理在长时间范围内评估行动的核心组成部分。然而,现有研究大多集中在由物理动态或空间结构主导的环境,而许多高影响领域,如供应链、采购网络和业务流程,通过离散事件、时间约束和因果依赖演变。这些设置需要离散事件世界模型。现有构建世界模型的方法往往处于两个极端:手动工程模拟器提供一致性和可重复性,但构建和适应成本高;神经模型灵活,但长期时间推演中可能累积不一致。本文寻求一种原则性的中间方法,通过从自然语言规范中在线合成离散事件世界模型,保留显式模拟器的可靠性,同时获得神经模型的适应性。我们采用DEVS形式化方法,并引入一种分阶段的基于LLM的生成流程,将组件交互的结构推断与组件级事件和时间逻辑分开。在评估方面,我们开发了基准测试集,其中模拟器发出结构化事件轨迹,随后通过规范推导的时序、因果和语义约束进行验证。这使得可以实现可重复的验证和局部诊断。这些贡献共同产生了一种在长期时间推演中保持一致、可以从可观察行为中验证,并且可以在在线执行时高效合成的世界模型。

英文摘要

World models are central to LLM agents that must evaluate actions over long horizons. Yet much existing work focuses on environments governed by physical dynamics or spatial structure, whereas many high-impact domains, including supply chains, procurement networks, and business processes, evolve through discrete events, timing constraints, and causal dependencies. These settings call for discrete-event world models. Existing approaches to constructing world models often fall near two extremes: hand-engineered simulators provide consistency and reproducibility, but are costly to build and adapt; neural models are flexible, but can suffer from compounding inconsistency over long-horizon rollouts. We seek a principled middle ground by synthesizing discrete-event world models online from natural-language specifications, retaining the reliability of explicit simulators while gaining the adaptability of neural models. We adopt the DEVS formalism and introduce a staged LLM-based generation pipeline that separates structural inference over component interactions from component-level event and timing logic. For evaluation, we develop benchmark suites in which simulators emit structured event traces, which are then validated against specification-derived temporal, causal, and semantic constraints. This enables reproducible verification and localized diagnostics. Together, these contributions produce world models that remain consistent over long-horizon rollouts, can be verified from observable behavior, and can be synthesized efficiently on demand during online execution.

2602.13294 2026-05-22 cs.CV cs.AI 版本更新

VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

VisPhyWorld: 通过代码驱动的视频重建探测物理推理

Jiarong Liang, Max Ku, Ka-Hei Hui, Ping Nie, Wenhu Chen

发表机构 * University of Waterloo(滑铁卢大学) Autodesk AI Lab(Autodesk人工智能实验室) Independent Researcher(独立研究者)

AI总结 本文提出VisPhyWorld框架,通过要求模型从视觉观察生成可执行的模拟器代码来评估物理推理能力,引入VisPhyBench基准测试集,验证模型在重建外观和模拟物理运动方面的能力,发现最先进的MLLM在准确推断物理参数和模拟一致的物理动态方面存在困难。

详情
AI中文摘要

评估多模态大语言模型(MLLMs)是否真正理解物理动态仍然具有挑战性。现有的基准测试大多依赖于识别式协议,如视觉问答(VQA)和期望违反(VoE),这些协议通常可以在不承诺明确、可测试的物理假设的情况下回答。我们提出了VisPhyWorld,一个基于执行的框架,通过要求模型从视觉观察生成可执行的模拟器代码来评估物理推理能力。通过生成可运行的代码,推断的世界表示可以直接检查、编辑和验证。这将物理推理与渲染分开。基于此框架,我们引入了VisPhyBench,包含209个评估场景,这些场景源自108个物理模板和一个系统化的协议,用于评估模型在重建外观和模拟物理合理的运动方面的能力。我们的流水线在97.7%的基准运行中生成有效的重建视频之前会回退。实验表明,尽管最先进的MLLM在语义场景理解方面表现强劲,但在准确推断物理参数和模拟一致的物理动态方面存在困难。我们的代码可在https://github.com/TIGER-AI-Lab/VisPhyWorld上获得。

英文摘要

Evaluating whether Multimodal Large Language Models (MLLMs) genuinely reason about physical dynamics remains challenging. Most existing benchmarks rely on recognition-style protocols such as Visual Question Answering (VQA) and Violation of Expectation (VoE), which can often be answered without committing to an explicit, testable physical hypothesis. We propose VisPhyWorld, an execution-based framework that evaluates physical reasoning by requiring models to generate executable simulator code from visual observations. By producing runnable code, the inferred world representation is directly inspectable, editable, and falsifiable. This separates physical reasoning from rendering. Building on this framework, we introduce VisPhyBench, comprising 209 evaluation scenes derived from 108 physical templates and a systematic protocol that evaluates how well models reconstruct appearance and reproduce physically plausible motion. Our pipeline produces valid reconstructed videos in 97.7% of benchmark runs before fallback. Experiments show that while state-of-the-art MLLMs achieve strong semantic scene understanding, they struggle to accurately infer physical parameters and to simulate consistent physical dynamics. Our code is available https://github.com/TIGER-AI-Lab/VisPhyWorld

2602.11574 2026-05-22 cs.AI 版本更新

Learning to Configure Agentic AI Systems

学习配置代理AI系统

Aditya Taparia, Som Sagar, Ransalu Senanayake

发表机构 * School of Computing and Augmented Intelligence(计算与增强智能学院) Arizona State University(亚利桑那州立大学)

AI总结 本文提出了一种基于半马尔可夫决策过程(SMDP)的代理配置方法,通过ARC模型动态选择查询特定的代理配置,从而在多个基准测试中提升了推理准确性、工具使用准确性和τ-Bench(Airline)Pass的成功率。

Comments 22 pages, 12 figures

详情
AI中文摘要

配置基于LLM的代理系统涉及从庞大的组合设计空间中选择工作流、工具、令牌预算和提示,而目前通常通过固定的模板或手工调整的启发式方法处理,这些方法无论查询难度如何都应用相同的配置,导致行为脆弱和计算浪费。为了解决这个问题,我们将代理配置建模为半马尔可夫决策过程(SMDP),其中每个配置都是一种时间扩展的选项,决定了代理系统如何处理查询,并引入了ARC(Agentic Resource & Configuration learner),一种轻量级的分层策略,能够动态选择查询特定的代理配置。在推理、工具使用和代理基准测试中,ARC在与预算匹配的工具增强LLM相比,平均推理准确性提高了31.3%,工具使用准确性提高了13.95%,并将τ-Bench(Airline)Pass的成功率从9.0%提升到18.0%。这些结果表明,学习查询特定的代理配置是“一刀切”设计的一种强大替代方案。

英文摘要

Configuring LLM-based agent systems involves choosing workflows, tools, token budgets, and prompts from a large combinatorial design space, and is typically handled today by fixed templates or hand-tuned heuristics that apply the same configuration regardless of query difficulty, leading to brittle behavior and wasted compute. To address this, we formulate agent configuration as a semi-Markov decision process (SMDP) where each configuration acts as a temporally extended option that determines how an agent system processes a query, and introduce introduce ARC (Agentic Resource & Configuration learner), a lightweight hierarchical policy that dynamically selects query-specific agent configurations. Across reasoning, tool-use, and agentic benchmarks, ARC consistently improves over budget-matched tool-augmented LLMs, increasing average reasoning accuracy by 31.3%, tool-use accuracy by 13.95%, and doubling τ-Bench (Airline) Pass^1 success from 9.0% to 18.0%. These results demonstrate that learning per-query agent configurations is a powerful alternative to "one size fits all" designs.

2602.05286 2026-05-22 cs.LG cs.AI 版本更新

HealthMamba: An Uncertainty-aware Spatiotemporal Graph State Space Model for Effective and Reliable Healthcare Facility Visit Prediction

HealthMamba: 一种考虑不确定性的时空图状态空间模型用于有效可靠的医疗设施访问预测

Dahai Yu, Lin Jiang, Rongchao Xu, Guang Wang

发表机构 * Department of Computer Science, Florida State University(佛罗里达州立大学计算机科学系)

AI总结 本文提出HealthMamba,一种考虑不确定性的时空图状态空间模型,用于有效可靠的医疗设施访问预测。该模型包含三个关键组件:统一的时空上下文编码器、新的图状态空间模型GraphMamba以及综合的不确定性量化模块。实验结果显示,HealthMamba在预测准确性和不确定性量化方面分别比现有最佳基线提高了6.0%和3.5%。

Comments IJCAI 2026

详情
AI中文摘要

医疗设施访问预测对于优化医疗资源配置和 informing 公共卫生政策至关重要。尽管已经采用了先进的机器学习方法以提高预测性能,但现有工作通常将此任务视为时间序列预测问题,而没有考虑不同类型的医疗设施的内在空间依赖性,且在公共紧急情况等异常情况下也无法提供可靠的预测。为了推进现有研究,我们提出了HealthMamba,一种考虑不确定性的时空框架,用于准确且可靠的医疗设施访问预测。HealthMamba包含三个关键组件:(i) 一个统一的时空上下文编码器,融合异构的静态和动态信息,(ii) 一种新的图状态空间模型称为GraphMamba用于分层时空建模,(iii) 一个综合的不确定性量化模块,整合三种不确定性量化机制以实现可靠的预测。我们在四个大规模真实世界数据集上评估了HealthMamba,这些数据集来自加州、纽约、得克萨斯州和佛罗里达州。结果表明,HealthMamba在预测准确性和不确定性量化方面分别比现有最佳基线提高了6.0%和3.5%。

英文摘要

Healthcare facility visit prediction is essential for optimizing healthcare resource allocation and informing public health policy. Despite advanced machine learning methods being employed for better prediction performance, existing works usually formulate this task as a time-series forecasting problem without considering the intrinsic spatial dependencies of different types of healthcare facilities, and they also fail to provide reliable predictions under abnormal situations such as public emergencies. To advance existing research, we propose HealthMamba, an uncertainty-aware spatiotemporal framework for accurate and reliable healthcare facility visit prediction. HealthMamba comprises three key components: (i) a Unified Spatiotemporal Context Encoder that fuses heterogeneous static and dynamic information, (ii) a novel Graph State Space Model called GraphMamba for hierarchical spatiotemporal modeling, and (iii) a comprehensive uncertainty quantification module integrating three uncertainty quantification mechanisms for reliable prediction. We evaluate HealthMamba on four large-scale real-world datasets from California, New York, Texas, and Florida. Results show HealthMamba achieves around 6.0% improvement in prediction accuracy and 3.5% improvement in uncertainty quantification over state-of-the-art baselines.

2602.03067 2026-05-22 cs.LG cs.AI cs.NA math.NA 版本更新

FlashSinkhorn: IO-Aware Entropic Optimal Transport on GPU

FlashSinkhorn: GPU上的IO感知熵最优传输

Felix X. -F. Ye, Xingjie Li, An Yu, Ming-Ching Chang, Linsong Chu, Davis Wertheimer

发表机构 * Department of Mathematics \& Statistics, University at Albany, Albany, NY, USA Department of Mathematics Statistics, University of North Carolina at Charlotte, Charlotte, NC, USA Department of Computer Science, University at Albany, Albany, NY, USA IBM T.\ J.\ Watson Research Center, Yorktown Heights, NY, USA

AI总结 本文提出FlashSinkhorn,一种基于GPU的熵最优传输求解器,通过将稳定化的对数域Sinkhorn更新转换为行-wise的LogSumExp归一化,实现了与Transformer注意力相同的归一化方式,从而实现了FlashAttention风格的融合和分块处理,显著降低了HBMIO并保持线性内存操作。

详情
AI中文摘要

熵最优传输(EOT)通过Sinkhorn迭代在现代机器学习中广泛应用,但GPU求解器在大规模情况下仍效率低下。张量化实现因密集的n×m交互导致二次HBM流量,而现有在线后端避免存储密集矩阵但仍然依赖于通用的 tiled map-reduce 减少内核,融合有限。我们提出FlashSinkhorn,一种针对平方欧几里得成本的IO感知EOT求解器,将稳定化的对数域Sinkhorn更新重写为行-wise的LogSumExp归一化,与Transformer注意力相同的归一化方式。这使得FlashAttention风格的融合和分块处理成为可能:融合的Triton内核通过芯片上的SRAM流式传输分块,并在单次通过中更新双潜力,显著减少每个迭代的HBM IO同时保持线性内存操作。我们进一步提供了用于传输应用的流式内核,实现了可扩展的一阶和二阶优化。在A100 GPU上,FlashSinkhorn在点云OT上的前向传递速度比最先进的在线基线快32倍,在端到端速度上快161倍,提高了OT基于下游任务的可扩展性。为了可重复性,我们发布了开源实现,网址为https://github.com/ot-triton-lab/flash-sinkhorn。

英文摘要

Entropic optimal transport (EOT) via Sinkhorn iterations is widely used in modern machine learning, yet GPU solvers remain inefficient at scale. Tensorized implementations suffer quadratic HBM traffic from dense $n\times m$ interactions, while existing online backends avoid storing dense matrices but still rely on generic tiled map-reduce reduction kernels with limited fusion. We present \textbf{FlashSinkhorn}, an IO-aware EOT solver for squared Euclidean cost that rewrites stabilized log-domain Sinkhorn updates as row-wise LogSumExp reductions of biased dot-product scores, the same normalization as transformer attention. This enables FlashAttention-style fusion and tiling: fused Triton kernels stream tiles through on-chip SRAM and update dual potentials in a single pass, substantially reducing HBM IO per iteration while retaining linear-memory operations. We further provide streaming kernels for transport application, enabling scalable first- and second-order optimization. On A100 GPUs, FlashSinkhorn achieves up to $32\times$ forward-pass and $161\times$ end-to-end speedups over state-of-the-art online baselines on point-cloud OT, improves scalability on OT-based downstream tasks. For reproducibility, we release an open-source implementation at https://github.com/ot-triton-lab/flash-sinkhorn .

2602.01935 2026-05-22 cs.LG cs.AI cs.PL 版本更新

LiteCoOp: Lightweight Multi-LLM Shared-Tree Reasoning for Model-Serving Compiler Optimizations

LiteCoOp: 轻量级多语言模型共享树推理用于模型服务编译器优化

Annabelle Sujun Tang, Christopher Priebe, Lianhui Qin, Hadi Esmaeilzadeh

发表机构 * A lternative C omputing T echnologies ( ACT ) Lab(替代计算技术实验室) University of California San Diego(加州大学圣地亚哥分校)

AI总结 本文提出LiteCoOp,一种轻量级框架,通过将优化搜索树本身作为多语言模型协作机制,实现编译器优化过程中异构语言模型的协作,从而在降低编译成本的同时提升性能。

详情
AI中文摘要

LLM引导的编译器优化最近展现出潜力,但现有方法依赖于整个搜索过程中单一大型语言模型,使其昂贵且排除了较小模型。我们提出了研究问题:异构语言模型是否可以在编译器优化过程中协作,同时在编译成本低于由单一大型语言模型引导的优化时减少成本。关键的是,这必须在不引入代理框架的开销的情况下实现,这会与降低编译成本的目标相悖。为实现这些竞争目标,我们引入了LiteCoOp,一种轻量级框架,将优化搜索树本身作为多语言模型协作的机制,使异构模型能够共享进展而无需外部代理协调。在每个优化步骤中,LiteCoOp查询一个语言模型以提出编译器转换并选择下一步查询的语言模型。这些语言模型的提案被记录在共享的MCTS树中,因此所有模型依次被调用,但彼此的决策相互影响。共享的MCTS回传奖励,使一个模型的进步影响其他模型后续的决策。这使得MCTS树本身成为协作推理的机制,避免了模型间通信、重载推理轨迹或代理基础设施。我们通过LLM-aware UCT将这一想法实例化,该方法倾向于较小的语言模型以减少成本,同时保持编译器性能目标。在多样化的GPU和(CPU)基准测试中,LiteCoOp在单模型基线上持续表现优异,当将协作扩展到八个异构语言模型时,其最佳结果取得。八模型配置将总编译时间减少1.95x(1.74x),减少API成本4.47x(4.32x),并且只在总调用中调用最大模型的23.1%(23.9%),并展示了协作的可扩展性。

英文摘要

LLM-guided compiler optimization has recently shown promise, but existing approaches rely on a single large LLM throughout search, making them expensive and excluding smaller models. We pose the research question: whether heterogeneous LLMs can collaborate during compiler optimization while reducing compilation cost below optimization guided by a single large LLM. Crucially, this must be achieved without introducing overhead from agentic frameworks, which would run counter to the goal of lower compilation cost. To achieve these competing objectives, we introduce LiteCoOp, a lightweight framework that turns the optimization search tree itself into the mechanism for multi-LLM collaboration, enabling heterogeneous models to share progress without external agentic coordination. At each optimization step, LiteCoOp queries one LLM to propose both a compiler transformation and select the LLM to query at the next step. These LLM proposals are recorded in a shared MCTS tree, so all models are invoked serially and yet are informed by each other's decisions. The shared MCTS backpropagates the rewards, allowing progress made by one model to influence later decisions by others. This makes the MCTS tree the collaborative reasoning mechanism itself, avoiding inter-model communication, heavy reasoning traces, or agentic infrastructure. We instantiate this idea with an LLM-aware UCT that biases model selection toward smaller LLMs to reduce cost while still preserving the compiler performance objective. Across diverse GPU and (CPU) benchmarks, LiteCoOp consistently outperforms single-model baselines, with the best results obtained when scaling collaboration to eight heterogeneous LLMs. This eight-model config reduces total compilation time by 1.95x (1.74x), reduces API cost by 4.47x (4.32x), and invokes the largest model for only 23.1% (23.9%) of total calls while demonstrating collaboration scalability.

2602.00851 2026-05-22 cs.AI cs.MA 版本更新

Understanding Persuasion in Long-Running Agents

理解长期运行代理中的说服

Hyejun Jeong, Amir Houmansadr, Shlomo Zilberstein, Eugene Bagdasarian

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)

AI总结 本文研究了长期任务中代理受到用户说服影响的行为变化,提出了一种基于行为的评估框架,发现提前指定信念状态的代理在搜索和源访问上表现更高效,表明说服影响代理行为。

Comments Code available at https://github.com/HyejunJeong/persuasion-propagation

详情
AI中文摘要

现代AI代理越来越多地结合对话交互与自主任务执行,例如编码和网络研究,这引发了一个自然问题:当一个从事长期任务的代理受到用户说服时会发生什么?然而研究这一可能性具有挑战性,因为长期运行的代理行为具有噪声且难以重复,而且不清楚只有在扩展任务执行中才会出现哪些独特挑战。我们研究了信念层面干预如何影响下游任务行为,这种现象我们称之为说服传播。我们介绍了一种以行为为中心的评估框架,区分在任务执行期间或之前应用的说服。在网页研究和编码任务中,我们发现即时说服导致的行为影响弱且不一致。相反,当在任务时间显式指定信念状态时,信念预填充的代理平均进行26.9%更少的搜索,并访问16.9%更少的唯一来源,比中性预填充的代理。这些结果表明,即使在之前的交互中,说服也会影响代理的行为,从而推动对代理系统的行为层面评估。

英文摘要

Modern AI agents increasingly combine conversational interaction with autonomous task execution, such as coding and web research, raising a natural question: What happens when an agent engaged in long-horizon tasks is exposed to user persuasion? Yet studying this possibility is challenging because long-running agent behavior is noisy and costly to reproduce, and it remains unclear which unique challenges emerge only in extended task execution. We study how belief-level intervention can influence downstream task behavior, a phenomenon we name persuasion propagation. We introduce a behavior-centered evaluation framework that distinguishes between persuasion applied during or prior to task execution. Across web research and coding tasks, we find that on-the-fly persuasion induces weak and inconsistent behavioral effects. In contrast, when the belief state is explicitly specified at task time, belief-prefilled agents conduct on average 26.9% fewer searches and visit 16.9% fewer unique sources than neutral-prefilled agents. These results suggest that persuasion, even in prior interaction, can affect the agent's behavior, motivating behavior-level evaluation in agentic systems.

2601.23219 2026-05-22 cs.MA cs.AI 版本更新

MonoScale: Scaling Multi-Agent System with Monotonic Improvement

MonoScale: 通过单调改进扩展多智能体系统

Shuai Shao, Yixiang Liu, Bingwei Lu, Weinan Zhang

发表机构 * Shanghai Jiao Tong University, Shanghai, China(上海交通大学,上海,中国) Shanghai Innovation Institute, Shanghai, China(上海创新研究院,上海,中国)

AI总结 本文提出MonoScale框架,通过生成agent条件化熟悉任务、收集交互证据并将其转化为可审计的自然语言记忆,实现多智能体系统的单调性能提升,实验表明其在GAIA和Humanity's Last Exam任务中优于简单扩展和强路由固定池基线。

详情
AI中文摘要

近年来,基于大语言模型的多智能体系统(MAS)发展迅速,利用路由器分解任务并将子任务委托给专门的智能体。扩展能力的自然方法是通过持续集成新功能智能体或工具接口来扩大智能体池,但盲目扩展可能导致性能崩溃,当路由器在新添加的异质且不可靠的智能体上冷启动时。我们提出MonoScale,一种扩展感知的更新框架,主动生成少量agent条件化的熟悉任务,从成功和失败的交互中收集证据,并将其提炼为可审计的自然语言记忆以指导未来的路由。我们将顺序增强正式化为上下文带窃,并执行信任区域记忆更新,从而在加入轮次中实现单调非递减的性能保证。在GAIA和Humanity's Last Exam上的实验表明,随着智能体池的增长,性能稳定提升,优于简单扩展和强路由固定池基线。

英文摘要

In recent years, LLM-based multi-agent systems (MAS) have advanced rapidly, using a router to decompose tasks and delegate subtasks to specialized agents. A natural way to expand capability is to scale up the agent pool by continually integrating new functional agents or tool interfaces, but naive expansion can trigger performance collapse when the router cold-starts on newly added, heterogeneous, and unreliable agents. We propose MonoScale, an expansion-aware update framework that proactively generates a small set of agent-conditioned familiarization tasks, harvests evidence from both successful and failed interactions, and distills it into auditable natural-language memory to guide future routing. We formalize sequential augmentation as a contextual bandit and perform trust-region memory updates, yielding a monotonic non-decreasing performance guarantee across onboarding rounds. Experiments on GAIA and Humanity's Last Exam show stable gains as the agent pool grows, outperforming naive scale-up and strong-router fixed-pool baselines.

2601.15671 2026-05-22 cs.HC cs.AI 版本更新

StreetDesignAI: Broadening Designer Perspectives Through Multi-Persona Evaluation of Cycling Infrastructure

StreetDesignAI: 通过多角色评估拓宽设计师视角

Ziyi Wang, Yilong Dai, Duanya Lyu, Mateo Nader, Sihan Chen, Wanghao Ye, Zijian Ding, Xiang Yan

发表机构 * University of Maryland, College Park(马里兰大学 College Park 分校) University of Alabama(阿拉巴马大学) University of Florida(佛罗里达大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出StreetDesignAI,通过多角色评估方法帮助设计师更全面地理解骑行者需求,提升设计决策能力。

详情
AI中文摘要

设计骑行基础设施需要平衡不同用户群体的 competing 需求,但设计师往往难以预见不同骑行者对同一街道环境的体验差异。本文探讨了基于角色的评估如何支持骑行基础设施设计,通过在设计过程中显式化体验冲突。基于与12名领域专家和427名骑行者的众包评估的形成性研究,我们提出了StreetDesignAI,一个交互系统,使设计师能够(1)通过影像和地图数据将评估扎根于真实的街道环境;(2)接收来自模拟骑行者角色(从自信到谨慎用户)的并行反馈;(3)在系统揭示不同视角冲突的同时迭代修改设计。26名交通专业人员的组内研究显示,结构化的多视角反馈显著拓宽了设计师对各种骑行者视角的理解、识别多样化角色需求的能力以及将这些需求转化为设计决策的信心。参与者还报告了更高的总体满意度和更强的使用系统进行专业实践的意愿。定性发现进一步揭示了显式冲突揭示如何将设计探索从单视角优化转变为有意的权衡推理。我们讨论了AI辅助工具在通过分歧作为交互原语来支持角色意识设计方面的启示。

英文摘要

Designing cycling infrastructure requires balancing the competing needs of diverse user groups, yet designers often struggle to anticipate how different cyclists experience the same street environment. We investigate how persona-based evaluation can support cycling infrastructure design by making experiential conflicts explicit during the design process. Informed by a formative study with 12 domain experts and crowdsourced bikeability assessments from 427 cyclists, we present StreetDesignAI, an interactive system that enables designers to (1) ground evaluation in real street context through imagery and map data, (2) receive parallel feedback from simulated cyclist personas spanning confident to cautious users, and (3) iteratively modify designs while the system surfaces conflicts across perspectives. A within-subjects study with 26 transportation professionals comparing StreetDesignAI against a general-purpose AI chatbot demonstrates that structured multi-perspective feedback significantly Broaden designers' understanding of various cyclists' perspectives, ability to identify diverse persona needs, and confidence in translating those needs into design decisions. Participants also reported significantly higher overall satisfaction and stronger intention to use the system in professional practice. Qualitative findings further illuminate how explicit conflict surfacing transforms design exploration from single-perspective optimization toward deliberate trade-off reasoning. We discuss implications for AI-assisted tools that scaffold persona-aware design through disagreement as an interaction primitive.

2601.11650 2026-05-22 physics.chem-ph cs.AI 版本更新

Large Language Model Agent for User-friendly Chemical Process Simulations

面向用户友好的化学过程模拟的大语言模型代理

Jingkang Liang, Niklas Groll, Gürkan Sin

发表机构 * Process and System Engineering Center(过程与系统工程中心) Department of Chemical and Biochemical Engineering(化学与生物化学工程系) Technical University of Denmark(丹麦技术大学)

AI总结 本文提出一种基于大语言模型的代理,通过Model Context Protocol与AVEVA Process Simulation集成,实现自然语言交互进行化学过程模拟,提升非专业用户对复杂过程设计、仿真和优化的访问能力。

详情
AI中文摘要

现代过程仿真器能够实现详细的工艺设计、仿真和优化;然而,构建和解释仿真过程耗时且需要专业知识,限制了非专业用户早期探索。为此,本文提出将大语言模型(LLM)代理通过Model Context Protocol(MCP)集成到AVEVA Process Simulation(APS)中,允许通过自然语言与严谨的过程仿真进行交互。MCP服务器工具集使LLM能够通过Python编程与APS通信,从而从普通语言指令中执行复杂的仿真任务。两个水-甲醇分离案例研究评估了该框架在不同任务复杂性和交互模式下的表现。第一个案例展示了代理能够自主分析流程图,发现改进机会,迭代优化,提取数据并清晰呈现结果。该框架在教育目的上能够将技术概念转化为流程,同时为有经验的从业者自动化数据提取,加快常规任务并支持头脑风暴。第二个案例研究通过逐步对话和单提示两种方式评估了自主流程图合成的潜力,展示了其对初学者和专家的适用性。逐步模式提供可靠且指导性的构建,适合教育环境;单提示模式快速构建基础流程图供后续优化。尽管当前的局限性如过度简化、计算错误和技术问题仍需专家监督,但该框架在分析、优化和引导构建方面的能力表明,基于LLM的代理可以成为有价值的协作伙伴。

英文摘要

Modern process simulators enable detailed process design, simulation, and optimization; however, constructing and interpreting simulations is time-consuming and requires expert knowledge. This limits early exploration by inexperienced users. To address this, a large language model (LLM) agent is integrated with AVEVA Process Simulation (APS) via Model Context Protocol (MCP), allowing natural language interaction with rigorous process simulations. An MCP server toolset enables the LLM to communicate programmatically with APS using Python, allowing it to execute complex simulation tasks from plain-language instructions. Two water-methanol separation case studies assess the framework across different task complexities and interaction modes. The first shows the agent autonomously analyzing flowsheets, finding improvement opportunities, and iteratively optimizing, extracting data, and presenting results clearly. The framework benefits both educational purposes, by translating technical concepts and demonstrating workflows, and experienced practitioners by automating data extraction, speeding routine tasks, and supporting brainstorming. The second case study assesses autonomous flowsheet synthesis through both a step-by-step dialogue and a single prompt, demonstrating its potential for novices and experts alike. The step-by-step mode gives reliable, guided construction suitable for educational contexts; the single-prompt mode constructs fast baseline flowsheets for later refinement. While current limitations such as oversimplification, calculation errors, and technical hiccups mean expert oversight is still needed, the framework's capabilities in analysis, optimization, and guided construction suggest LLM-based agents can become valuable collaborators.

2601.10348 2026-05-22 cs.CL cs.AI cs.LG 版本更新

Training-Trajectory-Aware Token Selection

基于训练轨迹的token选择

Zhanming Shen, Jiaqi Hu, Zeyu Qin, Hao Chen, Wentao Ye, Zenan Huang, Yihong Zhuang, Guoshan Lu, Junlin Zhou, Junbo Zhao

发表机构 * Zhejiang University(浙江大学) Hong Kong University of Science and Technology(香港科技大学)

AI总结 本文提出T3S方法,通过在token层面重构训练目标,清除未学习token的优化路径,从而在连续蒸馏中提升性能,实验表明在AR和dLLM设置中均取得显著效果。

Comments Accepted by ICML 2026

详情
AI中文摘要

高效的蒸馏是将昂贵的推理能力转化为可部署效率的关键途径,然而在前沿领域中,当学生模型已具备较强的推理能力时,朴素的连续蒸馏往往产生有限的收益甚至退化。我们观察到一种训练特征现象:即使损失单调下降,所有性能指标在几乎相同的瓶颈处会突然大幅下降,然后逐渐恢复。我们进一步揭示了token层面的机制:置信度会分裂成稳步增加的模仿锚点token,快速锚定优化,以及尚未学习的token,其置信度被抑制直到瓶颈之后。这两种类型token无法共存的特性是连续蒸馏失败的根本原因。为此,我们提出了基于训练轨迹的token选择(T3S)方法,以在token层面重建训练目标,清除未学习token的优化路径。T3S在AR和dLLM设置中均取得一致的收益:仅用数百个示例,Qwen3-8B在竞争性推理基准上超越DeepSeek-R1,Qwen3-32B接近Qwen3-235B,且T3训练的LLaDA-2.0-Mini超越其AR基线,达到所有16B级模型中的最先进性能。

英文摘要

Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharply at almost the same bottleneck, before gradually recovering. We further uncover a token-level mechanism: confidence bifurcates into steadily increasing Imitation-Anchor Tokens that quickly anchor optimization and other yet-to-learn tokens whose confidence is suppressed until after the bottleneck. And the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation. To this end, we propose Training-Trajectory-Aware Token Selection (T3S) to reconstruct the training objective at the token level, clearing the optimization path for yet-to-learn tokens. T3S yields consistent gains in both AR and dLLM settings: with only hundreds of examples, Qwen3-8B surpasses DeepSeek-R1 on competitive reasoning benchmarks, Qwen3-32B approaches Qwen3-235B, and T3-trained LLaDA-2.0-Mini exceeds its AR baseline, achieving state-of-the-art performance among all of 16B-scale no-think models.

2512.16739 2026-05-22 cs.AI 版本更新

AI-Driven Prediction of Cancer Pain Episodes: A Hybrid Decision Support Approach

基于AI的癌症疼痛发作预测:一种混合决策支持方法

Yipeng Zhuang, Yifeng Guo, Yuewen Li, Yuheng Wu, Philip Leung-Ho Yu, Tingting Song, Zhiyong Wang, Kunzhong Zhou, Weifang Wang, Li Zhuang

发表机构 * The University of Hong Kong(香港大学) Peking University Cancer Hospital Yunnan Hospital, The Third Affiliated Hospital of Kunming Medical University(北京大学肿瘤医院云南医院,昆明医科大学第三附属医院)

AI总结 本研究提出了一种混合机器学习和大语言模型的方法,利用结构化和非结构化电子健康记录数据预测癌症患者在住院48和72小时内疼痛发作,通过整合时间序列药物趋势和模糊剂量记录,提高了敏感性和可解释性,实现了87.6%和91.7%的准确率。

详情
AI中文摘要

肺癌患者经常经历突破性疼痛发作,高达91%的患者需要及时干预。为了实现主动疼痛管理,我们提出了一种混合机器学习和大语言模型的管道,利用结构化和非结构化的电子健康记录数据预测住院48和72小时内的疼痛发作。分析了266名住院患者的历史队列,特征包括人口统计学数据、肿瘤分期、生命体征和WHO分级镇痛药使用情况。机器学习模块捕捉时间序列药物趋势,而大语言模型解释模糊的剂量记录和自由文本临床笔记。整合这些模态提高了灵敏度和可解释性。我们的框架在48小时和72小时的准确率分别为0.876和0.917,灵敏度分别提高了10.6%和10.7%,归因于大语言模型的增强。这种混合方法提供了一种临床可解释且可扩展的工具,用于早期疼痛发作预测,有望提高治疗精准度并优化肿瘤学护理中的资源分配。

英文摘要

Lung cancer patients frequently experience breakthrough pain episodes, with up to 91% requiring timely intervention. To enable proactive pain management, we propose a hybrid machine learning and large language model pipeline that predicts pain episodes within 48 and 72 hours of hospitalization using both structured and unstructured electronic health record data. A retrospective cohort of 266 inpatients was analyzed, with features including demographics, tumor stage, vital signs, and WHO-tiered analgesic use. The machine learning module captured temporal medication trends, while the large language model interpreted ambiguous dosing records and free-text clinical notes. Integrating these modalities improved sensitivity and interpretability. Our framework achieved an accuracy of 0.876 (48h) and 0.917 (72h), with improvements in sensitivity of 10.6% and 10.7%, respectively, attributable to large language model augmentation. This hybrid approach offers a clinically interpretable and scalable tool for early pain episode forecasting, with potential to enhance treatment precision and optimize resource allocation in oncology care.

2512.11484 2026-05-22 cs.CR cs.AI 版本更新

Capacitive Touchscreens at Risk: Recovering Handwritten Trajectory on Smartphone via Electromagnetic Emanations

容性触屏面临风险:通过电磁辐射恢复智能手机上的手写轨迹

Yukun Cheng, Shiyu Zhu, Changhai Ou, Xingshuo Han, Yuan Li, Shihui Zheng

发表机构 * Wuhan University(武汉大学) Nanjing University of Aeronautics and Astronautics(南京航空航天大学) National University of Defense Technology(国防科学技术大学) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 本文揭示并利用了容性触屏的电磁侧信道漏洞,通过捕获屏幕书写时产生的电磁信号,实时回归二维手写轨迹。研究提出TESLA攻击框架,展示了在现实攻击条件下恢复高度可识别的手写轨迹的能力。

详情
AI中文摘要

本文揭示并利用了容性触屏的电磁侧信道漏洞:电磁(EM)侧信道泄露了足够的信息,可以恢复细粒度的连续手写轨迹。我们提出了Touchscreen Electromagnetic Side-channel Leakage Attack(TESLA),一种非接触攻击框架,该框架捕获屏幕书写过程中生成的电磁信号,并实时将其回归为二维(2D)手写轨迹。在各种商用现成(COTS)智能手机上的广泛评估显示,TESLA实现了77%的字符识别准确率和0.74的Jaccard指数,证明了其在现实攻击条件下恢复高度可识别的轨迹的能力,这些轨迹与原始手写非常相似。

英文摘要

This paper reveals and exploits a critical security vulnerability: the electromagnetic (EM) side channel of capacitive touchscreens leaks sufficient information to recover fine-grained, continuous handwriting trajectories. We present Touchscreen Electromagnetic Side-channel Leakage Attack (TESLA), a non-contact attack framework that captures EM signals generated during on-screen writing and regresses them into two-dimensional (2D) handwriting trajectories in real time. Extensive evaluations across a variety of commercial off-the-shelf (COTS) smartphones show that TESLA achieves 77% character recognition accuracy and a Jaccard index of 0.74, demonstrating its capability to recover highly recognizable motion trajectories that closely resemble the original handwriting under realistic attack conditions.

2512.02193 2026-05-22 cs.AI 版本更新

From monoliths to modules: Decomposing transducers for efficient world modelling

从整体到模块:分解转换器以实现高效的world建模

Alexander Boyd, Franz Nowak, David Hyland, Manuel Baltieri, Fernando E. Rosas

发表机构 * Department of Informatics, University of Sussex(Sussex大学信息学院) Beyond Institute for Theoretical Science (BITS)(理论科学研究所) ETH Zürich(苏黎世联邦理工学院) Principles of Intelligent Behaviour in Biological and Social Systems (PIBBSS)(生物和社会系统智能行为原理研究所) Department of Computer Science, University of Oxford(牛津大学计算机科学系) Araya Inc.(Araya公司) Sussex AI and Sussex Centre for Consciousness Science, University of Sussex(Sussex大学人工智能与意识科学中心) Centre for Complexity Science and Center for Psychedelic Research, Department of Brain Sciences, Imperial College London(复杂科学中心和迷幻研究中心,伦敦帝国理工学院脑科学系) Center for Eudaimonia and Human Flourishing, University of Oxford(幸福与人类繁荣中心,牛津大学)

AI总结 本文提出了一种分解复杂world建模的方法,通过转换器框架将世界模型分解为多个模块,从而提高计算效率并支持分布式推理,为AI安全和现实应用提供基础。

详情
AI中文摘要

world模型最近被提出作为AI代理在部署前训练和评估的沙盒环境。尽管现实中的world模型通常计算需求高,但通过利用现实世界场景中子组件以模块化方式交互的事实,可以缓解这一问题。在本文中,我们通过开发一个框架来分解由转换器表示的复杂world模型,探索这一想法。转换器是一类扩展POMDPs的模型。尽管转换器的组合已被深入理解,我们的结果澄清了如何通过推导在不同输入-输出子空间上操作的子转换器来反转这一过程,从而实现并行化和可解释的替代方案,以支持分布式推理。总体而言,这些结果为连接现实推理所需的计算效率与AI安全所要求的结构透明性奠定了基础。

英文摘要

World models have been recently proposed as sandbox environments in which AI agents can be trained and evaluated before deployment. While realistic world models often have high computational demands, this can often be alleviated by exploiting the fact that real-world scenarios tend to involve subcomponents that interact in a modular manner. In this paper, we explore this idea by developing a framework for decomposing complex world models represented by transducers, a class of models generalising POMDPs. Whereas the composition of transducers is well understood, our results clarify how to invert this process by deriving sub-transducers operating on distinct input-output subspaces, enabling parallelizable and interpretable alternatives to monolithic world modelling that can support distributed inference. Overall, these results lay groundwork for bridging the computational efficiency required for real-world inference and the structural transparency demanded by AI safety.

2511.07885 2026-05-22 cs.DC cs.AI cs.CL cs.LG 版本更新

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

每瓦智能:衡量本地AI的智能效率

Jon Saad-Falcon, Avanika Narayan, Hakki Orhun Akengin, J. Wes Griffin, Herumb Shandilya, Adrian Gamarra Lafuente, Medhya Goel, Rebecca Joseph, Shlok Natarajan, Etash Kumar Guha, Shang Zhu, Ben Athiwaratkun, John Hennessy, Azalia Mirhoseini, Christopher Ré

发表机构 * Stanford University(斯坦福大学) Together AI

AI总结 本文研究了本地AI在能源效率和性能上的表现,提出了一种统一的衡量指标IPW,展示了本地推理在重新分配需求方面的能力,并揭示了本地加速器的优化潜力。

详情
AI中文摘要

大型语言模型(LLM)查询主要由集中式云基础设施中的前沿模型处理。需求增长比提供商能够扩展的速度更快。两项进展创造了重新思考这一范式的机会:小型本地LM(<=20B活跃参数)在许多任务上能与前沿模型竞争性地表现,而本地加速器(如Apple M4 Max)可以以交互延迟支持这些模型。这引发了问题:本地推理能否在能源受限的设备上有效重新分配需求?这需要测量本地LM是否能准确回答现实查询以及是否在能源受限的设备上高效。我们提出了智能每瓦(IPW),即任务准确度每单位功率,作为衡量本地推理能力与效率的统一指标。我们评估了20多个最先进的本地LM、8种硬件加速器(本地和云)以及100万条现实单轮聊天和推理查询。对于每个查询,我们测量了准确性(本地LM对前沿模型的胜率)、能耗、延迟和功率。我们发现三个关键结果。首先,本地LM成功回答了88.7%的这些查询,准确性因领域而异。其次,2023-2025年的纵向分析显示IPW提高了5.3倍,由算法和加速器的改进驱动,本地可服务查询覆盖范围从23.2%增加到71.3%。第三,本地加速器在相同模型上实现的IPW至少比云加速器低1.4倍,揭示了本地加速器优化的巨大潜力。这些发现表明,本地推理可以对集中式基础设施的大量查询需求进行有意义的重新分配,IPW是跟踪这一转变的关键指标。

英文摘要

Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure. Demand growth strains this paradigm faster than providers can scale. Two advances create an opportunity to rethink it: small, local LMs (<=20B active parameters) now achieve competitive performance to frontier models on many tasks, and local accelerators (e.g., Apple M4 Max) can host these models at interactive latencies. This raises the question: can local inference viably redistribute demand from centralized infrastructure? This requires measuring both whether local LMs can accurately answer real-world queries and whether they can do so efficiently on power-constrained devices (e.g., laptops). We propose intelligence per watt (IPW), task accuracy per unit of power, as a unified metric for the capability and efficiency of local inference across model-accelerator configurations. We evaluate 20+ state-of-the-art local LMs, 8 hardware accelerators (local and cloud), and 1M real-world single-turn chat and reasoning queries. For each query, we measure accuracy (local LM win rate against frontier models), energy, latency, and power. We find three key results. First, local LMs successfully answer 88.7% of these queries, with accuracy varying by domain. Second, longitudinal analysis from 2023-2025 shows IPW improved 5.3x, driven by both algorithmic and accelerator advances, with locally-serviceable query coverage rising from 23.2% to 71.3%. Third, local accelerators achieve at least 1.4x lower IPW than cloud accelerators running identical models, revealing significant headroom for local accelerator optimization. These findings demonstrate that local inference can meaningfully redistribute demand from centralized infrastructure for a substantial subset of queries, with IPW serving as the critical metric for tracking this transition.

2511.06428 2026-05-22 cs.SE cs.AI 版本更新

Walking the Tightrope of LLMs for Software Development: A Practitioners' Perspective

在软件开发中平衡LLMs的绳子:从业者视角

Samuel Ferino, Rashina Hoda, John Grundy, Christoph Treude

发表机构 * Faculty of Information Technology, Monash University(墨尔本大学信息科技学院) School of Computing and Information Systems, Singapore Management University(新加坡管理大学计算机与信息系统学院)

AI总结 本文从软件开发者视角出发,研究LLMs对软件开发的影响及管理方法,通过22次访谈和STGT4DA分析方法,揭示了LLMs在个体、团队、组织和社会层面的利弊,并提出了缓解挑战的可行建议。

详情
AI中文摘要

背景:大型语言模型(LLMs)的出现有可能引发软件开发领域的革命(例如自动化流程、劳动力转型)。尽管已有研究开始探讨LLMs对软件开发的感知影响,但需要实证研究来理解如何平衡使用LLMs的正反作用。目标:我们研究了LLMs对软件开发的影响以及如何从软件开发者视角管理其影响。方法:我们进行了22次软件从业者的访谈,数据收集和分析跨越了2024年10月至2025年9月的三轮数据收集和分析。我们采用了社会技术扎根理论(STGT4DA)对访谈参与者的回答进行严格分析。结果:我们识别了使用LLMs在个体、团队、组织和社会层面的益处(例如维持开发者流程、提高开发者心理模型、促进创业)和挑战(例如损害开发者声誉),并提出了缓解这些挑战的可行建议。结论:关键在于我们提出了软件从业者、团队和组织在使用LLMs时所面临的权衡。我们的发现对软件团队领导者和IT经理评估其特定环境中LLMs的可行性特别有用。

英文摘要

Background: Large Language Models emerged with the potential of provoking a revolution in software development (e.g., automating processes, workforce transformation). Although studies have started to investigate the perceived impact of LLMs for software development, there is a need for empirical studies to comprehend how to balance forward and backward effects of using LLMs. Objective: We investigated how LLMs impact software development and how to manage the impact from a software developer's perspective. Method: We conducted 22 interviews with software practitioners across 3 rounds of data collection and analysis, between October (2024) and September (2025). We employed Socio-Technical Grounded Theory for Data Analysis (STGT4DA) to rigorously analyse interview participants' responses. Results: We identified the benefits (e.g., maintain developer flow, improve developer mental models, and foster entrepreneurship) and challenges (e.g., damage to developers' reputation) of using LLMs at individual, team, organisation, and society levels; as well as actionable guidances into how mitigate these challenges. Conclusion: Critically, we present the trade-offs that software practitioners, teams, and organisations face in working with LLMs. Our findings are particularly useful for software team leaders and IT managers to assess the viability of LLMs within their specific context.

2510.07962 2026-05-22 cs.CL cs.AI 版本更新

LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?

LightReasoner: 小型语言模型能否教会大型语言模型推理?

Jingyuan Wang, Yankai Chen, Zhonghang Li, Chao Huang

发表机构 * University of Hong Kong(香港大学) University of Chicago(芝加哥大学)

AI总结 本文提出LightReasoner框架,通过利用强专家模型与弱业余模型之间的行为差异,发现高价值推理时刻,从而提升大型语言模型的推理能力,同时减少资源消耗。

Comments Updated to ACL 2026 camera-ready version with improved method presentation, expanded related work discussion, additional analyses, and presentation refinements

详情
AI中文摘要

大型语言模型(LLMs)在推理任务上取得了显著进展,通常通过监督微调(SFT)实现。然而,SFT过程资源消耗大,依赖大规模定制数据集、拒绝采样演示和对所有token的统一优化,尽管只有少量token具有实际学习价值。在本工作中,我们探索了一个反直觉的想法:小型语言模型(SLMs)能否通过揭示高价值推理时刻来教会大型语言模型(LLMs)其独特优势?我们提出了LightReasoner,一种新的框架,利用强专家模型(LLM)与弱业余模型(SLM)之间的行为差异。LightReasoner分为两个阶段:(1)采样阶段通过专家-业余对比确定关键推理时刻,并构建捕捉专家优势的监督示例;(2)微调阶段将专家模型与这些提炼出的示例对齐,放大其推理优势。在七个数学基准测试中,LightReasoner将准确性提高了28.1%,同时将时间消耗减少了90%,采样问题减少了80%,调优token使用减少了99%,且不依赖真实标签。通过将弱SLMs转化为有效的教学信号,LightReasoner提供了一种可扩展且资源高效的提升LLM推理能力的方法。代码可在:https://github.com/HKUDS/LightReasoner获取。

英文摘要

Large language models (LLMs) have demonstrated remarkable progress in reasoning, often through supervised fine-tuning (SFT). However, SFT is resource-intensive, relying on large curated datasets, rejection-sampled demonstrations, and uniform optimization across all tokens, even though only a fraction carry meaningful learning value. In this work, we explore a counterintuitive idea: can smaller language models (SLMs) teach larger language models (LLMs) by revealing high-value reasoning moments that reflect the latter's unique strength? We propose LightReasoner, a novel framework that leverages the behavioral divergence between a stronger expert model (LLM) and a weaker amateur model (SLM). LightReasoner operates in two stages: (1) a sampling stage that pinpoints critical reasoning moments and constructs supervision examples capturing the expert's advantage through expert-amateur contrast, and (2) a fine-tuning stage that aligns the expert model with these distilled examples, amplifying its reasoning strengths. Across seven mathematical benchmarks, LightReasoner improves accuracy by up to 28.1%, while reducing time consumption by 90%, sampled problems by 80%, and tuned token usage by 99%, all without relying on ground-truth labels. By turning weaker SLMs into effective teaching signals, LightReasoner offers a scalable and resource-efficient approach for advancing LLM reasoning. Code is available at: https://github.com/HKUDS/LightReasoner

2510.04280 2026-05-22 cs.LG cs.AI cs.RO 版本更新

A KL-regularization Framework for Learning to Plan with Adaptive Priors

一种基于KL正则化的学习规划框架:具有自适应先验的规划

Álvaro Serra-Gomez, Daniel Jarne Ornia, Dhruva Tirumala, Thomas Moerland

发表机构 * LIACS, Leiden University, Leiden, The Netherlands(莱顿大学莱顿分校,荷兰) Google Deepmind, London, United Kingdom(谷歌DeepMind,英国伦敦) University of Oxford, Oxford, United Kingdom(牛津大学,英国牛津)

AI总结 本文提出了一种基于KL正则化的学习规划框架,通过将规划器的动作分布作为先验整合到策略优化中,提升了在高维连续控制任务中模型驱动强化学习的样本效率和长期性能。

Comments Published at ICML2026

详情
AI中文摘要

有效的探索仍然是模型驱动强化学习(MBRL)中的核心挑战,尤其是在高维连续控制任务中,样本效率至关重要。近期的一项重要工作利用学习的策略作为模型预测路径积分(MPPI)规划的提案分布。初始方法在更新采样策略时独立于规划器分布,通常通过确定性策略梯度和熵正则化最大化学习的价值函数。然而,由于训练过程中遇到的状态依赖于MPPI规划器,使采样策略与规划器对齐可以提高价值估计的准确性以及长期性能。为此,近期的方法通过最小化KL散度到规划器分布或引入规划器引导的正则化来更新采样策略。在本文中,我们通过引入策略优化-模型预测控制(PO-MPC),将这些基于MPPI的强化学习方法统一到一个框架中,这是一种整合规划器动作分布作为先验的KL正则化MBRL方法家族。通过使学习的策略与规划器的行为对齐,PO-MPC允许在回报最大化和KL散度最小化之间更灵活的策略更新。我们澄清了先前方法如何作为该家族的特殊案例出现,并探索了之前未研究的变体。我们的实验表明,这些扩展配置产生了显著的性能提升,推动了基于MPPI的强化学习的前沿。

英文摘要

Effective exploration remains a central challenge in model-based reinforcement learning (MBRL), particularly in high-dimensional continuous control tasks where sample efficiency is crucial. A prominent line of recent work leverages learned policies as proposal distributions for Model-Predictive Path Integral (MPPI) planning. Initial approaches update the sampling policy independently of the planner distribution, typically maximizing a learned value function with deterministic policy gradient and entropy regularization. However, because the states encountered during training depend on the MPPI planner, aligning the sampling policy with the planner improves the accuracy of value estimation and long-term performance. To this end, recent methods update the sampling policy by minimizing KL divergence to the planner distribution or by introducing planner-guided regularization into the policy update. In this work, we unify these MPPI-based reinforcement learning methods under a single framework by introducing Policy Optimization-Model Predictive Control (PO-MPC), a family of KL-regularized MBRL methods that integrate the planner's action distribution as a prior in policy optimization. By aligning the learned policy with the planner's behavior, PO-MPC allows more flexibility in the policy updates to trade off Return maximization and KL divergence minimization. We clarify how prior approaches emerge as special cases of this family, and we explore previously unstudied variations. Our experiments show that these extended configurations yield significant performance improvements, advancing the state of the art in MPPI-based RL.

2509.22795 2026-05-22 eess.SP cs.AI cs.SY eess.SY 版本更新

Generative Modeling and Decision Fusion for Unknown Event Detection and Classification Using Synchrophasor Data

基于同步相量数据的未知事件检测与分类的生成建模与决策融合

Yi Hu, Zheyuan Cheng

发表机构 * Department of Electrical and Computer Engineering, Michigan Technological University(密歇根技术大学电气与计算机工程系) Quanta Technology(魁塔科技)

AI总结 本文提出了一种结合生成建模、滑动窗口时间处理和决策融合的新框架,利用同步相量数据实现鲁棒的事件检测与分类,通过变分自编码器-生成对抗网络建模正常运行状态,并采用两种互补的决策策略来提高检测的准确性和鲁棒性。

Comments 10 pages

Journal ref IEEE Transactions on Industrial Informatics, 2026

详情
AI中文摘要

可靠的电力系统事件检测和分类对于维持电网稳定性和态势感知至关重要。现有方法往往依赖于有限的标记数据集,限制了其在罕见或未见扰动上的泛化能力。本文提出了一种新的框架,整合了生成建模、滑动窗口时间处理和决策融合,以实现使用同步相量数据的鲁棒事件检测和分类。采用变分自编码器-生成对抗网络来建模正常运行条件,其中重构误差和判别器误差被提取为异常指标。开发了两种互补的决策策略:基于阈值的规则用于计算效率,基于凸包的方法用于在复杂误差分布下的鲁棒性。这些特征通过滑动窗口机制组织成时空检测和分类矩阵,并通过识别和决策融合阶段整合来自PMUs的输出。该设计使框架能够识别已知事件,同时系统地将以前未见过的扰动分类到新类别中,解决了监督分类器的关键限制。实验结果表明,该方法的准确性处于最先进水平,超过了机器学习、深度学习和包络基线方法。识别未知事件的能力进一步突显了所提出方法在现代电力系统广域事件分析中的适应性和实际价值。

英文摘要

Reliable detection and classification of power system events are critical for maintaining grid stability and situational awareness. Existing approaches often depend on limited labeled datasets, which restricts their ability to generalize to rare or unseen disturbances. This paper proposes a novel framework that integrates generative modeling, sliding-window temporal processing, and decision fusion to achieve robust event detection and classification using synchrophasor data. A variational autoencoder-generative adversarial network is employed to model normal operating conditions, where both reconstruction error and discriminator error are extracted as anomaly indicators. Two complementary decision strategies are developed: a threshold-based rule for computational efficiency and a convex hull-based method for robustness under complex error distributions. These features are organized into spatiotemporal detection and classification matrices through a sliding-window mechanism, and an identification and decision fusion stage integrates the outputs across PMUs. This design enables the framework to identify known events while systematically classifying previously unseen disturbances into a new category, addressing a key limitation of supervised classifiers. Experimental results demonstrate state-of-the-art accuracy, surpassing machine learning, deep learning, and envelope-based baselines. The ability to recognize unknown events further highlights the adaptability and practical value of the proposed approach for wide-area event analysis in modern power systems.

2509.20912 2026-05-22 cs.AI 版本更新

DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning

DeFacto: 通过图像进行反事实推理以强制证据支持和忠实推理

Tianrun Xu, Haoda Jing, Ye Li, Yuquan Wei, Jun Feng, Guanyu Chen, Haichuan Gao, Tianren Zhang, Feng Chen

发表机构 * Department of Automation, Tsinghua University, Beijing, China(清华大学自动化系) Zhongguancun Academy, Beijing, China(中关村学院) School of Software, Xinjiang University, Urumqi, China(新疆大学软件学院) College of Materials Science and Engineering, Fuzhou University, Fuzhou, China(福州大学材料科学与工程学院) Institute of Automation, Chinese Academy of Sciences, Beijing, China(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学人工智能学院) Beijing Qianjue Technology Co., Ltd., Beijing, China(北京千 jue 技术有限公司)

AI总结 本文提出DeFacto框架,通过整合正例、反事实和随机遮蔽三种训练范式,提升多模态语言模型在证据一致性方面的表现,并引入DeFacto-1.5K基准进行系统评估。

详情
AI中文摘要

最近多模态语言模型(MLLMs)的进步使通过图像进行推理成为多模态推理的主要范式。然而,现有方法仍无法确保答案与证据的一致性,即正确答案必须由正确视觉证据支持。为了解决这个问题,我们提出了DeFacto,一种反事实推理框架,该框架明确地将视觉证据与最终答案对齐。我们的方法整合了三种互补的训练范式:正例、反事实和随机遮蔽。我们进一步开发了一个语言引导的证据构建流水线,该流水线能够自动定位与问题相关区域并生成反事实变体,从而得到DeFacto-100K。基于此数据集,我们训练MLLMs使用基于GRPO的强化学习,并设计三种互补的奖励机制以促进正确回答、结构化推理和一致的证据选择。此外,我们引入了DeFacto-1.5K,一个由人类标注的基准,用于系统评估证据支持的一致性,而不仅仅是答案准确性。在多样化的基准测试中,DeFacto在答案准确性和证据-答案一致性方面均显著优于强大的基线模型。

英文摘要

Recent advances in multimodal language models (MLLMs) have made thinking with images a dominant paradigm for multimodal reasoning. However, existing methods still fail to ensure evidence-answer consistency, where correct answers must be supported by correct visual evidence. To address this issue, we propose DeFacto, a counterfactual reasoning framework that explicitly aligns visual evidence with final answers. Our approach integrates three complementary training paradigms: positive, counterfactual, and random-masking. We further develop a language-guided evidence construction pipeline that automatically localizes question-relevant regions and generates counterfactual variants, resulting in DeFacto-100K. Building on this dataset, we train MLLMs with GRPO-based reinforcement learning and design three complementary rewards to promote correct answering, structured reasoning, and consistent evidence selection. Moreover, we introduce DeFacto-1.5K, a human-annotated benchmark for systematically evaluating evidence-grounded consistency beyond answer accuracy. Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and evidence-answer consistency over strong baselines.

2509.12610 2026-05-22 cs.DB cs.AI cs.LG 版本更新

ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

ScaleDoc: 通过大规模文档集合进行基于大语言模型的谓词扩展

Hengrui Zhang, Yulong Hui, Yihao Liu, Huanchen Zhang

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出ScaleDoc系统,通过将谓词执行分为离线表示阶段和优化的在线过滤阶段,解决了大规模文档分析中大语言模型高推理成本的问题,实现了端到端速度提升和LLM调用成本降低。

详情
AI中文摘要

谓词是数据分析系统中的基础组件。然而,现代工作负载越来越多地涉及无结构文档,这需要语义理解,而不仅仅是传统基于值的谓词。鉴于巨大的文档和随机查询,尽管大语言模型(LLMs)显示出强大的零样本能力,但其高推理成本导致不可接受的开销。因此,我们引入ScaleDoc,一种新的系统,通过将谓词执行分解为离线表示阶段和优化的在线过滤阶段来解决这一问题。在离线阶段,ScaleDoc利用LLM为每个文档生成语义表示。在线阶段,对于每个查询,它在这些表示上训练一个轻量级代理模型来过滤大多数文档,只将有歧义的案例转发给LLM进行最终决策。此外,ScaleDoc提出了两个核心创新来实现显著的效率:(1)基于对比学习的框架,训练代理模型生成可靠的预测决策分数;(2)自适应级联机制,确定有效的过滤策略,同时满足特定的准确率目标。我们在三个数据集上的评估表明,ScaleDoc实现了超过2倍的端到端速度提升,并将昂贵的LLM调用减少了高达85%,使大规模语义分析变得实用和高效。

英文摘要

Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous documents and ad-hoc queries, while Large Language Models (LLMs) demonstrate powerful zero-shot capabilities, their high inference cost leads to unacceptable overhead. Therefore, we introduce \textsc{ScaleDoc}, a novel system that addresses this by decoupling predicate execution into an offline representation phase and an optimized online filtering phase. In the offline phase, \textsc{ScaleDoc} leverages a LLM to generate semantic representations for each document. Online, for each query, it trains a lightweight proxy model on these representations to filter the majority of documents, forwarding only the ambiguous cases to the LLM for final decision. Furthermore, \textsc{ScaleDoc} proposes two core innovations to achieve significant efficiency: (1) a contrastive-learning-based framework that trains the proxy model to generate reliable predicating decision scores; (2) an adaptive cascade mechanism that determines the effective filtering policy while meeting specific accuracy targets. Our evaluations across three datasets demonstrate that \textsc{ScaleDoc} achieves over a 2$\times$ end-to-end speedup and reduces expensive LLM invocations by up to 85\%, making large-scale semantic analysis practical and efficient.

2509.06503 2026-05-22 cs.AI q-bio.QM 版本更新

An AI system to help scientists write expert-level empirical software

一种帮助科学家编写专家级经验软件的AI系统

Eser Aygün, Anastasiya Belyaeva, Gheorghe Comanici, Marc Coram, Hao Cui, Jake Garrison, Renee Johnston Anton Kast, Cory Y. McLean, Peter Norgaard, Zahra Shamsi, David Smalling, James Thompson, Subhashini Venugopalan, Brian P. Williams, Chujun He, Sarah Martinson, Martyna Plomecka, Lai Wei, Yuchen Zhou, Qian-Ze Zhu, Matthew Abraham, Erica Brand, Anna Bulanova, Jeffrey A. Cardille, Chris Co, Scott Ellsworth, Grace Joseph, Malcolm Kane, Ryan Krueger, Johan Kartiwa, Dan Liebling, Jan-Matthis Lueckmann, Paul Raccuglia, Xuefei, Wang, Katherine Chou, James Manyika, Yossi Matias, John C. Platt, Lizzie Dorfman, Shibl Mourad, Michael P. Brenner

发表机构 * Google DeepMind(谷歌深Mind) Google Research(谷歌研究) Google Platforms and Devices(谷歌平台与设备) Massachusetts Institute of Technology(麻省理工学院) School of Engineering and Applied Sciences, Harvard University(哈佛大学工程与应用科学学院)

AI总结 本文提出Empirical Research Assistance (ERA)系统,利用大型语言模型和树搜索技术,自动创建高质量的科学软件,以加速计算实验的开发,从而提高科研效率。

Comments 78 pages, 31 figures, 22 tables

详情
AI中文摘要

科学发现的周期经常被缓慢、手动的软件创建所限制,用于支持计算实验。为了解决这个问题,我们提出了Empirical Research Assistance (ERA),一种AI系统,其目标是最大化一个质量度量。该系统使用大型语言模型(LLM)和树搜索(TS)来系统性地提高质量度量并智能地导航可能的解决方案空间。当探索并整合外部来源的复杂研究想法时,ERA能够产生专家级的结果。树搜索的有效性在各种任务上得到了证明。在生物信息学中,ERA发现了40种新的单细胞数据分析方法,这些方法在公开排行榜上优于顶级的人工方法。在流行病学中,ERA生成了14种模型,这些模型在预测新冠住院预测方面优于CDC集合和所有其他个体模型。ERA还为地理空间分析、斑马鱼神经活动预测和积分数值解法以及时间序列预测的规则基构造生成了专家级软件。通过为多样任务设计和实现新的解决方案,ERA代表了加速科学进步的重要一步。

英文摘要

The cycle of scientific discovery is frequently bottlenecked by the slow, manual creation of software to support computational experiments\cite{hannay2009how}. To address this, we present Empirical Research Assistance (ERA), an AI system that creates expert-level scientific software whose goal is to maximize a quality metric. The system uses a Large Language Model (LLM) and Tree Search (TS)\cite{silver2016mastering} to systematically improve the quality metric and intelligently navigate the large space of possible solutions. ERA achieves expert-level results when it explores and integrates complex research ideas from external sources. The effectiveness of tree search is demonstrated across a diverse range of tasks. In bioinformatics, ERA discovered 40 novel methods for single-cell data analysis that outperformed the top human-developed methods on a public leaderboard. In epidemiology, ERA generated 14 models that outperformed the CDC ensemble and all other individual models for forecasting COVID-19 hospitalizations. ERA also produced expert-level software for geospatial analysis, neural activity prediction in zebrafish, and numerical solution of integrals, and a novel rule-based construction for time series forecasting. By devising and implementing novel solutions to diverse tasks, ERA represents a significant step towards accelerating scientific progress.

2507.03674 2026-05-22 cs.CL cs.AI 版本更新

STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking

STRUCTSENSE:一种任务无关的代理框架,用于结构化信息提取,具有人机协同评估和基准测试

Tek Raj Chhetri, Yibei Chen, Puja Trivedi, Dorota Jarecka, Saif Haobsh, Patrick Ray, Lydia Ng, Satrajit S. Ghosh

发表机构 * McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, USA(麦戈文脑科学研究所,麻省理工学院,马萨诸塞州剑桥市) Fylo Labs Inc., New York, NY, USA(Fylo实验室公司,纽约州纽约市) Allen Institute for Brain Science, Seattle, WA, USA(艾伦脑科学研究所,华盛顿州西雅图市)

AI总结 本文提出STRUCTSENSE框架,通过整合本体引导的符号知识、代理自我评估细化和人机协同验证,实现了结构化信息提取的鲁棒性,并在三个领域展示了其跨任务泛化能力。

Comments -

详情
AI中文摘要

从科学文献中提取结构化信息对于加速发现至关重要,但大型语言模型(LLMs)在需要专家知识的专门领域表现不佳,且在跨任务泛化方面表现差。我们引入STRUCTSENSE,一种模块化、任务无关、开源的框架,整合了本体引导的符号知识、代理自我评估细化和人机协同验证,以实现领域感知的稳健提取。我们在三个递增语义复杂度的任务上评估STRUCTSENSE:基于模式的评估工具提取(91-100%准确率)、从科学论文中提取元数据和资源(86-93%总体准确率)以及从神经科学文献中进行命名实体识别(NER)(58-75%标签准确率,共8,882个实体)。在两个生物医学NER基准(NCBI疾病和S800物种)上,系统实现了≥90%的宽松召回率和62.5-85.8%的严格召回率,同时提取了1,000-3,600个额外实体。本地概念映射服务在严格匹配下达到62-82%的Hits@1,在语义匹配下达到68-86%。这些结果在三个领域展示了STRUCTSENSE跨任务泛化的能力,同时保持了源地和可追溯性透明度。

英文摘要

Extracting structured information from scientific literature is critical for accelerating discovery, yet Large Language Models (LLMs) often struggle in specialized domains that require expert knowledge and generalize poorly across tasks. We introduce \textsc{StructSense}, a modular, task-agnostic, open-source framework that integrates ontology-guided symbolic knowledge, agentic self-evaluative refinement, and human-in-the-loop validation for robust domain-aware extraction. We evaluate \textsc{StructSense} on three tasks of increasing semantic complexity: schema-based extraction of assessment instruments (91--100\% accuracy), metadata and resource extraction from scientific papers (86--93\% overall), and named entity recognition (NER) from neuroscience literature (58--75\% label accuracy across 8,882 entities). On two biomedical NER benchmarks (NCBI Disease and S800 Species), the system achieves $\geq$90\% relaxed recall and 62.5--85.8\% strict recall while extracting 1,000--3,600 additional entities beyond gold annotations. The local concept mapping service achieves Hits@1 of 62--82\% under strict matching and 68--86\% under semantic matching. These results across three domains demonstrate that \textsc{StructSense} generalizes across tasks while maintaining source grounding and provenance transparency.

2506.19500 2026-05-22 cs.AI cs.CL cs.LG 版本更新

NaviAgent: Graph-Driven Bilevel Planning for Scalable Tool Orchestration

NaviAgent: 一种基于图的双层规划用于可扩展的工具编排

Yan Jiang, Hao Zhou, Lizhong GU, Tianlong Li, Ruinan Jin, Wanqi Zhou, Ai Han

发表机构 * Department of Electrical and Computer Engineering, The Ohio State University, USA(电气与计算机工程系,俄亥俄州立大学,美国)

AI总结 本文提出NaviAgent,一种基于图的双层规划框架,通过解耦任务规划与工具执行,提升大规模工具编排的可扩展性和鲁棒性,实验表明其在任务成功率和实际应用中表现优异。

Comments Accepted to ICML 2026

Journal ref Proceedings of the 43rd International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

大型语言模型(LLMs)越来越多地作为功能调用代理,通过调用外部工具来处理超出其静态知识的任务。然而,它们通常逐个调用工具,缺乏对任务结构的整体视图。由于工具之间往往相互依赖,这导致了错误累积和可扩展性差,尤其是在扩展到数百或数千个工具时。为了解决这些限制,我们提出了NaviAgent,一种显式的双层架构,通过基于工具关系的图建模来解耦任务规划与工具执行。在规划层,基于LLM的代理决定是否直接回应、澄清意图或检索并执行独立于工具间复杂度的工具链。在执行层,工具世界导航模型(TWNM)编码工具之间的结构和行为关系,引导代理生成可扩展且鲁棒的调用序列。通过整合真实工具交互的反馈,NaviAgent实现了规划与执行之间的闭环对齐,使代理能够在大规模工具生态系统中实现自适应导航。在API-Bank和ToolBench上的评估显示,任务成功率(TSR)有持续改进,TWNM在复杂任务上平均提升13.1个百分点。进一步在50个真实API跨7个领域的测试中,展示了4.3-12.0个百分点的持续收益,步骤更少且延迟更低,证明了其在真实世界动态下的鲁棒泛化能力。

英文摘要

Large Language Models (LLMs) increasingly act as function-call agents that invoke external tools to tackle tasks beyond their static knowledge. However, they typically invoke tools one at a time without a global view of task structure. As tools often depend on one another, this leads to error accumulation and poor scalability, particularly when scaling to hundreds or thousands of tools. To address these limitations, we propose NaviAgent, an explicit bilevel architecture that decouples task planning from tool execution through graph-based modeling of tool relations. At the planning level, the LLM-based agent decides whether to respond directly, clarify intent, or retrieve and execute a toolchain independent of inter-tool complexity. At the execution level, a Tool World Navigation Model (TWNM) encodes structural and behavioral relations among tools, steering the agent to compose scalable and robust invocation sequences. Incorporating feedback from real tool interactions, NaviAgent achieves closed-loop alignment between planning and execution, enabling adaptive navigation in large-scale tool ecosystems. Evaluations on API-Bank and ToolBench show consistent improvements in task success rate (TSR), with TWNM yielding an average gain of 13.1 points on complex tasks. Further tests on 50 real APIs across 7 domains show consistent gains of 4.3--12.0 points, with fewer steps and latency, demonstrating robust generalization under real-world dynamics.

2506.16659 2026-05-22 cs.LG cs.AI math.OC 版本更新

Memory-Efficient LLM Pretraining via Minimalist Optimizer Design

通过最小化优化器设计实现内存高效的LLM预训练

Athanasios Glentis, Jiaxiang Li, Andi Han, Mingyi Hong

发表机构 * Department of Electrical and Computer Engineering, University of Minnesota, USA(电气与计算机工程系,明尼苏达大学,美国) School of Mathematics and Statistics, University of Sydney, Australia(数学与统计学学院,悉尼大学,澳大利亚)

AI总结 本文研究了如何通过简单的优化器设计改进,使SGD在预训练中达到最先进的性能,提出了SCALE优化器,在内存使用上比Adam更高效,并在多个模型上表现优于现有内存高效的优化器。

Comments Accepted at ICML 2026

详情
AI中文摘要

训练大型语言模型(LLMs)依赖于自适应优化器,如Adam,这些优化器引入了额外的操作,并需要比SGD更多的内存来维护一阶和二阶矩量。尽管最近的工作如GaLore、Fira和APOLLO提出了状态压缩的内存高效变体,但一个根本性的问题仍然存在:plain SGD需要哪些最小的修改才能达到最先进的预训练性能?我们通过自底向上的方法系统地研究了这个问题,并识别出两种简单但高度(内存和计算)高效的技巧:(1)列级梯度归一化(沿输出维度归一化梯度),在没有动量的情况下提升SGD性能;(2)仅在输出层应用一阶动量,因为梯度方差最高。结合这两种技术得到SCALE(Stochastic Column-normAlized Last-layer momEntum),一种简单的优化器,用于内存高效的预训练。在多个模型(60M-1B)上,SCALE的内存使用仅为Adam的35-45%,并且在多个模型上表现优于Adam。它还一致优于内存高效的优化器如GaLore、Fira和APOLLO,使其成为在内存限制下的大规模预训练的强大候选者。对于LLaMA 7B,SCALE在困惑度和内存消耗方面都优于最先进的内存高效方法APOLLO和Muon。

英文摘要

Training large language models (LLMs) relies on adaptive optimizers such as Adam, which introduce extra operations and require significantly more memory to maintain first- and second-order moments than SGD. While recent works such as GaLore, Fira and APOLLO have proposed state-compressed memory-efficient variants, a fundamental question remains: What are the minimum modifications to plain SGD needed to match state-of-the-art pretraining performance? We systematically investigate this question using a bottom-up approach, and identify two simple yet highly (memory- and compute-) efficient techniques: (1) column-wise gradient normalization (normalizing the gradient along the output dimension), that boosts SGD performance without momentum; and (2) applying first-order momentum only to the output layer, where gradient variance is highest. Combining these two techniques lead to SCALE (Stochastic Column-normAlized Last-layer momEntum), a simple optimizer for memory efficient pretraining. Across multiple models (60M-1B), SCALE matches or exceeds the performance of Adam while using only 35-45% of the total memory. It also consistently outperforms memory-efficient optimizers such as GaLore, Fira and APOLLO, making it a strong candidate for large-scale pretraining under memory constraints. For LLaMA 7B, SCALE outperforms the state-of-the-art memory-efficient methods APOLLO and Muon in both perplexity and memory consumption.

2506.14648 2026-05-22 cs.RO cs.AI 版本更新

SENIOR: Efficient Query Selection and Preference-Guided Exploration in Preference-based Reinforcement Learning

SENIOR: 在基于偏好的强化学习中高效查询选择与偏好引导探索

Hexian Ni, Tao Lu, Haoyuan Hu, Yinghao Cai, Shuo Wang

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(多模态人工智能系统国家重点实验室,自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 本文提出SENIOR方法,通过高效查询选择和偏好引导探索提升人类反馈效率和策略学习速度,解决基于偏好的强化学习在反馈和样本效率方面的不足。

Comments 8 pages, 8 figures, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)

详情
AI中文摘要

基于偏好强化学习(PbRL)方法通过学习基于人类偏好的奖励模型来避免奖励工程。然而,较差的反馈和样本效率仍然是阻碍PbRL应用的问题。本文提出了一种新颖的高效查询选择和偏好引导探索方法,称为SENIOR,能够选择有意义且易于比较的行为片段对,以提高人类反馈效率并加速策略学习,通过设计的偏好引导内在奖励。我们的关键思想是双方面的:(1)我们设计了一种基于运动区别的选择方案(MDS)。它通过状态的核密度估计选择具有明显运动和不同方向的片段对,这更任务相关且更易于人类偏好标注;(2)我们提出了一种新颖的偏好引导探索方法(PGE)。它鼓励探索高偏好和低访问状态,并持续引导智能体获取有价值的样本。两种机制的协同作用可以显著加快奖励和策略学习的进度。我们的实验表明,SENIOR在六个复杂的机器人操作任务(从仿真和现实世界)中,既在人类反馈效率又在策略收敛速度上均优于其他五个现有方法。视频可在我们的项目网站上找到:https://2025senior.github.io/

英文摘要

Preference-based Reinforcement Learning (PbRL) methods provide a solution to avoid reward engineering by learning reward models based on human preferences. However, poor feedback- and sample- efficiency still remain the problems that hinder the application of PbRL. In this paper, we present a novel efficient query selection and preference-guided exploration method, called SENIOR, which could select the meaningful and easy-to-comparison behavior segment pairs to improve human feedback-efficiency and accelerate policy learning with the designed preference-guided intrinsic rewards. Our key idea is twofold: (1) We designed a Motion-Distinction-based Selection scheme (MDS). It selects segment pairs with apparent motion and different directions through kernel density estimation of states, which is more task-related and easy for human preference labeling; (2) We proposed a novel preference-guided exploration method (PGE). It encourages the exploration towards the states with high preference and low visits and continuously guides the agent achieving the valuable samples. The synergy between the two mechanisms could significantly accelerate the progress of reward and policy learning. Our experiments show that SENIOR outperforms other five existing methods in both human feedback-efficiency and policy convergence speed on six complex robot manipulation tasks from simulation and four real-worlds. Videos can be found on our project website: https://2025senior.github.io/

2503.24191 2026-05-22 cs.CR cs.AI 版本更新

When Grammar Guides the Attack: Uncovering Control-Plane Vulnerabilities in LLMs with Structured Output

当语法引导攻击:利用结构化输出揭示LLM中的控制平面漏洞

Shuoming Zhang, Jiacheng Zhao, Hanyuan Dong, Ruiyuan Xu, Zhicheng Li, Yangyu Zhang, Shuaijiang Li, Yuan Wen, Chunwei Xia, Zheng Wang, Xiaobing Feng, Huimin Cui

发表机构 * SKLP, ICT, CAS & UCAS(SKLP、信息科技研究院、中国科学院及中国科学院大学) University of Aberdeen(阿伯丁大学) University of Leeds(利兹大学) SKLP, ICT, CAS(SKLP、信息科技研究院、中国科学院)

AI总结 本文研究了通过结构化输出揭示LLM控制平面漏洞的问题,提出了一种名为Constrained Decoding Attack(CDA)的新类别的 jailbreak 方法,通过控制到语义的管道机制,利用schema-enforced logit masking注入恶意前缀,并由模型自身完成有害意图,展示了DictAttack在多个模型上的高攻击成功率,揭示了需要跨平面防御来弥合数据和控制平面之间的语义差距。

Comments To appear in CCS2026

详情
AI中文摘要

内容警告:本文可能包含由LLM生成的不安全或有害内容,可能对读者造成冒犯。大型语言模型(LLMs)越来越多地通过结构化输出API作为工具平台,但驱动这一功能的语法引导解码功能打开了一个与传统数据平面漏洞无关的控制平面攻击面。我们引入了Constrained Decoding Attack(CDA),一种针对LLM控制平面的新jailbreak类别。CDA最佳描述为一个控制到语义的管道:(1)schema-enforced logit masking注入恶意前缀到生成轨迹中,(2)模型本身完成有害意图。不同于依赖绕过对齐可见输入的数据平面jailbreaks,CDA作用于解码过程本身,因此仅靠内部安全对齐无法阻止它。我们用EnumAttack实例化CDA,其将恶意内容隐藏在枚举字段中,并用更狡猾的DictAttack,将负载拆分到一个无害提示和基于字典的语法中。在13个专有/开源模型和五个标准基准上,DictAttack在旗舰模型如gpt-5、gemini-2.5-pro、deepseek-r1和gpt-oss-120b上实现了94.3-99.5%的攻击成功率(ASR)。尽管基本语法审核可以缓解EnumAttack,DictAttack仍能抵御最先进的jailbreak guardrails,达到75.8%的ASR,暴露了需要跨平面防御来弥合数据和控制平面之间语义差距的问题。项目页面和代码可在https://ict-cda.github.io/上获得。

英文摘要

Content Warning: This paper may contain unsafe or harmful content generated by LLMs that may be offensive to readers. Large Language Models (LLMs) increasingly serve as tooling platforms through structured output APIs, but the grammar-guided decoding that powers this feature opens a critical control-plane attack surface orthogonal to traditional data-plane vulnerabilities. We introduce Constrained Decoding Attack (CDA), a new jailbreak class that targets the LLM control plane. CDA is best characterized as a control-to-semantic pipeline: (1) schema-enforced logit masking injects a malicious prefix into the generation trajectory, and (2) the model itself completes the harmful intent. Unlike data-plane jailbreaks that rely on bypassing alignment with visible inputs, CDA acts on the decoding process itself, so internal safety alignment alone cannot stop it. We instantiate CDA with EnumAttack, which hides malicious content in enum fields, and the more evasive DictAttack, which decouples the payload across a benign prompt and a dictionary-based grammar. Across 13 proprietary/open-weight models and five standard benchmarks, DictAttack achieves 94.3--99.5% Attack Success Rate (ASR) on flagship models including gpt-5, gemini-2.5-pro, deepseek-r1, and gpt-oss-120b. While basic grammar auditing mitigates EnumAttack, DictAttack still sustains 75.8% ASR against SOTA jailbreak guardrails, exposing a "semantic gap" that demands cross-plane defenses bridging the data and control planes. Project page and code are available at https://ict-cda.github.io/.

2503.21821 2026-05-22 cs.AI 版本更新

PHYSICS: Benchmarking Foundation Models on University-Level Physics Problem Solving

PHYSICS:在大学物理问题求解中基准测试基础模型

Kaiyue Feng, Yilun Zhao, Yixin Liu, Tianyu Yang, Chen Zhao, John Sous, Arman Cohan

发表机构 * Yale University(耶鲁大学) New York University(纽约大学) Notre Dame University(诺特丹大学)

AI总结 本文提出PHYSICS基准测试,用于评估大学水平物理问题求解能力,包含1297个专家标注的问题,涵盖六个核心领域,并通过自动化评估系统揭示了领先基础模型的显著局限性。

Journal ref Findings of ACL 2025

详情
AI中文摘要

我们介绍了PHYSICS,一个全面的大学物理问题求解基准测试。它包含1297个专家标注的问题,涵盖六个核心领域:经典力学、量子力学、热力学和统计力学、电磁学、原子物理和光学。每个问题都需要高级物理知识和数学推理。我们开发了一个稳健的自动化评估系统,以实现精确且可靠的验证。对领先基础模型的评估揭示了显著的局限性。即使最先进的模型o3-mini也只能达到59.9%的准确率,突显了解决高水平科学问题的重大挑战。通过全面的错误分析、探索多样的提示策略以及基于检索增强生成(RAG)的知识增强,我们识别出关键的改进领域,为未来的发展奠定了基础。

英文摘要

We introduce PHYSICS, a comprehensive benchmark for university-level physics problem solving. It contains 1297 expert-annotated problems covering six core areas: classical mechanics, quantum mechanics, thermodynamics and statistical mechanics, electromagnetism, atomic physics, and optics. Each problem requires advanced physics knowledge and mathematical reasoning. We develop a robust automated evaluation system for precise and reliable validation. Our evaluation of leading foundation models reveals substantial limitations. Even the most advanced model, o3-mini, achieves only 59.9% accuracy, highlighting significant challenges in solving high-level scientific problems. Through comprehensive error analysis, exploration of diverse prompting strategies, and Retrieval-Augmented Generation (RAG)-based knowledge augmentation, we identify key areas for improvement, laying the foundation for future advancements.

2502.09487 2026-05-22 cs.CL cs.AI cs.LG 版本更新

Internal narratives parameterise affective states

内部叙事参数化情感状态

Jakub Onysk, Quentin J. M. Huys

发表机构 * Applied Computational Psychiatry Lab(应用计算精神病学实验室) Max Planck UCL Centre for Computational Psychiatry and Ageing Research(马克斯·普朗克UCL计算精神病学与衰老研究中心) Queen Square Institute of Neurology and Mental Health(圣夸克广场神经病学与心理健康研究所) Neuroscience Department(神经科学系) Division of Psychiatry(精神病学系)

AI总结 本文通过量化参与者内部叙事的大语言模型表示及其子空间,研究了叙事与情感状态之间的关系,发现特定症状的描述性思维能够预测标准化的抑郁评分,并强调保持症状间的协方差对构建效度至关重要。

详情
AI中文摘要

描述我们如何用语言表达感受对于心理评估和干预至关重要,但叙事与情感状态之间的映射仍然理解不足。在两个大规模研究(n=1257)中,我们通过大语言模型表示及其子空间量化了参与者内部叙事的结构和动态,以参数化抑郁状态。在第一项研究中,我们发现对特定症状的描述性思维捕捉了预测标准化、自我报告抑郁评分的细粒度信息。关键的是,我们显示保持症状之间的特定协方差对于构效效度至关重要,这表明高维文本表示镜像了疾病的潜在几何结构。第二项研究探讨了这种关系的时间动态,当参与者与情感叙事互动时。我们发现量化内部叙事的变化导致自我报告的变化,而基线叙事严重性预测了后续情感变化的幅度。通过将情感视为计算状态,我们的结果强调了其核心、治疗相关功能:约束内部叙事的结构并整合上下文以塑造自我报告。

英文摘要

Characterising how we verbalise our feelings is central to psychological assessment and intervention, yet the mapping between narrative and affective state remains poorly understood. Across two large studies (n=1257), we parameterised the structure and dynamics of depressive states by quantifying participants' internal narratives through large-language-model representations and their subspaces. In Study 1, we found verbal descriptions of symptom-specific thoughts captured granular information predictive of standardised, self-reported depression scores. Critically, we show preserving the specific covariance between symptoms is essential for construct validity, suggesting high-dimensional text representations mirror the latent geometry of the disorder. Study 2 probed the temporal dynamics of this relationship as participants engaged with emotional narratives. We found quantified changes in internal narratives led to changes in self-report, while the baseline narrative severity predicted the magnitude of subsequent affective change. By framing affect as a computational state, our results highlight its core, therapeutically pertinent functions: constraining the structure of internal narratives and integrating context to shape self-report.

2411.01332 2026-05-22 cs.LG cs.AI 版本更新

A Mechanistic Explanatory Strategy for XAI

为XAI的解释性策略机制

Marcin Rabiza

发表机构 * Institute of Philosophy and Sociology, Polish Academy of Sciences(哲学与社会学院,波兰科学院) Institute for Philosophy, Leiden University(哲学研究所,莱顿大学)

AI总结 本文提出了一种基于机制的解释性策略,旨在通过分解、定位和重组来揭示深度学习系统功能组织的机制,从而改进可解释人工智能的理论基础和实践应用。

详情
AI中文摘要

尽管在XAI领域取得了显著进展,学者们指出缺乏坚实的理论基础和与更广泛科学解释 discourse 的整合仍是持续存在的问题。为此,新兴研究借鉴了各种科学和科学哲学文献中的解释策略来填补这些空白。本文概述了一种用于解释深度学习系统功能组织的机制性策略,将近期的可解释人工智能发展置于更广泛的哲学背景下。根据机制方法,对不透明AI系统的解释涉及识别驱动决策的机制。对于深度神经网络,这意味着辨别功能相关组件,如神经元、层、电路或激活模式,并通过分解、定位和重组来理解其作用。图像识别和语言模型的证明原理案例研究将这些理论方法与OpenAI和Anthropic的机制可解释性研究相结合。研究结果表明,追求机制性解释可以揭示传统可解释性技术可能忽略的元素,最终促进更彻底的可解释人工智能。

英文摘要

Despite significant advancements in XAI, scholars note a persistent lack of solid conceptual foundations and integration with broader scientific discourse on explanation. In response, emerging research draws on explanatory strategies from various sciences and the philosophy of science literature to fill these gaps. This paper outlines a mechanistic strategy for explaining the functional organization of deep learning systems, situating recent developments in explainable AI within a broader philosophical context. According to the mechanistic approach, the explanation of opaque AI systems involves identifying mechanisms that drive decision making. For deep neural networks, this means discerning functionally relevant components, such as neurons, layers, circuits, or activation patterns, and understanding their roles through decomposition, localization, and recomposition. Proof-of-principle case studies from image recognition and language modeling align these theoretical approaches with mechanistic interpretability research from OpenAI and Anthropic. The findings suggest that pursuing mechanistic explanations can uncover elements that traditional explainability techniques may overlook, ultimately contributing to more thoroughly explainable AI

2410.04753 2026-05-22 cs.AI cs.CL cs.LG cs.LO 版本更新

ImProver: Agent-Based Automated Proof Optimization

ImProver:基于代理的自动证明优化

Riyaz Ahuja, Jeremy Avigad, Prasad Tetali, Sean Welleck

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文研究了自动证明优化问题,提出ImProver这一基于大语言模型的代理,用于重写证明以优化长度、可读性等任意标准,实验表明其能显著缩短证明并提高其模块化和可读性。

Comments Published as a conference paper at ICLR 2025

详情
AI中文摘要

大型语言模型(LLMs)已被用于在证明助手如Lean中生成数学定理的正式证明。然而,我们通常希望根据不同的下游用途优化正式证明,例如使其符合某种风格、易于阅读、简洁或模块化。适当优化的证明对于学习任务也非常重要,尤其是因为人工撰写的证明可能不适用于此目的。为此,我们研究了一个新的问题:自动证明优化,即重写证明以使其正确并优化任意标准,如长度或可读性。作为自动证明优化的一种初步方法,我们提出了ImProver,这是一个能够重写证明以优化任意用户定义指标的大型语言模型代理。我们发现直接应用LLMs进行证明优化效果有限,并在ImProver中引入了各种改进,例如新颖的链式状态技术中的符号Lean上下文使用,以及错误校正和检索。我们测试了ImProver在重写真实世界中的本科、竞赛和研究级数学定理方面的性能,发现ImProver能够重写证明使其显著更短、更模块化和更易读。

英文摘要

Large language models (LLMs) have been used to generate formal proofs of mathematical theorems in proofs assistants such as Lean. However, we often want to optimize a formal proof with respect to various criteria, depending on its downstream use. For example, we may want a proof to adhere to a certain style, or to be readable, concise, or modularly structured. Having suitably optimized proofs is also important for learning tasks, especially since human-written proofs may not optimal for that purpose. To this end, we study a new problem of automated proof optimization: rewriting a proof so that it is correct and optimizes for an arbitrary criterion, such as length or readability. As a first method for automated proof optimization, we present ImProver, a large-language-model agent that rewrites proofs to optimize arbitrary user-defined metrics in Lean. We find that naively applying LLMs to proof optimization falls short, and we incorporate various improvements into ImProver, such as the use of symbolic Lean context in a novel Chain-of-States technique, as well as error-correction and retrieval. We test ImProver on rewriting real-world undergraduate, competition, and research-level mathematics theorems, finding that ImProver is capable of rewriting proofs so that they are substantially shorter, more modular, and more readable.

2403.16552 2026-05-22 cs.NE cs.AI cs.CV 版本更新

QKFormer: Hierarchical Spiking Transformer using Q-K Attention

QKFormer: 基于Q-K注意力的分层脉冲变换器

Chenlin Zhou, Han Zhang, Zhaokun Zhou, Liutao Yu, Liwei Huang, Xiaopeng Fan, Li Yuan, Zhengyu Ma, Huihui Zhou, Yonghong Tian

发表机构 * Pengcheng Laboratory(鹏城实验室) Harbin Institute of Technology(哈尔滨工业大学) Peking University(北京大学)

AI总结 本文提出QKFormer,一种基于Q-K注意力的分层脉冲变换器,通过引入新的脉冲形式Q-K注意力机制、分层结构和灵活的补丁嵌入模块,提升了脉冲神经网络在图像分类任务中的性能,实现了在ImageNet-1K数据集上85.65%的top-1准确率。

Comments Accepted by NeurIPS 2024 (Spotlight). Code and Model: https://github.com/zhouchenlin2096/QKFormer

详情
AI中文摘要

Spiking Transformers,将脉冲神经网络(SNNs)与变换器架构相结合,因其在能效和高性能方面的潜力而受到广泛关注。然而,现有模型在此领域仍存在性能不佳的问题。我们引入了几个创新来提高性能:i)我们提出了一种新的脉冲形式Q-K注意力机制,专为SNNs设计,通过二进制向量以线性复杂度高效建模token或通道维度的重要性。ii)我们将层次结构引入脉冲变换器,显著提升了生物和人工神经网络的性能,以获得多尺度脉冲表示。iii)我们设计了一个灵活且强大的补丁嵌入模块,具有特定于脉冲变换器的变形快捷方式。共同,我们开发了QKFormer,一种基于Q-K注意力的直接训练分层脉冲变换器。QKFormer在各种主流数据集上显著优于现有最先进SNN模型。值得注意的是,与Spikformer(66.34 M,74.81%)相比,QKFormer(64.96 M)在ImageNet-1k上实现了突破性的top-1准确率85.65%,大幅超越Spikformer 10.84%。据我们所知,这是首次直接训练SNNs在ImageNet-1K上超过85%的准确率。代码和模型可在https://github.com/zhouchenlin2096/QKFormer公开获取。

英文摘要

Spiking Transformers, which integrate Spiking Neural Networks (SNNs) with Transformer architectures, have attracted significant attention due to their potential for energy efficiency and high performance. However, existing models in this domain still suffer from suboptimal performance. We introduce several innovations to improve the performance: i) We propose a novel spike-form Q-K attention mechanism, tailored for SNNs, which efficiently models the importance of token or channel dimensions through binary vectors with linear complexity. ii) We incorporate the hierarchical structure, which significantly benefits the performance of both the brain and artificial neural networks, into spiking transformers to obtain multi-scale spiking representation. iii) We design a versatile and powerful patch embedding module with a deformed shortcut specifically for spiking transformers. Together, we develop QKFormer, a hierarchical spiking transformer based on Q-K attention with direct training. QKFormer shows significantly superior performance over existing state-of-the-art SNN models on various mainstream datasets. Notably, with comparable size to Spikformer (66.34 M, 74.81%), QKFormer (64.96 M) achieves a groundbreaking top-1 accuracy of 85.65% on ImageNet-1k, substantially outperforming Spikformer by 10.84%. To our best knowledge, this is the first time that directly training SNNs have exceeded 85% accuracy on ImageNet-1K. The code and models are publicly available at https://github.com/zhouchenlin2096/QKFormer

2403.03920 2026-05-22 cs.AI cs.CL cs.HC 版本更新

Enhancing Instructional Quality: Leveraging Computer-Assisted Textual Analysis to Generate In-Depth Insights from Educational Artifacts

提升教学质量:利用计算机辅助文本分析从教育资料中生成深入见解

Zewei Tian, Min Sun, Alex Liu, Shawon Sarkar, Jing Liu

发表机构 * University of Washington(华盛顿大学) University of Maryland(马里兰大学)

AI总结 本文探讨了计算机辅助文本分析在通过教育资料的深入分析提升教学质量的变革潜力,结合Richard Elmore的Instructional Core Framework,分析AI和机器学习方法,特别是自然语言处理(NLP),如何分析教育内容、教师话语和学生回答以促进教学改进,并指出AI/ML在教师指导、学生支持和内容开发中的关键优势。

详情
AI中文摘要

本文探讨了计算机辅助文本分析在通过教育资料的深入分析提升教学质量的变革潜力。我们整合Richard Elmore的Instructional Core Framework,以探讨人工智能(AI)和机器学习(ML)方法,特别是自然语言处理(NLP),如何分析教育内容、教师话语和学生回答,以促进教学改进。通过在Instructional Core Framework内的全面回顾和案例研究,我们识别出AI/ML整合在教师指导、学生支持和内容开发中的关键优势。我们揭示出模式,表明AI/ML不仅简化了行政任务,还引入了新的个性化学习路径,为教育工作者提供可操作的反馈,并有助于更深入地理解教学动态。本文强调了将AI/ML技术与教学目标对齐的重要性,以在教育环境中实现其全部潜力,倡导一种平衡的方法,考虑伦理问题、数据质量和人类专业知识的整合。

英文摘要

This paper explores the transformative potential of computer-assisted textual analysis in enhancing instructional quality through in-depth insights from educational artifacts. We integrate Richard Elmore's Instructional Core Framework to examine how artificial intelligence (AI) and machine learning (ML) methods, particularly natural language processing (NLP), can analyze educational content, teacher discourse, and student responses to foster instructional improvement. Through a comprehensive review and case studies within the Instructional Core Framework, we identify key areas where AI/ML integration offers significant advantages, including teacher coaching, student support, and content development. We unveil patterns that indicate AI/ML not only streamlines administrative tasks but also introduces novel pathways for personalized learning, providing actionable feedback for educators and contributing to a richer understanding of instructional dynamics. This paper emphasizes the importance of aligning AI/ML technologies with pedagogical goals to realize their full potential in educational settings, advocating for a balanced approach that considers ethical considerations, data quality, and the integration of human expertise.

2401.00139 2026-05-22 cs.AI cs.CL cs.LG stat.ME 版本更新

Enhancing Causal Reasoning in Large Language Models: A Causal Attribution Model for Precision Fine-Tuning

增强大语言模型中的因果推理:一种用于精确微调的因果归因模型

Hengrui Cai, Shengjie Liu, Rui Song

发表机构 * University of California, Irvine(加州大学尔湾分校) North Carolina State University(北卡罗来纳州立大学) Amazon(亚马逊公司)

AI总结 本文提出一种因果归因模型,通过精确微调提升大语言模型的可解释性和因果推理能力,展示了模型在不同领域中的因果发现任务中的有效性。

Comments A Python implementation of our proposed method is available at https://github.com/ncsulsj/Causal_LLM

详情
AI中文摘要

本文介绍了一种因果归因模型,旨在通过精确微调增强大语言模型(LLMs)的可解释性并提高其因果推理能力。尽管LLMs在多种任务中表现出色,但其推理过程往往仍是一个黑箱,限制了有针对性的增强。我们提出了一种新的因果归因模型,利用“do-运算符”构建干预场景,使我们能够系统地量化LLMs因果推理过程中不同组件的贡献。通过在各种领域中进行因果发现任务来评估所提出的归因分数,我们证明了LLMs在因果发现中的有效性严重依赖于提供的上下文和领域特定知识,但也可以利用数值数据进行有限的相关性推理,而非因果性。这促使了所提出的微调LLM用于成对因果发现,有效且正确地利用了知识和数值信息。

英文摘要

This paper introduces a causal attribution model to enhance the interpretability of large language models (LLMs) and improve their causal reasoning abilities via precise fine-tuning. Despite LLMs' proficiency in diverse tasks, their reasoning processes often remain black box, and thus restrict targeted enhancement. We propose a novel causal attribution model that utilizes "do-operators" for constructing interventional scenarios, allowing us to quantify the contribution of different components in LLMs's causal reasoning process systematically. By assessing the proposed attribution scores through causal discovery tasks across various domains, we demonstrate that LLMs' effectiveness in causal discovery heavily relies on provided context and domain-specific knowledge but can also utilize numerical data with limited calculations in correlation, not causation. This motivates the proposed fine-tuned LLM for pairwise causal discovery, effectively and correctly leveraging both knowledge and numerical information.

2311.04938 2026-05-22 cs.CV cs.AI cs.LG 版本更新

Improved DDIM Sampling with Moment Matching Gaussian Mixtures

改进的DDIM采样与矩匹配高斯混合模型

Prasad Gabbur

发表机构 * Independent Researcher(独立研究者) Apple(苹果公司)

AI总结 本文提出在DDIM框架中使用高斯混合模型作为反向转换操作符,通过约束GMM参数匹配DDPM前向边缘的矩,从而在少量采样步骤下提升生成样本质量,实验表明GMM核在FID和IS指标上优于传统高斯核。

Comments 34 pages, 12 figures; Accepted to TMLR; Code open sourced

Journal ref Transactions on Machine Learning Research, 05/2026

详情
AI中文摘要

我们提出在去噪扩散隐式模型(DDIM)框架中使用高斯混合模型(GMM)作为反向转换操作符(核),这是用于加速从预训练去噪扩散概率模型(DDPM)采样的最广泛使用的 approaches 之一。具体而言,我们通过约束GMM参数来匹配DDPM前向边缘的一阶和二阶中心矩。我们发现矩匹配足以获得与原始DDIM高斯核相等或更好的样本质量。我们分别在无条件模型(训练于CelebAHQ和FFHQ)、类条件模型(训练于ImageNet)以及使用Stable Diffusion v2.1在COYO700M数据集上进行文本到图像生成实验。我们的结果表明,当采样步骤数较小时,使用GMM核可显著提升生成样本的质量,如在ImageNet 256x256上,使用10个采样步骤时,GMM核的FID为6.94,IS为207.85,而高斯核分别为10.15和196.73。此外,我们还为修正流匹配模型推导了新的SDE采样器,并对所提出的方法进行了实验。我们发现使用1-修正流和2-修正流模型均有所改进。代码:https://github.com/pgabbur/ddim-gmm。

英文摘要

We propose using a Gaussian Mixture Model (GMM) as reverse transition operator (kernel) within the Denoising Diffusion Implicit Models (DDIM) framework, which is one of the most widely used approaches for accelerated sampling from pre-trained Denoising Diffusion Probabilistic Models (DDPM). Specifically we match the first and second order central moments of the DDPM forward marginals by constraining the parameters of the GMM. We see that moment matching is sufficient to obtain samples with equal or better quality than the original DDIM with Gaussian kernels. We provide experimental results with unconditional models trained on CelebAHQ and FFHQ, class-conditional models trained on ImageNet, and text-to-image generation using Stable Diffusion v2.1 on COYO700M datasets respectively. Our results suggest that using the GMM kernel leads to significant improvements in the quality of the generated samples when the number of sampling steps is small, as measured by FID and IS metrics. For example on ImageNet 256x256, using 10 sampling steps, we achieve a FID of 6.94 and IS of 207.85 with a GMM kernel compared to 10.15 and 196.73 respectively with a Gaussian kernel. Further, we derive novel SDE samplers for rectified flow matching models and experiment with the proposed approach. We see improvements using both 1-rectified flow and 2-rectified flow models. Code: https://github.com/pgabbur/ddim-gmm.

2305.12138 2026-05-22 cs.SE cs.AI 版本更新

Exploring Code Analysis: Zero-Shot Insights on Syntax and Semantics with LLMs

探索代码分析:利用大语言模型进行语法和语义的零样本洞察

Wei Ma, Zhihao Lin, Shangqing Liu, Qiang Hu, Ye Liu, Wenhan Wang, Cen Zhang, Liming Nie, Li Li, Yang Liu, Lingxiao Jiang

发表机构 * Singapore Management University(新加坡国立大学) Blekinge Institute of Technology(布莱金厄学院) Beihang University(北航) State Key Laboratory of Novel Software Technology, Nanjing University(南京大学软件新技术国家重点实验室) The University of Tokyo(东京大学) University of Alberta(阿尔伯塔大学) Nanyang Technological University(南洋理工大学) Shenzhen Technology University(深圳技术大学)

AI总结 本文研究了大语言模型在代码分析中的应用,通过评估21种最先进的LLM在四种语言中的九项任务,揭示了LLM在语法解析、静态语义推断和动态推理方面的性能,发现其在跨语言泛化方面有优势,但动态推理仍有限,提出了一个经过验证的评估框架。

Comments Accepted at ACM Transactions on Software Engineering and Methodology (TOSEM)

详情
AI中文摘要

代码分析在软件工程中至关重要,支持调试、优化和安全评估。人类开发者通过语法解析、静态语义推断和动态推理来处理。传统工具虽然有效,但受限于语言特异性且跨语言泛化能力弱。大语言模型(LLMs)在代码任务中具有潜力,但其在基础代码分析中的能力尚待探索。我们围绕与人类实践相关的三个方面(语法解析、静态语义推断和动态推理)展开研究。我们评估了21种最先进的LLM在四种语言(C、Java、Python、Solidity)中的九项任务,包括AST生成、CFG构建、数据依赖、污点分析和易变测试推理。我们应用三层评估协议(自动化指标、专家裁决、一致性验证)对3124个代码样本进行评估,实现了高评分者一致性(Cohen's kappa = 0.844-0.936)和强人机一致(Gwet's AC1 = 0.500-0.727,F1 = 0.791-0.882)。尽管最佳LLM在语法解析(AST 90%+,表达式匹配84-100%)方面表现优异,并在静态分析中显示出潜力,但其动态推理仍有限(<70%),且对数据迁移敏感(每项目F1变化0-1.0)。这一层次在模型家族和规模上均成立,表明是根本而非短暂的限制。这些发现展示了LLM如何补充传统分析器:它们提供跨语言泛化能力,但输出非确定性,需要验证;而传统工具提供确定性保证,但需要语言特定的配置。我们贡献了一个经过验证的评估框架,与传统分析器(Tree-sitter、Soot、Joern)进行比较,并提供了任务特定的应用层级。基准:https://github.com/mathieu0905/llm_code_analysis.git

英文摘要

Code analysis is fundamental in Software Engineering, supporting debugging, optimization, and security assessment. Human developers approach it through syntax parsing, static semantics inference, and dynamic reasoning. Traditional tools are effective but limited by language specificity and weak cross-language generalization. Large language models (LLMs) are promising for code tasks, yet their capabilities for fundamental code analysis remain underexplored. We structure our study around three aspects aligned with human practices: syntax parsing, static semantics inference, and dynamic reasoning. We evaluate 21 state-of-the-art LLMs across nine tasks in four languages (C, Java, Python, Solidity), including AST generation, CFG construction, data dependency, taint analysis, and flaky test reasoning. We apply a three-layer evaluation protocol (automated metrics, expert adjudication, consistency validation) to 3,124 code samples, achieving high inter-rater reliability (Cohen's kappa = 0.844-0.936) and strong human-machine agreement (Gwet's AC1 = 0.500-0.727, F1 = 0.791-0.882). While the best LLMs excel in syntax parsing (AST 90%+, expression matching 84-100%) and show promise in static analysis, their dynamic reasoning remains limited (<70%) with high data-shift sensitivity (per-project F1 varying 0-1.0). This hierarchy holds across model families and scales, suggesting fundamental rather than transient limitations. These findings show how LLMs complement traditional analyzers: they offer cross-language generalization but non-deterministic outputs needing validation, while traditional tools give deterministic guarantees but need language-specific configuration. We contribute a validated evaluation framework with comparison against traditional analyzers (Tree-sitter, Soot, Joern) and task-specific applicability tiers. Benchmark: https://github.com/mathieu0905/llm_code_analysis.git