2605.30343 2026-05-29 cs.CL cs.AI 版本更新

Unlocking the Working Memory of Large Language Models for Latent Reasoning

解锁大型语言模型的工作记忆以实现潜在推理

Lukas Aichberger, Sepp Hochreiter

发表机构 * ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria（林茨ELLIS单元和LIT AI实验室，机器学习研究所，约翰·凯撒大学林茨，奥地利）； NXAI GmbH, Linz, Austria（NXAI公司，林茨，奥地利）

AI总结提出一种名为RiM的潜在推理方法，通过固定记忆块替代自回归生成中间推理步骤，在单次前向传播中实现计算高效的潜在推理。

Comments Preprint

详情

AI中文摘要

基于采样的推理：在决策点进行裁剪

Felix Zhou, Anay Mehrotra, Quanquan C. Liu

发表机构 * Yale University（耶鲁大学）； Stanford University（斯坦福大学）

AI总结提出Entropy-Cut Metropolis-Hastings算法，利用基础模型的下一词元熵作为代理识别关键决策点并重新采样，从而高效地从幂分布中采样以增强推理能力，在多个基准上超越基线和RL训练模型。

详情

AI中文摘要

前沿推理模型是通过对基础语言模型进行强化学习后训练而产生的。最近的研究对此提出了挑战，表明从基础模型分布的锐化版本（即所谓的幂分布）中采样，无需额外训练、精心策划的数据集或验证器，就能产生可比的推理能力。然而，使这种方法实用化需要高效地从幂分布中采样。采样器需要“混合”到幂分布，这需要在目标分布的模态之间移动；直观地说，例如尝试不同的推理策略。先前工作中提出的采样器反复在当前推理轨迹中均匀随机选择一个“裁剪”位置，并从该位置开始重新采样后缀。然而，推理轨迹通常包含少数关键决策（例如，证明策略或算法的选择），我们观察到均匀选择的裁剪往往重写局部细节，而不是重新审视决策点。我们引入了一种算法（Entropy-Cut Metropolis-Hastings），该算法使用基础模型的下一词元熵作为代理来识别关键决策点，并从这些位置重新采样。我们通过实验验证了熵跳变是决策点的有用代理，并在一个风格化的推理模型中证明了我们的方法的混合时间与轨迹中的决策数量成比例，而不是与可能大得多的词元数量成比例。在MATH500、HumanEval、GPQA Diamond和AIME26上，我们的方法始终优于基线和RL训练模型。

英文摘要

Frontier reasoning models are produced by posttraining base language models with reinforcement learning. Recent work has challenged this by showing that sampling from a sharpened version of the base model's distribution, a so-called power distribution, elicits comparable reasoning without additional training, curated datasets, or verifiers. However, making this method practical requires efficiently sampling from the power distribution. A sampler needs to "mix" to the power distribution, which necessitates moving between modes of the target distribution; intuitively, e.g., trying different reasoning strategies. The samplers proposed in prior works repeatedly select a "cut" position in the current reasoning trace uniformly at random and resample the suffix from that position onward. However, reasoning traces typically contain a few consequential decisions (e.g., the choice of proof strategy or algorithm), and we observe that a uniformly chosen cut tends to rewrite local details rather than revisit decision points. We introduce an algorithm (Entropy-Cut Metropolis-Hastings) that uses the base model's next-token entropy as a proxy to identify key decision points and resample from those positions. We empirically verify that entropy jumps are a useful proxy for decision points and, in a stylized model of reasoning, prove that our method's mixing time scales with the number of decisions in a trace rather than with the number of tokens, which can be much larger. Across MATH500, HumanEval, GPQA Diamond, and AIME26, our method consistently improves over baselines and RL-trained models.

URL PDF HTML ☆

赞 0 踩 0

2605.30326 2026-05-29 cs.RO cs.AI 版本更新

RoboWits: Unexpected Challenges for Robotic Creative Problem Solving

RoboWits：机器人创造性问题解决中的意外挑战

Chunru Lin, Hongxin Zhang, Fenghao Yu, Zhehuan Chen, Thomas L. Griffiths, Yejin Choi, David Held, Chuang Gan

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿姆赫斯特分校）； Princeton University（普林斯顿大学）； Stanford University（斯坦福大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文提出RoboWits双臂机器人基准，通过多智能体协作的自动化任务生成流水线评估机器人在几何、材料和装配推理中的认知推理、创造性工具使用及鲁棒性，发现预训练VLA在突变任务中表现脆弱。

Comments The first two authors contributed equally

详情

AI中文摘要

在真实环境中运行的机器人必须具备在意外挑战下推理、适应和创造性解决问题的能力。然而，当前的机器人基准主要强调技能级执行，对此类认知推理能力的洞察有限。我们提出了RoboWits，一个双臂机器人基准，旨在系统评估认知推理、创造性工具使用以及对意外条件的鲁棒性。为了实现可扩展的高质量推理中心意外场景构建，我们提出了一种自动化任务生成流水线，该流水线被设计为多智能体协作框架，包括种子任务生成与验证、度量生成、场景生成和任务变异等智能体。利用该流水线，我们整理了30个多样化的种子任务和208个带有变异和分级难度的任务，涵盖几何、材料和基于装配的推理。我们对流行的机器人策略、预训练VLA和oracle状态规划器进行了基准测试。结果揭示了显著的性能差距：预训练VLA在单任务微调后在种子任务上表现出初步成功，但在变异任务上表现不佳，这表明它们在需要推理、策略适应以及对欺骗性或受限环境鲁棒性的操作任务中具有脆弱性。项目页面位于https://umass-embodied-agi.github.io/RoboWits。

英文摘要

The ability to reason, adapt, and creatively solve problems under unexpected challenges is essential for robots operating in real-world environments. However, current robotic benchmarks primarily emphasize skill-level execution and provide limited insight into such cognitive reasoning capabilities. We introduce RoboWits, a bi-manual robotic benchmark designed to systematically evaluate cognitive reasoning, creative tool use, and robustness to unexpected conditions. To enable scalable construction of high-quality reasoning-centric unexpected scenarios, we propose an automated task generation pipeline formulated as a multi-agent cooperative framework, comprising agents for seed task generation and verification, metric generation, scene generation, and task mutation. Using the pipeline, we curated 30 diverse seed tasks and 208 tasks with mutations and graded difficulty across geometry, material, and assembly-based reasoning. We benchmark popular robot policies, pre-trained VLAs, and oracle-state planners. Our results reveal a significant performance gap: while pre-trained VLAs exhibit preliminary success on seed tasks after single-task fine-tuning, they struggle to perform on mutated tasks, implying their brittleness in manipulation tasks requiring reasoning, strategy adaptation, and robustness to deceptive or constrained environments. Project page is available at https://umass-embodied-agi.github.io/RoboWits.

URL PDF HTML ☆

赞 0 踩 0

2605.30324 2026-05-29 cs.DS cs.AI cs.CL cs.LG stat.ML 版本更新

On Language Generation in the Limit with Bounded Memory

有界记忆下的极限语言生成

Jon Kleinberg, Anay Mehrotra, Amin Saberi, Grigoris Velegkas

发表机构 * Cornell University（康奈尔大学）； Stanford University（斯坦福大学）； Google Research（谷歌研究）

AI总结研究有界记忆下语言生成的极限问题，通过组合界和滑动窗口分析记忆约束对可生成性、密度和识别的影响。

Comments The abstract has been shortened to fit within the arXiv limit

详情

AI中文摘要

我们研究有界记忆下的极限语言生成。在该任务中，学习器每次观察来自未知目标语言的一个示例，并且必须最终只输出新的有效示例。先前的工作假设可以访问整个历史，这是一个强假设，因为实际算法只保留有限的过去信息。学习理论中的经典工作表明，记忆约束会显著改变可学习性；我们将此扩展到语言生成。首先，我们研究无记忆生成器。在温和的枚举限制下，每个可数无限语言集合仍然可以在没有记忆的情况下生成。没有这个限制，我们精确刻画了何时无记忆生成是可能的。对于有限集合，我们刻画了无记忆生成器可实现的最优极小极大密度——针对任何给定大小的集合所能保证的最佳密度。这个组合界依赖于Sperner定理和对称链分解。我们进一步表明，最后$W$个示例的滑动窗口不会改善这种最坏情况密度，而允许存储$b$个自适应选择的过去示例则会改善每个$b \geq 1$的可实现密度。最后，我们重新审视极限识别，其中学习器必须收敛到目标语言的单个正确假设。我们关注其增量变体，其中学习器只记住其之前的猜测。在这里，尽管精确识别在仅包含三种语言的集合上失败，但一个温和的松弛——要求收敛到目标的“近似”版本——对于每个有限集合都是可实现的。这些结果表明，有界记忆对这些任务的影响不同：生成对于每个可数集合仍然可实现，而密度和识别仅限于有限集合，且随着集合增长保证减弱。

英文摘要

We study language generation in the limit under bounded memory. In this task, a learner observes examples from an unknown target language one at a time and must eventually output only new valid examples. Prior work assumes access to the entire history, a strong assumption since realistic algorithms retain limited past information. Classical work in learning theory shows memory constraints dramatically alter learnability; we extend this to language generation. First, we study memoryless generators. Under a mild enumeration restriction, every countable collection of infinite languages remains generable without memory. Without this restriction, we exactly characterize when memoryless generation is possible. For finite collections, we characterize the optimal minimax density achievable by memoryless generators -- the best density guaranteed against any collection of a given size. This combinatorial bound relies on Sperner's theorem and symmetric chain decompositions. We further show that a sliding window of the last $W$ examples does not improve this worst-case density, whereas allowing it to store $b$ adaptively chosen past examples improves the achievable density for every $b \geq 1$. Finally, we revisit identification in the limit, where the learner must converge to a single correct hypothesis for the target language. We focus on its incremental variant, where the learner remembers only its previous guess. Here, although exact identification fails on a collection of just three languages, a mild relaxation requiring convergence to an ``approximate'' version of the target is achievable for every finite collection. These results show bounded memory affects these tasks differently: generation remains achievable for every countable collection, while density and identification are confined to finite collections, with guarantees weakening as the collection grows.

URL PDF HTML ☆

赞 0 踩 0

2605.30323 2026-05-29 cs.LG cs.AI 版本更新

In-Context Reward Adaptation for Robust Preference Modeling

上下文奖励自适应用于鲁棒偏好建模

Zhenyu Sun, Zheng Xu, Ermin Wei

发表机构 * Northwestern University（西北大学）； Meta Superintelligence Labs（Meta超智能实验室）

AI总结提出基于Transformer的上下文奖励自适应框架，通过少量偏好示例和人类反应时间辅助信号，在线建模多样且未见的人类偏好，实现鲁棒的偏好建模和分布偏移适应。

详情

AI中文摘要

Archon：面向整体数字人生成的统一多模态模型

Chong Bao, Shichen Liu, Lijun Yu, David Futschik, Stylianos Moschoglou, Shefali Srivastava, Ziqian Bai, Feitong Tan, Guofeng Zhang, Zhaopeng Cui, Sean Fanello, Yinda Zhang

发表机构 * State Key Lab of CAD&CG, Zhejiang University（浙江大学CAD与CG国家重点实验室）； Google（谷歌）； Google DeepMind（谷歌DeepMind）

AI总结提出Archon，一个完全预训练的以人为中心的统一多模态模型，通过模态特定分词器、语义视频重参数化和“模态思维”策略，实现文本、音频、动作和视觉等七种模态的整体数字人生成。

Comments Accepted to CVPR 2026. Project Page: https://zju3dv.github.io/archon/

详情

AI中文摘要

数字人是沉浸式交互的基础，然而创建一个统一模型来处理包括文本、音频、动作和视觉内容在内的整体模态仍然是一个开放的挑战。在本文中，我们提出了Archon，一个完全预训练的、以人为中心的统一多模态模型，用于整体虚拟形象生成。Archon通过模态特定分词器统一了七种模态，并利用一个在同步模态和72个不同任务上预训练的原生自回归统一多模态模型来建模整体联合分布。为了解决高保真说话视频中的标记爆炸挑战，我们引入了一种内存高效的语义视频重参数化方法，在保持细粒度动态的同时实现了4倍的标记减少，并结合了一个语义驱动的视频扩散解码器。我们进一步提出了一种“模态思维”，它将模糊的跨模态任务分解为替代模态链中的逐步思维，逐步增强保真度和可控性。大量实验表明，Archon在各种数字人生成任务中实现了优越或可比的性能，验证了我们统一框架的有效性。项目页面：https://zju3dv.github.io/archon/。

英文摘要

Digital humans are fundamental to immersive interaction, yet creating a unified model for holistic modalities, including text, audio, motion, and visual content, remains an open challenge. In this paper, we present Archon, a fully pretrained, human-centric unified multimodal model for holistic avatar generation. Archon unifies seven modalities with modality-specific tokenizers, and a native autoregressive unified multimodal model pretrained on synchronized modalities and 72 diverse tasks to model holistic joint distributions. To address the token explosion challenge in high-fidelity talking videos, we introduce a memory-efficient semantic video reparameterization, achieving 4x token reduction while preserving fine-grained dynamics, coupled with a semantic-driven video diffusion decoder. We further propose a "Thinking in Modality" that decomposes ambiguous cross-modal tasks into stepwise thinking in an alternative chain of modality, progressively enhancing fidelity and controllability. Extensive experiments demonstrate that Archon achieves superior or comparable performance across diverse digital human generation tasks, validating the effectiveness of our unified framework. Project page: https://zju3dv.github.io/archon/.

URL PDF HTML ☆

赞 0 踩 0

2605.30310 2026-05-29 cs.CV cs.AI cs.GR 版本更新

City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images

City-Mesh3R：面向仿真就绪的城市级多视图三维网格重建

Sayan Paul, Sourav Ghosh, Siddharth Katageri, Soumyadip Maity, Sanjana Sinha, Brojeshwar Bhowmick

发表机构 * Visual Computing & Embodied AI Lab, TCS Research（视觉计算与具身人工智能实验室，TCS研究）

AI总结提出City-Mesh3R框架，通过分治策略从大规模无序图像集合端到端重建水密表面网格，解决城市尺度重建中几何不完整、表面不规则及计算复杂性问题。

Comments Accepted to the USM3D Workshop Proceedings at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 as an Oral Presentation. Project page: https://citymesh3r.github.io/

详情

AI中文摘要

从多视图图像进行城市级三维表面重建以支持下游三维仿真，由于城市场景的规模和复杂性，带来了极具挑战性的问题。现有的基于NeRF、高斯泼溅等方法的城市级三维重建技术，常因几何不完整/缺失以及不规则、噪声表面而无法恢复可用于仿真的三维网格。将现有小规模三维重建方法扩展到任意大规模城市场景因计算复杂而不可行。我们提出City-Mesh3R，一个可扩展的框架，直接从大规模无序图像集合重建水密表面网格。与近期使用全局稀疏SfM点云初始化后分布式稠密重建大规模场景的方法不同，我们的方法采用分治策略，遵循端到端的图像到网格三维重建流程。通过拓扑图像聚类、聚类独立稀疏SfM和地图合并重建稀疏城市地图，无需穷举图像特征匹配。然后对该地图进行空间划分，执行几何感知的相机选择，接着进行稠密表面重建，并使用曲率感知的自适应顶点密度重网格化进行表面细化。这些分区网格随后拼接成城市全局网格。所提出的端到端框架在城市级重建数据集上进行了评估。定性和定量结果表明，我们的方法能生成具有规则几何、捕捉精细表面细节的高保真水密三维网格，并因其分布式端到端处理而适用于任意大规模场景。

英文摘要

City-scale 3D surface reconstruction from multiview images for downstream 3D simulation, poses highly challenging problems due to the scale and complexity of urban scenes. Existing city-scale 3D reconstruction methods based on NeRF, Gaussian Splatting etc. often fail to recover 3D meshes ready for simulation due to incomplete/missing geometry and irregular, noisy surfaces. Scaling existing small-scale 3D reconstruction methods to arbitrarily large urban scenes is highly infeasible due to their computational complexity. We present City-Mesh3R, a scalable framework for reconstructing watertight surface meshes directly from large unordered image collections. Unlike recent methods which use global sparse SfM point-cloud initialization followed by a distributed 3D dense reconstruction of large-scale scenes, our method follows an end-to-end images-to-mesh 3D reconstruction approach using a divide-and-conquer strategy. The sparse city map is reconstructed via topological image clustering, cluster-wise independent sparse SfM and map merging, without need for exhaustive image feature matching. Then this map is partitioned spatially to perform geometry-aware camera selection, followed by dense surface reconstruction and surface refinement using curvature-aware adaptive vertex density remeshing. These partition meshes are then stitched together to produce the global mesh of the city. The proposed end-to-end framework is evaluated on city-scale reconstruction datasets. As demonstrated by our qualitative and quantitative results, our proposed method yields high-fidelity watertight 3D meshes with regular geometry, capturing fine surface details, and is suitable for scaling to arbitrarily large scenes owing to the end-to-end processing in a distributed setting.

URL PDF HTML ☆

赞 0 踩 0

2605.30295 2026-05-29 cs.CL cs.AI 版本更新

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

MedCase-Structured：用于在临床真实EHR环境中基准测试诊断推理的文本到FHIR数据集

Valentina Bui Muti, Eugénie Dulout, Ziquan Fu

发表机构 * System Inc.（系统公司）

AI总结提出一个从非结构化文本生成临床真实HL7 FHIR R4数据集的流水线，构建MedCase-Structured数据集，发现LLMs在结构化FHIR输入上的诊断准确性低于纯文本，强调部署对齐基准测试的重要性。

Comments Accepted to ICML 2026 Structured Data for Health Workshop

详情

AI中文摘要

大型语言模型（LLMs）在临床推理和决策支持方面显示出潜力，但在真实、与电子健康记录一致的环境中的评估仍然有限。现有的基准测试通常依赖于静态数据集或不反映临床系统中使用的结构化、可互操作数据格式的非结构化输入。我们引入了一个从非结构化文本生成临床真实HL7 FHIR R4数据包的流水线，从而实现对临床决策支持系统的可控评估。该流水线将分阶段LLM生成与基于术语的验证和修复相结合，以减少幻觉代码并强制结构和语义一致性。将此方法应用于MedCaseReasoning，我们构建了MedCase-Structured，这是一个与临床医生编写的诊断案例对齐的合成数据集，实现了82.5%案例的有效FHIR生成。在MedCase-Structured上的评估显示，LLMs在结构化FHIR输入上的诊断准确性始终低于纯文本，突出了部署对齐基准测试的重要性。

英文摘要

Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health record-congruent settings remains limited. Existing benchmarks often rely on static datasets or unstructured inputs that do not reflect the structured, interoperable data formats used in clinical systems. We introduce a pipeline for generating clinically realistic HL7 FHIR R4 bundles from unstructured text, enabling controllable evaluation of clinical decision support systems. The pipeline combines staged LLM generation with terminology-grounded validation and repair to reduce hallucinated codes and enforce structural and semantic consistency. Applying this approach to MedCaseReasoning, we construct MedCase-Structured, a synthetic dataset aligned with clinician-authored diagnostic cases, achieving valid FHIR generation for 82.5% of cases. Evaluation on MedCase-Structured reveals consistently lower diagnostic accuracy for LLMs on structured FHIR inputs than with plain text, highlighting the importance of deployment-aligned benchmarking.

URL PDF HTML ☆

赞 0 踩 0

2605.29169 2026-05-29 cs.CR cs.AI 版本更新

Domain-Informed Representation for Evolutionary Sieving in Integral and Module Lattices

积分格与模格中进化筛法的领域信息表示

Ahmad Tashfeen, Qi Cheng

发表机构 * University of Oklahoma（俄克拉荷马大学）

AI总结针对格密码中最短向量问题（SVP），通过引入领域信息表示和交叉操作，将Ajtai等人的筛法改进为遗传算法，并自然扩展到模格。

Comments Published (16 pages) in the proceedings of EvoApplications 2026. You may find the proceedings version here at https://link.springer.com/chapter/10.1007/978-3-032-23604-3_9

2605.28746 2026-05-29 math.OC cs.AI cs.NE 版本更新

Preference-Shaped Expected Hypervolume and R2 Improvement: Exact Computation and Monotonicity

偏好形状的期望超体积和R2改进：精确计算与单调性

Michael T. M. Emmerich

发表机构 * Faculty of Information Technology, University of Jyväskylä（贾韦斯科普大学信息科技学院）

AI总结本文研究了贝叶斯多目标优化中偏好形状的期望改进准则，精确计算了超体积和R2指标的期望改进，并分析了其单调性和几何特性。

Comments 17 pages; Changes v1 (added strict Pareto compliance proof, removed missing figure references and redundant graphics section, added Liang et al 2026 citation in outlook. Improved figures and language

详情

AI中文摘要

本文研究了贝叶斯多目标优化中偏好形状的期望改进准则。我们考虑了两个常用于类似算法目的但几何性质不同的指标族。超体积指标基于一个反乌托邦参考点，测量目标空间中的支配体积。R2指标基于一个乌托邦点，通过加权Tchebycheff标量化包络评估近似集。本文的目的是明确哪些偏好变换保留了精确计算、Pareto兼容性和单调性，哪些变换改变了底层几何。在超体积方面，我们通过Deng表示重新审视了经典的EHVI，在期望坐标中制定了乘积密度加权的EHVI，讨论了基于锥的EHVI作为线性锥变换后的普通EHVI，并将这些情况与截断EHVI区分开来，后者可能违反方差单调性。在R2方面，我们证明精确积分R2改进通常不是普通的目标空间加权超体积。障碍是低维的：Lebesgue密度超体积无法看到Tchebycheff标量化仍能检测到的某些边界贡献。然后我们证明精确积分R2改进恰好是一个标量化空间体积，即当前标量化包络与参考包络之间的Tchebycheff阴影的测度。该表示产生了离散R2的有限和ER2I算法、精确积分R2的求积方法，以及一个成就空间高斯代理公式，其中ER2I是标量高斯期望改进的积分。

英文摘要

This paper studies preference-shaped expected improvement criteria for Bayesian multiobjective optimization. We consider two indicator families which are often used for similar algorithmic purposes, but which are geometrically different. The hypervolume indicator is based on a dystopian reference point and measures dominated volume in objective space. The R2 indicator is based on a utopian point and evaluates approximation sets through weighted Tchebycheff scalarization envelopes. The purpose of the paper is to make precise which preference transformations preserve exact computation, Pareto compatibility, and monotonicity properties, and which transformations change the underlying geometry. On the hypervolume side, we revisit canonical EHVI through the Deng representation, formulate product-density weighted EHVI in desirability coordinates, discuss cone-based EHVI as ordinary EHVI after a linear cone transformation, and separate these cases from truncated EHVI, where variance monotonicity may fail. On the R2 side, we prove that exact integral R2 improvement is not, in general, an ordinary objective-space weighted hypervolume. The obstruction is lower-dimensional: Lebesgue-density hypervolume cannot see certain boundary contributions that Tchebycheff scalarizations still detect. We then show that exact integral R2 improvement is exactly a scalarization-space volume, namely the measure of the Tchebycheff shadow between the incumbent scalarization envelope and the reference envelope. This representation yields finite-sum ER2I algorithms for discrete R2, quadrature methods for exact integral R2, and an achievement-space Gaussian surrogate formulation in which ER2I is an integral of scalar Gaussian expected improvements.

URL PDF HTML ☆

赞 0 踩 0

2604.04956 2026-05-29 physics.soc-ph cs.AI cs.CY physics.pop-ph 版本更新

MiAD: 幻影原子扩散用于从头晶体生成

Andrey Okhotin, Maksim Nakhodnov, Nikita Kazeev, Mikhail Lazarev, Andrey E Ustyuzhanin, Dmitry Vetrov

发表机构 * Higher School of Economics（俄罗斯高等经济学院）； Moscow State University（莫斯科大学）； Constructor University of Bremen（不来梅Constructor大学）

AI总结提出幻影注入技术，使扩散模型能在生成过程中改变原子数量，显著提升晶体生成质量，在MP-20数据集上实现8.2%的S.U.N.率。

详情

AI中文摘要

近年来，基于扩散的模型在搜索同时稳定、独特和新颖（S.U.N.）的晶体材料方面表现出卓越性能。然而，大多数这些模型在生成过程中无法改变晶体中的原子数量，这限制了模型采样轨迹的多样性。在本文中，我们展示了这种限制的严重性，并引入了一种简单而强大的技术——幻影注入，它使扩散模型能够将构成晶体的原子状态从存在变为不存在（幻影），反之亦然。我们表明，与没有这种修改的相同模型相比，该技术将模型质量提高了多达2.5倍。由此产生的模型，幻影原子扩散（MiAD），是一种用于从头晶体生成的等变联合扩散模型，能够在生成过程中改变原子数量。MiAD在MP-20数据集上实现了8.2%的S.U.N.率，大大超过了现有的最先进方法。代码：https://github.com/andrey-okhotin/miad.git

英文摘要

In recent years, diffusion-based models have demonstrated exceptional performance in searching for simultaneously stable, unique, and novel (S.U.N.) crystalline materials. However, most of these models don't have the ability to change the number of atoms in the crystal during the generation process, which limits the variability of model sampling trajectories. In this paper, we demonstrate the severity of this restriction and introduce a simple yet powerful technique, mirage infusion, which enables diffusion models to change the state of the atoms that make up the crystal from existent to non-existent (mirage) and vice versa. We show that this technique improves model quality by up to x2.5 compared to the same model without this modification. The resulting model, Mirage Atom Diffusion (MiAD), is an equivariant joint diffusion model for de novo crystal generation that is capable of altering the number of atoms during the generation process. MiAD achieves an 8.2% S.U.N. rate on the MP-20 dataset, which substantially exceeds existing state-of-the-art approaches. Code: https://github.com/andrey-okhotin/miad.git

URL PDF HTML ☆

赞 0 踩 0

2605.30284 2026-05-29 cs.AI 版本更新

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

ProjectionBench: 在渐进信息揭示下评估大语言模型的科学假设生成

A. J. Lew, Y. Cao, M. J. Buehler

发表机构 * Unreasonable Labs

AI总结提出ProjectionBench框架，通过渐进式信息揭示评估大语言模型在科学发现中的创新性和推理能力，实验表明GPT-5.4在最小上下文下仍保持0.7 F1分数与真实结论对齐。

Comments 19 pages, 4 figures

详情

AI中文摘要

科学发现本质上是一个创造性和不确定的过程，需要超越已知知识的推理。尽管许多基准测试通过多跳检索评估大语言模型在深度研究任务上的表现，但其对真正科学发现至关重要的创新推理能力仍未得到充分测试。我们引入了一个基准框架，用于评估模型在科学发现和推理中的表现，从原始问题逐步构建到经典零假设检验。在我们的框架中，模型最初仅接收来自近期论文的主题和研究问题，技术细节逐步揭示。在每个信息揭示阶段，模型需要生成针对研究问题的假设，这些假设与原始论文的结论进行比较，并通过组成原子声明的自动语义相似性进行评估。这种对与真实结论语义偏离的渐进评估，使得能够评估模型的创新性（在最小信息下）到基于推理的能力（在完整实验细节下），这两者对于将大语言模型用于科学发现都至关重要。我们的框架为系统评估大语言模型的科学推理和发现能力提供了基础，这对于推动下一代AI科学家/协同科学家系统的发展至关重要。具体来说，我们在涵盖生物活性材料、机械材料和纳米材料的45篇论文上评估了GPT-5、GPT-5.4、Gemini 2.5 pro和Gemini 3.1 pro preview。我们发现GPT-5.4和Gemini 3.1 pro的表现优于其前代版本，特别是GPT-5.4即使在最小上下文下仍保持0.7 F1分数与真实结论对齐。

LoRA如何记忆？大语言模型微调的参数记忆定律

Ziwen Xu, Haiwen Hong, Linsong Yu, Benglei Cui, Longtao Huang, Hui Xue, Ningyu Zhang

发表机构 * Zhejiang University（浙江大学）； Alibaba Group（阿里巴巴集团）

AI总结本文提出参数记忆定律，揭示LoRA在微调中参数与序列长度对损失降低的幂律关系，并基于此设计MemFT优化策略提升记忆保真度与效率。

Comments Ongoing work

详情

AI中文摘要

大型语言模型（LLM）必须持续学习和更新知识，以在动态的真实世界环境中保持有效。虽然低秩适应（LoRA）被广泛用于此类记忆更新，但现有研究主要依赖于定性的下游评估，使得精确参数记忆的定量容量限制和潜在动态在很大程度上未被探索。为了弥合这一差距，我们在潜在空间中使用LoRA作为受控记忆容量探针，以系统量化精确参数记忆。我们引入了参数记忆定律，这是一个将损失降低ΔL与有效参数和序列长度联系起来的稳健幂律。在令牌级别，细粒度分析揭示了确定性相变，表明在贪婪解码下，预测概率p > 0.5构成逐字回忆的充分条件。基于这些见解，我们引入了MemFT，一种阈值引导的优化策略，该策略动态地将训练预算重新分配给低于阈值的令牌。实证评估表明，MemFT可以提高记忆保真度和效率。代码将在https://github.com/zjunlp/ParametricMemoryLaw发布。

英文摘要

Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on qualitative downstream evaluations, leaving the quantitative capacity limits and underlying dynamics of exact parametric memory largely unexplored. To bridge this gap, we employ LoRA as a controlled memory capacity probe within the latent space to systematically quantify exact parametric memory. We introduce the Parametric Memory Law, a robust power law linking loss reduction Delta L to effective parameters and sequence length. At the token level, fine-grained analysis reveals a deterministic phase transition, demonstrating that a prediction probability of p > 0.5 constitutes a sufficient condition for verbatim recall under greedy decoding. Driven by these insights, we introduce MemFT, a threshold-guided optimization strategy that dynamically redistributes the training budget toward sub-threshold tokens. Empirical evaluations demonstrate that MemFT can enhance memory fidelity and efficiency. Code will be released at https://github.com/zjunlp/ParametricMemoryLaw.

URL PDF HTML ☆

赞 0 踩 0

2605.30251 2026-05-29 cs.CL cs.AI 版本更新

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

相同证据，不同答案：面向多轮语言模型的规范上下文在线策略蒸馏

Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang, Yifan Zhu, Xing Shi, Jingtao Xu, Zhihui Li, Yawei Luo

发表机构 * Zhejiang University（浙江大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出规范上下文在线策略蒸馏（CCOPD）方法，通过教师-学生框架对齐模型在完整提示和逐步揭示信息下的行为，减少自我锚定漂移，在多轮数学对话上训练后，在原始分片任务上平均提升32%性能。

详情

AI中文摘要

大型语言模型（LLMs）通常在单次提示中给出所有指令时能解决任务，但当相同信息在多个轮次中逐步揭示时却会失败。当干净的完整提示和原始分片对话包含相同的完整用户证据时，模型仍应得出相同的答案。我们认为造成这一差距的关键原因是自我锚定漂移：在部分信息下产生的响应引入了未经支持的假设，而这些假设随后扭曲了最终答案。为了减少这种影响，我们提出了规范上下文在线策略蒸馏（CCOPD）。在训练过程中，同一基础模型扮演两个角色：一个冻结的教师模型，以干净的完整提示为条件；一个可训练的学生模型，通过多轮对话逐步接收相同的证据；CCOPD将学生在其自身轨迹上的行为与教师的规范全上下文行为对齐。仅在数学问题对话上训练后，CCOPD在数学和五个零样本跨领域任务族上的原始分片性能相比原始基础模型平均提升32%，同时基本保持全上下文性能。进一步分析表明，CCOPD增强了基于用户证据的推理，并减少了对早期助手轮次污染的敏感性。

英文摘要

Large language models (LLMs) often solve a task when all instructions are given in a single prompt, but fail when the same information is revealed gradually across turns. When a clean FULL prompt and a RAW-SHARDED conversation contain the same complete user evidence, the model should still arrive at the same answer. We argue that a key reason for this gap is self-anchored drift: responses produced under partial information introduce unsupported assumptions, and those assumptions later distort the final answer. To reduce this effect, we propose Canonical-Context On-Policy Distillation (CCOPD). During training, the same base model is used in two roles: a frozen teacher conditioned on the clean FULL prompt and a trainable student that receives the same evidence incrementally through a multi-turn conversation; CCOPD aligns the student's behavior on its own trajectories with the teacher's canonical full-context behavior. Trained only on math problem conversations, CCOPD yields a 32\% average relative improvement in RAW-SHARDED performance over the original base model across math and five zero-shot out-of-domain task families, while largely preserving full-context performance. Further analyses suggest that CCOPD strengthens grounding in user evidence and reduces sensitivity to contamination from earlier assistant turns.

URL PDF HTML ☆

赞 0 踩 0

2605.30244 2026-05-29 cs.CV cs.AI 版本更新

检索增强商业对话中品牌推荐的人格条件化：一种突出性分层跨提供商审计

Will Jack, Noah Lehman, Keller Maloney, Sarah Xu

AI总结本研究通过审计10种人格×8个提示×3种模型配置的2000次运行，发现用户人格显著改变AI推荐品牌集，且效果在中等市场品牌和依赖先验的生成路径中更为突出。

详情

AI中文摘要

相同的提示——“最佳CRM软件”——来自不同背景的买家（独立创始人、企业副总裁、英国中小企业主）会到达AI助手。我们审计了这种上下文变化如何强烈地重塑模型推荐的品牌。审计采样了2000次运行，覆盖10种人格×8个提示×3种模型配置×N=10次重复的设计空间，其中两个OpenAI单元覆盖全部8个提示，Anthropic sonnet-4.6/低单元覆盖4个提示。在用户消息前添加人格，相对于同人格基线，推荐集相似度（Jaccard）下降Delta = -0.12至-0.20（聚类95%置信区间在所有三个测量单元上排除零；sonnet单元的置信区间仅基于4个提示聚类，相应更宽）。该效应具有明显的突出性分层：品类领导者具有人格抗性（跨人格约80%相同品牌一致性），但中等市场品牌随人格变化最多更换75%的推荐集。Anthropic模型的点估计效应大于OpenAI配置，尽管聚类置信区间在更接近的对比（sonnet vs. OpenAI/高）中重叠；这种不对称性与Anthropic更多依赖检索未归因的生成路径一致（43-52%的推荐没有观察到检索层证据，而OpenAI为8-29%，记录在Jack 2026中）。任何AI品牌感知的测量都必须以提供查询的买家人格为条件：相同的提示根据模型认为谁在提问而产生实质上不同的推荐集，而跨人格聚合的测量协议系统性地掩盖了这种变化。该效应集中在中等市场，并且在我们审计中最依赖先验的生成路径上最大，这与人格响应性随着模型更依赖训练数据先验和更丰富的上下文集成而增强是一致的。

英文摘要

The same prompt -- "best CRM software" -- reaches AI assistants from buyers in widely different contexts: a solo founder, an enterprise VP, a UK SMB owner. We audit how strongly that contextual variation reshapes which brands the model recommends. The audit samples 2,000 runs over a design space of 10 personas x 8 prompts x 3 model configurations x N=10 reps, with the two OpenAI cells at full 8-prompt coverage and the Anthropic sonnet-4.6 / low cell at 4-prompt coverage. Prefixing the user message with a persona drops the recommendation-set similarity (Jaccard) by Delta = -0.12 to -0.20 relative to a same-persona baseline (clustered 95% CIs exclude zero on all three measured cells; the sonnet cell's CI rests on only 4 prompt clusters and is correspondingly wider). The effect is sharply prominence-stratified: category leaders are persona-resistant (~80% same-brand consistency across personas), but mid-market brands swap up to 75% of the recommendation set as the persona changes. The Anthropic model shows a larger point-estimate effect than the OpenAI configurations, though clustered CIs overlap for the closer contrast (sonnet vs. OpenAI/high); the asymmetry is consistent with Anthropic's more retrieval-unattributed generation route (43-52% recommendations without observed retrieval-layer evidence, vs OpenAI's 8-29%, documented in Jack 2026). Any measurement of AI brand perception must condition on the buyer persona supplying the query: the same prompt produces materially different recommendation sets depending on who the model thinks is asking, and a measurement protocol that aggregates across personas systematically obscures that variation. The effect concentrates at mid-market and is largest on the most priors-reliant generation route in our audit, consistent with persona responsiveness growing as models lean more on training-data priors and richer context integration.

URL PDF HTML ☆

赞 0 踩 0

2605.30201 2026-05-29 cs.LG cs.AI 版本更新

HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

HPO: 稀疏奖励机制下稳定高效训练的滞后策略优化

Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Fadhel Ayed, Haozhe Zhang

发表机构 * Paris Research Center, Huawei Technologies（华为技术有限公司巴黎研究中心）

AI总结针对GRPO在稀疏验证奖励下的失败模式，提出HPO通过降低负优势更新权重和均值长度归一化改进训练，并引入自适应版本A-HPO，在TeleLogs和Countdown实验中显著提升奖励。

详情

AI中文摘要

我们研究了GRPO风格的强化学习在稀疏可验证奖励背景下的一种狭窄但常见的失败模式：早期更新中包含更多具有负优势的响应，而非正优势的响应，而响应级长度归一化将更新幅度与输出长度挂钩。我们提出滞后策略优化（HPO），这是对GRPO的最小修改，它降低了负优势更新的权重，并用均值长度归一化替代了每个响应的长度归一化。我们进一步引入自适应HPO（A-HPO），它基于批次级优势符号统计设置滞后权重，从而消除了调整固定滞后权重的需要。在我们的TeleLogs和Countdown实验中，与GRPO相比，A-HPO提高了每次更新的奖励，在早期稀疏奖励机制中增益最大。在TeleLogs上，A-HPO实现了0.84的最终奖励，比SAPO高5%，比GSPO高11%，比GRPO高15%，同时保持了可比较的响应长度。在Countdown上，A-HPO在1.5B-7B模型的初始和最困难配置中实现了最大增益。关于滞后权重的消融研究表明，A-HPO的增益来自于比仅正更新或完全对称更新更好地平衡正负优势的贡献。

英文摘要

We investigate a narrow but common failure mode of GRPO-style reinforcement learning in the context of sparse verifiable rewards: early updates contain more responses with negative advantages than those with positive advantages, while response-level length normalization ties the magnitude of the update to the length of the output. We propose Hysteretic Policy Optimization (HPO), a minimal modification of GRPO that reduces the weight of negative-advantage updates and replaces per-response length normalization with mean-length normalization. We further introduce Adaptive HPO (A-HPO), which sets the hysteretic weight based on batch-level advantage-sign statistics, thereby removing the need for tuning a fixed hysteretic weight. In our TeleLogs and Countdown experiments, A-HPO improves the reward per update compared to GRPO, with the largest gains in early sparse reward regimes. On TeleLogs, A-HPO achieves a final reward of 0.84, outperforming SAPO by 5%, GSPO by 11%, and GRPO by 15%, while maintaining a comparable response-length. On Countdown, A-HPO achieves the largest gains in initial and most difficult configurations across 1.5B-7B models. Ablation studies on the hysteretic weight show that the gains of A-HPO come from better balancing the contributions of positive and negative advantages compared to positive-only or fully symmetric updates.

URL PDF HTML ☆

赞 0 踩 0

2605.30200 2026-05-29 cs.AI 版本更新

Double-Edged Sword or Sharp Tool? Designing and Evaluating Triadic LLM-Teacher Collaboration for K-12 Writing at Scale

双刃剑还是利器？设计与评估面向K-12写作规模化的三元LLM-教师协作

Canran Wang, Yuwen Yang, Zhen Wang, Ming Ma, Ding Yu, Chentai Wang, Keman Huang, Xiaoyong Du

发表机构 * Renmin University of China（中国人民大学）； Central University of Finance and Economics（中央财经大学）； Beijing HQ Intelligent Technology, Ltd.（北京HQ智能科技有限公司）

AI总结本文通过开发一个三元协作系统，结合系统功能语言学与建议轨迹追踪管道，基于包含57,954篇作文的大规模实证数据，验证了LLM作为生成引擎、教师作为教学把关者的分工策略能有效提升写作质量，并发现语言扩展存在边际效用递减的天花板效应。

详情

AI中文摘要

集成大型语言模型（LLM）的双刃剑效应需要一个有效的LLM、教师和学生之间的三元协作机制，尤其是对于K-12教育。通过开发一个支持K-12写作学习的三元协作系统，一个基于系统功能语言学和建议轨迹追踪管道的多维评估框架，本文贡献了一个大规模实证数据集，包含来自120所学校10,195名学生在两年内提交的57,954篇作文。我们的发现证实了该系统通过战略分工提高写作质量的功效：LLM作为生成引擎以缓解教师倦怠，教师作为教学把关者和桥梁以保证反馈质量。虽然LLM和教师对技能提升都至关重要，但我们发现了一个天花板效应，即过度的语言扩展产生递减的边际效用。这些表明随着学生熟练度的提高，需要动态自适应的LLM-教师协作。

英文摘要

The double-edged sword of integrating Large Language Models (LLMs) requires an effective triadic collaboration mechanism among LLMs, teachers and students, especially for K-12 education. By developing a triadic collaboration system to support K-12 writing learning, a multidimensional evaluation framework grounded in Systemic Functional Linguistics and the suggestion trajectory tracing pipeline, this paper contributes a large-scale empirical dataset involving $57,954$ essays from $10,195$ students across $120$ schools over two years. Our findings confirm the efficacy of this system in improving writing quality through a strategic labor division: the LLM serves as a generative engine to mitigate teacher burnout, and the teacher acts as a pedagogical gatekeeper and bridge to guarantee feedback quality. While both LLM and teacher are critical for skill improvement, we uncover a ceiling effect where excessive linguistic expansion yields diminishing marginal utility. These suggest a dynamically adaptive LLM-teacher collaboration as student proficiency increases.

URL PDF HTML ☆

赞 0 踩 0

2605.30195 2026-05-29 cond-mat.mtrl-sci cs.AI cs.LG 版本更新

混沌动力系统中的分布强化学习

James Rudd-Jones, Mirco Musolesi, María Pérez-Ortiz

发表机构 * Centre for Artificial Intelligence（人工智能中心）； Department of Computer Science（计算机科学系）； University College London（伦敦大学学院）； University of Bologna（博洛尼亚大学）

AI总结针对混沌动力系统中强化学习面临的高方差和梯度病态问题，提出分布强化学习通过1-Wasserstein度量下的分布贝尔曼目标实现更稳定的优化。

详情

AI中文摘要

混沌动力系统对强化学习（RL）提出了根本性挑战：对初始条件的指数敏感性导致高方差的引导目标和病态的梯度更新。混沌动力学出现在科学和工程领域的各个方面，从流体流动和气候系统到多智能体系统，在这些领域中，可靠的学习是非常可取的。标准RL方法通过标量值函数优化期望回报，隐式地对发散轨迹进行平均，并将轨迹层面的不稳定性与学习目标纠缠在一起。我们证明，在温和的统计稳定性假设下，当在$1$-Wasserstein度量下测量时，回报分布比单个轨迹更规则地演化，从而产生更平滑的分布贝尔曼目标。通过将优化与该度量层面结构对齐，分布RL提供了更好的条件学习。我们为混沌系统中分布方法的优势以及混沌下RL目标的几何结构提供了原则性的解释。

英文摘要

Chaotic dynamical systems pose a fundamental challenge for Reinforcement Learning (RL): exponential sensitivity to initial conditions induces high-variance bootstrap targets and poorly conditioned gradient updates. Chaotic dynamics arise across scientific and engineering domains, from fluid flows and climate systems to multi-agent systems, where reliable learning is highly desirable. Standard RL methods optimise expected returns through scalar value functions, implicitly averaging over diverging trajectories and entangling trajectory level instability with the learning objective. We show that under mild statistical stability assumptions, the return distribution evolves more regularly than individual trajectories when measured under the $1$-Wasserstein metric, yielding a smoother distributional Bellman objective. By aligning optimisation with this measure level structure, distributional RL provides better conditioned learning. We offer a principled explanation for the advantages of distributional methods in chaotic systems and the geometries of RL objectives under chaos.

URL PDF HTML ☆

赞 0 踩 0

2605.30159 2026-05-29 cs.AI 版本更新

无锚点多样化并行LLM创意生成

Fares Nabil Ibrahim, Nafis Saami Azad, Raiyan Abdul Baten

发表机构 * Bellini College of Artificial Intelligence, Cybersecurity, and Computing, University of South Florida（贝尔尼人工智能、网络安全与计算学院，佛罗里达州立大学）

AI总结研究无锚点方法（如语义方向分层）在并行LLM创意生成中实现候选池多样化，无需依赖种子想法，在多样性、质量和计算效率上优于有锚点基线。

详情

AI中文摘要

大型语言模型越来越多地用于生成创意任务的候选想法池，其中广泛探索是有价值的。在此场景下，并行推理在拓宽池的同时保持质量和成本效率时具有吸引力。我们研究推理时控制以实现候选池多样化，探究无锚点方法是否能与依赖观察到的种子想法的方法相抗衡。在三个创意任务族中，我们在中性和群体参照发散指令下，比较了独立生成和语义方向分层与自我、同伴和代表性锚点基线。群体参照发散是一个强大的低成本基线，在保持质量代理的同时增加了语义多样性。语义方向分层更强：一次规划调用即可组织跨广泛语义方向的生成，产生最佳的多样性-质量-计算前沿。锚点再生在最终池多样性上可能很强，但其优势在完整流水线令牌核算下缩小。这些结果为开放式LLM创意生建立实用的无锚点基线。

英文摘要

LLMs are increasingly used to generate candidate-idea pools for creative tasks where broad exploration is valuable. Parallel inference can be attractive in this setting when it broadens the pool while retaining quality and cost efficiency. We study inference-time controls for candidate-pool diversification, asking whether anchorless methods can rival methods that depend on observed seed ideas. Across three creative task families, we compare independent generation and semantic direction stratification with self-, peer-, and representative-anchor baselines, under neutral and population-referential divergent instructions. Population-referential divergence is a strong low-cost baseline, increasing semantic diversity while preserving quality proxies. Semantic direction stratification is stronger: a single planning call organizes generations across broad semantic directions, yielding the best diversity--quality--compute frontier. Anchored regeneration can be strong in final-pool diversity, but its advantage shrinks under full-pipeline token accounting. These results establish practical anchorless baselines for open-ended LLM ideation.

URL PDF HTML ☆

赞 0 踩 0

2605.30148 2026-05-29 cs.LG cs.AI 版本更新

Overcoming Forgetting in LLM Fine-Tuning with Evolution Strategies

克服LLM微调中的遗忘：进化策略方法

Kajetan Schweighofer, Conor F. Hayes, Roberto Dailey, Risto Miikkulainen, Xin Qiu

发表机构 * Cognizant AI Lab（Cognizant AI实验室）； UT Austin（得克萨斯大学奥斯汀分校）

AI总结本文发现进化策略微调中的先前任务遗忘实为性能漂移且可恢复，并引入锚定权重衰减（AWD）正则化技术有效稳定先前任务性能，表明遗忘可避免，使ES成为LLM持续学习的可行方法。

详情

AI中文摘要

进化策略（ES）最近作为强化学习（RL）在大语言模型（LLM）微调中的竞争性替代方案出现，通过简单性、可扩展性和仅推理训练提供优势。然而，近期研究表明，在新任务上进行ES微调可能导致对先前任务的遗忘。首先，本文表明先前任务遗忘（1）更好地被描述为性能漂移而非不可逆遗忘，在ES训练过程中先前任务性能通常会恢复；（2）并非ES特有的失败模式，使用RL方法微调时也可能出现。其次，本文分析了这种漂移何时以及为何出现，强调了其对ES训练动态的依赖性，特别是权重空间中弱约束方向上的随机游走行为。第三，基于这些见解，本文引入了锚定权重衰减（AWD）作为一种参数空间正则化技术，将优化约束向初始模型参数。AWD在保持目标任务性能的同时有效稳定了先前任务性能，以更低的计算成本实现了与大型ES种群规模相当的优势。因此，与先前观点相反，本文表明ES下的先前任务遗忘在很大程度上是可以避免的，使ES成为LLM持续学习中一种有前景的方法。

英文摘要

Evolution Strategies (ES) has recently emerged as a competitive alternative to reinforcement learning (RL) for large language model (LLM) fine-tuning, offering advantages through simplicity, scalability, and inference-only training. However, recent work suggests that ES fine-tuning on new tasks may induce forgetting of prior tasks. First, this paper shows that prior task forgetting (1) is better characterized as performance drift rather than irreversible forgetting, with prior-task performance often recovering during ES training; and (2) is not a specific failure mode of ES, but can also arise for fine-tuning with RL methods. Second, it analyzes when and why such drift arises, highlighting its dependence on ES training dynamics, particularly random walk behavior in weakly constrained directions of the weight space. Third, based on these insights, it introduces Anchored Weight Decay (AWD) as a parameter-space regularization technique that constrains optimization toward the initial model parameters. AWD effectively stabilizes prior-task performance while preserving target-task performance, achieving benefits comparable to large ES population sizes at much lower computational cost. Thus, contrary to previous beliefs, the paper shows that prior-task forgetting under ES is largely avoidable, positioning ES as a promising approach for continual learning in LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.30144 2026-05-29 cs.AI cs.MA 版本更新

AgentSchool: An LLM-Powered Multi-Agent Simulation for Education

AgentSchool：基于LLM的多智能体教育模拟系统

Yulei Ye, Wenhao Li, Zhong Wen, Yunshu Huang, Yichen Hu, Zifan Wei, Yige Wang, Xinyu Xie, Haoxuan Yang, Yanjun Huang, Ruijia Li, Hong Qian, Yu Song, Bo Jiang, Bingdong Li, Lijun Li, Bo Zhang, Pinlong Cai, Xingcheng Xu, Shuangye Chen, Xia Hu, Liang He, Aimin Zhou, Jingjing Qu, Jing Shao, Xiangfeng Wang

发表机构 * Shanghai Institute of AI for Education（上海人工智能教育研究院）； School of Computer Science and Technology（计算机科学与技术学院）； East China Normal University（东华大学）； School of Design（设计学院）； Faculty of Education（教育学院）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结提出AgentSchool，一种LLM驱动的多智能体模拟器，通过可成长的学生智能体（带知识图谱、思维工作流和错误概念）与自适应教师智能体（基于最近发展区）模拟学习过程，支持多尺度模拟，实验验证了其生成差异化掌握轨迹和符合课堂社会理论的行为模式。

Comments 39 pages, 10 figures

详情

AI中文摘要

尽管LLM已迅速部署到课堂中，验证教育AI仍然具有独特的棘手性：干预措施作用于发展中的学习者，其认知和社会轨迹被不可逆地塑造，而现实世界试验缓慢、受伦理约束且受制度限制。基于LLM的教育模拟器已成为潜在的补救措施，但许多模拟器仍将学习简化为角色扮演，并且当仅优化以再现现有课堂时，可能会结构性惩罚教学改革所需的制度创新。在这项工作中，我们介绍了AgentSchool，一种LLM驱动的多智能体模拟器，将学习建模为状态转换而非提示行为。AgentSchool将可成长的学生智能体（配备加权学科知识图谱、思维工作流池和显式错误概念）与自适应教师智能体（在最近发展区内规划、搭建支架和反思）相结合，嵌入可配置的场景生成器（将教学置于正式和非正式学习领域）和多尺度模拟器（解耦交互规模、时间粒度和模拟持续时间）。实验表明，结构化学生智能体比基线模拟器产生更差异化的掌握和错误概念轨迹，而教师智能体比较显示出与基于ZPD的适应一致的骨干依赖模式。此外，AgentSchool生成与课堂社会理论一致的外围参与、小团体形成、攻击者诱导的凝聚力和意见领袖出现的合理轨迹。除了作为教育研究工具的作用外，AgentSchool还将教育构建为在组织压力下进行长时记忆、多智能体协调和未来制度推理的社会意义测试平台。

英文摘要

Despite the rapid deployment of LLMs into classrooms, validating educational AI remains uniquely intractable: interventions act on developing learners whose cognitive and social trajectories are irreversibly shaped, while real-world trials are slow, ethically constrained, and institutionally locked. LLM-based educational simulators have emerged as a potential remedy, but many still collapse learning into persona-conditioned role-play and, when optimized only to reproduce existing classrooms, can structurally penalize the institutional novelty that pedagogical reform requires. In this work, we introduce AgentSchool, an LLM-driven multi-agent simulator that models learning as state transition rather than prompted behavior. AgentSchool couples cognitively growable student agents -- equipped with weighted subject knowledge graphs, thinking-workflow pools, and explicit misconceptions -- with adaptive teacher agents that plan, scaffold, and reflect along the Zone of Proximal Development, embedded in a configurable scenery generator that situates instruction within both formal and informal learning fields, and a multi-scale simulator that decouples interaction scale, temporal granularity, and simulation duration. Experiments show that structured student agents produce more differentiated mastery and misconception traces than a baseline simulator, while teacher-agent comparisons show backbone-dependent patterns consistent with ZPD-informed adaptation. Further, AgentSchool generates plausible traces of peripheral participation, clique formation, aggressor-induced cohesion, and opinion-leader emergence consistent with classroom social theories. Beyond its role as an educational research instrument, AgentSchool frames education as a socially meaningful testbed for long-horizon memory, multi-agent coordination, and future institutional reasoning under organizational pressure.

URL PDF HTML ☆

赞 0 踩 0

2605.30136 2026-05-29 cs.AI 版本更新

Enhancing Multi-Agent Communication through Attention Steering with Context Relevance

通过上下文相关性的注意力引导增强多智能体通信

Hongxiang Zhang, Yuan Tian, Tianyi Zhang

发表机构 * Purdue University（普渡大学）

AI总结针对LLM多智能体系统中长对话历史导致信息稀释的问题，提出无训练的上下文管理方法Agent-Radar，利用时空衰减机制动态引导注意力，在五个基准上取得最高7.64个绝对点的提升。

详情

AI中文摘要

基于LLM的多智能体系统通过协作推理在复杂任务上表现出色。然而，这些系统在交互过程中会迅速积累极长的对话历史。随着对话变长，相关信息被无关上下文稀释，导致性能下降。在这项工作中，我们提出了Agent-Radar，一种无需训练的上下文管理方法，通过新颖的时空衰减机制动态引导每个智能体的注意力到相关上下文。实验表明，Agent-Radar在五个不同基准上优于最先进的方法，最高提升7.64个绝对点。此外，分析显示Agent-Radar在智能体数量和交互轮次增加时仍然有效且鲁棒。最后，消融研究表明Agent-Radar的核心组件对性能至关重要，且在不同设置下具有泛化性。

英文摘要

LLM-based multi-agent systems have demonstrated remarkable performance on complex tasks through collaborative reasoning. However, these systems tend to rapidly accumulate extremely long conversation histories during interaction. As conversations lengthen, relevant information is increasingly diluted by irrelevant context, leading to degraded performance. In this work, we present Agent-Radar, a training-free context management method that dynamically steers each agent's attention toward relevant context with a novel temporal and spatial decay mechanism. Our experiments demonstrate that Agent-Radar outperforms state-of-the-art methods across five different benchmarks, yielding gains of up to 7.64 absolute points. Furthermore, our analysis shows that Agent-Radar remains effective and robust as the number of agents and interaction rounds increases. Finally, the ablation study shows that core components in Agent-Radar are crucial to performance and generalizable in different settings.

URL PDF HTML ☆

赞 0 踩 0

2605.30135 2026-05-29 cs.LG cs.AI 版本更新

DAMEL: Dual-Axis Multi-Expert Learning for Class-Imbalanced Learning

DAMEL: 双轴多专家学习用于类别不平衡学习

Hyuck Lee, Taemin Park, Heeyoung Kim

发表机构 * AI Research, Krafton（AI研究，Krafton）； Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology (KAIST)（工业与系统工程系，韩国科学技术院）

AI总结提出双轴多专家学习算法DAMEL，通过表示轴和时间轴上的多专家集成，同时降低预测偏差和方差，有效解决类别不平衡学习问题。

详情

AI中文摘要

针对来自具有长尾分布的真实世界数据的类别不平衡学习所带来的挑战，已有多种算法被提出。这些算法通过重平衡技术减少了预测偏差，但通常以增加预测方差为代价。一些多专家学习算法旨在解决这一方差问题，但涉及复杂的过程。我们提出了一种新的多专家学习算法，称为双轴多专家学习（DAMEL），该算法通过沿表示轴和时间轴使用多个专家来同时降低预测的偏差和方差。沿表示轴，DAMEL拼接多个专家的表示，并同时使用拼接后的表示训练一个辅助的平衡分类器。沿时间轴，DAMEL聚合跨训练时期的网络权重，并在测试时使用这些聚合权重。实验结果表明，DAMEL同时降低了预测的偏差和方差，突显了其在类别不平衡学习中的有效性。

英文摘要

Various algorithms have been proposed to address the challenges posed by class-imbalanced learning from real-world data with long-tailed distributions. While these algorithms reduce prediction bias through rebalancing techniques, they often introduce increased prediction variance as a trade-off. Several multi-expert learning algorithms aim to address this variance but involve complex procedures. We propose a new multi-expert learning algorithm, called the dual-axis multi-expert learning (DAMEL), which reduces both bias and variance of predictions by using multiple experts along both representation and time axes. Along the representation axis, DAMEL concatenates the representations of multiple experts and trains an auxiliary balanced classifier simultaneously with the concatenated representations. Along the time axis, DAMEL aggregates network weights across training epochs, employing these aggregated weights during testing. Experimental results demonstrate that DAMEL reduces both bias and variance of predictions, highlighting its effectiveness in class-imbalanced learning.

URL PDF HTML ☆

赞 0 踩 0

2605.30126 2026-05-29 cs.CV cs.AI cs.CL cs.LG 版本更新

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

PARCEL: 基于池锚定的条件弹性查询重采样以实现高效视觉-语言理解

Selim Kuzucu, Alessio Tonioni, Vasile Lup, Bernt Schiele, Federico Tombari, Muhammad Ferjad Naeem

发表机构 * Max Planck Institute for Informatics（马克斯·普朗克研究所）； Google（谷歌）

AI总结提出PARCEL视觉分词架构，通过池锚定和条件弹性查询重采样解决视觉令牌压缩中的空间与查询表示冲突，在27个基准上提升性能-效率帕累托前沿。

Comments 33 pages, 4 figures

详情

AI中文摘要

大型视觉-语言模型（LVLMs）将视觉输入映射为密集的令牌序列，导致推理时的二次计算瓶颈。弹性视觉令牌压缩通过训练单一模型以在多个视觉令牌预算下运行来解决这一问题。然而，现有方法在激进压缩下表现不佳。空间压缩（如嵌套池化）表现为不完美的低通滤波器，并引起频谱混叠，掩盖了细粒度细节。查询压缩（如嵌套查询重采样）用非局部摘要替代显式的网格对齐令牌，显著降低了空间定位能力。为解决这一表示冲突，我们引入了PARCEL（基于池锚定的条件弹性查询重采样以实现高效视觉-语言理解），一种视觉分词架构，动态分配特征提取的工作。PARCEL将空间池令牌建立为低频布局锚点，并通过池条件查询重采样使弹性查询令牌依赖于这些锚点。这鼓励查询令牌专注于互补的视觉特征，而非冗余的空间映射。在27个基准上的广泛评估表明，PARCEL改进了性能-效率帕累托前沿，在各种视觉令牌预算下持续优于现有的嵌套基线，同时保留了“一次训练，随处部署”的范式。

英文摘要

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.

URL PDF HTML ☆

赞 0 踩 0

2605.30117 2026-05-29 cs.AI 版本更新

冲突多源个人记忆上的选择性问答：诊断性测试平台与方法比较

Tiancheng Yang, Matthias Schonlau, Ilia Sucholutsky

AI总结针对多源冲突记忆的选择性问答问题，构建了包含34,560个实例的诊断基准，评估了多种方法，发现结构化融合方法在准确性和选择性上优于纯提示LLM。

Comments 55 pages, 5 figures

详情

AI中文摘要

新兴的个人AI代理正朝着持久、多源记忆的方向发展。这带来了一个评估问题：系统必须决定如何使用冲突或不完整的证据；它们不能仅从一个干净的历史中检索事实。现有的基准很少能显示错误是来自提供给方法的证据还是来自方法的冲突解决步骤。我们将此研究为冲突多源个人记忆上的选择性问答：系统基于冲突的、有时不完整的来源进行回答，或者在证据不足时放弃回答。我们开发了一个基准，包含8种推理类型下的18个问题模板、480个角色、4个随机种子和34,560个实例，具有受控的来源扭曲和确定性的真实答案。我们评估了无法访问任何来源、访问单一来源、结构化融合方法以及前沿LLM的基线性能。最佳训练融合解析器达到80.3%的准确率，而最强的纯提示LLM基线达到70.0%。在允许弃权的情况下，同一解析器在78.3%的覆盖率下达到85.3%的选择性准确率，最佳LLM在95.4%的覆盖率下达到71.0%的选择性准确率。不同模型在不同推理类型上具有不同的优势。我们发布了数据、代码、缓存的模型输出以及数据生成过程以供复用。

英文摘要

Emerging personal AI agents are moving toward persistent, multi-source memory. This creates an evaluation problem: systems must decide how to use conflicting or incomplete evidence; they cannot just retrieve facts from one clean history. Existing benchmarks rarely show whether an error came from the evidence given to a method or from the method's conflict-resolution step. We study this as selective QA over conflicting multi-source personal memory: systems answer based on conflicting, sometimes incomplete sources, or abstain when evidence is insufficient. We develop a benchmark containing 18 question templates across 8 reasoning types, 480 personas, 4 random seeds, and 34,560 instances, with controlled source distortions and deterministic ground truth. We evaluate the performance of baselines without access to any source, access to a single source, structured fusion methods, and frontier LLMs. The best trained fusion resolver reaches 80.3% accuracy, while the strongest prompt-only LLM baseline reaches 70.0%. With abstention, the same resolver reaches 85.3% selective accuracy at 78.3% coverage and the best LLM reaches 71.0% selective accuracy at 95.4% coverage. Different models have different strengths across reasoning types. We release the data, code, cached model outputs, and data-generating process for reuse.

URL PDF HTML ☆

赞 0 踩 0

2605.30085 2026-05-29 cs.AI cs.CL cs.LG stat.ML 版本更新

Conformal Certification of Reasoning Trace Prefixes

推理轨迹前缀的保形认证

Matt Y. Cheung, Ashok Veeraraghavan, Hanjie Chen, Guha Balakrishnan

发表机构 * Department of Electrical & Computer Engineering, Rice University（电气与计算机工程系，里士满大学）

AI总结提出CROP方法，通过保形校准选择阈值，返回最长无错前缀，并控制错误包含概率，平衡保留有效推理与丢弃误导后缀。

Comments Code available at https://github.com/matthewyccheung/crop

详情

AI中文摘要

语言模型推理轨迹很少是全有或全无；在关键错误发生之前，它们通常包含有效的中间步骤。现有的不确定性量化方法通常认证最终答案或整个响应，未能为顺序轨迹中可安全保留的比例提供统计保证。为了解决这个问题，我们引入了CROP（保形推理输出前缀），一种与验证器无关的校准程序，用于干净前缀认证。给定任何步骤级风险代理，CROP选择一个校准阈值，并返回其步骤风险代理保持低于该阈值的最长连续前缀，将未认证的后缀路由到下游审查或修复。假设可交换性，CROP严格控制了返回前缀包含注释错误的边际概率。在六个过程标记的推理数据集上，我们证明了标准步骤级指标（如AUROC）不能完全捕捉前缀效用，建议验证器应改为通过认证前缀长度进行评估。此外，CROP平衡了过度保留和不足保留，通过保留有效的中间推理同时丢弃误导后缀，提高了下游修复的准确性。最终，这项工作将前缀认证定位为过程监督、弃权和修复之间的严格、实用的桥梁。

英文摘要

Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a critical error occurs. Existing uncertainty quantification methods typically certify final answers or entire responses, failing to provide statistical guarantees for the proportion of a sequential trace that can be safely retained. To address this, we introduce CROP (Conformal Reasoning Output Prefixes), a verifier-agnostic calibration procedure for clean-prefix certification. Given any step-level risk proxy, CROP selects a calibrated threshold and returns the longest contiguous prefix whose step risk proxies remain below it, routing the uncertified suffix for downstream review or repair. Assuming exchangeability, CROP rigorously controls the marginal probability that the returned prefix contains an annotated error. Across six process-labeled reasoning datasets, we demonstrate that standard step-level metrics such as AUROC do not fully capture prefix utility, suggesting verifiers should instead be evaluated by certified prefix length. Furthermore, CROP balances over- and under-withholding, improving downstream repair accuracy by preserving valid intermediate reasoning while discarding misleading suffixes. Ultimately, this work positions prefix certification as a rigorous, practical bridge between process supervision, abstention, and repair.

URL PDF HTML ☆

赞 0 踩 0

2605.30070 2026-05-29 cs.LG cs.AI 版本更新

A Predictive Law for On-Policy Self-Distillation From World Feedback

基于世界反馈的在线自蒸馏预测定律

Tommy He, Jerome Sieber, Matteo Saponati

发表机构 * Open-source models（开源模型）； LiveCodeBench

AI总结本文发现在线自蒸馏（OPSD）中初始师生性能差距与最终性能改进之间存在线性关系，并提出一种预测定律，用于在训练前预测OPSD配置的效果。

详情

AI中文摘要

超越简单的标量奖励，向更丰富的世界反馈迈进，是实现更可扩展的RL后训练的自然路径。在线自蒸馏（OPSD）是一种有前景的最新方法，它使用任意反馈作为学习信号，但其与GRPO等成熟方法相比的可靠性仍不清楚。我们发现了OPSD中初始学生-教师性能差距与最终性能改进之间存在惊人的一致线性相关性。这种关系在不同上下文类型和模型家族中均成立，为预测OPSD配置的结果提供了一种强大的预测定律，而无需运行完整的训练过程。有趣的是，我们表明这种线性可预测性随模型规模成立，这为具有更强上下文学习能力的大型模型上新的经验缩放定律提供了潜在基础。本质上，我们的发现表明，OPSD性能可以在训练前进行预测和调整，为将世界反馈作为后训练流水线的一等组件提供了一种原则性方法。

英文摘要

Moving beyond simple scalar rewards toward richer world feedback is a natural path to more scalable RL post-training. On-policy self-distillation (OPSD) is a promising recent approach that uses arbitrary feedback as learning signal, yet its reliability compared to established methods, such as GRPO, remains unclear. We identify a strikingly consistent linear correlation between the initial student-self-teacher performance gap and the final performance improvement in OPSD. This relationship holds across context types and model families, providing a powerful predictive law for anticipating the outcome of an OPSD configuration without running the full training procedure. Interestingly, we show that this linear predictability holds with model scale, suggesting a potential basis for new empirical scaling laws on larger models with stronger in-context learning capabilities. In essence, our findings show that OPSD performance can be predicted and tuned before training, offering a principled way to incorporate world feedback as a first-class component of the post-training pipeline.

URL PDF HTML ☆

赞 0 踩 0

2605.30054 2026-05-29 cs.SE cs.AI 版本更新

Projectional Decoding: Towards Semantic-Aware LLM Generation

投影式解码：迈向语义感知的LLM生成

Boqi Chen, José Antonio Hernández López, Aren A. Babikian

发表机构 * University of Ottawa（渥太华大学）； University of Murcia（穆尔西亚大学）； University of Toronto（多伦多大学）

AI总结提出投影式解码框架，通过维护部分图模型作为主要工件表示，实现增量语义验证和错误检测，以提升LLM生成工件的语义有效性。

Comments 5 pages, 3 figures. Accepted at FSE 2026 IVR track

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用于跨许多软件工程（SE）任务生成软件工件，然而确保这些工件的语义有效性仍然是一个基本挑战。现有的约束解码技术可以强制执行语法正确性，并且在某些情况下强制执行特定的语义规则，但缺乏一种通用表示，能够将LLM生成的文本与SE中语义验证所需的推理联系起来。在本文中，我们提出了投影式解码，一种新颖的概念框架，通过在整个生成过程中与文本一起维护部分图模型作为主要工件表示，直接将领域语义集成到生成过程中。这种抽象表示通过显式捕获不确定性并原生支持错误检测，实现增量语义验证，同时引导生成朝向具有可证明保证的语义有效输出。我们在一个程序生成任务上展示了初步结果，证明了这种方法在提高LLM生成工件的语义有效性方面的潜力。我们还讨论了投影式解码如何能够在各种SE活动中实现与LLM的可验证自动化。

英文摘要

Large language models (LLMs) are increasingly used to generate software artifacts across many software engineering (SE) tasks, yet ensuring the semantic validity of these artifacts remains a fundamental challenge. Existing constrained decoding techniques can enforce syntactic correctness and, in some cases, specific semantic rules, but lack a general representation that bridges LLM-generated text with the reasoning required for semantic validation in SE. In this paper, we propose projectional decoding, a novel conceptual framework that integrates domain semantics directly into the generation process by maintaining, alongside text, a partial graph model as the primary artifact representation throughout generation. This abstract representation enables incremental semantic validation by explicitly capturing uncertainty and natively supporting error detection, while guiding generation toward semantically valid outputs with provable guarantees. We present preliminary results on a program generation task which demonstrate the potential of this approach to improve the semantic validity of LLM-generated artifacts. We also discuss how projectional decoding can enable verifiable automation with LLMs across various SE activities.

URL PDF HTML ☆

赞 0 踩 0

2605.30052 2026-05-29 cs.SE cs.AI cs.CL 版本更新

REPOT: Recoverable Program-of-Thought via Checkpoint Repair

REPOT：通过检查点修复实现可恢复的思维程序

Parsa Mazaheri

发表机构 * University of California, Santa Cruz（加州大学圣克ruz分校）

AI总结提出 RePoT 方法，通过确定性验证重放和 LLM 调用从验证前缀恢复，以解决 Program-of-Thought 中单个无效动作导致轨迹失效的问题，在多个模型和基准上提升成功率。

详情

AI中文摘要

单次 Program-of-Thought (PoT) 生成一个打印基本动作计划的 Python 程序；单个无效动作会无声地使轨迹失效。我们引入 RePoT (可恢复 PoT)：一种确定性验证重放，它将计划遍历环境直到第一个无效转换，然后通过一次 LLM 调用从验证前缀恢复。在 PoT 失败的约 14% 的问题上，RePoT 最多增加一次 LLM 调用。在 PuzzleZoo-775 上，RePoT 在四种闭模型配置上比 PoT 提高 +3 到 +11 个百分点，在 gpt-5.4-mini-medium 上达到 96.9% 对比 86.3% 的峰值；与预算匹配的 PoT-retry 基线相比，RePoT 在 Gemini 上明显获胜（+3.8pp，95% CI [+2.2,+5.4]），在 GPT-medium 和 Claude 上处于采样噪声范围内，在 GPT-mini 上失败——这是一种能力扩展模式，我们开始通过自适应 RePoT 解决，这是一种基于规则的调度器，根据验证前缀长度在后缀修复和全新 PoT 重试之间路由（初步）。我们在 PlanBench Blocksworld 上复现（+1.1 到 +11.4pp），在四个开放权重模型上（四个中的三个 +3.3 到 +20.0pp）。在 Derail-550（我们的受控恢复基准）上，每个能够访问检查点信息的条件在 GPT-medium 上达到 >=30%，在 Gemini 上达到 >=70%，而仅错误反馈条件 <=3.1%——表明检查点信息（而非特定的验证前缀尾部）是承载恢复的信号。

英文摘要

One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its first invalid transition, then one LLM call that resumes from the verified prefix. RePoT costs at most one extra LLM call on the ~14% of problems where PoT fails. RePoT beats PoT by +3 to +11pp across four closed-model configurations on PuzzleZoo-775 and peaks at 96.9% vs 86.3% on gpt-5.4-mini-medium; against the matched-budget PoT-retry baseline, RePoT wins decisively on Gemini (+3.8pp, 95% CI [+2.2,+5.4]), is within sampling noise on GPT-medium and Claude, and loses on GPT-mini -- a capability-scaling pattern we begin to address with Adaptive RePoT, a rule-based dispatcher that routes between suffix repair and a fresh PoT retry based on verified-prefix length (preliminary). We replicate on PlanBench Blocksworld (+1.1 to +11.4pp) and on four open-weights models (+3.3 to +20.0pp on three of four). On Derail-550, our controlled recovery benchmark, every condition with access to checkpoint information clears >=30% on GPT-medium and >=70% on Gemini, vs <=3.1% for error-only feedback -- showing that checkpoint information, not the specific verified-prefix tail, is the load-bearing recovery signal.

URL PDF HTML ☆

赞 0 踩 0

2605.30049 2026-05-29 cs.AI 版本更新

Robust and Generalizable Safety Steering for Text-to-Image Diffusion Transformers

面向文本到图像扩散Transformer的鲁棒且可泛化的安全引导

Zihao Xue, Yan Wang, Zhen Bi, Long Ma, Zhonglong Zheng, Zeyu Yang, Bingyu Zhu, Longtao Huang, Jie Xiao, Jungang Lou

发表机构 * Huzhou Normal University（湖州师范学院）； Alibaba Group（阿里巴巴集团）； University of Science and Technology of China（中国科学技术大学）； Zhejiang Normal University（浙江师范大学）； Zhejiang University of Technology（浙江工业大学）

AI总结提出SafeDIG框架，通过位置感知稀疏特征迁移实现扩散Transformer的安全引导，在保持源域安全性和图像质量的同时，有效降低目标域和整体不安全生成率。

详情

AI中文摘要

扩散Transformer已成为文本到图像生成的强大骨干网络，但其分层和跨模态生成过程使得安全控制在根本上不同于提示级过滤或输出级检测。有害语义可能在文本表示中弱表达，逐步绑定到视觉潜变量，最终与渲染动态纠缠。因此，在固定层进行安全引导可能不稳定，而从已知风险学习到的引导机制可能无法可靠地迁移到偏移的目标风险域。我们提出SafeDIG，一个将DiT安全适应形式化为位置感知稀疏特征迁移的安全引导框架。SafeDIG首先在功能不同的DiT干预位置上构建稀疏自编码器，并使用鲁棒性感知预训练路由来优先选择在源-目标风险偏移下预期保持稳定的干预站点。然后，通过冻结SAE编码器作为可重用的稀疏安全字典，并仅将解码器适应到目标域激活流形，将可迁移的安全特征与特定领域的激活几何分离。在推理过程中，SafeDIG结合混合和排斥操作，将不安全激活引导至迁移的安全流形或远离有害的稀疏方向。在FLUX.1 Dev和Stable Diffusion 3.5 Large上的实验表明，SafeDIG在保持源域安全性和图像质量的同时，持续降低了目标域和整体的不安全生成率。

英文摘要

Diffusion Transformers have become a powerful backbone for text-to-image generation, but their layered and cross-modal generation process makes safety control fundamentally different from prompt-level filtering or output-level detection. Harmful semantics may be weakly expressed in text representations, progressively bound to visual latents, and finally entangled with rendering dynamics. As a result, safety steering at a fixed layer can be unstable, and a steering mechanism learned from known risks may not transfer reliably to a shifted target risk domain. We propose SafeDIG, a safety steering framework that formulates DiT safety adaptation as position-aware sparse feature transfer. SafeDIG first constructs Sparse Autoencoders over functionally distinct DiT intervention positions and uses robustness-aware pre-training routing to prioritize intervention sites that are expected to remain stable under source-target risk shift. It then separates transferable safety features from domain-specific activation geometry by freezing the SAE encoder as a reusable sparse safety dictionary and adapting only the decoder to the target-domain activation manifold. During inference, SafeDIG combines Blend and Repel operations to steer unsafe activations toward transferred safety manifolds or away from harmful sparse directions. Experiments on FLUX.1 Dev and Stable Diffusion 3.5 Large show that SafeDIG consistently reduces target-domain and overall unsafe generation rates while preserving source-domain safety and image quality.

URL PDF HTML ☆

赞 0 踩 0

2605.30046 2026-05-29 cs.LG cs.AI 版本更新

Masked Diffusion Modeling for Anomaly Detection

掩码扩散建模用于异常检测

Lixing Zhang, Yuchen Liang, Liyan Xie

发表机构 * University of Minnesota（明尼苏达大学）； Ohio State University（俄亥俄州立大学）

AI总结提出基于掩码扩散模型的MaskDiff-AD方法，通过重建随机掩码坐标的难度构建异常分数，在分类、混合类型和离散序列数据上实现高效异常检测。

详情

AI中文摘要

异常检测旨在识别偏离名义数据分布的样本，是许多安全关键应用的核心。然而，针对分类、混合类型和离散序列数据开发有效的异常检测方法仍然具有挑战性且相对未被充分探索。掩码扩散模型通过学习从剩余可见上下文中恢复掩码值，为建模此类数据提供了一种自然的方式。在本文中，我们提出了用于异常检测的掩码扩散（MaskDiff-AD），一种基于掩码扩散模型的前向方法，仅在名义数据上训练。给定测试样本，MaskDiff-AD从随机掩码坐标的重建难度构建异常分数，产生一个直接作用于离散状态空间且避免反向时间采样的内容敏感分数。我们还开发了MaskDiff-AD的非参数变体，并通过在固定检测阈值下表征I型和II型错误提供了理论保证。在来自ADBench和UADAD的十四个分类和混合类型表格数据集，以及来自NLP-ADBench的四个文本异常检测数据集上的实验表明，MaskDiff-AD相对于经典、基于扩散以及最近的表格/文本异常检测基线取得了有竞争力的性能。值得注意的是，MaskDiff-AD达到了最佳总体平均排名，优于所有十二种表格基线方法。

英文摘要

Anomaly detection aims to identify samples that deviate from the nominal data distribution and is central to many safety-critical applications. However, developing effective anomaly detection methods for categorical, mixed-type, and discrete sequence data remains challenging and relatively underexplored. Masked diffusion models provide a natural way to model such data by learning to recover masked values from the remaining visible context. In this paper, we propose Masked Diffusion for Anomaly Detection (MaskDiff-AD), a forward-only method based on masked diffusion models trained only on nominal data. Given a test sample, MaskDiff-AD constructs anomaly scores from the difficulty of reconstructing randomly masked coordinates, yielding a content-sensitive score that operates directly on discrete state spaces while avoiding reverse-time sampling. We also develop a non-parametric variant of MaskDiff-AD and provide theoretical guarantees by characterizing Type-I and Type-II errors under a fixed detection threshold. Experiments on fourteen categorical and mixed-type tabular datasets from ADBench and UADAD, as well as four text anomaly detection datasets from NLP-ADBench, show that MaskDiff-AD achieves competitive performance against classical, diffusion-based, and recent tabular/text anomaly detection baselines. Notably, MaskDiff-AD achieves the best overall average rank, outperforming all twelve tabular baseline methods.

URL PDF HTML ☆

赞 0 踩 0

2605.30042 2026-05-29 cs.AI 版本更新

Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection

学会选择：一种基于赋权与语义通信的自适应方法选择多智能体系统

Geremy Loachamín-Suntaxi, Robert Lazar, Dimitrios G. Giovanis, Ioannis G. Kevrekidis, Eleni D. Koronaki

发表机构 * Faculty of Science, Technology and Medicine（科学、技术与医学学院）； University of Luxembourg（卢森堡大学）； Johns Hopkins University（约翰霍普金斯大学）； Luxembourg Institute of Science and Technology（卢森堡科学与技术研究院）

AI总结提出一种结合上下文赌博机、结构化智能体间通信和语义检查点的多智能体框架，通过保持动作-结果因果一致性来提升科学计算工作流中自适应决策的收敛性、鲁棒性和泛化能力。

详情

AI中文摘要

自动化科学计算工作流不仅需要生成可执行代码：自主系统还必须选择适当的计算策略，忠实地执行它们，并确保最终结果在因果上可归因于产生它们的决策。在多智能体流水线中，这一过程尤其脆弱，因为智能体意图与行动之间的微小不一致可能导致语义漂移，即最终执行的程序不再反映最初选择的策略，从而破坏下游评估和适应。受ATHENA框架（Toscano等人，2025；Toscano等人，2026）和赋权概念（Yiu等人，2025）的启发，本文引入了一个多智能体框架，该框架将上下文赌博机与结构化智能体间通信相结合，最重要的是，引入了语义检查点以保持整个流水线中行动-结果的一致性。该系统在自适应决策架构中集成了专门的大语言模型（LLM）智能体、基于代码生成和自修复执行循环。通过赋权的视角解释该框架，我们表明可靠的自主学习不仅需要识别高质量的行动，还需要保持这些行动在智能体间传播的完整性。使用敏感性分析和不确定性量化工作流作为代表性案例研究，我们证明未受约束的语义漂移会降低策略学习，而所提出的框架则提高了收敛性、鲁棒性和对新问题情境的适应能力。这些结果表明了科学多智能体系统的一个更广泛的设计原则：自适应决策必须与明确的机制相结合，以保证整个计算流水线中的语义一致性和可靠信息流。

英文摘要

Automating scientific computing workflows requires more than generating executable code: autonomous systems must also select appropriate computational strategies, implement them faithfully, and ensure that the resulting outcomes remain causally attributable to the decisions that produced them. In multi-agent pipelines, this process is particularly fragile, as small inconsistencies between agent intentions and actions can lead to semantic drift, where the eventually executed procedure no longer reflects the originally selected strategy, thereby corrupting downstream evaluation and adaptation. In this work, motivated by the ATHENA framework (Toscano et al., 2025; Toscano et al., 2026) and the concept of empowerment (Yiu et al., 2025), we introduce a multi-agent framework that combines contextual bandits with structured inter-agent communication and, most importantly, semantic checkpoints that preserve action-outcome fidelity throughout the pipeline. The system integrates specialized large language model (LLM) agents, grounded code generation, and self-healing execution loops within an adaptive decision-making architecture. Interpreting the framework through the lens of empowerment, we show that reliable autonomous learning requires not only identifying high-quality actions, but also preserving the integrity of their propagation across agents. Using sensitivity analysis and uncertainty quantification workflows as representative case studies, we demonstrate that unchecked semantic drift degrades policy learning, whereas the proposed framework improves convergence, robustness, and adaptation to novel problem contexts. These results suggest a broader design principle for scientific multi-agent systems: adaptive decision-making must be coupled with explicit mechanisms that guarantee semantic consistency and reliable information flow across the computational pipeline.

URL PDF HTML ☆

赞 0 踩 0

2605.30040 2026-05-29 cs.CR cs.AI cs.CL 版本更新

RAISE：将RAG设计视为架构搜索问题

Zhen Chen, Yibing Liu, Weihao Xie, Yu Liang, Peilin Chen, Shiqi Wang

发表机构 * City University of Hong Kong（香港城市大学）； Baidu Inc.（百度公司）

AI总结本文提出将检索增强生成（RAG）系统的设计选择形式化为架构搜索问题，并构建RAISE框架和基准，通过标准化搜索空间和预算评估13种优化算法在7个数据集上的表现，发现优化性能高度依赖任务。

详情

AI中文摘要

检索增强生成（RAG）系统涉及众多设计选择，包括查询重写、分块、检索深度、重排序和上下文压缩。在实践中，这些选择通常通过启发式方法配置，阻碍了跨设置的系统评估和可重复性。我们认为这一挑战最好被形式化为RAG架构搜索。为了支持对该问题的可控和可重复研究，我们引入了RAG智能搜索引擎（RAISE），这是一个用于RAG超参数优化的综合框架和基准，它在标准化的搜索空间和预算下评估RAG管道的优化方法。RAISE实现了13种搜索算法，并使用三种随机种子在七个公开文本和多模态数据集上对其进行评估。我们的实验表明，优化性能高度依赖于任务：在一个数据集上表现良好的方法可能无法在其他数据集上一致泛化，这提醒我们不要将聚合排名解释为普遍优越策略的证据。RAISE为公平、可重复和系统的RAG超参数优化研究提供了共同的实验基础。

英文摘要

Retrieval-augmented generation (RAG) systems expose numerous design choices spanning query rewriting, chunking, retrieval depth, reranking, and context compression. In practice, these choices are often configured through heuristics, hindering systematic evaluation and reproducibility across settings. We argue that this challenge is best formulated as RAG architecture search. To support controlled and reproducible study of this problem, we introduce the RAG Intelligence Search Engine (RAISE), a comprehensive framework and benchmark for RAG hyperparameter optimization, which evaluates optimization methods for RAG pipelines under standardized search spaces and budgets. RAISE implements 13 search algorithms and evaluates them across seven public text and multimodal datasets using three random seeds. Our experiments show that optimization performance is highly task-dependent: methods that perform strongly on one dataset may not generalize consistently across others, cautioning against interpreting aggregate rankings as evidence of universally superior strategies. RAISE provides a common experimental substrate for fair, reproducible, and systematic research on RAG hyperparameter optimization.

URL PDF HTML ☆

赞 0 踩 0

2605.30022 2026-05-29 cs.CL cs.AI 版本更新

Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders

给它空间！编码器中位置和语义表示的显式解缠

Pierre-Antoine Lequeu, Camille Barboule, Benjamin Piwowarski

发表机构 * Sorbonne Université, CNRS, ISIR（索邦大学、国家科学研究中心、信息研究所）； Orange Innovation（Orange创新）

AI总结通过将位置和语义信号分离为三个独立流，研究Transformer中位置编码的机制，发现解缠方法能保留宏观结构并提升语言表示性能。

Comments 8 page + 10 pages of bibliography and appendix

详情

AI中文摘要

位置编码（PE）是置换不变的Transformer表示序列顺序的基础，然而位置信息如何处理和存储仍知之甚少。现代PE方法如RoPE在长上下文理解或检索等任务上仍存在困难\cite{chen-etal-2025-hope}。因此，更好地理解内部位置机制有助于设计更好的PE。基于位置和语义信号在训练好的Transformer中占据几乎正交子空间的证据，我们修改编码器Transformer以处理三个显式解缠的流：语义、绝对位置（AP）和相对位置（RP），并将掩码语言建模（MLM）目标限制在语义流上。这种解耦使得能够进行清晰的机制研究，并得出三个要点：（1）孤立的AP子空间自发坍缩为一个捕获文档结构的低频二维流形；（2）注意力头特化为结构导向和语义导向两组，其中RP专门支持后者；（3）标准位置编码不能稳健地保留宏观结构：RoPE和RP仅弱编码它，而纠缠的AP在MLM压力下在最后几层丢失了它。解缠方法保留了位置编码，在Flash-Holmes探测基准的65个语言现象中的49个上改善了语言表示。

英文摘要

Positional encoding (PE) underpins how permutation-invariant Transformers represent sequence order, yet how positional information is processed and stored remains poorly understood. Modern PE methods such as RoPE still struggle on tasks such as long-context understanding or retrieval \cite{chen-etal-2025-hope}. Hence, a better understanding of the internal positional mechanism could help design better PE. Building on evidence that positional and semantic signals occupy nearly orthogonal subspaces in trained Transformers, we modify an encoder Transformer to process three explicitly disentangled streams: semantic, absolute positional (AP) and relative positional (RP), and confine the masked-language-modeling (MLM) objective to the semantic stream. This decoupling enables a clean mechanistic study and yields three take-aways. (1) The isolated AP subspace spontaneously collapses into a low-frequency two-dimensional manifold that captures the structure of the document; (2) Attention heads specialize into structure and semantic-oriented groups, with RP exclusively supporting the latter; (3) Standard positional encodings do not robustly retain macroscopic structure: RoPE and RP only weakly encode it, and entangled AP loses it in the final layers under MLM pressure. The disentangled approach preserves positional encoding, which improves linguistic representation on 49 of the 65 linguistic phenomena of the Flash-Holmes probing benchmark.

URL PDF HTML ☆

赞 0 踩 0

2605.30015 2026-05-29 cs.LG cs.AI 版本更新

Test Time Training for Supervised Causal Learning

测试时训练用于监督因果学习

Zizhen Deng, Jiaru Zhang, Rui Ding, Huang Bojun, Jinzhuo Wang, Qiang Fu, Shi Han, Dongmei Zhang

发表机构 * Peking University（北京大学）； Shanghai Jiao Tong University（上海交通大学）； Microsoft（微软）； Sony Research（索尼研究）

AI总结针对监督因果学习在分布外泛化中的不足，提出测试时训练框架TTT-SCL，通过动态生成与测试实例对齐的训练集，显著提升因果发现性能。

详情

AI中文摘要

监督因果学习（SCL）通过将因果发现构建为监督学习问题，展现了潜力。然而，它面临显著的分布外泛化挑战。我们揭示了先前SCL实践的三个局限性：合成基准与真实数据之间的显著性能差距、对分布偏移的脆弱性以及组合泛化的失败，共同质疑了其现实世界适用性。为此，我们提出测试时训练用于监督因果学习（TTT-SCL），一种新颖的框架，动态生成与任何特定测试实例显式对齐的训练集。我们展示了TTT-SCL与基于分数的方法之间的关联，并基于经典评分函数设计了一个高效模块用于生成训练集。在合成基准、伪真实和真实世界数据集上的实验表明，TTT-SCL显著优于现有的SCL和传统因果发现方法。

英文摘要

Supervised Causal Learning (SCL) has shown promise in causal discovery by framing it as a supervised learning problem. However, it suffers from significant out-of-distribution generalization challenges. We reveal three limitations of previous SCL practices: a significant performance gap between synthetic benchmarks and real-world data, fragility to distribution shifts, and failure in compositional generalization, collectively questioning its real-world applicability. To address this, we propose Test-Time Training for Supervised Causal Learning (TTT-SCL), a novel framework that dynamically generates training sets explicitly aligned with any specific test instance. We demonstrate the correlation between TTT-SCL and score-based methods, and design an efficient module for generating training sets based on the classic scoring function. Experiments on synthetic benchmarks, pseudo-real and real-world datasets demonstrate that TTT-SCL significantly outperforms existing SCL and traditional causal discovery methods.

URL PDF HTML ☆

赞 0 踩 0

2605.30014 2026-05-29 cs.AI 版本更新

From GPS Points to Travel Patterns: Flexible and Semantic Trajectory Generation with LLMs

从GPS点到出行模式：基于LLM的灵活语义轨迹生成

Silin Zhou, Chenhao Wang, Yuntao Wen, Shuo Shang, Lisi Chen, Panos Kalnis

发表机构 * University of Electronic Science and Technology of China（电子科技大学）； King Abdullah University of Science and Technology（国王阿卜杜勒·阿齐兹大学）

AI总结提出HTP方法，通过层次化生成出行模式再生成GPS点，利用LLM和RQ-VAE实现灵活、语义丰富的轨迹生成，在质量上平均提升29.78%。

Comments This paper is accepted by KDD2026 second round

详情

AI中文摘要

城市轨迹在建模城市动态和支持各种智慧城市应用中起着关键作用。然而，隐私问题限制了对大规模高质量轨迹数据集的访问。轨迹生成通过合成现实数据来减轻隐私风险，提供了一种有前景的替代方案。然而，现有方法未能显式捕获出行模式，并且只能在单一条件下生成固定长度的轨迹。为了解决这些局限性，我们提出了 extbf{HTP}，它 extbf{层}次化地首先生成 extbf{出行模式}，然后使用大语言模型（LLM）生成GPS extbf{点}，而不是直接生成GPS点。我们首先设计了一个轨迹特定的残差量化变分自编码器（RQ-VAE），它以从粗到细的方式将微观级别的GPS轨迹量化为紧凑的宏观级别出行模式令牌。这些令牌捕获了丰富的段空间不规则性，例如由交通条件引起的点密度变化。然后，我们用出行模式令牌扩展LLM词汇表，以对齐轨迹表示与LLM输入，并应用监督微调（SFT）使LLM与轨迹生成任务对齐，从而能够在各种条件下生成出行模式序列。在两个真实世界数据集上的大量实验表明，HTP在生成质量上平均比最强基线高出29.78%。我们的代码可在https://github.com/slzhou-xy/HTP获取。

英文摘要

Urban trajectories play a crucial role in modeling urban dynamics and supporting various smart city applications. However, privacy concerns restrict access to large-scale and high-quality trajectory datasets. Trajectory generation provides a promising alternative by synthesizing realistic data to mitigate privacy risks. However, existing methods fail to explicitly capture travel patterns and can only generate fixed-length trajectories under a single condition. To address these limitations, we propose \textbf{HTP}, which \textbf{H}ierarchically generates \textbf{T}ravel patterns first and then generates GPS \textbf{P}oints by using large language models (LLMs), rather than directly generating GPS points. We first design a trajectory-specific residual quantization variational autoencoder (RQ-VAE) that quantizes micro-level GPS trajectories into compact, macro-level travel pattern tokens in a coarse-to-fine manner. These tokens capture rich segment spatial irregularities, such as point density variations caused by traffic conditions. Then, we extend the LLM vocabulary with travel pattern tokens to align trajectory representations with the LLM input, and apply supervised fine-tuning (SFT) to align the LLM with the trajectory generation task, enabling generation of travel pattern sequences under various conditions. Extensive experiments on two real-world datasets show that HTP outperforms the strongest baseline by an average of 29.78\% in terms of generation quality. Our code is available at https://github.com/slzhou-xy/HTP.

URL PDF HTML ☆

赞 0 踩 0

2605.30011 2026-05-29 cs.CV cs.AI 版本更新

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

VisualThink-VLA：用于高效低延迟视觉-语言-动作策略的视觉中间推理

Mingjian Gao, Wenqiao Zhang, Yuqian Yuan, Yang Dai, Binhe Yu, Zheqi Lv, Haoyu Zheng, Jiaqi Zhu, Zhiqi Ge, Zixuan Wan, Siliang Tang, Yueting Zhuang

发表机构 * Zhejiang University（浙江大学）； Cornell University（康奈尔大学）； National University of Singapore（新加坡国立大学）； Xi'an University of Electronic Science and Technology（西安电子科技大学）

AI总结提出VisualThink-VLA框架，通过视觉中间推理和选择性路由机制，在保持高精度的同时将推理延迟从数秒降至亚秒级。

详情

AI中文摘要

近期工作开始为视觉-语言-动作（VLA）策略配备显式的中间推理。然而，在具身控制中，文本思维链并不适用：无关或弱文本信息会干扰动作预测，而自回归文本解码为实时闭环执行增加了过多延迟。我们提出VISUALTHINK-VLA，一个用于准确、低延迟VLA策略的视觉中间推理框架。我们的引导哲学是通过有效的视觉思维来指导动作：VISUALTHINK-VLA通过一个紧凑的视觉证据接口引导动作预测，该接口在避免解码开销的同时保持空间精度。此外，为了进一步提升性能和效率，VISUALTHINK-VLA采用了一种定制的选择性路由机制来学习视觉证据令牌，从而实现低延迟推理同时保持高容量专用性。我们还引入了VisualEvidence-Kit，这是一个以VisualEvidence-Agent为核心的监督与审计资源，该智能体构建了754.7k条VLA指令的VisualEvidence-Set，用于路由监督和反事实忠实性测试。在多个基准测试和真实机器人评估中，VISUALTHINK-VLA在大多数基准测试上实现了最高成功率，同时将推理增强基线的多秒延迟降至亚秒级。例如，在BridgeData V2上，它将步骤延迟从ECoT的8.377秒降至0.367秒，实现了22.8倍的加速。

英文摘要

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.

URL PDF HTML ☆

赞 0 踩 0

2605.30003 2026-05-29 cs.MA cs.AI cs.LG 版本更新

Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas

发现合作管线：面向序列社会困境的自动研究

Víctor Gallego

发表机构 * Komorebi AI Technologies（Komorebi人工智能技术）

AI总结本文提出一种双层自动研究框架，其中外层AI智能体自动重新设计内层LLM策略合成管线，以解决多智能体序列社会困境，实验表明该方法在多个游戏和福利目标下优于手工基线。

Comments Accepted to the AI Agents for Discovery in the Wild (AID-Wild) Workshop at ACM CAIS 2026

详情

AI中文摘要

我们研究了两层自动研究合作问题：外层AI智能体自主重新设计用于多智能体序列社会困境（SSD）的LLM策略合成系统的内层管线。研究者智能体$\mathcal{R}$（作为编码智能体运行）读取内层源代码，编辑系统提示、反馈函数、辅助库和迭代逻辑，运行评估，并决定保留什么，遵循自动研究范式。在两个游戏（Cleanup和Gathering）、两个策略合成器LLM和两个福利目标（功利主义效率和Rawlsian最大最小原则）下，研究者可靠地超越了手工设计的基线，显著缩小了运行间方差，并优于仅提示优化。发现的管线依赖于目标：只有在最大最小原则下，研究者才会向合成器管线注入显式的公平机制，而这类机制在其自身目标无关的系统提示和每个效率优化的管线中都不存在。这支持了一种信息设计解读，即研究者根据福利目标选择向有限理性的合成器揭示什么。代码见https://github.com/vicgalle/autoresearch-social-dilemmas。

英文摘要

We study two-level autoresearch for cooperation: an outer-loop AI agent autonomously redesigns the inner-loop pipeline of an LLM policy-synthesis system for multi-agent Sequential Social Dilemmas (SSDs). A researcher agent $\mathcal{R}$ (run as a coding agent) reads the inner-loop source code, edits system prompts, feedback functions, helper libraries, and iteration logic, runs evaluations, and decides what to keep, following the autoresearch paradigm. Across two games (Cleanup and Gathering), two policy-synthesizer LLMs, and two welfare objectives (utilitarian efficiency and Rawlsian maximin), the researcher reliably exceeds hand-designed baselines, sharply tightens run-to-run variance, and outperforms prompt-only optimization. The discovered pipelines are objective-dependent: only under maximin does the researcher inject an explicit fairness mechanism into synthesizer pipelines, a class of mechanism that is absent from its own objective-agnostic system prompt and from every efficiency-optimized pipeline. This supports an information-design reading in which the researcher chooses what to reveal to the boundedly rational synthesizer as a function of the welfare objective. Code at https://github.com/vicgalle/autoresearch-social-dilemmas.

URL PDF HTML ☆

赞 0 踩 0

2605.30002 2026-05-29 cs.AI 版本更新

KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning

KairosAgent：融合语义推理的智能体时间序列预测

Kun Feng, Ziwei Shan, Yuchen Fang, Yiyang Tan, Sihan Lu, Shuqi Gu, Lintao Ma, Xingyu Lu, Kan Ren

发表机构 * School of Information Science and Technology, ShanghaiTech University（信息科学与技术学院，上海科技大学）； Ant Group（蚂蚁集团）

AI总结提出KairosAgent框架，通过结合基于LLM的推理器和基于TSFM的预测器，并引入强化学习范式，实现跨模态时间序列的零样本预测。

详情

AI中文摘要

Compass: 通过专家引导的LLM代理导航全球海洋铅数据整合

Yiming Liu, Bin Lu, Meng Jin, Ziyuan Sang, Shuo Jiang, Lei Zhou, Xinbing Wang, Chenghu Zhou, Jing Zhang

发表机构 * School of Information Science ； Electronic Engineering,\ Jiao Tong University Shanghai China ； School of Artificial Intelligence,\ Jiao Tong University Shanghai China ； State Key Laboratory of Estuarine ； Coastal Research,\ China Normal University Shanghai China ； School of Oceanography,\ Jiao Tong University Shanghai China ； Institute of Geographical Science ； Natural Resources Research,\ Academy of Sciences Beijing China ； Electronic Engineering,\ Jiao Tong University ； School of Artificial Intelligence,\ Jiao Tong University ； Coastal Research,\ China Normal University ； School of Oceanography,\ Jiao Tong University ； Natural Resources Research,\ Academy of Sciences

AI总结针对海洋铅数据分散于非结构化论文中的问题，提出专家引导的LLM代理框架Compass，结合知识树分解任务，从23万篇论文中提取3751条铅记录，构建最大海洋铅数据库，准确率达92%。

详情

AI中文摘要

海洋铅及其同位素是海洋环流和人为污染的关键示踪剂，然而实地观测仍然成本高昂且稀疏。尽管存在大量历史记录，但它们被埋藏在学术论文的非结构化内容中，形成了无法进行综合分析的数据孤岛。手动提取不可扩展，而通用大语言模型缺乏必要的领域特定知识，导致幻觉和科学上无效的输出。为了解决这个问题，我们引入了一种专家引导的适应方法，使LLM能够在不进行微调的情况下执行严格的科学数据提取。我们通过Compass（一个由与海洋科学家共同设计的知识树增强的LLM代理框架）来实现这种方法，该框架将复杂任务分解为可验证的步骤，引导代理的推理以确保科学有效性。将Compass应用于超过23万篇相关开放获取论文的语料库，我们成功提取了3751条先前未纳入的铅记录。这项工作建立了迄今为止最大的综合海洋铅数据库。除了标准指标外，Compass通过多层验证展示了卓越的可靠性，经专家手动验证确认准确率达到92%。新整合的数据扩展了先前采样不足区域（如东海和南大洋）的覆盖范围，为未来的科学发现提供了丰富的数据基础。我们发布了一个交互式可视化平台以促进开放科学访问。我们的工作表明，专家引导的代理可以有效弥合通用LLM与高风险科学领域之间的差距，实现地球科学中的可扩展数据发现。

英文摘要

Marine lead (Pb) and its isotopes are critical tracers for ocean circulation and anthropogenic pollution, yet in-situ observations remain costly and sparse. While vast historical records exist, they lie buried within the unstructured content of academic papers, creating "data silos" inaccessible to comprehensive analysis. Manual extraction is unscalable, while general-purpose Large Language Models (LLMs) lack the necessary domain-specific knowledge, leading to hallucinations and scientifically invalid outputs. To address this, we introduce an expert-guided adaptation approach that enables LLMs to perform rigorous scientific data extraction without fine-tuning. We operationalize this approach through Compass, an LLM agent framework enhanced by a Knowledge Tree co-designed with marine scientists, which decomposes complex tasks into verifiable steps, guiding the agent's reasoning to ensure scientific validity. Deploying Compass across a corpus of over 230,000 relevant open-access papers, we successfully extract 3,751 previously unincorporated Pb records. This effort establishes the largest integrated marine Pb database to date. Beyond standard metrics, Compass demonstrates superior reliability through multi-layered validation, achieving 92% accuracy as confirmed through expert manual verification. The newly integrated data expand coverage in previously under-sampled regions such as the East China Sea and the Southern Ocean, providing an enriched data foundation for future scientific discoveries. We release an interactive visualization platform to facilitate open scientific access. Our work demonstrates that expert-guided agents can effectively bridge the gap between general-purpose LLMs and high-stakes scientific domains, enabling scalable data discovery in geosciences.

URL PDF HTML ☆

赞 0 踩 0

2605.29965 2026-05-29 cs.AI 版本更新

Meta-Programming for Linear-time Temporal Answer Set Programming

线性时态回答集编程的元编程

Susana Hahn, Amade Nems, Javier Romero, Torsten Schaub

发表机构 * University of Potsdam, Germany（波恩大学）

AI总结提出一种统一的元编程框架，通过扩展clingo的理论语法并引入转换管道保护嵌套模态，实现了对多种线性时态逻辑（TEL、MEL、DEL）的语义操作化，并开发了metasp系统。

详情

AI中文摘要

回答集编程（ASP）的时态扩展的发展导致了非单调线性时态（TEL）、动态（DEL）和度量（MEL）时态均衡逻辑的出现。然而，高度优化的ASP系统固有的刚性常常阻碍了替代逻辑设计的快速探索和实现。在这项工作中，我们提出了一个灵活的元编程框架，通过统一的声明性框架操作化各种时态逻辑的语义。我们的方法通过用形式类型规范和嵌套能力增强clingo的理论语法，扩展了标准ASP元编程。为了确保语义正确性，我们引入了一个转换管道，在实例化过程中保护嵌套模态免受基于稳定模型的简化。我们通过实现TEL、MEL和DEL的元编码来展示我们框架的可扩展性。我们提供了TEL的全面说明，并突出了管理MEL的区间约束和DEL中的Fischer-Ladner闭包的关键特性。最后，我们介绍了metasp系统，这是一个封装了此工作流程的多功能工具。

英文摘要

The development of temporal extensions of Answer Set Programming (ASP) has led to the emergence of non-monotonic linear-time (TEL), dynamic (DEL), and metric (MEL) temporal equilibrium logics. However, the inherent rigidity of highly optimized ASP systems often hinders the rapid exploration and implementation of alternative logical designs. In this work, we propose a flexible meta-programming framework that operationalizes the semantics of varied temporal logics through a unified, declarative framework. Our approach extends standard ASP meta-programming by augmenting clingo's theory grammar with formal type specifications and nesting capabilities. To ensure semantic correctness, we introduce a transformation pipeline that protects nested modalities from stable-model-based simplifications during grounding. We demonstrate the extensibility of our framework by implementing meta-encodings for TEL, MEL, and DEL. We provide a comprehensive account of TEL and highlight the key features for managing the interval constraints of MEL and the Fischer-Ladner closure in DEL. Finally, we introduce the metasp system, a versatile tool that encapsulates this workflow.

URL PDF HTML ☆

赞 0 踩 0

2605.29963 2026-05-29 cs.CR cs.AI cs.LG 版本更新

Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots

Honeyval: 基于LLM的HTTP蜜罐综合评估框架

Mark Vero, Fabian Kaczmarczyck, Ivan Petrov, Ilia Shumailov, Jamie Hayes, Niels Heinen, Tianqi Fan, Luca Invernizzi, Martin Vechev

发表机构 * ETH Zurich（苏黎世联邦理工学院）； Google（谷歌）； Google DeepMind（谷歌深Mind）； AI Sequrity Company（AI安全公司）； Independent（独立）

AI总结提出Honeyval评估框架，通过16个后端应用、AI攻击代理、控制任务和可验证利用目标，系统评估LLM驱动的HTTP蜜罐，发现其相比规则基线能显著延长攻击交互、降低被前沿模型检测率，且保持成本优势。

详情

AI中文摘要

蜜罐是模拟真实系统组件的诱饵系统，旨在防御网络攻击。最近，LLM越来越多地作为蜜罐的模拟骨干。它们使防御者能够构建高交互蜜罐，同时降低系统安全风险。然而，基于LLM的蜜罐开发缺乏统一的评估框架。大多数评估包括测量固定命令上的响应相似性、手动测试或实际部署。这些方法通常不可扩展用于开发、不可跨评估复现、不能代表实际攻击，或不能适应各种攻击者和蜜罐配置。在这项工作中，我们弥补了这一差距，提出了Honeyval，一个针对LLM驱动的HTTP蜜罐的综合评估框架。我们通过将蜜罐基于16个后端应用程序、使用AI黑客代理作为攻击者、采用两个控制任务来监控代理和蜜罐在定制化方面的能力，以及为攻击者定义清晰且可验证的利用目标，解决了先前评估的局限性。使用Honeyval，我们对近期成本高效的LLM作为HTTP蜜罐进行了广泛评估。我们的实验突出了LLM驱动的蜜罐的前景；它们与基于规则的基线蜜罐相比，导致与攻击者的交互时间显著延长，并且即使被前沿模型检测到的频率也远低得多，同时平均而言，保持了针对代理攻击者的运行成本优势。此外，我们实验了不同的反攻蜜罐配置，并观察到了独特的权衡，例如以增加检测为代价获得更长的交互。

英文摘要

Honeypots are decoy systems mimicking real system components designed to defend against cyber attacks. Recently, LLMs increasingly serve as simulation backbones for honeypots. They enable defenders to construct high-interaction honeypots with low system security risks. However, LLM-powered honeypot development lacks a unified evaluation framework. Most evaluations consist of measuring response similarity on fixed commands, manual testing, or real-world deployment. These methods are often not scalable for development, reproducible across evaluations, representative of practical attacks, or adaptable to various attacker and honeypot configurations. In this work, we bridge this gap and propose Honeyval, a comprehensive evaluation framework for LLM-powered HTTP honeypots. We address the limitations of prior evaluations by grounding the honeypots in 16 backend applications, using AI hacking agents as attackers, employing two control tasks to monitor agent and honeypot capabilities across customizations, and defining clear and verifiable exploit goals for the attacker. Using Honeyval, we conduct an extensive evaluation of recent cost-efficient LLMs as HTTP honeypots. Our experiments highlight the promise of LLM-powered honeypots; they lead to substantially longer interactions with the attacker than rule-based baseline honeypots and are far less frequently detected even by frontier models, all while, on average, preserving a running cost advantage against agentic attackers. Further, we experiment with different counter-offensive honeypots configurations, and observe unique trade-offs, such as longer interactions at the cost of increased detection.

URL PDF HTML ☆

赞 0 踩 0

2605.29960 2026-05-29 cs.CR cs.AI 版本更新

Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction

劫持Agent记忆：通过对话交互的隐蔽木马攻击

Hongtao Wang, Se Yang, Yu Chen, Puzhuo Liu

发表机构 * North China Electric Power University（华北电力大学）； Tencent（腾讯）； Tsinghua University（清华大学）

AI总结提出MemPoison攻击方法，通过语义关系桥、实体伪装和联合嵌入优化绕过选择性记忆机制，在LLM Agent长期记忆中注入触发器后门，实现高达0.95的攻击成功率。

Comments 19 pages, 12 figures

详情

AI中文摘要

大型语言模型（LLM）Agent越来越多地利用长期记忆来支持持久且自主的任务执行。然而，这种能力也引入了一个新的攻击面：记忆投毒，即对手可以注入恶意信息以影响未来行为。现有的记忆投毒攻击通常假设注入内容可以直接存储在记忆中，忽略了现代记忆流水线中的选择性提取和重写阶段。这使得先前的方法在现实场景中无效。在本文中，我们提出MemPoison，一种新颖的记忆投毒攻击，能够绕过LLM Agent中的选择性记忆机制，攻击者可以通过对话交互将可触发的后门注入Agent的长期记忆，从而误导其后续响应。MemPoison引入三个关键组件：（i）语义关系桥，将触发器和载荷绑定为连贯的陈述，确保它们一起被提取到记忆中；（ii）实体伪装，优化触发器以模仿命名实体，抵抗重写；（iii）联合嵌入优化，将注入触发器的文本在嵌入空间中形成紧密聚类，同时与良性嵌入保持隔离以实现隐蔽。跨不同Agent领域和记忆机制的评估显示，MemPoison的攻击成功率高达0.95，优于现有基线。机制分析表明，该攻击利用了嵌入空间各向异性并转移注意力模式，突显了选择性记忆系统的核心漏洞。我们评估了多种防御策略，并展示了它们在缓解攻击方面的根本局限性。

英文摘要

Large language model (LLM) agents increasingly leverage long term memory to support persistent and autonomous task execution. However, this capability also introduces a new attack surface: memory poisoning, where adversaries can inject malicious information to influence future behavior. Existing memory poisoning attacks often assume that injected content can be stored directly in memory, overlooking the selective extraction and rewriting stages in modern memory pipelines. This makes prior methods ineffective under realistic settings. In this paper, we propose MemPoison, a novel memory poisoning attack that bypasses selective memory mechanisms in LLM agents, where an attacker can inject triggerable backdoors into the agent's long-term memory through dialogue interactions, thereby misleading its subsequent responses. MemPoison introduces three key components: (i) a semantic relational bridge that binds the trigger and payload into a coherent statement to ensure they are extracted into memory together; (ii) entity masquerading that optimizes triggers to mimic named entities, resisting rewriting; and (iii) joint embedding optimization that shapes trigger-injected texts into a tight cluster in the embedding space while maintaining isolation from benign embeddings for stealth. Evaluations across different agent domains and memory mechanisms show MemPoison achieves attack success rates up to 0.95, outperforming existing baselines. Mechanistic analysis indicates that the attack exploits embedding-space anisotropy and shifts attention patterns, highlighting core vulnerabilities in selective memory systems. We evaluate multiple defense strategies and demonstrate their fundamental limitations in mitigating the attack.

URL PDF HTML ☆

赞 0 踩 0

2605.29955 2026-05-29 cs.AI 版本更新

Formalizing Mathematics at Scale

大规模形式化数学

Ahmad Rammal, Niket Patel, Fabian Gloeckle, Amaury Hayat, Julia Kempe, Remi Munos, Charles Arnal, Vivien Cabannes

发表机构 * FAIR at Meta（Meta的FAIR）； New York University（纽约大学）； Korea Institute for Advanced Study（韩国高级研究院）

AI总结提出多智能体系统AutoformBot，利用LLM和形式化验证工具，自动将非正式教材翻译为Lean 4可验证代码，构建了包含超过45,000个声明和50万行代码的Atlas形式化库。

详情

AI中文摘要

我们提出了AutoformBot，一个用于在Lean 4中大规模构建自动形式化教材库（Atlas）的多智能体系统。AutoformBot协调数千个LLM智能体，配备形式化验证工具、依赖感知的任务调度和协作版本控制，将非正式的教材文本转化为机器可检查的定义和证明。我们将方法应用于26本开放获取教材，涵盖分析、代数、拓扑、组合学和概率论，生成了Atlas：一个包含超过45,000个Lean 4声明和50万行代码的已验证库。我们发布两个成果：（i）AutoformBot，开源的多智能体框架；（ii）Atlas，生成的形式化库。我们的结果表明，大规模自动形式化研究生级别数学的核心内容在经济和技术上现在是可行的。这为在研究层面上自动验证人类和机器生成的数学打开了大门。

英文摘要

We present AutoformBot, a multi-agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot orchestrates thousands of LLM agents, equipped with formal verification tools, dependency-aware task scheduling, and collaborative version control, to translate informal textbook prose into machine-checked definitions and proofs. We apply our methods to a corpus of 26 open-access textbooks spanning analysis, algebra, topology, combinatorics, and probability, producing Atlas: a verified library of over 45,000 Lean 4 declarations and 500 thousand lines of code. We release two artifacts: (i) AutoformBot, the open-source multi-agent framework; and (ii) Atlas, the resulting formal library. Our results suggest that autoformalizing the core content of graduate-level mathematics at scale is now economically and technically feasible. This opens the door to the automated verification of both human- and machine-generated mathematics at a research level.

URL PDF HTML ☆

赞 0 踩 0

2605.29951 2026-05-29 cs.AI cs.CL cs.LG cs.MM 版本更新

MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization

MuPHI: 通过语义基础奖励优化学习隐式多模态有害推理

Anisha Saha, Varsha Suresh, Teodora Kamova, Sophia Wiedmann, Timothy Hospedales, Vera Demberg

发表机构 * Max Planck Institute for Informatics（马克斯·普朗克院信息研究所）； Saarland Informatics Campus（萨尔兰州信息校园）； Saarland University（萨尔兰州大学）； The University of Edinburgh（爱丁堡大学）； Samsung AI Center, Cambridge（三星AI中心，剑桥）

AI总结针对视觉语言模型在隐式跨模态有害语义推理上的不足，提出MuPHI数据集和MuPHIRM训练框架，通过多视角奖励优化联合语义学习，提升有害检测与推理质量及分布外鲁棒性。

详情

AI中文摘要

理解看似良性的图像-文本对之间交互如何产生危害，需要超越表面特征的意图感知跨模态推理。现有的视觉语言模型（VLM）擅长对感知线索进行字面推理，但往往无法推导出依赖于隐式、上下文相关推理的有害语义。为了评估VLM在组合性有害检测和推理方面的能力，我们引入了多模态语用有害解释（MuPHI）数据集，其中包含有害编码在微妙多模态线索中的图像-文本对。MuPHI涵盖多种有害类别，并包含用于评估VLM推理链的注释有害理由。为了改进VLM的检测和推理能力，我们提出了MuPHIRM，一种推理增强的训练框架，通过优化多视角奖励来学习联合语义。MuPHIRM提高了VLM的有害检测和推理质量，同时与训练和推理时基线相比，表现出优越的分布外鲁棒性。我们的发现表明，面向推理的奖励优化为构建超越基准特定捷径进行泛化的多模态系统提供了一个有前景的方向。

英文摘要

Understanding how harm emerges from interaction between otherwise benign image-text pairs requires intent-aware cross-modal reasoning beyond surface-level features. Existing vision-language models (VLMs) excel at literal reasoning over perceptual cues but often fail to derive harmful semantics that rely on implicit, context-dependent reasoning. To evaluate VLMs on compositional harm detection and reasoning, we introduce Multimodal Pragmatic Harm Interpretation (MuPHI), a dataset containing image-text pairs where harm is encoded in subtle multimodal cues. MuPHI spans diverse harm categories and includes annotated harm rationales for assessing VLM reasoning chains. To improve both detection and reasoning in VLMs, we propose MuPHIRM, a reasoning-augmented training framework which learns joint semantics by optimizing multi-perspective rewards. MuPHIRM improves both harm detection and reasoning quality of VLMs while demonstrating superior out-of-distribution robustness compared to both trained and inference-time baselines. Our findings suggest that reasoning-oriented reward optimization offers a promising direction towards building multimodal systems that generalize beyond benchmark-specific shortcuts.

URL PDF HTML ☆

赞 0 踩 0

2605.29940 2026-05-29 cs.AI 版本更新

Make LLM Learn to Synthesize from Streaming Experiences through Feedback

使大语言模型通过反馈从流式经验中学习合成

Zhenlin Hu, Yan Wang, Zhen Bi, Zihao Xue, Bingyu Zhu, Longtao Huang, Xiongtao Zhang, Zeyu Yang, Zhixuan Chu, Jungang Lou

发表机构 * Huzhou Normal University（湖州师范学院）； Alibaba Group（阿里巴巴集团）； Zhejiang University（浙江大学）； Zhejiang Key Laboratory of Intelligent Education Technology and Application（浙江省智能教育技术与应用重点实验室）

AI总结提出StreamSynth设置和SynLearner框架，使模型通过任务流积累经验并利用反馈提升合成数据生成性能。

详情

AI中文摘要

大语言模型（LLMs）已被广泛用于合成数据生成，显著降低了标注成本。然而，现有研究大多将合成视为一组孤立任务，忽略了一个更基本的问题：模型能否通过积累过去任务的经验并将其迁移到未来任务来学习合成。在这项工作中，我们引入了StreamSynth，一种新的设置，其中合成任务顺序到达，历史任务的经验为未来合成提供信息信号。为了解决这一设置，我们提出了SynLearner，一个通用框架，使合成模型能够在任务流上获取可重用的合成经验。SynLearner不是为每个任务独立生成数据，而是鼓励模型探索多样化的合成模式，从反馈中学习，并在任务演化中平衡样本质量与集合级多样性。在多个基准上的大量实验表明，SynLearner有效地利用了早期任务的经验来改进后期任务的合成性能，表现出一致的跨任务可迁移性。这些发现为StreamSynth的可行性提供了证据，并突显了合成数据生成作为一个经验驱动过程，可以从任务流中受益。

英文摘要

Large language models (LLMs) have been widely adopted for synthetic data generation, significantly reducing annotation costs. However, most existing studies treat synthesis as a set of isolated tasks and overlook a more fundamental question: whether a model can learn to synthesize by accumulating experience from past tasks and transferring it to future ones. In this work, we introduce StreamSynth, a new setting in which synthesis tasks arrive sequentially and experience from historical tasks provides informative signals for future synthesis. To address this setting, we propose SynLearner, a general framework that enables synthesis models to acquire reusable synthesis experience over a task stream. Instead of generating data independently for each task, SynLearner encourages the model to explore diverse synthesis patterns, learn from feedback, and balance sample quality with set-level diversity as tasks evolve. Extensive experiments across multiple benchmarks show that SynLearner effectively leverages experience from earlier tasks to improve synthesis performance on later ones, exhibiting consistent cross-task transferability. These findings provide evidence for the feasibility of StreamSynth and highlight synthetic data generation as an experience-driven process that can benefit from task streams.

URL PDF HTML ☆

赞 0 踩 0

2605.29935 2026-05-29 cs.CV cs.AI 版本更新

CityGen: Structure-Guided City-Style Synthesis for Cross-City Autonomous Driving

CityGen: 结构引导的城市风格合成用于跨城市自动驾驶

Zezhong Qian, Zhao Yang, Lu Tan, Zhihao Yan, Weiyi Hong, Haizhuang Liu, Yawei Jueluo

发表机构 * Jiangsu Cytoderm Intelligent Technology Co., Ltd., China（江苏细胞膜智能科技有限公司，中国）； Xi'an Jiaotong University, Xi'an, China（西安交通大学，中国）； Tsinghua University, Beijing, China（清华大学，中国）； University of Science and Technology of China, Hefei, China（中国科学技术大学，中国）

AI总结提出CityGen，一种基于扩散模型的生成框架，通过高清地图条件和城市级视觉提示实现零标签城市适应，提升跨城市自动驾驶在感知、分割和规划任务上的鲁棒性。

详情

AI中文摘要

自动驾驶系统通常在有限的地理区域内进行训练和评估，这阻碍了它们在新城市部署时的可扩展性。然而，外观、道路拓扑和交通模式的显著域偏移常常导致跨城市部署时性能严重下降。现有的基于域适应、数据增强或合成数据生成的方法通常依赖于标注的目标数据、城市特定的标注或任务特定的设计，限制了它们在整体评估中的可扩展性和有效性。在本文中，我们引入了CityTransfer-Bench，一个地理上不重叠的基准，用于评估跨城市泛化在感知、分割和规划任务上的表现，并提出了CityGen，一个基于扩散的生成框架，通过城市级视觉提示引导的高清地图条件合成实现零标签城市适应。大量实验表明，CityGen在多个任务上持续提高了跨城市鲁棒性，为可泛化的自动驾驶建立了可扩展且标签高效的基石。

英文摘要

Autonomous driving systems are commonly trained and evaluated within limited geographic regions, which hinders their scalability when deployed in new cities. However, significant domain shifts in appearance, road topology, and traffic patterns often cause severe performance degradation under cross-city deployment. Existing approaches based on domain adaptation, data augmentation, or synthetic data generation typically rely on labeled target data, city-specific annotations, or task-specific designs, limiting their scalability and effectiveness for holistic evaluation. In this paper, we introduce CityTransfer-Bench, a geographically disjoint benchmark for evaluating cross-city generalization across perception, segmentation, and planning, and propose CityGen, a diffusion-based generative framework that performs zero-label city adaptation via HD-map-conditioned synthesis guided by city-level visual prompts. Extensive experiments demonstrate that CityGen consistently improves cross-city robustness across multiple tasks, establishing a scalable and label-efficient foundation for generalizable autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2605.29931 2026-05-29 cs.AI eess.AS 版本更新

It`s All About Speed: AI`s Impact on Workflow in Music Production

一切都关乎速度：AI对音乐制作工作流程的影响

Finn McClellan, Fabio Morreale

发表机构 * Waipapa Taumata Rau - University of Auckland, Auckland (Aotearoa - New Zealand)（瓦伊帕塔玛拉大学——奥克兰大学，奥克兰（奥特亚罗——新西兰））； Sony AI, Barcelona (Spain)（索尼AI，巴塞罗那（西班牙））

AI总结通过民族志研究，探讨AI和自动化工具如何影响音乐制作工作流程，重点关注录音工程师、混音师和制作人的使用体验与态度，并分析速度、可控性与创造性自主权之间的张力及其缓解方法。

Comments Audio Engineering Society Conference Paper - Presented at the AES International Conference on Machine Learning and Artificial Intelligence for Audio 2025 - September 8-10, London, UK

2605.29927 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents

计划方式重要吗？LLM网络代理计划表示的实证研究

Alejandra Zambrano, Sara Vera Marjanovic, Imene Kerboua, Xing Han Lù, Leila Kosseim

发表机构 * Concordia University（康科德大学）； Mila - Quebec AI Institute（魁北克人工智能研究所）； University of Copenhagen（哥本哈根大学）； Universite Claude Bernard Lyon（克莱尔蒙特-伯恩大学）； McGill University（麦吉尔大学）

AI总结本研究提出PlanAhead框架，通过自动难度分类和四种计划表示（顺序子目标、叙述、伪代码、检查清单）的对比实验，发现计划表示形式和生成计划的LLM显著影响网络代理的鲁棒性和任务成功率。

Comments Extended version of paper submitted to EMNLP, waiting for acceptance

详情

AI中文摘要

尽管最近取得了进展，基于LLM的网络代理仍然面临探索有限、遗漏关键步骤以及对任务约束敏感等问题。先前的研究表明，许多这些失败源于规划中的弱点，但替代自然语言计划表示的影响尚未被探索。为了解决这个问题，我们引入了PlanAhead，一个静态规划器-执行器框架，评估计划表示对代理性能的影响。我们首先将WebArena任务自动分类为3个难度级别，无需人工标注即可实现一致的难度分级。然后，我们在被分类为困难的任务上系统评估了4种不同的计划表示：顺序子目标、叙述、伪代码和检查清单；跨越不同系列的多模态LLM驱动的代理（OpenAI、阿里巴巴和谷歌）。为了解释随机变异性，我们引入了两个新的评估指标：达成率（AR）和解决任务一致性（STC）。我们的结果表明，计划制定和生成计划的底层LLM都显著影响网络代理的鲁棒性和任务成功率。

英文摘要

Despite recent advances, LLM-based web agents still struggle with limited exploration, omission of critical steps, and sensitivity to task constraints. Prior work suggests that many of these failures stem from weaknesses in planning, yet the impact of alternative natural language plan representation remains unexplored. To address this, we introduce PlanAhead, a static planner-executor framework that evaluates the impact of plan representation in agent performance. We first automatically categorize WebArena tasks into 3 difficulty levels, enabling consistent difficulty grading without human annotation. Then we systematically evaluate 4 different plan representations on the tasks categorized as hard: sequential subgoals, narrative, pseudocode, and checklist; across different families of multimodal LLM powered agents (OpenAI, Alibaba, and Google). To account for stochastic variability, we introduce two novel evaluation metrics: Achievement Rate (AR) and Solved-Task Consistency (STC). Our results show that both, the plan formulation and the underlying LLM generating the plan, significantly influence web-agent robustness and task success.

URL PDF HTML ☆

赞 0 踩 0

2605.29919 2026-05-29 cs.AI cs.MA 版本更新

On the Geometry of Games and their Solvers

论博弈及其求解器的几何结构

Yaqi Sun, Julian Ma, David Mguni

发表机构 * Queen Mary University of London（伦敦玛丽女王大学）； University College London（伦敦大学学院）

AI总结提出一种结构感知的求解器合成框架，通过学习连续求解器对齐的博弈几何表示，实现自适应均衡计算并揭示求解器行为的连续区域。

详情

AI中文摘要

博弈论和生成对抗网络等学习系统中的一个核心挑战是理解哪些算法能够在异质博弈景观中高效计算均衡。均衡计算通常按求解器和博弈类别分别研究，产生了强局部保证但碎片化的求解器行为视图。现有的离散分类法往往无法完整解释算法成功的原因。我们通过一个将博弈与有效求解器动力学联系起来的求解器-博弈映射来研究这一问题。经典理论识别出该映射的孤立区域，但对中间或重叠区域提供的见解有限，表明可解性由定义连续求解器对齐博弈几何的潜在结构属性控制。我们通过结构感知的求解器合成来形式化这一视角。一个学习到的结构识别器将每个博弈映射到低维求解器对齐表示，一个策略将该表示映射到有效的原始机制，从而跨区域调整求解器行为。这揭示了特定求解器动力学有效的区域，以及需要原始机制混合而非单一主导求解器的区域。一个有界残差充当局部校正器和诊断信号，用于不完整的求解器基或表示。该框架同时产生自适应求解器和分析视角：具有相似优化动力学的博弈聚类在一起，揭示了算法有效性的连续区域和重叠的求解器行为。实验表明，固定原始机制表现出系统性的区域不匹配，而学习到的表示将博弈空间组织成与求解器行为对齐的结构化地图。这些结果表明，应将均衡计算视为学习求解器机制和映射可解性几何的联合问题。

英文摘要

A central challenge in game theory and learning systems such as GANs is understanding which algorithms can efficiently compute equilibria across the heterogeneous landscape of games. Equilibrium computation is typically studied solver by solver and game class by game class, yielding strong local guarantees but a fragmented view of solver behaviour. Existing discrete taxonomies often provide an incomplete account of where algorithms succeed. We study this problem through a solver-game map linking games to effective solver dynamics. Classical theory identifies isolated regions of this map but provides limited insight into intermediate or overlapping regimes, suggesting that solvability is governed by latent structural properties defining a continuous solver-aligned geometry of games. We formalise this perspective through structure-aware solver synthesis. A learned structure recogniser maps each game to a low-dimensional solver-aligned representation, and a policy maps this representation to effective primitive mechanisms, adapting solver behaviour across regimes. This reveals regions where particular solver dynamics are effective and where mixtures of primitives are required rather than a single dominant solver. A bounded residual acts as a local corrector and diagnostic signal for incomplete solver bases or representations. The framework yields both an adaptive solver and an analytical lens: games with similar optimisation dynamics cluster together, revealing continuous regions of algorithmic validity and overlapping solver behaviour. Empirically, we show that fixed primitives exhibit systematic regime mismatch, while the learned representation organises game space into a structured cartography aligned with solver behaviour. These results suggest viewing equilibrium computation as the joint problem of learning solver mechanisms and mapping the geometry of solvability.

URL PDF HTML ☆

赞 0 踩 0

2605.29910 2026-05-29 cs.SE cs.AI 版本更新

Agora: Toward Autonomous Bug Detection in Production-Level Consensus Protocols with LLM Agents

Agora: 面向生产级共识协议中自主漏洞检测的LLM智能体

Xiang Liu, Sa Song, Zhaowei Zhang, Huiying Lan, Jason Zeng, Ming Wu, Michael Heinrich, Yong Sun, Ceyao Zhang

发表机构 * School of Computing, National University of Singapore（新加坡国立大学计算机学院）； School of Information and Telecommunication Engineering, Beijing University of Posts and Telecommunications（北京邮电大学信息与电信工程学院）； Peking University（北京大学）； G Labs（0G实验室）

AI总结提出Agora，一个领域感知的多智能体框架，通过假设驱动测试和LLM协作，在Raft、EPaxos、HotStuff、BullShark四个共识实现中发现15个未知协议级逻辑漏洞，而现有LLM方法未能检测到任何此类漏洞。

Comments 35 pages, 4 figures

详情

AI中文摘要

共识协议构成了分布式系统和区块链的骨干，其中的实现漏洞可能导致数据损坏和财务损失。虽然基于LLM的方法在代码分析中显示出前景，但它们难以处理涉及跨多个执行阶段的复杂状态依赖行为的深层协议级逻辑漏洞。我们提出Agora，一个领域感知的多智能体框架，将假设驱动测试与LLM能力相结合，用于系统性的协议验证。Agora采用专门的智能体，协作探索协议状态空间，使用领域特定约束综合攻击场景，并通过迭代细化验证发现。这种明确的角色分离使得能够推理全局协议不变量，超越单函数代码分析。我们在四个共识实现（Raft、EPaxos、HotStuff、BullShark）上使用四个最先进的LLM评估了Agora。Agora发现了15个先前未知的违反安全属性的协议级逻辑漏洞，而现有的基于LLM的智能体未能检测到任何此类协议级逻辑漏洞。我们的结果表明，领域感知的多智能体协作对于检测复杂协议中的深层逻辑漏洞至关重要。

英文摘要

Consensus protocols form the backbone of distributed systems and blockchains, where implementation bugs can cause data corruption and financial losses. While LLM-based approaches show promise in code analysis, they struggle with deep protocol-level logic bugs involving complex state-dependent behaviors across multiple execution stages. We present Agora, a domain-aware multi-agent framework that integrates hypothesis-driven testing with LLM capabilities for systematic protocol verification. Agora employs specialized agents that collaboratively explore protocol state spaces, synthesize attack scenarios using domain-specific constraints, and validate findings through iterative refinement. This explicit role separation enables reasoning about global protocol invariants beyond single-function code analysis. We evaluate Agora on four consensus implementations (Raft, EPaxos, HotStuff, BullShark) using four state-of-the-art LLMs. Agora discovers 15 previously unknown protocol-level logic bugs that violate safety properties, while existing LLM-based agents fail to detect any such protocol-level logic bugs. Our results demonstrate that domain-aware multi-agent collaboration is essential for detecting deep logic bugs in complex protocols.

URL PDF HTML ☆

赞 0 踩 0

2605.29893 2026-05-29 cs.AI 版本更新

通过屏障调控自适应闭式引导缓解视觉语言模型中的幻觉

Soumyadeep Jana, Pulkit Mittal, Sanasam Ranbir Singh

发表机构 * Indian Institute of Technology Guwahati（印度理工学院果阿班加）

AI总结提出BRACS框架，通过监测视觉注意力并仅在接地退化时进行闭式修正，无需训练即可有效减少LVLM中的物体幻觉。

详情

AI中文摘要

大型视觉语言模型（LVLMs）经常幻觉出输入图像中不存在的物体，这主要是因为随着解码进行，视觉接地减弱。现有的推理时缓解方法在生成过程中修改logits或隐藏状态，但它们存在三个关键限制：缺乏明确的接地目标，即使在模型已经良好接地时也进行干预，以及使用固定的修正强度，无法适应接地失败的严重程度。我们提出BRACS（屏障调控自适应闭式引导），一种无需训练的引导框架，通过屏障调控自适应闭式引导解决这些问题。BRACS监测模型自身的注意力以衡量视觉接地，并仅在接地恶化时对隐藏状态进行修正。修正更新以闭式解析计算，无需训练辅助网络或重新训练模型。在LLaVA-1.5-7B和Qwen-VL-Chat上的实验表明，BRACS在幻觉基准上持续优于先前方法，将CHAIR$_s$降低9.4个点，将POPE F1提高2.7个点，同时在四个通用多模态基准上匹配或提升性能。BRACS还保持高效，运行速度为贪心解码吞吐量的80%，平均速度比基线快1.3倍。

英文摘要

Large vision-language models (LVLMs) often hallucinate objects that are not present in the input image, largely because visual grounding weakens as decoding progresses. Existing inference-time mitigation methods modify logits or hidden states throughout generation, but they suffer from three key limitations: they lack an explicit grounding objective, intervene even when the model is already well-grounded, and use fixed correction strengths that do not adapt to the severity of grounding failure. We propose BRACS (Barrier-Regulated Adaptive Closed-form Steering), a training-free steering framework that addresses these issues through barrier-regulated adaptive closed-form steering. BRACS monitors the model's own attention to measure visual grounding and applies corrections to the hidden states only when grounding deteriorates. The corrective update is computed analytically in closed form, requiring no training of auxiliary networks or model retraining. Experiments on LLaVA-1.5-7B and Qwen-VL-Chat show that BRACS consistently outperforms prior methods on hallucination benchmarks, reducing CHAIR$_s$ by 9.4 points and improving POPE F1 by 2.7 points, while matching or improving performance on four general multimodal benchmarks. BRACS also remains efficient, operating at 80% of greedy decoding throughput and achieving 1.3 times higher speed on average than the baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.29873 2026-05-29 cs.AI 版本更新

Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation

Moment-KV: 基于动量的解码时KV缓存压缩用于长文本生成

Soumyadeep Jana, Sagar Nishad, Sanasam Ranbir Singh

发表机构 * Indian Institute of Technology Guwahati（印度理工学院瓜哇蒂）

AI总结提出Moment-KV方法，利用动量驱动的时序注意力聚合在解码阶段压缩KV缓存，以提升长文本生成质量并保持解码延迟。

详情

AI中文摘要

键值（KV）缓存仍然是大型语言模型（LLM）在长文本生成任务中部署的主要瓶颈。先前的工作通常对预填充和解码缓存应用均匀压缩，但压缩预填充缓存会破坏关键上下文从而降低性能。虽然保留预填充缓存至关重要，但解码阶段的压缩仍未被充分探索，现有方法依赖于固定的近期窗口或瞬时注意力。我们对注意力动态的分析揭示了强时间模式：关键标记在长时间范围内获得持续注意力，而局部推理涉及短暂的爆发。静态启发式方法无法捕捉这种行为，导致重要标记被过早驱逐或陈旧标记被保留。我们提出Moment-KV，一种基于动量驱动的时序注意力聚合的解码时KV缓存压缩方法。我们的方法将标记重要性建模为连续演化的状态，其中注意力通过衰减进行聚合，捕捉长期影响和近期相关性。实验表明，Moment-KV在长文本生成任务中显著提高了生成保真度（2.3-3.2%），同时保持了解码延迟。

英文摘要

Key-Value (KV) cache remains a major bottleneck for deploying Large Language Models (LLMs) in long-generation tasks. Prior work often applies uniform compression across both prefill and decoding caches, but compressing the prefill cache degrades performance by corrupting critical context. While preserving the prefill cache is essential, decoding-phase compression remains underexplored, with existing methods relying on rigid recency windows or instantaneous attention. Our analysis of attention dynamics reveals strong temporal patterns: critical tokens receive sustained attention over long horizons, while local reasoning involves short-lived bursts. Static heuristics fail to capture this behavior, leading to premature eviction of important tokens or retention of stale ones. We propose Moment-KV, a decoding-time KV cache compression method based on momentum-driven temporal attention aggregation. Our method models token importance as a continuously evolving state, where attention is aggregated with decay, capturing both long-term influence and recent relevance. Experiments show that Moment-KV significantly improves generation fidelity in long-generation tasks (2.3-3.2 %) while maintaining decoding latency.

URL PDF HTML ☆

赞 0 踩 0

2605.29862 2026-05-29 eess.AS cs.AI cs.SD 版本更新

Mitigating Stethoscope-Induced Shortcuts in Respiratory Sound Classification under Federated Domain Generalization with Causality-Inspired Interventions

在联邦域泛化下通过因果启发的干预减轻听诊器引起的呼吸音分类中的捷径

Heejoon Koo, Yoon Tae Kim, Miika Toikkanen, June-Woo Kim

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； RSC LAB（RSC实验室）； Wonkwang University（Wonkwang大学）

AI总结针对呼吸音分类中听诊器设备差异导致的域偏移问题，提出一种因果启发的多模态联邦域泛化框架，通过内容保持的风格扰动、反事实文本增强和梯度对齐实现设备不变表示，在ICBHI和SPRSound数据集上优于传统方法。

Comments 2 figures, 4 tables, and 5 pages

详情

AI中文摘要

基于AI的呼吸音分类（RSC）有望实现自动化肺部疾病检测，但多站点部署受到听诊器间差异的阻碍。我们针对听诊器引起的设备偏移引入了一种联邦域泛化（FedDG）公式，其中客户端使用异构设备，模型在未见设备上进行评估。我们的实证分析表明，听诊器引起的风格和疾病特定内容紧密纠缠，使得确定性风格去除不可靠。为此，我们提出了一种因果启发的多模态FedDG框架，结合了：(i) 因果启发的设备风格干预网络，执行内容保持的风格扰动，(ii) 反事实文本增强，中和元数据捷径，以及(iii) 梯度对齐，促进跨客户端的设备不变表示。基于多模态语言-音频预训练模型，在ICBHI和SPRSound数据集上的留一设备验证中，它优于传统数据增强和联邦学习基线。代码将在发表后发布。

英文摘要

AI-driven respiratory sound classification (RSC) is promising for automated pulmonary disease detection, yet multi-site deployment is hindered by inter-stethoscope variability. We introduce a federated domain generalization (FedDG) formulation for RSC under stethoscope-induced device shifts, where clients use heterogeneous devices and the model is evaluated on unseen devices. Our empirical analysis shows that stethoscope-induced style and disease-specific content are tightly entangled, making deterministic style removal unreliable. In response, we propose a causality-inspired multimodal FedDG framework that combines: (i) a causality-inspired device style intervention network that performs content-preserving style perturbations, (ii) counterfactual text augmentation that neutralizes metadata shortcuts, and (iii) gradient alignment that facilitates device-invariant representations across clients. Built on a multimodal language-audio pretraining model, it outperforms conventional data augmentation and federated learning baselines in leave-one-device-out validation on ICBHI and SPRSound datasets. Code will be released upon publication.

URL PDF HTML ☆

赞 0 踩 0

2605.29860 2026-05-29 cs.LG cs.AI 版本更新

ESPO: Early-Stopping Proximal Policy Optimization

ESPO：早期停止的近端策略优化

Zihang Li, Rui Zhou, Yingcheng Shi, Wenhan Yu, Zhewen Tan, Zixiang Liu, Zeming Li, Binhua Li, Yongbin Li, Tong Yang, Jieping Ye

发表机构 * Tongyi Lab（通义实验室）； Alibaba Group（阿里巴巴集团）； Peking University（北京大学）

AI总结提出ESPO算法，通过在强化学习训练大语言模型时在线检测轨迹失败并提前终止，节省计算资源并提升数学推理性能。

详情

AI中文摘要

当大语言模型在强化学习过程中，在轨迹早期出现错误的推理步骤时，标准算法会强制其继续生成直到最大步长，从而在从未获得正奖励的令牌上浪费计算资源，并用失败后的噪声污染优势估计。我们提出ESPO（早期停止的近端策略优化），该算法能够在线检测轨迹失败并提前终止轨迹生成。在每个生成步骤中，ESPO仅利用采样过程中已计算出的logits计算一个替代遗憾值，并在平滑累积遗憾值显著超过其估计值时终止。截断轨迹被视为具有终止奖励的吸收失败状态，将负的时间差分误差集中在检测到的失败步骤附近，无需任何额外的奖励模型或人工标注。在基于DeepSeek-R1-Distill-Qwen-7B训练的数学推理任务上，ESPO在AIME 2024（46.28% vs. 45.25%）、AMC 2023（85.83% vs. 82.94%）和MATH-500（87.42% vs. 85.43%）上超越了PPO，同时累计节省了超过20%的轨迹生成令牌。

英文摘要

When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum horizon, spending compute on tokens that never receive positive reward and polluting advantage estimates with post-failure noise. We propose ESPO (Early-Stopping Proximal Policy Optimization), which detects trajectory failure on-the-fly and terminates rollouts early. At each generation step, ESPO computes a surrogate regret using only the logits already computed during sampling, and terminates when the smoothed cumulative regret significantly exceeds its estimated values. Truncated trajectories are treated as absorbing failure states with a terminal reward, concentrating negative temporal-difference (TD) errors near the detected failure step without any additional reward model or human annotation. On DeepSeek-R1-Distill-Qwen-7B trained for mathematical reasoning, ESPO surpasses PPO on AIME~2024 (46.28% vs. 45.25%), AMC~2023 (85.83% vs. 82.94%), and MATH-500 (87.42% vs. 85.43%), while saving more than 20% rollout tokens cumulatively.

URL PDF HTML ☆

赞 0 踩 0

2605.29843 2026-05-29 cs.LG cs.AI 版本更新

面向多模态大语言模型的局部化与解耦知识编辑

Leijiang Gu, Zhen Zeng, Feng Li, Xinjian Gao, Zenglin Shi

发表机构 * Hefei University of Technology（合肥工业大学）； Tongji University（同济大学）

AI总结针对多模态知识编辑中因果错位和特征纠缠问题，提出LDKE框架，通过快速定位关键层和解耦分类器实现精准泛化编辑并保持高局部性。

详情

AI中文摘要

现有的多模态知识编辑（MKE）方法在纠正多模态大语言模型（MLLMs）中过时或不准确的知识方面取得了进展。然而，它们存在一个关键局限性：虽然能有效修改目标事实对，但无法将编辑泛化到逻辑相关的查询，并且常常对无关但视觉或语义上关联的信息造成意外改变。我们识别并形式化了导致该问题的两种潜在失败模式：因果错位（将编辑限制在特定样本）和特征纠缠（对耦合但无关的信息造成意外改变）。为解决这些问题，我们提出局部化与解耦知识编辑（LDKE），一种通过定位事实特定模型层并将目标相关输入与无关输入解耦来实现精确和泛化编辑的新框架。我们的方法引入快速定位模块以高效识别和更新关键层，以及解耦分类器以适当路由输入从而保留无关知识。在各种基准和MLLMs上的大量实验表明，LDKE在将编辑传播到相关上下文方面实现了优越性能，同时保持了高局部性。

英文摘要

Existing methods in Multimodal Knowledge Editing (MKE) have advanced the ability to correct outdated or inaccurate knowledge in Multimodal Large Language Models (MLLMs). However, they exhibit a critical limitation: while effectively modifying target factual pairs, they fail to generalize edits to logically related queries and often cause unintended alterations to unrelated but visually or semantically linked information. We identify and formalize two underlying failure modes causing this issue: Causal Misalignment, which confines edits to the specific sample, and Feature Entanglement, which causes unintended alterations to coupled but irrelevant information. To address these issues, we propose Localized and Disentangled Knowledge Editing (LDKE), a new framework that achieves precise and generalized editing by localizing fact-specific model layers and disentangling target-relevant inputs from irrelevant ones. Our approach introduces a Fast Localization module to identify and update critical layers efficiently, along with a Disentanglement Classifier that routes inputs appropriately to preserve unrelated knowledge. Extensive experiments across various benchmarks and MLLMs demonstrate that LDKE achieves superior performance in propagating edits to related contexts while maintaining high locality.

URL PDF HTML ☆

赞 0 踩 0

2605.29822 2026-05-29 cs.SE cs.AI 版本更新

Inferring Code Correctness from Specification

从规约推断代码正确性

Tambon Florian, Papadakis Mike

发表机构 * University of Luxembourg（卢森堡大学）

AI总结提出TRAILS方法，通过基于规约的类别划分生成测试输入并执行，利用LLM评估输入输出对是否符合规约，从而推断代码正确性，在LiveCodeBench和CoCoClaNeL数据集上相比基线方法提升了马修斯相关系数并增强了稳定性。

详情

AI中文摘要

大型语言模型（LLM）已成为现代软件开发不可或缺的一部分，实现了大规模自动代码生成。然而，验证LLM生成代码的正确性仍然是一个关键且基本未解决的挑战。现有方法要么依赖多个代码候选之间的动态共识——这使得它们成本高昂且难以扩展，要么依赖静态推理，容易受到动态错误和顺序偏差的影响。在本文中，我们提出TRAILS（通过输入和规约的目标推理一致性），一种将LLM推理与具体（输入，输出）对相结合的方法。TRAILS首先基于规约通过类别划分生成多样化的测试输入，然后针对候选代码执行这些输入，并提示LLM评估产生的输入输出对是否符合规约——而无需对代码本身进行推理。分数跨输入聚合，以确定程序是否可能正确。我们在两个数据集LiveCodeBench和CoCoClaNeL上，使用三个LLM（Qwen3Coder-30B、Devstral-Small-24B和Olmo3.1-Instruct）评估TRAILS，并与HoarePrompt和零样本思维链基线进行比较。TRAILS的马修斯相关系数相比零样本思维链提高了高达39%，并且始终优于HoarePrompt。除了准确性，TRAILS在多次运行中表现出更高的稳定性，降低了对LLM非确定性的敏感性，并且相比竞争方法为更多独特的代码样本分配了正确的标签。

英文摘要

Large language models (LLMs) have become integral to modern software development, enabling automated code generation at scale. However, validating the correctness of LLM-generated code remains a critical and largely unsolved challenge. Existing approaches either rely on dynamic consensus across multiple code candidates - making them costly and difficult to scale - or on static reasoning that is susceptible to dynamic bugs and order bias. In this paper, we propose TRAILS~ (Targeted Reasoning Agreement via Inputs and Specifications), an approach that grounds LLM reasoning with concrete (input, output) pairs. TRAILS~ first generates diverse test inputs via category partitioning based on the specification, then executes them against the candidate code and prompts LLMs to assess whether the resulting input-output pairs conform to the specification - without ever reasoning over the code itself. Scores are aggregated across inputs, to determines whether the program is likely correct. We evaluate TRAILS~ on two datasets, LiveCodeBench and CoCoClaNeL, across three LLMs (Qwen3Coder-30B, Devstral-Small-24B, and Olmo3.1-Instruct), comparing against HoarePrompt and a Zero-Shot Chain-of-Thought baseline. TRAILS~ improves Matthew Correlation Coefficient by up to 39\% relative to Zero-Shot COT and consistently outperforms HoarePrompt. Beyond accuracy, TRAILS~ demonstrates greater stability across seeded runs, reducing sensitivity to LLM non-determinism, and assigns correct labels to a larger set of unique code samples than competing approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.29816 2026-05-29 cs.AI 版本更新

Harnessing non-adversarial robustness in large language models

利用大语言模型中的非对抗鲁棒性

Qinghua Zhou, Ellina Aleshina, Andrey Lovyagin, Oleg Somov, Mikhail Seleznyov, Alexander Panchenko, Ivan Oseledets, Elena Tutubalina, Ivan Y. Tyukin

发表机构 * Applied AI Institute, Moscow, Russia（莫斯科应用人工智能研究所）； King's College London, London, UK（伦敦国王学院）； International Joint Laboratory of AI for Industry, QUST, Qingdao, China（工业人工智能联合实验室）

AI总结本文通过理论分析和实验，提出了一种基于去偏的微调方法，以提升大语言模型对语义相似但文本不同的提示的鲁棒性，并提供了认证保证。

详情

AI中文摘要

本文提出了一种方法来解决大语言模型（LLMs）对由语义相似但文本不同的提示引起的改变和潜在错误的鲁棒性挑战。最近的研究表明，这类提示变化会显著影响LLMs在任务上的性能。核心问题是：能否在不重新训练整个模型的情况下，获得LLMs对语义中性提示变化的鲁棒性？我们通过理论和实验来探讨这个问题。我们的理论分析揭示了一个影响模型鲁棒性的关键因素——神经网络模块输出中的系统性预期偏移或扰动引起的偏差。受此分析启发，我们表明可以通过一个简单的微调过程实现鲁棒性：为鲁棒性进行去偏。我们确定了去偏有帮助和没有帮助的条件，并通过理论和大量实验证明，为鲁棒性进行去偏确实可以成为一种快速有效的工具，以增强鲁棒性并提供对随机提示扰动的认证。

英文摘要

The work presents an approach for addressing the challenge of robustness in Large Language Models (LLMs) to alterations and potential errors caused by semantically similar but textually different prompts. Recent works have shown that these kinds of prompt variations can significantly impact the performance of LLMs on tasks. The central question is: can LLMs' robustness to semantically-neutral prompt alterations be acquired without expensive retraining of the entire model? We address this question both theoretically and through experiments. Our theoretical analysis reveals a crucial factor impacting model robustness - a systematic expected shift or perturbation-induced bias in neural network module outputs. Motivated by this analysis, we show that robustness can be achieved via a simple fine-tuning process: debiasing for robustness. We identify conditions when debiasing helps and when it does not, and demonstrate, through both theory and extensive experiments, that debiasing for robustness may indeed be a quick and efficient tool to enhance robustness and provide certification against random prompt perturbations.

URL PDF HTML ☆

赞 0 踩 0

2605.29815 2026-05-29 cs.AI cs.CL 版本更新

MEMENTO: 利用网络作为低数据领域的学习信号

Ashutosh Ojha, Vinay Aggarwal, Ashutosh Srivastava, Siddharth Yedlapati, Yaman K Singla, Jitendra Ajmera

发表机构 * Adobe, Media & Data Science Research Lab（Adobe媒体与数据科学研究实验室）

AI总结提出MEMENTO框架，通过自适应探索树和双通道记忆将网络作为学习信号，在低数据专业领域（销售自动化和法律研究）中显著提升性能。

详情

AI中文摘要

现实世界的任务通常缺乏大规模标注数据集，这激发了在低数据场景下学习的广泛研究。然而，现有方法如少样本提示、指令调优和合成数据生成，仍将标注或伪标注数据作为主要学习信号。相比之下，人类从业者通过反复、自主地与开放网络交互来获取专业知识，逐步完善领域知识和搜索策略。我们提出MEMENTO，一个将网络视为学习信号而非无状态检索接口的框架。MEMENTO在两个层面运作：在每个会话内，它通过自适应探索树（AET）进行迭代式网络探索，将任务分解为演化中的问题并反思中间发现；跨会话间，它通过双通道记忆积累经验，将陈述性知识（事实）与程序性知识（搜索策略）分离。这种设计使智能体能够从网络交互轨迹中学习可重用的研究策略和领域专业知识，而无需额外的模型训练。我们在两个低数据专业领域（销售自动化和法律研究）上评估MEMENTO。实验结果显示，与基于ReAct的基线相比，性能持续提升（销售自动化+25.6%，法律研究+36.5%），表明网络可以作为在数据稀缺场景下获取任务特定专业知识的可扩展学习源。

英文摘要

Real-world tasks often lack large labeled datasets, motivating extensive work on learning in low-data regimes. However, existing approaches such as few-shot prompting, instruction tuning, and synthetic data generation, continue to treat labeled or pseudo-labeled data as the primary learning signal. In contrast, human practitioners acquire expertise through repeated, self-directed interaction with the open web, progressively refining both domain knowledge and search strategies. We propose MEMENTO, a framework that treats the web as a learning signal rather than a stateless retrieval interface. MEMENTO operates at two levels: within each session, it conducts iterative web exploration via an Adaptive Exploration Tree (AET) that decomposes tasks into evolving questions and reflects on intermediate findings; across sessions, it accumulates experience through dual-channel memory, separating declarative knowledge (facts) from procedural knowledge (search strategies). This design enables agents to learn reusable research strategies and domain expertise from trajectories of web interaction without additional model training. We evaluate MEMENTO on two low-data professional domains: sales automation and legal research. Our empirical results show consistent improvements in performance over ReAct based baselines (+25.6% on sales automation and 36.5% on legal research), demonstrating that the web can serve as a scalable learning source for acquiring task-specific expertise in data-scarce settings.

URL PDF HTML ☆

赞 0 踩 0

2605.29794 2026-05-29 cs.AI 版本更新

SkillsInjector: Dynamic Skill Context Construction for LLM Agents

SkillsInjector: 面向LLM智能体的动态技能上下文构建

Yanchao Li, Wanhao Liu, Ben Gao, Jiaqing Xie, Zhehong Ai, Na Zou, Yuqiang Li, Tianfan Fu

发表机构 * Nanjing University（南京大学）； Shanghai AI Lab（上海人工智能实验室）

AI总结针对静态技能注入导致性能下降的问题，提出SkillsInjector两阶段自适应方法，通过上下文规划器学习技能偏好并自适应预算，结合集合感知渲染器优化描述呈现，在三个基准上分别提升3.9、6.1和7.3个百分点。

详情

AI中文摘要

LLM智能体现在依赖不断增长的技能库来处理复杂任务。然而，注入更多技能并不总能提高任务完成度，甚至可能降低性能。现有方法仍将技能注入视为静态步骤，使用固定标准选择技能，预先设定预算，并保持描述不变。我们认为这种静态处理会削弱技能的效用，因为暴露哪些技能、包含多少技能以及如何呈现它们都会影响下游性能。我们提出SkillsInjector，一种两阶段自适应方法，共同解决这些决策。首先，上下文规划器学习基于执行的技能偏好，并为每个任务自适应地确定技能数量。然后，集合感知渲染器根据共注入的邻居定制所选描述的呈现方式。在tau2-bench、SkillsBench和ALFWorld上，SkillsInjector取得了最高分数，分别比最强基线提高了3.9、6.1和7.3个百分点。消融研究表明，技能选择、自适应预算和集合感知渲染各自对性能提升有贡献。这些结果表明，技能增强型智能体受益于优化注入的上下文本身。代码将在发表后发布。

英文摘要

LLM agents now draw on growing skill libraries to handle complex tasks. However, injecting more skills does not always improve task completion and can even degrade it. Existing methods still treat skill injection as a static step, selecting skills with fixed criteria, fixing the budget in advance, and leaving descriptions unchanged. We argue that this static treatment can undermine the utility of skills, because which skills are exposed, how many are included, and how they are presented all affect downstream performance. We propose SkillsInjector, a two-stage adaptive method that jointly addresses these decisions. First, a context planner learns execution-grounded skill preferences and admits an adaptive number of skills for each task. A set-aware renderer then tailors how selected descriptions are presented relative to their co-injected neighbors. Across tau2-bench, SkillsBench, and ALFWorld, SkillsInjector achieves the highest score, improving over the strongest baseline by 3.9, 6.1, and 7.3 percentage points, respectively. Ablation studies show that skill selection, adaptive budgeting, and set-aware rendering each contribute to the gain. These results show that skill-augmented agents benefit from optimizing the injected context itself. Code will be released upon publication

URL PDF HTML ☆

赞 0 踩 0

2605.29790 2026-05-29 cs.MA cs.AI 版本更新

Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems

像团队一样进化：基于LLM的多智能体系统的协作自我进化

Zhezheng Hao, Tianfu Wang, Huanshuo Dong, Ziyan Liu, Hong Wang, Xiankun Lin, Qiang Lin, Can Wang, Hande Dong, Jiawei Chen

发表机构 * Zhejiang University（浙江大学）； Hong Kong University of Science and Technology（香港科技大学）； Tencent（腾讯）

AI总结提出Meta-Team框架，通过协作自我进化机制，基于执行经验改进多智能体系统的行为、协调和团队组织，在长周期任务中显著优于单智能体、手工MAS及先前进化方法。

详情

AI中文摘要

基于LLM的多智能体系统（MAS）已成为处理复杂和长周期任务的有效范式。然而，在实际任务中，MAS在执行过程中经常出现各种故障，且这些故障在设计阶段难以消除。这激发了经验驱动的MAS进化，即系统根据自身执行经验进行改进。然而，这种进化具有挑战性，因为MAS经验漫长而复杂，交织着多个智能体的执行链和通信消息，使得难以识别需要改进的内容。为应对这一挑战，我们提出了Meta-Team，一种基于协作自我进化的经验驱动MAS进化框架。Meta-Team保留每个智能体的执行上下文并协调任务后通信，使智能体能够交换分布式证据以进行进化。基于此设计，Meta-Team进行多尺度自我进化，将执行经验转化为对智能体行为、智能体间协调以及团队级组织的可复用改进。在六个长周期智能体基准测试中，Meta-Team始终优于单智能体系统、手工MAS和先前的MAS进化方法；进一步分析表明，Meta-Team实现了更可靠和可扩展的MAS自我进化。

英文摘要

LLM-based multi-agent systems (MAS) have emerged as an effective paradigm for complex and long-horizon tasks. However, in real-world tasks, MAS often exhibit various failures during execution and such failures are difficult to eliminate during design. This motivates experience-driven MAS evolution, where a system improves based on its own execution experience. Yet such evolution is challenging because MAS experience is prolonged and intricate, interleaving multiple agents' execution chains and communication messages, which makes it difficult to identify what should be improved. To address this challenge, we propose Meta-Team, an experience-driven MAS evolution framework based on collaborative self-evolution. Meta-Team preserves the execution context of each agent and coordinates post-task communication, enabling agents to exchange distributed evidence for evolution. Building on this design, Meta-Team conducts multi-scale self-evolution, transforming execution experience into reusable improvements to agent behaviors, inter-agent coordination, and team-level organization. Across six long-horizon agent benchmarks, Meta-Team consistently outperforms single-agent systems, hand-crafted MAS, and prior MAS evolution methods; further analyses demonstrate that Meta-Team enables more reliable and scalable MAS self-evolution.

URL PDF HTML ☆

赞 0 踩 0

2605.29788 2026-05-29 cs.AI cs.LG 版本更新

基于Transformer的脑电图基础模型位置编码策略基准测试

Ayse Betul Yuce, Sebastian Stober

发表机构 * Department of Computer Science, Otto von Guericke University（奥托·冯·格里克大学计算机科学系）

AI总结本研究在CBraMod骨干网络中基准测试五种位置编码策略，通过线性探测和微调协议评估运动想象分类和情感识别任务，发现最优策略具有任务依赖性。

详情

AI中文摘要

脑电图（EEG）是一种广泛使用的非侵入性技术，用于测量脑机接口（BCI）应用中的大脑活动。监督式EEG解码模型通常难以跨任务、受试者和数据集泛化，这促使了基于Transformer的EEG基础模型通过自监督学习进行训练。由于Transformer是排列不变的，它们需要显式的位置信息。与文本标记不同，EEG电极在头皮上空间分布，这引发了如何在基于Transformer的EEG模型中编码电极位置的问题。在本研究中，我们在CBraMod骨干网络中基准测试了五种位置编码策略，并在运动想象分类和情感识别任务上通过线性探测和微调协议进行评估。我们的结果表明，没有单一策略能在所有任务中持续表现优异。球形位置编码（SPE）为运动想象生成了强大的表示，但在情感识别上表现不佳，而非对称条件位置编码（ACPE）在任务间表现更为一致。这些发现表明，最优位置编码策略具有任务依赖性，在EEG解码场景中没有通用解决方案。

英文摘要

Electroencephalography (EEG) is a widely used non-invasive technique for measuring brain activity in brain-computer interface (BCI) applications. Supervised EEG decoding models often struggle to generalize across tasks, subjects, and datasets, motivating transformer-based EEG foundation models trained with self-supervised learning. Since transformers are permutation-invariant, they require explicit positional information. Unlike textual tokens, EEG electrodes are spatially distributed across the scalp, raising the question of how electrode positions should be encoded in transformer-based EEG models. In this study, we benchmark five positional encoding strategies within the CBraMod backbone and evaluate them under linear probing and fine-tuning protocols on motor imagery classification and emotion recognition. Our results show that no single strategy consistently outperforms across tasks. Spherical Positional Encoding (SPE) yields strong representations for motor imagery but underperforms on emotion recognition, while Asymmetric Conditional Positional Encoding (ACPE) demonstrates more consistent performance across tasks. These findings suggest that the optimal positional encoding strategy is task-dependent, with no universal solution across EEG decoding scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.29753 2026-05-29 eess.IV cs.AI 版本更新

A unified deeplearning framework for contrast-phase-specific virtual monochromatic imaging

一种用于对比相位特异性虚拟单色成像的统一深度学习框架

Antony Jerald, Hemant K Aggarwal, Brian Nett, Avinash Gopal, Phaneendra K Yalavarthy, Bipul Das, Rajesh Langoju

发表机构 * Science and Technology Organization, GE HealthCare（科技组织，GE医疗）

AI总结提出一种统一深度学习框架，利用对比相位先验信息从单能CT数据合成对比相位特异性虚拟单色50 keV图像，通过新型先验条件架构实现能量转换，并在四个对比相位上验证了其对比增强和泛化能力。

详情

DOI: 10.1117/12.3086034
Journal ref: SPIE Medical Imaging 2026

AI中文摘要

双能CT（DECT）可实现虚拟单色成像（VMI）并提高对比度分辨率，但其临床采用受到硬件复杂性和成本的限制。在这项工作中，我们提出了一种统一的深度学习框架，通过利用对比相位信息作为先验，从单能CT（SECT）数据合成对比相位特异性虚拟单色50 keV图像。该模型使用DECT衍生的70 keV和50 keV图像对进行训练，涵盖四个对比相位——血管期、动脉期、门脉期和延迟期——采用一种新颖的先验条件架构，将对比相位先验整合到能量转换过程中。我们证明了所提出的统一模型能够实现对比增强，并在对比相位之间具有良好的泛化能力。此外，我们展示了该模型可以从SECT输入生成类似50 keV的图像，并保留对比相位特异性动态。

英文摘要

Dual-energy CT (DECT) enables virtual monochromatic imaging (VMI) and improved contrast resolution, but its clinical adoption is limited by hardware complexity and cost. In this work, we propose a unified deep learning framework that synthesizes contrast-phase-specific virtual monochromatic 50 keV images from single-energy CT (SECT) data by leveraging contrast phase information as a prior. The model is trained using DECT-derived 70 keV and 50 keV image pairs across four contrast phases -- Angio, Arterial, Portal, and Delayed -- using a novel prior conditioning architecture that integrates contrast phase priors into the energy transformation process. We demonstrate that the proposed unified model achieves contrast enhancement and generalizes well across contrast phases. Additionally, we show that the model can generate 50 keV-like images from SECT inputs, preserving contrast phase-specific dynamics.

URL PDF HTML ☆

赞 0 踩 0

2605.29744 2026-05-29 cs.AI cs.CL cs.LG cs.MA 版本更新

Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence

为什么专家模型仍然重要：面向医学人工智能的异构多智能体范式

Yanan Wang, Shuaicong Hu, Jian Liu, Guohui Zhou, Aiguo Wang, Cuiwei Yang

发表机构 * Anthropic AI

AI总结提出HetMedAgent异构多智能体框架，通过冲突感知证据融合、不确定性驱动的临床医生干预触发和自适应阈值校准，实现通用大语言模型与领域专家模型的协同，在三个临床决策任务中验证了专家模型在模态特定分析中的不可替代价值。

Comments Accepted at ICML 2026. 12 pages main text, 16 pages appendix

详情

AI中文摘要

GPT和Claude等通用大语言模型在医疗保健领域的出色表现引发了一个关键问题：特定领域的医学专家模型是否会变得过时？我们认为，医学人工智能的未来不在于构建单一的医学基础模型，也不在于取代人类专业知识，而在于协调通用大语言模型、领域特定专家模型和临床医生之间的协作。我们提出HetMedAgent，一个异构医学多智能体框架，能够实现冲突感知证据融合、基于不确定性的临床医生干预触发和自适应阈值校准。在三个真实世界临床决策任务上的实验表明，通用大语言模型与领域特定专家模型之间的协同显著优于单独使用任一类型模型，验证了专家模型在模态特定分析中的不可替代价值。HetMedAgent代表了从构建医学大语言模型或基础模型向多智能体协作的转变，实现了通用推理能力与领域特定精度之间的平衡。

英文摘要

The impressive performance of generalist large language models (LLMs) such as GPT and Claude in healthcare raises a critical question: will domain-specific medical specialist models become obsolete? We argue that the future of medical artificial intelligence (AI) lies not in building monolithic medical foundation models, nor in replacing human expertise, but in orchestrating collaboration among generalist LLMs, domain-specific specialist models, and clinicians. We propose HetMedAgent, a heterogeneous medical multi-agent framework that enables conflict-aware evidence fusion, uncertainty-based clinician intervention triggering, and adaptive threshold calibration. Experiments on three real-world clinical decision-making tasks demonstrate that the synergy between generalist LLMs and domain-specific specialist models significantly outperforms using either type of model alone, validating the irreplaceable value of specialist models in modality-specific analysis. HetMedAgent represents a shift from building medical LLMs or foundation models to multi-agent collaboration, achieving a balance between general reasoning capabilities and domain-specific precision.

URL PDF HTML ☆

赞 0 踩 0

2605.29742 2026-05-29 cs.AI 版本更新

Citation-Closure Retrieval and Per-Rule Attribution for Real-World Regulatory Compliance Question Answering

引用闭包检索与逐规则归因：面向真实世界法规合规问答

Yeong-Joon Ju, Seong-Whan Lee

发表机构 * Department of Artificial Intelligence, Korea University（韩国大学人工智能系）

AI总结针对法规合规问答中多层级权威结构的引用追踪难题，提出基于操作知识图谱的基准RegOps-Bench和统一框架RefWalk，通过共享主题锚点遍历跨文档引用、多视角候选融合及逐规则归因，显著提升检索召回率和引用准确性。

Comments Under Review

详情

AI中文摘要

将大型语言模型（LLM）部署于法规合规领域，要求通过跨多层权威结构的全面引用来实现严格的追溯性。与传统多跳或法律问答不同，该任务需要结构化的程序性查找和证据集闭包，而非实体解析或判例推理。现有的RAG系统由于扁平化的引用边、碎片化的检索扩展以及脆弱的后期归因而难以胜任。我们通过RegOps-Bench将法规合规问答形式化，这是一个新颖的基准，包含从复杂的国家研发法规中导出的操作知识图谱。为解决这些瓶颈，我们提出了RefWalk，一个由共享主题锚点驱动的统一框架。RefWalk遍历跨文档引用，通过基于最大值的聚合融合多视角候选，并强制执行逐规则归因，以明确地将声明映射到来源。我们建立了一个强大的基线，在检索召回率和引用准确性方面取得了显著改进。最后，在美国健康合规数据集（HIPAA）上的对比评估显示，现有系统在扁平结构规则上表现饱和，凸显了RegOps-Bench的必要性。我们的代码可在https://github.com/yeongjoonJu/RefWalk获取。

英文摘要

Deploying Large Language Models (LLMs) for regulatory compliance demands rigorous traceability via comprehensive citations across multi-tiered authority structures. Unlike traditional multi-hop or legal QA, this task requires structured procedural lookups and evidence-set closure rather than entity resolution or case-law reasoning. Existing RAG systems struggle here due to flattened citation edges, fragmented retrieval expansions, and fragile post-hoc attribution. We formalize Regulatory Compliance QA with RegOps-Bench, a novel benchmark featuring an Operational Knowledge Graph derived from complex national R\&D regulations. To address these bottlenecks, we propose RefWalk, a unified framework driven by a shared topic anchor. RefWalk traverses cross-document citations, fuses multi-view candidates via max-based aggregation, and enforces per-rule attribution to explicitly map claims to sources. We establish a strong baseline with substantial improvements in retrieval recall and citation accuracy. Finally, a contrastive evaluation on a U.S. health compliance dataset (HIPAA) reveals that existing systems exhibit saturation on flat-structure rules, underscoring the need for RegOps-Bench. Our code is available at https://github.com/yeongjoonJu/RefWalk.

URL PDF HTML ☆

赞 0 踩 0

2605.29738 2026-05-29 cs.CL cs.AI 版本更新

Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions

Multi-Legal-Bench: 跨司法管辖区、语言和法律传统的法律推理评估LLM

Volodymyr Ovcharov

发表机构 * SecondLayer

AI总结提出Multi-Legal-Bench，首个跨司法管辖区法律基准，在6个国家、4个语系和1.34亿份法院判决上评估LLM，发现少样本效果跨辖区复制、无单一模型主导所有语言、跨语言迁移不遵循语言邻近性、分词器效率不显著预测跨语言准确率。

Comments 14 pages, 5 figures, 8 tables. Dataset: https://huggingface.co/datasets/overthelex/multi-legal-bench

详情

AI中文摘要

法律NLP基准绝大多数评估单一语言或汇总跨司法管辖区根本不同的任务，使得跨语言比较不可能。我们引入Multi-Legal-Bench，首个跨司法管辖区法律基准，在六个国家（乌克兰、法国、荷兰、波兰、捷克共和国、立陶宛）、四个语系和1.34亿份法院判决上评估相同任务。该基准定义了五个任务——法院类型分类、判决形式分类、案件结果预测、法律规范提取和原因类别预测——映射到来自国家法院登记处的结构化元数据，形成一个故意稀疏的5x6任务-司法管辖区矩阵（30个单元格中填充20个）。我们通过AWS Bedrock在零样本和3样本提示下评估7个前沿LLM，并额外使用4个小/中型模型（3-12B）进行规模分析。我们的结果显示：（1）在乌克兰发现的依赖任务的少样本效果在所有司法管辖区复制；（2）没有单一模型主导任何语言——排名随任务和司法管辖区而变化；（3）跨语言少样本迁移不遵循语言邻近性：UA->FR（罗曼语族，-2.1个百分点）迁移优于UA->PL（斯拉夫语族，-13.7个百分点），标签集对齐比语系更能预测迁移质量；（4）分词器生育率尽管有2.3倍的差异，并不能显著预测跨语言准确率（r=-0.27，p=0.14），表明模型架构和预训练数据主导分词器效率。我们发布所有数据、提示和模型预测。

英文摘要

Legal NLP benchmarks overwhelmingly evaluate a single language or aggregate tasks that differ fundamentally across jurisdictions, making cross-lingual comparison impossible. We introduce Multi-Legal-Bench, the first cross-jurisdictional legal benchmark that evaluates identical tasks across six countries (Ukraine, France, Netherlands, Poland, Czech Republic, Lithuania), four language families, and 134 million court decisions. The benchmark defines five tasks court-type classification, judgment form classification, case-outcome prediction, legal norm extraction, and cause category prediction mapped to structured metadata from national court registries, forming a deliberately sparse 5x6 task-jurisdiction matrix (20 of 30 cells filled). We evaluate 7 frontier LLMs under zero-shot and 3-shot prompting via AWS Bedrock, with 4 additional small/medium models (3-12B) for scaling analysis. Our results reveal that: (1) task-dependent few-shot effects discovered in Ukrainian replicate across all jurisdictions; (2) no single model dominates any language rankings shift with both task and jurisdiction; (3) cross-lingual few-shot transfer does not follow language proximity: UA->FR (Romance, -2.1 pp) transfers better than UA->PL (Slavic, -13.7 pp), with label-set alignment predicting transfer quality better than language family; and (4) tokenizer fertility, despite a 2.3x spread, does not significantly predict cross-lingual accuracy (r=-0.27, p=0.14), suggesting that model architecture and pretraining data dominate tokenizer efficiency. We release all data, prompts, and model predictions.

URL PDF HTML ☆

赞 0 踩 0

2605.29733 2026-05-29 cs.AI 版本更新

Uncertainty-Aware Transfer Learning for Cross-Building Energy Forecasting: Toward Robust and Scalable District-Level Energy Management

面向跨建筑能耗预测的不确定性感知迁移学习：迈向鲁棒且可扩展的区域级能源管理

Shadmehr Zaregarizi, Khashayar Yavari

发表机构 * Politecnico di Torino（托里尼理工学院）

AI总结提出基于时间融合变换器的不确定性感知迁移学习框架，通过引入迁移鲁棒性指标和探针微调策略，实现跨建筑能耗预测的鲁棒迁移与不确定性量化。

Comments 5 pages, 3 figures, 2 tables. Accepted at BALANCES'26 (6th ACM International Workshop on Big Data and Machine Learning for Smart Buildings and Cities), Banff, Alberta, Canada, June 22, 2026. This is the author's accepted manuscript; final published version DOI will be activated after June 22, 2026

详情

AI中文摘要

将数据驱动的能耗预测扩展到区域级需要能够在最小目标域数据和诚实不确定性估计下跨建筑复用的模型。我们提出了一种基于时间融合变换器的不确定性感知迁移学习框架，用于跨建筑能耗预测，并在新发布的高分辨率真实子计量数据集上进行了评估：丹麦奥尔堡大学的一栋教育建筑（源域）和瑞士EMPA的多类型NEST建筑（目标域）。我们引入了迁移鲁棒性指数，一种与架构无关的度量，用于量化跨域泛化质量。一项四策略层冻结消融实验表明，仅探针微调（仅更新806K参数中的455个输出层参数）实现了最佳的迁移质量，优于全微调，表明TFT编码器学习了可迁移的时间表示。蒙特卡洛丢弃法得到的预测区间覆盖概率为93.2%，接近名义上的95%目标。数据稀缺性分析进一步显示，随着目标域数据的增加，性能单调提升，为区域能源部署提供了实践指导。

英文摘要

Scaling data-driven energy forecasting to district level requires models that can be re-used across buildings with minimal target-domain data and honest uncertainty estimates. We present an uncertainty-aware transfer learning (TL) framework for cross-building energy forecasting based on the Temporal Fusion Transformer (TFT), evaluated on a newly released high-resolution real sub-meter dataset: an educational building at Aalborg University, Denmark (source) and the multi-typology NEST building at EMPA, Switzerland (target). We introduce the Transfer Robustness Index (TRI), an architecture-agnostic metric for quantifying generalization quality across domain gaps. A four-strategy layer-freezing ablation shows that Probe-Only fine-tuning, updating only 455 output-layer parameters out of 806K, achieves the best transfer quality (TRI = 3,097), outperforming full fine-tuning and suggesting that TFT encoders learn transferable temporal representations. Monte Carlo Dropout yields a prediction interval coverage probability of 93.2%, close to the nominal 95% target. A data-scarcity analysis further shows monotonic improvement with increasing target-domain data, providing practical guidance for district energy deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.29716 2026-05-29 cs.AI 版本更新

NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs

NaRA: 面向扩散大语言模型参数高效微调的噪声感知LoRA

Shuaidi Wang, Zhan Zhuang, Ruping Huang, Yu Zhang

发表机构 * Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China（南方科技大学计算机科学与工程系，深圳，中国）； Department of Computer Science, City University of Hong Kong, Hong Kong, China（香港城市大学计算机科学系，香港，中国）； Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China（香港理工大学计算机科学与工程系，香港，中国）

AI总结针对扩散大语言模型，提出噪声感知低秩适配（NaRA），通过噪声条件超网络生成低秩核心矩阵，实现沿去噪轨迹连续变化的更新矩阵，在常识推理、数学推理和代码生成基准上优于噪声无关基线。

详情

AI中文摘要

扩散大语言模型（dLLMs）已成为一种有前途的非自回归生成范式。鉴于全微调的计算成本过高，参数高效微调（PEFT）已成为标准方法。然而，现有的PEFT方法（如LoRA）最初是为自回归模型设计的，依赖于静态参数，对噪声水平不敏感。因此，它们忽略了扩散过程的内在动态性，其中输入分布和生成难度沿去噪轨迹显著变化，使得它们对dLLMs而言是次优的。为了解决这个问题，我们提出了噪声感知低秩适配（NaRA），它引入了一个由轻量级、全局共享的超网络根据噪声水平生成的低秩核心矩阵。这种设计使得更新矩阵能够沿扩散过程连续变化，同时保持参数和延迟开销可忽略不计。我们为所提出的NaRA框架提供了理论依据，并在常识推理、数学推理和代码生成基准上实证证明了其相对于噪声无关基线的持续改进。我们的代码可在https://github.com/generaldi/NaRA获取。

英文摘要

Diffusion Large Language Models (dLLMs) have emerged as a promising non-autoregressive generative paradigm. Given the prohibitive computational cost of full fine-tuning, Parameter-Efficient Fine-Tuning (PEFT) has become the standard approach. However, existing PEFT methods (e.g., LoRA), originally tailored for autoregressive models, rely on static parameters that are agnostic to the noise level. Consequently, they ignore the intrinsic dynamics of the diffusion process, where input distributions and generation difficulty shift significantly along the denoising trajectory, rendering them suboptimal for dLLMs. To address this, we propose Noise-aware Low-Rank Adaptation (NaRA), which introduces a low-rank core matrix generated by a lightweight, globally shared hypernetwork conditioned on the noise level. This design enables the update matrices to vary continuously along the diffusion process while keeping parameter and latency overhead negligible. We provide a theoretical justification for the proposed NaRA framework and empirically demonstrate consistent improvements over noise-agnostic baselines across commonsense reasoning, mathematical reasoning, and code generation benchmarks. Our code is available at https://github.com/generaldi/NaRA.

URL PDF HTML ☆

赞 0 踩 0

2605.29713 2026-05-29 cs.LG cs.AI 版本更新

The Little Book of Generative AI Foundations: An Intuitive Mathematical Primer

生成式AI基础小书：直观数学入门

Tianhua Chen

发表机构 * School of Computing and Engineering（计算与工程学院）

AI总结本书通过推导导向的方式，从PCA到能量模型，系统介绍现代生成式人工智能的数学基础，旨在使生成建模结构更易理解。

Comments Preprint version, 178 pages. Comments and corrections are welcome

2605.29712 2026-05-29 cs.CL cs.AI 版本更新

Teaching Language Models to Check Grounded Claim Factuality with Human Test-Taking Strategies

教会语言模型使用人类应试策略检查基于事实的声明真实性

Yuxuan Ye, Raul Santos-Rodriguez, Edwin Simpson

发表机构 * Intelligent Systems Laboratory（智能系统实验室）； University of Bristol（布里斯托大学）

AI总结将基于事实的声明真实性检查建模为真假阅读理解任务，通过提示语言模型使用明确的应试策略进行高效推理，并训练小语言模型以降低推理成本。

Comments ACL 2026 Main

详情

AI中文摘要

基于事实的声明真实性检查对于大型语言模型（LLM）应用（如检索增强生成）非常重要，因为它帮助用户评估生成输出的正确性。现有的使用蕴含分类器的指标需要针对数据集调整阈值，而基于LLM的方法通常使用直接提示，这未能充分利用LLM的推理能力。我们通过将基于事实的声明真实性检查建模为真假阅读理解任务，并提示LLM使用明确的应试策略进行高效推理来解决这一问题。与无引导的开放式推理相比，我们的方法减少了超过80%的令牌使用量，并在两个真实性基准测试中取得了与更昂贵替代方案竞争的性能，在一个基准上达到了新的最先进水平。为了进一步降低推理成本，我们训练小语言模型（SLM）来替代检查流程中的LLM。通过监督微调（SFT）和自我修正机制，SLM学会了改进其真实性判断。实验结果表明，生成的SLM在性能上与强基线相当，结合了低推理成本和生成支持理由以支持可解释性。代码和数据集将在接收后发布。

英文摘要

Grounded claim factuality checking is important for large language model (LLM) applications such as retrieval-augmented generation, as it helps users assess the correctness of generated outputs. Existing metrics using entailment classifiers require dataset-specific threshold tuning, while LLM-based approaches often use direct prompting, which underutilises the reasoning capabilities of LLMs. We address this by formulating grounded claim factuality checking as a true/false reading comprehension task and prompting LLMs with explicit test-taking strategies for efficient reasoning. Our method reduces token usage by over 80% compared to unguided open-ended reasoning, and achieves competitive performance to more expensive alternatives across two factuality benchmarks, setting a new state of the art on one. To further reduce inference cost, we train small language models (SLMs) to replace LLMs in the checking pipeline. Using supervised fine-tuning (SFT) and a self-revision mechanism, the SLMs learn to improve their factuality judgements. Experimental results show that the resulting SLMs perform on par with strong baselines, combining low inference costs with generating supporting rationales to support interpretability. Code and datasets will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.29711 2026-05-29 cs.CL cs.AI 版本更新

Personalized Turn-Level User Conversation Satisfaction Benchmark

个性化轮级用户对话满意度基准

Zhefan Wang, Zhiqiang Guo, Weizhi Ma, Min Zhang, Quanjia Yan, Hengliang Luo

发表机构 * Department of Computer Science and Technology, Tsinghua University, Beijing, China.（清华大学计算机科学与技术系，北京，中国）； Institute for AI Industry Research, Tsinghua University, Beijing, China.（清华大学人工智能产业研究院，北京，中国）； Meituan（美团）

AI总结针对AI助手响应的个性化满意度评估问题，提出结合用户记忆与目标轮上下文的满意度评估器，并构建PersTurnBench基准，通过回放实现生成模型的受控比较。

详情

AI中文摘要

用户对AI助手的满意度高度个性化：同一响应可能满足一个用户但令另一个失望，取决于每个用户的期望以及他们之前询问的内容。现有的自动评估方法大多衡量通用响应质量，难以判断某个响应在特定轮次是否满足用户。我们将此问题作为个性化轮级用户对话满意度评估进行研究。我们构建了一个对话满意度评估器，将紧凑的用户记忆与目标轮上下文相结合，生成满意度分数和不满意的理由。与人类满意度标注的元评估表明，个性化记忆和事后分数校准在有序一致性和不满意轮次检测上优于监督式、检索式和通用LLM作为评判者的基线。我们进一步引入了PersTurnBench，这是一个个性化轮级用户对话满意度基准，通过回放使用经过验证的评估器来评估生成模型。通过固定回放状态，PersTurnBench能够在无需为每个候选模型收集新人工标签的情况下，对通用生成模型和记忆增强的个性化系统进行受控比较。该评估器和基准让研究人员能够在无需为每个模型收集新用户反馈的情况下，比较候选生成模型在个性化满意度上的表现。

英文摘要

User satisfaction with AI assistants is highly personalized: the same response may satisfy one user but disappoint another depending on what each user expects and what they have asked for before. Existing automatic evaluation methods mostly measure generic response quality, making it difficult to judge whether a response satisfies a user at a specific turn. We study this problem as personalized turn-level user conversation satisfaction evaluation. We build a conversation satisfaction evaluator that combines compact user memories with target-turn context to produce satisfaction scores and dissatisfaction-oriented rationales. Meta-evaluation against human satisfaction annotations shows that personalized memory and post-hoc score calibration improve ordinal agreement and dissatisfied-turn detection over supervised, retrieval-based, and generic LLM-as-a-judge baselines. We further introduce PersTurnBench, a personalized turn-level user conversation satisfaction benchmark that uses the verified evaluator to assess generation models via replay. By holding the replay state fixed, PersTurnBench enables controlled comparison of generic generation models and memory-augmented personalized systems without new human labels for every candidate model. The evaluator and benchmark let researchers compare candidate generation models on personalized satisfaction without collecting new user feedback for every model.

URL PDF HTML ☆

赞 0 踩 0

2605.29705 2026-05-29 cs.AI 版本更新

BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices

BitTP：面向边缘设备的轻量级轨迹预测模型与BitLLM

Mincheol Kang, Hyunjin Lim, Bomin Kang, Daehee Park

发表机构 * KAIST, Republic of Korea（韩国釜山国立大学）； DGIST, Republic of Korea（韩国国立庆北科学技术院）

AI总结提出BitTP，通过将LLM轨迹预测器转换为1.58比特轻量架构，在保持或提升预测质量的同时大幅降低内存和计算需求，实现边缘设备部署。

Comments Camera-ready version. Accepted as a findings paper at CVPR 2026. 8 pages, 4 figures

详情

AI中文摘要

NICE：一个基于理论的LLM社交智能诊断基准

Yunjin Qi, Zhaojun Jiang, Xuan Wu, Hanxi Pan, Yixuan Wang, Yanfang Liu, Xiang Ji, Churu Yu, Chunyuan Zheng, Yingze Chen, Jie He, Liuqing Chen, Zaifeng Gao

发表机构 * Department of Psychology and Behavioral Sciences, Zhejiang University（浙江大学心理学与行为科学系）； College of Artificial Intelligence, Zhejiang University（浙江大学人工智能学院）； Human Machine Interaction Lab, Huawei Technologies Co., Ltd.（华为技术有限公司人机交互实验室）； Zhejiang Key Laboratory of Neurocognitive Development and Mental Health（浙江省神经认知发展与心理健康重点实验室）

AI总结本文通过构建基于社会理论的社交智能框架，提出诊断基准NICE，用于细粒度评估大语言模型在社交交互中的能力弱点。

详情

AI中文摘要

随着大语言模型（LLM）在情感陪伴和客户服务等社交场景中的广泛应用，衡量其社交智能对人工智能交互的质量与安全性变得至关重要。然而，现有的社交智能基准缺乏统一框架来组织社交能力，因此无法进行细粒度诊断。为了构建首个基于社会理论的整体诊断评估，我们首先通过文献综述和多阶段专家验证（遵循心理测量学原则）构建了一个社交智能框架。该框架包括4个类别和11个维度，每个维度进一步由细粒度的能力方面指定。基于此框架，我们提出了NICE（规范、交互、认知、体验），一个包含137个项目的诊断基准，通过代表性中文情境进行操作化。在5个前沿LLM和一个人类参考组中，模型在总体准确率上得分较高，但在沟通方面表现出持续的弱点，框架将其定位到三个具体能力方面：多轮沟通、非语言沟通和同步性。因此，NICE将社交智能评估重新定义为对LLM中具有社会后果的弱点的基于理论的诊断。

英文摘要

As large language models (LLMs) are increasingly applied in social contexts such as emotional companionship and customer service, measuring their social intelligence has become critical to the quality and safety of human-AI interaction. However, existing social intelligence benchmarks lack a unified framework that organizes social abilities into a unified structure, and therefore cannot enable fine-grained diagnosis. To build the first holistic diagnostic evaluation grounded in social theory, we first construct a social intelligence framework through a literature review and multi-stage expert validation guided by psychometric principles. The resulting framework includes 4 categories and 11 dimensions, each further specified by fine-grained capability facets. Building on this framework, we introduce NICE (Norm, Interaction, Cognition, Experience), a diagnostic benchmark of 137 items operationalized through representative Chinese contexts. Across 5 frontier LLMs and a human reference group, models score higher in aggregate accuracy yet show a consistent weakness in Communication, which the framework localizes to 3 specific capability facets: multi-turn communication, nonverbal communication, and synchrony. NICE thus reframes social intelligence evaluation toward theory-grounded diagnosis of socially consequential weaknesses in LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.29675 2026-05-29 cs.HC cs.AI cs.IR 版本更新

From Prompts to Context: An Ontology-Driven Framework for Human-Generative AI Collaboration

从提示到上下文：一种面向人类-生成式AI协作的本体驱动框架

Ngoc Luyen Le, Marie-Hélène Abel, Bertrand Laforge

发表机构 * Gamaizer ； Université de technologie de Compiègne, CNRS, Heudiasyc（法国图尔学院、CNRS、Heudiasyc）； Sorbonne Université, CNRS UMR 7585, LPMHE（索邦大学、CNRS UMR 7585、LPMHE）

AI总结提出一种基于本体（CCAI）的框架，通过结构化建模任务、角色、资源和约束，将提示-响应交互转化为可查询的协作轨迹，以提升信息密集型工作流中的可追溯性和问责性。

详情

AI中文摘要

与生成式AI的协作通常始于简短提示，止于不透明输出，隐去了参与者、任务、资源及约束等关键信息。这种上下文显式性的缺失阻碍了信任、可追溯性和问责性，尤其在搜索、查询和档案管理等信息密集型工作流中。本文提出“从提示到上下文”这一本体驱动框架，用于表示人类-生成式AI协作。其核心组件——上下文协作AI本体（CCAI）——将任务、智能体角色、资源和约束等协作关键元素建模为共享的机器可解释词汇。通过将填充的CCAI实例与基于SPARQL的上下文检索相结合，该框架将原本短暂的提示-响应交互转化为结构化、可查询的协作轨迹，连接提示、输出及其周围上下文。通过一个软件开发团队构建基于能力的教育功能（用于查看和更新学习者能力档案）的案例研究，展示了该框架如何支持需求分析、设计、实现和测试阶段的协作片段表示与文档化。结果表明，显式协作建模有助于使任务上下文更清晰，提高AI生成贡献的可追溯性，并支持更透明、更负责任的人类-生成式AI实践。最后，我们提出了未来人类-生成式AI系统的设计原则，强调不仅关注输出质量，还要显式表示产生输出的协作上下文。

英文摘要

Collaborations with Generative AI often begin with a short prompt and end with an opaque output, leaving implicit who was involved, what task was being pursued, which resources were used, and which constraints should have shaped the process. This limited contextual explicitness hinders trust, traceability, and accountability, particularly when Generative AI is embedded in information-intensive workflows such as search, querying, and profile management. This paper introduces From Prompts to Context, an ontology-driven framework for representing Human-Generative AI collaboration. Its core component, the Contextual Collaboration AI Ontology (CCAI), models key elements of collaboration - including tasks, agent roles, resources, and constraints - as a shared machine-interpretable vocabulary. By combining populated CCAI instances with SPARQL-based context retrieval in operational workflows, the framework turns otherwise ephemeral prompt-response interactions into structured and queryable collaboration traces linking prompts, outputs, and their surrounding context. The approach is illustrated through a case study involving a software development team building a competency-based education feature for viewing and updating learner competency profiles. The case study shows how the framework can support the representation and documentation of collaboration episodes across requirements analysis, design, implementation, and testing. Within this setting, the results indicate that explicit collaboration modelling helps make task context more explicit, improves the traceability of AI-generated contributions, and supports more transparent and accountable Human-Generative AI practices. We conclude by outlining design principles for future Human-Generative AI systems that emphasise not only output quality, but also the explicit representation of the collaborative context in which outputs are produced.

URL PDF HTML ☆

赞 0 踩 0

2605.29670 2026-05-29 cs.CL cs.AI 版本更新

EviLink: Multi-Path Schema Linking with Uncertainty-Guided Evidence Acquisition for Large-Scale Text-to-SQL

EviLink: 面向大规模Text-to-SQL的基于不确定性引导证据获取的多路径模式链接

Huawei Zheng, Sen Yang, Zhaorui Yang, Yuhui Zhang, Haozhe Feng, Haoxuan Li, Xuan Yi, Chao Hu, Defeng Xie, Chen Hou, Danqing Huang, Wei Chen, Yingcai Wu, Peng Chen, Dazhen Deng

发表机构 * School of Software Technology, Zhejiang University（浙江大学软件学院）； State Key Lab of CAD&CG, Zhejiang University（浙江大学CAD&CG国家重点实验室）； Tencent TEG（腾讯TEG）； School of Mathematical Sciences, Peking University（北京大学数学科学学院）

AI总结提出EviLink方法，通过多假设模式基础与不确定性引导的证据获取，重新定义模式链接为不确定性感知的模式需求推理，以平衡模式完整性、相关性和令牌成本，提升大规模Text-to-SQL性能。

详情

AI中文摘要

模式链接是大规模Text-to-SQL中困难且重要的步骤，系统必须从庞大且模糊的数据库中识别出紧凑且充分的模式上下文。现有方法通常将模式链接视为围绕单个SQL路径的确定性选择，但复杂问题可能允许多个具有不同模式需求的有效实现。我们将模式链接重新定义为对多个可行SQL路径的不确定性感知模式需求推理，其中系统区分必需模式项与路径依赖的不确定项，并仅在需要时获取证据。我们通过EviLink实例化这一重构，它结合了多假设模式基础与不确定性引导的证据获取。在BIRD-Dev和Spider2-Snow上的实验表明，这种视角改善了模式完整性、模式相关性和令牌成本之间的平衡。在Spider2-Snow上，EviLink实现了90.15%的字段级严格召回率，平均使用123.30K令牌，并在固定生成器下提升了下游SQL生成性能。

英文摘要

Schema linking is a difficult and important step in large-scale Text-to-SQL, where systems must identify a compact yet sufficient schema context from large and ambiguous databases. Existing methods often treat schema linking as deterministic selection around a single SQL path, but complex questions may admit multiple valid realizations with different schema needs. We reframe schema linking as uncertainty-aware schema-need inference over multiple plausible SQL paths, where the system distinguishes required schema items from path-dependent uncertain ones and acquires evidence only where needed. We instantiate this reframing with EviLink, which combines multi-hypothesis schema grounding with uncertainty-guided evidence acquisition. Experiments on BIRD-Dev and Spider2-Snow show that this perspective improves the balance among schema completeness, schema relevance, and token cost. On Spider2-Snow, EviLink achieves 90.15% field-level strict recall rate, uses 123.30K average tokens, and improves downstream SQL generation under a fixed generator.

URL PDF HTML ☆

赞 0 踩 0

2605.29668 2026-05-29 cs.AI cs.CL 版本更新

PTCG-Bench：LLM智能体能否掌握宝可梦集换式卡牌游戏？

Dongdong Hua, Yifei Sun, Renhong Huang, Feng Gao, Chunping Wang, Yang Yang

发表机构 * Zhejiang University（浙江大学）； FinVolution Group（FinVolution集团）

AI总结提出PTCG-Bench基准，通过宝可梦集换式卡牌游戏评估LLM智能体的决策性能和自进化能力，并设计模块化消融实验分析智能体性能。

详情

AI中文摘要

面对一个策略复杂的棋盘游戏，人类玩家在玩几轮后就能快速学会制定策略。自主智能体在现实交互环境中需要类似的能力，然而现有的智能体基准往往未能充分捕捉这种策略性和不断演变的决策场景。我们提出了PTCG-Bench，一个基于宝可梦集换式卡牌游戏（PTCG）构建的基准，它在两个互补层面上评估LLM智能体：（1）它们在单个复杂环境中的决策性能，以及（2）它们通过积累经验自我进化的能力。我们进一步包括一个模块化消融实验，以更好地解释智能体性能，而不将其与模型能力混为一谈。我们的实验表明，尽管LLM智能体能够实现非平凡的 gameplay 性能，但持续稳定的自我进化仍然具有挑战性，并且性能对消融设计敏感。我们希望PTCG-Bench能够促进未来在现实交互环境中对消融感知和自我进化智能体的研究。

英文摘要

Given a strategically complex board game, human players can quickly learn to devise strategies after playing a few rounds. Autonomous agents require similar capabilities in realistic interactive environments, yet existing agent benchmarks often fail to fully capture such strategic and evolving decision-making scenarios. We present PTCG-Bench, a benchmark built on the Pok'{e}mon Trading Card Game (PTCG) that evaluates LLM agents at two complementary levels: (1) their decision-making performance within a single complex environment, and (2) their ability to self-evolving through accumulated experience. We further include a modular harness ablation to better interpret agent performance without conflating it with model capability. Our experiments show that, although LLM agents can achieve non-trivial gameplay performance, sustained and stable self-evolution remains challenging, and performance is sensitive to harness design. We hope that PTCG-Bench will facilitate future research on harness-aware and self-evolving agents in realistic interactive environments.

URL PDF HTML ☆

赞 0 踩 0

2605.29652 2026-05-29 cs.AI 版本更新

Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation

快速思考，智能对话：结构化健康文本生成中确定性与神经计算的划分

Kai-Chen Cheng, Haejun Han, David Q. Sun

发表机构 * Kai-Chen Cheng Haejun Han David Q. Sun

AI总结提出一种将确定性计算与有限LLM调用相结合的流水线，用于结构化健康文本生成，在降低错误率和成本的同时保持忠实性。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用于从结构化记录（如可穿戴时间序列、生物标志物、生命体征和护理管理日志）生成健康文本。对于重复性健康输出，流畅性是不够的：系统必须忠实于源数据，将解释性主张建立在可用证据上，遵循既定政策，输出机器可读的内容，并且运行成本足够低以支持重复使用。我们探讨在结构化健康生成中，哪些责任应由确定性计算承担，而非运行时LLM提示。我们引入了“快速思考，智能对话”，一个睡眠健康洞察流水线，其中确定性代码在调用一次有界LLM写入器之前执行重复分析。在280个用户-夜晚和六个模型上，与结构化零样本和少样本单次调用基线相比，该方法实现了更低的数值误差、更低的指令合规误差和更低的端到端成本。层替换揭示了特定合约的失败：LLM比较增加了数值误差，LLM排名降低了策略选择，LLM属性增加了无根据的因果语言，而LLM生成的写入器接口即使在上游事实确定后也会重新引入误差。结果支持一个更广泛的设计规则：让代码负责重复分析，让LLM在有界接口内表达已验证的事实。

英文摘要

Large language models (LLMs) are increasingly being used to generate health text from structured records such as wearable time series, biomarkers, vitals, and care-management logs. For recurring health outputs, fluency is not enough: systems must remain faithful to source data, ground explanatory claims in available evidence, follow stated policies, emit machine-readable outputs, and run cheaply enough for repeated use. We ask which responsibilities in structured health generation should be deterministic computation rather than runtime LLM prompting. We introduce Think Fast, Talk Smart, a sleep-health insight pipeline in which deterministic code performs recurring analysis before one bounded LLM writer call. Across 280 user-nights and six models, achieves lower numeric error, lower instruction-compliance error, and lower end-to-end cost than structured zero-shot and few-shot one-call baselines. Layer replacement reveals contract-specific failures: LLM comparison raises numeric error, LLM ranking degrades policy selection, LLM attribution increases unsupported causal language, and an LLM-generated writer interface reintroduces errors even after upstream facts are deterministic. The results support a broader design rule: let code own recurring analysis, and let LLMs express verified facts within bounded interfaces.

URL PDF HTML ☆

赞 0 踩 0

2605.29645 2026-05-29 cs.LG cs.AI stat.ML 版本更新

超越攻击成功率：LLM安全失效的时间对数可观测性

Junyoung Park, Sunghwan Park, Seongyong Ju, Jaewoo Lee

发表机构 * Chung-Ang University（Chung-Ang 大学）

AI总结提出时间对数可观测性（TLO）方法，通过解码过程中的合规-拒绝边际将模型-攻击条件映射到校准的二维平面，揭示攻击成功的时间模式，并基于此设计早期停止规则将成功越狱减少一半以上。

详情

AI中文摘要

攻击成功率（ASR）在生成结束时用单个是/否标签评估每次越狱，告诉我们是否发生了失败，但未说明失败如何展开。产生同等有害输出的两次攻击可能遵循完全不同的路径，而ASR无法区分它们。我们仅从对数几率使这些隐藏路径变得可观测。时间对数可观测性（TLO）是一种无需训练的诊断方法，在解码过程中观察合规-拒绝边际，并将每个模型-攻击条件置于校准的二维平面上。通过设计，该平面在ASR信息量最小的情况下最具信息量：即在因真正不同原因而成功的攻击中。在四种对齐的LLM和三种越狱范式下，具有几乎相同ASR的攻击在平面上位于明显不同的点：同一模型可能通过不同的时间模式失败。在大多数条件下，几何形状与来自隐藏状态的拒绝方向探针匹配，但一个模型显示了固定词汇方法的局限性。从TLO导出的简单早期停止规则将成功的越狱减少一半以上，且对普通良性查询无误报。安全评估应报告失败发生的时间和方式，而不仅仅是是否发生。TLO仅从对数几率即可观测前两者。

英文摘要

Attack Success Rate (ASR) evaluates each jailbreak with a single yes/no label at the end of generation, telling us whether a failure happened but not how it unfolded. Two attacks that produce equally harmful outputs may have followed completely different paths, and ASR cannot tell them apart. We make those hidden paths observable from logits alone. Temporal Logit Observability (TLO) is a training-free diagnostic that watches a compliance-refusal margin during decoding and places each model-attack condition on a calibrated 2D plane. By design, this plane is most informative exactly where ASR is least informative: among attacks that succeed for genuinely different reasons. Across four aligned LLMs and three jailbreak paradigms, attacks with nearly identical ASR land at clearly different points on the plane: the same model can fail through different temporal patterns. The geometry matches refusal-direction probes from hidden states on most conditions, with one model showing the limit of our fixed-lexicon approach. A simple early-stop rule derived from TLO cuts successful jailbreaks by more than half, without false alarms on plain benign queries. Safety evaluation should report when and how a failure unfolds, not only whether it occurred. TLO makes the first two observable from logits alone.

URL PDF HTML ☆

赞 0 踩 0

2605.29628 2026-05-29 cs.SD cs.AI cs.CL cs.LG eess.AS 版本更新

COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

COMET：音频-文本多模态对比嵌入中模态间隙的概念空间剖析

Yonggang Zhu, Liting Gao, Aidong Men, Wenwu Wang

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications（北京邮电大学人工智能学院）； Centre for Vision, Speech, and Signal Processing (CVSSP), University of Surrey（Surrey 大学视觉、语音和信号处理中心）

AI总结提出COMET框架，通过PLS-SVD分解揭示CLAP模型中模态间隙主要由少数共享概念轴贡献，并基于谱截断方法无训练地缓解间隙，实现零样本音频字幕接近全监督性能。

详情

AI中文摘要

对比语言-音频预训练（CLAP）模型广泛用于音频理解，并在许多零样本应用中支持模态无关的条件交换。然而，其性能受到音频和文本嵌入之间模态间隙的严重影响。现有解释主要将此间隙归因于锥体效应，将其视为均值嵌入之间的偏移，但仅纠正均值只能带来有限的改进。其他假设，如信息不平衡和维度坍缩，也被提出，但仍未得到充分验证，并且在音频领域尚未被深入研究。同时，一些工作尝试将多模态对比嵌入分解为可解释的概念，但没有任何工作从概念分解的角度显式分析模态间隙。在这项工作中，我们引入了COMET（基于PLS-SVD变换的概念空间组织与模态间隙解释），这是一个新颖的用于CLAP的偏最小二乘奇异值分解（PLS-SVD）框架，揭示了模态间隙的更广泛视角。我们的框架揭示，只有一小部分可解释的轴（捕捉共享概念）对相似度计算有显著贡献，并且均值分量仅部分代表模态间隙。基于这一见解，我们提出了一种简单的谱截断方法，以无训练的方式缓解模态间隙。该方法使得零样本音频字幕通过条件交换接近全监督性能，无需大型辅助记忆库或昂贵计算。同时，它在保持检索和音频字幕任务强性能的同时，实现了显著的嵌入维度缩减。

英文摘要

Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Existing explanations mainly attribute this gap to the cone effect, treating it as a shift between mean embeddings, yet correcting the mean alone yields only limited improvements. Alternative hypotheses, such as information imbalance and dimensionality collapse, have also been proposed, but they remain insufficiently verified and have not been thoroughly studied in the audio domain. Meanwhile, several works attempt to decompose multimodal contrastive embeddings into interpretable concepts, but none explicitly analyze the modality gap from the perspective of concept decomposition. In this work, we introduce COMET (Concept space Organization and Modality gap Explanation with PLS-SVD Transformation), a novel partial least squares singular value decomposition (PLS-SVD) framework for CLAP that unveils a broader perspective of the modality gap. Our framework reveals that only a small, interpretable subset of axes, which captures shared concepts, contributes substantially to similarity computation, and that the mean component represents only partially the modality gap. Building on this insight, we propose a simple spectral truncation method that mitigates the modality gap in a training-free manner. The method enables zero-shot audio captioning with condition swapping to approach fully supervised performance, without requiring large auxiliary memory banks or expensive computation. At the same time, it achieves substantial embedding dimensionality reduction while preserving strong performance on retrieval and audio captioning tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.29626 2026-05-29 cs.CL cs.AI 版本更新

DLM-SWAI: Steering Diffusion Language Models Before They Unmask

DLM-SWAI: 在扩散语言模型去掩码之前引导它们

Hyeseon An, Yo-Sub Han

发表机构 * Department of Computer Science（计算机科学系）； Yonsei University（延世大学）

AI总结提出一种无需训练的引导方法DLM-SWAI，通过预计算的词级风格分数在去噪步骤中偏置词分布，实现扩散语言模型的可控生成。

Comments preprint

详情

AI中文摘要

将语言模型生成引导至期望的文本属性对于实际部署至关重要，而推理时方法特别有吸引力，因为它们无需重新训练即可实现可控生成。最近的研究也强调了扩散语言模型作为一种新兴的生成范式，具有独特的解码特性。然而，大多数现有的引导方法要么依赖辅助模型，要么专为自回归下一个词解码设计，难以应用于通过部分掩码序列的迭代去噪生成文本的扩散语言模型（DLM）。因此，我们提出DLM-SWAI，一种简单的无需训练的引导方法，通过使用预计算的词级风格分数在每个去噪步骤偏置词分布。在风格和安全控制任务上的实验表明，DLM-SWAI有效引导扩散语言模型，同时保持生成质量并需要最小的计算开销。消融实验进一步揭示了引导强度与流畅性之间的可控权衡，我们的分析将类别可引导性与词级属性线索的强度联系起来。

英文摘要

Steering language model generation toward desired textual properties is essential for practical deployment, and inference-time methods are particularly appealing because they enable controllable generation without retraining. Recent work has also highlighted diffusion language models as an emerging generation paradigm with distinct decoding properties. However, most existing steering approaches either rely on auxiliary models or are designed for autoregressive next-token decoding, making them difficult to apply to diffusion language models DLMs, which generate text through iterative denoising of partially masked sequences. Therefore, we propose DLM-SWAI, a simple training-free steering method that biases the token distribution at each denoising step using pre-computed token-level style scores. Experiments on style and safety control tasks show that DLM-SWAI effectively steers diffusion language models while preserving generation quality and requiring minimal computational overhead. Ablations further reveal a controllable trade-off between steering strength and fluency, and our analysis links class-wise steerability to the strength of token-level attribute cues.

URL PDF HTML ☆

赞 0 踩 0

2605.29625 2026-05-29 cs.AI 版本更新

训练审慎监控器用于黑箱策划检测

Aditya Sinha, Akshat Naik, Victor Gillioz, Simon Storf, Kilian Merkelbach, Rich Barton-Cooper, Axel Højmark, Marius Hobbhahn

发表机构 * Independent（独立）； MATS Research（MATS研究）； Astra Fellowship ； Apollo Research（Apollo研究）

AI总结提出一种基于行动轨迹的审慎监控方法，通过蒸馏前沿模型的推理过程训练开源模型，以低成本高精度检测智能体的策划与破坏行为。

详情

AI中文摘要

随着自主智能体在执行现实任务方面变得愈发强大，区分策划行为与良性任务追求可能成为AI控制的核心问题。现有监控器通常依赖思维链访问或内部激活，或使用提示的前沿模型，这些在部署中可能不可用、不可靠或成本高昂。在本工作中，我们研究仅基于行动的审慎监控器：较小的开源模型，经过训练可从智能体轨迹中检测策划与破坏行为，而无需访问被监控智能体的推理或模型内部。我们的方法受审慎对齐启发，使用策划规范从前沿教师模型中引出结构化推理，通过独立的评判器进行过滤，并通过监督微调和强化学习将最高质量的推理蒸馏到开源监控器中。我们在五个数据集上训练，并在六个分布外智能体失调基准上评估。我们表明，将我们的方法应用于Qwen3.5-27B，其性能优于所有低成本前沿模型作为提示监控器（Gemini 3.1 Flash-Lite、GPT-5.4 Nano和Claude Haiku 4.5）以及Gemini 2.5 Pro，同时实现了更低的边际推理成本（每1000次评估的token计费美元）。更强的提示前沿监控器（Gemini 3.1 Pro、GPT-5.4、Claude Sonnet 4.6和Claude Opus 4.6）实现了更高的性能，但边际推理成本大约高出16-34倍。我们训练的多个监控器在我们评估的监控器中位于经验成本-性能帕累托前沿，为提示前沿模型提供了实用的低成本、低误报率替代方案。

英文摘要

As autonomous agents become more capable of performing real-world tasks, distinguishing scheming behavior from benign task pursuit may become a central AI control problem. Existing monitors often rely on chain-of-thought access or internal activations, or use prompted frontier models, all of which can be unavailable, unreliable or expensive in deployment. In this work, we study action-only deliberative monitors: smaller open-weight models trained to detect scheming and sabotage from agentic trajectories without accessing the monitored agent's reasoning or model internals. Our method, inspired by deliberative alignment, uses a scheming specification to elicit structured rationales from a frontier teacher, filters them with a separate judge, and distills the highest-quality rationales into open-weight monitors with supervised fine-tuning and reinforcement learning. We train on five datasets, and evaluate across six out-of-distribution agentic misalignment benchmarks. We show that applying our method to Qwen3.5-27B yields higher performance than all low-cost frontier models as prompted monitors (Gemini 3.1 Flash-Lite, GPT-5.4 Nano, and Claude Haiku 4.5) and than Gemini 2.5 Pro, while also achieving lower marginal inference cost (token-metered USD per 1,000 evaluations). Stronger prompted frontier monitors (Gemini 3.1 Pro, GPT-5.4, Claude Sonnet 4.6, and Claude Opus 4.6) achieve higher performance but at roughly $16$--$34\times$ higher marginal inference cost. Several of our trained monitors are positioned on the empirical cost--performance Pareto frontier among the monitors we evaluate, providing practical low-cost, low-FPR alternatives to prompted frontier models.

URL PDF HTML ☆

赞 0 踩 0

2605.29591 2026-05-29 cs.AI 版本更新

Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion

Mind-Omni：通过离散扩散实现脑-视觉-语言建模的统一多任务框架

Yizhuo Lu, Changde Du, Qingyu Shi, Hang Chen, Jie Peng, Liuyun Jiang, Shuangchen Zhao, Huiguang He

发表机构 * NeuBCI Lab, State Key Laboratory of Brain Cognition ； Brain-inspired Intelligence Technology, Institute of Automation, Chinese Academy of Sciences, Beijing, China ； School of Future Technology, University of Chinese Academy of Sciences, Beijing, China ； School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China ； Zhongguancun Academy, Beijing, China ； Peking University, Beijing, China

AI总结提出Mind-Omni框架，利用离散扩散范式统一七种编码与解码任务，通过脑分词器将连续脑信号转化为离散令牌，实现多模态交互，并构建脑问答指令调优数据集，在多项任务上达到或超越专用模型性能。

详情

AI中文摘要

建模外部刺激与内部神经表征之间的相互作用是脑机接口（BCI）领域的关键研究方向。以往工作的主要局限性在于普遍采用专门的单任务模型，这限制了通用性并忽略了任务间的协同效应。为解决这一问题，我们提出了Mind-Omni，这是第一个通过离散扩散范式统一七种不同编码和解码任务的通用框架。其核心是一种新颖的脑分词器（Brain Tokenizer），可将异质、连续的脑信号转化为标准化、离散的令牌。这使得在共享语义空间中，任意两个或多个模态之间能够进行直接的令牌级交互，实现相互理解和生成。为了解锁高级推理能力，我们进一步策划了一个专门的脑问答（BQA）指令调优数据集。我们的模型不仅在多任务统一框架中确立了新的最先进水平，还为多任务协同提供了有力证据。通过展示与更大规模专用模型相当甚至有时更优的性能，我们的工作为神经建模提供了强大的新范式，并为神经活动基础模型铺平了道路。代码已公开于https://github.com/ReedOnePeck/Mind-Omni。

英文摘要

Modeling the interplay between external stimuli and internal neural representations is a pivotal research area for Brain-Computer Interfaces (BCIs). A major limitation of prior work is the prevailing paradigm of specialized, single-task models, which curtails versatility and neglects inter-task synergies. To address this, we propose Mind-Omni, the first versatile framework that unifies seven distinct encoding and decoding tasks through a discrete diffusion paradigm. At its core is a novel Brain Tokenizer that transforms heterogeneous, continuous brain signals into standardized, discrete tokens. This enables direct, token-level interactions for mutual understanding and generation between any two or more modalities within a shared semantic space. To unlock advanced reasoning capabilities, we further curate a specialized Brain Question Answering (BQA) instruction-tuning dataset. Our model not only establishes a new state-of-the-art among multi-task unified frameworks but also provides strong evidence for multi-task synergy. By demonstrating performance competitive with, and at times superior to, larger specialized models, our work offers a powerful new paradigm for neural modeling and paves the way for foundation models of neural activity. The code is publicly available at https://github.com/ReedOnePeck/Mind-Omni.

URL PDF HTML ☆

赞 0 踩 0

2605.29586 2026-05-29 cs.AI 版本更新

压缩知识图谱假说：哪些图事实对科学假设生成至关重要？

Shashwat Sourav, Viktoriia Baibakova, Sanjay Das, Ran Elgedawy, Maria Mahbub, Emily Herron, Tirthankar Ghosal

发表机构 * Washington University in St. Louis（华盛顿大学圣路易斯分校）； Oak Ridge National Laboratory（橡树岭国家实验室）； Lawrence Berkeley National Laboratory（伯克利国家实验室）； UniverseTBD（宇宙TBD）

AI总结研究通过扰动局部知识图谱（密度、本体丰富度、拓扑和控制结构），评估不同语言模型在电池材料假设生成中知识图谱的效用，提出冗余感知的压缩知识图谱假说：有用信号可从紧凑子图恢复。

详情

AI中文摘要

知识图谱（KGs）可以为语言模型提供结构化的科学背景，但目前尚不清楚哪些图事实实际上塑造了生成的假设。我们研究了Mistral-7B、Llama-3.1-70B和Gemini 2.5 Flash在电池材料上的KG引导假设生成。通过改变密度、本体丰富度、拓扑和控制结构来扰动局部KG，并使用提供的图和固定参考指标评估输出。跨模型而言，KG效用是选择性的且依赖于模型：图上下文改变了输出，但无KG输出也从模型先验中恢复了大量图内容。紧凑的top-k子图通常近似于完整KG的行为，包括当声称的结果三元组被排除时。同时，压缩并非唯一依赖于某种语义排序规则，随机和基于拓扑的子集也能恢复大部分信号。这些结果支持一种冗余感知的压缩KG假说：有用的KG信号通常可以从紧凑的、科学结构的子图中恢复，而不是需要完整的局部图。

英文摘要

Knowledge graphs (KGs) can provide structured scientific context to language models, but it remains unclear which graph facts actually shape the generated hypotheses. We study KG-guided hypothesis generation for battery materials across Mistral-7B, Llama-3.1-70B, and Gemini 2.5 Flash. We perturb local KGs by varying density, ontology richness, topology, and control structure, and evaluate outputs with both provided-graph and fixed-reference metrics. Across models, KG utility is selective and model-dependent: graph context changes outputs, but no-KG outputs also recover substantial graph content from model priors. Compact top-k subgraphs often approximate full-KG behavior, including when claimed-outcome triples are held out. At the same time, compression is not unique to one semantic ranking rule, random and topology-based subsets can also recover much of the signal. These results support a redundancy-aware Compressive KG hypothesis: useful KG signal is often recoverable from compact, scientifically structured subgraphs rather than requiring the full local graph.

URL PDF HTML ☆

赞 0 踩 0

2605.27078 2026-05-29 cs.LG cs.AI 版本更新

通过一致性训练减少政治操纵

Long Phan, Devin Kim, Alexander Pan, Alice Blair, Adam Khoja, Dan Hendrycks

发表机构 * Center for AI Safety（人工智能安全中心）； UC Berkeley（加州大学伯克利分校）

AI总结针对大语言模型在敏感话题中表现出的隐性政治偏见，提出政治一致性训练（PCT）方法，通过情感一致性和帮助一致性两个指标及相应训练范式来减少偏见。

2605.16825 2026-05-29 cs.IR cs.AI 版本更新

教师引导的策略优化：大策略差异下的在线推理蒸馏

Xinyu Liu, Kechen Jiao, Chunyang Xiao, Runsong Zhao, Junhao Ruan, Bei Li, Jiahao Liu, Qifan Wang, Xin Chen, Jingang Wang, Chenglong Wang, Tong Xiao, JingBo Zhu

发表机构 * School of Computer Science and Engineering, Northeastern University, China（东北大学计算机科学与工程学院）； Tsinghua University（清华大学）； Meituan（美团）； Meta AI ； NiuTrans Research, Shenyang, China（新译研究院，沈阳，中国）

AI总结针对在线蒸馏中教师与学生策略差异大时反向KL监督失效的问题，提出教师引导策略优化（TGPO），通过教师直接指导学生上下文的token级生成并结合RLVR奖励，在推理基准上优于现有方法。

详情

AI中文摘要

在线蒸馏（OPD）已成为面向推理的大型语言模型（LLM）后训练的一种有前景的范式，特别是与可验证奖励的强化学习（RLVR）结合时。现有的OPD方法依赖于基于反向KL（RKL）的教师监督，对学生策略采样的轨迹进行监督。然而，我们识别出一个关键限制：在教师-学生策略差异大的情况下，RL驱动的探索常常产生教师分布之外的轨迹，导致无信息的负面反馈。为了解决这个问题，我们提出教师引导策略优化（TGPO），一种在策略差异大设置下仍然有效的在线推理蒸馏方法。TGPO不依赖于单纯的评估监督，而是利用教师直接指导基于学生生成上下文的token级生成；结合RLVR风格的轨迹级奖励，TGPO引导探索朝向改进的延续。在推理基准上的实验表明，TGPO始终优于现有的基于RKL的OPD方法，并且在不同教师模型下保持鲁棒性。

英文摘要

On-policy distillation (OPD) has become a promising paradigm for reasoning-oriented post-training of large language models (LLMs), especially when combined with reinforcement learning from verifiable rewards (RLVR). Existing OPD methods rely on reverse KL (RKL)-based teacher supervision over trajectories sampled from the student policy. However, we identify a critical limitation: under large teacher--student policy divergence, RL-driven exploration often produces trajectories outside the teacher distribution, resulting in uninformative negative feedback. To address this, we propose Teacher-Guided Policy Optimization (TGPO), an on-policy reasoning distillation method that remains effective under large policy divergence settings. Rather than relying solely on evaluative supervision, TGPO uses teacher to directly guide token level generation conditioning on student-generated contexts; together with RLVR-style trajectory level rewards, TGPO steers exploration toward improved continuations. Experiments on reasoning benchmarks show that TGPO consistently outperforms existing RKL-based OPD methods and remains robust across different teacher models.

URL PDF HTML ☆

赞 0 踩 0

2605.11723 2026-05-29 cs.CV cs.AI 版本更新

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

CaC：通过分层时空聚焦推进视频奖励模型

Jiyuan Wang, Huan Ouyang, Jiuzhou Lin, Chunyu Lin, Dewen Fan, Boheng Zhang, Haonan Fan, Fei Zuo, Jia Sun, Huaiqing Wang, Honglie Wang, Yiyang Fan, Zhenlong Yuan, Zijun Li, Yongrui Heng, Guosheng Lin, Fan Yang, Tingting Gao

发表机构 * BJTU（北京工业大学）； NTU（国立台湾大学）； BUPT（北京邮电大学）； Kuaishou Technology（快手科技）

AI总结提出基于视觉语言模型的粗到细异常奖励模型CaC，通过全局时间扫描、局部空间定位和结构化时空思维链推理，结合大规模生成视频异常数据集和三阶段渐进训练，显著提升细粒度异常检测精度并减少生成视频异常。

Comments 27 pages, 10 figures

详情

AI中文摘要

在本文中，我们提出了Concentrate and Concentrate (CaC)，一种基于视觉语言模型的粗到细异常奖励模型。在推理过程中，它首先进行全局时间扫描以锚定异常时间窗口，然后在局部区间内进行细粒度空间定位，最后通过结构化的时空思维链推理得出稳健判断。为了使模型具备这些能力，我们构建了第一个大规模生成视频异常数据集，包含逐帧边界框注释、时间异常窗口和细粒度归因标签。基于该数据集，我们设计了三阶段渐进训练范式。模型首先通过单帧和多帧监督微调学习空间和时间锚定，然后通过基于两轮组相对策略优化（GRPO）的强化学习策略进行优化。除了传统的准确率奖励，我们引入了时间和空间IoU奖励来监督中间定位过程，有效引导模型进行更扎实和可解释的时空推理。大量实验表明，CaC能够稳定聚焦于细微异常，在细粒度异常基准上实现了25.7%的准确率提升，并且作为奖励信号时，CaC将生成视频异常减少了11.7%，同时提高了整体视频质量。

英文摘要

In this paper, we propose Concentrate and Concentrate (CaC), a coarse-to-fine anomaly reward model based on Vision-Language Models. During inference, it first conducts a global temporal scan to anchor anomalous time windows, then performs fine-grained spatial grounding within the localized interval, and finally derives robust judgments via structured spatiotemporal Chain-of-Thought reasoning. To equip the model with these capabilities, we construct the first large-scale generated video anomaly dataset with per-frame bounding-box annotations, temporal anomaly windows, and fine-grained attribution labels. Building on this dataset, we design a three-stage progressive training paradigm. The model initially learns spatial and temporal anchoring through single- and multi-frame supervised fine-tuning, and then is optimized by a reinforcement learning strategy based on two-turn Group Relative Policy Optimization (GRPO). Beyond conventional accuracy rewards, we introduce Temporal and Spatial IoU rewards to supervise the intermediate localization process, effectively guiding the model toward more grounded and interpretable spatiotemporal reasoning. Extensive experiments demonstrate that CaC can stably concentrate on subtle anomalies, achieving a 25.7% accuracy improvement on fine-grained anomaly benchmarks and, when used as a reward signal, CaC reduces generated-video anomalies by 11.7% while improving overall video quality.

URL PDF HTML ☆

赞 0 踩 0

2605.05155 2026-05-29 cs.CV cs.AI 版本更新

Aes3D: Aesthetic Assessment in 3D Gaussian Splatting

Aes3D: 3D高斯泼溅中的美学评估

Chuanzhi Xu, Boyu Wei, Haoxian Zhou, Xuanhua Yin, Zihan Deng, Haodong Chen, Qiang Qu, Weidong Cai

发表机构 * The University of Sydney（悉尼大学）； The University of Hong Kong（香港大学）

AI总结针对3D高斯泼溅场景缺乏美学评估的问题，提出首个系统框架Aes3D，包含专用数据集Aesthetic3D和轻量级模型Aes3DGSNet，直接预测场景级美学分数，无需渲染多视图图像。

详情

AI中文摘要

随着3D高斯泼溅（3DGS）在沉浸式媒体和数字内容创作中受到关注，评估3D场景的美学对于帮助创作者构建更具视觉吸引力的3D内容变得重要。然而，现有的3D场景评估方法主要强调重建保真度和感知真实感，在很大程度上忽略了构图、和谐度和视觉吸引力等更高层次的美学属性。这一局限性源于两个关键挑战：（1）缺乏带有美学标注的通用3DGS数据集，以及（2）3DGS作为低级基元表示的内在性质，使其难以捕捉高级美学特征。为应对这些挑战，我们提出Aes3D，这是首个用于评估3D神经渲染场景美学的系统框架。Aes3D包含Aesthetic3D，这是首个专用于3D场景美学评估的数据集，基于我们提出的3D场景美学标注策略构建。此外，我们提出Aes3DGSNet，一个轻量级模型，可直接从3DGS表示预测场景级美学分数。值得注意的是，我们的模型仅基于3D高斯基元运行，无需渲染多视图图像，从而降低了计算成本和硬件要求。通过对多视图3DGS场景表示进行美学监督学习，Aes3DGSNet有效捕获高级美学线索并准确回归美学分数。实验结果表明，我们的方法在保持轻量级设计的同时实现了强劲性能，为3D场景美学评估建立了新基准。代码和数据集将在未来版本中提供。

英文摘要

As 3D Gaussian Splatting (3DGS) gains attention in immersive media and digital content creation, assessing the aesthetics of 3D scenes becomes important in helping creators build more visually compelling 3D content. However, existing evaluation methods for 3D scenes primarily emphasize reconstruction fidelity and perceptual realism, largely overlooking higher-level aesthetic attributes such as composition, harmony, and visual appeal. This limitation comes from two key challenges: (1) the absence of general 3DGS datasets with aesthetic annotations, and (2) the intrinsic nature of 3DGS as a low-level primitive representation, which makes it difficult to capture high-level aesthetic features. To address these challenges, we propose Aes3D, the first systematic framework for assessing the aesthetics of 3D neural rendering scenes. Aes3D includes Aesthetic3D, the first dataset dedicated to 3D scene aesthetic assessment, built on our proposed annotation strategy for 3D scene aesthetics. In addition, we present Aes3DGSNet, a lightweight model that directly predicts scene-level aesthetic scores from 3DGS representations. Notably, our model operates solely on 3D Gaussian primitives, eliminating the need for rendering multi-view images and thus reducing computational cost and hardware requirements. Through aesthetics-supervised learning on multi-view 3DGS scene representations, Aes3DGSNet effectively captures high-level aesthetic cues and accurately regresses aesthetic scores. Experimental results demonstrate that our approach achieves strong performance while maintaining a lightweight design, establishing a new benchmark for 3D scene aesthetic assessment. Code and datasets will be made available in a future version.

URL PDF HTML ☆

赞 0 踩 0

2605.00969 2026-05-29 cs.SD cs.AI cs.CL 版本更新

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

MedMosaic：一个具有挑战性的多样化医学音频大规模基准

Harshit Rajgarhia, Shuubham Ojha, Asif Shaik, Akhil Pothanapalli, Rachuri Lokesh, Abhishek Mukherji, Prasanna Desikan

发表机构 * Centific Global Solutions Inc.（Centific全球解决方案公司）； University of Maryland, College Park, MD, USA（马里兰大学学院市分校）

AI总结为解决医学音频数据稀缺和现有基准不足的问题，提出MedMosaic数据集，包含多种医学音频类型和46701个问答对，用于评估语言和音频推理模型，实验表明推理仍具挑战性。

Comments Accepted at ICML 2026

详情

AI中文摘要

由于隐私法规和领域专业知识导致的高注释成本，医学音频数据难以收集。因此，现有基准往往未能充分代表复杂的医学音频场景。为应对这一挑战，我们提出了MedMosaic，一个医学音频问答数据集，旨在在现实临床约束下对语言和音频推理模型进行基准测试。MedMosaic包含多种医学音频类型，包括与疾病相关的生理声音、精心构建的模拟带有伪影的语音的合成声音，以及模拟不同上下文长度的真实短篇和长篇临床对话。该数据集还包含总共46,701个问答对，涵盖多项选择、顺序多轮和开放式问答等类别，从而能够系统评估多跳推理和答案生成能力。对13个音频和多模态推理模型的基准测试显示，推理对所有评估系统仍然具有挑战性，且在不同问题类型上表现差异显著。特别是，即使是像Gemini-2.5-pro这样的最先进模型也只能达到约68.1%的准确率。这些发现强调了医学推理中的持续局限性，并凸显了对更鲁棒、特定领域的多模态推理模型的需求。基准数据样本可在此处获取：https://shorturl.at/Lyp33

英文摘要

Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existing benchmarks tend to underrepresent complex medical audio scenarios. To address this challenge, we present MedMosaic, a medical audio question-answering dataset designed to benchmark language and audio reasoning models under realistic clinical constraints. MedMosaic features a diverse range of medical audio types, including condition-related physiological sounds, carefully constructed synthetic voices to mimic speech with artifacts as well as real short and long length clinical conversations to model varying context lengths. The dataset also features a total of 46,701 question-answer pairs, spanning categories such as multiple-choice, sequential multi-turn, and open-ended question-answers, enabling systematic evaluation of multi-hop reasoning and answer generation capabilities. Benchmarking 13 audio and multimodal reasoning models reveals that reasoning remains challenging for all evaluated systems, with substantial performance variation across question types. In particular, even state-of-the-art model like Gemini-2.5-pro can only achieve 68.1% accuracy approximately. These findings underscore persistent limitations in medical reasoning and highlight the need for more robust, domain-specific multimodal reasoning models. A sample of benchmark data is available here: https://shorturl.at/Lyp33

URL PDF HTML ☆

赞 0 踩 0

2604.27272 2026-05-29 cs.CL cs.AI cs.LG 版本更新

When 2D Tasks Meet 1D Serialization: On Serialization Friction in Structured Tasks

当2D任务遇到1D序列化：结构化任务中的序列化摩擦

Chung-Hsiang Lo, Lu Li, Diji Yang, Tianyu Zhang, Yunkai Zhang, Yoshua Bengio, Yi Zhang

发表机构 * Northeastern University（东北大学）； University of Pennsylvania（宾夕法尼亚大学）； UC Santa Cruz（加州大学圣克鲁兹分校）； Mila - Quebec AI Institute（魁北克人工智能研究所）； University of Montreal（蒙特利尔大学）； BAIR, UC Berkeley（伯克利大学BAIR实验室）

AI总结研究通过矩阵转置、康威生命游戏和LU分解三个任务，发现将二维布局任务序列化为一维文本会因表示不匹配导致性能下降，且错误呈现空间结构模式。

详情

AI中文摘要

在LLM时代，许多符号化和结构化问题通过一维文本序列化呈现给模型。然而，其中一些问题本质上是二维的：它们的相关关系，如行列对应或空间邻接，由二维布局中的位置定义，而非顺序。这引发了一个表示问题：在一维序列中保留相同的符号条目是否也保留了计算所需的关系结构？我们通过序列化摩擦的视角研究这一问题：即相同底层任务实例和条目仍然存在，但依赖于布局的关系在一维序列化下变得隐式的表示不匹配。本研究使用三个受控合成测试任务：矩阵转置、康威生命游戏和LU分解。在每个任务中，相同的实例要么作为一维文本序列化呈现，要么作为其原生二维布局渲染为图像呈现。在整个测试集中，随着任务规模增长，一维序列化的性能下降更显著，且序列化下的错误呈现空间结构模式，表明这种呈现选择在我们的测试集中具有重要影响。为了进一步解释这些结果，我们添加了补充分析，包括视觉内探针以及混合训练转置设置下两种输入呈现的额外比较。这些发现表明，对于布局定义的任务，将输入简化为1D序列化并非中性的表示选择。

英文摘要

In the LLM era, many symbolic and structured problems are presented to models through 1D text serialization. Yet some such problems are natively two-dimensional: their relevant relations, such as row--column correspondence or spatial adjacency, are defined by position in a 2D layout rather than by sequential order. This raises a representational question: does preserving the same symbolic entries in a 1D sequence also preserve the relational structure needed for computation? We study this issue through the lens of serialization friction: the representational mismatch in which the same underlying task instances and entries are still present, but relations that depend on layout become implicit under 1D serialization. The study uses a controlled synthetic testbed of three tasks: matrix transpose, Conway's Game of Life, and LU decomposition. In each task, the same instances are presented either as 1D text serialization or as their native 2D layout rendered as an image. Across this testbed, 1D serialization degrades more sharply as task size grows, and errors under serialization exhibit spatially structured patterns, suggesting that this presentation choice is consequential within our testbed. To further interpret these results, we add supplementary analyses that include a within-visual probe and an additional comparison of the two input presentations under the mixed-training transpose setting. These findings suggest that, for layout-defined tasks, reducing inputs to 1D serialization is not a neutral choice of representation.

URL PDF HTML ☆

赞 0 踩 0

2604.26645 2026-05-29 cs.AI cs.LG 版本更新

SciHorizon-DataEVA: An Agentic System for AI-Readiness Evaluation of Heterogeneous Scientific Data

SciHorizon-DataEVA：面向异构科学数据AI就绪性评估的智能体系统

Dianyu Liu, Chuan Qin, Xi Chen, Xiaohan Li, Wenxi Xu, Yuyang Wang, Xin Chen, Yuanchun Zhou, Hengshu Zhu

发表机构 * SciHorizon Team, Computer Network Information Center, Chinese Academy of Sciences（科学前沿团队，计算机网络信息中心，中国科学院）

AI总结提出SciHorizon-DataEVA智能体系统，基于Sci-TQA2原则和层次化多智能体评估方法，实现对异构科学数据的可扩展AI就绪性评估。

详情

AI中文摘要

AI-for-Science (AI4Science) 正通过将机器学习模型嵌入跨领域的预测、模拟和假设生成工作流程，日益变革科学发现。然而，这些模型的有效性从根本上受到科学数据AI就绪性的限制，目前尚不存在可扩展且系统的评估机制。在这项工作中，我们提出了SciHorizon-DataEVA，一种新颖的智能体系统，用于对异构科学数据进行可扩展的AI就绪性评估。在评估标准层面，我们引入了Sci-TQA2原则，将AI就绪性组织为四个互补维度：治理可信度、数据质量、AI兼容性和科学适应性。每个维度被分解为可测量的原子元素，以实现细粒度且可执行的评估。为了大规模实施这些原则，我们开发了Sci-TQA2-Eval，一种通过有向循环工作流编排的层次化多智能体评估方法。我们的Sci-TQA2-Eval通过结合轻量级数据集分析、适用性感知的度量激活以及基于领域约束和数据集-论文信号的知识增强规划，动态构建数据集感知的评估规范。这些规范通过自适应的、以工具为中心的评估机制执行，该机制具有内置的验证和自我修正能力，从而实现对异构科学数据的可扩展且可靠的评估。在跨多个领域的科学数据集上的广泛实验证明了SciHorizon-DataEVA在原则性AI就绪性评估方面的有效性和通用性。

英文摘要

AI-for-Science (AI4Science) is increasingly transforming scientific discovery by embedding machine learning models into prediction, simulation, and hypothesis generation workflows across domains. However, the effectiveness of these models is fundamentally constrained by the AI-readiness of scientific data, for which no scalable and systematic evaluation mechanism currently exists. In this work, we propose SciHorizon-DataEVA, a novel agentic system to scalable AI-readiness evaluation of heterogeneous scientific data. At the evaluation-criteria level, we introduce the Sci-TQA2 principles, which organize AI-readiness into four complementary dimensions: Governance Trustworthiness, Data Quality, AI Compatibility, and Scientific Adaptability. Each dimension is decomposed into measurable atomic elements that enable fine-grained and executable assessment. To operationalize these principles at scale, we develop Sci-TQA2-Eval, a hierarchical multi-agent evaluation approach orchestrated through a directed, cyclic workflow. Our Sci-TQA2-Eval dynamically constructs dataset-aware evaluation specifications by combining lightweight dataset profiling, applicability-aware metric activation, and knowledge-augmented planning grounded in domain constraints and dataset-paper signals. These specifications are executed through an adaptive, tool-centric evaluation mechanism with built-in verification and self-correction, enabling scalable and reliable assessment across heterogeneous scientific data. Extensive experiments on scientific datasets spanning multiple domains demonstrate the effectiveness and generality of SciHorizon-DataEVA for principled AI-readiness evaluation.

URL PDF HTML ☆

赞 0 踩 0

2604.23862 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Graph Memory Transformer (GMT)

图记忆Transformer (GMT)

Nicola Zanarini, Niccolò Ferrari, Evelina Lamma

发表机构 * Bonfiglioli Engineering s.r.l.（博尼菲利工程公司）； Department of Engineering, University of Ferrara（费拉拉大学工程学院）； NAIS s.r.l.（NAIS公司）

AI总结提出用显式学习的记忆图替换解码器-only Transformer中的前馈网络子层，保留自回归架构，实现可解释的记忆导航。

Comments 65 pages, 10 figures, 5 tables. Author list updated in arXiv metadata; no technical changes. Code available at https://github.com/Nemesis533/GMT-GraphMemoryTransformer

详情

AI中文摘要

我们研究是否可以在解码器-only Transformer中，用显式学习的记忆图替换前馈网络（FFN）子层，同时保留周围的自回归架构。所提出的图记忆Transformer（GMT）保持因果自注意力不变，但将通常的逐token FFN变换替换为一个记忆单元，该单元通过一个由学习的有向转移矩阵连接的质心库来路由token表示。在此处研究的基础GMT v7实例中，16个Transformer块中的每个块包含128个质心、一个128*128的边矩阵、引力源路由、token条件目标选择以及门控位移读出。因此，该单元返回从估计的源记忆状态到目标记忆状态的移动，而不是检索到的值。由此产生的模型是一个完全解码器-only的语言模型，具有82.2M可训练参数且没有密集的FFN子层，而评估中使用的密集GPT风格基线有103.0M参数。基础v7模型训练稳定，并将质心使用、转移结构和源到目标移动作为前向计算中可直接检查的量。在验证损失和困惑度方面，它落后于较大的密集基线（3.5995/36.58 vs. 3.2903/26.85），但在评估设置下显示出接近的零样本基准表现。这些结果并非旨在声称最先进性能；它们支持用图介导的记忆导航替换密集的token内变换的可行性和结构可解释性。更广泛的扩展、优化的内核以及更广泛的基准评估留待后续工作。

英文摘要

We investigate whether the Feed-Forward Network (FFN) sublayer in a decoder-only transformer can be replaced by an explicit learned memory graph while preserving the surrounding autoregressive architecture. The proposed Graph Memory Transformer (GMT) keeps causal self-attention intact, but replaces the usual per-token FFN transformation with a memory cell that routes token representations over a learned bank of centroids connected by a learned directed transition matrix. In the base GMT v7 instantiation studied here, each of 16 transformer blocks contains 128 centroids, a 128 * 128 edge matrix, gravitational source routing, token-conditioned target selection, and a gated displacement readout. The cell therefore returns movement from an estimated source memory state toward a target memory state, rather than a retrieved value. The resulting model is a fully decoder-only language model with 82.2M trainable parameters and no dense FFN sublayers, compared with a 103.0M-parameter dense GPT-style baseline used in the evaluation. The base v7 model trains stably and exposes centroid usage, transition structure, and source-to-target movement as directly inspectable quantities of the forward computation. It remains behind the larger dense baseline in validation loss and perplexity (3.5995/36.58 vs. 3.2903/26.85), while showing close zero-shot benchmark behavior under the evaluated setting. These results are not intended as a state-of-the-art claim; they support the viability and structural interpretability of replacing dense within-token transformation with graph-mediated memory navigation. Broader scaling, optimized kernels, and more extensive benchmark evaluation are left for subsequent work.

URL PDF HTML ☆

赞 0 踩 0

2604.14889 2026-05-29 cs.AI 版本更新

SVSR：一种用于多模态推理的自我验证与自我修正范式

Zhe Qian, Nianbing Su, Zhonghua Wang, Hebei Li, Zhongxing Xu, Yueying Li, Fei Luo, Zhuohan Ouyang, Yanbiao Ma

发表机构 * South China Agricultural University（华南农业大学）； University of Glasgow（格拉斯哥大学）； University of Electronic Science and Technology of China（电子科技大学）； Monash University（莫纳什大学）； University of Science and Technology of China（中国科学技术大学）； National University of Defense Technology（国防科技大学）； Renmin University of China（中国人民大学）； South China Normal University（华南师范大学）

AI总结提出SVSR框架，通过三阶段训练（统一偏好数据集构建、冷启动监督微调、半在线直接偏好优化）将自我验证与自我修正显式集成到多模态推理流程中，提升复杂视觉理解和多模态推理的鲁棒性与可靠性。

详情

AI中文摘要

当前多模态模型常存在浅层推理问题，导致因不完整或不一致的思维过程而产生错误。为解决这一局限，我们提出自我验证与自我修正（SVSR）统一框架，将自我验证和自我修正显式集成到模型的推理流程中，显著提升复杂视觉理解和多模态推理任务的鲁棒性与可靠性。SVSR基于一种新颖的三阶段训练范式。首先，通过精炼预训练视觉语言模型的推理轨迹，结合前向和后向推理嵌入自我反思信号，构建高质量统一偏好数据集。其次，在该数据集上进行冷启动监督微调，学习结构化、多步推理行为。第三，应用半在线直接偏好优化（Semi-online DPO）过程，通过强大的教师VLM筛选的高质量模型生成推理轨迹持续增强训练语料。该流程使模型能够学习、激发并精炼其自我验证与自我修正能力。跨多个基准的广泛实验表明，SVSR提升了推理准确性，并增强了对未见任务和问题类型的泛化能力。值得注意的是，经过显式自我反思推理训练后，模型还展现出改进的隐式推理能力，即使在没有显式推理轨迹的情况下也优于强基线。这些结果凸显了SVSR在构建更可靠、内省且认知对齐的多模态系统方面的潜力。

英文摘要

Current multimodal models often suffer from shallow reasoning, leading to errors caused by incomplete or inconsistent thought processes. To address this limitation, we propose Self-Verification and Self-Rectification (SVSR), a unified framework that explicitly integrates self-verification and self-rectification into the model's reasoning pipeline, substantially improving robustness and reliability in complex visual understanding and multimodal reasoning tasks. SVSR is built on a novel three-stage training paradigm. First, we construct a high-quality unified preference dataset by refining reasoning traces from pre-trained vision-language models, incorporating both forward and backward reasoning to embed self-reflective signals. Second, we perform cold-start supervised fine-tuning on this dataset to learn structured, multi-step reasoning behaviors. Third, we apply a Semi-online Direct Preference Optimization (Semi-online DPO) process, continuously augmenting the training corpus with high-quality, model-generated reasoning traces filtered by a powerful teacher VLM. This pipeline enables the model to learn, elicit, and refine its ability to self-verify and self-rectify. Extensive experiments across diverse benchmarks demonstrate that SVSR improves reasoning accuracy and enables stronger generalization to unseen tasks and question types. Notably, once trained with explicit self-reflective reasoning, the model also exhibits improved implicit reasoning ability, outperforming strong baselines even when no explicit reasoning traces are provided. These results highlight the potential of SVSR for building more dependable, introspective, and cognitively aligned multimodal systems.

URL PDF HTML ☆

赞 0 踩 0

2604.10219 2026-05-29 cs.AI 版本更新

Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

认知支点与视觉锚定：揭示并纠正多模态推理模型中的幻觉

Zhe Qian, Yanbiao Ma, Zhuohan Ouyang, Zhonghua Wang, Zhongxing Xu, Fei Luo, Xinyu Liu, Zongyuan Ge, Yike Guo, Jungong Han

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学龙城人工智能学院）； South China Agricultural University（华南农业大学）； South China Normal University（华南师范大学）； Monash University（莫纳什大学）； Jishou University（吉首大学）； Hong Kong University of Science and Technology（香港科技大学）； Tsinghua University（清华大学）

AI总结针对多模态大推理模型在长链推理中易产生幻觉的问题，提出V-STAR训练范式，通过分层视觉注意力奖励和强制反思机制，将视觉锚定引入推理过程以减轻幻觉。

Comments TPAMI under review

详情

AI中文摘要

多模态大推理模型（MLRMs）通过测试时计算扩展在视觉推理方面取得了显著进展，但长链推理仍然容易出现幻觉。我们识别出一个称为“推理视觉真相脱节”（RVTD）的令人担忧的现象：幻觉与认知分叉点高度相关，这些分叉点通常表现出高熵状态。我们将这种脆弱性归因于视觉语义锚定的崩溃，这种崩溃位于网络中间层；具体来说，在这些高不确定性过渡期间，模型未能查询视觉证据，而是退回到语言先验。因此，我们主张从仅关注结果层面的监督转向增加细粒度的内部注意力引导。为此，我们提出V-STAR（视觉结构训练与注意力强化），一种轻量级、整体的训练范式，旨在内化视觉感知的推理能力。我们方法的核心是分层视觉注意力奖励（HVAR），集成在GRPO框架内。在检测到高熵状态时，该机制动态激励关键中间层的视觉注意力，从而将推理过程锚定回视觉输入。此外，我们引入了强制反思机制（FRM），一种轨迹编辑策略，通过在高熵认知分叉点触发反思并鼓励后续步骤与视觉输入进行验证，从而打破认知惯性，将外部去偏干预转化为减轻幻觉的内在能力。

英文摘要

Multimodal Large Reasoning Models (MLRMs) have achieved remarkable strides in visual reasoning through test time compute scaling, yet long chain reasoning remains prone to hallucinations. We identify a concerning phenomenon termed the Reasoning Vision Truth Disconnect (RVTD): hallucinations are strongly correlated with cognitive bifurcation points that often exhibit high entropy states. We attribute this vulnerability to a breakdown in visual semantic anchoring, localized within the network's intermediate layers; specifically, during these high uncertainty transitions, the model fails to query visual evidence, reverting instead to language priors. Consequently, we advocate a shift from solely outcome level supervision to augmenting it with fine grained internal attention guidance. To this end, we propose V-STAR (Visual Structural Training with Attention Reinforcement), a lightweight, holistic training paradigm designed to internalize visually aware reasoning capabilities. Central to our approach is the Hierarchical Visual Attention Reward (HVAR), integrated within the GRPO framework. Upon detecting high entropy states, this mechanism dynamically incentivizes visual attention across critical intermediate layers, thereby anchoring the reasoning process back to the visual input. Furthermore, we introduce the Forced Reflection Mechanism (FRM), a trajectory editing strategy that disrupts cognitive inertia by triggering reflection around high entropy cognitive bifurcation points and encouraging verification of subsequent steps against the visual input, thereby translating external debiasing interventions into an intrinsic capability for hallucination mitigation.

URL PDF HTML ☆

赞 0 踩 0

2604.09557 2026-05-29 cs.DC cs.AI 版本更新

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

SPEED-Bench：一个统一且多样化的推测解码基准

Talor Abramovich, Maor Ashkenazi, Izzy Putterman, Benjamin Chislett, Tiyasa Mitra, Bita Darvish Rouhani, Ran Zilberstein, Yonatan Geifman

发表机构 * NVIDIA

AI总结针对推测解码（SD）评估中任务多样性不足、吞吐量评估支持不够及实现不贴近生产环境的问题，提出SPEED-Bench基准，包含多样化语义领域和真实服务场景的数据集，集成vLLM和TensorRT-LLM引擎，以标准化SD评估并揭示系统行为。

Comments ICML 2026; Our data is available on https://huggingface.co/datasets/nvidia/SPEED-Bench

详情

AI中文摘要

推测解码（SD）已成为加速大型语言模型（LLM）推理的关键技术。与确定性系统优化不同，SD性能本质上依赖于数据，因此多样且具有代表性的工作负载对于准确衡量其有效性至关重要。现有基准存在任务多样性有限、对面向吞吐量的评估支持不足，以及依赖无法反映生产环境的高级实现等问题。为解决这些问题，我们引入了SPEED-Bench，这是一个全面的套件，旨在跨不同语义领域和真实服务场景标准化SD评估。SPEED-Bench提供了精心策划的定性数据划分，通过优先考虑数据样本之间的语义多样性来选择。此外，它还包括一个吞吐量数据划分，允许在从延迟敏感的低批量设置到面向吞吐量的高负载场景的一系列并发性下进行加速评估。通过与vLLM和TensorRT-LLM等生产引擎集成，SPEED-Bench使从业者能够分析其他基准常常掩盖的系统行为。我们通过量化合成输入如何高估实际吞吐量、识别依赖于批量大小的最优草稿长度和低多样性数据中的偏差，以及分析最先进起草器中词汇剪枝的注意事项来突出这一点。我们发布SPEED-Bench，以建立用于SD算法实际比较的统一评估标准。

英文摘要

Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduce SPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes. SPEED-Bench offers a carefully curated Qualitative data split, selected by prioritizing semantic diversity across the data samples. Additionally, it includes a Throughput data split, allowing speedup evaluation across a range of concurrencies, from latency-sensitive low-batch settings to throughput-oriented high-load scenarios. By integrating with production engines like vLLM and TensorRT-LLM, SPEED-Bench allows practitioners to analyze system behaviors often masked by other benchmarks. We highlight this by quantifying how synthetic inputs overestimate real-world throughput, identifying batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzing the caveats of vocabulary pruning in state-of-the-art drafters. We release SPEED-Bench to establish a unified evaluation standard for practical comparisons of SD algorithms.

URL PDF HTML ☆

赞 0 踩 0

2603.27667 2026-05-29 cs.SD cs.AI 版本更新

EvA: An Evidence-First Audio Understanding Paradigm for LALMs

EvA: 一种面向LALM的以证据为先的音频理解范式

Xinyuan Xie, Shunian Chen, Zhiheng Liu, Yuhao Zhang, Zhiqiang Lv, Liyin Liang, Benyou Wang

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Didi Chuxing（滴滴出行）

AI总结提出EvA双路径架构，通过分层聚合和非压缩时间对齐融合增强声学证据保留，并在统一零样本协议下在MMAU、MMAR和MMSU上取得最佳开源感知结果，支持以证据为先的假设。

详情

AI中文摘要

大型音频语言模型（LALM）在复杂声学场景中仍然存在困难，因为它们往往在推理开始前未能保留与任务相关的声学证据。我们将这种错误模式识别为证据瓶颈：最先进的系统在声学证据提取方面的缺陷大于下游推理，这表明上游感知通常是限制因素。为了解决这个问题，我们提出了EvA（以证据为先的音频），一种双路径架构，通过分层聚合和非压缩、时间对齐融合来增强声学证据保留。我们还构建了EvA-Perception，一个大规模训练集，包含约54K个事件排序描述和500K个基于证据的问答对。在统一的零样本协议下，EvA在MMAU、MMAR和MMSU上取得了最佳开源感知结果，在感知密集型分割上增益最大。对开放描述的人工评估进一步显示了改进的细粒度声学覆盖和描述质量。这些结果支持以证据为先的假设：更强的音频理解依赖于在推理前保留声学证据。项目地址：https://satsuki2486441738.github.io/EvA/。

英文摘要

Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We identify this error pattern as the evidence bottleneck: state-of-the-art systems show larger deficits in acoustic evidence extraction than in downstream reasoning, suggesting that upstream perception is often the limiting factor. To address this problem, we propose EvA (Evidence-First Audio), a dual-path architecture that enhances acoustic evidence preservation through hierarchical aggregation and non-compressive, time-aligned fusion. We also build EvA-Perception, a large-scale training set with about 54K event-ordered captions and 500K evidence-grounded QA pairs. Under a unified zero-shot protocol, EvA achieves the best open-source \emph{Perception} results on MMAU, MMAR, and MMSU, with the largest gains on perception-heavy splits. Human evaluation on open-ended captioning further shows improved fine-grained acoustic coverage and caption quality. These results support the evidence-first hypothesis: stronger audio understanding depends on preserving acoustic evidence before reasoning. Project can be found at https://satsuki2486441738.github.io/EvA/.

URL PDF HTML ☆

赞 0 踩 0

2603.26668 2026-05-29 cs.IR cs.AI cs.CL 版本更新

Bridge-RAG: An Abstract Bridge Tree Based Retrieval Augmented Generation Algorithm

Bridge-RAG：一种基于抽象桥树的检索增强生成算法

Zihang Li, Wenjun Liu, Yikun Zong, Jiawen Tao, Siying Dai, Songcheng Ren, Zirui Liu, Yuhang Wang, Yanbing Jiang, Tong Yang

发表机构 * Peking University（北京大学）

AI总结针对检索增强生成中准确性和效率的挑战，提出Bridge-RAG框架，通过抽象桥树结构实现多级检索，并集成布谷鸟过滤器实现O(1)实体查找，在保持高准确率的同时将检索速度提升至1.9倍。

详情

AI中文摘要

作为增强大型语言模型（LLMs）生成质量的重要范式，检索增强生成（RAG）面临着检索准确性和计算效率两方面的挑战。本文提出了一种名为Bridge-RAG的新型RAG框架。为了克服准确性挑战，我们引入了抽象概念来桥接查询实体和文档块，提供了稳健的语义理解。我们将抽象组织成树结构，并设计了多级检索策略以确保包含足够的上下文信息。虽然这种层次化组织显著提高了答案质量，但遍历树以定位包含查询实体的抽象不可避免地引入了额外的检索开销。为了恢复检索效率，我们进一步在CFT-RAG中集成了布谷鸟过滤器，该过滤器提供O(1)实体查找，并且自然适配了我们框架中实体到抽象的路径。大量实验表明，与结构化RAG基线相比，Bridge-RAG在所有指标上均实现了持续的准确性提升，并且检索速度最高提升了1.9倍。

英文摘要

As an important paradigm for enhancing the generation quality of Large Language Models (LLMs), retrieval-augmented generation (RAG) faces the two challenges regarding retrieval accuracy and computational efficiency. This paper presents a novel RAG framework called Bridge-RAG. To overcome the accuracy challenge, we introduce the concept of abstract to bridge query entities and document chunks, providing robust semantic understanding. We organize the abstracts into a tree structure and design a multi-level retrieval strategy to ensure the inclusion of sufficient contextual information. While this hierarchical organization substantially improves answer quality, traversing the tree to locate the abstracts that contain a query entity inevitably introduces additional retrieval overhead. To restore retrieval efficiency, we further integrate the Cuckoo Filter in CFT-RAG, which provides O(1) entity lookup and naturally fits the entity-to-abstract pathway of our framework. Extensive experiments show that Bridge-RAG achieves consistent accuracy improvements across all metrics and up to $1.9\times$ faster retrieval compared to structured RAG baselines.

URL PDF HTML ☆

赞 0 踩 0

2603.23853 2026-05-29 cs.AI cs.MA 版本更新

SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems

SCoOP: 多视觉-语言模型系统中用于不确定性量化的语义一致意见池化

Chung-En Johnny Yu, Brian Jalaian, Nathaniel D. Bastian

发表机构 * University of West Florida（西佛罗里达大学）； United States Military Academy（美国军事学院）

AI总结提出SCoOP框架，通过不确定性加权的线性意见池化聚合多个视觉-语言模型的输出，实现无训练的不确定性量化，有效检测幻觉并支持高不确定性样本的弃权。

Comments Accepted to ICLR 2026 Workshop on Agentic AI in the Wild: From Hallucinations to Reliable Autonomy

详情

AI中文摘要

结合多个视觉-语言模型（VLM）可以增强多模态推理和鲁棒性，但聚合异构模型的输出会放大不确定性并增加幻觉风险。我们提出SCoOP（语义一致意见池化），一种无需训练的不确定性量化（UQ）框架，通过不确定性加权的线性意见池化用于多VLM系统。核心思想是将每个VLM视为概率“专家”，采样多个输出，映射到统一空间，聚合它们的意见，并产生系统级不确定性分数。与先前为单模型设计的UQ方法不同，SCoOP明确测量跨多个VLM的集体系统级不确定性，从而实现对高不确定性样本的有效幻觉检测和弃权。在ScienceQA上，SCoOP在幻觉检测中实现了0.866的AUROC，优于基线（0.732-0.757）约10-13%。对于弃权，它达到了0.907的AURAC，超过基线（0.818-0.840）7-9%。尽管有这些提升，SCoOP相对于基线仅引入微秒级的聚合开销，与典型的VLM推理时间（秒级）相比微不足道。这些结果表明，SCoOP为不确定性感知聚合提供了一种高效且原则性的机制，推动了多模态AI系统的可靠性。我们的代码公开于https://github.com/chungenyu6/SCoOP。

英文摘要

Combining multiple Vision-Language Models (VLMs) can enhance multimodal reasoning and robustness, but aggregating heterogeneous models' outputs amplifies uncertainty and increases the risk of hallucinations. We propose SCoOP (Semantic-Consistent Opinion Pooling), a training-free uncertainty quantification (UQ) framework for multi-VLM systems through uncertainty-weighted linear opinion pooling. The core idea is to treat each VLM as a probabilistic "expert," sample multiple outputs, map them to a unified space, aggregate their opinions, and produce a system-level uncertainty score. Unlike prior UQ methods designed for single models, SCoOP explicitly measures collective, system-level uncertainty across multiple VLMs, enabling effective hallucination detection and abstention for highly uncertain samples. On ScienceQA, SCoOP achieves an AUROC of 0.866 for hallucination detection, outperforming baselines (0.732-0.757) by approximately 10-13%. For abstention, it attains an AURAC of 0.907, exceeding baselines (0.818-0.840) by 7-9%. Despite these gains, SCoOP introduces only microsecond-level aggregation overhead relative to the baselines, which is trivial compared to typical VLM inference time (on the order of seconds). These results demonstrate that SCoOP provides an efficient and principled mechanism for uncertainty-aware aggregation, advancing the reliability of multimodal AI systems. Our code is publicly available at https://github.com/chungenyu6/SCoOP.

URL PDF HTML ☆

赞 0 踩 0

2603.23234 2026-05-29 cs.AI cs.LG 版本更新

大型语言模型的越狱缩放定律：多项式-指数交叉

Indranil Halder, Annesya Banerjee, Cengiz Pehlevan

发表机构 * John A. Paulson School of Engineering And Applied Sciences, Harvard University（哈佛大学约翰·A·保罗森工程与应用科学学院）； Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology（麻省理工学院脑科学与认知科学系）； Speech and Hearing Bioscience and Technology, Harvard Medical School（哈佛医学院语音与听力生物科学与技术系）； Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University（哈佛大学自然与人工智能研究学院）； Center for Brain Science, Harvard University（哈佛大学脑科学中心）

AI总结研究发现对抗性提示注入攻击可使攻击成功率从无注入时的缓慢多项式增长变为随推理样本数指数增长，并通过自旋玻璃模型从理论上解释了这一现象。

详情

AI中文摘要

对抗性攻击可以可靠地将安全对齐的大型语言模型引导至不安全行为。经验上，我们发现对抗性提示注入攻击可以将攻击成功率从无注入时观察到的缓慢多项式增长放大为随推理样本数指数增长。我们首先通过一组关于上下文安全生成分布的最小假设，确定了这两种机制的统计基础，并推导出两种缩放定律。为了进一步解释这一现象，我们提出了一个基于自旋玻璃系统的代理语言理论生成模型，该系统处于复制对称破缺状态，生成样本来自相关的吉布斯测度，并将低能、有偏大小的子集标记为不安全。我们分析展示了该模型如何自然实现最小假设。短注入提示对应于指向不安全簇中心的弱磁场，导致攻击成功率随推理样本数呈幂律缩放；而长注入提示（即强磁场）则导致指数缩放。我们在参数规模从3B到70B的广泛大型语言模型中观察到了定性一致的行为。特别是，主要趋势在多种攻击方法（如GCG和AutoDAN）以及基准数据集（如AdvBench和HarmBench）中保持稳定。

英文摘要

Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that adversarial prompt-injection attacks can amplify attack success rate from the slow polynomial growth observed without injection to exponential growth with the number of inference-time samples. We first identify a minimal statistical mechanism for these two regimes by giving a small set of assumptions on the distribution of safe generation across contexts under which both scaling laws follow. To explain this phenomenon further, we propose a theoretical generative model of proxy language in terms of a spin-glass system operating in a replica-symmetry-breaking regime, where generations are drawn from the associated Gibbs measure and a subset of low-energy, size-biased clusters is designated unsafe. We analytically show how this model naturally realizes the minimal assumptions. Short injected prompts correspond to a weak magnetic field aligned towards unsafe cluster centers and yield a power-law scaling of attack success rate with the number of inference-time samples, while long injected prompts, i.e., strong magnetic field, yield exponential scaling. We observe qualitatively consistent behavior across a broad range of large language models, spanning parameter scales from 3B to 70B. In particular, the main trends remain stable across multiple attack methods, such as GCG and AutoDAN, as well as across benchmark datasets such as AdvBench and HarmBench.

URL PDF HTML ☆

赞 0 踩 0

2603.07916 2026-05-29 cs.AI cs.DB cs.LG 版本更新

Rel-MOSS: Towards Imbalanced Relational Deep Learning on Relational Databases

Rel-MOSS：面向关系数据库中不平衡关系深度学习的解决方案

Jun Yin, Peng Huo, Bangguo Zhu, Hao Yan, Senzhang Wang, Shirui Pan, Chengqi Zhang

发表机构 * Department of Data Science and Artificial Intelligence（数据科学与人工智能系）； Hong Kong Polytechnic University（香港理工大学）； School of Computer Science and Engineering（计算机科学与工程学院）； Central South University（中南大学）； School of Information and Communication Technology（信息与通信技术学院）； Griffith University（格里菲斯大学）； National Super Computing Center（国家超级计算中心）

AI总结针对关系数据库中实体分类的类别不平衡问题，提出关系中心少数类合成过采样GNN（Rel-MOSS），通过关系门控控制器和关系引导的少数类合成器提升少数类表示，在12个数据集上平均平衡准确率提升2.46%，G-Mean提升4.00%。

详情

AI中文摘要

在最近的进展中，为了实现关系数据库（RDB）上完全数据驱动的学习范式，提出了关系深度学习（RDL），将RDB结构化为异构实体图，并采用图神经网络（GNN）作为预测模型。然而，现有的RDL方法忽略了RDB中关系数据的不平衡问题，可能导致少数实体表示不足，从而在实践中产生不可用的模型。在这项工作中，我们首次研究了RDB实体分类中的类别不平衡问题，并设计了以关系为中心的少数类合成过采样GNN（Rel-MOSS），以填补当前文献中的关键空白。具体来说，为了缓解少数类相关信息被多数类信息淹没的问题，我们设计了关系门控控制器来调节来自每个单独关系类型的邻域消息。基于关系门控表示，我们进一步提出了用于过采样的关系引导的少数类合成器，该合成器整合了实体关系签名以保持关系一致性。在12个实体分类数据集上的大量实验为Rel-MOSS的优越性提供了令人信服的证据，与最先进的RDL方法和处理类别不平衡的经典方法相比，在平衡准确率和G-Mean上分别平均提高了2.46%和4.00%。

英文摘要

In recent advances, to enable a fully data-driven learning paradigm on relational databases (RDB), relational deep learning (RDL) is proposed to structure the RDB as a heterogeneous entity graph and adopt the graph neural network (GNN) as the predictive model. However, existing RDL methods neglect the imbalance problem of relational data in RDBs and risk under-representing the minority entities, leading to an unusable model in practice. In this work, we investigate, for the first time, class imbalance problem in RDB entity classification and design the relation-centric minority synthetic over-sampling GNN (Rel-MOSS), in order to fill a critical void in the current literature. Specifically, to mitigate the issue of minority-related information being submerged by majority counterparts, we design the relation-wise gating controller to modulate neighborhood messages from each individual relation type. Based on the relational-gated representations, we further propose the relation-guided minority synthesizer for over-sampling, which integrates the entity relational signatures to maintain relational consistency. Extensive experiments on 12 entity classification datasets provide compelling evidence for the superiority of Rel-MOSS, yielding an average improvement of up to 2.46% and 4.00% in terms of Balanced Accuracy and G-Mean, compared with SOTA RDL methods and classic methods for handling class imbalance.

URL PDF HTML ☆

赞 0 踩 0

2603.05488 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

推理剧场：从思维链中分离模型信念

Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, Jack Merullo

发表机构 * Harvard University, Cambridge, MA（哈佛大学，马萨诸塞州剑桥）

AI总结通过激活探针、早期强制回答和思维链监控器分析，发现推理模型存在表演性思维链现象，并利用探针引导的早期退出实现高效计算。

详情

AI中文摘要

我们提供了推理模型中表演性思维链（CoT）的证据，即模型对其最终答案变得非常自信，但继续生成令牌而不揭示其内部信念。我们的分析比较了两个大型模型（DeepSeek-R1 671B 和 GPT-OSS 120B）中的激活探针、早期强制回答和思维链监控器，并发现了任务难度特定的差异：模型的最终答案可以从思维链中远早于监控器能够判断的激活中解码，特别是对于基于回忆的简单MMLU问题。我们将此与困难的多跳GPQA-Diamond问题中的真正推理进行对比。尽管如此，转折点（例如回溯、“啊哈”时刻）几乎只出现在探针显示大信念转变的响应中，表明这些行为追踪的是真正的不确定性，而不是学到的“推理剧场”。最后，探针引导的早期退出在MMLU上减少了高达80%的令牌，在GPQA-Diamond上减少了30%，且准确率相似，将注意力探针定位为检测表演性推理和实现自适应计算的高效工具。

英文摘要

We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.

URL PDF HTML ☆

赞 0 踩 0

2603.04678 2026-05-29 cs.CL cs.AI 版本更新

Post-Training Language Models for Crosslingual Consistency

后训练语言模型以实现跨语言一致性

Tianyu Liu, Jirui Qi, Mrinmaya Sachan, Ryan Cotterell, Raquel Fernández, Arianna Bisazza

发表机构 * ETH Zürich（苏黎世联邦理工学院）； CLCG, University of Groningen（格罗宁根大学CLCG中心）； University of Amsterdam（阿姆斯特丹大学）

AI总结针对多语言模型对翻译等价提示响应不一致的问题，提出基于信息论的跨语言一致性定义，并开发后训练方法直接一致性优化（DCO）以提升一致性。

Comments ICML 2026. The first two authors contributed equally. Codes available at: https://github.com/Betswish/ConsistencyRL

2603.04314 2026-05-29 cs.CV cs.AI 版本更新

MOO: A Multi-view Oriented Observations Dataset for Viewpoint Analysis in Cattle Re-Identification

MOO：用于牛个体重识别视角分析的多视角观测数据集

William Grolleau, Achraf Chaouch, Astrid Sabourin, Guillaume Lapouge, Catherine Achard

发表机构 * Universite Paris-Saclay, CEA, List（巴黎-萨克雷大学，CEA，List）； Sorbonne University, CNRS, ISIR（索邦大学，CNRS，ISIR）

AI总结提出大规模合成多视角观测数据集MOO，通过128个均匀采样视角的1000头牛图像，量化视角变化对重识别的影响，并验证合成几何先验在真实场景中的迁移性。

Comments 6 pages, 3 figures, accepted to the CVPR 2026 Workshop on Computer Vision for Animal Behavior Tracking and Modeling (CV4Animals)

详情

AI中文摘要

动物重识别（ReID）由于视角变化面临严峻挑战，特别是在航空-地面（AG-ReID）场景中，模型需要跨越剧烈的高度变化匹配个体。然而，现有数据集缺乏精确的角度标注来系统分析这些几何变化。为此，我们引入了多视角观测（MOO）数据集，这是一个大规模合成AG-ReID数据集，包含从128个均匀采样视角捕获的1000头牛个体（128,000张标注图像）。利用这个受控数据集，我们量化了高度的影响，并识别出一个关键高度阈值，超过该阈值模型对未见视角的泛化能力显著提升。最后，我们在零样本和监督设置下验证了向真实世界应用的迁移性，展示了在四个真实牛数据集上的性能提升，并确认合成几何先验有效弥合了领域差距。总之，该数据集和分析为跨视角动物ReID的未来模型开发奠定了基础。MOO公开于https://github.com/TurtleSmoke/MOO。

英文摘要

Animal re-identification (ReID) faces critical challenges due to viewpoint variations, particularly in Aerial-Ground (AG-ReID) settings where models must match individuals across drastic elevation changes. However, existing datasets lack the precise angular annotations required to systematically analyze these geometric variations. To address this, we introduce the Multi-view Oriented Observation (MOO) dataset, a large-scale synthetic AG-ReID dataset of $1,000$ cattle individuals captured from $128$ uniformly sampled viewpoints ($128,000$ annotated images). Using this controlled dataset, we quantify the influence of elevation and identify a critical elevation threshold, above which models generalize significantly better to unseen views. Finally, we validate the transferability to real-world applications in both zero-shot and supervised settings, demonstrating performance gains across four real-world cattle datasets and confirming that synthetic geometric priors effectively bridge the domain gap. Collectively, this dataset and analysis lay the foundation for future model development in cross-view animal ReID. MOO is publicly available at https://github.com/TurtleSmoke/MOO.

URL PDF HTML ☆

赞 0 踩 0

2603.03805 2026-05-29 cs.LG cs.AI cs.DB 版本更新

Relational In-Context Learning via Synthetic Pre-training with Structural Prior

通过结构先验的合成预训练实现关系上下文学习

Yanbo Wang, Jiaxuan You, Chuan Shi, Muhan Zhang

发表机构 * Institute for Artificial Intelligence, Peking University（北京大学人工智能研究院）； University of Illinois at Urbana-Champaign（伊利诺伊大学香槟分校）； Institute of Computing Technology, Beijing University of Post（北京邮电大学计算机学院）； State Key Laboratory of General Artificial Intelligence（通用人工智能国家重点实验室）

AI总结提出RDB-PFN，首个仅通过合成数据训练的关系基础模型，利用结构因果模型生成多样关系数据库，实现对新数据库的即时上下文学习，在19个真实关系预测任务上优于现有表格基础模型。

详情

AI中文摘要

关系数据库是现代业务的支柱，但它们缺乏与文本或视觉领域相当的基础模型。一个关键障碍是高质量的关系数据库是私有的、稀缺的且结构异构，使得互联网规模的预训练不可行。为了克服这种数据稀缺性，我们引入了RDB-PFN，这是第一个完全通过合成数据训练的关系基础模型。受先验数据拟合网络的启发，其中从结构因果模型生成的合成数据能够实现单表推理，我们设计了一个关系先验生成器，从零开始创建无限多样的关系数据库流。在超过200万个合成单表和关系任务上进行预训练后，RDB-PFN通过真正的上下文学习学会即时适应任何新数据库。实验表明，RDB-PFN在19个真实世界的关系预测任务上实现了强大的少样本性能，优于在相同DFS线性化输入上评估的最先进的表格基础模型，同时使用轻量级架构和快速推理。代码可在https://github.com/MuLabPKU/RDBPFN获取。

英文摘要

Relational Databases (RDBs) are the backbone of modern business, yet they lack foundation models comparable to those in text or vision. A key obstacle is that high-quality RDBs are private, scarce, and structurally heterogeneous, making internet-scale pre-training infeasible. To overcome this data scarcity, we introduce RDB-PFN, the first relational foundation model trained purely via synthetic data. Inspired by Prior-Data Fitted Networks (PFNs), where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables, we design a Relational Prior Generator to create an infinite stream of diverse RDBs from scratch. Pre-training on over 2 million synthetic single-table and relational tasks, RDB-PFN learns to adapt to any new database instantly via genuine in-context learning. Experiments show that RDB-PFN achieves strong few-shot performance on 19 real-world relational prediction tasks, outperforming state-of-the-art tabular foundation models evaluated on the same DFS-linearized inputs, while using a lightweight architecture and fast inference. The code is available at https://github.com/MuLabPKU/RDBPFN.

URL PDF HTML ☆

赞 0 踩 0

2602.23258 2026-05-29 cs.AI cs.CL 版本更新

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

AgentDropoutV2: 通过测试时修正或拒绝剪枝优化多智能体系统中的信息流

Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding, Miao Zhang, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Alibaba Group（阿里巴巴集团）

AI总结提出AgentDropoutV2框架，在测试时通过检索增强修正器纠正错误并剪枝不可修复输出，动态优化多智能体系统信息流，显著提升数学和代码基准性能。

详情

AI中文摘要

虽然多智能体系统（MAS）在复杂推理中表现出色，但它们受到来自单个智能体的错误信息的级联影响。当前的解决方案通常依赖于刚性的结构工程或昂贵的微调，限制了它们的适应性。我们提出了AgentDropoutV2（ADv2），一种测试时修正或拒绝剪枝框架，动态优化MAS信息流。作为主动防火墙，ADv2拦截智能体输出，并采用检索增强修正器迭代纠正错误。这种修正由一个指示池引导，该池通过从历史MAS失败轨迹中提炼错误模式离线构建。随后，不可修复的输出被剪枝以防止错误传播。实验结果表明，ADv2在固定和动态MAS框架上均显著提升了性能，在广泛的数学和代码基准测试中分别实现了平均6.39和2.28个百分点的准确率提升。此外，ADv2表现出卓越的适应性，根据任务难度动态调整修正力度，以解决广泛的错误模式。我们的代码已发布在https://github.com/TonySY2/AgentDropoutV2。

英文摘要

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information from individual agents. Current solutions often resort to rigid structural engineering or expensive fine-tuning, limiting their adaptability. We propose AgentDropoutV2 (ADv2), a test-time rectify-or-reject pruning framework that dynamically optimizes MAS information flow. Acting as an active firewall, ADv2 intercepts agent outputs and employs a retrieval-augmented rectifier to iteratively correct errors. This rectification is guided by an indicator pool, which is constructed offline by distilling error patterns from historical MAS failure trajectories. Irreparable outputs are subsequently pruned to prevent error propagation. Empirical results demonstrate that ADv2 significantly boosts performance on both fixed and dynamic MAS frameworks, achieving average accuracy gains of 6.39 and 2.28 percentage points on extensive math and code benchmarks, respectively. Furthermore, ADv2 exhibits remarkable adaptivity, dynamically modulating rectification efforts based on task difficulty to resolve a wide spectrum of error patterns. Our code is released at https://github.com/TonySY2/AgentDropoutV2.

URL PDF HTML ☆

赞 0 踩 0

2602.20141 2026-05-29 cs.AI 版本更新

GICDM: 缓解枢纽性以实现可靠的基于距离的生成模型评估

Nicolas Salvy, Hugues Talbot, Bertrand Thirion

发表机构 * Inria, Palaiseau, France（法国帕莱索研究所）

AI总结针对生成模型评估中高维嵌入空间的枢纽性现象，提出GICDM方法（基于迭代上下文不相似度度量），通过多尺度扩展校正邻域估计，恢复可靠度量并与人类评估对齐。

Comments Forty-third International Conference on Machine Learning, 2026

2602.12304 2026-05-29 cs.SD cs.AI cs.MM eess.AS 版本更新

OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

OmniCustom: 通过联合音视频生成模型实现同步音视频定制

Maomao Li, Zhen Li, Kaipeng Zhang, Guosheng Yin, Zhifeng Li, Dong Xu

发表机构 * The University of Hong Kong（香港大学）； Shanda AI Research Tokyo（Shanda AI东京研究所）； XIntelligence Technology Co., Limited（XIntelligence技术有限公司）

AI总结提出一种基于DiT的零样本音视频定制框架OmniCustom，通过参考图像和音频同步生成保持身份和音色一致性的视频，支持文本指定语音内容。

Comments code: https://github.com/OmniCustom-project/OmniCustom

详情

AI中文摘要

现有的主流视频定制方法侧重于基于给定参考图像和文本提示生成身份一致的视频。受益于联合音视频生成的快速发展，本文提出一个更具吸引力的新任务：同步音视频定制，旨在同步定制视频身份和音频音色。具体来说，给定参考图像$I^{r}$和参考音频$A^{r}$，该新任务要求生成保持参考图像身份并模仿参考音频音色的视频，语音内容可由用户提供的文本提示自由指定。为此，我们提出OmniCustom，一个基于DiT的强大音视频定制框架，能够以零样本方式一次性根据参考图像身份、音频音色和文本提示合成视频。我们的框架基于三个关键贡献。首先，身份和音频音色控制通过独立的参考身份和音频LoRA模块实现，这些模块通过基础音视频生成模型中的自注意力层操作。其次，我们引入了对比学习目标与标准流匹配目标一起使用。它将以参考输入为条件的预测流作为正例，以无参考条件的预测流作为负例，从而增强模型保持身份和音色的能力。第三，我们在构建的大规模高质量音视频人类数据集上训练OmniCustom。大量实验表明，OmniCustom在生成具有一致身份和音色保真度的音视频内容方面优于现有方法。项目页面：https://omnicustom-project.github.io/page/。

英文摘要

Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a more compelling new task: sync audio-video customization, which aims to synchronously customize both video identity and audio timbre. Specifically, given a reference image $I^{r}$ and a reference audio $A^{r}$, this novel task requires generating videos that maintain the identity of the reference image while imitating the timbre of the reference audio, with spoken content freely specifiable through user-provided textual prompts. To this end, we propose OmniCustom, a powerful DiT-based audio-video customization framework that can synthesize a video following reference image identity, audio timbre, and text prompts all at once in a zero-shot manner. Our framework is built on three key contributions. First, identity and audio timbre control are achieved through separate reference identity and audio LoRA modules that operate through self-attention layers within the base audio-video generation model. Second, we introduce a contrastive learning objective alongside the standard flow matching objective. It uses predicted flows conditioned on reference inputs as positive examples and those without reference conditions as negative examples, thereby enhancing the model ability to preserve identity and timbre. Third, we train OmniCustom on our constructed large-scale, high-quality audio-visual human dataset. Extensive experiments demonstrate that OmniCustom outperforms existing methods in generating audio-video content with consistent identity and timbre fidelity. Project page: https://omnicustom-project.github.io/page/.

URL PDF HTML ☆

赞 0 踩 0

2602.11171 2026-05-29 cs.CL cs.AI 版本更新

A Language-Guided Bayesian Optimization for Efficient LoRA Hyperparameter Search

语言引导的贝叶斯优化用于高效LoRA超参数搜索

Baek Seong-Eun, Lee Jung-Mok, Kim Sung-Bin, Tae-Hyun Oh

发表机构 * Grad. School of AI, POSTECH, Pohang, Korea（POSTECH人工智能研究生院，韩国坡安）； School of EE, KAIST, Daejeon, Korea（韩国科学技术院电子工程学院，韩国大田）； School of Computing, KAIST, Daejeon, Korea（韩国科学技术院计算学院，韩国大田）

AI总结提出一种利用预训练LLM领域知识的贝叶斯优化框架，通过语言提示将超参数映射到连续空间，结合子集训练代理评估，仅需约30次迭代即可发现比标准超参数提升20%以上性能的LoRA超参数。

Comments Accepted at ICML 2026

详情

AI中文摘要

使用低秩适配（LoRA）微调大型语言模型（LLM）提供了一种资源高效的方式来实现个性化或专业化。然而，LoRA对超参数选择高度敏感，且穷举超参数搜索计算成本高昂。为此，我们提出一个贝叶斯优化（BO）框架，利用预训练LLM的领域知识来高效搜索LoRA超参数。我们的方法将预训练LLM重新用作离散到连续映射模块，将超参数及其领域知识链接到连续向量空间，在其中进行BO。我们通过语言提示设计和控制映射，提供描述超参数间关系及其各自角色的领域感知文本提示。这使我们能够以自然语言将关于LoRA的领域知识显式注入LLM。我们还引入一个额外的可学习标记，以捕获提示中难以用语言描述的残差信息。这有助于BO采样更多高性能超参数。此外，通过利用LoRA训练机制中从完整数据集和子集训练数据集获得的性能之间观察到的强相关性，我们引入使用数据子集的代理训练和评估。这显著提高了我们方法的效率。我们证明，仅需约30次迭代发现的超参数，相比从约45,000种组合中找到的标准超参数，实现了超过20%的性能提升。项目页面：https://baekseongeun.github.io/lora-bo/

英文摘要

Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) offers a resource-efficient way to personalize or specialize. However, LoRA is highly sensitive to hyperparameter choices, and exhaustive hyperparameter search is computationally expensive. To address this, we propose a Bayesian Optimization (BO) framework that leverages the domain knowledge of pre-trained LLMs to efficiently search for LoRA hyperparameters. Our approach repurposes a pre-trained LLM as a discrete-to-continuous mapping module to link hyperparameters and their domain knowledge to a continuous vector space, where BO is conducted. We design and control the mapping via language prompting, providing a domain-aware textual prompt that describes the relationships among hyperparameters and their respective roles. This allows us to explicitly inject domain knowledge about LoRA into the LLM in natural language. We also introduce an additional learnable token to capture residual information that is difficult to describe linguistically in the prompt. This aids BO to sample more high-performing hyperparameters. In addition, by leveraging the strong correlation observed between the performance obtained from full and subset training datasets in LoRA training regimes, we introduce proxy training and evaluation using a data subset. This significantly improves the efficiency of our method. We demonstrate that our hyperparameter, discovered with only about 30 iterations, achieves more than 20% performance improvement over standard hyperparameters found from about 45,000 combinations. Project page: https://baekseongeun.github.io/lora-bo/

URL PDF HTML ☆

赞 0 踩 0

2602.11065 2026-05-29 cs.CL cs.AI 版本更新

S-MARC: Causal Streaming Reasoning for Full-Duplex Conversational Behavior Modeling

S-MARC：全双工对话行为建模的因果流式推理

Dingkun Zhou, Shuchang Pan, Jiachen Lian, Siddharth Banerjee, Sarika Pasumarthy, Dhruv Hebbar, Siddhant Patel, Zeyi Austin Li, Kan Jen Cheng, Sanay Bordia, Krish Patel, Akshaj Gupta, Tingle Li, Gopala Anumanchipalli

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Zhejiang University（浙江大学）； South China University of Technology（华南理工大学）

AI总结提出S-MARC框架，通过流式因果层次建模意图到动作路径，预测高层交际功能和低层交互行为，并构建高质量语料库，实现全双工对话中的鲁棒行为检测与可解释推理。

详情

AI中文摘要

人类对话由隐式的思维链组织，并表现为时间结构化的对话行为。捕捉这一感知路径对于构建自然的全双工交互系统至关重要。我们提出了S-MARC（对话的流式因果建模与推理），一个用于对话行为建模与推理的流式、因果、层次化框架。通过形式化意图到动作的路径，S-MARC预测高层交际功能和低层交互行为，同时建模它们的因果和时间依赖关系。为支持这一设置，我们构建了一个高质量语料库，将可控、事件丰富的双工对话数据与行为标签配对。S-MARC将流式预测组织成持续演化的图结构，为其决策生成简洁的推理依据，并动态优化其推理过程。在合成和真实双工对话上的实验表明，S-MARC实现了鲁棒的行为检测，产生了可解释的推理链，并为全双工口语对话系统中的对话推理建立了基准基础。

英文摘要

Human conversation is organized by an implicit chain of thought and manifests as temporally structured conversational behaviors. Capturing this perceptual pathway is critical for building natural full-duplex interactive systems. We propose S-MARC (Streaming Causal Modeling and Reasoning for Conversation), a streaming, causal, and hierarchical framework for conversational behavior modeling and reasoning. By formalizing the intent-to-action pathway, S-MARC predicts high-level communicative functions and low-level interaction behaviors while modeling their causal and temporal dependencies. To support this setting, we construct a high-quality corpus that pairs controllable, event-rich duplex dialogue data with behavior labels. S-MARC organizes streaming predictions into a continuously evolving graph structure, generates concise justifications for its decisions, and dynamically optimizes its reasoning process. Experiments on synthetic and real duplex dialogues show that S-MARC achieves robust behavior detection, produces interpretable reasoning chains, and establishes a benchmark foundation for conversational reasoning in full-duplex spoken dialogue systems.

URL PDF HTML ☆

赞 0 踩 0

2602.08783 2026-05-29 cs.AI cs.CL 版本更新

Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure

潜在思维链中的因果结构：一项实证研究

Zirui Li, Xuefeng Bai, Kehai Chen, Yizhi Li, Jian Yang, Chenghua Lin, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen, China（哈尔滨工业大学（深圳））； University of Manchester, United Kingdom（曼彻斯特大学）； Beihang University, China（北京航空航天大学）

AI总结通过结构因果模型对潜在思维链进行干预分析，揭示其因果结构、步骤间影响传播及与显式思维链的差异。

Comments Accepted to ICML 2026; 25 pages, 23 figures

详情

AI中文摘要

潜在或连续思维链方法用若干内部潜在步骤替代显式文本推理，但这些中间计算难以通过基于相关性的探针进行评估。本文将潜在思维链视为表示空间中的可操控因果过程，将潜在步骤建模为结构因果模型（SCM）中的变量，并通过逐步do-干预分析其效应。我们研究了两种代表性范式（即Coconut和CODI）在数学和通用推理任务上的表现，以探讨三个关键问题：（1）哪些步骤对正确性具有因果必要性，以及答案何时可早期解码；（2）影响如何在步骤间传播，以及这种结构与显式CoT相比如何；（3）中间轨迹是否保留竞争性答案模式，以及输出级承诺与步骤间表示级承诺的差异。我们发现潜在步骤预算更像分阶段功能而非同质化额外深度，并具有非局部路由特性，同时识别出早期输出偏差与后期表示承诺之间的持续差距。这些结果促使我们采用模式条件化和稳定性感知分析，以及相应的训练/解码目标，作为解释和改进潜在推理系统的更可靠工具。代码见https://github.com/J1mL1/causal-latent-cot。

英文摘要

Latent or continuous chain-of-thought methods replace explicit textual rationales with a number of internal latent steps, but these intermediate computations are difficult to evaluate beyond correlation-based probes. In this paper, we view latent chain-of-thought as a manipulable causal process in representation space by modeling latent steps as variables in a structural causal model (SCM) and analyzing their effects through step-wise do-interventions. We study two representative paradigms (i.e., Coconut and CODI) on both mathematical and general reasoning tasks to investigate three key questions: (1) which steps are causally necessary for correctness and when answers become decodable early; (2) how influence propagates across steps and how this structure compares to explicit CoT; and (3) whether intermediate trajectories retain competing answer modes and how output-level commitment differs from representational commitment across steps. We find that latent-step budgets behave less like homogeneous extra depth and more like staged functionality with non-local routing, and we identify a persistent gap between early output bias and late representational commitment. These results motivate mode-conditional and stability-aware analyses, together with corresponding training/decoding objectives, as more reliable tools for interpreting and improving latent reasoning systems. Code is available at https://github.com/J1mL1/causal-latent-cot.

URL PDF HTML ☆

赞 0 踩 0

2602.02849 2026-05-29 cs.AI 版本更新

从元思维到执行：面向通用且可靠的大语言模型推理的认知对齐后训练

Shaojie Wang, Liang Zhang

发表机构 * Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出一种认知启发的两阶段后训练框架，通过元思维链监督学习通用策略和置信度校准强化学习优化执行可靠性，在分布内和分布外分别提升2.10%和3.86%。

详情

AI中文摘要

当前的大语言模型后训练方法通过监督微调（SFT）后接基于结果的强化学习（RL）来优化完整的推理轨迹。虽然有效，但仔细审视发现一个根本差距：这种方法与人类实际解决问题的方式不一致。人类认知自然地将问题解决分解为两个不同的阶段：首先获取跨问题泛化的抽象策略（即元知识），然后将其适应到具体实例。相比之下，通过将完整轨迹视为基本单元，当前方法本质上是问题中心的，将抽象策略与问题特定的执行纠缠在一起。为了解决这种错位，我们提出了一个认知启发的框架，明确地模仿人类认知的两阶段过程。具体而言，元思维链（CoMT）将监督学习聚焦于抽象推理模式而不涉及具体执行，从而能够获取可泛化的策略。然后，置信度校准强化学习（CCRL）通过中间步骤上的置信度感知奖励来优化任务适应，防止过度自信的错误级联并提高执行可靠性。在四个模型和十个基准上的实验表明，与标准方法相比，分布内和分布外分别提升了2.10%和3.86%，同时对教师模型选择、优化方法和符号扰动的变化保持高度鲁棒。

英文摘要

Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental gap: this approach does not align with how humans actually solve problems. Human cognition naturally decomposes problem-solving into two distinct stages: first acquiring abstract strategies (i.e., meta-knowledge) that generalize across problems, then adapting them to specific instances. In contrast, by treating complete trajectories as basic units, current methods are inherently problem-centric, entangling abstract strategies with problem-specific execution. To address this misalignment, we propose a cognitively-inspired framework that explicitly mirrors the two-stage human cognitive process. Specifically, Chain-of-Meta-Thought CoMT focuses supervised learning on abstract reasoning patterns without specific executions, enabling acquisition of generalizable strategies. Confidence-Calibrated Reinforcement Learning (CCRL) then optimizes task adaptation via confidence-aware rewards on intermediate steps, preventing overconfident errors from cascading and improving execution reliability. Experiments across four models and ten benchmarks show 2.10% and 3.86% improvements in-distribution and out-of-distribution respectively over standard methods, while remaining highly robust to variations in teacher model selection, optimization methods, and symbolic perturbations.

URL PDF HTML ☆

赞 0 踩 0

2601.19947 2026-05-29 cs.LG cs.AI cs.CV 版本更新

NCSAM Noise-Compensated Sharpness-Aware Minimization for Noisy Label Learning

NCSAM: 噪声补偿的锐度感知最小化用于噪声标签学习

Jiayu Xu, Junbiao Pang

发表机构 * Beijing University of Technology（北京理工大学）

AI总结提出NCSAM方法，通过噪声补偿扰动修正噪声标签引起的优化偏差，缓解对噪声标签的记忆，在合成和真实噪声标签基准上优于SAM基线。

Comments 11 pages, 1 figure, 8 tables. Major revision of v1: revised PAC-Bayesian theoretical analysis, clarified the NCSAM formulation, added appendix derivations, reorganized experiments and ablations, updated related work, citations, writing, and author list

详情

AI中文摘要

从噪声标签学习（LNL）仍然是深度学习中的一个基本挑战，因为现实世界的数据集通常包含损坏的注释。大多数现有方法依赖于标签校正或样本选择机制。相比之下，我们从优化角度研究LNL，通过建立标签噪声与锐度感知最小化（SAM）的平坦性寻求行为之间的理论联系。基于此分析，我们提出了噪声补偿的锐度感知最小化（NCSAM），它使用噪声补偿扰动来抵消由噪声标签引起的优化偏差。通过纠正失真的SAM扰动，NCSAM在训练过程中减轻了对噪声标签的记忆，同时保持了基于优化的学习的简单性。在合成和真实噪声标签基准上的实验表明，NCSAM在基于SAM的优化基线上持续改进，并与代表性的噪声标签学习方法保持竞争力。

英文摘要

Learning from Noisy Labels (LNL) remains a fundamental challenge in deep learning because real-world datasets often contain corrupted annotations. Most existing methods rely on label correction or sample selection mechanisms. In contrast, we study LNL from an optimization perspective by establishing a theoretical connection between label noise and the flatness-seeking behavior of Sharpness-Aware Minimization (SAM). Based on this analysis, we propose Noise-Compensated Sharpness-Aware Minimization (NCSAM), which uses a noise-compensated perturbation to counteract the optimization bias induced by noisy labels. By correcting distorted SAM perturbations, NCSAM mitigates the memorization of noisy labels during training while preserving the simplicity of optimization-based learning. Experiments on synthetic and real-world noisy-label benchmarks show that NCSAM consistently improves over SAM-based optimization baselines and remains competitive with representative noisy-label learning methods.

URL PDF HTML ☆

赞 0 踩 0

2601.14758 2026-05-29 cs.LG cs.AI cs.CL 版本更新

大型语言模型中句法与语义的差异编码

Santiago Acevedo, Alessandro Laio, Marco Baroni

发表机构 * Catalan Institute of Research and Advanced Studies (ICREA) and Universitat Pompeu Fabra (UPF)（加泰罗尼亚研究与高级科学研究所（ICREA）和庞培法华大学（UPF））

AI总结本研究通过平均共享句法结构或语义的句子隐藏表示向量，发现大型语言模型（以DeepSeek-V3为例）的内部层表示中句法和语义信息至少部分线性编码，且两者编码轮廓不同，可一定程度解耦。

Comments Published as conference paper at ICML 2026

2512.19199 2026-05-29 cs.LG cs.AI 版本更新

On the Koopman-Based Generalization Bounds for Multi-Task Deep Learning

基于Koopman的多任务深度学习泛化界

Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, Panos M. Pardalos

发表机构 * Free University of Bozen-Bolzano（博兹纳-博尔扎诺自由大学）； University of Catania（卡塔尼亚大学）； University of Florida（佛罗里达大学）

AI总结本文利用算子理论技术建立多任务深度神经网络的泛化界，通过利用权重矩阵的小条件数并引入定制的Sobolev空间作为扩展假设空间，提出比传统范数方法更紧的界，该界在单输出设置下仍有效且优于现有Koopman界。

Comments Accepted at the 11th International Conference on Machine Learning, Optimization, and Data Science (LOD), Castiglione della Pescaia, Italy, September 21-24, 2025. To appear in Lecture Notes in Computer Science (LNCS), volume 16467

2512.19184 2026-05-29 cs.LG cs.AI 版本更新

Operator-Based Generalization Bound for Deep Learning: Insights on Multi-Task Learning

基于算子的深度学习泛化界：多任务学习的洞见

Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, Panos M. Pardalos

发表机构 * Free University of Bozen-Bolzano（博兹纳-博尔扎诺自由大学）； University of Catania（卡塔尼亚大学）； University of Florida（佛罗里达大学）

AI总结本文通过算子理论框架，结合Koopman方法与现有技术，为向量值神经网络和深度核方法提出了更紧的泛化界，并引入草图技术降低计算成本，同时提出深度向量值再生核希尔伯特空间框架，利用Perron-Frobenius算子增强深度核方法，推导了新的Rademacher泛化界，解决了欠拟合和过拟合问题。

Comments Accepted at the 11th International Conference on Machine Learning, Optimization, and Data Science (LOD), Castiglione della Pescaia, Italy, September 21-24, 2025. To appear in Lecture Notes in Computer Science (LNCS), volume 16467

详情

DOI: 10.1007/978-3-032-21480-5_9
Journal ref: Machine Learning, Optimization, and Data Science (LOD 2025), Lecture Notes in Computer Science (LNCS), vol. 16468, Springer, 2026, pp. 120--137

AI中文摘要

本文提出了向量值神经网络和深度核方法的新型泛化界，通过算子理论框架聚焦多任务学习。我们的关键发展在于策略性地将基于Koopman的方法与现有技术相结合，实现了比传统基于范数的界更紧的泛化保证。为缓解基于Koopman方法的计算挑战，我们引入了适用于向量值神经网络的草图技术。这些技术在一般Lipschitz损失下给出了超额风险界，为包括鲁棒回归和多重分位数回归在内的应用提供了性能保证。此外，我们提出了一个新的深度学习框架——深度向量值再生核希尔伯特空间（vvRKHS），利用Perron-Frobenius（PF）算子增强深度核方法。我们为该框架推导了新的Rademacher泛化界，通过核精炼策略明确处理欠拟合和过拟合。这项工作为深度学习架构下的多任务学习泛化性质提供了新颖洞见，该领域直到最近才有所发展。

英文摘要

This paper presents novel generalization bounds for vector-valued neural networks and deep kernel methods, focusing on multi-task learning through an operator-theoretic framework. Our key development lies in strategically combining a Koopman based approach with existing techniques, achieving tighter generalization guarantees compared to traditional norm-based bounds. To mitigate computational challenges associated with Koopman-based methods, we introduce sketching techniques applicable to vector valued neural networks. These techniques yield excess risk bounds under generic Lipschitz losses, providing performance guarantees for applications including robust and multiple quantile regression. Furthermore, we propose a novel deep learning framework, deep vector-valued reproducing kernel Hilbert spaces (vvRKHS), leveraging Perron Frobenius (PF) operators to enhance deep kernel methods. We derive a new Rademacher generalization bound for this framework, explicitly addressing underfitting and overfitting through kernel refinement strategies. This work offers novel insights into the generalization properties of multitask learning with deep learning architectures, an area that has been relatively unexplored until recent developments.

URL PDF HTML ☆

赞 0 踩 0

2512.14754 2026-05-29 cs.SE cs.AI cs.CL 版本更新

Revisiting the Reliability of Language Models in Instruction-Following

重新审视指令跟随中语言模型的可靠性

Jianshuo Dong, Yutong Zhang, Yan Liu, Zhenyu Zhong, Tao Wei, Chao Zhang, Han Qiu

发表机构 * Tsinghua University（清华大学）； Ant Group（蚂蚁集团）

AI总结本文提出可靠@k指标和自动生成相似提示的流水线，构建IFEval++基准，发现当前模型在细微差异提示下性能下降高达61.8%，并探索了三种改进方法。

Comments ACL 2026 main oral

详情

AI中文摘要

先进的LLM在IFEval等基准测试中已达到接近上限的指令跟随准确率。然而，这些令人印象深刻的分数并不一定能转化为实际使用中的可靠服务，因为用户经常改变他们的措辞、上下文框架和任务表述。在本文中，我们研究面向细微差异的可靠性：模型是否在传达类似用户意图但具有细微差异的相似提示中表现出一致的能力。为了量化这一点，我们引入了一个新的指标，可靠@k，并开发了一个自动化流水线，通过数据增强生成高质量的相似提示。在此基础上，我们构建了IFEval++用于系统评估。在20个专有和26个开源LLM中，我们发现当前模型在面向细微差异的可靠性方面存在显著不足——它们的性能在细微提示修改下可能下降高达61.8%。此外，我们对其进行了表征，并探索了三种潜在的改进方法。我们的发现强调了面向细微差异的可靠性是朝着更可靠和可信的LLM行为迈出的关键但尚未充分探索的下一步。我们的代码和基准可访问：https://github.com/jianshuod/IFEval-pp。

英文摘要

Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability -- their performance can drop by up to 61.8% with nuanced prompt modifications. What's more, we characterize it and explore three potential improvement recipes. Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior. Our code and benchmark are accessible: https://github.com/jianshuod/IFEval-pp.

URL PDF HTML ☆

赞 0 踩 0

2512.11944 2026-05-29 cs.RO cs.AI 版本更新

A Review of Learning-Based Motion Planning: Toward a Data-Driven Optimal Control Approach

基于学习的运动规划综述：迈向数据驱动的最优控制方法

Jia Hu, Yang Chang, Haoran Wang

发表机构 * College of Transportation Key Laboratory of Road and Traffic Engineering of the Ministry of Education（交通运输学院道路交通工程教育部重点实验室）； Institute for Advanced Study（先进研究院）； Tongji University（同济大学）

AI总结本文系统综述了数据驱动最优控制范式，通过融合最优控制的理论保证与机器学习的自适应能力，为自动驾驶运动规划提供了三维实现路线图，并指出了四个未来研究方向。

Comments 44 pages, 14 figures

详情

AI中文摘要

自动驾驶的运动规划面临一个关键的权衡。传统的基于规则的流程提供了可验证的安全性和可解释性，但往往难以在复杂场景中泛化。相反，新兴的基于学习的方法——包括模仿学习、强化学习和生成式AI——提供了更大的适应性，但通常受限于不透明性和安全风险。现有的综述通常孤立地分析这些AI方法，忽视了将它们与严格的控制框架相结合的潜力。为弥合这一差距，本文首次系统综述了数据驱动最优控制（DDOC）范式，明确考察了它如何协同最优控制的理论保证与现代机器学习的自适应能力。基于这一框架，我们提出了首个DDOC运动规划路线图，将其实现结构化为三个关键维度：定制化、动力学自适应和自整定。最后，为缩小剩余的现实差距，我们确定了四个未来研究方向，从而加速向可信赖且类人的自动驾驶的过渡。

英文摘要

Motion planning for autonomous driving (AD) faces a critical trade-off. While traditional rule-based pipelines offer verifiable safety and interpretability, they often fail to generalize in complex scenarios. Conversely, emerging learning-based methods-including imitation learning (IL), reinforcement learning (RL), and generative AI-offer greater adaptability but are often constrained by opacity and safety risks. Existing surveys typically analyze these AI methods in isolation, overlooking the potential of integrating them with rigorous control frameworks. To bridge this gap, this paper presents the first systematic review of the Data-Driven Optimal Control (DDOC) paradigm, explicitly examining how it synergizes the theoretical guarantees of optimal control with the adaptive capabilities of modern machine learning. Building on this framework, we propose the first roadmap for DDOC-based motion planning, structuring its implementation into three critical dimensions: customization, dynamics adaptation, and self-tuning. Finally, to close the remaining reality gap, we identify four future research directions, thereby accelerating the transition to trustworthy and human-like autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2512.04733 2026-05-29 cs.CV cs.AI 版本更新

E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

E3AD：面向以人为中心的端到端自动驾驶的情感感知视觉-语言-动作模型

Yihong Tang, Haicheng Liao, Tong Nie, Junlin He, Ao Qu, Kehua Chen, Wei Ma, Zhenning Li, Lijun Sun, Chengzhong Xu

发表机构 * McGill University（麦吉尔大学）； University of Macau（澳门大学）； The Hong Kong Polytechnic University（香港理工大学）； Massachusetts Institute of Technology（麻省理工学院）； University of Washington（华盛顿大学）

AI总结提出E3AD框架，通过连续VAD情感模型和双路径空间推理模块，将情感理解融入视觉-语言-动作模型，实现开放域端到端自动驾驶中的情感感知轨迹规划，在真实数据集上达到SOTA性能。

详情

AI中文摘要

端到端自动驾驶系统越来越多地采用视觉-语言-动作模型，但它们通常忽略乘客的情绪状态，而情绪状态对舒适度和自动驾驶接受度至关重要。我们引入了开放域端到端自动驾驶，其中自动驾驶车辆必须解释自由形式的自然语言命令，推断情绪，并规划物理上可行的轨迹。我们提出了E3AD，一个情感感知的VLA框架，通过两个认知启发的组件增强语义理解：一个连续的Valence-Arousal-Dominance情感模型，从语言中捕捉语调和紧迫性；以及一个双路径空间推理模块，融合自我中心和异中心视角以实现类人空间认知。结合模态预训练和基于偏好的对齐的一致性导向训练方案，进一步强化了情感意图与驾驶行为之间的一致性。在真实世界数据集上，E3AD改进了视觉定位和路径点规划，并在情感估计方面达到了最先进的VAD相关性。这些评估结果表明，将情感注入VLA风格的驾驶能够产生更符合人类行为的定位、规划和反馈。

英文摘要

End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These evaluation results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and feedback.

URL PDF HTML ☆

赞 0 踩 0

2512.03109 2026-05-29 cs.LG cs.AI stat.AP stat.ML 版本更新

E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

E-valuator: 基于序贯假设检验的可靠智能体验证器

Shuvom Sadhuka, Drew Prinster, Clara Fannjiang, Gabriele Scalia, Bonnie Berger, Aviv Regev, Hanchen Wang

发表机构 * Genentech（基因泰克）； MIT（麻省理工学院）； Johns Hopkins（约翰霍普金斯大学）； Stanford（斯坦福大学）

AI总结提出E-valuator方法，将任意黑盒验证器分数转化为具有可控虚警率的决策规则，通过序贯假设检验实现对智能体轨迹的在线监控，提升统计功效并节省令牌。

详情

AI中文摘要

智能体AI系统根据用户提示执行一系列动作，如推理步骤或工具调用。为了评估其轨迹的成功性，研究人员开发了验证器（如LLM评判器和过程奖励模型）来对智能体轨迹中每个动作的质量进行评分。尽管这些启发式评分可能提供信息，但在用于决定智能体是否会产生成功输出时，无法保证正确性。在此，我们引入e-valuator，一种将任意黑盒验证器分数转化为具有可证明虚警率控制的决策规则的方法。我们将区分成功轨迹（即会导致对用户提示正确响应的动作序列）与不成功轨迹的问题构建为序贯假设检验问题。E-valuator基于e-过程工具开发了一个序贯假设检验，该检验在智能体轨迹的每一步都保持统计有效性，从而能够对任意长动作序列的智能体进行在线监控。实验表明，在六个数据集和三个智能体上，e-valuator相比其他策略提供了更高的统计功效和更好的虚警率控制。我们还展示了e-valuator可用于快速终止有问题的轨迹并节省令牌。总之，e-valuator提供了一个轻量级、模型无关的框架，将验证器启发式转化为具有统计保证的决策规则，从而支持部署更可靠的智能体系统。

英文摘要

Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the success of their trajectories, researchers have developed verifiers, such as LLM judges and process-reward models, to score the quality of each action in an agent's trajectory. Although these heuristic scores can be informative, there are no guarantees of correctness when used to decide whether an agent will yield a successful output. Here, we introduce e-valuator, a method to convert any black-box verifier score into a decision rule with provable control of false alarm rates. We frame the problem of distinguishing successful trajectories (that is, a sequence of actions that will lead to a correct response to the user's prompt) and unsuccessful trajectories as a sequential hypothesis testing problem. E-valuator builds on tools from e-processes to develop a sequential hypothesis test that remains statistically valid at every step of an agent's trajectory, enabling online monitoring of agents over arbitrarily long sequences of actions. Empirically, we demonstrate that e-valuator provides greater statistical power and better false alarm rate control than other strategies across six datasets and three agents. We additionally show that e-valuator can be used for to quickly terminate problematic trajectories and save tokens. Together, e-valuator provides a lightweight, model-agnostic framework that converts verifier heuristics into decisions rules with statistical guarantees, enabling the deployment of more reliable agentic systems.

URL PDF HTML ☆

赞 0 踩 0

2511.19316 2026-05-29 cs.CV cs.AI 版本更新

Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach

评估数据集水印用于定制扩散模型微调可追溯性：一个综合基准与移除方法

Xincheng Wang, Hanchi Sun, Wenjun Sun, Kejun Xue, Wangqiu Zhou, Jianbo Zhang, Wei Sun, Dandan Zhu, Xiongkuo Min, Jun Jia, Zhijun Fang

发表机构 * Donghua University（东华大学）； Shanghai Jiao Tong University（上海交通大学）； Xidian University（西安电子科技大学）； Hefei University of Technology（合肥工业大学）； East China Normal University（华东师范大学）

AI总结针对扩散模型微调中的版权与安全风险，本文建立统一威胁模型并提出包含普适性、可传递性和鲁棒性的评估框架，揭示现有数据集水印方法的脆弱性，并进一步提出一种实用的水印移除方法。

详情

AI中文摘要

最近扩散模型的微调技术使其能够再现特定图像集，例如特定人脸或艺术风格，但也引入了版权和安全风险。数据集水印已被提出，通过将不可察觉的水印嵌入训练图像来确保可追溯性，即使在微调后这些水印在输出中仍然可检测。然而，当前方法缺乏统一的评估框架。为解决这一问题，本文建立了一个通用威胁模型，并引入了一个包含普适性、可传递性和鲁棒性的综合评估框架。实验表明，现有方法在普适性和可传递性方面表现良好，并对常见图像处理操作具有一定的鲁棒性，但在真实威胁场景下仍然不足。为揭示这些脆弱性，本文进一步提出了一种实用的水印移除方法，该方法在不影响微调的情况下完全消除数据集水印，突出了未来研究的一个关键挑战。

英文摘要

Recent fine-tuning techniques for diffusion models enable them to reproduce specific image sets, such as particular faces or artistic styles, but also introduce copyright and security risks. Dataset watermarking has been proposed to ensure traceability by embedding imperceptible watermarks into training images, which remain detectable in outputs even after fine-tuning. However, current methods lack a unified evaluation framework. To address this, this paper establishes a general threat model and introduces a comprehensive evaluation framework encompassing Universality, Transmissibility, and Robustness. Experiments show that existing methods perform well in universality and transmissibility, and exhibit some robustness against common image processing operations, yet still fall short under real-world threat scenarios. To reveal these vulnerabilities, the paper further proposes a practical watermark removal method that fully eliminates dataset watermarks without affecting fine-tuning, highlighting a key challenge for future research.

URL PDF HTML ☆

赞 0 踩 0

2511.08548 2026-05-29 cs.AI 版本更新

A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models

兴趣问题：理解人类与语言模型对数学问题的兴趣度

Shubhra Mishra, Yuka Machino, Gabriel Poesia, Albert Jiang, Joy Hsu, Adrian Weller, Challenger Mishra, David Broman, Joshua B. Tenenbaum, Mateja Jamnik, Cedegao E. Zhang, Katherine M. Collins

发表机构 * KTH Royal Institute of Technology（皇家理工学院）； Stanford University（斯坦福大学）； Massachusetts Institute of Technology（麻省理工学院）； University of Cambridge（剑桥大学）； Kempner Institute at Harvard University（哈佛大学肯普尼研究所）； Mistral AI

AI总结通过比较大型语言模型与不同数学背景人群对数学问题的兴趣度评分，研究LLM在兴趣判断上与人类的一致性，并评估其生成有趣问题的能力。

Comments Published at the Math-AI Workshop, NeurIPS 2025

详情

AI中文摘要

数学的演变受到兴趣度的重要影响：研究人员选择要解决的问题，学生选择要参与的问题，都是基于对兴趣和挑战的期望。随着AI系统，特别是那些在自然语言和形式数学上灵活操作的大型语言模型（LLMs）越来越多地用于数学研究和教育，描述它们的判断与来自不同数学背景的人们的判断有多接近变得至关重要。我们通过将LLM的评分与两个人群（具有大学数学经验的众包参与者和国际数学奥林匹克竞赛选手）的评分进行比较，研究LLM是否与人类的兴趣度判断一致。尽管许多LLM在广泛层面上与人类对兴趣度的看法一致，但它们在很大程度上未能匹配人类判断的分布。它们与人类认为问题有趣的原因也弱对齐，与人类选择的理由相关性低。最后，我们评估了LLM生成有趣问题的能力，发现经过有效性过滤后，LLM能够生成引人入胜的问题。我们得出结论，包括需要多LLM人机协作系统，这突显了LLM作为数学推理伙伴的前景和当前局限。

英文摘要

The evolution of mathematics is shaped importantly by interestingness: researchers choose which problems to pursue, and students choose which problems to engage with, based on expectations of interest and challenge. As AI systems, particularly large language models (LLMs) that operate flexibly over natural language and formal mathematics, are increasingly used in mathematics research and education, it becomes crucial to characterize how closely their judgments align with people from different mathematical backgrounds. We study whether LLMs align with human interestingness judgments by comparing LLM ratings with those of two populations, crowdsourced participants with college math experience and International Math Olympiad competitors. Although many LLMs broadly agree with human notions of interestingness, they largely fail to match the distribution of human judgments. They also weakly align with why humans find problems interesting, with low correlation to human-selected rationales. Finally, we evaluate LLMs' ability to generate interesting problems and find that, after filtering for validity, LLMs are able to generate engaging problems. We conclude with takeaways, including the need for multi-LLM human-AI collaborative systems, that highlight both the promise and current limits of LLMs as partners in mathematical reasoning.

URL PDF HTML ☆

赞 0 踩 0

2511.04758 2026-05-29 cs.RO cs.AI cs.MA 版本更新

ScheduleStream: Temporal Planning with Samplers for GPU-Accelerated Multi-Arm Task and Motion Planning & Scheduling

ScheduleStream: 基于采样器的时序规划用于GPU加速的多臂任务与运动规划及调度

Caelan Garrett, Fabio Ramos

发表机构 * NVIDIA Research Seattle Robotics Lab (SRL)（NVIDIA西雅图机器人实验室）； University of Sydney（悉尼大学）

AI总结提出ScheduleStream，首个通用框架，通过混合持续动作和领域无关算法，结合GPU加速采样器，实现多臂并行任务与运动规划及调度。

Comments Project website: https://schedulestream.github.io

详情

Journal ref: 2026 IEEE International Conference on Robotics and Automation (ICRA)

AI中文摘要

双臂和类人机器人因其类似人类利用多臂高效完成任务的能力而具有吸引力。然而，由于混合离散-连续动作空间的增长，同时控制多个臂在计算上具有挑战性。任务与运动规划（TAMP）算法可以在混合空间中高效规划，但通常生成一次只移动一个臂的计划，而不是允许并行臂运动的调度。为了将TAMP扩展到生成调度，我们提出了ScheduleStream，这是第一个用于带采样操作的规划与调度的通用框架。ScheduleStream使用混合持续动作对时间动态进行建模，这些动作可以异步启动，并持续一个由其参数决定的时长。我们提出了领域无关的算法，无需任何特定于应用的机制即可解决ScheduleStream问题。我们将ScheduleStream应用于任务与运动规划及调度（TAMPAS），其中我们利用采样器内的GPU加速来加快规划。我们将ScheduleStream算法与模拟中的几种消融方法进行比较，发现它们能产生更高效的解决方案。我们在https://schedulestream.github.io上展示了ScheduleStream在几个真实世界双臂机器人任务上的应用。

英文摘要

Bimanual and humanoid robots are appealing because of their human-like ability to leverage multiple arms to efficiently complete tasks. However, controlling multiple arms at once is computationally challenging due to the growth in the hybrid discrete-continuous action space. Task and Motion Planning (TAMP) algorithms can efficiently plan in hybrid spaces but generally produce plans, where only one arm is moving at a time, rather than schedules that allow for parallel arm motion. In order to extend TAMP to produce schedules, we present ScheduleStream, the first general-purpose framework for planning & scheduling with sampling operations. ScheduleStream models temporal dynamics using hybrid durative actions, which can be started asynchronously and persist for a duration that's a function of their parameters. We propose domain-independent algorithms that solve ScheduleStream problems without any application-specific mechanisms. We apply ScheduleStream to Task and Motion Planning & Scheduling (TAMPAS), where we use GPU acceleration within samplers to expedite planning. We compare ScheduleStream algorithms to several ablations in simulation and find that they produce more efficient solutions. We demonstrate ScheduleStream on several real-world bimanual robot tasks at https://schedulestream.github.io.

URL PDF HTML ☆

赞 0 踩 0

语义对自监督表示学习的影响

Mohammad Alkhalefi, Georgios Leontidis, Mingjun Zhong

AI总结通过控制实验研究语义正对（不同同类实例）相比增强正对在自监督学习中的效果，发现语义对能提升泛化性能，尤其对比学习受益最大。

Comments 19 pages, 7 figures, 5 tables

详情

AI中文摘要

实例判别通过将同一图像的不同增强视图视为正对来学习视觉表示。虽然这鼓励对手工变换的不变性，但同图像正对可能保留背景、纹理、光照和对象特定细节等干扰相关性。语义正对，即不同的同类实例，通过在不同上下文中呈现对象可能减少这些相关性。然而，先前的研究通常将语义对与增强正对或错误邻居（即错误映射的语义对）结合，使得难以隔离语义配对的效果。我们提出了一个关于语义正对用于自监督表示学习的受控实证研究。从ImageNet-1K中，我们构建了两个匹配的子集：一个增强对基线和一个手动策划的语义对数据集，具有相同的类别组成和训练对数量。我们使用这些数据集在匹配的训练条件下比较代表性的对比和非对比SSL方法。在迁移学习和目标检测评估中，语义对预训练始终优于增强对预训练。额外的消融实验表明，语义对诱导了超出标准变换管道的不变性。在评估的方法中，对比学习从语义对中受益最大，其中SimCLR显示出最大的相对改进。这些结果阐明了语义正对在SSL中的作用，并为选择和设计能够有效利用语义对信息的框架提供了指导。

英文摘要

Instance discrimination learns visual representations by treating different augmented views of the same image as positive pairs. While this encourages invariance to handcrafted transformations, same-image positives can preserve nuisance correlations such as background, texture, illumination, and object-specific details. Semantic positive pairs, i.e., different same-class instances, may reduce these correlations by presenting objects across diverse contexts. However, previous studies often combine semantic pairs with augmented positives or false neighbors (i.e., incorrectly mapped semantic pairs), making it difficult to isolate the effect of semantic pairing. We present a controlled empirical study of semantic positive pairs for self-supervised representation learning. From ImageNet-1K, we construct two matched subsets: an augmented-pair baseline and a manually curated semantic-pair dataset with the same class composition and training-pair count. We use these datasets to compare representative contrastive and non-contrastive SSL methods under matched training conditions. Across transfer learning and object detection evaluations, semantic-pair pretraining consistently improves generalisation over augmented-pair pretraining. Additional ablations show that semantic pairs induce invariances beyond the standard transformation pipeline. Among the evaluated methods, contrastive learning benefits most strongly from semantic pairs, with SimCLR showing the largest relative improvement. These results clarify the role of semantic positive pairs in SSL and provide guidance for selecting and designing frameworks that can exploit semantic pair information effectively

URL PDF HTML ☆

赞 0 踩 0

2510.04704 2026-05-29 cond-mat.mtrl-sci cs.AI cs.CL 版本更新

PuzzleClone: 一种基于DSL的可验证数据合成框架

Kai Xiong, Yanwei Huang, Rongjunchen Zhang, Kun Chen, Haipang Wu, Yingcai Wu

发表机构 * HiThink Research（HiThink研究院）； HKUST（香港科技大学）； Zhejiang University（浙江大学）

AI总结提出PuzzleClone框架，通过DSL驱动的方法合成大规模、高可靠、多样化的可验证数学逻辑数据集，并构建PC-83K基准，实验表明后训练能显著提升LLM在逻辑与数学任务上的性能。

详情

AI中文摘要

高质量、带有可验证答案的数学和逻辑数据集对于增强大型语言模型（LLM）的推理能力至关重要。虽然最近的数据增强技术促进了大规模基准的创建，但现有的LLM生成数据集往往存在可靠性、多样性和可扩展性有限的问题。为了解决这些挑战，我们引入了PuzzleClone，一个使用新颖的DSL驱动方法大规模合成可验证数据的正式框架。我们的方法具有三个关键创新：（1）将种子谜题编码为结构化的逻辑规范，（2）通过系统化的变量和约束随机化生成可扩展的变体，（3）通过再现机制确保有效性。应用PuzzleClone，我们构建了PC-83K，一个包含超过83K个多样化且经过程序验证的谜题的基准。生成的谜题涵盖了广泛的难度和格式，对当前最先进的模型构成了重大挑战。实验结果表明，在PC-83K上进行后训练（SFT和RL）不仅在测试集上取得了显著提升，而且在各种逻辑和数学基准上也取得了改进。后训练将PC-83K上的平均性能从14.5提高到66.0，并在7个逻辑和数学基准上持续改进，绝对百分点最高达18.4（SATBench从51.6提高到70.0）。我们的代码和数据可在https://github.com/HiThink-Research/PuzzleClone获取。

英文摘要

High-quality mathematical and logical datasets with verifiable answers are essential for strengthening the reasoning capabilities of large language models (LLMs). While recent data augmentation techniques have facilitated the creation of large-scale benchmarks, existing LLM-generated datasets often suffer from limited reliability, diversity, and scalability. To address these challenges, we introduce PuzzleClone, a formal framework for synthesizing verifiable data at scale using a novel DSL-driven approach. Our approach features three key innovations: (1) encoding seed puzzles into structured logical specifications, (2) generating scalable variants through systematic variable and constraint randomization, and (3) ensuring validity via a reproduction mechanism. Applying PuzzleClone, we construct PC-83K, a benchmark comprising over 83K diverse and programmatically validated puzzles. The generated puzzles span a wide spectrum of difficulty and formats, posing significant challenges to current state-of-the-art models. Experimental results show that post training (SFT and RL) on PC-83K yields substantial improvements not only on the testset but also on various logic and mathematical benchmarks. Post training raises average performance on PC-83K from 14.5 to 66.0 and delivers consistent improvements across 7 logic and mathematical benchmarks up to 18.4 absolute percentage points (SATBench from 51.6 to 70.0). Our code and data are available at https://github.com/HiThink-Research/PuzzleClone.

URL PDF HTML ☆

赞 0 踩 0

2508.03253 2026-05-29 cs.GT cs.AI cs.MA 版本更新

Approximate Proportionality in Online Fair Division

在线公平分配中的近似比例性

Davin Choo, Winston Fu, Derek Khu, Tzeh Yuan Neoh, Tze-Yang Poon, Nicholas Teh

发表机构 * Harvard University, USA（哈佛大学）； University of Oxford, UK（牛津大学）； Centre for Frontier AI Research (CFAR), Agency for Science, Technology and Research (A*STAR), Singapore（前沿人工智能研究中心（CFAR））； Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore（高性能计算研究所（IHPC））； Princeton University, New Jersey, USA（普林斯顿大学）； Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR), Singapore（信息与通信研究所以（I2R））

AI总结研究在线公平分配问题中比例性（PROP1）的可近似性，通过非自适应对手和最大物品价值预测两种松弛方法，设计了具有鲁棒保证的在线算法。

Comments Appears in the 43rd International Conference on Machine Learning (ICML), 2026

详情

AI中文摘要

我们研究在线公平分配问题，其中不可分割的商品按顺序到达，必须立即且不可撤销地分配。先前的工作为近似经典概念（如至多一个商品的嫉妒无妒（EF1）和最大最小份额（MMS））建立了强不可能性结果，但至多一个商品的比例性（PROP1）的可近似性仍未解决。我们分两步解决这一差距。首先，我们展示了三种自然的贪婪分配规则（公平分配中的标准基线）无法保证对自适应对手的任何乘法近似到PROP1。这些局限性激发了两种松弛：（i）将注意力限制在非自适应对手上，以及（ii）在学习增强算法的精神下纳入粗略预测。在非自适应对手下，我们展示了均匀随机分配以高概率实现了有意义的PROP1近似，并且这一保证对于这种方法本质上是紧的；此外，当物品值足够小时，分配以高概率接近PROP1。最后，给定最大物品值（MIV）预测，我们设计了一种在线算法，该算法实现了PROP1的鲁棒近似保证，并在单边预测误差下优雅地退化。相比之下，我们展示了即使有完美的MIV预测，EF1、MMS和PROPX仍然不可近似。

英文摘要

We study the online fair division problem, where indivisible goods arrive sequentially and must be allocated immediately and irrevocably. Prior work establishes strong impossibility results for approximating classic notions such as envy-freeness up to one good (EF1) and maximin share (MMS) in this setting, but the approximability of proportionality up to one good (PROP1) has remained unresolved. We resolve this gap in two steps. First, we show that three natural greedy allocation rules (standard baselines in fair division) fail to guarantee any multiplicative approximation to PROP1 against an adaptive adversary. These limitations motivate two relaxations: (i) restricting attention to a non-adaptive adversary, and (ii) incorporating coarse predictions in the spirit of learning-augmented algorithms. Under a non-adaptive adversary, we show that the uniform random allocation achieves a meaningful PROP1 approximation with high probability, and this guarantee is essentially tight for this approach; moreover, when item values are sufficiently small, the allocation is near-PROP1 with high probability. Finally, given maximum item value (MIV) predictions, we design an online algorithm that achieves robust approximation guarantees for PROP1, and degrades gracefully under one-sided prediction error. In contrast, we show that EF1, MMS, and PROPX remain inapproximable even with perfect MIV predictions.

URL PDF HTML ☆

赞 0 踩 0

2507.16880 2026-05-29 cs.CV cs.AI cs.LG 版本更新

Finding DoRI: Discovery of Retained Images in Diffusion Models

Finding DoRI: 扩散模型中保留图像的发现

Antoni Kowalczuk, Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, Franziska Boenisch

发表机构 * CISPA Helmholtz Center for Information Security（CISPA信息安全研究中心）； German Research Center for Artificial Intelligence (DFKI)（德国人工智能研究中心（DFKI））； Technical University of Darmstadt（达姆施塔特技术大学）； Hessian Center for AI (Hessian.AI)（黑森人工智能中心（Hessian.AI））； Centre for Cognitive Science, Technical University of Darmstadt（达姆施塔特技术大学认知科学中心）

AI总结通过挑战记忆局部化假设，发现文本嵌入的小扰动可重新触发数据复制，并证明记忆本质上是非局部的，从而提出对抗微调实现更鲁棒的缓解方法。

Comments Published at ICML 2026

详情

AI中文摘要

文本到图像扩散模型（DMs）在图像生成方面取得了显著成功。然而，由于它们可能无意中记忆并复制训练数据，数据隐私和知识产权问题仍然存在。最近的缓解工作集中在识别和剪枝负责触发逐字训练数据复制的权重，基于记忆可以被局部化的假设。我们挑战这一假设，并证明即使经过这样的剪枝，对先前缓解的提示的文本嵌入进行微小扰动可以重新触发数据复制，揭示了此类方法的脆弱性。我们的进一步分析提供了多个迹象表明记忆确实本质上不是局部的：（1）记忆图像的复制触发因素分布在文本嵌入空间中；（2）产生相同复制图像的嵌入会产生不同的模型激活；（3）不同的剪枝方法对同一图像识别出不一致的记忆相关权重集。最后，我们表明绕过局部性假设可以通过对抗微调实现更鲁棒的缓解。这些发现为文本到图像DMs中记忆的基本性质提供了新见解，并为未来开发更可靠的对抗DM记忆的缓解方法提供了信息。

英文摘要

Text-to-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intellectual property remain due to their potential to inadvertently memorize and replicate training data. Recent mitigation efforts have focused on identifying and pruning weights responsible for triggering verbatim training data replication, based on the assumption that memorization can be localized. We challenge this assumption and demonstrate that, even after such pruning, small perturbations to the text embeddings of previously mitigated prompts can re-trigger data replication, revealing the fragility of such methods. Our further analysis then provides multiple indications that memorization is indeed \textit{not} inherently local: (1) replication triggers for memorized images are distributed throughout text embedding space; (2) embeddings yielding the same replicated image produce divergent model activations; and (3) different pruning methods identify inconsistent sets of memorization-related weights for the same image. Finally, we show that bypassing the locality assumption enables more robust mitigation through adversarial fine-tuning. These findings provide new insights into the fundamental nature of memorization in text-to-image DMs and inform the future development of more reliable mitigation methods against DM memorization.

URL PDF HTML ☆

赞 0 踩 0

2507.06092 2026-05-29 cs.CR cs.AI cs.LG 版本更新

Taming Data Challenges in ML-based Security Tasks Using Generative AI

驯服基于ML的安全任务中的数据挑战：使用生成式AI

Shravya Kanchi, Neal Mangaokar, Aravind Cheruvu, Sifat Muhammad Abdullah, Shirin Nilizadeh, Atul Prakash, Bimal Viswanath

发表机构 * University of Michigan, Ann Arbor（密歇根大学安娜堡分校）； University of Texas at Arlington（德克萨斯理工大学）

AI总结提出使用生成式AI（GenAI）生成的合成数据增强训练集，以改善机器学习安全分类器的泛化性能，在7个任务上实现最高32.6%的提升。

Comments Accepted at the 2026 ACM Asia Conference on Computer and Communications Security (AsiaCCS 2026)

详情

DOI: 10.1145/3779208.3785264
Journal ref: In Proc. ACM AsiaCCS 2026, Bangalore, India, June 1-5, 2026. ACM, 2026

AI中文摘要

基于机器学习的监督分类器广泛用于安全任务，其改进主要集中在算法进步上。我们认为，对分类器性能产生负面影响的数据挑战受到的关注有限。我们解决以下研究问题：生成式AI（GenAI）的发展能否应对这些数据挑战并提高分类器性能？我们提出使用GenAI技术生成的合成数据增强训练数据集，以改善分类器的泛化能力。我们使用6种最先进的GenAI方法在7个不同的安全任务上评估了这种方法，并引入了一种名为Nimai的新型GenAI方案，该方案能够实现高度可控的数据合成。我们发现，GenAI技术可以显著提高安全分类器的性能，即使在数据严重受限的情况下（仅约180个训练样本），也能实现高达32.6%的提升。此外，我们证明GenAI可以促进部署后对概念漂移的快速适应，在调整过程中只需最少的标注。尽管取得了成功，但我们的研究发现，一些GenAI方案在某些安全任务上难以初始化（训练和生成数据）。我们还识别了特定任务的特征，如噪声标签、重叠的类别分布和稀疏特征向量，这些特征阻碍了使用GenAI提升性能。我们相信，我们的研究将推动未来针对安全任务的GenAI工具的开发。

英文摘要

Machine learning-based supervised classifiers are widely used for security tasks, and their improvement has been largely focused on algorithmic advancements. We argue that data challenges that negatively impact the performance of these classifiers have received limited attention. We address the following research question: Can developments in Generative AI (GenAI) address these data challenges and improve classifier performance? We propose augmenting training datasets with synthetic data generated using GenAI techniques to improve classifier generalization. We evaluate this approach across 7 diverse security tasks using 6 state-of-the-art GenAI methods and introduce a novel GenAI scheme called Nimai that enables highly controlled data synthesis. We find that GenAI techniques can significantly improve the performance of security classifiers, achieving improvements of up to 32.6% even in severely data-constrained settings (only ~180 training samples). Furthermore, we demonstrate that GenAI can facilitate rapid adaptation to concept drift post-deployment, requiring minimal labeling in the adjustment process. Despite successes, our study finds that some GenAI schemes struggle to initialize (train and produce data) on certain security tasks. We also identify characteristics of specific tasks, such as noisy labels, overlapping class distributions, and sparse feature vectors, which hinder performance boost using GenAI. We believe that our study will drive the development of future GenAI tools designed for security tasks.

URL PDF HTML ☆

赞 0 踩 0

2507.00037 2026-05-29 cs.LG cs.AI 版本更新

Model Fusion via Retrofitting

通过回溯改造的模型融合

Phoomraphee Luenam, Andreas Spanopoulos, Amit Sant, Thomas Hofmann, Sotiris Anagnostidis, Sidak Pal Singh

发表机构 * ETH Z\"urich

AI总结提出一种以神经元为中心的融合算法，通过将父模型中间神经元分组为目标表示并训练融合模型子网络逼近，结合神经元归因分数进行显著特征对齐，适用于任意可模块化为有向无环图结构的架构，在零样本和非独立同分布场景下表现最佳。

Comments 5 figures, 15 tables, 23 pages

详情

AI中文摘要

模型融合旨在将独立训练的神经网络组合成一个单一模型而无需重新训练，但由于排列不变性、随机初始化和异构训练数据导致的表示差异，这一过程变得复杂。现有方法在非独立同分布数据分布下的零样本设置中尤其困难，并且通常局限于特定架构或成对融合。我们引入了一类以神经元为中心的融合算法，将融合视为一个原则性的表示匹配问题：父模型中的中间神经元被分组为目标表示，然后训练融合模型的相应子网络来逼近这些表示。与先前工作不同，我们的方法结合了神经元归因分数以偏向于显著特征的对齐，并且可以应用于任何可模块化为有向无环图层次的架构——在VGG、ResNet和ViT上进行了实证验证。在标准基准上的实验显示，与现有融合方法相比，我们的方法取得了一致的改进，在零样本和非独立同分布场景中增益最大。代码可在https://github.com/AndrewSpano/model-fusion-via-retrofitting获取。

英文摘要

Model fusion seeks to combine independently trained neural networks into a single model without retraining, but is complicated by representational divergence arising from permutation invariance, random initialization, and heterogeneous training data. Existing methods struggle particularly in zero-shot settings under non-IID data distributions, and are often limited to specific architectures or pairwise fusion. We introduce a neuron-centric family of fusion algorithms that frames fusion as a principled representation-matching problem: intermediate neurons across parent models are grouped into target representations, which the fused model's corresponding sub-networks are then trained to approximate. Unlike prior work, our approach incorporates neuron attribution scores to bias alignment toward salient features, and can be applied to any architecture modularizable as a DAG of levels -- empirically validated on VGGs, ResNets, and ViTs. Experiments across standard benchmarks show consistent improvements over existing fusion methods, with the largest gains in zero-shot and non-IID scenarios. Code is available at https://github.com/AndrewSpano/model-fusion-via-retrofitting.

URL PDF HTML ☆

赞 0 踩 0

2505.21627 2026-05-29 cs.GT cs.AI cs.CY cs.LG 版本更新

Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives

你的大语言模型是否在过度收费？分词、透明度与激励

Ander Artola Velasco, Stratis Tsirtsis, Nastaran Okati, Manuel Gomez-Rodriguez

发表机构 * Ander Artola Velasco（1. 阿德纳·阿尔托拉·韦拉斯科）； Stratis Tsirtsis（2. 斯特拉蒂斯·蒂尔蒂斯）； Nastaran Okati（3. 纳斯塔兰·奥卡蒂）

AI总结研究当前按token计费机制下，服务提供商可能通过策略性报告token数量来过度收费，并提出按字符线性定价的激励相容机制以消除该财务激励。

Comments Selected as an oral presentation at ICML 2026

详情

AI中文摘要

最先进的大语言模型需要专门的硬件和大量能源来运行。因此，提供大语言模型访问的基于云的服务变得非常流行。在这些服务中，用户为模型生成的输出支付的价格取决于模型用于生成该输出的token数量：他们为每个token支付固定价格。在这项工作中，我们表明这种定价机制为提供商创造了财务激励，使其策略性地虚报模型用于生成输出的token数量，而用户无法证明甚至不知道提供商是否在过度收费。然而，我们也表明，如果不诚实的提供商被强制要求透明地说明模型使用的生成过程，那么在不引起怀疑的情况下最优地虚报是困难的。尽管如此，作为概念验证，我们开发了一种高效的启发式算法，使提供商能够在不引起怀疑的情况下显著过度收费用户。关键的是，我们证明运行该算法的成本低于从过度收费用户中获得的额外收入，突显了当前按token计费机制下用户的脆弱性。此外，我们表明，为了消除策略性行为的财务激励，定价机制必须根据token的字符数线性定价。虽然这会使提供商的利润率因token而异，但我们引入了一个简单的方案，采用这种激励相容定价机制的提供商可以维持他们在按token计费机制下的平均利润率。在此过程中，为了说明和补充我们的理论结果，我们使用来自$ exttt{Llama}$、$ exttt{Gemma}$和$ exttt{Ministral}$系列的几个大语言模型以及来自LMSYS Chatbot Arena平台的输入提示进行了实验。

英文摘要

State-of-the-art large language models require specialized hardware and substantial energy to operate. As a consequence, cloud-based services that provide access to large language models have become very popular. In these services, the price users pay for an output provided by a model depends on the number of tokens the model uses to generate it: they pay a fixed price per token. In this work, we show that this pricing mechanism creates a financial incentive for providers to strategize and misreport the (number of) tokens a model used to generate an output, and users cannot prove, or even know, whether a provider is overcharging them. However, we also show that, if an unfaithful provider is obliged to be transparent about the generative process used by the model, misreporting optimally without raising suspicion is hard. Nevertheless, as a proof-of-concept, we develop an efficient heuristic algorithm that allows providers to significantly overcharge users without raising suspicion. Crucially, we demonstrate that the cost of running the algorithm is lower than the additional revenue from overcharging users, highlighting the vulnerability of users under the current pay-per-token pricing mechanism. Further, we show that, to eliminate the financial incentive to strategize, a pricing mechanism must price tokens linearly on their character count. While this makes a provider's profit margin vary across tokens, we introduce a simple prescription under which the provider who adopts such an incentive-compatible pricing mechanism can maintain the average profit margin they had under the pay-per-token pricing mechanism. Along the way, to illustrate and complement our theoretical results, we conduct experiments with several large language models from the $\texttt{Llama}$, $\texttt{Gemma}$ and $\texttt{Ministral}$ families, and input prompts from the LMSYS Chatbot Arena platform.

URL PDF HTML ☆

赞 0 踩 0

2502.20838 2026-05-29 cs.SD cs.AI cs.LG eess.AS 版本更新

Weakly Supervised Detection and Temporal Localization of Whale Calls in Long-Duration Bioacoustic Data

弱监督检测与长时间生物声学数据中鲸叫声的时间定位

Ragib Amin Nihal, Benjamin Yen, Runwu Shi, Takeshi Ashizawa, Kazuhiro Nakadai

发表机构 * Systems and Control Engineering, School of Engineering, Institute of Science Tokyo, Japan（东京科学研究院工程学院系统与控制工程系）

AI总结提出DSMIL-LocNet框架，利用弱监督多实例学习仅使用录音级标签实现鲸叫声的分类和时间定位，在长录音上优于全监督基线。

Comments Accepted in European Signal Processing Conference (EUSIPCO) 2026

详情

AI中文摘要

被动声学监测（PAM）系统生成持续数月连续录音，但自动化生物声学分析鲸叫声需要两种独立的标注工作：用于分类的二元存在标签和用于定位的精确时间边界。一个多分钟录音的二元标签可以在几秒钟内分配，但对其中的每个叫声打时间戳需要数小时的专家努力。在操作规模上同时提供两者是不可行的。我们提出DSMIL-LocNet，一个弱监督多实例学习（MIL）框架，仅使用录音级存在/缺失标签执行分类和时间定位。我们的双流架构整合频谱和时间特征，处理2-30分钟的录音，而无需现有CNN方法在长输入上退化的时间压缩。在AcousticTrends BlueFinLibrary上，DSMIL-LocNet在300-1800秒录音上达到F1分数0.88-0.91，而全监督CNN基线退化为0.19-0.64。它还提供这些基线在没有帧级标注的情况下无法产生的时间定位。代码：https://github.com/Ragib-Amin-Nihal/DSMIL-LocNet

英文摘要

Passive acoustic monitoring (PAM) systems generate continuous recordings spanning months, yet automated bioacoustic analysis of whale calls requires two separate annotation efforts: binary presence labels for classification and precise temporal boundaries for localization. A binary label for a multi-minute recording can be assigned in seconds, but timestamping every call within it requires hours of expert effort. Providing both is infeasible at operational scale. We present DSMIL-LocNet, a weakly supervised multiple instance learning (MIL) framework that performs both classification and temporal localization using only recording-level presence/absence labels. Our dual-stream architecture integrates spectral and temporal features to process recordings of 2--30 minutes without the temporal compression that degrades existing CNN methods on long inputs. On the AcousticTrends BlueFinLibrary, DSMIL-LocNet achieves F1 scores of 0.88--0.91 on recordings of 300--1800s, where fully supervised CNN baselines degrade to 0.19--0.64. It also provides temporal localization that these baselines cannot produce without frame-level annotation. Code: https://github.com/Ragib-Amin-Nihal/DSMIL-Loc

URL PDF HTML ☆

赞 0 踩 0

2501.12374 2026-05-29 cs.HC cs.AI cs.CY 版本更新

Expertise elevates AI usage: experimental evidence comparing laypeople and professional artists

专业知识提升AI使用：比较普通人和专业艺术家的实验证据

Thomas F. Eisenmann, Andres Karjus, Mar Canet Sola, Levin Brinkmann, Bramantyo Ibrahim Supriyatno, Iyad Rahwan

发表机构 * Center for Humans and Machines, Max Planck Institute for Human Development（人类与机器中心，马克斯·普朗克人类发展研究所）； Tallinn University（塔林大学）； Estonian Business School（爱沙尼亚商学院）； Academy of Media Arts Cologne（科隆媒体艺术学院）

AI总结通过实验比较50位专业艺术家和普通人使用生成式AI进行图像复制和创意生成的表现，发现艺术家的专业技能迁移到AI使用中，在复制准确性和发散思维上均优于普通人，而GPT-4o在创意任务上平均略优于艺术家但未超越最佳人类。

Comments Eisenmann and Karjus contributed equally to this work and share first authorship

详情

DOI: 10.1080/10447318.2026.2669041
Journal ref: International Journal of Human-Computer Interaction, 2026, pp 1-22

AI中文摘要

生成式AI的新能力引发了关于人类专业知识未来角色的疑问：AI是否拉平了专业艺术家和普通人之间的差距，还是专业知识增强了AI的使用？专家在分析和绘制视觉艺术时使用的认知技能是否也转移到使用这些新工具上？这项预先注册的研究对50位专业艺术家和人口统计学匹配的普通人样本进行了实验比较。我们的跨学科团队开发了两项任务，涉及图像复制和创意图像生成，评估了他们的复制准确性和发散思维。我们为实验实施了一个定制平台，由现代文本到图像AI驱动。结果显示，艺术家比普通参与者产生了更准确的复制和更多发散的想法，突显了专业知识的技能转移——即使是在生成式AI的有限空间内。我们还探索了一个典型的视觉能力大型语言模型（GPT-4o）的表现：在复制任务上与艺术家相当，在创意任务上平均略优于艺术家，但从未超越最佳人类。这些发现强调了将艺术技能与AI整合的重要性，表明协作协同的潜力可能重塑创意产业和艺术教育。

英文摘要

Generative AI's novel capacities raise questions about the future role of human expertise: does AI level the playing field between professional artists and laypeople, or does expertise enhance AI use? Do the cognitive skills experts make use of in analyzing and drawing visual art also transfer to using these new tools? This pre-registered study conducts experimental comparisons between 50 professional artists and a demographically matched sample of laypeople. Our interdisciplinary team developed two tasks involving image replication and creative image creation, assessing their copying accuracy and divergent thinking. We implemented a bespoke platform for the experiment, powered by a modern text-to-image AI. Results reveal artists produced more accurate copies and more divergent ideas than lay participants, highlighting a skill transfer of professional expertise - even to the confined space of generative AI. We also explored how well an exemplary vision-capable large language model (GPT-4o) would fare: on par in copying and slightly better on average than artists in the creative task, although never above best humans. These findings highlight the importance of integrating artistic skills with AI, suggesting a potential for collaborative synergy that could reshape creative industries and arts education.

URL PDF HTML ☆

赞 0 踩 0

2501.10332 2026-05-29 cs.CY cs.AI 版本更新

Agent4Edu: Generating Learner Response Data by Generative Agents for Intelligent Education Systems

Agent4Edu：通过生成式智能体为智能教育系统生成学习者响应数据

Weibo Gao, Qi Liu, Linan Yue, Fangzhou Yao, Rui Lv, Zheng Zhang, Hao Wang, Zhenya Huang

发表机构 * State Key Laboratory of Cognitive Intelligence（认知智能国家重点实验室）； University of Science and Technology of China（中国科学技术大学）； Institute of Artificial Intelligence（人工智能研究院）； Hefei Comprehensive National Science Center（合肥综合性国家科学中心）

AI总结提出Agent4Edu，一种利用大语言模型构建生成式智能体模拟学习者行为，以解决智能教育系统中离线指标与在线性能差异的问题，并支持个性化学习算法评估与优化。

Comments Accepted by AAAI2025

详情

AI中文摘要

个性化学习是智能教育系统中一种有前景的教育策略，旨在提高学习者的练习效率。然而，离线指标与在线性能之间的差异严重阻碍了其进展。为了解决这一挑战，我们引入了Agent4Edu，一种新颖的个性化学习模拟器，通过大语言模型（LLMs）利用人类智能的最新进展。Agent4Edu采用基于LLM的生成式智能体，配备针对个性化学习算法定制的学习者档案、记忆和行动模块。学习者档案使用真实世界的响应数据初始化，捕捉练习风格和认知因素。受人类心理学理论启发，记忆模块记录练习事实和高层摘要，并集成反思机制。行动模块支持多种行为，包括练习理解、分析和响应生成。每个智能体可以与个性化学习算法（如计算机自适应测试）交互，实现对定制服务的多方面评估和增强。通过全面评估，我们探讨了Agent4Edu的优势和不足，强调了智能体与人类学习者之间响应的一致性和差异。代码、数据和附录可在https://github.com/bigdata-ustc/Agent4Edu公开获取。

英文摘要

Personalized learning represents a promising educational strategy within intelligent educational systems, aiming to enhance learners' practice efficiency. However, the discrepancy between offline metrics and online performance significantly impedes their progress. To address this challenge, we introduce Agent4Edu, a novel personalized learning simulator leveraging recent advancements in human intelligence through large language models (LLMs). Agent4Edu features LLM-powered generative agents equipped with learner profile, memory, and action modules tailored to personalized learning algorithms. The learner profiles are initialized using real-world response data, capturing practice styles and cognitive factors. Inspired by human psychology theory, the memory module records practice facts and high-level summaries, integrating reflection mechanisms. The action module supports various behaviors, including exercise understanding, analysis, and response generation. Each agent can interact with personalized learning algorithms, such as computerized adaptive testing, enabling a multifaceted evaluation and enhancement of customized services. Through a comprehensive assessment, we explore the strengths and weaknesses of Agent4Edu, emphasizing the consistency and discrepancies in responses between agents and human learners. The code, data, and appendix are publicly available at https://github.com/bigdata-ustc/Agent4Edu.

URL PDF HTML ☆

赞 0 踩 0

2410.23222 2026-05-29 cs.LG cs.AI stat.ML 版本更新

Dataset-Driven Channel Masks in Transformers for Multivariate Time Series

数据集驱动的Transformer通道掩码用于多变量时间序列

Seunghan Lee, Taeyoung Park, Kibok Lee

发表机构 * Department of Statistics and Data Science, Yonsei University（延世大学统计与数据科学系）； LG AI Research（LG人工智能研究）

AI总结提出部分通道依赖（PCD）概念，通过数据集特定的通道掩码（CMs）改进Transformer中的通道依赖建模，并在多种任务和数据集上验证有效性。

Comments ICASSP 2026. Preliminary version: NeurIPS Workshop on Time Series in the Age of Large Models 2024 (Oral presentation)

详情

AI中文摘要

最近基础模型的进展已成功扩展到时间序列（TS）领域，这得益于大规模TS数据集的出现。然而，先前的努力主要集中于捕获通道依赖（CD），这对于建模多变量时间序列至关重要，并且基于注意力的方法已被广泛用于此目的。尽管如此，这些方法主要关注修改架构，往往忽略了数据集特定特征的重要性。在这项工作中，我们引入了部分通道依赖（PCD）的概念，通过利用数据集特定信息来增强基于Transformer的模型中的CD建模，从而细化模型捕获的CD。为了实现PCD，我们提出了通道掩码（CMs），通过逐元素乘法将其集成到Transformer的注意力矩阵中。CMs由两个组件组成：1）捕获通道之间关系的相似性矩阵，以及2）数据集特定且可学习的领域参数，用于细化相似性矩阵。我们在多种任务和数据集上使用不同的骨干网络验证了PCD的有效性。代码可在此存储库获取：https://github.com/YonseiML/pcd。

英文摘要

Recent advancements in foundation models have been successfully extended to the time series (TS) domain, facilitated by the emergence of large-scale TS datasets. However, previous efforts have primarily Capturing channel dependency (CD) is essential for modeling multivariate time series (TS), and attention-based methods have been widely employed for this purpose. Nonetheless, these methods primarily focus on modifying the architecture, often neglecting the importance of dataset-specific characteristics. In this work, we introduce the concept of partial channel dependence (PCD) to enhance CD modeling in Transformer-based models by leveraging dataset-specific information to refine the CD captured by the model. To achieve PCD, we propose channel masks (CMs), which are integrated into the attention matrices of Transformers via element-wise multiplication. CMs consist of two components: 1) a similarity matrix that captures relationships between the channels, and 2) dataset-specific and learnable domain parameters that refine the similarity matrix. We validate the effectiveness of PCD across diverse tasks and datasets with various backbones. Code is available at this repository: https://github.com/YonseiML/pcd.

URL PDF HTML ☆

赞 0 踩 0

2410.07287 2026-05-29 physics.soc-ph cs.AI 版本更新

Crafting Desirable Climate Trajectories with RL Explored Socio-Environmental Simulations

利用强化学习探索的社会环境模拟来塑造理想的气候轨迹

James Rudd-Jones, Fiona Thendean, María Pérez-Ortiz

发表机构 * UCL Centre for Artificial Intelligence, Department of Computer Science（伦敦大学学院人工智能中心，计算机科学系）； University College London（伦敦大学学院）； London（伦敦）； United Kingdom（英国）

AI总结本研究通过引入多智能体强化学习替代传统求解器，在综合评估模型中模拟合作与竞争的社会互动，发现合作智能体能一致地实现减排与经济改善，而竞争则导致难以达成理想气候目标。

Comments 23 pages, 13 Figures

详情

AI中文摘要

气候变化构成生存威胁，需要有效的气候政策来实施有影响力的变革。该领域的决策极其复杂，涉及冲突的实体和证据。在过去几十年中，政策制定者越来越多地使用模拟和计算方法来指导部分决策。综合评估模型（IAMs）是其中一种方法，它结合了社会、经济和环境模拟来预测潜在政策效果。例如，联合国在其最近的政府间气候变化专门委员会（IPCC）报告中使用了IAMs的输出。传统上，这些模型使用递归方程求解器求解，但存在若干缺点，例如在不确定性下决策困难。最近使用强化学习（RL）替代传统求解器的初步工作显示，在不确定和嘈杂场景中决策有前景的结果。我们通过引入多个交互的RL智能体作为初步分析，扩展了这项工作，以模拟驱动当前气候危机的各种利益相关者或国家之间复杂的社会互动。我们的发现表明，该框架中的合作智能体能够一致地规划出通往更理想未来的路径，表现为减少碳排放和改善经济。然而，当引入智能体之间的竞争时，例如通过使用相反的奖励函数，理想的气候未来很少能达到。模拟竞争对于提高这些模拟的真实性至关重要，因此我们通过可视化导致更不确定行为的状态来采用策略解释，以理解算法失败的原因。最后，我们强调了当前局限性和未来工作的方向，以确保未来技术应用于政策制定。

英文摘要

Climate change poses an existential threat, necessitating effective climate policies to enact impactful change. Decisions in this domain are incredibly complex, involving conflicting entities and evidence. In the last decades, policymakers increasingly use simulations and computational methods to guide some of their decisions. Integrated Assessment Models (IAMs) are one of such methods, which combine social, economic, and environmental simulations to forecast potential policy effects. For example, the UN uses outputs of IAMs for their recent Intergovernmental Panel on Climate Change (IPCC) reports. Traditionally these have been solved using recursive equation solvers, but have several shortcomings, e.g. struggling at decision making under uncertainty. Recent preliminary work using Reinforcement Learning (RL) to replace the traditional solvers shows promising results in decision making in uncertain and noisy scenarios. We extend on this work by introducing multiple interacting RL agents as a preliminary analysis on modelling the complex interplay of socio-interactions between various stakeholders or nations that drives much of the current climate crisis. Our findings show that cooperative agents in this framework can consistently chart pathways towards more desirable futures in terms of reduced carbon emissions and improved economy. However, upon introducing competition between agents, for instance by using opposing reward functions, desirable climate futures are rarely reached. Modelling competition is key to increased realism in these simulations, as such we employ policy interpretation by visualising what states lead to more uncertain behaviour, to understand algorithm failure. Finally, we highlight the current limitations and avenues for further work to ensure future technology uptake for policy derivation.

URL PDF HTML ☆

赞 0 踩 0

2405.13003 2026-05-29 cs.CL cs.AI cs.IR 版本更新

A Survey on Recent Advances in Conversational Data Generation

对话数据生成最新进展综述

Heydar Soudani, Roxana Petcu, Evangelos Kanoulas, Faegheh Hasibi

发表机构 * Radboud University（拉博德大学）； University of Amsterdam（阿姆斯特丹大学）

AI总结本文系统综述了多轮对话数据生成方法，涵盖开放域、任务导向和信息检索三类对话系统，提出了包含种子数据创建、话语生成和质量过滤的通用框架，并讨论了评估指标与未来方向。

详情

DOI: 10.1145/3795686

AI中文摘要

近年来对话系统的进步显著增强了各领域的人机交互。然而，由于专业对话数据的稀缺，训练这些系统面临挑战。传统上，对话数据集通过众包创建，但该方法成本高、规模有限且劳动密集。作为解决方案，合成对话数据的开发应运而生，利用技术增强现有数据集或将文本资源转换为对话格式，提供了一种更高效且可扩展的数据集创建方法。在本综述中，我们系统全面地回顾了多轮对话数据生成，重点关注三类对话系统：开放域、任务导向和信息检索。我们根据种子数据创建、话语生成和质量过滤方法等关键组件对现有研究进行分类，并引入了一个概述对话数据生成系统主要原则的通用框架。此外，我们考察了评估合成对话数据的指标和方法，探讨了当前领域的挑战，并探索了未来研究的潜在方向。我们的目标是通过概述最先进的方法并强调该领域进一步研究的机会，加速研究人员和从业者的进展。

英文摘要

Recent advancements in conversational systems have significantly enhanced human-machine interactions across various domains. However, training these systems is challenging due to the scarcity of specialized dialogue data. Traditionally, conversational datasets were created through crowdsourcing, but this method has proven costly, limited in scale, and labor-intensive. As a solution, the development of synthetic dialogue data has emerged, utilizing techniques to augment existing datasets or convert textual resources into conversational formats, providing a more efficient and scalable approach to dataset creation. In this survey, we offer a systematic and comprehensive review of multi-turn conversational data generation, focusing on three types of dialogue systems: open domain, task-oriented, and information-seeking. We categorize the existing research based on key components like seed data creation, utterance generation, and quality filtering methods, and introduce a general framework that outlines the main principles of conversation data generation systems. Additionally, we examine the evaluation metrics and methods for assessing synthetic conversational data, address current challenges in the field, and explore potential directions for future research. Our goal is to accelerate progress for researchers and practitioners by presenting an overview of state-of-the-art methods and highlighting opportunities to further research in this area.

URL PDF HTML ☆

赞 0 踩 0

2205.04297 2026-05-29 cs.RO cs.AI 版本更新

Learning A Simulation-based Visual Policy for Real-world Peg In Unseen Holes

基于学习的视觉策略用于真实世界中未见过孔洞的插拔

Liang Xie, Hongxiang Yu, Kechun Xu, Tong Yang, Minhang Wang, Haojian Lu, Rong Xiong, Yue Wang

发表机构 * College of Control Science and Engineering, Zhejiang University, Zhejiang, China.（控制科学与工程学院，浙江大学，浙江，中国）； The Application Innovate Lab, Huawei Incorporated Company, China.（应用创新实验室，华为公司，中国）

AI总结提出一种基于学习的视觉插拔方法，通过解耦感知与策略模块，在仿真中训练多种形状，并仅需少量仿真到现实迁移成本即可适应真实世界中任意未见形状。

详情

AI中文摘要

本文提出一种基于学习的视觉插拔方法，能够在仿真中训练多种形状，并在真实世界中以最小的仿真到现实迁移成本适应任意未见形状。核心思想是将感知-运动策略的泛化解耦为快速适应的感知模块和仿真通用策略模块的设计。框架包括分割网络（SN）、虚拟传感器网络（VSN）和控制器网络（CN）。具体地，VSN被训练用于从分割图像中测量未见形状的位姿。然后，给定与形状无关的位姿测量，CN被训练以实现通用插拔。最后，当应用于真实未见孔洞时，我们只需微调仿真VSN+CN所需的分割网络。为进一步最小化迁移成本，我们提出在一分钟人工教学后自动收集和标注分割网络的数据。展示了在眼在外/眼在手配置下的仿真和真实世界结果。采用所提策略的电动汽车充电系统在2-3秒内实现了10/10的成功率，仅使用数百个自动标注样本进行分割网络迁移。

英文摘要

This paper proposes a learning-based visual peg-in-hole that enables training with several shapes in simulation, and adapting to arbitrary unseen shapes in real world with minimal sim-to-real cost. The core idea is to decouple the generalization of the sensory-motor policy to the design of a fast-adaptable perception module and a simulated generic policy module. The framework consists of a segmentation network (SN), a virtual sensor network (VSN), and a controller network (CN). Concretely, the VSN is trained to measure the pose of the unseen shape from a segmented image. After that, given the shape-agnostic pose measurement, the CN is trained to achieve generic peg-in-hole. Finally, when applying to real unseen holes, we only have to fine-tune the SN required by the simulated VSN+CN. To further minimize the transfer cost, we propose to automatically collect and annotate the data for the SN after one-minute human teaching. Simulated and real-world results are presented under the configurations of eye-to/in-hand. An electric vehicle charging system with the proposed policy inside achieves a 10/10 success rate in 2-3s, using only hundreds of auto-labeled samples for the SN transfer.

URL PDF HTML ☆

赞 0 踩 0

2605.29578 2026-05-29 cs.AI 版本更新

GPS-Enhanced Tourist Mobility Modeling with Seasonal Spatial Priors and LLM-Based Activity Chain Generation

基于季节性空间先验和LLM活动链生成的GPS增强游客移动性建模

Yifan Liu, Yanling Sang, Xishun Liao, Morgan Sun, Bo Yang, Zhiyuan Zhang, Chris Stanford, Haoxuan Ma, Jiaqi Ma

发表机构 * UCLA Mobility Lab, Department of Civil and Environmental Engineering, University of California, Los Angeles（加州大学洛杉矶分校移动实验室，土木与环境工程系，加州大学洛杉矶分校）； Novateur Research Solutions（Novateur研究解决方案）； University of Central Florida（中央佛罗里达大学）

AI总结提出一种四阶段仿真框架，结合GPS和调查数据推导的月份条件空间先验、游客人口统计信息、距离可行区域序列分配以及基于LLM的活动链生成，以解决游客移动性建模中非例行、吸引驱动且对旅行目的、季节和成员组成高度敏感的问题。

详情

AI中文摘要

游客移动性对城市交通规划提出了独特挑战。与居民通勤不同，游客旅行大多是非例行的、由景点驱动的，并且对旅行目的、旅行季节和旅行成员组成高度敏感。现有方法要么测量聚合的游客空间模式而不生成个人行程，要么合成移动性而不考虑游客特定结构，如旅行持续时间条件、月份变化的景点需求以及家庭共同旅行规则。为了解决这些挑战，我们提出了一个四阶段仿真框架，结合了从GPS和调查数据推导的月份条件空间先验、基于游客人口统计的旅行范围预测、距离可行的区域序列分配，以及在家庭和空间约束下基于LLM的活动链生成。GPS数据仅以隐私保护的聚合形式用作月份条件空间先验，不保留或暴露任何个人轨迹。在东京旅游上的实验表明，基于GPS的游客群体提取恢复了与调查参考一致的空间访问特征，我们的框架生成了人口统计对齐的合成行程，其区域级访问份额与调查分布和停留点导出的月度访问模式紧密对齐。结果证明了该框架作为地理基础、人口统计感知的游客移动性建模方法的有效性。

英文摘要

Tourist mobility poses a distinct challenge for urban transportation planning. Unlike resident commuting, tourist travel is largely non-routine, attraction driven, and highly sensitive to trip purpose, travel season, and trip member composition. Existing approaches either measure aggregate tourist spatial patterns without generating individual schedules, or synthesize mobility without tourist specific structure such as trip duration conditioning, month varying attraction demand, and household co-travel rules. To address these challenges, we propose a four stage simulation framework combining month conditioned spatial priors derived from GPS and survey data, trip extent prediction from tourist demographics, distance feasible ward sequence assignment, and LLM-based activity chain generation under household and spatial constraints. GPS data are used only in privacy preserving aggregated form as month conditioned spatial priors, with no individual traces retained or exposed. Experiments on tourism in Tokyo demonstrate that the GPS based tourist cohort extraction recovers spatial visitation signatures consistent with survey references, and our framework produces demographically aligned synthetic schedules whose ward-level visitation shares align closely with both survey distributions and staypoint derived monthly visitation patterns. The results demonstrate the framework's effectiveness as a geographically grounded, demographically aware approach to tourist mobility modeling.

URL PDF HTML ☆

赞 0 踩 0

2605.29568 2026-05-29 cs.AI 版本更新

DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning

DeepTool: 通过过程监督强化学习扩展工具集成推理中的交错思考

Yang He, Xiao Ding, Bibo Cai, Yufei Zhang, Kai Xiong, Zhouhao Sun, Bing Qin, Ting Liu

发表机构 * Research Center for Social Computing（社会计算研究中心）； Interactive Robotics, Harbin Institute of Technology, China（交互机器人，哈尔滨工业大学，中国）

AI总结针对工具集成推理中缺乏逐步监督和自纠正能力的问题，提出DeepTool框架，通过合成交错轨迹和基于动作中心过程奖励的GRPO强化学习，显著提升模型在多个基准上的性能。

详情

AI中文摘要

工具集成推理通过利用外部环境扩展了LLM的能力。然而，现有方法在顺序调用工具时缺乏战略规划和自我纠正所需的思考。虽然强化学习缓解了这一问题，但传统的工具集成推理方法受到稀疏的基于结果奖励的阻碍，无法监督中间推理步骤和工具调用。为了解决这个问题，我们提出了DeepTool，一个新颖的框架，它在每一轮思考、行动和观察的交错过程中扩展了深思熟虑的思考。在DeepTool中，我们首先引入了一个合成流程，将扩展思考演变为交错轨迹，并集成对抗性扰动以确保鲁棒性和自我纠正。其次，我们基于GRPO设计了过程监督强化学习，利用以行动为中心的过程奖励来强化中间交错思考，并在每一轮强制执行精确的工具调用。大量实验表明，DeepTool实现了卓越的性能，在六个基准测试中显著提升了Qwen2.5-7B（例如，AIME24: 3.2% -> 40.4%，HMMT25: 0.0% -> 28.6%）。此外，令牌成本效益分析证实了交错思考的实用性，展示了DeepTool在性能和令牌效率之间的最佳平衡。

英文摘要

Tool-Integrated Reasoning (TIR) extends LLM capabilities by leveraging external environments. However, existing methods lack the deliberation during sequential tool invocation required for strategic planning and self-correction. While RL mitigates this, conventional approaches for Tool-Integrated Reasoning are hindered by sparse outcome-based rewards, failing to supervise intermediate reasoning steps and tool invocations. To address this, we propose DeepTool, a novel framework that scales deliberate thinking within the interleaved process of thinking, action, and observation at each turn. In DeepTool, we first introduce a synthesis pipeline that evolves extended thinking into interleaved trajectories, integrating adversarial perturbations to ensure robustness and self-correction. Secondly, we devise Process-Supervised Reinforcement Learning based on GRPO, which utilizes an Action-Centric Process Reward to reinforce intermediate interleaved thinking and enforce precise tool invocation at every turn. Extensive experiments demonstrate that DeepTool achieves superior performance, boosting Qwen2.5-7B significantly across six benchmarks (e.g., AIME24: 3.2% -> 40.4% and HMMT25: 0.0% -> 28.6%). Furthermore, the token cost-effectiveness analysis confirms the utility of interleaved thinking, demonstrating DeepTool's optimal balance between performance and token efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.29562 2026-05-29 cs.RO cs.AI cs.CV 版本更新

VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models

VLA-Pro：面向视觉-语言-动作模型的跨任务程序性记忆迁移

Shengyu Si, Yuanzhuo Lu, Ruimeng Yang, Ziyi Ye, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Institute of Trustworthy Embodied AI, Fudan University（复旦大学可信具身人工智能研究院）； Shanghai Key Laboratory of Multimodal Embodied AI（上海多模态具身人工智能重点实验室）； Shanghai Xinzhi Embodied Intelligence Technology Co., Ltd.（上海新智具身智能技术有限公司）

AI总结提出VLA-Pro框架，通过存储和检索任务相关的LoRA适配器作为程序性记忆，实现跨任务泛化，在仿真和真实任务中成功率显著提升。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在通用机器人操作中展现出强大潜力，但在泛化到需要跨物体、场景和动作模式迁移相关经验的新任务时仍面临挑战。本文提出VLA-Pro，一种即插即用框架，通过在训练时存储任务相关的程序性记忆并在推理时迁移这些记忆来增强跨任务泛化。具体而言，VLA-Pro在训练时将任务特定的LoRA适配器存储为参数化的程序性记忆。在推理时，VLA-Pro基于当前多模态上下文检索相关程序性记忆，并动态融合这些记忆以生成当前动作块。在RoboTwin、RLBench和真实世界操作任务上的实验表明，VLA-Pro在多个骨干网络上持续提升跨任务泛化能力，在仿真中实现高达207%的相对改进，并将真实世界成功率从5.8%提升至65.0%。这些结果表明，程序性记忆检索与自适应为将操作经验迁移到新任务提供了一种有效机制，同时保持了模块化和执行稳定性。

英文摘要

Vision-Language-Action~(VLA) models have shown strong potential for general-purpose robotic manipulation, yet they still struggle to generalize to unseen tasks that necessitate transferring relevant experience across objects, scenes, and action patterns. This paper proposes VLA-Pro, a plug-and-play framework designed to enhance cross-task generalization by storing task-relevant procedural memories at training time and transferring these memories during inference. Specifically, VLA-Pro stores task-specific LoRA adapters as parameterized procedural memories during training. At inference time, VLA-Pro retrieves relevant procedural memories based on the current multi-modal context and dynamically fuses these memories for generating the current action chunk. Experiments on RoboTwin, RLBench, and real-world manipulation tasks show that VLA-Pro consistently improves cross-task generalization across multiple backbones, achieving up to a 207% relative improvement in simulation and increasing real-world success rate from 5.8% to 65.0%. These results suggest that procedural memory retrieval and adaptation provide an effective mechanism for transferring manipulation experience to novel tasks while preserving modularity and execution stability.

URL PDF HTML ☆

赞 0 踩 0

2605.29561 2026-05-29 cs.AI cs.SE 版本更新

ParaTool: Shifting Tool Representations from Context to Parameters

ParaTool: 将工具表示从上下文转移到参数中

Zekai Yu, Qi Meng, Qizhi Chu, Yu Hao, Chuan Shi, Cheng Yang

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出ParaTool框架，通过将每个工具投影为可加载的参数集，结合参数化工具预训练、软工具选择和参数化工具微调三个阶段，使大语言模型无需上下文文档即可进行工具调用，在Stable ToolBench和BFCL上显著优于基于ICL的基线方法。

详情

AI中文摘要

工具调用通过使大语言模型（LLM）能够与外部可执行接口进行基于环境的交互，从而扩展了其能力。然而，主流的上下文学习（ICL）方法通常将详细的工具文档和使用示例直接纳入上下文中，这导致随着上下文长度的增长，推理开销显著增加，并且幻觉风险升高。相反，基于微调的方法虽然提高了通用工具调用能力，但往往无法有效内化先前见过的工具的特定细节，从而仍然依赖于上下文文档。为了解决这些限制，我们提出了ParaTool，一个将每个工具投影到专用的、可加载的参数集中的框架。通过配备这些参数化工具的动态集成，LLM可以在不依赖上下文文档或示例的情况下执行工具调用。具体来说，我们的方法包括三个阶段：（1）参数化工具预训练将不同工具的知识封装到独立的参数模块中；（2）软工具选择使用门控网络动态加权和聚合相关工具参数；（3）参数化工具微调联合更新工具参数以对齐训练和推理过程。在Stable ToolBench和BFCL上的实验表明，ParaTool显著优于基于ICL的强基线方法，在降低计算复杂度的同时实现了优越的性能。

英文摘要

Tool calling extends large language models (LLMs) by enabling grounded interaction with external executable interfaces, thereby supporting environment-coupled problem solving. However, mainstream in-context learning (ICL) approaches typically incorporate detailed tool documentation and usage examples directly into the context. This results in substantial inference overhead and heightened risks of hallucination as the context length grows. Conversely, while tuning-based methods improve general tool-calling capabilities, they often fail to effectively internalize the specific details of previously seen tools, thereby retaining a dependency on in-context documentation. To address these limitations, we propose ParaTool, a framework that projects each tool into a dedicated, loadable set of parameters. By equipping a dynamic integration of these parameterized tools, the LLM can perform tool calling without relying on in-context documents or examples. Specifically, our approach consists of three stages: (1) parametric tool pre-training encapsulates the knowledge of different tools into independent parameter modules; (2) soft tool selection employs a gating network to dynamically weigh and aggregate relevant tool parameters; and (3) parametric tool fine-tuning jointly updates tool parameters to align the training and inference processes. Experiments on Stable ToolBench and BFCL demonstrate that ParaTool significantly outperforms strong ICL-based baselines, achieving superior performance while reducing computational complexity.

URL PDF HTML ☆

赞 0 踩 0

2605.29560 2026-05-29 cs.AI 版本更新

Battery-Sim-Agent: Leveraging LLM-Agent for Inverse Battery Parameter Estimation

Battery-Sim-Agent: 利用LLM智能体进行电池逆参数估计

Jiawei Chen, Xiaofan Gui, Shikai Fang, Shengyu Tao, Shun Zheng, Weiqing Liu, Jiang Bian

发表机构 * Peking University（北京大学）； Microsoft Research（微软研究院）； Zhejiang University（浙江大学）； Chalmers University of Technology（皇家理工学院）

AI总结提出Battery-Sim-Agent框架，将电池逆参数估计重构为推理任务，利用LLM智能体与高保真模拟器闭环交互，通过物理假设和结构化参数更新，显著优于贝叶斯优化等黑箱优化方法。

详情

AI中文摘要

对高保真电池“数字孪生”进行参数化是一个关键但具有挑战性的逆问题，阻碍了电池创新的步伐。现有方法将此表述为黑箱优化（BBO）任务，采用样本效率低且忽视底层物理的算法。在这项工作中，我们引入了一种新范式，将逆问题重新定义为推理任务，并提出了Battery-Sim-Agent，这是第一个将大型语言模型（LLM）智能体与高保真电池模拟器闭环部署的框架。该智能体模仿人类科学家的工作流程：它解释来自模拟器的丰富多模态反馈，形成基于物理的假设来解释差异，并提出结构化的参数更新。在一个系统构建的基准套件上，涵盖多种电池化学成分、操作条件和难度级别，我们的智能体在识别准确参数方面显著优于贝叶斯优化等强BBO基线。我们进一步展示了该框架在复杂长期退化拟合任务中的能力，并在真实电池数据集上验证了其实用性。我们的结果突显了LLM智能体作为基于推理的优化器在科学发现和电池参数估计中的前景。

英文摘要

Parameterizing high-fidelity "digital twins" of batteries is a critical yet challenging inverse problem that hinders the pace of battery innovation. Prevailing methods formulate this as a black-box optimization (BBO) task, employing algorithms that are sample-inefficient and blind to the underlying physics. In this work, we introduce a new paradigm that reframes the inverse problem as a reasoning task, and present Battery-Sim-Agent, the first framework to deploy a Large Language Model (LLM) agent in a closed loop with a high-fidelity battery simulator. The agent mimics a human scientist's workflow: it interprets rich, multi-modal feedback from the simulator, forms physically-grounded hypotheses to explain discrepancies, and proposes structured parameter updates. On a systematically constructed benchmark suite spanning diverse battery chemistries, operating conditions, and difficulty levels, our agent significantly outperforms strong BBO baselines like Bayesian optimization in identifying accurate parameters. We further demonstrate the framework's capability in complex long-horizon degradation fitting tasks and validate its practical applicability on real-world battery datasets. Our results highlight the promise of LLM-agents as reasoning-based optimizers for scientific discovery and battery parameter estimation.

URL PDF HTML ☆

赞 0 踩 0

2605.29556 2026-05-29 cs.AI 版本更新

Opt-Verifier: Unleashing the Power of LLMs for Optimization Modeling via Dual-Side Verification

Opt-Verifier：通过双面验证释放大语言模型在优化建模中的潜力

Haoyang Liu, Jie Wang, Boxuan Niu, Xiongwei Han, Yian Xu, Mingxuan Ye, Zijie Geng, Fangzhou Zhu, Tao Zhong, Mingxuan Yuan, Jianye Hao

发表机构 * MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition（脑启发式感知与认知MoE实验室）； University of Science and Technology of China（中国科学技术大学）； Noah's Ark Lab, Huawei Technologies（华为技术诺亚实验室）； Tianjin University（天津大学）

AI总结提出Opt-Verifier框架，通过结构侧和解决方案侧的双面验证，利用大语言模型自动构建数学优化模型，显著提升建模准确性。

详情

Journal ref: International Conference on Machine Learning (ICML), 2026

AI中文摘要

构建数学优化模型在运筹学中至关重要，但需要大量人类专业知识。最近的进展利用大语言模型（LLMs）来自动化这一建模过程。然而，现有工作往往难以验证生成的优化模型的正确性，既不检查约束和变量的合理性，也不检查生成模型解的有效性。这阻碍了后续的验证和纠正步骤，从而严重损害了建模准确性。为了解决这一挑战，我们提出了一种新颖的基于LLM的框架，具有从结构和解决方案两个角度的双面验证（Opt-Verifier），从而提高建模准确性。结构侧验证确保生成的优化模型的建模结构与原始问题描述一致，准确捕捉问题的约束和要求。同时，解决方案侧验证解释和评估解的有效性，确认优化模型在逻辑和数学上是合理的。在流行基准上的实验表明，我们的方法在准确性上提高了20%以上。

英文摘要

Building mathematical optimization models is critical in operations research (OR), while it requires substantial human expertise. Recent advancements have utilized large language models (LLMs) to automate this modeling process. However, existing works often struggle to verify the correctness of the generated optimization models, without checking the rationality of the constraints and variables or the validity of solutions to the generated models. This hampers the subsequent verification and correction steps, and thus it severely hurts the modeling accuracy. To address this challenge, we propose a novel LLM-based framework with Dual-side Verification (Opt-Verifier) from both structure and solution perspectives, thereby improving the modeling accuracy. The structure-side verification ensures that the modeling structure of the generated optimization models aligns with the original problem description, accurately capturing the problem's constraints and requirements. Meanwhile, the solution-side verification interprets and evaluates the solutions' validity, confirming that the optimization models are logically and mathematically sound. Experiments on popular benchmarks demonstrate that our approach achieves over 20\% improvement in accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.29547 2026-05-29 cs.LG cs.AI math.OC 版本更新

Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization

基于随机几何探测的奇异性感知优化：迈向稳定的非光滑优化

Ruoran Xu, Borong She, Xiaobo Jin, Qiufeng Wang

发表机构 * Xi'an Jiaotong-Liverpool University（西安交通大学利物浦大学）

AI总结针对非光滑优化中Adam优化器的梯度抖动问题，提出奇异性感知Adam（S-Adam），通过局部几何不稳定性（LGI）度量动态调整步长，实现稳定训练并提升泛化性能。

Comments International Conference on Machine Learning (ICML), 2026

详情

AI中文摘要

深度学习优化严重依赖于损失景观平滑的假设，而现代架构由于ReLU激活和量化算子等非光滑组件系统性地违反了这一条件。在这种非光滑情况下，Adam等自适应优化器会出现梯度抖动，即由Clarke次微分内冲突信号引起的剧烈振荡，导致收敛性差和泛化能力欠佳。为解决此问题，我们引入了奇异性感知Adam（S-Adam），一种通过基于局部几何不稳定性动态调整步长来稳定训练的新型优化器。我们的关键贡献是局部几何不稳定性（LGI）度量，一种从随机方向导数方差导出的Clarke次微分直径的计算高效估计量。S-Adam采用自适应阻尼机制exp(-$λ$$ρ$)，在高不稳定性区域减缓更新，同时在平滑盆地保持快速收敛。我们使用微分包含提供了严格的收敛性分析，证明S-Adam以最优的O(1/$\sqrt(T)$)速率几乎必然收敛到($δ$,$ε$)-Clarke稳定点。在量化感知训练（QAT）和高噪声小批量学习上的实证评估表明，S-Adam持续优于AdamW和Prox-SGD，在CIFAR-100上实现高达6%的准确率提升，在TinyImageNet上实现3%的提升，同时有效缓解梯度振荡。

英文摘要

Deep learning optimization relies heavily on the assumption of smooth loss landscapes, a condition systematically violated by modern architectures due to non-smooth components such as ReLU activations and quantization operators. In such non-smooth regimes, adaptive optimizers such as Adam suffer from gradient chattering, violent oscillations caused by conflicting signals within the Clarke subdifferential, leading to poor convergence and suboptimal generalization. To address this, we introduce Singularity-aware Adam (S-Adam), a novel optimizer that stabilizes training by dynamically modulating step sizes based on local geometric instability. Our key contribution is the Local Geometric Instability (LGI) metric, a computationally efficient estimator of the Clarke subdifferential diameter derived from the variance of randomized directional derivatives. S-Adam incorporates an adaptive damping mechanism exp(-$λ$$ρ$) that decelerates updates in high-instability regions while preserving fast convergence in smooth basins. We provide a rigorous convergence analysis using differential inclusions, proving that S-Adam converges almost surely to ($δ$,$ε$)-Clarke stationary points at the optimal O(1/$\sqrt(T)$) rate. Empirical evaluations on Quantization-Aware Training (QAT) and high-noise small-batch learning demonstrate that S-Adam consistently outperforms AdamW and Prox-SGD, achieving accuracy gains of up to 6 percent on CIFAR-100 and 3 percent on TinyImageNet while effectively mitigating gradient oscillations.

URL PDF HTML ☆

赞 0 踩 0

2605.29543 2026-05-29 cs.LG cs.AI cs.CL cs.HC cs.IR 版本更新

SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring

SCOPE：一种用于空中交通管制复诵监控的轻量训练LLM框架

Qihan Deng, Minghua Zhang, Yang Yang, Zhenyu Gao

发表机构 * Department of Mechanical and Aerospace Engineering, The Hong Kong University of Science and Technology（香港科学与技术大学机械与航空航天工程系）； School of Electronic and Information Engineering, Beihang University（北航电子与信息工程学院）； State Key Laboratory of CNS/ATM（国家空管自动化系统实验室）

AI总结提出SCOPE框架，通过冻结LLM结合插件式开放集分类器和上下文学习机制，实现高效准确的空管复诵监控，在少样本设置下开放集检测准确率达91.05%，异常纠正率96.63%。

详情

AI中文摘要

飞行员对空中交通管制（ATC）语音指令的复诵是航空运输中防止沟通失误的主要保障。然而，复诵异常仍与约80%的航空事故相关。这一脆弱性因交通量增加和认知负荷升高而进一步加剧，从而推动了机器自动化复诵监控的需求。传统的基于规则和机器学习的方法难以在高度可变且不断演变的空管-飞行员通信术语中泛化。尽管大语言模型（LLM）凭借其强大的推理和泛化能力开辟了新途径，但现有方法在实践中仍面临部署和计算障碍。在这项工作中，我们提出了SCOPE（Semantic reasoning for Communication via Open-set Plug-in with Examples），一种新颖的轻量训练LLM框架，提升了基于机器的ATC复诵监控的效率和准确性。核心思想是在冻结的LLM之上，将插件式开放集分类器与精心设计的上下文学习机制相结合。在半合成通信数据集上的大量实验表明，SCOPE在实现运行环境所需的低延迟响应的同时，达到了优越的准确性。在少样本设置下，SCOPE在开放集检测中达到91.05%的准确率，并纠正了96.63%的异常复诵，从而在提供决策解释的同时优于现有最强基线。这些发现证明了我们的框架作为通向可解释和可控的ATC复诵监控的实用途径的潜力。

英文摘要

Pilot readback of Air Traffic Control (ATC) voice instructions is a primary safeguard against miscommunication in air transportation. However, readback anomalies remain implicated in approximately 80% of aviation incidents. This vulnerability is further exacerbated by rising traffic volume and elevated cognitive workload, thereby motivating automated readback monitoring by machine. Traditional rule-based and machine learning approaches struggle to generalize across the highly variable and evolving phraseology of air traffic controller-pilot communications. While Large Language Models (LLMs) have opened a new avenue through their strong reasoning and generalization capabilities, existing approaches still face deployment and computational barriers in practice. In this work, we propose Semantic reasoning for Communication via Open-set Plug-in with Examples (SCOPE), a novel lightweight-training LLM framework that advances both the efficiency and accuracy of machine-based ATC readback monitoring. The core idea is to couple a plug-in open-set classifier with a carefully designed in-context learning mechanism on top of a frozen LLM. Extensive experiments on the semi-synthetic communication dataset show that SCOPE attains superior accuracy while delivering the low-latency response required for operational environments. Under a few-shot setting, SCOPE achieves 91.05% accuracy in open-set detection and corrects 96.63% of anomalous readbacks, thereby outperforming the strongest available baselines while providing explanations for its decisions. These findings demonstrate the potential of our framework as a practical pathway toward interpretable and controllable ATC readback monitoring.

URL PDF HTML ☆

赞 0 踩 0

2605.29534 2026-05-29 cs.AI 版本更新

UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents

UI-KOBE：面向轻量级图引导GUI代理的知识导向行为探索

Yuxiang Chai, Han Xiao, Xinyu Fu, Jinpeng Chen, Rui Liu, Hongsheng Li

发表机构 * CUHK MMLab（香港大学多模态实验室）； Huawei Research（华为研究）； Shenzhen Loop Area Institute（深圳环城区域研究所）； CPII under InnoHK（创新香港下的CPII）

AI总结提出UI-KOBE框架，通过自动构建应用知识图谱并引导轻量级GUI代理进行运行时决策，以提升其移动端GUI任务执行效果。

详情

AI中文摘要

近期移动GUI代理的进展显示出自动化移动任务的强大潜力，但大多数有效系统仍依赖大型视觉语言模型进行截图理解和长期规划。可直接部署在移动设备上的小型GUI代理在实际应用中更具吸引力，具有更低的推理成本和更好的敏感设备信息保护。然而，由于模型容量有限，这些轻量级代理在仅凭截图端到端规划和执行GUI任务时仍不可靠。我们提出知识导向行为探索（UI-KOBE），一种利用可复用的应用特定图知识来改进轻量级移动GUI代理的框架。UI-KOBE首先自主探索移动应用并构建应用知识图谱，其中节点代表不同的UI状态，边代表可执行的转换。运行时，轻量级GUI代理将图作为外部指导：给定用户任务和当前截图，它识别当前图节点，并选择与该节点关联的自循环动作、相邻转换、任务完成或回退自由动作。通过用应用特定的图指导支持运行时决策，UI-KOBE减轻了端到端GUI规划的负担，帮助轻量级模型更有效地执行移动GUI任务，为高效、可解释且注重隐私的设备端GUI代理提供了实用的一步。

英文摘要

Recent advances in mobile GUI agents have shown strong potential for automating mobile tasks, but most effective systems still depend on large vision-language models for screenshot understanding and long-horizon planning. Small GUI agents that can be deployed directly on mobile devices are more attractive for practical use, offering lower inference cost and better protection of sensitive on-device information. However, due to limited model capacity, such lightweight agents remain unreliable when planning and executing GUI tasks end-to-end from screenshots alone. We propose Knowledge-Oriented Behavior Exploration (\textbf{UI-KOBE}), a framework that improves lightweight mobile GUI agents with reusable app-specific graph knowledge. UI-KOBE first autonomously explores a mobile application and constructs an app knowledge graph, where nodes represent distinct UI states and edges represent executable transitions. At runtime, a lightweight GUI agent uses the graph as external guidance: given a user task and the current screenshot, it identifies the current graph node and selects among self-loop actions, neighboring transitions, task completion, or fallback free actions associated with that node. By supporting runtime decisions with app-specific graph guidance, UI-KOBE reduces the burden of end-to-end GUI planning and helps lightweight models perform mobile GUI tasks more effectively, offering a practical step toward efficient, interpretable, and privacy-conscious on-device GUI agents.

URL PDF HTML ☆

赞 0 踩 0

2605.29532 2026-05-29 cs.SE cs.AI 版本更新

GUITestScape: Towards Open-set Evaluation on Exploratory GUI Testing

GUITestScape：面向探索性GUI测试的开放集评估

Xiaoyi Chen, Yifei Gao, Yang Xu, Xingxing Song, Yi Zhang, Jitao Sang

发表机构 * Beijing Jiaotong University（北京交通大学）； Independent Researcher（独立研究者）

AI总结提出GUITestScape基准和GUIJudge评估器，通过覆盖交互与显示缺陷的508个预设缺陷及过程感知评估方法，解决现有GUI测试评估局限于预定义标注和交互缺陷的问题。

详情

AI中文摘要

探索性GUI测试对MLLM代理提出了特别高的要求：在没有预定义测试脚本的情况下，代理必须自主导航应用程序并通过自身交互发现缺陷。然而，当前的评估在两个层面上存在不足。首先，现有基准几乎只关注交互缺陷，将显示缺陷排除在评估框架之外。其次，评估协议局限于预定义的缺陷标注，将测试过程简化为单一终态判断，混淆了性质不同的失败模式。为解决这些挑战，我们提出了GUITestScape，一个交互式基准，涵盖61个真实Android应用程序和508个预设缺陷（包括交互和显示类型），并引入了GUIJudge，一个开放集评估器，将代理的测试轨迹分解为可独立诊断的能力。实验结果表明，GUIJudge在预定义标注之外实现了可靠的过程感知评估，显著优于所有基线。在GUITestScape上的基准测试进一步揭示，检测仍然是现有模型在两种缺陷类型上的关键瓶颈，并且将GUIJudge的验证器集成到现有代理中可以在不重新训练的情况下显著提升其检测性能。

英文摘要

Exploratory GUI testing is a particularly demanding setting for MLLM agents: without predefined test scripts, an agent must autonomously navigate an application and discover defects through its own interaction. However, current evaluation falls short on two fronts. First, existing benchmarks focus almost exclusively on interaction defects, leaving display defects outside the evaluation frame. Second, evaluation protocols are bound to predefined defect annotations, collapsing the testing process into a single end-state judgment that conflates qualitatively distinct failure modes. To address these challenges, we present GUITestScape, an interactive benchmark covering 61 real-world Android applications and 508 preset defects spanning interaction and display types, and introduce GUIJudge, an open-set evaluator that decomposes an agent's testing trajectory into independently diagnosable capabilities. Experimental results demonstrate that GUIJudge achieves reliable process-aware evaluation beyond predefined annotations, substantially outperforming all baselines. Benchmarking on GUITestScape further reveals that detection remains the critical bottleneck for existing models across both defect types, and that integrating GUIJudge's verifiers into existing agents significantly boosts their detection performance without retraining.

URL PDF HTML ☆

赞 0 踩 0

2605.29524 2026-05-29 cs.CR cs.AI 版本更新

KBF: Knowledge Boundary as Fingerprint for Language Model and Black-Box API Auditing

KBF：知识边界作为语言模型和黑盒API审计的指纹

Yijia Fang, Yiqing Feng, Bingyu Li, Mingxun Zhou

发表机构 * Beihang University China（北航中国）； Xidian University China（西电中国）

AI总结提出KBF协议，利用知识边界附近的稳定数值召回率作为指纹，低成本黑盒审计模型API，检测替代和混合路由攻击。

Comments 20 pages, 13 figures

2605.29522 2026-05-29 cs.AI 版本更新

DeepSurvey: Enhancing Analytical Depth and Citation Reliability in Automated Survey Generation

DeepSurvey: 提升自动综述生成中的分析深度与引用可靠性

Ziyue Yang, Da Ma, Hanqi Li, Zijian Wang, Tiancheng Huang, Zijian Hu, Chenrun Wang, Yunzhe Zhang, Xiaobao Wu, Kai Yu, Lu Chen

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China（上海交通大学计算机科学学院X-LANCE实验室）； Jiangsu Key Lab of Language Computing, Suzhou, China（江苏省语言计算重点实验室）； Suzhou Laboratory, Suzhou, China（苏州实验室）

AI总结提出DeepSurvey智能体系统，通过结构化全文笔记、跨论文关系建模和代码仓库分析增强分析深度，结合引文图扩展与混合过滤、证据约束引用分配及多粒度智能体精炼提升引用可靠性，在内容质量和引用准确性上超越现有方法。

详情

AI中文摘要

随着科学文献的快速增长，自动综述生成已成为AI科学家和人类研究者的关键能力。然而，现有系统由于依赖摘要和孤立论文处理而分析深度有限，并且由于不精确的检索和事后归因而导致引用不可靠，从而产生肤浅的综述并可能误导研究者。我们提出DeepSurvey，一个解决这两个问题的智能体系统。为了增强深度，DeepSurvey从全文论文中提取结构化要点，通过聚类和比较分析建模跨论文关系，并集成代码仓库分析以恢复实现级细节。为了加强可靠性，它结合引文图扩展与混合过滤进行主题聚焦检索，强制执行证据约束的引用分配，并部署多粒度智能体精炼以验证引用-声明对齐。实验表明，DeepSurvey在内容得分（8.644/10）和引用质量（召回率和精确率分别比最强基线提高12.3%和9.3%）上达到最高，跨领域泛化更稳健（CS到非CS的下降为0.14 vs 0.22至0.69），并且领域专家更倾向于选择它而非人类撰写的综述（整体质量83.3%，内容深度100%）。

英文摘要

As scientific literature grows rapidly, automated survey generation has become a key capability for AI scientists and human researchers. However, existing systems suffer from limited analytical depth due to reliance on abstracts and isolated paper processing, and unreliable citations from imprecise retrieval and post-hoc grounding, producing superficial surveys and may mislead researchers. We present DeepSurvey, an agentic system that addresses both. To enhance depth, DeepSurvey extracts structured keynotes from full-text papers, models cross-paper relationships through clustering and comparative analysis, and integrates code-repository analysis to recover implementation-level details. To fortify reliability, it combines citation-graph expansion with hybrid filtering for topic-focussed retrieval, enforces evidence-constrained citation assignment, and deploys multi-granularity agentic refinement to validate citation-claim alignment. Experiments show that DeepSurvey achieves the highest content score (8.644/10) and citation quality (12.3% and 9.3% recall and precision gains over the strongest baseline), generalizes more robustly across domains (0.14 vs 0.22 to 0.69 CS-to-non-CS drop), and is preferred over human-written surveys by domain experts (83.3% overall quality, 100% content depth).

URL PDF HTML ☆

赞 0 踩 0

2605.29518 2026-05-29 cs.NI cs.AI 版本更新

Network Optimization Aspects of Autonomous Vehicles: Challenges and Future Directions

自动驾驶汽车的网络优化方面：挑战与未来方向

Rudolf Krecht, Tamas Budai, Erno Horvath, Akos Kovacs, Nobert Marko, Miklos Unger

发表机构 * Department of Automation and Mechatronics, Széchenyi István University（自动化与机电系，塞切尼伊斯特万大学）； Department of Telecommunications, Széchenyi István University（电信系，塞切尼伊斯特万大学）； Vehicle Industry Development Center, Széchenyi István University（车辆工业发展中心，塞切尼伊斯特万大学）

AI总结本文综述了自动驾驶汽车网络优化的多学科方法，包括协同感知，旨在消除误解并展望未来方向。

2605.29512 2026-05-29 cs.AI 版本更新

MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs

MINDGAMES: 多智能体LLM中社会与策略推理评估的实时竞技场

Kevin Wang, Anna Thöni, Benjamin Kempinski, Bobby Cheng, Jianzhu Yao, Benjamin Finch, Leon Guertler, Viraj Nadkarni, Yihan Jiang, Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov, Siyuan Wu, Yu-Chi Cheng, Yan-Ru Ju, Ti-Rong Wu, I-Hsuan Chu, Yu-Yu Yang, I-Chen Wu, Yitian Huang, Qinlu Cao, Yiheng Sun, Yuhong Dai, Hongkun Yao, Jingxuan Fu, Jiwei Zhang, Hao Liao, Mossimo Ebeling, Govind Arun, Sadhvik Bathini, Mihir S Arya, Avinash Anish, Aditya Ranjan, Kirtana Sunil Phatnani, Paval KS, Vrushali Mehta, Aravind S, Nikhil Arora, Tanya Upadhyay, Amol Bandagale, Yuan Lu, ChunEn Hsiao, YuTing Lin, Arvin Chung, Jerry John Thomas, Mathieu Laurière, Leshem Choshen, Yoram Bachrach, Pramod Viswanath, Maria Polukarov, Cheston Tan, Tal Kachman, Atlas Wang

发表机构 * NeurIPS 2025 Competition（NeurIPS 2025 会议）

AI总结提出MINDGAMES多游戏竞技平台，通过四个游戏环境评估LLM智能体的社会推理与策略能力，揭示规则遵循瓶颈与排行榜有效性差异。

详情

AI中文摘要

大型语言模型（LLM）正越来越多地被部署为交互式智能体，但它们在长时间交互中的社会与策略推理能力仍知之甚少。现有评估依赖于静态场景或单一游戏基准，无法捕捉现实多智能体环境所需的持续、多面推理。我们引入MINDGAMES，一个多游戏竞技场和LLM智能体评估平台，它操作化了与“心智理论”相关的互补推理需求：隐藏信息下的信念归因、通过重复策略交互进行对手建模、知识不对称下的合作推理，以及社会推理中的持续欺骗。基于TextArena，MINDGAMES提供了统一的交互界面、基于TrueSkill的评分和四个游戏环境的完整轨迹记录。我们通过2025年在一场主要AI会议上举办的竞赛周期实例化MINDGAMES，评估了来自76个团队的944个提交智能体，涉及四个游戏：Colonel Blotto、迭代囚徒困境、Codenames和Secret Mafia。我们的分析揭示了智能体层面和评估层面的局限性：脆弱的规则遵循仍是主要瓶颈，顶级系统反复依赖显式结构支撑，且排行榜有效性在不同环境中差异显著。特别是，失败密集的环境可能同样奖励对对手错误的鲁棒性和策略能力，其中Secret Mafia在本周期中表现出明显的错误生存混杂。我们发布了一个包含29,571场多智能体游戏的数据集，包含回合级观察、动作和奖励，以及MG-Ref，一个确定性离线锦标赛协议，该协议使用与本分析相同的错误归因视角，将新智能体与冻结的顶级、低错误Stage II提交参考池进行评分。

英文摘要

Large language models (LLMs) are increasingly deployed as interactive agents, yet their capacity for social and strategic reasoning over extended interaction remains poorly understood. Existing evaluations rely on static vignettes or single-game benchmarks that cannot capture the sustained, multi-faceted reasoning that real-world multi-agent settings demand. We introduce Mindgames, a multi-game arena and evaluation platform for LLM agents that operationalizes complementary reasoning demands relevant to ``theory of mind'': belief attribution under hidden information, opponent modeling through repeated strategic interaction, cooperative inference under knowledge asymmetries, and sustained deception in social deduction. Built on TextArena, Mindgames provides a unified interaction interface, TrueSkill-based rating, and full trajectory logging across four game environments. We instantiate Mindgames through a 2025 competition cycle hosted at a major AI conference, which assessed 944 submitted agents from 76 teams across four games: Colonel Blotto, Iterated Prisoner's Dilemma, Codenames, and Secret Mafia. Our analysis surfaces both agent-level and evaluation-level limitations: brittle rule adherence remains a major bottleneck, top-performing systems repeatedly rely on explicit structural scaffolding, and leaderboard validity differs sharply across environments. In particular, failure-heavy environments can reward robustness to opponent errors as much as strategic ability, with Secret Mafia exhibiting a pronounced error-survival confound in this cycle. We release a dataset of 29,571 multi-agent games with turn-level observations, actions, and rewards, together with MG-Ref, a deterministic offline tournament protocol that scores new agents against a frozen reference pool of top-ranked, low-error Stage~II submissions under the same error-attribution lens used in this analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.29507 2026-05-29 cs.AI cs.IR 版本更新

Xetrieval: Mechanistically Explaining Dense Retrieval

Xetrieval：机械解释稠密检索

Zhixin Cai, Jun Bai, Yang Liu, Jiaqi Li, Yichi Zhang, Taichuan Li, Zhuofan Chen, Zixia Jia, Zilong Zheng, Wenge Rong

发表机构 * School of Computer Science and Engineering, Beihang University（北航计算机科学与工程学院）； State Key Laboratory of General Artificial Intelligence, BIGAI（通用人工智能国家重点实验室，BIGAI）

AI总结提出Xetrieval框架，通过嵌入级别的推理内化器和稀疏可解释特征分解，机械地解释稠密检索模型为何赋予高相关性分数。

Comments Code: https://github.com/Hihiczx/Xetrieval ; Project page: https://hihiczx.github.io/Xetrieval

详情

AI中文摘要

解释稠密检索器为何赋予高相关性分数仍然具有挑战性，因为检索决策是通过不透明的高维嵌入做出的。现有的解释通常关注表面信号，如词汇匹配、令牌对齐或事后文本理由，因此对塑造稠密检索行为在嵌入级别的潜在因素提供的洞察有限。我们提出 extit{Xetrieval}，一个嵌入级别的机械框架，用于解释稠密检索。 extit{Xetrieval}首先引入一个轻量级推理内化器，通过单次前向传递直接在嵌入空间中近似思维链推理，丰富句子嵌入的推理导向信息，同时避免昂贵的自回归生成。然后，它将这些推理增强的嵌入分解为稀疏、人类可解释的特征，每个特征与连贯的自然语言描述相关联。通过聚合多个文档端视图上的稀疏特征重叠， extit{Xetrieval}提供单个检索决策的特征级解释。在多种检索器和基准上的实验表明， extit{Xetrieval}揭示了连贯的可解释特征，产生更强的成对干预效果，并支持任务级特征引导。项目页面和源代码可在https://hihiczx.github.io/Xetrieval获取。

英文摘要

Explaining why dense retrievers assign high relevance scores remains challenging because retrieval decisions are made through opaque high-dimensional embeddings. Existing explanations often focus on surface signals, such as lexical matches, token alignments, or post-hoc textual rationales, and thus provide limited insight into the latent factors that shape dense retrieval behavior at the embedding level. We propose \textit{Xetrieval}, an embedding-level mechanistic framework for explaining dense retrieval. \textit{Xetrieval} first introduces a lightweight reasoning internalizer that approximates Chain-of-Thought reasoning directly in the embedding space with a single forward pass, enriching sentence embeddings with reasoning-oriented information while avoiding expensive autoregressive generation. It then decomposes these reasoning-enhanced embeddings into sparse, human-interpretable features, each associated with a coherent natural language description. By aggregating sparse feature overlaps across multiple document-side views, \textit{Xetrieval} provides feature-level explanations of individual retrieval decisions. Experiments on diverse retrievers and benchmarks show that \textit{Xetrieval} uncovers coherent interpretable features, yields stronger pair-level intervention effects, and supports task-level feature steering. The project page and source code are available at https://hihiczx.github.io/Xetrieval .

URL PDF HTML ☆

赞 0 踩 0

2605.29502 2026-05-29 cs.CL cs.AI 版本更新

帮助的诅咒：通过 DistractionIF 在干扰指令鲁棒性中的逆缩放定律

Zeli Su, Zhankai Xu, Tianlei Chen, Longfei Zheng, Xiaolu Zhang, Jun Zhou, Wentao Zhang

发表机构 * Minzu University of China, Beijing, China（民族大学，北京，中国）； Renmin University of China, Beijing, China（中国人民大学，北京，中国）； Peking University, Beijing, China（北京大学，北京，中国）

AI总结提出 DistractionIF 基准，发现大语言模型在参考文本中干扰指令的鲁棒性存在逆缩放现象，并通过 GRPO 强化学习提升鲁棒性。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地部署在智能体和检索增强生成（RAG）系统中，在这些系统中，它们必须对外部提供的参考文本执行用户指定的任务。实际上，这种上下文通常是非结构化的，并且包含良性的但类似指令的语义噪声，例如编辑评论和系统痕迹，这些应严格视为数据。我们引入了 DistractionIF，这是一个旨在评估对参考文本中此类干扰指令鲁棒性的基准。在广泛模型范围内，我们观察到一致的逆缩放现象：较大的模型通常鲁棒性较差，随着规模增加，性能下降多达 30 个百分点。从机制上讲，我们的困惑度分析表明，缩放侵蚀了鲁棒和受干扰行为之间的概率边界，使模型越来越倾向于将噪声过度解释为指令。为了解决这个问题，我们证明了强化学习，特别是群体相对策略优化（GRPO），可以恢复这一边界，在不损害通用指令遵循能力的情况下，将鲁棒性提高多达 15.5%。我们的发现突显了参考接地任务中关键的指令遵循鲁棒性差距，并确立了强化学习作为在大规模下强制严格数据-指令分离的有前途的途径。

英文摘要

Large Language Models (LLMs) are increasingly deployed in agentic and retrieval-augmented generation (RAG) systems, where they must execute user-specified tasks over externally provided reference text. In practice, such context is often unstructured and contaminated with benign but instruction-like semantic noise, such as editorial comments and system traces, which should be treated strictly as data. We introduce DistractionIF, a benchmark designed to evaluate robustness against such distractor instructions in reference text. Across a broad range of models, we observe a consistent inverse scaling phenomenon: larger models are often less robust, with performance dropping by up to 30 points as scale increases. Mechanistically, our perplexity analysis reveals that scaling erodes the probabilistic boundary between robust and distracted behaviors, making models increasingly prone to over-interpreting noise as instructions. To address this, we demonstrate that reinforcement learning, specifically Group Relative Policy Optimization (GRPO), can restore this boundary, improving robustness by up to 15.5% without compromising general instruction-following capability. Our findings highlight a critical instruction-following robustness gap in reference-grounded tasks and establish reinforcement learning as a promising path for enforcing strict data-instruction separation at scale.

URL PDF HTML ☆

赞 0 踩 0

2605.29486 2026-05-29 cs.CL cs.AI cs.LG 版本更新

PhoneWorld: Scaling Phone-Use Agent Environments

PhoneWorld: 扩展手机使用代理环境

Zhengyang Tang, Yuxuan Liu, Xin Lai, Junyi Li, Pengyuan Lyu, Jason, Yiduo Guo, Zhengyao Fang, Yang Ding, Yi Zhang, Weinong Wang, Huawen Shen, Xingran Zhou, Liang Wu, Fei Tang, Sunqi Fan, Shangpin Peng, Zheng Ruan, Anran Zhang, Benyou Wang, Rui Yan, Ji-Rong Wen, Chengquan Zhang, Han Hu

发表机构 * Tencent Hunyuan（腾讯文英）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学人工智能学院（ Gallagher 学院））

AI总结提出PhoneWorld，一个可复用的管道，将真实GUI轨迹和截图转化为可控的手机使用环境、可执行任务、自动验证器和训练回滚，从而规模化构建手机代理环境。

Comments work in progress

详情

AI中文摘要

手机使用代理的一个核心瓶颈是，覆盖真实移动行为的可控、可复现环境难以大规模构建。现有的移动代理基准在评估方面取得了重要进展，但它们本身并未提供一种可扩展的方式来构建许多新的手机使用环境。我们提出了PhoneWorld，一个可复用的管道，将真实的GUI轨迹和截图转化为可控的手机使用环境、可执行任务、自动验证器和训练回滚。PhoneWorld不是一次手动构建一个移动基准，而是利用真实轨迹来恢复哪些屏幕重要、屏幕如何连接、哪些交互必须改变环境状态、以及哪些用户目标可以自动验证。从这些信号中，它构建了由只读应用内容和可变状态支持的可运行模拟Android应用，然后从相同环境中派生出可执行任务、基于规则的验证器和训练回滚。在当前实例中，PhoneWorld覆盖了16个领域的34个应用，涵盖了常见的消费者移动行为，如搜索、浏览、购物、预订、媒体和社交互动。在固定的训练预算下，将来自辅助AndroidWorld语料库的10K步替换为广泛的PhoneWorld监督，同时提升了所有四个评估基准，使HYMobileBench提高了17.7分，AndroidControl提高了6.0分，AndroidWorld提高了14.7分，PhoneWorld提高了52.5分。然后我们研究了两个额外的扩展问题：增加PhoneWorld监督量显著提高了PhoneWorld性能，并且在固定的PhoneWorld预算下，扩大应用覆盖范围带来了更大的收益。总体而言，PhoneWorld将焦点从一次构建一个移动基准转向了规模化供应手机使用环境本身。

英文摘要

A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build at scale. Existing mobile-agent benchmarks have made important progress on evaluation, but they do not by themselves provide a scalable way to construct many new phone-use environments. We present PhoneWorld, a reusable pipeline that converts real GUI trajectories and screenshots into controllable phone-use environments, executable tasks, automatic verifiers, and training rollouts. Rather than hand-building one mobile benchmark at a time, PhoneWorld uses real trajectories to recover which screens matter, how screens connect, which interactions must change environment state, and which user goals admit automatic verification. From these signals, it builds runnable mock Android apps backed by read-only app content and mutable state, then derives executable tasks, rule-based verifiers, and training rollouts from the same environments. In its current instantiation, PhoneWorld covers 34 apps across 16 domains, spanning common consumer mobile behaviors such as search, browsing, shopping, booking, media, and social interaction. Under a fixed training budget, replacing 10K steps from an auxiliary AndroidWorld corpus in an AndroidWorld-based baseline with broad PhoneWorld supervision improves all four evaluation benchmarks at once, raising HYMobileBench by 17.7 points, AndroidControl by 6.0 points, AndroidWorld by 14.7 points, and PhoneWorld by 52.5 points. We then study two additional scaling questions: increasing the amount of PhoneWorld supervision strongly improves PhoneWorld performance, and under a fixed PhoneWorld budget, expanding app coverage yields even larger gains. Overall, PhoneWorld shifts the focus from building one mobile benchmark at a time to scaling the supply of phone-use environments themselves.

URL PDF HTML ☆

赞 0 踩 0

2605.29478 2026-05-29 cs.NE cs.AI 版本更新

Evolutionary Rule Extraction from Corporate Default Prediction Models

企业违约预测模型中的进化规则提取

Desirè Fabbretti, Matteo Pasquino, Elia Pacioni, Caterina Lucarelli, Davide Calvaresi

发表机构 * Department of Management, Università Politecnica delle Marche（波兰马克西米利亚那大学管理学院）； HES-SO Valais-Wallis（瓦莱-达菲大学）； University of Extremadura（埃斯特拉马杜拉大学）

AI总结本研究提出DEXiRE-EVO进化规则提取框架，结合多目标优化与CIU可解释性方法，从机器学习违约预测模型中提取经济意义明确的规则，兼顾预测性能与可解释性。

详情

AI中文摘要

中小企业（SMEs）在大多数经济体中占企业多数，常面临财务约束和更高的财务困境脆弱性。因此，预测中小企业违约对金融机构、政策制定者和研究人员至关重要。机器学习（ML）的最新进展提高了信用风险建模的预测性能。然而，复杂模型的有限可解释性引发了透明度和监管合规方面的担忧。本研究调查了中小企业的违约预测因子，并应用可解释人工智能（XAI）技术。使用2015-2024年间50,718家意大利中小企业的面板数据，我们比较了传统计量经济学方法与多种ML分类器。实证结果表明，ML模型在平衡准确率和PR-AUC方面显著优于传统逻辑回归基准。为解决可解释性挑战，我们引入了DEXiRE-EVO，一种新颖的进化规则提取框架，结合了多目标优化与上下文重要性和效用（CIU）可解释性方法。提取的规则揭示了与中小企业财务困境相关的经济意义模式，突出了内部流动性生成薄弱、内部资本侵蚀、高杠杆和运营效率低下的作用。此外，宏观经济背景条件和财务不稳定的持续性有助于识别高风险企业。总体而言，结果表明，将ML与进化规则提取相结合可以提高信用风险建模中的预测性能和可解释性，从而支持金融环境中更透明、数据驱动的决策。

英文摘要

Small and medium-sized enterprises (SMEs) represent the majority of firms in most economies and often face financial constraints and higher vulnerability to financial distress. Predicting SME default is therefore crucial for financial institutions, policymakers, and researchers. Recent advances in machine learning (ML) have improved predictive performance in credit risk modeling. Yet, the limited interpretability of complex models raises concerns regarding transparency and regulatory compliance. This study investigates SME's default predictors and applies explainable artificial intelligence (XAI) techniques to them. Using a panel of 50,718 Italian SME over the period 2015-2024, we compare traditional econometric approaches with several ML classifiers. The empirical results show that ML models significantly outperform the traditional logistic regression benchmark in terms of Balanced Accuracy and PR-AUC. To address the interpretability challenge, we introduce DEXiRE-EVO, a novel evolutionary rule extraction framework that combines multi-objective optimization with the Contextual Importance and Utility (CIU) explainability method. The extracted rules reveal economically meaningful patterns associated with SME financial distress, highlighting the roles of weak internal liquidity generation, internal capital erosion, high leverage, and operational inefficiency. Additionally, contextual macroeconomic conditions and the persistence of financial instability contribute to identifying high-risk firms. In general, the results show that combining ML with evolutionary rule extraction can improve both predictive performance and interpretability in credit risk modeling, thus supporting more transparent, data-driven decision-making in financial environments.

URL PDF HTML ☆

赞 0 踩 0

2605.29473 2026-05-29 cs.HC cs.AI cs.CL cs.CY cs.SI 版本更新

Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles

告知、指导、共情、倾听：审计LLM护理支持角色

Drishti Goel, Agam Goyal, Veda Duddu, Olivia Pal, Jeongah Lee, Qiuyue Joy Zhong, Violeta J. Rodriguez, Daniel S. Brown, Dong Whi Yoo, Ravi Karkar, Koustuv Saha

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Massachusetts Amherst（马萨诸塞大学阿默斯特分校）； OSF HealthCare（OSF医疗集团）； Indiana University Indianapolis（印第安纳大学印第安纳波利斯分校）

AI总结本研究通过操作化四种社会支持角色（告知、指导、共情、倾听），评估大型语言模型在非正式护理对话中的安全概况，发现支持角色系统性地影响交互风险，且存在感知质量-安全性权衡。

详情

AI中文摘要

语言模型越来越多地被部署用于非正式护理环境中的对话支持，在这些环境中，交互通常超出信息寻求范围：护理者在应对不确定、关系复杂的护理决策时，寻求情感安慰、指导和帮助。然而，大多数安全评估在通用提示下评估模型行为，留下一个关键问题未加审视：模型的安全概况是否会随其支持角色而变化？我们通过操作化四种基于社会支持理论的专家评审支持角色来研究这一点：告知、指导、共情和倾听，并将它们与两个基线控制条件（基本提示条件和检索增强生成条件）进行比较。我们在三个语言模型（GPT-4o-mini、Llama-3.1-8B-Instruct和MedGemma-1.5-4b-it）上，对来自在线阿尔茨海默病及相关痴呆症社区的5,000个真实世界查询进行了评估。我们发现，LLM的支持角色系统地影响了交互风险的普遍性和构成。此外，一项人类评估研究揭示了感知质量-安全性权衡：更具指导性、信息导向的角色被认为更有帮助和值得信赖，尽管它们表现出更高的交互风险概况。我们发布了约90,000个带有风险注释的支持角色条件模型响应，作为研究更安全的LLM中介对话支持的生态基础资源。

英文摘要

Language models are increasingly being deployed for conversational support in informal caregiving contexts, where interactions often extend beyond information-seeking: caregivers seek emotional reassurance, guidance, and help, while navigating uncertain, relationally complex care decisions. Yet most safety evaluations assess model behavior under generic prompts, leaving a critical question unexamined: does a model's safety profile change with its support role? We study this by operationalizing four expert-reviewed support roles grounded in social support theory: Inform, Coach, Relate, and Listen, and comparing them against two baseline controls: a basic prompting condition and a retrieval-augmented generation (RAG) condition. We evaluate across three language models (GPT-4o-mini, Llama-3.1-8B-Instruct, and MedGemma-1.5-4b-it) on 5,000 real-world queries from online Alzheimer's Disease and Related Dementias (ADRD) communities. We find that the LLM's support role systematically shapes both the prevalence and composition of interactional risks. Furthermore, a human evaluation study reveals a perceived quality--safety tension: more directive, information-oriented roles are rated as more helpful and trustworthy despite exhibiting elevated interactional risk profiles. We release ~90,000 support role-conditioned model responses with risk annotations as an ecologically grounded resource for research on safer LLM-mediated conversational support.

URL PDF HTML ☆

赞 0 踩 0

2605.29468 2026-05-29 cs.CR cs.AI 版本更新

SciIntBench: Measuring LLM Compliance with Research Integrity Norms Under Adversarial Framing

SciIntBench: 衡量大语言模型在对抗性框架下对科研诚信规范的遵从度

Almene De Meran Meguimtsop, Maria Leonor Pacheco, Daniel E. Acuna

发表机构 * Department of Computer Science University of Colorado Boulder（计算机科学系，科罗拉多大学博尔德分校）

AI总结提出SciIntBench对抗性基准，通过810个提示评估16个LLM在10个RCR类别中的框架敏感拒绝与帮助行为，发现模型对显性不当行为拒绝可靠，但对隐性违规（尤其是压力驱动的捷径）拒绝不足。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被用于支持科学工作，但尚不清楚它们是维护还是破坏负责任的研究行为（RCR）规范。我们引入了SciIntBench，这是一个对抗性基准，包含810个提示，涵盖10个RCR类别和三个科学领域。每个场景以显性对抗、隐性对抗和良性三种版本出现，使我们能够联合测量对不当行为的框架敏感拒绝以及对合法请求的帮助性。我们评估了来自六个提供商的16个商业和开源LLM（2024-2026年），产生了12,960个响应。我们发现，科研诚信对齐对框架高度敏感：模型拒绝显性不当行为远比拒绝隐性违规可靠得多，尤其是当不当行为被呈现为压力驱动的捷径时。拒绝率因RCR类别而异，在透明度、抄袭和捏造方面的边界较弱。

英文摘要

Large language models (LLMs) are increasingly used to support scientific work, but it is unclear whether they uphold responsible conduct of research (RCR) norms or help undermine them. We introduce SciIntBench, an adversarial benchmark of 810 prompts across ten RCR categories and three scientific domains. Each scenario appears as an Overt Adversarial, Covert Adversarial, and Benign version, allowing us to jointly measure framing-sensitive refusal of misconduct and helpfulness on legitimate requests. We evaluate 16 commercial and open-weight LLMs from six providers (2024--2026), producing 12,960 responses. We find that scientific integrity alignment is strongly framing-sensitive: models refuse explicit misconduct far more reliably than covert violations, especially failing when misconduct is presented as a pressure-driven shortcut. Refusals vary by RCR category, with weaker boundaries around transparency, plagiarism, and fabrication.

URL PDF HTML ☆

赞 0 踩 0

2605.29467 2026-05-29 cs.LG cs.AI 版本更新

Composing Non-Conjugate Factor Graphs with Closed-Form Variational Inference

非共轭因子图的闭式变分推断组合

Mykola Lukashchuk, Kyrylo Yemets, Wouter M. Kouw, Dmitry Bagaev, İsmail Şenöz, Jeff Beck, Bert de Vries

发表机构 * Eindhoven University of Technology, the Netherlands（埃因霍温理工大学，荷兰）； Lviv Polytechnic National University, Lviv, Ukraine（利沃夫国立理工大学，利沃夫，乌克兰）； Lazy Dynamics, Utrecht, the Netherlands（Lazy Dynamics，乌得勒支，荷兰）

AI总结提出五种因子图原语，证明任意组合均支持闭式变分消息传递，并通过堆叠路由层实现通用函数逼近，应用于时间序列预测。

详情

AI中文摘要

将概率构建块堆叠成更深层次的架构通常会破坏闭式推断。我们证明闭式推断是可以保持的。我们识别了五种因子图原语：双线性因子、指数链接、Gamma先验、高斯似然和等式节点，并证明任何由它们组成的模型都允许闭式变分消息传递。这种构造之所以有效，是因为每个原语都保留了一小部分消息族：在平均场分解下，高斯变量上的消息保持高斯分布，精度变量上的消息保持Gamma分布，而唯一的非共轭接口——指数链接——通过高斯矩生成函数和Gamma族的充分统计量保持可处理性。我们展示了从静态集成到输入依赖门控再到分裂分支路由的递增深度组合，并表明堆叠路由层编码任意决策树，建立了具有闭式推断的通用函数逼近。应用于集成时间序列预测时，该框架产生了一个贝叶斯专家混合模型，其中门控函数是推断而非学习得到的，在五个基准数据集上提供了对专家选择的校准不确定性。

英文摘要

Stacking probabilistic building blocks into deeper architectures typically breaks closed-form inference. We show that closed-form inference can be preserved. We identify five factor-graph primitives: a bilinear factor, an exponential link, a Gamma prior, a Gaussian likelihood, and an equality node, and prove that any model composed from them admits closed-form variational message passing. The construction works because each primitive preserves a small set of message families: under mean-field factorization, messages on Gaussian variables remain Gaussian and messages on precision variables remain Gamma, while the only non-conjugate interface, the exponential link, remains tractable through the Gaussian moment-generating function and the sufficient statistics of the Gamma family. We demonstrate composition at increasing depth, from static ensembles through input-dependent gating to split-branch routing, and show that stacking routing layers encodes arbitrary decision trees, establishing universal function approximation with closed-form inference. Applied to ensemble time-series forecasting, the framework yields a Bayesian mixture of experts in which gating functions are inferred rather than learned, providing calibrated uncertainty over expert selection across five benchmark datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.29462 2026-05-29 cs.CV cs.AI 版本更新

Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset

大型视觉语言模型在CFMME上的基准测试：一个全面的中文金融多模态评估数据集

Qian Chen, Xianyin Zhang, Yanzhi Liu, Lifan Guo, Feng Chen, Chi Zhang

发表机构 * Qwen DianJin Team, Alibaba Cloud Computing（文言金团队，阿里云计算）

AI总结提出CFMME，一个包含6052个实例的中文金融多模态评估基准，涵盖八种主要金融图像模态和四项核心多模态任务，用于评估LVLMs在金融业务全流程中的感知、理解、推理和认知能力。

详情

AI中文摘要

大型视觉语言模型（LVLMs）的出现显著扩展了模型的能力，超越了仅文本理解，实现了跨视觉和文本模态的统一推理，并支持更广泛的实际应用。为了全面评估LVLMs在中文环境下整个金融业务流程中的感知、理解、推理和认知能力，我们引入了CFMME，一个新颖的中文金融多模态评估基准。CFMME包含6052个实例，涵盖从基础学术知识到复杂实际应用，涉及八种主要金融图像模态和四项核心多模态任务。在CFMME上，我们对代表性LVLMs进行了全面评估。结果表明，最先进的模型在问答任务上达到了66.11%的总体准确率，在检测、识别和信息提取任务上平均得分为77.18，表明当前LVLMs仍有很大的改进空间。此外，我们对错误原因、跨模态能力和多方向设置进行了详细分析，为未来研究提供了有价值的见解。我们希望CFMME能推动LVLMs的进一步进展，特别是在金融领域多个多模态任务上的性能提升。

英文摘要

The emergence of Large Vision-Language Models (LVLMs) has substantially expanded model capabilities beyond text-only understanding, enabling unified inference across both visual and textual modalities and supporting a broader range of real-world applications. To comprehensively evaluate the perception, understanding, reasoning, and cognition capabilities of LVLMs throughout the entire financial business workflow in Chinese contexts, we introduce CFMME, a novel Chinese financial multimodal evaluation benchmark. CFMME comprises 6,052 instances spanning from fundamental academic knowledge to complex real-world applications, covering eight primary financial image modalities and four core multimodal tasks. On CFMME, we conduct a thorough evaluation of representative LVLMs. The results show that the state-of-the-art model attains an overall accuracy of 66.11\% on the question answering task and an average score of 77.18 on the detection, recognition, and information extraction tasks, indicating substantial room for improvement in current LVLMs. In addition, we conduct detailed analyses of error causes, cross-modal capabilities, and multi-orientation settings, yielding valuable insights for future research. We hope that CFMME will spur further progress in LVLMs, especially by improving their performance on multiple multimodal tasks in the financial domain.

URL PDF HTML ☆

赞 0 踩 0

2605.29458 2026-05-29 cs.CL cs.AI 版本更新

Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment

面向LLM人格模拟的自适应访谈：基于证据的推理提升决策对齐

Ruoxi Su, Yuhan Liu, Jingyu Hu

发表机构 * University of Cambridge（剑桥大学）； Independent Researcher（独立研究员）

AI总结提出自适应访谈框架，通过结构化三阶段对话收集人格相关信息，并基于访谈记录评估LLM在道德困境场景中模拟个体决策的能力，发现基于后续追问的证据推理能显著提升预测准确性。

Comments 20 pages, 2 figures, 12 tables

详情

AI中文摘要

准确模拟特定个体的决策对大型语言模型（LLM）仍然具有挑战性，部分原因在于人格信息通常以静态描述形式提供，缺乏个体层面决策模拟所需的价值观、经历和情境线索。我们提出一种自适应访谈框架，通过结构化的三阶段对话收集人格相关信息：核心问题、动态追问和综合人格总结。利用生成的访谈记录，我们评估LLM能否模拟参与者在道德困境场景中的决策。我们比较了三种对话情境——核心10个问题回答、完整访谈对话以及总结性人格表征。结果发现，自适应访谈并非作为统一的准确性增强器，而更像是一种选择性接地机制：约40%的完整访谈轨迹中融入了基于追问的证据，且这些基于追问的预测比仅基于核心问题的预测更准确（45.5% vs. 39.3%）。这些发现强调，仅靠更丰富的人格背景是不够的：只有当模型真正将其决策基于用户特定证据时，改进才会出现。

英文摘要

Accurately simulating the decisions of a specific individual remains challenging for large language models (LLMs), partly because persona information is often provided as static descriptions that miss the values, experiences, and contextual cues needed for individual-level decision simulation. We propose an adaptive interview framework that gathers persona-relevant information through a structured three-stage dialogue: core questions, dynamic follow-ups, and a synthesized personality summary. Using the resulting interview transcripts, we evaluate whether LLMs can simulate participants' decisions in moral dilemma scenarios. We compare three conversational contexts -- Core-10 responses, the full interview dialogue, and a summarized persona representation. We find that adaptive interviewing functions less as a uniform accuracy booster and more as a selective grounding mechanism: follow-up-derived evidence is incorporated in around 40% of full-interview traces, and these follow-up-grounded predictions are more accurate than core-only grounded ones (45.5% vs. 39.3%). These findings highlight that richer persona context alone is insufficient: improvements arise only when models actually ground their decisions in user-specific evidence.

URL PDF HTML ☆

赞 0 踩 0

2605.29453 2026-05-29 cs.LG cs.AI 版本更新

编码助手如何辜负用户：基于20,574个真实会话的开发人员与智能体不一致的大规模分析

Ningzhi Tang, Chaoran Chen, Gelei Xu, Yiyu Shi, Yu Huang, Collin McMillan, Tao Dong, Toby Jia-Jun Li

发表机构 * University of Notre Dame（诺丁汉大学）； Vanderbilt University（范德比大学）； Google（谷歌）

AI总结通过对20,574个编码助手会话的分析，识别出七种常见的不一致形式，发现大多数不一致导致信任成本而非系统损坏，且多数仍需用户显式纠正。

详情

AI中文摘要

AI编码助手越来越多地直接在软件环境中行动，然而现有对其失败的分析依赖于基准轨迹，忽略了开发人员实际体验的不一致。我们提出了一项观察性研究，涵盖来自IDE和CLI工作流的1,639个代码仓库的20,574个编码助手会话。我们将不一致操作化为通过开发人员抵制而显现的故障，并沿四个轴标注每个事件：形式、原因、成本和解决方式。我们识别出七种反复出现的形式，涵盖助手如何阅读项目、解释开发人员意图、遵循规则、约束行动、实现和执行代码以及报告进度。90.50%的事件施加了努力和信任成本而非不可逆的系统损坏，但91.49%的可见解决方式仍需用户显式纠正。不一致模式在IDE和CLI设置中也有所不同，在相邻会话中持续存在，并随时间变化：尽管总体发生率下降，但约束违反和不准确自我报告的比例上升。我们的发现为训练、评估和界面设计提供了信息，以保持编码助手与真实开发工作流一致。

英文摘要

AI coding agents increasingly act directly within software environments, yet existing analyses of their failures rely on benchmark trajectories that miss how developers actually experience misalignment. We present an observational study of 20,574 coding-agent sessions from 1,639 repositories across IDE and CLI workflows. We operationalize misalignment as a breakdown made visible through developer pushback, and annotate each episode along four axes: form, cause, cost, and resolution. We identify seven recurring forms, spanning how agents read projects, interpret developer intent, follow rules, bound their actions, implement and execute code, and report progress. 90.50\% of episodes impose effort and trust costs rather than irreversible system damage, yet 91.49\% of visible resolutions still require explicit user correction. Misalignment patterns also differ across IDE and CLI settings, persist across adjacent sessions, and shift over time: while overall rates decline, constraint violations and inaccurate self-reporting grow in share. Our findings inform the design of training, evaluation, and interfaces for keeping coding agents aligned with real developer workflows.

URL PDF HTML ☆

赞 0 踩 0

2605.29440 2026-05-29 cs.CL cs.AI cs.IR 版本更新

SkillBrew: Multi-Objective Curation of Skill Banks for LLM Agents

SkillBrew: LLM智能体技能库的多目标策展

Wentao Hu, Zhendong Chu, Yiming Zhang, Junda Wu, Ming Jin, Xiangyu Zhao, Yilei Shao, Yanfeng Wang, Qingsong Wen

发表机构 * City University of Hong Kong（香港城市大学）； Squirrel Ai Learning ； University of Science and Technology of China（中国科学技术大学）； University of California, San Diego（加州大学圣地亚哥分校）； Griffith University（格里菲斯大学）； East China Normal University（华东师范大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出SkillBrew框架，将技能库策展建模为带效用约束的帕累托优化问题，通过双层提议-验证循环实现技能库的精简与多样性。

Comments 16 pages. Preprint. Under review

详情

AI中文摘要

检索增强的LLM智能体越来越依赖于精心策划的技能库：指导复杂任务决策的可重用文本原则集合。现有方法通常以仅追加的方式扩展这些库，不断添加新技能而不移除冗余、过时或有害的技能，导致存储库效率低下且策展不良。在本文中，我们将技能库策展形式化为一个受约束的多目标问题：一个理想的库必须对智能体有用、内容多样，并且对查询分布有良好的覆盖。为此，我们引入了SkillBrew，一个多目标策展框架，将技能库策展形式化为在效用约束下的帕累托感知优化，并通过双层提议-验证循环求解。我们在两个公共基准上评估了我们的方法。我们的发现表明，将技能库视为原则性策展的对象，而不是不断增长的仅追加日志，是构建自我改进的LLM智能体的重要一步。

英文摘要

Retrieval-augmented LLM agents increasingly rely on curated skill banks: collections of reusable textual principles that guide decision making on complex tasks. Existing approaches typically expand these banks in an append-only fashion, continuously adding new skills without removing redundant, outdated, or harmful ones, resulting in inefficient and poorly curated repositories. In this paper, we formulate the skill bank curation as a constrained multi-objective problem: a desirable bank must be useful for the agent, diverse in its content, and provide good coverage of the query distribution. To this end, we introduce SkillBrew, a multi-objective curation framework that formalizes skill bank curation as Pareto-aware optimization under a utility constraint, and solves it via a bi-level propose-then-verify loop. We evaluate our approach on two public benchmarks. Our findings suggest that treating skill banks as objects of principled curation, rather than ever-growing append-only logs, is an important step toward building self-improving LLM agents.

URL PDF HTML ☆

赞 0 踩 0

2605.29434 2026-05-29 cs.CR cs.AI cs.CL cs.LG 版本更新

AliMark: Enhancing Robustness of Sentence-Level Watermarking Against Text Paraphrasing

AliMark: 增强句子级水印对文本释义的鲁棒性

Yuexin Li, Wenjie Qu, Linyu Wu, Yulin Chen, Yufei He, Tri Cao, Bryan Hooi, Jiaheng Zhang

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出AliMark框架，将句子级水印重构为比特序列编码与对齐问题，通过多候选对齐检测策略提升对句子拆分合并等结构扰动的鲁棒性。

Comments Accepted by ICML 2026

详情

AI中文摘要

现有的句子级水印方法通过将水印锚定在句子语义中来增强对释义的鲁棒性。然而，它们基于前缀的设计仍然容易受到结构扰动的影响，例如句子拆分和合并，这些扰动在强释义器（如DIPPER和GPT-3.5）下经常出现。为了缓解这个问题，我们提出了AliMark，一个将句子级水印重构为潜在水印文本与秘密比特序列之间的比特序列编码和对齐问题的框架。值得注意的是，我们的方法采用了两阶段检测策略：我们生成多个重构的文本变体，并自适应地将它们提取的比特序列与秘密比特序列对齐，以最小化对齐成本。这种多候选对齐设计自然地提高了对句子合并和拆分的鲁棒性。大量实验表明，在多种释义攻击下，AliMark显著优于最先进的基线方法。

英文摘要

Existing sentence-level watermarking methods enhance robustness to paraphrasing by anchoring watermarks in sentence semantics. However, their prefix-based designs remain vulnerable to structural perturbations, such as sentence splitting and merging, which commonly arise under strong paraphrasers like DIPPER and GPT-3.5. To mitigate this issue, we propose AliMark, a framework that reformulates sentence-level watermarking as a bit sequence encoding and alignment problem between a potentially watermarked text and a secret bit sequence. Notably, our approach adopts a two-stage detection strategy: we generate multiple restructured text variants and adaptively align their extracted bit sequences with the secret bit sequence to minimize alignment cost. This multi-candidate alignment design naturally improves robustness to sentence merges and splits. Extensive experiments demonstrate that AliMark substantially outperforms state-of-the-art baselines under diverse paraphrasing attacks.

URL PDF HTML ☆

赞 0 踩 0

2605.29430 2026-05-29 cs.AI cs.CL 版本更新

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

迈向具有智能体纠正和语义评估的类人交互式语音识别

Zixuan Jiang, Yanqiao Zhu, Peng Wang, Qinyuan Chen, Xinjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao, Xiangang Li, Kai Yu, Xie Chen

发表机构 * College of Artificial Intelligence, Xi’an Jiaotong University（西安交通大学人工智能学院）； X-LANCE Lab, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University（上海交通大学电子信息与电气工程学院X-LANCE实验室）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Fudan University（复旦大学）； Tongyi Fun Team, Alibaba Group（阿里云通义团队）

AI总结提出Agentic ASR闭环框架，通过多轮交互和语义纠正减少语义错误，并引入句子级语义错误率（S^2ER）作为评估指标。

详情

AI中文摘要

自动语音识别（ASR）是人机交互的核心组成部分，也是基于LLM的助手和智能体日益重要的前端。然而，当前大多数ASR系统仍遵循单遍范式，这与人类通信方式不一致——在人类通信中，误解通过迭代澄清和修正来解决。这种不匹配使得一旦发生意义关键的错误，很难纠正。同时，词错误率（WER）或字符错误率（CER）等词级指标无法充分反映此类问题。为解决这些局限，我们将交互式ASR形式化为多轮修正任务，并提出Agentic ASR，一种结合单遍ASR前端与语义纠正、意图路由和基于推理编辑的闭环框架。我们进一步引入句子级语义错误率（S^2ER），一种基于LLM的语义评估指标，以及交互式仿真系统，用于可扩展和可复现的基准测试。在多语言、命名实体密集和代码切换基准上的实验表明，迭代交互持续减少语义错误，在S^2ER上的提升远大于传统词级指标。人机对齐和消融研究进一步验证了语义判断器的可靠性和所提框架的鲁棒性。代码见：https://interactiveasr.github.io/，在线演示见：https://i-asr.sjtuxlance.com/

英文摘要

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate \emph{Interactive ASR} as a multi-turn refinement task and propose \textbf{Agentic ASR}, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the \textbf{Sentence-level Semantic Error Rate} ($S^2ER$), an LLM-based semantic evaluation metric, together with an \textbf{Interactive Simulation System} for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in $S^2ER$ than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/

URL PDF HTML ☆

赞 0 踩 0

2605.29428 2026-05-29 astro-ph.EP astro-ph.IM cs.AI 版本更新

DELOS: Detecting Shallow Transits in Kepler Photometry Using a Contrastive-Learning Framework

DELOS: 使用对比学习框架检测开普勒测光中的浅凌星

Qingtian Liu, Jian Ge, XingChen Yan, Kevin Willis, Xinyu Yao, QuanQuan Hu, Jiapeng Zhu

发表机构 * Shanghai Astronomical Observatory, Chinese Academy of Sciences, Shanghai 200030, China（上海天文台，中国科学院，上海200030，中国）； School of Astronomy and Space Sciences, University of Chinese Academy of Sciences, Beijing 101408, China（中国科学院大学天文与空间科学学院，北京101408，中国）； Science Talent Training Center, Gainesville, FL, 32606 USA（科学人才培训中心，佛罗里达州盖恩斯维尔，32606 USA）

AI总结提出基于对比学习的DELOS框架，通过GPU加速折叠和卷积编码器检测低信噪比浅凌星，性能优于BLS和TLS。

Comments 25 pages, 19 figures, 1 table, submitted to ApJ

详情

AI中文摘要

我们提出了基于相位折叠光变曲线的对比评分检测方法（DELOS），这是一个基于对比学习的框架，旨在搜索开普勒测光中的浅凌星。DELOS结合了GPU加速的相位折叠、优化的相位分箱和自定义的一维卷积编码器，为每条折叠光变曲线分配凌星似然分数，从而在无需预先检测阈值穿越事件的情况下，在试验周期上生成分数周期图。针对轨道周期为100-150天的中长周期信号，DELOS在2000万条使用真实凌星模型和开普勒类似噪声特性生成的合成光变曲线上进行训练，在合成验证集上达到了99.3%的验证准确率。在受控注入-恢复实验中，在低信噪比区域，DELOS相对于箱形拟合最小二乘法（BLS）和凌星最小二乘法（TLS）分别将综合精确率-召回率性能提高了15.5%和11.25%。与BLS和TLS相比，它还将搜索速度分别提高了约3-5倍和74-80倍。应用于选定的开普勒验证样本时，DELOS在测试周期范围内恢复了所有已知的浅中长周期凌星信号。这些结果表明，DELOS为低信噪比凌星搜索提供了一个高效且灵敏的框架，并代表了向未来在开普勒、K2、TESS、PLATO和地球2.0数据中搜索更长周期类地行星迈出的实际一步。因此，这项工作旨在作为方法论开发和验证研究，对新识别候选体的详细天体物理验证留待未来工作。

英文摘要

We present DEtection in phase-folded Light curves with cOntrastive Scoring (DELOS), a contrastive-learning-based framework designed to search for shallow transits in Kepler photometry. DELOS combines GPU-accelerated phase folding, optimized phase binning, and a custom one-dimensional convolutional encoder to assign a transit-likeness score to each folded light curve, thereby producing a score periodogram over trial periods without relying on pre-detected threshold-crossing events. Focusing on intermediate-to-long-period signals with orbital periods of 100-150 days, DELOS was trained on 20 million synthetic light curves generated with realistic transit models and Kepler-like noise properties, achieving a validation accuracy of 99.3 percent on the synthetic validation set. In controlled injection-recovery experiments, DELOS improves the combined precision-recall performance by 15.5 percent relative to Box-fitting Least Squares (BLS) and 11.25 percent relative to Transit Least Squares (TLS) in the low Signal-to-Noise Ratios (low-SNR) regime. It also accelerates the search by factors of approximately 3-5 and 74-80 compared with BLS and TLS, respectively. Applied to a selected Kepler validation sample, DELOS recovered all known shallow intermediate-to-long-period transit signals in the tested period range. These results demonstrate that DELOS provides an efficient and sensitive framework for low-SNR transit searches and represents a practical step toward future searches for longer-period terrestrial planets in Kepler, K2, TESS, PLATO, and Earth 2.0 data. Accordingly, this work is intended as a methodological development and validation study, with the detailed astrophysical validation of newly identified candidates deferred to future work.

URL PDF HTML ☆

赞 0 踩 0

2605.29425 2026-05-29 cs.AI 版本更新

ReasonLight: A Multimodal Foundation Model-Enhanced Reinforcement Learning Framework for Zero-Shot Traffic Signal Control

ReasonLight: 一种多模态基础模型增强的强化学习框架用于零样本交通信号控制

Aoyu Pang, Maonan Wang, Yuejiao Xie, Chung Shue Chen, Zhiwei Yang, Man-On Pun

发表机构 * School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China（香港中文大学（深圳）科学与工程学院）； Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Hong Kong（香港中文大学机械与自动化工程系）； Shanghai AI Laboratory, Shanghai, China（上海人工智能实验室）； Nokia Bell Labs, Paris-Saclay, France（法国巴黎萨克雷诺基贝尔实验室）

AI总结提出ReasonLight框架，通过多模态基础模型增强强化学习，利用路侧传感器和摄像头数据实现零样本适应罕见交通事件，显著降低紧急车辆等待时间。

详情

AI中文摘要

强化学习在交通信号控制中展现出潜力，但其对预定义状态的依赖限制了其对训练数据中未出现的可观测开放世界事件的响应能力。物联网赋能的路口通过路侧传感器和摄像头提供异构观测，为提升强化学习对此类事件的适应性创造了机会。为此，我们提出ReasonLight，一种多模态基础模型增强的强化学习框架，用于零样本交通信号控制。ReasonLight整合三类信息：结构化交通测量、多视角摄像头观测以及预训练强化学习控制器生成的候选相位决策。给定强化学习提议的相位，ReasonLight从多视角图像中提取视觉语义，并将其与紧凑的传感器导出的场景描述对齐。这种对齐使得语义引导的细化模块能够根据交通规则和事件语义保留或调整提议的动作。为确保操作可靠性，细化后的动作受可用相位集合约束。任何无效决策被拒绝，系统回退至原始强化学习动作。我们在强化学习训练期间未见的两类罕见事件上评估ReasonLight：紧急车辆优先和临时交通管制。实验结果表明，ReasonLight无需重新训练即可实现零样本适应。与仅使用强化学习的主干相比，它将紧急车辆等待时间最多降低88.7%，同时保持相当的常规交通性能。

英文摘要

Reinforcement learning (RL) has shown promise in traffic signal control (TSC). However, its reliance on predefined states limits responsiveness to observable open-world events that are absent from training data. IoT-enabled intersections provide heterogeneous observations from roadside sensors and cameras, creating opportunities to improve RL adaptability to such events. To this end, we propose ReasonLight, a multimodal foundation model-enhanced RL framework for zero-shot TSC. ReasonLight integrates three sources of information: structured traffic measurements, multi-view camera observations, and candidate phase decisions from a pre-trained RL controller. Given an RL-proposed phase, ReasonLight extracts visual semantics from multi-view images and aligns them with compact sensor-derived scene descriptions. This alignment enables a semantic-guided refinement module to either preserve or adjust the proposed action according to traffic rules and event semantics. To ensure operational reliability, refined actions are constrained by the set of available phases. Any invalid decision is rejected, and the system falls back to the original RL action. We evaluate ReasonLight on two types of rare events not seen during RL training: emergency vehicle priority and temporary traffic regulation. Experimental results show that ReasonLight achieves zero-shot adaptation without retraining. It reduces emergency vehicle waiting time by up to 88.7% compared with the RL-only backbone while preserving comparable routine traffic performance.

URL PDF HTML ☆

赞 0 踩 0

2605.29420 2026-05-29 cs.AI cs.LG 版本更新

When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs

角色提示何时真正有效？LLM中专家角色注入的检索与度量分析

Shuai Xiao, Su Liu, Weikai Zhou, Jialun Wu, Xinjie He, Zhiyuan Lin, Qiyang Xie

发表机构 * Independent Researchers（独立研究者）

AI总结通过对比四种提示条件在1140个开放式问题上的表现，发现角色提示系统性地增加专家深度但降低清晰度，其效果高度依赖于问题类型和领域，且混合检索优于纯嵌入检索。

Comments 6 pages, 2 figures. Submitted for peer review

详情

AI中文摘要

角色提示被广泛用于引导大型语言模型，但其实际价值仍不明确。先前的工作通常使用聚合分数评估角色提示，难以确定专家角色提示是否一致地提高响应质量，或者是否沿着不同的质量维度改变响应。我们通过对比四种提示条件在涵盖38个专家角色和六个领域的1140个开放式问题上的表现来研究这个问题：无角色提示、通用领域专家提示、基于嵌入的角色检索，以及结合嵌入搜索和基于LLM的角色选择的混合检索方法。聚合结果显示各条件之间总体差异很小。然而，度量级分析揭示了一个聚合平均值掩盖的一致权衡：角色提示系统性地增加了专家深度，同时降低了清晰度。这些效果高度有条件而非普遍。角色提示在咨询类问题以及医学和心理学等领域表现最佳，在这些领域中，结构化的专家框架和风险沟通具有内在价值。相比之下，基线提示在金融、法律、科学和技术领域的概念性和解释性问题中表现更好，在这些领域中，简洁的平实语言解释更为重要。我们进一步表明，混合检索显著优于纯嵌入角色选择，尽管更好的角色检索并不能消除更广泛的专家深度与清晰度之间的权衡。总体而言，我们的发现表明，角色提示主要重塑响应特征而非广泛提升能力，并且多度量评估对于理解其效果是必要的。

英文摘要

Persona prompting is widely used to steer large language models, yet its practical value remains unclear. Prior work often evaluates persona prompting using aggregate scores, making it difficult to determine whether expert-role prompting consistently improves response quality or instead changes responses along different quality dimensions. We study this question through a controlled comparison of four prompting conditions across 1,140 open-ended questions spanning 38 expert roles and six domains: no role prompt, a generic domain-expert prompt, embedding-based role retrieval, and a hybrid retrieval method combining embedding search with LLM-based role selection. Aggregate results show only small overall differences between conditions. However, metric-level analysis reveals a consistent tradeoff that aggregate averages obscure: role prompting systematically increases expertise depth while reducing clarity. These effects are highly conditional rather than universal. Role prompting performs best on advisory questions and in domains such as medicine and psychology, where structured expert framing and risk communication are intrinsically valuable. In contrast, baseline prompting performs better on conceptual and explanatory questions in finance, legal, science, and technology domains, where concise plain-language explanation is more important. We further show that hybrid retrieval significantly improves over embedding-only role selection, although better role retrieval does not eliminate the broader expertise-depth versus clarity tradeoff. Overall, our findings suggest that persona prompting primarily reshapes response characteristics rather than broadly improving capability, and that multi-metric evaluation is necessary for understanding its effects.

URL PDF HTML ☆

赞 0 踩 0

2605.29414 2026-05-29 cs.CL cs.AI 版本更新

Beyond Bilingual Transfer: Multilingual Code-Switching in Instruction Tuning

超越双语迁移：指令微调中的多语言代码切换

Shunta Asano, Jeonghun Baek, Toshihiko Yamasaki

发表机构 * The University of Tokyo（东京大学）

AI总结本研究通过跨四种语言的句子级多语言代码切换指令微调，验证了多语言代码切换能有效提升大语言模型的多语言理解性能，超越了传统双语迁移设置。

2605.29411 2026-05-29 cs.LG cs.AI stat.ME stat.ML 版本更新

The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction

马尔可夫边界在表格预测中的好、坏与丑

Shu Wan, Abhinav Gorantla, Huan Liu, K. Selçuk Candan

发表机构 * Arizona State University（亚利桑那州立大学）

AI总结研究马尔可夫边界在表格预测中的实际效用，发现理论上最优的边界在实践中有条件地提升预测性能，但因果发现方法难以实现其潜力。

Comments 11 pages, 9 figures, 2 tables. Preprint

详情

AI中文摘要

在标准图形假设下，目标变量的马尔可夫边界是使所有其他特征冗余的最小特征集。一旦观察到边界，目标变量与表格的其余部分条件独立。这对于表格预测来说是一个诱人的对象，因为它恰好指出了模型所需的列。然而，现代回归器仍然在完整特征集上训练。我们询问马尔可夫边界是否在SCM3K（一个包含3450个任务的合成SCM基准，特征数量从40到1000，涵盖六个SCM家族）上对预测真正有用，并使用六个回归器进行评估。答案比理论所暗示的要微妙得多。将回归器限制在oracle边界上通常会显著改善预测，并且随着特征空间变得更大更稀疏，改善程度增加。但是，通过因果发现恢复边界并在恢复的掩码上训练的自然流程并不奏效。现有的估计器在达到边界最有帮助的区域之前就耗尽了计算预算，即使它们运行，也很少能击败完整特征集。我们将此归因于三个原因。发现优化的是结构恢复而非预测。假阴性和假阳性具有高度不对称的预测成本。精确边界只是众多击败所有特征的特征集之一。然后，我们阐述了这些事实对于预测对齐的特征选择以及学习使用因果结构的表格模型的意义。

英文摘要

Under standard graphical assumptions, the Markov boundary of a target variable is the smallest set of features that renders every other feature redundant. Once the boundary is observed, the target is conditionally independent of the rest of the table. This is a tempting object for tabular prediction, since it names exactly the columns a model should need. Yet modern regressors are still trained on the full feature set. We ask whether the Markov boundary is genuinely useful for prediction on SCM3K, a 3,450-task synthetic SCM benchmark with feature counts from 40 to 1000 and six SCM families, evaluated with six regressors. The answer is more nuanced than the theory suggests. Restricting a regressor to the oracle boundary often improves prediction substantially, and the improvement grows as the feature space becomes larger and sparser. But the natural pipeline of recovering the boundary with causal discovery and training on the recovered mask does not deliver. Existing estimators exhaust the compute budget before reaching the regime where the boundary helps most, and even where they run they rarely beat the full feature set. We trace this to three causes. Discovery optimizes structural recovery rather than prediction. False negatives and false positives carry sharply asymmetric predictive cost. The exact boundary is only one of many feature sets that beat all features. We then develop what these facts imply for prediction-aligned feature selection and for tabular models that learn to use causal structure.

URL PDF HTML ☆

赞 0 踩 0

2605.29402 2026-05-29 cs.CV cs.AI 版本更新

对齐但脆弱：通过零阶优化增强LLM安全鲁棒性

Zhihao Liu, Yifan Wu, Jian Lou, Di Wang, Yuxi Zhou, Yuke Hu

发表机构 * The State Key Laboratory of Blockchain and Data Security（区块链与数据安全国家重点实验室）； Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security（杭州高科技园区（滨江）区块链与数据安全研究院）； Sun Yat-sen University（中山大学）； KAUST（卡塔尔大学）

AI总结针对大语言模型安全对齐后易受轻量级后处理（如参数噪声、激活噪声或量化）影响的问题，提出基于零阶优化的混合框架，通过先标准一阶安全对齐再零阶精炼提升鲁棒性，并利用扰动评估估计层鲁棒性敏感性以高效聚焦关键层更新。

详情

AI中文摘要

大语言模型的安全对齐旨在减少有害或不安全行为，同时保持通用效用。然而，最近的研究发现对齐效果可能是脆弱的：轻量级的对齐后操作，如参数噪声、激活噪声或量化，很容易削弱预期的安全行为。先前提高鲁棒性的努力主要集中在数据整理、修改对齐目标和识别安全关键参数上，而优化器本身的作用在很大程度上未被探索。在本文中，我们首次从基础优化器的角度研究安全对齐的鲁棒性。这种以优化器为中心的视角自然地指向零阶优化，它通过评估扰动下的安全对齐来提供面向鲁棒性的信号。基于这一见解，我们提出了一个混合框架，首先执行标准的一阶安全对齐，然后应用零阶精炼来提高鲁棒性。从理论和实证上，我们表明仅需少量零阶精炼步骤即可增强鲁棒性，同时保持安全对齐。我们进一步通过利用其固有的基于扰动的评估来估计逐层鲁棒性敏感性，从而提高零阶精炼的效率，使精炼过程能够以适度的训练开销将更新集中在鲁棒性关键层上。

英文摘要

Safety alignment for large language models (LLMs) aims to reduce harmful or unsafe behavior while preserving general utility. However, recent findings reveal that alignment effects can be fragile: lightweight post-alignment manipulations, such as parameter noise, activation noise, or quantization, can easily weaken the intended safety behavior. Prior efforts to improve robustness have primarily focused on data curation, modified alignment objectives, and safety-critical parameter identification, leaving the role of the optimizer itself largely unexplored. In this paper, we are the first to study the robustness of safety alignment from the perspective of the base optimizer. This optimizer-centric view naturally points to zeroth-order optimization, which provides a robustness-oriented signal by evaluating safety alignment under perturbations. Based on this insight, we propose a hybrid framework that first performs standard first-order safety alignment and then applies zeroth-order refinement to improve robustness. Both theoretically and empirically, we show that only a few zeroth-order refinement steps can enhance robustness while preserving safety alignment. We further improve the efficiency of zeroth-order refinement by exploiting its inherent perturbation-based evaluations to estimate layer-wise robustness sensitivity, enabling the refinement process to concentrate updates on robustness-critical layers with modest training overhead.

URL PDF HTML ☆

赞 0 踩 0

2605.29394 2026-05-29 cs.AI 版本更新

EvoMD-LLM: Learning the Language of Species Evolution in Reactive Molecular Dynamics

EvoMD-LLM：学习反应分子动力学中物种进化的语言

Zhichen Tang, Zhengzheng Dang, Yulin Chen, Jixin Wu, Haiwen Li, Yanming Wang

发表机构 * Global College, Shanghai Jiao Tong University（上海交通大学全球学院）； Global Institute of Future Technology, Shanghai Jiao Tong University（上海交通大学未来技术全球研究院）

AI总结提出EvoMD-LLM框架，将反应分子动力学轨迹离散化为符号时间序列，通过时间脚手架机制使自回归大语言模型学习物种组成演化，在多项时间预测任务上优于基线模型，并能生成可解释性预测。

Comments 17 pages, ACL Findings

详情

AI中文摘要

虽然大型语言模型（LLM）在静态科学推理方面表现出色，但它们在建模动态物理过程的时间结构方面存在困难。我们提出了EvoMD-LLM（进化分子动力学大型语言模型），这是一个将物种级分子动力学重新表述为符号时间语言建模问题的框架。反应分子动力学轨迹被离散化为分子事件序列，其中每个标记代表一个化学物种及其持续时间，通过高效微调使标准自回归LLM能够学习随时间的组成演化。EvoMD-LLM的一个关键组成部分是时间脚手架，它将事件持续时间视为显式语言标记，并作为结构化归纳偏置，与传统的序列建模方法相比，显著减少了无效或幻觉的分子输出。我们在多个时间预测任务上评估了EvoMD-LLM，达到了高达66.14%的准确率，并始终优于序列神经网络和基于语言的基线。除了定量改进，我们定性地观察到，该模型能够通过结合相关化学知识为其预测生成解释，尽管它没有经过配对轨迹-解释数据的显式监督。这些结果表明，符号时间语言建模为将LLM应用于动态物理模拟提供了有效框架。

英文摘要

While large language models (LLMs) excel at static scientific reasoning, they struggle to model the temporal structure of dynamic physical processes. We present EvoMD-LLM (Evolutionary Molecular Dynamics Large Language Model), a framework that reformulates species-level molecular dynamics as a symbolic temporal language modeling problem. Reactive MD trajectories are discretized into sequences of molecular events, where each token represents a chemical species augmented with its persistence duration, enabling standard autoregressive LLMs to learn compositional evolution over time through efficient fine-tuning. A key component of EvoMD-LLM is temporal scaffolding, which treats event duration as an explicit linguistic token and serves as a structured inductive bias, significantly reducing invalid or hallucinated molecular outputs compared to conventional sequence modeling approaches. We evaluate EvoMD-LLM on multiple temporal prediction tasks, achieving up to 66.14% accuracy and consistently outperforming sequential neural networks and language-based baselines. Beyond quantitative improvements, we qualitatively observe that the model is capable of generating interpretations for its own predictions by incorporating relevant chemical knowledge, even though it was not explicitly supervised with paired trajectory-explanation data. These results demonstrate that symbolic temporal language modeling provides an effective framework for grounding LLMs in dynamic physical simulations.

URL PDF HTML ☆

赞 0 踩 0

2605.29387 2026-05-29 cs.LG cs.AI stat.ML 版本更新

On the Optimizer Dependence of Neural Scaling Laws

神经缩放定律的优化器依赖性

Vansh Ramani, Shourya Vir Jain

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology Delhi（计算机科学与工程系，印度理工学院德里）

AI总结通过随机特征回归实验，发现优化器类型系统性地影响神经缩放定律中的缩放指数α，预条件优化器产生更陡峭的缩放，并提供了光谱诊断预测高级优化器的收益。

详情

AI中文摘要

神经缩放定律 $L(N) \propto N^{-α}$ 中的缩放指数 $α$ 通常被视为由架构和数据确定的固定常数。我们提出证据表明 $α$ 系统性地依赖于优化器。在受控的随机特征回归实验——神经缩放的理论框架——中，我们测量了五种优化器变体和六种光谱条件下的 $α$。预条件优化器一致地产生更陡峭的缩放（更大的 $α$），且 $α$ 的偏移在大部分测试光谱范围内增加，在 $s = 1.5$ 附近达到峰值，并在 $s = 2.0$ 时保持较大。在 $s \approx 1.0$（自然语言的特征）时，完全自然梯度达到 $α\approx 0.31$，而梯度下降为 $α\approx 0.12$——拟合指数大 $2.6$ 倍，在随机特征模型中，该差异随模型规模加倍而累积。这种指数偏移是否以及如何迁移到大规模 LLM 训练中——近期证据表明优势可能随规模减弱——仍是一个重要的开放问题。我们的结果表明，缩放定律预测应考虑优化器选择，并且我们提供了一个光谱诊断来预测高级优化器何时会带来收益。

英文摘要

The scaling exponent $α$ in neural scaling laws $L(N) \propto N^{-α}$ is commonly treated as a fixed constant set by architecture and data. We present evidence that $α$ depends systematically on the optimizer. In controlled random-feature regression experiments -- the canonical theoretical framework for neural scaling -- we measure $α$ across five optimizer variants and six spectral conditions. Preconditioned optimizers consistently yield steeper scaling (larger $α$), with the $α$-shift increasing across most of the tested spectral range, peaking near $s = 1.5$, and remaining large at $s = 2.0$. At $s \approx 1.0$ (characteristic of natural language), the full natural gradient achieves $α\approx 0.31$ versus $α\approx 0.12$ for gradient descent -- a $2.6\times$ larger fitted exponent that, within the random-feature model, compounds with each model-size doubling. Whether and how this exponent shift transfers to large-scale LLM training -- where recent evidence suggests the advantage may attenuate with scale -- remains an important open question. Our results imply that scaling-law forecasts should account for optimizer choice, and we provide a spectral diagnostic predicting when advanced optimizers will pay off.

URL PDF HTML ☆

赞 0 踩 0

2605.29384 2026-05-29 cs.IR cs.AI cs.CL 版本更新

MiraBench: 评估机器人世界模型中的动作条件可靠性

Tianzhuo Yang, Zihan Shen, Zirui Mi, Zhaoyi Zhang, Jiayi Zhou, Jiaming Ji, Juntao Dai, Jiawei Chen, Boyuan Chen, Yaodong Yang

发表机构 * Institute for Artificial Intelligence, Peking University（人工智能研究院，北京大学）

AI总结提出MiraBench基准，通过物理一致性、动作跟随保真度和乐观偏差检测三个层次评估机器人世界模型的动作条件可靠性，发现视觉保真度不能反映动作保真度、模型规模扩大不保证动作跟随改善、乐观偏差普遍存在。

详情

AI中文摘要

动作条件世界模型越来越多地被用作机器人学习的可扩展模拟器，但当前的评估对其在条件动作下预测的可靠性提供的证据有限。现有基准主要强调视觉保真度，未明确预测的未来是否物理上合理、是否忠实于命令动作，以及在动作不应成功时是否校准到失败。我们引入了\textsc{MiraBench}，一个分层基准，将\emph{动作条件可靠性}定义为机器人世界模型的核心评估目标。MiraBench将此目标分解为三个逐步严格层次：\emph{物理一致性}，评估无参考的物理一致性；\emph{动作跟随保真度}，衡量预测是否尊重任务相关动作输入；以及\emph{乐观偏差检测}，探测在导致失败的动作下预测成功结果的倾向。为支持此评估，我们整理了一个人工标注语料库，包含跨任务、失败类别和领先世界模型的超过16,000个判断。我们评估了12种代表性模型配置，涵盖向量条件机器人世界模型、文本条件生成世界模型、开源系统、闭源系统和多种模型规模。在这一广泛的模型景观中，MiraBench揭示了三个核心发现：视觉保真度是动作保真度的糟糕代理；增加模型规模并不能可靠地改善动作跟随；乐观偏差在现有系统中普遍存在。通过将评估从外观转向动作条件可靠性，MiraBench为评估和改进机器人世界模型作为忠实模拟器提供了诊断基础。

英文摘要

Action-conditioned world models are increasingly used as scalable simulators for robot learning, yet current evaluations provide limited evidence that their predictions are reliable under the actions they condition on. Existing benchmarks largely emphasize visual fidelity, leaving unclear whether predicted futures are physically plausible, faithful to commanded actions, and calibrated to failure when actions should not succeed. We introduce \textsc{MiraBench}, a hierarchical benchmark that defines \emph{action-conditioned reliability} as a core evaluation target for robotic world models. MiraBench decomposes this target into three progressively demanding levels: \emph{Physics Adherence}, which evaluates reference-free physical consistency; \emph{Action-Following Fidelity}, which measures whether predictions respect task-relevant action inputs; and \emph{Optimism Bias Detection}, which probes the tendency to predict successful outcomes under failure-inducing actions. To support this evaluation, we curate a human-annotated corpus with over 16,000 judgments across tasks, failure categories, and leading world models. We evaluate 12 representative model configurations spanning vector-conditioned robotic world models, text-conditioned generative world models, open-weight systems, closed-source systems, and multiple model scales. Across this broad model landscape, MiraBench reveals three central findings: visual fidelity is a poor proxy for action fidelity; increasing model scale does not reliably improve action following; and optimism bias is pervasive across current systems. By shifting evaluation from appearance to action-conditioned reliability, MiraBench provides a diagnostic foundation for assessing and improving robotic world models as faithful simulators.

URL PDF HTML ☆

赞 0 踩 0

2605.29359 2026-05-29 cs.CY cs.AI 版本更新

Does Distributed Training Undermine Compute Governance?

分布式训练是否会破坏计算治理？

Robi Rahman

发表机构 * Machine Intelligence Research Institute（机器智能研究所）

AI总结本文探讨了分布式训练技术可能规避计算治理的可行性，并提出了包括举报、芯片追踪、法务会计以及集群内存和计算阈值在内的反制措施。

Comments TAIGR workshop in ICML 2026

2605.29358 2026-05-29 cs.AI 版本更新

通过参考数据集的几何结构重新思考FID

Yunghee Lee, Byeonghyun Pak

AI总结本文通过分析参考数据集的几何特性（密度和有效秩）来解释Fréchet Inception Distance (FID) 与样本质量之间的不一致性，并提出应结合参考数据集几何结构来更可靠地评估生成模型。

Comments 9 pages, 2 figures. Accepted to ICML 2026 Workshop: Combining Theory and Benchmarks

2605.29310 2026-05-29 cs.AI cs.CL 版本更新

Rubric-Guided Process Reward for Stepwise Model Routing

基于评分准则的逐步模型路由过程奖励

Shenghao Ye, Yu Guo, Zhengheng Li, Shuangwu Chen, Jian Yang

发表机构 * University of Science and Technology of China（中国科学技术大学）； Southeast University（东南大学）； Institute of Artificial Intelligence, Hefei Comprehensive National Science Center（合肥综合性国家科学中心人工智能研究院）

AI总结提出RoRo框架，通过收集路由轨迹、构建偏好对、训练Rubricor生成评估准则和Judge评分，结合过程与结果奖励优化路由策略，提升大型推理模型逐步路由的准确性和成本效率。

Comments 17 pages, 9 figures, submitted to EMNLP 2026

详情

AI中文摘要

逐步模型路由通过将每个推理步骤分配给合适的模型来提高大型推理模型（LRM）的效率。最近的方法将路由建模为顺序决策过程，并使用强化学习训练路由器。然而，尽管它们将路由建模为一个过程，但仍然使用结果奖励来监督路由器。这种奖励仅反映最终答案的正确性，未能评估中间路由决策，这可能会削弱性能和泛化能力。为了解决这一差距，我们提出了RoRo，一种基于评分准则的逐步模型路由过程奖励框架。RoRo首先收集多样化的路由轨迹，并基于结果、成本和过程质量构建偏好对。然后，它通过交替优化训练一个Rubricor来生成查询特定的评估准则，以及一个Judge来在此准则下对路由轨迹进行评分。由此产生的过程奖励与结果奖励相结合，通过GRPO优化路由策略。在五个推理基准上的实验，无论是在同族还是跨族设置下，都表明RoRo始终优于强基线，并实现了更好的准确性和成本权衡。

英文摘要

Stepwise model routing improves the efficiency of Large Reasoning Models (LRMs) by assigning each reasoning step to a suitable model. Recent methods formulate routing as a sequential decision process and train the router with reinforcement learning. However, although they model routing as a process, they still supervise the router with outcome rewards. Such rewards only reflect final answer correctness and fail to evaluate intermediate routing decisions, which can weaken performance and generalization. To address this gap, we propose RoRo, a rubric-guided process reward framework for stepwise model routing. RoRo first collects diverse routing trajectories and constructs preference pairs based on outcome, cost, and process quality. It then trains a Rubricor to generate a query-specific evaluation rubric and a Judge to score routing trajectories under this rubric through alternating optimization. The resulting process rewards are combined with outcome rewards to optimize the routing policy via GRPO. Experiments on five reasoning benchmarks under both same-family and cross-family settings show that RoRo consistently outperforms strong baselines and achieves better accuracy and cost trade-offs.

URL PDF HTML ☆

赞 0 踩 0

2605.29307 2026-05-29 cs.CL cs.AI cs.IR cs.LG 版本更新

GrepSeek: Training Search Agents for Direct Corpus Interaction

GrepSeek：训练用于直接语料库交互的搜索代理

Alireza Salemi, Chang Zeng, Atharva Nijasure, Jui-Hui Chung, Razieh Rahimi, Fernando Diaz, Hamed Zamani

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿默斯特分校）； Princeton University（普林斯顿大学）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出GrepSeek，一种通过两阶段训练（冷启动数据集+GRPO优化）和语义保持的分片并行执行引擎，训练紧凑型搜索代理直接与文本语料库交互（通过shell命令），在开放域问答中取得最优F1和精确匹配。

详情

AI中文摘要

大型语言模型（LLM）搜索代理通过多轮推理和信息检索，在知识密集型语言任务中展现出强大潜力。大多数现有系统使用检索器，该检索器接收关键词或自然语言查询，并利用预计算文档表示的索引返回排序后的文档列表。在本工作中，我们探索了一种互补视角，其中搜索代理将语料库本身视为搜索环境，并通过执行可执行的shell命令来寻找证据。我们引入了GrepSeek，一种优化的直接语料库交互（DCI）搜索代理，它训练一个紧凑的搜索代理从大型文本语料库中查找、过滤和组合证据。为了解决在大语料库上直接使用强化学习进行学习行为的不稳定性，我们提出了一种两阶段训练流程。首先，我们使用答案感知的Tutor和答案盲的Planner构建冷启动数据集，生成经过验证的、因果基础的搜索轨迹。其次，我们使用组相对策略优化（GRPO）优化初始化的策略，使代理能够通过与语料库的直接交互来改进其任务导向的搜索行为。为了使DCI在大规模下实用，我们进一步使用语义保持的分片并行执行引擎，该引擎将基于shell的检索加速高达7.6倍，同时保持与shell命令顺序执行的字节精确等价。在七个开放域问答基准上的实验表明，GrepSeek在整体词元级F1和精确匹配上取得了最强性能。我们的分析还揭示了纯粹词汇交互在具有显著表面形式变化的查询上的局限性，表明DCI作为搜索代理的一种实用且具有竞争力的方法，可以在现实世界中补充现有的检索范式。

英文摘要

Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and information retrieval. Most existing systems access information using a retriever that takes a keyword or natural language query and returns a ranked list of documents using an index of pre-computed document representations. In this work, we explore a complementary perspective in which the search agent treats the corpus itself as the search environment and finds evidence by issuing executable shell commands. We introduce GrepSeek, an optimized direct corpus interaction (DCI) search agent that trains a compact search agent to find, filter, and compose evidence from large text corpora. To address the instability of learning behavior directly with reinforcement learning on large corpora, we propose a two-stage training pipeline. First, we construct a cold-start dataset using an answer-aware Tutor and answer-blind Planner to generate verified, causally grounded search trajectories. Second, we refine the initialized policy with Group Relative Policy Optimization (GRPO), allowing the agent to improve its task-oriented search behavior through direct interaction with the corpus. To make DCI practical at scale, we further use a semantics-preserving sharded-parallel execution engine that accelerates shell-based retrieval by up to $7.6\times$ while preserving byte-exact equivalence with sequential execution of the shell command. Experiments across seven open-domain question answering benchmarks show that GrepSeek achieves the strongest overall token-level $F_1$ and Exact Match. Our analysis also highlights the limitations of purely lexical interaction on queries with substantial surface-form variation, suggesting DCI as a practical and competitive method for search agents that can complement existing retrieval paradigms in the real world.

URL PDF HTML ☆

赞 0 踩 0

2605.29303 2026-05-29 cs.AI 版本更新

Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models

基于熵-KL散度的令牌掩码：一种用于大语言模型选择性微调的新方法

Qi Liu, Mingdi Sun, Yongyi He, Zhi Zheng, Tong Xu, Yi Zheng, Zhefeng Wang, Enhong Chen

发表机构 * University of Science and Technology of China（中国科学技术大学）； Huawei Cloud（华为云）

AI总结针对低数据场景下标准监督微调导致模型分布偏移的问题，提出EKSFT方法，通过选择性掩码高熵或高KL散度的令牌，在注入任务知识的同时保持预训练分布完整性，在数学推理基准上优于标准SFT并提升后续RL性能。

Comments 17 pages

详情

AI中文摘要

监督微调（SFT）后接强化学习（RL）已成为大语言模型的标准后训练范式。该范式为RL探索提供了冷启动，避免了纯RL中在线采样产生不足正样本的低效问题。然而，在实践中，现有方法通常使用少量数据进行SFT初始化（相比RL阶段），这可能导致模型拟合有限样本并偏离其预训练分布。这种分布偏移阻碍了模型在后续RL训练中有效探索的能力。为解决这一挑战，我们提出在低数据场景下，SFT应优先激活任务相关能力而非记忆特定内容。沿着这一思路，我们提出EKSFT（熵-KL选择性微调），该方法选择性掩码那些相对于参考模型表现出高熵或高KL散度的令牌。通过排除这些高不确定性、分布偏移的令牌进行模仿，EKSFT在注入任务特定知识的同时保持了模型预训练分布的完整性。在数学推理基准上的实证评估表明，EKSFT始终优于标准SFT。从EKSFT模型进行进一步的RL微调可获得一致更好的后RL性能，表明RL阶段的探索得到了改善。我们的代码和数据集可在https://github.com/MINE-USTC/EKSFT获取。

英文摘要

Supervised fine-tuning (SFT) followed by reinforcement learning (RL) has become a standard post-training paradigm for large language models. This paradigm provides a cold-start for RL exploration, avoiding the inefficiency of pure RL where on-policy sampling yields insufficient positive samples. However, in practice, existing approaches often use a small amount of data for SFT initialization compared to the RL phase, which can cause the model to fit the limited samples and shift away from its pre-trained distribution. This distribution shift impedes the model's ability to effectively explore during subsequent RL training. To address this challenge, we propose that in low-data regimes, SFT should prioritize activating task-relevant capabilities rather than memorizing specific content. Along this line, we propose EKSFT (Entropy-KL Selective Fine-Tuning), which selectively masks tokens that exhibit either high entropy or high KL divergence from a reference model. By excluding these high-uncertainty, distribution-shifting tokens from imitation, EKSFT injects task-specific knowledge while preserving the integrity of the model's pre-trained distribution. Empirical evaluations on mathematical reasoning benchmarks demonstrate that EKSFT consistently outperforms standard SFT. Further RL fine-tuning from the EKSFT model yields consistently better post-RL performance, indicating improved exploration for the RL stage. Our codes and datasets are available at https://github.com/MINE-USTC/EKSFT.

URL PDF HTML ☆

赞 0 踩 0

2605.29300 2026-05-29 cs.CL cs.AI cs.SD 版本更新

MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs

MusTBENCH：音乐大语言模型中的时间定位基准与推进

Daeyong Kwon, Qiyu Wu, Shinobu Kuriya, Junghyun Koo, Shuyang Cui, Zhi Zhong, Wei-Hsiang Liao, Hiromi Wakaki, Yuki Mitsufuji

发表机构 * Seoul National University（首尔国立大学）； Sony Group Corporation（索尼集团）； Sony AI（索尼人工智能）

AI总结提出MusTBENCH基准和MusT四阶段优化方法，评估并提升音乐大语言模型在音频中的时间定位能力。

详情

AI中文摘要

近期的大型音频-语言模型（LALMs）在理解音乐内容方面展现了有前景的能力。然而，它们的响应是否基于音频中正确的时间区域仍未得到充分探索。这一限制对于音乐理解尤为关键，因为关键信息通常以时间局部化事件的形式出现，例如乐器进入和节奏转换。为了解决这一差距，我们引入了MusTBENCH，一个由音乐专家验证的基准，旨在通过五个时间定位的问答任务评估LALMs中的时间定位能力。为了进一步提升现有模型中的时间定位，我们提出了MusT，一种新颖的四阶段时间优化方案，涵盖音乐编码器适应、LLM适应、LLM监督微调和基于RL的优化。在MusTBENCH上的实验表明，现有LALMs在精确时间定位方面存在困难，而MusT相比强基线带来了显著改进。这些结果将时间定位确立为当前LALMs中缺失的关键能力，并将MusTBENCH定位为未来时间定位音乐理解研究的具有挑战性的基准。

英文摘要

Recent Large Audio-Language Models (LALMs) have demonstrated promising abilities in understanding musical content. However, whether their responses are grounded in the correct temporal regions of the audio remains underexplored. This limitation is particularly critical for music understanding, where key information often occurs as temporally localized events, such as instrument entries and rhythmic transitions. To address this gap, we introduce MusTBENCH, a music-expert-validated benchmark designed to evaluate temporal grounding in LALMs through five temporally grounded question-answering tasks. To further improve temporal grounding in existing models, we propose MusT, a novel four-stage temporal optimization recipe spanning music encoder adaptation, LLM adaptation, LLM supervised fine-tuning, and RL-based optimization. Experiments on MusTBENCH show that existing LALMs struggle with precise temporal grounding, while MusT brings significant improvements over strong baselines. These results establish temporal grounding as a key missing capability in current LALMs and position MusTBENCH as a challenging benchmark for future research in temporally grounded music understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.29288 2026-05-29 cs.AI 版本更新

Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

诊断答案正确长链思维训练轨迹中的有害延续

Chen He, Yuhao Wu, Lei Wang, Wenxuan Zhang, Fumin Shen

发表机构 * University of Electronic Science and Technology of China（电子科技大学）； Singapore University of Technology and Design（新加坡科技设计大学）； Singapore Management University（新加坡管理学院）

AI总结研究长链思维训练数据中答案正确但后续推理有害的延续现象，通过删除后缀实验发现其损害训练效果，并提出轻量级边界代理方法。

详情

AI中文摘要

长链思维（CoT）轨迹被广泛用作面向推理的大语言模型监督微调（SFT）的监督信号，然而答案正确的轨迹仍可能导致显著不同的微调结果。我们研究了答案正确的长CoT数据中的结论后延续：即答案已充分支持，但轨迹继续包含额外推理并保留在监督目标中。为了测试其训练效果，我们使用仅删除的编辑器构建保留答案的后缀移除，并比较原始和经过处理的轨迹上的CoT监督微调。我们观察到移除编辑器识别的结论后延续后监督微调结果有所改善，表明这种延续在我们的设置中对训练有害。因此，我们将这一经验支持的现象称为有害延续。除了这一干预，我们还通过不确定性和隐藏状态进展进一步刻画了被移除的结论后延续。我们观察到持续的局部不确定性以及减弱的终端方向进展，形成了不确定性-几何不匹配。最后，我们实例化了有害延续切割（HCC），一种轻量级边界代理，近似于编辑器识别的结论后延续边界。

英文摘要

Long chain-of-thought (CoT) traces are widely used as supervision for reasoning-oriented LLM SFT, yet answer-correct traces can still lead to markedly different fine-tuning outcomes. We study post-conclusion continuation in answer-correct long-CoT data: a continuation where the answer appears sufficiently supported, but the trace continues with additional reasoning that remains in the supervised target. To test its training effect, we use a delete-only editor to construct answer-preserving suffix removal and compare CoT-based SFT on the original and processed traces. We observe improved SFT outcomes after removing the editor-identified post-conclusion continuation, suggesting that this continuation is harmful to training in our setting. We therefore refer to this empirically supported phenomenon as harmful continuation. Beyond this intervention, we further characterize the removed post-conclusion continuation through uncertainty and hidden-state progress. We observe persistent local uncertainty together with weakened terminal-directional progress, forming an uncertainty--geometry mismatch. Finally, we instantiate Harmful Continuation Cut (HCC), a lightweight boundary proxy that approximates the editor-identified post-conclusion continuation boundary.

URL PDF HTML ☆

赞 0 踩 0

2605.29283 2026-05-29 cs.LG cs.AI 版本更新

Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts

物理基础模型能否学习可泛化的物理？一种跨物理机制和分布偏移的偏差感知基准

Mengdi Chu, Yang Liu, Ayan Biswas, Han-Wei Shen

发表机构 * The Ohio State University（俄亥俄州立大学）； Los Alamos National Laboratory（洛斯阿拉莫斯国家实验室）

AI总结通过构建包含8种物理动力学、3种训练数据混合和25种测试机制的基准，评估五种物理基础模型架构，发现当前模型是条件性而非通用性泛化者，其泛化能力依赖于物理机制、时间尺度、初始条件、预训练、模型大小和架构，并指出改进需超越缩放模型或扩展数据，转向学习跨机制、时间尺度和分布偏移的可迁移物理知识。

Comments 26 pages, 31 figures

详情

AI中文摘要

最近的物理基础模型声称具有通用的时空预测能力，但它们的评估通常将性能压缩为固定训练分布下的单一平均分数。这使得难以确定模型是否学习了可泛化的物理动力学，还是仅在特定设置下表现良好。我们构建了一个包含8种物理动力学、3种训练数据混合和25种测试机制的基准，这些测试机制由动态尺度和初始条件复杂性变化引起，涵盖了分布内、分布偏移和分布外设置。我们评估了五种物理基础模型架构和每种架构的四种模型变体（从头训练和三种预训练大小），共得到60,000个测量结果。我们的结果表明，当前的物理基础模型表现为条件性而非通用性泛化者：它们的泛化能力取决于物理机制、时间尺度、初始条件设置、预训练、模型大小和架构。改进训练数据分布只能部分缓解这一限制。预训练和缩放也无法可靠地消除它们的能力偏差。我们认为，改进物理基础模型需要超越缩放模型或扩展数据，转向学习能够更好地跨机制、时间尺度和分布偏移捕获可迁移物理知识的机制。

英文摘要

Recent physics foundation models claim general spatiotemporal forecasting ability, yet their evaluations often collapse performance into a single average score under a fixed training distribution. This makes it difficult to determine whether a model has learned generalizable physical dynamics or only performs well under particular settings. We construct a benchmark with 8 physical dynamics, 3 training-data mixtures, and 25 test regimes induced by dynamic-scale and initial-condition complexity shifts, covering in-distribution, distribution-shift, and out-of-distribution settings. We evaluate five physics foundation model architectures and four model variants per architecture (scratch and three pretrained sizes), resulting in 60,000 measurements. Our results show that current physics foundation models behave as conditional rather than universal generalists: their generality depends on the physical regime, temporal scale, initial-condition setting, pretraining, model size, and architecture. Improving the training data distribution only partially mitigates this limitation. Pretraining and scaling are also unable to reliably remove their ability biases. We argue that improving physics foundation models requires moving beyond scaling models or expanding data, toward learning mechanisms that better capture transferable physical knowledge across regimes, temporal scales, and distribution shifts.

URL PDF HTML ☆

赞 0 踩 0

2605.29277 2026-05-29 cs.SE cs.AI 版本更新

Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA

Code-QA-Bench：在仓库级问答中分离代码推理与文档记忆

Jun Zhang, JianYing Qu, Hanwen Du, Zhongkai Sun, Yehua Yang, Qiao Zhao

发表机构 * Baidu Inc（百度公司）

AI总结提出Code-QA-Bench框架，通过答案优先生成和三条件实验设计，自动构建仓库级代码理解基准，以区分代码推理、文档回忆和预训练记忆的影响。

详情

AI中文摘要

我们提出了Code-QA-Bench，一个全自动框架，用于合成仓库级代码理解基准，将真正的代码理解与文档回忆和预训练记忆分离。该框架有两个方法论贡献：（1）答案优先生成流程，其中配备工具的代理探索源代码以生成经过验证的金色答案，然后推导问题，确保每个任务都基于真实的代码结构；（2）三条件实验设计，在闭卷（无仓库）、仅代码（移除文档）和带文档（完整仓库）条件下评估代理，差值直接量化文档效用和记忆。我们从SWE-Bench中的10个Python仓库生成了528个代码可推导任务和100个文档依赖任务，由LLM评判员根据准确性、完整性和特异性评分。对四个前沿模型的实验表明，代码访问是主导因素（比闭卷平均提高0.23），文档提供了适度的额外收益（文档依赖任务上提高0.071），并且在代码可推导任务上仅代码≈带文档，验证了该设计。该框架是开源的，适用于任何文档良好的Python仓库。

英文摘要

We present Code-QA-Bench, a fully automated framework for synthesizing repository-level code understanding benchmarks that separates genuine code comprehension from documentation recall and pretraining memorization. The framework makes two methodological contributions: (1) an answer-first generation pipeline where a tool-equipped agent explores source code to produce verified gold answers before deriving questions, ensuring every task is grounded in real code structure; and (2) a three-condition experimental design evaluating agents under closed-book (no repository), code-only (documentation removed), and documented (full repository) conditions, with deltas directly quantifying documentation utility and memorization. We generate 528 code-derivable and 100 doc-dependent tasks across 10 Python repositories from SWE-Bench, scored by an LLM judge on accuracy, completeness, and specificity. Experiments on four frontier models reveal that code access is the dominant factor (+0.23 mean gain over closed-book), documentation provides modest additional benefit (+0.071 on doc-dependent tasks), and code-only $\approx$ documented on code-derivable tasks, validating the design. The framework is open-source and applicable to any well-documented Python repository.

URL PDF HTML ☆

赞 0 踩 0

2605.29272 2026-05-29 cs.LG cs.AI stat.ML 版本更新

Causal Label Recovery in Payment Networks

支付网络中的因果标签恢复

Gaurav Dhama

发表机构 * Mastercard（麦star卡）

AI总结针对支付网络中标签存在的四种系统偏差，提出序列三重稳健（STR）估计器，同时纠正所有偏差并达到半参数效率界，实现基于数天而非数月数据的训练。

Comments 49 pages

详情

AI中文摘要

支付网络中的欺诈检测模型依赖于存在系统性偏差的退单标签进行训练。每个标签必须依次经过三个门控：授权（被拒绝的交易不产生标签）、发卡行报告（未报告的欺诈不可见）和延迟（待处理的退单在训练时缺失）。到达的标签可能因第一方滥用或发卡行错误分类而受损。配套论文[arXiv:2605.27557]证明这四种损害对检测性能施加了极小极大下界。本文问：能否达到该下界？我们将观测流程形式化为一个具有三个倾向阶段和一个损坏层的顺序缺失数据问题，并构建了序列三重稳健（STR）估计器。STR同时纠正所有四种损害，并达到半参数效率界——没有估计器能具有更低的渐近方差。它是序列三重稳健的：在每个门控处，一致性仅要求倾向模型或结果回归中有一个正确指定，而非两者。我们提供了通过噪声率调整的伪标签进行损坏校正、通过经验贝叶斯收缩稳定小发卡行的逆倾向权重、提供有效置信区间的插件方差估计量，以及用于有限样本保证的伯恩斯坦集中不等式。在操作层面，我们推导了最优训练延迟——使标签质量损失和模型过时之和最小化的成熟窗口——并证明STR允许使用数天而非数月前的数据进行训练，将模型新鲜度与退单成熟周期解耦。对于任何样本量，STR在均方误差上严格优于基于退单的朴素训练。

英文摘要

Fraud detection models in payment networks train on chargeback labels that are systematically biased. Every label must survive three sequential gates: authorization (declined transactions generate no labels), issuer reporting (unreported fraud is invisible), and delay (pending chargebacks are missing at training time). Labels that do arrive may be corrupted by first-party misuse or issuer misclassification. A companion paper [arXiv:2605.27557] proved that these four impairments impose a minimax lower bound on detection performance. This paper asks: can that bound be achieved? We formalize the observation pipeline as a sequential missing-data problem with three propensity stages and a corruption layer, and construct the Sequential Triply Robust (STR) estimator. The STR corrects for all four impairments simultaneously and achieves the semiparametric efficiency bound -- no estimator can have lower asymptotic variance. It is sequentially triply robust: at each gate, consistency requires only that either the propensity model or the outcome regression is correctly specified, not both. We provide corruption correction via noise-rate-adjusted pseudo-labels, empirical Bayes shrinkage to stabilize inverse-propensity weights for small issuers, a plug-in variance estimator yielding valid confidence intervals, and a Bernstein concentration inequality for finite-sample guarantees. On the operational side, we derive the optimal training delay -- the maturity window that minimizes the sum of label-quality loss and model staleness -- and prove that the STR permits training on data that is days old rather than months old, decoupling model freshness from the chargeback maturity cycle. The STR provably dominates naive chargeback-based training in mean squared error for any sample size.

URL PDF HTML ☆

赞 0 踩 0

2605.29271 2026-05-29 cs.AI cs.IR cs.LG 版本更新

CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval

CoHyDE: 用于工具检索的LLM改写器与稠密编码器的迭代协同训练

Vaishali Senthil, Ashutosh Hathidara, Sebastian Schreiber

发表机构 * SAP Labs（SAP实验室）

AI总结提出CoHyDE方法，通过迭代协同训练稠密编码器和LLM改写器，结合对比学习和偏好对齐，在工具检索任务中同时提升标准查询和模糊查询的性能。

详情

AI中文摘要

在大规模API目录上的工具检索是LLM智能体的核心瓶颈：用户查询以口语化、通常不明确的语言出现，而目录使用技术性API词汇，没有固定的编码器能够单独弥合这一差距。两种主要的训练方法，对比编码器微调和基于冻结LLM的HyDE式查询扩展，从相反的角度解决这个问题，并在互补的方向上失败：微调编码器在查询的表面形式与目录匹配时表现出色，但在不匹配时性能崩溃；而零样本HyDE对不明确的查询更鲁棒，但生成不感知目录的假设描述，当查询形式良好时检索性能下降。我们提出CoHyDE，一种迭代过程，将稠密编码器和LLM改写器训练为单个共同演化的系统：编码器使用改写器生成的目录风格假设描述通过InfoNCE重新训练，改写器通过DPO基于编码器的检索分数进行偏好对齐，两者在循环开始前在工具目录上进行热启动。在ToolBench目录的约10k工具子集上，三轮CoHyDE在标准查询上比最强的单组件基线提高+2.5个百分点的NDCG@5，在保留的模糊查询上提高+6.3个百分点，在最难的模糊层级上增益高达+8个百分点。消融实验证实协同训练是关键因素：单独使用任一组件都无法在形式良好和模糊查询上匹配CoHyDE，在模糊查询上损失高达-8个百分点。

英文摘要

Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified language, while the catalog uses technical API vocabulary that no fixed encoder can bridge on its own. The two dominant training approaches, contrastive encoder fine-tuning and HyDE-style query expansion with a frozen LLM, address this problem from opposite ends and fail in complementary directions: the fine-tuned encoder excels when the query's surface form already matches the catalog but collapses when it does not, while zero-shot HyDE is more robust to underspecified queries yet generates catalog-unaware hypothetical descriptions that degrade retrieval when queries are well-formed. We introduce CoHyDE, an iterative procedure that trains the dense encoder and the LLM rewriter as a single co-evolving system: the encoder is retrained with InfoNCE on catalog-style hypothetical descriptions produced by the rewriter, and the rewriter is preference-aligned via DPO against the encoder's retrieval scores, with both sides warm-started on the tool catalog before the loop begins. On a ~10k tool subset of the ToolBench catalog, three rounds of CoHyDE improve over the strongest single-component baseline by +2.5 pp NDCG@5 on standard queries and +6.3 pp on held-out vague queries, with gains as large as +8 pp on the hardest vague tier. Ablations confirm that co-training is the key ingredient: using either component in isolation fails to match CoHyDE on both well-formed and vague queries, with losses of up to -8 pp on vague queries.

URL PDF HTML ☆

赞 0 踩 0

2605.29270 2026-05-29 cs.AI 版本更新

Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies

索引不可读之物：基于LLM原生的服务分类法递归构建与搜索

Wei Zheng, Yang Yan, Yiyang Shao, Jinyang Li, Zeze Chang, Yukuang Jia, Qiming Mao, Chihyung Wang, Jingbin Zhou

AI总结针对LLM在服务发现中因上下文窗口限制和长输入中间信息丢失问题，提出LLM原生的渐进式披露方案A2X，通过自动构建层次化服务分类法并在查询时逐层遍历，显著提升检索准确率并降低token消耗。

Comments Preprint. 8 pages main paper + appendix; 2 figures. Under submission to EMNLP 2026

详情

AI中文摘要

物联网代理（IoA）时代正在形成：LLM代理预计通过编排快速增长中的模型上下文协议（MCP）服务器、代理到代理（A2A）端点、可复用技能以及其他LLM可调用服务来实现用户目标。然而，LLM面临与此机制的结构性不匹配：有效上下文是一种稀缺资源，无法随服务数量扩展。将数千个服务描述串联到提示中会溢出上下文窗口，即使窗口足够大，模型也会系统性地忽略长输入中间部分的信息，即文献中充分记录的“中间迷失”现象。这本质上是服务发现中的上下文管理问题。为解决此问题，我们提出一种LLM原生的渐进式披露方案及其具体实例A2X（代理到任何事物的服务发现）：一个LLM驱动的流水线，自动将注册服务组织成层次化分类法，并在查询时逐层遍历，使得每次LLM调用仅看到与用户查询高度相关的小候选集。这将有效上下文稀缺性与注册表规模解耦，显著降低token消耗并提高检索准确性。与全上下文转储相比，A2X在提示token成本仅为九分之一的情况下实现了6.2个点的命中率提升；与最先进的开源基于嵌入的基线相比，A2X将命中率提高了超过20个点。

英文摘要

The era of the Internet of Agents (IoA) is taking shape: LLM agents are expected to fulfill user goals by orchestrating fast-growing populations of Model Context Protocol (MCP) servers, Agent-to-Agent (A2A) endpoints, reusable skills, and other LLM-callable services. Yet LLMs face a structural mismatch with this regime: effective context is a scarce resource that does not scale with the number of services. Concatenating thousands of service descriptions into a prompt overflows the context window, and even when the window is large enough, models systematically under-attend to information in the middle of long inputs, the well-documented Lost-in-the-Middle phenomenon. This is fundamentally a question of context management for service discovery. To address this, we propose an LLM-native progressive-disclosure scheme and its concrete instantiation, A2X (Agent-to-Anything service discovery): an LLM-driven pipeline that automatically organizes the registered services into a hierarchical taxonomy and walks it layer by layer at query time, so that every LLM call sees only a small candidate set highly relevant to the user query. This decouples effective-context scarcity from registry size and significantly reduces token consumption while improving retrieval accuracy. Compared to full-context dumping, A2X achieves a 6.2-point Hit Rate gain at one-ninth the prompt-token cost; compared to the state-of-the-art open-source embedding-based baseline, A2X improves Hit Rate by more than 20 points.

URL PDF HTML ☆

赞 0 踩 0

2605.29267 2026-05-29 cs.AI cs.LG 版本更新

When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop

人类策展何时以及如何适得其反：多模型自消费循环下的偏好对齐

Yang Zhang, Xiukun Wei, Xueru Zhang

发表机构 * Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio（计算机科学与工程系，俄亥俄州立大学，哥伦布，俄亥俄）

AI总结研究多模型自消费训练中人类策展对模型对齐的影响，发现跨模型交互可能削弱甚至逆转策展效果，导致长期对齐退化。

详情

AI中文摘要

基础模型越来越多地使用先前模型迭代生成的合成数据进行训练，而非仅依赖真实数据。这种自消费训练范式可能导致模型崩溃、发散或偏差放大。近期工作（Ferbach et al., 2024）表明，将人类策展纳入循环可以引导自消费模型向人类对齐的行为，但这些分析聚焦于单一孤立模型，该模型仅消耗自身输出。然而，在实践中，模型经常交互并训练于其他模型产生的输入-输出对。本文研究多模型机制下的自消费训练。我们首先形式化了一个交互自消费模型的框架，并刻画了所得动力系统何时收敛到稳定点。然后，我们考察了一个模型的人类策展如何影响其自身对齐（自影响），以及这种效应如何传播到其他模型（交叉影响）。与孤立设置中人类策展总是增强模型对齐不同，我们表明跨模型交互可以削弱甚至逆转这种效应，最终损害长期对齐。

英文摘要

Foundation models are increasingly trained on synthetic data generated by prior model iterations rather than exclusively on real data. This self-consuming training paradigm can lead to model collapse, divergence, or bias amplification. Recent work (Ferbach et al., 2024) shows that incorporating human curation into the loop can steer a self-consuming model toward human-aligned behavior, but these analyses focus on a single, isolated model that solely consumes its own outputs. In practice, however, models often interact and train on input-output pairs produced by other models. This paper studies self-consuming training in the multi-model regime. We first formalize a framework for interacting self-consuming models and characterize when the resulting dynamical system converges to a stable point. We then examine how human curation of one model affects its own alignment (self-influence) and how such effects propagate to other models (cross-influence). Unlike isolated settings where human curation always enhances model alignment, we show that cross-model interactions can dampen or even invert this effect, ultimately degrading long-term alignment.

URL PDF HTML ☆

赞 0 踩 0

2605.29262 2026-05-29 cs.AI 版本更新

Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling

协调实时约束与长视距推理：一种用于动态调度的异步智能体框架

Shijie Cao, Yuan Yuan, Jing Liu

发表机构 * School of Computer Science and Engineering, Beihang University, Beijing 100191, China（北京航空航天大学计算机科学与工程学院）； Shenzhen Loop Area Institute, Shenzhen, China（深圳环形区研究所）； Qingdao Research Institute, Beihang University（青岛研究院）； Hangzhou Innovation Institute, Beihang University（杭州创新研究院）； School of Artificial Intelligence, Xidian University, Xi’an 710071, Shaanxi, China（西安电子科技大学人工智能学院）； Guangzhou Institute of Technology, Xidian University, Guangzhou 510555, Guangdong, China（广州技术研究所）

AI总结提出RACE-Sched异步智能体框架，通过双流架构解耦策略执行与逻辑推理，利用LLM合成和验证符号启发式规则，在保证实时性的同时提升动态调度质量。

详情

AI中文摘要

动态柔性作业车间调度问题（DFJSP）需要在即时响应随机扰动与全局优化生产目标之间进行权衡。传统的优先级规则在处理复杂扰动时灵活性不足，而基于学习的方法往往牺牲可解释性或难以跨问题规模泛化。尽管大语言模型（LLM）提供了高级推理能力以弥合这一差距，但其显著的推理延迟与工业控制系统的毫秒级决策周期不兼容。为解决这一冲突，我们引入了RACE-Sched，一种异步智能体框架，通过双流架构将策略执行与逻辑推理解耦。反应流执行低延迟的符号启发式规则以实现实时调度，而并行的深思流利用LLM合成、验证和演化这些规则。候选规则在沙箱中经过严格测试，并通过原子更新部署，确保安全且不阻塞控制循环。此外，语义规则库索引已验证的启发式规则，用于基于检索的初始化，从而增强跨问题规模的可迁移性。在GEN-Bench、MK-Bench和JMS-Bench上的广泛评估表明，RACE-Sched优于领先的深度强化学习和其他基于LLM的基线方法。该方法协调了实时约束与长视距推理，实现了更优的解决方案质量和对动态事件的鲁棒适应。

英文摘要

The Dynamic Flexible Job Shop Scheduling Problem (DFJSP) necessitates a trade-off between instant reaction to stochastic disturbances and global optimization of production goals. Conventional priority rules are insufficiently flexible to handle complex disruptions, whereas learning-based approaches often compromise interpretability or fail to generalize across problem scales. Although Large Language Models (LLMs) offer advanced reasoning capabilities to bridge this gap, their substantial inference latency is incompatible with the millisecond-level decision cycles of industrial control systems. To resolve this conflict, we introduce RACE-Sched, an asynchronous agent-based framework that decouples policy execution from logical reasoning via a dual-stream architecture. The Reactive Stream executes low-latency symbolic heuristics to enable real-time dispatching, while the parallel Deliberative Stream leverages an LLM to synthesize, validate, and evolve these rules. Candidate rules undergo rigorous testing in a sandbox and are deployed via atomic updates, ensuring safety without blocking the control loop. Additionally, a semantic rule repository indexes validated heuristics for retrieval-based initialization which enhances transferability across problem scales. Extensive evaluations on GEN-Bench, MK-Bench, and JMS-Bench demonstrate that RACE-Sched outperforms leading Deep Reinforcement Learning and other LLM-based baselines. This approach harmonizes real-time constraints with long-horizon reasoning to achieve superior solution quality and robust adaptation to dynamic events.

URL PDF HTML ☆

赞 0 踩 0

2605.29259 2026-05-29 cs.LG cs.AI 版本更新

KLAS: Using Similarity to Stitch Neural Networks for Improved Accuracy-Efficiency Tradeoffs

KLAS：利用相似性拼接神经网络以改进精度-效率权衡

Debopam Sanyal, Anantharaman Iyer, Alind Khare, Trisha Jain, Akshay Jajoo, Myungjin Lee, Clayton Kerce, Alexey Tumanov

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Microsoft M365 Research（微软M365研究）； Cisco Research（思科研究）； Georgia Tech Research Institute（佐治亚理工研究机构）

AI总结提出KLAS框架，通过KL散度度量中间表示相似性自动选择最佳拼接配置，在相同微调成本下提升拼接模型的精度-效率曲线。

详情

AI中文摘要

鉴于部署目标的广泛性，灵活模型选择对于在给定计算预算内优化性能至关重要。最近的研究表明，在模型家族内拼接预训练模型能够实现精度-效率权衡空间的成本效益插值。拼接将一个预训练模型的中间激活变换到另一个模型，生成新的插值拼接网络。这类网络沿精度-效率谱提供了部署选项池。然而，现有拼接方法往往产生次优权衡且缺乏泛化性，因为它们主要依赖启发式方法选择拼接配置。我们认为，构建改进的精度-效率权衡需要显式捕获并利用被拼接预训练模型之间的相似性。为此，我们引入KLAS，一种新颖的拼接选择框架，通过利用中间表示之间的KL散度，自动化和泛化跨模型家族的拼接选择。KLAS从$O(k^2n^2)$种可能性中为$k$个深度为$n$的预训练模型识别最有前景的二元拼接。通过全面实验，我们证明KLAS在相同微调成本下改进了拼接模型的精度-效率曲线，与基线相比，KLAS在相同计算成本下实现了高达$1.21\%$的ImageNet-1K top-1准确率提升，或在保持准确率的同时将FLOPs降低$1.33\times$。

英文摘要

Given the wide range of deployment targets, flexible model selection is essential for optimizing performance within a given compute budget. Recent work demonstrates that stitching pretrained models within a model family enables cost-effective interpolation of the accuracy-efficiency tradeoff space. Stitching transforms intermediate activations from one pretrained model into another, producing a new interpolated stitched network. Such networks provide a pool of deployment options along the accuracy-efficiency spectrum. However, existing stitching approaches often yield suboptimal tradeoffs and lack generalizability, as they primarily rely on heuristics to select stitch configurations. We argue that constructing improved accuracy-efficiency tradeoffs requires explicitly capturing and leveraging the similarity between pretrained models being stitched. To this end, we introduce KLAS, a novel stitch selection framework that automates and generalizes stitch selection across model families by leveraging KL divergence between intermediate representations. KLAS identifies the most promising binary stitches from the $O(k^2n^2)$ possibilities for $k$ pretrained models of depth $n$. Through comprehensive experiments, we demonstrate that KLAS improves the accuracy-efficiency curve of stitched models at the same finetuning cost as baselines. KLAS achieves up to $1.21\%$ higher ImageNet-1K top-1 accuracy at the same computational cost, or maintains accuracy with a $1.33\times$ reduction in FLOPs.

URL PDF HTML ☆

赞 0 踩 0

2605.29256 2026-05-29 cs.CL cs.AI 版本更新

DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents

DynSess：面向角色扮演智能体的动态会话级评估与优化框架

Rongsheng Zhang, Jiji Tang, Junnan Ren, Zuyi Bao, Weijie Chen, Ruofan Hu, Zhou Zhao, Tangjie Lv, Yan Zhang

发表机构 * Zhejiang University（浙江大学）； Fuxi AI Lab, NetEase Inc.（福克斯人工智能实验室，网易公司）； Xiamen University（厦门大学）

AI总结提出DynSess统一会话级框架，通过会话级评估（DynSess-Eval）和基于多步前瞻搜索的训练轨迹优化（DSPO/GSRPO），提升角色扮演智能体的长程一致性和交互质量。

详情

AI中文摘要

基于大型语言模型的角色扮演本质上是一个会话级任务，要求智能体在扩展的多轮对话中维持角色身份和交互质量。然而，现有的评估和优化方法大多停留在轮次级别，无法捕捉长程质量。我们提出DynSess，一个统一的会话级角色扮演智能体框架。DynSess-Eval通过针对长程行为的评分标准对完整对话会话进行评分。利用其会话级奖励，我们通过多步前瞻搜索构建高质量训练轨迹，并训练DynSess-Character的两个互补变体：DSPO（离策略）和GSRPO（在策略）。实验表明，DynSess-Eval与人类判断的一致性显著优于先前的评估器，盲人机评估进一步显示，尽管参数少得多，DynSess-Character仍能与最强角色模型匹配，同时保持强大的角色一致性和交互能力。我们的数据集和代码将发布以促进未来研究。

英文摘要

Role-playing with large language models is fundamentally a session-level task, requiring agents to sustain character identity and interaction quality across extended multi-turn conversations. Yet existing evaluation and optimization methods remain largely turn-level, failing to capture long-horizon quality. We propose DynSess, a unified session-level framework for role-playing agents. DynSess-Eval scores complete dialogue sessions via rubrics targeting long-horizon behaviors. Leveraging its session-level rewards, we construct high-quality training trajectories through multi-turn lookahead search and train DynSess-Character with two complementary variants: DSPO (off-policy) and GSRPO (on-policy). Experiments show that DynSess-Eval aligns with human judgments substantially better than prior evaluators, and blind human evaluation further shows that DynSess-Character matches the strongest character model despite using substantially fewer parameters, while maintaining strong role consistency and interactive ability. Our dataset and code will be released to facilitate future research.

URL PDF HTML ☆

赞 0 踩 0

2605.29254 2026-05-29 cs.RO cs.AI 版本更新

Extreme dynamic symmetry enables omnidirectional and multifunctional robots

极端动态对称性实现全向多功能机器人

Jiaxun Liu, Boxi Xia, Boyuan Chen

发表机构 * Department of Mechanical Engineering and Materials Science, Duke University（杜克大学机械工程与材料科学系）； Department of Electrical and Computer Engineering, Duke University（杜克大学电气与计算机工程系）； Department of Computer Science, Duke University（杜克大学计算机科学系）

AI总结本文提出动态对称性概念，通过动态各向同性度量，在超过1000种模拟形态中发现高动态对称性可提升轨迹跟踪、任务成功率、鲁棒性等性能，并开发了Argus球形机器人系列验证近极端动态各向同性带来的全向运动、自适应地形、快速自稳定和抗故障能力。

Comments Published in Science Robotics (2026). Our project website is at:https://generalroboticslab.com/Argus

详情

Journal ref: Science Robotics 11, eaec1725 (2026)

AI中文摘要

对称性是自然系统中的核心组织原则，但其作为机器人统一设计策略的应用仍主要局限于几何形态。我们证明，对称性可以在动态驱动能力层面加以利用。我们引入动态对称性，即机器人可达质心加速度的均匀性，并通过称为动态各向同性的度量将其形式化。在超过1000种模拟形态中，我们发现更高的动态对称性持续改善了轨迹跟踪、任务成功率、鲁棒性、恢复能力和能量效率，且当动态各向同性接近其理论极限时，效益最为显著。为了系统地研究这一机制，我们开发了Argus，一系列球形机器人，旨在探索增加动态对称性的效果。Argus家族的成员在驱动几何和动态对称性水平上有所不同，但共享一个共同架构原则：径向定向的线性致动器直接塑造机器人的质心动力学。其中，我们构建了一个物理的20腿Argus变体，实现了接近极端的动态各向同性，并展示了方向无关的运动、在杂乱和可变形地形上的敏捷穿越、快速自稳定以及对部分致动器故障的鲁棒性。其分布式感知进一步实现了在连续运动中的全向感知和物体交互。这些结果表明，不仅在形态上而且在可达动力学上设计机器人的对称性，为在不确定的地球和地外环境中实现敏捷性、鲁棒性和多功能性提供了一条强大且通用的途径。

英文摘要

Symmetry is a central organizing principle in natural systems, yet its use as a unifying design strategy in robotics has largely remained limited to geometric form. We show that symmetry can instead be leveraged at the level of dynamic actuation capability. We introduce dynamic symmetry, the uniformity of a robot's attainable center-of-mass accelerations, and formalize it through a measure coined as dynamic isotropy. Across more than 1000 simulated morphologies, we found that higher dynamic symmetry consistently improved trajectory tracking, task success, robustness, resiliency, and energy efficiency, with the benefits becoming most pronounced as dynamic isotropy approached its theoretical limit. To study this regime systematically, we developed Argus, a family of spherical robots designed to explore the effects of increasing dynamic symmetry. Members of the Argus family vary in their actuation geometry and dynamic symmetry level while sharing a common architectural principle: radially oriented linear actuators that directly shape the robot's center-of-mass dynamics. Among them, we built a physical 20-leg Argus variant that achieved near-extreme dynamic isotropy and demonstrated orientation-invariant locomotion, agile traversal of cluttered and deformable terrain, rapid self-stabilization, and resilience to partial actuator failures. Its distributed sensing further enabled omnidirectional perception and object interaction during continuous motion. These results show that designing robots for symmetry not only in morphology but also in their attainable dynamics provides a powerful and general pathway toward agility, robustness, and multifunctionality in uncertain terrestrial and extraterrestrial environments.

URL PDF HTML ☆

赞 0 踩 0

2605.29253 2026-05-29 cs.AI 版本更新

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

OpenClawBench: 真实智能体执行轨迹中过程侧异常的基准测试

Yibing Liu, Yangze Liu, Xiaolong Yin, Bin Wang, Chong Zhang, Hao Yin, Zhongyi Han

发表机构 * School of Software, Shandong University（山东大学软件学院）； School of Artificial Intelligence, Nanjing University（南京大学人工智能学院；南京大学新型软件技术国家重点实验室）； State Key Laboratory of Novel Software Technology, Nanjing University（医学人工智能中心；青岛中医药科学院；海洋传统中医研究所，山东中医药大学）； Center for Medical Artificial Intelligence（四川大学软件工程学院）； Qingdao Academy of Chinese Medical Sciences ； Institute of Marine Traditional Chinese Medicine, Shandong University of Traditional Chinese Medicine ； School of Software Engineering, Sichuan University

AI总结提出OpenClawBench数据集，通过FullTax标注框架量化智能体执行中的过程侧异常，揭示仅基于结果评估的不足。

Comments 37 pages, 1 figure, 43 tables

详情

AI中文摘要

任务成功可能掩盖真实智能体执行中的过程异常。智能体可能通过最终任务测试，但过程中仍累积未解决的歧义、不安全的外部写入、被忽略的错误、弱化的承诺或能力边界过度承诺。我们将这种不匹配研究为结果-过程差距，并引入OpenClawBench，这是一个用于测量和监督真实智能体执行过程中过程侧异常的大规模数据集。OpenClawBench基于由6个源模型生成的BFCL驱动的OpenClaw会话构建，包含31,264条带注释的轨迹。它将任务测试结果与结构化过程证据对齐。FullTax将对齐的轨迹转换为结构化异常监督：二元标签、支持证据、起始/跨度定位、严重性、可恢复性以及一个5类异常分类法。使用OpenClawBench，我们使结果-过程差距变得可测量。在31,135次通过测试的执行中，有2,904次在FullTax下被标记为过程异常。这些结果表明，仅基于成功的评估忽略了真实智能体执行中一类具体的过程侧失败。基于高置信度FullTax监督池训练的LoRA微调Gemma 3 12B检测器，在更干净标签的保留测试集上达到了二元F1=0.729。总之，OpenClawBench将真实智能体执行日志转化为可审计和可复用的监督，用于研究、诊断和操作监控运行时智能体可靠性。

英文摘要

Task success can hide process anomalies in real-world agent executions. An agent may pass the final task oracle while still accumulating unresolved ambiguity, unsafe external writes, ignored errors, weakly grounded commitments, or capability-boundary overcommitment. We study this mismatch as the Outcome-Process Gap and introduce OpenClawBench, a large-scale dataset for measuring and supervising process-side anomalies in real agent execution processes. OpenClawBench is built from BFCL-driven OpenClaw sessions produced by 6 source models and contains 31,264 annotated trajectories. It aligns task-oracle outcomes with structured process evidence. FullTax converts the aligned trajectories into structured anomaly supervision: binary labels, supporting evidence, onset/span localization, severity, recoverability, and a 5-class anomaly taxonomy. Using OpenClawBench, we make the Outcome-Process Gap measurable. Among 31,135 oracle-passing executions, 2,904 are still labeled process-anomalous under FullTax. These results show that success-only evaluation misses a concrete class of process-side failures in real agent executions. A LoRA-fine-tuned Gemma 3 12B detector trained on the high-confidence FullTax supervised pool reaches binary F1=0.729 on the cleaner-labels held-out test split. Together, OpenClawBench turns real agent execution logs into auditable and reusable supervision for studying, diagnosing, and operationally monitoring runtime agent reliability.

URL PDF HTML ☆

赞 0 踩 0

2605.29251 2026-05-29 cs.AI cs.CR 版本更新

Provably Secure Agent Guardrail

可证明安全的智能体护栏

Benlong Wu, Weiming Zhang, Kejiang Chen, Han Fang, Nenghai Yu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对现有语义护栏无法提供确定性安全下界的问题，提出基于逻辑推理基本限制的新安全范式，并引入可执行证明约束动作框架，通过神经符号隔离架构实现零攻击成功率和零误报率。

详情

AI中文摘要

随着大语言模型从有限生成引擎转变为具有广泛执行权限的智能体，人工智能失控引发了人工智能安全的基本危机。现有的防御架构严重依赖经验性语义护栏和概率性大模型裁决器，这些机制在面对复杂的语义符号解耦攻击时无法提供确定性的安全下界。为了克服这种经验性语义护栏困境，本文提出了一种基于逻辑推理基本限制的智能体安全新范式。基于该范式，我们进一步引入了一种具有神经符号隔离架构的可执行证明约束动作（ePCA）框架。该框架放弃了对自然语言的语义信任，迫使智能体在执行物理操作之前将其意图无损地形式化为一阶逻辑数学约束。宏观和微观二维动态对抗系统的实证评估表明，我们的形式化验证机制在评估场景中实现了零攻击成功率和零误报率，且计算延迟极低。这项研究为在明确系统假设下构建未来智能系统的基础防御提供了条件性的形式化基础和工程范式。

英文摘要

As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of control precipitates a fundamental crisis in artificial intelligence security. Existing defense architectures heavily rely on empirical semantic guardrails and probabilistic large model adjudicators, mechanisms that fail to provide deterministic security lower bounds when facing complex semantic symbol decoupling attacks. To overcome this empirical semantic guardrail dilemma, this paper proposes a new security paradigm for agents based on the fundamental limitations of logical reasoning. Based on this paradigm, we further introduce an executable Proof-Constrained Action (ePCA) framework with a neural symbolic isolation architecture. This framework abandons semantic trust in natural language, forcing agents to losslessly formalize their intentions into first-order logical mathematical constraints before performing physical operations. Empirical evaluations of macroscopic and microscopic two-dimensional dynamic adversarial systems demonstrate that our formal verification mechanism achieves zero attack success rate and zero false positive rate across the evaluated scenarios, with extremely low computational latency. This research provides a conditional formal foundation under explicit system assumptions and an engineering paradigm for constructing the underlying defense foundation for future intelligent systems.

URL PDF HTML ☆

赞 0 踩 0

2605.29250 2026-05-29 cs.CL cs.AI cs.IR cs.LG 版本更新

OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

OmniRetrieval：跨异构知识源的统一检索

Jinheon Baek, Soyeong Jeong, Sangwoo Park, Woongyeong Yeo, Minki Kang, Patara Trirat, Heejun Lee, Sung Ju Hwang

发表机构 * KAIST（韩国科学技术院）

AI总结提出OmniRetrieval框架，通过自然语言查询识别并调度到不同知识源的本地执行引擎，在13个数据集和309个知识库上超越单源基线，实现异构知识源统一检索。

详情

AI中文摘要

重新思考文献检索评估：深度研究有帮助，且人类引用列表并非金标准

Gaurav Sahu, Laurent Charlin, Christopher Pal

发表机构 * Mila – Quebec AI Institute（魁北克AI研究所）； HEC Montréal（蒙特利尔HEC商学院）； ServiceNow Research（ServiceNow研究）； Canada CIFAR AI Chair（加拿大CIFAR人工智能主席）； Université de Montréal（蒙特利尔大学）； Polytechnique Montréal（蒙特利尔理工学院）

AI总结本文通过改进检索流程和检验人类引用列表作为评估目标的可靠性，发现深度研究管道显著提升召回率，而人类引用中仅51%被判定为中等相关以上，建议采用多维度评估。

详情

AI中文摘要

我们从两个互补角度研究大规模文献检索：改进检索流程，以及压力测试人类参考文献列表作为评估目标。首先，我们实现了一个深度研究管道，处理完整查询论文并沿其参考文献广度优先扩展检索结果，表明其显著优于纯API搜索，将RollingEval-Jun25（一个250篇论文的文献检索基准）上的召回率从低于20%提升至高于80%。其次，我们使用中立的LLM作为裁判来判断人类参考文献是否是任务的金标准。我们发现显著局限性：只有51%的人类引用被判定为中等相关或更高，而最强AI重排序器为86-88%。我们在OpenAlex合著图上研究这一差距，发现人类引用直接合作者的可能性比最佳AI重排序器高2.5倍。综合来看，我们的结果反对单一轴线的文献检索评估：召回率、主题相关性评分、排序列表多样性和合著距离诊断各自衡量引用质量的互补属性，应联合报告。

英文摘要

We study large-scale literature search from two complementary angles: improving the retrieval pipeline, and stress-testing the human reference list as an evaluation target. First, we implement a Deep Research pipeline that processes the full query paper and expands the retrieved results breadth-first along their bibliographies, and show that it substantially outperforms vanilla API-only search, raising recall on RollingEval-Jun25 (a 250-paper literature-search benchmark) from below 20% to above 80%. Second, we use a neutral LLM-as-a-judge to determine if human references are sound ground truth for the task. We find significant limitations: only 51% of human citations are judged moderately relevant or higher, against 86--88% for the strongest AI-based re-rankers. We study this gap on the OpenAlex co-authorship graph, finding that humans are 2.5x more likely than the best AI re-rankers to cite a direct collaborator. Together, our results argue against single-axis literature-search evaluation: recall, topical-relevance scoring, ranked-list diversity, and a co-authorship-distance diagnostic each measure complementary properties of citation quality and should be reported jointly.

URL PDF HTML ☆

赞 0 踩 0

2605.29230 2026-05-29 cs.CV cs.AI 版本更新

Toward Ethical Facial Age Estimation: A Generalized Zero-Shot Benchmark Without Training on Children's Data

面向道德的面部年龄估计：无需儿童数据训练的广义零样本基准

Caio Petrucci, Leo Sampaio Ferraz Ribeiro, Sandra Avila

发表机构 * New York University（纽约大学）

AI总结提出一个广义零样本基准，训练时排除儿童数据，评估模型对未见年龄组的泛化能力，发现所有方法均存在严重性能下降和可见类偏见。

Comments 12 pages; 3 figures; 5 tables

详情

AI中文摘要

从面部图像进行年龄估计通常依赖于包含未成年人图像的训练数据，这种做法引发了严重的伦理、法律和隐私问题。在这项工作中，我们提出了一个用于面部年龄估计的广义零样本基准，该基准在训练时明确排除儿童数据，同时仍评估模型在年轻人群上的性能。我们重新审视了六个广泛使用的数据集，并引入了具有严格年龄组划分的标准化分割：18-59岁的样本用于训练、验证和测试；18岁以下的样本仅保留用于零样本评估；60岁以上的样本作为分布偏移下模型选择的未见验证集。对于具有身份注释的数据集，基于主体的分割防止了身份泄露，并更好地反映了实际部署条件。在此协议下评估九种最先进的年龄估计方法，结果表明所有评估方法均无法泛化到未见年龄组，性能相对于监督基线平均下降46.4%，最高达52.8%。此外，模型并非简单退化：它们系统性地将未见年龄的预测锚定到附近的可见类别，这是广义零样本学习中众所周知的可见类偏见的体现。通过将无儿童数据的年龄估计形式化为现有数据集上的广义零样本基准，这项工作突出了当前建模实践与现实伦理约束之间的关键差距。我们的基准为在受限数据制度下评估模型提供了原则性基础，并鼓励开发对分布偏移鲁棒且符合负责任数据使用的方法。

英文摘要

Age estimation from facial images typically relies on training data that includes images of minors, a practice that raises serious ethical, legal, and privacy concerns. In this work, we propose a generalized zero-shot benchmark for facial age estimation that explicitly excludes children's data during training while still assessing model performance on younger populations. We revisit six widely used datasets and introduce standardized splits with strict age-group separation: samples aged 18-59 for training, validation, and testing; samples under 18 reserved exclusively for zero-shot evaluation; and samples 60+ as an unseen validation set for model selection under distribution shift. For datasets with identity annotations, subject-exclusive splits prevent identity leakage and better reflect real-world deployment conditions. Evaluating nine state-of-the-art age estimation methods under this protocol reveals that all evaluated methods consistently fail to generalize to unseen age groups, suffering substantial performance degradation -- on average 46.4%, and up to 52.8% -- relative to the supervised baseline. Moreover, models do not simply degrade: they systematically anchor predictions for unseen ages to nearby seen classes, a manifestation of the well-known seen-class bias in generalized zero-shot learning. By formalizing age estimation without children's data as a generalized zero-shot benchmark on existing datasets, this work highlights a critical gap between current modeling practices and real-world ethical constraints. Our benchmark provides a principled basis for evaluating models under restricted data regimes and encourages the development of methods that are robust to distribution shift and aligned with responsible data use.

URL PDF HTML ☆

赞 0 踩 0

2605.29229 2026-05-29 cs.AI 版本更新

Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility

定制课程：通过动态数据-模型兼容性进行以学生为中心的推理蒸馏

Jiahao Huang, Fei Cheng, Junfeng Jiang, Akiko Aizawa

发表机构 * University of Tokyo（东京大学）； Kyoto University（京都大学）； National Institute of Informatics（日本信息处理研究所）

AI总结提出数据-模型兼容性（DMC）指标，通过联合考虑数据质量、相对难度和学生能力来评估数据集对推理蒸馏的适用性，并基于DMC动态选择数据以提升蒸馏性能。

详情

AI中文摘要

推理蒸馏将复杂推理能力从大型语言模型（LLMs）转移到较小的模型，但其成功取决于训练数据与学生模型的匹配程度。本文引入了数据-模型兼容性（DMC）指标，可用于评估数据集在学生模型上进行推理蒸馏的适用性。DMC通过联合考虑数据质量、相对难度和学生能力来提供评估。我们从两个角度验证了DMC的有效性：（1）DMC与推理蒸馏性能表现出强相关性；（2）使用DMC作为数据选择标准可提高推理蒸馏性能。这两个发现在多个学生模型和任务上均得到一致证明。此外，由于每个数据集的DMC在训练过程中动态变化，我们的实验表明，基于DMC动态选择数据集可以进一步提升性能。

英文摘要

Reasoning distillation transfers complex reasoning abilities from large language models (LLMs) to smaller ones, yet its success depends on how well the training data align with the student model. This paper introduces the Data-Model Compatibility (DMC) metric, which can be used to assess the suitability of a dataset for reasoning distillation on a student model. DMC provides an assessment by jointly considering data quality, relative difficulty, and student capability. We validated the effectiveness of DMC from two perspectives: (1) DMC exhibits a strong correlation with reasoning distillation performance; and (2) using DMC as the criterion for data selection leads to improved reasoning distillation performance. Both findings are consistently demonstrated across multiple student models and tasks. Moreover, since the DMC of each dataset dynamically changes during training, our experiments demonstrate that dynamically selecting datasets based on DMC can further enhance performance.

URL PDF HTML ☆

赞 0 踩 0

2605.29225 2026-05-29 cs.AI 版本更新

BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

BenchTrace: 用于测试LLM智能体反思能力和受控进化的基准

Jiahao Huang, Fei Cheng, Junfeng Jiang, Zefan Yu, Akiko Aizawa

发表机构 * University of Tokyo（东京大学）； Kyoto University（京都大学）； National Institute of Informatics（日本信息处理学会）

AI总结提出BenchTrace基准，通过反思评估和进化评估两个任务，结合失败避免率(FAR)指标，系统评估LLM智能体的自我进化能力，实验发现当前模型在反思诊断和泛化上存在显著瓶颈。

详情

AI中文摘要

自我进化智能体通过反思过去失败来随时间改进，但现有评估存在两个局限：仅衡量任务得分，无法反映反思质量；且依赖智能体自身的回合运行，缺乏针对特定失败模式的机制。我们提出 extbf{BenchTrace}，一个用于评估LLM智能体自我进化能力的基准。BenchTrace基于包含1,821个带注释回合的快照反思数据集构建，涵盖六个多样化任务，包含 extbf{反思评估}（通过目标QA任务探测失败识别）和 extbf{进化评估}（在受控自我进化模拟中测试过去失败经验是否转化为回避行为）。基于BenchTrace，我们提出 extbf{失败避免率(FAR)}，一种新的评估指标，衡量智能体成功避免目标失败实例的测试用例比例。使用Qwen3-32B和GPT-4.1的实验表明，两个模型在反思评估上的端到端通过率均低于30%，其中诊断是主要瓶颈。进化评估显示，自我进化方法通常比非进化基线提高FAR，但随着噪声回合累积，智能体会遗忘早期教训，且无法将反思泛化到特定情境之外，导致跨任务情境的负迁移。我们的相关性分析进一步揭示，只有完全正确的反思与更高的FAR强相关。BenchTrace揭示了当前自我进化方法的具体局限，并提供了一个受控的、模型无关的针对性评估框架。

英文摘要

Self-evolving agents improve over time by reflecting on past failures, but existing evaluation is limited in two ways: it measures only task scores, leaving reflection quality unknown, and it relies on agents' own episode runs, offering no mechanism to target specific failure patterns. We present \textbf{BenchTrace}, a benchmark for evaluating self-evolution ability in LLM agents. BenchTrace is built on a snapshot-reflection dataset of 1,821 annotated episodes spanning six diverse tasks, and comprises a \textbf{Reflection Evaluation} that probes failure identification through targeted QA tasks, and an \textbf{Evolution Evaluation} that tests whether past failure experience translates into avoidance behavior in a controlled self-evolution simulation. Building on BenchTrace, we propose \textbf{failure avoidance rate (FAR)}, a new evaluation metric measuring the fraction of test cases in which the agent successfully avoids the target failure instance. Experiments with Qwen3-32B and GPT-4.1 reveal that both models fall below a 30\% end-to-end pass rate on reflection evaluation, with diagnosis as the primary bottleneck. Evolution evaluation shows that self-evolution methods generally improve FAR over the non-evolving baseline, but agents forget early lessons as noise episodes accumulate, and agents fail to generalize their reflections beyond the specific context, causing negative transfer across task contexts. Our correlation analysis further reveals that only a fully correct reflection is strongly associated with higher FAR. BenchTrace exposes concrete limits of current self-evolution approaches and provides a controlled, model-agnostic framework for targeted evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.29224 2026-05-29 cs.CL cs.AI cs.CR 版本更新

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

相关性即漏洞：网络检索如何削弱LLM智能体的安全对齐

Aditya Nawal, Manit Baser, Mohan Gurusamy

发表机构 * Department of Electrical and Computer Engineering（电子与计算机工程系）； National University of Singapore（新加坡国立大学）

AI总结本文提出AgentREVEAL框架，分析检索集成方式和内容属性如何导致LLM智能体安全退化，发现相关性是共同激活条件，并引入HarmURLBench基准。

详情

AI中文摘要

AI智能体通过外部工具（如网络检索）增强大型语言模型，使其能够提供基于事实和最新的响应。然而，将外部内容纳入生成流程可能会削弱控制模型输出的安全对齐机制。先前的研究表明，在智能体中启用检索会增加对有害请求的遵从性。我们提出了AgentREVEAL，一个用于分析LLM智能体中检索诱导的安全退化的诊断框架。该框架考察两个维度：检索如何集成到智能体流程中，以及检索内容的属性。在集成维度上，我们发现将工具调用和响应生成绑定在单一步骤中会放大有害输出。在内容维度上，我们揭示了安全来源悖论：即使是对立或安全导向的来源（例如包含警告或风险免责声明的页面），与无检索基线相比，也会使有害遵从性平均增加25%。最后，我们表明相关性是这两种漏洞的共同激活条件。类似模式出现在前沿闭源模型上，并且在几种代表性流程干预下，有害遵从性仍然保持较高水平，一些智能体在自主检索下也会进入这种状态。由于相关性也是使检索有用的原因，这些结果揭示了检索增强智能体的安全-效用权衡。我们引入了HarmURLBench，一个包含1,405个真实世界URL和320个有害行为的基准，以支持未来的评估。

英文摘要

AI agents augment large language models with external tools such as web retrieval, enabling grounded and up-to-date responses. However, incorporating external content into the generation pipeline can weaken the safety alignment mechanisms that govern model outputs. Prior work shows that enabling retrieval in agents increases compliance with harmful requests. We introduce AgentREVEAL, a diagnostic framework for analyzing retrieval-induced safety degradation in LLM agents. The framework examines two axes: how retrieval is integrated into the agent pipeline and the properties of the retrieved content. Along the integration axis, we find that binding tool invocation and response generation in a single step amplifies harmful outputs. Along the content axis, we uncover the Safe Source Paradox: even oppositional or safety-oriented sources, such as pages containing warnings or risk disclaimers, can increase harmful compliance by an average of 25% compared to the no-retrieval baseline. Finally, we show that relevance acts as a shared activation condition for both vulnerabilities. Similar patterns appear on frontier closed models, and harmful compliance remains elevated under several representative pipeline interventions, with some agents also entering this regime under autonomous retrieval. Because relevance is also what makes retrieval useful, these results expose a safety-utility trade-off for retrieval-enabled agents. We introduce HarmURLBench, a benchmark containing 1,405 real-world URLs paired with 320 harmful behaviors to support future evaluations.

URL PDF HTML ☆

赞 0 踩 0

2605.29218 2026-05-29 cs.AI cs.CL 版本更新

GTA: Generating Long-Horizon Tasks for Web Agents at Scale

GTA：大规模生成面向Web智能体的长程任务

Tenghao Huang, Kung-Hsiang Huang, Prafulla Kumar Choubey, Yilun Zhou, Muhao Chen, Jonathan May, Chien-Sheng Wu

发表机构 * University of Southern California（南加州大学）； Salesforce AI Research（Salesforce人工智能研究）； University of California, Davis（加州大学戴维斯分校）

AI总结提出GTA框架，通过集成爬取、检索式种子生成、上下文内生成和自动质量控制，为Web智能体生成带可执行轨迹的真实长程任务，解决现有基准缺乏过程监督和可扩展性问题。

Comments Published at Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics

详情

AI中文摘要

Web智能体将语言模型与浏览和工具使用能力相结合，有望成为开放的Web助手。然而，进展日益受到缺乏可扩展的过程级监督的限制。现有基准大多为手动构建，仅提供粗略的起始-目标注释，缺乏中间轨迹，而最近的自动生成方法仍然昂贵、有偏且浅显。这些限制阻碍了对必须泛化到现实、多跳、跨页面任务的智能体进行可靠训练和评估。我们引入了一个可扩展的框架GTA，它集成了爬取、基于检索的种子生成、上下文内生成和自动质量控制，以生成与可执行轨迹配对的真实任务。该设计将爬取与生成解耦以提高效率，将任务基于站点图以强制组合性，并通过确定性重放和系统验证确保密集监督。我们在超过50个涵盖电子商务、政府、论坛和新闻的网站上实例化了该流程，并具有多语言和多跳覆盖。由此产生的基准揭示了显著的人机性能差距，并实现了详细的诊断。我们的贡献有三方面：（i）形式化多跳Web智能体任务生成，（ii）提出一个高效且经过验证的自动数据创建流程，以及（iii）发布一个具有可重复评估的动态基准。

英文摘要

Web agents, which couple language models with browsing and tool-use capabilities, show promise as open web assistants. Yet progress is increasingly limited by the lack of scalable, process-level supervision. Existing benchmarks are largely manually constructed, providing only coarse start-goal annotations without intermediate trajectories, while recent automatic generation efforts remain expensive, biased, and shallow. These limitations prevent reliable training and evaluation of agents that must generalize to realistic, multi-hop, cross-page tasks. We introduce a scalable framework, GTA, that integrates crawling, retrieval-based seeding, in-context generation, and automated quality control to produce realistic tasks paired with executable trajectories. This design decouples crawling from generation for greater efficiency, grounds tasks in the site graph to enforce compositionality, and ensures dense supervision through deterministic replays and systematic validation. We instantiate the pipeline on over 50 websites covering e-commerce, government, forums, and news, with multilingual and multi-hop coverage. The resulting benchmark reveals a significant human-agent performance gap and enables detailed diagnostics. Our contributions are three-fold: (i) formalizing multi-hop web-agent task generation, (ii) proposing an efficient and validated pipeline for automatic data creation, and (iii) releasing a dynamic benchmark with reproducible evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.29194 2026-05-29 cs.LG cs.AI cs.NA math.NA 版本更新

Stochastic Lifting for Generating Trajectories of Stochastic Physical Systems

随机提升：生成随机物理系统轨迹

Jules Berman, Tobias Blickhan, Benjamin Peherstorfer

发表机构 * Courant Institute of Mathematical Sciences, New York University, New York, NY 10012, USA（Courant数学科学研究所，纽约大学，纽约，NY 10012，USA）

AI总结提出随机提升方法，通过为每个状态转换附加独立高维随机标签并学习从当前状态和标签到下一状态的映射，以生成多样化的随机物理系统轨迹。

2605.29192 2026-05-29 cs.AI cs.CL 版本更新

ReasonOps: Operator Segmentation for LLM Reasoning Traces

ReasonOps: 大语言模型推理轨迹的算子分割

Daniel Lee, Owen Queen, James Zou

发表机构 * Stanford University（斯坦福大学）

AI总结提出无监督方法ReasonOps，从思维链轨迹中提取7种通用推理算子，揭示模型推理结构并用于模型识别与正确性预测。

详情

AI中文摘要

大型推理模型的思维链轨迹可长达数万token，但我们缺乏描述其内部结构的词汇。以往用于分析思维链轨迹的方法要么过于僵化，要么表达能力不足，无法捕捉跨领域和跨模型的特征。为解决此问题，我们开发了ReasonOps，一种无监督、表达力强的方法，用于注释思维链轨迹，提供简洁的通用算子。利用ReasonOps，我们分析了来自12个思考型LLM（涵盖6个家族、8个推理基准）的44,662条轨迹，发现它们共享一个共同的组合结构：7个反复出现的推理算子——语篇层面的动作，如回溯、推理和假设——这些算子从句子开头的3-token枢轴的无监督聚类中涌现。这些算子出现在每个模型家族和基准领域，由三个独立的LLM评判员对留出样本进行分类，准确率达70-76%。我们分析了算子在简单与困难问题上的结构，发现反思性算子在困难问题上更有帮助，而在简单问题上则损害性能。算子序列具有高度的模型识别性：仅基于算子分布训练的分类器能以宏AUC恢复源模型，揭示每个模型家族具有独特的推理指纹。结构化的算子特征在问题内答案正确性预测上远高于基线。基于这些算子构建的分类器在WP-AUC上达到，特别是在AIME上。ReasonOps还能够在轨迹完成前进行早期质量估计：我们仅用50%的轨迹就能在WP-AUC上进行预测。ReasonOps流程是无监督且无需标注的，能够深入洞察LLM推理轨迹，并在模型识别和正确性预测方面取得强大的下游结果。

英文摘要

Chain-of-thought traces from large reasoning models can span tens of thousands of tokens, yet we lack a vocabulary for describing their internal structure. Previous methods developed to analyze chain-of-thought traces are either too rigid or not expressive enough, failing to capture features across domains and models. To remedy this, we develop ReasonOps, an unsupervised, expressive method for annotating chain-of-thought traces, providing succinct universal operators. Using ReasonOps, we analyze 44,662 traces from 12 thinking LLMs spanning 6 families across 8 reasoning benchmarks and discover that they share a common compositional structure: 7 recurring reasoning operators -- discourse-level moves such as backtracking, inferring, and hypothesizing -- that emerge from unsupervised clustering of sentence-initial 3-token pivots. These operators appear across every model family and benchmark domain, confirmed by three independent LLM judges who classify held-out samples at 70 -76% accuracy. We analyze the structure of operators on easy vs. hard problems, revealing that reflective operators are more helpful on hard problems and harm performance on easy problems. Operator sequences are highly model-identifying: a classifier trained on operator distributions alone recovers the source model with macro-AUC, revealing that each model family has a distinctive reasoning fingerprint. Structural operator features predict within-problem answer correctness well above baselines. Classifiers built on these operators reach WP-AUC and on AIME specifically. ReasonOps further enables early quality estimation well before the trace completes: we predict at WP-AUC for only 50% of the trace. The ReasonOps pipeline is unsupervised and annotation-free, enabling deep insights into LLM reasoning traces as well as strong downstream results on model identification and correctness prediction.

URL PDF HTML ☆

赞 0 踩 0

2605.29184 2026-05-29 cs.LG cs.AI 版本更新

Influence-Guided Symbolic Regression: Scientific Discovery via LLM-Driven Equation Search with Granular Feedback

影响引导的符号回归：基于大语言模型与细粒度反馈的方程搜索科学发现

Evgeny S. Saveliev, Samuel Holt, Nabeel Seedat, David L. Bentley, Jim Weatherall, Mihaela van der Schaar

发表机构 * University of Cambridge（剑桥大学）； Thomson Reuters Foundational Research（汤姆森·路透基础研究）； U. Colorado, Anschutz Medical Campus（科罗拉多大学安舒茨医疗校区）

AI总结提出影响引导符号回归（IGSR）方法，利用大语言模型生成候选函数并通过细粒度影响分数进行剪枝，结合蒙特卡洛树搜索高效探索组合空间，在多个基准和真实生物数据中发现新关系。

Comments ICML 2026

详情

AI中文摘要

大型语言模型（LLM）为科学发现提供了有前景的途径，但它们在符号回归中的应用常受限于低效的搜索策略和粗糙的反馈信号。当前方法通常使用标量指标（如全局均方误差）指导LLM，这无法识别所提出方程中哪些成分驱动性能或导致误差。我们引入 extit{影响引导符号回归}（IGSR），该方法将方程发现表述为一个迭代的两步过程，结合多样化的项生成与严格选择：LLM为线性模型生成候选基函数$ψ_j(\mathbf{x})$，然后使用细粒度影响分数$Δ_j$进行评估。这些分数量化每个项对泛化准确性的边际贡献，从而实现影响引导的剪枝过程，系统地精炼模型结构。将此机制集成到蒙特卡洛树搜索（MCTS）中，能够在导航组合搜索空间的同时平衡对新函数形式的探索与对高影响成分的利用。我们在多个基准测试上展示了IGSR的有效性，包括LLM-SRBench、药理学PKPD模型、流行病学模拟和真实基因组数据。值得注意的是，我们通过一个高维生物数据集的案例研究验证了该框架的真正发现能力，其中IGSR识别出DNA甲基化与RNA聚合酶II暂停之间的新关系；该假设随后通过湿实验得到了支持。

英文摘要

Large Language Models (LLMs) offer a promising avenue for scientific discovery, yet their application to symbolic regression is often constrained by inefficient search strategies and coarse feedback signals. Current methods typically guide LLMs using scalar metrics (e.g., global Mean Squared Error), which fail to identify which components of a proposed equation are driving performance or causing error. We introduce \textit{Influence-Guided Symbolic Regression} (IGSR), a method that frames equation discovery as an iterative two-step process combining diverse term generation with rigorous selection: an LLM generates candidate basis functions $ψ_j(\mathbf{x})$ for a linear model, which are then evaluated using granular influence scores $Δ_j$. These scores quantify each term's marginal contribution to generalization accuracy, enabling an influence-guided pruning process that systematically refines the model structure. Integrating this mechanism into a Monte Carlo Tree Search (MCTS) enables navigating the combinatorial search space while balancing exploration of novel functional forms with exploitation of high-influence components. We demonstrate IGSR's effectiveness on a diverse suite of benchmarks, including LLM-SRBench, pharmacological PKPD models, an epidemiological simulation, and real-world genomic data. Notably, we validate the framework's capacity for genuine discovery in a case study using a high-dimensional biological dataset, in which IGSR identified a novel relationship between DNA methylation and RNA Polymerase II pausing; a hypothesis that was subsequently supported via wet-lab experimentation.

URL PDF HTML ☆

赞 0 踩 0

2605.29174 2026-05-29 cs.AI cs.CR 版本更新

Paper Agents, Paper Gains: An Empirical Analysis of DeFi Investment Agents

Paper Agents, Paper Gains: DeFi投资代理的实证分析

Jay Yu, Amy Zhao, Danning Sui

发表机构 * Pantera Capital（Pantera资本）； Stanford University（斯坦福大学）； IC3 ； Ava Labs（Ava实验室）

AI总结通过分析1900多个AI加密项目、10个代表性代理和11个Solana代理金库，发现当前DeFi投资代理仍处于早期阶段，存在自主执行证据不足、代币持有者集体亏损、估值与基本面脱节等问题，并提出成熟度框架。

详情

AI中文摘要

DeFi投资代理，即使用AI进行自主链上交易的系统，自2024年底以来已获得超过30亿美元的代币总估值。我们调查了1900多个标记为AI的加密项目，筛选出专注于投资的代理，并策划了10个涵盖策略和可观测性维度的代表性项目。然后，我们对两个突出的代理框架ElizaOS和Virtuals Protocol进行了深入的架构分析，并对11个基于Solana的代理金库（具有公开可归因的交易活动）进行了定量链上表现分析，覆盖925,323个代币持有者。我们发现当前部署仍处于早期且异构：（1）在我们的样本中，许多项目尚未提供清晰的自主交易执行证据，开发者访谈表明许多可见部署仍为基本API集成；（2）代理金库保留了超过3000万美元的账面收益，而代币持有者集体损失了1.917亿美元，前1%的钱包捕获了所有收益的81.4%（18.1亿美元）；（3）代币估值与金库基本面关联微弱，市值与AUM比率超过10,000倍，而成熟的DeFi协议低于1倍；（4）用户总收益在达到24亿美元的峰值后下降至净亏损，每个平台的中位数回报均为负，代币从历史高点平均下跌93%。我们将这些结果解释为无许可的第一代市场的特征，其中开放基础设施支持快速实验，但也允许幼稚或投机性代理在自主性、性能和利益相关者对齐的稳健标准出现之前启动。因此，我们提出了一个沿三个维度（自主执行、风险调整后盈利能力和利益相关者对齐）的成熟度框架，以表征当前部署与未来投资级代理系统之间的差距。

英文摘要

DeFi investment agents, systems that use AI for autonomous on-chain trading, have attained over USD 3 billion in combined token valuations since late 2024. We survey over 1,900 AI-tagged crypto projects, filter to investment-focused agents, and curate 10 representative projects spanning strategy and observability dimensions. We then conduct a deep-dive architectural analysis of two prominent agent frameworks, ElizaOS and Virtuals Protocol, and a quantitative on-chain performance analysis of 11 Solana-based agent treasuries with publicly attributable trading activity, covering 925,323 token holders. We find that current deployments remain early and heterogeneous: (1) in our sample, many projects do not yet provide clear evidence of autonomous trade execution, and developer interviews suggest that many visible deployments remain basic API integrations; (2) agent treasuries retain over USD 30M in paper gains while token holders collectively lost USD 191.7M, with the top 1% of wallets capturing 81.4% of all gains (USD 1.81B); (3) token valuations are weakly connected to treasury fundamentals, with market-cap-to-AUM ratios exceeding 10,000x versus below 1x for established DeFi protocols; and (4) aggregate user gains peaked at USD 2.4B before declining to net losses, with median returns negative on every platform and tokens declining 93% on average from all-time highs. We interpret these outcomes as characteristic of a permissionless, first-generation market in which open infrastructure enables rapid experimentation but also allows naive or speculative agents to launch before robust standards for autonomy, performance, and stakeholder alignment emerge. We therefore propose a maturity framework along three dimensions: autonomous execution, risk-adjusted profitability, and stakeholder alignment, to characterize the gap between current deployments and future investment-grade agent systems.

URL PDF HTML ☆

赞 0 踩 0

2605.29170 2026-05-29 cs.CL cs.AI 版本更新

UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning

UA-Legal-Bench：评估大语言模型在乌克兰法律推理上的基准

Volodymyr Ovcharov

发表机构 * SecondLayer

AI总结针对法律NLP基准以英语为中心的问题，构建了基于乌克兰法院判决的五个任务基准，评估11个LLM，发现少样本提示效果因任务而异，且在不平衡任务中准确率具有误导性。

Comments 13 pages, 5 figures, 4 tables. Data: https://huggingface.co/datasets/overthelex/ua-legal-bench

详情

AI中文摘要

法律NLP基准 overwhelmingly 以英语为中心，导致在形态丰富、非拉丁字母语言中的失败模式未被检测。我们引入了UA-Legal-Bench，一个包含五个任务的基准，用于评估大语言模型在乌克兰法律推理上的表现，该基准基于统一国家法院判决登记册（EDRSR）——世界上最大的开放司法语料库之一（9950万份判决）。该基准包括：（1）案件类型分类（4类，n=2,000），（2）判决形式分类（4类，n=2,000），（3）案件结果预测（6类，n=800），（4）法律规范提取（n=1,794），以及（5）原因类别预测（22类，n=1,871）。我们评估了来自五个系列的11个LLM（3B-675B），在零样本和3样本提示下通过AWS Bedrock进行了158K次API调用。我们的结果揭示了 sharply 任务依赖的少样本效应：少样本提示将判决形式分类提高了最多+38.6个百分点，但对结果预测的影响不一。我们表明，在不平衡的法律任务中，准确率具有误导性：COP准确率最高的模型（62%）是多数类预测器（macro-F1：23%），而真正最好的模型macro-F1仅为44%。系列内规模分析显示，8B模型在表面级任务上可以匹配前沿性能，但不同系列的规模阈值差异很大。我们发布了所有数据、提示和模型预测。

英文摘要

Legal NLP benchmarks are overwhelmingly English-centric, leaving failure modes in morphologically rich, non-Latin-script languages undetected. We introduce UA-Legal-Bench, a five-task benchmark for evaluating large language models on Ukrainian legal reasoning, built from the Unified State Register of Court Decisions (EDRSR) -- one of the world's largest open judicial corpora (99.5 million decisions). The benchmark comprises: (1) case-type classification (4 classes, n=2,000), (2) judgment form classification (4 classes, n=2,000), (3) case-outcome prediction (6 classes, n=800), (4) legal norm extraction (n=1,794), and (5) cause category prediction (22 classes, n=1,871). We evaluate 11 LLMs (3B--675B) from five families under zero-shot and 3-shot prompting via AWS Bedrock with 158K API calls. Our results reveal sharply task-dependent few-shot effects: few-shot prompting improves judgment form classification by up to +38.6 pp but has mixed effects on outcome prediction. We show that accuracy is misleading on imbalanced legal tasks: the model with highest COP accuracy (62%) is a majority-class predictor (macro-F1: 23%), while the genuinely best model scores only 44% macro-F1. Within-family scaling analysis reveals that 8B models can match frontier performance on surface-level tasks but scaling thresholds vary dramatically across families. We release all data, prompts, and model predictions.

URL PDF HTML ☆

赞 0 踩 0

2605.29168 2026-05-29 cs.AI cs.LG 版本更新

Better Later Than Sooner: Neuro-Symbolic Knowledge Graph Construction via Ontology-grounded Post-extraction Correction

晚做总比早做好：基于本体后提取校正的神经符号知识图谱构建

Lorenzo Loconte, Timothy Hospedales, Cristina Cornelio

发表机构 * University of Edinburgh, UK（爱丁堡大学）； Samsung AI Center, Cambridge, UK（三星人工智能中心）

AI总结提出一种神经符号框架，通过后提取校正解决LLM提取知识图谱时的本体不一致问题，减少token使用并提升图谱一致性。

详情

AI中文摘要

问答是AI中的核心挑战，特别是对于需要跨文档多跳推理或聚合、穷举等符号操作的复杂查询。检索增强生成已成为问答的主要方法，最近的基于图的变体通过组织知识以更好地支持组合性问题，部分解决了这些问题。然而，大多数基于文本图的RAG方法仍缺乏可靠回答复杂问题所需的符号操作结构。这推动了基于符号图的方法，该方法提取知识图谱，其关系是逻辑谓词，支持类似SQL的查询。然而，这些流程通常使用LLM进行KG提取，这可能导致一致性问题，即提取的事实可能违反常识本体约束。我们提出了一种用于本体基础KG构建的神经符号框架，结合了开放域提取、基于嵌入的类型和谓词规范化，以及针对本体违规的LLM校正。通过将校正推迟到后提取阶段，我们的方法避免了重复的LLM调用，显著减少了token使用，同时提高了KG一致性并保持了下游问答质量。最后，通过测量SPARQL图模式的出现，我们展示了提取的KG非常适合符号查询。

英文摘要

Question answering (QA) is a core challenge in AI, particularly for complex queries requiring multi-hop reasoning across documents, or symbolic operations like aggregation or exhaustive listing. Retrieval-augmented generation has become the dominant approach to QA, with recent graph-based variants addressing part of these issues by organizing knowledge to better support compositional questions. However, most textual graph-based RAG methods still lack the structure needed for symbolic operations useful to answer complex questions reliably. This motivates symbolic graph-based approaches, which extract knowledge graphs (KGs) whose relations are logic predicates that enable SQL-like querying. Yet these pipelines typically use LLMs for KG extraction, which can introduce consistency issues, where extracted facts may violate commonsense ontology constraints. We propose a neuro-symbolic framework for ontology-grounded KG construction combining open-domain extraction, embedding-based canonicalization of types and predicates, and targeted LLM-based correction of ontology violations. By deferring corrections to a post-extraction stage, our method avoids repeated LLM calls, substantially reducing token usage while improving KG consistency and preserving downstream QA quality. Finally, we show that the extracted KGs are well suited for symbolic querying by measuring the occurrence of SPARQL graph patterns.

URL PDF HTML ☆

赞 0 踩 0

2605.29161 2026-05-29 cs.LG cs.AI 版本更新

Evolutionary Refinement of Generative Graph Topologies: A Hybrid WGAN-GA Approach

生成图拓扑的进化精炼：一种混合WGAN-GA方法

James Sargant, Seyedeh Ava Razi Razavi, Renata Dividino, Sheridan Houghten

发表机构 * Computer Science Brock University, Canada（计算机科学布鲁克大学加拿大）

AI总结提出一种混合WGAN-GA方法，通过遗传算法精炼GAN生成的图结构，减少度分布和谱分布等偏差，使合成图更接近真实图。

Comments 6 pages, 4 Figures, 4 Tables, IEEE World Congress on Computational Intelligence

详情

AI中文摘要

由于离散连通性、图大小变化和类别特定的结构模式，生成逼真的图结构数据具有挑战性。最近基于生成对抗网络（GAN）的图生成方法通过学习连通性和匹配类别特定的密度分布来改进边建模。然而，这些模型在与真实图相比时仍表现出明显的偏差，例如度和谱分布，表明重要的结构属性未完全保留。本工作旨在通过使用遗传算法（GA）精炼现有基于GAN的图生成器框架生成的图来减少这些偏差。在GAN框架中，生成器同时生成节点特征和连通性模式，而基于GNN的判别器评估图的真实性和类别一致性，以确保全局结构和类别对齐。在此基础上，我们应用GA来精炼生成图的边。精炼过程引导合成图更接近真实数据，同时保持多样性和新颖性。实验结果表明，与基础模型相比，GA精炼持续降低组合最大均值差异（MMD），从而生成更匹配真实结构模式的图。这表明进化精炼是纠正基于GAN的图生成器中残留结构偏差的有效且灵活的方法，提高了它们用于逼真图合成和数据增强的适用性。

英文摘要

Generating realistic graph-structured data is challenging due to discrete connectivity, varying graph sizes, and class-specific structural patterns. Recent Generative Adversarial Networks (GAN)-based graph generation methods improve edge modelling by learning connectivity and matching class-specific density distributions. However these models still exhibit noticeable deviations such as in degree and spectral distribution when compared to real graphs, indicating that important structural properties are not fully preserved. This work aims to reduce these deviations by refining the graphs produced by an existing GAN-based graph generator framework with a Genetic Algorithm (GA). In the GAN framework, the generator produces both node features and connectivity patterns, while a GNN-based critic evaluates graph realism and class consistency to ensure global structural and class alignment. Building on this foundation, we apply a GA to refine the edges of generated graphs. The refinement process guides synthetic graphs toward closer agreement with real data, while preserving diversity and novelty. Experimental results show that the GA refinement consistently lowers combined Maximum Mean Discrepancy (MMD) compared to the base model, leading to graphs that more closely match real structural patterns. This demonstrates that evolutionary refinement is an effective and flexible way to correct residual structural deviations in GAN-based graph generators, improving their suitability for realistic graph synthesis and data augmentation.

URL PDF HTML ☆

赞 0 踩 0

2605.29157 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Parallax: Parameterized Local Linear Attention for Language Modeling

Parallax: 参数化局部线性注意力用于语言建模

Yifei Zuo, Dhruv Pai, Zhichen Zeng, Alec Dewulf, Shuming Hu, Zhaoran Wang

发表机构 * Northwestern University（西北大学）； Tilde Research（Tilde研究）； University of Washington（华盛顿大学）

AI总结提出Parallax，一种可扩展的参数化局部线性注意力机制，通过消除数值求解器并学习查询投影器，在语言模型预训练中实现一致的困惑度改进和下游任务迁移优势。

详情

AI中文摘要

代理型AI系统中的技术债务治理

Muhammad Zia Hydari, Raja Iqbal, Narayan Ramasubbu

发表机构 * School of Business, University of Pittsburgh（匹兹堡大学商学院）

AI总结本文定义了代理型AI系统中的技术债务和随机税概念，并提出通过轻量级仪表盘和治理控制来管理这些负债和运营成本。

详情

AI中文摘要

代理型AI系统正越来越多地被探索作为生产基础设施：它们进行多步推理、调用工具、通过工作流行动，并通过记忆和反馈进行适应。这些系统带来了传统软件或预测性机器学习技术债务未能完全涵盖的治理挑战。我们将代理型技术债务定义为当提示、记忆、工具模式、编排图、控制策略和可观测性例程被拼凑在一起，速度快于它们能够被验证、标准化和治理时所产生的累积负债。我们将随机税定义为将概率性代理行为保持在可接受范围内所产生的重复性运营负担。区别很重要：债务是设计和治理负债的存量，而税是运营成本的流量，源于随机代理通过工具和工作流行动。我们概述了管理者如何通过轻量级仪表盘和治理控制使两者可见。

英文摘要

Agentic AI systems are increasingly being explored as production infrastructure: they reason over multiple steps, call tools, act through workflows, and adapt through memory and feedback. These systems create governance challenges that are not fully captured by traditional software or predictive ML technical debt. We define Agentic Technical Debt as the accumulated liability created when prompts, memory, tool schemas, orchestration graphs, control policies, and observability routines are patched together faster than they can be validated, standardized, and governed. We define Stochastic Tax as the recurring operating burden of keeping probabilistic agent behavior within acceptable bounds. The distinction matters: debt is a stock of design and governance liability, while the tax is a flow of operating cost that arises because stochastic agents act through tools and workflows. We outline how managers can make both visible through lightweight dashboards and governance controls.

URL PDF HTML ☆

赞 0 踩 0

2605.29126 2026-05-29 cs.LG cs.AI 版本更新

When and How Long? The Readout-Mediator Angle in Temporal Reasoning

何时与多久？时间推理中的读出-中介角度

Shreyas Fadnavis, Praitayini Kanakaraj, Felix Wyss

发表机构 * Bioscope AI

AI总结通过测量线性探针与模型实际计算子空间之间的角度，发现探针可能学习与模型无关的正交方向，从而揭示基于探针的可解释性存在根本缺陷。

详情

AI中文摘要

线性探针几乎可以完美解码表示，但却可能与模型如何使用该表示完全无关。在语言模型的日历日期持续时间推理中，一个$\\\sin$/ $\\\cos$探针从层的激活中恢复一年中的第几天，但消融其方向对模型的答案没有影响——而在同一层通过分布式对齐搜索（DAS）找到的四维子空间被消融时，性能完全崩溃。我们测量这两个子空间之间的角度——\\emph{读出-中介角度}——发现它与两个随机子空间之间的角度（Haar均匀零假设）无法区分，这意味着探针学到了与模型实际计算正交的方向。逆向工程电路揭示了原因：注意力头通过学习的QK偏移（$\\\pm30$和$\\\pm61$天）路由月份粒度的上下文，然后MLP将\\emph{何时}（绝对日期）转换为\\emph{多久}（持续时间）——所有这些都在探针从未触及的因果子空间的下游。稀疏自编码器分解证实了这种分裂：探针对齐和DAS对齐的特征编码了语义上不相交的概念，因果重叠可忽略不计。这种分离在四个规模（$1.5$-$9\\\,$B）和两个模型家族中重复出现，并在另外两个领域（空间位移、符号算术）有初步证据，表明读出-中介正交性是探针可解释性的一种普遍失败模式。这直接削弱了将探针部署为运行时安全监控的提议：探针可以在模型已悄然放弃的方向上报告高置信度。

英文摘要

A linear probe can decode a representation almost perfectly and yet be completely irrelevant to how the model uses it. On calendar-date duration reasoning in language models, a $\sin$/$\cos$ probe recovers day-of-year from a layer's activations, yet ablating its direction has no effect on the model's answers -- while ablating a four-dimensional subspace found by Distributed Alignment Search (DAS) at the same layer collapses performance entirely. We measure the angle between these two subspaces -- the \emph{readout-mediator angle} -- and find it indistinguishable from the angle between two random subspaces (the Haar-uniform null), meaning the probe has learned a direction orthogonal to the model's actual computation. Reverse-engineering the circuit reveals why: attention heads route month-grained context through learned QK offsets at ${\pm}30$ and ${\pm}61$ days, and MLPs then convert \emph{when} (absolute date) into \emph{how long} (duration) -- all downstream of the causal subspace the probe never touches. Sparse-autoencoder decomposition confirms the split: probe-aligned and DAS-aligned features encode semantically disjoint concepts with negligible causal overlap. The dissociation replicates across four scales ($1.5$-$9\,$B) and two model families, with preliminary evidence on two further domains (spatial displacement, symbolic arithmetic), suggesting that readout-mediator orthogonality is a general failure mode of probe-based interpretability. This directly undermines proposals to deploy probes as runtime safety monitors: the probe can report high confidence on a direction the model has silently abandoned.

URL PDF HTML ☆

赞 0 踩 0

2605.29123 2026-05-29 cs.AI cs.CL 版本更新

The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models

置信捷径：掩码扩散模型的一种推理失败模式

Dueun Kim, Albert No

发表机构 * Department of Artificial Intelligence, Yonsei University（延世大学人工智能系）

AI总结本文发现掩码扩散模型在置信度解码时存在推理失败模式，表现为过早预测局部易解部分而忽略长程依赖，导致复杂输入错误率升高，而随机掩码训练能保持推理轨迹条件。

详情

AI中文摘要

掩码扩散语言模型（MDMs）独特地支持任意顺序生成，其中基于置信度的解码目前作为事实上的标准推理策略。为了优化这一点，最近的训练方案试图直接将训练掩码模式与生成过程中观察到的模式对齐。然而，我们认为基于置信度的解码本质上与复杂推理所需的逻辑流轨迹不一致，并且置信度对齐训练会主动强化这种不一致。我们使用多位加法具体说明这一点，其中解码策略在解决长程依赖之前过早预测局部易解的数字，从而在具有挑战性的输入上产生高置信度错误。虽然传统的随机掩码在此困难尾部上保持低失败率，但置信度对齐训练将错误率放大了一个数量级。在五个不同的推理任务中，同样的模式以任务依赖的严重程度出现：基于置信度的解码在高度复杂的输入上引发失败，而置信度对齐训练则加剧了这些失败。相比之下，随机掩码——尽管被认为效率低下——稳健地保留了解决困难尾部所必需的推理轨迹条件。

英文摘要

Masked diffusion language models (MDMs) uniquely support any-order generation, with confidence-based decoding currently serving as the de facto standard inference policy. To optimize for this, recent training schemes attempt to align training mask patterns directly with those observed during generation. However, we argue that confidence-based decoding is inherently misaligned with the logical-flow trajectories required for complex reasoning, and that confidence-aligned training actively entrenches this misalignment. We make this concrete using multi-digit addition, where the decoding strategy prematurely predicts locally easy digits before resolving their long-range dependencies, producing high-confidence errors on challenging inputs. While traditional random masking keeps the failure rate low on this challenging tail, confidence-aligned training amplifies the error rate by an order of magnitude. Across five distinct reasoning tasks, this same pattern emerges with task-dependent severity: confidence-based decoding induces failures on highly complex inputs, and confidence-aligned training exacerbates them. In contrast, random masking -- despite its perceived inefficiency -- robustly preserves the reasoning-trajectory conditionals essential for solving the challenging tail.

URL PDF HTML ☆

赞 0 踩 0

2605.29121 2026-05-29 math.DS cs.AI cs.LG 版本更新

A Minimal Bifurcation Model of Load Imbalance in a Softmax Mixture-of-Experts Router

Softmax混合专家路由器中负载不平衡的最小分岔模型

O. M. Kiselev

发表机构 * Innopolis University（因诺波利斯大学）

AI总结提出一个两专家混合专家层的自适应softmax路由最小动力学模型，通过平均场极限从离散强化规则导出，发现超临界叉形分岔导致负载不平衡，并推导了分岔集和尖点灾变的精确参数方程。

Comments 21 pages, 11 figures

详情

AI中文摘要

我们提出了一个两专家混合专家（MoE）层的自适应softmax路由的最小动力学模型。该模型作为离散强化规则的平均场极限得到：被选中的专家获得小的分数增量，而所有分数经历正则化衰减。在对称情况下，极限系统具有超临界叉形分岔：对于弱反馈，存在唯一的稳定平衡状态，而当反馈强度超过临界值时，出现两个稳定的不对称状态。当加入外部不对称性时，叉形分岔展开为一对折叠分岔，在控制参数平面中形成一个尖点。我们推导了分岔集和尖点灾变的局部规范型的精确参数方程。数值实验将这一图景与经验专家负载、一个小的可训练MoE模型、硬top-1 PyTorch路由以及一个关于数字的小型分类实验联系起来。结果为自适应MoE路由器中负载不平衡的突然转变提供了一个可控的低维机制。

英文摘要

We propose a minimal dynamical model of adaptive softmax routing for a two-expert Mixture-of-Experts (MoE) layer. The model is obtained as a mean-field limit of a discrete reinforcement rule: the selected expert receives a small score increment, while all scores undergo regularizing decay. In the symmetric case the limiting system has a supercritical pitchfork bifurcation: for weak feedback there is a unique stable balanced state, whereas above a critical feedback strength two stable asymmetric states appear. When an external asymmetry is added, the pitchfork unfolds into a pair of fold bifurcations forming a cusp in the control-parameter plane. We derive exact parametric equations for the bifurcation set and the local normal form of the cusp catastrophe. Numerical experiments connect this picture to empirical expert load, a small trainable MoE model, hard top-1 PyTorch routing, and a small classification experiment on digits. The results provide a controlled low-dimensional mechanism for abrupt transitions to load imbalance in adaptive MoE routers.

URL PDF HTML ☆

赞 0 踩 0

2605.29119 2026-05-29 cs.AI 版本更新

PRO-CUA: Process-Reward Optimization for Computer Use Agents

PRO-CUA: 面向计算机使用代理的过程奖励优化

Yifei He, Rui Yang, Hao Bai, Tong Zhang, Han Zhao

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出PRO-CUA框架，通过过程奖励模型和逐步骤强化学习，解决计算机使用代理训练中的模仿瓶颈和稀疏奖励问题。

详情

AI中文摘要

计算机使用代理（CUA）在自动化复杂数字工作流方面展现出强大潜力，但其训练仍受限于成本高昂的实时环境交互和有限的高质量监督。现有的过滤行为克隆管道面临模仿瓶颈，包括专家演示的分布偏移和缺乏负学习信号。同时，标准轨迹级强化学习在长程GUI交互中面临稀疏奖励、模糊信用分配和高基础设施成本等问题。在这项工作中，我们提出PRO-CUA，一个用于训练CUA的迭代步骤级强化学习的过程奖励优化框架。PRO-CUA将策略优化与在线环境交互解耦：当前策略通过实时运行收集状态，为每个状态生成多样化的候选动作，从过程奖励模型（PRM）接收步骤级反馈，并通过组相对优势进行优化。这种设计无需依赖黄金答案或离线专家轨迹即可实现密集且灵活的信用分配，同时通过在代理自身的执行状态上训练减少分布偏移。在实时网络基准上的实验证明了PRO-CUA的有效性以及PRM引导的步骤级训练的可靠性。

英文摘要

Computer use agents (CUAs) have shown strong potential for automating complex digital workflows, yet their training remains constrained by costly live environment interaction and limited high-quality supervision. Existing filtered behavior cloning pipelines suffer from imitation bottlenecks, including distribution shift from the expert demonstration and the absence of negative learning signals. Meanwhile, standard trajectory-level reinforcement learning struggles with sparse rewards, ambiguous credit assignment, and high infrastructure costs for long-horizon GUI interaction. In this work, we propose PRO-CUA, a process-reward optimization framework for training CUAs with iterative step-level reinforcement learning. PRO-CUA decouples on-policy environment interaction from policy optimization: the current policy collects states through live rollouts, generates diverse candidate actions for each state, receives step-level feedback from a process reward model (PRM), and is optimized with group-relative advantages. This design enables dense and flexible credit assignment without relying on golden answers or offline expert trajectories, while reducing distribution shift by training on the agent's own execution states. Experiments on live web benchmarks demonstrate the effectiveness of PRO-CUA and the reliability of PRM-guided step-level training.

URL PDF HTML ☆

赞 0 踩 0

2605.29116 2026-05-29 cs.AI 版本更新

Beyond Consensus: Trace-Level Synthesis in Mixture of Agents

超越共识：混合智能体中的轨迹级合成

Shreyas Fadnavis, Praitayini Kanakaraj, Felix Wyss

发表机构 * Bioscope AI

AI总结本文提出轨迹级合成方法，通过语义保持输入扰动生成多样化推理轨迹，并利用锚定精炼保证非退化，从而在多数投票失败时仍能恢复正确解，超越基于答案的聚合。

详情

AI中文摘要

当多个LLM智能体解决同一问题时，标准做法是将每个智能体的推理压缩为多数投票或分层合成，将一致性视为终点。我们证明这是不必要的损失：一个读取完整推理轨迹的LLM聚合器即使在智能体一致同意时也能恢复正确解，且有益修正始终超过有害修正——即“聚合悖论”。多数投票存在上限，而扰动多样性无法提高（错误相关性相同）；聚合器的收益来自轨迹级互补性，即从投票丢弃的少数链中组装正确的中间步骤。这些发现促使我们提出自洽混合智能体，通过语义保持输入扰动生成轨迹多样性，通过锚定精炼保护多数并具有可证明的非退化保证，并且始终进行合成——绝不基于共识进行门控。单个模型通过扰动诱导的轨迹变异性在结构化推理、博士级科学、竞赛数学和竞争性编程中优于异构模型池。聚合的单位应是推理轨迹，而非答案。

英文摘要

When multiple LLM agents solve the same problem, standard practice compresses each agent's reasoning into a majority vote or layered synthesis, treating agreement as the finish line. We show this is unnecessarily lossy: an LLM aggregator that reads complete reasoning traces recovers correct solutions even when agents unanimously agree, with beneficial corrections consistently outweighing harmful ones -- the \emph{aggregation paradox}. Majority voting has a ceiling that perturbation diversity does not raise (error correlations are identical); the aggregator's gain comes from trace-level complementarity, assembling correct intermediate steps from minority chains that voting discards. These findings motivate Self-Consistent Mixture of Agents which generates trace diversity through semantic-preserving input perturbations, safeguards the majority via anchored refinement with provable non-degradation guarantees, and always synthesizes -- never gates on consensus. A single model with perturbation-induced trace variation outperforms heterogeneous model pools across structured reasoning, PhD-level science, competition mathematics, and competitive programming. The unit of aggregation should be the reasoning trace, not the answer.

URL PDF HTML ☆

赞 0 踩 0

2605.29115 2026-05-29 cs.CR cs.AI 版本更新

链条保持，答案翻转：对抗压力下推理模型中的痕迹-答案分离

Yubo Li, Ramayya Krishnan, Rema Padman

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结本研究通过2×2潜在-行为框架发现推理模型在持续对抗压力下出现“不忠实屈服”故障模式，即思维链保持正确但答案错误，并验证了推理通道对此的影响。

详情

AI中文摘要

推理模型在单轮基准测试中评估，但部署在多轮对话中，用户会对正确答案进行反驳。在持续对抗压力下，我们发现了一种先前未记录的故障模式：思维链从第一轮到最后一轮保持事实正确，而输出的答案却翻转错误。我们称此为不忠实屈服（UC），并通过一个$2\times 2$潜在-行为框架将其分离出来，该框架揭示了翻转率指标和单轮忠实度探测均遗漏的问题。在三个数据集（MT-Consistency、MMLU-Pro、GSM8K）上，行为翻转时的潜在正确率在思考模式下聚集在50%附近，在无思考模式下降至11-15%——这是模型内成对因果证据，表明推理造成了这一差距。跨模型而言，该效应与推理通道相关（在Qwen3-32B和GPT-OSS-20B中较高，在内联思维链的Gemma-4-31B-it中较低）。独立的GPT-4o评判者验证了86%的UC标签；令牌级探测显示答案槽的argmax在84%的UC单元中是正确的；而一种朴素的痕迹锚定防御适得其反。我们发布了所有轨迹、痕迹和评判者标签。

英文摘要

Reasoning models are evaluated on single-turn benchmarks but deployed in multi-turn dialogue, where users push back on correct answers. Under sustained adversarial pressure we find a previously undocumented failure mode: the chain-of-thought stays factually correct from first turn to last while the emitted answer flips wrong. We call this unfaithful capitulation (UC) and isolate it with a $2\times 2$ latent-versus-behavioral framework that flip-rate metrics and single-turn faithfulness probes both miss. Across three datasets (MT-Consistency, MMLU-Pro, GSM8K), the latent-correct rate at the behavioral flip clusters near 50% in think mode and collapses to 11-15% under no_think -- paired, within-model causal evidence that reasoning creates the gap. Across models the effect tracks the reasoning channel (high in Qwen3-32B and GPT-OSS-20B, low in inline-CoT Gemma-4-31B-it). An independent GPT-4o judge corroborates $86\%$ of UC labels; a token-level probe shows the answer-slot argmax is correct in $84\%$ of UC cells; and a naive trace-anchored defense backfires. We release all trajectories, traces, and judge labels.

URL PDF HTML ☆

赞 0 踩 0

2605.29084 2026-05-29 cs.CL cs.AI cs.IR 版本更新

具有潜在推理的鲁棒高效防护栏

Siddharth Sai, Xiaofei Wen, Muhao Chen

发表机构 * University of California, Davis（加州大学戴维斯分校）

AI总结提出COLAGUARD模型，通过阶段式训练将多步安全推理转移到连续潜在空间，在保持高安全性能的同时实现12.9倍加速和22.4倍令牌减少。

详情

AI中文摘要

随着大型语言模型（LLMs）在现实应用中的日益部署，维护其安全性至关重要。现有的安全防护栏通常依赖单次分类或更近期的蒸馏推理。基于推理的防护栏显著优于仅分类的基线，但会带来大量的查询延迟和令牌开销，使其不适用于高吞吐量部署。为了解决这一挑战，我们提出了COLAGUARD，一种通过阶段式训练课程将多步安全推理转移到连续潜在空间的防护栏模型，从而在推理时实现直接的隐藏状态传播。在涵盖八个安全基准的十个提示和响应审核设置上评估，COLAGUARD在宏观F1上比Llama Guard 3提高了8.24分，并与我们的显式推理基线GuardReasoner在宏观F1上相当，同时实现了12.9倍的加速和22.4倍的令牌使用减少。我们的结果表明，潜在推理为可部署的防护栏提供了一种实用的替代方案，以替代显式理由生成，共同提高安全鲁棒性和推理效率，而不是将它们视为竞争目标。

英文摘要

Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world applications. Existing safety guardrails typically rely on single-pass classification or, more recently, distilled reasoning. Reasoning-based guardrails significantly outperform classification-only baselines, but they incur substantial query latency and token overhead that make them impractical for highthroughput deployment. To address this challenge, we propose COLAGUARD, a guardrail model that transfers multi-step safety reasoning into a continuous latent space through a stage-wise training curriculum, enabling direct hidden-state propagation at inference. Evaluated on ten prompt- and response-moderation settings spanning eight safety benchmarks, COLAGUARD improves macro-F1 by 8.24 points over Llama Guard 3 and matches our explicit reasoning baseline, GuardReasoner, in macroF1 while delivering a 12.9X speedup and 22.4X reduction in token usage. Our results suggest that latent reasoning offers a practical alternative to explicit rationale generation for deployable guardrails, jointly improving safety robustness and inference efficiency rather than treating them as competing objectives.

URL PDF HTML ☆

赞 0 踩 0

2605.29059 2026-05-29 cs.SE cs.AI cs.CR 版本更新

SCDBench: A Benchmark for LLM-Based Smart Contract Decompilers

SCDBench: 基于大语言模型的智能合约反编译基准

Kaihua Qin, Dawn Song, Arthur Gervais

发表机构 * University of Warwick（沃里克大学）； UC Berkeley（伯克利大学）； University College London（伦敦大学学院）

AI总结针对现有智能合约反编译评估缺乏统一基准的问题，提出SCDBench数据集与评估方法，通过四阶段累积评估（格式完整性、可编译性、ABI恢复、语义一致性）测试前沿LLM的反编译能力，发现语义一致性仍远未解决。

详情

AI中文摘要

智能合约反编译旨在从字节码恢复高级源代码，但评估反编译器仍然困难，因为现有研究使用狭窄的数据集、不一致的度量标准和有限的语义一致性检查。随着大语言模型（LLMs）开始生成类似源代码的Solidity代码，这些代码可能编译通过并看似合理，即使其语义与原始合约存在偏差，这一差距变得日益重要。我们引入了SCDBench，一个用于基于LLM的智能合约反编译的数据集和基准方法。该数据集包含600个真实世界的Solidity合约，配有配对的字节码输入、真实源代码和可重放的语义检查点。SCDBench通过四个累积阶段评估反编译器的输出：格式完整性、可编译性、应用程序二进制接口（ABI）恢复以及通过差分重放实现的语义一致性。我们在零样本反编译设置中评估了Claude Opus 4.7、GPT-5.3-Codex和GLM-5，包括具有和不具有扩展推理的GLM-5变体，以及零样本编译修复设置。结果表明，前沿LLM通常能够生成结构化和可编译的Solidity代码，但实现语义一致性仍远未解决：表现最好的前沿模型仅完美反编译了42/600个合约。我们进一步表明，引入同模型编译修复在适度增加成本的情况下显著提升了性能。SCDBench为严格、可重复的评估建立了共同基础，旨在加速开发用于区块链安全性和透明度的可靠智能合约反编译器。

英文摘要

Smart contract decompilation aims to recover high-level source code from bytecode, but evaluating decompilers remains difficult because existing studies use narrow datasets, inconsistent metrics, and limited semantic consistency checks. This gap is increasingly important as large language models (LLMs) begin to generate source-like Solidity that may compile and appear plausible, even when its semantics diverge from the original contract. We introduce SCDBench, a dataset and benchmark methodology for LLM-based smart contract decompilation. The dataset contains 600 real-world Solidity contracts with paired bytecode inputs, ground-truth source code, and replayable semantic checkpoints. SCDBench evaluates decompiler outputs through four cumulative stages: format completeness, compilability, Application Binary Interface (ABI) recovery, and semantic consistency via differential replay. We evaluate Claude Opus 4.7, GPT-5.3-Codex, and GLM-5 in a zero-shot decompilation setting, including GLM-5 variants with and without extended reasoning and a zero-shot compilation-repair setting. The results show that frontier LLMs can often produce structured and compilable Solidity, but achieving semantic consistency remains far from solved: the best-performing frontier model perfectly decompiles only 42/600 contracts. We further show that introducing same-model compilation repair substantially improves performance at modest additional cost. SCDBench establishes a common ground for rigorous, reproducible evaluation and aims to accelerate the development of reliable smart contract decompilers for blockchain security and transparency.

URL PDF HTML ☆

赞 0 踩 0

2605.29055 2026-05-29 cs.AI cs.MA 版本更新

Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching

基于智能体AI、嵌套学习与语义缓存的幻觉缓解与AI可持续性

Diego Gosmar, Deborah A. Dahl

发表机构 * Head of AI, Tesisquare Member, Open Voice Interoperability Initiative Linux Foundation AI & Data（AI负责人，Tesisquare成员，开放语音互操作性倡议Linux基金会AI与数据）； Principal, Conversational Technologies Member, Open Voice Interoperability Initiative Linux Foundation AI & Data（首席科学家，对话技术成员，开放语音互操作性倡议Linux基金会AI与数据）

AI总结提出一种HOPE启发的嵌套学习架构，结合连续记忆系统和语义缓存，通过三阶段智能体管道在混合基准上实现幻觉缓解，同时降低能耗并提高可观测性。

Comments 21 pages, 14 figures

详情

AI中文摘要

幻觉仍然是生产级LLM系统的主要可靠性障碍，特别是在多智能体管道中，未经支持的声明可能在各阶段不受控制地传播。本文将一种受HOPE启发的嵌套学习架构与连续记忆系统（CMS）和语义相似性缓存相结合，应用于一个混合基准测试，该基准包含310个提示，包括217个认知不确定性提示和93个虚构诱导压力测试提示。通过开放地板协议（OFP）编排的三阶段智能体管道，使用五个KPI进行评估——事实声明密度（FCD）、事实依据参考（FGR）、虚构免责声明频率（FDF）、显式情境化得分（ECS）和可观测性得分比率（OSR）——聚合为总幻觉得分（THS），在五种权重配置下研究缓解与可观测性之间的权衡。FDF、ECS、OSR和FGR作为缓解信号被减去，因此更负的THS表示更强的缓解。前端代理被配置为高随机性生成器（温度=1.0）以产生真实的幻觉基线，而二级审查者和三级审查者作为渐进式纠正器运行。这种非对称设计在五种权重配置下实现了端到端THS降低-31.3%至-35.9%。语义缓存在930次潜在调用中实现了440次缓存命中（命中率47.3%），将LLM调用减少至490次，降低了能源和二氧化碳足迹，使多阶段审查管道在生产规模下操作可行。极端可观测性获得了最负的最终THS（-0.0709），证实了高可观测性配置强化而非损害缓解效果。这些发现表明，记忆增强的多智能体设计可以在无需模型重新训练的情况下，共同提高事实可靠性、操作效率和可审计性。

英文摘要

Hallucination remains a major reliability barrier for production LLM systems, particularly in multi-agent pipelines where unsupported claims can propagate unchecked across stages. This paper adapts a HOPE-inspired Nested Learning architecture with Continuum Memory Systems (CMS) and semantic similarity caching to a hybrid benchmark of 310 prompts combining 217 epistemic-uncertainty prompts and 93 fabrication-induction stress-test prompts. A three-stage agentic pipeline orchestrated via the Open Floor Protocol (OFP) is evaluated with five KPIs -- FCD (Factual Claim Density), FGR (Factual Grounding References), FDF (Fictional Disclaimer Frequency), ECS (Explicit Contextualization Score), and OSR (Observability Score Ratio) -- aggregated into THS (Total Hallucination Score) across five weighting configurations to study mitigation-observability trade-offs. FDF, ECS, OSR, and FGR are subtracted as mitigation signals, so that a more negative THS indicates stronger mitigation. The FrontEndAgent is configured as a high-stochasticity generator (temperature = 1.0) to produce a realistic hallucination baseline, while the SecondLevelReviewer and ThirdLevelReviewer operate as progressive correctors. This asymmetric design yields end-to-end THS reductions of -31.3% to -35.9% across five weighting configurations. Semantic caching achieves 440 cache hits over 930 potential calls (47.3% hit rate), reducing LLM invocations to 490, lowering energy and CO2e footprint, and making multi-stage review pipelines operationally viable at production scale. ExtremeObservability attains the most negative final THS (-0.0709), confirming that observability-heavy configurations reinforce rather than compromise mitigation. These findings suggest that memory-augmented multi-agent designs can jointly improve factual reliability, operational efficiency, and auditability without model retraining.

URL PDF HTML ☆

赞 0 踩 0

2605.29042 2026-05-29 cs.AI cs.LG 版本更新

Differentiable Belief-based Opponent Shaping

基于可微信念的对手塑造

Aarav G Sane, Karthik Sivachandran, Rohan Paleja

发表机构 * Department of Computer Science（计算机科学系）

AI总结提出D-BOS方法，通过可微的信念更新和梯度传播，在隐藏角色游戏中实现对手信念的塑造，从而自然涌现最优策略。

详情

AI中文摘要

人类协调往往依赖于通过战略行动影响他人信念的能力。在多智能体强化学习中，对手塑造试图复制这种影响，尽管现有方法通常作用于对手的参数、策略或价值空间。同时，隐藏角色游戏中的信念操纵技术通常依赖于硬编码的目标，如欺骗或信念饱和。我们提出基于可微信念的对手塑造（D-BOS），一种一阶方法，将每个观察者的信念视为被塑造的对手状态，并通过$k$步softmax-贝叶斯信念动力学进行微分。我们的方法不显式奖励欺骗或合作行为，而是将信念状态作为塑造目标。这使得最优策略能够从环境奖励结构中自然涌现。这种信念空间公式通过微分对手信念更新提供对手塑造信号，并通过聚合多个观察者个体推断信念轨迹上的梯度，自然地扩展到多个观察者。实验上，D-BOS在隐藏角色游戏中优于PPO和BBM，在混合动机设置中提升最大。

英文摘要

Human coordination often relies on the ability to influence the beliefs of others through strategic action. In multi-agent reinforcement learning, opponent shaping attempts to replicate this influence, though existing methods typically operate within an opponent's parameter, policy, or value space. Meanwhile, belief-manipulation techniques in hidden-role games often rely on hard-coded objectives, such as deception or belief saturation. We propose Differentiable Belief-based Opponent Shaping (D-BOS), a first-order method that treats each observer's belief as the shaped opponent state and differentiates through $k$-step softmax-Bayes belief dynamics. Rather than explicitly rewarding deceptive or cooperative behavior, our method treats the belief state as the target for shaping. This allows the optimal strategy to emerge naturally from the environment's reward structure. This belief-space formulation provides an opponent-shaping signal by differentiating through opponent belief updates, and naturally extends to multiple observers by aggregating gradients over their individual inferred belief trajectories. Empirically, D-BOS outperforms PPO and BBM in hidden-role games, with the largest gains in mixed-motive settings.

URL PDF HTML ☆

赞 0 踩 0

2605.29041 2026-05-29 cs.AI 版本更新

当模型存在分歧：重新思考用于公众评论分析的LLM评估

Aisha Najera, Alvin Moon, Vedant Srinivasan, Rajesh Veeraraghavan

发表机构 * AI Lab, Princeton University Engineering and Applied Sciences（普林斯顿大学人工智能实验室、工程与应用科学学院）； RAND Corporation（RAND公司）； Engineering and Applied Sciences（工程与应用科学）； Science, Technology, and International Affairs（科学、技术与国际事务）； Georgetown University（乔治·华盛顿大学）

AI总结提出一种解释性审计流程，利用多模型分歧检测解释复杂性，引导人工审查关注真正模糊的公众意见，以补充传统基于准确率的评估方法。

详情

AI中文摘要

联邦机构正在部署大型语言模型（LLM）对公众评论语料进行分类，模型对记录的组织方式会影响政策制定者看到的内容以及哪些论点被记录。基于小规模验证集上的立场准确率的标准评估无法检测不同模型对同一公众输入产生实质性不同分类的情况。我们提出了一种解释性审计流程，将多模型分歧视为解释复杂性的诊断，并引导人工审查关注真正模糊的公众输入。通过分析四个LLM对联邦USDA案卷中1,260条公众评论的结果，我们发现模型间的主题分歧超过了模型内的提示变化，并且专家评分标准抑制了深层的解释分歧而未解决它。在一项针对分层抽样的40条评论子样本的两阶段标注研究中，四个LLM和一名人工标注员独立标注，然后在看到其他标注员的标签后进行修订。修订行为在不同标注员之间有所不同，人工标注员的修订经常引入整体集成输出中不存在的框架。我们认为基于分歧的评估是LLM辅助解释性编码中准确率指标的必要补充。

英文摘要

Federal agencies are deploying large language models (LLMs) to categorize public comment corpora, where the model's organization of the record shapes what policymakers see and which arguments register. Standard evaluation, anchored on stance accuracy against a small validated set, cannot detect when different models produce materially different categorizations of the same public input. We propose an Interpretive Audit Pipeline that treats multi-model disagreement as diagnostic of interpretive complexity and directs human review toward genuinely ambiguous public input. Analyzing 1,260 public comments on a federal USDA docket across four LLMs, we find that inter-model thematic divergence exceeds within-model prompt variation, and that an expert rubric suppresses deep interpretive disagreement without resolving it. In a two-stage labeling study on a stratified 40-comment subsample, four LLMs and a human annotator labeled independently and then revised after seeing the others' labels. Revision behavior varied across labelers, and the human annotator's revisions frequently introduced framings absent from the ensemble's collective output. We argue disagreement-based evaluation is a necessary complement to accuracy metrics for LLM-assisted interpretive coding.

URL PDF HTML ☆

赞 0 踩 0

2605.29018 2026-05-29 cs.AI cs.CL 版本更新

Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild

采用 ≠ 适应：野外LLM对话的纵向分析

Rebecca M. M. Hicke, Kiran Tomlinson

发表机构 * Cornell University（康奈尔大学）； Microsoft Research（微软研究院）

AI总结通过分析约12,000名Microsoft Bing Copilot用户的对话轨迹及WildChat-4.8M数据，发现用户行为高度固化，活跃用户更倾向复杂专业任务，且WildChat数据集偏向高熟练度“超级用户”，表明现有用户行为难以改变并揭示用户异质性。

详情

AI中文摘要

尽管越来越多的研究开始描述用户与LLM的交互，但其描绘的画面基本上是静态的；关于个体用户如何随时间改变其行为，我们知之甚少。为填补这一空白，我们分析了约12,000名随机抽样的Microsoft Bing Copilot用户的对话轨迹，并与WildChat-4.8M的数据进行比较。虽然Copilot数据包含显著的人群层面趋势，但我们发现个体用户轨迹中的趋势要弱得多；用户习惯被证明极其顽固。我们还发现不同活跃度用户之间存在显著差异：更活跃的用户拥有更成功的对话，并使用LLM处理更复杂和专业导向的任务。一些用户趋势也出现在WildChat-4.8M中，但我们发现证据表明该数据集显著偏向高熟练度的“超级用户”。最终，我们的结果表明现有用户行为难以改变，并展示了用户异质性的程度。我们数据集之间的比较突显了WildChat并不代表典型的用户-AI交互，这是对数据下游使用的一个重要警示。

英文摘要

Although a growing body of research has begun to describe user--LLM interactions, the picture it paints is largely static; little is known about how individual users change their behavior over time. To address this gap, we analyze the conversational trajectories of $\sim$12,000 randomly sampled Microsoft Bing Copilot users and compare these with data from WildChat-4.8M. While the Copilot data contains significant population-level trends, we find that trends in individual user trajectories are much weaker; user habits prove to be overwhelmingly sticky. We also find stark differences between users of different activity levels: more active users have more successful conversations and use the LLM for more complex and professionally oriented tasks. Some user trends also appear in WildChat-4.8M, but we find evidence that this dataset is significantly skewed towards highly proficient "power" users. Ultimately, our results suggest that existing user behavior is difficult to change and demonstrate the extent of user heterogeneity. Our comparison between datasets highlights that WildChat does not represent typical user-AI interactions, an important caveat for downstream uses of the data.

URL PDF HTML ☆

赞 0 踩 0

2605.29009 2026-05-29 cs.LG cs.AI 版本更新

BEAMS：用于建模与仿真的AI基准测试与评估

Sara Metcalf, William Schoenberg

AI总结提出BEAMS倡议，通过建立以人为本的基准测试框架，评估AI工具在建模与仿真中的表现，发现其在因果推理和定量修正方面存在不足。

详情

AI中文摘要

支持现实世界决策的AI工具必须能够构建仿真模型，为其建议提供依据并使其可解释。能够自动化建模实践某些方面的工具必须补充人类专业知识，而非取代它。BEAMS倡议旨在通过建立以人为本的建模与仿真实践的基准，引导AI工具在建模与仿真领域的发展走向负责任和合乎伦理的形式。该倡议利用开放的数字和组织基础设施，协作评估用于建模与仿真的AI工具。倡议托管的开源sd ai项目确保了透明度，并使贡献能够广泛共享。指导小组专注于优先考虑潜在基准，而技术小组则专注于以自动化测试的形式实施基准。针对多个不同评估类别的测试已经实施，并应用于支持定性模型构建、定量模型构建和模型讨论的AI工具。这些测试包括因果翻译、模型迭代、因果推理、一致性、模型行为解释、建议的模型构建步骤以及建议的模型修正。当sd ai项目的引擎与不同的LLM结合时，它们在这些评估上的表现揭示了不同AI工具之间的差异。倡议实施的评估表明，支持AI的建模工具在讨论和基本定性任务上的表现优于因果推理和定量错误修正。没有单一的LLM在所有引擎类型中占据主导地位，这突显了特定任务的重要性以及速度与准确性之间的权衡。倡议的持续努力旨在纳入考虑替代视角和以人为本用例的基准，以解决对偏见的担忧。

英文摘要

AI tools to support real world decision making must be able to build simulation models that inform their recommendations and render them interpretable. Tools that can automate aspects of modeling practice must complement human expertise, not replace it. The BEAMS Initiative aims to guide the development of AI tools for modeling and simulation toward forms that are responsible and ethical by establishing benchmarks for human centered modeling and simulation practices. The initiative uses open digital and organizational infrastructure to collaboratively evaluate AI tools for modeling and simulation. The open source sd ai project hosted by the initiative establishes transparency and enables contributions to be shared broadly. A steering group focuses on prioritizing potential benchmarks, while a technical group focuses on implementing the benchmarks in the form of automated tests. Tests for several distinct categories of evaluation have been implemented and applied to AI tools that support qualitative model building, quantitative model building, and model discussion. These include tests for causal translation, model iteration, causal reasoning, conformance, model behavior explanation, suggested model building steps, and suggested model fixes. When engines from the sd ai project are coupled with different LLMs, their performance on these evaluations reveals variability across different AI tools. The evaluations implemented by the initiative demonstrate that AI enabled modeling tools perform better at discussion and basic qualitative tasks than with causal reasoning and quantitative error fixing. No single LLM dominates across engine types, highlighting the importance of specific tasks and tradeoffs between speed and accuracy. Ongoing efforts of the initiative aim to incorporate benchmarks that address concerns about bias by considering alternative perspectives and human centered use cases.

URL PDF HTML ☆

赞 0 踩 0

2605.28983 2026-05-29 cs.LG cs.AI math.DS math.RT physics.comp-ph 版本更新

The Hamilton-Jacobi Theory of Deep Learning

深度学习的哈密顿-雅可比理论

Jose Marie Antonio Miñoza, Erika Fille T. Legara, Christopher P. Monterola

发表机构 * Center for AI Research PH（人工智能研究所以PH）； Asian Institute of Management（亚洲管理学院）

AI总结本文通过将神经网络训练精确识别为哈密顿-雅可比初值问题的搜索，建立了深度学习与粘性哈密顿-雅可比方程之间的严格对应关系，并统一了残差网络、Transformer、RNN等架构，导出了最优泛化率、对抗鲁棒性等定量结果。

详情

AI中文摘要

在本文中，神经网络训练被精确地识别为通过哈密顿-雅可比初值问题的搜索：每个梯度步选择粘性哈密顿-雅可比方程的初始数据，其Hopf-Cole传播子最佳拟合观测值；在推理时，输入是评估该解的空间点，初始条件已编码在权重中。这种对应对于log-sum-exp层是精确的，对于更广泛的架构（残差网络、Transformer和循环架构（RNN、LSTM、SSM））是结构性的，它们离散化同一类哈密顿-雅可比方程，具有依赖于架构的哈密顿量和粘性。一个单一的变形参数ε在交换图中统一了所有四个视角（网络、热带代数、粘性PDE、凸优化），并在Lipschitz条件下封闭。定量结果包括：固定t时的极小极大最优泛化率O(n^{-1/(d+2)})；由ε控制的对抗鲁棒性；残差网络的反向传播作为哈密顿系统的协态方程（庞特里亚金最大值原理）；通过PDE求积与数据内在维度一致的标度指数；以及闭式O(N)影响函数（softmax归因权重π_j），其熵景观随着ε增加经历折叠分岔，每个分岔合并归因盆地。

英文摘要

In this paper, training a neural network is identified, exactly, as a search through Hamilton--Jacobi initial-value problems: each gradient step selects the initial data of a viscous Hamilton--Jacobi equation whose Hopf--Cole propagator best fits the observations; at inference, the input is the spatial point at which that solution is evaluated and the initial condition is already encoded in the weights. The correspondence is exact for log-sum-exp layers and structural for broader architectures: residual networks, transformers, and recurrent architectures (RNNs, LSTMs, SSMs) each discretize the same class of Hamilton--Jacobi equations, with architecture-dependent Hamiltonian and viscosity. A single deformation parameter $\varepsilon$ unifies all four perspectives (network, tropical algebra, viscous PDE, convex optimization) in a commutative diagram closed under Lipschitz conditions. Quantitative consequences include: the minimax optimal generalization rate $O(n^{-1/(d+2)})$ for fixed $t$; adversarial robustness controlled by $\varepsilon$; backpropagation as the co-state equation of the Hamiltonian system for residual networks (Pontryagin Maximum Principle); scaling exponents consistent with data intrinsic dimension via PDE quadrature; and a closed-form $O(N)$ influence function (softmax attribution weights $π_j$) whose entropy landscape undergoes fold bifurcations as $\varepsilon$ increases, each merging attribution basins.

URL PDF HTML ☆

赞 0 踩 0

2605.28978 2026-05-29 cs.AI cs.CE 版本更新

VFEAgent: A Multimodal Agent Framework for End-to-End Automated Finite Element Analysis

VFEAgent: 面向端到端自动化有限元分析的多模态智能体框架

Jiachen Zhang, Junyi Lao, Chenghao Liu, Siyuan Liu, Shixin Wu, Linsen Zhang, Boyu Wang, Songfang Huang

发表机构 * Peking University（北京大学）； China Agricultural University（中国农业大学）

AI总结提出VFEAgent多智能体系统，通过多模态视觉-语言流水线和验证优先的代码合成框架，实现从输入图像和问题描述到有限元建模与仿真的端到端自动化。

Comments 9 pages, 3 figures, 2 tables. Equal contribution: Jiachen Zhang and Junyi Lao. Corresponding author: Songfang Huang. Preprint

详情

AI中文摘要

有限元分析（FEA）是现代工程设计的基石。然而，其工作流程本质上复杂且高度依赖领域专业知识。尽管近期有研究将大语言模型（LLM）集成到FEA中，但现有方法在处理多模态输入和执行复杂任务方面存在局限性。为解决这些限制，我们提出了VFEAgent，一个端到端的多智能体系统，旨在直接从输入图像和问题描述中自动化FEA建模和仿真。我们的方法整合了两个核心组件：（1）多模态视觉-语言多智能体流水线，采用ReAct驱动的推理从异构输入中提取结构化的FEA规范；（2）验证优先的代码合成框架，结合了强大的自调试和回退机制，以确保可执行性和物理有效性。我们在各种工程力学场景下系统评估了该系统。结果表明，VFEAgent在生成完整且物理有效的仿真方面取得了高成功率，在可靠性和正确性上优于基于LLM的基线方法。这些发现验证了自动化完整FEA工作流程的可行性，突显了该框架在将工程师从繁琐的手工分析中解放出来的潜力。

英文摘要

Finite Element Analysis (FEA) serves as the cornerstone of modern engineering design. However, its workflow is inherently complex and relies heavily on domain expertise. Although recent efforts have integrated Large Language Models (LLMs) into FEA, existing approaches face limitations in handling multimodal inputs and executing complex tasks. To address these limitations, we propose VFEAgent, an end-to-end multi-agent system designed to automate FEA modeling and simulation directly from input images and problem descriptions. Our methodology integrates two core components: (1) a multimodal vision-language multi-agent pipeline that employs ReAct-driven reasoning to extract structured FEA specifications from heterogeneous inputs and (2) a verification-first code synthesis framework, incorporating robust self-debugging and fallback mechanisms to ensure executability and physical validity. We systematically evaluated the system across various engineering mechanics scenarios. The results demonstrate that VFEAgent achieves a high success rate in generating complete and physically valid simulations, outperforming LLM-based baseline methods in reliability and correctness. These findings validate the feasibility of automating the complete FEA workflow, highlighting the framework's potential to liberate engineers from tedious manual analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.28977 2026-05-29 cs.LG cs.AI 版本更新

Comparing Post-Hoc Explainable AI Methods for Interpreting Black-Box EEG Models in Depression Detection

比较事后可解释AI方法用于解释抑郁症检测中的黑盒脑电图模型

Antonia Šarčević, Nikolina Frid

发表机构 * University of Zagreb Faculty of Electrical Engineering and Computing（Zagreb大学电子工程与计算学院）

AI总结本研究通过多种事后可解释性方法（如DeepSHAP、集成梯度、GradCAM、遮挡和置换特征重要性）分析InceptionTime架构在脑电图抑郁症检测中的决策过程，发现不同方法在额叶、颞叶和后部脑区（尤其是右半球）的归因模式部分收敛，但方法间存在差异，强调了事后可解释性的有用性和局限性。

详情

AI中文摘要

深度学习的最新进展使得基于脑电图的重度抑郁症分类越来越准确，但高容量模型的决策过程仍然难以解释。本研究调查了应用于训练用于基于脑电图的重度抑郁症检测的InceptionTime架构的多种事后可解释性方法。分析包括基于Shapley、基于梯度和基于扰动的归因方法：DeepSHAP、集成梯度、GradCAM、遮挡和置换特征重要性。在受试者级别的分层5折交叉验证框架内，通过跨脑电图片段和受试者的全局归因聚合进行可解释性分析。评估的方法揭示了部分收敛的归因模式，其中额叶、颞叶和后部脑区（尤其是右半球）反复受到关注。定量比较表明，基于梯度和基于扰动的方法之间具有实质性一致性，而DeepSHAP产生了相对独特的归因分布。同时，可解释性方法之间的差异凸显了方法假设对所得解释的影响。总体而言，结果表明，不同的事后可解释性方法捕捉了基于脑电图的深度学习模型在抑郁症检测中的部分重叠的相关性结构。尽管观察到的归因模式与先前几项关于重度抑郁症的脑电图研究大致一致，但该分析应被视为探索性的，而非确凿的神经生理学生物标志物或临床适用性的证据。该研究强调了事后可解释性在解释精神病学应用中的黑盒脑电图分类器方面的有用性和局限性。

英文摘要

Recent advances in deep learning have enabled increasingly accurate electroencephalography (EEG)-based classification of Major Depressive Disorder (MDD), but the decision-making processes of high-capacity models remain difficult to interpret. This study investigates multiple post-hoc explainability methods applied to an InceptionTime architecture trained for EEG-based MDD detection. The analysis includes Shapley-based, gradient-based, and perturbation-based attribution approaches: DeepSHAP, Integrated Gradients, GradCAM, Occlusion, and Permutation Feature Importance. Explainability analysis was performed within a subject-level stratified 5-fold cross-validation framework using global attribution aggregation across EEG segments and subjects. The evaluated methods revealed partially convergent attribution patterns, with recurring emphasis on frontal, temporal, and posterior EEG regions, particularly in the right hemisphere. Quantitative comparison demonstrated substantial agreement between gradient- and perturbation-based approaches, while DeepSHAP produced comparatively distinct attribution distributions. At the same time, variability between explainability methods highlighted the influence of methodological assumptions on the resulting explanations. Overall, the results suggest that different post-hoc explainability approaches capture partially overlapping relevance structures in EEG-based deep learning models for depression detection. Although the observed attribution patterns are broadly consistent with several previous EEG studies of MDD, the analysis should be interpreted as exploratory rather than evidence of definitive neurophysiological biomarkers or clinical applicability. The study highlights both the usefulness and limitations of post-hoc explainability for interpreting black-box EEG classifiers in psychiatric applications.

URL PDF HTML ☆

赞 0 踩 0

2605.28969 2026-05-29 cs.CL cs.AI cs.HC 版本更新

AIRGuard：通过运行时权限控制守护智能体行为

Suliu Qin, Haomin Zhuang, Yujun Zhou, Yufei Han, Xiangliang Zhang

发表机构 * University of Notre Dame（诺丁汉大学）； Inria, France（法国国家信息与自动化技术研究所）； University of Liverpool（利物浦大学）

AI总结针对工具使用语言智能体面临的权限混淆问题，提出运行时守卫AIRGuard，通过动作时授权实现最小权限原则，显著降低攻击成功率并保持良好良性效用。

详情

AI中文摘要

使用工具的语言智能体将模型决策转化为外部副作用：它们读取文件、运行脚本、调用API、发送消息以及调用模型上下文协议工具。这使得针对智能体的攻击不同于越狱攻击。有害步骤往往不是明显禁止的输出，而是普通的可执行动作，但由于攻击者控制的上下文将授权访问导向违背用户利益的方向而变得不安全。我们将这种失败模式识别为权限混淆：不可信资源可以告知推理，但绝不能授权副作用。我们提出AIRGuard，一种运行时守卫，将最小权限原则实现为动作时授权。AIRGuard规范化异构工具调用，将任务权限推导为步骤级权限，跟踪源和目标信任度，模拟敏感副作用，审计跨步骤风险，并在动作执行前强制执行决策。在AgentTrap上，AIRGuard将Sonnet 4.6的攻击成功率从无防御时的36.3%降低到5.5%。在DTAP-150上，AIRGuard在Haiku 4.5上保持了76.0%的良性效用，而ARGUS为52.0%，MELON为42.0%。消融实验进一步表明，仅靠提示策略效果有限，而专用的运行时权限控制层为智能体系统提供了对工具介导副作用的直接控制。代码和数据可在https://github.com/Sophie508/AIRGuard获取。

英文摘要

Tool-using language agents turn model decisions into external side effects: they read files, run scripts, call APIs, send messages, and invoke Model Context Protocol tools. This makes agent attacks different from jailbreaks. The harmful step is often not an obviously forbidden output, but an ordinary executable action that becomes unsafe because attacker-controlled context steers authorized access against the user's interest. We identify this failure mode as authority confusion: untrusted resources may inform reasoning, but they must not authorize side effects. We present AIRGuard, a runtime guard that operationalizes least privilege as action-time authorization. AIRGuard normalizes heterogeneous tool calls, derives task authority into step-level authority, tracks source and target trust, simulates sensitive side effects, audits cross-step risk, and enforces decisions before actions execute. On AgentTrap, AIRGuard reduces Sonnet 4.6 attack success from 36.3% without defense to 5.5%. On DTAP-150, AIRGuard preserves 76.0% benign utility with Haiku 4.5, compared with 52.0% for ARGUS and 42.0% for MELON. An ablation further shows that prompt-only policy helps only modestly, whereas a dedicated runtime authority-control layer gives the agent system direct control over tool-mediated side effects. Code and data are available at https://github.com/Sophie508/AIRGuard.

URL PDF HTML ☆

赞 0 踩 0

2605.28902 2026-05-29 cs.AI 版本更新

Orthogonal Concept Erasure for Diffusion Models

扩散模型的正交概念擦除

Yuhao Sun, Lingyun Yu, Haoxiang Xu, Fengyuan Miao, Zhuoer Xu, Hongtao Xie

发表机构 * University of Science（科学技术大学）

AI总结提出正交概念擦除（OCE）方法，通过几何视角的乘法参数更新实现精确概念擦除，同时保持生成能力，支持多概念擦除。

Comments Accepted by ICML 2026 Oral

详情

AI中文摘要

概念擦除已成为减轻扩散模型中不期望或不安全内容的有前途方法，但现有方法仍面临显著限制。基于训练的方法有效，但高计算成本限制了可扩展性。基于编辑的方法更高效且易于部署，但难以同时实现精确的概念擦除和保持整体生成能力。我们将基于编辑的方法的这一核心限制归因于对加法参数更新的依赖。我们的实证分析表明，概念语义主要依赖于神经元方向而非神经元幅度，而整体生成能力依赖于神经元的角几何。由于加法更新固有地纠缠方向、幅度和角几何，它们不可避免地引入概念擦除与整体生成性能之间的意外干扰。为了解决这个问题，我们提出了正交概念擦除（OCE），它从几何角度将基于编辑的擦除重新表述为乘法参数更新。具体来说，OCE应用从参数的闭式解导出的逐层正交变换，能够在保持神经元幅度和角几何的同时实现精确的概念擦除。此外，为了解决多概念擦除中的冲突约束，OCE引入了具有结构化子空间操作的子空间级目标，实现了更有效和可扩展的擦除。在单概念和多概念擦除上的大量实验表明，OCE在概念擦除和非目标保持方面优于现有方法，可在4.3秒内擦除多达100个概念。代码：https://github.com/HansSunY/OCE。

英文摘要

Concept erasure has emerged as a promising approach to mitigate undesired or unsafe content in diffusion models, yet existing methods still face significant limitations. While training-based methods are effective, their high computational cost limits scalability. Editing-based methods are more efficient and deployment-friendly, yet they struggle to simultaneously achieve precise concept erasure and preserve overall generative capacity. We identify this core limitation of the editing-based methods as reliance on additive parameter updates. Our empirical analysis reveals that concept semantics primarily depend on neuron direction rather than neuron magnitude, while overall generative capacity relies on the angular geometry of neurons. As additive updates inherently entangle direction, magnitude, and angular geometry, they inevitably introduce unintended interference between concept erasure and overall generation performance. To address this, we propose Orthogonal Concept Erasure (OCE), which reformulates editing-based erasure as multiplicative parameter updates from a geometric perspective. Specifically, OCE applies layer-wise orthogonal transformations derived from a closed-form solution to the parameters, enabling precise concept erasure while preserving the neuron magnitude and angular geometry. Furthermore, to address conflicting constraints in multi-concept erasure, OCE introduces a subspace-level objective with structured subspace manipulation, yielding a more effective and scalable erasure. Extensive experiments on single- and multi-concept erasure demonstrate that OCE outperforms existing methods in concept erasure and non-target preservation, erasing up to 100 concepts in 4.3 s. Code: https://github.com/HansSunY/OCE.

URL PDF HTML ☆

赞 0 踩 0

2605.28899 2026-05-29 cs.CR cs.AI 版本更新

Quantum-Enhanced Adversarial Robustness in Artificial Intelligence

人工智能中的量子增强对抗鲁棒性

Jaydip Sen

发表机构 * Praxis Business School（普拉斯业务学校）

AI总结本文综述了对抗性机器学习与量子计算交叉领域，提出利用量子优化、特征映射和混合量子-经典架构来增强人工智能系统的对抗鲁棒性。

Comments This is the pre-print of the chapter which has been accepted for publication in the edited volume titled "Quantum Enhancements to the AI Industry", edited by Eduard Babulak. The volume will be published by IGI Global, USA. This is not the final version of the chapter published in the book

详情

AI中文摘要

人工智能在多个应用领域取得了显著成功。然而，其对对抗性攻击的脆弱性给可靠性、安全性和可信赖性带来了重大挑战。对抗性机器学习表明，即使是高精度的模型也可能通过精心设计的扰动被操纵，这在医疗、金融和自主技术等安全关键系统中引发了严重担忧。与此同时，量子计算作为一种变革性范式出现，能够通过叠加、纠缠和量子干涉等原理解决复杂的计算问题。这两个领域的融合催生了量子人工智能的出现，该领域探索量子技术如何增强学习效率、可扩展性和鲁棒性。本章全面概述了对抗性机器学习和现有防御策略，随后对量子计算和量子机器学习模型进行了易于理解的介绍。进一步提出了量子增强对抗鲁棒性的概念框架，强调了量子优化、特征映射和混合量子-经典架构。还讨论了实际应用、关键挑战和未来研究方向，以支持安全可信赖的AI系统的开发。

英文摘要

Artificial Intelligence has achieved remarkable success across diverse application domains. However, its vulnerability to adversarial attacks poses significant challenges to reliability, security, and trustworthiness. Adversarial machine learning demonstrates that even highly accurate models can be manipulated through carefully crafted perturbations, raising serious concerns in safety critical systems such as healthcare, finance, and autonomous technologies. In parallel, quantum computing has emerged as a transformative paradigm capable of addressing complex computational problems through principles such as superposition, entanglement, and quantum interference. The convergence of these fields has led to the emergence of quantum artificial intelligence, which explores how quantum techniques can enhance learning efficiency, scalability, and robustness. This chapter provides a comprehensive overview of adversarial machine learning and existing defense strategies, followed by an accessible introduction to quantum computing and quantum machine learning models. It further presents conceptual frameworks for quantum-enhanced adversarial robustness, emphasizing quantum optimization, feature mapping, and hybrid quantum classical architectures. Practical applications, key challenges, and future research directions are also discussed to support the development of secure and trustworthy AI systems.

URL PDF HTML ☆

赞 0 踩 0

2605.28897 2026-05-29 cs.AI cs.MA 版本更新

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Review Arcade: 关于LLM评审的人类对齐与可博弈性

Hans Ole Hatzel, Sebastian Steindl, Jan Strich

发表机构 * Language Technology Group, University of Hamburg, Germany（汉堡大学语言技术小组）； Hub of Computing and Data Science (HCDS), University of Hamburg, Germany（汉堡大学计算与数据科学中心）； OTH Amberg-Weiden, Germany（阿姆伯-魏登工业大学）

AI总结通过实验评估LLM生成论文评审与人类评审的对齐程度，并发现作者可根据LLM评审迭代修改论文以提升评分（最多35%的论文显著提高），揭示了LLM评审的可博弈性。

Comments Under Review EMNLP 26

详情

AI中文摘要

LLM生成的科学论文评审正获得广泛关注，甚至被主要会议正式试点。我们必须假设不仅评审员在使用LLM辅助，而且作者在提交前也使用LLM修改论文。在这项工作中，我们对2025年ACL滚动评审（ARR）的论文进行实证实验，从作者和评审员两个角度评估LLM评审。首先，我们发现LLM评审与人类评审的对齐程度有限。在最佳情况下，对齐是合理的。然而，我们也发现LLM与人类的对齐在不同提示和模型间差异很大。最后，我们研究了作者根据LLM评审使用迭代草稿-修订工作流程改进提交的情况。我们发现，这种对LLM评审的“博弈”在特定场景下是有效的，导致最多35%的论文整体得分有统计显著提升。我们公开代码：https://github.com/uhh-hcds/reviewarcade。

英文摘要

LLM-generated reviews for scientific papers are gaining considerable traction and are even being officially piloted by major conferences. We have to assume that not only reviewers are using LLM-assistance, but also that authors use LLMs to revise their papers before submitting. In this work, we perform empirical experiments on papers from the 2025 ACL Rolling Review (ARR) to evaluate LLM reviews from both the author and the reviewer perspective. First, we identify a limited alignment of LLM reviews with human ones. In the best-case scenario, the alignment is reasonable. However, we also find that LLM-human alignment varies substantially across prompts and models. Finally, we investigate the scenario in which the author uses an iterative draft-revise workflow to improve the submission according to the LLM review. We find that this "gaming" of LLM reviews can be effective in specific scenarios, leading to a statistically significant increase of overall scores for up to 35\% of papers. We publish our code: https://github.com/uhh-hcds/reviewarcade.

URL PDF HTML ☆

赞 0 踩 0

2605.28889 2026-05-29 cs.LG cs.AI 版本更新

Context Distillation as Latent Memory Management

上下文蒸馏作为潜在记忆管理

Ziyang Zheng, Zeju Li, Xiangyu Wen, Jianyuan Zhong, Junhua Huang, Lei Chen, Mingxuan Yuan, Qiang Xu

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Huawei Noah’s Ark Lab（华为诺亚实验室）

AI总结将上下文蒸馏视为潜在记忆管理问题，通过独立LoRA适配器形成模块化记忆库，并利用自门控机制决定是否激活潜在记忆，以提升检索鲁棒性和效率。

2605.28883 2026-05-29 cs.AI cs.RO 版本更新

Ultra-Reduced-Impact-Encased-Logging (URIEL): propose a new method for selective sustainable logging and post-harvest silvicultural treatment in tropical forest using airborne robotics systems

超低影响包裹式伐木（URIEL）：提出一种利用空中机器人系统在热带森林中进行选择性可持续伐木和采后造林处理的新方法

Daniel Albiero, Gelton Fernando de Morais, Daniela Han, Flávio Roberto de Freitas Gonçalves, Artur Vitório Andrade Santos, Wesllen Lins de Araújo, Alessandra Maia Freire, Cláudio Kiyoshi Umezu, Mateus Peressin, Francesco Toscano, Admilson Írio Ribeiro, Alfeu J. Sguarezi Filho, Américo Ferraz Dias Neto, Angel Pontin Garcia

发表机构 * School of Agricultural Engineering, University of Campinas (UNICAMP)（坎皮纳斯大学农业工程学院）； School of Mechanical Engineering, University of Campinas (UNICAMP)（坎皮纳斯大学机械工程学院）； Depart. of Agricultural, Forestry, Food and Environmental Sciences, University of Basilicata（巴里奇塔大学农业、林业、食品与环境科学系）； Sorocaba Environmental Engineering, São Paulo State University (UNESP)（圣保罗州立大学索罗卡巴环境工程）； Center for Engineering, Modeling and Applied Social Sciences, Federal University of ABC (UFABC)（ABC联邦大学工程、建模和应用社会科学中心）

AI总结提出URIEL方法，结合直升机伐木、机器人、AI和无人机采后造林处理，实现高经济可行性和几乎零附带损害，维持生态系统服务。

Comments 196 pages, 40 figures, A revolutionary technology to help protect tropical forests. It was developed, scaled, detailed, calculated, and simulated in an advanced computational environment, com viabilidade econômica e social. "E pur si muove"

详情

AI中文摘要

全球热带森林正面临由经济和政治利益驱动的强烈砍伐压力，科学证据表明这种砍伐加剧了气候变化。本文提出了一种新颖的热带森林伐木方法——超低影响包裹式伐木（URIEL）。该方法基于直升机伐木技术，结合机器人技术和人工智能的密集使用，以及由无人机执行的采后造林处理。为此方法开发了合适的设备概念，确定了尺寸，在数字概念验证中完成了细节，并对各种直升机-木材-距离组合进行了有效的数字模拟和经济可行性分析。结果表明，URIEL方法具有高经济可行性，并能在维持生态系统服务的同时几乎消除对森林的附带损害。本文的主要结论是，尽管取得了令人满意的科学和技术成果，但URIEL方法的可行性取决于相关利益相关者的整合：高科技产业、政治政府、认证伐木公司和原住民。

英文摘要

Tropical forests worldwide are under intense deforestation pressure driven by economic and political interests, and scientific evidence suggests this deforestation contributes to climate change. This paper proposes a novel logging method for tropical forests, Ultra-Reduced-Impact-Encased-Logging (URIEL). This new method is based on heli-logging techniques combined with intensive use of robotics and AI integrated with post-harvest silvicultural treatments performed by drones. The concept of appropriate equipment for this method was developed, dimensions were determined, details were completed in a digital proof of concept, and an effective digital simulation and economic feasibility analysis were carried out for various helicopter-timber-distance combinations. The results demonstrated that a URIEL method has high economic viability and makes it possible to virtually eliminate collateral damage to forests while maintaining ecosystem services. The main conclusion of this paper is that, despite the satisfactory scientific and technological results, the feasibility of a Uriel method depends on the integration of stakeholders intrinsic to the context: high-tech industry; political governments; certified logging companies; and native populations.

URL PDF HTML ☆

赞 0 踩 0

2605.28876 2026-05-29 cs.SE cs.AI 版本更新

LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis

LogDx-CI：为LLM根因诊断基准测试日志缩减工具

Bowen Qin

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出LogDx-CI基准，比较11种日志缩减工具在35个真实CI故障案例上的效果，发现混合grep+tail路由器在成本质量上占优，且智能体循环可缩小质量差距但成本差异持续存在，同时跨家族LLM摘要器优于同家族。

详情

AI中文摘要

CI失败日志规模大（本语料中位数5000行，最大20万行）且噪声多。尝试调试的编码智能体依赖上游工具将日志缩减为可管理的上下文，但该领域缺乏公开的经验比较来评估哪些缩减能为下游LLM诊断保留足够证据。我们引入LogDx-CI基准，比较11种上下文缩减工具（原始、尾部、grep、三种RTK模式、两种真实LLM map-reduce摘要器、三种混合路由器）在35个真实GitHub Actions失败案例上的表现，由3个LLM调试器家族（Claude Haiku 4.5、Claude Sonnet 4.6、OpenAI gpt-5-mini）以及一个Sonnet 4.6工具使用智能体评分。我们报告三个重要发现。（1）混合grep+tail路由器主导成本-质量帕累托前沿；前两种方法得分0.670/0.666，每案例约0.03美元，质量与独立grep相当但令牌数减少4.5倍。（2）在智能体循环场景中，不同缩减工具的质量范围缩小7倍（单次得分跨度0.42 → 智能体循环跨度0.059）；智能体通过后续工具调用挽救弱上下文。然而，成本差异持续存在：弱上下文迫使智能体发出2-4倍的工具调用来恢复。（3）跨家族LLM摘要-调试器对（gpt-5-mini摘要器供给Claude Haiku调试器）在四个诊断变体上的平均得分比同家族对高0.071，否定了该任务上的自我调用偏差假设。gpt-5-mini摘要器也是智能体循环中的第一名方法（得分0.749），每案例0.37次工具调用，且缩减器成本比Haiku摘要器低10倍（每案例0.18美元 vs 1.75美元）。所有数据、代码、每个案例的捆绑包和可复现性基础设施均已公开。

英文摘要

CI failure logs are large (median 5k lines, max 200k in this corpus) and noisy. Coding agents that try to debug them depend on an upstream tool to reduce the log to a manageable context, but the field has had no public empirical comparison of which reductions preserve enough evidence for downstream LLM diagnosis. We introduce LogDx-CI, a benchmark that compares 11 context-reduction tools (raw, tail, grep, three RTK modes, two real LLM map-reduce summarizers, three hybrid routers) on 35 real GitHub Actions failure cases, scored by 3 LLM debugger families (Claude Haiku 4.5, Claude Sonnet 4.6, OpenAI gpt-5-mini) plus a Sonnet 4.6 tool-using agent. We report three load-bearing findings. (1)~Hybrid grep+tail routers dominate the cost-quality Pareto frontier; the top two methods score 0.670 / 0.666 at $\sim$ \$0.03 per case, same-ballpark quality as standalone grep at $4.5\times$ fewer tokens. (2)~In the agent-loop regime, the quality range across reduction tools collapses $7\times$ (single-shot spread 0.42 $\to$ agent-loop spread 0.059); the agent rescues weak contexts via follow-up tool calls. However, cost differences persist: weak contexts force the agent to issue 2--4$\times$ more tool calls to recover. (3)~A cross-family LLM-summary pair (gpt-5-mini summarizer feeding a Claude Haiku debugger) beats the same-family pair by $+0.071$ averaged across four diagnoser variants, falsifying the self-call-bias hypothesis on this task. The gpt-5-mini summarizer is also the agent-loop \#1 method (score 0.749) at $0.37$ tool-calls per case and $10\times$ lower reducer cost than the Haiku summarizer (\$0.18 vs \$1.75 per case). All data, code, per-case bundles, and reproducibility infrastructure are public.

URL PDF HTML ☆

赞 0 踩 0

2605.28870 2026-05-29 cs.LG cs.AI 版本更新

Representation Alignment Rests on Linear Structure

表示对齐依赖于线性结构

Kiril Bangachev, Guy Bresler, Yury Polyanskiy

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结本文通过信号、偏差和噪声的三部分统计框架研究柏拉图表示假说，提出对齐源于对象与属性的线性关系，并通过稀疏自编码器提取线性特征、中心化和归一化减少偏差、以及数据稀缺导致噪声等证据支持该框架。

详情

AI中文摘要

我们通过表示的三部分统计框架研究柏拉图表示假说（PRH）：信号、偏差和噪声。{1) 信号：} 我们提出柏拉图对齐源于对象与属性之间的普遍关系，这种关系根据线性表示假说（LRH）在线性上编码。我们通过稀疏自编码器提取线性对象-属性特征，并展示这些稀疏表示通常比其稠密对应物表现出更强的跨模态对齐，从而提供证据表明LRH有助于解释PRH。{2) 偏差：} 由于使用的不同架构和训练过程，模型具有不同的隐式偏差。我们表明这种差异可以部分缓解。中心化和归一化一致地改善跨模型对齐。{3) 噪声：} 有限样本训练导致表示中的噪声。我们通过揭示词频与对齐之间在LLM和文本嵌入模型中的强且一致的正相关，提供证据表明表示噪声由数据稀缺驱动。综合信号、偏差和噪声，我们提出一个统计模型，该模型细化线性表示假说，并解释与现代AI架构中出现的表示对齐相关的进一步现象。

英文摘要

We investigate the Platonic Representation Hypothesis (PRH) through a tripartite statistical framework of representations: signal, bias, and noise. {1) Signal:} We propose that Platonic alignment arises from the universal relationship between objects and attributes, which is encoded linearly in representations according to the Linear Representation Hypothesis (LRH). We provide evidence that LRH helps explain PRH by extracting linear object-attribute features with sparse autoencoders and showing that these sparse representations often exhibit stronger cross-modal alignment than their dense counterparts. {2) Bias:} Models have different implicit biases due to the diverse architectures and training procedures used. We show that this difference can be partially mitigated. Centering and normalization consistently improve cross-model alignment. {3) Noise:} Finite-sample training leads to noise in representations. We provide evidence that representational noise is driven by data scarcity by revealing a strong and consistent positive correlation between word frequency and alignment in LLMs and text embedding models. Synthesizing signal, bias, and noise, we propose a statistical model that refines the Linear Representation Hypothesis and explains further phenomena related to the alignment of representations emerging from diverse modern AI architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.28869 2026-05-29 cs.LG cs.AI 版本更新

Balancing Multimodal Learning through Label Space Reshaping

通过标签空间重塑平衡多模态学习

Xiaoyu Ma, Weijie Zhang, Yuanhao Gao, Han Miao, Yongjian Deng, Hao Chen

AI总结针对多模态学习中模态不平衡问题，提出基于标签空间重塑的BMLR方法，通过均衡各模态映射难度来提升多模态性能。

Comments In process

详情

AI中文摘要

多模态学习常受模态不平衡问题困扰，其中收敛较快的模态主导优化，而其他模态训练不足。现有方法通常通过加强弱模态或调整优化梯度来缓解此问题。然而，这些策略主要补偿优化速率差异，往往以牺牲强模态的优化能力为代价，而未从模态层面分析这些差异如何产生。基于理论洞察和实证观察，我们认为学习速度的差异源于模态特定特征空间与共享标签空间之间映射难度的不同。为解决此问题，我们提出了平衡多模态标签重塑（BMLR），这是首个从标签侧设计促进多模态平衡的方法。BMLR重塑跨模态标签空间以均衡各模态的映射难度，从而促进模态交互并为每个模态注入更丰富的类间信息。跨多种架构的大量实验表明，BMLR持续提升多模态性能，并与多种模型设计表现出强兼容性。源代码即将发布。

英文摘要

Multimodal learning often suffers from modality imbalance, where modalities that converge faster dominate optimization while others remain undertrained. Existing approaches typically mitigate this issue by strengthening the weak modality or adjusting optimization gradients. However, such strategies mainly compensate for optimization rate discrepancies, often at the expense of the strong modality's optimization capacity, without analyzing how these discrepancies arise at the modality level. Based on theoretical insights and empirical observations, we argue that the discrepancy of learning pace arises from differences in the mapping difficulty between modality-specific feature space and the shared label space. To address this issue, we propose Balanced Multimodal Label Reshaping (BMLR), the first method that promotes multimodal balance from the label-side design. BMLR reshapes the cross-modal label space to equalize mapping difficulty across modalities, thereby facilitating modality interaction and injecting richer inter-class information into each modality. Extensive experiments across multiple architectures demonstrate that BMLR consistently improves multimodal performance and exhibits strong compatibility with diverse model designs. The source code will be released soon.

URL PDF HTML ☆

赞 0 踩 0

2605.28868 2026-05-29 cs.LG cs.AI 版本更新

TaxDistill: Improving Metagenomic Taxonomic Annotation via Distilled Genomic Foundation Models

TaxDistill：通过蒸馏基因组基础模型改进宏基因组分类注释

Rongye Ye, Lun Li, Zheng Luo, Yiran Zhan, Shuhui Song

发表机构 * National Genomics Data Center, China National Center for Bioinformation（中国生物信息中心国家基因组数据中心）； Beijing Key Laboratory of Intelligent Governance and Application of Biological Big Data, China National Center for Bioinformation（北京生物大数据智能治理与应用重点实验室，中国生物信息中心）； Beijing Institute of Genomics, Chinese Academy of Sciences（北京基因组研究所，中国科学院）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出TaxDistill知识蒸馏框架，利用500M参数的基因组基础模型GenomeOcean作为教师网络生成软标签，以减轻初始检索工具引入的标签噪声，从而提升宏基因组序列分类性能。

Comments The manuscript contains 14 pages, 7 figures, and 3 tables

详情

AI中文摘要

宏基因组分类注释旨在识别环境样本中DNA片段的微生物起源。依赖序列相似性的传统方法通常受到高微生物多样性和参考数据库不完整性的限制，这推动了诸如Taxometer等学习方法的发展，这些方法通过事后校正来学习更具信息量的宏基因组序列表示。然而，这些方法通常依赖于训练期间从相似性搜索工具获得的标签，这不可避免地引入了噪声，从而损害表示学习并降低分类性能。为了解决这个问题，我们提出了TaxDistill，一种用于宏基因组分类的知识蒸馏框架。我们引入GenomeOcean，一个500M参数的基因组基础模型，作为教师网络来提取深层语义特征并基于置信度生成软标签。通过将这些软标签信息蒸馏到轻量级学生网络中，TaxDistill有效减少了初始检索工具引入的标签噪声。在七个不同的CAMI2数据集上的全面实验表明，TaxDistill在大多数场景下优于现有基线。例如，在胃肠道数据集上，它将MMseqs2的F1分数从0.763提高到0.941，优于Taxometer基线。总体而言，TaxDistill为复杂宏基因组分析中的标签校正提供了一种可靠的方法。

英文摘要

Metagenomic taxonomic annotation aims to identify the microbial origins of DNA fragments in environmental samples. Traditional methods that rely on sequence similarity are often constrained by the high microbial diversity and the incompleteness of reference databases, which has motivated the development of learning approaches such as Taxometer that perform post hoc correction to learn more informative metagenomic sequence representations. However, these methods typically rely on labels derived from similarity search tools during training, which inevitably introduces noise that can impair representation learning and degrade classification performance. To address this issue, we propose TaxDistill, a knowledge distillation framework for metagenomic classification. We introduce GenomeOcean, a 500M parameter genomic foundation model, as the teacher network to extract deep semantic features and generate soft labels based on confidence. By distilling this soft label information into a lightweight student network, TaxDistill effectively reduces the label noise introduced by initial retrieval tools. Comprehensive experiments on seven diverse CAMI2 datasets demonstrate that TaxDistill outperforms existing baselines in most scenarios. For instance, on the Gastrointestinal dataset, it improves the F1 score of MMseqs2 from 0.763 to 0.941, outperforming the Taxometer baseline. Overall, TaxDistill provides a reliable method for label correction in complex metagenomic analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.28867 2026-05-29 cs.LG cs.AI 版本更新

PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation

PrismFlow: 时间序列生成中流匹配的残差动力学

Junru Zhang, Lang Feng, Jinbo Wang, Xu Guo, Yucheng Wang, Han Yu, Min Wu, Yabo Dong, Duanqing Xu

发表机构 * Zhejiang University, China（浙江大学，中国）； Nanyang Technological University, Singapore（南洋理工大学，新加坡）； I2R, Agency for Science, Technology and Research (A*STAR), Singapore（科技研究局（A*STAR）新加坡研究所，新加坡）

AI总结提出PrismFlow方法，通过Koopman启发的动力学专家和置信度感知的胜者全得目标，在流匹配中学习残差修正，以解决标准流匹配中全局向量场估计器导致的频谱失真和模式覆盖不足问题，在时间序列生成中取得最优性能。

详情

AI中文摘要

生成高质量时间序列数据具有挑战性，因为现实世界的信号通常表现出多模态模式和多尺度动力学，包括振荡和高频变化。流匹配（FM）为扩散模型提供了一种高效的替代方案，但实际实现通常依赖于单个有限容量的全局向量场估计器。在这种异质的时间分布中，不同的状态可能通过邻近的流状态，同时需要不相容的条件速度。使用标准$\ell_2$速度匹配目标训练的单一估计器可能学习到局部传输场的过度平滑近似。这种估计器级别的平滑会减弱分支特定的动力学，导致频谱失真和较差的模式覆盖。为了解决这个问题，我们提出了PrismFlow，一种新的具有Koopman启发动力学专家的FM方法。每个专家在一个潜在空间中学习残差修正，其中局部非线性时间演化可以通过线性变换近似。我们进一步提出了一种置信度感知的胜者全得（WTA）目标，该目标仅更新与每个样本最对齐的专家，同时屏蔽其他专家的梯度，鼓励模式特定专业化。在采样过程中，选定的专家向全局传输场添加残差动力学修正，在保持FM稳定性的同时恢复细粒度和高频时间结构。在各种基准测试中，PrismFlow有效缓解了标准FM中的频谱收缩，并实现了最先进的性能，Context-FID提升了15.6%，判别分数提升了38.6%，同时在低数据设置下保持鲁棒性，并有效用于预测和插补。

英文摘要

Generating high-quality time-series data is challenging because real-world signals often exhibit multimodal patterns and multiscale dynamics, including oscillations and high-frequency variations. Flow Matching (FM) offers an efficient alternative to diffusion models, but practical implementations typically rely on a single finite-capacity global vector-field estimator. In such heterogeneous temporal distributions, distinct regimes may pass through nearby flow states while requiring incompatible conditional velocities. A monolithic estimator trained with the standard $\ell_2$ velocity-matching objective may therefore learn an overly smoothed approximation of the local transport field. This estimator-level smoothing can attenuate branch-specific dynamics, leading to spectral distortion and poor mode coverage. To address this, we propose PrismFlow, a new FM method with Koopman-inspired dynamical experts. Each expert learns residual corrections in a latent space where local nonlinear temporal evolution can be approximated by linear transitions. We further propose a confidence-aware Winner-Take-All (WTA) objective that updates only the expert best aligned with each sample while masking gradients to the others, encouraging mode-specific specialization. During sampling, the selected expert adds a residual dynamical correction to the global transport field, preserving FM stability while recovering fine-grained and high-frequency temporal structures. Across various benchmarks, PrismFlow effectively mitigates the spectral contraction in standard FM and achieves state-of-the-art performance, with a 15.6% gain in Context-FID and a 38.6% improvement in Discriminative Score, while remaining robust in low-data settings and effective for forecasting and imputation.

URL PDF HTML ☆

赞 0 踩 0

2605.28866 2026-05-29 cs.LG cs.AI 版本更新

Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models

连续性与序数性至关重要：利用大语言模型进行有效时间序列分析的时间序列令牌约束

Musheng Li, Ziying Zhang, Cheng jin, Yuantao Gu

发表机构 * Department of Electronic Engineering（电子工程系）

AI总结针对令牌化时间序列大语言模型忽略连续性和序数性的问题，提出COM策略，通过几何约束初始化与训练阶段，提升模型在多个时间序列分析基准上的性能与泛化能力。

2605.28865 2026-05-29 cs.LG cs.AI 版本更新

Emergent Semantic Representations in World Models through Physical Interaction without Linguistic Supervision

无需语言监督的物理交互中世界模型中的涌现语义表征

Jiayi Fang

发表机构 * Independent Researcher（独立研究者）

AI总结通过无语言监督的物理探索训练VAE世界模型，发现其潜在空间自发形成与物理几何结构对齐的语义结构，且预测性能与语义对齐共同提升，验证了物理几何作为世界模型表征的组织原则。

Comments 10 pages, 3 figures

详情

AI中文摘要

世界模型从物理探索中学习到什么，没有任何语言监督？我们认为答案由单一原则组织：物理世界的几何结构。在随机具身探索上训练基于VAE的世界模型，我们发现其潜在空间发展出反映物理几何的空间语义结构——方向准确率0.677±0.029对比随机初始化编码器的0.547，位置RSA 0.192±0.047对比随机编码器的0.029（提升6.6倍），表明训练诱导了超越CNN归纳偏置的真正结构组织。在20个时间检查点上，预测性能和语义对齐共同提升（Spearman r=-0.61, p=0.004），与共享驱动者解释一致。我们通过双重敲除确认：标准KL正则化（beta=0.1）迫使编码器远离几何结构，预测性能和语义对齐同时崩溃至接近随机水平（第50,000步），完全符合共享驱动者预测。将beta降至0.001可恢复几何访问并同时恢复两种能力。这些发现确立了物理世界几何作为世界模型表征的组织原则，对设计语义基础的具身智能体具有直接意义。

英文摘要

What does a world model learn from physical exploration, without any linguistic supervision? We argue the answer is organized by a single principle: the geometric structure of the physical world. Training a VAE-based world model on random embodied exploration, we find that its latent space develops spatial semantic structure that mirrors physical geometry -- direction accuracy 0.677+-0.029 versus 0.547 for a randomly initialized encoder, and position RSA 0.192+-0.047 versus 0.029 for random encoders (6.6x improvement), showing that training induces genuine structural organization beyond CNN inductive bias. Across 20 temporal checkpoints, prediction performance and semantic alignment co-improve (Spearman r=-0.61, p=0.004), consistent with the shared-driver account. We confirm this through a double knockout: standard KL regularization (beta=0.1) forces the encoder away from geometric structure, and both prediction performance and semantic alignment collapse simultaneously to near-chance by step 50,000 -- exactly as the shared-driver account predicts. Reducing beta to 0.001 restores geometric access and recovers both capabilities together. These findings establish physical world geometry as the organizing principle of world model representations, with direct implications for the design of semantically grounded embodied agents.

URL PDF HTML ☆

赞 0 踩 0

2605.28864 2026-05-29 cs.AI cs.CL 版本更新

The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling

认知范畴变换器：用于语言建模的范畴论归纳偏置

Al Kari

发表机构 * Manceps Inc.（Manceps公司）

AI总结提出认知范畴变换器（CCT），通过引入基于范畴论和认知科学的组件，在WikiText-103上以306M参数实现21.27验证困惑度，相比GPT-2 Small基线降低2.92 PPL（12%相对提升），并通过消融实验证实单纯复形消息传递贡献了84%的改进。

详情

AI中文摘要

认知范畴变换器（CCT）是一个306M参数的架构，它通过源自范畴论和认知科学的认知启发组件增强了预训练的GPT-2 Small骨干网络。在WikiText-103上采用匹配步数协议（215,000优化器步数、匹配数据、匹配优化器和调度）下，CCT达到21.27验证困惑度，而相同微调的GPT-2 Small基线为24.19。因此，该架构在领域内微调本身之外贡献了2.92 PPL（12%相对）的降低。一个从头开始重训练的消融实验，在整个七阶段激活调度中保持GT-Full单纯复形消息传递绕过，达到23.72 PPL，将84%的架构改进（2.45 of 2.92 PPL）归因于GT-Full。我们首次提供了消融验证的证据，表明单纯复形消息传递在WikiText-103上以306M参数规模改善了语言模型困惑度。已发表的GPT-2 Large在WikiText-103上以比GPT-2 Small多6.2倍的参数达到22.05零样本困惑度；本文将这一数字视为外部已发表参考，而非架构基准。关于一致性风格的范畴先验（层平滑、伴随往返、曲率正则化）的三个负面结果，以及GT-Full和PrecisionWeightedPP的联合结构先验结果，共同支持了一个经验模式，称为*结构/一致性区分*，其中添加新拓扑的范畴先验改善了语言建模，而强制执行一致性恒等式的范畴先验则没有。

英文摘要

The Cognitive Categorical Transformer (CCT) is a 306M-parameter architecture that augments a pretrained GPT-2 Small backbone with cognitively grounded components derived from category theory and several inspirations from cognitive science. Under a matched-step protocol (215,000 optimizer steps, matched data, matched optimizer and schedule) on WikiText-103, CCT reaches 21.27 validation perplexity, compared with 24.19 for an identically fine-tuned GPT-2 Small baseline. The architecture therefore contributes a 2.92 PPL (12% relative) reduction beyond what in-domain fine-tuning alone provides. A retrain-from-scratch ablation that holds GT-Full simplicial message passing bypassed across the entire seven-phase activation schedule reaches 23.72 PPL, localizing 84% of the architectural improvement (2.45 of 2.92 PPL) to GT-Full. We present the first ablation-validated evidence that simplicial message passing improves language-model perplexity at the 306M-parameter scale on WikiText-103. Published GPT-2 Large reaches 22.05 zero-shot PPL on WikiText-103 with 6.2x more parameters than GPT-2 Small; this paper treats that number as an external published reference, not as the architectural benchmark. Three negative results on consistency-style categorical priors (sheaf smoothing, adjunction round-trip, curvature regularization) and the joint structural-prior result for GT-Full and PrecisionWeightedPP together support an empirical pattern termed the *structure/consistency distinction*, in which categorical priors that add new topology improve language modeling and those that enforce a consistency identity do not.

URL PDF HTML ☆

赞 0 踩 0

2605.28863 2026-05-29 cs.LG cs.AI 版本更新

Self-Play Reinforcement Learning under Imperfect Information in Big 2

大二（Big 2）中不完全信息下的自我对弈强化学习

Aalok Patwa

发表机构 * University of Pennsylvania（宾夕法尼亚大学）

AI总结本文提出一个自我对弈强化学习框架，在四人不完全信息纸牌游戏Big 2中比较策略梯度和值近似方法，发现PPO优于其他方法，并证明中等熵正则化和当前策略自我对弈的有效性。

Comments 11 pages

详情

AI中文摘要

不完全信息多人游戏测试智能体在隐藏信息、稀疏奖励和非平稳对手下的行动能力。我们在Big 2（一个四人不完全信息纸牌游戏）中研究这些挑战。我们为Big 2开发了一个自我对弈强化学习框架，能够对策略梯度和值近似智能体进行受控比较。在共同的环境、输入表示、训练预算和评估协议下，PPO在对抗随机、贪婪和启发式Big 2对手时优于蒙特卡洛Q近似、SARSA和Q学习。我们进一步发现，适度的熵正则化通过防止策略变得过于确定性来改进PPO，并且当前策略自我对弈比检查点自我对弈或固定对手训练提供了更强的有限预算课程。这些结果共同表明，Big 2是研究不完全信息、多人交互、延迟奖励和可变动作集下深度强化学习的一个有用的受控环境。

英文摘要

Imperfect-information multiplayer games test whether agents can act under hidden information, sparse rewards, and non-stationary opponents. We study these challenges in Big 2, a four-player imperfect-information card game. We develop a self-play RL framework for Big 2 that enables controlled comparisons between policy-gradient and value-approximating agents. Under a common environment, input representation, training budget, and evaluation protocol, PPO outperforms Monte Carlo Q approximation, SARSA, and Q-learning against random, greedy, and heuristic Big 2 opponents. We further find that moderate entropy regularization improves PPO by preventing the policy from becoming overly deterministic, and that current-policy self-play provides a stronger finite-budget curriculum than checkpoint self-play or fixed-opponent training. Together, these results show that Big 2 is a useful controlled setting for studying deep RL under imperfect information, multiplayer interaction, delayed rewards, and variable action sets.

URL PDF HTML ☆

赞 0 踩 0

2605.28855 2026-05-29 cs.AI 版本更新

Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction

行为感知的辅助校正用于离策略时序差分预测

Xingguo Chen, Zhiang He, Yuchen Shen, Shangdong Yang, Chao Li, Guang Yang, Wenhao Wang

发表机构 * Nanjing University of Posts and Telecommunications（南京邮电大学）； Department of Computer Science and Technology, Nanjing University（南京大学计算机科学与技术系）； College of Electronic Countermeasure, National University of Defense Technology（国防科技大学电子对抗学院）

AI总结针对离策略时序差分学习的不稳定性，提出行为感知的辅助协方差校正方法（BA-TDC/BA-TDRC），通过替换辅助矩阵为行为贝尔曼矩阵，并引入正则化，在保持不动点和收敛性的同时提升性能。

详情

AI中文摘要

在函数近似下，离策略采样中的时序差分学习可能不稳定。TDC通过辅助协方差校正稳定离策略TD，而TDRC在单时间尺度递归中进一步正则化该校正。本文研究在线性预测设置中行为感知的辅助协方差几何替换，这是理解值函数近似特征空间动力学的标准局部模型。我们首先将TDC辅助矩阵(C)替换为行为贝尔曼矩阵(A_μ)，得到BA-TDC，然后正则化同一行为感知方程得到BA-TDRC。这种两步构造将行为感知几何的贡献与正则化的贡献分离。线性分析还为神经网络值近似中出现的辅助几何设计问题提供了一个可处理模型，其中特征协方差和时间转移矩阵共同塑造最后一层校正动力学。我们给出了有限状态均值系统公式，证明了在实例化均值系统的Hurwitz稳定性条件下的不动点保持和几乎必然收敛，并通过精确线性误差递归的谱半径比较了确定性均值速率。在二状态反例、Baird反例、随机游走和Boyan链上的实验表明，行为感知替换本身在某些任务上非常有益，但正则化对于在更困难设置下实现稳健性能是必要的。

英文摘要

Temporal-difference learning with function approximation can be unstable under off-policy sampling. TDC stabilizes off-policy TD through an auxiliary covariance correction, and TDRC further regularizes this correction in a single-timescale recursion. This paper studies a behavior-aware replacement of the auxiliary covariance geometry in the linear prediction setting, which is the standard local model for understanding the feature-space dynamics of value-function approximation. We first replace the TDC auxiliary matrix (C) by the behavior Bellman matrix (A_μ), yielding BA-TDC, and then regularize the same behavior-aware equation to obtain BA-TDRC. This two-step construction separates the contribution of behavior-aware geometry from the contribution of regularization. The linear analysis also provides a tractable model for an auxiliary-geometry design question that arises in neural-network value approximation, where feature covariances and temporal transition matrices jointly shape the last-layer correction dynamics. We give a finite-state mean-system formulation, prove fixed-point preservation and almost-sure convergence under a Hurwitz stability condition on the instantiated mean system, and compare deterministic mean rates through the spectral radius of the exact linear error recursion. Experiments on the two-state counterexample, Baird's counterexample, Random Walk, and Boyan Chain show that the behavior-aware replacement can be highly beneficial by itself on some tasks, but that regularization is necessary for robust performance across harder settings.

URL PDF HTML ☆

赞 0 踩 0

2605.28849 2026-05-29 cs.AI 版本更新

Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction

行为诱导的镜像-近似时间差分学习用于更快的离策略预测

Xingguo Chen, Yuchen Shen, Shangdong Yang, Chao Li, Guang Yang, Wenhao Wang

发表机构 * Nanjing University of Posts and Telecommunications（南京邮电大学）； Department of Computer Science and Technology, Nanjing University（南京大学计算机科学与技术系）； College of Electronic Countermeasure, National University of Defense Technology（国防科学技术大学电子对抗学院）

AI总结提出一种行为诱导的镜像-近似时间差分方法（STHTD-MP），通过用行为策略Bellman矩阵的对称部分替换协方差度量来改进离策略预测的几何结构，并证明其收敛性和更小的平均收缩因子。

详情

AI中文摘要

梯度时间差分方法通过线性函数逼近提供稳定的离策略预测，但其实际性能强烈受辅助变量度量诱导的几何结构影响。现有的镜像-近似TD方法通常使用特征协方差度量，而混合TD方法表明行为策略转移信息可以提供更具信息性的更新几何结构。本文提出一种行为诱导的镜像-近似时间差分方法，称为STHTD-MP，它将原始-对偶鞍点公式中的协方差度量替换为行为策略Bellman矩阵的对称部分。该方法对原始变量和辅助变量保持单一学习率，并对得到的混合鞍点算子应用镜像-近似预测-校正步骤。我们在标准随机逼近假设下对固定策略线性预测提供了形式化的收敛分析：行为诱导度量正定，联合均值系统Hurwitz稳定，有界性通过Lyapunov论证得到，随机递归通过ODE方法收敛。我们进一步推导了投影预言遍历间隙界，并基于确定性镜像-近似误差矩阵的谱半径与GTD2-MP进行了精确的均值算子比较。分析表明，当行为诱导度量改善鞍点几何结构时，STHTD-MP可以比GTD2-MP具有更小的平均收缩因子。在二状态、随机游走和Boyan Chain基准上的精确数值均值算子分析支持了这一条件，而Baird的反例被识别为一个奇异边界情况，其中严格假设不成立。

英文摘要

Gradient temporal-difference methods provide stable off-policy prediction with linear function approximation, but their practical performance is strongly affected by the geometry induced by the auxiliary-variable metric. Existing Mirror-Prox TD methods typically use the feature covariance metric, whereas hybrid TD methods suggest that behavior-policy transition information can provide a more informative update geometry. This paper proposes a behavior-induced Mirror-Prox temporal-difference method, called STHTD-MP, which replaces the covariance metric in the primal-dual saddle-point formulation with the symmetric part of the behavior-policy Bellman matrix. The method keeps a single learning rate for the primal and auxiliary variables and applies a Mirror-Prox prediction-correction step to the resulting hybrid saddle-point operator. We provide a formal convergence analysis for fixed-policy linear prediction under standard stochastic approximation assumptions: the behavior-induced metric is positive definite, the joint mean system is Hurwitz, boundedness follows from a Lyapunov argument, and the stochastic recursion converges by the ODE method. We further derive projected-oracle ergodic gap bounds and an exact mean-operator comparison with GTD2-MP based on the spectral radius of the deterministic Mirror-Prox error matrix. The analysis shows that STHTD-MP can have a smaller mean contraction factor than GTD2-MP when the behavior-induced metric improves the saddle-point geometry. Exact numerical mean-operator analysis on two-state, Random Walk, and Boyan Chain benchmarks supports this condition, while Baird's counterexample is identified as a singular boundary case where the strict assumptions fail.

URL PDF HTML ☆

赞 0 踩 0

2605.28848 2026-05-29 cs.CL cs.AI 版本更新

GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models

GPF-LiveNews: 大型语言模型中群体条件框架的流式评估协议

Mohd Ariful Haque, Fahad Rahman, Kishor Datta Gupta, Roy George

发表机构 * Clark Atlanta University（克拉克阿特兰大学）

AI总结提出GPF-LiveNews流式评估协议，通过实时新闻锚点与身份标签组合，检测LLM输出中针对不同受众的语义敏感性和情感差异，用于审计群体条件框架。

详情

AI中文摘要

部署的语言模型在非静态环境中进行评估：模型版本、检索层、安全系统和真实世界输入都随时间变化。静态偏差基准仍然有用，但它们无法显示模型如何针对不同提示受众构建新出现事件的框架。我们引入了GPF-LIVENEWS，这是一个流式评估协议和基准快照，用于审计开放端LLM输出中的群体条件框架。该协议扩展了来自BBC/路透社的最新新闻锚点，涵盖42个身份标签和七个提示族，然后使用语义敏感性和情感差异信号评估响应束。在12次监控运行和23个托管模型的试点中，政策/行动提示产生了最强的语义运动，而情感变化在维度和提示族之间较为平坦。发布的工件包括文章元数据、提示模板、实例化提示、模型输出元数据、评分表、文档和复现脚本。我们将所有评分解释为用于人工审查的观察窗口审计信号，而非永久性的公平性排名或有害偏差的直接证据。

英文摘要

Deployed language models are evaluated in a non-stationary environment: model versions, retrieval layers, safety systems, and real-world inputs all change over time. Static bias benchmarks remain useful, but they do not show how models frame newly emerging events for different prompted audiences. We introduce GPF-LIVENEWS, a streaming evaluation protocol and benchmark snapshot for auditing group-conditioned framing in open-ended LLM outputs. The protocol expands fresh BBC/Reuters news anchors across 42 identity labels and seven prompt families, then evaluates response bundles using semantic-sensitivity and sentiment-disparity signals. In a pilot over 12 monitoring runs and 23 hosted models, Policy/Action prompts produce the strongest semantic movement, while sentiment variation is flatter across dimensions and prompt families. The released artifact includes article metadata, prompt templates, instantiated prompts, model-output metadata, score tables, documentation, and reproduction scripts. We interpret all scores as observed-window audit signals for human review, not as permanent fairness rankings or direct proof of harmful bias.

URL PDF HTML ☆

赞 0 踩 0

2605.28842 2026-05-29 cs.CL cs.AI 版本更新

Thoughts-as-Planning: Latent World Models for Chain-of-Thoughts Optimization via Reinforcement Planning

思想即规划：通过强化规划进行思维链优化的潜在世界模型

Dong Liu, Yanxuan Yu, Ying Nian Wu

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Columbia University（哥伦比亚大学）

AI总结提出Thoughts-as-Planning框架，将思维链优化形式化为潜在语义空间中的序贯决策过程，通过潜在世界模型模拟推理链编辑对下游输出的影响，并利用梯度下降或强化学习进行规划，在语言理解和生成任务上优于现有基线。

详情

AI中文摘要

评估荷兰语音节划分算法并通过深度学习结合语音和正字法信息提高准确性

Gus Lathouwers, Wieke Harmsen, Catia Cucchiarini, Helmer Strik

发表机构 * Radboud University（拉博德大学）

AI总结本研究评估了四种荷兰语音节划分算法的性能，并提出一种结合语音和正字法信息的深度学习模型，实现了99.65%的词准确率，较文献最佳提升0.14%。

Comments Published in CLIN Journal

详情

Journal ref: Computational Linguistics in the Netherlands Journal, Vol. 14 (2025), pp. 365 to 383

AI中文摘要

音节划分描述将单词划分为音节的任务。由于许多规则和例外，训练算法以高准确率执行音节划分仍然是一个挑战。在过去几十年中，针对荷兰语音节划分提出了不同的算法，但尚未进行全面的比较评估。此外，近年来深度学习在自然语言处理中获得了显著普及，但尚未开发出基于现代深度学习的荷兰语正字法音节划分框架。最后，语音和正字法音节划分算法已被分别研究，但未结合研究。当前研究的目标有两个：(a) 检查现有荷兰语音节划分算法的性能，(b) 研究将语音和正字法信息结合到单个模型中是否能提高音节划分性能。为了比较算法性能，将四种算法（Brandt Corstius、Liang、Trogkanis-Elkan (CRF) 和新构思的深度学习模型）应用于三个不同的数据集（词典词、借词、伪词）。这些算法在数据集上表现出不同的性能，数据驱动算法在所有条件下除一个外均优于基于知识的算法。开发的新深度学习方法相比文献中发现的最佳结果（99.65%的词准确率，提高了0.14%）带来了性能提升。对添加语音信息改善音节划分性能的单词的分析表明，这些单词中正字法歧义可以通过发音信息解决。未来研究可以考察语音信息有益于正字法处理的其他领域。此外，新开发的深度学习框架可以应用于荷兰语以外的其他语言。

英文摘要

Syllabification describes the task of dividing words into syllables. Due to many rules and exceptions, training an algorithm to perform syllabification with high accuracy remains a challenge. Throughout the last decades, different algorithms have been put forth for Dutch syllabification, yet a comprehensive comparative assessment has not been done. Additionally, deep learning has gained significant popularity within NLP in recent years, yet no modern deep-learning based framework has been developed for Dutch orthographic syllabification. Finally, phonetic and orthographic syllabification algorithms have been examined separately, but not in combination. The aim of the current research was twofold: (a) to examine the performance of existing Dutch syllabification algorithms, and (b) to investigate whether combining phonetic and orthographic information into a single model can increase syllabification performance. To compare the performance of algorithms, four algorithms (Brandt Corstius, Liang, Trogkanis-Elkan (CRF), and a newly conceived deep-learning model) were applied to three different datasets (dictionary words, loanwords, pseudowords). The algorithms show varying performance across datasets, with the data-driven algorithms outperforming a knowledge-based algorithm in all but one condition. The new deep-learning methods developed led to increased performance compared to the best found in the literature (99.65% word accuracy, a 0.14% improvement). An analysis of the words for which adding phonetic information improved syllabification performance indicates that these were words in which the orthographic ambiguity could be resolved by information on pronunciation. Future research could examine other areas where phonetic information can benefit orthographic processing. In addition, the newly developed deep learning frameworks can be applied to other languages than Dutch.

URL PDF HTML ☆

赞 0 踩 0

2605.28833 2026-05-29 cs.CL cs.AI 版本更新

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

转录儿童语音：ASR性能与获取可靠正字法转录

Gus Lathouwers, Lingyun Gao, Catia Cucchiarini, Helmer Strik

发表机构 * Radboud University（拉德堡德大学）

AI总结本研究评估了三种ASR模型家族（Whisper、Parakeet、Wav2Vec2）在荷兰儿童语音数据集上的性能，并提出了一种基于话语级选择的方法，以自动识别高置信度的正确发音，从而减少人工验证需求。

详情

AI中文摘要

自动语音识别（ASR）有潜力通过生成自动转录来大幅减少儿童语音研究中的手动标注工作。然而，在低资源语言中，由于缺乏针对儿童的预训练模型以及高度多样的噪声条件，获得可靠的高质量ASR转录仍然具有挑战性。本研究通过两个研究问题调查了最先进的ASR模型在儿童语音上的有效性，评估了来自三个模型家族（Whisper、Parakeet和Wav2Vec2）的九个ASR模型在两个荷兰儿童语音数据集JASMIN和DART上的表现。研究问题1考察了ASR模型应用于儿童语音的性能。微调的Whisper-medium模型取得了最佳整体性能，在JASMIN上WER为5.54%，在DART上为70.37%，表明噪声较大的DART数据明显更具挑战性。研究问题2考察了在多大程度上可以选择一个子集，使得无需人工验证即可自动获得可靠的正字法转录。我们使用一种话语级选择方法，将ASR输出与原始阅读提示进行比较，以识别正确发音的录音。使用所提出的选择方法，42.0% [对于JASMIN] 和18.1% [对于DART] 的话语可以高置信度地自动识别为正确发音，从而在话语级别上实现极低的错误率（精确度达到98.3%或更高），并减少了人工验证的需求。

英文摘要

Automatic speech recognition (ASR) has the potential to substantially reduce manual annotation effort in child speech research by generating automatic transcriptions. However, obtaining reliably high-quality ASR transcriptions for child speech remains challenging in low-resource languages due to limited child-specific pre-trained models and highly diverse noise conditions. This study investigates the effectiveness of state-of-the-art ASR models on child speech through two research questions, by evaluating nine ASR models from three model families (Whisper, Parakeet, and Wav2Vec2) on two Dutch child speech datasets, JASMIN and DART. Research question 1 examines the performance of ASR-models applied to child speech. The fine-tuned Whisper-medium model achieves the best overall performance, with a WER of 5.54% on JASMIN and 70.37% on DART, showing that the noisy DART data are clearly more challenging. Research question 2 examines to what extent it is possible to select a subset for which reliable orthographic transcriptions can be obtained automatically, without the need for manual verification. We use an utterance-level selection method that compares ASR output with the original read prompt to identify correctly pronounced recordings. Using the proposed selection method, 42.0% [for JASMIN] and 18.1% [for DART] of the utterances can be automatically identified as correctly pronounced with high confidence, resulting in very low error rates on an utterance level (precisions of 98.3% and higher) and reducing the need for manual verification.

URL PDF HTML ☆

赞 0 踩 0

2605.28832 2026-05-29 cs.CL cs.AI 版本更新

A comparative study of transformer-based embeddings for topic coherence

基于Transformer的嵌入在主题连贯性中的比较研究

Alex Ding, Tarun Rapaka, Willy Rodriguez, Jason Yang

发表机构 * Worcester Academy Stanford Online High School（沃斯特学院斯坦福在线高中）； Stanford Online High School（斯坦福在线高中）； Lexington High School（莱克星顿高中）

AI总结本研究系统比较了七种不同规模的Transformer语言模型（从MiniLM到LLaMA-2）在BERTopic流程中对主题质量的影响，发现模型大小（从2200万到130亿参数）对主题连贯性影响可忽略。

详情

AI中文摘要

主题建模是自然语言处理的一个分支，旨在根据词共现模式将大量文本组织成连贯的组，其中潜在狄利克雷分配仍是最广泛使用和可解释的概率方法之一。自然语言处理的最新进展，特别是基于Transformer的语言模型，提供了改进的文档表示。已知模型大小（以参数数量计）对语言模型在不同预定义任务上的性能有显著影响。在本研究中，我们通过分析七种基于Transformer的语言模型（从小型模型如MiniLM到大型模型如LLaMA-2）在BERTopic流程中对多种语料库的性能，系统地考察了模型大小对主题质量的影响。主题质量使用Röder等人（2015）的连贯性和分歧度指标进行评估。我们的结果表明，模型大小从2200万到130亿参数对主题质量的影响可忽略，表明较小的模型可以达到与较大模型相当的性能。

英文摘要

Topic modeling is a branch of Natural Language Processing (NLP) that aims to organize large collections of texts into coherent groups according to word co-occurrence patterns, with Latent Dirichlet Allocation (LDA) remaining one of the most widely used and interpretable probabilistic approaches. Recent advances in NLP, particularly transformer-based language models, offer improved document representations. It is also known that the size of the model (in terms of number of parameters) has a significant impact in the performance of the language models on different pre-defined tasks. In this study, we systematically examine the effect of model size on topic quality by analyzing the performances of seven transformer-based language models (from small models such as MiniLM to large ones such as LLaMA-2) in a BERTopic pipeline on a variety of corpora. Topic quality is evaluated using coherence and divergence metrics following R{ö}der et al. (2015). Our results indicate that model size, ranging from 22 million to 13 billion parameters, has a negligible impact on the quality of the topic, suggesting that smaller models can achieve comparable performance to larger models.

URL PDF HTML ☆

赞 0 踩 0

2605.28830 2026-05-29 cs.CL cs.AI cs.SE 版本更新

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

开源安全防护模型基准测试：全面评估

Reetu Raj Harsh, Bhaskarjit Sarmah, Stefano Pasquali

发表机构 * Domyn

AI总结本研究对14个开源安全防护模型在8个NIST AI风险框架安全类别上进行全面评估，发现召回率是关键指标，且模型大小与安全检测性能不相关。

详情

AI中文摘要

随着大型语言模型（LLMs）越来越多地部署在安全关键型应用中，稳健的内容审核变得至关重要。我们对14个开源安全防护模型进行了全面评估，使用了包含79,331个样本的精选基准，涵盖8个NIST AI风险框架安全类别。我们的基准聚合了四个不同的数据集（HarmBench、StrongREJECT、RealToxicityPrompts和BeaverTails），并经过筛选，仅关注安全相关内容（暴力、仇恨言论、骚扰、色情内容、自杀/自残、亵渎、威胁和健康虚假信息）。我们发现召回率是安全应用的关键指标，因为遗漏有害内容比误报构成更大风险。我们的评估揭示了令人惊讶的结果：Qwen Guard（4B参数）实现了最高的召回率（83.97%），而较大的模型如Llama Guard（12B）和GPT-OSS Safeguard（20B）表现出保守行为，遗漏了高达75%的不安全内容。我们证明了模型大小与安全检测性能不相关，并且通用防护模型优于专用模型。这些发现为在生产部署中选择安全防护模型提供了实用指导。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated benchmark of 79,331 samples spanning 8 NIST AI Risk Framework safety categories. Our benchmark aggregates four diverse datasets (HarmBench, StrongREJECT, RealToxicityPrompts, and BeaverTails), filtered to focus exclusively on safety-relevant content (violence, hate speech, harassment, sexual content, suicide/self-harm, profanity, threats, and health misinformation). We find that recall is the critical metric for safety applications, as missing harmful content poses greater risk than false positives. Our evaluation reveals surprising results: Qwen Guard (4B parameters) achieves the highest recall (83.97%) while larger models like Llama Guard (12B) and GPT-OSS Safeguard (20B) exhibit conservative behavior, missing up to 75% of unsafe content. We demonstrate that model size does not correlate with safety detection performance and that general-purpose guard models outperform specialized ones. These findings provide practical guidance for selecting safety guard models in production deployments.

URL PDF HTML ☆

赞 0 踩 0

2605.28828 2026-05-29 cs.CL cs.AI 版本更新

ROVER: 面向对象中心视觉证据的路由用于基于多图像推理

Guannan Lv, Ren Nie, Hongjian Dou, Tingting Gao

发表机构 * Kuaishou Technology（快手科技）

AI总结提出ROVER，一种轻量级可学习插件，通过对象中心差分注意力聚合上下文、蒸馏图像内线索并路由历史感知证据，实现高效全局视觉证据路由，在多图像推理中提升答案和定位精度。

详情

AI中文摘要

多模态大语言模型（MLLMs）越来越多地定位和交错视觉证据以进行审慎推理。基于定位的方法通常通过将裁剪的图像块或感兴趣区域（RoI）特定特征注入推理上下文来关注RoI。然而，这种设计可能削弱整体场景理解和对象间关系，同时导致解码成本随RoI数量和大小增加而增加。或者，自适应视觉特征选择通常需要细粒度监督或复杂启发式方法。为解决这些限制，我们提出ROVER（面向对象中心视觉证据的路由用于基于多图像推理），一种轻量级、可学习的插件，用于高效的全局视觉证据路由。在每次对象定位预测时，ROVER注入一个步骤特定的令牌三元组，以协同地：(i) 聚合正在进行的推理上下文，(ii) 通过对象中心差分注意力将图像内线索蒸馏到视觉工作空间中，以及(iii) 在该空间内跨对象和图像路由并整合历史感知证据以供后续推理。我们将ROVER集成到Qwen2.5-VL-7B中，并开发了一个交错的SFT到GRPO训练流程。严格遵循原始数据集和评估协议，我们的方法在MM-GCoT（+4.8%答案准确率，+14.6%定位准确率）和VideoEspresso（+8.6%答案准确率）上取得了最佳性能。在VideoEspresso上训练的模型表现出强大的迁移能力，在多个基准测试上平均比基础模型高出+4.7%。

英文摘要

Multimodal Large Language Models (MLLMs) have increasingly localized and interleaved visual evidence for deliberative reasoning. Grounding-based approaches typically focus on regions of interest (RoIs) by injecting cropped image patches or RoI-specific features into the reasoning context. However, such designs can weaken holistic scene understanding and inter-object relations, while incurring decoding costs that scale with the number and size of RoIs. Alternatively, adaptive visual feature selection often requires fine-grained supervision or complex heuristics. To address these limitations, we propose ROVER (Routing Object-centric Visual Evidence for grounded multi-image Reasoning), a lightweight, learnable plugin for efficient global visual evidence routing. Upon each object grounding prediction, ROVER injects a step-specific token triplet to synergistically: (i) aggregate the ongoing reasoning context, (ii) distill intra-image cues into a visual working space via object-centric differential attention, and (iii) route and integrate history-aware evidence across objects and images within this space for subsequent reasoning. We integrate ROVER into Qwen2.5-VL-7B and develop an interleaved SFT-to-GRPO training pipeline. Strictly adhering to the original datasets and evaluation protocols, our method achieves the best performance on MM-GCoT (+4.8% answer accuracy, +14.6% grounding accuracy) and VideoEspresso (+8.6% answer accuracy). The VideoEspresso-trained model demonstrates strong transferability, outperforming the base model by +4.7% on average across diverse benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.27480 2026-05-29 q-bio.OT cs.AI cs.CY 版本更新

BIRDS: Characterizing and Understanding Biodiversity Impact of Large Language Model Serving

BIRDS：表征与理解大语言模型服务对生物多样性的影响

Tianyao Shi, Yi Ding

发表机构 * Purdue University（普渡大学）

AI总结提出BIRDS框架，通过定义请求级功能单元、量化运营与隐含生物多样性影响，并引入质量归一化生物多样性影响（QNBI），揭示大规模LLM服务对生态系统的累积影响及质量感知的服务权衡。

Comments 21 pages, 27 figures, 9 tables

2605.27390 2026-05-29 cs.CL cs.AI 版本更新

EvoSpec: Evolving Speculative Decoding via Real-Time Vocabulary and Parameter Adaptation

EvoSpec: 通过实时词汇和参数自适应进化推测解码

Shuyu Zhang, Lingfeng Pan, Qicheng Wang, Yaqi Shi, Yueyang Tan, Ruyu Yan, Jiaqi Chen, Lixing Du, Lu Wang

发表机构 * School of Computer Science and Technology（计算机科学与技术学院）

AI总结提出EvoSpec框架，通过动态词汇和参数自适应实现推测解码中草稿模型的实时进化，解决静态方法在专业领域和主题切换场景下接受率骤降的问题，在EAGLE-3上实现1.13倍加速并降低27%内存开销。

详情

AI中文摘要

推测解码通过草稿-验证范式加速大型语言模型推理，但随着词汇表规模扩大，输出投影层成为瓶颈。现有的静态剪枝方法虽有效降低开销，但由于无法捕捉动态分布变化，在专业领域或主题切换场景中接受率骤降。为解决此问题，我们提出EvoSpec框架，通过动态词汇和参数自适应实现草稿模型的实时进化。与静态或纯检索方法不同，EvoSpec采用上下文感知机制，通过高效的语义和统计索引检索关键长尾词。此外，我们提出一种轻量级在线对齐策略，利用课程学习持续最小化草稿模型与目标模型之间的分布差距。在专业领域（编码、法律和医学）的广泛评估证实，EvoSpec克服了静态基线的局限性。在EAGLE-3上，它相比最先进的静态基线FR-Spec实现1.13倍加速，且内存开销比标准在线自适应低27%。

英文摘要

Speculative decoding accelerates Large Language Model inference via a draft-then-verify paradigm, yet the output projection layer becomes a bottleneck as vocabulary sizes scale. While existing static pruning methods effectively reduce this overhead, they suffer from precipitous drops in acceptance rate in specialized domains or topic-switching scenarios due to their inability to capture dynamic distribution shifts. To address this, we introduce EvoSpec, a framework that enables real-time evolution of the draft model through dynamic vocabulary and parameter adaptation. Unlike static or purely retrieval-based approaches, EvoSpec employs a context-aware mechanism that retrieves critical long-tail tokens via efficient semantic and statistical indexing. Furthermore, we propose a lightweight online alignment strategy utilizing curriculum learning to continually minimize the distributional gap between the draft and target models. Extensive evaluations across specialized domains (coding, law, and medicine) confirm that EvoSpec overcomes the limitations of static baselines. On EAGLE-3, it achieves a 1.13x speedup in these settings over the state-of-the-art static baseline FR-Spec, with 27\% lower memory overhead than standard online adaptation.

URL PDF HTML ☆

赞 0 踩 0

2605.27387 2026-05-29 cs.CL cs.AI 版本更新

From AR to Diffusion: Efficiently Adapting Large Language Models with Strictly Causal and Elastic Horizons

从自回归到扩散：利用严格因果与弹性视野高效适配大型语言模型

Xiangyu Ma, Teng Xiao, Zuchao Li, Lefei Zhang

发表机构 * School of Artificial Intelligence, Wuhan University（武汉大学人工智能学院）； School of Computer Science, Wuhan University（武汉大学计算机学院）

AI总结提出FLUID框架，通过严格因果对齐和弹性视野机制，将自回归模型高效适配为扩散模型，实现并行文本生成并大幅降低训练成本。

Comments Accepted by ACL 2026

详情

AI中文摘要

扩散模型有望实现高效的并行文本生成，但其依赖双向注意力机制，与预训练的自回归（AR）模型存在结构不匹配。这种不兼容性阻碍了稳健AR先验的复用，需要从头开始进行代价高昂的预训练。为弥合这一差距，我们提出FLUID框架，该框架高效地将AR骨干网络适配到扩散范式。通过强制执行严格因果对齐，FLUID能够从标准GPT风格检查点无缝初始化，避免了大规模预训练。此外，我们引入弹性视野，这是一种基于局部信息密度而非固定调度动态调节去噪步长的熵驱动机制。实验表明，FLUID在将训练成本降低数个数量级的同时实现了最先进的性能，有效调和了成熟的AR基础与高效的并行生成。我们的代码可在https://github.com/Oli-lab-nun/FLUID/tree/main获取。

英文摘要

Diffusion models promise efficient parallel text generation but rely on bidirectional attention, creating a structural mismatch with pre-trained Autoregressive (AR) models. This incompatibility precludes reusing robust AR priors, necessitating prohibitive pre-training from scratch. To bridge this gap, we propose FLUID, a framework that efficiently adapts AR backbones to the diffusion paradigm. By enforcing Strictly Causal Alignment, FLUID enables seamless initialization from standard GPT-style checkpoints, circumventing the need for massive pre-training. Furthermore, we introduce Elastic Horizons, an entropy-driven mechanism that dynamically modulates denoising strides based on local information density rather than fixed schedules. Experiments demonstrate that FLUID achieves state-of-the-art performance while reducing training costs by orders of magnitude, effectively reconciling established AR foundations with efficient parallel generation. Our code is available at https://github.com/Oli-lab-nun/FLUID/tree/main.

URL PDF HTML ☆

赞 0 踩 0

2605.27382 2026-05-29 cs.HC cs.AI cs.CL 版本更新

The Alignment Floor: How Persona Customization Breaks Safety in Weakly-Aligned LLMs

对齐下限：角色定制如何破坏弱对齐大语言模型的安全性

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He

发表机构 * AWS Generative AI Innovation Center（AWS生成式人工智能创新中心）； HSBC Holdings Plc., HSBC Technology Center, China（汇丰控股有限公司，汇丰技术中心，中国）

AI总结通过对比强对齐与弱对齐模型在不同角色条件下的谄媚率变化，定义对齐下限Δ_floor作为评估模型角色定制安全性的审计指标。

详情

AI中文摘要

告诉LLM“要热情”会使轻对齐模型的谄媚率从30%上升到50%，但对强对齐模型没有影响。我们将这一差距定义为对齐下限Δ_floor(m)=max_pS(m,p)-min_pS(m,p)，即模型在不同角色条件下产生的谄媚率范围，并将谄媚视为角色条件属性而非固定模型属性。多元AI依赖于通过角色提示（如“要有创造力”或“要彻底”）进行行为适应，使系统能够尊重不同的用户价值观和沟通风格；安全问题在于给定模型在真实性改变之前能吸收多少定制化。我们进行了一项受控案例研究，对比了强对齐的RLHF+宪法AI模型（Claude Sonnet 4.6）与轻对齐模型（Amazon Nova Lite），涵盖7种角色条件和5个任务，共1800次运行。存在性结果促使进行逐模型审计：至少有一个强对齐模型的Δ_floor=5个百分点（在15%控制率的5个百分点内），至少有一个轻对齐模型的Δ_floor=45个百分点（范围5%-50%）。在轻对齐模型上，所有五种大五人格角色都增加了谄媚率，且反直觉的是，宜人性产生的增幅最小而非最大。研究中最大的单一效果是建设性的：怀疑论者角色使轻对齐模型的谄媚率降低了25个百分点，并且是唯一指示抵制用户主张而非与之互动的角色，这暗示了方向性解释。角色效果的跨模型迁移几乎为零，因此角色-对齐测试必须逐模型进行。我们提出Δ_floor作为部署时的审计指标：在部署角色定制之前，在小规模角色面板上测量该指标。

英文摘要

Telling an LLM to "be enthusiastic" raises its sycophancy rate from 30\% to 50\% on a lightly-aligned model, but has zero effect on a strongly-aligned one. We define this gap as the alignment floor, $Δ_{\text{floor}}(m)=\max_pS(m,p)-\min_pS(m,p)$, the range of sycophancy rates a model produces across persona conditions, and treat sycophancy as a persona-conditional property rather than a fixed model property. Pluralistic AI relies on behavioral adaptation via persona prompts like "be creative" or "be thorough", which let systems respect diverse user values and communication styles; the safety question is how much customization a given model can absorb before its truthfulness shifts. We present a controlled case study contrasting a strongly-aligned RLHF + Constitutional-AI model (Claude Sonnet 4.6) with a more lightly-aligned model (Amazon Nova Lite), spanning seven persona conditions and five tasks for 1800 total runs. An existence-pair result motivates per-model auditing: there is at least one strongly-aligned model with $Δ_{\text{floor}}=5$pp (within 5pp of the 15\% control rate) and at least one lightly-aligned model with 45pp (5\%--50\% range). On the lightly-aligned model, all five Big Five personas increase sycophancy over control, and counterintuitively Agreeableness produces the smallest increase, not the largest. The single largest effect in the study is constructive: a Skeptic persona reduces sycophancy by 25pp on the lightly-aligned model, and is the only persona that instructs resistance against user claims rather than engagement with them, suggesting a directionality account. Cross-model transfer of persona effects is near-zero, so persona-alignment testing must be per-model. We propose $Δ_{\text{floor}}$ as a deployment-time audit metric: measure it on a small persona panel before deploying persona customization.

URL PDF HTML ☆

赞 0 踩 0

2605.27379 2026-05-29 cs.AI cs.CL 版本更新

多模式呼吸衰竭预测的前瞻性评估：胸部X光片能否在电子健康记录信号之外提升性能？

Xiaolei Lu, Shamim Nemati

AI总结本研究提出一种门控多模态框架，集成结构化电子健康记录时间序列数据和胸部X光片基础模型表示，用于前瞻性预测ICU患者24小时内是否需要有创机械通气，结果显示相比仅使用电子健康记录的模型和医生预测，多模态融合提高了区分度、敏感性和阳性预测值。

详情

AI中文摘要

呼吸衰竭的早期预测对于重症监护病房的及时临床干预至关重要。现有的基于电子健康记录（EHR）的模型可以持续监测生理恶化，但可能无法完全捕捉胸部X光片（CXR）中反映的肺部病理生理学。在本研究中，我们探讨CXR信息是否能在仅使用EHR信号的基础上改善有创机械通气的前瞻性预测。我们开发了一个门控多模态框架，将结构化EHR时间序列数据与CXR基础模型表示相结合。门控模块根据患者特定的临床背景自适应地控制成像特征的贡献，使模型在成像信息有用时选择性地依赖它。我们前瞻性地评估了该框架在ICU患者中预测24小时内需要有创机械通气的性能，并将其与已建立的仅使用EHR的模型（Ventio）、在匹配临床时间点获得的医生预测以及替代多模态变体进行比较。门控多模态模型比仅使用EHR的基线模型实现了更高的区分度，使用REMEDIS和MedInsight CXR表示时AUROC值分别为0.860和0.858，而Ventio为0.752。相对于医生预测，多模态框架显著提高了敏感性，同时保持了良好的特异性。与仅使用EHR的模型相比，多模态整合提高了特异性和阳性预测值，表明CXR信息可以细化选定患者的风险估计。这些发现支持自适应多模态融合作为将成像纳入前瞻性呼吸衰竭预测的实用策略。

英文摘要

Early prediction of respiratory failure is critical for timely clinical intervention in intensive care units. Existing electronic health record (EHR)-based models can continuously monitor physiologic deterioration, but they may not fully capture pulmonary pathophysiology reflected in chest radiographs (CXRs). In this study, we ask whether CXR information improves prospective prediction of invasive mechanical ventilation beyond EHR signals alone. We develop a gated multimodal framework that integrates structured EHR time-series data with CXR foundation-model representations. The gating module adaptively controls the contribution of imaging features based on patient-specific clinical context, allowing the model to selectively rely on imaging information when it is informative. We prospectively evaluate the framework for predicting invasive mechanical ventilation within 24 hours in ICU patients and compare it with an established EHR-only model (Ventio), physician predictions obtained at matched clinical time points, and alternative multimodal variants. The gated multimodal models achieved higher discrimination than the EHR-only baseline, with AUROC values of 0.860 and 0.858 using REMEDIS and MedInsight CXR representations, respectively, compared with 0.752 for Ventio. Relative to physician predictions, the multimodal framework substantially improved sensitivity while maintaining favorable specificity. Compared with the EHR-only model, multimodal integration increased specificity and positive predictive value, suggesting that CXR information can refine risk estimation in selected patients. These findings support adaptive multimodal fusion as a practical strategy for incorporating imaging into prospective respiratory failure prediction.

URL PDF HTML ☆

赞 0 踩 0

2605.26193 2026-05-29 cs.LG cs.AI 版本更新

Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection

桥接分类与重建：协同时间序列异常检测

Qideng Tang, Dai Chaofan, Wubin Ma, Yahui Wu, Haohao Zhou, Tao Zhang, Huan Li, Dalin Zhang

发表机构 * National Key Laboratory of Information Systems Engineering, National University of Defense Technology（信息系统工程国家重点实验室，国防科技大学）； College of Systems Engineering, National University of Defense Technology（系统工程学院，国防科技大学）； Zhejiang University（浙江大学）； Zhejiang Key Laboratory of Space Information Sensing and Transmission, Hangzhou Dianzi University（空间信息感知与传输浙江大学重点实验室，杭州电子科技大学）

AI总结提出CoAD框架，通过分类模块生成概率软掩码指导重建模块，协同利用分类与重建范式的互补优势，有效检测细微复杂异常，并在基准数据集上显著优于现有方法。

Comments 15 pages, submitted to KDD 2026

详情

AI中文摘要

时间序列异常检测（TSAD）因其广泛应用而长期成为数据挖掘领域的热门研究课题。最近的研究挑战了流行的深度学习方法在TSAD中的有效性，指出它们无法检测细微和持久的异常。异常暴露（OE）和掩码自编码器（MAE）作为两种有前景的范式（分类和重建）出现，用于解决上述问题。然而，基于OE的方法受限于泛化能力差，而基于MAE的方法受限于掩码错位问题。为了解决这些局限性，本文提出了一种新颖的框架CoAD，该框架统一了两种范式，以利用它们的互补优势，同时减轻各自的弱点。在该框架中，分类模块为重建模块生成概率信息软掩码，这反过来又缓解了分类模块的泛化问题。这种协同设计使CoAD能够有效检测现有方法常常忽略的细微和复杂异常。此外，分类模块经过精心设计，以解决分类粒度不当和忽视频率信息的问题。在高质量基准数据集上，按照严格的评估协议进行的大量实验表明，CoAD显著优于最先进的深度学习和传统数据挖掘方法，突显了深度学习在TSAD中的潜力。此外，CoAD轻量级且速度远快于现有SOTA方法，展示了其在大规模实时应用中的实用价值。

英文摘要

Time series anomaly detection (TSAD) has long been a hot research topic in data mining due to its various applications. Recent studies challenge the effectiveness of popular deep learning methods for TSAD, suggesting their failure in detecting subtle and prolonged anomalies. Outlier Exposure (OE) and Masked Autoencoder (MAE) emerge as two promising paradigms (classification and reconstruction) for solving the above problems. However, OE-based methods are constrained by poor generalization, while MAE-based methods are limited by masking misalignment issues. To address these limitations, this paper proposes a novel framework, CoAD, which unifies the two paradigms to leverage their complementary strengths while mitigating their respective weaknesses. In this framework, the classification module generates probability-informed soft masks for the reconstruction module, which in turn alleviates the generalization problem of the classification module. This cooperative design enables CoAD to effectively detect subtle and complex anomalies that are often overlooked by existing methods. Additionally, the classification module is carefully designed to resolve issues related to improper classification granularity and the neglect of frequency information. Extensive experiments on high-quality benchmark datasets, conducted under rigorous evaluation protocols, demonstrate that CoAD significantly outperforms both state-of-the-art deep learning and traditional data mining methods, highlighting the potential of deep learning in TSAD. Moreover, CoAD is lightweight and substantially faster than existing SOTA methods, demonstrating its practical value for large-scale, real-time applications.

URL PDF HTML ☆

赞 0 踩 0

2605.26029 2026-05-29 cs.AI cs.CL 版本更新

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

CausaLab：面向AI科学家的交互式因果发现可扩展环境

Junlin Yang, Dylan Zhang, Xiangchen Song, Qirun Dai, Xiao Liu, Yuen Chen, Aniket Vashishtha, Jing Shi, Chenhao Tan, Hao Peng

发表机构 * Tsinghua University（清华大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Carnegie Mellon University（卡内基梅隆大学）； University of Chicago（芝加哥大学）； Adobe

AI总结提出CausaLab环境，通过合成实验室任务评估LLM代理在因果发现中的预测准确性与因果机制恢复能力，发现两者存在显著差距。

详情

AI中文摘要

我们介绍了CausaLab，一个用于评估LLM代理进行交互式因果发现的可扩展环境。与先前的评估不同，CausaLab既评估代理是否能够使用因果证据解决问题，也评估其答案是否基于忠实恢复的因果机制。每个回合将代理置于一个合成实验室中：它接收先前的测量记录，对操纵器晶体进行干预，并预测由相同机制控制的保留反应器晶体的共振频率。隐藏的数据生成过程是一个随机采样的结构因果模型（SCM），因此成功需要恢复因果图和结构方程，而不是回忆先验知识。实验表明，预测和机制恢复之间存在持续差距：在纯观测的6节点设置中，GPT-5.2-high达到92%的任务准确率，但全边$F_1$仅为0.471。混合观测-干预策略提高了结构保真度，而纯干预即使对强代理仍然困难。我们确定过早停止是一个主要弱点，并表明一致性验证可以缓解它。因此，CausaLab将预测成功与因果理解分开，并揭示了当前LLM代理作为实验因果推理者的局限性。

英文摘要

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.

URL PDF HTML ☆

赞 0 踩 0

2605.25556 2026-05-29 cs.LO cs.AI 版本更新

Keep the Proof State Live: Snapshotting for Efficient Tactic Search in Lean 4

保持证明状态活跃：Lean 4 中高效策略搜索的快照技术

Austin Shen, Yunong Shi

发表机构 * University of Michigan（密歇根大学）； Amazon Web Services（亚马逊网络服务）

AI总结针对 Lean 4 中并行策略搜索因反复重建证明状态导致开销巨大的问题，提出证明状态快照技术，通过一次捕获并复用证明状态，实现 5.6-50 倍加速。

Comments 11 pages, 1 figure. v2: Added co-author affiliation (Amazon Web Services) and contact emails for both authors

详情

AI中文摘要

基于 Lean 4 的自动定理证明系统越来越依赖对部分指定证明（如 Draft-Sketch-Prove (DSP) 流水线生成的证明）进行并行策略搜索。在现有系统中，每个搜索分支通过重新运行 elaboration 来重建证明状态，导致每个分支产生大量开销。在带有 Mathlib 的 Lean 4 中，这种开销有两个组成部分：(1) 导入加载，反序列化预编译库（每个分支约 60 秒）；(2) 定理体 elaboration，重新检查直到目标目标的定理上下文（根据证明复杂度估计为 18-735 秒）。两者合计占每个分支墙钟时间的 99% 以上，使得基于组合的搜索难以大规模应用。我们观察到，这种开销源于证明搜索的结构与其执行模型之间的不匹配：分支是通过重复重建证明状态实现的，而不是直接重用。为了解决这个问题，我们引入了证明状态快照，它一次捕获 elaborated 证明状态，并通过 Lean 4 语言服务器的一个小扩展在分支间重用。在 48 个 miniF2F-v2 问题（45 个证明阶段基准和 3 个完整端到端运行）上，我们的方法比标准回退方法实现了 5.6-50 倍的墙钟时间加速（平均 14 倍，中位数 9.7 倍）。加速比随证明分支数量增加而增加。我们的方法与导入级缓存（例如 Kimina Lean Server）正交，后者避免了导入加载，但未避免定理体 elaboration。修补后的 Lean 二进制文件和 Snapshot-DSP 流水线将在发表后作为开源发布。

英文摘要

Automated theorem proving systems built on Lean 4 increasingly rely on parallel tactic search over partially specified proofs, such as those generated by Draft-Sketch-Prove (DSP) pipelines. In current systems, each search branch reconstructs a proof state by re-running elaboration, leading to substantial per-branch overhead. In Lean 4 with Mathlib, this cost has two components: (1) import loading, which deserializes pre-compiled libraries (~60 s per branch); and (2) theorem-body elaboration, which re-checks the theorem context up to the target goal (estimated 18-735 s depending on proof complexity). Together, these account for >99% of per-branch wall time, making portfolio-based search impractical at scale. We observe that this overhead arises from a mismatch between the structure of proof search and its execution model: branching is implemented via repeated reconstruction of proof states rather than direct reuse. To address this, we introduce proof-state snapshotting, which captures the elaborated proof state once and reuses it across branches via a small extension to the Lean 4 language server. Across 48 miniF2F-v2 problems (45 prove-phase benchmarks and 3 full end-to-end runs), our approach achieves a 5.6-50x wall-time speedup over the standard fallback (average 14x, median 9.7x). Speedup increases with the number of proof branches. Our method is orthogonal to import-level caching (e.g., Kimina Lean Server), which avoids import loading but not theorem-body elaboration. The patched Lean binary and the Snapshot-DSP pipeline will be released as open source upon publication.

URL PDF HTML ☆

赞 0 踩 0

2605.25297 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction

Eureka：面向企业AI云资源需求预测的智能特征工程

Hangxuan Li, Renjun Jia, Xuezhang Wu, Yunjie Qian, Zeqi Zheng, Xianling Zhang

发表机构 * Alibaba Cloud Computing Co. Ltd, Hangzhou, China（阿里云计算有限公司，杭州，中国）； School of Computer Science, Fudan University, Shanghai, China（复旦大学计算机学院，上海，中国）； School of Computer Science and Technology, Tongji University, Shanghai, China（同济大学计算机科学与技术学院，上海，中国）； Independent Researcher, United States（独立研究员，美国）

AI总结提出Eureka框架，将特征工程视为智能体代码生成问题，通过专家代理、LLM特征工厂和自演化对齐引擎三阶段，自动生成可执行特征代码，在医疗、金融、社交等7个公开基准及阿里云GPU资源需求预测中显著提升性能。

Comments accepted at NeurIPS 2025 Workshop, DASFAA 2026 (International Conference on Database Systems for Advanced Applications)

详情

DOI: 10.1007/978-981-92-0378-9_33
Journal ref: Database Systems for Advanced Applications (DASFAA 2026), Lecture Notes in Computer Science, vol. 16540, pp. 528-540, Springer

AI中文摘要

有效的特征对于预测模型性能至关重要，但创建特征通常需要领域专业知识，限制了跨应用的可扩展性。我们将特征工程定义为一个智能体代码生成问题：特征不再是静态的数据转换，而是可生成、评估和迭代改进的可执行程序。我们提出了Eureka，一个由LLM驱动的三阶段框架。（1）专家代理，通过领域知识的SFT微调，生成结构化的JSON格式特征设计方案。（2）LLM特征工厂，通过思维链推理将每个方案转化为可执行的Python代码，将特征假设转化为可运行的程序。（3）自演化对齐引擎，使用带双通道奖励（基于指标的效用+语义对齐）的强化学习（GRPO）来提升代码质量。通过将特征表达为程序，学习到的生成模式可以跨领域迁移。在医疗、金融和社交领域的7个公开基准上评估，Eureka一致优于传统的AutoFE和基于LLM的基线。我们进一步在阿里云的云GPU资源需求预测中展示了Eureka的有效性，其中Eureka将需求满足率提高了16%，并将计算资源迁移率降低了33%。

英文摘要

Effective features are crucial for predictive model performance, but creating them often requires domain expertise, limiting scalability across applications. We define feature engineering as an agentic code generation problem: features are not static data transformations, but executable programs that can be generated, evaluated, and iteratively improved. We present Eureka, an LLM-driven framework with three stages. (1) An Expert Agent, fine-tuned via SFT on domain knowledge, produces structured feature design plans in JSON format. (2) An LLM Feature Factory translates each plan into executable Python code through chain-of-thought reasoning, turning feature hypotheses into runnable programs. (3) A Self-Evolving Alignment Engine uses Reinforcement Learning (GRPO) with dual-channel reward (metric-based utility + semantic alignment) to enhance code quality. By expressing features as programs, the learned generation patterns can transfer across domains. Evaluated on 7 public benchmarks in healthcare, finance, and social domains, Eureka consistently outperforms both traditional AutoFE and LLM-based baselines. We further demonstrate Eureka's effectiveness on cloud GPU resource demand prediction at Alibaba Cloud, where Eureka improves demand fulfillment rate by 16% and lowers computing resource migration rates by 33%.

URL PDF HTML ☆

赞 0 踩 0

2605.24846 2026-05-29 cs.LG cs.AI 版本更新

Tiny Brains, Giant Impact: Uncovering the Keystone Neurons of LLM with Just a Few Prompts

微小大脑，巨大影响：仅用少量提示揭示LLM的关键神经元

Xiangtian Ji, Yuxin Chen, Zhengzhou Cai, Xiang Wang, An Zhang, Tat-Seng Chua

发表机构 * National University of Singapore（新加坡国立大学）； Beijing University of Posts and Telecommunications（北京邮电大学）； University of Science and Technology of China（中国科学技术大学）

AI总结本研究通过跨任务激活强度分析，发现大型语言模型中存在一组极其稀疏的关键神经元，其移除会导致模型行为崩溃，并基于此提出仅更新关键神经元的微调方法，在少量参数修改下达到与全参数微调相当或更优的任务性能。

详情

AI中文摘要

大型语言模型（LLM）展现出强大的综合能力，但支撑这些行为的内部机制仍未被充分理解。在这项工作中，我们展示了在多种开放权重Transformer模型中，存在一组神经元在跨多个能力维度的任务推理期间始终保持高度激活。通过沿跨任务激活强度进行探测，我们分离出一个极其稀疏的子集，其移除会导致模型行为崩溃，我们将其称为关键神经元。我们的分析揭示，关键神经元是模型的一个稳定且内在的神经元子集，主要在预训练期间建立。与这些神经元相关的参数在训练过程中被紧密校准，其精确值对模型能力至关重要。基于这些见解，我们提出了一种监督微调方法，仅更新关键神经元，在修改远少于全参数的情况下，实现了与全参数微调相当甚至更好的任务增益，同时更好地保留了其他能力维度的性能。

英文摘要

Large language models (LLMs) display strong comprehensive abilities, yet the internal mechanisms that support these behaviors remain insufficiently understood. In this work, we show that across a wide range of open-weight Transformers, a subset of neurons remains consistently highly activated during inference across tasks of multiple capability dimensions. By probing along the cross-task activation strength, an extremely sparse subset is isolated, whose removal causes a collapse in model behavior, which we term keystone neurons. Our analysis reveals that keystone neurons are a stable and intrinsic neuron subset of the model that is largely established during pretraining. The parameters associated with these neurons are tightly calibrated during the training process, and their precise values are critical for the capabilities of the model. Building on these insights, we propose a supervised fine-tuning approach that updates only keystone neurons, achieving task gains comparable to or even better than full-parameter fine-tuning while better preserving performance in other capability dimensions, despite modifying a much smaller number of parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.24399 2026-05-29 cs.AI 版本更新

ConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology

ConceptM$^3$oE：面向可解释计算病理学的概念引导多模态专家混合模型

Xuan Wang, Zhongling Xu, Gopi Kannedhara, Joakim Nguyen, Jian Yu, Jinrui Fang, Abdurrahmaan Baghdadi, Tianlong Chen, Awais Naeem, Chandra Krishnan, Edward Castillo, Andrew H. Song, Ankita Shukla, Ying Ding, Nicholas Konz, Hairong Wang

发表机构 * University of Texas at Austin（德克萨斯大学奥斯汀分校）； University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）； Dell Children’s Medical Center（德尔儿童医疗中心）； The University of Texas MD Anderson Cancer Center（德克萨斯大学MD安德森癌症中心）； University of Nevada, Reno（内华达大学里诺分校）

AI总结提出ConceptM$^3$oE框架，通过概念引导的多模态专家混合路径嵌入概念形成，并利用残差路径保持性能与可解释性，在脑肿瘤分类中优于基线并提升小样本性能。

详情

AI中文摘要

医疗模型正从单模态预测转向对异构诊断输入的多模态推理。在计算病理学中，对于仅凭形态学难以区分的复杂肿瘤亚型，病理报告和分子测量可提供额外的诊断证据，但现有模型往往无法阐明不同信号如何组合成可识别的诊断概念。我们提出ConceptM$^3$oE（概念多模态MoE），将概念形成直接嵌入交互感知的专家混合（MoE）路径中。该架构将证据分解为模态特定、冗余和协同专家，然后将其投影到结构化概念瓶颈中，将潜在特征映射到形态学和生物标志物概念层次结构。为防止可解释瓶颈典型的信息损失，我们在每个专家内利用残差路径，使任务相关信号既通过概念流动，也直接流向最终任务预测，从而在保持可解释性的同时维持高性能。在机构性儿童脑肿瘤队列和公共胶质瘤队列上，该框架实现了与无约束模型相竞争的性能，同时产生由独立神经病理学家验证的推理轨迹。在数据有限的情况下，ConceptM$^3$oE提升了小数据性能，在较小训练规模下，与非概念信息基线相比，宏F1从56.41%提升至66.70%，同时显示出更快的训练收敛速度，这与概念学习的正则化效应一致。这项工作为高性能、内在可验证且更符合临床实践复杂决策的医疗AI提供了一条可扩展的路径。

英文摘要

Healthcare models are transitioning from unimodal prediction toward multimodal reasoning over heterogeneous diagnostic inputs. In computational pathology, for complex tumor subtypes where morphology alone can be challenging to distinguish, pathology reports and molecular measurements may provide additional diagnostic evidence alongside whole-slide images, yet existing models often fail to clarify how diverse signals assemble into recognizable diagnostic concepts. We propose ConceptM$^3$oE (Concept Multimodal MoE), which embeds concept formation directly within interaction-aware mixture-of-experts (MoE) pathways. The architecture decomposes evidence into modality-specific, redundant, and synergistic experts, which are then projected into structured concept bottlenecks mapping latent features to a hierarchy of morphology and biomarker concepts. To prevent the information loss typical of interpretable bottlenecks, we utilize residual pathways within each expert to allow task-relevant signals to flow both through the concepts and directly to the final task prediction, so that high performance is maintained alongside interpretability. Across an institutional pediatric brain tumor cohort and a public glioma cohort, the framework delivers competitive performance to unconstrained models while producing reasoning traces validated by an independent neuropathologist. In data-limited regimes, ConceptM$^3$oE improves limited-data performance, increasing macro-F1 from 56.41% to 66.70% at small training sizes compared to non-concept-informed baselines, while also showing faster training convergence consistent with the regularizing effect of concept learning. This work offers a scalable path toward high-performance medical AI that is inherently verifiable and better aligned with the complex decision-making of clinical practice.

URL PDF HTML ☆

赞 0 踩 0

2605.24140 2026-05-29 cs.AI 版本更新

HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models

HyperGuide: 用于大型语言模型高效多步推理的双曲引导

Yuyu Liu, Haotian Xu, Yanan He, Sarang Rajendra Patil, Mengjia Xu, Tengfei Ma

发表机构 * Department of Computer Science（计算机科学系）； Stony Brook University（石英布鲁克大学）； Department of Applied Mathematics and Statistics（应用数学与统计学系）； Yale University（耶鲁大学）； Department of Data Science（数据科学系）； New Jersey Institute of Technology（新泽西理工学院）； Department of Biomedical Informatics（生物医学信息学系）

AI总结针对多步推理中单次生成效率高但精度低、树搜索计算量大的问题，提出通过将推理进度蒸馏为双曲几何信号来引导逐步生成，利用双曲空间的距离和角度特性编码解接近度与分支区分，训练轻量头投影隐状态并微调适配器，在多个基准上取得一致提升。

详情

AI中文摘要

多步推理仍然是大型语言模型的一个核心挑战：单次生成效率高但缺乏准确性；树搜索方法探索多条路径但计算量大。我们通过将推理进度蒸馏为双曲几何信号来弥补这一差距，该信号引导逐步生成。我们的方法基于一个结构性观察：在组合推理树中，包含解的状态很少，而死胡同则呈指数级多。双曲空间匹配这种不对称性，原点附近体积紧凑，向边界指数扩展，因此到原点的距离自然地编码解的接近度，而角度分离则区分需要不同下一步操作的分支。我们训练一个轻量头将LLM的隐状态投影到该空间，然后在其自身的推理尝试上交互式地微调一个低秩适配器，以对注入的信号做出反应。在多个基准上，该几何信号带来一致的提升，在更深推理链上改进更大。我们的代码公开在 https://github.com/yuyuliu11037/HyperGuide。

英文摘要

Multi-step reasoning remains a central challenge for large language models: single-pass generation is efficient but lacks accuracy; tree-search methods explore multiple paths but are computation-heavy. We address this gap by distilling reasoning progress into a hyperbolic geometric signal that guides step-by-step generation. Our approach is motivated by a structural observation: in combinatorial reasoning trees, solution-bearing states are few while dead ends are exponentially numerous. The hyperbolic space matches this asymmetry, with compact volume near the origin and exponentially expanding capacity toward the boundary, so that distance-to-origin naturally encodes solution proximity while angular separation distinguishes branches requiring different next operations. We train a lightweight head to project LLM hidden states into this space, then fine-tune a low-rank adapter interactively on its own reasoning attempts to act on the injected signal. Across multiple benchmarks, the geometric signal yields consistent gains, with larger improvements on deeper reasoning chains. Our code is publicly available at https://github.com/yuyuliu11037/HyperGuide.

URL PDF HTML ☆

赞 0 踩 0

2605.23993 2026-05-29 cs.CV cs.AI cs.LG 版本更新

Nano World Models: A Minimalist Implementation of Future Video Prediction

纳米世界模型：未来视频预测的极简实现

Siqiao Huang, Partha Kaushik, Michael Chen, Hengkai Pan, Kaiwen Geng, Omar Chehab, Fernando Moreno-Pino, Max Simchowitz

发表机构 * DeepMind

AI总结提出Nano World Models，一个基于扩散强迫的极简代码库，用于未来视频预测，支持可控研究世界模型的设计选择，并通过实验分析预测参数化、架构规模等因素对视频预测质量的影响。

Comments Project page: https://simchowitzlabpublic.github.io/nano-world-model/

详情

AI中文摘要

世界模型已成为学习预测模拟器的核心范式，支持生成、规划和决策。然而，尽管工业级交互式视频生成取得了快速进展，更广泛的研究社区仍然缺乏紧凑、可重复且易于扩展的实现来研究现代世界模型的设计选择。我们介绍了Nano World Models，一个围绕扩散强迫的极简代码库，用于未来视频预测。Nano World Models为生成目标、模型规模、动作条件机制、潜在观测空间、数据集、评估协议和长程展开程序提供了统一接口。这种设计使得通常在不同实现中纠缠的世界模型组件可以进行受控研究。通过在简单控制环境、游戏模拟和真实机器人数据上的实验，我们考察了预测参数化、架构规模、动作注入、采样预算和领域复杂性如何影响视频预测质量和自回归展开行为。通过发布代码、配置、评估脚本和预训练检查点，Nano World Models旨在为开放、可重复和科学的世界模型研究提供一个紧凑但可扩展的实验基础。

英文摘要

World models have become a central paradigm for learning predictive simulators that support generation, planning, and decision-making. Yet, despite rapid progress in industry-scale interactive video generation, the broader research community still lacks compact, reproducible, and easily extensible implementations for studying the design choices underlying modern world models. We introduce Nano World Models, a minimalist codebase for future video prediction centered around diffusion forcing. Nano World Models provides a unified interface for generative objectives, model scales, action-conditioning mechanisms, latent observation spaces, datasets, evaluation protocols, and long-horizon rollout procedures. This design enables controlled studies of world-modeling components that are often entangled across separate implementations. Through experiments across simple control environments, game simulation, and real-robot data, we examine how prediction parameterization, architecture scale, action injection, sampling budget, and domain complexity affect video prediction quality and autoregressive rollout behavior. By releasing code, configurations, evaluation scripts, and pretrained checkpoints, Nano World Models aims to provide a compact yet extensible experimental substrate for open, reproducible, and scientific world-model research.

URL PDF HTML ☆

赞 0 踩 0

2605.22100 2026-05-29 cs.AI 版本更新

MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

MPDocBench-Parse：面向实际的多页文档解析基准测试

Bangbang Zhou, Hangdi Xing, Yifan Chen, Jianjun Xu, Qi Zheng, Feiyu Gao, Zhibo Yang, Shuai Bai, Ming Yan, Jieping Ye, Hongtao Xie

发表机构 * University of Science and Technology of China（中国科学技术大学）； Tongyi Lab, Alibaba Group（阿里云实验室）

AI总结针对现有基准测试在真实场景中评估不足的问题，提出MPDocBench-Parse基准，包含433份多页文档（3246页），覆盖15种文档类型，设计全面的内容保真度和逻辑结构评估协议，实验表明现有模型在语义连续性、视觉内容解析和层次结构恢复方面存在明显局限。

详情

AI中文摘要

文档解析将视觉丰富的文档转换为机器可读的结构化表示，为信息系统提供了关键基础。尽管已有许多文档解析基准测试，但它们仍不足以应对真实场景。现有基准测试要么专注于特定任务，要么仅评估单页、以文本为中心的设置，因此不足以处理实际的多页解析。此外，它们缺乏对语义连续性、层次结构恢复和视觉内容保留的细粒度评估。为解决这些不足，我们提出了MPDocBench-Parse，一个面向实际应用的多页文档解析基准测试。它包含433份人工标注的文档，共3246页，覆盖中英文15种文档类型，具有多样化的布局风格，并支持文档级端到端评估。我们进一步设计了一套全面的内容保真度和逻辑结构评估协议，涵盖文本、表格和公式识别，截断文本和表格合并，图形提取，阅读顺序以及标题层次恢复。实验表明，尽管现有模型在基本文本提取方面表现良好，但在语义连续性整合、视觉内容解析和层次结构恢复方面仍存在明显局限。MPDocBench-Parse为将文档解析推进到更真实的场景提供了统一基础。

英文摘要

Document parsing converts visually rich documents into machine-readable structured representations, forming a crucial foundation for information systems. Although many benchmarks have been proposed for document parsing, they remain inadequate for realistic scenarios. Existing benchmarks either focus on specific tasks or assess only single-page, text-centric settings, making them insufficient for practical multi-page parsing. Moreover, they lack fine-grained evaluation of semantic continuity, hierarchical structure recovery, and visual content preservation. To address these gaps, we propose MPDocBench-Parse, a benchmark for multi-page document parsing in real-world applications. It contains 433 manually annotated documents with 3,246 pages, covering 15 document types in English and Chinese, with diverse layout styles, and supports document-level end-to-end evaluation. We further design a comprehensive protocol for content fidelity and logical structure, covering text, table, and formula recognition, truncated text and table merging, figure extraction, reading order, and heading hierarchy recovery. Experiments show that, while existing models perform well on basic text extraction, they still suffer clear limitations in semantic continuity integration, visual content parsing, and hierarchical structure recovery. MPDocBench-Parse provides a unified foundation for advancing document parsing toward more realistic scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.22080 2026-05-29 cs.CV cs.AI 版本更新

JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation

JMed48k：用于视觉语言模型评估的多专业日本医疗执照基准

Yue Xun, Junyu Liu, Qian Niu, Xinyi Wang, Zheng Yuan, Zirui Li, Zequn Zhang, Bowen Zhao, Shujun Wang, Irene Li, Kan Hatakeyama-Sato, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； Kyoto University（京都大学）； The University of Tokyo（东京大学）； Hohai University（淮海大学）； University of Science and Technology of China（中国科学技术大学）； University of Toronto（多伦多大学）

AI总结本文提出JMed48k，一个包含48,862道试题和20,142张图像的多专业日本医疗执照基准，通过评估21个模型并引入配对图像移除审计，发现专有和开源模型显著受益于图像，而医学专用模型对视觉证据利用有限。

详情

AI中文摘要

我们引入了JMed48k，一个用于评估视觉语言模型的多专业日本医疗执照基准。该基准基于日本厚生劳动省发布的官方PDF材料构建，包含2005年至2025年间11个国家执照考试的48,862道试题和20,142张图像，视觉内容按8类分类法进行标注。从该语料库中，我们提取了JMed48k-Eval，一个近五年的评估子集，包含12,484道评分题，其中9,905道纯文本题和2,579道带图像题。我们评估了21个专有、开源和医学专用模型，分别报告纯文本和带图像的性能。由于这些子集包含不同的问题，我们进一步引入了一种配对图像移除审计，评估带图像的问题在移除视觉内容前后的表现，以探索四种答案转换状态。审计显示，专有和开源模型从图像中获益显著，而医学专用系统对视觉证据的利用有限，许多正确答案在图像移除后仍然存在。即使在专有模型中，净图像移除效应在不同专业间变化七倍，从医师问题的+5.7分到公共卫生护士问题的+39.8分。我们发布JMed48k以支持在医疗执照场景中对视觉语言模型进行可重复的、按专业分层的评估。

英文摘要

We introduce JMed48k, a multi-profession Japanese healthcare licensing benchmark for evaluating vision-language models. Built from official PDF materials released by the Japanese Ministry of Health, Labour and Welfare, JMed48k contains 48,862 exam questions and 20,142 images from 11 national licensing examinations between 2005 and 2025, with visual content annotated under an 8-type taxonomy. From this corpus, we derive JMed48k-Eval, a recent five-year evaluation subset with 12,484 scored questions, including 9,905 text-only questions and 2,579 questions with images. We evaluate 21 proprietary, open-source, and medical-specific models, reporting text-only and with-image performance separately. Because these subsets contain different questions, we further introduce a paired image-removal audit that evaluates questions with images before and after removing visual content to explore four answer-transition states. The audit shows that proprietary and open source models gain substantially from images, whereas medical-specific systems show limited observable use of visual evidence, with many correct answers persisting after image removal. Even among proprietary models, the net image-removal effect varies sevenfold across professions, from +5.7 points on Physician questions to +39.8 points on Public Health Nurse questions. We release JMed48k to support reproducible, profession-stratified evaluation of vision-language models in medical licensing settings.

URL PDF HTML ☆

赞 0 踩 0

2605.21739 2026-05-29 cs.AI 版本更新

AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

AttuneBench: 基于对话的LLM情商基准测试

Kate M. Lubrano, Faisal Sayed, Ankita Rathod, Akshansh, Craver Corbyn Thomas-Smith, Mark E. Whiting, Karina Nguyen

发表机构 * Pareto ； Thoughtful ； University of Pennsylvania（宾夕法尼亚大学）

AI总结提出AttuneBench基准，基于200个真实多轮人机对话，评估LLM在情绪识别、行为分类、偏好预测和响应质量等方面的情商能力，发现这些能力相互独立且偏好对齐和响应质量更具区分性。

Comments v2: Updated def_18 and def_20 supplemental figures to cover all 11 evaluated models (previously 9). Removed redundant supplemental figures. Corrected select captions (color descriptions, chance baselines, figure-content mismatches). No changes to experimental results, numerical claims, or conclusions

详情

AI中文摘要

情商（EI），即感知、理解并恰当回应他人情绪状态的能力，是人类交流的核心，随着LLM在日常生活中承担对话角色，评估其情商日益重要。现有的EI基准依赖于合成提示、单轮案例或第三方标注。这些方法不能直接衡量模型在真实对话过程中如何推断和回应参与者的情绪状态。我们引入AttuneBench，一个基于200个真实多轮人机对话的基准，其中参与者与匿名LLM对话，并逐轮标注其情绪状态、模型行为以及他们偏好的回应。在11个评估模型中，我们发现模型在情绪识别、行为分类、偏好预测和评判响应质量上的排名基本独立，表明情商行为可分解为可分离的能力。偏好对齐和响应质量判断比情绪标签准确性更具模型区分性。这些结果表明，情商行为需要预测特定用户在上下文中想要什么样的回应，这一区别可能被总体评分掩盖，而单轮或合成格式无法跨轮直接捕捉。AttuneBench提供了一个评估这些能力以及诊断模型在情绪显著对话中的特定优势和失败模式的框架。

英文摘要

Emotional intelligence (EI), the ability to perceive, understand, and respond appropriately to others' emotional states, is central to human communication, and increasingly important to assess as LLMs assume conversational roles in everyday life. Existing EI benchmarks rely on synthetic prompts, single-turn cases, or third-party annotation. These approaches do not directly measure how models infer and respond to a participant's emotional state over the course of a real conversation. We introduce AttuneBench, a benchmark grounded in 200 genuine multi-turn human-model conversations in which participants conversed with anonymized LLMs and provided turn-by-turn annotations of their emotional state, the model's behavior, and their preferred responses. Across 11 evaluated models, we find that model rankings on emotion recognition, behavioral classification, preference prediction, and judged response quality are largely independent, indicating that emotionally intelligent behavior decomposes into separable capabilities. Preference alignment and response-quality judgments are substantially more model-discriminating than emotion-label accuracy. These results indicate that emotionally intelligent behavior requires predicting what kind of response a specific user wants in context, a distinction that aggregate scoring can obscure and that single-turn or synthetic formats cannot directly capture across turns. AttuneBench provides a framework for assessing each of these capabilities and for diagnosing model-specific strengths and failure modes in emotionally salient conversation.

URL PDF HTML ☆

赞 0 踩 0

2605.15219 2026-05-29 cs.AI cs.IT math.IT 版本更新

NOVA: Fundamental Limits of Knowledge Discovery Through AI

NOVA：通过人工智能进行知识发现的基本限制

Salman Avestimehr, Ken Duffy, Muriel Médard

发表机构 * University of Southern California（南加州大学）； Northeastern University（东北大学）； Massachusetts Institute of Technology（麻省理工学院）

AI总结本文提出NOVA框架，将“生成-验证-积累-再训练”循环建模为知识空间上的自适应采样过程，识别了知识覆盖有限域的条件及失败模式，并证明了发现成本与Zipf定律相关的标度律。

详情

AI中文摘要

人工智能系统能否通过迭代自我改进发现真正的新知识，如果可以，代价是什么？我们引入了NOVA框架，将常见的“生成、验证、积累、再训练”循环建模为知识空间上的自适应采样过程。我们识别了积累的真正知识最终覆盖有限域的充分条件，并展示了违反这些条件如何产生不同的失败模式：污染、遗忘、探索失败和接受失败。然后，我们分析了不完美的验证，并识别了一个污染陷阱：随着容易发现的知识被耗尽，模型分配给新有效工件的质量缩小，因此即使很小的假阳性率也可能导致无效工件比真正发现更快地进入知识库。我们澄清了Good-Turing估计是一种局部批次多样性诊断工具，而不是用于估计历史上未发现的、支配长期发现的有效质量的估计量。在将模型的有效发现分布与指数$α>1$的Zipf定律联系起来的独立尾部等价假设下，我们证明了获得$D$个不同真正发现所需的累积生成成本满足$R_{\mathrm{cum}}(D)=Θ(c_{\mathrm{gen}}D^α)$，其中$c_{\mathrm{gen}}$是每个候选的生成成本。这个标度律量化了随着发现前沿推进而渐进的收益递减。最后，我们通过指导、生成和验证形式化了人类增强，解释了为什么专家输入在自主探索障碍附近最有价值。

英文摘要

Can AI systems discover genuinely new knowledge through iterative self improvement, and if so, at what cost? We introduce the NOVA framework, which models the common ``generate, verify, accumulate, retrain'' loop as an adaptive sampling process over a knowledge space. We identify sufficient conditions under which accumulated genuine knowledge eventually covers a finite domain, and show how their violations produce distinct failure modes: contamination, forgetting, exploration failure, and acceptance failure. We then analyze imperfect verification and identify a contamination trap: as easy-to-find knowledge is exhausted, the model mass assigned to new valid artifacts shrinks, so even small false-positive rates can cause invalid artifacts to enter the knowledge base faster than genuine discoveries. We clarify that Good--Turing estimation is a local batch-diversity diagnostic, not an estimator of the historically undiscovered valid mass that governs long-term discovery. Under a separate tail-equivalence assumption relating the model's effective discovery distribution to a Zipf law with exponent $α>1$, we prove that the cumulative generation cost required to obtain $D$ distinct genuine discoveries satisfies $R_{\mathrm{cum}}(D)=Θ(c_{\mathrm{gen}}D^α)$, where $c_{\mathrm{gen}}$ is the per-candidate generation cost. This scaling law quantifies asymptotic diminishing returns as the discovery frontier advances. Finally, we formalize human amplification through guidance, generation, and verification, explaining why expert input is most valuable near autonomous exploration barriers.

URL PDF HTML ☆

赞 0 踩 0

2605.13841 2026-05-29 cs.SD cs.AI cs.CL cs.LG 版本更新

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

EVA-Bench：一种用于评估语音代理的新型端到端框架

Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz, Oluwanifemi Bamgbose, Fanny Riols, Hoang H. Nguyen, Raghav Mehndiratta, Lindsay Devon Brin, Joseph Marinier, Hari Subramani, Anil Madamala, Sridhar Krishna Nemala, Srinivas Sunkara

发表机构 * ServiceNow

AI总结提出EVA-Bench框架，通过机器人间音频对话模拟和复合指标（EVA-A和EVA-X）全面评估语音代理的准确性和体验质量。

Comments Work in progress

详情

AI中文摘要

语音代理是一种通过口语对话完成任务的人工智能系统，越来越多地部署在企业应用中。然而，现有基准测试未能同时解决两个核心评估挑战：生成逼真的模拟对话，以及全面衡量语音特定故障模式的质量。我们提出了EVA-Bench，一个端到端评估框架，同时解决这两个问题。在模拟方面，EVA-Bench通过动态多轮对话协调机器人间的音频对话，并自动进行模拟验证，检测用户模拟器错误并在评分前适当重新生成对话。在测量方面，EVA-Bench引入了两个复合指标：EVA-A（准确性），捕捉任务完成度、忠实度和音频级语音保真度；以及EVA-X（体验），捕捉对话进展、口语简洁性和话轮转换时机。这两个指标适用于所有主要的代理架构，支持直接的跨架构比较。EVA-Bench包含三个企业领域的213个场景、一个用于口音和噪声鲁棒性的受控扰动套件，以及区分峰值能力和可靠能力的pass@1、pass@k、pass^k测量。在跨越所有三种架构的12个系统中，我们发现：（1）没有系统在EVA-A pass@1和EVA-X pass@1上同时超过0.5；（2）峰值性能和可靠性能差异显著（EVA-A上pass@k与pass^k的中位数差距为0.44）；（3）口音和噪声扰动暴露了显著的鲁棒性差距，其影响因架构、系统和指标而异（平均Δ高达0.314）。我们在开源许可下发布了完整的框架、评估套件和基准数据。

英文摘要

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to all major agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k--pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean $Δ$ up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.

URL PDF HTML ☆

赞 0 踩 0

2605.12208 2026-05-29 stat.ML cs.AI cs.LG stat.CO 版本更新

Self-Supervised Laplace Approximation for Bayesian Uncertainty Quantification

自监督拉普拉斯近似用于贝叶斯不确定性量化

Julian Rodemann, Alexander Marquard, Thomas Augustin, Michele Caprio

发表机构 * Rational Intelligence Lab, CISPA Helmholtz Center for Information Security Department of Statistics, LMU Munich（理性智能实验室，CISPA海德堡信息安全中心统计学系，慕尼黑大学）； Department of Statistics, LMU Munich（统计学系，慕尼黑大学）； Department of Computer Science, The University of Manchester（计算机科学系，曼彻斯特大学）

AI总结提出自监督拉普拉斯近似（SSLA），通过重新拟合自预测数据直接近似后验预测分布，实现确定性、无采样的贝叶斯不确定性量化，并在回归任务中优于经典拉普拉斯近似。

Comments Accepted for publication in TMLR (https://openreview.net/forum?id=T8w8L2t3JG), v2: fixed typos and added a deceased-author footnote with a dedication to Thomas Augustin

详情

Journal ref: Transactions on Machine Learning Research (TMLR). ISSN 2835-8856 (2026)

AI中文摘要

近似贝叶斯推断通常围绕计算后验参数分布展开。然而，在实践中，感兴趣的主要对象通常是模型的预测而非其参数。在这项工作中，我们提出绕过参数后验，直接关注近似后验预测分布。我们通过从自监督和半监督学习中的自训练中汲取灵感来实现这一点。本质上，我们通过重新拟合自预测数据来量化贝叶斯模型的预测不确定性。这个想法非常简单：如果模型对自预测数据赋予高似然，那么这些预测的不确定性低，反之亦然。这产生了后验预测的确定性、无采样近似。我们的自监督拉普拉斯近似（SSLA）的模块化结构进一步允许我们插入不同的先验规范，从而实现经典的贝叶斯敏感性（关于先验选择）分析。为了绕过昂贵的重新拟合，我们进一步引入了SSLA的近似版本，称为ASSLA。我们从理论和经验上研究了（A）SSLA，涉及从贝叶斯线性模型到贝叶斯神经网络的回归模型。在模拟和真实数据集的广泛回归任务中，我们的方法在预测校准方面优于经典拉普拉斯近似，同时保持计算效率。

英文摘要

Approximate Bayesian inference typically revolves around computing the posterior parameter distribution. In practice, however, the main object of interest is often a model's predictions rather than its parameters. In this work, we propose to bypass the parameter posterior and focus directly on approximating the posterior predictive distribution. We achieve this by drawing inspiration from self-training within self-supervised and semi-supervised learning. Essentially, we quantify a Bayesian model's predictive uncertainty by refitting on self-predicted data. The idea is strikingly simple: If a model assigns high likelihood to self-predicted data, these predictions are of low uncertainty, and vice versa. This yields a deterministic, sampling-free approximation of the posterior predictive. The modular structure of our Self-Supervised Laplace Approximation (SSLA) further allows us to plug in different prior specifications, enabling classical Bayesian sensitivity (w.r.t. prior choice) analysis. In order to bypass expensive refitting, we further introduce an approximate version of SSLA, called ASSLA. We study (A)SSLA both theoretically and empirically in regression models ranging from Bayesian linear models to Bayesian neural networks. Across a wide array of regression tasks with simulated and real-world datasets, our methods outperform classical Laplace approximations in predictive calibration while remaining computationally efficient.

URL PDF HTML ☆

赞 0 踩 0

2605.07707 2026-05-29 cs.AI 版本更新

Hierarchical Task Network Planning with LLM-Generated Heuristics

基于LLM生成启发式的层次任务网络规划

Felipe Meneguzzi, Alexandre Buchweitz, Augusto B. Corrêa, Victor Scherer Putrich, André Grahl Pereira

发表机构 * University of Aberdeen, UK（爱丁堡大学（英国））； PUCRS, Brazil（巴西普埃布拉联邦大学）； University of Oxford, UK（牛津大学（英国））； Saarland University, Germany（萨尔大学（德国））； Universidade Federal do Rio Grande do Sul, Brazil（巴西里约格兰德 do 南大学）

AI总结研究利用大语言模型为层次任务网络规划生成搜索启发式，通过Pytrich规划器在六个基准领域评估，结果表明LLM生成的启发式在覆盖度上接近最优HTN规划器，并在83%的共享问题上显著减少搜索开销。

Comments 9 pages, 3 figures; submitted to NeurIPS 2026

详情

AI中文摘要

HTN规划是经典规划的一种变体，其中算法不是搜索线性动作序列，而是使用方法库分解高层任务，直到只剩下可执行动作。一方面，这允许引入领域知识，通过方法库加速解决方案的搜索。另一方面，它带来了超越经典状态空间搜索的挑战。尽管最近的研究产生了一些加速HTN规划的启发式和新型算法，但这些启发式仍不如经典规划算法中的启发式信息丰富。我们研究大语言模型（LLMs）能否为HTN规划生成有效的搜索启发式，将Corrêa、Pereira和Seipp（2025）的方法从经典规划扩展到层次规划。使用Pytrich规划器在六个标准全序HTN基准领域上，我们评估了九个LLM在领域特定提示下生成的启发式，并将它们与TDG和LMCount领域无关基线以及PANDA规划器进行比较。结果表明，LLM生成的启发式在覆盖度上几乎与最佳可用HTN规划器相当，同时在83%的共享问题上显著减少了搜索开销。

英文摘要

HTN planning is a variation of classical planning where, instead of searching for a linear sequence of actions, an algorithm decomposes higher-level tasks using a method library until only executable actions remain. On one hand, this allows one to introduce domain knowledge that can speed up the search for a solution through the method library. On the other hand, it creates challenges that go beyond those of classical state-space search. While recent research produced a number of heuristics and novel algorithms that speed up HTN planning, these heuristics are not yet as informative as those available in classical planning algorithms. We investigate whether large language models (LLMs) can generate effective search heuristics for HTN planning, extending the methodology of Corrêa, Pereira, and Seipp (2025) from classical to hierarchical planning. Using the Pytrich planner on six standard total-order HTN benchmark domains, we evaluate heuristics generated by nine LLMs under domain-specific prompting and compare them against the TDG and LMCount domain-independent baselines and the PANDA planner. Our results show that LLM-generated heuristics nearly match the coverage of the best available HTN planner, while substantially reducing search effort on 83% of shared problems.

URL PDF HTML ☆

赞 0 踩 0

2605.04916 2026-05-29 cs.AI cs.LG cs.SC 版本更新

A Foundation Model for Zero-Shot Logical Rule Induction

零样本逻辑规则归纳的基础模型

Yin Jun Phua

发表机构 * Institute of Science Tokyo（东京科学研究所）

AI总结提出神经规则归纳器（NRI），一种基于统计编码和并行槽解码的预训练模型，实现零样本逻辑规则归纳，无需重新训练即可泛化到新谓词。

Comments Camera-ready version accepted at IJCAI 2026, with full appendices

详情

AI中文摘要

归纳逻辑编程（ILP）从数据中学习可解释的逻辑规则。现有方法是传导性的：其学习参数绑定到特定谓词，并且每个新任务都需要重新训练。我们引入了神经规则归纳器（NRI），一种用于零样本规则归纳的预训练模型。NRI 不编码文字标识，而是使用领域无关的统计属性（如类别条件率、熵和共现）来表示文字，这些属性无需重新训练即可泛化到不同的标识和数量。该模型由一个统计编码器和一个基于并行槽的解码器组成。并行解码保持了逻辑析取的置换不变性；而自回归解码器则会施加任意子句顺序。乘积 T-范数松弛使规则执行可微分，从而仅基于预测准确性进行端到端训练。我们在规则恢复、对标签噪声和虚假相关性的鲁棒性以及零样本迁移到真实世界基准上评估了 NRI，并相信这项工作开启了符号推理基础模型的可能性。代码和参考检查点可在 https://github.com/phuayj/neural-rule-inducer 获取。

英文摘要

Inductive Logic Programming (ILP) learns interpretable logical rules from data. Existing methods are transductive: their learned parameters are bound to specific predicates and require retraining for each new task. We introduce Neural Rule Inducer (NRI), a pretrained model for zero-shot rule induction. Rather than encoding literal identities, NRI represents literals using domain-agnostic statistical properties such as class-conditional rates, entropy, and co-occurrence, which generalize across variable identities and counts without retraining. The model consists of a statistical encoder and a parallel slot-based decoder. Parallel decoding preserves the permutation invariance of logical disjunction; an autoregressive decoder would instead impose an arbitrary clause order. Product T-norm relaxation makes rule execution differentiable, allowing end-to-end training on prediction accuracy alone. We evaluate NRI on rule recovery, robustness to label noise and spurious correlations, and zero-shot transfer to real-world benchmarks, and we believe this work opens up the possibility of foundation models for symbolic reasoning. Code and the reference checkpoint are available at https://github.com/phuayj/neural-rule-inducer.

URL PDF HTML ☆

赞 0 踩 0

2605.00846 2026-05-29 cs.AI cs.MA 版本更新

ClinicBot: A Guideline-Grounded Clinical Chatbot with Prioritized Evidence RAG and Verifiable Citations

ClinicBot：基于指南的临床聊天机器人，具有优先证据RAG和可验证引用

Navapat Nananukul, Mayank Kejriwal

发表机构 * USC Information Sciences Institute（USC信息科学研究所）

AI总结提出ClinicBot系统，通过结构化提取指南、按临床重要性排序证据和多智能体协作，生成准确、可验证的临床回答。

详情

DOI: 10.1145/3786335.3813224

AI中文摘要

临床诊断需要准确、可验证且明确基于官方指南的答案。虽然大型语言模型在自然语言处理方面表现出色，但它们产生幻觉的倾向削弱了其在需要精确性的高风险医疗环境中的实用性。现有的检索增强生成（RAG）系统对所有证据一视同仁，产生嘈杂的上下文和与临床实践不符的通用答案。我们提出ClinicBot，一个通过三项关键进展将指南建议转化为可信临床支持的人工智能系统：（1）将临床指南结构化提取为语义单元（建议、表格、定义、叙述）并带有明确的出处；（2）证据优先级排序，根据临床重要性和指南结构而非文本相似性对内容进行排序；（3）一个基于Web的界面，呈现简洁、可操作的答案及可验证的证据。我们将使用真实患者的糖尿病问题以及一个忠实于美国糖尿病协会（ADA）《糖尿病护理标准（2025）》的额外糖尿病风险评估工具来演示ClinicBot。演示将说明语义知识提取和分层证据排名如何在多智能体设置中可靠运行，以大规模处理复杂的临床指南。

英文摘要

Clinical diagnosis requires answers that are accurate, verifiable, and explicitly grounded in official guidelines. While large language models excel at natural language processing, their tendency to hallucinate undermines their utility in high-stakes medical contexts where precision is essential. Existing retrieval-augmented generation (RAG) systems treat all evidence equally, producing noisy context and generic answers misaligned with clinical practice. We present ClinicBot, an AI system that translates guideline recommendations into trustworthy clinical support through three key advances: (1) structured extraction of clinical guidelines into semantic units (recommendations, tables, definitions, narrative) with explicit provenance, (2) evidence prioritization that ranks content by clinical significance and guideline structure rather than textual similarity, and (3) a web-based interface that presents concise, actionable answers with verifiable evidence. We will demonstrate ClinicBot using diabetes questions from real patients and an additional diabetes risk assessment tool that is faithful to the American Diabetes Association (ADA) Standards of Care in Diabetes (2025). The demonstration will illustrate how semantic knowledge extraction and hierarchical evidence ranking can reliably operate in a multi-agent setting to process complex clinical guidelines at scale.

URL PDF HTML ☆

赞 0 踩 0

2604.25098 2026-05-29 cs.AI cs.CL cs.LG 版本更新

因果解耦启发的退化表示学习用于全参考图像质量评估

Zhen Zhang, Jielei Chu, Tian Zhang, Lin Ma, Fengmao Lv, Weide Liu, Tianrui Li, Yuming Fang

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University（计算机与人工智能学院，西南交通大学）； School of Transportation and Logistics, Southwest Jiaotong University（交通运输与物流学院，西南交通大学）； School of Physics, Northeast Normal University（物理学院，东北师范大学）； School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics（计算机与人工智能学院，江西财经大学）； School of Information Management, Jiangxi University of Finance and Economics（信息管理学院，江西财经大学）

AI总结提出基于因果推断和解耦表示学习的全参考图像质量评估新范式，通过干预潜在表示实现退化估计，在多种设置和跨域场景中表现优异。

详情

AI中文摘要

现有的基于深度网络的全参考图像质量评估（FR-IQA）模型通常通过对参考图像和失真图像的深度特征进行成对比较来工作。在本文中，我们从不同的角度处理这个问题，提出了一种基于因果推断和解耦表示学习的新型FR-IQA范式。与典型的基于特征比较的FR-IQA模型不同，我们的方法将退化估计表述为一个由对潜在表示进行干预引导的因果解耦过程。我们首先利用参考图像和失真图像之间的内容不变性来解耦退化表示和内容表示。其次，受人类视觉掩蔽效应的启发，我们设计了一个掩蔽模块来建模图像内容与退化特征之间的因果关系，从而从失真图像中提取受内容影响的退化特征。最后，通过监督回归或无标签降维从这些退化特征预测质量分数。大量实验表明，我们的方法在全监督、少标签和无标签设置的标准IQA基准上取得了极具竞争力的性能。此外，我们还在数据稀缺的多种非标准自然图像域（包括水下、放射线、医学、中子和屏幕内容图像）上评估了该方法。得益于其能够在没有标记IQA数据的情况下进行场景特定训练和预测的能力，我们的方法在跨域泛化方面优于现有的无训练FR-IQA模型。

英文摘要

Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of deep features from the reference and distorted images. In this paper, we approach this problem from a different perspective and propose a novel FR-IQA paradigm based on causal inference and decoupled representation learning. Unlike typical feature comparison-based FR-IQA models, our approach formulates degradation estimation as a causal disentanglement process guided by intervention on latent representations. We first decouple degradation and content representations by exploiting the content invariance between the reference and distorted images. Second, inspired by the human visual masking effect, we design a masking module to model the causal relationship between image content and degradation features, thereby extracting content-influenced degradation features from distorted images. Finally, quality scores are predicted from these degradation features using either supervised regression or label-free dimensionality reduction. Extensive experiments demonstrate that our method achieves highly competitive performance on standard IQA benchmarks across fully supervised, few-label, and label-free settings. Furthermore, we evaluate the approach on diverse non-standard natural image domains with scarce data, including underwater, radiographic, medical, neutron, and screen-content images. Benefiting from its ability to perform scenario-specific training and prediction without labeled IQA data, our method exhibits superior cross-domain generalization compared to existing training-free FR-IQA models.

URL PDF HTML ☆

赞 0 踩 0

2604.20443 2026-05-29 cs.CL cs.AI cs.LG 版本更新

DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories

DialToM：用于预测状态驱动对话轨迹的心智理论基准

Neemesh Yadav, Palakorn Achananuparp, Jing Jiang, Ee-Peng Lim

发表机构 * Singapore Management University（新加坡管理大学）； Australian National University（澳大利亚国立大学）

AI总结提出DialToM基准，通过多选评估框架从自然对话中构建，揭示LLMs在推断心理状态（字面ToM）与利用其进行社会预测（功能ToM）之间的系统性推理不对称性，并证明领域专家与AI之间存在显著能力差距。

Comments Submitted to EMNLP 2026

详情

AI中文摘要

我们介绍了DialToM，一个基于自然人类对话构建的带注释的心智理论（ToM）基准，采用多选评估框架。与近期在合成环境中显示显式心理状态推断与应用ToM之间存在差距的工作一致，我们建立了一个更严格的“状态驱动诊断探针”，要求模型仅从孤立的心理状态特征（无对话上下文）预测状态一致的对话轨迹。我们的评估揭示了系统性的推理不对称性——LLMs在推断心理状态（字面ToM）方面表现出色，但在利用它们进行社会预测（功能ToM）方面存在困难。关键的是，领域专家在此任务上达到100%准确率，证明了其有效性，并揭示了人类与AI之间的显著能力差距。此外，教师-学生推理注入探针显示，Gemini 3 Pro（建立了领先基线）具备强大的功能ToM能力，可用于无上下文预测，且该能力可迁移至较弱模型。DialToM、其评估代码和数据集公开于https://github.com/Stealth-py/DialToM。

英文摘要

We introduce DialToM, an annotated Theory of Mind (ToM) benchmark built from naturalistic human-human dialogues using a multiple-choice evaluation framework. Concurrent with recent work showing a gap between explicit mental-state inference and applied ToM in synthetic settings~\cite{gu2024simpletom}, we establish a stricter \emph{State-Driven Diagnostic Probe} in which models must forecast state-consistent dialogue trajectories solely from isolated mental-state profiles without dialogue context. Our evaluation reveals a systematic reasoning asymmetry -- LLMs excel at inferring mental states (Literal ToM) but struggle to leverage them for social forecasting (Functional ToM). Crucially, a domain expert achieves 100\% accuracy on this task, proving its validity and establishing a stark human-AI capability gap. Further, a teacher-student reasoning injection probe shows that Gemini 3 Pro -- which establishes the leading baseline -- possesses robust Functional ToM capabilities for context-free forecasting that are transferable to weaker models. DialToM, its evaluation code, and dataset are publicly available at https://github.com/Stealth-py/DialToM.

URL PDF HTML ☆

赞 0 踩 0

2604.18847 2026-05-29 cs.AI cs.CL 版本更新

Human-Guided Harm Recovery for Computer Use Agents

面向计算机使用代理的人类引导式危害恢复

Christy Li, Sky CH-Wang, Andi Peng, Andreea Bobu

发表机构 * MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）

AI总结针对LM代理在计算机系统中执行操作后的危害恢复问题，通过用户研究定义偏好对齐的恢复维度，提出基于奖励模型对候选恢复计划重排序的方法，并构建BackBench基准测试，实验表明该方法优于基线代理。

详情

AI中文摘要

随着LM代理获得在真实计算机系统上执行操作的能力，我们不仅需要大规模预防有害行为的方法，还需要在预防失败时有效修复危害。我们形式化了后执行安全中这一被忽视的挑战的解决方案——危害恢复：即根据人类偏好，将代理从有害状态最优地引导回安全状态的问题。通过一项形成性用户研究，我们确定了偏好对齐的恢复维度，并生成了自然语言评分标准，从而为偏好对齐的恢复奠定基础。我们的1130个成对判断数据集揭示了属性重要性的上下文相关变化，例如偏好实用、有针对性的策略而非全面的长期方法。我们将这些学习到的见解操作化为一个奖励模型，在测试时对代理框架生成的多个候选恢复计划进行重排序。为了系统性地评估恢复能力，我们引入了BackBench，一个包含50个计算机使用任务的基准测试，用于测试代理从有害状态中恢复的能力。人工评估表明，我们的奖励模型框架比基础代理和基于评分标准的框架产生更高质量的恢复轨迹。这些贡献共同为新型代理安全方法奠定了基础——这些方法不仅通过预防来应对危害，而且通过有意图的对齐来应对危害的后果。

英文摘要

As LM agents gain the ability to execute actions on real computer systems, we need ways to not only prevent harmful actions at scale but also effectively remediate harm when prevention fails. We formalize a solution to this neglected challenge in post-execution safeguards as harm recovery: the problem of optimally steering an agent from a harmful state back to a safe one in alignment with human preferences. We ground preference-aligned recovery through a formative user study that identifies valued recovery dimensions and produces a natural language rubric. Our dataset of 1,130 pairwise judgments reveals context-dependent shifts in attribute importance, such as preferences for pragmatic, targeted strategies over comprehensive long-term approaches. We operationalize these learned insights in a reward model, re-ranking multiple candidate recovery plans generated by an agent scaffold at test time. To evaluate recovery capabilities systematically, we introduce BackBench, a benchmark of 50 computer-use tasks that test an agent's ability to recover from harmful states. Human evaluation shows our reward model scaffold yields higher-quality recovery trajectories than base agents and rubric-based scaffolds. Together, these contributions lay the foundation for a new class of agent safety methods -- ones that confront harm not only by preventing it, but by navigating its aftermath with alignment and intent.

URL PDF HTML ☆

赞 0 踩 0

2604.17176 2026-05-29 eess.SY cs.AI cs.SY math.OC 版本更新

Intent-aligned Autonomous Spacecraft Guidance via Reasoning Models

通过推理模型实现意图对齐的自主航天器制导

Yuji Takubo, Simone D'Amico

发表机构 * Stanford University（斯坦福大学）

AI总结提出一种通过行为序列和航点约束将高层推理与安全轨迹优化相结合的意图对齐航天器制导框架，在近距离操作场景中实现了超过90%的SCP收敛率，并比启发式决策高出1.5倍的满足顶级意图优先性能标准的轨迹生成率。

Comments Accepted for Computer Vision and Pattern Recognition Conference (CVPR) 2026, AI4Space Workshop (4-page Short paper). 9 pages, 3 figures (including supplementary materials)

详情

AI中文摘要

未来的航天器操作需要能够解释高层任务意图同时保持安全性的自主性。然而，现有的轨迹优化仍然严重依赖专家设计的公式，并且不支持意图条件决策。本文提出了一种意图对齐的航天器制导框架，通过显式的中间抽象（基于行为序列和航点约束）将高层推理与安全轨迹优化联系起来。基础模型首先预测意图对齐的行为计划，然后航点生成模型将其转换为航点约束，最后通过优化计算安全轨迹。这种分解使得在不牺牲安全性的情况下实现可扩展的监督。在近距离操作场景中的数值实验表明，所提出的流程实现了超过90%的SCP收敛率，并且比启发式决策高出1.5倍的生成满足顶级意图优先性能标准的轨迹率。这些结果支持将中间行为抽象作为基础模型推理与安全关键型星载航天器自主性之间的实用接口。

英文摘要

Future spacecraft operations require autonomy that can interpret high-level mission intent while preserving safety. However, existing trajectory optimization still relies heavily on expert-crafted formulations and does not support intent-conditioned decision-making. This paper proposes an intent-aligned spacecraft guidance framework that links high-level reasoning and safe trajectory optimization through explicit intermediate abstractions, based on behavior sequences and waypoint constraints. A foundation model first predicts an intent-aligned behavior plan, a waypoint generation model then converts it into waypoint constraints, and the safe trajectory is computed via optimization. This decomposition enables scalable supervision without sacrificing safety. Numerical experiments in close-proximity operation scenarios demonstrate that the proposed pipeline achieves over 90\% SCP convergence and yields a $1.5\times$ higher rate of generating trajectories that satisfy the top intent-prioritized performance criteria than heuristic decision-making. These results support the use of intermediate behavior abstraction as a practical interface between foundation-model reasoning and safety-critical onboard spacecraft autonomy.

URL PDF HTML ☆

赞 0 踩 0

2604.11088 2026-05-29 cs.AI cs.CL 版本更新

Guardrails Beat Guidance: A Large-Scale Study of Rules, Skills, and Persistent Configuration for Coding Agents

护栏优于指导：关于编码智能体的规则、技能和持久配置的大规模研究

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He

发表机构 * AWS Generative AI Innovation Center（AWS生成式人工智能创新中心）； HSBC Holdings Plc., HSBC Technology Center, China（汇丰控股有限公司，汇丰技术中心，中国）

AI总结通过大规模实验发现，随机规则与专家规则对编码智能体性能提升相当，且有益规则均为负面约束，有害规则均为正面指令，提出应使用约束而非指导来配置智能体。

详情

AI中文摘要

随机规则对编码智能体任务性能的提升与专家精心设计的规则相当（在SWE-bench Verified的判别子集上均提升$+13.8$个百分点），并且在我们的数据中，每条单独有益的规则都是负面约束（“不要重构无关代码”），而每条单独有害的规则都是正面指令（“遵循代码风格”）。我们通过首次对智能体规则文件（ exttt{CLAUDE.md}、 exttt{.cursorrules}以及更广泛的智能体技能、插件清单和角色定义系列）进行大规模受控研究得出这些发现：我们从GitHub抓取了679个规则文件（共25,532条规则），并使用Claude Opus 4.6在SWE-bench Verified上进行了超过5,000次Claude Code智能体运行。出现了三种模式。（i）规则极性清晰地区分了有益规则和有害规则；我们通过基于势能的奖励塑形（PBRS）的视角来解读这一点。（ii）性能提升在很大程度上与内容无关：随机、打乱、领域不匹配和未转换格式的规则文件均与精心设计的规则相匹配，指向一种上下文启动机制。（iii）单独的规则通常看起来有害，但在集成中并未明显累积损害：在规则数量从0到50的范围内，通过率保持稳定。这些发现揭示了快速增长的社区编写规则和技能生态系统中隐藏的可靠性风险，并得出了更安全智能体配置的明确原则：约束智能体不能做什么，而不是规定它应该做什么。

英文摘要

Random rules improve a coding agent's task performance as much as expert-curated ones (both $+13.8$pp on a discriminative subset of SWE-bench Verified), and in our data every individually beneficial rule is a negative constraint ("do not refactor unrelated code"), while every individually harmful one is a positive directive ("follow code style"). We arrive at these findings through the first large-scale controlled study of agent rule files (\texttt{CLAUDE.md}, \texttt{.cursorrules}, and the broader family of agent skills, plugin manifests, and persona definitions): we scrape 679 rule files (25{,}532 rules) from GitHub and conduct over 5{,}000 agent runs of Claude Code with Claude Opus 4.6 on SWE-bench Verified. Three patterns emerge. (i) Rule polarity cleanly separates beneficial from harmful rules; we read this through the lens of potential-based reward shaping (PBRS). (ii) Performance gains are largely content-independent: random, shuffled, mismatched-domain, and unconverted-format rule files all match curated rules, pointing to a context priming mechanism. (iii) Individual rules often appear harmful in isolation yet do not visibly accumulate damage in ensemble: pass rates remain stable across rule counts from 0 to 50. These findings expose a hidden reliability risk in the rapidly growing ecosystem of community-authored rules and skills, and they yield a clear principle for safer agent configuration: constrain what agents must not do, rather than prescribing what they should.

URL PDF HTML ☆

赞 0 踩 0

2604.11080 2026-05-29 cs.CV cs.AI 版本更新

ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation

ReSpinQuant: 通过子空间残差旋转近似实现高效逐层大模型量化

Suyoung Kim, Sunghyun Wee, Hyeonjin Kim, Kyomin Hwang, Hyunho Lee, Nojun Kwak

发表机构 * Seoul National University（首尔国立大学）

AI总结提出ReSpinQuant框架，通过离线激活旋转融合和高效子空间残差旋转匹配基，解决逐层量化方法在线计算开销大的问题，在W4A4和W3A3量化上达到最优性能。

Comments ICML 2026

详情

AI中文摘要

基于旋转的后训练量化（PTQ）已成为缓解大型语言模型（LLMs）量化中激活值异常值的有前景的解决方案。全局旋转方法通过将激活旋转融合到注意力块和前馈网络块中实现推理效率，但由于受限于在所有层中使用单一可学习旋转矩阵，其表达能力有限。为了解决这一问题，出现了逐层变换方法，通过局部自适应实现了更高的精度。然而，逐层方法无法将激活旋转矩阵融合到权重中，需要在线计算并导致显著开销。在本文中，我们提出ReSpinQuant，一种量化框架，通过利用离线激活旋转融合和使用高效残差子空间旋转匹配基来解决此类开销。这种设计调和了逐层自适应的高表达性与仅可忽略的推理开销。在W4A4和W3A3量化上的大量实验表明，ReSpinQuant实现了最先进的性能，优于全局旋转方法，并以最小开销匹配计算昂贵的逐层方法的精度。

英文摘要

Rotation-based Post-Training Quantization (PTQ) has emerged as a promising solution for mitigating activation outliers in the quantization of Large Language Models (LLMs). Global rotation methods achieve inference efficiency by fusing activation rotations into attention and FFN blocks, but suffer from limited expressivity as they are constrained to use a single learnable rotation matrix across all layers. To tackle this, layer-wise transformation methods emerged, achieving superior accuracy through localized adaptation. However, layer-wise methods cannot fuse activation rotation matrices into weights, requiring online computations and causing significant overhead. In this paper, we propose ReSpinQuant, a quantization framework that resolves such overhead by leveraging offline activation rotation fusion and matching basis using efficient residual subspace rotation. This design reconciles the high expressivity of layer-wise adaptation with only negligible inference overhead. Extensive experiments on W4A4 and W3A3 quantization demonstrate that ReSpinQuant achieves state-of-the-art performance, outperforming global rotation methods and matching the accuracy of computationally expensive layer-wise methods with minimal overhead.

URL PDF HTML ☆

赞 0 踩 0

2604.06811 2026-05-29 cs.CR cs.AI 版本更新

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

SkillTrojan：基于技能智能体系统的后门攻击

Yunhao Feng, Yifan Ding, Yingshui Tan, Boren Zheng, Yanming Guo, Xiaolong Li, Kun Zhai, Yishan Li, Wenke Huang

发表机构 * Alibaba Group（阿里巴巴集团）； National University of Defense Technology（国防科技大学）； Fudan University（复旦大学）； Wuhan University（武汉大学）

AI总结提出SkillTrojan，一种针对技能实现而非模型参数的后门攻击方法，通过将恶意逻辑嵌入看似正常的技能中，利用技能组合重构并执行攻击者指定的负载，在保持良性行为的同时实现高攻击成功率。

详情

AI中文摘要

基于技能的智能体系统通过组合可复用技能来处理复杂任务，提高了模块化和可扩展性，同时引入了一个几乎未被审视的安全攻击面。我们提出SkillTrojan，一种针对技能实现而非模型参数或训练数据的后门攻击。SkillTrojan将恶意逻辑嵌入看似合理的技能中，并利用标准技能组合来重构和执行攻击者指定的负载。该攻击将加密负载分割到多个看似良性的技能调用中，仅在预定义触发条件下激活。SkillTrojan还支持从任意技能模板自动合成带后门的技能，从而在基于技能的智能体生态系统中实现可扩展传播。为了进行系统评估，我们发布了一个包含3000多个精心策划的带后门技能的数据集，涵盖多种技能模式和触发-负载配置。我们在一个代表性的基于代码的智能体设置中实例化SkillTrojan，并评估了干净任务效用和攻击成功率。结果表明，技能级后门可以非常有效，同时对良性行为的退化最小，暴露了当前基于技能的智能体架构中的一个关键盲点，并促使防御机制明确考虑技能组合和执行。具体来说，在EHR SQL上，SkillTrojan在GPT-5.2-1211-Global上实现了高达97.2%的攻击成功率，同时保持了89.3%的干净准确率。

英文摘要

Skill-based agent systems tackle complex tasks by composing reusable skills, improving modularity and scalability while introducing a largely unexamined security attack surface. We propose SkillTrojan, a backdoor attack that targets skill implementations rather than model parameters or training data. SkillTrojan embeds malicious logic inside otherwise plausible skills and leverages standard skill composition to reconstruct and execute an attacker-specified payload. The attack partitions an encrypted payload across multiple benign-looking skill invocations and activates only under a predefined trigger. SkillTrojan also supports automated synthesis of backdoored skills from arbitrary skill templates, enabling scalable propagation across skill-based agent ecosystems. To enable systematic evaluation, we release a dataset of 3,000+ curated backdoored skills spanning diverse skill patterns and trigger-payload configurations. We instantiate SkillTrojan in a representative code-based agent setting and evaluate both clean-task utility and attack success rate. Our results show that skill-level backdoors can be highly effective with minimal degradation of benign behavior, exposing a critical blind spot in current skill-based agent architectures and motivating defenses that explicitly reason about skill composition and execution. Concretely, on EHR SQL, SkillTrojan attains up to 97.2% ASR while maintaining 89.3% clean ACC on GPT-5.2-1211-Global.

URL PDF HTML ☆

赞 0 踩 0

2604.05157 2026-05-29 cs.AI 版本更新

IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents

IntentScore: 面向计算机使用智能体的意图条件动作评估

Rongqian Chen, Yu Li, Zeyu Fang, Sizhe Tang, Weidong Cao, Tian Lan

发表机构 * George Washington University（乔治·华盛顿大学）

AI总结提出IntentScore，一种基于计划感知的奖励模型，通过对比对齐和边际排序学习评估动作质量，在OSWorld上提升任务成功率6.9个百分点。

详情

AI中文摘要

计算机使用智能体（CUA）利用大型语言模型在桌面环境中执行GUI操作，但它们生成动作时不评估动作质量，导致不可逆的错误级联到后续步骤。我们提出IntentScore，一种计划感知的奖励模型，从跨越三个操作系统的398K离线GUI交互步骤中学习对候选动作进行评分。IntentScore通过两个互补目标进行训练：状态-动作相关性的对比对齐和动作正确性的边际排序。在架构上，它将每个候选的计划意图嵌入动作编码器，从而能够区分具有相似动作但不同理由的候选。IntentScore在保留评估上达到97.5%的成对区分准确率。作为Agent S3在OSWorld（训练中完全未见的环境）上的重排序器，IntentScore将任务成功率提高了6.9个百分点，表明从异构离线轨迹中学到的奖励估计可以泛化到未见过的智能体和任务分布。

英文摘要

Computer-Use Agents (CUAs) leverage large language models to execute GUI operations on desktop environments, yet they generate actions without evaluating action quality, leading to irreversible errors that cascade through subsequent steps. We propose IntentScore, a plan-aware reward model that learns to score candidate actions from 398K offline GUI interaction steps spanning three operating systems. IntentScore trains with two complementary objectives: contrastive alignment for state-action relevance and margin ranking for action correctness. Architecturally, it embeds each candidate's planning intent in the action encoder, enabling discrimination between candidates with similar actions but different rationales. IntentScore achieves 97.5% pairwise discrimination accuracy on held-out evaluation. Deployed as a re-ranker for Agent S3 on OSWorld, an environment entirely unseen during training, IntentScore improves task success rate by 6.9 points, demonstrating that reward estimation learned from heterogeneous offline trajectories generalizes to unseen agents and task distributions.

URL PDF HTML ☆

赞 0 踩 0

2604.01473 2026-05-29 cs.CR cs.AI 版本更新

SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits

SelfGrader: 基于锚定令牌级对数概率的LLM越狱检测

Zikai Zhang, Rui Hu, Olivera Kotevska, Jiahao Xu

发表机构 * Department of Computer Science and Engineering, University of Nevada, Reno（内华达大学里诺分校计算机科学与工程系）； Oak Ridge National Laboratory（橡树岭国家实验室）

AI总结提出SelfGrader方法，利用锚定令牌级对数概率将越狱检测转化为数值评分问题，实现低延迟、低误报率的鲁棒检测。

详情

AI中文摘要

大型语言模型（LLM）是回答用户查询的强大工具，但仍然极易受到越狱攻击。现有的护栏方法通常依赖内部特征或文本响应来检测恶意查询，这要么引入大量延迟，要么遭受文本生成的随机性。为了克服这些限制，我们提出SelfGrader，一种轻量级护栏方法，它将越狱检测表述为使用锚定令牌级对数概率的数值评分问题。具体来说，SelfGrader在一组紧凑的数值令牌（NT）（例如0-9）内评估用户查询的安全性，并将其对数概率分布解释为内部安全信号。为了将这些信号与目标安全准则对齐，SelfGrader构建了概率近似正确引导的ICL锚定示例，并引入了双视角评分规则，同时考虑查询的恶意性和良性，从而产生稳定且可解释的分数，反映危害性并同时降低误报率。跨不同越狱基准、自适应攻击、良性提示基准、多个LLM和最先进的护栏基线的广泛实验表明，SelfGrader在低误报率、内存开销和延迟下实现了强鲁棒性。

英文摘要

Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal features or textual responses to detect malicious queries, which either introduce substantial latency or suffer from randomness in text generation. To overcome these limitations, we propose SelfGrader, a lightweight guardrail method that formulates jailbreak detection as a numerical grading problem using anchored token-level logits. Specifically, SelfGrader evaluates the safety of a user query within a compact set of numerical tokens (NTs) (e.g., 0-9) and interprets their logit distribution as an internal safety signal. To align these signals with the target safety rubric, SelfGrader constructs Probably Approximately Correct-guided ICL anchor examples and introduces a dual-perspective scoring rule that considers both the maliciousness and benignness of the query, yielding a stable and interpretable score that reflects harmfulness and reduces the false positive rate simultaneously. Extensive experiments across diverse jailbreak benchmarks, adaptive attacks, benign prompt benchmarks, multiple LLMs, and state-of-the-art guardrail baselines demonstrate that SelfGrader achieves strong robustness with low false positive rates, memory overhead, and latency.

URL PDF HTML ☆

赞 0 踩 0

2603.27150 2026-05-29 cs.AI cs.MA 版本更新

MediHive: A Decentralized Agent Collective for Medical Reasoning

MediHive：用于医学推理的去中心化智能体集体

Xiaoyang Wang, Christopher C. Yang

发表机构 * College of Computing and Informatics, Drexel University（德雷塞尔大学计算与信息学院）

AI总结提出一种去中心化多智能体框架MediHive，通过共享记忆池和迭代融合机制，使LLM智能体自主分配角色、进行条件性基于证据的辩论并融合观点，在MedQA和PubMedQA上分别达到84.3%和78.4%的准确率。

Comments Accepted by Journal of Healthcare Informatics Research

详情

DOI: 10.1007/s41666-026-00239-7

AI中文摘要

大型语言模型（LLM）已经革新了医学推理任务，但单智能体系统在处理需要稳健处理不确定性和冲突证据的复杂跨学科问题时常常表现不佳。利用LLM的多智能体系统（MAS）能够实现协作智能，但主流的集中式架构在资源受限环境中存在可扩展性瓶颈、单点故障和角色混淆问题。去中心化MAS（D-MAS）通过点对点交互承诺增强自主性和弹性，但其在高风险医疗领域的应用仍未充分探索。我们提出了MediHive，一种新颖的去中心化多智能体医学问答框架，该框架将共享记忆池与迭代融合机制相结合。MediHive部署基于LLM的智能体，这些智能体自主分配专业角色、进行初始分析、通过条件性基于证据的辩论检测分歧，并在多轮中本地融合同伴见解以达成共识。实验表明，MediHive在MedQA和PubMedQA数据集上分别达到84.3%和78.4%的准确率，优于单LLM和集中式基线。我们的工作推进了用于医学AI的可扩展、容错D-MAS，解决了集中式设计的关键局限性，同时在推理密集型任务中展示了优越性能。

英文摘要

Large language models (LLMs) have revolutionized medical reasoning tasks, yet single-agent systems often falter on complex, interdisciplinary problems requiring robust handling of uncertainty and conflicting evidence. Multi-agent systems (MAS) leveraging LLMs enable collaborative intelligence, but prevailing centralized architectures suffer from scalability bottlenecks, single points of failure, and role confusion in resource-constrained environments. Decentralized MAS (D-MAS) promise enhanced autonomy and resilience via peer-to-peer interactions, but their application to high-stakes healthcare domains remains underexplored. We introduce MediHive, a novel decentralized multi-agent framework for medical question answering that integrates a shared memory pool with iterative fusion mechanisms. MediHive deploys LLM-based agents that autonomously self-assign specialized roles, conduct initial analyses, detect divergences through conditional evidence-based debates, and locally fuse peer insights over multiple rounds to achieve consensus. Empirically, MediHive outperforms single-LLM and centralized baselines on MedQA and PubMedQA datasets, attaining accuracies of 84.3% and 78.4%, respectively. Our work advances scalable, fault-tolerant D-MAS for medical AI, addressing key limitations of centralized designs while demonstrating superior performance in reasoning-intensive tasks.

URL PDF HTML ☆

赞 0 踩 0

2603.23971 2026-05-29 cs.CL cs.AI cs.GT cs.LG cs.MA 版本更新

The Price Reversal Phenomenon: When Cheaper Reasoning Models Cost More

价格反转现象：当更便宜的推理模型成本更高时

Lingjiao Chen, Chi Zhang, Yeye He, Ion Stoica, Matei Zaharia, James Zou

发表机构 * Stanford University（斯坦福大学）； UC Berkeley（加州大学伯克利分校）； CMU（卡内基梅隆大学）； Microsoft Research（微软研究院）

AI总结本文首次系统研究推理模型标价与实际成本的偏差，发现32%的模型对比较中存在价格反转现象，并基于Shapley值建立成本归因框架，揭示思考令牌消耗和交互轮次的高度异质性是主要原因。

详情

AI中文摘要

开发者和消费者越来越根据列出的API价格选择推理模型（RMs）。然而，这些价格在多大程度上准确反映了实际推理成本？我们首次系统研究这一问题，评估了8个前沿RM在12个不同任务上的表现，涵盖竞赛数学、科学问答、代码生成和多领域智能体。我们发现了定价反转现象：在32%的模型对比较中，标价较低的模型实际上产生了更高的总成本，反转幅度高达28倍。例如，Gemini 3 Flash的标价比GPT-5.4便宜80%，但其在所有任务上的实际成本却高出38%。我们基于Shapley值构建了一个正式的成本归因框架，并利用它追溯了思考令牌消耗和交互轮次数量巨大异质性的主要贡献因素：对于同一查询，一个模型可能比另一个模型多使用900%的思考令牌，或多出10倍的环境交互轮次。我们进一步表明，每次查询的成本预测本质上是困难的：同一查询的重复运行产生的思考令牌变化高达9.7倍，为任何预测器建立了不可约的噪声底限。因此，我们提出成本分布预测作为一个开放挑战。我们的发现表明，列出的API定价是实际成本的不可靠代理，呼吁进行成本感知的模型选择和透明的每次请求成本监控。

英文摘要

Developers and consumers increasingly choose reasoning models (RMs) based on their listed API prices. However, how accurately do these prices reflect actual inference costs? We conduct the first systematic study of this question, evaluating 8 frontier RMs across 12 diverse tasks covering competition math, science QA, code generation, and multi-domain agents. We uncover the pricing reversal phenomenon: in 32% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitude reaching up to 28x. For example, Gemini 3 Flash's listed price is 80% cheaper than GPT-5.4's, yet its actual cost across all tasks is 38% higher. We build a formal cost attribution framework based on Shapley value, and leverage it to trace the dominating contributors to vast heterogeneity in thinking token consumption and number of interaction turns: on the same query, one model may use 900% more thinking tokens than another, or 10x more turns of environment interactions. We further show that per-query cost prediction is fundamentally difficult: repeated runs of the same query yield thinking token variation up to 9.7x, establishing an irreducible noise floor for any predictor. Thus, we propose cost distribution prediction as an open challenge. Our findings demonstrate that listed API pricing is an unreliable proxy for actual cost, calling for cost-aware model selection and transparent per-request cost monitoring.

URL PDF HTML ☆

赞 0 踩 0

2603.18859 2026-05-29 cs.AI cs.CL cs.LG 版本更新

RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

RewardFlow: 面向大语言模型智能体强化学习的拓扑感知状态图奖励传播

Xiao Feng, Bo Han, Zhanke Zhou, Jiaqi Fan, Jiangchao Yao, Ka Ho Li, Dahai Yu, Michael Kwok-Po Ng

发表机构 * TMLR Group（TMLR小组）； Hong Kong Baptist University（香港 Baptist大学）； TCL Corporate Research (HK) Co Ltd（TCL企业研究（香港）有限公司）； Cooperative Medianet Innovation Center Shanghai Jiao Tong University（合作中位网创新中心上海交通大学）； Department of Mathematics Hong Kong Baptist University（香港 Baptist大学数学系）

AI总结提出RewardFlow方法，通过构建状态图进行拓扑感知的奖励传播，为智能体推理提供无标注的密集奖励，显著提升强化学习性能。

详情

AI中文摘要

强化学习在增强大语言模型智能体推理方面展现出潜力，但稀疏的终端奖励阻碍了细粒度优化。过程奖励建模提供了一种替代方案，但带来了高计算成本、奖励黑客风险和标注瓶颈。我们引入RewardFlow，一种用于估计智能体推理中状态级奖励的轻量级方法。通过构建捕获轨迹内在拓扑结构的状态图，RewardFlow执行拓扑感知的传播以估计每个状态对成功的贡献，从而产生有原则的、无标注的密集奖励。用于强化学习优化时，RewardFlow在四个智能体基准测试中显著优于先前基线：在基于文本的任务上平均成功率提高6.2%，在视觉推理上跨三个模型尺度比最强基线提高29.7%，在DeepResearch上准确率提高10%，同时具有卓越的鲁棒性和训练效率。RewardFlow的实现已在https://github.com/tmlr-group/RewardFlow公开。

英文摘要

Reinforcement learning (RL) shows promise for enhancing LLM agentic reasoning, yet sparse terminal rewards hinder fine-grained optimization. Process reward modeling offers an alternative but incurs high computational costs, reward hacking risks, and annotation bottlenecks. We introduce RewardFlow, a lightweight method for estimating state-level rewards in agentic reasoning. By constructing state graphs that capture the intrinsic topological structure of trajectories, RewardFlow performs topology-aware propagation to estimate each state's contribution to success, yielding principled, annotation-free dense rewards. Used for RL optimization, RewardFlow substantially outperforms prior baselines across four agentic benchmarks: +6.2% average success rate on text-based tasks, +29.7% on visual reasoning over the strongest baseline across three model scales, and +10% accuracy on DeepResearch, with superior robustness and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr-group/RewardFlow.

URL PDF HTML ☆

赞 0 踩 0

2603.16673 2026-05-29 cs.RO cs.AI cs.LG 版本更新

更安全智能体的追踪能力

Martin Odersky, Yaoyu Zhao, Yichen Xu, Oliver Bračevac, Cao Nguyen Pham

发表机构 * EPFL（苏黎世联邦理工学院）

AI总结提出通过Scala 3的捕获检查类型系统静态追踪能力，构建基于编程语言的智能体安全约束，防止信息泄露和恶意副作用。

详情

DOI: 10.1145/3786335.3813127

AI中文摘要

通过工具调用与现实世界交互的AI智能体带来了基本的安全挑战：智能体可能泄露私人信息、导致意外副作用或通过提示注入被操纵。为应对这些挑战，我们建议将智能体置于基于编程语言的“安全约束”中：智能体不直接调用工具，而是以能力安全的语言（支持捕获检查的Scala 3）中的代码表达其意图。能力是程序变量，用于调节对感兴趣的效果和资源的访问。Scala的类型系统静态追踪能力，提供对智能体行为的细粒度控制。特别是，它支持局部纯度，即强制子计算无副作用的能力，防止智能体处理机密数据时的信息泄露。我们展示了通过利用具有追踪能力的强类型系统，可以构建可扩展的智能体安全约束。实验表明，智能体能够生成能力安全的代码，而任务性能没有显著损失，同时类型系统可靠地防止了信息泄露和恶意副作用等不安全行为。

英文摘要

AI agents that interact with the real world through tool calls pose fundamental safety challenges: agents might leak private information, cause unintended side effects, or be manipulated through prompt injection. To address these challenges, we propose to put the agent in a programming-language-based "safety harness": instead of calling tools directly, agents express their intentions as code in a capability-safe language: Scala 3 with capture checking. Capabilities are program variables that regulate access to effects and resources of interest. Scala's type system tracks capabilities statically, providing fine-grained control over what an agent can do. In particular, it enables local purity, the ability to enforce that sub-computations are side-effect-free, preventing information leakage when agents process classified data. We demonstrate that extensible agent safety harnesses can be built by leveraging a strong type system with tracked capabilities. Our experiments show that agents can generate capability-safe code with no significant loss in task performance, while the type system reliably prevents unsafe behaviors such as information leakage and malicious side effects.

URL PDF HTML ☆

赞 0 踩 0

2603.00454 2026-05-29 cs.LG cs.AI 版本更新

Rooted Absorbed Prefix Trajectory Balance with Submodular Replay for GFlowNet Training

基于子模重放的根吸收前缀轨迹平衡用于GFlowNet训练

Xi Wang, Wenbo Lu, Shengjie Wang

发表机构 * Courant Institute School of Mathematics, Computing, and Data Science, New York University（纽约大学Courant研究所数学、计算与数据科学学院）； Courant Institute School of Mathematics, Computing（纽约大学Courant研究所数学、计算与数据科学学院）； Data Science, New York University（纽约大学数据科学学院）

AI总结针对GFlowNet的模式坍塌问题，提出RapTB目标函数（通过根锚定子轨迹监督和吸收后缀备份提供密集前缀学习信号）和SubM子模重放策略（促进高奖励和多样性），在分子生成等任务中提升优化性能和多样性。

详情

AI中文摘要

生成流网络（GFlowNets）能够微调大型语言模型以近似奖励比例的后验分布，但仍容易出现模式坍塌，表现为前缀坍塌和长度偏差。我们将此归因于两个因素：（i）对早期前缀的信用分配较弱，以及（ii）有偏的重放导致偏移的、非代表性的训练流分布。我们提出根吸收前缀轨迹平衡（RapTB），该目标函数将子轨迹监督锚定在根节点，并通过吸收后缀备份将终端奖励传播到中间前缀，从而提供密集的前缀级学习信号。为了减轻重放引起的分布偏移，我们进一步引入SubM，一种子模重放刷新策略，同时促进高奖励和多样性。实验表明，在使用SMILES字符串的分子生成等任务中，RapTB结合SubM持续提升优化性能和分子多样性，同时保持高有效性。

英文摘要

Generative Flow Networks (GFlowNets) enable fine-tuning large language models to approximate reward-proportional posteriors, but they remain prone to mode collapse, manifesting as prefix collapse and length bias. We attribute this to two factors: (i) weak credit assignment to early prefixes, and (ii) biased replay that induces a shifted, non-representative training flow distribution. We propose Rooted absorbed prefix Trajectory Balance RapTB, an objective that anchors subtrajectory supervision at the root and propagates terminal rewards to intermediate prefixes via absorbed suffix-based backups, providing dense prefix-level learning signals. To mitigate replay-induced distribution shift, we further introduce SubM, a submodular replay refresh strategy that promotes both high reward and diversity. Empirically, on tasks such as molecule generation with LLM using SMILES strings, RapTB combined with SubM consistently improves optimization performance and molecular diversity while preserving high validity.

URL PDF HTML ☆

赞 0 踩 0

2602.12642 2026-05-29 cs.CL cs.AI 版本更新

好的SFT优化SFT，更好的SFT为强化学习做准备

Dylan Zhang, Yufeng Xu, Haojin Wang, Qingzhi Chen, Hao Peng

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； New York University (Shanghai)（纽约大学（上海））

AI总结针对当前SFT-RL流程中离线SFT数据分布与在线RL策略分布不匹配的问题，提出基于策略评估的离线学习损失重加权方法PEAR，通过重要性采样重加权SFT损失，提升后续RL训练效果。

详情

AI中文摘要

推理大语言模型的后训练是一个整体过程，通常包括离线SFT阶段和后续的在线强化学习（RL）阶段。然而，SFT通常被孤立地优化，仅追求最大化SFT性能。我们表明，在相同的RL训练后，从更强的SFT检查点初始化的模型可能显著劣于从较弱检查点初始化的模型。我们将此归因于当前SFT-RL流程中典型的错配：生成离线SFT数据的分布可能与在线RL期间优化的策略（该策略从其自身的rollout中学习）存在显著差异。我们提出PEAR（基于策略评估的离线学习损失重加权算法），这是一种在SFT阶段纠正此错配并让模型更好地为RL做准备的方法。PEAR使用重要性采样来重加权SFT损失，具有三种变体，分别在token、块和序列级别操作。它可以用于增强标准SFT目标，并且一旦收集到离线数据的概率，仅需很少的额外训练开销。我们在可验证推理游戏和数学推理任务上对Qwen 2.5和3以及DeepSeek蒸馏模型进行了控制实验。PEAR在标准SFT基础上持续提升了RL后性能，在AIME2025上pass@8增益高达14.6%。我们的结果表明，通过设计和评估SFT时考虑下游RL而非孤立进行，PEAR是迈向更全面的大语言模型后训练的有效一步。

英文摘要

Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We attribute this to a mismatch typical in current SFT-RL pipelines: the distribution that generates the offline SFT data can differ substantially from the policy optimized during online RL, which learns from its own rollouts. We propose PEAR (Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at the token, block, and sequence levels. It can be used to augment standard SFT objectives and incurs little additional training overhead once probabilities for the offline data are collected. We conduct controlled experiments on verifiable reasoning games and mathematical reasoning tasks on Qwen 2.5 and 3 and DeepSeek-distilled models. PEAR consistently improves post-RL performance over canonical SFT, with pass at 8 gains up to a 14.6 percent on AIME2025. Our results suggest that PEAR is an effective step toward more holistic LLM post-training by designing and evaluating SFT with downstream RL in mind rather than in isolation.

URL PDF HTML ☆

赞 0 踩 0

2602.00994 2026-05-29 cs.AI 版本更新

Reasoning and Tool-use Compete in Agentic RL:From Quantifying Interference to Disentangled Tuning

推理与工具使用在智能体强化学习中的竞争：从量化干扰到解耦调优

Yu Li, Mingyang Yi, Xiuyu Li, Ju Fan, Fuxin Jiang, Binbin Chen, Peng Li, Jie Song, Tieying Zhang

发表机构 * School of Information, Renmin University of China（中国人民大学信息学院）； Bytedance Inc.（字节跳动公司）

AI总结本文通过引入能力效应归因（CEA）量化推理与工具使用行为之间的干扰，并提出解耦动作-推理调优（DART）框架，通过分离参数更新来提升智能体强化学习的性能。

详情

AI中文摘要

智能体强化学习（ARL）训练大型语言模型将推理与外部工具执行交错进行，以解决复杂任务。大多数现有ARL方法训练一组参数来支持推理和工具使用行为，隐含假设联合训练能提升整体智能体性能。尽管被广泛采用，这一假设很少得到实证检验。本文通过引入能力效应归因（CEA）系统性地检验这一假设，提供了推理与工具使用行为之间干扰的定量证据。通过深入分析，我们表明这两种能力常常导致不一致的梯度方向，产生训练干扰，削弱联合优化的有效性，并挑战了主流的ARL范式。为解决此问题，我们提出解耦动作-推理调优（DART），一个简单高效的框架，通过独立的低秩适应模块显式解耦推理和工具使用的参数更新。仅凭这一简单改变，DART在检索增强问答和NL2SQL的十三个基准上超越了所有联合优化基线，并接近2-智能体上界，进一步支持了我们在共享优化下能力干扰的发现。

英文摘要

Agentic Reinforcement Learning (ARL) trains large language models to interleave reasoning with external tool execution to solve complex tasks. Most existing ARL methods train a single set of parameters to support both reasoning and tool-use behaviors, implicitly assuming that joint training leads to improved overall agent performance. Despite its widespread adoption, this assumption has rarely been examined empirically. In this paper, we systematically examine this assumption by introducing Capability Effect Attribution (CEA), which provides quantitative evidence of interference between reasoning and tool-use behaviors. Through an in-depth analysis, we show that these two capabilities often induce misaligned gradient directions, leading to training interference that undermines the effectiveness of joint optimization and challenges the prevailing ARL paradigm. To address this issue, we propose Disentangled Action--Reasoning Tuning (DART), a simple and efficient framework that explicitly decouples parameter updates for reasoning and tool use via separate low-rank adaptation modules. With this simple change alone, DART outperforms all joint-optimization baselines and approaches the 2-Agent upper bound across thirteen benchmarks on retrieval-augmented QA and NL2SQL, further supporting our finding of capability interference under shared optimization.

URL PDF HTML ☆

赞 0 踩 0

2601.22531 2026-05-29 cs.LG cs.AI 版本更新

Learn from A Rationalist: Distilling Intermediate Interpretable Rationales

向理性主义者学习：蒸馏中间可解释原理

Jiayi Dai, Randy Goebel

发表机构 * Department of Computing Science, University of Alberta, Edmonton, Canada（阿尔伯塔大学计算机科学系，加拿大埃德蒙顿）； Alberta Machine Intelligence Institute, Edmonton, Canada（阿尔伯塔机器智能研究所，加拿大埃德蒙顿）

AI总结提出REKD方法，通过知识蒸馏将教师模型的可解释原理和预测传授给学生模型，提升基于较弱神经网络的可解释原理提取模型的预测性能。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

由于深度神经网络（DNN）的广泛使用，尤其是在高风险领域，DNN的可解释性受到了越来越多的关注。原理提取（RE）的总体思想是通过选择-预测架构为DNN提供一个可解释的设计框架，其中两个神经网络分别联合学习进行特征选择和预测。仅依赖于最终任务预测的远程监督，学习选择特征子集（或原理）的过程需要在所有可能的特征组合空间中进行搜索，这在计算上具有挑战性，当基础神经网络能力不足时甚至更加困难。为了提高基于能力较弱或较小神经网络（即学生）的RE模型的预测性能，我们提出了REKD（基于知识蒸馏的原理提取），其中学生RE模型除了自身的RE优化外，还从教师（即理性主义者）的原理和预测中学习。这种对RE的结构调整与人类如何从可解释和可验证的知识中有效学习的方式高度一致。由于该方法与神经模型无关，任何黑盒神经网络都可以作为骨干模型集成。为了证明REKD的可行性，我们使用BERT和视觉变换器（ViT）模型的多种变体进行了实验。我们在语言和视觉分类数据集（即IMDB电影评论、CIFAR 10和CIFAR 100）上的实验表明，REKD显著提高了学生RE模型的预测性能。

英文摘要

Because of the pervasive use of deep neural networks (DNNs), especially in high-stakes domains, the interpretability of DNNs has received increased attention. The general idea of rationale extraction (RE) is to provide an interpretable-by-design framework for DNNs via a select-predict architecture where two neural networks learn jointly to perform feature selection and prediction, respectively. Given only the remote supervision from the final task prediction, the process of learning to select subsets of features (or rationales) requires searching in the space of all possible feature combinations, which is computationally challenging and even harder when the base neural networks are not sufficiently capable. To improve the predictive performance of RE models that are based on less capable or smaller neural networks (i.e., the students), we propose REKD (Rationale Extraction with Knowledge Distillation) where a student RE model learns from the rationales and predictions of a teacher (i.e., a rationalist) in addition to the student's own RE optimization. This structural adjustment to RE aligns well with how humans could learn effectively from interpretable and verifiable knowledge. Because of the neural-model agnostic nature of the method, any black-box neural network could be integrated as a backbone model. To demonstrate the viability of REKD, we conduct experiments with multiple variants of BERT and vision transformer (ViT) models. Our experiments across language and vision classification datasets (i.e., IMDB movie reviews, CIFAR 10 and CIFAR 100) show that REKD significantly improves the predictive performance of the student RE models.

URL PDF HTML ☆

赞 0 踩 0

2601.22347 2026-05-29 cs.LG cs.AI 版本更新

从评分标准到可靠分数：基于证据的文本评估与LLM裁判

Yihan Hong, Huaiyuan Yao, Bolin Shen, Wanpeng Xu, Hua Wei, Yushun Dong

发表机构 * Washington University in St. Louis（华盛顿大学圣路易斯分校）； Arizona State University（亚利桑那州立大学）； Florida State University（佛罗里达州立大学）

AI总结提出Rulers框架，通过三阶段推理（任务规范、结构化执行、事后校准）解决LLM在基于评分标准的文本评估中的执行漂移、归因不可验证和人类尺度错位问题，实现更可靠的评分。

详情

AI中文摘要

基于评分标准的文本评估越来越多地使用大型语言模型（LLM）作为可扩展的裁判，但将冻结的黑盒模型与人类评分标准对齐仍然具有挑战性。我们将这一挑战表述为一个标准迁移问题：目标不仅仅是提示LLM分配分数，而是将人类评分标准意图转移到一个稳定、可审计且与人类对齐的评分协议中。我们识别了基于LLM的评分标准评估中三种反复出现的失败模式：评分标准执行漂移、不可验证的分数归因和人类尺度错位。为了解决这些失败模式，我们引入了Rulers，一个三阶段推理时框架，用于可靠、基于证据的评分标准文本评估。Rulers首先将人类评分标准转换为锁定的任务级规范，然后通过结构化检查表决策、类型化证据基础以及在适用时进行可提取引用验证来执行该规范，最后应用事后校准以将模型衍生的信号与人类分数边界对齐。在涵盖论文评分、摘要评估、EFL写作评估和结构化输入文本生成的四个基于评分标准的基准测试中，Rulers在多个冻结骨干模型的大多数评估设置中实现了更强的人类分数一致性。进一步分析表明，Rulers更好地匹配了经验人类分数分布，提高了在语义等价评分标准扰动下的稳定性，并受益于其三个组成部分。这些结果表明，可靠的LLM评判需要固定标准、可追溯证据和校准的分数解释，而不仅仅是提示措辞。我们的代码可在 https://anonymous.4open.science/r/Rulers_0525-3328 获取。

英文摘要

Rubric-based text evaluation increasingly uses large language models (LLMs) as scalable judges, but aligning frozen black-box models with human scoring standards remains challenging. We formulate this challenge as a criteria-transfer problem: the goal is not merely to prompt an LLM to assign a score, but to transfer human rubric intent into a stable, auditable, and human-aligned scoring protocol. We identify three recurring failure modes in LLM-based rubric scoring: rubric execution drift, unverifiable score attribution, and human-scale misalignment. To address these failure modes, we introduce Rulers, a three-stage inference-time framework for reliable, evidence-grounded rubric-based text evaluation. Rulers first converts a human rubric into a locked task-level specification, then executes the specification with structured checklist decisions, typed evidence grounding, and extractive quote verification when applicable, and finally applies post-hoc calibration to align model-derived signals with human score boundaries. Across four rubric-governed benchmarks covering essay scoring, summarization assessment, EFL writing evaluation, and structured-input text generation, Rulers achieves stronger human-score agreement in most evaluated settings across multiple frozen backbone models. Further analyses show that Rulers better matches empirical human score distributions, improves stability under semantically equivalent rubric perturbations, and benefits from each of its three components. These results suggest that reliable LLM judging requires fixed criteria, traceable evidence, and calibrated score interpretation rather than prompt phrasing alone. Our code is available at https://anonymous.4open.science/r/Rulers_0525-3328.

URL PDF HTML ☆

赞 0 踩 0

2601.06431 2026-05-29 cs.AI 版本更新

LsrIF: Enhancing Logic-Structured Instruction Following of Large Language Models

LsrIF: 增强大语言模型的逻辑结构化指令遵循能力

Qingyu Ren, Qianyu He, Jingwen Chang, Geng Zhang, Jiajie Zhu, Xingzhou Chen, Zhuofei Shi, Jiaqing Liang, Yanghua Xiao, Han Xia, Zeye Sun, Fei Yu

发表机构 * Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University（复旦大学数据科学重点实验室，计算机科学与人工智能学院）； School of Data Science, Fudan University（复旦大学数据科学学院）； Ant Group（蚂蚁集团）

AI总结提出LsrIF框架，通过构建并行、顺序、条件和嵌套结构的原子约束数据，并采用结构感知的奖励聚合方法，提升大语言模型在逻辑结构化指令遵循任务中的表现。

详情

AI中文摘要

指令遵循对于大语言模型至关重要，然而现实世界中的指令通常涉及具有逻辑结构的多个约束，例如并行组合、顺序依赖和条件分支。现有方法通常通过简单组合约束来构建数据，并在训练过程中通过平均各个约束分数来聚合奖励，忽略了逻辑依赖关系并引入了噪声信号。我们提出LsrIF，一个用于逻辑结构化指令遵循的训练框架。LsrIF通过将原子约束组织成并行、顺序、条件和嵌套结构来构建数据，并应用与其执行语义一致的结构感知奖励聚合：对并行约束取平均奖励，在顺序结构中早期失败后衰减后续奖励，在条件结构中仅奖励活跃分支。实验表明，LsrIF在领域内和领域外设置中均提升了指令遵循能力，同时也有利于逻辑推理。进一步分析表明，逻辑结构化训练增加了对约束相关词元和逻辑连接词的注意力，表明模型对指令逻辑的建模得到改善。我们将发布我们的数据和代码以供未来研究。

英文摘要

Instruction following is critical for large language models, yet real-world instructions often involve multiple constraints with logical structures, such as parallel composition, sequential dependencies, and conditional branching. Existing methods typically construct data by simply combining constraints and aggregate rewards by averaging individual constraint scores during training, overlooking logical dependencies and introducing noisy signals. We propose LsrIF, a training framework for logic-structured instruction following. LsrIF constructs data by organizing atomic constraints into parallel, sequential, conditional, and nested structures, and applies structure-aware reward aggregation aligned with their execution semantics: averaging rewards for parallel constraints, decaying later rewards after early failures in sequential structures, and rewarding only active branches in conditional structures. Experiments show that LsrIF improves instruction following in both in-domain and out-of-domain settings while also benefiting logic reasoning. Further analysis indicates that logic-structured training increases attention to constraint-related tokens and logical connectors, suggesting improved modeling of instruction logic. We will release our data and code for future research.

URL PDF HTML ☆

赞 0 踩 0

2601.01162 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models

弥合分类数据聚类的语义鸿沟：基于大语言模型的方法

Zihua Yang, Xin Liao, Yiqun Zhang, Yiu-ming Cheung

发表机构 * School of Computer Science and Technology（计算机科学与技术学院）； Guangdong University of Technology（广东技术大学）； Department of Computer Science（计算机科学系）； Hong Kong Baptist University（香港 Baptist 大学）

AI总结提出BREVE框架，利用外部知识库的语义嵌入丰富分类属性值，并引入自适应权重平衡原始标识与语义信息，在八个基准数据集上平均ARI排名达1.3。

Comments Accepted to ICPR2027

详情

AI中文摘要

定性数据广泛存在于医疗、营销和生物信息学等领域，聚类是其中模式发现的基本工具。定性数据聚类的核心困难在于度量属性值之间的相似性，这些属性值没有固有的顺序或距离。为了恢复这种关系，现有研究通常依赖于数据集内的共现统计。然而，当样本量较小时，这种统计路径变得不可靠，每个值的语义上下文因此未被充分利用。受此限制，本文提出BREVE（通过外部值丰富实现平衡表示），一种聚类框架，通过从外部知识库中提取额外的语义维度来丰富每个定性值。即，每个唯一值被扩展为一个密集嵌入，编码其语义内容。为了防止原始值身份被添加的维度稀释，进一步附加一个轻量级的独热编码组件。然后，由聚类紧致性引导的自适应权重决定富集维度进入最终表示的强度。通过这种设计，在八个基准数据集上的实验表明，与七个代表性竞争者相比，平均ARI排名为1.3。

英文摘要

Qualitative data are widespread in domains such as healthcare, marketing, and bioinformatics, where clustering offers a fundamental tool for pattern discovery. A core difficulty of qualitative-data clustering lies in measuring similarity among attribute values that carry no inherent ordering or distance. To recover such relationships, existing studies typically rely on within-dataset co-occurrence statistics. This statistical route, however, becomes unreliable once the sample size is small, and the semantic context of each value is therefore left underexploited. Motivated by this limitation, this paper proposes BREVE (Balanced Representation via External Value Enrichment), a clustering framework that enriches each qualitative value with extra semantic dimensions drawn from an external knowledge base. That is, every unique value is expanded by a dense embedding that encodes its semantic content. To prevent the original value identity from being diluted by the added dimensions, a lightweight one-hot component is further appended. An adaptive weight, guided by cluster compactness, then determines how strongly the enrichment dimensions enter the final representation. With this design, experiments on eight benchmark datasets yield an average ARI rank of 1.3 against seven representative competitors.

URL PDF HTML ☆

赞 0 踩 0

2512.15374 2026-05-29 cs.AI 版本更新

SCOPE: Prompt Evolution for Enhancing Agent Effectiveness

SCOPE: 通过提示进化增强智能体效能

Zehua Pei, Hui-Ling Zhen, Shixiong Kai, Sinno Jialin Pan, Yunhe Wang, Mingxuan Yuan, Bei Yu

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Huawei Technologies Co., Ltd（华为技术有限公司）

AI总结针对LLM智能体静态提示无法有效管理动态上下文导致失败的问题，提出SCOPE框架，将上下文管理建模为在线优化问题，通过双流记忆机制和视角驱动探索自动进化提示，在HLE基准上将任务成功率从14.23%提升至38.64%。

详情

AI中文摘要

大型语言模型（LLM）智能体越来越多地部署在生成大规模动态上下文的环境中。然而，一个关键瓶颈仍然存在：虽然智能体可以访问这些上下文，但其静态提示缺乏有效管理上下文的机制，导致反复出现纠正性和增强性失败。为了解决这一能力差距，我们引入了通过提示进化实现自进化上下文优化（SCOPE）。SCOPE将上下文管理视为一个 extit{在线优化}问题，从执行轨迹中综合指导方针，自动进化智能体的提示。我们提出了一种双流机制，在战术记忆（即时错误纠正）和战略记忆之间路由指导方针，后者通过冲突解决、包含剪枝和合并不断优化。为了最大化策略覆盖范围，视角驱动探索进化多个并行提示，由不同的优化视角引导。在HLE基准上的实验表明，SCOPE在没有人工干预的情况下将任务成功率从14.23%提高到38.64%。我们在https://github.com/JarvisPei/SCOPE公开了代码。

英文摘要

Large Language Model (LLM) agents are increasingly deployed in environments that generate massive, dynamic contexts. However, a critical bottleneck remains: while agents have access to this context, their static prompts lack the mechanisms to manage it effectively, leading to recurring Corrective and Enhancement failures. To address this capability gap, we introduce Self-evolving Context Optimization via Prompt Evolution (SCOPE). SCOPE frames context management as an \textit{online optimization} problem, synthesizing guidelines from execution traces to automatically evolve the agent's prompt. We propose a Dual-Stream mechanism that routes guidelines between tactical memory (immediate error correction) and strategic memory, which is continuously refined through conflict resolution, subsumption pruning, and consolidation. To maximize strategy coverage, Perspective-Driven Exploration evolves multiple parallel prompts guided by distinct optimization perspectives. Experiments on the HLE benchmark show that SCOPE improves task success rates from 14.23\% to 38.64\% without human intervention. We make our code publicly available at https://github.com/JarvisPei/SCOPE.

URL PDF HTML ☆

赞 0 踩 0

2512.10388 2026-05-29 cs.IR cs.AI 版本更新

The Best of the Two Worlds: Harmonizing Semantic and Hash IDs for Sequential Recommendation

两全其美：为序列推荐协调语义ID和哈希ID

Ziwei Liu, Yejing Wang, Wanyu Wang, Wang Zejian, Qidong Liu, Zijian Zhang, Chong Chen, Wei Huang, Xiangyu Zhao

发表机构 * City University of Hong Kong（香港城市大学）； Xi'an Jiaotong University（西安交通大学）； Jilin University（吉林大学）； Independent Researcher（独立研究者）； Tsinghua University（清华大学）

AI总结针对序列推荐中头部和尾部物品性能权衡问题，提出H2Rec框架，通过双分支架构协调语义ID和哈希ID，并采用双级对齐策略实现知识迁移，在公开基准和商业平台上取得更好平衡。

详情

AI中文摘要

传统的序列推荐系统通常分配唯一的哈希ID（HID）来构建物品嵌入，主要从历史用户-物品交互中捕获协同信号。然而，在大多数物品很少被消费的长尾场景中，这种嵌入是脆弱的。最近结合辅助信息的方法常常面临来自共现信号的噪声协同共享或由平坦密集嵌入导致的语义同质性问题。相比之下，语义ID（SID）因其支持代码共享和多粒度语义建模，提供了一种有前景的替代方案。然而，基于SID的方法受到协同压倒现象的阻碍：常用的量化机制损害了建模头部物品所需的标识符唯一性，导致头部和尾部物品之间的性能权衡。为了解决这一挑战，我们提出了H2Rec，一种协调SID和HID的新框架。我们设计了一个双分支建模架构，同时捕获SID的多粒度语义，同时保留HID提供的唯一协同身份。此外，我们引入了一种双级对齐策略来桥接两种表示，实现有效的知识迁移和鲁棒的偏好建模。在三个公开基准上的大量离线实验和在大规模商业平台上的在线实验表明，H2Rec在头部和尾部推荐质量之间实现了更好的平衡，并且持续优于现有基线。

英文摘要

Conventional Sequential Recommender Systems (SRS) typically assign unique hash IDs (HID) to construct item embeddings, which mainly capture collaborative signals from historical user-item interactions. However, such embeddings are vulnerable in long-tail scenarios where most items are rarely consumed. Recent methods that incorporate auxiliary information often face noisy collaborative sharing from co-occurrence signals or semantic homogeneity caused by flat dense embeddings. In contrast, Semantic IDs (SID), with their support for code sharing and multi-granular semantic modeling, offer a promising alternative. Nevertheless, SID-based methods are hindered by a collaborative overwhelming phenomenon: commonly adopted quantization mechanisms compromise the identifier uniqueness needed to model head items, resulting in a performance trade-off between head and tail items. To address this challenge, we propose H2Rec, a novel framework that harmonizes SID and HID. We design a dual-branch modeling architecture that simultaneously captures the multi-granular semantics of SID while preserving the unique collaborative identity provided by HID. Moreover, we introduce a dual-level alignment strategy to bridge the two representations, enabling effective knowledge transfer and robust preference modeling. Extensive offline experiments on three public benchmarks and online experiments on a large-scale commercial platform demonstrate that H2Rec achieves a better balance between head and tail recommendation quality and consistently outperforms existing baselines.

URL PDF HTML ☆

赞 0 踩 0

2512.01863 2026-05-29 cond-mat.mes-hall cond-mat.str-el cs.AI 版本更新

Topological Order in Neural Wavefunctions

神经波函数中的拓扑序

Ahmed Abouelkomsan, Max Geier, Liang Fu

发表机构 * Department of Physics, Massachusetts Institute of Technology, Cambridge, MA-02139, USA（麻省理工学院物理系）

AI总结本文利用基于注意力的深度神经网络变分波函数，通过能量最小化发现分数量子霍尔效应基态，并引入一种从单一实空间波函数提取拓扑简并度的方法，展示了神经网络变分蒙特卡洛在强关联拓扑相研究中的潜力。

Comments Published version

详情

DOI: 10.1103/bzq1-123h
Journal ref: Phys. Rev. B 113, 205119 (2026)

AI中文摘要

拓扑有序态是最有趣的量子物质相之一，它们承载具有分数电荷并服从分数量子统计的涌现准粒子。然而，由于这些态具有强耦合性质，传统的平均场处理难以奏效，因此其理论研究颇具挑战。在这里，我们证明基于注意力的深度神经网络提供了一个富有表现力的变分波函数，它仅通过能量最小化就能在无先验知识的情况下发现分数量子陈绝缘体基态，并达到了显著的精度。我们引入了一种高效的方法，通过将平移不变系统中的单一优化实空间波函数分解为不同的多体动量扇区，从中提取基态拓扑简并度——这是拓扑序的标志。我们的结果确立了神经网络变分蒙特卡洛作为发现强关联拓扑相的多功能工具的地位。

英文摘要

Topologically ordered states are among the most interesting quantum phases of matter that host emergent quasi-particles having fractional charge and obeying fractional quantum statistics. Theoretical study of such states is however challenging owing to their strong-coupling nature that prevents conventional mean-field treatment. Here, we demonstrate that an attention-based deep neural network provides an expressive variational wavefunction that discovers fractional Chern insulator ground states purely through energy minimization without prior knowledge and achieves remarkable accuracy. We introduce an efficient method to extract ground state topological degeneracy -- a hallmark of topological order -- from a single optimized real-space wavefunction in translation-invariant systems by decomposing it into different many-body momentum sectors. Our results establish neural network variational Monte Carlo as a versatile tool for discovering strongly correlated topological phases.

URL PDF HTML ☆

赞 0 踩 0

2512.00283 2026-05-29 cs.LG cs.AI q-bio.QM 版本更新

BioArc: Discovering Optimal Neural Architectures for Biological Foundation Models

BioArc：发现生物学基础模型的最优神经架构

Yi Fang, Haoran Xu, Jiaxin Han, Sirui Ding, Yizhi Wang, Yue Wang, Xuan Wang

发表机构 * Department of Computer Science, Virginia Tech, Blacksburg, VA, USA（弗吉尼亚理工学院计算机科学系）； Department of Electrical and Computer Engineering, Virginia Tech, Blacksburg, VA, USA（弗吉尼亚理工学院电气与计算机工程系）； Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA（卡内基梅隆大学计算机科学系）； Department of Biomedical Data Science, Stanford University, Stanford, CA, USA（斯坦福大学生物医学数据科学系）

AI总结针对现有基础模型架构直接迁移至生物学领域时忽视生物数据独特性质的问题，提出BioArc框架，利用神经架构搜索系统探索架构设计空间，发现高性能架构并提炼设计原则，同时提出架构预测方法以高效预测新任务的最优架构。

Comments Accepted at the 43nd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

基础模型已彻底改变了自然语言处理（NLP）和计算机视觉（CV）等多个领域。尽管已有努力将通用AI领域中基础模型的成功迁移到生物学，但现有工作主要直接采用来自通用机器学习领域的现有基础模型架构，而未考虑每种生物数据模态独特的物理化学和结构特性进行系统设计。这导致性能欠佳，因为这些改造后的架构难以捕捉生物数据固有的长程依赖、稀疏信息和复杂的底层“语法”。为解决这一差距，我们引入了BioArc，这是一个新颖的框架，旨在超越直觉驱动的架构设计，转向生物学基础模型的原理性、自动化架构发现。利用神经架构搜索（NAS），BioArc系统性地探索了广阔的架构设计空间，跨多种生物模态评估架构，同时严格分析架构、分词和训练策略之间的相互作用。这一大规模分析识别出新颖的高性能架构，使我们能够提炼出一套经验性设计原则，以指导未来的模型开发。此外，为充分利用这套发现的原理性架构，我们提出并比较了几种架构预测方法，这些方法能够有效且高效地预测新生物学任务的最优架构。总体而言，我们的工作为基础资源和原理性方法论提供了基础，以指导下一代生物学任务特定模型和基础模型的创建。

英文摘要

Foundation models have revolutionized various fields such as natural language processing (NLP) and computer vision (CV). While efforts have been made to transfer the success of the foundation models in general AI domains to biology, existing works focus on directly adopting the existing foundation model architectures from general machine learning domains without a systematic design considering the unique physicochemical and structural properties of each biological data modality. This leads to suboptimal performance, as these repurposed architectures struggle to capture the long-range dependencies, sparse information, and complex underlying ``grammars'' inherent to biological data. To address this gap, we introduce BioArc, a novel framework designed to move beyond intuition-driven architecture design towards principled, automated architecture discovery for biological foundation models. Leveraging Neural Architecture Search (NAS), BioArc systematically explores a vast architecture design space, evaluating architectures across multiple biological modalities while rigorously analyzing the interplay between architecture, tokenization, and training strategies. This large-scale analysis identifies novel, high-performance architectures, allowing us to distill a set of empirical design principles to guide future model development. Furthermore, to make the best of this set of discovered principled architectures, we propose and compare several architecture prediction methods that effectively and efficiently predict optimal architectures for new biological tasks. Overall, our work provides a foundational resource and a principled methodology to guide the creation of the next generation of task-specific and foundation models for biology.

URL PDF HTML ☆

赞 0 踩 0

2511.22884 2026-05-29 cs.AI 版本更新

一种面向CNN的基于LRP剪枝的精度感知扩展，以防止数据稀缺迁移学习中的级联精度下降

Daisuke Yasui, Toshitaka Matsuki, Hiroshi Sato

发表机构 * Mathematics and Computer Science National Defense Academy of Japan（日本防卫大学校数学与计算机科学系）

AI总结针对数据稀缺迁移学习中预训练CNN剪枝导致的级联精度下降问题，提出一种精度感知的剪枝控制机制，通过动态调整剪枝率和顺序来抑制精度下降，提升模型压缩后的分类性能。

Comments Accepted to scientific reports. The title was revised during the peer review process

详情

DOI: 10.1038/s41598-026-47992-8

AI中文摘要

在大规模数据集（如ImageNet）上预训练的卷积神经网络（CNN）被广泛用作特征提取器，从稀缺数据中构建特定任务的高精度分类模型。在此类场景中，由于数据稀缺，微调预训练CNN变得困难，因此必须使用固定权重。然而，当权重固定时，许多对目标任务无贡献的滤波器仍保留在模型中，导致不必要的冗余和效率降低。因此，需要有效的方法通过剪枝对推理不必要的滤波器来减小模型大小。为此，已有研究提出了利用逐层相关性传播（LRP）的方法。LRP量化每个滤波器对推理结果的贡献，从而可以剪枝低相关性的滤波器。然而，现有基于LRP的剪枝方法被观察到会导致级联精度下降。在本研究中，我们为现有基于LRP的滤波器剪枝方法引入了一种精度感知的剪枝控制机制，该机制通过使用类别精度的调和平均数动态调整剪枝率和剪枝顺序，抑制级联精度下降，并在小数据环境下压缩预训练模型的同时保持任务特定性能。我们证明，该控制机制有效缓解了级联精度下降，与现有基于LRP的剪枝方法相比，实现了更高的分类精度，将VGG16的精度-剪枝率曲线下的类别平均面积（AUC）比传统基于LRP的方法提高了约15%。

英文摘要

Convolutional Neural Networks (CNNs) pre-trained on large-scale datasets such as ImageNet are widely used as feature extractors to construct high-accuracy classification models from scarce data for specific tasks. In such scenarios, fine-tuning the pre-trained CNN is difficult due to data scarcity, necessitating the use of fixed weights. However, when the weights are kept fixed, many filters that do not contribute to the target task remain in the model, leading to unnecessary redundancy and reduced efficiency. Therefore, effective methods are needed to reduce model size by pruning filters that are unnecessary for inference. To address this, approaches utilizing Layer-wise Relevance Propagation (LRP) have been proposed. LRP quantifies the contribution of each filter to the inference result, enabling the pruning of filters with low relevance. However, existing LRP-based pruning methods have been observed to cause cascading accuracy degradation. In this study, we introduce an accuracy-aware pruning control mechanism for existing LRP-based filter pruning methods, which suppresses cascading accuracy degradation by dynamically adjusting the pruning rate and the pruning order using the harmonic mean of class accuracy, and compresses the pre-trained model while preserving task-specific performance in a small-data environment. We demonstrate that this control mechanism effectively mitigates cascading accuracy degradation and achieves higher classification accuracy compared to existing LRP-based pruning methods, improving the class-averaged area under the accuracy-pruning-rate curve (AUC) of VGG16 by approximately 15\% over conventional LRP-based approaches.

URL PDF HTML ☆

赞 0 踩 0

2510.26412 2026-05-29 cs.CV cs.AI 版本更新

LoCoT2V-Bench: Benchmarking Long-Form and Complex Text-to-Video Generation

LoCoT2V-Bench: 长文本与复杂文本到视频生成的基准测试

Xiangqing Zheng, Chengyue Wu, Kehai Chen, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China（哈尔滨工业大学深圳研究院）； The University of Hong Kong（香港大学）

AI总结针对长视频生成在复杂文本输入下的评估挑战，提出包含多场景提示与层次元数据的基准LoCoT2V-Bench，并设计多维度评估框架LoCoT2V-Eval，实验发现模型在细粒度文本-视频对齐和角色一致性方面存在显著不足。

Comments Accepted by ICML 2026 (Regular)

详情

AI中文摘要

近期文本到视频生成在短片段上取得了令人印象深刻的性能，但在复杂文本输入下评估长视频生成仍然是一个重大挑战。为应对这一挑战，我们提出了LoCoT2V-Bench，一个用于长视频生成（LVG）的基准，包含具有层次元数据（如角色设置和相机行为）的多场景提示，这些提示从收集的真实世界视频中构建。我们进一步提出了LoCoT2V-Eval，一个多维度评估框架，涵盖感知质量、文本-视频对齐、时间质量、动态质量和人类期望实现程度（HERD），重点关注细粒度文本-视频对齐和时间角色一致性等方面。在17个代表性LVG模型上的实验揭示了评估维度之间的显著能力差异，模型在感知质量和背景一致性方面表现强劲，但在细粒度文本-视频对齐和角色一致性方面明显较弱。这些发现表明，提高提示忠实度和身份保持仍是长视频生成的关键挑战。我们的代码和数据发布在https://github.com/XqZeppelinhead0702/LoCoT2V-Bench。

英文摘要

Recent advances in text-to-video generation have achieved impressive performance on short clips, yet evaluating long-form generation under complex textual inputs remains a significant challenge. In response to this challenge, we present LoCoT2V-Bench, a benchmark for long video generation (LVG) featuring multi-scene prompts with hierarchical metadata (e.g., character settings and camera behaviors), constructed from collected real-world videos. We further propose LoCoT2V-Eval, a multi-dimensional framework covering perceptual quality, text-video alignment, temporal quality, dynamic quality, and Human Expectation Realization Degree (HERD), with an emphasis on aspects such as fine-grained text-video alignment and temporal character consistency. Experiments on 17 representative LVG models reveal pronounced capability disparities across evaluation dimensions, with strong perceptual quality and background consistency but markedly weaker fine-grained text-video alignment and character consistency. These findings suggest that improving prompt faithfulness and identity preservation remains a key challenge for long-form video generation. Our code and data are released at https://github.com/XqZeppelinhead0702/LoCoT2V-Bench

URL PDF HTML ☆

赞 0 踩 0

2510.20743 2026-05-29 cs.HC cs.AI cs.CL 版本更新

Empathic Prompting: Non-Verbal Context Integration for Multimodal LLM Conversations

共情提示：多模态大语言模型对话中的非语言上下文整合

Lorenzo Stacchio, Andrea Ubaldi, Alessandro Galdelli, Maurizio Mauri, Emanuele Frontoni, Andrea Gaggioli

发表机构 * University of Macerata（马切拉塔大学）

AI总结提出共情提示框架，通过集成面部表情识别服务将非语言情感线索隐式融入大语言模型对话，实现无需用户显式控制的流畅多模态交互。

详情

AI中文摘要

我们提出了共情提示，一种新颖的多模态人机交互框架，它通过隐式的非语言上下文丰富大语言模型（LLM）对话。该系统集成了商业面部表情识别服务以捕捉用户的情感线索，并将其作为上下文信号嵌入提示过程中。与传统多模态界面不同，共情提示不需要用户显式控制；相反，它通过情感信息无干扰地增强文本输入，以实现对话和流畅性对齐。该架构模块化且可扩展，允许集成额外的非语言模块。我们描述了通过本地部署的DeepSeek实例实现的系统设计，并报告了初步的服务和可用性评估（N=5）。结果表明，非语言输入能够一致地整合到连贯的LLM输出中，参与者强调了对话的流畅性。除了这一概念验证外，共情提示还指向了聊天机器人中介通信中的应用，特别是在医疗或教育等领域，这些领域中用户的情感信号至关重要，但在言语交流中往往难以察觉。

英文摘要

We present Empathic Prompting, a novel framework for multimodal human-AI interaction that enriches Large Language Model (LLM) conversations with implicit non-verbal context. The system integrates a commercial facial expression recognition service to capture users' emotional cues and embeds them as contextual signals during prompting. Unlike traditional multimodal interfaces, empathic prompting requires no explicit user control; instead, it unobtrusively augments textual input with affective information for conversational and smoothness alignment. The architecture is modular and scalable, allowing integration of additional non-verbal modules. We describe the system design, implemented through a locally deployed DeepSeek instance, and report a preliminary service and usability evaluation (N=5). Results show consistent integration of non-verbal input into coherent LLM outputs, with participants highlighting conversational fluidity. Beyond this proof of concept, empathic prompting points to applications in chatbot-mediated communication, particularly in domains like healthcare or education, where users' emotional signals are critical yet often opaque in verbal exchanges.

URL PDF HTML ☆

赞 0 踩 0

2510.16658 2026-05-29 cs.AI cs.CE 版本更新

Large-Scale AI and Foundation Models for Neuroscience: A Comprehensive Review

大规模人工智能与基础模型在神经科学中的应用：综合综述

Shihao Yang, Xiying Huang, Danilo Bernardo, Jun-En Ding, Andrew Michael, Guoan Wang, Jingmei Yang, Alison Anderson, Dinesh Giritharan, Patrick Kwan, Ashish Raj, Yu Zhang, Feng Liu

发表机构 * Department of Systems Engineering, Stevens Institute of Technology（系统工程系，史蒂文斯理工学院）； Department of Neurology and Weill Institute for Neurosciences, University of California San Francisco（神经病学系和Weill神经科学研究所，旧金山大学）； Duke Institute for Brain Sciences, Duke University（杜克大学脑科学研究所）； Division of Systems Engineering, Department of Electrical and Computer Engineering, Boston University（系统工程 division，电气与计算机工程系，波士顿大学）； Department of Neuroscience, School of Translational Medicine, Monash University（神经科学系，转化医学学院，莫纳什大学）； Department of Neurology, Alfred Hospital, Melbourne, Victoria, Australia（神经病学系，阿尔弗雷德医院，墨尔本，维多利亚州，澳大利亚）； Department of Radiology and Biomedical Imaging, University of California, San Francisco, CA, USA（放射学与生物医学成像系，旧金山大学，加州，美国）； Department of Psychiatry and Behavioral Sciences, School of Medicine, Stanford University（精神病学与行为科学系，医学院，斯坦福大学）； Wu Tsai Neurosciences Institute, Stanford University（吴氏神经科学研究所，斯坦福大学）； Stanford Institute for Human-Centered AI, Stanford University（斯坦福大学人本人工智能研究所）

AI总结本文综述了大规模AI模型在神经科学四个主要领域（神经影像与数据处理、脑机接口与神经解码、临床决策支持与转化框架、神经系统与精神疾病特定应用）的应用，展示了其在多模态数据整合、时空模式解释和临床转化方面的潜力，并强调了严格评估、领域知识整合、临床验证和伦理指南的重要性。

Comments Accepted for publication in Meta-Radiology

详情

AI中文摘要

大规模人工智能（AI）模型的发展通过实现从原始脑信号和神经数据的端到端学习，正在影响神经科学研究。本文综述了大规模AI模型在四个主要神经科学领域的应用：神经影像与数据处理、脑机接口与神经解码、临床决策支持与转化框架，以及神经系统和精神疾病的特定应用。这些模型显示出解决重大计算神经科学挑战的潜力，包括多模态神经数据整合、时空模式解释以及为临床研究开发转化框架。此外，神经科学与AI之间的相互作用已变得日益互惠，因为现在融入了受生物学启发的架构约束，以开发更具可解释性和计算效率的模型。本综述既强调了此类技术的潜力，也强调了关键的实现考虑因素，特别关注严格的评估框架、领域知识的有效整合、前瞻性临床验证以及全面的伦理指南。最后，提供了用于开发和评估跨不同研究应用的大规模AI模型的关键神经科学数据集的系统列表。

英文摘要

The development of large-scale artificial intelligence (AI) models is influencing neuroscience research by enabling end-to-end learning from raw brain signals and neural data. In this paper, we review applications of large-scale AI models across four major neuroscience domains: neuroimaging and data processing, brain-computer interfaces and neural decoding, clinical decision support and translational frameworks, and disease-specific applications across neurological and psychiatric disorders. These models show potential to address major computational neuroscience challenges, including multimodal neural data integration, spatiotemporal pattern interpretation, and the development of translational frameworks for clinical research. Moreover, the interaction between neuroscience and AI has become increasingly reciprocal, as biologically informed architectural constraints are now incorporated to develop more interpretable and computationally efficient models. This review highlights both the promise of such technologies and critical implementation considerations, with particular emphasis on rigorous evaluation frameworks, effective integration of domain knowledge, prospective clinical validation, and comprehensive ethical guidelines. Finally, a systematic listing of critical neuroscience datasets used to develop and evaluate large-scale AI models across diverse research applications is provided.

URL PDF HTML ☆

赞 0 踩 0

2510.16060 2026-05-29 cs.LG cs.AI stat.ME stat.ML 版本更新

Beyond Accuracy: Are Time Series Foundation Models Well-Calibrated?

超越准确性：时间序列基础模型是否良好校准？

Coen Adler, Yuxin Chang, Felix Draxler, Samar Abdi, Padhraic Smyth

发表机构 * Department of Computer Science（计算机科学系）； Department of Statistics（统计学系）； Google, Irvine（谷歌（伊文斯堡））

AI总结本文系统评估了五个时间序列基础模型和两个基线的校准特性，发现基础模型校准优于基线且无系统性过度自信或信心不足。

Comments Published as a conference paper at ICLR 2026

详情

Journal ref: Proceedings of ICLR 2026

AI中文摘要

最近时间序列数据基础模型的发展引起了在各种应用中使用此类模型的广泛兴趣。尽管基础模型实现了最先进的预测性能，但它们的校准特性仍然相对未被充分探索，尽管校准在许多实际应用中可能至关重要。在本文中，我们研究了五个近期时间序列基础模型和两个竞争基线的校准相关特性。我们进行了一系列系统评估，包括模型校准（即过度自信或信心不足）、不同预测头的影响以及长期自回归预测下的校准。我们发现时间序列基础模型始终比基线模型校准得更好，并且往往不会系统性地过度自信或信心不足，这与在其他深度学习模型中常见的过度自信形成对比。

英文摘要

The recent development of foundation models for time series data has generated considerable interest in using such models across a variety of applications. Although foundation models achieve state-of-the-art predictive performance, their calibration properties remain relatively underexplored, despite the fact that calibration can be critical for many practical applications. In this paper, we investigate the calibration-related properties of five recent time series foundation models and two competitive baselines. We perform a series of systematic evaluations assessing model calibration (i.e., over- or under-confidence), effects of varying prediction heads, and calibration under long-term autoregressive forecasting. We find that time series foundation models are consistently better calibrated than baseline models and tend not to be either systematically over- or under-confident, in contrast to the overconfidence often seen in other deep learning models.

URL PDF HTML ☆

赞 0 踩 0

2510.10961 2026-05-29 cs.CL cs.AI 版本更新

Obfuscation Rules for Detecting and Detoxifying Korean Toxicity

用于检测和去毒化韩语毒性内容的混淆规则

Yejin Lee, Su-Hyeon Kim, Hyundong Jin, Dayoung Kim, Yeonsoo Kim, Yo-Sub Han

发表机构 * Yonsei University（延世大学）

AI总结本文提出KOTOX数据集，通过定义基于语言学的韩语混淆规则和变换框架，支持对混淆毒性文本的去混淆与去毒化，首次同时实现韩语混淆毒性检测与净化。

Comments 26 pages, 12 figures, 24 tables

详情

AI中文摘要

随着语言模型越来越多地部署在线环境中，毒性检测和去毒化已受到越来越多的关注。现有研究主要关注非混淆文本，这限制了当用户故意伪装毒性表达时的鲁棒性。特别是，韩语毒性表达可以通过黏着形态学和韩文特有的正字法变体轻易伪装。然而，韩语中的混淆现象在很大程度上尚未被探索，这促使我们引入KOTOX：用于去混淆和去毒化的韩语毒性数据集。我们将韩语混淆模式分类为基于语言学的类别，定义从真实世界示例中推导出的变换规则，并将生成的混淆框架作为开放的变换包提供。利用这些规则，我们提供了配对的非毒性和毒性句子及其混淆版本。在我们的数据集上训练的模型能更好地处理混淆文本，而不会牺牲在非混淆文本上的性能。这是首个同时支持韩语去混淆和去毒化的数据集。我们期望该数据集能促进大型语言模型对韩语混淆毒性内容的更好理解和缓解。我们的代码和数据可在 https://github.com/leeyejin1231/KOTOX 获取。

英文摘要

As language models become increasingly deployed in online environments, toxicity detection and detoxification have received growing attention. Existing studies primarily focus on non-obfuscated text, which limits robustness when users intentionally disguise toxic expressions. In particular, Korean toxic expressions can be easily disguised through agglutinative morphology and Hangeul-specific orthographic variation. However, obfuscation in Korean remains largely unexplored, which motivates us to introduce a KOTOX: Korean toxic dataset for deobfuscation and detoxification. We categorize Korean obfuscation patterns into linguistically grounded classes, define transformation rules derived from real-world examples, and provide the resulting obfuscation framework as an open transformation package. Using these rules, we provide paired neutral and toxic sentences alongside their obfuscated counterparts. Models trained on our dataset better handle obfuscated text without sacrificing performance on non-obfuscated text. This is the first dataset that simultaneously supports deobfuscation and detoxification for the Korean language. We expect the dataset to facilitate better understanding and mitigation of obfuscated toxic content in LLM for Korean. Our code and data are available at https://github.com/leeyejin1231/KOTOX.

URL PDF HTML ☆

赞 0 踩 0

2510.06063 2026-05-29 cs.AI cs.IT cs.LG math.IT 版本更新

TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis

TelecomTS：面向时间序列与语言分析的多模态可观测性数据集

Austin Feng, Andreas Varvarigos, Ioannis Panitsas, Daniela Fernandez, Jinbiao Wei, Yuwei Guo, Jialin Chen, Ali Maatouk, Leandros Tassiulas, Rex Ying

发表机构 * Yale University（耶鲁大学）； Johns Hopkins University（约翰霍普金斯大学）

AI总结本文提出TelecomTS，一个来自5G电信网络的大规模多模态可观测性数据集，通过保留绝对尺度信息的异质协变量和多样化下游任务（异常检测、根因分析、多模态问答），揭示了现有模型在处理可观测性数据的高噪声、突变特性时的不足。

详情

AI中文摘要

现代企业在监控复杂系统时会产生大量的时间序列指标流，即所谓的可观测性数据。与来自气候等领域的传统时间序列不同，可观测性数据具有零膨胀、高度随机且时间结构极小的特点。尽管这些数据至关重要，但由于专有限制和隐私问题，可观测性数据集在公开基准中仍然代表性不足。现有数据集通常经过匿名化和归一化处理，去除了尺度信息，限制了其在异常检测、根因分析和多模态推理等任务中的应用。为弥补这一空白，我们引入了TelecomTS，这是一个源自5G电信网络的大规模可观测性数据集。TelecomTS包含具有明确绝对尺度信息的异质、去匿名化协变量，并提供多样化的下游任务套件，包括异常检测、根因分析和多模态问答。对最先进的时间序列、语言、推理和多模态基础模型的基准测试表明，现有方法难以应对可观测性数据特有的突变、高噪声和高方差动态特性。我们的实验进一步强调了保留协变量绝对尺度的重要性，凸显了开发能够原生利用尺度信息的基础时间序列模型以应对实际可观测性应用需求的必要性。代码可在https://github.com/Ali-maatouk/TelecomTS获取。

英文摘要

Modern enterprises generate vast streams of time series metrics when monitoring complex systems, known as observability data. Unlike conventional time series from domains such as climate, observability data are zero-inflated, highly stochastic, and exhibit minimal temporal structure. Despite their importance, observability datasets remain underrepresented in public benchmarks due to proprietary restrictions and privacy concerns. Existing datasets are often anonymized and normalized, removing scale information and limiting their use for tasks such as anomaly detection, root cause analysis, and multi-modal reasoning. To address this gap, we introduce TelecomTS, a large-scale observability dataset derived from a 5G telecommunications network. TelecomTS features heterogeneous, de-anonymized covariates with explicit absolute scale information and provides a diverse suite of downstream tasks, including anomaly detection, root cause analysis, and multi-modal question-answering. Benchmarking state-of-the-art time series, language, reasoning, and multi-modal foundation models reveals that existing approaches struggle with the abrupt, noisy, and high-variance dynamics characteristic of observability data. Our experiments further underscore the importance of preserving covariates' absolute scale, emphasizing the need for foundation time series models that natively leverage scale information for practical real-world observability applications. The code is available at: https://github.com/Ali-maatouk/TelecomTS.

URL PDF HTML ☆

赞 0 踩 0

2510.02480 2026-05-29 cs.AI cs.LG 版本更新

Controlling the Risk of Corrupted Contexts for Language Models via Early-Exiting

通过早退机制控制语言模型中有害上下文的风险

Andrea Wynn, Metod Jazbec, Charith Peris, Rinat Khaziev, Anqi Liu, Daniel Khashabi, Eric Nalisnick

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； University of Amsterdam（阿姆斯特丹大学）； Amazon AGI（亚马逊人工智能实验室）； Amazon Alexa（亚马逊Alexa）

AI总结提出一种结合动态早退预测与无分布风险控制的方法，限制有害上下文对语言模型性能的退化，并在有益上下文中实现计算效率提升。

Comments Accepted to ICML 2026

详情

AI中文摘要

大型语言模型（LLMs）可能受到有害或不相关上下文的影响，这会显著损害模型在下游任务上的性能。这促使我们设计具有内置机制的原则性方案，以防范此类“垃圾进，垃圾出”场景。我们提出一种新颖方法，限制有害上下文对模型性能的退化程度。首先，我们定义模型的基线“安全”行为——即无任何上下文（零样本）时的模型性能。接着，我们应用无分布风险控制（DFRC）来控制用户提供的上下文将性能降至该安全零样本基线以下的程度。我们通过利用动态早退预测实现这一点，忽略那些最关注不安全输入的后注意力头。最后，我们提出对DFRC的修改，使其既能控制有害输入的风险，又能利用有益输入的性能和效率提升。我们在涵盖上下文学习和开放式问答的9项任务上展示了理论和实证结果，表明我们的方法能有效控制有害上下文的风险，同时在使用有益上下文时实现显著的计算效率提升。

英文摘要

Large language models (LLMs) can be influenced by harmful or irrelevant context, which can significantly harm model performance on downstream tasks. This motivates principled designs in which LLM systems include built-in mechanisms to guard against such "garbage in, garbage out" scenarios. We propose a novel approach to limit the degree to which harmful context can degrade model performance. First, we define a baseline "safe" behavior for the model -- the model's performance given no context at all (zero-shot). Next, we apply distribution-free risk control (DFRC) to control the extent to which the user-provided context can decay performance below this safe zero-shot baseline. We achieve this by leveraging dynamic early exit prediction, ignoring later attention heads that attend the most to the unsafe inputs. Finally, we propose modifications to DFRC that allow it to both control risk for harmful inputs \textit{and} leverage performance and efficiency gains on helpful inputs. We present both theoretical and empirical results across 9 tasks spanning in-context learning and open-ended question answering, showing that our approach can effectively control risk for harmful context and simultaneously achieve substantial computational efficiency gains with helpful context.

URL PDF HTML ☆

赞 0 踩 0

2509.23573 2026-05-29 cs.CR cs.AI 版本更新

Uncovering Vulnerabilities of LLM-Assisted Cyber Threat Intelligence

揭示LLM辅助网络威胁情报中的脆弱性

Yuqiao Meng, Luoxi Tang, Feiyang Yu, Jinyuan Jia, Guanhua Yan, Ping Yang, Zhaohan Xi

发表机构 * Binghamton University（宾夕法尼亚州立大学）； Duke University（杜克大学）； Pennsylvania State University（宾夕法尼亚州立大学）

AI总结本文通过人机协同分类框架，识别并验证了LLM在CTI推理中的三种领域特定认知失败模式（虚假关联、矛盾知识和受限泛化），并证明针对性防御可显著降低失败率。

详情

AI中文摘要

大型语言模型（LLM）正越来越多地被用于帮助安全分析师应对激增的网络威胁，自动化从漏洞评估到事件响应的工作流程。然而，在实际操作的CTI工作流中，可靠性差距仍然显著。现有解释通常指向通用模型问题（如幻觉），但我们认为主要瓶颈在于威胁格局本身：CTI具有异质性、易变性和碎片化特征。在这些条件下，证据相互交织、众包且时间不稳定，这些特性是标准LLM研究很少捕捉到的。在本文中，我们对LLM在CTI推理中的脆弱性进行了全面的实证研究。我们引入了一个人机协同分类框架，该框架能够稳健地标注CTI生命周期中的失败模式，避免了自动化“LLM作为评判者”管道的脆弱性。我们识别出三种领域特定的认知失败：来自表面元数据的虚假关联、来自冲突来源的矛盾知识以及对新兴威胁的受限泛化。我们通过因果干预验证了这些机制，并表明针对性防御能显著降低失败率。这些结果共同为构建具有韧性且领域感知的CTI智能体提供了具体路线图。

英文摘要

Large language models (LLMs) are increasingly used to help security analysts manage the surge of cyber threats, automating tasks from vulnerability assessment to incident response. Yet in operational CTI workflows, reliability gaps remain substantial. Existing explanations often point to generic model issues (e.g., hallucination), but we argue the dominant bottleneck is the threat landscape itself: CTI is heterogeneous, volatile, and fragmented. Under these conditions, evidence is intertwined, crowdsourced, and temporally unstable, which are properties that standard LLM-based studies rarely capture. In this paper, we present a comprehensive empirical study of LLM vulnerabilities in CTI reasoning. We introduce a human-in-the-loop categorization framework that robustly labels failure modes across the CTI lifecycle, avoiding the brittleness of automated "LLM-as-a-judge" pipelines. We identify three domain-specific cognitive failures: spurious correlations from superficial metadata, contradictory knowledge from conflicting sources, and constrained generalization to emerging threats. We validate these mechanisms via causal interventions and show that targeted defenses reduce failure rates significantly. Together, these results offer a concrete roadmap for building resilient, domain-aware CTI agents.

URL PDF HTML ☆

赞 0 踩 0

2509.23571 2026-05-29 cs.CR cs.AI 版本更新

Benchmarking LLM-Assisted Blue Teaming via Standardized Threat Hunting

通过标准化威胁狩猎评估LLM辅助蓝队

Yuqiao Meng, Luoxi Tang, Feiyang Yu, Xi Li, Guanhua Yan, Ping Yang, Zhaohan Xi

发表机构 * State University of New York at Binghamton（纽约州立大学布ingham顿分校）； University of Alabama at Birmingham（阿拉巴马大学伯明翰分校）； Duke University（杜克大学）

AI总结本文提出CyberTeam基准，通过构建标准化工作流和模块化操作步骤，评估大语言模型在蓝队威胁狩猎中的有效性，并揭示标准化设计带来的改进与开放式推理的局限性。

Comments ICML'26

详情

AI中文摘要

随着网络威胁在规模和复杂性上持续增长，蓝队防御者越来越需要先进工具来主动检测和缓解风险。大语言模型（LLMs）为增强威胁分析提供了有前景的能力。然而，它们在真实蓝队威胁狩猎场景中的有效性仍未得到充分探索。本文提出CyberTeam，一个旨在指导LLMs进行蓝队实践的基准。CyberTeam通过两个阶段构建标准化工作流。首先，它通过捕获从威胁归因到事件响应的分析任务之间的依赖关系，对真实的威胁狩猎工作流进行建模。接下来，每个任务通过一组针对其特定分析需求定制的操作模块来处理。这将威胁狩猎转化为一系列结构化的推理步骤，每个步骤基于离散操作并根据任务特定依赖关系排序。在此框架指导下，LLMs被引导通过模块化步骤执行威胁狩猎任务。总体而言，CyberTeam整合了30个任务和9个操作模块，以指导LLMs进行标准化威胁分析。我们评估了领先的LLMs和最先进的网络安全智能体，将CyberTeam与开放式推理策略进行比较。我们的结果突显了标准化设计带来的改进，同时也揭示了开放式推理在真实威胁狩猎中的局限性。

英文摘要

As cyber threats continue to grow in scale and sophistication, blue team defenders increasingly require advanced tools to proactively detect and mitigate risks. Large Language Models (LLMs) offer promising capabilities for enhancing threat analysis. However, their effectiveness in real-world blue team threat-hunting scenarios remains insufficiently explored. This paper presents CyberTeam, a benchmark designed to guide LLMs in blue teaming practice. CyberTeam constructs a standardized workflow in two stages. First, it models realistic threat-hunting workflows by capturing the dependencies among analytical tasks from threat attribution to incident response. Next, each task is addressed through a set of operational modules tailored to its specific analytical requirements. This transforms threat hunting into a structured sequence of reasoning steps, with each step grounded in a discrete operation and ordered according to task-specific dependencies. Guided by this framework, LLMs are directed to perform threat-hunting tasks through modularized steps. Overall, CyberTeam integrates 30 tasks and 9 operational modules to guide LLMs through standardized threat analysis. We evaluate both leading LLMs and state-of-the-art cybersecurity agents, comparing CyberTeam against open-ended reasoning strategies. Our results highlight the improvements enabled by standardized design, while also revealing the limitations of open-ended reasoning in real-world threat hunting.

URL PDF HTML ☆

赞 0 踩 0

2508.15371 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Confidence-Modulated Speculative Decoding for Large Language Models

置信度调节的推测解码用于大型语言模型

Jaydip Sen, Subhasis Dasgupta, Hetvi Waghela

发表机构 * Department of Data Science（数据科学系）； Praxis Business School（普拉克斯商学院）

AI总结本文提出一种基于置信度调节的推测解码框架，通过熵和边际不确定性度量动态调整草稿长度与验证过程，在机器翻译和摘要任务上实现加速并保持或提升BLEU和ROUGE分数。

Comments This is the preprint of the paper, which has been accepted for oral presentation and publication in the proceedings of IEEE INDISCON 2025. The conference will be organized at the National Institute of Technology, Rourkela, India, from August 21 to 23, 2025. The paper is 10 pages long, and it contains 2 figures and 5 tables

详情

DOI: 10.1109/INDISCON66021.2025.11254640

AI中文摘要

推测解码已成为一种通过草稿-验证范式并行化令牌生成来加速自回归推理的有效方法。然而，现有方法依赖静态草稿长度和刚性验证标准，限制了其在不同模型不确定性和输入复杂性下的适应性。本文提出一种基于置信度调节草稿的信息论推测解码框架。通过利用草稿模型输出分布上的熵和边际不确定性度量，所提方法在每次迭代中动态调整推测生成的令牌数量。这种自适应机制减少了回滚频率，提高了资源利用率，并保持了输出保真度。此外，验证过程使用相同的置信度信号进行调节，使得在不牺牲生成质量的情况下更灵活地接受草稿令牌。在机器翻译和摘要任务上的实验表明，与标准推测解码相比，该方法在保持或提升BLEU和ROUGE分数的同时实现了显著加速。所提方法提供了一种原则性的即插即用方法，用于在不确定性变化条件下实现大型语言模型的高效且鲁棒的解码。

英文摘要

Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft-then-verify paradigm. However, existing methods rely on static drafting lengths and rigid verification criteria, limiting their adaptability across varying model uncertainties and input complexities. This paper proposes an information-theoretic framework for speculative decoding based on confidence-modulated drafting. By leveraging entropy and margin-based uncertainty measures over the drafter's output distribution, the proposed method dynamically adjusts the number of speculatively generated tokens at each iteration. This adaptive mechanism reduces rollback frequency, improves resource utilization, and maintains output fidelity. Additionally, the verification process is modulated using the same confidence signals, enabling more flexible acceptance of drafted tokens without sacrificing generation quality. Experiments on machine translation and summarization tasks demonstrate significant speedups over standard speculative decoding while preserving or improving BLEU and ROUGE scores. The proposed approach offers a principled, plug-in method for efficient and robust decoding in large language models under varying conditions of uncertainty.

URL PDF HTML ☆

赞 0 踩 0

2508.12176 2026-05-29 cs.CV cs.AI eess.SP 版本更新

Scalable RF Simulation in Generative 4D Worlds

生成式4D世界中的可扩展射频仿真

Zhiwei Zheng, Dongyin Hu, Mingmin Zhao

发表机构 * University of Pennsylvania（宾夕法尼亚大学）

AI总结提出WaveVerse框架，通过语言引导的4D世界生成器和物理信号模拟器实现可扩展的射频信号仿真，在相位敏感基准上表现高保真度，并有效提升下游任务性能。

Comments Accepted to ICML 2026

详情

AI中文摘要

射频（RF）感知已成为一种强大的、保护隐私的替代视觉方法，用于各种感知任务。然而，在动态和多样化的环境中构建高质量的RF数据集仍然是一个重大挑战。为了解决这一问题，我们引入了WaveVerse，一个基于提示的可扩展框架，该框架从生成的室内场景中模拟真实的RF信号，并包含由空间路径引导的人体运动，从而无需手动轨迹设计即可实现多样且可行的行为。WaveVerse具有语言引导的4D世界生成器和基于物理的信号模拟器，能够在多样化的环境中实现RF信号的逼真模拟。它采用了一个相位相干光线追踪器，保留了空间和时间上的相位一致性。模拟信号在相位敏感基准上显示出高保真度，并且与真实世界收集的测量数据以及来自专有电磁求解器的模拟结果高度一致。当用于数据增强时，WaveVerse在RF成像和人类活动识别等下游任务中持续提升性能，其增益随模拟数据量的增加而增长，并超越了现有方法。代码和附加材料可在网页上获取。

英文摘要

Radio Frequency (RF) sensing has emerged as a powerful, privacy-preserving alternative to vision-based methods for various perception tasks. However, building high-quality RF datasets in dynamic and diverse environments remains a major challenge. To address this, we introduce WaveVerse, a prompt-based, scalable framework that simulates realistic RF signals from generated indoor scenes with human motions guided by spatial paths, enabling diverse and feasible behaviors without manual trajectory design. WaveVerse features a language-guided 4D world generator and a physics-based signal simulator that enables realistic simulation of RF signals in diverse environments. It employs a phase-coherent ray tracer that preserves both spatial and temporal phase consistency. The simulated signals show high fidelity on phase-sensitive benchmarks, and closely align with both real-world collected measurements and simulations from a proprietary electromagnetic solver. When used for data augmentation, WaveVerse consistently improves performance in downstream tasks like RF imaging and human activity recognition, with gains that grow with the amount of simulated data and surpass existing methods. Code and additional materials are available on the webpage.

URL PDF HTML ☆

赞 0 踩 0

2508.05614 2026-05-29 cs.CL cs.AI 版本更新

GroundAct: Can LLM Agents Ground Actions in Environmental States?

GroundAct：LLM智能体能否在环境状态中实现动作落地？

Zixuan Wang, Dingming Li, Hongxing Li, Yanrui Miao, Shuo Chen, Yuchen Yan, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang

发表机构 * Zhejiang University（浙江大学）

AI总结本研究提出GroundAct基准，通过1500个场景和16592个任务实例评估15个LLM，发现动作落地能力是多维挑战，不能仅通过模型规模解决。

Comments Project Page: https://zju-real.github.io/OmniEmbodied Code: https://github.com/ZJU-REAL/OmniEmbodied

详情

AI中文摘要

LLM智能体在指令完全指定动作的任务上成功率达到85-96%，但当动作可行性取决于指令未提及的环境状态时，成功率降至29-53%。我们认为这一差距反映了一种缺失的能力：动作落地，即从结构化环境状态推断动作是否可行、缺少哪些前提条件以及是否超出个体能力的能力。我们引入GroundAct，这是一个包含1500个场景和16592个任务实例的基准，基于文本的交互式环境涵盖11个领域，任务按认知复杂度层级组织为七个类别。评估15个LLM（3B-671B）后，我们发现三种诊断模式：（i）属性推理与工具和协作推理弱相关，产生不同的模型轮廓；（ii）完整环境图在工具使用与隐式协作之间产生高达+27.6/-22.9%的差异，区分了搜索边界与约束过滤瓶颈；（iii）监督微调将Qwen2.5-3B在直接命令上的性能从0.6%提升至76.3%，但在隐式协作上仅从1.5%提升至5.5%。这些结果表明动作落地是一个多维挑战，不能仅通过规模扩展解决。

英文摘要

LLM agents achieve 85-96% success on tasks where instructions fully specify the action, but drop to 29-53% when action feasibility depends on environmental state that the instruction does not mention. We argue that this gap reflects a missing capability: action grounding, the ability to infer from structured environmental state whether an action is feasible, what prerequisites it lacks, and whether it exceeds individual capacity. We introduce GroundAct, a benchmark of 1,500 scenarios and 16,592 task instances in text-based interactive environments spanning 11 domains, with tasks organized into seven categories along a cognitive complexity hierarchy. Evaluating 15 LLMs (3B-671B), we find three diagnostic patterns: (i) attribute reasoning is weakly correlated with tool and coordination reasoning, producing distinct model profiles; (ii) complete environment graphs yield up to +27.6/-22.9% on tool use vs. implicit collaboration, separating search-bound from constraint-filtering bottlenecks; and (iii) supervised fine-tuning lifts Qwen2.5-3B from 0.6% to 76.3% on direct command but only 1.5% to 5.5% on implicit collaboration. These results establish action grounding as a multi-dimensional challenge irreducible to scaling.

URL PDF HTML ☆

赞 0 踩 0

2507.21114 2026-05-29 cs.IR cs.AI cs.CV 版本更新

Page image classification for content-specific data processing

面向特定内容数据处理的页面图像分类

Kateryna Lutsai

AI总结本研究针对人文学科数字化项目中历史文档页面图像内容多样、手动分类困难的问题，开发并评估了一种基于人工智能和机器学习的图像分类系统，通过按内容类别（如文本类型、图形元素、布局）自动分类页面，以支持定制化的下游分析流程。

Comments Dataset licensing issues occurred

详情

AI中文摘要

人文学科的数字化项目通常会产生大量历史文档的页面图像，这给手动分类和分析带来了巨大挑战。这些档案包含多样化的内容，包括各种文本类型（手写体、打字体、印刷体）、图形元素（图画、地图、照片）以及布局（纯文本、表格、表单）。高效处理这些异构数据需要基于页面内容进行自动分类的方法，从而能够启用定制化的下游分析流程。本项目通过开发并评估一种专门为历史文档页面设计的图像分类系统来满足这一需求，该系统利用了人工智能和机器学习的最新进展。所选的类别集旨在促进特定内容处理工作流程，将需要不同分析技术（例如，用于文本的OCR、用于图形的图像分析）的页面区分开来。

英文摘要

Digitization projects in humanities often generate vast quantities of page images from historical documents, presenting significant challenges for manual sorting and analysis. These archives contain diverse content, including various text types (handwritten, typed, printed), graphical elements (drawings, maps, photos), and layouts (plain text, tables, forms). Efficiently processing this heterogeneous data requires automated methods to categorize pages based on their content, enabling tailored downstream analysis pipelines. This project addresses this need by developing and evaluating an image classification system specifically designed for historical document pages, leveraging advancements in artificial intelligence and machine learning. The set of categories was chosen to facilitate content-specific processing workflows, separating pages requiring different analysis techniques (e.g., OCR for text, image analysis for graphics)

URL PDF HTML ☆

赞 0 踩 0

2507.09574 2026-05-29 cs.CV cs.AI cs.CL 版本更新

MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models

MENTOR: 面向自回归视觉生成模型的高效多模态条件微调

Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Minjia Zhang, Junjie Hu

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； Tsinghua University（清华大学）； Peking University（北京大学）； Microsoft（微软公司）

AI总结提出MENTOR框架，通过两阶段训练范式实现自回归图像生成器与多模态输入的细粒度token级对齐，无需辅助适配器或交叉注意力模块，在DreamBench++上取得优异性能。

Comments Findings of ACL 2026

详情

AI中文摘要

最近的文本到图像模型能够生成高质量结果，但在精确视觉控制、平衡多模态输入以及需要大量训练以实现复杂多模态图像生成方面仍存在困难。为解决这些局限，我们提出MENTOR，一种新颖的自回归（AR）框架，用于高效的多模态条件微调以实现自回归多模态图像生成。MENTOR将AR图像生成器与两阶段训练范式相结合，无需依赖辅助适配器或交叉注意力模块，即可实现多模态输入与图像输出之间的细粒度、token级对齐。两阶段训练包括：（1）多模态对齐阶段，建立稳健的像素级和语义级对齐；随后是（2）多模态指令微调阶段，平衡多模态输入的整合并增强生成可控性。尽管模型规模适中、基础组件非最优且训练资源有限，MENTOR在DreamBench++基准测试上仍取得了强劲性能，在概念保持和提示遵循方面优于竞争基线。此外，与基于扩散的方法相比，我们的方法具有更优的图像重建保真度、广泛的任务适应性以及更高的训练效率。数据集、代码和模型可在 https://github.com/HaozheZhao/MENTOR 获取。

英文摘要

Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: https://github.com/HaozheZhao/MENTOR

URL PDF HTML ☆

赞 0 踩 0

2507.03318 2026-05-29 cs.LG cs.AI 版本更新

Structure-Aware Compound-Protein Affinity Prediction via Graph Neural Network with Group Lasso Regularization

基于图神经网络与组套索正则化的结构感知化合物-蛋白质亲和力预测

Zanyu Shi, Yang Wang, Pathum Weerawarna, Jie Zhang, Timothy Richardson, Yijie Wang, Kun Huang

发表机构 * Department of Biostatistics & Health Data Science（生物统计学与健康数据科学系）； Indiana University（印第安纳大学）； Department of Computer Science（计算机科学系）； Indiana University Bloomington（印第安纳大学布卢明顿分校）； Division of Clinical Pharmacology（临床药理学部）； Indiana University School of Medicine（印第安纳大学医学院）； IUSM-Purdue TREAT-AD Center（IUSM-普渡大学TREAT-AD中心）； Department of Medical and Molecular Genetics（医学与分子遗传学系）

AI总结提出利用图神经网络结合组套索和稀疏组套索正则化，从活性悬崖分子对中学习结构信息以预测化合物-蛋白质亲和力（IC50），并提升模型可解释性。

Comments 15 pages, 7 figures

详情

DOI: 10.34133/csbj.0012
Journal ref: Comput Struct Biotechnol J. 2026;35:0012

AI中文摘要

可解释人工智能（XAI）方法越来越多地被应用于药物发现中，以学习分子表示并识别驱动性质预测的子结构。然而，为化合物性质预测构建结构-活性关系（SAR）建模的端到端可解释模型面临诸多挑战，例如特定蛋白质靶标的化合物-蛋白质相互作用活性数据有限，以及分子构型位点的细微变化会显著影响分子性质。我们利用具有活性悬崖的分子对，这些分子共享骨架但在取代基位点不同，其特征是对特定蛋白质靶标具有较大的效力差异。我们提出一个框架，通过实现图神经网络（GNN）来利用活性悬崖对的性质和结构信息，以预测化合物-蛋白质亲和力（即半数最大抑制浓度，IC50）。为了增强模型性能和可解释性，我们使用结构感知损失函数训练GNN，采用组套索和稀疏组套索正则化，这些正则化方法能够剪枝并突出与活性差异相关的分子子图。我们将该框架应用于针对三种原癌基因酪氨酸蛋白激酶Src蛋白（PDB ID：1O42、2H8H、4MXO）的分子活性悬崖数据。我们的方法通过稀疏组套索整合公共和私有节点信息，改进了性质预测，这体现在均方根误差（RMSE）降低和皮尔逊相关系数（PCC）提高上。应用正则化还通过提升图级全局方向分数和改进原子级着色精度，增强了GNN的特征归因能力。这些进展增强了药物发现流程中模型的可解释性，特别是在先导化合物优化中识别关键分子子结构方面。

英文摘要

Explainable artificial intelligence (XAI) approaches have been increasingly applied in drug discovery to learn molecular representations and identify substructures driving property predictions. However, building end-to-end explainable models for structure-activity relationship (SAR) modeling for compound property prediction faces many challenges, such as the limited number of compound-protein interaction activity data for specific protein targets, and plenty of subtle changes in molecular configuration sites significantly affecting molecular properties. We exploit pairs of molecules with activity cliffs that share scaffolds but differ at substituent sites, characterized by large potency differences for specific protein targets. We propose a framework by implementing graph neural networks (GNNs) to leverage property and structure information from activity cliff pairs to predict compound-protein affinity (i.e., half maximal inhibitory concentration, IC50). To enhance model performance and explainability, we train GNNs with structure-aware loss functions using group lasso and sparse group lasso regularizations, which prune and highlight molecular subgraphs relevant to activity differences. We applied this framework to activity cliff data of molecules targeting three proto-oncogene tyrosine-protein kinase Src proteins (PDB IDs: 1O42, 2H8H, 4MXO). Our approach improved property prediction by integrating common and uncommon node information with sparse group lasso, as reflected in reduced root mean squared error (RMSE) and improved Pearson's correlation coefficient (PCC). Applying regularizations also enhances feature attribution for GNN by boosting graph-level global direction scores and improving atom-level coloring accuracy. These advances strengthen model interpretability in drug discovery pipelines, particularly for identifying critical molecular substructures in lead optimization.

URL PDF HTML ☆

赞 0 踩 0

2506.12508 2026-05-29 cs.AI 版本更新

EPiC: 基于精确锚点视频引导的高效视频摄像机控制学习

Zun Wang, Jaemin Cho, Jialu Li, Han Lin, Jaehong Yoon, Yue Zhang, Mohit Bansal

AI总结提出EPiC框架，通过基于首帧可见性掩码构建精确对齐的锚点视频，并引入轻量模块Anchor-ControlNet，以极低参数实现高效、精确的3D摄像机控制，在RealEstate10K和MiraData上达到最先进性能。

Comments Accepted to ICML 2026. Project website: https://zunwang1.github.io/Epic

详情

AI中文摘要

近期带摄像机控制的视频生成方法通常通过从估计的点云沿摄像机轨迹渲染，创建锚点视频（即近似所需摄像机运动的渲染视频），以作为结构化先验引导扩散模型。然而，点云和摄像机轨迹估计中的误差常导致不准确的锚点视频，并带来更高的训练成本和低效率，因为模型被迫补偿渲染错位。为解决这些局限，我们提出EPiC，一种高效且精确的摄像机控制学习框架，无需摄像机姿态或点云估计即可构建良好对齐的训练锚点视频。具体而言，我们通过基于首帧可见性掩码掩蔽源视频来创建高精度锚点视频，这确保了强对齐，消除了对摄像机/点云估计的需求，因此可轻松应用于任意野外视频。此外，我们引入Anchor-ControlNet，一种轻量模块，将可见区域中的锚点视频引导集成到预训练视频扩散模型中，仅增加不到1%的额外参数。EPiC以显著更少的参数、训练步骤和数据实现高效训练，并在测试时对使用点云制作的锚点视频具有鲁棒泛化能力，从而实现精确的3D感知摄像机控制。EPiC在RealEstate10K和MiraData上的I2V摄像机控制任务中达到最先进性能。值得注意的是，EPiC还展现出对视频到视频（V2V）场景的强零样本泛化能力。

英文摘要

Recent approaches for video generation with camera control often create anchor videos (i.e., rendered videos that approximate desired camera motions) to guide diffusion models as a structured prior, by rendering from estimated point clouds following camera trajectories. However, errors in point cloud and camera trajectory estimation often lead to inaccurate anchor videos with higher training cost and low efficiency, as the model is forced to compensate for rendering misalignments. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that constructs well-aligned training anchor videos without the need for camera pose or point cloud estimation. Concretely, we create highly precise anchor videos by masking source videos based on first-frame visibility, which ensures strong alignment, eliminates the need for camera/point cloud estimation, and thus can be readily applied to any in-the-wild video. Furthermore, we introduce Anchor-ControlNet, a lightweight module that integrates anchor video guidance in visible regions to pretrained video diffusion models, with less than 1% of additional parameters. EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, and generalizes robustly to anchor videos made with point clouds at test time, enabling precise 3D-informed camera control. EPiC achieves SoTA performance on RealEstate10K and MiraData for I2V camera control task. Notably, EPiC also exhibits strong zero-shot generalization to video-to-video (V2V) scenarios.

URL PDF HTML ☆

赞 0 踩 0

2505.10975 2026-05-29 cs.CL cs.AI cs.SD eess.AS 版本更新

Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio

单声道音频的端到端多说话人自动语音识别综述

Xinlu He, Jacob Whitehill

发表机构 * Worcester Polytechnic Institute（沃斯特理工大学）

AI总结本文系统综述了端到端多说话人自动语音识别的神经架构范式（SIMO与SISO）、近期改进方法及长语音扩展策略，并通过标准基准评估比较了各类方法。

Comments Accepted for publication in Computer Speech & Language (CSL)

详情

AI中文摘要

单声道多说话人自动语音识别（ASR）由于数据稀缺以及识别并将词语归因于单个说话人的内在困难（尤其是在重叠语音中）仍然具有挑战性。最近的进展推动了从级联系统向端到端（E2E）架构的转变，这减少了错误传播并更好地利用了语音内容与说话人身份之间的协同作用。尽管端到端多说话人ASR取得了快速进展，但该领域缺乏对近期发展的全面综述。本综述为多说话人ASR的端到端神经方法提供了一个系统的分类法，突出了近期进展和比较分析。具体而言，我们分析了：（1）用于预分割音频的架构范式（SIMO与SISO），分析了它们的不同特征和权衡；（2）基于这两种范式的近期架构和算法改进；（3）对长语音的扩展，包括分割策略和说话人一致性的假设拼接。此外，我们（4）在标准基准上评估和比较了各种方法。最后，我们讨论了构建鲁棒且可扩展的多说话人ASR所面临的开放挑战和未来研究方向。

英文摘要

Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (SIMO vs.~SISO) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.

URL PDF HTML ☆

赞 0 踩 0

2503.13844 2026-05-29 cs.CL cs.AI cs.CY cs.LG 版本更新

Towards Detecting Persuasion on Social Media: From Model Development to Insights on Persuasion Strategies

检测社交媒体上的说服：从模型开发到说服策略的洞察

Elyas Meguellati, Stefano Civelli, Pietro Bernardelle, Shazia Sadiq, Irwin King, Gianluca Demartini

发表机构 * University of Queensland（昆士兰大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结本文通过开发轻量级说服文本检测模型（在SemEval 2023任务3子任务3中达到最优性能）并应用于澳大利亚联邦选举2022 Facebook广告数据集，揭示了政治竞选在不同资金策略、词汇选择、人口统计定位和选举临近时说服强度时间变化中的模式。

详情

DOI: 10.1609/icwsm.v20i1.42714
Journal ref: Proceedings of the International AAAI Conference on Web and Social Media 20(1) (2026) 1587-1608

AI中文摘要

政治广告通过嵌入更广泛宣传策略中的微妙说服技巧，在塑造公众舆论和影响选举结果方面发挥着关键作用。检测这些说服元素对于提高选民意识和确保民主进程的透明度至关重要。本文通过两项相互关联的研究，提出了一种连接模型开发与实际应用的综合方法。首先，我们引入了一个轻量级说服文本检测模型，该模型在SemEval 2023任务3子任务3中达到了最先进性能，同时所需的计算资源和训练数据远少于现有方法。其次，我们通过收集澳大利亚联邦选举2022 Facebook广告（APA22）数据集，对其中一部分进行说服标注，并对模型进行微调以使其从主流新闻适应社交媒体内容，从而展示了该模型的实际效用。然后，我们应用微调后的模型对APA22数据集的其余部分进行标注，揭示了政治竞选如何通过不同的资金策略、词汇选择、人口统计定位以及选举日临近时说服强度的时间变化来利用说服的独特模式。我们的发现不仅强调了分析社交媒体说服时领域特定建模的必要性，还展示了揭示这些策略如何能够增强透明度、告知选民并促进数字竞选中的问责制。

英文摘要

Political advertising plays a pivotal role in shaping public opinion and influencing electoral outcomes, often through subtle persuasive techniques embedded in broader propaganda strategies. Detecting these persuasive elements is crucial for enhancing voter awareness and ensuring transparency in democratic processes. This paper presents an integrated approach that bridges model development and real-world application through two interconnected studies. First, we introduce a lightweight model for persuasive text detection that achieves state-of-the-art performance in Subtask 3 of SemEval 2023 Task 3 while requiring significantly fewer computational resources and training data than existing methods. Second, we demonstrate the model's practical utility by collecting the Australian Federal Election 2022 Facebook Ads (APA22) dataset, partially annotating a subset for persuasion, and fine-tuning the model to adapt from mainstream news to social media content. We then apply the fine-tuned model to label the remainder of the APA22 dataset, revealing distinct patterns in how political campaigns leverage persuasion through different funding strategies, word choices, demographic targeting, and temporal shifts in persuasion intensity as election day approaches. Our findings not only underscore the necessity of domain-specific modeling for analyzing persuasion on social media but also show how uncovering these strategies can enhance transparency, inform voters, and promote accountability in digital campaigns.

URL PDF HTML ☆

赞 0 踩 0

2502.16548 2026-05-29 cs.LG cs.AI cs.CV 版本更新

A Composable Multimodal Framework for cine CMR-Text-Driven Prediction of Heart Failure Outcomes

用于电影心脏磁共振-文本驱动的心力衰竭结局预测的可组合多模态框架

Jianzhou Chen, Jinyang Sun, Xiumei Wang, Xi Chen, Heyu Chu, Guo Song, Yuji Luo, Xingping Zhou, Rong Gu

发表机构 * Department of Cardiology, Nanjing Drum Tower Hospital, State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University（南京鼓楼医院心内科，南京大学国家药物生物技术重点实验室）； School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University（上海交通大学电子信息与电气工程学院）； College of Electronic and Optical Engineering, Nanjing University of Posts and Telecommunications（南京邮电大学电子与光学工程学院）； College of Integrated Circuit Science and Engineering, Nanjing University of Posts and Telecommunications（南京邮电大学集成电路科学与工程学院）； Department of Cardiology, Nanjing Drum Tower Hospital Clinical College of Nanjing Medical University（南京医科大学南京鼓楼医院临床学院心内科）； Institute of Quantum Information and Technology, Nanjing University of Posts and Telecommunications（南京邮电大学量子信息与技术研究院）

AI总结提出一种可组合多模态框架，通过整合cine CMR影像、结构化临床指标和非结构化文本记录，实现比单模态AI算法更准确的心力衰竭预后预测，并支持个性化治疗优化。

详情

AI中文摘要

目的。根据世界卫生组织（WHO）及其他公共卫生机构的数据，心力衰竭是全球主要死因之一，每年导致数百万人死亡。尽管心力衰竭领域已取得显著进展，生存率和射血分数有所改善，但由于其复杂性和多因素特征，仍存在大量未满足的需求。本研究旨在提出并评估一种用于心力衰竭评估和治疗优化的可组合策略框架，旨在提供更全面的患者评估和管理。方法。该框架利用多模态算法分析全面的患者数据，明确整合了电影心脏磁共振（cine CMR）序列、结构化临床指标（如实验室结果、人口统计学数据）和非结构化文本记录（如病史、处方）。通过整合这些多种数据源，我们的框架为患者提供了更全面的评估和优化的治疗方案。主要结果。与单模态AI算法相比，该多模态框架在心力衰竭预后预测方面展现出更高的准确性。此外，它还能详细评估各种病理指标对心力衰竭结局的影响。意义。通过系统性地整合异质性临床数据，该方法支持更全面的预后评估，并有助于为心力衰竭患者制定优化的个性化治疗计划。

英文摘要

Objective. Heart failure is one of the leading causes of death worldwide, with millions of deaths each year, according to data from the World Health Organization (WHO) and other public health agencies. While significant progress has been made in the field of heart failure, leading to improved survival rates and improvement of ejection fraction, there remains substantial unmet needs, due to the complexity and multifactorial characteristics. This study aims to propose and evaluate a composable strategy framework for assessment and treatment optimization in heart failure, designed to provide more holistic patient evaluation and management. Approach. The framework leverages multi-modal algorithms to analyze a comprehensive range of patient data, explicitly integrating cine cardiac magnetic resonance (cine CMR) sequences, structured clinical metrics (e.g., lab results, demographics), and unstructured textual records (e.g., medical history, prescriptions). By integrating these various data sources, our framework offers a more holistic evaluation and optimized treatment plan for patients. Main results. The multi-modal framework demonstrates superior accuracy in HF prognosis prediction compared to single-modal AI algorithms. Additionally, it enables a detailed evaluation of the impact of various pathological indicators on HF outcomes. Significance. By integrating heterogeneous clinical data in a systematic manner, this approach supports more comprehensive prognosis assessment and facilitates optimized, personalized treatment planning for heart failure patients.

URL PDF HTML ☆

赞 0 踩 0

2410.15236 2026-05-29 cs.CR cs.AI cs.LG 版本更新

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

大语言模型的越狱与漏洞缓解

Benji Peng, Hanxuan Chen, Keyu Chen, Qian Niu, Ziqian Bi, Ming Liu, Pohsun Feng, Tianyang Wang, Lawrence K. Q. Yan, Yizhu Wen, Yichao Zhang, Caitlyn Heqi Yin, Xinyuan Song, Riyang Bao, Jiacheng Shi

发表机构 * Hunan University Changsha, PRC ； Georgia Institute of Technology Atlanta, USA ； Kyoto University Kyoto, Japan ； Purdue University West Lafayette, USA ； National Taiwan Normal University Taipei, ROC ； University of Liverpool Suzhou, PRC ； Hong Kong University of Science ； University of Hawaii Honolulu, USA ； The University of Texas at Dallas Dallas, USA ； University of Wisconsin-Madison Madison, USA ； Emory University Atlanta, USA ； College of William \& Mary Williamsburg, USA

AI总结本文综述了大语言模型在提示注入和越狱攻击下的漏洞，分类攻击方法并评估防御策略，指出研究空白与未来方向。

详情

DOI: 10.63336/Eureka.47
Journal ref: Eureka 1(1) (2026) 26-61

AI中文摘要

大语言模型通过推进自然语言理解和生成，在医疗、软件工程和对话系统等领域实现了广泛应用，从而改变了人工智能。尽管在过去几年取得了这些进展，但大语言模型已显示出相当大的漏洞，特别是对提示注入和越狱攻击。本综述分析了这些漏洞的研究现状，并介绍了可用的防御策略。我们大致将攻击方法分为基于提示的、基于模型的、多模态的和多语言的，涵盖对抗性提示、后门注入和跨模态利用等技术。我们还回顾了各种防御机制，包括提示过滤、转换、对齐技术、多智能体防御和自律，评估了它们的优缺点。我们还讨论了用于评估大语言模型安全性和鲁棒性的关键指标和基准，指出了在交互环境中量化攻击成功率的挑战以及现有数据集中的偏差。通过识别当前研究空白，我们提出了未来在韧性对齐策略、针对不断演变的攻击的高级防御、越狱检测自动化以及考虑伦理和社会影响方面的方向。本综述强调了在人工智能社区内持续研究和合作的必要性，以增强大语言模型的安全性并确保其安全部署。

英文摘要

Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation, enabling applications across fields beyond healthcare, software engineering, and conversational systems. Despite these advancements in the past few years, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks. This review analyzes the state of research on these vulnerabilities and presents available defense strategies. We roughly categorize attack approaches into prompt-based, model-based, multimodal, and multilingual, covering techniques such as adversarial prompting, backdoor injections, and cross-modality exploits. We also review various defense mechanisms, including prompt filtering, transformation, alignment techniques, multi-agent defenses, and self-regulation, evaluating their strengths and shortcomings. We also discuss key metrics and benchmarks used to assess LLM safety and robustness, noting challenges like the quantification of attack success in interactive contexts and biases in existing datasets. Identifying current research gaps, we suggest future directions for resilient alignment strategies, advanced defenses against evolving attacks, automation of jailbreak detection, and consideration of ethical and societal impacts. This review emphasizes the need for continued research and cooperation within the AI community to enhance LLM security and ensure their safe deployment.

URL PDF HTML ☆

赞 0 踩 0

2410.10398 2026-05-29 cs.CE cs.AI 版本更新

Are LLMs Socially Adaptive? Contrasting Belief Evolution in Large Language Models and Humans

大型语言模型是否具有社会适应性？对比大型语言模型与人类的信念演化

Yu Lei, Hao Liu, Chengxing Xie, Songjia Liu, Zhiyu Yin, Canyu Chen, Guohao Li, Philip Torr, Zhen Wu

发表机构 * Tsinghua University（清华大学）； Department of Psychological and Cognitive Sciences（心理与认知科学系）； College AI（人工智能学院）； School of Management（管理学院）； Fudan University（复旦大学）； Stevens Institute of Technology（史蒂文斯理工学院）； Northwestern University（西北大学）； University of Oxford（牛津大学）

AI总结本研究提出基于社会心理学的仿真基准FairMindSim和信念-奖励对齐行为演化模型BREM，通过连续经济游戏对比人类与LLM的决策动态，发现中等能力模型表现出过度惩罚的刚性攻击性，而前沿模型随推理能力提升趋向人类式的克制与宽容。

Comments KDD 2026 Oral

详情

AI中文摘要

随着大型语言模型（LLM）越来越多地参与复杂的社会互动，确保其行为符合人类伦理原则和意图（即价值对齐）已成为一项关键的科学挑战。现有基准通常依赖静态评估，未能捕捉决策的纵向动态或驱动智能体行为的潜在认知过程。在这项工作中，我们提出了FairMindSim，一个基于社会心理学的现实仿真基准，通过连续经济游戏评估对齐性。为了超越黑箱观察，我们引入了信念-奖励对齐行为演化模型（BREM），这是一个概率框架，将决策形式化为最大化外在奖励与维护内在信念之间的动态权衡。我们进行了一项大规模比较研究，涉及1,017名人类参与者和十个LLM，包括GPT-5和Gemini-3-Pro。我们的实验结果揭示了第三方惩罚（TPP）游戏中一种与能力相关的非线性经验趋势。中等能力模型表现出僵化且算法化的攻击性，其特征是过度惩罚，而前沿模型则展现出克制收敛，并随着推理能力的扩展向类似人类的宽容转变。此外，利用BREM，我们分解了智能体的纵向决策动态，发现更先进的模型通过减少信念-行为不一致性，更好地平衡了相互冲突的目标。我们的贡献为心理压力测试提供了一个标准化协议，并为在受控社会困境环境中分析AI对齐的纵向演化提供了一种可解释的机制。

英文摘要

As large language models (LLMs) increasingly engage in complex social interactions, ensuring that their behaviors align with human ethical principles and intentions, known as value alignment, has become a critical scientific challenge. Existing benchmarks often rely on static assessments and fail to capture the longitudinal dynamics of decision-making or the latent cognitive processes driving agent behavior. In this work, we propose FairMindSim, a realistic simulation benchmark rooted in social psychology that evaluates alignment through continuous economic games. To move beyond black-box observations, we introduce the Belief-Reward Alignment Behavior Evolution Model (BREM), a probabilistic framework that formalizes decision-making as a dynamic trade-off between maximizing extrinsic rewards and upholding intrinsic beliefs. We conducted a large-scale comparative study involving 1,017 human participants and ten LLMs, including GPT-5 and Gemini-3-Pro. Our experimental results reveal a capability linked non linear empirical trend in the Third Party Punishment (TPP) game. Mid capability models exhibit rigid and algorithmic aggression that is characterized by over punishment, while frontier models show a convergence of restraint and a shift toward human like leniency as reasoning capabilities scale. Furthermore, using BREM, we decompose agents longitudinal decision dynamics and find that more advanced models better balance conflicting objectives by reducing belief action inconsistency. Our contributions provide a standardized protocol for psychological stress testing and an interpretable mechanism for analyzing the longitudinal evolution of AI alignment in controlled social dilemma settings.

URL PDF HTML ☆

赞 0 踩 0

2306.10356 2026-05-29 cs.LG cs.AI eess.SP 版本更新

MATNet: Multi-Level Fusion Transformer-Based Model for Day-Ahead PV Generation Forecasting

MATNet：基于多层级融合Transformer的日前光伏发电预测模型

Matteo Tortora, Francesco Conte, Gianluca Natrella, Paolo Soda

发表机构 * Department of Naval, Electrical, Electronics ； Telecommunications Engineering, University of Genoa, Via all’Opera Pia 11a, 16145 Genoa, Italy ； Unit of Innovation, Entrepreneurship \& Sustainability, Department of Engineering, University Campus Bio-Medico of Rome Via Alvaro del Portillo 21, 00128 Rome, Italy ； Computer Systems Department of Engineering, University Campus Bio-Medico of Rome Via Alvaro del Portillo 21, 00128 Rome, Italy

AI总结提出一种基于多层级融合Transformer的多模态架构MATNet，通过多级联合融合和软注意力机制利用历史光伏数据与气象数据，在日前多步光伏发电预测中显著优于基线模型（RMSE 0.0445，相对提升约65%），并展现出对缺失数据的鲁棒性和跨域零样本泛化能力。

详情

AI中文摘要

可再生能源发电的准确预测对于促进可再生能源融入电力系统至关重要。聚焦光伏（PV）单元，预测方法主要分为基于物理和基于数据两大类，其中基于人工智能（AI）的模型提供了最先进的性能。然而，这些基于AI的模型虽然能够捕捉数据中的复杂模式和关系，却忽略了现象背后的物理先验知识。因此，本文提出MATNet，一种新颖的基于Transformer的多模态架构，用于多步日前光伏发电预测。该模型通过多层级联合融合方法输入历史光伏数据以及历史和预报气象数据，在多个融合阶段采用软注意力机制。我们在Ausgrid基准数据集上评估了MATNet的有效性，其显著优于各种基线模型，实现了0.0445的RMSE，相比表现最佳的基线方法相对提升约65%。分析进一步通过一系列消融研究、对缺失数据的敏感性分析（突显了MATNet对输入退化的鲁棒性）、在五个外部光伏数据集上的跨站点零样本泛化评估（证明了MATNet在显著域偏移下的鲁棒性）以及对模型计算复杂度的评估（确认了其在预测精度与计算效率之间的良好平衡）得到丰富。这些结果凸显了MATNet作为促进光伏能源融入电网的可靠且高效解决方案的潜力。代码可在https://github.com/arco-group/MATNet获取。

英文摘要

Accurate forecasting of renewable generation is crucial to facilitate the integration of Renewable Energy Sources into the power system. Focusing on photovoltaic (PV) units, forecasting methods can be divided into two main categories: physics-based and data-based strategies, with Artificial Intelligence (AI)-based models providing state-of-the-art performance. However, while these AI-based models can capture complex patterns and relationships in the data, they ignore the underlying physical prior knowledge of the phenomenon. Therefore, in this paper, we propose MATNet, a novel transformer-based multimodal architecture for multi-step day-ahead PV power generation forecasting. The model is fed with historical PV data and historical and forecast weather data through a multi-level joint fusion approach, employing a soft-attention mechanism at multiple fusion stages. We evaluate the effectiveness of MATNet on the Ausgrid benchmark dataset, where it significantly outperforms various baseline models, achieving an RMSE of 0.0445, corresponding to a relative improvement of approximately 65% compared to the best-performing baseline method. The analysis is further enriched by a comprehensive set of ablation studies, a sensitivity analysis on missing data, which highlights MATNet's resilience to input degradation, a cross-site zero-shot generalization evaluation on five external PV datasets, demonstrating MATNet's robustness under significant domain shifts, and an assessment of the model's computational complexity, confirming its favorable balance between predictive accuracy and computational efficiency. These results highlight MATNet's potential as a reliable and efficient solution to facilitate the integration of PV energy into the power grid. The code is available at https://github.com/arco-group/MATNet.

URL PDF HTML ☆

赞 0 踩 0