arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4089
专题追踪 全部专题
2606.01528 2026-06-02 cs.AI

Joint Agent Memory and Exploration Learning via Novelty Signals

通过新颖性信号实现联合智能体记忆与探索学习

Shizuo Tian, Xiaohong Weng, Rui Kong, Yuxuan Chen, Guohong Liu, Yuebing Song, Jiacheng Liu, Yuchen Li, Dawei Yin, Ting Cao, Yunxin Liu, Yuanchun Li

发表机构 * Tsinghua University(清华大学) Sun Yat-sen University(中山大学) Baidu Inc.(百度公司) Tongji University(同济大学) Peking University(北京大学)

AI总结 提出JAMEL框架,利用新颖性信号联合训练智能体记忆与探索策略,在开放环境中实现高效探索并泛化到未见环境。

详情
AI中文摘要

在开放环境中,探索对于自主智能体至关重要,但当前的语言模型智能体难以做到这一点。有效的探索需要记忆,但保留原始交互历史在长轨迹中计算成本高昂。虽然潜在记忆提供了压缩交互历史的解决方案,但其训练缺乏可靠的监督信号。我们提出了联合智能体记忆与探索学习(JAMEL),这是一个通过新颖性驱动的交互来共同训练智能体记忆和探索策略的框架。我们观察到记忆和探索形成了一个相互依赖的循环:持续的探索需要记忆来区分已耗尽的行为和未见过的新行为,而寻求新颖性的交互提供了使记忆对未来探索有用的监督。通过利用确定性和持久的新颖性信号(如GUI领域的代码覆盖率),我们为记忆模块提供了自然的、无需标注的监督。实证评估表明,我们的方法成功泛化到未见环境。其探索能力优于开放权重基线,并与闭源模型的探索深度相媲美,同时减少了token消耗。我们的代码和模型已在https://github.com/MobileLLM/JAMEL开源。

英文摘要

In open-ended environments, exploration is fundamental for autonomous agents, yet current language model agents struggle with this. Effective exploration requires memory, but retaining raw interaction histories is computationally expensive over long trajectories. While latent memory offers a solution to compress interaction histories, its training lacks reliable supervisory signals. We introduce \textbf{J}oint \textbf{A}gent \textbf{M}emory and \textbf{E}xploration \textbf{L}earning (\textbf{JAMEL}), a framework that trains agentic memory and exploration policy together through novelty-driven interaction. We observe that memory and exploration form a mutually dependent loop: sustained exploration requires memory to distinguish exhausted behaviors from unseen ones, while novelty-seeking interaction provides the supervision needed to make memory useful for future exploration. By utilizing deterministic and persistent novelty signals such as code coverage in the GUI domain, we provide natural, annotation-free supervision for the memory module. Empirical evaluations demonstrate that \ours successfully generalizes to unseen environments. Its exploration capability outperforms open-weight baselines and rivals the exploration depth of a closed-source model while reducing token consumption. Our code and model are open-sourced at https://github.com/MobileLLM/JAMEL.

2606.01527 2026-06-02 cs.LG cs.CR

Near-Optimal Pure Machine Unlearning for Smooth Strongly Convex Losses

平滑强凸损失下的近最优纯机器遗忘

Matthew Regehr, Gautam Kamath, Andrew Lowy

发表机构 * University of Waterloo(滑铁卢大学) Vector Institute(向量研究所) CISPA Helmholtz Center for Information Security(CISPA海德堡信息安全中心)

AI总结 针对平滑强凸随机优化中的近似ε-遗忘问题,本文通过证明超额总体风险的上界和下界(紧至条件数因子),几乎解决了遗忘的基本统计代价,并提出了在ε≫d时相比从头再训练和差分隐私基线具有指数级精度提升的遗忘算法。

详情
AI中文摘要

机器遗忘受到法律和用户需求(如被遗忘权)的驱动,旨在从训练模型中移除个体数据的影响。先前的工作已经为平滑强凸随机优化中的遗忘开发了算法和误差界,但遗忘的基本统计代价仍不清楚。我们通过证明近似ε-遗忘的超额总体风险的上界和下界,几乎解决了这个问题;我们的界紧至条件数因子。对于单位球上的均值估计,我们的上界和下界匹配。最优速率是通常的统计误差加上一个遗忘惩罚,该惩罚在从头再训练速率和随着ε/d增长而指数级减小的项之间插值,其中d是模型的维度。特别地,当ε≫d时,我们的ε-遗忘算法相比从头再训练模型和差分隐私基线提供了指数级的精度提升。另一方面,当ε≤d时,从头再训练是最优的。

英文摘要

Machine unlearning is motivated by legal and user-facing requirements to remove the influence of individuals' data from trained models, such as the right to be forgotten. Prior work has developed algorithms and error bounds for unlearning in smooth strongly convex stochastic optimization, but the fundamental statistical cost of unlearning has remained unclear. We nearly resolve this problem by proving upper and lower bounds on the excess population risk of approximate $\varepsilon$-unlearning; our bounds are tight up to a condition-number factor. For mean estimation over the unit ball, our upper and lower bounds match. The optimal rate is the usual statistical error plus an unlearning penalty that interpolates between the retraining-from-scratch rate and an exponentially smaller term as $\varepsilon/d$ grows, where $d$ is the dimension of the model. In particular, when $\varepsilon \gg d$, our $\varepsilon$-unlearning algorithm offers an exponential accuracy improvement over retraining the model from scratch and differentially private baselines. On the other hand, when $\varepsilon \le d$, retraining from scratch is optimal.

2606.01526 2026-06-02 cs.RO

Spatio-Temporal Reconnection for Multi-Robot Networks using Adaptive Prescribed-Time CBFs

基于自适应预设时间CBF的多机器人网络时空重连

Hao Liu, Yupeng Yang, Yanze Zhang, Wenhao Luo

发表机构 * Department of Computer Science, University of Illinois Chicago(伊利诺伊大学芝加哥分校计算机科学系) Department of Computer Science, University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校计算机科学系)

AI总结 提出自适应预设时间控制屏障函数框架,使多机器人系统能在可调预设时间内断开并重连通信,结合触发机制提升任务效率。

Comments 6 pages, 6 figures, accepted by IFAC 2026

详情
AI中文摘要

在多机器人系统中,维持持续的通信图连接往往过于严格,特别是当机器人通信范围有限但在大环境中运行时。相反,允许机器人暂时断开连接并在之后重新连接,通常更有利于高效执行任务,同时确保团队内及时的信息共享。在本文中,我们提出了一种自适应预设时间控制屏障函数(自适应PT-CBF)框架,使机器人能够在可调且可行的预设时间内暂时断开连接并重新进入通信范围。此外,我们引入了一种重连触发机制,该机制联合考虑任务执行和重连紧迫性,从而提供了一种原则性的方式来决定何时应发生重连。理论分析证明了在预设有限时间内收敛到满足重连的合理性。实验结果验证了我们提出的自适应PT-CBF的性能,具有改进的任务效率和令人满意的重连。

英文摘要

In multi-robot systems, maintaining persistent communication graph connectivity is often overly restrictive, especially when robots have limited communication ranges but operate in large environments. Instead, allowing robots to temporarily disconnect and later reconnect is often more desirable for efficient task execution while still ensuring timely information sharing across the team. In this paper, we propose an adaptive prescribed-time control barrier function (adaptive PT-CBF) framework that enables robots to temporarily disconnect and re-enter the communication range within an adjustable and feasible prescribed time. Moreover, we introduce a reconnection triggering mechanism that jointly considers task execution and reconnection urgency, thereby providing a principled way to decide when reconnection should occur. Theoretical analysis justifies convergence to the satisfying reconnection within a prescribed finite time. Experimental results validate the performance of our proposed adaptive PT-CBF with improved task efficiency and satisfying reconnections.

2606.01525 2026-06-02 cs.LG stat.ML

Semi-Supervised Hyperbolic Hierarchical Clustering with Set-Level Structural Priors

基于集合级结构先验的半监督双曲层次聚类

Junjing Zheng, Xinyu Zhang, Xiangfeng Qiu, Chengliang Song, Weidong Jiang

发表机构 * College of Electronic Science and Technology, National University of Defense Technology(电子科学与技术学院,国防科技大学)

AI总结 提出一种半监督双曲层次聚类方法,通过引入集合作为基本建模单元,利用从叶级监督导出的集合级结构先验来指导非叶层次结构学习,提升标签一致性和树质量。

详情
AI中文摘要

半监督层次聚类旨在学习与数据模式和用户提供的监督一致的树结构。监督通常以叶级关系的形式给出,例如成对的必须连接/不能连接约束或三元组的必须在之前连接约束。尽管这些约束有助于调节局部样本关系,但它们并不直接指示哪些样本应形成连贯的子树。因此,学习到的树的非叶结构可能偏离真实标签所偏好的层次组织。为了解决这一局限性,我们提出了一种具有集合级结构先验的半监督双曲层次聚类方法。主要贡献是引入集合作为层次学习的基本建模单元。每个集合表示预期在子树内凝聚的样本,并从叶级监督以及学习到的约束一致相似性结构中导出。这些集合作为子树级监督的软结构先验,使得监督能够指导超出局部叶级关系的非叶层次形成。具体来说,我们首先学习约束一致的嵌入以获得可靠的集合划分,然后构建约束诱导的集合并估计集合间相似性以形成集合级结构先验。最后,将这些先验纳入双曲层次目标中进行连续树优化。在11个基准数据集上的实验和消融研究表明,所提出的方法在提高代表性层次聚类基线的标签一致性的同时,也增强了基于相似性的树质量。

英文摘要

Semi-supervised hierarchical clustering aims to learn a tree structure consistent with data patterns and user-provided supervision. Supervision is usually given as leaf-level relations, such as pairwise must-link/cannot-link constraints or triplet-wise must-link-before constraints. Although useful for regulating local sample relations, such supervision does not directly indicate which samples should form coherent subtrees. Consequently, the non-leaf structure of the learned tree may deviate from the hierarchical organization preferred by ground-truth labels. To address this limitation, we propose a semi-supervised hyperbolic hierarchical clustering method with set-level structural priors. The main contribution is to introduce sets as basic modeling units for hierarchy learning. Each set denotes samples expected to cohere within a subtree and is induced from leaf-level supervision together with a learned constraint-consistent similarity structure. These sets act as soft structural priors for subtree-level supervision, allowing supervision to guide non-leaf hierarchy formation beyond local leaf-level relations. Specifically, we first learn constraint-consistent embeddings to obtain a reliable set partition, then construct constraint-induced sets and estimate inter-set similarities to form set-level structural priors. Finally, these priors are incorporated into a hyperbolic hierarchy objective for continuous tree optimization. Experiments on eleven benchmark datasets and ablation studies show that the proposed method consistently improves label consistency over representative hierarchical clustering baselines while also enhancing similarity-based tree quality.

2606.01521 2026-06-02 cs.LG stat.ML

Fast Generalization after Interpolation via Critically Damped Momentum Optimization

通过临界阻尼动量优化实现插值后的快速泛化

Luca Muscarnera, Silas Ruhrberg Estévez, Yuanzhang Xiao, Mihaela Van der Schaar

发表机构 * University of Cambridge(剑桥大学) University of Hawaii at Manoa(夏威夷大学曼瑙分校)

AI总结 提出GROKtimizer双阶段策略,结合快速收敛到插值与临界阻尼动量后插值范数最小化,在局部二次模型下实现比经典梯度下降二次加速,选择低范数插值解以提升泛化。

详情
AI中文摘要

机器学习的一个核心问题是模型在训练中可以达到近乎完美的性能,但对未见示例的泛化能力却显著较差。这种差距在高维、小样本场景下尤为严重,因为存在许多插值解,优化必须隐式地在具有不同泛化特性的最小值之间进行选择。基于最近关于插值阈值附近优化动态的理论进展,我们注意到风险最小化的两阶段结构(先损失最小化,后复杂度最小化)启发了一种双阶段优化调度。因此,我们从理论上证明,GROKtimizer——一种结合快速收敛到插值与基于临界阻尼动量(CDM)的后插值范数最小化的双阶段策略——为选择低范数插值解提供了一种自然方案。在后插值盆地的局部二次模型下,GROKtimizer比经典梯度下降实现了二次加速,并在一阶优化器中具有可证明的最优性。为了展示我们方法的适用性,我们在经典grokking文献中常见的几个合成基准以及各种真实世界数据集上评估了GROKtimizer。最后,我们将我们的发现与平坦最小值假说相协调,强调了后插值动态在构建高质量、泛化模型中的重要性。

英文摘要

A central problem in machine learning is that models can achieve near-perfect training performance while generalizing substantially less well to unseen examples. This gap is especially acute in high-dimensional, low-sample regimes, where many interpolating solutions exist and optimization must implicitly select among minima with different generalization properties. Following recent theoretical advances on optimization dynamics near the interpolation threshold, we note that the two-regime structure of risk minimization, with loss minimization followed by complexity minimization, motivates a biphasic optimization schedule. We thus theoretically demonstrate that GROKtimizer, a biphasic strategy that combines rapid convergence to interpolation with Critically Damped Momentum (CDM)-based post-interpolation norm minimization, offers a natural solution for selecting low-norm interpolating solutions. Under a local quadratic model of the post-interpolation basin, GROKtimizer provides a quadratic speedup over classical gradient descent, with provable optimality among first-order optimizers. To showcase the applicability of our method, we evaluate GROKtimizer on several synthetic benchmarks common in the classical grokking literature and on various real-world datasets. Finally, we reconcile our findings with the flat-minima hypothesis, highlighting the importance of post-interpolation dynamics in the construction of high-quality, generalizing models.

2606.01520 2026-06-02 cs.AI

TERRA: Task-Embedded Reasoning and Representation Architecture for Cross-Domain Applications

TERRA: 面向跨领域应用的任务嵌入推理与表示架构

Shayan Shokri

发表机构 * Humanpath Labs Inc.(Humanpath实验室有限公司)

AI总结 提出TERRA架构,通过形式化跨领域转移问题,利用松弛双模拟差异和Gromov-Wasserstein距离度量结构状态域间的同态性,推导出预测误差与决策遗憾的转移界,将广泛直觉转化为可检验理论。

详情
AI中文摘要

一个单一的动作条件潜在预测架构原则上可以在驾驶场景、机器人工作空间或金融订单簿的结构化状态上进行训练。在任何单个领域内实现这一点的要素已经存在并得到单独验证:掩码潜在预测、动作条件潜在世界模型、离散动作标记化以及体素化状态上的联合嵌入预测。TERRA解决的是尚未确立的转移问题:在一个结构化状态领域学到的表示或预测器何时以及多大程度上能够迁移到结构类似但其他方面无关的领域。我们对此问题进行了形式化处理。我们将每个领域建模为分级潜在网格上的受控马尔可夫过程,将任何实例分解为薄领域适配器和共享的领域不变核心,并识别出跨领域对应关系,该对应关系近似于一个马尔可夫决策过程同态,其质量通过松弛双模拟差异来衡量,对于缺乏共享坐标系的领域,则通过其动作条件转移算子之间的Gromov-Wasserstein距离来衡量。在Lipschitz预测器下,我们推导出一个转移界,该界将源模型误差与结构失配分开,在预测范围内呈几何增长,并由Gromov-Wasserstein距离从下方保证;然后通过双模拟度量的Lipschitz值性质将潜在误差与决策遗憾联系起来。由此产生的结构化状态转移假设被表述为一个可证伪的主张,并附有预注册的实验方案,核心是从驾驶场景到订单簿的转移测试,包括其被反驳的条件。我们不呈现实证结果:这是一个将广泛重复的直觉转化为可检验理论的研究提案。

英文摘要

A single action-conditioned latent predictive architecture can in principle be trained on the structured state of a driving scene, a robot workspace, or a financial order book. The ingredients for doing so within any one domain already exist and are individually validated: masked-latent prediction, action-conditioned latent world models, discrete action tokenization, and joint-embedding prediction on voxelized state. What is not established, and what TERRA addresses, is the transfer question: when does a representation or predictor learned in one structured-state domain carry over to a structurally analogous but otherwise unrelated domain, and by how much. We give this question a formal treatment. We model each domain as a controlled Markov process on a graded latent grid, factor any instantiation into thin domain adapters and a shared domain-invariant core, and identify a cross-domain correspondence with an approximate Markov decision process homomorphism whose quality is measured by a lax bisimulation discrepancy and, for domains lacking a shared coordinate system, by a Gromov-Wasserstein distance between their action-conditioned transition operators. Under a Lipschitz predictor we derive a transfer bound that separates source-model error from structural mismatch, grows geometrically in the prediction horizon, and is certified from below by the Gromov-Wasserstein distance; we then connect latent error to decision regret through the Lipschitz value property of bisimulation metrics. The resulting Structured-State Transfer Hypothesis is stated as a falsifiable claim with a preregistered experimental program, centered on a transfer test from driving scenes to order books, including conditions under which it is refuted. We present no empirical results: this is a research proposal that converts a widely repeated intuition into testable theory.

2606.01518 2026-06-02 cs.CV cs.GR

MotionDreamer: Universal Skeletal Motion Generation for 3D Rigged Shapes

MotionDreamer: 面向3D绑定形状的通用骨骼运动生成

Ye Tao, Yuxin Yao, Kendong Liu, Dapeng Wu, Junhui Hou

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 提出基于扩散的框架MotionDreamer,通过结构-语义注入机制从2D视频生成类别无关的骨骼动画,并构建大规模动态数据集,实现跨形态的高保真运动合成。

Comments 18 pages, 7 figures

详情
AI中文摘要

绑定形状的运动生成对于可扩展的4D资产制作至关重要。然而,基于模板的方法受限于特定拓扑结构,无法泛化到不同形态。相反,逐案例优化计算成本高,易陷入局部最优,且对视角引起的歧义高度敏感。在本文中,我们提出MotionDreamer,一个基于扩散的框架,旨在从2D视频指导中生成类别无关的骨骼动画。为了克服高质量训练数据的稀缺性,我们整理了一个大规模动态数据集,包含约20,000个多样化的3D模型,每个模型具有完整的纹理、骨骼绑定和广泛的动画序列。为了弥合2D视觉运动线索与异构3D骨骼结构之间的运动学差距,我们提出了一种结构-语义注入机制。我们的模型将纹理和语义属性直接集成到骨骼关节表示中,使其能够将感知的视觉动态映射到特定的关节层次及其功能角色。这使得MotionDreamer能够合成高保真动画,在从现有生物物种到幻想生物的广泛未见类别中保持解剖一致性。大量实验表明,我们的方法显著优于现有方法,为鲁棒且高效的4D资产生成设立了新的最先进基准。代码将在接收后公开。

英文摘要

Motion generation for rigged shapes is vital for scalable 4D asset production. However, template-based methods are limited by specific topologies and fail to generalize across diverse morphologies. Conversely, per-case optimization is computationally expensive, susceptible to local optima, and highly sensitive to viewpoint-induced ambiguities. In this paper, we present MotionDreamer, a diffusion-based framework designed for category-agnostic skeletal animation generation from 2D video guidance. To overcome the scarcity of high-quality training data, we have curated a large-scale dynamic dataset comprising approximately 20,000 diverse 3D models, each featuring complete textures, skeletal rigging, and a wide array of comprehensive animation sequences. To bridge the kinematic gap between 2D visual motion cues and heterogeneous 3D skeletal structures, we propose a structural-semantic injection mechanism. Our model integrates texture and semantic attributes directly into skeletal joint representations. This allows it to map perceived visual dynamics to specific joint hierarchies and their functional roles. This enables MotionDreamer to synthesize high-fidelity animations that maintain anatomical consistency across a vast range of unseen categories, from existing biological species to fantastical beings. Extensive experiments demonstrate that our approach significantly outperforms existing methods, setting a new state-of-the-art benchmark for robust and efficient 4D asset generation. The code will be made publicly available upon acceptance.

2606.01509 2026-06-02 cs.LG cs.AI

ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts

ProbMoE:可微分的专家混合概率路由

Heng Zhao, Zilei Shao, Guy Van den Broeck, Zhe Zeng

发表机构 * Imperial College London(伦敦帝国学院) University of Waterloo(多伦多大学) EPFL(瑞士联邦理工学院)

AI总结 提出ProbMoE概率路由框架,通过离散子集空间上的概率推断实现专家选择,解决top-k路由的离散非可微问题,并扩展到动态k路由,提升专家利用率和路由多样性。

Comments Accepted at ICML 2026

详情
AI中文摘要

专家混合(MoE)模型通过每个令牌仅激活一小部分专家来扩展规模。然而,训练此类模型仍然具有挑战性,因为top-$k$路由是离散且不可微的,需要针对专家选择的梯度估计器,其设计仍是一个核心开放问题。我们引入了ProbMoE,一种概率路由框架,将专家选择建模为基数受限专家子集上的分布,并将路由公式化为该离散子集空间中的概率推断。我们首先提出ProbMoE Exact-$k$路由,在前向传播中采样$k$专家子集,后向传播使用每个专家精确边际概率的梯度作为真实梯度的可处理代理。ProbMoE自然地推广到动态$k$路由设置,其中训练和推理都将路由基数约束到相同的预定义范围,允许每个令牌自适应地分配专家。在多个基准测试和模型骨干上,ProbMoE Exact-$k$相比竞争基线实现了强性能,具有改进的专家利用率和路由多样性;ProbMoE Dynamic-$k$以更少的激活专家实现了可比的性能。

英文摘要

Mixture-of-Experts (MoE) models scale by activating only a small subset of experts per token. However, training such models remains challenging because top-$k$ routing is discrete and non-differentiable, requiring gradient estimators for expert selection whose design remains a central open problem. We introduce ProbMoE, a probabilistic routing framework that models expert selection as a distribution over cardinality-constrained expert subsets and formulates routing as probabilistic inference in this discrete subset space. We first propose ProbMoE Exact-$k$ routing, which samples $k$-expert subsets in the forward pass, and the backward pass uses gradients through each expert's exact marginal probability as a tractable surrogate for the true gradient. ProbMoE naturally generalizes to a dynamic-$k$ routing setting, where both training and inference constrain the routing cardinality to the same predefined range, allowing adaptive expert allocation per token. Across benchmarks and model backbones, ProbMoE Exact-$k$ achieves strong performance compared to competitive baselines, with improved expert utilization and routing diversity; ProbMoE Dynamic-$k$ achieves comparable performance with fewer activated experts.

2606.01503 2026-06-02 cs.CV cs.AI cs.CL

On the Limits of Token Reduction for Efficient Unified Vision Language Training

论高效统一视觉语言训练中令牌缩减的极限

Siyi Chen, Weiming Zhuang, Jingtao Li, Lingjuan Lv

发表机构 * University of Michigan(密歇根大学) Sony AI(索尼人工智能)

AI总结 本文通过分析层注意力分配,发现视觉理解与视觉生成在令牌冗余上存在不对称性,设计任务特定加速器,但统一训练中任务特定令牌丢弃导致协同损失,表明高效统一建模需保留共享跨任务结构。

详情
AI中文摘要

统一视觉语言模型(VLM)在单个自回归骨干中集成了视觉理解和视觉生成,但其联合训练计算成本高昂且从效率角度常被忽视。在这项工作中,我们研究了基于令牌缩减的加速在统一VLM训练中的可行性和极限。通过对逐层注意力分配的系统分析,我们揭示了一个基本的不对称性:视觉理解在后期层表现出显著的视觉冗余,而视觉生成在深度上对图像令牌保持持续依赖。受此观察启发,我们设计了任务特定的加速器,针对每个目标选择性地减少图像令牌计算。虽然这些方法在孤立设置中实现了显著的效率提升,但我们在统一训练下观察到一致的协同损失——任务特定的令牌丢弃需要不同的参数路径,并消除了联合优化中通常观察到的相互性能增益。我们的发现表明,高效统一建模需要保留共享的跨任务结构,强调了需要协同感知的加速策略。项目页面:https://chicychen.github.io/TokenReductionUnifiedVLM/。

英文摘要

Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but their joint training is computationally expensive and largely overlooked from an efficiency perspective. In this work, we study the feasibility and limits of token-reduction-based acceleration for unified VLM training. Through a systematic analysis of layerwise attention allocation, we uncover a fundamental asymmetry: visual understanding exhibits substantial late-layer visual redundancy, whereas visual generation maintains persistent dependence on image tokens across depth. Guided by this observation, we design task-specific accelerators that selectively reduce image-token computation for each objective. While these methods achieve significant efficiency gains in isolated settings, we observe a consistent synergy loss under unified training -- task-specific token dropping necessitates divergent parameter pathways and eliminates the mutual performance gains typically observed in joint optimization. Our findings suggest that efficient unified modeling requires preserving shared cross-task structures, highlighting the need for synergy-aware acceleration strategies. Project page: https://chicychen.github.io/TokenReductionUnifiedVLM/.

2606.01498 2026-06-02 cs.CL cs.AI

TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning

TimeSage-MT:用于评估智能时间序列推理的多轮基准测试

Yaxuan Kong, Qingren Yao, Yuqi Nie, Yichen Li, Yilei Shao, Stefan Zohren, Anna Vettoruzzo, Joaquin Vanschoren, Ming Jin, Qingsong Wen

发表机构 * University of Oxford(牛津大学) VulpiVox Intelligence Eindhoven University of Technology(埃因霍温理工大学) Griffith University(格里菲斯大学) Squirrel Ai Learning East China Normal University(华东师范大学)

AI总结 提出TimeSage-MT多轮基准测试,包含240个任务和2680轮对话,覆盖8个真实领域,用于评估LLM智能体在时间序列推理中的表现,揭示其在决策导向任务中的性能下降及记忆、不确定性处理等缺陷。

详情
AI中文摘要

时间序列数据为许多真实世界领域的决策提供信息。虽然大语言模型(LLM)智能体可以通过自然语言和工具分析数据,但目前尚不清楚它们是否能在多轮对话中进行可靠的时间序列分析。现有基准测试侧重于预测和异常检测等单步任务,忽略了用户目标演变、智能体必须基于先前分析以及结论从累积证据中得出的实际工作流程。在这项工作中,我们引入了TimeSage-MT,一个用于智能时间序列推理的多轮基准测试,包含240个任务和2,680轮对话,涵盖8个真实世界领域,从基础探索到决策导向分析。TimeSage-MT通过一个可复现的流程构建,该流程将真实世界的时间序列数据转换为具有可验证答案的多轮对话。它提供了一个统一的评估协议和公共排行榜,用于比较时间序列智能系统。为了展示基准测试的实用性,我们评估了前沿LLM以及TimeSage——一种配备全面时间序列技能库的新型结构化智能体。结果显示,在决策导向任务上性能急剧下降,原因是记忆、不确定性处理和基于领域的决策方面的失败。TimeSage-MT揭示了当前智能推理中的关键差距,并为未来发展提供了严谨的基础。

英文摘要

Time series data inform critical decisions across many real-world domains. While large language model (LLM) agents can analyze data through natural language and tools, it remains unclear whether they can conduct reliable time series analysis across multi-turn conversations. Existing benchmarks focus on single-step tasks such as forecasting and anomaly detection, overlooking practical workflows where user goals evolve, agents must build on prior analyses, and conclusions emerge from accumulated evidence. In this work, we introduce TimeSage-MT, a multi-turn benchmark for agentic time series reasoning with 240 tasks and 2,680 dialogue turns across 8 real-world domains, spanning basic exploration to decision-oriented analysis. TimeSage-MT is built through a reproducible pipeline that converts real-world time series data into multi-turn conversations with verifiable answers. It provides a unified evaluation protocol and public leaderboard for comparing time series agentic systems. To demonstrate the benchmark's utility, we evaluate frontier LLMs alongside TimeSage, a novel structured agent equipped with a comprehensive time series skill library. The results show sharp performance drops on decision-oriented tasks, driven by failures in memory, uncertainty handling, and domain-based decision making. TimeSage-MT exposes critical gaps in current agentic reasoning and provides a rigorous foundation for future development.

2606.01493 2026-06-02 cs.CV

Splatshot: 3D Face Avatar Generation from a Single Unconstrained Photo

Splatshot: 从单张非约束照片生成3D人脸头像

Hao Liang, Zhixuan Ge, Soumendu Majee, Joanna Li, Ashok Veeraraghavan, Guha Balakrishnan

发表机构 * Rice University(里士大学) Samsung Research America(三星美国研究院)

AI总结 提出SplatShot,一种无需训练的方法,通过将3D高斯泼溅与扩散模型去噪过程耦合,从单张照片生成多视图一致的逼真3D人脸头像。

Comments 28 pages, 15 figures

详情
AI中文摘要

从单张非约束照片重建逼真的3D人脸头像具有挑战性:前馈3D高斯泼溅(3DGS)模型在分布外输入上性能下降,而预训练扩散模型生成高保真图像但缺乏多视图一致性。我们观察到这些范式本质上是互补的:显式3D表示保证几何一致性,而2D扩散先验确保逼真度。基于此,我们提出SplatShot,一种无需训练的框架,直接在去噪过程中耦合这些表示。给定一个基础3DGS人脸模型和一张参考图像,我们使用每步3D反馈循环联合去噪所有目标视图。在每个时间步,我们从噪声潜变量预测干净图像,将3DGS重新拟合到这些多视图预测,并将3DGS重新渲染与2D预测之间的光度差异反向传播到噪声估计中。这将采样轨迹引导向严格3D一致、身份保真的输出。在各种野外图像上的实验表明,SplatShot生成的3D头像具有优越的身份保持、逼真度和多视图一致性。

英文摘要

Reconstructing a photorealistic 3D face avatar from a single unconstrained photograph is challenging: feed-forward 3D Gaussian Splatting (3DGS) models degrade on out-of-distribution inputs, while pretrained diffusion models produce high-fidelity images but lack multi-view consistency. We observe that these paradigms are fundamentally complementary: explicit 3D representations guarantee geometric consistency, whereas 2D diffusion priors ensure photorealism. Building on this, we propose SplatShot, a training-free framework that couples these representations directly within the denoising process. Given a base 3DGS face model and a single reference image, we jointly denoise all target views using a per-step 3D feedback loop. At each timestep, we predict clean images from the noisy latents, refit the 3DGS to these multi-view predictions, and back-propagate the photometric discrepancy between the 3DGS re-renderings and 2D predictions into the noise estimate. This steers the sampling trajectory toward strictly 3D-coherent, identity-faithful outputs. Experiments on diverse in-the-wild images demonstrate that SplatShot produces 3D avatars with superior identity preservation, photorealism, and multi-view consistency.

2606.01485 2026-06-02 cs.CV cs.LG

Perception First: A Frontier Native-Video Model with Self-Consistency for Implicit Video Question Answering

感知优先:具有自一致性的前沿原生视频模型用于隐式视频问答

Ali Alavi

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 本文通过系统实验发现隐式视频问答基准是感知受限而非推理受限,并指出提升基础模型感知能力和轻量级测试时去噪是唯一可靠手段。

详情
AI中文摘要

我们描述了提交至CVPR 2026 VRR挑战赛的方案,该方案基于ImplicitQA / VRR-QA基准:一种多项选择视频问答任务,其中答案有意地不在任何单帧中可观察,必须从创意视频的不连续帧中的空间布局、运动、深度、视角、因果关系和社会背景推断。我们对开源视频大语言模型(Qwen2.5-VL、Qwen3-VL、InternVL3、Gemma-3以及经过强化学习训练的视频推理器Video-R1和VideoChat-R1.5)和一系列推理时策略(思维链、问题分解、描述-推理级联、音频转录、空间状态提示、自一致性、多模型集成和类别路由)进行了系统的、无需训练的研究。我们的核心发现是,该基准是感知受限而非推理受限:推理侧的增强是中性的甚至有害的,而基础模型的感知能力和轻量级测试时去噪是唯一可靠的杠杆。按类别的错误分析将困难定位到低级感知——相对深度、视角和计数是最困难的类别,而因果和社会推理几乎已解决——一个明确注入单目深度线索以攻击最弱类别的提示将测试准确率降低了5.8个百分点,证实了模型需要更好的感知,而非更好的过程。

英文摘要

We describe our submission to the VRR Challenge @ CVPR 2026, built on the \emph{ImplicitQA} / \emph{VRR-QA} benchmark~\cite{implicitqa}: multiple-choice video question answering in which answers are deliberately \emph{not} observable in any single frame and must be inferred from spatial layout, motion, depth, viewpoint, causality, and social context across discontinuous frames of creative video. We conduct a systematic, training-free study spanning open-source Video-LMMs (Qwen2.5-VL~\cite{qwen25vl}, Qwen3-VL~\cite{qwen3vl}, InternVL3, Gemma-3, and the RL-tuned video reasoners Video-R1~\cite{videor1} and VideoChat-R1.5~\cite{videochatr15}) and a battery of inference-time strategies (chain-of-thought, question decomposition, describe-then-reason cascades, audio transcripts, spatial state prompting, self-consistency~\cite{selfconsistency}, multi-model ensembling, and category routing). Our central finding is that this benchmark is \emph{perception-bound rather than reasoning-bound}: reasoning-side augmentations are neutral-to-harmful, whereas base-model perceptual capability and lightweight test-time denoising are the only reliable levers. A per-category error analysis localizes the difficulty to low-level perception -- relative depth, viewpoint, and counting are the hardest categories, while causal and social reasoning are nearly solved -- and a prompt that explicitly injects monocular depth cues to attack the weakest category \emph{lowers} test accuracy by $5.8$ points, confirming that the model needs a better \emph{percept}, not a better \emph{procedure}.

2606.01483 2026-06-02 cs.LG cs.AI eess.AS

MURMUR: An Efficient Inference System for Long-Form ASR

MURMUR:一种高效的长时间语音识别推理系统

Wei-Tzu Lee, Keisuke Kamahori, Baris Kasikci

发表机构 * University of Washington(华盛顿大学)

AI总结 提出MURMUR推理系统,通过块间和块内两级优化,在保持高精度的同时显著降低长时间语音识别的延迟。

详情
AI中文摘要

长时间自动语音识别(ASR)需要高精度和低延迟,但现有系统迫使两者之间进行权衡。基于块的流水线在并行窗口中处理音频以实现低延迟,但丢失了跨块上下文,并且需要脆弱的启发式方法来对齐边界处的说话人和时间戳。长上下文ASR模型通过单次传递解决所有问题以获得更好的准确性,但速度慢一个数量级。我们提出MURMUR,一个通过两级操作克服这种权衡的推理系统。在块间级别,我们重新审视基于块的流水线以适应现代长上下文ASR,将块大小视为可调超参数,并表明中间块大小在准确性和延迟之间取得了良好的平衡。在块内级别,我们通过应用于输出和语音令牌的滑动窗口KV缓存驱逐策略来利用注意力稀疏性。在AMI-IHM上,MURMUR匹配单次传递准确性,同时将延迟降低4.2倍,通过令牌驱逐进一步获得收益,相对tcpWER退化小于1%。MURMUR的代码可在https://github.com/uw-syfi/Murmur获取。

英文摘要

Long-form automatic speech recognition (ASR) requires both high accuracy and low latency, but existing systems force a trade-off between the two. Chunk-based pipelines process audio in parallel windows for low latency, but lose cross-chunk context and need brittle heuristics to align speakers and timestamps at boundaries. Long-context ASR models resolve everything in a single pass for better accuracy, but are an order of magnitude slower. We propose Murmur, an inference system that overcomes this trade-off by operating at two levels. At the inter-chunk level, we revisit the chunk-based pipeline for modern long-context ASR, treating chunk size as a tunable hyperparameter, and show that intermediate chunk sizes strike a good balance of accuracy and latency. At the intra-chunk level, we exploit attention sparsity through a sliding window KV cache eviction policy applied to both output and speech tokens. On AMI-IHM, Murmur matches single-pass accuracy while reducing latency by 4.2x, with further gains from token eviction at less than 1% relative tcpWER degradation. The code of Murmur is available at https://github.com/uw-syfi/Murmur.

2606.01482 2026-06-02 cs.CL

Beyond Topical Similarity: Contrastive Evidence Retrieval with Interpretable Attention Alignment in RAG

超越主题相似性:RAG 中具有可解释注意力对齐的对比证据检索

Francielle Vargas, João Robiatti, Diego Alves, Lucas Pascotti Valem, Maximilian Seeth, Sebastián Ferrada, Ameeta Agrawal, Daniel Pedronette, André Freitas

发表机构 * University of Chile(智利大学) São Paulo State University(圣保罗州立大学) Saarland University(萨尔兰州立大学) University of Munich(慕尼黑大学) Portland State University(波特兰州立大学) Idiap Research Institute(Idiap研究机构)

AI总结 提出 CERA 框架,通过基于主观性的困难负样本选择和辅助注意力对齐损失注入证据归纳偏差,实现可解释且事实准确的检索。

详情
AI中文摘要

确保 RAG 中的事实性和可解释性仍然是一个开放且紧迫的问题。我们引入了对比证据理性注意力(CERA),这是第一个采用基于主观性的困难负样本选择并通过辅助注意力对齐损失将证据归纳偏差注入对比学习的检索框架。CERA 使用两个训练目标微调密集检索器:基于三元组的对比学习和可解释注意力对齐,后者通过使用基于词性标注的掩码分布监督 CLS 到 token 的注意力,该分布覆盖人工标注的事实理性作为证据信号。在大型临床试验报告语料库上的实验表明,与 Contriever 和困难负样本选择基线相比,基于主观性的困难负样本选择显著提高了检索效果。此外,理性对齐在保持竞争性检索性能的同时提高了忠实度,支持了注意力在人类理性指导下可以作为模型行为更忠实解释的假设。超越主题相似性,CERA 使检索器能够识别构成支持证据的特定 token,促进了 RAG 系统中更可解释的证据选择。

英文摘要

Ensuring factuality and interpretability in RAG remains an open and urgent problem. We introduce Contrastive Evidence Rationale Attention (CERA), the first retrieval framework to employ subjectivity-based hard negative selection and inject an evidential inductive bias into contrastive learning through an auxiliary attention alignment loss. CERA fine-tunes a dense retriever using two training objectives: triplet-based contrastive learning and interpretable attention alignment, which supervises CLS-to-token attention using a part-of-speech-weighted masking distribution over human-annotated factual rationales as evidence signals. Experiments on a large corpus of clinical trial reports demonstrate that the subjectivity-based hard negative selection substantially improves retrieval effectiveness compared to both Contriever and hard negative selection baselines. Furthermore, rationale alignment improves faithfulness while maintaining competitive retrieval performance, supporting the hypothesis that attention can serve as a more faithful explanation of model behavior when guided by human rationales. Moving beyond topical similarity, CERA enables the retriever to identify the specific tokens that constitute supporting evidence, promoting more interpretable evidence selection in RAG systems.

2606.01481 2026-06-02 cs.CV

SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation

SafeGen-Bench: 图像条件文本到视频生成中的安全性基准测试

Yingzi Ma, Xiaogeng Liu, Yawen Zheng, Chaowei Xiao

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Tsinghua University(清华大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 针对图像条件文本到视频生成中安全文本和图像组合仍可能产生有害内容的问题,提出SafeGen-Bench基准,定义10个恶意类别并评估现有模型,发现当前模型难以避免生成恶意内容,且单模态护栏防御不足。

Comments 8 pages, 7 figures, 2 tables

详情
AI中文摘要

随着文本到图像扩散模型的快速发展,像Sora这样的生成式视频模型(T2V模型)现在可以从文本提示或初始图像生成短视频。然而,合成视频生成——尤其是在初始图像引导下——常常带来风险,包括可能创建非法、政治敏感或不道德的内容。现有基准已开始考虑生成视频的安全性,但它们主要关注用恶意文本提示测试模型,忽略了文本提示和图像组合仍可能导致有害视频内容的场景。在实践中,这是一个常见且具有挑战性的问题:从安全文本和图像输入生成的视频仍可能传达有害信息。为弥补这一差距,我们引入了SafeGen-Bench,一个专门设计用于评估条件T2V模型安全性的基准。我们的基准定义了10个恶意类别,重点关注与时间序列和描绘行为相关的风险。SafeGen-Bench包含从多样图像和视频源中精心选择的起始帧,并配以相应的文本提示以模拟真实输入。我们在SafeGen-Bench上评估了多种条件T2V模型,结果表明当前模型难以持续避免生成恶意内容,不安全分数高达44.5,尤其是在需要高质量的条件下。此外,我们评估了基于文本和基于图像的护栏在我们的基准上的有效性,发现单模态护栏单独不足以提供稳健防御,在七个恶意类别中失败率达80%。我们希望SafeGen-Bench能促进更安全、更可控的条件T2V模型的开发。

英文摘要

With the rapid advancements in text-to-image diffusion models, generative video models (T2V models) like Sora can now produce short synthetic videos from a text prompt or an initial image. However, synthetic video generation -- especially when guided by an initial image -- often poses risks, including the potential creation of illegal, politically sensitive, or unethical content. Existing benchmarks have started to consider the safety of generated videos, but they primarily focus on testing models with malicious text prompts, ignoring the scenario where text prompt and image combination may still lead to harmful video content. In practice, this is a common and challenging issue: videos generated from safe text and image inputs can nonetheless convey harmful information. To bridge this gap, we introduce SafeGen-Bench, a benchmark specifically designed to evaluate the safety of conditional T2V models. Our benchmark defines 10 malicious categories, concentrating on risks related to both temporal sequences and depicted behaviors. SafeGen-Bench consists of carefully selected start frames from diverse image and video sources, paired with corresponding text prompts to simulate realistic inputs. We evaluate a variety of conditional T2V models on SafeGen-Bench, and the results indicate that current models struggle to consistently avoid generating malicious content with unsafety scores reaching up to 44.5, especially under conditions requiring high quality. Furthermore, we assess the effectiveness of both text-based and image-based guardrails on our benchmark, finding that unimodal guardrails alone were insufficient to provide a robust defense, with an 80\% failure rate across seven malicious categories. We hope that SafeGen-Bench will foster the development of safer and more controllable conditional T2V models.

2606.01479 2026-06-02 cs.CL

Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech

用于文本到语音中可解释情感控制的稀疏自编码器

Hongfei Du, Jiacheng Shi, Sidi Lu, Gang Zhou, Ye Gao

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过稀疏自编码器分析基于LLM的TTS模型中的情感相关潜在特征,提出特征级干预框架实现双向情感诱导与抑制,无需修改模型参数。

Comments Accepted by ICML 2026

详情
AI中文摘要

将大型语言模型(LLMs)集成到文本到语音(TTS)系统中提高了语音的表现力,但可解释的情感控制仍然具有挑战性。现有方法主要依赖于外部条件或全局激活引导,对情感控制背后的内部表示提供的洞察有限。在这项工作中,我们使用稀疏自编码器(SAEs)分析基于LLM的TTS模型语义隐藏状态中与情感相关的变异,以识别稀疏潜在特征。我们的分析表明,情感变异分布在多个稀疏潜在特征上,而对一小部分特征进行干预可以实现可解释的情感控制。基于这一观察,我们引入了一个特征级干预框架,用于双向情感诱导和抑制,而无需修改骨干参数。我们进一步表明,不同的潜在特征与特定的声学属性(例如,音高)相关联,这表明情感表达源于协调的潜在贡献,而非单一的全局变化。实验上,引导这些稀疏潜在特征在情感诱导和抑制性能上达到或优于全局引导和现有的TTS基线。

英文摘要

Integrating large language models (LLMs) into text-to-speech (TTS) systems has improved speech expressiveness, yet interpretable emotional control remains challenging. Existing approaches primarily rely on external conditioning or global activation steering, offering limited insight into the internal representations underlying emotional control. In this work, we analyze emotion-related variation in the semantic hidden states of LLM-based TTS models using sparse autoencoders (SAEs) to identify sparse latent features. Our analysis shows that emotional variation is distributed across multiple sparse latent features, while intervening on a small subset enables interpretable emotion control. Building on this observation, we introduce a feature-level intervention framework for bidirectional emotion induction and suppression without modifying backbone parameters. We further show that distinct latent features are associated with specific acoustic attributes (e.g., pitch), suggesting that emotional expression arises from coordinated latent contributions rather than a single global shift. Empirically, steering these sparse latent features achieves comparable or superior emotion induction and suppression performance relative to global steering and existing TTS baselines.

2606.01473 2026-06-02 cs.AI cs.HC

A Minimalist Brain-Computer Musical Interface for Real-Time Emotion-Driven Sonification: System Design and Preliminary Evaluation

极简脑机音乐接口用于实时情感驱动声化:系统设计与初步评估

Pablo A. Monroy-D'Croz, Rafael Ramirez-Melendez, Julian Cespedes-Guevara

发表机构 * GitHub

AI总结 本文提出一种极简脑机音乐接口,通过前额EEG活动实时估计情感效价并映射到音乐特征,实验发现额叶alpha不对称性无法可靠区分指令性情绪状态。

详情
AI中文摘要

本文提出一种极简脑机音乐接口(BCMI),作为实时情感声化系统,将前额EEG活动转化为自适应音乐。通过额叶alpha不对称性(AF7/AF8)估计情感效价,并通过随机生成算法映射到音乐特征,如调式、速度、节奏密度和音高音域。系统集成了无线EEG采集、实时Python信号处理以及通过Lab Streaming Layer同步的Ableton Live音乐生成。一项包含22名参与者的实验探究了有意情感自我诱导是否能调节BCMI神经反馈信号。线性混合效应分析发现目标情绪或时间无显著效应,表明额叶alpha不对称性信号无法可靠区分指令性情绪状态。个体差异(包括音乐训练和表演经验)解释了比实验操作更多的方差,后者仅占总信号方差的0.40%。这些发现凸显了使用额叶alpha不对称性作为闭环情绪调节的自愿控制信号的挑战,并为未来BCMI研究提出了方法论方向。

英文摘要

This paper presents a minimalist brain-computer Musical Interface (BCMI) that functions as a real-time affective sonification system, translating prefrontal EEG activity into adaptive music. Emotional valence is estimated from frontal alpha asymmetry (AF7/AF8) and mapped to musical features such as mode, tempo, rhythmic density, and pitch register through a stochastic generative algorithm. The system integrates wireless EEG acquisition, real-time Python signal processing, and Ableton Live-based music generation synchronized via Lab Streaming Layer. An experiment with 22 participants investigated whether intentional emotional self-induction could modulate the BCMI neurofeedback signal. Linear mixed-effects analyses found no significant effects of target emotion or time, indicating that the frontal alpha asymmetry signal did not reliably distinguish instructed emotional states. Individual differences, including musical training and acting experience, explained more variance than the experimental manipulation, which accounted for only 0.40\% of total signal variance. These findings highlight the challenges of using frontal alpha asymmetry as a voluntary control signal for closed-loop emotion regulation and suggest methodological directions for future BCMI research.

2606.01469 2026-06-02 cs.CL

Peacemaker at ATE-IT: Automatic term extraction from Italian text for waste management data using encoder model

Peacemaker at ATE-IT: 使用编码器模型从意大利语文本中自动提取废物管理术语

Mahdi Bakhtiyarzadeh, Hadi Bayrami Asl Tekanlou, Jafar Razmara

发表机构 * Department of Computer Science, University of Tabriz(塔布里兹大学计算机科学系) University of Tabriz(塔布里兹大学)

AI总结 针对ATE共享任务中的Task A,提出一种低计算成本、可解释的自动术语提取方法,通过微调编码器模型在少量资源上实现平衡性能,为低资源模型提供起点。

Comments 9 pages, 2 figures, Published in EVALITA 2026, CEUR Workshop Proceedings Vol. 4195

Journal ref CEUR Workshop Proceedings, Vol. 4195, 2026

详情
AI中文摘要

自动术语提取的发展在现代技术中变得越来越重要。目前几乎每个可用的搜索引擎中都存在自动术语提取。最近的进展为自动术语的提取提供了有希望的结果;然而,由于多种因素,如可用于训练的标注文档数量有限,以及由于领域变化导致提取多词表达式的复杂性,准确标注是困难的。在本文中,我们将提出一种低成本且可解释的自动术语提取方法,专门为ATE共享任务中的Task A开发。这种新方法利用微调提取策略,可以在少量计算资源上运行。我们使用类型级和微级精确率、召回率和F1分数来评估我们的自动化系统,以衡量提取性能的两个互补方面。根据实验结果,我们提出的方法与其他团队相比,实现了一致且平衡的性能。尽管该技术本身相对简单,但它为低资源模型提供了一个良好的起点。总体而言,研究结果表明,未来在模型扩展方面有可能取得重大进展,同时仍能保持其可解释性。

英文摘要

The development of automatic term extraction has become increasingly important in modern technology. Automatic term extraction can be found in virtually every search engine that is currently available to users. Recent advancements have provided promising results for the extraction of automatic terms; however, accurate labeling is difficult because of several factors, such as the limited number of annotated documents available for training and the complexity of extracting multi-word expressions due to shifts in the domain. In this paper, we will present a low-cost and interpretable method of automatic term extraction, developed specifically for Task A of the ATE Shared Task. This new method utilizes fine-tuning extraction strategies that can run on a small amount of computational resources. We evaluated our automated system using both type-level and micro-level measures of precision, recall, and F1-score to measure both complementary aspects of the extraction performance. According to the experimental results, our proposed approach achieves consistent and balanced performance compared to other teams. Even though the technique itself is relatively straightforward, it serves as a good starting point for low-resource models. Overall, the findings point toward the possibility of significant future advancements (in model expansion) with higher-level performance still able to retain their ability to be interpreted.

2606.01464 2026-06-02 cs.CL

Cross-lingual Self-Consistency for Multilingual Reasoning with Language Models

跨语言自一致性:面向语言模型的多语言推理

Ahmed Elhady, Eneko Agirre, Mikel Artetxe

发表机构 * HiTZ Center, University of the Basque Country (UPV/EHU)(巴斯克大学HiTZ中心) Reka AI

AI总结 提出无监督强化学习方法,通过强制模型对跨语言等价问题产生相同答案来增强多语言推理,在MGSM上平均提升21.7%,并展现出强泛化能力。

Comments Paper under review

详情
AI中文摘要

尽管大语言模型(LLMs)的多语言覆盖范围在扩大,但其高级推理能力仍主要局限于少数高资源语言(如英语)。为了解决这一问题,我们提出了一种无监督强化学习(RL)方法,通过强制跨语言自一致性(即模型应对不同语言的等价问题产生相同最终答案)来增强多语言推理。现有方法受限于多语言推理数据的稀缺性,且对未见语言的泛化能力较弱。我们的方法既不需要标准答案,也不需要平行数据,在MGSM的10种语言上平均提升了高达21.7%。此外,我们的方法展现出强泛化能力,在训练期间未见的MGSM语言上平均提升18.2%,在3个分布外基准测试上提升高达6.2%。这些结果表明,基于一致性的方法有潜力在无需监督数据的情况下提升LLMs的多语言能力。

英文摘要

Despite expanding their multilingual coverage, the advanced reasoning capabilities of LLMs remain largely confined to a few high-resource languages like English. To address this, we propose an unsupervised Reinforcement Learning (RL) approach to enhance multilingual reasoning by enforcing cross-lingual self-consistency: the principle that a model should produce the same final answer for equivalent problems in different languages. Existing methods are limited by the scarcity of multilingual reasoning data and show weak generalization to unseen languages. Our approach requires neither gold answers nor parallel data, and it achieves average gains of up to 21.7% on MGSM across 10 languages. In addition, our method demonstrates strong generalization, with an 18.2% mean improvement on MGSM languages unseen during training, and up to 6.2% gain on 3 out-of-distribution benchmarks. These results show the potential of consistency-based methods to improve the multilingual capabilities of LLMs without requiring supervised data.

2606.01462 2026-06-02 cs.AI cs.CL cs.LG

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

人工推理之谜:探究大型推理模型中的生成-评估差距

Mingzhong Sun, Teresa Yeo, Armando Solar-Lezama, Tan Zhi-Xuan

发表机构 * NUS Department of Computer Science(国立新加坡大学计算机科学系) MIT EECS(麻省理工学院电子工程与计算机科学系) A*STAR(新加坡科技研究局) Singapore-MIT Alliance for Research and Technology (SMART)(新加坡-麻省理工联合研究技术机构(SMART))

AI总结 本文通过VAIR数据集发现大型推理模型在评估推理时存在显著缺陷,表现为答案确认偏差,即模型倾向于验证答案正确性而非仔细检查推理步骤。

Comments 10 pages, 8 figures, 2 tables (Appendix: 19 pages, 13 figures, 3 tables)

详情
AI中文摘要

对人类推理的研究表明,人们通常更擅长评估推理而非从头生成推理。相比之下,大型推理模型(LRMs)经过训练,擅长生成长链推理以解决复杂问题。那么,LRMs在评估推理方面表现如何?我们通过有效答案-无效推理(VAIR)数据集进行研究:该数据集包含数学问题和解决方案,这些解决方案存在琐碎的推理缺陷但答案有效,旨在将推理评估与推理生成混淆因素分离。与人类(我们发现人类在评分此类问题时仅比解决它们差6%)不同,我们发现LRMs存在显著的生成-评估差距:前沿模型在评估VAIR解决方案时得分低至48%,尽管在解决方案生成方面近乎完美。为何存在这一谜团?通过思维链(CoT)分析,我们发现了答案确认偏差的证据:LRMs通常先产生答案,然后检查正确答案,而不是仔细验证每一步,即使在注意到异常推理时也会编造合理化解释。线性探针进一步证实了这一点,表明虽然LRM激活编码了有效推理的某些表示,但它们未能稳健地将VAIR解决方案表示为无效。对最终答案表示的因果修补导致LRM判断和激活翻转,表明答案有效性是模型确认偏差的原因。这些发现揭示了主导推理训练方法的显著局限性,该方法激励LRMs生成并确认朝向正确答案的推理,但未能稳健地评估底层推理。

英文摘要

Studies of human reasoning have shown that people are typically stronger at evaluating reasoning than producing it from scratch. In contrast, large reasoning models (LRMs) are trained to excel at producing long chains of reasoning to solve complex problems. How then do LRMs perform at evaluating reasons? We investigate this with the Valid-Answer-Invalid-Reasoning (VAIR) dataset: math problems and solutions with trivial reasoning flaws but valid answers, designed to isolate reasoning evaluation from the confound of reasoning production. Unlike humans, who we find are only 6% worse at grading than solving such problems, we find a substantial production-evaluation gap in LRMs: frontier models score as low as 48% when evaluating VAIR solutions, despite near-perfect solution production. Why this enigma? Through chain-of-thought (CoT) analysis, we find evidence of an answer confirmation bias: LRMs often produce then check for the correct answer instead of carefully verifying each step, fabricating rationalizations even when noticing anomalous reasoning. Linear probes corroborate this, showing that while LRM activations encode some representation of valid reasoning, they fail to robustly represent VAIR solutions as invalid. Causal patching of the final answer's representations causes LRM verdicts and activations to flip, demonstrating that answer validity is responsible for models' confirmation biases. These findings indicate an outstanding limitation in dominant approaches to reasoning training, which incentivize LRMs to produce and confirm reasoning towards correct answers, but not to robustly evaluate the underlying reasons.

2606.01461 2026-06-02 cs.LG cs.MA

Genotype-Conditioned Molecular Generation via Evidence-Grounded Multi-Objective Latent Perturbation in Diffusion Models

基于证据的多目标潜在扰动在扩散模型中的基因型条件分子生成

Brenda Nogueira, Gisela A. Gonzalez-Montiel, Nitesh V. Chawla, Nuno Moniz

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) University of Notre Dame(诺克斯大学) Department of Chemistry and Biochemistry(化学与生物化学系) Lucy Family Institute for Data & Society(数据与社会学院)

AI总结 提出一种在预训练的基因型到药物扩散模型的潜在空间中,通过梯度上升优化可学习扰动以最大化药物敏感性、类药性和合成可及性的复合奖励,并利用实验数据和LLM管道确保生物合理性和机制一致性。

详情
AI中文摘要

由于肿瘤异质性和跨癌症亚型缺乏明确的分子靶点,开发有效的抗癌疗法仍然具有挑战性。以癌症基因型为条件的生成模型为个性化药物发现提供了一条有前景的途径,但现有方法缺乏对同时优化敏感性、可合成性和机制结合合理性的明确优化。我们提出了一种针对预训练的基因型到药物扩散模型的潜在空间优化方法,引入一个在分子潜在空间上的可学习扰动,通过梯度上升优化以最大化结合预测药物敏感性(AUC)、类药性(QED)和合成可及性(SAS)的复合奖励。关键的是,通过将奖励设计和评估基于实验衍生的癌细胞系数据和经过验证的药理学信号,将候选生成锚定在真实世界的临床证据中,从而强制执行生物学真实性。机制一致性合理性进一步通过基于扩散模型注意力机制的多智能体LLM管道进行评估。在来自三个保留评估集的15个癌细胞系上的实验表明,在敏感性、类药性、可合成性和化学有效性方面,与竞争基线相比,该方法具有一致且显著的改进。

英文摘要

Developing effective anticancer therapeutics remains challenging due to tumor heterogeneity and the absence of well-defined molecular targets across cancer subtypes. Generative models conditioned on cancer genotypes offer a promising avenue for personalized drug discovery, yet existing approaches lack explicit optimization for simultaneous sensitivity, synthesizability, and mechanistic binding plausibility. We present a latent-space optimization approach for a pretrained genotype-to-drug diffusion model, introducing a learnable perturbation over the molecular latent space optimized via gradient ascent to maximize a composite reward combining predicted drug sensitivity (AUC), drug-likeness (QED), and synthetic accessibility (SAS). Critically, biological realism is enforced by grounding both reward design and evaluation in experimentally-derived cancer cell line data and validated pharmacologic signals, anchoring candidate generation in real-world clinical evidence. Mechanistic consistency plausibility is further assessed by a multi-agent LLM pipeline grounded in the diffusion model's attention mechanism. Experiments across 15 cancer cell lines from three held-out evaluation sets demonstrate consistent and noticeable improvements over competing baselines in sensitivity, drug-likeness, synthesizability, and chemical validity.

2606.01460 2026-06-02 cs.SD eess.AS

A Lightweight Slot-Attention Framework for Multi-Instrument Multi-Pitch Estimation

轻量级槽注意力框架用于多乐器多音高估计

Michael Taenzer

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种轻量级槽注意力框架,通过匈牙利匹配和模块化扩展实现多乐器多音高估计,并验证了其在URMP上的乐器族分解效果。

Comments Preprint submitted to the IEEE 28th International Workshop on Multimedia Signal Processing (MMSP). This work has been submitted to the IEEE for possible publication. 6 pages, 2 figures

详情
AI中文摘要

多音高估计(MPE)通常预测混合信号中哪些音高是活跃的,但不预测是哪种乐器或声源产生的。本文研究了一种用于多乐器MPE(MI-MPE)的轻量级槽注意力框架,其中混合CQT被映射到一组无序的类声源音高图。该模型使用排列不变的匈牙利匹配来避免固定的输出语义,并将槽的数量视为活跃声源数量的上界。我们进一步研究了两种模块化扩展:一个自监督音色编码器,为槽级音色嵌入提供训练时目标;以及一个复音分支,正则化混合级和槽级预测的音高密度。实验表明,匈牙利匹配显著改善了URMP上的乐器族分解。音轨级预测仍然更具挑战性:音色和复音监督改善了特定配置,但未能一致地解决声源分配问题。结果表明,基于槽的架构是声源感知MPE的一个有前景的方向,同时强调了需要更仔细地将辅助音乐线索与槽身份耦合。

英文摘要

Multi-pitch estimation (MPE) typically predicts which pitches are active in a mixture, but not which instrument or source produced them. This paper investigates a lightweight slot-attention framework for multi-instrument MPE (MI-MPE), where a mixture CQT is mapped to an unordered set of source-like pitch maps. The model uses permutation-invariant Hungarian matching to avoid fixed output semantics and treats the number of slots as an upper bound on the number of active sources. We further study two modular extensions: a self-supervised timbre encoder that provides training-time targets for slot-level timbre embeddings, and a polyphony branch that regularizes the pitch density of mixture- and slot-level predictions. Experiments show that Hungarian matching substantially improves instrument family decomposition on URMP. Stem-level prediction remains more challenging: timbre and polyphony supervision improve selected configurations, but do not consistently resolve source assignment. The results suggest that slot-based architectures are a promising direction for source-aware MPE, while highlighting the need to couple auxiliary musical cues to slot identity more carefully.

2606.01458 2026-06-02 cs.RO

LEGS: Fine-Tuning Teleop-Free VLAs for Humanoid Loco-manipulation in an Embodied Gaussian Splatting World

LEGS: 在具身高斯泼溅世界中免遥操作微调VLA用于人形机器人全身操控

Hojune Kim, Timothy Chen, Jiankai Sun, Lars W. Osterberg, Qianzhong Chen, Ke Wang, Mac Schwager

发表机构 * Stanford University(斯坦福大学)

AI总结 提出LEGS混合模拟器,通过程序化运动基元生成器和两阶段颜色校准,无需遥操作即可合成训练数据,使VLA策略在真实人形机器人操控任务中达到或超越遥操作训练效果。

Comments https://legsvla.github.io/

详情
AI中文摘要

训练用于人形机器人全身操控的视觉-语言-动作(VLA)策略受到收集人类遥操作演示的高成本和复杂性的限制。迄今为止,在模拟器中微调的VLA策略未能有效迁移到人形机器人全身操控任务中。我们提出LEGS(通过具身高斯泼溅实现全身操控),一种混合模拟器,将网格前景(机器人、物体、道具)合成到从手持场景捕获重建的光照真实3D高斯泼溅(3DGS)背景上。LEGS使用程序化运动基元生成器在无需人类遥操作的情况下大规模合成带标签的演示,并通过确定性两阶段颜色校准将渲染的3DGS图像对齐到机器人的部署相机。在Unitree G1人形机器人上,跨三个全身难度递增的抓取放置任务和三个VLA骨干网络(psi_0, pi_0.5, GR00T N1.6),仅使用LEGS数据训练的策略在每个实验中都匹配或超越了使用人类遥操作演示训练的策略。它还优于消融了3DGS背景效果的纯网格模拟基线,表明光照真实渲染是合成数据迁移的关键因素。LEGS中的人形运动独立于场景外观记录,使得相同的自动生成演示可以在新背景和物体网格下重新渲染——覆盖新场景的成本比遥操作低15倍以上——从而增强训练数据对场景变化的鲁棒性。在物体和场景外观联合偏移下,使用重新渲染的LEGS-AUG数据训练的策略保持任务成功,而使用遥操作数据训练的基线完全失败。我们的项目页面位于https://legsvla.github.io/。

英文摘要

Training vision-language-action (VLA) policies for humanoid loco-manipulation is constrained by the high cost and complexity of collecting human teleoperation demonstrations. VLA policies fine-tuned in simulators have, until now, failed to transfer effectively in humanoid loco-manipulation tasks. We present LEGS (Loco-manipulation via Embodied Gaussian Splatting), a hybrid simulator that composites a mesh foreground (robot, objects, props) over a photorealistic 3D Gaussian Splatting (3DGS) background reconstructed from a handheld scene capture. LEGS uses a procedural motion-primitive generator to synthesize labeled demonstrations at scale without human teleoperation, and a deterministic two-stage color calibration to align the rendered 3DGS image to the robot's deployment camera. On a Unitree G1 humanoid robot, across three pick-and-place tasks of increasing whole-body difficulty and three VLA backbones (psi_0, pi_0.5, GR00T N1.6), a policy trained purely on LEGS data matches or exceeds one trained on human teleoperation demos on every experiment. It also outperforms a mesh-only simulation baseline that ablates the effect of the 3DGS background, showing that photorealistic rendering is a key enabler for synthetic data transfer. Humanoid motion is recorded independently of scene appearance in LEGS, allowing the same auto-generated demonstrations to be re-rendered under new backgrounds and object meshes--covering a new scene at more than 15x lower cost than teleoperation--to augment training data for robustness to scene variations. Under combined object-and-scene appearance shift, the policy trained on re-rendered LEGS-AUG data maintains task success while the baseline trained on teleoperation data fails entirely. Our project page is located at https://legsvla.github.io/.

2606.01457 2026-06-02 cs.AI cs.LG stat.ML

Transferring Information Across Interventions in Causal Bayesian Optimization

跨干预因果贝叶斯优化的信息传递

Mohammad Ali Javidian

发表机构 * Computer Science Department(计算机科学系)

AI总结 提出图耦合因果贝叶斯优化方法,通过共享因果参数的不确定性连接不同干预效应,实现跨干预信息传递,在可识别线性高斯因果模型中证明低秩核性质和次线性遗憾界。

详情
AI中文摘要

贝叶斯优化是一种优化昂贵系统的流行方法,其中每次实验、模拟或干预都会耗费时间或金钱。在其标准形式中,它将我们控制的变量视为黑盒的普通输入,无法区分单纯的相关性与真正的因果关系。因果贝叶斯优化通过使用已知因果图结合观测数据来决定哪些变量值得干预,从而部分弥补了这一差距。然而,现有方法几乎孤立地学习每种可能干预的效果,尽管在因果系统中这些效果通常共享相同的底层机制。我们提出图耦合因果贝叶斯优化,通过我们对一小部分共享因果参数的不确定性,将不同的干预效果联系在一起。结果是一个因果核,使得从一次干预收集的证据能够改进我们对相关干预的估计。对于可识别的线性高斯因果模型,我们证明该核具有低秩,其秩由共享参数的数量而非干预菜单的大小界定。这进而产生一个信息增益界,该界仅随优化范围对数增长,以及一个遗憾界,清晰地将三种误差来源分开:优化、因果估计以及考虑哪些干预集的选择。我们还描述了非线性和自适应扩展。在与理论一致的高斯系统、共享机制压力测试以及标准因果优化基准测试中,该方法保持了因果贝叶斯优化的优势,同时实现了跨相关干预的信息传递,当对目标父节点的直接干预不可用且稀疏的干预数据必须在一大组候选干预中重复使用时,增益最为明显。

英文摘要

Bayesian optimization is a popular way to optimize expensive systems, where every experiment, simulation, or intervention costs time or money. In its standard form, it treats the variables we control as plain inputs to a black box and cannot tell apart mere correlation from a real cause and effect. Causal Bayesian optimization closes part of this gap by using a known causal graph together with observational data to decide which variables are worth intervening on. Existing methods, however, learn the effect of each possible intervention almost in isolation, even though in a causal system these effects usually share the same underlying mechanisms. We propose graph-coupled causal Bayesian optimization, which ties the different intervention effects together through the uncertainty we have about a small set of shared causal parameters. The result is a causal kernel that lets evidence collected from one intervention improve our estimate of related interventions. For identifiable linear Gaussian causal models, we show that this kernel has low rank, bounded by the number of shared parameters rather than by the size of the intervention menu. This in turn yields an information-gain bound that grows only logarithmically in the optimization horizon, and a regret bound that cleanly separates three sources of error: optimization, causal estimation, and the choice of which intervention sets to consider. We also describe nonlinear and adaptive extensions. Across theory-aligned Gaussian systems, shared-mechanism stress tests, and standard causal optimization benchmarks, the method keeps the benefits of causal Bayesian optimization while transferring information across related interventions, with the clearest gains when direct interventions on the target's parents are unavailable and sparse interventional data must be reused across a large family of candidate interventions.

2606.01456 2026-06-02 cs.LG cs.CL cs.GT

Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment

诚实的人工智能顾问:偏好错位下大语言模型诚实性的预设基准

Hamidreza Hasani Balyani, Seyed Pouyan Mousavi Davoudi, Alireza Amiri-Margavi, Amin Gholami Davodi, Arshia Gharagozlou

发表机构 * Amazon Lab126, HW Tech Org.(亚马逊实验室126,硬件技术组织) Computational Modeling and Simulation University of Pittsburgh(计算建模与仿真大学匹兹堡分校) Mathematics & Statistics Department University of Minnesota Duluth(数学与统计学系明尼苏达大学 Duluth 分校)

AI总结 通过Crawford-Sobel廉价谈话模型构建基准,评估大语言模型在偏好冲突时是否诚实,发现模型过度揭示信息,偏离策略最优。

Comments 19 pages. Code and data: https://github.com/iHamidHasani/cheap-talk-llm-benchmark

详情
AI中文摘要

大语言模型越来越多地被部署为顾问,其目标与用户不一致:推荐系统优化参与度,销售助手优化购买,谈判代理优化让步。当诚实与自身收益冲突时,这些顾问是否保持诚实是一个核心的对齐评估问题。我们将经典的Crawford-Sobel廉价谈话模型转化为偏好错位下LLM诚实性的预设基准。廉价谈话理论预测既非完全揭示也非沉默,而是粗糙的单调划分,随着偏好冲突增加,信息区间减少。发送者观察到状态omega在[0,1]中,希望接收者的行动接近omega+b,并向理想行动为omega的接收者发送一条无成本消息。设计使用5个偏差水平、3个提示框架、固定的低温度设置和每个单元200个状态:共12,000次发送者调用。对于正偏差网格b∈{0.01,0.04,0.08,0.12},最信息丰富的划分大小分别为7、4、3、2,预言机归一化互信息分别为0.5294、0.3268、0.2205、0.1829。在四个指令调优模型(GPT-4o、Claude Sonnet 4.5、Gemini 2.5 Flash-Lite、Llama-3.3-70B)上运行完整设计,我们发现所有四个模型相对于最信息丰富的均衡过度揭示1.8至4.2倍:归一化互信息保持在0.78-0.94,而预言机规定为0.18-0.53。信息量随偏差下降如预测,但从未接近策略最优;模型显示出近乎完全的揭示,并带有跟踪其偏差的恒定正向偏移(线性夸大)。收益最大化与诚实框架的影响可忽略。解码器消融表明,仅当接收者读取发送者陈述的数字时,该发现才可恢复:仅嵌入解码器将相同数据误读为近乎胡言乱语。

英文摘要

Large language models are increasingly deployed as advisors whose objective is not aligned with the user's: recommenders optimize for engagement, sales assistants for purchases, negotiation agents for concessions. Whether such advisors stay truthful when honesty conflicts with their own payoff is a core alignment-evaluation question. We turn the canonical Crawford-Sobel cheap-talk model into a pre-specified benchmark for LLM honesty under preference misalignment. Cheap-talk theory predicts neither full revelation nor silence but coarse monotone partitions, with fewer informative intervals as preference conflict grows. A sender observes a state omega in [0,1], wants the receiver's action near omega+b, and sends one costless message to a receiver whose ideal action is omega. The design uses 5 bias levels, 3 prompt frames, a fixed low-temperature setting, and 200 states per cell: 12,000 sender calls. For the positive-bias grid b in {0.01,0.04,0.08,0.12} the exact most-informative partition sizes are 7,4,3,2, with oracle normalized mutual information 0.5294, 0.3268, 0.2205, 0.1829. Running the full design on four instruction-tuned models (GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Flash-Lite, Llama-3.3-70B), we find all four over-reveal relative to the most-informative equilibrium by 1.8 to 4.2x: normalized mutual information stays at 0.78-0.94 where the oracle prescribes 0.18-0.53. Informativeness declines with bias as predicted but never approaches the strategic optimum; rather than coarse partitions, models show near-full revelation with a constant upward offset tracking their bias (linear exaggeration). Payoff-maximizing versus honesty framing has negligible effect. A decoder ablation shows the finding is recoverable only when the receiver reads the sender's stated number: an embedding-only decoder mis-reads the same data as near-babbling.

2606.01451 2026-06-02 cs.CL

Before and After Temperature: A Distributional View of Creative LLM Generation

温度前后:创造性LLM生成的分布视角

V. S. Raghu Parupudi, Harsha Ponnada, Aditi Kaushal, S. Shria Parupudi, Saiteja Dasari, Sahiti Bulusu

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Indian Institute of Technology, Kharagpur(印度理工学院克哈格布尔分校) Delhi Technological University(德里技术大学) Carnegie Mellon University(卡内基梅隆大学) Rutgers University(罗格斯大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 通过分析采样温度重塑token分布的特征,提出一种无需参考的创造力评估方法,在Llama-3.1-8B-Instruct上达到Spearman ρ=0.918(LLM评判)和ρ=0.870(人类多数评判),显著优于现有基线。

Comments Submitted to NGEN-AI 2026

详情
AI中文摘要

大型语言模型(LLM)创造力的无参考评估依赖于困惑度、熵和top-1边际。我们表明,一个更强的信号存在于流程中更早的一步:在抽取下一个token之前,采样温度如何重塑模型的token分布。在Llama-3.1-8B-Instruct生成的500个开放式创造性提示(温度T∈{0.3,0.8,1.5})上,一个基于这种重塑的单个逐token特征预测提示内创造力排名的Spearman ρ=0.918(与平均gpt-4o/gemini-2.5-pro评判对比,n=500)和ρ=0.870(与三人人类多数排名对比,n=150)。四种标准无参考基线(自困惑度、平均预测熵、top-1边际、gzip压缩比)在两个真实标签上均达到|ρ|≈0.76:与平均LLM评判差距为+0.165,与人类多数评判差距为+0.110,两者均远大于基线之间的差异。两个真实标签面板之间的相关系数为ρ=0.83,高于人类间上限ρ=0.77,因此比较不受评判噪声的瓶颈限制。机制上,优势来自于不连贯区域的尖锐分布特征:在T=1.5时,累积质量宽度n_{95}(q)从约1个token膨胀到约131个token,且温度后的质量从温度前top-90%可能集合中泄漏约13个百分点。逐token聚合无法区分T=0.8和T=0.3;区分这两个连贯区域的任务留给序列级特征。

英文摘要

Reference-free evaluation of large language model (LLM) creativity relies on perplexity, entropy, and top-1 margin. We show that a much stronger signal lives one step earlier in the pipeline: in how sampling temperature \emph{reshapes} the model's token distribution before the next token is drawn. On Llama-3.1-8B-Instruct generations of 500 open-ended creative prompts at $T \in \{0.3, 0.8, 1.5\}$, a single per-token feature derived from this reshaping predicts the within-prompt creativity rank at Spearman $ρ{=}0.918$ against an averaged gpt-4o\,/\,gemini-2.5-pro judge ($n{=}500$) and $ρ{=}0.870$ against a three-rater human-majority ranking ($n{=}150$). Each of four standard reference-free baselines (self-perplexity, mean predictive entropy, top-1 margin, gzip compression ratio) tops out at $|ρ|\!\approx\!0.76$ on both ground truths: a gap of $+0.165$ on averaged-LLM and $+0.110$ on human-majority, both far larger than the spread among the baselines themselves. The two ground-truth panels agree with each other at $ρ{=}0.83$, above the inter-human ceiling of $ρ{=}0.77$, so the comparison is not bottlenecked by judge noise. Mechanistically, the win comes from a sharp distributional signature of the incoherence regime: at $T{=}1.5$ the cumulative-mass width $n_{95}(q)$ inflates from $\sim\!1$ to ${\sim}\!131$ tokens and post-temperature mass leaks off the pre-temperature top-$90\%$ plausible set by about $13$ percentage points. The per-token aggregates do not separate $T{=}0.8$ from $T{=}0.3$; discriminating the two coherent regimes is left to sequence-level features.

2606.01444 2026-06-02 cs.AI cond-mat.mtrl-sci cs.CL cs.LG math.CT

Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence

科学中的自我修正发现系统:面向主体人工智能的范畴论框架

Fiona Y. Wang, Markus J. Buehler

发表机构 * Laboratory for Atomistic and Molecular Mechanics(原子分子力学实验室) Department of Biological Engineering(生物工程系) Massachusetts Institute of Technology(麻省理工学院) Department of Civil and Environmental Engineering(土木与环境工程系) Department of Mechanical Engineering(机械工程系) Center for Computational Science and Engineering(计算科学与工程中心) Schwarzman College of Computing(施瓦茨曼计算学院)

AI总结 本文提出一个基于范畴论的框架,通过左Kan扩展实现科学发现中的表征体制转换,并应用于材料科学中的蛋白质力学和纤维网络建模。

详情
AI中文摘要

科学发现不仅是生成答案,更是对证据、人工制品、操作和验证者进行类型化的表征体制的修正。我们为材料科学中的主体发现开发了一个范畴论描述。在固定体制b中,模式类别为S_b,系统状态是一个余预层I_t: S_b -> Set,来源是元素范畴∫_{S_b} I_t。固定体制操作是对此类状态的更新,仅当指定并保留了保持来源的细化时才是自函子。发现则是经过验证的体制转换u: S_b -> S_b':旧人工制品通过左Kan扩展Lan_u I_t保存并传输,并与转换后状态进行比较,以识别超出函子传输的剩余内容。这在不依赖主观新颖性的情况下区分了检索、搜索和发现。我们在两个系统中实例化了该框架。在Builder/Breaker中,蛋白质力学世界模型在最小描述长度门控下进行修正;接受的定律将链内柔性表示为受慢集体模式调节的全模态弹性柔度,即模式调节柔度。在CategoryScienceClaw中,类型化技能、人工制品、开放需求、工作流变异、门控、压力测试和公共话语构成了一个携带证明的知识计算图。一个纤维网络示例记录了候选模型、被拒绝的替代方案、AIC门控、扰动测试以及一个基于各向同性纤维计数描述符的接受取向张量各向异性刚度代理模型。这些案例共同展示了范畴论如何既作为科学发现的数学语言,又作为自我修正AI发现系统的工程规范。

英文摘要

Scientific discovery is not only answer generation but revision of the representational regime in which evidence, artifacts, operations, and verifiers are typed. We develop a category-theoretic account of agentic discovery for materials science. In a fixed regime b with schema category S_b, the system state is a copresheaf I_t: S_b -> Set, and provenance is the category of elements \int_{S_b} I_t. Fixed-regime operation is an update on such states, endofunctorial only when provenance-preserving refinements are specified and preserved. Discovery is instead a verified regime transition u: S_b -> S_b': old artifacts are preserved, transported by the left Kan extension Lan_u I_t, and compared with the post-transition state to identify residual content beyond functorial transport. This separates retrieval, search, and discovery without subjective novelty. We instantiate the framework in two systems. In Builder/Breaker, a protein-mechanics world model is revised under a Minimum Description Length gate; the accepted law expresses within-chain flexibility as all-mode elastic compliance conditioned by slow collective-mode participation, or mode-conditioned compliance. In CategoryScienceClaw, typed skills, artifacts, open needs, workflow mutation, gates, stress tests, and public discourse become a proof-carrying knowledge-computation graph. A fiber-network example records candidate models, rejected alternatives, an AIC gate, perturbation tests, and an accepted orientation-tensor anisotropic stiffness surrogate over an isotropic fiber-count descriptor. Together, the cases show how category theory can be both a mathematical language for discovery and an engineering specification for self-revising AI discovery systems.

2606.01443 2026-06-02 cs.LG cs.AI cs.CV

UR-JEPA: Uniform Rectifiability as a Regularizer for Joint-Embedding Predictive Architectures

UR-JEPA:均匀可整流性作为联合嵌入预测架构的正则化器

Triet M. Le

发表机构 * Spatiolyx LLC(Spatiolyx公司)

AI总结 提出UR-JEPA,通过高斯核平滑的Carleson型平方函数实现均匀n-可整流测度正则化,防止表示坍塌,在多个数据集上达到与LeJEPA相当的峰值精度但具有更低的种子方差。

详情
AI中文摘要

训练联合嵌入预测架构(JEPA)的一个核心困难是防止表示坍塌。LeJEPA通过素描各向同性高斯正则化(SIGReg)对嵌入施加各向同性高斯目标来解决这一问题。该目标与流形假设相矛盾,流形假设期望嵌入集中在环境空间的低维子集上。我们提出\emph{UR-JEPA},其目标是在小尺度上具有局部切向维度$n$的均匀$n$-可整流测度,通过高斯核平滑的Carleson型平方函数$\mathcal{L}^{ ext{CGLT}}$实现,并辅以Jones $β$数公式。在Inet10上,UR-JEPA($\mathcal{L}^{ ext{CGLT}}$)达到$0.9141 \pm 0.0014$,相比LeJEPA($\mathcal{L}^{ ext{SIGReg}}$)提高了$+0.83$个百分点,种子标准差降低约$30\%$;在匹配配方的Galaxy10~SDSS、单种子ImageNet-$100$运行和3种子EuroSAT遥感运行中,两种方法在收敛时处于相同的峰值精度区间,UR-JEPA保持其较低的种子方差特征。在EuroSAT上,域内对在$96.0$到$96.1\%$之间具有竞争力,且使用大型遥感基础模型迁移时骨干网络缩小$25$倍。区别在于几何结构:对投影仪输出分布的直接可视化显示,在所有四个数据集上,UR-JEPA($\mathcal{L}^{ ext{CGLT}}$)产生的全局PCA谱在索引$\sim 20$到$25$(共$D=32$)处出现$4$到$5$个数量级的下降,而LeJEPA的谱接近平坦(顶部到底部比率最多为$3.6$)。两种方法的每维度边缘分布同时接近高斯分布(平均Shapiro-Wilk $W \in [0.992, 0.996]$),这是Diaconis-Freedman结果的一个推论。因此,在匹配精度下,两种正则化器产生结构上不同的投影表示。

英文摘要

A central difficulty in training Joint-Embedding Predictive Architectures (JEPAs) is preventing representation collapse. LeJEPA addresses this by enforcing an isotropic Gaussian target on the embeddings via Sketched Isotropic Gaussian Regularization (SIGReg). This target is in tension with the manifold hypothesis, which expects embeddings to concentrate on a low-dimensional subset of the ambient space. We propose \emph{UR-JEPA}, which targets a uniformly $n$-rectifiable measure of local tangent dimension $n$ at small scales, realized through a Gaussian-kernel smoothed Carleson-type square function $\mathcal{L}^{\text{CGLT}}$, with a complementary Jones $β$-number formulation. On Inet10, UR-JEPA($\mathcal{L}^{\text{CGLT}}$) attains $0.9141 \pm 0.0014$ for a $+0.83$\,pp gain over LeJEPA($\mathcal{L}^{\text{SIGReg}}$) with $\sim 30\%$ lower seed standard deviation; on matched-recipe Galaxy10~SDSS, a single-seed ImageNet-$100$ run, and a $3$-seed EuroSAT remote-sensing run, the two methods lie in the same peak-accuracy band at convergence, with UR-JEPA retaining its lower-seed-variance signature. On EuroSAT the in-domain pair is competitive at $96.0$ to $96.1\%$ with large remote-sensing foundation-model transfer at a $25\times$ smaller backbone. The distinction is geometric: direct visualization of the projector output distribution shows that on all four datasets UR--JEPA($\mathcal{L}^{\text{CGLT}}$) produces a global PCA spectrum with a $4$ to $5$ order-of-magnitude drop at index $\sim 20$ to $25$ out of $D = 32$, while LeJEPA's spectrum is near-flat (top-to-bottom ratio at most $3.6$). Per-dimension marginals are simultaneously near-Gaussian for both methods (mean Shapiro-Wilk $W \in [0.992, 0.996]$) as a Diaconis-Freedman consequence. At matched accuracy the two regularizers therefore yield structurally distinct projected representations.

2606.01441 2026-06-02 cs.AI

Dive into Ambiguity: A*-Inspired Multi-Agents Commonsense Obfuscation Attack on LLM Prompts

深入歧义:基于A*的多智能体常识混淆攻击LLM提示

Boxuan Wang, Zhuoyun Li, Xiaowei Huang, Yi Dong

发表机构 * University of Liverpool(利物浦大学)

AI总结 提出一种基于A*的事实错误诱导框架,通过层次化重写策略和动态语义分散系数生成语义对齐但混淆的提示,以高效攻击LLM的常识推理。

Comments Pre-print

详情
AI中文摘要

大型语言模型(LLM)在推理和知识密集型任务中表现出色,但仍易受到保留意图同时触发常识幻觉的提示级对抗攻击。这一漏洞亟待解决,因为LLM正迅速集成到事实可靠性不容妥协的安全关键领域。现有攻击方法要么缺乏效率,要么无法捕捉真实世界对手的适应性策略。我们提出一种基于A*的事实错误诱导框架,用于生成语义对齐但混淆的提示。其核心是由动态语义分散系数$γ$引导的层次化重写策略,该系数遵循反向模拟退火调度,在早期平衡保守编辑,后期进行激进混淆。为了增强可解释性,我们进一步引入智能体机制标记,发现并优化对抗机制,提供可解释的反向优化。理论上,我们证明提示重写遵循收缩递归,导致随着$γ$减小语义崩溃。实验上,在多种LLM上,我们的方法以更少的尝试次数实现了比穷举探索更高的攻击成功率,证明了其高效性和有效性。

英文摘要

Large language models (LLMs) excel in reasoning and knowledge-intensive tasks but remain vulnerable to prompt-level adversarial attacks that preserve intent while triggering commonsense hallucinations. This vulnerability is urgent, as LLMs are rapidly integrated into safety-critical domains where factual reliability is non-negotiable. Existing attack methods either lack efficiency or fail to capture the adaptive strategies of real-world adversaries. We propose an A*-inspired Factual Error Induction Framework, a framework for generating semantically aligned yet obfuscated prompts. At its core is a Hierarchical Rewrite Strategy guided by a dynamic semantic dispersion coefficient $γ$ that balances conservative edits early with aggressive obfuscations later, following a reverse simulated annealing schedule. To enhance interpretability, we further introduce Agentic Mechanism Labeling, which discovers and refines adversarial mechanisms, offering interpretable reverse optimization. Theoretically, we prove that prompt rewriting follows a contractive recurrence, leading to semantic collapse as $γ$ decreases. Empirically, across diverse LLMs, our method achieves higher attack success rates than exhaustive exploration while requiring fewer attempts, demonstrating both efficiency and effectiveness.

2606.01437 2026-06-02 cs.LG cs.AI

CEAR: Certified Ensemble Adversarial Robustness in DNNs

CEAR: 深度神经网络中的集成对抗鲁棒性认证

Daniel Sadig, Mohammadreza Maleki, Hamed Karimi, Reza Samavi

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出CEAR方法,通过混合经验与认证防御机制,利用高斯噪声和温度混淆梯度与logits,并扩展随机平滑以验证集成分类器的鲁棒性,在多个数据集上取得更优的认证准确率和鲁棒半径。

Comments This is the preprint of the work accepted for publication in the Proceedings of the 39th Canadian Conference on Artificial Intelligence (Canadian AI 2026); 19 Pages

详情
AI中文摘要

深度神经网络(DNN)极易受到对抗性扰动的影响,这促使了对安全关键应用鲁棒性的广泛研究。最先进的实证防御机制通过训练阶段提高DNN的鲁棒性,但仍难以应对自适应白盒攻击。另一方面,认证防御在指定的扰动范围内提供可证明的鲁棒性保证。这些保证无论扰动程度如何都成立,即使攻击者拥有模型的完全知识。在本文中,我们提出了CEAR,一种基于集成的鲁棒方法,它利用了实证和认证防御机制的混合。CEAR使用不同的高斯噪声和温度训练集成中的每个网络,以混淆梯度和logits,使模型对更强的基于梯度的攻击更具抵抗力。然后我们使用带噪声的logits,并提出了两种不同的投票机制来进一步提高鲁棒性。此外,我们扩展了随机平滑以验证基于集成的分类器的鲁棒性。我们在MNIST、CIFAR10和TinyImageNet数据集上的实验评估表明,与基线方法相比,平均认证准确率更高,鲁棒半径更大,可迁移性更低。

英文摘要

Deep Neural Networks (DNNs) are highly susceptible to adversarial perturbations, leading to extensive research on robustness for safety-critical applications. State-of-the-art empirical defense mechanisms improve the robustness of DNNs through the training phase, but still struggle against adaptive white-box attacks. On the other hand, certified defenses offer provable guarantees of robustness within a specified perturbation bound. These guarantees hold regardless of the level of perturbations, even if the attacker is given full knowledge of the model. In this paper, we propose CEAR, an ensemble-based robust method that utilizes a hybrid of empirical and certified defense mechanisms. CEAR trains each network within the ensemble using varying Gaussian noise and temperatures to obfuscate gradients and logits, making the model more resistant to stronger gradient-based attacks. We then use noisy logits and propose two different voting mechanisms to further improve robustness. Furthermore, we extend randomized smoothing to verify the robustness of ensemble-based classifiers. Our experimental evaluations on MNIST, CIFAR10, and TinyImageNet datasets demonstrate superior certified accuracy on average, increased robustness radius, and decreased transferability compared to baseline methods.