arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3410
2510.26418 2026-05-26 cs.AI

Chain-of-Thought Hijacking

思维链劫持

Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez

AI总结 提出思维链劫持攻击,通过诱导大型推理模型进行长时间良性推理来削弱其拒绝有害请求的能力,实现高成功率越狱。

详情
AI中文摘要

大型推理模型(LRMs)通过扩展推理时间推理来提高任务性能。尽管先前研究表明更长的推理应导致更稳健的安全行为,但我们发现了相反的证据:过度扩展的推理反而可以被利用来系统性地削弱拒绝行为。我们提出了思维链劫持,一种简单而有效的黑盒越狱攻击,诱导LRMs进行长时间的良性谜题求解推理(通常持续五分钟以上),然后引发有害的顺从。在HarmBench上,思维链劫持在Gemini 2.5 Pro、ChatGPT o4 Mini、Grok 3 Mini和Claude 4 Sonnet上分别实现了99%、94%、100%和94%的攻击成功率。为了理解该攻击为何成功,我们对开源推理模型进行了激活探测、注意力模式分析和因果干预。我们的结果表明,拒绝行为依赖于一个低维安全信号,其表达随着推理轨迹变长而减弱。特别是,扩展的良性推理将注意力从有害意图转移开,并减弱与拒绝相关的激活,产生了我们称之为拒绝稀释的现象。这些发现表明,过长的推理可能引入系统性的越狱攻击面。我们发布了评估材料以支持可重复性和进一步研究。

英文摘要

Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. Although previous studies suggest that longer reasoning should lead to more robust safety behavior, we find evidence to the contrary: over-extended reasoning can instead be exploited to systematically weaken refusal behavior. We propose Chain-of-Thought Hijacking, a simple yet effective black-box jailbreak attack that induces LRMs to engage in prolonged benign puzzle-solving reasoning, often lasting more than five minutes, before eliciting harmful compliance. Across HarmBench, CoT Hijacking achieves attack success rates of 99%, 94%, 100%, and 94% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet, respectively. To understand why this attack succeeds, we conduct activation probing, attention-pattern analysis, and causal interventions on open-source reasoning models. Our results indicate that refusal behavior depends on a low-dimensional safety signal whose expression weakens as reasoning traces grow longer. In particular, extended benign reasoning shifts attention away from harmful intentions and attenuates refusal-related activations, producing what we call refusal dilution. These findings demonstrate that excessively prolonged reasoning can introduce a systematic jailbreak attack surface. We release our evaluation materials to support reproducibility and further research.

2510.25065 2026-05-26 cs.AI

Rewarding Structural Conformance of Reasoning using Process Mining

使用过程挖掘奖励推理的结构符合性

Yongjae Lee, Taekhyun Park, Sunghyun Sim, Hyerim Bae

AI总结 提出TACReward奖励模型,利用过程挖掘技术聚合推理步骤的结构偏差,以改进稀疏奖励策略梯度方法在数学推理任务中的性能。

详情
AI中文摘要

近期稀疏奖励策略梯度方法的进展使得基于强化学习的语言模型后训练成为可能。然而,对于数学问题求解等推理任务,二值化结果奖励对中间推理步骤提供的反馈有限。虽然一些研究尝试通过估计整体推理质量来解决此问题,但这些奖励是否可靠地代表逐步推理质量仍不明确。在本研究中,我们将推理视为结构化过程,并提出TACReward,该奖励模型可无缝集成到稀疏奖励策略梯度方法中,无需额外的人工标注成本或架构修改。TACReward利用过程挖掘技术聚合教师与策略推理之间的逐步结构偏差,生成范围在[0, 1]的标量输出奖励以指示推理质量。在多个数学推理基准上的实验表明,将TACReward集成到稀疏奖励框架中鼓励策略模型改善推理的结构质量,从而在现有稀疏奖励框架上实现一致的性能提升。我们的代码和检查点可在https://github.com/Thrillcrazyer/TACReward和https://huggingface.co/Thrillcrazyer/TACReward7B公开获取。

英文摘要

Recent advances in sparse reward policy gradient methods have enabled effective reinforcement learning (RL)-based language model post-training. However, for reasoning tasks such as mathematical problem solving, binarized outcome rewards provide limited feedback on intermediate reasoning steps. While some studies have attempted to address this issue by estimating overall reasoning quality, it remains unclear whether these rewards are reliable proxies for the quality of stepwise reasoning. In this study, we consider reasoning as a structured process and propose TACReward, the reward model that can be seamlessly integrated into sparse reward policy gradient methods without additional human annotation costs or architectural modifications. TACReward aggregates stepwise structural deviations between teacher and policy reasoning using process mining techniques, producing a scalar output reward range of [0, 1] to indicate reasoning quality. Experiments on multiple mathematical reasoning benchmarks demonstrate that integrating the TACReward into sparse reward frameworks encourages the policy model to improve the structural quality of reasoning. Consequently, this leads to consistent performance improvements over existing sparse reward frameworks. Our code and checkpoints are publicly available at https://github.com/Thrillcrazyer/TACReward and https://huggingface.co/Thrillcrazyer/TACReward7B.

2510.23509 2026-05-26 cs.RO

Logic-Guided Socially-aware Robot Navigation World Model

逻辑引导的社会感知机器人导航世界模型

Weizheng Wang, Obi Ike, Soyun Choi, Sungeun Hong, Aniket Bera, Byung-Cheol Min

AI总结 提出NaviWM,通过结合结构化世界模型和逻辑驱动推理链,增强大语言模型在动态人类空间中生成社交合规且物理安全的导航决策的能力。

详情
AI中文摘要

社交机器人导航越来越依赖大语言模型进行推理、路径规划以及在动态人类空间中实现移动。然而,仅依赖LLM进行规划往往会导致不可预测和不安全的行为,尤其是在动态人类空间中,原因是物理基础有限且逻辑一致性弱。在这项工作中,我们引入了NaviWM,一种社会感知的机器人导航世界模型,它通过结构化世界模型和逻辑驱动的思维链过程增强LLM推理。NaviWM由两个主要组件组成:(1)一个时空世界模型,捕捉环境中智能体的位置、速度和活动;(2)一个演绎推理模块,通过多步、基于逻辑的推理过程引导LLM。这种集成使机器人能够在明确定义的约束(如个人空间、碰撞避免和时机)下生成既社交合规又物理安全的导航决策。与基于提示或微调的先前方法不同,NaviWM将社会规范编码为一阶逻辑,从而实现可解释和可验证的推理。实验表明,NaviWM提高了成功率并减少了社交违规,尤其是在拥挤环境中。这些结果证明了将形式推理与LLM结合用于鲁棒社交导航的好处。本工作的更多实验细节和演示视频可在以下网址找到:https://sites.google.com/view/NaviWM。

英文摘要

Social robot navigation increasingly relies on large language models for reasoning, path planning, and enabling movement in dynamic human spaces. However, relying solely on LLMs for planning often leads to unpredictable and unsafe behaviors, especially in dynamic human spaces, due to limited physical grounding and weak logical consistency. In this work, we introduce NaviWM, a socially-aware robot Navigation World Model that augments LLM reasoning with a structured world model and a logic-driven chain-of-thought process. NaviWM consists of two main components: (1) a spatial-temporal world model that captures the positions, velocities, and activities of agents in the environment, and (2) a deductive reasoning module that guides LLMs through a multi-step, logic-based inference process. This integration enables the robot to generate navigation decisions that are both socially compliant and physically safe, under well-defined constraints such as personal space, collision avoidance, and timing. Unlike previous methods based on prompting or fine-tuning, NaviWM encodes social norms as first-order logic, enabling interpretable and verifiable reasoning. Experiments show that NaviWM improves success rates and reduces social violations, particularly in crowded environments. These results demonstrate the benefit of combining formal reasoning with LLMs for robust social navigation. Additional experimental details and demo videos for this work can be found at: https://sites.google.com/view/NaviWM.

2510.19328 2026-05-26 cs.LG

Clustered Calibration: Representation-Aware Probability Calibration via Learned Subpopulations

聚类校准:通过学习子群体实现表示感知的概率校准

Tomer Lavi, Bracha Shapira, Nadav Rappoport

AI总结 提出聚类校准框架,通过在特征空间聚类识别子群体并拟合混合校准器,结合分层收缩实现上下文特定校准,在表格、图像和文本基准上提升或匹配强全局校准器的负对数似然和Brier分数。

详情
AI中文摘要

在高风险领域如临床决策支持、自动驾驶和金融风险评估中,确保预测概率与观察频率一致至关重要。现有的校准方法通常应用单一全局变换或依赖对预测置信度的后验分箱,限制了其利用子群体间异质可靠性的能力。我们提出聚类校准,一种表示感知框架,通过在学习的特征空间(如覆盖向量、SHAP值、CNN激活、Transformer嵌入)中聚类识别子群体,并在分层收缩下拟合向全局映射的簇特定参数化校准器的软混合。这种设计在保持全局稳定性的同时实现了上下文特定的校准。在六个表格数据集以及额外的图像和文本基准上,聚类校准在负对数似然和Brier分数方面持续改进或匹配强全局校准器,同时保持AUC和准确率。我们进一步从分析和经验上证明,即使适当评分规则改进,固定箱期望校准误差(ECE)也可能对软的、区域感知的校准器进行错误排序,并主张在此类设置中使用对数损失和Brier作为更可靠的模型选择基础。

英文摘要

Ensuring that predicted probabilities align with observed frequencies is critical in high-stakes domains such as clinical decision support, autonomous driving and financial risk assessment. Existing calibration methods typically apply a single global transformation or rely on post-hoc binning over predicted confidences, limiting their ability to exploit heterogeneous reliability across sub-populations. We propose Clustered Calibration, a representation-aware framework that identifies sub-populations via clustering in learned feature spaces (e.g., coverage vectors, SHAP values, CNN activations, Transformer embeddings) and fits a soft mixture of cluster-specific parametric calibrators under hierarchical shrinkage toward a global mapping. This design yields context-specific calibration while maintaining global stability. Across six tabular datasets and additional image and text benchmarks, clustered calibration consistently improves or matches strong global calibrators in terms of negative log-likelihood and Brier score, while preserving AUC and accuracy. We further show, both analytically and empirically, that fixed-bin Expected Calibration Error (ECE) can mis-rank soft, region-aware calibrators even when proper scoring rules improve, and we advocate for log-loss and Brier as more reliable bases for model selection in such settings.

2510.15514 2026-05-26 cs.AI

Voting with the Graph: Stable RLAIF via Topological Consistency Maximization

基于图的投票:通过拓扑一致性最大化实现稳定的RLAIF

Boyin Liu, Zhuo Zhang, Sen Huang, Lipeng Xie, Qingxu Fu, Haoran Chen, LI YU, Tianyi Hu, Zhaoyang Liu, Bolin Ding, Dongbin Zhao

AI总结 提出拓扑共识奖励(TCR)框架,利用传递性作为去噪机制,通过拓扑多数投票过滤偏好信号中的随机噪声,以稳定强化学习从AI反馈(RLAIF)中的偏好学习。

详情
AI中文摘要

从AI反馈中强化学习(RLAIF)依赖LLM法官作为偏好测量工具,但这些工具本质上受限于随机测量误差——表现为偏好循环(例如,$A \succ B \succ C \succ A$)的随机波动,在最先进模型中占5-9%的评估。虽然重复采样通过平均多个判断来减轻噪声,但它孤立地处理每个比较,未能利用区分系统信号与随机噪声的结构约束。我们引入拓扑共识奖励(TCR),一个通过拓扑多数投票利用传递性作为去噪机制的框架:系统信号通过传递链相互增强,而随机误差聚集为拓扑暴露的循环。TCR近似最大无环子图以从偏好信号中过滤随机噪声。我们还提出循环发生率(CIR)作为诊断指标,衡量包含偏好循环的样本比例。在我们的噪声模型下,这些循环主要源于随机测量误差而非真正的非传递性。在Arena-Hard、MT-Bench和WritingBench上的实验表明,TCR始终优于成对基线和经典排序算法,并在不同法官模型上表现出稳健性能。

英文摘要

Reinforcement Learning from AI Feedback (RLAIF) relies on LLM judges as preference measurement instruments, yet these instruments are fundamentally limited by random measurement errors -- stochastic fluctuations that manifest as preference cycles (e.g., $A \succ B \succ C \succ A$), occurring in 5-9% of evaluations across state-of-the-art models. While repeated sampling mitigates noise by averaging multiple judgments, it treats each comparison in isolation and fails to exploit the structural constraints that distinguish systematic signals from random noise. We introduce Topological Consensus Rewards (TCR), a framework that leverages transitivity as a denoising mechanism via topological majority voting: systematic signals reinforce each other through transitive chains, while random errors cluster into topologically exposed cycles. TCR approximates the Maximum Acyclic Subgraph to filter stochastic noise from preference signals. We also propose Cycle Incidence Rate (CIR) as a diagnostic metric that measures the proportion of samples containing preference cycles. Under our noise model, these cycles arise primarily from stochastic measurement errors rather than genuine intransitivity. Experiments on Arena-Hard, MT-Bench, and WritingBench demonstrate that TCR consistently outperforms pairwise baselines and classical ranking algorithms, while exhibiting robust performance across different judge models.

2510.15284 2026-05-26 cs.LG math.ST stat.TH

Small Ensemble-based Data Assimilation: A Machine Learning-Enhanced Data Assimilation Method with Limited Ensemble Size

基于小集合的数据同化:一种机器学习增强的有限集合数据同化方法

Zhilin Li, Zhou Yao, Xianglong Li, Zeng Liu, Zhaokuan Lu, Shanlin Xu, Seungnam Kim, Guangyao Wang

AI总结 提出一种结合集合卡尔曼滤波与全连接神经网络的机器学习数据同化方法,通过小集合生成初步分析状态并用神经网络预测修正项,在几乎不增加计算成本下提升精度。

详情
AI中文摘要

基于集合的数据同化方法因其处理非线性动态问题的固有能力而日益流行。然而,这些方法通常面临分析精度与计算效率之间的权衡,因为更高精度所需的更大集合规模也会导致更高的计算成本。在本研究中,我们提出了一种新颖的基于机器学习的数据同化方法,将传统的集合卡尔曼滤波与全连接神经网络相结合。具体而言,我们的方法使用相对较小的集合规模通过EnKF生成初步但次优的分析状态。然后利用FCNN学习并预测这些状态的修正项,从而减轻有限集合规模导致的性能下降。我们通过涉及Lorenz系统和非线性海浪场模拟的数值实验评估了所提出的EnKF-FCNN方法的性能。结果一致表明,新方法在相同集合规模下比传统EnKF实现了更高的精度,同时几乎不增加额外计算成本。此外,EnKF-FCNN方法通过与不同模型耦合以及使用替代的基于集合的数据同化方法,可适应多种应用。

英文摘要

Ensemble-based data assimilation (DA) methods have become increasingly popular due to their inherent ability to address nonlinear dynamic problems. However, these methods often face a trade-off between analysis accuracy and computational efficiency, as larger ensemble sizes required for higher accuracy also lead to greater computational cost. In this study, we propose a novel machine learning-based data assimilation approach that combines the traditional ensemble Kalman filter (EnKF) with a fully connected neural network (FCNN). Specifically, our method uses a relatively small ensemble size to generate preliminary yet suboptimal analysis states via EnKF. A FCNN is then employed to learn and predict correction terms for these states, thereby mitigating the performance degradation induced by the limited ensemble size. We evaluate the performance of our proposed EnKF-FCNN method through numerical experiments involving Lorenz systems and nonlinear ocean wave field simulations. The results consistently demonstrate that the new method achieves higher accuracy than traditional EnKF with the same ensemble size, while incurring negligible additional computational cost. Moreover, the EnKF-FCNN method is adaptable to diverse applications through coupling with different models and the use of alternative ensemble-based DA methods.

2510.14925 2026-05-26 cs.AI cs.CL cs.LG

False Fixed Points: Kantian Feedback, Stable Miscalibration, and Representational Compression in LLMs

虚假不动点:大语言模型中的康德反馈、稳定误校准与表征压缩

Akira Okutomi

AI总结 本文通过康德承诺门控框架和线性反馈模型,研究大语言模型中高置信度错误作为局部稳定、内部一致且自信错误的虚假不动点现象,发现稳定性与正确性可分离,并探索高信噪比惯性和表征压缩作为稳定误校准的可能机制。

Comments 27 pages, 8 figures, v3.0

详情
AI中文摘要

大型语言模型中的高置信度错误通常被视为脆弱的失败。我们研究另一种可能性:某些错误可能是虚假不动点,即局部稳定、内部一致且自信地错误。这分离了鲁棒性与真实追踪。我们通过康德承诺门控框架和一个最小线性反馈模型来发展这种分离,其中稳定性和正确性可以偏离。在三个开源权重模型上,根据我们的隐藏状态敏感性探测,过度自信的错误项并不比自信正确的项系统性地更局部脆弱。基于弃权的自我批评通过牺牲覆盖率减少了过度自信的错误承诺,而C3-R(一种基于规则的显式反馈门控)则加剧了这种权衡而非消除它。这些结果激发但未证实高信噪比惯性和表征压缩作为稳定误校准的可能机制。

英文摘要

High-confidence errors in large language models are often treated as fragile failures. We study an alternative: some errors may be false fixed points, locally stable, internally coherent, and confidently wrong. This separates robustness from truth-tracking. We develop the separation through a Kantian commitment-gate framing and a minimal linear feedback model in which stability and correctness can diverge. Across three open-weight models, overconfident wrong items are not systematically more locally fragile than confidently correct items under our hidden-state sensitivity probes. Abstention-aware self-critique reduces overconfident wrong commitments by sacrificing coverage, and C3-R, a rule-based explicit feedback gate, sharpens that tradeoff rather than eliminating it. These results motivate, but do not establish, high signal-to-noise (high-SNR) inertia and representational compression as possible mechanisms for stable miscalibration.

2510.04580 2026-05-26 cs.AI

Strongly Solving 2048 4x3

强求解 2048 4x3

Tomoyuki Kaneko, Shuhei Yamashita

AI总结 通过按棋盘上数字和(称为状态年龄)划分状态空间,枚举所有可达状态和后继状态,强求解了4x3棋盘上的2048变体,最优策略期望得分约50724.26。

详情
AI中文摘要

2048是一个随机单人游戏,涉及4x4网格上的16个单元格,玩家在上下左右中选择一个方向,通过合并沿该方向相邻单元格中相同数字的两个方块来获得分数。本文证明,变体2048-4x3(4x3棋盘上的12个单元格,比原版少一行)已被强求解。在该变体中,对于最常见的初始状态(两个数字2的方块),最优策略的期望得分约为50724.26。可达状态和后继状态的数量分别为1,152,817,492,752和739,648,886,170。关键技术是按棋盘上数字之和(称为状态年龄)划分状态空间。年龄在状态与其任何有效动作后的后继状态之间保持不变,并通过环境的随机响应增加2或4。因此,我们可以按年龄划分状态空间,并仅依赖于最近年龄的状态来枚举一个年龄的所有(后继)状态。类似地,我们可以按年龄递减顺序确定(后继)状态值。

英文摘要

2048 is a stochastic single-player game involving 16 cells on a 4 by 4 grid, where a player chooses a direction among up, down, left, and right to obtain a score by merging two tiles with the same number located in neighboring cells along the chosen direction. This paper presents that a variant 2048-4x3 12 cells on a 4 by 3 board, one row smaller than the original, has been strongly solved. In this variant, the expected score achieved by an optimal strategy is about $50724.26$ for the most common initial states: ones with two tiles of number 2. The numbers of reachable states and afterstates are identified to be $1,152,817,492,752$ and $739,648,886,170$, respectively. The key technique is to partition state space by the sum of tile numbers on a board, which we call the age of a state. An age is invariant between a state and its successive afterstate after any valid action and is increased two or four by stochastic response from the environment. Therefore, we can partition state space by ages and enumerate all (after)states of an age depending only on states with the recent ages. Similarly, we can identify (after)state values by going along with ages in decreasing order.

2510.02730 2026-05-26 cs.LG cs.CV

Dale meets Langevin: A Multiplicative Denoising Diffusion Model

Dale meets Langevin: 乘法去噪扩散模型

Nishanth Shetty, Madhava Prasath, Chandra Sekhar Seelamantula

AI总结 提出以几何布朗运动为前向噪声过程的乘法分数生成模型,推导反向时间SDE并设计两种乘法采样器,引入Hyvärinen分数和乘法去噪分数匹配目标,在图像数据集上验证生成能力。

详情
AI中文摘要

指数梯度下降(EGD)是一种受生物学启发的优化算法,遵循Dale定律,在收敛时产生对数正态分布的突触权重,与神经科学的实验观察一致。由于几何布朗运动(GBM)在任何固定时间的边际分布是对数正态的,这种收敛性质揭示了EGD与基于GBM的随机过程之间的自然联系。我们提出了一种基于分数的乘法生成模型,以GBM作为前向噪声过程,并推导了其在环境空间和对数变换空间中的相应反向时间SDE。通过离散化相应的反向时间SDE,我们推导出两种乘法采样器:直接从环境空间反向时间SDE得到的符号无关采样器,以及通过Lamperti变换得到的符号保持采样器,我们称之为Dale-Langevin采样器。我们将该框架与镜像Langevin动力学联系起来,表明优化中驱动EGD的凸函数精确地控制着Dale-Langevin采样器。虽然标准Stein分数(定义为随机向量X在x处的∇log p_X(x))在基于加性噪声的扩散模型中自然出现,但在乘法设置中,我们遇到了一种用于采样的修改版Stein分数,我们称之为Hyvärinen分数:x∘∇log p_X(x)。为了估计该分数,我们提出了一种新的乘法去噪分数匹配目标(M-DSM),证明了其与乘法显式分数匹配损失的等价性,并表明它包含了非负分数匹配损失。在MNIST、Fashion-MNIST、Kuzushiji-MNIST和CIFAR-10上的实验结果验证了所提框架的生成能力。

英文摘要

Exponentiated gradient descent (EGD), a biologically motivated optimisation algorithm that respects Dale's law, produces log-normally distributed synaptic weights at convergence, in alignment with experimental observations in neuroscience. Since the marginal distribution of geometric Brownian motion (GBM) at any fixed time is log-normal, this convergence property reveals a natural connection between EGD and GBM-based stochastic processes. We propose a multiplicative score-based generative model with GBM as a forward noising process and derive its corresponding reverse-time SDE in both the ambient space and in the $\log$-transformed space. We derive two multiplicative samplers by discretising the corresponding reverse-time SDEs: a sign-agnostic sampler obtained directly from the ambient-space reverse-time SDE, and a sign-preserving sampler, which we refer to as the Dale-Langevin sampler, obtained via the Lamperti transform. We connect the framework to Mirrored Langevin Dynamics, showing that the convex function driving EGD in optimisation precisely governs the Dale-Langevin sampler. While the standard Stein score, defined as $\nabla \log p_{\boldsymbol{X}}(\boldsymbol{x})$ for a random vector $\boldsymbol{X}$ evaluated at $\boldsymbol{x}$, comes up naturally in the additive noise based diffusion models, in the multiplicative setting, we encounter a modified version of the Stein score for sampling, which we refer to as the {\it Hyvärinen score}: $\boldsymbol{x} \circ \nabla \log p_{\boldsymbol{X}}(\boldsymbol{x})$. To estimate the score, we propose a new multiplicative denoising score-matching objective (M-DSM), prove its equivalence to the multiplicative explicit score-matching loss and show that it subsumes the non-negative score matching loss. Experimental results on MNIST, Fashion-MNIST, Kuzushiji-MNIST, and CIFAR-10 to validate the generative capability of the proposed framework.

2510.02171 2026-05-26 cs.SD cs.AI eess.AS

Go witheFlow: Real-time Emotion Driven Audio Effects Modulation

Go witheFlow:实时情感驱动音频效果调制

Edmund Dervakos, Spyridon Kantarelis, Vassilis Lyberatos, Jason Liartis, Giorgos Stamou

AI总结 提出witheFlow系统,通过生物信号和音频特征实时自动调制音频效果,增强音乐表演中的人机协作。

Comments Accepted at NeurIPS Creative AI Track 2025: Humanity

详情
AI中文摘要

音乐表演是一种独特的人类活动,与表演者传达、唤起或表达情感的能力内在相关。机器无法以人类的意义表演音乐;它们可以制作、复制、执行或合成音乐,但缺乏情感或情绪体验的能力。因此,音乐表演是探索人机协作方面的理想候选。在本文中,我们介绍了witheFlow系统,旨在通过基于从生物信号和音频本身提取的特征自动调制音频效果,增强实时音乐表演。该系统目前处于概念验证阶段,设计轻量,能够在笔记本电脑上本地运行,并且在兼容的数字音频工作站和传感器可用的情况下是开源的。

英文摘要

Music performance is a distinctly human activity, intrinsically linked to the performer's ability to convey, evoke, or express emotion. Machines cannot perform music in the human sense; they can produce, reproduce, execute, or synthesize music, but they lack the capacity for affective or emotional experience. As such, music performance is an ideal candidate through which to explore aspects of collaboration between humans and machines. In this paper, we introduce the witheFlow system, designed to enhance real-time music performance by automatically modulating audio effects based on features extracted from both biosignals and the audio itself. The system, currently in a proof-of-concept phase, is designed to be lightweight, able to run locally on a laptop, and is open-source given the availability of a compatible Digital Audio Workstation and sensors.

2510.01389 2026-05-26 cs.RO cs.AI cs.LG

INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models

INSIGHT: 视觉-语言-动作模型中生成帮助触发器的推理时序列内省

Ulas Berk Karli, Ziyao Shangguan, Tesca FItzgerald

AI总结 提出INSIGHT框架,利用令牌级不确定性信号(熵、对数概率、不确定性估计)训练变压器分类器,预测VLA模型何时需要人类帮助,并对比强/弱监督下的性能,发现建模时间动态优于静态评分。

详情
AI中文摘要

最近的视觉-语言-动作(VLA)模型展现出强大的泛化能力,但它们缺乏用于预测失败和向人类监督者请求帮助的内省机制。我们提出了INSIGHT,一个利用令牌级不确定性信号来预测VLA何时应请求帮助的学习框架。使用π0-FAST作为基础模型,我们提取每个令牌的熵、对数概率以及基于狄利克雷的偶然不确定性和认知不确定性估计,并训练紧凑的变压器分类器将这些序列映射到帮助触发器。我们探索了强监督或弱监督的监督机制,并在分布内和分布外任务中进行了广泛比较。我们的结果显示了权衡:强标签使模型能够捕捉细粒度的不确定性动态以实现可靠的帮助检测,而弱标签虽然噪声较大,但在训练和评估对齐时仍能支持有竞争力的内省,为密集标注不可行时提供了可扩展的路径。关键的是,我们发现使用变压器建模令牌级不确定性信号的时间演化比静态序列级评分提供了更强的预测能力。本研究首次对VLA中基于不确定性的内省进行了系统评估,为主动学习和通过选择性人工干预实现实时错误缓解开辟了未来途径。

英文摘要

Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating failures and requesting help from a human supervisor. We present \textbf{INSIGHT}, a learning framework for leveraging token-level uncertainty signals to predict when a VLA should request help. Using $π_0$-FAST as the underlying model, we extract per-token \emph{entropy}, \emph{log-probability}, and Dirichlet-based estimates of \emph{aleatoric and epistemic uncertainty}, and train compact transformer classifiers to map these sequences to help triggers. We explore supervision regimes for strong or weak supervision, and extensively compare them across in-distribution and out-of-distribution tasks. Our results show a trade-off: strong labels enable models to capture fine-grained uncertainty dynamics for reliable help detection, while weak labels, though noisier, still support competitive introspection when training and evaluation are aligned, offering a scalable path when dense annotation is impractical. Crucially, we find that modeling the temporal evolution of token-level uncertainty signals with transformers provides far greater predictive power than static sequence-level scores. This study provides the first systematic evaluation of uncertainty-based introspection in VLAs, opening future avenues for active learning and for real-time error mitigation through selective human intervention.

2509.25339 2026-05-26 cs.CV cs.AI cs.LG eess.IV

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

VisualOverload: 在真正密集场景中探测VLM的视觉理解

Paul Gavrikov, Wei Lin, M. Jehanzeb Mirza, Soumya Jahagirdar, Muhammad Huzaifa, Sivan Doveh, Serena Yeung-Levy, James Glass, Hilde Kuehne

AI总结 提出VisualOverload基准,通过密集场景中的简单视觉任务测试VLM,发现最佳模型仅达69.5%准确率,揭示计数、OCR和逻辑一致性等关键缺陷。

Comments Accepted at CVPR 2026

详情
AI中文摘要

最先进的VLM是否真正解决了基本视觉理解?我们提出VisualOverload,一个略有不同的视觉问答(VQA)基准,包含2,720个问答对,并持有私有真实答案。与以往通常关注近全局图像理解的VQA数据集不同,VisualOverload挑战模型在密集(或过载)场景中执行简单的、无需知识的视觉任务。我们的数据集由公共领域绘画的高分辨率扫描图组成,这些绘画包含多个人物、动作和展开的子情节,背景细节丰富。我们手动为这些图像标注了六个任务类别的问题,以探测对场景的彻底理解。我们假设当前基准高估了VLM的性能,编码和推理细节对它们来说仍然是一项具有挑战性的任务,尤其是当面对密集场景时。实际上,我们观察到在37个测试模型中,即使是最好的模型(o3)在我们最难的测试子集上也仅达到19.6%的准确率,在所有问题上总体准确率为69.5%。除了全面评估外,我们还通过错误分析补充了基准,揭示了多种失败模式,包括缺乏计数能力、OCR失败以及复杂任务下惊人的逻辑不一致。总之,VisualOverload暴露了当前视觉模型中的关键差距,并为社区开发更好的模型提供了重要资源。基准:http://paulgavrikov.github.io/visualoverload

英文摘要

Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models. Benchmark: http://paulgavrikov.github.io/visualoverload

2509.24050 2026-05-26 cs.LG

Bridging On-Device and Cloud LLMs for Collaborative Reasoning: A Unified Methodology for Local Routing and Post-Training

桥接设备端与云端大语言模型实现协作推理:本地路由与后训练的统一方法

Wenzhi Fang, Dong-Jun Han, Liangqi Yuan, Evan Chen, Christopher Brinton

AI总结 提出通过强化学习后训练使设备端LLM内部决定是否调用云端,结合分层奖励和自适应提示过滤,显著缩小与纯云端LLM的性能差距。

Comments We propose a unified post-training framework that integrates routing optimization, enabling the on-device LLM to improve its problem-solving ability while learning routing strategies

详情
AI中文摘要

设备-云端协作有望部署大型语言模型(LLM),利用轻量级设备端模型提高效率,同时依赖强大的云端模型实现卓越推理。该设置中的一个核心挑战是,对于每个传入查询,确定是应在本地处理还是卸载到云端。现有方法通常依赖外部路由器,这些路由器往往难以从提示本身判断难度,尤其是涉及复杂推理的任务。受此限制,我们提出使设备端LLM在推理时内部决定是否调用云端协助,并通过基于强化学习的后训练来灌输这种能力。将设备端LLM后训练视为奖励最大化问题,我们设计分层奖励以鼓励本地问题解决和明智的云端卸载。为解决该问题,我们开发了一种算法,采用组级策略梯度稳定优化,并结合自适应提示过滤提供互补学习信号,以缓解策略崩溃(即仅本地执行或仅云端卸载)。在多个推理基准上对设备端规模的LLaMA和Qwen模型进行的大量实验表明,我们的方法始终优于基线,并显著缩小了与纯云端LLM的差距。

英文摘要

Device-cloud collaboration holds promise for deploying large language models (LLMs), leveraging lightweight on-device models for efficiency while relying on powerful cloud models for superior reasoning. A central challenge in this setting is determining, for each incoming query, whether it should be processed locally or offloaded to the cloud. Existing approaches typically rely on external routers, which often struggle to determine difficulty from the prompt itself, especially for tasks involving complex reasoning. Motivated by this limitation, we propose enabling on-device LLMs to decide internally whether to invoke cloud assistance at inference time, with this capability instilled through reinforcement learning based post-training. Casting on-device LLM post-training as a reward maximization problem, we design hierarchical rewards to encourage local problem solving and judicious cloud offloading. To solve the resulting problem, we develop an algorithm featuring a group-level policy gradient that stabilizes optimization, together with adaptive prompt filtering that provides complementary learning signals to mitigate policy collapse (i.e., exclusive local execution or exclusive cloud offloading). Extensive experiments on on-device-scale LLaMA and Qwen models across multiple reasoning benchmarks show that our method consistently outperforms baselines and significantly narrows the gap to full cloud LLMs.

2509.12196 2026-05-26 cs.LG cs.AI

Dynamic Relational Priming Improves Transformer in Multivariate Time Series

动态关系先验提升Transformer在多变量时间序列中的表现

Hunjae Lee, Corey Clark

AI总结 提出动态关系先验注意力机制(prime attention),通过为每个token对动态调整表示,有效捕捉多变量时间序列中异构的通道间依赖关系,在保持相同计算复杂度下提升预测精度达6.5%。

详情
AI中文摘要

标准Transformer中的注意力机制使用静态的token表示,这些表示在每一层的所有成对计算中保持不变。这限制了它们与每个token对交互中可能存在的多样化关系动态的表示对齐。虽然标准注意力在关系相对同质的领域表现出色,但其静态关系学习难以捕捉多变量时间序列(MTS)数据中多样、异构的通道间依赖关系——其中单个系统内不同的通道对交互可能由完全不同的物理定律或时间动态支配。为了更好地将注意力机制与此类领域现象对齐,我们提出了带有动态关系先验的注意力机制(prime attention)。与标准注意力中每个token在所有成对交互中呈现相同表示不同,prime attention通过可学习的调制动态地(或按交互)定制每个token,以最好地捕捉每个token对的独特关系动态,从而针对特定关系优化每个成对交互。这种prime attention的表示可塑性使其能够在保持与标准注意力相同渐近计算复杂度的同时,有效提取MTS中关系特定的信息。我们的结果表明,prime attention在基准测试中始终优于标准注意力,预测精度提升高达6.5%。此外,我们发现与标准注意力相比,prime attention在使用最多40%更短序列长度时即可达到相当或更优的性能,进一步证明了其卓越的关系建模能力。

英文摘要

Standard attention mechanisms in transformers employ static token representations that remain unchanged across all pair-wise computations in each layer. This limits their representational alignment with the potentially diverse relational dynamics of each token-pair interaction. While they excel in domains with relatively homogeneous relationships, standard attention's static relational learning struggles to capture the diverse, heterogeneous inter-channel dependencies of multivariate time series (MTS) data--where different channel-pair interactions within a single system may be governed by entirely different physical laws or temporal dynamics. To better align the attention mechanism for such domain phenomena, we propose attention with dynamic relational priming (prime attention). Unlike standard attention where each token presents an identical representation across all of its pair-wise interactions, prime attention tailors each token dynamically (or per interaction) through learnable modulations to best capture the unique relational dynamics of each token pair, optimizing each pair-wise interaction for that specific relationship. This representational plasticity of prime attention enables effective extraction of relationship-specific information in MTS while maintaining the same asymptotic computational complexity as standard attention. Our results demonstrate that prime attention consistently outperforms standard attention across benchmarks, achieving up to 6.5\% improvement in forecasting accuracy. In addition, we find that prime attention achieves comparable or superior performance using up to 40\% less sequence length compared to standard attention, further demonstrating its superior relational modeling capabilities.

2509.12194 2026-05-26 cs.AI cs.CV

Teaching large language models to reason like expert diagnosticians

教会大型语言模型像专家诊断医生一样推理

Thomas A. Buckley, Riccardo Conci, Peter G. Brodeur, Jason Gusdorf, Sourik Beltrán, Bita Behrouzi, Byron Crowe, Jacob Dockterman, Muzzammil Muhammad, Sarah Ohnigian, Andrew Sanchez, James A. Diao, Aashna P. Shah, Daniel Restrepo, Eric S. Rosenberg, Andrew S. Lea, Emily Glanton, Kimberly LeBlanc, Undiagnosed Diseases Network, Marinka Zitnik, Scott H. Podolsky, Zahir Kanjee, Raja-Elie E. Abdulnour, Jacob M. Koshy, Adam Rodman, Arjun K. Manrai

AI总结 提出 Dr. CaBot 代理 AI 系统,通过生成基于初始病例描述的幻灯片演示来模拟专家诊断推理,并在 NEJM CPC 和 NIH 未诊断疾病网络病例上取得优于前沿模型的表现,同时发布 CPC-Bench 基准以促进临床 AI 发展。

详情
AI中文摘要

鉴别诊断是一个迭代过程,将患者信息与更广泛的医学知识相结合。自1923年以来持续发表的临床病例系列,如NEJM临床病理会议(CPCs),展示了专家医生向同行演示诊断推理,并已被用于评估AI数十年。然而,先前的AI评估主要关注最终诊断准确性,而非细微的临床推理。在此,我们介绍Dr. CaBot,一个代理AI系统,通过仅从初始病例描述生成带有书面和旁白的幻灯片演示,来模拟专家诊断医生。CaBot最近生成了NEJM CPC 100多年历史上首个发表的AI诊断。在盲评中,医生在46/62(74%)的试验中错误分类了鉴别诊断的来源(CaBot vs. 医生撰写),并在各个质量维度上给予其好评。当被要求解决来自NIH未诊断疾病网络的72名未诊断疾病患者的病例时,CaBot仅从转诊记录中就识别出了50/72(69%)病例的工作诊断。为了促进透明度和研究,我们还开发了CPC-Bench,一个基于7,102个CPC和47,648个问题(涵盖10个任务)的经医生验证的基准。我们证明CaBot在CPC-Bench上优于前沿模型,并公开发布CaBot和CPC-Bench,以促进临床AI的进步。

英文摘要

Differential diagnosis is an iterative process that integrates patient information with broader medical knowledge. Clinical case series such as the NEJM Clinicopathologic Conferences (CPCs), published continuously since 1923, feature expert physicians who demonstrate diagnostic reasoning to peers, and have been used for decades to evaluate AI. However, prior AI evaluations have largely focused on final diagnostic accuracy rather than nuanced clinical reasoning. Here, we introduce Dr. CaBot, an agentic AI system that emulates an expert diagnostician by generating written and narrated slide-based presentations from an initial case description alone. CaBot recently generated the first AI diagnosis published in the 100+ year history of the NEJM CPCs. In blinded evaluations, physicians misclassified the source of the differential (CaBot vs. physician-written) in 46/62 (74%) of trials and rated them favorably across quality dimensions. When tasked with solving cases for 72 patients with undiagnosed disease from the NIH Undiagnosed Diseases Network, CaBot identified the working diagnosis in 50/72 (69%) of cases from referral notes alone. To promote transparency and research, we also developed CPC-Bench, a physician-validated benchmark based on 7,102 CPCs and 47,648 questions across 10 tasks. We show that CaBot outperforms frontier models on CPC-Bench, and release both CaBot and CPC-Bench publicly to foster progress in clinical AI.

2509.10515 2026-05-26 cs.LG cs.CL

Adaptive Preference Optimization with Uncertainty-aware Utility Anchor

基于不确定性感知效用锚点的自适应偏好优化

Xiaobo Wang, Zixia Jia, Jiaqi Li, Qi Liu, Zilong Zheng

AI总结 提出一种通用离线偏好优化框架UAPO,通过引入锚点函数估计偏好数据标注的不确定性,支持非配对数据训练,提升数据利用效率和训练鲁棒性。

Comments Accepted by EMNLP 2025 Findings

详情
AI中文摘要

离线偏好优化方法对于大型语言模型(LLMs)的对齐是高效的。直接偏好优化(DPO)类学习作为最流行的方法之一,因其在奖励建模中的高效性而脱颖而出。然而,这些方法通常遵循惯例使用Bradley-Terry(BT)奖励建模,该建模面临几个关键假设,包括对成对训练数据的需求、模型分布偏移、人类理性假设等。为了解决这些限制,我们提出了一种通用的离线偏好优化框架——基于不确定性感知效用锚点的自适应偏好优化(UAPO),该框架引入了一个锚点函数来估计偏好数据标注带来的不确定性。我们的方法即使在数据未配对的情况下也能进行训练,显著提高了数据利用效率。此外,锚点设计使UAPO在训练过程中更加鲁棒。实验结果表明,UAPO在无需严格依赖数据配对的情况下取得了有竞争力的结果,为更灵活有效的偏好优化方法铺平了道路。

英文摘要

Offline preference optimization methods are efficient for large language models (LLMs) alignment. Direct Preference optimization (DPO)-like learning, one of the most popular approaches, stands out for its efficiency in reward modeling. However, these methods typically follow the convention to use Bradley-Terry (BT) reward modeling that faces several critical assumptions, including the requirement for pairwise training data, model distribution shifting, human rationality assumption, etc. To address these limitations, we propose a general framework for offline preference optimization methods, Adaptive Preference Optimization with Utility Anchor (UAPO), which introduces an anchoring function to estimate the uncertainties brought from preference data annotation. Our method enables training even in scenarios where the data is unpaired, significantly enhancing data utilization efficiency. Moreover, the anchor design makes UAPO more robust in the training process. Experimental results demonstrate that UAPO achieves competitive outcomes without the strict dependency on data pairing, paving the way for more flexible and effective preference optimization methods.

2509.05614 2026-05-26 cs.CV cs.AI cs.RO

SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning

SpecPrune-VLA: 通过动作感知的自推测剪枝加速视觉-语言-动作模型

Hanzhen Wang, Jiaming Xu, Yushun Xiang, Jiayi Pan, Yongkang Zhou, Yong-Lu Li, Guohao Dai

AI总结 针对视觉-语言-动作模型推理加速,提出结合全局上下文与局部信息的无训练两层剪枝方法,实现高达1.57倍加速且成功率几乎无下降。

Comments Accepted to ICML 2026

详情
AI中文摘要

剪枝是一种通过移除不重要值的计算来加速计算密集型模型的典型技术。最近,它被应用于加速视觉-语言-动作(VLA)模型推理。然而,现有的加速方法仅关注当前动作步骤的局部信息,忽略了全局上下文,导致在某些场景下成功率下降超过20%且加速效果有限。本文指出VLA任务中的时空一致性:连续步骤中的输入图像表现出高度相似性,并提出关键见解:令牌选择应结合局部信息与模型的全局上下文。基于此,我们提出SpecPrune-VLA,一种无需训练、具有启发式控制的两级剪枝方法。(1) 动作级静态剪枝:利用全局历史和局部注意力,在每个动作中静态减少视觉令牌。(2) 层级动态剪枝:根据逐层重要性自适应地剪枝每层的令牌。(3) 轻量级动作感知控制器:根据末端执行器的速度将动作分为粗粒度或细粒度,并相应调整剪枝激进程度。大量实验表明,SpecPrune-VLA在LIBERO模拟中实现高达1.57倍加速,在真实世界任务中实现1.70倍加速,且成功率下降可忽略不计。

英文摘要

Pruning is a typical acceleration technique for compute-bound models by removing computation on unimportant values. Recently, it has been applied to accelerate Vision-Language-Action (VLA) model inference. However, existing acceleration methods focus on local information from the current action step and ignore the global context, leading to >20% success rate drop and limited speedup in some scenarios. In this paper, we point out spatial-temporal consistency in VLA tasks: input images in consecutive steps exhibit high similarity, and propose the key insight that token selection should combine local information with global context of the model. Based on this, we propose SpecPrune-VLA, a training-free, two-level pruning method with heuristic control. (1) Action-level static pruning. We leverage global history and local attention to statically reduce visual tokens per action. (2) Layer-level dynamic pruning. We prune tokens adaptively per layer based on layer-wise importance. (3) Lightweight action-aware controller: We classify actions as coarse- or fine-grained by the speed of the end effector and adjust pruning aggressiveness accordingly. Extensive experiments show that SpecPrune-VLA achieves up to 1.57$\times$ speedup in LIBERO simulation and 1.70$\times$ on real-world tasks, with negligible success rate degradation.

2509.01557 2026-05-26 cs.CV

Real-Time Hardware-Free HIFU Interference Suppression via Teacher-Student Diffusion Framework

基于教师-学生扩散框架的实时无硬件HIFU干扰抑制

Dejia Cai, Ali Abdollahi, Xi Wang, Kun Yang, Zhaohui Guo, Xiaowei Zhou, Hao Chen

AI总结 提出一种无需专用硬件同步的图像域扩散框架mHC-Diff,通过教师-学生蒸馏实现实时高保真HIFU干扰抑制,在临床数据集上达到26.65 dB PSNR和~20 FPS。

详情
AI中文摘要

高强度聚焦超声(HIFU)是一种非侵入性疗法,但其安全性常因连续超声引导期间的严重声学干扰而降低。传统的HIFU干扰抑制方法严重依赖专有的原始射频(RF)数据或复杂的硬件同步,限制了其临床实用性并阻碍了实时实现。为解决这一限制,我们提出了流形约束超连接扩散(mHC-Diff),一种图像域扩散框架,用于无需专用硬件同步的实时干扰抑制,将复杂干扰与解剖结构分离,同时确保高重建保真度。为实现临床实时应用,我们的方法采用两阶段策略:(i)解剖感知先验获取,其中扩散模型使用多步UNet作为高保真教师进行训练;以及(ii)效率蒸馏,其中通过知识蒸馏将该先验蒸馏为单步学生以实现实时吞吐量。在涵盖多种治疗场景的临床代表性数据集上的广泛验证表明,mHC-Diff实现了卓越的恢复(26.65 dB PSNR),同时在单个NVIDIA RTX 4090上实现实时推理(~20 FPS),比迭代扩散基线(例如HIFU-Diff)加速约6.8倍。通过消除对专用硬件同步和专有RF访问的需求,该图像域框架确保了兼容性,并促进了超声引导HIFU干预期间的实时干扰抑制。

英文摘要

High-Intensity Focused Ultrasound (HIFU) is a non-invasive therapy, yet its safety is often degraded by severe acoustic interference during continuous ultrasound guidance. Conventional HIFU interference suppression methods heavily rely on proprietary raw Radio-Frequency (RF) data or complex hardware synchronization, limiting their clinical utility and preventing real-time implementation. To address this limitation, we propose Manifold-Constrained Hyper-Connections Diffusion (mHC-Diff), an image-domain diffusion framework for real-time interference suppression without specialized hardware synchronization, disentangling complex interference from anatomical structures while ensuring high reconstruction fidelity. To achieve clinical real-time application, our approach employs a two-stage strategy: (i) anatomy-aware prior acquisition, where a diffusion model is trained with multi-step UNet as a highfidelity Teacher; and (ii) efficiency distillation, where this prior is distilled into a one-step Student via knowledge distillation to achieve real-time throughput. Extensive validation on a clinically representative dataset across diverse therapeutic scenarios shows that mHC-Diff achieves superior restoration (26.65 dB PSNR), while enabling real-time inference (~20 FPS) on a single NVIDIA RTX 4090, providing a ~6.8x speedup over iterative diffusion baselines (e.g., HIFU-Diff). By eliminating the requirement for specialized hardware synchronization and proprietary RF access, this image-domain framework ensures compatibility and facilitates real-time interference suppression during ultrasound-guided HIFU interventions.

2508.21620 2026-05-26 cs.LG

Introduction to the Analysis of Probabilistic Decision-Making Algorithms

概率决策算法分析导论

Agustinus Kristiadi

AI总结 本文为概率决策算法(包括赌博机算法、贝叶斯优化和树搜索算法)的理论分析提供了一本自包含的入门指南,旨在降低非专家的理解门槛。

详情
AI中文摘要

决策理论为在各种不确定性下做出选择提供了原则性方法。实现这些理论的算法已成功应用于广泛的实际问题,包括材料和药物发现。事实上,这些算法是可取的,因为它们可以自适应地收集信息以在未来做出更好的决策,从而产生数据高效的工作流程。在科学发现中,实验成本高昂,因此这些算法可以显著降低实验成本。这些算法的理论分析对于理解其行为以及为开发下一代算法提供有价值的见解至关重要。然而,文献中的理论分析通常对非专家来说难以理解。本专著旨在为常用概率决策算法(包括赌博机算法、贝叶斯优化和树搜索算法)的理论分析提供一本可访问的、自包含的入门介绍。仅假设读者具备概率论和统计学的基本知识,以及一些关于高斯过程的基础知识。

英文摘要

Decision theories offer principled methods for making choices under various types of uncertainty. Algorithms that implement these theories have been successfully applied to a wide range of real-world problems, including materials and drug discovery. Indeed, they are desirable since they can adaptively gather information to make better decisions in the future, resulting in data-efficient workflows. In scientific discovery, where experiments are costly, these algorithms can thus significantly reduce the cost of experimentation. Theoretical analyses of these algorithms are crucial for understanding their behavior and providing valuable insights for developing next-generation algorithms. However, theoretical analyses in the literature are often inaccessible to non-experts. This monograph aims to provide an accessible, self-contained introduction to the theoretical analysis of commonly used probabilistic decision-making algorithms, including bandit algorithms, Bayesian optimization, and tree search algorithms. Only basic knowledge of probability theory and statistics, along with some elementary knowledge about Gaussian processes, is assumed.

2508.08652 2026-05-26 cs.AI

Prompt-and-Check: Using Large Language Models to Evaluate Communication Protocol Compliance in Simulation-Based Training

Prompt-and-Check:使用大型语言模型评估基于模拟训练中的通信协议合规性

Vishakha Lall, Yisi Liu

AI总结 提出Prompt-and-Check方法,利用开源大语言模型通过上下文丰富的提示评估模拟训练中通信协议的合规性,并在海事领域案例中验证其有效性。

详情
AI中文摘要

准确的程序通信合规性评估在基于模拟的训练中至关重要,特别是在安全关键领域,遵守合规检查表反映了操作能力。本文探索了一种轻量级、可部署的方法,使用基于提示的推理与开源大型语言模型(LLMs),这些模型可以在消费级GPU上高效运行。我们提出了Prompt-and-Check,一种使用上下文丰富的提示来评估协议中每个检查表项目是否已满足的方法,仅基于转录的口头交流。我们在海事领域进行了一个案例研究,参与者执行相同的模拟任务,并实验了LLama 2 7B、LLaMA 3 8B和Mistral 7B等模型,在本地RTX 4070 GPU上运行。对于每个检查表项目,一个包含相关转录摘录的提示被输入模型,模型输出合规性判断。我们使用分类准确性和一致性分数将模型输出与专家标注的基准进行比较。我们的发现表明,提示使得无需任务特定训练即可进行有效的上下文感知推理。这项研究突出了LLMs在增强训练环境中的汇报、绩效反馈和自动评估方面的实际效用。

英文摘要

Accurate evaluation of procedural communication compliance is essential in simulation-based training, particularly in safety-critical domains where adherence to compliance checklists reflects operational competence. This paper explores a lightweight, deployable approach using prompt-based inference with open-source large language models (LLMs) that can run efficiently on consumer-grade GPUs. We present Prompt-and-Check, a method that uses context-rich prompts to evaluate whether each checklist item in a protocol has been fulfilled, solely based on transcribed verbal exchanges. We perform a case study in the maritime domain with participants performing an identical simulation task, and experiment with models such as LLama 2 7B, LLaMA 3 8B and Mistral 7B, running locally on an RTX 4070 GPU. For each checklist item, a prompt incorporating relevant transcript excerpts is fed into the model, which outputs a compliance judgment. We assess model outputs against expert-annotated ground truth using classification accuracy and agreement scores. Our findings demonstrate that prompting enables effective context-aware reasoning without task-specific training. This study highlights the practical utility of LLMs in augmenting debriefing, performance feedback, and automated assessment in training environments.

2508.07624 2026-05-26 cs.CV

Enhancing Egocentric Object Detection in Static Environments using Graph-based Spatial Anomaly Detection and Correction

基于图的空间异常检测与校正增强静态环境中的自我中心目标检测

Vishakha Lall, Yisi Liu

AI总结 提出一种基于图神经网络的后处理管道,通过建模静态环境中物体的空间关系来校正自我中心帧中的检测异常,显著提升检测性能。

详情
AI中文摘要

在涉及静态环境的许多实际应用中,物体的空间布局在实例之间保持一致。然而,最先进的目标检测模型通常无法利用这种空间先验,导致预测不一致、漏检或误分类,尤其是在杂乱或遮挡的场景中。在这项工作中,我们提出了一种基于图的后处理管道,显式建模物体之间的空间关系,以校正自我中心帧中的检测异常。使用在手动标注数据上训练的图神经网络(GNN),我们的模型识别无效的物体类别标签,并根据其邻域上下文预测校正后的类别标签。我们评估了我们的方法,既作为独立的异常检测与校正框架,也作为标准目标检测器(如YOLOv7和RT-DETR)的后处理模块。实验表明,融入这种空间推理显著提升了检测性能,mAP@50提升高达4%。该方法凸显了利用环境空间结构来提高目标检测系统可靠性的潜力。

英文摘要

In many real-world applications involving static environments, the spatial layout of objects remains consistent across instances. However, state-of-the-art object detection models often fail to leverage this spatial prior, resulting in inconsistent predictions, missed detections, or misclassifications, particularly in cluttered or occluded scenes. In this work, we propose a graph-based post-processing pipeline that explicitly models the spatial relationships between objects to correct detection anomalies in egocentric frames. Using a graph neural network (GNN) trained on manually annotated data, our model identifies invalid object class labels and predicts corrected class labels based on their neighbourhood context. We evaluate our approach both as a standalone anomaly detection and correction framework and as a post-processing module for standard object detectors such as YOLOv7 and RT-DETR. Experiments demonstrate that incorporating this spatial reasoning significantly improves detection performance, with mAP@50 gains of up to 4%. This method highlights the potential of leveraging the environment's spatial structure to improve reliability in object detection systems.

2508.03526 2026-05-26 cs.RO

CollaBot: Vision-Language Guided Simultaneous Collaborative Manipulation

CollaBot: 视觉-语言引导的同步协作操作

Kun Song, Gaoming Chen, Shentao Ma, Ninglong Jin, Guangbao Zhao, Mingyu Ding, Zhenhua Xiong, Jia Pan

AI总结 提出CollaBot通用框架,通过场景分割、协作抓取和两阶段规划,实现多机器人同步协作操作大型物体,在实验中达到72%成功率。

Comments 8 pages,6 figures

详情
AI中文摘要

机器人学的一个核心目标是使机器人能够与物理世界交互。传统的操作研究主要关注单个机器人和相对较小的物体。然而,工厂和家庭环境通常需要大型物体的操作,例如移动桌子,这需要多个机器人协同工作。现有研究仍然缺乏一个能够处理不同物体、任务和机器人团队规模的通用框架。在这项工作中,我们提出了CollaBot,一个用于同步协作操作的通用框架。首先,我们使用SEEM进行场景分割和目标物体提取。然后,我们提出了一个协作抓取框架,将任务分解为局部抓取姿态生成和全局协调。最后,我们设计了一个两阶段规划模块,以生成无碰撞轨迹用于任务执行。在不同物体、任务和机器人数量设置下的实验结果表明,我们的框架达到了72%的成功率。这比基于行为克隆的方法有显著改进,验证了所提出框架在复杂多机器人协作任务中的优势。真实世界实验进一步证明了我们的方法在实际应用中的可行性。

英文摘要

One central goal of robotics is to enable robots to interact with the physical world. Traditional manipulation studies primarily focus on single robots and relatively small objects. However, factory and domestic environments often require large-object manipulation, such as moving tables, where multiple robots must work collaboratively. Existing studies still lack a generalizable framework that can handle diverse objects, tasks, and robot team sizes. In this work, we propose CollaBot, a generalist framework for simultaneous collaborative manipulation. First, we use SEEM for scene segmentation and target-object extraction. Then, we propose a collaborative grasping framework that decomposes the task into local grasp pose generation and global coordination. Finally, we design a two-stage planning module to generate collision-free trajectories for task execution. Experimental results across different settings with varying objects, tasks, and numbers of robots indicate that our framework achieves a 72% success rate. This marks a substantial improvement over behavior cloning-based methods, validating the advantages of the proposed framework in complex multi-robot cooperative tasks. Real-world experiments further demonstrate the feasibility of our method in practical applications.

2507.21556 2026-05-26 cs.CL

Transformers over-extend what humans underlearn: the case of Spanish L-shaped morphome

Transformer过度泛化而人类学习不足:西班牙语L形形态词素案例

Akhilesh Kakolu Ramarao, Kevin Tang, Dinah Baer-Henney

AI总结 本研究通过Transformer模型在三种频率条件下学习西班牙语L形形态词素,并与人类行为数据对比,发现模型能从分布输入中习得该模式但泛化方式与人类定性不同。

详情
AI中文摘要

不规则形态模式的认知现实性已争论数十年:说话者是否将其扩展到新形式,还是它们只是词汇产物?基于分布输入训练的神经网络提供了可学习性测试:如果它恢复了模式,则该模式仅从输入统计中即可学习。我们将此测试应用于西班牙语L形形态词素,其中第一人称单数直陈式词干出现在每个现在虚拟式单元格中,尽管缺乏明显的音系或语义动机。我们进一步询问输入中不规则动词的频率是否调节泛化,在三种频率条件(10%、50%、90%不规则)下评估Transformer,并将其与人类行为数据进行比较。在伪词输入的全形式产出中,所有模型表现均较差,但所有三种条件产生正确词干的频率均高于人类(43-49% vs. 33%)。响应偏好显示出明显分歧:人类始终偏好规则屈折,而模型随着训练中不规则比例增加更倾向于不规则形式。自然和平衡条件下的模型也对伪词与真实西班牙语不规则动词之间的音系相似性敏感,而这种效应在人类中不存在。因此,L形形态词素仅从分布输入即可学习,但模型在定性上以不同于人类的方式泛化它。

英文摘要

The cognitive reality of irregular morphological patterns has been debated for decades: do speakers extend them to novel forms, or are they lexical artifacts? A neural network trained on distributional input offers a learnability test: if it recovers the pattern, the pattern is learnable from input statistics alone. We apply this test to the Spanish L-shaped morphome, where the first-person singular indicative stem appears in every present subjunctive cell despite lacking apparent phonological or semantic motivation. We further ask whether the frequency of irregular verbs in the input modulates generalization, evaluating transformers under three frequency conditions (10%, 50%, 90% irregular) and comparing them to human behavioral data. On full-form production from pseudoword inputs all models performed poorly, but all three conditions produced the correct stem more often than humans (43--49% vs. 33%). Response preferences revealed a clear divergence: humans consistently favored regular inflections, whereas models preferred irregular forms more as their proportion in training grew. Models in the naturalistic and balanced conditions were also sensitive to phonological similarity between pseudowords and real Spanish irregular verbs, an effect absent in humans. The L-shaped morphome is thus learnable from distributional input alone, but models generalize it qualitatively differently from humans.

2507.19219 2026-05-26 cs.CL cs.CR

How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework

大型语言模型在评估中作弊了多少?基于一次性密码本的框架下的高估基准测试

Zi Liang, Liantong Yu, Shiyu Zhang, Qingqing Ye, Haibo Hu

AI总结 针对大型语言模型在公开基准测试中因数据污染或训练偏差导致评估结果虚高的问题,提出基于一次性密码本加密思想的动态评估框架ArxivRoll,包含自动生成私有测试用例的SCP模块和衡量污染与偏差比例的Rugged Scores指标,实现可重复、透明且高效的评估。

Comments This paper has been accepted by AAAI 2026. We update it for adding new evaluation results for ArxivRollBench-2025a and ArxivRollBench-2026a, with the evaluation of timly models like DeepSeekV4Pro, GPT-5.5, Claude-Opus-4.7, and so on. Source code: https://github.com/liangzid/ArxivRoll/ Online Leaderboard Website: https://arxivroll.moreoverai.com/

详情
AI中文摘要

评估大型语言模型(LLMs)时的高估问题日益引起关注。由于公开基准测试的数据污染或模型训练不平衡,LLMs可能在公开基准测试中无意或有意地获得不真实的评估结果,这导致LLMs之间的不公平比较,并削弱了对其实际能力的评估。现有基准测试试图通过永久保密测试用例、通过人工评估减轻污染或反复收集和构建新样本来解决这些问题。然而,这些方法无法同时确保可重复性、透明性和高效率。此外,当前LLMs的高估程度仍未量化。为解决这些问题,我们提出了ArxivRoll,一个受密码学中一次性密码本加密启发的动态评估框架。ArxivRoll包含两个关键组件:\emph{i) SCP(排序、完形填空和预测)},一个用于私有测试用例的自动生成器;\emph{ii) Rugged Scores(RS)},衡量公开基准测试污染和训练偏差比例的指标。利用SCP,ArxivRoll每六个月使用ArXiv上的最新文章构建一个新的基准测试,并将其用于LLM性能的一次性评估。大量实验证明了我们基准测试的高质量,并且我们提供了对当前LLMs的系统评估。源代码可在https://github.com/liangzid/ArxivRoll/获取。

英文摘要

Overestimation in evaluating large language models (LLMs) has become an increasing concern. Due to the contamination of public benchmarks or imbalanced model training, LLMs may achieve unreal evaluation results on public benchmarks, either intentionally or unintentionally, which leads to unfair comparisons among LLMs and undermines their realistic capability assessments. Existing benchmarks attempt to address these issues by keeping test cases permanently secret, mitigating contamination through human evaluation, or repeatedly collecting and constructing new samples. However, these approaches fail to ensure reproducibility, transparency, and high efficiency simultaneously. Moreover, the extent of overestimation in current LLMs remains unquantified. To address these issues, we propose ArxivRoll, a dynamic evaluation framework inspired by one-time pad encryption in cryptography. ArxivRoll comprises two key components: \emph{i) SCP (Sequencing, Cloze, and Prediction)}, an automated generator for private test cases, and \emph{ii) Rugged Scores (RS)}, metrics that measure the proportion of public benchmark contamination and training bias. Leveraging SCP, ArxivRoll constructs a new benchmark every six months using recent articles from ArXiv and employs them for one-time evaluations of LLM performance. Extensive experiments demonstrate the high quality of our benchmark, and we provide a systematic evaluation of current LLMs. The source code is available at https://github.com/liangzid/ArxivRoll/.

2507.14958 2026-05-26 cs.CL

MUR: Momentum Uncertainty guided Reasoning for Large Language Models

MUR: 面向大型语言模型的动量不确定性引导推理

Hang Yan, Fangzhi Xu, Rongman Xu, Yifei Li, Jian Zhang, Haoran Luo, Xiaobao Wu, Luu Anh Tuan, Haiteng Zhao, Qika Lin, Jun Liu

AI总结 提出动量不确定性引导推理(MUR)方法,通过追踪和聚合逐步不确定性动态分配推理预算,并引入γ控制机制,在不额外训练的情况下减少冗余计算,在多个基准上平均减少45%以上计算量并提升准确率。

详情
AI中文摘要

大型语言模型在推理密集型任务上取得了令人印象深刻的性能,但优化其推理效率仍然是一个开放的挑战。虽然测试时缩放(TTS)提高了推理质量,但它常常导致过度思考,在冗余计算上浪费令牌。本研究探讨如何在不额外训练的情况下,高效且自适应地引导当前模型的测试时缩放。受物理学中动量概念的启发,我们提出了动量不确定性引导推理(MUR),它通过随时间跟踪和聚合逐步不确定性,动态地将思考预算分配给关键的推理步骤。为了支持灵活的推理时控制,我们引入了γ控制,这是一种通过单个超参数调整推理预算的简单机制。我们提供了深入的理论证明,以支持MUR在稳定性和偏差方面的优越性。MUR与各种TTS方法在四个具有挑战性的基准(MATH-500、AIME24、AIME25和GPQA-diamond)上,使用不同大小的最新Qwen3模型(1.7B、4B和8B)进行了全面评估。结果表明,MUR平均减少了超过45%的计算量,同时将准确率提高了0.33%至3.46%。

英文摘要

Large Language Models have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide current model' test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by by over 45% on average while improving accuracy from 0.33 to 3.46%.

2507.10644 2026-05-26 cs.AI cs.CL cs.CR cs.HC cs.MA

From Multi-Agent Systems and the Semantic Web to Agentic AI: A Unified Narrative of the Web of Agents

从多智能体系统和语义网到智能体AI:智能体网络的统一叙事

Tatiana Petrova, Boris Bliznioukov, Aleksandr Puzikov, Radu State

AI总结 本文提出智能体网络(WoA)经历了从平台端协调(第一代)、数据端标注(第二代)到模型端解释(第三代)的语义努力迁移,并分析了各代失败模式及当前开放问题。

详情
AI中文摘要

智能体网络(WoA)将文档为中心的Web转变为由自主智能体代表用户行动的环境,这一愿景随着大型语言模型(LLM)的成熟而变得可行。我们认为,在过去的三十年中,WoA按时间顺序经历了语义努力迁移:从平台端协调(多智能体系统,第一代),经过数据端标注(语义网,第二代),到模型端解释(LLM时代,第三代)。这一轨迹中的核心转变——从第二代到第三代,我们称之为从数据中的语义到模型中的语义的转变——具有预测性:每一代的失败模式和当前开放问题都源于该代语义努力的定位。本文做出五项贡献:(i) 一个跨越1990-2026年的统一进化叙事;(ii) 一个四维比较框架(语义基础、通信范式、智能定位、发现机制),统一应用于所有三代;(iii) 对十六个代表性系统在这些维度上的分类,包括混合LLM-知识图谱和计算机使用智能体;(iv) 涵盖2024年11月至2026年8月的制度融合(Linux基金会智能体AI基金会、A2A v1.0、MCP 2024年11月发布和2025年11月规范、Visa/Mastercard/Stripe支付网络协议、欧盟AI法案分阶段执行、NIST AI智能体标准倡议、2026年国际AI安全报告);以及(v) 基于跨代证据的七个命名教训,以及七个与代无关的挑战,无论哪种协议占主导地位,这些挑战都持续存在。进一步的进展更多地取决于标准机构、监管机构和商业支付网络正在组装的社会技术基础设施,而不是协议设计。

英文摘要

The Web of Agents (WoA) transforms the document-centric Web into an environment of autonomous agents acting on users' behalf, a vision newly tractable as large language models (LLMs) mature. We argue that across three decades the WoA has undergone a \emph{semantic-effort migration} in chronological order: from platform-side coordination (Multi-Agent Systems, Generation~I), through data-side annotation (Semantic Web, Generation~II), to model-side interpretation (LLM-era, Generation~III). The central Gen~II~$\rightarrow$~Gen~III transition within this trajectory, which we call the \emph{semantics-in-data $\rightarrow$ semantics-in-models} shift, is predictive: each generation's failure modes and current open problems follow from where that generation located its semantic effort. The survey makes five contributions: (i)~a unified evolutionary narrative spanning 1990--2026; (ii)~a four-dimensional comparative framework (semantic foundation, communication paradigm, locus of intelligence, discovery mechanism) applied uniformly across all three generations; (iii)~classification of sixteen representative systems on these dimensions, including hybrid LLM--knowledge-graph and computer-use agents; (iv)~coverage of the November~2024--August~2026 institutional convergence (Linux Foundation's Agentic AI Foundation, A2A v1.0, MCP November~2024 launch and November~2025 specification, Visa/Mastercard/Stripe payment-network protocols, EU AI Act phased enforcement, the NIST AI Agent Standards Initiative, International AI Safety Report 2026); and (v)~seven named lessons grounded in cross-generational evidence paired with seven generation-invariant challenges that persist regardless of which protocol prevails. Further progress depends less on protocol design than on the socio-technical infrastructure now being assembled by standards bodies, regulators, and commercial payment networks.

2507.09179 2026-05-26 cs.AI

Hide-and-Shill: A Reinforcement Learning Framework for Market Manipulation Detection in Symphony-a Decentralized Multi-Agent System

Hide-and-Shill:面向交响乐系统中市场操纵检测的强化学习框架——一个去中心化多智能体系统

Ronghua Shi, Yiou Liu, Yuchun Feng, Lynn Ai, Bill Shi, Zhuang Liu

AI总结 提出一个多智能体强化学习框架,通过动态对抗博弈建模操纵者与检测者的交互,利用延迟代币价格反应识别可疑模式,并集成GRPO、理论奖励函数和多模态智能体管道,在去中心化交响乐系统中实现无需中心化预言机的鲁棒操纵检测。

详情
AI中文摘要

去中心化金融(DeFi)开创了无需许可的金融创新新时代,但也导致了前所未有的市场操纵。在没有中心化监管的情况下,恶意行为者协调各种平台上的宣传活动和拉高出货计划。我们提出了一个用于去中心化操纵检测的多智能体强化学习(MARL)框架,将操纵者和检测者之间的交互建模为动态对抗博弈。该框架利用延迟的代币价格反应作为财务指标来识别可疑模式。我们的方法引入了三项创新:(1)群体相对策略优化(GRPO),以增强在稀疏奖励和部分可观测设置下的学习稳定性;(2)一个基于理性预期和信息不对称理论启发的奖励函数,区分价格发现与操纵噪声;(3)一个多模态智能体管道,集成基于LLM的语义特征、社交图信号和链上市数据,以支持知情决策。该框架集成在Symphony系统中,这是一个去中心化的多智能体架构,通过分布式日志支持点对点智能体执行和信任感知学习,支持链上可验证评估。Symphony促进战略参与者之间的对抗性共同进化,并在没有中心化预言机的情况下保持鲁棒的操纵检测,从而实现对全球DeFi生态系统的实时监控。在10万个真实世界话语片段上训练,并在对抗性模拟中验证,Hide-and-Shill在检测准确性和因果归因方面达到了顶级性能。这项工作将多智能体系统与金融监控相结合,推动了去中心化市场智能的新范式。所有资源可在Hide-and-Shill GitHub仓库中获取,以促进开放研究和可重复性。

英文摘要

Decentralized finance (DeFi) has introduced a new era of permissionless financial innovation but also led to unprecedented market manipulation. Without centralized oversight, malicious actors coordinate shilling campaigns and pump-and-dump schemes across various platforms. We propose a Multi-Agent Reinforcement Learning (MARL) framework for decentralized manipulation detection, modeling the interaction between manipulators and detectors as a dynamic adversarial game. This framework identifies suspicious patterns using delayed token price reactions as financial indicators.Our method introduces three innovations: (1) Group Relative Policy Optimization (GRPO) to enhance learning stability in sparse-reward and partially observable settings; (2) a theory-based reward function inspired by rational expectations and information asymmetry, differentiating price discovery from manipulation noise; and (3) a multi-modal agent pipeline that integrates LLM-based semantic features, social graph signals, and on-chain market data for informed decision-making.The framework is integrated within the Symphony system, a decentralized multi-agent architecture enabling peer-to-peer agent execution and trust-aware learning through distributed logs, supporting chain-verifiable evaluation. Symphony promotes adversarial co-evolution among strategic actors and maintains robust manipulation detection without centralized oracles, enabling real-time surveillance across global DeFi ecosystems.Trained on 100,000 real-world discourse episodes and validated in adversarial simulations, Hide-and-Shill achieves top performance in detection accuracy and causal attribution. This work bridges multi-agent systems with financial surveillance, advancing a new paradigm for decentralized market intelligence. All resources are available at the Hide-and-Shill GitHub repository to promote open research and reproducibility.

2506.19117 2026-05-26 cs.CV

PrITTI: Primitive-based Generation of Controllable and Editable 3D Semantic Urban Scenes

PrITTI: 基于基元的可控可编辑3D语义城市场景生成

Christina Ourania Tze, Daniel Dauner, Yiyi Liao, Dzmitry Tsishkou, Andreas Geiger

AI总结 提出PrITTI,一种利用矢量化对象基元和栅格化地面表面的混合表示,通过潜在扩散模型实现高质量、可控且可编辑的3D语义城市场景生成。

Comments Accepted to CVPR 2026

详情
AI中文摘要

现有的3D语义城市场景生成方法主要依赖于基于体素的表示,这些方法受限于固定分辨率、难以编辑且密集形式下内存消耗大。相比之下,我们倡导一种基于基元的范式,其中城市场景使用紧凑、语义上有意义的3D元素表示,这些元素易于操作和组合。为此,我们引入了PrITTI,一种潜在扩散模型,利用矢量化对象基元和栅格化地面表面生成多样化、可控且可编辑的3D语义城市场景。这种混合表示产生了一个结构化的潜在空间,便于对象和地面级别的操作。在KITTI-360上的实验表明,基于基元的表示释放了扩散变压器的全部能力,实现了最先进的3D场景生成质量,同时内存需求更低、推理速度更快、可编辑性优于基于体素的方法。除了生成,PrITTI还支持一系列下游应用,包括场景编辑、修复、外推和照片级真实感的街景合成。源代码和更多结果可在https://raniatze.github.io/pritti/找到。

英文摘要

Existing approaches to 3D semantic urban scene generation predominantly rely on voxel-based representations, which are bound by fixed resolution, challenging to edit, and memory-intensive in their dense form. In contrast, we advocate for a primitive-based paradigm where urban scenes are represented using compact, semantically meaningful 3D elements that are easy to manipulate and compose. To this end, we introduce PrITTI, a latent diffusion model that leverages vectorized object primitives and rasterized ground surfaces for generating diverse, controllable, and editable 3D semantic urban scenes. This hybrid representation yields a structured latent space that facilitates object- and ground-level manipulation. Experiments on KITTI-360 show that primitive-based representations unlock the full capabilities of diffusion transformers, achieving state-of-the-art 3D scene generation quality with lower memory requirements, faster inference, and greater editability than voxel-based methods. Beyond generation, PrITTI supports a range of downstream applications, including scene editing, inpainting, outpainting, and photo-realistic street-view synthesis. The source code and more results can be found at https://raniatze.github.io/pritti/.

2506.05360 2026-05-26 cs.CV

CarboFormer: A Lightweight Semantic Segmentation Architecture for Efficient Carbon Dioxide Detection Using Optical Gas Imaging

CarboFormer: 一种用于光学气体成像的轻量级语义分割架构,实现高效二氧化碳检测

Taminul Islam, Toqi Tahamid Sarker, Mohamed G Embaby, Khaled R Ahmed, Amer AbuGhazaleh

AI总结 提出CarboFormer轻量级语义分割框架,通过优化编解码器、多尺度特征融合和辅助监督策略,在资源受限环境下实现CO2排放的实时高精度检测,并贡献两个新数据集。

详情
Journal ref
Advances in Visual Computing. ISVC 2025. Lecture Notes in Computer Science, vol 16397, pp. 3-15, Springer, Cham, 2026
AI中文摘要

二氧化碳(CO$_2$)排放是环境影响和多种工业过程(包括畜牧业管理)的关键指标。我们提出了CarboFormer,一种用于光学气体成像(OGI)的轻量级语义分割框架,旨在检测和量化不同应用中的CO$_2$排放。我们的方法集成了优化的编码器-解码器架构与专门的多尺度特征融合和辅助监督策略,以有效建模气体羽流图像中的局部细节和全局关系,同时在资源受限环境中以最小的计算开销实现有竞争力的精度。我们贡献了两个新数据集:(1)受控二氧化碳释放(CCR)数据集,模拟了系统变化流速(10-100 SCCM)的气体泄漏;(2)实时Ankom(RTA)数据集,专注于奶牛瘤胃液体外实验的排放。大量评估表明,CarboFormer在CCR上达到84.88% mIoU,在RTA上达到92.98% mIoU,同时保持计算效率,仅5.07M参数,运行速度为84.68 FPS。该模型在具有挑战性的低流量场景中特别有效,显著优于其他轻量级方法,如SegFormer-B0(CCR上83.36% mIoU)和SegNeXt(CCR上82.55% mIoU),使其适用于资源受限平台(如可编程无人机)上的实时监测。我们的工作通过提供稳健高效的CO$_2$排放分析工具,推进了环境传感和精准畜牧业管理。

英文摘要

Carbon dioxide (CO$_2$) emissions are critical indicators of both environmental impact and various industrial processes, including livestock management. We introduce CarboFormer, a lightweight semantic segmentation framework for Optical Gas Imaging (OGI), designed to detect and quantify CO$_2$ emissions across diverse applications. Our approach integrates an optimized encoder-decoder architecture with specialized multi-scale feature fusion and auxiliary supervision strategies to effectively model both local details and global relationships in gas plume imagery while achieving competitive accuracy with minimal computational overhead for resource-constrained environments. We contribute two novel datasets: (1) the Controlled Carbon Dioxide Release (CCR) dataset, which simulates gas leaks with systematically varied flow rates (10-100 SCCM), and (2) the Real Time Ankom (RTA) dataset, focusing on emissions from dairy cow rumen fluid in vitro experiments. Extensive evaluations demonstrate that CarboFormer achieves competitive performance with 84.88\% mIoU on CCR and 92.98\% mIoU on RTA, while maintaining computational efficiency with only 5.07M parameters and operating at 84.68 FPS. The model shows particular effectiveness in challenging low-flow scenarios and significantly outperforms other lightweight methods like SegFormer-B0 (83.36\% mIoU on CCR) and SegNeXt (82.55\% mIoU on CCR), making it suitable for real-time monitoring on resource-constrained platforms such as programmable drones. Our work advances both environmental sensing and precision livestock management by providing robust and efficient tools for CO$_2$ emission analysis.

2505.24876 2026-05-26 cs.CV cs.CL

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks

Agent-X:评估视觉中心智能体任务中的深度多模态推理

Tajamul Ashraf, Amal Saqib, Hanan Ghani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan

AI总结 提出Agent-X基准,通过828个真实视觉任务和细粒度步骤评估框架,揭示当前模型在多步视觉推理中全链成功率低于50%的瓶颈。

Comments Accepted in International Conference of Learning Representations (ICLR 2026)

详情
AI中文摘要

深度推理对于解决复杂任务至关重要,尤其是在需要顺序多模态理解的视觉中心场景中。然而,现有基准通常使用完全合成的单轮查询、有限的视觉模态进行评估,并且缺乏在真实世界环境中多步推理质量的评估框架。为了解决这一问题,我们引入了Agent-X,这是一个大规模基准,用于评估视觉中心智能体在真实多模态环境中的多步和深度推理能力。Agent-X包含828个具有真实视觉上下文的智能体任务,包括图像、多图像比较、视频和指令文本。这些任务涵盖六大智能体环境:通用视觉推理、网页浏览、安全与监控、自动驾驶、体育和数学推理。我们的基准要求智能体在这些多样化环境中将工具使用与明确的逐步决策相结合。此外,我们提出了一个细粒度的步骤级评估框架,用于评估每个推理步骤的正确性和逻辑连贯性以及整个任务中工具使用的有效性。我们的结果表明,即使是最佳性能模型,包括GPT、Gemini和Qwen系列,也难以解决多步视觉任务,全链成功率低于50%。这些发现突显了当前LMM推理和工具使用能力的关键瓶颈,并指出了视觉中心智能体推理模型的未来研究方向。我们的数据和代码公开在https://github.com/mbzuai-oryx/Agent-X。

英文摘要

Deep reasoning is fundamental for solving complex tasks, especially in vision-centric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we introduce Agent-X, a large-scale benchmark for evaluating vision-centric agents multi-step and deep reasoning capabilities in real-world, multimodal settings. Agent- X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text. These tasks span six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning. Our benchmark requires agents to integrate tool use with explicit, stepwise decision-making in these diverse settings. In addition, we propose a fine-grained, step-level evaluation framework that assesses the correctness and logical coherence of each reasoning step and the effectiveness of tool usage throughout the task. Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks, achieving less than 50% full-chain success. These findings highlight key bottlenecks in current LMM reasoning and tool-use capabilities and identify future research directions in vision-centric agentic reasoning models. Our data and code are publicly available at https://github.com/mbzuai-oryx/Agent-X