arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3410
2602.00443 2026-05-26 cs.SD cs.MM eess.AS

RVCBench: Benchmarking the Robustness of Voice Cloning Across Modern Audio Generation Models

RVCBench:现代音频生成模型中语音克隆鲁棒性的基准测试

Ruinan Jin, Xinting Liao, Hanlin Yu, Deval Pandya, Xiaoxiao Li

AI总结 提出RVCBench数据集和基准,通过18项鲁棒性评估、225个说话人和14370个话语,系统评估语音克隆模型在噪声、多语言、长文本、后处理和对抗扰动等现实场景下的鲁棒性。

Comments 65 pages, 10 figures

详情
AI中文摘要

现代语音克隆,也称为零样本文本转语音(TTS),可以从仅几秒的参考音频中合成与目标说话人高度匹配的语音,从而支持个性化语音界面和配音等应用。在实践中,这些系统经常面临噪声参考音频、不完美的文本提示、多语言和长文本生成、后处理以及对抗性扰动,所有这些都可能削弱鲁棒性。尽管编解码器令牌语言模型和基于扩散的TTS取得了快速进展,但在现实部署变化下的鲁棒性仍未得到充分探索。本文介绍了RVCBench,一个用于评估语音克隆鲁棒性的综合数据集和基准。RVCBench提供了任务对齐的测试,涵盖受控文本-音频配对、多语言和长文本场景、表达性提示、后处理条件以及被动或主动音频扰动。通过18项鲁棒性评估、225个说话人和14370个话语,RVCBench支持对输入敏感性、生成稳定性、输出弹性、扰动鲁棒性、说话人相似性和深度伪造可检测性的统一评估。我们评估了18个代表性的开源语音克隆模型,并揭示了在内容一致性、说话人相似性、长文本稳定性、后处理弹性、对抗鲁棒性和面向检测器的可分离性方面的系统性漏洞。我们发布代码和数据集,以支持可重复的评估和未来在鲁棒语音克隆、语音合成和音频生成方面的研究。代码:https://github.com/Nanboy-Ronan/RVCBench。数据集:https://huggingface.co/datasets/Nanboy/RVCBench。

英文摘要

Modern voice cloning, also known as zero-shot text-to-speech (TTS), can synthesize speech that closely matches a target speaker from only seconds of reference audio, enabling applications such as personalized speech interfaces and dubbing. In practice, these systems often face noisy reference audio, imperfect text prompts, multilingual and long-form generation, post-processing, and adversarial perturbations, all of which can weaken robustness. Despite rapid progress in codec-token language models and diffusion-based TTS, robustness under realistic deployment shifts remains underexplored. This paper introduces RVCBench, a comprehensive dataset and benchmark for evaluating robustness in voice cloning. RVCBench provides task-aligned tests covering controlled text-audio pairing, multilingual and long-form scenarios, expressive prompts, post-processing conditions, and passive or proactive audio perturbations. Across 18 robustness evaluations, 225 speakers, and 14,370 utterances, RVCBench supports unified evaluation of input sensitivity, generation stability, output resilience, perturbation robustness, speaker similarity, and deepfake detectability. We evaluate 18 representative open-source voice cloning models and reveal systematic vulnerabilities in content consistency, speaker similarity, long-form stability, post-processing resilience, adversarial robustness, and detector-facing separability. We release the code and dataset to support reproducible evaluation and future research on robust voice cloning, speech synthesis, and audio generation. Code: https://github.com/Nanboy-Ronan/RVCBench. Dataset: https://huggingface.co/datasets/Nanboy/RVCBench.

2601.23164 2026-05-26 cs.LG

Stochastic Linear Bandits with Parameter Noise

带有参数噪声的随机线性赌博机

Daniel Ezer, Alon Peled-Cohen, Yishay Mansour

AI总结 研究带有参数噪声的随机线性赌博机模型,提出一种简单的探索-利用算法,实现了与下界匹配(对数因子内)的遗憾界,并揭示了与经典加性噪声模型不同的最优遗憾阶。

Comments 8 pages

详情
AI中文摘要

我们研究了带有参数噪声模型的随机线性赌博机,其中动作$a$的奖励为$a^ op θ$,$θ$是独立同分布的样本。我们给出了一个遗憾上界$\widetilde{O} (\sqrt{d T \log (K/δ) σ^2_{\max}})$,其中$T$是时间范围,动作集大小为$K$,维度为$d$,$σ^2_{\max}$是任何动作奖励的最大方差。我们进一步给出了一个下界$\widetildeΩ (d \sqrt{T σ^2_{\max}})$,当$\log (K) \approx d$时,该下界是紧的(忽略对数因子)。对于更具体的动作集,即$p \leq 2$的$\ell_p$单位球及其对偶范数$q$,我们证明了极小极大遗憾为$\widetildeΘ (\sqrt{dT σ^2_q})$,其中$σ^2_q$是一个与方差相关的量,且始终不超过4。这与经典加性噪声模型中此类动作集可达到的极小极大遗憾(阶为$d \sqrt{T}$)形成对比。令人惊讶的是,我们表明这个最优(忽略对数因子)遗憾界可以通过一个非常简单的探索-利用算法实现。

英文摘要

We study the stochastic linear bandits with parameter noise model, in which the reward of action $a$ is $a^\top θ$ where $θ$ is sampled i.i.d. We show a regret upper bound of $\widetilde{O} (\sqrt{d T \log (K/δ) σ^2_{\max})}$ for a horizon $T$, general action set of size $K$ of dimension $d$, and where $σ^2_{\max}$ is the maximal variance of the reward for any action. We further provide a lower bound of $\widetildeΩ (d \sqrt{T σ^2_{\max}})$ which is tight (up to logarithmic factors) whenever $\log (K) \approx d$. For more specific action sets, $\ell_p$ unit balls with $p \leq 2$ and dual norm $q$, we show that the minimax regret is $\widetildeΘ (\sqrt{dT σ^2_q)}$, where $σ^2_q$ is a variance-dependent quantity that is always at most $4$. This is in contrast to the minimax regret attainable for such sets in the classic additive noise model, where the regret is of order $d \sqrt{T}$. Surprisingly, we show that this optimal (up to logarithmic factors) regret bound is attainable using a very simple explore-exploit algorithm.

2601.22709 2026-05-26 cs.CV cs.AI

Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

基于置信度蒸馏的门控关系对齐用于高效视觉语言模型

Yanlong Chen, Amirhossein Habibian, Luca Benini, Yawei Li

AI总结 提出GRACE框架,通过信息瓶颈原理统一知识蒸馏与量化感知训练,使用置信度门控解耦蒸馏、关系中心核对齐和自适应控制器,在INT4量化下实现性能超越FP16基线并接近教师模型,同时显著降低内存和提升吞吐量。

Comments Accepted to the International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

视觉语言模型(VLM)具有强大的多模态性能,但部署成本高,且训练后量化通常会导致显著的精度损失。尽管有潜力,但针对VLM的量化感知训练仍未得到充分探索。我们提出GRACE,一个在信息瓶颈原则下统一知识蒸馏和量化感知训练的框架:量化约束信息容量,而蒸馏指导在此预算内保留什么。将教师视为任务相关信息的代理,我们引入置信度门控解耦蒸馏以过滤不可靠的监督,关系中心核对齐以传递视觉标记结构,以及通过拉格朗日松弛实现的自适应控制器以平衡保真度与容量约束。在LLaVA和Qwen系列的大量基准测试中,我们的INT4模型始终优于FP16基线(例如,LLaVA-1.5-7B:SQA上70.1 vs. 66.8;Qwen2-VL-2B:MMBench上76.9 vs. 72.6),几乎匹配教师性能。使用真实的INT4内核,我们实现了3倍的吞吐量,内存减少54%。这一原则性框架显著优于现有量化方法,使GRACE成为资源受限部署的有力解决方案。代码和数据可在https://github.com/ForeverBlue816/GRACE获取。

英文摘要

Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes significant accuracy loss. Despite its potential, quantization-aware training for VLMs remains underexplored. We propose GRACE, a framework unifying knowledge distillation and QAT under the Information Bottleneck principle: quantization constrains information capacity while distillation guides what to preserve within this budget. Treating the teacher as a proxy for task-relevant information, we introduce confidence-gated decoupled distillation to filter unreliable supervision, relational centered kernel alignment to transfer visual token structures, and an adaptive controller via Lagrangian relaxation to balance fidelity against capacity constraints. Across extensive benchmarks on LLaVA and Qwen families, our INT4 models consistently outperform FP16 baselines (e.g., LLaVA-1.5-7B: 70.1 vs. 66.8 on SQA; Qwen2-VL-2B: 76.9 vs. 72.6 on MMBench), nearly matching teacher performance. Using real INT4 kernel, we achieve 3$\times$ throughput with 54% memory reduction. This principled framework significantly outperforms existing quantization methods, making GRACE a compelling solution for resource-constrained deployment. Code and data are available at: https://github.com/ForeverBlue816/GRACE.

2601.21924 2026-05-26 cs.LG stat.ML

One-Step Bellman Alignment Enables Provably Efficient Transfer in Online RL

一步贝尔曼对齐实现在线强化学习中的可证明高效迁移

Elynn Chen, Enpei Zhang, Jinhang Chai, Yujun Yan

AI总结 提出一步贝尔曼对齐作为在线强化学习中迁移的正确抽象,并通过重加权目标(RWT)实现算子级修正,在RKHS函数逼近下建立了与任务迁移复杂度相关的遗憾界。

详情
AI中文摘要

我们研究在情节马尔可夫决策过程中的在线迁移强化学习,其中在学习目标任务时,来自相关源任务的经验是可用的。一个基本困难在于任务相似性通常根据奖励或转移来定义,而在线RL算法操作在贝尔曼回归目标上。因此,简单地重用源贝尔曼更新会引入系统性偏差并使遗憾保证失效。我们识别出一阶贝尔曼对齐作为在线RL中迁移的正确抽象,并提出重加权目标(RWT),这是一种算子级修正,通过测度变换重新定位延续值并补偿转移不匹配。RWT将任务不匹配简化为固定的一步修正,并实现了源数据的统计上合理的重用。这种对齐产生了一个两阶段RWT Q学习框架,将方差减少与偏差修正分离。在RKHS函数逼近下,我们建立的遗憾界随任务迁移的复杂度而非目标MDP的复杂度变化。我们进一步证明了所需的密度比允许一个具有有限样本保证的构造性RKHS估计器,并经验验证了对估计和错误指定比率的鲁棒性。在表格和神经网络设置中的实证结果均显示,与单任务学习和朴素池化相比,持续改进,突出了贝尔曼对齐作为在线RL中模型无关的迁移原理。

英文摘要

We study online transfer reinforcement learning (RL) in episodic Markov decision processes, where experience from related source tasks is available during learning on a target task. A fundamental difficulty is that task similarity is typically defined in terms of rewards or transitions, whereas online RL algorithms operate on Bellman regression targets. As a result, naively reusing source Bellman updates introduces systematic bias and invalidates regret guarantees. We identify one-step Bellman alignment as the correct abstraction for transfer in online RL and propose re-weighted targeting (RWT), an operator-level correction that retargets continuation values and compensates for transition mismatch via a change of measure. RWT reduces task mismatch to a fixed one-step correction and enables statistically sound reuse of source data. This alignment yields a two-stage RWT $Q$-learning framework that separates variance reduction from bias correction. Under RKHS function approximation, we establish regret bounds that scale with the complexity of the task shift rather than the target MDP. We further show the required density ratios admit a constructive RKHS estimator with finite-sample guarantees, and empirically validate robustness to estimated and mis-specified ratios. Empirical results in both tabular and neural network settings demonstrate consistent improvements over single-task learning and naïve pooling, highlighting Bellman alignment as a model-agnostic transfer principle for online RL.

2601.21601 2026-05-26 cs.LG cs.AI

Dynamics Reveals Structure: Challenging the Linear Propagation Assumption

动力学揭示结构:挑战线性传播假设

Hoyeon Chang, Bálint Mucsányi, Seong Joon Oh

AI总结 通过关系代数研究神经网络中线性传播假设的几何极限,证明其在对合运算(否定、逆)上可行,但在组合运算上存在根本性障碍,导致特征映射崩溃,并解释知识编辑失败、反转诅咒和多跳推理等问题的共同根源。

详情
AI中文摘要

神经网络通过一阶参数更新进行自适应,但尚不清楚这种更新是否保持逻辑一致性。我们研究了线性传播假设(LPA)的几何极限,该假设认为局部更新能够连贯地传播到逻辑结论。为了形式化这一点,我们采用关系代数,研究关系的三种核心运算:否定翻转真值、逆交换参数顺序、组合链接关系。对于否定和逆,我们证明保证与方向无关的一阶传播需要一种张量分解,将实体对上下文与关系内容分离。然而,对于组合,我们识别出一个根本性障碍。我们证明组合可归结为合取,并证明任何在线性特征上良好定义的合取必须是双线性的。由于双线性与否定不兼容,这迫使特征映射崩溃。这些结果表明,知识编辑失败、反转诅咒和多跳推理可能源于LPA固有的共同结构限制。

英文摘要

Neural networks adapt through first-order parameter updates, yet it remains unclear whether such updates preserve logical coherence. We investigate the geometric limits of the Linear Propagation Assumption (LPA), the premise that local updates coherently propagate to logical consequences. To formalize this, we adopt relation algebra and study three core operations on relations: negation flips truth values, converse swaps argument order, and composition chains relations. For negation and converse, we prove that guaranteeing direction-agnostic first-order propagation necessitates a tensor factorization separating entity-pair context from relation content. However, for composition, we identify a fundamental obstruction. We show that composition reduces to conjunction, and prove that any conjunction well-defined on linear features must be bilinear. Since bilinearity is incompatible with negation, this forces the feature map to collapse. These results suggest that failures in knowledge editing, the reversal curse, and multi-hop reasoning may stem from common structural limitations inherent to the LPA.

2601.21463 2026-05-26 cs.SD cs.AI

Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs

通过先验增强的音频大语言模型统一语音编辑检测与内容定位

Jun Xue, Yi Chai, Yanzhen Ren, Jinshen He, Zhiqiang Tang, Zhuolin Yi, Yihuan Huang, Yuankun Xie, Yujie Chen

AI总结 提出基于音频大语言模型的统一框架,通过生成式方法联合处理语音编辑检测和内容定位,并引入先验增强策略和声学一致性损失以提升性能。

详情
AI中文摘要

现有的语音编辑检测(SED)数据集主要使用手动拼接或有限的编辑操作构建,导致多样性受限且对真实编辑场景的覆盖不足。同时,当前的SED方法严重依赖帧级监督来检测可观察的声学异常,这从根本上限制了它们处理删除型编辑的能力,其中被操纵的内容完全从信号中消失。为了解决这些挑战,我们提出了一个统一框架,通过基于音频大语言模型(Audio LLMs)的生成式公式,将语音编辑检测和内容定位连接起来。我们首先引入了AiEdit(https://huggingface.co/datasets/JunXueTech/AiEdit),这是一个大规模双语数据集(约140小时),使用最先进的端到端语音编辑系统覆盖添加、删除和修改操作,为现代威胁提供了更真实的基准。在此基础上,我们将SED重新定义为结构化文本生成任务,实现了对编辑类型识别和内容定位的联合推理。为了增强生成模型在声学证据中的基础,我们提出了一种先验增强的提示策略,注入从帧级检测器导出的词级概率线索。此外,我们引入了一种声学一致性感知损失,在潜在空间中明确强制正常和异常声学表示之间的分离。实验结果表明,所提出的方法在检测和定位任务上均持续优于现有方法。

英文摘要

Existing speech editing detection (SED) datasets are predominantly constructed using manual splicing or limited editing operations, resulting in restricted diversity and poor coverage of realistic editing scenarios. Meanwhile, current SED methods rely heavily on frame-level supervision to detect observable acoustic anomalies, which fundamentally limits their ability to handle deletion-type edits, where the manipulated content is entirely absent from the signal. To address these challenges, we present a unified framework that bridges speech editing detection and content localization through a generative formulation based on Audio Large Language Models (Audio LLMs). We first introduce AiEdit, https://huggingface.co/datasets/JunXueTech/AiEdit, a large-scale bilingual dataset (approximately 140 hours) that covers addition, deletion, and modification operations using state-of-the-art end-to-end speech editing systems, providing a more realistic benchmark for modern threats. Building upon this, we reformulate SED as a structured text generation task, enabling joint reasoning over edit type identification, and content localization. To enhance the grounding of generative models in acoustic evidence, we propose a prior-enhanced prompting strategy that injects word-level probabilistic cues derived from a frame-level detector. Furthermore, we introduce an acoustic consistency-aware loss that explicitly enforces the separation between normal and anomalous acoustic representations in the latent space. Experimental results demonstrate that the proposed approach consistently outperforms existing methods across both detection and localization tasks.

2601.20738 2026-05-26 cs.LG cs.DC eess.SP math.OC stat.ML

SA-PEF: Step-Ahead Partial Error Feedback for Efficient Federated Learning

SA-PEF:用于高效联邦学习的前瞻部分误差反馈

Dawit Kiros Redie, Reza Arablouei, Stefan Werner

AI总结 提出SA-PEF方法,通过结合前瞻校正和部分误差反馈,在非IID数据和部分客户端参与下加速联邦学习收敛,并理论证明其收敛速率与Fed-SGD相当。

详情
Journal ref
Transactions on Machine Learning Research, 2026
AI中文摘要

带误差反馈(EF)的有偏梯度压缩减少了联邦学习(FL)中的通信,但在非IID数据下,残差误差可能缓慢衰减,导致早期轮次中的梯度不匹配和进度停滞。我们提出前瞻部分误差反馈(SA-PEF),它集成了前瞻(SA)校正与部分误差反馈(PEF)。当前瞻系数$α=0$时,SA-PEF恢复为EF;当$α=1$时,恢复为前瞻EF(SAEF)。对于非凸目标和$δ$-收缩压缩器,我们建立了二阶矩界和残差递归,保证了在异构数据和部分客户端参与下收敛到平稳点。得到的速率与标准非凸Fed-SGD保证在常数因子内匹配,在固定内步长下实现$O((η,η_0TR)^{-1})$收敛到方差/异质性下界。我们的分析揭示了一个由前瞻控制的残差收缩$ρ_r$,解释了早期训练阶段观察到的加速。为了平衡SAEF的快速预热与EF的长期稳定性,我们选择接近理论预测最优的$α$。跨多种架构和数据集的实验表明,SA-PEF始终比EF更快达到目标精度。

英文摘要

Biased gradient compression with error feedback (EF) reduces communication in federated learning (FL), but under non-IID data, the residual error can decay slowly, causing gradient mismatch and stalled progress in the early rounds. We propose step-ahead partial error feedback (SA-PEF), which integrates step-ahead (SA) correction with partial error feedback (PEF). SA-PEF recovers EF when the step-ahead coefficient $α=0$ and step-ahead EF (SAEF) when $α=1$. For non-convex objectives and $δ$-contractive compressors, we establish a second-moment bound and a residual recursion that guarantee convergence to stationarity under heterogeneous data and partial client participation. The resulting rates match standard non-convex Fed-SGD guarantees up to constant factors, achieving $O((η,η_0TR)^{-1})$ convergence to a variance/heterogeneity floor with a fixed inner step size. Our analysis reveals a step-ahead-controlled residual contraction $ρ_r$ that explains the observed acceleration in the early training phase. To balance SAEF's rapid warm-up with EF's long-term stability, we select $α$ near its theory-predicted optimum. Experiments across diverse architectures and datasets show that SA-PEF consistently reaches target accuracy faster than EF.

2601.16763 2026-05-26 cs.CV

Flow Matching for Probabilistic Monocular 3D Human Pose Estimation

基于流匹配的概率单目3D人体姿态估计

Cuong Le, Pavlo Melnyk, Bastian Wandt, Mårten Wadenbäck

AI总结 提出FMPose方法,利用流匹配生成模型从2D关键点学习3D姿态分布,通过图卷积网络建模2D提升条件,在保持精度的同时显著提升推理速度。

Comments 12 pages, 2 figures, 8 tables, accepted to TMLR

详情
AI中文摘要

从单目相机视角恢复3D人体姿态是一个高度病态的问题,因为存在深度模糊。早期从2D提升3D姿态的研究常常包含错误但过度自信的3D估计。为了缓解这一问题,新兴的概率方法将3D估计视为分布,考虑姿态的不确定性度量。属于类似范畴,我们提出了FMPose,一种基于流匹配生成方法的概率3D人体姿态估计方法。以2D线索为条件,流匹配方案通过连续归一化流学习从简单源分布到合理3D人体姿态分布的最优传输。2D提升条件通过图卷积网络建模,利用人体关节之间的可学习连接作为图结构进行特征聚合。尽管处理时间和精度之间存在权衡,但在等精度比较中,FMPose的处理时间显著快于扩散模型,并且还提供了另一种更快且更准确的配置。实验结果表明,我们的FMPose在3D人体姿态估计的两个常见基准(Human3.6M、MPI-INF-3DHP)上相比当前最先进方法有显著改进。此外,FMPose在更具挑战性的3DPW数据集上表现出竞争性能。代码实现见https://github.com/cuongle1206/FMPose。

英文摘要

Recovering 3D human poses from a monocular camera view is a highly ill-posed problem due to the depth ambiguity. Earlier studies on 3D human pose lifting from 2D often contain incorrect-yet-overconfident 3D estimations. To mitigate the problem, emerging probabilistic approaches treat the 3D estimations as a distribution, taking into account the uncertainty measurement of the poses. Falling in a similar category, we proposed FMPose, a probabilistic 3D human pose estimation method based on the flow matching generative approach. Conditioned on the 2D cues, the flow matching scheme learns the optimal transport from a simple source distribution to the plausible 3D human pose distribution via continuous normalizing flows. The 2D lifting condition is modeled via graph convolutional networks, leveraging the learnable connections between human body joints as the graph structure for feature aggregation. While trade-offs between processing time and precision exist, already in the equal-accuracy comparison, FMPose exhibits significantly faster processing time than the diffusion model, and also offers another faster and more accurate configuration. Experimental results show major improvements of our FMPose over current state-of-the-art methods on two common benchmarks for 3D human pose estimation, namely Human3.6M, MPI-INF-3DHP. Additionally, FMPose shows competitive performance on the more challenging 3DPW dataset. The code implementation is available at https://github.com/cuongle1206/FMPose

2601.15544 2026-05-26 cs.LG cs.AI

RDumb++: Drift-Aware Continual Test-Time Adaptation

RDumb++:漂移感知的持续测试时自适应

Himanshu Mishra

AI总结 针对持续测试时自适应中分布快速变化或长期漂移导致性能崩溃的问题,提出RDumb++方法,通过熵和KL散度漂移检测机制与自适应重置策略,在CCC基准上实现约3%的绝对准确率提升。

详情
AI中文摘要

持续测试时自适应(CTTA)旨在仅使用传入的无标签数据流在部署期间更新预训练模型。尽管先前的方法如Tent、EATA等在短期演化偏移下提供了有意义的改进,但当测试分布快速变化或时间跨度极长时,它们表现不佳。CCC基准测试体现了这一挑战,模型在包含750万样本且不断变化损坏类型和严重程度的数据流上运行。我们提出RDumb++,它是RDumb的合理扩展,引入了两种漂移检测机制,即基于熵的漂移评分和KL散度漂移评分,以及自适应重置策略。这些机制使模型能够检测累积的自适应何时变得有害,并在预测崩溃发生前恢复。在包含三种速度和三种种子的CCC-medium(九次运行,每次包含一百万样本)上,RDumb++始终优于RDumb,在整个数据流中实现约3%的绝对准确率提升,同时保持稳定的自适应。关于漂移阈值和重置强度的消融实验进一步表明,漂移感知重置对于防止崩溃和实现可靠的长期CTTA至关重要。

英文摘要

Continual Test-Time Adaptation (CTTA) seeks to update a pretrained model during deployment using only the incoming, unlabeled data stream. Although prior approaches such as Tent, EATA etc. provide meaningful improvements under short evolving shifts, they struggle when the test distribution changes rapidly or over extremely long horizons. This challenge is exemplified by the CCC benchmark, where models operate over streams of 7.5M samples with continually changing corruption types and severities. We propose RDumb++, a principled extension of RDumb that introduces two drift-detection mechanisms i.e entropy-based drift scoring and KL-divergence drift scoring, together with adaptive reset strategies. These mechanisms allow the model to detect when accumulated adaptation becomes harmful and to recover before prediction collapse occurs. Across CCC-medium with three speeds and three seeds (nine runs, each containing one million samples), RDumb++ consistently surpasses RDumb, yielding approx 3% absolute accuracy gains while maintaining stable adaptation throughout the entire stream. Ablation experiments on drift thresholds and reset strengths further show that drift-aware resetting is essential for preventing collapse and achieving reliable long-horizon CTTA.

2601.09931 2026-05-26 cs.SD

Diffusion-based Frameworks for Unsupervised Speech Enhancement

基于扩散框架的无监督语音增强

Jean-Eudes Ayilo, Mostafa Sadeghi, Romain Serizel, Xavier Alameda-Pineda

AI总结 本文提出一种无监督扩散框架,通过联合建模语音和噪声作为潜在变量,在E步中共同采样,并引入基于扩散的噪声模型,显著提升语音增强性能。

详情
AI中文摘要

本文研究基于扩散的无监督单通道语音增强(SE)。先前的工作将基于干净语音训练的分数扩散模型与协方差由非负矩阵分解(NMF)结构化的高斯噪声模型相结合,在迭代期望最大化(EM)方案中使用,其中基于扩散的后验采样E步估计干净语音。我们首先重新审视该框架,提出将语音和声学噪声都显式建模为潜在变量,在E步中联合采样,而不是像先前方法那样仅采样语音。然后,我们引入一个新的半监督SE框架,用基于扩散的噪声模型替换NMF噪声先验,该模型与语音先验在单个条件分数模型中联合学习。在该框架内,我们推导出两种变体:一种隐式处理噪声,另一种显式将噪声视为潜在变量。在WSJ0-QUT和VoiceBank-DEMAND上的实验表明,对于基于NMF和基于扩散的噪声先验,显式噪声建模系统地改善了SE性能。在匹配条件下,基于扩散的噪声模型在无监督方法中达到了最佳的整体质量和可懂度;而在不匹配条件下,所提出的基于NMF的显式噪声框架更加鲁棒,且退化程度低于几个监督基线。代码、演示和补充材料公开可用。

英文摘要

This paper addresses unsupervised diffusion-based single-channel speech enhancement (SE). Prior work in this direction combines a score-based diffusion model trained on clean speech with a Gaussian noise model whose covariance is structured by non-negative matrix factorization (NMF). This combination is used within an iterative expectation-maximization (EM) scheme, in which a diffusion-based posterior-sampling E-step estimates the clean speech. We first revisit this framework and propose to explicitly model both speech and acoustic noise as latent variables, jointly sampling them in the E-step instead of sampling speech alone as in previous approaches. We then introduce a new semi-supervised SE framework that replaces the NMF noise prior with a diffusion-based noise model, learned jointly with the speech prior in a single conditional score model. Within this framework, we derive two variants: one that implicitly accounts for noise and one that explicitly treats noise as a latent variable. Experiments on WSJ0-QUT and VoiceBank-DEMAND show that explicit noise modeling systematically improves SE performance for both NMF-based and diffusion-based noise priors. Under matched conditions, the diffusion-based noise model attains the best overall quality and intelligibility among unsupervised methods, while under mismatched conditions the proposed NMF-based explicit-noise framework is more robust and suffers less degradation than several supervised baselines. Code, demo, and supplementary materials are publicly available.

2601.08205 2026-05-26 cs.CV cs.LG

FUME: Fused Unified Multi-Gas Emission Network for Livestock Rumen Acidosis Detection

FUME: 用于牲畜瘤胃酸中毒检测的融合统一多气体排放网络

Taminul Islam, Toqi Tahamid Sarker, Mohamed Embaby, Khaled R Ahmed, Amer AbuGhazaleh

AI总结 提出FUME网络,利用双气体(CO2和CH4)光学成像,通过轻量双流架构和通道注意力融合,实现瘤胃酸中毒的高精度分割与分类。

Comments 10 pages, 5 figures

详情
Journal ref
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, 2026, pp. 510-519
AI中文摘要

瘤胃酸中毒是奶牛中常见的代谢紊乱,导致重大经济损失和动物福利问题。当前的诊断方法依赖于侵入性pH测量,限制了持续监测的可扩展性。我们提出了FUME(融合统一多气体排放网络),这是首个在体外条件下通过双气体光学成像进行瘤胃酸中毒检测的深度学习方法。我们的方法利用红外相机捕获的互补二氧化碳(CO2)和甲烷(CH4)排放模式,将瘤胃健康状态分类为健康、过渡和酸中毒。FUME采用轻量双流架构,包含权重共享编码器、模态特定自注意力和通道注意力融合,联合优化气体羽流分割和奶牛健康分类。我们引入了首个双气体OGI数据集,包含8967个标注帧,覆盖六个pH水平,并带有像素级分割掩码。实验表明,FUME在仅使用1.28M参数和1.97G MACs的情况下,实现了80.99%的mIoU和98.82%的分类准确率——在分割质量上优于最先进方法,且计算成本降低10倍。消融研究揭示,CO2提供主要的判别信号,而双任务学习对于最优性能至关重要。我们的工作确立了基于气体排放的牲畜健康监测的可行性,为实用的体外酸中毒检测系统铺平了道路。代码可在 https://github.com/taminulislam/fume 获取。

英文摘要

Ruminal acidosis is a prevalent metabolic disorder in dairy cattle causing significant economic losses and animal welfare concerns. Current diagnostic methods rely on invasive pH measurement, limiting scalability for continuous monitoring. We present FUME (Fused Unified Multi-gas Emission Network), the first deep learning approach for rumen acidosis detection from dual-gas optical imaging under in vitro conditions. Our method leverages complementary carbon dioxide (CO2) and methane (CH4) emission patterns captured by infrared cameras to classify rumen health into Healthy, Transitional, and Acidotic states. FUME employs a lightweight dual-stream architecture with weight-shared encoders, modality-specific self-attention, and channel attention fusion, jointly optimizing gas plume segmentation and classification of dairy cattle health. We introduce the first dual-gas OGI dataset comprising 8,967 annotated frames across six pH levels with pixel-level segmentation masks. Experiments demonstrate that FUME achieves 80.99% mIoU and 98.82% classification accuracy while using only 1.28M parameters and 1.97G MACs--outperforming state-of-the-art methods in segmentation quality with 10x lower computational cost. Ablation studies reveal that CO2 provides the primary discriminative signal and dual-task learning is essential for optimal performance. Our work establishes the feasibility of gas emission-based livestock health monitoring, paving the way for practical, in vitro acidosis detection systems. Codes are available at https://github.com/taminulislam/fume.

2601.06870 2026-05-26 cs.LG cs.AI

QASA: Quality-Aware Semantic Augmentation for Robust Multimodal Sentiment Analysis

QASA: 面向鲁棒多模态情感分析的质量感知语义增强

Jiazhang Liang, Jianheng Dai, Miaosen Luo, Menghua Jiang, Sijie Mai

AI总结 提出QASA框架,利用扩散模型生成视觉和听觉增强样本,并通过解耦质量感知评分模块分配训练权重,以解决高质量数据稀缺问题,提升多模态情感分析的鲁棒性和泛化能力。

Comments 11 pages, 4 figures

详情
AI中文摘要

多模态大语言模型在多模态情感分析中展现出强大的语义表示能力。然而,由于高质量训练数据的稀缺,它们学习稳定且可泛化的多模态特征的能力受到限制。为了解决这一问题,我们提出了QASA(质量感知语义增强),该方法使用扩散模型生成增强的视觉和听觉样本,从而扩大训练数据集并支持多模态学习。生成的样本质量可能参差不齐,并可能出现跨模态不一致。为此,我们引入了一个解耦的质量感知评分模块,根据每个增强样本的可靠性分配训练权重。这种方法减少了低质量数据的影响,有助于更稳定和鲁棒的模型训练。该框架结合了扩散模型的生成能力和多模态大模型的语义推理能力,提供了一种无需人工标注的自动数据增强策略,同时在有限高质量数据下提高了泛化性和鲁棒性。在CH-SIMS数据集上的实验表明,QASA在五类准确率(Acc5)和二类准确率(Acc2)上分别相对提升了18.0%和5.9%,并且在CMU-MOSI和MUStARD基准测试上也优于现有方法。

英文摘要

Multimodal large language models have demonstrated strong ability in capturing semantic representations for multimodal sentiment analysis. Their capacity to learn stable and generalizable multimodal features is limited, however, by the scarcity of high-quality training data. To address this, we propose QASA (Quality-Aware Semantic Augmentation), which uses diffusion models to generate augmented visual and auditory samples, thereby enlarging the training dataset and supporting multimodal learning. The generated samples can vary in quality and may exhibit cross-modal inconsistencies. To manage this, we introduce a decoupled quality-aware scoring module that assigns training weights based on the reliability of each augmented sample. This approach reduces the influence of low-quality data and contributes to more stable and robust model training. The framework combines the generative capabilities of diffusion models with the semantic reasoning of multimodal large models, providing an automated data augmentation strategy that does not require human annotation while improving generalization and robustness under limited high-quality data. Experiments on the CH-SIMS dataset show that QASA yields a relative increase of 18.0\% and 5.9\% in five-class accuracy (Acc5) and binary accuracy (Acc2), respectively, and it also outperforms existing methods on the CMU-MOSI and MUStARD benchmarks.

2601.03014 2026-05-26 cs.CL cs.AI

SentGraph: Hierarchical Sentence Graph for Multi-hop Retrieval-Augmented Question Answering

SentGraph: 用于多跳检索增强问答的层次化句子图

Junli Liang, Pengfei Zhou, Wangqiu Zhou, Wenjie Qing, Qi Zhao, Ziwen Wang, Qi Song, Xiangyang Li

AI总结 提出SentGraph,一种句子级图RAG框架,通过构建层次化句子图并建模细粒度逻辑关系,解决多跳问答中证据链不完整的问题。

详情
AI中文摘要

传统的检索增强生成(RAG)通过大型语言模型有效支持单跳问答,但在需要结合多个文档证据的多跳问答任务中面临显著限制。现有的基于块的检索通常提供不相关且逻辑不连贯的上下文,导致答案生成过程中证据链不完整和推理错误。为了解决这些挑战,我们提出了SentGraph,一种句子级图RAG框架,显式建模句子之间的细粒度逻辑关系以用于多跳问答。具体来说,我们离线构建一个层次化句子图:首先调整修辞结构理论以区分核心句和卫星句,然后将它们组织成带有跨文档实体桥的主题级子图。在线检索时,SentGraph执行图引导的证据选择和路径扩展,以检索细粒度的句子级证据。在四个多跳问答基准上的大量实验证明了SentGraph的有效性,验证了显式建模句子级逻辑依赖关系对多跳推理的重要性。

英文摘要

Traditional Retrieval-Augmented Generation (RAG) effectively supports single-hop question answering with large language models but faces significant limitations in multi-hop question answering tasks, which require combining evidence from multiple documents. Existing chunk-based retrieval often provides irrelevant and logically incoherent context, leading to incomplete evidence chains and incorrect reasoning during answer generation. To address these challenges, we propose SentGraph, a sentence-level graph-based RAG framework that explicitly models fine-grained logical relationships between sentences for multi-hop question answering. Specifically, we construct a hierarchical sentence graph offline by first adapting Rhetorical Structure Theory to distinguish nucleus and satellite sentences, and then organizing them into topic-level subgraphs with cross-document entity bridges. During online retrieval, SentGraph performs graph-guided evidence selection and path expansion to retrieve fine-grained sentence-level evidence. Extensive experiments on four multi-hop question answering benchmarks demonstrate the effectiveness of SentGraph, validating the importance of explicitly modeling sentence-level logical dependencies for multi-hop reasoning.

2512.20063 2026-05-26 cs.LG

PairFlow: Closed-Form Source-Target Coupling for Few-Step Generation in Discrete Flow Models

PairFlow: 离散流模型中用于少步生成的闭式源-目标耦合

Mingue Park, Jisung Hwang, Seungwoo Yoo, Kyeongmin Yeo, Minhyuk Sung

AI总结 提出PairFlow,一种轻量级预处理方法,通过闭式反演构建源-目标配对样本,无需预训练教师即可实现离散流模型的少步采样,匹配甚至超越两阶段微调性能。

Comments ICLR 2026

详情
AI中文摘要

我们介绍了$\texttt{PairFlow}$,一种用于训练离散流模型(DFM)的轻量级预处理步骤,无需预训练教师即可实现少步采样。DFM最近作为一类新的离散数据生成模型出现,性能强劲。然而,由于其迭代性质,采样速度慢。现有的加速方法主要依赖微调,这引入了大量额外的训练开销。$\texttt{PairFlow}$通过轻量级预处理步骤解决了这个问题。受ReFlow及其在DFM上的扩展启发,我们从源分布和目标分布的耦合样本训练DFM,无需任何预训练教师。我们方法的核心是DFM的闭式反演,这使得能够高效构建配对的源-目标样本。尽管成本极低,仅占完整模型训练所需计算量的1.7%,但$\texttt{PairFlow}$匹配甚至超越了涉及微调的两阶段训练的性能。此外,使用我们的框架训练的模型为后续蒸馏提供了更强的基模型,在微调后进一步加速。在分子数据以及二值和RGB图像上的实验证明了我们方法的广泛适用性和有效性。

英文摘要

We introduce $\texttt{PairFlow}$, a lightweight preprocessing step for training Discrete Flow Models (DFMs) to achieve few-step sampling without requiring a pretrained teacher. DFMs have recently emerged as a new class of generative models for discrete data, offering strong performance. However, they suffer from slow sampling due to their iterative nature. Existing acceleration methods largely depend on finetuning, which introduces substantial additional training overhead. $\texttt{PairFlow}$ addresses this issue with a lightweight preprocessing step. Inspired by ReFlow and its extension to DFMs, we train DFMs from coupled samples of source and target distributions, without requiring any pretrained teacher. At the core of our approach is a closed-form inversion for DFMs, which allows efficient construction of paired source-target samples. Despite its extremely low cost, taking only up to 1.7% of the compute needed for full model training, $\texttt{PairFlow}$ matches or even surpasses the performance of two-stage training involving finetuning. Furthermore, models trained with our framework provide stronger base models for subsequent distillation, yielding further acceleration after finetuning. Experiments on molecular data as well as binary and RGB images demonstrate the broad applicability and effectiveness of our approach.

2512.16710 2026-05-26 cs.CV

A multi-centre, multi-device benchmark dataset for landmark-based comprehensive fetal biometry

基于标志点的全面胎儿生物测量多中心、多设备基准数据集

Chiara Di Vece, Zhehua Mao, Netanell Avisdris, Brian Dromey, Raffaele Napolitano, Dafna Ben Bashat, Francisco Vasconcelos, Danail Stoyanov, Leo Joskowicz, Sophia Bano

AI总结 为解决胎儿超声生物测量中手动标注耗时且依赖操作者的问题,构建了包含4513张图像、来自3个临床中心7种设备的公开基准数据集,提供标准化评估流程和基线结果,验证了单中心训练会高估性能,为多中心泛化研究提供基准。

Comments 11 pages, 5 figures, 3 tables

详情
Journal ref
Scientific Reports (2026)
AI中文摘要

准确的胎儿生长评估依赖于通过手动识别标准平面中的解剖标志点进行精确生物测量。手动标志点标注耗时、依赖操作者,且易受扫描仪和站点间差异影响,限制了自动化方法的可重复性。需要多源标注数据集来开发人工智能辅助的胎儿生长评估方法。为解决这一瓶颈,我们提出了一个开放的、多中心、多设备的胎儿超声图像基准数据集,包含用于临床胎儿生物测量的专家解剖标志点标注。这些测量包括头双顶径和枕额径、腹横径和前后径以及股骨长度。该数据集包含来自1904名受试者的4513张去标识超声图像,这些图像在三个临床站点使用七种不同的超声设备采集。我们提供标准化的、受试者不重叠的训练/测试划分、评估代码和基线结果,以实现方法的公平和可重复比较。使用自动生物测量模型,我们量化了域偏移,并证明局限于单个中心的训练和评估相对于多中心测试会显著高估性能。据我们所知,这是第一个公开可用的多中心、多设备、标志点标注数据集,覆盖所有主要胎儿生物测量指标,为胎儿生物测量中的域适应和多中心泛化提供了稳健的基准,并有助于跨中心实现更可靠的AI辅助胎儿生长评估。所有数据、标注、训练代码和评估流程均已公开。

英文摘要

Accurate fetal growth assessment from ultrasound (US) relies on precise biometry measured by manually identifying anatomical landmarks in standard planes. Manual landmarking is time-consuming, operator-dependent, and sensitive to variability across scanners and sites, limiting the reproducibility of automated approaches. There is a need for multi-source annotated datasets to develop artificial intelligence-assisted fetal growth assessment methods. To address this bottleneck, we present an open, multi-centre, multi-device benchmark dataset of fetal US images with expert anatomical landmark annotations for clinically used fetal biometric measurements. These measurements include head bi-parietal and occipito-frontal diameters, abdominal transverse and antero-posterior diameters, and femoral length. The dataset comprises 4,513 de-identified US images from 1,904 subjects acquired at three clinical sites using seven different US devices. We provide standardised, subject-disjoint train/test splits, evaluation code, and baseline results to enable fair and reproducible comparison of methods. Using an automatic biometry model, we quantify domain shift and demonstrate that training and evaluation confined to a single centre substantially overestimate performance relative to multi-centre testing. To the best of our knowledge, this is the first publicly available multi-centre, multi-device, landmark-annotated dataset that covers all primary fetal biometry measures, providing a robust benchmark for domain adaptation and multi-centre generalisation in fetal biometry and enabling more reliable AI-assisted fetal growth assessment across centres. All data, annotations, training code, and evaluation pipelines are made publicly available.

2512.15922 2026-05-26 cs.AI

Leveraging Spreading Activation for Improved Document Retrieval in Knowledge-Graph-Based RAG Systems

利用传播激活改进基于知识图谱的RAG系统中的文档检索

Jovan Pavlović, Miklós Krész, László Hajdu

AI总结 提出一种基于自动构建异构知识图谱的传播激活算法,用于多跳问答中的文档检索,减少对语义知识图谱和LLM引导的依赖,性能优于或持平现有方法。

Comments 20 pages, 5 figures

详情
AI中文摘要

尽管初始成功且架构多样,检索增强生成系统在复杂推理任务中仍难以可靠地检索和连接多步证据。大多数标准RAG框架将所有检索到的信息视为同等可靠,忽视了大型文本语料库中不同的可信度和相互关联性。GraphRAG方法通过集成知识图谱(将信息结构化为节点和边,捕获实体关系,支持多步逻辑遍历)为RAG系统提供了潜在改进。然而,GraphRAG并非总是理想方案,因为它依赖于语料库的高质量图表示,这类表示通常需要人工构建知识图谱(构建和更新成本高)或自动图构建流程(往往不可靠)。此外,遵循该范式的系统通常使用大语言模型引导图遍历和证据检索。本文提出一种新颖的RAG框架,使用传播激活算法从由自动构建的异构知识图谱连接的文档语料库中检索信息。该方法减少了对语义知识图谱的依赖(后者因信息提取过程中的信息丢失而常不完整),避免了LLM引导的图遍历,并提高了多跳问答的性能。实验表明,我们的方法在多项最先进RAG方法上达到更好或相当的性能,并可作为即插即用模块集成到不同的迭代RAG流程中。与思维链迭代检索结合时,在答案正确性上相比朴素RAG实现了高达39%的绝对提升,且使用小型开源语言模型即可达到这些结果。

英文摘要

Despite initial successes and a variety of architectures, retrieval-augmented generation systems still struggle to reliably retrieve and connect the multi-step evidence required for complicated reasoning tasks. Most of the standard RAG frameworks regard all retrieved information as equally reliable, overlooking the varying credibility and interconnected nature of large textual corpora. GraphRAG approaches offer potential improvement to RAG systems by integrating knowledge graphs, which structure information into nodes and edges, capture entity relationships, and enable multi-step logical traversal. However, GraphRAG is not always an ideal solution, as it depends on high-quality graph representations of the corpus. Such representations usually rely on manually curated knowledge graphs, which are costly to construct and update, or on automated graph-construction pipelines that are often unreliable. Moreover, systems following this paradigm typically use large language models to guide graph traversal and evidence retrieval. In this paper, we propose a novel RAG framework that uses a spreading activation algorithm to retrieve information from a corpus of documents connected by an automatically constructed heterogeneous knowledge graph. This approach reduces reliance on semantic knowledge graphs, which are often incomplete due to information loss during information extraction, avoids LLM-guided graph traversal, and improves performance on multi-hop question answering. Experiments show that our method achieves better or comparable performance to several state-of-the-art RAG methods and can be integrated as a plug-and-play module with different iterative RAG pipelines. When combined with chain-of-thought iterative retrieval, it yields up to a 39% absolute improvement in answer correctness over naive RAG, while achieving these results with small open-weight language models.

2512.13323 2026-05-26 cs.AI cs.LG

Error-Driven Prompt Optimization for Arithmetic Reasoning

基于错误驱动的算术推理提示优化

Árpád Pándy, Róbert Lakatos, András Hajdu

AI总结 提出一种错误驱动的提示优化框架,通过聚类错误预测迭代优化提示规则,使小型本地语言模型在算术推理任务中准确率达到70.8%,超越GPT-3.5 Turbo。

详情
Journal ref
IEEE Access, vol. 14, pp. 62570-62583, 2026
AI中文摘要

人工智能的最新进展激发了人们对工业代理的兴趣,这些代理能够在表格数据工作流中支持金融和医疗等受监管领域的分析师。此类系统的关键能力是对结构化数据执行准确的算术运算,同时确保敏感信息永远不会离开安全的本地环境。在此,我们引入了一种用于算术推理的错误驱动优化框架,该框架增强了代码生成代理(CGA),特别应用于本地小型语言模型(SLM)。通过对领先的SLM(Qwen3 4B)进行系统评估,我们发现虽然基础模型在算术任务中表现出基本局限性,但我们提出的错误驱动方法通过聚类错误预测来迭代优化提示规则,显著提升了性能,将模型准确率提高到70.8%。我们的结果表明,开发可靠、可解释且可工业部署的AI助手不仅可以通过昂贵的微调实现,还可以通过系统的、错误驱动的提示优化来实现,从而使小型模型以符合隐私要求的方式超越大型语言模型(GPT-3.5 Turbo)。

英文摘要

Recent advancements in artificial intelligence have sparked interest in industrial agents capable of supporting analysts in regulated sectors, such as finance and healthcare, within tabular data workflows. A key capability for such systems is performing accurate arithmetic operations on structured data while ensuring sensitive information never leaves secure, on-premises environments. Here, we introduce an error-driven optimization framework for arithmetic reasoning that enhances a Code Generation Agent (CGA), specifically applied to on-premises small language models (SLMs). Through a systematic evaluation of a leading SLM (Qwen3 4B), we find that while the base model exhibits fundamental limitations in arithmetic tasks, our proposed error-driven method, which clusters erroneous predictions to refine prompt-rules iteratively, dramatically improves performance, elevating the model's accuracy to 70.8\%. Our results suggest that developing reliable, interpretable, and industrially deployable AI assistants can be achieved not only through costly fine-tuning but also via systematic, error-driven prompt optimization, enabling small models to surpass larger language models (GPT-3.5 Turbo) in a privacy-compliant manner.

2512.10548 2026-05-26 cs.CV

Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding

Blink: 动态视觉令牌分辨率增强多模态理解

Yuchen Feng, Zhenyu Zhang, Naibin Gu, Yilong Chen, Peng Fu, Zheng Lin, Shuohuan Wang, Yu Sun, Hua Wu, Weiping Wang, Haifeng Wang

AI总结 提出Blink框架,通过注意力引导的令牌超分辨率和动态丢弃机制,在单次前向传播中模拟人类眨眼式扫描,提升多模态大语言模型的视觉感知能力。

Comments CVPR 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在各种视觉-语言任务上取得了显著进展,但其视觉感知仍然有限。相比之下,人类通过动态扫描并顺序地以“眨眼式”过程聚焦于显著区域,高效地感知复杂场景。受此策略启发,我们首先研究MLLMs是否表现出类似行为。我们的初步分析表明,MLLMs自然地关注不同层的视觉区域,并且选择性地将更多计算分配给显著令牌可以增强视觉感知。基于这一见解,我们提出Blink,一种动态视觉令牌分辨率框架,在单次前向传播中模拟人类启发的过程。具体来说,Blink包括两个模块:显著性引导扫描和动态令牌分辨率。它首先基于注意力图估计每层视觉令牌的显著性,并通过即插即用的令牌超分辨率(TokenSR)模块扩展重要令牌。在下一层,当扩展令牌失去焦点时,它会丢弃它们。这种动态机制平衡了广泛探索和细粒度聚焦,从而自适应且高效地增强视觉感知。大量实验验证了Blink在增强视觉感知和多模态理解方面的有效性。

英文摘要

Multimodal large language models (MLLMs) have achieved remarkable progress on various vision-language tasks, yet their visual perception remains limited. Humans, in comparison, perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential "blink-like" process. Motivated by this strategy, we first investigate whether MLLMs exhibit similar behavior. Our pilot analysis reveals that MLLMs naturally attend to different visual regions across layers and that selectively allocating more computation to salient tokens can enhance visual perception. Building on this insight, we propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass. Specifically, Blink includes two modules: saliency-guided scanning and dynamic token resolution. It first estimates the saliency of visual tokens in each layer based on the attention map, and extends important tokens through a plug-and-play token super-resolution (TokenSR) module. In the next layer, it drops the extended tokens when they lose focus. This dynamic mechanism balances broad exploration and fine-grained focus, thereby enhancing visual perception adaptively and efficiently. Extensive experiments validate Blink, demonstrating its effectiveness in enhancing visual perception and multimodal understanding.

2512.08254 2026-05-26 cs.CV

Real-World Scene Recovery for Scattering-Degraded Images Using Spatial and Frequency Priors

使用空间和频率先验的散射退化图像真实场景恢复

Yun Liu, Tao Li, Guanghui Yue, Wenqi Ren, Cosmin Ancuti, Weisi Lin

AI总结 提出空间和频率先验(SFP)方法,通过空间域传输图估计和频率域自适应增强策略,实现散射退化图像的真实场景恢复,在多种真实场景中优于现有方法。

Comments 18 pages, 22 figures, submitted to IEEE T-PAMI

详情
AI中文摘要

从受散射效应(如雾、沙尘暴、水下和遥感条件)退化的真实图像中恢复场景,仍然是计算机视觉中一个基本但具有挑战性的问题。现有方法要么依赖单一先验(本质上不足以表征多样的散射退化),要么使用在合成数据上训练的深度网络(通常对真实场景的泛化能力有限)。在本文中,我们提出空间和频率先验(SFP)用于散射诱导退化下的真实场景恢复。在空间域,我们观察到散射退化图像的逆在其光谱方向上揭示了一个与底层场景传输相关的投影。基于这一观察,我们制定了一个空间先验来估计传输图,从而能够在散射效应下有效恢复场景辐射。在频率域,我们设计了一种由两个新先验引导的自适应频率增强策略。第一个先验假设退化图像中跨通道的直流(DC)分量的平均强度近似于对应清晰图像的平均强度。第二个先验基于观察:在清晰图像中,窄带内的低径向频率仅占整个频谱的一小部分。这些先验能够针对不同频带的散射诱导衰减进行补偿。最后,对空间域和频率域的结果进行加权融合,得到最终的恢复图像。在多种真实世界散射退化场景上的大量实验验证,与最先进方法相比,我们的SFP实现了优越的性能和强大的泛化能力。

英文摘要

Scene recovery from real-world images degraded by scattering effects, such as haze, sandstorm, underwater, and remote sensing conditions, remains a fundamental yet challenging problem in computer vision. Existing methods either rely on a single prior, which is inherently insufficient to characterize diverse scattering degradations, or employ deep networks trained on synthetic data, which often suffer from limited generalization to real-world scenarios. In this paper, we propose Spatial and Frequency Priors (SFP) for real-world scene recovery under scattering-induced degradations. In the spatial domain, we observe that the inverse of a scattering-degraded image reveals a projection along its spectral direction that correlates with the underlying scene transmission. Based on this observation, a spatial prior is formulated to estimate the transmission map, enabling effective recovery of scene radiance under scattering effects. In the frequency domain, we design an adaptive frequency enhancement strategy guided by two novel priors. The first prior assumes that the mean intensity of the direct current (DC) components across channels in degraded images approximates that of the corresponding clear images. The second prior is based on the observation that, in clear images, low radial frequencies within a narrow band contribute only a small proportion of the overall spectrum. These priors enable targeted compensation for scattering-induced attenuation across different frequency bands. Finally, a weighted fusion of the spatial and frequency domain results is performed to obtain the final recovered image. Extensive experiments on diverse real-world scattering-degraded scenarios verify that our SFP achieves superior performance and strong generalization capability compared to state-of-the-art methods.

2512.06393 2026-05-26 cs.AI cs.CL cs.LG cs.LO

Conflict-Aware Fusion: Mitigating Logic Inertia in Large Language Models via Structured Cognitive Priors

冲突感知融合:通过结构化认知先验缓解大语言模型中的逻辑惯性

Qiming Bao, Xiaoxuan Fu, Michael Witbrock

AI总结 针对大语言模型在规则系统结构扰动下表现脆弱的问题,提出冲突感知融合训练流程,通过验证-演绎结构先验和符号推理奖励,在多个压力测试中实现鲁棒性饱和。

详情
AI中文摘要

大型语言模型(LLM)在许多推理基准上取得了高准确率,但在基于规则系统的结构扰动下仍然脆弱。我们引入了一个包含四个压力测试的诊断框架——冗余与必要规则删除、矛盾规则注入、逻辑保持重写和多定律堆叠——并用它来揭示逻辑惯性:生成式LLM(Qwen2/3、TinyLlama、GPT-4o、Gemma-3-4B-IT)和仅编码器BERT基线在矛盾前提下沿学习到的演绎轨迹持续推理的倾向。这种崩溃是剧烈的:未经处理的基线在基础任务上的准确率从1.00下降到矛盾注入时的0.00(实例级精确匹配),而GPT-4o仅解决了56.0%的矛盾案例。我们提出冲突感知融合,这是一个四阶段训练流程,将验证-演绎作为学习到的结构先验强制执行:(i)SFT建立验证前缀;(ii)DPO锐化矛盾停止决策边界;(iii)逻辑不变正则化(LIRE)通过对称KL惩罚逻辑等价规则公式之间的差异;(iv)来自验证反馈的强化学习(RLVF)使用符号前向链接引擎作为确定性预言奖励,联合优化不变性和敏感性。该流程在1.5B和8B骨干网络上均使所有四个主要压力测试达到饱和。我们进一步验证了第二阶段扩展,用Lean 4内核替换命题预言机,在分层187个问题的Lean翻译样本中,对105个经典可推导(T)问题达到99.0%的内核一致性(整体71.7%,涵盖两种极性),为形式化验证的RL训练提供了可靠的升级路径。代码和基准:https://github.com/14H034160212/lemo

英文摘要

Large language models (LLMs) achieve high accuracy on many reasoning benchmarks but remain brittle under structural perturbations of rule-based systems. We introduce a diagnostic framework with four stress tests -- redundant vs. essential rule deletion, contradictory-rule injection, logic-preserving rewrites, and multi-law stacking -- and use it to expose Logic Inertia: the tendency of generative LLMs (Qwen2/3, TinyLlama, GPT-4o, Gemma-3-4B-IT) and the encoder-only BERT baseline to persist along learned deductive trajectories under inconsistent premises. The collapse is sharp: untreated baselines fall from accuracy 1.00 on the base task to 0.00 on contradiction injection (instance-level exact match), and GPT-4o resolves only 56.0% of contradiction cases. We propose Conflict-Aware Fusion, a four-stage training pipeline that enforces verification-before-deduction as a learned structural prior: (i) SFT establishes the verification preamble; (ii) DPO sharpens the halt-on-contradiction decision boundary; (iii) Logical Invariance REgularisation (LIRE) penalises divergence between logically equivalent rule formulations via symmetric KL; (iv) Reinforcement Learning from Verification Feedback (RLVF) uses a symbolic forward-chaining engine as a deterministic oracle reward, jointly optimising invariance and sensitivity. The pipeline saturates all four primary stress tests for both 1.5B and 8B backbones. We further validate a Phase 2 extension that replaces the propositional oracle with a Lean 4 kernel, attaining 99.0% kernel agreement on the 105 classically-derivable (T) questions within a stratified 187-question Lean-translated sample (overall 71.7% across both polarities), providing a sound upgrade path to formally verified RL training. Code and benchmark: https://github.com/14H034160212/lemo

2512.05765 2026-05-26 cs.AI cs.LG

AGI Requires a Coordination Layer on Top of Pattern Repositories

AGI 需要在模式存储库之上建立协调层

Edward Y. Chang

AI总结 本文提出大型语言模型(LLM)并非AGI的死胡同,而是缺少系统2协调层,通过UCCT和RCA实现语义锚定与因果验证,并设计MACI多智能体协调栈,实验表明自适应控制优于静态提示。

Comments 15 pages, 5 figures, 7 tables

详情
AI中文摘要

在本文中,我们认为那些将大型语言模型(LLM)视为AGI死胡同的有影响力的批评误判了瓶颈:它们混淆了海洋与渔网。模式存储库是必要的系统1基础;缺失的组件是一个系统2协调层,该层能够招募相关模式、验证其使用、保持状态并控制收敛。我们将常常被混淆的两种控制用途分开。由UCCT(统一上下文控制理论)形式化的语义锚定,通过由有效支持(rho_d)、表征不匹配(d_r)和自适应锚定预算(gamma log k)控制的相变,将标签和任务意图绑定到学习到的模式区域。由递归因果审计(RCA)实现的追踪-答案验证,测试最终因果判断是否在其自身推理轨迹的压力下得到支持。我们将这些思想转化为MACI,一个多智能体协调栈,通过诱饵(PID调节辩论)、过滤(苏格拉底式和因果审计)和持久性(事务性内存)整合多样性和控制。在因果判断和谄媚-偏执权衡上的实证验证表明,静态提示失败的地方,自适应控制成功。通过将常见反对意见重新定义为可测试的协调失败,我们认为通往AGI的道路是通过LLM,而不是绕过它们。能力不是协调。

英文摘要

In this paper we argue that influential critiques dismissing Large Language Models (LLMs) as a dead end for AGI misidentify the bottleneck: they confuse the ocean with the net. Pattern repositories are the necessary System-1 substrate; the missing component is a System-2 coordination layer that recruits relevant patterns, verifies their use, preserves state, and governs convergence. We separate two uses of control that are often conflated. Semantic anchoring, formalized by UCCT (Unified Contextual Control Theory), binds labels and task intent to learned pattern regions through a phase transition governed by effective support (rho_d), representational mismatch (d_r), and an adaptive anchoring budget (gamma log k). Trace-answer verification, implemented by Recursive Causal Audit (RCA), tests whether a final causal judgment is warranted by its own reasoning trace under pressure. We translate these ideas into MACI, a multi-agent coordination stack that integrates diversity and control via baiting (PID-modulated debate), filtering (Socratic and causal audit), and persistence (transactional memory). Empirical validation on causal judgment and the sycophancy-paranoia trade-off demonstrates that static prompting fails where adaptive control succeeds. By reframing common objections as testable coordination failures, we argue that the path to AGI runs through LLMs, not around them. Capability is not coordination.

2512.01382 2026-05-26 cs.CV

Reversible Inversion for Training-Free Exemplar-guided Image Editing

可逆反演用于免训练示例引导图像编辑

Yuke Li, Lianli Gao, Ji Zhang, Pengpeng Zeng, Lichuan Xiang, Hongkai Wen, Heng Tao Shen, Jingkuan Song

AI总结 提出可逆反演(ReInversion)方法,通过两阶段去噪和掩码引导选择性去噪策略,实现免训练的高效示例引导图像编辑,达到最优性能且计算开销最低。

详情
AI中文摘要

示例引导图像编辑(EIE)旨在根据视觉参考修改源图像。现有方法通常需要大规模预训练来学习源图像和参考图像之间的关系,计算成本高。作为一种免训练的替代方案,反演技术可用于将源图像映射到潜在空间进行操作。然而,我们的实证研究表明,标准反演对于EIE是次优的,导致质量差和效率低。为了解决这一挑战,我们引入了 extbf{可逆反演({ReInversion})},用于有效且高效的EIE。具体来说,ReInversion作为一个两阶段去噪过程运行,首先以源图像为条件,然后以参考图像为条件。此外,我们引入了一种掩码引导选择性去噪(MSD)策略,将编辑限制在目标区域,保持背景的结构一致性。定性和定量比较都表明,我们的ReInversion方法以最低的计算开销实现了最先进的EIE性能。

英文摘要

Exemplar-guided Image Editing (EIE) aims to modify a source image according to a visual reference. Existing approaches often require large-scale pre-training to learn relationships between the source and reference images, incurring high computational costs. As a training-free alternative, inversion techniques can be used to map the source image into a latent space for manipulation. However, our empirical study reveals that standard inversion is sub-optimal for EIE, leading to poor quality and inefficiency. To tackle this challenge, we introduce \textbf{Reversible Inversion ({ReInversion})} for effective and efficient EIE. Specifically, ReInversion operates as a two-stage denoising process, which is first conditioned on the source image and subsequently on the reference. Besides, we introduce a Mask-Guided Selective Denoising (MSD) strategy to constrain edits to target regions, preserving the structural consistency of the background. Both qualitative and quantitative comparisons demonstrate that our ReInversion method achieves state-of-the-art EIE performance with the lowest computational overhead.

2512.00375 2026-05-26 cs.RO

DPNet: Doppler LiDAR Motion Planning for Highly-Dynamic Environments

DPNet: 面向高动态环境的多普勒激光雷达运动规划

Wei Zuo, Zeyi Ren, Chengyang Li, Yikun Wang, Mingle Zhao, Shuai Wang, Wei Sui, Fei Gao, Yik-Chung Wu, Chengzhong Xu

AI总结 提出DPNet,通过多普勒卡尔曼神经网络跟踪快速障碍物并利用多普勒调谐模型预测控制实现高动态环境下的高频高精度运动规划。

Comments Accepted to IEEE Robotics and Automation Letters in April, 2026

详情
AI中文摘要

现有的运动规划方法由于对环境变化理解不足,常常难以应对快速移动的障碍物。为了解决这一问题,我们提出将运动规划器与多普勒激光雷达集成,后者不仅提供测距测量,还提供瞬时点速度。然而,由于高精度和高频率的要求,这种集成并非易事。为此,我们引入了多普勒规划网络(DPNet),通过基于多普勒模型的学习来跟踪和应对快速障碍物。我们首先提出了一种多普勒卡尔曼神经网络(D-KalmanNet),用于在部分可观测的高斯状态空间模型下跟踪障碍物状态。然后,我们利用预测的障碍物运动构建了一个多普勒调谐模型预测控制(DT-MPC)框架用于自我运动规划,实现了控制器参数的运行时自动调优。这两个模块使得DPNet能够从最少数据中学习快速环境变化,同时保持轻量级,在跟踪和规划中实现高频率和高精度。在高保真模拟器和真实世界数据集上的实验表明,DPNet优于广泛的基准方案。代码可在 https://github.com/UUwei-zuo/DPNet 获取。

英文摘要

Existing motion planning methods often struggle with rapid-motion obstacles due to an insufficient understanding of environmental changes. To address this, we propose integrating motion planners with Doppler LiDARs, which provide not only ranging measurements but also instantaneous point velocities. However, this integration is nontrivial due to the requirements of high accuracy and high frequency. To this end, we introduce Doppler Planning Network (DPNet), which tracks and reacts to rapid obstacles via Doppler model-based learning. We first propose a Doppler Kalman neural network (D-KalmanNet) to track obstacle states under a partially observable Gaussian state space model. We then leverage the predicted motions of obstacles to construct a Doppler-tuned model predictive control (DT-MPC) framework for ego-motion planning, enabling runtime auto-tuning of controller parameters. These two modules allow DPNet to learn fast environmental changes from minimal data while remaining lightweight, achieving high frequency and high accuracy in both tracking and planning. Experiments on high-fidelity simulator and real-world datasets demonstrate the superiority of DPNet over extensive benchmark schemes. Code available at https://github.com/UUwei-zuo/DPNet

2512.00125 2026-05-26 cs.CV cs.LG

Hybrid Synthetic Data Generation with Domain Randomization Enables Zero-Shot Vision-Based Part Inspection Under Extreme Class Imbalance

混合合成数据生成与域随机化实现极端类别不平衡下基于视觉的零样本零件检测

Ruo-Syuan Mei, Sixian Jia, Guangze Li, Soo Yeon Lee, Brian Musser, William Keller, Sreten Zakula, Jorge Arinez, Chenhui Shao

AI总结 提出一种结合仿真渲染、域随机化和真实背景合成的混合合成数据生成框架,仅用合成数据训练YOLOv8n和MobileNetV3-small模型,在极端类别不平衡下实现零样本工业零件检测,检测mAP@0.5达0.995,分类准确率96%,平衡准确率90.1%。

Comments Submitted to the NAMRC 54

详情
AI中文摘要

机器学习,特别是深度学习,正在改变工业质量检测。然而,训练鲁棒的机器学习模型通常需要大量高质量标注数据,这在制造业中获取成本高昂、耗时且劳动密集。此外,缺陷样本本身稀少,导致严重的类别不平衡,降低模型性能。这些数据约束阻碍了基于机器学习的质量检测方法在实际生产环境中的广泛采用。合成数据生成(SDG)通过高效、经济且可扩展的方式创建大规模、平衡且完全标注的数据集,提供了一种有前景的解决方案。本文提出一种混合SDG框架,集成了基于仿真的渲染、域随机化和真实背景合成,无需人工标注即可实现基于计算机视觉的工业零件检测的零样本学习。该SDG流水线通过改变零件几何、光照和表面属性,并将合成零件合成到真实图像背景上,在一小时内生成12,960张标注图像。利用YOLOv8n骨干网络进行目标检测、MobileNetV3-small进行质量分类的两阶段架构,仅使用合成数据训练,并在300个真实工业零件上评估。所提方法在检测上达到mAP@0.5为0.995,分类准确率96%,平衡准确率90.1%。与基于少量真实数据的基线方法相比,性能显著提升。在严重类别不平衡下,所提基于SDG的方法达到90-91%的平衡准确率,而基线仅达到50%准确率。这些结果表明,所提方法能够为真实制造应用实现免标注、可扩展且鲁棒的质量检测。

英文摘要

Machine learning, particularly deep learning, is transforming industrial quality inspection. Yet, training robust machine learning models typically requires large volumes of high-quality labeled data, which are expensive, time-consuming, and labor-intensive to obtain in manufacturing. Moreover, defective samples are intrinsically rare, leading to severe class imbalance that degrades model performance. These data constraints hinder the widespread adoption of machine learning-based quality inspection methods in real production environments. Synthetic data generation (SDG) offers a promising solution by enabling the creation of large, balanced, and fully annotated datasets in an efficient, cost-effective, and scalable manner. This paper presents a hybrid SDG framework that integrates simulation-based rendering, domain randomization, and real background compositing to enable zero-shot learning for computer vision-based industrial part inspection without manual annotation. The SDG pipeline generates 12,960 labeled images in one hour by varying part geometry, lighting, and surface properties, and then compositing synthetic parts onto real image backgrounds. A two-stage architecture utilizing a YOLOv8n backbone for object detection and MobileNetV3-small for quality classification is trained exclusively on synthetic data and evaluated on 300 real industrial parts. The proposed approach achieves an mAP@0.5 of 0.995 for detection, 96% classification accuracy, and 90.1% balanced accuracy. Comparative evaluation against few-shot real-data baseline approaches demonstrates significant improvement. The proposed SDG-based approach achieves 90-91% balanced accuracy under severe class imbalance, while the baselines reach only 50% accuracy. These results demonstrate that the proposed method enables annotation-free, scalable, and robust quality inspection for real-world manufacturing applications.

2511.19211 2026-05-26 cs.RO

Soft Pneumatic Grippers: Topology optimization, 3D-printing and Experimental validation

软体气动夹爪:拓扑优化、3D打印与实验验证

Prabhat Kumar, Chandra Prakash, Josh Pinskier, David Howard, Matthijs Langelaar

AI总结 提出一种考虑载荷设计依赖性的软体气动夹爪拓扑优化框架,通过2D软臂单元优化、3D打印制造及实验验证,证明其优于传统矩形设计。

Comments 11 Figures

详情
AI中文摘要

本文提出了一种系统性的拓扑优化框架,用于设计软体气动夹爪(SPG),明确考虑了驱动载荷的设计依赖性。载荷使用达西定律并添加排水项进行建模。通过将问题表述为使用鲁棒公式的柔顺机构设计问题,优化了一个2D软臂单元。该问题被设定为最小-最大优化,其中考虑了蓝图设计和侵蚀设计的输出变形。对蓝图部分施加体积约束,对侵蚀部分施加应变能约束。采用MMA求解优化问题并获得优化的软单元。使用Ogden材料模型进行有限元分析证实,优化后的2D单元在气动载荷下优于传统的矩形设计。将优化后的2D单元拉伸得到3D模块,并组装十个这样的单元以形成软臂。分析了优化臂在不同压力载荷下的变形曲线。对四个臂进行3D打印,并与支撑结构集成以实现所提出的SPG。在具有不同重量、尺寸、刚度和形状的物体上展示了SPG的抓取性能。

英文摘要

This paper presents a systematic topology optimization framework for designing a soft pneumatic gripper (SPG), explicitly considering the design-dependent nature of the actuating load. The load is modeled using Darcy's law with an added drainage term. A 2D soft arm unit is optimized by formulating it as a compliant mechanism design problem using the robust formulation. The problem is posed as a min-max optimization, where the output deformations of blueprint and eroded designs are considered. A volume constraint is imposed on the blueprint part, while a strain-energy constraint is enforced on the eroded part. The MMA is employed to solve the optimization problem and obtain the optimized soft unit. Finite element analysis with the Ogden material model confirms that the optimized 2D unit outperforms a conventional rectangular design under pneumatic loading. The optimized 2D unit is extruded to obtain a 3D module, and ten such units are assembled to create a soft arm. Deformation profiles of the optimized arm are analysed under different pressure loads. Four arms are 3D-printed and integrated with a supporting structure to realize the proposed SPG. The gripping performance of the SPG is demonstrated on objects with different weights, sizes, stiffness, and shapes.

2511.15407 2026-05-26 cs.AI cs.CV cs.LG

IPR-1: Interactive Physical Reasoner

IPR-1:交互式物理推理器

Mingyu Zhang, Lifeng Zhuo, Tianxi Tan, Guocan Xie, Xian Nie, Yan Li, Renjie Zhao, Zizhu He, Ziyu Wang, Jiting Cai, Yong-Lu Li

AI总结 提出IPR模型,通过世界模型滚动评分和强化VLM策略,结合物理中心动作代码PhysCode,在1000+异构游戏基准上实现鲁棒的物理推理,性能超越GPT-5并零样本迁移至未见游戏。

Comments Accepted by CVPR 2026. 13 pages of main text and 20 pages of appendices. Project page: https://mybearyzhang.github.io/ipr-1

详情
AI中文摘要

人类通过观察、与环境交互以及内化物理和因果关系来学习。在这里,我们旨在探究一个智能体是否能够通过交互类似地获得类人推理能力,并随着更多经验不断改进。为此,我们引入了一个包含1000+异构游戏的Game-to-Unseen (G2U)基准,这些游戏展现出显著的视觉领域差异。现有方法(包括VLM和世界模型)难以捕捉底层物理和因果关系,因为它们不关注核心机制且过度拟合视觉细节。VLM/VLA智能体能够推理,但在交互设置中缺乏前瞻性,而世界模型进行想象但模仿视觉模式而非分析物理和因果关系。因此,我们提出IPR(交互式物理推理器),利用世界模型滚动来评分和强化VLM的策略,并引入PhysCode,一种以物理为中心的动作代码,将语义意图与动力学对齐,为预测和推理提供共享动作空间。在1000+游戏上预训练后,我们的IPR在从原始直觉到目标驱动推理的各个层次上表现稳健,甚至在总体上超越了GPT-5。我们发现,性能随着训练游戏和交互步骤的增加而提升,并且模型还能零样本迁移到未见过的游戏。这些结果支持以物理为中心的交互作为稳步提升物理推理的路径。更多演示和项目详情请见https://mybearyzhang.github.io/ipr-1。

英文摘要

Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. To study this, we introduce a Game-to-Unseen (G2U) benchmark of 1,000+ heterogeneous games that exhibit significant visual domain gaps. Existing approaches, including VLMs and world models, struggle to capture underlying physics and causality since they are not focused on core mechanisms and overfit to visual details. VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM's policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on levels from primitive intuition to goal-driven reasoning, and even surpasses GPT-5 overall. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning. Further demos and project details can be found at https://mybearyzhang.github.io/ipr-1.

2511.12378 2026-05-26 cs.AI

Learning to Trust: Bayesian Adaptation to Varying Suggester Reliability in Sequential Decision Making

学会信任:序列决策中针对不同建议者可靠性的贝叶斯自适应方法

Dylan M. Asmar, Mykel J. Kochenderfer

AI总结 提出一种贝叶斯框架,通过将建议者质量融入信念表示并引入显式“询问”动作,使智能体在部分可观测环境中动态学习和适应变化的建议者可靠性,平衡信息获取与成本。

Comments Repo: https://github.com/dylan-asmar/learning_to_trust

详情
AI中文摘要

在不确定性下执行序列决策任务的自主智能体可以从外部动作建议中受益,这些建议提供了有价值的指导,但其可靠性固有地变化。现有整合此类建议的方法通常假设静态且已知的建议者质量参数,限制了实际部署。我们引入了一个框架,在部分可观测环境中动态学习并适应变化的建议者可靠性。首先,我们将建议者质量直接整合到智能体的信念表示中,使智能体能够通过贝叶斯推断建议者类型来推断并调整对建议的依赖。其次,我们引入了一个显式的“询问”动作,允许智能体在关键时刻策略性地请求建议,平衡信息获取与获取成本。实验评估表明,该框架在不同建议者质量下具有稳健性能,能够适应变化的可靠性,并策略性地管理建议请求。这项工作通过解决不确定环境中的建议不确定性,为自适应人机协作奠定了基础。

英文摘要

Autonomous agents operating in sequential decision-making tasks under uncertainty can benefit from external action suggestions, which provide valuable guidance but inherently vary in reliability. Existing methods for incorporating such advice typically assume static and known suggester quality parameters, limiting practical deployment. We introduce a framework that dynamically learns and adapts to varying suggester reliability in partially observable environments. First, we integrate suggester quality directly into the agent's belief representation, enabling agents to infer and adjust their reliance on suggestions through Bayesian inference over suggester types. Second, we introduce an explicit ``ask'' action allowing agents to strategically request suggestions at critical moments, balancing informational gains against acquisition costs. Experimental evaluation demonstrates robust performance across varying suggester qualities, adaptation to changing reliability, and strategic management of suggestion requests. This work provides a foundation for adaptive human-agent collaboration by addressing suggestion uncertainty in uncertain environments.

2511.09048 2026-05-26 cs.LG

Guaranteeing Conservation of Integrals with Projection in Physics-Informed Neural Networks

在物理信息神经网络中通过投影保证积分守恒

Anthony Baez, Wang Zhang, Ziwen Ma, Lam Nguyen, Subhro Das, Luca Daniel

AI总结 提出一种投影方法,通过求解约束非线性优化问题,在物理信息神经网络中分别或联合保证线性和二次积分量的守恒,将守恒误差降低三到四个数量级。

详情
AI中文摘要

我们提出了一种新颖的投影方法,能够保证物理信息神经网络(PINNs)中积分量的守恒。尽管PINNs用于强制执行偏微分方程(PDEs)结构的软约束在训练过程中提供了必要的灵活性,但也允许发现的解违反物理定律。为了解决这个问题,我们引入了一种投影方法,分别和联合保证线性和二次积分的守恒。我们通过求解约束非线性优化问题推导了投影公式,并发现经过投影修改的PINN(称为PINN-Proj)相比软约束,将这些量的守恒误差降低了三到四个数量级,并略微减少了PDE解误差。我们还发现,投影通过改善损失景观的条件性来改善收敛。我们的方法有望成为一个通用框架,只要存在可解的方案,就能保证PINN中任何积分量的守恒。

英文摘要

We propose a novel projection method that guarantees the conservation of integral quantities in Physics-Informed Neural Networks (PINNs). While the soft constraint that PINNs use to enforce the structure of partial differential equations (PDEs) enables necessary flexibility during training, it also permits the discovered solution to violate physical laws. To address this, we introduce a projection method that guarantees the conservation of the linear and quadratic integrals, both separately and jointly. We derived the projection formulae by solving constrained non-linear optimization problems and found that our PINN modified with the projection, which we call PINN-Proj, reduced the error in the conservation of these quantities by three to four orders of magnitude compared to the soft constraint and marginally reduced the PDE solution error. We also found evidence that the projection improved convergence through improving the conditioning of the loss landscape. Our method holds promise as a general framework to guarantee the conservation of any integral quantity in a PINN if a tractable solution exists.

2511.02721 2026-05-26 cs.CL

PETra: A Multilingual Corpus of Pragmatic Explicitation in Translation

PETra:翻译中语用显化的多语语料库

Doreen Osmelak, Koel Dutta Chowdhury, Uliana Sentsova, Cristina España-Bonet, Josef van Genabith

AI总结 提出首个多语语料库PragExTra及检测框架,通过空对齐和主动学习识别语用显化,跨语言准确率达0.88,F1达0.82。

详情
Journal ref
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
AI中文摘要

译者常常用背景细节丰富文本,使隐含的文化含义对新受众变得明确。这种现象称为语用显化,在翻译理论中已被广泛讨论,但很少用计算模型处理。我们介绍了PragExTra,这是第一个用于语用显化的多语语料库和检测框架。该语料库涵盖来自TED-Multi和Europarl的八种语言对,并包括实体描述、测量转换和译者评论等补充内容。我们通过空对齐识别候选显化案例,并使用主动学习结合人工标注进行精炼。我们的结果表明,实体和系统层面的显化最为常见,主动学习将分类器准确率提高了7-8个百分点,跨语言达到0.88的准确率和0.82的F1值。PragExTra将语用显化确立为可测量的跨语言现象,并向构建文化感知的机器翻译迈出了一步。关键词:翻译,多语制,显化

英文摘要

Translators often enrich texts with background details that make implicit cultural meanings explicit for new audiences. This phenomenon, known as pragmatic explicitation, has been widely discussed in translation theory but rarely modeled computationally. We introduce PragExTra, the first multilingual corpus and detection framework for pragmatic explicitation. The corpus covers eight language pairs from TED-Multi and Europarl and includes additions such as entity descriptions, measurement conversions, and translator remarks. We identify candidate explicitation cases through null alignments and refined using active learning with human annotation. Our results show that entity and system-level explicitations are most frequent, and that active learning improves classifier accuracy by 7-8 percentage points, achieving up to 0.88 accuracy and 0.82 F1 across languages. PragExTra establishes pragmatic explicitation as a measurable, cross-linguistic phenomenon and takes a step towards building culturally aware machine translation. Keywords: translation, multilingualism, explicitation

2510.27118 2026-05-26 cs.CL

Probability Distributions Computed by Autoregressive Transformers

自回归变压器计算的概率分布

Andy Yang, Anej Svete, Jiaoda Li, Anthony Widjaja Lin, Jonathan Rawski, Ryan Cotterell, David Chiang

AI总结 研究自回归变压器作为语言模型时能表达的概率分布,揭示自回归和概率化对表达力的影响。

Comments 20 pages

详情
AI中文摘要

大多数关于变压器的表达力结果将其视为语言识别器——接受或拒绝字符串的设备——而不是像实际使用中那样:作为自回归和概率生成字符串的语言模型。我们刻画了变压器语言模型可以表达的概率分布。我们表明,使变压器语言识别器自回归有时可以增加其表达力,而使其概率化可以打破非概率情况下成立的等价关系。我们的总体贡献是厘清变压器在其最常见的用例——作为语言模型——中能够表达哪些函数。

英文摘要

Most expressivity results for transformers treat them as language recognizers -- devices that accept or reject strings -- rather than as they are used in practice: as language models that generate strings autoregressively and probabilistically. We characterize the probability distributions that transformer language models can express. We show that making transformer language recognizers autoregressive can sometimes increase their expressivity, and that making them probabilistic can break equivalences that hold in the non-probabilistic case. Our overall contribution is to tease apart what functions transformers are capable of expressing in their most common use case as language models.