arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.02211 2026-06-02 cs.CL cs.AI

Consistency Training while Mitigating Obfuscation via Rate Matching

通过速率匹配缓解混淆的一致性训练

Sohaib Imran, Prakhar Gupta, Jannes Elstner, David Demitri Africa

AI总结 提出速率匹配一致性训练(RMCT),通过匹配目标行为率而非约束表达方式,在减少模型受无关特征影响的同时避免混淆,提升可监控性。

详情
AI中文摘要

大型语言模型常常受到无关输入特征的影响,例如揭示用户偏好答案的线索。一致性训练通过训练模型在具有和不具有无关特征的输入上表现相似来减少这种影响。然而,现有方法在整个响应或内部激活上训练一致性,这也限制了模型是否表达这些无关特征。我们表明这会导致混淆,即模型学会不提及线索但仍受其影响,这可能削弱可监控性。为了解决这个问题,我们引入了速率匹配一致性训练(RMCT),它在选定的行为属性上训练一致性,而不约束这种行为如何表达。RMCT匹配模型在输入扰动下表现出目标行为(例如,遵循偏见线索)的速率,而不是要求具有和不具有无关特征的配对输入,从而将一致性训练扩展到无法移除无关特征的场景。我们在两个开放权重语言模型上评估了RMCT在减少谄媚方面的效果,在保留的偏见类型上实现了与标准一致性训练基线相当的偏见遵循减少,同时很大程度上保留了模型表达偏见线索的倾向。此外,我们发现RMCT在我们的实验中更节省数据,但计算效率较低。总体而言,RMCT表明一致性训练可以在不直接牺牲可监控性的情况下提高行为鲁棒性。

英文摘要

Large language models are often influenced by extraneous input features, such as cues revealing a user's preferred answer. Consistency training reduces this influence by training models to behave similarly across inputs with and without the extraneous feature. However, existing methods train for consistency over entire responses or internal activations, which also constrains whether the model verbalises said extraneous features. We show this leads to obfuscation, where the model learns not to mention a cue while remaining influenced by it, which may undermine monitorability. To address this, we introduce Rate Matching Consistency Training (RMCT), which trains for consistency over selected behavioural properties without constraining how this behaviour is expressed. RMCT matches the rate at which the model exhibits a target behaviour (e.g., following a bias cue) across input perturbations, rather than requiring paired inputs with and without the extraneous feature, extending consistency training to settings where the extraneous features cannot be removed. We evaluate RMCT on sycophancy reduction in two open-weight language models, achieving reductions in bias-following comparable to a standard consistency-training baseline on held-out bias types, while largely preserving the model's tendency to verbalise the bias cue. Further, we find that RMCT is more data-efficient at the expense of being less compute-efficient in our experiments. Overall, RMCT shows that consistency training can improve behavioural robustness without directly trading off against monitorability.

2606.02204 2026-06-02 cs.CL

Cross-Environment Neural Reranking for Sample-Efficient Action Selection in Text-Based Agents

跨环境神经重排序用于文本智能体中样本高效的动作选择

Kan Shao

AI总结 研究通过联合训练轻量级神经重排序器(DeBERTa-v3)在多个文本环境(ALFWorld、WebShop、ScienceWorld)中实现样本高效的动作选择,发现重平衡联合训练显著提升性能,跨环境适应仅需9.2%目标域数据即可恢复93%全数据性能,数据多样性是主要驱动力。

详情
Comments
11 pages, 4 figures, 6 tables
AI中文摘要

大型语言模型智能体在基于文本的基准测试中取得了强劲性能,但推理成本过高,促使使用紧凑的神经重排序器进行动作选择。我们研究单个轻量级模型是否能在多个不同环境中执行动作选择,这种能力将消除每个环境模型的维护成本。在ALFWorld、WebShop和ScienceWorld上联合训练DeBERTa-v3(184M-434M参数),并采用少数类上采样,我们发现重平衡的双环境联合训练显著提升了单环境ALFWorld性能(净增益+0.412),同时保持了具有竞争力的WebShop性能(+0.214 vs. 单环境+0.249)。三环境训练在4个种子上平均组合净增益为+0.551 +/- 0.024,每个环境的结果接近专门的单环境模型,同时提供正向跨域迁移。跨环境适应具有极高的样本效率:仅用9.2%的目标域数据微调即可恢复93%的全数据性能,且扩展模型容量收益有限,表明数据多样性是主要驱动力。基于环境感知的LoRA适配器路由结合PCGrad实现了最佳种子结果+0.611(种子42),种子456和789分别为+0.554和+0.559,但由于种子123降至+0.263(4种子均值+0.497 +/- 0.158),表现出高方差,这是一个有前景但目前不稳定的方向。使用干净分割和数据重平衡的联合训练是关键要素。我们将发布包含51,580个训练实例(41,740个原始唯一状态,带少数类上采样)的三环境基准测试,以及所有模型检查点(接收后)。

英文摘要

Large language model agents achieve strong performance on text-based benchmarks but incur prohibitive inference costs, motivating the use of compact neural rerankers for action selection. We investigate whether a single lightweight model can perform action selection across multiple diverse environments, a capability that would eliminate per-environment model maintenance. Training DeBERTa-v3 (184M-434M parameters) jointly on ALFWorld, WebShop, and ScienceWorld with minority-class upsampling, we find that rebalanced two-environment joint training substantially improves over single-environment ALFWorld performance (net gain +0.412) while maintaining competitive WebShop performance (+0.214 vs. +0.249 single-environment). Three-environment training yields a mean combined net gain of +0.551 +/- 0.024 across 4 seeds, with per-environment results approaching specialized single-environment models while providing positive cross-domain transfer. Cross-environment adaptation is highly sample-efficient: fine-tuning on only 9.2% of target-domain data recovers 93% of full-data performance, and scaling model capacity yields limited benefits, indicating data diversity is the primary driver. Environment-aware LoRA adapter routing with PCGrad achieves a best-seed result of +0.611 (seed 42), with seeds 456 and 789 at +0.554 and +0.559, but exhibits high variance due to seed 123 collapsing to +0.263 (4-seed mean +0.497 +/- 0.158), representing a promising but currently unstable direction. Joint training with clean splits and data rebalancing is a key ingredient. We will release our three-environment benchmark of 51,580 training instances (41,740 raw unique states with minority-class upsampling) and all model checkpoints upon acceptance.

2606.02198 2026-06-02 cs.LG cs.CY

Model Multiplicity and Predictive Arbitrariness in Recidivism Risk Assessment

模型多重性与再犯风险评估中的预测任意性

Ashwin Singh, Carlos Castillo

AI总结 针对再犯风险评估中的预测任意性问题,通过理论下界推导和实证分析,发现相似精度的模型间预测一致性通常高于最坏情况理论保证,并提出采用最低风险分配策略来缓解任意性。

详情
Comments
17 pages, 12 figures
AI中文摘要

针对个体未来的预测任务本质上是嘈杂的,通常会产生多个相似精度的模型。当这些模型对同一个人产生不同预测时,会引发决策中的任意性问题。这种任意性在理论和实践中可能有多严重?如何解决以支持高风险风险评估?我们通过对一个已使用超过15年的基于机器学习的再犯风险评估决策支持系统的研究来回答这些问题。通过将复杂的法律规则转化为标记释放后结果(再犯或非再犯)的算法,我们首先构建了一个包含数千名囚犯释放的数据集。利用该数据集,我们学习可解释的模型,这些模型提高了预测性能,减少了群体间的错误率差异,并确保康复进展降低风险评分。接下来,我们研究预测多重性,首先推导出数据集上任何有限模型集的期望预测一致性的紧下界,然后评估该集合内的结构多样性(例如,不同的模型系数)在多大程度上转化为预测多重性(即对同一人的不同预测)。我们的实验表明,存在许多相似精度的模型且具有可比较的错误率差异并不一定意味着严重的预测多重性。经验上,性能相似的模型可以表现出比最坏情况理论保证高得多的预测一致性。我们发现,一种简单的策略——为每个囚犯分配这些模型中的最低风险——对于解决预测任意性是有效的。

英文摘要

Prediction tasks over individual futures, which are inherently noisy, often admit multiple similarly accurate models. When these models produce different predictions for the same individual, they raise concerns of arbitrariness in decision-making. How severe can this arbitrariness be, in theory and in practice? How can it be resolved to support high-stakes risk assessment? We address these questions through a study of a machine learning-based decision support system for recidivism risk assessment that has been in use for over 15 years. By translating complex legal rules into an algorithm for labeling post release outcomes (recidivist or non-recidivist), we first construct a dataset of thousands of inmate releases. Using this dataset, we learn interpretable models that improve predictive performance, reduce error-rate disparities between groups, and ensure that rehabilitative progress lowers risk scores. Next, we study predictive multiplicity, by first deriving a tight lower bound on the expected predictive agreement of any finite set of models over a dataset, and then by evaluating the extent to which structural diversity (e.g., different model coefficients) within this set translates to predictive multiplicity (i.e., different predictions for the same individual). Our experiments indicate that the existence of many similarly accurate models with comparable error-rate disparities does not necessarily translate into severe predictive multiplicity. Empirically, similarly performant models can exhibit substantially higher predictive agreement than worst-case theoretical guarantees suggest. We find that a simple policy that assigns each inmate the lowest risk among these models is effective for addressing predictive arbitrariness.

2606.02194 2026-06-02 cs.LG

Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

基于学习奖励的大规模行为模型的一致性离策略改进

Christian Scherer, Joe Watson, Theo Gruner, Daniel Palenicek, Ingmar Posner, Jan Peters

AI总结 提出一种逆强化学习方法,通过从专家演示中学习稠密奖励函数,结合一致性模仿学习理论保证,实现对预训练策略的离策略改进,在稀疏奖励操作任务中优于强化学习基线。

详情
Comments
13 pages, 7 figures
AI中文摘要

使用行为克隆将专家演示数据蒸馏到大规模生成模型中是一种可扩展的学习机器人控制能力策略的方法,特别是对于灵巧操作。强化学习(RL)可以作为一种利用额外经验进一步微调这些策略的手段。一个开放的问题是RL是否比收集更多人类演示更具样本效率。先前的工作通过将RL应用于一个较小的残差策略来大规模微调预训练策略,该残差策略纠正预训练模型。然而,对于典型的稀疏奖励任务,RL算法可能难以以样本高效的方式优化行为。我们探索逆强化学习,其中从专家演示中学习稠密奖励函数,可能降低RL微调的挑战。我们特别考虑一致性模仿学习,这是一种IRL方法,通过使用具有理论保证的特定奖励公式来促进BC策略的改进。我们展示了我们的IRL方法在所有六个稀疏操作任务上保持或提高了pi-0.5的性能,并在六个复杂操作任务中的五个上实现了≥90%的成功率,优于使用稀疏奖励的基于RL的基线。通过确保我们的初始预训练微调策略对于初始奖励和评论家是最优的,我们的方法避免了RL微调中常见的初始下降,并实现了更快的改进。

英文摘要

Distilling expert demonstration data into large generative models using behavioral cloning is a scalable approach to learning capable policies for robotic control, particularly for dexterous manipulation. Reinforcement learning (RL) can be used as a means to finetune these policies further using additional experience. An open question is whether RL is more sample-efficient than collecting more human demonstrations. Prior work has finetuned large pretrained policies in a scalable fashion by applying RL to a smaller residual policy that corrects the pretrained model. However, for the typical sparse reward tasks, RL algorithms can struggle to optimize the behavior in a sample-efficient manner. We explore inverse reinforcement learning, where a dense reward function is learned from expert demonstrations, potentially reducing the challenge of RL finetuning. We specifically consider coherent imitation learning, an IRL method that facilitates improvement of the BC policy through using a specific reward formulation with theoretical guarantees. We show that our IRL method maintains or improves the performance of pi-0.5 on all six sparse manipulation tasks and achieves a $\geq 90\%$ success rate on five out of six complex manipulation tasks, outperforming RL-based baselines using sparse rewards. By ensuring our initial pretrained finetuning policy is optimal for our initial reward and critic, our method circumvents the initial drop commonly seen in RL finetuning and enables faster improvement.

2606.02184 2026-06-02 cs.DL cs.LG

The Ghost Couple: Correlated LLM Name Priors and Their Haunting of the Web and Academic Publishing

幽灵搭档:相关的大语言模型姓名先验及其对网络和学术出版的困扰

Michał Brzozowski, Neo Christopher Chung

AI总结 研究发现大语言模型生成虚构专家姓名时会产生相关性强的角色组合,这些组合具有模型家族特异性,并在Zenodo等平台造成大量幽灵作者记录,影响学术出版。

详情
AI中文摘要

这些名字并不存在。Elena Vasquez 和 Marcus Chen 作为火山专家、宇航员、惊悚小说主角、播客主持人和学术合著者,出现在数百个独立生成的AI生成文档中,却从未存在过。我们表明,大语言模型在生成虚构专家时不仅仅默认使用高概率的单个名字:它们会产生相关的角色组合、配对和三人组,其共现频率远超偶然,并且在独立生成中保持一致。这些先验是模型家族特定的(Claude:Elena Vasquez + Marcus Chen + Amara Okafor;Gemini:Aris Thorne + Lena Petrova;GPT:Elara Voss 无固定搭档)、版本特定的,并且在模型发布边界处被主动抑制,在它们生成的内容中留下可定时的行为指纹。我们记录了一个大规模的下游后果。在Zenodo(一个由CERN运营的、生成真实DataCite DOI的存储库)上,我们识别出1,655条幽灵作者记录,声称不存在的期刊并带有捏造的出版日期:服务器端的DataCite时间戳证明了故意的回溯日期,其中991条记录在一个月内注册;这些记录携带在DataCite中注册的真实DOI,因此任何摄取DOI元数据的学术聚合器都可以获取它们。幽灵名字还出现在ResearchGate上,形成由来自多个模型家族的合作者组成的合成研究小组;这些记录上的出版日期为模型部署窗口提供了可靠的时间代理。

英文摘要

These names do not exist. Elena Vasquez and Marcus Chen have appeared as volcano experts, astronauts, thriller protagonists, podcast hosts, and academic co-authors across hundreds of independently produced AI-generated documents, never having lived. We show that large language models do not merely default to high-probability individual names when generating fictional experts: they produce correlated character ensembles, pairs and trios whose co-occurrence rates far exceed chance and are consistent across independent generations. These priors are model-family-specific (Claude: Elena Vasquez + Marcus Chen + Amara Okafor; Gemini: Aris Thorne + Lena Petrova; GPT: Elara Voss with no fixed partner), version-specific, and actively suppressed at model release boundaries, leaving dateable behavioral fingerprints in the content they produced. We document a downstream consequence at scale. On Zenodo, a CERN-operated repository that mints real DataCite DOIs, we identify 1,655 ghost-authored records claiming nonexistent journals with fabricated publication dates: server-side DataCite timestamps prove deliberate backdating, and 991 records were registered in a single month; these carry real DOIs registered in DataCite, making them harvestable by any scholarly aggregator that ingests DOI metadata. Ghost names additionally appear on ResearchGate forming synthetic research groups with collaborators drawn from multiple model families; publication dates on these records provide a reliable temporal proxy for model deployment windows.

2606.02179 2026-06-02 cs.LG cs.AI cs.CE

On the Generalization in Topology Optimization via Sensitivity-Conditioned Bernoulli Flow Matching

关于拓扑优化中通过敏感性条件伯努利流匹配的泛化性

Mohammad Rashed, Duarte F. Valoroso Madeira, Babak Gholami, Caglar Guerbuez, Yunjia Yang, Nils Thuerey

AI总结 通过信息论分析,提出伪敏感性概念,并利用敏感性条件伯努利流匹配生成器在拓扑优化中实现最优的分布外泛化性能。

详情
Comments
ICML Paper
AI中文摘要

拓扑优化(TO)的代理模型在分布偏移(如载荷或边界条件变化)下表现出高度可变的分布外(OOD)泛化能力,但这一变异性的来源尚不清楚。我们假设OOD性能取决于条件信号保留关于驱动经典TO的伴随敏感性(简化梯度)的信息量。将TO流程建模为因果马尔可夫链,数据处理不等式表明,在该抽象下,敏感性场是拓扑预测的信息论最优条件信号。然而,计算精确的伴随敏感性在实践中可能昂贵或不可用;我们观察到某些物理场可以通过单调变换近似敏感性。为形式化这一点,我们引入 extbf{伪敏感性}来区分哪些场能够实现泛化,哪些信息贫乏。然后,我们展示了一个敏感性条件的伯努利流匹配生成器实证地证实了这些预测:以敏感性为条件可获得最先进的OOD性能,而越来越远的物理场性能退化至原始参数条件。结果在载荷偏移下的结构TO基准测试和我们新的CFD-TO数据集(边界条件偏移如多出口配置)中均成立。代码和数据集见https://tum-pbs.github.io/topotransformer/。

英文摘要

Surrogate models for topology optimization (TO) exhibit highly variable out-of-distribution (OOD) generalization under distribution shifts such as changing loads or boundary conditions, yet the source of this variability remains unclear. We hypothesize that OOD performance is governed by how much information the conditioning signal preserves about the adjoint sensitivity (reduced gradient) that drives classical TO. Modeling the TO pipeline as a causal Markov chain, the Data Processing Inequality establishes that, under this abstraction, the sensitivity field is an information-theoretically optimal conditioning signal for topology prediction. However, computing exact adjoint sensitivities can be expensive or unavailable in practice; we observe that certain physical fields can approximate sensitivities through monotone transformations. To formalize this, we introduce \textbf{pseudo-sensitivities} to characterize which fields enable generalization versus those that are information-poor. We then show that a sensitivity-conditioned Bernoulli flow-matching generator empirically confirms these predictions: conditioning on sensitivities yields state-of-the-art OOD performance, while increasingly distant physical fields degrade toward raw parameter conditioning. Results hold across structural TO benchmarks under load shifts and our new CFD-TO dataset under boundary-condition shifts such as multi-outlet configurations. Code and datasets are available at https://tum-pbs.github.io/topotransformer/ .

2606.02178 2026-06-02 cs.CV cs.AI

Order within Chaos: Capturing Intrinsic Energy Anomalies for AI-Manipulated Image Forgery Localization

混沌中的秩序:捕捉AI操纵图像伪造定位的内在能量异常

Yiming Wang, Baiqi Wu, Qingming Li, Jiahao Chen, Tong Zhang, Shouling Ji

AI总结 本文提出FLAME框架,利用扩散过程抑制局部高频方差产生的统计能量间隙,结合LAD图和SAM适配器实现像素级伪造定位,并引入EditStream流水线持续合成训练数据,在AI生成伪造数据集上达到最先进性能。

详情
Comments
Accepted by ICML 2026
AI中文摘要

近期生成式AI的进展催生了能够产生逼真伪造图像的图像编辑模型,这些伪造图像能规避传统图像伪造定位方法,因为传统方法依赖于合成数据中不存在的物理噪声。为应对这一挑战,我们从理论上证明扩散过程本质上抑制了局部高频方差,产生了与光学成像自然熵可区分的统计能量间隙。受此启发,我们提出FLAME,一个统一框架,利用LAD图捕捉这些内在异常,并结合SAM的参数高效适配器实现精确的像素级伪造定位。此外,为弥合取证基准与不断演变的生成模型之间的滞后,我们引入EditStream,一个基于指令的连续训练数据合成自动化流水线。大量实验表明,FLAME建立了新的最先进水平,在AI生成伪造数据集上显著优于先前方法,同时有效泛化到未见过的生成架构。我们的代码可在https://github.com/phoenixnir/FLAME获取。

英文摘要

Recent advancements in generative AI have led to image editing models capable of producing realistic forgeries that evade traditional image forgery localization methods, as these approaches depend on physical noise absent in synthetic data. To address this challenge, we theoretically demonstrate that the diffusion process inherently suppresses local high-frequency variance, creating a statistical energy gap that is distinguishable from the natural entropy of optical imaging. Guided by this insight, we propose FLAME, a unified framework that utilizes a LAD map to capture these intrinsic anomalies, coupled with a parameter-efficient adapter for SAM to achieve precise, pixel-level forgery localization. Furthermore, to bridge the lag between forensic benchmarks and evolving generative models, we introduce EditStream, an automated pipeline for continuous, instruction-based training data synthesis. Extensive experiments demonstrate that FLAME establishes a new state-of-the-art, significantly outperforming previous methods on AI-generated forgery datasets while effectively generalizing to unseen generative architectures. Our code is available at https://github.com/phoenixnir/FLAME.

2606.02177 2026-06-02 cs.LG

Low-Pass Flow Matching

低通流匹配

Francesco M. Ruscio, T. Konstantin Rusch

AI总结 针对流匹配中白噪声源与自然数据频谱不匹配的问题,提出基于算子调制插值的低通流匹配方法,引入时变频谱偏差,在保持或提升样本质量的同时显著降低采样成本。

详情
Comments
ICLR 2026 Delta Workshop
AI中文摘要

流匹配通常依赖于白噪声源,这一选择往往与自然数据的功率谱不一致,自然数据的功率谱倾向于随频率衰减。为了解决这个问题,我们引入了低通流匹配,这是基于算子调制插值的流匹配的一种变体。该公式引入了一种时变频谱偏差,随着路径接近数据,该偏差从源频谱过渡到频率衰减偏差。我们在无条件图像生成任务上验证了我们的方法,包括科学数据集Galaxy10。实验表明,我们的方法与自适应ODE求解器配合使用时特别有效,与标准基线相比,在提高或保持样本质量的同时,大幅降低了采样成本。

英文摘要

Flow Matching typically relies on white noise sources, a choice often misaligned with the power spectra of natural data, which tend to decay with frequency. To address this, we introduce Low-Pass Flow Matching, a variant of Flow Matching based on an operator-modulated interpolant. This formulation induces a time-varying spectral bias that transitions from the source spectrum to a frequency-decaying bias as the path approaches the data. We validate our method on unconditional image generation tasks, including the scientific Galaxy10 dataset. Empirically, we show that our method is particularly effective when paired with adaptive ODE solvers, where it improves or preserves sample quality while substantially reducing sampling cost compared to standard baselines.

2606.02172 2026-06-02 cs.LG cs.CV

Closing the Alignment-Maturity Gap in Federated Prototype Learning

缩小联邦原型学习中的对齐-成熟度差距

Mario Casado-Diez, Alejandro Dopico-Castro, Verónica Bolón-Canedo, Bertha Guijarro-Berdiñas

AI总结 针对联邦学习中原型对齐压力抑制局部判别结构的问题,提出FedSAP框架,通过确定性对齐课程和几何驱动代理分离损失稳定表征学习,在多种异质性条件下提升分类性能。

详情
AI中文摘要

从分布式异质数据中学习判别性视觉表示是联邦学习(FL)中的一个基本挑战。基于原型的方法通过跨客户端共享类级表示来解决统计异质性,但在早期训练轮次中会产生距离依赖的梯度压力,这种压力尤其严重:对从噪声局部表示聚合而来的不成熟全局原型施加的对齐压力会产生大梯度,从而抑制局部判别结构的出现。结果导致嵌入空间组织不良,识别性能下降,尤其是在严重的非独立同分布(non-IID)条件下。我们提出FedSAP,一个通过两种互补机制稳定联邦表示学习的框架:一个确定性对齐课程,将全局对齐延迟到局部表示变得稳定;以及一个几何驱动的代理分离损失,利用现有原型库在单位超球面上强制执行类间结构,而不引入额外参数或通信开销。这些机制共同产生紧凑、分离良好的类簇,而不改变联邦参与者之间的底层通信协议。在三个基准测试和不同程度的异质性下的实验表明,与评估的原型基线相比,性能提升高达4个百分点,在高异质性下改进最为显著。我们框架的表示性质还使其能够直接扩展到半监督设置,其中未标记数据只需最小修改即可纳入,突显了调度对齐作为设计原则的通用性。

英文摘要

Learning discriminative visual representations from distributed, heterogeneous data is a fundamental challenge in Federated Learning (FL). Prototype-based methods address statistical heterogeneity by sharing class-level representations across clients but create a distance-dependent gradient pressure that is particularly severe during early training rounds: alignment pressure applied to immature global prototypes, aggregated from noisy local representations, generates large gradients that suppress the emergence of local discriminative structure. The result is a poorly organized embedding space and degraded recognition performance, particularly under severe non-IID conditions. We propose FedSAP, a framework that stabilises federated representation learning through two complementary mechanisms: a deterministic alignment curriculum that delays global alignment until local representations become stable and a geometry-driven proxy separation loss that enforces inter-class structure on the unit hypersphere using the existing prototype bank without introducing additional parameters or communication overhead. Together, these mechanisms produce compact, well-separated class clusters without altering the underlying communication protocol between federation's participants. Experiments across three benchmarks and varying degrees of heterogeneity show gains of up to 4 percentage points over the prototype-based baselines evaluated, with improvements most pronounced under high heterogeneity. The representational nature of our framework further enables a straightforward extension to semi-supervised settings, where unlabelled data is incorporated with minimal modification, underscoring the generality of scheduled alignment as a design principle.

2606.02171 2026-06-02 cs.CV

InsightVQA: High-Dimensional Emotion-Cognitive Visual Question Answering Benchmark

InsightVQA: 高维情感认知视觉问答基准

Shiyu Wang, Ziyu Liu, Chaoyi Yu, Yujie Yin, Zhongqian Mao, Jing Chen, Jiaqi Song, Yunshi Lan, Yan Wang

AI总结 为解决现有基准仅关注情感识别而缺乏深层认知推理的问题,提出大规模层次化视觉问答数据集InsightVQA,包含725K问答对,并构建评估基准InsightVQA-Bench和基线模型InsightNet。

详情
Comments
16 pages, 22 figures
AI中文摘要

视觉情感理解要求模型不仅识别情感状态,还要理解其产生原因并进行更高层次的认知推理。然而,现有基准主要关注情感识别,对基于依据的理解和面向响应的分析支持有限。为弥补这一差距,我们引入了 extbf{InsightVQA},一个用于情感理解和认知推理的层次化视觉问答大规模数据集。我们从六个公开来源收集的351K图像出发,应用严格的多阶段过滤流程,筛选出138K高置信度图像。每张图像在三个层次上进行标注:用于情感和效价识别的感知QA、通过约束引导生成从视觉触发提取构建的基于依据的理解QA,以及以响应意图预测和序列洞察推理为中心的认知QA。总计,InsightVQA包含725K个问答对。我们还提出了 extbf{InsightVQA-Bench},一个包含30K样本的高质量评估基准,用于细粒度评估。为支持评估,我们引入了 extbf{InsightNet},一个针对多模态大语言模型的情感调优基线。结果表明,InsightVQA对基于依据的情感理解和推理提出了重大挑战。

英文摘要

Visual emotion understanding requires models not only to recognize emotional states, but also to why they arise and perform higher-level cognitive reasoning. However, existing benchmarks mainly focus on emotion recognition, offering limited support for grounded understanding and response-oriented analysis. To address this gap, we introduce \textbf{InsightVQA}, a large-scale dataset for hierarchical visual question answering on emotion understanding and cognitive reasoning. Building from 351K images collected from six public sources, we apply a rigorous multi-stage filtering pipeline to curate 138K high-confidence images. Each image is annotated at three hierarchical levels: perception QA for emotion and valence recognition, grounded understanding QA constructed from visual trigger extraction through constraint-guided generation, and cognition QA centered on response intent prediction and sequential insight reasoning. In total, InsightVQA contains 725K QA pairs. We further present \textbf{InsightVQA-Bench}, a high-quality evaluation benchmark comprising 30K samples for fine-grained evaluation. To support evaluation, we introduce \textbf{InsightNet}, an emotion-tuned baseline for MLLMs. Results demonstrate that InsightVQA poses significant challenges for grounded emotion understanding and reasoning.

2606.02170 2026-06-02 cs.CL

CRAFTQA: A Code-Driven Adaptive Framework for Complex Structured Data Reasoning

CRAFTQA: 一种用于复杂结构化数据推理的代码驱动自适应框架

Chengtao Gan, Zhiqiang Liu, Long Jin, Yushan Zhu, Lei Liang, Wen Zhang

AI总结 提出CRAFTQA框架,通过CodeSTEP模块生成逐步代码推理序列,并利用CRAFT模块动态生成自定义代码函数,以突破预定义函数集的限制,在复杂结构化数据推理任务中显著优于现有统一方法。

详情
Comments
Accepted by Findings of ACL 2026
AI中文摘要

现实场景涉及大量异构结构化数据(例如表格、知识图谱),因此对这些多样化数据进行有效推理变得越来越重要。统一结构化数据问答已成为一个突出的研究趋势,旨在单一框架内回答跨不同结构化数据类型的自然语言问题。然而,现有的统一方法有一个共同的局限性:它们依赖于一组预定义函数,这限制了它们在超出这些预定义操作之外执行复杂推理的能力。为了克服这一根本性限制,我们提出了CRAFTQA,一种新颖的自适应代码驱动框架,包含两个核心模块:CodeSTEP和CRAFT。CodeSTEP模块是一种范式,它生成一个完整的可执行Python代码序列,其中包含基于问题的逐步代码推理操作。CRAFT模块为超出预定义函数集的操作动态生成自定义代码函数,并与CodeSTEP无缝集成,显著增强了处理复杂推理的灵活性。在多个结构化数据集上的综合实验表明,与现有的统一方法相比,CRAFTQA在复杂推理场景中取得了显著的改进。

英文摘要

Real-world scenarios involve massive heterogeneous structured data (e.g., tables, knowledge graphs), making effective reasoning over such diverse data increasingly important. Unified structured data question answering has emerged as a prominent research trend, aiming to answer natural language questions across different structured data types within a single framework. However, existing unified methods share a common limitation: they rely on a set of predefined functions, which restricts their ability to perform complex reasoning beyond these predefined operations. To overcome this fundamental limitation, we propose CRAFTQA, a novel adaptive code-driven framework comprising two core modules, CodeSTEP and CRAFT. The CodeSTEP module is a paradigm that generates a complete executable Python code sequence, which contains step-by-step code-based reasoning operations based on the question. The CRAFT module dynamically generates custom code functions for operations beyond the predefined function set, and seamlessly integrates with CodeSTEP to significantly enhance flexibility in handling complex reasoning. Comprehensive experiments on multiple structured datasets demonstrate that CRAFTQA achieves remarkable improvements in complex reasoning scenarios compared to existing unified methods.

2606.02168 2026-06-02 cs.CV cs.LG

Disentanglement-Based Equivariant Learning for Compositional VQA

基于解耦的等变学习用于组合式VQA

Zhou Du, Zhaoquan Yuan, Xiao Wu, Changsheng Xu

AI总结 提出DEAL框架,通过因果干预解耦视觉和文本概念,并利用等变约束增强组合推理能力,在CLEVR-CoGenT和GQA-SGL上超越现有方法。

详情
Journal ref
IEEE Trans. Multimedia, vol. 27, pp. 8160-8173, 2025
Comments
Accepted by IEEE Transactions on Multimedia
AI中文摘要

组合式视觉问答(VQA)是一项具有挑战性但基础的任务,要求模型理解先前学习概念的新组合。当前方法往往忽视潜在概念的解耦,并且在有效捕捉组合变化机制方面受到限制。此外,最先进的技术依赖于额外的线索进行训练,这在现实世界的VQA场景中不可行。为了解决这些问题,本文提出了一种新颖的基于解耦的等变学习(DEAL)框架用于组合式VQA,该框架仅由真实答案指导。在DEAL中,我们采用因果启发的干预措施,在重新编码框架内解耦来自视觉和文本输入的概念。基于等变性原理,我们随后对推理输入进行组合变换,并对输出施加等变约束,以增强模型的组合推理能力。在基准数据集CLEVR-CoGenT和GQA-SGL上进行的全面实验验证了我们提出的DEAL方法在视觉和语言泛化设置下均优于现有的最先进方法。

英文摘要

Compositional visual question answering (VQA) represents a challenging yet fundamental task that requires models to comprehend novel combinations of previously learned concepts. The current methods often overlook the disentanglement of underlying concepts and are restricted in terms of their ability to effectively capture the compositional variation mechanism. Moreover, the state-of-the-art techniques depend on additional clues for training, which is not feasible in real-world VQA scenarios. To address these issues, in this paper, we introduce a novel Disentanglement-based EquivAriant Learning (DEAL) framework for compositional VQA, which is guided exclusively by ground-truth answers. In DEAL, we employ causality-inspired interventions to disentangle concepts derived from visual and textual inputs within a re-encoding framework. Based on the principle of equivariance, we subsequently perform a compositional transformation on the inference input and impose the equivariant constraint on the output to augment the compositional reasoning capacity of the model. Comprehensive experiments conducted on the benchmark CLEVR-CoGenT and GQA-SGL datasets validate the superiority of our proposed DEAL approach over the existing state-of-the-art methods for compositional VQA tasks in both visual and linguistic generalization settings.

2606.02167 2026-06-02 cs.AI

From Capability Models to Automated Planning: An AAS-Native Approach for Automatic PDDL Generation

从能力模型到自动规划:一种面向AAS的自动PDDL生成方法

Hamied Nabizada, Thomas Wirt, Luis Miguel Vieira da Silva, Felix Gehlhoff, Alexander Fay

AI总结 提出一种基于资产管理外壳(AAS)能力模型自动生成PDDL规划问题的方法,使工程师无需掌握PDDL语法即可进行生产系统布局验证。

详情
Comments
Accepted at the 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026)
AI中文摘要

设计生产系统的工程师需要验证给定的布局是否支持所有必需的生产序列。自动化规划技术可以回答此类问题,但在规划领域定义语言(PDDL)中制定所需的规划问题需要专业知识,而生产工程师通常缺乏这些知识。资产管理外壳(AAS)已成为工业4.0中工业资产的标准化数字孪生。我们展示了使用四个成熟的工业4.0标准(用于过程描述的VDI 3682、用于语义属性限定的IEC 61360-1、用于类型层次结构的IDTA 02011和用于实例描述的IDTA 02016)构建的AAS能力模型包含足够的信息,可以自动生成完整的PDDL问题。与之前引入PDDL特定子模型的工作不同,我们的方法从资源功能的领域级描述(即能力)中推导出所有规划元素,使工程师能够在完全不接触PDDL语法或规划概念的情况下对能力进行建模。我们的提取算法将分布式的多AAS架构转换为完整的PDDL规划问题。我们在实验室生产系统的AAS模型上验证了该方法,通过最优规划比较了四种布局变体,展示了工程师如何通过修改AAS模型并重新生成规划域来系统地探索设计权衡。

英文摘要

Engineers designing production systems need to verify that a given layout supports all required production sequences. Automated planning techniques can answer such questions, but formulating the required planning problems in the Planning Domain Definition Language (PDDL) demands specialized expertise that production engineers typically lack. Asset Administration Shells (AAS) have emerged as the standardized Digital Twin for industrial assets in Industry 4.0. We show that AAS capability models, structured using four established Industry 4.0 standards (VDI 3682 for process descriptions, IEC 61360-1 for semantic property qualification, IDTA 02011 for type hierarchies, and IDTA 02016 for instance descriptions), contain sufficient information to generate complete PDDL problems automatically. Unlike prior work that introduced PDDL-specific submodels, our approach derives all planning elements from domain-level descriptions of resource functions, so-called capabilities, allowing engineers to model capabilities without any exposure to PDDL syntax or planning concepts. Our extraction algorithm transforms distributed Multi-AAS architectures into complete PDDL planning problems. We validate the approach on AAS models of a laboratory production system, comparing four layout variants using optimal planning to demonstrate how engineers can systematically explore design trade-offs by modifying the AAS model and regenerating the planning domain

2606.02163 2026-06-02 cs.AI

An Abstract Worlds Semantic Framework for Belief Change Operators

信念变化算子的抽象世界语义框架

Daniel Grimaldi, M. Vanina Martinez, Ricardo O. Rodriguez

AI总结 提出一种无逻辑语法的集合论框架——抽象世界语义,通过将世界视为原始元素并定义世界收缩与修正算子,统一分析信念变化模型,并推广至经典与非优先信念变化。

详情
AI中文摘要

本文提出了一种用于信念变化的集合论框架,称为抽象世界语义,其中不假设任何逻辑语法。受Grove(1988)结果的启发,我们的方法将世界视为原始元素,并在此基础上定义了世界收缩和世界修正算子。该语义框架能够对信念变化模型进行统一分析。在此框架内,我们通过定义通用算子,统一了经典和非优先信念变化构造。当考虑经典命题逻辑时,我们的框架提供了AGM、KM和多重变化模型的同质化描述。总之,AWS系统化了信念变化框架和算子,简化并推广了基于信念集的信念变化理论。

英文摘要

This article proposes a set-theoretic framework for belief change, called Abstract Worlds Semantics, in which no logical syntax is assumed. Inspired by Grove's (1988) results, our approach treats worlds as primitive elements, over which world contraction and world revision operators are defined. This semantic framework enables a unified analysis of belief change models. Within this framework, we unify classical and non-prioritized belief change constructions by defining versatile operators. When classical propositional logic is considered, our framework provides a homogeneous account of AGM, KM, and Multiple Change models. In summary, AWS systematizes belief change frameworks and operators, simplifying and generalizing belief change theory over belief sets.

2606.02162 2026-06-02 cs.CV cs.AI cs.CL cs.IR

Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis

视觉丰富文档类型分类的多模态方法:一项比较分析

Catyana Heyne, Jürgen Frikel, Filippo Riccio

AI总结 针对视觉丰富文档类型分类中多模态建模策略难以系统比较的问题,本文在统一实验框架下对基于Transformer和LLM的四种代表性模型进行受控对比,发现专用多模态Transformer优于LLM方法,且图像信息贡献最大。

详情
AI中文摘要

视觉丰富文档中的文档类型分类仍然具有挑战性,因为相关信息分布在文本、视觉和布局模态中。为了捕捉这种复杂性,当前方法依赖于多样化的多模态建模策略,导致异构架构使得系统比较复杂化。这种变异性也反映在现有的比较研究中,这些研究通常依赖于异构评估设置,进一步复杂化了系统比较,并使得评估进展变得困难。为了解决这些局限性,本文提供了跨基于Transformer和基于LLM架构的多模态设计策略的结构化分析,并结合统一实验框架内的受控实证比较。具体来说,在RVL-CDIP基准上评估了四种代表性模型(LayoutLMv3、Donut、Qwen3-VL-32B-Instruct和Qwen3-32B),以系统分析文本、图像和布局信息对文档类型分类的贡献,特别关注对比OCR依赖和OCR无关的方法。结果表明,专用多模态Transformer在视觉丰富和布局密集型文档上优于基于LLM的方法。图像信息对可靠分类贡献最大,而OCR派生的文本提供有用但次要的支持。这些发现强调,对于具有显著布局结构的文档,多模态处理仍然是必不可少的。总体而言,该研究为比较多模态架构提供了系统基础,并为选择有效的特征组合和模型设计以进行文档类型分类提供了实用指导。

英文摘要

Document type classification in visually rich documents remains challenging, as relevant information is distributed across textual, visual, and layout modalities. To capture this complexity, current approaches rely on diverse multimodal modeling strategies, resulting in heterogeneous architectures that complicate systematic comparison. This variability is also reflected in existing comparative studies, which often rely on heterogeneous evaluation setups, further complicating systematic comparison and making it difficult to assess progress. To address these limitations, this work provides a structured analysis of multimodal design strategies across transformer- and LLM-based architectures, combined with a controlled empirical comparison within a unified experimental framework. Specifically, four representative models (LayoutLMv3, Donut, Qwen3-VL-32B-Instruct, and Qwen3-32B) are evaluated on the RVL-CDIP benchmark to systematically analyze the contributions of text, image, and layout information for document type classification, with a particular focus on contrasting OCR-dependent and OCR-free approaches. The results show that specialized multimodal Transformers outperform LLM-based approaches on visually rich and layout-intensive documents. Image information contributes most strongly to reliable classification, while OCR-derived text provides useful but secondary support. These findings highlight that multimodal processing remains essential for documents with pronounced layout structure. Overall, the study provides a systematic basis for comparing multimodal architectures and offers practical guidance for selecting effective feature combinations and model designs for document type classification.

2606.02161 2026-06-02 cs.CV cs.CL

InfoMerge: Information-aware Token Compression for Efficient Video Large Language Models

InfoMerge: 信息感知的令牌压缩用于高效视频大语言模型

Xinxin Liu, Shiwei Gan, Xiao Liu, Yafeng Yin, Lei Xie, Sanglu Lu

AI总结 提出InfoMerge,一种无需训练的视觉令牌压缩方法,通过鲁棒冗余估计和内容感知预算分配,在减少85%视觉令牌的同时保持98.8%性能,实现4.24倍预填充加速。

详情
Comments
15 pages, 8 figures
AI中文摘要

视频大语言模型在视频理解中表现出色,但过多的视觉令牌带来了巨大的计算开销。现有的免训练压缩方法通过减少视觉令牌来提高推理效率,但它们通常依赖局部相邻帧相似性进行时间冗余估计,或主要根据片段长度分配令牌预算。这种设计对帧级噪声敏感,且无法捕捉真实视频的非均匀信息分布。为解决这些挑战,我们提出InfoMerge,一种无需训练的视觉令牌压缩方法,通过鲁棒冗余估计和内容感知预算分配来提高令牌利用率。具体来说,我们提出时间指纹差异:一种片段级二阶时间冗余估计策略,用于建模每个片段内相同空间位置令牌的时间相似性结构。我们进一步引入内容感知预算分配(CABA),根据片段独特性和基于谱熵的表征丰富性动态分配片段级令牌预算。通过减少对冗余静态区域的重复保留,并将更多令牌分配给信息丰富的片段,InfoMerge在保持强大性能的同时更好地利用了有限的令牌预算。大量实验表明,InfoMerge在多个基准和骨干网络上实现了强效的精度-效率权衡,在激进压缩下优势更为明显。在LLaVA-OneVision-7B上,InfoMerge保留了原始平均性能的98.8%,同时减少了85%的视觉令牌,并在预填充阶段实现了4.24倍的加速。

英文摘要

Video Large Language Models (Video-LLMs) achieve strong performance in video understanding, but their excessive visual tokens bring substantial computational overhead. Existing training-free compression methods improve inference efficiency by reducing visual tokens, yet they often rely on local adjacent-frame similarity for temporal redundancy estimation or allocate token budgets mainly according to segment length. Such designs are sensitive to frame-level noise and fail to capture the non-uniform information distribution of real-world videos. To address these challenges, we propose InfoMerge, a training-free visual token compression method that improves token utilization through robust redundancy estimation and content-aware budget allocation. Specifically, we propose the Temporal Fingerprint Difference: a segment-level second-order temporal redundancy estimation strategy, which models the temporal similarity structure of tokens at the same spatial positions within each segment. We further introduce Content-Aware Budget Allocation (CABA), which dynamically allocates segment-level token budgets based on segment uniqueness and spectral-entropy-based representational richness. By reducing repeated preservation of redundant static regions and allocating more tokens to informative segments, InfoMerge makes better use of the limited token budget while maintaining strong performance. Extensive experiments show that InfoMerge achieves strong efficiency--accuracy trade-offs across multiple benchmarks and backbones, with more pronounced advantages under aggressive compression. On LLaVA-OneVision-7B, InfoMerge retains 98.8\% of the original average performance while reducing 85\% of visual tokens and achieving a 4.24-fold speedup in the prefill stage.

2606.02158 2026-06-02 cs.CL

On the Salience of Low-Probability Tokens for AI-Generated Text Detection: A Multiscale Uncertainty Perspective

低概率标记对AI生成文本检测的显著性:多尺度不确定性视角

Yikai Guo, Bin Wang, Xilai Fan, Wenjun Ke, Haoran Luo

AI总结 针对AI生成文本检测中模板标记主导和点估计脆弱的问题,提出基于低概率标记的多尺度不确定性估计器Uncertainty及其改进版Uncertainty++,在多个数据集和LLM上验证了有效性、泛化性和鲁棒性。

详情
Comments
Accepted by ICML 2026 main conference
AI中文摘要

AI生成的文本日益与人类写作融合,带来了诸如错误信息、学术滥用和语料库污染等实际风险。虽然统计检测器因高效和泛化能力而具有吸引力,但它们存在两个关键局限性:(i)模板标记主导,人类和LLM写作中共有的模板标记可能压倒判别信号;(ii)脆弱的点估计,依赖单一概率分数在对抗性操纵下产生不稳定的决策。为解决这些问题,我们提出Uncertainty,一种多尺度不确定性估计器,专注于信息丰富的低概率标记,这些标记更清晰地暴露分布差异。局部上,它通过平均低概率标记的对数概率来缓解模板标记主导;全局上,它通过Rényi熵捕捉该低概率区域的分布形状,从而降低脆弱性。我们进一步通过条件独立采样将检测器扩展为Uncertainty++,得到更稳定的不确定性估计。在七个数据集和十六个LLM上的实验证明了其高效性、泛化性和鲁棒性。我们的代码可在https://github.com/guoyikai2000/Uncertainty-AIGT获取。

英文摘要

AI-generated text increasingly blends with human writing, raising practical risks such as misinformation, academic misuse, and corpora contamination. While statistical detectors are appealing for efficiency and generalization, they suffer from two key limitations. (i) Boilerplate dominance, boilerplate tokens shared across human and LLM writing can overwhelm discriminative signals. (ii) Brittle point estimates, relying on a single probability score yields unstable decisions under adversarial manipulations. To address these issues, we propose Uncertainty, a multiscale uncertainty estimator that focuses on informative low-probability tokens, which more clearly expose distributional discrepancies. Locally, it alleviates boilerplate dominance by averaging the log-probabilities of low-probability tokens; globally, it reduces brittleness by capturing the distributional shape of this low-probability region via Rényi entropy. We further extend the detector to Uncertainty++ via conditional independent sampling, yielding a more stable uncertainty estimation. Experiments across seven datasets and sixteen LLMs demonstrate high effectiveness, generalization, and robustness. Our code is available at https://github.com/guoyikai2000/Uncertainty-AIGT.

2606.02156 2026-06-02 eess.IV cs.AI cs.CV cs.IR cs.LG

Predicting the risk of colorectal anastomotic leak based on preoperative mapping of the blood supply of the bowel

基于术前肠道血供映射预测结直肠吻合口漏风险

Zahra Tabatabaei, Jon Sporring, Mark Bremholm Ellebæk, Alaa El-Hussuna

AI总结 提出一种基于术前CT影像的AI驱动系统,通过分析血管和组织特征量化吻合口漏风险,并结合内容检索支持临床决策。

详情
AI中文摘要

吻合口漏仍然是结直肠癌手术后最严重的并发症之一,显著影响患者预后、康复轨迹和医疗成本。尽管影像技术有所进步,目前的术前评估仍依赖临床评估,这一过程主观、易出错且高度依赖个人经验。迄今为止,尚无经过验证的基于CT的方法能够在术前预测吻合口漏风险。本方案论文概述了一个全面的框架,用于开发和验证一个AI驱动的系统,该系统利用对比增强前后的CT影像进行术前风险评估。研究描述了数据收集、伦理处理、符合GDPR的患者数据预处理、图像预处理以及旨在生成临床可解释输出的深度学习架构探索等阶段。该工作流程的两个主要成果是:1) 风险评估模块,通过分析CT扫描中的血管和组织特征量化漏液可能性;2) 基于内容的医学图像检索(CBMIR)模块,识别并显示相似历史病例以支持循证手术决策。该方案论文需要医院和大学之间的密切合作;本方案表明,此类系统在现有医疗基础设施内技术上可行且临床可实施。通过遵循所提出的方法论阶段和监管原则,其他机构可以复制此工作流程以开发类似的决策支持工具。最终,这一跨学科框架旨在加强手术规划、减少漏液发生率,并推动向可解释、数据驱动的精准手术的更广泛范式转变。

英文摘要

Anastomotic leak remains one of the most serious complications following colorectal cancer surgery, substantially affecting patient outcomes, recovery trajectories, and healthcare costs. Despite advances in imaging technology, current preoperative assessment relies only on clinical assessment, a process that is subjective, error-prone, and highly dependent on individual expertise. To date, no validated CT-based method exists to predict anastomotic leak risk prior to surgery. This protocol paper outlines a comprehensive framework for developing and validating an AI-driven system for preoperative risk assessment using pre- and post-contrast CT imaging. The study describes the stages of data collection, ethical handling, and preprocessing of patient data in accordance with GDPR, image preprocessing, and the exploration of deep learning architectures designed to generate clinically interpretable outputs. Two integrated tools constitute the main deliverables of this workflow: 1) a risk assessment module, which quantifies the likelihood of leakage by analyzing vascular and tissue features in CT scans, and 2) a Content-Based Medical Image Retrieval (CBMIR) module, which identifies and displays similar historical cases to support evidence-based surgical decision making. The protocol paper requires close collaboration between hospitals and universities; this protocol demonstrates that such a system is technically feasible and clinically implementable within existing healthcare infrastructures. By following the proposed methodological stages and regulatory principles, other institutions can reproduce this workflow to develop analogous decision-support tools. Ultimately, this interdisciplinary framework aims to enhance surgical planning, reduce leak incidence, and contribute to a broader paradigm shift toward explainable, data-driven precision surgery.

2606.02153 2026-06-02 cs.CV cs.GR

Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking From Sparse Inertial Sensors and Ranging-Based Between-Sensor Distances

超扩散姿态估计器:基于扩散的从稀疏惯性传感器和测距传感器间距离的人体运动跟踪

Dominik Hollidt, Tommaso Bendinelli, Christian Holz

AI总结 提出Ultra Diffusion Poser扩散模型,通过显式建模UWB测距的几何约束(空间布局模块解析重建传感器位置)和引入UWB扩散引导,在扩散采样中强制预测姿态与实测距离对齐,将关节位置误差降低22%。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026, pp. 7036-7046
Comments
CVPR 2026 - Computer Vision and Pattern Recognition
AI中文摘要

使用惯性测量单元(IMU)的方法提供了一种可穿戴的替代基于摄像头的运动捕捉方案。为了减轻惯性信号的漂移,最近的稀疏惯性姿态估计器集成了由超宽带(UWB)测距测量的传感器间距离。到目前为止,UWB距离仅被用作额外的输入特征,忽略了它们对传感器位置施加的物理约束。然而,这些距离也可以用于重建底层3D传感器布局,从而为姿态重建提供更具信息性的输入。我们提出了Ultra Diffusion Poser,一种显式建模这些几何约束的扩散模型。它包括一个空间布局模块,该模块从UWB测量中解析地重建3D传感器位置。这些传感器位置与IMU信号和UWB距离一起作为扩散过程中的条件信号。尽管如此,网络预测可能违反传感器间距离测量。为了解决这个问题,我们引入了UWB扩散引导,它在扩散采样过程中鼓励预测姿态与测量距离之间的对齐。这些贡献共同使我们的模型达到了最先进的性能,将关节位置误差相比先前工作降低了高达22%。

英文摘要

Methods using inertial measurement units (IMUs) provide a wearable alternative to camera-based motion capture. To mitigate drift from inertial signals, recent sparse inertial pose estimators integrate inter-sensor distances measured by ultra-wideband (UWB) ranging. So far, UWB distances have only been used as an additional input feature, ignoring the physical constraints they impose on sensor positions. However, these distances can also be used to reconstruct the underlying 3D sensor layout, which in turn provides more informative input for pose reconstruction. We propose Ultra Diffusion Poser, a diffusion model that explicitly models these geometric constraints. It includes a Spatial Layout Module that analytically reconstructs the 3D sensor positions from UWB measurements. These sensor positions are used alongside IMU signals and UWB distances as a conditioning signal during diffusion. Still, network predictions can violate inter-sensor distance measurements. To address this, we introduce UWB-Diffusion Guidance, which encourages alignment between predicted poses and measured distances during diffusion sampling. Together, these contributions enable our model to achieve state-of-the-art performance, reducing joint position error by up to 22% over prior work.

2606.02151 2026-06-02 cs.AI cs.SY eess.SY

S3TS: Stochastic Scenario-Structured Tree Search for Advanced Planning Under Uncertainty

S3TS:面向不确定性下高级规划的随机情景结构化树搜索

Fabio Pavirani, Bert Claessens, Pierre Pinson, Chris Develder

AI总结 提出随机情景结构化树搜索(S3TS)算法,通过情景树显式表示不确定性并集成非线性模型,在需求响应信号发布问题上实现近最优性能,成本比最优解高14%以内,在非线性场景中比贪心算法和确定性MCTS分别降低51%和5.4%的成本。

详情
AI中文摘要

能源领域的有效调度对于确保电网及其连接资产的可靠运行至关重要,例如通过优化发电机组和储能系统的调度。有效的规划策略必须(a)适应先进且可能非线性的系统模型——利用现代电网日益增长的数据可用性,以及(b)显式处理由可再生能源整合等引起的不确定性。虽然现有方法可以处理非线性(例如蒙特卡洛树搜索)或不确定性(例如随机数学优化),但缺乏能够同时应对这两个挑战的规划技术。为填补这一空白,我们提出了一种随机情景结构化树搜索(S3TS)算法,该算法通过情景树显式表示不确定性,同时能够集成先进的非线性模型。我们在一个模拟的需求响应信号发布问题上评估了S3TS,该问题很大程度上模仿了比利时的失衡结算机制。结果表明,在线性、可解析处理的设置中,S3TS实现了接近最优的性能,成本在情景树条件下比数学最优解高14%以内。在高度非线性的场景中,S3TS显著优于基线方法,与贪心算法和确定性MCTS相比,成本分别降低了51%和5.4%。

英文摘要

Effective scheduling in the energy sector is essential to ensure the reliable operation of electrical grids and their connected assets by, for instance, optimizing the dispatch of generation units and storage systems. An effective planning strategy must (a) accommodate advanced and potentially non-linear system models -- exploiting the increasing data availability of modern grids, and (b) explicitly handle uncertainties arising, for instance, from the integration of renewable energy sources. While existing approaches can address either non-linearity (e.g., Monte Carlo Tree Search) or uncertainty (e.g., stochastic mathematical optimization), there is a lack of planning techniques capable of addressing both challenges simultaneously. To bridge this gap, we propose a Stochastic Scenario-Structured Tree Search (S3TS) algorithm that explicitly represents uncertainty through scenario trees while enabling the integration of advanced non-linear models. We evaluate S3TS on a simulated demand response signal publication problem, largely mimicking the imbalance settlement mechanism in Belgium. The results demonstrate near-optimal performance in linear, analytically tractable settings, with costs within 14% of the mathematically optimal solution conditioned to the scenario trees. In highly non-linear scenarios, S3TS significantly outperforms baseline methods, achieving cost reductions of up to 51% and 5.4% compared to a myopic algorithm and deterministic MCTS, respectively.

2606.02147 2026-06-02 cs.CL cs.AI

Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages

高、中、低资源语言中的句子和对话多语言习语

Saeed Almheiri, Bilal Elbouardi, Salsabila Zahirah Pranida, Irina Nikishina, Ashwath Rao B, Parameswari Krishnamurthy, Muhammad Cendekia Airlangga, Rifo Ahmad Genadi, Nguyen Phan Gia Bao, Amir Hossein Yari, Hawau Olamide Toyin, Nurdaulet Mukhituly, Mena Attia, Besher Hassan, Ahmad Fathan Hidayatullah, Tatsuki Kuribayashi, Haonan Li, Suma Bhat, Fajri Koto

AI总结 针对多语言习语理解,构建了覆盖3种高资源、3种中资源和12种低资源语言的MIDI数据集,包含句子和对话上下文中的字面与比喻用法,实验表明低资源语言理解更差,字面义比比喻义更难,对话上下文虽有改善但未消除差距。

详情
AI中文摘要

习语表达对多语言NLP构成重大挑战,因为其意义在比喻和字面用法之间转换,通常需要上下文才能准确理解。先前工作集中在高资源语言上,通常评估孤立的习语意义问题,忽略了现实话语。我们引入了MIDI,一个多语言习语数据集,涵盖3种高资源、3种中资源和12种低资源语言,由母语者策划。与之前的数据集不同,MIDI提供了嵌入在句子级和对话上下文中的习语,捕捉了字面和比喻解读。对最先进模型的基准测试表明,习语理解在低资源语言中下降,并且在所有资源层级中,字面解释比比喻解释更难。对话上下文提高了性能,但并未消除这些差异。通过受控测试和对隐藏表示的干预,我们进一步将记忆与推理分离,揭示了当前模型的核心局限性。

英文摘要

Idiomatic expressions pose a major challenge for multilingual NLP because their meanings shift between figurative and literal usage, often requiring context for accurate interpretation. Prior work has focused on high-resource languages typically evaluates isolated idiom-meaning questions, overlooking realistic discourse. We introduce MIDI, a multilingual idiom dataset spanning 3 high-, 3 medium-, and 12 low-resource languages, curated by native speakers. Unlike previous datasets, MIDI provides idioms embedded in both sentence-level and conversational contexts, capturing both literal and figurative readings. Benchmarking state-of-the-art models shows that idiom comprehension degrades in low-resource languages and that, in all resource tiers, literal interpretations are substantially harder than figurative ones. Conversational context improves performance but does not eliminate these disparities. Through controlled tests and interventions on hidden representations, we further separate memorization from reasoning, exposing core limitations of current models.

2606.02145 2026-06-02 cs.LG

Hybrid Neural Ordinary Differential Equations for Data-Efficient Polymerization Modeling with Incomplete Kinetics

混合神经常微分方程用于不完全动力学的高效聚合建模

Marah Almanasreh, Alexander Mitsos, Eike Cramer

AI总结 提出混合神经常微分方程框架,通过仅学习部分表征的有效自由基浓度项,在稀疏数据下实现自由基聚合的准确预测。

详情
Comments
25 pages, 5 figures
AI中文摘要

聚合动力学的准确预测对于过程设计、控制和优化至关重要。然而,纯机理模型需要对部分表征的动力学进行劳动密集型的参数化,而纯数据驱动模型需要大量且多样化的数据集,这些数据集获取成本高昂,尤其是在早期设计阶段。我们提出了一种混合神经常微分方程(NODE)框架,用于自由基聚合的数据高效建模。以甲基丙烯酸甲酯(MMA)的间歇聚合为例,明确保留了机理质量平衡,仅通过神经网络代理从数据中学习部分表征的控制单体消耗的有效自由基浓度,而引发剂分解、链增长和终止等已确立的反应则保持物理建模。在稀疏数据条件下,将混合NODE与离散时间前馈神经网络和纯数据驱动NODE进行比较,模型在规则和不规则采样下仅使用少至十个测量值进行训练。混合NODE始终比两种纯数据驱动基线实现更低的预测误差和更物理一致的外推。在噪声数据和未见操作条件的泛化场景中,混合NODE的RMSE为0.013,而数据驱动NODE为0.31,离散时间模型为0.68,表明在有限数据可用性下,仅学习闭合项而非完整动力学足以实现可靠预测。

英文摘要

Accurate prediction of polymerization dynamics is essential for process design, control, and optimization. Yet, purely mechanistic models require labor-intensive parameterization of partially characterized kinetics, while purely data-driven models demand large, diverse datasets that are costly to obtain, particularly in early-design stages. We propose a hybrid Neural Ordinary Differential Equation (NODE) framework for data-efficient modeling of free-radical polymerization. Using batch polymerization of methyl methacrylate (MMA) as a case study, the mechanistic mass balances are retained explicitly, and only the partially-characterized effective radical concentration governing monomer consumption is learned from data through a neural network surrogate, while established reactions such as initiator decomposition, propagation, and termination remain physically modeled. The hybrid NODE is evaluated against a discrete-time feedforward neural network and a purely data-driven NODE under sparse data conditions, with models trained on as few as ten measurements under both regular and irregular sampling. The hybrid NODE consistently achieves lower prediction errors and more physically consistent extrapolations than both purely data-driven baselines. In a generalization scenario with noisy data and unseen operating conditions, the hybrid NODE achieves an RMSE of 0.013, compared to 0.31 for the data-driven NODE and 0.68 for the discrete-time model, demonstrating that learning only a closure term rather than the full dynamics is sufficient for reliable prediction under limited data availability.

2606.02142 2026-06-02 cs.LG cs.DB

TimeBlocks: Foundational and Continual Time-Series Blockbase -- Extended Version

TimeBlocks: 基础与持续时间序列块库——扩展版本

David Campos, Bin Yang, Tung Kieu, Lei Chen, Chenjuan Guo, Christian S. Jensen

AI总结 提出TimeBlocks方法,通过可互换的模块化模型块和路由策略,构建轻量级、多任务的时间序列模型,并引入StreamCore实现持续校准,在多个数据集上优于现有基线。

详情
Comments
15 pages. An extended version of "TimeBlocks: Versatile and Continual Time-Series Blockbase" accepted at SIGKDD 2026
AI中文摘要

持续的数字化导致监控各种过程的时间序列数据流激增,从中可以获得有价值的见解。此外,成功的基础语言模型的出现引发了一个问题:是否可能实现具有处理多个任务的基础属性的时间序列模型,同时足够轻量以允许实时数据流处理。现有的基础时间序列模型通常很大,并且仅在离线设置中有效,没有严格的时间和计算约束,且不需要重复的模型校准。然而,当应用于数据流时,这些模型由于规模大且缺乏对持续校准的支持而效率低下,这损害了它们提供准确实时响应的能力、耐久性以及在硬件受限环境中的可部署性。我们提出TimeBlocks,通过促进在可变条件下适用于多个任务的轻量级模型的高效构建,实现多用途的时间序列处理。特别是,该方法维护一个可互换和模块化的模型块池,可用于构建新的时间序列模型。当面对特定的时间序列数据时,路由策略迭代选择最合适的块来为数据构建轻量级且准确的模型。我们为TimeBlocks配备了一种称为StreamCore的方法,以构建数据流的代表性小子集,该子集随时间保持流的保证近似,从而实现持续的模型校准。在多个数据集和多个任务上的实验研究表明,TimeBlocks能够构建优于现有基线的模型。

英文摘要

The ongoing digitization has led to a proliferation of time-series data streams that monitor a variety of processes, from which valuable insights may be obtained. Further, the emergence of successful foundational language models begs the question of whether it is possible to achieve time-series models with the foundational properties of handling multiple tasks, while being sufficiently lightweight to allow real-time data stream processing. Existing foundational time-series models are often large and only effective in offline settings without stringent time and computational constraints, and where repeated model calibration is not needed. However, when applied to data streams, these models are ineffective due to their size and lack of support for continual calibration, which compromise their ability to deliver accurate real-time responses, their durability, and their deployability in hardware-limited settings. We propose TimeBlocks to enable versatile time-series processing by facilitating the efficient building of lightweight models suitable for multiple tasks under variable conditions. In particular, the method maintains a pool of interchangeable and modular model blocks that can be used to construct new time-series models. When presented with specific time-series data, a routing strategy iteratively selects the most suitable blocks to construct a lightweight and accurate model for the data. We equip TimeBlocks with a method called StreamCore to build a representative small subset of the data stream, which preserves a guaranteed approximation of the stream over time, enabling continual model calibration. An experimental study on multiple data sets and covering multiple tasks shows that TimeBlocks enables to build models capable of outperforming existing baselines.

2606.02138 2026-06-02 cs.LG cs.AI

VLBM: Variational Latent Basis Modeling for OOD Robust Multivariate Time Series Forecasting

VLBM:面向OOD鲁棒多变量时间序列预测的变分潜在基建模

Xudong Zhang, Jierui Lei, Jiacheng Li, Lingdong Shen, Jian Cui, Haina Tang

AI总结 提出VLBM框架,通过变分潜在基分离稳定动态与OOD偏差,实现混合ID/OOD分布下的鲁棒预测,在12个基准任务上平均MAE和MSE分别提升15.08%和7.74%。

详情
AI中文摘要

多变量时间序列预测中的分布外(OOD)事件虽然罕见,但往往主导现实世界风险,使得平均情况预测不足以可靠部署。在混合ID/OOD分布的标准平均风险训练下,来自罕见OOD事件的优化信号可能被频繁的分布内(ID)模式淹没,因此强基准精度可能无法转化为高影响偏移下的可靠性。为解决此问题,我们提出VLBM(变分潜在基模型),一种理论指导的潜在预测框架,将稳定动态与OOD引起的偏差分离。VLBM学习一个共享潜在基,定义稳定ID动态的低秩子空间,将输入显式分解为基子空间分量和正交残差分量,并将未来感知后验与未来盲先验对齐,使得测试时潜在推断仅依赖于历史输入。在涵盖交通、天气、电力系统及其他现实世界领域的12个基准任务上,包括新构建的现实世界OOD交通数据集,VLBM实现了最先进的OOD鲁棒性和ID精度,平均MAE和MSE比最强基线分别提升15.08%和7.74%。在合成模拟数据集上,VLBM也持续实现最佳性能并更好地跟踪OOD脉冲恢复。这些结果支持潜在结构化预测作为混合ID和OOD条件下鲁棒预测的原则性途径。代码可在https://github.com/leijieruilq/VLBM_OOD_forecast获取。

英文摘要

Out of distribution (OOD) events in multivariate time series forecasting are rare but often dominate real world risk, making average case forecasting insufficient for reliable deployment. Under standard average risk training on mixed ID/OOD distributions, optimization signals from rare OOD events can be overwhelmed by frequent in distribution (ID) patterns, so strong benchmark accuracy may not translate into reliability under high impact shifts. To address this issue, we propose VLBM (Variational Latent Basis Model), a theory guided latent forecasting framework that separates stable dynamics from OOD induced deviations. VLBM learns a shared latent basis that defines a low rank subspace for stable ID dynamics, explicitly decomposes inputs into basis subspace components and orthogonal residual components, and aligns a future aware posterior with a future blind prior so that test time latent inference depends only on historical input. Across 12 benchmark tasks spanning transportation, weather, power systems, and other real world domains, including newly constructed real world OOD traffic datasets, VLBM achieves state of the art OOD robustness and ID accuracy, with average MAE and MSE gains of 15.08\% and 7.74\% over the strongest baseline. On a synthetic simulation dataset, VLBM also consistently achieves the best performance and better tracks OOD pulse recovery. These results support latent structured forecasting as a principled route to robust prediction under mixed ID and OOD conditions. The code is available at https://github.com/leijieruilq/VLBM_OOD_forecast.

2606.02136 2026-06-02 cs.LG

Edge-aware Decoding for Neural Asymmetric Routing

面向神经非对称路由的边缘感知解码

Li Liang, Jinbiao Chen, Zizhen Zhang

AI总结 针对神经非对称路由中表示与决策不匹配的问题,提出边缘感知解码器,通过显式暴露转移级成本信息提升零样本泛化性能。

详情
AI中文摘要

神经非对称路由模型越来越多地通过矩阵表示和非对称感知注意力来编码方向性。然而,最终路由动作并非孤立节点,而是在当前部分路由下选择的有向转移。这造成了表示与决策的不匹配:成对成本信息可能在上游编码,而最终候选logit仍主要参数化为上下文-节点兼容性。我们提出一种针对神经非对称路由的解码器设计原则:最终得分应显式暴露问题成本结构所暗示的转移级量。我们通过一个边缘感知解码器实例化该原则,该解码器为当前有向边、返回起点的闭合以及静态轻量级前瞻添加候选特定项,同时保持表示骨干网络固定。在受控的SVD/Sinkhorn非对称骨干网络上,该解码器在ATSP-100上训练并在ATSP-100/200/500/1000上零样本评估时,优于RADAR参考,将ATSP-1000的差距从4.13%降至2.73%。在ACVRP上,相同的得分级修改在更丰富的路由状态下显示出相同的定性趋势。ATSP消融实验和有向转移诊断进一步阐明了机制:最强证据涉及对当前有向边的敏感性,而闭合和静态前瞻则作为启发式延续线索。结果支持一项机制研究:神经非对称路由中一个关键的解码器侧信号是决策时暴露转移级边缘信息。

英文摘要

Neural asymmetric routing models increasingly encode directionality through matrix representations and asymmetry-aware attention. The final routing action, however, is not a node in isolation but a directed transition chosen under the current partial route. This creates a representation--decision mismatch: pairwise cost information may be encoded upstream while the final candidate logit is still largely parameterized as context--node compatibility. We propose a decoder-design principle for neural asymmetric routing: the final score should explicitly expose transition-level quantities suggested by the problem's cost-to-go structure. We instantiate this principle with an edge-aware decoder that adds candidate-specific terms for the current directed edge, return-to-start closure, and static lightweight lookahead, while keeping the representation backbone fixed. On a controlled SVD/Sinkhorn asymmetric backbone, the decoder improves over the RADAR reference when trained on ATSP-100 and evaluated zero-shot on ATSP-100/200/500/1000, reducing the ATSP-1000 gap from $4.13\%$ to $2.73\%$. On ACVRP, the same score-level modification shows the same qualitative trend under a richer routing state. ATSP ablations and directed-transition diagnostics sharpen the mechanism: the strongest evidence concerns sensitivity to the current directed edge, while closure and static lookahead act as heuristic continuation cues. The results support a mechanism study: a key decoder-side signal in neural asymmetric routing is decision-time exposure of transition-level edge information.

2606.02134 2026-06-02 cs.LG cs.AI cs.CV

Rethinking Evaluation Paradigms in IBP-based Certified Training

重新思考基于IBP的认证训练中的评估范式

Konstantin Kaulen, Hadar Shavit, Holger H. Hoos

AI总结 针对认证训练中自然精度与认证精度的权衡问题,提出基于Pareto前沿的多目标超参数优化方法,实现公平的方法间比较,并发现先前配置的欠调优现象,建立新的最优性能。

详情
Comments
Accepted to ICML 2026
AI中文摘要

深度神经网络在许多监督学习任务上取得了强大性能,但仍易受对抗性扰动的影响。神经网络验证提供了数学上严格的鲁棒性保证,但计算成本高昂。为缓解这一问题,认证训练技术在训练过程中优化可验证的鲁棒性,通常通过方法特定的超参数控制自然精度与认证精度之间的权衡。由于这些指标本质上是冲突的,报告单一配置的常见做法存在问题:它可能误导关于整体性能的结论,并妨碍对最新技术的无偏评估。我们通过基于自然-认证精度权衡的Pareto前沿比较来评估认证训练方法。为了实现公平、方法无关的比较,我们执行高效的自动化多目标超参数优化,为每种方法识别一组Pareto最优配置。这种方法常常揭示先前报告配置中的显著欠调优,从而获得更优性能并建立新的最优水平。利用这些前沿,我们首次对认证训练方法进行了全面的多目标比较,表明先前的进展并不像假设的那样显著,并揭示了先前未报告的性能互补性。

英文摘要

Deep neural networks achieve strong performance on many supervised learning tasks but remain vulnerable to adversarial perturbations. Neural network verification provides mathematically rigorous robustness guarantees, yet at substantial computational cost. To mitigate this, certified training techniques optimise for verifiable robustness during training, typically inducing a trade-off between natural and certified accuracy controlled by method-specific hyperparameters. Because these metrics are inherently conflicting, the common practice of reporting a single configuration is problematic: it can mislead conclusions about overall performance and prevents unbiased assessments of the state of the art. We address this by evaluating certified training methods via Pareto front comparisons over the natural--certified accuracy trade-off. To enable fair, method-agnostic comparisons, we perform efficient automated multi-objective hyperparameter optimisation to identify a set of Pareto-optimal configurations for each method. This approach often uncovers substantial undertuning in previously reported configurations, yielding superior performance and establishing a new state of the art. Leveraging these fronts, we present the first comprehensive multi-objective comparison of certified training approaches, showing that prior advancements are less pronounced than assumed and revealing previously unreported performance complementarities.

2606.02129 2026-06-02 cs.CV

Equilibrated Diffusion: Frequency-aware Textual Embedding for Equilibrated Image Customization

均衡扩散:面向均衡图像定制的频率感知文本嵌入

Liyuan Ma, Xueji Fang, Guo-Jun Qi

AI总结 提出均衡扩散方法,通过频率空间分解概念特征并独立优化嵌入,实现风格与主体解耦,提升定制图像的保真度和文本对齐。

详情
AI中文摘要

图像定制从参考概念图像中学习目标主体,并根据文本提示生成条件图像,主要修改风格或背景。主流方法采用微调将多样化的概念属性打包到统一的潜在嵌入中,但纠缠的属性阻碍了从风格和背景中消除无关干扰。为解决此问题,我们提出均衡扩散,一种频率驱动的方法,解缠纠缠的概念特征,实现均衡定制和一致的文本-视觉匹配。与使用共享嵌入和统一调优学习完整概念的传统方法不同,我们的工作利用图像频率分量与语义之间的内在联系:低频表示主体内容,高频对应风格。我们在频率空间中分解概念并独立优化每个嵌入。这种分离优化使去噪器能够捕获与主体身份分离的风格,并更好地泛化到未见过的风格提示。合并多频率嵌入保留了模型原有的空间定制能力。我们进一步部署掩码引导扩散以限制无关背景变化并增强文本对齐。将残差参考注意力(RRA)插入空间注意力中以保持主体结构和身份一致性。实验证明,均衡扩散在主体保真度和文本遵循方面超过主流基线,验证了我们方法的优越性。

英文摘要

Image customization learns target subjects from reference concept images and generates conditioned images per text prompts, mainly modifying styles or backgrounds. Prevailing methods adopt fine-tuning to pack diverse concept attributes into a unified latent embedding, yet entangled attributes hinder elimination of irrelevant disturbances from style and background. To address this issue, we propose Equilibrated Diffusion, a frequency-driven approach that disentangles tangled concept features for balanced customization and consistent text-visual matching. Unlike conventional methods learning full concepts with shared embeddings and unified tuning, our work utilizes the inherent link between image frequency components and semantics: low frequencies represent subject content and high frequencies correspond to styles. We decompose concepts in frequency space and optimize each embedding independently. This separate optimization enables the denoiser to capture style detached from subject identity and generalize better to unseen stylistic prompts. Merging multi-frequency embeddings preserves the model's original spatial customization ability. We further deploy mask-guided diffusion to restrict irrelevant background changes and boost text alignment. Residual Reference Attention (RRA) is inserted into spatial attention to retain subject structure and identity consistency. Experiments prove Equilibrated Diffusion exceeds mainstream baselines on subject fidelity and text adherence, verifying our method's superiority.

2606.02127 2026-06-02 eess.AS cs.SD

Localizing broadband noise sources using the Loève spectrum and a 2.5D approach

使用Loève谱和2.5D方法定位宽带噪声源

Christian H. Kasess, Wolfgang Kreuzer, Holger Waubke

AI总结 针对移动宽带随机声源定位问题,提出一种基于2.5D设置和Loève谱的逆定位方法,推导了移动源功率谱密度与静态接收器Loève谱的关系,并通过多窗估计实现源定位。

详情
Comments
31 pages, 13 figures
AI中文摘要

使用麦克风阵列定位移动声源通常基于修改信号以补偿多普勒效应。在时域中,这种补偿是逐样本进行的。在频域中,需要使用短时间片段,其中假设多普勒效应近似恒定,并对每个片段进行离散傅里叶变换。相比之下,作者开发了一种针对均匀移动单频源的逆2.5D定位方法,该方法在谱域中工作,并允许使用更长的窗口。这是通过修改2.5D正向模型以直接计算运动在静态观察者位置的影响来实现的。该方法既不需要修改测量信号,也不需要在所使用的窗口内要求测量准平稳。不幸的是,这种方法不直接适用于宽带随机源,在本文中,我们将研究均匀移动随机源在静态观察者处观测时其统计特性如何变化。使用2.5D设置,推导了移动源功率谱密度与静态接收器处互谱密度推广形式——Loève谱之间的关系。基于速度高达100 m/s的模拟数据,本文提供了一种基于多窗估计Loève谱的方法的概念验证,用于定位移动宽带随机源。目前,该方法要求源信号平稳,并且谱密度在感兴趣频率附近的一定范围内平坦。此外,目前不考虑源之间的相关性。

英文摘要

The localization of moving sound sources using a microphone array is typically based on modifying the signal to compensate for the Doppler effect. In the time domain this compensation is done on a sample-by-sample basis. In the frequency domain short time segments need to be used in which the Doppler effect is assumed to be approximately constant and a discrete Fourier transform is done on each segment. In contrast, the authors developed an inverse 2.5D localization method for uniformly moving single-frequency sources that works in the spectral domain and allows for the use of longer windows. This was achieved by modifying the 2.5D forward model to directly compute the effect of the motion in the static observer position. The method does neither require to modify the measured signal nor does it require quasi-stationary of the measurements within the window used. Unfortunately, this approach is not directly suitable for broad-band stochastic sources, and in the present work we will investigate how the statistical properties of a uniformly moving stochastic source change when observed at a static observer. Using a 2.5D setting, the relation between the power spectral density of the moving source and the Loève spectrum, which is a generalization of the cross-spectral density at the static receivers, was derived. Based on simulated data with speeds up to 100 m\,s$^{-1}$, the work presented here provides a proof of concept for a method based on multi-taper estimates for the Loève spectrum to localize moving broad-band stochastic sources . Currently, the method requires a stationary source signal and that the spectral density is flat within a certain range around the frequency of interest. Also, correlations between sources are currently not considered.

2606.02120 2026-06-02 cs.CV cs.AI cs.LG

Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection

理解增强的模型协作用于长尾自我中心错误检测

Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui, Qingming Huang

AI总结 提出理解增强的模型协作方法(UE-MCM),结合粗粒度视频理解与细粒度动作推理,通过双分支模型和自适应融合门检测自我中心视频中的错误,并优化长尾分布。

详情
AI中文摘要

在本报告中,我们解决了从自我中心视频数据中判断用户是否错误执行动作的问题。为此,我们提出了一种理解增强的模型协作方法(UE-MCM),该方法将高效的粗粒度视频理解与准确的细粒度动作推理相结合。具体来说,UE-MCM包含一个小模型分支和一个大模型分支。大模型分支关注细粒度动作本身是否执行错误,而小模型分支联合输入粗粒度视频和细粒度片段,以识别可能局部正确但与整体工作流不一致的动作。小模型分支基于CLIP4CLIP视频编码器构建,该编码器从通过扩散对比重建增强的CLIP模型初始化,大模型分支使用Qwen3-VL嵌入模型从细粒度动作片段中提取高容量表示。然后,通过轻量级协作门自适应融合小分支预测和大分支预测。为了处理错误实例的长尾分布,我们通过互补目标优化分类器,包括重加权交叉熵、AUC导向学习和标签感知调整。所得系统平衡了速度和准确性,使其能够有效检测自我中心教学视频中的细微、罕见和模糊错误。

英文摘要

In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To this end, we propose an Understanding-Enhanced Model Collaboration Method (UE-MCM) that combines efficient coarse-grained video understanding with accurate fine-grained action reasoning. Specifically, UE-MCM contains a small model branch and a large model branch. The large model branch focuses on whether the fine-grained action itself is executed incorrectly, while the small model branch jointly takes the coarse-grained video and fine-grained segment as input to identify actions that may be locally correct but inconsistent with the overall workflow. The small model branch is built on a CLIP4CLIP video encoder initialized from a CLIP model enhanced by Diffusion Contrastive Reconstruction, and the large model branch uses the Qwen3-VL Embedding model to extract high-capacity representations from fine-grained action segments. The small-branch prediction and the large-branch prediction are then adaptively fused by a lightweight collaboration gate. To handle the long-tailed distribution of mistake instances, we optimize the classifiers with complementary objectives, including reweighted cross-entropy, AUC-oriented learning, and label-aware adjustment. The resulting system balances speed and accuracy, making it effective for detecting subtle, rare, and ambiguous mistakes in egocentric instructional videos.

2606.02119 2026-06-02 cs.LG cs.AI

How Hard Can It Be? Hardness-Aware Multi-Objective Unlearning

到底有多难?难度感知的多目标遗忘学习

Jiangwei Chen, Xinyuan Niu, Rachael Hwee Ling Sim, Zhengyuan Liu, Nancy F. Chen, Bryan Kian Hsiang Low

AI总结 针对现有遗忘学习无法保证同时提升遗忘质量和保持保留效用的缺陷,提出一种基于约束优化的难度感知多目标遗忘算法(HAMU),通过量化遗忘数据与保留数据的相似度来指导模型更新,在保证遗忘质量提升的同时最小化保留效用损失。

详情
Comments
ICML 2026
AI中文摘要

机器遗忘旨在由于隐私、版权或偏见问题,移除特定遗忘训练数据的影响,同时保持模型在剩余保留数据上的性能。现有的遗忘算法,例如优化损失的加权组合,试图实现提高遗忘质量和保持保留效用这些目标。然而,它们无法保证对所有遗忘和保留数据都能将目标改进到指定程度。在这项工作中,我们从约束优化的角度,用一种新颖且理论扎实的方法解决了这一限制。首先,我们确定遗忘数据和保留数据之间的相似度可以量化调和两个目标的难度。接下来,我们推导出一种遗忘算法(HAMU),其总体目标是通过根据我们的难度度量更新模型权重,在保证遗忘质量有指定改进的同时,最小化保留效用成本/下降。我们的难度度量还告知用户何时保留效用下降不可避免,即两个目标无法同时改进,应考虑停止。我们的算法适用于非凸模型,并且易于并行化,使其易于在实际场景中部署。我们通过实验使用大型模型在图像和文本数据集上证明了HAMU相对于基线的优越性能。我们的代码可在 https://github.com/aoi3142/HAMU 获取。

英文摘要

Machine unlearning aims to remove the influence of specific forget training data due to privacy, copyright or bias concerns while maintaining the model performance on the remaining retain data. Existing unlearning algorithms, such as optimizing a weighted combination of losses, have tried to achieve these objectives of improving forget quality and maintaining retain utility. However, they do not guarantee that these objectives can be improved by a specified extent for all forget and retain data. In this work, we address this limitation with a novel and theoretically-grounded approach from a constrained optimization perspective. Firstly, we identify that the hardness of reconciling both objectives can be quantified by the similarity between the forget data and the retain data. Next, we derive an unlearning algorithm (HAMU) with the overall goal of guaranteeing a specified improvement in forget quality while minimizing the retain utility cost/degradation by updating the model weights based on our hardness measure. Our hardness measure also informs users when retain utility degradation is unavoidable, i.e., both objectives cannot be improved simultaneously, and stopping should be considered. Our algorithm is applicable to non-convex models and is easily parallelizable, making it readily deployable in real-world scenarios. We empirically demonstrate HAMU's superior performance over baselines on both image and text datasets using large models. Our code is available at https://github.com/aoi3142/HAMU.