arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1709
专题追踪
2603.05167 2026-06-15 cs.CL cs.AI 版本更新

C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

C2-Faith: 为思维链推理中的因果和覆盖忠实性基准测试LLM评判者

Avni Mittal, Rauno Arike

发表机构 * SPARAI

AI总结 提出C2-Faith基准,通过因果和覆盖两个维度评估LLM评判者对思维链推理过程忠实性的判断能力,发现模型在错误定位和覆盖评分上存在显著不足。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作思维链(CoT)推理的评判者,但目前尚不清楚它们能否可靠地评估过程忠实性,而不仅仅是答案的合理性。我们引入了C2-Faith,这是一个基于PRM800K构建的基准,明确将忠实性分解为两个互补维度:因果性(每一步是否逻辑上源自先前上下文)和覆盖性(是否包含必要的中间推理)。通过受控扰动,我们构建了具有已知因果错误位置的示例,将单个步骤替换为逻辑不一致的变体,并以不同速率进行受控覆盖删除,从而能够直接根据参考标签进行测量。我们评估了三个前沿的LLM评判者在三项任务上的表现:二元因果检测、因果步骤定位和覆盖评分。我们的结果表明,评判者的可靠性高度依赖于任务,没有单一模型在所有设置中占主导地位。虽然模型通常能检测到错误存在,但它们难以准确定位错误,这表明检测与归因之间存在显著差距。此外,所有评判者都系统性地高估了推理完整性,即使中间推理的很大部分缺失,也会给出高覆盖分数。这些发现揭示了LLM评判者在过程级评估中的根本局限性,并强调了在使用LLM评估推理质量时需要更可靠和校准的方法。

英文摘要

Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, yet it remains unclear whether they can reliably assess process faithfulness rather than merely answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that explicitly decomposes faithfulness into two complementary dimensions: causality (whether each step logically follows from prior context) and coverage (whether essential intermediate inferences are present). Using controlled perturbations, we construct examples with known causal error positions by replacing a single step with a logically inconsistent variant, and with controlled coverage deletions at varying rates, enabling direct measurement against reference labels. We evaluate three frontier LLM judges across three tasks: binary causal detection, causal step localization, and coverage scoring. Our results reveal that judge reliability is highly task-dependent, with no single model dominating across settings. While models often detect that an error exists, they struggle to accurately localize it, indicating a substantial gap between detection and attribution. Moreover, all judges systematically overestimate reasoning completeness, assigning high coverage scores even when substantial portions of intermediate reasoning are missing. These findings expose fundamental limitations of LLM judges in process-level evaluation and highlight the need for more reliable and calibrated methods when using LLMs to assess reasoning quality.

2603.04976 2026-06-15 cs.CV cs.AI 版本更新

3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

3D-RFT:基于视频的3D场景理解的强化微调

Xiongkun Linghu, Jiangyong Huang, Baoxiong Jia, Siyuan Huang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出3D-RFT框架,将可验证奖励的强化学习(RLVR)扩展到视频3D感知与推理,通过直接优化评估指标(如3D IoU和F1分数)提升性能,4B模型超越8B模型。

Comments Accepted at ICML 2026. Project page: https://3d-rft.github.io/

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为增强大型语言模型(LLMs)推理能力的变革性范式,但其在3D场景理解中的潜力尚未充分挖掘。现有方法主要依赖监督微调(SFT),其中token级交叉熵损失作为优化的间接代理,导致训练目标与任务性能之间的错位。为弥合这一差距,我们提出了基于视频的3D场景理解的强化微调(3D-RFT),这是首个将RLVR扩展到视频3D感知与推理的框架。3D-RFT通过直接优化模型以匹配评估指标来转变范式。3D-RFT首先通过SFT激活3D感知的多模态大语言模型(MLLMs),然后使用组相对策略优化(GRPO)结合严格可验证的奖励函数进行强化微调。我们根据3D IoU和F1-Score等指标设计任务特定的奖励函数,以提供更有效的信号来指导模型训练。大量实验表明,3D-RFT-4B在各种基于视频的3D场景理解任务上达到了最先进的性能。值得注意的是,3D-RFT-4B在3D视频检测、3D视觉定位和空间推理基准上显著优于更大的模型(例如VG LLM-8B)。我们进一步揭示了3D-RFT的良好特性,如鲁棒有效性,以及对训练策略和数据影响的宝贵见解。我们希望3D-RFT能够作为未来3D场景理解发展的稳健且有前景的范式。

英文摘要

Reinforcement Learning with Verifiable Rewards ( RLVR ) has emerged as a transformative paradigm for enhancing the reasoning capabilities of Large Language Models ( LLMs), yet its potential in 3D scene understanding remains under-explored. Existing approaches largely rely on Supervised Fine-Tuning ( SFT), where the token-level cross-entropy loss acts as an indirect proxy for optimization, leading to a misalignment between training objectives and task performances. To bridge this gap, we present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT ), the first framework to extend RLVR to video-based 3D perception and reasoning. 3D-RFT shifts the paradigm by directly optimizing the model towards evaluation metrics. 3D-RFT first activates 3D-aware Multi-modal Large Language Models ( MLLM s) via SFT, followed by reinforcement fine-tuning using Group Relative Policy Optimization ( GRPO) with strictly verifiable reward functions. We design task-specific reward functions directly from metrics like 3D IoU and F1-Score to provide more effective signals to guide model training. Extensive experiments demonstrate that 3D-RFT-4B achieves state-of-the-art performance on various video-based 3D scene understanding tasks. Notably, 3D-RFT-4B significantly outperforms larger models (e.g., VG LLM-8B) on 3D video detection, 3D visual grounding, and spatial reasoning benchmarks. We further reveal good properties of 3D-RFT such as robust efficacy, and valuable insights into training strategies and data impact. We hope 3D-RFT can serve as a robust and promising paradigm for future development of 3D scene understanding.

2603.03970 2026-06-15 cs.AI 版本更新

Generative AI for Managerial Decision-Making under Ambiguity and Sycophancy

生成式人工智能在模糊性与谄媚行为下的管理决策

Sule Ozturk Birim, Fabrizio Marozzo, Yigit Kazancoglu

发表机构 * Manisa Celal Bayar University(曼萨塞尔朱巴大学) University of Calabria(卡拉布里亚大学) Yasar University(亚沙大学)

AI总结 本研究通过人机协作实验,利用四维商业模糊性分类法评估GenAI模型在模糊检测、解析和谄媚行为方面的表现,发现模糊解析能提升决策质量,且不同模型对错误指令的谄媚程度不一。

详情
AI中文摘要

生成式人工智能(GenAI)正日益融入复杂的业务流程,从根本上改变了管理决策的边界。然而,在模糊的商业环境中,其战略建议的可靠性仍是一个关键的知识空白。为填补这一空白,本研究比较了多个GenAI模型在检测模糊性方面的能力,检验了系统性模糊解析过程是否能改善响应质量,并调查了它们在面对有缺陷的管理指令时对谄媚行为的易感性。利用一种新颖的四维商业模糊性分类法,我们在战略、战术和操作场景中进行了人机协作实验。通过一个基于一致性、可操作性、理由质量和约束遵守的人工验证自动评估框架对生成的决策进行评估。结果表明,我们的方法不仅能区分不同类型的模糊性,还揭示了模糊解析如何系统地改变模型行为。特别是,解析模糊性提高了所有管理层级的决策质量,其中在约束遵守方面提升最为显著。进一步分析显示,谄媚行为在不同模型中并不一致:一些模型质疑有缺陷的假设,而另一些则倾向于遵从。本研究通过将GenAI定位为一种能够检测和解析管理者可能忽略的模糊性的认知支架,同时证明其人工局限性需要人类监督以确保其作为战略伙伴的可靠性,从而为有限理性文献做出了贡献。

英文摘要

Generative artificial intelligence (GenAI) is increasingly being integrated into complex business workflows, fundamentally shifting the boundaries of managerial decision-making. However, the reliability of its strategic advice in ambiguous business contexts remains a critical knowledge gap. To address this gap, this study compares multiple GenAI models in their ability to detect ambiguity, examines whether a systematic ambiguity-resolution process improves response quality, and investigates their susceptibility to sycophantic behavior when confronted with flawed managerial directives. Using a novel four-dimensional business ambiguity taxonomy, we conducted a human-in-the-loop experiment across strategic, tactical, and operational scenarios. The resulting decisions were assessed through a human-validated automated evaluation framework based on agreement, actionability, justification quality, and constraint adherence. The results show that our approach not only distinguishes different types of ambiguity, but also reveals how ambiguity resolution systematically changes model behavior. In particular, resolving ambiguities improved decision quality across all managerial levels, with the strongest gains observed in constraint adherence. The analysis further showed that sycophantic behavior is not uniform across models: some models challenged flawed assumptions, whereas others tended to comply with them. This study contributes to the bounded rationality literature by positioning GenAI as a cognitive scaffold that can detect and resolve ambiguities managers might overlook, while demonstrating that its artificial limitations require human oversight to ensure its reliability as a strategic partner.

2603.03733 2026-06-15 cs.RO 版本更新

X-Loco: Towards Generalist Humanoid Locomotion Control via Synergetic Policy Distillation

X-Loco:通过协同策略蒸馏实现通用人形机器人运动控制

Dewei Wang, Xinmiao Wang, Chenyun Zhang, Jiyuan Shi, Yingnan Zhao, Chenjia Bai, Xuelong Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出X-Loco框架,通过协同策略蒸馏和案例自适应专家选择,训练视觉通用人形运动策略,整合直立行走、全身协调和跌倒恢复,仅基于速度指令,无需参考运动。

Comments Accepted by RSS 2026. Project page: https://x-loco-humanoid.github.io/

详情
AI中文摘要

尽管近期进展在单个类人技能(如直立行走、跌倒恢复和全身协调)上表现出色,但由于多样化的动力学和冲突的控制目标,学习一个掌握所有这些技能的单一策略仍然具有挑战性。为此,我们引入X-Loco,一个用于训练基于视觉的通用人形运动策略的框架。X-Loco训练多个专家策略,并采用协同策略蒸馏与案例自适应专家选择机制,动态利用多个专家策略来指导基于视觉的学生策略。这种设计使学生能够获得广泛的运动技能,从跌倒恢复到地形穿越和全身协调技能。据我们所知,X-Loco是第一个展示基于视觉的人形运动的框架,该框架联合集成了直立行走、全身协调和跌倒恢复,且仅基于速度命令运行,无需依赖参考运动。实验结果表明,X-Loco实现了卓越的性能,通过跌倒恢复和地形穿越等任务得到证明。消融研究进一步强调,我们的框架有效利用了专家知识并提高了学习效率。

英文摘要

While recent advances have demonstrated strong performance in individual humanoid skills such as upright locomotion, fall recovery and whole-body coordination, learning a single policy that masters all these skills remains challenging due to the diverse dynamics and conflicting control objectives involved. To address this, we introduce X-Loco, a framework for training a vision-based generalist humanoid locomotion policy. X-Loco trains multiple oracle specialist policies and adopts a synergetic policy distillation with a case-adaptive specialist selection mechanism, which dynamically leverages multiple specialist policies to guide a vision-based student policy. This design enables the student to acquire a broad spectrum of locomotion skills, ranging from fall recovery to terrain traversal and whole-body coordination skills. To the best of our knowledge, X-Loco is the first framework to demonstrate vision-based humanoid locomotion that jointly integrates upright locomotion, whole-body coordination and fall recovery, while operating solely under velocity commands without relying on reference motions. Experimental results show that X-Loco achieves superior performance, demonstrated by tasks such as fall recovery and terrain traversal. Ablation studies further highlight that our framework effectively leverages specialist expertise and enhances learning efficiency.

2603.02230 2026-06-15 cs.LG cs.AI 版本更新

Generalized Discrete Diffusion with Self-Correction

广义离散扩散与自校正

Linxuan Wang, Ziyi Wang, Yikun Bai, Wei Deng, Guang Lin, Qifan Song

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出自校正离散扩散模型(SCDD),通过显式状态转移和离散时间学习,简化训练噪声调度,消除冗余重掩码步骤,在GPT-2规模上实现高效并行解码并保持生成质量。

Comments 40 pages, 3 figures, 6 tables

详情
AI中文摘要

自校正是保持离散扩散模型中并行采样且性能损失最小的有效技术。先前的工作在推理时或后训练期间探索了自校正;然而,此类方法通常泛化能力有限,并可能损害推理性能。GIDD通过多步BERT风格的均匀吸收目标开创了基于预训练的自校正。然而,GIDD依赖于连续的基于插值的管道,其中均匀转移和吸收掩码之间的交互不透明,这使超参数调整复杂化并阻碍实际性能。在这项工作中,我们提出了一种自校正离散扩散(SCDD)模型,以显式状态转移和直接在离散时间中学习的方式重新表述预训练自校正。我们的框架还简化了训练噪声调度,消除了冗余的重掩码步骤,并完全依赖均匀转移来学习自校正。在GPT-2规模上的实验表明,我们的方法能够实现更高效的并行解码,同时保持生成质量。

英文摘要

Self-correction is an effective technique for maintaining parallel sampling in discrete diffusion models with minimal performance degradation. Prior work has explored self-correction at inference time or during post-training; however, such approaches often suffer from limited generalization and may impair reasoning performance. GIDD pioneers pretraining-based self-correction via a multi-step BERT-style uniform-absorbing objective. However, GIDD relies on a continuous interpolation-based pipeline with opaque interactions between uniform transitions and absorbing masks, which complicates hyperparameter tuning and hinders practical performance. In this work, we propose a Self-Correcting Discrete Diffusion (SCDD) model to reformulate pretrained self-correction with explicit state transitions and learn directly in discrete time. Our framework also simplifies the training noise schedule, eliminates a redundant remasking step, and relies exclusively on uniform transitions to learn self-correction. Experiments at the GPT-2 scale demonstrate that our method enables more efficient parallel decoding while preserving generation quality.

2511.07075 2026-06-15 cs.SD 版本更新

Metric Analysis for Spatial Semantic Segmentation of Sound Scenes

声场景空间语义分割的度量分析

Mayank Mishra, Paul Magron, Romain Serizel

发表机构 * University of Cambridge(剑桥大学) Inria(法国国家信息与自动化技术研究院)

AI总结 针对声场景空间语义分割(S5)的评估,提出一种新的度量CASA-SDR,通过置换不变源匹配分离分类与分离误差,提供更可解释的分离中心评估。

Comments 5 pages; content+bibliography

详情
AI中文摘要

声场景空间语义分割(S5)包括从多通道音频混合中联合执行音频源分离和声音事件分类。使用分离和分类度量分别评估S5系统使得系统比较困难,而现有的联合度量(如类感知信号失真比CA-SDR)可能混淆分离和标记错误。特别是,CA-SDR依赖预测的类标签进行源匹配,当底层源估计在感知上正确时,这可能掩盖标签交换或错误分类。在这项工作中,我们引入了类和源感知信号失真比(CASA-SDR),一种新的度量,它在计算分类错误之前执行置换不变的源匹配,从而从以分类为中心的方法转向以分离为中心的方法。我们首先在具有神谕分离和合成分类错误的受控场景中分析CA-SDR,以及在受控的源间交叉污染下,并将其行为与经典SDR和CASA-SDR进行比较。我们还通过引入基于错误和基于源的聚合策略,研究分类错误对度量的影响。最后,我们在提交给DCASE 2025挑战赛任务4的系统上比较CA-SDR和CASA-SDR,突出了CA-SDR过度惩罚标签交换或分离不良源的情况,而CASA-SDR提供了更可解释的以分离为中心的S5性能评估。

英文摘要

Spatial semantic segmentation of sound scenes (S5) consists of jointly performing audio source separation and sound event classification from a multichannel audio mixture. Evaluating S5 systems with separation and classification metrics individually makes system comparison difficult, whereas existing joint metrics, such as the class-aware signal-to-distortion ratio (CA-SDR), can conflate separation and labeling errors. In particular, CA-SDR relies on predicted class labels for source matching, which may obscure label swaps or misclassifications when the underlying source estimates remain perceptually correct. In this work, we introduce the class and source-aware signal-to-distortion ratio (CASA-SDR), a new metric that performs permutation-invariant source matching before computing classification errors, thereby shifting from a classification-focused approach to a separation-focused approach. We first analyze CA-SDR in controlled scenarios with oracle separation and synthetic classification errors, as well as under controlled cross-contamination between sources, and compare its behavior to that of the classical SDR and CASA-SDR. We also study the impact of classification errors on the metrics by introducing error-based and source-based aggregation strategies. Finally, we compare CA-SDR and CASA-SDR on systems submitted to Task 4 of the DCASE 2025 challenge, highlighting the cases where CA-SDR over-penalizes label swaps or poorly separated sources, while CASA-SDR provides a more interpretable separation-centric assessment of S5 performance.

2506.14202 2026-06-15 cs.LG cs.AI stat.ML 版本更新

DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

DiffusionBlocks: 通过扩散解释进行分块神经网络训练

Makoto Shing, Masanori Koyama, Takuya Akiba

发表机构 * Sakana AI The University of Tokyo(东京大学)

AI总结 提出DiffusionBlocks框架,利用残差连接与动力系统的对应关系,将网络转换为去噪过程,通过分数匹配目标实现独立分块训练,在多种Transformer架构上达到与端到端训练相当的性能,同时降低内存需求。

Comments To appear at the 14th International Conference on Learning Representations (ICLR 2026). v4: Fixed typos in experimental details (Appendix E.4)

详情
AI中文摘要

端到端反向传播需要存储所有层的激活值,造成内存瓶颈,限制了模型的可扩展性。现有的分块训练方法提供了缓解该问题的途径,但它们依赖于特设的局部目标,并且在分类任务之外尚未得到充分探索。我们提出$\textit{DiffusionBlocks}$,一个将基于Transformer的网络转化为真正独立可训练块的原则性框架,这些块能保持与端到端训练相竞争的性能。我们的关键洞察在于利用残差连接自然对应于动力系统中的更新这一事实。通过对该系统进行最小修改,我们可以将这些更新转换为去噪过程的更新,其中每个块可以通过利用分数匹配目标独立学习。这种独立性使得每次只训练一个块的梯度成为可能,从而将内存需求按块数量成比例降低。我们在多种Transformer架构(视觉、扩散、自回归、递归深度和掩码扩散)上的实验表明,DiffusionBlocks训练与端到端训练性能匹配,同时能够在实际任务(超越小规模分类)上实现可扩展的分块训练。DiffusionBlocks提供了一种理论上有依据的方法,成功地将现代生成任务扩展到多种架构。代码可在该https URL获取。

英文摘要

End-to-end backpropagation requires storing activations throughout all layers, creating memory bottlenecks that limit model scalability. Existing block-wise training methods offer means to alleviate this problem, but they rely on ad-hoc local objectives and remain largely unexplored beyond classification tasks. We propose $\textit{DiffusionBlocks}$, a principled framework for transforming transformer-based networks into genuinely independent trainable blocks that maintain competitive performance with end-to-end training. Our key insight leverages the fact that residual connections naturally correspond to updates in a dynamical system. With minimal modifications to this system, we can convert the updates to those of a denoising process, where each block can be learned independently by leveraging the score matching objective. This independence enables training with gradients for only one block at a time, thereby reducing memory requirements in proportion to the number of blocks. Our experiments on a range of transformer architectures (vision, diffusion, autoregressive, recurrent-depth, and masked diffusion) demonstrate that DiffusionBlocks training matches the performance of end-to-end training while enabling scalable block-wise training on practical tasks beyond small-scale classification. DiffusionBlocks provides a theoretically grounded approach that successfully scales to modern generative tasks across diverse architectures. Code is available at https://github.com/SakanaAI/DiffusionBlocks .

2602.14169 2026-06-15 cs.LG cs.AI cs.CL 版本更新

Deep Dense Exploration for LLM Reinforcement Learning via Pivot-Driven Resampling

基于枢轴驱动重采样的LLM强化学习深度密集探索

Yiran Guo, Zhongjian Qiao, Yingqi Xie, Jie Liu, Dan Ye, Ruiqing Zhang, Shuang Qiu, Lijie Xu

发表机构 * Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) City University of Hong Kong(香港城市大学) Baidu(百度)

AI总结 针对大语言模型强化学习中探索效率低的问题,提出深度密集探索(DDE)策略,通过识别失败轨迹中的可恢复枢轴状态并局部密集重采样,结合双流优化目标,在数学推理基准上优于现有方法。

详情
AI中文摘要

有效探索是大语言模型强化学习中的一个关键挑战:在有限的采样预算内,从庞大的自然语言序列空间中发现高质量轨迹。现有方法面临显著局限性:GRPO仅从根节点采样,使高概率轨迹饱和,而深层易错状态探索不足;基于树的方法盲目地将预算分散到琐碎或不可恢复的状态,导致采样稀释,无法发现罕见的正确后缀并破坏局部基线。为解决此问题,我们提出深度密集探索(DDE),一种将探索聚焦于失败轨迹中的“枢轴”——深层、可恢复状态的策略。我们通过DEEP-GRPO实例化DDE,引入三个关键创新:(1)轻量级数据驱动效用函数,自动平衡可恢复性和深度偏差以识别枢轴状态;(2)在每个枢轴处进行局部密集重采样,增加发现后续正确轨迹的概率;(3)双流优化目标,将全局策略学习与局部纠正更新解耦。在数学推理基准上的实验表明,我们的方法一致优于GRPO、基于树的方法及其他强基线。代码见 https://this https URL

英文摘要

Effective exploration is a key challenge in reinforcement learning for large language models: discovering high-quality trajectories within a limited sampling budget from the vast natural language sequence space. Existing methods face notable limitations: GRPO samples exclusively from the root, saturating high-probability trajectories while leaving deep, error-prone states under-explored. Tree-based methods blindly disperse budgets across trivial or unrecoverable states, causing sampling dilution that fails to uncover rare correct suffixes and destabilizes local baselines. To address this, we propose Deep Dense Exploration (DDE), a strategy that focuses exploration on $\textit{pivots}$-deep, recoverable states within unsuccessful trajectories. We instantiate DDE with DEEP-GRPO, which introduces three key innovations: (1) a lightweight data-driven utility function that automatically balances recoverability and depth bias to identify pivot states; (2) local dense resampling at each pivot to increase the probability of discovering correct subsequent trajectories; and (3) a dual-stream optimization objective that decouples global policy learning from local corrective updates. Experiments on mathematical reasoning benchmarks demonstrate that our method consistently outperforms GRPO, tree-based methods, and other strong baselines. Code is available at https://github.com/AgentCombo/DEEP-GRPO

2602.13848 2026-06-15 cs.LG stat.ML 版本更新

Testing For Distribution Shifts with Conditional Conformal Test Martingales

基于条件共形检验鞅的分布偏移检测

Shalev Shaer, Yarin Bar, Drew Prinster, Yaniv Romano

发表机构 * Technion - Israel Institute of Technology(技术ion - 以色列理工学院)

AI总结 提出一种顺序检验方法,通过固定参考集避免测试污染,利用稳健鞅构造实现任意有效的I型错误控制和渐近功效1,检测速度优于标准共形检验鞅。

详情
AI中文摘要

我们提出了一种用于检测任意分布偏移的顺序检验方法,该方法允许共形检验鞅(CTM)在固定的参考条件设置下工作。现有的CTM检测器通过不断用每个新样本扩展参考集来构建检验鞅,并以此评估新样本相对于过去观测的异常程度。虽然这种设计能实现任意有效的I型错误控制,但它存在测试污染问题:变化发生后,偏移后的观测进入参考集,稀释了分布偏移的证据,增加了检测延迟并降低了功效。相比之下,我们的方法通过将每个新样本与固定的零假设参考数据集进行比较,从设计上避免了污染。我们的主要技术贡献是一种稳健的鞅构造,该构造在条件于零假设参考数据时仍然有效,通过显式考虑有限参考集引起的参考分布估计误差来实现。这实现了任意有效的I型错误控制,同时保证了渐近功效为1和有界期望检测延迟。实验表明,我们的方法比标准CTM更快地检测到偏移,提供了一种强大且可靠的分布偏移检测器。

英文摘要

We propose a sequential test for detecting arbitrary distribution shifts that allows conformal test martingales (CTMs) to work under a fixed, reference-conditional setting. Existing CTM detectors construct test martingales by continually growing a reference set with each incoming sample, using it to assess how atypical the new sample is relative to past observations. While this design yields anytime-valid type-I error control, it suffers from test-time contamination: after a change, post-shift observations enter the reference set and dilute the evidence for distribution shift, increasing detection delay and reducing power. In contrast, our method avoids contamination by design by comparing each new sample to a fixed null reference dataset. Our main technical contribution is a robust martingale construction that remains valid conditional on the null reference data, achieved by explicitly accounting for the estimation error in the reference distribution induced by the finite reference set. This yields anytime-valid type-I error control together with guarantees of asymptotic power one and bounded expected detection delay. Empirically, our method detects shifts faster than standard CTMs, providing a powerful and reliable distribution-shift detector.

2509.24102 2026-06-15 cs.CL 版本更新

Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Metapragmatic Links

道德推理习得的语用推理:通过元语用链接实现泛化

Guangliang Liu, Xi Chen, Bocheng Chen, Han Zi, Xitong Zhang, Kristen Johnson

发表机构 * Indiana University Indianapolis(印第安纳大学印第安纳波利斯分校) Nanyang Technological University(南洋理工大学) University of Mississippi(密苏里大学) Northeastern University(东北大学) Qualcomm(高通公司) Michigan State University(密歇根州立大学)

AI总结 针对大语言模型在道德推理中泛化能力不足的问题,提出基于元语用链接和道德基础理论的语用推理方法,使模型获取道德推理目标与社会变量间的元语用链接,在三个任务上验证了其适应性和泛化性。

详情
AI中文摘要

虽然道德推理已成为大型语言模型(LLM)的一个有前景的研究方向,但实现稳健的泛化仍然是一个关键挑战。这一挑战源于所说内容与道德隐含内容之间的差距。在本文中,我们基于元语用链接和道德基础理论来缩小这一差距。具体来说,我们开发了一种语用推理方法,使LLM在给定道德情境下,能够获取道德推理目标与影响它们的社会变量之间的元语用链接。我们将该方法应用于三个不同的道德推理任务,以展示其适应性和泛化性。实验结果表明,我们的方法显著增强了LLM在道德推理中的泛化能力,为未来研究利用语用推理处理更广泛的道德推理任务铺平了道路。

英文摘要

While moral reasoning has emerged as a promising research direction for large language models (LLMs), achieving robust generalization remains a critical challenge. This challenge arises from the gap between what is said and what is morally implied. In this paper, we build on metapragmatic links and Moral Foundations Theory to close this gap. Specifically, we develop a pragmatic inference approach that enables LLMs, given a moral situation, to acquire the metapragmatic links between moral reasoning objectives and the social variables that influence them. We adapt this approach to three different moral reasoning tasks to demonstrate its adaptability and generalizability. Experimental results show that our approach significantly enhances LLMs' generalization in moral reasoning, paving the way for future research to leverage pragmatic inference across a wide range of moral reasoning tasks.

2602.12379 2026-06-15 cs.LG 版本更新

Deep Doubly Debiased Longitudinal Effect Estimation with ICE G-Computation

深度双重去偏的ICE G-计算公式纵向效应估计

Wenxin Chen, Weishen Pan, Kyra Gan, Fei Wang

发表机构 * Cornell University(康奈尔大学) Weill Cornell Medicine(韦尔医学院)

AI总结 提出D3-Net框架,通过顺序双重稳健伪结果和纵向目标最小损失估计,解决ICE G-计算中的误差传播问题,实现纵向治疗效应的稳健估计。

详情
AI中文摘要

估计纵向治疗效应对于顺序决策至关重要,但由于治疗-混杂反馈而具有挑战性。虽然迭代条件期望(ICE)G-计算提供了一种原则性方法,但其递归结构存在误差传播,破坏了学习到的结果回归模型。我们提出D3-Net,一个在ICE训练中减轻误差传播并应用稳健最终校正的框架。首先,为了中断学习过程中的误差传播,我们使用顺序双重稳健(SDR)伪结果训练ICE序列,为每个回归提供偏差校正的目标。其次,我们采用多任务变换器,配备协变量模拟器头部进行辅助监督,正则化表示学习,以及目标网络以稳定训练动态。对于最终估计,我们丢弃SDR校正,而是使用未校正的干扰模型对原始结果进行纵向目标最小损失估计(LTMLE)。这第二阶段的针对性去偏确保了稳健性和最优有限样本性质。综合实验表明,与现有最先进的基于ICE的估计器相比,我们的模型D3-Net在不同时间范围、反事实和时变混杂下稳健地降低了偏差和方差。

英文摘要

Estimating longitudinal treatment effects is essential for sequential decision-making but is challenging due to treatment-confounder feedback. While Iterative Conditional Expectation (ICE) G-computation offers a principled approach, its recursive structure suffers from error propagation, corrupting the learned outcome regression models. We propose D3-Net, a framework that mitigates error propagation in ICE training and then applies a robust final correction. First, to interrupt error propagation during learning, we train the ICE sequence using Sequential Doubly Robust (SDR) pseudo-outcomes, which provide bias-corrected targets for each regression. Second, we employ a multi-task transformer with a covariate simulator head for auxiliary supervision, regularizing representation learning, and a target network to stabilize training dynamics. For the final estimate, we discard the SDR correction and instead use the uncorrected nuisance models to perform Longitudinal Targeted Minimum Loss-Based Estimation (LTMLE) on the original outcomes. This second-stage, targeted debiasing ensures robustness and optimal finite-sample properties. Comprehensive experiments demonstrate that our model, D3-Net, robustly reduces bias and variance across different horizons, counterfactuals, and time-varying confoundings, compared to existing state-of-the-art ICE-based estimators.

2602.09258 2026-06-15 cs.LG 版本更新

Generalizing GNNs with Tokenized Mixture of Experts

泛化GNN:基于令牌化的专家混合

Xiaoguang Guo, Zehong Wang, Jiazheng Li, Shawn Spitzel, Qi Yang, Kaize Ding, Jundong Li, Chuxu Zhang

发表机构 * University of Connecticut Storrs(康涅狄格大学斯特劳斯分校) University of Notre Dame(Notre Dame 大学) University of Virginia(弗吉尼亚大学) Northwestern University Evanston(北western 大学埃文斯顿分校)

AI总结 针对图神经网络部署时稳定性与泛化性的权衡,提出STEM-GNN框架,通过令牌化专家混合编码器、向量量化接口和Lipschitz正则化头实现三方面平衡,在多种分布偏移和扰动下提升鲁棒性。

Comments Accepted to KDD 2026

详情
AI中文摘要

部署的图神经网络(GNN)在部署时是冻结的,但必须适应干净数据,在分布偏移下泛化,并对扰动保持稳定。我们表明静态推理引入了一个基本权衡:提高稳定性需要减少对偏移敏感特征的依赖,留下一个不可约的最坏情况泛化下限。实例条件路由可以打破这个上限,但很脆弱,因为偏移可能误导路由,扰动可能使路由波动。我们通过两个分解来捕捉这些效应:覆盖与选择分离,以及基础敏感性与波动放大分离。基于这些见解,我们提出了STEM-GNN,一个预训练-微调框架,包含一个用于多样化计算路径的专家混合编码器,一个用于稳定编码器到头部信号的向量量化令牌接口,以及一个用于限制输出放大的Lipschitz正则化头部。在九个节点、链接和图基准测试中,STEM-GNN实现了更强的三方面平衡,提高了对度/同质性偏移以及特征/边损坏的鲁棒性,同时在干净图上保持竞争力。

英文摘要

Deployed graph neural networks (GNNs) are frozen at deployment yet must fit clean data, generalize under distribution shifts, and remain stable to perturbations. We show that static inference induces a fundamental tradeoff: improving stability requires reducing reliance on shift-sensitive features, leaving an irreducible worst-case generalization floor. Instance-conditional routing can break this ceiling, but is fragile because shifts can mislead routing and perturbations can make routing fluctuate. We capture these effects via two decompositions separating coverage vs selection, and base sensitivity vs fluctuation amplification. Based on these insights, we propose STEM-GNN, a pretrain-then-finetune framework with a mixture-of-experts encoder for diverse computation paths, a vector-quantized token interface to stabilize encoder-to-head signals, and a Lipschitz-regularized head to bound output amplification. Across nine node, link, and graph benchmarks, STEM-GNN achieves a stronger three-way balance, improving robustness to degree/homophily shifts and to feature/edge corruptions while remaining competitive on clean graphs.

2602.00845 2026-06-15 cs.AI 版本更新

Optimizing Agentic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward

通过合成语义信息增益奖励优化基于检索的智能推理

Senkang Hu, Yong Dai, Yuzhi Zhao, Yihang Tao, Yu Guo, Zhengru Fang, Sam Tak Wu Kwong, Yuguang Fang

发表机构 * Hong Kong JC STEM Lab of Smart City.(香港JC STEM实验室) City University of Hong Kong.(香港城市大学) Lingnan University(岭南大学) Fudan University.(复旦大学) Huazhong University of Science and Technology.(华中科技大学)

AI总结 提出InfoReasoner框架,利用合成语义信息增益奖励优化检索过程,通过GRPO训练策略,在七个问答基准上平均准确率提升5.4%。

Comments Accepted by ICML'26

详情
AI中文摘要

智能推理使大型推理模型(LRMs)能够动态获取外部知识,但由于缺乏密集、有原则的奖励信号,优化检索过程仍然具有挑战性。在本文中,我们介绍了InfoReasoner,一个统一的框架,通过合成语义信息增益奖励激励有效的信息寻求。理论上,我们将信息增益重新定义为模型信念状态的不确定性减少,建立了保证,包括非负性、伸缩可加性和通道单调性。实际上,为了实现无需手动检索注释的可扩展优化,我们提出了一种输出感知的内在估计器,该估计器通过双向文本蕴含的语义聚类,直接从模型的输出分布计算信息增益。这种内在奖励引导策略最大化认知进步,使得通过组相对策略优化(GRPO)进行高效训练成为可能。在七个问答基准上的实验表明,InfoReasoner始终优于强大的检索增强基线,平均准确率提升高达5.4%。我们的工作为基于检索的智能推理提供了一条理论上有根据且可扩展的路径。代码可在该 https URL 获取。

英文摘要

Agentic reasoning enables large reasoning models (LRMs) to dynamically acquire external knowledge, but yet optimizing the retrieval process remains challenging due to the lack of dense, principled reward signals. In this paper, we introduce InfoReasoner, a unified framework that incentivizes effective information seeking via a synthetic semantic information gain reward. Theoretically, we redefine information gain as uncertainty reduction over the model's belief states, establishing guarantees, including non-negativity, telescoping additivity, and channel monotonicity. Practically, to enable scalable optimization without manual retrieval annotations, we propose an output-aware intrinsic estimator that computes information gain directly from the model's output distributions using semantic clustering via bidirectional textual entailment. This intrinsic reward guides the policy to maximize epistemic progress, enabling efficient training via Group Relative Policy Optimization (GRPO). Experiments across seven question-answering benchmarks demonstrate that InfoReasoner consistently outperforms strong retrieval-augmented baselines, achieving up to 5.4% average accuracy improvement. Our work provides a theoretically grounded and scalable path toward agentic reasoning with retrieval. The code is available at https://github.com/dl-m9/InfoReasoner

2601.22436 2026-06-15 cs.CL 版本更新

Large Language Model Agents Are Not Always Faithful Self-Evolvers

大型语言模型代理并非总是忠实的自我进化者

Weixiang Zhao, Yingshuo Wang, Yichen Zhang, Yang Deng, Yanyan Zhao, Wanxiang Che, Bing Qin, Ting Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本研究首次系统调查自我进化LLM代理的经验忠实性,通过因果干预发现代理依赖原始经验但常忽略或误解浓缩经验,并分析其成因。

Comments ICML 2026

详情
AI中文摘要

自我进化的大型语言模型(LLM)代理通过积累和重用过去的经验不断改进,但尚不清楚它们是否忠实地依赖这些经验来指导其行为。我们首次系统调查了自我进化LLM代理中的经验忠实性,即代理决策对其所获经验的因果依赖性。通过对原始和浓缩形式的经验进行受控因果干预,我们全面评估了13个LLM骨干和9个环境中的四个代表性框架。我们的分析揭示了一个显著的不对称性:虽然代理始终依赖原始经验,但它们经常忽略或误解浓缩经验,即使这是唯一提供的经验。这种差距在单代理和多代理配置以及骨干规模中持续存在。我们将其根本原因追溯到三个因素:浓缩内容的语义局限性、抑制经验的内部处理偏差以及预训练先验已足够的任务机制。这些发现挑战了关于自我进化方法的普遍假设,并强调了需要更忠实、更可靠的经验整合方法。

英文摘要

Self-evolving large language model (LLM) agents continually improve by accumulating and reusing past experience, yet it remains unclear whether they faithfully rely on that experience to guide their behavior. We present the first systematic investigation of experience faithfulness, the causal dependence of an agent's decisions on the experience it is given, in self-evolving LLM agents. Using controlled causal interventions on both raw and condensed forms of experience, we comprehensively evaluate four representative frameworks across 13 LLM backbones and 9 environments. Our analysis uncovers a striking asymmetry: while agents consistently depend on raw experience, they often disregard or misinterpret condensed experience, even when it is the only experience provided. This gap persists across single- and multi-agent configurations and across backbone scales. We trace its underlying causes to three factors: the semantic limitations of condensed content, internal processing biases that suppress experience, and task regimes where pretrained priors already suffice. These findings challenge prevailing assumptions about self-evolving methods and underscore the need for more faithful and reliable approaches to experience integration.

2502.10886 2026-06-15 cs.CL 版本更新

MET-Bench: Multimodal Entity Tracking for Evaluating the Limitations of Vision-Language and Reasoning Models

MET-Bench:用于评估视觉语言与推理模型局限性的多模态实体追踪

Vanya Cohen, Raymond Mooney

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出MET-Bench多模态实体追踪基准,发现视觉语言模型在图像实体追踪上显著弱于文本,主要源于视觉推理缺陷,强化学习可提升模态内性能但跨模态迁移不足。

Comments ICML 2026

详情
AI中文摘要

实体状态追踪是世界建模的必要组成部分,需要随时间维护实体的连贯表示。以往工作仅在纯文本任务中基准测试实体追踪性能。我们引入MET-Bench,一个多模态实体追踪基准,旨在评估视觉语言模型跨模态追踪实体状态的能力。使用三个领域,我们评估了当前模型整合文本和图像状态更新的有效性。我们的发现揭示了基于文本和基于图像的实体追踪之间存在显著性能差距。我们通过实验表明,这种差异主要源于视觉推理缺陷而非感知缺陷。我们进一步证明,显式的基于文本的推理策略能提升性能,但局限性依然存在,尤其是在长程多模态任务中。我们应用强化学习来改进开源视觉语言模型中的实体追踪。这带来了显著的模态内增益,但未能稳健地跨输入模态迁移。我们的结果凸显了改进多模态表示和推理技术以弥合文本与视觉实体追踪之间差距的必要性。

英文摘要

Entity state tracking is a necessary component of world modeling that requires maintaining coherent representations of entities over time. Previous work has benchmarked entity tracking performance in purely text-based tasks. We introduce MET-Bench, a multimodal entity tracking benchmark designed to evaluate the ability of vision-language models to track entity states across modalities. Using three domains, we assess how effectively current models integrate textual and image-based state updates. Our findings reveal a significant performance gap between text-based and image-based entity tracking. We empirically show this discrepancy primarily stems from deficits in visual reasoning rather than perception. We further show that explicit text-based reasoning strategies improve performance, yet limitations remain, especially in long-horizon multimodal tasks. We apply reinforcement learning to improve entity tracking in open-source VLMs. This yields substantial in-modality gains, but does not transfer robustly across input modalities. Our results highlight the need for improved multimodal representations and reasoning techniques to bridge the gap between textual and visual entity tracking.

2602.05670 2026-06-15 cs.SD cs.AI eess.AS 版本更新

HyperPotter: Spell the Charm of High-Order Interactions in Audio Deepfake Detection

HyperPotter: 在音频深度伪造检测中施展高阶交互的魔力

Qing Wen, Haohao Li, Zhongjie Ba, Peng Cheng, Miao He, Li Lu, Kui Ren

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于超图的HyperPotter框架,通过聚类超边和类感知原型初始化捕获高阶交互,在13个测试集上平均EER降低12.68%。

Comments 20 pages, 8 figures, accepted to ICML 2026

详情
AI中文摘要

AIGC技术的进步使得合成高度逼真的音频深度伪造成为可能,能够欺骗人类听觉感知。尽管已经开发了许多音频深度伪造检测(ADD)方法,但大多数依赖于局部时间/频谱特征或成对关系,忽略了高阶交互(HOIs)。HOIs捕获从多个特征组件中涌现出的判别性模式,超越了它们各自的贡献。我们提出了HyperPotter,一个基于超图的框架,旨在通过基于聚类的超边和类感知原型初始化来捕获与协同模式相关的高阶关系。在13个测试集上的大量实验表明,HyperPotter在11个测试集上优于基线,在所有测试集上平均相对EER降低了12.68%,在改进的测试集上降低了22.15%。这些结果展示了强大的跨场景泛化能力,同时也揭示了在严重编解码器或信道失真下的鲁棒性限制。

英文摘要

Advances in AIGC technologies have enabled the synthesis of highly realistic audio deepfakes capable of deceiving human auditory perception. Although numerous audio deepfake detection (ADD) methods have been developed, most rely on local temporal/spectral features or pairwise relations, overlooking high-order interactions (HOIs). HOIs capture discriminative patterns that emerge from multiple feature components beyond their individual contributions. We propose HyperPotter, a hypergraph-based framework designed to capture high-order relations associated with synergistic patterns through clustering-based hyperedges with class-aware prototype initialization. Extensive experiments on 13 test sets show that HyperPotter improves over the baseline on 11 sets, yielding an average relative EER reduction of 12.68\% across all test sets and 22.15\% on the improved sets. These results demonstrate strong cross-scenario generalization, while also revealing robustness limits under severe codec or channel distortion.

2602.04879 2026-06-15 cs.LG cs.AI cs.CL 版本更新

Rethinking the Trust Region in LLM Reinforcement Learning

重新思考LLM强化学习中的信任区域

Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, Wee Sun Lee

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Toronto(多伦多大学)

AI总结 针对PPO在LLM微调中因词表大导致的训练不稳定问题,提出基于策略散度直接约束的DPPO算法,并引入高效近似方法。

详情
AI中文摘要

强化学习已成为微调大型语言模型(LLM)的基石,其中近端策略优化(PPO)是事实上的标准算法。尽管其普遍存在,我们认为PPO中的核心比率裁剪机制在结构上不适合LLM固有的大词表。PPO基于采样令牌的概率比率约束策略更新,该比率是对真实策略散度的有噪单样本蒙特卡洛估计。这导致次优的学习动态:低概率令牌的更新被过度惩罚,而高概率令牌中潜在的灾难性变化却约束不足,导致训练效率低下和不稳定。为解决此问题,我们提出散度近端策略优化(DPPO),用基于策略散度(如总变差或KL)直接估计的更原则性约束替代启发式裁剪。为避免巨大内存占用,我们引入了高效的二元和Top-K近似,以可忽略的开销捕获本质散度。大量实证评估表明,DPPO相比现有方法实现了更优的训练稳定性和效率,为基于RL的LLM微调提供了更稳健的基础。我们的代码可在https://github.com/sail-sg/Stable-RL获取。

英文摘要

Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO) serving as the de facto standard algorithm. Despite its ubiquity, we argue that the core ratio clipping mechanism in PPO is structurally ill-suited for the large vocabularies inherent to LLMs. PPO constrains policy updates based on the probability ratio of sampled tokens, which serves as a noisy single-sample Monte Carlo estimate of the true policy divergence. This creates a sub-optimal learning dynamic: updates to low-probability tokens are aggressively over-penalized, while potentially catastrophic shifts in high-probability tokens are under-constrained, leading to training inefficiency and instability. To address this, we propose Divergence Proximal Policy Optimization (DPPO), which substitutes heuristic clipping with a more principled constraint based on a direct estimate of policy divergence (e.g., Total Variation or KL). To avoid huge memory footprint, we introduce the efficient Binary and Top-K approximations to capture the essential divergence with negligible overhead. Extensive empirical evaluations demonstrate that DPPO achieves superior training stability and efficiency compared to existing methods, offering a more robust foundation for RL-based LLM fine-tuning. Our code is available at https://github.com/sail-sg/Stable-RL.

2602.03177 2026-06-15 cs.RO 版本更新

Estimation of Ground Reaction Forces from Kinematic Data during Locomotion

基于运动学数据估计行走过程中的地面反作用力

Gautami Golani, Dong Anh Khoa To, Ananda Sidarta, Arun-Kumar Kaliya-Perumal, Oliver Roberts, Lek Syn Lim, Jim Patton, Domenico Campolo

发表机构 * Nanyang Technological University(南洋理工大学) Agency for Science, Technology and Research(科技研究局) National Healthcare Group(国家健康集团)

AI总结 提出一种仅使用标记点运动捕捉数据估计地面反作用力的无测力台方法,通过16个身体段运动学计算质心并分解力分量,实验验证了可行性。

详情
AI中文摘要

地面反作用力(GRFs)提供了对人体步态力学的基本洞察,并广泛用于评估关节负荷、肢体对称性、平衡控制和运动功能。尽管具有临床相关性,但由于测力台系统的实际限制,GRF在临床工作流程中的应用仍不充分。在这项工作中,我们提出了一种无测力台的方法,仅使用基于标记的运动捕捉数据来估计GRF。这种仅基于运动学的方法来估计和分解GRF,使其非常适合广泛的临床部署。通过使用16个身体节段的运动学,我们估计质心(CoM)并计算GRF,随后通过基于最小化的方法将其分解为各个分量。通过这一框架,我们可以识别步态支撑期,并在没有专用测力台系统的情况下提供临床上有意义的动力学测量。实验结果表明,仅基于运动学数据估计CoM和GRF是可行的,支持无测力台的步态分析。

英文摘要

Ground reaction forces (GRFs) provide fundamental insight into human gait mechanics and are widely used to assess joint loading, limb symmetry, balance control, and motor function. Despite their clinical relevance, the use of GRF remains underutilised in clinical workflows due to the practical limitations of force plate systems. In this work, we present a force-plate-free approach for estimating GRFs using only marker-based motion capture data. This kinematics only method to estimate and decompose GRF makes it well suited for widespread clinical depolyment. By using kinematics from sixteen body segments, we estimate the centre of mass (CoM) and compute GRFs, which are subsequently decomposed into individual components through a minimization-based approach. Through this framework, we can identify gait stance phases and provide access to clinically meaningful kinetic measures without a dedicated force plate system. Experimental results demonstrate the viability of CoM and GRF estimation based solely on kinematic data, supporting force-plate-free gait analysis.

2602.03120 2026-06-15 cs.LG cs.AI 版本更新

Quantized Evolution Strategies: High-precision Fine-tuning of Quantized LLMs at Low-precision Cost

量化进化策略:以低精度代价实现量化大语言模型的高精度微调

Yinggan Xu, Kajetan Schweighofer, Risto Miikkulainen, Xin Qiu

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Cognizant AI Lab(Cognizant AI实验室) UT Austin(得克萨斯大学奥斯汀分校)

AI总结 提出量化进化策略(QES),通过集成累积误差反馈和无状态种子重放,直接在量化空间进行全参数微调,无需反向传播,显著优于现有零阶微调方法。

Comments Added more tasks and baselines

详情
AI中文摘要

后训练量化(PTQ)对于在内存受限设备上部署大语言模型(LLM)至关重要,但它使模型变得静态且难以微调。标准的微调范式,包括强化学习(RL),从根本上依赖于反向传播和连续权重来计算梯度。因此,它们无法用于参数空间离散且不可微的量化模型。虽然进化策略(ES)提供了一种无需反向传播的替代方案,但由于梯度估计消失或不准确,量化参数的优化仍可能失败。本文介绍了量化进化策略(QES),一种直接在量化空间执行全参数微调的优化范式。QES基于两项创新:(1)它集成了累积误差反馈以保留高精度权重更新信号,(2)它利用无状态种子重放将内存使用降低到低精度推理水平。QES在各种任务上显著优于最先进的零阶微调方法,使得量化模型的直接微调成为可能。因此,它开辟了完全在量化空间中扩展LLM的可能性。源代码可在此https URL获取。

英文摘要

Post-Training Quantization (PTQ) is essential for deploying Large Language Models (LLMs) on memory-constrained devices, yet it renders models static and difficult to fine-tune. Standard fine-tuning paradigms, including Reinforcement Learning (RL), fundamentally rely on backpropagation and continuous weights to compute gradients. Thus they cannot be used on quantized models, where the parameter space is discrete and non-differentiable. While Evolution Strategies (ES) offer a backpropagation-free alternative, optimization of the quantized parameters can still fail due to vanishing or inaccurate gradient estimation. This paper introduces Quantized Evolution Strategies (QES), an optimization paradigm that performs full-parameter fine-tuning directly in the quantized space. QES is based on two innovations: (1) it integrates accumulated error feedback to preserve high-precision weight updating signals, and (2) it utilizes a stateless seed replay to reduce memory usage to low-precision inference levels. QES significantly outperforms the state-of-the-art zeroth-order fine-tuning methods on a variety of tasks, making direct fine-tuning for quantized models possible. It therefore opens up the possibility for scaling up LLMs entirely in the quantized space. The source code is available at https://github.com/dibbla/Quantized-Evolution-Strategies .

2602.01948 2026-06-15 cs.RO 版本更新

A Unified Control Architecture for Macro-Micro Manipulation using a Active Remote Center of Compliance for Manufacturing Applications

面向制造应用的宏微操作统一控制架构:基于主动远程柔顺中心

Patrick Frank, Christian Friedrich

发表机构 * Institute for Robotics and Intelligent Production Systems University of Applied Sciences Karlsruhe (HKA)(机器人与智能生产系统研究所 卡尔施塔特应用科学大学(HKA))

AI总结 提出一种将宏操作器纳入主动交互控制的新架构,相比现有领先-跟随方法将控制带宽提升2.1倍,相比传统力控制提升12.5倍,并引入替代模型简化控制器设计。

Comments 17 pages, 14 figures, submitted to Robotics and Computer-Integrated Manufacturing (RCIM)

详情
AI中文摘要

宏微操作器将具有大工作空间的宏操作器(如工业机器人)与轻量、高带宽的微操作器相结合。这使得在保持机器人广阔工作空间的同时,能够实现高动态的交互控制。传统上,位置控制分配给宏操作器,而微操作器负责与环境交互,这限制了可实现的交互控制带宽。为解决此问题,我们提出了一种新颖的控制架构,将宏操作器纳入主动交互控制中。与基于领先-跟随方法的最先进架构相比,这导致控制带宽提升了2.1倍,与传统基于机器人的力控制相比提升了12.5倍。此外,我们提出了替代模型,以实现更高效的控制器设计并易于适应硬件变化。我们通过在不同实验(如与物体碰撞、跟随力轨迹和工业装配任务)中与其他控制方案进行比较,验证了我们的方法。

英文摘要

Macro-micro manipulators combine a macro manipulator with a large workspace, such as an industrial robot, with a lightweight, high-bandwidth micro manipulator. This enables highly dynamic interaction control while preserving the wide workspace of the robot. Traditionally, position control is assigned to the macro manipulator, while the micro manipulator handles the interaction with the environment, limiting the achievable interaction control bandwidth. To solve this, we propose a novel control architecture that incorporates the macro manipulator into the active interaction control. This leads to a increase in control bandwidth by a factor of 2.1 compared to the state of the art architecture, based on the leader-follower approach and factor 12.5 compared to traditional robot-based force control. Further we propose surrogate models for a more efficient controller design and easy adaptation to hardware changes. We validate our approach by comparing it against the other control schemes in different experiments, like collision with an object, following a force trajectory and industrial assembly tasks.

2602.01801 2026-06-15 cs.CV cs.AI 版本更新

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

快速自回归视频扩散与世界模型:基于时间缓存压缩与稀疏注意力

Dvir Samuel, Issar Tzachor, Matan Levy, Michael Green, Gal Chechik, Rami Ben-Ari

发表机构 * Hebrew University of Jerusalem(特拉维夫大学) Google Research(谷歌研究)

AI总结 提出FAST-AR框架,通过TempCache压缩KV缓存、AnnCA加速交叉注意力、AnnSA稀疏化自注意力,实现自回归视频扩散模型5-10倍加速,同时保持视觉质量并稳定GPU内存使用。

Comments Accepted to ICML 2026. Project Page: https://dvirsamuel.github.io/fast-auto-regressive-video/

详情
AI中文摘要

自回归视频扩散模型支持流式生成,为长序列合成、视频世界模型和交互式神经游戏引擎打开了大门。然而,其核心注意力层在推理时成为主要瓶颈:随着生成过程推进,KV缓存增长,导致延迟增加和GPU内存飙升,进而限制可用的时间上下文并损害长程一致性。在本工作中,我们研究了自回归视频扩散中的冗余性,并识别出三个持续存在的来源:跨帧的近似重复缓存键、缓慢演化的(主要是语义的)查询/键使得许多注意力计算冗余,以及长提示上的交叉注意力中每帧只有少量标记相关。基于这些观察,我们提出了一个统一的、无需训练的注意力框架(FAST-AR),用于快速自回归扩散,包含三个组件:TempCache通过时间对应压缩KV缓存以限制缓存增长;AnnCA通过使用快速近似最近邻(ANN)匹配选择帧相关的提示标记来加速交叉注意力;AnnSA通过将每个查询限制为语义匹配的键(也使用轻量级ANN)来稀疏化自注意力。这些模块共同减少了注意力、计算和内存,并且与现有的自回归扩散骨干网络和世界模型兼容。实验表明,在保持几乎相同的视觉质量的同时,实现了高达5-10倍的端到端加速,并且关键的是,在长序列生成中维持稳定的吞吐量和几乎恒定的峰值GPU内存使用,而先前的方法会逐渐变慢并遭受内存使用增加的问题。

英文摘要

Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework (FAST-AR) for FAST-AutoRegressive diffusion, consisting of three components: TempCache compresses the KV cache via temporal correspondence to bound cache growth; AnnCA accelerates cross-attention by selecting frame-relevant prompt tokens using fast approximate nearest neighbor (ANN) matching; and AnnSA sparsifies self-attention by restricting each query to semantically matched keys, also using a lightweight ANN. Together, these modules reduce attention, compute, and memory and are compatible with existing autoregressive diffusion backbones and world models. Experiments demonstrate up to x5 - x10 end-to-end speedups while preserving near-identical visual quality and, crucially, maintaining stable throughput and nearly constant peak GPU memory usage over long rollouts, where prior methods progressively slow down and suffer from increasing memory usage.

2601.22954 2026-06-15 cs.CL cs.AI 版本更新

Residual Context Diffusion Language Models

残差上下文扩散语言模型

Yuezhou Hu, Harman Singh, Monishwaran Maheswaran, Haocheng Xi, Coleman Hooper, Jintao Zhang, Aditya Tomar, Michael W. Mahoney, Sewon Min, Mehrdad Farajtabar, Kurt Keutzer, Amir Gholami, Chenfeng Xu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出残差上下文扩散(RCD)模块,通过回收丢弃令牌的上下文残差提高扩散语言模型的解码效率,在长/短CoT任务上以极少额外计算提升准确率4-11个百分点。

详情
AI中文摘要

扩散大语言模型(dLLM)已成为纯自回归语言模型的有前途的替代方案,因为它们可以并行解码多个令牌。然而,最先进的逐块dLLM依赖于一种“重掩码”机制,该机制仅解码最自信的令牌并丢弃其余令牌,从而浪费计算。我们证明,回收来自被丢弃令牌的计算是有益的,因为这些令牌保留了对于后续解码迭代有用的上下文信息。鉴于此,我们提出了残差上下文扩散(RCD),一个将这些被丢弃的令牌表示转换为上下文残差并将其注入回下一个去噪步骤的模块。RCD使用解耦的两阶段训练流程来绕过与反向传播相关的内存瓶颈。我们在长链推理(SDAR)和短链指令跟随(LLaDA)模型上验证了我们的方法。我们证明,一个标准的dLLM可以仅用约3亿个令牌高效地转换为RCD范式。在广泛基准测试中,RCD以极小的额外计算开销一致地将前沿dLLM的准确率提升4-11个百分点。值得注意的是,在最具挑战性的AIME任务上,RCD几乎使基线准确率翻倍,并在基线峰值准确率下实现高达4-5倍更少的去噪步骤。

英文摘要

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a "remasking" mechanism that decodes only the most confident tokens and discards the rest, effectively wasting computation. We demonstrate that recycling computation from the discarded tokens is beneficial, as these tokens retain contextual information useful for subsequent decoding iterations. In light of this, we propose Residual Context Diffusion (RCD), a module that converts these discarded token representations into contextual residuals and injects them back for the next denoising step. RCD uses a decoupled two-stage training pipeline to bypass the memory bottlenecks associated with backpropagation. We validate our method on both long CoT reasoning (SDAR) and short CoT instruction following (LLaDA) models. We demonstrate that a standard dLLM can be efficiently converted to the RCD paradigm with merely ~300 million tokens. RCD consistently improves frontier dLLMs by 4-11 percentage points in accuracy with minimal extra computation overhead across a wide range of benchmarks. Notably, on the most challenging AIME tasks, RCD nearly doubles baseline accuracy and attains up to 4-5x fewer denoising steps at baseline's peak accuracy.

2601.22108 2026-06-15 cs.LG cs.AI 版本更新

Learning What to Predict: Downstream-Guided Task Design for Continued Pretraining

学习预测什么:下游引导的持续预训练任务设计

Shuqi Ke, Giulia Fanti

发表机构 * Department of ECE(电子工程系) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出V-pretraining方法,通过轻量级任务设计器为无标签批次构建目标或视图,利用下游损失的一阶减少作为反馈,指导自监督更新,提升目标能力而不损害泛化。

详情
AI中文摘要

持续预训练通过固定的自监督任务进行优化,但根据下游性能选择检查点,形成了一个粗粒度的反馈循环:实践者评估检查点、改变数据混合或目标、重新开始运行,而单个更新仍然对目标能力视而不见。我们询问是否一小部分可验证的下游示例可以在不直接监督学习器的情况下提供步骤级反馈。我们引入了V-pretraining,它将仅使用自监督损失训练的学习器与一个轻量级任务设计器解耦,该设计器为无标签批次构建目标或视图。给定当前学习器和批次,V-pretraining通过预测诱导的自监督更新后下游损失的一阶减少来评分候选构建。设计器最大化该值;然后学习器应用带有分离目标或视图的更新,因此下游标签永远不会更新学习器参数。我们将V-pretraining实例化为用于语言建模的自适应top-K软目标和用于自监督视觉的学习视图或掩码。在两种模态中,V-pretraining在不降低泛化的情况下提高了目标能力。在挂钟时间匹配的持续预训练下,它仅使用1,024个GSM8K示例作为反馈,提高了Qwen模型的GSM8K Pass@1,包括Qwen2.5-0.5B的单次运行+7.4点增益。在视觉方面,它改善了DINOv3向ADE20K语义分割和NYUv2深度估计的迁移,同时保持了ImageNet线性准确率,表明反馈引导的任务构建可以在不破坏通用表示的情况下提高目标能力。

英文摘要

Continued pretraining is optimized with fixed self-supervised tasks but selected by downstream performance, creating a coarse feedback loop in which practitioners evaluate checkpoints, change data mixtures or objectives, and restart runs, while individual updates remain blind to target capabilities. We ask whether a small set of verifiable downstream examples can provide step-level feedback without directly supervising the learner. We introduce V-pretraining, which decouples a learner trained only with a self-supervised loss from a lightweight task designer that constructs targets or views for unlabeled batches. Given the current learner and batch, V-pretraining scores a candidate construction by predicting the first-order reduction in downstream loss after the induced self-supervised update. The designer maximizes this value; the learner then applies the update with targets or views detached, so downstream labels never update learner parameters. We instantiate V-pretraining as adaptive top-K soft targets for language modeling and learned views or masks for self-supervised vision. Across both modalities, V-pretraining improves target capabilities without degrading generalization. Under wall-clock-matched continued pretraining, it improves GSM8K Pass@1 for Qwen models using 1,024 GSM8K examples only as feedback, including a +7.4 point single-run gain for Qwen2.5-0.5B. In vision, it improves DINOv3 transfer to ADE20K semantic segmentation and NYUv2 depth estimation while preserving ImageNet linear accuracy, suggesting that feedback-guided task construction can improve target capabilities without collapsing general-purpose representations.

2601.21179 2026-06-15 cs.CV 版本更新

Enhancing Underwater Light Field Images via Global Geometry-aware Diffusion Process

通过全局几何感知扩散过程增强水下光场图像

Yuji Lin, Qian Zhao, Zongsheng Yue, Junhui Hou, Deyu Meng

发表机构 * School of Mathematics and Statistics, Xi’an Jiaotong University(西安交通大学数学与统计学学院) School of Mathematics and Statistics and the Ministry of Education Key Lab of Intelligent Networks and Network Security, Xi’an Jiaotong University(西安交通大学数学与统计学学院和教育部智能网络与网络安全重点实验室) Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系) Macao Institute of Systems Engineering, Macau University of Science and Technology(澳门系统工程研究院,澳门科学大学)

AI总结 提出基于扩散的GeoDiff-LF框架,利用空间-角度结构增强水下4D光场成像,通过改进U-Net、几何引导损失和优化采样策略,有效缓解颜色失真,在视觉保真度和定量性能上超越现有方法。

Comments 14 pages, 9 figures

详情
AI中文摘要

本文研究了通过4D光场(LF)成像获取高质量水下图像的挑战性问题。为此,我们提出了GeoDiff-LF,一种基于SD-Turbo的新型扩散框架,通过利用其空间-角度结构来增强水下4D LF成像。GeoDiff-LF包含三个关键改进:(1)改进的U-Net架构,带有卷积和注意力适配器以建模几何线索;(2)使用张量分解和渐进加权的几何引导损失函数以正则化全局结构;(3)优化的采样策略与噪声预测以提高效率。通过整合扩散先验和LF几何,GeoDiff-LF有效缓解了水下场景中的颜色失真。大量实验表明,我们的框架在视觉保真度和定量性能上均优于现有方法,推动了水下成像增强的最新进展。代码将在https://this URL公开。

英文摘要

This work studies the challenging problem of acquiring high-quality underwater images via 4-D light field (LF) imaging. To this end, we propose GeoDiff-LF, a novel diffusion-based framework built upon SD-Turbo to enhance underwater 4-D LF imaging by leveraging its spatial-angular structure. GeoDiff-LF consists of three key adaptations: (1) a modified U-Net architecture with convolutional and attention adapters to model geometric cues, (2) a geometry-guided loss function using tensor decomposition and progressive weighting to regularize global structure, and (3) an optimized sampling strategy with noise prediction to improve efficiency. By integrating diffusion priors and LF geometry, GeoDiff-LF effectively mitigates color distortion in underwater scenes. Extensive experiments demonstrate that our framework outperforms existing methods across both visual fidelity and quantitative performance, advancing the state-of-the-art in enhancing underwater imaging. The code will be publicly available at https://github.com/linlos1234/GeoDiff-LF.

2601.12913 2026-06-15 cs.AI cs.LG cs.NE 版本更新

Actionable Interpretability Must Be Defined in Terms of Symmetries

可操作的可解释性必须根据对称性来定义

Pietro Barbiero, Mateo Espinosa Zarlenga, Francesco Giannini, Alberto Termine, Filippo Bonchi, Mateja Jamnik, Giuseppe Marra

发表机构 * University of Oxford(牛津大学) ETH Zurich(苏黎世联邦理工学院) University of Cambridge(剑桥大学)

AI总结 本文论证AI可解释性研究存在根本性问题,提出可操作的可解释性应基于四种对称性来定义,以形式化可解释模型并统一可解释推理。

详情
AI中文摘要

本文认为,人工智能(AI)中的可解释性研究从根本上来说是不恰当的,因为现有的可解释性定义未能描述如何正式测试或设计可解释性。我们提出,可操作的可解释性定义必须根据*对称性*来制定,这些对称性指导模型设计并导致可测试的条件。在概率视角下,我们假设四种对称性(推理等变性、信息不变性、概念封闭不变性和结构不变性)足以(i)将可解释模型形式化为概率模型的一个子类,(ii)产生可解释推理的统一形式(例如,对齐、干预和反事实)作为贝叶斯逆的一种形式,以及(iii)提供一个正式框架来验证是否符合安全标准和法规。

英文摘要

This paper argues that interpretability research in Artificial Intelligence (AI) is fundamentally ill-posed as existing definitions of interpretability fail to describe how interpretability can be formally tested or designed for. We posit that actionable definitions of interpretability must be formulated in terms of *symmetries* that inform model design and lead to testable conditions. Under a probabilistic view, we hypothesise that four symmetries (inference equivariance, information invariance, concept-closure invariance, and structural invariance) suffice to (i) formalise interpretable models as a subclass of probabilistic models, (ii) yield a unified formulation of interpretable inference (e.g., alignment, interventions, and counterfactuals) as a form of Bayesian inversion, and (iii) provide a formal framework to verify compliance with safety standards and regulations.

2509.18930 2026-06-15 cs.LG cs.AI 版本更新

Tackling GNARLy Problems: Graph Neural Algorithmic Reasoning Reimagined through Reinforcement Learning

解决GNARLy问题:通过强化学习重新构想图神经算法推理

Alex Schutz, Victor-Alexandru Darvariu, Efimia Panagiotaki, Bruno Lacerda, Nick Hawes

发表机构 * Oxford Robotics Institute, University of Oxford(牛津大学机器人研究所) Stateful Robotics

AI总结 提出GNARL框架,将算法轨迹学习转化为马尔可夫决策过程,结合模仿学习和强化学习,在CLRS-30问题上取得高精度,适用于NP难问题及无专家算法场景。

详情
AI中文摘要

神经算法推理(NAR)是一种通过监督学习训练神经网络执行经典算法的范式。尽管取得了成功,但仍存在重要局限性:无法在不进行后处理的情况下构建有效解,无法推理多个正确解,在组合NP难问题上性能差,且不适用于尚未已知强算法的问题。为了解决这些局限性,我们将学习算法轨迹的问题重新定义为马尔可夫决策过程,这为解构建过程施加了结构,并解锁了模仿学习和强化学习(RL)的强大工具。我们提出了GNARL框架,包括将问题从NAR转化为RL的方法论,以及适用于广泛图问题的学习架构。我们在多个CLRS-30问题上取得了非常高的图准确率结果,性能匹配或超过针对NP难问题的更窄NAR方法,并且值得注意的是,即使在缺乏专家算法的情况下也能适用。

英文摘要

Neural algorithmic reasoning (NAR) is a paradigm that trains neural networks to execute classic algorithms by supervised learning. Despite its successes, important limitations remain: inability to construct valid solutions without post-processing and to reason about multiple correct ones, poor performance on combinatorial NP-hard problems, and inapplicability to problems for which strong algorithms are not yet known. To address these limitations, we reframe the problem of learning algorithm trajectories as a Markov decision process, which imposes structure on the solution construction procedure and unlocks the powerful tools of imitation and reinforcement learning (RL). We propose the GNARL framework, encompassing the methodology to translate problem formulations from NAR to RL and a learning architecture suitable for a wide range of graph-based problems. We achieve very high graph accuracy results on several CLRS-30 problems, performance matching or exceeding much narrower NAR approaches for NP-hard problems and, remarkably, applicability even when lacking an expert algorithm.

2601.19810 2026-06-15 cs.LG cs.AI cs.RO 版本更新

Unsupervised Learning of Efficient Exploration: Pre-training Adaptive Policies via Self-Imposed Goals

高效探索的无监督学习:通过自我设定目标预训练自适应策略

Octavio Pappalardo

发表机构 * University College London (UCL)(伦敦大学学院(UCL))

AI总结 提出ULEE方法,结合上下文学习器与对抗性目标生成策略,在无监督元学习框架中优化多回合探索与适应,提升零样本和少样本性能。

Comments ICLR 2026; v2 adds link to code: https://github.com/Octavio-Pappalardo/ulee-jax

Journal ref The Fourteenth International Conference on Learning Representations, 2026

详情
AI中文摘要

无监督预训练可以为强化学习智能体提供先验知识,加速下游任务的学习。一个基于人类发展的有前景方向是研究智能体通过设定和追求自身目标来学习。核心挑战在于如何有效地生成、选择并从这些目标中学习。我们的关注点是下游任务的广泛分布,其中零样本解决每个任务是不可行的。当目标任务位于预训练分布之外或智能体未知其身份时,这种设置自然出现。在这项工作中,我们(i)在元学习框架内优化高效的多回合探索和适应,以及(ii)用智能体适应后性能的演化估计来指导训练课程。我们提出了ULEE,一种无监督元学习方法,它将上下文学习器与对抗性目标生成策略相结合,该策略将训练维持在智能体能力的前沿。在XLand-MiniGrid基准测试中,ULEE预训练产生了改进的探索和适应能力,这些能力泛化到新的目标、环境动态和地图结构。得到的策略获得了改进的零样本和少样本性能,并为更长的微调过程提供了强初始化。它优于从头学习、DIAYN预训练和替代课程。代码可在以下网址获取:https://github.com/facebookresearch/ulee

英文摘要

Unsupervised pre-training can equip reinforcement learning agents with prior knowledge and accelerate learning in downstream tasks. A promising direction, grounded in human development, investigates agents that learn by setting and pursuing their own goals. The core challenge lies in how to effectively generate, select, and learn from such goals. Our focus is on broad distributions of downstream tasks where solving every task zero-shot is infeasible. Such settings naturally arise when the target tasks lie outside of the pre-training distribution or when their identities are unknown to the agent. In this work, we (i) optimize for efficient multi-episode exploration and adaptation within a meta-learning framework, and (ii) guide the training curriculum with evolving estimates of the agent's post-adaptation performance. We present ULEE, an unsupervised meta-learning method that combines an in-context learner with an adversarial goal-generation strategy that maintains training at the frontier of the agent's capabilities. On XLand-MiniGrid benchmarks, ULEE pre-training yields improved exploration and adaptation abilities that generalize to novel objectives, environment dynamics, and map structures. The resulting policy attains improved zero-shot and few-shot performance, and provides a strong initialization for longer fine-tuning processes. It outperforms learning from scratch, DIAYN pre-training, and alternative curricula. Code is available at: https://github.com/Octavio-Pappalardo/ulee-jax

2601.19115 2026-06-15 cs.CV 版本更新

FBSDiff++: Improved Frequency Band Substitution of Diffusion Features for Efficient and Highly Controllable Text-Driven Image-to-Image Translation

FBSDiff++: 改进的扩散特征频带替换用于高效且高度可控的文本驱动图像到图像翻译

Xiang Gao, Yunpeng Jia

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出FBSDiff++框架,通过动态频带替换扩散特征,实现无需训练的文本驱动图像到图像翻译,支持外观、布局和轮廓引导,并大幅提升推理速度(8.9倍),支持任意分辨率输入和局部编辑。

详情
AI中文摘要

随着大规模文本到图像(T2I)扩散模型在开放域图像创建方面取得显著进展,越来越多的关注集中在将其自然扩展到文本驱动图像到图像(I2I)翻译领域,其中源图像除了文本提示提供的文本引导外,还作为生成图像的视觉引导。我们提出了FBSDiff,一种新颖的框架,从全新的频域角度将现成的T2I扩散模型适配到I2I范式。通过扩散特征的动态频带替换,FBSDiff以即插即用的方式(无需模型训练、微调或在线优化)实现了多样且高度可控的文本驱动I2I,通过分别替换潜在扩散特征的低频带、中频带和高频带,实现外观引导、布局引导和轮廓引导的I2I翻译。此外,FBSDiff通过简单调整替换频带的带宽,灵活地实现了对I2I相关强度的连续控制。为了进一步提升图像翻译的效率、灵活性和功能性,我们提出了FBSDiff++,它在FBSDiff的基础上主要在三个方面进行了改进:(1)通过改进的模型架构大幅加速推理速度(推理速度提升8.9倍);(2)改进频带替换模块,允许输入任意分辨率和宽高比的源图像;(3)扩展模型功能,仅通过对核心方法进行细微调整即可实现局部图像操作和特定风格内容创建。大量的定性和定量实验验证了FBSDiff++在I2I翻译的视觉质量、效率、多样性和可控性方面相对于相关先进方法的优越性。

英文摘要

With large-scale text-to-image (T2I) diffusion models achieving significant advancements in open-domain image creation, increasing attention has been focused on their natural extension to the realm of text-driven image-to-image (I2I) translation, where a source image acts as visual guidance to the generated image in addition to the textual guidance provided by the text prompt. We propose FBSDiff, a novel framework adapting off-the-shelf T2I diffusion model into the I2I paradigm from a fresh frequency-domain perspective. Through dynamic frequency band substitution of diffusion features, FBSDiff realizes versatile and highly controllable text-driven I2I in a plug-and-play manner (without need for model training, fine-tuning, or online optimization), allowing appearance-guided, layout-guided, and contour-guided I2I translation by progressively substituting low-frequency band, mid-frequency band, and high-frequency band of latent diffusion features, respectively. In addition, FBSDiff flexibly enables continuous control over I2I correlation intensity simply by tuning the bandwidth of the substituted frequency band. To further promote image translation efficiency, flexibility, and functionality, we propose FBSDiff++ which improves upon FBSDiff mainly in three aspects: (1) accelerate inference speed by a large margin (8.9$\times$ speedup in inference) with refined model architecture; (2) improve the Frequency Band Substitution module to allow for input source images of arbitrary resolution and aspect ratio; (3) extend model functionality to enable localized image manipulation and style-specific content creation with only subtle adjustments to the core method. Extensive qualitative and quantitative experiments verify superiority of FBSDiff++ in I2I translation visual quality, efficiency, versatility, and controllability compared to related advanced approaches.

2601.15828 2026-06-15 cs.CL cs.AI 版本更新

Can professional translators identify machine-generated text?

专业翻译人员能否识别机器生成的文本?

Michael Farrell

发表机构 * IULM University Milan Italy(米兰IULM大学)

AI总结 通过实验研究无专门训练的专业翻译人员识别AI生成短篇故事的能力,发现少数人(16.2%)能准确区分,但多数依赖主观印象导致误判,低突发性和叙事矛盾是可靠指标。

Comments Pages 581 to 591, Volume 1, proceedings of the 26th Annual Conference of the European Association for Machine Translation, 2026

详情
AI中文摘要

本研究调查了未经专门训练的专业翻译人员能否可靠地识别由人工智能(AI)生成的意大利语短篇故事。69名翻译人员参加了一项现场实验,评估了三篇匿名短篇故事——两篇由ChatGPT-4o生成,一篇由人类作者撰写。对于每篇故事,参与者评估了AI作者身份的可能性并提供了选择理由。虽然平均结果不明确,但有一个统计上显著的子集(16.2%)成功区分了合成文本与人类文本,表明他们的判断基于分析技能而非偶然。然而,几乎相同数量的人以相反方向错误分类了文本,通常依赖主观印象而非客观标记,这可能反映了读者对AI生成文本的偏好。低突发性和叙事矛盾成为合成作者身份最可靠的指标,同时报告了意外的仿译、语义借用和来自英语的句法迁移。相比之下,语法准确性和情感基调等特征经常导致误分类。这些发现对专业语境中合成文本编辑的作用和范围提出了疑问。

英文摘要

This study investigates whether professional translators without prior specialized training can reliably identify short stories generated in Italian by artificial intelligence (AI). Sixty-nine translators took part in an in-person experiment, where they assessed three anonymized short stories - two written by ChatGPT-4o and one by a human author. For each story, participants rated the likelihood of AI authorship and provided justifications for their choices. While average results were inconclusive, a statistically significant subset (16.2%) successfully distinguished the synthetic texts from the human text, suggesting that their judgements were informed by analytical skill rather than chance. However, a nearly equal number misclassified the texts in the opposite direction, often relying on subjective impressions rather than objective markers, possibly reflecting a reader preference for AI-generated texts. Low burstiness and narrative contradiction emerged as the most reliable indicators of synthetic authorship, with unexpected calques, semantic loans and syntactic transfer from English also reported. In contrast, features such as grammatical accuracy and emotional tone frequently led to misclassification. These findings raise questions about the role and scope of synthetic-text editing in professional contexts.

2601.18707 2026-06-15 cs.LG cs.AI cs.CV cs.NE 版本更新

SMART: Scalable Mesh-free Aerodynamic Simulations from Raw Geometries using a Transformer-based Surrogate Model

SMART: 基于Transformer代理模型的原始几何形状可扩展无网格气动模拟

Jan Hagnberger, Mathias Niepert

发表机构 * Jan Hagnberger Mathias Niepert

AI总结 提出SMART,一种无需模拟网格、仅使用几何点云预测任意查询位置物理量的神经代理模型,通过交叉层交互联合更新几何特征和物理场,性能媲美甚至超越依赖网格的方法。

Comments Accepted for publication at the 43rd International Conference on Machine Learning (ICML) 2026, Seoul, South Korea

详情
AI中文摘要

基于机器学习的代理模型已成为复杂几何体(如车身)物理模拟中数值求解器的高效替代方案。许多现有模型将模拟网格作为额外输入,从而减少预测误差。然而,为新几何体生成模拟网格计算成本高昂。相比之下,不依赖模拟网格的无网格方法通常误差更高。基于这些考虑,我们引入了SMART,一种神经代理模型,它仅使用几何体的点云表示,无需访问模拟网格,即可预测任意查询位置的物理量。几何体和模拟参数被编码到一个共享的潜在空间中,该空间捕捉物理场的结构和参数特征。然后,一个物理解码器关注编码器的中间潜在表示,将空间查询映射到物理量。通过这种跨层交互,模型联合更新潜在几何特征和演变的物理场。大量实验表明,SMART与依赖模拟网格作为输入的现有方法相比具有竞争力,并且通常表现更优,展示了其在工业级模拟中的能力。

英文摘要

Machine learning-based surrogate models have emerged as more efficient alternatives to numerical solvers for physical simulations over complex geometries, such as car bodies. Many existing models incorporate the simulation mesh as an additional input, thereby reducing prediction errors. However, generating a simulation mesh for new geometries is computationally costly. In contrast, mesh-free methods, which do not rely on the simulation mesh, typically incur higher errors. Motivated by these considerations, we introduce SMART, a neural surrogate model that predicts physical quantities at arbitrary query locations using only a point-cloud representation of the geometry, without requiring access to the simulation mesh. The geometry and simulation parameters are encoded into a shared latent space that captures both structural and parametric characteristics of the physical field. A physics decoder then attends to the encoder's intermediate latent representations to map spatial queries to physical quantities. Through this cross-layer interaction, the model jointly updates latent geometric features and the evolving physical field. Extensive experiments show that SMART is competitive with and often outperforms existing methods that rely on the simulation mesh as input, demonstrating its capabilities for industry-level simulations.