arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4089
专题追踪
2606.00726 2026-06-02 cs.AI

Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

潜在奖励引导:一种自适应推理时框架,隐式促进推理大语言模型中的认知行为

Jiakang Li, Guanyu Zhu, Can Jin, Chenxi Huang, Dexu Yu, Ronghao Chen, Yang Zhou, Hongwu Peng, Xuanqi Lan, Dimitris N. Metaxas, Youhua Li

发表机构 * Rutgers University(罗格斯大学) South China Agricultural University(华南农业大学) Columbia University(哥伦比亚大学) Fenz.AI QuantaAlpha Adobe Santa Clara University(圣克拉拉大学) City University of Hong Kong(香港城市大学)

AI总结 提出潜在奖励引导(LRS)框架,通过优化稀疏自编码器潜在状态隐式促进认知行为,利用最终答案正确性训练潜在奖励模型估计中间状态质量,并在推理时提供状态特定的修正方向,实验表明该方法能提升推理性能并修复原始推理错误。

详情
AI中文摘要

强推理不仅依赖于模型知识,还取决于生成过程中认知行为的有效部署。现有方法通常依赖显式的行为级控制,当失败和所需修正因推理状态、任务和模型而异时,其适应性不足。为此,我们提出潜在奖励引导(LRS),一种自适应推理时框架,通过优化隐式携带认知行为的稀疏自编码器(SAE)潜在状态来促进认知行为。LRS不依赖预定义的认知行为或由此衍生的引导方向,而是基于最终答案正确性在推理轨迹上训练潜在奖励模型,以估计中间潜在状态的质量。推理时,奖励梯度为脆弱的潜在状态提供状态特定的修正方向,而奖励与置信度门控将干预限制在奖励信号标记为脆弱的状态上。在多个推理LLM骨干和基准上的实验表明,LRS一致地提升了相对于各种基线的性能,事后分析进一步表明LRS隐式促进了修复原始推理错误的良好认知行为。代码见:https://github.com/jiakanglee/Latent-Reward-Steering。

英文摘要

Strong reasoning depends not only on model knowledge but also on how effectively cognitive behaviors are deployed during generation. Existing methods often rely on explicit behavior-level control, making them insufficiently adaptive when failures and required corrections vary across reasoning states, tasks, and models. To this end, we propose Latent Reward Steering (LRS), an adaptive inference-time framework that promotes cognitive behaviors by optimizing the sparse-autoencoder (SAE) latent states that implicitly carry them. Rather than relying on predefined cognitive behaviors or steering directions derived from them, LRS trains a latent reward model on reasoning traces by final answer correctness to estimate the quality of intermediate latent states. During inference, reward gradients provide state-specific correction directions for fragile latent states, while a reward and confidence gate restricts intervention to states the reward signal flags as fragile. Experiments on multiple reasoning LLM backbones and benchmarks show that \ours consistently improves performance over various baselines, and post-hoc analyses further indicate that \ours implicitly promotes good cognitive behaviors that fix the original reasoning errors. Code is available at: https://github.com/jiakanglee/Latent-Reward-Steering.

2606.00724 2026-06-02 cs.CL cs.AI

WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering

WaveFilter: 通过小波引导的KV缓存过滤增强扩散型大语言模型的长上下文能力

Jinnan Yang, Yan Wang, Zhen Bi, Kehao Wu, Xiaojie Li, Jungang Lou, Zechao Li, Jing Liu

发表机构 * Nanjing University of Science and Technology(南京理工大学) Alibaba Group(阿里巴巴集团) Huzhou Normal University(湖州师范学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 针对扩散型大语言模型在长上下文任务中计算开销大和推理延迟高的问题,提出一种无需训练的通用缓存框架WaveFilter,利用小波变换分解长序列以精确识别关键token,构建稀疏KV缓存,从而提升现有KV缓存方法在复杂长上下文任务中的性能。

Comments 8 pages,3 figures

详情
AI中文摘要

扩散型大语言模型(DLMs)在各种任务中展现出显著优势。然而,受限于其多步迭代推理机制,它们在长上下文任务中的计算开销和推理延迟已成为限制其大规模部署的核心瓶颈。在处理长序列时,现有的键值(KV)缓存机制常常面临生成质量急剧下降的困境,其核心挑战在于如何在超长上下文中精确且高效地过滤关键token。受人类阅读过程的启发,我们提出了 extbf{WaveFilter},一个通用的、无需训练的缓存框架。该框架创新性地引入小波变换来分解长序列,以实现关键token的精确识别,并基于此构建稀疏KV缓存以计算最终的上下文表示。实验结果表明,WaveFilter作为一个即插即用的通用框架,显著提升了现有主流KV缓存方法在复杂长上下文任务中的性能。

英文摘要

Diffusion Large Language Models (DLMs) have demonstrated significant advantages across various tasks. However, constrained by their multi-step iterative inference mechanism, their computational overhead and inference latency in long-context tasks have become core bottlenecks restricting their large-scale deployment. When processing long sequences, existing Key-Value (KV) caching mechanisms often face a dilemma where generation quality degrades drastically, where the core challenge lies in precisely and efficiently filtering critical tokens within ultra-long contexts. Inspired by the human reading process, we propose \textbf{WaveFilter}, a universal and training-free caching framework. This framework innovatively introduces the wavelet transform for decomposition of long sequences to achieve precise identification of key tokens, based on which a sparse KV Cache is constructed to compute the final contextual representation. Experimental results demonstrate that WaveFilter, as a plug-and-play generic framework, significantly enhances the performance of existing mainstream KV Cache methods in complex long-context tasks.

2606.00722 2026-06-02 cs.CL cs.AI

EPIC: Efficient and Parallel Inference under CFG Constraints for Diffusion Language Models

EPIC: 扩散语言模型在上下文无关文法约束下的高效并行推理

Hyundong Jin, Yo-Sub Han

发表机构 * Yonsei University(延世大学)

AI总结 提出EPIC框架,通过词法记忆化、Earley解析验证和松弛兼容子集选择,解决扩散语言模型在CFG约束解码中的低效和并行性损失问题,推理时间降低67.5%,额外开销减少90.5%。

详情
AI中文摘要

控制语言模型输出对于确保结构有效性、可靠性和下游可用性至关重要,扩散语言模型也不例外。最近扩散语言模型解码的进展已将输出控制从常规约束扩展到上下文无关文法(CFG)约束。然而,现有方法的速度可能比无约束解码慢四倍。更重要的是,它们大大削弱了扩散语言模型相对于自回归模型的关键优势之一,即并行解码。这种减慢是因为在并行生成过程中,顺序有效性检查引入了显著开销。我们提出了一个高效的CFG约束解码框架EPIC,解决了这一限制。我们的方法通过结合词法记忆化、使用Earley风格解析(而非确定性自动机)进行验证,以及用于并行提交的松弛兼容子集选择,提高了解码效率。它减少了重复的词法分析和验证开销,同时允许多个兼容令牌一起提交。在三个基准测试上使用四个模型的实验表明,与现有的CFG约束解码方法相比,我们的方法将推理时间减少了高达67.5%,并将额外开销降低了高达90.5%。我们的实现可在https://github.com/hyundong98/EPIC-Decoding.git获取。

英文摘要

Controlling language model outputs is essential for ensuring structural validity, reliability, and downstream usability, and diffusion language models are no exception. Recent advances in diffusion language model decoding have extended output control beyond regular constraints to context-free grammar (CFG) constraints. Existing methods, however, can be up to four times slower than unconstrained decoding. More importantly, they substantially diminish one of the key advantages of diffusion language models over autoregressive models, namely parallel decoding. This slowdown arises because sequential validity checking introduces significant overhead during parallel generation. We propose an efficient CFG-constrained decoding framework, EPIC, that addresses this limitation. Our method improves decoding efficiency by combining lexing memoization, validation using Earley-style parsing instead of deterministic automata, and relaxed compatible subset selection for parallel commit. It reduces repeated lexing and validation overhead while allowing multiple compatible tokens to be committed together. Experiments on three benchmarks using four models show that our method reduces inference time by up to 67.5% and decreases the additional overhead by up to 90.5% compared with existing CFG-constrained decoding methods. Our implementation is available at https://github.com/hyundong98/EPIC-Decoding.git .

2606.00718 2026-06-02 cs.AI math.OC

LLM-Driven Co-Evolutionary Automated Heuristic Design for Bi-Component Coupled Combinatorial Optimization

LLM驱动的双组件耦合组合优化的协同进化自动启发式设计

Mingen Kuang, Xudong Deng, Xi Lin, Ye Fan, Jianyong Sun, Jialong Shi

发表机构 * Xi’an Jiao Tong University(西安交通大学) Northwestern Polytechnical University(西北工业大学)

AI总结 提出CoEvo-AHD框架,利用大语言模型协同进化两个算子种群,通过合作评估和联合交叉发现互补逻辑,解决旅行窃贼问题等耦合组合优化问题。

详情
AI中文摘要

虽然大语言模型(LLMs)最近在自动启发式设计(AHD)中展现出潜力,但现有方法通常将启发式作为单一算子或搜索策略生成和进化,限制了它们在诸如旅行窃贼问题(TTP)和旅行采购问题(TPP)等问题中对多个决策子结构之间强耦合建模的能力。在这项工作中,我们提出CoEvo-AHD,一个LLM驱动的双种群协同进化框架,用于耦合组合优化中的自动启发式设计。与先前单独进化单个算子的方法不同,CoEvo-AHD利用LLMs协同进化两个紧密相关的算子种群。合作评估机制明确捕获路径和选择算子之间的交互,而成对评分和协同联合交叉有助于发现互补的算子逻辑,以在耦合决策子空间上实现联合改进。我们进一步设计了一个工具调用环境库,将常用核心操作(如局部搜索增量计算)封装为可调用函数,使LLM生成的算子能够使用标准化接口,而不是重新实现低效且易出错的问题特定循环。在TTP和TPP上的实验表明,CoEvo-AHD自动发现合作启发式组合,并达到与传统启发式竞争的解质量。

英文摘要

While Large Language Models (LLMs) have recently shown promise in Automated Heuristic Design (AHD), existing methods typically generate and evolve heuristics as a single operator or search strategy, limiting their ability to model strong coupling among multiple decision substructures in problems such as the Traveling Thief Problem (TTP) and the Traveling Purchaser Problem (TPP). In this work, we propose CoEvo-AHD, an LLM-driven dual-population co-evolutionary framework for automated heuristic design in coupled combinatorial optimization. Unlike prior methods that evolve individual heuristics in isolation, CoEvo-AHD leverages LLMs to co-evolve two closely related operator populations. A cooperative evaluation mechanism explicitly captures interactions between route and selection operators, while pairwise scoring and synergistic joint crossover help discover complementary operator logic for joint improvement across coupled decision subspaces. We further design a tool-invocation environment library that encapsulates frequently used core operations, such as local-search delta computation, into callable functions, enabling LLM-generated operators to use standardized interfaces instead of reimplementing inefficient and error-prone problem-specific loops. Experiments on TTP and TPP show that CoEvo-AHD automatically discovers cooperative heuristic combinations and achieves competitive solution quality against traditional heuristics.

2606.00717 2026-06-02 cs.LG cs.AI stat.ML

Multi-Agent Conformal Prediction with Personalized Statistical Validity

具有个性化统计有效性的多智能体共形预测

Martin V. Vejling, Christophe A. N. Biscio, Adrien Mazoyer, Petar Popovski, Shashi Raj Pandey

发表机构 * Department of Electronic Systems(电子系统系) Aalborg University(奥尔堡大学) Department of Mathematical Sciences(数学科学系) Institut de Mathématiques de Toulouse(图卢兹数学研究所) Université de Toulouse(图卢兹大学)

AI总结 提出个性化联邦加权共形预测框架,通过局部密度比加权和加权分位数聚合,在保护隐私的同时纠正数据异质性,为每个参与智能体提供渐近有效的边际和校准条件覆盖保证。

详情
AI中文摘要

不确定性量化在高风险机器学习任务中至关重要。然而,共形预测这一原则性解决方案在局部校准数据有限、隐私约束和数据异质性下面临挑战。在多智能体设置中,现有工作无法同时令人满意地解决这些挑战,其保证要么限于智能体间的平均值,要么在异质性设置中失去有效性。因此,我们提出个性化联邦加权共形预测(PFWCP),该框架结合局部密度比加权与加权分位数聚合,以在保护隐私的同时纠正异质性。该方法为每个参与智能体提供渐近有效的边际和校准条件覆盖保证,并支持一次性通信协议。理论分析呈现了对覆盖方差的调整,该调整由有效样本量表达式控制,这在加权共形预测的背景下是必要的,并且在合成和真实数据集上的实验表明,与最先进的联邦共形基线相比,校准质量有所提高。

英文摘要

Uncertainty quantification is essential in high-stakes machine learning tasks. However, one of the principled solutions, conformal prediction, faces challenges under limited local calibration data, privacy constraints, and data heterogeneity. In multi-agent settings, existing works do not simultaneously and satisfactorily address these challenges with guarantees either limited to averages across agents or losing validity in heterogeneous settings. Hence, we propose personalized federated weighted conformal prediction (PFWCP), a framework that combines local density ratio weighting with weighted quantile aggregation to correct for heterogeneity while preserving privacy. The method yields asymptotically valid marginal and calibration-conditional coverage guarantees for each participating agent and supports protocols with one-shot communication. Theoretical analysis presents an adjustment to the coverage variance, governed by an effective sample size expression, which is necessary in the context of weighted conformal prediction, and experiments on synthetic and real datasets show improved calibration quality over state-of-the-art federated conformal baselines.

2606.00716 2026-06-02 cs.LG eess.SP

Graph Transfer Learning via Shared Latent Geometry: Theory and Applications

基于共享潜在几何的图迁移学习:理论与应用

Tong Wu, Andrew Campbell, Anna Scaglione

发表机构 * University of Central Florida, USA(佛罗里达中央大学) Cornell University, USA(康奈尔大学)

AI总结 提出一种非对称双路径架构,通过教师编码器从高保真模拟器学习算子多项式特征,学生编码器从稀疏数据学习相同潜在几何,实现零样本迁移并给出可证明的误差界。

详情
AI中文摘要

在工程物理系统的推理与控制中,部署时面临高昂的物理代价:状态估计器、逆问题求解器、模型预测控制器、调度器和观测器通常没有闭式解,必须针对每个实例重新求解数值优化问题,且每次需重新提供算子。物理信息学习将这一代价转移到训练阶段,但使用单一编码器路径,其潜在几何在微调时会退化,且无法提供定量迁移保证。我们提出一种非对称双路径架构来解决这两个问题。教师编码器从高保真模拟器中获取特权密集状态,并通过在谱扰动下稳定的算子多项式特征表示系统;学生编码器从稀疏现场数据和算子描述符学习相同的潜在几何。部署时丢弃教师,冻结的学生编码器通过单次前向传播运行,并附带迁移证书。该设计关联了特权信息学习、知识蒸馏和跨模态蒸馏,但目标是跨实例迁移而非固定实例预测:拓扑和算子可以变化,而潜在任务不变。我们通过潜在律之间的Wasserstein距离建立了充分且近乎必要的迁移条件,得到了零样本误差界,并开发了一种在覆盖不完全时主动扩展的有限样本认证协议。该框架适用于任何具有可报告谱的算子的系统。在电力系统估计中,它实现了对100种未见拓扑的零样本迁移,95%的证书通过率,与拓扑感知的牛顿-拉夫逊方法相当的精度,以及亚毫秒级推理。这些结果表明,非对称路径加上算子锚定的潜在几何为认证的零样本推理与控制奠定了基础。

英文摘要

Inference and control in engineered physical systems pay a heavy physics cost at deployment: state estimators, inverse-problem solvers, model-predictive controllers, schedulers, and observers are often not closed-form and must re-solve a numerical optimization per instance, with the operator re-supplied each time. Physics-informed learning moves this cost to training, but uses a single encoder pathway whose latent geometry de-learns under fine-tuning and admits no quantitative transfer guarantee. We propose an asymmetric two-pathway architecture that resolves both issues. A teacher encoder consumes privileged dense states from a high-fidelity simulator and represents the system through operator-polynomial features stable under spectral perturbation; a student encoder learns the same latent geometry from sparse field data and operator descriptors. At deployment the teacher is discarded, and the frozen student runs in a single forward pass with a transfer certificate. The design connects to privileged-information learning, knowledge distillation, and cross-modal distillation, but targets cross-instance transfer rather than fixed-instance prediction: topology and operator may change, while the latent task does not. We establish sufficient and near-necessary transfer conditions via Wasserstein proximity between latent laws, yielding a zero-shot error bound, and develop a finite-sample certification protocol with active expansion when coverage is incomplete. The framework applies wherever a system admits an operator with reportable spectrum. On power-system estimation, it achieves zero-shot transfer to 100 unseen topologies, a 95% certificate pass rate, accuracy competitive with topology-aware Newton--Raphson, and sub-millisecond inference. These results suggest asymmetric pathways plus operator-anchored latent geometry provide a foundation for certified zero-shot inference and control.

2606.00712 2026-06-02 cs.CV

CASTLE2026 Team WDL Technical Report

CASTLE2026 团队 WDL 技术报告

Zhengyang Li, Zhenglin Du, Yi Wen, Fang Liu, Shuo Li, Xu Liu

发表机构 * Key Laboratory of Intelligent Perception and Image Understanding(智能感知与图像理解重点实验室)

AI总结 提出基于 Qwen 的证据感知多模态推理流程,通过提示路由和置信度加权投票解决长视频问答,在 CASTLE 挑战赛中排名第一。

Comments 4 pages

详情
AI中文摘要

CASTLE 挑战赛 @ EgoVis 2026 评估基于 600 多小时多视角记录的长格式自我中心视频问答。每个四选一问题需要来自视频、转录、辅助照片、人物、天数、房间和时间上下文的证据。我们提出了一种基于 Qwen 的证据感知多模态推理流程。我们的系统解析问题提示、检索 ASR 片段、附加辅助图像、采样候选视频帧,并将问题路由到静态视觉、语音/文本、时间和混合类型,并附带专门提示。多次推理通过置信度加权投票进行聚合,并转换为官方 Codabench 格式。在消融实验中,LoRA 将得分从 0.21 提升至 0.50,更多采样帧进一步将其提升至 0.58。我们的最终系统在 CASTLE 挑战赛 @ EgoVis 2026 中排名第一。

英文摘要

The CASTLE Challenge @ EgoVis 2026 evaluates long-form egocentric video question answering over 600+ hours of multi-perspective recordings. Each four-choice question requires evidence from videos, transcripts, auxiliary photos, people, days, rooms, and temporal context. We propose an evidence-aware multimodal reasoning pipeline based on Qwen. Our system parses question hints, retrieves ASR chunks, attaches auxiliary images, samples candidate video frames, and routes questions into static visual, speech/text, temporal, and mixed types with specialized prompts. Multiple inference passes are aggregated by confidence-weighted voting and converted into the official Codabench format. In ablation, LoRA improves the score from 0.21 to 0.50, and more sampled frames further raise it to 0.58. Our final system ranks first in the CASTLE Challenge @ EgoVis 2026.

2606.00709 2026-06-02 cs.RO

BEVIO: Efficient Bird's-Eye-View based Sparse-Update Visual-Inertial Odometry for Lunar Day-Night Navigation

BEVIO: 基于鸟瞰图的稀疏更新视觉-惯性里程计用于月球昼夜导航

Mohit Singh, Shehryar Khattak, Ashish Goel, Michael Paton, Kostas Alexis, Issa A. Nesnas

发表机构 * Jet Propulsion Laboratory, California Institute of Technology(喷气推进实验室,加州理工学院) Autonomous Robots Lab at the Norwegian University of Science and Technology(挪威科学技术大学自主机器人实验室)

AI总结 提出一种基于鸟瞰图的图像匹配方案,在极低视觉更新率下实现可靠的视觉-惯性里程计,适用于资源受限的月球车昼夜导航。

Comments Accepted at the 2026 IEEE International Conference on Robotics and Automation, Vienna

详情
AI中文摘要

视觉-惯性里程计(VIO)提供平滑、高频率的状态估计,已广泛应用于地面和行星应用的机器人导航。然而,其性能通常依赖于视觉更新的频率,这对于在极端资源约束和低帧率下运行的行星车来说是一个挑战。本文研究如何为月球车应用实现具有极稀疏视觉更新的可靠VIO,解决昼夜操作中自照明条件下特征关联特别困难的问题。我们提出了一种基于鸟瞰图(BEV)的图像匹配方案,该方案在较大的帧间运动和显著的视觉外观变化下仍能保持鲁棒性,实现更可靠的特征匹配。我们通过高保真照片级月球仿真和半比例月球车在加利福尼亚州普拉斯特城进行的长期昼夜部署实时机器人实验,广泛评估了我们提出的BEVIO方法。结果表明,我们的方法能够在低至0.25 Hz的视觉更新率下实现可靠的昼夜自照明穿越,突显了其在功耗和计算受限的月球车导航中的适用性。

英文摘要

Visual-Inertial Odometry (VIO) provides smooth, high-rate state estimates and has been widely used for robotic navigation in both terrestrial and planetary applications. However, its performance is typically dependent on the frequency of visual updates, which is a challenge for planetary rovers operating under extreme resource constraints and low frame rates. This work investigates enabling reliable VIO with very sparse visual updates for lunar rover applications, addressing both day and night-time operations where feature associations become especially difficult under self-illumination conditions. We propose a Bird's Eye View (BEV)-based image matching scheme that remains robust to larger inter-frame motions and more reliable feature matching despite significant visual appearance changes. We extensively evaluate our proposed approach, BEVIO, through high-fidelity photorealistic lunar and real-time robotic experiments conducted using a half-scale lunar rover, in a long-term day-night deployment at Plaster City, CA, USA. The results demonstrate that our method enables reliable day and nighttime self-illuminated traverses at visual update rates as low as 0.25 Hz, underscoring its suitability for navigation on power- and compute-limited lunar rovers.

2606.00708 2026-06-02 cs.AI cs.LG

MOSAIC: Modular Orchestration for Structured Agentic Intelligence and Composition

MOSAIC:结构化智能体智能与组合的模块化编排

Yifan Bao, Xinyu Xi, Xinyu Liu, Wen Ge, Lei Jiang, Kevin Zhang, Raad Khraishi, Yihao Ang, Anthony K. H. Tung, Lukasz Szpruch, Hao Ni

发表机构 * Department of Computer Science, National University of Singapore(新加坡国立大学计算机科学系) University College London(伦敦大学学院) University of Edinburgh(爱丁堡大学) Data & Analytics, Digital X(Digital X 数据与分析部) Alan Turing Institute(艾伦·图灵研究所)

AI总结 提出MOSAIC框架,通过结构化智能体编排、记忆驱动的模型选择和蓝图构建,将自动化数据科学转化为可验证、可复用的模型选择问题,在金融时间序列任务中优于AutoML和智能体基线。

详情
AI中文摘要

自动化数据科学是一个结构化的模型选择问题。解决方案必须为任务选择数据转换、特征表示、架构、训练过程、评估协议和优化策略。AutoML系统自动化了该过程的部分环节,但通常是在预定义的流水线、模型和超参数空间内搜索。基于LLM的智能体通过检索、代码生成和执行反馈提供了更大的灵活性,但其建模决策通常是非结构化的、难以验证且难以复用。我们引入了 extsc{MOSAIC}(结构化智能体智能与组合的模块化编排),一个用于记忆驱动的模型选择和工作流构建的结构化智能体框架。给定任务和数据集, extsc{MOSAIC}构建语义任务画像,检索先前的案例和源代码模块,并构建蓝图:一个指定所选建模组件、组合、接口约束和执行需求的中间表示。该蓝图将模型选择转化为分阶段、上下文驱动的搜索,并将基于LLM的代码生成建立在检索证据而非无约束合成之上。候选模型通过执行验证,并使用诊断反馈、训练轨迹、任务指标以及一个失败感知的强化学习策略进行优化。我们在金融时间序列预测和生成任务上实例化了 extsc{MOSAIC},其中模型必须满足预测准确性、分布保真度、执行可靠性以及下游金融标准(如风险和尾部行为)。与AutoML和智能体基线的实验表明, extsc{MOSAIC}提高了任务性能、执行成功率和决策可追溯性,证明了将自动化数据科学视为结构化、可复用且基于执行的模型选择的价值。

英文摘要

Automated data science is a structured model-selection problem. A solution must choose data transformations, feature representations, architecture, training procedure, evaluation protocol, and refinement strategy for a task. AutoML systems automate parts of this process, but typically search within predefined pipeline, model, and hyperparameter spaces. LLM-based agents offer greater flexibility through retrieval, code generation, and execution feedback, yet their modelling decisions are often unstructured, difficult to verify, and hard to reuse. We introduce \textsc{MOSAIC} (Modular Orchestration for Structured Agentic Intelligence and Composition), a structured agentic framework for memory-grounded model selection and workflow construction. Given a task and dataset, \textsc{MOSAIC} builds a semantic task profile, retrieves prior cases and source-code modules, and constructs a blueprint: an intermediate representation specifying selected modelling components, composition, interface constraints, and execution requirements. This blueprint turns model selection into a staged, context-grounded search and grounds LLM-based code generation in retrieved evidence rather than unconstrained synthesis. Candidate models are validated by execution and refined using diagnostic feedback, training traces, task metrics, and a failure-aware reinforcement learning policy. We instantiate \textsc{MOSAIC} on financial time-series forecasting and generation, where models must satisfy predictive accuracy, distributional fidelity, execution reliability, and downstream financial criteria such as risk and tail behaviour. Experiments against AutoML and agentic baselines show that \textsc{MOSAIC} improves task performance, execution success, and decision traceability, demonstrating the value of treating automated data science as structured, reusable, and execution-grounded model selection.

2606.00704 2026-06-02 cs.CV

VICR: Visual In-Context Restoration for Real-World Image Super-Resolution

VICR: 面向真实图像超分辨率的视觉上下文恢复

Qichang Zhang, Hailong Wang, Baiang Li, Linhao Wang, Rong Fu, Erkang Cheng, Simon James Fong

发表机构 * Faculty of Science and Technology, University of Macau(澳门大学科技学院) Nullmax Hefei University of Technology(合肥工业大学) Shandong Normal University(山东师范大学)

AI总结 提出基于扩散变换器的视觉上下文恢复框架,通过解耦的视觉先验注入机制将真实图像超分辨率建模为图像补全,实现结构保真与细节合成的平衡。

Comments 28 pages, 11 figures, 9 tables

详情
AI中文摘要

真实世界图像超分辨率(Real-ISR)需要在结构保真度(对退化观测)与逼真细节合成之间取得平衡。然而,现有的生成式Real-ISR方法通常依赖于纠缠的条件机制,导致结构漂移或语义不一致的细节。为了解决这个问题,我们提出了视觉上下文恢复(VICR),一种基于扩散变换器(DiT)的框架,将Real-ISR表述为图像补全。具体来说,我们引入了一种解耦的视觉先验注入机制,从低质量(LQ)图像中提取局部和全局线索:局部线索有助于恢复图像结构并支持高频细节合成,而全局线索指导整体生成并促进语义一致性。对于严重退化下的模糊区域,VICR采用推理时代理,利用LQ输入的视觉证据优化语义提示,同时保持模型参数固定。实验表明,VICR仅用127M可训练参数就在多个Real-ISR基准上实现了最先进的性能。

英文摘要

Real-world image super-resolution (Real-ISR) requires balancing structural fidelity to degraded observations with realistic detail synthesis. However, existing generative Real-ISR methods often rely on entangled conditioning mechanisms, leading to structural drift or semantically inconsistent details. To address this issue, we propose Visual In-Context Restoration (VICR), a Diffusion Transformer (DiT)-based framework that formulates Real-ISR as image completion. Specifically, we introduce a decoupled visual prior injection mechanism that derives local and global cues from the low-quality (LQ) image: local cues help recover image structures and support high-frequency detail synthesis, while global cues guide overall generation and promote semantic consistency. For ambiguous regions under severe degradation, VICR employs an inference-time agent to refine semantic prompts using visual evidence from the LQ input while keeping model parameters fixed. Experiments show that VICR achieves state-of-the-art performance across multiple Real-ISR benchmarks with only 127M trainable parameters.

2606.00702 2026-06-02 cs.RO cs.AI

Shape Your Body: Value Gradients for Multi-Embodiment Robot Design

塑造你的身体:用于多形态机器人设计的价值梯度

Nico Bohlinger, Jan Peters

发表机构 * Technical University of Darmstadt(德累斯顿技术大学) Robotics Institute Germany (RIG)(德国机器人研究所) German Research Center for AI (DFKI)(德国人工智能研究中心) hessian.AI(黑森AI)

AI总结 提出将通用多形态价值函数转化为可复用模型,通过价值梯度优化机器人设计,无需为每个机器人重新进行强化学习协同设计。

详情
AI中文摘要

我们提出将通用多形态价值函数转化为可复用的机器人设计模型。不是为每个机器人运行新的强化学习协同设计循环,而是首先在多种机器人设计上训练一个感知形态的策略和价值函数。训练后,冻结的价值函数被用作可微分的代理,通过价值梯度优化候选形态。我们在不同的机器人设计设置中评估了我们的方法,从受扰动的单个机器人到跨形态类别的保留机器人,使用在多达50个机器人和超过1100个连续形态参数的设计空间上训练的单个模型。除了优化完整形态,我们还展示了价值梯度可以识别限制性能的设计和控制参数,从而能够优化和分析新的机器人设计。

英文摘要

We propose to turn generalist multi-embodiment value functions into reusable models for robot design. Instead of running a new reinforcement learning co-design loop for each robot, we first train an embodiment-aware policy and value function across many robot designs. After training, the frozen value function is used as a differentiable surrogate to optimize candidate embodiments through value gradients. We evaluate our approach across different robot design settings, from perturbed single robots to held-out robots across morphology classes, with single models trained on up to 50 robots and design spaces of over 1100 continuous embodiment parameters. Beyond optimizing complete embodiments, we show that value gradients can identify performance-limiting design and control parameters, enabling both the optimization and the analysis of new robot designs.

2606.00700 2026-06-02 cs.LG cs.AI

COPF: An Online Framework for Deployment-Stable Counterfactual Fairness in Evolving Graphs

COPF:演化图中部署稳定的反事实公平性在线框架

Sheng'en Li, Dongmian Zou

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 针对演化图上的在线链接推荐,提出COPF框架,通过反事实暴露机会差距、显式探索和残差不可区分性审计,实现部署稳定的公平性监控与控制。

Comments Accepted at ICML 2026

详情
AI中文摘要

演化图上的在线链接推荐是表演性的:通过选择向用户展示哪些候选链接,系统会改变哪些链接形成以及后续观察到的反馈。因此,来自记录结果的公平性估计可能具有误导性,并且在推荐策略更新后部署时可能会漂移。我们引入了COPF(反事实在线表演性公平性),这是一个用于在线链接推荐中部署稳定的公平性监控和控制的决策层框架。COPF (i) 定义了暴露(展示 vs. 未展示)反事实上的群体级机会差距,(ii) 通过显式探索和记录每个候选被展示的概率(倾向性)使其可估计,以及(iii) 使用图感知双重稳健(GA-DR)估计器,在可配置的审计器族上通过残差结果不可区分性(OI)审计和控制公平性。我们提供了一个噪声传递定理,表明在时间混合和有界局部干扰下,估计的GA-DR残差上的残差OI意味着暴露反事实群体差距的界限,并实例化了一个在线多校准审计器以及一个原始-对偶控制器。在两个TGB流和一个受控的合成二分图流上的实验表明,COPF减少了暴露反事实群体差距的最坏情况峰值,同时对排序效用的影响较小。我们的代码可在 https://github.com/lsnnnnnnnn/COPF 获取。

英文摘要

Online link recommendation on evolving graphs is performative: by choosing which candidate links to show users, the system changes which links form and what feedback it later observes. Consequently, fairness estimates from logged outcomes can be misleading and may drift after deployment when the recommendation policy is updated. We introduce COPF (Counterfactual Online Performative Fairness), a decision-layer framework for deployment-stable fairness monitoring and control in online link recommendation. COPF (i) defines group-level opportunity gaps over exposure (shown vs. not shown) counterfactuals, (ii) makes them estimable by explicit exploration and by logging the probability (propensity) that each candidate is shown, and (iii) audits and controls fairness using residual outcome indistinguishability (OI) over a configurable auditor family with graph-aware doubly robust (GA-DR) estimators. We provide a noisy transfer theorem showing that Residual-OI on estimated GA-DR residuals implies bounds on exposure-counterfactual group gaps under temporal mixing and bounded local interference, and we instantiate an online multicalibration auditor together with a primal-dual controller. Experiments on two TGB streams and a controlled synthetic bipartite stream show that COPF reduces worst-case spikes in exposure-counterfactual group disparities with modest impact on ranking utility. Our code is available at https://github.com/lsnnnnnnnn/COPF.

2606.00694 2026-06-02 cs.CV

FROST-STA: Frozen Dense Features for the Ego4D Short-Term Object Interaction Anticipation

FROST-STA: 用于Ego4D短期物体交互预测的冻结密集特征

Chaoyang Wang, Lexuan Xu

发表机构 * Beihang University(北航大学)

AI总结 提出FROST-STA模型,利用冻结的密集图像-视频特征和对象中心解码,在Ego4D短期物体交互预测挑战中取得第二名。

详情
AI中文摘要

第一人称视频中的短期预测需要超越对当前场景的识别:系统必须推断摄像头佩戴者将接触哪个物体、将执行什么动作以及接触将在多久后发生。本报告描述了FROST-STA,我们提交至EgoVis 2026 Ego4D短期物体交互预测(STA)挑战的方案。对于每个查询时间,模型输出一组排序的结构化假设,包含主动物体框、名词标签、动词标签、接触时间(TTC)和置信度。FROST-STA基于V-JEPA 2.1 STA评估协议,但通过使用对象中心解码、多头预测以及面向提交的训练和集成方案,使其适应挑战。我们固定V-JEPA 2.1 ViT-G骨干网络,提取两个密集token流:来自查询前缩放至384像素的短视频片段的视频token,以及来自最后观察到的最高分辨率帧的图像token。一个紧凑的对齐模块,由注意力探针和帧引导的时间池化组成,将片段表示映射到最后一帧的空间参考上,然后与图像特征融合。融合后的特征图由Faster R-CNN风格的STA头解码,估计框偏移、名词、动词、TTC值和交互质量。对于最终排行榜提交,我们使用官方训练集加上额外允许的验证标注训练25个epoch,并组合来自8个头和epoch 15-25的检查点的预测。FROST-STA在官方测试服务器上获得5.13总体Top-5 mAP,在挑战中排名第二,表明冻结的密集图像-视频特征可以作为物体级交互预测的坚实基础。

英文摘要

Short-term anticipation in egocentric video requires more than recognizing the current scene: a system must infer which object the camera wearer will contact, which action will follow, and how soon the contact will happen. This report describes FROST-STA, our submission to the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. For each query time, the model produces a ranked set of structured hypotheses containing an active-object box, noun label, verb label, time-to-contact (TTC), and confidence. FROST-STA builds on the V-JEPA 2.1 STA evaluation protocol, but adapts it to the challenge by using object-centric decoding, multi-head prediction, and a submission-oriented training and ensembling recipe. We keep the V-JEPA 2.1 ViT-G backbone fixed and extract two dense token streams: video tokens from a short clip resized to 384 pixels before the query, and image tokens from the last observed high-resolution frame. A compact alignment module, consisting of an attentive probe and frame-guided temporal pooling, maps the clip representation onto the spatial reference of the final frame before fusing it with image features. The fused maps are decoded by Faster R-CNN-style STA heads that estimate box offsets, nouns, verbs, TTC values, and interaction quality. For the final leaderboard entry, we train for 25 epochs with the official training split plus additional permitted validation annotations, and combine predictions across eight heads and checkpoints from epochs 15-25. FROST-STA obtains 5.13 Overall Top-5 mAP on the official test server, ranking second in the challenge and showing that frozen dense image-video features can serve as a strong basis for object-level interaction forecasting.

2606.00690 2026-06-02 cs.LG

DistMatch: Adaptive Binning via Distribution Matching for Robust Sequential Conformal Prediction

DistMatch: 通过分布匹配的自适应分箱用于鲁棒序列共形预测

Enver Menadjiev, Jihyeon Seong, Jisu Yeo, Jaesik Choi

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Tokyo(东京大学)

AI总结 提出DistMatch方法,利用Kolmogorov-Smirnov统计量递归划分残差以实现近似可交换性,结合在线分位数回归进行局部自适应推理,提升序列共形预测对分布偏移的鲁棒性。

Comments ICML 2026 (34 pages, 12 figures, 16 tables)

详情
AI中文摘要

序列共形预测在残差可交换性假设下提供有效的不确定性量化。然而,由于时间依赖性和分布偏移,该假设在现实时间序列中常被违反。尽管近期方法尝试通过重新加权来近似可交换性,但确定最优权重仍是一个开放挑战。为解决此局限,我们提出DistMatch,一种基于分箱的方法,利用Kolmogorov-Smirnov统计量在二叉树中递归划分残差。我们从理论上证明,这种划分诱导出近似可交换的叶子节点,从而避免重新加权的需要。通过在每个叶子节点内应用在线更新的分位数回归,DistMatch实现了局部自适应推理,并提高了对分布偏移的鲁棒性。大量实验表明,DistMatch优于现有序列共形预测方法。

英文摘要

Sequential conformal prediction (CP) provides valid uncertainty quantification under the assumption of residual exchangeability. However, this assumption is often violated in real-world time series due to temporal dependencies and distributional shifts. While recent methods attempt to approximate exchangeability through reweighting, identifying optimal weights remains an open challenge. To address this limitation, we propose DistMatch, a binning-based method that recursively partitions residuals within a binary tree using the Kolmogorov-Smirnov (KS) statistic. We theoretically show that this partitioning induces approximately exchangeable leaves, thereby avoiding the need for reweighting. By applying quantile regression with online updates within each leaf, DistMatch enables locally adaptive inference and improves robustness to distributional shifts. Extensive experiments demonstrate that DistMatch outperforms existing sequential CP methods.

2606.00689 2026-06-02 cs.CV

Wavelet-Fusion Diffusion Model for Multimodal Brain MRI Synthesis with Modality and Metadata Conditioning

小波融合扩散模型用于多模态脑MRI合成,具有模态和元数据条件

Muhammad Nabi Yasinzai, Remika Mito, Mangor Pedersen

发表机构 * Department of Psychology & Neuroscience, Auckland University of Technology(心理学与神经科学系,奥克兰技术大学) Department of Psychiatry, University of Melbourne(精神病学系,墨尔本大学)

AI总结 提出一种小波融合扩散模型(WFDM),结合小波融合变分自编码器(WF-VAE)和条件3D U-Net扩散模型,通过显式模态和元数据条件实现多模态脑MRI合成,解决了数据集模态覆盖不均和异质性问题,在分布对齐上优于现有方法。

Comments 51 pages, 7 figures, including supplementary material. Submitted to Imaging Neuroscience

详情
AI中文摘要

多模态MRI为神经影像分析提供互补信息,不同成像模态捕获独特的解剖、组织和病理特征,支持下游AI应用的开发和评估。尽管大规模结构MRI资源日益可用,但公共和汇集神经影像数据集的模态覆盖往往不均匀。这种不均匀的模态覆盖因站点、扫描仪和采集协议之间的异质性,以及跨研究通常稀疏、不一致记录或不可用的人口统计学和临床变量而进一步复杂化。合成MRI生成可以通过合成目标模态体积用于数据集增强和受控合成队列创建,帮助解决这种不平衡。然而,许多现有的MRI合成方法在狭窄的模态集或相对同质的队列上训练,限制了它们对大型汇集神经影像资源的适用性,其中模态可用性、采集协议和元数据覆盖在不同数据集之间差异很大。扩散模型因其强大的样本保真度和多样性而成为MRI合成的一种有吸引力的方法,但直接在3D体素空间采样在推理时计算昂贵且缓慢。潜在扩散通过在学习的3D潜在空间中合成MRI提高了实用性,尽管生成质量取决于自编码器的重建保真度和由此产生的潜在分布。我们的方法将小波融合变分自编码器(WF-VAE)潜在压缩器与在学习的潜在空间中训练的、使用显式模态和元数据条件的条件3D U-Net扩散模型相结合。我们提出的Wavelet-Fusion Diffusion Model (WFDM) 在评估的合成MRI生成器中实现了最强的分布对齐。

英文摘要

Multimodal MRI provides complementary information for neuroimaging analysis, where different imaging modalities capture distinct anatomical, tissue, and pathological features that support the development and evaluation of downstream AI applications. Although large-scale structural MRI resources are increasingly available, their modality coverage is often uneven across public and pooled neuroimaging datasets. This uneven modality coverage is further complicated by heterogeneity across sites, scanners, and acquisition protocols, as well as demographic and clinical variables that are often sparse, inconsistently recorded, or unavailable across studies. Synthetic MRI generation can help address this imbalance by synthesizing target-modality volumes for dataset augmentation and controlled synthetic cohort creation. However, many existing MRI synthesis approaches are trained on narrow modality sets or relatively homogeneous cohorts, limiting their applicability to large pooled neuroimaging resources where modality availability, acquisition protocols, and metadata coverage vary substantially across datasets. Diffusion models have become an attractive approach for MRI synthesis because of their strong sample fidelity and diversity, but sampling directly in 3D voxel space is computationally expensive and slow at inference. Latent diffusion improves practicality by synthesizing MRI in a learned, 3D latent space, although generation quality depends on the autoencoder's reconstruction fidelity and the resulting latent distribution. Our approach combines a Wavelet-Fusion variational autoencoder (WF-VAE) latent compressor with a conditional 3D U-Net diffusion model trained in the learned latent space using explicit modality and metadata conditioning. Our proposed Wavelet-Fusion Diffusion Model (WFDM) achieved the strongest distributional alignment among the evaluated synthetic MRI generators.

2606.00688 2026-06-02 cs.CV

Shape-Prior-Based Point Cloud Completion for Single-Stage Fully Sparse 3D Object Detection

基于形状先验的点云补全用于单阶段全稀疏3D目标检测

Kaizheng Wang, Mingqian Ji, Jian Yang, Shanshan Zhang

发表机构 * School of Computer Science and Engineering, Nanjing University of Science and Technology(南京理工大学计算机科学与工程学院)

AI总结 针对单阶段全稀疏3D检测器中点云稀疏和不完整的问题,提出一种基于形状先验的点云补全方法,通过实例选择和对齐补全模块显著提升检测性能。

详情
AI中文摘要

单阶段全稀疏3D目标检测器依赖点云数据在自动驾驶场景中检测目标。然而,点云的稀疏性和不完整性严重限制了3D目标检测的性能。为解决此问题,本文提出一种专门针对单阶段全稀疏检测器的点云补全方法。整个基于形状先验的补全过程由两个连续步骤组成。第一步,我们设计了一个新颖的实例选择模块,即使在基线模型未生成提议的情况下,也能识别对应前景目标的点云,同时有效忽略背景区域的点云。第二步,我们引入了一种新颖的基于对齐的点补全模块,该模块将前景目标的点云在中心和朝向上与原型对齐。随后,从原型中选择点来填充前景目标的缺失部分。我们在KITTI数据集上使用两个单阶段全稀疏检测器评估了我们的方法。实验结果表明,所提方法显著提升了检测性能,证实了其有效性和泛化能力。

英文摘要

Single-stage fully sparse 3D object detectors rely on point clouds data to detect objects in autonomous driving scenarios. However, the sparsity and incompleteness of point clouds significantly limit the performance of 3D object detection. To address this issue, this paper proposes a point clouds completion method specifically designed for single-stage fully sparse detectors. The entire shape-prior-based completion process consists of two consecutive steps. In the first step, we design a novel Instance Selection module, which is capable of identifying point clouds corresponding to foreground objects even when the baseline model does not generate proposals, while effectively ignoring the point clouds of background regions. In the second step, we introduce a novel Alignment-Based Point Completion module, which aligns the point clouds of foreground objects with prototypes in terms of both their centers and orientations. Subsequently, points are selected from the prototype to fill in the missing parts of the foreground object. We evaluated our method on two single-stage fully sparse detectors using the KITTI dataset. The experimental results demonstrate that the proposed method significantly improves the detection performance, confirming its effectiveness and generalizability.

2606.00686 2026-06-02 cs.LG

Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing

对齐的辩证法:利用不安全知识实现动态安全路由

Maryam Hashemzadeh, Jerry Huang, Minseon Kim, Marc-Alexandre Côté, Sarath Chandar

发表机构 * Chandar Research Lab(Chandar研究实验室) Mila – Quebec AI Institute(魁北克AI研究所) Université de Montréal(蒙特利尔大学) Microsoft Research(微软研究院) Polytechnique Montréal(蒙特利尔理工学院) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 提出SafeMoE框架,通过混合专家模型将不安全知识隔离到领域特定的低秩适配器中,并训练轻量级门控网络动态路由这些专家,在保持安全性的同时生成信息丰富的响应。

详情
AI中文摘要

大语言模型(LLM)对齐的主流范式通过擦除、过滤不安全数据或训练模型严格拒绝有害提示来运作。虽然这种方法能有效降低即时毒性,但根本上限制了模型的认识论范围,导致系统过度谨慎,对敏感但良性的查询输出无信息量的全面拒绝。在这项工作中,我们挑战了不安全数据必须丢弃的正统观念。我们提出了一种对齐的辩证方法,认为不安全数据编码了丰富的、领域特定的知识,对于细致、安全且信息丰富的生成至关重要。为实现这一点,我们引入了SafeMoE,一个混合专家(MoE)框架,将不安全知识隔离到仅在有害语料上训练的领域特定低秩适配器(LoRA专家)中。为了从这些不安全基元中综合安全性,我们使用最小、高度精选的安全信息响应集训练一个轻量级门控网络。在推理时,该路由器动态编排不安全专家,有效引导生成轨迹以利用其深层领域知识,同时严格执行安全约束。在严格的安全基准上的广泛实证评估表明,SafeMoE不仅更安全,安全响应率相对提高了20%以上(绝对增益超过15%),而且在安全性和危害性至关重要时能生成更具信息量的响应。此外,路由机制在未见领域和更广泛的安全任务上表现出强大的零样本泛化能力,无需领域特定监督。我们的发现表明对齐的范式转变:真正的安全不需要掩盖不安全知识,而是需要其受控整合。

英文摘要

The prevailing paradigm in large language model (LLM) alignment operates via erasure, filtering unsafe data or training models to strictly refuse harmful prompts. While effective at reducing immediate toxicity, this approach fundamentally constricts the model's epistemological scope, resulting in over-cautious systems that output uninformative blanket refusals to sensitive yet benign queries. In this work, we challenge the orthodoxy that unsafe data must be discarded. We propose a dialectical approach to alignment, positing that unsafe data encodes rich, domain specific knowledge critical for nuanced, safe, and informative generation. To operationalize this, we introduce SafeMoE, a Mixture-of-Experts (MoE) framework that isolates unsafe knowledge into domain-specific Low-Rank Adapters (LoRA experts) trained exclusively on harmful corpora. To synthesize safety from these unsafe primitives, we train a lightweight gating network using a minimal, highly curated set of safe-informative responses. During inference, this router dynamically orchestrates the unsafe experts, effectively steering the generation trajectory to harness their deep domain knowledge while strictly enforcing safety constraints. Extensive empirical evaluations across stringent safety benchmarks demonstrate that SafeMoE is not only safer, achieving over a 20% relative improvement in safe response rate (more than a 15% absolute gain), but also produces more informative responses when safety and harmfulness are of paramount concern. Furthermore, the routing mechanism exhibits strong zero-shot generalization to unseen domains and broader safety tasks without domain-specific supervision. Our findings suggest a paradigm shift in alignment: true safety requires not the masking of unsafe knowledge, but its controlled integration.

2606.00685 2026-06-02 cs.LG

Prior-Guided Multi-Omic Transformers for Single-Cell Gene Regulatory Network Inference

先验引导的多组学Transformer用于单细胞基因调控网络推断

Tianyang Xu, Tianci Liu, Niraj Rayamajhi, Ryan Patrick, Kranthi Varala, Ying Li, Jing Gao

发表机构 * Elmore Family School of Electrical and Computer Engineering(埃尔莫夫家庭电气与计算机工程学院) Purdue University(普渡大学) Department of Horticulture and Landscape Architecture(园艺与景观建筑系) School of Biological Sciences(生物科学学院)

AI总结 提出EpiAwareNet框架,通过先验引导的多组学Transformer,结合基因-峰值交叉注意力模块和批量数据先验,从配对单细胞数据中重建基因调控网络。

Comments 12 pages, 6 figures. Accepted to the KDD 2026 AI4Sciences Track

详情
AI中文摘要

基因调控网络(GRN)捕捉转录因子-靶标相互作用,是理解细胞状态调控和疾病的核心。从配对的单细胞转录组和染色质可及性数据重建GRN具有前景但充满挑战:scATAC极其稀疏,且大多数方法依赖于固定的峰值-基因链接和弱监督。我们提出EpiAwareNet,一个先验引导的多组学Transformer框架,仅使用轻量级生物学先验从配对单细胞数据重建GRN。在第一阶段,EpiAwareNet通过基因-峰值交叉注意力模块学习联合基因-峰值表示,实现数据驱动的、基因特异性的可及性信号聚合,而非硬编码的峰值-基因分配。在第二阶段,EpiAwareNet引入批量数据衍生的GRN先验作为噪声正边,在标签稀缺情况下提供弱监督,同时保持对先验噪声的鲁棒性,细化调控分数。在我们的实验中,EpiAwareNet在GRN重建上优于代表性的单组学和多组学基线,并产生更具生物学合理性的GRN,例如改善已知调控相互作用的恢复,这表明当与自适应跨模态表示学习结合时,来自批量数据的轻量级生物学先验可以有效指导单细胞GRN推断。代码和数据将在https://github.com/tianyang-x/EpiAwareNet_pub提供。

英文摘要

Gene regulatory networks (GRNs) capture transcription factor-target interactions and are central to understanding cell-state regulation and disease. Reconstructing GRNs from paired single-cell transcriptomic and chromatin accessibility data is promising but challenging: scATAC is extremely sparse, and most methods rely on fixed peak-to-gene links and weak supervision. We present EpiAwareNet, a prior-guided multi-omic Transformer framework that reconstructs GRNs from paired single-cell data using only lightweight biological priors. In Stage 1, EpiAwareNet learns joint gene-peak representations with a gene-peak cross-attention module, enabling data-driven, gene-specific aggregation of accessibility signals rather than hard-coded peak-to-gene assignments. In Stage 2, EpiAwareNet incorporates a bulk-derived GRN prior as noisy positive edges to provide weak supervision under label scarcity, refining regulatory scores while remaining robust to prior noise. In our experiments, EpiAwareNet improves GRN reconstruction over representative single- and multi-omic baselines and yields GRNs with greater biological plausibility, such as improved recovery of known regulatory interactions, suggesting that lightweight biological priors from bulk data can effectively guide single-cell GRN inference when combined with adaptive cross-modal representation learning. Code and data will be available at https://github.com/tianyang-x/EpiAwareNet_pub.

2606.00683 2026-06-02 cs.CL

OCC-RAG: Optimal Cognitive Core for Faithful Question Answering

OCC-RAG:面向忠实问答的最优认知核心

Maksim Savkin, Mikhail Goncharov, Alexander Gambashidze, Alla Chepurova, Dmitrii Tarasov, Nikita Andriianov, Daria Pugacheva, Vasily Konovalov, Andrey Galichin, Ivan Oseledets

发表机构 * OCC Team(OCC团队)

AI总结 提出OCC-RAG,一种通过多上下文多跳合成数据训练的小语言模型,在忠实问答任务中匹配或超越2-6倍规模通用模型。

详情
AI中文摘要

近年来,语言模型的发展由规模定义,每一代模型都将更多世界知识吸收进其参数中。然而,许多实际应用更受益于稳健推理而非广泛的参数化知识。在此背景下,任务专用的小语言模型(SLM)提供了一种原则性的设计选择。我们提出最优认知核心(OCC),一个基于此前提构建的SLM家族。作为OCC的变体,我们提出OCC-RAG,针对基于给定上下文的忠实问答(QA)进行了优化。该任务与OCC设计方法直接对齐,需要在提供的段落上进行多跳推理,同时忽略记忆的知识。为训练OCC-RAG,我们实现了一种新颖的流水线,用于大规模合成多上下文、多跳QA数据,生成了一个包含超过三百万个样本的语料库,针对多跳推理、严格上下文忠实性和校准的弃权进行了优化。我们发布了OCC-RAG-0.6B和OCC-RAG-1.7B,两者均在此语料库上进行了中期训练。这些模型生成带有源引用的结构化推理轨迹,这些引用基于上下文中的逐字引用。通过OCC-RAG,我们证明了紧凑的任务专用SLM可以在多跳推理(HotpotQA、MuSiQue、TAT-QA)、忠实性(ConFiQA)和拒绝(MuSiQue-Un)基准测试中匹配或超越规模为其2-6倍的通用模型。

英文摘要

Recent progress in the development of language models has been defined by scale, with each generation absorbing more of the world's knowledge into its weights. However, many practical applications benefit more from robust reasoning than from extensive parametric knowledge. In this setting, task-specialized small language models (SLMs) offer a principled design choice. We introduce Optimal Cognitive Core (OCC), a family of SLMs built around this premise. As a variant of OCC, we present OCC-RAG, optimized for faithful question answering (QA) grounded in the provided context. This task directly aligns with the OCC design approach, requiring multi-hop reasoning over supplied passages while ignoring memorized knowledge. To train OCC-RAG, we implement a novel pipeline for synthesizing multi-context, multi-hop QA data at scale, producing a corpus of over three million examples targeting multi-hop reasoning, strict context faithfulness, and calibrated abstention. We release OCC-RAG-0.6B and OCC-RAG-1.7B, both mid-trained on this corpus. The models produce structured reasoning traces with source citations grounded in literal quotes from the context. Through OCC-RAG, we demonstrate that compact, task-specialized SLMs can match or exceed general-purpose models 2 -- 6x their size across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un) benchmarks.

2606.00677 2026-06-02 cs.LG

Limits of Resolution Equivariance in Fourier Neural Operators

傅里叶神经算子中的分辨率等变性极限

Alex Colagrande, Paul Caillon, Eva Feillet, Alexandre Allauzen

发表机构 * Miles Team, LAMSADE, Université Paris Dauphine-PSL(巴黎萨克雷大学巴黎-达菲学院LAMSADE团队) Université Paris-Saclay, CNRS, LISN(巴黎-萨克雷大学CNRS LISN) ESPCI PSL, Paris(巴黎ESPCI PSL)

AI总结 本文通过对比直接细网格推理与低网格加傅里叶零填充上采样两种策略,发现傅里叶神经算子并不总是能泛化到不同分辨率,并分析了其层间频谱特性,指出非线性混叠是零样本分辨率等变性的主要障碍。

Comments Published as a paper at AI&PDE: ICLR 2026 Workshop on AI and Partial Differential Equations. 6 pages, 2 figures

详情
AI中文摘要

傅里叶神经算子通常被认为能够跨空间分辨率泛化,从而可以在粗网格上训练并在细网格上部署。我们通过对比从训练分辨率 $s$ 到测试分辨率 $S>s$ 时的两种推理选择来检验这一假设:直接在 $S$ 上运行 FNO,或者在 $s$ 上运行并通过傅里叶零填充将预测上采样到 $S$。在达西流问题上,我们观察到直接细网格推理并非总是有益的,甚至可能比低网格加上采样基线更差。我们进一步分析了层间频谱,发现在傅里叶截断下,中间表示的能量越来越集中在低频,而高频输出主要由后期的非线性/解码器阶段产生。这为 FNO 在保留少量模式时仍能表现良好,但对分辨率变化敏感的现象提供了机制性解释。我们的发现强调了一个简单但强大的跨分辨率评估基线,并指出非线性混叠是零样本分辨率等变性的关键障碍。

英文摘要

Fourier Neural Operators are often assumed to generalize across spatial resolutions, enabling training on a coarse grid and deployment on a finer grid. We test this assumption by contrasting two inference-time choices when moving from training resolution $s$ to test resolution $S>s$: running FNO directly at $S$, or running at $s$ and upsampling the prediction to $S$ via Fourier zero-padding. On Darcy flow, we observe that direct fine-grid inference is not reliably beneficial and can be worse than the low-grid-plus-upsampling baseline. We further analyze layerwise spectra and find that, under Fourier truncation, intermediate representations increasingly concentrate energy in low frequencies, with high-frequency output produced mainly by late nonlinear/decoder stages. This offers a mechanistic explanation for why FNO can perform well while retaining few modes, yet remain sensitive under resolution shifts. Our findings highlight a simple but strong baseline for cross-resolution evaluation and point to nonlinear aliasing as a key obstacle to zero-shot resolution equivariance.

2606.00676 2026-06-02 cs.CV

A Modelling and Evaluation Framework for EuroCrops-Driven Sentinel-2 Crop Segmentation

基于EuroCrops驱动的Sentinel-2作物分割的建模与评估框架

Alexandra Nicoleta Scarlat, Ioana Cristina Plajer, Alexandra Baicoianu

发表机构 * Transilvania University of Braşov(布拉索夫瓦拉米亚大学)

AI总结 提出一个可配置的流水线,利用EuroCrops标注和Sentinel-2影像生成语义分割数据集,并训练U-Net模型评估其在域内和域外数据集上的性能。

详情
AI中文摘要

本工作提出了一个可配置的流水线,用于从Sentinel-2影像和EuroCrops地块级标注生成适用于语义分割的农业数据集。该流程通过标签统一、Sentinel-2产品选择、空间对齐、栅格化、图块提取、质量过滤和类别感知样本选择,将异质的矢量作物标注转化为对齐的多光谱图像-掩码对。生成的数据集包含来自五个欧洲国家的67,337个图块,并使用简化的十种作物类别加上背景的分类法。 使用10个Sentinel-2光谱波段和组合损失(类别加权交叉熵和Dice损失)训练了一个带有组归一化的四层U-Net。在基于EuroCrops的内部测试集上,模型实现了平均交并比(mIoU)0.7665、像素准确率0.8693和平均类别准确率0.9072。与光谱和空间上下文随机森林基线相比,U-Net显示了学习多尺度空间表示对于作物分割的重要性。 在未见过的比利时EuroCrops子集、DACIA5和PASTIS上进行了外部评估。结果显示,在外部和跨数据集评估下存在明显的性能差距,尤其是对于具有不同分类法、标注协议、空间覆盖或时间组织的基准。模型更可靠地转移到分类法对齐的优势类别(如玉米和小麦),而对于几个少数类别以及适应后的单日期PASTIS设置,性能仍然有限。这些发现突出了在现实域偏移下使用EuroCrops衍生监督进行Sentinel-2作物分割的潜力和局限性。

英文摘要

This work presents a configurable pipeline for generating semantic-segmentation-ready agricultural datasets from Sentinel-2 imagery and EuroCrops parcel-level annotations. The workflow transforms heterogeneous vector crop annotations into aligned multispectral image--mask pairs through label harmonization, Sentinel-2 product selection, spatial alignment, rasterization, patch extraction, quality filtering, and class-aware sample selection. The generated dataset contains 67,337 patches from five European countries and uses a reduced taxonomy of ten crop classes plus background. A four-level U-Net with Group Normalization was trained using 10 Sentinel-2 spectral bands and a composite loss combining class-weighted cross-entropy and Dice loss. On the internal EuroCrops-based test split, the model achieved a mean Intersection over Union (mIoU) of 0.7665, a pixel accuracy of 0.8693, and a mean class accuracy of 0.9072. Compared with spectral and spatial-context Random Forest baselines, the U-Net showed the importance of learned multi-scale spatial representations for crop segmentation. External evaluation was performed on unseen Belgian EuroCrops subsets, DACIA5, and PASTIS. The results show a clear performance gap under external and cross-dataset evaluation, especially for benchmarks with different taxonomies, annotation protocols, spatial coverage, or temporal organization. The model transfers more reliably to dominant and taxonomically aligned classes such as maize and wheat, while performance remains limited for several minority classes and for the adapted single-date PASTIS setting. These findings highlight both the potential and the limitations of using EuroCrops-derived supervision for Sentinel-2 crop segmentation under realistic domain shifts.

2606.00674 2026-06-02 cs.LG cs.AI

The Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLMs

结果优化的悖论:LLM中推理捷径的因果信息论界限

Zihan Chen, Yiming Zhang, Wenxiang Geng, Zenghui Ding, Yining Sun

发表机构 * HFIPS, Chinese Academy of Sciences(中国科学院HFIPS) University of Science and Technology of China(中国科学技术大学)

AI总结 针对基于结果强化学习的LLM在分布外任务中推理脆弱的问题,提出因果信息论框架解释奖励诱导的流形坍缩,并证明过程奖励模型作为拓扑滤波器可消除低复杂度捷径。

详情
AI中文摘要

通过基于结果的强化学习(RL)对齐的大型语言模型(LLM)经常表现出一种关键失败模式:它们在分布内基准测试上取得高性能,但在分布外(OOD)任务上推理能力脆弱。我们将这种现象称为奖励诱导的流形坍缩。我们建立了一个理论框架,将结构因果模型(SCM)和信息瓶颈(IB)原理联系起来,以解释这一悖论。我们将推理定义为高复杂度的因果过程,将捷径学习定义为利用低复杂度的虚假相关性。在随机梯度下降(SGD)的隐式归纳偏置下,只要训练分布允许对真实因果机制进行“马尔可夫筛选”,优化结果奖励的模型就会偏向于捷径解。我们基于语义覆盖度量($\eta$)而非样本量推导了一个新的泛化界限,说明了为什么在同质分布上扩展数据可能无法纠正推理缺陷。我们还表明,过程奖励模型(PRM)作为拓扑滤波器,通过强制执行逐步互信息约束,使得低复杂度的捷径流形不可行。这些结果为过程监督在简单信用分配之外的作用提供了数学基础。

英文摘要

Large Language Models (LLMs) aligned via outcome-based Reinforcement Learning (RL) frequently exhibit a critical failure mode: they achieve high performance on in-distribution benchmarks while demonstrating brittle reasoning capabilities on out-of-distribution (OOD) tasks. We term this phenomenon Reward-Induced Manifold Collapse. We establish a theoretical framework bridging Structural Causal Models (SCM) and the Information Bottleneck (IB) principle to explain this paradox. We define reasoning as a high-complexity causal process and shortcut learning as the exploitation of low-complexity spurious correlations. Under the implicit inductive bias of Stochastic Gradient Descent (SGD), models optimized for outcome rewards are biased toward shortcut solutions whenever the training distribution allows for a ``Markovian Screening'' of the true causal mechanism. We derive a new generalization bound based on Semantic Coverage Measure ($η$) rather than sample size, showing why data scaling on homogeneous distributions may fail to correct reasoning flaws. We also show that Process Reward Models (PRMs) function as Topological Filters, enforcing step-wise mutual information constraints that render the low-complexity shortcut manifold inadmissible. These results provide a mathematical grounding for the role of process supervision beyond simple credit assignment.

2606.00673 2026-06-02 cs.CV

T-CLIP: Enabling Thermal Perception for Contrastive Language-Image Pretraining

T-CLIP:面向对比语言-图像预训练的热感知

Tayeba Qazi, Ayush Maheshwari, Prerana Mukherjee, Brejesh Lall

发表机构 * Indian Institute of Technology Delhi, India(印度理工学院德里分校) NVIDIA AI Technology Center, India(NVIDIA AI技术中心) Jawaharlal Nehru University, India(贾瓦哈拉尔·尼赫鲁大学)

AI总结 针对CLIP无法对齐热图像与文本描述的问题,提出物理感知的热描述数据集IR-Cap和解耦双LoRA框架T-CLIP,实现场景级和对象级热理解,在跨模态检索任务上超越所有基线。

Comments 34pages (including references and appendix), 13 figures

详情
AI中文摘要

热成像在低光照和恶劣天气等挑战性条件下提供了可见光谱视觉的强大替代方案,然而像CLIP这样的基础视觉-语言模型由于根本性的热感知差距,无法将热图像与文本描述对齐。我们识别出三个主要挑战:缺乏带标题的热数据集、标准LLM无法推理热现象,以及热成像中的一个关键表示挑战——全局场景上下文和对象级热信号在单个嵌入空间中同时学习时会产生冲突。为了解决这些问题,我们引入了IR-Cap,这是第一个物理感知的热标题生成管道和数据集,在三个公开基准上提供互补的全局和细粒度热描述;以及T-CLIP,一个解耦的双LoRA框架,独立地适配CLIP用于场景级和对象级热理解。T-CLIP在三个热基准的跨模态检索中相对于所有基线取得了一致的改进,并且我们初步展示了其在文本条件热图像生成中的适用性。

英文摘要

Thermal imaging offers a powerful alternative to visible-spectrum vision under challenging conditions such as low illumination and adverse weather, yet foundational vision-language models like CLIP fail to align thermal images with textual descriptions due to a fundamental thermal perception gap. We identify three major challenges: the lack of captioned thermal datasets, the inability of standard LLMs to reason about thermal phenomena, and a key representational challenge in thermal imaging where global scene context and object-level heat signatures conflict when learned together in a single embedding space. To address these, we introduce IR-Cap, the first physics-aware thermal captioning pipeline and dataset providing complementary global and fine-grained thermal descriptions across three public benchmarks, and T-CLIP, a decoupled dual-LoRA framework that independently adapts CLIP for scene-level and object-level thermal understanding. T-CLIP achieves consistent improvements over all baselines across three thermal benchmarks in cross-modal retrieval, and we provide an exploratory demonstration of its applicability to text-conditioned thermal image generation.

2606.00672 2026-06-02 cs.AI cs.LG

Medication-Aware Financial Exploitation Detection for Alzheimer's Patients Using Edge-Aware Interaction Risk Modeling

基于边缘感知交互风险建模的阿尔茨海默病患者药物感知金融剥削检测

Farzana Akter, Lisan Al Amin, Rakib Hossain, Chaitanya Gunupudi, Faisal Quader

发表机构 * Cognitive Links LLC University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 提出一种药物感知框架,通过同步药物依从性与交易监控,利用交互感知逻辑模型提升对认知风险金融事件的检测,尤其在药物脆弱窗口期召回率从0.7442提升至0.9070。

详情
AI中文摘要

金融剥削对阿尔茨海默病患者日益构成威胁,尤其是在认知稳定性下降期间。传统欺诈检测系统通常仅依赖金融行为,忽略可能改变脆弱性的临床相关因素。本文提出一种药物感知框架,将药物依从性与交易级监控同步,以改进对认知风险金融事件的检测。构建了180名患者45天的混合模拟数据集,产生8,100条药物记录和30,855笔交易。该框架通过纯金融、加性药物感知和交互感知逻辑模型评估金额异常、商家新颖性、交易频率、时间偏差和药物依从性。结果表明,纯金融基线获得了最高的全局F1分数0.5000,但交互感知模型在药物诱导脆弱窗口期内将召回率从0.7442提升至0.9070,并在排名高风险案例中实现了最高平均精度。研究结果表明,药物依从性作为金融风险的上下文修饰因子比作为孤立预测因子更有用。

英文摘要

Financial exploitation is a growing concern for people with Alzheimer's disease, especially during periods of reduced cognitive stability. Conventional fraud detection systems usually rely on financial behavior alone and ignore clinically relevant factors that may alter vulnerability. This paper proposes a medication-aware framework that synchronizes medication adherence with transaction-level monitoring to improve detection of cognitively risky financial events. A hybrid simulation dataset was constructed for 180 patients across 45 days, producing 8,100 medication records and 30,855 transactions. The framework evaluates amount anomaly, vendor novelty, transaction frequency, time deviation, and medication adherence through financial-only, additive medication-aware, and interaction-aware logistic models. Results show that the financial-only baseline obtained the highest global F1-score of 0.5000, but the interaction-aware model improved recall during medication-induced vulnerability windows from 0.7442 to 0.9070 and achieved the highest average precision for ranked high-risk cases. The findings suggest that medication adherence is most useful as a contextual modifier of financial risk rather than as an isolated predictor.

2606.00671 2026-06-02 cs.AI cs.CL cs.LG

AXIOM: A Trust-First Neuro-Symbolic Execution Architecture for Verifiable Mathematical Reasoning

AXIOM: 一种用于可验证数学推理的信任优先神经符号执行架构

Alessio Bruno

发表机构 * Independent researcher(独立研究者)

AI总结 提出AXIOM架构,将语言模型限制为规范化器,通过确定性计算机代数系统管道实现可验证的数学推理,在4个MATH类别上达到94.36%的正确率和100%的信任度。

Comments Preprint. 12 pages, 2 figures. Live interactive demo: https://huggingface.co/spaces/Squagghy/axiom-solver. Paper artifact and dataset on Zenodo (concept-DOI): 10.5281/zenodo.20440225

详情
AI中文摘要

我们提出AXIOM,一种用于自然语言数学推理的信任优先神经符号执行架构。在AXIOM中,语言模型严格作为规范化器:它将非正式问题文本重写为狭窄的模式,由确定性计算机代数系统(CAS)管道消费,该管道推导并验证答案,或作为第一类输出弃权。路由遵循问题形状正则表达式、特定模式提示和封闭形式CAS处理器之间的1:1:1对齐,已交付3100多条这样的路由,并在250多个连续提交中零LOST_CORRECT回归。我们在4个MATH类别上报告了实证结果,累积正确率为94.36%(2,592/2,747),可解析问题的信任度为100.00%(在整个2,747条记录基准测试中零自信错误答案),所有四个领域均高于每个领域70/90/70的阈值,每个领域信任度为100.0%,仅规则处理器的中位延迟为1毫秒(在lm-eval算术20,000条记录基准测试中占88%的记录)。该架构通过公共部署已服务约30,000次生产查询。我们强调的贡献不是最终的准确率数字,而是该架构建立的向前动态:生产中的每个记录弃权在一次发布周期后都是候选正确,因为新任务在不回归注册表的情况下组合。支撑这一特性的操作纪律——数学模板分桶、LOST_CORRECT扫描作为回归预言机、可解析优先接入以及弃权作为第一类输出——构成了一个可迁移的框架,适用于数学之外的值得信赖的神经符号系统。

英文摘要

We present AXIOM, a trust-first neuro-symbolic execution architecture for natural-language mathematical reasoning. In AXIOM, the language model functions strictly as a canonicalizer: it rewrites informal problem text into a narrow schema consumed by a deterministic Computer-Algebra-System (CAS) pipeline, which derives and verifies the answer or abstains as a first-class output. Routing follows a 1:1:1 alignment between problem-shape regex, schema-specific prompt, and closed-form CAS handler, with 3,100+ such routes shipped and zero LOST_CORRECT regressions across 250+ consecutive ship commits. We report empirical results on 4 MATH categories with a cumulative correctness of 94.36% (2,592/2,747) at 100.00% trust on parseable (zero confident-wrong answers across the full 2,747-record benchmark), all four domains above the per-domain 70/90/70 floor with per-domain trust at 100.0%, and median latency of 1 ms on rule-only handlers (88% of records on the lm-eval arithmetic 20,000-record benchmark). The architecture has served ~30,000 production queries through a public deployment. The contribution we emphasize is not a final accuracy figure but the forward dynamic the architecture establishes: every logged abstain in production is a candidate correct after one ship cycle, since new tasks compose without regressing the registry. The operational discipline behind this property -- math-template bucketing, LOST_CORRECT scan as regression oracle, parseable-first onboarding, and abstain as first-class output -- constitutes a transferable framework for trustworthy neuro-symbolic systems beyond mathematics.

2606.00670 2026-06-02 cs.SD cs.AI

Beyond the Mouth: Upper-Face Affective Cues in Audiovisual Sentence Recognition under Acoustic Uncertainty

超越口部:声学不确定性下视听句子识别中的上半脸情感线索

Zhou Yang, Yueyi Yang

发表机构 * Faculty of Education and Psychology, University of Oulu, Finland(奥卢大学教育与心理学学院,芬兰) Center for Machine Vision and Signal Analysis, University of Oulu, Finland(奥卢大学机器视觉与信号分析中心,芬兰)

AI总结 本研究利用CREMA-D语料库,通过特征分类器探究在声学退化条件下,上半脸情感信息是否有助于视听句子识别,发现上半脸情感线索能提升模型校准和鲁棒性。

详情
AI中文摘要

面对面言语理解本质上是多模态的,整合了声学信号与可见的发音、面部表情、头部运动及其他社交相关线索。虽然视听言语系统通常将口部区域作为语言信息的主要视觉来源,但情感面部表情常被单独视为情感识别目标。本文研究在声学退化条件下,上半脸情感信息是否有助于视听句子识别,超越音频和口部区域线索。使用CREMA-D视听情感言语语料库,我们在四种线索条件下训练基于特征的句子分类器:仅音频(A)、音频加口部/下半脸特征(A+M)、音频加上半脸特征(A+U)以及音频加口部和上半脸特征(A+M+U)。模型在干净音频和粉红噪声条件下(+10 dB、+5 dB和0 dB SNR)进行评估,采用演员独立划分。结果表明,在退化音频下,口部/下半脸特征提供了显著的鲁棒性优势。在0 dB SNR下,A+M相比A准确率提升0.0794,演员自举95%置信区间为[0.0296, 0.1298]。上半脸情感线索表现出更微妙的效果。尽管A+M+U相比A+M的直接准确率增益很小,但全脸模型在不同SNR水平上持续改善校准,并且在噪声条件下优于打乱的上半脸对照。这些发现表明,情感面部信息可能支持声学不确定性下的多模态鲁棒性和置信度估计,而不直接编码词汇内容。更广泛地说,该研究强调了社交表达性面部线索在以人为中心的视听交互系统中的潜在作用。

英文摘要

Face-to-face speech comprehension is inherently multimodal, integrating acoustic signals with visible articulation, facial expression, head motion, and other socially relevant cues. While audiovisual speech systems typically focus on the mouth region as the primary visual source of linguistic information, affective facial expressions are often treated separately as emotion-recognition targets. This paper investigates whether upper-face affective information contributes to audiovisual sentence recognition beyond audio and mouth-region cues, particularly under acoustic degradation. Using the CREMA-D audiovisual emotional speech corpus, we train feature-based sentence classifiers under four cue conditions: audio only (A), audio plus mouth/lower-face features (A+M), audio plus upper-face features (A+U), and audio plus both mouth and upper-face features (A+M+U). Models are evaluated on clean audio and pink-noise conditions at +10 dB, +5 dB, and 0 dB SNR using actor-independent splits. Results show that mouth/lower-face features provide substantial robustness benefits under degraded audio. At 0 dB SNR, A+M improves accuracy over A by 0.0794, with an actor-bootstrap 95% confidence interval of [0.0296, 0.1298]. Upper-face affective cues exhibit a more nuanced effect. Although the direct accuracy gain of A+M+U over A+M is small, full-face models consistently improve calibration across SNR levels and outperform shuffled upper-face controls under noisy conditions. These findings suggest that affective facial information may support multimodal robustness and confidence estimation under acoustic uncertainty without directly encoding lexical content. More broadly, the study highlights the potential role of socially expressive facial cues in human-centered audiovisual interaction systems.

2606.00664 2026-06-02 cs.RO cs.CV

SKIP: Sparse Keyframe Interpolation Paradigm for Efficient Embodied World Models

SKIP: 用于高效具身世界模型的稀疏关键帧插值范式

Ziheng He, Yixiang Chen, Ning Yang, Zhanqian Wu, Qisen Ma, Yuan Xu, Jiabing Yang, Peiyan Li, Xiangnan Wu, Xiaofeng Wang, Zheng Zhu, Jing Liu, Nianfeng Liu, Yan Huang

发表机构 * UCAS(中国科学院自动化研究所) CASIA(中国科学院自动化研究所) NJU(南京大学) GigaAI THU(清华大学) FiveAges

AI总结 提出稀疏关键帧插值范式(SKIP),通过识别任务相关关键帧并仅生成这些帧,再基于机器人动作插值缺失帧,实现高效视频生成,在LIBERO上速度提升4.16倍,FVD降低89%,且生成视频作为训练数据时策略性能下降极小。

Comments 25 pages, 10 figures

详情
AI中文摘要

具身世界模型通过预测机器人动作如何影响周围场景,已成为机器人学中一种有前景的范式。然而,在像素空间中进行 rollout 推理在计算上仍然昂贵,因为长时程操作视频通常必须逐帧生成。这种成本不能通过不加区分地丢弃帧来轻易降低,因为下游策略依赖于对稀疏任务相关事件(如接近、接触、抓取和释放)的完整保留。为了解决这一挑战,我们提出了稀疏关键帧插值范式(SKIP),这是一种事件保留的稀疏到密集框架,避免了密集的逐帧生成。SKIP 首先通过利用机器人感知的多模态特征来识别任务相关的关键帧。然后,它仅用稀疏视频扩散模型合成这些关键帧。一个学习到的间隙预测器和一个动作条件插值器随后根据机器人动作重建缺失的间隔。在 LIBERO 上,SKIP 生成密集 rollouts 的速度比密集基线快 4.16 倍,同时提高了视觉保真度并将聚合 FVD 降低了 89.0%。重要的是,SKIP 生成的视频是有效的策略训练数据。即使它们完全替代真实演示,π_{0.5} 的成功率在 LIBERO 模拟中仅下降 1.3 个百分点,在真实机器人上下降 6.7 个百分点,而完全密集的逐帧生成则下降 48 到 58 个百分点。

英文摘要

Embodied world models have emerged as a promising paradigm in robotics by predicting how robot actions affect the surrounding scene. However, the rollout inference remains computationally expensive in pixel space, as long-horizon manipulation videos typically have to be generated frame by frame. This cost cannot be easily reduced by indiscriminately dropping frames, since downstream policies rely on complete preservation of sparse task-relevant events such as approach, contact, grasp, and release. To address this challenge, we propose Sparse Keyframe Interpolation Paradigm (SKIP), an event-preserving sparse-to-dense framework that avoids dense frame-by-frame generation. SKIP first identifies task-relevant keyframes by leveraging robot-aware multimodal features. It then synthesizes only these keyframes with a sparse video diffusion model. A learned gap predictor and an action-conditioned interpolator subsequently reconstruct the missing intervals according to the robot actions. On LIBERO, SKIP generates dense rollouts $4.16\times$ faster than a dense baseline while improving visual fidelity and reducing aggregate FVD by $89.0\%$. Importantly, SKIP-generated videos are effective policy-training data. Even when they fully replace real demonstrations, $π_{0.5}$ success drops only $1.3$ pp in LIBERO simulation and $6.7$ pp on the real robot, whereas fully dense frame-by-frame generation collapses by $48$ to $58$ pp.

2606.00662 2026-06-02 cs.CV

TAP-JEPA: Frozen Future-Latent Probing and Two-Stage Score Fusion for EPIC-KITCHENS-100 Action Anticipation

TAP-JEPA:冻结的未来潜在探测与两阶段分数融合用于EPIC-KITCHENS-100动作预测

Chaoyang Wang, Lexuan Xu

发表机构 * Beihang University(北航大学)

AI总结 提出TAP-JEPA方法,利用冻结的V-JEPA 2.1特征和两阶段分数融合,在EPIC-KITCHENS-100动作预测挑战中获得第二名。

Comments The runner-up solution for the Action Anticipation Challenge, EPIC-KITCHENS-100 at the CVPR EgoVis Workshop 2026

详情
AI中文摘要

本报告介绍了TAP-JEPA,我们在EgoVis 2026的EPIC-KITCHENS-100(EK-100)动作预测挑战中获得亚军的提交方案。该任务是从目标动作开始前结束的自我中心视频片段中预测下一个动词、名词以及动词-名词动作。TAP-JEPA没有微调大型视频骨干网络,而是在冻结的V-JEPA 2.1特征上构建了一个紧凑的预测模型:ViT-G/384编码器提取可见的动作前令牌,预训练的潜在预测器从观察到的上下文估计近未来的令牌,两组令牌通过带有动词、名词和动作对特定查询的注意力探针进行融合。在最终提交中,我们使用官方训练集和大部分验证集扩展了监督训练,保留了一小部分用于合理性检查和定性观察,并采用了两阶段分数融合:首先在每个epoch内平均八个独立初始化的探针副本,然后合并epoch 12-20的候选结果,并应用依赖于类别的权重。在官方开放测试排行榜上,我们的sunshinesky条目达到了27.91%的整体动作平均Top-5召回率(MT5R),排名第二,仅比最高分低0.04个百分点。

英文摘要

This report presents TAP-JEPA, our runner-up submission to the EPIC-KITCHENS-100 (EK-100) Action Anticipation Challenge at EgoVis 2026. The task is to anticipate the next verb, noun, and verb-noun action from an egocentric clip that ends before the target action begins. Instead of fine-tuning a large video backbone, TAP-JEPA builds a compact anticipation model on frozen V-JEPA 2.1 features: a ViT-G/384 encoder extracts visible pre-action tokens, the pre-trained latent predictor estimates near-future tokens from the observed context, and both token groups are fused by attentive probes with task-specific queries for verbs, nouns, and action pairs. For the final submission, we expand supervised training with the official training split and most of the validation split, reserving a small subset for sanity checks and qualitative inspection, and adopt a two-stage score fusion that first averages eight independently initialized probe replicas within each epoch and then merges candidates from epochs 12-20 with field-dependent weights. On the official open-testing leaderboard, our sunshinesky entry achieves 27.91 percent overall action Mean Top-5 Recall (MT5R), ranking second and only 0.04 percentage points behind the top score.

2606.00660 2026-06-02 cs.CL

FineVerify: Scaling Test-Time Compute with Fine-Grained Self-Verification for Agentic Search

FineVerify: 通过细粒度自验证扩展智能搜索的测试时计算

James Xu Zhao, Hui Chen, Bryan Hooi, See-Kiong Ng

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出FineVerify细粒度自验证框架,将问题分解为可检查的子问题,对候选答案逐项验证并聚合得分,在四个智能搜索基准上显著优于标准扩展基线。

Comments 8+18 pages, 6 tables, 11 figures

详情
AI中文摘要

智能搜索需要语言模型代理探索多个来源并回答复杂的信息寻求问题。扩展测试时计算是改进这些代理的一种有前景的方法,但当前方法可能失败,因为正确答案通常稀疏且基于分数的选择依赖于模型校准。我们提出FineVerify,一个细粒度自验证框架,将每个问题分解为可检查的子问题,对采样候选答案针对每个子问题进行验证,并选择聚合得分最高的候选答案。这种逐项检查结构将选择转化为更简单的局部判断,并在相同的显式标准下产生分数。在四个智能搜索基准和两个模型上,FineVerify始终优于标准扩展基线。仅使用四个采样轨迹,它在GPT-5-mini上平均提高8.2个准确率点,在Gemini-3-flash上提高5.6%。使用12个样本,FineVerify使GPT-5-mini在BrowseComp-Plus上超越前沿GPT-5。除了准确性,FineVerify还生成可解释的验证轨迹,有助于审计基准错误,表明其在检查智能搜索系统方面具有更广泛的应用。代码和数据可在https://github.com/XuZhao0/fineverify获取。

英文摘要

Agentic search requires language model agents to explore many sources and answer complex information-seeking questions. Scaling test-time compute is a promising way to improve these agents, but current approaches can fail, because correct answers are often sparse and score-based selection depends on model calibration. We propose FineVerify, a fine-grained self-verification framework that decomposes each question into checkable sub-questions, verifies sampled candidates against each sub-question, and selects the candidate with the highest aggregated score. This per-check structure turns selection into simpler local judgments and produces scores under the same explicit criteria. Across four agentic search benchmarks and two models, FineVerify consistently outperforms standard scaling baselines. With only four sampled trajectories, it improves GPT-5-mini by 8.2 accuracy points and Gemini-3-flash by 5.6% on average. With 12 samples, FineVerify enables GPT-5-mini to surpass frontier GPT-5 on BrowseComp-Plus. Beyond accuracy, FineVerify produces interpretable verification traces that help audit benchmark errors, suggesting broader applications for inspecting agentic search systems. Code and data are available at https://github.com/XuZhao0/fineverify

2606.00658 2026-06-02 cs.CV cs.AI

Collaborative Few-Step Distillation and Low-Bit Quantization for Wan2.2 Dual-Expert Video Diffusion Models

Wan2.2双专家视频扩散模型的协同少步蒸馏与低位量化

Jinyang Du, Shenghao Jin, Ziqian Xu, Ruihao Gong, Shiqiao Gu, Yang Yong, Jinyang Guo, Xianglong Liu

发表机构 * IEEE ICME 2026 GCC Low-Bit-width Large Model Quantization Challenge(GCC 低精度大模型量化挑战)

AI总结 针对Wan2.2-T2V-A14B视频扩散模型,提出结合少步分布匹配蒸馏与低位量化的部署压缩流程,通过双专家去噪分支校准、敏感层保护及HiF4低位表示,在保持质量的同时降低计算开销。

详情
AI中文摘要

大型视频扩散模型实现了强大的视觉质量,但由于每个样本需要大量去噪步骤和较大的驻留参数足迹,部署成本仍然很高。本文研究了一种面向部署的压缩流程,针对Wan2.2-T2V-A14B模型,结合少步分布匹配蒸馏与低位量化。该流程遵循模型的双专家去噪路线,分别校准高噪声和低噪声分支,保护敏感入口层,并使用HiF4风格的低位表示以改善动态范围覆盖。量化是在蒸馏后的少步学生模型上校准,而非原始的长步轨迹上,从而减少推理过程中的激活分布不匹配。所提出的协同设计使量化模型保持接近同步全精度模型,并在平均8步和20步时超越原始全精度基线。在测试配置中,20步设置提供了最佳的质量-效率权衡。

英文摘要

Large video diffusion models achieve strong visual quality but remain expensive to deploy because each sample requires many denoising steps and a large resident parameter footprint. This paper studies a deployment-oriented compression pipeline for Wan2.2-T2V-A14B by combining few-step distribution-matching distillation with low-bit quantization. The pipeline follows the model's dual-expert denoising route, calibrates the high-noise and low-noise branches separately, protects sensitive entrance layers, and uses HiF4-style low-bit representation to improve dynamic-range coverage. Quantization is calibrated on the distilled few-step student rather than on the original long-step trajectory, reducing activation-distribution mismatch during inference. The proposed co-design keeps the quantized model close to the same-step full-precision model and surpasses the original full-precision baseline at 8 and 20 steps on average. The 20-step setting gives the best quality-efficiency trade-off in the tested configurations.