arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3990
2606.08671 2026-06-09 cs.LG 新提交

SkillHone: A Harness for Continual Agent Skill Evolution Through Persistent Decision History

SkillHone:基于持久决策历史的持续智能体技能演化框架

Zhiwei Li, Yong Hu

发表机构 * WeChat, Tencent Inc., China(腾讯微信,中国)

AI总结 提出SkillHone框架,通过持久决策历史记录诊断、修订和证据,实现智能体技能的持续演化,在开放网络深度研究基准上超越现有方法。

详情
Comments
Work in progress
AI中文摘要

智能体技能通过任务特定程序、脚本和参考扩展语言模型智能体,但目标和环境不断变化。现有方法在有限运行中改进技能,仅保留最终产物,丢弃后续智能体解释先前修订、评估和拒绝替代方案所需的决策历史。我们提出SkillHone,一个基于持久决策历史的持续智能体技能演化框架。SkillHone将技能修订与提供实践反馈的评估侧证据配对,记录诊断、修订、证据和结果的结构化历史。角色分离的子智能体在带有隐去报告的实践探针上运行候选技能,并根据先前决策提出修订,实现跨会话改进而无需重新发现过去的推理。我们在原始开放网络环境中评估SkillHone的深度研究基准,其中智能体未获得集成搜索堆栈,必须通过可移植技能组织检索。我们与商业检索服务支持的深度研究智能体进行比较。以Qwen3.6-35B-A3B作为评估时骨干,生成的技能在GAIA上超过深度研究智能体15.8分,在WebWalkerQA-EN上超过3.2分,同时也超越了先前的技能演化方法。

英文摘要

Agent skills extend language-model agents with task-specific procedures, scripts, and references, but the tasks and environments they target continually change. Existing methods improve skills in bounded runs and retain only the final artifact, discarding the decision history that later agents need to interpret prior revisions, evaluations, and rejected alternatives. We introduce SkillHone, a harness for continual agent skill evolution grounded in persistent decision history. SkillHone pairs skill revisions with evaluation-side evidence that supplies practice feedback, recording structured histories of diagnoses, revisions, evidence, and outcomes. Role-separated subagents run candidate skills on practice probes with redacted reporting and propose revisions informed by prior decisions, enabling cross-session refinement without rediscovering past rationale. We evaluate SkillHone on deep-research benchmarks in a raw open-web setting, where agents are not given an integrated search stack and must organize retrieval through portable skills. We compare against a deep-research agent backed by commercial retrieval services. With Qwen3.6-35B-A3B as the evaluation-time backbone, the resulting skills outperform the deep-research agent by 15.8 points on GAIA and 3.2 points on WebWalkerQA-EN, while also exceeding prior skill-evolution methods.

2606.08670 2026-06-09 cs.CV 新提交

WaveDiT: Distribution-Aware Wavelet Flow Matching for Efficient 3D Brain MRI Synthesis

WaveDiT: 面向高效3D脑MRI合成的分布感知小波流匹配

Danilo Danese, Angela Lombardi, Giuseppe Fasano, Matteo Attimonelli, Tommaso Di Noia

发表机构 * Politecnico di Bari(巴里理工大学) Sapienza University of Rome(罗马大学)

AI总结 提出WaveDiT,一种在3D Haar离散小波变换系数空间中运行的条件流匹配框架,通过分解时空注意力与基于高阶小波统计的带状异方差不确定性建模,实现单GPU上全分辨率3D脑MRI高效合成,在分布对齐和下游任务中优于现有方法。

详情
Comments
Provisionally accepted at MICCAI 2026
AI中文摘要

大型且人口统计学平衡的数据集对于可靠的神经影像生物标志物至关重要。全分辨率3D脑MRI合成可以支持该场景下的数据增强,但现有方法要么在体积尺度上产生高昂的计算成本,要么依赖可能有损解剖细节的潜在压缩。因此,实用的3D生成增强通常需要专门的计算基础设施。我们提出WaveDiT,一种在3D Haar离散小波变换系数空间中运行的条件流匹配框架。该模型将分解的时空注意力与从高阶小波统计中导出的带状异方差不确定性建模相结合。预测的对数方差直接集成到流目标和条件路径中,实现了与解剖细节的重尾和输入相关方差结构一致的适应性精度。该公式支持在单个现代GPU上,在实用的内存和时间约束下进行全分辨率3D合成。在多站点队列上的评估表明,与基于扩散、潜在和小波的基线相比,生成的MRI分布与真实MRI分布的对齐程度有所提高,同时下游脑年龄预测和区域级解剖一致性也得到了增强。代码可在https://github.com/sisinflab/WaveDiT获取。

英文摘要

Large and demographically balanced datasets are essential for reliable neuroimaging biomarkers. Full-resolution 3D brain MRI synthesis can support data augmentation in this setting, but existing approaches either incur prohibitive computational cost at volumetric scale or rely on lossy latent compression that may compromise anatomical detail. As a result, practical 3D generative augmentation often requires specialized compute infrastructure. We propose WaveDiT, a conditional flow matching framework operating in the coefficient space of a 3D Haar Discrete Wavelet Transform. The model combines factorized spatio-depth attention with band-wise heteroscedastic uncertainty modeling derived from higher-order wavelet statistics. Predicted log-variance is integrated directly into both the flow objective and conditioning pathway, enabling adaptive precision consistent with the heavy-tailed and input-dependent variance structure of anatomical detail. This formulation supports full-resolution 3D synthesis under practical memory and time constraints on a single modern GPU. Evaluation on a multi-site cohort demonstrates improved alignment between generated and real MRI distributions, together with enhanced downstream brain age prediction and region-level anatomical agreement relative to diffusion, latent, and wavelet-based baselines. Code is available at https://github.com/sisinflab/WaveDiT

2606.08669 2026-06-09 cs.SD cs.LG 新提交

A Comparison of SSL-Based Feature Extractors and Back-End Classifiers for Spoofing Detection: A Multi-Corpus Training and Cross-Linguistic Analysis

基于SSL的特征提取器与后端分类器在欺骗检测中的比较:多语料库训练与跨语言分析

Anh-Tuan Dao, Driss Matrouf, Mickael Rouvier, Nicholas Evans

发表机构 * Avignon Universite(阿维尼翁大学) EURECOM

AI总结 本研究通过多语料库训练和跨语言分析,比较了四种自监督学习特征提取器与四种后端分类器在欺骗检测中的性能,揭示了ASVspoof 5数据集中的领域偏差,并发现仅用8小时目标语言数据微调即可提升检测鲁棒性。

详情
AI中文摘要

语音生物识别系统面临来自欺骗攻击的日益增长的威胁,然而检测模型的评估在不同数据集上仍然不一致。为了研究这些不可预测的波动,我们对四种自监督学习特征提取器与四种后端分类器的组合进行了全面基准测试。我们比较了ResNet的层次化局部特征提取与基于注意力和图的后端的全局序列和关系建模。通过三种场景下的多语料库训练和六个评估数据集,我们的实证分析得出了两个关键发现。首先,我们揭示了ASVspoof 5数据集中的领域偏差,表明简单的数据缩放会主动降低性能。其次,我们的跨语言分析表明,仅用8小时的目标语言数据微调即可增强检测鲁棒性。这些发现共同强调了在欺骗检测中需要领域感知和语言特定适应的关键需求。

英文摘要

Voice biometric systems face growing threats from spoofing attacks, yet the evaluation of detection models remains inconsistent across datasets. To investigate these unpredictable fluctuations, we conduct a comprehensive benchmark of four self-supervised learning feature extractors paired with four back-end classifiers. We compare the hierarchical local feature extraction of ResNet with the global sequence and relational modeling of attention and graph-based back-ends. Through multi-corpus training across three scenarios and six evaluation datasets, our empirical analysis yields two critical findings. First, we expose a domain bias within the ASVspoof 5 dataset, showing that naive data scaling actively degrades performance. Second, our cross-linguistic analysis reveals that fine-tuning with just 8 hours of target-language data enhances detection robustness. Together, these findings emphasize the critical need for domain-aware and language-specific adaptation in spoofing detection.

2606.08666 2026-06-09 cs.RO 新提交

Language as a Sensor: Calibrated Spatial Belief Estimation in 3D Scenes from Natural Language

语言作为传感器:从自然语言在3D场景中进行校准的空间信念估计

Aryan Naveen, Jason Xinyu Liu, Luca Carlone, Andreea Bobu

发表机构 * MIT Laboratory for Information & Decision Systems(麻省理工学院信息与决策系统实验室) MIT Computer Science & Artificial Intelligence Laboratory(麻省理工学院计算机科学与人工智能实验室)

AI总结 提出语言传感器模型(LSM)将自然语言描述转化为校准的空间分布,并融合到VL-Map概率框架中,实现更准确的目标定位。

详情
Comments
18 pages, 7 figures, 3 tables
AI中文摘要

部署在以人为中心的环境中的机器人经常接收自然语言的空间信息描述(如“我把背包放在桌子上”),这些描述涉及超出其感知视野的世界部分。传统的度量-语义映射忽略了这一信号,而现成的多模态模型在3D空间推理方面仍然有限,并且不易与其他传感器模态融合。为了将语言观测转换为校准的空间分布,我们训练了一个语言传感器模型(LSM),该模型将每个话语及其场景图上下文映射到多模态分布,其中混合权重编码指代歧义(例如,“哪张桌子”),分量协方差编码空间不确定性(例如,目标在“桌子上”的哪个位置)。然后,我们引入了VL-Map(视觉-语言度量-语义映射),这是一个概率框架,将这些语言预测视为随机观测,并在统一的信念图中与机载感知融合。在VLA-3D基准测试以及真实世界的移动机器人上,LSM是唯一协方差估计保持在校准范围内的语言预测器;融合到VL-Map中,它导致对目标对象位置更准确的预测(与最强的基础模型基线相比,真实目标上的概率质量增加了约70%)。

英文摘要

Robots deployed in human-centric environments routinely receive natural-language descriptions of spatial information ("I left my backpack on the table") that reference parts of the world beyond their perceptual field of view. Traditional metric-semantic mapping ignores this signal, while off-the-shelf multimodal models remain limited in 3D spatial reasoning and are not directly amenable to fusion with other sensor modalities. To convert language observations into a calibrated spatial distribution, we train a Language Sensor Model (LSM) that maps each utterance and its scene-graph context to a multimodal distribution, with mixture weights encoding referential ambiguity (e.g., "which table") and component covariances encoding spatial uncertainty (e.g., where "on the table" the target lies). We then introduce VL-Map (Vision-Language Metric-Semantic Mapping), a probabilistic framework that treats these language predictions as stochastic observations and fuses them with onboard perception within a unified belief map. On the VLA-3D benchmark as well as on a real-world mobile robot, LSM is the only language predictor whose covariance estimates remain within the calibrated regime; fused into VL-Map, it leads to more accurate predictions of the target object location (~70% more probability mass on the true target compared to the strongest foundation-model baseline).

2606.08655 2026-06-09 cs.RO cs.CV 新提交

PhysGraph: A Physics-aware 3D Scene Graph for Perception and Reasoning

PhysGraph:用于感知与推理的物理感知3D场景图

Haoyu Li, Aaron Thomas, Shuyan Zhou, Xianyi Cheng

发表机构 * Duke University(杜克大学)

AI总结 提出PhysGraph框架,结合符号推理与结构化3D几何,建模杂乱场景中的运动学和物理属性,在语义分割、多物体质量估计和关节预测上达到最优。

详情
AI中文摘要

为了执行广泛的日常任务,机器人需要构建一个语义丰富、物理基础扎实且结构化的3D表示,以支持任务规划和功能预测。然而,现有方法主要关注语义检索,常常忽略物理和运动学因素。尝试建模物理属性的方法通常依赖于狭窄的训练集或单物体建模,限制了跨不同物体类型的可扩展性和泛化能力。为应对这些挑战,我们提出了PhysGraph,一个将符号推理与结构化3D几何相统一的框架,用于建模杂乱场景中的运动学和物理属性。给定RGB-D观测,PhysGraph重建以物体为中心的3D几何,并跨视图关联物体实例。然后,它将物体分解为功能部件,并通过视觉推理推断材料和关节。在合成和真实世界数据集上的评估表明,PhysGraph在语义分割、多物体质量估计和关节预测方面取得了最先进的结果。凭借其简单而有效的设计,PhysGraph生成物理一致且语义结构化的场景图,作为下游任务(如约束感知的3D功能预测和真实到模拟迁移)的结构化3D表示,这两项任务均在我们的实验中得到了验证。

英文摘要

To perform a wide range of daily tasks, robots need to construct a 3D representation that is semantically rich, physically grounded, and structured enough to support task planning and affordance prediction. However, existing approaches primarily focus on semantic retrieval, often overlooking physical and kinematic factors. Methods that attempt to model physical properties typically rely on narrow training sets or single-object modeling, limiting scalability and generalization across diverse object types. To address these challenges, we present PhysGraph, a framework that unifies symbolic reasoning with structured 3D geometry to model kinematic and physical properties in cluttered scenes. Given RGB-D observations, PhysGraph reconstructs object-centric 3D geometry and associates object instances across views. It then decomposes objects into functional parts and infers materials and articulations through visual reasoning. Evaluated on both synthetic and real-world datasets, PhysGraph achieves state-of-the-art results in semantic segmentation, multi-object mass estimation, and articulation prediction. With its simple yet effective design, PhysGraph produces physically consistent and semantically structured scene graphs, serving as a structured 3D representation for downstream tasks such as constraint-aware 3D affordance prediction and real-to-sim transfer, both of which are demonstrated in our experiments.

2606.08654 2026-06-09 cs.LG cs.NA math.AP math.NA stat.AP 新提交

Operator learning for the 2D incompressible Navier-Stokes equations: a conformal prediction approach in the data-scarce regime

二维不可压缩Navier-Stokes方程的算子学习:数据稀缺情况下的共形预测方法

Weinan Wang, Bowen Gang, Hao Deng

发表机构 * University of Oklahoma(俄克拉荷马大学) Fudan University(复旦大学)

AI总结 针对数据稀缺下算子学习的不确定性量化,提出基于扰动的共形预测框架,在二维Navier-Stokes基准上比现有方法生成更窄的共形带,同时保持目标覆盖。

详情
AI中文摘要

本文提出了一种基于扰动的共形预测框架,用于算子学习中的不确定性量化,重点关注二维Navier-Stokes方程。虽然神经算子为昂贵的PDE求解器提供了快速替代方案,但它们本身无法为时空场预测提供校准的不确定性。我们的方法将训练好的傅里叶神经算子(FNO)与分裂共形预测相结合,通过比较在几乎相同数据集上训练的两个算子的预测来构建局部不确定性尺度:一个使用原始标签,另一个使用添加小高斯噪声的标签。我们在数据稀缺情况下考虑该过程,其中总标签预算固定,而需要单独不确定性网络的方法必须在多个模型之间划分训练数据。在二维Navier-Stokes基准上,在匹配总数据预算的情况下,基于扰动的方法产生的共形带比现有方法窄得多,同时保持目标同时覆盖。这些结果表明,扰动敏感性是共形化神经算子的一种实用且样本高效的不确定性代理。

英文摘要

In this paper, we propose a perturbation-based conformal prediction framework for uncertainty quantification in operator learning, with a focus on the 2D Navier--Stokes equations. While neural operators provide fast surrogates for expensive PDE solvers, they do not by themselves provide calibrated uncertainty for spatiotemporal field predictions. Our approach wraps a trained Fourier Neural Operator (FNO) with split conformal prediction and constructs the local uncertainty scale by comparing the predictions of two operators trained on nearly identical datasets: one on the original labels and one on labels perturbed by small Gaussian noise. We consider this procedure in the data-scarce regime, where the total label budget is fixed and methods that require a separate uncertainty network must divide training data between multiple models. On the 2D Navier--Stokes benchmark, the perturbation-based method produces substantially narrower conformal bands than existing methods under matched total data budgets while maintaining the target simultaneous coverage. These results suggest that perturbation sensitivity is a practical and sample-efficient uncertainty proxy for conformalized neural operators.

2606.08653 2026-06-09 cs.CV cs.AI cs.LG cs.RO 新提交

FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning

FiberTune: 在视觉-语言-动作微调中保留动作纤维视觉残差

Haihao Lin, Xiangsheng Huang, Xiao Yang, Weibang Zhou, Yiqi Zhang, Bo Yang, Simin Zeng, Jiawei Yang, Zhengyang Wang, Jiahui Du

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Hebei Key Laboratory of Cognitive Intelligence, Xiong’an Institute of Innovation(河北省认知智能重点实验室,雄安创新研究院) Hebei University of Technology(河北工业大学) Beijing Information Science and Technology University(北京信息科技大学)

AI总结 提出FiberTune,通过在线动作探针过滤动作预测特征方向,对齐教师视觉残差并正则化有效秩,在六个仿真和实物任务中提升VLA策略性能。

详情
Comments
Project page: https://fibertune.github.io/
AI中文摘要

动作监督的视觉-语言-动作(VLA)策略微调能有效拟合演示,但仅约束改变预测动作的方向,导致动作等价状态下视觉结构自由坍缩。我们将此形式化为沿局部动作纤维的残差视觉坍缩,并提出FiberTune,一种训练时目标,在不增加推理开销的情况下保留教师结构的视觉残差。FiberTune使用在线动作探针估计动作预测特征方向,从中滤除中间视觉标记表示,并将探针过滤后的残差与冻结的视觉教师对齐,同时正则化其有效秩。在相同训练条件下,FiberTune在跨越两个基准和两种架构(pi_0.5和OpenVLA-OFT)的六个受控仿真设置以及物理SO-101拾取放置任务中,均优于仅任务损失的微调;代表性提升包括长时域CALVIN ABC-to-D上SR(5)提高10.7个百分点,物理SO-101任务成功率从72.7%提升至78.1%。残差诊断显示,这些增益与探针过滤后的残差教师对齐度和有效秩增加一致,符合动作纤维动机。

英文摘要

Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but constrains only the directions that change predicted actions, leaving visual structure consistent across action-equivalent states free to collapse. We formalize this as residual visual collapse along local action fibers and propose FiberTune, a training-time objective that preserves teacher-structured visual residuals without adding inference-time overhead. FiberTune uses an online action probe to estimate action-predictive feature directions, filters them from intermediate visual-token representations, and aligns the resulting probe-filtered residuals to a frozen visual teacher while regularizing their effective rank. Under identical training conditions, FiberTune improves over task-loss-only fine-tuning in every one of six controlled simulation settings spanning two benchmarks and two architectures (pi_0.5 and OpenVLA-OFT), as well as on physical SO-101 pick-place; representative gains include +10.7 percentage points SR(5) on long-horizon CALVIN ABC-to-D and physical SO-101 task success rising from 72.7% to 78.1%. Residual diagnostics show that these gains coincide with increased probe-filtered residual teacher alignment and effective rank, consistent with the action-fiber motivation.

2606.08644 2026-06-09 cs.CL cs.AI 新提交

A retrieval conditioned rebinding circuit for dynamic entity tracking in large language models

一种用于大语言模型中动态实体追踪的检索条件重绑定电路

Soyoung Oh, Vera Demberg

发表机构 * Saarland University(萨尔兰大学) Max Planck Institute for Informatics(马克斯·普朗克信息学研究所)

AI总结 通过因果干预识别出大语言模型中实现动态状态追踪的检索条件重绑定机制,该机制由紧凑的注意力头电路编码并恢复绑定信息,在不同模型家族中表现不同。

详情
AI中文摘要

为了正确解释上下文并检索相关信息,大语言模型必须将实体与其属性绑定,并在状态变化时更新这些绑定。我们分析了LLM在动态状态追踪中如何实现这一绑定过程。通过因果干预,我们识别出一种检索条件重绑定机制,这是一个紧凑的注意力头电路,编码交换相关的绑定信息并在读出时恢复。在Gemma和Llama模型中,该电路支持重绑定行为,但机制的表示特征在不同模型家族中有所不同。在Gemma模型中,绑定特征清晰地表达在相关注意力头的查询/键子空间中,而在Llama模型中,绑定信息主要由键向量携带。总体而言,我们的结果揭示了LLM中上下文相关状态追踪的可解释机制。

英文摘要

To interpret context correctly and retrieve relevant information, large language models must bind entities to their attributes and update these bindings as state changes. We analyze how LLMs implement this binding process in a dynamic state tracking. Using causal interventions, we identify a retrieval conditioned rebinding mechanism, a compact attention head circuit that encodes swap relevant binding information and reinstates it at readout. Across Gemma and Llama models, this circuit supports rebinding behavior, but the representational signature of the mechanism differs across model families. In Gemma models, the binding signature is clearly expressed in the query/key subspaces of the relevant attention heads, whereas in Llama models, the binding information is carried primarily in key vectors. Overall, our results reveal an interpretable mechanism for context dependent state tracking in LLMs.

2606.08641 2026-06-09 cs.CV 新提交

Learnable Token Sparsification for Efficient Gigapixel Whole Slide Image Reasoning

可学习的令牌稀疏化用于高效十亿像素全切片图像推理

Jingzhi Chen, Landi He, Zhuo Chen, Shawn Young, Lijian Xu

发表机构 * Shenzhen University of Advanced Technology(深圳先进技术大学)

AI总结 针对视觉语言模型中全切片图像令牌过多的问题,提出可学习的稀疏化方法,通过SparseLearn组件和可微分的Soft Top-K算子实现训练,推理时仅保留32个令牌,在SlideBench上达到73.32%准确率。

详情
AI中文摘要

在视觉语言模型中处理十亿像素全切片图像面临的主要困难是视觉令牌数量过多。现有解决方案通常依赖于无需训练的空间下采样或启发式剪枝策略,这些方法往往会丢弃细微但具有临床意义的模式,因为病理证据在组织中不规则地分布。为了克服这一限制,我们将全切片图像中的令牌减少重新定义为可训练的稀疏化问题,使模型能够学习最优选择策略,而不是遵循固定的启发式规则。我们提出了一种解耦路由架构。为了在训练过程中通过不可微的剪枝操作实现梯度传播,我们引入了一个名为SparseLearn的组件。该组件使用一个方差保持的噪声门,通过可微分的Soft Top-K算子调节每个补丁的信息流,并配合一个对角注意力去噪器,在不泄露空间信息的情况下恢复受扰动的表示。在推理时,SparseLearn模块被完全丢弃,训练好的评分器应用确定性的Hard Top-K算子,仅保留得分最高的32个令牌,不产生额外计算。通过将视觉序列压缩到仅32个令牌的稀疏集合(仅占原始长度的0.78%),我们的框架在SlideBench(TCGA)上实现了73.32%的总体准确率,持续优于基于采样的基线和通用视觉语言模型。在SlideBench(BCNB)和WSI VQA*上也展示了强大的零样本泛化能力。通过解决视觉上下文瓶颈并防止稀疏诊断证据的稀释,这项工作为端到端的十亿像素全切片图像推理提供了一种高效范式。

英文摘要

The processing of gigapixel whole slide images within vision language models faces a major difficulty due to an excessive number of visual tokens. Existing solutions typically rely on spatial downsampling or heuristic pruning strategies that operate without training, and these methods often discard subtle but clinically meaningful patterns because pathological evidence is scattered irregularly across the tissue. To overcome this limitation, we reformulate token reduction in whole slide images as a trainable sparsification problem, allowing the model to learn an optimal selection strategy instead of following fixed heuristics. We propose a decoupled routing architecture. To enable gradient propagation through the nondifferentiable pruning operation during training, we introduce a component called SparseLearn. This component uses a variance-preserving noise gate that regulates the information flow of each patch via a differentiable Soft Top-K operator, together with a diagonal attention denoiser that recovers perturbed representations without leaking spatial information. At inference time, the SparseLearn module is entirely discarded, and the trained scorer applies a deterministic Hard Top-K operator to keep only the highest scoring 32 tokens, incurring no extra computation. By compressing the visual sequence down to a sparse set of just 32 tokens, which represents as little as 0.78% of the original length, our framework achieves 73.32% overall accuracy on SlideBench (TCGA), consistently surpassing sampling-based baselines and general-purpose vision language models. It also demonstrates strong zero shot generalization on SlideBench (BCNB) and WSI VQA*. By resolving the visual context bottleneck and preventing the dilution of sparse diagnostic evidence, this work provides a highly efficient paradigm for end to end gigapixel whole slide image reasoning.

2606.08634 2026-06-09 cs.CV 新提交

SSAFE: Simple and Strong AI-Generated Image Detection via Frozen Vision Encoders

SSAFE: 通过冻结视觉编码器实现简单而强大的AI生成图像检测

Seunghyun Lee, Byoungkwon Kim, Jaehyun Nam, Kyungmin Lee, Jinwoo Shin

发表机构 * KAIST(韩国科学技术院) Google Cloud AI(谷歌云AI)

AI总结 本文发现冻结的多模态视觉编码器在嵌入空间中自然分离真实与合成图像,通过线性分类器即可实现强检测性能,并提出一种表示感知的数据策展策略,仅用10K图像训练,在多个基准上表现优异。

详情
Comments
Preprint. 22 pages, 10 figures, supplementary material included
AI中文摘要

生成模型的快速发展模糊了合成图像与真实图像之间的界限,产生了对可靠深度伪造检测的迫切需求。然而,大多数现有方法依赖于大规模的真实-伪造数据集,随着新生成器的不断涌现,这些数据集越来越难以维护。在这项工作中,我们研究了图像真实性信息在多大程度上已经编码在现代多模态视觉表示中。我们发现,冻结的多模态编码器在其嵌入空间中自然分离真实图像和合成图像,使得简单的线性分类器无需特定任务微调即可实现强性能。受此观察启发,我们开发了一种表示感知的数据策展策略,选择一组紧凑的代表性生成器进行训练。由此产生的训练集仅包含10K张图像,而AIGIBench为288K张,OpenFake为400万张,同时提高了对未见生成器和分布偏移的鲁棒性。我们还引入了RealWorldBench,这是一个包含现代相机照片、当代库存图像以及近期商业生成器输出的基准。在多个基准上的实验表明,将冻结的多模态表示与精心策展的训练数据相结合,为AI生成图像检测提供了一种简单而有效的方法。

英文摘要

The rapid advancement of generative models has blurred the boundary between synthetic and real imagery, creating an urgent need for reliable deepfake detection. Yet most existing approaches rely on massive real--fake datasets, which are increasingly difficult to maintain as new generators continue to emerge. In this work, we investigate how much information about image authenticity is already encoded in modern multimodal vision representations. We find that frozen multimodal encoders naturally separate real and synthetic images in their embedding space, enabling a simple linear classifier to achieve strong performance without task-specific fine-tuning. Motivated by this observation, we develop a representation-aware data curation strategy that selects a compact set of representative generators for training. The resulting training set contains only 10K images, compared to 288K in AIGIBench and 4M in OpenFake, while improving robustness to unseen generators and distribution shifts. We additionally introduce RealWorldBench, a benchmark consisting of modern camera photographs, contemporary stock images, and outputs from recent commercial generators. Experiments across multiple benchmarks show that combining frozen multimodal representations with carefully curated training data provides a simple and effective approach to AI-generated image detection.

2606.08633 2026-06-09 cs.AI cs.LG 新提交

Towards Long-Horizon Vessel Trajectory and Destination Forecasting with Reasoning Large Language Models

面向长时域船舶轨迹与目的地预测的推理型大语言模型

Hongwei Wang, Miao Zhou, Fengde Wang, Yuting Wang, Jiewen Yu, Jun-Yan He, Bohao Qu, Wanbing Zhang, Xiuju Fu, Qing Guo, Zipei Fan, Yingying Xing, Yi Yuan

发表机构 * Institute of High Performance Computing (IHPC), A*STAR, Singapore(新加坡科技研究局高性能计算研究所) The Key Laboratory of Road and Traffic Engineering, Ministry of Education, Tongji University(同济大学道路与交通工程教育部重点实验室) Meituan Inc., Shenzhen, China(美团(深圳)) Centre for Frontier AI Research (CFAR), A*STAR, Singapore(新加坡科技研究局前沿人工智能研究中心) Nankai University(南开大学) School of Artificial Intelligence, Jilin University(吉林大学人工智能学院)

AI总结 提出基于可验证奖励强化学习(RLVR)的Maritime LLM后训练框架,将轨迹转化为语义文本,通过物理有效性约束和层次匹配提升长时域(30天)预测精度,4B模型表现最优。

详情
Comments
The IEEE International Conference on Intelligent Transportation Systems (ITSC) 2026, Naples, Italy
AI中文摘要

长时域海上轨迹预测对航运管理、物流规划和海上风险分析至关重要,但月度级别的预测仍研究不足。现有深度学习方法主要关注短期和中期坐标外推,在长时间跨度下往往难以保持路线可行性和目的地正确性。本文研究了利用具备推理能力的大语言模型进行联合长时域船舶轨迹和目的地预测,并基于可验证奖励强化学习(RLVR)开发了Maritime LLM后训练框架。构建了一个基于AIS的基准数据集,包含60天历史轨迹和30天预测范围,其中轨迹被转换为语义文本表示用于RL提示构建。RLVR通过强制执行物理有效性、提供早期加权轨迹监督以及通过层次匹配和课程学习评估目的地正确性,使LLM与海上预测目标对齐。实验结果表明,RLVR训练的LLM在零样本LLM和代表性深度学习基线方法上均有显著提升,尤其在目的地相关指标上。在评估的RLVR训练变体中,4B LLM实现了最佳整体性能,表明奖励兼容优化和任务特定容量匹配比单纯使用更大的8B或14B LLM更为重要。结果还显示,在有限的微调数据下,LSTM仍然是一个强大的深度学习基线,而Transformer风格的时空模型通常需要更大的数据集和更丰富的结构化输入。总体而言,这项工作推进了用于运营决策支持的语义化、验证器对齐的海上预测。

英文摘要

Long-horizon maritime trajectory prediction is important for shipping management, logistics planning, and maritime risk analysis, yet month-level forecasting remains insufficiently studied. Existing deep learning methods mainly focus on short- and mid-term coordinate extrapolation and often struggle to preserve route feasibility and destination correctness over extended horizons. This paper investigates joint long-horizon vessel trajectory and destination forecasting with reasoning-capable large language models, and develops a Maritime LLM post-training framework based on Reinforcement Learning with Verifiable Reward (RLVR). An AIS-based benchmark is constructed with 60-day historical trajectories and 30-day forecasting horizons, where trajectories are converted into semantic textual representations for RL prompt construction. RLVR aligns LLMs with maritime forecasting objectives by enforcing physical validity, providing early-weighted trajectory supervision, and evaluating destination correctness through hierarchical matching and curriculum learning. Experimental results show that RLVR-trained LLMs substantially improve over zero-shot LLMs and representative deep learning baselines, especially on destination-related metrics. Among the evaluated RLVR-trained variants, 4B LLMs achieve the best overall performance, suggesting that reward-compatible optimization and task-specific capacity matching are more important than simply using larger 8B or 14B LLMs. The results also show that LSTM remains a strong deep learning baseline under limited fine-tuning data, while Transformer-style spatio-temporal models typically require larger datasets and richer structured inputs. Overall, this work advances semantic, verifier-aligned maritime forecasting for operational decision support.

2606.08630 2026-06-09 cs.LG cs.AI 新提交

Tyan-WP: A Wind Power Foundation Model for Ultra-Short-Term Probabilistic Forecasting

Tyan-WP:用于超短期概率预测的风电基础模型

Jiahui Huang, Ao Luo, Lei Liu, Hongwei Zhao, Tengyuan Liu, Ruibo Guo, Bo Wang, Zhao Wang, Bin Li

发表机构 * School of Information Science and Technology, University of Science and Technology of China(中国科学技术大学信息科学技术学院) China Electric Power Research Institute(中国电力科学研究院)

AI总结 提出首个风电基础模型Tyan-WP,通过静态站点嵌入和功率感知气象融合模块,在零样本场景下实现超短期概率预测,显著优于传统模型。

详情
AI中文摘要

全球风电容量,特别是在中国,正在蓬勃发展,新的风电场跨越了多样的地形和气候。行业迫切需要准确的风电基础模型,以缩短调试并加速并网。这是因为特定站点的时间序列模型(TSM)不适用于数据稀缺场景且泛化能力差,而通用大型时间序列模型(LTSM)大多限于单变量输入,无法充分利用静态站点属性或功率与气象协变量之间的依赖关系,导致精度不足。为填补这一空白,我们提出了\textbf{Tyan-WP},这是首个用于超短期概率预测的风电基础模型。在覆盖美国超过126,000个站点、跨越七年的大规模风电数据集上预训练后,Tyan-WP通过两个特定领域模块设计进一步提升了零样本预测:使用坐标、地形和生态区域元数据的静态站点嵌入,以及一个功率感知气象融合(PAMF)模块,该模块对历史功率和气象协变量之间的交互进行建模。在统一评估协议下,Tyan-WP在10个域内站点上超越了八个特定站点的监督TSM,并在127个域内站点上优于十一个通用LTSM,MAE降低19.9%,RMSE降低16.6%,CRPS降低22.2%,AQL降低21.7%,同时R^2提升16.7%。它还在六个真实的英国站点上展示了强大的跨地理泛化能力。这些结果表明,风电基础模型可以在无需目标站点训练的情况下实现准确的零样本预测,为新风电场快速涡轮机接入和概率风险管理提供了实用途径。

英文摘要

Global wind power capacity, especially in China, is booming, with new farms spanning diverse terrains and climates. The industry urgently needs accurate wind power foundation models to shorten commissioning and accelerate grid connection. This is because site-specific time series models (TSMs) are not well suited to data-scarce scenarios and generalize poorly, while generic large time series models (LTSMs) are mostly limited to univariate inputs and cannot fully exploit static site attributes or the dependencies between power and meteorological covariates, leading to insufficient accuracy. To fill this gap, we propose \textbf{Tyan-WP}, the first wind power foundation model for ultra-short-term probabilistic forecasting. Pretrained on a large-scale wind power dataset covering more than 126,000 U.S. sites over seven years, Tyan-WP further improves zero-shot forecasting through two domain-specific module designs: static site embedding using coordinate, terrain, and ecoregion metadata, and a power-aware meteorological fusion (PAMF) module that models interactions between historical power and meteorological covariates. Under a unified evaluation protocol, Tyan-WP surpasses eight site-specific supervised TSMs on 10 in-domain sites and outperforms eleven generic LTSMs on 127 in-domain sites, reducing MAE by 19.9%, RMSE by 16.6%, CRPS by 22.2%, and AQL by 21.7%, while raising R^2 by 16.7%. It further demonstrates strong cross-geography generalization on six real U.K. sites. These results show that the wind power foundation model can achieve accurate zero-shot forecasting without target-site training, providing a practical pathway for rapid turbine onboarding and probabilistic risk management at new wind farms.

2606.08625 2026-06-09 cs.CL 新提交

From Holistic Evaluation to Structured Criteria: Rubrics Across the Evolving LLM Landscape

从整体评估到结构化标准:大语言模型演变中的评分准则

Hao Chen, Ziyu Han, Yukun Yan, Qingfu Zhu, Maosong Sun, Wanxiang Che

发表机构 * Research Center for Social Computing and Interactive Robotics(社会计算与交互机器人研究中心) Department of Computer Science and Technology, Institute for AI(计算机科学与技术系,人工智能研究院)

AI总结 本文提出评分准则作为统一框架,通过分解整体判断为可验证维度、提供过程级反馈和动态涌现自模型行为三个层次,连接人类意图与机器行为。

详情
AI中文摘要

随着大型语言模型(LLMs)向开放式自主智能体发展,用于评估和引导其行为的机制也必须相应演进。本文引入评分准则作为捕捉这一演进的统一框架,将其描述为对LLM范式转变的动态响应,这种响应在评估、强化学习和安全对齐等看似独立的工作中反复出现。我们将评分准则定义为将复杂质量判断转化为结构化、可操作标准的一组显式标准,并证明其在上述研究线索中的反复出现并非巧合。我们系统地整理了现有的评分准则设计,考察了其构建与优化,并分析了它们在评估和训练中的作用。评分准则在三个逐渐深入的层面体现:在评估层面,它们将整体判断分解为可验证的维度;在训练层面,它们作为密集的反馈信号,在标量奖励不足时提供过程级指导;在内在层面,它们从模型行为中动态涌现,驱动自我改进。我们进一步评估了评分准则在生成质量、执行保真度、理论约束和安全威胁方面的可靠性,并调查了跨领域的基于评分准则的基准。通过使评估透明且可分解,评分准则将人类价值期望转化为机器可学习的信号,成为人类意图与机器行为之间的持久桥梁。

英文摘要

As Large Language Models (LLMs) advance toward open-ended autonomous agents, the mechanisms used to evaluate and guide their behavior must evolve accordingly. This work introduces the rubric as a unifying framework capturing this evolution, characterizing rubrics as a dynamic response to successive LLM paradigm shifts that recurs across otherwise independent efforts in evaluation, reinforcement learning, and safety alignment. We define rubrics as explicit criteria sets that transform complex quality judgments into structured and actionable standards, and demonstrate that their recurrence across these research threads is not coincidental. We systematically organize existing rubric designs, examine their construction and optimization, and analyze their role across evaluation and training. Rubrics manifest at three progressively deeper levels: at the evaluative level, they decompose holistic judgments into verifiable dimensions; at the training level, they serve as dense feedback signals providing process-level guidance where scalar rewards fall short; at the intrinsic level, they emerge dynamically from model behaviors, driving self-improvement. We further assess rubric reliability across generation quality, execution fidelity, theoretical constraints, and security threats, before surveying rubric-based benchmarks across diverse domains. By rendering assessment transparent and decomposable, rubrics translate human value expectations into machine-learnable signals, serving as the enduring bridge between human intentions and machine behavior.

2606.08617 2026-06-09 cs.CL 新提交

Cross-Source Reasoning-based Correction for Author Name Disambiguation

基于跨源推理的作者姓名消歧校正

Fanjin Zhang, Yunhe Pang, Bo Chen, Zhiyu Shen, Yanghui Rao, Evgeny Kharlamov, Jie Tang

发表机构 * Renmin University of China(中国人民大学) Sun Yat-Sen University(中山大学) Tsinghua University(清华大学) Robert Bosch GmbH(罗伯特·博世有限公司) University of Oslo(奥斯陆大学)

AI总结 提出CrossND框架,通过跨源不一致分配推理,结合数据精炼、监督微调和测试时缩放,无需人工干预即可校正作者姓名消歧错误。

详情
Comments
Accepted at KDD 2026 ADS track
AI中文摘要

作者姓名消歧是学术搜索系统中的关键挑战,通常通过从头开始和实时消歧方法解决。然而,当前算法仍然容易受到论文-作者分配的累积误差影响,并忽略了不同来源之间的不一致分配。诉诸专家注释是资源密集型的。为此,本文探索了作者姓名消歧的新视角:通过利用跨源的不一致分配进行跨源校正。我们提出了CrossND,一个集成数据精炼、跨源推理和测试时缩放的全栈框架。首先,一个精炼链去噪作者档案并产生更准确的论文-作者匹配概率。其次,一个监督微调过程结合这些精炼信号和基于概率软逻辑的交叉校正模块,推断哪些来源的分配是错误的。第三,测试时缩放进一步增强了预测的准确性和鲁棒性。在真实数据集上的实验表明,CrossND通过利用跨源推理,无需人工干预,始终优于17个基线。

英文摘要

Author name disambiguation is a critical challenge in academic search systems, often addressed through from-scratch and real-time disambiguation approaches. However, current algorithms remain vulnerable to cumulative errors of paper-author assignments and overlook inconsistent assignments across different sources. Resorting to expert annotation is resource-intensive. To this end, this paper explores a new perspective for author name disambiguation: cross-source correction by leveraging inconsistent assignments across sources. We propose CrossND, a full-stack framework that integrates data refinement, cross-source reasoning, and test-time scaling. First, a chain-of-refinement pipeline denoises author profiles and produces more accurate paper-author matching probabilities. Second, a supervised fine-tuning process incorporates these refined signals and a probabilistic soft logic-based cross-correction module to infer the assignments of which sources are incorrect. Third, test-time scaling further enhances the accuracy and robustness of the predictions. Experiments on real-world datasets indicate that CrossND consistently outperforms 17 baselines by leveraging cross-source reasoning without human intervention.

2606.08615 2026-06-09 cs.CV cs.CL 新提交

Harnessing Streaming Video in the Wild

利用野外流式视频

Dingyu Yao, Shuhuan Gu, Qingyi Si, Junhao Zhou, Chenxu Yang, Chuanyu Qin, Naibin Gu, Zheng Lin, Weiping Wang, Nan Duan, Jiaqi Wang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院) JD.COM(京东)

AI总结 提出Streaming Harness系统,通过Streaming-Train-248K数据集和训练目标,使视觉语言模型具备主动交互、长期记忆和实时处理能力,并构建Streaming-Eval基准评估流式视频理解。

详情
AI中文摘要

视觉语言模型(VLM)在视频通话助手、实时评论和具身机器人等应用中越来越需要处理无界视频流。理想的流式系统应支持主动交互、长期记忆和实时处理,同时基于能够处理各种野外流式任务的VLM骨干。然而,现有VLM在离线视频理解方面表现出色,但在流式能力上有所欠缺,并且缺乏用于流式部署的专用基础设施。我们在三个方面解决这一差距。(i) 对于骨干能力,我们构建了\textbf{Streaming-Train-248K},一个流式数据集,配以新颖的训练目标,用于使VLM适应流式交互和理解。(ii) 对于实际部署,我们引入了\textbf{Streaming Harness},一个即插即用系统,赋予任何VLM三种核心能力:主动交互(每秒响应决策)、长期记忆(12小时上下文保留)和实时处理(亚秒级延迟)。(iii) 为了推动社区在流式能力方面的持续进步,我们设计了\textbf{Streaming-Eval},一个反映模型在各种野外场景中能力的基准。大量实验表明,我们的方法在流式视频理解所需的所有核心能力上均取得了一致的提升。我们将开源我们的数据、代码和基准,以推动社区从离线视频理解向可部署的流式智能的转变。

英文摘要

Vision-Language Models (VLMs) are increasingly required to process unbounded video streams in applications such as video-call assistants, live commentary, and embodied robots. An ideal streaming system should support proactive interaction, long-horizon memory, and real-time processing, while resting on a VLM backbone capable of handling diverse in-the-wild streaming tasks. However, existing VLMs excel at offline video understanding but fall short in streaming capabilities and lack dedicated infrastructure for streaming deployment. We address this gap on three fronts. (i) For backbone capability, we construct \textbf{Streaming-Train-248K}, a streaming dataset paired with a novel training objective for adapting VLMs to streaming interaction and understanding. (ii) For real-world deployment, we introduce \textbf{Streaming Harness}, a plug-and-play system that endows any VLM with three core abilities: proactive interaction (per-second response decisions), long-term memory (12-hour context retention), and real-time processing (sub-second latency). (iii) To drive continued community progress on streaming capabilities, we design \textbf{Streaming-Eval}, a benchmark that reflects models' capabilities across diverse in-the-wild scenarios. Extensive experiments demonstrate consistent gains from our approach across all core capabilities required for streaming video understanding. We will open-source our data, code, and benchmark to advance the community's shift from offline video understanding to deployable streaming intelligence.

2606.08612 2026-06-09 cs.CV 新提交

Facial Expression Recognition in the Deep Learning Era: A Systematic Multi-Criteria Review of Methods, Models, Datasets, Performance, Challenges, and Future Research Directions

深度学习时代的面部表情识别:方法、模型、数据集、性能、挑战与未来研究方向的多准则系统综述

Spyridon Georgiou, Aggelos Psiris, Spyridon Evangelatos, Thomas Lagkas, Vasileios Argyriou, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos

发表机构 * International Hellenic University(国际希腊大学) University of Thessaly(色萨利大学) Democritus University of Thrace(德谟克利特大学) University of Peloponnese(伯罗奔尼撒大学) Harokopio University of Athens(哈罗科皮奥大学)

AI总结 本文系统综述了深度学习面部表情识别的最新进展,提出五阶段演化框架和多准则分类法,分析了七维度的优缺点,并总结了数据集、性能比较及未来挑战。

详情
AI中文摘要

面部表情识别(FER)在过去十年中取得了快速发展,这得益于从手工特征和浅层分类器向深度卷积、注意力机制、视觉语言和基础模型架构的转变,以及大规模野外基准测试的并行增长,这些基准涵盖了分类、维度、复合、微表情、动作单元(AU)和强度估计任务。然而,基于深度学习的FER领域迄今为止仅在狭窄的任务、架构或应用特定轴线上被综述,缺乏对其近期进展的整体、系统组织的描述。本综述通过全面回顾近期基于深度学习的FER,并明确将其与更广泛的面部情感识别(FAR)领域联系起来,填补了这一空白。其主要贡献包括:a) 描述了FER演变为五个不同阶段的过程,从手工特征和经典机器学习到注意力机制、视觉语言和基础模型方法,并给出了每个阶段的关键里程碑工作;b) 一个多准则分类法,沿七个互补轴分析文献:识别任务、输入模态、面部预处理流程、网络架构、学习策略、采集设置和应用领域;c) 按准则进行比较分析,深入洞察每个类别在野外条件下的优势和局限性;d) 按任务组织的公共FER数据集综述,包括其标注方案、模态和评估协议;e) 性能指标汇编以及代表性最先进方法在广泛采用的基准上的按任务定量比较;f) 当前挑战和有前景的未来方向的讨论。

英文摘要

Facial Expression Recognition (FER) has advanced rapidly over the last decade, driven by the shift from handcrafted descriptors and shallow classifiers to deep convolutional, attention-based, vision-language, and foundation-model architectures, and by the parallel growth of large-scale in-the-wild benchmarks spanning categorical, dimensional, compound, micro-expression, Action Unit (AU), and intensity-estimation tasks. Yet the deep learning-based FER landscape has so far been reviewed only along narrow task-, architecture-, or application-specific axes, leaving a holistic, systematically organized account of its recent advances missing. This survey addresses that gap with a comprehensive review of recent deep learning-based FER, explicitly linked to the wider Facial Affect Recognition (FAR) domain. Its main contributions are: a) A description of FER's evolution into five distinct phases, from handcrafted features and classical machine learning to attention-based, vision-language, and foundation-model approaches, with the key milestone works of each, b) A multi-criteria taxonomy analyzing the literature along seven complementary axes: recognition task, input modality, face pre-processing pipeline, network architecture, learning strategy, acquisition setting, and application domain, c) A per-criterion comparative analysis, with critical insights into the strengths and limitations of each category under in-the-wild conditions, d) A task-organized review of public FER datasets, with their annotation schemes, modalities, and evaluation protocols, e) A compilation of performance metrics and a per-task quantitative comparison of representative state-of-the-art methods on widely adopted benchmarks, and f) A discussion of current challenges and promising future directions.

2606.08610 2026-06-09 cs.RO cs.AI 新提交

HARBOR: A Harness Framework for Agentic Robot Reinforcement Learning

HARBOR:面向智能体机器人强化学习的框架

Zechu Li, Yufeng Jin, Xiaoyang Liu, Puze Liu, Vignesh Prasad, Carlo D'Eramo, Georgia Chalvatzaki

发表机构 * TU Darmstadt(达姆施塔特工业大学) Honda Research Institute Europe(本田欧洲研究所) Columbia University(哥伦比亚大学) Tongji University(同济大学) Shanghai Research Institute for Intelligent Autonomous Systems(上海智能自主系统研究院) University of Würzburg(维尔茨堡大学) Hessian.AI(黑森人工智能中心)

AI总结 提出HARBOR框架,通过将机器人强化学习自动化视为框架工程问题,利用专用智能体、标准化命令和可复用知识,在模拟中自动完成从环境搭建到策略训练的全流程,并在6个基准测试和16个任务中验证其有效性。

详情
AI中文摘要

强化学习已成为机器人学习的一种强大范式,特别是在模拟到现实的环境中,但其更广泛的采用仍受限于围绕算法的工程流程。构建任务、设计奖励和调整超参数需要大量专家努力,使得强化学习工作流程成本高昂且难以扩展。我们提出HARBOR,一个智能体框架,将机器人强化学习自动化视为一个框架工程问题:给定一个模拟器代码库和一个任务规范,它自动完成从环境设置到模拟中策略训练的工作流程。HARBOR将此类高级目标分解为有界阶段,由专用智能体通过标准化命令、持久化工件、可执行门和可复用知识执行,并通过去中心化并行试验和跨运行经验学习来扩展迭代。我们在6个基准测试和总共16个任务上评估HARBOR,涵盖操作、移动和双臂灵巧控制。我们证明HARBOR端到端地自动化了模拟强化学习工作流程,设计奖励,调整算法以匹配或改进默认配置,并以实用的令牌和挂钟成本减少了工程工作量;生成的策略也可以转移到真实机器人。

英文摘要

Reinforcement learning (RL) has become a powerful paradigm for robot learning, particularly in sim-to-real settings, but its broader adoption remains limited by the engineering pipeline surrounding the algorithms. Building tasks, shaping rewards, and tuning hyperparameters require substantial expert effort, making RL workflows costly and difficult to scale. We introduce HARBOR, an agentic framework that frames robot RL automation as a harness-engineering problem: given a simulator codebase and a task specification, it automates the workflow from environment setup to policy training in simulation. HARBOR decomposes such high-level objectives into bounded stages executed by specialized agents through standardized commands, persistent artifacts, executable gates, and reusable knowledge, and scales iteration via decentralized parallel trials and experience learning across runs. We evaluate HARBOR across 6 benchmarks and 16 tasks in total, spanning manipulation, locomotion, and bimanual dexterous control. We demonstrate that HARBOR automates the simulation RL workflow end-to-end, designs rewards, tunes algorithms to match or improve over default configurations, and reduces engineering effort at practical token and wall-clock cost; the resulting policies can also be transferred to real robots.

2606.08605 2026-06-09 cs.CL 新提交

Multilingual Fact-Checking at Scale: Fine-Tuned Compact Models vs LLMs

大规模多语言事实核查:微调紧凑模型 vs 大语言模型

Pratuat Amatya, Vinay Setty

发表机构 * Factiverse

AI总结 提出一个多语言事实核查系统,通过微调XLM-RoBERTa、mmBERT和SetFit模型,在114种语言的声明检测和28种语言的真实性预测中,与GPT-5.2等LLM相比,展示了紧凑模型的高效和稳定性能。

详情
AI中文摘要

我们提出了一个部署在Factiverse的多语言事实核查系统,旨在跨多种语言实现高吞吐量和低延迟操作。该系统遵循模块化流水线,包含三个阶段:声明检测、证据检索与重排序,以及真实性预测。我们微调了XLM-RoBERTa-Large用于声明检测,mmBERT-base用于三标签立场分类(支持/反驳/混合),以及一个基于SetFit的多语言重排序器用于声明-证据匹配。我们将这些组件与强大的LLM基线进行比较,包括GPT-5.2、Claude Opus~4.6和Qwen3-8b。在涵盖114种语言的声明检测和28种语言的真实性预测的生产数据上的实验表明,任务特定的微调提供了强大且稳定的多语言性能,而微调的检索模型与现代专有嵌入保持竞争力。相同硬件上的延迟测量进一步显示,基于编码器的组件具有巨大的效率提升,支持其在具有严格成本和隐私约束的生产部署中使用。总体而言,紧凑的微调自托管模型仍然是大规模多语言事实核查的实用且有效的基础。本研究的代码和数据可在https://github.com/factiverse/factcheck-editor获取。

英文摘要

We present a multilingual fact-checking system deployed at Factiverse, designed for high-throughput and low-latency operation across diverse languages. The system follows a modular pipeline with three stages: claim detection, evidence retrieval and re-ranking, and veracity prediction. We fine-tune XLM-RoBERTa-Large for claim detection, mmBERT-base for three-label stance classification (Supports/Refutes/Mixed), and a SetFit-based multilingual re-ranker for claim--evidence matching. We compare these components against strong LLM baselines, including GPT-5.2, Claude Opus~4.6, and Qwen3-8b. Experiments on production data spanning 114 languages for claim detection and 28 languages for veracity prediction show that task-specific fine-tuning provides strong and stable multilingual performance, while the fine-tuned retrieval model remains competitive with modern proprietary embeddings. Same-hardware latency measurements further show large efficiency gains for encoder-based components, supporting their use in production deployments with tight cost and privacy constraints. Overall, compact fine-tuned, self-hosted models remain a practical and effective foundation for multilingual fact-checking at scale. Code and data used for this study are available at https://github.com/factiverse/factcheck-editor.

2606.08602 2026-06-09 cs.LG cs.AI 新提交

Reinforcement Learning for Flow-Matching Policies with Density Transport

基于密度传输的流匹配策略强化学习

Boshu Lei, Kostas Daniilidis, Antonio Loquercio

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出在线强化学习算法RLDT,利用Stein变分梯度下降构建传输场,微调预训练流匹配策略,通过期望目标估计稳定训练,在连续控制任务中优于基线方法。

详情
AI中文摘要

我们提出了一种在线强化学习(RL)算法,用于微调连续控制问题中的流匹配策略。我们的关键见解是将基于RL的策略改进视为将动作密度向高奖励区域传输,这自然与流匹配模型的传输公式一致。先前的方法要么近似当前或最优策略分布,要么采用蒸馏,这引入了有偏梯度或牺牲了多模态建模能力。相比之下,我们提出的基于密度传输的RL方法(称为RLDT)使用Stein变分梯度下降(SVGD)从最大熵RL目标构建传输场,然后微调预训练的流匹配策略以与该场对齐。使用这种对齐目标进行训练并非易事,因为流匹配策略通过多步过程生成动作,使得直接的基于梯度的优化具有挑战性。为了克服这一挑战并稳定训练,我们通过期望目标估计从中间去噪步骤近似策略动作。这使得传输场更新能够传播到网络参数中,而无需通过时间进行不稳定的反向传播。实验结果表明,RLDT在奖励质量和收敛速度方面优于竞争基线。该性能在多种连续控制任务中保持一致,包括密集和稀疏奖励,以及基于状态和视觉的长期机器人操作。项目网页为https://rpfey.github.io/rldt/。

英文摘要

We present an online reinforcement learning (RL) algorithm for fine-tuning flow-matching policies in continuous-control problems. Our key insight is to view RL-based policy improvement as a transport of action densities towards regions of high reward, which naturally aligns with the transport formulation of flow matching models. Prior methods either approximate the current or optimal policy distribution or resort to distillation, which introduces biased gradients or sacrifices multimodal modeling capacity. In contrast, our approach for RL with Density Transport, which we name \emph{RLDT}, constructs a transport field from a maximum-entropy RL objective using Stein Variational Gradient Descent (SVGD). Then, it finetunes a pretrained flow matching policy to align with this field. Training with this alignment objective is nontrivial because flow-matching policies generate actions via a multi-step process, making direct gradient-based optimization challenging. To overcome this challenge and stabilize training, we approximate policy actions from intermediate denoising steps via expected-target estimation. This allows the transport-field update to propagate into the network parameters without unstable backpropagation through time. Experimental results demonstrate that RLDT outperforms competitive baselines in reward quality and convergence speed. This performance holds across diverse continuous-control tasks, encompassing both dense and sparse rewards, as well as state- and vision-based long-horizon robot manipulation. The project webpage is \href{https://rpfey.github.io/rldt/}{https://rpfey.github.io/rldt/}.

2606.08601 2026-06-09 cs.AI 新提交

InA-Probe: Instruction-Aware Active Probing for Time Series Forecasting with LLMs

InA-Probe:面向LLM时间序列预测的指令感知主动探测

Peiliang Gong, Emadeldeen Eldele, Chenyu Liu, Ziyu Jia, Yi Ding, Xinliang Zhou, Lianchao Gu, Qi Zhu, Yang Liu, Daoqiang Zhang, Xiaoli Li

发表机构 * Nanyang Technological University(南洋理工大学) Khalifa University(哈利法大学) Nanjing University of Aeronautics and Astronautics(南京航空航天大学) Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 提出指令感知主动探测(InA-Probe),通过多级指令注入和自适应查询生成,结合双阶段注意力机制,在7个基准上超越现有方法,跨域误差降低37%。

详情
AI中文摘要

大型语言模型(LLMs)近期在时间序列预测中展现出令人瞩目的潜力。然而,现有方法主要依赖被动模态对齐或静态任务重编程,往往难以捕捉细粒度的非平稳时间模式或适应细微的任务意图。本文提出指令感知主动探测(InA-Probe),将范式从被动对齐转向主动的指令驱动探测机制。具体而言,我们设计了一种多级指令注入机制,为模型注入全局任务目标和细粒度的补丁级语义先验。在此基础上,自适应查询生成模块生成样本特定的探测,这些探测由时间上下文动态调制。随后,这些探测通过双阶段注意力过程进行精炼:首先通过指令感知自注意力内化任务特定意图,然后通过时间交叉注意力审查询问投影的时间表示以提取显著模式。在七个真实世界基准上的全面实验表明,InA-Probe在统一泛化和零样本迁移中均持续优于最先进的深度学习和基于LLM的基线,在具有挑战性的跨域场景中预测误差降低高达37%。消融研究进一步证实,自适应查询与细粒度指令之间的协同作用是解锁LLM推理能力以处理复杂时间序列的关键。

英文摘要

Large Language Models (LLMs) have recently demonstrated impressive potential for time series forecasting. However, existing methods predominantly rely on passive modality alignment or static task reprogramming, which often fail to capture fine-grained, non-stationary temporal patterns or to adapt to nuanced task intents. In this paper, we propose Instruction-aware Active Probing (InA-Probe), which shifts the paradigm from passive alignment toward an active, instruction-driven probing mechanism. Specifically, we design a Multi-Level Instruction Injection mechanism that enriches the model with both global task objectives and fine-grained, patch-level semantic priors. Building on this, an Adaptive Query Generation module produces sample-specific probes that are dynamically modulated by the temporal context. These probes are then refined through a dual-stage attention process: they first internalize task-specific intents via Instruction-Aware Self-Attention, and subsequently interrogate the projected temporal representations through Temporal Cross-Attention to extract salient patterns. Comprehensive experiments on seven real-world benchmarks show that InA-Probe consistently outperforms state-of-the-art deep learning and LLM-based baselines, excelling in both one-for-all generalization and zero-shot transfer while reducing forecasting error by up to 37\% in challenging cross-domain scenarios. Ablation studies further confirm that the synergy between adaptive querying and fine-grained instructions is key to unlocking the reasoning power of LLMs for complex time series.

2606.08594 2026-06-09 cs.LG eess.SP 新提交

How Much Capacity Does EEG Denoising Need? Ultra-Compact Networks reveal Benchmark Saturation and Metric-Utility Gap

脑电图去噪需要多少容量?超紧凑网络揭示基准饱和与度量-效用差距

Jasmeet Singh Bindra, Siddharth Panwar, Shubhajit Roy Chowdhury

发表机构 * Indian Knowledge Systems and Mental Health Applications (IKSMHA) Center, Indian Institute of Technology Mandi(印度理工学院曼迪分校印度知识体系与心理健康应用中心) School of Computing and Electrical Engineering, Indian Institute of Technology Mandi(印度理工学院曼迪分校计算与电气工程学院)

AI总结 通过固定架构仅改变通道宽度(1.05K-40.26K参数),发现EEG去噪重建性能在3-6.5K参数时饱和,且重建度量不预测下游BCI效用,超紧凑模型(33-46KB)适用于边缘部署。

详情
Comments
17 pages, will be submitted to peer-reviewed journal
AI中文摘要

深度学习脑电图去噪架构已从数万参数扩展到数千万参数,然而尚无先前研究将模型容量作为实验变量隔离,或测试重建度量是否预测下游神经信号效用。我们通过固定架构、损失、数据划分和训练配方,仅在最小深度可分离卷积U-Net中从1.05K到40.26K参数扫描通道宽度,解决了这两个空白。模型在EEGDenoiseNet基准、跨数据集BCI迁移测试、受控基线重训练以及所有九个BCI竞赛IV-2a受试者的五个解码器家族的下游运动想象分类上进行了评估。重建性能在3-6.5K参数时饱和,肘部后每log10参数单位增益最多0.015相关系数。在相同流程下重训练的8.46M参数基线在EOG上与40.26K紧凑变体匹配——200倍参数差距未带来优势——而Patch-Transformer控制重现了相同的递减回报形状。下游评估揭示了分类器依赖的度量-效用差距:重建优化的去噪显著降低了所有九个受试者和三种伪影类型的CSP+LDA分类(最佳去噪准确率0.547 vs. 噪声基线0.612;Bonferroni p=0.0488),在自然记录试验中持续存在(Delta=-0.047;BH-FDR q=0.0049)。端到端神经解码器显示可变或中性效果。标准EEG去噪基准在远低于当前模型容量时已饱和,重建度量不预测BCI效用。33-46 KB和1.27-2.61M FLOPs/段的超紧凑模型适用于边缘部署。这些发现主张容量控制评估、更困难的任务感知基准以及强制性的下游验证。

英文摘要

Deep learning EEG denoising architectures have scaled from tens of thousands to tens of millions of parameters, yet no prior study has isolated model capacity as the experimental variable or tested whether reconstruction metrics predict downstream neural-signal utility. We address both gaps by fixing architecture, loss, data split, and training recipe while sweeping only channel width from 1.05K to 40.26K parameters in a minimal depthwise-separable convolutional U-Net. Models were evaluated on the EEGDenoiseNet benchmark, cross-dataset BCI transfer tests, controlled baseline retraining, and downstream motor-imagery classification with five decoder families across all nine BCI Competition IV-2a subjects. Reconstruction performance saturated by 3-6.5K parameters, with post-elbow gains of at most 0.015 correlation coefficient per log10-parameter unit. An 8.46M-parameter baseline retrained under the same pipeline matched the 40.26K compact variant on EOG--a 200x parameter gap yielding no advantage--while a Patch-Transformer control reproduced the same diminishing-return shape. Downstream evaluation exposed a classifier-dependent metric-utility gap: reconstruction-optimized denoising significantly degraded CSP+LDA classification across all nine subjects and three artifact types (best denoised accuracy 0.547 vs. 0.612 noisy baseline; Bonferroni p=0.0488), persisting on naturally recorded trials (Delta=-0.047; BH-FDR q=0.0049). End-to-end neural decoders showed variable or neutral effects. Standard EEG denoising benchmarks are saturated far below current model capacity, and reconstruction metrics do not predict BCI utility. Ultra-compact models at 33-46 KB and 1.27-2.61M FLOPs/segment are practical for edge deployment. These findings argue for capacity-controlled evaluation, harder task-aware benchmarks, and mandatory downstream validation.

2606.08592 2026-06-09 cs.LG quant-ph 新提交

Quantum Global Variational Learning for Quantum Error Correction

量子全局变分学习用于量子纠错

Shun Ryuzaki, Hideo Mukai

发表机构 * Meiji University(明治大学)

AI总结 提出一种全局结构的量子神经网络,减少量子电路中酉矩阵数量,训练时间降低97%,训练完成率提升25%,实现100%训练成功率,纠错性能超越以往研究。

详情
Comments
24 pages, 22 figures
AI中文摘要

高效的量子纠错对于量子计算的发展至关重要。我们提出了一种具有全局结构的量子神经网络,该网络减少了量子电路中所需的酉矩阵数量。这种方法使训练时间减少了97%,训练完成率提高了25%,最终实现了100%的训练成功率,同时超越了以往研究中报告的纠错性能。此外,我们展示了量子纠错对内部网络噪声的增强鲁棒性。而且,由于计算负载的减少,内部网络噪声下的量子纠错保真度提高了15%。

英文摘要

Efficient quantum error correction is essential for the advancement of quantum computing. We propose a quantum neural network with a global structure that reduces the number of unitary matrices required in quantum circuits. This approach resulted in a 97\% reduction in training time and up to a 25\% improvement in the training completion rate, ultimately achieving a 100\% success rate in training while surpassing the error correction performance reported in previous studies. In addition, we demonstrated the enhanced robustness of quantum error correction against internal network noise. Moreover, the fidelity of quantum error correction under internal network noise increased by up to 15\% due to the reduced computational load.

2606.08584 2026-06-09 cs.LG 新提交

Convolutional Sparse Coding via the Locally Competitive Algorithm on Loihi 2

基于Loihi 2的局部竞争算法实现卷积稀疏编码

Geoffrey Kasenbacher, Daniel Ruepp, Gerrit A. Ecke

发表机构 * Mercedes-Benz AG(梅赛德斯-奔驰集团) Institut für Robotik und Kognitive Systeme, Universität zu Lübeck(吕贝克大学机器人与认知系统研究所)

AI总结 本文在Loihi 2神经形态芯片上实现了卷积稀疏编码的局部竞争算法,并与GPU基线对比,展示了其在结构化稀疏推理中的可行性和优势。

详情
AI中文摘要

稀疏编码通过将输入表示为仅少量基函数的线性组合,为信号表示提供了一个原则性框架。局部竞争算法(LCA)因其动力学特性(泄漏积分、阈值化和侧向抑制)自然映射到神经形态硬件,在神经形态计算中特别有吸引力。虽然先前的工作已在Loihi 2上研究了非卷积LCA,但卷积设置尤其令人感兴趣,因为它引入了空间结构、权重共享、重叠感受野和缩放行为,这些更代表实际的稀疏推理工作负载。在这项工作中,我们提出了通过LCA在Loihi 2上实现卷积稀疏编码,并在相同的推理问题上与传统的GPU基线进行了评估。该实现遵循单层循环LCA公式,并将其扩展到具有从成对滤波器相互作用导出的局部抑制核的卷积特征图。据我们所知,这是Loihi 2上卷积LCA的首次实现和基准测试。我们的目标不仅是证明可行性,而且还要阐明在何种操作条件下卷积稀疏推理在神经形态硬件上变得有吸引力。由此产生的研究将卷积LCA定位为新兴神经形态系统上结构化稀疏推理的有用基准。

英文摘要

Sparse coding provides a principled framework for signal representation by expressing an input as a linear combination of only a small number of basis functions. The Locally Competitive Algorithm (LCA) is particularly attractive in the context of neuromorphic computing because its dynamics, leaky integration, thresholding, and lateral inhibition map naturally to neuromorphic hardware. While prior work has studied non-convolutional LCA on Loihi 2, the convolutional setting is of particular interest because it introduces spatial structure, weight sharing, overlapping receptive fields, and scaling behavior that are more representative of practical sparse inference workloads. In this work, we present a Loihi 2 implementation of convolutional sparse coding via the LCA and evaluate it against a conventional GPU baseline on the same inference problems. The implementation follows a one-layer recurrent LCA formulation and extends it to convolutional feature maps with local inhibitory kernels derived from pairwise filter interactions. To the best of our knowledge, this is the first implementation and benchmark of convolutional LCA on Loihi 2. Our goal is not only to demonstrate feasibility, but also to clarify in which operating regimes convolutional sparse inference becomes attractive on neuromorphic hardware. The resulting study positions convolutional LCA as a useful benchmark for structured sparse inference on emerging neuromorphic systems.

2606.08583 2026-06-09 cs.LG eess.SP 新提交

A spectral audit framework reveals task-dependent aperiodic reliance across EEG and ECG deep learning

频谱审计框架揭示EEG和ECG深度学习中任务依赖的非周期性依赖

Jasmeet Singh Bindra, Siddharth Panwar, Shubhajit Roy Chowdhury

发表机构 * Indian Knowledge Systems and Mental Health Applications (IKSMHA) Center, Indian Institute of Technology Mandi(印度理工学院曼迪分校印度知识体系与心理健康应用中心) School of Computing and Electrical Engineering, Indian Institute of Technology Mandi(印度理工学院曼迪分校计算与电气工程学院)

AI总结 提出频谱审计框架,结合非周期/周期分解、相位保持傅里叶干预等,发现深度学习模型对非周期成分的依赖是任务依赖且架构通用的,在睡眠-觉醒分类中影响显著,临床异常检测中中等,运动想象中最小,并扩展到ECG。

详情
Comments
25 pages, being prepared for submission to peer-reviewed journal
AI中文摘要

生理时间序列的深度学习通过领域特定特征解释——EEG中的振荡节律、ECG中的形态复合波——但这些信号位于一个宽带非周期1/f样包络之上,该包络与觉醒、年龄和病理共变。我们引入了一个频谱审计框架,结合非周期/周期分解、相位保持傅里叶干预、假对照和模拟验证。非周期依赖是任务依赖且架构通用的:在六种神经架构中,对于睡眠-觉醒分类,平坦化下降超过0.42平衡准确率点;对于临床异常检测达到0.07-0.13;对于运动想象保持最小。七个EEG基础模型中有六个在临床EEG上显示出FDR显著的非周期依赖;年龄/性别和记录时代控制减少了但未消除该效应。将审计应用于PTB-XL ECG,发现神经下降0.32-0.36,在人口统计匹配后持续存在,确认此类混淆因素扩展到EEG之外。非周期控制应成为可解释生理时间序列深度学习的标准。

英文摘要

Deep learning on physiological time series is interpreted through domain-specific features -- oscillatory rhythms in EEG, morphological complexes in ECG -- yet these signals sit atop a broadband aperiodic 1/f-like envelope that covaries with arousal, age, and pathology. We introduce a spectral audit framework combining aperiodic/periodic decomposition, phase-preserving Fourier interventions, sham controls, and simulation validation. Aperiodic reliance was task-dependent and architecture-general: across six neural architectures, flattening drops exceeded 0.42 balanced-accuracy points for sleep-wake classification, reached 0.07-0.13 for clinical abnormality detection, and remained minimal for motor imagery. Six of seven EEG foundation models showed FDR-significant aperiodic reliance on clinical EEG; age/sex and recording-era controls reduced but did not eliminate the effect. Applying the audit to PTB-XL ECG revealed neural drops of 0.32--0.36 persisting after demographic matching, confirming this confound class extends beyond EEG. Aperiodic controls should become standard for interpretable physiological time-series deep learning.

2606.08578 2026-06-09 cs.LG 新提交

Lost in the Non-convex Loss Landscape: How to Fine-tune the Large Time Series Model?

迷失在非凸损失景观中:如何微调大型时间序列模型?

Xu Zhang, Peang Wang, Wei Wang

发表机构 * Shanghai Key Laboratory of Data Science(上海市数据科学重点实验室) College of Computer Science and Artificial Intelligence(计算机科学与人工智能学院) Fudan University(复旦大学)

AI总结 针对预训练大型时间序列模型微调时因非凸损失景观导致过拟合的问题,提出平滑全微调(SFF)方法,通过随机初始化辅助模型插值平滑损失景观,提升可训练性,在八个代表性模型上取得一致改进。

详情
Comments
This paper has been accepted by The Fourteenth International Conference on Learning Representations (ICLR 2026). The code is available at the link \url{https://github.com/Meteor-Stars/SFF}
AI中文摘要

近年来,大型时间序列模型(LTSMs)因其与大型语言模型的相似性(包括灵活的上下文长度、可扩展性和任务通用性)而受到越来越多的关注,其性能优于先进的任务特定模型。然而,先前研究表明,预训练的LTSMs可能表现出条件较差的非凸损失景观,导致可训练性有限。因此,直接微调往往会导致过拟合和次优性能,有时甚至比从头训练更差,大大削弱了预训练的好处。为了克服这一限制,我们提出了平滑全微调(SFF),一种新颖的微调技术。具体来说,我们通过随机初始化构建一个辅助LTSM以获得更平滑的损失景观,然后将其权重与预训练模型的权重进行线性插值,以平滑原始景观。这一过程在保留预训练知识的同时提高了可训练性,从而实现更有效的下游微调。从优化角度来看,SFF扰动尖锐最小值而不显著损害平坦区域,有助于逃离不良局部盆地,走向更平滑且泛化性更好的解。在基准数据集上的大量实验表明,在包括Timer、TimesFM、MOMENT、UniTS、MOIRAI、Chronos、TTMs和Sundial在内的八个代表性LTSM上,针对多样化的下游任务均取得了一致的改进。代码可在链接获取:https://github.com/Meteor-Stars/SFF。

英文摘要

Recently, large time series models (LTSMs) have gained increasing attention due to their similarities to large language models, including flexible context length, scalability, and task generality, outperforming advanced task-specific models. However, prior studies indicate that pre-trained LTSMs may exhibit a poorly conditioned non-convex loss landscape, leading to limited trainability. As a result, direct fine-tuning tends to cause overfitting and suboptimal performance, sometimes even worse than training from scratch, substantially diminishing the benefits of pre-training. To overcome this limitation, we propose Smoothed Full Fine-tuning (SFF), a novel fine-tuning technology. Specifically, we construct an auxiliary LTSM via random initialization to obtain a smoother loss landscape, and then linearly interpolate its weights with those of the pre-trained model to smooth the original landscape. This process improves trainability while preserving pre-trained knowledge, thereby enabling more effective downstream fine-tuning. From an optimization perspective, SFF perturbs sharp minima without significantly harming flat regions, facilitating escape from poor local basins toward smoother and more generalizable solutions. Extensive experiments on benchmark datasets demonstrate consistent improvements across eight representative LTSMs, including Timer, TimesFM, MOMENT, UniTS, MOIRAI, Chronos, TTMs, and Sundial, on diverse downstream tasks. The code is available at the link: https://github.com/Meteor-Stars/SFF.

2606.08572 2026-06-09 cs.CV 新提交

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

OmniCap-IF:全视频字幕遵循指令能力的基准测试与改进

Jiahao Wang, An Ping, Yanghai Wang, Yuanxing Zhang, Shihao Li, Hanyan Bian, Yichi Ren, Yize Zhang, Han Wang, Haowen Chen, Junze Li, Jiaqi Wang, Yiyang Hu, Zhuze Xu, Zijie Zhang, Jiaheng Liu

发表机构 * NJU-LINK Team, Nanjing University(南京大学 NJU-LINK 团队) Kling Team, Kuaishou Technology(快手科技 Kling 团队)

AI总结 提出首个全模态字幕指令遵循基准OmniCap-IF,通过格式与内容正确性评估50种约束类型,揭示格式-内容权衡,并构建54K指令微调数据集OmniCap-IF-54K及模型OmniCaptioner-IF。

详情
AI中文摘要

虽然全模态大语言模型(OLLMs)在联合处理音频和视觉流方面展示了令人印象深刻的能力,但它们严格遵循复杂、多方面的用户指令的能力在很大程度上仍未得到探索。现有基准主要关注整体视频理解或纯文本指令遵循,未能捕捉模态与用户约束之间的复杂交互。为填补这一空白,我们引入了OmniCap-IF,这是首个专门设计用于评估全模态字幕中指令遵循能力的综合基准。OmniCap-IF包含一个系统框架,从格式正确性和内容正确性两个维度评估字幕。我们的基准涵盖了纯视觉、纯音频和音视频模态中的50种不同约束类型,同时整合了时间定位以评估时空精度。对1,920个高质量样本上主流模型的广泛评估揭示了显著的性能差异。此外,我们的分析揭示了一个关键的“格式-内容权衡”,表明增加格式复杂性直接降低了模型的全模态推理能力。最后,为推进该领域,我们整理了一个54K的指令微调数据集OmniCap-IF-54K,并提出了OmniCaptioner-IF,该模型在复杂指令遵循和通用全模态字幕性能方面均取得了显著改进。

英文摘要

While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluate instruction-following capabilities in omni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our benchmark encompasses 50 distinct constraint types across pure visual, pure audio, and audio-visual modalities, while integrating Temporal Grounding to assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical "format-content tradeoff", demonstrating that increasing formatting complexity directly degrades models' omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and general omni-modal captioning performance.

2606.08566 2026-06-09 cs.CV 新提交

Towards Accurate Emotion-Attributed Video Captioning via Fine-grained Emotion-Cause Pair Extraction

通过细粒度情感-原因对提取实现精确的情感归因视频字幕生成

Weidong Chen, Cheng Ye, Zhendong Mao, Liping Wang, Xinyan Liu, Yongdong Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合性国家科学中心人工智能研究院) Harbin Institute of Technology (Weihai)(哈尔滨工业大学(威海))

AI总结 提出细粒度情感-原因对提取框架,通过概念感知视觉语义分解和视觉引导情感可解释学习,提升情感视频字幕的准确性和丰富性。

详情
AI中文摘要

情感视频字幕生成(EVC)是一项具有挑战性的任务,旨在为视频生成事实准确且情感丰富的描述。现有的EVC方法利用整体视觉特征挖掘全局情感线索,然后聚合多模态特征以指导情感字幕生成,这忽略了EVC任务的关键特性。视觉情感是由特定的动机原因引发的,这些原因通常只隐含在核心视频片段中。整体挖掘带来了显著的信息冗余和不准确的情感线索。因此,细粒度的视觉原因提取对情感感知和情感归因字幕生成都有促进作用。为此,我们提出了一种用于情感归因视频字幕生成的细粒度情感-原因对提取框架。具体来说,我们通过两轮学习成对的情感和原因特征:1)我们提出了一种概念感知的视觉语义分解模块,通过探索场景、对象和运动概念来增强视觉特征。此外,为了增强情感特征,我们提出了一种视觉引导的情感可解释学习模块,该模块利用视觉时间动态指导情感细化,并通过可靠的VAD向量约束增强可解释的细化过程。2)我们通过在细化前后交叉耦合视觉和情感特征来实现情感-原因对提取,并利用对比损失实现语义强制对齐。总体而言,我们的方法优化了视频的复杂语义理解和情感感知,从而在情感字幕生成中取得了有前景的性能。在三个具有挑战性的数据集上进行的大量实验证明了我们的方法和每个提出模块的优越性,例如,在EVC-MSVD数据集上,BLEU-2和ROUGE-L分别取得了+4.4%和+5.4%的最佳性能。

英文摘要

Emotional Video Captioning (EVC) is a challenging task that aims to generate factually accurate and emotionally rich descriptions for videos. Existing EVC methods leverage holistic visual features to mine global emotional cues, and then aggregate multimodal features to guide the emotional caption generation, which ignores the critical characteristic of the EVC task. Visual emotions are evoked by specific motivational causes, which are usually only implied in core video segments. The holistic mining brings significant information redundancy and inaccurate emotional cues. Thus, fine-grained visual cause extraction has a facilitative effect on both emotion perception and emotion-attributed caption generation. To this end, we propose a fine-grained emotion-cause pair extraction framework for emotion-attributed video captioning. Specifically, we learn pair-wise emotion and cause features in two rounds: 1) We propose a Concept-aware Visual Semantic Decomposition module to augment visual features by exploring scene, object, and motion concepts. Besides, to enhance emotional features, we propose a Visual-guided Emotion Interpretable Learning module, which guides emotion refinement with visual temporal dynamics, and augments the interpretable refinement process by reliable VAD-vector constraints. 2) We achieve emotion-cause pair extraction by cross-coupling the visual and emotional features before and after refinement, and leverage contrastive loss to achieve semantic forced alignment. Overall, our approach optimizes complex semantic understanding and emotion perception of videos, leading to a promising performance in emotional captioning. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module, e.g., achieving the best performances with +4.4% and +5.4% w.r.t. BLEU-2 and ROUGE-L, respectively, on the EVC-MSVD dataset.

2606.08564 2026-06-09 cs.RO 新提交

Real-IKEA: Physical Fidelity is the Prerequisite for Robust Manipulation

Real-IKEA:物理保真度是鲁棒操作的前提

Kunqi Xu, Zhenhao Huang, Siyuan Luo, Ziqiu Zeng, Fan Shi

发表机构 * National University of Singapore(新加坡国立大学) Peking University(北京大学)

AI总结 针对仿真与现实物理差异导致操作鲁棒性不足的问题,提出Real-IKEA数据集与仿真框架,通过高保真资产和阻力校准配置,使强化学习策略发现优先利用机械优势的鲁棒策略。

详情
AI中文摘要

机器人操作的鲁棒性常常因简化仿真与充满阻力的现实世界之间的物理差距而失败。在这项工作中,我们强调在铰接交互中的物理真实性是鲁棒策略学习的重要因素。我们提出了Real-IKEA,一个以物理精度为首要目标的数据集和仿真框架。Real-IKEA提供了1,079个铰接资产配置,源自83个真实的IKEA把手和旋钮,经过细致的六步物理工作流程处理。对于接触几何精度,我们引入了一个双向表面偏差度量来量化碰撞网格。对于动力学真实性,我们建立了阻力校准配置,改变阻尼和摩擦。关键的是,我们通过强化学习策略证明,高保真资产能够发现鲁棒的“钩”和“杠杆”策略,这些策略优先考虑机械优势而非脆弱的摩擦拉动。总之,这些结果使Real-IKEA成为开发能够在铰接物体任务中达到人类水平鲁棒性的操作策略的关键基准。

英文摘要

Robotic manipulation robustness often founders on the physics gap between simplified simulations and the resistance-laden real world. In this work, we emphasize that physical realism in articulated interaction is an important ingredient for robust policy learning. We present Real-IKEA, a dataset and simulation framework designed with physical accuracy as a first-class goal. Real-IKEA provides 1,079 articulated asset configurations, derived from 83 authentic IKEA handles and knobs processed through a meticulous six-step physical workflow. For contact-geometry accuracy, we introduce a bidirectional surface-deviation metric to quantify collision meshes. For dynamics realism, we establish resistance-calibrated configurations that vary damping and friction. Crucially, we demonstrate through a Reinforcement Learning (RL) policy that high-fidelity assets enable the discovery of robust "hooking" and "levering" strategies that prioritize mechanical advantage over fragile friction-pulling. Together, these results position Real-IKEA as a critical benchmark for developing manipulation policies capable of human-level robustness in articulated object tasks.

2606.08563 2026-06-09 cs.LG physics.ao-ph 新提交

Physics-Guided Dual Decoding and Spectral Supervision for Global 3D Hydrometeor Prediction

物理引导的双解码与光谱监督用于全球三维水凝物预测

Dandan Chen, Yaqiang Wang

发表机构 * Chinese Academy of Meteorological Sciences(中国气象科学研究院) Xiong’an Institute of Meteorological Artificial Intelligence(雄安气象人工智能研究院)

AI总结 针对三维水凝物预测中零膨胀长尾分布导致的过度平滑问题,提出物理引导的双解码框架PredHydro-Net,通过解耦架构、小波频率解耦和对抗训练,在极端事件检测和光谱表示上优于现有模型。

详情
AI中文摘要

虽然全球数据驱动模型在预测连续大气变量方面表现出色,但由于这些变量的零膨胀长尾分布,三维水凝物预测仍然具有挑战性。标准的深度学习优化通常会产生过度平滑的预测,削弱极端事件和空间纹理。我们提出了PredHydro-Net,一个物理引导的双解码框架,以缓解这种平滑。为了解决多变量优化冲突,它采用了解耦架构,其中宏观热力学和动力学场单向调节水凝物的生成。通过集成基于小波的频率解耦、光谱幅度匹配和对抗训练,该模型在定量准确性和空间保真度之间实现了有利的权衡。在72小时全球评估中,PredHydro-Net在极端事件检测和光谱表示方面优于时空深度学习基线(Earthformer和PredRNNv2)以及业务全球预报系统(GFS)。此外,它与全球降水测量(GPM)卫星反演表现出良好的气候一致性。该模型合理地再现了极端天气事件(如飓风伊恩)中的三维云结构。特征归因证实了其对物理前兆(如相对湿度和风辐合)的依赖,为长尾大气预测提供了一种稳健的、物理信息的方法。

英文摘要

While global data-driven models excel at predicting continuous atmospheric variables, three-dimensional hydrometeor forecasting remains challenging due to the zero-inflated, long-tailed distributions of these variables. Standard deep learning optimization often yields overly smooth forecasts, attenuating extreme events and spatial textures. We propose PredHydro-Net, a physics-guided dual-decoding framework that mitigates this smoothing. To resolve multi-variable optimization conflicts, it employs a decoupled architecture where macroscopic thermodynamic and dynamic fields unidirectionally modulate hydrometeor generation. By integrating wavelet-based frequency decoupling, spectral amplitude matching, and adversarial training, the model achieves a favorable trade-off between quantitative accuracy and spatial fidelity. In a 72-h global evaluation, PredHydro-Net outperforms both spatiotemporal deep learning baselines (Earthformer and PredRNNv2) and the operational Global Forecast System (GFS) in extreme-event detection and spectral representation. Furthermore, it demonstrates strong climatological consistency with Global Precipitation Measurement (GPM) satellite retrievals. The model reasonably reproduces the three-dimensional cloud structures in extreme weather events, such as Hurricane Ian. Feature attribution confirms its dependence on physical precursors such as relative humidity and wind convergence, offering a robust, physics-informed approach to long-tailed atmospheric prediction.

2606.08555 2026-06-09 cs.RO 新提交

FAWAM: Force-Aware World Action Models for Closed-Loop Contact-Rich Manipulation

FAWAM: 面向闭环密集接触操作的力感知世界动作模型

Haotian He, Zeyu Yan, Qipeng Liu, Ning Guo, Wenzhao Lian

发表机构 * School of Mathematical Sciences, Peking University(北京大学数学科学学院) School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院)

AI总结 提出FAWAM,在感知、预测和闭环执行三个层次融入力信息,通过联合预测动作与末端扳手及残差校正模块,提升密集接触操作的成功率。

详情
AI中文摘要

力信号为接触丰富的机器人操作提供了关键的交互线索。然而,现有方法大多将力作为额外的观测模态,未能充分利用其在建模未来交互动态或指导执行时反馈校正中的作用。本文提出FAWAM,一种力感知世界动作模型,在三个层次融入力信息:感知、预测和闭环执行。FAWAM首先编码历史六轴力/力矩信号以调节动作生成,然后联合预测未来动作和末端扳手以显式建模接触演化。它进一步引入残差校正模块,使用预测的扳手轨迹作为执行时参考,基于实时力反馈在线优化动作。跨多个接触丰富任务的实际实验表明,FAWAM相比纯视觉基线平均成功率提升36.25%,相比现有力感知基线提升21.25%,证明了我们的力感知框架在鲁棒密集接触操作中的有效性。

英文摘要

Force signals provide critical interaction cues for contact-rich robotic manipulation. However, existing methods mostly use force as an additional observation modality, without fully exploiting its role in modeling future interaction dynamics or guiding execution-time feedback correction. In this paper, we propose FAWAM, a force-aware world action model that incorporates force information at three levels: perception, prediction, and closed-loop execution. FAWAM first encodes historical 6-axis force/torque signals to modulate action generation, then jointly predicts future actions and end-effector wrenches to explicitly model contact evolution. It further introduces a residual correction module that uses the predicted wrench trajectory as an execution-time reference to refine actions online based on real-time force feedback. Real-world experiments across multiple contact-rich tasks show that FAWAM improves the average success rate by 36.25% over vision-only baselines and 21.25% over existing force-aware baselines, demonstrating the effectiveness of our force-aware framework for robust contact-rich manipulation.