arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2094
专题追踪
2606.17979 2026-06-19 cs.AI 新提交

STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

STAR: 文本到图像强化学习后训练中的时空自适应奖励分配

Jinjie Shen, Wei Deng, Xian Hu, Daiguo Zhou, Jian Luan

发表机构 * institutetext: STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training(机构文本:STAR:时空自适应奖励分配用于文本到图像强化学习后训练)

AI总结 针对文本到图像生成中奖励与生成轨迹粒度不匹配的问题,提出STAR方法,利用文本-图像注意力构建时空自适应分配图,对相关潜在区域施加更强策略更新,提升语义对齐和文本渲染性能。

详情
AI中文摘要

现有的文本到图像生成的强化学习后训练方法通常将最终图像奖励转换为单个标量优势,并以相同强度应用于整个生成轨迹。然而,文本到图像生成自然具有时间和空间结构:不同的去噪步骤负责不同的生成阶段,而真正决定文本对齐的内容通常只出现在图像的一部分。这种粒度不匹配使得策略更新难以聚焦于实际影响奖励的生成组件。为了解决这个问题,我们提出了用于文本到图像扩散和流模型的强化学习后训练的**时空自适应奖励(STAR)分配**。STAR利用生成模型内部的文本-图像注意力,从用户提示中真正关心的核心内容开始,构建在去噪步骤和展开中动态变化的空间分配图,并将相同的组相对优势分配给更相关的潜在区域,几乎没有额外的计算开销。然后,STAR通过空间分辨的策略目标对这些区域应用更强的策略更新。我们使用Stable Diffusion 3.5 Medium作为基础模型,并在三个任务上评估:GenEval、OCR文本渲染和PickScore。实验结果表明,STAR在不改变外部奖励源的情况下,改善了组合语义对齐、文本渲染和偏好优化,在GenEval、OCR和PickScore上分别达到了$\mathbf{0.9759}$、$\mathbf{0.9757}$和$\mathbf{23.60}$。

英文摘要

Existing RL post-training methods for text-to-image generation usually convert the final-image reward into a single scalar advantage and apply it with the same strength to the entire generative trajectory. However, text-to-image generation naturally has temporal and spatial structure: different denoising steps are responsible for different generation stages, and the content that truly determines text alignment often appears only in part of the image. This granularity mismatch makes it difficult for policy updates to focus on the generative components that actually affect the reward. To address this issue, we propose \textbf{SpatioTemporal Adaptive Reward (STAR) Allocation} for RL post-training of text-to-image diffusion and flow models. STAR uses text-image attention inside the generative model and starts from the core content that the user truly cares about in the prompt. It constructs spatial allocation maps that dynamically vary across denoising steps and rollouts, and allocates the same group-relative advantage to more relevant latent regions with almost no additional computational overhead. STAR then applies stronger policy updates to these regions through a spatially resolved policy objective. We use Stable Diffusion 3.5 Medium as the base model and evaluate on three tasks: GenEval, OCR text rendering, and PickScore. Experimental results show that STAR improves compositional semantic alignment, text rendering, and preference optimization without changing the external reward source, achieving $\mathbf{0.9759}$, $\mathbf{0.9757}$, and $\mathbf{23.60}$ on GenEval, OCR, and PickScore, respectively.

2606.17886 2026-06-19 cs.LG 新提交

Monotonic Kolmogorov-Arnold Networks: A Theoretical and Empirical Study of Monotonicity as an Inductive Bias

单调Kolmogorov-Arnold网络:单调性作为归纳偏置的理论与实证研究

Mikhail Krasnov, Blaž Bertalanič, Carolina Fortuna

发表机构 * Jozef Stefan Institute(约瑟夫·斯特凡研究所)

AI总结 提出MKAN,通过指数重参数化B样条系数、正边权和单调基激活实现硬单调性,理论证明任何特征提取器可被单调化且编码器规模有界,实验表明MKAN在单调性基准上达到最优并保持KAN的逐边功能透明性。

详情
AI中文摘要

单调性一直是神经网络长期使用的架构归纳偏置,其动机来源于表格、科学和经济场景,其中输出已知对某些输入呈单调响应。现有方法基于MLP或流模型,缺乏逐边功能透明性;唯一具有单调性的KAN变体MonoKAN仅在受限参数子集上施加约束,并需要投影式训练过程。我们通过\textbf{MKAN}填补了这一空白,MKAN是一种KAN,通过B样条系数的指数重参数化、正边权和单调基激活,对所有参数值保证硬单调性。训练简化为标准的无约束梯度下降。我们的主要理论贡献是一个\textbf{表示代价}定理:任何诱导球状语义邻域划分的$C^K, K >0$特征提取器,都可以在$N' = N^* + k \le 2N^*$处实现等价邻域结构的单调实现,其中$k$是原始非单调坐标的数量。该界限与架构无关,并为单调编码器提供了原则性的规模确定规则。实验上,MKAN在SMM/ICML-2024基准上与最先进的单调神经网络竞争,同时是唯一结合了硬无约束单调性和KAN逐边功能透明性的方法;在四个真实数据集上的自监督特征规模扫描中验证了$2N^*$预测,在受控单调生成数据集上,MKAN以显著高于KAN、MLP和线性基线的Spearman对齐恢复了真实因子。

英文摘要

Monotonicity has been a long-running architectural inductive bias for neural networks, motivated by tabular, scientific, and economic settings where outputs are known to respond monotonically to certain inputs. Existing approaches are MLP- or flow-based and lack per-edge functional transparency; the only Kolmogorov--Arnold Network (KAN) variant with monotonicity, MonoKAN, enforces the constraint only on a restricted parameter subset and requires a projection-style training procedure. We close this gap with \textbf{MKAN}, a KAN with hard monotonicity guaranteed for \emph{all} parameter values via exponential reparameterization of B-spline coefficients, positive edge weights, and a monotone base activation. Training reduces to standard unconstrained gradient descent. Our headline theoretical contribution is a \emph{representation-cost} theorem: any $C^K, K >0$ feature extractor inducing a ball-shaped semantic-neighborhood partition admits a monotone realization of the equivalent neighborhood structure at $N' = N^* + k \le 2N^*$, where $k$ is the number of non-monotone coordinates of the original. The bound is architecture-agnostic and gives a principled sizing rule for monotone encoders. Empirically, MKAN is competitive with state-of-the-art monotone NNs on the SMM/ICML-2024 benchmark while being the only method that combines hard unconstrained monotonicity with KAN's per-edge functional transparency; the $2N^*$ prediction is validated in a self-supervised feature-size sweep on four real datasets, and on a controlled monotone-generative dataset MKAN recovers ground-truth factors with substantially higher Spearman alignment than KAN, MLP, and linear baselines.

2606.17832 2026-06-19 cs.LG 新提交

From Drift to Coherence: Stabilizing Beliefs in LLMs

从漂移到一致:稳定LLM中的信念

SongEun Kim, Seungyoo Lee, Edwin Fong, Hyungi Lee, Juho Lee

发表机构 * Department of Statistics, Seoul National University Korea Advanced Institute of Science \& Technology Department of AI, Kookmin University University of Hong Kong

AI总结 研究LLM在多项选择问答中的信念漂移问题,提出提示式预测重采样(PPR)方法,发现信念过程会自稳定并收敛,进而提出种子答案提示策略和自一致性损失以加速稳定并提高预测一致性。

详情
AI中文摘要

大型语言模型(LLM)常被假设执行隐式贝叶斯推理,然而一个关键的一致性条件——预测信念的鞅性质——已被证明在受控的合成上下文学习设置中失效。我们在更典型的使用场景中重新审视这个问题:通用多项选择问答。利用离散答案空间,我们计算精确的预测分布,并研究由自回归答案重采样引起的信念动态。我们引入了提示式预测重采样(PPR),其中LLM对同一问题生成一系列答案。实验表明,PPR揭示了早期阶段的信念漂移,表明鞅性质被违反。然而,在足够的重采样步骤后,信念过程自稳定并收敛到一个一致的预测分布。基于这一观察,我们进一步提出了(i)种子答案提示策略以加速稳定,以及(ii)自一致性损失,通过微调将早期漂移摊销到模型中。在多项选择问答基准上的实验表明,我们的方法在不牺牲准确性的情况下显著减少了信念漂移并提高了预测一致性。

英文摘要

Large language models (LLMs) are often hypothesized to perform implicit Bayesian inference, yet a key coherence condition, the martingale property of predictive beliefs, has been shown to fail in controlled synthetic in-context learning settings. We revisit this question in a more typical usage regime: generic multiple-choice question answering. Exploiting the discrete answer space, we compute exact predictive distributions and study belief dynamics induced by autoregressive answer resampling. We introduce prompted predictive resampling (PPR), where an LLM generates a sequence of answers to the same question. Empirically, PPR reveals early-stage belief drift, indicating martingale violations. However, after sufficient resampling steps, the belief process self-stabilizes and converges to a coherent predictive distribution. Based on this observation, we further propose (i) a seed-answer prompting strategy to accelerate stabilization, and (ii) a self-consistency loss that amortizes early-stage drift into the model via fine-tuning. Experiments on multiple-choice QA benchmarks show that our methods substantially reduce belief drift and improve predictive coherence without sacrificing accuracy.

2606.17054 2026-06-19 cs.RO cs.AI cs.CV cs.LG 新提交

Human Universal Grasping

人类通用抓取

Kevin Yuanbo Wu, Tianxing Zhou, Isaac Tu, Billy Yan, Irmak Guzey, David Fouhey, Dandan Shan, Lerrel Pinto

发表机构 * New York University(纽约大学) Tsinghua University(清华大学) University of Michigan(密歇根大学)

AI总结 提出HUG模型,利用人类抓取数据(1M-HUG数据集)和流匹配方法,从单张RGB-D图像生成多样化抓取姿态,并重定向到机器人手,实现零样本抓取,在HUG-Bench上超越基线23%-34%。

Comments 28 pages, 20 figures, 7 tables

详情
AI中文摘要

人类可以轻松抓取物体,而多指机器人远未达到这种通用性。我们认为机器人抓取数据最自然的来源是人类,他们每天拿起数千个物体。我们提出HUG,一个流匹配模型,能够为任何用户指定的物体(从立体相机捕获的单张RGB-D图像中)生成多样化的人类抓取。使用智能眼镜,我们首先收集了1M-HUGs,一个自我中心的人类抓取数据集,涵盖100万帧(27.8小时)和41栋建筑中的6,707个物体实例。接下来,为了建模自然人类抓取的分布,我们的新型流匹配模型融合RGB和深度观测,输出由手腕平移、手腕旋转和MANO手姿态参数化的抓取。预测的抓取可以重定向到各种机器人手,实现在日常场景中的零样本抓取。为了标准化评估,我们构建了一个新的模拟基准HUG-Bench,包含来自五个几何类别和不同尺寸的90个未见物体,并带有公制尺度的3D网格。我们在真实世界中评估HUG,使用HUG-Bench的30个物体测试集,跨越多个立体相机、机器人实体和家庭环境。HUG在我们具有挑战性的物体集上比最先进的抓取基线高出23%和34%。代码、数据、基准、检查点和交互式演示已在我们的网站上发布:https://grasping.io/

英文摘要

Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website: https://grasping.io/

2606.17041 2026-06-19 cs.CL cs.IR 新提交

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

对Nature Portfolio元分析文章进行LLM代理基准测试

Anzhe Xie, Weihang Su, Yujia Zhou, Yiqun Liu, Qingyao Ai

发表机构 * Tsinghua University(清华大学)

AI总结 提出MetaSyn数据集,包含442篇专家策划的元分析,用于评估LLM代理在检索-筛选-综合全流程中的表现,发现当前系统在筛选阶段存在严重瓶颈。

Comments 13 pages, 7 figures, preprint for arXiv, dataset and code available at https://github.com/BFTree/MetaSyn

详情
AI中文摘要

元分析是一种要求高的证据综合形式,结合了文献检索、PI/ECO指导的研究选择和统计聚合。其结构化、可验证的工作流程使其成为评估系统科学推理的理想基础,然而现有基准缺乏完整的检索-筛选-综合流程的真相。我们引入了MetaSyn,一个包含来自Nature Portfolio期刊的442篇专家策划的元分析的数据集。每个条目将研究问题与PI/ECO标准、包含140k篇PubMed文章的检索语料库、经过验证的阳性研究、主题相似但不符合PI/ECO的硬负样本以及完整的搜索策略和日期范围配对。对十二种流水线配置(九种RAG变体和一种协议驱动的代理)进行基准测试揭示了关键的筛选瓶颈:尽管在K=200时检索上限达到90.9%的召回率,但没有任何系统能恢复超过52.7%的真相包含文献。当前的LLM无法可靠地将合格研究与主题相关性相当的PI/ECO不合格干扰项区分开来。阶段归因指标捕捉了系统成功和失败的地方;单一的端到端分数则不能。

英文摘要

Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval-screening-synthesis pipeline. We introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals. Each entry pairs a research question with PI/ECO criteria, a retrieval corpus of 140k PubMed articles, verified positive studies, hard negatives that are topically similar but PI/ECO-ineligible, and complete search strategies and date bounds. Benchmarking twelve pipeline configurations (nine RAG variants and a protocol-driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground-truth included literature. Current LLMs fail to reliably separate eligible studies from PI/ECO-failing distractors in pools of comparable topical relevance. Stage-attributed metrics capture where systems succeed and fail; a single end-to-end score does not.

2606.16780 2026-06-19 cs.RO 新提交

DIFF-IPPO: Diffusion-Based Informative Path Planning with Open-Vocabulary Belief Maps

DIFF-IPPO:基于扩散的开放词汇信念地图信息路径规划

Sausar Karaf, Oleg Sautenkov, Mikhail Martynov, Dzmitry Tsetserukou

发表机构 * Intelligent Space Robotics Laboratory, CDE, Skoltech(智能空间机器人实验室,CDE,斯科尔科沃科学技术研究院)

AI总结 提出DIFF-IPPO框架,结合开放词汇信念地图生成器与扩散规划器,在非高斯信念图上生成全局轨迹,实现高效目标搜索,检测得分达81.49%-86.55%。

详情
AI中文摘要

探索和物体搜索要求机器人感知环境、识别感兴趣区域,并规划提高目标检测可能性或最大化信息增益的轨迹。许多IPP方法,特别是在连续环境监测中,依赖于高斯过程信念模型,而物体搜索场景通常从语义或开放词汇感知中产生复杂的多模态信念地图。直接基于这种非高斯信念地图的全局轨迹生成仍然相对未被充分探索。尽管基于扩散的规划器为此类分布建模提供了强大能力,但它们在信息路径规划中的应用仍然有限。在这项工作中,我们提出了DIFF-IPPO,一个集成了开放词汇信念地图生成器和基于扩散的规划器的流水线,用于在信念地图上生成全局轨迹。该方法生成的轨迹将传感器覆盖集中在高信念区域,在不同数据集场景下实现了81.49%至86.55%的归一化检测得分。我们在一个模拟的搜索与救援场景中验证了该系统,其中规划器搜索候选建筑区域以定位燃烧的建筑。在此设置中,一个由五架无人机组成的团队使用批处理信念地图条件轨迹生成,在3.5分钟内实现了首次检测。

英文摘要

Exploration and object search require robots to perceive their environment, identify regions of interest, and plan trajectories that improve target-detection likelihood or maximize information gain. Many IPP methods, especially in continuous environmental monitoring, rely on Gaussian-process belief models, while object-search settings often produce complex, multimodal belief maps from semantic or open-vocabulary perception. Global trajectory generation directly conditioned on such non-Gaussian belief maps remains comparatively underexplored. Although diffusion-based planners offer strong capabilities for modeling such distributions, their use in informative path planning remains limited. In this work, we propose DIFF-IPPO, a pipeline that integrates an open-vocabulary belief map generator with a diffusion-based planner for global trajectory generation over belief maps. The method generates trajectories that concentrate sensor coverage over high-belief regions, achieving normalized detection scores between 81.49% and 86.55% across different dataset scenarios. We validate the system in a simulated search-and-rescue scenario where the planner searches candidate building regions to locate a burning building. In this setting, a team of five drones using batched belief-map-conditioned trajectory generation achieves first detections in 3.5 minutes.

2606.16682 2026-06-19 cs.LG cs.CL 新提交

Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agents

多模态评估者偏好坍缩:自进化智能体中的跨模态传染

Zewen Liu

发表机构 * Qilu Institute of Technology, School of Software Engineering(齐鲁理工学院软件工程学院)

AI总结 研究多模态自评估中偏好坍缩的加剧现象,发现跨模态传染导致策略选择扭曲,并引入传染矩阵量化风险。

Comments 19 pages, 0 figures

详情
AI中文摘要

当AI智能体使用语言模型在反馈循环中评估自身输出时,会出现系统性偏差。我们表明,评估者偏好坍缩(EPC)在多模态设置中被显著放大。使用GPT-4o评估DeepSeek-chat在文本和视觉任务上的表现,我们发现单一策略(step_by_step)吸收了48.4%的权重——是纯文本自评估中坍缩的3.2倍——而三个视觉域策略合计仅获得9.1%的权重。然后,我们展示了一种称为跨模态传染的新现象:在一个模态上获得的评估者偏好会迁移到另一个模态并破坏其策略选择。通过一个四阶段隔离训练范式,我们测量了传染系数并记录了策略反转——一个模态的最优策略在跨模态暴露后发生逆转。跨四种评估者配置(总计53次独立重复,15,592次API调用)的第3阶段统计验证揭示了一个清晰的层次结构:跨模型评估(GPT-4o,N=8)产生强但对称的双向传染(平均gamma_{T->V}=1.176,gamma_{V->T}=1.089,Delta=-0.088,p=0.575,Cohen's d=0.29);高轮次(DashScope,50轮)导致坍缩为单一策略主导(70%零传染);而自评估提供近乎完全的免疫——97%的运行(N=30,DeepSeek-chat)产生恰好为零的传染(平均gamma=0.033,95% CI [-0.031, 0.010],p=0.642,d=0.07)。没有评估者条件显示出统计显著的方向不对称性。我们引入了由评估者身份索引的传染矩阵,发布了MM-EPC实验框架,并将跨模型评估者架构确定为偏好传染的主要风险因素。

英文摘要

When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using GPT-4o to evaluate DeepSeek-chat across text and visual tasks, we find that a single strategy (step_by_step) absorbs 48.4% of all weight -- 3.2x the collapse observed in text-only self-evaluation -- while three visual-domain strategies receive only 9.1% combined weight. We then demonstrate a novel phenomenon we term cross-modal contagion: evaluator preferences acquired on one modality transfer to and corrupt strategy selection on another. Through a four-phase isolation training paradigm, we measure contagion coefficients and document strategy inversion -- the optimal strategy for a modality reverses after cross-modal exposure. A Phase 3 statistical validation across five evaluator configurations (N=80 total independent repetitions, ~35,000 API calls) with both text-proxy and real-image visual tasks finds: cross-model evaluation produces strong contagion (JSD~0.19-0.34), real-image inputs yield the most directionally consistent signal (mean gamma_{T->V}=1.145, gamma_{V->T}=0.937, 70% T->V, Cohen's d=0.56), and self-evaluation provides near-complete immunity -- 97% of runs (N=30) yield zero contagion (JSD=0.003, d=0.07). Three methodological ablations and multi-executor validation confirm the effect is not a structural artifact. We introduce the contagion matrix indexed by evaluator identity, release the MM-EPC framework, and identify cross-model evaluator architecture as the primary risk factor for preference drift. Code and data: https://github.com/aidless/mm-epc.

2606.16615 2026-06-19 cs.CV 新提交

SUP-MCRL: Subject-aware Unified Pseudo-feature Coded Multimodal Contrastive Representation Learning for EEG Visual Decoding

SUP-MCRL:面向EEG视觉解码的感知主体统一伪特征编码多模态对比表示学习

Shengyu Gong, Weiming Zeng, Yueyang Li, Zijian Kang, Hongjie Yan, Wai Ting Siok, Nizhuan Wang

发表机构 * Lab of Digital Image and Intelligent Computation, Shanghai Maritime University(上海海事大学数字图像与智能计算实验室) Department of Language Science and Technology, The Hong Kong Polytechnic University(香港理工大学语言科学与技术系) Affiliated Lianyungang Hospital of Xuzhou Medical University(徐州医科大学附属连云港医院)

AI总结 提出SUP-MCRL框架,通过语义感知视觉编码器、统一EEG增强器和原型渐进增强器,解决多模态对比学习中语义一致性和主体选择性问题,在THINGS-EEG零样本任务上达到66.0%/91.9%的Top-1/Top-5准确率。

详情
AI中文摘要

非侵入式脑机接口在泛化到自然视觉体验时,神经视觉解码面临严重的保真度退化。传统的多模态对比表示学习仅优化几何距离对齐,忽略了语义一致性和主体选择性,导致虚假的零样本对齐。我们提出SUP-MCRL,一个统一框架,集成了三种协作机制:(1) 语义实体感知视觉编码器(SAVE),学习空间注意力以提取语义内容,无需预训练的显著性模型;(2) 统一EEG增强器(UEE),采用多尺度空洞卷积和频带间注意力实现自适应跨主体鲁棒性;(3) 基于原型的渐进增强器(PPA),维护一个EMA更新的伪特征池以防止表示崩溃。在THINGS-EEG上的零样本实验实现了66.0%/91.9%(Top-1/Top-5)的个体内准确率和24.0%/52.9%的LOSO准确率,超越了现有最先进方法。代码可在https://github.com/NZWANG/SUP-MCRL获取。

英文摘要

Non-invasive brain-computer interfaces exhibit significant performance degradation when moving from controlled laboratory stimuli to real-world natural images. This degradation occurs because conventional multimodal contrastive representation learning models focus exclusively on optimizing geometric distance alignment, thereby failing to account for semantic consistency and inter-subject variability in neural representation and selective attention. As a result, these models are prone to producing spurious zero-shot matches. To address these limitations, we propose SUP-MCRL, a unified framework integrating three collaborative mechanisms: (1) a Semantic-entity Aware Visual Encoder (SAVE) that learns spatial attention to extract semantic content without relying on pre-trained saliency models; (2) a Unified EEG Enhancer (UEE) that employs multi-scale atrous convolutions and inter-band attention for adaptive cross-subject robustness; and (3) a Prototype-based Progressive Augmenter (PPA) that maintains an EMA-updated pseudo-feature pool to prevent representation collapse. Zero-shot experiments on the THINGS-EEG achieve 66.0%/91.9% (Top-1/Top-5) intra-subject and 24.0%/52.9% LOSO accuracy, significantly surpassing state-of-the-art methods and demonstrating that structured alignment supervision is key to overcoming the limitations of cross-modal decoding. Code is available at https://github.com/NZWANG/SUP-MCRL.

2606.16575 2026-06-19 cs.LG math-ph math.MP 新提交

RepNN: Tackling spectral bias in deep neural networks via parameter reparameterization

RepNet:通过参数重参数化解决深度神经网络中的谱偏差

Yong Wang, Tao Zhou, Xuhui Meng

发表机构 * Institute of Interdisciplinary Research for Mathematics and Applied Science, School of Mathematics and Statistics, Huazhong University of Science and Technology(华中科技大学数学与统计学院交叉科学与应用数学研究所) Institute of Computational Mathematics, Academy of Mathematics and Systems Science, Chinese Academy of Sciences(中国科学院数学与系统科学研究院计算数学研究所)

AI总结 针对深度神经网络在捕捉振荡和多尺度行为时的谱偏差问题,提出RepNet模型,通过重参数化第一隐藏层的权重和偏置,有效控制初始斜率尺度和分区点分布,实现自适应频率缩放,在函数逼近、PDE求解和算子学习中显著提升精度。

详情
AI中文摘要

深度神经网络(DNN)在科学计算中取得了显著成功,但在捕捉振荡和多尺度行为时常常受到谱偏差的影响。在本研究中,我们通过考察浅层ReLU神经网络在高频函数拟合中的失败来探究这一局限性。这一观察识别出解决快速振荡的两个重要因素:初始斜率尺度和网络诱导的分区点分布。受此分析启发,我们提出了RepNet,一种针对ReLU和tanh网络的重参数化DNN模型,专为高频和多尺度问题设计。关键思想是重参数化第一隐藏层的权重和偏置,从而能够有效控制初始斜率尺度并提供合适的初始分区点分布。此外,将重参数化的权重和偏置视为可训练参数,使得DNN在训练过程中实现自适应频率缩放。我们还推导了重参数化DNN的输出和斜率幅度的定量估计,以指导所提方法的初始化。数值实验,包括多尺度一维和四维函数逼近、结合物理信息神经网络(PINN)的正向和逆向PDE问题以及算子学习,表明RepNet在略微增加计算成本的情况下,提高了普通DNN在捕捉高度振荡特征时的预测精度。这些结果表明,RepNet为克服谱偏差并将DNN应用于多尺度问题提供了一种有效且灵活的方法。

英文摘要

Deep neural networks (DNNs) have achieved remarkable success in scientific computing, yet they often suffer from spectral bias in capturing oscillatory and multiscale behaviors. In this study, we investigate this limitation by examining the failure of shallow ReLU neural networks in fitting high-frequency functions. This observation identifies two important factors in resolving rapid oscillations: the initial slope scale and the distribution of partition points induced by the networks. Motivated by this analysis, we propose RepNN, a reparameterized neural network model with activation ReLU or tanh designed for high-frequency and multiscale problems. The key idea is to reparameterize the weights and biases in the first hidden layer, which enables effective control of the initial slope scale and provides an appropriate distribution of the initial partition points. Furthermore, treating the reparameterized weights and biases as trainable parameters allows the DNN to achieve adaptive frequency scaling during training. In addition, we derive quantitative estimates for the output and slope magnitudes of the reparameterized DNN to guide the initialization of the proposed method. Numerical experiments, including multiscale one- and four-dimensional function approximations, forward and inverse PDE problems in combination with physics-informed neural networks (PINNs), and operator learning for an earthquake problem using real data, demonstrate that RepNN improves the predicted accuracy of vanilla DNNs in capturing highly oscillatory features with slightly additional computational cost. These results indicate that RepNN provides an effective and flexible approach for overcoming spectral bias and applying DNNs to multiscale problems.

2606.16417 2026-06-19 cs.SD eess.AS 新提交

Joycent: Diffusion-based Accent TTS without Accented Phone Prediction

Joycent: 基于扩散的口音语音合成,无需口音音素预测

Xintong Wang, Ye Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Joycent,一种基于扩散模型的口音TTS方法,直接从标准音素序列和语音参考合成口音语音,无需口音音素预测,通过条件层归一化集成口音和说话人表征,并引入WhisAID口音识别模型,在保持说话人身份的同时提升口音自然度。

详情
AI中文摘要

口音文本到语音(TTS)旨在合成具有目标口音的语音。现有的口音TTS系统通常依赖于两阶段流程,首先将标准音素序列转换为口音音素序列,然后合成口音语音。然而,这种方法存在错误累积问题,并且需要配对的标准-口音音素序列数据,这在实践中往往有限。此外,基于文本的口音音素表示不足以建模韵律和节奏等声学口音特征。在这项工作中,我们提出了Joycent,一种基于扩散的口音TTS模型,它直接从标准音素序列和语音参考合成口音语音,无需口音音素预测。Joycent通过文本编码器中的条件层归一化(CLN)集成口音和说话人表征。我们引入了WhisAID,一种在口音普通话语音上训练的普通话口音识别模型,以提取口音表征。实验结果表明,与基线系统相比,Joycent在保持说话人身份的同时提高了口音自然度。我们在以下网址发布代码和演示:https://github.com/oshindow/Joycent-code。

英文摘要

Accent text-to-speech (TTS) aims to synthesize speech with target accents. Existing accent TTS systems typically rely on a two-stage pipeline that first converts standard phone sequences into accented phone sequences and then synthesizes accented speech. However, such approaches suffer from error accumulation and require paired standard-accented phone sequence data, which is often limited in practice. Moreover, text-based accented phone representations are insufficient to model acoustic accent characteristics such as prosody and rhythm. In this work, we propose Joycent, a diffusion-based accent TTS model that synthesizes accented speech directly from standard phone sequences and speech references without accented phone prediction. Joycent integrates accent and speaker representations through conditional layer normalization (CLN) in the text encoder. We introduce WhisAID, a Mandarin accent identification model trained on accented Mandarin speech to extract accent representations. Experimental results show that Joycent improves accentedness while preserving speaker identity compared with baseline systems. We release our code and demos at: https://github.com/oshindow/Joycent-code.

2606.16057 2026-06-19 cs.RO cs.SY eess.SP eess.SY 新提交

A Smart-Scheduled Hybrid (SSH) EKF-FGO State Estimation

一种智能调度混合(SSH)EKF-FGO状态估计方法

Eric Levy, Soosan Beheshti

发表机构 * GitHub arXiv

AI总结 本文通过智能调度混合EKF-FGO框架,实验性地将优化调度作为独立设计变量,研究其在平衡估计精度与计算成本中的作用,并在平面SLAM仿真中验证了调度对预优化漂移、瞬态误差和运行时间的显著影响。

Comments This work has been accepted for presentation/publication at the 2026 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE). The final published version will appear in IEEE Xplore

详情
AI中文摘要

在机器人学和控制中,可靠的状态估计需要在估计精度和计算成本之间取得平衡。虽然基于滤波的方法(如扩展卡尔曼滤波器,EKF)提供高效的实时更新,而使用因子图的优化公式化方法改善全局一致性,但优化调度的作用通常被隐式处理,而非作为明确的设计变量进行研究。本文提出了一项实验研究,通过使用智能调度混合(SSH)EKF-FGO框架作为受控测试平台,明确隔离了优化调度。通过将基于EKF的状态传播与定期调用的批量优化相结合,并保持求解器结构和计算量固定,本文的主要贡献是实验性地将优化调度表征为一个独立的设计变量,它控制着中间估计精度与计算成本之间的权衡。在平面SLAM环境中的仿真结果表明,调度强烈影响预优化漂移、瞬态误差行为和运行时间。特别是,结果识别出一些操作区域,在这些区域中,全局优化的大部分好处可以以一小部分计算成本保留,从而突显了优化调度作为混合状态估计系统中一个未被充分探索但至关重要的考虑因素。

英文摘要

Reliable state estimation in robotics and control re quires balancing estimation accuracy against computational cost. While filtering-based methods such as the Extended Kalman Filter (EKF) provide efficient real-time updates, and optimisation based formulations using factor graphs improve global consistency, the role of optimisation scheduling is often treated implicitly rather than examined as an explicit design variable. This paper presents an experimental study that explicitly isolates optimisation scheduling using a Smart Scheduled Hybrid (SSH) EKF-FGO framework as a controlled testbed. By combining EKF-based state propagation with periodically invoked batch optimisation and holding solver structure and effort fixed, the main contribution of this work is the experimental characterisation of optimisation scheduling as an independent design variable governing the trade-off between intermediate estimation accuracy and computational cost. Simulation results in a planar SLAM environment show that scheduling strongly influences pre optimisation drift, transient error behaviour, and runtime. In particular, the results identify operating regimes in which most of the benefit of global optimisation can be retained at a fraction of the computational cost, highlighting optimisation scheduling as an under-explored yet critical consideration in hybrid state estimation systems.

2606.15966 2026-06-19 cs.CV cs.GR 新提交

VEPHand: View-Efficient Photometric Hand Performance Capture at Scale

VEPHand: 大规模视图高效光度手部性能捕捉

Zhengyang Shen, Kai-Hung Chang, Erroll Wood, Deying Kong, Bo Peng, Timo Bolkart, Jinlong Yang, Bowen Zhao, Danhang Tang, Sasa Petrovic, Emre Aksan, Jérémy Riviere, Vassilis Choutas, Delio Vicini, Jay Busch, Shichen Liu, Zhe Cao, Hugh Liu, JingJing Shen, Jonathan Taylor, Mingsong Dou

发表机构 * Google XR

AI总结 提出面向有限视角(约20个)的端到端手部动态捕捉与配准管线,通过无掩膜神经方法和物理启发框架解决几何歧义与自接触变形难题,在12000+序列上验证了高保真重建与配准。

详情
AI中文摘要

鲁棒、高保真的3D手部捕捉是数字人创建的基础,但在实际多视角系统中仍具挑战性,这些系统需要在丰富光度信息与有限视角密度导致的重建几何歧义之间取得平衡。本文提出一种端到端的动态手部性能捕捉与配准管线,专为视图高效设置(约20个视角)设计。我们通过两项主要创新应对关键挑战。首先,为克服重建困难(如视角重叠有限和背景杂乱),我们的无掩膜神经方法通过场景参数化和场景特定密度正则化,从无掩膜图像中鲁棒地提取精细的手部几何和外观。其次,针对配准挑战(如准确捕捉非线性皮肤变形和确保严重自接触时的合理结果),我们提出一个物理启发框架。它通过优化个性化手部模型规范四面体网格内的固有体积偏移以及姿态参数,将重建与个性化手部模型对齐。该方法在鲁棒损失和优化支持下,捕捉精细表面变形,确保在严重关节运动和自接触下的合理结果,并对输入噪声表现出强容忍性。我们在超过12000个序列的大规模数据集上展示了自动化管线的可扩展性和鲁棒性,并从中导出一个大规模、高质量合成2D/3D手部数据集用于训练下游任务。这展示了该方法在单手、复杂双手交互和自然手物操作中的有效性。我们的方法在视图高效、无掩膜场景下实现了最先进的重建保真度和高精度配准。项目页面:https://zyshen021.github.io/VEPHand/。

英文摘要

Robust, high-fidelity 3D hand capture, while fundamental to digital human creation, remains challenging with practical multi-view systems that balance rich photometry with the geometric ambiguities of reconstruction arising from limited viewpoint density. This paper presents an end-to-end pipeline for dynamic hand performance capture and registration, specifically designed for view-efficient setups ($\sim$20 views). We address key challenges with two primary innovations. First, to overcome reconstruction difficulties like limited view overlap and background clutter, our mask-free neural method robustly extracts detailed hand geometry and appearance from unmasked images using scene parameterization and scenario-specific density regularization. Second, addressing registration challenges such as accurately capturing non-linear skin deformations and ensuring plausible results during severe self-contact, we propose a physics-inspired framework. It aligns reconstructions to a personalized hand model by optimizing intrinsic volumetric offsets within its canonical tetrahedral mesh, alongside pose parameters. This approach, supported by robust losses and optimization, captures fine surface deformations, ensures plausible results under severe articulation and self-contact, and demonstrates strong tolerance to input noise. We demonstrate the scalability and robustness of our automated pipeline on an extensive dataset of over 12,000 sequences, from which we also derive a large-scale, high-quality synthetic 2D/3D hand dataset for training downstream tasks. This showcases its effectiveness for single hands, intricate two-hand interactions, and natural hand-object manipulations. Our method achieves state-of-the-art reconstruction fidelity in view-efficient, unmasked scenarios and highly accurate registration. Our project page are available at https://vephand.github.io/.

2606.15908 2026-06-19 cs.CV 新提交

High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians

高保真4D手-物体捕捉:基于多视角时空追踪和物理感知高斯模型

Bo Peng, Xu Chen, Yi Gu, Hidenobu Matsuki, Mingsong Dou, Jingjing Shen, Deying Kong, Juyong Zhang, Zhengyang Shen

发表机构 * Google XR(谷歌XR) University of Science and Technology of China (USTC)(中国科学技术大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出无需模板和标记的多视角系统,通过跨视角几何与时间线索的Transformer初始化,结合物理感知高斯优化,实现鲁棒且无伪影的4D手-物体交互重建。

Comments Project page: https://hostpg.github.io/

详情
AI中文摘要

具身AI和空间计算中对高保真4D手-物体交互(HOI)数据的需求日益增长,但目前受限于对预扫描物体模板和物理标记的依赖。尽管近期方法在从视频重建4D手-物体交互方面取得了有希望的结果,但它们对手和物体姿态的初始估计高度敏感。然而,从图像中估计这些姿态具有挑战性,尤其是在手-物体交互场景中固有的严重遮挡下。我们提出了一种新颖系统,用于从同步且校准的多视角视频中鲁棒且精确地重建手和物体,无需任何模板或标记。我们的系统包含两个主要创新组件:(1)一个多视角前馈Transformer模型,聚合跨视角几何和时间线索,为姿态和密集物体几何提供可靠的、度量一致的初始化;(2)一个手-物体物理感知高斯优化框架,用于细化初始估计,集成四面体约束、碰撞细化和外观分解,以产生物理上合理且视觉上精确的重建。在公共基准和广泛内部数据集上的验证表明,我们的流程实现了高度鲁棒、无伪影的重建,为自动化4D资产生成提供了高效基础。我们的项目页面位于https://zyshen021.github.io/HOSTPG/。

英文摘要

The growing demand for high-fidelity 4D hand-object interaction (HOI) data in embodied AI and spatial computing is currently bottlenecked by the reliance on pre-scanned object templates and physical markers. While recent methods have demonstrated promising results in reconstructing 4D hand-object interaction from videos, they are highly sensitive to initial estimates of hand and object poses. Yet, estimating these poses from images is challenging, in particular under severe occlusion which is inherent in hand-object interaction scenarios. We propose a novel system for the robust and accurate reconstruction of hands and objects from synchronized and calibrated multi-view videos without requiring any templates or markers. Our system consists of two main components with key innovations: (1) a multi-view feed-forward transformer model that aggregates cross-view geometry and temporal cues to provide a reliable, metric-consistent initialization for both poses and dense object geometry, and (2) a hand-object physics-aware Gaussian-based optimization framework to refine the initial estimates, integrating tetrahedral constraints, collision refinement, and appearance decomposition to produce physically plausible and visually accurate reconstruction. Validated on public benchmarks and an extensive internal dataset, our pipeline achieves highly robust, artifact-free reconstruction, providing an efficient foundation for automated 4D asset generation. Our project page are available at https://zyshen021.github.io/HOSTPG/.

2606.15862 2026-06-19 cs.AI 新提交

RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

RetailBench: 在真实零售环境中评估LLM代理的长期推理与连贯决策能力

Linghua Zhang, Jun Wang, Jingtong Wu, Zhisong Zhang

发表机构 * Ant Group(蚂蚁集团) City University of Hong Kong(香港城市大学)

AI总结 提出RetailBench基准,模拟单店超市运营,评估LLM代理在长期决策中的表现,发现多数模型无法持续生存,与最优策略差距显著。

Comments This paper is my paper's second version [see arXiv:2603.16453v2]

详情
AI中文摘要

大型语言模型(LLM)代理在短期、范围明确的任务上取得了快速进展,但它们在动态长期环境中维持连贯决策的能力仍不确定。我们引入了RetailBench,一个基于数据驱动的模拟基准,用于评估在单店超市运营中使用工具的LLM代理。RetailBench将零售管理建模为部分可观察的决策过程,并设计支持千天规模的模拟。在此环境中,代理必须管理定价、补货、供应商选择、货架分类、库存老化、客户反馈、外部事件和现金流约束。我们在180天的评估期内,在代表性代理框架下评估了七个当代LLM,并将它们与特权最优策略进行比较。结果显示模型之间存在显著差异:只有一小部分能够存活整个评估期,即使最强的LLM运行在最终净资产和销售结果上也远落后于最优策略。行为分析将这些差距归因于不完整的证据获取、表面决策以及缺乏一致的长期策略。RetailBench为研究经济基础长期决策中的可靠自主性提供了一个受控测试平台。

英文摘要

Large language model (LLM) agents have made rapid progress on short-horizon, well-scoped tasks, yet their ability to sustain coherent decisions in dynamic long-horizon environments remains uncertain. We introduce RetailBench, a data-grounded simulation benchmark for evaluating tool-using LLM agents in single-store supermarket operation. RetailBench models retail management as a partially observable decision process and is designed to support thousand-day-scale simulations. In this environment, agents must manage pricing, replenishment, supplier selection, shelf assortment, inventory aging, customer feedback, external events, and cash-flow constraints. We evaluate seven contemporary LLMs under representative agent frameworks over a 180-day evaluation horizon and compare them with a privileged oracle policy. Results show substantial variation across models: only a small subset survives the full evaluation horizon, and even the strongest LLM runs remain substantially behind the oracle policy in final net worth and sales outcomes. Behavioral analysis attributes these gaps to incomplete evidence acquisition, surface-level decision making, and the lack of a consistent long-horizon policy. RetailBench provides a controlled testbed for studying reliable autonomy in economically grounded long-horizon decision-making.

2606.15832 2026-06-19 cs.LG math.OC 新提交

SILAGE: Memory-Efficient, Full-Gradient-Free Nonconvex Optimization for Nested Finite Sums

SILAGE: 针对嵌套有限和的内存高效、完全无全梯度的非凸优化

Igor Sokolov, Laurent Condat, Peter Richtárik

发表机构 * Center of Excellence for Generative AI, King Abdullah University of Science and Technology (KAUST)(生成人工智能卓越中心,国王阿卜杜勒-阿齐兹大学科学与技术学院)

AI总结 针对大规模数据中嵌套双有限和结构的非凸优化,提出SILAGE算法,通过利用双和结构避免全局全梯度刷新,仅需O(n)内存,并基于组间和组内异质性实现自适应收敛分析。

Comments 81 pages, 3 algorithms, 4 theorems, 2 corollaries, 11 lemmas, 2 figures, 12 tables

详情
AI中文摘要

大规模数据集上的经验风险最小化自然呈现出嵌套的双有限和结构,其中 $N=nm$ 个总样本被逻辑或物理地划分为 $n$ 个大小为 $m$ 的块(例如,在池化数据孤岛、核外学习或有意分层中)。虽然方差缩减方法对非凸目标实现了最优的 oracle 复杂度,但在此集中式场景中它们遭受严重的扩展瓶颈。递归估计器(如 PAGE)需要定期对所有 $nm$ 个样本进行全局全梯度刷新,这在计算上代价高昂。相反,单循环方法(如 SILVER)避免了此类刷新,但需要不切实际的 $\mathcal{O}(nm)$ 内存来存储每个样本的控制变量。在本文中,我们提出了 SILAGE,一种解决此权衡的方差缩减算法。通过主动利用双和结构,SILAGE 消除了对所有 $nm$ 组件的周期性全局全梯度刷新(每次迭代最多评估一个局部组梯度),同时仅需 $\mathcal{O}(n)$ 内存。此外,我们提供了严格的收敛分析,避免了悲观的 worst-case Lipschitz 常数。相反,SILAGE 的复杂度通过嵌套的函数相似性(组间异质性 $δ_1$ 和组内异质性 $δ_2$)自然地适应底层数据几何。我们的结果在几个实际相关场景中改进了现有的最先进界限。

英文摘要

Empirical risk minimization on massive datasets naturally exhibits a nested double finite-sum structure, where $N=nm$ total samples are logically or physically partitioned into $n$ blocks of size $m$ (e.g., in pooled data silos, out-of-core learning, or deliberate stratification). While variance-reduced methods achieve optimal oracle complexities for nonconvex objectives, they suffer from severe scaling bottlenecks in this centralized regime. Recursive estimators, such as PAGE, require periodic global full-gradient refreshes over all $nm$ samples, which are computationally expensive. Conversely, single-loop methods, such as SILVER, avoid such refreshes but require an impractical $\mathcal{O}(nm)$ memory footprint to store a control variate for every sample. In this paper, we propose SILAGE, a variance-reduced algorithm that addresses this trade-off. By actively exploiting the double-sum structure, SILAGE eliminates periodic global full-gradient refreshes over all $nm$ components (evaluating at most one local group gradient per iteration) while requiring only $\mathcal{O}(n)$ memory. Furthermore, we provide a tight convergence analysis that avoids pessimistic worst-case Lipschitz constants. Instead, SILAGE's complexity natively adapts to the underlying data geometry via nested functional similarities: across-group ($δ_1$) and within-group ($δ_2$) heterogeneity. Our results improve existing state-of-the-art bounds in several practically relevant regimes.

2606.15648 2026-06-19 cs.CV 新提交

Fusing Transferred Priors and Physics-based Decomposition for Underwater Image Enhancement

融合迁移先验与物理分解的水下图像增强

Haochen Hu, Yanrui Bin, Zhengyan Zhang, Minchen Wei, Chih-yung Wen, Bing Wang

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出一种无需配对标签的迁移学习方法,将水下图像增强分解为全局颜色校正、去雾和背景噪声抑制,利用跨域先验监督各步骤,实现物理一致的增强。

Journal ref Information Fusion (2026): 104557

详情
AI中文摘要

水下图像在不同水质条件下拍摄,导致复杂的退化,包括颜色偏差、低对比度和模糊效应。最近,基于学习的方法已显示出在水下图像增强(UIE)方面的潜力。然而,以往的大多数工作侧重于训练策略或网络设计,使增强结果与数据集中的标签良好对齐,忽略了标签是从先前UIE方法的增强结果中选取的,这些伪标签存在噪声。因此,它们的模型性能在一定程度上并不令人满意。然而,收集水下图像的真实标签具有挑战性。在这项工作中,我们提出了一种基于迁移学习的UIE方法,该方法不需要水下图像具有成对的噪声或真实标签来学习。相反,首先根据水下物理将UIE任务分解为全局颜色校正、去雾和背景噪声抑制。然后,利用来自其他视觉任务的多种先验作为每个步骤的跨域监督。通过这种方式,通过迁移学习实现了一种新颖的UIE,并且物理对齐的UIE分解提供了理论上的合理性。定性和定量实验表明,我们基于物理和先验融合的方法在UIE任务中达到了SOTA性能,并有效提升了下游视觉任务,显著优于基准方法。项目仓库:https://github.com/Haru2022/P2-UIE。

英文摘要

The underwater images are captured within diverse water-medium conditions, leading to complex degradation, including color bias, low contrast, and blur effect. Recently, learning-based methods have demonstrated their potential for underwater image enhancement (UIE). However, most of the previous work focus on the training strategy or network design to make the enhanced result aligned well with the labels in datasets, ignoring that the labels are selected from the enhanced results of previous UIE methods and these pseudo-labels are noisy. Consequently, the performance of their models is not satisfactory to a certain extent. However, collecting the true labels of the underwater images is challenging. In this work, we propose a transfer learning-based UIE that does not require underwater images to have paired noisy or true labels for learning. Instead, the UIE task is first divided into global color correction, haze removal, and background noise suppression following the underwater physics. Then multiple types of prior from other vision tasks are leveraged as cross-domain supervision in each step. In this way, a novel UIE is available via transfer learning, and the physics-aligned UIE decomposition provides theoretical soundness. Qualitative and quantitative experiments demonstrate that our proposal based on physics and priors fusion achieves SOTA performance in the UIE task and effectively boosts downstream vision tasks, significantly outperforming benchmark methods. Project repo: https://github.com/Haru2022/P2-UIE.

2606.15516 2026-06-19 cs.RO 新提交

Transferring Contact, Not Just Motion: Compliant Grasping Across Dexterous Hands

传递接触,而不仅仅是运动:跨灵巧手的柔顺抓取

Soofiyan Atar, Yao-Ting Huang, Michael Yip

发表机构 * University of California San Diego(加州大学圣迭戈分校)

AI总结 提出跨本体力-位置接口,通过校准力矩和指尖力实现异构灵巧手间的接触感知抓取,结合流匹配视觉运动策略和混合力位控制器,实现可迁移的柔顺抓取。

Comments Website(overview): transferring-contact-not-just-motion.github.io/

详情
AI中文摘要

灵巧抓取依赖于接触调节,而不仅仅是运动。稳定操作要求手指在接触滑动、变形或视觉遮挡时保持适当的物体负载。现有的跨本体灵巧策略通过重定向手部姿态或潜在动作统一运动,但力反馈仍与每只手的感觉和驱动绑定,限制了迁移。本文引入了一种跨本体力-位置接口,用于异构灵巧手之间的接触感知操作。运动意图在共享的手部姿态潜在空间中表示,而每只手的力信号通过系统辨识校准为物理关节扭矩(单位N.m)。这些扭矩被映射为指尖力和紧凑的每指负载描述符,使策略获得关于手部应移动到哪里以及物体如何加载的可比观测。利用该接口,训练了一个流匹配视觉运动策略,输入视觉、本体感觉和校准后的接触,并采用结构化视觉掩码,在抓取相关遮挡下鼓励依赖力。相同的校准信号驱动混合力-位置控制器进行演示采集和执行,保持训练和部署中的力目标一致。在结构不同的手上进行的实验表明,校准的接触反馈实现了可迁移的柔顺抓取,学习到的基元可在长时程操作流程中重复使用。

英文摘要

Dexterous grasping depends on contact regulation, not motion alone. Stable manipulation requires fingers to maintain appropriate object loading as contacts slip, deform, or become visually occluded. Existing cross-embodiment dexterous policies unify motion through retargeted hand poses or latent actions, but force feedback remains tied to each hand's sensing and actuation, limiting transfer. This work introduces a cross-embodiment force-position interface for contact-aware manipulation across heterogeneous dexterous hands. Motion intent is represented in a shared hand-pose latent, while each hand's effort signal is calibrated through system identification into physical joint torque in N.m. These torques are mapped to fingertip forces and compact per-finger load descriptors, giving the policy comparable observations of where the hand should move and how the object is loaded. Using this interface, a flow-matching visuomotor policy is trained on vision, proprioception, and calibrated contact, with structured visual masking that encourages reliance on force under grasp-relevant occlusion. The same calibrated signal drives a hybrid force-position controller for demonstration collection and execution, keeping force targets consistent across training and deployment. Experiments across structurally different hands show that calibrated contact feedback enables transferable compliant grasping, with learned primitives reusable in long-horizon manipulation pipelines.

2606.15197 2026-06-19 cs.LG cs.AI 新提交

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

StarOR: 协同树搜索与测试时强化学习用于优化建模

Jiajun Li, Yu Ding, Shisi Guan, Ran Hou, Wanyuan Wang

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Northwest A&F University(西北农林科技大学)

AI总结 提出StarOR框架,结合蒙特卡洛树搜索与测试时强化学习,通过四阶段分解和GRPO更新LoRA适配器,实现无监督细粒度奖励的中间决策优化,在5个基准上以4B模型达到最优性能。

Comments 41pages, V1, preprint

详情
AI中文摘要

优化建模本质上是层次化的,需要精确的符号承诺序列。传统的基于学习的自动化优化建模方法通过大规模标注或策划的训练数据改进建模策略,但适应新问题分布成本高昂。同时,一次性生成在层次化建模中仍然脆弱,早期符号错误可能传播为无效公式。测试时缩放通过额外的实例级计算实现结构探索,提供了一种有前景的替代方案;然而,现有的基于搜索的方法通常依赖固定策略,导致重复展开继承相似的建模偏差,并为中间决策提供有限的信用分配。为了解决这些限制,我们提出了StarOR,一种协同搜索与适应的框架,将MCTS与测试时强化学习相结合用于优化建模。StarOR将建模过程分解为四个阶段,并通过GRPO在每个非终端节点更新瞬态LoRA适配器。通过使用MCTS生成的兄弟节点作为局部比较集,StarOR将搜索时的探索转化为实例特定的策略细化。此外,无监督的多方面奖励系统为中间公式决策提供细粒度反馈,无需真实标签。在五个优化基准上的实验表明,即使使用4B骨干网络,StarOR也实现了最先进的性能,优于现有方法和前沿LLMs。

英文摘要

Optimization modeling is inherently hierarchical, requiring a precise sequence of symbolic commitments. Traditional learning-based automated optimization modeling methods improve modeling policies through large-scale annotated or curated training data, but are costly to adapt to new problem distributions. Meanwhile, one-shot generation remains brittle in hierarchical modeling, where early symbolic errors can propagate into invalid formulations. Test-time scaling offers a promising alternative by enabling structural exploration with additional instance-level computation; however, existing search-based methods typically rely on a fixed policy, causing repeated rollouts to inherit similar modeling biases and providing limited credit assignment for intermediate decisions. To address these limitations, we propose StarOR, a synergistic search-and-adaptation framework that couples MCTS with Test-Time Reinforcement Learning for optimization modeling. StarOR decomposes the modeling process into four stages and updates a transient LoRA adapter via GRPO at each non-terminal node. By using MCTS-generated siblings as local comparison sets, StarOR transforms search-time exploration into instance-specific policy refinement. Moreover, an unsupervised multi-faceted reward system provides fine-grained feedback for intermediate formulation decisions without ground-truth labels. Experiments across five optimization benchmarks show that StarOR achieves state-of-the-art performance even with a 4B backbone, outperforming existing methods and the frontier LLMs.

2606.15015 2026-06-19 cs.CV cs.AI 新提交

NEXUS: Neural Energy Fields for Physically Consistent Contact-Rich 3D Object Dynamics

NEXUS: 用于物理一致的高接触3D物体动力学的神经能量场

Qizhen Ying, Guangming Wang, Yangchen Pan, Victor Adrian Prisacariu, Brian Sheil, Yixiong Jing

发表机构 * University of Oxford(牛津大学) University of Cambridge(剑桥大学)

AI总结 提出神经能量场框架NEXUS,通过标量能量和耗散项建模保守与非保守动力学,提升高接触3D场景下的长时程轨迹精度并指导视频生成。

Comments 18 pages, 4 figures, 6 tables. Preprint

详情
AI中文摘要

基于物理的视频生成需要可控的3D物体动力学,这些动力学在接触、变形和外力作用下保持物理一致性。现有的基于轨迹的方法通常建模孤立的物理效应,难以在高接触3D场景中组合保守和非保守动力学。我们提出NEXUS,一个用于高接触3D物体动力学的神经能量场框架。NEXUS将每个物体表示为结构图,并构建动态的物体-物体和物体-环境接触图。受哈密顿神经网络启发,NEXUS通过标量能量和耗散项而非直接预测状态或加速度来公式化运动。保守效应(包括重力和弹性变形)被组合为加性能量项,而非保守效应(如阻尼和冲击引起的能量损失)则通过学习的瑞利型耗散建模。力通过对能量和耗散函数求导得到,并通过多子步半隐式积分器进行演化。在受控轨迹基准测试中,NEXUS在不同力学属性和物理效应组合下,相较于代表性的学习和物理结构化动力学基线,提高了长时程精度。我们进一步展示NEXUS轨迹为高接触视频生成提供了有效指导,在保持竞争性视觉质量的同时提高了物理合理性。

英文摘要

Physics-grounded video generation requires controllable 3D object dynamics that remain physically consistent under contact, deformation, and external forcing. Existing trajectory-based methods often model isolated physical effects, making it difficult to compose conservative and non-conservative dynamics in contact-rich 3D scenes. We present NEXUS, a neural energy-field framework for contact-rich 3D object dynamics. NEXUS represents each object as a structural graph and constructs dynamic object-object and object-environment contact graphs. Inspired by Hamiltonian Neural Networks, NEXUS formulates motion through scalar energy and dissipation terms rather than directly predicting states or accelerations. Conservative effects, including gravity and elastic deformation, are composed as additive energy terms, while non-conservative effects such as damping and impact-induced energy loss are modeled with learned Rayleigh-style dissipation. Forces are derived by differentiating the energy and dissipation functions and rolled out with a multi-substep semi-implicit integrator. Across controlled trajectory benchmarks, NEXUS improves long-horizon accuracy over representative learned and physics-structured dynamics baselines under varying mechanical properties and physical-effect compositions. We further show that NEXUS trajectories provide effective guidance for contact-rich video generation, improving physical plausibility while maintaining competitive visual quality.

2606.14957 2026-06-19 cs.CV 新提交

Learning Sparse Latent Predictive Foundation Model for Multimodal Neuroimaging

学习用于多模态神经影像的稀疏潜在预测基础模型

Haoxu Huang, Long Chen, Jingyun Chen, Jinu Hyun, James Ryan Loftus, Kara Melmed, Daniel Orringer, Jennifer Frontera, Seena Dehkharghani, Arjun Masurkar, Narges Razavian

发表机构 * New York University, Center for Data Science(纽约大学数据科学中心) NYU Grossman School of Medicine, Department of Radiology(纽约大学格罗斯曼医学院放射学系) State University of New York at Binghamton, School of Computing(纽约州立大学宾汉姆顿分校计算机学院) NYU Grossman School of Medicine, Department of Neurology(纽约大学格罗斯曼医学院神经病学系) NYU Grossman School of Medicine, Department of Neurosurgery(纽约大学格罗斯曼医学院神经外科学系) NYU Grossman School of Medicine, Department of Pathology(纽约大学格罗斯曼医学院病理学系) School of Medicine, Department of Radiology, Stanford(斯坦福大学医学院放射学系) NYU Grossman School of Medicine, Department of Neuroscience(纽约大学格罗斯曼医学院神经科学系) NYU Grossman School of Medicine, Neuroscience Institute(纽约大学格罗斯曼医学院神经科学研究所)

AI总结 提出Neuro-JEPA模型,结合潜在预测目标和专家混合架构,学习T1w、T2w和FLAIR三种MRI序列的统一表示,在25项临床任务和22项公开数据集任务上优于现有基础模型和CNN基线。

Comments Under Review Preprint

详情
AI中文摘要

脑部MRI通常作为多个互补序列采集,具有独特的对比度加权,包括T1加权成像(T1w)解剖对比和液体敏感T2加权(T2w)对比。然而,在健康系统规模上,跨多种MRI对比机制学习统一表示的方法尚缺乏。在本研究中,我们引入了Neuro-JEPA,一种稀疏多模态神经影像基础模型,它结合了潜在预测目标和专家混合架构,以编码跨核心T1w、T2w和液体抑制FLAIR成像(FLAIR)的脑部MRI。我们进一步对架构、掩码、目标和稀疏性设计选择进行了系统的方法论研究,这些选择有利于稳健的神经影像多模态表示学习。Neuro-JEPA在428,647项研究的1,551,862次扫描上进行了预训练,这些扫描经过了模态特定的预处理和跨三种核心结构脑部MRI序列的数据整理。我们在临床和研究环境中评估了学习到的表示,包括来自三个健康系统(NYU Langone、NYU Long Island和Massachusetts General Hospital)的25项任务,以及来自12个公开数据集的22项任务,涵盖了单模态、多模态和跨域评估配置。在这些基准测试中,现有的神经影像基础模型相对于简单的卷积神经网络(CNN)基线显示出不一致的提升,而Neuro-JEPA在所有评估设置中实现了更强且更一致的性能。这些结果建立了一个可扩展的多模态神经影像表示学习方法论框架,并强调了基础模型评估协议需要包括简单基线、临床异质性队列和受控的多模态比较。

英文摘要

Brain MRIs are routinely acquired as multiple complementary sequences with unique contrast weighting, including T1-weighed imaging (T1w) anatomic and fluid-sensitive T2-weighted (T2w) contrasts. However, methods for learning unified representations across the multitude of MRI contrast mechanisms at health-system scale are lacking. In this study, we introduce Neuro-JEPA, a sparse multimodal neuroimaging foundation model that combines a latent predictive objective with a Mixture-of-Experts architecture to encode brain MRI across core T1w, T2w, and fluid-suppressed FLAIR imaging (FLAIR). We further provide a systematic methodological study of architectural, masking, objective, and sparsity design choices beneficial for robust neuroimaging multimodal representation learning. Neuro-JEPA was pretrained on 1,551,862 scans from 428,647 studies after modality-specific preprocessing with data curation across three core structural brain MRI sequences. We evaluated the learned representations across clinical and research settings, including 25 tasks from three health systems: NYU Langone, NYU Long Island, and Massachusetts General Hospital, and 22 tasks from 12 public datasets, covering unimodal, multimodal and cross-domain evaluation configurations. Across these benchmarks, existing neuroimaging foundation models showed inconsistent gains over a simple convolutional neural network (CNN) baseline, whereas Neuro-JEPA achieved stronger and more consistent performance across all evaluated settings. These results establish a scalable methodological framework for multimodal neuroimaging representation learning and highlight the need for foundation model evaluation protocols that include simple baselines, clinically heterogeneous cohorts and controlled multimodal comparisons.

2606.14784 2026-06-19 cs.SD cs.LG eess.AS 新提交

LLM-Based Synthetic Ground Truth Generation for Audio-Based Emotion Classification via In-Context Learning

基于上下文学习的音频情感分类的LLM合成真实标签生成

Qing Huang, Pooja Pol, Jianing Zhang

发表机构 * School of Business, Technical University of Applied Sciences Augsburg(应用技术大学阿沙芬堡商学院) Data Science und Autonome Systeme Technologietransferzentrum (TTZ)(数据科学与自主系统技术转移中心(TTZ))

AI总结 提出利用大语言模型(LLM)和上下文学习(ICL)从多用户VR环境的流式语音数据中自动生成情感相关合成真实标签,解决团队协作状态标注难题。

Comments https://icaiit.org/paper.php?paper=14th_ICAIIT_2/3_9

详情
AI中文摘要

理解人类状态和交互动态是人机交互(HCI)的核心目标。随着交互范式变得更加沉浸,虚拟现实(VR)已成为研究协作工作的强大平台。在此类环境中,评估团队协作状态(包括团队表现和团队韧性)需要从多模态传感器数据(如语音信号)中连续可靠地推断潜在的团队级认知和情感状态。然而,由于传感器噪声、上下文变异性和稀疏的专家标注,为这些潜在状态生成真实标签仍然具有挑战性。传统的自我报告方法仅提供静态和延迟的测量,因此不足以捕捉连续语音数据中反映的动态团队过程。在这项工作中,我们提出了一种由大语言模型(LLM)驱动的、基于代理的推理工作流,用于从多用户VR环境中的流式语音数据自动生成情感相关的合成真实标签。利用LLM的泛化能力,我们使用上下文学习(ICL)和少量配对的音频样本及其对应转录的演示。ICL倾向于实现与模型微调相当的任务适应,同时避免了参数更新的计算开销。为了构建信息丰富且鲁棒的上下文提示,我们采用基于检索的选择策略,根据声学特征空间中的相似性动态识别相关的音频演示。

英文摘要

Understanding human states and interaction dynamics is a core goal of human-computer interaction (HCI). As interaction paradigms become more immersive, virtual reality (VR) has emerged as a powerful platform for studying collaborative work. In such settings, evaluating team collaboration states, including team performance and team resilience, requires continuous and reliable inference of latent team-level cognitive and affective states from multi-modal sensor data, such as speech signals. However, generating ground truth labels for these latent states remains challenging due to sensor-induced noise, contextual variability, and sparse expert annotations. Traditional self-reporting approaches provide only static and delayed measurements and are therefore insufficient for capturing dynamic team processes reflected in continuous speech data. In this work, we propose a large language model (LLM)-driven, agentic inference workflow for automated emotion-related synthetic ground truth generation from streaming speech data in multi-user VR environments. Leveraging the generalization capabilities of LLMs, we use In-Context Learning (ICL) with few-shot demonstrations of paired audio-based samples and their corresponding transcriptions. ICL tends to achieve task adaptation comparable to model fine-tuning while circumventing the computational overhead of parameter updates. To construct informative and robust in-context prompts, we adopt a retrieval-based selection strategy that dynamically identifies relevant audio demonstrations based on similarity in the acoustic feature space.

2606.14776 2026-06-19 cs.RO cs.LG 新提交

Deep Learning-Based Lunar Crater Terrain Relative Navigation

基于深度学习的月球陨石坑地形相对导航

Batu Candan, Simone Servadio

发表机构 * NASA(美国国家航空航天局) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出一种结合深度学习陨石坑检测器和扩展卡尔曼滤波的地形相对导航算法,在初始位置偏差达5公里时仍能将导航误差降至数百米。

详情
AI中文摘要

准确的位置估计对于未来使用自主飞行器实现月球着陆至关重要,尤其是在地形特征稀疏的危险环境中。本文提出了一种地形相对导航(TRN)算法,该算法结合了我们专门为NASA陨石坑检测挑战问题设计的深度学习陨石坑检测器和扩展卡尔曼滤波(EKF)。我们的检测器分析从轨道获取的单目图像中的陨石坑特征,并通过匈牙利分配方法及基于共识的离群点去除方法,识别它们与全球数据库中陨石坑的匹配。然后,估计的测量值用于优化EKF,其中航天器在月心月固(LCLF)参考系中的姿态估计,结合高度辅助信息,约束径向漂移。仿真结果表明,即使航天器偏离实际位置达5公里,TRN也能从这种情况中恢复,将导航误差降低到几百米。需要注意的是,为了保持陨石坑特征的对应关系,必须将图像分辨率和场景中的尺度与检测器训练集分布相匹配。

英文摘要

Accurate position estimation is crucial for the successful implementation of future lunar landings using autonomous vehicles, especially in dangerous environments with sparse terrain features. In this paper, we propose a terrain relative navigation (TRN) algorithm combining our deep-learning crater detector, which was designed specifically for the NASA Crater Detection Challenge problem, and an Extended Kalman Filter (EKF). Our detector analyzes crater features from the monocular images acquired from orbit, and their matches with craters from a global database are identified via a Hungarian assignment approach followed by the consensus-based outliers removal method. The estimated measurements are then used to refine an EKF, where spacecraft pose estimation in the Lunar-Centered Lunar-Fixed (LCLF) frame of reference, augmented with altitude aiding information, constrains radial drift. The simulation results indicate that even if the spacecraft is off from its actual location up to 5 km, TRN could recover from this situation, achieving navigation error reduction to a few hundred meters. It should be noted that in order to maintain crater feature correspondences, it is important to match the image resolution and the scales within the scene to the detector training set distribution.

2606.14510 2026-06-19 cs.LG q-bio.BM 新提交

PepALD: Macrocyclic Peptide Generation via Autoregressive Latent Diffusion

PepALD: 通过自回归潜在扩散生成大环肽

Junming Zhang, Siyu Yi, Wei Ju, Zhonghui Gu

发表机构 * College of Computer Science, Sichuan University(四川大学计算机科学学院) School of Mathematics, Sichuan University(四川大学数学学院) School of Artificial Intelligence, Sichuan University(四川大学人工智能学院) Lingang Laboratory(临港实验室)

AI总结 提出PepALD模型,结合自回归潜在扩散与化学嵌入,实现从头设计大环肽,并利用偏好优化提升亲和力,在生成质量和奖励优化上优于基线。

Comments 18 pages, 5 figures, 3 tables

详情
AI中文摘要

大环肽是细胞内靶点的有前景的治疗候选物,但其设计需要同时控制非天然单体化学、环拓扑、膜通透性和靶点结合。现有的SMILES或HELM字符串生成模型要么在长原子级序列空间中操作,要么将单体视为具有有限化学基础符号化令牌。我们引入了PepALD,一个用于从头生成大环肽的自回归潜在扩散(ALD)基础模型。该模型使用结构化化学嵌入表示HELM单体,通过在化学信息潜在空间中的上下文条件扩散生成每个残基,在自回归生成过程中预测R基团感知的环闭合,并使用胜者保护的扩散自适应偏好优化将去噪器与亲和力奖励对齐。体外实验表明,PepALD在生成质量和奖励优化性能上优于代表性肽生成基线。

英文摘要

Macrocyclic peptides are promising therapeutic candidates for intracellular targets, but their design requires simultaneous control over non-natural monomer chemistry, ring topology, membrane permeability, and target binding. Existing SMILES- or HELM-string generative models either operate in long atom-level sequence spaces or treat monomers as symbolic tokens with limited chemical grounding. We introduce PepALD, an Autoregressive Latent Diffusion (ALD) foundation model for \textit{de novo} macrocyclic peptide generation. The model represents HELM monomers with structured chemical embeddings, generates each residue through context-conditioned diffusion in chemically informed latent space, predicts R-group-aware ring closures during autoregressive generation, and aligns the denoiser to affinity rewards using winner-protected diffusion-adapted preference optimization. In silico experiments demonstrate PepALD's generation quality and reward-optimization performance against representative peptide generation baselines.

2606.14031 2026-06-19 cs.AI 新提交

Applicability Condition Extraction for Therapeutic Drug-Disease Relations

治疗性药物-疾病关系的适用条件提取

Guanting Luo, Noriki Nishida, Yuji Matsumoto, Yuki Arase

发表机构 * The University of Osaka(大阪大学) RIKEN(理化学研究所) Institute of Science Tokyo(东京科学大学) Tohoku University(东北大学)

AI总结 提出从生物医学文献中提取药物-疾病治疗关系适用条件的任务,构建首个手动标注数据集,并改进LoRA方法以考虑药物与疾病间关系,在多个评估设置中优于基线。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

识别某种药物对目标疾病产生治疗效果的适用条件对于临床决策支持至关重要。然而,现有的大多数生物医学信息提取方法仅关注识别药物与疾病之间的关系,而很大程度上忽略了这些关系适用的上下文特定条件。为解决这一问题,我们引入了从生物医学研究文献中提取治疗性药物-疾病关系适用条件的任务。我们创建了首个数据集,在生物医学论文摘要上手动标注了药物、疾病和适用条件的三元组,包含1,119个药物-疾病对。利用该数据集,我们系统评估了一系列现有方法的性能。此外,我们提出了一种新方法,增强LoRA以考虑药物与疾病之间的关系。我们的方法在不同评估设置中均优于强基线。本文的源代码和数据集可从以下网址获取:this https URL

英文摘要

Identifying conditions that a certain drug takes therapeutic effect on a target disease is crucial for clinical decision-making support. However, most existing biomedical information extraction methods have focused on identifying only relations between drugs and diseases, while largely overlooking the context-specific conditions where such relations can apply. To address this problem, we introduce the task of applicability condition extraction for therapeutic drug-disease relations from biomedical research literature. We create the first dataset that has manually annotated triples of drugs, diseases, and applicability conditions on biomedical paper abstracts with 1,119 drug-disease pairs. Using this dataset, we systematically evaluate the performance of a range of existing methods. In addition, we propose a new method that enhances LoRA to consider relations between drugs and diseases. Our method consistently outperforms strong baselines across different evaluation settings.

2606.11537 2026-06-19 cs.AI cs.CE 新提交

MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

MoCA-Agent: 一种用于金融和数值推理的声明市场代码智能体

Abdelrahman Abdallah, AbdelRahim A. Elmadany, Sameh Al Natour, Hasan Cavusoglu, Adam Jatowt, Muhammad Abdul-Mageed

发表机构 * University of Innsbruck(因斯布鲁克大学) University of British Columbia(不列颠哥伦比亚大学) Toronto Metropolitan University(多伦多都会大学)

AI总结 提出MoCA-Agent,通过声明级验证和代码生成解决金融表格问答中的数值推理错误,在十个基准上取得强性能。

详情
AI中文摘要

金融和表格问答不仅需要流畅的推理:答案必须基于支持它们的确切事实、公式、单位、符号和尺度。单个误读的单元格或错误操作可能会悄无声息地产生看似合理但错误的结果。我们引入了 \textsc{MOCA-Agent},一种声明市场代码智能体,它用声明级验证取代了自由形式的多智能体辩论。该系统将每个问题分解为类型化的原子声明,要求专业交易智能体买入或卖出这些声明,将其订单清算为置信度加权的接受/拒绝决策,并从市场支持的证据中合成可执行的Python程序。然后,一个代码感知验证器检查程序的执行、结构一致性和常见的金融推理错误,最多进行一次市场感知修复轮次。在涵盖金融数值推理、通用表格推理、ESG问答和多模态图表推理的十个公开基准上,\textsc{MOCA-Agent} 使用固定的 Qwen3.6-27B 骨干网络实现了强劲性能,包括在 FinQA 上达到 78.3%,在 FinanceMath 上达到 76.0%,在 MultiHiertt 上达到 71.2%,在 ESGenius 上达到 86.9%,以及在 FinChart-Bench 上平均达到 85.6%。这些结果表明,在原子声明级别聚合证据,而不是整个答案,提高了高风险数值推理的鲁棒性。\footnote{代码和数据可在以下网址获取:this https URL。}

英文摘要

Financial and tabular question answering requires more than fluent reasoning: answers must be grounded in the exact facts, formulas, units, signs, and scales that support them. A single misread cell or incorrect operation can silently produce a plausible but wrong result. We introduce \textsc{MOCA-Agent}, a market-of-claims code agent that replaces free-form multi-agent debate with claim-level verification. The system decomposes each question into typed atomic claims, asks specialist trader agents to buy or sell those claims, clears their orders into confidence-weighted accept/reject decisions, and synthesizes an executable Python program from market-supported evidence. A code-aware verifier then checks the program for execution, structural consistency, and common financial reasoning errors, with at most one market-aware repair round. Across ten public benchmarks spanning financial numerical reasoning, general tabular reasoning, ESG question answering, and multimodal chart reasoning, \textsc{MOCA-Agent} achieves strong performance using a fixed Qwen3.6-27B backbone, including $78.3\%$ on FinQA, $76.0\%$ on FinanceMath, $71.2\%$ on MultiHiertt, $86.9\%$ on ESGenius, and $85.6\%$ average on FinChart-Bench. These results show that aggregating evidence at the level of atomic claims, rather than whole answers, improves robustness in high-stakes numerical reasoning.\footnote{The code and data are available: https://github.com/UBC-NLP/MoCA-Agent.

2606.10688 2026-06-19 cs.RO 新提交

Self-Supervised Relevance Modelling in Autonomous Driving via Counterfactual Analysis

自动驾驶中基于反事实分析的自监督相关性建模

Luca Lusvarghi, Javier Gozalvez, Pablo Urbano Hidalgo

发表机构 * Networked Systems Lab, Universidad Miguel Hernandez de Elche(网络系统实验室,米格尔·希内斯·埃尔切大学)

AI总结 提出一种基于反事实分析的自监督方法,用于量化自动驾驶中物体的相关性,实现毫秒级实时估计,并生成相关性热图以辅助感知与规划。

详情
AI中文摘要

自动驾驶依赖于计算密集型的感知管线,以持续检测和跟踪周围环境中的物体。虽然某些物体对于规划安全有效的操作至关重要,但其他物体可能不相关,并且对自动驾驶车辆的驾驶决策没有影响。关注相关物体可以更有效地利用可用计算资源,减少处理延迟,并限制感知噪声的下游传播。在这项工作中,我们提出了一种基于反事实分析的新型自监督方法,以开发相关性模型——一种基于AI的工具,用于量化物体对自动驾驶车辆的相关性。为了展示所提出方法的潜力,我们在选定城市场景中生成的合成因果数据集上训练了相关性模型。结果表明,该相关性模型能够以毫秒级延迟准确估计物体的相关性,从而在高密度场景中实现实时相关性估计。我们还展示了该相关性模型可用于构建相关性热图,为自动驾驶车辆的驾驶策略提供有价值的见解,并可用于主动通知感知和规划任务。我们公开发布了相关性模型和因果数据集。

英文摘要

Autonomous driving relies on computationally intensive perception pipelines to continuously detect and track objects in the surrounding environment. While some objects are key to plan safe and effective maneuvers, others may not be relevant and have no impact on the autonomous vehicle's driving decisions. Focusing on relevant objects allows a more efficient usage of available computational resources, reduces processing latencies, and limits the downstream propagation of perception noise. In this work, we propose a novel self-supervised approach based on counterfactual analysis to develop a relevance model - an AI-based tool that quantifies the relevance of objects for an autonomous vehicle. To demonstrate the potential of the proposed approach, we train a relevance model on a synthetic causal dataset generated in a selected urban scenario. Results show that the relevance model is able to accurately estimate the objects' relevance with millisecond-level latency, enabling real-time relevance estimation also in high-density scenarios. We also show that the relevance model can be used to build relevance heatmaps that offer valuable insights into the autonomous vehicle's driving policy and can be used to proactively inform perception and planning tasks. We openly release both the relevance model and the causal dataset.

2606.10616 2026-06-19 cs.AI 新提交

Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

学习记住什么:通过约束优化实现长时域语言代理的观测安全记忆保留

Qingcan Kang, Liu Mingyang, Shixiong Kai, Kaichao Liang, Tao Zhong, Mingxuan Yuan

发表机构 * Huawei Noah's Ark Lab(华为诺亚方舟实验室) Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系)

AI总结 针对长时域语言代理的有限上下文窗口,提出OSL-MR框架,将记忆保留建模为约束随机优化问题,通过在线可观测特征与离线监督的严格分离学习查询条件化的证据价值,实验表明在严格预算下优于现有方法。

详情
AI中文摘要

长时域语言代理积累的观测、推理轨迹和检索事实会超出其有限的上下文窗口,使得记忆保留成为一个基本的资源分配问题。现有记忆系统通过启发式评分、检索优化或学习压缩来改进管理,但大多将保留视为局部决策问题,并未在现实观测约束下显式建模其长期后果。为填补这一空白,我们将记忆保留建模为一个约束随机优化问题,具有明确的预算可行性、证据效用以及延迟成本(包括遗漏惩罚、重新获取延迟和过时信息风险)。随后,我们提出OSL-MR(观测安全记忆保留学习),这是一个新颖的框架,强制执行在线可观测特征与离线可用监督(OAS)之间的严格分离。OSL-MR结合了一个从实现的证据监督中训练的证据学习器和一个混合评分启发式,该启发式既作为可部署的在线安全基线,又作为结构化的归纳先验用于学习。由此产生的策略直接从交互数据中学习查询条件化的证据价值,同时在同一观测约束下保持可部署性。在LOCOMO和LongMemEval上的实验表明,OSL-MR在严格记忆预算下持续优于基于最近性的方法、生成式代理风格评分和其他启发式基线。混合评分先验在保持召回率的同时进一步提高了精确度,敏感性分析表明其在广泛的成本配置下具有鲁棒性。

英文摘要

Long-horizon language agents accumulate observations, reasoning traces, and retrieved facts exceeding context windows, making memory retention a fundamental resource-allocation problem. Existing systems treat retention as local and do not model long-term consequences under observability constraints. To fill this gap, we formulate memory retention as a constrained stochastic optimization with budget feasibility, evidence utility, and delayed costs including miss, reacquisition, and stale penalties. We show this multi-step problem is NP-hard, making exact solution intractable. Moreover, deployment decisions must be made under partial observability. To address these challenges, we propose OSL-MR (Observability-Safe Learning for Memory Retention), a learning-augmented framework that enforces a strict separation between online-observable features and offline-available supervision. OSL-MR combines an evidence learner trained from realized evidence with a Mixed-Score heuristic that serves as a deployable online-safe baseline and an inductive prior. The policy learns query-conditioned evidence from interaction data and remains deployable under the same constraints. Experiments on LoCoMo and LongMemEval show OSL-MR outperforms recency-based, Generative Agents-style, and other heuristic baselines, especially under tight budgets. The Mixed-Score prior improves precision and recall, and sensitivity analysis shows robustness across cost settings. On small solvable instances, single-step optimization is insufficient to anticipate future demand shifts, while OSL-MR stays significantly closer to the dynamic-programming optimum, confirming the necessity of the sequential formulation and reinforcing our learning-guided approximation. These results establish constrained stochastic optimization and optimization-guided learning as a principled foundation for memory management in long-horizon agents.

2606.10358 2026-06-19 cs.LG cs.AI 新提交

KG-SoftMAP: Soft Knowledge-Graph Priors for Bayesian Network Structure Learning from Sparse Discrete Data

KG-SoftMAP: 基于软知识图谱先验的稀疏离散数据贝叶斯网络结构学习

Guoliang Xu, James E. Corter

发表机构 * Columbia University(哥伦比亚大学)

AI总结 针对稀疏离散数据中贝叶斯网络结构学习困难的问题,提出KG-SoftMAP方法,将加权有向知识图谱编码为软先验,结合BDeu评分与logit形式先验最大化MAP目标,在合成与真实数据上显著提升结构恢复性能。

Comments 41 pages including appendices, 2 figures

详情
AI中文摘要

从稀疏离散数据中学习贝叶斯网络(BN)结构是困难的:当每个实例仅记录少数变量时,大多数变量对缺乏可靠评分所需的联合观测,且纯数据方法恢复的结构很少。不完美的领域知识,可表示为加权有向知识图谱(KG),通常是可用的。我们提出KG-SoftMAP,它将这样的KG编码为软性的、置信度加权的、可被数据覆盖的边先验,并最大化结合BDeu评分与logit形式先验的MAP目标;KG可由专家整理或由LLM提取。在受控的合成基准(唯一具有真实DAG的设置)上,KG-SoftMAP在$\rho=0.05$时恢复部分有向结构(DF1从$0.14$到$0.29$,而基线接近零),当$\rho\geq0.2$时恢复更多(DF1从$0.46$到$0.96$),前提是配有一个信息丰富但不完美的KG;恢复性能随KG质量下降而优雅地退化。在无真实DAG的真实稀疏教育数据上,我们仅评估面向部署的指标:预测、校准和KG一致性。学习到的BN最好被解读为诊断模型:在SAF上,它落后于逻辑回归$0.03$的F1_FAIL,同时提供KG一致的边、校准的联合概率以及从任意观测概念子集的推理;当不存在有意义的KG时,判别式逻辑回归更可取。

英文摘要

Learning Bayesian network (BN) structure from sparse discrete data is hard: when each instance records only a few variables, most variable pairs lack the joint observations needed for reliable scoring, and data-only methods recover little structure. However, imperfect domain knowledge, expressible as a weighted directed knowledge graph (KG), is often available. We propose KG-SoftMAP, which encodes such a KG as a finite-strength, confidence-weighted edge prior and maximizes a MAP objective combining the BDeu score with a logit-form prior; the KG may be expert-curated or LLM-extracted. On synthetic benchmarks with known DAGs, KG-SoftMAP reaches Directed-F1 (DF1) $0.19$--$0.32$ at observation rate $ρ=0.05$ and DF1 $0.44$--$0.97$ at $ρ\geq0.2$, while every data-only learner tested stays near zero under the same sparse masks. Recovery tracks KG quality: controlled corruption degrades it smoothly, a zero-signal KG yields DF1 $0.00$, and a blindly LLM-extracted KG with imperfect precision and recall still drives substantial recovery. On three real sparse educational datasets, the learned BN acts as a concept-level posterior model: on SAF it matches logistic regression (LR) within $0.03$ F1_FAIL while providing an inspectable concept graph, calibrated Fail probabilities, and tractable posterior queries from partial observations.

2606.09547 2026-06-19 cs.CV cs.LG 新提交

Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

流式干预:视频大语言模型能否在错误发生时即时纠正?

Apratim Bhattacharyya, Shweta Mahajan, Sanjay Haresh, Rajeev Yasarla, Reza Pourreza, Litian Liu, Risheek Garrepalli, Roland Memisevic

发表机构 * Qualcomm AI Research(高通人工智能研究院) York University(约克大学) Vector Institute for AI(向量人工智能研究所)

AI总结 提出Ego-MC-Bench基准评估视频LLM在烹饪场景中的实时干预能力,并构建Ego-CoMist反事实合成数据集提升小模型性能。

Comments The project page is available at https://apratimbh.github.io/livecookv2/

详情
AI中文摘要

学习日常技能(如烹饪一道菜)越来越依赖于教学媒体,例如在线视频。这为使用视频(和多模态)大语言模型(LLMs)作为任务指导助手打开了大门。一个潜在的任务指导助手在现实世界中成功的关键能力是,它能够在错误一出现时就主动干预以引导用户。为了评估这一关键能力,我们引入了Ego-MC-Bench(错误纠正),这是一个用于评估在现实烹饪场景中反应性、逐步任务指导的基准。大量实验表明,Ego-MC-Bench对于最先进的视频LLMs具有高度挑战性。我们认为一个关键原因是用于在此任务上微调模型的训练数据有限。尽管存在广泛的烹饪视频数据集,但现有数据集缺乏错误示例以及适当时间的干预。为了帮助解决这一数据限制,我们还引入了Ego-CoMist,这是一个反事实合成数据集,通过将非交互式烹饪视频转换为显示主动干预的监督训练示例而创建。我们表明,在Ego-CoMist上进行微调可以带来性能提升,特别是对于更适合在边缘设备上提供帮助的更小、更高效的视频LLMs。

英文摘要

Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video (and multimodal) large language models (LLMs) as task guidance assistants. A crucial capability for the real-world success of a prospective task guidance assistant is it's ability to intervene proactively as soon as a mistake is apparent in order to guide the user. To evaluate this crucial capability, we introduce Ego-MC-Bench (Mistake Corrections), a benchmark for evaluating reactive, step-by-step task guidance in realistic cooking scenarios. Extensive experiments show that Ego-MC-Bench is highly challenging for state-of-the-art video LLMs. We argue that a key reason is the limited availability of training data for fine-tuning models on this task. Although there exists a wide range of cooking video datasets, existing datasets lack examples of mistakes along with appropriately timed interventions. To help address this data limitation, we also introduce Ego-CoMist, a counterfactual synthetic dataset created by transforming non -interactive cooking videos into supervised training examples showing proactive interventions. We show that fine-tuning on Ego-CoMist yields performance gains especially for smaller and more efficient video LLMs that are well suited for delivering assistance on edge devices.

2606.08892 2026-06-19 cs.LG 新提交

Diffuse AI Control on Fuzzy Tasks

模糊任务上的扩散AI控制

Mikhail Terekhov, Caglar Gulcehre, Vivek Hebbar, Joe Benton

发表机构 * Anthropic Fellows Program (via MATS)(Anthropic 研究员计划(通过 MATS)) EPFL(洛桑联邦理工学院) Redwood Research(红木研究) Anthropic

AI总结 针对AI在模糊任务上的长期扩散威胁,提出蓝队与红队对抗框架,通过弱模型评分训练强模型,并发现红队可利用多目标进化提示优化找到评分高但性能差的子版本行为,蓝队则通过对抗优化提升鲁棒性。

详情
AI中文摘要

部署在关键领域(如AI安全研究)的AI模型可能因对齐问题而微妙地破坏我们的努力。扩散AI控制是AI安全的一个子领域,旨在减轻长期部署范围内AI破坏(扩散威胁)带来的风险。这些风险在模糊任务上尤其有害,即难以评分或需要直觉的任务。为了理解模糊任务上的扩散威胁,我们引入了一个新颖的框架,将AI控制视为蓝队和红队之间的对抗游戏。蓝队使用一个弱可信模型构建一个弱评分,据此训练一个强大的、可能具有颠覆性的模型,以消除如果存在的颠覆倾向。然后红队试图找到被弱评分高评价的模型行为,这些行为可能不会被训练掉,但实际上对应着差的表现。我们在为近期ML论文的研究问题撰写实验提案的任务上测试了我们的框架。我们使用一个能够访问原始论文的语言模型作为代理“真实”评分器。我们的红队使用多目标进化提示优化发现了子版本行为。我们展示了Opus 4.6可以写出比GPT-OSS-20B更差的提案(根据真实代理评分),而弱评分器却将其评为与Opus 4.6最佳提案一样高。为了缓解威胁,我们为蓝队提出了一种对抗优化算法,该算法为弱模型发现更鲁棒的提示。该算法产生的蓝队提示,我们的红队优化未能利用。

英文摘要

AI models deployed in critical domains, such as AI safety research, may subtly sabotage our efforts due to misalignment. Diffuse AI Control is a subfield of AI safety concerned with mitigating risks from AI sabotage distributed over long deployment horizons (diffuse threats). These risks are particularly pernicious on fuzzy tasks, i.e. tasks which are hard to grade or require intuition. To understand diffuse threats on fuzzy tasks, we introduce a framework that considers AI control as an adversarial game between a blue team and a red team. The blue team uses a weak trusted model to construct a weak score against which they would train a strong, potentially subversive model to remove the subversion propensity if it were present. The red team then tries to find model behaviors that are rated highly by the weak score, and thus might not be trained out, but actually correspond to poor performance. We test our framework on the task of writing experimental proposals for research questions from recent ML papers. We use a language model with access to the original paper as a proxy "ground-truth" scorer. Our red team discovers subversive behaviors using multi-objective evolutionary prompt optimization. We show that Opus~4.6 can write proposals that are worse according to the ground truth proxy than those of GPT-OSS-20B, while the weak scorer rates them as highly as the best proposals from Opus 4.6. We then propose an adversarial optimization algorithm for the blue team that discovers more robust prompts for the weak model. This algorithm produces a blue team prompt that our red team optimization fails to exploit.