arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2376
2606.32040 2026-07-01 cs.CV 新提交

FaceMoE: Mixture of Experts for Low-Resolution Face Recognition

FaceMoE:用于低分辨率人脸识别的专家混合模型

Kartik Narayan, Vishal M. Patel

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 针对低分辨率人脸识别中特征提取困难和域差距问题,提出FaceMoE,采用专家混合Transformer架构,通过多个FFN专家和top-k路由器实现分辨率感知特征提取,在十一个数据集上超越现有方法。

Comments ECCV 2026, Project Page: https://kartik-3004.github.io/FaceMoE/

详情
AI中文摘要

低分辨率人脸识别(LR-FR)由于特征提取和聚合困难,仍然是一项具有挑战性的任务,因为探针图像通常包含因极端退化(如模糊、遮挡和低对比度)导致的有限身份信息。此外,高分辨率(HR)画廊图像与低分辨率(LR)探针图像之间的域差距也带来了重大挑战。当在LR数据集上微调时,单个特征编码器难以在两个域上有效泛化,而灾难性遗忘进一步放大了这一问题。为了解决这些挑战,我们提出了FaceMoE,一种针对低分辨率人脸识别的专家混合(MoE)Transformer架构的有效适配。具体来说,我们引入了多个专门的前馈网络(FFN)专家,并结合了一个top-k路由器,该路由器动态地将令牌分配给合适的专家。这种设计促使专家针对人脸的不同语义区域进行专门化,从而使FaceMoE能够执行分辨率感知的特征提取。此外,top-k路由器促进了稀疏专家激活,使模型在LR数据集上微调时能够保留预训练知识,同时在不增加计算开销的情况下增加模型容量。FaceMoE使用组合的人脸识别损失、路由器z损失和负载平衡损失进行训练,以确保专家专门化和稳定训练。据我们所知,这是首个利用MoE进行LR-FR的工作。在涵盖HR、混合质量和LR基准的十一个数据集上的大量实验表明,FaceMoE显著优于最先进的方法。代码:此https URL

英文摘要

Low-resolution face recognition (LR-FR) remains a challenging task due to poor feature extraction and aggregation, as probe images often contain limited identity information resulting from extreme degradations such as blur, occlusion, and low contrast. Additionally, the domain gap between high-resolution (HR) gallery images and low-resolution (LR) probe images poses a significant challenge. A single feature encoder struggles to generalize effectively across both domains when fine-tuned on an LR dataset, and this issue is further magnified by catastrophic forgetting. To address these challenges, we propose FaceMoE, an effective adaptation of Mixture of Experts (MoE) transfomer architecture for low-resolution face-recognition . Specifically, we introduce multiple specialized feed-forward network (FFN) experts and incorporate a top-k router, which dynamically assigns tokens to appropriate experts. This design emergently promotes specialization across experts for different semantic regions of the face, which enables FaceMoE to perform resolution-aware feature extraction. Moreover, the top-k router facilitates sparse expert activation, enabling the model to preserve pretrained knowledge when finetuned on a LR dataset, while increasing model capacity without proportional computational overhead. FaceMoE is trained with a combined face recognition loss, router z-loss, and load balancing loss to ensure expert specialization and stable training. To the best of our knowledge, this is the first work leveraging MoE for LR-FR. Extensive experiments across eleven datasets, spanning HR, mixed-quality, and LR benchmarks, demonstrate that FaceMoE significantly outperforms state-of-the-art methods. Code: https://github.com/Kartik-3004/FaceMoE

2606.32039 2026-07-01 cs.CV 新提交

GEAR: Guided End-to-End AutoRegression for Image Synthesis

GEAR: 引导式端到端自回归图像合成

Bin Lin, Zheyuan Liu, Chenguo Lin, Sixiang Chen, Yunyang Ge, Yunlong Lin, Jianwei Zhang, Miles Yang, Zhao Zhong, Liefeng Bo, Li Yuan

发表机构 * Peking University(北京大学) Tencent Hunyuan(腾讯混元)

AI总结 提出GEAR方法,通过表示对齐联合训练VQ分词器和自回归生成器,解决非可微索引导致的梯度问题,实现端到端优化,显著加速ImageNet gFID收敛并提升特征质量。

详情
AI中文摘要

视觉生成模型通常分两阶段训练:先训练分词器用于重建并冻结,再在离散索引或连续潜变量上训练生成器。这种解耦使得分词器不了解生成器容易建模的内容。我们提出GEAR(引导式端到端自回归),通过表示对齐联合训练向量量化(VQ)分词器和自回归(AR)生成器,实现端到端训练。关键障碍是输入到AR模型的VQ索引不可微,导致梯度无法到达分词器,而直通估计器会崩溃。GEAR通过码本分配的双重读出解决了这一问题:硬one-hot分支以next-token预测训练AR,而可微软分支携带表示对齐损失,反向传播仅引导分词器。因此,AR模型引导其分词器朝向更易预测的索引分布。这将对齐负担从分词器转移到AR:分词器自身的特征变得不那么像DINOv2,而AR的特征则更像,这与使潜变量本身具有语义的扩散侧方案相反。相对于强基线LlamaGen-REPA,GEAR将ImageNet gFID收敛速度提升高达10倍,学习到明显更好的块级和空间连贯特征,并泛化到多种量化器(VQVAE、LFQ、IBQ)和文本到图像生成。

英文摘要

Visual generative models are typically trained in two stages. A tokenizer is first trained for reconstruction and then frozen, after which a generator is trained on its discrete indices or continuous latents. This decoupling leaves the tokenizer unaware of what the generator finds easy to model. We present GEAR (Guided End-to-end AutoRegression), which trains a vector-quantized (VQ) tokenizer and an autoregressive (AR) generator jointly and end-to-end, guided by representation alignment. The key obstacle is that the VQ index fed to the AR model is non-differentiable, so gradients cannot reach the tokenizer, and a straight-through estimator collapses. GEAR resolves this with a dual read-out of the codebook assignment. A hard, one-hot branch trains the AR with next-token prediction, while a differentiable soft branch carries a representation-alignment loss that flows back to guide only the tokenizer. The AR model thereby steers its tokenizer toward an index distribution it can predict more easily. This shifts the alignment burden from the tokenizer to the AR: the tokenizer's own features become less DINOv2-like while the AR's become more so, the opposite of diffusion-side recipes that make the latent itself semantic. GEAR speeds up ImageNet gFID convergence by up to 10x relative to the strong LlamaGen-REPA baseline, learns markedly better patch-level and spatially-coherent features, and generalizes across quantizers (VQVAE, LFQ, IBQ) and to text-to-image generation.

2606.32038 2026-07-01 cs.CL cs.AI cs.LG 新提交

Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision

内省耦合:尽管监督固定,自我解释训练仍追踪行为变化

Zifan Carl Guo, Laura Ruis, Jacob Andreas, Belinda Z. Li

发表机构 * MIT EECS(麻省理工学院电气工程与计算机科学系)

AI总结 研究训练语言模型生成解释时,固定反事实解释数据集如何使模型产生内省性解释,追踪自身行为变化,无需更新监督。

Comments 32 pages, 19 figures

详情
AI中文摘要

何时训练语言模型生成对其预测的解释会产生忠实的内省,而非肤浅的模仿?我们研究训练LM解释其输入中哪些特征影响了其行为,使用模型在修改输入上的反事实行为作为监督。令人惊讶的是,我们发现,基于自身早期检查点甚至不同家族中行为相似模型生成的固定反事实解释训练的LM,其产生的解释通常更忠实于自身当前行为,而非训练目标的行为。这种LM解释与行为之间的“内省”耦合发生在训练过程中解释与当前行为保持足够相关时,即使行为本身发生变化。我们还表明,内省耦合追踪行为变化:当解释训练与其他后训练目标同时进行时,解释会追踪这些变化而无需更新监督。这种现象出现在多个任务中,包括谄媚和拒绝,并且对标签噪声具有鲁棒性。总体而言,我们的结果表明,即使是固定的反事实解释数据集也能为内省提供可扩展且可泛化的后训练信号。

英文摘要

When does training language models (LMs) to generate explanations of their predictions yield faithful introspection, rather than superficial imitation? We study LMs trained to explain which features of their inputs influenced their behavior, using models' counterfactual behavior on modified inputs as supervision. Surprisingly, we find that LMs trained on fixed counterfactual explanations derived from earlier checkpoints of themselves, or even from behaviorally similar models in different families, frequently produce explanations more faithful to their own current behaviors than to those of their training targets. This "introspective" coupling between LM explanations and behaviors occurs when training explanations remain sufficiently correlated with current behaviors over the course of training, even as behaviors themselves shift. We also show that introspective coupling tracks behavior shifts: when explanation training is provided concurrently with other post-training objectives, explanations track those shifts without requiring updated supervision. This phenomenon appears in multiple tasks, including sycophancy and refusal, and is robust to label noise. Overall, our results show that even fixed datasets of counterfactual explanations can provide scalable and generalizable post-training signal for introspection.

2606.32036 2026-07-01 cs.CV 新提交

PointSplat: Compact Gaussian Splatting via Human-Centric Prediction

PointSplat: 通过以人为中心的预测实现紧凑的高斯泼溅

Yujie Guo, Yudong Jin, Lingteng Qiu, Zehong Shen, Zhen Xu, Jing Zhang, Xianchao Shen, Hujun Bao, Sida Peng, Xiaowei Zhou

发表机构 * State Key Lab of CAD&CG, Zhejiang University(浙江大学CAD&CG国家重点实验室) ByteDance(字节跳动) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出PointSplat,一种以人为中心的方法,直接从输入点集推断高斯原语,通过3D空间预测减少冗余,实现紧凑表示和高质量新视角渲染。

Comments Project Page: https://zju3dv.github.io/pointsplat

详情
AI中文摘要

从输入视图即时生成3D人体表示对于沉浸式直播系统至关重要,在有限的计算能力和传输带宽下,表示的紧凑性与高保真度同样关键。尽管最近的馈送式重建方法通过视图中心的3D表示预测实现了令人印象深刻的画质,但它们跨多个视图重复编码相同的主题内容,导致显著的视图间冗余。我们的关键洞察是直接在3D空间中进行预测,使网络能够学习并生成高度紧凑的表示。为此,我们提出PointSplat,一种新颖的以人为中心的方法,直接从输入点集推断高斯原语。该方法首先估计一个粗略的几何代理,并执行光线投射以修剪冗余点并建立显式的2D-3D对应关系。随后,它采用点-图像变换器融合外观和几何特征,在单次前向传递中预测高斯属性。这种设计将预测限制在前景感兴趣区域,显著减少高斯总数,同时提高新视图渲染质量。大量实验表明,PointSplat在多个数据集上实现了更高的效率和画质,同时对视图数量和图像分辨率的变化表现出强鲁棒性。

英文摘要

Producing 3D human representations from input views on the fly is essential for immersive live streaming systems, where representation compactness is as critical as high fidelity given limited computational power and transmission bandwidth. Although recent feed-forward reconstruction methods achieve impressive quality through the view-centric prediction of 3D representations, they repeatedly encode the same subject content across multiple views, leading to significant inter-view redundancy. Our key insight is to perform predictions directly in 3D space, enabling the network to learn and produce a highly compact representation. To this end, we propose PointSplat, a novel human-centric approach that directly infers Gaussian primitives from an input point set. The proposed method first estimates a coarse geometric proxy and performs ray casting to prune redundant points and establish explicit 2D--3D correspondences. Subsequently, it employs a Point-Image Transformer to fuse appearance and geometry features, predicting Gaussian attributes in a single forward pass. This design restricts predictions to foreground regions of interest, substantially reducing the total number of Gaussians while improving novel-view rendering quality. Extensive experiments demonstrate that PointSplat achieves higher efficiency and quality while exhibiting strong robustness to variations in view count and image resolution across multiple datasets.

2606.32034 2026-07-01 cs.LG cs.AI cs.CL 新提交

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

QVal:廉价评估长周期LLM智能体的密集监督信号

Sergio Hernández-Gutiérrez, Matteo Merler, Ilze Amanda Auzina, Joschka Strüber, Ameya Prabhu, Matthias Bethge

发表机构 * Tübingen AI Center, University of Tübingen(图宾根大学图宾根人工智能中心) Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会)

AI总结 提出QVal,一种无需训练的基准测试,通过Q值对齐直接评估密集监督信号质量,避免下游训练开销。在21种方法、4个环境上的实验表明,简单提示基线优于现有方法。

Comments 10 pages, 5 figures in main text; 48 pages, 6 figures with appendix

详情
AI中文摘要

LLM智能体越来越多地在长周期中行动,单个轨迹可能包含数百或数千个动作。在这些设置中,仅结果奖励提供的指导过于稀疏,无法告知模型中间动作的好坏。密集监督方法旨在通过评分中间步骤来解决这个问题,从内在置信度到自蒸馏和嵌入相似性。然而,常见的做法是通过测量集成它们的训练管道的下游性能来评估它们。这既昂贵,又将监督质量与训练工程混杂因素混为一谈,并且使得需要不同训练设置的不同方法家族无法比较。因此,密集监督方法很少在共同基础上进行基准测试。我们引入了QVal,一个无需训练的测试平台,用于直接评估密集监督信号。给定一个状态-动作对,QVal衡量一个方法的分数与Q值对齐的程度:即它是否根据强参考策略的Q值对动作进行排序。这使我们能够在任何训练运行之前比较信号,并将信号质量与其他工程选择分开。我们将QVal实例化为QVal-v1.0,对跨四个不同环境和七个方法家族的21种密集监督方法进行基准测试,在六个开放权重模型骨干上进行了超过1.2K次评估实验。我们发现,简单的提示基线始终优于文献中最近的密集监督方法,并且性能按家族强烈聚类。这些发现在模型大小、环境和观察模态上均成立。QVal设计为易于扩展到新的环境和方法,使研究人员能够在任何训练运行之前迭代密集监督方法。

英文摘要

LLM agents increasingly act over long horizons, where a single trajectory can contain hundreds or thousands of actions. In these settings, outcome-only rewards provide too sparse guidance, failing to inform the model about the goodness of intermediate actions. Dense supervision methods aim to solve this problem by scoring intermediate steps, from intrinsic confidence to self-distillation and embedding similarities. However, it is common practice to evaluate them by measuring the downstream performance of a training pipeline that integrates them. This is expensive, conflates supervision quality with training engineering confounders, and renders different methodological families requiring distinct training setups incomparable. As a result, dense supervision methods are rarely benchmarked on common ground. We introduce QVal, a training-free testbed for directly evaluating dense supervision signals. Given a state-action pair, QVal measures how well a method's score is Q-aligned: whether it orders actions according to the Q-values of a strong reference-policy. This lets us compare signals before any training run and separate signal quality from other engineering choices. We instantiate QVal as QVal-v1.0, benchmarking 21 dense supervision methods across four diverse environments and seven methodological families, with over 1.2K evaluation experiments across six open-weight model backbones. We find that simple prompting baselines consistently outperform recent dense supervision methods from the literature, and that performance clusters strongly by family. These findings hold across model sizes, environments, and observation modalities. QVal is designed to be easily extensible to new environments and methods, enabling researchers to iterate on dense supervision methods before any training run.

2606.32033 2026-07-01 cs.CV 新提交

SpheRoPE: Zero-Shot Optimization-Free 360 Panorama Generation with Spherical RoPE

SpheRoPE:基于球形RoPE的零样本无优化360度全景生成

Or Hirschorn, Aaron Olender, Eli Alshan, Ianir Ideses, Lior Fritz, Sagie Benaim

发表机构 * Amazon Prime Video(亚马逊Prime Video) Tel-Aviv University(特拉维夫大学) Hebrew University of Jerusalem(耶路撒冷希伯来大学)

AI总结 提出零样本、无训练、无优化的框架,通过向预训练扩散Transformer注入球形先验,直接生成360度全景图像和视频,无需微调或多步优化。

详情
AI中文摘要

我们提出一个零样本、无训练且无优化的框架,通过直接将球形先验注入预训练的扩散Transformer来生成360度全景图像和视频。现有方法要么依赖在稀缺全景数据上进行昂贵的微调,限制了泛化能力;要么利用多步优化,导致推理延迟过高。我们观察到,当代生成模型从大规模训练中天然具备一些全景先验。然而,这些涌现能力并不充分,因为模型从根本上无法满足等距柱状投影(ERP)施加的严格拓扑约束。我们引入了一种零样本且无优化的方法,在推理时解决这些约束。球形RoPE替代了标准旋转位置嵌入:低频通道被重新参数化为3D笛卡尔坐标,以原生编码球形流形,而高频通道被谐波量化以强制执行精确周期性。结合显式引导几何的互补语义失真无分类器引导(CFG),我们避免了重新训练,并继承了最先进模型的全部创作广度。我们的方法适用于不同的骨干网络和360度生成模态。我们使用Flux.1、Flux.2和LTX-Video骨干网络在文本到全景任务中展示了这一点,实现了与基线相当的性能,同时保持无训练。项目页面:此 https URL

英文摘要

We present a zero-shot, training-free and optimization-free framework for generating 360 panoramic images and videos by directly injecting spherical priors into pre-trained diffusion transformers. Existing methods either rely on costly fine-tuning on scarce panoramic data that limits generalization, or leverage multi-step optimization that incurs prohibitive inference latency. We observe that contemporary generative models natively exhibit some panoramic priors from large-scale training. However, these emergent capabilities are insufficient, as the models fundamentally fail to satisfy the rigorous topological constraints imposed by equirectangular projection (ERP). We introduce a zero-shot and optimization-free approach that resolves these constraints at inference time. Spherical RoPE replaces standard rotary position embeddings: low-frequency channels are re-parameterized as 3D Cartesian coordinates to natively encode the spherical manifold, while high-frequency channels are harmonically quantized to enforce exact periodicity. Coupled with complementary Semantic Distortion classifier-free guidance (CFG) that explicitly steers geometry, we avoid retraining and inherit the full creative breadth of state-of-the-art models. Our approach generalizes across diverse backbones and 360 generation modalities. We demonstrate this across text-to-panorama using Flux.1, Flux.2, and LTX-Video backbones, achieving competitive performance against baselines, all while remaining training-free. Project page: https://orhir.github.io/SpheRoPE

2606.32032 2026-07-01 cs.CL cs.AI 新提交

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

基于元认知反馈的强化学习引发LLM忠实的不确定性表达

Gabrielle Kaili-May Liu, Avi Caciularu, Gal Yona, Idan Szpektor, Arman Cohan

发表机构 * Yale University(耶鲁大学) Google Research(谷歌研究院)

AI总结 提出强化学习与元认知反馈(RLMF)和元认知数据选择方法,通过模型自我判断质量优化偏好学习,实现忠实校准(FC),在保持准确性的同时显著提升LLM不确定性表达的忠实度。

Comments Code: https://github.com/yale-nlp/RLMF

详情
AI中文摘要

元认知是智能的关键组成部分,描述了监控和调节自身认知过程的能力。然而,LLM在关键的元认知能力上表现出系统性缺陷:它们以高置信度产生幻觉,未能识别知识边界,并错误表达内部不确定性——这削弱了可信度和可靠性。由于监控任务表现并相应调整行为是元认知的核心,我们假设能够准确判断自身表现的模型更有可能改进表现。我们通过两种新颖机制实现这一想法:基于元认知反馈的强化学习(RLMF),一种在偏好优化期间根据模型自我判断表现的质量细化完成排序的范式;以及元认知数据选择,它使用类似的自我判断来识别高价值训练样本,优于朴素的主动学习。我们将这些创新应用于忠实校准(FC)问题,该任务本身本质上是元认知的:目标是使表达的不确定性与内在不确定性对齐,即使对于前沿LLM也很困难。我们采用两阶段解耦方法,首先使用这些方法校准模型自我报告置信度分数的忠实性,然后通过定向输出编辑映射到自然、上下文可适应的语言不确定性。大量实验表明,RLMF在保持准确性的同时,在多种任务上实现了可泛化的、最先进的FC。此外,RLMF比标准RL提升高达63%,同时增强了模型评估和表达自身能力极限的能力。这使得RLMF成为增强LLM元认知以提升能力和对齐的有前景范式,并表明元认知表现可作为有效的RL信号,克服先前内在反馈方法的局限性。

英文摘要

Metacognition is a critical component of intelligence that describes the ability to monitor and regulate one's own cognitive processes. Yet LLMs exhibit systemic deficiencies in key metacognitive faculties: they hallucinate with high confidence, fail to recognize knowledge boundaries, and misrepresent their internal uncertainty--undermining trustworthiness and reliability. Since monitoring task performance and adapting behavior accordingly are central to metacognition, we posit that models capable of accurately judging their own performance are better positioned to improve it. We operationalize this idea via two novel mechanisms: reinforcement learning with metacognitive feedback (RLMF), a paradigm to refine completion rankings during preference optimization based on the quality of a model's self-judgments of performance, and metacognitive data selection, which uses similar self-judgments to identify high-value training examples, outperforming naive active learning. We apply these innovations to the problem of faithful calibration (FC), a task that is itself fundamentally metacognitive: the goal is to align expressed with intrinsic uncertainty, difficult even for frontier LLMs. We adopt a two-stage, decoupled approach, first using these methods to calibrate the faithfulness of models' self-reported confidence scores, then mapping to natural, context-adaptable linguistic uncertainty via targeted output editing. Extensive experiments show RLMF achieves generalizable, state-of-the-art FC on diverse tasks while preserving accuracy. Further, RLMF surpasses standard RL by up to 63% while enhancing models' ability to assess and express their own capability limits. This positions RLMF as a promising paradigm to enhance LLM metacognition toward improved abilities and alignment, and suggests metacognitive performance as an effective RL signal to overcome limits of prior intrinsic feedback methods.

2606.32029 2026-07-01 cs.CL cs.AI 新提交

When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors

当LLM粗心阅读表格时:测量并减少数据引用错误

Yuqing Yang, Qi Zhu, Zhen Han, Boran Han, Zhengyuan Shen, Shuai Wang, Vassilis N. Ioannidis, Huzefa Rangwala

发表机构 * University of Southern California(南加州大学) AWS AI Labs(AWS AI实验室)

AI总结 本研究系统评估了大型语言模型在表格任务中的数据引用错误,并提出基于批评模型的过滤与拒绝采样方法,将答案准确率提升高达12.0%。

Comments ACL 2026 (Oral)

详情
AI中文摘要

虽然大型语言模型(LLM)在表格任务上表现良好,但它们仍然会犯数据引用错误(DRE),即在理解表格结构的情况下错误地引用或遗漏表格值。除了最终答案的准确性,DRE直接损害中间推理步骤的正确性和可靠性。然而,先前的研究仅提供了有限的、小规模的分析。在这项工作中,我们首次对不同模型和任务中的表格数据引用错误进行了系统评估。我们的结果表明,DRE出现在所有测试的模型(1.7B到20B参数)中。此外,我们证明,通过基于批评的过滤和拒绝采样,将数据引用作为批评者可以显著提高答案准确性,最高提升12.0%。最后,我们训练了一个轻量级的4B参数批评模型,在检测分布内和分布外DRE时平均F1得分为78.2%,并有效辅助更大模型的推理。

英文摘要

While large language models (LLMs) perform well on table tasks, they still make data referencing errors (DREs), i.e., incorrectly citing or omitting table values, despite understanding the table structure. Beyond final-answer accuracy, DREs directly compromise the correctness and reliability of intermediate reasoning steps. Yet prior studies have only offered limited, small-scale analyses. In this work, we present the first systematic evaluation of tabular data referencing errors across different models and tasks. Our results show that DREs occur across all tested models (1.7B to 20B parameters). Furthermore, we demonstrate that incorporating data referencing as a critic significantly improves answer accuracy up to 12.0%, through critic-based filtering and rejection sampling. Finally, we trained a lightweight 4B-parameter critic model that achieves an average F1 score of 78.2% in detecting both in-distribution and out-of-distribution DREs, and effectively assists inference for larger models.

2606.32028 2026-07-01 cs.RO 新提交

DVG-WM: Disentangled Video Generation Enables Efficient Embodied World Model for Robotic Manipulation

DVG-WM:解耦视频生成实现高效机器人操作具身世界模型

Ziyu Shan, Zhenyu Wu, Xiaofeng Wang, Zheng Zhu, Ziwei Wang

发表机构 * Nanyang Technological University, Singapore(南洋理工大学(新加坡)) Beijing University of Posts and Telecommunications, Beijing, China(北京邮电大学(中国北京)) GigaAI

AI总结 提出解耦视频生成世界模型(DVG-WM),将世界模型分解为动力学学习和视觉合成,通过流匹配直接映射动力学到视频潜变量,并引入潜变量退化机制再生接触细节,在LIBERO和真实平台实现高达3.97倍加速。

详情
AI中文摘要

基于视频的具身世界模型通过预测未来状态为机器人操作提供了有吸引力的基础,但当前方法仍受限于一个基本纠缠:精确建模动力学通常需要低层时间推理,而生成高分辨率帧则需要根据高层语义进行扩展的视觉合成。这种纠缠导致迭代规划推理速度慢,或预测过于粗糙而无法保留接触丰富的细节。为解决这一困境,我们提出了解耦视频生成世界模型(DVG-WM),一个高效框架,明确将世界模型分解为动力学学习和视觉合成。以初始观察和语言指令为条件,我们的模型首先生成合理的中间视觉状态序列以预览物理交互,并对其进行细化以获得高保真视频。此外,提出了一种高效的级联机制,其中DVG-WM使用流匹配直接将动力学映射到视频潜变量,并引入潜变量退化机制以再生接触丰富的细节。在LIBERO和真实平台上的实验表明,视频质量提升且加速高达3.97倍,验证了解耦视频生成可以作为机器人操作的高效具身世界模型。

英文摘要

Video-based embodied world models provide an appealing substrate for robotic manipulation by predicting future states, yet current approaches remain limited by a fundamental entanglement: accurately modeling dynamics typically requires low-level temporal reasoning, while producing high-resolution frames demands expansive visual synthesis according to high-level semantics. This entanglement results in slow inference speed for iterative planning or too coarse predictions to retain contact-rich details. To solve this dilemma, we present Disentangled Video Generation World Model (DVG-WM), an efficient framework that explicitly decomposes world modeling into dynamics learning and visual synthesis. Conditioned on an initial observation and a language instruction, our model first generates a plausible sequence of intermediate visual states to preview the physical interaction and refines them to obtain high-fidelity videos. Furthermore, an efficient cascading mechanism is proposed, where DVG-WM uses flow matching to directly map the dynamics to video latents, and introduces a latent degradation mechanism to regenerate contact-rich details. Experiments on LIBERO and real-world platforms demonstrate improved video quality with up to 3.97 times acceleration, validating that disentangled video generation can be an efficient embodied world model for robotic manipulation.

2606.32027 2026-07-01 cs.RO cs.AI cs.LG 新提交

Freeform Preference Learning for Robotic Manipulation

自由形式偏好学习用于机器人操作

Marcel Torne, Anubha Mahajan, Abhijnya Bhat, Chelsea Finn

发表机构 * Stanford University(斯坦福大学)

AI总结 提出自由形式偏好学习(FPL),通过自然语言定义偏好轴并学习语言条件奖励模型,在长时域操作任务中比稀疏奖励和二元偏好方法提升38个百分点,支持行为组合与测试时策略引导。

详情
AI中文摘要

奖励设计仍然是自主机器人策略改进的核心瓶颈,尤其是在长时域操作任务中,稀疏的成功标签提供的信号太少,而二元偏好将许多竞争性的质量概念压缩成一个模糊的信号。我们提出自由形式偏好学习(FPL),一种从自由形式人类偏好中学习机器人策略的方法。FPL 不是让标注者判断两个轨迹哪个整体更好,而是让它们定义自然语言的偏好轴,例如速度、安全性、放置质量或细心程度,并沿着每个轴提供成对偏好。这些注释用于学习一个语言条件奖励模型,该模型将轨迹和偏好标签映射到特定轴的奖励。我们使用该模型训练一个奖励条件策略,该策略在多个人类指定的维度上进行优化。在四个真实世界和两个模拟的长时域操作任务中,FPL 比稀疏奖励和二元偏好方法提高了38个百分点。除了性能提升外,FPL 无需显式子任务分割即可学习密集的进度信号,展现出数据中不存在的行为组合性,并允许用户在测试时无需重新训练即可引导策略朝向不同行为。博客文章及视频见此 https URL。

英文摘要

Reward design remains a central bottleneck for autonomous robot policy improvement, especially in long-horizon manipulation tasks where sparse success labels provide too little signal and binary preferences collapse many competing notions of quality into one ambiguous signal. We introduce Freeform Preference Learning (FPL), a method for learning robot policies from freeform human preferences. Rather than asking annotators which of two trajectories is better overall, FPL lets them define natural-language preference axes, such as speed, safety, quality of placement, or carefulness, and provide pairwise preferences along each axis. These annotations are used to learn a language-conditioned reward model that maps a trajectory and preference label to an axis-specific reward. We use this model to train a reward-conditioned policy that optimizes across the multiple human-specified dimensions. Across four real-world and two simulated long-horizon manipulation tasks, FPL improves over sparse-reward and binary-preference methods by 38 percentage points. Beyond improved performance, FPL learns dense progress signals without explicit subtask segmentation, shows compositionality of behavior not present in the data, and allows users to steer the policy towards different behaviors at test time without retraining. Blog post with videos available at https://freeform-pl.github.io/fpl.website/

2606.32026 2026-07-01 cs.LG cs.AI 新提交

AdaJEPA: An Adaptive Latent World Model

AdaJEPA:一种自适应潜在世界模型

Ying Wang, Oumayma Bounou, Yann LeCun, Mengye Ren

发表机构 * New York University(纽约大学) AMI Labs(AMI实验室)

AI总结 提出AdaJEPA,一种在模型预测控制闭环中进行测试时自适应的潜在世界模型,通过自监督信号持续校准模型,显著提升规划成功率。

详情
AI中文摘要

潜在世界模型通过在紧凑潜在空间中预测未来状态,实现了从高维观测进行规划。然而,这些模型在测试时通常保持冻结:当预测变得不准确时,规划可能失败,尤其是在测试时分布偏移下。为了解决这个问题,我们提出了AdaJEPA,一种自适应潜在世界模型,它在模型预测控制(MPC)的闭环内执行测试时自适应。训练后,AdaJEPA规划并执行第一个动作块,将观察到的下一个状态转换作为自监督自适应信号,并用更新后的模型重新规划。这种闭环更新持续重新校准世界模型,无需额外的专家演示。在一系列目标到达任务中,AdaJEPA在每个MPC重新规划步骤中仅需一个梯度步骤即可显著提高规划成功率。

英文摘要

Latent world models enable planning from high-dimensional observations by predicting future states in a compact latent space. However, these models are typically kept frozen at test time: when their predictions become inaccurate, planning can fail, especially under test-time distribution shift. To address this, we propose AdaJEPA, an adaptive latent world model that performs test-time adaptation within the closed loop of model predictive control (MPC). After training, AdaJEPA plans and executes the first action chunk, uses the observed next-state transition as a self-supervised adaptation signal, and replans with the updated model. This closed-loop update continuously recalibrates the world model without additional expert demonstrations. Across a range of goal-reaching tasks, AdaJEPA substantially improves planning success with as few as one gradient step per MPC replanning step.

2606.32025 2026-07-01 cs.CL 新提交

Generative Skill Composition for LLM Agents

面向LLM智能体的生成式技能组合

Xinyu Zhao, Zhen Tan, Vaishnav Tadiparthi, Nakul Agarwal, Kwonjoon Lee, Ehsan Moradi Pari, Hossein Nourkhiz Mahjoub, Tianlong Chen

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Arizona State University(亚利桑那州立大学) Honda Research Institute USA(本田美国研究院)

AI总结 提出SkillComposer方法,将技能组合形式化为任务条件技能序列预测,通过约束自回归解码器联合决定技能子集、数量和顺序,在真实技能库上训练并验证其提升下游任务成功率。

详情
AI中文摘要

最近的LLM智能体受益于解决复杂任务的技能。技能封装了用于执行专业化任务的程序性知识和指令的模块化包,例如设置沙盒环境、运行测试套件或跨多个文件重构函数。随着技能库的增长并在任务和领域间变得可重用,选择合适的技能组合已成为一个核心瓶颈。现有方法分为两类:一类将智能体的推理暴露给整个技能集合;另一类通过嵌入或基于LLM的重排序器进行技能检索。两者都提供了有用的见解;然而,它们忽略了技能组合的结构性本质,即关于哪些技能、多少技能以及以何种顺序——这三个不可解耦的维度的联合决策。我们将其形式化为结构化技能组合:给定一个任务和一个技能库,预测一个可执行的技能计划,该计划联合指定激活的子集、数量和执行顺序。我们提出SkillComposer,它将结构化技能组合实例化为任务条件技能序列预测。SkillComposer在技能标识符上使用约束自回归解码器,因此子集、数量和顺序从单次解码过程中联合产生,并且连续技能之间的依赖关系被自然捕获。我们从真实、人工策划的技能库构建任务-组合对的训练集。然后,我们沿着两个轴评估SkillComposer:在保留测试集上的组合质量,以及在SkillsBench上跨两个生产级编码智能体的下游任务成功率。在GPT-5.2-Codex和Gemini-3-Pro-Preview上,SkillComposer相比无技能基线将通过率提高了+23.1和+18.2个百分点,超越了top-3检索,并以更低的提示词成本匹配了黄金技能检索的上限。

英文摘要

Recent LLM agents benefit from skills for solving complex tasks. Skills encapsulate modular packages of procedural knowledge and instructions for performing specialized tasks, such as setting up a sandboxed environment, running a test suite, or refactoring a function across multiple files. As skill libraries grow and become reusable across tasks and domains, selecting an appropriate skill composition has emerged as a central bottleneck. Existing approaches fall into two categories. One exposes the agent's reasoning to the entire skill collection; the other performs skill retrieval via embeddings or LLM-based rerankers. Both provide useful insights; however, they miss the structural nature of skill composition, which is a joint decision over which skills, how many, and in what order -- three dimensions that cannot be decoupled. We formalize this as structured skill composition: given a task and a skill library, predict an executable skill plan that jointly specifies the activated subset, count, and execution order. We propose SkillComposer, which instantiates structured skill composition as task-conditioned skill sequence prediction. SkillComposer uses a constrained autoregressive decoder over skill identifiers, so subset, count, and order emerge jointly from a single decoding pass, and dependencies between successive skills are captured naturally. We build a training set of task-composition pairs from a real, human-curated skill library. We then evaluate SkillComposer along two axes: composition quality on a held-out test set, and downstream task success on SkillsBench across two production-grade coding agents. On GPT-5.2-Codex, Gemini-3-Pro-Preview, SkillComposer raises the pass rate by +23.1, +18.2pp over the no-skill baseline, surpassing top-3 retrieval and matching the gold-skill retrieval upper bound at lower prompt-token cost.

2606.32023 2026-07-01 cs.CV cs.AI 新提交

FLORA: A deep learning approach to predict forest attributes from heterogeneous LiDAR data

FLORA: 一种从异构LiDAR数据预测森林属性的深度学习方法

Emilie Vautier, Clément Mallet, Cédric Vega

发表机构 * Univ Gustave Eiffel, Géodata Paris, IGN, LASTIG(古斯塔夫·埃菲尔大学,巴黎地理数据,IGN,LASTIG) Univ Gustave Eiffel, Géodata Paris, IGN, LIF(古斯塔夫·埃菲尔大学,巴黎地理数据,IGN,LIF) Université de Lorraine, Géodata Paris, IGN, LIF(洛林大学,巴黎地理数据,IGN,LIF)

AI总结 提出FLORA框架,利用八叉树骨干网络与生态时空辅助变量,从异构LiDAR点云预测六种森林属性,在法国32,052个样地验证,主导高rRMSE约12.3%。

详情
AI中文摘要

森林属性对于国家尺度的资源监测至关重要。机载LiDAR指标是与国家森林清查(NFI)估计中森林属性相关性最强的辅助变量之一。然而,当LiDAR数据在异构条件下获取时,生成全覆盖预测仍然具有挑战性。随着欧洲各国LiDAR计划的扩展,传感器、飞行参数、季节和扫描角度的变化限制了现有模型的鲁棒性,这些模型通常针对局部条件进行校准。我们提出了FLORA(森林LiDAR八叉树回归与辅助数据),这是一个深度学习框架,可从异构LiDAR点云预测六种森林属性:优势木高、总蓄积量、阔叶蓄积量、针叶蓄积量、胸高断面积和立木密度。FLORA通过后期融合门控机制将基于八叉树的骨干网络与生态和时空辅助变量相结合。模型在法国本土32,052个国家森林清查样地上进行训练和评估,使用法国LiDAR HD计划的数据。一个同时在叶期和无叶期采集数据上训练的模型优于特定季节模型,并提高了跨季节鲁棒性。辅助变量总体上提供了适度的增益,但对特定树种的蓄积量预测贡献更大。FLORA在优势木高上实现了约12.3%的rRMSE(R²=0.88),在总蓄积量上实现了39%的rRMSE(R²=0.74),为从异构国家LiDAR计划进行大规模森林属性估计提供了稳健的基线。

英文摘要

Forest attributes are essential for national-scale resource monitoring. Airborne LiDAR metrics are among the auxiliary variables most strongly correlated with forest attributes used in National Forest Inventory (NFI) estimates. However, producing wall-to-wall predictions remains challenging when LiDAR data are acquired under heterogeneous conditions. As national LiDAR programs expand across Europe, variability in sensors, flight parameters, seasons, and scan angles limits the robustness of existing models, which are often calibrated for local conditions. We present FLORA (Forest LiDAR Octree Regression with Auxiliary Data), a deep learning framework that predicts six forest attributes: dominant height, total volume, deciduous volume, coniferous volume, basal area, and stem density from heterogeneous LiDAR point clouds. FLORA combines an octree-based backbone with ecological and spatiotemporal auxiliary variables through a late-fusion gating mechanism. Models are trained and evaluated on 32,052 National Forest Inventory plots across mainland France using data from the French LiDAR HD program. A single model trained on both leaf-on and leaf-off acquisitions outperforms season-specific models and improves cross-season robustness. Auxiliary variables provide modest overall gains but contribute more strongly to species-specific volume prediction. FLORA achieves an rRMSE of about 12.3% (R2 = 0.88) for dominant height and 39% (R2 = 0.74) for total volume, providing a robust baseline for large-scale forest attribute estimation from heterogeneous national LiDAR programs.

2606.32022 2026-07-01 cs.LG cs.CL 新提交

SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models

SemRF:语言模型中残差流语义参考框架

Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang

发表机构 * Monash University(莫纳什大学) Technical University of Munich(慕尼黑工业大学) Chongqing University(重庆大学)

AI总结 提出语义参考框架(SemRF),通过锚点固定实现残差流语义坐标的稳定测量,并利用Voronoi迹定义最小作用路径,揭示层间计算与参数效率的关系。

Comments an early-stage version

详情
AI中文摘要

残差流分析探究语言模型计算如何随深度演化,但中间解码需要跨层可比较的读出坐标。若嵌入锚点与解嵌读出在选定跨度上不一致,表观运动可能反映测量漂移而非计算。我们引入语义参考框架(SemRF),这是一种基于锚点的形式化方法,将语义测量与残差动态分离。SemRF固定锚点并测量状态相对于锚点的位置。伪逆绑定实现精确同步;在受限双可逆性下,SemRF提供稳定的语义基坐标、失真界以及近似恒等变化。框架固定后,残差计算变为深度方向的语义轨迹。锚点诱导语义Voronoi图:距离(或证据如logits)将每层分配至粗粒度单元,而坐标保留单元内运动与边界。我们定义层间步长、贡献轮廓和不平衡诊断,然后利用Voronoi迹定义边界松弛管。标准迹是该管内的最小作用路径;当非空且具有正二次权重时,该路径唯一且满足远离活动约束的离散样条方程。超额作用控制步长、曲率和轮廓失配。低曲率意味着分段线性可压缩性和局部知识密度:更低的迹复杂度意味着更少的语义结点。通过参数到轨迹的映射,这给出了与参数效率的条件联系:在拟合数据的可接受设置中,更低作用和更低复杂度的迹使用更少的语义自由度。这些保证需要受控的接口误差和显式管约束下的小投影残差。

英文摘要

Residual-stream analysis asks how language-model computation evolves across depth, but intermediate decoding requires comparable readout coordinates across layers. If embedding anchors and unembedding readout disagree on the chosen span, apparent motion may reflect measurement drift rather than computation. We introduce \emph{Semantic Reference Frames} (SemRF), an anchor-based formalism separating semantic measurement from residual dynamics. A SemRF fixes anchors and measures states against them. Pseudo-inverse tying gives exact synchronization; under restricted bi-invertibility, SemRF yields stable semantic-basis coordinates, distortion bounds, and near-identity changes. With the frame fixed, residual computation becomes a depthwise semantic trajectory. The anchors induce a semantic Voronoi diagram: distance, or evidence such as logits, assigns each layer to a coarse cell, while coordinates retain within-cell motion and margins. We define layerwise steps, contribution profiles, and imbalance diagnostics, then use the Voronoi trace to define a margin-relaxed tube. The canonical trace is the minimum-action path inside this tube; when nonempty with positive quadratic weight, it is unique and obeys a discrete spline equation away from active constraints. Excess action controls step, curvature, and profile mismatch. Low curvature implies piecewise-linear compressibility and local knowledge density: lower trace complexity means fewer semantic knots. Through the parameter-to-trajectory map, this gives a conditional link to parameter efficiency: among admissible settings fitting data, lower-action and lower-complexity traces use fewer semantic degrees of freedom. The guarantees require controlled interface error and small projection residual under explicit tube constraints.

2606.32020 2026-07-01 cs.CV 新提交

Cross-Space Distillation: Teaching One-Step Students with Modern Diffusion Teachers

跨空间蒸馏:用现代扩散教师训练一步学生模型

Anh Nguyen, Ngan Nguyen, Duc Vu, Trung Dao, Viet Nguyen, Quan Dao, Kien Nguyen, Chi Tran, Phong Nguyen, Khoi Nguyen, Cuong Pham, Dimitris Metaxas, Vishal M. Patel, Anh Tran

发表机构 * Qualcomm AI Research(高通AI研究院) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Johns Hopkins University(约翰霍普金斯大学) Rutgers University(罗格斯大学)

AI总结 提出跨空间蒸馏框架,通过轻量级潜在接口Bridge解决教师与学生模型潜在空间不匹配问题,实现异构教师蒸馏到紧凑学生模型,显著提升性能。

Comments ECCV 2026

详情
AI中文摘要

现代一步扩散模型通过基于分布的时步蒸馏实现了令人印象深刻的生成质量。然而,它们依赖于一个关键假设:教师和学生必须位于相同的潜在空间中。这种共享空间约束阻止了知识从现代高容量教师(例如SD 3.5和Flux)迁移到紧凑、部署友好的学生模型(例如SD 1.5),后者的潜在分辨率和VAE参数化与教师不同。我们将这种被忽视的情况形式化为跨空间蒸馏,其中教师和学生在潜在分辨率和VAE空间上均存在差异。为了在这种不匹配下实现蒸馏,我们引入了Bridge,一个轻量级潜在接口,它将学生潜在映射到教师空间,而无需修改学生主干。Bridge结合了冻结的学生VAE解码器作为空间先验和紧凑的可学习投影器,并通过潜在重建和注意力保真度目标进行训练,以实现稳定的教师空间对齐。在多种现代教师模型上,Bridge为紧凑的一步学生模型带来了显著的性能提升;例如,它将SD 1.5的HPSv3从5.4提高到9.4,同时保持一步推理、低延迟和广泛的生态系统兼容性。这些结果表明,通过轻量级潜在空间接口,异构的大教师模型可以被蒸馏到高效、可部署的主干网络中。

英文摘要

Modern one-step diffusion models achieve impressive quality through distribution-based timestep distillation. Yet, they rely on a critical assumption: Teacher and Student must inhabit the same latent space. This Shared-Space constraint prevents knowledge transfer from modern high-capacity Teachers (e.g., SD 3.5 and Flux) into compact, deployment-friendly Students such as SD 1.5, whose latent resolution and VAE parameterization differ from the Teacher. We formalize this overlooked regime as Cross-Space Distillation, where Teacher and Student differ in both latent resolution and VAE space. To enable distillation under this mismatch, we introduce the Bridge, a lightweight latent interface that maps Student latents into the Teacher space without modifying the Student backbone. Bridge combines a frozen Student VAE decoder as a spatial prior with a compact learnable projector, and is trained with latent reconstruction and attention fidelity objectives for stable Teacher-space alignment. Across diverse modern Teachers, Bridge enables substantial gains for compact one-step Students; for example, it improves SD 1.5 from 5.4 to 9.4 HPSv3 while preserving one-step inference, low latency, and broad ecosystem compatibility. These results show that heterogeneous large Teachers can be distilled into efficient, deployable backbones through a lightweight latent-space interface.

2606.32018 2026-07-01 cs.CV cs.LG 新提交

Automated Background Swapping for Robustness against Spurious Backgrounds

自动背景交换以增强对虚假背景的鲁棒性

Cesar Roder, Kajetan Schweighofer

发表机构 * Johannes Kepler University Linz(林茨约翰·开普勒大学) Cognizant AI Lab(Cognizant AI实验室)

AI总结 提出自动背景交换(AutoBackSwap)方法,通过辅助网络分离前景和背景并合成完整背景,增强训练数据,减少分类器对虚假背景的依赖,在多个任务上优于先前方法。

详情
AI中文摘要

基于深度神经网络的分类器在各个领域表现出色,但如果它们依赖于虚假相关性(即训练数据中与目标标签相关但无因果联系的特征),则可能灾难性地失败。对于视觉领域,许多此类虚假相关性体现在图像的背景中,只有前景与类别标签相关。本文提出自动背景交换(AutoBackSwap)以减少分类器对此类虚假背景的依赖。AutoBackSwap 使用辅助网络分离前景和背景,然后通过填充合成完整背景,最后将不同的前景和修复后的背景组合以增强训练数据。我们发现,仅需几百个样本的逐块标注即可训练辅助网络,并在具有挑战性的图像分类任务上自动增强整个训练数据集。与许多先前方法不同,即使训练数据中没有单个样本打破虚假相关性,AutoBackSwap 也证明非常有效。在一系列具有虚假背景的图像分类任务中,AutoBackSwap 始终优于先前方法。

英文摘要

Classifiers based on Deep Neural Networks exhibit strong performance across domains, yet can fail catastrophically if they rely on spurious correlations, i.e., features that are predictive of the target label in the training data but are not causally linked and thus fail to generalize. For the vision domain, many such spurious correlations manifest themselves within the background of the image, where only the foreground is predictive of the class label. In this paper, we introduce Automated Background Swapping (AutoBackSwap) to reduce the reliance of classifiers on such spurious backgrounds. AutoBackSwap uses a secondary network to disentangle the foreground and background, followed by infilling to synthesize complete backgrounds, and finally combines different foregrounds and inpainted backgrounds to augment the training data. We find that patch-wise labeling of just a few hundred samples suffices to train the secondary network and automatically augment the full training dataset on challenging image classification tasks. In contrast to many previous methods, AutoBackSwap proves very effective even if there is not a single sample in the training data breaking the spurious correlation. Across a range of image classification tasks with spurious backgrounds, AutoBackSwap consistently outperforms prior methods.

2606.32017 2026-07-01 cs.LG cs.AI 新提交

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

TRIAGE:面向智能体强化学习的角色类型化信用分配

Yuanda Xu, Zhengze Zhou, Hejian Sang, Xiaomin Li, Jiaxin Zhang, Xinchen Du, Zhipeng Wang, Alborz Geramifard

发表机构 * LinkedIn Corporation(领英公司) Harvard University(哈佛大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出TRIAGE框架,通过角色类型化信用分配修正GRPO仅依赖结果信用的盲点,在ALFWorld等任务中提升成功率并减少交互步数。

详情
AI中文摘要

智能体强化学习需要将信用分配给环境交互动作,如搜索、点击、编辑、导航命令和对象交互。标准GRPO使用最终验证器结果作为所有动作令牌的统一优势。该结果信号有用但结构不完整:它在失败轨迹中惩罚有用的探索,并在成功轨迹中强化冗余或倒退动作。我们提出TRIAGE,一种角色类型化信用分配框架,为结果信用添加语义角色轴。结构化评判器将每个片段分类为决定性进展、有用探索、无进展基础设施或倒退,固定角色条件规则将这些标签映射到有界片段级过程奖励。这保持了验证器结果作为优化方向的来源,同时纠正了仅结果信用的两个主要盲点。我们进一步证明,角色条件信用是从角色标签本身可表达的最优片段级修正——将每片段优势残差投影到角色变量上——因此当评判器可靠时,固定角色常数减少优势估计误差,并将其与低方差策略梯度联系起来。在ALFWorld、Search-QA和WebShop上,TRIAGE在两种策略模型上均优于GRPO的成功率,并优于标量评判器推导的过程奖励和结果监督共享骨干价值基线。消融实验表明,增益来自角色类型化而非仅仅添加密集奖励:成功轨迹内倒退的可靠检测是主要贡献者,而探索信用提供一致的次要增益;在完成的ALFWorld和WebShop轨迹上,TRIAGE相对于GRPO还额外减少了10.4%和14.8%的环境交互步数。

英文摘要

Agentic reinforcement learning requires assigning credit to environment-facing actions such as searches, clicks, edits, navigation commands, and object interactions. Standard GRPO uses the final verifier outcome as a uniform advantage over all action tokens. This outcome signal is useful but structurally incomplete: it punishes useful exploration in failed rollouts and reinforces redundant or regressive actions in successful rollouts. We propose TRIAGE, a role-typed credit assignment framework that adds a semantic role axis to outcome credit. A structured judge classifies each segment as decisive progress, useful exploration, no-progress infrastructure, or regression, and a fixed role-conditioned rule maps these labels to bounded segment-level process rewards. This keeps verifier outcomes as the source of optimization direction while correcting the two main blind spots of outcome-only credit. We further show that role-conditioned credit is the optimal segment-level correction expressible from role labels alone -- a projection of the per-segment advantage residual onto the role variable -- so that the fixed role constants reduce advantage estimation error whenever the judge is reliable, and we connect this to lower-variance policy gradients. Across ALFWorld, Search-QA, and WebShop, TRIAGE improves success rates over GRPO for two policy models and outperforms both a scalar judge-derived process reward and an outcome-supervised shared-backbone value baseline. Ablations show that the gain comes from role typing rather than merely adding dense rewards: reliable detection of regression inside successful trajectories is the dominant contributor, while exploration credit provides a consistent secondary gain; on completed ALFWorld and WebShop rollouts, TRIAGE also reduces environment-facing turns by an additional $10.4\%$ and $14.8\%$ relative to GRPO.

2606.32016 2026-07-01 cs.LG 新提交

FedLAB: Traceable Semantic Codebooks for Federated Multimodal Graph Foundation Learning

FedLAB: 面向联邦多模态图基础学习的可追踪语义码本

Zekai Chen, Kairui Yang, Xuaner Chen, Xunkai Li, Xun Wu, Rong-Hua Li, Guoren Wang

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 提出FedLAB框架,通过类型化分层码本组织模态证据、节点语义和拓扑上下文,实现联邦多模态图学习中的可追踪语义,在10个基准和6个下游任务上提升达7.53%。

详情
AI中文摘要

多模态图基础模型旨在从富含文本、图像、属性和关系拓扑的图中学习可重用知识,从而支持多样化的以图为中心和以模态为中心的任务。然而,在实践中,此类多模态图通常分布在去中心化的客户端上,由于隐私约束,原始内容和局部结构无法集中共享。这推动了联邦多模态图基础学习,它不仅需要可迁移的表示学习,还需要在严格数据隔离下实现内在的语义可追踪性。现有方法通常通过参数、原型、嵌入或紧凑码本交换或存储知识,这些方法支持优化和迁移,但并未明确揭示模态证据、节点语义和拓扑上下文如何共同支持预测。为弥补这一差距,我们提出了FedLAB,一个可追踪的语义码本框架,它将多模态图知识组织为用于模态证据、节点语义和拓扑上下文的类型化分层码本。FedLAB通过联邦语义重心预训练进一步细化这些追踪单元,同时保持原始多模态内容和图结构的本地性。在10个基准和6个下游任务上的大量实验表明,FedLAB相比最先进的基线提升了高达7.53%,同时保留了原生的语义追踪接口。

英文摘要

Multimodal graph foundation models aim to learn reusable knowledge from graphs enriched with text, images, attributes, and relational topology, thereby supporting diverse graph-centric and modality-centric tasks. In practice, however, such multimodal graphs are often distributed across decentralized clients, where raw contents and local structures cannot be centrally shared due to privacy constraints. This motivates federated multimodal graph foundation learning, which requires not only transferable representation learning but also intrinsic semantic traceability under strict data isolation. Existing methods usually exchange or store knowledge through parameters, prototypes, embeddings, or compact codebooks, which support optimization and transfer but do not explicitly expose how modality evidence, node semantics, and topology context jointly support predictions. To bridge this gap, we propose FedLAB, a traceable semantic codebook framework that organizes multimodal graph knowledge into typed hierarchical codebooks for modality evidence, node semantics, and topology context. FedLAB further refines these trace units through federated semantic barycenter pre-training while keeping raw multimodal contents and graph structures local. Extensive experiments on 10 benchmarks and 6 downstream tasks show that FedLAB improves over state-of-the-art baselines by up to 7.53\%, while preserving a native semantic trace interface.

2606.32012 2026-07-01 cs.LG cs.CV 新提交

CoMet: Context and Multiplicity Decomposition for Multimodal Uncertainty Estimation

CoMet:多模态不确定性估计的上下文与多样性分解

Sanghyuk Chun, William Yang, Amaya Dharmasiri, Olga Russakovsky

发表机构 * Princeton University(普林斯顿大学)

AI总结 提出CoMet方法,将多模态大模型的不确定性分解为上下文项和多样性项,通过轻量级后验模块高效估计,无需自回归生成或重复采样,在多个基准上优于现有方法。

Comments 33 pages, 13.3MB

详情
AI中文摘要

不确定性估计一直是AI模型中的长期挑战;它相当于“知道自己不知道”,而元认知即使对人类来说也是出了名的困难(参见邓宁-克鲁格效应)。尽管即使在更简单的分类系统中也远未解决,但在多模态大语言模型(MLLMs)中解决这一问题正变得越来越重要。在MLLMs中,不确定性可能源于多种来源及其关系,还可能源于开放设定中无界答案。为解决这些问题,我们提出CoMet,一种通过将不确定性分解为上下文特定项和多样性特定项来进行MLLM不确定性估计的方法。前者捕获给定上下文(例如任务或提示)引起的歧义,而后者捕获由上下文确定的合理答案中有多少与给定输入兼容。我们训练一个轻量级的后验不确定性模块来估计这些量,从而无需自回归答案生成或重复采样即可实现高效的不确定性估计。在各种开放多模态基准、幻觉检测和多项选择视觉问答基准上的实验表明,CoMet在保持实际效率的同时,持续改进了现有基线的不确定性估计。代码可在该https URL获取。

英文摘要

Uncertainty estimation has been a long-standing challenge in AI models; it amounts to "knowing what you don't know," and metacognition is notoriously difficult even for humans (cf. the Dunning-Kruger effect). Although it is still far from solved even in simpler classification systems, tackling it in multimodal large language models (MLLMs) is becoming increasingly important. Within MLLMs, uncertainty can stem from any of the diverse sources as well as from their relationships, and further can stem from the unbounded answers in the open-ended setting. To tackle the issues, we propose CoMet, an MLLM uncertainty estimation method by decomposing uncertainty into a context-specific term and a multiplicity-specific term. The former captures ambiguity induced by the given context (e.g., task or prompt), while the latter captures how many plausible answers determined by the context remain compatible with the given input. We train a lightweight post-hoc uncertainty module to estimate these quantities, which enables efficient uncertainty estimation without autoregressive answer generation or repeated sampling. Experiments on various open-ended multimodal benchmarks, hallucination detection, and multiple-choice visual question answering benchmarks show that CoMet consistently improves uncertainty estimation over existing baselines while remaining efficient in practice. Code is available at https://github.com/princetonvisualai/comet_uncertainty

2606.32009 2026-07-01 cs.RO 新提交

Human-as-Humanoid: Enabling Zero-Shot Humanoid Learning from Ego-Exo Human Videos with Human-Aligned Embodiments

人类作为人形机器人:通过人类对齐的具身从自我-外部人类视频实现零样本人形机器人学习

Xiaopeng Lin, Ruoqi Yang, Shijie Lian, Zhaolong Shen, Bin Yu, Changti Wu, Haibao Liu, Yuxiang Zhang, Hong Li, Qiyuan Su, Haochen Liu, Xuguo He, Yukun Shi, Cong Huang, Zhirui Zhang, Bojun Cheng, Kai Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) DeepCybo ZGCA ZGCI Harbin Institute of Technology(哈尔滨工业大学) Huazhong University of Science and Technology(华中科技大学) Beihang University(北京航空航天大学)

AI总结 提出Human-as-Humanoid框架,通过对齐机器人本体、感知设置和动作标签接口,将大规模人类演示视频转化为可执行的动作监督,实现高自由度人形机器人的零样本学习。

Comments 20 pages, 9 figures

详情
AI中文摘要

跨机器人本体的视觉-语言-动作(VLA)模型需要高质量的观察-动作监督来学习可部署的动作分布,但扩展此类机器人数据仍然困难,尤其是对于高自由度的人形机器人。遥操作提供控制器对齐的监督,而人类自我中心视频捕捉多样的双手操作,但并未直接提供可执行的机器人动作。我们引入Human-as-Humanoid,一个人到人形机器人的监督框架,能够实现近实时的人类中心动作生成,通过联合对齐机器人本体、感知设置和动作标签接口,使人类演示可用于高自由度人形机器人VLA训练。基于PrimeU(一个人类对齐的60自由度上半身人形机器人),Human-as-Humanoid使用同步的自我-外部视频将部署对齐的自我中心观察与外部运动恢复配对,通过分阶段逆运动学(IK)将恢复的人类运动重定向为控制器对齐的60自由度动作块,并使用前向运动学(FK)感知的监督训练VLA模型,以保留手腕和指尖的任务空间几何。这将大规模人类演示从视觉观察转换为目标人形机器人的可执行观察-动作监督。实验在运动恢复、机器人动作空间和真实机器人部署层面验证了转换链。Human-as-Humanoid在我们的数据收集分析中比人形机器人遥操作实现了4.8-7.2倍的原始演示吞吐量增益,并且在多个下游任务中,仅使用转换后的人类标签进行后训练的策略能够泛化到真实机器人部署,无需目标任务机器人演示。官方项目网站见此https URL。

英文摘要

Vision-language-action (VLA) models across robot embodiments require high-quality observation--action supervision to learn deployable action distributions, yet scaling such robot data remains difficult, especially for high-DoF humanoids. Teleoperation provides controller-aligned supervision, while human egocentric videos capture diverse bimanual manipulation but do not directly provide executable robot actions. We introduce Human-as-Humanoid, a human-to-humanoid supervision framework that enables near-real-time human-centric action generation, making human demonstrations usable for high-DoF humanoid VLA training by jointly aligning the robot embodiment, the sensing setup, and the action-label interface. Built on PrimeU, a human-aligned 60-DoF upper-body humanoid, Human-as-Humanoid uses synchronized ego-exo videos to pair deployment-aligned egocentric observations with exocentric motion recovery, retargets the recovered human motion through staged Inverse Kinematics (IK) into controller-aligned 60-DoF action chunks, and trains the VLA model with Forward Kinematics (FK)-aware supervision to preserve wrist and fingertip task-space geometry. This converts large-scale human demonstrations from visual observations into executable observation--action supervision for the target humanoid. Experiments validate the conversion chain at the motion-recovery, robot-action-space, and real-robot deployment levels. Human-as-Humanoid yields a 4.8--7.2x raw demonstration-throughput gain over humanoid teleoperation in our data-collection analysis, and on several downstream tasks, policies post-trained only with the converted human labels generalize to real-robot deployment without target-task robot demonstrations. The official project website is available at https://zgc-embodyai.github.io/Human-as-Humanoid.

2606.32008 2026-07-01 cs.LG 新提交

Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?

替代保真度:开放LLM何时能解释封闭LLM?

Philippe Chlenski, Zachariah Carmichael, Ayush Warikoo, Chia-Tse Shao, Yingxiao Ye, Aobo Yang, Vivek Miglani, Nehal Bandi

发表机构 * Meta

AI总结 研究开放模型能否替代解释封闭模型,发现预测保真度显著高估归因保真度,模型虽答案一致但理由不同,机械解释不能自动迁移。

详情
AI中文摘要

机械可解释性(MI)需要完全访问模型内部,但大多数广泛部署的语言模型的API最多只暴露输出令牌的对数概率。这产生了一个替代问题:对开放模型进行的测量何时能让我们对封闭模型做出断言?我们在预测、归因和表示层面评估替代保真度。对于二分类任务,对数几率提供了模型表示空间的API兼容标量读数,而留一法归因提供了对模型行为的洞察。跨越四个系列(Llama、Qwen、GPT和Gemini)的十一个模型,我们发现预测保真度显著高估了归因保真度:对答案是什么达成一致的模型往往对为什么有分歧。我们记录了一个访问-有效性反转:像注意力模式和扰动幅度这样的白盒信号在模型间高度稳定,但仅弱预测因果归因,而黑盒输入消融通过设计捕获了这些归因。机械见解不会自动迁移到封闭目标,预测层面的同意不足以保证这种迁移。代码和结果可在https://this URL获取。

英文摘要

Mechanistic interpretability (MI) requires full access to model internals, yet the APIs for most widely deployed language models at best expose log-probabilities over output tokens. This creates a surrogate problem: when do measurements made on open models allow us to make claims about a closed model? We evaluate surrogate fidelity at the prediction, attribution, and representation levels. For binary classification tasks, log-odds provide an API-compatible scalar readout of the model's representation space, and leave-one-out attributions provide insight into model behavior. Across eleven models spanning four families (Llama, Qwen, GPT, and Gemini), we find that prediction fidelity substantially overstates attribution fidelity: models that agree on what the answer is often disagree on why. We document an access-validity inversion: white-box signals like attention patterns and perturbation magnitudes are highly stable across models but only weakly predictive of causal attributions, which black-box input ablations capture by design. Mechanistic insight does not automatically transfer to closed targets, and prediction-level agreement is insufficient to warrant such transfer. Code and results are available at https://github.com/facebookresearch/surrogate.

2606.32007 2026-07-01 cs.AI 新提交

AxDafny: Agentic Verified Code Generation in Dafny

AxDafny: Dafny中的智能验证代码生成

Benjamin Breen, Austin Letson, Borja Requena Pozo, Leopoldo Sarra

发表机构 * Axiomatic AI, Boston, MA, USA(Axiomatic AI,波士顿,马萨诸塞州,美国)

AI总结 提出AxDafny框架,通过验证器引导的修复迭代生成Dafny代码和证明,在LCB-Pro-Dafny和DafnyBench上显著提升验证成功率。

详情
AI中文摘要

我们研究Dafny中的智能代码生成,其中模型必须生成可执行代码和用于验证的证明工件。我们提出了AxDafny,一个验证器引导的修复框架,迭代生成实现、不变量、断言和终止参数。我们还引入了LiveCodeBench-Pro-Dafny (LCB-Pro-Dafny),一个包含250道竞赛风格编程问题的基准测试,这些问题被翻译成Dafny并带有形式化规范和基于验证器的评估工具。在LCB-Pro-Dafny上,AxDafny相比基线GPT-5.5显著提高了验证成功率。在DafnyBench上,AxDafny达到了92.7%的验证成功率,比之前最强的证明提示基线高出6.5个百分点。最后,我们表明验证成功率和运行时测试性能衡量了生成代码的不同方面。

英文摘要

We study agentic code generation in Dafny, where a model must generate both executable code and the proof artifacts for verification. We present AxDafny, a verifier-guided repair framework that iteratively generates implementations, invariants, assertions, and termination arguments. We also introduce LiveCodeBench-Pro-Dafny (LCB-Pro-Dafny), a benchmark of 250 competition-style programming problems translated into Dafny with formal specifications and a verifier-based evaluation harness. On LCB-Pro-Dafny, AxDafny substantially improves verification success over baseline GPT-5.5 performance. On DafnyBench, AxDafny achieves 92.7\% verification success, outperforming the strongest previously reported proof-hint baseline by 6.5 percentage points. Lastly, we show that verification success and runtime test performance measure different aspects of generated code.

2606.32004 2026-07-01 cs.AI cs.LG cs.LO cs.SC 新提交

PolicyGuard: From Organizational Policies to Neuro-SymbolicCompliance Review Engines

PolicyGuard: 从组织策略到神经符号合规审查引擎

Sameer Malik, Ayush Singh, Amar Prakash Azad

发表机构 * Fujitsu Research India Private Limited, Bengaluru, India(富士通研究印度私人有限公司,印度班加罗尔)

AI总结 提出PolicyGuard框架,将组织策略转化为可执行的神经符号审查引擎,通过分离策略形式化、局部文档解释和符号合规评估,实现显式、可维护且可系统测试的文档合规审查。

详情
AI中文摘要

基于策略的文档审查需要确定目标文档是否符合组织特定的策略、指南或手册。虽然大型语言模型可以协助策略解释和文档分析,但端到端提示使应用的策略逻辑隐含,导致合规决策难以检查、更新和测试。我们提出了PolicyGuard,一个用于基于策略的文档合规审查的神经符号框架。PolicyGuard将组织策略指南转化为一个可执行的审查引擎,该引擎由类型化的关系逻辑规则和原子级提取问题组成。在审查过程中,LLMs使用检索到的文档证据回答这些局部问题,符号评估器应用形式规则检测不合规情况。我们在公司特定的NDA合规审查上实例化并评估了PolicyGuard,其中合同条款必须根据组织特定的谈判策略进行检查。通过分离策略形式化、局部文档解释和符号合规评估,PolicyGuard使文档审查更加显式、可维护且可系统测试。

英文摘要

Policy-grounded document review requires determining whether a target document complies with organization-specific policies, guidelines, or playbooks. While large language models can assist with policy interpretation and document analysis, end-to-end prompting leaves the applied policy logic implicit, making compliance decisions difficult to inspect, update, and test. We present PolicyGuard, a neuro-symbolic framework for policy-grounded document compliance review. PolicyGuard converts organizational policy guidance into an executable review engine consisting of typed relational logic rules and atom-level extraction questions. During review, LLMs answer these local questions using retrieved document evidence, and a symbolic evaluator applies the formal rules to detect non-compliance. We instantiate and evaluate PolicyGuard on company-specific NDA compliance review, where contract clauses must be checked against organization-specific negotiation policies. By separating policy formalization, local document interpretation, and symbolic compliance evaluation, PolicyGuard makes document review more explicit, maintainable, and systematically testable.

2606.32002 2026-07-01 cs.AI cs.LG 新提交

Self-Study Reconsidered: The Hidden Fragility of Learning from Self-Generated QA

自我学习再思考:从自生成问答中学习的隐藏脆弱性

Ekaterina Alimaskina, Denis Shveykin, Gleb Molodtsov, Igor Shalygin, Alexey Kadeishvili, Aleksandr Beznosikov

发表机构 * BRAIn Lab(BRAIn实验室) Yandex Research(Yandex研究院)

AI总结 本文揭示语言模型从自生成问答对学习中存在的脆弱性:生成阶段会偏向选择显著内容并服从文本中的指令式段落,导致覆盖不均和注入偏差。通过固定目标问题和过滤指令式跨度可显著降低偏差。

详情
AI中文摘要

语言模型越来越多地从合成的问答监督中学习:模型生成关于文档的问题,从同一文本中回答这些问题,得到的问答对被用于微调、蒸馏或将知识压缩到另一个模型中。我们表明,这个生成步骤并非中立的预处理。它是一个隐式策略,既选择哪些证据成为训练信号,又决定如何回答这些证据,并且在两个阶段都很脆弱。在选择问什么时,生成器不会均匀地扫描文档。覆盖范围早期饱和并集中在显著跨度上,不同的提示收敛到相同区域,而看起来值得提问的内容由局部呈现驱动。因此,诸如清理不当的标记等显著伪影可以跨模型系列和规模劫持问题生成。在回答时,产生监督的模型倾向于服从文本中嵌入的指令式段落。这种服从取决于段落的意图和表面形式而非其严格性,并且在任务冲突下最严重,此时更大的模型更常服从。这些失败模式源于问答生成过程中的选择,因此可以在不改变训练循环的情况下减少。将每个问题绑定到固定目标可减少有偏选择,在回答前过滤指令式跨度可将平均注入服从率从我们的评估中的88%降低到13%,同时保留几乎所有的干净文本。

英文摘要

Language models are increasingly taught from synthetic question--answer (QA) supervision: a model generates questions about a document, answers them from the same text, and the resulting pairs are used to fine-tune, distill, or compress knowledge into another model. We show that this generation step is not neutral preprocessing. It is an implicit policy that both selects which evidence becomes training signal and decides how that evidence is answered, and it is fragile at both stages. When choosing what to ask, generators do not scan a document uniformly. Coverage saturates early and concentrates on salient spans, diverse prompts converge on the same regions, and what looks question-worthy is driven by local presentation. As a result, salient artifacts such as poorly cleaned markup can hijack question generation across model families and scales. When answering, the model that produces the supervision tends to obey instruction-like passages embedded in the text. This compliance depends on the intent and surface form of the passage rather than its strictness, and is worst under task conflict, where larger models comply more often. These failure modes arise from choices made during QA generation, so they can be reduced without changing the training loop. Tying each question to a fixed target reduces biased selection, and filtering instruction-like spans before answering lowers mean injection compliance from $88\%$ to $13\%$ in our evaluation while retaining nearly all clean text.

2606.32000 2026-07-01 cs.LG cs.AI 新提交

Radial Suppression Accelerates Algorithmic Generalization: A Geometric Analysis of Delayed Generalization

径向抑制加速算法泛化:延迟泛化的几何分析

Srijan Tiwari, Aditya Chauhan, Manjot Singh

发表机构 * Indian Institute of Technology Roorkee(印度理工学院罗尔基分校)

AI总结 通过几何分析揭示交叉熵优化下隐藏表示的径向膨胀导致记忆-泛化延迟,提出径向惩罚加速泛化,在模算术和加法任务上实现最高6倍加速。

Comments 16 pages, 5 figures, 10 tables. Presented at the Workshop on High-dimensional Learning Dynamics at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

为什么神经网络在泛化之前会长时间记忆算法训练数据?我们通过一个几何案例研究表明,在需要发现结构化低维电路的任务上,记忆-泛化延迟是由交叉熵优化下隐藏表示的径向膨胀驱动的。我们形式化了激活空间动力学的径向-角度分解,并推导出三个可检验的命题:(i)惩罚径向膨胀会导致各向异性、数据相关的权重正则化;(ii)它将径向梯度能量抑制到各向同性随机基线以下,迫使更新主要为角度更新;(iii)它使收敛偏向更平坦的极小值。为了实证验证这些命题,我们研究了一个单超参数范数惩罚,该惩罚将激活软约束到sqrt(d)半径的超球面上。在模算术上,该惩罚在MLP和Transformer上加速了“顿悟”现象高达6倍,并在10M参数的nanoGPT上进行3位数加法时减少了训练步骤的一半。

英文摘要

Why do neural networks memorize algorithmic training data long before they generalize? We present a geometric case study demonstrating that, on tasks where generalization requires discovering structured low-dimensional circuits, the memorization-generalization delay is driven by radial inflation of hidden representations under cross-entropy optimization. We formalize a radial-angular decomposition of activation-space dynamics and derive three testable propositions: (i) that penalizing radial inflation induces anisotropic, data-dependent weight regularization; (ii) that it suppresses radial gradient energy below the isotropic random baseline, forcing predominantly angular updates; and (iii) that it biases convergence toward flatter minima. To empirically validate these propositions, we study a single-hyperparameter norm penalty that softly constrains activations to a sqrt(d)-radius hypersphere. On modular arithmetic, this penalty accelerates grokking up to 6x across MLPs and Transformers, and halves training steps for a 10M-parameter nanoGPT on 3-digit addition.

2606.31993 2026-07-01 cs.RO 新提交

OopsieVerse: A Safety Benchmark with Damage-Aware Simulation for Robot Manipulation

OopsieVerse: 一个具有损伤感知仿真的机器人操作安全基准

Arnav Balaji, Arpit Bahety, Sriniket Ambatipudi, Daniel Lam, Junhong Xu, Roberto Martín-Martín

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出OopsieVerse框架,通过DamageSim检测和量化接触力、温度等损伤信号,在OmniGibson和RoboCasa仿真器中实现损伤感知,用于安全策略学习与评估。

Comments Project website: https://robin-lab.cs.utexas.edu/oopsieverse/. The first two authors contributed equally; order decided by dice roll. Accepted to Robotics: Science and Systems (RSS 2026)

详情
AI中文摘要

尽管机器人操作能力发展迅速,物理安全仍然是部署家用机器人的主要障碍:如果机器人损坏自身或周围环境,任务成功是不够的。仿真为昂贵且危险的真实世界训练和评估提供了无伤害的替代方案,但现有仿真器缺乏检测、量化和表示损伤的通用机制。为了解决这一差距,我们引入了OOPSIEVERSE,一个统一的仿真框架和基准,用于损伤感知的家用操作。OOPSIEVERSE通过将接触力、温度变化和液体相互作用等来源转换为相应的机械、热或流体损伤,将损伤作为显式、物理基础且任务无关的信号提供。OOPSIEVERSE包含两个核心元素:(1) DAMAGESIM,一个与仿真器无关的框架,用于在导航和操作过程中检测和量化损伤,以及(2)一套家用任务,旨在评估常见的损伤模式并区分任务完成和安全执行。我们通过在两个具有不同物理后端的仿真器OmniGibson(Nvidia Omniverse)和RoboCasa(MuJoCo)中实例化DAMAGESIM来展示我们框架的通用性。我们进一步展示了OOPSIEVERSE在多个用例中的实用性,包括(1)通过实时损伤反馈指导更安全的演示收集,(2)通过损伤条件模仿学习和强化学习学习更安全的操作策略,(3)对最先进的视觉语言动作策略进行安全基准测试,以及(4)提高仿真到现实迁移策略的真实世界安全性。总之,我们的结果凸显了OOPSIEVERSE作为安全机器人操作系统化、可扩展研究的开源基础的潜力。有关代码和更多信息,请参阅此https URL。

英文摘要

While robotic manipulation capabilities have advanced rapidly, physical safety remains a major barrier to deploying household robots: task success is insufficient if the robot damages itself or its surroundings. Simulation offers a harm-free alternative to costly and dangerous real-world training and evaluation, yet existing simulators lack general mechanisms to detect, quantify, and represent damage. To address this gap, we introduce OOPSIEVERSE, a unified simulation framework and benchmark for damage-aware household manipulation. OOPSIEVERSE provides damage as an explicit, physically-grounded, and taskagnostic signal by converting sources such as contact forces, temperature changes, and liquid interactions into corresponding mechanical, thermal or fluid damage. OOPSIEVERSE comprises two core elements: (1) DAMAGESIM, a simulator-agnostic framework for detecting and quantifying damage during navigation and manipulation, and (2) a suite of household tasks designed to evaluate common damage modes and distinguish between task completion and safe execution. We demonstrate the generality of our framework by instantiating DAMAGESIM in two simulators with different physics backends, OmniGibson (Nvidia Omniverse) and RoboCasa (MuJoCo). We further showcase the utility of OOPSIEVERSE across multiple use cases, including (1) guiding safer demonstration collection via real-time damage feedback, (2) learning safer manipulation policies through damage-conditioned imitation learning and reinforcement learning, (3) benchmarking the safety of state-of-the-art Vision Language Action policies, and (4) improving real-world safety of sim-to-real transferred policies. Together, our results highlight the potential of OOPSIEVERSE as an open-source foundation for systematic, scalable research on safe robot manipulation. For code and more information, please refer to https://robin-lab.cs.utexas.edu/oopsieverse/

2606.31991 2026-07-01 cs.LG cs.AI 新提交

Amplifying Membership Signal Through Chained Regeneration

通过链式再生放大成员信号

Wojciech Łapacz, Stanisław Pawlak

发表机构 * Warsaw University of Technology(华沙理工大学)

AI总结 提出MADreMIA框架,利用链式再生迭代轨迹增强生成模型的成员推断和数据集推理攻击,在低误报率下提升信号强度。

详情
AI中文摘要

大型生成模型记忆训练数据的趋势使得样本验证对于隐私审计和版权执行至关重要。当前的成员推断(MIA)和数据集推理(DI)攻击通常依赖于一次性生成,这会产生弱信号并在不同模态间具有有限的敏感性。受模型自噬障碍(MAD)的启发,我们引入了MADreMIA,这是一个模型无关的框架,增强了白盒、灰盒和黑盒MIA及DI。我们的框架不依赖于影子模型训练(对于大型生成模型通常不可行),而是通过迭代轨迹利用固有信号促进可扩展的推理。该过程利用跨不同模态的链式生成,其中每个输出作为后续输入,以在低误报率下改善成员证据。我们证明,与非成员生成相比,记忆的训练样本在迭代再生过程中表现出显著更高的连贯性和更慢的退化。我们的结果表明,MADreMIA在不同模型家族和模态中提供了更丰富的信号;我们提供了对IAR、扩散和语言模型的全面评估,以及初步结果展示了其在音频模型上的潜力。

英文摘要

The tendency of large generative models to memorize training data makes sample verification critical for privacy auditing and copyright enforcement. Current membership (MIA) and dataset inference (DI) attacks often rely on one-shot generations, which yield weak signals and limited sensitivity across modalities. Inspired by Model Autophagy Disorder (MAD), we introduce MADreMIA, a model-agnostic framework that enhances white-, gray-, and black-box MIA and DI. Rather than relying on shadow model training -- often infeasible for large generative models -- our framework facilitates scalable inference by leveraging inherent signals through iterative trajectories. This process utilizes chained generations across diverse modalities, where each output serves as the subsequent input, to improve membership evidence at low FPR. We demonstrate that memorized training samples exhibit significantly higher coherence and slower degradation during iterative regeneration than non-member generations. Our results show that MADreMIA provides richer signals across diverse model families and modalities; we present comprehensive evaluations for IARs, diffusion, and language models, alongside preliminary results demonstrating its potential for audio models.

2606.31986 2026-07-01 cs.CV 新提交

CoLT: Teaching Multi-Modal Models to Think with Chain of Latent Thoughts

CoLT: 教会多模态模型通过潜在思维链进行思考

Lianyu Hu, Shengqian Qin, Zeqin Liao, Qing Guo, Liang Wan, Wei Feng, Yang Liu

发表机构 * Nanyang Technological University(南洋理工大学) Shanghai Jiao Tong University(上海交通大学) NKIARI AAIS & VCIP, Nankai University(南开大学AAIS & VCIP) Tianjin University(天津大学)

AI总结 提出CoLT框架,用潜在思维链代替文本推理,通过前向和后向解码器监督及内部监督实现高效多模态推理,在8个基准上优于现有方法,推理时间减少10.1倍。

Comments Accepted by ECCV2026. Code is available at https://github.com/hulianyuyy/CoLT

详情
AI中文摘要

思维链(CoT)推理通过生成自然语言中的显式中间推理步骤,使多模态大语言模型(MLLMs)能够处理复杂的视觉推理任务。然而,这种基于文本的推理范式在推理时本质上较慢,甚至需要数千个token,并且从根本上受限于自然语言的表达能力。在本文中,我们提出CoLT(潜在思维链),一种新颖的框架,教会多模态模型通过潜在思维表示链而不是冗长的文本token进行推理,只需3步即可完成思考。强行让模型用潜在状态思考容易产生无意义的语义并使训练不稳定。为了有效规范潜在推理过程,我们引入一个轻量级的外部解码器,为每个潜在推理步骤提供两个互补方向上的步骤级监督:前向模式将潜在思维解码为下一步的文本推理,后向模式在给定前文文本上下文的情况下,将解码器隐藏状态与模型的潜在思维对齐。我们还加入了内部监督,鼓励连贯的逐步潜在转换。在推理过程中,解码器和内部监督被移除,以保持潜在推理的高效率。在八个基准上的大量实验表明,CoLT不仅优于现有的潜在推理方法(如CODI和SIM-CoT),还超越了依赖辅助图像且需要昂贵标注的潜在视觉推理方法。与文本CoT方法相比,CoLT可以将推理时间显著减少10.1倍,文本解码时间减少22.6倍。代码已发布在此https URL。

英文摘要

Chain-of-thought (CoT) reasoning has enabled multi-modal large language models (MLLMs) to tackle complex visual reasoning tasks by generating explicit intermediate reasoning steps in natural language. However, this text-based reasoning paradigm is inherently slow at inference time with even thousands of tokens and fundamentally constrained by the expressiveness of natural language. In this paper, we propose CoLT, (Chain of Latent Thoughts), a novel framework that teaches multi-modal models to reason through a chain of latent thought representations instead of verbose text tokens, which can perform thinking with as few as 3 steps. Naively forcing the model to think with latent states easily produces meaningless semantics and makes training unstable. To effectively regulate the latent reasoning process, we introduce a lightweight external decoder that provides step-level supervision for each latent reasoning step in two complementary directions: a forward mode that decodes latent thoughts into the textual reasoning of the next step, and a backward mode that aligns decoder hidden states with the model's latent thoughts given preceding textual context. We further incorporate internal supervision that encourages coherent step-by-step latent transitions. The decoder and internal supervision are removed during inference to maintain high efficiency of latent reasoning. Extensive experiments on eight benchmarks demonstrate that CoLT not only outperforms existing latent reasoning methods such as CODI and SIM-CoT, but also surpasses latent visual reasoning approaches that rely on auxiliary images with costly annotation requirements. Compared to text CoT methods, CoLT can notably reduce the inference time by 10.1$\times$ and text decoding time by 22.6$\times$. Code is released at https://github.com/hulianyuyy/CoLT.

2606.31982 2026-07-01 cs.CV 新提交

ERA: Entropy-Guided Visual Token Pruning with Rectified Attention for Efficient MLLMs

ERA:基于熵引导的视觉令牌剪枝与修正注意力机制的高效多模态大语言模型

Yuhao Wang, Mu Qiao, Haiwen Diao, Yunzhi Zhuge, Pingping Zhang, Xindong Zhang, Lei Zhang, Huchuan Lu

发表机构 * School of Future Technology, Dalian University of Technology(大连理工大学未来技术学院) S-Lab, Nanyang Technological University(南洋理工大学S-Lab) OPPO Research Institute(OPPO研究院) Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算机系)

AI总结 提出ERA框架,通过熵引导的视觉令牌剪枝和修正注意力机制,解决令牌减少导致的注意力对数坍塌问题,实现高效多模态大语言模型推理加速。

Comments 17 pages, 7 figures

详情
AI中文摘要

多模态大语言模型(MLLMs)由于长视觉令牌序列而产生高昂的推理成本。无训练的视觉令牌减少提供了一种高效解决方案。然而,现有方法扭曲了注意力分布,导致我们称之为注意力对数坍塌的现象。为了解决这个问题,我们提出了ERA,一种基于熵引导的视觉令牌剪枝框架,具有修正注意力机制,用于高效MLLMs。具体来说,ERA包含三个关键组件:双视角熵剪枝(DEP)、偏差感知令牌回收(BTR)和对数保持注意力修正(LAR)。首先,DEP通过联合建模视觉多样性和头部显著性来识别代表性锚点令牌。然后,BTR将剪枝的令牌回收至其对应的锚点,同时估计聚类级别的对数偏差。在此基础上,LAR将估计的偏差注入注意力对数,有效修正由令牌减少引起的坍塌。这些组件共同在激进压缩下保留视觉证据,使广泛MLLMs在单图像、多图像和视频设置中实现稳健性能。除了提供实际加速外,ERA将对数保持视觉令牌剪枝建立为高效MLLMs的原则性框架,统一了理论基础、算法设计和实际部署。代码位于此https URL。

英文摘要

Multimodal Large Language Models (MLLMs) incur prohibitive inference costs due to long visual token sequences. Training-free visual token reduction provides an efficient solution. However, existing methods distort attention distributions, giving rise to a phenomenon we term Attention Logit Collapse. To address this issue, we propose ERA, an Entropy-guided visual token pruning framework with Rectified Attention for efficient MLLMs. Specifically, ERA comprises three crucial components: Dual-view Entropy Pruning (DEP), Bias-aware Token Recycling (BTR), and Logit-preserving Attention Rectification (LAR). First, DEP identifies representative anchor tokens by jointly modeling visual diversity and head-wise saliency. BTR then recycles pruned tokens into their corresponding anchors while estimating a cluster-level logit bias. Building upon this, LAR injects the estimated bias into attention logits, effectively rectifying the collapse induced by token reduction. Together, these components preserve visual evidence even under aggressive compression, enabling robust performance across single-image, multi-image, and video settings on a wide range of MLLMs. Beyond delivering practical acceleration, ERA establishes logit-preserving visual token pruning as a principled framework for efficient MLLMs, unifying theoretical foundation, algorithmic design, and practical deployment. The code is at https://github.com/924973292/ERA.

2606.31981 2026-07-01 cs.CV cs.AI 新提交

LUNA: Learning Universal 3D Human Animation Beyond Skinning

LUNA: 超越蒙皮的学习通用3D人体动画

Peng Li, Rawal Khirodkar, Junxuan Li, Yuan Dong, Chen Cao, Yuan Liu, Wenhan Luo, Yike Guo, Shunsuke Saito

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Codec Avatars Lab, Meta(Meta编解码虚拟形象实验室)

AI总结 提出无LBS的通用神经动画模型LUNA,通过Transformer运动回归器将多种2D控制映射为3D高斯变形,结合混合监督实现零样本跨身份泛化。

Comments ECCV 2026, Project page: https://penghtyx.github.io/LUNA/

详情
AI中文摘要

从单目图像创建逼真、可动画化的3D人体化身仍然在很大程度上依赖于线性混合蒙皮(LBS)和参数化身体模型,这限制了表现力并常因拟合不完美而引入伪影。我们提出LUNA,一种无LBS的通用神经动画模型,直接将多种2D控制(如图像、关键点、草图及未见角色)映射为3D高斯变形,绕过了显式身体拟合。其核心是一个基于Transformer的运动回归器,将全局刚体运动与细粒度局部动态解耦,以捕捉连贯运动和细微的非刚性效果。为解决2D到3D提升的固有歧义并扩展到拟合数据集之外,我们引入了混合监督,从LBS教师中蒸馏软结构先验,并设计了一种损失函数,支持在有限的拟合数据和大量野外未标记视频上训练。大量实验表明,与基于LBS的方法相比,LUNA实现了具有竞争力的视觉保真度,同时提供了逼真的人体运动和跨多种驱动模态的零样本跨身份泛化。据我们所知,LUNA是第一个支持隐式2D驱动的端到端3D可动画化模型。

英文摘要

Creating photorealistic, animatable 3D human avatars from monocular images still largely depends on Linear Blend Skinning (LBS) and parametric body models, which constrain expressivity and often introduce artifacts due to imperfect fitting. We propose LUNA, an LBS-free universal neural animation model that directly maps multiple 2D controls like images, keypoints, sketches, and unseen characters into 3D Gaussian deformations, bypassing explicit body fitting. At its core, a transformer-based motion regressor disentangles global rigid motion from fine-grained local dynamics to capture both coherent movement and subtle non-rigid effects. To resolve the inherent ambiguity of 2D-to-3D lifting while scaling beyond fitted datasets, we introduce hybrid supervision that distills soft structural priors from an LBS teacher and a loss that supports training on both limited fitted data and large in-the-wild unlabeled videos. Extensive experiments show LUNA achieves competitive visual fidelity compared to LBS-based approaches, while delivering realistic human motion and zero-shot cross-identity generalization across diverse driving modalities. To the best of our knowledge, LUNA is the first end-to-end 3D animatable model that supports implicit 2D driving.