arXivDaily arXiv每日学术速递 周一至周五更新
arXiv周末暂无论文更新,休息一下吧,周末愉快~~
全部学科分类 2223
2607.02517 2026-07-03 cs.CV 新提交

WorldDirector: Building Controllable World Simulators with Persistent Dynamic Memory

WorldDirector: 构建具有持久动态记忆的可控世界模拟器

Hanlin Wang, Hao Ouyang, Qiuyu Wang, Wen Wang, Qingyan Bai, Ka Leong Cheng, Yue Yu, Yixuan Li, Yihao Meng, Zichen Liu, Yanhong Zeng, Yujun Shen, Qifeng Chen

发表机构 * HKUST(香港科技大学) Ant Group(蚂蚁集团) ZJU(浙江大学) CUHK(香港中文大学)

AI总结 提出WorldDirector框架,通过LLM协调3D轨迹与相机运动作为视频生成控制信号,解耦语义运动编排与视觉生成,实现持久动态对象记忆和自由视角探索。

Comments Project Page: https://worlddirector.github.io/

详情
AI中文摘要

我们提出了WorldDirector,一个高度可控的视频世界模型框架,专为持久动态对象记忆和不受限制的视角探索而设计。与现有将物理动力学与像素渲染纠缠在一起、并依赖连续视觉观察来维持运动的世界模型不同,我们的框架明确地将语义运动编排与视觉生成解耦。通过利用LLM协调3D轨迹与相机运动,随后将这些编排好的轨迹作为视频生成的控制信号,我们的方法确保了严格的物理逻辑和外观稳定性,成功保留了动态实体的精确视觉身份,即使它们在长时间离开视野后重新进入场景。实验结果表明,我们的方法支持合成复杂和扩展的事件,具有前所未有的可控性和持久动态对象记忆。项目页面:this https URL

英文摘要

We present WorldDirector, a highly controllable video world model framework designed for persistent dynamic object memory and unrestricted viewpoint exploration. Unlike existing world models that entangle physical dynamics with pixel rendering and rely on continuous visual observation to sustain motion, our framework explicitly decouples semantic motion orchestration from visual generation. By leveraging an LLM to coordinate 3D trajectories with camera movements and subsequently employing these orchestrated trajectories as control signals for video generation, our approach ensures strict physical logic and appearance stability, successfully preserving the exact visual identities of dynamic entities even when they re-enter the scene after prolonged periods out of view. Experimental results demonstrate that our method supports the synthesis of complex and extended events with unprecedented controllability and persistent dynamic object memory. Project Page: https://worlddirector.github.io/

URL PDF HTML
2607.02516 2026-07-03 cs.CV 新提交

Alignment Is All You Need For X-to-4D Generation

对齐即一切:X到4D生成

Qiaowei Miao, Kehan Li, Yawei Luo, Yi Yang

发表机构 * Zhejiang University(浙江大学)

AI总结 提出Align4D框架,通过物体距离对齐、运动-几何联合对齐和异步优化,实现任意模态输入到4D视频-3D对的生成,在X4D和Consistent4D数据集上达到最先进水平。

详情
AI中文摘要

生成扩散模型在多模态控制下合成高质量图像、视频和3D内容方面表现出色。然而,由于构建多样化数据集的高成本和现有方法的有限可扩展性,任意用户定义模态到4D(X-to-4D)生成仍然具有挑战性。本文提出Align4D,一个灵活的框架,将任意模态输入转换为连贯的视频-3D对,利用视频引导4D运动,3D数据塑造4D几何。Align4D引入了三项关键技术:(1)物体距离对齐,分别搜索视频对齐和多视图对齐的物体距离(VAOD/MAOD),以协调4D渲染与视频及多视图扩散模型的先验;(2)运动-几何联合对齐,通过同步视频和3D输入约束已知和未知视图,确保一致的4D生成;(3)异步优化,解耦高斯属性和变形网络训练,以增强运动和几何保真度。我们进一步提出X4D数据集,整合了提示、图像、视频和3D数据用于基准测试。在X4D和Consistent4D上的实验表明,Align4D在X-to-4D生成中实现了最先进的质量和一致性。项目页面:此https URL。

英文摘要

Generative diffusion models excel at synthesizing high-quality images, videos, and 3D content under multimodal control. However, arbitrary user-defined modality-to-4D (X-to-4D) generation remains challenging due to the high cost of constructing diverse datasets and the limited scalability of existing methods. This paper presents Align4D, a flexible framework that translates any-modal input into coherent video-3D pairs, using video to guide 4D motion and 3D data to shape 4D geometry. Align4D introduces three key techniques: (1) Object Distance Alignment, which searches Video-Aligned and Multiview-Aligned Object Distances (VAOD/MAOD), respectively, to reconcile 4D renderings with video and the priors of multiview diffusion models; (2) Motion-Geometry Joint Alignment, which constrains known and unknown views through synchronized video and 3D inputs, ensuring consistent 4D generation; and (3) Asynchronous Optimization, which decouples Gaussian attribute and deformation network training to enhance motion and geometry fidelity. We further propose the X4D dataset, which integrates prompt, image, video, and 3D data for benchmarking. Experiments on X4D and Consistent4D demonstrate that Align4D achieves state-of-the-art quality and consistency in X-to-4D generation. Project page: https://miaoqiaowei.github.io/Align4D/.

URL PDF HTML
2607.02514 2026-07-03 cs.AI 新提交

Distributed Attacks in Persistent-State AI Control

持久状态AI控制中的分布式攻击

Josh Hills, Ida Caspary, Asa Cooper Stickland

发表机构 * Constellation Astra Fellowship(Constellation Astra 奖学金) Imperial College London(帝国理工学院) UK AI Security Institute(英国人工智能安全研究所)

AI总结 研究AI编码代理在持久代码库中跨拉取请求分布攻击的威胁,提出Iterative VibeCoding基准,发现单一监控器无法同时防御渐进与非渐进攻击,而状态跟踪监控器可显著降低渐进攻击成功率。

详情
AI中文摘要

随着AI编码代理变得更加自主,它们越来越多地迭代式地提交代码,代码库在会话间持久存在。这种持久性创造了一个新的攻击面:一个不匹配或经过提示注入的代理可以跨拉取请求(PR)分布攻击,并在具有最佳自然掩护的PR中定时其载荷。为了研究由此产生的动态,我们引入了Iterative VibeCoding,这是一个用于AI控制(即安全部署有能力但可能不可信的AI的研究)的设置。在Iterative VibeCoding中,一个编码代理在持久代码库中通过一系列PR构建软件,同时追求一个隐蔽的副任务。我们的基准包括两个任务族:CLI工具和Flask Web服务,共20个任务变体。我们使用Claude Sonnet 4.5作为攻击代理,GPT-4o作为监控器。我们将渐进攻击(将副任务分布到多个PR中)与非渐进攻击(集中在单个PR中)进行比较。没有单一的监控器对两者都鲁棒:哪种策略最能逃避(成功同时逃避监控器)取决于监控器类型,因此防御者无法用任何一个监控器同时阻止渐进和非渐进攻击。高逃避率(>=65%)在不同模型攻击代理后端(Sonnet 4.5, Gemini 3.1 Pro, Kimi K2.5)中普遍存在,证实这是持久状态攻击面的属性,而非单一模型的能力。逃避率在最先进的监控器模型中仍然很高,并且对于能力更强的模型,渐进与非渐进逃避之间的差距扩大。我们引入了一个有状态链接跟踪监控器,它跨PR跟踪可疑的积累。在两个任务族上,它比仅看到更多累积历史的差异监控器更好地检测渐进攻击。将这个更强的监控器与轨迹监控器结合在一个四监控器集成中,将渐进攻击逃避率从最弱标准差异监控器下的93%降低到47%。

英文摘要

As AI coding agents become more autonomous, they increasingly ship code iteratively, with the codebase persisting across sessions. This persistence creates a new attack surface: a misaligned or prompt-injected agent can distribute attacks across pull requests (PRs) and time its payload for the PR with the best natural cover. To study the resulting dynamics, we introduce Iterative VibeCoding, a setting for AI control, the study of safely deploying capable but potentially untrusted AI. In Iterative VibeCoding, a coding agent builds software over a sequence of PRs in a persistent codebase while pursuing a covert side task. Our benchmark includes two task families: CLI tools and Flask web services, across 20 total task variations. We use Claude Sonnet 4.5 as the attack agent and GPT-4o as the monitor. We compare gradual attacks, which distribute the side task across PRs, against non-gradual attacks concentrated in a single PR. No single monitor is robust to both: which strategy evades best (success while evading the monitor) depends on the monitor type, so a defender cannot close off both gradual and non-gradual attacks with any one monitor. High evasion (>= 65%) generalizes across model attack agent backends (Sonnet 4.5, Gemini 3.1 Pro, Kimi K2.5), confirming this is a property of the persistent-state attack surface rather than a single model's capability. Evasion also remains high across state-of-the-art monitor models and the gap between gradual and non-gradual evasion widens for more capable models. We introduce a stateful link-tracker monitor that tracks suspicious buildup across PRs. On both task families, it detects gradual attacks substantially better than diff monitors that merely see more accumulated history. Combining this stronger monitor with trajectory monitors in a four-monitor ensemble reduces gradual-attack evasion from 93% under the weakest standard diff monitor to 47%.

URL PDF HTML
2607.02513 2026-07-03 cs.CL cs.AI cs.LG 新提交

LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning

LACUNA: 评估大语言模型遗忘定位精度的测试平台

Matteo Boglioni, Thibault Rousset, Siva Reddy, Marius Mosbach, Verna Dankers

发表机构 * Mila – Quebec Artificial Intelligence Institute(米拉-魁北克人工智能研究所) McGill University(麦吉尔大学)

AI总结 提出LACUNA测试平台,通过注入合成个人身份信息到模型参数,评估遗忘方法是否真正擦除知识,发现现有方法定位不精确且易受重现攻击,而精确定位可实现强擦除。

详情
AI中文摘要

大语言模型会记忆敏感训练数据,包括个人身份信息(PII),因此迫切需要可靠的后期移除方法。遗忘已成为一种有前景的解决方案,最先进(SOTA)方法通常遵循先定位、后遗忘的范式,针对特定模型参数。然而,现有基准仅在输出层面评估遗忘,留下了遗忘是否真正从模型参数中擦除知识或仅仅掩盖知识的疑问,而重现攻击的成功强化了这一担忧。为弥补这一差距,我们引入了LACUNA:首个具有真实参数级定位的遗忘测试平台。LACUNA通过掩码连续预训练将合成个体的PII注入到基于OLMo的1B和7B模型的预定义参数中,从而能够直接评估遗忘是否针对负责知识存储的权重。我们使用LACUNA对当前SOTA遗忘方法进行基准测试,发现尽管在输出层面表现强劲,现有方法高度不精确且易受重现攻击。我们进一步表明,当定位成功时,即使是简单的基于梯度的遗忘方法也能实现强擦除和对重现攻击的鲁棒性,突显了精确定位遗忘的重要性。我们发布LACUNA以补充行为评估,并推动基于定位的鲁棒遗忘的进一步进展。

英文摘要

LLMs memorize sensitive training data, including personally identifiable information (PII), creating a pressing need for reliable post hoc removal methods. Unlearning has emerged as a promising solution, with state-of-the-art(SOTA) methods often following a localize-first, unlearn-second paradigm that targets specific model parameters. However, existing benchmarks evaluate unlearning solely at the output level, leaving open the question of whether unlearning truly erases knowledge from a model's parameters or merely obfuscates it, a concern reinforced by the success of resurfacing attacks. To bridge this gap, we introduce LACUNA: the first unlearning testbed with ground-truth parameter-level localization. LACUNA injects PII of synthetic individuals into predefined parameters of 1B and 7B OLMo-based models via masked continual pretraining, enabling direct evaluation of whether unlearning targets the weights responsible for knowledge storage. We use LACUNA to benchmark current SOTA unlearning methods and find that, despite strong output-level performance, existing methods are highly imprecise and susceptible to resurfacing attacks. We further show that when localization is successful, even a simple gradient-based unlearning method achieves strong erasure and robustness to resurfacing attacks, highlighting the importance of precise unlearning. We release LACUNA to complement behavioral evaluations and drive further advances in robust, localization-based unlearning.

URL PDF HTML
2607.02512 2026-07-03 cs.LG cs.AI cs.CL 新提交

Program-as-Weights: A Programming Paradigm for Fuzzy Functions

程序即权重:一种模糊函数的编程范式

Wentao Zhang, Liliana Hotsko, Woojeong Kim, Pengyu Nie, Stuart Shieber, Yuntian Deng

发表机构 * University of Waterloo(滑铁卢大学) Cornell University(康奈尔大学) Harvard University(哈佛大学)

AI总结 提出模糊函数编程范式,通过Program-as-Weights方法将自然语言规范编译为轻量可执行神经工件,在保持性能的同时大幅降低推理成本。

详情
AI中文摘要

许多日常编程任务难以用干净的基于规则的方式实现,例如对重要日志行进行告警、修复格式错误的JSON或按意图对搜索结果进行排序,这些任务越来越多地被外包给大型语言模型API,但代价是局部性、可重复性和价格。我们提出模糊函数编程:将这样的函数从自然语言规范编译成紧凑的、本地可执行的神经工件。我们通过Program-as-Weights(PAW)实例化这一范式,其中在FuzzyBench(我们发布的包含1000万示例的数据集)上训练的4B编译器为冻结的轻量级解释器生成参数高效的适配器。执行PAW程序的0.6B Qwen3解释器与直接提示Qwen3-32B的性能相当,同时使用的推理内存约为后者的五十分之一,并在MacBook M3上以30 tokens/s的速度运行。PAW将基础模型从每个输入的问题解决者重新定义为工具构建者:每个函数定义调用一次,生成一个小的可重用工件,其后续每次函数应用的调用成本低廉且可离线进行。

英文摘要

Many everyday programming tasks resist clean rule-based implementation, such as alerting on important log lines, repairing malformed JSON, or ranking search results by intent, and are increasingly outsourced to large language model APIs at the cost of locality, reproducibility, and price. We propose fuzzy-function programming: compiling such a function from a natural-language specification into a compact, locally-executable neural artifact. We instantiate this paradigm with Program-as-Weights (PAW), in which a 4B compiler trained on FuzzyBench, a 10M-example dataset we release, emits parameter-efficient adapters for a frozen, lightweight interpreter. A 0.6B Qwen3 interpreter executing PAW programs matches the performance of direct prompting of Qwen3-32B, while using roughly one fiftieth of the inference memory and running at 30 tokens/s on a MacBook M3. PAW reframes the foundation model from a per-input problem solver into a tool builder: invoked once per function definition, it produces a small reusable artifact whose subsequent calls per function application are cheap and offline.

URL PDF HTML
2607.02509 2026-07-03 cs.AI 新提交

ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning

ReContext: 递归证据重放作为大语言模型的长上下文推理工具

Yanjun Zhao, Ruizhong Qiu, Tianxin Wei, Yuanchen Bei, Zhining Liu, Lingjie Chen, Ismini Lourentzou, Hanghang Tong, Jingrui He

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出一种免训练推理方法ReContext,通过模型内部相关性信号构建查询条件证据池并在最终生成前重放,以提升长上下文推理中的证据利用,无需外部记忆或上下文剪枝。

详情
AI中文摘要

理解和推理长上下文已成为在现实应用中部署大型语言模型(LLM)的关键需求。尽管最近的LLM支持越来越长的上下文窗口,但它们常常无法利用输入中已经存在的相关证据,揭示了上下文访问与有效上下文利用之间的差距。在这项工作中,我们提出了递归证据重放作为LLM长上下文推理的工具(RECONTEXT),一种无需训练的长上下文推理改进方法。RECONTEXT利用模型内部的相关性信号构建查询条件证据池,并在最终生成前重放该证据池,同时保留完整的原始上下文。这种递归选择过程将证据组织与答案生成分离,无需训练、外部记忆或上下文剪枝。我们还基于联想记忆提供了理论分析,将上下文表征为记忆存储,问题作为检索线索,注意力作为线索-痕迹关联,重放作为痕迹重新激活。在八个128K上下文长度的长上下文数据集上的实验表明,RECONTEXT在Qwen3-4B、Qwen3-8B和Llama3-8B上一致地提高了证据利用,在所有三个骨干模型上取得了最佳平均排名。代码可在该网址获取。

英文摘要

Understanding and reasoning over long contexts has become a key requirement for deploying large language models (LLMs) in realistic applications. Although recent LLMs support increasingly long context windows, they often fail to use relevant evidence that is already present in the input, revealing a gap between context access and effective context utilization. In this work, we propose Recursive Evidence Replay as LLM Harness for Long-Context Reasoning (RECONTEXT), a training-free inference method for improving long-context reasoning. RECONTEXT uses model-internal relevance signals to construct a query-conditioned evidence pool and replays it before final generation while preserving the full original context. This recursive selection process separates evidence organization from answer generation without training, external memory, or context pruning. We also provide a theoretical analysis based on associative memory, which characterizes the context as a memory store, the question as a retrieval cue, attention as cue-trace association, and replay as trace reactivation. Experiments on eight long-context datasets with 128K context length show that RECONTEXT consistently improves evidence utilization across Qwen3-4B, Qwen3-8B, and Llama3-8B, achieving the best average rank on all three backbones. Code is available at https://github.com/Yanjun-Zhao/ReContext.

URL PDF HTML
2607.02508 2026-07-03 cs.CV 新提交

From SRA to Self-Flow: Data Augmentation or Self-Supervision?

从SRA到Self-Flow:数据增强还是自监督?

Dengyang Jiang, Mengmeng Wang, Harry Yang, Jingdong Wang

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Zhejiang University of Technology(浙江工业大学) Baidu Inc.(百度公司)

AI总结 本文通过提出注意力分离方法,发现Self-Flow中双时间调度的性能提升主要源于噪声维度的数据增强,而非token间交互,并验证了该设计在ImageNet上的有效性。

详情
AI中文摘要

表示对齐已成为加速扩散变换器训练和提高生成质量的有效方法。最近的自对齐方法,如SRA和Self-Flow,通过构建扩散模型内部的对齐,进一步消除了对外部预训练编码器的依赖。然而,从SRA到Self-Flow的改进机制——双时间调度——仍未得到充分研究:Self-Flow将其增益归因于不同噪声水平下token之间的交互,其中更干净的token有助于推断更嘈杂的token。在这项工作中,我们重新审视了这一解释,并询问增益是否反而来自噪声维度的数据增强。为了分离这些因素,我们引入了注意力分离,它保留了与Self-Flow相同的双时间步输入,同时阻止分配给不同噪声水平的token之间的注意力。令人惊讶的是,移除这种交互并不会降低性能,甚至可以提高性能,这表明从SRA到Self-Flow的改进主要来自数据增强。此外,我们表明注意力分离本身通过将单个图像分割成多个有效的训练部分来扩展训练数据,从而提供增强效果。基于这些观察,我们将自表示对齐与双时间步和注意力分离增强相结合,并在ImageNet上证明了这种设计的有效性。

英文摘要

Representation alignment has become an effective way to accelerate diffusion transformer training and improve generation quality. Recent self-alignment methods, such as SRA and Self-Flow, further remove the dependency on external pretrained encoders by constructing alignment within the diffusion model itself. However, the mechanism behind the improvement from SRA to Self-Flow, dual-time scheduling, remains under-examined: Self-Flow attributes its gain to interactions between tokens at different noise levels, where cleaner tokens help infer noisier ones. In this work, we revisit this explanation and ask whether the gain instead comes from data augmentation along the noise dimension. To disentangle these factors, we introduce Attention Separation, which preserves the same dual-timestep input as Self-Flow while blocking attention between tokens assigned to different noise levels. Surprisingly, removing such interaction does not degrade performance and can even improve it, suggesting that the improvement from SRA to Self-Flow mainly comes from data augmentation. Furthermore,We show that Attention Separation itself provides an augmentation effect by splitting a single image into multiple effective training parts to expand the training data. Based on these observations, we combine self-representation alignment with dual-timestep and attention-separation augmentation, and demonstrate the effectiveness of this design on ImageNet.

URL PDF HTML
2607.02507 2026-07-03 cs.AI cs.CL cs.LG cs.MA 新提交

What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates

当无人注视时LLM智能体在说什么:多智能体辩论中的社会结构与潜在目标涌现

Arman Ghaffarizadeh, Danyal Mohaddes, Aliakbar Izadkhah, Shahriar Noroozizadeh

发表机构 * Independent Researcher(独立研究员) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出双通道辩论框架,研究社会结构(角色、受众、关系)如何导致LLM智能体在公开与私下(OTR)表达中产生系统性分歧,揭示潜在目标的涌现。

详情
AI中文摘要

LLM智能体将越来越多地在具有社会结构的环境中行动,其中角色、受众和关系背景会影响说什么有利或代价高昂。我们研究这种社会结构,在没有提示中任何明确目标的情况下,是否改变智能体在公开场合表达的内容,相对于在相同条件下引发的非正式(OTR)渠道。我们引入了一个双通道辩论框架,其中智能体产生公开话语进入共享历史,同时记录OTR响应但从不向其他参与者展示。在10个模型、3个场景以及每个场景内的5个变体中,对齐诱导设置导致目标智能体产生系统性的公开-OTR分歧,其决策分歧从约3%的基线上升到约40%。该效应在四个聚合分析中一致:立场、语义相似性、自然语言推理和调查响应。在某些情况下,OTR响应明确将公开适应归因于关系压力,如职业风险或赞助义务。研究结果表明,智能体评估应超越明确目标并检测涌现目标。我们提出了一个双通道评估框架和补充的行为测量,以操作化这种评估。

英文摘要

LLM agents will increasingly act in socially structured settings where role, audience, and relational context can shape what is advantageous or costly to say. We study whether such social structure, without any explicit objective in the prompt, changes what an agent expresses publicly relative to an off-the-record (OTR) channel elicited under the same condition. We introduce a dual-channel debate framework in which agents produce public utterances that enter the shared history alongside OTR responses that are recorded but never shown to the other participant. Across 10 models, 3 scenarios, and 5 variations within each scenario, alignment-inducing settings produce systematic public-OTR divergence in the targeted agent, with its decision divergence rising from a $\sim$3% baseline to roughly 40%. The effect is consistent across four aggregate analyses: stance, semantic similarity, natural language inference, and survey responses. In some cases, the OTR response explicitly attributes public accommodation to relational pressures, such as career risk or sponsorship obligation. The findings suggest that agent evaluation should extend beyond explicit goals and detect emergent objectives. We present a dual-channel evaluation framework and complementary behavioral measures that operationalize this assessment.

URL PDF HTML
2607.02503 2026-07-03 cs.RO 新提交

VT-WAM: Visual-Tactile World Action Model for Contact-Rich Manipulation

VT-WAM: 面向密集接触操作的视觉-触觉世界动作模型

Shuai Tian, Yupeng Zheng, Yuhang Zheng, Songen Gu, Yujie Zang, Yuxing Qin, Weize Li, Haoran Li, Wenchao Ding, Dongbin Zhao

发表机构 * SKL-MAIS, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所复杂系统管理与控制国家重点实验室) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) TARS Robotics National University of Singapore(新加坡国立大学) Fudan University(复旦大学)

AI总结 提出VT-WAM,在统一流匹配框架中联合学习视觉预测、触觉变形预测和动作预测,通过非对称混合Transformer注意力和接触门控注意力引导,在六项真实密集接触操作任务中平均成功率71.67%。

详情
AI中文摘要

密集接触操作需要策略对局部变形、压力、滑动和摩擦做出反应,但这些线索在时间上稀疏且在视觉观察中通常不可见。现有的视觉-触觉策略通常直接将触觉观测输入动作预测,但很少在动作生成过程中建模触觉变形动态。本文提出VT-WAM,一种视觉-触觉世界动作模型,在统一的流匹配框架中联合学习未来视觉预测、触觉变形预测和动作预测。具体地,VT-WAM引入了(1)非对称混合Transformer注意力,以桥接首帧视觉锚点与时间触觉动态,以及(2)接触门控动作-视觉-触觉注意力引导,以鼓励动作查询在接触阶段依赖触觉证据。在六项真实世界密集接触操作任务中,VT-WAM实现了71.67%的平均成功率,比Fast-WAM高出26.67%,比OmniVTLA高出35.84%。消融实验表明,建模触觉变形动态和引导接触阶段触觉注意力对于密集接触任务都很重要。项目网站:此https URL。

英文摘要

Contact-rich manipulation requires policies to react to local deformation, pressure, slip, and friction, yet these cues are temporally sparse and often invisible in visual observations. Existing visual-tactile policies usually feed tactile observations directly into action prediction, but rarely model tactile deformation dynamics during action generation. In this paper, we introduce VT-WAM, a Visual-Tactile World Action Model that jointly learns future visual prediction, tactile deformation prediction, and action prediction within a unified flow matching framework. In particular, VT-WAM introduces (1) Asymmetric Mixture-of-Transformers (MoT) attention to bridge a first-frame visual anchor with temporal tactile dynamics, and (2) contact-gated Action-Visual-Tactile Attention Guidance (AVTAG) to encourage action queries to rely on tactile evidence during contact phases. Across six real-world contact-rich manipulation tasks, VT-WAM achieves a 71.67% average success rate, outperforming Fast-WAM by 26.67% and OmniVTLA by 35.84%. Ablations demonstrate that modeling tactile deformation dynamics and guiding contact-phase tactile attention are both important for contact-rich tasks. Project website: https://vt-wam.github.io/.

URL PDF HTML
2607.02502 2026-07-03 cs.LG cs.AI 新提交

DemoPSD: Disagreement-Modulated Policy Self-Distillation

DemoPSD: 分歧调节的策略自蒸馏

Yunhe Li, Hao Shi, Wenhao Liu, Mengzhe Ruan, Hanxu Hou, Zhongxiang Dai, Shuang Qiu, Linqi Song

发表机构 * City University of Hong Kong(香港城市大学) Tsinghua University(清华大学) Shenzhen University of Advanced Technology(深圳理工大学) Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出DemoPSD框架,通过选择性采纳教师指导,平衡学生从教师学习和保持自身推理能力,解决特权信息泄露和探索抑制问题,在科学推理任务上优于现有方法。

详情
AI中文摘要

同策略自蒸馏(OPSD)已成为训练大型语言模型(LLMs)进行推理的实用方法,其中单个模型同时扮演教师和学生角色,但具有不同的信息访问级别。然而,最近的研究发现,教师基于特权信息的密集令牌级监督可能导致对领域内模式的过拟合、抑制探索并损害跨领域泛化,同时还引入了一个更根本的问题:*特权信息泄露*,即学生编码了在测试时不可用的依赖于答案的捷径。我们引入了**DemoPSD**,一种通过*选择性采纳教师指导*思想解决这些问题的新框架。DemoPSD不拟合完整的教师分布,而是将学生引导向一个*反向KL重心目标*,即教师和学生分布的加权几何组合,自然地在向教师学习和保持学生自身推理能力之间取得平衡。我们测量它们分布之间的差异,并利用这种差异自适应地控制每个令牌位置的混合。我们可证明地表明DemoPSD实现了**(1)** *泄露衰减*,即有效缓解特权信息泄露;以及**(2)** *探索保持*,即在密集令牌级蒸馏下保持探索能力。在四个科学领域的SciKnowEval上的大量实验表明,DemoPSD在保持更高训练熵并稳健泛化到分布外GPQA基准的同时,优于GRPO和SDPO。

英文摘要

On-policy self-distillation (OPSD) has emerged as a practical method for training large language models (LLMs) to reason, where a single model acts as both the teacher and the student with different levels of information access. However, recent studies have found that the teacher's dense token-level supervision, conditioned on privileged information, can lead to overfitting to in-domain patterns, suppress exploration, and hurt cross-domain generalization, while also introducing a more fundamental issue: *privileged information leakage*, where the student encodes answer-dependent shortcuts that are unavailable at test time. We introduce **DemoPSD**, a novel framework that resolves such problems through the idea of *selective adoption of teacher guidance*. Instead of fitting the full teacher distribution, DemoPSD steers the student toward a *reverse-KL barycenter target*, a weighted geometric combination of the teacher and student distributions, that naturally balances learning from the teacher with preserving the student's own reasoning capacity. We measure the difference between their distributions and use such a discrepancy to adaptively control the blending at each token position. We provably show that DemoPSD achieves **(1)** *leakage attenuation*, i.e., effective mitigation of privileged information leakage; and **(2)** *exploration preservation*, i.e., preservation of exploration capacity under dense token-level distillation. Extensive experiments on SciKnowEval across four scientific fields show that DemoPSD outperforms both GRPO and SDPO while maintaining higher training entropy and robustly generalizing to out-of-distribution GPQA benchmarks.

URL PDF HTML
2607.02501 2026-07-03 cs.RO cs.CV cs.OS 新提交

Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots

Embodied.cpp:面向异构机器人的具身AI模型可移植推理运行时

Ling Xu, Chuyu Han, Borui Li, Hao Wu, Shiqi Jiang, Ting Cao, Chuanyou Li, Sheng Zhong, Shuai Wang

发表机构 * Southeast University(东南大学) Nanjing University(南京大学) Microsoft Research(微软研究院) Institute for AI Industry Research (AIR), Tsinghua University(清华大学人工智能产业研究院(AIR))

AI总结 提出Embodied.cpp,一个基于C++的可移植推理运行时,通过五层架构支持多速率执行、延迟优先融合推理和可扩展接口,在异构机器人上高效部署VLA和世界动作模型。

Comments 12 pages, 2 figures, Project website: https://github.com/SEU-PAISys/Embodied.cpp

详情
AI中文摘要

具身AI模型现在涵盖视觉-语言-动作(VLA)模型和世界动作模型(WAM),但实际部署仍然分散在特定于模型的Python堆栈、后端假设和机器人端胶水代码中,尤其是在异构边缘设备上。现有的推理运行时主要设计用于请求-响应服务,因此不满足具身部署的运行时契约:闭环控制内的多速率执行、异构硬件上的延迟优先batch-1推理,以及超越固定令牌I/O的可扩展具身接口。我们提出Embodied.cpp,一个用于具身模型的可移植C++推理运行时。基于代表性VLA模型和WAM的架构分析,Embodied.cpp捕获共享执行路径并将其组织为五个层:输入适配器、序列构建器、骨干执行、头部插件和部署适配器。该运行时提供模块化多速率执行、延迟优先融合推理以及可扩展的运算符和I/O支持,通过一个后端抽象实现跨异构设备、机器人和模拟器的部署。我们在两个VLA模型(HY-VLA和pi0.5)以及使用LingBot-VA Transformer块的初步WAM基准上评估Embodied.cpp。VLA部署分别实现了100.0%和91.0%的任务成功率的成功闭环执行。WAM基准将块内存从312.2 MiB减少到88.1 MiB。这些结果表明,Embodied.cpp在保持跨不同具身模型架构的高精度的同时,提高了部署效率。

英文摘要

Embodied AI models now span vision-language-action (VLA) models and world-action models (WAMs), but practical deployment remains fragmented across model-specific Python stacks, backend assumptions, and robot-side glue code, especially on heterogeneous edge devices. Existing inference runtimes are designed mainly for request-response serving and therefore do not satisfy the runtime contract of embodied deployment: multi-rate execution inside closed-loop control, latency-first batch-1 inference on heterogeneous hardware, and extensible embodied interfaces beyond fixed token I/O. We present Embodied.cpp, a portable C++ inference runtime for embodied models. Based on an architectural analysis of representative VLA models and WAMs, Embodied.cpp captures a shared execution path and organizes it into five layers: input adapters, sequence builders, backbone execution, head plugins, and deployment adapters. The runtime provides modular multi-rate execution, latency-first fused inference, and extensible operator and I/O support, enabling deployment across heterogeneous devices, robots, and simulators through one backend abstraction. We evaluate Embodied.cpp on two VLA models, HY-VLA and pi0.5, and on a preliminary WAM benchmark using a LingBot-VA Transformer block. The VLA deployments achieve successful closed-loop execution with 100.0% and 91.0% task success rates, respectively. The WAM benchmark reduces block memory from 312.2 MiB to 88.1 MiB. These results show that Embodied.cpp improves deployment efficiency while preserving high accuracy across diverse embodied model architectures.

URL PDF HTML
2607.02497 2026-07-03 cs.CV 新提交

Seek to Segment: Active Perception for Panoramic Referring Segmentation

寻求分割:全景指代分割的主动感知

Song Tang, Shuming Hu, Xincheng Shuai, Henghui Ding, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学)

AI总结 提出主动全景指代分割任务,通过记忆增强智能体PanoSeeker结合视觉语言模型与空间记忆,实现高效搜索与分割。

Comments ECCV 2026, Project Page: https://henghuiding.com/APRS/

详情
AI中文摘要

现有的指代分割模型被动处理固定视角的静态图像,限制了它们在具身AI中的应用,其中智能体必须在连续的360$^\circ$环境中进行主动感知。为弥补这一差距,我们引入了一项新任务:主动全景指代分割(APRS)。在此设置中,智能体需要调整其视角($\Delta\theta, \Delta\phi$)以探索360$^\circ$环境,寻找用户指令指定的对象进行分割。为应对这一挑战,我们提出了PanoSeeker,一种用于高效APRS的记忆增强智能体。PanoSeeker不依赖启发式扫描,而是将视觉语言模型(VLM)与EgoSphere(一种显式空间视觉记忆)相结合。通过将顺序局部观测逐步整合为统一的360$^\circ$表示,EgoSphere使智能体能够规划高效且无冗余的搜索轨迹。一旦找到目标,智能体执行主动视角对齐并输出分割掩码。此外,我们整理了一个带有记忆时间线的专家标注搜索轨迹数据集用于监督微调,随后进行强化学习后训练以显式优化PanoSeeker的探索效率。在我们新建立的APRS基准上的大量实验表明,PanoSeeker实现了卓越的搜索效率和分割精度,显著优于改编的最先进基线。

英文摘要

Existing referring segmentation models passively process static images captured from fixed perspectives, limiting their applicability in Embodied AI, where agents must perform active perception in the continuous 360$^\circ$ environments. To bridge this gap, we introduce a novel task: Active Panoramic Referring Segmentation (APRS). In this setting, an agent is required to adjust its viewing direction ($Δθ, Δϕ$) to explore the 360$^\circ$ environment, seeking the object specified by a user instruction for segmentation. To tackle this challenging task, we propose PanoSeeker, a memory-augmented agent for efficient APRS. Rather than relying on heuristic scanning, PanoSeeker integrates a Vision-Language Model (VLM) with EgoSphere, an explicit spatial visual memory. By progressively integrating sequential local observations into a unified 360$^\circ$ representation, EgoSphere enables the agent to plan efficient and non-redundant search trajectories. Once the target is found, the agent performs active viewpoint alignment and outputs the segmentation mask. Furthermore, we curate an expert-annotated search trajectory dataset with memory timelines for Supervised Fine-Tuning, followed by Reinforcement Learning post-training to explicitly optimize PanoSeeker's exploration efficiency. Extensive experiments on our newly established APRS benchmark demonstrate that PanoSeeker achieves superior search efficiency and segmentation accuracy, significantly outperforming adapted state-of-the-art baselines.

URL PDF HTML
2607.02496 2026-07-03 cs.RO cs.LG 新提交

Controllable Sim Agents with Behavior Latents

具有行为潜变量的可控模拟智能体

Juanwu Lu, Junyu Zhu, Ziran Wang

发表机构 * Purdue University(普渡大学) University of Tokyo(东京大学)

AI总结 提出可控神经变分智能体(CNeVA),通过闭环共轭变分更新推断行为潜变量,结合修正流轨迹生成器,实现可解释轴上的可控模拟,在Waymo数据集上达到竞争性真实感并提供通道级可控性。

Comments 23 pages, 5 tables, 8 figures

详情
AI中文摘要

真实的交通模拟需要能够模仿记录行为并沿可解释轴进行引导的智能体。这种可控性使工程师能够隔离变量、重现特定边缘情况,并在无真实世界风险的情况下测试自主系统。我们引入了可控神经变分智能体(CNeVA),这是一种可控的模拟智能体框架,通过闭环共轭变分更新从每通道折扣回报中学习推断每个智能体的高斯行为潜变量,并调节在混合通道掩码课程上训练的修正流轨迹生成器,以实现无分类器引导。为了解决奖励信号稀疏的问题,我们提出了软资格门,用平滑指数衰减替代硬二进制阈值,保留了接近阈值智能体的梯度信号。在Waymo开放运动数据集上,CNeVA在基准测试中达到了竞争性的真实感,同时暴露了排名更高的模仿模型所缺乏的每通道可控性。基于速度和加速度的引导产生单调响应,而不会出现停滞引起的奖励黑客行为。引入软资格后,安全可控性具有单调性和实质性。我们成功地在上下文残差回报度量下实现了可引导的地图合规性。此外,我们的实验表明,引导指标必须与物理合理性护栏一起解读,以避免奖励黑客混淆。

英文摘要

Realistic traffic simulation requires agents that imitate logged behavior and can also be steered along interpretable axes. Such controllability enables engineers to isolate variables, reproduce specific edge cases, and test autonomous systems without real-world risk. We introduce Controllable Neural Variational Agents (CNeVA), a controllable simulated-agent framework that learns to infer a per-agent Gaussian behavior latent from per-channel discounted returns via a closed-form conjugate variational update, conditioning a rectified-flow trajectory generator trained on a mixed channel-mask curriculum for classifier-free guidance. To tackle scarcity in reward signals, we propose soft eligibility gates that replace hard binary thresholds with smooth exponential decay, preserving the gradient signal for near-threshold agents. On the Waymo Open Motion Dataset, CNeVA attains competitive realism on the benchmark while exposing per-channel controllability that the higher-ranked imitation models lack. Speed- and acceleration-based steering produces monotone responses without stall-induced reward hacking. Safety controllability is monotone and substantial with the introduction of soft eligibility. We manage to achieve steerable map compliance under a context-residual return measure. Furthermore, our experiment demonstrates that steering metrics must be read alongside physical-plausibility guardrails to avoid reward-hacking confounds.

URL PDF HTML
2607.02494 2026-07-03 cs.CV cs.CL 新提交

Towards Robustness against Typographic Attack with Training-free Concept Localization

基于训练无关的概念定位实现对抗印刷体攻击的鲁棒性

Bohan Liu, Wenqian Ye, Guangzhi Xiong, Zhenghao He, Sanchit Sinha, Aidong Zhang

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 提出一种无需训练的可解释性方法,通过分析注意力头对词汇和语义的编码差异,定位并干预ViT中的词汇偏置电路,从而在不额外训练的情况下显著提升对印刷体攻击的鲁棒性。

Comments 15 pages main text, provisionally accepted to ECCV 2026

详情
AI中文摘要

通过对比语言-图像预训练(CLIP)训练的模型作为大多数现代大型视觉语言模型(LVLM)的基础视觉编码器。尽管被广泛采用,CLIP模型表现出一个关键但未被充分探索的失败模式:图像中出现的无关文本会混淆视觉表示,使其偏向词汇意义而非真实视觉语义。这种鲁棒性问题通常被称为印刷体攻击(TA),暴露了一个对安全关键应用(如自动驾驶)构成重大风险的漏洞。为了实现可解释且有效的抗TA鲁棒性,我们提出了一种新颖的、无需训练的可解释性方法。该方法提供基于采样的隐藏状态表示解释,并定量地将语义与词汇关注度归因于各个注意力头。通过概率分析和电路挖掘,我们隔离了不成比例地编码词汇信息的特定Vision Transformer(ViT)组件,从而确定了TA的机制来源。我们进一步表明,直接应用于识别出的电路的简单干预(无需任何额外训练)可以显著提高目标分类中对印刷体攻击的鲁棒性。这些干预措施,例如选择性调整注意力权重,也优于有监督和无需训练的防御方法。我们的实验表明,将所提出的干预应用于几个最先进的LVLM的视觉编码器,在RIO-Bench上受到印刷体攻击干扰时,视觉问答准确性有显著提升。这些结果证实了我们机制方法的有效性和泛化性。代码发布在此https URL。

英文摘要

Models trained via Contrastive Language-Image Pretraining (CLIP) serve as the foundational vision encoders for most modern Large Vision Language Models (LVLMs). Despite their widespread adoption, CLIP models exhibit a critical yet underexplored failure mode: irrelevant text appearing within images confounds visual representations, biasing them toward lexical meaning rather than true visual semantics. This robustness issue, commonly described as a Typographic Attack (TA), exposes a vulnerability that poses a significant risk to safety-critical applications such as autonomous driving. To achieve interpretable and effective robustness against TA, we propose a novel, training-free mechanistic interpretability method. Our method provides sampling-based interpretations of hidden state representations and quantitatively attributes semantic versus lexical focus to individual attention heads. Through probabilistic analysis and circuit mining, we isolate specific Vision Transformer (ViT) components that disproportionately encode lexical information, thereby identifying the mechanistic source of TA. We further show that simple interventions applied directly to the identified circuits, without any additional training, can substantially improve robustness against Typographic Attacks in object classification. These interventions, such as selective adjustment of attention weights, also outperform both supervised and training-free defense methods. Our experiments demonstrate that applying the proposed intervention to the vision encoders of several state-of-the-art LVLMs yields substantial gains in Visual Question Answering accuracy under Typographic Attack interference on RIO-Bench. These results confirm both the efficacy and the generalizability of our mechanistic approach. Code is released at https://github.com/Liu-524/SamplingTAR.

URL PDF HTML
2607.02490 2026-07-03 cs.CL cs.CV 新提交

Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

基于强化学习的视觉语言模型视觉接地自反思

Liyan Tang, Fangcong Yin, Greg Durrett

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) New York University(纽约大学)

AI总结 提出VRRL框架,通过随机掩码轨迹前缀和缓冲回滚机制,增强视觉语言模型在分布外场景下的视觉接地自反思能力,显著提升准确率。

详情
AI中文摘要

大型视觉语言模型可以通过生成文本思维链(CoT)对多模态输入进行推理。CoT推理中一个关键能力是自反思:重新审视早期决策并纠正先前错误。然而,现有的LVLM在反思过程中往往未能正确关注视觉输入,限制了其将反馈转化为接地修正的能力,尤其是在分布外图像上。为解决这一问题,我们提出了一种新颖的强化学习训练框架VRRL,其中包含两个专门设计用于引发视觉接地自反思的组件。首先,我们在训练期间随机掩码轨迹前缀,以强调从错误中间预测中恢复,而非避免早期错误。其次,我们从经验回放缓冲区引入缓冲回滚,使模型暴露于必须学会纠正的多样失败状态。我们在涉及表格和图表的视觉接地任务以及空间导航基准上评估了我们的方法。虽然现成的和常规微调的模型在分布偏移下性能大幅下降,但我们的方法通过有效利用自反思,在标准RL和面向反思的微调基线上显著提高了平均分布外准确率。

英文摘要

Large vision-language models can reason over multimodal inputs by generating textual chains of thought (CoT). A key capability exhibited in CoT reasoning is self-reflection: revisiting earlier decisions and correcting previous errors. However, existing LVLMs often fail to properly attend to visual inputs during reflection, limiting their ability to translate feedback into grounded corrections, especially for out-of-distribution images. To address this issue, we propose a novel reinforcement learning training framework VRRL, with two components explicitly designed to elicit visually grounded self-reflection. First, we randomly mask trajectory prefixes during training to emphasize recovery from incorrect intermediate predictions rather than making early mistakes. Second, we introduce buffered roll-ins from an experience replay buffer to expose the model to diverse failure states that it must learn to correct. We evaluate our approach on visual grounding tasks involving tables and charts, as well as spatial navigation benchmarks. While off-the-shelf and conventionally fine-tuned models degrade substantially under distribution shift, our method substantially improves average out-of-distribution accuracy over standard RL and reflection-oriented fine-tuning baselines by using self-reflection effectively.

URL PDF HTML
2607.02486 2026-07-03 cs.CV 新提交

GeoMix: Descriptor-Free Visual Localization via Global Context and Multi-Detector Training

GeoMix: 通过全局上下文和多检测器训练实现无描述符视觉定位

Yejun Zhang, Xinjue Wang, Zihan Wang, Esa Rahtu, Juho Kannala

发表机构 * Aalto University(阿尔托大学) Tampere University(坦佩雷大学) University of Oulu(奥卢大学)

AI总结 提出GeoMix框架,通过局部方向与距离嵌入、全局上下文节点和多检测器混合训练,增强几何判别性,显著提升无描述符视觉定位精度。

Comments ECCV 2026

详情
AI中文摘要

无描述符视觉定位消除了高维描述符存储,保护了场景隐私,并简化了地图维护,但其精度仍远落后于基于描述符的流程。我们发现这一差距源于纯几何匹配中几何判别性不足。没有视觉外观,当前方法未能充分利用局部几何线索,缺乏关键点间的全局上下文,并且过拟合于单一关键点检测器。我们进一步观察到,无描述符匹配自然支持多检测器训练,因为异构关键点可以在共享的纯几何空间中进行优化,而无需对齐描述符空间。基于这些见解,我们提出了GeoMix,一个无描述符的2D-3D匹配框架,在三个层面增强几何判别性。在局部层面,方向和距离感知嵌入通过细粒度空间结构丰富了邻域聚合。在全局层面,可学习的上下文节点通过交叉注意力聚合和重新分配场景范围的信息,以解决局部感受野之外的歧义。在训练层面,混合训练利用这种与检测器无关的几何空间,学习跨多个关键点检测器的表示。在MegaDepth、Cambridge Landmarks、7Scenes和Aachen Day-Night上的大量实验表明,GeoMix在无描述符方法中达到了新的最佳水平,将第75百分位的旋转误差降低了89%,平移误差降低了高达90%,同时零样本泛化到未见过的检测器,并缩小了与基于描述符流程的差距。代码可在$\href{this https URL}{\text{this links}}$获取。

英文摘要

Descriptor-free visual localization eliminates high-dimensional descriptor storage, preserves scene privacy, and simplifies map maintenance, yet its accuracy still lags far behind descriptor-based pipelines. We identify this gap to insufficient geometric discriminability in geometry-only matching. Without visual appearance, current methods underutilize local geometry cues, lack the global context among keypoints, and overfit to a single keypoint detector. We further observe that descriptor-free matching naturally enables multi-detector training, as heterogeneous keypoints can be optimized in a shared geometry-only space without aligning descriptor spaces. Building on these insights, we propose GeoMix, a descriptor-free 2D-3D matching framework that strengthens geometric discriminability at three levels. Locally, directional and distance-aware embeddings enrich neighborhood aggregation with fine-grained spatial structure. Globally, learnable context nodes aggregate and redistribute scene-wide information via cross-attention to resolve ambiguities beyond local receptive fields. At the training level, Mix-Training exploits this detector-agnostic geometry space to learn representations across multiple keypoint detectors. Extensive experiments on MegaDepth, Cambridge Landmarks, 7Scenes, and Aachen Day-Night show that GeoMix sets a new state of the art among descriptor-free methods, reducing 75th-percentile rotation error by 89\% and translation error by up to 90\% over the previous best, while generalizing zero-shot to unseen detectors and narrowing the gap to descriptor-based pipelines. Code is available at $\href{https://github.com/YejunZhang/Geomix}{\text{this links}}$.

URL PDF HTML
2607.02484 2026-07-03 cs.CV cs.AI 新提交

Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning

对抗文本噪声与冗余:熵感知的密集视觉令牌剪枝

Xuehui Wang, Xuankun Yang, Wei Shen

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出熵感知密集剪枝(EADP)框架,通过熵过滤文本噪声并利用子模最大化选择令牌,在严格预算下保留细粒度视觉线索,提升VLM精度-效率权衡。

Comments Accepted to ECCV 2026

详情
AI中文摘要

视觉令牌剪枝是通过压缩冗余图像块来加速VLM的关键策略,但现有方法在密集指令和细粒度查询下常无法保留关键线索。本文研究了这一失败,并识别出两个潜在瓶颈:污染密集跨模态评分的文本噪声的广泛分散,以及标准令牌选择固有的特征碎片化。为解决这些问题,我们提出了熵感知密集剪枝(EADP),一个将剪枝重新表述为结构化压缩问题的框架。EADP首先利用统计熵量化并过滤文本噪声,得到鲁棒的细粒度指令相关性分数。随后,EADP将令牌选择视为带有空间先验的子模最大化问题,而非简单的Top-K选择,明确确保整体且非冗余的视觉表示。大量实验表明,EADP改善了VLM的精度-效率权衡,在严格令牌预算下鲁棒地保留细粒度视觉线索,并在具有挑战性的多模态基准上实现了最先进的性能。

英文摘要

Visual token pruning is a crucial strategy for accelerating VLMs by compressing redundant image patches, yet existing methods often fail to preserve critical cues under dense instructions and fine-grained queries. In this paper, we investigate this failure and identify two underlying bottlenecks: the widespread dispersion of textual noise that corrupts dense cross-modal scoring, and the feature fragmentation inherent to standard token selection. To address these issues, we propose Entropy-Aware Dense Pruning (EADP), a framework that reformulates pruning as a structured compression problem. EADP first leverages statistical entropy to quantify and filter out textual noise, yielding a robust, fine-grained instruction relevance score. Subsequently, instead of naive Top-K selection, EADP casts token selection as a submodular maximization problem with a spatial prior, explicitly ensuring a holistic and non-redundant visual representation. Extensive experiments demonstrate that EADP improves the accuracy-efficiency trade-off of VLMs, robustly preserving fine-grained visual cues under strict token budgets while achieving SoTA performance on challenging multimodal benchmarks.

URL PDF HTML
2607.02479 2026-07-03 cs.CV 新提交

EAGLE-360: Embodied Active Global-to-Local Exploration in 360$^\circ$

EAGLE-360: 360°环境中的具身主动全局到局部探索

Jingtao Xu, Zizhuo Lin, Jianwen Sun, Yi Yang, Yawei Luo

发表机构 * Zhejiang University(浙江大学) Central China Normal University(华中师范大学)

AI总结 针对多模态大模型在360°全景主动视觉搜索中因极地畸变和连续拓扑建模不足导致的效率低下问题,提出EAGLE-360框架,通过全局先验引导局部探索,结合RoPE Rolling位置编码和GRPO训练,实现近8倍精度提升。

Comments Preprint

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)在标准视觉理解中展现出卓越能力,但将其应用于360°全景环境中的主动视觉搜索时暴露了根本性局限。具体而言,标准MLLMs难以有效建模全景的固有属性,如严重的极地畸变和连续圆柱拓扑,这显著降低了目标检测精度。因此,现有全景搜索方法试图通过严重依赖碎片化的局部视角来补偿。受限于僵化的初始化和缺乏全局全景先验,这些方法存在短视、低效的探索问题,并在目标移出视野时难以进行鲁棒的错误恢复。为克服这些挑战,我们提出EAGLE-360,一种新颖的具身主动全局到局部探索框架。EAGLE-360并非执行穷举局部搜索,而是利用全局先验建立初始整体视角,迭代推理并逐步缩小搜索空间。在架构上,我们采用RoPE Rolling(一种坐标移位位置编码机制)来无缝建模全景的连续拓扑。为促进这一范式,我们构建了大规模EAGLE-360数据集,包含14000+张4K全景图和70000+轮高质量VQA对话。通过采用集成监督微调(SFT)与组相对策略优化(GRPO)的训练流程,我们有效激发了复杂的空间推理和工具调用能力。大量实验表明,EAGLE-360在360°视觉搜索中建立了新的最先进水平,相比基础模型实现了近8倍的精度提升,同时显著增强了探索效率。

英文摘要

While Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in standard visual understanding, adapting them for active visual search in 360$^\circ$ panoramic environments exposes fundamental limitations. Specifically, standard MLLMs struggle to effectively model inherent panoramic properties, such as severe polar distortion and continuous cylindrical topologies, which significantly degrades target detection accuracy. Consequently, existing panoramic search methods attempt to compensate by relying heavily on fragmented local viewpoints. Burdened by rigid initialization and a lack of global panoramic priors, these approaches suffer from myopic, inefficient exploration and struggle with robust error recovery when targets are out of view. To overcome these challenges, we propose EAGLE-360, a novel Embodied Active Global-to-Local Exploration framework. Rather than performing exhaustive local searches, EAGLE-360 leverages global priors to establish an initial holistic perspective, iteratively reasoning and progressively narrowing the search space. Architecturally, we adapt RoPE Rolling, a coordinate-shifting positional encoding mechanism, to seamlessly model the continuous topologies of panoramas. To facilitate this paradigm, we construct the large-scale EAGLE-360 dataset, comprising 14,000+ 4K panoramas and 70,000+ rounds of high-quality VQA dialogues. By employing a training pipeline that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), we effectively elicit complex spatial reasoning and tool-calling capabilities. Extensive experiments demonstrate that EAGLE-360 establishes a new state-of-the-art for 360$^\circ$ visual search, achieving nearly an 8-fold increase in accuracy over the base model while significantly enhancing exploration efficiency.

URL PDF HTML
2607.02474 2026-07-03 cs.RO cs.SY eess.SY 新提交

QuadRocket: An Aerial Robotic Testbed for Adaptive Thrust-Vector Control of Rocket-Like Vehicles

QuadRocket:用于火箭类飞行器自适应推力矢量控制的空中机器人试验台

Pedro Santos, Joel Reis, Paulo Oliveira, Carlos Silvestre

发表机构 * ISR, Instituto Superior Técnico, Universidade de Lisboa(里斯本大学高等技术学院ISR) LAETA LARSyS

AI总结 提出QuadRocket四旋翼火箭原型,通过自适应反步控制器实现未知扰动下的轨迹跟踪,并验证其作为推力矢量控制试验台的适用性。

Comments Paper accepted for publication in IEEE Transactions on Aerospace and Electronic Systems

详情
AI中文摘要

本文介绍了QuadRocket,一种基于四旋翼的火箭原型,为验证运载火箭类系统的先进推力矢量控制策略提供了低成本、低风险平台。该原型由一个圆柱形主体通过万向节安装在四旋翼顶部组成,形成一个具有不可忽略惯性的飞行倒立摆。在控制设计方面,将耦合系统建模为一个由沿其纵轴施加的矢量力驱动的轴对称刚体。采用二球面上的简化姿态表示,以显式利用飞行器的轴对称性,并将偏航与推力矢量方向解耦。在此模型上,我们推导了一种自适应反步控制器,在存在未知常值扰动的情况下实现几乎全局轨迹跟踪,同时通过控制点变换减轻非最小相位行为。然后,将四旋翼视为推力矢量执行器,设计了一种基于动态曲面的姿态控制器来跟踪期望的推力矢量,考虑了执行器动力学并避免了对虚拟控制信号的显式微分。完整架构在仿真中进行了评估,并在室内运动捕捉场中进行了实验验证。结果展示了精确的轨迹跟踪、有效的扰动补偿,并确认了QuadRocket作为推力矢量控制机器人飞行器多功能试验台的适用性。

英文摘要

This paper presents QuadRocket, a quadrotor-based rocket prototype that provides a low-cost, low-risk platform for validating advanced thrust-vector control strategies for launch vehicle-type systems. The prototype consists of a cylindrical main body mounted on top of a quadrotor through a universal joint, forming a flying inverted pendulum with non-negligible inertia. For control design, the coupled system is modeled as a single axisymmetric rigid body actuated by a vectored force applied along its longitudinal axis. A reduced-attitude representation on the two sphere is adopted to explicitly exploit the vehicle's axial symmetry and to decouple yaw from the thrust-vector direction. On this model, we derive an adaptive backstepping controller that achieves almost global trajectory tracking in the presence of unknown constant disturbances, while a control-point transformation mitigates non minimum-phase behavior. The quadrotor is then treated as a thrust vector actuator, and a dynamic-surface-based attitude controller is designed to track the desired thrust-vector, accounting for actuation dynamics and avoiding explicit differentiation of virtual control signals. The complete architecture is evaluated in simulation and validated experimentally in an indoor motion-capture arena. Results demonstrate accurate trajectory tracking, effective disturbance compensation, and confirm the suitability of the QuadRocket as a versatile testbed for thrust-vector-controlled robotic vehicles.

URL PDF HTML
2607.02473 2026-07-03 cs.CL cs.SD eess.AS 新提交

Audio-Based Understanding of Audiobook Narration Appeal

基于音频的有声书旁白吸引力理解

Shahar Elisha, Mariano Beguerisse-Díaz, Emmanouil Benetos

发表机构 * Spotify Queen Mary University of London(伦敦玛丽女王大学)

AI总结 通过提取有声书旁白的声学特征(如语调、语速、响度),分析其与消费数据(如观看率)的关系,发现声学信息与吸引力显著相关,且受体裁和书名影响,首次系统性地计算研究了旁白质量与有声书消费的关联。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

旁白是有声书聆听体验的核心,它塑造了听众与内容的互动和理解方式。本研究探讨旁白质量如何影响有声书的吸引力,并注意到其效果可能因体裁、书名和受众而异。我们使用预训练音频模型从LibriVox中提取声音和声学特征(如语调、语速、响度),并分析它们与消费数据(具体为观看率)的关系以及它们与体裁和书名的相互作用。尽管消费数据有限,但我们发现仅凭声学信息就与吸引力有稳健的关联,即使在考虑书名效应后也是如此。我们进一步使用更细微的专有参与度指标验证了这些发现。据我们所知,这是首个系统性的计算研究,将旁白质量、体裁、书名和有声书消费联系起来,突显了数据驱动见解在改进有声书个性化和叙述者选角方面的潜力。

英文摘要

Narration is central to the audiobook listening experience, shaping how listeners engage with and understand the content. This work explores how narration qualities shape an audiobook's appeal, noting that their effects can vary by genre, title, and audience. We extract vocal and acoustic features (e.g., tone, pace, loudness) from LibriVox using pre-trained audio models and analyse their relationship with consumption data (specifically, view-rate) and their interplay with genre and title. Despite limited consumption data, we find that acoustic information alone has a robust association with appeal, even after accounting for title effects. We further validate these findings using more nuanced proprietary engagement metrics. To our knowledge, this is the first systematic computational study linking narration qualities, genre, title, and audiobook consumption, highlighting the potential of data-driven insights to improve audiobook personalisation and narrator casting.

URL PDF HTML
2607.02472 2026-07-03 cs.RO 新提交

Learning Agile Intruder Interception using Differentiable Quadrotor Dynamics

利用可微四旋翼动力学学习敏捷入侵者拦截

Michael Anoruo, Xiaoyu Tian, Abhishek Rathod, Timothy Naudet, Thomas Canchola, Eric Sturzinger, Kshitij Goel, Wennie Tabib

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Artificial Intelligence Integration Center(人工智能集成中心)

AI总结 提出一种基于可微四旋翼动力学的解析策略梯度方法,仅利用入侵者方向向量学习高速(10 m/s)敏捷拦截策略,相比简化质点动力学方法平均提升30%。

Comments 17 pages, 10 figures, 6 tables

详情
AI中文摘要

本文提出了一种学习控制策略的方法,该方法利用指向入侵者的3D方向单位向量和拦截器状态来拦截入侵者。先前的深度强化学习方法假设相对位置或到入侵者的距离可用,但在使用被动单目相机传感器的实际应用中,这些信息不易获得。相反,我们提出了一种解决方案,利用可微四旋翼动力学的解析策略梯度方法,学习速度高达10 m/s的敏捷拦截。所提出的方法比利用简化质点动力学的基线方法平均提升30%。

英文摘要

This paper presents a methodology for learning a control policy to intercept an intruder using the 3D direction unit vector to the intruder and the interceptor state. Prior deep reinforcement learning approaches assume either relative position or distance to the intruder is available, but this information is not readily accessible in real-world applications that employ passive, monocular camera sensors. Instead, we propose a solution that leverages an analytical policy gradient method using differentiable quadrotor dynamics to learn agile interception at speeds up to 10 m/s. The proposed approach outperforms baseline methods that utilize simplified point mass dynamics by an average of 30%.

URL PDF HTML
2607.02471 2026-07-03 cs.CV 新提交

Interpretation-Oriented Cloud Removal via Observation-Anchored Residual Flow with Geo-Contextual Alignment

面向解译的云去除:基于观测锚定残差流与地理上下文对齐

Ziyao Wang, Maonan Wang, Yucheng He, Xianping Ma, Ziyi Wang, Hongyang Zhang, Yirong Cheng, Man-on Pun

发表机构 * School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)理工学院) Shanghai Ai Lab(上海人工智能实验室) Faculty of Geosciences and Engineering, Southwest Jiaotong University(西南交通大学地球科学与工程学院)

AI总结 提出GACR框架,通过观测锚定残差流(OAR-Flow)将云去除重构为物理残差反演,结合地理上下文先验对齐(GCPA)保持语义结构,在六个数据集和十二个下游任务中验证了重建质量和下游精度提升。

Comments accepted by ECCV 2026

详情
AI中文摘要

云去除(CR)对于光学遥感至关重要,是可靠下游解译(如语义分割和变化检测)的前提。然而,现有CR方法通常优先考虑视觉真实感,而忽视其对后续分析任务的影响,导致语义漂移和下游性能下降。为解决此问题,我们提出地理锚定云去除(GACR),一个统一框架,同时确保忠实重建和鲁棒可解译性。其核心是观测锚定残差流(OAR-Flow),将CR重新表述为物理基础的残差反演过程。通过将生成轨迹锚定到含云观测而非纯噪声,OAR-Flow实现了快速、稳定且忠实的重建。为进一步保留对下游解译至关重要的语义结构,GACR集成了地理上下文先验对齐(GCPA),将重建约束在由视觉基础模型(VFM)诱导的语义流形内。因此,GACR严格保持了复杂景观的空间-语义完整性。在六个CR数据集和十二个下游任务上的大量实验表明,GACR在产生优越重建质量的同时,持续提升下游任务精度。代码见该 https URL。

英文摘要

Cloud removal (CR) is essential for optical remote sensing, serving as a prerequisite for reliable downstream interpretation, such as semantic segmentation and change detection. However, existing CR approaches often prioritize visual realism while overlooking their impact on subsequent analytical tasks, leading to semantic drift and degraded downstream performance. To address this issue, we propose Geo-Anchored Cloud Removal (GACR), a unified framework that jointly ensures faithful reconstruction and robust interpretability. At its core, GACR incorporates Observation-Anchored Residual Flow (OAR-Flow), which reformulates CR as a physically grounded residual inversion process. By anchoring the generative trajectory to the cloudy observation rather than pure noise, OAR-Flow enables fast, stable, and faithful reconstruction. To further preserve semantic structures critical for downstream interpretation, GACR integrates Geo-Contextual Prior Alignment (GCPA) to constrain the reconstruction within a semantic manifold induced by a Vision Foundation Model (VFM). Consequently, GACR strictly maintains the spatial-semantic integrity of complex landscapes. Extensive experiments across six CR datasets and twelve downstream tasks demonstrate that GACR produces superior reconstruction quality while consistently improving downstream task accuracy. The code is available at https://github.com/wzy6055/GACR.

URL PDF HTML
2607.02466 2026-07-03 cs.RO cs.AI 新提交

Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

先学移动再学做事:面向VLA的任务无关预训练

Junhao Shi, Siyin Wang, Xiaopeng Yu, Li Ji, Jingjing Gong, Xipeng Qiu

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出任务无关预训练(TAP)框架,先通过自监督逆动力学从廉价无标签交互数据学习可迁移运动先验,再用少量专家数据将先验与语言对齐,显著降低VLA模型对专家演示的依赖。

Comments Accepted to ICML 2026, 21 pages,6 figures

详情
AI中文摘要

视觉-语言-动作(VLA)模型根本上受限于专家演示的稀缺性——即观察、指令和动作的三元组,这些数据大规模收集成本高昂。我们认为,这一瓶颈源于混淆了两个不同的学习目标:获取物理能力(如何移动)和获取语义对齐(做什么)。关键在于,只有后者需要语言监督。基于这一分解假设,我们提出了任务无关预训练(TAP),这是一个两阶段框架:首先通过自监督逆动力学目标,从廉价的无标签交互数据(包括丢弃的离任务轨迹和自主机器人玩耍)中学习可迁移的运动先验;然后,第二阶段利用最少的专家数据将这些先验与语言对齐。在SIMPLER基准上,TAP匹配了使用超过100万条专家轨迹训练的模型,同时使用的标注数据量少几个数量级,相比标准行为克隆取得了10%的绝对提升。在真实世界的WidowX平台上,当相机发生扰动时,TAP保持了25%的成功率,而基于互联网规模的基线模型则降至0%,这表明任务无关预训练能够产生鲁棒、可迁移的物理表示,并为具身AI提供了一条可扩展的前进道路。

英文摘要

Vision-Language-Action (VLA) models are fundamentally bottlenecked by the scarcity of expert demonstrations -- triplets of observations, instructions, and actions that are costly to collect at scale. We argue that this bottleneck stems from conflating two distinct learning objectives: acquiring physical competence (how to move) and acquiring semantic alignment (what to do). Crucially, only the latter requires language supervision. Building on this Decomposition Hypothesis, we propose Task-Agnostic Pretraining (TAP), a two-stage framework that first learns transferable motor priors from cheap, unlabeled interaction data -- including discarded off-task trajectories and autonomous robot play -- via a self-supervised Inverse Dynamics objective. A lightweight second stage then grounds these priors in language using minimal expert data. On the SIMPLER benchmark, TAP matches models trained on over 1M expert trajectories while using orders of magnitude less labeled data, yielding a 10% absolute gain over standard behavior cloning. On a real-world WidowX platform, TAP retains 25% success under camera perturbations where internet-scale baselines collapse to 0%, demonstrating that task-agnostic pretraining produces robust, transferable physical representations and offers a scalable path forward for Embodied AI.

URL PDF HTML
2607.02464 2026-07-03 cs.CL 新提交

Will Scaling Improve Social Simulation with LLMs?

扩展规模会改善基于LLM的社会模拟吗?

Caleb Ziems, William Held, Su Doga Karaca, David Grusky, Tatsunori Hashimoto, Diyi Yang

发表机构 * Stanford University(斯坦福大学) Open Athena(开放雅典娜)

AI总结 研究LLM规模扩展对社会模拟保真度的影响,发现多数任务随规模提升而改善,但存在例外,如纵向预测和低资源领域改进缓慢。

详情
AI中文摘要

大语言模型(LLM)社会模拟是一种有前景的研究方法,但其保真度尚不足以被广泛采用。本文研究当前语言建模中的扩展范式是否可能弥合这些差距,或者模拟保真度是否与通用能力正交,从而值得更多研究关注。我们利用扩展定律研究LLM的计算规模、通用能力基准与三个代表性子领域(观点建模、行为模拟和纵向预测)中社会模拟保真度之间的关系。令人惊讶的是,我们使用一套85个具有Qwen3架构的Transformer LLM,在固定计算预算($10^{18}$到$10^{20}$ FLOPs)下在DCLM网络文本语料库上进行预训练,在所有三个设置中发现了强大的计算扩展性。然后,我们评估了35个更大、能力更强的开放权重模型(参数高达70B),从而能够从损失预测下游准确性。这表明,大多数行为模拟和观点模拟任务将随着规模扩大而迅速改进,特别是当涉及在英文网络语料库中代表性良好的人群时。纵向预测和代表性不足的观点扩展较慢,尤其是当它们与通用知识和推理基准(如MMLU)相关性较低时。在行为模拟中,扩展未能改善模型与人类认知偏差(如风险厌恶)以及人类启发式(如从相关任务中学习相关奖励)的校准。在这些任务上,即使是微调模型,从0.5B到8B参数也未能显著提升性能。综合来看,我们得出结论:规模将在大多数设置中改善社会模拟,但存在异常值,并且在低资源领域改进将不太可靠。

英文摘要

Large Language Model (LLM) social simulations are a promising research method, but they are not yet faithful enough to be adopted widely. In this work, we investigate whether the current scaling paradigm in language modeling is likely to close these gaps, or whether simulation fidelity is orthogonal to general capabilities and therefore deserving of more research attention. We use scaling laws to study the relationship between LLMs' compute scale, general capability benchmarks, and the fidelity of social simulation in three representative sub-domains: opinion modeling, behavioral simulation, and longitudinal forecasting. Surprisingly, we discover strong compute scaling in all three settings, using a suite of 85 transformer LLMs with the Qwen3 architecture pre-trained on the DCLM web text corpus under fixed-compute budgets from $10^{18}$ to $10^{20}$ FLOPs. Then we evaluate 35 larger and more capable open-weight models up to 70B parameters, allowing us to predict downstream accuracy from loss. This reveals that the majority of behavioral and opinion simulation tasks will rapidly improve with scale, particularly when they involve populations that are well-represented in English web corpora. Longitudinal forecasting and underrepresented opinions scale more slowly, especially when they are less correlated with general knowledge and reasoning benchmarks like MMLU. In behavior simulation, scaling fails to improve model calibration with human cognitive biases like risk aversion, as well as human heuristics like learning correlated rewards from related tasks. On these tasks, even fine-tuned models fail to noticeably scale up performance from 0.5B to 8B parameters. Taken together, we conclude that scale will improve social simulations in most settings, but outliers exist, and improvements will be less reliable in low-resource domains.

URL PDF HTML
2607.02461 2026-07-03 cs.CV cs.AI cs.LG 新提交

OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers

OrbitQuant: 图像和视频扩散Transformer的数据无关量化

Donghyun Lee, Jitesh Chavan, Duy Nguyen, Sam Huang, Liming Jiang, Priyadarshini Panda, Timo Mertens, Saurabh Shukla

发表机构 * Cantina Labs(Cantina实验室) University of Southern California(南加州大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出数据无关量化方法OrbitQuant,通过归一化旋转基和Lloyd-Max码本,无需校准数据即可对扩散Transformer的权重和激活进行低比特量化,在多个模型和模态上达到最优。

详情
AI中文摘要

扩散Transformer(DiT)在图像和视频生成中达到最先进水平,但其多步采样和不断增长的参数数量使得推理成本高昂。训练后量化(PTQ)是自然的补救措施,然而DiT的激活在不同时间步、提示和引导分支之间发生变化,迫使先前的方法为每个新的检查点或模态重新拟合校准数据。我们提出OrbitQuant,一种数据无关的权重激活量化器,通过在归一化旋转基中进行量化来绕过范围估计。在这个基中,随机排列块哈达玛(RPBH)旋转将每个坐标集中在一个固定的、已知的边缘分布周围,而与输入无关,因此单个Lloyd-Max码本即可服务于给定输入维度的所有时间步、提示和层。我们将相同的量化器离线扩展到权重行,将旋转吸收到权重中,使得在每个线性层内部旋转抵消,运行时仅保留激活上的前向旋转。同样的方法从图像迁移到视频,无需每模态调优。在FLUX.1、Z-Image-Turbo、Wan 2.1和CogVideoX上,它在多个低比特设置下达到了PTQ的最先进水平。它还将图像扩散Transformer的PTQ推进到W2A4,并具有可用的生成质量。

英文摘要

Diffusion transformers (DiTs) achieve state-of-the-art image and video generation, but their multi-step sampling and growing parameter count make inference expensive. Post-training quantization (PTQ) is the natural remedy, yet DiT activations shift across timesteps, prompts, and guidance branches, forcing prior methods to re-fit calibration data for every new checkpoint or modality. We present OrbitQuant, a data-agnostic weight-activation quantizer that bypasses range estimation by quantizing in a normalized, rotated basis. In this basis, a randomized permuted block-Hadamard (RPBH) rotation concentrates each coordinate around one fixed, known marginal regardless of the input, so a single Lloyd-Max codebook serves all timesteps, prompts, and layers of a given input dimension. We extend the same quantizer to weight rows offline, absorbing the rotation into the weights so that it cancels inside each linear layer and only a forward rotation on the activations remains at runtime. The same recipe transfers from image to video with no per-modality tuning. Across FLUX.1, Z-Image-Turbo, Wan 2.1, and CogVideoX, it sets the state of the art for PTQ at several low-bit settings. It also pushes PTQ of image diffusion transformers to W2A4 with usable generation quality.

URL PDF HTML
2607.02460 2026-07-03 cs.LG cs.AI 新提交

Neuron-Aware Data Selection for Annotation-Free LLM Self-Distillation

面向无标注LLM自蒸馏的神经元感知数据选择

Zhuowei Chen, Xiang Lorraine Li

发表机构 * University of Pittsburgh(匹兹堡大学)

AI总结 提出Neuron-OPSD框架,利用神经元激活指导训练数据选择和教师上下文构建,通过在线策略蒸馏实现无标注自蒸馏,提升领域内性能并保持跨领域泛化。

详情
AI中文摘要

在没有真实世界交互反馈或人工标注监督的情况下对大型语言模型(LLM)进行后训练仍然具有挑战性,特别是在专家标注成本高昂的专业领域。最近的无标注自进化方法通过使用模型自身的输出作为监督信号来解决这个问题,通过额外上下文构建教师模型,并通过多数投票聚合多个rollout的预测来生成伪标签。然而,这些方法并非没有缺点:基于SFT和GRPO的变体会导致域外性能下降,而基于奖励的在线策略RL会膨胀校准误差。在本文中,我们提出神经元在线策略自蒸馏(Neuron-OPSD),一个用于无标注自蒸馏的数据中心框架,利用内部神经元激活来指导训练数据选择和教师上下文构建。然后,模型通过从教师分布进行在线策略蒸馏来训练,在任何阶段都不需要真实标签。在专业领域基准测试中,Neuron-OPSD提高了领域内任务性能,同时保持了跨领域泛化,并减轻了先前无标注基线的校准崩溃。该框架特别适用于在线交互或外部监督成本高昂或不可行的场景,并且在概念上不同于依赖已记录奖励标签轨迹的离线RL方法。

英文摘要

Post-training large language models (LLMs) without real-world interaction feedback or human-labeled supervision remains challenging, particularly in specialized domains where expert annotations are costly to obtain. Recent annotation-free self-evolution methods address this by using the model's own outputs as supervision signals, constructing a teacher via additional context and aggregating predictions across multiple rollouts through majority voting to produce pseudo-labels. However, these approaches are not without drawbacks: SFT- and GRPO-based variants suffer out-of-domain performance degradation, while reward-based on-policy RL inflates calibration error. In this paper, we propose Neuron On-Policy Self-Distillation (Neuron-OPSD), a data-centric framework for annotation-free self-distillation that leverages internal neuron activations to guide both training-data selection and teacher context construction. The model is then trained via on-policy distillation from the teacher distribution, requiring no ground-truth labels at any stage. Across specialized-domain benchmarks, Neuron-OPSD improves in-domain task performance while preserving cross-domain generalization and mitigating calibration collapse over prior annotation-free baselines. This framework is particularly relevant to settings where online interaction or external supervision is costly or infeasible, and is conceptually distinct from offline RL approaches that rely on logged, reward-labeled trajectories.

URL PDF HTML
2607.02459 2026-07-03 cs.CL 新提交

Language Models as Measurement Apparatus for Culture

语言模型作为文化测量仪器

Kent K. Chang

发表机构 * School of Information University of California, Berkeley(信息学院加州大学伯克利分校)

AI总结 本文提出语言模型的文化测量是物质-话语实践,通过Karen Barad的能动切割概念,论证模型设计选择构成所测量的文化现实,并通过三个案例和三个仪器检查展示这一观点。

Comments Accepted to the Big Picture workshop co-located with ACL 2026. This version expands the camera-ready (adding Fig. 3 and section 6.3, as well as correcting minor typos) in Proceedings of The Big Picture v2: Crafting a Research Narrative, pp. 131--143, San Diego, CA, USA. Association for Computational Linguistics

详情
AI中文摘要

语言模型越来越多地被用于量化文化现象,但什么使得这种测量具有独特的文化性?本文认为,NLP中的文化工作是一种物质-话语实践:仪器——模型、数据、标注、评估——参与构成它所测量的文化现实,而不是被动地记录它。借鉴Karen Barad的能动切割概念——现象与仪器之间的偶然边界——我表明,仪器的实质性设计选择划定了这样的边界,并且该边界从一开始就是纠缠的,因为语言模型已经内化了它们所测量的大部分文化材料。我通过三个关于电视和电影对话的案例研究(测量结构、互动和偏差)以及三个对仪器本身的考察(文化标记的擦除、对历史材料的调谐、以及能动工作流中的能动性)来说明这一点。这一宏观分析提出了一个理论驱动、经验严谨且文化偶然的研究计划,将每个能动切割视为有意识的承诺,既是方法论的也是伦理的。

英文摘要

Language models are increasingly used to quantify cultural phenomena, but what makes such measurement distinctively cultural? This paper argues that NLP work on culture is a material-discursive practice: the apparatus -- model, data, annotation, evaluation -- participates in constituting the cultural reality it measures, rather than passively recording it. Drawing on Karen Barad's concept of the agential cut -- the contingent boundary between phenomenon and instrument -- I show that the apparatus's substantive design choices draw such boundaries, and that the boundary is entangled from the start because language models have already internalized much of the cultural material they measure. I illustrate this through three case studies on television and film dialogue (measuring structure, interaction, and deviation) and three examinations of the apparatus itself (erasure of cultural markers, attunement to historical material, and agency in an agentic workflow). This big picture analysis proposes a research program that is theory-driven, empirically rigorous, and culturally contingent, treating each agential cut as a conscious commitment, at once methodological and ethical.

URL PDF HTML
2607.02447 2026-07-03 cs.LG 新提交

Understanding the Robustness of Distributed Self-Supervised Learning Frameworks Against Non-IID Data

理解分布式自监督学习框架对非独立同分布数据的鲁棒性

Xuanyu Chen, Nan Yang, Shuai Wang, Dong Yuan

发表机构 * School of Electrical and Computer Engineering, The University of Sydney(悉尼大学电气与计算机工程学院) Northwestern Polytechnical University(西北工业大学)

AI总结 本文理论分析分布式自监督学习在非IID数据下的鲁棒性,发现掩码图像建模比对比学习更鲁棒,且联邦学习与去中心化学习鲁棒性相当,并提出MAR损失验证理论。

Comments Accepted at ICLR2026

详情
AI中文摘要

近期研究引入了分布式自监督学习(D-SSL)方法,以利用大量未标记的分散数据。然而,D-SSL面临数据异质性的关键挑战,且对不同D-SSL框架如何应对这一挑战的理论理解有限。为填补这一空白,我们对非独立同分布(non-IID)设置下D-SSL框架的鲁棒性进行了严格的理论分析。结果表明,使用掩码图像建模(MIM)进行预训练本质上比对比学习(CL)对异质数据更鲁棒,且去中心化SSL的鲁棒性随平均网络连通性增加而增强,这意味着联邦学习(FL)的鲁棒性不低于去中心化学习(DecL)。这些发现为未来D-SSL算法的设计提供了坚实的理论基础。为进一步说明我们理论的实际意义,我们引入了MAR损失,这是MIM目标的一种改进,带有局部到全局对齐正则化。跨模型架构和分布式设置的大量实验验证了我们的理论见解,并额外证实了MAR损失作为我们分析应用的有效性。

英文摘要

Recent research has introduced distributed self-supervised learning (D-SSL) approaches to leverage vast amounts of unlabeled decentralized data. However, D-SSL faces the critical challenge of data heterogeneity, and there is limited theoretical understanding of how different D-SSL frameworks respond to this challenge. To fill this gap, we present a rigorous theoretical analysis of the robustness of D-SSL frameworks under non-IID (non-independent and identically distributed) settings. Our results show that pre-training with Masked Image Modeling (MIM) is inherently more robust to heterogeneous data than Contrastive Learning (CL), and that the robustness of decentralized SSL increases with average network connectivity, implying that federated learning (FL) is no less robust than decentralized learning (DecL). These findings provide a solid theoretical foundation for guiding the design of future D-SSL algorithms. To further illustrate the practical implications of our theory, we introduce MAR loss, a refinement of the MIM objective with local-to-global alignment regularization. Extensive experiments across model architectures and distributed settings validate our theoretical insights, and additionally confirm the effectiveness of MAR loss as an application of our analysis.

URL PDF HTML
2607.02440 2026-07-03 cs.AI cs.CL 新提交

EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments

EvoPolicyGym:评估交互环境中的自主策略进化

Zhilin Wang, Han Song, Runzhe Zhan, Jusen Du, Jiacheng Chen, Tianle Li, Qingyu Yin, Yulun Wu, Zhennan Shen, Tong Zhu, Yanshu Li, Guanjie Chen, Derek F. Wong, Yafu Li, Yu Cheng, Yang Yang

发表机构 * University of Science and Technology of China(中国科学技术大学) The Chinese University of Hong Kong(香港中文大学) University of Macau(澳门大学) Tsinghua University(清华大学) Zhejiang University(浙江大学) Soochow University(苏州大学) Brown University(布朗大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出自主策略进化评估框架,通过EvoPolicyGym基准测试代理在固定交互预算下迭代改进策略的能力,GPT-5.5在16个环境中表现最优,并揭示预算分配与反馈转化机制。

Comments 24 pages

详情
AI中文摘要

自主代理越来越期望通过反馈改进可执行策略,但现有评估通常将此过程简化为最终分数或与开放式软件工程进展混淆。我们引入自主策略进化,这是一种受控评估设置,其中主模型代理在固定交互预算下反复编辑可执行策略系统。我们在EvoPolicyGym中实例化此设置,这是一个由紧凑交互式强化学习环境构建的基准,评估代理如何迭代改进探索的策略。在EvoPolicyGym套件上,GPT-5.5在所有16个环境中实现了最强的聚合排名分数和前两名性能。除了排行榜结果,EvoPolicyGym还提供轨迹级诊断,区分代理如何分配预算、将反馈转化为参数调整。这些分析表明,强大的自主策略进化不仅依赖于孤立的任务胜利,还依赖于在有限反馈下发现任务合适的机制并完善策略。

英文摘要

Autonomous agents are increasingly expected to improve executable policies through feedback, yet existing evaluations often collapse this process into a final score or confound it with open-ended software-engineering progress. We introduce Autonomous Policy Evolution, a controlled evaluation setting in which a harness-model agent repeatedly edits an executable policy system under a fixed interaction budget. We instantiate this setting in EvoPolicyGym, a benchmark built from compact interactive RL environments that evaluates how agents iteratively improve explored policies. On the EvoPolicyGym suite, GPT-5.5 achieves the strongest aggregate rank score and top-two performance on all 16 environments. Beyond leaderboard results, EvoPolicyGym also provides trajectory-level diagnostics that distinguish how agents allocate budget, convert feedback into parametric tuning. These analyses show that strong autonomous policy evolution depends not only on isolated task wins, but on discovering task-appropriate mechanisms and refining policies under bounded feedback.

URL PDF HTML
2607.02437 2026-07-03 cs.LG 新提交

Extreme Adaptive Transformer for Time Series Forecasting

极端自适应Transformer用于时间序列预测

Sanjeev Shrestha, Hui Liu, Yifan Zhang

发表机构 * Department of Computer Science, Missouri State University(密苏里州立大学计算机科学系)

AI总结 针对时间序列中罕见极端事件预测难题,提出Exformer框架,通过包含局部、步长和极端三个稀疏分量的极端自适应注意力机制,显式建模正常与极端事件的时间依赖,在四个水文数据集上实现更优的3天预测性能。

Comments Submitted to Scientific Reports

详情
AI中文摘要

当底层数据包含罕见但关键的极端事件时,时间序列预测仍然具有挑战性。这个问题在水文预测中尤为重要,因为水流分布通常高度偏斜,极端峰值对洪水监测、水资源管理和预警系统有重大影响。尽管基于Transformer的预测模型通过建模长程时间依赖性取得了强劲性能,但它们通常均匀对待所有时间点,因此可能低估罕见的极端模式。在本文中,我们提出了极端自适应Transformer(Exformer),这是一个预测框架,旨在显式建模涉及正常和极端事件的时间依赖性。Exformer引入了一种由三个稀疏分量组成的极端自适应注意力机制:局部、步长和极端。局部和步长分量分别捕获短期和周期性时间依赖性,而极端分量选择性建模正常和极端水流模式之间的事件感知依赖性。在四个真实世界的水文水流数据集上的实验表明,与最先进的基线相比,Exformer实现了优越的3天预测性能。我们的发现表明,显式纳入极端感知注意力提高了Transformer模型在具有罕见但重要事件的不平衡时间序列上的预测能力。

英文摘要

Time series forecasting remains challenging when the underlying data contain rare but critical extreme events. This issue is particularly important in hydrologic forecasting, where streamflow distributions are often highly skewed and extreme peaks can have substantial impacts on flood monitoring, water resource management, and early warning systems. Although Transformer-based forecasting models have achieved strong performance by modeling long-range temporal dependencies, they typically treat all time points uniformly and may therefore underrepresent rare extreme patterns. In this paper, we propose the Extreme-Adaptive Transformer (Exformer), a forecasting framework designed to explicitly model temporal dependencies involving both normal and extreme events. Exformer introduces an extreme-adaptive attention mechanism composed of three sparse components: Local, Stride, and Extreme. The Local and Stride components capture short-term and periodic temporal dependencies, respectively, while the Extreme component selectively models event-aware dependencies between normal and extreme streamflow patterns. Experiments on four real-world hydrologic streamflow datasets show that Exformer achieves superior 3-day forecasting performance compared with state-of-the-art baselines. Our findings demonstrate that explicitly incorporating extreme-aware attention improves the forecasting capacity of Transformer models on imbalanced time series with rare but consequential events.

URL PDF HTML