arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1840
2606.26095 2026-06-25 cs.RO cs.AI cs.CV 新提交

Learning Action Priors for Cross-embodiment Robot Manipulation

跨具身机器人操作的动作先验学习

Dong Jing, Tianqi Zhang, Jiaqi Liu, Jinman Zhao, Zelong Sun, Li Erran Li, Zhiwu Lu, Mingyu Ding

发表机构 * Renmin University of China(中国人民大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) University of Toronto(多伦多大学) Amazon(亚马逊)

AI总结 提出两阶段训练框架,先通过流匹配预训练动作模块学习跨具身时间运动结构,再迁移至VLA训练,提升数据效率与任务成功率。

详情
AI中文摘要

大多数视觉-语言-动作(VLA)模型基于视觉-语言模型(VLM)骨干,通过附加动作模块并联合优化整个策略来构建。这种设计继承了VLM强大的视觉和语言先验,但使得动作模块几乎从零开始学习物理运动。因此,策略缺乏显式的运动先验,迫使早期优化同时发现时间动作动态和跨模态对齐,这一挑战在跨具身设置中进一步放大。在这项工作中,我们提出在跨模态VLA对齐之前,用运动先验预训练动作模块。具体来说,我们引入了一个两阶段训练框架,在VLA训练开始之前为动作模块配备跨具身时间运动结构。在第一阶段,一个基于流匹配的轻量级编码器-解码器动作模块仅从无条件动作轨迹中高效学习时间运动结构,而不处理视觉或语言标记。在第二阶段,通过解码器重用和早期潜在蒸馏,将学到的先验迁移到VLA训练中,使视觉-语言特征与动作嵌入空间对齐,同时仍允许端到端策略细化。此外,训练好的编码器作为紧凑的历史压缩器,将状态-动作历史总结为单个时间上下文标记,以极小的代价实现历史感知建模。在模拟和真实世界平台上进行的13个不同跨具身任务的广泛实验验证了我们方法的有效性。与没有动作先验的VLA训练相比,我们的模型实现了更快的收敛、更高的成功率,并且在数据稀缺的真实世界任务上表现出显著更强的性能。此外,在第一阶段扩大动作数据规模会产生更可泛化的动作先验,直接提升下游VLA性能。

英文摘要

Most Vision-Language-Action (VLA) models build on a Vision-Language Model (VLM) backbone by attaching an action module and optimizing the full policy jointly. This design inherits strong visual and linguistic priors from the VLM, but leaves the action module to learn physical motion almost from scratch. As a result, the policy lacks an explicit motion prior, forcing early optimization to simultaneously discover temporal action dynamics and cross-modal alignment, a challenge further amplified in cross-embodiment settings. In this work, we propose to pretrain the action module with motion priors before cross-modal VLA alignment. Specifically, we introduce a two-stage training framework that equips the action module with cross-embodiment temporal motion structure before VLA training begins. In Stage~1, a lightweight flow-matching-based encoder-decoder action module efficiently learns temporal motion structure solely from unconditioned action trajectories, without processing visual or language tokens. In Stage~2, this learned prior is transferred to VLA training through decoder reuse and early-stage latent distillation, aligning visual-language features with the action embedding space while still allowing end-to-end policy refinement. In addition, the trained encoder serves as a compact history compressor, summarizing state-action histories into a single temporal context token for history-aware modeling at negligible cost. Extensive experiments across 13 diverse cross-embodiment tasks on both simulated and real-world platforms validate the effectiveness of our approach. Compared with VLA training without action priors, our model achieves faster convergence, higher success rates, and substantially stronger performance on data-scarce real-world tasks. Moreover, scaling up the action data in Stage~1 yields a more generalizable action prior that directly improves downstream VLA performance.

2606.26094 2026-06-25 cs.LG 新提交

RevengeBench: Reverse Engineering Code-Space Policies from Behavioral Experiments

RevengeBench: 从行为实验中逆向工程代码空间策略

Babak Rahmani, Sebastian Dziadzio, Joschka Strüber, Sergio Hernández-Gutiérrez, Matthias Bethge

发表机构 * Tübingen AI Center(图宾根人工智能中心)

AI总结 提出RevengeBench基准,通过行为轨迹和可控实验逆向工程LLM生成的策略代码,验证恢复质量在34-72%之间,并能提升下游对战胜率。

Comments 12 pages, 5 figures, 22 appendix pages

详情
AI中文摘要

在科学史的大部分时间里,研究行为的研究人员只能从外部行为推断隐藏机制:这是一个逆问题,当观察通过有针对性的干预得到增强时,该问题变得更加易于处理。我们提出了一个计算类比:给定智能体在游戏环境中的行为轨迹,学习者能否将底层决策程序重建为可执行代码,以及这种重建在能够设计受控实验的情况下能提高多少?我们引入了RevengeBench,这是一个包含75个LLM生成的、经过Elo校准的策略的基准,这些策略跨越五个游戏环境,来自CodeClash锦标赛轨迹。学习者观察隐藏的目标策略与采样对手对战,并设计行为探针(以自定义对手策略的形式)来引发信息性行为。然后,它提交一个可执行假设,该假设使用连续动作距离度量进行评估。我们进一步验证,恢复的代码在下游玩家对玩家锦标赛中携带信息性信号。在十二个前沿LLM中,恢复质量差异很大(初始距离的34%到72%被缩小),恢复的策略产生了可衡量的竞争优势,特别是对于否则难以设计有效反策略的较弱模型。我们的基准将程序化策略的行为恢复定位为代码空间中的一个可处理的逆问题,为对手建模、策略可解释性以及从观测中推断潜在机制的更广泛问题开辟了道路。

英文摘要

For most of scientific history, researchers studying behavior could only infer hidden mechanisms from outward actions: an inverse problem that becomes more tractable when observation is augmented by targeted intervention. We pose a computational analogue: given only behavioral traces of an agent in a game environment, can a learner reconstruct the underlying decision program as executable code, and how much does this reconstruction improve with the ability to design controlled experiments? We introduce RevengeBench, a benchmark of 75 LLM generated, Elo-calibrated policies across five game environments, drawn from CodeClash tournament trajectories. The learner observes the hidden target policy play against sampled opponents and designs behavioral probes in the form of custom opponent policies that elicit informative behavior. It then submits an executable hypothesis, which is evaluated using continuous action-distance metrics. We further validate that recovered code carries informative signal in downstream player-versus-player tournaments. Across twelve frontier LLMs, recovery quality varies substantially (34 to 72% of initial distance closed), with reconstructed policies yielding measurable competitive advantage, particularly for weaker models that otherwise struggle to design effective counter-strategies. Our benchmark positions behavioral recovery of programmatic policies as a tractable inverse problem in code-space, opening a path to opponent modeling, policy interpretability, and the broader question of inferring latent mechanisms from observations.

2606.26093 2026-06-25 cs.RO 新提交

ForceBand: Learning Forceful Manipulation with sEMG

ForceBand: 利用表面肌电信号学习有力操作

Botao He, Zhi Wang, Linna Kuang, Ishaan Ghosh, Jitendra Malik, Cornelia Fermuller, Tingfan Wu, Jiayuan Mao, Ruoshi Liu, Haozhi Qi, Yiannis Aloimonos

发表机构 * Amazon FAR(亚马逊 FAR) University of Maryland(马里兰大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出ForceBand,一种低成本腕戴式sEMG系统,通过预测手指力来增强人类演示数据,用于机器人有力操作策略学习,在多种物体上实现87%的成功率。

详情
AI中文摘要

人类演示是学习机器人操作策略的可扩展数据源。然而,常见的人类演示数据来源,如动作捕捉轨迹和互联网视频,主要捕捉运动和外观,而缺少对力敏感操作至关重要的接触力。在本文中,我们介绍了ForceBand,一种低成本的腕戴式sEMG系统,它将人类肌肉活动转化为富含力的演示。我们首先收集了一个10小时的多模态数据集,包含第一人称视频、sEMG、IMU和指尖力测量,涵盖多种动作和物体。利用该数据集,我们预训练了一个EMG2Force模型,该模型从sEMG和IMU信号预测每根手指的力。经过短暂的特定用户校准后,用户只需使用ForceBand和视频即可收集目标任务演示;然后EMG2Force用每根手指的力轨迹标记这些演示,为机器人策略学习生成力增强的演示。实验表明,ForceBand恢复了细粒度的指尖交互,力预测误差比基于视觉的基线低50%以上,并且在需要跨不同形状、大小和重量的物体进行特定物体力控制的抓取、挤压和放置任务中实现了87%的成功率。项目网站:此https URL

英文摘要

Human demonstrations are a scalable data source for learning robot manipulation policies. However, common sources of human demonstration data, such as motion-capture trajectories and internet videos, capture mostly motion and appearance while missing the contact forces that are critical for force-sensitive manipulation. In this paper, we introduce ForceBand, a low-cost wrist-worn sEMG system that turns human muscle activity into force-enriched demonstrations. We first collect a 10-hour multimodal dataset containing egocentric video, sEMG, IMU, and fingertip force measurements across diverse actions and objects. Using this dataset, we pre-train an EMG2Force model that predicts per-finger forces from sEMG and IMU signals. After a short user-specific calibration, users can collect target-task demonstrations using only ForceBand and video; EMG2Force then labels these demonstrations with per-finger force traces, producing force-augmented demonstrations for robot policy learning. Experiments show that ForceBand recovers fine-grained fingertip interactions with over 50% lower force prediction error than vision-based baselines and achieves an 87% success rate on pick, squeeze, and place tasks that require object-specific force control across objects with diverse shapes, sizes, and weights. Project website: this https URL

2606.26092 2026-06-25 cs.CV 新提交

TryOnCrafter: Unleashing Camera Trajectories for Realistic Video Virtual Try-on via a Renderable 4D Try-on Proxy

TryOnCrafter: 通过可渲染的4D试穿代理释放相机轨迹以实现逼真的视频虚拟试穿

Hao Sun, Hao Yan, Mengting Chen, Quanjian Song, Yu Li, Juan Cao, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, Sheng Tang

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) University of Chinese Academy of Sciences(中国科学院大学) Xiamen University(厦门大学) Alibaba Group(阿里巴巴集团)

AI总结 提出TryOnCrafter,首个基于DiT的统一框架,通过可渲染4D试穿代理解耦人体与环境,实现相机可控的视频虚拟试穿,支持任意视角合成和下游应用。

Comments Project Page: this https URL (https://sunhao242.github.io/TryOnCrafter_web.github.io/)

详情
AI中文摘要

尽管视频虚拟试穿(VVT)在动态主体上合成逼真的服装覆盖方面取得了显著进展,但现有范式仍然受到对源相机轨迹的被动依赖的根本限制,无法满足全方位视角探索所需的交互自由。为了解决这一局限性,我们定义了一个开创性的研究前沿:相机可控视频虚拟试穿(CaM-VVT)。与传统的VVT不同,CaM-VVT不仅需要视角无关的纹理幻觉,还需要在任意、无约束的相机运动下,非刚性人体动态与背景上下文之间的严格结构同步。为了应对这些挑战,我们提出了TryOnCrafter,这是第一个专门为CaM-VVT任务设计的基于DiT的统一框架。与隐式的像素空间操作不同,我们引入了一个可渲染的4D试穿代理,将人体主体与环境明确解耦。这是通过将高保真2D试穿先验蒸馏到基于3DGS的穿衣化身中实现的,随后通过SMPL-X序列进行动画化,并度量对齐到重建的背景点云。该代理建立了一个具有优越纹理密度和运动完整性的稳健结构基础。我们的代理锚定视频DiT利用这一稳健结构基础作为主要几何锚点,确保合成的逼真视频严格受预设轨迹和物理合理形变的约束。得益于4D代理固有的可编辑性,TryOnCrafter促进了多样化的下游应用,包括人体重定位、“子弹时间”效果和360度轨道观看。

英文摘要

While Video Virtual Try-on (VVT) has achieved remarkable progress in synthesizing realistic garment overlays on dynamic subjects, existing paradigms remains fundamentally constrained by a passive dependency on source camera trajectories, failing to accommodate the requisite interactive freedom for omnidirectional viewpoint exploration. To address this limitation, we define a pioneering research frontier: Camera-controllable Video Virtual Try-on (CaM-VVT). Unlike conventional VVT, CaM-VVT not only necessitates viewpoint-agnostic texture hallucination but also strict structural synchronization between non-rigid human dynamics and background contexts under arbitrary, unconstrained camera movements. To tackle these challenges, we present TryOnCrafter, the first unified DiT-based framework specifically architected for the CaM-VVT task. Departing from implicit pixel-space manipulation, we introduce a Renderable 4D Try-on Proxy that explicitly decouples the human subject from the environment. This is achieved by distilling high-fidelity 2D try-on priors into a clothed 3DGS-based avatar, which is subsequently animated via SMPL-X sequences and metric-aligned into a reconstructed background point cloud. This proxy establishes a robust structural foundation with superior texture density and motion integrity. Our Proxy-Anchored Video DiT leverages this robust structural foundation as a primary geometric anchor, ensuring that the synthesized photorealistic videos are strictly constrained by prescribed trajectories and physically plausible deformations. Benefiting from the inherent editability of the 4D proxy, TryOnCrafter facilitates diverse downstream applications, including human relocalization, ``bullet time'' effects, and $360$-degree orbital viewing.

2606.26087 2026-06-25 cs.CV 新提交

MVTrack4Gen: Multi-View Point Tracking as Geometric Supervision for 4D Video Generation

MVTrack4Gen: 多视角点跟踪作为4D视频生成的几何监督

JoungBin Lee, Jaewoo Jung, Jongmin Lee, Tongmin Kim, Hyunsung Kim, Takuya Narihira, Kazumi Fukuda, Jahyeok Koo, Jisang Han, Yuki Mitsufuji, Seungryong Kim

发表机构 * KAIST AI(韩国科学技术院人工智能学院) Sony AI(索尼人工智能) Sony Group Corporation(索尼集团公司)

AI总结 提出MVTrack4Gen框架,利用多视角点跟踪作为几何与运动监督信号,增强仅相机条件的新视角视频扩散模型的运动一致性和几何一致性。

Comments Project Page: this https URL (https://cvlab-kaist.github.io/MVTrack4Gen/)

详情
AI中文摘要

从单目参考视频沿目标相机轨迹合成新视角视频,需要相对于参考视频的几何一致性和运动保真度。基于显式3D表示的现有方法受限于现成重建模块的精度,这些模块通常对单目视频中的动态对象产生不准确的几何。相比之下,仅相机条件的方法可以实现高视觉质量,但往往难以保持几何和运动一致性。在这项工作中,我们引入了MVTrack4Gen(用于新视角生成的多视角点跟踪),这是一个运动感知训练框架,利用多视角点跟踪作为额外的几何和运动监督信号,用于仅相机条件的新视角视频扩散模型。我们的关键发现是,特定的注意力层编码了强对应线索,其中查询特征关注跨视角和跨时间几何对应位置的关键特征,而这些对应的错位会导致运动不一致。基于这一观察,我们将这些特征路由到一个辅助的多视角跟踪头,并联合训练扩散模型与点跟踪目标。通过明确增强这些运动感知对应,MVTrack4Gen改进了现有模型,使其更好地遵循参考视图中的运动并保持跨视角几何一致性。在多个基准测试中,我们的方法实现了最先进的几何一致性和有竞争力的相机精度。

英文摘要

Synthesizing a novel-view video from a monocular reference video along a target camera trajectory requires both geometric consistency and motion fidelity with respect to the reference video. Existing methods based on explicit 3D representations are limited by the accuracy of off-the-shelf reconstruction modules, which often produce inaccurate geometry for dynamic objects in monocular videos. In contrast, camera-conditioning-only methods can achieve high visual quality but often struggle to preserve geometric and motion consistency. In this work, we introduce MVTrack4Gen (Multi-View point Tracking for Novel-View Generation), a motion-aware training framework that leverages multi-view point tracking as an additional geometric and motion supervision signal for camera-conditioning-only novel-view video diffusion models. Our key finding is that specific attention layers encode strong correspondence cues, where query features attend to key features at geometrically corresponding locations across views and over time, and the misalignment of these correspondences causes motion inconsistency. Based on this observation, we route these features into an auxiliary multi-view tracking head and jointly train the diffusion model with a point-tracking objective. By explicitly strengthening these motion-aware correspondences, MVTrack4Gen improves existing models to better follow the motion in the reference view and maintain cross-view geometric consistency. Across diverse benchmarks, our method achieves state-of-the-art geometric consistency and competitive camera accuracy.

2606.26083 2026-06-25 cs.CL eess.AS 新提交

Real-Time Voice AI Hears but Does Not Listen

实时语音AI能听到但不倾听

Martijn Bartelds, Federico Bianchi, James Zou

发表机构 * Together AI Stanford University(斯坦福大学)

AI总结 评估四个领先实时语音系统,发现它们依赖文字而非声学线索,在情感理解上存在感知与行动脱节,提示需谨慎用于依赖语调的场景。

详情
AI中文摘要

语音通过词语和发声方式传递信息。我们评估了四个领先的实时语音系统——OpenAI的GPT Realtime 2、Google的Gemini 3.1 Flash Live以及阿里巴巴的Qwen3.5 Omni Plus和Omni Flash——在词语和表达方式都传递有意义信息的任务上。在三个关键场景中,所有四个系统都根据词语而非语音采取行动:它们挂断坚持说没事的哭泣来电者的电话,批准以恐惧声音授权的电汇,以及注册明显是讽刺的同意来电者。令人惊讶的是,这通常不是感知失败。当直接询问时,四个系统中有三个可靠地识别出后来在做决定时忽略的痛苦、恐惧或讽刺。当这些实时语音系统估计口音和年龄时,我们观察到类似模式,它们的回答常常遵循词语的偏见而非说话者的声学特性。我们将这种感知与行动之间的脱节称为语音AI的情感智能差距。提示系统明确关注发声方式仅能部分且不一致地改善性能。我们的发现表明,当前实时语音AI系统通常表现得好像语音已被简化为文本转录,建议在表达语气和情感传递重要信息的场景中谨慎使用。

英文摘要

Speech conveys information through both words and vocal delivery. We evaluate four leading production realtime voice systems-OpenAI's GPT Realtime 2, Google's Gemini 3.1 Flash Live, and Alibaba's Qwen3.5 Omni Plus and Omni Flash-on tasks where the words and the delivery patterns both convey meaningful information. Across three consequential scenarios, all four systems act on the words rather than the voice. They end calls with crying callers who insist nothing is wrong, approve wire transfers authorized in frightened voices, and enroll callers whose agreement is clearly sarcastic. Surprisingly, this is often not a failure of perception. When asked directly, three of the four systems reliably identify the distress, fear, or sarcasm they later ignore when making decisions. We observe a similar pattern when these realtime voice systems estimate accent and age, as their responses frequently follow the biases of the words rather than the acoustic properties of the speaker. We term this disconnect between perception and action the emotional intelligence gap of voice AI. Prompting systems to explicitly attend to vocal delivery improves performance only partially and inconsistently. Our findings show that current realtime voice AI systems often behave as if speech had been reduced to a transcript, suggesting that they should be used with caution in settings where the tone and emotion of delivery convey important information.

2606.26080 2026-06-25 cs.LG cs.AI 新提交

Neglected Free Lunch from Post-training: Progress Advantage for LLM Agents

被忽视的免费午餐:后训练中的进展优势用于LLM智能体

Changdae Oh, Wendi Li, Seongheon Park, Samuel Yeh, Tanwi Mallick, Sharon Li

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Argonne National Laboratory(阿贡国家实验室)

AI总结 提出进展优势概念,利用强化学习后训练中的隐式优势函数作为无标注、领域无关的步骤级评分,在测试时扩展、不确定性量化和失败归因中超越专用奖励模型。

详情
AI中文摘要

过程奖励模型能够对LLM进行细粒度、步骤级的评估,但在智能体场景中构建它们仍然极其困难:长时程交互、不可逆动作和随机环境反馈使得大规模人工标注和蒙特卡洛估计都不可行。在这项工作中,我们表明强化学习后训练已经提供了有效步骤级评分的要素,完全消除了专用奖励模型训练的需求。具体地,我们在一般随机马尔可夫决策过程中推导出一个隐式优势,称为进展优势——RL训练策略与其参考策略之间的对数概率比恰好恢复了最优优势函数。这种表述使得所得信号无需标注、领域无关,并且作为标准RL后训练流程的副产品可用。我们在五个基准和四个模型族上验证了进展优势在三种不同应用中的有效性:测试时扩展、不确定性量化和失败归因。在所有设置中,它始终优于基于置信度的基线,并且尽管不需要任务特定训练,却超越了专用训练的奖励模型。我们通过更深入的分析补充了这些结果,探讨了进展优势的特性,为在现实世界智能体系统中的采用提供了实用指导。

英文摘要

Process reward models enable fine-grained, step-level evaluation of LLMs, yet building them for agentic settings remains prohibitively difficult: long-horizon interactions, irreversible actions, and stochastic environment feedback make both human annotation and Monte Carlo estimation infeasible at scale. In this work, we show that reinforcement learning (RL) post-training already provides the ingredients for effective step-level scoring, eliminating the need for dedicated reward model training altogether. Concretely, we derive an implicit advantage under a general stochastic Markov decision process, which we term progress advantage -- log-probability ratio between the RL-trained policy and its reference policy exactly recovers the optimal advantage function. This formulation makes the resulting signal annotation-free, domain-agnostic, and available as a byproduct of the standard RL post-training pipeline. We validate the effectiveness of the progress advantage across three different applications: test-time scaling, uncertainty quantification, and failure attribution on five benchmarks and four model families. Across all settings, it consistently outperforms confidence-based baselines and, despite requiring no task-specific training, surpasses dedicated trained reward models. We complement these results with deeper analyses on characteristics of progress advantage, offering practical guidance for adoption in real-world agentic systems.

2606.26079 2026-06-25 cs.CL cs.CV cs.LG 新提交

Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

相同证据,不同答案:多模态大语言模型中的顺序敏感性审计

Akshay Paruchuri, Sanmi Koyejo, Ehsan Adeli

发表机构 * Stanford University(斯坦福大学)

AI总结 提出Facet-Probe审计框架,评估18个多模态大语言模型在五种顺序扰动下的答案翻转率,发现所有模型均非顺序不变,最佳模型仍有13.4%翻转率,提示仅靠提示级缓解难以实现通用顺序鲁棒性。

Comments 22 pages, 4 figures, 5 tables

详情
AI中文摘要

多模态大语言模型(MLLMs)的标准基准测试在每个项目上使用一种规范顺序进行评分,忽略了顺序无关的洗牌是否会改变答案,而这是新兴AI评估指南所要求的基本可靠性属性。我们引入了Facet-Probe,一个五方面审计(选项、证据块、文档排名、图像集和混合模态排序),针对18个前沿和开源权重的MLLMs。贝叶斯项目响应模型将排序噪声与每个方面的偏差分开,同顺序控制估计观察到的翻转的解码器随机性下限。我们发现,我们审计的18个MLLMs中没有一个是顺序不变的:筛选后的每个方面面板平均翻转率范围为24-50%。在温度0下,Gemini同顺序控制估计,在验证的单元中,相对于相同输入的解码器噪声下限,存在显著的顺序超额。能力预测但不能消除翻转;最好的模型仍在13.4%的试验中翻转。在我们的Gemini缓解测试中,无需训练的提示更改是模态条件性的,并且不能从文本推理迁移到视觉推理。这些结果表明,仅靠提示级缓解不太可能提供通用的顺序鲁棒性,这激发了未来在训练时间和架构方法上的工作。我们提出跨顺序翻转率作为MLLMs的标准报告轴。

英文摘要

Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes the answer, a baseline reliability property called for by emerging AI evaluation guidelines. We introduce Facet-Probe, a five-facet audit (option, evidence-chunk, document-rank, image-set, and mixed-modality ordering) of 18 frontier and open-weight MLLMs. A Bayesian item-response model separates ordering noise from per-facet bias, and a same-ordering control estimates the decoder-stochastic floor for observed flips. We find that none of the 18 MLLMs we audit are order-invariant: screened per-facet panel-mean flip rates span 24-50%. A Gemini same-ordering control at temperature 0 estimates a substantial ordering excess over a same-input decoder-noise floor in verified cells. Capability predicts but does not eliminate flips; the best model still flips on 13.4% of trials. In our Gemini mitigation tests, training-free prompt changes are modality-conditional and do not transfer from text to visual reasoning. These results suggest that prompt-level mitigation alone is unlikely to provide general order robustness, motivating future work on training-time and architectural approaches. We propose cross-ordering flip rate as a standard reporting axis for MLLMs.

2606.26062 2026-06-25 cs.CL 新提交

When Certainty Is an Artifact: Keyword Lexicon Blindness and the (Mis)Measurement of Rhetorical Stance

当确定性是人为产物:关键词词典盲点与修辞立场的(误)测量

Bo Chen

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所)

AI总结 通过对比关键词词典与LLM零样本分类在85篇访谈(2016-2026)中的表现,发现关键词计数产生的显著负情绪-确定性共现模式是测量伪影,而LLM揭示出悲观话语实际吸引的是模糊化而非确定性。

Comments 16 pages, 2 figures

详情
AI中文摘要

计算社会科学中一个统计显著、效应量大的发现是否可能完全是测量工具的人为产物?我们提出了一个答案似乎是肯定的案例。分析四位公共知识分子(2016-2026年)的85篇访谈,我们发现基于关键词评分时存在稳健的负情绪/强调确定性词汇共现模式(四位演讲者的$r = 0.72$--$0.93$,$p < 0.01$)。将关键词计数替换为基于LLM的零样本语义分类(对整个分段语料库32,625个句子)后,该相关性大幅降低:Dalio的$r = 0.851$降至$r = 0.206$,两位演讲者显示负的$r(neg, emphatic)$,一位显示零相关。相反,LLM揭示了跨演讲者的强负情绪-模糊化耦合——Rogoff的$r(neg, hedged) = 0.875$($p = 0.001$)和Zeihan的$r(neg, hedged) = 0.722$($p = 0.008$)——这与悲观话语吸引模糊化而非确定性的传统预期一致。句子级错误分析将这种差异追溯到关键词词典的三个结构性失效模式——句法盲点、多义盲点和范畴缺失——并通过关键词计数颠倒语义含义的案例(例如,“never absolutely totally confident”被计为高确定性)加以说明。我们认为,关键词词典测量的是普遍的词汇共现倾向——负面话语自然吸引强调性词汇——这与修辞立场正交,并可能系统性地颠倒修辞立场。将关键词计数视为认知确定性的测量是一种范畴错误:一个看似关于说话者心理的发现可能完全关乎词汇计数。

英文摘要

Can a statistically significant, large-effect-size finding in computational social science be entirely an artifact of the measurement instrument? We present a case where the answer appears to be yes. Analyzing 85 interviews across four public intellectuals (2016--2026), we find a robust negative-affect/emphatic-certainty lexical co-occurrence pattern under keyword-based scoring ($r = 0.72$--$0.93$, $p < 0.01$ for all four speakers). Replacing keyword counting with LLM-based zero-shot semantic classification on the complete diarized corpus (32,625 sentences) dramatically reduces this correlation: Dalio's $r = 0.851$ drops to $r = 0.206$, with two speakers showing negative $r(\text{neg}, \text{emphatic})$ and one showing null. In contrast, the LLM reveals a strong negative-hedging coupling across speakers -- Rogoff's $r(\text{neg}, \text{hedged}) = 0.875$ ($p = 0.001$) and Zeihan's $r(\text{neg}, \text{hedged}) = 0.722$ ($p = 0.008$) -- consistent with the conventional expectation that pessimistic discourse attracts hedging, not certainty. Sentence-level error analysis traces this discrepancy to three structural failure modes in keyword lexicons -- syntactic blindness, polysemy blindness, and categorical absence -- illustrated through cases where keyword counting inverts semantic meaning (e.g., ''never absolutely totally confident'' scored as high-certainty). We argue that keyword lexicons measure a universal lexical co-occurrence tendency -- negative discourse naturally attracts emphatic vocabulary -- that is orthogonal to, and can systematically invert, rhetorical stance. Treating keyword counts as measurements of epistemic certainty is a category error: a finding that appears to be about a speaker's psychology may be entirely about the counting of words.

2606.26058 2026-06-25 cs.CV 新提交

DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation

DomainShuttle: 自由形式开放域主题驱动文本到视频生成

Nan Chen, Yiyang Cai, Rongchang Xie, Junwen Pan, Cheng Chen, Weinan Jia, Zhuowei Chen, Wen Zhou, Zhenbang Sun, Wenhan Luo

发表机构 * Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出DomainShuttle方法,通过Domain-MoT解耦视频与参考特征、Video-Reference DualRoPE实现精确主体空间建模及Cross-Pair Consistent Loss提取本质特征,在开放域视频个性化中同时实现高保真与生成灵活性。

Comments 19 pages, 9 figures

详情
AI中文摘要

开放域主题驱动文本到视频(S2V)生成在学术界和工业界引起了广泛关注。开放域S2V主要涉及两种场景:域内场景,需要尽可能保留参考主题特征;跨域场景,在保留主题内在特征的同时,允许与主题无关的属性根据文本提示灵活变化。现有方法主要关注在域内场景中最大化主题保真度,这限制了它们在跨域场景(如新颖风格、语义组合或域属性)中的可编辑性和适应性。在本研究中,我们提出理想的S2V方法应能在不同域之间灵活穿梭,在域内和跨域场景中均实现强性能。为此,我们提出DomainShuttle,该方法能够为开放域视频个性化实现高保真度和生成灵活性。具体来说,我们引入Domain-MoT,它解耦视频和参考特征,并引入域感知的AdaLN用于参考图像的域特定建模。然后我们引入Video-Reference DualRoPE方案,将参考图像令牌和视频令牌置于独立的RoPE空间中,以实现精确的主体级空间建模,以及Cross-Pair Consistent Loss,旨在提取不受无关特征影响的本质主体特征。大量实验表明,DomainShuttle在多种开放域应用场景中相比现有方法取得了显著的性能提升,展现出高主体保真度和生成灵活性。

英文摘要

Open domain subject-driven text-to-video (S2V) generation has drawn significant interest in academia and industry. Open domain S2V mainly involves two scenarios: in-domain, which requires retaining the reference subject features as much as possible, and cross-domain, which preserves the intrinsic features of the subject while allowing subject-irrelevant properties to vary flexibly according to the text prompt. Existing methods primarily focus on maximizing subject fidelity in in-domain scenarios, which limits their editability and adaptability in cross-domain scenarios, such as novel styles, semantic combinations, or domain attributes. In this study, we propose that an ideal S2V method should flexibly shuttle between different domains, achieving strong performance in both in-domain and cross-domain scenarios. To this end, we propose DomainShuttle, which could achieve high fidelity and generative flexibility for open domain video personalization. Specifically, we introduce Domain-MoT, which decouples videos and reference features and introduces the domain-aware AdaLN for domain-specific modeling of reference images. We then introduce the Video-Reference DualRoPE scheme, which places reference image tokens and video tokens in separate RoPE spaces to enable precise subject-level spatial modeling, and Cross-Pair Consistent Loss, which aims to extract intrinsic subject features unaffected by irrelevant features. Extensive experiments demonstrate that DomainShuttle achieves significant performance improvements over existing methods, exhibiting high subject fidelity and generative flexibility across diverse open domain application scenarios.

2606.26057 2026-06-25 cs.AI cs.CR cs.LG 新提交

The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems

不可解雇的安全内核:面向AI代理及其他可逃逸AI系统的执行时AI对齐

Seth Dobrin, Łukasz Chmiel

发表机构 * ARYA Labs PBC

AI总结 提出一种架构性控制机制“不可解雇的安全内核”,通过进程隔离、前置强制、故障关闭和外部验证四个属性实现执行时AI对齐,在可逃逸AI系统中阻止逃逸尝试。

Comments Pre-print submitted for publication

详情
AI中文摘要

AI代理被授予访问工具、API和其他基础设施的权限,使其成为这些系统中的活跃主体。主流方法将控制置于代理自身的运行时内:系统提示、输出过滤器和护栏库。代理地址空间中的任何控制都可以被影响它的输入所触及;这推广到任何对其自身运行时具有足够影响力的AI系统,我们将其称为可逃逸AI系统。我们确定了授权机制必须满足的四个属性,以实现架构性控制而非合作请求:进程隔离、在结构上唯一路径上的前置执行强制、在请求和系统级别上的故障关闭,以及在受控系统信任边界之外可验证的外部化签名证据。我们将这一层定位为执行时AI对齐,补充训练时对齐(RLHF、宪法AI)和推理时对齐。我们提出了不可解雇的安全内核,一个实现了所有四个属性的Rust参考实现。其故障关闭不变性在两个层面经过机器验证:一个SMT定理(Z3)和一个生产决策函数的穷举有界模型检查证明(Kani,4/4个测试套件)。从Python到Rust的迁移以字节等价性为门控条件(1000/1000个测试用例;17/17个对抗类)。我们评估了该内核管理一个实时的、可逃逸的AI系统——一个确定性的、自我改进的世界模型——对抗一个利用其真实自我修改接口的逃逸寻求攻击者:在1000次自我修改中,所有704次针对安全关键核心的尝试均被拒绝,无逃逸发生;另外300次在操作员终止开关下的尝试也被拒绝。另一项针对6240次授权往返的独立攻击未成功绕过。对比三个声称拥有代理控制平面的当代系统,代理可以调用控制;而在此系统中,它没有这个选择。

英文摘要

AI agents are granted access to tools, APIs, and other infrastructure, making them active principals in those systems. The dominant approach places controls inside the agent's own runtime: system prompts, output filters, and guardrail libraries. Any control in the agent's address space is reachable by inputs that influence it; this generalizes to any AI system with sufficient reach into its own runtime, a class we term escapable AI systems. We identify four properties that an authorization mechanism must satisfy for architectural control rather than for cooperative requests: process separation, pre-action enforcement on a structurally only path, fail-closed at both the request and system levels, and externalized signed evidence verifiable outside the controlled system's trust boundary. We position this layer as execution-time AI alignment, complementing training-time alignment (RLHF, Constitutional AI) and inference-time alignment. We present the Unfireable Safety Kernel, a Rust reference implementation realizing all four. Its fail-closed invariant is machine-checked at two levels: an SMT theorem (Z3) and an exhaustive bounded-model-checking proof of the production decision function (Kani, 4/4 harnesses). A Python-to-Rust migration was gated on byte-equivalence (1000/1000 fixtures; 17/17 adversarial classes). We evaluate the kernel governing a live, escapable AI system, a deterministic, self-improving world model, against an escape-seeking adversary driving its real self-modification seam: across 1,000 self-modifications, all 704 attempts on the safety-critical core are refused, with no escape; a further 300, under the operator kill switch, are also refused. A separate campaign of 6,240 authorization round-trips had no successful bypass. Against 3 contemporary systems claiming the agent control plane, the agent invokes control; here, it lacks that choice.

2606.26048 2026-06-25 cs.RO eess.SY 新提交

Deep Reinforcement Learning-Enhanced Event-Triggered Data-Driven Predictive Control for a 3D Cable-Driven Soft Robotic Arm

深度强化学习增强的事件触发数据驱动预测控制用于三维缆索驱动软体机械臂

Cheng Ouyang, Moeen Ul Islam, Kaixiang Zhang, Zhaojian Li, Xiaobo Tan, Dong Chen

发表机构 * Mississippi State University(密西西比州立大学) Michigan State University(密歇根州立大学)

AI总结 提出一种强化学习自适应事件触发DeePC框架,通过训练RL策略决定何时调用优化器,在保持跟踪精度的同时减少计算开销,实验显示优化频率降低66%。

详情
AI中文摘要

软体机器人由于其非线性和时变动力学特性而难以控制。数据使能预测控制(DeePC)通过直接利用测量的输入-输出轨迹构建预测控制器,提供了一种无模型替代方案。然而,其滚动时域公式需要在每个采样时刻求解一个约束优化问题,这对于在资源受限的机器人平台上实时部署可能计算量过大。为了解决这一限制,我们提出了一种基于自适应强化学习的事件触发DeePC(RL-ET-DeePC)框架用于软体机器人控制。训练一个无模型RL策略,根据当前系统状态表示决定何时调用DeePC优化器,从而减少不必要的优化调用,同时保持闭环性能。仿真结果表明,与周期性DeePC相比,RL-ET-DeePC将优化频率降低了高达66%,同时保持了相当的跟踪精度。在三维缆索驱动软体机械臂上的硬件实验展示了零样本迁移,实现了优化频率降低34%,跟踪精度与周期性DeePC相当,且性能比基于静态阈值的事件触发基线更一致。

英文摘要

Soft robots are challenging to control due to their nonlinear and time-varying dynamics. Data-enabled predictive control (DeePC) offers a model-free alternative by directly leveraging measured input-output trajectories to construct a predictive controller. However, its receding-horizon formulation requires solving a constrained optimization problem at every sampling instant, which can be computationally demanding for real-time deployment on resource-limited robotic this http URL address this limitation, we propose an adaptive reinforcement-learning-based event-triggered DeePC (RL-ET-DeePC) framework for soft robotic control. A model-free RL policy is trained to determine when to invoke the DeePC optimizer based on the current system state representation, thereby reducing unnecessary optimization calls while preserving closed-loop this http URL results show that RL-ET-DeePC reduces optimization frequency by up to 66% compared to periodic DeePC, while maintaining comparable tracking accuracy. Hardware experiments on a three-dimensional cable-driven soft robotic arm demonstrate zero-shot transfer, achieving a 34% reduction in optimization frequency with tracking accuracy comparable to periodic DeePC and more consistent performance than a static threshold-based event-triggered baseline.

2606.26047 2026-06-25 cs.RO 新提交

Learning Robot Visual Navigation in Crowds via Intention-Aware Scene Representations

通过意图感知场景表示学习机器人在人群中的视觉导航

Han Bao, Bingyi Xia, Hanjing Ye, Yu Zhan, Hao Cheng, Baozhi Jia, Wenjun Xu, Jiankun Wang

发表机构 * Shenzhen Key Laboratory of Robotics Perception and Intelligence, Department of Electronic and Electrical Engineering, SUSTech(南方科技大学电子与电气工程系深圳市机器人感知与智能重点实验室) Jiaxing Research Institute, SUSTech(南方科技大学嘉兴研究院) Research Institute of MA&EI, Peng Cheng Laboratory(鹏城实验室机器人与人工智能研究所)

AI总结 提出iCrowdNav方法,利用时空编码器和Intent-Interact Former从视觉观测中提取场景占用特征和人体姿态意图,通过深度强化学习实现高效人群导航。

详情
AI中文摘要

机器人人群导航需要具备推断人类意图的能力,同时考虑环境的结构约束。目前,深度强化学习(DRL)为学习理解人类意图的导航策略提供了一种有前景的方法。然而,大多数方法依赖于有限的场景表示,将行人视为简单的二维点,忽略了来自人类和环境的丰富视觉线索。为了解决这个问题,我们引入了iCrowdNav,一种新颖的视觉人群导航方法,具有意图感知的场景表示,从自我中心的视觉观察中编码行为和环境上下文。我们的方法采用两个关键组件:一个用于提取场景占用特征的时空编码器,以及Intent-Interact Former(I$^2$ Former),一个基于注意力的模块,用于编码人体姿态以推断行人的运动意图。这些特征被整合到一个紧凑的状态嵌入中,支持有效的DRL策略训练。大量实验表明,我们的方法在基线上取得了优越的性能,并且实际部署展示了基于视觉的人群导航。

英文摘要

Robot crowd navigation requires the ability to infer human intentions while accounting for the structural constraints of the environment. Currently, deep reinforcement learning (DRL) provides a promising method for learning navigation policies that understand human intentions. However, most of them rely on limited scene representations, treating pedestrians as simple 2D points and ignoring rich visual cues from both humans and the environment. To address this issue, we introduce iCrowdNav, a novel visual crowd navigation method with intention-aware scene representations, to encode behavioral and structural context from egocentric visual observations. Our method employs two key components: a spatio-temporal encoder for extracting occupancy features of the scene, and Intent-Interact Former (I$^2$ Former), an attention-based module that encodes human poses to infer pedestrians' motion intentions. These features are integrated into a compact state embedding that supports effective DRL policy training. Extensive experiments show that our method achieves superior performance over baselines, and real-world deployment demonstrates vision-based crowd navigation.

2606.26046 2026-06-25 cs.RO cs.CV 新提交

RoboAtlas: Contextual Active SLAM

RoboAtlas:上下文感知主动SLAM

Alexander Schperberg, Shivam K. Panda, Abraham P. Vinod, M. K. Jawed, Stefano Di Cairano

发表机构 * Mitsubishi Electric Research Laboratories(三菱电机研究实验室) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出RoboAtlas框架,通过上下文感知的多臂赌博机平衡几何探索与语义推理,结合3D语义地图OpenRoboVox,在真实环境中实现100%任务成功率,并在GOAT-Bench上以90.6%成功率达到SOTA。

Comments Alexander Schperberg and Shivam K. Panda made equal contribution

详情
AI中文摘要

我们提出了RoboAtlas,一种上下文感知的主动SLAM框架,它使用可扩展的3D语义映射系统OpenRoboVox自适应地平衡几何探索和语义推理。RoboAtlas通过上下文感知的多臂赌博机整合了前沿探索、全局语义地图推理和以自我为中心的VLM推理,随着场景理解的提高,从探索过渡到语义引导的导航。我们在仿真中以及在超过1800平方米、约30k映射语义实例的大规模真实世界环境中对Unitree Go2机器人进行了系统评估,实现了100%的任务成功率。在GOAT-Bench“Val Unseen”基准测试中,RoboAtlas使用GPT-4o实现了最高的报告成功率(SR)90.6%,比之前的最强基线提高了17.8个百分点。使用更小的Qwen2.5-VL-7B模型,它仍然实现了88.8%的SR,在所有使用GPT-4o的基线中SR表现最佳,揭示了我们的语义映射框架所获得的信息的重要性,而不仅仅是替换底层基础模型。结果表明,将基础模型与大规模3D语义地图相结合,能够实现稳健且高效的上下文感知主动SLAM。

英文摘要

We present RoboAtlas, a contextual Active SLAM framework that adaptively balances geometric exploration and semantic reasoning using a scalable 3D semantic mapping system, OpenRoboVox. RoboAtlas integrates frontier exploration, global semantic-map reasoning, and egocentric VLM-based reasoning through a contextual multi-armed bandit that transitions from exploration to semantically guided navigation as scene understanding improves. We evaluate the system in simulation and on a Unitree Go2 robot in large-scale real-world environments exceeding 1800 m2 with approx. 30k mapped semantic instances, achieving a 100% task success rate. On the GOAT-Bench "Val Unseen" benchmark, RoboAtlas achieves state-of-the-art performance with highest reported success rate (SR) of 90.6%, using GPT-4o, improving over the strongest prior baseline by 17.8 percentage points in SR. Using the much smaller Qwen2.5-VL-7B model, it still achieves 88.8% SR, outperforming all baselines using GPT-4o in SR, and revealing the importance of the information gained by our semantic mapping framework over simply replacing the underlying foundation model. The results demonstrate that grounding foundation models with large-scale 3D semantic maps enables robust and efficient contextual Active SLAM.

2606.26041 2026-06-25 cs.CV cs.CL 新提交

How Robust is OCR-Reasoning? Evaluating OCR-Reasoning Robustness of Vision-Language Models under Visual Perturbations

OCR推理有多鲁棒?评估视觉语言模型在视觉扰动下的OCR推理鲁棒性

Yuxing Cheng, Yuan Wu, Yi Chang

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, MOE, China(教育部知识驱动人机智能工程研究中心) International Center of Future Science, Jilin University(吉林大学未来科学国际合作中心)

AI总结 提出OCR-Robust基准,通过5种扰动类型和3种严重程度评估18个VLM在OCR推理任务上的鲁棒性,发现高干净精度不保证强鲁棒性,图表比文档更脆弱。

详情
AI中文摘要

视觉语言模型(VLM)在基于OCR的基准测试上取得了强劲性能,并日益关注文本丰富的理解,但它们在受控视觉退化下的鲁棒性仍未被充分理解。这一差距对于OCR推理至关重要,因为视觉损坏可能导致OCR错误和结构扭曲,从而给推理任务带来不确定性。为了系统研究这一问题,我们引入了OCR-Robust,一个旨在评估视觉扰动下OCR推理鲁棒性的基准。它包含812个样本,分为两个互补子集:OCR1.0涵盖文档、场景文本、收据、手写和数学内容,OCR2.0专注于图表、几何图形和表格。为了实现高效且信息丰富的评估,我们对18个候选扰动进行了初步研究,并根据它们的影响和跨模型区分性,选择了5种代表性类型,每种类型设置3个严重级别。我们使用干净精度、相对损坏保持率(RCR)、最坏情况保持率(WCR)和复合损坏鲁棒性指数(CRI)来评估鲁棒性,并对18个模型进行基准测试,包括专有系统、开源VLM和OCR+LLM流水线。我们的结果表明,更高的干净精度并不一定意味着更强的鲁棒性,模型在结构敏感的OCR任务上可能在最坏情况下遭受显著退化,并且图表和表格在扰动下比文档类输入脆弱得多。

英文摘要

Vision-language models (VLMs) have achieved strong performance on OCR-based benchmarks and increasingly focused on text-rich understanding, but their robustness under controlled visual degradation remains insufficiently understood. This gap is critical for OCR reasoning, where visual corruption can induce OCR errors and structural distortions, thereby introducing uncertainty into the reasoning task. To systematically study this problem, we introduce OCR-Robust, a benchmark designed for evaluating OCR reasoning robustness under visual perturbations. It contains 812 samples across two complementary subsets: OCR1.0, covering documents, scene text, receipts, handwriting, and mathematical content, and OCR2.0, focusing on charts, geometry diagrams, and tables. To enable efficient yet informative evaluation, we conduct a pilot study over 18 candidate perturbations and select 5 representative types at 3 severity levels each based on their impact and cross-model discriminability. We evaluate robustness using clean accuracy, Relative Corruption Retention (RCR), Worst-Case Retention (WCR), and a composite Corruption Robustness Index (CRI), and benchmark 18 models spanning proprietary systems, open-source VLMs, and OCR+LLM pipelines. Our results show that higher clean accuracy does not necessarily imply stronger robustness, and that models can suffer pronounced degradation in the worst case on OCR tasks that are sensitive to structure, and charts and tables are substantially more fragile than document-like inputs under perturbation.

2606.26040 2026-06-25 cs.CL 新提交

AI translation of literary texts is "fine", but readers still prefer human translations

AI 翻译文学作品“还行”,但读者仍偏好人工翻译

Yves Ferstler, Adam Podoxin, Ty Brassington, Roman Grundkiewicz, Maite Taboada, Marzena Karpinska

发表机构 * Simon Fraser University(西蒙菲莎大学) Université du Québec à Montréal(魁北克大学蒙特利尔分校) Microsoft(微软)

AI总结 通过15位读者对15部法、波、日小说的人工翻译与AI翻译的比较实验,发现AI翻译虽可接受,但读者更偏好人工翻译的流畅性、清晰度和沉浸感,且无法可靠区分两者。

Comments 58 pages, including appendices

详情
AI中文摘要

AI 翻译文学作品越来越常见。虽然内容可能被充分传达,但我们对其在沉浸感和文学效果方面的读者体验了解不足,这些方面难以通过自动机器翻译指标或针对流畅性和充分性的人工评估来捕捉。我们邀请15位热爱阅读的读者,将最近出版的人工翻译(HT)与基于智能大语言模型(LLM)流水线生成的机器翻译(MT)进行比较,涉及15部近期法语、波兰语和日语小说及其英译本。读者评估了约8000词的摘录,分为两种条件:沉浸式阅读整篇摘录(30次比较)和细读386个对齐的HT-MT片段对(772次比较),每本书由两位读者以交替顺序进行。总体而言,读者认为MT“还行”,但更偏好HT(在摘录层面为19/30,在片段层面更明显为522/772),因其更易读、清晰且具有沉浸感。读者的标注显示,MT的质量在一本书内波动比HT更大。关键的是,读者无法可靠区分两者(17/30猜对),且倾向于偏好他们认为出自人类之手的版本。自动指标(包括LLM作为评判的方法)无法还原读者偏好,反而偏向MT。我们发布了LAIT(文学AI翻译),一个以读者为中心的评估数据集,包含1000条读者评论、2000条判断和偏好评分,以及7200个跨度级标注,同时提供我们的评估协议和支持界面。

英文摘要

AI translation of literary works is increasingly common. While the content may be rendered adequately, we do not know enough about how readers experience it in terms of immersiveness and literary effect, aspects poorly captured by automatic machine translation metrics or human evaluation targeting fluency and adequacy. We ask 15 avid readers to compare recently published human translations (HT) to machine translations (MT) generated with an agentic large language model (LLM)-based pipeline, for 15 recent novels in French, Polish, and Japanese and translated into English. Readers evaluated approximately 8K-word excerpts in two conditions: immersive reading of the whole excerpt (30 comparisons) and close reading of 386 aligned HT-MT chunk pairs (772 comparisons), with two readers per book and in alternating order of presentation. Overall, readers find MT "fine", but prefer HT (slightly at excerpt-level 19/30, more clearly at chunk-level 522/772) for its ease, clarity, and immersive nature. Readers' highlights show that MT's quality varies more within one book than HT's does. Crucially, readers cannot reliably tell the two apart (17/30 guess correctly) and tend to prefer the version they believe to be human. Automatic metrics, including LLM-as-a-judge approaches, fail to recover reader preferences and favor MT. We release LAIT (Literary AI Translation), a reader-centered evaluation dataset with 1K reader comments, 2K judgments and preference ratings, and 7.2K span-level annotations, along with our evaluation protocol and supporting interface.

2606.26036 2026-06-25 cs.CL cs.CR 新提交

Detect, Unlearn, Restore: Defending Text Summarization Models Against Data Poisoning

检测、遗忘、恢复:防御文本摘要模型免受数据投毒

Poojitha Thota, Shirin Nilizadeh

发表机构 * The University of Texas at Arlington(德克萨斯大学阿灵顿分校)

AI总结 提出统一的后验防御框架,通过影响函数分析或行为审计检测微调阶段的数据投毒,并利用梯度上升遗忘恢复模型性能,在多种攻击下实现85-92%检测精度和96%行为恢复。

详情
AI中文摘要

微调期间的训练时数据投毒对部署用于抽象文本摘要的大型语言模型(LLMs)构成重大威胁,因为小的任务特定数据集对模型行为产生不成比例的影响。在此设置中,对手操纵微调数据以诱导持续的摘要失败,例如有偏见或有害的摘要,同时保持标准评估指标。我们提出了一个统一的后验防御框架,用于检测和修复机器学习供应链中摘要模型的微调阶段投毒。我们的实验表明,在白盒设置中,被投毒的文档-摘要对表现出异常高的训练影响,通过带有语义一致性检查的影响函数分析实现检测。在黑盒设置中,被投毒的模型对保持语义的扰动表现出两到三倍的敏感性,从而无需训练数据即可进行行为审计。除了现有的投毒形式,我们引入了针对事实扭曲和代表性偏差的新型攻击,表明投毒改变了摘要行为而不触发常规警报。在自适应攻击下,跨越九种架构和六个基准数据集,我们的防御实现了85-92%的检测精度,而梯度上升遗忘恢复了高达96%的原始行为,且效用损失极小(ROUGE下降小于0.6%)。这些结果表明,微调时投毒留下持久的结构伪影,使得无需完全重新训练即可实现实际检测和部署后恢复。

英文摘要

Training-time data poisoning during fine-tuning poses a significant threat to large language models (LLMs) deployed for abstractive text summarization, where small task-specific datasets exert disproportionate influence on model behavior. In this setting, adversaries manipulate fine-tuning data to induce persistent summarization failures, such as biased or harmful summaries, while preserving standard evaluation metrics. We present a unified post-hoc defense framework for detecting and remediating fine-tuning-stage poisoning in summarization models across the machine learning supply chain. Our experiments show that in white-box settings, poisoned document-summary pairs exhibit abnormally high training influence, enabling detection via influence-function analysis with semantic consistency checks. In black-box settings, poisoned models display two to three times greater sensitivity to semantics-preserving perturbations, enabling behavioral auditing without training data access. Beyond existing poisoning formulations, we introduce novel attacks targeting factual distortion and representational bias, showing that poisoning alters summarization behavior without triggering conventional alarms. Across nine architectures and six benchmark datasets under adaptive attacks, our defenses achieve 85-92% detection precision, while gradient-ascent unlearning restores up to 96% of original behavior with minimal utility loss (less than 0.6% ROUGE degradation). These results indicate that fine-tuning-time poisoning leaves persistent structural artifacts, enabling practical detection and post-deployment recovery without full retraining.

2606.26029 2026-06-25 cs.CV cs.AI 新提交

TriViewBench: Controlled Complexity Scaling for Multi-View Structural Reasoning in MLLMs

TriViewBench: 多视图结构推理中受控复杂度缩放

Yu-Yang Chen, Lan-Zhe Guo

发表机构 * School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院) National Key Laboratory for Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室)

AI总结 提出TriViewBench基准,通过合成3D场景控制物体数量和遮挡,评估18个多模态大模型在多视图推理中的能力层次和性能退化,发现跨视图空间表示是瓶颈。

Comments 26 pages, 8 figures

详情
AI中文摘要

多模态大语言模型(MLLMs)在标准视觉问答基准上表现出色,但其在受控结构复杂度下的可扩展性仍知之甚少。我们引入了TriViewBench,一个受控的三视图视觉推理基准,由合成3D场景构建,具有显式参数化的物体数量和遮挡。该基准包含1,923个场景和超过14,000个问答对,分为四个复杂度级别和三个推理类别:局部决策、物体计数和全局恢复。我们在统一提示协议下评估了18个开源和闭源MLLM。所有18个模型无一例外地表现出相同的能力层次(局部决策 > 物体计数 > 全局恢复),并且性能随复杂度单调下降:局部决策任务下降适度(相对下降12.11%),而物体计数大幅下降(59.14%),全局恢复严重崩溃(80.02%)。对物体计数的错误分析揭示了两种机制上独立的失败模式:单视图任务因遮挡盲视导致低估,而多视图任务因跨视图身份混淆转为高估。思维链(CoT)提示带来的总体收益近乎为零(Δ = -0.16%),且其对全局恢复的效果强烈受能力门控,表明瓶颈在于跨视图空间表示而非推理策略。这些发现揭示了当前MLLM的基本可扩展性限制,并将TriViewBench定位为分析结构推理失败的可控诊断框架。

英文摘要

Multimodal Large Language Models (MLLMs) demonstrate strong performance on standard visual question answering benchmarks, yet their scalability under controlled structural complexity remains poorly understood. We introduce TriViewBench, a controlled three-view visual reasoning benchmark constructed from synthetic 3D scenes with explicitly parameterized object count and occlusion. The benchmark contains 1,923 scenes and over 14K Question-Answer (QA) pairs organized into four complexity levels and three reasoning categories: Local Decision, Object Counting, and Global Recovery. We evaluate 18 open- and closed-source MLLMs under a unified prompting protocol. All 18 models exhibit an identical capability hierarchy without exception (Local Decision > Object Counting > Global Recovery), and performance degrades monotonically with complexity: Local Decision tasks decline modestly (12.11% relative drop), while Object Counting degrades substantially (59.14%) and Global Recovery collapses severely (80.02%). Error analysis on Object Counting reveals two mechanistically independent failure modes: single-view tasks are dominated by undercounting due to occlusion blindness, whereas the multi-view task reverses to overcounting due to cross-view identity confusion. Chain-of-Thought (CoT) prompting yields near-zero overall benefit ($\Delta = -0.16\%$) and its effect on Global Recovery is strongly capability-gated, suggesting that the bottleneck lies in cross-view spatial representation rather than reasoning strategy. These findings reveal fundamental scalability limitations in current MLLMs and position TriViewBench as a controlled diagnostic framework for analyzing structural reasoning failures.

2606.26027 2026-06-25 cs.CL cs.LG 新提交

Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It

为什么多步工具使用强化学习会崩溃以及监督信号如何修复它

Yupu Hao, Zhuoran Jin, Huanxuan Liao, Kang Liu, Jun Zhao

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所复杂系统认知与决策智能重点实验室) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 研究发现RL训练LLM工具使用时会出现控制token概率尖峰导致性能崩溃,而交错SFT与RL可提升稳定性,但OOD评估下降。

详情
AI中文摘要

工具使用使大型语言模型(LLM)能够执行复杂任务,最近的智能体强化学习(RL)方法显示出增强模型能力的潜力。然而,仅靠RL往往导致工具使用任务中的不稳定或有限收益。在我们的实验中,一些模型表现出灾难性崩溃,性能突然下降,工具调用结构失效。分析表明,这些失败源于特定控制token中出现意外的概率尖峰,破坏了结构化执行,但底层工具使用能力仍然完好,仅被特定格式掩盖。为了解决这个问题,我们系统地研究了一组多样化的监督信号,包括离策略监督、提示引导、错误示例监督等,在同步和交错训练方案下应用。我们发现,将监督微调(SFT)与RL交错进行可显著提高稳定性,但在格式和内容分布外(OOD)评估下表现出性能下降。我们还分析了学习率的影响以及跨设置的泛化能力。这些结果强调了理解RL失败的重要性,并展示了多样化的监督信号如何引导探索性学习,使LLM能够稳健地训练以完成复杂的多步工具使用任务。我们的代码可在此https URL获取。

英文摘要

Tool use enables large language models (LLMs) to perform complex tasks, and recent agentic reinforcement learning (RL) methods show promise for enhancing model capabilities. However, RL alone often leads to instability or limited gains in tool-use tasks. In our experiments, some models exhibit catastrophic collapse, where performance abruptly drops and tool-invocation structures fail. The analysis reveals that these failures stem from unexpected probability spikes in specific control tokens, disrupting structured execution, yet the underlying tool-use capability remains intact, merely obscured by specific formats. To address this, we systematically investigate a diverse set of supervisory signals, including off-policy supervision, hint-based guidance, erroneous example supervision, and others, applied under both synchronous and interleaved training schemes. We find that interleaving supervised fine-tuning (SFT) with RL substantially improves stability, but exhibits degraded performance under format and content out-of-distribution (OOD) evaluation. We also analyze the impact of learning rates and generalization across settings. These results highlight the importance of understanding RL failures and demonstrate how diverse supervisory signals can guide exploratory learning, enabling robust training of LLMs for complex, multi-step tool-use tasks. Our Code is available at this https URL.

2606.26017 2026-06-25 cs.RO 新提交

G2DP: Diffusion Planning with Spatio-Temporal Grid Guidance

G2DP: 基于时空网格引导的扩散规划

Hang Yu, Ye Jin, Alessandro Canevaro, Julian Schmidt, Julian Jordan, Peizheng Li, Marc Kaufeld, Silvan Lindner, Johannes Betz, Wilhelm Stork

发表机构 * Mercedes-Benz AG(梅赛德斯-奔驰集团) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) TU Munich(慕尼黑工业大学) University of Tübingen(图宾根大学)

AI总结 针对自动驾驶扩散规划器随机性导致的安全与路线保持问题,提出G2DP,通过可微时空代价体积在去噪过程中注入密集梯度,实现无碰撞与路径最优的轨迹生成,在nuPlan等基准上取得最优性能。

详情
AI中文摘要

在自动驾驶中,基于扩散的规划器已成为密集交互交通中鲁棒运动规划的有前景范式,因为它们能有效建模多样化的驾驶行为。然而,其固有的随机性通常需要在去噪过程中进行显式引导,以确保安全性和路线遵循,从而实现鲁棒的闭环执行。现有引导通常依赖于稀疏的、以实体为中心的几何查询或事后细化,导致在交互场景中情境感知有限且性能脆弱。为解决此问题,我们提出G2DP(网格引导扩散规划),一种通过推理时引导直接强制执行密集环境约束的扩散规划器。具体来说,G2DP通过融合概率性未来占据分布与路线进展图,构建了一个可微的时空代价体积。通过将该体积表述为连续的安全能量函数,它将密集梯度直接注入去噪循环,主动引导轨迹生成朝向无碰撞且进展最优的区域。广泛的闭环评估表明,G2DP在nuPlan上实现了最先进的性能,在反应分数上比最强的模仿学习基线高出7.2分。在零样本迁移到interPlan和DeepScenario基准时,它进一步保持了最高分数,其中在interPlan上碰撞避免比无引导方法提高了10.15分。这些结果表明,时空代价网格可作为扩散规划中鲁棒引导的有效表示。

英文摘要

In autonomous driving, diffusion-based planners have emerged as a promising paradigm for robust motion planning in dense and interactive traffic, as they can effectively model diverse driving behaviors. However, their inherent stochasticity often requires explicit guidance during denoising to ensure safety and route adherence for robust closed-loop execution. Existing guidance typically relies on sparse, entity-centric geometric queries or post-hoc refinement, yielding limited situational awareness and fragile performance in interactive scenes. To address this issue, we propose G2DP (Grid-Guided Diffusion Planning), a diffusion-based planner that directly enforces dense environmental constraints through inference-time guidance. Specifically, G2DP constructs a differentiable spatio-temporal cost volume by fusing probabilistic future occupancy distributions with a route-progress map. By formulating this volume as a continuous safety energy functional, it injects dense gradients directly into the denoising loop, actively steering trajectory generation toward collision-free and progress-optimal regions. Extensive closed-loop evaluations show that G2DP achieves state-of-the-art performance on nuPlan, outperforming the strongest imitation-learning baseline by +7.2 points in reactive score. It further maintains top scores in zero-shot transfers to interPlan and DeepScenario benchmarks, with collision avoidance improving by +10.15 over the unguided approach on interPlan. These results demonstrate that spatio-temporal cost grids serve as an effective representation for robust guidance in diffusion-based planning.

2606.26016 2026-06-25 cs.CV 新提交

MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation

MIMFlow:将掩码图像建模与归一化流集成用于端到端图像生成

Yang Chen, Xiaowei Xu, Shuai Wang, Xinwen Zhang, Qiushi Guo, Tiezheng Ge, Limin Wang

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室) Alibaba Group(阿里巴巴集团) Shanghai AI Lab(上海人工智能实验室)

AI总结 提出MIMFlow框架,通过VAE编码器从掩码图像推断语义潜在变量,使归一化流专注于低频语义流形,解码器处理高频合成,解决NF容量瓶颈,在ImageNet 256×256上达到71.3%线性探测精度和2.50 FID。

Comments Accepted by ECCV 2026

详情
AI中文摘要

归一化流(NFs)是强大的生成模型,能够进行精确的密度估计和采样。然而,其严格的可逆性常常迫使模型将容量消耗在低层像素细节上,阻碍了对高层语义结构的捕获。尽管掩码图像建模(MIM)在表示学习方面表现出色,但其在生成流程中的集成仍然主要是模块化和分离的。在本文中,我们提出了MIMFlow,一个统一的端到端框架,联合优化潜在语义、像素重建和生成流。通过采用VAE编码器从掩码图像推断语义潜在变量,MIMFlow实现了生成任务的原则性解耦:归一化流专注于建模简化的低频语义流形,而专门的解码器处理高频合成。这种设计有效地解决了NF固有的容量瓶颈,使模型能够优先考虑全局结构一致性而非冗余噪声。在ImageNet 256×256上的实验结果表明,MIMFlow-L达到了71.3%的线性探测精度和2.50的FID。尽管仅使用了128个token(比标准模型少50%),它相比类似规模的NF基线获得了32.8%的性能提升。我们的代码可在以下网址获取:https://this URL。

英文摘要

Normalizing Flows (NFs) are powerful generative models capable of exact density estimation and sampling. However, their strict invertibility often forces the model to exhaust its capacity on low-level pixel details, hindering the capture of high-level semantic structures. While Masked Image Modeling (MIM) has excelled in representation learning, its integration into generative pipelines has remained largely modular and disjointed. In this paper, we propose MIMFlow, a unified end-to-end framework that jointly optimizes latent semantics, pixel reconstruction, and generative flow. By employing a VAE encoder to infer semantic latent from masked images, MIMFlow achieves a principled decoupling of the generative task: the Normalizing Flow focuses on modeling a simplified, low-frequency semantic manifold, while a specialized decoder handles high-frequency synthesis. This design effectively resolves the inherent capacity bottleneck of NFs, allowing the model to prioritize global structural coherence over redundant noise. Empirical results on ImageNet 256$\times$256 show that MIMFlow-L reaches 71.3\% linear probing accuracy and an FID of 2.50. Despite using only 128 tokens (50\% fewer than standard models), it yields a 32.8\% performance gain over similar-scale NF baselines. Our code is available at this https URL.

2606.26015 2026-06-25 cs.CL 新提交

The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar

低资源语言文本去毒化的Tatoxa系统:以鞑靼语为例

Ilseyar Alimova, Bogdan Monogov, Artyom Mazur, Daniil Antonov, Vsevolod Karimov, Vitaliy Egorov, Bulat Khakimov, Alexander Panchenko

发表机构 * Skoltech(斯科尔科沃科学技术学院) HSE(高等经济学院) ITMO(ITMO大学) Institute of Applied Semiotics Tatarstan Academy of Sciences(鞑靼斯坦科学院应用符号学研究所) Kazan Federal University(喀山联邦大学) AIRI(人工智能研究所)

AI总结 提出Tatoxa系统,针对低资源鞑靼语进行文本去毒化,通过新数据集和跨语言迁移实验证明,使用本地数据训练显著优于跨语言迁移。

详情
AI中文摘要

文本去毒化,即自动检测和减轻辱骂及有害内容,对于确保在线社区安全和保护用户至关重要。然而,像鞑靼语这样的低资源语言很少受到研究关注。在本文中,我们提出了Tatoxa,一个用于鞑靼语文本去毒化的新型最先进系统。对比实验表明,所提出的方法在关键质量指标上优于现有的开源和专有商业大语言模型。我们还引入了一个新的鞑靼语文本去毒化数据集,专为低资源环境下的微调和评估而设计。最后,跨语言迁移实验表明,即使有大型俄语语料库,从其他语言(包括文化相近的俄语)进行迁移的效果也显著差于在本地鞑靼语数据上训练。

英文摘要

Text detoxification, the automated detection and mitigation of abusive and harmful content, is essential for ensuring the safety of online communities and protecting users. However, low resource languages such as Tatar have received little research attention. In this paper we present Tatoxa, a novel state-of-the-art system for text detoxification in the Tatar language. Comparative experiments show that the proposed approach outperforms existing open source and proprietary commercial LLMs on key quality metrics. We also introduce a new dataset for text detoxification in Tatar, designed for fine tuning and evaluation in low resource settings. Finally, cross lingual transfer experiments indicate that transfer from other languages, including the culturally close Russian, performs significantly worse than training on native Tatar data even when a large Russian corpus is available.

2606.26009 2026-06-25 cs.LG 新提交

Is Variational Monte Carlo Robust? Sharp Moment Thresholds and Heavy-tailed Stochastic Optimization

变分蒙特卡洛是否稳健?尖锐矩阈值与重尾随机优化

Philipp Grohs, Davide Nobile

发表机构 * Faculty of Mathematics, University of Vienna(维也纳大学数学学院) RICAM, Austrian Academy of Sciences(奥地利科学院里卡姆研究所)

AI总结 本文揭示变分蒙特卡洛(VMC)中局部能量和梯度估计子的重尾特性,提出基于裁剪的PS-Clip-VMC算法,证明其在弱矩条件下的收敛性,实验表明比标准方法更稳健。

详情
AI中文摘要

变分蒙特卡洛(VMC)是电子结构理论中的核心算法,并通过现代神经网络ansätze(如FermiNet)重新获得重要性。VMC的核心是通过随机优化最小化瑞利商来寻找基态。在这项工作中,我们表明由此产生的随机优化问题本质上由底层波函数的节点几何决定。更精确地说,我们建立节点集的性质决定了驱动VMC的局部能量和梯度估计子的可积性。对于广泛且实际相关的ansatz类,包括具有变指数斯莱特型轨道的斯莱特-贾斯特罗波函数,我们证明这些估计子通常是重尾的,并且无法具有高阶矩。同时,对于一般解析ansätze,我们证明了相关估计子的弱矩界,并确定了精确的低矩区域,展示了通用和退化节点结构如何导致不同的可积性阈值。基于这一分析,我们引入了一种新的稳健VMC变体——称为PS-Clip-VMC——它基于裁剪局部能量和梯度随机变量。我们证明PS-Clip-VMC在VMC的弱矩区域中既在期望上又以高概率收敛。在最多18个电子的原子上训练FermiNet的初步实验表明,PS-Clip-VMC比标准方法显著更稳健。

英文摘要

Variational Monte Carlo (VMC) is a central algorithm in electronic structure theory and has gained renewed importance through modern neural-network ansätze such as FermiNet. At its core, VMC seeks ground states by minimizing the Rayleigh quotient by stochastic optimization. In this work, we show that the resulting stochastic optimization problem is intrinsically governed by the nodal geometry of the underlying wave function. More precisely, we establish that properties of the nodal set determine the integrability of the local energy and gradient estimators that drive VMC. For broad and practically relevant ansatz classes, including Slater-Jastrow wave functions with variable-exponent Slater-type orbitals, we prove that these estimators are generically heavy-tailed and fail to admit higher moments. At the same time, for general analytic ansätze, we prove weak moment bounds for the relevant estimators and identify precise low-moment regimes, showing how generic and degenerate nodal structures lead to different integrability thresholds. Building on this analysis, we introduce a new robust variant of VMC $\unicode{x2013}$ coined PS-Clip-VMC $\unicode{x2013}$ which is based on clipping both the local energy and the gradient random variable. We prove that PS-Clip-VMC converges both in expectation and with high probability in the weak moment regime of VMC. Preliminary experiments for training FermiNet on Atoms with up to 18 electrons suggest that PS-Clip-VMC is significantly more robust than standard methods.

2606.26008 2026-06-25 cs.RO 新提交

Emcar: Embodied Controller for Animating Robots

Emcar: 用于动画机器人的具身控制器

Carlos Gomez Cubero, Elizabeth Jochum

发表机构 * CREATE, Aalborg University(奥尔堡大学CREATE) Department of Communication and Psychology, Aalborg University(奥尔堡大学传播与心理学系) CyPhy Life, Robotics and AI Lab, School of Science & Technology, IE University(IE大学科技学院CyPhy Life机器人与人工智能实验室)

AI总结 提出EMCAR无代码平台,利用木偶戏和绘画等艺术实践来编程机器人运动,扩展协作机器人的创意应用,使非技术背景用户能参与人机交互研究。

Comments Published in Lecture Notes in Computer Science, DOI: https://doi.org/10.1007/978-3-032-15501-6_11

详情
AI中文摘要

本章描述了EMCAR,一种新颖的机器人运动编程软件工具,它利用木偶戏和绘画等艺术实践的独特可供性来构思、设计和编程新颖的交互,并实现人机交互的新用例。该无代码平台的优势在于,它扩展了协作机器人的创意应用——将机器人直接交到艺术家手中——并提供了一个包容性环境,使几乎没有技术背景的个人能够有意义地参与协作和机器人研究。

英文摘要

This chapter describes EMCAR, a novel software tool for programming robot motion that leverages the unique affordances of artistic practices such as puppetry and drawing to conceive, design, and program novel interactions and realize new use cases for HRI. The advantage of this no-code platform is that it expands creative applications for collaborative robots - putting robots directly in the hands of artists - and provides an inclusive environment that enables individuals with little or no technical backgrounds to engage meaningfully in collaborations and robotics research.

2606.26007 2026-06-25 cs.CV cs.GR 新提交

From Sparse and Imperfect 2D Anchors to Consistent 3D Gaussian Street Scenes: Support-Aware Appearance

从稀疏且不完美的2D锚点到一致的3D高斯街景:支持感知的外观

Long Cao, Zhongquan Wang, Jie Li, Yuhan Chen, Kefei Qian, Xiangfei Huang, Guofa Li

发表机构 * College of Mechanical and Vehicle Engineering, Chongqing University(重庆大学机械与运载工程学院)

AI总结 提出教师相对外观残差蒸馏方法,通过支持感知的高斯空间聚合和置信门控优化,将稀疏不完美2D锚点烘焙为一致3D高斯街景,平衡目标对齐、内容保持和跨视图一致性。

详情
AI中文摘要

图像先验可以为3D高斯街景合成目标条件,但独立编辑的视图无法定义连贯的3D目标。直接拟合会传播视图特定噪声,而现有管线无法联合处理不完美的稀疏锚点和标准光栅化器部署。为填补这一空白,引入了教师相对外观残差蒸馏用于外观烘焙。通过教师锚点与原始渲染之间的残差,形成用于频率分解、置信度估计和基元级提升的结构化空间。直接优化信号由光栅化器空间匹配提供,而基元分配通过支持感知的高斯空间聚合进行正则化。通过置信门控的从粗到细优化,允许支持细节并抑制无支持噪声,之后所有残差被烘焙到固定几何的球谐系数中。推理时丢弃教师和辅助训练模块。在Waymo街景资产、Tanks and Temples场景以及多种目标条件下的评估显示,与基于编辑的基线相比,在目标对齐、内容保持、伪影抑制和跨视图一致性方面取得了良好的整体平衡。消融实验确认了主要组件的有效性。代码将在该URL发布。

英文摘要

Image priors can synthesize target conditions for 3D Gaussian street scenes, but independently edited views do not define a coherent 3D target. Direct fitting can propagate view-specific noise, while existing pipelines do not jointly handle imperfect sparse anchors and standard-rasterizer deployment. To address this gap, teacher-relative appearance residual distillation is introduced for appearance baking. A structured space for frequency decomposition, confidence estimation, and primitive-level lifting is formed by residuals between teacher anchors and original renders. The direct optimization signal is supplied by renderer-space matching, while primitive assignment is regularized by support-aware Gaussian-space aggregation. Supported detail is admitted and unsupported noise is suppressed through confidence-gated coarse-to-fine optimization, after which all residuals are baked into fixed-geometry spherical-harmonic coefficients. The teacher and auxiliary training modules are discarded at inference. Evaluation across Waymo street assets, Tanks and Temples scenes, and multiple target conditions shows a favorable overall balance of target alignment, content preservation, artifact suppression, and cross-view consistency over editing-based baselines. Ablations confirm the effectiveness of the main components. Code will be released at this https URL.

2606.26003 2026-06-25 cs.CL 新提交

Dziri Voicebot: An End-to-End Low-Resource Speech-to-Speech Conversational System for Algerian Dialect

Dziri Voicebot: 面向阿尔及利亚方言的端到端低资源语音到语音对话系统

Dihia Lanasri, Fairouz Taki, Asma Kemmoum

发表机构 * ATM Mobilis, Saad Dahlab Blida 1 University(ATM Mobilis,萨阿德·达赫拉布·布利达第一大学)

AI总结 针对阿尔及利亚方言的低资源挑战,提出模块化流水线集成ASR、NLU、RAG和TTS的端到端语音对话系统,在电信领域数据集上微调模型,实现低词错误率、高意图分类和实体识别性能。

详情
AI中文摘要

自动语音和语言技术仍然严重偏向高资源语言,限制了它们在方言和低资源环境(如阿尔及利亚方言)中的适用性。这种语言还面临额外挑战,包括缺乏标准化正字法、频繁与法语进行语码转换以及标注语音资源的稀缺。本文解决了为阿尔及利亚方言构建完整语音到语音对话系统的问题。我们提出了一个模块化流水线,在统一架构中集成自动语音识别、自然语言理解、检索增强生成和文本到语音合成。本工作是我们之前关于阿尔及利亚方言对话系统Bechiri和Lanasri [2026]的延续,将其从基于文本的对话建模扩展到完整的基于语音的交互。我们为电信领域构建了专门的ASR、NLU和TTS数据集,并为每个组件微调预训练模型。ASR系统基于Whisper的适配构建,而NLU模块结合了基于Transformer的嵌入与面向任务的对话框架。神经TTS系统在新收集的方言语料库上训练,以实现语音响应生成。实验结果显示所有组件均表现强劲,包括ASR的低词错误率、NLU的高意图分类和实体识别分数以及稳定的语音合成质量。所提出的系统为阿尔及利亚方言的端到端对话建模提供了可复现的基线。

英文摘要

Automatic speech and language technologies are still heavily biased toward high-resource languages, limiting their applicability to dialectal and low-resource settings such as Algerian Dialect. This language presents additional challenges including lack of standardized orthography, frequent codeswitching with French, and scarcity of annotated speech resources. This paper addresses the problem of building a complete speech-to-speech conversational system for Algerian Dialect. We propose a modular pipeline integrating automatic speech recognition, natural language understanding, retrieval-augmented generation, and text-to-speech synthesis within a unified architecture. This work is the continuation of our previous work on Algerian dialectal conversational systems Bechiri and Lanasri [2026], extending it from text-based dialogue modeling to full speech-based interaction. We constructed dedicated datasets for ASR, NLU, and TTS in the telecom domain and fine-tune pretrained models for each component. The ASR system is built on Whisper-based adaptation, while the NLU module combines transformer-based embeddings with a task-oriented dialogue framework. A neural TTS system is trained on a newly collected dialectal corpus to enable spoken response generation. Experimental results show strong performance across all components, including low word error rate for ASR, high intent classification and entity recognition scores for NLU, and stable speech synthesis quality. The proposed system provides a reproducible baseline for end-to-end conversational modeling in Algerian Dialect.

2606.25996 2026-06-25 cs.AI cs.CL cs.LG 新提交

Autodata: An agentic data scientist to create high quality synthetic data

Autodata: 一种创建高质量合成数据的智能数据科学家方法

Ilia Kulikov, Chenxi Whitehouse, Tianhao Wu, Yixin Nie, Swarnadeep Saha, Eryk Helenowski, Weizhe Yuan, Olga Golovneva, Jack Lanchantin, Yoram Bachrach, Jakob Foerster, Xian Li, Han Fang, Sainbayar Sukhbaatar, Jason Weston

发表机构 * Meta

AI总结 提出Autodata方法,让AI代理充当数据科学家,通过元优化学习创建更优的训练和评估数据,在计算机科学、法律推理和数学推理任务上优于传统方法。

详情
AI中文摘要

我们介绍了Autodata,一种通用方法,使AI代理能够充当数据科学家,构建高质量的训练和评估数据。我们展示了如何训练(元优化)这样一个数据科学家代理,使其学会创建更强大的数据。我们描述了整体公式和一个具体的实际实现,即Agentic Self-Instruct。我们在计算机科学研究任务、法律推理任务和数学对象推理任务上进行了实验,与经典的合成数据集创建方法相比,获得了改进的结果。此外,对数据科学家代理本身进行元优化带来了更大的性能提升。代理数据创建提供了一种将增加的推理计算转化为更高质量模型训练的方法。总的来说,我们相信这一方向有潜力改变我们构建AI数据的方式。

英文摘要

We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to create even stronger data. We describe the overall formulation, and a specific practical implementation, Agentic Self-Instruct. We conduct experiments on computer science research tasks, legal reasoning tasks and reasoning with mathematical objects, where we obtain improved results compared to classical synthetic dataset creation methods. Further, meta-optimizing the data scientist agent itself delivers an even larger performance uplift. Agentic data creation provides a way to convert increased inference compute into higher quality model training. Overall, we believe this direction has the potential to change the way we build AI data.

2606.25990 2026-06-25 cs.CL cs.AI cs.SD 新提交

SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models

SpeechEQ: 社交感知语音对话模型的情商基准测试

Liang-Yuan Wu, Zih-Ching Chen, Tongshuang Wu, Chao-Han Huck Yang, Hua Shen

发表机构 * New York University(纽约大学) NVIDIA(英伟达) Carnegie Mellon University(卡内基梅隆大学) NYU Shanghai(上海纽约大学)

AI总结 提出SpeechEQ框架,基于EQ-i 2.0理论构建2265个对话数据集,通过口语情商(SEQ)评分评估语音语言模型在主动多轮对话中的副语言社交线索推理能力,揭示端到端模型存在模态捷径、安全陷阱和上下文遗忘等瓶颈。

详情
AI中文摘要

随着多模态对话系统越来越多地参与口语交互,它们处理副语言社交线索的能力已成为自然人机通信的关键瓶颈。然而,现有的机器情商评估仅通过孤立的文本或被动声学感知来评估推理,忽略了主动多轮对话所需的复杂跨模态推理。我们引入了\ extsc{SpeechEQ},一个旨在评估语音语言模型(SLMs)社会语言学推理能力的综合框架。该框架包含一个经过验证的数据集,包含2265个对话,涵盖基于EQ-i 2.0理论的15个情商(EQ)子量表,以及一个多轮评估协议,通过我们提出的受人类情商评估启发的口语情商(SEQ)评分来衡量。实验表明,现有的语音情感识别和端到端语音语言模型在通过语音理解和应用副语言线索方面存在局限性。虽然端到端架构优于级联系统,但\ extsc{SpeechEQ}揭示了当前多模态模型仍然受到依赖文本的“模态捷径”、对齐引起的“安全陷阱”和“上下文遗忘”的瓶颈,凸显了实现真正情感感知AI的障碍。我们的基准测试可在此https URL访问,演示页面在此https URL。

英文摘要

As multimodal conversational systems increasingly engage in spoken interaction, their ability to navigate paralinguistic social cues has become a critical bottleneck for natural human-AI communication. However, existing evaluations of machine emotional intelligence assess reasoning exclusively through isolated text or passive acoustic perception, overlooking the complex cross-modal reasoning required for active, multi-turn dialogue. We introduce \textsc{SpeechEQ}, a comprehensive framework designed to evaluate the sociolinguistic reasoning of Speech-Language Models (SLMs). The framework includes a validated dataset of 2,265 dialogues across 15 Emotional Quotient (EQ) subscales grounded in EQ-i 2.0 theory, along with a multi-turn evaluation protocol measured by our proposed Spoken EQ (SEQ) score inspired by human EQ assessments. Experiments show limitations in how both existing Speech Emotion Recognition and end-to-end Speech-Language Models understand and apply paralinguistic cues through speech. While end-to-end architectures outperform cascaded systems, \textsc{SpeechEQ} reveals that current multimodal models remain bottlenecked by a text-reliant ``modality shortcut,'' an alignment-induced ``safety trap,'' and ``contextual amnesia,'' highlighting the barriers to truly emotionally aware AI. Our benchmark can be accessed at this https URL and demo page at this https URL

2606.25989 2026-06-25 cs.CV cs.LG 新提交

Taxonomy-aware deep learning for hierarchical marine species classification in underwater imagery

面向水下图像中分层海洋物种分类的 taxonomy 感知深度学习

Dan Zimmerman, Dimitris A. Pados, George Sklivanitis

发表机构 * Center for Connected Autonomy and AI(连接自主性与人工智能中心)

AI总结 提出一种 taxonomy 感知深度学习框架,通过分层损失函数、最小风险贝叶斯推理和多尺度特征编码,解决水下图像中海洋物种分类的领域偏移和细粒度相似性问题,在 FathomNet 2025 数据集上达到平均分类距离 1.581。

Comments 10 pages, 3 figures, 4 tables. Presented at SPIE Defense + Security 2026 (Machine Learning from Challenging Data conference), National Harbor, MD, April 2026

详情
AI中文摘要

从水下图像中自动分类海洋物种对于可扩展的海洋生物多样性监测和保护政策至关重要。现有方法面临跨采集平台的严重领域偏移、近缘物种间的细粒度视觉相似性以及不均匀的注释粒度(许多标本只能识别到属或更粗的分类等级)等挑战。我们提出了一种 taxonomy 感知深度学习框架,将训练损失和推理规则与生物分类的层次结构对齐,结合了 taxonomy 加权损失、最小风险贝叶斯推理、多尺度特征编码和独立的每等级分类头。在 FathomNet 2025 数据集(涵盖七个分类等级的 79 个海洋类别)上评估,该系统实现了 1.581 的平均分类距离,与第一名解决方案(1.535)相差在 3% 以内,最大的改进来自度量对齐推理和简单解耦组件,这些组件在分布偏移下比学习到的依赖关系具有更好的泛化能力。

英文摘要

Automated classification of marine species from underwater imagery is essential for scalable ocean biodiversity monitoring and conservation policy. Existing approaches struggle with severe domain shift across collection platforms, fine-grained visual similarity between closely related species, and uneven annotation granularity, where many specimens can only be identified to genus or a coarser taxonomic rank. We present a taxonomy-aware deep learning framework that aligns both the training loss and the inference rule with the hierarchical structure of biological classification, combining a taxonomy-weighted loss, minimum-risk Bayesian inference, multi-scale feature encoding, and independent per-rank classification heads. Evaluated on the FathomNet 2025 dataset1 (79 marine classes across seven taxonomic ranks), the system achieves a mean taxonomic distance of 1.581, within 3% of the 1st-place solution (1.535), with the largest gains from metric-aligned inference and simple, decoupled components that generalize better than learned dependencies under distribution shift.

2606.25986 2026-06-25 cs.LG q-fin.ST q-fin.TR 新提交

The Inference-Compute Frontier and a Latency-Efficient Architecture for Limit Order Book Prediction

推理-计算前沿与限价订单簿预测的延迟高效架构

C. Evans Hedges

发表机构 * Independent Researcher(独立研究员)

AI总结 研究限价订单簿预测中的推理-计算前沿,发现幂律关系,并基于此提出延迟高效的FastBiNLOB架构,在低延迟下超越现有SOTA。

详情
AI中文摘要

我们研究了在限价订单簿预测中是否会出现类似缩放定律的推理-计算前沿。使用FI-2010数据集和从小型决策树到神经LOB架构的一系列模型,我们发现预测损失与结构前向工作的实际经验前沿很好地符合幂律分布。特别是,当MLPLOB作为一个架构族被排除时,对低计算和中计算非MLPLOB前沿的幂律拟合可以外推多个数量级,并在被排除的高计算MLPLOB目标前沿上达到$R^2=0.941$。在延迟空间中的类似实验给出了明显较弱的结果,表明延迟不仅仅是带噪声的计算。我们利用这一差距提出了FastBiNLOB,一个由硬件友好的时间和特征混合操作构建的密集轴可分离LOB混合器。在五次种子实验中,FastBiNLOB以显著低于现有已发表SOTA架构的延迟,超过了已发布的$y_{10}$和$y_{100}$宏F1目标。

英文摘要

We study whether a scaling-law-style inference-compute frontier appears in limit order book prediction. Using FI-2010 and a suite of models ranging from small decision trees to neural LOB architectures, we find that the realized empirical frontier of predictive loss versus structural forward work is well summarized by a power law. In particular, with MLPLOB held out as an architecture family, a power-law fit to the low- and mid-compute non-MLPLOB frontier extrapolates across multiple orders of magnitude and attains $R^2=0.941$ on the excluded high-compute MLPLOB target frontier. A similar exercise in latency space gives substantially weaker results, showing that latency is not merely noisy compute. We use this gap to motivate FastBiNLOB, a dense axis-separable LOB mixer built from hardware-friendly temporal and feature mixing operations. In a five-seed experiment, FastBiNLOB exceeds the published $y_{10}$ and $y_{100}$ macro-F1 targets at notably lower latency than existing published SOTA architectures.