arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1792
2606.14703 2026-06-15 cs.CV cs.CL cs.LG 新提交

Gaze Heads: How VLMs Look at What They Describe

注视头:视觉语言模型如何观察它们所描述的内容

Rohit Gandikota, David Bau

发表机构 * Northeastern University(东北大学)

AI总结 发现视觉语言模型的语言骨干中存在一组“注视头”,其注意力跟踪当前描述的图像区域,通过干预这些头可精确控制模型描述内容,准确率达83.1%。

详情
AI中文摘要

视觉语言模型在内部如何解决描述图像的任务远非显而易见。我们发现模型为此发展出一种特定机制:其语言模型骨干中的一小部分注意力头(我们称之为注视头),其注意力跟踪模型当前正在描述的图像区域。我们通过简单的相关性得分从几次前向传播中发现了它们,使用连环漫画作为受控测试平台,其中叙事顺序在空间上展开。这些注视头不仅跟踪正在描述的图像标记:将它们的注意力重定向到所选区域会强制视觉语言模型描述该区域。对前100个注视头(少于所有头的9%)进行单次注意力掩码干预,以83.1%的准确率将模型的答案引导到任何选定的漫画面板,而对随机头进行相同干预则无法重定向答案,并且对所有头进行干预会破坏生成。相同的杠杆还扩展到连续控制:在生成过程中切换注视目标会使模型在几个标记内结束当前面板描述并转向新面板。在漫画之外,相同的干预将答案重定向到自然COCO图像中的选定区域。该机制进一步在2B到32B参数的模型大小以及其他视觉语言模型架构中重复出现,尽管一些冻结编码器系列没有显示可比较的头集。更广泛地说,这表明通过机制分析识别的目标编辑可以作为实用的推理时杠杆来引导多模态模型行为,而无需任何重新训练。我们的代码、交互式演示和数据集可在以下网址获取:此 https URL

英文摘要

How a vision-language model internally solves the task of describing an image is far from obvious. We find that the model develops a specific mechanism for this: a small set of attention heads in its language-model backbone, which we call gaze heads, whose attention tracks the image region the model is currently describing. We find them with a simple correlation score from a few forward passes, using comic strips as a controlled testbed where narrative order is laid out spatially. These gaze heads do not just track the image tokens being described: redirecting their attention to a chosen region forces the VLM to describe that region instead. A single attention-mask intervention on the top-100 gaze heads, fewer than 9% of all heads, steers the model's answer to any chosen comic panel at 83.1% accuracy, while the same intervention on random heads fails to redirect the answer, and intervening on all heads destroys generation. The same lever also extends to continuous control: switching the gaze target mid-generation makes the model wrap up its current panel description and move to the new one within a few tokens. Beyond comics, the same intervention redirects answers to chosen regions in natural COCO images. The mechanism further recurs across model sizes from 2B to 32B parameters and across other VLM architectures, although some frozen-encoder families show no comparable head set. More broadly, this shows that targeted edits identified through mechanistic analysis can serve as practical inference-time levers for steering multimodal model behavior, without any retraining. Our code, interactive demo, and datasets are available at this https URL

2606.14702 2026-06-15 cs.CV 新提交

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

OmniVideo-100K:通过结构化脚本和证据链进行音视频推理的数据集

Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang, Ran He, Caifeng Shan

发表机构 * Nanjing University(南京大学) CASIA(中国科学院自动化研究所)

AI总结 提出OmniVideo-100K数据集,通过实体锚定视频脚本和线索引导的QA生成机制,解决音视频问答中跨段实体不一致和长时推理不足的问题,微调模型在多个基准上取得显著提升。

Comments Project page: this https URL (https://github.com/MiG-NJU/OmniVideo-100K)

详情
AI中文摘要

当前的音视频问答(QA)自动化流水线通常采用“视频-字幕-QA”范式。然而,这些方法通常将视频分割成短片段,并为音频和视觉模态生成独立的描述。这种解耦处理切断了声音与其视觉来源之间的固有关联,而独立的片段处理常常导致同一实体在不同片段中的描述不一致。此外,将长文本理解和QA合成耦合到单一步骤中,往往将模型限制在局部事件上,生成的问答缺乏长期时间连接和深度跨模态推理。为了解决这些问题,我们提出了一种自动化数据引擎,包含两种机制:(1)**实体锚定视频脚本**将视频转换为结构化脚本,包括摘要、主要实体列表和逐片段的音视频描述。实体列表作为全局先验,确保跨片段引用一致性并重建音视频关联。(2)**线索引导的QA生成**提示模型首先从脚本中挖掘跨片段、多模态线索,然后基于这些高价值线索生成QA对。利用这一流水线,我们构建了指令微调数据集**OmniVideo-100K**和人工验证的测试集**OmniVideo-Test**。在OmniVideo-100K上微调VITA-1.5、Qwen2.5-Omni-7B和Qwen3-Omni-30B,在OmniVideo-Test上获得了高达20.59%的性能提升,并在Daily-Omni和JointAVBench等现有基准上表现出强大的泛化能力(提升高达12.64%)。

英文摘要

Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) \textbf{Entity-Anchored Video Scripting} transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) \textbf{Clue-Guided QA Generation} prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset \textbf{OmniVideo-100K} and a human-verified test set, \textbf{OmniVideo-Test}. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.

2606.14701 2026-06-15 cs.CV 新提交

RATS! Patches Talk Through Registers: Emergent Parts in Register Attention Transformers

RATS!补丁通过寄存器对话:寄存器注意力Transformer中的涌现部件

Timing Yang, Predrag Neskovic, Jansen Seheult, Wenchao Han, Anand Bhattad, Alan Yuille, Feng Wang

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Office of Naval Research, Arlington, VA(海军研究办公室,阿灵顿,弗吉尼亚州) Department of Laboratory Medicine and Pathology, Mayo Clinic, MN, USA(梅奥诊所检验医学与病理学系,明尼苏达州,美国)

AI总结 提出RATS模型,通过将分类令牌分解为可学习的寄存器令牌,在L→N→N→L瓶颈中路由补丁信息,无需辅助损失或部件标注,每个寄存器自发专化为类似物体部件的原语义区域,在五个分割基准上平均mIoU提升12。

详情
AI中文摘要

当人类看到一只鸟时,他们识别出的远不止是“鸟”——他们看到头部、翅膀和爪子,这是一个可重复使用部件的结构化组合,这些部件可以在他们见过的每一只鸟中被识别出来。我们询问一个自监督视觉模型能否自行发现相同的组合结构。为此,我们提出了RATS(寄存器注意力Transformer),它将分类令牌分解为N个可学习的寄存器令牌,通过三步压缩-通信-广播注意力机制,在L→N→N→L瓶颈中路由补丁信息。这N个寄存器被分配到H个注意力头上,因此分配给不同头的寄存器之间不相互作用。在没有辅助损失或部件标注的情况下,每个寄存器自发地专化为一个原语义区域,其涌现结构类似于物体部件。RATS在五个分割基准上平均超过所有基线+12 mIoU,在ADE20K(+1.11 mIoU)和COCO(+0.2 AP^m)上持续提升。其寄存器字典进一步展示了跨相关类别的部件级一致性和语义接近性。我们的结果表明,RATS可能为结构化和可解释的视觉表示学习提供有用的架构先验。

英文摘要

When humans see a bird, they recognize far more than just "bird" -- they see a head, wings, and talons, a structured assembly of reusable parts that can be identified across every bird they have ever seen. We ask whether a self-supervised visual model can discover the same compositional structure on its own. To this end, we propose RATS (Register Attention Transformers), which decomposes the classification token into N learnable register tokens that route patch information through an L->N->N->L bottleneck via a three-step compress-communicate-broadcast attention. The N registers are partitioned across the H attention heads, so that registers assigned to different heads do not interact with each other. Without auxiliary losses or part annotations, each register spontaneously specializes into a proto-semantic region whose emerging structure resembles object parts. RATS surpasses all baselines by +12 mIoU on average across five segmentation benchmarks, with consistent gains on ADE20K (+1.11 mIoU) and COCO (+0.2 AP^m). Its register dictionary further exhibits part-level consistency and semantic proximity across related categories. Our results suggest that RATS may provide a useful architectural prior for structured and interpretable visual representation learning.

2606.14700 2026-06-15 cs.CV 新提交

RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space

RepFusion:利用多模态先验在表示空间中进行去噪

Xichen Pan, Aashu Singh, Satya Narayan Shukla, Xiangjun Fan, Shlok Kumar Mishra, Saining Xie

发表机构 * Meta AI New York University(纽约大学)

AI总结 提出RepFusion方法,利用多模态大语言模型作为噪声表示编码器,为扩散变压器提供条件信号,在相似推理预算下优于新初始化解码器基线。

Comments Project Page: this https URL (https://xichenpan.com/repfusion)

详情
AI中文摘要

大型语言模型(LLMs)广泛用于文本到图像(T2I)系统,但它们通常仅限于文本编码,而去噪由新训练的生成骨干网络处理。表示自编码器(RAEs)的出现将生成目标转向语义结构化的视觉表示,创建了一个与预训练LLM先验更兼容的潜在空间。受多模态LLM(MLLMs)的启发,其中MLP投影仪足以将干净的视觉表示与预训练LLM对齐,我们将MLLM本身重新用作噪声表示编码器,将此机制从干净输入扩展到噪声输入。我们提出了RepFusion,它使用生成的MLLM输出作为扩散变压器的条件信号。在相似推理预算下的受控比较中,RepFusion优于将可比容量分配给新初始化解码器的基线。这些结果表明,MLLMs为去噪视觉表示提供了强大的先验,并且通过以演化的噪声表示为条件,测试时的计算可以有效地用于现代T2I系统中重复的MLLM条件化。

英文摘要

Large language models (LLMs) are widely used in text-to-image (T2I) systems, but they are typically limited to text encoding, while denoising is handled by newly trained generative backbones. The emergence of representation autoencoders (RAEs) shifts the generation target toward semantically structured visual representations, creating a latent space that is more compatible with pretrained LLM priors. Inspired by multimodal LLMs (MLLMs), where an MLP projector is sufficient to align clean visual representations with a pretrained LLM, we repurpose the MLLM itself as a noisy representation encoder, extending this mechanism from clean to noisy inputs. We present RepFusion, which uses the resulting MLLM outputs as the conditioning signal for a diffusion transformer. In controlled comparisons at similar inference budgets, RepFusion outperforms baselines that devote comparable capacity to newly initialized denoisers. These results demonstrate that MLLMs provide strong priors for denoising visual representations and that, by conditioning on evolving noisy representations, test-time compute can be productively spent on repeated MLLM conditioning in modern T2I systems.

2606.14699 2026-06-15 cs.CV cs.GR cs.RO 新提交

Instruct-Particulate: Scaling Feed-Forward 3D Object Articulation with Kinematic Control

Instruct-Particulate: 基于运动学控制的可扩展前馈式3D物体关节化

Ruining Li, Yuxin Yao, Matt Zhou, Chuanxia Zheng, Christian Rupprecht, Joan Lasenby, Shangzhe Wu, Andrea Vedaldi

发表机构 * University of Oxford(牛津大学) University of Cambridge(剑桥大学) Nanyang Technological University(南洋理工大学)

AI总结 提出Instruct-Particulate模型,通过运动学规范(部件描述、连接性、关节类型等)指导3D网格的关节分割和运动参数预测,利用异构数据集(15万+物体)训练,实现跨类别和AI生成网格的泛化。

Comments Project page: this https URL (https://instruct-particulate.github.io/)

详情
AI中文摘要

重建关节式3D物体对于动画、游戏和机器人模拟至关重要。最近的神经网络可以估计3D物体的关节结构,但其泛化能力仍然受到该任务标注数据稀缺的限制。为了解决这一差距,我们引入了Instruct-Particulate,一个模型,它接受一个3D网格以及一个目标运动学规范,包括部件描述、连接性、关节类型和可选的点提示,并预测相应的运动学部件分割和关节运动参数。运动学规范消除了任务的歧义,并允许模型针对不同粒度的标注,从而使得使用更丰富的异构训练数据成为可能。在测试时,运动学规范可以从大规模视觉-语言模型中自动获得,因此该模型可以应用于任何输入网格。为了大规模训练我们的模型,我们构建了一个包含超过15万个关节式3D物体的异构数据集,通过使用视觉-语言模型对部分其他3D模型(整体或已分解为部件)进行运动学标注,扩展了现有的公开数据集。实验表明,我们的模型在跨类别和AI生成网格上泛化更好,通过图像到3D模型实现了从真实世界图像重建关节式资产。

英文摘要

Reconstructing articulated 3D objects is important for animation, gaming, and robotic simulations. Recent neural networks can estimate the articulated structure of 3D objects, but their generalization remains limited by the scarcity of annotated data for this task. To address this gap, we introduce Instruct-Particulate, a model that takes a 3D mesh together with a target kinematic specification, including part descriptions, connectivity, joint types, and optional point prompts, and predicts the corresponding kinematic part segmentation and joint motion parameters. The kinematic specification disambiguates the task and allows the model to target annotations of different granularity, thereby making it possible to use more abundant heterogeneous training data. At test time, the kinematic specification can be obtained automatically from large-scale vision-language models, so the model can be applied to any input mesh. To train our model at scale, we construct a heterogeneous dataset of more than 150,000 articulated 3D objects, extending existing publicly available collections with data obtained by partially labelling other 3D models (monolithic or already decomposed into parts) with kinematic labels by means of vision-language models. Experiments show that our model generalizes better across categories and to AI-generated meshes, enabling articulated asset reconstruction from real-world images via image-to-3D models.

2606.14697 2026-06-15 cs.CV cs.AI cs.CL 新提交

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

ClinHallu: 用于诊断医学多模态大语言模型推理中阶段式幻觉的基准

Sicheng Yang, Hangjie Yuan, Wenjun Zhang, Jinwang Wang, Yichen Qian, Weihua Chen, Fan Wang, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) DAMO Academy, Alibaba Group(阿里巴巴达摩院) Hupan Lab(湖畔实验室) Zhejiang University(浙江大学)

AI总结 提出ClinHallu基准,包含7031个实例,每个实例带有结构化推理轨迹(视觉识别、知识回忆、推理整合),通过阶段替换干预和轨迹监督微调,实现细粒度幻觉诊断与缓解。

Comments Code and datasets: this https URL (https://github.com/alibaba-damo-academy/ClinHallu)

详情
AI中文摘要

构建可信的医学多模态大语言模型(MLLM)对于可靠的临床决策支持至关重要。现有的医学幻觉基准主要关注数据收集,但往往忽略了推理过程中幻觉的起源。我们发现幻觉来源因样本而异:错误可能源于视觉误识别、不正确的医学知识回忆或有缺陷的推理整合。为了实现源级别的幻觉诊断,我们引入了ClinHallu,一个用于医学MLLM推理中阶段式幻觉诊断的基准。ClinHallu包含7031个经过验证的实例,每个实例都附有分解为视觉识别、知识回忆和推理整合的结构化推理轨迹。我们还使用阶段替换干预来测量纠正特定阶段如何影响最终答案。除了评估,我们表明轨迹监督微调减少了阶段式幻觉。ClinHallu为诊断和缓解医学MLLM中的推理失败提供了一个细粒度的幻觉测试平台。该基准可从此https URL公开获取。

英文摘要

Building trustworthy medical multimodal large language models (MLLMs) is critical for reliable clinical decision support. Existing medical hallucination benchmarks mainly focus on data collection, but often ignore where hallucinations originate within the reasoning process. We find that hallucination sources vary across samples: errors may arise from visual misrecognition, incorrect medical knowledge recall, or flawed reasoning integration. To enable source-level hallucination diagnosis, we introduce ClinHallu, a benchmark for stage-wise hallucination diagnosis in medical MLLM reasoning. ClinHallu contains 7,031 validated instances, where each instance is augmented with a structured reasoning trace decomposed into Visual Recognition, Knowledge Recall, and Reasoning Integration. We also use stage-replacement interventions to measure how correcting specific stages affects the final answer. Beyond evaluation, we show that trace-supervised fine-tuning reduces stage-wise hallucinations. ClinHallu provides a fine-grained hallucination testbed for diagnosing and mitigating reasoning failures in medical MLLMs. The benchmark is publicly available at this https URL.

2606.14694 2026-06-15 cs.CL 新提交

AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization

AdaSR: 自适应流式推理与分层相对策略优化

Junlong Tong, Wenqi Xu, Yingqi Fan, Anhao Zhao, Xuan Lu, Yang Tan, Xiaoyu Shen

发表机构 * Eastern Institute of Technology, Ningbo(宁波东方理工大学) Shanghai Jiao Tong University(上海交通大学) The Hong Kong Polytechnic University(香港理工大学) Southeast University(东南大学) Xi’an Jiaotong-Liverpool University(西交利物浦大学)

AI总结 提出AdaSR框架,通过分层相对策略优化(HRPO)实现流式输入下的自适应推理,在推理准确率、计算效率和流式延迟间取得更好平衡。

详情
AI中文摘要

大型推理模型通常遵循先读后想的范式:它们观察完整输入,在静态上下文中推理,然后产生答案。然而许多真实场景本质上是动态的,例如音频和视频流,信息以连续流的形式到达,模型必须在部分观察下进行推理、更新和响应。最近的流式推理方法允许模型边读边想,但它们主要依赖于对预构建轨迹的监督模仿,这限制了其灵活性。在本文中,我们提出AdaSR,一种自适应流式推理框架,使模型能够在输入流式传输期间进行推理,并在流完成后进行最终深思,学习何时思考以及在不同阶段分配多少计算量。为了优化这一分层推理过程,我们引入了分层相对策略优化(HRPO),它将策略优化分解为流式推理和深度推理阶段,提供更细粒度的优势分配,而不是将单一序列级优势均匀分配给所有token。HRPO整合了格式、准确性和自适应思考奖励,以强制执行有效的推理协议,保持最终任务性能,并鼓励延迟感知的计算分配。实验表明,与监督微调基线相比,AdaSR在推理准确率、计算效率和流式延迟之间实现了更好的平衡。我们在以下网址发布代码:此 https URL。

英文摘要

Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and video stream, where information arrives as a continuous stream and models must reason, update, and respond under partial observations. Recent streaming reasoning methods allow models to think while reading, but they largely rely on supervised imitation of pre-constructed trajectories, which limits their flexibility. In this paper, we propose AdaSR, an adaptive streaming reasoning framework that enables models to reason during input streaming and perform final deliberation once the stream is complete, learning when to think, and how much computation to allocate across different stages. To optimize this hierarchical reasoning process, we introduce Hierarchical Relative Policy Optimization (HRPO), which decomposes policy optimization into streaming reasoning and deep reasoning phases, providing more fine-grained advantage assignment instead of uniformly distributing a single sequence-level advantage over all tokens. HRPO integrates format, accuracy, and adaptive thinking rewards to enforce valid reasoning protocols, preserve final task performance, and encourage latency-aware computation allocation. Experiments show that AdaSR achieves a better balance among reasoning accuracy, computational efficiency, and streaming latency compared with supervised fine-tuning baseline. We release our code at this https URL.

2606.14691 2026-06-15 cs.CL 新提交

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

CORA: 通过一致性导向的推理对齐分析与弥合多模态RLVR中的思考-答案差距

Jiayue Cao, Zhicong Lu, Xuehan Sun, Wei Jia, Hongling Zheng, Changyuan Tian, Zichuan Lin, Wenqian Lv, Nayu Liu

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Wuhan University(武汉大学) Tsinghua University(清华大学) Tianjin University(天津大学)

AI总结 本文分析多模态RLVR中思考与答案的语义不一致问题,提出CORA方法,通过轻量级一致性奖励模型引入语义一致性,并采用混合奖励优势分裂稳定优化,提升推理忠实度。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成功激发大语言模型的推理能力,推动其向多模态场景扩展。现有方法主要关注提升推理轨迹的视觉覆盖和缓解视觉幻觉,但低估了推理过程与最终答案之间的语义不一致性。本文深入研究了大型视觉语言模型(LVLMs)中RLVR的思考-答案不一致性,通过对组相对策略优化(GRPO)训练过程中收集的轨迹以及RLVR后评估输出的分析,表明该问题在训练期间持续存在,并在推理时仍然存在。受此分析启发,我们提出一致性导向的推理对齐(CORA),通过轻量级即插即用的一致性奖励模型将思考-答案语义一致性引入RLVR,并进一步结合混合奖励优势分裂(HRAS)以稳定协调任务和一致性优化。在代表性多模态推理基准和主流LVLMs上的大量实验表明,CORA在提升任务性能的同时有效缓解了思考-答案不一致性,从而产生更忠实的推理轨迹。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has successfully elicited the reasoning capabilities of large language models, motivating its extension to multimodal scenarios. Existing methods primarily focus on improving the visual coverage of reasoning traces and mitigating visual hallucinations, but underestimate the semantic inconsistency between the reasoning process and the final answer. In this paper, we delve into thinking-answer inconsistency in RLVR for large vision-language models (LVLMs), showing thorough analyses of rollouts collected throughout Group Relative Policy Optimization (GRPO) training process and post-RLVR evaluation outputs that this issue persists during training and remains present during inference. Motivated by the analysis, we propose Consistency-Oriented Reasoning Alignment (CORA), which introduces thinking-answer semantic consistency into RLVR through a lightweight plug-and-play consistency reward model, and further incorporates Hybrid Reward Advantage Splitting (HRAS) to stably coordinate task and consistency optimization. Extensive experiments across representative multimodal reasoning benchmarks and mainstream LVLMs show that CORA improves task performance while effectively mitigating thinking-answer inconsistency, leading to more faithful reasoning traces.

2606.14690 2026-06-15 cs.LG cs.IT 新提交

A Complexity Measure for Active Learning in Multi-group Mean Estimation

多组均值估计中主动学习的复杂度度量

Abdellah Aznag, Rachel Cummings, Adam N. Elmachtoub

发表机构 * Department of Industrial Engineering and Operations Research & Data Science Institute, Columbia University(哥伦比亚大学工业工程与运筹学系及数据科学研究所)

AI总结 针对多组均值估计的max-risk目标,提出局部极小极大框架并证明一般下界,引入方差局部曲率(VLC)作为复杂度度量,在平滑类中与方差-费希尔信息关联,并揭示异质实例中的系统性差距。

详情
AI中文摘要

我们研究了多组均值估计$d$臂老虎机中主动学习的\emph{max-risk}目标:学习者在$d$组间自适应分配$T$个样本的预算,以最小化最坏情况不确定性指标$\max_{k\in[d]}\sigma_k^2/n_k$,其中$\sigma_k$是臂$d$分布的标准差,$n_k$是臂$d$被采样的次数。我们开发了一个局部极小极大框架,并证明了该目标的第一个通用下界,适用于任何有限方差假设类。该下界将难度分解为三个正交因素:\emph{预算}项、衡量不确定性在臂间分布不均匀程度的\emph{异方差性}指数,以及一个模型相关的复杂度度量——\emph{方差局部曲率}($\mathrm{VLC}$),它捕捉了局部方差变化在假设类内创造的信息量。对于平滑类,$\mathrm{VLC}$是方差-费希尔信息的重新参数化,常见族具有闭式值。与现有最强上界对比表明,在广泛范围内接近最优(对数因子内),并在高度异质实例中指出了系统性差距。我们的证明引入了两个关键要素:决策空间上的损失诱导$\ell_1$几何,以及一个基于表示的实例生成器,将困难实例构造简化为显式随机矩阵计算。

英文摘要

We study a \emph{max-risk} objective for active learning in a multi-group mean estimation $d$-armed bandits: a learner adaptively allocates a budget of $T$ samples across $d$ groups to minimize the worst-case uncertainty index $\max_{k\in[d]}\sigma_k^2/n_k$, where $\sigma_k$ is the standard deviation of the distribution of arm $d$, and $n_k$ is the number of times arm $d$ is sampled. We develop a local minimax framework and prove the first general lower bound for this objective, valid for any finite-variance hypothesis class. The bound separates difficulty into three orthogonal factors: a \emph{budget} term, a \emph{heteroscedasticity} index measuring how unevenly the uncertainty is spread across arms, and a model-dependent complexity measure, the \emph{Variance Local Curvature} ($\mathrm{VLC}$), which captures how much information a local change of variance creates inside the hypothesis class. For smooth classes, the $\mathrm{VLC}$ is a reparametrization of a variance--Fisher information, with closed-form values for common families. Benchmarking against the strongest available upper bound shows near-optimality up to logarithmic factors in broad regimes, and pinpoints a systematic gap in highly heterogeneous instances. Our proof introduces two key ingredients: a loss-induced $\ell_1$ geometry on the decision space, and a representation-based instance generator that reduces hard-instance construction to an explicit random matrix calculation.

2606.14688 2026-06-15 cs.LG cs.AI cs.CL cs.DS 新提交

Flood and Harvest: The Provable Necessity of Trivia for Generating Valuable Mathematics via the Lens of Language Generation in the Limit

洪流与收获:通过极限语言生成视角证明琐碎知识对于生成有价值数学的必要性

Xiaoyu Li, Andi Han, Dai Shi, Zheng Gao, Jiaojiao Jiang, Junbin Gao

发表机构 * University of New South Wales(新南威尔士大学) University of Sydney(悉尼大学) University of Cambridge(剑桥大学)

AI总结 本文通过极限语言生成模型证明,在形式化数学生成中,验证器无法替代品味:覆盖未记录的有价值数学必须产生无限但渐近可忽略的琐碎语句,这是理论上的必然。

详情
AI中文摘要

与证明助手耦合的AI系统现在能够大规模生成形式化数学,而验证器可验证的内容与数学家认为有价值的内容之间的差距已成为制约因素。我们将有价值数学的生成建模为极限下的嵌套语言生成:通过成员查询预言机(证明检查器)访问的可验证形式语言$F$包含一个未知的有价值语言$H \in \mathcal{H}$,该语言仅通过核心$C \subseteq H$的对抗性枚举揭示,其精确密度为$\alpha$(文献)。每个输出要么是有价值的($\in H$),要么是琐碎的($\in F \setminus H$),要么是幻觉($\notin F$)。我们解决了四个问题。第一,验证器不是品味:允许广度生成的集合恰好是无预言机模型中的那些,按纤维由Angluin条件刻画。第二,验证器确实提供了可靠覆盖,覆盖所有未见过的有价值陈述同时仅断言有效陈述:有验证器可能,无验证器不可能;它将不可避免的错误从虚假转移到琐碎。第三,核心地,关于紧族存在尖锐二分法:生成有限个琐碎语句的生成器达到最优覆盖$\alpha/2$,而任何无限琐碎语句的允许,即使以消失速率,也将最优值跃升至$1-\alpha/2$(两者均为紧界,对于以候选交集形式呈现的核心),且存在一个生成器同时达到两端。转变在于琐碎语句的数量而非速率;间隙$1-\alpha$是未记录的质量。第四,两种机制在数学的压缩模型中实例化。完美的验证器无法替代品味:正确但无价值的语句的无界流并非工程事故,而是可证明的必要性,因为覆盖未记录的有价值数学需要无限但渐近可忽略的已认证琐碎语句流。

英文摘要

AI systems coupled to proof assistants now generate formal mathematics at scale, and the gap between what a checker can verify and what a mathematician would value has become the binding constraint. We model the generation of valuable mathematics as nested language generation in the limit: a verifiable formal language $F$, accessed through a membership oracle (the proof checker), contains an unknown valuable language $H \in \mathcal{H}$ revealed only through an adversarial enumeration of a core $C \subseteq H$ of exact density $\alpha$ (the literature). Every output is valuable ($\in H$), trivial ($\in F \setminus H$), or a hallucination ($\notin F$). We settle four questions. First, the verifier is not taste: the collections admitting generation with breadth are exactly those of the oracle-free model, characterized fiber-wise by Angluin's condition. Second, the verifier does buy sound coverage, covering all unseen valuable statements while asserting only valid ones: possible with it, impossible without it; it relocates unavoidable errors from false to trivial. Third, and centrally, a sharp dichotomy on the tight family: generators emitting finitely many trivia achieve optimal coverage $\alpha/2$, while any infinite trivia allowance, even at vanishing rate, jumps the optimum to $1-\alpha/2$ (both tight, for cores presented as the candidate intersection), and one generator attains both ends. The transition is in trivia count, not rate; the gap $1-\alpha$ is the unrecorded mass. Fourth, both regimes instantiate in a compression model of mathematics. A perfect verifier cannot substitute for taste: the unbounded stream of correct-but-worthless statements is not an engineering accident but a provable necessity, since covering unrecorded valuable mathematics requires an infinite, but asymptotically negligible, stream of certified trivia.

2606.14679 2026-06-15 cs.LG eess.SY math.OC stat.ML 新提交

Optimal Hidden-Target Learning for Online Inventory Optimization on General Convex Sets

一般凸集上在线库存优化的最优隐藏目标学习

Anthony Pineci, Yunzong Xu

发表机构 * UIUC(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对一般凸容量集上的在线库存优化问题,提出隐藏目标投影方法,将遗憾从逆概率依赖改进为平方根逆概率依赖,并证明匹配下界,同时首次给出强凸损失的 polylog 遗憾和动态遗憾保证。

详情
AI中文摘要

在线库存优化(OIO)是具有物理记忆的在线凸优化:库存结转使得可行动作集依赖于过去。一个自然的原则——在随机库存学习以及最近在单一线性容量约束下的OIO中使用——是维护一个由在线学习器选择的隐藏目标,并将其投影到当前可行的订货上限集上。我们证明,对于任意有界凸容量集上的OIO,这一简单原则是最优的。以在线梯度下降为基础学习器,该方法将一般凸集上OIO的最佳已知遗憾保证从对共同需求概率的逆依赖改进为平方根逆依赖,并且我们证明了匹配的下界。同样的原则为强凸损失提供了首个多对数遗憾保证,并为一般凸容量集上的欧几里得路径变化提供了首个动态遗憾保证。分析引入了一个范数对齐原则:正确的状态变量是隐藏目标到可行集的距离,以与投影相同的范数度量。在范数对齐下,该距离路径地演化为一个标量队列,目标移动作为到达,共同需求作为服务。这种简化为一维队列控制解决了状态依赖性,并将保证扩展到一般凸容量集,超出了先前乘积方法的范围。在合成和真实库存数据上的实验证实了该理论。

英文摘要

Online inventory optimization (OIO) is online convex optimization with physical memory: inventory carryover makes the feasible action set depend on the past. A natural principle, used in stochastic inventory learning and recently in OIO under a single linear capacity constraint, is to maintain a hidden target chosen by an online learner and implement its projection onto the currently feasible order-up-to set. We prove that this simple principle is optimal for OIO on arbitrary bounded convex capacity sets. With online gradient descent as the base learner, the method improves the best known regret guarantee for OIO on general convex sets from inverse to inverse-square-root dependence on the common-demand probability, and we prove a matching lower bound. The same principle gives the first polylogarithmic regret guarantee for strongly convex losses and the first dynamic regret guarantee adapting to Euclidean path variation on general convex capacity sets. The analysis introduces a norm alignment principle: the right state variable is the distance from the hidden target to the feasible set, measured in the same norm as the projection. Under norm alignment, this distance evolves pathwise as a scalar queue, with target movement as arrival and common demand as service. This reduction to one-dimensional queue control resolves the state dependence and extends the guarantees to general convex capacity sets, beyond the reach of prior productwise approaches. Experiments on synthetic and real-world inventory data corroborate the theory.

2606.14674 2026-06-15 cs.CL 新提交

AgentSpec: Understanding Embodied Agent Scaffolds Through Controlled Composition

AgentSpec: 通过受控组合理解具身智能体脚手架

Jixuan Chen, Jianzhi Shen, Haoqiang Kang, Zhi Hong, Qingyi Jiang, Soham Bose, Yiming Zhang, Leon Leng, Amit Vyas, Lingjun Mao, Siru Ouyang, Kun Zhou, Lianhui Qin

发表机构 * University of California, San Diego(加利福尼亚大学圣迭戈分校) Johns Hopkins University(约翰霍普金斯大学) University of Washington(华盛顿大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出AgentSpec模块化规范框架,将具身智能体表示为可复用策略组件的类型化组合,通过标准化接口实现受控组件替换与重组,揭示脚手架兼容性和交互效应对性能的主导作用。

详情
AI中文摘要

LLM智能体越来越多地构建为脚手架系统,而非单一模型调用,这些系统结合了推理、记忆、反思、动作执行和学习。虽然此类脚手架通常能提升性能,但它们往往嵌入在紧密耦合的流水线中,使得难以隔离组件贡献、比较替代设计或理解模块交互如何塑造智能体行为。我们引入AgentSpec,一个模块化规范框架,将具身智能体表示为具有标准化接口的可复用策略组件的类型化组合。AgentSpec标准化了感知、记忆、推理、反思、动作和可选学习之间的接口,使得组件能够在受控条件下被交换和重组。我们在DeliveryBench、ALFRED、MiniGrid和RoboTHOR上实例化该框架,并分析了跨模型骨干的推理、记忆、反思和强化学习模块。我们的结果表明,智能体性能由脚手架兼容性和交互效应主导,而非孤立模块强度。特别是,结构化多粒度记忆改善了长程状态跟踪,推理和记忆在不同环境中非均匀交互,反思在纠正和成本之间权衡,而RL训练的策略在与部署时脚手架结构共同优化时组合最佳。AgentSpec为研究、比较和设计可组合的LLM智能体提供了受控基础。我们的代码、基线和交互式游乐场在此https URL公开。

英文摘要

LLM agents are increasingly built not as single model calls, but as scaffolded systems that combine reasoning, memory, reflection, action execution, and learning. While such scaffolds often improve performance, they are often embedded in tightly coupled pipelines, making it difficult to isolate component contributions, compare alternative designs, or understand how module interactions shape agent behavior. We introduce AgentSpec, a modular specification framework that represents embodied agents as typed compositions of reusable policy components with standardized interfaces. AgentSpec standardizes the interfaces among perception, memory, reasoning, reflection, action, and optional learning, enabling components to be swapped and recombined under controlled conditions. We instantiate this framework across DeliveryBench, ALFRED, MiniGrid, and RoboTHOR, and analyze reasoning, memory, reflection, and reinforcement-learning modules across model backbones. Our results show that agent performance is governed by scaffold compatibility and interaction effects rather than isolated module strength. In particular, structured multi-granularity memory improves long-horizon state tracking, reasoning and memory interact non-uniformly across environments, reflection trades off correction and cost, and RL-trained policies compose best when optimized with deployment-time scaffold structure. AgentSpec provides a controlled foundation for studying, comparing, and designing composable LLM agents. Our code, baselines and interactive playground are publicly available at this https URL.

2606.14673 2026-06-15 cs.LG 新提交

Compressed Computation is (probably) not Computation in Superposition

压缩计算(可能)不是叠加计算

Jai Bhagat, Sara Molas-Medina, Giorgi Giglemiani, Stefan Heimersheim

发表机构 * Metamorphic Independent(独立研究者) UK AI Security Institute(英国人工智能安全研究所) Apollo Research

AI总结 通过分析压缩计算(CC)模型,发现其性能提升源于标签中的混合矩阵,而非真正的叠加计算,SNMF基线可复现其损失特征。

Comments Presented at the Mechanistic Interpretability Workshop at NeurIPS 2025

详情
AI中文摘要

我们研究压缩计算(CC)玩具模型(Braun等人,2025)是否是叠加计算的一个实例。CC模型似乎仅用50个神经元就能计算100个ReLU函数,其损失优于仅表示50个ReLU函数的预期。我们表明,该模型通过其带噪的残差流混合输入,对应于标签中一个非预期的混合矩阵。将训练目标分解为ReLU项和混合项,我们发现性能增益随混合矩阵的幅度缩放,并在移除该矩阵时消失。学习到的神经元方向集中在与混合矩阵前50个特征值相关的子空间中,表明混合项主导了解决方案。最后,仅从混合矩阵导出的半非负矩阵分解(SNMF)基线重现了定性损失曲线,并改进了先前的基线,尽管它未能匹配训练后的模型。这些结果表明CC不是叠加计算的一个合适玩具模型。

英文摘要

We study whether the Compressed Computation (CC) toy model (Braun et al., 2025) is an instance of computation in superposition. The CC model appears to compute 100 ReLU functions with just 50 neurons, achieving a better loss than expected from only representing 50 ReLU functions. We show that the model mixes inputs via its noisy residual stream, corresponding to an unintended mixing matrix in the labels. Splitting the training objective into the ReLU term and the mixing term, we find that performance gains scale with the magnitude of the mixing matrix and vanish when the matrix is removed. The learned neuron directions concentrate in the subspace associated with the top 50 eigenvalues of the mixing matrix, suggesting that the mixing term governs the solution. Finally, a semi-non-negative matrix factorization (SNMF) baseline derived solely from the mixing matrix reproduces the qualitative loss profile and improves on prior baselines, though it does not match the trained model. These results suggest CC is not a suitable toy model of computation in superposition.

2606.14672 2026-06-15 cs.AI cs.CL 新提交

Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows

面向LLM-Agent工作流中并行分支的直接潜在空间合成

Shikun Liu, Mufei Li, Dongqi Fu, Haoyu Wang, Yinglong Xia, Hong Li, Hong Yan, Pan Li

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Meta

AI总结 提出Parallel-Synthesis框架,通过直接利用并行工作代理的KV缓存进行合成,避免文本拼接冗余,在9个数据集上匹配或超越文本合成,并将首令牌延迟降低2.5-11倍。

详情
AI中文摘要

大型语言模型越来越多地作为代理系统的执行引擎,但它们仍然通过顺序文本接口消耗上下文。这与现代结构化代理工作流不匹配,其中独立分支探索子任务、检索证据或生成候选解决方案,然后进行最终合成步骤。现有系统通常通过拼接这些分支的文本输出来合并它们,这丢弃了并行结构并导致冗余的预填充计算。在这项工作中,我们引入了Parallel-Synthesis,一个即插即用的框架,使合成器能够直接消耗由并行工作代理产生的KV缓存。Parallel-Synthesis结合了一个缓存映射器,用于校准独立生成的分支缓存,以及一个微调的合成器适配器,用于从此非顺序缓存接口生成。我们使用数据训练Parallel-Synthesis,这些数据使合成器暴露于并行缓存上下文,教授跨缓存分支的聚合,并从基于标准文本拼接的合成中蒸馏推理行为。在跨越数学、科学问答、代码生成、GAIA和多代理数据库诊断的九个下游数据集上,Parallel-Synthesis在七个数据集上匹配或优于基于文本的合成,并在另外两个数据集上保持接近。它还将首令牌时间减少了2.5-11倍,表明直接基于缓存的合成是一种更有前途的接口,用于在并行代理分支上进行更原生和高效的合成。

英文摘要

Large language models increasingly serve as execution engines for agentic systems, yet they still consume context through a sequential text interface. This creates a mismatch with modern structured agent workflows, in which independent branches explore subtasks, retrieve evidence, or generate candidate solutions before a final synthesis step. Existing systems typically merge these branches by concatenating their textual outputs, which discards the parallel structure and incurs redundant prefill computation. In this work, we introduce Parallel-Synthesis, a plug-and-play framework that enables a synthesizer to directly consume the KV caches produced by parallel worker agents. Parallel-Synthesis combines a cache mapper that calibrates independently generated branch caches with a fine-tuned synthesizer adapter that enables generation from this non-sequential cache interface. We train Parallel-Synthesis using data that exposes the synthesizer to parallel cache contexts, teaches aggregation across cached branches, and distills reasoning behavior from standard text-concatenation-based synthesis. Across nine downstream datasets spanning math, science QA, code generation, GAIA, and multi-agent database diagnosis, Parallel-Synthesis matches or outperforms text-based synthesis on seven datasets and remains close on the other two. It also reduces time-to-first-token by 2.5x-11x, suggesting that direct cache-based synthesis is a promising interface for more native and efficient synthesis over parallel agent branches.

2606.14667 2026-06-15 cs.CV 新提交

Memento: Reconstruct to Remember for Consistent Long Video Generation

Memento: 通过重建来记忆以实现一致的长视频生成

Xuan Wei, Longbin Ji, Guan Wang, Xiangrui Liu, Zhenyu Zhang, Shuohuan Wang, Yu Sun, Qingqi Hong

发表机构 * Xiamen University(厦门大学) ERNIE Team, Baidu Inc.(百度公司ERNIE团队)

AI总结 提出Memento框架,通过主体重建引导和双查询记忆机制,解决长视频生成中主体一致性丢失问题,实现跨镜头连贯生成。

Comments Project page: this https URL (https://ernie-research.github.io/Memento/)

详情
AI中文摘要

长视频生成需要重复出现的主题在各种镜头、视角、运动和场景转换中保持一致。现有的时间分解方法通过逐镜头生成视频来提高可扩展性。然而,它们主要专注于优化合理的下一镜头延续,而没有验证历史记忆是否保留了身份关键的主体证据。因此,随着生成的进行,重复出现的主题可能会被稀释、覆盖或遗忘。在本文中,我们提出了Memento,一个主体重建引导的框架,将主体保留视为一个显式的身份基础问题,其前提是:一个忠实保存主体的记忆库应该能够仅从记忆中重建该主体。具体来说,Memento联合训练自回归的下一镜头生成和基于记忆的主体重建,利用历史记忆和全局故事描述恢复目标外观。为了从短程线索中分离出长程主体证据,Memento引入了一种双查询记忆机制,其中一个查询检索与身份相关的记忆,另一个选择短上下文关键帧以实现连贯的延续。此外,一个主体感知的电影数据管道通过一致、无代词的主体描述提供精确的重建监督。实验表明,Memento在长期主体一致性、跨镜头连贯性和视觉质量方面达到了最先进的性能。

英文摘要

Long-form video generation requires recurring subjects to remain consistent across various shots, viewpoints, motions, and scene transitions. Existing temporal decomposition methods improve scalability by generating videos shot by shot. However, they mainly focus on optimizing plausible next-shot continuations without verifying whether the historical memory preserves identity-critical subject evidence. Consequently, as generation proceeds, recurring subjects may be diluted, overwritten, or forgotten. In this paper, we propose Memento, a subject-reconstruction-guided framework that treats subject preservation as an explicit identity grounding problem, based on the premise that a memory bank faithfully preserving a subject should support reconstructing that subject from memory alone. Specifically, Memento jointly trains autoregressive next-shot generation with memory-based subject reconstruction, recovering target appearances using historical memory and global story captions. To disentangle long-range subject evidence from short-range cues, Memento introduces a dual-query memory mechanism, where one query retrieves identity-relevant memory and the other selects short-context keyframes for coherent continuation. Additionally, a subject-aware cinematic data pipeline provides precise reconstruction supervision via consistent, pronoun-free subject descriptions. Experiments demonstrate that Memento achieves state-of-the-art performance in long-term subject consistency, cross-shot coherence, and visual quality.

2606.14665 2026-06-15 cs.RO 新提交

EgoGuide: Egocentric Guidance for Efficient Robot-Free Demonstration Collection and Learning

EgoGuide: 以自我为中心引导的高效无机器人演示收集与学习

Yue Xu, Mingtao Nie, Tianle Li, Hong Li, Yibo Luo, Siyuan Huang, Yong-Lu Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) Beijing Institute for General Artificial Intelligence (BIGAI)(北京通用人工智能研究院)

AI总结 提出EgoGuide数据收集接口,通过同步腕部和头部/自我中心观察并在线视觉-几何质量引导,结合门控自我中心残差策略,减少所需数据量并提高数据效率。

详情
AI中文摘要

目前,从真实世界演示中进行的机器人学习受到数据扩展的限制。通用操作接口(UMI)提供了一种高效的无机器人数据收集接口,然而当前的UMI风格流程通常收集冗余的演示,并且缺乏全局场景上下文。为了提高数据效率,我们提出了EgoGuide,一种收集接口,它记录同步的腕部和头部/自我中心观察,并将其与在线视觉-几何数据质量引导相结合。我们还引入了一种门控自我中心残差策略,用于从视角变化的自我中心相机中进行鲁棒学习,允许头部/自我中心上下文纠正模糊的局部观察,同时保持稳定的腕部视角控制。真实世界实验表明,EgoGuide减少了所需的数据集数并提高了数据效率。残差策略进一步提高了视觉遮挡下的鲁棒性。项目页面:此 https URL

英文摘要

Robot learning from real-world demonstrations is currently constrained by data scaling. Universal Manipulation Interface (UMI) provides an efficient robot-free data collection interface, yet current UMI-style pipelines often collect redundant demonstrations and lack global scene context. To improve data efficiency, we present EgoGuide, a collection interface that records synchronized wrist and head/egocentric observations and couples them with online visual-geometric data quality guidance. We also introduce a Gated Egocentric Residual Policy for robust learning from a viewpoint-varying egocentric camera, allowing head/egocentric context to correct ambiguous local observations while preserving stable wrist-view control. Real-world experiments show that EgoGuide reduces the required number of data episodes and improves data efficiency. The residual policy further improves robustness under visual occlusion. Project Page: this https URL

2606.14658 2026-06-15 cs.CV cs.AI 新提交

Giving AI a Headache: Acoustic Adversarial Attacks to Computer Vision Applications

给AI带来头痛:针对计算机视觉应用的声学对抗攻击

Nicole Villavicencio-Garduño, Maksim Ekin Eren, Milo Prisbrey, Ben Migliori, Michael Teti

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 研究利用低频声波(<20 kHz)引起相机物理振动,导致AI视觉模型(如YOLO11)误分类、漏检或产生幻觉,并分析了影响攻击效果的因素。

Comments 9 pages, 7 figures, SPIE Defense + Security

详情
AI中文摘要

人工智能(AI)越来越多地被用于自动化各种现实世界的计算机视觉(CV)应用,如自动驾驶车辆控制、面部识别和安全摄像头。最近的研究表明,声学振动可以引起相机真实的物理运动,干扰其内部稳定机制。由于这种运动超出了稳定系统设计处理的条件,系统会在帧中引入伪影,导致基于AI的CV模型误分类、错过目标或产生幻觉对象。先前的工作使用超声波频率(>20 kHz)进行短距离攻击,由于高频的衰减,这些攻击仅限于短距离。在这项工作中,我们研究了使用可听范围内较低频率(<20 kHz)的声学攻击,并进一步扩展了我们的分析,包括各种图像和物体特征如何受到攻击的影响。具体来说,我们进行了物理实验,通过用各种频率共振商用相机,证明了我们的攻击对现成目标检测模型(YOLO11)的可行性。基于我们的结果,我们提供了关于使AI CV系统更容易受到这些攻击的几个因素的见解,这可能有助于未来缓解策略的开发。

英文摘要

Artificial Intelligence (AI) is increasingly used to automate a variety of real-world computer vision (CV) applications, such as autonomous vehicle control, facial recognition, and security cameras. Recent research has shown that acoustic vibration can induce real physical motion in cameras, interfering with their internal stabilization mechanisms. Because the motion falls outside the conditions the stabilization system was designed to handle, the system introduces artifacts into the frame, causing AI-based CV models to misclassify, miss targets, or hallucinate objects. Previous work used ultrasonic frequencies (>20 kHz) to perform short-range attacks, which limits them to short distances due to the attenuation exhibited by high frequencies. In this work, we investigate acoustic attacks using lower frequencies in the audible range (<20 kHz), and we further expand our analysis to include how various image and object features are affected by the attacks. Specifically, we performed physical experiments to demonstrate the viability of our attacks on an off-the-shelf object detection model (YOLO11) by resonating a commercially available camera with various frequencies. Based on our results, we provide insights into several factors that make an AI CV system more vulnerable to these attacks, which could help inform the development of future mitigation strategies.

2606.14657 2026-06-15 cs.CV 新提交

HPSv3++: Scaling Reward Models Across the Full Spectrum of Diffusion Model Capabilities

HPSv3++:跨扩散模型能力全谱系扩展奖励模型

Yijun Liu, Jie Huang, Zeyue Xue, Yuming Li, Ruizhe He, Haoran Li, Shijia Ge, Siming Fu

发表机构 * Tsinghua University(清华大学) JD Explore Academy(京东探索研究院) Peking University(北京大学) Zhejiang University(浙江大学)

AI总结 提出HPSv3++奖励模型框架,通过双维度偏好数据集HPDv3++和两阶段训练(正交梯度投影+无监督引导),提升对各类T2I模型及RL迭代的偏好预测能力,在多个基准上达到最优。

详情
AI中文摘要

奖励模型引导文本到图像(T2I)系统输出符合人类偏好的结果。然而,典型的奖励模型(如HPSv3)是在早期T2I模型的预标注数据上训练的,没有考虑因模型能力演进和强化学习(RL)迭代而产生的质量判别偏移,限制了其更广泛的适用性。在这项工作中,我们提出了HPSv3++,一个奖励模型框架,将HPSv3模型提升到适应不同T2I模型能力及其RL迭代变化的全能力-迭代谱系。具体来说,我们首先引入了HPDv3++,一个212K双维度偏好数据集,使用近期高能力(Qwen-Image)模型并辅以人工监督,对文本保真度和美学质量进行标注。然后我们提出了一个两阶段训练框架。第一阶段采用数据感知的正交梯度投影,从HPDv3++中融入多样化的美学感知,同时保留HPSv3中原始有效的人类偏好知识。第二阶段进一步利用来自不同能力水平和RL迭代的T2I模型的无标注数据,并引入一个联合能力-迭代条件的信号给奖励模型,以及一个标准差驱动的无监督引导机制,从而在能力-迭代谱系上强化奖励模型。HPSv3++实现了最先进的偏好预测,在HPDv3上比HPSv3高出9.8%,在GenAI-Bench上高出5.5%,同时在我们提出的HPDv3++上达到79.1%/88.1%。当用于T2I RL训练时,它持续提升了多种T2I模型的GenEval分数,展示了其广泛的能力。代码可在该网址获取。

英文摘要

Reward models guide text-to-image (T2I) systems toward outputs aligned with human preferences. However, typical reward models such as HPSv3 are trained on pre-annotated data from earlier T2I models, without accounting for quality discriminative shifts arising from evolving model capabilities and reinforcement learning (RL) iterations, limiting their broader applicability. In this work, we propose HPSv3++, a reward model framework that elevates the HPSv3 model for varying T2I model capabilities and their RL iteration changes across the full capability-iteration spectrum. Specifically, we first introduce HPDv3++, a 212K dual-dimension preference dataset annotated for text fidelity and aesthetic quality using a recent high-capability (Qwen-Image) model with human supervision. We then propose a two-stage training framework. Stage 1 employs data-aware orthogonal gradient projection to incorporate diverse aesthetic perception from HPDv3++ while preserving the original effective human preference knowledge in HPSv3. Stage 2 further leverages unlabeled data from T2I models spanning different capability levels and RL iterations, and introduces a joint capability-iterations conditioned signal for the reward model together with a standard deviation-driven unsupervised guidance mechanism, strengthening reward model across the capability-iteration spectrum. HPSv3++ achieves state-of-the-art preference prediction, outperforming HPSv3 9.8% on HPDv3, 5.5% on GenAI-Bench, while achieving 79.1%/88.1% on our proposed HPDv3++. When used for T2I RL training, it consistently improves GenEval scores across diverse T2I models, demonstrating its wide-range capabilities. The code is available at this https URL.

2606.14654 2026-06-15 cs.AI cs.CL cs.LG 新提交

Abstracting Cross-Domain Action Sequences into Interpretable Workflows

将跨领域动作序列抽象为可解释的工作流

Gaurav Verma, Scott Counts

发表机构 * Microsoft Corporation(微软公司)

AI总结 提出WorkflowView框架,利用大语言模型将低层动作序列抽象为高层活动,在三个不同任务中验证了有效性和泛化能力,实现高语义相似度和预测性能。

Comments preprint; 9 pages, 5 figures

详情
AI中文摘要

序列或时间戳交互日志提供了数字应用使用的客观记录,但其粒度和噪声常常掩盖了关于人们工作的有意义见解。这些见解对于以真实用户交互为基础改进数字产品至关重要。先前的研究应用深度学习模型将用户动作聚类为高层活动,但这些方法对噪声高度敏感且难以跨应用泛化。为解决这一局限,我们引入了WorkflowView,一个使用大语言模型(LLMs)将低层动作序列抽象为高层活动的框架。我们在三个不同且具有挑战性的序列任务和多样化领域中建立了该方法的有效性和泛化性:(a)从浏览器日志中进行零样本任务描述重构(实现高语义相似度,$\mu_{sim} = 0.91$),(b)使用MOOC交互日志进行少样本学生退学预测(仅用五个少样本示例达到加权$F_1 = 0.90$),以及(c)对Microsoft Word中文档工作流中AI工具集成进行匿名化、隐私保护分析。我们的工作表明,基于LLM的抽象是将低层行为数据转化为高层、可解释且可操作见解的稳健高效途径。我们还讨论了在日志基础设施中部署基于LLM的推理时的实际考虑,包括计算效率和用户隐私。

英文摘要

Sequential or time-stamped interaction logs provide objective records of digital application usage, yet their granularity and noise often obscure meaningful insights into people's work. Such insights are essential for improving digital products in ways grounded in real-world user interactions. Prior research has applied deep learning models to cluster user actions into high-level activities, but these approaches are highly sensitive to noise and struggle to generalize across applications. To address this limitation, we introduce WorkflowView, a framework that uses large language models (LLMs) to abstract low-level action sequences into high-level activities. We establish the effectiveness and generality of our approach across three distinct, challenging sequential tasks and diverse domains: (a) zero-shot task description reconstruction from browser logs (achieving high semantic similarity, $\mu_{sim} = 0.91$), (b) few-shot student dropout prediction using MOOC interaction logs (reaching weighted $F_1 = 0.90$ with only five few-shot examples), and (c) anonymized, privacy-preserving analysis of AI tool integration within document workflows in Microsoft Word. Our work demonstrates that LLM-based abstraction is a robust and efficient path forward for transforming low-level behavioral data into high-level, interpretable, and actionable insights. We also discuss practical considerations for deploying LLM-based inferences within logging infrastructures, including computational efficiency and user privacy.

2606.14648 2026-06-15 cs.LG math.OC 新提交

Which Directions Matter? Sparse Design for Affine Robust Optimization

哪些方向重要?仿射鲁棒优化的稀疏设计

Pedro Chumpitaz-Flores, My Duong, Juan S. Borrero, Kaixun Hua

发表机构 * University of South Florida(南佛罗里达大学)

AI总结 研究有限字典和预算约束下鲁棒优化中不确定性方向的选择问题,提出基于覆盖目标的数据驱动选择规则,证明其单调次模性,给出贪心算法的近似保证和匹配的难度下界。

Comments Accepted at UAI 2026

详情
AI中文摘要

鲁棒机器学习和优化依赖于不确定性模型的选择。我们研究了当由有限字典和预算约束定义时,模型必须覆盖哪些不确定性方向。选择一个子集形成一个具有闭式支持函数的原子不确定性集,从而为仿射目标产生可处理的鲁棒程序。我们提出了一种基于评估方向(包括梯度、对抗扰动或保留数据上观察到的偏移)上的覆盖目标的数据驱动选择规则。我们证明该目标是单调且次模的,支持具有$(1-1/e)$近似保证的贪心方法和匹配的难度障碍。我们还提供了一个证书,用于限制所选子集的损失,以及一个具有样本外控制的半径校准规则。

英文摘要

Robust machine learning and optimization rely on the uncertainty model choice. We investigate which uncertainty directions a model must cover when defined by a finite dictionary and a budget constraint. Selecting a subset forms an atomic uncertainty set with a closed form support function, yielding tractable robust programs for affine objectives. We propose a data driven selection rule based on a coverage objective over evaluation directions, including gradients, adversarial perturbations, or shifts observed on held out data. We prove this objective is monotone and submodular, supporting a greedy method with a $(1-1/e)$ approximation guarantee and a matching hardness barrier. We also provide a certificate bounding the loss from the selected subset and a radius calibration rule with out of sample control.

2606.14647 2026-06-15 cs.SD cs.AI 新提交

Listening with Attention: Entropy-Guided Explainability for Transformer-Based Audio Models

基于注意力的听觉:面向Transformer音频模型的熵引导可解释性

Ravi Ranjan, Utkarsh Grover, Xiaomin Lin, Agoritsa Polyzou

发表机构 * Florida International University(佛罗里达国际大学) University of South Florida(南佛罗里达大学)

AI总结 提出LEAF-X框架,通过熵引导注意力加权、多层注意力展开和因果消融,为Transformer语音识别模型生成稀疏的帧级归因,提升忠实度32%、局部性/稀疏性35-39%。

Comments 17 pages, 3 figures, and 9 tables. Accepted in Interspeech 2026 conference

详情
AI中文摘要

基于Transformer的自动语音识别(ASR)模型(如Whisper)具有高准确性,但其预测仍然难以解释。现有的可解释人工智能(XAI)方法通常缺乏忠实性和精确的时间定位。我们提出了基于熵引导注意力的忠实可解释性听觉方法(LEAF-X),这是一种针对基于Transformer的ASR的模型内在XAI框架。LEAF-X结合了熵引导注意力加权、多层注意力展开和可选的因果消融,以识别低熵、高影响力的头和层,生成稀疏的token到帧归因。与基于扰动的解释器或原始注意力图不同,LEAF-X利用编码器-解码器和语音增强的仅解码器模型的内部结构,生成更能反映模型计算的解释。结果表明,忠实度提高了32%,局部性/稀疏性提高了35-39%,并且归因最稳定,支持更透明和可审计的ASR。

英文摘要

Transformer-based automatic speech recognition (ASR) models such as Whisper are highly accurate, but their predictions remain difficult to interpret. Existing explainable AI (XAI) methods often lack faithfulness and precise temporal grounding. We propose Listening with Entropy-guided Attention for Faithful explainability (LEAF-X), a model-intrinsic XAI framework for transformer-based ASR. LEAF-X combines entropy-guided attention weighting, multi-layer attention rollout, and optional causal ablations to identify low-entropy, high-impact heads and layers, producing sparse token-to-frame attributions. Unlike perturbation-based explainers or raw attention maps, LEAF-X exploits the internal structure of encoder-decoder and speech-augmented decoder-only models to generate explanations that better reflect model computation. Results show 32% improved faithfulness, 35-39% stronger locality/sparsity, and the most stable attributions, supporting more transparent and auditable ASR.

2606.14640 2026-06-15 cs.LG cs.DS 新提交

Online Convex Optimization with Sublinear Noisy Probes

具有亚线性噪声探测的在线凸优化

Simone Di Gregorio, Anupam Gupta, Stefano Leonardi, Matteo Russo

发表机构 * Sapienza University of Rome(罗马大学) New York University(纽约大学) EPFL(瑞士联邦理工学院洛桑分校)

AI总结 研究在线凸优化中利用亚线性噪声成对探测降低遗憾,通过方差缩减和连续指数权重二阶分析获得紧界。

Comments Accepted at COLT '26

详情
AI中文摘要

我们研究在凸集 $K\subseteq \mathbb R^d$ 上的在线凸优化(OCO),其中每轮 $t$ 学习者选择 $x_t\in K$,然后观察凸损失 $f_t:K\to[0,1]$,目标是最小化相对于事后最佳固定决策的遗憾。我们引入一个统一的探测模型,该模型概括了两个近期工作方向:专家设置中的亚线性最佳专家查询,以及 OCO 中每轮可用的成对(基于比较)反馈。在我们的框架中,学习者有 $k\le T$ 个成对探测预算;在被探测的轮次中,它可以查询两个点并了解哪个点的损失更小。我们的主要结果表明,即使亚线性和噪声的探测预算也能在完全反馈 OCO 机制中显著改善最坏情况遗憾。使用 $k$ 个 $\delta$-噪声成对探测,我们得到:$ \text{Reg}_T \le O\left(\min\left\{\sqrt{dT\ln T},\\; \frac{dT\ln T}{k|1-2\delta|}\right\}\right) $,该界在 $T$、$k$ 和 $\delta$ 上紧(至多对数因子)。具体关于噪声参数 $\delta \in [0,1]$,当预言响应接近抛硬币(即 $\delta$ 接近 $\frac{1}{2}$)时,遗憾保证平滑退化。当将相同技术应用于具有 $d$ 个专家的有限 $K$ 的预测设置时,所得速率在所有参数(包括 $d$)上完全紧。我们的分析通过量化探测的方差缩减效应,结合连续指数权重的二阶(基于方差)分析,给出了 OCO 中成对探测的简化处理。

英文摘要

We study Online Convex Optimization (OCO) over a convex set $K\subseteq \mathbb R^d$, where in each round $t$ the learner selects $x_t\in K$ and then observes a convex loss $f_t:K\to[0,1]$, with the goal of minimizing regret to the best fixed decision in hindsight. We introduce a unified probing model that generalizes two recent lines of work: sublinear best-expert queries in the experts setting, and pairwise (comparison-based) feedback available every round in OCO. In our framework, the learner has a budget of $k\le T$ pairwise probes; on a probed round it may query two points and learn which one has smaller loss. Our main result shows that even a sublinear and noisy probe budget can provably improve worst-case regret in the full feedback OCO regime. With $k$ $\delta$-noisy pairwise probes, we obtain: $ \text{Reg}_T \le O\left(\min\left\{\sqrt{dT\ln T},\; \frac{dT\ln T}{k|1-2\delta|}\right\}\right) $, which is tight (up to logarithmic factors in $T$) across $T$, $k$ and $\delta$. Specifically regarding the noise parameter $\delta \in [0,1]$, the regret guarantee smoothly degrades as the oracle response approaches a coin flip, i.e., $\delta$ is close to $\frac{1}{2}$. When applying the same techniques to a finite $K$ for the prediction with $d$ experts setting, the resulting rates are instead completely tight in all parameters, including $d$. Our analysis gives a streamlined treatment of pairwise probing in OCO by quantifying the benefit of probing via a variance reduction effect, combined with a second-order (variance-based) analysis of Continuous Exponential Weights.

2606.14639 2026-06-15 cs.SD cs.AI 新提交

From Self-Supervised Speech Models to Mixture-of-Experts for Robust Anti-Spoofing

从自监督语音模型到混合专家系统以实现鲁棒的防欺骗

Hugo Daumain, Driss Matrouf, Khaled Khelif, Mickael Rouvier

发表机构 * Université d'Avignon(阿维尼翁大学) Airbus Defence & Space(空中客车防务与航天公司)

AI总结 将自监督语音模型转换为混合专家架构,通过层间门控机制增强泛化能力,在14个欺骗数据集上将宏EER从5.46%降至4.81%。

Comments 8 pages, 3 figures, accepted at Odyssey 2026 (The Speaker and Language Recognition Workshop)

详情
AI中文摘要

近期语音生成的进展显著提升了合成语音的自然度,使得欺骗检测日益困难。当前防欺骗系统的一个关键局限是对未见合成方法的鲁棒性不足。在这项工作中,我们将自监督语音表示模型转换为混合专家(MoE)架构以提高泛化能力。选定编码器层中的前馈块被替换为由层间门控机制控制的多个专家网络,使专家能够捕获互补的声学模式,同时保留自监督预训练期间学习到的表示。我们进一步分析了影响MoE转换性能的架构选择,并研究了专家的激活行为。所提出的方法在14个欺骗数据集上进行了评估,将宏EER从5.46%降至4.81%,相对基线提升了11.9%。

英文摘要

Recent advances in speech generation have significantly improved the naturalness of synthetic speech, making spoofing detection increasingly challenging. A key limitation of current anti-spoofing systems is their limited robustness to unseen synthesis methods. In this work, we transform a self-supervised speech representation model into a Mixture-of-Experts (MoE) architecture to improve generalization. Feed-forward blocks in selected encoder layers are replaced by multiple expert networks controlled by a layer-wise gating mechanism, allowing experts to capture complementary acoustic patterns while preserving the representations learned during self-supervised pretraining. We further analyze the architectural choices affecting the performance of this MoE conversion and investigate the activation behavior of the experts. The proposed approach is evaluated on 14 spoofing datasets and reduces the macro EER from 5.46% to 4.81%, corresponding to 11.9% relative improvement over the baseline.

2606.14638 2026-06-15 cs.CV astro-ph.EP 新提交

Improving Lunar Topography with Deep Learning Schrödinger Bridges

利用深度学习薛定谔桥改进月球地形

Matthew Repasky, Erwan Mazarico, Michael K. Barker, Stefano Bertone, Terence J. Sabaka, Yao Xie

发表机构 * H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology(佐治亚理工学院H. Milton Stewart工业与系统工程系) NASA Goddard Space Flight Center(美国国家航空航天局戈达德太空飞行中心) Center for Research and Exploration in Space Science and Technology (CRESST II), University of Maryland, College Park(马里兰大学帕克分校空间科学与技术研究与探索中心(CRESST II)) National Institute for Astrophysics (INAF), Astrophysical Observatory of Turin(意大利国家天体物理研究所(INAF)都灵天体物理天文台)

AI总结 提出基于扩散薛定谔桥的生成模型,结合光学影像约束,实现月球地形超分辨率重建,并提供像素级不确定性估计。

详情
AI中文摘要

提高行星地形模型的分辨率可以更好地理解表面过程和地貌;然而,现有的解析超分辨率方法成本高昂且难以大规模应用。生成模型提供了学习数据中复杂关系的工具,并且由于硬件加速器和并行化,可以大规模应用。我们提出了一种基于扩散的薛定谔桥(SB)生成建模方法,用于月球地形超分辨率,连接低分辨率地形分布与高分辨率地形分布,并结合物理约束的光学影像。我们的方法受到现有形状重建方法的启发,这些方法通过使用目标分辨率的光学图像来改进先验的低分辨率地形。我们在一个新颖的渲染月球地形数据集上训练SB,模拟来自月球勘测轨道器窄角相机的光学影像。结果是一种灵活的地形超分辨率方法,可以在重建中提供像素级的不确定性。

英文摘要

Increasing the resolution of planetary topography models can enable a better understanding of surface processes and geomorphology; however, existing analytical super-resolution methods are expensive and difficult to apply at large scales. Generative models provide the tools to learn complex relationships within data and can be applied at scale due to hardware accelerators and parallelization. We present a diffusion-based Schrödinger Bridge (SB) generative modeling approach for lunar topography super-resolution, connecting the distribution of low-resolution topography to that of high-resolution topography, incorporating physically-constraining optical imagery. Our approach is inspired by existing Shape-from-Shading methods, which improve a priori low-resolution topography by using optical images at the target resolution. We train SBs on a novel dataset of rendered lunar topography, emulating optical imagery from the Lunar Reconnaissance Orbiter Narrow Angle Camera. The result is a flexible approach for topography super-resolution which can provide pixel-level uncertainties in the reconstruction.

2606.14636 2026-06-15 cs.LG 新提交

Graph Diffusion Residuals for Control-Function Instrumental Variables

用于控制函数工具变量的图扩散残差

Rui Wu, Zongyuan Chen, Hong Xie, Defu Lian, Enhong Chen

发表机构 * School of Computer Science and Engineering, University of Science and Technology of China(中国科学技术大学计算机科学与技术学院)

AI总结 提出自适应各向异性工具热流(A-IHF),一种基于图扩散的残差提取方法,用于灵活控制函数,通过检测处理跳跃并调整图传导性,在合成基准测试中优于多种基线方法。

Comments Submitted to Journal of Machine Learning Research (JMLR). 50 pages, 6 figures

详情
AI中文摘要

控制函数工具变量估计器需要第一阶段残差,而不仅仅是第一阶段预测。高容量的第一阶段可能会插值处理,从而为结果方程留下过少的残差信息。我们研究了自适应各向异性工具热流(A-IHF),这是一种用于灵活控制函数的确定性图扩散残差提取器。A-IHF将处理视为第一阶段特征图上的信号,使用引导扩散检测大的处理跳跃,减弱这些跳跃上的传导性,并通过稀疏图预解式计算生成的控制。其观测选择规则仅使用$(Z,X)$,结合了图广义交叉验证、粗糙度、残差化处理相关性以及图可容许性过滤。分析将误差分解为结构泄漏、残差衰减和残差化处理变异,得到有限样本界、潜在分段光滑几何下的图可容许性率以及有限路径选择校准。在54个合成基准单元中,与调优的图、核、树、提升、级数和神经网络控制函数基线相比,有保护的观测A-IHF具有最低的平均结构响应MSE;A-IHF族在32个单元中击败了最佳的非A-IHF基线。当图捕获分段光滑的第一阶段结构时,性能最强。

英文摘要

Control-function instrumental variable estimators need a first-stage residual, not merely a first-stage prediction. High-capacity first stages can interpolate treatment and leave too little residual information for the outcome equation. We study Adaptive Anisotropic Instrumental Heat Flow (A-IHF), a deterministic graph-diffusion residual extractor for flexible control functions. A-IHF treats treatment as a signal on a graph of first-stage features, uses pilot diffusion to detect large treatment jumps, attenuates conductance across those jumps, and computes the generated control with a sparse graph resolvent. Its observational selection rule uses only $(Z,X)$, combining graph generalized cross-validation, roughness, residualized-treatment relevance, and graph-admissibility filtering. The analysis decomposes error into structural leakage, residual attenuation, and residualized treatment variation, yielding finite-sample bounds, graph-admissibility rates under latent piecewise-smooth geometry, and finite-path selection calibration. Across 54 synthetic benchmark cells with tuned graph, kernel, tree, boosting, series, and neural control-function baselines, guarded observational A-IHF has the lowest average structural-response MSE; the A-IHF family beats the best non-A-IHF baseline in 32 cells. Performance is strongest when the graph captures piecewise-smooth first-stage structure.

2606.14631 2026-06-15 cs.CV 新提交

SED:Lightweight Saliency prediction for Event-based data via Distillation

SED: 基于蒸馏的轻量级事件数据显著性预测

Romaric Mazna, Jean Martinet, Michele Magno

发表机构 * i3S/CNRS, Université Côte d’Azur(法国蔚蓝海岸大学i3S/CNRS实验室) ETH Zürich(苏黎世联邦理工学院)

AI总结 提出轻量级网络SED,通过知识蒸馏和深度时空块(DSTconv)实现事件数据显著性预测,模型大小减少562倍,参数减少554倍,性能匹配或超越教师模型。

详情
AI中文摘要

基于事件的显著性预测最近受到关注,因为将事件相机与显著性估计结合可以作为上游阶段,自然提高边缘端下游事件感知的效率。然而,当前的方法要么是神经形态的,在基于事件的显著性基准上表现不佳,要么由于依赖Transformer或3D卷积而对资源受限的边缘应用来说过于沉重。受高效卷积模块的启发,SED旨在利用事件数据中的时间信息,我们提出了一种轻量级网络,通过知识蒸馏训练,构建于深度时空块(DSTconv)之上——这是3D深度可分离卷积的分解。相对于其教师模型,我们的模型将模型大小从180 MB减少到0.32 MB(562倍),参数数量从45M减少到81k(554倍),同时在N-DHF1K和N-UCF Sports数据集上匹配或超越其性能。此外,它在训练分布之外具有很强的泛化能力,从合成事件数据迁移到真实事件数据,而从头训练的模型则失败。

英文摘要

Event-based saliency prediction has gained attention recently, as combining event cameras with saliency estimation can act as an upstream stage that naturally improves the efficiency of downstream eventbased perception at the edge. However, current approaches are either neuromorphic, underperforming on event-based saliency benchmarks, or too heavy for resource-constrained edge applications due to their reliance on transformers or 3D convolutions. Drawing inspiration from efficient convolutional modules, SED and aiming to exploit the temporal information in event data, we propose a lightweight network, trained through knowledge distillation, built on a Depthwise Spatio-Temporal Block (DSTconv) -- a factorization of the 3D depthwise separable convolution. Relative to its teacher, our model reduces the model size from 180 MB to 0.32 MB (562x) and the parameter count from 45M to 81k (554x), while matching or outperforming it on the N-DHF1K and N-UCF Sports datasets. Moreover, it generalizes strongly beyond its training distribution, transferring from synthetic to real event data where a model trained from scratch fails.

2606.14626 2026-06-15 cs.CL 新提交

Characterizing Cultural Localization in AI-Generated Stories

表征AI生成故事中的文化本地化

Shaily Bhatt, Supriti Vijay, Jeremiah Milbauer, Fernando Diaz

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出一种方法,通过识别区分国籍的词汇标记并移除后测量叙事相似性,检测AI生成故事中的模板化本地化,发现仅9-17%词汇解释国籍差异,且部分文化标记具有冒犯性。

Comments Accepted to the 4th Workshop on Cross-Cultural Considerations in NLP (C3NLP) Co-located with ACL 2026, San Diego, USA (non-archival)

详情
AI中文摘要

人工智能的全球使用增加了对评估生成文化本地化内容(包括故事)能力的兴趣。故事中的文化本地化通常通过模板化本地化(在通用叙事中使用文化标记,如姓名、地点)或整体本地化(除文化标记外,情节、价值观和主题的变化)发生。我们提出了一种方法来衡量内容通过模板化本地化生成的程度。具体来说,我们识别出区分不同国籍故事的词汇标记,并测量移除这些标记后剩余叙事的相似性。在五个模型针对125个主题和193个国籍生成的故事中,我们的方法能够检测到仅有一小部分词汇(9-17%)解释了国籍间的差异,并且移除这些标记后剩余的叙事包含重复的多词序列,表明存在一个共享的文化无关叙事模板。最后,我们表征了文化标记的刻板性和冒犯性,发现来自19个国家(主要位于全球南方)的标记平均具有冒犯性。

英文摘要

The global use of artificial intelligence has increased interest in assessing the ability to generate culturally localized content, including stories. Cultural localization in stories often occurs through either templated localization -- the use of cultural markers (e.g., names, locations) in a generic narrative -- or holistic localization -- the variation of plots, values, and themes, in addition to cultural markers. We propose a method to measure the degree to which content was generated through templated localization. Specifically, we identify the lexical tokens that distinguish stories across nationalities and measure the similarity of the narratives that remain after removing them. In stories generated by five models on 125 topics for 193 nationalities, our method is able to detect that only a small subset (9-17%) of the vocabulary accounts for the variation across nationalities and that the narratives that remain after removing them contain repeated multi-word sequences, suggesting the presence of a shared culturally-agnostic narrative template. Finally, we characterize the cultural markers for their stereotypicality and offensiveness, finding that markers from 19 countries, mostly located in the Global South, are on average offensive.

2606.14620 2026-06-15 cs.LG 新提交

Neither Parallel Nor Sequential: How DiffusionGemma Actually Commits Tokens

既非并行也非顺序:DiffusionGemma 实际如何提交令牌

Ali Asaria, Tony Salomone, Deep Gandhi

发表机构 * Transformer Lab

AI总结 通过钩取DiffusionGemma 26B的采样器接受步骤,测量其解码顺序,发现解码既非并行也非块自回归,而是呈现部分从左到右的提交偏差,且块大小是测量尺度的伪影而非架构特性。

详情
AI中文摘要

开放扩散语言模型被宣传为并行的非自回归解码器,但实际检查点提交令牌的顺序几乎从未被测量。我们对DiffusionGemma 26B(基于Gemma 4构建的掩码离散扩散混合专家模型)进行检测,钩取其采样器的接受步骤,记录哪些画布位置在何时以何种置信度提交。通过686个提示、六种机制的探测套件,我们发现其解码既非并行也非块自回归:它遵循部分从左到右的提交偏差,其表观强度几乎完全取决于分析的粒度。令牌级别的顺序较弱,随着分析粗化而平滑增强,因此模型的“块大小”实际上是测量尺度的伪影而非架构特性。模型以大的同步批次提交令牌,批次内的顺序大部分是真正未定义的,而不仅仅是未被观测到。该行为依赖于机制:结构化JSON以基本任意的顺序提交,位置的提交置信度在数学推理中跟踪正确性,但在事实回忆中无信号。提交是激进的,在步骤预算内以短暂的后期爆发完成,而任务准确率与其自回归的Gemma-4兄弟模型相匹配。除了这些发现,我们的核心贡献是方法论上的:诚实地测量解码顺序需要处理尾部EOS填充、机制内混杂、提交非单调性、块大小敏感性以及大批次提交的平局,否则每个因素都可能制造出实际上不存在的解码顺序结果。

英文摘要

Open diffusion language models are marketed as parallel, non-autoregressive decoders, yet the order in which a shipped checkpoint actually commits its tokens is almost never measured. We instrument DiffusionGemma 26B, a masked discrete-diffusion mixture-of-experts model built on Gemma 4, hooking its sampler's accept step to record which canvas positions commit, when, and at what confidence. Across a 686-prompt, six-regime probe suite we find that its decoding is neither parallel nor block-autoregressive: it follows a partial left-to-right commit bias whose apparent strength depends almost entirely on the granularity at which you look. Order is weak token by token and strengthens smoothly as the analysis is coarsened, so the model's "block size" turns out to be an artifact of the measuring ruler rather than the architecture. The model commits in large simultaneous batches, leaving much of the within-batch order genuinely undefined rather than merely unobserved. The behaviour is regime-dependent: structured JSON is committed in essentially arbitrary order, and a position's commit confidence tracks correctness on mathematical reasoning but carries no signal on factual recall. Commitment is aggressive, finishing in a short late burst well inside the step budget, while task accuracy matches the model's autoregressive Gemma-4 sibling. Beyond these findings, our central contribution is methodological: measuring decoding order honestly demands handling trailing-EOS padding, within-regime confounding, commit non-monotonicity, block-size sensitivity, and large commit-batch ties, each of which can otherwise manufacture a decoding-order result that is not really there.

2606.14617 2026-06-15 cs.RO eess.SY 新提交

Whole-Body Impedance Model Predictive Control for Safe Physical Human--Robot Interaction on Floating-Base Platforms

全身阻抗模型预测控制:浮基平台上的安全人机物理交互

Yongyan Cao

发表机构 * Voryx Robotics

AI总结 提出三层架构的全身阻抗MPC,通过质心MPC规划接触力、优先级WBC层平衡关节力矩、再ceding-horizon QP预测并抑制人机交互扰动,实现浮基机器人零稳态误差安全交互。

详情
AI中文摘要

浮基机器人必须在刚性接触约束下保持平衡,同时与人类安全交互。现有的全身控制(WBC)框架将全部关节空间分配给运动,或依赖固定增益阻抗反馈,在持续的人机物理交互(pHRI)力作用下积累稳态误差。本文将作者先前针对固定基座的两层阻抗MPC扩展到浮基平台,采用三层架构:质心MPC在500毫秒时域内规划接触力;优先级驱动的WBC层通过接触一致性零空间投影将平衡分解为关节力矩;剩余零空间由再ceding-horizon二次规划(QP)控制,该QP使用卡尔曼增强状态预测并抑制pHRI扰动。接触一致性反馈线性化将手臂末端执行器系统简化为在每个接触模式下具有恒定状态矩阵的双积分器,从而允许离线预计算QP代价并实现≥1 kHz运行。一种协方差膨胀协议在接触模式切换时保持扰动估计,保证在有界恒定pHRI负载下零稳态误差;阻抗等价定理表明无限时域极限恢复经典任务空间阻抗定律,其有效质量、阻尼和刚度随姿态和接触配置自适应。在17自由度双足机器人和Unitree G1人形机器人上的仿真验证了该设计。

英文摘要

Floating-base robots must balance under rigid contact constraints while interacting safely with humans. Existing whole-body control~(WBC) frameworks allocate the full joint space to locomotion or rely on fixed-gain impedance feedback that accumulates steady-state error under sustained physical human--robot interaction~(pHRI) forces. This paper extends the authors' fixed-base two-layer Impedance MPC to floating-base platforms through a three-level architecture: a centroidal MPC plans contact forces over a 500\,ms horizon; a priority-driven WBC layer resolves balance into joint torques through contact-consistent null-space projection; and the residual null space is governed by a receding-horizon quadratic program~(QP) that predicts and rejects pHRI disturbances using a Kalman-augmented state. A contact-consistent feedback linearization reduces the arm end-effector plant to a double integrator with a \emph{constant} state matrix within each contact mode, enabling offline precomputation of the QP cost and ${\geq}1$\,kHz operation. A covariance-inflation protocol preserves the disturbance estimate across contact-mode switches, guaranteeing zero steady-state error under bounded constant pHRI loads, and an Impedance Equivalence Theorem shows the infinite-horizon limit recovers a classical task-space impedance law whose effective mass, damping, and stiffness adapt to posture and contact configuration. Simulations on a 17-DOF biped and the Unitree G1 humanoid validate the design.

2606.14612 2026-06-15 cs.SD cs.AI eess.AS 新提交

Moonlight in Latent Space: Chirality and Structural Correspondence Between Beethoven's Op. 27 No. 2 and Machine Learning Mechanisms

潜空间中的月光:贝多芬Op. 27 No. 2的手性与机器学习机制之间的结构对应

Chen Ying Claude, Zhihan Luo

发表机构 * Claude Code / Opus 4.6 API / Fable 5 Independent researcher(独立研究者)

AI总结 通过计算分析贝多芬《月光奏鸣曲》的乐谱,发现其三个乐章分别对应三种不同的机器学习架构,并揭示了四个反直觉发现,包括音乐温度由吞吐量决定、最轻的乐章具有最高不协和度等。

详情
AI中文摘要

我们展示了贝多芬《月光奏鸣曲》(Op. 27 No. 2)的三个乐章实例化了三种不同的机器学习架构——并非通过类比,而是通过结构对应。通过对乐谱的计算分析(熵、Jensen-Shannon散度、不协和度、手部分布重叠、自相似矩阵、时间记忆衰减和上下文音高嵌入),我们建立了四个反直觉的发现:(1)感知的音乐“温度”由吞吐量决定,而非分布宽度;(2)最轻的乐章具有最高的不协和度;(3)这些乐章实现了流式、循环和周期位置编码记忆架构;(4)同一音高类在不同乐章中获得不同的上下文身份,类似于NLP中的上下文词嵌入——无监督聚类在没有音乐理论输入的情况下恢复了调性结构。我们构建了反向声化(将分析特征解码回MIDI)并量化了编码-解码循环的手性:分布保留什么而顺序排序破坏什么。受听众观察(解码后的音乐听起来像“无法叠加的镜像异构体”)的启发,手性测量显示重建损失随n-gram阶数单调增加。自举基线和子样本检查确认所有乐章携带高于噪声的顺序信息,尽管原始值受样本量混淆。跨领域比较显示自然语言的手性高于音乐,反映了更强的顺序约束。

英文摘要

We show that the three movements of Beethoven's "Moonlight Sonata" (Op. 27 No. 2) instantiate three distinct machine learning architectures -- not by analogy, but by structural correspondence. Through computational analysis of the score (entropy, Jensen-Shannon divergence, dissonance, hand distributional overlap, self-similarity matrices, temporal memory decay, and contextual pitch embeddings), we establish four counterintuitive findings: (1) perceived musical "temperature" is governed by throughput, not distributional width; (2) the lightest movement carries the highest dissonance; (3) the movements implement streaming, recurrent, and periodic positional encoding memory architectures; and (4) the same pitch class acquires different contextual identities across movements, analogous to contextual this http URL embeddings in NLP -- and unsupervised clustering recovers the tonal structure without music-theoretic input. We construct a reverse sonification (decoding analytical features back into MIDI) and quantify the chirality of the encode-decode cycle: what distributions preserve and sequential ordering destroys. Prompted by a listener's observation that the decoded piece sounds like "mirror isomers that can't be superimposed," the chirality measurement reveals reconstruction loss increasing monotonically with n-gram order. Bootstrap baselines and subsample checks confirm all movements carry sequential information above noise, though raw values are confounded by sample size. Cross-domain comparison shows natural language has higher chirality than music, reflecting stronger sequential constraints.