arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2084
专题追踪
2605.18359 2026-05-27 cs.CV

RAVE: Re-Allocating Visual Attention in Large Multimodal Models

RAVE: 重新分配大型多模态模型中的视觉注意力

Xi Leng, Xinhong Ma, Ziqiang Dong, Feng Zhang, Xiaoying Tang, Yang Yang, Guanjun Jiang

发表机构 * Qwen Business Unit of Alibaba(阿里巴巴文勤业务部) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Beijing Institute of Technology(北京理工大学)

AI总结 针对大型多模态模型中标准注意力机制存在的跨模态误分配和视觉内不平衡问题,提出轻量级成对门控机制RAVE,通过学习查询-键偏置重新分配视觉注意力,在多个多模态基准上平均提升3个百分点,尤其对感知密集型任务效果显著。

详情
AI中文摘要

大型多模态模型(LMMs)继承了预训练语言骨干网络的自注意力机制,但标准注意力可能表现出次优的分配,包括文本和视觉证据之间的跨模态误分配以及视觉令牌之间的视觉内不平衡。我们提出RAVE(重新分配视觉注意力),一种轻量级成对门控机制,它为预softmax注意力分数添加一个学习到的查询-键偏置,该偏置基于预RoPE查询和键特征。RAVE不需要对骨干网络进行架构修改,并且可以与模型的其余部分进行端到端训练。在一系列多模态基准测试中,RAVE比标准注意力平均提升3个百分点,在感知密集型任务(包括多语言OCR、图表理解、文档VQA和场景文本VQA)上提升最大,这些任务中准确的视觉定位至关重要。

英文摘要

Large multimodal models (LMMs) inherit the self-attention mechanism of pretrained language backbones, yet standard attention can exhibit suboptimal allocation, including cross-modal misallocation between textual and visual evidence and intra-visual imbalance among visual tokens. We propose RAVE (Re-Allocating Visual Attention), a lightweight pair-gating mechanism that adds a learned query-key bias to pre-softmax attention scores over visual keys, derived from pre-RoPE query and key features. RAVE requires no architectural modification to the backbone and can be trained end-to-end with the rest of the model. Across a suite of multimodal benchmarks, RAVE improves over standard attention by an average of 3 points, with the largest gains on perception-intensive tasks -- including multilingual OCR, chart understanding, document VQA, and scene text VQA -- where accurate visual grounding is critical.

2605.17774 2026-05-27 cs.CL

Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning

通过QLoRA微调将工具知识内化到小型语言模型中

Yuval Shemla, Ayal Yakobe, Tanmay Agarwal, Dhaval Patel, Kaoutar El Maghraoui

发表机构 * Columbia School of General Studies, Columbia University, NY, USA(哥伦比亚大学泛研学院) Columbia Engineering, Columbia University, NY, USA(哥伦比亚大学工程学院) IBM Research, NY, USA(IBM研究院)

AI总结 本文研究通过QLoRA参数高效微调将工具知识内化到小型语言模型中,在AssetOpsBench基准上,微调后的Gemma 4 E4B和Qwen3-4B模型在无描述推理下优于有完整工具描述的未微调基线,输入长度减少82.6%,规划分数提升。

详情
AI中文摘要

大型语言模型越来越多地被用作代理系统中的规划组件,但当前的工具使用流程通常需要将完整的工具模式包含在每个提示中,这产生了大量的令牌开销,并限制了较小模型的实用性。本文研究了是否可以通过参数高效微调将工具使用知识内化到小型语言模型中,从而在推理时无需显式的工具描述即可进行结构化规划。使用AssetOpsBench作为主要基准,我们使用8位QLoRA在约1700个工具使用示例上微调了Gemma 4 E4B和Qwen3-4B,这些示例涵盖工具知识、问题到规划的映射以及执行风格的轨迹。我们在无描述推理下评估了生成的模型,其中提示完全省略了工具目录。微调后的模型优于接收完整工具描述的有信息未微调基线,输入长度减少了82.6%,同时提高了结构性和LLM评判的规划分数。在最佳的Gemma运行中,模型达到了0.65的AT-F1和3.88的整体评判分数,而信息基线的分数分别为0.47和2.88。Qwen3-4B达到了3.78的强劲整体评判分数,同时使用的内存比Gemma少62%,运行速度快2.5倍,尽管它在一般多项选择基准上也表现出更大的灾难性遗忘。额外的消融实验表明,LoRA秩控制着质量与保留之间的权衡,其中$r=32$最大化规划质量,而较小的秩保留了更多的一般知识。这些结果表明,对于固定的工具目录,QLoRA微调可以将工具知识从提示上下文转移到模型权重中,从而在保持或提高工具规划质量的同时,大幅减少推理开销。

英文摘要

Large language models are increasingly used as planning components in agentic systems, but current tool-use pipelines often require full tool schemas to be included in every prompt, creating substantial token overhead and limiting the practicality of smaller models. This paper investigates whether tool-use knowledge can be internalized into small language models through parameter-efficient fine-tuning, enabling structured planning without explicit tool descriptions at inference time. Using AssetOpsBench as the primary benchmark, we fine-tune Gemma 4 E4B and Qwen3-4B with 8-bit QLoRA on approximately 1,700 tool-use examples spanning tool knowledge, question-to-plan mappings, and execution-style traces. We evaluate the resulting models under description-free inference, where the prompt omits the tool catalog entirely. The fine-tuned models outperform an informed unfine-tuned baseline that receives full tool descriptions, reducing input length by 82.6\% while improving structural and LLM-judge planning scores. In the best Gemma run, the model achieves an AT-F1 of 0.65 and an overall judge score of 3.88, compared with 0.47 and 2.88 for the informed baseline. Qwen3-4B achieves a strong overall judge score of 3.78 while using 62\% less memory and running 2.5$\times$ faster than Gemma, though it also exhibits greater catastrophic forgetting on general multiple-choice benchmarks. Additional ablations show that LoRA rank controls a quality--retention trade-off, with $r=32$ maximizing planning quality and smaller ranks preserving more general knowledge. These results suggest that, for fixed tool catalogs, QLoRA fine-tuning can shift tool knowledge from prompt context into model weights, substantially reducing inference overhead while maintaining or improving tool-planning quality.

2605.17617 2026-05-27 cs.AI

GraphMind: From Operational Traces to Self-Evolving Workflow Automation

GraphMind:从操作轨迹到自演化工作流自动化

Yiwen Zhu, Joyce Cahoon, Anna Pavlenko, Qiushi Bai, Nima Shahbazi, Divya Vermareddy, Meina Wang, Mathieu Demarne, Swati Bararia, Wenjing Wang, Hemkesh Vijaya Kumar, Hannah Lerner, Katherine Lin, Steve Toscano, Miso Cilimdzic, Subru Krishnan

发表机构 * Microsoft, USA University of Illinois Chicago, USA Microsoft, Spain

AI总结 提出GraphMind系统,通过离线提取因果工作流图、在线多智能体遍历执行和自适应遍历强化,实现云数据库事故调查中的自动化工作流,相比基线方法减少8倍检索上下文并降低26%幻觉率。

详情
AI中文摘要

协调人员、工具和信息的复杂操作工作流是系统运行的核心,但由于需要大量人工输入且适应能力有限,端到端自动化仍然具有挑战性。我们提出GraphMind,一个以最小人力构建、执行和演化以行动为中心的工作流图的系统。该系统分三个阶段运行。首先,一个可扩展的离线管道从大量人工解决轨迹中提取结构化工作流图,捕捉问题、行动及其因果关系。其次,一个在线多智能体遍历引擎导航该图以动态构建和执行工作流,每一步结合图引导检索与LLM驱动的推理。第三,自适应遍历强化(ATR)强化成功的遍历路径,实现执行信息引导的图适应。GraphMind已部署在四个生产云数据库服务中用于事故调查。在93个保留事故上评估并通过盲审专家验证,该系统在缓解范围、幻觉率和诊断吞吐量方面优于Agentic Summary-RAG基线,同时需要少8倍的检索上下文。ATR层将幻觉率降低26%,证明工作流图可以从执行反馈中学习。一项为期12周的现场研究证实了实用价值:97%的评分对话在交互延迟内产生可操作结果。

英文摘要

Complex operational workflows coordinating personnel, tools, and information are central to system operations, yet end-to-end automation remains challenging due to extensive human input requirements and limited ability to adapt over time. We present GraphMind, a system that constructs, executes, and evolves action-centric workflow graphs with minimal human effort. The system operates in three phases. First, a scalable offline pipeline extracts structured workflow graphs from large volumes of human resolution traces, capturing problems, actions, and their causal relationships. Second, an online multi-agent traversal engine navigates the graph to dynamically construct and execute workflows, combining graph-guided retrieval with LLM-driven reasoning at each step. Third, Adaptive Traversal Reinforcement (ATR) reinforces successful traversal paths, enabling execution-informed graph adaptation. GraphMind has been deployed across four production cloud database services for incident investigation. Evaluated on 93 held-out incidents and validated via blind expert review, the system outperforms an Agentic Summary-RAG baseline in mitigation reach, hallucination rate, and diagnostic throughput while requiring 8x less retrieval context. The ATR layer reduces hallucination rate by 26%, demonstrating that workflow graphs can learn from execution feedback. A 12-week field study confirms practical value: 97% of scored conversations yield actionable results within interactive latency.

2605.17482 2026-05-27 cs.CL cs.LG

RSD: A Local Triangulation Audit Primitive for Learned Vector Blocks

RSD:一种用于学习向量块的局部三角剖分审计原语

Seungmin Jin

发表机构 * HSE University(俄罗斯高等经济大学)

AI总结 提出RSD(关系语义分解)作为局部三角剖分审计方法,通过拟合单纯形成员关系和坐标极点,结合关系解码器和坐标残差,实现学习向量块的可解释性审计。

Comments 8 pages, 1 figure. Revised version with clarified scope, experiments, and limitations

详情
AI中文摘要

局部XAI审计将有限的学习向量块与弱侧信号进行比较。基线方法如最近邻查找、低秩坐标模型和关系分解揭示了审计的不同部分。我们引入关系语义分解(简称RSD),作为学习向量块的局部三角剖分审计。给定坐标X和一个声明的有界弱亲和代理A,RSD拟合单纯形成员关系S和坐标极点C。它在关系解码器中重用S来解码A,并报告坐标残差R=X-SC。这产生了一个范围限定的审计单元:所选块、代理、解码器类和损失预算的兼容性,以及组件质量和残差读数。合成控制检查单纯形重构、代理解码和固定S残差分解。定理陈述、月份和狗/狼块说明了为什么低代理损失应结合组件质量、残差读数和块大小来解读。

英文摘要

Local XAI audits compare a finite block of learned vectors with a weak side signal. Baselines such as nearest-neighbor lookup, low-rank coordinate models, and relation factorization expose different parts of this audit. We introduce Relational Semantic Decomposition, abbreviated as RSD, as a local triangulation audit for learned vector blocks. Given coordinates X and a declared bounded weak affinity proxy A, RSD fits simplex memberships S and coordinate poles C. It reuses S in a relation decoder for A and reports the coordinate residual R=X-SC. This yields a scoped audit unit: compatibility for the chosen block, proxy, decoder class, and loss budget, plus component mass and residual readouts. Synthetic controls check simplex reconstruction, proxy decoding, and fixed-S residual decomposition. The theorem-statement, month, and dog/wolf blocks illustrate why low proxy loss should be read with component mass, residual readouts, and block size.

2605.05204 2026-05-27 cs.CV

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

D-OPSD:用于连续调优步蒸馏扩散模型的在线自蒸馏方法

Dengyang Jiang, Xin Jin, Dongyang Liu, Zanyi Wang, Mingzhe Zheng, Ruoyi Du, Xiangpeng Yang, Qilong Wu, Zhen Li, Peng Gao, Harry Yang, Steven Hoi

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Z-Image Team, Alibaba Group(阿里集团Z-Image团队) University of California, San Diego(加州大学圣地亚哥分校) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出D-OPSD,一种在线自蒸馏训练范式,使步蒸馏扩散模型在监督微调中保持少步推理能力,通过让模型同时作为教师和学生,利用不同上下文条件(学生仅文本特征,教师多模态特征)最小化预测分布,学习新概念和风格而不牺牲原有少步能力。

Comments Project Page: https://vvvvvjdy.github.io/d-opsd/

详情
AI中文摘要

高性能图像生成模型的格局目前正在从低效的多步模型转向高效的少步模型(例如,Z-Image-Turbo和FLUX.2-klein)。然而,这些模型对直接连续监督微调提出了重大挑战。例如,应用常用的微调技术会损害其固有的少步推理能力。为了解决这个问题,我们提出了D-OPSD,一种用于步蒸馏扩散模型的新颖训练范式,能够在监督微调期间实现在线策略学习。我们首先发现,以LLM/VLM作为编码器的现代扩散模型可以继承其编码器的上下文能力。这使我们能够将训练形式化为一个在线自蒸馏过程。具体来说,在训练期间,我们让模型在不同上下文中同时充当教师和学生,其中学生仅以文本特征为条件,而教师则以文本提示和目标图像的多模态特征为条件。训练最小化学生自身轨迹上的两个预测分布。通过在模型自己的轨迹上并在其自身监督下进行优化,D-OPSD使模型能够学习新的概念、风格等,而不会牺牲原始的少步能力。

英文摘要

The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for direct continuous supervised fine-tuning. For example, applying the commonly used fine-tuning technique would compromise their inherent few-step inference capability. To address this, we propose D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning. We first find that the modern diffusion models, where the LLM/VLM serves as the encoder, can inherit its encoder's in-context capabilities. This enables us to formulate the training as an on-policy self-distillation process. Specifically, during training, we make the model act as both the teacher and the student with different contexts, where the student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature of both the text prompt and the target image. Training minimizes the two predicted distributions over the student's own roll-outs. By optimizing on the model's own trajectory and under its own supervision, D-OPSD enables the model to learn new concepts, styles, etc., without sacrificing the original few-step capacity.

2603.04639 2026-05-27 cs.RO cs.AI

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

RoboMME:机器人通用策略的记忆基准与理解

Yinpei Dai, Hongze Fu, Jayjun Lee, Yuejiang Liu, Haoran Zhang, Jianing Yang, Chelsea Finn, Nima Fazeli, Joyce Chai

发表机构 * University of Michigan(密歇根大学) Stanford University(斯坦福大学) Figure AI

AI总结 提出RoboMME基准,通过16个操作任务评估VLA模型在长时程和历史依赖场景中的记忆能力,并基于π0.5骨干网络探索14种记忆增强变体,发现记忆表示的有效性高度依赖于任务。

Comments Accepted to ICML 2026

详情
AI中文摘要

记忆对于长时程和历史依赖的机器人操作至关重要。这类任务通常涉及计数重复动作或操作暂时被遮挡的物体。最近的视觉-语言-动作(VLA)模型已开始融入记忆机制;然而,它们的评估仍局限于狭窄、非标准化的设置中。这限制了对记忆的系统理解、比较和进展测量。为应对这些挑战,我们引入了RoboMME:一个大规模标准化基准,用于评估和推进VLA模型在长时程、历史依赖场景中的表现。我们的基准包含16个操作任务,这些任务基于精心设计的分类法构建,该分类法评估时间、空间、对象和程序记忆。我们进一步开发了一套基于π0.5骨干网络的14种记忆增强VLA变体,以系统探索多种集成策略下的不同记忆表示。实验结果表明,记忆表示的有效性高度依赖于任务,每种设计在不同任务中都有独特的优势和局限性。视频和代码可在我们的网站https://robomme.github.io上找到。

英文摘要

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the π0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.

2412.18084 2026-05-27 cs.AI

Property Enhanced Instruction Tuning for Multi-task Molecule Generation with Large Language Models

属性增强指令微调用于大型语言模型的多任务分子生成

Xuan Lin, Long Chen, Yile Wang, Yangyang Chen, Xiangxiang Zeng

发表机构 * School of Computer Science, Xiangtan University(湘潭大学计算机科学学院) College of Computer Science and Software Engineering, Shenzhen University(深圳大学计算机科学与软件工程学院) Department of Computer Science, University of Tsukuba(东京大学理工学部) College of Computer Science and Electronic Engineering, Hunan University(湖南大学计算机科学与电子工程学院)

AI总结 提出PEIT框架,通过多模态对齐预训练和指令微调,提升LLM在分子描述、文本分子生成、属性预测和多约束分子生成任务上的性能。

Comments 9

详情
AI中文摘要

大型语言模型(LLMs)广泛应用于各种自然语言处理任务,如问答和机器翻译。然而,由于缺乏标记数据以及生化属性手动标注的困难,分子生成任务的性能仍然有限,尤其是涉及多属性约束的任务。在这项工作中,我们提出了一个两步框架PEIT(属性增强指令微调)来改进LLMs在分子相关任务上的表现。第一步,我们使用文本描述、SMILES和生化属性作为多模态输入,通过对齐多模态表示来合成指令数据,预训练一个名为PEIT-GEN的模型。第二步,我们使用合成数据微调现有的开源LLMs,得到的PEIT-LLM可以处理分子描述、基于文本的分子生成、分子属性预测以及我们新提出的多约束分子生成任务。实验结果表明,我们的预训练模型PEIT-GEN在分子描述任务上优于MolT5、BioT5、MolCA和Text+Chem-T5,证明了文本描述、结构和生化属性之间的模态对齐良好。此外,PEIT-LLM在多任务分子生成中显示出有希望的改进,证明了PEIT框架在分子任务中的有效性。代码和附录可在https://github.com/chenlong164/PEIT获取。

英文摘要

Large language models (LLMs) are widely applied in various natural language processing tasks such as question answering and machine translation. However, due to the lack of labeled data and the difficulty of manual annotation for biochemical properties, the performance for molecule generation tasks is still limited, especially for tasks involving multi-properties constraints. In this work, we present a two-step framework PEIT (\textbf{P}roperty \textbf{E}nhanced \textbf{I}nstruction \textbf{T}uning) to improve LLMs for molecular-related tasks. In the first step, we use textual descriptions, SMILES, and biochemical properties as multimodal inputs to pre-train a model called PEIT-GEN, by aligning multi-modal representations to synthesize instruction data. In the second step, we fine-tune existing open-source LLMs with the synthesized data, the resulting PEIT-LLM can handle molecule captioning, text-based molecule generation, molecular property prediction, and our newly proposed multi-constraint molecule generation tasks. Experimental results show that our pre-trained PEIT-GEN outperforms MolT5, BioT5, MolCA and Text+Chem-T5 in molecule captioning, demonstrating modalities align well between textual descriptions, structures, and biochemical properties. Furthermore, PEIT-LLM shows promising improvements in multi-task molecule generation, demonstrating the effectiveness of the PEIT framework for molecular tasks. The code and appendix are available at https://github.com/chenlong164/PEIT.

2604.27019 2026-05-27 cs.LG cs.CL cs.CR

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

动态对抗微调重组拒绝几何结构

Wenhao Lan, Shan Li, Xinhua Lai, Meiqi Wu, Junbin Yang, Haihua Shen, Yijun Yang

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Inner Mongolia University of Technology(内蒙古科技大学) Tsinghua University(清华大学) Shandong University(山东大学)

AI总结 研究动态对抗微调如何改变安全对齐语言模型中拒绝行为的因果控制载体(低维子空间),发现R2D2沿鲁棒性-效用前沿重组几何结构但未建立自适应鲁棒性。

详情
AI中文摘要

安全对齐的语言模型必须拒绝有害请求而不广泛过度拒绝,但尚不清楚动态对抗微调如何改变拒绝控制载体:Kullback--Leibler (KL)约束方向或因果调节拒绝而不引起大规模安全提示分布偏移的小子空间。我们研究了一个7B骨干模型在监督微调(SFT)和鲁棒拒绝动态防御(R2D2)下的表现,将HarmBench、StrongREJECT和XSTest评估与五点几何测量、因果干预和稀疏自适应压力测试对齐。R2D2在早期检查点将固定源HarmBench攻击成功率降至零;然而,这些检查点也表现出最大的XSTest拒绝率并未能通过良性效用审计。后期检查点部分恢复了面向效用的行为,同时重新打开了攻击成功率,自适应GCG攻击成功率在第250步升至0.415,第500步升至0.613。内部地,R2D2在第100步之前保留了一个后期层的可接受拒绝控制载体,然后将最佳可接受载体迁移到早期层;SFT迁移更早但鲁棒性较差。有效秩保持在1.24附近,SFT表现出更大的主角漂移,这反对将维度扩展和漂移幅度作为充分解释。因果干预支持一个低维但效用耦合的载体。这些结果支持R2D2沿鲁棒性-效用前沿的几何重组解释,但未建立自适应鲁棒性。

英文摘要

Safety-aligned language models must refuse harmful requests without broad over-refusal, but it remains unclear how dynamic adversarial fine-tuning changes refusal-control carriers: Kullback--Leibler (KL)-constrained directions or small subspaces that causally modulate refusal without large safe-prompt distribution shifts. We study a 7B backbone under supervised fine-tuning (SFT) and Robust Refusal Dynamic Defense (R2D2), aligning HarmBench, StrongREJECT, and XSTest evaluations with five-anchor geometry measurements, causal interventions, and sparse adaptive stress tests. R2D2 drives fixed-source HarmBench attack success to zero at early checkpoints; however, these checkpoints also exhibit maximal XSTest refusal and fail a benign-utility audit. Later checkpoints partially recover utility-facing behavior while reopening attack success, with adaptive GCG attack success rate rising to 0.415 at step 250 and 0.613 at step 500. Internally, R2D2 preserves a late-layer admissible refusal-control carrier through step 100 and then relocates the best admissible carrier to an early layer; SFT relocates earlier yet remains less robust. Effective rank stays near 1.24, and SFT shows larger principal-angle drift, arguing against both dimensional expansion and drift magnitude as sufficient explanations. Causal interventions support a low-dimensional but utility-coupled carrier. These results support a geometry-reorganization account of R2D2 along a robustness--utility frontier, without establishing adaptive robustness.

2601.15891 2026-05-27 cs.CV

RadJEPA: Radiology Encoder for Chest X-Rays via Joint Embedding Predictive Architecture

RadJEPA:基于联合嵌入预测架构的胸部X光放射学编码器

Anas Anwarul Haq Khan, Mariam Husain, Pratik Jalan, Kshitij Jadhav

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology Bombay(印度理工学院孟买分校计算机科学与工程系) Department of Biomedical Engineering, Johns Hopkins University(约翰霍普金斯大学生物医学工程系) Koita Centre for Digital Health, Indian Institute of Technology Bombay(印度理工学院孟买分校Koita数字健康中心)

AI总结 提出RadJEPA,一种无需语言监督的自监督框架,通过联合嵌入预测架构在约84万张无标签胸部X光图像上预训练,学习预测掩码区域的潜在表示,在放射学报告生成等任务中达到或超越现有基线。

详情
AI中文摘要

视觉-语言预训练推动了医学图像表示学习的最新进展,但这种范式受限于配对图像-文本数据的可用性以及临床叙述的报告偏差。我们探究是否可以在没有任何语言监督的情况下学习具有竞争力的放射学编码器。我们引入了RadJEPA,这是一个基于联合嵌入预测架构的自监督框架,并在约84万张无标签胸部X光图像上进行了预训练。该模型学习从可见上下文区域预测掩码目标区域的潜在表示,这一目标与图像-文本对比预训练和DINO风格自蒸馏不同,它显式地建模表示空间中的条件结构。我们主要在冻结的Vicuna-7B解码器上进行放射学报告生成评估,并将其编码器替换到四个广泛使用的视觉-语言骨干网络(MedLLaVA、Qwen-2.5、BLIP-2和Phi-4)中。为完整性,我们还报告了疾病分类和语义分割结果。在两个数据集和四个指标上,RadJEPA匹配或超过了最强的纯图像和视觉-语言基线,同时使用ViT-B/14骨干网络和224×224分辨率。

英文摘要

Vision-language pretraining has driven much of the recent progress in medical image representation learning, but this paradigm is constrained by the availability of paired image-text data and by the reporting bias of clinical narratives. We ask whether competitive radiology encoders can be learned without any language supervision. We introduce RadJEPA, a self-supervised framework built on a Joint Embedding Predictive Architecture and pretrained on approximately 840K unlabeled chest X-ray images. The model learns to predict latent representations of masked target regions from a visible context region, an objective that differs from both image-text contrastive pretraining and DINO-style self-distillation by explicitly modelling conditional structure in representation space. We evaluate RadJEPA primarily on radiology report generation with a frozen Vicuna-7B decoder, and additionally substitute its encoder into four widely used vision-language backbones (MedLLaVA, Qwen-2.5, BLIP-2, and Phi-4). For completeness we also report disease classification and semantic segmentation results. Across two datasets and four metrics, RadJEPA matches or exceeds the strongest image-only and vision-language baselines while using a ViT-B/14 backbone at 224 x 224 resolution.

2605.15477 2026-05-27 cs.CV

EgoExo-WM: Unlocking Exo Video for Ego World Models

EgoExo-WM: 利用外部视频解锁自我世界模型

Danny Tran, Roberto Martín-Martín, Kristen Grauman

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出通过从外部视频提取结构化身体姿态并利用人体运动学先验将其转换为自我视频,从而利用丰富的野外外部数据训练自我世界模型,显著提升预测质量和下游规划性能。

Comments Project Page: https://vision.cs.utexas.edu/projects/EgoExo-WM/

详情
AI中文摘要

自我中心世界模型为智能体预测和规划提供了有前景的方向,但其性能受限于自我中心训练数据的有限性以及人类物理动作的固有部分可观测性。相比之下,外部中心视频丰富且能很好地揭示身体姿态,但缺乏与智能体动作空间的直接对齐,且不是自我中心的。我们提出一种方法,通过从外部中心视频中提取结构化身体姿态作为动作表示,并基于人体运动学先验将外部中心视频转换为自我中心视频,从而弥合这一差距。这一过程使得将野外外部中心数据整合到自我中心世界模型训练中成为可能。我们表明,使用转换后的数据训练全身动作条件自我中心世界模型显著提高了预测质量和下游规划性能,其中我们推断实现视觉目标状态所需的身体姿态序列。我们的方法为利用任意野外视频构建强大的自我中心世界模型铺平了道路,进一步推动了机器人规划和增强现实指导等应用。

英文摘要

Egocentric world models present a promising direction for enabling agents to predict and plan, but their performance is constrained by the limited availability of egocentric training data and its inherent partial observability of humans' physical actions. In contrast, exocentric video is abundant and reveals body poses well, but lacks direct alignment with an agent's action space -- and is not egocentric. We propose a method to bridge this gap by extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior. This process unlocks the integration of in-the-wild exocentric data for egocentric world model training. We show that training whole-body action-conditioned egocentric world models with our converted data significantly improves both prediction quality and downstream planning performance, where we infer the sequence of body poses needed to achieve a visual goal state. Our approach paves the way to enlist arbitrary in-the-wild videos for building powerful egocentric world models, furthering applications in robot planning and augmented-reality guidance.

2605.14473 2026-05-27 cs.CL cs.AI

Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict

RAG 能知道检索错误吗?知识冲突下的上下文合规性诊断

Yihang Chen, Pin Qian, Su Wang, Sipeng Zhang, Huan Xu, Shuhuai Lin, Xinpeng Wei

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Carnegie Mellon University(卡内基梅隆大学) University of California San Diego(加州大学圣地亚哥分校)

AI总结 提出上下文驱动分解(CDD)方法,在推理时探测并干预检索增强生成中的上下文与参数知识冲突,揭示上下文合规性模式并提升鲁棒性。

Comments 12 pages, 4 figures, 3 tables

详情
AI中文摘要

检索增强生成(RAG)中的上下文合规机制发生在检索到的上下文主导最终答案时,即使它与模型的参数化知识冲突。仅凭准确性并不能揭示在这种冲突下检索到的上下文如何因果性地塑造答案。我们引入了上下文驱动分解(CDD),这是一种在推理时运行的信念分解探针,并作为受控检索冲突的干预机制。通过跨Epi-Scale压力测试、TruthfulQA错误概念注入和跨模型重复实验,CDD揭示了三种模式。P1:上下文合规性在对抗性上界设置中是可测量的,标准RAG在TruthfulQA错误概念注入(N=500)上达到15.0%的准确率。P2:对抗性准确率提升跨模型家族迁移——CDD提高了Gemini-2.5-Flash以及Claude Haiku/Sonnet/Opus的准确率——但理由-答案因果耦合不迁移。CDD在Gemini-2.5-Flash上达到64.1%的错误注入因果敏感性,而所有三种Claude变体的敏感性落在[-3%, +7%]范围内,表明Claude侧的准确率提升通过一种与显式冲突解决轨迹不同的机制运作。P3:显式冲突分解提高了时间漂移和噪声干扰下的鲁棒性,CDD在完整Epi-Scale对抗性基准上对时间偏移达到71.3%,对干扰证据达到69.9%。这三种模式将上下文合规性识别为一个结构轴,沿此轴可以对标准RAG进行探测和干预,区别于检索质量或单一方法鲁棒性问题,并激励发布Epi-Scale以跨模型家族和检索管道进行系统研究。

英文摘要

The Context-Compliance Regime in Retrieval-Augmented Generation (RAG) occurs when retrieved context dominates the final answer even when it conflicts with the model's parametric knowledge. Accuracy alone does not reveal how retrieved context causally shapes answers under such conflict. We introduce Context-Driven Decomposition (CDD), a belief-decomposition probe that operates at inference time and serves as an intervention mechanism for controlled retrieval conflict. Across Epi-Scale stress tests, TruthfulQA misconception injection, and cross-model reruns, CDD exposes three patterns. P1: context compliance is measurable in an upper-bound adversarial setting, where Standard RAG reaches 15.0% accuracy on TruthfulQA misconception injection (N=500). P2: adversarial accuracy gains transfer across model families -- CDD improves accuracy on Gemini-2.5-Flash and on Claude Haiku/Sonnet/Opus -- but rationale-answer causal coupling does not transfer. CDD reaches 64.1% mistake-injection causal sensitivity on Gemini-2.5-Flash, while sensitivities for all three Claude variants fall in the [-3%, +7%] range, suggesting that the Claude-side accuracy gains operate through a mechanism distinct from the explicit conflict-resolution trace. P3: explicit conflict decomposition improves robustness under temporal drift and noisy distractors, with CDD reaching 71.3% on temporal shifts and 69.9% on distractor evidence on the full Epi-Scale adversarial benchmark. These three patterns identify context-compliance as a structural axis along which standard RAG can be probed and intervened on, distinct from retrieval-quality or single-method robustness questions, and motivate releasing Epi-Scale for systematic study across model families and retrieval pipelines.

2605.11651 2026-05-27 cs.CV cs.AI cs.CL

Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

Hide to See: 面向VLM蒸馏中视觉锚定思维的推理前缀掩码

Seonghoon Yu, Dongjun Nam, Byung-Kwan Lee, Jeany Son

发表机构 * KAIST(韩国科学技术院) NVIDIA(英伟达) POSTECH(POSTECH大学)

AI总结 提出一种推理前缀掩码蒸馏框架,通过掩码学生模型的显著推理前缀,迫使其在推理过程中更依赖视觉证据,从而缓解长推理轨迹中的视觉遗忘问题,提升多模态推理性能。

Comments Pre-print

详情
AI中文摘要

近期VLM中的思考-回答方法(如Qwen3-VL-Thinking)通过在最终答案前利用中间推理步骤来提升推理性能,但其计算成本显著增加,尤其是对于较大的VLM。为了将这种能力蒸馏到紧凑的思考-回答VLM中,一个主要目标是提高学生在整个推理轨迹中利用视觉证据的能力,因为长思考-回答轨迹存在视觉遗忘问题。为此,我们引入了一种新颖的思考-回答蒸馏框架,通过掩码学生模型的显著推理前缀,鼓励学生将思考锚定在视觉信息上。为了补偿这种被掩码的文本线索,学生在蒸馏过程中被鼓励更多地依赖视觉证据作为替代信息源。我们的掩码策略包括:1)逐token的显著推理前缀掩码,针对每个下一token预测选择性掩码高影响力的推理前缀;2)自调节掩码预算调度,根据教师-学生分布之间的差异(即蒸馏难度)逐渐增加掩码规模。在蒸馏阶段,学生模型由我们的显著推理前缀掩码引导,该掩码同时阻塞未来token和显著推理线索,替代了自回归语言建模中使用的标准因果掩码。实验结果表明,我们的方法在多模态推理基准上优于最近的开源VLM、VLM蒸馏和自蒸馏方法,进一步分析证实了学生思考过程中视觉利用的增强。

英文摘要

Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their computational cost becomes substantial, especially for larger VLMs. To distill such capabilities into compact think-answer VLMs, a primary objective is to improve the student's ability to utilize visual evidence throughout its reasoning trace, as long think-answer traces suffer from visual forgetting issues. To this end, we introduce a novel think-answer distillation framework that encourages the student to anchor its thinking on visual information by masking the student's salient reasoning prefixes. To compensate for such masked textual cues, the student is encouraged to rely more on visual evidence as an alternative source of information during distillation. Our masking strategies include: 1) token-wise salient reasoning-prefix masking, which masks high-influence reasoning prefixes selectively for each next-token prediction, and 2) self-paced masking budget scheduling, which gradually increases the masking scale according to distillation difficulty, measured by the discrepancy between teacher--student distributions. In the distillation phase, the student is guided by our salient reasoning-prefix mask, which blocks both future tokens and salient reasoning cues, in place of the standard causal mask used for auto-regressive language modeling. Experimental results show that our approach outperforms recent open-source VLMs, VLM distillation, and self-distillation methods on multimodal reasoning benchmarks, while further analyzes confirm enhanced visual utilization along the student thinking process.

2605.14799 2026-05-27 cs.CV cs.CR cs.SI

Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation

视觉Mamba能否提升AI生成图像检测?一项深入研究

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed, Xianxun Zhu, Abdenour Hadid

发表机构 * Laboratory of IEMN, CNRS, Centrale Lille, UMR 8520, Univ. Polytechnique Hauts-de-France(伊姆纳实验室,国家科学研究中心,里尔中央理工大学,UMR 8520,法国高等技术大学) Khalifa University(卡利法大学) School of Communication and Information Engineering, Shanghai University(上海大学通信与信息工程学院) Sorbonne Center for Artificial Intelligence, Sorbonne University Abu Dhabi(索邦人工智能中心,索邦大学阿布扎克分校)

AI总结 本研究系统评估了Vision Mamba模型在AI生成图像检测中的性能,与CNN、ViT和VLM检测器进行对比,分析了准确性、效率和泛化能力。

详情
AI中文摘要

近年来,计算机视觉取得了显著进展,这得益于卷积神经网络(CNN)、生成对抗网络(GAN)、扩散架构、视觉Transformer(ViT)以及最近的视觉-语言模型(VLM)等创新架构的发展。这一进展无疑有助于创造越来越逼真和多样化的视觉内容。然而,图像生成的这些进步也引发了对错误信息、身份盗窃以及隐私和安全威胁等潜在滥用的担忧。与此同时,基于Mamba的架构已成为这一快速发展的领域中一系列图像分析任务(包括分类、分割、医学成像、目标检测和图像恢复)的多功能工具。然而,与已有技术相比,它们在识别AI生成图像方面的潜力仍相对未被探索。本研究提供了用于AI生成图像检测的Vision Mamba模型的系统评估和比较分析。我们在多样化的数据集和合成图像源上,将多个Vision Mamba变体与代表性的CNN、ViT和基于VLM的检测器进行基准测试,重点关注准确性、效率以及跨不同图像类型和生成模型的泛化能力等关键指标。通过这一全面分析,我们旨在阐明Vision Mamba相对于已有方法在检测AI生成图像方面的适用性、准确性和效率上的优势与局限性。总体而言,我们的研究结果突显了Vision Mamba作为区分真实与AI生成视觉内容的系统组件的潜力和当前局限性。这项研究对于在区分真实与AI生成内容成为重大挑战的时代提升检测能力至关重要。

英文摘要

In recent years, computer vision has witnessed remarkable progress, fueled by the development of innovative architectures such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), diffusion-based architectures, Vision Transformers (ViTs), and, more recently, Vision-Language Models (VLMs). This progress has undeniably contributed to creating increasingly realistic and diverse visual content. However, such advancements in image generation also raise concerns about potential misuse in areas such as misinformation, identity theft, and threats to privacy and security. In parallel, Mamba-based architectures have emerged as versatile tools for a range of image analysis tasks, including classification, segmentation, medical imaging, object detection, and image restoration, in this rapidly evolving field. However, their potential for identifying AI-generated images remains relatively unexplored compared to established techniques. This study provides a systematic evaluation and comparative analysis of Vision Mamba models for AI-generated image detection. We benchmark multiple Vision Mamba variants against representative CNNs, ViTs, and VLM-based detectors across diverse datasets and synthetic image sources, focusing on key metrics such as accuracy, efficiency, and generalizability across diverse image types and generative models. Through this comprehensive analysis, we aim to elucidate Vision Mamba's strengths and limitations relative to established methodologies in terms of applicability, accuracy, and efficiency in detecting AI-generated images. Overall, our findings highlight both the promise and current limitations of Vision Mamba as a component in systems designed to distinguish authentic from AI-generated visual content. This research is crucial for enhancing detection in an age where distinguishing between real and AI-generated content is a major challenge.

2605.14664 2026-05-27 cs.CV

MiVE: Multiscale Vision-language features for reference-guided video Editing

MiVE:用于参考引导视频编辑的多尺度视觉语言特征

Tong Wang, Meng Zou, Chengjing Wu, Xiaochao Qu, Luoqi Liu, Xiaolin Hu, Ting Liu

发表机构 * MT Lab, Meitu Inc., Beijing 100083, China(美图实验室,美图公司,北京100083,中国) Department of Computer Science and Technology, BNRist, IDG/McGovern Institute for Brain Research, Tsinghua University, Beijing 100084, China(计算机科学与技术系,BNRist,IDG/麦戈文脑研究学院,清华大学,北京100084,中国) Beijing University of Posts(北京邮电大学)

AI总结 提出MiVE框架,利用VLM的多尺度层次特征(早期层保留空间细节,深层编码全局语义)统一到自注意力扩散Transformer中,解决模态间隙和细粒度信息丢失问题,在参考引导视频编辑中达到SOTA性能。

Comments ICML 2026

详情
AI中文摘要

参考引导视频编辑以源视频、文本指令和参考图像作为输入,要求模型在忠实执行指令编辑的同时保留原始运动及未编辑内容。现有方法分为两种范式,各有固有限制:解耦编码器在处理指令和视觉内容时存在模态间隙,而统一视觉语言编码器仅依赖最终层表示,丢失了细粒度空间细节。我们观察到VLM层层次化地编码互补信息——早期层捕获局部空间细节,对精确编辑至关重要;深层编码全局语义,用于指令理解。基于此洞察,我们提出MiVE(用于参考引导视频编辑的多尺度视觉语言特征),该框架将VLM重新用作多尺度特征提取器。MiVE从Qwen3-VL提取层次特征,并将其集成到统一的自注意力扩散Transformer中,消除了交叉注意力设计中固有的模态不匹配。实验表明,MiVE在人类偏好中排名最高,性能优于学术方法和商业系统,达到了最先进水平。

英文摘要

Reference-guided video editing takes a source video, a text instruction, and a reference image as inputs, requiring the model to faithfully apply the instructed edits while preserving original motion and unedited content. Existing methods fall into two paradigms, each with inherent limitations: decoupled encoders suffer from modality gaps when processing instructions and visual content independently, while unified vision-language encoders lose fine-grained spatial details by relying solely on final-layer representations. We observe that VLM layers encode complementary information hierarchically -- early layers capture localized spatial details essential for precise editing, while deeper layers encode global semantics for instruction comprehension. Building on this insight, we present MiVE (Multiscale Vision-language features for reference-guided video Editing), a framework that repurposes VLMs as multiscale feature extractors. MiVE extracts hierarchical features from Qwen3-VL and integrates them into a unified self-attention Diffusion Transformer, eliminating the modality mismatch inherent in cross-attention designs. Experiments demonstrate that MiVE achieves state-of-the-art performance by ranking highest in human preference, outperforming both academic methods and commercial systems.

2605.14480 2026-05-27 cs.CL

Cross-Linguistic Transcription and Phonological Representation in the Huìtóngguǎnxì Huáyíyìyǔ

《会同馆华夷译语》中的跨语言转写与音系表征

Ji-eun Kim

发表机构 * Department of Korean language and literature, Duksung Women’s University(韩国语言文学系,杜克松女子大学)

AI总结 本研究将《会同馆华夷译语》视为一个连贯的多语言转写系统,通过数字化和音系分析,揭示了其主要转写和补充转写的跨语言规律,并论证了该系统作为历史音系证据的价值。

Comments 49 pages; 1 figure; 40 tables; SLE2019; under review

详情
AI中文摘要

目的:本研究调查《会同馆华夷译语》(HHY)的转写原则,该系列多语词汇集由明朝政府在15至16世纪间编纂,用于译员培训。本研究不将HHY视为孤立语言材料的集合,而是将其视为一个连贯的多语言转写系统,通过汉字表征非汉语语言的口语形式。方法:将HHY的绝大部分数字化,并与汉语音韵范畴对齐。对先前各语言部分的重建进行批判性审查,并整合到一个统一的比较数据库中。分析聚焦于八个语言部分中主要转写(MT)和补充转写(ST)的跨语言规律。结果:MT通常表征与当时汉语音节结构兼容的音,而ST主要编码与汉语音系兼容性较差的语音特征。分析进一步表明,汉语音韵范畴在外语转写中的使用比先前假设的更为灵活。因此,HHY作为一种相对系统的语音近似方法,而非汉语音系对非汉语语言的直接投射。结论:HHY可被分析为一个内部结构化的转写系统,而不仅仅是词汇集的集合。更广泛地说,该研究表明历史转写系统可为历史音系学提供宝贵证据,尤其对于历史记录有限的亚洲语言。

英文摘要

Purpose: This study investigates the transcription principles underlying Huìtóngguǎnxì Huáyíyìyǔ (HHY), a series of multilingual glossaries compiled by the Ming government between the fifteenth and sixteenth centuries for interpreter training. The study treats HHY not as a collection of isolated language materials, but as a coherent multilingual transcription system representing spoken forms of non-Chinese languages through Chinese characters. Methods: A substantial portion of HHY was digitized and aligned with Chinese phonological categories. Previous reconstructions of individual language sections were critically reviewed and integrated into a unified comparative database. The analysis focuses on cross-linguistic regularities in Main Transcription (MT) and Supplementary Transcription (ST) across eight language sections. Results: MT generally represents sounds compatible with the Chinese syllable structure of the period, whereas ST mainly encodes phonetic features less compatible with Chinese phonology. The analysis further shows that Chinese phonological categories were used more flexibly in foreign-language transcription than previously assumed. HHY therefore functioned as a relatively systematic method of phonetic approximation rather than a direct projection of Chinese phonology onto non-Chinese languages. Conclusion: HHY can be analyzed as an internally structured transcription system rather than merely as a collection of glossaries. More broadly, the study demonstrates that historical transcription systems can provide valuable evidence for historical phonology, particularly for under-documented Asian languages with limited historical records.

2605.13779 2026-05-27 cs.LG cs.AI cs.DC

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

MinT:用于训练和服务数百万LLM的托管基础设施

Mind Lab, :, Song Cao, Vic Cao, Andrew Chen, Kaijie Chen, Cleon Cheng, Steven Chiang, Kaixuan Fan, Hera Feng, Huan Feng, Arthur Fu, Jun Gao, Hongquan Gu, Aaron Guan, Nolan Ho, Mutian Hong, Hailee Hou, Peixuan Hua, Charles Huang, Miles Jiang, Nora Jiang, Yuyi Jiang, Qiuyu Jin, Fancy Kong, Andrew Lei, Kyrie Lei, Alexy Li, Lucian Li, Ray Li, Theo Li, Zhihui Li, Jiayi Lin, Kairus Liu, Kieran Liu, Logan Liu, Xiang Liu, Irvine Lu, Maeve Luo, Runze Lv, Pony Ma, Verity Niu, Anson Qiu, Vincent Wang, Rio Yang, Maxwell Yao, Carrie Ye, Regis Ye, Wenlin Ye, Josh Ying, Danney Zeng, Yuhan Zhan, Anya Zhang, Di Zhang, Ruijia Zhang, Sueky Zhang, Ya Zhang, Wei Zhao, Ada Zhou, Changhai Zhou, Yuhua Zhou, Xinyue Zhu, Murphy Zhuang

发表机构 * Mind Lab

AI总结 提出MinT系统,通过LoRA适配器管理实现大规模基础模型上的高效训练与在线服务,支持百万级策略目录。

Comments 30 pages, technical report

详情
AI中文摘要

我们提出MindLab Toolkit (MinT),一个用于低秩适配(LoRA)后训练和在线服务的托管基础设施系统。MinT针对这样一种场景:在少量昂贵的基模型部署上产生许多训练好的策略。MinT不是将每个策略实现为合并的完整检查点,而是保持基模型驻留,并通过回滚、更新、导出、评估、服务和回滚等阶段移动导出的LoRA适配器修订版,将分布式训练、服务、调度和数据移动隐藏在服务接口后面。MinT沿三个维度扩展此路径。Scale Up将LoRA RL扩展到前沿规模的密集和MoE架构,包括MLA和DSA注意力路径,训练和服务已验证超过1T总参数。Scale Down仅移动导出的LoRA适配器,在秩1设置中可小于基模型大小的1%;适配器仅移交将测量步骤在4B密集模型上减少18.3倍,在30B MoE上减少2.85倍,而并发多策略GRPO将挂钟时间缩短1.77倍和1.45倍,且不提高峰值内存。Scale Out将持久策略可寻址性与CPU/GPU工作集分离:张量并行部署支持10^6规模的可寻址目录(通过100K测量单引擎扫描)和集群规模的千适配器活动波,冷加载作为计划的服务工作处理,打包的MoE LoRA张量将实时引擎加载提高8.5-8.7倍。因此,MinT管理百万规模的LoRA策略目录,同时在共享的1T级基模型上训练和服务选定的适配器修订版。

英文摘要

We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing each policy as a merged full checkpoint, MinT keeps the base model resident and moves exported LoRA adapter revisions through rollout, update, export, evaluation, serving, and rollback, hiding distributed training, serving, scheduling, and data movement behind a service interface. MinT scales this path along three axes. Scale Up extends LoRA RL to frontier-scale dense and MoE architectures, including MLA and DSA attention paths, with training and serving validated beyond 1T total parameters. Scale Down moves only the exported LoRA adapter, which can be under 1% of base-model size in rank-1 settings; adapter-only handoff reduces the measured step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while concurrent multi-policy GRPO shortens wall time by 1.77x and 1.45x without raising peak memory. Scale Out separates durable policy addressability from CPU/GPU working sets: a tensor-parallel deployment supports 10^6-scale addressable catalogs (measured single-engine sweeps through 100K) and thousand-adapter active waves at cluster scale, with cold loading treated as scheduled service work and packed MoE LoRA tensors improving live engine loading by 8.5-8.7x. MinT thus manages million-scale LoRA policy catalogs while training and serving selected adapter revisions over shared 1T-class base models.

2605.13455 2026-05-27 cs.CV

Bayesian In Vivo Tracking of Synapses using Joint Poisson Deconvolution and Diffeomorphic Registration

使用联合泊松反卷积和微分同胚配准的贝叶斯体内突触追踪

Shashwat Kumar, Dominic M. Padova, Binish Narang, Gabrielle I. Coste, Austin R. Graves, Richard L. Huganir, Adam S. Charles, Michael I. Miller, Anuj Srivastava

发表机构 * Department of Biomedical Engineering, Johns Hopkins University(约翰霍普金斯大学生物医学工程系) Department of Neuroscience, Johns Hopkins University(约翰霍普金斯大学神经科学系) Kavli Neuroscience Discovery Institute, Johns Hopkins University(约翰霍普金斯大学Kavli神经科学发现研究所) Data Science and AI Institute, Johns Hopkins University(约翰霍普金斯大学数据科学与人工智能研究所) Department of Applied Mathematics and Statistics, Johns Hopkins University(约翰霍普金斯大学应用数学与统计学系)

AI总结 提出一种基于模板的贝叶斯框架,通过联合泊松反卷积和微分同胚配准,同时实现突触检测、去噪、荧光强度推断、组织运动校正和置信区间估计,用于低信噪比体内显微镜数据中的突触追踪。

详情
AI中文摘要

突触是密集排列的亚微米结构,在学习和记忆形成过程中动态重组。纵向体内成像荧光标记的突触受体为研究大规模突触动力学以及这些过程在神经疾病中如何被破坏提供了有希望的机会。然而,使用双光子显微镜的体内成像采用低激光功率,因此受到低信噪比和高散粒噪声、天与天之间的非线性组织运动、突触荧光的非平稳波动以及显微镜点扩散函数引起的显著模糊的影响。这些因素共同使得检测和追踪突触变得具有挑战性,尤其是在突触密度高的区域。本文提出了一种新颖的基于模板的框架,将突触建模为在非线性组织变形下移动的可变亮度点源。采用统一的贝叶斯方法,我们通过推导一个后验分布来将该模型应用于显微镜数据,该后验分布包含用于域扭曲的微分同胚映射、用于成像过程的高斯点扩散函数以及用于原始光子计数的泊松观测模型。贝叶斯解决方案同时:(1) 构建突触位置的概率模板,(2) 对图像数据进行去噪和反卷积,(3) 推断荧光强度,(4) 执行微分同胚图像配准以校正组织运动,以及(5) 为这些参数估计提供置信区域。我们在一个2D+t模拟数据集和一个在小鼠两周内成像的荧光突触的3D+t纵向体内显微镜数据集上展示了该框架。

英文摘要

Synapses are densely packed submicron structures that dynamically reorganize during learning and memory formation. Longitudinal \textit{in vivo} imaging of fluorescently tagged synaptic receptors offers a promising opportunity to study large-scale synaptic dynamics and how these processes are disrupted in neurological disease. However, in vivo imaging with 2-photon microscopy uses low laser power and therefore suffers from low signal-to-noise ratio (SNR) and high shot noise, nonlinear tissue motion between days, nonstationary fluctuations in synaptic fluorescence, and significant blur induced by the microscope point spread function (PSF). Together, these factors make it challenging to detect and track synapses, especially in regions with high synaptic density. This paper presents a novel template-based framework for modeling synapses as varying luminance point sources that move under a nonlinear tissue deformation. Taking a unified Bayesian approach, we apply this model to microscopy data by deriving a posterior that incorporates a diffeomorphic mapping for domain warping, a Gaussian point spread function for the imaging process, and a Poisson observation model for raw photon counts. The Bayesian solution simultaneously: (1) Constructs a probabilistic template of synapse locations, (2) denoises and deconvolves the image data, (3) infers fluorescence intensities, (4) performs diffeomorphic image registration to correct for tissue motion, and (5) provides confidence regions for these parameter estimates. We demonstrate the framework on both a 2D+t simulated dataset and a 3D+t longitudinal \textit{in vivo} microscopy dataset of fluorescent synapses imaged in a mouse over two weeks.

2604.22546 2026-05-27 cs.CV

ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation

ReLIC-SGG: 开放词汇场景图生成的关系格补全

Amir Hosseini, Sara Farahani, Xinyi Li, Suiyang Guang

发表机构 * Amirkabir University of Technology(阿米尔卡比大学技术学院)

AI总结 针对开放词汇场景图生成中标注不完整导致大量有效关系被误判为负例的问题,提出ReLIC-SGG框架,通过构建语义关系格建模谓词间的相似、蕴含和矛盾关系,将未标注关系视为潜在变量而非确定负例,结合视觉-语言兼容性、图上下文和语义一致性推断缺失正关系,并采用正-无标记图学习减少假负例监督,格引导解码生成紧凑且语义一致的场景图。

Comments Some errors in the experimental sections

详情
AI中文摘要

开放词汇场景图生成(SGG)旨在用超越固定谓词集的灵活关系短语描述视觉场景。现有方法通常将标注的三元组视为正例,所有未标注的对象-对关系视为负例。然而,场景图标注本质上是不完整的:许多有效关系缺失,且同一交互可以以不同粒度描述,例如 extit{on}、 extit{standing on}、 extit{resting on} 和 extit{supported by}。由于开放词汇SGG的关系空间更大,这一问题变得更加严重。我们提出 extbf{ReLIC-SGG},一种关系不完整性感知框架,将未标注关系视为潜在变量而非确定负例。ReLIC-SGG构建语义关系格来建模开放词汇谓词间的相似性、蕴含和矛盾关系,并利用它从视觉-语言兼容性、图上下文和语义一致性中推断缺失的正关系。正-无标记图学习目标进一步减少假负例监督,而格引导解码生成紧凑且语义一致的场景图。在常规、开放词汇和全景SGG基准上的实验表明,ReLIC-SGG改进了稀有和未见谓词的识别,并更好地恢复了缺失关系。

英文摘要

Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible relation phrases beyond a fixed predicate set. Existing methods usually treat annotated triplets as positives and all unannotated object-pair relations as negatives. However, scene graph annotations are inherently incomplete: many valid relations are missing, and the same interaction can be described at different granularities, e.g., \textit{on}, \textit{standing on}, \textit{resting on}, and \textit{supported by}. This issue becomes more severe in open-vocabulary SGG due to the much larger relation space. We propose \textbf{ReLIC-SGG}, a relation-incompleteness-aware framework that treats unannotated relations as latent variables rather than definite negatives. ReLIC-SGG builds a semantic relation lattice to model similarity, entailment, and contradiction among open-vocabulary predicates, and uses it to infer missing positive relations from visual-language compatibility, graph context, and semantic consistency. A positive-unlabeled graph learning objective further reduces false-negative supervision, while lattice-guided decoding produces compact and semantically consistent scene graphs. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that ReLIC-SGG improves rare and unseen predicate recognition and better recovers missing relations.

2509.09544 2026-05-27 cs.CL

MetaGraph: A Large-Scale Meta-Analysis of GenAI in Financial NLP (2022-2025)

MetaGraph:金融NLP中GenAI的大规模元分析(2022-2025)

Paolo Pedinotti, Peter Baumann, Nathan Jessurun, Leslie Barrett, Enrico Santus

发表机构 * Bloomberg(贝莱德)

AI总结 提出MetaGraph方法,利用本体引导的LLM从科学语料中提取类型化知识图谱,对681篇GenAI在金融领域的论文进行结构化趋势分析,揭示了三个阶段:早期LLM驱动的任务和数据集扩展、对局限性和风险的日益关注、以及向模块化系统导向方法的转变。

Comments 8 pages, appendices, GEM, ACL

详情
AI中文摘要

自2022年底以来,金融NLP迅速发展,超越了叙述性综述。我们引入了MetaGraph,一种使用本体引导的LLM从科学语料中提取类型化知识图谱的方法,以实现结构化的大规模趋势分析。应用于681篇关于金融领域GenAI的论文(2022-2025),MetaGraph揭示了三个阶段:早期LLM驱动的任务和数据集扩展、对局限性和风险的日益关注、以及向模块化系统导向方法(如检索增强设计)的转变。我们发布了生成的资源和工件,以支持可重复的元分析和未来对该领域的监测。

英文摘要

Financial NLP has evolved rapidly since late 2022, outpacing narrative surveys. We introduce MetaGraph, a methodology for extracting typed knowledge graphs from scientific corpora using ontology-guided LLM extraction to enable structured, large-scale trend analysis. Applied to 681 papers on GenAI in Finance (2022-2025), MetaGraph reveals three phases: early LLM-driven expansion of tasks and datasets, growing emphasis on limitations and risk, and a shift toward modular, system-oriented methods (e.g., retrieval-augmented designs). We release the resulting resource and artifacts to support reproducible meta-analysis and future monitoring of the field.

2605.12271 2026-05-27 cs.CV

Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

超越文本提示:视觉到视觉生成作为统一范式

Yaofang Liu, Kangning Cui, Meng Chu, Zhaoqing Li, Suiyun Zhang, Jean-Michel Morel, Xiaodong Cun, Haoxuan Che, Rui Liu, Raymond H. Chan

发表机构 * City University of Hong Kong(香港城市大学) City University of Hong Kong (Dongguan)(香港城市大学(东莞)) The Hong Kong University of Science and Technology(香港科学与技术大学) The Chinese University of Hong Kong(香港中文大学) Celia Research HK(Celia研究香港) Great Bay University(大湾大学) Lingnan University(岭南大学)

AI总结 提出视觉到视觉(V2V)生成范式及无需训练的V2V-Zero框架,通过利用视觉页面隐藏状态替代文本条件,在多个任务上达到或接近优化后的文本到图像性能。

Comments Project Page: https://yaofang-liu.github.io/V2V_Web

详情
AI中文摘要

人类通常通过视觉制品(如排版表、草图、参考图像和标注场景)来指定和创作。然而,现代视觉生成器仍然要求用户将这种意图序列化为文本,这一瓶颈压缩了空间结构、精确外观和字形形状等信号。我们提出 extbf{\emph{视觉到视觉}(V2V)}生成,其中用户使用视觉规范页面(而非文本提示)来条件化生成模型。该页面不是编辑目标,而是指定所需输出的视觉文档。我们引入 extbf{V2V-Zero},一个无需训练的框架,通过用从视觉页面提取的最终层隐藏状态替换纯文本条件,在现有的视觉语言模型(VLM)条件化生成器中暴露此接口,利用了冻结的VLM已将文本和图像映射到生成器条件空间的事实。在GenEval上,V2V-Zero使用冻结的Qwen-Image骨干网络达到0.85,接近其优化后的文本到图像性能而无需微调。为评估更广泛的V2V空间,我们引入 extbf{Simple-V2V Bench},涵盖七个视觉条件化任务和七个模型,包括GPT Image 2、Nano Banana 2、Seedream 5.0 Lite、开源权重基线和视频扩展。V2V-Zero得分为32.7/100,优于评估的开源图像基线,并揭示了清晰的能力层次:属性绑定强,内容生成不可靠,结构控制即使对商业系统也困难。HunyuanVideo-1.5扩展得分为20.2/100,表明该接口可迁移到图像之外。机制分析显示默认推理路径主要通过视觉路由,95.0%的条件化token注意力集中在视觉页面隐藏状态上。

英文摘要

Humans often specify and create through visual artifacts: typography sheets, sketches, reference images, and annotated scenes. Yet modern visual generators still ask users to serialize this intent into text, a bottleneck that compresses signals like spatial structure, exact appearance, and glyph shape. We propose \textbf{\emph{visual-to-visual} (V2V)} generation, in which the user conditions a generative model with a visual specification page rather than a text prompt. The page is not an edit target, but a visual document that specifies the desired output. We introduce \textbf{V2V-Zero}, a training-free framework that exposes this interface in existing vision-language model (VLM) conditioned generators by replacing text-only conditioning with final-layer hidden states extracted from visual pages, exploiting the fact that the frozen VLM already maps both text and images into the generator's conditioning space. On GenEval, V2V-Zero reaches 0.85 with a frozen Qwen-Image backbone, closely matching its optimized text-to-image performance without fine-tuning. To evaluate the broader V2V space, we introduce \textbf{Simple-V2V Bench}, spanning seven visual-conditioning tasks and seven models, including GPT Image 2, Nano Banana 2, Seedream 5.0 Lite, open-weight baselines, and a video extension. V2V-Zero scores 32.7/100, outperforming evaluated open-weight image baselines and revealing a clear capability hierarchy: attribute binding is strong, content generation is unreliable, and structural control remains hard even for commercial systems. A HunyuanVideo-1.5 extension scores 20.2/100, showing the interface transfers beyond images. Mechanistic analysis shows the default reasoning path is primarily visually routed, with 95.0\% of conditioning-token attention mass on visual-page hidden states.

2605.11867 2026-05-27 cs.CV

When Brains Disagree: Biological Ambiguity Underlies the Challenge of Amyloid PET Synthesis from Structural MRI

当大脑存在分歧:生物模糊性是结构MRI合成淀粉样蛋白PET挑战的基础

Louise E. G. Baron, Ross Callaghan, David M. Cash, Philip S. J. Weston, Hojjat Azadbakht, Hui Zhang

发表机构 * Hawkes Institute, University College London, UK(霍克斯研究所,伦敦大学学院,英国) Department of Medical Physics and Biomedical Engineering, University College London, UK(医学物理与生物医学工程系,伦敦大学学院,英国) AINOSTICS Ltd, Manchester, UK(AINOSTICS有限公司,曼彻斯特,英国) Dementia Research Centre, UCL Queen Square Institute of Neurology, University College London, UK(痴呆研究中心,伦敦大学学院女王广场神经科学研究所,英国) UK Dementia Research Institute, London, UK(英国痴呆研究研究所,伦敦,英国) Department of Computer Science, University College London, UK(计算机科学系,伦敦大学学院,英国)

AI总结 通过控制实验证明,结构MRI到淀粉样蛋白PET合成性能受限的根本原因是生物模糊性(MRI与PET测量时间解耦的病理过程),而非模型架构能力,并表明引入血浆生物标志物等多模态信息可解决该问题。

Comments MICCAI 2026 accepted paper (no rebuttal)

详情
AI中文摘要

结构MRI到淀粉样蛋白PET合成已被提出作为阿尔茨海默病(AD)中淀粉样蛋白评估的非侵入性替代方法。然而,相同模型的报告性能在不同研究中差异很大,且日益复杂的架构并未带来一致的提升。这种不一致性被认为是由基本的生物模糊性引起的:MRI捕捉神经退行性变,而PET测量淀粉样蛋白病理——这两个过程在AD中常常在时间上解耦。因此,相似的MRI模式可能对应不同的淀粉样蛋白状态,产生模糊的一对多映射。因此,MRI到淀粉样蛋白PET合成可能本质上是病态的;然而,这一想法尚未得到科学验证。本工作的目的是通过两个控制实验来检验这一假设。我们首先通过根据淀粉样蛋白和神经退行性变状态对配对的MRI-PET数据进行分层来控制训练分布。在控制设计下使用两种标准合成模型,我们表明生物学上明确的映射可以单独学习,但当引入数据模糊性时性能崩溃。这表明数据分布中的模糊性(而非架构容量)限制了性能。其次,我们表明引入血浆生物标志物形式的正交生物学信息可以解决这种模糊性。当整合多模态输入时,性能提高且稳定性恢复。总之,这些发现表明MRI到淀粉样蛋白PET合成中有限且不一致的性能是由内在的生物模糊性解释的,稳定、有意义的进展需要多模态整合而非架构复杂性。

英文摘要

Structural MRI-to-amyloid PET synthesis has been proposed as a non-invasive alternative for amyloid assessment in Alzheimer's disease (AD). However, reported performance of identical models varies widely across studies, and increasingly complex architectures have not led to consistent gains. This inconsistency is thought to be caused by a fundamental biological ambiguity: MRI captures neurodegeneration, while PET measures amyloid pathology - two processes that are often temporally decoupled in AD. As a result, similar MRI patterns may correspond to different amyloid states, creating ambiguous one-to-many mappings. MRI-to-amyloid PET synthesis may therefore be intrinsically ill-posed; however, this idea has yet to be tested scientifically. The aim of this work is to test this hypothesis through two controlled experiments. We first control the training distribution by stratifying paired MRI-PET data by amyloid and neurodegeneration status. Using two standard synthesis models under a controlled design, we show that biologically unambiguous mappings are learnable in isolation, but performance collapses when data ambiguity is introduced. This demonstrates that ambiguity in the data distribution, rather than architectural capacity, constrains performance. Second, we show that introducing orthogonal biological information in the form of plasma biomarkers resolves this ambiguity. When multimodal inputs are incorporated, performance improves and stability is restored. Together, these findings suggest that limited and inconsistent performance in MRI-to-amyloid PET synthesis is explained by intrinsic biological ambiguity, and that stable, meaningful progress requires multimodal integration rather than architectural complexity.

2605.06152 2026-05-27 cs.LG cs.CL math.OC stat.ML

Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes

Grokking 还是 Glitching?低精度如何驱动 Slingshot 损失尖峰

Liu Hanqing, Jianjun Cao, Yuanze Li, Zijian Zhou

发表机构 * Tsinghua University(清华大学) The University of Tokyo(东京大学)

AI总结 本文证明深度神经网络训练中的 Slingshot 损失尖峰现象是由浮点精度限制导致的数值特征膨胀(NFI)机制引起的,并解释了参数范数快速增长和梯度消失等现象。

Comments 28 pages, 13 figures; ICML 2026 Workshop on High-dimensional Learning Dynamics (Spotlight)

详情
AI中文摘要

深度神经网络在无正则化的长期训练中会出现周期性的损失尖峰,这种现象被称为“Slingshot 机制”。现有工作通常将其归因于内在的优化动力学,但其触发机制仍不清楚。本文证明这种现象是浮点算术精度限制的结果。当训练进入高置信度阶段时,正确类别的 logit 与其他 logit 之间的差异可能超过吸收误差阈值。然后在反向传播中,正确类别的梯度被精确舍入为零,而错误类别的梯度保持非零。这打破了跨类别的梯度零和约束,并在分类器层的参数更新中引入了系统性漂移。我们证明这种漂移与特征形成正反馈循环,导致全局分类器均值和全局特征均值呈指数增长。我们将这种机制称为数值特征膨胀(NFI)。该机制解释了 Slingshot 尖峰前的快速范数增长、随后梯度的重新出现以及由此产生的损失尖峰。我们进一步表明,NFI 并不等同于观察到的损失尖峰:在更实际的任务中,部分吸收可能不会产生可见的尖峰,但它仍然可以打破零和约束并驱动参数范数的快速增长。我们的结果将 Slingshot 重新解释为有限精度训练的一种数值动力学,并为训练后期异常参数增长和 logit 发散提供了可检验的解释。

英文摘要

Deep neural networks exhibit periodic loss spikes during unregularized long-term training, a phenomenon known as the "Slingshot Mechanism." Existing work usually attributes this to intrinsic optimization dynamics, but its triggering mechanism remains unclear. This paper proves that this phenomenon is a result of floating-point arithmetic precision limits. As training enters a high-confidence stage, the difference between the correct-class logit and the other logits may exceed the absorption-error threshold. Then during backpropagation, the gradient of the correct class is rounded exactly to zero, while the gradients of the incorrect classes remain nonzero. This breaks the zero-sum constraint of gradients across classes and introduces a systematic drift in the parameter update of the classifier layer. We prove that this drift forms a positive feedback loop with the feature, causing the global classifier mean and the global feature mean to grow exponentially. We call this mechanism Numerical Feature Inflation (NFI). This mechanism explains the rapid norm growth before a Slingshot spike, the subsequent reappearance of gradients, and the resulting loss spike. We further show that NFI is not equivalent to an observed loss spike: in more practical tasks, partial absorption may not produce visible spikes, but it can still break the zero-sum constraint and drive rapid growth of parameter norms. Our results reinterpret Slingshot as a numerical dynamic of finite-precision training, and provide a testable explanation for abnormal parameter growth and logit divergence in late-stage training.

2604.22274 2026-05-27 cs.CV

CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation

CAGE-SGG:用于开放词汇场景图生成的反事实主动图证据

Suiyang Guang, Chenyu Liu, Ruohan Zhang, Siyuan Chen

发表机构 * Institute of Intelligent Vision and Embodied Cognition(智能视觉与具身认知研究院)

AI总结 提出基于反事实关系验证的开放词汇场景图生成框架,通过分解谓词为软证据基并使用反事实验证器确保关系有视觉证据支持,从而提升可靠性、可解释性和泛化能力。

Comments This manuscript has been withdrawn by the authors because we found a methodological flaw in the formulation and evaluation of the proposed approach. The issue affects the reliability of the experimental results and the conclusions drawn from them. Therefore, the authors consider the current version unsuitable for citation or further use

详情
AI中文摘要

开放词汇场景图生成(SGG)旨在用超出固定谓词词汇表的灵活且细粒度的关系短语描述视觉场景。虽然最近的视觉语言模型极大地扩展了SGG的语义覆盖范围,但它们也引入了一个关键的可信性问题:预测的关系可能由语言先验或对象共现驱动,而非基于视觉证据。在本文中,我们提出了一种基于反事实关系验证的证据充分的开放词汇SGG框架。我们的方法不是直接接受合理的关系提议,而是验证每个候选关系是否得到关系特定的视觉、几何和上下文证据的支持。具体来说,我们首先使用视觉语言提议器生成开放词汇关系候选,然后将谓词短语分解为软证据基,如支撑、接触、包含、深度和状态。关系条件证据编码器提取谓词相关线索,而反事实验证器测试当必要证据被移除时关系分数是否下降,并在无关扰动下保持稳定。我们进一步引入矛盾感知谓词学习和图级偏好优化,以改进细粒度区分和全局图一致性。在常规、开放词汇和全景SGG基准上的实验表明,我们的方法一致地改进了标准召回率指标、未见谓词泛化和反事实基础质量。这些结果表明,从关系生成转向关系验证可产生更可靠、可解释且基于证据的场景图。

英文摘要

Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.

2511.15572 2026-05-27 cs.CV

From Per-Image Low-Rank to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

从逐图像低秩到编码不匹配:重新思考视觉Transformer中的特征蒸馏

Huiyuan Tian, Bonan Xu, Shijian Li

发表机构 * Zhejiang University(浙江大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文通过发现编码不匹配现象,提出Lift或WideLast两种简单修复方法,显著提升视觉Transformer特征蒸馏在压缩场景下的性能。

Comments 22 pages, 22 figures. Accepted at the ICML 2026

详情
AI中文摘要

特征图知识蒸馏(KD)在规模相当的视觉Transformer(ViT)之间能很好地传递内部表示,但在压缩场景下常常失败。我们重新审视这一失败并揭示了一个悖论。逐样本SVD表明每个图像高度可压缩,这似乎暗示一个带有线性投影器的窄学生网络“原则上”应该匹配教师网络。然而,数据集层面的视图与这一直觉相矛盾:PCA表明教师网络是低秩子空间的并集,且不同输入间存在显著的子空间旋转。我们进一步引入token级别的频谱能量模式(SEP),发现一个架构无关的编码定律:即使token存在于低秩子空间中,它们也会在通道模式上广泛分布能量,造成带宽不匹配。我们将这一组合现象称为编码不匹配。我们提出两种最小修复方法:Lift或WideLast。(i)Lift在推理时保留一个轻量级的提升投影器以提供更宽的通道,或(ii)WideLast仅加宽学生网络的最后一个块,实现输入依赖的扩展。在ImageNet-1K上,这些修复方法复兴了ViT压缩的特征KD,将从CaiT-S24蒸馏的DeiT-Tiny的top-1准确率从74.86%提升至77.53%/78.23%,并且也增强了未经蒸馏训练的学生网络。我们的分析阐明了特征图KD何时以及为何失败,以及如何修复。代码和原始数据见https://github.com/thy960112/From-Per-Image-Low-Rank-to-Encoding-Mismatch。

英文摘要

Feature-map knowledge distillation (KD) transfers internal representations well between comparably sized Vision Transformers (ViTs), but it often fails in compression. We revisit this failure and uncover a paradox. Sample-wise SVD shows that each image is highly compressible, which seems to suggest that a narrow student with a linear projector should match the teacher "in principle". However, a dataset-level view contradicts this intuition: PCA shows that the teacher is a union of low-rank subspaces with significant subspace rotation across inputs. We further introduce token-level Spectral Energy Patterns (SEP) and find an architecture-invariant encoding law: tokens spread energy broadly across channel modes even when they live in low-rank subspace, creating a bandwidth mismatch. We refer to this combined phenomenon as an encoding mismatch. We propose two minimal remedies, Lift or WideLast: (i) Lift retains a lightweight lifting projector at inference to provide wider channel, or (ii) WideLast widens only the student's last block, enabling an input-dependent expansion. On ImageNet-1K, these fixes revive feature KD for ViT compression, improving DeiT-Tiny distilled from CaiT-S24 from 74.86% to 77.53%/78.23% top-1 accuracy, and they also strengthen students trained without distillation. Our analyses clarify when and why feature-map KD fails and then how to fix it. Code and raw data are provided in https://github.com/thy960112/From-Per-Image-Low-Rank-to-Encoding-Mismatch.

2509.26469 2026-05-27 cs.LG

DiVeQ: Differentiable Vector Quantization Using the Reparameterization Trick

DiVeQ: 使用重参数化技巧的可微分向量量化

Mohammad Hassan Vali, Tom Bäckström, Arno Solin

发表机构 * ELLIS Institute Finland & Department of Computer Science, Aalto University, Finland(芬兰ELLIS研究所及阿尔托大学计算机科学系) Department of Information and Communications Engineering, Aalto University, Finland(芬兰阿尔托大学信息与通信工程系)

AI总结 提出DiVeQ方法,通过重参数化技巧将量化视为添加模拟量化失真的误差向量,实现前向传播硬量化而梯度可流动,并引入空间填充变体SF-DiVeQ减少量化误差并充分利用码本,在VQ-VAE、VQGAN和DAC任务中提升重建质量和样本质量。

详情
AI中文摘要

向量量化在深度模型中很常见,但其硬分配会阻止梯度传播并阻碍端到端训练。我们提出DiVeQ,将量化视为添加一个模拟量化失真的误差向量,保持前向传播为硬量化的同时让梯度流动。我们还提出一种空间填充变体(SF-DiVeQ),将输入分配到由码字间连线构成的曲线上,从而减少量化误差并充分利用码本。两种方法均无需辅助损失或温度调度即可实现端到端训练。在VQ-VAE图像压缩、VQGAN图像生成和DAC语音编码任务中,我们的方法在不同数据集上相比其他量化方法提高了重建质量和样本质量。

英文摘要

Vector quantization is common in deep models, yet its hard assignments block gradients and hinder end-to-end training. We propose DiVeQ, which treats quantization as adding an error vector that mimics the quantization distortion, keeping the forward pass hard while letting gradients flow. We also present a space-filling variant (SF-DiVeQ) that assigns input to a curve constructed by the lines connecting codewords, resulting in less quantization error and full codebook usage. Both methods train end-to-end without requiring auxiliary losses or temperature schedules. In VQ-VAE image compression, VQGAN image generation, and DAC speech coding tasks across various data sets, our proposed methods improve reconstruction and sample quality over alternative quantization approaches.

2601.16578 2026-05-27 cs.RO cs.SY eess.SY

Zero-Shot MARL Benchmark in the Cyber-Physical Mobility Lab

Cyber-Physical Mobility Lab中的零样本多智能体强化学习基准测试

Julius Beerwerth, Jianye Xu, Simon Schäfer, Fynn Belderink, Bassam Alrifaee

发表机构 * Cyber-Physical Mobility Lab(智能物理移动实验室) University of the Bundeswehr Munich(联邦国防军大学慕尼黑) RWTH Aachen University(亚琛工业大学)

AI总结 本文基于Cyber-Physical Mobility Lab构建了一个可复现的基准测试平台,用于评估联网自动驾驶汽车多智能体强化学习策略的仿真到现实迁移,并揭示了性能下降的两个互补来源。

详情
AI中文摘要

我们提出了一个可复现的基准测试,用于评估联网自动驾驶汽车(CAV)的多智能体强化学习(MARL)策略的仿真到现实迁移。该平台基于Cyber-Physical Mobility Lab(CPM Lab)[1],集成了仿真、高保真数字孪生和物理测试平台,能够对MARL运动规划策略进行结构化的零样本评估。我们通过在所有三个领域部署SigmaRL训练的策略[2]来展示其用途,揭示了性能下降的两个互补来源:仿真与硬件控制栈之间的架构差异,以及由环境真实性增加引起的仿真到现实差距。开源设置使得在现实且可复现的条件下,能够系统分析MARL中的仿真到现实挑战。

英文摘要

We present a reproducible benchmark for evaluating sim-to-real transfer of Multi-Agent Reinforcement Learning (MARL) policies for Connected and Automated Vehicles (CAVs). The platform, based on the Cyber-Physical Mobility Lab (CPM Lab) [1], integrates simulation, a high-fidelity digital twin, and a physical testbed, enabling structured zero-shot evaluation of MARL motion-planning policies. We demonstrate its use by deploying a SigmaRL-trained policy [2] across all three domains, revealing two complementary sources of performance degradation: architectural differences between simulation and hardware control stacks, and the sim-to-real gap induced by increasing environmental realism. The open-source setup enables systematic analysis of sim-to-real challenges in MARL under realistic, reproducible conditions.

2605.09156 2026-05-27 cs.CL cs.AI

Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan

迷失在翻译中?探索从拉丁语到奥克语的语法性别转变

Ahan Chatterjee, Matthias Schöffel, Matthias Aßenmacher, Marinus Wiedner, Esteban Garces Arias

发表机构 * Bavarian Academy of Sciences (BAdW)(巴伐利亚科学学院) LMU Munich(慕尼黑大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) University of Freiburg(弗赖堡大学)

AI总结 本文提出一个可解释的深度学习框架,通过词法和上下文层面分析拉丁语到奥克语的语法性别系统从三分(阳性、阴性、中性)到二分(阳性、阴性)的演变,并展示了改进的分词策略和形态特征、词性对性别预测的贡献。

Comments Accepted at NLP4DH @ ACL 2026

详情
AI中文摘要

从拉丁语到罗曼语族的历时演变涉及语法性别系统的重组,在大多数罗曼语中从三分结构(阳性、阴性、中性)变为二分结构(阳性、阴性)。在这项工作中,我们引入了一个可解释的深度学习框架,在词法和上下文层面研究这一现象。首先,我们表明传统的分词策略对于这种低资源历史设置不够稳健,而我们提出的分词器在这些基线上提高了性能。在词法层面,我们评估了形态特征对性别预测的贡献。在上下文层面,我们量化了不同词性类别对语法性别预测的贡献。这些分析共同刻画了性别信息在词元及其句子上下文之间的分布。我们在 \href{https://github.com/ahan-2000/Lost-in-Translation-}{https://github.com/ahan-2000/Lost-in-Translation-} 公开了我们的代码库、数据集和结果。

英文摘要

The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine) in most Romance languages. In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available at \href{https://github.com/ahan-2000/Lost-in-Translation-}{https://github.com/ahan-2000/Lost-in-Translation-}.

2605.08455 2026-05-27 cs.LG cs.PL cs.SE

CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

CUDABeaver:基于LLM的自动化CUDA调试基准测试

Shiyang Li, Haoyang Chen, Mattia Fazzini, Caiwen Ding

发表机构 * University of Minnesota(明尼苏达大学)

AI总结 提出CUDABEAVER基准,通过协议条件指标pass@k(M,C,A)评估LLM修复CUDA代码的能力,揭示性能损失容忍度对成功率的影响。

Comments 25 pages, 5 figures

详情
AI中文摘要

调试CUDA程序长期以来一直具有挑战性,因为故障通常源于硬件行为、编译器决策、内存层次结构和异步执行之间微妙的交互。更重要的是,随着GPU在科学计算、机器学习、图形和系统工作负载中的快速扩展,CUDA调试变得比以往任何时候都更具挑战性。当前对基于LLM的CUDA编程的评估大多忽略了这一场景:模型可以通过退化性修复通过正确性测试,将CUDA代码简化为更安全但更慢的程序,从而放弃原始优化结构。我们引入了CUDABEAVER,一个从基于LLM的CUDA生成过程中产生的真实失败工作空间中进行CUDA调试的基准。每个任务提供损坏的候选代码、原生构建/测试命令、原始错误证据以及一个可编辑文件。CUDABEAVER评估修复程序是否真正修复了失败的CUDA代码,还是仅仅找到了一个更慢的通过测试的替代方案,并按故障类别、调试轨迹、停滞模式和性能保持情况报告结果。我们进一步提出了pass@k(M,C,A),一种协议条件的CUDA调试指标,通过明确修复程序M、语料库C和协议轴A。使用该指标在213个任务和七个前沿LLM上,我们表明协议感知评估提供了更真实的CUDA调试能力视图:当性能损失容忍度高时,修复程序看起来更强,但即使是一个微小的更严格的性能要求也能显著降低测量成功率,分数变化高达40个百分点。

英文摘要

Debugging CUDA programs has long been challenging because failures often arise from subtle interactions among hardware behavior, compiler decisions, memory hierarchy, and asynchronous execution. More importantly, with the rapid expansion of GPU usage across scientific computing, machine learning, graphics, and systems workloads, CUDA debugging has become more challenging than ever. Current evaluations of LLM-based CUDA programming largely miss this setting: a model can pass correctness tests with repair by degeneration, simplifying the CUDA code into a safer but slower program that abandons the original optimization structure. We introduce CUDABEAVER, a benchmark for CUDA debugging from real failing workspaces produced during LLM-based CUDA generation. Each task provides the broken candidate, native build/test commands, raw error evidence, and a single editable file. CUDABEAVER evaluates whether a fixer truly repairs the failing CUDA code or merely finds a slower test-passing replacement, reporting results by failure category, debugging trajectory, stagnation mode, and performance preservation. We further propose pass@k(M,C,A), a protocol-conditional CUDA debugging metric by making the fixer M, corpus C, and protocol axes Aexplicit. Using this metric across 213 tasks and seven frontier LLMs, we show that protocol-aware evaluation gives a more faithful view of CUDA debugging ability: when performance-loss tolerance is high, fixers appear much stronger, but even a minor stricter performance requirement can sharply reduce measured success, shifting scores by up to 40 percentage points.

2605.04635 2026-05-27 cs.CV

UniPCB: A Generation-Assisted Detection Framework for PCB Defect Inspection

UniPCB: 一种用于PCB缺陷检测的生成辅助检测框架

Huan Zhang, Lianghong Tan, Yichu Xu, Zishan Su, Jiangzhong Cao, Huanqi Wu, Linwei Zhu, Xu Zhang

发表机构 * School of Information Engineering, Guangdong University of Technology(广东工业大学信息工程学院) School of Computer Science, Wuhan University(武汉大学计算机学院) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院)

AI总结 提出UniPCB框架,通过多模态条件生成器合成缺陷样本以增强数据,并设计倒残差移位注意力与跨级互补融合模块提升检测性能,在DsPCBSD+上实现98.0% mAP@0.5。

详情
AI中文摘要

在工业物联网(IIoT)中,实现智能、实时的印刷电路板(PCB)缺陷检测对于确保产品可靠性至关重要。然而,现有的基于IIoT的视觉检测系统面临两个相互叠加的挑战:稀缺且不平衡的缺陷样本限制了模型训练,以及在复杂电路背景下特征表示不足。现有的生成方法依赖具有粗略结构控制的单模态条件,而检测方法则改进架构但未解决数据瓶颈。为了共同解决这两个挑战,我们提出了一种生成辅助的PCB缺陷检测框架,该框架在IIoT支持的流水线中集成了受控缺陷合成与任务特定缺陷检测。在生成侧,多模态条件生成器并行提取互补的边缘、深度和文本条件。然后,ScaleEncoder将这些条件嵌入到扩散U-Net的四个分辨率中,条件调制在每个尺度上应用FiLM风格的空间自适应调制,实现结构对齐和缺陷感知的样本合成,以增强稀缺的IIoT数据集。在检测侧,倒残差移位注意力将自注意力与移位卷积相结合,以共同捕获全局上下文和局部纹理,跨级互补融合块生成像素级门控用于选择性跨级特征融合。合成的样本直接丰富检测训练集,使得生成的改进与检测的改进相互叠加。在DsPCBSD+上的大量实验表明,UniPCB在缺陷检测上达到mAP@0.5为98.0%、mAP@0.5:0.95为61.8%,超越了所有对比方法,同时生成分支的FID为129.61、SSIM为0.619,优于现有的条件生成方法。

英文摘要

In the Industrial Internet of Things (IIoT), enabling intelligent, real-time Printed Circuit Board (PCB) defect inspection is critical for ensuring product reliability. However, existing IIoT-based visual inspection systems face two compounding challenges: scarce and imbalanced defect samples that limit model training, and insufficient feature representation under complex circuit backgrounds. Existing generation methods rely on single-modality conditions with coarse structural control, while detection methods improve architectures without addressing the data bottleneck. To resolve both challenges jointly, we propose a generation-assisted PCB defect inspection framework that integrates controlled defect synthesis with task-specific defect detection within an IIoT-enabled pipeline. On the generation side, a Multi-modal Condition Generator extracts complementary edge, depth, and text conditions in parallel. A ScaleEncoder then embeds these conditions into the diffusion U-Net at four resolutions, and a Condition Modulation applies FiLM-style spatially-adaptive modulation at each scale, enabling structurally aligned and defect-aware sample synthesis to augment the scarce IIoT dataset. On the detection side, an Inverted Residual Shift Attention couples self-attention with shift-wise convolution to jointly capture global context and local texture, and a Cross-level Complementary Fusion Block generates pixel-level gates for selective cross-level feature fusion. The synthesized samples directly enrich the detection training set, so that improvements in generation compound with improvements in detection. Extensive experiments on DsPCBSD+ demonstrate that UniPCB achieves mAP@0.5 of 98.0% and mAP@0.5:0.95 of 61.8% on defect detection, surpassing all compared methods, while the generation branch attains an FID of 129.61 and SSIM of 0.619, outperforming existing conditional generation approaches.

2605.03929 2026-05-27 cs.SD cs.AI cs.LG eess.SP

PHALAR: Phasors for Learned Musical Audio Representations

PHALAR:用于学习音乐音频表示的相量

Davide Marincione, Michele Mancusi, Giorgio Strano, Luca Cerovaz, Donato Crisostomi, Roberto Ribuoli, Emanuele Rodolà

发表机构 * Department of Computer Science, Sapienza University of Rome, Italy(罗马大学计算机科学系) Moises Systems, Inc.(Moises系统公司) Paradigma, Inc.(Paradigma公司)

AI总结 提出PHALAR对比框架,利用学习谱池化和复值头实现音高和相位等变,在茎检索任务中参数减少50%、训练加速7倍,准确率相对提升约70%,并捕获鲁棒的音乐结构。

Comments Accepted at ICML 2026

详情
AI中文摘要

茎检索,即匹配缺失茎到给定音频子混音的任务,是一个关键挑战,目前受限于丢弃时间信息的模型。我们引入PHALAR,一个对比框架,在参数少于50%且训练加速7倍的情况下,相对于现有技术实现了高达约70%的相对准确率提升。通过利用学习谱池化层和复值头,PHALAR强制施加音高等变和相位等变偏差。PHALAR在MoisesDB、Slakh和ChocoChorales上建立了新的检索最优结果,与人类一致性判断的相关性显著高于语义基线。最后,零样本节拍跟踪和线性和弦探测证实PHALAR捕获了超越检索任务的鲁棒音乐结构。

英文摘要

Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70\%$ over the state-of-the-art while requiring $<50\%$ of the parameters and a 7$\times$ training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.