arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.18359 2026-05-27 cs.CV

RAVE: Re-Allocating Visual Attention in Large Multimodal Models

RAVE: 重新分配大型多模态模型中的视觉注意力

Xi Leng, Xinhong Ma, Ziqiang Dong, Feng Zhang, Xiaoying Tang, Yang Yang, Guanjun Jiang

发表机构 * Qwen Business Unit of Alibaba（阿里巴巴文勤业务部）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Beijing Institute of Technology（北京理工大学）

AI总结针对大型多模态模型中标准注意力机制存在的跨模态误分配和视觉内不平衡问题，提出轻量级成对门控机制RAVE，通过学习查询-键偏置重新分配视觉注意力，在多个多模态基准上平均提升3个百分点，尤其对感知密集型任务效果显著。

2605.17774 2026-05-27 cs.CL

Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning

通过QLoRA微调将工具知识内化到小型语言模型中

Yuval Shemla, Ayal Yakobe, Tanmay Agarwal, Dhaval Patel, Kaoutar El Maghraoui

发表机构 * Columbia School of General Studies, Columbia University, NY, USA（哥伦比亚大学泛研学院）； Columbia Engineering, Columbia University, NY, USA（哥伦比亚大学工程学院）； IBM Research, NY, USA（IBM研究院）

AI总结本文研究通过QLoRA参数高效微调将工具知识内化到小型语言模型中，在AssetOpsBench基准上，微调后的Gemma 4 E4B和Qwen3-4B模型在无描述推理下优于有完整工具描述的未微调基线，输入长度减少82.6%，规划分数提升。

详情

AI中文摘要

大型语言模型越来越多地被用作代理系统中的规划组件，但当前的工具使用流程通常需要将完整的工具模式包含在每个提示中，这产生了大量的令牌开销，并限制了较小模型的实用性。本文研究了是否可以通过参数高效微调将工具使用知识内化到小型语言模型中，从而在推理时无需显式的工具描述即可进行结构化规划。使用AssetOpsBench作为主要基准，我们使用8位QLoRA在约1700个工具使用示例上微调了Gemma 4 E4B和Qwen3-4B，这些示例涵盖工具知识、问题到规划的映射以及执行风格的轨迹。我们在无描述推理下评估了生成的模型，其中提示完全省略了工具目录。微调后的模型优于接收完整工具描述的有信息未微调基线，输入长度减少了82.6%，同时提高了结构性和LLM评判的规划分数。在最佳的Gemma运行中，模型达到了0.65的AT-F1和3.88的整体评判分数，而信息基线的分数分别为0.47和2.88。Qwen3-4B达到了3.78的强劲整体评判分数，同时使用的内存比Gemma少62%，运行速度快2.5倍，尽管它在一般多项选择基准上也表现出更大的灾难性遗忘。额外的消融实验表明，LoRA秩控制着质量与保留之间的权衡，其中$r=32$最大化规划质量，而较小的秩保留了更多的一般知识。这些结果表明，对于固定的工具目录，QLoRA微调可以将工具知识从提示上下文转移到模型权重中，从而在保持或提高工具规划质量的同时，大幅减少推理开销。

英文摘要

Large language models are increasingly used as planning components in agentic systems, but current tool-use pipelines often require full tool schemas to be included in every prompt, creating substantial token overhead and limiting the practicality of smaller models. This paper investigates whether tool-use knowledge can be internalized into small language models through parameter-efficient fine-tuning, enabling structured planning without explicit tool descriptions at inference time. Using AssetOpsBench as the primary benchmark, we fine-tune Gemma 4 E4B and Qwen3-4B with 8-bit QLoRA on approximately 1,700 tool-use examples spanning tool knowledge, question-to-plan mappings, and execution-style traces. We evaluate the resulting models under description-free inference, where the prompt omits the tool catalog entirely. The fine-tuned models outperform an informed unfine-tuned baseline that receives full tool descriptions, reducing input length by 82.6\% while improving structural and LLM-judge planning scores. In the best Gemma run, the model achieves an AT-F1 of 0.65 and an overall judge score of 3.88, compared with 0.47 and 2.88 for the informed baseline. Qwen3-4B achieves a strong overall judge score of 3.78 while using 62\% less memory and running 2.5$\times$ faster than Gemma, though it also exhibits greater catastrophic forgetting on general multiple-choice benchmarks. Additional ablations show that LoRA rank controls a quality--retention trade-off, with $r=32$ maximizing planning quality and smaller ranks preserving more general knowledge. These results suggest that, for fixed tool catalogs, QLoRA fine-tuning can shift tool knowledge from prompt context into model weights, substantially reducing inference overhead while maintaining or improving tool-planning quality.

URL PDF HTML ☆

赞 0 踩 0

2605.17617 2026-05-27 cs.AI

GraphMind: From Operational Traces to Self-Evolving Workflow Automation

GraphMind：从操作轨迹到自演化工作流自动化

Yiwen Zhu, Joyce Cahoon, Anna Pavlenko, Qiushi Bai, Nima Shahbazi, Divya Vermareddy, Meina Wang, Mathieu Demarne, Swati Bararia, Wenjing Wang, Hemkesh Vijaya Kumar, Hannah Lerner, Katherine Lin, Steve Toscano, Miso Cilimdzic, Subru Krishnan

发表机构 * Microsoft, USA ； University of Illinois Chicago, USA ； Microsoft, Spain

AI总结提出GraphMind系统，通过离线提取因果工作流图、在线多智能体遍历执行和自适应遍历强化，实现云数据库事故调查中的自动化工作流，相比基线方法减少8倍检索上下文并降低26%幻觉率。

详情

AI中文摘要

动态对抗微调重组拒绝几何结构

Wenhao Lan, Shan Li, Xinhua Lai, Meiqi Wu, Junbin Yang, Haihua Shen, Yijun Yang

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； Inner Mongolia University of Technology（内蒙古科技大学）； Tsinghua University（清华大学）； Shandong University（山东大学）

AI总结研究动态对抗微调如何改变安全对齐语言模型中拒绝行为的因果控制载体（低维子空间），发现R2D2沿鲁棒性-效用前沿重组几何结构但未建立自适应鲁棒性。

详情

AI中文摘要

安全对齐的语言模型必须拒绝有害请求而不广泛过度拒绝，但尚不清楚动态对抗微调如何改变拒绝控制载体：Kullback--Leibler (KL)约束方向或因果调节拒绝而不引起大规模安全提示分布偏移的小子空间。我们研究了一个7B骨干模型在监督微调（SFT）和鲁棒拒绝动态防御（R2D2）下的表现，将HarmBench、StrongREJECT和XSTest评估与五点几何测量、因果干预和稀疏自适应压力测试对齐。R2D2在早期检查点将固定源HarmBench攻击成功率降至零；然而，这些检查点也表现出最大的XSTest拒绝率并未能通过良性效用审计。后期检查点部分恢复了面向效用的行为，同时重新打开了攻击成功率，自适应GCG攻击成功率在第250步升至0.415，第500步升至0.613。内部地，R2D2在第100步之前保留了一个后期层的可接受拒绝控制载体，然后将最佳可接受载体迁移到早期层；SFT迁移更早但鲁棒性较差。有效秩保持在1.24附近，SFT表现出更大的主角漂移，这反对将维度扩展和漂移幅度作为充分解释。因果干预支持一个低维但效用耦合的载体。这些结果支持R2D2沿鲁棒性-效用前沿的几何重组解释，但未建立自适应鲁棒性。

英文摘要

Safety-aligned language models must refuse harmful requests without broad over-refusal, but it remains unclear how dynamic adversarial fine-tuning changes refusal-control carriers: Kullback--Leibler (KL)-constrained directions or small subspaces that causally modulate refusal without large safe-prompt distribution shifts. We study a 7B backbone under supervised fine-tuning (SFT) and Robust Refusal Dynamic Defense (R2D2), aligning HarmBench, StrongREJECT, and XSTest evaluations with five-anchor geometry measurements, causal interventions, and sparse adaptive stress tests. R2D2 drives fixed-source HarmBench attack success to zero at early checkpoints; however, these checkpoints also exhibit maximal XSTest refusal and fail a benign-utility audit. Later checkpoints partially recover utility-facing behavior while reopening attack success, with adaptive GCG attack success rate rising to 0.415 at step 250 and 0.613 at step 500. Internally, R2D2 preserves a late-layer admissible refusal-control carrier through step 100 and then relocates the best admissible carrier to an early layer; SFT relocates earlier yet remains less robust. Effective rank stays near 1.24, and SFT shows larger principal-angle drift, arguing against both dimensional expansion and drift magnitude as sufficient explanations. Causal interventions support a low-dimensional but utility-coupled carrier. These results support a geometry-reorganization account of R2D2 along a robustness--utility frontier, without establishing adaptive robustness.

URL PDF HTML ☆

赞 0 踩 0

2601.15891 2026-05-27 cs.CV

视觉Mamba能否提升AI生成图像检测？一项深入研究

Mamadou Keita, Wassim Hamidouche, Hessen Bougueffa Eutamene, Abdelmalik Taleb-Ahmed, Xianxun Zhu, Abdenour Hadid

发表机构 * Laboratory of IEMN, CNRS, Centrale Lille, UMR 8520, Univ. Polytechnique Hauts-de-France（伊姆纳实验室，国家科学研究中心，里尔中央理工大学，UMR 8520，法国高等技术大学）； Khalifa University（卡利法大学）； School of Communication and Information Engineering, Shanghai University（上海大学通信与信息工程学院）； Sorbonne Center for Artificial Intelligence, Sorbonne University Abu Dhabi（索邦人工智能中心，索邦大学阿布扎克分校）

AI总结本研究系统评估了Vision Mamba模型在AI生成图像检测中的性能，与CNN、ViT和VLM检测器进行对比，分析了准确性、效率和泛化能力。

详情

AI中文摘要

近年来，计算机视觉取得了显著进展，这得益于卷积神经网络（CNN）、生成对抗网络（GAN）、扩散架构、视觉Transformer（ViT）以及最近的视觉-语言模型（VLM）等创新架构的发展。这一进展无疑有助于创造越来越逼真和多样化的视觉内容。然而，图像生成的这些进步也引发了对错误信息、身份盗窃以及隐私和安全威胁等潜在滥用的担忧。与此同时，基于Mamba的架构已成为这一快速发展的领域中一系列图像分析任务（包括分类、分割、医学成像、目标检测和图像恢复）的多功能工具。然而，与已有技术相比，它们在识别AI生成图像方面的潜力仍相对未被探索。本研究提供了用于AI生成图像检测的Vision Mamba模型的系统评估和比较分析。我们在多样化的数据集和合成图像源上，将多个Vision Mamba变体与代表性的CNN、ViT和基于VLM的检测器进行基准测试，重点关注准确性、效率以及跨不同图像类型和生成模型的泛化能力等关键指标。通过这一全面分析，我们旨在阐明Vision Mamba相对于已有方法在检测AI生成图像方面的适用性、准确性和效率上的优势与局限性。总体而言，我们的研究结果突显了Vision Mamba作为区分真实与AI生成视觉内容的系统组件的潜力和当前局限性。这项研究对于在区分真实与AI生成内容成为重大挑战的时代提升检测能力至关重要。

英文摘要

In recent years, computer vision has witnessed remarkable progress, fueled by the development of innovative architectures such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), diffusion-based architectures, Vision Transformers (ViTs), and, more recently, Vision-Language Models (VLMs). This progress has undeniably contributed to creating increasingly realistic and diverse visual content. However, such advancements in image generation also raise concerns about potential misuse in areas such as misinformation, identity theft, and threats to privacy and security. In parallel, Mamba-based architectures have emerged as versatile tools for a range of image analysis tasks, including classification, segmentation, medical imaging, object detection, and image restoration, in this rapidly evolving field. However, their potential for identifying AI-generated images remains relatively unexplored compared to established techniques. This study provides a systematic evaluation and comparative analysis of Vision Mamba models for AI-generated image detection. We benchmark multiple Vision Mamba variants against representative CNNs, ViTs, and VLM-based detectors across diverse datasets and synthetic image sources, focusing on key metrics such as accuracy, efficiency, and generalizability across diverse image types and generative models. Through this comprehensive analysis, we aim to elucidate Vision Mamba's strengths and limitations relative to established methodologies in terms of applicability, accuracy, and efficiency in detecting AI-generated images. Overall, our findings highlight both the promise and current limitations of Vision Mamba as a component in systems designed to distinguish authentic from AI-generated visual content. This research is crucial for enhancing detection in an age where distinguishing between real and AI-generated content is a major challenge.

URL PDF HTML ☆

赞 0 踩 0

2605.14664 2026-05-27 cs.CV

MiVE: Multiscale Vision-language features for reference-guided video Editing

MiVE：用于参考引导视频编辑的多尺度视觉语言特征

Tong Wang, Meng Zou, Chengjing Wu, Xiaochao Qu, Luoqi Liu, Xiaolin Hu, Ting Liu

发表机构 * MT Lab, Meitu Inc., Beijing 100083, China（美图实验室，美图公司，北京100083，中国）； Department of Computer Science and Technology, BNRist, IDG/McGovern Institute for Brain Research, Tsinghua University, Beijing 100084, China（计算机科学与技术系，BNRist，IDG/麦戈文脑研究学院，清华大学，北京100084，中国）； Beijing University of Posts（北京邮电大学）

AI总结提出MiVE框架，利用VLM的多尺度层次特征（早期层保留空间细节，深层编码全局语义）统一到自注意力扩散Transformer中，解决模态间隙和细粒度信息丢失问题，在参考引导视频编辑中达到SOTA性能。

Comments ICML 2026

详情

AI中文摘要

参考引导视频编辑以源视频、文本指令和参考图像作为输入，要求模型在忠实执行指令编辑的同时保留原始运动及未编辑内容。现有方法分为两种范式，各有固有限制：解耦编码器在处理指令和视觉内容时存在模态间隙，而统一视觉语言编码器仅依赖最终层表示，丢失了细粒度空间细节。我们观察到VLM层层次化地编码互补信息——早期层捕获局部空间细节，对精确编辑至关重要；深层编码全局语义，用于指令理解。基于此洞察，我们提出MiVE（用于参考引导视频编辑的多尺度视觉语言特征），该框架将VLM重新用作多尺度特征提取器。MiVE从Qwen3-VL提取层次特征，并将其集成到统一的自注意力扩散Transformer中，消除了交叉注意力设计中固有的模态不匹配。实验表明，MiVE在人类偏好中排名最高，性能优于学术方法和商业系统，达到了最先进水平。

英文摘要

Reference-guided video editing takes a source video, a text instruction, and a reference image as inputs, requiring the model to faithfully apply the instructed edits while preserving original motion and unedited content. Existing methods fall into two paradigms, each with inherent limitations: decoupled encoders suffer from modality gaps when processing instructions and visual content independently, while unified vision-language encoders lose fine-grained spatial details by relying solely on final-layer representations. We observe that VLM layers encode complementary information hierarchically -- early layers capture localized spatial details essential for precise editing, while deeper layers encode global semantics for instruction comprehension. Building on this insight, we present MiVE (Multiscale Vision-language features for reference-guided video Editing), a framework that repurposes VLMs as multiscale feature extractors. MiVE extracts hierarchical features from Qwen3-VL and integrates them into a unified self-attention Diffusion Transformer, eliminating the modality mismatch inherent in cross-attention designs. Experiments demonstrate that MiVE achieves state-of-the-art performance by ranking highest in human preference, outperforming both academic methods and commercial systems.

URL PDF HTML ☆

赞 0 踩 0

2605.14480 2026-05-27 cs.CL

Cross-Linguistic Transcription and Phonological Representation in the Huìtóngguǎnxì Huáyíyìyǔ

《会同馆华夷译语》中的跨语言转写与音系表征

Ji-eun Kim

发表机构 * Department of Korean language and literature, Duksung Women’s University（韩国语言文学系，杜克松女子大学）

AI总结本研究将《会同馆华夷译语》视为一个连贯的多语言转写系统，通过数字化和音系分析，揭示了其主要转写和补充转写的跨语言规律，并论证了该系统作为历史音系证据的价值。

Comments 49 pages; 1 figure; 40 tables; SLE2019; under review

详情

AI中文摘要

目的：本研究调查《会同馆华夷译语》（HHY）的转写原则，该系列多语词汇集由明朝政府在15至16世纪间编纂，用于译员培训。本研究不将HHY视为孤立语言材料的集合，而是将其视为一个连贯的多语言转写系统，通过汉字表征非汉语语言的口语形式。方法：将HHY的绝大部分数字化，并与汉语音韵范畴对齐。对先前各语言部分的重建进行批判性审查，并整合到一个统一的比较数据库中。分析聚焦于八个语言部分中主要转写（MT）和补充转写（ST）的跨语言规律。结果：MT通常表征与当时汉语音节结构兼容的音，而ST主要编码与汉语音系兼容性较差的语音特征。分析进一步表明，汉语音韵范畴在外语转写中的使用比先前假设的更为灵活。因此，HHY作为一种相对系统的语音近似方法，而非汉语音系对非汉语语言的直接投射。结论：HHY可被分析为一个内部结构化的转写系统，而不仅仅是词汇集的集合。更广泛地说，该研究表明历史转写系统可为历史音系学提供宝贵证据，尤其对于历史记录有限的亚洲语言。

英文摘要

Purpose: This study investigates the transcription principles underlying Huìtóngguǎnxì Huáyíyìyǔ (HHY), a series of multilingual glossaries compiled by the Ming government between the fifteenth and sixteenth centuries for interpreter training. The study treats HHY not as a collection of isolated language materials, but as a coherent multilingual transcription system representing spoken forms of non-Chinese languages through Chinese characters. Methods: A substantial portion of HHY was digitized and aligned with Chinese phonological categories. Previous reconstructions of individual language sections were critically reviewed and integrated into a unified comparative database. The analysis focuses on cross-linguistic regularities in Main Transcription (MT) and Supplementary Transcription (ST) across eight language sections. Results: MT generally represents sounds compatible with the Chinese syllable structure of the period, whereas ST mainly encodes phonetic features less compatible with Chinese phonology. The analysis further shows that Chinese phonological categories were used more flexibly in foreign-language transcription than previously assumed. HHY therefore functioned as a relatively systematic method of phonetic approximation rather than a direct projection of Chinese phonology onto non-Chinese languages. Conclusion: HHY can be analyzed as an internally structured transcription system rather than merely as a collection of glossaries. More broadly, the study demonstrates that historical transcription systems can provide valuable evidence for historical phonology, particularly for under-documented Asian languages with limited historical records.

URL PDF HTML ☆

赞 0 踩 0

2605.13779 2026-05-27 cs.LG cs.AI cs.DC

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

MinT：用于训练和服务数百万LLM的托管基础设施

Mind Lab, :, Song Cao, Vic Cao, Andrew Chen, Kaijie Chen, Cleon Cheng, Steven Chiang, Kaixuan Fan, Hera Feng, Huan Feng, Arthur Fu, Jun Gao, Hongquan Gu, Aaron Guan, Nolan Ho, Mutian Hong, Hailee Hou, Peixuan Hua, Charles Huang, Miles Jiang, Nora Jiang, Yuyi Jiang, Qiuyu Jin, Fancy Kong, Andrew Lei, Kyrie Lei, Alexy Li, Lucian Li, Ray Li, Theo Li, Zhihui Li, Jiayi Lin, Kairus Liu, Kieran Liu, Logan Liu, Xiang Liu, Irvine Lu, Maeve Luo, Runze Lv, Pony Ma, Verity Niu, Anson Qiu, Vincent Wang, Rio Yang, Maxwell Yao, Carrie Ye, Regis Ye, Wenlin Ye, Josh Ying, Danney Zeng, Yuhan Zhan, Anya Zhang, Di Zhang, Ruijia Zhang, Sueky Zhang, Ya Zhang, Wei Zhao, Ada Zhou, Changhai Zhou, Yuhua Zhou, Xinyue Zhu, Murphy Zhuang

发表机构 * Mind Lab

AI总结提出MinT系统，通过LoRA适配器管理实现大规模基础模型上的高效训练与在线服务，支持百万级策略目录。

Comments 30 pages, technical report

详情

AI中文摘要

我们提出MindLab Toolkit (MinT)，一个用于低秩适配（LoRA）后训练和在线服务的托管基础设施系统。MinT针对这样一种场景：在少量昂贵的基模型部署上产生许多训练好的策略。MinT不是将每个策略实现为合并的完整检查点，而是保持基模型驻留，并通过回滚、更新、导出、评估、服务和回滚等阶段移动导出的LoRA适配器修订版，将分布式训练、服务、调度和数据移动隐藏在服务接口后面。MinT沿三个维度扩展此路径。Scale Up将LoRA RL扩展到前沿规模的密集和MoE架构，包括MLA和DSA注意力路径，训练和服务已验证超过1T总参数。Scale Down仅移动导出的LoRA适配器，在秩1设置中可小于基模型大小的1%；适配器仅移交将测量步骤在4B密集模型上减少18.3倍，在30B MoE上减少2.85倍，而并发多策略GRPO将挂钟时间缩短1.77倍和1.45倍，且不提高峰值内存。Scale Out将持久策略可寻址性与CPU/GPU工作集分离：张量并行部署支持10^6规模的可寻址目录（通过100K测量单引擎扫描）和集群规模的千适配器活动波，冷加载作为计划的服务工作处理，打包的MoE LoRA张量将实时引擎加载提高8.5-8.7倍。因此，MinT管理百万规模的LoRA策略目录，同时在共享的1T级基模型上训练和服务选定的适配器修订版。

英文摘要

We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing each policy as a merged full checkpoint, MinT keeps the base model resident and moves exported LoRA adapter revisions through rollout, update, export, evaluation, serving, and rollback, hiding distributed training, serving, scheduling, and data movement behind a service interface. MinT scales this path along three axes. Scale Up extends LoRA RL to frontier-scale dense and MoE architectures, including MLA and DSA attention paths, with training and serving validated beyond 1T total parameters. Scale Down moves only the exported LoRA adapter, which can be under 1% of base-model size in rank-1 settings; adapter-only handoff reduces the measured step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while concurrent multi-policy GRPO shortens wall time by 1.77x and 1.45x without raising peak memory. Scale Out separates durable policy addressability from CPU/GPU working sets: a tensor-parallel deployment supports 10^6-scale addressable catalogs (measured single-engine sweeps through 100K) and thousand-adapter active waves at cluster scale, with cold loading treated as scheduled service work and packed MoE LoRA tensors improving live engine loading by 8.5-8.7x. MinT thus manages million-scale LoRA policy catalogs while training and serving selected adapter revisions over shared 1T-class base models.

URL PDF HTML ☆

赞 0 踩 0

2605.13455 2026-05-27 cs.CV

Bayesian In Vivo Tracking of Synapses using Joint Poisson Deconvolution and Diffeomorphic Registration

使用联合泊松反卷积和微分同胚配准的贝叶斯体内突触追踪

Shashwat Kumar, Dominic M. Padova, Binish Narang, Gabrielle I. Coste, Austin R. Graves, Richard L. Huganir, Adam S. Charles, Michael I. Miller, Anuj Srivastava

发表机构 * Department of Biomedical Engineering, Johns Hopkins University（约翰霍普金斯大学生物医学工程系）； Department of Neuroscience, Johns Hopkins University（约翰霍普金斯大学神经科学系）； Kavli Neuroscience Discovery Institute, Johns Hopkins University（约翰霍普金斯大学Kavli神经科学发现研究所）； Data Science and AI Institute, Johns Hopkins University（约翰霍普金斯大学数据科学与人工智能研究所）； Department of Applied Mathematics and Statistics, Johns Hopkins University（约翰霍普金斯大学应用数学与统计学系）

AI总结提出一种基于模板的贝叶斯框架，通过联合泊松反卷积和微分同胚配准，同时实现突触检测、去噪、荧光强度推断、组织运动校正和置信区间估计，用于低信噪比体内显微镜数据中的突触追踪。

详情

AI中文摘要

突触是密集排列的亚微米结构，在学习和记忆形成过程中动态重组。纵向体内成像荧光标记的突触受体为研究大规模突触动力学以及这些过程在神经疾病中如何被破坏提供了有希望的机会。然而，使用双光子显微镜的体内成像采用低激光功率，因此受到低信噪比和高散粒噪声、天与天之间的非线性组织运动、突触荧光的非平稳波动以及显微镜点扩散函数引起的显著模糊的影响。这些因素共同使得检测和追踪突触变得具有挑战性，尤其是在突触密度高的区域。本文提出了一种新颖的基于模板的框架，将突触建模为在非线性组织变形下移动的可变亮度点源。采用统一的贝叶斯方法，我们通过推导一个后验分布来将该模型应用于显微镜数据，该后验分布包含用于域扭曲的微分同胚映射、用于成像过程的高斯点扩散函数以及用于原始光子计数的泊松观测模型。贝叶斯解决方案同时：(1) 构建突触位置的概率模板，(2) 对图像数据进行去噪和反卷积，(3) 推断荧光强度，(4) 执行微分同胚图像配准以校正组织运动，以及(5) 为这些参数估计提供置信区域。我们在一个2D+t模拟数据集和一个在小鼠两周内成像的荧光突触的3D+t纵向体内显微镜数据集上展示了该框架。

英文摘要

Synapses are densely packed submicron structures that dynamically reorganize during learning and memory formation. Longitudinal \textit{in vivo} imaging of fluorescently tagged synaptic receptors offers a promising opportunity to study large-scale synaptic dynamics and how these processes are disrupted in neurological disease. However, in vivo imaging with 2-photon microscopy uses low laser power and therefore suffers from low signal-to-noise ratio (SNR) and high shot noise, nonlinear tissue motion between days, nonstationary fluctuations in synaptic fluorescence, and significant blur induced by the microscope point spread function (PSF). Together, these factors make it challenging to detect and track synapses, especially in regions with high synaptic density. This paper presents a novel template-based framework for modeling synapses as varying luminance point sources that move under a nonlinear tissue deformation. Taking a unified Bayesian approach, we apply this model to microscopy data by deriving a posterior that incorporates a diffeomorphic mapping for domain warping, a Gaussian point spread function for the imaging process, and a Poisson observation model for raw photon counts. The Bayesian solution simultaneously: (1) Constructs a probabilistic template of synapse locations, (2) denoises and deconvolves the image data, (3) infers fluorescence intensities, (4) performs diffeomorphic image registration to correct for tissue motion, and (5) provides confidence regions for these parameter estimates. We demonstrate the framework on both a 2D+t simulated dataset and a 3D+t longitudinal \textit{in vivo} microscopy dataset of fluorescent synapses imaged in a mouse over two weeks.

URL PDF HTML ☆

赞 0 踩 0

2604.22546 2026-05-27 cs.CV

ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation

ReLIC-SGG: 开放词汇场景图生成的关系格补全

Amir Hosseini, Sara Farahani, Xinyi Li, Suiyang Guang

发表机构 * Amirkabir University of Technology（阿米尔卡比大学技术学院）

AI总结针对开放词汇场景图生成中标注不完整导致大量有效关系被误判为负例的问题，提出ReLIC-SGG框架，通过构建语义关系格建模谓词间的相似、蕴含和矛盾关系，将未标注关系视为潜在变量而非确定负例，结合视觉-语言兼容性、图上下文和语义一致性推断缺失正关系，并采用正-无标记图学习减少假负例监督，格引导解码生成紧凑且语义一致的场景图。

Comments Some errors in the experimental sections

详情

AI中文摘要

DiVeQ: 使用重参数化技巧的可微分向量量化

Mohammad Hassan Vali, Tom Bäckström, Arno Solin

发表机构 * ELLIS Institute Finland & Department of Computer Science, Aalto University, Finland（芬兰ELLIS研究所及阿尔托大学计算机科学系）； Department of Information and Communications Engineering, Aalto University, Finland（芬兰阿尔托大学信息与通信工程系）

AI总结提出DiVeQ方法，通过重参数化技巧将量化视为添加模拟量化失真的误差向量，实现前向传播硬量化而梯度可流动，并引入空间填充变体SF-DiVeQ减少量化误差并充分利用码本，在VQ-VAE、VQGAN和DAC任务中提升重建质量和样本质量。

2601.16578 2026-05-27 cs.RO cs.SY eess.SY

Zero-Shot MARL Benchmark in the Cyber-Physical Mobility Lab

Cyber-Physical Mobility Lab中的零样本多智能体强化学习基准测试

Julius Beerwerth, Jianye Xu, Simon Schäfer, Fynn Belderink, Bassam Alrifaee

发表机构 * Cyber-Physical Mobility Lab（智能物理移动实验室）； University of the Bundeswehr Munich（联邦国防军大学慕尼黑）； RWTH Aachen University（亚琛工业大学）

AI总结本文基于Cyber-Physical Mobility Lab构建了一个可复现的基准测试平台，用于评估联网自动驾驶汽车多智能体强化学习策略的仿真到现实迁移，并揭示了性能下降的两个互补来源。

2605.09156 2026-05-27 cs.CL cs.AI

Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan

迷失在翻译中？探索从拉丁语到奥克语的语法性别转变

Ahan Chatterjee, Matthias Schöffel, Matthias Aßenmacher, Marinus Wiedner, Esteban Garces Arias

发表机构 * Bavarian Academy of Sciences (BAdW)（巴伐利亚科学学院）； LMU Munich（慕尼黑大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）； University of Freiburg（弗赖堡大学）

AI总结本文提出一个可解释的深度学习框架，通过词法和上下文层面分析拉丁语到奥克语的语法性别系统从三分（阳性、阴性、中性）到二分（阳性、阴性）的演变，并展示了改进的分词策略和形态特征、词性对性别预测的贡献。

Comments Accepted at NLP4DH @ ACL 2026

详情

AI中文摘要

从拉丁语到罗曼语族的历时演变涉及语法性别系统的重组，在大多数罗曼语中从三分结构（阳性、阴性、中性）变为二分结构（阳性、阴性）。在这项工作中，我们引入了一个可解释的深度学习框架，在词法和上下文层面研究这一现象。首先，我们表明传统的分词策略对于这种低资源历史设置不够稳健，而我们提出的分词器在这些基线上提高了性能。在词法层面，我们评估了形态特征对性别预测的贡献。在上下文层面，我们量化了不同词性类别对语法性别预测的贡献。这些分析共同刻画了性别信息在词元及其句子上下文之间的分布。我们在 \href{https://github.com/ahan-2000/Lost-in-Translation-}{https://github.com/ahan-2000/Lost-in-Translation-} 公开了我们的代码库、数据集和结果。

英文摘要

The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine) in most Romance languages. In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available at \href{https://github.com/ahan-2000/Lost-in-Translation-}{https://github.com/ahan-2000/Lost-in-Translation-}.

URL PDF HTML ☆

赞 0 踩 0

2605.08455 2026-05-27 cs.LG cs.PL cs.SE

CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

CUDABeaver：基于LLM的自动化CUDA调试基准测试

Shiyang Li, Haoyang Chen, Mattia Fazzini, Caiwen Ding

发表机构 * University of Minnesota（明尼苏达大学）

AI总结提出CUDABEAVER基准，通过协议条件指标pass@k(M,C,A)评估LLM修复CUDA代码的能力，揭示性能损失容忍度对成功率的影响。

Comments 25 pages, 5 figures

详情

AI中文摘要

调试CUDA程序长期以来一直具有挑战性，因为故障通常源于硬件行为、编译器决策、内存层次结构和异步执行之间微妙的交互。更重要的是，随着GPU在科学计算、机器学习、图形和系统工作负载中的快速扩展，CUDA调试变得比以往任何时候都更具挑战性。当前对基于LLM的CUDA编程的评估大多忽略了这一场景：模型可以通过退化性修复通过正确性测试，将CUDA代码简化为更安全但更慢的程序，从而放弃原始优化结构。我们引入了CUDABEAVER，一个从基于LLM的CUDA生成过程中产生的真实失败工作空间中进行CUDA调试的基准。每个任务提供损坏的候选代码、原生构建/测试命令、原始错误证据以及一个可编辑文件。CUDABEAVER评估修复程序是否真正修复了失败的CUDA代码，还是仅仅找到了一个更慢的通过测试的替代方案，并按故障类别、调试轨迹、停滞模式和性能保持情况报告结果。我们进一步提出了pass@k(M,C,A)，一种协议条件的CUDA调试指标，通过明确修复程序M、语料库C和协议轴A。使用该指标在213个任务和七个前沿LLM上，我们表明协议感知评估提供了更真实的CUDA调试能力视图：当性能损失容忍度高时，修复程序看起来更强，但即使是一个微小的更严格的性能要求也能显著降低测量成功率，分数变化高达40个百分点。

英文摘要

Debugging CUDA programs has long been challenging because failures often arise from subtle interactions among hardware behavior, compiler decisions, memory hierarchy, and asynchronous execution. More importantly, with the rapid expansion of GPU usage across scientific computing, machine learning, graphics, and systems workloads, CUDA debugging has become more challenging than ever. Current evaluations of LLM-based CUDA programming largely miss this setting: a model can pass correctness tests with repair by degeneration, simplifying the CUDA code into a safer but slower program that abandons the original optimization structure. We introduce CUDABEAVER, a benchmark for CUDA debugging from real failing workspaces produced during LLM-based CUDA generation. Each task provides the broken candidate, native build/test commands, raw error evidence, and a single editable file. CUDABEAVER evaluates whether a fixer truly repairs the failing CUDA code or merely finds a slower test-passing replacement, reporting results by failure category, debugging trajectory, stagnation mode, and performance preservation. We further propose pass@k(M,C,A), a protocol-conditional CUDA debugging metric by making the fixer M, corpus C, and protocol axes Aexplicit. Using this metric across 213 tasks and seven frontier LLMs, we show that protocol-aware evaluation gives a more faithful view of CUDA debugging ability: when performance-loss tolerance is high, fixers appear much stronger, but even a minor stricter performance requirement can sharply reduce measured success, shifting scores by up to 40 percentage points.

URL PDF HTML ☆

赞 0 踩 0

2605.04635 2026-05-27 cs.CV

UniPCB: A Generation-Assisted Detection Framework for PCB Defect Inspection

UniPCB: 一种用于PCB缺陷检测的生成辅助检测框架

Huan Zhang, Lianghong Tan, Yichu Xu, Zishan Su, Jiangzhong Cao, Huanqi Wu, Linwei Zhu, Xu Zhang

发表机构 * School of Information Engineering, Guangdong University of Technology（广东工业大学信息工程学院）； School of Computer Science, Wuhan University（武汉大学计算机学院）； Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences（中国科学院深圳先进技术研究院）

AI总结提出UniPCB框架，通过多模态条件生成器合成缺陷样本以增强数据，并设计倒残差移位注意力与跨级互补融合模块提升检测性能，在DsPCBSD+上实现98.0% mAP@0.5。

详情

AI中文摘要

在工业物联网（IIoT）中，实现智能、实时的印刷电路板（PCB）缺陷检测对于确保产品可靠性至关重要。然而，现有的基于IIoT的视觉检测系统面临两个相互叠加的挑战：稀缺且不平衡的缺陷样本限制了模型训练，以及在复杂电路背景下特征表示不足。现有的生成方法依赖具有粗略结构控制的单模态条件，而检测方法则改进架构但未解决数据瓶颈。为了共同解决这两个挑战，我们提出了一种生成辅助的PCB缺陷检测框架，该框架在IIoT支持的流水线中集成了受控缺陷合成与任务特定缺陷检测。在生成侧，多模态条件生成器并行提取互补的边缘、深度和文本条件。然后，ScaleEncoder将这些条件嵌入到扩散U-Net的四个分辨率中，条件调制在每个尺度上应用FiLM风格的空间自适应调制，实现结构对齐和缺陷感知的样本合成，以增强稀缺的IIoT数据集。在检测侧，倒残差移位注意力将自注意力与移位卷积相结合，以共同捕获全局上下文和局部纹理，跨级互补融合块生成像素级门控用于选择性跨级特征融合。合成的样本直接丰富检测训练集，使得生成的改进与检测的改进相互叠加。在DsPCBSD+上的大量实验表明，UniPCB在缺陷检测上达到mAP@0.5为98.0%、mAP@0.5:0.95为61.8%，超越了所有对比方法，同时生成分支的FID为129.61、SSIM为0.619，优于现有的条件生成方法。

英文摘要

In the Industrial Internet of Things (IIoT), enabling intelligent, real-time Printed Circuit Board (PCB) defect inspection is critical for ensuring product reliability. However, existing IIoT-based visual inspection systems face two compounding challenges: scarce and imbalanced defect samples that limit model training, and insufficient feature representation under complex circuit backgrounds. Existing generation methods rely on single-modality conditions with coarse structural control, while detection methods improve architectures without addressing the data bottleneck. To resolve both challenges jointly, we propose a generation-assisted PCB defect inspection framework that integrates controlled defect synthesis with task-specific defect detection within an IIoT-enabled pipeline. On the generation side, a Multi-modal Condition Generator extracts complementary edge, depth, and text conditions in parallel. A ScaleEncoder then embeds these conditions into the diffusion U-Net at four resolutions, and a Condition Modulation applies FiLM-style spatially-adaptive modulation at each scale, enabling structurally aligned and defect-aware sample synthesis to augment the scarce IIoT dataset. On the detection side, an Inverted Residual Shift Attention couples self-attention with shift-wise convolution to jointly capture global context and local texture, and a Cross-level Complementary Fusion Block generates pixel-level gates for selective cross-level feature fusion. The synthesized samples directly enrich the detection training set, so that improvements in generation compound with improvements in detection. Extensive experiments on DsPCBSD+ demonstrate that UniPCB achieves mAP@0.5 of 98.0% and mAP@0.5:0.95 of 61.8% on defect detection, surpassing all compared methods, while the generation branch attains an FID of 129.61 and SSIM of 0.619, outperforming existing conditional generation approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.03929 2026-05-27 cs.SD cs.AI cs.LG eess.SP

PHALAR: Phasors for Learned Musical Audio Representations

PHALAR：用于学习音乐音频表示的相量

Davide Marincione, Michele Mancusi, Giorgio Strano, Luca Cerovaz, Donato Crisostomi, Roberto Ribuoli, Emanuele Rodolà

发表机构 * Department of Computer Science, Sapienza University of Rome, Italy（罗马大学计算机科学系）； Moises Systems, Inc.（Moises系统公司）； Paradigma, Inc.（Paradigma公司）

AI总结提出PHALAR对比框架，利用学习谱池化和复值头实现音高和相位等变，在茎检索任务中参数减少50%、训练加速7倍，准确率相对提升约70%，并捕获鲁棒的音乐结构。

Comments Accepted at ICML 2026

AI 大模型

视觉与机器人

科学与医疗

RAVE: Re-Allocating Visual Attention in Large Multimodal Models

Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning

GraphMind: From Operational Traces to Self-Evolving Workflow Automation

RSD: A Local Triangulation Audit Primitive for Learned Vector Blocks

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Property Enhanced Instruction Tuning for Multi-task Molecule Generation with Large Language Models

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

RadJEPA: Radiology Encoder for Chest X-Rays via Joint Embedding Predictive Architecture

EgoExo-WM: Unlocking Exo Video for Ego World Models

Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict

Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation

MiVE: Multiscale Vision-language features for reference-guided video Editing

Cross-Linguistic Transcription and Phonological Representation in the Huìtóngguǎnxì Huáyíyìyǔ

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

Bayesian In Vivo Tracking of Synapses using Joint Poisson Deconvolution and Diffeomorphic Registration

ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation

MetaGraph: A Large-Scale Meta-Analysis of GenAI in Financial NLP (2022-2025)

Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

When Brains Disagree: Biological Ambiguity Underlies the Challenge of Amyloid PET Synthesis from Structural MRI

Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes

CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation

From Per-Image Low-Rank to Encoding Mismatch: Rethinking Feature Distillation in Vision Transformers

DiVeQ: Differentiable Vector Quantization Using the Reparameterization Trick

Zero-Shot MARL Benchmark in the Cyber-Physical Mobility Lab

Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan

CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

UniPCB: A Generation-Assisted Detection Framework for PCB Defect Inspection

PHALAR: Phasors for Learned Musical Audio Representations