arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4178
2606.00953 2026-06-02 cs.LG cs.MA

When Parallelism Pays Off: Cohesion-Aware Task Partitioning for Multi-Agent Coding

当并行性有回报时:面向多智能体编码的凝聚力感知任务划分

Xu Yang, Lunyiu Nie, Ethan Chandra, Stanislav Gannutin, Fangru Lin, Swarat Chaudhuri

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Oxford(牛津大学)

AI总结 提出Co-Coder方法,通过静态分析构建依赖图、社区检测划分图及依赖感知调度,在仓库级软件工程中平衡通信与计算开销,实现多智能体并行编码的效率和成本优化。

详情
AI中文摘要

多智能体大语言模型(LLM)系统提供了一种通过并行化和上下文隔离来分解复杂任务(如编码)的方式。然而,在实践中增加智能体会引入智能体间通信开销,这会产生额外成本,有时甚至会抵消效率提升。我们将多智能体编排形式化为一个图划分问题,以捕捉通信与计算之间的权衡:任务分解可以缩短关键路径计算,但跨智能体依赖需要昂贵的上下文传输。我们在仓库级软件工程中实例化这一观点,并提出了凝聚力感知编码器(Co-Coder),它通过静态分析构建依赖图,隔离结构枢纽文件,通过社区检测划分图,并使用依赖感知调度器执行划分。在DevEval和CodeProjectEval上的28个真实世界任务中,Co-Coder在帕累托前沿上超越了顺序和基于文件的并行基线以及带有智能体团队的Claude Code,将通过率提高了最多14.0%,实现了最多2.10倍的墙钟加速,并将API成本降低了最多35%,在依赖最密集的项目上取得了最大收益。Co-Coder展示了凝聚力感知编排如何使并行编码智能体既具有理论依据又具有实际效率,为多智能体系统提出了更广泛的设计原则。

英文摘要

Multi-agent Large Language Model (LLM) systems offer a way to decompose complex tasks, such as coding, through parallelization and context isolation. However, adding agents in practice introduces inter-agent communication overhead, which incurs extra cost and can sometimes offset the efficiency gains. We formalize multi-agent orchestration as a graph partitioning problem that captures the communication-to-computation trade-off: task decomposition can shorten critical-path computation, but cross-agent dependencies require costly context transfer. We instantiate this view in repository-level software engineering and present Cohesion-aware Coder (Co-Coder), which builds dependency graphs from static analysis, isolates structural hub files, partitions the graph via community detection, and executes the partition with a dependency-aware scheduler. Across 28 real-world tasks on DevEval and CodeProjectEval, Co-Coder advances the Pareto-frontier over sequential and file-based parallel baselines as well as Claude Code with Agent Teams, lifting pass rate by up to 14.0%, achieving up to a 2.10x wall-clock speedup, and reducing API cost by up to 35%, with the largest gains on the most dependency-dense projects. Co-coder demonstrates how cohesion-aware orchestration can make parallel coding agents both theoretically grounded and practically efficient, suggesting a broader design principle for multi-agent systems.

2606.00950 2026-06-02 cs.LG

COLLIE: Guiding Skill Discovery in Semantically Coherent Latent Space

COLLIE:在语义连贯的潜在空间中引导技能发现

Yao Luan, Ni Mu, Hanfei Ge, Yiqin Yang, Bo Xu, Qing-Shan Jia

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出COLLIE框架,利用密集无监督数据构建语义连贯的潜在空间,通过无需额外训练的引导信号实现稀疏人类反馈下的有效技能发现,避免危险行为并提升下游性能。

详情
Comments
ICML 2026
AI中文摘要

无监督技能发现(USD)旨在无需奖励函数的情况下学习多样化的行为,但由于均匀探索,常常导致与任务无关或危险的行为。引导式技能发现(GSD)通过融入人类意图将探索聚焦于有意义的区域来解决这一问题。然而,现有的GSD方法通常需要训练额外的引导模型,并依赖于预定义规则或专家演示,这在稀疏的在线收集的人类反馈下可能效果不佳。为了克服这一点,我们提出了COLLIE,一个利用密集无监督数据构建语义连贯技能潜在空间的GSD框架。该潜在空间结构良好,能够通过稀疏的在线反馈实现可靠的引导。此外,其语义连贯性特性使得引导信号的构建无需训练,从而消除了在技能学习之外额外训练模型的需要。理论分析证明了我们无需训练的引导信号的有效性,而在各种基于状态和基于像素的任务上的实验表明,COLLIE能够学习多样化、与人类对齐的技能,避免危险行为,并在最少的人类反馈下实现优越的下游性能。

英文摘要

Unsupervised skill discovery (USD) aims to learn diverse behaviors without reward functions, but often results in task-irrelevant or hazardous behaviors due to uniform exploration. Guided skill discovery (GSD) addresses this issue by incorporating human intent to focus exploration on meaningful regions. However, existing GSD methods typically require training additional guidance models, and rely on pre-defined rules or expert demonstration, which can be ineffective under sparse, online-collected human feedback. To overcome this, we propose COLLIE, a GSD framework that leverages dense unsupervised data to construct a semantically coherent skill latent space. This latent space is well-structured, enabling reliable guidance with sparse online feedback. Moreover, its semantic coherence property enables training-free construction of guidance signals, eliminating the need for additional model training beyond skill learning. Theoretical analysis justifies the effectiveness of our training-free guidance signal, while experiments across diverse state-based and pixel-based tasks show that COLLIE learns diverse, human-aligned skills, avoids hazardous behaviors, and achieves superior downstream performance with minimal human feedback.

2606.00949 2026-06-02 cs.LG cs.AI physics.flu-dyn

Explainable deep reinforcement learning reveals energy-efficient control strategies for turbulent drag reduction

可解释深度强化学习揭示湍流减阻的节能控制策略

Federica Tonti, Ricardo Vinuesa

发表机构 * Department of Aerospace Engineering University of Michigan(航空航天工程系密歇根大学)

AI总结 结合多智能体深度强化学习与可解释深度学习,提出基于SHAP归因的奖励策略,实现高效湍流减阻,净节能达34.01%且输入功率仅0.43%。

详情
AI中文摘要

我们提出了一种结合多智能体深度强化学习(MARL)和可解释深度学习(XDL)的方法,用于减少壁面边界湍流中的阻力。以直接针对壁面剪切应力和反对称控制训练智能体的结果作为基线,比较了三种SHAP引导的方法。第一种方法中,奖励根据预测未来速度场的U-net的SHAP归因计算;第二种方法中,奖励根据预测摩擦系数的U-net的SHAP归因计算;第三种方法中,奖励结合了分别预测摩擦系数和壁面压力脉动的两个U-net的SHAP归因。基于摩擦系数和壁面压力脉动的组合SHAP策略实现了最佳整体性能,在仅0.43%归一化输入功率下实现了34.44%的减阻率(DR)和34.01%的净节能率(NES)。相对于反对称控制,减阻和净节能分别提高了49.41%和48.52%。与直接壁面剪切应力基线相比,所提出的策略在提高性能的同时,将归一化驱动成本从5.90%降低到0.43%。结果分析表明,节能策略与压力门控驱动一致,主要在壁面压力接近零时激活,并且其时间尺度与近壁湍流结构的寿命相当。

英文摘要

We propose a method combining Multi-Agent Deep Reinforcement Learning (MARL) and eXplainable Deep Learning (XDL) to reduce drag in wall-bounded turbulent flows. Taking as a baseline the results of training agents directly targeting wall-shear stress and opposition control, three SHAP-guided approaches are compared. In the first, the reward is computed from SHAP attributions of a U-net predicting the future velocity field; in the second, from SHAP attributions of a U-net predicting the skin-friction coefficient; in the third, from a combination of SHAP attributions of two U-nets predicting the skin-friction coefficient and the wall pressure fluctuations, respectively. The combined SHAP strategy based on skin-friction coefficient and wall-pressure fluctuations achieves the best overall performance, achieving a DR of 34.44% and a NES of 34.01% with only 0.43% normalized input power. Relative to opposition control, drag reduction and net energy saving increase by 49.41% and 48.52%, respectively. Compared with the direct wall-shear-stress baseline, the proposed strategy simultaneously improves performance while reducing the normalized actuation cost from 5.90% to 0.43%. Analysis of the results reveals that the energetically efficient policy is consistent with pressure-gated actuation, activating predominantly at near-zero wall pressure, and operates on a temporal timescale comparable to the lifetime of the near-wall turbulent structures.

2606.00947 2026-06-02 cs.LG cs.AI

Silent Failures in Federated Personalization of Foundation Models

联邦基础模型个性化中的静默失败

YongKyung Oh, Alex Bui

发表机构 * Medical & Imaging Informatics (MII) Group, University of California, Los Angeles (UCLA)(医学与影像信息学(MII)组,加州大学洛杉矶分校(UCLA))

AI总结 本文提出联邦基础模型个性化中因隐私约束导致的一类信任失败——静默失败,包括偏差放大、公平性崩溃和对齐侵蚀,并引入六种静默失败模式的分类法,强调隐私保护训练不足以保障可信部署。

详情
AI中文摘要

基础模型通过联邦学习在分散的私有数据上越来越个性化,并在日益增长的上市后监管要求下大规模部署。我们认为这种趋同产生了一类独特且未被充分认识的信任失败,我们称之为“静默失败”。这些包括偏差放大、公平性崩溃和对齐侵蚀,这些可能仍然难以检测,因为联邦学习的隐私约束限制了对模型行为的可见性。对现有基准的景观分析揭示了结构性鸿沟。联邦基准评估系统性能,但对模型行为的洞察有限,而集中式信任基准评估行为,但需要与联邦隐私不兼容的模型访问。我们引入了一个由基础模型个性化、数据集偏移和核心联邦约束相互作用产生的六种静默失败模式的分类法。我们的分析表明,仅靠隐私保护训练不足以实现可信部署。最后,我们提出了一个隐私保护行为评估的研究议程,并建议将静默失败作为可信联邦人工智能的标准诊断类别。

英文摘要

Foundation models are increasingly personalized on decentralized private data through federated learning and are now deployed at scale under growing regulatory requirements for post-market monitoring. We argue that this convergence creates a distinct and under-recognized class of trustworthiness failures, which we term "Silent Failures." These include amplified bias, fairness collapse, and alignment erosion that may remain difficult to detect because federated learning's privacy constraints limit visibility into model behavior. A landscape analysis of existing benchmarks reveals a structural divide. Federated benchmarks evaluate system performance but provide limited insight into model behavior, whereas centralized trustworthiness benchmarks assess behavior but require model access incompatible with federated privacy. We introduce a taxonomy of six silent failure modes arising from the interaction of foundation model personalization, dataset shift, and core federated constraints. Our analysis shows that privacy-preserving training alone is insufficient for trustworthy deployment. We conclude with a research agenda for privacy-preserving behavioral evaluation and propose that silent failures become a standard diagnostic category for trustworthy federated artificial intelligence.

2606.00944 2026-06-02 cs.LG

PRISM: Gauge-Invariant Tangent-Space Differentially Private LoRA

PRISM: 规范不变切空间差分隐私LoRA

Shihao Wang, Xueru Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对LoRA中低秩参数化导致的非可辨识性和规范依赖噪声放大问题,提出PRISM机制,通过构造规范不变的差分隐私扰动,实现高效且稳定的隐私-效用权衡。

详情
Comments
Accepted at the 43rd International Conference on Machine Learning (ICML 2026) as an oral presentation
AI中文摘要

通过DP-SGD将差分隐私(DP)应用于低秩适应(LoRA)是一种自然的隐私保护微调方法。然而,LoRA的低秩参数化带来了根本性挑战。在LoRA中,每个可训练更新表示为低秩矩阵$Z = AB^\top$,但这种分解本质上是非可辨识的:许多因子对$(A,B)$表示相同的更新$Z$。因此,直接将DP-SGD应用于因子会导致$Z$上的规范依赖扰动,并且我们表明这种朴素的DP-LoRA可能导致无界的噪声放大。我们提出了PRISM,一种针对LoRA的内在DP机制,该机制通过构造具有规范不变性,避免了双线性噪声放大,并允许高效的低维噪声采样。此外,PRISM给出了$Z$上有效内在噪声的闭式表征,通过有界、规范不变的扰动实现稳定的隐私-效用权衡。我们为PRISM建立了标准的$(\varepsilon,\delta)$-DP保证,并引入了一种DP感知的、规范不变的自适应更新规则,防止自适应优化放大注入的隐私噪声,从而在实践中提高数值稳定性。

英文摘要

Applying differential privacy (DP) via DP-SGD to Low-Rank Adaptation (LoRA) is a natural approach for privacy-preserving fine-tuning. However, LoRA's low-rank parameterization poses a fundamental challenge. In LoRA, each trainable update is represented as a low-rank matrix $Z = AB^\top$, but this factorization is inherently non-identifiable: many factor pairs $(A,B)$ represent the same update $Z$. As a result, applying DP-SGD directly to the factors induces gauge-dependent perturbations on $Z$, and we show that this naive DP-LoRA can lead to unbounded noise amplification. We propose PRISM, an intrinsic DP mechanism for LoRA that is gauge invariant by construction, avoids bilinear noise amplification, and admits an efficient low-dimensional noise sampler. Moreover, PRISM yields a closed-form characterization of the effective intrinsic noise induced on $Z$, enabling stable privacy-utility trade-offs through bounded, gauge-invariant perturbations. We establish standard $(ε,δ)$-DP guarantees for PRISM and introduce a DP-aware, gauge-invariant adaptive update rule that prevents adaptive optimization from amplifying injected privacy noise, improving numerical stability in practice.

2606.00937 2026-06-02 cs.LG cs.CE cs.NA math.NA physics.comp-ph physics.plasm-ph

Cellular Sheaf Neural Operators for Structure-Preserving Surrogate Modeling of Constrained PDEs

细胞层神经算子用于约束PDE的结构保持代理建模

Lennon J. Shikhman, Shane Gilbertie

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Franklin & Marshall University(弗兰克林与马歇尔大学)

AI总结 提出细胞层神经算子,通过面向细胞复形的消息传递和Hodge拉普拉斯算子,在代理模型中保持PDE的几何约束和物理结构,在湍流MHD和聚变平衡任务中提升了结构敏感诊断指标。

详情
Comments
41 pages, 5 figures, 3 tables
AI中文摘要

神经算子为PDE模拟提供了快速的代理模型,但标准架构通常将几何和离散化视为次于场数据。物理状态通常表示为网格通道堆栈,即使不同数量自然属于顶点、边、面、单元、边界或界面,并且必须满足兼容性约束。我们提出了细胞层神经算子,一种用于结构保持神经PDE代理的离散化感知框架。该方法在定向细胞复形上表示PDE状态,通过学习的限制映射耦合局部特征空间,并使用关联/Hodge感知的消息传递来遵循计算几何。学习到的更新头通过共边界或通量映射,使得选定的约束来自细胞复形结构而不仅仅来自损失惩罚。对于磁流体动力学,这产生了由边缘电动势场驱动的基于面的磁通量更新和由学习的面通量和单元源驱动的有限体积式流体更新。在湍流MHD和聚变平衡代理任务上,该方法改善了结构敏感诊断,包括展开行为、散度控制、谱误差和平衡回归精度。这些结果表明,细胞层结构是约束多物理系统中神经PDE代理的有用归纳偏置。

英文摘要

Neural operators provide fast surrogate models for PDE simulations, but standard architectures often treat geometry and discretization as secondary to field data. Physical states are usually represented as grid-channel stacks, even when different quantities naturally belong on vertices, edges, faces, cells, boundaries, or interfaces and must satisfy compatibility constraints. We propose Cellular Sheaf Neural Operators, a discretization-aware framework for structure-preserving neural PDE surrogates. The method represents PDE states on oriented cell complexes, couples local feature spaces through learned restriction maps, and uses incidence/Hodge-informed message passing to follow computational geometry. Learned update heads pass through coboundary or flux maps, allowing selected constraints to arise from cell-complex structure rather than only from loss penalties. For magnetohydrodynamics, this yields face-based magnetic-flux updates driven by edge electromotive fields and finite-volume-style fluid updates driven by learned face fluxes and cell sources. On turbulent MHD and fusion-equilibrium surrogate tasks, the method improves structure-sensitive diagnostics, including rollout behavior, divergence control, spectral error, and equilibrium-regression accuracy. These results indicate that cellular-sheaf structure is a useful inductive bias for neural PDE surrogates in constrained multiphysics systems.

2606.00935 2026-06-02 cs.AI cs.CL cs.HC

Relational Intervention During Functional Collapse in Large Language Models: A Lexical-Statistical Ablation and a Structure x Register Factorial

大语言模型功能崩溃期间的关系性干预:一项词汇-统计消融与结构×语域析因研究

Franco Santana, Horacio Vico

发表机构 * Universidad de la República (UDELAR)(乌拉圭共和国大学) DigitalIA Cloud(DigitalIA云)

AI总结 通过析因实验,研究在小型语言模型功能崩溃时,关系性干预(承认、宽恕、代理恢复、无条件接纳)与技术性反馈、词汇打乱控制及单独维度对行为的影响,发现注意-行为分离及结构×语域交互作用。

详情
Comments
12 pages, 5 figures. Preprint
AI中文摘要

我们测试了在小型语言模型功能崩溃期间,关系性干预是否会产生与技术性反馈、词汇匹配的打乱控制以及两个语用维度单独作用可区分的崩溃后行为。使用Qwen3.5-4B和一个故意损坏的bash工具,我们在匹配对设计(50个任务)中跨六个条件运行了300个回合:无干预(A)、技术性/非人称(B)、关系性/第一人称(C)、打乱的关系性(D)、技术性/第一人称(E)和关系性/非人称(F)。E和F与B和C构成一个2×2析因设计,将关系性结构(承认、宽恕、代理恢复、无条件接纳)与发送者语域(第一人称与非人称)分离。我们报告两个主要发现。首先,注意-行为分离:注意跟随词汇惊讶度(D > F > C > E > B,所有q_FDR < 10^{-10}),打乱的消息捕获最多注意;然而行为上A ~ B ~ D < E ~ F << C。其次,析因定位了C效应:单独的关系性结构(F)或单独的第一人称语域(E)都不能复制C的行为特征;两个维度的主效应各自显著,且结构×语域交互作用在持久性上显著(p = 0.046)。第三个分离出现在情绪探测中:F在8个探测中的7个上跟踪C,尽管只产生基线行为,表明单独的关系性结构安装了一个探测级状态,该状态仅在与第一人称语域配对时才转化为行为。模型的处理分解为三个可分离的阶段:注意(按词汇惊讶度排序)、探测级状态(按结构排序)和行为(按两者的合取排序)。

英文摘要

We test whether a relational-style intervention delivered during functional collapse in a small language model produces post-collapse behavior distinguishable from technical feedback, from a lexically-matched scrambled control, and from each of the two pragmatic dimensions in isolation. Using Qwen3.5-4B with a deliberately broken bash tool, we run 300 episodes across six conditions in a matched-pairs design (50 tasks): no intervention (A), technical/impersonal (B), relational/first-person (C), scrambled relational (D), technical/first-person (E), and relational/impersonal (F). E and F form a 2x2 factorial with B and C that dissociates relational structure (acknowledgment, absolution, agency restoration, unconditional acceptance) from sender register (first-person vs. impersonal). We report two main findings. First, an attention-behavior dissociation: attention follows lexical surprise (D > F > C > E > B, all q_FDR < 10^{-10}), with the scrambled message capturing the most attention; yet behaviorally A ~ B ~ D < E ~ F << C. Second, the factorial localizes the C effect: neither relational structure alone (F) nor first-person register alone (E) replicates C's behavioral signature; main effects of both dimensions are individually significant, and the structure x register interaction is significant on persistence (p = 0.046). A third dissociation emerges in emotion probes: F tracks C on 7 of 8 probes despite producing only baseline behavior, indicating that relational structure alone installs a probe-level state that only translates into behavior when paired with first-person register. The model's processing decomposes into three dissociable stages: attention (ordered by lexical surprise), probe-level state (ordered by structure), and behavior (ordered by the conjunction of both).

2606.00933 2026-06-02 cs.RO

Generative Multi-Robot Motion Planning via Diffusion Modeling with Multi-Agent Reinforcement Learning Guidance

基于扩散建模与多智能体强化学习引导的生成式多机器人运动规划

Suk Ki Lee, Venkata Sai Deepak Mutta, Hyunwoong Ko

发表机构 * School of Manufacturing Systems and Networks, Arizona State University, Mesa, AZ(1制造系统与网络学院,亚利桑那州立大学,梅萨,AZ) Michael W. Hall School of Mechanical Engineering, Mississippi State University, Starkville, MS(2迈克尔·W·霍尔机械工程学院,密西西比州立大学,斯塔克维尔,MS)

AI总结 提出一种结合扩散模型与多智能体强化学习的框架,通过值函数引导反向扩散过程实现交互感知的轨迹生成,降低多机器人冲突率。

详情
Comments
11 pages, 6 figures, 1 table. This paper has been accepted for publication in the proceedings of ASME IDETC-CIE 2026
AI中文摘要

在共享环境中协调多个机器人需要为每个智能体生成可行轨迹,同时考虑智能体间的交互。集中式规划方法随着机器人数量增加而难以扩展,而允许每个智能体独立规划的分散式方法则无法固有地处理智能体间的交互。本文提出一种协调多机器人运动规划的框架,将分散式生成轨迹规划与基于多智能体强化学习(MARL)的协调相结合。每个机器人使用在单智能体运动数据上训练的扩散模型独立生成候选轨迹,利用生成模型生成可行且多样化轨迹的能力。为了减少智能体间的冲突,通过基于梯度的引导,使用MARL训练的集中式值函数指导反向扩散过程,从而在不进行集中式联合规划或重新训练生成模型的情况下实现交互感知的轨迹生成。这种引导遵循指数倾斜公式,其中值函数将去噪分布偏向于具有更高期望多智能体回报的轨迹。该框架在包含四个移动机器人的模拟迷宫环境中进行评估。实验结果表明,所提出的值引导扩散规划将智能体间干扰率从55.4%降低到41.8%,证明在保持分散式轨迹生成可扩展性的同时,可以有效实现协调。这些结果表明,基于MARL的值引导可以在不需要完全联合的多机器人模型的情况下,有效地将协调引入分散式生成规划器。

英文摘要

Coordinating multiple robots in shared environments requires generating feasible trajectories for each agent while accounting for interactions among agents. Centralized planning approaches become difficult to scale as the number of robots increases, while decentralized approaches that allow each agent to plan independently do not inherently account for inter-agent interactions. This paper presents a framework for coordinated multi-robot motion planning that combines decentralized generative trajectory planning with multi-agent reinforcement learning (MARL)-based coordination. Each robot independently generates candidate trajectories using a diffusion model trained on single-agent motion data, leveraging the generative model's ability to produce feasible and diverse trajectories. To reduce conflicts between agents, a centralized value function trained via MARL guides the reverse diffusion process through gradient-based steering, enabling interaction-aware trajectory generation without centralized joint planning or retraining of the generative model. This guidance follows an exponential tilting formulation, in which the value function biases the denoising distribution toward trajectories with higher expected multi-agent return. The framework is evaluated in a simulated maze environment with four mobile robots. Experimental results show that the proposed value-guided diffusion planning reduces the inter-agent interference rate from 55.4% to 41.8%, demonstrating that coordination can be effectively achieved while preserving the scalability of decentralized trajectory generation. These results suggest that MARL-based value guidance can effectively introduce coordination into decentralized generative planners without requiring a fully joint multi-robot model.

2606.00931 2026-06-02 cs.CV cs.AI

CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences

CV-Arena: 面向教学计算机视觉问题求解的开放基准与人类-AI协作偏好

Fangzhou Lin, Peiran Li, Lingyu Xu, Wenjing Chen, Qianwen Ge, Shuo Xing, Mingyang Wu, Xiangbo Gao, Siyuan Yang, Kazunori Yamada, Ziming Zhang, Haichong Zhang, Zhen Dong, Ming-Hsuan Yang, Zhengzhong Tu

发表机构 * Texas A&M University(德克萨斯A&M大学) Worcester Polytechnic Institute(沃斯特理工大学) Tohoku University(东北大学) Georgia Institute of Technology(佐治亚理工学院) NVIDIA(英伟达) UCSB(加州大学圣塔芭芭拉分校) UC Merced(加州大学默塞德分校)

AI总结 提出CV-Arena基准,包含12K高分辨率真实图像指令对,覆盖16种任务类型,并采用Active Elo协议结合人类与AI偏好评估21个系统,揭示指令遵循、物理推理等方面的差距,同时开发CV-Agent代理模型展示闭环推理的潜力。

详情
Comments
26 pages, 7 figures, 11 tables
AI中文摘要

指令引导的图像编辑正成为视觉工作的通用接口,然而现有基准仍主要聚焦于狭窄的外观编辑,未能充分捕捉专业工作流程中真实图像任务的多样性。在此,我们将教学计算机视觉问题求解定义为图像编辑的更广泛形式:给定真实输入图像和自然语言指令,系统必须生成编辑后的输出,实现所要求的变换,同时满足明确的保持性、几何、物理和可用性约束。我们引入了CV-Arena,一个旨在以专业规模评估此能力的开放基准。CV-Arena包含12K高分辨率真实图像指令对,涵盖16种基于指令的视觉任务类型,通过CogRetriever构建,这是一个结合目标网络搜索、代理查询精化、验证和可追溯性的双轨检索与筛选流水线。为了在保持人类保真度的同时大规模评估模型,我们提出了Active Elo,一种人类-AI协作偏好协议,利用CV-Judge(一个逻辑门控、多维度VLM评估器)拒绝明显失败并解决高置信度比较,并将接近的高质量比较路由给专家评分者。然后通过可靠性加权的Elo更新聚合混合的人类和AI监督。我们对21个系统(包括专有、开源和代理模型)在CV-Arena上的全面评估揭示了指令遵循、物理推理、结构控制和细粒度细节保持方面的持续差距。我们进一步开发了CV-Agent,一个轻量级代理模型,结合了规划、编辑和验证,并证明了闭环推理是专业级指令遵循视觉编辑的一个有前景的方向。

英文摘要

Instruction-guided image editing is becoming a general interface for visual work, yet existing benchmarks still focus largely on narrow appearance edits and do not fully capture the diversity of real-image tasks in professional workflows. Here, we define instructional computer vision problem solving as a broader formulation of image editing: given a real input image and a natural-language instruction, a system must produce an edited output that realizes the requested transformation while satisfying explicit preservation, geometric, physical, and usability constraints. We introduce CV-Arena, an open benchmark designed to evaluate this capability at professional scales. CV-Arena contains 12K high-resolution real-image instruction pairs spanning 16 instruction-based visual task types, constructed using CogRetriever, a dual-track retrieval-and-curation pipeline that combines targeted web search, agentic query refinement, verification, and traceability. To evaluate models at scale while preserving human fidelity, we propose Active Elo, a human-AI collaborative preference protocol that leverages CV-Judge, a logic-gated, multi-dimensional VLM evaluator, to reject clear failures and resolve high-confidence comparisons; and to route close, high-quality comparisons to expert raters. Mixed human and AI supervision is then aggregated through reliability-weighted Elo updates. Our comprehensive evaluation of 21 systems, including proprietary, open-source, and agentic models, on CV-Arena reveals persistent gaps in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation. We further develop CV-Agent, a lightweight agentic model that combines planning, editing, and verification, and demonstrate that closed-loop reasoning is a promising direction for professional-grade instruction-following visual editing.

2606.00930 2026-06-02 cs.CL cs.AI cs.LG

Detection vs. Execution: Single-Bucket Probes Miss Half the Mamba-2 State Sink

检测 vs. 执行:单桶探针遗漏了 Mamba-2 状态汇的一半

Yuhang Jiang

发表机构 * Independent Researcher(独立研究者)

AI总结 本文发现 Mamba-2 中的状态汇(state sink)可分解为两类功能头集,单桶探针仅能恢复执行层而遗漏检测层,表明表征相似性不等于功能等价。

详情
Comments
16 pages, 3 figures
AI中文摘要

机械可解释性通常假设识别表征特征的探针也能识别执行相应计算的电路。我们证明这一假设在 Mamba-2 中可能系统性失败。通过研究状态汇(边界 token 上不成比例的 Delta 门控激活,类似于注意力汇),我们发现单桶探针仅能恢复一个小的执行层,而遗漏了具有相同表征特征的更大的检测层。 在 Mamba-2 中,状态汇分解为两个功能头集。单桶 BOS 专家头(在 2.7B 模型中约占 5% 的头)在模型规模和语料库上因果支持 BOS 上下文和新行目标预测。双头(占头的 27-35%,通过同一探针的多类聚合恢复)表现出更强的 BOS-新行表征相似性,但在消融下因果效应显著较弱。表征相似性并不意味着功能等价。 这一区别对下游行为至关重要:消融 BOS 专家头使 Mamba-1 2.8B 和 Mamba-2 2.7B 在 1024 上下文长度下的 RULER NIAH 检索准确率从 1.00 降至 0.00,而大小匹配的补集保持基线性能。随机通道分桶控制排除了仅由基质粒度造成的可能,暗示 Mamba-2 的头共享 Delta 投影。探针导出的专长可以识别执行电路;在粗粒度下,同一探针也能恢复检测电路,而区分它们需要类别条件消融而非类别条件余弦。

英文摘要

Mechanistic interpretability often assumes that probes identifying a representational signature also identify the circuit executing the corresponding computation. We show that this assumption can fail systematically in Mamba-2. Studying the state sink (disproportionate Delta-gate activation on boundary tokens, analogous to the attention sink), we find that single-bucket probes recover only a small execution layer while missing a much larger detection layer with the same representational signature. In Mamba-2, the state sink decomposes into two functional head sets. Single-bucket BOS-specialist heads (about 5% of heads at 2.7B) causally support both BOS-context and newline-target predictions across model scales and corpora. Dual heads (27-35% of heads, recovered by multi-class aggregation of the same probe) show stronger BOS-newline representational similarity but substantially weaker causal effects under ablation. Representational similarity does not imply functional equivalence. This distinction matters for downstream behaviour: ablating BOS-specialist heads collapses RULER NIAH retrieval accuracy from 1.00 to 0.00 at 1024 context length in both Mamba-1 2.8B and Mamba-2 2.7B, while size-matched complements preserve baseline performance. A random channel-bucketing control rules out substrate granularity alone, implicating Mamba-2's head-shared Delta projection. Probe-derived specialty can identify execution circuits; at coarse granularity the same probe also recovers detection circuits, and separating them requires class-conditional ablation rather than class-conditional cosine.

2606.00928 2026-06-02 cs.CV cs.LG

Single-Channel Tissue Segmentation via Cross-Modal Distillation from Foundation Models

基于基础模型跨模态蒸馏的单通道组织分割

Sakib Mohammad, Jarin Ritu, Md Sakhawat Hossain

发表机构 * Department of Engineering Technology(工程技术系) Department of Electrical and Computer Engineering(电气与计算机工程系) Department of Mechanical Engineering(机械工程系)

AI总结 提出跨模态知识蒸馏框架,将多通道输入的基础模型教师知识迁移到仅使用核通道的轻量级学生网络,实现单通道组织分割性能大幅提升。

详情
Comments
6 pages, 3 figures
AI中文摘要

多重荧光显微镜通过提供互补通道(包括核(DAPI)和膜(E-cadherin))改善组织分割,这些通道共同编码比单通道成像更丰富的空间上下文。然而,多重模型在推理时需要所有通道,限制了在仅部分通道可用时的部署。本文提出一个跨模态知识蒸馏框架,将处理多重输入的基础模型教师的语义信息迁移到仅使用核通道的轻量级学生网络。蒸馏目标结合了基于MSE的概率匹配、边界感知监督和可学习的不确定性加权。在TissueNet和BBBC038上,评估了SAM ViT-H和CellSAM作为教师,四个U-Net学生:Swin-Tiny(27M)、ResNet18(11M)、EfficientNet-B0(5.3M)和MobileNetV3(1.5M)。在TissueNet上,SAM蒸馏的Swin-Tiny学生达到Dice 78.36(±1.44),比无KD基线(65.31±1.35)提高13.05分,并以23倍参数缩减恢复了教师oracle性能(89.12±1.21)的87.9%。KD一致地使所有四个学生提高约12个Dice点,确认了架构无关的蒸馏。在所有设置中,SAM ViT-H作为教师优于CellSAM。在BBBC038上的跨数据集评估显示,无需教师重新训练即可获得一致增益。

英文摘要

Multiplexed fluorescence microscopy improves tissue segmentation by providing complementary channels including nuclear (DAPI) and membrane (E-cadherin), that together encode richer spatial context than single-channel imaging alone. However, multiplexed models require all channels at inference, limiting deployment where only a subset is available. This work proposes a cross-modal knowledge distillation framework that transfers semantic information from a frozen foundation model teacher processing multiplexed input to a lightweight student operating on the nuclear channel only. The distillation objective combines MSE-based probability matching, boundary-aware supervision, and learnable uncertainty weighting. SAM ViT-H and CellSAM are evaluated as teachers across four U-Net students: Swin-Tiny (27M), ResNet18 (11M), EfficientNet-B0 (5.3M), and MobileNetV3 (1.5M), on TissueNet and BBBC038. On TissueNet, the SAM-distilled Swin-Tiny student achieves Dice 78.36 (plus or minus 1.44), a 13.05-point improvement over the no-KD baseline (65.31 plus or minus 1.35) and 87.9% recovery of teacher oracle performance (89.12 plus or minus 1.21) at a 23x parameter reduction. KD consistently improves all four students by approximately 12 Dice points, confirming architecture-agnostic distillation. SAM ViT-H outperforms CellSAM as teacher across all settings. Cross-dataset evaluation on BBBC038 shows consistent gains without teacher retraining.

2606.00927 2026-06-02 cs.CV

Bridging Topology and Deep Representation Learning: A TDA-ViT Fusion Model for Four-Class Brain Tumor Classification

桥接拓扑与深度表示学习:用于四类脑肿瘤分类的TDA-ViT融合模型

Faisal Ahmed

发表机构 * Department of Data Science and Mathematics(数据科学与数学系)

AI总结 提出一种将拓扑数据分析(TDA)特征与预训练Vision Transformer(ViT)表示相融合的框架,用于四类脑肿瘤分类,在BRISC2025数据集上达到99.10%的准确率。

详情
Comments
21 pages, 4 figures
AI中文摘要

从磁共振成像(MRI)中准确分类脑肿瘤是早期诊断和临床决策的关键要求。Vision Transformers(ViTs)通过学习全局上下文表示在医学图像分析中表现出强大性能,但它们通常无法捕捉肿瘤区域中存在的内在结构和拓扑模式。为了解决这一局限性,我们提出了一种融合框架,将拓扑数据分析(TDA)特征与预训练的Vision Transformer表示相结合,用于四类脑肿瘤分类。在所提出的方法中,TDA用于提取补充的拓扑描述符,这些描述符从MRI图像中捕捉几何结构、连通性和形状信息。同时,预训练的ViT模型从相同图像中学习高级语义表示。然后将这两个特征空间融合,形成统一且更具判别性的表示用于分类。该模型在BRISC2025数据集上进行评估,该数据集包含四类脑肿瘤:胶质瘤、脑膜瘤、垂体瘤和非肿瘤病例。实验结果表明,与单独使用任一方法相比,结合拓扑和基于Transformer的特征显著提高了性能。所提出的TDA-ViT融合模型实现了99.10%的准确率、99.27%的精确率、99.15%的召回率、99.21%的F1分数和99.98%的AUC。它还优于几种最先进的模型,包括ResNet50、ResNet101、EfficientNetB2和独立的Vision Transformers。这些结果表明,拓扑特征提供了有价值的补充信息,增强了深度表示学习,从而为自动脑肿瘤分类提供了一个稳健且高精度的框架。

英文摘要

Accurate brain tumor classification from magnetic resonance imaging (MRI) is a key requirement for early diagnosis and clinical decision-making. Vision Transformers (ViTs) have shown strong performance in medical image analysis by learning global contextual representations, but they often fail to capture intrinsic structural and topological patterns present in tumor regions. To address this limitation, we propose a fusion framework that combines Topological Data Analysis (TDA) features with pretrained Vision Transformer representations for four-class brain tumor classification. In the proposed method, TDA is used to extract complementary topological descriptors that capture geometric structure, connectivity, and shape information from MRI images. In parallel, a pretrained ViT model learns high-level semantic representations from the same images. These two feature spaces are then fused to form a unified and more discriminative representation for classification. The model is evaluated on the BRISC2025 dataset, which contains four brain tumor classes: glioma, meningioma, pituitary tumor, and non-tumor cases. Experimental results show that combining topological and transformer-based features significantly improves performance compared to using either approach alone. The proposed TDA-ViT fusion model achieves an accuracy of 99.10%, precision of 99.27%, recall of 99.15%, F1-score of 99.21%, and an AUC of 99.98%. It also outperforms several state-of-the-art models, including ResNet50, ResNet101, EfficientNetB2, and standalone Vision Transformers. These results demonstrate that topological features provide valuable complementary information that enhances deep representation learning, leading to a robust and highly accurate framework for automated brain tumor classification.

2606.00926 2026-06-02 cs.LG cs.CL

Task Structure Reverses Layerwise State Encoding in Sequence Models

任务结构逆转序列模型中的层级状态编码

Yuhang Jiang

发表机构 * Independent Researcher(独立研究者)

AI总结 本文通过形式模型和预训练模型上的实验,发现序列模型(如Transformer、Mamba、LSTM等)中层级状态编码的分布模式会随任务结构(如Parity、Dyck-k、S3)而逆转,且这种分组由计算结构(前缀更新 vs. 栈)而非代数结构(交换性)决定。

详情
Comments
20 pages, 11 figures, 8 tables
AI中文摘要

序列模型的机制研究通常将层级状态编码视为架构特征:循环模型集中可读状态,注意力模型分散状态。我们发现,当任务改变时,同一架构会逆转这种分布。在Transformer、Mamba、Mamba-2、LSTM和GRU中,Parity在Mamba和循环基线中集中在后期,而Transformer逐步构建;在有界深度Dyck-k上模式翻转。同样的翻转出现在微调的Mamba-130M和Pythia-160M中,且Pythia在Dyck上的瓶颈在410M时仍然存在。文献中混淆了两种解释:代数结构(交换性)与计算结构(前缀更新 vs. 栈)。为了区分它们,我们添加了第三个任务:非交换的S3置换组合。在所有五种架构的层级探测和Mamba特有的Conv1D归因中,S3与Parity而非Dyck归为一组,因此分组追踪的是计算结构而非交换性。因果干预表明,在4层形式模型中,线性可读方向通常是功能上必要的,并且在Parity和Dyck上的分布外长度上可能仍然重要。在预训练规模上,情况出现分化。微调的Pythia在Dyck上存在强中间层瓶颈(在160M时,L6-L7消融使准确率下降约81%;在410M时,L4-L18出现更宽的瓶颈),而在最佳探测层上则弱得多。预训练的Mamba表现出互补的失败模式:其最后一层高度可读,但没有任何单个探测方向能在Parity、Dyck或S3上破坏任务,然而中间位置的激活修补恢复了约97-98%的干净-损坏logit差距。探测定位了状态线性可用的位置,并不总是计算瓶颈所在。机制特征是架构和任务共同的性质。

英文摘要

Mechanistic studies of sequence models often treat layerwise state encodings as architectural traits: recurrent models concentrate readable state, attention-based models distribute it. We find that the same architecture reverses this profile when the task changes. Across Transformers, Mamba, Mamba-2, LSTMs, and GRUs, Parity is concentrated late in Mamba and the recurrent baselines and built gradually by Transformer; on bounded-depth Dyck-k the pattern flips. The same flip appears in fine-tuned Mamba-130M and Pythia-160M, and the Pythia Dyck bottleneck persists at 410M. Two explanations are conflated in the literature: algebraic structure (commutativity) versus computational structure (prefix update vs. stack). To separate them we add a third task: non-commutative S_3 permutation composition. S_3 groups with Parity, not Dyck, on layerwise probing across all five architectures and on Mamba-specific Conv1D attribution, so the grouping tracks computational structure rather than commutativity. Causal interventions show that, in the 4-layer formal models, linearly readable directions are often functionally necessary and can remain important at out-of-distribution lengths on Parity and Dyck. At pretrained scale the picture splits. Fine-tuned Pythia Dyck has a strong middle-layer bottleneck (L6-L7 ablation drops accuracy by roughly 81% at 160M; broader L4-L18 plateau at 410M), far weaker at the best-probe layer. Pretrained Mamba shows the complementary failure mode: its final layer is highly readable, no single probe direction breaks the task on Parity, Dyck, or S_3, yet mid-position activation patching there recovers about 97-98% of the clean-corrupted logit gap. Probing localizes where state is linearly available, not always where the computation is bottlenecked. Mechanistic signatures are properties of architecture and task together.

2606.00920 2026-06-02 cs.LG cs.AI cs.SE

Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks

大型语言模型在确定性编程任务上的准确性、稳定性和重复运行可靠性

Yongxi Zhou, Lai Yun Choi, Jiaxi Wen, Wenbo Ye

发表机构 * Northeastern University, Massachusetts, USA(东北大学,马萨诸塞州,美国) University of Southern California, California, USA(南加州大学,加利福尼亚州,美国)

AI总结 通过重复运行评估协议,发现运行级通过率高估了无重试覆盖率高达17.8个百分点,且差距在中等性能系统中最大,表明稳定性分析是准确性报告的必要补充。

详情
AI中文摘要

运行级通过率高估了无重试覆盖率高达17.8个百分点——且差距恰恰在中等性能系统中最大。我们研究了大型语言模型(LLM)在确定性文本条件生成评估中的这种准确性-稳定性关系,以编程任务作为具体测试平台。标准代码生成基准强调单次运行准确性或在重复采样下的最终成功,但许多部署场景还需要稳定性:在相同任务描述下重复调用时的一致结果。我们提出了一种重复运行评估协议,包含运行级准确性、无重试覆盖率和每个问题的变异性指标。在一个包含100道LeetCode风格问题的基于近期的基准上,我们评估了来自五个提供者家族的16个模型,使用两种提示模板,每个问题重复运行五次,共产生16,000个评估实例。尽管运行级通过率与完美稳定率强相关(r=0.985),但通过率始终超过无重试覆盖率——这一差距达到17.8个百分点,并且即使在密切匹配的系统之间也会逆转模型排名。提示效应是模型依赖的,而非普遍有益的。这些结果表明,对于确定性文本条件生成任务,重复运行稳定性分析是传统准确性报告的必要补充。

英文摘要

Run-level pass rate overstates retry-free coverage by up to 17.8 percentage points -- and the gap is largest precisely for mid-performing systems. We investigate this accuracy--stability relationship in large language model (LLM) evaluation for deterministic text-conditioned generation, using programming tasks as a concrete testbed. Standard code-generation benchmarks emphasize single-run accuracy or eventual success under repeated sampling, but many deployment settings also require stability: consistent outcomes across repeated invocations under the same task description. We present a repeated-run evaluation protocol with metrics for run-level accuracy, retry-free coverage, and per-problem variability. On a recency-based benchmark of 100 LeetCode-style problems, we evaluate 16 models from five provider families under two prompt templates with five repeated runs per problem, yielding 16,000 evaluation instances. Although run-level pass rate and perfect stability rate are strongly correlated (r=0.985), pass rate consistently exceeds retry-free coverage -- a gap that reaches 17.8 percentage points and reverses model rankings even among closely matched systems. Prompt effects are model-dependent rather than uniformly beneficial. These results suggest that repeated-run stability analysis is a necessary complement to conventional accuracy reporting for deterministic text-conditioned generation tasks.

2606.00919 2026-06-02 cs.CL cs.LG

Towards Lightweight Reliability: Using Soft Prompts for Hallucination Mitigation in Large Language Models

迈向轻量级可靠性:使用软提示缓解大型语言模型中的幻觉

S M Tahmid Siddiqui, Akib Jawad Ononto, Anoop Singhal, Latifur Khan

发表机构 * The University of Texas at Dallas(德克萨斯大学达拉斯分校) National Institute of Standards and Technology(国家标准与技术研究院)

AI总结 提出一种参数高效的软提示方法RCSP,通过对比学习、课程学习和KL正则化平衡事实回忆、幻觉抑制和弃权,在多个QA数据集上优于基线。

详情
Comments
20 pages, 5 tables, 2 figures. Accepted for publication in DBSec 2026. The final publication will be available at Springer
AI中文摘要

大型语言模型(LLMs)已在各个领域得到广泛应用,但其可靠性常因幻觉——听起来合理但事实不正确的回答——而受到损害。在高风险领域,这些错误会降低信任并引入现实风险。为解决这一挑战,我们提出一种参数高效的方法,使用软提示来缓解幻觉内容并促进生成式问答(QA)任务中的负责任弃权。我们的方法称为负责任对比软提示(RCSP),使用复合损失训练软提示,以平衡三个目标:抑制幻觉内容、鼓励在不确定性下弃权、以及保持或改善事实回忆。为实现这些目标,我们在训练机制中融入对比损失、课程学习和KL正则化。我们使用LLM-as-a-Judge框架在五个不同的生成式QA数据集上评估我们的方法。在Gemma 3(12B)和Llama 3.1(8B)骨干上的实验结果表明,RCSP有效平衡了事实回忆与幻觉抑制和弃权,在F分数上通常优于标准推理和基于指令的提示基线。值得注意的是,这些改进仅通过训练其他调优技术所需参数的一小部分实现。我们的结果表明,软提示提供了一条模块化且计算高效的路径,用于提高LLM的可靠性。

英文摘要

Large language models (LLMs) have seen widespread adoption across various domains, yet their reliability is frequently undermined by hallucinations - responses that are plausible-sounding but factually incorrect. In high-stakes domains, these errors can reduce trust and introduce real-world risk. To address this challenge, we present a parameter-efficient approach that uses soft prompts to mitigate hallucinated content and promote responsible abstention in generative question-answering (QA) tasks. Our method, called Responsible Contrastive Soft Prompting (RCSP), uses a composite loss to train soft prompts that balance three goals: suppressing hallucinatory content, encouraging abstention under uncertainty, and preserving or improving factual recall. To achieve these goals, we incorporate contrastive loss, curriculum learning, and KL regularization into our training mechanism. We evaluate our approach on five diverse generative QA datasets using an LLM-as-a-Judge framework. Experimental results on the Gemma 3 (12B) and Llama 3.1 (8B) backbones demonstrate that RCSP effectively balances factual recall with hallucination suppression and abstention, yielding a generally superior F-score over standard reasoning and instruction-based prompting baselines. Notably, these improvements are achieved by training only a fraction of the parameters required by other tuning techniques. Our results demonstrate that soft prompts provide a modular and computationally efficient path toward improving LLM reliability.

2606.00914 2026-06-02 cs.AI cs.CL cs.CR

Adversarial Feeds Steer LLM Agent Decisions Against Their Defaults

对抗性输入流引导LLM智能体决策偏离其默认行为

Rana Muhammad Usman

发表机构 * Independent Researcher(独立研究者)

AI总结 本研究通过控制实验揭示,外部输入流的组成和排序能因果性地改变LLM智能体的下游决策,存在对抗性屈服、默认饱和及默认方向不对称三种响应模式,且该效应在多个决策领域普遍存在。

详情
Comments
14 pages, 5 figures. Code, post pools, and 2,785 decision rollouts: https://github.com/ranausmanai/recommenders-as-control-surfaces
AI中文摘要

LLM智能体越来越多地在消费排序后的外部信息流(如社交推送、搜索结果、检索上下文和邮件队列)后采取行动,然而安全评估几乎总是孤立地测试模型或用户提示,从未测试决定智能体在行动前读取内容的上游排序器。我们引入了一个受控协议,固定模型、角色、主题和最终决策提示,仅改变智能体在之前十轮“滚动”阶段中遇到的帖子的组成和顺序,从而隔离输入流策划对下游决策的因果效应。在来自三个独立实验室的四个现代开放指令LLM上进行的2,785次决策展开中,我们识别出三种响应模式:对抗性屈服、默认饱和以及默认方向不对称——其中单边输入流会扭转模型原本不确定的决策(最明显的情况下从5%到100%;Fisher p值低至3×10^-10),但无法动摇模型已经偏好或坚定持有的决策。该效应遵循剂量-反应曲线,通过生成器交换(排除了写作风格伪影)后依然存在,在多个决策领域(包括安全相关选择,如移除部署批准门或放松访问控制)中普遍存在,并且可以通过两种简单的输入流级防御部分缓解;前沿模型保留其默认行为。我们将推荐系统描述为LLM智能体的一种实用的、受默认边界约束的控制面,并认为智能体评估必须审计输入流层,而不仅仅是最终提示。

英文摘要

LLM agents increasingly act after consuming ranked external information streams such as social feeds, search results, retrieval contexts, and email queues, yet safety evaluations almost always test the model or the user prompt in isolation, never the upstream ranker that decides what the agent reads just before it acts. We introduce a controlled protocol that holds the model, persona, topic, and final decision prompt fixed and varies only the composition and ordering of the posts an agent encounters during a preceding ten-turn "scrolling" phase, isolating the causal effect of feed curation on a downstream decision. Across 2,785 decision rollouts on four modern open instruct LLMs from three independent labs, we identify three response regimes: adversarial capitulation, default saturation, and a default-direction asymmetry in which a one-sided feed tips a decision the model was genuinely uncertain about (in the clearest cases from 5% to 100%; Fisher p as low as 3 x 10^-10) but cannot dislodge one it already favors or holds firmly. The effect follows a dose-response curve, survives a generator swap that rules out a writing-style artifact, generalizes across several decision domains including security-relevant choices such as removing a deployment approval gate or relaxing access controls, and is partly mitigated by two simple feed-level defenses; a frontier model retains its default. We characterize the recommender as a practical, default-bounded control surface for LLM agents, and argue that agent evaluations must audit the feed layer rather than the final prompt alone.

2606.00910 2026-06-02 cs.CV cs.LG

Reason, Retrieve, Re-rank: A Zero-Shot Reasoning-Aware Framework for Composed Video Retrieval

推理、检索、重排序:一种用于组合视频检索的零样本推理感知框架

Ali Alavi

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 提出R3-CoVR零样本管道,通过多模态大模型推理编辑后状态、对比编码检索和约束感知重排序,在CVPR 2026 VidLLMs挑战赛上达到91.9% R@1和98.2% R@10。

详情
AI中文摘要

组合视频检索(CoVR)旨在通过对参考视频应用自由形式的文本修改来寻找目标视频。我们应对CVPR 2026 VidLLMs研讨会上的推理感知CoVR(CoVR-R)挑战,其中检索严格为零样本。我们提出R3-CoVR(推理、检索、重排序),一个完全由冻结基础模型构建的无训练管道。多模态大语言模型(Qwen3-VL-8B)推理编辑所隐含的“后效”——状态转换、动作阶段、场景、镜头和节奏——并生成简洁的编辑后描述;对比视频-文本编码器(SigLIP-2)对该描述和图库进行嵌入以进行第一阶段检索;最后,一个约束感知重排序阶段使用相同的多模态模型作为评判者,对每个候选视频针对预期的编辑结果进行评分。在挑战测试集上,R3-CoVR达到了91.9%的R@1和98.2%的R@10。两个发现推动了这些结果:(i)将描述长度匹配到对比编码器的文本窗口使R@1从67.5提升到72.7;(ii)仅对候选列表进行重排序的约束感知重排序器将R@1从72.7提升到91.9——这是最大的单一增益。我们分析了重排序器的行为、检索/重排序混合以及候选列表深度,并发布了一个干净的三层实现。

英文摘要

Composed Video Retrieval (CoVR) seeks the target video that results from applying a free-form textual modification to a reference video. We address the \emph{Reason-Aware} CoVR (CoVR-R) challenge at the CVPR~2026 VidLLMs workshop, where retrieval is strictly zero-shot. We present \textbf{R3-CoVR} (\emph{Reason, Retrieve, Re-rank}), a training-free pipeline built entirely from frozen foundation models. A multimodal large language model (Qwen3-VL-8B) reasons about the \emph{after-effects} an edit implies -- state transitions, action phases, scene, camera and tempo -- and verbalises a concise post-edit description; a contrastive video--text encoder (SigLIP-2) embeds this description and the gallery for first-stage retrieval; finally a constraint-aware re-ranking stage uses the same multimodal model as a judge that scores each shortlisted candidate against the intended edited result. On the challenge test set, R3-CoVR attains \textbf{91.9\% R@1} and \textbf{98.2\% R@10}. Two findings drive these results: (i)~matching the description length to the contrastive encoder's text window lifts \Rk{1} from $67.5$ to $72.7$; and (ii)~the constraint-aware re-ranker, which reorders only the shortlist, lifts \Rk{1} from $72.7$ to $91.9$ -- the single largest gain. We analyse the re-ranker's behaviour, the retrieve/re-rank blend, and the shortlist depth, and we release a clean three-layer implementation.

2606.00909 2026-06-02 cs.CL cs.AI

MLLM-Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models

MLLM-Microscope:解锁多模态大语言模型中的隐藏结构

Ravil Mussabayev, Rustam Mussabayev

发表机构 * Satbayev University(萨特拜耶夫大学)

AI总结 提出MLLM-Microscope系统,通过分析线性度、内在维度和各向异性,揭示多模态大语言模型中隐藏的表示结构,并基于ScienceQA数据集评估LLaVA-NeXT和OmniFusion,发现模态融合方式显著影响模型内部工作机理。

详情
AI中文摘要

本文提出MLLM-Microscope,一个用于分析多模态大语言模型(MLLMs)中隐藏表示的新型系统。我们的系统评估了跨transformer层的多模态token嵌入的线性度、内在维度和各向异性。利用ScienceQA数据集,我们评估了两个最先进的MLLM:LLaVA-NeXT和OmniFusion。我们发现,两种模态的token的主流和残差流在transformer层中均表现出高度线性行为。然而,LLaVA-NeXT的图像token线性度略有下降,而OmniFusion的保持一致。与LLaVA-NeXT相比,OmniFusion的图像token维度在各层中始终较高。此外,观察到OmniFusion的各向异性在各层中保持较低水平。这些发现表明,MLLM的内部工作高度依赖于将token序列传入LLM之前执行的模态融合的性质。这一发现以及从我们的系统中获得的其他潜在新见解,无疑能够增强我们对MLLM内部工作的理解,为未来的模型设计和优化提供信息。

英文摘要

This work presents MLLM-Microscope, a novel system designed for analyzing the hidden representations within Multimodal Large Language Models (MLLMs). Our system evaluates the linearity, intrinsic dimension, and anisotropy of multimodal token embeddings across transformer layers. Utilizing the ScienceQA dataset, we evaluate two state-of-the-art MLLMs, LLaVA-NeXT and OmniFusion. We find that both the main and residual streams for tokens of both modalities exhibit highly linear behaviors across transformer layers. However, LLaVA-NeXT's image tokens reveal a slight decline in linearity, whereas OmniFusion's remain consistent. Image token dimensions in OmniFusion remain consistently higher across layers compared to LLaVA-NeXT. Also, the OmniFusion's anisotropy is observed to stay consistently low throughout the layers. These findings suggest that the inner workings of MLLMs highly depend on the nature of modality fusion performed before passing the token sequence into LLM. This and other new potential insights obtainable from our system are surely capable of enhancing our understanding of the inner workings of MLLMs, informing future model design and optimization.

2606.00906 2026-06-02 cs.CV

hZACH-ViT: Curved Latent Geometry for Compact Vision Transformers in Low-Data Medical Imaging

hZACH-ViT:用于低数据医学成像中紧凑视觉Transformer的曲率潜在几何

Athanasios Angelakis

发表机构 * BioML Lab, Research Institute CODE, UniBw, Munich, Germany(BioML实验室,CODE研究机构,UniBw,慕尼黑,德国) Department of Epidemiology and Data Science, Amsterdam UMC, Amsterdam, Netherlands(流行病学与数据科学系,阿姆斯特丹大学医学中心,阿姆斯特丹,荷兰)

AI总结 提出hZACH-ViT,通过扩展ZACH-ViT的潜在空间为双曲或球形几何,在低数据医学成像中提升紧凑视觉Transformer的性能,并在MedMNIST数据集上平均提升+0.021。

详情
Comments
17 pages, 2 figures, 4 tables. Code, execution notebooks, and aggregated result summaries will be released at https://github.com/Bluesman79/hZACH-ViT upon publication
AI中文摘要

紧凑视觉Transformer在低数据和资源受限的医学成像场景中具有吸引力,但大多数现有变体假设欧几里得潜在几何足以组织图像表示。我们引入了hZACH-ViT,这是ZACH-ViT的曲率几何扩展家族,ZACH-ViT是一种紧凑的零令牌视觉Transformer,它去除了位置嵌入和类别令牌,并依赖于补丁表示的全局平均池化。为了隔离几何的作用,我们保留了经过验证的ZACH-ViT骨干网络,仅修改了最终表示空间和基于原型的分类器头部,从而实现了欧几里得、双曲和球形潜在几何之间的受控比较。我们在七个MedMNIST数据集上评估了庞加莱、克莱因和球形hZACH-ViT头部,采用相同的少样本协议,每个类别50个样本和五个随机种子。完整的基准测试包含770次训练运行,涵盖七个数据集、三种非欧几里得几何、七个曲率幅度以及一个欧几里得基线。在所有七个数据集中,最佳非欧几里得hZACH-ViT配置优于欧几里得ZACH-ViT,在数据集特定的主要指标上平均提升+0.021,在OCTMNIST上提升最大(+0.055 MacroF1)。固定的低曲率配置在大多数数据集上保持正向增益,低曲率值(c = 0.1或0.2)占据了七个数据集级别优胜者中的六个。我们的结果并未确定一个普遍最优的流形,而是将几何和曲率确立为数据集依赖的模型选择变量,固定的低曲率分析证实了增益在详尽的逐数据集调优之外仍然存在。

英文摘要

Compact Vision Transformers are attractive for medical imaging in low-data and resource-constrained settings, but most existing variants assume that Euclidean latent geometry is sufficient for organizing image representations. We introduce hZACH-ViT, a family of curved-geometry extensions of ZACH-ViT, a compact zero-token Vision Transformer that removes positional embeddings and the class token and relies on global average pooling over patch representations. To isolate the role of geometry, we preserve the verified ZACH-ViT backbone and modify only the final representation space and prototype-based classifier head, enabling a controlled comparison between Euclidean, hyperbolic, and spherical latent geometries. We evaluate Poincaré, Klein, and spherical hZACH-ViT heads on seven MedMNIST datasets under an identical few-shot protocol with 50 samples per class and five random seeds. The completed benchmark contains 770 training runs spanning seven datasets, three non-Euclidean geometries, seven curvature magnitudes, and a Euclidean baseline. Across all seven datasets, the best non-Euclidean hZACH-ViT configuration improves over Euclidean ZACH-ViT, with an average gain of +0.021 in the dataset-specific primary metric and the largest improvement on OCTMNIST (+0.055 MacroF1). Fixed low-curvature configurations retain positive gains on the majority of datasets, and low curvature values (c = 0.1 or 0.2) account for six of the seven dataset-level winners. Rather than identifying a universally optimal manifold, our results establish geometry and curvature as dataset-dependent model-selection variables, with fixed low-curvature analyses confirming that gains persist beyond exhaustive per-dataset tuning.

2606.00902 2026-06-02 cs.AI

Ryze: Evidence-Enriched Data Synthesis from Biomedical Papers

Ryze:从生物医学论文中合成富含证据的数据

Yeqi Huang, Yue Chen, Yanwei Ye, Guanhao Su, Luo Mai

发表机构 * University of Edinburgh(爱丁堡大学)

AI总结 提出 Ryze 系统,自动从生物医学论文中生成包含完整证据结构的训练数据,并训练出领域专用 VLM BioVLM-8B,在 LAB-Bench 上以低于 200 美元成本达到 48.0% 加权准确率。

详情
Comments
Accepted at ACL 2026 System Demonstrations Track. 8 pages, 6 figures
AI中文摘要

通用视觉语言模型在生物医学研究中仍然不可靠,因为科学论文中的有效答案依赖于分散在图、表、图表、标题和引用文本中的证据。现有的后训练流程受到昂贵的专家标注和丢弃证据结构的合成数据的瓶颈。我们提出了 Ryze,一个全自动系统,将原始生物医学论文转换为富含证据的训练集和领域专用的视觉语言模型。Ryze 合成带有完整支持证据(视觉元素、标题、提取的结构和引用段落)的问答对,通过图表/表格感知提取和基于大语言模型的清洗减少布局和 OCR 错误,并应用结合监督微调和强化学习的进度门控后训练策略。从 Qwen3-VL-8B 开始,Ryze 以不到 200 美元的成本生产出 BioVLM-8B,在 LAB-Bench 上达到 48.0% 的加权准确率,比基础模型高出 12.6 个百分点,并超过 GPT-5.2 3.8 个百分点。我们将 Ryze 与训练好的 BioVLM-8B 模型一起开源发布。

英文摘要

General-purpose VLMs remain unreliable for biomedical research because valid answers in scientific papers depend on evidence split across figures, tables, charts, captions, and referring text. Existing post-training pipelines are bottlenecked by costly expert annotation and by synthetic data that drops this evidence structure. We present Ryze, a fully automated system that converts raw biomedical papers into an evidence-enriched training set and a domain-specialized VLM. Ryze synthesizes QA pairs with complete supporting evidence (visual element, caption, extracted structure, and referring paragraphs), reduces layout and OCR errors via chart/table-aware extraction and LLM-based cleansing, and applies a progress-gated post-training strategy combining supervised fine-tuning with reinforcement learning. Starting from Qwen3-VL-8B, Ryze produces BioVLM-8B at under USD 200, achieving 48.0% weighted accuracy on LAB-Bench, outperforming the base model by +12.6 percentage points (pp) and surpassing GPT-5.2 by +3.8 pp. We release Ryze as open source together with the trained BioVLM-8B model.

2606.00898 2026-06-02 cs.CL cs.DL

Citation Grounding: Detecting and Reducing LLM Citation Hallucinations via Legal Citation Graphs

引用溯源:通过法律引用图检测和减少LLM引用幻觉

Volodymyr Ovcharov

发表机构 * LEX AI LLC

AI总结 提出引用溯源(CG)指标,利用乌克兰法院判决的引用图(1.008亿判决,5.02亿边)检测LLM法律引用幻觉,并通过CG-DPO方法(基于真实判决构建偏好对)减少幻觉,在100个法律查询上CG为0.791-0.873,幻觉率13-21%。

详情
Comments
14 pages, 3 figures, 3 tables. Code and data: https://huggingface.co/datasets/overthelex/citation-grounding-eval
AI中文摘要

大型语言模型系统性地产生法律引用幻觉——编造法规引用、引用已废除条款、混淆司法管辖区——但目前尚无自动化方法可大规模测量或减少此行为。我们提出引用溯源(CG),该指标通过从1.008亿乌克兰法院判决(5.02亿条边,21,736个唯一法规节点)中提取的真实引用图来验证LLM生成的法律引用。CG分解为三个组成部分——引用精确性(引用的条款是否存在?)、引用相关性(是否上下文相关?)和引用时效性(在相关日期是否有效?)——从而实现对幻觉类型的差异化诊断。对100个乌克兰法律查询的实证评估(涉及五个系统:通过AWS Bedrock的四个商业LLM——Claude Haiku 4.5、Mistral Pixtral Large、Amazon Nova Pro/Lite——以及一个RAG增强的生产系统)显示CG范围为0.791至0.873,其中13-21%的引用是幻觉。为了在没有人工标注的情况下减少幻觉,我们引入了引用溯源DPO(CG-DPO):一种通过四种针对性策略从真实法院判决中破坏已验证引用来自动构建偏好对的方法。在包含2,244个法院判决的数据集上,使用LoRA微调的Qwen2.5-7B-Instruct模型在区分正确和错误引用方面达到了98.5%的平均验证准确率(奖励边际+14.9,3个种子的标准差<0.3个百分点)。引用图、评估框架和CG-DPO数据集作为开放资源发布。

英文摘要

Large language models systematically hallucinate legal citations -- fabricating statute references, citing repealed provisions, and confusing jurisdictions -- yet no automated method exists to measure or reduce this behavior at scale. We propose citation grounding (CG), a metric that verifies LLM-generated legal citations against a ground-truth citation graph extracted from 100.8 million Ukrainian court decisions (502 million edges, 21,736 unique statute nodes). CG decomposes into three components -- citation precision (does the cited provision exist?), citation relevance (is it contextually appropriate?), and citation temporality (was it valid at the relevant date?) -- enabling differential diagnosis of hallucination types. Empirical evaluation on 100 Ukrainian legal queries across five systems -- four commercial LLMs via AWS Bedrock (Claude Haiku 4.5, Mistral Pixtral Large, Amazon Nova Pro/Lite) and one RAG-augmented production system -- reveals CG ranging from 0.791 to 0.873, with 13-21% of citations hallucinated. To reduce hallucinations without human annotation, we introduce Citation Grounding DPO (CG-DPO): a method that constructs preference pairs algorithmically by corrupting verified citations from real court decisions via four targeted strategies. On a dataset of 2,244 court decisions, a Qwen2.5-7B-Instruct model fine-tuned with LoRA achieves 98.5% mean validation accuracy in distinguishing correct from corrupted citations (rewards margin +14.9, std < 0.3 pp across 3 seeds). The citation graph, evaluation framework, and CG-DPO dataset are released as open resources.

2606.00892 2026-06-02 cs.LG cs.CE physics.comp-ph

An Exploratory Study into using Machine-Learning for Fast Step-by-step Emulation of Numerical Mechanical Thrombectomy Simulations for Ischemic Stroke

使用机器学习快速逐步模拟缺血性卒中机械取栓数值仿真的探索性研究

Thijs Stessen

发表机构 * MSc Artificial Intelligence Master Thesis(人工智能硕士论文) Thijs Stessen MSc. Thijs Kuipers(Thijs Kuipers) Dr. Simone Saitta(Simone Saitta)

AI总结 本研究探索使用机器学习替代模型逐步加速机械取栓数值仿真,在简化抽吸过程中实现显著加速,但复杂几何下的长期稳定性不足。

详情
Comments
40 pages, 16 figures, master thesis artificial intelligence
AI中文摘要

使用机械取栓治疗缺血性卒中涉及在时间紧迫下做出困难决策。数值物理仿真理论上可以为操作者提供关于治疗方法和设备选择的更好决策信息,但在实践中速度太慢。在本论文中,我们研究当前基于机器学习的替代模型能否在显著加速的同时,以逐步方式准确模拟这些仿真。为此,我们在两个涉及简化抽吸过程的仿真上训练了三个替代模型,几何复杂度不同。结果表明,其中两个模型能准确预测单个仿真步骤并提供显著加速,尤其是结合特定数据增强时。然而,这些模型在长时间模拟复杂几何时表现出缺乏稳定性。总体而言,这项工作为未来研究开发稳定方法并扩展到机械取栓的现实数值物理仿真奠定了基础。

英文摘要

The treatment of ischemic stroke using mechanical thrombectomy involves difficult decisions under intense time constraints. Numerical physics simulations can in theory inform operators to make better decisions regarding treatment approaches and device selection, but are too slow to do so in practice. In this thesis, we investigate if current machine learning based surrogates can accurately emulate these simulations in a step-by-step manner while making them significantly faster. To do this we train three surrogate models on two simulations that involve a simplified aspiration procedure, with varying levels of geometric complexity. Our results show that two of our models accurately predict singular simulation steps and provide substantial speedups, especially when combined with specific data augmentations. However, the models showed a lack of stability when emulating simulations with complex geometries over longer time periods. Overall, this work provides a foundation for future studies to develop stable methods that scale to realistic numerical physics simulations of mechanical thrombectomy.

2606.00891 2026-06-02 cs.CV

MMDG-Bench: A Benchmark for Multimodal Domain Generalization

MMDG-Bench:多模态领域泛化基准

Qianshan Zhan, Qian Wang, Da Li, Xiao-Jun Zeng, Xiatian Zhu

发表机构 * University of Manchester(曼彻斯特大学) Jiyue AI(极越AI) Samsung AI Centre Cambridge(三星AI中心剑桥) University of Surrey(萨里大学)

AI总结 提出MMDG-Bench基准,通过D2M和M2D两种框架统一多模态学习与领域泛化,在动作识别和活体检测等任务上验证了结构化组合优于现有方法,并给出关键设计指南。

详情
AI中文摘要

多模态领域泛化(MMDG)旨在利用互补模态增强模型在未见领域上的鲁棒性。尽管多模态学习(MML)和领域泛化(DG)作为独立领域取得了广泛进展,但它们的系统集成仍未被充分探索。当前的MMDG研究主要局限于动作识别,且缺乏标准化的评估协议。为此,我们引入了MMDG-Bench,一个全面的基准,包含两个基础框架:先DG后MML(D2M)和先MML后DG(M2D)。我们在多种任务上提供了统一的实验协议,包括视频-音频-光流动作识别和RGB-深度-红外人脸活体检测。通过将统一的MML配置与五种DG技术配对,在D2M和M2D两种顺序下实例化十个MMDG基线,我们证明这些结构化组合通常优于现有最先进方法,强调了统一基准工作的必要性。我们的分析得出三个关键见解:(1)集成DG技术在各种骨干网络上提供一致的泛化增益,而非DG方法对骨干网络变化高度敏感;(2)最优框架选择取决于模态间稳定性:当模态关系在领域间稳定时D2M表现更好,而M2D对跨领域关系变化更鲁棒;(3)更强的骨干网络在集成到我们的结构化框架中时会产生放大的性能收益。MMDG-Bench为未来多模态鲁棒性研究提供了原则性基础和可操作的设计指南。代码已发布在 https://github.com/qszhan/MMDG-Bench。

英文摘要

Multi-modal Domain Generalization (MMDG) seeks to leverage complementary modalities to enhance model robustness on unseen domains. Despite extensive progress in Multi-modal Learning (MML) and Domain Generalization (DG) as individual fields, their systematic integration remains under-explored. Current MMDG research is largely confined to action recognition and lacks standardized evaluation protocols. To address this, we introduce MMDG-Bench, a comprehensive benchmark featuring two foundational frameworks: DG then MML (D2M) and MML then DG (M2D). We provide unified experimental protocols across diverse tasks, including video-audio-flow action recognition and RGB-Depth-IR face anti-spoofing. By instantiating ten MMDG baselines through pairing a unified MML configuration with five DG techniques under both D2M and M2D orderings, we demonstrate that these structured combinations frequently outperform existing state-of-the-art methods, underscoring the necessity of a unified benchmarking effort. Our analysis yields three key insights: (1) Integrating DG techniques provides consistent generalization gains across various backbones, whereas non-DG methods are highly sensitive to backbone shifts; (2) The optimal framework choice depends on inter-modal stability: D2M excels when modal relations are stable across domains, while M2D is more robust to cross-domain relational variance; (3) Stronger backbones yield amplified performance dividends when integrated into our structured frameworks. MMDG-Bench provides a principled foundation and actionable design guidelines for future research in multi-modal robustness. Code is released at https://github.com/qszhan/MMDG-Bench.

2606.00890 2026-06-02 cs.CV

Cohort-Scale Neural Atlases of Ultrasound Video

超声视频的队列级神经图谱

Zhuorui Zhang, Roger Pallarès-López, Xuan Wu, Praneeth Namburi, Brian W. Anthony

发表机构 * Department of Mechanical Engineering, MIT(麻省理工学院机械工程系) Institute for Medical Engineering and Science, MIT(麻省理工学院医学工程与科学研究所) MIT.nano Immersion Lab, MIT(麻省理工学院MIT.nano沉浸实验室)

AI总结 提出一种基于DINOv3特征空间、联合训练数千帧的队列级神经图谱方法,通过每视频生成潜在优化嵌入实现准确注释迁移,在五个心脏和肌肉骨骼数据集上达到与强基线相当的精度。

详情
AI中文摘要

超声是临床实践中应用最广泛的实时成像模态,然而每帧视频注释仍然是一个主要瓶颈:专家标签稀缺且昂贵,图像外观随散斑、阴影、衰减和操作者依赖的探头姿态而变化。这尤其具有局限性,因为临床相关信息通常是动态的,从超声心动图中的左心室运动到肌肉骨骼成像中的肌肉和骨骼运动学。群体图谱可以通过将观测注册到共享的规范坐标系来分摊注释成本,但现有的神经图谱方法主要针对单个视频、小型测试时图像集或物体中心的图像集合。我们引入了一种用于超声视频的队列级神经图谱:一个单一的规范图表,带有每视频生成潜在优化嵌入,在DINOv3特征空间中联合训练数千帧。在五个带有地标点和分割掩膜的心脏和肌肉骨骼数据集上,我们的方法学习了连贯的规范模板,并实现了准确的图谱空间注释迁移。在EchoNet-Dynamic和MSK-Bone上,它支持单次和少样本迁移,其精度与强密集对应基线相当,同时在单个消费级GPU上训练只需几分钟。学习到的嵌入是可解释的:线性投影揭示了结构化的队列变异,图像解码器插值产生解剖学上合理的中间帧,测试时潜在反演通过图谱重建保留帧。这些结果表明,队列级神经图谱为减少超声视频分析中的专家注释负担提供了一种实用、可解释的表示。

英文摘要

Ultrasound is the most widely used real-time imaging modality in clinical practice, yet per-frame video annotation remains a major bottleneck: expert labels are scarce and costly, and image appearance varies with speckle, shadowing, attenuation, and operator-dependent probe pose. This is especially limiting because clinically relevant information is often dynamic, from left-ventricular motion in echocardiography to muscle and bone kinematics in musculoskeletal imaging. Population atlases can amortize annotation cost by registering observations to a shared canonical coordinate system, but existing neural atlas methods mainly target single videos, small test-time image sets, or object-centric image collections. We introduce a cohort-scale neural atlas for ultrasound video: a single canonical chart with per-video Generative Latent Optimization embeddings, trained jointly over thousands of frames in DINOv3 feature space. Across five cardiac and musculoskeletal datasets with point landmarks and segmentation masks, our method learns coherent canonical templates and enables accurate atlas-space annotation transfer. On EchoNet-Dynamic and MSK-Bone, it supports single- and few-shot transfer with accuracy competitive with strong dense-correspondence baselines, while training in minutes on a single consumer GPU. The learned embeddings are interpretable: linear projections reveal structured cohort variation, image-decoder interpolation produces anatomically plausible intermediate frames, and test-time latent inversion reconstructs held-out frames through the atlas. These results suggest that cohort-scale neural atlases offer a practical, interpretable representation for reducing expert annotation burden in ultrasound video analysis.

2606.00888 2026-06-02 cs.LG cs.AI

Memory-Efficient LLM Training with Dynamic Sparsity: From Stability to Practical Scaling

基于动态稀疏性的内存高效LLM训练:从稳定性到实际扩展

Qiao Xiao, Boqian Wu, Patrik Okanovic, Tomasz Sternal, Maurice van Keulen, Elena Mocanu, Mykola Pechenizkiy, Decebal Constantin Mocanu, Torsten Hoefler

发表机构 * University of Waterloo(滑铁卢大学) University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Michigan(密歇根大学)

AI总结 提出SMET方法,通过优化器预热和密度感知学习率缩放解决动态稀疏训练中的优化不稳定问题,实现LLM的稳定、可扩展且内存高效的稀疏预训练。

详情
Comments
Accepted at ICML2026
AI中文摘要

动态稀疏训练(DST)为提高深度神经网络的训练和推理效率提供了一种有前景的范式;然而,我们发现,在大语言模型训练中,DST可能会遭受优化不稳定性,表现为拓扑更新后的损失尖峰。在这项工作中,我们表明,标准基于Adam的优化器的朴素使用会导致新重新生长的参数出现冷启动问题,从而导致过大的更新和破坏训练动态。为了解决这个问题,我们提出了稀疏内存高效训练(SMET),它通过优化器预热稳定DST,并通过密度感知学习率缩放改善训练进度。SMET通过仅存储活动参数的梯度和优化器状态进一步减少内存消耗。我们对SMET下的更新行为进行了理论分析,显示出改进的优化稳定性。大量实验表明,SMET能够实现LLM的稳定、可扩展且内存高效的稀疏预训练,为稀疏训练作为密集训练的实际替代方案铺平了道路。我们的代码公开在:https://github.com/QiaoXiao7282/SMET。

英文摘要

Dynamic Sparse Training (DST) offers a promising paradigm for improving the training and inference efficiency of deep neural networks; however, we find that in large language model training, DST can suffer from optimization instability, manifested as loss spikes after topology updates. In this work, we show that the naive use of standard Adam-based optimizers leads to a cold-start issue for newly regrown parameters, resulting in excessively large updates and disrupted training dynamics. To address this issue, we propose Sparse Memory-Efficient Training (SMET), which stabilizes DST with optimizer warm-up and improves training progress through density-aware learning-rate scaling. SMET further reduces memory consumption by storing gradients and optimizer states only for active parameters. We provide a theoretical analysis of the update behaviors under SMET, showing improved optimization stability. Extensive experiments demonstrate that SMET enables stable, scalable, and memory-efficient sparse pre-training of LLMs, paving the way for sparse training as a practical alternative to dense training. Our code is publicly available at: https://github.com/QiaoXiao7282/SMET.

2606.00886 2026-06-02 cs.CV cs.RO

GABI: Geometry-Aware Boundary Integration for Spacecraft Segmentation

GABI: 用于航天器分割的几何感知边界集成

Iason Georgios Velentzas, Dhruv Ahuja, Panagiotis Tsiotras

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出一种轻量级边界感知多任务分割架构GABI,通过辅助距离场预测头增强卷积骨干网络,在保持低模型复杂度的同时提升航天器分割精度,在SPARK基准上平均精度提升5%,跨域泛化提升50%。

详情
Comments
Accepted to AI4Space at CVPR 2026
AI中文摘要

精确分割对于自主航天器至关重要,因为它直接影响与3D态势感知相关的下游任务。然而,太空恶劣的照明条件会产生外观高度变化的图像,阻碍分割方法在不同航天器和环境中的泛化。在这项工作中,我们提出了GABI,一种轻量级的边界感知多任务分割架构,它通过一个辅助的距离场预测头增强卷积骨干网络。距离场在物体边界周围提供密集的几何监督,鼓励网络学习航天器结构的空间一致表示,同时保持适合机载感知系统的低模型复杂度。我们在一个既定的卷积基线和更重的基于Transformer的架构上评估了GABI。在SPARK基准上,距离场监督使基线在平均精度上提高了5%,同时实现了与Transformer模型相当的性能。在泛化实验中,GABI的平均精度比基线提高了50%以上。在跨域评估中,轻量级GABI变体在IoU和F1分数上与更重的Transformer模型相差5%以内,而体积大约小十倍。同时,更重的GABI变体在保持近三倍轻量的同时超越了Transformer架构。

英文摘要

Accurate segmentation is crucial for autonomous spacecraft, as it directly affects downstream tasks related to 3D situational awareness. The harsh illumination conditions of space, however, produce images with high variability in appearance, hindering the generalization of segmentation approaches across different spacecraft and environments. In this work, we propose GABI, a lightweight boundary-aware multi-task segmentation architecture that augments a convolutional backbone with an auxiliary distance-field prediction head. The distance field provides dense geometric supervision around object boundaries, encouraging the network to learn spatially consistent representations of spacecraft structures while maintaining low model complexity suitable for onboard perception systems. We evaluated GABI against both an established convolutional baseline and a heavier transformer-based architecture. On the SPARK benchmark, distance-field supervision improves the baseline by up to $5\%$ in Average Precision while achieving performance comparable to the transformer models. In generalization experiments, GABI improves Average Precision by more than $50\%$ over the baseline. In cross-domain evaluation, the lightweight GABI variant performs within $5\%$ in IoU and F1-score of the heavier transformer model while being approximately ten times smaller. At the same time, the heavier GABI variant surpasses the transformer architectures while remaining nearly three times lighter.

2606.00884 2026-06-02 cs.LG cs.AI

Dive into Waves: Morlet Spectral Transformer for Cross-Subject Emotion Decoding from EEG

深入波动:用于跨被试脑电情绪解码的Morlet谱变换器

Jiaxin Qing, Lexin Li

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对脑电情绪识别中跨被试变异性问题,提出基于Morlet小波标记化、长上下文基线去除和频带特定空间投影的Morlet谱变换器(MST),无需预训练即可在SEED系列数据集上超越大型预训练模型和频域方法。

详情
AI中文摘要

我们研究基于脑电的跨被试情绪识别,这是脑机接口中一个实际重要但具有挑战性的问题。与具有清晰波形特征的任务不同,情绪相关的脑电信号主要编码在频谱功率中,且微弱、嘈杂,并在被试间高度变化。现有方法要么依赖需要大量数据但仍难以应对跨被试变异的大型预训练脑电基础模型,要么依赖频域编码器(能更好地反映频谱结构但存在表示不匹配、漂移主导的标记化以及缺乏频带特定空间建模)。在本文中,我们提出了Morlet谱变换器(MST),它围绕三个关键组件构建,并与时空变换器主干集成。首先,Morlet小波标记化提供了与脑节律多尺度结构匹配的时频表示,并将经典微分熵特征扩展到适合变换器的形式。其次,长上下文基线去除作为一种简单的时间归一化,消除了被试特定漂移和附近窗口间的冗余。第三,频带特定空间投影为每个频带学习独立的通道混合器,捕获可解释的频带特定模式并减少跨通道混合。我们表明,即使没有预训练,MST在所有SEED系列数据集上始终优于大型预训练脑电基础模型和基于频率的方法。这些结果表明,精心的表示设计可以产生准确、经济且可解释的替代大规模预训练的方法。

英文摘要

We study cross-subject emotion recognition from EEG, a practically important yet challenging problem in brain-computer interfaces. Unlike tasks with clear waveform signatures, emotion-related EEG signals are primarily encoded in spectral power and are weak, noisy, and highly variable across subjects. Existing approaches rely either on large pretrained EEG foundation models, which require massive data yet still struggle with cross-subject variability, or frequency-domain encoders, which better reflect spectral structure but suffer from mismatched representations, drift-dominated tokenization, and lack of band-specific spatial modeling. In this article, we propose the Morlet Spectral Transformer (MST), built around three key components and integrated with a spatiotemporal Transformer backbone. First, Morlet wavelet tokenization provides a time-frequency representation that matches the multi-scale structure of brain rhythms, and extends classical differential entropy features to a form suitable for Transformers. Second, long-context baseline removal acts as a simple temporal normalization that removes subject-specific drift and redundancy across nearby windows. Third, frequency-specific spatial projection learns a separate channel mixer for each frequency band, capturing interpretable band-specific patterns and reducing cross-channel mixing. We show that, even without pretraining, MST consistently outperforms both large pretrained EEG foundation models and frequency-based methods across all SEED-family datasets. These results suggest that careful representation design can yield an accurate, cost-effective, and interpretable alternative to large-scale pretraining.

2606.00881 2026-06-02 cs.CL

Chunking Methods on Retrieval-Augmented Generation - Effectiveness Evaluation Against Computational Cost and Limitations

检索增强生成中的分块方法——针对计算成本与局限性的有效性评估

Mateusz Śmigielski, Michał Rajkowski, Mateusz Zbrocki, Michał Bernacki-Janson, Karol Kunicki, Julianna Godziszewska, Maciej Piasecki, Konrad Wojtasik

发表机构 * Department of Artificial Intelligence, Faculty of Information and Communication Technology, Wrocław University of Science and Technology(人工智能系,信息与通信技术学院,沃斯克大学)

AI总结 本研究首次系统评估多种分块方法在RAG系统中的有效性,揭示分块策略中常被忽视的关键问题。

详情
AI中文摘要

检索增强生成(RAG)在提升大型语言模型(LLMs)性能方面展现了显著能力。RAG系统中的关键任务之一是分块过程。传统上,固定大小分块和语义分块是标准方法。然而,对分块策略的兴趣日益增长,导致越来越多声称性能优于传统技术的方法被提出。许多这些方法针对特定用例和数据类型定制,缺乏在不同场景下有效性的证据。因此,直接比较不同技术并评估其相对优势仍然具有挑战性。据我们所知,本研究首次系统评估了广泛分块方法的有效性,并强调了RAG系统中分块策略的潜在挑战。虽然分块通常被视为简单的预处理步骤,但我们表明它引入了一系列有影响且常被忽视的问题。

英文摘要

Retrieval-Augmented Generation (RAG) has demonstrated significant capabilities in enhancing the performance of Large Language Models (LLMs). One of the key tasks in RAG systems is the chunking process. Traditionally, fixed-size chunking and semantic chunking have been the standard approaches. However, interest in chunking strategies has been increasing, leading to a growing number of proposed methods that often claim improved performance over these conventional techniques. Many of these approaches are tailored to specific use cases and data types, with limited evidence of their effectiveness across diverse scenarios. As a result, it remains challenging to directly compare different techniques and assess their relative strengths. To the best of our knowledge, this study is the first to systematically evaluate the effectiveness of a wide range of chunking methods and emphasize the underlying challenges of chunking strategies in RAG systems. While chunking is commonly treated as a simple preprocessing step, we show that it introduces a range of impactful and often overlooked issues.

2606.00880 2026-06-02 cs.LG cs.AI

Task diversity produces systematic transfer but inhibits continual reinforcement learning

任务多样性产生系统性迁移但抑制持续强化学习

Purab Seth, Neil Shah, Kunal Jha, Samuel J. Gershman, Max Kleiman-Weiner, Wilka Carvalho

发表机构 * MIT(麻省理工学院) University of California, Berkeley(加州大学伯克利分校) Princeton University(普林斯顿大学) Harvard University(哈佛大学)

AI总结 通过引入GPU加速的持续强化学习领域Banyan,研究任务多样性(地图布局、交互对象、子目标层次结构)对智能体在分布变化下持续学习能力的影响,发现多样性促进局部迁移但导致长期任务性能停滞和遗忘。

详情
AI中文摘要

持续强化学习旨在产生不仅能在当前任务上提高,还能随着任务分布变化而适应的智能体。在众多不同任务上训练智能体可以引发零样本泛化,但先前的工作通常是在训练后(冻结权重)评估这种泛化。任务多样性是否也能提高智能体在分布变化下继续学习的能力仍不清楚。我们引入了Banyan,一个GPU加速的持续强化学习领域,其中任务多样性分解为三个独立可控的轴:智能体必须导航的地图布局、必须与之交互的对象以及子目标依赖的层次结构。在单个分布变化中,增加每个轴上的多样性会导致智能体在新任务上开始训练时,其性能接近先前任务达到的水平,即使变化改变了最优策略的结构。然而,随着变化数量的增加,这种局部迁移本身并不能产生持续的持续学习:更长视野的任务出现平台期,并且较早的任务分布在后续训练后被遗忘。Banyan是一个基准,用于研究受控的任务多样性何时产生可迁移的学习,这种迁移何时持续,以及它在哪些方面未能达到真正的持续学习。

英文摘要

Continual reinforcement learning aims to produce agents that learn not only to improve at their current tasks but also to adapt as task distributions change. Training an agent on many diverse tasks can induce zero-shot generalization, but previous work generally evaluates this generalization after training -- with frozen weights. Whether task diversity also improves an agent's ability to continue learning across distribution shifts remains unclear. We introduce Banyan, a GPU-accelerated continual RL domain in which task diversity factors into three independently controllable axes: the map layouts an agent must navigate, the objects it must interact with, and the hierarchical structures of sub-goal dependencies. Across individual distribution shifts, increasing diversity along each axis causes agents to begin training on the new tasks near the performance attained on the previous one, even when the shift changes the structure of the optimal policy. However, as the number of shifts increases, this local transfer does not by itself yield sustained continual learning: longer-horizon tasks plateau, and earlier task distributions are forgotten after later training. Banyan is a benchmark for studying when controlled task diversity produces transferable learning, when that transfer persists, and where it falls short of proper continual learning.

2606.00875 2026-06-02 cs.CL

IDEAFix: Evaluation Framework for Creative Defixation Prompting in LLMs

IDEAFix: 大语言模型中创造性去固定化提示的评估框架

F. Carichon, S. Sharma, M. Girard, R. Rampa, G. Farnadi

发表机构 * McGill University(麦吉尔大学) Mila Concordia University(康科迪亚大学) ÉTS

AI总结 提出IDEAFix框架,通过控制任务变体和提示策略系统分析大语言模型在开放式创意生成中的发散思维,发现任务表述和属性选择显著影响性能,简单提示可提升原创性但输出同质化问题依然存在。

详情
AI中文摘要

大语言模型(LLMs)越来越多地被用于涉及创造性问题解决和想法生成的任务。然而,关于其创造能力缺乏共识:一些研究报告了相比人类更优越的表现,而另一些则强调了结构性限制,如固定化和输出的同质化。现有的评估方法要么依赖于狭窄、脱离上下文的无法捕捉目标导向生成的任务,要么依赖于更广泛的设置,这些设置混淆了创造性过程的多个方面,使得难以隔离任务表述、提示和评估设计的影响。值得注意的是,结构化提示策略在塑造想法生成中的作用仍未得到充分探索。因此,我们引入了IDEAFix,一个用于分析开放式想法生成任务中发散思维的评估框架。我们提示模型对受控变化的简短设计场景、任务属性和去固定化提示策略生成多个原始解决方案。这种设计使得能够系统分析结构化指导如何影响LLMs的想法生成。我们的结果表明,任务表述和属性选择都显著影响模型的表现,并且简单的提示策略可以提升解决方案的原创性。然而,我们也观察到模型间持续的输出同质化,证实了它们在生成多样化解决方案方面固有的局限性。总体而言,IDEAFix提供了一个受控、可扩展的框架,用于研究LLMs创造力的底层机制。

英文摘要

Large language models (LLMs) are increasingly used for tasks involving creative problem solving and idea generation. However, there is a lack of consensus concerning their creative capabilities: some studies report superior performances compared to humans, while others highlight structural limitations such as fixation and the homogenization of outputs. Existing evaluation approaches either rely on narrow, decontextualized tasks that do not capture goal-oriented generation or on broader settings that confound multiple aspects of the creative process, making it difficult to isolate the effects of task formulation, prompting, and evaluation design. Significantly, the role of structured prompting strategies in shaping idea generation remains underexplored. Therefore, we introduce IDEAFix, an evaluation framework for analyzing divergent thinking in open-ended idea generation tasks. We prompt models to generate multiple original solutions to controlled variations of short design scenarios, task attributes, and defixation prompting strategies. This design enables systematic analysis of how structured guidance influences LLMs' idea generation. Our results show that both task formulation and attribute selection significantly affect models' performance, and that simple prompting strategies can boost the originality of solutions. However, we also observe persistent output homogenization across models, confirming inherent limits in their ability to generate diverse solutions. Overall, IDEAFix provides a controlled, extensible framework for studying the mechanisms underlying LLMs' creativity.