arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.03939 2026-06-03 cs.LG cs.AI cs.PF

FlashbackCL: Mitigating Temporal Forgetting in Federated Learning

FlashbackCL:缓解联邦学习中的时间遗忘

Mubarak A. Ojewale, Adriana E. Chis, Jorge M. Cortes-Mendoza, Bernardo Pulido-Gaytan, Horacio Gonzalez-Velez

AI总结 针对联邦学习中客户端数据分布随时间漂移导致的时间遗忘问题,提出FlashbackCL方法,通过时间衰减标签计数、类别平衡水库采样重放和服务器端主动核心集筛选,在CIFAR-10上相对Flashback提升6.9%-10.0%,时间遗忘减少68%。

详情
AI中文摘要

基础模型和边缘模型的联邦学习(FL)越来越多地部署在客户端数据分布随时间漂移的场景中,然而现有的遗忘缓解方法假设每个客户端的分布是平稳的。Flashback是近期最强的针对跨客户端(空间)遗忘的FL方法,它使用单调累积的每类标签计数作为知识代理;该代理在时间分布漂移下会失准,并将全局模型锚定在过时的类别平衡上。我们通过一个与协议级波动隔离的每阶段指标形式化定义了FL中的时间遗忘,并提出了Flashback Continual Learning(FlashbackCL),它是Flashback的即插即用扩展,包含:(i) 时间衰减的标签计数;(ii) 具有类别平衡水库采样(CBRS)的设备感知重放缓冲区;(iii) 在公共蒸馏集上的服务器端主动核心集筛选。结果表明,在具有50个客户端和三种受控时间漂移模式的CIFAR-10上,FlashbackCL相对于Flashback实现了6.9%至10.0%的相对改进,同时将时间遗忘减少了高达68%。一项5变体消融实验表明CBRS重放是关键组件。FlashbackCL在平稳CIFAR-100上也比Flashback提高了3.5个百分点,表明类别平衡重放同样正则化了空间异质性和时间漂移。

英文摘要

Federated Learning (FL) of foundation and edge models increasingly targets deployments where client data distributions drift over time, yet existing forgetting-mitigation methods assume each client's distribution is stationary. Flashback, the strongest recent FL method against cross-client (spatial) forgetting, uses monotonically accumulating per-class label counts as a knowledge proxy; this proxy becomes miscalibrated under temporal distribution shift and anchors the global model to an outdated class balance. We formalise temporal forgetting in FL with a per-phase metric isolated from protocol-level fluctuations and propose Flashback Continual Learning (FlashbackCL), a drop-in extension of Flashback with (i) temporally-decayed label counts; (ii) a device-aware replay buffer with Class-Balanced Reservoir Sampling (CBRS); and (iii) server-side active coreset curation on the public distillation set. The results show that FlashbackCL achieves 6.9% to 10.0% relative improvement relative to Flashback, on CIFAR-10 with 50 clients and three controlled temporal shift modes, while simultaneously reducing temporal forgetting by up to 68%. A 5-variant ablation identifies CBRS replay as the critical component. FlashbackCL also improves Flashback by 3.5 points on stationary CIFAR-100, suggesting that class-balanced replay regularises spatial heterogeneity as well as temporal shift.

2606.03936 2026-06-03 cs.LG physics.geo-ph

Correcting Neural Operator Spectral Bias via Diffusion Posterior Sampling with Sparse Observations

通过稀疏观测的扩散后验采样校正神经算子谱偏差

Niccolò Perrone, Fanny Lehmann, Stefania Fresca, Filippo Gatti

AI总结 提出FreqNO-DPS方法,利用扩散后验采样结合谱形状引导分数,校正神经算子在稀疏观测下的高频衰减谱偏差,实现近零谱偏差。

详情
AI中文摘要

神经算子代理(NO)比数值求解器快数个数量级地近似PDE解,但受谱偏差影响:高频内容被系统性地衰减,限制了在细尺度结构重要时的可靠性。通常也可获得场的稀疏传感器测量,提供点精度而无谱失真,但仅覆盖域的一小部分。我们通过将NO预测视为扩散后验采样框架中的辅助观测来解决这一问题。我们的方法FreqNO-DPS(此 https URL )将基于无条件分数扩散先验(在高保真模拟上训练)与扩散后验采样(DPS)相结合,以稀疏观测为条件并由冻结的神经算子引导。朴素集成会重新引入代理的谱偏差;我们通过一个闭式、谱形状的引导分数来解决这一问题,该分数根据代理的频率相关精度加权,且无需去噪器反向传播。一个无分布分析在频率-扩散-时间平面上界定了近似误差,并表明引导的频率依赖性无论分布假设如何都得以保持。在3D弹性波场预测中,传感器覆盖率为5%和2%时,该方法在所有频带上达到近零谱偏差,而代理和仅传感器DPS均显示出系统性的高频衰减。各向同性引导(自然基线)提高了点精度,但几乎完整地将偏差带入后验,证实了频率依赖性校准是必要的,而不仅仅是有益的。该框架仅需配对的代理/参考数据,且除了残差的近似谱对角性外,不利用任何问题特定结构,可通过我们提供的相干性诊断对新代理进行验证。

英文摘要

Neural operator surrogates (NO) approximate PDE solutions orders of magnitude faster than numerical solvers, but suffer from spectral bias: high-frequency content is systematically attenuated, limiting reliability where fine-scale structure matters. Sparse sensor measurements of the field are often available too, offering pointwise accuracy without spectral distortion but covering only a small fraction of the domain. We address this by treating NO predictions as auxiliary observations in a diffusion posterior sampling framework. Our method, FreqNO-DPS (https://github.com/niccoloperrone/FreqNO-DPS), combines an unconditional score-based diffusion prior, trained on high-fidelity simulations, with diffusion posterior sampling (DPS) conditioned on sparse observations and guided by a frozen neural operator. Naive integration reintroduces the surrogate's spectral bias; we resolve this with a closed-form, spectrally shaped guidance score that weights the surrogate by its frequency-dependent accuracy and needs no denoiser backpropagation. A distribution-free analysis bounds the approximation error across the frequency-diffusion-time plane and shows the guidance's frequency dependence is preserved regardless of distributional assumptions. On 3D elastic wavefield prediction at 5% and 2% sensor coverage, the method reaches near-zero spectral bias across all bands, where both the surrogate and sensor-only DPS show systematic high-frequency attenuation. Isotropic guidance, the natural baseline, improves pointwise accuracy but carries the bias into the posterior nearly intact, confirming that frequency-dependent calibration is essential, not merely beneficial. The framework needs only paired surrogate/reference data and exploits no problem-specific structure beyond the residual's approximate spectral diagonality, verifiable for new surrogates via the coherence diagnostic we provide.

2606.03935 2026-06-03 cs.NE cs.LG

Quadratic integrate-and-fire neurons exhibit less fragmented loss landscapes and outperform leaky integrate-and-fire neurons in spike-based gradient descent

二次整合-放电神经元表现出更少的碎片化损失景观,并在基于脉冲的梯度下降中优于漏电整合-放电神经元

Carlo Wenig, Raoul-Martin Memmesheimer, Christian Klos

AI总结 通过对比LIF和QIF神经元在Spiking Heidelberg Digits数据集上的表现,发现QIF神经元具有更平滑的损失景观和梯度,从而在脉冲神经网络训练中表现更优。

详情
Comments
9 pages, 5 figures (main part)
AI中文摘要

训练脉冲神经网络对于模拟生物神经网络以及神经形态计算至关重要。然而,对于广泛使用的漏电整合-放电(LIF)神经元,任意小的参数变化都可能引起脉冲的(消失)出现,从而破坏后续活动,导致在精确的基于脉冲的梯度下降过程中出现不稳定的神经表征和永久沉默的神经元。最近的研究表明,包括二次整合-放电(QIF)神经元在内的一类神经元模型避免了这些不连续性,并实现了连续甚至平滑的基于脉冲的梯度下降。然而,尚不清楚这些优势是否能转化为实际应用。在这里,我们通过在流行的Spiking Heidelberg Digits数据集上对LIF和QIF神经元网络进行受控比较,证明了它们确实如此。具体来说,第一步,我们进行了彻底的超参数搜索以优化两种模型,揭示了QIF神经元的明显性能优势。第二步,我们可视化了损失和梯度景观。与它们较差的性能一致,我们发现LIF神经元的损失景观(不连续)显得更加碎片化,相关梯度更加不稳定。对单个样本景观的分析表明,这些特征源于脉冲时间顺序的变化,这常常导致破坏性的脉冲(消失)出现。总体而言,我们的结果主张在梯度下降训练中用具有连续脉冲动态的神经元模型(如QIF神经元)替代LIF神经元。

英文摘要

The ability to train spiking neural networks is essential for modeling biological neural networks as well as for neuromorphic computing. However, for the extensively used leaky integrate-and-fire (LIF) neurons, arbitrarily small parameter changes can induce spike (dis)appearances that disrupt subsequent activity, leading to unstable neural representations and permanently silent neurons during exact spike-based gradient descent. Recent work shows that a class of neuron models, which includes the quadratic integrate-and-fire (QIF) neuron, avoids these discontinuities and enables continuous and even smooth spike-based gradient descent. However, it remains unclear whether these advantages translate into practice. Here, we demonstrate that they do so via a controlled comparison between networks of LIF and QIF neurons on the popular Spiking Heidelberg Digits dataset. Specifically, in a first step, we perform a thorough hyperparameter search to optimize both models, revealing a clear performance advantage of QIF neurons. In a second step, we visualize the loss and gradient landscapes. Consistent with their inferior performance, we find that the loss landscapes of LIF neurons, which are discontinuous, appear more fragmented and the related gradients more erratic. An analysis of the landscapes of single samples indicates that these features arise from changes in the temporal order of spikes, which often cause disruptive spike (dis)appearances. Overall, our results advocate replacing LIF neurons with neuron models exhibiting continuous spiking dynamics, such as QIF neurons, for gradient descent training.

2606.03931 2026-06-03 cs.RO cs.SY eess.SY

Multi-Robot Bearing-only Pose Estimation via Angle Rigidity

基于角度刚性的多机器人仅方位姿态估计

J. Francisco Presenza, Leonardo J. Colombo, Ignacio Mas, Juan I. Giribet

AI总结 提出一种分布式仅方位姿态估计器,利用体坐标系方位角计算位置并恢复姿态,仅需角度刚性条件,实现局部一致指数稳定。

详情
AI中文摘要

本文提出了一种新颖的分布式基于方位的姿态估计器,用于时变多机器人系统。该方法利用从体坐标系方位计算出的角度来估计机器人在 $\mathbb{R}^3$ 中的位置,而无需知道其方向。方向在 $\mathrm{SO}(3)$ 中从估计的位置、方位和方位导数中恢复。所提出的观测器仅要求(有向)感知拓扑是 extit{角度刚性的},这是一个比常用条件(如方位刚性)更弱的条件。在假设部分机器人持续激励运动的情况下,建立了所提出观测器的局部一致指数稳定性。通过仿真评估了该方案的有效性和实用性。

英文摘要

This letter proposes a novel distributed bearing-based pose estimator for time-varying multi-robot systems. The method uses angles computed from body-frame bearings to estimate the robots' positions in $\mathbb{R}^3$ without knowledge of their orientations. The orientations in $\mathrm{SO}(3)$ are recovered from the estimated positions, the bearings, and the bearing derivatives. The proposed observer only requires the (directed) sensing topology to be \textit{angle-rigid}, a weaker condition than the commonly used ones like bearing rigidity. Local uniform exponential stability of the proposed observer is established under the assumption of persistently exciting motions for a subset of robots. Simulations are presented and discussed to evaluate the scheme's effectiveness and practicality.

2606.03928 2026-06-03 cs.LG cs.CL

Value-Aware Stochastic KV Cache Eviction for Reasoning Models

面向推理模型的价值感知随机KV缓存淘汰

Ting-Yun Chang, Harvey Yiyun Fu, Deqing Fu, Chenghao Yang, Jesse Thomason, Robin Jia

AI总结 针对推理模型长输出导致的KV缓存瓶颈,提出价值感知随机淘汰方法VaSE,通过保护大幅度值状态和引入随机性,在4倍压缩下比最强淘汰方法准确率提升超4%。

详情
Comments
Codes: https://github.com/terarachang/VaSE
AI中文摘要

推理模型通过扩展思维链提高了准确性,但其长输出造成了内存和计算瓶颈。KV缓存淘汰方法通过从缓存中淘汰不重要的键值对来降低这一成本,但它们的准确性往往不如基于选择的稀疏注意力替代方案,后者保留了完整的KV缓存。我们识别出对KV缓存淘汰准确性至关重要的关键因素。首先,一小部分值状态具有异常大的幅度,淘汰它们会导致灾难性失败,模型进入重复推理循环。其次,在淘汰过程中引入随机性通过增加缓存多样性提高了准确性。基于这些发现,我们提出了价值感知随机KV缓存淘汰(VaSE),这是一种无需训练的方法,保护大幅度值状态并促进多样化的淘汰决策。在六个推理任务上,使用VaSE进行4倍KV缓存压缩的Qwen3模型在相同稀疏度下比最先进的选择方法获得了更高的平均准确率,同时比最强的淘汰方法高出超过4%。总体而言,VaSE弥合了效率与准确性之间的差距,支持FlashAttention2,并为推理模型实现了静态内存占用。

英文摘要

Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache, yet they often yield worse accuracy than selection-based sparse attention alternatives, which keep the full KV cache. We identify key factors crucial to KV cache eviction accuracy. First, a small fraction of value states have abnormally large magnitudes, and evicting them causes catastrophic failure where models enter repetitive reasoning loops. Second, introducing stochasticity during eviction improves accuracy by increasing cache diversity. Based on these findings, we propose Value-aware Stochastic KV Cache Eviction (VaSE), a training-free recipe that protects large-magnitude value states and promotes diverse eviction decisions. Across six reasoning tasks, Qwen3 models using VaSE with 4x KV cache compression yield higher average accuracies than SOTA selection method at the same sparsity, while outperforming the strongest eviction method by more than 4%. Overall, VaSE bridges the gap between efficiency and accuracy, supporting FlashAttention2 and enabling a static memory footprint for reasoning models.

2606.03927 2026-06-03 cs.LG cs.AI

FFR: Forward-Forward Learning for Regression

FFR:前向-前向学习用于回归

Xinyang Liu, Xuanyu Liang, Shiqi Ding, Boyang Li, Zhiqiang Que, Jiayang Li, Guosheng Hu

AI总结 提出FFR框架,通过序数竞争 goodness 函数、分层阶梯架构和层次化预测将前向-前向算法扩展到回归任务,在多个数据集上恢复BP 98.6%的精度并显著降低内存和时间开销。

详情
AI中文摘要

前向-前向(FF)算法通过纯局部、逐层优化训练神经网络,提供了反向传播(BP)的计算高效且生物合理的替代方案。然而,FF本质上是为通过对比正负样本对进行分类而设计的,将其扩展到回归面临根本性挑战:连续目标空间缺乏用于对比学习的自然“对立面”,且标准 goodness 函数不携带关于目标幅度或顺序的信息。我们提出FFR(前向-前向回归),据我们所知,这是第一个将FF扩展到现实世界回归并展示在多样化真实数据集上具有竞争力的性能的框架。FFR引入了三项关键创新:(1)序数竞争 goodness 函数,通过距离感知序数监督下分区神经元组之间的竞争学习取代对比对;(2)分层阶梯架构,其中浅层学习粗序数判别,深层细化到细粒度回归,并通过多尺度特征聚合实现层间协作;(3)带不确定性估计的层次化预测,其中多尺度预测器联合提供鲁棒预测和预测置信度作为免费午餐。大量实验结果表明,FFR在五个真实世界回归基准上平均恢复了BP 98.6%的精度,同时将峰值训练内存降低到深度8时BP的27%和深度32时BP的8%,每次迭代时间约为BP的72%,并且显著优于所有无BP的竞争对手。

英文摘要

The Forward-Forward (FF) algorithm offers a computationally efficient and biologically plausible alternative to backpropagation (BP) by training neural networks through purely local, layer-wise optimization. However, FF is inherently designed for classification via contrastive positive-negative sample pairs, and extending it to regression poses fundamental challenges: continuous target space lack natural "opposites" for contrastive learning, and the standard goodness function carries no information about target magnitude or ordering. We propose FFR (Forward-Forward for Regression), to our knowledge, the first framework to extend FF to real-world regression and demonstrate competitive performance across diverse real-world datasets. FFR introduces three key innovations: (1) an ordinal competitive goodness function that replaces contrastive pairs with competitive learning between partitioned neuron groups under distance-aware ordinal supervision; (2) a stratified ladder architecture where shallow layers learn coarse ordinal discrimination and deeper layers refine into fine-grained regression, with multi-scale feature aggregation for inter-layer collaboration; and (3) hierarchical prediction with uncertainty estimation, where multi-scale predictors jointly provide robust predictions and prediction confidence as a free-lunch. Extensive experimental results show FFR recovers on average 98.6% of BP's accuracy across five real-world regression benchmarks while reducing peak training memory to only 27% of BP's at depth 8 and 8% at depth 32, with per-iteration time around 72% of BP's, and substantially outperforms all BP-free competitors.

2606.03926 2026-06-03 cs.HC cs.LG

DiffUNet^2: Bidirectional Prediction, Probabilistic Generation and Collaborative Visual Discovery for Scientific Data

DiffUNet^2: 科学数据的双向预测、概率生成与协同视觉发现

Mengdi Chu, Jiaxin Yang, Angus G. Forbes, Nathan Debardeleben, Earl Lawrence, Ayan Biswas, Han-Wei Shen

AI总结 提出基于扩散模型的条件生成框架DiffUNet^2,实现时间序列的双向任意步预测与概率分布捕获,并结合交互式可视化支持科学探索。

详情
Comments
12 pages, 20 figures
AI中文摘要

对科学现象进行时间演化建模对于分析和推理至关重要,然而大多数机器学习方法仅提供确定性的前向预测,忽略了多种可能的结果,且很少支持反向推理,限制了它们在科学工作流中的实用性。我们提出了一个将基于扩散的生成建模与交互式视觉分析相结合的框架,用于科学探索。我们引入了DiffUNet^2,一种条件扩散模型,能够实现跨时间的双向、任意到任意生成,并捕获系统可能演化的分布。基于该模型,我们的交互式系统支持分支时间线探索、用户引导的状态编辑和概率空间导航,使科学家能够主动探索替代假设,而非被动观察预测。我们在5个不同科学领域的数据集上评估了该模型,验证了其预测准确性和概率空间集成质量。与领域专家合作,我们证明了该方法在支持实际科学时间数据分析工作流中的有效性。通过集成建模与视觉交互,我们的方法使科学家能够交互式地探索系统动力学,将生成模型转化为假设驱动的科学分析工具。

英文摘要

Modeling temporal evolution is important to analyzing and reasoning about scientific phenomena, yet most machine learning methods provide deterministic forward predictions that overlook multiple plausible outcomes and rarely support backward reasoning, limiting their usefulness in practical scientific workflows. We present a framework that integrates diffusion-based generative modeling with interactive visual analytics for scientific exploration. We introduce DiffUNet^2, a conditional diffusion model that enables bidirectional, any-to-any generation across time and captures distributions of plausible system evolutions. Built upon the model, our interactive system supports branching timeline exploration, user-guided state editing, and probability-space navigation, enabling scientists to actively explore alternative hypotheses rather than passively observe predictions. We evaluate the model on 5 datasets across different scientific domains to validate its predictive accuracy and probability-space ensemble quality. In collaboration with domain experts, we demonstrate the effectiveness of our approach in supporting practical scientific temporal data analysis workflows. By integrating modeling and visual interaction, our approach enables scientists to interactively explore system dynamics, transforming generative models into tools for hypothesis-driven scientific analysis.

2606.03925 2026-06-03 cs.CV

Adaptive Causal Alignment for High-Confidence Adversarial Training

自适应因果对齐用于高置信度对抗训练

Zhiming Luo, Kejia Zhang, Yingxin Lai, Junwei Wu, Juanjuan Weng, Shaozi Li

AI总结 针对高置信度对抗训练中模型过度依赖非因果背景相关性的问题,提出HICAT框架,通过可学习背景偏差估计器与自适应去偏机制实现因果对齐,提升鲁棒泛化性能。

详情
AI中文摘要

逆对抗训练利用高置信度预测来稳定鲁棒学习,然而我们发现了一个关键悖论:高置信度往往源于对非因果背景相关性的过拟合,而非内在对象语义。我们的研究表明,视觉上下文作为双重信号,既可以是必要的支持先验,也可以是混杂的虚假相关。这一洞察使得现有的盲目抑制策略存在缺陷,因为它们不可避免地导致严重的特征损失。为解决此问题,我们提出高置信度因果对齐训练(HICAT),一个建立语义均衡的统一框架。HICAT遵循“测量-去偏-对齐”流程,集成了可学习背景偏差估计器(LBBE)以自适应诊断上下文效用。在该诊断指导下,自适应去偏机制执行精细的逻辑校正,并辅以几何基础的背景逻辑正交增强(FLOE)损失以强制执行特征解耦。在CIFAR-10、CIFAR-100和ImageNet-1K上的大量实验表明,HICAT在不同架构(CNN和ViT)上均持续优于匹配基线,同时显著缩小了鲁棒泛化差距。

英文摘要

Inverse adversarial training leverages high-confidence predictions to stabilize robust learning, yet we uncover a critical paradox: high confidence often stems from overfitting to non-causal background correlations rather than intrinsic object semantics. Our investigation reveals that visual context functions as a dual-natured signal, serving as either a necessary supportive prior or a spurious confounder. This insight renders existing blind suppression strategies flawed, as they inevitably lead to severe Feature Loss. To resolve this, we propose High-Confidence Causally Aligned Training (HICAT), a unified framework that establishes a Semantic Equilibrium. Operating on a ``Measure-Debias-Align'' pipeline, HICAT integrates a Learnable Background-Bias Estimator (LBBE) to adaptively diagnose context utility. Guided by this diagnosis, an Adaptive Debiasing mechanism performs surgical logit rectification, complemented by a geometrically grounded Foreground Logit Orthogonal Enhancement (FLOE) loss to enforce rigorous feature disentanglement. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet-1K demonstrate that HICAT consistently improves over matched baselines across diverse architectures (CNNs and ViTs) while significantly reducing the robust generalization gap.

2606.03924 2026-06-03 cs.CL

Knowledge Editing in Masked Diffusion Language Models

掩码扩散语言模型中的知识编辑

Haewon Park, Yohan Jo

AI总结 研究将定位-编辑方法从自回归模型迁移到掩码扩散模型,发现编辑位置可迁移但多词编辑性能下降,并提出优化中间状态的简单修正方法。

详情
AI中文摘要

知识编辑旨在更新或纠正语言模型中的事实知识。一种广泛使用的方法是定位-编辑,它分两步进行:首先在模型中定位事实,然后在那里编辑权重。迄今为止,此类方法仅在自回归模型(ARMs)上开发。它们的基本假设是否适用于掩码扩散模型(MDMs)——后者双向建模文本并通过迭代去噪而非下一个词预测生成——仍是一个开放问题。我们通过将定位-编辑迁移到MDMs,并在匹配规模下比较两个MDMs(LLaDA, Dream)与两个ARMs(LLaMA, Qwen)来解决这个问题。我们的核心发现分为两部分。首先,编辑应用的位置跨范式迁移:因果追踪在两种模型中均突出显示最后一个主体词处的相同早期到中间层MLP,并且编辑在那里最有效。其次,这个共享位置并不能保证共享结果。单词编辑在两种模型中均成功,但随着目标变长,编辑在MDMs中系统性退化,而在ARMs中则不然。失败源于编辑事实的生成方式:生成多词目标需要经过部分未掩码的中间状态,而编辑从未针对这些状态进行优化。在此诊断指导下,我们引入了一个简单的修正方法,针对这些状态优化编辑,从而显著恢复了多词性能。

英文摘要

Knowledge editing aims to update or correct factual knowledge in a language model. A widely used approach, locate-then-edit, does this in two steps: it first localizes a fact within the model, then edits the weights there. To date, such methods have been developed exclusively on autoregressive models (ARMs). Whether their underlying assumptions hold for masked diffusion models (MDMs), which model text bidirectionally and generate by iterative denoising rather than next-token prediction, remains an open question. We address it by transferring locate-then-edit to MDMs and comparing two MDMs (LLaDA, Dream) with two ARMs (LLaMA, Qwen) at matched scale. Our central finding has two parts. First, where an edit is applied transfers across paradigms: causal tracing highlights the same early-to-mid-layer MLP at the last subject token in both, and editing is most effective there. Second, this shared location does not guarantee a shared outcome. Single-token edits succeed in both, but as targets grow longer, editing degrades systematically in the MDMs but not the ARMs. The failure stems from how the edited fact is generated: producing a multi-token target requires passing through partially unmasked intermediate states for which the edit was never optimized. Guided by this diagnosis, we introduce a simple correction that optimizes the edit for these states, substantially restoring multi-token performance.

2606.03923 2026-06-03 cs.LG

Contrastive Neural Algorithmic Reasoning for Graph Coloring

对比神经算法推理用于图着色

Thien Le, Tianyu Zhao, Melanie Weber

AI总结 提出对比学习框架学习可迁移的着色几何结构,通过图神经网络编码器实现低冲突着色,并推广到不同规模的图。

详情
Comments
52 pages, 5 figures, 45 tables
AI中文摘要

图着色旨在用尽可能少的颜色为图的节点分配颜色,使得相邻节点颜色不同。这里,我们研究近似$k$-着色,目标是用最多$k$种颜色同时最小化单色边的数量。该问题是图论的核心问题,并在调度和资源分配等领域有应用。最近的無监督GNN方法直接优化每个实例,阻碍了跨图大小和分布的泛化。我们转而提出一个对比学习框架,学习可迁移的着色几何结构,其中同色节点的嵌入对齐,而相邻节点的表示被推向不同方向。我们分析了有界大小图上的总体目标。对于单位范数嵌入,我们证明其最优解具有线原型结构:同色节点的表示坍缩到共享的一维子空间,边连接正交子空间。该几何结构在有监督设置中产生平稳条件,并在平衡着色假设下通过投影次梯度动力学保持。在非归一化变体中,梯度下降具有由商图硬间隔问题控制的最大间隔偏差。在合成和真实世界图上的实验表明,对比GNN编码器有效泛化并产生低冲突着色,与贪心方法匹配甚至有时改进。

英文摘要

Graph coloring seeks to assigns colors to a graph's nodes so that adjacent nodes receive different colors, using as few colors as possible. Here, we study approximate $k$-coloring, where the goal is to use at most $k$ colors while minimizing the number of monochromatic edges. This problem is central to graph theory and has applications in areas such as scheduling and resource allocation. Recent unsupervised GNN approaches optimize each instance directly, precluding generalization across graph sizes and distributions. We instead propose a contrastive learning framework that learns transferable coloring geometry where the embeddings of same-color nodes align, while adjacent nodes' representations are pushed toward distinct directions. We analyze the resulting population objective over bounded-size graphs. For unit-norm embeddings, we show that its optima have a line-prototype structure: Representations of nodes of the same color collapse to a shared one-dimensional subspace, and edges connect orthogonal subspaces. This geometry yields stationarity conditions in the supervised setting and is preserved by projected subgradient dynamics under a balanced-coloring assumption. In an unnormalized variant, gradient descent has a max-margin bias governed by a quotient-graph hard-margin problem. Experiments on synthetic and real-world graphs show that contrastive GNN encoders generalize effectively and produce low-conflict colorings, matching and sometimes improving on greedy approaches.

2606.03921 2026-06-03 cs.CV

GARDEN: Gravity-Aligned Reconstruction of Disentangled ENvironments from RGB images

GARDEN: 从RGB图像中重力对齐的解耦环境重建

Jiahao Sun, Dingkun Wei, Zehong Shen, Hongyu Zhou, Yujun Shen, Liang Li

AI总结 提出GARDEN框架,利用重力先验将多视图RGB图像重建为具有显式刚体和解耦背景的结构化混合场景表示,支持直接物理模拟。

详情
AI中文摘要

将多视图RGB观测转换为可用于模拟的3D环境仍然具有挑战性,因为当前的重建流程会产生没有显式物理结构的整体场景表示。它们通常定义到任意全局旋转,并将刚性前景物体与背景几何纠缠在一起,这阻碍了稳定的物理交互。现有的解决方案通常通过用检索到的CAD资产替换重建的物体来恢复交互性,但这引入了缓慢的检索和替换阶段,并削弱了场景特定的几何保真度。我们提出GARDEN,一个仅使用RGB的框架,将重建重新表述为基于物理的场景分解,并输出结构化的混合场景表示。关键思想是使用重力作为通用物理先验:我们首先将重建对齐到统一的重力视角坐标系以解决规范模糊性,然后恢复具有准确6自由度放置的物体中心刚性网格,最后通过条件3D点分类从背景中移除重复的物体几何。得到的表示结合了显式刚体和解耦背景,能够在保持视觉真实感的同时实现直接物理模拟。在模拟和真实多视图场景上的实验表明,与基于检索的基线相比,GARDEN提高了物体放置可靠性、解耦质量和渲染模拟效率。

英文摘要

Converting multi-view RGB observations into simulation-ready 3D environments remains challenging because current reconstruction pipelines produce monolithic scene representations without explicit physical structure. They are typically defined up to an arbitrary global rotation and entangle rigid foreground objects with background geometry, which hinders stable physical interaction. Existing solutions often recover interactivity by replacing reconstructed objects with retrieved CAD assets, but this introduces a slow retrieval-and-replacement stage and weakens scene-specific geometric fidelity. We propose GARDEN, an RGB-only framework that reformulates reconstruction as physically-grounded scene factorization and outputs a structured hybrid scene representation. The key idea is to use gravity as a universal physical prior: we first align the reconstruction to a unified Gravity-View frame to resolve gauge ambiguity, then recover object-centric rigid meshes with accurate 6-DoF placement, and finally remove duplicate object geometry from the background through conditional 3D point classification. The resulting representation combines explicit rigid bodies with a decoupled background, enabling direct physics simulation while preserving visual realism. Experiments on both simulated and real multi-view scenes show that GARDEN improves object placement reliability, disentanglement quality, and rendering-simulation efficiency compared with retrieval-based baselines.

2606.03920 2026-06-03 cs.CV

Benchmarking Visual State Tracking in Multimodal Video Understanding

多模态视频理解中的视觉状态追踪基准测试

Sihyun Yu, Nanye Ma, Pinzhi Huang, Hyunseok Lee, Shusheng Yang, June Suk Choi, Ellis Brown, Oscar Michel, Boyang Zheng, Jinwoo Shin, Saining Xie

AI总结 提出VSTAT基准,通过需要连续感知和整合整个视频流的问题评估多模态大语言模型的视觉状态追踪能力,发现当前模型远低于人类表现,失败主要源于视觉感知而非文本推理。

详情
Comments
Website: https://vision-x-nyu.github.io/vstat-site/
AI中文摘要

理解视频需要超越识别孤立时刻,因为人类会持续追踪实体、状态和事件。这种视觉状态追踪能力是视频理解的基础,但在当前多模态大语言模型(MLLMs)的评估中仍未得到充分探索。我们引入了视觉状态追踪基准(VSTAT),这是一个基于视频的基准,旨在诊断MLLMs的视觉状态追踪能力。VSTAT包含来自合成和真实世界视频的834个片段,配以1500个问题,这些问题无法从任何单帧或短片段中回答,需要持续感知和整合整个视频流中的事件。尽管在现有视频基准上表现强劲,我们发现最先进的MLLMs远低于人类水平,仅略高于基于答案先验的基线。为了分析这一差距,我们将MLLMs的思维轨迹与底层视频流进行比较,以理解MLLMs在VSTAT上失败的原因和时机。我们发现MLLMs在文本中正确推理和追踪,但在视觉上感知它们需要追踪的事件时失败。最后,我们的初步评估表明,最近的基于代理的方法,包括基于MLLM的视频代理和编码代理,并不能轻易解决这些失败,在VSTAT上仍然表现不佳。

英文摘要

Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 834 clips drawn from both synthetic and real-world videos, paired with 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration of events across the entire video stream. Despite their strong performance on existing video benchmarks, we find that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines. To analyze this gap, we compare MLLMs' thinking traces with the underlying video stream to understand why and when MLLMs fail on VSTAT. We find that MLLMs reason and track correctly in text, but fail at visually perceiving the events they need to track. Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-based video agents and coding agents, do not readily resolve these failures, still falling short on VSTAT.

2606.03919 2026-06-03 cs.SI cs.CY cs.DL cs.LG physics.soc-ph

Forecasting Conceptual Diffusion in Science: The Case of Quantum Computing

预测科学中的概念扩散:以量子计算为例

Thomas Maillart, Thibaut Chataing, David Dosu, Paul Bagourd, Julian Jang-Jaccard, Alain Mermoud

AI总结 通过构建时间分辨的概念共现网络并训练LightGBM模型,研究量子计算领域概念的内生巩固与外生扩散的可预测性,发现外生扩散和熵具有强可预测性(R²高达0.78),而内生巩固在量子计算中几乎不可预测,但在神经植入领域显著上升(R²=0.83),表明概念扩散受语义和引用环境中的稳定结构规律支配。

详情
Comments
19 pages, 5 figures, 6 tables. Code and manuscript sources: https://github.com/wazaahhh/breakthroughs-diffusion . An earlier version was presented at the Global Tech Mining Conference (GTM) 2026 (submission #117)
AI中文摘要

理解和预测科学变化需要能够区分科学概念的内生巩固和外生扩散的模型。利用OpenAlex中量子计算概念子树,我们构建了一个时间分辨的概念共现网络,并追踪每个概念对的上游引用谱系和下游扩散。我们在分布和多样性感知特征上训练LightGBM模型,以预测四个结果:内生巩固、外生扩散、它们的比率以及扩散熵。在控制科学整体出版增长后,内生巩固在主要的量子计算基准中基本不可预测。相比之下,外生扩散和熵具有很强的可预测性(R²高达0.78),并且由上游异质性、引用广度和分布离散度驱动,如SHAP分析所示;在机器人、先进材料和神经植入上的重复验证证实,外生扩散仍然是跨领域排名最高的目标(测试R²约0.60-0.87),而内生可预测性在神经植入中显著上升(测试R²=0.83),表明量子计算的不对称性并非普遍适用。案例研究表明,尖锐的熵增加与新概念前沿的开启同时发生,而熵崩溃则标志着技术趋同或范式更替。这些结果表明,概念扩散受嵌入语义和引用环境中的稳定结构规律支配。通过识别跨领域采纳的早期基于多样性的信号,该方法为快速发展的研究领域中的预期科学计量学、技术预见和创新导向政策分析提供了可扩展的基础。

英文摘要

Understanding and anticipating scientific change requires models that distinguish between endogenous consolidation and exogenous diffusion of scientific concepts. Using the quantum computing subtree of concepts in OpenAlex, we construct a temporally resolved concept co-occurrence network and track each concept pair through its upstream citation lineage and downstream diffusion. We train LightGBM models on distributional and diversity-aware features to predict four outcomes: endogenous reinforcement, exogenous diffusion, their ratio, and diffusion entropy. After controlling for overall publication growth of the scientific body, endogenous reinforcement proves largely unpredictable in the primary quantum-computing benchmark. In contrast, exogenous diffusion and entropy are strongly predictable ($R^2$ up to $0.78à) and are driven by upstream heterogeneity, citation breadth, and distributional dispersion, as shown by SHAP analyses; replications on robotics, advanced materials, and neuro implants confirm that exogenous diffusion remains the top-ranked target across fields ($R^2_test \sim 0.60-0.87$), while endogenous predictability rises markedly in neuro implants (R^2_test = 0.83), indicating that the quantum-computing asymmetry does not generalise uniformly. Case studies reveal that sharp entropy increases coincide with the opening of new conceptual frontiers, while entropy collapses signal technological convergence or paradigm displacement. These results demonstrate that conceptual diffusion is governed by stable structural regularities embedded in semantic and citation environments. By identifying early diversity-based signals of cross-domain uptake, the approach provides a scalable foundation for anticipatory scientometrics, technology foresight, and innovation-oriented policy analysis in rapidly evolving research fields.

2606.03918 2026-06-03 cs.AI

Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning

Hedge-Bench:在金融推理相关的困难、现实任务上对智能体进行基准测试

Eric Cho, Shawn Huang, Alice Lu, Andy Lyu

AI总结 提出Hedge-Bench基准,包含102个基于对冲基金分析师实际工作推理轨迹的任务,用于评估AI智能体在开放金融推理问题上的表现,前沿模型得分低于16%。

详情
Comments
Dataset and evaluation harness available at github.com/Trata-Inc/trata-hedge-bench
AI中文摘要

AI智能体越来越多地能够处理金融分析的机械任务:检索文档、计算公式、更新电子表格。更难、更有价值的挑战在于推理那些定义专家分析师工作的开放式问题。现有基准没有捕捉到这类问题,而那些试图评估开放式推理的基准依赖于模型判断的输出,这引入了噪声和循环。我们提出Hedge-Bench 1.0:一个包含102个实际工作任务的基准,这些任务基于专业对冲基金分析师在相关信息来源上工作的明确推理轨迹。这种方法能够针对经过验证的专家步骤进行确定性评分。前沿模型和智能体在基准上的得分低于16%。我们在该网址发布数据集和评估工具。

英文摘要

AI agents can increasingly handle the mechanical tasks of financial analysis: retrieving documents, calculating formulas, updating spreadsheets. The harder, more valuable challenge is reasoning through the open-ended questions that define expert Analyst work. Existing benchmarks do not capture this class of problem, and those that attempt to evaluate open-ended reasoning rely on model-judged outputs that introduce noise and circularity. We present Hedge-Bench 1.0: a benchmark of 102 actual, on-the-job tasks grounded in the explicit reasoning traces of professional hedge fund analysts working with relevant information sources. This approach enables deterministic grading against verified expert steps. Frontier models and agents score below 16\% on the benchmark. We publish the dataset and evaluation harness at github.com/Trata-Inc/trata-hedge-bench.

2606.03915 2026-06-03 cs.CV

PatchScene: Patch-based Voxel Diffusion for Large-Scale Scene Completion

PatchScene:基于体素块扩散的大规模场景补全

Qingdong Xu, Jiajun Zhu, Shilin Zhu, Xinjing He, Chao Lu, Huanran Wang, Jiyao Zhang

AI总结 提出PatchScene,一种基于体素块扩散的框架,通过局部3D区域细粒度生成、置信度引导的时空融合和环形流扩散策略,实现大规模LiDAR场景补全,在SemanticKITTI上达到最优性能并展现强泛化能力。

详情
Comments
10 pages, 5 figures, 5 tables
AI中文摘要

我们提出了PatchScene,一种新颖的基于扩散的大规模LiDAR场景补全框架。与依赖全局潜在表示或密集体素网格的现有方法不同,PatchScene采用基于体素块的扩散范式,在局部3D区域内显式生成细粒度几何结构。为了确保在空间和时间尺度上的连贯重建,我们引入了一种置信度引导的时空融合机制,在统一的生成过程中整合重叠块和相邻帧。此外,我们设计了一种环形流扩散策略,利用LiDAR扫描的径向密度模式,将近距离区域的高保真信息逐步传播到远距离区域,从而实现空间无界的场景补全。在SemanticKITTI基准上的大量实验表明,PatchScene在所有标准指标上均达到了最先进的性能,在几何精度和时间一致性上超越了先前的方法。值得注意的是,在20米LiDAR范围上训练的模型无需重新训练即可有效推广到50米场景,突显了其在真实世界自动驾驶应用中的强大可扩展性和泛化能力。

英文摘要

We propose PatchScene, a novel diffusion-based framework for large-scale LiDAR scene completion. Unlike existing methods that rely on global latent representations or dense voxel grids, PatchScene adopts a patch-based voxel diffusion paradigm that explicitly generates fine-grained geometry within localized 3D regions. To ensure coherent reconstruction at both spatial and temporal scales, we introduce a confidence-guided spatio-temporal fusion mechanism that integrates overlapping patches and adjacent frames in a unified generative process. Furthermore, we design an Annular-Flow diffusion strategy that leverages the radial density pattern of LiDAR scans to progressively propagate high-fidelity information from near-range to far-range regions, enabling spatially unbounded scene completion. Extensive experiments on the SemanticKITTI benchmark demonstrate that PatchScene achieves state-of-the-art performance across all standard metrics, surpassing previous approaches in both geometric accuracy and temporal consistency. Remarkably, the model trained on 20 m LiDAR ranges generalizes effectively to 50 m scenes without retraining, highlighting its strong scalability and generalization capability for real-world autonomous driving applications.

2606.03911 2026-06-03 cs.CV

Bootstrap Your Generator: Unpaired Visual Editing with Flow Matching

Bootstrap Your Generator: 基于流匹配的无配对视觉编辑

Yoad Tewel, Yuval Atzmon, Gal Chechik, Lior Wolf

AI总结 提出Bootstrap Your Generator (ByG)框架,利用基础模型知识通过流匹配实现无配对训练的图像视频编辑,无需外部信号,在数据稀缺场景下达到最优性能。

详情
Comments
Accepted at ICML 2026. Project page is at https://research.nvidia.com/labs/par/byg/
AI中文摘要

现代生成模型对视觉内容有深刻理解,但训练它们进行图像编辑通常需要大量配对示例数据集。这限制了可扩展性,尤其是对于视频编辑,收集配对数据成本过高。我们提出了Bootstrap Your Generator (ByG),一个用于流匹配编辑模型无配对训练的通用框架。它利用基础模型的知识,无需任何外部信号。我们的方法将从冻结模型中提取的指令遵循线索与用于结构保持的循环一致性相结合。为了使这可行,我们提出将来自干净预测的下游损失的梯度路由到噪声训练状态。我们在具有挑战性的数据稀缺图像和视频编辑场景中展示了最先进的结果。大量评估和用户研究表明,我们的方法有效泛化到未见过的领域,并优于在数百万样本上训练的监督基线。分析表明,我们的梯度路由弥合了训练-推理差距,从基础模型中提取语义线索提供了强大的训练信号,消除了对外部奖励模型的需求。

英文摘要

Modern generative models possess a deep understanding of visual content, yet training them for image editing typically requires massive datasets of paired examples. This limits scalability, especially for video editing where collecting paired data is prohibitively expensive. We propose Bootstrap Your Generator (ByG), a general framework for unpaired training of flow matching editing models. It leverages the base model's knowledge without any external signal. Our approach pairs instruction-following cues extracted from the frozen model with cycle-consistency for structure preservation. To make this tractable, we propose to route gradients from downstream losses over clean predictions to noisy training states. We demonstrate state-of-the-art results on challenging data-scarce image and video editing scenarios. Extensive evaluations and user studies show that our method effectively generalizes to unseen domains and outperforms supervised baselines trained on millions of samples. Analysis reveals that our gradient routing bridges the train-inference gap, and extracting semantic cues from a base model provides a robust training signal that obviates the need for external reward models.

2606.03910 2026-06-03 cs.PF cs.AI cs.DC cs.NI

NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference

NetKV: 面向分解式LLM推理的网络感知解码实例选择

Mubarak Adetunji Ojewale

AI总结 针对分解式LLM推理中KV缓存传输导致的首令牌时间增加问题,提出网络成本感知调度器NetKV,通过贪心算法选择解码实例,在64-GPU胖树模拟器上平均降低TTFT达21.2%。

详情
AI中文摘要

分解式LLM推理迫使KV缓存在解码开始前穿越数据中心网络,因此传输时间直接计入首令牌时间(TTFT)预算。当前调度器仅根据计算负载和前缀缓存局部性进行路由,忽略了预填充和解码实例之间的拓扑距离和动态拥塞。我们通过一个轻量级的算子到调度器接口(网络成本预言机)来弥补这一差距,并证明忽略网络项会导致仅缓存感知的调度在上下文长度增长时任意次优。NetKV是一个每请求O(|D|)的贪心算法,它消耗该预言机,其层级排名对过时遥测数据具有可证明的鲁棒性。在由Mooncake轨迹驱动的64-GPU四层胖树模拟器上,NetKV相比轮询调度平均降低TTFT达21.2%,相比调优的缓存+负载感知调度器降低17.6%,将SLO达标率提升最多20.1个百分点,并在所有测试条件下将令牌间时间开销保持在0.5毫秒以下,无需对传输、推理引擎或硬件进行任何更改。

英文摘要

Disaggregated LLM inference forces the KV cache to traverse the datacenter network before decoding begins, so transfer time enters directly into the Time to First Token (TTFT) budget. Current schedulers route on compute load and prefix-cache locality alone, ignoring the topological distance and dynamic congestion between prefill and decode instances. We close this gap with a thin operator-to-scheduler interface, the network cost oracle, and we prove that ignoring the network term renders cache-aware-only scheduling arbitrarily suboptimal as context length grows. NetKV, the O(|D|) per-request greedy that consumes this oracle, has tier rankings that are provably robust to stale telemetry. On a 64-GPU four-tier fat-tree simulator driven by Mooncake traces, NetKV reduces mean TTFT by up to 21.2% over round-robin and 17.6% over a tuned cache+load-aware scheduler, lifts SLO attainment by up to 20.1 percentage points, and keeps the Time Between Tokens overhead below 0.5 ms in every condition tested, with no changes to the transport, inference engine, or hardware.

2606.03909 2026-06-03 cs.CV

SparseStreet: Sparse Gaussian Splatting for Real-Time Street Scene Simulation

SparseStreet: 用于实时街景模拟的稀疏高斯泼溅

Qingpo Wuwu, Xiaobao Wei, Peng Chen, Nan Huang, Zhongyu Zhao, Hao Wang, Ming Lu, Ningning Ma, Shanghang Zhang

AI总结 针对街景重建中高斯原语冗余问题,提出节点可学习剪枝与背景压缩框架,实现高达80%压缩比且质量损失极小。

详情
AI中文摘要

尽管3D高斯泼溅在街景重建中显示出有希望的结果,现有方法需要大量高斯原语来捕捉细节,导致存储成本过高和渲染速度缓慢。我们观察到动态对象(如车辆和行人)需要高保真表示以保持时间一致性,而静态背景区域通常包含大量冗余。受此启发,我们提出SparseStreet,一种专为街景设计的通用压缩框架。首先,我们引入基于节点的可学习剪枝策略,系统性地移除低贡献高斯原语,同时保留视觉关键区域。其次,在场景表示稳定后,我们应用背景压缩,进一步减少静态区域中的冗余。我们的方法有效保留了动态对象的几何和外观,同时显著减少了高斯原语的总数。在Waymo和nuScenes上的大量实验表明,SparseStreet实现了高达80%的压缩比,且质量退化极小,实现了资源高效、高保真的动态场景重建。项目网站:此 https URL。

英文摘要

While 3D Gaussian Splatting has shown promising results in street scene reconstruction, existing methods require massive numbers of Gaussian primitives to capture fine details, leading to prohibitive storage costs and slow rendering speeds. We observe that dynamic objects (e.g., vehicles and pedestrians) demand high-fidelity representations to maintain temporal consistency, while static background regions often contain substantial redundancy. Motivated by this, we propose SparseStreet, a general compression framework specifically designed for street scenes. First, we introduce a node-based learnable pruning strategy that systematically removes low-contributing Gaussian primitives while preserving visually critical regions. Second, after the scene representation stabilizes, we apply background compression, further reducing redundancy in static regions. Our method effectively preserves the geometry and appearance of dynamic objects while significantly reducing the total number of Gaussian primitives. Extensive experiments on the Waymo and nuScenes demonstrate that SparseStreet achieves up to 80% compression ratio with minimal quality degradation, enabling resource-efficient, high-fidelity dynamic scene reconstruction. Project website: https://sparsestreet.github.io/.

2606.03907 2026-06-03 cs.SE cs.AI cs.HC

The Impact of Configuring Agentic AI Coding Tools on Build-vs-Buy Decisions: A Study Protocol

配置智能体AI编码工具对构建vs购买决策的影响:一项研究协议

Jai Lal Lulla, Matthias Galster, Jie M. Zhang, Sebastian Baltes, Christoph Treude

AI总结 本研究通过受控实验协议,探讨配置机制如何影响Claude Code和OpenAI Codex等智能体AI编码工具在构建vs购买决策中的行为,并发布可复用的基准数据集和分析流程。

详情
Comments
14 pages, 1 table. Accepted at the 20th International Symposium on Empirical Software Engineering and Measurement (ESEM 2026), Registered Reports track
AI中文摘要

智能体AI编码工具以越来越高的自主性编写代码,并在此过程中决定何时导入库以及何时从头实现功能。这些决策——是从头构建功能还是购买外部库(以下称为构建vs购买)——对软件安全性、许可合规性、性能和长期可维护性有直接影响。然而,尚无受控实验研究探讨智能体AI编码工具中构建vs购买决策的支配因素。配置机制,即开发人员根据项目或工作流程定制智能体AI编码工具行为的手段,是实践者影响这些决策的主要方式之一。但尚不清楚哪些配置机制最有效地影响构建vs购买决策。我们提出了一项预注册协议,研究配置机制如何改变两种流行的智能体AI编码工具(Claude Code和OpenAI Codex)中的构建vs购买行为。我们将执行来自阶段性项目基准的受控编程任务,每个任务围绕可识别的构建vs购买点构建,并操纵提供给每个工具的配置,范围从无配置、包含软偏好和明确禁止的上下文文件,到技能(可自主发现的指令)、支持MCP的库发现工具和权限控制,测量工具选择的库、是否披露新引入的库以及这些披露是否完整准确。九个预注册假设构成了该协议。生成的基准数据集和分析流程将作为可复用工件发布,用于评估智能体AI编码工具中的构建vs购买行为。

英文摘要

Agentic AI coding tools write code with increasing autonomy and in doing so decide when to import a library and when to implement functionality from scratch. These decisions, whether to build functionality from scratch or buy into an external library, hereafter build-versus-buy, carry direct consequences for software security, licensing compliance, performance, and long-term maintainability. Yet no controlled experimental study has examined what governs build-versus-buy decisions in agentic AI coding tools. Configuration mechanisms, i.e., the means by which developers tailor agentic AI coding tool behavior to a project or workflow, are one of the primary means by which practitioners can influence these decisions. However, it is unclear which configuration mechanisms influence build-versus-buy decisions most effectively. We present a pre-registered protocol to study how configuration mechanisms alter build-versus-buy behavior in two popular agentic AI coding tools: Claude Code and OpenAI Codex. We will execute controlled programming tasks drawn from a benchmark of staged projects, each constructed around identifiable build-versus-buy points, and will manipulate the configuration supplied to each tool, ranging from no configuration, through context files with soft preferences and explicit prohibitions, to Skills (instructions that can be autonomously discovered), MCP-enabled library discovery tools, and permission controls, measuring which libraries the tool selects, whether it discloses newly introduced libraries, and whether those disclosures are complete and accurate. Nine pre-registered hypotheses structure the protocol. The resulting benchmark dataset and analysis pipeline will be released as a reusable artifact for evaluating build-versus-buy behavior in agentic AI coding tools.

2606.03906 2026-06-03 cs.AI

scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality Translation

scTranslation:单细胞多组学模态翻译的综合基准

Jiabei Cheng, Jingbo Zhou, Jun Xia, Changkai Li, Zhen Lei, Chang Yu, Stan Z. Li

AI总结 针对单细胞多组学模态翻译任务,提出了包含多样化数据集、先进模型和全面评估指标的综合基准scTranslation,并系统研究了特征选择、特征质量和少样本设置等影响因素。

详情
AI中文摘要

在单细胞中同时测量多种组学模态使研究人员能够更全面地理解细胞状态和调控机制。然而,由于高实验成本、显著噪声和不完全的模态覆盖,近年来出现了多种用于模态翻译的计算方法。尽管翻译模型有所发展,但在数据集、评估指标和影响因素方面仍缺乏系统的基准评估。为此,我们提出了scTranslation,一个用于单细胞多组学模态翻译任务的综合基准。它包括多样化的翻译数据集,整合了最先进的模型,并提供了全面的评估指标。此外,我们评估了模型在不同场景下的性能,如特征选择、特征质量和少样本设置。这些因素显著影响模型性能,但此前很少被系统研究。利用该基准,我们对当前方法进行了大规模研究,报告了许多有洞察力的发现,为未来发展开辟了新的可能性。该基准已开源以促进未来研究。代码匿名发布于该https URL。

英文摘要

Simultaneous measurement of multiple omics modalities in single cells enables researchers to gain a more comprehensive understanding of cellular states and regulatory mechanisms. However, due to high experimental costs, significant noise, and incomplete modality coverage, a variety of computational methods for modality translation have emerged in recent years. Despite the development of translation models, there is still a lack of systematic benchmark evaluation in terms of datasets, evaluation metrics, and influencing factors. To address this, we present scTranslation, a comprehensive benchmark for single-cell multi-omics modality translation tasks. It includes diverse translation datasets, integrates state-of-the-art models, and provides a comprehensive evaluation metrics. In addition, we assess model performance under different scenarios, such as feature selection, feature quality, and few-shot settings. These factors significantly affect model performance but have rarely been systematically studied before. Leveraging this benchmark, we conduct a large-scale study of current methods, report many insightful findings that open up new possibilities for future development. The benchmark is open-sourced to facilitate future research. The code is anonymously released at https://github.com/Bunnybeibei/scTranslation.

2606.03905 2026-06-03 cs.RO

Semantic-weighted ICP for LiDAR Odometry: Class-Aware Residual Reweighting for Robust Scan Registration

语义加权ICP用于LiDAR里程计:基于类感知残差加权的鲁棒扫描配准

Vasco Carvalho, Tiago Barros, Urbano J. Nunes

AI总结 提出语义加权ICP方法,通过根据语义类别的几何稳定性对残差进行加权,在动态和复杂环境中提升LiDAR里程计的位姿估计鲁棒性。

详情
AI中文摘要

LiDAR里程计是自主机器人系统的基本组成部分,依赖于连续点云之间的几何配准来估计自运动。然而,传统的几何方法在动态或非结构化环境中常常退化,原因是移动物体、稀疏几何特征、植被和语义模糊结构导致不可靠的对应关系。现有工作表明,其中一些限制可以通过在配准过程中引入环境的语义信息来解决。在这项工作中,我们在此基础上进一步表明,并非环境中的所有元素对配准同等重要。因此,我们提出了一种用于LiDAR里程计的语义类加权ICP。所提出的方法不是严格过滤掉属于特定语义类别的点,而是根据其预期的几何稳定性对属于语义类别的点的残差进行加权。这种策略使得信息丰富但可能不稳定的结构能够对配准过程做出贡献,同时减轻动态物体的影响。实验评估在SemanticKITTI和RELLIS-3D数据集上进行,这些数据集包括城市、高速公路、乡村和越野环境。实证结果表明,所提出的语义加权ICP改进了位姿估计,特别是在传统刚性特征稀缺的具有挑战性的越野场景中。此外,分析表明,这种加权策略的有效性高度依赖于环境,受场景的结构和语义组成影响。

英文摘要

LiDAR odometry is a fundamental component of autonomous robotic systems, relying on geometric registration between consecutive point clouds to estimate ego-motion. However, traditional geometric approaches often degrade in dynamic or unstructured environments due to unreliable correspondences caused by moving objects, sparse geometric features, vegetation, and semantically ambiguous structures. Existing works have shown that, some of these limitations can be addressed by introducing semantic information from the environment in the registration process. In this work, we build on this, and show that not all elements in the environment are equally relevant for registration. Hence, we propose a semantic class-weighted ICP for LiDAR odometry. Instead of strictly filtering out points belonging to specific semantic classes, the proposed approach weights the residuals of points belonging to semantic categories based on their expected geometric stability. This strategy enables informative but potentially unstable structures, to contribute to the registration process while mitigating the influence of dynamic objects. The experimental evaluation was conducted on the SemanticKITTI and RELLIS-3D datasets, which include urban, highway, rural, and off-road environments. The empirical results show that the proposed Semantic-weighted ICP improves pose estimation, especially in challenging off-road scenarios where conventional rigid features are scarce. Furthermore, the analysis reveals that the effectiveness of this weighting strategy is highly environment-dependent, influenced by the structural and semantic composition of the scene.

2606.03904 2026-06-03 cs.LG cs.CV

MAdam: Metric-Aware Multi-Objective Adam

MAdam: 度量感知的多目标Adam

Fengbei Liu, Rachit Saluja, Sunwoo Kwak, Ruibo Wang, Ruining Deng, Heejong Kim, Johannes C. Paetzold, Mert R. Sabuncu

AI总结 提出MAdam,通过偏好条件曲率预处理多目标优化中的协调方向,解决Adam与求解器之间的权重失配和几何失配问题,在多任务学习、帕累托前沿恢复等任务中一致提升性能。

详情
AI中文摘要

多目标优化是许多机器学习问题的基础,然而跨损失平衡、梯度平衡和基于帕累托的求解器家族几乎都将它们协调后的方向交给Adam处理。我们表明这种耦合在求解器的意图和优化器的执行之间引入了两个系统性差距。第一个是权重失配:Adam的二阶矩分母将时变偏好向量与梯度统计量纠缠在一起,将偏好边缘化为历史平均值,并将不同的帕累托权衡压缩为近乎均匀的混合。第二个是几何失配:Adam的自适应度量扭曲了多目标优化求解器假设的欧几里得几何,将对齐的目标转化为明显的冲突。为了共同解决这两个问题,我们引入了MAdam(度量感知的多目标Adam),这是一个即插即用的包装器,不改变求解器和优化器。MAdam通过标量化目标的偏好条件曲率对协调方向进行预处理;在此白化输入上,Adam的二阶矩退化为单位矩阵,因此实际更新由偏好条件度量主导。在多任务学习、帕累托前沿恢复、物理信息神经网络和医学成像中,MAdam在每个求解器家族上都一致优于Adam。

英文摘要

Multi-objective optimization (MOO) underlies many machine learning problems, yet MOO solvers across the loss-balancing, gradient-balancing, and Pareto-based families almost universally hand their reconciled directions to Adam~\cite{kingma2015adam}. We show this coupling introduces two systematic gaps between the solver's intent and the optimizer's execution. The first is a \emph{weighting mismatch}: Adam's second-moment denominator entangles the time-varying preference vector with gradient statistics, marginalizing the preference into a history average and collapsing distinct Pareto trade-offs toward a near-uniform mixture. The second is a \emph{geometric mismatch}: Adam's adaptive metric distorts the Euclidean geometry MOO solvers assume, turning aligned objectives into apparent conflicts. To resolve both jointly, we introduce \textbf{MAdam} (Metric-Aware Multi-Objective Adam), a drop-in wrapper that leaves both solver and optimizer unchanged. MAdam preconditions the reconciled direction by the preference-conditioned curvature of the scalarized objective; on this whitened input, Adam's second moment collapses to identity, so the realized update is governed by the preference-conditioned metric. Across multi-task learning, Pareto-front recovery, physics-informed neural networks, and medical imaging, MAdam consistently improves over Adam for every solver family.

2606.03903 2026-06-03 cs.CV

An Attention-Based Denoising Model for Diffusion Weighted Imaging

一种基于注意力的扩散加权成像去噪模型

Prithviraj Verma, Pawan Kumar, Chandan Deshani, Prasun Chandra Tripathi

AI总结 提出一种结合Swin Transformer窗口注意力和多维门控精化的噪声感知注意力驱动去噪框架,用于解决DWI中信号依赖的Rician噪声问题,在1%至15%噪声水平下实现平均PSNR 33.69 dB和SSIM 0.8539。

详情
AI中文摘要

扩散加权成像(DWI)用于全身癌症筛查,但通常需要较长的采集时间。当扫描时间减少时,图像质量往往会下降,导致扫描中的噪声增加。DWI中的幅度重建引入了信号依赖的Rician噪声,这使得传统的基于卷积的方法去噪更具挑战性。为了解决这一限制,我们提出了一种噪声感知的注意力驱动去噪框架,该框架将分层Swin Transformer窗口注意力与基于transformer的多维门控精化相结合,用于DWI恢复。该模型结合了显式噪声水平调节和残差重建,以实现对广泛损坏水平下异方差噪声的自适应抑制。在损坏的DWI扫描上的实验评估显示了强大的恢复性能。我们的模型在1%至15%的噪声水平下实现了平均PSNR 33.69 dB和SSIM 0.8539,同时在严重噪声条件下保持稳定行为。这些结果表明,注意力引导的上下文建模与通道自适应精化相结合,为DWI去噪提供了稳健且可推广的解决方案。

英文摘要

Diffusion-weighted imaging (DWI) is used for whole-body cancer screening, but it typically requires a long acquisition time. When the scan time is reduced, the image quality often suffers, leading to increased noise in the scans. Magnitude reconstruction in DWI introduces signal-dependent Rician noise, which makes denoising more challenging for conventional convolution-based methods. To address this limitation, we propose a noise-aware attention-driven denoising framework that integrates hierarchical Swin Transformer window attention with transformer-based multi-dimensional gated refinement for DWI restoration. The model incorporates explicit noise-level conditioning and residual reconstruction to enable adaptive suppression of heteroscedastic noise across a wide range of corruption levels. Experimental evaluation on corrupted DWI scans demonstrates strong restoration performance. Our model achieves a mean PSNR of 33.69~dB and SSIM of 0.8539 across noise levels from 1\% to 15\%, while maintaining stable behavior under severe noise conditions. These results indicate that attention-guided contextual modeling combined with channel-adaptive refinement provides a robust and generalizable solution for DWI denoising.

2606.03895 2026-06-03 cs.OS cs.AI cs.CR

Agent libOS: A Library-OS-Inspired Runtime for Long-Running, Capability-Controlled LLM Agents

Agent libOS: 一种受库操作系统启发的运行时,用于长时间运行、能力受控的LLM智能体

Yingqi Zhang

AI总结 提出Agent libOS运行时,将LLM智能体建模为具有进程标识、生命周期、能力控制和审计记录的AgentProcess,通过类似libc的工具包装和运行时原语边界实现安全调度与资源控制。

详情
Comments
14 pages, 1 figure, 2 tables
AI中文摘要

大型语言模型(LLM)智能体正在从请求-响应助手演变为长时间运行的软件参与者:它们在模型调用之间维护状态,分叉子任务,等待外部事件,请求人类授权,生成工具,并执行必须被恢复和审计的副作用。本文提出Agent libOS,一种受库操作系统启发的LLM智能体运行时基础。Agent libOS运行在传统主机操作系统之上;它不实现硬件驱动、内核模式隔离或POSIX兼容操作系统。相反,它将智能体视为一个AgentProcess:一个可调度的执行主体,具有进程标识、父子谱系、生命周期状态、从AgentImage派生的工具表、类型化对象内存、显式能力、人类队列、检查点、事件和审计记录。其核心设计原则是工具是类似libc的包装器;运行时原语是权限边界。文件系统访问、对象访问、睡眠、人类批准、JIT工具注册和外部副作用都在显式能力和策略下在原语边界进行检查。我们描述了设计、威胁模型、Python原型和面向安全的评估。当前原型实现了异步调度、命名空间本地对象内存、运行时集成的人类批准、一次性权限授予、每进程工作目录、shell和图像注册原语、基于libOS系统调用代理的Deno/TypeScript JIT工具、文件系统/对象桥接工具、可注入的资源提供者基础、确定性演示、真实模型烟雾脚本以及撰写时的123个回归测试。Agent libOS不是提高规划器准确性,而是展示了一种运行时基础,在该基础上,长时间运行的LLM智能体可以被调度、授权、恢复和审计,而无需将工具分发视为信任边界。

英文摘要

Large language model (LLM) agents are evolving from request-response assistants into long-running software actors: they maintain state across model calls, fork subtasks, wait for external events, request human authority, generate tools, and perform side effects that must be resumed and audited. This paper presents Agent libOS, a library-OS-inspired runtime substrate for LLM agents. Agent libOS runs above a conventional host operating system; it does not implement hardware drivers, kernel-mode isolation, or a POSIX-compatible operating system. Instead, it treats an agent as an AgentProcess: a schedulable execution subject with process identity, parent-child lineage, lifecycle state, a tool table derived from an AgentImage, typed Object Memory, explicit capabilities, human queues, checkpoints, events, and audit records. Its central design rule is tools are libc-like wrappers; runtime primitives are the authority boundary. Filesystem access, object access, sleeps, human approval, JIT tool registration, and external side effects are checked at primitive boundaries under explicit capabilities and policy. We describe the design, threat model, Python prototype, and safety-oriented evaluation. The current prototype implements async scheduling, namespace-local Object Memory, runtime-integrated human approval, one-shot permission grants, per-process working directories, shell and image-registration primitives, Deno/TypeScript JIT tools over a libOS syscall broker, filesystem/object bridge tools, an injectable Resource Provider Substrate, deterministic demos, real-model smoke scripts, and 123 regression tests at the time of writing. Rather than improving planner accuracy, Agent libOS demonstrates a runtime substrate in which long-running LLM agents can be scheduled, authorized, resumed, and audited without treating tool dispatch as the trust boundary.

2606.03893 2026-06-03 cs.CV

Electromagnetic Navigation for Femoral Osteotomy Using High-Accuracy X-ray-to-CT Registration

基于高精度X光到CT配准的股骨截骨电磁导航

Roman Flepp, Arend Nieuwland, Bastian Sigrist, Philipp Fürnstahl, Lilian Calvet, Thomas Dreher

AI总结 提出一种基于电磁跟踪的股骨截骨导航系统,通过一次术中C臂标定和两幅X光图像配准实现实时无荧光导航,在合成股骨实验中总角度误差显著优于徒手操作,并与患者特异性器械精度等效。

详情
Comments
Will be published in the International Journal of Computer Assisted Radiology and Surgery
AI中文摘要

矫正性股骨截骨术中准确执行术前计划仍具挑战。当前技术受限于精度不一、侵入性和辐射暴露,徒手方法和患者特异性器械(PSI)分别通常需要>30和>6次荧光透视图像。我们提出一种集成的、基于电磁跟踪(EMT)的股骨截骨导航系统,可最小化剥离和术中荧光透视。该系统将基于CT的术前计划与一次性术中C臂标定以及从初始化时获取的两幅X光图像进行精确的X光到CT配准相结合。这使得锯片和骨碎片相对于术前计划的实时、无荧光EMT导航成为可能,并兼容单平面和双平面截骨。在使用18个合成股骨的可行性研究中,EMT引导在总角度误差上显著优于徒手执行($(3.05 \pm 0.75)^\circ$ vs. $(6.32 \pm 2.36)^\circ$,$p=0.031$),假设两者具有相同的最小手术暴露。EMT引导试验均未超过>5°的临床阈值,而徒手6次试验中有4次异常值。该系统在总角度误差($p \le 0.02$)和总平移误差($p=0.048$)上与PSI达到统计等效($\pm 2^\circ$,$\pm 2, ext{mm}$),用户问卷评分无显著差异。通过仅使用两幅X光图像转移术前计划,同时匹配PSI精度且无需额外手术暴露,所提出的系统为后续尸体和临床验证提供了动力。

英文摘要

Accurate execution of preoperative plans in corrective femoral osteotomies remains challenging. Current techniques are limited by variable accuracy, invasiveness, and radiation exposure, with free-hand methods and patient-specific instrumentation (PSI) often requiring >30 and >6 fluoroscopic images, respectively. We present an integrated, electromagnetic tracking (EMT)-based navigation system for femoral osteotomies that minimizes dissection and intraoperative fluoroscopy. The system couples CT-based preoperative planning with one-time intraoperative C-arm calibration and accurate X-ray-to-CT registration from two fluoroscopic images acquired at initialization. This enables real-time, fluoroscopy-free EMT navigation of the saw blade and bone fragments relative to the preoperative plan, and is compatible with uniplanar and biplanar osteotomies. In a feasibility study using 18 synthetic femora, EMT guidance significantly outperformed free-hand execution in total angular error ($(3.05 \pm 0.75)^\circ$ vs.\ $(6.32 \pm 2.36)^\circ$, $p=0.031$), assuming the same minimal surgical exposure for both. No EMT-guided trials exceeded the >5° clinical threshold, whereas free-hand produced 4 outliers of 6 trials. The system achieved statistical equivalence ($\pm 2^\circ$, $\pm 2,\text{mm}$) to PSI for total angular ($p \le 0.02$) and total translational ($p=0.048$) errors, with no significant differences in user questionnaire scores. By transferring preoperative plans using only two fluoroscopic images while matching PSI accuracy without additional surgical exposure, the proposed system motivates subsequent cadaveric and clinical validation.

2606.03890 2026-06-03 cs.CV

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

OVO-S-Bench:多模态大语言模型中流式空间智能的分层基准

Yifei Li, Pengyiang Liu, Yuhang Zang, Zhongyue Shi, Qi Fu, Hongye Hao, Jiwen Lu

AI总结 提出OVO-S-Bench,一个完全人工标注的流式空间智能基准,包含1680个问题,涵盖四个抽象层次,评估38个MLLM,发现Gemini-3.1-Pro落后人类专家27分,流式空间微调MLLM表现不如其骨干模型。

详情
Comments
48 pages, 12 figures, 15 tables. Project page: https://internlm.github.io/OVO-S-Bench/
AI中文摘要

机器人、增强现实和自动驾驶中的多模态智能体必须从连续的自我中心流中推理地点和布局,通常使用当前视野之外的证据。现有基准要么在完整视频上进行离线评估,要么针对事件而非空间结构。我们引入了OVO-S-Bench,一个完全人工标注的流式空间智能基准,包含来自348个源视频的1680个问题。标注涉及12名训练有素的标注员,每人还担任盲审交叉评审,耗时约804人小时的多轮质量保证。每个问题带有一个查询时间戳和一个证据区间,评估时模型仅看到查询之前的前缀。问题涵盖四个抽象层次:瞬时自我中心感知、时空上下文跟踪、空间模拟与推理、以及异中心映射。在38个专有和开源MLLM中,Gemini-3.1-Pro落后人类专家27分(59.2 vs. 86.6),异中心映射是主要瓶颈。值得注意的是,流式和空间微调的MLLM表现不如其骨干模型。我们进一步发现,当链式思维推理未基于流时,会放大空间错误。通过暴露这些局限性,OVO-S-Bench为下一代流式空间MLLM建立了一个高要求的测试平台。

英文摘要

Multimodal agents in robotics, AR, and autonomous driving must reason about places and layouts from continuous egocentric streams, often using evidence outside the current view. Existing benchmarks either evaluate offline over full videos or target events rather than spatial structure. We introduce OVO-S-Bench, a fully human-annotated benchmark for streaming spatial intelligence, comprising 1,680 questions over 348 source videos. Annotation involves 12 trained annotators, each also serving as a blind cross-reviewer, across roughly 804 person-hours of multi-round quality assurance. Each question carries a query timestamp and an evidence interval, and at evaluation, the model sees only the prefix preceding the query. Questions span four levels of increasing abstraction: instantaneous egocentric perception, spatiotemporal context tracking, spatial simulation and reasoning, and allocentric mapping. Across 38 proprietary and open-source MLLMs, Gemini-3.1-Pro trails human experts by 27 points, 59.2 vs. 86.6, with allocentric mapping as the dominant bottleneck. Notably, streaming and spatially fine-tuned MLLMs underperform their own backbones. We further find that chain-of-thought reasoning amplifies spatial errors when ungrounded in the stream. By exposing these limitations, OVO-S-Bench establishes a demanding testbed for next-generation streaming spatial MLLMs.

2606.03888 2026-06-03 cs.CV cs.LG

CoralBay: A Self-Supervised CT Foundation Model

CoralBay: 一种自监督CT基础模型

Ioannis Gatopoulos, Nicolas Känzig, Sebastian Otálora, Fei Tang

AI总结 提出CoralBay框架,通过分层3D Swin骨干网络和自蒸馏学习多尺度特征,实现CT体积数据的自监督预训练,有效提升下游放射学任务性能。

详情
AI中文摘要

自监督学习已在2D自然图像上实现了大规模预训练,产生了跨任务有效迁移的通用视觉表示。然而,许多医学成像模态(如CT扫描)本质上是三维的,在结构和语义上与自然图像根本不同。体积模态捕捉空间连续性、器官解剖和基于强度的组织特性(如亨氏单位),这些无法通过2D预训练充分建模。为弥补这一差距,我们引入了CoralBay,一种自蒸馏框架,通过使用分层3D Swin骨干网络并将自蒸馏应用于拼接的多尺度特征,扩展了DINO,实现了数据高效的自监督学习,编码了全局语义和细粒度局部结构的丰富空间表示。因此,CoralBay有效迁移到广泛的下游放射学任务,在多样化的解剖目标上展现出强大且一致的性能。此外,我们通过引入一个公开、可复现的3D放射学排行榜,为开源\eva框架做出贡献,该排行榜统一了多个数据集,并建立了评估体积表示学习方法的标准化基准。

英文摘要

Self-supervised learning has enabled large-scale pre-training on 2D natural images, producing general-purpose visual representations that transfer effectively across tasks. However, many medical imaging modalities, such as CT scans, are inherently three-dimensional and differ fundamentally from natural images in both structure and semantics. Volumetric modalities capture spatial continuity, organ anatomy, and intensity-based tissue properties (e.g., Hounsfield Units), which are not adequately modeled by 2D pre-training. To bridge this gap, we introduce CoralBay, a self-distillation framework that extends DINO by using a hierarchical 3D Swin backbone and applying self-distillation to concatenated multi-scale features, enabling data-efficient self-supervised learning of rich spatial representations that encode both global semantics and fine-grained local structure. As a result, CoralBay transfers effectively to a wide range of downstream radiological tasks, demonstrating strong and consistent performance across diverse anatomical targets. In addition, we contribute to the open-source \eva framework by introducing a public, reproducible 3D radiology leaderboard that unifies multiple datasets and establishes a standardized benchmark for evaluating volumetric representation learning methods.

2606.03885 2026-06-03 cs.LG

Attribution via Distributional Paths for Information Revelation

通过分布路径进行信息揭示的归因

Kieran A. Murphy, Shameen Shrestha

AI总结 提出Reveal-IG方法,将路径归因从输入空间提升到结构化探针分布空间,通过逐步揭示信息并归因模型期望输出的变化,保留完整性并避免输入空间路径伪影。

详情
Comments
Code: https://github.com/murphyka/Reveal-IG
AI中文摘要

特征归因方法通过为输入特征分配重要性分数来解释预测。基于路径的方法(如积分梯度)特别有吸引力,因为它们满足 extit{完整性}:归因总和等于模型输出在参考状态和输入之间的变化。然而,大多数路径方法在输入空间中定义轨迹,沿着所选路径通过逐点扰动输入来解释模型。输入空间路径积分模型在每个经过点的原始响应,无法控制特征被查询的分辨率;轨迹中靠近基线的早期部分与输入本身对解释的贡献相同。在这里,我们将路径归因从输入空间提升到围绕感兴趣示例的结构化探针分布空间,并将我们的方法称为Reveal-IG。Reveal-IG不是遍历原始输入值,而是逐步揭示关于输入的信息,并归因模型期望输出沿此分布路径的变化。结果是一个路径归因框架,它保留了对期望模型响应的完整性,并自然地适应多尺度图像探针和表格数据中的特征级不确定性。综合诊断表明,Reveal-IG避免了影响输入空间方法的路径伪影,并且在ImageNet分类和表格回归中,它产生稳定的、有符号的归因——在使用归因符号的指标上领先,同时在其余指标上保持竞争力。

英文摘要

Feature attribution methods explain predictions by assigning importance scores to input features. Path-based methods such as Integrated Gradients are especially appealing because they satisfy \textit{completeness}: attributions sum to the change in model output between a reference state and the input. Yet most path methods define this trajectory in input space, explaining a model through pointwise perturbed inputs along a chosen path. An input-space path integrates the model's raw response at each point it passes through, with no control over the resolution at which a feature is queried; the early, baseline-adjacent part of the trajectory contributes to the explanation on equal footing with the input itself. Here, we lift path attribution from input space to a space of structured probe distributions around the example of interest, and call our method Reveal-IG. Rather than traversing raw input values, Reveal-IG progressively reveals information about the input and attributes changes in the model's expected output along this distributional path. The result is a path-attribution framework that retains completeness with respect to the expected model response, and naturally accommodates multiscale image probes and feature-wise uncertainty in tabular data. Synthetic diagnostics show that Reveal-IG avoids path artifacts that affect input-space methods, and across ImageNet classification and tabular regression it produces stable, signed attributions -- leading on metrics that use attribution sign while remaining competitive on the rest.

2606.03883 2026-06-03 cs.AI cs.LG

Reasoning Structure of Large Language Models

大型语言模型的推理结构

Frédéric Berdoz, Luca A. Lanzendörfer, Fabian Farestam, Roger Wattenhofer

AI总结 针对大型推理模型评估中隐藏不同推理结构的问题,提出基于逻辑谜题的基准测试和将非结构化轨迹转化为可验证推理图的方法,并定义推理效率指标,以量化分析推理拓扑结构。

详情
Comments
Accepted at ICML 2026 and presented at the ICLR 2026 workshop on LLM reasoning
AI中文摘要

大型推理模型(LRMs)通常使用最终答案准确率或token数量等指标进行评估。然而,这些指标上的相同分数可能隐藏着根本不同的推理结构。为了解决这一局限性,我们引入了一个可扩展的逻辑谜题LRM基准测试,以及一个将非结构化轨迹转化为包含声明和依赖关系的可验证推理图的流程。这将推理转化为一个结构化的、可测量的对象,其拓扑结构可以定量分析。在此基础上,我们定义了一个推理效率指标,用于量化模型逻辑流的集中程度。我们对开源推理模型的分析表明,结构度量能够区分token数量和准确率所混淆的行为,为诊断失败模式和比较推理如何随谜题难度扩展提供了实用工具。

英文摘要

Large reasoning models (LRMs) are often evaluated using metrics such as final-answer accuracy or token count. However, identical scores on these metrics can hide fundamentally different reasoning structures. To address this limitation, we introduce a scalable LRM benchmark of logic puzzles and a pipeline that converts unstructured traces into verifiable reasoning graphs of claims and dependencies. This turns reasoning into a structured, measurable object whose topology can be quantitatively analyzed. Building on this, we define a reasoning efficiency metric that quantifies how concentrated the model's logical flow is. Our analysis on open-source reasoning models shows that structural measurements separate behaviors that token count and accuracy conflate, providing a practical tool for diagnosing failure modes and comparing how reasoning scales with puzzle difficulty.

2606.03879 2026-06-03 cs.CV cs.AI

Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs

超越编码器累加:衡量多编码器视觉语言模型中编码器的作用

Wei Ding, Yudong Zhang, Ruobing Xie, Xingwu Sun, Jiansheng Chen, Yu Wang

AI总结 通过重新训练所有31个非空子集,提出容量-必要性分解和预投影器秩分析,揭示多编码器视觉语言模型中编码器角色并非简单累加,并给出最优配对原则。

详情
AI中文摘要

随着基础模型向融合更多异构视觉流扩展,理解不同编码器在联合训练下的交互成为原则性设计的前提。然而,大型视觉语言模型目前缺乏相应的工具,且参数高效的编码器配置在训练前难以识别。为了重新审视联合训练下的编码器角色,我们在16基准的Cambrian-1套件上,在统一流程下重新训练并评估了五个常见视觉编码器的所有31个非空子集(总计约2万GPU小时),并报告了三个发现。首先,从头重新训练每个子集揭示了与在固定检查点上掩码编码器所得不同的编码器排名,包括哪个编码器整体排名第一。其次,我们将每个编码器的贡献分解为两个维度:容量(编码器自身达到的分数)和必要性(从完整池中移除时的下降)。这两个维度不可互换。配对两个最高容量的编码器是次优的,而将一个高容量锚点与一个自适应补充配对则匹配完整的五编码器模型。在此配对之外添加更多编码器仅带来边际收益。第三,在固定参数数量下,每个编码器的预投影器有效秩解释了残差分数变化。最强的配对结合了一个秩在联合训练中存活的锚点和一个秩在联合训练下扩展的补充,这表明更高秩、更少坍缩的投影器输入对应着编码器-投影器接口处更有利的优化机制。总之,容量-必要性分解和预投影器秩分析,连同通过重新训练进行的全面评估,揭示了多编码器视觉语言模型设计中的方法论差距,并提供了弥补这一差距的具体原语。

英文摘要

As foundation models scale toward fusing more heterogeneous visual streams, understanding how diverse encoders interact under joint training becomes a prerequisite for principled design. Yet large vision-language models (LVLMs) currently lack the tools to do so, and parameter-efficient encoder configurations remain hard to identify before training. To re-examine encoder roles under joint training, on the 16-benchmark Cambrian-1 suite we retrain and evaluate all 31 non-empty subsets of five common vision encoders under a unified pipeline (~20k GPU-hours total), and report three findings. First, retraining each subset from scratch reveals encoder rankings that differ from those obtained by masking encoders on a fixed checkpoint, including which encoder ranks first overall. Second, we decompose each encoder's contribution into two axes, Capacity, the score an encoder reaches on its own, and Necessity, the drop when it is removed from the full pool. The two axes are not interchangeable. Pairing the two highest-Capacity encoders is suboptimal, while pairing a high-Capacity anchor with an adaptive complement matches the full five-encoder model. Adding further encoders beyond this pair yields only marginal gains. Third, at fixed parameter count, per-encoder pre-projector effective rank explains the residual score variation. The strongest pairs combine an anchor whose rank survives joint training with a complement whose rank expands under it, suggesting that higher-rank, less-collapsed projector inputs correspond to a more favorable optimization regime at the encoder-projector interface. Together, the Capacity-Necessity decomposition and the pre-projector rank analysis, along with comprehensive evaluation through retraining, expose a methodological gap in multi-encoder LVLM design, and offer concrete primitives for closing it.