arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4178
2606.00450 2026-06-02 cs.CV cs.GR

Optimizing 3D Gaussian Splatting via Point Cloud Upsampling

通过点云上采样优化3D高斯泼溅

Adrian Ramlal, Yan Song Hu, John S. Zelek

发表机构 * Vision and Image Processing Group, Systems Design Engineering, University of Waterloo(滑铁卢大学视觉与图像处理组,系统设计工程)

AI总结 提出多种点云上采样方法及深度引导点提升技术,改善3D高斯泼溅的初始化质量,实验表明不同场景适用不同策略。

详情
Journal ref
Journal of Computational Vision and Imaging Systems, Vol. 10, No. 1, p. 47, 2024
Comments
Accepted in Journal of Computational Vision and Imaging Systems (JCVIS)
AI中文摘要

3D高斯泼溅(3DGS)是一种用于创建和渲染3D场景的技术,但其性能严重依赖于初始种子点的质量。为了改进3DGS初始化,本研究提出并评估了几种点云上采样方法:线性插值、三角插值、基于样条的曲面重建、移动最小二乘曲面拟合和基于Voronoi的点生成。此外,本研究引入了一种深度引导的点提升方法,利用深度图保持与运动恢复结构(SfM)重建的几何一致性。通过在Mip-NeRF360和Replica数据集上的大量实验,所提出的方法在多种场景类型中展示了重建质量的提升。结果表明,不同的上采样策略在不同场景中表现优异:曲面重建方法在处理有机、细节丰富的场景时表现更好,而简单的插值方法更适合以分段平滑几何为主的场景。相比之下,深度引导方法在添加整个场景中的几何感知点方面显示出潜力,尤其是在纹理缺失区域。这些发现为根据场景特征和计算约束选择合适的上采样方法提供了初步实用指南,增进了对点云初始化如何影响3DGS质量的理解。

英文摘要

3D Gaussian Splatting (3DGS) is a technique for creating and rendering 3D scenes, however its performance depends heavily on the quality of initial seed points. To improve 3DGS initialization, this study presents and evaluates several point cloud upsampling approaches: linear interpolation, triangular interpolation, spline-based surface reconstruction, moving least squares surface fitting, and Voronoi-based point generation. Additionally, this research introduces a depth-guided point lifting method that leverages depth maps to maintain geometric consistency with Structure-from-Motion (SfM) reconstructions. Through extensive experiments on the Mip-NeRF360 and Replica datasets, the proposed methods demonstrate improvements in reconstruction quality across diverse scene types. Results indicate that different upsampling strategies excel in different scenarios: surface reconstruction methods perform better with organic, detailed scenes, while simpler interpolation approaches are more suited for scenes dominated by piecewise-smooth geometries. In comparison, the depth-guided approach shows promise for adding geometry-aware points across the entire scene, importantly in texture-less regions. These findings, which provide preliminary practical guidelines for selecting appropriate upsampling methods based on scene characteristics and computational constraints, advances the understanding of how point cloud initialization affects 3DGS quality.

2606.00449 2026-06-02 cs.RO

ROG-Grasp: Root-Oriented Geometry for Robotic Grasping and Placement

ROG-Grasp:面向根部的几何方法用于机器人抓取与放置

Zijian An, Augustus Sroka, Ran Yang, Bill Cai, Satoru Eto, Brian Poon, Kelvin Cai, Shijie Geng, Feng Liu, Yiming Feng, Lifeng Zhou

发表机构 * Department of Electrical and Computer Engineering, Drexel University(德雷塞尔大学电气与计算机工程系) Virginia Seafood Agricultural Research and Extension Center, and Department of Biological Systems Engineering, Virginia Tech(弗吉尼亚理工学院生物系统工程系和弗吉尼亚海鲜农业研究与推广中心) Amazon Store Foundation AI (SFAI)(亚马逊商店基金会人工智能(SFAI))

AI总结 提出基于根部表面几何的ROG-Grasp框架,通过RGB-D感知估计农产品朝向,结合YOLO检测器和点云平面拟合生成稳定抓取姿态,在番茄和洋葱实验中实现高成功率与快速执行。

详情
Comments
Comments: 7 pages, 6 figures. Video: https://youtu.be/Ir2UtGODdMo
AI中文摘要

朝向感知操作在采后农业加工中至关重要,其中农产品必须以一致的配置被抓取和放置。本文提出ROG-Grasp,一种基于几何的机器人抓取和放置框架,通过RGB-D感知从根部表面几何估计农产品朝向。使用基于YOLO的根部检测器和点云平面拟合来推断根部法线,从而生成稳定的抓取姿态和朝向约束的笛卡尔运动规划。在番茄和洋葱上的实验表明,在孤立和杂乱场景中均具有高成功率和稳定的执行时间。与视觉-语言-动作(VLA)策略相比,所提出的方法实现了更可靠、更准确的抓取完成,且执行速度更快。这些结果突显了几何驱动感知对于实际朝向控制操作任务的有效性。我们的论文视频可在网上获取:https://youtu.be/Ir2UtGODdMo。

英文摘要

Orientation-aware manipulation is essential in post-harvest agricultural processing, where produce must be grasped and placed in consistent configurations. This paper presents ROG-Grasp, a geometry-based robotic grasping and placement framework that estimates the produce orientation from root surface geometry using RGB-D perception. A YOLO-based root detector and point cloud plane fitting are used to infer the root normal, enabling stable grasp pose generation and orientation-constrained Cartesian motion planning. Experiments on tomatoes and onions demonstrate high success rates and stable execution time in both isolated and cluttered scenarios. Compared with vision-language-action (VLA) policies, the proposed method achieves more reliable and accurate grasp completion with faster execution. These results highlight the effectiveness of geometry-driven perception for practical orientation-controlled manipulation tasks. A video of our paper is available online https://youtu.be/Ir2UtGODdMo.

2606.00447 2026-06-02 cs.CV cs.AI

GeoSAM-3D: Geodesic Prompt Propagation for Open-Vocabulary 3D Scene Segmentation from Monocular Video

GeoSAM-3D: 用于从单目视频进行开放词汇3D场景分割的测地线提示传播

Arun Sharma

发表机构 * University of Minnesota, Twin Cities(明尼苏达大学,双城分校)

AI总结 提出GeoSAM-3D方法,利用冻结的视觉基础模型和单目3D高斯泼溅重建,通过可微分的图-测地线传播核在场景图上传播用户提示,实现从单目视频的开放词汇3D场景分割。

详情
AI中文摘要

开放词汇的3D场景分割通常假设有RGB-D视频、校准的多视角图像或重建的网格。GeoSAM-3D研究了一种更轻的设置:用户上传一段短的单目视频,在一帧中点击或命名一个物体,并在高斯场景上接收传播的3D掩码。该实现结合了冻结的图像和视频基础模型、单目3D高斯泼溅重建以及在高斯质心上可微分的图-测地线传播核。核心设计选择是通过重建场景图上的热核距离传播提示,而不是通过3D中的欧几里得最近邻。这保持了曲面周围的连续性,并减少了附近但不相连物体之间的泄漏。本文描述了仓库状态、在geosam3d.propagate中实现的数学核、从Segment Anything掩码训练的特征头以及代码库中已有的验证。评估协议将实现验证、图传播质量、泄漏控制和交互延迟分开。

英文摘要

Open-vocabulary 3D scene segmentation usually assumes RGB-D video, calibrated multi-view imagery, or a reconstructed mesh. GeoSAM-3D studies a lighter setting: a user uploads a short monocular video, clicks or names an object in one frame, and receives a propagated 3D mask over a Gaussian scene. The implementation combines frozen image and video foundation models with a monocular 3D Gaussian Splatting reconstruction and a differentiable graph-geodesic propagation kernel over Gaussian centroids. The central design choice is to propagate prompts by heat-kernel distance on the reconstructed scene graph, rather than by Euclidean nearest neighbors in 3D. This preserves continuity around curved surfaces and reduces leakage across nearby but disconnected objects. This paper describes the repository state, the mathematical kernel implemented in geosam3d.propagate, the feature head trained from Segment Anything masks, and the validation already present in the codebase. The evaluation protocol separates implementation validation, graph propagation quality, leakage control, and interactive latency.

2606.00445 2026-06-02 cs.CV cs.AI cs.LG

DarkVesselNet: Multi-Modal Remote Sensing and Trajectory Reasoning for Dark Vessel Detection

DarkVesselNet: 用于暗船检测的多模态遥感和轨迹推理

Arun Sharma

发表机构 * University of Minnesota, Twin Cities(明尼苏达大学,双城分校)

AI总结 提出DarkVesselNet,融合Sentinel-1 SAR、Sentinel-2光学影像、地理空间基础模型、AIS轨迹推理、TGARD间隙检测和Pi-DPM异常头,实现多模态遥感暗船检测。

详情
AI中文摘要

暗船检测需要融合船只通过AIS报告的信息与卫星通过雷达和光学传感器观测到的信息。DarkVesselNet是一个多模态遥感堆栈,结合了Sentinel-1 SAR、Sentinel-2光学影像、地理空间基础模型骨干、AIS轨迹推理、TGARD风格的间隙检测以及受Pi-DPM启发的异常头。该仓库将系统呈现为经过测试的Python包和公开的Hugging Face Space。本文介绍了传感器堆栈、骨干抽象、融合路径、异常头和当前的验证。目前可用的证据是基于软件的:针对SAR散斑滤波、光学波段比、Haversine距离、TGARD间隙发射、传感器配准、骨干token形状和可微分异常评分的测试。

英文摘要

Dark vessel detection requires fusing what vessels report through AIS with what satellites observe through radar and optical sensors. DarkVesselNet is a multi-modal remote sensing stack that combines Sentinel-1 SAR, Sentinel-2 optical imagery, geospatial foundation model backbones, AIS trajectory reasoning, TGARD-style gap detection, and a Pi-DPM-inspired anomaly head. The repository exposes the system as a tested Python package and a public Hugging Face Space. The paper presents the sensor stack, backbone abstraction, fusion path, anomaly head, and current validation. The evidence currently available is software-grounded: tests for SAR speckle filtering, optical band ratios, Haversine distance, TGARD gap emission, sensor coregistration, backbone token shapes, and differentiable anomaly scoring.

2606.00444 2026-06-02 cs.CV cs.GR

Real-Time Physics Simulation with Dynamic Mesh-Gaussian Reconstructions

基于动态网格-高斯重建的实时物理仿真

Adrian Ramlal, John S. Zelek

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 针对动态重建与物理仿真拓扑不兼容的问题,提出固定拓扑网格与高斯泼溅的双表示框架,实现实时物理仿真,并揭示高质量重建与物理兼容拓扑存在本质冲突。

详情
Journal ref
Journal of Computational Vision and Imaging Systems, Vol. 11, No. 1, 2025
AI中文摘要

将动态3D重建集成到物理仿真中需要固定的网格拓扑以实现高效的碰撞检测,但像DG-Mesh这样的先进方法会生成针对几何质量优化的可变拓扑。我们研究了拓扑转换是否能在保持重建保真度的同时实现物理集成。我们提出了一种双表示框架,将用于物理的固定拓扑网格与用于渲染的高斯泼溅相结合,通过运行时顶点缓冲区更新实现了比可变拓扑基线快4.65倍的加速。我们在DG-Mesh数据集上评估了两种转换策略(时间对应跟踪和基于模板的投影)与原生固定拓扑方法(MaGS)的性能。我们的评估表明,两种转换方法都会导致65-80%的几何退化,尽管DG-Mesh具有优越的初始质量,但产生的结果不如MaGS。这表明高质量重建和物理兼容拓扑代表了根本不同的目标,无法通过后处理来调和。我们的发现为未来物理感知重建方法的发展提供了信息,并且我们的框架能够与任何固定拓扑方法实现实时仿真。

英文摘要

Integrating dynamic 3D reconstructions into physics simulation requires fixed mesh topology for efficient collision detection, but state-of-the-art methods like DG-Mesh produce varying topology optimized for geometric quality. We investigate whether topology conversion can enable physics integration while preserving reconstruction fidelity. We propose a dual-representation framework combining fixed-topology meshes for physics with Gaussian splatting for rendering, achieving 4.65$\times$ speedup over varying-topology baselines through runtime vertex buffer updates. We evaluate two conversion strategies, temporal correspondence tracking and template-based projection, against native fixed-topology methods (MaGS) on the DG-Mesh dataset. Our evaluation reveals that both conversion approaches incur 65-80% geometric degradation, producing results inferior to MaGS despite DG-Mesh's superior initial quality. This demonstrates that high-quality reconstruction and physics-compatible topology represent fundamentally distinct objectives that cannot be reconciled through post-processing. Our findings inform future development of physics-aware reconstruction methods and our framework enables real-time simulation with any fixed-topology approach.

2606.00442 2026-06-02 cs.LG math.OC stat.ML

Exploiting weight-space symmetries for approximating curvature

利用权重空间对称性近似曲率

Artem Artemev, Rui Xia, Benjamin M. Boyd, Youjing Yu, Felix Dangel, Guillaume Hennequin, Alberto Bernacchia

发表机构 * DeepMind, London, UK(伦敦DeepMind)

AI总结 本文通过解析平均化保持损失不变的群作用,从单个梯度构建结构化的Hessian近似,从而利用权重空间对称性来近似损失函数的曲率。

详情
Comments
Published at ICML 2026. 35 pages, 11 figures. Code: https://github.com/mtkresearch/symm_opt
AI中文摘要

许多机器学习技术依赖于近似损失函数的曲率,但在现代深度网络的规模下,这通常很难做到。令人惊讶的是,之前没有工作利用损失景观中众所周知的权重空间对称性所产生的曲率约束。通过解析平均化保持损失不变的群作用,我们从单个梯度构建了结构化的Hessian近似,这些近似可以易于估计、存储和求逆。用户指定的对称群直接控制近似精度与计算成本之间的权衡。此外,我们的框架为审视现有方法提供了统一的理论视角;特别地,特定的对称群选择可以恢复Shampoo/Muon类的曲率估计。我们在多种网络架构上验证了我们的方法,并将其应用于二阶优化基准测试,包括一个小型语言模型。我们的曲率估计框架可能在机器学习其他问题中找到应用,如不确定性估计、持续学习、压缩/剪枝、训练数据归因等。

英文摘要

Many machine learning techniques rely on approximating a loss function's curvature, but this is notoriously hard to do at the scale of modern deep networks. Surprisingly, no previous work has exploited the curvature constraints that arise from well known weight-space symmetries in loss landscapes. By analytically averaging over group actions that leave the loss invariant, we construct structured Hessian approximations from single gradients that can be tractably estimated, stored, and inverted. The choice of user-specified symmetry group directly governs the trade-off between approximation accuracy and computational cost. Moreover, our framework provides a unifying theoretical lens for viewing existing methods; in particular, a specific choice of symmetry group recovers Shampoo/Muon-like curvature estimates. We validate our method on a range of network architectures, and deploy it to second-order optimization benchmarks, including a small language model. Our curvature estimation framework might find applications in other machine learning problems such as uncertainty estimation, continual learning, compression/pruning, training data attribution, and more.

2606.00440 2026-06-02 cs.AI

SDR: Set-Distance Rewards for Radiology Report Generation

SDR:用于放射学报告生成的集合距离奖励

Halil Ibrahim Gulluk, Max Van Puyvelde, Wim Van Criekinge, Olivier Gevaert

发表机构 * Stanford University(斯坦福大学) Stanford University School of Medicine(斯坦福大学医学院) Ghent University(根特大学)

AI总结 针对胸部X光报告生成中标准奖励不兼容的问题,提出基于集合距离的连续置换不变奖励,通过GRPO后训练和测试时缩放显著提升性能。

详情
AI中文摘要

具有可验证奖励的强化学习已迅速推进了视觉-语言模型中的推理能力。然而,对于胸部X光报告生成,标准奖励(即精确匹配准确率和逐步过程)并不兼容,因为报告由无序且正交的发现组成,而非因果推理链。我们通过基于集合的视角来解决这一差距:每个报告被分割成句子,并由冻结的句子变换器嵌入,生成无序的嵌入集合。我们提出使用生成嵌入与参考嵌入之间的集合到集合距离作为连续的、置换不变的奖励。在两个数据集和三个视觉-语言模型(Qwen3-VL-2B/4B, Gemma3-4B)上,通过GRPO使用基于集合到集合距离的奖励进行后训练,在所有主要指标(BERTScore、RadGraph F1和CheXbert F1,分别相对提升平均6.80%、7.82%和4.45%)上一致优于监督微调和精确匹配GRPO。相同的集合距离还实现了测试时的最佳N选:通过候选与训练报告嵌入的距离进行评分,在我们训练的模型以及三个闭源LLM(Mistral-Small、Gemini-2.5 Flash-Lite、GPT-4o-mini)上,平均相对提升BERTScore 16.4%,优于随机选择。作为流式信号使用时,它们支持更高效的测试时缩放形式:在生成过程中修剪低分候选,可将生成的令牌减少50%以上,同时保持完整最佳N选的结果质量。这些结果共同确立了集合距离奖励作为胸部X光报告生成中后训练和测试时缩放的统一信号。我们的代码已公开。

英文摘要

Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. However, for chest X-ray report generation, the standard rewards (i.e. exact-match accuracy and step-level processes) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain. We address this gap with a set-based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets. We propose the use of set-to-set distances between generated and reference embeddings as continuous, permutation-invariant rewards. Across two datasets and three vision--language models (Qwen3-VL-2B/4B, Gemma3-4B), post-training with set-to-set distance based rewards via GRPO consistently outperforms supervised fine-tuning and exact-match GRPO on all headline metrics (BERTScore, RadGraph F1 and CheXbert F1 by average \%6.80, \%7.82 and \%4.45 relative improvements respectively). The same set distances also enable test-time best-of-$N$ selection: scoring candidates by their distance to training-report embeddings outperforms random selection on our trained models as well as three closed-source LLMs (Mistral-Small, Gemini-2.5 Flash-Lite, GPT-4o-mini) with on average \%16.4 relative improvement on BERTScore. Used as a streaming signal, they support a more efficient form of test-time scaling: pruning low-scoring candidates mid-generation reduces generated tokens by over 50\% while preserving the Findings quality of full best-of-$N$ selection. Together these results establish set-distance rewards as a unified signal for both post-training and test-time scaling in chest X-ray report generation. Our code is publicly \href{https://anonymous.4open.science/r/Set-Distance-Rewards-CXR-BFDA}{available}.

2606.00439 2026-06-02 cs.CV

Physical Object Understanding with a Physically Controllable World Model

基于物理可控世界模型的物理对象理解

Rahul Venkatesh, Klemen Kotar, Lilian Naing Chen, Wanhee Lee, Gia Ancone, Seungwoo Kim, Luca Thomas Wheeler, Jared Watrous, Honglin Chen, Daniel Bear, Stefan Stojanov, Daniel LK Yamins

发表机构 * Stanford University(斯坦福大学) OpenAI(开放人工智能公司) Noetik Inc.(Noetik公司) Google(谷歌)

AI总结 提出一类概率世界模型,通过自回归序列建模高效训练,从视频中推断对象及其物理交互,实现对象发现、3D操控和物理关系计算。

详情
Comments
CVPR 2026 Highlight. Project page at: https://neuroailab.github.io/psi-website/blog.html
AI中文摘要

视觉智能的一个核心挑战是从原始视频中学习场景的物理结构:区域如何形成对象以及支配它们交互的规律。解决这些任务需要能够从部分观测中推断世界分布状态的世界模型——当前架构无法提供这种能力。我们引入了一类新的概率世界模型,支持估计任何视觉变量(如外观和动态)在给定其他变量条件下的概率。在这里,我们发现这些模型可以通过自回归序列建模高效训练,从而产生能够涌现丰富对象理解的世界模型。首先,我们展示了我们的模型通过顺序推理生成多个合理的未来世界状态,捕捉了支配对象如何运动的物理规律。然后,通过分析这些未来状态中的运动相关性,我们提取出对象及其关节子部分。在发现这些对象后,我们展示了我们的世界模型可以在3D中操控它们。最后,我们演示了如何从世界模型计算对象之间的物理关系,从而实现了诸如视觉叠叠乐等应用。

英文摘要

A central challenge in visual intelligence is learning the physical structure of scenes from raw videos: how regions form objects and the laws that govern their interactions. Solving these tasks requires world models capable of inferring distributional states of the world from partial observations - capabilities that current architectures do not provide. We introduce a new class of probabilistic world models that support estimation of the probability of any visual variable, such as appearance and dynamics, conditioned on any other variables. Here, we identify that these models can be trained efficiently with autoregressive sequence modeling, yielding world models from which rich object understanding emerges. First, we demonstrate that our model captures the physical laws governing how objects move by generating multiple plausible future states of the world through sequential inference. Then, by analyzing motion correlations across these futures, we extract objects and articulated object subparts. Having discovered these objects, we show that our world model can manipulate them in 3D. Finally, we demonstrate how physical relationships between objects can be computed from the world model, enabling applications such as Visual Jenga.

2606.00437 2026-06-02 cs.LG

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

EST-PRM:在过程奖励模型成为关键依赖之前对其进行压力测试

Ibne Farabi Shihab, Fariya Afrin, Sanjeda Akter, Anuj Sharma

发表机构 * Department of Computer Science, Iowa State University(艾奥瓦州立大学计算机科学系) Department of Computer Science, Kalinga Institute of Industrial Technology(卡林加工业技术学院计算机科学系) Department of Civil, Construction & Environmental Engineering, Iowa State University(艾奥瓦州立大学土木、建设与环境工程系)

AI总结 提出EST-PRM框架,通过步骤膨胀、依赖感知重排序和置信度标记三种变换对过程奖励模型进行压力测试,发现不同模型在奖励膨胀和正确性敏感性损失方面存在显著差异。

详情
AI中文摘要

过程奖励模型(PRM)在具有密集步骤级监督的语言模型训练中被广泛使用。它们假设在标签保持变换下,PRM分数是步骤正确性的稳定代理。这些变换改变推理结构但保留最终答案。我们认为这一假设未得到充分验证。此类变换可能改变PRM分数与正确性信号之间的关系,导致不同模型出现不同的故障模式。为弥补这一空白,我们引入了 extbf{EST-PRM},一个用于密集过程奖励的压力测试框架。它应用三种变换:(1)步骤膨胀,(2)依赖感知步骤重排序,以及(3)置信度标记。定义了一个脆弱性分解,将奖励膨胀与正确性敏感性损失分开。在来自MATH-500、GSM8K和PRMBench的4,687条推理链上评估了五种PRM风格模型。结果表明不同模型的脆弱性模式存在明显差异。Math-Shepherd对位置扰动表现出最强的敏感性,Pearson相关系数下降$0.152 \pm 0.038$,分数膨胀率为$32.8 \pm 4.9\%$。Qwen2.5-Math-PRM受步骤膨胀影响最大,膨胀率达到$47.6 \pm 4.3\%$。基于置信度的扰动也会扭曲奖励校准,揭示正确性估计中的不一致性。评估了三种缓解策略,突出了鲁棒性覆盖率和假阳性率之间的权衡。

英文摘要

Process reward models (PRMs) are widely used in language-model training with dense step-level supervision. They assume PRM scores are stable proxies for step correctness under label-preserving transformations. These transformations change reasoning structure but preserve final answers. We argue this assumption is not well validated. Such transformations can change how PRM scores relate to correctness signals, leading to different failure modes across models.To address this gap, we introduce \textbf{EST-PRM}, a stress-testing framework for dense process rewards. It applies three transformations: (1) step inflation, (2) dependency-aware step reordering, and (3) confidence markers. A vulnerability decomposition is defined that separates reward inflation from loss of correctness sensitivity. Five PRM-style models are evaluated on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench.The results indicate clear differences in vulnerability patterns across models. Math-Shepherd shows the strongest sensitivity to position perturbations, with a Pearson correlation drop of $0.152 \pm 0.038$ and a $32.8 \pm 4.9\%$ score inflation rate. Qwen2.5-Math-PRM is most affected by step inflation, reaching a $47.6 \pm 4.3\%$ inflation rate. Confidence-based perturbations also distort reward calibration, revealing inconsistencies in correctness estimation. Three mitigation strategies are evaluated, highlighting trade-offs between robustness coverage and false-positive rates.

2606.00435 2026-06-02 cs.CV cs.AI

Detect Before You Leap: Mirage Detection in Vision-Language Models

在跳跃前检测:视觉语言模型中的幻象检测

Sayeed Shafayet Chowdhury, Md. Shaown Miah

发表机构 * Indiana University Indianapolis(印第安纳大学印第安纳波利斯分校) Bangladesh University of Engineering and Technology(孟加拉工程与技术大学)

AI总结 针对视觉语言模型在缺乏视觉证据时产生自信但无根据回答的幻象问题,提出文本条件层内对齐方法,通过分析视觉编码器各层补丁令牌与问题嵌入的对齐轨迹,结合像素统计、零样本域路由和结构化自评估,实现高精度预响应幻象检测。

详情
AI中文摘要

视觉语言模型(VLM)即使在所需视觉证据缺失、空白或与问题无关时,也能产生自信的视觉答案。这种失败模式被称为幻象(Asadi et al. 2026),在医学和文档视觉问答中尤其令人担忧,因为看似合理但缺乏视觉依据的响应可能被误认为是基于图像的证据。我们研究预发布幻象检测:给定图像-问题对,目标是在VLM生成响应之前确定其应回答还是弃权。我们提出文本条件层内对齐(TC-LIA),一种模型无关的方法,探测CLIP ViT-H/14视觉编码器各层的补丁令牌表示。TC-LIA将逐层图像补丁令牌投影到最终CLIP嵌入空间,并测量它们与问题嵌入的相似度,从而跟踪问题相关视觉证据是否在视觉层中出现。得到的对齐轨迹通过最终图像-文本余弦相似度、后期层top-k补丁-文本对齐、早期到后期增益和逐层斜率进行总结。这些特征与像素统计空白/噪声检测、零样本域路由和结构化VLM自评估相结合,形成一个集成系统。在五个VQA领域、三种输入条件和十二个VLM骨干网络上,最佳系统实现了约94.6-94.7%的三类检测准确率,幻象率低于3%,而基线幻象率范围为21.7%至66.6%。

英文摘要

Vision-language models (VLMs) can produce confident visual answers even when the required visual evidence is missing, blank, or unrelated to the question. This failure mode, known as mirage (Asadi et al. 2026), is especially concerning in medical and document visual question answering, where plausible but visually ungrounded responses may be mistaken for image-based evidence. We study pre-release mirage detection: given an image-question pair, the goal is to determine whether a VLM should answer or abstain before producing a response. We propose Text-Conditioned Layer-wise Internal Alignment (TC-LIA), a model-agnostic method that probes patch-token representations across the layers of a CLIP ViT-H/14 vision encoder. TC-LIA projects layer-wise image patch tokens into the final CLIP embedding space and measures their similarity to the question embedding, allowing the method to track whether question-relevant visual evidence emerges across vision layers. The resulting alignment trajectory is summarized using final image-text cosine similarity, late-layer top-k patch-text alignment, early-to-late gain, and layer-wise slope. These features are combined with pixel-statistic blank/noise detection, zero-shot domain routing, and structured VLM self-assessment in an ensemble. Across five VQA domains, three input conditions, and twelve VLM backbones, the best systems achieve approximately 94.6-94.7% three-class detection accuracy with mirage rates below 3%, while baseline mirage rates range from 21.7% to 66.6%.

2606.00432 2026-06-02 cs.LG

Grounded Decoding: Retrieval-Anchored Probability Fusion for Faithful RAG

Grounded Decoding: 面向忠实RAG的检索锚定概率融合

Ibne Farabi Shihab, Fariya Afrin, Sanjeda Akter, Anuj Sharma

发表机构 * Department of Computer Science, Iowa State University(爱荷华州立大学计算机科学系) Department of Computer Science, Kalinga Institute of Industrial Technology(卡林加工业技术学院计算机科学系) Department of Civil, Construction & Environmental Engineering, Iowa State University(爱荷华州立大学土木、建设与环境工程系)

AI总结 提出Grounded Decoding,一种无需训练的推理时解码框架,通过KL-重心目标融合全RAG分布和仅检索分布,并引入冲突感知自适应加权,以提升RAG的事实一致性。

详情
AI中文摘要

随着检索增强生成(RAG)系统的扩展,确保忠实基于外部证据变得越来越具有挑战性。当冲突出现时,大型语言模型仍可能优先考虑参数化知识而非检索信息。我们提出了一种新颖的无训练解码框架——\emph{Grounded Decoding},旨在不修改模型参数的情况下提高RAG的事实一致性。与依赖单一条件分布的标准方法不同,我们的方法在每个生成步骤构建两个匹配提示分布:(1)以查询、检索文档和生成前缀为条件的完整RAG分布,以及(2)仅以检索证据和相同前缀为条件的仅检索分布。最终的下一词分布被推导为概率单纯形上KL-重心目标的唯一解,产生两个分布的归一化几何融合。当接地权重为零时,该公式自然恢复标准RAG,并随着接地强度增加平滑地将概率质量移向检索证据。我们进一步引入了一种冲突感知自适应加权方案,该方案基于分布分歧和检索器置信度动态调整接地。在ALCE、Natural Questions和FActScore上的实验表明,与标准RAG和有竞争力的解码时基线相比,在事实准确性和引用质量上取得了一致改进,同时保持了流畅性。我们的结果表明,概率级融合为忠实RAG解码提供了一种强大且高效的替代对数级干预方法。

英文摘要

As retrieval-augmented generation (RAG) systems scale, it becomes increasingly challenging to ensure faithful grounding in external evidence. Large language models may still prioritize parametric knowledge over retrieved information when conflicts arise. We propose a novel training-free decoding framework, \emph{Grounded Decoding}, designed to improve factual consistency in RAG without modifying model parameters. Unlike standard approaches that rely on a single conditional distribution, our method constructs two matched-prompt distributions at every generation step: (1) a full RAG distribution conditioned on the query, retrieved documents, and generated prefix, and (2) a retrieval-only distribution conditioned solely on retrieved evidence and the same prefix. The final next-token distribution is derived as the unique solution to a KL-barycenter objective over the probability simplex, yielding a normalized geometric fusion of the two distributions.This formulation naturally recovers standard RAG when the grounding weight is zero and smoothly shifts probability mass toward retrieved evidence as grounding strength increases. We further introduce a conflict-aware adaptive weighting scheme that dynamically adjusts grounding based on distributional disagreement and retriever confidence. Experiments on ALCE, Natural Questions, and FActScore demonstrate consistent improvements in factual accuracy and citation quality over standard RAG and competitive decoding-time baselines, while maintaining fluency. Our results indicate that probability-level fusion provides a strong and efficient alternative to logit-level intervention methods for faithful RAG decoding.

2606.00428 2026-06-02 cs.LG cs.AI cs.CL

Finer Parameter Steps for Low-Rank PEFT: A Controlled Study with CP Tensor Adapters

低秩PEFT的更细参数步长:基于CP张量适配器的控制研究

Xinjue Wang, Xiuheng Wang, Yejun Zhang, Sergiy A. Vorobyov, Esa Ollila, Zhi-Yong Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过固定组件的规范多路分解(CP)张量适配器实现更细的参数步长,研究其对低秩适配器精度-预算权衡的影响,发现CP适配器能填补LoRA秩之间的空白,但效果依赖于任务。

详情
Comments
Accepted at the ICML 2026 Workshop on CoLoRAI
AI中文摘要

低秩适配器通常通过扫描少量秩进行比较,但秩也固定了参数预算的分辨率。对于一个$2048{\times}2048$的OPT注意力投影,增加LoRA的一个秩会存储$4096$个可训练标量,导致可行的低预算适配器大小之间存在较大间隙。本文探讨具有更细容量增量的张量化适配器是否会改变观察到的精度-预算权衡。我们通过固定组件的规范多路分解(CP)张量适配器来实例化这个问题。在$32{\times}64{\times}32{\times}64$的张量化下,一个归一化的CP组件每个投影存储$193$个可训练标量,比LoRA的一个秩步长小约21倍。我们在OPT-1.3B上,在匹配的目标模块、训练协议、数据上限和种子调度下,比较了CP适配器和LoRA在SST-2、RTE和BoolQ上的表现。CP训练稳定,并填补了LoRA秩之间的空白,但效果依赖于任务:SST-2早期达到低预算平台,BoolQ在略低于LoRA饱和之前受益于额外的CP组件,而RTE仍然偏好LoRA。因此,更细的参数步长有助于诊断PEFT预算敏感性,但它们本身并不能保证更好的精度-预算曲线。

英文摘要

Low-rank adapters are usually compared by sweeping a small set of ranks, but the rank also fixes the resolution of the parameter budget. For a $2048{\times}2048$ OPT attention projection, increasing LoRA by one rank stores $4096$ trainable scalars, leaving large gaps between feasible low-budget adapter sizes. This paper asks whether a tensorized adapter with finer capacity increments changes the observed accuracy--budget trade-off. We instantiate this question with fixed-component canonical polyadic (CP) tensor adapters. Under a $32{\times}64{\times}32{\times}64$ tensorization, one normalized CP component stores $193$ trainable scalars per projection, about $21$ times smaller than one LoRA rank step. We compare CP adapters and LoRA on OPT-1.3B across SST-2, RTE, and BoolQ under matched target modules, training protocol, data caps, and seed schedules. CP trains stably and fills the gaps between LoRA ranks, but the effect is task-dependent: SST-2 reaches an early low-budget plateau, BoolQ benefits from additional CP components before saturating slightly below LoRA, and RTE remains LoRA-favored. Finer parameter steps are therefore useful for diagnosing PEFT budget sensitivity, but they do not by themselves guarantee a better accuracy--budget curve.

2606.00427 2026-06-02 cs.LG

Topology-Aware State Abstraction with Tangle Cores for Markov Decision Processes

基于纠缠核的马尔可夫决策过程拓扑感知状态抽象

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

发表机构 * Department of Computer Science, Iowa State University(计算机科学系,爱荷华州立大学) Department of Civil, Construction & Environmental Engineering, Iowa State University(土木、建设与环境工程系,爱荷华州立大学)

AI总结 提出纠缠核抽象框架,利用经验转移图的图纠缠构建重叠状态抽象,在动作一致性条件下保证价值保持,并通过实验证明其在瓶颈领域优于现有方法。

详情
AI中文摘要

强化学习中的状态抽象通常被形式化为基于奖励和转移相似性的状态划分。这排除了导航、图和分层决策问题中的常见结构模式:接口状态(如门、枢纽和瓶颈)自然参与多个区域。我们引入了\emph{纠缠核抽象},一种基于经验转移图的图纠缠的重叠状态抽象框架。该方法从一致定向的低阶分离中构建抽象状态,并通过隶属核而非硬划分来表示共享接口。我们在显式动作一致性条件下给出了诱导的重叠抽象MDP的价值保持保证,识别了内部同质性/边界泄漏误差分解,并证明了一个定量接口重叠结果,表明硬划分何时会引入可避免的边界误差。实验上,在瓶颈表格领域、程序生成迷宫和MiniGrid表示中,纠缠核抽象在压缩-回报权衡上优于奖励感知、学习、拓扑映射和图划分基线。我们还识别了一个清晰的失败机制,即转移拓扑无信息时,纠缠可预测地几乎没有益处。这些结果将图纠缠定位为具有共享接口结构的决策问题的有效拓扑感知抽象先验。

英文摘要

State abstraction in reinforcement learning is usually formulated as a partition of states based on reward and transition similarity. This excludes a common structural pattern in navigation, graph, and hierarchical decision problems: interface states such as doors, hubs, and bottlenecks naturally participate in more than one region. We introduce \emph{tangle-core abstraction}, an overlapping state-abstraction framework based on graph tangles of empirical transition graphs. The method constructs abstract states from consistently oriented low-order separations and represents shared interfaces through a membership kernel rather than a hard partition. We give value-preservation guarantees for the induced overlapping abstract MDP under an explicit action-consistency condition, identify an interior-homogeneity/boundary-leakage error decomposition, and prove a quantitative interface-overlap result showing when hard partitions incur an avoidable boundary error. Empirically, tangle-core abstractions achieve favorable compression--return tradeoffs against reward-aware, learned, topological-map, and graph-partitioning baselines across bottlenecked tabular domains, procedurally generated mazes, and MiniGrid representations. We also identify a clear failure regime in which transition topology is uninformative, where tangles predictably offer little benefit. These results position graph tangles as an effective topology-aware abstraction prior for decision problems with shared interface structure.

2606.00426 2026-06-02 cs.LG

Canonicalized Stable-List Replay for Private Federated Continual Learning over Language-Model Embeddings

规范化稳定列表回放:面向语言模型嵌入的私有联邦持续学习

Ibne Farabi Shihab, Abu Sa-Adat Mohamed Moon-Im Al Ahsan, Anuj Sharma

发表机构 * Department of Computer Science, Iowa State University(爱荷华州立大学计算机科学系) Department of Computer Science & Engineering, BRAC University(BRAC大学计算机科学与工程系) Department of Civil, Construction & Environmental Engineering, Iowa State University(爱荷华州立大学土木、建设与环境工程系)

AI总结 针对联邦持续学习中差分隐私下回放列表无序的问题,提出规范化稳定列表回放(CSLR)方法,利用公共锚句的签名对齐客户端分布,在多个基准上提升性能。

详情
AI中文摘要

联邦持续学习(FCL)允许分布式客户端在不共享原始文本的情况下,将语言模型头部适应不断演变的NLP任务。在用户级差分隐私(DP)下,基于回放的持续学习面临一个结构性障碍:客户端只能发布候选回放摘要的小型噪声列表,且这些列表在客户端之间是无序的。我们引入了规范化稳定列表回放(CSLR),其中客户端在共享的句子嵌入空间上私有地生成候选回放分布,服务器使用公共锚句诱导的签名对齐它们。锚点提供聚合的可识别性,而不是额外的回放数据。我们证明,在可观测的锚签名间隔下,$O(\log(N/η)/p)$个锚点以至少$1-η$的概率区分$N$个候选列表元素,并给出了无序标签预言机模型的范围性无锚不可识别性结果。在持续分类、NER和对话基准的五个随机种子上,CSLR在报告的回放发布预算下,在$\eps=4$时,最终平均任务指标比最强的非CSLR DP基线提高了3.9-5.6个点,同时也优于匈牙利匹配和最优传输匹配。形式化隐私保证涵盖回放发布;端到端私有训练还需要与用于任务头更新的私有优化器组合。

英文摘要

Federated continual learning (FCL) lets distributed clients adapt language-model heads to evolving NLP tasks without sharing raw text. Under user-level differential privacy (DP), replay-based continual learning faces a structural obstacle: clients can release only small noisy lists of candidate replay summaries, and those lists are unordered across clients. We introduce Canonicalized Stable-List Replay (CSLR), where clients privately produce candidate replay distributions over a shared sentence-embedding space and the server aligns them using signatures induced by public anchor sentences. The anchors provide identifiability for aggregation rather than additional replay data. We prove that, under an observable anchor-signature margin, $O(\log(N/η)/p)$ anchors distinguish $N$ candidate list elements with probability at least $1-η$, and we give a scoped anchorless non-identifiability result for unordered-label oracle models. Across five seeds on continual classification, NER, and dialogue benchmarks, CSLR improves the final average task metric by 3.9--5.6 points over the strongest non-CSLR DP baseline at $\eps=4$ under the reported replay-release budget, while also outperforming Hungarian and optimal-transport matchers. The formal privacy guarantee covers replay release; end-to-end private training additionally requires composition with a private optimizer for task-head updates.

2606.00424 2026-06-02 cs.AI

Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight

弱批评者造就强学习者:用于可扩展监督的在策略批评蒸馏

Can Jin, Jiakang Li, Rui Wu, Eddy Zhang, Dimitris N. Metaxas

发表机构 * University of Cambridge(剑桥大学) University of California, Berkeley(加州大学伯克利分校) UC Berkeley AI Lab(加州大学伯克利分校人工智能实验室)

AI总结 提出在策略批评蒸馏(OPCD)方法,利用弱模型作为批评者提供修订方向,通过自适应自教师信号蒸馏批评引导的行为,提升强模型在推理和对齐基准上的表现。

详情
AI中文摘要

随着大型语言模型变得更强,弱监督者可能无法为复杂输出提供可靠的标签、偏好或最终判断,限制了弱到强的泛化和可扩展监督。我们研究了一种更易处理的弱监督形式:使用弱模型作为批评者,而不是作为标注者或评判者。弱批评者不需要解决任务或选择正确答案,只需提供非误导性的修订方向,帮助强模型更好地利用自身知识。我们将这种设置称为*弱批评者强监督*。我们首先表明,弱批评可以在推理时改进冻结的强模型,并且批评质量是这种改进的关键。然后,我们提出渐进式在策略批评蒸馏(**OPCD**),它过滤高质量的批评,并通过自适应自教师信号将批评引导的行为蒸馏到强模型中。在推理和对齐基准上的实验表明,我们的方法在训练轮次中改进了强模型,为使用弱监督的可扩展监督提供了一条有效路径。

英文摘要

As large language models become stronger, weak supervisors may fail to provide reliable labels, preferences, or final judgments for complex outputs, limiting both weak-to-strong generalization and scalable oversight. We study a more tractable form of weak supervision: using a weak model as a critic rather than as a labeler or judge. Instead of solving the task or selecting the correct answer, the weak critic only needs to provide a non-misleading revision direction that helps the strong model better use its own knowledge. We call this setting *weak-critic strong oversight*. We first show that weak critiques can improve frozen strong models at inference time, and that critique quality is key to this improvement. We then propose progressive on-policy critique distillation (**OPCD**), which filters high-quality critiques and distills critic-guided behavior into the strong model through adaptive self-teacher signals. Experiments on reasoning and alignment benchmarks show that our method improves strong models over training epochs, suggesting an effective path for scalable oversight with weak supervision.

2606.00418 2026-06-02 cs.RO cs.HC

Literary Emotions in Motion: A Soft Robotics Installation for Tactile Storytelling

文学情感在运动中:用于触觉叙事的软体机器人装置

Carolina Silva-Plata, Abraham Villavicencio-Carmona, Miguel Silva Plata, Stefan Escaida, Ruben Fernandez

发表机构 * Department of Mechanical Engineering, University of Chile(智利大学机械工程系) Independent Researcher(独立研究员) Bolivian Catholic University(玻利维亚天主大学) Institute of Engineering Sciences, University of O’Higgins(奥希金斯大学工程科学研究所)

AI总结 提出一种将叙事文本语义情感分析映射到软体气动模块可变刚度的交互装置,通过用户研究评估刚度与LED强度多感官耦合对情感感知的影响。

详情
Journal ref
IEEE Robotics and Automation Magazine, 2026
Comments
8 pages, 8 figures
AI中文摘要

软体机器人越来越多地在艺术语境中被探索,其中触觉交互为观众提供了超越视觉或听觉信号的具身参与。本作品展示了一个交互装置,将叙事文本的语义情感分析映射到软体气动模块的可变刚度。一个自然语言模型从预定义的六种情感中识别出两种主导情感,驱动七个六边形排列的软体执行器充气。中心执行器代表主要情感,而周围的执行器表达次要情感。我们开发并机械表征了称为软模块的硅胶执行器,其具有薄膜层,展示了这种形态控制如何扩展可实现的刚度范围,同时保持简单性和低成本制造。一项包含十名参与者的用户研究进一步评估了刚度和LED强度的多感官耦合如何影响情感感知。结果表明,伴随颜色变化的刚度调制可以支持软体机器人装置中具有情感意义和吸引力的触觉交互。

英文摘要

Soft robotics is increasingly explored in artistic contexts, where tactile interaction provides audiences with embodied engagement beyond visual or auditory signals. This work presents an interactive installation that maps semantic emotion analysis of narrative text into variable stiffness of soft pneumatic modules. A natural language model identifies two dominant emotions from a predefined set of six, driving the inflation of seven hexagonally arranged soft actuators. The central actuator represents the primary emotion, while the surrounding ones express the secondary. We develop and mechanically characterize silicone actuators, called soft modules, featuring a thin membrane layer, demonstrating how this morphological control expands the achievable stiffness range while preserving simplicity and low-cost fabrication. A user study with ten participants further evaluates how multisensory coupling of stiffness and LEDs intensity influences emotional perception. The results suggest that stiffness modulation accompanied by color change can support emotionally meaningful and engaging tactile interaction in soft robotic installations.

2606.00416 2026-06-02 cs.CV

4D Radar Meets LiDAR and Camera: Cooperative Perception under Adverse Weather

4D雷达与激光雷达和相机的结合:恶劣天气下的协同感知

Melih Yazgan, Iramm Hamdard, Qiyuan Wu, J. Marius Zoellner

发表机构 * FZI Research Center for Information Technology(FZI信息技术研究所以) Karlsruhe Institute of Technology(卡尔斯鲁厄大学)

AI总结 针对恶劣天气下相机和激光雷达性能下降的问题,提出集成4D成像雷达作为鲁棒模态,并引入多普勒引导的空间注意力机制进行多智能体融合,显著提升雾雨环境下的协同感知鲁棒性。

详情
Comments
Accepted by CVPR - DriveX Workshop
AI中文摘要

协同感知对于自动驾驶至关重要,但在恶劣天气下,当相机和激光雷达性能下降时,其可靠性会受到影响。我们通过将4D成像雷达作为一种对天气鲁棒的模态集成到协同感知中,并引入多普勒引导的空间注意力机制用于多智能体融合,来解决这一挑战。我们的方法扩展了两种代表性骨干网络:一种是雷达-相机流水线,其中雷达替代激光雷达;另一种是激光雷达-雷达流水线,其中雷达补充激光雷达。为了支持评估,我们发布了雷达增强的基准数据集OPV2V-R和Adver-City-R,并加入了基于物理的激光雷达退化模拟。实验表明,在雾和雨条件下,该方法获得了显著的鲁棒性提升,特别是在雷达替代退化激光雷达时改进明显。在MAN TruckScenes上的额外验证证明了该方法在仿真之外的迁移能力。总体而言,我们的结果突出了4D成像雷达作为一种适用于全天候协同感知的鲁棒模态。数据集和代码可在以下网址获取:https://url.fzi.de/SlimComm。

英文摘要

Cooperative perception is important for autonomous driving but remains fragile when cameras and LiDAR degrade in adverse weather. We address this challenge by integrating 4D imaging radar as a weather-robust modality into collaborative perception and introducing a Doppler-guided spatial attention mechanism for multi-agent fusion. Our approach extends two representative backbones: a radar-camera pipeline where radar substitutes LiDAR, and a LiDAR-radar pipeline where radar complements LiDAR. To support evaluation, we release radar-augmented benchmarks, OPV2V-R and Adver-City-R, with physics-based LiDAR degradation. Experiments show strong robustness gains in fog and rain, including substantial improvements when radar replaces degraded LiDAR. Additional validation on MAN TruckScenes demonstrates transfer beyond simulation. Overall, our results highlight 4D imaging radar as a robust modality for all-weather collaborative perception. Dataset and code are available at: https://url.fzi.de/SlimComm.

2606.00414 2026-06-02 cs.LG

Auditing Near-Optimal Policies Can Be Exponentially Hard: Conditional Query Lower Bounds via Occupancy Rashomon Capacity

审计近最优策略可能是指数级困难的:通过占用Rashomon容量的条件查询下界

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

发表机构 * Department of Computer Science, Iowa State University(计算机科学系,爱荷华州立大学) Department of Civil, Construction & Environmental Engineering, Iowa State University(土木、建设与环境工程系,爱荷华州立大学)

AI总结 本文通过引入占用Rashomon容量概念,证明了在存在多个近最优策略时,审计这些策略的行为差异需要指数级数量的查询,并给出了精确和噪声查询下的下界。

详情
AI中文摘要

当许多强化学习策略达到近最优回报时,事后审计员可能必须区分许多行为不同但回报等价的策略。我们通过Rashomon容量的占用度量类比来形式化这一现象:近最优占用区域的度量熵,相对于被审计的部署类别计算。由于占用度量仅识别到占用等价,我们在占用类别级别上制定审计,并区分精确局部查询预言机和噪声样本查询预言机。我们的主要精确查询结果是条件性的:如果被审计类别包含一个$2/H$-分离的近最优填充,其局部签名是$b$-稀疏的,那么精确局部查询审计需要$\Omega(M/b)$次查询;当填充实现部署类别容量且$b=O(1)$时,这变为$\Omega(2^{\Hopt^\cF(\eps)})$。我们给出了一个有限折扣隐藏分支MDP达到此界,并展示了精确贝叶斯成功律。对于噪声隐藏触发测试,我们证明了阶为$M/\beta$的混合下界,其中$\beta$是每样本KL信号,对于容量阶填充且$\beta=O(\rho^2\Delta^2)$,得到$\Omega(2^{\Hopt^\cF(\eps)}/(\rho^2\Delta^2))$。我们还提供了静态目标识别信息下界、一个转录兼容的预言机覆盖验证上界,以及一个规范占用正则化器,当存在可信参考占用时,其正则化审计容量会崩溃。受控基准将正稀疏签名实例与精确审计容易的高容量阴性对照区分开来,并将噪声触发律映射到后处理的连续控制和视觉RL审计体制。

英文摘要

When many reinforcement-learning policies achieve near-optimal return, a post-hoc auditor may have to distinguish among many behaviorally distinct but return-equivalent policies. We formalize this phenomenon through an occupancy-measure analogue of Rashomon capacity: the metric entropy of the near-optimal occupancy region, computed relative to an audited deployment class. Because occupancy measures identify behavior only up to occupancy equivalence, we formulate auditing at the occupancy-class level and distinguish exact local-query oracles from noisy sample-query oracles. Our main exact-query result is conditional: if the audited class contains a $2/H$-separated near-optimal packing whose local signatures are $b$-sparse, then exact local-query auditing requires $Ω(M/b)$ queries; when the packing realizes deployment-class capacity and $b=O(1)$, this becomes $Ω(2^{\Hopt^\cF(\eps)})$. We give a finite discounted hidden-branch MDP attaining this bound and show the exact Bayes success law. For noisy hidden-trigger testing, we prove a mixture lower bound of order $M/β$, where $β$ is the per-sample KL signal, yielding $Ω(2^{\Hopt^\cF(\eps)}/(ρ^2Δ^2))$ for capacity-order packings with $β=O(ρ^2Δ^2)$. We also provide a static target-recognition information lower bound, a transcript-compatible oracle-cover verification upper bound, and a canonical occupancy regularizer whose regularized audited capacity collapses when a trusted reference occupancy is available. Controlled benchmarks distinguish positive sparse-signature instances from high-capacity negative controls where exact auditing is easy, and map the noisy-trigger law to post-processed continuous-control and visual-RL auditing regimes.

2606.00404 2026-06-02 cs.CV cs.LG

Rethinking Amortized Neural Representations for High-Resolution Terrain Elevation Data

重新思考高分辨率地形高程数据的摊销神经表示

Haoan Feng, Xin Xu, Leila De Floriani

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 针对地形高程数据,提出HUVR+SIREN超网络方法,通过替换坐标解码器为平滑可微版本,在统一基准上实现最佳高度和导数保真度,且支持后训练量化压缩。

详情
Comments
12 pages, 7 figures, 10 tables
AI中文摘要

隐式神经表示(INR)将信号建模为连续的坐标到值函数。对于地形高程数据,这支持解析导数、任意分辨率解码以及底层高度场的平滑表面模型。然而,为每个瓦片拟合和存储单独的INR无法扩展到大型地形数据集。摊销神经表示通过共享网络降低了这一成本:新瓦片被映射到紧凑的每瓦片载荷,共享解码器从中重建高度场。大多数此类方法是超网络,通过单次前向传递预测载荷,而其他方法则通过短时的每瓦片优化恢复载荷。这些方法主要针对自然图像开发,其在地形高度场上的适用性尚不清楚。我们在1米/像素的地形数据集上引入了受控基准,并在统一协议下评估了三种代表性方法。观察到明显的跨领域差距后,我们提出了HUVR+SIREN,这是一种超网络,它通过将坐标解码器替换为平滑、解析可微的解码器来适应最强的基准方法(HUVR)。它在基准上实现了最佳的高度和导数保真度,无需额外的每瓦片存储且解码成本更低,并且能够容忍激进的后训练量化而质量损失可忽略,从而形成了紧凑的地形神经格式。消融和诊断进一步确定了哪些设计选择可迁移到地形,并表明每瓦片瓶颈已接近其有用极限,剩下的差距在于共享超网络的架构设计。

英文摘要

Implicit neural representations (INRs) model a signal as a continuous coordinate-to-value function. For terrain elevation data, this supports analytic derivatives, arbitrary-resolution decoding, and a smooth surface model of the underlying heightfield. However, fitting and storing a separate INR for every tile does not scale to large terrain datasets. Amortized neural representations reduce this cost with a shared network: a new tile is mapped to a compact per-tile payload, and a shared decoder reconstructs the heightfield from it. Most such methods are hypernetworks that predict the payload in a single forward pass, while others recover it through a short per-tile optimization. These methods were developed primarily for natural images, and their suitability for terrain heightfields remains unclear. We introduce a controlled benchmark on a 1 m/pixel terrain dataset and evaluate three representative methods under a unified protocol. Observing a clear cross-domain gap, we propose HUVR+SIREN, a hypernetwork that adapts the strongest benchmarked method (HUVR) by replacing its coordinate decoder with a smooth, analytically differentiable one. It attains the best height and derivative fidelity on the benchmark with no additional per-tile storage and lower decode cost, and tolerates aggressive post-training quantization with negligible quality loss, giving a compact terrain neural format. Ablations and diagnostics further identify which design choices transfer to terrain and show that the per-tile bottleneck is already near its useful limit, leaving the remaining gap in the shared hypernetwork's architectural design.

2606.00400 2026-06-02 cs.LG

Dynamic Proxy-Mixing: Transferring Replay Controllers from Small to Large Models for Continual Instruction Tuning

动态代理混合:将重放控制器从小模型迁移到大模型以进行持续指令微调

Ibne Farabi Shihab, Fariya Afrin, Anuj Sharma

发表机构 * Department of Computer Science, Iowa State University(爱荷华州立大学计算机科学系) Department of Computer Science, Kalinga Institute of Industrial Technology(卡林加工业技术学院计算机科学系) Department of Civil, Construction & Environmental Engineering, Iowa State University(爱荷华州立大学土木、建筑与环境工程系)

AI总结 提出PROXY-MIX框架,通过在小代理模型上学习动态重放控制器并冻结迁移至大模型,以解决持续指令微调中固定重放比例导致的灾难性遗忘问题,在LLaMA-3-8B上平均准确率提升3.4%,遗忘降低3.5%,安全性提升5.8%。

详情
AI中文摘要

持续指令微调通过一系列新领域更新语言模型,但每次更新会逐渐侵蚀先前学到的能力和对齐行为。重放是标准的缓解方法,但固定重放比例本质上有限,因为最优混合比例随当前领域、训练阶段以及先前行为的脆弱性而变化。我们提出PROXY-MIX框架,该框架在小代理模型上学习动态重放控制器,并将冻结的控制器迁移到更大的目标模型。控制器从未见过未来任务,而是从归一化的验证损失及其时间动态构建状态,生成当前任务和可访问重放缓冲区的掩码混合。我们的核心经验假设是遗忘镜像:即使绝对损失大小不同,任务脆弱性排名在不同模型规模上基本一致。在跨规模迁移控制器之前,我们通过实验验证了这一假设。在LLaMA-3-8B上跨越五个持续指令微调序列,PROXY-MIX在平均准确率上提高了3.4个百分点,最终遗忘降低了3.5个百分点,安全性得分比最强的非神谕基线提高了5.8个百分点,策略学习成本约为神谕目标强化学习的50倍。该框架在接口层面无泄漏且架构无关,我们还确定了代理假设失效的设置,突出了鲁棒部署的局限性。

英文摘要

Continual instruction tuning updates a language model through a sequence of new domains, yet each update can progressively erode previously learned capabilities and alignment behavior. Replay is the standard mitigation, but fixed replay ratios are inherently limited because the optimal mixture varies with the current domain, the training stage, and the evolving vulnerability of prior behaviors. We propose PROX-YMIX, a framework that learns a dynamic replay controller on a small proxy model and transfers the frozen controller to a larger target. The controller never observes future tasks and constructs its state from normalized validation losses and their temporal dynamics, producing a masked mixture over the current task and accessible replay buffers. Our core empirical hypothesis is forgetting mirroring: task vulnerability rankings remain largely consistent across model scales even when absolute loss magnitudes differ. We validate this assumption empirically before transferring controllers across scales. On LLaMA-3-8B across five continual instruction tuning sequences, PROXYMIX improves average accuracy by 3.4 points, reduces final forgetting by 3.5 points, and raises safety score by 5.8 points over the strongest non-oracle baseline, at roughly 50x lower policy learning cost than Oracle Target RL. The framework is leakage free and architecture independent at the interface level, and we also identify settings where the proxy assumption breaks down, highlighting limitations for robust deployment.

2606.00399 2026-06-02 cs.LG

Multi-Objective Reference-Aligned Machine Unlearning

多目标参考对齐机器遗忘

Rasa Khosrowshahli, Stephen Asobiela, Beatrice Ombuki-Berman, Shahryar Rahnamayan

发表机构 * arXiv

AI总结 提出多目标框架RAUL,通过将遗忘样本的预测对齐到参考分布(均匀分布或保留集经验分布)来约束遗忘目标,并利用雅可比下降解决多目标优化,实现接近完整重训练的遗忘效果。

详情
Comments
Accepted as a short paper at Canadian AI 2026. Author version with an added framework overview figure for clarity
AI中文摘要

机器遗忘旨在移除特定训练样本的影响,同时保持模型的效用。现有的单目标方法,如梯度上升或随机重标,常常由于冲突的优化动态和无界的遗忘目标导致灾难性遗忘,使模型偏离其预训练知识。我们提出参考对齐遗忘(RAUL),一个多目标框架,通过将遗忘样本上的无界损失最大化替换为有界的KL对齐,使其预测对齐到代表未见数据的参考分布(可实例化为均匀分布或来自保留参考集的经验分布),从而约束遗忘目标并减少与保留目标的梯度冲突,联合优化遗忘和保留。通过雅可比下降解决由此产生的多目标优化(MOO)问题,该算法将多个梯度聚合到无冲突的方向。我们的结果表明,与完全重训练相比,RAUL实现了最接近的差距。

英文摘要

Machine unlearning aims to remove the influence of specific training samples while preserving the model's utility. Existing single-objective approaches, such as gradient ascent or random relabeling, often induce catastrophic forgetting due to conflicting optimization dynamics and unbounded forgetting objectives that cause the model to drift from its pre-trained knowledge. We propose Reference-Aligned UnLearning (RAUL), a multi-objective framework that jointly optimizes forgetting and retention by replacing unbounded loss maximization with a bounded KL alignment of predictions on forgotten samples toward a reference distribution representing unseen data, instantiated either as a uniform distribution or an empirical distribution from a held-out reference set, which constrains the forgetting objective and reduces gradient conflict with retention. The resulting multi-objective optimization (MOO) problem is solved via Jacobian descent, which aggregates multiple gradients into a direction that does not conflict. Our results demonstrate that RAUL achieves the closest gap compared to full retraining.

2606.00397 2026-06-02 cs.RO cs.SY eess.SY

SoFiE: Soft Finger Exoskeleton for Intelligent Grasping

SoFiE: 用于智能抓取的软手指外骨骼

Magnus Malthe Sigsgaard Nielsen, Nicklas Nikolaj Grønvall, Xiaofeng Xiong, Saravana Prashanth Murali Babu

发表机构 * SDU Soft Robotics, SDU Biorobotics, The Maersk Mc-Kinney Moller Institute, University of Southern Denmark (SDU)(SDU柔性机器人实验室、SDU生物机器人实验室、马士基麦金尼莫勒研究所、南丹麦大学)

AI总结 本文提出一种模块化软手指外骨骼SoFiE,采用3D打印柔性材料、肌腱驱动和集成触觉传感,实现轻量化、低轮廓的抓取辅助与智能感知。

详情
AI中文摘要

软体可穿戴机器人系统已成为辅助手部功能减退个体的有前景解决方案。本文提出SoFiE,一种模块化软手指外骨骼,旨在辅助抓取任务中的食指屈曲。该系统主要采用3D打印柔性材料制造,实现了轻量、低轮廓和模块化设计。驱动通过紧凑型直流电机驱动的肌腱机构实现,而被动伸展由柔性导电弹簧提供。该元件称为StretchSense,通过变形下的电阻变化也作为本体感受传感器。此外,引入了一种新颖的触觉传感方法MagSense,使用嵌入软指尖结构中的磁铁和磁力计对来估计接触力和物体柔顺性。该系统完全无线,并由嵌入式微控制器控制。此外,通过电机编码器反馈的驱动器级传感能够估计系统状态,为安全和自适应控制策略提供基础。实验验证表明,该系统能够提供可靠的姿态估计,区分不同刚度的材料,并在不同抓取任务中生成独特的传感器特征。本文详细介绍了所提出的外骨骼的设计、制造和传感概念,作为模块化、软体和辅助可穿戴机器人的概念验证。

英文摘要

Soft wearable robotic systems have emerged as a promising solution for assisting individuals with reduced hand function. This paper presents SoFiE, a modular soft finger exoskeleton designed to assist index-finger flexion during grasping tasks. The proposed system is primarily fabricated using 3D-printed flexible materials, enabling a lightweight, low-profile, and modular design. Actuation is achieved through a tendon-driven mechanism powered by a compact DC motor, while passive extension is provided by a compliant conductive spring. This element, termed StretchSense, also functions as a proprioceptive sensor by exhibiting resistance changes under deformation. Furthermore, a novel tactile sensing approach, MagSense, is introduced, using a magnet and magnetometer pair embedded in a soft fingertip structure to estimate contact force and object compliance. The system is fully untethered and controlled by an embedded microcontroller. In addition, actuator-level sensing through motor encoder feedback enables estimation of the system state, providing a foundation for safe and adaptive control strategies. Experimental validation demonstrates the capability of the system to provide reliable pose estimation, distinguish between materials with different stiffness, and generate distinct sensor signatures across different grasping tasks. This paper details the design, fabrication, and sensing concepts of the proposed exoskeleton as a proof of concept toward modular, soft, and assistive wearable robotics.

2606.00392 2026-06-02 cs.LG cs.AI

Detector-Evasive LLM Paraphrasing via Constrained Policy Optimization

通过约束策略优化实现检测器规避的LLM释义

Mingyi Wang, Zhuoer Shen, Yuheng Bu, Shaofeng Zou

发表机构 * School of ECEE, Arizona State University(亚利桑那州立大学电子工程与计算机科学学院) Department of Computer Science, University of California, Santa Barbara(加州大学圣巴巴拉分校计算机科学系)

AI总结 提出DEPO算法,将检测器规避的LLM释义建模为约束马尔可夫决策过程,通过拉格朗日对偶强化学习在保持语义的同时实现高效规避。

详情
AI中文摘要

AI文本检测器易受释义和检测器引导的释义攻击,但现有规避方法缺乏对语义保持的精确控制。特别是,直接优化检测器规避会降低细粒度语义,而标量化奖励设计仅提供间接、权重敏感的规避-语义权衡控制。我们通过将检测器规避的LLM释义建模为约束马尔可夫决策过程来解决这一限制,其中检测器规避是主要目标,语义保持作为显式约束强制执行。我们提出检测器规避策略优化(DEPO),一种拉格朗日原始-对偶强化学习算法,具有新颖的GRPO风格组基策略更新。DEPO在训练期间自适应平衡语义保持和检测器规避,使策略能够在规定的语义保持区域内提高攻击成功率。在MAGE、M4、RAID和同行评审数据集上的实验,针对MAGE、RoBERTa、RADAR、Binoculars和Fast-DetectGPT检测器进行评估,表明DEPO在精确满足语义保持约束的同时实现了强大的检测器规避。DEPO还表现出跨领域、跨检测器和提示级别的鲁棒性。

英文摘要

AI-text detectors are vulnerable to paraphrasing and detector-guided paraphrasing attacks, but existing detector-evasion methods often lack precise control over semantic preservation. In particular, optimizing directly for detector evasion can degrade fine-grained semantics, whereas scalarized reward designs provide only indirect, weight-sensitive control over the evasion-semantics trade-off. We address this limitation by formulating detector-evasive LLM paraphrasing as a Constrained Markov Decision Process, where detector evasion is the primary objective and semantic preservation is enforced as an explicit constraint. We propose Detector Evasion Policy Optimization (DEPO), a Lagrangian primal-dual reinforcement learning algorithm with a novel GRPO-style group-based policy update. DEPO adaptively balances semantic preservation and detector evasion during training, enabling the policy to improve attack success within a prescribed semantic-preservation region. Experiments on MAGE, M4, RAID, and peer-review datasets, evaluated against MAGE, RoBERTa, RADAR, Binoculars, and Fast-DetectGPT detectors, show that DEPO achieves strong detector evasion while precisely satisfying the semantic preservation constraint. DEPO also exhibits cross-domain, cross-detector, and prompt-level robustness.

2606.00390 2026-06-02 cs.CV cs.AI

Zamba2-VL Technical Report

Zamba2-VL 技术报告

Hassan Shapourian, Kasra Hejazi, Olabode M. Sule, Beren Millidge

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Washington(华盛顿大学) University of Toronto(多伦多大学)

AI总结 提出基于混合架构Zamba2的视觉语言模型Zamba2-VL,在图像理解等基准上媲美Transformer模型,且首次令牌延迟降低约一个数量级。

详情
Comments
16 pages, 2 figures
AI中文摘要

我们提出Zamba2-VL,这是一套基于Zamba2构建的视觉语言模型,Zamba2是一种混合语言模型架构,结合了Mamba2状态空间层和少量共享的Transformer块。在广泛的图像理解、推理、OCR、定位和计数基准测试中,Zamba2-VL与同等规模的主流基于Transformer的开源VLM(包括Molmo2、Qwen3-VL和InternVL3.5系列)具有竞争力,并且显著优于之前的基于SSM和混合的VLM,如VL-Mamba、Cobra和mmMamba。继承了其Zamba2骨干网络的近线性预填充计算和小的、近乎恒定的循环状态,Zamba2-VL在匹配参数规模下,首次令牌延迟(TTFT)比这些Transformer基线低大约一个数量级,在最适合设备和边缘部署的较小1.2B和2.7B规模上效率差距最为明显。我们发布了三个模型——1.2B、2.7B和7B——以及推理代码,网址为https://huggingface.co/collections/Zyphra/zamba2-vl。

英文摘要

We present Zamba2-VL, a suite of vision-language models built on Zamba2, a hybrid language-model architecture combining Mamba2 state-space layers with a small number of shared transformer blocks. Across a broad range of image understanding, reasoning, OCR, grounding, and counting benchmarks, Zamba2-VL is competitive with leading Transformer-based open-weight VLMs of comparable scale, including the Molmo2, Qwen3-VL, and InternVL3.5 families, and substantially outperforms prior SSM-based and hybrid VLMs such as VL-Mamba, Cobra, and mmMamba. Inheriting the near-linear prefill compute and small, near-constant recurrent state of its Zamba2 backbone, Zamba2-VL delivers roughly an order of magnitude lower time-to-first-token (TTFT) than these Transformer baselines at matched parameter scale, with the efficiency gap most pronounced at the smaller 1.2B and 2.7B scales most relevant to on-device and edge deployment. We release three models -- 1.2B, 2.7B, and 7B -- together with inference code at https://huggingface.co/collections/Zyphra/zamba2-vl.

2606.00386 2026-06-02 cs.CV

αDepth: Learning Single-Pass Soft Boundary Decomposition for Stereo Conversion

αDepth: 学习用于立体转换的单次软边界分解

Xiang Zhang, Yang Zhang, Lukas Mehl, Karlis Martins Briedis, Markus Gross, Christopher Schroers

发表机构 * ETH Zürich(苏黎世联邦理工学院) DisneyResearch|Studios(迪士尼研究|工作室)

AI总结 提出αDepth表示,通过圆形Alpha表示(CAR)将软边界分解为局部层次,实现高保真立体转换,无需用户干预。

详情
AI中文摘要

精确建模软边界(例如头发和散焦模糊)是立体转换中的一个基本挑战,因为前景和背景的模糊混合。现有的深度模型主要预测单层深度,导致软边界处的深度对应关系模糊。虽然抠图技术可以捕获用于分层建模的不透明度,但它们在具有多个目标的复杂场景中通常表现不佳,并且通常需要用户干预。本文介绍了αDepth,一种分层表示,用于分解软边界以实现高保真立体转换。具体来说,我们首先通过估计软边界处的分层颜色和深度值来解决混合颜色和深度模糊问题。考虑到复杂的多目标场景,我们设计了一种圆形Alpha表示(CAR),将范式从全局目标提取转变为局部边界分解。与先前仅限于单个前景/背景的抠图方法不同,CAR无需手动指导即可实现高效的场景级推理。大量评估表明,αDepth在立体转换中实现了最先进的性能,消除了软边界处的背景渗色和结构失真。

英文摘要

Accurately modeling soft boundaries, e.g., hair and defocus blur, is a fundamental challenge in stereo conversion due to the ambiguous blending of foreground and background. Existing depth models primarily predict single-layer depth, leading to ambiguity in depth correspondence at soft boundaries. While matting techniques can capture opacity for layered modeling, they often struggle in complex scenes with multiple targets and usually require user intervention. This paper introduces αDepth, a layered representation that decomposes soft boundaries for high-fidelity stereo conversion. Specifically, we first resolve mixed color and depth ambiguity by estimating layered color and depth values at soft boundaries. Considering complex multi-target scenes, we design a Circular Alpha Representation (CAR) that shifts the paradigm from global target extraction to local boundary decomposition. Unlike prior matting methods restricted to a single foreground/background, CAR enables efficient scene-level inference without manual guidance. Extensive evaluations demonstrate that αDepth achieves state-of-the-art performance in stereo conversion, eliminating background bleeding and structural distortions at soft boundaries.

2606.00382 2026-06-02 cs.LG

CRMA: A Spectrally-Bounded Backbone for Modular Continual Fine-Tuning of LLMs

CRMA:用于大语言模型模块化持续微调的谱约束骨干

Kiran Nayudu, Aswini Nutakki, Sai Vinay Naidu, Ashwin Shanmugasundaram

发表机构 * ModelBrew AI

AI总结 提出CRMA残差适配器,通过Sinkhorn归一化确保混合矩阵双随机,从而在结构上约束谱范数,实现共享基座的持续训练与跨任务正向迁移,无需回放或蒸馏。

详情
Comments
38 pages, 10 figures. Patent-pending construction details deferred to companion technical report (in preparation)
AI中文摘要

大语言模型的顺序微调面临两难选择:要么让共享基座持续学习并接受灾难性遗忘,要么在第一个任务后冻结它并放弃跨任务优化。每任务适配器方法(LoRAHub、AdapterFusion、PackNet、Progressive Networks)选择了后一条路径。我们提出CRMA(约束残差混合适配器),一种残差适配器,其内部混合矩阵M通过Sinkhorn归一化在每次前向传播时保持双随机,因此根据Birkhoff定理,||M||_2 <= 1由构造保证——这是一种结构约束,而非惩罚项。CRMA的谱约束骨干提供了一个持续训练的共享基座,这是早期模块化方法无法实现的,同时保留了它们的遗忘保证。在Mistral-7B上,跨越5个顺序领域和3个种子,基于CRMA骨干的模块化每任务LoRA将损失相对漂移从+42.96% ± 5.5(朴素顺序微调)降低到-0.17% ± 0.17,且每种子范围不重叠,并将先前任务的保留损失比匹配的冻结基座基线提高了1.99% ± 0.54。三个独立的实验设置(Mistral-7B 4领域受控消融、TinyLlama 3领域污染控制复现、Mistral-7B 7B跨领域探测)均显示出正向反向迁移——无需回放缓冲区、无需增加每任务内存、无需蒸馏。在Gemma-2-9B上的推理时消融证实CRMA介导了对顺序训练知识的访问:在相同权重和相同问题上,仅通过切换CRMA注入,结果从38/100提升到98/100。867次记录的训练步骤验证了||M||_2 = 1.0在float32精度内(最大偏差1.2×10^-7)。遗忘预防效果在1.1B-9.2B参数和四个架构系列中均成立。

英文摘要

Sequential fine-tuning of large language models forces a choice: let the shared substrate keep learning and accept catastrophic forgetting, or freeze it after task one and foreclose cross-task refinement. Per-task adapter methods (LoRAHub, AdapterFusion, PackNet, Progressive Networks) take the second path. We introduce CRMA (Constrained Residual Mixing Adapter), a residual adapter whose internal mixing matrix M is doubly-stochastic at every forward pass via Sinkhorn normalization, so by Birkhoff's theorem ||M||_2 <= 1 holds by construction -- a structural bound, not a penalty. CRMA's spectrally bounded backbone provides a continuously trained shared substrate that earlier modular methods could not, while preserving their forgetting guarantees. On Mistral-7B across 5 sequential domains and 3 seeds, modular per-task LoRA on a CRMA backbone reduces loss-relative drift from +42.96% +/- 5.5 (naive sequential fine-tuning) to -0.17% +/- 0.17, with disjoint per-seed ranges, and improves prior-task holdout loss by 1.99% +/- 0.54 over a matched frozen-substrate baseline. Three independent experimental setups (Mistral-7B 4-domain controlled ablation, TinyLlama 3-domain contamination-controlled replication, Mistral-7B cross-domain probes at 7B) all show positive backward transfer -- without replay buffers, without growing per-task memory, and without distillation. An inference-time ablation on Gemma-2-9B confirms CRMA mediates access to sequentially trained knowledge: 98/100 vs. 38/100 on the same weights and same questions with only CRMA injection toggled. 867 logged training steps verify ||M||_2 = 1.0 within float32 precision (max deviation 1.2 x 10^-7). The forgetting-prevention effect holds across 1.1B-9.2B parameters and four architecture families.

2606.00380 2026-06-02 cs.CV cs.AI

SUPREME: A Multi-GPU Framework for Reproducible Image Unlearning Method Evaluation

SUPREME: 一个用于可复现图像遗忘方法评估的多GPU框架

Petros Andreou, Jamie Lanyon, Axel Finke, Georgina Cosma

发表机构 * Department of Computer Science, School of Science, Loughborough University(计算机科学系,科学学院,洛斯伯勒大学) School of Mathematics, Statistics and Physics, Newcastle University(数学、统计与物理学院,新卡克大学)

AI总结 提出SUPREME框架,通过多GPU分布式架构加速图像分类遗忘方法的评估,支持新方法注册和多精度模式。

详情
Comments
17 pages. Code available at https://github.com/pedroandreou/supreme-unlearning
AI中文摘要

机器遗忘旨在从已训练模型中移除特定训练数据的影响,而无需从头重新训练。评估遗忘方法需要在多个种子下重复训练、遗忘和评估,计算成本高昂。据我们所知,现有的图像分类遗忘框架在单个GPU上运行,限制了在合理时间内可评估的种子数量。我们提出SUPREME,一个开源框架,将这些阶段分布到多个GPU上。SUPREME做出三项贡献:基于注册表的设计,用于添加新方法、指标、模型和场景;支持多种加速器和精度模式的多GPU架构;以及在Pins Face Recognition上使用ResNet18和ViT在十种种子下进行全类和随机样本遗忘的演示。该框架可在https://github.com/pedroandreou/supreme-unlearning获取。

英文摘要

Machine unlearning removes the influence of specific training data from a trained model without retraining it from scratch. Evaluating an unlearning method requires repeating training, unlearning, and evaluation across multiple seeds, which is computationally expensive. To our knowledge, existing image classification unlearning frameworks run on a single GPU, which limits how many seeds can be evaluated in reasonable time. We introduce SUPREME, an open-source framework that distributes these stages across multiple GPUs. SUPREME makes three contributions: a registry-based design for adding new methods, metrics, models, and scenarios; a multi-GPU architecture supporting multiple accelerators and precision modes; and a demonstration on Pins Face Recognition using ResNet18 and ViT under full-class and random-sample unlearning across ten seeds. The framework is available at https://github.com/pedroandreou/supreme-unlearning.

2606.00379 2026-06-02 cs.CV

Non-Learning Low-Light Stereo Vision

非学习低光立体视觉

Jason Wang, Lucas Nguyen, Hyunseung Eom, Wei Xu, Qi Guo

发表机构 * arXiv.org

AI总结 提出一种非学习立体框架,利用Field of Junctions (FoJ)提取粗视觉特征,结合边界感知半全局匹配(SGM)从严重噪声图像中估计视差,在基准数据集上获得比近期立体算法更准确的稀疏视差图。

详情
Comments
Accepted to ICIP 2026. Code and data available at https://github.com/guo-research-group/nonlearning-lowlight-stereo
AI中文摘要

我们提出了一种非学习立体框架,用于从严重噪声图像中估计视差。利用Field of Junctions (FoJ),它保留了在严重噪声下稳定的粗视觉特征用于构建代价体,同时丢弃与光子噪声不可分的精细纹理。由此产生的结构信息指导边界感知的半全局匹配(SGM),动态调整平滑惩罚以保留真实的视差不连续性。输出是稀疏视差图,在广泛使用的基准数据集上,在未掩蔽像素上比最近的立体算法更准确。

英文摘要

We present a non-learning stereo framework for disparity estimation from severely noisy images. Using the Field of Junctions (FoJ), it retains coarse visual features stable under severe noise for cost volume construction while discarding fine textures inseparable from photon noise. The resulting structural information guides boundary-aware Semi-Global Matching (SGM) that dynamically adapts smoothness penalties to preserve true disparity discontinuities. The output is a sparse disparity map more accurate than those of recent stereo algorithms over unmasked pixels on widely-used benchmark datasets.

2606.00377 2026-06-02 cs.CV

Score-Control for Hallucination Reduction in Diffusion Models

扩散模型中减少幻觉的分数控制

Mahesh Bhosale, Naresh Kumar Devulapally, Abdul Wasi, Chau Pham, Vishnu Suresh Lokhande, David Doermann

发表机构 * University at Buffalo(布法罗大学)

AI总结 针对扩散模型中的幻觉问题,提出基于方差引导的分数调制策略,通过控制分数雅可比矩阵减少幻觉,在保持高保真度和多样性的同时将幻觉降低约25%。

详情
AI中文摘要

扩散模型已成为现代生成式AI的支柱,推动了视觉、语言、音频及其他模态的进步。尽管取得了成功,但它们存在幻觉问题,即生成真实数据分布支撑集之外的不可信样本,这降低了可靠性和信任度。在这项工作中,我们首先通过实验证实了先前提出的假设,即分数平滑性导致图像生成扩散模型中的幻觉,并提供了基于密度的视角。我们进一步通过将幻觉概率质量与学习到的分数函数的利普希茨常数联系起来,形式化了这一概念。受此启发,我们引入了一种方差引导的分数调制(VSM)策略,该策略控制分数雅可比矩阵,从而降低分数平滑性并更好地逼近真实分数,进而减少幻觉。在合成和真实世界数据集上的实验结果表明,我们的方法在保持高保真度和多样性的同时,将幻觉降低了约25%,为更可靠的基于扩散的图像生成提供了原则性步骤。我们还提出了两个具有极端语义变化的基准数据集,用于系统性幻觉评估。代码和数据集公开于https://github.com/bhosalems/VSM。

英文摘要

Diffusion models have emerged as the backbone of modern generative AI, powering advances in vision, language, audio and other modalities. Despite their success, they suffer from hallucinations, implausible samples that lie outside the support of true data distribution, which degrade reliability and trust. In this work, we first empirically confirm previously proposed hypothesis that score smoothness causes hallucinations in Image Generation diffusion models and provide a density-based perspective. We further formalize this notion by linking the hallucinations probability mass to lipschitz constant of the learned score function. Motivated by this, we introduce a Variance-Guided Score Modulation (VSM) strategy that controls the score Jacobian, in turn reducing score smoothness and better approximating the ground truth score that decreases hallucinations. Empirical results on synthetic and real-world datasets demonstrate that our approach reduces hallucinations (up to ~25%) while maintaining high fidelity and diversity, providing a principled step toward more reliable diffusion-based image generation. We also propose two benchmark datasets with extreme semantic variation for systematic hallucination evaluation. Code and Datasets are publicly available at https://github.com/bhosalems/VSM.

2606.00376 2026-06-02 cs.AI cs.CL cs.LG

The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary

确定性视界:当扩展推理失败时工具委托变得必要

Dongxin Guo, Jikun Wu, Siu Ming Yiu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过注意力瓶颈定理和确定性视界概念,证明解码器-only注意力在确定性状态追踪任务中存在信息论容量限制,导致扩展推理性能退化,并指出当视界超过19-31时工具委托成为必要。

详情
Comments
Accepted at ICML 2026. 4 figures. 51 pages including appendices
AI中文摘要

扩展的思维链推理可能会在确定性状态追踪任务上降低性能,这不是由于偏好偏差,而是源于解码器-only注意力的信息论容量限制。我们建立了:(1) 注意力瓶颈定理及互补的可达性构造,将状态追踪容量界定为 $O(H \cdot \log(L/H) \cdot \sqrt{d_h})$;(2) 一个上下文相关的错误模型,导致超指数精度衰减;(3) 状态空间Jaccard度量,区分能力与偏好失败;(4) 确定性视界 $d^* \in [19, 31]$,超过该视界工具委托变得必要。在12个模型和8个任务领域(包括SWE-Bench、WebArena和SQL-Multi)中,工具集成推理始终优于神经思维链;在主要模型套件上,其准确率达到86-94%,而神经思维链仅为24-42%。在最优长度轨迹上进行微调仅带来<5%的提升,证实了架构上限,并且高跨模型相关性($r = 0.81$-$0.91$)表明这些失败是架构性的而非训练特定的。我们的结果为在代理系统中纯神经推理何时应让位于混合方法提供了原则性指导。

英文摘要

Extended chain-of-thought reasoning can degrade performance on deterministic state-tracking tasks, not due to preference biases, but limits rooted in the information-theoretic capacity of decoder-only attention. We establish: (1) an Attention Bottleneck Theorem with a complementary achievability construction, bounding state-tracking capacity as $O(H \cdot \log(L/H) \cdot \sqrt{d_h})$; (2) a context-dependent error model yielding super-exponential accuracy decay; (3) the State-Space Jaccard metric distinguishing capability from preference failures; (4) a Deterministic Horizon $d^* \in [19, 31]$ beyond which tool delegation becomes necessary. Across 12 models and 8 task domains (including SWE-Bench, WebArena, and SQL-Multi), tool-integrated reasoning consistently outperforms neural chain-of-thought; on the primary model suite it reaches 86-94% accuracy versus 24-42% for neural chain-of-thought. Fine-tuning on optimal-length traces yields $<$5% improvement, confirming an architectural ceiling, and high cross-model correlation ($r = 0.81$-$0.91$) indicates these failures are architectural rather than training-specific. Our results provide principled guidance for when pure neural reasoning should yield to hybrid approaches in agentic systems.