arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3844
专题追踪 全部专题
2606.16196 2026-06-16 cs.LG cs.CV 新提交

When Confidence Lacks Concepts: Interpretable OOD Detection via Representation Perturbations

当置信度缺乏概念:通过表示扰动实现可解释的OOD检测

Anju Chhetri, Pratik Shrestha, Ramesh Rana, Prashnna Gyawali, Binod Bhattarai

发表机构 * NepAl Applied Mathematics and Informatics Institute for research(尼泊尔应用数学与信息学研究所) West Virginia University(西弗吉尼亚大学) Kathmandu University(加德满都大学) University College London(伦敦大学学院) University of Aberdeen(阿伯丁大学)

AI总结 提出一种基于类条件语义扰动和稀疏自编码器的可解释OOD检测框架,通过分析表示稳定性实现检测与内部机制解释。

详情
AI中文摘要

深度神经网络在医学影像任务中取得了显著性能,但其在分布偏移下过度泛化的倾向对安全临床部署构成了主要障碍。分布外(OOD)检测方法旨在缓解这一风险,但现有方法大多依赖语义含义理解不足的不透明内部信号,限制了在安全关键场景中的信任。本文提出一种可解释的OOD检测框架,该框架通过类条件语义扰动探测模型预测的稳定性。利用稀疏自编码器(SAE),我们从分布内数据中学习类特定概念向量,将密集的中间表示解耦为稀疏、语义有意义的组件。在推理时,我们使用与模型预测类别相关的概念向量扰动深层表示,并测量类别logits的稳定性。我们假设分布内样本对此类扰动表现出低敏感性,因为其表示与类特定语义方向对齐,而OOD样本由于表示错位而显示出放大的偏差。通过将OOD检测框架为概念条件稳定性分析,我们的方法既提供了判别性OOD信号,又提供了驱动模型不确定性的内部机制的可解释视角,使其特别适用于高风险医学应用。

英文摘要

Deep neural networks have achieved remarkable performance across medical imaging tasks, yet their tendency to overgeneralize under distributional shifts poses a major obstacle to safe clinical deployment. Out-of-Distribution (OOD) detection methods aim to mitigate this risk, but most existing approaches rely on opaque internal signals with poorly understood semantic meaning, limiting trust in safety-critical settings. In this work, we propose an interpretable OOD detection framework that probes the stability of model predictions under class-conditioned semantic perturbations. Leveraging sparse autoencoders (SAEs), we learn class-specific concept vectors from in-distribution data that disentangle dense intermediate representations into sparse, semantically meaningful components. At inference, we perturb deeper-layer representations using the concept vectors associated with the model's predicted class and measure the class logits stability. We hypothesize that in-distribution samples exhibit low sensitivity to such perturbations, as their representations align with class-specific semantic directions, whereas OOD samples show amplified deviations due to representational misalignment. By framing OOD detection as a concept conditioned stability analysis, our approach provides both a discriminative OOD signal and an interpretable lens into the internal mechanisms driving model uncertainty, making it particularly suitable for high stakes medical applications.

2606.16193 2026-06-16 cs.CV cs.AI cs.LG 新提交

Cascaded Sparse Autoencoders Learn Multi-Level Visual Concepts in Multimodal LLMs

级联稀疏自编码器在多模态大语言模型中学习多级视觉概念

Yusong Zhao, Hengyi Wang, Tanuja Ganu, Akshay Nambi, Hao Wang

发表机构 * Rutgers University(罗格斯大学) Microsoft Research(微软研究院)

AI总结 提出级联稀疏自编码器(CSAEs),通过在第一级SAE解码器权重上训练第二级SAE来学习层次化视觉概念,避免嵌套或堆叠SAE的缺点,在多个MLLM和数据集上提升了概念层次一致性和干预效果。

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉-语言任务上表现出色,但其内部视觉表示仍难以解释。稀疏自编码器(SAEs)提供了一种可扩展的方式,将密集模型激活分解为稀疏、可解释的特征。然而,现有SAE架构主要恢复扁平特征字典,不太适合显式的多级概念组织。在本文中,我们引入级联稀疏自编码器(CSAEs)用于学习MLLMs中的层次化视觉概念。CSAEs并非嵌套或堆叠SAE稀疏激活码,而是直接在第一个SAE的解码器权重上训练第二个SAE,将学习到的低级特征方向作为高级抽象的输入。这种设计使CSAEs能够学习“概念的概念”,同时避免了嵌套、Matryoshka式层次结构中的共享前缀耦合问题以及简单堆叠SAE的瓶颈。在Qwen3-VL、Gemma-3和LLaVA上的多个视觉数据集上的实验表明,与最先进的SAE基线相比,CSAEs在层次概念一致性方面提高了可解释性。概念引导的结果进一步表明,学习到的概念组支持对MLLM输出进行有效的组级干预。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated strong performance on vision-language tasks, yet their internal visual representations remain difficult to interpret. Sparse Autoencoders (SAEs) provide a scalable way to decompose dense model activations into sparse, interpretable features. However, existing SAE architectures primarily recover flat feature dictionaries and are less suited for explicit multi-level concept organization. In this paper, we introduce cascaded sparse autoencoders (CSAEs) for learning hierarchical visual concepts in MLLMs. Rather than nesting or stacking SAE sparse activation codes, CSAEs train a second-level SAE directly on the decoder weights of the first-level SAE, treating learned low-level feature directions as inputs for higher-level abstraction. This design enables CSAEs to learn "concepts of concepts" while avoiding drawbacks from the shared-prefix coupling of nesting, Matryoshka-style hierarchies and the bottlenecks of naively stacked SAEs. Experiments across Qwen3-VL, Gemma-3, and LLaVA on multiple visual datasets show that CSAEs improve interpretability in terms of hierarchical concept coherence over state-of-the-art SAE baselines. Results on concept steering further demonstrate that the learned concept groups support effective group-level interventions in MLLM outputs.

2606.16188 2026-06-16 cs.CV 新提交

teasr: training-efficient any-step diffusion transformer for real-world image super-resolution

teasr: 面向真实世界图像超分辨率的训练高效任意步扩散Transformer

Xiang Gao, Chenxin Zhu, Yushun Fang, Qiang Hu, Xiaoyun Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出TEASR框架,通过自对抗蒸馏和时步感知矫正策略,在单一扩散模型中实现任意步采样,无需辅助教师模型,显著提升训练效率并超越现有方法。

详情
AI中文摘要

扩散模型因其强大的生成先验在真实世界图像超分辨率(Real-ISR)中表现出色,但存在迭代采样速度慢的问题。尽管现有的单步蒸馏方法加速了推理,但它们通常需要辅助教师模型,这会增加训练内存并限制其在大规模架构上的可扩展性。此外,这些固定步模型缺乏在速度和质量之间进行权衡的灵活性。在本文中,我们提出了TEASR,一种用于Real-ISR的训练高效任意步扩散框架,能够在统一模型内实现单步和多步恢复。我们的关键思想是在单个扩散模型内执行自对抗蒸馏,从而消除对辅助教师或判别器的需求。具体来说,我们提出了一种时步感知矫正策略,该策略稳定了跨噪声水平的单步生成。这两个设计进一步使得在单个GPU上蒸馏20B参数的扩散模型成为可能,显著提高了训练效率。此外,我们引入了一种具有解耦时步条件的双分支扩散Transformer,以分离当前噪声状态和去噪目标,从而提升采样质量。大量实验表明,TEASR支持无缝的任意步采样,并在多个数据集上持续优于最先进的方法。

英文摘要

Diffusion models excel in Real-World Image Super-Resolution (Real-ISR) due to their powerful generative priors but suffer from slow iterative sampling. Although existing one-step distillation methods accelerate inference, they typically require auxiliary teacher models that inflate training memory and restrict scalability to large-scale architectures. Furthermore, these fixed-step models lack the flexibility to trade off speed for quality. In this paper, we propose TEASR, a training-efficient any-step diffusion framework for Real-ISR that enables both one-step and multi-step restoration within a unified model. Our key idea is to perform self-adversarial distillation within a single diffusion model, eliminating the need for auxiliary teachers or discriminators. Specifically, we propose a timestep-aware rectification strategy that stabilizes one-step generation across noise levels. These two designs further enables the distillation of 20B-parameter diffusion models on a single GPU, significantly improving training efficiency. Moreover, we introduce a dual-branch diffusion transformer with decoupled timestep condition to separate the current noise state and the denoising target to enhance sampling quality. Extensive experiments demonstrate that TEASR supports seamless any-step sampling and consistently outperforms state-of-the-art methods across multiple datasets.

2606.16185 2026-06-16 cs.CV 新提交

Learned JPEG Compression for DNN Vision

面向DNN视觉的JPEG压缩学习

Kaixiang Zheng, Ahmed H. Salamah, Siyu Chen, En-Hui Yang

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出J4D框架,通过可微分JPEG编解码器和信息论速率估计,优化JPEG编码参数以在低压缩率下提升DNN推理性能,实验显示在相同精度下压缩率降低高达80.05%。

详情
AI中文摘要

JPEG是一种为人类观看而设计的损失性图像压缩技术,几十年来一直占据主导地位。然而,在人工智能(AI)时代,大量通常由JPEG压缩的图像数据正在并将继续由深度神经网络(DNN)而非人类消费,因此需要优化JPEG以提升DNN推理性能。为此,我们提出面向DNN视觉的JPEG压缩学习(J4D),这是一种新颖的训练框架,用于确定JPEG编码参数,以在最小化压缩率的同时最大化DNN推理性能。解决这一优化问题的主要挑战在于以封闭形式表示JPEG编解码器和压缩率。通过引入基于概率量化方案的可微分软量化器,我们不仅获得了JPEG编解码器的可微分代理,还能够解析计算编码源的熵,这是实际压缩率的近似估计。有了可微分JPEG编解码器和信息论速率估计器,我们就能通过反向传播解决上述优化问题。训练后,学习到的编码参数将基于概率量化用于实际的JPEG编码。跨多个数据集和DNN架构的大量实验结果表明,J4D始终显著优于默认JPEG和其他为DNN优化的竞争性JPEG编解码器。值得注意的是,与默认JPEG相比,J4D在相同码率下准确率提升高达11.60%,或在相同准确率下压缩率降低高达80.05%。此外,借助J4D,我们首次展示了为不同DNN架构设计通用JPEG编码参数的潜力。

英文摘要

JPEG, a lossy image compression technique designed for human viewers, has maintained its dominance for decades. However, in the era of artificial intelligence (AI), a substantial portion of image data, often compressed by JPEG, is and will continue to be consumed by deep neural networks (DNNs) instead of humans, thus creating a need to optimize JPEG for DNN inference performance. To this end, we propose learned JPEG compression for DNN vision (J4D), a novel training framework for determining JPEG encoding parameters to minimize compression rate while maximizing DNN inference performance. The major challenge of solving this optimization problem lies in representing the JPEG codec and compression rate in closed form. By incorporating a differentiable soft quantizer based on a probabilistic quantization scheme, we not only obtain a differentiable proxy for the JPEG codec, but are also able to compute the entropy of the coded source analytically, which is a close estimate of the actual compression rate. Equipped with both the differentiable JPEG codec and the information-theoretic rate estimator, we are then able to solve the aforementioned optimization problem with backpropagation. After training, the learned encoding parameters will be subsequently used in actual JPEG encoding based on probabilistic quantization. Extensive experimental results across multiple datasets and DNN architectures demonstrate that J4D consistently and significantly outperforms the default JPEG and other competitive JPEG codecs optimized for DNNs. Notably, compared to the default JPEG, J4D achieves an increase in accuracy by as much as 11.60% at the same rate, or a reduction of compression rate up to 80.05% at the same accuracy. Additionally, with the help of J4D, we show the potential to design universal JPEG encoding parameters for various DNN architectures for the first time.

2606.16184 2026-06-16 cs.CV cs.MM 新提交

Closed-Loop Triplet Synergistic Generation for Long-Form Video

闭环三元组协同生成用于长视频

Xinlei Yin, Xiulian Peng, Xiao Li, Zhiwei Xiong, Yan Lu

发表机构 * University of Science and Technology of China(中国科学技术大学) Microsoft Research Asia(微软亚洲研究院)

AI总结 提出CoTriSyGen框架,通过闭环视觉-文本-记忆协同过程,结合分析器进行镜头内和镜头间修正,解决长视频生成中的身份漂移和不一致问题。

详情
AI中文摘要

多镜头长视频生成由于身份漂移和镜头间的复合不一致性仍然具有挑战性。虽然基于故事板的流程提高了可控性,但它们通常以前馈方式执行,缺乏将生成的视觉证据反馈到后续条件中的机制。我们提出CoTriSyGen,一个智能体框架,将多镜头长视频生成形式化为闭环视觉-文本-记忆协同过程,其中计划意图、持久记忆和生成的视觉被联合用于迭代校正和长程一致性。基于视觉语言模型的分析器对该三元组进行推理,并沿两条路径生成对提示和记忆的更新:(i) 镜头内修正,当检测到语义或构成违规时触发目标重新生成,并细化图像到视频的提示以实现连贯运动;(ii) 镜头间修正,重写后续镜头提示以传播新出现的实体或属性,并根据生成的证据提高提示质量(例如,构成基础和电影流畅性)。该循环基于以实体为中心的记忆,该记忆被建模为可变的视觉状态,随着故事进展而演变,由生成器和分析器通过添加新的和演变的实体来持续更新,以反映外观变化、累积的多视图证据和多实体构成。在我们策划的StoryBench基准上的实验表明,与代表性方法相比,在跨镜头一致性、提示遵循和电影连续性方面有显著改进。

英文摘要

Multi-shot long-form video generation remains challenging due to identity drift and compounding inconsistencies across shots. While storyboard-driven pipelines improve controllability, they are often executed in a feed-forward manner, with limited mechanisms to incorporate generated visual evidence back into subsequent conditioning. We propose CoTriSyGen, an agentic framework that formulates multi-shot long video generation as a closed-loop visual-text-memory synergy process, where planned intent, persistent memory, and generated visuals are jointly leveraged for iterative correction and long-range coherence. A vision-language-model-based analyzer reasons over this triplet and produces updates to both prompts and memory along two pathways: (i) intra-shot refinement, which triggers targeted regeneration when semantic or compositional violations are detected and refines image-to-video prompt for coherent motions; and (ii) inter-shot refinement, which rewrites subsequent-shot prompts to propagate newly manifested entities or attributes and improve prompt quality (e.g., compositional grounding and cinematic fluency) based on generated evidence. The loop is grounded in an entity-centric memory modeled as a mutable visual state that evolves as the story progresses, which is continuously updated by both the generator and the analyzer by adding new and evolved entities to reflect appearance changes, accumulated multi-view evidence, and multi-entity compositions. Experiments on our curated StoryBench benchmark demonstrate substantial improvements in cross-shot consistency, prompt adherence, and cinematic continuity over representative methods.

2606.16183 2026-06-16 cs.LG cs.AI cs.CL 新提交

LLM-Powered Virtual Population for Demand Simulation and Pricing

基于LLM的虚拟人群用于需求模拟与定价

Chengpiao Huang, Kaizheng Wang

发表机构 * Columbia University(哥伦比亚大学)

AI总结 提出一种LLM驱动的虚拟人群模型,通过混合客户画像和LLM评估购买概率,生成需求分布,支持风险感知定价,在H&M数据集上表现最优。

Comments 18 pages, 7 figures

详情
AI中文摘要

我们开发了一个基于LLM的虚拟人群模型,用于模拟定价决策中的需求,其中产品由丰富的非结构化信息(如文本描述和图像)描述,决策者不仅需要平均需求预测,还需要反事实价格的不确定性估计。我们的模型将暴露的客户表示为从有限混合客户画像中的抽取。对于每个画像、产品和候选价格,LLM使用结构化画像信息和非结构化产品信息来引出画像级别的购买概率。这些概率通过校准的混合权重聚合,形成总需求的预测分布。生成的模拟器可以在各种定价目标下评估反事实价格,包括期望收入和风险感知标准(如条件风险价值)。我们在一个包含产品描述和图像的在线H&M时尚数据集上测试了该框架。校准后的基于LLM的模拟器在所考虑的模型中实现了最佳的整体预测性能,并支持样本高效的定价决策。我们的框架提供了一种实用的方法,将LLM用作需求模拟器,适用于历史需求数据有限但产品信息丰富的产品。通过生成完整的需求预测分布而不仅仅是点预测,它使管理者能够比较候选价格、量化需求不确定性,并选择针对平均收入或风险感知目标的价格。

英文摘要

We develop an LLM-powered virtual population model that simulates demand for pricing decisions, in settings where products are described by rich unstructured information, such as text descriptions and images, and where decision makers need not only mean-demand predictions but also uncertainty estimates for counterfactual prices. Our model represents exposed customers as draws from a finite mixture of customer personas. For each persona, product, and candidate price, an LLM elicits a persona-level purchase probability using both structured persona information and unstructured product information. These probabilities are aggregated through calibrated mixture weights to form a predictive distribution of aggregate demand. The resulting simulator can evaluate counterfactual prices under various pricing objectives, including expected revenue and risk-aware criteria such as conditional value at risk. We test the framework on an online H&M fashion dataset with product descriptions and images. The calibrated LLM-based simulator achieves the best overall predictive performance among the models considered, and supports sample-efficient pricing decisions. Our framework provides a practical way to use LLMs as demand simulators for products with limited historical demand data but rich product information. By producing a full predictive demand distribution rather than only a point forecast, it enables managers to compare candidate prices, quantify demand uncertainty, and choose prices that target either average-case revenue or risk-aware objectives.

2606.16180 2026-06-16 cs.CV cs.LG 新提交

To forget is to preserve: Machine Unlearning for 3D medical image segmentation

遗忘即保留:面向3D医学图像分割的机器遗忘

Nitesh Kumar Singh, Akhilesh Singh, Arjun Arora

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 针对数据隐私法规,研究基于四种机制的近似遗忘策略在3D医学图像分割中的应用,通过Dice系数和MAE评估,发现噪声标签策略在遗忘集和保留集间取得最佳平衡。

Comments 9 pages, 5 figures

详情
AI中文摘要

随着新的数据隐私法规(如GDPR [1])允许个人要求从训练好的机器学习模型中删除其任何个人信息,人们开始推动研究从模型中遗忘数据以遵守这些法律。在这方面,基于四种机制,我们考虑了几种应用于MRBrainS18数据集 [2] 的近似遗忘策略。我们使用3D ResNet-50 [3] 作为分割的骨干架构,该架构已通过Med3D框架 [4] 进行预训练。以预训练模型为基线,我们评估了在两类主体(即保留和遗忘)上的相应保留准确率。我们通过Dice相似系数和平均绝对误差(MAE)值评估这些方法,使用两个独立的训练周期(20和50个epoch)。结果表明,噪声标签策略具有最佳的整体权衡,在50个epoch后,遗忘集准确率下降93%,同时保留集准确率保持84%。所有其他策略在更高的epoch数下表现出极端的遗忘水平,同时其保留集性能也出现灾难性退化。本研究结果为在主体特定水平上的遗忘提供了严格的性能指标基线,并为从业者选择适当策略提供了明确标准。

英文摘要

With new data privacy laws such as the General Data Protection Regulation (GDPR) [1] that allow individuals to ask that any of their personal information be erased from trained machine learning models, there has been a push to investigate the unlearning of data from models as a way to comply with these laws. In this regard, based on four mechanics, we consider several approximate unlearning strategies applied to the MRBrainS18 dataset [2]. We use a 3D ResNet-50 [3] as a backbone architecture for segmentation that has been pre-trained with the Med3D framework [4]. Considering the pre-trained model as a baseline, we evaluate respective retention accuracy on 2 types of subjects, i.e., retain and forget. We assess these approaches through their Dice similarity coefficient and mean absolute error (MAE) values using two separate training horizons 20 and 50 epochs. The results show that the Noisy Label strategy had the best overall trade-off with a decrease of 93% in the forget set while maintaining 84% accuracy for the retained set after 50 epochs. All other strategies showed extreme levels of forgetting at higher epoch numbers while also demonstrating catastrophic degradation of their retain set performance. The results of this study provide a strict baseline of performance metrics for unlearning on a subject-specific level and provide practitioners with clear criteria for selecting the proper strategies.

2606.16178 2026-06-16 cs.RO 新提交

Scaling Short-Term Memory of Visuomotor Policies for Long-Horizon Tasks

面向长时任务的视觉运动策略短期记忆扩展

Rutav Shah, Rajat Kumar Jenamani, Xiaohan Zhang, Lingfeng Sun, Roberto Martín-Martín, Yuke Zhu, Deva Ramanan, Karl Schmeckpeper

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) Toyota Research Institute(丰田研究院)

AI总结 提出PRISM架构,通过门控注意力与层次化压缩扩展视觉运动策略的短期记忆至两分钟,在ReMemBench基准上超越现有方法5%-12%。

Comments 14 pages, 9 Figures, 8 Tables

详情
AI中文摘要

许多机器人任务需要短期记忆,无论是检索不再可见的物体,还是在设定时间后关闭电器。然而,大多数通过模仿学习训练的视觉运动策略仅依赖即时感官输入,而不使用过去经验来指导决策。我们提出了PRISM,一种基于Transformer的视觉运动策略架构,通过两个关键组件有效利用短期记忆:(i) 门控注意力,过滤检索到的信息以抑制无关细节,通过减少历史与当前动作预测之间的虚假相关性来提高性能;(ii) 层次化架构,首先将局部信息压缩为紧凑令牌,然后整合它们以捕获时间上扩展的依赖关系,改善计算和内存占用。这些机制共同使我们将视觉运动策略的短期记忆扩展到长达两分钟。为了系统评估视觉运动控制中的记忆,我们引入了ReMemBench——一个包含八种多样化家务操作任务的基准,涵盖四类短期记忆——旨在促进通用记忆机制而非孤立的、特定任务的解决方案。PRISM持续优于先前的工作,包括循环架构、Transformer及其变体——在最强基线上实现了5%–12%的绝对改进。在RoboCasa和LIBERO基准上,尽管没有利用任何大规模预训练,它相对于其无记忆变体以及微调的视觉-语言-动作基线(如GR00T-N1-3B和OpenVLA)实现了11%–15%的绝对改进。PRISM和ReMemBench共同为开发和评估可扩展到长时任务的短期记忆增强视觉运动策略奠定了基础。更多资料请访问https://shahrutav.github.io/short-term-memory。

英文摘要

Many robotic tasks require short-term memory, whether it's retrieving an object that's no longer visible or turning off an appliance after a set period. Yet, most visuomotor policies trained via imitation learning rely only on immediate sensory input without using past experiences to guide decisions. We present PRISM, a transformer-based architecture for visuomotor policies to effectively use short-term memory via two key components: (i) gated attention, which filters retrieved information to suppress irrelevant details, improving performance by reducing the spurious correlations between the history and current action prediction, (ii) a hierarchical architecture that first compresses local information into compact tokens and then integrates them to capture temporally extended dependencies, improving its compute and memory footprint. Together, these mechanisms enable us to scale short-term memory in visuomotor policies for up to two minutes. To systematically evaluate memory in visuomotor control, we introduce ReMemBench -- a benchmark of eight diverse household manipulation tasks spanning four categories of short-term memory -- designed to foster general memory mechanisms rather than siloed, task-specific solutions. PRISM consistently outperforms prior works, including recurrent architectures, transformers, and their variants -- achieving an absolute improvement of 5%--12% over the strongest baseline. On the RoboCasa and LIBERO benchmarks, it achieves absolute improvements of 11%--15% over its no-memory variant and fine-tuned Vision-Language-Action baselines such as GR00T-N1-3B and OpenVLA, despite not leveraging any large-scale pretraining. Together, PRISM and ReMemBench establish a foundation for developing and evaluating short-term memory-augmented visuomotor policies that scale to long-horizon tasks. Additional materials are available at https://shahrutav.github.io/short-term-memory

2606.16175 2026-06-16 cs.AI 新提交

PAL-Bench: Evidence-Grounded Profile Reconstruction from Longitudinal Personal Albums

PAL-Bench: 基于纵向个人相册的证据驱动画像重建

Qiwei Yan, Zhiqiang Yuan, Zexi Jia, Nanxing Hu, Kailin Lyu, Jie Zhou, Jinchao Zhang

发表机构 * Tsinghua University(清华大学) Beijing University of Posts and Telecommunications(北京邮电大学) University of Chinese Academy of Sciences(中国科学院大学) Zhejiang University(浙江大学)

AI总结 提出PAL-Bench基准,通过合成用户和隐私保护审计,评估从纵向个人相册中重建用户画像、社交关系和身份映射的能力,发现现有系统在身份解析和证据引用方面存在不足。

详情
AI中文摘要

纵向个人相册是弱模式多模态数据库:包含噪声感知记录,其关键事实需要跨人脸、文本、时间戳、位置和重复事件进行连接。现有的视觉、视频、文档和生活日志基准测试了子问题,但未涉及具有社会身份绑定和证据引用的相册级画像重建。由于评估所需的真实数据——所有者画像、社交图谱、人脸-姓名映射和证据来源——是私有状态,真实相册无法安全发布,因此基准测试此任务具有挑战性。我们提出PAL-Bench,一个在公共记录契约下进行证据驱动重建的受控基准。其证据编译器构建潜在的私有世界,编程目标级证据路径,渲染相册像素,通过感知管道重新测量,并导出经过审计的公共/私有视图。智能体仅接收感知衍生的公共记录;目标、标识符映射和证据路径保持隐藏。PAL-Bench包含50个合成用户、36,659条公共照片记录以及2,799个关于所有者事实、身份和关系的目标。一项包含10名参与者的隐私保护审计确认,PAL-Bench的证据结构与真实私有相册匹配,尽管等效发布仍受隐私限制。在七个系统和两个计算匹配的诊断中,一个七指标协议揭示了合理的画像总结与忠实的社会重建之间的差距:系统恢复了一些所有者事实,但在处理重复出现的身份和证据引用方面存在困难。PAL-TRACE是一个参考框架,在所有者事实挖掘之前冻结身份绑定,表现最佳,但硬身份解析远未解决。PAL-Bench为感知实体解析、多模态数据集成、时间证据聚合和来源感知的结构化预测提供了测试平台。

英文摘要

Longitudinal personal albums are weak-schema multimodal databases: noisy perceptual records whose key facts require joins across faces, text, timestamps, locations, and repeated events. Existing visual, video, document, and lifelog benchmarks test sub-problems, but not album-scale profile reconstruction with social identity binding and evidence citation. Benchmarking this task is difficult because the ground truth needed for evaluation--owner profiles, social graphs, face-name maps, and evidence provenance--is private state that real albums cannot safely release. We introduce PAL-Bench, a controlled benchmark for evidence-grounded reconstruction under a public-record contract. Its Evidence Compiler builds latent private worlds, programs target-level evidence paths, renders album pixels, re-measures them through perception pipelines, and exports audited public/private views. Agents receive only perception-derived public records; targets, identifier maps, and evidence paths remain hidden. PAL-Bench contains 50 synthetic users, 36,659 public photo records, and 2,799 targets over owner facts, identities, and relations. A privacy-preserving audit with 10 participants confirms that PAL-Bench evidence structures match real private albums, though equivalent releases remain privacy-prohibitive. Across seven systems and two compute-matched diagnostics, a seven-metric protocol reveals a gap between plausible profile summarization and faithful social reconstruction: systems recover some owner facts but struggle with recurring identities and evidence citation. PAL-TRACE, a reference framework that freezes identity bindings before owner-fact mining, performs best but leaves hard identity resolution far from solved. PAL-Bench provides a testbed for perceptual entity resolution, multimodal data integration, temporal evidence aggregation, and provenance-aware structured prediction.

2606.16173 2026-06-16 cs.AI 新提交

TimeVista: Exploring and Exploiting Vision-Language Models as Judges for Time Series Forecasting

TimeVista:探索和利用视觉语言模型作为时间序列预测的评判者

Zhi Chen, Yuxuan Wang, Jialong Wu, Yong Liu, Haoran Zhang, Xingjian Su, Jianmin Wang, Mingsheng Long

发表机构 * School of Software, BNRist, Tsinghua University(清华大学软件学院、北京信息科学与技术国家研究中心)

AI总结 提出TimeVista框架,利用视觉语言模型(VLM)作为时间序列预测的评判者,通过微观和宏观判断结合上下文信息评估预测质量,实验表明VLM比传统指标更符合人类偏好。

详情
AI中文摘要

高质量的时间序列预测对于现实世界的决策至关重要。然而,传统的逐点度量往往无法揭示复杂的时间模式,并且与人类直观偏好的一致性较差。虽然“LLM-as-a-Judge”范式通过提供灵活、符合人类判断的评估彻底改变了文本评估,但其在时间序列中的应用仍鲜有探索。在本文中,我们利用视觉语言模型(VLM)作为时间序列预测的评判者,利用它们理解基于文本信息的时间序列图的能力。具体来说,我们提出了一种新颖的框架,整合了基于上下文信息的微观和宏观层面判断来评估时间序列预测。为此,我们引入了TimeVista,一个全面的VLM-as-a-Judge基准,包含5563个时间序列样本及其详细的评估标准。广泛的元评估表明,VLM是高度可靠的评判者,与人类偏好的一致性显著高于传统指标。基于我们的基准,我们在VLM-as-a-Judge范式下全面评估了近期的时间序列基础模型(TSFM)。我们的结果表明,VLM作为稳健且可解释的评判者,为评估时间序列模型提供了全面且符合人类的标准。

英文摘要

High-quality time series forecasting is pivotal for real-world decision-making. However, traditional point-wise metrics often fail to reveal complex temporal patterns and align poorly with human intuitive preferences. While the ''LLM-as-a-Judge'' paradigm has revolutionized text evaluation by providing flexible, human-aligned judgment, its application to time series remains largely unexplored. In this paper, we leverage Vision-Language Models (VLMs) as judges for time series forecasting, harnessing their ability to comprehend time series plots grounded in textual information. Specifically, we propose a novel framework integrating micro- and macro-level judgments informed by contextual information to evaluate time series forecasting. To this end, we introduce TimeVista, a comprehensive VLM-as-a-Judge benchmark comprising 5563 time series samples paired with detailed evaluation rubrics. Extensive meta-evaluations demonstrate that VLMs are highly reliable judges, achieving significantly higher consistency with human preferences than conventional metrics. Building upon our benchmark, we comprehensively assess recent Time Series Foundation Models (TSFMs) under the VLM-as-a-Judge paradigm. Our results demonstrate that VLMs serve as robust and interpretable judges, providing a comprehensive, human-aligned standard for evaluating time series models.

2606.16168 2026-06-16 cs.CV 新提交

Fi-Gaussian: Frequency-Aware Implicit Gaussian Splatting for Single Image Dehazing

Fi-Gaussian:面向单图像去雾的频率感知隐式高斯泼溅

Yuhan Chen, Ying Fang, Guofa Li, Wenxuan Yu, Yicui Shi, Kunyang Huang, Wenbo Chu, Keqiang Li

发表机构 * College of Mechanical and Vehicle Engineering, Chongqing University(重庆大学机械与车辆工程学院) Department of Electrical and Computer Engineering, Carnegie Mellon University(卡内基梅隆大学电气与计算机工程系) National Innovation Center of Intelligent and Connected Vehicles(国家智能网联汽车创新中心) School of Vehicle and Mobility, Tsinghua University(清华大学车辆与运载学院)

AI总结 提出Fi-Gaussian网络,利用频率感知隐式高斯泼溅模块解耦高低频信息并自适应聚合,结合物理散射重归一化机制,实现单图像去雾的SOTA性能。

详情
AI中文摘要

单图像去雾仍然受到高频细节丢失和精确物理散射建模困难的阻碍。为了解决这些问题,我们提出了Fi-Gaussian,一种用于单图像去雾的频率感知隐式高斯泼溅网络。与依赖3D点云的显式渲染方法不同,我们的方法采用隐式高斯泼溅,将清晰图像的潜在分布自适应地建模为2D特征空间中的连续表示。网络的核心是频率感知隐式高斯泼溅模块,它在频域中解耦低频结构信息和高频纹理信息,然后使用复值权重进行自适应高斯聚合以恢复精细细节。此外,引入了一种物理驱动的散射重归一化机制,在隐式高斯先验的指导下估计透射图和大气光。在多个基准数据集上的大量实验表明,Fi-Gaussian实现了最先进的定量性能,并产生视觉上更优的去雾结果,验证了隐式高斯泼溅在低级视觉任务中的有效性。

英文摘要

Single image dehazing continues to be hindered by the loss of high-frequency details and the difficulty of accurate physical scattering modeling. To address these issues, we propose Fi-Gaussian, a frequency-aware implicit Gaussian splatting network for single image dehazing. Unlike explicit rendering methods that rely on 3D point clouds, our method employs implicit Gaussian splatting to adaptively model the underlying distribution of clear images as a continuous representation in 2D feature space. The core of the network is a frequency-aware implicit Gaussian splatting module, which decouples low-frequency structural information and high-frequency texture information in the frequency domain and then performs adaptive Gaussian aggregation with complex-valued weights to recover fine details. In addition, a physics-driven scattering renormalization mechanism is introduced to estimate the transmission map and atmospheric light under the guidance of implicit Gaussian priors. Extensive experiments on multiple benchmark datasets demonstrate that Fi-Gaussian achieves state-of-the-art quantitative performance and produces visually superior dehazed results, validating the effectiveness of implicit Gaussian splatting for low-level vision tasks.

2606.16167 2026-06-16 cs.AI 新提交

AI Pluralism and the Worlds It Misses

AI多元主义及其遗漏的世界

Rashid Mushkani

发表机构 * Rashid Mushkani

AI总结 本文提出AI系统施加本体论,导致本体论扁平化,并引入多元生命周期治理框架以记录本体开放性和问责条件。

Comments To be presented at the ICML Pluralistic Alignment Workshop

详情
AI中文摘要

AI多元主义通常被表述为代表多样价值观、偏好、用户或输出的问题。本文认为这种表述是不完整的,因为AI系统也施加本体论:它们定义什么算作实体、关系、特征、伤害、利益和有效证据形式。我们将本体论扁平化定义为将情境化、有争议且具有历史特定性的意义转化为受限的技术类别、代理、聚合规则或基准目标,这些被视为中立且难以质疑。本文在价值多元主义、多元对齐、参与式和民主AI、程序正义、科学技术研究、问责研究、11次专家访谈的聚合主题以及三个城市AI案例之间进行了有限的概念和定性综合。这些案例说明了多元主义方法如何改善或结构化模型行为,同时仍然在受影响行为者获得程序地位之前压缩类别、代理、聚合规则和修订权。我们引入多元生命周期治理(PLG)作为初步的定性审计框架,用于记录本体开放性、认知包容性、程序权威、评估多元性和生命周期问责。PLG并非作为经过验证的评分工具呈现;它是一个使多元AI的证据和治理条件显式化的框架。

英文摘要

AI pluralism is often framed as a problem of representing diverse values, preferences, users, or outputs. This paper argues that this framing is incomplete because AI systems also impose ontologies: they define what counts as an entity, relation, feature, harm, benefit, and valid form of evidence. We define ontological flattening as the conversion of situated, contested, and historically specific meanings into a restricted technical category, proxy, aggregation rule, or benchmark target that is treated as neutral and difficult to contest. The paper develops a bounded conceptual and qualitative synthesis across value pluralism, pluralistic alignment, participatory and democratic AI, procedural justice, science and technology studies, accountability research, aggregate themes from 11 expert interviews, and three urban AI companion cases. The cases illustrate how pluralistic methods can improve or structure model behavior while still compressing categories, proxies, aggregation rules, and revision rights before affected actors have procedural standing. We introduce Pluralistic Lifecycle Governance (PLG) as a preliminary qualitative audit scaffold for documenting ontological openness, epistemic inclusion, procedural authority, evaluation pluralism, and lifecycle accountability. PLG is not presented as a validated scoring instrument; it is a framework for making the evidence and governance conditions of pluralistic AI explicit.

2606.16163 2026-06-16 cs.CV 新提交

Dehaze-GaussianImage: Zero-Shot Dehazing via Efficient 2D Gaussian Splatting Representation

Dehaze-GaussianImage:基于高效2D高斯泼溅表示的零样本去雾

Yuhan Chen, Wenxuan Yu, Guofa Li, Kunyang Huang, Ying Fang, Yicui Shi, Wenbo Chu, Keqiang Li

发表机构 * College of Mechanical and Vehicle Engineering, Chongqing University(重庆大学机械与车辆工程学院) Department of Electrical and Computer Engineering, Carnegie Mellon University(卡内基梅隆大学电气与计算机工程系) National Innovation Center of Intelligent and Connected Vehicles(国家智能网联汽车创新中心) School of Vehicle and Mobility, Tsinghua University(清华大学车辆与运载学院)

AI总结 提出首个将2D高斯泼溅引入图像去雾的零样本框架,通过重建-解耦学习策略嵌入大气散射模型,实现几何级解耦和清晰纹理重建,以最少参数达到无监督SOTA性能。

详情
AI中文摘要

现有的单图像去雾方法通常受限于像素级优化的计算冗余和隐式神经网络的缺乏物理可解释性。这些限制阻碍了表示效率与重建保真度之间的平衡。为了解决这些问题,我们提出了Dehaze-GaussianImage,这是第一个将2D高斯泼溅(2DGS)引入图像去雾领域的零样本框架,打破了传统的像素网格处理范式。与静态卷积神经网络(CNN)或Transformer不同,我们的方法将有雾图像建模为连续且动态演变的各向异性高斯场。具体来说,我们提出了一种新颖的重建-解耦零样本学习策略,将大气散射模型嵌入高斯参数空间。该策略驱动高斯基元在优化过程中自适应地分裂、克隆和修剪,实现传输介质和清晰纹理的几何级解耦。此外,引入了显式的结构保持约束,以抑制传统物理先验常引起的伪影。实验结果表明,所提出的方法以最少的参数在全无监督方式下实现了最先进的性能,突显了显式高斯表示在低级视觉任务中的潜力。

英文摘要

Existing single image dehazing methods are often constrained by computational redundancy in pixel-level optimization and the lack of physical interpretability in implicit neural networks. These limitations hinder the balance between representation efficiency and reconstruction fidelity. To address these issues, we propose Dehaze-GaussianImage, the first zero-shot framework that introduces 2D Gaussian Splatting (2DGS) into the image dehazing domain to break the traditional pixel-grid processing paradigm. Distinct from static convolutional neural networks (CNNs) or Transformers, our approach models hazy images as continuous and dynamically evolvable anisotropic Gaussian fields. Specifically, we propose a novel reconstruction-decoupling zero-shot learning strategy that embeds the atmospheric scattering model into the Gaussian parameter space. This strategy drives Gaussian primitives to adaptively split, clone, and prune during optimization, achieving geometric-level decoupling of the transmission medium and clear textures. Furthermore, explicit structure-preserving constraints are introduced to suppress artifacts commonly caused by traditional physical priors. Experimental results demonstrate that the proposed method achieves state-of-the-art (SOTA) performance in a fully unsupervised manner with minimal parameters, highlighting the potential of explicit Gaussian representation for low-level vision tasks.

2606.16161 2026-06-16 cs.CV 新提交

Multimodal LLM-Empowered Re-Ranking for Generalizable Person Re-Identification

多模态大语言模型赋能的通用行人重识别重排序

Jiachen Li, Xiaojin Gong

发表机构 * College of Information Science and Electronic Engineering, Zhejiang University(浙江大学信息与电子工程学院)

AI总结 提出利用多模态大语言模型(MLLM)的泛化能力,通过微调MLLM并计算μ-距离来改进推理阶段的重排序,从而提升领域泛化行人重识别的性能。

详情
AI中文摘要

领域泛化(DG)行人重识别(Re-ID)因其在未见真实场景中部署的潜力而吸引了越来越多的研究兴趣。现有大多数方法通过训练领域泛化编码器来处理DG Re-ID,但忽略了推理阶段可能的改进。相比之下,本文探索了一种替代方向,即改进推理重排序以增强DG Re-ID。传统的重排序方法通常依赖于基于邻域的距离来优化初始排序列表,这本质上依赖于Re-ID编码器生成的特征。然而,由于编码器缺乏足够的泛化能力来在未见场景中产生可靠的特征距离,这些方法在目标域上性能下降。受近期多模态大语言模型(MLLM)卓越泛化能力的启发,我们提出了一种MLLM赋能的距离度量来改进DG Re-ID中的重排序。具体来说,我们首先通过监督微调将MLLM适应于Re-ID数据,其中包含一个领域无关的提示和一种查询-候选难例挖掘方案。然后,使用适应后的MLLM在推理过程中计算μ-距离,该距离对领域差距具有鲁棒性,并显著提升后续重排序性能。我们的方法是模型无关的,可以无缝集成到之前的重排序框架中。大量实验表明,我们的方法在多个DG Re-ID基准上持续带来显著的性能提升。本工作的代码将很快在https://github.com/RikoLi/MUSE发布。

英文摘要

Domain Generalizable (DG) person re-identification (Re-ID) has attracted growing research interest due to its potential for deployment in unseen real-world scenarios. Most existing approaches address DG Re-ID by focusing on training domain-generalizable encoders but ignore the possible refinements in inference stage. In contrast, this work explores an alternative direction which improves inference re-ranking to enhance DG Re-ID. Conventional re-ranking methods typically rely on neighborhood-based distances to refine the initial ranking list, inherently depending on features produced by the Re-ID encoder. However, they deteriorate on target domains since the encoder lacks sufficient generalizability to produce reliable feature distances on unseen scenarios. Inspired by the remarkable generalization capabilities of recent Multimodal Large Language Models (MLLMs), we propose an MLLM-empowered distance metric to improve re-ranking in DG Re-ID. Specifically, we first adapt an MLLM to Re-ID data through supervised fine-tuning, which incorporates a domain-agnostic prompt and a query-candidate hard mining scheme. Then, the adapted MLLM is employed to compute a $μ$-distance during inference, which is robust to domain gap and significantly enhances subsequent re-ranking performance. Our approach is model-agnostic and can be seamlessly integrated into previous re-ranking frameworks. Extensive experiments demonstrate that our approach consistently yields substantial performance improvements across multiple DG Re-ID benchmarks. The code of this work will be released at https://github.com/RikoLi/MUSE soon.

2606.16160 2026-06-16 cs.LG cs.AI cs.HC 新提交

A comparative and critical study of EEGNet for fNIRS-driven cognitive load classification

EEGNet在fNIRS驱动的认知负荷分类中的比较与批判性研究

Mehshan Ahmed Khan, Houshyar Asadi, Li Zhang, Mohammad reza Chalak Qazani, Ghazal Bargshady, Stefanos gkikas, Christian arzate, Sam Oladazimi, Zoran Najdovsk, Lei Wei, Chee Peng Lim

发表机构 * Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University(智能系统研究与创新研究所(IISRI),德克萨斯大学) Department of Computer Science, Royal Holloway, University of London(伦敦大学皇家霍洛威学院计算机科学系) College of Science and Engineering, James Cook University(詹姆斯库克大学科学与工程学院) Faculty of Science and Technology, University of Canberra(堪培拉大学科学与技术学院) Honda research institute (HRI), Japan(日本本田研究院) Swinburne University of Technonology, Hawthorn, Victoria(技术学院,维多利亚州哈沃恩)

AI总结 本研究系统评估EEGNet在fNIRS认知负荷分类中的性能,发现重叠分段和小固定学习率在随机分割中表现最佳,但受试者独立评估准确率大幅下降,非重叠分段和PCA特征在SI评估中取得最佳56.11%准确率,表明消除时间冗余有助于学习更鲁棒的跨个体表征。

详情
AI中文摘要

由于时间变异性、受试者间差异以及对预处理选择的敏感性,从功能性近红外光谱(fNIRS)信号中准确分类认知负荷仍然是一个重大挑战。本研究通过系统检查时间分割策略(重叠与非重叠)、窗口长度(10秒、20秒、30秒)、特征提取方法(方差分析(ANOVA)、主成分分析(PCA)、快速独立成分分析(FastICA))、学习率配置(固定和自适应)以及评估协议(随机分割与受试者独立(SI))的影响,对EEGNet在基于fNIRS的认知负荷分类中进行了全面评估。随机分割实验的结果表明,重叠分割结合较小的固定学习率(0.01-0.001)由于时间冗余和血流动力学转变的密集采样而产生了最高的准确率。然而,SI评估显示准确率大幅下降,表明对未见参与者的泛化能力有限。在SI评估下,非重叠分割优于重叠窗口,使用PCA特征、20秒窗口和0.1学习率获得了最佳准确率56.11%。这些发现表明,消除时间冗余有助于模型学习更鲁棒和可泛化的跨个体认知负荷表征。尽管自适应学习率策略提高了训练稳定性,但并未超过最优选择的固定学习率的性能。该研究强调了分割策略和学习率选择在提高模型泛化能力中的关键作用,并指出了开发基于fNIRS的可靠、实时和受试者独立认知负荷分类系统所必需的方法学考虑。

英文摘要

Accurately classifying cognitive load from functional near-infrared spectroscopy (fNIRS) signals remains a significant challenge due to temporal variability, inter-subject differences, and sensitivity to preprocessing choices. This study provides a comprehensive evaluation of EEGNet for fNIRS-based cognitive load classification by systematically examining the effects of temporal segmentation strategies (overlapping vs. non-overlapping), window lengths (10s, 20s, 30s), feature extraction methods (Analysis of Variance (ANOVA), Principal Component Analysis (PCA), Fast Independent Component Analysis (FastICA)), learning rate configurations (fixed and adaptive), and evaluation protocols (random split vs. subject-independent (SI)). Results from random-split experiments show that overlapping segmentation, combined with smaller fixed learning rates (0.01-0.001), yields the highest accuracies, due to temporal redundancy and dense sampling of hemodynamic transitions. However, SI evaluation reveals a substantial drop in accuracy, demonstrating limited generalization to unseen participants. Under SI evaluation, non-overlapping segmentation outperformed overlapping windows, with the best accuracy of 56.11% achieved using PCA features with a 20-second window and a 0.1 learning rate. These findings indicate that eliminating temporal redundancy helps the model learn more robust and generalizable representations of cognitive load across individuals. Although adaptive learning rate strategy improved training stability, it did not surpass the performance of optimally selected fixed learning rates. The study highlights the critical role of segmentation strategy and learning rate selection in improving model generalization and identifies methodological considerations essential for developing reliable, real-time, and SI cognitive load classification systems using fNIRS.

2606.16159 2026-06-16 cs.CV 新提交

Continuous Splatting meets Retinex: Continuous Gaussian Splatting and Implicit Reflectance Modeling for Low-Light Image Enhancement

连续Splatting遇见Retinex:用于低光图像增强的连续高斯Splatting与隐式反射建模

Yuhan Chen, Yicui Shi, Guofa Li, Wenxuan Yu, Ying Fang, Guangrui Bai, Wenbo Chu, Keqiang Li

发表机构 * College of Mechanical and Vehicle Engineering, Chongqing University(重庆大学机械与运载工程学院) School of Engineering Science, University of Science and Technology of China(中国科学技术大学工程科学学院) National Innovation Center of Intelligent and Connected Vehicles(国家智能网联汽车创新中心) School of Vehicle and Mobility, Tsinghua University(清华大学车辆与运载学院)

AI总结 提出CGS-Retinex框架,结合连续高斯Splatting与Retinex理论,通过连续参数场估计全局光照,并利用隐式神经表示独立建模反射率,实现低光图像增强,有效抑制噪声和过曝,恢复高频结构和色彩。

详情
AI中文摘要

低光图像增强旨在从低照度观测中恢复清晰图像,对高级下游视觉任务至关重要。然而,现有方法在平衡全局平滑光照调整和局部高频细节恢复时经常遇到颜色失真和结构伪影。为了解决这些问题,我们提出了CGS-Retinex,这是第一个基于显式-隐式联合建模的低光图像增强框架。我们的框架深度融合了连续高斯Splatting与Retinex理论。具体来说,我们将图像网格表示为连续参数场,并提出连续高斯渲染器来估计空间连续的全局光照分布。这种方法从根本上消除了离散高斯采样引起的网格伪影。此外,我们引入隐式神经表示来独立建模反射率。我们利用浅层高频特征引导网络准确重建退化的纹理细节。在Retinex框架内,我们加入了物理启发的亮度一致性约束和光照平滑正则化,使显式光照和隐式反射率能够保持适当曝光,并实现高频结构和颜色的高保真恢复。大量实验表明,CGS-Retinex通过精确解耦光照和纹理,显著抑制了暗区噪声和过曝,同时实现了卓越的高频结构保真度和色彩恢复。这项工作为低光图像增强建立了一种新的连续物理表示范式。

英文摘要

Low-light image enhancement aims to recover clear images from low-illumination observations and is crucial for high-level downstream vision tasks. However, existing methods frequently encounter color distortion and structural artifacts when balancing global smooth illumination adjustment and local high-frequency detail recovery. To address these issues, we propose CGS-Retinex as the first low-light image enhancement framework based on explicit-implicit joint modeling. Our framework deeply integrates continuous Gaussian splatting with Retinex theory. Specifically, we represent the image grid as a continuous parameter field and propose a continuous Gaussian renderer to estimate the spatially continuous global illumination distribution. This approach fundamentally eliminates grid artifacts caused by discrete Gaussian sampling. Furthermore, we introduce an implicit neural representation to model reflectance independently. We leverage shallow high-frequency features to guide the network in accurately reconstructing degraded texture details. Within the Retinex framework, we incorporate physics-inspired brightness consistency constraints and illumination smoothness regularization to enable explicit illumination and implicit reflectance to maintain proper exposure and achieve high-fidelity recovery of high-frequency structures and colors. Extensive experiments demonstrate that CGS-Retinex significantly suppresses dark-region noise and overexposure while achieving exceptional high-frequency structural fidelity and color restoration by precisely decoupling illumination and texture. This work establishes a novel continuous physical representation paradigm for low-light image enhancement.

2606.16158 2026-06-16 cs.CV cs.CL 新提交

Focus When Necessary: Adaptive Routing and Collaborative Grounding for Training-Free Visual Grounding

必要时聚焦:用于无训练视觉定位的自适应路由与协作定位

Yifan Wang, Peiming Li, Shiyu Li, Zhiyuan Hu, Xiaochen Yang, Wenming Yang, Yang Tang, Zheng Wei

发表机构 * East China University of Science and Technology(华东理工大学) Tsinghua University(清华大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) University of Science and Technology of China(中国科学技术大学)

AI总结 提出LazyMCoT动态框架,通过自适应路由评估不确定性,对简单查询跳过处理,对困难样本利用协作定位模块进行两阶段精炼,在提升推理精度的同时降低平均推理延迟。

详情
AI中文摘要

虽然多模态大语言模型(MLLMs)在跨模态推理方面表现出色,但它们通常难以感知复杂高分辨率图像中的细粒度细节。最近的无训练方法通过图像缩放和局部裁剪来解决这一问题。然而,不加区分地应用这些操作会导致简单查询的计算冗余,并且可能因截断必要的全局上下文或引入无关的背景噪声而降低准确性。为此,我们提出了LazyMCoT,一个动态且无需训练的框架,能够根据样本难度自适应地分配视觉定位工作。该框架具有自适应路由机制,通过单次前向传递的首词统计量来评估预测不确定性。这有效地绕过了置信度高的案例,同时通过保形校准确保困难样本的召回。对于这些具有挑战性的案例,协作定位模块通过两阶段精炼过程,将模型固有的跨模态注意力与外部视觉专家相结合。该精炼过程生成精确的局部显示,以恢复小目标或被遮挡的目标。在多个基准上的大量实验表明,LazyMCoT通过同时提高推理精度和降低平均推理延迟,与基于训练的方法相媲美。我们的代码可在https://github.com/TencentBAC/LazyMCoT获取。

英文摘要

While Multimodal Large Language Models (MLLMs) excel in cross-modal reasoning, they often struggle to perceive fine-grained details in complex high-resolution images. Recent training-free methods address this through image scaling and localized cropping. However, applying these manipulations indiscriminately introduces computational redundancy for simple queries and can degrade accuracy by truncating essential global context or introducing irrelevant background noise. To this end, we propose LazyMCoT, a dynamic and training-free framework that adaptively allocates visual grounding efforts based on sample difficulty. The framework features an Adaptive Routing mechanism that evaluates predictive uncertainty using first-token statistics from a single forward pass. This efficiently bypasses confident cases while ensuring the recall of difficult samples via conformal calibration. For these challenging cases, a Collaborative Grounding module integrates the inherent cross-modal attention of the model with an external visual expert through a two-stage refinement process. This refinement process generates a precise localized display to recover small or occluded targets. Extensive experiments across diverse benchmarks demonstrate that LazyMCoT rivals training-based approaches by simultaneously improving reasoning accuracy and reducing average inference latency. Our code is availble at https://github.com/TencentBAC/LazyMCoT.

2606.16154 2026-06-16 cs.LG 新提交

A Gradient Perspective on RLVR Stability and Winner Advantage Policy Optimization

RLVR稳定性与胜者优势策略优化的梯度视角

Prasanth YSS, Zhichen Ren, Rasa Hosseinzadeh, Ilan Gofman, Yuqi Chen, Zhaoyan Liu, Guangwei Yu, Jesse C. Cresswell, Satya Krishna Gorti

发表机构 * Berkeley(伯克利) Layer 6 AI

AI总结 通过令牌级梯度动力学分析GRPO的不稳定性,提出仅更新正优势完成的WAPO算法,在数学推理和多跳QA任务中提升训练稳定性并匹配或超越基线。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)改进了语言模型的推理能力,但GRPO风格的优化仍然容易崩溃。我们通过令牌级梯度动力学分析这种不稳定性,推导出一个分类法,预测更新如何影响下一个令牌的概率和熵。该分类法表明,稳定性共同取决于当前策略下的优势符号和令牌分布。受此发现启发,我们提出了胜者优势策略优化(WAPO),一种简单的在线裁剪策略梯度目标,仅更新正优势完成。在数学推理和多跳QA基准测试中,WAPO提高了训练稳定性,并在多个模型家族中匹配或超越基线。完整代码可在https://github.com/layer6ai-labs/wapo找到。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) improves language-model reasoning, but GRPO-style optimization remains prone to collapse. We analyse this instability through token-level gradient dynamics, deriving a taxonomy that predicts how updates affect next-token probabilities and entropy. The taxonomy shows that stability depends jointly on the advantage sign and token distribution under the current policy. Motivated by this finding, we propose Winner Advantage Policy Optimization (WAPO), a simple online clipped policy-gradient objective that updates only on positive-advantage completions. Across mathematical reasoning and multi-hop QA benchmarks, WAPO improves training stability and matches or outperforms baselines across multiple model families. Full code can be found at https://github.com/layer6ai-labs/wapo.

2606.16153 2026-06-16 cs.CV cs.AI 新提交

A Comprehensive Survey of Medical Image Segmentation: Challenges, Benchmarks, and Beyond

医学图像分割综述:挑战、基准与未来展望

Pengyu Zhu, Xiaojing Zhang, Kunbo Zhang, Chunyan Zhang, Zhenyu Wang

发表机构 * School of Control and Computer Engineering, North China Electric Power University(华北电力大学控制与计算机工程学院) SPIC Digital Technology Co., Ltd(国家电投数字科技有限公司) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Department 6 of Health Care, Second Medical Center, People’s Liberation Army General Hospital(中国人民解放军总医院第二医学中心健康医学科六病区)

AI总结 本文系统综述了基于U-Net、Transformer和SAM架构的医学图像分割方法,分析主要挑战,旨在指导未来研究并推动临床转化。

Comments 12 pages,3 figures,1 table. All related resources are available at https://github.com/andrew-pengyu/Awsome_MedSeg/tree/main

详情
AI中文摘要

医学图像分割在临床诊断、治疗规划、疾病监测和神经系统疾病识别中发挥着关键作用。本文对其系统发展进行了全面综述,涵盖了广泛使用的公开数据集、基于U-Net、Transformer和SAM架构的代表性方法及其关键评估指标与差异,随后从多个角度分析了主要挑战。与专注于单一模型家族或特定临床应用的综述不同,本综述将基于U-Net、Transformer和SAM的方法组织在一个统一的分析框架内,特别关注它们在提高分割精度和效率方面的有效性。本工作旨在指导医学图像分割的未来研究并支持临床转化,所有相关资源均可在我们的GitHub仓库中公开获取:https://github.com/andrew-pengyu/Awsome_MedSeg/tree/main。

英文摘要

Medical image segmentation plays a critical role in clinical diagnostics, treatment planning, disease monitoring, and neurological disorder identification. This article presents a comprehensive review of its systematic development, covering widely used public datasets, representative methods built on the U-Net, Transformer, and SAM architectures, and key evaluation metrics with their differences, followed by an analysis of major challenges from multiple perspectives. Unlike surveys that focus on a single model family or a specific clinical application, this review organizes U-Net-, Transformer-, and SAM-based methods within a unified analytical framework, with a particular focus on their effectiveness in improving segmentation accuracy and efficiency. This work aims to guide future research and support clinical translation of medical image segmentation, with all related resources publicly available in our GitHub repository: https://github.com/andrew-pengyu/Awsome_MedSeg/tree/main.

2606.16152 2026-06-16 cs.AI 新提交

The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning

质量-效用悖论:为什么高奖励数据会损害小模型的数学推理

Haolong Qian, Xianliang Yang, Yinuo ma, Lirong Che, Feng Lu, Ye Guo, Lei Song, Jiang Bian, Chun Yuan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 发现数学推理蒸馏中的质量-效用悖论:Oracle精炼的高奖励数据因分布漂移增加适应成本,反而不如SLM自生成数据;提出风格对齐精炼方法恢复效用。

Comments Accepted at ICML 2026

详情
AI中文摘要

从强大推理模型进行知识蒸馏被广泛用于提升小语言模型(SLM)的数学推理能力,通常假设奖励模型得分更高的轨迹能提供更有用的监督。我们在数学推理蒸馏中发现了一个反直觉的\textbf{质量-效用悖论}。由更强Oracle精炼或合成的数据根据奖励模型获得更高的感知质量,但在Qwen2.5、LLaMA-3和DeepSeek系列中,其表现始终不如SLM自身生成并通过拒绝采样选择的轨迹。我们的分析表明,Oracle精炼将逻辑修复与偏离SLM原生推理分布的分布漂移相结合。这种漂移增加了学习者的适应成本,可能抵消改进推理逻辑带来的收益。为验证这一机制,我们引入\textbf{风格对齐精炼},在保留Oracle逻辑修复的同时保持SLM的原生轨迹。这种干预降低了适应成本并恢复了下游效用。这些发现表明,有效的数学推理蒸馏应联合优化感知解质量和学习者-数据兼容性,而非仅依赖奖励模型得分。数据集和代码见https://github.com/Dracoqhl/Quality-Utility-Paradox。

英文摘要

Knowledge distillation from powerful reasoning models is widely used to improve Small Language Models (SLMs) on mathematical reasoning, often assuming that traces with higher reward model scores provide more useful supervision. We identify a counterintuitive \textbf{Quality-Utility Paradox} in mathematical reasoning distillation. Data refined or synthesized by a stronger Oracle obtains higher perceived quality according to reward models, yet consistently underperforms traces generated by the SLM itself and selected through rejection sampling across Qwen2.5, LLaMA-3, and DeepSeek families. Our analysis shows that Oracle refinement couples logical repair with distributional drift away from the SLM's native reasoning distribution. This drift increases the learner's adaptation cost and can outweigh the benefit of improved reasoning logic. To test this mechanism, we introduce \textbf{Style-Aligned Refinement}, which preserves the native trajectory of the SLM while retaining logical repair from the Oracle. This intervention lowers adaptation cost and restores downstream utility. These findings suggest that effective mathematical reasoning distillation should jointly optimize perceived solution quality and learner-data compatibility, rather than relying solely on reward-model scores. The datasets and code are available at https://github.com/Dracoqhl/Quality-Utility-Paradox.

2606.16151 2026-06-16 cs.CL 新提交

GRACE: Step-Level Benchmark for Faithful Reasoning over Context

GRACE:基于上下文忠实推理的步骤级基准

Hoang Pham, Dong Le, Anh Tuan Luu

发表机构 * Nanyang Technological University(南洋理工大学) VinUniversity

AI总结 提出GRACE,首个带人工标注的步骤级忠实性基准,通过数据驱动错误分类法评估上下文推理中的链式思维步骤,并证明步骤级忠实信号可提升下游准确性和推理可靠性。

详情
AI中文摘要

许多推理任务要求模型基于输入上下文进行推理,从文档问答到基于规则的演绎。链式思维提示产生的轨迹看似透明,但单个步骤可能悄然偏离源证据,即使最终答案正确。现有方法在响应级别检测幻觉,但无法识别链中失败的位置或类型。我们引入GRACE,这是首个带人工标注的步骤级忠实性基准,具有数据驱动的错误分类法,用于基于上下文的文本推理。GRACE涵盖来自4个源数据集的10个模型的CoT轨迹,每个步骤都标注了忠实性、错误类别和自然语言解释。通过无监督聚类自底向上发现的数据驱动分类法将失败分为两个轨道:GRACE-Inference(演绎错误)和GRACE-Grounding(事实基础错误),每个轨道包含四个类别。评估集是人工标注的,且设计上具有挑战性。我们的实验揭示了当前模型存在巨大的改进空间。此外,将步骤级忠实性信号集成到强化学习管道中可提高下游准确性和推理可靠性。

英文摘要

Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can silently deviate from the source evidence, even when the final answer is correct. Existing methods detect hallucinations at the response level but fail to identify where in the chain a failure occurs or what type it is. We introduce GRACE, the first human-annotated step-level faithfulness benchmark with a data-driven error taxonomy for context-grounded textual reasoning. GRACE covers CoT traces from 10 models across 4 source datasets, with each step annotated for faithfulness, error category, and natural language explanation. A data-driven taxonomy, discovered bottom-up via unsupervised clustering, organizes failures into two tracks: GRACE-Inference (deductive errors) and GRACE-Grounding (factual grounding errors), with four categories each. The evaluation set is human-annotated and challenging by design. Our experiments reveal substantial headroom for current models. In addition, integrating step-level faithfulness signals into reinforcement learning pipelines improves both downstream accuracy and reasoning reliability.

2606.16149 2026-06-16 cs.AI 新提交

LiteOdyssey: A Lightweight Reasoning AI Agent for Interpretable Rare-Disease Diagnosis

LiteOdyssey: 一种用于可解释罕见病诊断的轻量级推理AI智能体

Minh-Ha Nguyen, Erica Gray, Chih-Ting Yang, Rizwan Hamid, Lingyao Li, Siyuan Ma, Thomas A. Cassini, Cathy Shyr

发表机构 * Vanderbilt University(范德堡大学) Vanderbilt University Medical Center(范德堡大学医学中心) University of South Florida(南佛罗里达大学)

AI总结 提出轻量级框架LiteOdyssey,通过人类-AI协作的诊断策略和公共生物医学工具增强单个推理语言模型,在罕见病诊断基准上达到最先进性能,无需微调或多智能体集成。

Comments 21 pages,5 main figures, working version 1

详情
AI中文摘要

大多数医疗AI系统通过扩展额外机制来改进:更多的微调数据、更多的智能体和/或更大的检索数据库。然而,在罕见病诊断中,这种扩展可能导致系统难以部署、审计和维护。我们探究是否可以通过扩展单个AI智能体的推理链来实现最先进的诊断性能:通过人类-AI协作开发的诊断策略指导它,并利用可免费获取的生物医学工具进行增强。我们引入了LiteOdyssey,一个轻量级罕见病诊断框架,通过临床遗传学工作流引导推理语言模型。该框架通过人类反馈的策略迭代(PIHF)开发,并动态访问公共生物医学工具。在两个仅提供患者临床特征的高难度基准上,LiteOdyssey取得了最先进的性能,在LIRICAL(n=370)和PhenoPacket Store(n=873)的合并1243个病例中,总体疾病Recall@1达到59.3%。这两个基准中超高罕见病(患病率低于1/1,000,000)的比例很高,分别约为45%和52.8%。在更困难的PhenoPacket子集上(其中因果疾病未在我们的稀有性映射流程中映射到Orphanet),LiteOdyssey实现了60.7%的Recall@1,而相同基线模型(GPT-5.4)不使用工具时为10.7%。这一性能是在没有微调、多智能体集成或大型病例检索数据库的情况下实现的。在开发过程中未见过的病例、真实世界罕见病患者的私人队列以及较小的开源权重模型上也观察到了增益。LiteOdyssey为罕见病AI系统指明了一条路径,使其准确、易于部署且对医生审查更透明。

英文摘要

Most medical AI systems improve by scaling additional machinery: more fine-tuning data, more agents, and/or larger retrieval databases. In rare-disease diagnosis, however, such scaling can produce systems that are difficult to deploy, audit, and maintain. We asked whether state-of-the-art diagnostic performance could instead be achieved by extending the reasoning chain of a single AI agent: guiding it with a diagnostic policy, developed through human-AI collaboration and augmenting with freely available biomedical tools. We introduce LiteOdyssey, a lightweight rare-disease diagnostic framework that guides reasoning language model through a clinical genetics workflow. This framework was developed through Policy Iteration with Human Feedback (PIHF) and uses dynamic access to public biomedical tools. On two challenging benchmarks that provide only patient clinical features, LiteOdyssey achieved state-of-the-art performance, with an overall disease Recall@1 of 59.3% over the combined 1,243 cases of LIRICAL (n = 370) and the PhenoPacket Store (n = 873). Both benchmarks have a high proportion of ultra-rare disease (a prevalence below 1 in 1,000,000, with ultra-rare shares of approximately 45% and 52.8%, respectively). On the more difficult PhenoPacket subset, where causal diseases were not mapped to Orphanet in our rarity-mapping pipeline, LiteOdyssey achieved 60.7% Recall@1, compared with 10.7% for the same baseline model (GPT-5.4) without tools. This performance was achieved without fine-tuning, multi-agent ensembles, or a large case-retrieval database. Gains were also observed in the following: on cases never seen during development, on a private cohort of real-world rare disease patients, and on a smaller open-weights model. LiteOdyssey suggests a path toward rare-disease AI systems that are accurate, easier to deploy, and more transparent for physician review.

2606.16140 2026-06-16 cs.AI cs.CL 新提交

VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models

VibeThinker-3B:探索小型语言模型中可验证推理的前沿

Sen Xu, Shixi Liu, Wei Wang, Jixin Min, Yingwei Dai, Zhibin Yin, Yirong Chen, Xin Zhou, Junlin Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出3B参数紧凑模型VibeThinker-3B,通过频谱到信号后训练范式(课程SFT、多域强化学习、离线自蒸馏)在可验证推理任务上达到前沿性能,匹配甚至超越大模型,并验证推理增强不损害指令可控性。

详情
AI中文摘要

本技术报告介绍了VibeThinker-3B,一个具有3B参数的紧凑密集模型,旨在探究在严格的小模型范围内可验证推理能推进到何种程度。基于频谱到信号后训练范式,我们通过优化的流程系统性地增强模型,该流程包括基于课程的监督微调、多域强化学习和离线自蒸馏。实验评估表明,VibeThinker-3B在高度要求的可验证任务上达到了前沿水平。具体来说,它在AIME26上获得94.3分(通过声明级测试时缩放提升至97.1),在LiveCodeBench v6上获得80.2的Pass@1,并在最近的未见LeetCode竞赛中表现出强大的分布外泛化能力,接受率达96.1%。这有效地将其置于一流推理系统的性能区间,匹配或超越规模大数个数量级的旗舰模型,如DeepSeek V3.2、GLM-5和Gemini 3 Pro。此外,IFEval上的93.4分证实了这种极端的推理增强并未损害严格的指令可控性。扩展我们之前的1.5B工作,这些发现推动了参数压缩-覆盖假说,该假说将可验证推理视为可压缩到紧凑推理核心中,而开放域知识和通用能力则需要广泛的参数覆盖事实、概念和长尾场景。这一观点表明,紧凑模型不仅是部署高效的替代品,更是通往参数密集能力领域前沿性能的互补路径。

英文摘要

This technical report introduces VibeThinker-3B, a compact dense model with 3B parameters developed to investigate how far verifiable reasoning can be pushed within a strictly small-model regime. Building upon the Spectrum-to-Signal post-training paradigm, we systematically enhance the model through an optimized pipeline that includes curriculum-based supervised fine-tuning, multi-domain reinforcement learning, and offline self-distillation. Experimental evaluations demonstrate that VibeThinker-3B achieves frontier-level performance on highly demanding verifiable tasks. Specifically, it attains a score of 94.3 on AIME26 (improving to 97.1 with claim-level test-time scaling), an 80.2 Pass@1 on LiveCodeBench v6, and exhibits strong out-of-distribution generalization with a 96.1\% acceptance rate on recent unseen LeetCode contests. This effectively places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2, GLM-5, and Gemini 3 Pro. Furthermore, a score of 93.4 on IFEval confirms that this extreme reasoning enhancement does not compromise strict instruction controllability. Extending our previous 1.5B work, these findings motivate the Parametric Compression-Coverage Hypothesis, which views verifiable reasoning as compressible into compact reasoning cores, while open-domain knowledge and general-purpose competence require broad parameter coverage over facts, concepts, and long-tail scenarios. This perspective suggests that compact models are not merely deployment-efficient substitutes, but a complementary path toward frontier-level performance in parameter-dense capability regimes.

2606.16137 2026-06-16 cs.CL cs.AI 新提交

XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models

基于XAI的语音深度伪造检测解释生成:使用免训练多模态大语言模型

Yupei Li, Qiyang Sun, Xiaoliang Wu, Chenxi Wang, Berrak Sisman, Björn W. Schuller

发表机构 * Imperial College London(帝国理工学院) Technical University of Munich(慕尼黑工业大学) University of Southampton(南安普顿大学) MBZUAI(穆罕默德·本·扎耶德人工智能大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 针对语音深度伪造检测缺乏可解释性的问题,提出一种免训练框架,融合XAI证据与多模态大语言模型,生成基于证据的特定解释,在PartialSpoof数据集上内部准确率提升超45%。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

语音深度伪造检测(SDD)系统需要可信的解释以进行可靠的决策。现有的解释方式主要分为两类。传统的可解释人工智能(XAI),如基于梯度的归因,产生与模型决策紧密耦合的低级归因信号,且比自然语言解释更难被人类理解。同时,基于大语言模型(LLM)的解释生成通常由于缺乏启发式证据和任务特定监督(源于SDD有限的基于证据的解释数据集)而产生通用且无根据的描述。因此,我们提出一种免训练解释框架,将XAI证据与多模态LLM集成,以生成基于证据的特定解释。使用PartialSpoof数据集,我们构建了一个基于证据的解释数据集,并表明带有XAI的方法将内部准确率提高了超过45%,通过人工评估和忠实性检查得到验证。

英文摘要

Speech deepfake detection (SDD) systems require trustworthy explanations for reliable decision-making. Existing explanation ways mainly fall into two categories. Traditional explainable AI (XAI), such as gradient-based attribution, produces low-level attribution signals tightly coupled with model decisions, and harder to be understood by human than natural language explanations. Meanwhile, large language model (LLM)-based explanation generation often produces generic and ungrounded descriptions due to the lack of heuristic evidence and task-specific supervision, stemming from limited grounded explanation datasets for SDD. We therefore propose a training-free explanation framework that integrates XAI evidence with multimodal LLMs to generate grounded and specific explanations. Using the PartialSpoof dataset, we construct a grounded explanation dataset and show that methods with XAI increase inside accuracy by over 45\%, verified through human evaluation and faithfulness checks.

2606.16131 2026-06-16 cs.CV cs.LG 新提交

Shift-and-Sum Quantization for Visual Autoregressive Models

Shift-and-Sum 量化用于视觉自回归模型

Jaehyeon Moon, Bumsub Ham

发表机构 * Yonsei University(延世大学) Articron

AI总结 提出针对视觉自回归模型的训练后量化框架,通过移位求和量化减少注意力值乘积误差,并采用重采样策略校准数据,在图像生成等任务上达到新最优。

Comments ICLR 2026

详情
AI中文摘要

训练后量化(PTQ)能够使用少量数据实现深度网络的高效部署。然而,其在视觉自回归模型(VAR)上的应用仍相对未被探索。我们识别出将PTQ应用于VAR的两个关键挑战:(i)注意力值乘积中的大重建误差,尤其是在高注意力分数更频繁出现的粗尺度上;(ii)由于有限的校准数据,码本条目的采样频率与其预测概率之间存在差异。为了解决这些挑战,我们提出了一种针对VAR的PTQ框架。首先,我们引入了一种移位求和量化方法,通过聚合值令牌的对称移位副本的量化结果来减少重建误差。其次,我们提出了一种校准数据的重采样策略,使码本条目的采样频率与其预测概率对齐。在类别条件图像生成、修复、外推和类别条件编辑上的实验表明,该方法在VAR架构上取得了一致的改进,为VAR的PTQ建立了新的最先进水平。

英文摘要

Post-training quantization (PTQ) enables efficient deployment of deep networks using a small set of data. Its application to visual autoregressive models (VAR), however, remains relatively unexplored. We identify two key challenges for applying PTQ to VAR: (i) large reconstruction errors in attention-value products, especially at coarse scales where high attention scores occur more frequently; and (ii) a discrepancy between the sampling frequencies of codebook entries and their predicted probabilities due to limited calibration data. To address these challenges, we propose a PTQ framework tailored for VAR. First, we introduce a shift-and-sum quantization method that reduces reconstruction errors by aggregating quantized results from symmetrically shifted duplicates of value tokens. Second, we present a resampling strategy for calibration data that aligns sampling frequencies of codebook entries with their predicted probabilities. Experiments on class-conditional image generation, inpainting, outpainting, and class-conditional editing show consistent improvements across VAR architectures, establishing a new state of the art in PTQ for VAR.

2606.16127 2026-06-16 cs.CL cs.AI cs.LG 新提交

AuAu: A Benchmark for Auditing Authoritarian Alignment in Large Language Models

AuAu: 大型语言模型中威权对齐审计基准

Andreas Einwiller, Max Klabunde, Florian Lemmerich

发表机构 * University of Zurich(苏黎世大学)

AI总结 提出AuAu基准,结合心理测量、情境行为测试和用户提示评估LLM的威权倾向,发现17个模型均存在显著威权响应,且系统提示可操纵多数模型。

Comments v1, 50 pages

详情
AI中文摘要

全球威权主义的浪潮,加上用户日常生活中日益核心的角色,引发了特定模型在多大程度上展现或促进威权态度和特征的问题。我们引入了AuAu,一个旨在评估LLM生成具有威权倾向响应风险的全面基准。该基准结合了三种评估方法:(i) 来自15个经过人类验证的广泛工具库的心理测量问题;(ii) 在具体情境中探究意图行为的情境行为小故事;(iii) 对现实用户提示的响应。与先前工作不同,AuAu不仅评估对威权主义的一般亲近程度,还评估已建立的子概念:威权攻击、威权服从和传统主义。评估来自中国、欧盟、俄罗斯和美国的17个模型,我们发现所有测试模型在心理测量评估下都表现出显著的威权响应率,尽管在越来越现实的下游任务中,该比率显著下降。我们进一步发现,威权系统提示容易操纵17个模型中的15个以促进增强的威权主义。我们的结果强调了持续、系统性地审计基于LLM的AI系统的必要性,以检测并最终减轻生成输出中不期望的威权倾向。我们的代码和数据可在 https://github.com/andreaseinwiller/AuAu 获取。

英文摘要

The worldwide surge of authoritarianism, combined with the increasing central role in users' everyday lives, raises the question of to what extent specific models exhibit or promote authoritarian attitudes and characteristics. We introduce AuAu, a comprehensive benchmark that aims to assess the risk of LLMs generating responses with authoritarian tendencies. This benchmark combines three evaluation approaches: (i) psychometric questions from an extensive pool of 15 human validated instruments; (ii) contextual behavior vignettes probing intended actions in concrete situations; and (iii) responses to realistic user prompts. Unlike prior work, AuAu evaluates not only a general closeness towards authoritarianism but also the established sub-concepts Authoritarian Aggression, Authoritarian Submission, and Conventionalism. Evaluating 17 models from China, the EU, Russia, and the USA, we find that all tested models exhibit substantial authoritarian response rates under the psychometric evaluation, though rates drop significantly in increasingly more realistic downstream task. We further find that an authoritarian system prompt easily manipulates 15 out of 17 models to promote increased authoritarianism. Our results underscore the need for continued, systematic auditing of LLM-based AI systems to detect and ultimately mitigate undesired authoritarian tendencies in generated output. Our code and data are available at: https://github.com/andreaseinwiller/AuAu

2606.16124 2026-06-16 cs.CV 新提交

Training-Free Open-Vocabulary Visual Grounding for Remote Sensing Images and Videos

面向遥感图像与视频的无训练开放词汇视觉定位

Ke Li, Di Wang, Yongshan Zhu, Ting Wang, Weiping Ni, Tao Lei, Quan Wang, Xinbo Gao

发表机构 * School of Computer Science and Technology, Xidian University(西安电子科技大学计算机科学与技术学院) Interdisciplinary Institute of Artificial Intelligence, Xidian University(西安电子科技大学跨学科人工智能研究院) School of Artificial Intelligence, Xidian University(西安电子科技大学人工智能学院) Northwest Institute of Nuclear Technology(西北核技术研究所) School of Physics and Information Engineering, Fuzhou University(福州大学物理与信息工程学院)

AI总结 提出无训练框架RSVG-ZeroOV,利用冻结的通用基础模型通过概览-聚焦-演化范式实现零样本开放词汇遥感视觉定位,并扩展至视频时空定位,在多个基准上超越现有零样本方法。

详情
AI中文摘要

遥感视觉定位(RSVG)旨在根据自然语言表达在遥感图像或视频中定位所指目标。现有的RSVG方法通常依赖于任务特定的手动标注,这些标注收集成本高昂,且在覆盖真实世界地理空间场景的多样性方面不可避免地存在局限。因此,它们往往难以泛化到涉及新物体、细粒度属性、复杂空间关系和功能语义的开放词汇查询。本文提出RSVG-ZeroOV,一个无训练框架,利用冻结的通用基础模型进行零样本开放词汇RSVG。RSVG-ZeroOV遵循概览-聚焦-演化范式,利用视觉语言模型(VLM)和扩散模型(DM)独特且互补的注意力模式逐步生成精确的定位结果。具体而言:(i) 概览利用VLM提取交叉注意力图,捕获指代表达与视觉区域之间的语义相关性;(ii) 聚焦利用DM的细粒度建模先验,补偿VLM注意力常忽略的物体结构和形状信息;(iii) 演化引入一个简单而有效的注意力演化模块,抑制无关激活,产生纯净的物体掩码。为处理视频输入,我们进一步提出Video RSVG-ZeroOV,通过查询相关关键帧选择器和时序传播器将图像级定位扩展到时空定位,无需视频标注或微调即可实现高效且时序一致的视频定位。在六个图像和视频定位基准上的大量实验表明,RSVG-ZeroOV持续优于现有零样本基线,并与弱监督和全监督方法相比达到有竞争力或更优的性能。

英文摘要

Remote sensing visual grounding (RSVG) aims to localize a referred target in a remote sensing image or video according to a natural language expression. Existing RSVG methods usually rely on task-specific manual annotations, which are costly to collect and inevitably limited in covering the diversity of real-world geospatial scenarios. As a result, they often struggle to generalize to open-vocabulary queries involving novel objects, fine-grained attributes, complex spatial relationships, and functional semantics. In this paper, we propose RSVG-ZeroOV, a training-free framework that leverages frozen generic foundation models for zero-shot open-vocabulary RSVG. RSVG-ZeroOV follows an Overview-Focus-Evolve paradigm, which exploits the distinct yet complementary attention patterns of vision-language models (VLMs) and diffusion models (DMs) to progressively generate precise grounding results. Specifically, (i) Overview utilizes a VLM to extract cross-attention maps that capture semantic correlations between the referring expression and visual regions; (ii) Focus leverages the fine-grained modeling priors of a DM to compensate for object structure and shape information often overlooked by VLM attention; and (iii) Evolve introduces a simple yet effective attention evolution module to suppress irrelevant activations, yielding purified object masks. To handle video inputs, we further present Video RSVG-ZeroOV, which extends image-level grounding to spatio-temporal grounding through a query-relevant key-frame selector and a temporal propagator, enabling efficient and temporally coherent video grounding without video annotations or fine-tuning. Extensive experiments on six image and video grounding benchmarks show that RSVG-ZeroOV consistently outperforms existing zero-shot baselines and achieves competitive or superior performance compared with weakly- and fully-supervised methods.

2606.16122 2026-06-16 cs.AI 新提交

Thinking with Visual Grounding

视觉锚定思维

Junkai Zhang, Yihe Deng, Kai-Wei Chang, Wei Wang

发表机构 * University of California, Los Angeles(加利福尼亚大学洛杉矶分校)

AI总结 提出视觉锚定思维方法,让视觉语言模型在推理时交替生成自然语言和视觉锚点(点或框),并通过合成数据管道和锚定感知强化学习训练,在计数和空间推理任务上显著提升性能。

详情
AI中文摘要

视觉思维不仅应该听起来正确,还应该展示其证据。虽然最近的视觉语言模型(VLM)能够生成自然语言推理轨迹,但这些轨迹往往隐含了所支持的图像区域,使得它们难以验证和监督。我们引入了视觉锚定思维,这是一种推理过程,其中模型将自然语言思想与每一步所使用的视觉证据的显式点或框锚定交替生成。这使得模型能够在语言中表达中间推理,同时将关键对象锚定到它们所指的图像区域。为了训练这种行为,我们构建了一个可扩展的合成管道,该管道蒸馏正确的视觉推理轨迹,提取轨迹所需的视觉对象,使用基于SAM3的代理对其进行锚定,并从生成的掩码中导出对齐的点与框监督。我们进一步提出了锚定感知强化学习,它将答案正确性奖励与密集的锚定奖励相结合,后者评分生成的物体引用是否匹配正确的图像证据。在两个计数基准和四个空间推理基准上,将视觉锚定思维添加到Gemma3-4B-IT中,始终优于原始模型和非锚定思维基线。在空间推理上,视觉锚定思维的4B模型匹配,并在某些情况下超越了同一模型家族的Gemma3-27B-IT。我们的分析表明,点锚定适合计数,而框锚定在空间任务上从显式锚定奖励中获益最多。总体而言,我们的结果表明,当VLM的中间思维与使它们为真的图像区域相关联时,它们的思考能力更强。

英文摘要

Visual thinking should not only sound right; it should show its evidence. While recent vision-language models (VLMs) can produce natural-language reasoning traces, these traces often leave the supporting image regions implicit, making them hard to verify and difficult to supervise. We introduce visually grounded thinking, a reasoning process in which models interleave natural-language thoughts with explicit point or box groundings of the visual evidence used at each step. This lets the model express intermediate reasoning in language while grounding key objects in the image regions they refer to. To train this behavior, we construct a scalable synthesis pipeline that distills correct visual reasoning traces, extracts the visual objects required by the traces, grounds them with a SAM3-based agent, and derives aligned point and box supervision from the resulting masks. We further propose grounding-aware reinforcement learning, which combines answer correctness rewards with dense grounding rewards that score whether generated object references match the correct image evidence. Across two counting benchmarks and four spatial reasoning benchmarks, adding visually grounded thinking to Gemma3-4B-IT consistently improves performance over the original model and the non-grounded thinking baseline. On spatial reasoning, the visually grounded thinking 4B models match, and in some cases surpass, Gemma3-27B-IT from the same model family. Our analysis shows that point grounding is well suited to counting, while box grounding benefits most from explicit grounding rewards on spatial tasks. Overall, our results show that VLMs think better when their intermediate thoughts are tied to the image regions that make them true.

2606.16119 2026-06-16 cs.CV 新提交

EdgeZSAD: Practical Zero-Shot Anomaly Detection on Edge Devices

EdgeZSAD:边缘设备上的实用零样本异常检测

Taewan Cho, Andrew Jaeyong Choi

发表机构 * Gachon University(加东大学) Plaid Labs Inc.(Plaid实验室)

AI总结 针对边缘部署约束,提出基于TinyViT-21M-512骨干、非对称全局-局部读出(EdgeGLR)和可复现源训练方案(Real-IAD-DR)的紧凑零样本异常检测系统,在多个工业基准上达到高精度且可直接部署。

详情
AI中文摘要

工业检测需要零样本异常检测(ZSAD),该检测在边缘部署约束下仍然有效。最近的方法通常依赖ViT-L基础骨干(约3亿参数),这超出了典型嵌入式硬件的内存和算子预算。我们通过EdgeZSAD研究这一场景,这是一个紧凑的参考系统,围绕TinyViT-21M-512骨干、非对称全局-局部读出(EdgeGLR)和可复现的源端训练方案(Real-IAD-DR)构建。我们在源训练、目标未见协议下训练单个检查点,并在六个工业基准上评估。在三次独立运行中,所得模型在MVTec-AD上平均图像AUROC达到91.6,在VisA上达到88.2,同时可直接部署在Jetson Orin Nano Super(TensorRT FP16)和RB5 Gen2(QNN GPU FP16)上。在六个设备重新评分的基准中,图像AUROC漂移保持在0.2点以下,表明导出的图在评估的部署设置中保留了主机端的排序行为。

英文摘要

Industrial inspection needs zero-shot anomaly detection (ZSAD) that remains useful under edge deployment constraints. Recent methods often rely on ViT-L foundation backbones (~300M parameters), which exceed the memory and operator budget of typical embedded hardware. We study this regime through EdgeZSAD, a compact reference system built around a TinyViT-21M-512 backbone, an asymmetric global-local readout (EdgeGLR), and a reproducible source-side training recipe (Real-IAD-DR). We train a single checkpoint in a source-trained, target-unseen protocol and evaluate it across six industrial benchmarks. Across three independent runs, the resulting model reaches an average image AUROC of 91.6 on MVTec-AD and 88.2 on VisA, while remaining directly deployable on Jetson Orin Nano Super (TensorRT FP16) and RB5 Gen2 (QNN GPU FP16). Across the six device-rescored benchmarks, image-AUROC drift stays below 0.2 points, indicating that the exported graph preserves host-side ranking behavior in the evaluated deployment setting.

2606.16118 2026-06-16 cs.AI cs.CL cs.LO 新提交

Know Your Limits : On the Faithfulness of LLMs as Solvers and Autoformalizers in Legal Reasoning

了解你的局限:LLM在法律推理中作为求解器和自动形式化工具的忠实性

Olivia Peiyu Wang, Sanna Wong-Toropainen, Daneshvar Amrollahi, Ryan Bai, Tashvi Bansal, Arush Garg, Leilani H. Gilpin

发表机构 * UC Santa Cruz(加州大学圣克鲁兹分校) Univ. Helsinki(赫尔辛基大学) CodeX, Stanford(斯坦福大学CodeX中心) Stanford University(斯坦福大学) Canyon Crest Academy(峡谷峰学院) Monta Vista High School(蒙塔维斯塔高中) Los Altos High School(洛斯阿尔托斯高中)

AI总结 研究LLM在法律推理中是否忠实执行逻辑推理,发现LLM基于形式推理的高性能掩盖了范围清洗等不忠实模式,揭示基准准确性与逻辑忠实性之间的根本差距。

Comments 10 pages, submitted to COLM 2026 (under review, average score of 6.25 across 4 reviewers) and accepted by the AI4Law workshop at ICML. This is the version where we already addressed most of the reviews from the COLM reviewers

详情
AI中文摘要

大型语言模型(LLM)在推理任务上表现强劲,但这是否反映了忠实的逻辑推理还是启发式近似仍不清楚。我们在法律蕴含中通过比较三种范式——纯LLM分类、基于LLM的形式推理以及使用Z3 SMT求解器的基于求解器的形式推理——在重新标注的ContractNLI子集上对五个LLM进行了研究。我们的重新标注揭示了实用法律解释与严格形式蕴含之间存在系统性的、可测量的差距,其中相当大比例的法律上合理的推理在没有额外未声明假设的情况下缺乏形式基础。虽然引入形式结构提高了准确性,基于LLM的形式推理达到了最高的基准性能,但我们表明这种提升并不意味着忠实推理。我们识别出三种反复出现的失败模式:范围清洗(LLM报告与求解器不一致的分类而不执行底层形式推理,产生看似逻辑上合理但实际并非如此的结论)、隐式约束盲区(LLM忽略形式表示中存在的逻辑约束)以及程序合成失败(尽管有结构化提示,LLM仍生成错误的Z3代码)。关键的是,范围清洗在所有模型中持续存在,这引发了对基于LLM的形式推理作为符号执行代理的忠实性的严重担忧。这些结果揭示了基准准确性与逻辑忠实性之间的根本差距。

英文摘要

Large Language Models (LLMs) achieve strong performance on reasoning tasks, but whether this reflects faithful logical inference or heuristic approximation remains unclear. We study this question in legal entailment by comparing three paradigms, including pure LLM classification, LLM-based Formal Reasoning, and solver-based Formal Reasoning using the Z3 SMT solver, on a re-annotated subset of ContractNLI across five LLMs. Our re-annotation reveals a systematic and measurable gap between pragmatic legal interpretation and strict formal entailment, where a substantial proportion of legally sound inferences are not formally grounded without additional unstated assumptions. While introducing formal structure improves accuracy, with LLM-based Formal Reasoning achieving the highest benchmark performance, we show that this gain does not imply faithful reasoning. We identify three recurring failure modes: scope laundering, where LLMs report solver-inconsistent classifications without executing the underlying formal reasoning, producing conclusions that appear logically grounded but are not; implicit constraint blindness, where LLMs overlook logical constraints present in formal representations; and program synthesis failures, where LLMs generate incorrect Z3 code despite structured prompting. Critically, scope laundering persists across all models, raising serious concerns about the faithfulness of LLM-based formal reasoning as a proxy for symbolic execution. These results reveal a fundamental gap between benchmark accuracy and logical faithfulness.