arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1970
2601.18577 2026-05-21 cs.CV cs.LG

Self-Refining Video Sampling

自 refining 视频采样

Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Saining Xie, Jaehong Yoon, Sung Ju Hwang

AI总结 本文提出了一种自 refining 视频采样方法,通过预训练的视频生成器作为自身 refine 器,无需外部验证器或额外训练,在推理时实现迭代内部循环 refine,提高了运动一致性和物理对齐性。

Comments ICML 2026. Project page: https://agwmon.github.io/self-refine-video/

详情
AI中文摘要

现代视频生成器仍难以处理复杂的物理动态,往往无法达到物理真实感。现有方法通过外部验证器或在增强数据上额外训练来解决这一问题,但计算成本高且仍难以捕捉细粒度运动。在本工作中,我们提出了自 refining 视频采样,一种简单的方法,利用在大规模数据集上预训练的视频生成器作为自身的 self-refiner。通过将生成器解释为去噪自编码器,我们能够在推理时实现迭代内部循环 refine,而无需任何外部验证器或额外训练。我们进一步引入了一种不确定性的 refine 策略,根据 self-consistency 选择性地 refine 区域,这防止了过度 refine 引起的伪影。在最先进的视频生成器上进行的实验显示,在运动一致性与物理对齐性方面有显著提升,达到比默认采样器和 guidance-based 采样器高出 70% 以上的人类偏好。

英文摘要

Modern video generators still struggle with complex physical dynamics, often falling short of physical realism. Existing approaches address this using external verifiers or additional training on augmented data, which is computationally expensive and still limited in capturing fine-grained motion. In this work, we present self-refining video sampling, a simple method that uses a pre-trained video generator trained on large-scale datasets as its own self-refiner. By interpreting the generator as a denoising autoencoder, we enable iterative inner-loop refinement at inference time without any external verifier or additional training. We further introduce an uncertainty-aware refinement strategy that selectively refines regions based on self-consistency, which prevents artifacts caused by over-refinement. Experiments on state-of-the-art video generators demonstrate significant improvements in motion coherence and physics alignment, achieving over 70% human preference compared to the default sampler and guidance-based sampler.

2601.05877 2026-05-21 cs.CL

iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models

iReasoner: 一种面向轨迹的内在推理监督方法,用于自演化的大多模态模型

Meghana Sunil, Manikandarajan Venmathimaran, Muthu Subash Kavitha

AI总结 本文提出iReasoner,一种自演化框架,通过显式引导推理链和奖励内部一致性来提升大模型的隐式推理能力,在无监督设置下实现了多模态推理基准的性能提升。

Comments ACL 2026 (Findings)

详情
AI中文摘要

最近的研究表明,大多模态模型(LMMs)可以通过自博弈和内在反馈从无标签数据中自我改进。然而,现有的自演化框架主要奖励最终结果,尽管中间推理对于视觉基础决策至关重要,但其约束较弱。我们提出iReasoner,一种自演化框架,通过显式引导推理链(CoT)并奖励其内部一致性来提升LMM的隐式推理能力。在无标签图像上进行Proposer--Solver循环时,iReasoner通过在中间推理步骤上定义轨迹感知信号,增强结果层面的内在奖励,提供无需真实标签或外部评判的学习信号,以区分导向相同答案的不同推理路径。从Qwen2.5-VL-7B开始,iReasoner在完全无监督的后训练阶段,在多样化的多模态推理基准上实现了最高+2.1分的提升。我们希望这项工作能成为在纯无监督设置中实现推理感知自改进的LMMs的起点。我们的代码可在https://meghanaasunil.github.io/iReasoner上获取。

英文摘要

Recent work shows that large multimodal models (LMMs) can self-improve from unlabeled data via self-play and intrinsic feedback. Yet existing self-evolving frameworks mainly reward final outcomes, leaving intermediate reasoning weakly constrained despite its importance for visually grounded decision making. We propose iReasoner, a self-evolving framework that improves an LMM's implicit reasoning by explicitly eliciting chain-of-thought (CoT) and rewarding its internal agreement. In a Proposer--Solver loop over unlabeled images, iReasoner augments outcome-level intrinsic rewards with a trajectory-aware signal defined over intermediate reasoning steps, providing learning signals that distinguish reasoning paths leading to the same answer without ground-truth labels or external judges. Starting from Qwen2.5-VL-7B, iReasoner yields up to $+2.1$ points across diverse multimodal reasoning benchmarks under fully unsupervised post-training. We hope this work serves as a starting point for reasoning-aware self-improvement in LMMs in purely unsupervised settings. Our code is available at https://meghanaasunil.github.io/iReasoner.

2601.04068 2026-05-21 cs.CV cs.AI

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

注意生成细节:面向视频扩散模型的直接局部化细节偏好优化

Zitong Huang, Kaidong Zhang, Yukang Ding, Chao Gao, Rui Ding, Ying Chen, Wangmeng Zuo

AI总结 本文提出LocalDPO,一种新的后训练框架,通过从真实视频中构建局部偏好对,并在时空区域层面优化对齐,以提高视频生成的质量和人类偏好评分。

Comments Accepted by CVPR 2026

详情
AI中文摘要

将文本到视频的扩散模型与人类偏好对齐对于生成高质量视频至关重要。现有的直接偏好优化(DPO)方法依赖于多样本排序和任务特定的批评模型,这效率低下且常导致模糊的全局监督。为了解决这些限制,我们提出了LocalDPO,一种新的后训练框架,该框架从真实视频中构建局部偏好对,并在时空区域层面进行优化。我们设计了一个自动化流程,高效地收集偏好对数据,通过单次提示推理生成偏好对,消除了对外部批评模型或人工标注的需求。具体来说,我们将高质量的真实视频作为正样本,并通过局部随机时空掩码来生成对应的负样本,仅使用冻结的基模型恢复被掩码的区域。在训练过程中,我们引入了区域感知的DPO损失,将偏好学习限制在被损坏的区域以实现快速收敛。在Wan2.1和CogVideoX上的实验表明,LocalDPO在视频保真度、时间连贯性和人类偏好评分方面优于其他后训练方法,建立了更高效和精细的视频生成器对齐范式。代码可在https://github.com/1170300714/Local-DPO上获得。

英文摘要

Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.The code is available at https://github.com/1170300714/Local-DPO.

2601.03135 2026-05-21 cs.CL

Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing

通过合成数据和语言特定预处理改进原住民语言机器翻译

Aashish Dhawan, Christopher Driggers-Ellis, Christan Grant, Daisy Zhe Wang

AI总结 本研究通过合成数据生成和语言特定预处理方法,改进低资源原住民语言的神经机器翻译效果,实验显示合成数据增强对翻译质量有积极影响,但通用预处理在高度屈折语言中存在局限。

详情
AI中文摘要

低资源原住民语言往往缺乏用于有效神经机器翻译(NMT)所需的平行语料库。合成数据生成为数据稀缺环境提供了一种实用策略。在本工作中,我们通过使用高容量多语言翻译模型生成合成句子对,扩充美洲原住民语言的精选平行语料库。我们对多语言mBART模型进行微调,使用curated-only和合成增强的数据,并通过chrF++评估翻译质量,该指标是最近美洲NLP共享任务中用于屈折语言的主要指标。我们进一步应用语言特定的预处理,包括正字法标准化和噪声感知过滤,以减少语料库中的伪影。在瓜拉尼-西班牙语和克丘亚-西班牙语翻译实验中,合成数据增强显示出一致的chrF++提升,而对艾马拉语的诊断实验则揭示了通用预处理在高度屈折语言中的局限性。

英文摘要

Low-resource indigenous languages often lack the parallel corpora required for effective neural machine translation (NMT). Synthetic data generation offers a practical strategy for mitigating this limitation in data-scarce settings. In this work, we augment curated parallel datasets for indigenous languages of the Americas with synthetic sentence pairs generated using a high-capacity multilingual translation model. We fine-tune a multilingual mBART model on curated-only and synthetically augmented data and evaluate translation quality using chrF++, the primary metric used in recent AmericasNLP shared tasks for agglutinative languages. We further apply language-specific preprocessing, including orthographic normalization and noise-aware filtering, to reduce corpus artifacts. Experiments on Guarani-Spanish and Quechua-Spanish translation show consistent chrF++ improvements from synthetic data augmentation, while diagnostic experiments on Aymara highlight the limitations of generic preprocessing for highly agglutinative languages.

2601.00473 2026-05-21 cs.LG cs.AI

Deep Neural Networks as Discrete Dynamical Systems: Implications for Physics-Informed Learning

深度神经网络作为离散动力系统:对物理信息学习的启示

Abhisek Ganguly, Santosh Ansumali, Sauro Succi

AI总结 本文探讨了深度神经网络与离散动力系统之间的类比,通过比较Burgers方程和Eikonal方程的数值/精确解与PINNs获得的解,展示了PINN学习在近似相同系统动力学时提供了一种不同的计算路径,同时指出PINNs的密集参数表示在高维情况下可能具有优势。

详情
AI中文摘要

我们重新审视了前馈深度神经网络(DNNs)与源自神经积分方程及其相应偏微分方程(PDE)形式的离散动力系统之间的类比。本文呈现了Burgers方程和Eikonal方程的数值/精确解与通过PINNs获得的解的比较分析。我们展示了PINN学习在近似本质上相同的系统动力学时提供了一种不同于标准数值离散化的计算路径。在此框架下,DNNs可以被解释为离散动力系统,其层间演进方法趋向于吸引子,多个参数配置可能产生可比的解,反映了逆映射的退化性。与有限差分(FD)过程相关的结构化算子不同,PINNs学习密集的参数表示,这些表示与经典离散化 stencil 无直接关联。这种分布式表示通常涉及更多的参数,导致可解释性降低和计算成本增加。然而,这种额外的灵活性可能在高维情况下提供优势,其中经典网格方法变得不切实际。

英文摘要

We revisit the analogy between feed-forward deep neural networks (DNNs) and discrete dynamical systems derived from neural integral equations and their corresponding partial differential equation (PDE) forms. A comparative analysis between the numerical/exact solutions of the Burgers' and Eikonal equations, and the same obtained via PINNs is presented. We show that PINN learning provides a different computational pathway compared to standard numerical discretization in approximating essentially the same underlying dynamics of the system. Within this framework, DNNs can be interpreted as discrete dynamical systems whose layer-wise evolution approaches attractors, and multiple parameter configurations may yield comparable solutions, reflecting the degeneracy of the inverse mapping. In contrast to the structured operators associated with finite-difference (FD) procedures, PINNs learn dense parameter representations that are not directly associated with classical discretization stencils. This distributed representation generally involves a larger number of parameters, leading to reduced interpretability and increased computational cost. However, the additional flexibility of such representations may offer advantages in high-dimensional settings where classical grid-based methods become impractical.

2512.14896 2026-05-21 cs.CL cs.AI

DrugRAG: Enhancing Pharmacy LLM Performance Through A Novel Retrieval-Augmented Generation Pipeline

DrugRAG: 通过一种新颖的检索增强生成流水线提升药学LLM性能

Houman Kazemzadeh, Kiarash Mokhtari Dizaji, Seyed Reza Tavakoli, Farbod Davoodi, MohammadReza KarimiNejad, Parham Abed Azad, Fatemeh Latifi, Ali Sabzi, Armin Khosravi, Siavash Ahmadi, Babak Khalaj, Mohammad Hossein Rohban, Glolamali Aminian, Zohreh Amoozgar, Tahereh Javaheri

AI总结 本研究评估了大型语言模型在药学执业资格问答任务中的性能,并开发了一种外部知识整合方法以提高准确性,通过DrugRAG流水线整合结构化药物知识,从而提升药学相关问答任务的LLM性能。

Comments 14 pages, 2 figures, 2 tables. The revised version includes McNemar's paired statistical analysis, Wilson confidence intervals, expanded methodological clarifications, a revised discussion of evidence retrieval, improved reproducibility details, and updated limitations

详情
AI中文摘要

在本研究中,我们评估了大型语言模型(LLM)在药学执业资格问答任务中的性能,并开发了一种外部知识整合方法以提高准确性。我们使用一个包含141个问题的药学数据集,对十个参数规模不同的LLM(8十亿到70十亿以上)进行了基准测试,测量了基线准确性。基线性能范围从46%到92%,其中GPT-5(92%)和o3(89%)取得了最高分数,而较小的开源模型表现显著较低。然后,我们开发了DrugRAG,一种三步检索增强生成(RAG)流水线,该流水线检索结构化、基于证据的药物信息,并将上下文药理学证据添加到模型提示中,该流水线在模型架构或参数无需更改的情况下外部运行。DrugRAG在所有五个评估模型上均提高了准确性,提升幅度范围从7到21个百分点(例如,Gemma 3 27B:61.0%到71%,Llama 3.1 8B:46%到67%)。McNemar分析显示,这些改进在较小和中等规模的开源模型中具有统计学显著性。这些发现表明,通过DrugRAG整合结构化外部药物知识可以提高LLM在药学相关问答任务中的性能,而无需修改底层模型,为提升基于证据的药学相关AI应用提供了实用的流水线。

英文摘要

In our study, we evaluated large language model (LLM) performance on pharmacy licensure-style question-answering tasks and developed an external knowledge integration method to improve accuracy. We benchmarked ten LLMs with varying parameter sizes (8 billion to 70+ billion) using a 141-question pharmacy dataset, measuring baseline accuracy without modification. Baseline performance ranged from 46% to 92%, with GPT-5 (92%) and o3 (89%) achieving the highest scores, while smaller open-source models showed substantially lower performance. We then developed DrugRAG, a three-step retrieval-augmented generation (RAG) pipeline that retrieves structured, evidence-based drug information and augments model prompts with contextual pharmacological evidence, operating externally and requiring no changes to model architecture or parameters. DrugRAG increased accuracy across all five evaluated models, with gains ranging from 7 to 21 percentage points (e.g., Gemma 3 27B: 61.0% to 71%, Llama 3.1 8B: 46% to 67%). McNemar analyses demonstrated statistically significant paired improvements primarily in smaller and mid-sized open-source models. These findings demonstrate that integrating structured external drug knowledge via DrugRAG can improve LLM performance on pharmacy-focused question-answering tasks without modifying the underlying models, providing a practical pipeline for enhancing evidence-based pharmacy-focused AI applications.

2512.13788 2026-05-21 cs.LG cs.RO

Constrained Policy Optimization via Sampling-Based Weight-Space Projection

通过基于采样的权重空间投影进行约束策略优化

Shengfan Cao, Francesco Borrelli, Eunhyek Joa

AI总结 该研究提出了一种基于采样的权重空间投影方法SCPO,用于在不离开安全操作范围的情况下优化策略,通过在参数空间中直接强制安全约束,确保在训练过程中保持安全性和可行性,同时在约束控制任务中实现闭环稳定性。

Comments Accepted for publication at IFAC World Congress 2026; fixed minor notation inconsistencies

详情
AI中文摘要

安全关键学习需要在不离开安全操作范围的情况下提高性能的策略。我们研究了约束策略学习,其中模型参数必须满足基于滚动的安全部署约束,这些约束可以评估但不能解析地微分。我们提出了SCPO,一种基于采样的权重空间投影方法,该方法在不需梯度访问约束函数的情况下直接在参数空间中强制安全。SCPO通过结合基于滚动的安全评估和参数扰动与安全度量变化之间的平滑性界,构建局部安全区域,并通过凸QCQP将每个梯度更新投影。我们建立了安全-by-induction保证:从任何安全初始化开始,给定可行的投影,所有中间策略保持安全。在具有稳定备份策略的约束控制设置中,SCPO进一步确保闭环稳定性,同时在保守备份之外实现安全适应。在具有有害监督的约束回归和双积分模仿与恶意专家的实验中,SCPO拒绝了不安全的更新,保持了训练过程中的可行性,并实现了有意义的目标改进。

英文摘要

Safety-critical learning requires policies that improve performance without leaving the safe operating regime. We study constrained policy learning where model parameters must satisfy rollout-based safety constraints that can be evaluated but not differentiated analytically. We propose SCPO, a sampling-based weight-space projection method that enforces safety directly in parameter space without requiring gradient access to the constraint functions. SCPO constructs a local safe region by combining rollout-based safety evaluations with smoothness bounds relating parameter perturbations to changes in safety metrics, and projects each gradient update via a convex QCQP. We establish a safe-by-induction guarantee: starting from any safe initialization, all intermediate policies remain safe given feasible projections. In constrained control settings with a stabilizing backup policy, SCPO further ensures closed-loop stability while enabling safe adaptation beyond the conservative backup. Experiments on constrained regression with harmful supervision and double-integrator imitation with a malicious expert show that SCPO rejects unsafe updates, maintains feasibility throughout training, and achieves meaningful objective improvement.

2512.13402 2026-05-21 cs.CV cs.AI

End2Reg: Learning Task-Specific Segmentation for Markerless Registration in Spine Surgery

End2Reg: 为无标记定位学习任务特定分割在脊柱手术中

Lorenzo Pettinari, Sidaty El Hadramy, Michael Wehrli, Philippe C. Cattin, Daniel Studer, Carol C. Hasler, Maria Licci

AI总结 本文提出End2Reg,一种端到端深度学习框架,通过联合优化分割和定位,无需分割标签和手动步骤,从而提高脊柱手术中无标记导航的精度。

Comments Early Accepted MICCAI 2026. Code and interactive visualizations: https://lorenzopettinari.github.io/end-2-reg/

详情
AI中文摘要

脊柱手术中的术中导航需要毫米级的精度。目前,这通过辐射强度大的术中成像和骨锚定标记实现,但这些标记侵入性且会干扰手术流程。无标记RGB-D定位方法提供了一种有前途的替代方案。然而,现有方法依赖于弱分割标签来隔离相关解剖结构,这可能导致在定位过程中传播误差。我们提出了End2Reg,一种端到端深度学习框架,通过联合优化分割和定位,消除了对分割标签和手动步骤的需要。网络学习任务特定的分割掩码,以适应定位,仅通过定位目标进行指导,而无需显式的分割监督。End2Reg在体外和体内基准测试中实现了最先进的性能,将中位目标定位误差减少了32%,均方根误差平均减少了61%,同时在部分遮挡下保持稳健性能。消融结果证实,端到端优化显著提高了定位精度。总体而言,End2Reg朝着完全自动化的无标记术中导航迈进。代码和交互式可视化可在:https://lorenzopettinari.github.io/end-2-reg/ 上找到。

英文摘要

Intraoperative navigation in spine surgery demands millimeter-level accuracy. Currently, this is achieved through radiation-intensive intraoperative imaging and bone-anchored markers that are invasive and disrupt surgical workflow. Markerless RGB-D registration methods offer a promising alternative. However, existing approaches rely on weak segmentation labels to isolate relevant anatomical structures, potentially propagating errors through the registration process. We present End2Reg, an end-to-end deep learning framework that jointly optimizes segmentation and registration, eliminating the need for segmentation labels and manual steps. The network learns task-specific segmentation masks optimized for registration, guided solely by the registration objective without explicit segmentation supervision. End2Reg achieves state-of-the-art performance on ex- and in-vivo benchmarks, reducing median Target Registration Error by 32% and mean Root Mean Square Error by 61%, while maintaining robust performance under partial occlusions. Ablation results confirm that end-to-end optimization significantly improves registration accuracy. Overall, End2Reg advances towards fully automatic, markerless intraoperative navigation. Code and interactive visualizations are available at: https://lorenzopettinari.github.io/end-2-reg/.

2512.09806 2026-05-21 cs.CV cs.AI

CHEM: Estimating and Understanding Hallucinations in Deep Learning for Image Processing

CHEM: 估计和理解深度学习在图像处理中的幻觉

Jianfei Li, Ines Rosellon-Inclan, Gitta Kutyniok, Jean-Luc Starck

AI总结 本文提出CHEM方法,用于量化和表征图像重建模型中的幻觉 artifacts,通过小波和shearlet表示定位幻觉区域,并利用 conformalized quantile regression 评估幻觉水平,同时分析U-shaped网络为何容易产生幻觉预测。

详情
AI中文摘要

基于深度学习的方法最近在图像重建问题中取得了显著成功。然而,挑战出现了,因为这些方法可能会生成不真实的 artifacts 或幻觉,这可能干扰安全关键场景中的分析。本文介绍了一个框架,用于量化和表征图像重建模型中的幻觉 artifacts。所提出的方法称为 Conformal Hallucination Estimation Metric (CHEM),能够识别模型预测中的幻觉易发区域。它利用小波和shearlet表示在图像特征层面定位这些区域,并使用 conformalized quantile regression 以分布无关的方式评估幻觉水平。提供了理论分析,表征了CHEM对幻觉 artifacts 的灵敏度及其与均方误差的关系。基于这些见解并采用基于逼近理论的观点,我们研究了为何U-shaped网络,广泛用于图像重建的架构,倾向于产生易受幻觉影响的预测。我们在天文图像去卷积中使用CANDELS数据集(如U-Net、SwinUNet和Learnlets)以及在自然图像超分辨率中使用DIV2K数据集(如DRUNet、Unfolded DRS、RAM和DPS)上评估了所提出方法的有效性。

英文摘要

Deep learning-based methods have recently achieved significant success in image reconstruction problems. However, challenges have emerged, as these methods may generate unrealistic artifacts or hallucinations, which can interfere with analysis in safety-critical scenarios. This paper introduces a framework for quantifying and characterizing hallucinated artifacts in image reconstruction models. The proposed method, termed the Conformal Hallucination Estimation Metric (CHEM), enables the identification of hallucination-prone regions in model predictions. It leverages wavelet and shearlet representations to localize such regions at the level of image features, and uses conformalized quantile regression to assess hallucination levels in a distribution-free manner. A theoretical analysis is provided, characterizing the sensitivity of CHEM to hallucinated artifacts and its relationship to the mean squared error. Building on these insights and adopting a viewpoint grounded in approximation theory, we investigate why U-shaped networks, widely used architectures for image reconstruction, tend to hallucination-prone predictions. We assess the effectiveness of the proposed approach on astronomical image deconvolution using the CANDELS dataset with architectures such as U-Net, SwinUNet, and Learnlets, and on natural image super-resolution using the DIV2K dataset with models such as DRUNet, Unfolded DRS, RAM, and DPS.

2512.09447 2026-05-21 cs.RO cs.CV

Query-Calibrated Segmental Admission for Descriptor-Agnostic LiDAR Loop Closure in Repetitive Environments

基于查询校准的分段准入用于无描述符的激光雷达回环闭合在重复环境中

Jaehyun Kim, Seungwon Choi, Wonseok Kang, Tae-Wan Kim

AI总结 该研究提出了一种无描述符的稀疏回环准入策略,用于在重复环境中稳定图结构,通过校准查询级的分段假设并验证代表性配对来减少回环因素的误入,从而提高回环闭合的精度和稳定性。

Comments 8 pages, 3 figures

详情
AI中文摘要

结构重复的环境会产生视觉上合理但存在混叠的LiDAR回环候选者,当这些候选者被作为回环因子加入图中时,可能会破坏位姿图优化。我们提出了一种名为查询校准分段准入(QCSA)的策略,这是一种面向图稳定性的稀疏回环准入政策。该策略通过与硬负样本对比对短描述符分段进行评分,校准哪些查询级的分段假设能达到几何关系,并通过广义迭代最近点(G-ICP)验证代表性配对。我们在SNU图书馆数据集(SNULib)和HeLiPR重叠路线上评估了该方法。在SNULib上对七种LiDAR描述符家族进行汇总分析,QCSA将插入的回环因子减少了3.8倍,将因子精度从0.542提高到0.717,并显著降低了每组查询的误入率。在更稀疏的图中,它保持了可比的平均绝对轨迹误差(ATE)并大幅降低了最坏序列ATE与密集Top1+G-ICP相比,从1.064降至0.778米。这些结果支持了所提出的回环准入层在重混叠的同时定位与建图(SLAM)中的应用。我们的实现和数据集将在:https://github.com/wanderingcar/snu_library_dataset上发布。

英文摘要

Structurally repetitive environments produce visually plausible but aliased LiDAR loop candidates that can destabilize pose-graph optimization when admitted as loop factors. We propose Query-Calibrated Segmental Admission (QCSA), a descriptor-agnostic sparse loop-admission policy for graph-stability-oriented insertion. The policy scores short descriptor segments against hard negatives, calibrates which query-level segment hypotheses reach geometry, and inserts representative pairs validated by Generalized Iterative Closest Point (G-ICP). We evaluate it on the SNU Library Dataset (SNULib) and HeLiPR overlap routes. Aggregated over seven LiDAR descriptor families on SNULib, QCSA reduces inserted loop factors by 3.8 times, raises factor precision from 0.542 to 0.717, and sharply lowers false admissions per query group. With this sparser graph, it maintains comparable mean absolute trajectory error (ATE) and substantially reduces worst-sequence ATE versus dense Top1+G-ICP, from 1.064 to 0.778 m. The aggregate mean and worst-sequence ATE remain lower than the odometry-only reference. Under a matched factor budget, QCSA also attains lower trajectory error than SeqSLAM and sparse Top1+G-ICP selections. Fixed-transfer validation on HeLiPR, with no route-specific tuning, likewise suppresses hard-negative admissions. These results support the proposed admission layer for aliasing-heavy simultaneous localization and mapping (SLAM). Our implementation and dataset will be released at: https://github.com/wanderingcar/snu_library_dataset.

2511.23152 2026-05-21 cs.LG cond-mat.dis-nn math.OC math.RT stat.ML

A Differentiable Measure of Algebraic Complexity: Provably Exact Discovery of Group Structures

一种可微的代数复杂性度量:证明精确发现群结构

Dongsung Huh, Lior Horesh, Halyun Jeong

AI总结 本文提出了一种可微的代数复杂性度量,通过Cayley表完成问题,证明了通过超立方体操作符张量分解可以精确发现群结构,解决了Huh(2025)的核心开放猜想。

Comments 29 pages, 3 figures. All theoretical conjectures are formally proven as theorems and verified in Lean 4. v4: Minor typographical corrections

详情
AI中文摘要

从数据中发现离散代数规则是机器学习中的基本挑战。我们通过Cayley表完成——经典矩阵完成的代数对应物——正式化了这个问题,其中关联性违反的程度取代线性秩作为复杂性的内在度量。我们对超立方体,一种操作值张量分解,在完全观察的目标表δ上进行了严格的景观分析,证明其全局下界H_inf(δ) := inf_{Θ∈F_δ} H(Θ)隐式定义了这种复杂性的精确可微度量。我们证明了超立方体的原目标函数H(Θ)分解为两个组成部分:几何对齐(共线性)和反ℓ_2惩罚。我们建立这些连续变分压力诱导了核心离散属性:共线性强制关联性(共线性-关联性等价),而反ℓ_2惩罚在共线性流形内减少为精确反秩惩罚,驱动参数向全秩单位性发展。因此,我们推导出一个绝对下界H(Θ) ≥ H_inf(δ) ≥ 3 |δ|,其中|δ|是目标表大小。我们证明这个绝对地板在且仅在目标是同源于群时被达到,并将全局最小值表征为底层群的正则表示(除单位性规范外),解决了Huh(2025)的核心开放猜想。本文为某些离散代数结构可以被可微度量精确表征提供了存在证明,使得基于梯度的发现无需组合搜索。所有理论结果均在Lean 4中机械验证并通过小规模实验确认。

英文摘要

Discovering discrete algebraic rules from data is a fundamental challenge in machine learning. We formalize this problem through Cayley-table completion -- an algebraic counterpart to classical matrix completion -- where the degree of associativity violation replaces linear rank as the intrinsic measure of complexity. We provide a rigorous landscape analysis of HyperCube, an operator-valued tensor factorization, on the fully observed target table $δ$, proving that its global infimum $H_{\inf}(δ) := \inf_{Θ\in F_δ} H(Θ)$ implicitly defines an exact differentiable measure for this complexity. We show that HyperCube's native objective $H(Θ)$ decomposes into two components: geometric alignment (collinearity) and an inverse $\ell_2$ penalty. We establish that these continuous variational pressures induce core discrete properties: collinearity enforces associativity (Collinearity--Associativity Equivalence), and the inverse $\ell_2$ penalty reduces to an exact inverse rank penalty within the collinear manifold, driving the parameters toward full-rank unitarity. Consequently, we derive an absolute lower bound $H(Θ) \ge H_{\inf}(δ) \ge 3 \, |δ|$, where $|δ|$ is the target table size. We prove this absolute floor is attained if and only if the target is isotopic to a group, and characterize the global minimizer as the regular representation of the underlying group (up to unitary gauge), resolving the central open conjecture of Huh (2025). This work serves as an existence proof that certain discrete algebraic structures can be exactly characterized by differentiable measures, enabling gradient-based discovery without the need for combinatorial search. All theoretical results are mechanically verified in Lean 4 and confirmed via small-scale experiments.

2511.01482 2026-05-21 cs.CL

Towards Consistent Detection of Cognitive Distortions: LLM-Based Annotation and Dataset-Agnostic Evaluation

迈向认知扭曲一致检测:基于大语言模型的标注与数据集无关评估

Neha Sharma, Navneet Agarwal, Kairit Sirts

AI总结 本文探讨了利用大语言模型作为一致且可靠的标注器进行认知扭曲检测的方法,并提出了一种数据集无关的评估框架,以公平比较不同数据集训练的模型,结果显示GPT-4能产生一致的标注,提升了模型在主观NLP任务中的表现。

详情
Journal ref
https://lrec.elra.info/lrec2026-main-851
AI中文摘要

基于文本的自动化认知扭曲检测是一项具有挑战性的任务,由于其主观性质,即使在专家人类标注者之间也观察到低一致性分数,导致不可靠的标注。我们探索了使用大型语言模型(LLMs)作为一致且可靠的标注器,并提出多个独立的LLM运行可以揭示稳定的标注模式,尽管任务本身具有内在的主观性。此外,为了公平比较训练于不同特征数据集上的模型,我们引入了一种使用Cohen's kappa作为效应大小度量的数据集无关评估框架。该方法允许在传统指标如F1分数不足的情况下进行公平的跨数据集和跨研究比较。我们的结果表明,GPT-4可以产生一致的标注(Fleiss's Kappa = 0.78),从而在使用这些标注训练的模型在测试集上的表现优于使用人类标注数据训练的模型。我们的发现表明,LLMs可以为生成支持强大下游性能的主观NLP任务的训练数据提供可扩展且内部一致的替代方案。

英文摘要

Text-based automated Cognitive Distortion detection is a challenging task due to its subjective nature, with low agreement scores observed even among expert human annotators, leading to unreliable annotations. We explore the use of Large Language Models (LLMs) as consistent and reliable annotators, and propose that multiple independent LLM runs can reveal stable labeling patterns despite the inherent subjectivity of the task. Furthermore, to fairly compare models trained on datasets with different characteristics, we introduce a dataset-agnostic evaluation framework using Cohen's kappa as an effect size measure. This methodology allows for fair cross-dataset and cross-study comparisons where traditional metrics like F1 score fall short. Our results show that GPT-4 can produce consistent annotations (Fleiss's Kappa = 0.78), resulting in improved test set performance for models trained on these annotations compared to those trained on human-labeled data. Our findings suggest that LLMs can offer a scalable and internally consistent alternative for generating training data that supports strong downstream performance in subjective NLP tasks.

2511.01219 2026-05-21 cs.RO

Tackling the Kidnapped Robot Problem via Sparse Feasible Hypothesis Sampling and Reliable Batched Multi-Stage Inference

通过稀疏可行假设采样和可靠的分批多阶段推理解决被绑架的机器人问题

Muhua Zhang, Lei Ma, Ying Wu, Kai Shen, Deqing Huang, Henry Leung

AI总结 本文提出了一种被动的2D全局重定位框架,通过单个LiDAR扫描和占用网格地图在机器人静止时高效可靠地估计全局姿态,从而提高移动机器人的长期自主性。该框架将全局重定位问题转化为非凸问题,并通过多假设方案与分批多阶段推理和早期终止平衡完整性和效率。

Comments 14 pages, 8 figures. Accepted for publication in IEEE Transactions on Instrumentation and Measurement. DOI: 10.1109/TIM.2026.3694741

详情
AI中文摘要

本文针对被绑架的机器人问题(KRP),即在已知地图中重新定位机器人时,没有先验姿态估计或在SLAM初始化时的定位丢失问题。为此,提出了一种被动的2D全局重定位框架。该框架在机器人静止时,通过单个LiDAR扫描和占用网格地图高效可靠地估计全局姿态,从而提高移动机器人的长期自主性。所提出的框架将全局重定位问题转化为非凸问题,并通过多假设方案与分批多阶段推理和早期终止来解决,平衡完整性和效率。快速探索随机树(RRT)在可通行性约束下,渐近覆盖可达空间以生成稀疏、均匀分布的可行位置假设,从根本上减少采样空间。假设首先通过所提出的扫描均方差(SMAD)进行排序,这是一种粗略的光束误差水平度量,通过优先处理高可能性的候选者来实现早期终止。SMAD计算优化以适应有限的扫描测量。提出的翻译亲和度扫描到地图对齐度量(TAM)用于在假设位置可靠地选择方向,并准确评估最终的全局姿态,以减轻由于稀疏假设引起的翻译不确定性以及非全景LiDAR扫描和环境变化导致的传统似然场度量的退化。在资源受限的移动机器人上的真实世界实验表明,所提出的框架在成功率、在测量不确定性下的鲁棒性和计算效率方面均表现优异。

英文摘要

This paper addresses the Kidnapped Robot Problem (KRP), a core localization challenge of relocalizing a robot in a known map without prior pose estimate upon localization loss or at SLAM initialization. For this purpose, a passive 2-D global relocalization framework is proposed. It estimates the global pose efficiently and reliably from a single LiDAR scan and an occupancy grid map while the robot remains stationary, thereby enhancing the long-term autonomy of mobile robots. The proposed framework casts global relocalization as a non-convex problem and solves it via the multi-hypothesis scheme with batched multi-stage inference and early termination, balancing completeness and efficiency. The Rapidly-exploring Random Tree (RRT), under traversability constraints, asymptotically covers the reachable space to generate sparse, uniformly distributed feasible positional hypotheses, fundamentally reducing the sampling space. The hypotheses are preliminarily ordered by the proposed Scan Mean Absolute Difference (SMAD), a coarse beam-error level metric that facilitates the early termination by prioritizing high-likelihood candidates. The SMAD computation is optimized for limited scan measurements. The Translation-Affinity Scan-to-Map Alignment Metric (TAM) is proposed for reliable orientation selection at hypothesized positions and accurate final global pose evaluation to mitigate degradation in conventional likelihood-field metrics under translational uncertainty induced by sparse hypotheses, as well as non-panoramic LiDAR scan and environmental changes. Real-world experiments on a resource-constrained mobile robot with non-panoramic LiDAR scans show that the proposed framework achieves competitive performance in success rate, robustness under measurement uncertainty, and computational efficiency.

2510.23538 2026-05-21 cs.AI cs.CL cs.CV cs.SE

JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

JanusCoder: 向代码智能的视觉-程序化界面迈进

Qiushi Sun, Jingyang Gong, Yang Liu, Qiaosheng Chen, Lei Li, Kai Chen, Qipeng Guo, Ben Kao, Fei Yuan

AI总结 本文提出JanusCoder,一种面向代码智能的视觉-程序化界面,通过构建大规模多模态代码数据集和统一模型,实现从文本指令、视觉输入或两者结合生成代码,展示了其在文本和视觉编程任务中的优越性能。

Comments ICLR 2026 Camera Ready Version, with code and data available

详情
AI中文摘要

神经代码智能的范围正在迅速扩展,从基于文本的源代码扩展到程序生成的丰富视觉输出。这种视觉维度对于高级应用如灵活的内容生成和精确的可视化程序驱动编辑至关重要。然而,进展受到高质量多模态代码数据稀缺的阻碍,这源于合成和质量评估的挑战。为了解决这些挑战,我们从数据和建模的角度做出贡献。我们首先引入了一个完整的合成工具包,利用数据模态之间的相互协同效应,高效地生成涵盖标准图表到复杂交互式网页UI和代码驱动动画的大型高质量语料库。利用该工具包,我们构建了JanusCode-800K,目前最大的多模态代码语料库。这推动了我们模型JanusCoder和JanusCoderV的训练,建立了从文本指令、视觉输入或两者结合生成代码的视觉-程序化界面。我们的统一模型不同于现有方法,后者为孤立任务构建专门模型。在文本导向和视觉导向的编程任务上的大量实验表明,JanusCoder系列的性能优越,我们的7B到14B规模模型接近甚至超过了商业模型的性能。此外,广泛的分析提供了将程序逻辑与其视觉表达和谐统一的关键见解。我们的代码和检查点可在https://github.com/InternLM/JanusCoder上获得。

英文摘要

The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimodal code data, a bottleneck stemming from challenges in synthesis and quality assessment. To address these challenges, we make contributions from both a data and modeling perspective. We first introduce a complete synthesis toolkit that leverages reciprocal synergies between data modalities to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations. Leveraging this toolkit, we construct JanusCode-800K, the largest multimodal code corpus to date. This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both. Our unified model is a departure from existing approaches that build specialized models for isolated tasks. Extensive experiments on both text-centric and vision-centric coding tasks demonstrate the superior performance of the JanusCoder series, with our 7B to 14B scale models approaching or even exceeding the performance of commercial models. Furthermore, extensive analysis provides key insights into harmonizing programmatic logic with its visual expression. Our code and checkpoints are available at https://github.com/InternLM/JanusCoder.

2510.21583 2026-05-21 cs.CV cs.AI

Principled RL for Flow Matching Emerges from the Chunk-level Policy Optimization

基于流匹配的原理化强化学习从片段级策略优化中涌现

Yifu Luo, Haoyuan Sun, Xinhao Hu, Penghui Du, Keyu Fan, Bo Li, Sinan Du, Xu Wan, Zhiyu Chen, Bo Xia, Tiantian Zhang, Yongzhe Chang, Changqian Yu, Kun Gai, Xueqian Wang

AI总结 本文提出了一种基于片段级策略优化的流匹配强化学习方法GCPO,通过将连续步骤聚合为相干片段并改变策略优化层级,有效缓解了优势归因不准确的问题,实验表明其在文本到图像生成任务中表现优于现有方法。

Comments ICML 2026

详情
AI中文摘要

近期在文本到图像(T2I)生成中的后训练流匹配中,群相对策略优化(GRPO)展示了强大的潜力。然而,其受到关键限制:优势归因不准确。在本文中,我们主张将连续步骤聚合为一个连贯的`chunk'并将策略优化范式从GRPO的步骤级别转移到片段级别,可以有效减轻这一问题的负面影响。基于这一见解,我们提出了群片段策略优化(GCPO),这是首个用于后训练流匹配的片段级强化学习方法。广泛的实验表明,GCPO在标准T2I基准和偏好对齐方面均取得了优越的性能,相对于GRPO最高相对提升达43%,凸显了片段级策略优化的前景。代码可在https://github.com/xingzhejun/GCPO上获得。

英文摘要

Recent Progress in post-training flow matching for text-to-image (T2I) generation with Group Relative Policy Optimization (GRPO) has demonstrated strong potential. However, it is hindered by a critical limitation: inaccurate advantage attribution. In this work, we argue that aggregating consecutive steps into a coherent `chunk' and shifting the policy optimization paradigm from GRPO's step level to the chunk level can effectively mitigate the negative impact of this issue. Building on this insight, we propose Group Chunking Policy Optimization (GCPO), the first chunk-level reinforcement learning approach for post-training flow matching. Extensive experiments demonstrate that GCPO achieves superior performance on both standard T2I benchmarks and preference alignment, with up to 43% relative gains over GRPO, highlighting the promise of chunk-level policy optimization. The code is available on https://github.com/xingzhejun/GCPO.

2510.18034 2026-05-21 cs.CV cs.AI cs.RO

Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning

VLMs能否解锁语义异常检测?一个结构化推理的框架

Roberto Brusnicki, David Pop, Yuan Gao, Mattia Piccinini, Johannes Betz

AI总结 本文提出SAVANT框架,通过结构化推理方法提升VLM在语义异常检测中的性能,实现对自动驾驶场景中罕见异常情况的更准确识别。

Comments 8 pages, 5 figures

详情
AI中文摘要

自动驾驶系统仍然对长尾的稀有、分布外语义异常极度脆弱。尽管VLMs已显现为感知的有前途工具,但其在异常检测中的应用仍然主要局限于提示专有模型,限制了可靠性、可重复性和部署可行性。为解决这一差距,我们引入SAVANT(语义异常验证/分析工具包),一种新的模型无关推理框架,将异常检测重新表述为分层语义一致性验证。通过应用SAVANT的两阶段流程——结构化场景描述提取和多模态评估,现有VLMs在输入图像中检测异常驾驶场景的得分得到提升。我们的方法取代了随意提示,通过语义感知推理,将基于VLM的检测转化为四个语义领域之间的原则性分解。我们证明,在平衡的现实驾驶场景集上,应用SAVANT可将VLM的绝对召回率提高约18.5%,相比提示基线。此外,这一增益使大规模注释成为可能:利用我们框架内的最佳专有模型,我们自动标注了约10,000张现实世界图像,具有高置信度。我们使用由此产生的高质量数据集来微调一个7B开源模型(Qwen2.5-VL)以执行单次异常检测,达到90.8%的召回率和93.8%的准确率,超越所有评估模型,同时在接近零成本的情况下实现本地部署。通过将结构化语义推理与可扩展的数据整理相结合,我们为自动驾驶系统中的语义异常检测数据稀缺问题提供了实用的解决方案。补充材料:https://TUM-AVS.github.io/SAVANT/.

英文摘要

Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution semantic anomalies. While VLMs have emerged as promising tools for perception, their application in anomaly detection remains largely restricted to prompting proprietary models - limiting reliability, reproducibility, and deployment feasibility. To address this gap, we introduce SAVANT (Semantic Anomaly Verification/Analysis Toolkit), a novel model-agnostic reasoning framework that reformulates anomaly detection as a layered semantic consistency verification. By applying SAVANT's two-phase pipeline - structured scene description extraction and multi-modal evaluation - existing VLMs improve their scores in detecting anomalous driving scenarios from input images. Our approach replaces ad hoc prompting with semantic-aware reasoning, transforming VLM-based detection into a principled decomposition across four semantic domains. We show that across a balanced set of real-world driving scenarios, applying SAVANT improves VLM's absolute recall by approximately 18.5% compared to prompting baselines. Moreover, this gain enables reliable large-scale annotation: leveraging the best proprietary model within our framework, we automatically labeled around 10,000 real-world images with high confidence. We use the resulting high-quality dataset to fine-tune a 7B open-source model (Qwen2.5-VL) to perform single-shot anomaly detection, achieving 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By coupling structured semantic reasoning with scalable data curation, we provide a practical solution to data scarcity in semantic anomaly detection for autonomous systems. Supplementary material: https://TUM-AVS.github.io/SAVANT/.

2510.17269 2026-05-21 cs.CV cs.AI

FineVision: Open Data Is All You Need

FineVision: 你只需要开放数据

Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, Andrés Marafioti

AI总结 本文提出FineVision,一个包含2400万样本的高质量数据集,通过半自动化流程整合了200多个来源,通过严格的数据清洗和人工审核确保数据质量,训练基于该数据集的模型在广泛评估中表现更优,推动数据驱动的视觉语言模型研究。

详情
AI中文摘要

视觉语言模型(VLMs)的进步受到碎片化、不一致和受污染的公共数据集的阻碍。我们引入了FineVision,一个精心收集、整理和统一的2400万样本数据集,是最大的开放资源。我们通过半自动化、人机协作的流程将超过200个来源整合为185个子集:自动化处理大量数据和模式映射,而审核员检查映射并抽查输出以验证注释的忠实消费、适当的格式和多样性以及安全性;问题会触发针对性的修复和重新运行。该流程进一步在源内和跨源之间应用严格的去重,并针对66个公共基准进行去污染。FineVision还包含具有统一动作空间的代理/GUI任务;审核员验证模式并检查样本轨迹以确认可执行性。在广泛评估套件中,基于FineVision训练的模型始终优于基于现有开放混合数据训练的模型,凸显了规模、数据卫生和平衡自动化与人工监督的好处。我们发布该数据集和整理工具以加速数据驱动的VLM研究。

英文摘要

The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.

2510.14444 2026-05-21 cs.LG cs.AI

A Free Lunch in LLM Compression: Revisiting Retraining after Pruning

在LLM压缩中寻找免费午餐:重新审视剪枝后的重新训练

Moritz Wagner, Christophe Roux, Max Zimmer, Sebastian Pokutta

AI总结 本文研究了在剪枝后通过局部重建进行适应的方法,发现其在减少数据和计算成本的同时能有效提升模型性能,并揭示了在不同粒度下重建参数窗口对最终质量的影响,挑战了LLM剪枝后适应不可行的主流观点。

详情
AI中文摘要

后训练剪枝可以显著降低LLM推理成本,但除非剩余权重被适应,否则往往会降质。由于在LLM规模上全局重新训练成本高昂,近期研究大多集中在日益复杂的剪枝标准上,旨在选择更好的稀疏模式而不进行适应。我们通过局部重建重新审视这一权衡:在剪枝后,我们依次在校准集上适应模型参数的一个子集,训练其以匹配密集模型的相应中间激活值。我们评估了局部重建在不同模型家族和规模上的表现,最高达到72B参数,并得出三个主要发现。首先,局部重建是LLM的有效适应机制:它在剪枝后重新训练时,使用了超过一个数量级更少的数据和计算资源,即使使用PEFT技术也是如此。其次,重建在粒度上表现出广泛的“免费午餐”区域,即重建参数窗口:只要重建区域包含至少一个非线性子模块,最终质量对窗口大小几乎不敏感,允许粒度主要基于内存约束来选择。相比之下,重建单个矩阵,尽管是文献中常提出的方法,却持续表现不佳,因为小的矩阵级误差会积累成更大的激活漂移。最后,重建减少了剪枝标准的相对重要性:随着模型规模的增加,复杂标准与简单基线之间的性能差距缩小,使简单方法再次具有竞争力。总体而言,我们的结果挑战了LLM剪枝后适应不可行的主流观点。

英文摘要

Post-training pruning can substantially reduce LLM inference costs, but it often degrades quality unless the remaining weights are adapted. Since global retraining is expensive at LLM scale, recent work has largely focused on increasingly sophisticated pruning criteria that aim to select better sparsity patterns without adaptation. We revisit this trade-off through local reconstruction: after pruning, we adapt one subset of the model parameters at a time on a calibration set, training it to match the corresponding intermediate activations of the dense model. We evaluate local reconstruction across model families and scales, up to 72B parameters, and establish three main findings. First, local reconstruction is an effective adaptation mechanism for LLMs: it matches post-pruning retraining while using over an order of magnitude less data and compute, even when using PEFT techniques. Second, reconstruction exhibits a broad "free-lunch" regime in granularity, i.e., the reconstruction parameter window: as long as the reconstructed region contains at least a nonlinear submodule, final quality is largely insensitive to the window size, allowing granularity to be chosen primarily based on memory constraints. In contrast, reconstructing individual matrices, despite being the natural approach often proposed in the literature, consistently underperforms, as small matrix-level errors accumulate into larger activation drift. Lastly, reconstruction reduces the relative importance of the pruning criterion: performance gaps between sophisticated criteria and simple baselines shrink with model scale, making simple methods competitive again. Overall, our results challenge the prevailing view that post-pruning adaptation is impractical for LLMs.

2510.09833 2026-05-21 cs.CV

Post Processing of image segmentation using Conditional Random Fields

利用条件随机场对图像分割进行后处理

Aashish Dhawan, Pankaj Bodani, Vishal Garg

AI总结 本文研究了如何通过条件随机场提升图像分割结果的清晰度,分析了不同CRF类型在低质量卫星图像和高质量航拍照片上的表现,评估了不同方法的优缺点。

详情
Journal ref
Proc. 2019 6th International Conference on Computing for Sustainable Global Development (INDIACom), pp. 147-151, 2019
AI中文摘要

图像分割过程的输出通常由于卫星图像的低质量特征而不够清晰。本研究旨在寻找合适的条件随机场(CRF)以提高分割图像的清晰度。我们首先尝试了不同类型的CRF,并研究它们为何适合或不适合我们的目的。我们在两个不同的数据集上评估了我们的方法——具有低质量特征的卫星图像和高质量的航拍照片。在研究过程中,我们尝试了各种CRF,找出在图像上表现最佳的CRF,并将我们的结果与这些数据集进行比较,以展示不同方法的陷阱和潜力。

英文摘要

The output of image the segmentation process is usually not very clear due to low quality features of Satellite images. The purpose of this study is to find a suitable Conditional Random Field (CRF) to achieve better clarity in a segmented image. We started with different types of CRFs and studied them as to why they are or are not suitable for our purpose. We evaluated our approach on two different datasets - Satellite imagery having low quality features and high quality Aerial photographs. During the study we experimented with various CRFs to find which CRF gives the best results on images and compared our results on these datasets to show the pitfalls and potentials of different approaches.

2510.08482 2026-05-21 cs.CV cs.CL

The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping

视觉象征性挑战:在手语形式-意义映射上评估视觉-语言模型

Onur Keleş, Aslı Özyürek, Gerardo Ortega, Kadir Gökgöz, Esam Ghaleb

AI总结 本文提出一个新颖的视频基准测试,用于评估视觉-语言模型在手语形式-意义映射上的表现,通过心理语言学测量来评估三种任务:语音学手语形式预测、透明度和渐进象征性评分,并发现模型在语音形式预测上表现较好但整体仍低于人类表现。

详情
AI中文摘要

象征性,即语言形式与意义之间的相似性,在手语中普遍存在,为视觉 grounding 提供了自然的测试环境。对于视觉-语言模型(VLMs),挑战在于从动态的人类运动中恢复这种本质的映射,而非静态上下文。我们引入了视觉象征性挑战,一个新颖的基于视频的基准测试,将心理语言学测量适应于评估 VLMs 在三个任务上的表现:(i)语音学手语形式预测(例如,手形、位置),(ii)透明度(从视觉形式推断意义),以及(iii)渐进象征性评分。我们评估了13种最先进的VLMs在零样本和少样本设置下在荷兰手语上的表现,并将其与人类基线进行比较。在语音形式预测上,VLMs恢复了一些手形和位置细节,但表现仍低于人类;在透明度上,它们与人类基线相差甚远;只有顶级模型与人类象征性评分有中等相关性。有趣的是,语音形式预测能力更强的模型更能与人类象征性判断相关联,表明它们对视觉基础结构有共同的敏感性。我们的发现验证了这些诊断任务,并推动了以人类为中心的信号和具身学习方法,用于建模象征性和改进多模态模型中的视觉 grounding。

英文摘要

Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the Visual Iconicity Challenge, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess 13 state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On phonological form prediction, VLMs recover some handshape and location detail but remain below human performance; on transparency, they are far from human baselines; and only top models correlate moderately with human iconicity ratings. Interestingly, models with stronger phonological form prediction correlate better with human iconicity judgment, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.

2510.06824 2026-05-21 cs.LG

Efficient numeracy in language models through single-token number embeddings

通过单token数字嵌入提升语言模型的数值处理效率

Linus Kreitner, Paul Hager, Jonathan Mengedoht, Georgios Kaissis, Daniel Rueckert, Martin J. Menten

AI总结 本文提出BitTokens,一种利用IEEE 754二进制浮点表示将数字编码为单token的方法,使语言模型能更高效地处理数值计算,从而提升其解决复杂问题的能力。

详情
AI中文摘要

为了推动科学和工程领域的进步,大型语言模型(LLMs)必须能够高效处理大量数值数据并解决长计算。目前只能通过外部工具或大量推理链实现,这要么削弱了LLMs的数值表示,要么限制了它们能解决的问题长度。我们发现前沿LLMs解决基本计算需要过多的推理token,这被其分拆单个数字为多个token的分词策略所加剧。这促使了对高效且有效的单token数字编码的需求。我们提出了一组此类编码的准则,并展示现有方法未能满足这些准则。为解决这些不足,我们提出了BitTokens,一种新的编码策略,通过IEEE 754二进制浮点表示将任何数字编码为单个token。通过广泛实验,我们证明我们的BitTokens使即使是小型语言模型也能学习到几乎完美解决基本算术运算的算法。这种新获得的效率可以扩展语言模型能解决的问题长度和复杂性。

英文摘要

To drive progress in science and engineering, large language models (LLMs) must be able to process large amounts of numerical data and solve long calculations efficiently. This is currently only possible through the use of external tools or extensive reasoning chains, either weakening the numerical representations of LLMs or limiting the length of problems they can solve. We show that frontier LLMs require excessive amounts of reasoning tokens to solve even basic calculations, which is exacerbated by their tokenization strategies that split single numbers into multiple tokens. This motivates the need for efficient and effective single-token number encodings. We introduce a set of desiderata for such encodings and show that existing approaches fail to fulfill them. To address these shortcomings, we propose BitTokens, a novel encoding strategy that represents any number as a single token using its IEEE 754 binary floating-point representation. Through extensive experiments we show that our BitTokens allow even small language models to learn algorithms that solve basic arithmetic operations nearly perfectly. This newly gained efficiency could expand the length and complexity of problems language models can solve.

2510.00520 2026-05-21 cs.CV

CardioBench: Do Echocardiography Foundation Models Generalize Beyond the Lab?

CardioBench: 心脏超声基础模型是否能超越实验室?

Darya Taratynova, Ahmed Aly, Numan Saeed, Mohammad Yaqub

AI总结 本文提出CardioBench,一个用于评估心脏超声基础模型的基准,通过统一多个公开数据集,评估不同模型在零样本、探测和对齐协议下的性能,揭示通用模型在功能任务上表现优异,但细粒度区分任务上存在不足。

详情
AI中文摘要

基础模型正在重塑医学影像,但其在心脏超声中的应用仍然有限,受制于对私有数据集的依赖,限制了可重复的比较。心脏超声具有独特的挑战,包括噪声采集、高帧冗余和有限的多样化公开数据集。为了解决这个问题,我们引入了CardioBench,一个全面的心脏超声基础模型基准。具体而言,CardioBench将八个公开可用的数据集统一为一个标准化的套件,涵盖四个回归和五个分类任务,覆盖功能、结构、诊断和视图识别终点。利用这一框架,我们评估了几种领先的基座模型,包括心脏专用、生物医学和通用编码器,在一致的零样本、探测和对齐协议下。我们的分析显示,尽管通用编码器转移良好,往往接近探测,但在视图分类和细微病理识别等细粒度区分任务上表现不佳。结果表明,能够捕捉心脏时间动态的模型在功能任务上表现最佳,而基于检索的方法在跨数据集的泛化上更加一致。通过发布预处理、分割和公开评估流程,CardioBench建立了可重复的参考点,以指导未来心脏超声和可能其他医学影像基础模型的架构设计。

英文摘要

Foundation models are reshaping medical imaging, yet their application in echocardiography remains limited, hindered by a heavy reliance on private datasets that prevent reproducible comparison. Echocardiography poses unique challenges, including noisy acquisitions, high frame redundancy, and limited diverse public datasets. To address this, we introduce CardioBench, a comprehensive benchmark for echocardiography foundation models. Specifically, CardioBench unifies eight publicly available datasets into a standardized suite spanning four regression and five classification tasks, covering functional, structural, diagnostic, and view recognition endpoints. Leveraging this framework, we evaluate several leading foundation models, including cardiac-specific, biomedical, and general-purpose encoders, under consistent zero-shot, probing, and alignment protocols. Our analysis reveals that while general-purpose encoders transfer well and often close the gap with probing, they struggle significantly with fine-grained distinctions like view classification and subtle pathology recognition. Results indicate that models capturing temporal cardiac dynamics perform best on functional tasks, while retrieval-based approaches generalize more consistently across datasets. By releasing preprocessing, splits, and public evaluation pipelines, CardioBench establishes a reproducible reference point to guide the architectural design of future echocardiography and possibly other medical imaging foundation models.

2509.26627 2026-05-21 cs.AI cs.LG cs.RO

TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance

TimeRewarder: 通过帧间时间距离从被动视频中学习密集奖励

Yuyang Liu, Chuan Wen, Yihang Hu, Dinesh Jayaraman, Yang Gao

AI总结 本文提出TimeRewarder方法,通过帧间时间距离从被动视频中学习密集奖励,以提升强化学习在稀疏奖励任务中的性能,实验表明其在多个任务中显著提高了成功率和样本效率。

Comments ICML 2026 spotlight paper

详情
AI中文摘要

设计密集奖励对于强化学习(RL)至关重要,但在机器人学中往往需要大量的手动工作且缺乏可扩展性。一个有前景的解决方案是将任务进展视为密集奖励信号,因为它量化了动作在时间上推动系统向任务完成迈进的程度。我们提出了TimeRewarder,一种简单而有效的奖励学习方法,通过建模帧对之间的时间距离,从被动视频(包括机器人演示和人类视频)中推导出进展估计信号。然后展示如何通过TimeRewarder提供逐步的代理奖励以指导强化学习。在我们对十个具有挑战性的Meta-World任务的全面实验中,我们表明TimeRewarder显著提高了稀疏奖励任务的强化学习性能,仅在每个任务中进行200,000次环境交互时,就实现了9/10任务的几乎完美成功。该方法在最终成功率和样本效率上均优于先前方法和手动设计的环境密集奖励。此外,我们还展示了TimeRewarder预训练可以利用真实世界的人类视频,突显了其作为从多样化视频源中获取丰富奖励信号的可扩展方法的潜力。

英文摘要

Designing dense rewards is crucial for reinforcement learning (RL), yet in robotics it often demands extensive manual effort and lacks scalability. One promising solution is to view task progress as a dense reward signal, as it quantifies the degree to which actions advance the system toward task completion over time. We present TimeRewarder, a simple yet effective reward learning method that derives progress estimation signals from passive videos, including robot demonstrations and human videos, by modeling temporal distances between frame pairs. We then demonstrate how TimeRewarder can supply step-wise proxy rewards to guide reinforcement learning. In our comprehensive experiments on ten challenging Meta-World tasks, we show that TimeRewarder dramatically improves RL for sparse-reward tasks, achieving nearly perfect success in 9/10 tasks with only 200,000 environment interactions per task. This approach outperformed previous methods and even the manually designed environment dense reward on both the final success rate and sample efficiency. Moreover, we show that TimeRewarder pretraining can exploit real-world human videos, highlighting its potential as a scalable approach to rich reward signals from diverse video sources.

2509.25606 2026-05-21 cs.LG

Effective Model Pruning: Measure The Redundancy of Model Components

有效模型剪枝:衡量模型组件的冗余性

Yixuan Wang, Dan P. Guralnik, Saiedeh Akbari, Warren E. Dixon

AI总结 本文研究了模型剪枝中的基本问题,提出了一种基于有效样本大小的剪枝方法,通过分析重要性评分分布来确定可丢弃的组件数量,并在多种网络架构上验证了该方法的有效性。

Comments 18 pages, 4 figures. Accepted at ICML 2026 (Spotlight)

详情
AI中文摘要

本文开创性地研究了模型剪枝中的基本问题:给定一个分配给模型组件的重要性评分向量s,如何确定在不牺牲性能的情况下可以丢弃多少评分组件?我们提出了有效模型剪枝(EMP),该方法通过粒子过滤中的有效样本大小概念(也称为逆西姆逊指数)直接从评分分布中推导出所需的稀疏性。EMP提供了一个通用的自适应阈值,该阈值基于评分s在模型组件上的分布:EMP将s映射到一个称为有效样本大小的数值N_eff(s)。丢弃N-N_eff分值最低的组件。推导了有效质量s_eff(保留的标准化评分总和)关于N_eff的紧下界。这一过程产生了一个相对于原始密集模型具有可证明上界损失变化的模型。在多种网络架构上进行了数值实验,包括MLPs、CNNs、Transformers、LLMs和KAN。还展示了EMP能够处理多种剪枝标准,如权重大小、注意力评分、KAN重要性评分以及特征级信号如图像像素。

英文摘要

This article initiates the study of a basic question about model pruning. Given a vector $s$ of importance scores assigned to model components, how many of the scored components could be discarded without sacrificing performance? We propose Effective Model Pruning (EMP), which derives the desired sparsity directly from the score distribution using the notion of effective sample size from particle filtering, also known as the inverse Simpson index. Rather than prescribe a pruning criterion, EMP supplies a universal adaptive threshold derived from the distribution of the score $s$ over the model components: EMP maps $s$ to a number $N_{eff}=N_{eff}(s)$, called the effective sample size. The $N-N_{eff}$ lowest scoring components are discarded. A tight lower bound on the effective mass $s_{eff}$ (the sum of retained normalized scores) in terms of $N_{eff}$ is derived. This process yields models with a provable upper bound on the loss change relative to the original dense model. Numerical experiments are performed demonstrating this phenomenon across a variety of network architectures including MLPs, CNNs, Transformers, LLMs, and KAN. It is also shown that EMP addresses a rich set of pruning criteria such as weight magnitude, attention score, KAN importance score, and even feature-level signals such as image pixels.

2509.22963 2026-05-21 cs.LG

Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces

基于离散扩散策略的强化学习

Haitong Ma, Ofir Nabati, Aviv Rosenberg, Bo Dai, Oran Lang, Craig Boutilier, Na Li, Shie Mannor, Lior Shani, Guy Tenneholtz

AI总结 本文提出了一种新的框架,用于在复杂的组合动作空间中训练高效的离散扩散模型策略,通过高效的在线训练过程和策略镜像下降方法,实现了稳定的策略改进,并在多个挑战性组合基准上取得了最先进的性能。

Comments 22 pages, 10 figures. Haitong Ma and Ofir Nabati contributed equally to this paper

详情
AI中文摘要

强化学习(RL)在面对许多现实问题中常见的大规模组合动作空间时面临扩展困难。本文介绍了一种新的框架,用于训练离散扩散模型作为这些复杂设置中的高效策略。我们的关键创新是一个高效的在线训练过程,确保了稳定的策略改进。通过利用策略镜像下降(PMD)来定义一个理想的、正则化的目标策略分布,我们将策略更新框架为一个分布匹配问题,训练具有表现力的扩散模型以复制这个稳定的靶向分布。这种解耦方法稳定了学习过程,并显著提高了训练性能。我们的方法在一系列具有挑战性的组合基准上实现了最先进的结果和优越的样本效率,包括DNA序列生成、具有宏动作的强化学习和多智能体系统。实验表明,我们的扩散策略在与其他基线相比时表现出优越的性能。

英文摘要

Reinforcement learning (RL) struggles to scale to large, combinatorial action spaces common in many real-world problems. This paper introduces a novel framework for training discrete diffusion models as highly effective policies in these complex settings. Our key innovation is an efficient online training process that ensures stable and effective policy improvement. By leveraging policy mirror descent (PMD) to define an ideal, regularized target policy distribution, we frame the policy update as a distributional matching problem, training the expressive diffusion model to replicate this stable target. This decoupled approach stabilizes learning and significantly enhances training performance. Our method achieves state-of-the-art results and superior sample efficiency across a diverse set of challenging combinatorial benchmarks, including DNA sequence generation, RL with macro-actions, and multi-agent systems. Experiments demonstrate that our diffusion policies attain superior performance compared to other baselines.

2509.17931 2026-05-21 cs.CV physics.med-ph

Multi-needle Localization for Pelvic Seed Implant Brachytherapy based on Tip-handle Detection and Matching

基于尖端-柄检测与匹配的盆腔种子植入近距离放射治疗多针定位

Zhuo Xiao, Fugen Zhou, Jingjing Wang, Chongyu He, Bo Liu, Haitao Sun, Zhe Ji, Yuliang Jiang, Junjie Wang, Qiuwen Wu

AI总结 本文提出了一种基于尖端-柄检测与匹配的新方法,用于解决术中CT图像中多针定位的难题,通过锚点自由网络和贪心匹配与合并方法,在100名患者的数据集上实现了更高的精度和F1分数,为复杂临床场景下的针定位提供了更鲁棒和准确的解决方案。

详情
AI中文摘要

在术中CT图像中实现准确的多针定位对于优化盆腔种子植入近距离放射治疗中的种子放置至关重要。然而,由于图像对比度差和针管粘附,这一任务具有挑战性。本文提出了一种新颖的方法,将针定位重新框架为尖端-柄检测与匹配问题,以克服这些困难。提出了一种基于HRNet的锚点自由网络,用于提取多尺度特征,并通过解耦分支进行热图回归和极角预测,准确检测针尖和柄。为了将检测到的尖端和柄关联为个体针,提出了一种贪心匹配与合并(GMM)方法,该方法设计用于解决具有约束条件的不平衡分配问题(UAP-C)。GMM方法通过迭代选择最可能的尖端-柄对并基于距离度量进行合并,以重建3D针路径。在100名患者的数据集上评估,所提方法表现出优越的性能,其精度和F1分数优于使用nnUNet模型的基于分割的方法,从而为复杂临床场景中的针定位提供了更稳健和准确的解决方案。

英文摘要

Accurate multi-needle localization in intraoperative CT images is crucial for optimizing seed placement in pelvic seed implant brachytherapy. However, this task is challenging due to poor image contrast and needle adhesion. This paper presents a novel approach that reframes needle localization as a tip-handle detection and matching problem to overcome these difficulties. An anchor-free network, based on HRNet, is proposed to extract multi-scale features and accurately detect needle tips and handles by predicting their centers and orientations using decoupled branches for heatmap regression and polar angle prediction. To associate detected tips and handles into individual needles, a greedy matching and merging (GMM) method designed to solve the unbalanced assignment problem with constraints (UAP-C) is presented. The GMM method iteratively selects the most probable tip-handle pairs and merges them based on a distance metric to reconstruct 3D needle paths. Evaluated on a dataset of 100 patients, the proposed method demonstrates superior performance, achieving higher precision and F1 score compared to a segmentation-based method utilizing the nnUNet model,thereby offering a more robust and accurate solution for needle localization in complex clinical scenarios.

2509.14165 2026-05-21 cs.CV cs.AI

Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions

令牌去哪了?在高分辨率下的STEP中理解剪枝行为

Michal Szczepanski, Martyna Poreba, Karim Haroun

AI总结 本文提出STEP框架,通过动态补丁合并和令牌剪枝提高效率,同时在高分辨率语义分割任务中实现显著的计算成本降低和吞吐量提升,同时保持较高的准确性。

详情
Journal ref
SN Computer Science 2026
AI中文摘要

视觉变换器(ViTs)在语义分割任务中实现了最先进的性能,但受到高计算和内存成本的限制。为了解决这一问题,我们提出了STEP(SuperToken和Early-Pruning),一种混合的令牌减少框架,结合动态补丁合并和令牌剪枝,以提高效率而不显著牺牲准确性。STEP的核心是dCTS,一个轻量级的CNN基政策网络,能够灵活地合并为超补丁。编码器块也集成了早期退出,以移除高置信度的超令牌,从而降低计算负载。我们在高分辨率语义分割基准上评估了我们的方法,包括高达1024x1024像素的图像,并显示当仅应用dCTS时,令牌数量可以比标准的16x16像素补丁方案减少2.5倍。这在使用ViT-Large作为骨干时,导致计算成本减少2.6倍,吞吐量增加3.4倍。应用完整的STEP框架进一步提高效率,达到计算复杂度减少4倍,推理速度提高1.7倍,最大精度下降不超过2.0%。通过提出的STEP配置,可以自信地在到达最终编码器层之前停止多达40%的令牌。

英文摘要

Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early-Pruning), a hybrid token-reduction framework that combines dynamic patch merging and token pruning to enhance efficiency without significantly compromising accuracy. At the core of STEP is dCTS, a lightweight CNN-based policy network that enables flexible merging into superpatches. Encoder blocks integrate also early-exits to remove high-confident supertokens, lowering computational load. We evaluate our method on high-resolution semantic segmentation benchmarks, including images up to 1024 x 1024, and show that when dCTS is applied alone, the token count can be reduced by a factor of 2.5 compared to the standard 16 x 16 pixel patching scheme. This yields a 2.6x reduction in computational cost and a 3.4x increase in throughput when using ViT-Large as the backbone. Applying the full STEP framework further improves efficiency, reaching up to a 4x reduction in computational complexity and a 1.7x gain in inference speed, with a maximum accuracy drop of no more than 2.0%. With the proposed STEP configurations, up to 40% of tokens can be confidently predicted and halted before reaching the final encoder layer.

2509.13648 2026-05-21 cs.LG cs.IR

Sequential Data Augmentation for Generative Recommendation

生成推荐中的序列数据增强

Geon Lee, Bhuvesh Kumar, Clark Mingxuan Ju, Tong Zhao, Kijung Shin, Neil Shah, Liam Collins

AI总结 本文研究了生成推荐中数据增强的影响,提出了一种系统化的框架GenPAS,通过三种受偏步骤统一了多种增强策略,提升了模型的准确率、数据效率和参数效率。

详情
AI中文摘要

生成推荐在个性化系统中起着关键作用,通过预测用户的历史行为序列来预测用户未来的行为。在训练这些模型时,数据增强是一个关键但尚未充分研究的因素,即从用户交互历史中构建训练数据的过程。通过塑造训练分布,数据增强直接影响模型的泛化能力和性能。然而,在现有工作中,这一过程通常被简化、应用不一致或被视为次要设计选择,而没有系统和原则性的理解。受我们实证发现不同增强策略会产生显著性能差异的启发,我们深入分析了它们如何重塑训练分布并影响与未来目标的对齐以及对未见输入的泛化能力。为了系统化这一设计空间,我们提出GenPAS,一个通用且原则性的框架,将增强建模为输入-目标对上的随机采样过程,包含三个受偏步骤:序列采样、目标采样和输入采样。这种形式将广泛使用的策略作为特殊情况统一起来,并使训练分布的灵活控制成为可能。我们在基准和工业数据集上的大量实验表明,GenPAS在准确率、数据效率和参数效率方面优于现有策略,为生成推荐中原则性的训练数据构建提供了实用指导。我们的代码可在https://github.com/snap-research/GenPAS上获得。

英文摘要

Generative recommendation plays a crucial role in personalized systems, predicting users' future interactions from their historical behavior sequences. A critical yet underexplored factor in training these models is data augmentation, the process of constructing training data from user interaction histories. By shaping the training distribution, data augmentation directly and often substantially affects model generalization and performance. Nevertheless, in much of the existing work, this process is simplified, applied inconsistently, or treated as a minor design choice, without a systematic and principled understanding of its effects. Motivated by our empirical finding that different augmentation strategies can yield large performance disparities, we conduct an in-depth analysis of how they reshape training distributions and influence alignment with future targets and generalization to unseen inputs. To systematize this design space, we propose GenPAS, a generalized and principled framework that models augmentation as a stochastic sampling process over input-target pairs with three bias-controlled steps: sequence sampling, target sampling, and input sampling. This formulation unifies widely used strategies as special cases and enables flexible control of the resulting training distribution. Our extensive experiments on benchmark and industrial datasets demonstrate that GenPAS yields superior accuracy, data efficiency, and parameter efficiency compared to existing strategies, providing practical guidance for principled training data construction in generative recommendation. Our code is available at https://github.com/snap-research/GenPAS.

2509.13482 2026-05-21 cs.CV

Improving 3D Gaussian Splatting Compression by Scene-Adaptive Lattice Vector Quantization

通过场景自适应晶格向量量化改进3D高斯散射压缩

Hao Xu, Xiaolin Wu, Xi Zhang

AI总结 本文提出了一种场景自适应晶格向量量化(SALVQ)方法,用于改进3D高斯散射(3DGS)的压缩性能,通过优化晶格基矢来提高适应性和R-D效率,同时减少计算开销和训练时间。

Comments Accepted by IEEE TIP. Code available at https://github.com/hxu160/SALVQ

详情
AI中文摘要

3D高斯散射(3DGS)因其逼真渲染质量和实时性能而迅速流行,但会产生大量数据。因此,压缩3DGS数据对于其模型的成本效益至关重要。最近,一些基于锚点的神经压缩方法已被提出,实现了良好的3DGS压缩性能。然而,它们都依赖于统一标量量化(USQ)因其简单性。一个引人注目的问题是,更复杂的量化器是否能在极小的额外开销和系统最小变化的情况下改进当前的3DGS压缩方法。答案是肯定的,通过将USQ替换为晶格向量量化(LVQ)。为了更好地捕捉场景特定特性,我们为每个场景优化晶格基矢,提高LVQ的适应性和R-D效率。这种场景自适应LVQ(SALVQ)在向量量化和USQ的低复杂性之间取得了平衡。SALVQ可以无缝集成到现有的3DGS压缩架构中,通过最小的修改和计算开销提高其R-D性能。此外,通过缩放晶格基矢量,SALVQ可以动态调整晶格密度,使单个模型能够适应多种比特率目标。这种灵活性消除了为不同压缩级别训练单独模型的需要,显著减少了训练时间和内存消耗。

英文摘要

3D Gaussian Splatting (3DGS) is rapidly gaining popularity for its photorealistic rendering quality and real-time performance, but it generates massive amounts of data. Hence compressing 3DGS data is necessary for the cost effectiveness of 3DGS models. Recently, several anchor-based neural compression methods have been proposed, achieving good 3DGS compression performance. However, they all rely on uniform scalar quantization (USQ) due to its simplicity. A tantalizing question is whether more sophisticated quantizers can improve the current 3DGS compression methods with very little extra overhead and minimal change to the system. The answer is yes by replacing USQ with lattice vector quantization (LVQ). To better capture scene-specific characteristics, we optimize the lattice basis for each scene, improving LVQ's adaptability and R-D efficiency. This scene-adaptive LVQ (SALVQ) strikes a balance between the R-D efficiency of vector quantization and the low complexity of USQ. SALVQ can be seamlessly integrated into existing 3DGS compression architectures, enhancing their R-D performance with minimal modifications and computational overhead. Moreover, by scaling the lattice basis vectors, SALVQ can dynamically adjust lattice density, enabling a single model to accommodate multiple bit rate targets. This flexibility eliminates the need to train separate models for different compression levels, significantly reducing training time and memory consumption.

2509.09946 2026-05-21 cs.CV

Online 3D Multi-Camera Perception through Robust 2D Tracking and Depth-based Late Aggregation

通过鲁棒的2D跟踪和基于深度的后期聚合实现在线3D多摄像机感知

Vu-Minh Le, Thao-Anh Tran, Duc Huy Do, Xuan Canh Do, Huong Ninh, Hai Tran

AI总结 本文提出了一种方法,通过利用深度信息将现有的在线2D多摄像机跟踪系统扩展到3D空间,通过点云空间重建目标并利用聚类和偏转细化恢复其3D框,同时引入了增强的在线数据关联机制,以局部ID一致性来分配跨帧的全局ID,该框架在2025年AI城市挑战赛的3D MTMC数据集上评估,取得了第三名的成绩。

Comments Accepted at ICCVW 2025

详情
AI中文摘要

多目标多摄像机跟踪(MTMC)是自动化大规模监控中的关键计算机视觉任务。通过摄像机标定和深度信息,场景中的目标可以投影到3D空间,提供对3D环境的前所未有的自动感知水平。然而,在3D空间中的跟踪需要替换所有2D跟踪组件,这可能对现有的MTMC系统不可行。本文提出了一种方法,通过利用深度信息将任何在线2D多摄像机跟踪系统扩展到3D空间,通过点云空间重建目标,并通过聚类和偏转细化恢复其3D框。我们还引入了增强的在线数据关联机制,利用目标的局部ID一致性来分配跨帧的全局ID。所提出的框架在2025年AI城市挑战赛的3D MTMC数据集上进行评估,取得了排行榜第三名的成绩。

英文摘要

Multi-Target Multi-Camera Tracking (MTMC) is an essential computer vision task for automating large-scale surveillance. With camera calibration and depth information, the targets in the scene can be projected into 3D space, offering unparalleled levels of automatic perception of a 3D environment. However, tracking in the 3D space requires replacing all 2D tracking components from the ground up, which may be infeasible for existing MTMC systems. In this paper, we present an approach for extending any online 2D multi-camera tracking system into 3D space by utilizing depth information to reconstruct a target in point-cloud space, and recovering its 3D box through clustering and yaw refinement following tracking. We also introduced an enhanced online data association mechanism that leverages the target's local ID consistency to assign global IDs across frames. The proposed framework is evaluated on the 2025 AI City Challenge's 3D MTMC dataset, achieving 3rd place on the leaderboard.