arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1676
2602.22918 2026-05-18 cs.CL

Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models

Jonathan Steinberg, Oren Gal

AI总结 该研究探讨了视觉语言模型中光学字符识别(OCR)信息如何融入语言处理流程,并定位了OCR路由机制中的关键瓶颈。通过因果干预和激活差异分析,研究发现不同架构的OCR敏感层位置存在差异,且OCR信号具有高度低维特性,主成分分析方向在不同数据集间具有可迁移性。研究还揭示了在模块化OCR电路中,去除OCR信息可提升模型的计数性能,表明OCR可能干扰其他视觉处理任务。

详情
英文摘要

Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures up to 72.9% of variance. Crucially, principal component analysis (PCA) directions learned on one dataset transfer to others, demonstrating shared text-processing pathways. Surprisingly, in models with modular OCR circuits (notably Qwen3-VL-4B), OCR removal can improve counting performance (up to +6.9 percentage points), suggesting OCR interferes with other visual processing in sufficiently modular architectures.

2602.20630 2026-05-18 cs.CV

From Pairs to Sequences: Track-Aware Policy Gradients for Keypoint Detection

Yepeng Liu, Hao Li, Liwen Yang, Fangzhen Li, Xudi Ge, Yuliang Gu, kuang Gao, Bing Wang, Guang Chen, Hangjun Ye, Yongchao Xu

AI总结 本文将关键点检测问题重新定义为一个序列决策过程,提出了一种基于强化学习的端到端框架 TraqPoint,旨在直接优化关键点在图像序列中的长期可追踪性。其核心创新在于引入了一种关注轨迹质量的奖励机制,通过策略梯度方法同时提升关键点在多视角下的一致性和区分度。实验表明,TraqPoint 在稀疏匹配任务中显著优于当前最先进的关键点检测与描述方法。

Comments Accepted by CVPR 2026 (Oral)

详情
英文摘要

Keypoint-based matching is a fundamental component of modern 3D vision systems, such as Structure-from-Motion (SfM) and SLAM. Most existing learning-based methods are trained on image pairs, a paradigm that fails to explicitly optimize for the long-term trackability of keypoints across sequences under challenging viewpoint and illumination changes. In this paper, we reframe keypoint detection as a sequential decision-making problem. We introduce TraqPoint, a novel, end-to-end Reinforcement Learning (RL) framework designed to optimize the \textbf{Tra}ck-\textbf{q}uality (Traq) of keypoints directly on image sequences. Our core innovation is a track-aware reward mechanism that jointly encourages the consistency and distinctiveness of keypoints across multiple views, guided by a policy gradient method. Extensive evaluations on sparse matching benchmarks, including relative pose estimation and 3D reconstruction, demonstrate that TraqPoint significantly outperforms some state-of-the-art (SOTA) keypoint detection and description methods.The code will be available at https://github.com/xiaomi-research/traqpoint.

2602.20207 2026-05-18 cs.LG cs.AI

Golden Layers and Where to Find Them: Improved Knowledge Editing for Large Language Models Via Layer Gradient Analysis

Shrestha Datta, Hongfu Liu, Anshuman Chhabra

AI总结 本文研究了如何在大语言模型中高效地进行知识编辑,即在不破坏模型整体性能的前提下,针对特定查询更新模型的输出。作者提出了一种基于层梯度分析(LGA)的新方法,通过分析模型各层的梯度信息,高效识别出对知识编辑效果最佳的“黄金层”,从而避免了传统方法中繁琐的试错过程。实验表明,该方法在多种大语言模型和知识编辑任务中均表现出良好的有效性和鲁棒性。

详情
英文摘要

Knowledge editing in Large Language Models (LLMs) aims to update the model's prediction for a specific query to a desired target while preserving its behavior on all other inputs. This process typically involves two stages: identifying the layer to edit and performing the parameter update. Intuitively, different queries may localize knowledge at different depths of the model, resulting in different sample-wise editing performance for a fixed editing layer. In this work, we hypothesize the existence of fixed golden layers that can achieve near-optimal editing performance similar to sample-wise optimal layers. To validate this hypothesis, we provide empirical evidence by comparing golden layers against ground-truth sample-wise optimal layers. Furthermore, we show that golden layers can be reliably identified using a proxy dataset and generalize effectively to unseen test set queries across datasets. Finally, we propose a novel method, namely Layer Gradient Analysis (LGA) that estimates golden layers efficiently via gradient-attribution, avoiding extensive trial-and-error across multiple editing runs. Extensive experiments on several benchmark datasets demonstrate the effectiveness and robustness of our LGA approach across different LLM types and various knowledge editing methods.

2602.19069 2026-05-18 cs.AI

Asking the Right Questions: Improving Reasoning with Generated Stepping Stones

Hengyuan Hu, Tingchen Fu, Minqi Jiang, Alexander H Miller, Yoram Bachrach, Jakob Nicolaus Foerster

AI总结 该研究探讨了如何通过生成中间“台阶问题”来提升大型语言模型在复杂推理任务中的表现。研究提出了一种名为ARQ的框架,通过引入问题生成器到默认推理流程中,帮助模型逐步分解任务、构建有用的中间步骤。实验表明,这些生成的台阶问题具有可迁移性,能够有效辅助不同能力的模型解决目标任务,并可通过后训练方法进一步优化生成质量。

详情
英文摘要

Recent years have witnessed tremendous progress in enabling LLMs to solve complex reasoning tasks such as math and coding. As we start to apply LLMs to harder tasks that they may not be able to solve in one shot, it is worth paying attention to their ability to construct intermediate stepping stones that prepare them to better solve the tasks. Examples of stepping stones include simplifications, alternative framings, or subproblems. We study properties and benefits of stepping stones in the context of modern reasoning LLMs via ARQ (Asking the Right Questions), a simple framework that introduces a question generator to the default reasoning pipeline. We first show that good stepping stone questions exist and are transferrable, meaning that good questions can be generated, and they substantially help LLMs of various capabilities in solving the target tasks. We next frame stepping stone generation as a post-training task and show that we can fine-tune LLMs to generate more useful stepping stones by SFT and RL on synthetic data.

2602.18801 2026-05-18 cs.LG

SGNO: Spectral Generator Neural Operators for Stable Long Horizon PDE Rollouts

Jiayi Li, Penghao Jiang, Hira Saleem, Zhaonan Wang, Piotr Koniusz, Flora D. Salim

AI总结 本文提出了一种名为SGNO的频谱生成神经算子,用于解决长期时间演化偏微分方程(PDE)预测中的累积误差问题。SGNO通过结构化的频谱演化更新机制,结合实值非正对角生成器和复值频谱混合修正路径,实现了对耗散、色散、输运主导及非线性PDE的稳定长期预测。实验表明,SGNO在多个匹配机制的APEBench任务中显著优于现有单步自回归方法,尤其在色散和非线性耦合任务中表现突出。

详情
英文摘要

Autoregressive neural PDE surrogates predict future states by repeatedly applying a learned one-step operator. This is a simple and widely used method, but small one-step errors can accumulate during long rollouts. The resulting drift often appears as spectral amplitude distortion, phase misalignment, and nonlinear mode-interaction error. These effects are especially important for time-dependent PDEs with clear Fourier structure. We introduce the Spectral Generator Neural Operator (SGNO), a structured autoregressive neural operator for long-horizon PDE forecasting. SGNO organizes each learned one-step map as a structured spectral evolution update. A real-valued nonpositive diagonal generator provides a gain-controlled spectral backbone, while a learned correction pathway with complex-valued spectral mixing completes the residual evolution. This design gives the autoregressive step an evolution-like structure while retaining the flexibility needed for dissipative, dispersive, transport-dominated, and nonlinear PDEs. SGNO is designed for periodic linear and semilinear evolution PDEs with Fourier multiplier linear dynamics. Across ten mechanism-matched APEBench tasks spanning this regime, SGNO consistently outperforms strong single-step autoregressive baselines in long-horizon rollout accuracy, reducing GMean100 by a median of 74.8% relative to the strongest available non-SGNO baseline, with per-task reductions ranging from 13.6% to 92.9%. The gains are strongest on dispersive and transport-dominated tasks, as well as tasks involving nonlinear closure and mode coupling. Spectral diagnostics show lower spectral energy error and improved rollout-level phase fidelity. Ablations show that the constrained generator, the structured update, and the learned correction pathway each contribute to performance. The code is available at https://github.com/cruiseresearchgroup/SGNO.

2602.17363 2026-05-18 cs.LG

2Mamba2Furious: Linear in Complexity, Competitive in Accuracy

Gabriel Mongaras, Eric C. Larson

AI总结 本文提出了一种名为2Mamba的线性注意力模型,旨在弥补线性注意力在准确率上相对于softmax注意力的不足。通过简化并改进Mamba-2的核心组件,2Mamba在保持高内存效率的同时,达到了接近softmax注意力的精度,尤其在处理长上下文任务时表现突出。研究还探讨了提升线性注意力性能的关键因素,并提供了实验代码。

详情
英文摘要

Linear attention transformers have become a strong alternative to softmax attention due to their efficiency. However, linear attention tends to be less expressive and results in reduced accuracy compared to softmax attention. To bridge the accuracy gap between softmax attention and linear attention, we manipulate Mamba-2, a very strong linear attention variant. We first simplify Mamba-2 down to its most fundamental and important components, evaluating which specific choices make it most accurate. From this simplified Mamba variant (Mamba-2S), we improve the A-mask and increase the order of the hidden state, resulting in a method, which we call 2Mamba, that is nearly as accurate as softmax attention, yet much more memory efficient for long context lengths. We also investigate elements to Mamba-2 that help surpass softmax attention accuracy. Code is provided for all our experiments.

2602.17050 2026-05-18 cs.LG

Multi-Probe Zero Collision Hash (MPZCH): Mitigating Embedding Collisions and Enhancing Model Freshness in Large-Scale Recommenders

Ziliang Zhao, Bi Xue, Emma Lin, Tianqi Lu, Mengjiao Zhou, Kaustubh Vartak, Shakhzod Ali-Zade, Tao Li, Bin Kuang, Rui Jian, Bin Wen, Dennis van der Staay, Yixin Bao, Eddy Li, Chao Deng, Henry Wei, Songbin Liu, Qifan Wang, Kai Ren

AI总结 在大规模推荐系统中,嵌入表是处理高基数分类特征的关键组件,但传统哈希索引方法在面对大量唯一ID时容易产生碰撞,影响模型性能与个性化质量。本文提出了一种基于线性探测的新型索引机制——多探针零碰撞哈希(MPZCH),能够有效缓解嵌入碰撞问题,并通过合理配置表大小实现几乎零碰撞。MPZCH引入辅助张量和高性能CUDA内核,支持可配置的探测与主动驱逐策略,防止过时嵌入的继承,提升新特征的学习效果,实验表明其在保持训练吞吐量和推理延迟的同时显著提升了嵌入的新鲜度与质量。

Comments 9 pages, 6 figures

详情
英文摘要

Embedding tables are critical components of large-scale recommendation systems, facilitating the efficient mapping of high-cardinality categorical features into dense vector representations. However, as the volume of unique IDs expands, traditional hash-based indexing methods suffer from collisions that degrade model performance and personalization quality. We present Multi-Probe Zero Collision Hash (MPZCH), a novel indexing mechanism based on linear probing that effectively mitigates embedding collisions. With reasonable table sizing, it often eliminates these collisions entirely while maintaining production-scale efficiency. MPZCH utilizes auxiliary tensors and high-performance CUDA kernels to implement configurable probing and active eviction policies. By retiring obsolete IDs and resetting reassigned slots, MPZCH prevents the stale embedding inheritance typical of hash-based methods, ensuring new features learn effectively from scratch. Despite its collision-mitigation overhead, the system maintains training QPS and inference latency comparable to existing methods. Rigorous online experiments demonstrate that MPZCH achieves zero collisions for user embeddings and significantly improves item embedding freshness and quality. The solution has been released within the open-source TorchRec library for the broader community.

2602.16363 2026-05-18 cs.LG

Improved Bounds for Reward-Agnostic and Reward-Free Exploration

Oran Ridel, Alon Cohen

AI总结 本文研究了无奖励和奖励无关的探索问题,在回合制有限时间马尔可夫决策过程(MDPs)中,智能体在没有外部奖励信号的情况下探索未知环境。针对奖励无关设置,作者提出了一种新的算法,显著放宽了对精度参数 $ε$ 的限制,并通过设计精心的奖励函数进行在线学习,构建用于数据收集的探索策略,从而实现对动力学的精确估计和后续的 $ε$-最优策略计算。此外,作者还建立了无奖励探索的紧致下界,填补了已知上界与下界之间的差距。

详情
英文摘要

We study reward-free and reward-agnostic exploration in episodic finite-horizon Markov decision processes (MDPs), where an agent explores an unknown environment without observing external rewards. Reward-free exploration aims to enable $ε$-optimal policies for any reward revealed after exploration, while reward-agnostic exploration targets $ε$-optimality for rewards drawn from a small finite class. In the reward-agnostic setting, Li, Yan, Chen, and Fan achieve minimax sample complexity, but only for restrictively small accuracy parameter $ε$. We propose a new algorithm that significantly relaxes the requirement on $ε$. Our approach is novel and of technical interest by itself. Our algorithm employs an online learning procedure with carefully designed rewards to construct an exploration policy, which is used to gather data sufficient for accurate dynamics estimation and subsequent computation of an $ε$-optimal policy once the reward is revealed. Finally, we establish a tight lower bound for reward-free exploration, closing the gap between known upper and lower bounds.

2602.16274 2026-05-18 cs.LG stat.ML

Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

Rahul Singh, Siddharth Chandak, Eric Moulines, Vivek S. Borkar, Nicholas Bambos

AI总结 本文首次为无限时间折扣马尔可夫决策过程中的经典在线Q学习提供了悔恨界,无需依赖乐观或奖励项。研究分析了衰减温度的玻尔兹曼Q学习,并提出了一种结合ε_n-贪心与玻尔兹曼探索的平滑探索策略,证明其悔恨界对子优化间隙具有鲁棒性,达到近似O(N^{9/10})的上界。同时,作者还给出了高概率下的样本复杂度保证,并发展了一种适用于合缩马尔可夫随机逼近的高概率集中界,该结果具有独立研究价值。

详情
英文摘要

We present the first regret bound for classical online Q-learning in infinite-horizon discounted Markov decision processes (MDPs), without relying on optimism or bonus terms. We first analyze Boltzmann Q-learning with decaying temperature and show that its regret depends critically on the suboptimality gap of the MDP: for sufficiently large gaps, the regret is sublinear, while for small gaps it deteriorates and can approach linear growth. To address this limitation, we study a Smoothed $ε_n$-Greedy exploration scheme that combines $ε_n$-greedy and Boltzmann exploration, for which we prove a gap-robust regret bound of near-$\tilde{O}(N^{9/10})$. We also obtain sample complexity guarantees, with both regret and sample complexity bounds holding with high probability. To analyze these algorithms, we develop a high-probability concentration bound for contractive Markovian stochastic approximation with iterate- and time-dependent transition dynamics. This bound may be of independent interest as the contraction factor in our framework is allowed to converge to one asymptotically.

2602.14896 2026-05-18 cs.LG

Algorithmic Simplification of Neural Networks with Mosaic-of-Motifs

Pedram Bakhtiarifard, Tong Chen, Jonathan Wenshøj, Erik B Dam, Raghavendra Selvan

AI总结 本文探讨了深度神经网络为何适合压缩这一核心问题,提出从算法复杂度的角度进行解释。研究假设训练后的模型参数具有更多结构,因而算法复杂度更低,并引入了一种基于可重复模块(motif)的参数化方法,通过约束参数块的选择来引导优化过程趋向更简单的解。实验表明,该方法在保持模型性能的同时有效降低了网络的算法复杂度,为模型压缩提供了理论依据和新思路。

详情
英文摘要

Large-scale deep learning models are well-suited for compression. Across a variety of tasks, methods like pruning, quantization, and knowledge distillation have been used to achieve massive reductions in model parameters with only marginal performance drops. This raises the central question: *Why are deep neural networks suited for compression?* In this work, we take up the perspective of algorithmic complexity to explain this behavior. We hypothesize that the parameters of trained models have more structure and, hence, exhibit lower algorithmic complexity compared to the weights at (random) initialization. Furthermore, model compression methods harness this reduced algorithmic complexity to compress models. Although an unconstrained parameterization of model weights, $\mathbf{w} \in \mathbb{R}^n$, can represent arbitrary weight assignments, the solutions found during training exhibit repeatability and structure, making them simpler to implement than a trivial program. To this end, we formalize the Kolmogorov complexity of $\mathbf{w}$ by $\mathcal{K}(\mathbf{w})$. We introduce a constrained parameterization $\widehat{\mathbf{w}}$ that partitions parameters into blocks of size $s$ and restricts each block to be selected from a set of $k$ reusable motifs, specified by a reuse pattern (or mosaic). The resulting method, $\mathit{Mosaic\text{-}of\text{-}Motifs}$ (MoMos), provides a theoretically justified parameterization that biases optimization toward algorithmically simpler solutions. Empirical evidence from multiple experiments shows that MoMos consistently lowers the algorithmic complexity of neural networks during training while preserving the performance of unconstrained models. These results suggest that parameter compressibility is not only observed after training, but can be induced from the optimization domain.

2602.12262 2026-05-18 cs.CL cs.LG

Few-Step Diffusion Language Models via Trajectory Self-Distillation

Tunyu Zhang, Xinxi Zhang, Ligong Han, Haizhou Shi, Xiaoxiao He, Zhuowei Li, Hao Wang, Kai Xu, Akash Srivastava, Chengzhi Mao, Hao Wang, Vladimir Pavlovic, Dimitris N. Metaxas

AI总结 该论文研究了如何在扩散语言模型中实现高效且高质量的少步解码。为了解决少步解码导致的生成质量下降问题,作者提出了一种基于轨迹自蒸馏的框架,通过让少步学生模型学习完整步教师模型的生成轨迹,从而缓解因分词错误带来的性能损失。此外,引入了直接判别优化方法,进一步提升了模型在复杂推理任务中的表现,显著缩小了少步解码与完整步解码之间的性能差距。

详情
英文摘要

Diffusion large language models (DLLMs) have emerged as powerful generative models with the promise of fast text generation through parallel decoding. However, realizing this potential in practice remains challenging: reducing the number of decoding steps, typically causes a substantial degradation in output quality due to token factorization error. To alleviate this, we propose a self-distillation framework that trains a few-step student to match the generative trajectory of a full-step teacher. We theoretically and empirically show that trajectory-level supervision mitigates this factorization error, thereby enabling effective few-step decoding. We further incorporate Direct Discriminative Optimization (DDO), a reverse-KL objective that encourages mode-seeking toward the teacher's modes, yielding stronger performance on challenging reasoning tasks. Across reasoning and code-generation benchmarks, our method substantially narrows the gap between few-step and full-step decoding. The source code is available at https://github.com/Tyrion58/T3D.

2602.10687 2026-05-18 cs.CV cs.AI

OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

Jinjie Shen, Jing Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang, Tianrui Hui, Nan Pu, Zhun Zhong

AI总结 现有伪造检测方法多局限于单模态或双模态设置,难以应对现实中的多模态虚假信息。本文提出OmniVL-Guard,一个基于平衡强化学习的统一视觉-语言伪造检测与定位框架,旨在解决多模态交互与多任务优化中的偏差问题。该方法包含自进化推理路径生成和自适应奖励缩放策略优化两个核心设计,有效提升了检测与定位的综合性能,并在多个数据集上展现出优越的零样本泛化能力。

Comments Accepted by ICML 2026

详情
英文摘要

Existing forgery detection methods are often limited to uni-modal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding. In this unified setting, the {interplay} between diverse modalities and the dual requirements of simultaneous detection and localization pose a critical ``difficulty bias`` problem: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding during multi-task optimization. To address this challenge, we propose \textbf{OmniVL-Guard}, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding. Particularly, OmniVL-Guard comprises two core designs: Self-Evolving CoT Generatio and Adaptive Reward Scaling Policy Optimization (ARSPO). {Self-Evolving CoT Generation} synthesizes high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, {Adaptive Reward Scaling Policy Optimization (ARSPO)} dynamically modulates reward scales and task weights, ensuring a balanced joint optimization. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits zero-shot robust generalization across out-of-domain scenarios. The dataset and code are publicly available at https://github.com/shen8424/OmniVL-Guard.

2602.09297 2026-05-18 cs.LG

Laplacian Heads Improve Transformers by Smoothing Token Representations

Yuchong Zhang, Vardan Papyan

AI总结 本文提出了一种改进Transformer模型的方法,通过引入拉普拉斯头(Laplacian Heads)来平滑令牌表示。该方法将部分注意力头的softmax矩阵替换为对应的拉普拉斯矩阵,从而在更新令牌表示时同时控制序列内的方差,并在图结构视角下解释为热扩散过程。实验表明,该方法在监督学习、语言建模和自监督学习任务中均能提升性能,且有助于增强令牌表示的可分性和结构对齐,挑战了传统认为令牌过度平滑有害的观点。

详情
英文摘要

Transformers update token representations through multi-head attention and residual connections as $X \leftarrow X + \sum_{i} P^{(i)}XW_{V_i}W_{o_i}$, where $P^{(i)}$ is the softmax attention matrix in head $i$. We propose replacing a subset of $P^{(i)}$'s with the Laplacian $I - P^{(i)}$, giving $X \leftarrow X + \sum_{i \in \mathcal{A}} P^{(i)}XW_{V_i}W_{o_i} + \sum_{i \in \mathcal{L}} (I - P^{(i)})XW_{V_i}W_{o_i}$. Our proposal has two motivations. First, it allows attention heads to update the mean of token representations, while Laplacian heads can directly control within-sequence variance. Second, if tokens are viewed as nodes in a graph with edge weights $P^{(i)}$, then $I - P^{(i)}$ is the corresponding graph Laplacian, and the update can be interpreted as one step of heat diffusion on the graph. We show that this simple modification improves performance across supervised learning, language modeling, and self-supervised learning tasks. To investigate why, we examine the token representations learned with and without Laplacian heads. In supervised learning, Laplacian heads collapse token representations within the same sequence and align the sequence means with the geometry of Neural Collapse. In language modeling, they increase the separability of token representations that share the same next-token prediction. In self-supervised learning, they produce token representations whose principal components are better suited for segmentation. Across modalities, they also lead to faster-decaying spectra, indicating stronger token smoothing. Overall, our findings challenge the prevailing view that token oversmoothing is inherently harmful, showing instead that certain forms of smoothing can be beneficial.

2602.05414 2026-05-18 cs.CV

TSBOW -- Traffic Surveillance Benchmark for Occluded Vehicles Under Various Weather Conditions

Ngoc Doan-Minh Huynh, Duong Nguyen-Ngoc Tran, Long Hoang Pham, Tai Huu-Phuong Tran, Hyung-Joon Jeon, Huy-Hung Nguyen, Duong Khac Vu, Hyung-Min Jeon, Son Hong Phan, Quoc Pham-Nam Ho, Chi Dai Tran, Trinh Le Ba Khanh, Jae Wook Jeon

AI总结 随着全球变暖加剧极端天气事件的频率和强度,现有交通监控数据集难以应对复杂天气条件下的遮挡车辆检测问题。为此,本研究提出了TSBOW数据集,包含超过32小时的真实城市交通视频,涵盖多种天气条件和遮挡场景,标注了超过4.8万个目标框,旨在提升恶劣天气下交通参与者检测的性能。TSBOW为智能交通系统的研究提供了重要资源,推动了基于CCTV的交通监控技术发展。

Comments This paper has been accepted by the 40th AAAI Conference on Artificial Intelligence (AAAI-26)

详情
Journal ref
Proceedings of the AAAI Conference on Artificial Intelligence. 40(2026). 5239-5247
英文摘要

Global warming has intensified the frequency and severity of extreme weather events, which degrade CCTV signal and video quality while disrupting traffic flow, thereby increasing traffic accident rates. Existing datasets, often limited to light haze, rain, and snow, fail to capture extreme weather conditions. To address this gap, this study introduces the Traffic Surveillance Benchmark for Occluded vehicles under various Weather conditions (TSBOW), a comprehensive dataset designed to enhance occluded vehicle detection across diverse annual weather scenarios. Comprising over 32 hours of real-world traffic data from densely populated urban areas, TSBOW includes more than 48,000 manually annotated and 3.2 million semi-labeled frames; bounding boxes spanning eight traffic participant classes from large vehicles to micromobility devices and pedestrians. We establish an object detection benchmark for TSBOW, highlighting challenges posed by occlusions and adverse weather. With its varied road types, scales, and viewpoints, TSBOW serves as a critical resource for advancing Intelligent Transportation Systems. Our findings underscore the potential of CCTV-based traffic monitoring, pave the way for new research and applications. The TSBOW dataset is publicly available at: https://github.com/SKKUAutoLab/TSBOW.

2602.04909 2026-05-18 cs.LG

Learning Where It Matters: Geometric Anchoring for Robust Preference Alignment

Youngjae Cho, Jongsuk Kim, Ji-Hoon Kim

AI总结 本文研究了如何在存在噪声监督的情况下,提升大型语言模型偏好对齐的鲁棒性。为了解决传统方法中固定参考策略随策略漂移而失效的问题,作者提出了一种基于几何锚点的偏好优化方法(GAPO),通过动态生成策略的局部对抗扰动作为悲观基准,实现对偏好对的自适应重加权。该方法有效抑制了脆弱样本的影响,提升了模型在不同噪声环境下的鲁棒性,同时在标准对齐和推理任务上保持或超越了现有方法的性能。

Comments Under Review

详情
英文摘要

Direct Preference Optimization (DPO) and related methods align large language models from pairwise preferences by regularizing updates against a fixed reference policy. As the policy drifts, a static reference, however, can become increasingly miscalibrated, leading to distributional mismatch and amplifying spurious preference signals under noisy supervision. Conversely, reference-free variants avoid mismatch but often suffer from unconstrained reward drift. We propose Geometric Anchor Preference Optimization (GAPO), which replaces the fixed reference with a dynamic, geometry-aware anchor: an adversarial local perturbation of the current policy within a small radius that serves as a pessimistic baseline. This anchor enables an adaptive reweighting mechanism, modulating the importance of each preference pair based on its local sensitivity. We further introduce the Anchor Gap, the reward discrepancy between the policy and its anchor, and show under smoothness conditions that it approximates worst-case local margin degradation. Optimizing a logistic objective weighted by this gap downweights geometrically brittle instances while emphasizing robust preference signals. Across diverse noise settings, GAPO consistently improves robustness while matching or improving performance on standard LLM alignment and reasoning benchmarks.

2602.04163 2026-05-18 cs.LG

BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models

Junyu Chen, Jungang Li, Jing Xiong, Wenjie Wang, Qingyao Yang, He Xiao, Zhen Li, Taiqiang Wu, Mengzhao Chen, Zhen Peng, Chaofan Tao, Long Shi, Hongxia Yang, Ngai Wong

AI总结 在资源受限的场景下,大语言模型的推理常受内存和带宽限制,量化是提升效率的关键。现有后训练量化方法在4比特时保持较高精度,但在2-3比特时精度显著下降,主要受限于固定量化网格对误差最小化的限制。为此,本文提出一种基于位平面分解的可变网格量化方法BPDQ,通过位平面和标量系数构建动态量化网格,并利用二阶信息迭代优化,逐步补偿量化误差以最小化输出差异。实验表明,BPDQ在2比特下仍能以较高精度运行超大规模模型,且理论分析表明其可变网格扩展了误差最小化的可行解空间。

详情
英文摘要

Large language model inference is often bounded by memory footprint and bandwidth in resource-constrained deployments, making quantization fundamental to efficient serving. While post-training quantization (PTQ) maintains high fidelity at 4-bit, it deteriorates at 2-3 bits. In essence, existing methods enforce a shape-invariant quantization grid (e.g., the fixed uniform intervals of UINT2) for each group, severely restricting the feasible set for error minimization. To address this, we propose Bit-Plane Decomposition Quantization (BPDQ), which constructs a variable quantization grid via bit-planes and scalar coefficients, and iteratively refines them using second-order information while progressively compensating for quantization errors to minimize output discrepancy. In the 2-bit regime, BPDQ enables serving Qwen2.5-72B on a single RTX 3090 with 83.85\% GSM8K accuracy (vs. 90.83\% at 16-bit). Moreover, we theoretically show that the variable grid expands the feasible set, and that the quantization process consistently aligns with the optimization objective in Hessian-induced geometry. The code is available at https://github.com/KingdalfGoodman/BPDQ.

2602.03922 2026-05-18 cs.LG

Online Vector Quantized Attention

Nick Alonso, Tomas Figliolia, Beren Millidge

AI总结 本文提出了一种名为在线向量量化注意力(OVQ-attention)的序列混合层,旨在在计算与内存效率和长上下文处理能力之间取得更好的平衡。该方法在保持线性计算和常数内存消耗的同时,通过稀疏的内存更新机制显著提升了内存容量,从而在长序列任务中表现出色。实验表明,OVQ-attention在多个合成和语言建模任务中优于传统线性注意力和原始向量量化注意力,并在较长序列长度上达到了与自注意力相当的性能。

详情
英文摘要

Standard sequence mixing layers used in language models struggle to balance efficiency and performance. Self-attention performs well on long context tasks but has expensive quadratic compute and linear memory costs, while linear attention and SSMs use only linear compute and constant memory but struggle with long context processing. In this paper, we develop a sequence mixing layer that aims to find a better compromise between memory-compute costs and long-context processing, which we call online vector-quantized (OVQ) attention. OVQ-attention requires linear compute costs and constant memory, but, unlike linear attention and SSMs, it uses a sparse memory update that allows it to greatly increase the size of its memory state and, consequently, memory capacity. We develop a theoretical basis for OVQ-attention based on Gaussian mixture regression, and we test it on a variety of synthetic long context tasks and on long context language modeling. OVQ-attention shows significant improvements over linear attention baselines and the original VQ-attention, on which OVQ-attention was inspired. It demonstrates competitive, and sometimes identical, performance to strong self-attention baselines up 64k sequence length, despite using a small fraction of the memory of full self-attention.

2602.03812 2026-05-18 cs.LG cs.AI cs.CL

Antidistillation Fingerprinting

Yixuan Even Xu, John Kirchenbauer, Yash Savani, Asher Trockman, Alexander Robey, Tom Goldstein, Fei Fang, J. Zico Kolter

AI总结 该研究提出了一种名为“反蒸馏指纹”(ADFP)的新方法,用于检测第三方模型是否通过蒸馏技术学习了教师模型的输出。与现有依赖启发式扰动的方法不同,ADFP 将指纹检测目标与学生模型的学习动态对齐,利用代理模型选择能最大化指纹可检测性的标记,从而在保证生成质量的前提下提升检测效果。实验表明,ADFP 在数学推理、对话和代码生成任务中均实现了比现有方法更优的检测性能与实用性平衡。

Comments 28 pages, 13 figures, ICML 2026

详情
英文摘要

Model distillation enables efficient emulation of frontier large language models (LLMs), creating a need for robust mechanisms to detect when a third-party student model has trained on a teacher model's outputs. However, existing fingerprinting techniques that could be used to detect such distillation rely on heuristic perturbations that impose a steep trade-off between generation quality and fingerprinting strength, often requiring significant degradation of utility to ensure the fingerprint is effectively internalized by the student. We introduce antidistillation fingerprinting (ADFP), a principled approach that aligns the fingerprinting objective with the student's learning dynamics. Building upon the gradient-based framework of antidistillation sampling, ADFP utilizes a proxy model to identify and sample tokens that directly maximize the expected detectability of the fingerprint in the student after fine-tuning, rather than relying on the incidental absorption of the un-targeted biases of a more naive watermark. Experiments on GSM8K, OASST1, and MBPP demonstrate that ADFP achieves a significant Pareto improvement over state-of-the-art baselines, yielding stronger detection confidence with minimal impact on utility across mathematical reasoning, dialogue, and code generation, even when the student model's architecture is unknown.

2602.00841 2026-05-18 cs.CV

Beyond First-Order: Learning Riemannian Geometries for Invariant Visual Place Recognition

Jintao Cheng, Weibin Li, Zhijian He, Jin Wu, Chi Man Vong, Wei Zhang

AI总结 本文研究了视觉地点识别(VPR)中如何构建对环境和视角剧烈变化具有鲁棒性的特征表示。为解决现有方法在极端变化下结构关联丢失或适应成本高的问题,提出了一种基于黎曼几何的不变聚合框架RIA,通过在对称正定流形上建模二阶场景结构,有效保留不变结构信息并抑制噪声。实验表明,RIA在无需大量监督训练的情况下即可达到与监督方法相当的性能,并在无结构环境中取得最先进的识别准确率。

Comments 14pages, 5 figures

详情
英文摘要

Visual Place Recognition (VPR) demands representations robust to drastic environmental and viewpoint shifts. Existing aggregation paradigms either depend on extensive supervised training or rely on first-order pooling, often struggling to preserve structural correlations under extreme shifts or incurring high adaptation costs. In this work, we propose Riemannian Invariant Aggregation (RIA), a unified geometric framework that explicitly models second-order scene structure on the Symmetric Positive Definite (SPD) manifold. By treating perturbations as tractable congruence transformations, RIA leverages geometry-aware Riemannian mappings to project covariance descriptors into a linearized Euclidean space, effectively preserving invariant structural components while suppressing noise. Extensive evaluations demonstrate that RIA achieves zero-shot performance comparable to supervised methods, and establishes state-of-the-art accuracy with simple fine-tuning, particularly in unstructured environments. The source code will be released.

2601.21702 2026-05-18 cs.LG cs.CL

Beyond Forgetting: Machine Unlearning Elicits Controllable Side Behaviors and Capabilities

Tien Dang, The-Hai Nguyen, Dinh Mai Phuong, Nguyen Minh Phuong, Anh Bui, Hoang Thanh-Tung, Le-Minh Nguyen, Naoya Inoue

AI总结 本文研究了一种名为“表示误导”(RM)的大语言模型遗忘方法,该方法通过将遗忘样本的潜在表示引导至目标向量实现遗忘。作者从线性表示假设出发,提出RM不仅能够实现遗忘,还能引发与高层概念相关的可控侧行为和增强能力。实验表明,RM可用于控制模型的诚实度、情感倾向等行为,或提升其上下文学习能力,揭示了该方法在可控模型开发中的潜在价值与风险。

Comments 36 pages, 19 tables, 9 figures

详情
英文摘要

We consider Representation Misdirection (RM), a class of large language model (LLM) unlearning methods that achieve forgetting by redirecting the forget-representations, that is, latent representations of forget-samples, toward a target vector. Despite being important, the roles of the target vector used in RM, however, remain underexplored. Here, we approach and revisit RM through the lens of the Linear Representation Hypothesis. Specifically, if one can identify a one-dimensional representation corresponding to a high-level concept, the Linear Representation Hypothesis enables linear operations on this concept vector within the forget-representation space. Under this view, we hypothesize that, beyond forgetting, machine unlearning via RM elicits controllable emergent side behaviors and stronger side capabilities corresponding to the high-level concept. Our hypothesis is empirically validated across a wide range of tasks, including behavioral control (e.g., controlling unlearned models' truthfulness, sentiment, refusal, and language) and capability enhancement (e.g., improving unlearned models' in-context learning (ICL) capability). Our findings reveal that this phenomenon could be either a hidden risk if misused or a mechanism that can be harnessed for developing unlearned models that require stronger capabilities and controllable behaviors.

2601.21294 2026-05-18 cs.LG stat.ML

Missing-Data-Induced Phase Transitions in Spectral PLS for Multimodal Learning

Anders Gjølbye, Ida Kargaard, Emma Kargaard, Lina Skerath, Lars Kai Hansen

AI总结 本文研究了在多模态学习中,缺失数据对谱偏最小二乘(PLS)方法性能的影响。通过在高维尖峰模型下分析独立缺失的完全随机掩码对交叉协方差矩阵的影响,发现缺失数据会削弱信号强度,并导致类似BBP类型的相变现象:当信号与噪声比低于临界阈值时,主奇异向量无法有效捕捉潜在共享结构;高于该阈值时则能实现非平凡对齐。研究还提出了有限秩扩展的猜想,并通过仿真和半合成实验验证了理论预测的相图和恢复曲线。

Comments Preprint

详情
英文摘要

Partial Least Squares (PLS) learns shared structure from paired data via the top singular vectors of the empirical cross-covariance (PLS-SVD), but multimodal datasets often have missing entries in both views. We study PLS-SVD under independent entry-wise missing-completely-at-random masking in a proportional high-dimensional spiked model. After appropriate normalization, the masked cross-covariance behaves like a spiked rectangular random matrix whose effective signal strength is attenuated by $\sqrtρ$, where $ρ$ is the joint entry retention probability. The replica-symmetric analysis predicts a sharp BBP-type phase transition: below a critical signal-to-noise threshold the leading singular vectors are asymptotically uninformative, while above it they achieve nontrivial alignment with the latent shared directions, with closed-form asymptotic overlap formulas. We also state a finite-rank extension as a conjecture, predicting that the same missingness-adjusted threshold applies componentwise when the latent spikes are separated. Simulations and semi-synthetic multimodal experiments agree with the predicted phase diagram and recovery curves across aspect ratios, signal strengths, and missingness levels.

2601.19923 2026-05-18 cs.CL cs.AI

Structure-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation for Web Information Systems

Boxiang Zhao, Qince Li, Zhonghao Wang, Zelin Cao, Yi Wang, Peng Cheng, Bo Lin

AI总结 随着大语言模型(LLMs)在基于网络的自主代理和复杂网络信息系统中扮演核心角色,其将自然语言准确转换为结构化格式的能力变得至关重要。为此,本文提出Structure-BiEval,一种无需人工标注的自监督框架,通过解耦结构与内容,利用内容语义准确度和归一化树编辑距离等指标,对网络数据的结构保真度进行量化评估。实验结果表明,不同规模的LLM在结构化任务中表现差异显著,且深层嵌套结构对各类模型均构成挑战。

详情
英文摘要

As Large Language Models (LLMs) evolve into the core of Web-based autonomous agents and complex Web Information Systems, their ability to faithfully translate natural language into rigorous structured formats has become paramount, as this capability is critical for Web API invocation and data exchange. However, evaluating this structural fidelity in Web-native payloads remains a challenge: traditional text metrics fail to capture topological consistency in semi-structured Web data, while manual evaluation is prohibitively costly. To address this, we propose Structure-BiEval, a novel self-supervised framework for quantitative, annotation-free assessment tailored for Web data engineering. By leveraging deterministic Intermediate Representations, our framework effectively decouples structure from content, utilizing Content Semantic Accuracy and Normalized Tree Edit Distance as precise metrics. We empirically benchmark 15 state-of-the-art LLMs across dual Web structural topologies, namely Hierarchical Data (Web backend payloads) and Tabular Data (Web frontend presentation). The results reveal substantial variability in structural performance, including cases where mid-sized models unexpectedly outperform larger counterparts in Web data formatting. Furthermore, our findings show that deep recursive nesting poses a consistent challenge for Web agents across varying parameter scales.

2601.13529 2026-05-18 cs.RO

The OncoReach Stylet for Brachytherapy: Design Evaluation and Pilot Study

Pejman Kheradmand, Kent K. Yamamoto, Emma Webster, Keith Sowards, Gianna Hatheway, Katharine L. Jackson, Sabino Zani, Julie A. Raffi, Diandra N. Ayala-Peacock, Scott R. Silva, Joanna Deaton Bertram, Yash Chitalia

AI总结 本文介绍了一种名为OncoReach的可操控针管装置,用于宫颈癌的近距离放疗。该装置采用肌腱驱动设计,兼容标准的15和13号针管,能够在保持轴向刚度的同时实现更高的弯曲灵活性。通过实验和仿真验证了其性能,并在患者衍生的复合仿体上进行了初步测试,展示了从微创入口点到达侧方靶点的引导能力,验证了可操控针管在临床应用中的潜力。

详情
英文摘要

Cervical cancer accounts for a significant portion of the global cancer burden among women. Interstitial brachytherapy (ISBT) is a standard procedure for treating cervical cancer; it involves placing a radioactive source through a straight hollow needle within or in close proximity to the tumor and surrounding tissue. However, the use of straight needles limits surgical planning to a linear needle path. We present the OncoReach stylet, a handheld, tendon-driven steerable stylet designed for compatibility with standard ISBT 15- and 13-gauge needles. Building upon our prior work, we evaluated design parameters like needle gauge, spherical joint count and spherical joint placement, including an asymmetric disk design to identify a configuration that maximizes bending compliance while retaining axial stiffness. Free space experiments quantified tip deflection across configurations, and a two-tube Cosserat rod model accurately predicted the centerline shape of the needle for most trials. The best performing configuration was integrated into a reusable handheld prototype that enables manual actuation. A patient-derived, multi-composite phantom model of the uterus and pelvis was developed to conduct a pilot study of the OncoReach steerable stylet with one expert user. Results showed the ability to steer from less-invasive, medial entry points to reach the lateral-most targets, underscoring the significance of steerable stylets.

2601.00678 2026-05-18 cs.CV

Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians

Melonie de Almeida, Daniela Ivanova, Tong Shi, John H. Williamson, Paul Henderson

AI总结 该论文提出了一种基于单张图像生成动态视频的新方法,能够根据给定的相机轨迹生成高质量且时间一致的视频。核心方法是通过构建动态的3D高斯场景表示,并在单次前向传播中生成合理的物体运动,从而实现快速的相机控制视频生成。该方法在多个数据集上表现出色,取得了领先的视频质量和推理效率。

详情
英文摘要

Humans excel at forecasting the future dynamics of a scene given just a single image. Video generation models that can mimic this ability are an essential component for intelligent systems. Recent approaches have improved temporal coherence and 3D consistency in single-image-conditioned video generation. However, these methods often lack robust user controllability, such as modifying the camera path, limiting their applicability in real-world applications. Most existing camera-controlled image-to-video models struggle with accurately modeling camera motion, maintaining temporal consistency, and preserving geometric integrity. Leveraging explicit intermediate 3D representations offers a promising solution by enabling coherent video generation aligned with a given camera trajectory. Although these methods often use 3D point clouds to render scenes and introduce object motion in a later stage, this two-step process still falls short in achieving full temporal consistency, despite allowing precise control over camera movement. We propose a novel framework that constructs a 3D Gaussian scene representation and samples plausible object motion, given a single image in a single forward pass. This enables fast, camera-guided video generation without the need for iterative denoising to inject object motion into render frames. Extensive experiments on the KITTI, Waymo, RealEstate10K and DL3DV-10K datasets demonstrate that our method achieves state-of-the-art video quality and inference efficiency. The project page is available at https://melonienimasha.github.io/Pixel-to-4D-Website.

2512.21651 2026-05-18 cs.LG

Rethinking Output Alignment For 1-bit Post-Training Quantization of Large Language Models

Dung Anh Hoang, Cuong Pham, Cuong Nguyen, Trung le, Jianfei Cai, Thanh-Toan Do

AI总结 大型语言模型(LLMs)在自然语言处理任务中表现出色,但其庞大的模型规模限制了在资源受限设备上的部署。本文针对1比特后训练量化(PTQ)中输出行为保持困难的问题,分析了误差累积和表示空间各向异性失真两大根本原因,提出了一种新的PTQ方法,有效提升了1比特量化模型的性能,实验表明该方法在多个任务上均优于现有方法。

详情
英文摘要

Large Language Models (LLMs) deliver strong performance across a wide range of NLP tasks, but their massive sizes hinder deployment on resource-constrained devices. To reduce their computational and memory burden, various compression techniques have been proposed, including quantization, pruning, and knowledge distillation. Among these, post-training quantization (PTQ) is widely adopted for its efficiency, as it requires no retraining and only a small dataset for calibration, enabling low-cost deployment. Recent advances for post-training quantization have demonstrated that even near 4-bit methods can maintain most of the original model performance. However, 1-bit quantization remains particularly challenging. A common strategy in 1-bit quantization is to determine binary weights by matching full-precision parameters, following a weight-driven criterion. However, this objective is not directly aligned with the quantized model's objective, which is to preserve the model's output behavior under the impact of quantization. A natural alternative is to adopt output-driven criteria that minimize discrepancies in model outputs using calibration data. Surprisingly, naive output-driven approaches often perform even worse in the 1-bit regime. In this paper, we show that this failure arises from two fundamental issues: error accumulation across layers and, more critically, \emph{anisotropic distortion} of the representation space. Based on these insights, we propose a novel PTQ method for 1-bit LLMs that explicitly addresses these issues while maintaining computational efficiency. Extensive experiments demonstrate that our approach consistently outperforms existing 1-bit PTQ methods.

2512.19701 2026-05-18 cs.LG cs.AI

LASER: Language Model Regression for Semi-Structured Workflow Resource and Runtime Estimation

Yuxuan Yin, Shengke Zhou, Yunjie Zhang, Ajay Mohindra, Boxun Xu, Peng Li

AI总结 准确预测云工作流任务的资源消耗和运行时间对调度效率至关重要,但由于任务配置的半结构化特性,这一任务具有挑战性。本文提出 LASER 框架,通过微调大语言模型对序列化的工作流配置进行多目标资源和运行时间回归,引入科学记数法输出编码和约束解码机制以提升数值预测的准确性和效率。实验表明,LASER 在大规模芯片设计任务和新构建的 GHARuntime 数据集上均优于人类专家和最先进的表格机器学习方法,确立了基于大语言模型处理半结构化工作流数据回归任务的新范式。

Comments 20 pages, 7 figures

详情
英文摘要

Accurate prediction of resource consumption and runtime for cloud workflow jobs is critical for scheduling efficiency, yet remains challenging due to the semi-structured nature of job configurations -- comprising shell commands, tool-specific parameters, dependency graphs, and hierarchical metadata. Traditional ML approaches require brittle feature engineering to flatten this rich information into fixed-size vectors, losing critical semantic context. We present LASER, a framework that fine-tunes LLMs on serialized workflow job configurations for multi-target resource and runtime regression. To address the challenges of numerical regression via generation, we introduce scientific notation output encoding for targets spanning multiple orders of magnitude, and constrained decoding with prefix filling to enforce output validity while reducing inference latency by over 30%. We further show that full-attention fine-tuning improves accuracy over sliding-window LLMs on long job contexts. Validated on large-scale chip design workloads, and GHARuntime, a new public benchmark derived from 580,000+ GitHub Actions runs across 27,000+ repositories, LASER outperforms human experts and SOTA tabular ML baselines, with clear model- and data-scaling behavior, establishing a new paradigm for LLM-based regression on semi-structured workflow data.

2512.15067 2026-05-18 cs.LG cs.AI cs.SY eess.SY

EMFusion: An Uncertainty-Aware Conditional Diffusion Framework for Frequency-Selective EMF Forecasting in Wireless Networks

Zijiang Yan, Yixiang Huang, Jianhua Pei, Hina Tabassum, Luca Chiaraviglio

AI总结 随着无线基础设施的快速发展,准确估计和预测电磁场(EMF)水平对于确保合规性、评估健康影响和优化网络规划变得尤为重要。本文提出EMFusion,一种结合不确定性感知的条件扩散框架,用于无线网络中频率选择性的多变量EMF预测。该方法通过引入残差U-Net结构和跨注意力机制,整合时间、季节和节假日等上下文信息,同时提供显式的不确定性估计,并采用基于插补的采样策略提升预测的时序一致性。实验表明,EMFusion在多个评价指标上均优于现有方法,显著提升了预测精度和可靠性。

Comments Submission for possible publication

详情
英文摘要

The rapid growth in wireless infrastructure has increased the need to accurately estimate and forecast electromagnetic field (EMF) levels to ensure ongoing compliance, assess potential health impacts, and support efficient network planning. While existing studies rely on univariate forecasting of wideband aggregate EMF data, frequency-selective multivariate forecasting is needed to capture the inter-operator and inter-frequency variations essential for proactive network planning. To this end, this paper introduces EMFusion, a conditional multivariate diffusion-based probabilistic forecasting framework that integrates diverse contextual factors, such as time of day, season, and holidays, while providing explicit uncertainty estimates. The proposed architecture features a residual U-Net backbone enhanced by a cross-attention mechanism that dynamically integrates external conditions to guide the generation process. Furthermore, EMFusion integrates an imputation-based sampling strategy that treats forecasting as a structural inpainting task, ensuring temporal coherence even with irregular measurements. Unlike standard point forecasters, EMFusion generates empirical probabilistic prediction intervals from the learned conditional distribution, providing uncertainty-aware probabilistic forecasting rather than simple point estimation. Numerical experiments conducted on frequency-selective EMF datasets demonstrate that EMFusion with the contextual information of working hours outperforms the baseline models with or without conditions. EMFusion outperforms the best baseline by 23.85% in continuous ranked probability score (CRPS), 13.93% in normalized root mean square error, and reduces prediction CRPS error by 22.47%.

2512.14671 2026-05-18 cs.CV

ART: Articulated Reconstruction Transformer

Zizhang Li, Cheng Zhang, Zhengqin Li, Henry Howard-Jenkins, Zhaoyang Lv, Chen Geng, Jiajun Wu, Richard Newcombe, Jakob Engel, Zhao Dong

AI总结 本文提出了一种名为ART的全新模型,用于从稀疏的多状态RGB图像中重建完整的3D可动物体,该模型无需依赖特定物体类别或复杂的优化过程。ART将可动物体视为由多个刚性部件组成,通过设计的Transformer架构将图像映射到可学习的部件槽位,并联合解码各部件的三维几何、纹理及运动参数,实现了物理可解释且可直接用于仿真的重建结果。实验表明,ART在多个基准测试中表现优异,显著超越了现有方法,确立了新的状态-of-the-art。

Comments Project Page: https://kyleleey.github.io/ART/

详情
英文摘要

We introduce ART, Articulated Reconstruction Transformer -- a category-agnostic, feed-forward model that reconstructs complete 3D articulated objects from only sparse, multi-state RGB images. Previous methods for articulated object reconstruction either rely on slow optimization with fragile cross-state correspondences or use feed-forward models limited to specific object categories. In contrast, ART treats articulated objects as assemblies of rigid parts, formulating reconstruction as part-based prediction. Our newly designed transformer architecture maps sparse image inputs to a set of learnable part slots, from which ART jointly decodes unified representations for individual parts, including their 3D geometry, texture, and explicit articulation parameters. The resulting reconstructions are physically interpretable and readily exportable for simulation. Trained on a large-scale, diverse dataset with per-part supervision, and evaluated across diverse benchmarks, ART achieves significant improvements over existing baselines and establishes a new state of the art for articulated object reconstruction from image inputs.

2512.09673 2026-05-18 cs.LG cs.AI cs.NE stat.ML

Drawback of Enforcing Equivariance and its Compensation via the Lens of Expressive Power

Yuzhu Chen, Tian Qin, Xinmei Tian, Fengxiang He, Dacheng Tao

AI总结 本文研究了强制等变性对神经网络表达能力的影响,发现这种约束可能削弱模型的表达能力。通过分析边界超平面和通道向量,作者构造性地证明了这一问题,并指出可通过扩大模型规模来补偿这一缺陷,同时证明了所需扩大的上界。令人意外的是,扩大的网络结构反而降低了假设空间的维度,可能带来更好的泛化能力。

详情
英文摘要

Equivariant neural networks encode the intrinsic symmetry of data as an inductive bias, which has achieved impressive performance in wide domains. However, the understanding to their expressive power remains premature. Focusing on 2-layer ReLU networks, this paper investigates the impact of enforcing equivariance constraints on the expressive power. By examining the boundary hyperplanes and the channel vectors, we constructively demonstrate that enforcing equivariance constraints could undermine the expressive power. Naturally, this drawback can be compensated for by enlarging the model size -- we further prove upper bounds on the required enlargement for compensation. Surprisingly, we show that the enlarged neural architectures have reduced hypothesis space dimensionality, implying even better generalizability.

2512.04457 2026-05-18 cs.CL

RapidUn: Influence-Driven Parameter Reweighting for Efficient Large Language Model Unlearning

Guoshenghui Zhao, Huawei Lin, Weijie Zhao

AI总结 本文提出了一种名为RapidUn的高效大语言模型遗忘框架,通过影响驱动的参数重加权方法,解决从模型中移除特定数据影响的难题。该方法首先快速估计每个样本的影响,再将其映射为自适应更新权重,从而选择性地更新参数以遗忘有害行为而不丢失通用知识。实验表明,RapidUn在多个模型和数据集上相比现有方法效率提升高达100倍,具有更好的稳定性和泛化能力。

Comments Code available at: https://github.com/eyerf/RapidUn

详情
英文摘要

Removing specific data influence from large language models (LLMs) remains challenging, as retraining is costly and existing approximate unlearning methods are often unstable. The challenge is exacerbated when the forget set is small or imbalanced. We introduce RapidUn, an influence-driven and parameter-efficient unlearning framework. It first estimates per-sample influence through a fast estimation module, then maps these scores into adaptive update weights that guide selective parameter updates -- forgetting harmful behavior while retaining general knowledge. On Mistral-7B and Llama-3-8B across Dolly-15k and Alpaca-57k, RapidUn achieves up to 100 times higher efficiency than full retraining and consistently outperforms Fisher, GA, and LoReUn on both in-distribution and out-of-distribution forgetting. These results establish influence-guided parameter reweighting as a scalable and interpretable paradigm for LLM unlearning.