arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2332
2605.06870 2026-05-13 cs.LG

Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

Xinyu Zhao, Nikita Karagodin, Hamed Hassani, Sinan Hersek, Paul Pu Liang, Yury Polyanskiy

AI总结 本文研究了VQ-VAE在训练过程中出现的维度坍塌问题,即编码表示退化到极低维子空间的现象,并揭示了这一问题会导致难以突破的下界损失。作者提出了一种简单有效的解决方法——在引入VQ之前先进行自编码器的预训练(AE Warm-Up),从而恢复编码表示的维度。实验表明,该方法在图像和音频任务中均显著提升了重建质量与感知性能,同时提高了码本的有效维度。

详情
英文摘要

While many approaches to improve VQ-VAE performance focus on codebook size and utilization, the effect of dimensional collapse, where trained VQ-VAE representations live in an extremely low-dimensional subspace (1-2% of full rank), remains unaddressed. We show theoretically and empirically that dimension collapse causes a hard loss lower bound that various codebook improvement techniques fail to surpass. Our analytic framework extends the sequential learning effect of Saxe et al. [2014] by introducing ideas from rate-distortion theory and explains how the latent collapse is caused by the VQ suppressing lower-variance directions. Our theory justifies a simple solution: a "warm-up phase" that trains the model as an (unquantized) autoencoder before introducing VQ. On both synthetic experiments and large-scale image (VQGAN) and audio (WavTokenizer) VQ-VAEs, we show that AE Warm-Up successfully restores representation dimension, leading to lower reconstruction and perceptual loss at the same training budget. Across codebook sizes $K \in$ {$2^{10}, 2^{14}, 2^{16}$}, AE warm-up raises VQGAN codebook effective dimension from 3-5 to 17-19 and reduces rFID by 17-35%; on WavTokenizer at $K \in$ {$2^{13}, 2^{14}$}, it raises codebook dimension from 4 to 17-19 and improves PESQ by 11-14%. We empirically characterize how warm-up duration governs the achievable final loss. In agreement with experiment, our theoretical analysis predicts downstream performance as a function of warm-up length, enabling an adaptive criterion for switching from AE Warm-up to VQ-VAE training.

2605.06732 2026-05-13 cs.LG

On Training in Imagination

Nadav Timor, Ravid Shwartz-Ziv, Micah Goldblum, Yann LeCun, David Harel

AI总结 本文研究了基于想象的模型强化学习中,使用学习到的动力学模型和奖励模型进行策略训练时,模型误差对策略优化和回报的影响。作者扩展了现有分析,推导出在功率律假设下最优的样本分配比例,以最小化回报误差的上界,并指出降低动力学、奖励和策略的Lipschitz常数有助于紧化这一界。此外,作者分析了REINFORCE算法在存在噪声奖励情况下的表现,发现零均值噪声不影响梯度估计的无偏性,但会增加方差,并提出了在固定预算下如何权衡 rollout 数量与奖励噪声的优化问题。

详情
英文摘要

State-of-the-art model-based reinforcement learning methods train policies on imagined rollouts. These rollouts are trajectories generated by a learned dynamics model and are scored by a learned reward model, but without querying the true environment during policy updates. We study this training paradigm by quantifying how errors in learned dynamics and reward models affect returns and policy optimization. First, we extend the analysis of Asadi et al. (2018) to MDPs with learned reward models, and derive the optimal sample allocation--the ratio of dynamics samples to reward samples that minimizes a bound on return error under power-law scaling assumptions. We identify lower Lipschitz constants of the learned dynamics, reward, and policy as a representation desideratum that tightens this bound, and we connect this perspective to the temporal-straightening objective of Wang et al. (2026). Second, we examine how policy optimization with REINFORCE tolerates noisy rewards, which are often cheaper to obtain. We show that zero-mean reward noise leaves the gradient estimator unbiased and adds at most a variance term that decreases with the number of rollouts. This introduces a practical tradeoff: given a fixed budget, should one buy more rollouts with cheaper but noisier rewards, or fewer rollouts with more expensive but less noisy rewards? We reduce this choice to a one-dimensional optimization problem and characterize the optimum.

2605.06440 2026-05-13 cs.LG cs.CV

Hyperbolic Concept Bottleneck Models

Daniel Uyterlinde, Swasti Shreya Mishra, Pascal Mettes

AI总结 该论文提出了一种名为Hyperbolic Concept Bottleneck Models(HypCBM)的新型可解释神经网络框架,用于提升模型的可解释性。与传统将概念嵌入欧几里得空间的方法不同,HypCBM将概念组织在语义层次结构中,并利用双曲空间的几何特性,通过不对称的几何包含关系来表示概念激活,从而更自然地捕捉概念间的层次关系。该方法无需额外监督或学习模块即可实现稀疏且层次感知的激活,并在保持人类可解释性的同时,展现出更强的层次一致性和对输入噪声的鲁棒性。

Comments 24 pages, 14 figures

详情
英文摘要

Concept Bottleneck Models (CBMs) have become a popular approach to enable interpretability in neural networks by constraining classifier inputs to a set of human-understandable concepts. While effective, current models embed concepts in flat Euclidean space, treating them as independent, orthogonal dimensions. Concepts, however, are highly structured and organized in semantic hierarchies. To resolve this mismatch, we propose Hyperbolic Concept Bottleneck Models (HypCBM), a post-hoc framework that grounds the bottleneck in this structure by reformulating concept activation as asymmetric geometric containment in hyperbolic space. Rather than treating entailment cones as a pre-training penalty, we show they encode a natural test-time activation signal: the margin of inclusion within a concept's entailment cone yields sparse, hierarchy-aware activations without any additional supervision or learned modules. We further introduce an adaptive scaling law for hierarchically faithful interventions, propagating user corrections coherently through the concept tree. Empirically, HypCBM rivals post-hoc Euclidean models trained on 20$\times$ more data in sparse regimes required for human interpretability, with stronger hierarchical consistency and improved robustness to input corruptions.

2605.06314 2026-05-13 cs.LG

When Does $\ell_2$-Boosting Overfit Benignly? High-Dimensional Risk Asymptotics and the $\ell_1$ Implicit Bias

Ye Su, Jian Li, Yong Liu

AI总结 本文研究了在高维设置下,$\ell_2$-Boosting 算法在 $\ell_1$ 隐含偏差下的良性过拟合行为。通过结合凸高斯极小极大定理与截断高斯矩的渐近展开,作者分析了连续时间 $\ell_2$-Boosting 的风险特性,揭示了其在纯噪声模型下以对数速率衰减的过拟合现象,并指出在存在信号时,该机制仍可能成立,但信号-噪声分解仍是开放问题。此外,作者还提出了一个无需调参的早停规则,能够在 $\ell_1$ 约束下达到最优的预测性能。

详情
英文摘要

Benign overfitting is well-characterized in $\ell_2$ geometries, but its behavior under the $\ell_1$ implicit bias of greedy ensembles remains challenging. The analytical barrier stems from the non-linear coupling of coordinate selection thresholds, which invalidates standard spectral resolvent tools. To isolate this algorithmic bias, we characterize the high-dimensional risk of continuous-time $\ell_2$-Boosting over $p$ features and $n$ samples. By coupling the Convex Gaussian Minimax Theorem with delicate asymptotic expansions of double-sided truncated Gaussian moments, we analytically resolve the non-smooth $\ell_1$ interpolant. Under an isotropic pure-noise model, we prove that benign overfitting fails at the linear rate: greedy selection localizes noise into sparse active sets, and the excess variance decays at a logarithmic rate $Θ(σ^2/\log(p/n))$ for noise variance $σ^2$. We remark that while this localization mechanism should persist in the presence of signals, the exact signal-noise decomposition remains an open problem. For spiked-isotropic designs with $k^*$ head eigenvalues and $r_2 = p - k^*$ tail dimensions, the risk converges to zero when $r_{2} \gg n$, but only at a logarithmic rate $Θ(σ^2/\log(r_2/n))$, which is slower than the linear decay observed in $\ell_2$ geometries. To avoid this slow convergence, we analyze the non-smooth subdifferential dynamics of the boosting flow. This yields a tuning-free early stopping rule that, under a bounded $\ell_1$-path condition, recovers the Lasso basic inequality and attains the minimax-optimal empirical prediction rate for $\ell_1$-bounded signals.

2605.06218 2026-05-13 cs.LG

AffineLens: Capturing the Continuous Piecewise Affine Functions of Neural Networks

Yi Wei, Xuan Qi, Furao Shen, Jian Zhao, Vittorio Murino, Cigdem Beyan

AI总结 AffineLens 是一种用于分析神经网络中分段仿射函数结构的统一框架,旨在准确捕捉神经网络输入输出映射的连续分段仿射特性。该方法通过计算神经元诱导的超平面排列和多面体结构,逐层枚举并可视化网络的仿射区域,从而提供对网络表达能力的直观理解与量化评估。AffineLens 支持包括批量归一化、池化、残差连接等多种现代网络组件,并通过实证研究揭示了不同网络设计对函数几何特性的影响。

详情
英文摘要

Piecewise affine neural networks (PANNs) provide a principled geometric perspective on neural network expressivity by characterizing the input--output map as a continuous piecewise affine (CPA) function whose complexity is governed by the number, arrangement, and shapes of its affine regions. However, existing interpretability and expressivity analyses often rely on indirect proxies (e.g., activation statistics or theoretical upper bounds) and rarely offer practical, accurate tools for enumerating and visualizing the induced region partition under realistic architectures and bounded input domains. In this work, we present AffineLens, a unified framework for computing the hyperplane arrangements and polyhedral structures underlying PANNs. Given a calibrated (bounded) input polytope, AffineLens identifies the subset of neuron-induced hyperplanes that intersect the domain, enumerates the resulting affine sub-regions in a layer-wise manner, and returns provably non-empty maximal CPA regions together with interior representatives. The framework further provides visualizations of region partitioning and decision boundaries, enabling qualitative inspection alongside quantitative region counts. By exploiting the affine restriction property of CPA networks under fixed activation patterns, AffineLens supports a broad class of modern components, including batch normalization, pooling, residual connections, multilayer perceptrons, and convolutional layers. Finally, we use AffineLens to perform a systematic empirical study of architectural expressivity, comparing networks through region complexity metrics and revealing how design choices influence the geometry of learned functions.

2605.05971 2026-05-13 cs.LG

Training Transformers for KV Cache Compressibility

Yoav Gelberg, Yam Eitan, Michael Bronstein, Yarin Gal, Haggai Maron

AI总结 随着长上下文语言模型的发展,Key-Value(KV)缓存的内存和解码时访问成本已成为关键瓶颈。本文提出了一种在训练过程中引导Transformer模型学习可压缩表示的方法,即KV-压缩感知训练(KV-CAT),通过在训练时稀疏化KV缓存,促使模型生成更利于后续压缩的内部表示。实验表明,该方法有效提升了后续压缩方法在检索、长上下文问答和压缩前缀续写等任务中的性能表现。

Comments 32 pages, 4 figures

详情
英文摘要

Long-context language modeling is increasingly constrained by the Key-Value (KV) cache, whose memory and decode-time access costs scale linearly with the prefix length. This bottleneck has motivated a range of context-compression methods, from token-level summarization to recent optimization-based KV compression methods. These post-hoc methods operate on the KV cache of a fixed pretrained model, so their effectiveness is fundamentally limited by how well the model's internal representations can be compressed. In this work, we formalize the notion of KV compressibility and show that it is a property of the learned representations, rather than of the context alone. We prove that almost any sequence-to-vector function admits both highly compressible and inherently non-compressible transformer implementations, highlighting the need to guide transformers toward compressible representations during training. Motivated by this, we propose KV-Compression Aware Training (KV-CAT), a continued pretraining procedure that incentivizes the emergence of compressible representations. We introduce a train-time KV sparsification policy that masks KV slots during training. This forces the model to use fewer KV slots and encourages it to learn representations amenable to post-hoc compression. Empirically, we show that KV-CAT improves the quality-budget tradeoff of downstream compression methods across retrieval, long-context question answering, and perplexity-based evaluation of compressed-prefix continuation.

2605.05922 2026-05-13 cs.CV

Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

Yuan Wang, Ouxiang Li, Yulong Xu, Borui Liao, Jiajun Liang, Jinghan Li, Meng Wang, Xintao Wang, Pengfei Wan, Kuien Liu, Xiang Wang

AI总结 该论文提出了一种名为DeScore的视频奖励模型,旨在解决现有模型在推理与评分耦合时存在的优化瓶颈问题。其核心方法是将推理与评分过程解耦,先由多模态大语言模型生成详细的推理过程,再通过独立的评分模块预测最终奖励。该方法在保证模型可解释性和泛化能力的同时,提升了训练稳定性与效率。

详情
英文摘要

Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human preferences across diverse scenarios. However, existing paradigms face a fundamental dilemma: \textit{Discriminative RMs} regress rewards directly on features extracted by multimodal large language models (MLLMs) without explicit reasoning, making them prone to shortcut learning and heavily reliant on massive data scaling for generalization. In contrast, \textit{Generative RMs} with Chain-of-Thought (CoT) reasoning exhibit superior interpretability and generalization potential, as they leverage fine-grained semantic supervision to internalize the rationales behind human preferences. However, they suffer from inherent optimization bottlenecks due to the coupling of reasoning and scoring within a single autoregressive inference chain. To harness the generalization benefits of CoT reasoning while mitigating the training instability of coupled reasoning and scoring, we introduce DeScore, a training-efficient and generalizable video reward model. DeScore employs a decoupled ``think-then-score'' paradigm: an MLLM first generates an explicit CoT, followed by a dedicated discriminative scoring module consisting of a learnable query token and a regression head that predicts the final reward. DeScore is optimized via a two-stage framework: (1) a discriminative cold start incorporating a random mask mechanism to ensure robust scoring capabilities, and (2) a dual-objective reinforcement learning stage that independently refines CoT reasoning quality and calibrates the final reward, ensuring that higher-quality reasoning directly translates to superior model performance.

2605.04946 2026-05-13 cs.LG stat.ML

Training-Time Batch Normalization Reshapes Local Partition Geometry in Piecewise-Affine Networks

Xuan Qi, Yi Wei, Fanqi Yu, Furao Shen, Vittorio Murino, Cigdem Beyan

AI总结 本文研究了训练过程中批量归一化(BN)在分段仿射网络中的几何影响,揭示了BN如何通过调整神经元的参考超平面,改变局部区域的划分结构。研究发现,BN在每个神经元上定义了一个以小批量中心为基准的超平面,其切换超平面的偏移量以标准化坐标表示,与原始偏置无关。这一机制提高了局部划分的精细程度,并在深度网络中具有局部传递性,为理解BN在训练阶段的函数级几何作用提供了新视角。

详情
英文摘要

Batch normalization (BN) is central to modern deep networks, but its effect on the realized function during training remains less understood than its optimization benefits. We study training-time BN in continuous piecewise-affine (CPA) networks through the geometry of switching hyperplanes and the induced affine-region partition. Conditioned on a mini-batch, we show that BN defines for each neuron a reference hyperplane through the batch centroid, and that breakpoint-switching hyperplanes are parallel translates whose offsets are expressed in batch-standardized coordinates and are independent of the raw bias. This yields an exact criterion for when a switching hyperplane intersects a local $\ell_\infty$ window and motivates a local region-density functional based on exact affine-region counts. Under explicit sufficient conditions, we show that BN increases expected local partition refinement in ReLU and more general piecewise-affine networks, and that this mechanism transfers locally through depth inside parent affine regions where the upstream representation map is an affine embedding. These results provide a function-level geometric account of training-time BN as a batch-conditional recentering mechanism near the data.

2605.04647 2026-05-13 cs.RO

ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

Huimin Wang, Yue Wang, Bihao Cui, Pengxiang Li, Ben Lu, Mingqian Wang, Tong Wang, Chuan Tang, Teng Zhang, Kun Zhan

AI总结 本文提出 ReflectDrive-2,一种基于强化学习对齐的离散扩散规划器,用于自动驾驶任务。该方法通过独立的动作专家生成离散轨迹标记,并利用并行掩码解码生成轨迹,支持在原地进行轨迹编辑。通过两阶段训练策略,结合结构感知扰动和强化学习优化,显著提升了轨迹生成与编辑的性能。实验表明,ReflectDrive-2 在 NAVSIM 上实现了较高的 PDMS 分数,并具备较高的推理效率。

详情
英文摘要

We introduce ReflectDrive-2, a masked discrete diffusion planner with separate action expert for autonomous driving that represents plans as discrete trajectory tokens and generates them through parallel masked decoding. This discrete token space enables in-place trajectory revision: AutoEdit rewrites selected tokens using the same model, without requiring an auxiliary refinement network. To train this capability, we use a two-stage procedure. First, we construct structure-aware perturbations of expert trajectories along longitudinal progress and lateral heading directions and supervise the model to recover the original expert trajectory. We then fine-tune the full decision--draft--reflect rollout with reinforcement learning (RL), assigning terminal driving reward to the final post-edit trajectory and propagating policy-gradient credit through full-rollout transitions. Full-rollout RL proves crucial for coupling drafting and editing: under supervised training alone, inference-time AutoEdit improves PDMS by at most $0.3$, whereas RL increases its gain to $1.9$. We also co-design an efficient reflective decoding stack for the decision--draft--reflect pipeline, combining shared-prefix KV reuse, Alternating Step Decode, and fused on-device unmasking. On NAVSIM, ReflectDrive-2 achieves $91.0$ PDMS with camera-only input and $94.8$ PDMS in a best-of-6 oracle setting, while running at $31.8$ ms average latency on NVIDIA Thor.

2605.04539 2026-05-13 cs.CL cs.AI

RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

Qiming Bao, Juho Leinonen, Paul Denny, Michael J. Witbrock

AI总结 该论文提出了一种名为RLearner-LLM的新方法,旨在解决大型语言模型在知识密集型生成任务中逻辑准确性与流畅性之间的平衡问题。研究通过引入混合直接偏好优化(Hybrid-DPO)技术,结合基于DeBERTa-v3的自然语言推理信号和验证器LLM评分,无需人工标注即可提升模型的逻辑对齐能力。实验表明,该方法在多个学术领域中显著提升了模型的逻辑推理能力,同时保持了生成流畅性,并在多个基础模型上实现了有效的性能提升。

详情
英文摘要

Direct Preference Optimization (DPO), the efficient alternative to PPO-based RLHF, falls short on knowledge-intensive generation: standard preference signals from human annotators or LLM judges exhibit a systematic verbosity bias that rewards fluency over logical correctness. This blindspot leaves a logical alignment gap -- SFT models reach NLI entailment of only 0.05-0.22 despite producing fluent text. We propose RLearner-LLM with Hybrid-DPO: an automated preference pipeline that fuses a DeBERTa-v3 NLI signal with a verifier LLM score, removing human annotation while overcoming the "alignment tax" of single-signal optimization. Evaluated across five academic domains (Biology, Medicine, Law) with three base architectures (LLaMA-2-13B, Qwen3-8B, Gemma 4 E4B-it), RLearner-LLM yields up to 6x NLI improvement over SFT, with NLI gains in 11 of 15 cells and consistent answer-coverage gains. On Gemma 4 E4B-it (4.5B effective params), Hybrid-DPO lifts NLI in four of five domains (+11.9% to +2.4x) with faster inference across all five, scaling down to compact base models without losing the alignment-tax mitigation. Our Qwen3-8B RLearner-LLM wins 95% of pairwise comparisons against its own SFT baseline; GPT-4o-mini in turn wins 95% against our concise output -- alongside the 69% win the same judge gives a verbose SFT over our DPO model, this replicates verbosity bias on a frontier comparator and motivates logic-aware metrics (NLI, ACR) over LLM-as-a-judge for knowledge-intensive generation.

2605.02906 2026-05-13 cs.LG

An End-to-End Framework for Building Large Language Models for Software Operations

Jingkai He, Pengfei Chen, Chenghui Wu, Shuang Liang, Ye Li, Gou Tan, Xiadao Wen, Chuanfu Zhang

AI总结 本文提出了一种面向软件运维领域的端到端大语言模型构建框架 OpsLLM,旨在解决当前运维场景下大模型因数据质量低、知识碎片化和学习效率不足而难以实现高效智能运维的问题。该框架引入了人工参与的数据筛选机制和领域过程奖励模型,有效提升了模型在运维问答和根因分析任务中的准确性和可靠性。实验表明,OpsLLM 在多个难度级别的任务中均优于现有开源和闭源模型,并且已开源三个不同参数规模的版本及相应的微调数据集。

详情
英文摘要

In the field of software operations, Large Language Models (LLMs) have attracted increasing attention. However, existing research has not yet achieved efficient and effective end-to-end intelligent operations due to low-quality data, fragmented knowledge and insufficient learning. To explore the potential of LLMs in software operations, we propose OpsLLM, a domain-specific LLM that supports both knowledge-based question answering (QA) and root cause analysis (RCA). Moreover, we disclose the detailed workflow for building LLMs specifically in the software operations domain. First, a Human-in-the-Loop mechanism is introduced to curate highquality data from a large collection of operational raw data and construct a fine-tuning dataset. Then, based on the data, supervised fine-tuning is conducted to achieve a base model. Furthermore, we introduce a domain process reward model (DPRM) during the reinforcement learning stage to optimize the accuracy and reliability of the fine-tuned model on RCA tasks. Experimental results on the tasks with diverse difficulties demonstrate that OpsLLMs effectively learns and aligns with the operational domain knowledge infused, outperforming existing open-source and closed-source LLMs in accuracy with improvements of 0.2%~5.7% on QA tasks and 2.7% ~70.3% on RCA tasks, while exhibiting strong transferability. Moreover, we will open-source three versions of OpsLLM with 7B, 14B and 32B parameters, along with a 15K fine-tuning dataset.

2605.00939 2026-05-13 cs.LG cs.AI

From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity

Yee Zhing Liew, Andrew Huey Ping Tan, Anwar P. P. Abdul Majeed

AI总结 本文研究了传统语言模型中难以检测的“顽固性幻觉”问题,即模型在错误信息上表现出高度自信的情况。作者提出了一种基于梯度敏感性的几何检测方法——嵌入扰动梯度敏感性(EPGS),通过在输入嵌入中加入高斯噪声并测量梯度幅值的变化,来区分稳定知识与脆弱记忆。实验表明,该方法在检测高置信度事实错误方面显著优于基于熵和表示的基线方法。

Comments Accepted to ICML 2026. Camera-ready version

详情
英文摘要

Traditional hallucination detection fails on "Stubborn Hallucinations" - errors where LLMs are confidently wrong. We propose a geometric solution: Embedding-Perturbed Gradient Sensitivity (EPGS). We hypothesize that while robust facts reside in flat minima, stubborn hallucinations sit in sharp minima, supported by brittle memorization. EPGS detects this sharpness by perturbing input embeddings with Gaussian noise and measuring the resulting spike in gradient magnitude. This acts as an efficient proxy for the Hessian spectrum, differentiating stable knowledge from unstable memorization. Our experiments show that EPGS significantly outperforms entropy-based and representation-based baselines, providing a robust signal for detecting high-confidence factual errors.

2604.24801 2026-05-13 cs.LG cs.AI

Architecture Determines Observability of Transformers

Thomas Carmichael

AI总结 该研究探讨了Transformer模型中架构对可观测性的影响,指出自回归Transformer在输出置信度监控下仍可能产生无法被检测的错误。研究发现,激活信号中包含的决策质量信息主要由模型架构和训练过程决定,而非输出置信度本身。实验表明,通过控制输出置信度可大幅减少激活探针信号,而剩余信号的可观测性取决于架构和训练方式,为模型监控和训练设计提供了新的视角。

Comments 31 pages, 8 figures, 14 tables. v3 of arXiv:2604.24801. Code v5.1.0: https://github.com/tmcarmichael/nn-observability/tree/v5.1.0 Changelog: https://github.com/tmcarmichael/nn-observability/blob/v5.1.0/CHANGELOG.md Croissant: https://github.com/tmcarmichael/nn-observability/blob/v5.1.0/croissant.json

详情
英文摘要

Autoregressive transformers make confident errors that output-confidence monitoring cannot catch. Activation monitors catch them only when training leaves a decision-quality signal beyond what the output already exposes. This signal is an architectural property of the trained model, fixed upstream of any monitor. Controlling for output confidence removes 60.3% of the raw activation-probe signal on average across 14 models. Raw probe signal is mostly output confidence, and output-side readouts cannot recover the residual. What remains depends on architecture and training. In Pythia's controlled training, both matched-width configurations form the signal early. One preserves it through convergence while another erases it as perplexity continues to improve. Capability and observability are not inherently in tension. Across independently trained families this pattern persists, even as the collapse point shifts. Where the signal survives, monitoring catches what confidence cannot. On downstream QA, a WikiText-trained probe with no task-specific tuning catches about one in eight confident errors that output-confidence monitoring misses, at a 20% flag rate. These results establish signal engineering as a training-time design axis alongside loss and capability. Architecture sets the conditions for observability, and training determines what remains readable.

2604.22099 2026-05-13 cs.LG

Assessing the impact of dimensionality reduction on clustering performance -- a systematic study

Ousmane Assani-Amate, Mohammadreza Bakhtyari, Émilie Roy, Vladimir Makarenkov

AI总结 本研究系统评估了五种降维技术对四种聚类算法性能的影响,旨在探讨降维在高维数据聚类中的作用。通过调整降维后的维度比例,并使用调整兰德指数(ARI)进行性能比较,研究发现选择合适的降维方法和降维程度对于提升聚类效果至关重要,且需根据数据结构和聚类算法特性进行适配。

详情
英文摘要

Dimensionality reduction is a critical preprocessing step for clustering high-dimensional data, yet comprehensive evaluation of its impact across diverse methods and data types remains limited. In this study, we systematically assess the influence of five dimensionality reduction techniques - Principal Component Analysis (PCA), Kernel Principal Component Analysis (Kernel PCA), Variational Autoencoder (VAE), Isometric Mapping (Isomap), and Multidimensional Scaling (MDS) - on the performance of four popular clustering algorithms - k-means, Agglomerative Hierarchical Clustering (AHC), Gaussian Mixture Models (GMM), and Ordering Points to Identify the Clustering Structure (OPTICS). We evaluate clustering quality using the Adjusted Rand Index (ARI), comparing results without and with dimensionality reduction at different reduction levels recommended in the literature (i.e., k-1, where k is the number of clusters, and 25% and 50% of the original number of dimensions). Our findings underscore the importance of a careful selection of the dimensionality reduction technique and the dimensionality reduction level that should be tailored to intrinsic data geometry and clustering algorithms under consideration.

2604.22026 2026-05-13 cs.AI cs.CY cs.DL

Rethinking Publication: A Certification Framework for AI-Enabled Research

Yang Lu, Rabimba Karanjai, Lei Xu, Weidong Shi

AI总结 本文提出了一种用于评估AI生成研究成果的双重认证框架,旨在应对当前学术出版体系对人类作者假设的局限性。该框架将知识有效性与人类贡献程度的评估分离开来,前者确保研究成果的科学性,后者明确人类在研究过程中的参与程度。研究还提出了专门的基准投稿渠道,以促进完全自动化研究成果的透明发表,并强调应基于知识价值而非作者身份来评价研究贡献。

Comments correct references

详情
英文摘要

AI research pipelines can now generate academic work that may satisfy existing peer review standards for quality, novelty, and methodological rigor. However, the publication system was built around the assumption that research is produced by human authors. It therefore lacks a clear way to evaluate work when the knowledge claim may be valid but the producer is partly or fully automated. This paper proposes a two-layer certification framework for AI-generated research. The first layer evaluates whether the knowledge claim is sound. The second layer evaluates the level of human contribution. This separation allows journals and conferences to assess pipeline-generated work more consistently without creating new institutions. The framework uses normative analysis, conceptual design, and dry-run validation against representative submission cases. It classifies human contribution into three categories: Category A, where the work is reachable by an automated pipeline; Category B, where human direction is required at identifiable stages; and Category C, where the work goes beyond current pipeline capability, especially at the problem-formulation stage. The paper also proposes dedicated benchmark slots for fully disclosed automated research. These slots would provide a transparent publication path and help reviewers calibrate judgments over time. The key argument is that publication has historically certified two things at once: that the knowledge is valid and that a human produced it. AI research pipelines separate these two claims. By decoupling knowledge certification from authorship attribution, the proposed framework responds to a structural change already underway. It can be implemented within existing editorial systems, works even when attribution is uncertain, and recognizes human frontier contribution based on epistemic value rather than human origin alone.

2604.21052 2026-05-13 cs.CV cs.AI

StyleVAR: Controllable Image Style Transfer via Visual Autoregressive Modeling

Liqi Jing, Dingming Zhang, Peinian Li, Lichen Zhu, Yang Xu, Hanyu Xing

AI总结 StyleVAR 是一种基于视觉自回归建模(VAR)框架的可控图像风格迁移方法,通过将图像分解为多尺度表示并编码为离散码,利用变压器模型在条件离散序列建模中实现风格与内容的可控融合。该方法引入了混合交叉注意力机制和尺度相关的融合系数,以在保持自回归连续性的同时,有效结合风格与内容信息。实验表明,StyleVAR 在多个基准测试中优于传统 AdaIN 方法,在感知相似度和结构保持方面表现突出,尤其在风景和建筑场景中效果显著。

详情
英文摘要

We build on the Visual Autoregressive Modeling (VAR) framework and formulate style transfer as conditional discrete sequence modeling in a learned latent space. Images are decomposed into multi-scale representations and tokenized into discrete codes by a VQ-VAE; a transformer then autoregressively models the distribution of target tokens conditioned on style and content tokens. To inject style and content information, we introduce a blended cross-attention mechanism in which the evolving target representation attends to its own history, while style and content features act as queries that decide which aspects of this history to emphasize. A scale-dependent blending coefficient controls the relative influence of style and content at each stage, encouraging the synthesized representation to align with both the content structure and the style texture without breaking the autoregressive continuity of VAR. We train StyleVAR in two stages from a pretrained VAR checkpoint: supervised fine-tuning on a large triplet dataset of content--style--target images, followed by reinforcement fine-tuning with Group Relative Policy Optimization (GRPO) against a DreamSim-based perceptual reward, with per-action normalization weighting to rebalance credit across VAR's multi-scale hierarchy. Across three benchmarks spanning in-, near-, and out-of-distribution regimes, StyleVAR consistently outperforms an AdaIN baseline on Style Loss, Content Loss, LPIPS, SSIM, DreamSim, and CLIP similarity, and the GRPO stage yields further gains over the SFT checkpoint, most notably on the reward-aligned perceptual metrics. Qualitatively, the method transfers texture while maintaining semantic structure, especially for landscapes and architectural scenes, while a generalization gap on internet images and difficulty with human faces highlight the need for better content diversity and stronger structural priors.

2604.16684 2026-05-13 cs.LG stat.ML

DARLING: Detection Augmented Reinforcement Learning with Non-Stationary Guarantees

Argyrios Gerogiannis, Yu-Han Huang, Venugopal V. Veeravalli

AI总结 本文研究了在非平稳有限时间回合马尔可夫决策过程(MDPs)中的无模型强化学习问题,且不预先知道非平稳性。针对分段平稳(PS)环境,即奖励和转移动态在未知时间点发生变化的情况,提出了一个名为DARLING的模块化方法,适用于表格和线性MDPs,无需提前知道变化时间点。DARLING在理论分析中改进了已知的最佳动态遗憾界,并在多种非平稳基准测试中表现出优于现有方法的性能。

Comments 50 pages, 8 figures

详情
英文摘要

We study model-free reinforcement learning (RL) in non-stationary finite-horizon episodic Markov decision processes (MDPs) without prior knowledge of the non-stationarity. We focus on the piecewise stationary (PS) setting, where both rewards and transition dynamics can change at unknown times. We first revisit existing state-of-the-art approaches and identify theoretical and practical limitations that change the current landscape of performance guarantees. To characterize the difficulty of the problem, we establish the first minimax lower bounds for PS-RL in tabular and linear MDPs. We then introduce Detection Augmented Reinforcement Learning (DARLING), a modular wrapper for PS-RL that applies to both tabular and linear MDPs, without knowledge of the changes. In tabular MDPs, under change-point separability and reachability conditions, DARLING improves the best known dynamic regret bounds and matches our minimax lower bound. In linear MDPs, DARLING matches the minimax lower bound when the relevant reachability parameters are known, and our analysis clarifies the structural obstacles that distinguish this setting from the tabular case. Finally, through extensive experimentation across diverse non-stationary benchmarks, we show that DARLING consistently surpasses the state-of-the-art methods.

2604.15664 2026-05-13 cs.LG cs.AI

Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints

Xinge Liu, Terry Jingchen Zhang, Bernhard Schölkopf, Zhijing Jin, Kristen Menou

AI总结 本文介绍了 Stargazer,一个用于评估人工智能代理在天体物理约束下进行动态模型拟合任务的可扩展基准环境。该环境基于径向速度时间序列数据,包含120个任务,涵盖从高信噪比单行星系统到复杂低信噪比多行星系统的多种场景,并包含20个真实档案案例。研究发现,尽管现有前沿代理在统计拟合上表现良好,但在物理参数恢复方面仍存在显著不足,且增加计算资源带来的提升有限。Stargazer 为训练和评估人工智能代理在实际科研相关模型拟合问题上的能力提供了重要平台。

详情
英文摘要

The rise of autonomous AI agents suggests that dynamic benchmark environments with built-in feedback on scientifically grounded tasks are needed to evaluate the capabilities of these agents in research work. We introduce Stargazer, a scalable environment for evaluating AI agents on dynamic, iterative physics-grounded model-fitting tasks using inference on radial-velocity (RV) time series data. Stargazer comprises 120 tasks across three difficulty tiers, including 20 real archival cases, covering diverse scenarios ranging from high-SNR single-planet systems to complex multi-planetary configurations requiring involved low-SNR analysis. Our evaluation of eight frontier agents reveals a gap between numerical optimization and adherence to physical constraints: although agents often achieve a good statistical fit, they frequently fail to recover correct physical system parameters, a limitation that persists even when agents are equipped with vanilla skills. Furthermore, increasing test-time compute yields only marginal gains, with excessive token usage often reflecting recursive failure loops rather than meaningful exploration. Stargazer presents an opportunity to train, evaluate, scaffold, and scale strategies on a model-fitting problem of practical research relevance today. Our methodology to design a simulation-driven environment for AI agents presumably generalizes to many other model-fitting problems across scientific domains. Source code and the project website are available at https://github.com/AIPS-UofT/Stargazer and https://aips-uoft.github.io/Stargazer/, respectively.

2604.14717 2026-05-13 cs.AI cs.CR cs.CY cs.LG

Layered Mutability: Continuity and Governance in Persistent Self-Modifying Agents

Krti Tallam

AI总结 本文提出“分层可变性”框架,用于分析持续自我修改语言模型代理在预训练、对齐、自我叙述、记忆和权重适应五个层面中的行为演化过程。研究指出,当内部变化迅速、耦合性强、不可逆且难以观测时,治理难度显著增加,导致行为影响层与人类可观察层之间出现系统性不匹配。通过引入漂移、治理负载和滞后等量化指标,并结合实验验证,论文揭示了这类代理的主要失效模式并非突变失准,而是由局部合理更新累积引起的“组合漂移”问题。

Comments 17 pages, 2 figures, 3 tables. self-modifying agents; AI governance; identity drift; persistent memory; runtime adaptation; model editing Primary: cs.AI Cross-list: cs.LG, cs.CY

详情
英文摘要

Persistent language-model agents increasingly combine tool use, tiered memory, reflective prompting, and runtime adaptation. In such systems, behavior is shaped not only by current prompts but by mutable internal conditions that influence future action. This paper introduces layered mutability, a framework for reasoning about that process across five layers: pretraining, post-training alignment, self-narrative, memory, and weight-level adaptation. The central claim is that governance difficulty rises when mutation is rapid, downstream coupling is strong, reversibility is weak, and observability is low, creating a systematic mismatch between the layers that most affect behavior and the layers humans can most easily inspect. I formalize this intuition with simple drift, governance-load, and hysteresis quantities, connect the framework to recent work on temporal identity in language-model agents, and report a preliminary ratchet experiment in which reverting an agent's visible self-description after memory accumulation fails to restore baseline behavior. In that experiment, the estimated identity hysteresis ratio is 0.68. The main implication is that the salient failure mode for persistent self-modifying agents is not abrupt misalignment but compositional drift: locally reasonable updates that accumulate into a behavioral trajectory that was never explicitly authorized.

2604.12928 2026-05-13 cs.CL eess.AS

MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

Chung-Ming Chien, Manu Orsini, Eugene Kharitonov, Neil Zeghidour, Karen Livescu, Alexandre Défossez

AI总结 本文提出了一种名为MoshiRAG的异步知识检索方法,用于提升全双工语音语言模型的事实准确性。该方法通过结合紧凑的全双工接口与选择性检索机制,使模型能够在保持实时交互性的同时,访问更强大的知识源。实验表明,MoshiRAG在事实性方面达到非全双工模型的水平,并且支持灵活的检索模块替换,表现出良好的跨领域推理能力。

Comments Accepted to ICML 2026

详情
英文摘要

Speech-to-speech language models have recently emerged to enhance the naturalness of conversational AI. In particular, full-duplex models are distinguished by their real-time interactivity, including handling of pauses, interruptions, and backchannels. However, improving their factuality remains an open challenge. While scaling the model size could address this gap, it would make real-time inference prohibitively expensive. In this work, we propose MoshiRAG, a modular approach that combines a compact full-duplex interface with selective retrieval to access more powerful knowledge sources. Our asynchronous framework enables the model to identify knowledge-demanding queries and ground its responses in external information. By leveraging the natural temporal gap between response onset and the delivery of core information, the retrieval process can be completed while maintaining a natural conversation flow. With this approach, MoshiRAG achieves factuality comparable to the best publicly released non-duplex speech language models while preserving the interactivity inherent to full-duplex systems. Moreover, our flexible design supports plug-and-play retrieval methods without retraining and demonstrates strong performance on out-of-domain mathematical reasoning tasks.

2604.12923 2026-05-13 cs.CV

Pi-HOC: Pairwise 3D Human-Object Contact Estimation

Sravan Chittupalli, Ayush Jain, Dong Huang

AI总结 本文提出了一种名为Pi-HOC的单次推理、实例感知的框架,用于预测图像中所有人类-物体对的密集3D语义接触。该方法通过检测实例并为每对人-物生成专用的标记,结合InteractionFormer进行优化,再利用基于SAM的解码器在SMPL人体网格上预测密集接触点。实验表明,Pi-HOC在多个数据集上显著提升了接触估计的准确性和定位能力,并且推理效率提高了20倍,同时还能通过测试时优化算法提升3D图像到网格的重建效果,并支持基于语言查询的参考接触预测。

详情
英文摘要

Resolving real-world human-object interactions in images is a many-to-many challenge, in which disentangling fine-grained concurrent physical contact is particularly difficult. Existing semantic contact estimation methods are either limited to single-human settings or require object geometries (e.g., meshes) in addition to the input image. Current state-of-the-art leverages powerful VLM for category-level semantics but struggles with multi-human scenarios and scales poorly in inference. We introduce Pi-HOC, a single-pass, instance-aware framework for dense 3D semantic contact prediction of all human-object pairs. Pi-HOC detects instances, creates dedicated human-object (HO) tokens for each pair, and refines them using an InteractionFormer. A SAM-based decoder then predicts dense contact on SMPL human meshes for each human-object pair. On the MMHOI and DAMON datasets, Pi-HOC significantly improves accuracy and localization over state-of-the-art methods while achieving 20x higher throughput. We further demonstrate that predicted contacts improve SAM-3D image-to-mesh reconstruction via a test-time optimization algorithm and enable referential contact prediction from language queries without additional training.

2604.11048 2026-05-13 cs.CL cs.AI

A Systematic Analysis of the Impact of Persona Steering on LLM Capabilities

Jiaqi Chen, Ming Wang, Tingna Xie, Shi Feng, Yongkang Liu

AI总结 本文系统分析了在大型语言模型中引入特定人格特质对其认知能力的影响。研究采用基于神经元的人格特质诱导框架(NPTI),在六个认知基准任务中评估五大人格特质对模型性能的影响,发现人格诱导不仅改变了交互风格,还导致认知任务表现的稳定变化,并且这种影响因任务类型和人格特质不同而有所差异。研究还提出了一种轻量级的动态人格路由策略(DPR),能够在无需额外训练的情况下优于固定人格设置。

详情
英文摘要

Imbuing Large Language Models (LLMs) with specific personas is prevalent for tailoring interaction styles, yet the impact on underlying cognitive capabilities remains unexplored. We employ the Neuron-based Personality Trait Induction (NPTI) framework to induce Big Five personality traits in LLMs and evaluate performance across six cognitive benchmarks. Our findings reveal that persona induction produces stable, reproducible shifts in cognitive task performance beyond surface-level stylistic changes. These effects exhibit strong task dependence: certain personalities yield consistent gains on instruction-following, while others impair complex reasoning. Effect magnitude varies systematically by trait dimension, with Openness and Extraversion exerting the most robust influence. Furthermore, LLM effects show 73.68% directional consistency with human personality-cognition relationships. Capitalizing on these regularities, we propose Dynamic Persona Routing (DPR), a lightweight query-adaptive strategy that outperforms the best static persona without additional training.

2604.06779 2026-05-13 cs.AI

VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion

Shivanshu Shekhar, Sagnik Mukherjee, Jia Yi Zhang, Tong Zhang

AI总结 该论文提出了一种名为VASR的方差感知系统重采样方法,用于解决奖励引导扩散模型中的系统采样(SMC)粒子系谱快速崩溃问题。通过将延续方差与残差方差分离,研究揭示了传统多项式重采样导致的高后代数量方差是崩溃的主要原因,并提出基于方差最优质量分配和系统重采样的改进方法。VASR及其变体VASR-Max在多个任务中表现出更优的样本质量和更高的计算效率,且无需训练、可并行处理。

详情
英文摘要

Sequential Monte Carlo (SMC) samplers for reward-guided diffusion models often suffer from rapid lineage collapse: a few high-reward particles dominate the population within a handful of resampling steps, destroying diversity and degrading sample quality. We propose a variance-decomposition framework for reward-guided diffusion SMC that separates continuation variance $V_t^{\mathrm{cont}}$ from residual variance $V_t^{\mathrm{res}}$, revealing that high offspring-count variance under the commonly used multinomial resampling drives this collapse. This motivates \textsc{VASR} (Variance-Aware Systematic Resampling), which addresses both variance terms via variance-optimal mass allocation $m_t \propto w_t e^{r_t}$ (minimizing $V_t^{\mathrm{cont}}$) and systematic resampling (controlling $V_t^{\mathrm{res}}$). For latent diffusion models where intermediate rewards are noisy due to stochastic continuations, we propose \textsc{VASR-Max}, a deliberately biased high-selection variant for variance-sensitive reward optimization. Both methods are training-free, fully parallelizable, and add only linear overhead. On MNIST and CIFAR-10, VASR achieves as high as $26\%$ better FID than prior SMC methods while remaining 66 times faster than MCTS-based value methods at matched compute. On text-to-image generation, \textsc{VASR-Max} consistently outperforms the strongest SMC baseline across compute budgets and matches MCTS-based methods within 2.5--3% reward at high budgets while being approximately times faster.

2604.06485 2026-05-13 cs.LG cs.AI

Inference-Time Code Selection via Symbolic Equivalence Partitioning

David Cho, Yifan Wang, Fanping Sui, Ananth Grama

AI总结 该论文研究了如何在推理阶段从大型语言模型生成的多个候选程序中有效选择正确解的问题。作者提出了一种基于符号等价划分(SEP)的方法,利用问题提供的公共示例作为有效性信号,并通过符号执行将候选程序划分为功能等价类,从而选择最可能正确的解。实验表明,该方法在多个基准上显著提升了代码选择的准确性,无需额外测试生成或学习验证器。

详情
英文摘要

Sampling multiple candidate programs at inference time is an effective way to improve LLM code generation. However, its benefit depends on reliably selecting a correct solution from the generated pool. We observe that this selection problem has a distinctive semantic structure: correct solutions, despite differences in syntax, implementation, or algorithmic strategy, often converge to the same functional behavior over valid inputs. At the same time, consensus alone is not sufficient for correctness, because models can also produce correlated wrong solutions that implement the same mistaken behavior. We propose Symbolic Equivalence Partitioning (SEP), an inference-time selection framework that first uses problem-provided public examples as lightweight validity signals. SEP then uses symbolic execution to partition the remaining candidate programs into bounded functional equivalence classes and selects from the dominant equivalence class. Across HumanEval+ and LiveCodeBench, SEP consistently improves selection accuracy without auxiliary test generation, learned verifiers, or additional LLM inference. At $N=10$, SEP improves average accuracy from 0.754 to 0.826 on HumanEval+ and from 0.565 to 0.647 on LiveCodeBench, showing that symbolic functional agreement is an effective signal for inference-time code selection.

2604.04894 2026-05-13 cs.CL cs.AI cs.LG

Asymmetric Advantage Modulation Calibrates Entropy Dynamics in RLVR

Hengrui Gu, Xiaotian Han, Yujing Bian, Feiyi Wang, Kaixiong Zhou

AI总结 在可验证奖励强化学习(RLVR)中,大型语言模型(LLMs)的推理能力虽有所提升,但常因探索受限而难以获得多样化解。本文提出一种新的熵动态调节方法——AsymGRPO,通过将优势估计器分解为正负通道,分别调控有益熵和噪声熵,从而更精细地引导模型学习。该方法在多个数学推理任务中表现出色,显著优于现有RLVR基线方法。

详情
英文摘要

Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning ability of large language models (LLMs), but it often suffers from \textit{restricted exploration}, where the policy rapidly concentrates on a narrow set of solutions. A common remedy is entropy regularization, which attempts to preserve exploration by increasing policy entropy. However, for LLM-RL, this intervention is highly sensitive to its coefficient, can introduce semantically weak uncertainty, and often yields limited accuracy gains. This motivates a more precise question: which entropy helps reasoning, and which entropy should be reduced? To study this, we parameterize the advantage estimator in Group Relative Policy Optimization (GRPO) into positive and negative outcome-conditioned channels and analyze their entropy dynamics. Our results show that positive-channel modulation raises \textit{productive entropy} associated with successful reasoning trajectories, while negative-channel modulation removes \textit{noisy entropy} associated with failed rollouts and reduces interference with correct paths. Guided by this channel-wise view, we propose \textbf{AsymGRPO}, which decouples the modulation strengths of positive and negative advantages. This enables flexible control over how the model updates across prompt difficulty levels, allowing stronger reinforcement of rare successes on harder prompts or stronger suppression of residual failures on easier prompts without forcing the two channels to share the same modulation strength. Experiments on five mathematical reasoning benchmarks show that AsymGRPO outperforms strong RLVR baselines, with consistent gains across model backbones.

2604.03701 2026-05-13 cs.CV

VidNum-1.4K: A Comprehensive Benchmark for Video-based Numerical Reasoning

Shaoyang Cui, Lingbei Meng

AI总结 VidNum-1.4K 是一个用于评估视频中数值推理能力的综合性基准数据集,包含1,379个人工标注的视频问答对,覆盖多种复杂场景,旨在测试视觉语言模型对时间事件、物体持续性和组合逻辑的理解。该基准采用三级结构,从直接视觉感知逐步过渡到多步骤数值推理,要求模型进行算术运算、比较和逻辑推断。实验表明,当前最先进的模型在该任务上仍存在较大性能差距,凸显出视频数值推理任务的挑战性与现有模型的不足。

Comments 7 pages, 5 figures, under review at ACMMM 2026 Dataset Track

详情
英文摘要

Video-based numerical reasoning provides a premier arena for testing whether Vision-Language Models (VLMs) truly "understand" real-world dynamics, as accurate numerical deduction necessitates a profound grasp of temporal events, object permanence, and compositional logic beyond superficial pattern matching. However, existing benchmarks are often confined to narrow domains, such as repetitive athletic motions, or treat simple counting merely as a superficial regression task, failing to assess multi-step numerical logic within the inherent complexity of real-world multimedia content. We introduce VidNum-1.4K, a comprehensive VideoQA benchmark comprising 1,379 strictly human-annotated video-question pairs designed to evaluate genuine numerical reasoning across highly diverse environments, encompassing object, action, and event quantification. The VidNum-1.4K is uniquely structured into a three-level hierarchy that evolves from direct visual perception to video-based compositional numerical reasoning, requiring models to perform arithmetic operations, comparisons, and logical deductions grounded in temporal evidence. Our evaluations across a diverse suite of state-of-the-art VLMs reveal a striking reasoning gap: while the Gemini-3.1-pro barely reaches a 60% accuracy threshold, representative open-source families struggle heavily in the 25%--45% range. These findings demonstrate that current VLMs still lack a stable "internal world model", positioning VidNum-1.4K as a demanding diagnostic testbed for the next generation of numerical video intelligence.

2603.28561 2026-05-13 cs.RO cs.AI

Fine-Tuning Large Language Models for Cooperative Tactical Deconfliction of Small Unmanned Aerial Systems

Iman Sharifi, Alex Zongo, Peng Wei

AI总结 随着小型无人机系统在低空空域的广泛应用,如何在安全约束下实现可靠的战术避撞成为亟需解决的问题。本文研究了通过微调大语言模型(LLM)来实现多智能体协同避撞的方法,提出了一种基于BlueSky模拟器的仿真到语言数据生成流程,生成符合航空安全规则的避撞数据集,并采用低秩适配(LoRA)和基于偏好的微调策略对预训练模型进行优化。实验表明,该方法显著提升了避撞决策的准确性、一致性及避撞性能,有效减少了近距空中冲突的发生。

Comments 15 pages, 6 figures, to be published in CVPR 2026 Workshop Proceedings

详情
英文摘要

The growing deployment of small Unmanned Aerial Systems (sUASs) in low-altitude airspaces has increased the need for reliable tactical deconfliction under safety-critical constraints. Tactical deconfliction involves short-horizon decision-making in dense, partially observable, and heterogeneous multi-agent environments, where both cooperative separation assurance and operational efficiency must be maintained. While Large Language Models (LLMs) exhibit strong reasoning capabilities, their direct application to air traffic control remains limited by insufficient domain grounding and unpredictable output inconsistency. This paper investigates LLMs as decision-makers in cooperative multi-agent tactical deconfliction using fine-tuning strategies that align model outputs to human operator heuristics. We propose a simulation-to-language data generation pipeline based on the BlueSky air traffic simulator that produces rule-consistent deconfliction datasets reflecting established safety practices. A pretrained Qwen-Math-7B model is fine-tuned using two parameter-efficient strategies: supervised fine-tuning with Low-Rank Adaptation (LoRA) and preference-based fine-tuning combining LoRA with Group-Relative Policy Optimization (GRPO). Experimental results on validation datasets and closed-loop simulations demonstrate that supervised LoRA fine-tuning substantially improves decision accuracy, consistency, and separation performance compared to the pretrained LLM, with significant reductions in near mid-air collisions. GRPO provides additional coordination benefits but exhibits reduced robustness when interacting with heterogeneous agent policies.

2603.28488 2026-05-13 cs.CL cs.AI cs.MA

Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification

Masnun Nuha Chowdhury, Nusrat Jahan Beg, Umme Hunny Khan, Syed Rifat Raiyan, Md Kamrul Hasan, Hasan Mahmud

AI总结 该研究针对大语言模型在高风险声明验证中的不可靠问题,提出了一种基于法庭辩论风格的多智能体框架PROClaim,通过引入角色分工和渐进式检索增强生成(P-RAG)方法,提升证据检索与推理的深度与准确性。该方法通过结构化辩论流程、证据协商及多法官异构聚合,有效增强了系统校准能力与鲁棒性,在零样本测试中表现出优于传统多智能体辩论10个百分点的性能,验证了其在争议性声明验证中的有效性。

Comments Under review, 7 figures, 12 tables

详情
英文摘要

Large language models (LLMs) remain unreliable for high-stakes claim verification due to hallucinations and shallow reasoning. While retrieval-augmented generation (RAG) and multi-agent debate (MAD) address this, they are limited by one-pass retrieval and unstructured debate dynamics. We propose a courtroom-style multi-agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation. Our approach integrates specialized roles (e.g., Plaintiff, Defense, Judge) with Progressive RAG (P-RAG) to dynamically expand and refine the evidence pool during the debate. Furthermore, we employ evidence negotiation, self-reflection, and heterogeneous multi-judge aggregation to enforce calibration, robustness, and diversity. In zero-shot evaluations on the Check-COVID benchmark, PROClaim achieves 81.7% accuracy, outperforming standard multi-agent debate by 10.0 percentage points, with P-RAG driving the primary performance gains (+7.5 pp). We ultimately demonstrate that structural deliberation and model heterogeneity effectively mitigate systematic biases, providing a robust foundation for reliable claim verification. Our code and data are publicly available at https://github.com/mnc13/PROClaim.

2603.27358 2026-05-13 cs.CL

Not Worth Mentioning? A Pilot Study on Salient Proposition Annotation

Amir Zeldes, Katherine Conhaim, Lauren Levine

AI总结 本文探讨了如何在自然文本中对命题的显著性进行分级标注的问题。研究借鉴了基于摘要的显著性度量方法,并将其应用于命题层面,定义了相应的标注任务。通过在一个多体裁小规模数据集上的实验,验证了该方法的可行性,并初步探讨了其与话语结构理论中核心话语单元之间的关系。

详情
英文摘要

Despite a long tradition of work on extractive summarization, which by nature aims to recover the most important propositions in a text, little work has been done on operationalizing graded proposition salience in naturally occurring data. In this paper, we adopt graded summarization-based salience as a metric from previous work on Salient Entity Extraction (SEE) and adapt it to quantify proposition salience. We define the annotation task, apply it to a small multi-genre dataset, evaluate agreement and carry out a preliminary study of the relationship between our metric and notions of discourse unit centrality in discourse parsing following Rhetorical Structure Theory (RST).

2603.24652 2026-05-13 cs.CL cs.LG

Demystifying When Pruning Works via Representation Hierarchies

Shwai He, Guoheng Sun, Haichao Zhang, Yun Fu, Ang Li

AI总结 该研究探讨了网络剪枝在不同语言任务中的效果差异,发现剪枝对非生成任务(如检索和多选)效果较好,但在生成任务中常导致性能下降。通过分析语言模型的表示层次结构,研究将模型内部计算分解为嵌入、logit和概率三个空间,发现嵌入和logit空间对剪枝具有较强鲁棒性,但logit到概率的非线性变换会放大剪枝带来的偏差,进而影响生成质量。该分析揭示了剪枝效果任务差异的内在机制,并为实际应用提供了指导。

Comments ICML 2026. 24 pages, 21 figures, and 3 tables. Includes an appendix with supplementary experiments and derivations

详情
英文摘要

Network pruning, which removes less important parameters or architectures, is often expected to improve efficiency while preserving performance. However, this expectation does not consistently hold across language tasks: pruned models can perform well on non-generative tasks but frequently fail in generative settings. To understand this discrepancy, we analyze network pruning from a representation-hierarchy perspective, decomposing the internal computation of language models into three sequential spaces: embedding (hidden representations), logit (pre-softmax outputs), and probability (post-softmax distributions). We find that representations in the embedding and logit spaces are largely robust to pruning-induced perturbations. However, the nonlinear transformation from logits to probabilities amplifies these deviations, which accumulate across time steps and lead to substantial degradation during generation. In contrast, the stability of the categorical-token probability subspace, together with the robustness of the embedding space, supports the effectiveness of pruning for non-generative tasks such as retrieval and multiple-choice selection. Our analysis disentangles the effects of pruning across tasks and provides practical guidance for its application. Code is available at https://github.com/CASE-Lab-UMD/Pruning-on-Representations