arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2352
2605.11884 2026-05-13 cs.LG

Sobolev Regularized MMD Gradient Flow

Chenyang Tian, Bharath K. Sriperumbudur, Arthur Gretton, Zonghao Chen

AI总结 本文提出了一种基于Sobolev正则化的最大平均差异(SrMMD)梯度流,通过在见证函数上施加梯度惩罚来改进传统MMD梯度流。该方法有效缓解了MMD目标函数的非凸性,并在连续和离散时间下均提供了可证明的全局收敛性保证。不同于以往工作仅适用于生成建模或采样中的一个场景,本文提出的梯度流同时适用于非归一化目标分布的采样和生成建模任务,且其收敛性分析不依赖于目标分布的等周假设,而是基于核均值嵌入差异的正则性条件。

详情
英文摘要

We propose Sobolev-regularized Maximum Mean Discrepancy (SrMMD) gradient flow, a regularized variant of maximum mean discrepancy (MMD) gradient flow based on a gradient penalty on the witness function. The proposed regularization mitigates the non-convexity of the MMD objective and yields provable \emph{global} convergence guarantees in MMD in both continuous and discrete time. A more surprising appeal is that our convergence analysis does not rely on isoperimetric assumptions on the target distribution. Instead, it is based on a regularity condition on the difference between kernel mean embeddings. A key highlight of the proposed flow is that it is applicable in both sampling (from an unnormalized target distribution) -- using Stein kernels -- and generative modeling settings, unlike previous works, where a gradient flow is suitable for only generative modeling or sampling but not both. The effectiveness of the proposed flow is empirically verified on a broad range of tasks in both generative modelling and sampling.

2605.11882 2026-05-13 cs.AI

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

Bo Yin, Qi Li, Xinchao Wang

AI总结 该研究针对工具使用型大语言模型代理在执行过程中可能产生的不安全行为,提出了一种基于失败轨迹的在线策略自我进化框架FATE。该方法通过将验证器评估的失败轨迹转化为修复监督信号,指导代理自我优化,同时引入帕累托前沿策略优化以平衡安全与任务效用。实验表明,FATE在多个基准上显著提升了代理的安全性,同时保持了其任务执行能力。

详情
英文摘要

Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely response-level or off-policy, and often incur a safety-utility trade-off: improving agent safety comes at the cost of degraded task performance. Such sparse and single-objective rewards severely limit real-world usability. To bridge this gap, we propose FATE, an on-policy self-evolving framework that transforms verifier-scored failures into repair supervision without expert demonstrations. For each failure, the same policy proposes repair candidates, which are then re-scored by verifiers and filtered across security, utility, over-refusal control, and trajectory validity. This dense trajectory-level information is then used as a supervision signal for agent self-evolution. During this process, we further introduce Pareto-Front Policy Optimization (PFPO), combining supervised warmup with Pareto-aware policy optimization to preserve safety-utility trade-offs. Experiments on AgentDojo, AgentHarm, and ATBench show that FATE improves safety across different models and scales while preserving useful behavior. Compared with strong baselines, FATE reduces attack success rate by 33.5%, harmful compliance by 82.6%, and improves external trajectory-safety diagnosis by 6.5%. These results suggest that failed trajectories can provide structured repair supervision for safer self-evolving agents.

2605.11880 2026-05-13 cs.LG cs.MA

Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning

Yue Deng, Zirui Wang, Yin Zhang

AI总结 该论文研究了在合作多智能体强化学习中如何自适应地设置TD($λ$)参数,以提升值估计的稳定性与效率。作者提出了一种基于参数化似然比估计的方法,避免了传统统计计算策略分布的困难,并利用两个不同规模的回放缓冲区来区分历史和当前策略的数据分布。通过将自适应TD($λ$)值分配给状态-动作对,该方法在多个基准环境中表现出优于或与静态$λ$方法相当的性能。

详情
英文摘要

TD($λ$) in value-based MARL algorithms or the Temporal Difference critic learning in Actor-Critic-based (AC-based) algorithms synergistically integrate elements from Monte-Carlo simulation and Q function bootstrapping via dynamic programming, which effectively addresses the inherent bias-variance trade-off in value estimation. Based on that, some recent works link the adaptive $λ$ value to the policy distribution in the single-agent reinforcement learning area. However, because of the large joint action space from multiple number of agents, and the limited transition data in Multi-agent Reinforcement Learning, the policy distribution is infeasible to be calculated statistically. To solve the policy distribution calculation problem in MARL settings, we employ a parametric likelihood-free density ratio estimator with two replay buffers instead of calculating statistically. The two replay buffers of different sizes store the historical trajectories that represent the data distribution of the past and current policies correspondingly. Based on the estimator, we assign Adaptive TD($λ$), \textbf{ATD($λ$)}, values to state-action pairs based on their likelihood under the stationary distribution of the current policy. We apply the proposed method on two competitive baseline methods, QMIX for value-based algorithms, and MAPPO for AC-based algorithms, over SMAC benchmarks and Gfootball academy scenarios, and demonstrate consistently competitive or superior performance compared to other baseline approaches with static $λ$ values.

2605.11872 2026-05-13 cs.LG stat.ML

LOFT: Low-Rank Orthogonal Fine-Tuning via Task-Aware Support Selection

Lanxin Zhao, Bamdev Mishra, Pratik Jawanpuria, Lequan Lin, Dai Shi, Junbin Gao, Andi Han

AI总结 该论文提出了一种名为LOFT的低秩正交微调框架,旨在解决现有正交参数高效微调方法中子空间选择与变换方式混淆的问题。LOFT通过将正交微调视为子空间旋转,统一了多种已有方法,并将支持选择作为核心设计要素,提出了基于任务信号的实用支持选择策略。实验表明,LOFT在多个任务中表现出优越的效率与性能平衡,突显了合理支持选择对提升正交微调效果的重要性。

详情
英文摘要

Orthogonal parameter-efficient fine-tuning (PEFT) adapts pretrained weights through structure-preserving multiplicative transformations, but existing methods often conflate two distinct design choices: the subspace in which adaptation occurs and the transformation applied within that subspace. This paper introduces LOFT, a low-rank orthogonal fine-tuning framework that explicitly separates these two components. By viewing orthogonal adaptation as a multiplicative subspace rotation, LOFT provides a unified formulation that recovers representative orthogonal PEFT methods, including coordinate-, butterfly-, Householder-, and principal-subspace-based variants. More importantly, this perspective exposes support selection as a central design axis rather than a byproduct of a particular parameterization. We develop a first-order analysis showing that useful adaptation supports should be informed by the downstream training signal, motivating practical task-aware support selection strategies. Across language understanding, visual transfer, mathematical reasoning, and multilingual out-of-distribution adaptation, LOFT recovers principal-subspace orthogonal adaptation while gradient-informed supports improve the efficiency-performance trade-off under matched parameter, memory, and compute budgets. These results suggest that principled support selection is an important direction for improving orthogonal PEFT.

2605.11870 2026-05-13 cs.LG cs.IT math.IT

Information theoretic underpinning of self-supervised learning by clustering

Josef Kittler, Sara Atito, Muhammad Awais

AI总结 本文从信息论角度探讨了自监督学习中聚类方法的理论基础,将自监督学习建模为K-L散度优化问题,并通过引入教师分布的优化约束防止模式崩溃。研究提出了基于逆聚类先验的归一化方法,并揭示了其与批量中心化策略的理论联系,为自监督学习中常用的蒸馏和中心化技术提供了理论支撑。

详情
英文摘要

Self-supervised learning (SSL) is recognized as an essential tool for building foundation models for Artificial Intelligence applications. The advances in SSL have been made thanks to vigorous arguments about the principles of SSL and through extensive empirical research. The aim of this paper is to contribute to the development of the underpinning theory of SSL, focusing on the deep clustering approach. By analogy to supervised learning, we formulate SSL as K-L divergence optimization. The mode collapse is prevented by imposing an optimisation constraint on the teacher distribution. This leads to normalization using inverse cluster priors. We show that using Jensen inequality this normalization simplifies to the popular batch centering procedure. Distillation and centering are common {heuristics-based} practices in SSL, {but our work underpins them theoretically.} The theoretical model developed not only supports specific existing successful SSL methods, but also suggests directions for future investigations.

2605.11869 2026-05-13 cs.CV cs.LG

FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity

Jian Tang, Jiawei Fan, Qingbin Liu, Zheng Wei

AI总结 视频扩散变换器(DiT)在推理速度上的提升虽可通过模型蒸馏实现,但每步推理延迟仍是关键瓶颈。现有方法主要依赖去噪轨迹中的冗余性,但在少步推理场景下效果有限,因时间状态稀缺导致特征复用困难。为此,本文提出一种无需训练、操作无关的FIS-DiT框架,将优化重点从时间轨迹转移到潜空间帧维度,通过帧交错稀疏策略在模型层次上操作帧子集,实现高效推理。实验表明,FIS-DiT在多个数据集上实现了2.11到2.41倍的加速,且在多项指标上几乎无性能损失。

详情
英文摘要

While the overall inference latency of Video Diffusion Transformers (DiTs) can be substantially reduced through model distillation, per-step inference latency remains a critical bottleneck. Existing acceleration paradigms primarily exploit redundancy across the denoising trajectory; however, we identify a limitation where these step-wise strategies encounter diminishing returns in few-step regimes. In such scenarios, the scarcity of temporal states prevents effective feature reuse or predictive modeling, creating a formidable barrier to further acceleration. To overcome this, we propose Frame Interleaved Sparsity DiT (FIS-DiT), a training-free and operator-agnostic framework that shifts the optimization focus from the temporal trajectory to the latent frame dimension. Our approach is motivated by an intrinsic duality within this dimension: the existence of frame-wise sparsity that permits reduced computation, coupled with a structural consistency where each frame position remains equally vital to the global spatiotemporal context. Leveraging this insight, we implement Frame Interleaved Sparsity (FIS) as an execution strategy that manipulates frame subsets across the model hierarchy, refreshing all latent positions without requiring full-scale block computation. Empirical evaluations on Wan 2.2 and HunyuanVideo 1.5 demonstrate that FIS-DiT consistently achieves 2.11--2.41$\times$ speedup with negligible degradation across VBench-Q and CLIP metrics, providing a scalable and robust pathway toward real-time high-definition video generation.

2605.11863 2026-05-13 cs.CV eess.IV

GATA2Floor: Graph attention for floor counting in street-view facades

Ngoc Tan Le, Tzoulio Chamiti, Eirini Papagiannopoulou, Nikos Deligiannis

AI总结 本文研究如何从街景立面图像中自动分析建筑的楼层数量,提出了一个基于图注意力机制的模型GATA2Floor。该方法将建筑立面建模为包含窗户和门的图结构,并引入多头图注意力网络来预测楼层数,同时通过可学习的跨注意力查询将元素分配到潜在的楼层槽位,从而获得可解释且鲁棒的结果。为了解决数据标注不足的问题,作者还提出了一种无需标注的轻量级提案机制,利用自监督特征和视觉-语言评分实现无监督学习,展示了图注意力关系推理在立面理解中的有效性。

Comments Accepted at IEEE ICIP 2026; 6 pages, 5 figures, 3 tables

详情
英文摘要

Automated analysis of building facades from street-level imagery has great potential for urban analytics, energy assessment, and emergency planning. However, it requires reasoning over spatially arranged elements rather than solely isolated detections. In this work, we model each facade as a graph over window/door detections with a vertical prior on edges. Additionally, we introduce GATA2Floor, a multi-head Graph Attention v2 (GATv2) based model that predicts the global floor count of a building and, via learnable cross-attention queries, softly assigns elements to latent floor slots, yielding interpretable outputs and robustness to irregular designs. To mitigate the lack of labeled datasets, we demonstrate that the proposed graph-based reasoning can be applied without annotations by leveraging a lightweight label-free proposal mechanism based on self-supervised features and vision-language scoring. Our approach demonstrates the value of graph-attention-based relational reasoning for facade understanding.

2605.11862 2026-05-13 cs.CL

Concordance Comparison as a Means of Assembling Local Grammars

Juliana Pirovani, Elias de Oliveira, Eric Laporte

AI总结 本文研究了如何通过比较局部语法(LG)的搭配信息来构建更优的局部语法,以提升人名实体识别的性能。作者提出了一种基于搭配对比的方法,通过分析不同局部语法之间的包含、交集和排斥关系,选择并组合出效果最佳的语法结构。该方法在葡萄牙语人名提取任务中取得了76.86的F值,相比现有最佳方法提升了6个百分点。

详情
Journal ref
Computational Processing of the Portuguese Language. 13th International Conference, PROPOR, Canela, Brazil, September 24-26, 2018, Proceedings, 11122, Springer, pp.57-65, Lecture Notes in Artificial Intelligence
英文摘要

Named Entity Recognition for person names is an important but non-trivial task in information extraction. This article uses a tool that compares the concordances obtained from two local grammars (LG) and highlights the differences. We used the results as an aid to select the best of a set of LGs. By analyzing the comparisons, we observed relationships of inclusion, intersection and disjunction within each pair of LGs, which helped us to assemble those that yielded the best results. This approach was used in a case study on extraction of person names from texts written in Portuguese. We applied the enhanced grammar to the Gold Collection of the Second HAREM. The F-Measure obtained was 76.86, representing a gain of 6 points in relation to the state-of-the-art for Portuguese.

2605.11859 2026-05-13 cs.RO cs.AI

EvoNav: Evolutionary Reward Function Design for Robot Navigation with Large Language Models

Zhikai Zhao, Chuanbo Hua, Federico Berto, Zihan Ma, Kanghoon Lee, Jiachen Li, Jinkyoo Park

AI总结 本文提出了一种基于进化算法和大语言模型的机器人导航奖励函数设计框架EvoNav,旨在解决传统人工设计奖励函数依赖领域专业知识、难以适应复杂环境的问题。该方法通过分阶段的预热-提升流程,利用大语言模型生成候选奖励函数,并结合低成本代理和逐步强化训练,显著提高了设计效率与导航策略性能。实验表明,EvoNav生成的导航策略优于手动设计和现有先进方法。

详情
英文摘要

Robot navigation is a crucial task with applications to social robots in dynamic human environments. While Reinforcement Learning (RL) has shown great promise for this problem, the policy quality is highly sensitive to the specification of reward functions. Hand-crafted rewards require substantial domain expertise and embed inductive biases that are difficult to audit or adapt, limiting their effectiveness and leading to suboptimal performance. In this paper, we propose EvoNav, an evolutionary framework that automates the design of robot navigation reward functions via large language models (LLMs). To overcome prohibitively costly policy training, EvoNav evaluates each candidate proposal from the LLM via a progressive three-stage warm-up-boost procedure. EvoNav advances from analytical proxies with low-cost surrogates, such as small datasets and analytic rules, to lightweight rollouts and, finally, to full policy training, enabling computationally efficient exploration under effective feedback. Experiment results show that EvoNav produces more effective navigation policies than manually designed RL rewards and state-of-the-art reward design methods.

2605.11857 2026-05-13 cs.LG

Beyond Parameter Aggregation: Semantic Consensus for Federated Fine-Tuning of LLMs

Amr Abourayya, Jens Kleesiek, Michael Kamp

AI总结 本文提出了一种新的联邦微调方法,突破传统参数聚合的限制,通过模型行为而非参数进行协作。客户端在本地数据上微调模型,并在共享的公共提示集上生成输出,服务器将这些输出映射到语义空间,形成每条提示的语义共识并返回伪标签供进一步微调。该方法显著降低了通信开销,与模型规模无关,适用于异构架构和开放文本生成,并在实验中表现出与现有方法相当的效果,同时大幅减少通信量、运行时间和能耗。

详情
英文摘要

Federated fine-tuning of large language models is commonly formulated as a parameter aggregation problem. However, even parameter-efficient methods require transmitting large collections of trainable weights, assume aligned architectures, and rely on white-box access to model parameters. As model sizes continue to grow and deployments become increasingly heterogeneous, these assumptions become progressively misaligned with practical constraints. We consider an alternative formulation in which collaboration is mediated through model behavior rather than parameters. Clients fine-tune local models on private data and exchange generated outputs on a shared, public prompt set. The server maps these outputs into a semantic representation space, forms a per-prompt semantic consensus, and returns pseudo-labels for further local fine-tuning. This formulation fundamentally changes the communication scaling of federated LLM fine-tuning. The amount of information exchanged depends only on the public prompt budget and the size of the communicated behaviors, independent of model size. As a consequence, the protocol naturally accommodates heterogeneous architectures and applies directly to open-ended text generation. We present a theoretical analysis and empirical results demonstrating that this approach can match strong federated fine-tuning baselines while substantially reducing communication by orders of magnitude (e.g., analytically by a factor of $1006$ for Llama3.1-405B), as well as reductions in runtime and energy consumption. These results suggest that, for generative foundation models, behavior-level consensus provides a more appropriate abstraction for federated adaptation than parameter aggregation.

2605.11856 2026-05-13 cs.CV cs.CL

UniVLR: Unifying Text and Vision in Visual Latent Reasoning for Multimodal LLMs

Houcheng Jiang, Jiajun Fu, Junfeng Fang, Chen Gao, Xiang Wang, Xiangnan He, Yong Li

AI总结 本文提出了一种统一的视觉潜层推理框架UniVLR,旨在提升多模态大语言模型在图像推理任务中的效率与表现。该方法将文本推理与辅助视觉信息整合到共享的视觉工作空间中,通过联合生成推理轨迹和图像信息,并将其压缩为紧凑的视觉潜层表示,从而在推理时仅依赖视觉潜层进行推理并直接生成答案,避免了显式文本推理和外部工具调用。实验表明,UniVLR在实际感知与视觉推理任务中优于现有方法,且生成的推理标记更少,展示了更高效统一的视觉推理范式。

详情
英文摘要

Multimodal large language models are increasingly expected to perform thinking with images, yet existing visual latent reasoning methods still rely on explicit textual chain-of-thought interleaved with visual latent tokens. This interleaved design limits efficiency and keeps reasoning fragmented across separate text and vision channels. We propose UniVLR, a unified visual latent reasoning framework that treats textual reasoning and auxiliary visual evidence as a shared visual workspace. Instead of preserving text CoT as an independent inference-time path, UniVLR renders reasoning traces together with auxiliary images and learns to compress this unified representation into compact visual latent tokens. At inference time, the model reasons only through visual latents and directly decodes the final answer, avoiding both external tool calls and verbose text reasoning. Experiments on real-world perception and visual reasoning tasks show that UniVLR outperforms prior visual latent reasoning methods while using substantially fewer generated reasoning tokens, suggesting a more unified and efficient paradigm for visual thinking in MLLMs.

2605.11846 2026-05-13 cs.LG cs.AI

Martingale-Consistent Self-Supervised Learning

Moritz Gögl, Hanwen Xing, Christopher Yau

AI总结 本文研究了在信息不完整或动态变化的环境下,如何提升自监督学习(SSL)的鲁棒性和一致性。作者提出了一种基于鞅理论的自监督学习框架,确保粗略预测与精炼预测在期望上保持一致,从而防止系统性偏差。该方法引入了预测空间和潜在空间的变体,并设计了无偏的蒙特卡洛估计器,实验表明其在部分观测场景下能提升模型的稳定性与校准能力。

详情
英文摘要

Self-supervised learning (SSL) is often deployed under changing information, such as shorter histories, missing features, or partially observed images. In these settings, predictions from coarse and refined views should be coherent: before refinement, the coarse-view prediction should match the average prediction expected after refinement. Martingales formalize this coherence principle, but standard SSL objectives do not enforce it. Unlike invariance objectives that pull views together, martingale consistency constrains only the expected refined prediction, allowing predictions to update as information is revealed while preventing systematic drift. We introduce a martingale-consistent SSL framework that closes this gap, with practical prediction- and latent-space variants and an unbiased two-sample Monte Carlo estimator based on stochastic refinement. We evaluate the approach on synthetic and real time-series, tabular, and image benchmarks under partial-observation regimes, in both semi-self-supervised and fully label-free settings. Across these experiments, our framework improves robustness and calibration under partial observation, yielding more stable representations as information is revealed.

2605.11845 2026-05-13 cs.CL

Probabilistic Calibration Is a Trainable Capability in Language Models

Davide Baldelli, Sruthi Kuriakose, Maryam Hashemzadeh, Amal Zouaq, Sarath Chandar

AI总结 该研究探讨了语言模型在满足用户指定随机性约束时,其生成概率与目标分布之间校准不佳的问题,并通过微调方法提升这一能力。研究者提出了两种校准微调方法:一种基于软目标,将目标分布转化为词序树导出的下一个词目标;另一种基于硬目标,通过目标分布采样完成进行训练。实验表明,这两种方法均能有效提升模型在多种分布和参数设置下的结构化采样准确性,证明概率校准是可以通过微调增强的能力。

详情
英文摘要

Language models are increasingly used in settings where outputs must satisfy user-specified randomness constraints, yet their generation probabilities are often poorly calibrated to those targets. We study whether this capability can be improved directly through fine-tuning. Concretely, we fine-tune language models on synthetic prompts that require sampling from mathematical distributions, and compare two Calibration Fine-Tuning variants: a soft-target method that converts the desired output distribution into trie-derived next-token targets, and a hard-target method that trains on sampled completions from the same target distribution. Across 12 models spanning four families, both methods substantially improve structured-sampling fidelity on held-out distribution families and unseen parameter settings, showing that probabilistic calibration is a trainable capability. Under our selected training configurations, the two methods exhibit different empirical profiles: hard-target fine-tuning is often strongest on structured numeric sampling, while soft-target fine-tuning performs better on broader stochastic generation benchmarks, including open-ended random generation, multiple-choice answer-position balancing, and NoveltyBench. The gains sometimes reduce downstream capability, especially arithmetic reasoning, with costs varying by model. Overall, our results show that probabilistic calibration can be improved through fine-tuning, with our hard-target configuration favoring exact numeric fidelity and our soft-target configuration favoring broader stochastic transfer. Code is available at https://github.com/chandar-lab/calibration-finetuning.

2605.11840 2026-05-13 cs.CV

Selection, Not Fusion: Radar-Modulated State Space Models for Radar-Camera Depth Estimation

Zhangcheng Hou, Tomoaki Ohtsuki

AI总结 本文研究了如何利用雷达信号提升雷达-相机深度估计的性能,提出了一种基于状态空间模型的雷达调制选择机制(RMS),将雷达信息直接融入模型的扫描过程,而非传统的特征融合方式。该方法通过雷达对扫描步长和读取参数进行调制,在保证图像主干网络不变的前提下,仅在雷达能提升精度的区域引入雷达影响,从而实现更高效、准确的深度估计。实验表明,该方法在nuScenes数据集上取得了显著的性能提升,并且具有更低的计算延迟。

Comments 16 pages, 3 figures, 9 tables

详情
英文摘要

Radar-camera depth estimation must turn an ultra-sparse, all-weather, metric radar signal into a dense per-pixel depth map. Existing methods -- concatenation, confidence-aware gating, sparse supervision, graph-based extraction -- combine radar and image features outside the backbone's sequence operator, and even cross-modal Mamba variants leave the selection mechanism itself unimodal. We argue that the selection mechanism is the right place for radar to enter. We introduce Radar-Modulated Selection (RMS), a minimal and principled way to inject radar into Mamba's selective scan: radar modulates the scan from within, adding zero-initialised perturbations to the step size $Δ$ and readout $\mathbf{C}$ while leaving the input projection $\mathbf{B}$ and state dynamics $\mathbf{A}$ image-only. The construction is exactly equivalent to a pretrained image-only Mamba at initialisation, ensuring radar only influences the model where it improves accuracy. Two further properties follow that out-of-scan fusion cannot offer: linear-cost cross-modal coupling at every recurrence step, and a natural fallback to the image-only backbone when radar is absent. We deploy RMS in a Multi-View Scan Pyramid (MVSP) that matches the fusion operator to radar's spatial reach at each scale. SemoDepth achieves state-of-the-art performance on nuScenes, reducing MAE by 34.0%, 29.9%, and 29.9% over the previous best at 0--50, 0--70, and 0--80m, while attaining the lowest single-frame latency (26.8ms). A further ablation shows that out-of-scan feature blending adds no accuracy on top of RMS, providing empirical validation that in-scan selection can replace out-of-scan fusion.

2605.11838 2026-05-13 cs.LG math.OC

Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters

Alexander Yukhimchuk, Mladen Kolar, Martin Takáč, Sayantan Choudhury

AI总结 本文研究了在现代神经网络训练中如何更有效地应用梯度裁剪技术,针对参数矩阵的结构提出了一种新的方法。作者发现,数据异常值主要影响梯度矩阵的前几个奇异值,因此提出基于奇异值的梯度裁剪方法,通过限制超过阈值的奇异值来稳定训练过程。该方法不仅推广了传统的向量范数裁剪,还提供了对重尾噪声的收敛性分析,并通过随机截断SVD实现了高效的实现,适用于大规模神经网络层。

详情
英文摘要

Gradient clipping is a standard safeguard for training neural networks under noisy, heavy-tailed stochastic gradients; yet, most clipping rules treat all parameters as vectors and ignore the matrix structure of modern architectures. We show empirically that data outliers often amplify only a small number of leading singular values in layer-wise gradient matrices, while the rest of the spectrum remains largely unchanged. Motivated by this phenomenon, we propose spectral clipping, which stabilizes training by clamping singular values that exceed a threshold while preserving the singular directions. This framework generalizes classical gradient norm clipping and can be easily integrated into existing optimizers. We provide a convergence analysis for non-convex optimization with spectrally clipped SGD, yielding the optimal $\mathcal{O}\left(K^{\frac{2 - 2α}{3α- 2}}\right)$ rate for heavy-tailed noise. To minimize hyperparameter tuning, we introduce layer-wise adaptive thresholds based on moving averages or sliding-window quantiles of the top singular values. Finally, we develop efficient implementations that clip only the top $r$ singular values via randomized truncated SVD, avoiding full decompositions for large layers. We demonstrate competitive performance across synthetic heavy-tailed settings and neural network training tasks.

2605.11836 2026-05-13 cs.LG cs.CL

More Edits, More Stable: Understanding the Lifelong Normalization in Sequential Model Editing

Xin Ma, Wei Chen, Qi Liu, Derong Xu, Zhi Zheng, Tong Xu, Enhong Chen

AI总结 本文研究了在连续模型编辑过程中保持大型语言模型稳定性的关键机制,提出了“终身归一化”(Lifelong Normalization, LN)这一核心策略,并首次从理论上解释了其作用机制。研究发现,LN通过运行统计量对梯度进行归一化,能够形成自我强化的稳定性循环,结合岭正则回归可有效抑制遗忘和系统崩溃。基于这些发现,作者提出了StableEdit方法,通过引入预热阶段和全白化处理,进一步提升了长期编辑的稳定性,实验验证了理论的有效性。

详情
英文摘要

Lifelong Model Editing aims to continuously update evolving facts in Large Language Models while preserving unrelated knowledge and general capabilities, yet it remains plagued by catastrophic forgetting and model collapse. Empirically, we find that recent editors resilient over long horizons share the same core strategy: Lifelong Normalization (LN), which normalizes value gradients using running statistics. Removing LN causes immediate performance collapse, and we observe a counter-intuitive positive cumulative effect where early edits can promote the success of future edits. Yet the mechanism of LN remains a "black box", leaving its precise role in lifelong stability poorly understood. In this work, we provide the first theoretical account of LN in the lifelong regime. Our analysis reveals a self-reinforcing stability loop and proves that, when combined with ridge-regularized regression, LN yields parameter updates with asymptotic orthogonality and bounded norms, directly mitigating forgetting and systemic collapse. Based on these insights, we derive StableEdit, which strengthens this stability loop via an explicit warm-up stage and full whitening, improving long-horizon stability at minimal overhead. Extensive experiments validate our theory and demonstrate competitive performance. Our code is available at https://github.com/MINE-USTC/StableEdit.

2605.10916 2026-05-13 cs.CV cs.AI

Confidence-Guided Diffusion Augmentation for Enhanced Bangla Compound Character Recognition

Md. Sultan Al Rayhan

AI总结 识别手写孟加拉语复合字符是一个具有挑战性的问题,主要由于字符结构复杂、类内变化大以及高质量标注数据有限。本文提出了一种基于置信度引导的扩散增强框架,用于提升低分辨率孟加拉语复合字符的识别性能。该方法结合了类别条件扩散模型和分类器引导技术,生成高质量的合成样本,并引入了增强残差块和置信度过滤机制,以提升生成质量并筛选出类别一致性高的样本。实验表明,该方法在多个主流模型上均取得性能提升,最佳模型在AIBangla数据集上的分类准确率达到89.2%,显著优于现有基准。

详情
英文摘要

Recognition of handwritten Bangla compound characters remains a challenging problem due to complex character structures, large intra-class variation, and limited availability of high-quality annotated data. Existing Bangla handwritten character recognition systems often struggle to generalize across diverse writing styles, particularly for compound characters containing intricate ligatures and diacritical variations. In this work, we propose a confidence-guided diffusion augmentation framework for low-resolution Bangla compound character recognition. Our framework combines class-conditional diffusion modeling with classifier guidance to synthesize high-quality handwritten compound character samples. To further improve generation quality, we introduce Squeeze-and-Excitation enhanced residual blocks within the diffusion model's U-Net backbone. We additionally propose a confidence-based filtering mechanism where pre-trained classifiers act as quality gates to retain only highly class-consistent synthetic samples. The filtered synthetic images are fused with the original training data and used to retrain multiple classification architectures. Experiments conducted on the AIBangla compound character dataset demonstrate consistent performance improvements across ResNet50, DenseNet121, VGG16, and Vision Transformer architectures. Our best-performing model achieves 89.2\% classification accuracy, surpassing the previously published AIBangla benchmark by a substantial margin. The results demonstrate that quality-aware diffusion augmentation can effectively enhance handwritten character recognition performance in low-resource script domains.

2605.10818 2026-05-13 cs.LG q-bio.NC

On periodic distributed representations using Fourier embeddings

Jakeb Chouinard

AI总结 本文研究了如何利用傅里叶嵌入构建周期性分布式表示,以更好地处理角度等周期性信号。作者提出使用高维实值周期嵌入,解决传统标量角度表示在处理接近角度时的困难,并通过点积相似性控制不同核函数的形状。研究重点在于利用空间语义指针这一神经可解释的表示方法,形式化定义狄利克雷核和周期高斯核,为周期性信号的建模提供了新的思路。

详情
英文摘要

Periodic signals are critical for representing physical and perceptual phenomena. Scalar, real angular measures, e.g., radians and degrees, result in difficulty processing and distinguishing nearby angles, especially when their absolute difference exceeds pi. We can avoid this problem by using real-valued, periodic embeddings in high-dimensional space. These representations also allow us to control the nature of their dot product similarities, allowing us to construct a variety of different kernel shapes. In this work, we aim of highlight how these representations can be constructed and focus on the formalization of Dirichlet and periodic Gaussian kernels using the neurally-plausible representation scheme of Spatial Semantic Pointers.

2605.10684 2026-05-13 cs.LG cs.AI

Is Data Shapley Not Better than Random in Data Selection? Ask NASH

Xiao Tian, Jue Fan, Rachael Hwee Ling Sim, Zixuan Wang, Nancy F. Chen, Bryan Kian Hsiang Low

AI总结 本文研究了如何从训练数据中选择高质量子集的问题,探讨了数据选择中使用Data Shapley等方法的有效性。针对Data Shapley在实践中表现不稳定的问题,作者提出了NASH框架,通过将目标效用函数分解为更简单的Shapley-信息组件,并非线性地聚合这些组件进行数据选择,显著提升了基于Shapley的数据选择效果,且仅需少量额外计算成本。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML-26) as a Spotlight paper

详情
英文摘要

Data selection studies the problem of identifying high-quality subsets of training data. While some existing works have considered selecting the subset of data with top-$m$ Data Shapley or other semivalues as they account for the interaction among every subset of data, other works argue that Data Shapley can sometimes perform ineffectively in practice and select subsets that are no better than random. This raises the questions: (I) Are there certain "Shapley-informative" settings where Data Shapley consistently works well? (II) Can we strategically utilize these settings to select high-quality subsets consistently and efficiently? In this paper, we propose a novel data selection framework, NASH (Non-linear Aggregation of SHapley-informative components), which (I) decomposes the target utility function (e.g., validation accuracy) into simpler, Shapley-informative component functions, and selects data by optimizing an objective that (II) aggregates these components non-linearly. We demonstrate that NASH substantially boosts the effectiveness of Shapley/semivalue-based data selection with minimal additional runtime cost.

2605.10360 2026-05-13 cs.CV

DySurface: Consistent 4D Surface Reconstruction via Bridging Explicit Gaussians and Implicit Functions

Minje Kim, Younghyun Noh, Jaesoon Kim, Tae-Kyun Kim

AI总结 本文提出了一种名为DySurface的新框架,用于解决动态场景中重建时间一致的4D表面的挑战。该方法结合了显式的高斯点和隐式的符号距离函数(SDF),通过构建动态稀疏体素网格,为隐式SDF场提供明确的几何引导,从而显著提升了表面重建的质量,实现了更精确的边界和细节表现。实验表明,DySurface在几何精度方面优于现有先进方法,同时保持了良好的渲染性能。

详情
英文摘要

While novel view synthesis (NVS) for dynamic scenes has seen significant progress, reconstructing temporally consistent geometric surfaces remains a challenge. Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) offer powerful dynamic scene rendering capabilities; however, relying solely on photometric optimization often leads to geometric ambiguities. This results in discontinuous surfaces, severe artifacts, and broken surfaces over time. To address these limitations, we present DySurface, a novel framework that bridges the effectiveness of explicit Gaussians with the geometric fidelity of implicit Signed Distance Functions (SDFs) in dynamic scenes. Our approach tackles the structural discrepancy between the forward deformation of 3DGS ($canonical \rightarrow dynamic$) and the backward deformation required for volumetric SDF rendering ($dynamic \rightarrow canonical$). Specifically, we propose the VoxGS-DSDF branch that leverages deformed Gaussians to construct a dynamic sparse voxel grid, providing explicit geometric guidance to the implicit SDF field. This explicit anchoring effectively regularizes the volumetric rendering process, significantly improving surface reconstruction quality, with watertight boundaries and detailed representations. Quantitative and qualitative experiments demonstrate that DySurface significantly outperforms state-of-the-art baselines in geometric accuracy metrics while maintaining competitive rendering performance.

2605.10288 2026-05-13 cs.LG math.OC

BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization

Hengrui Zhang, Boao Kong, Engao Zhang, Kun Yuan

AI总结 本文提出了一种名为BROS的高效单循环双层优化方法,旨在解决深度学习中超参数学习、数据重加权等问题。该方法通过在随机子空间中进行梯度更新,并结合Rademacher双探针校正技术,实现了对Hessian算子的无偏估计,从而在降低内存消耗的同时保持与精确单循环方法相近的收敛速度。实验表明,BROS在多个任务中相比现有方法可减少高达44.9%的峰值内存使用,同时保持相近的性能。

详情
英文摘要

Stochastic bilevel optimization (SBO) has become a standard framework for hyperparameter learning, data reweighting, representation learning, and data-mixture optimization in deep learning. Existing exact single-loop SBO methods and memory-efficient surrogate SBO methods either create severe memory pressure for large lower-level neural networks or lack competitive convergence guarantees under standard assumptions. In this paper, we propose BROS, a memory-efficient single-loop SBO method with the same convergence rate order as exact single-loop SBO methods. BROS performs lower and auxiliary updates in randomized subspaces with a Rademacher bi-probe correction that recovers an unbiased Hessian-action estimator. We prove that BROS preserves the $\mathcal O(\varepsilon^{-2})$ sample complexity of MA-SOBA for finding an $\varepsilon$-stationary point under only standard assumptions. Experiments on hyper-data cleaning, data-mixture learning, hyper-representation learning, and ViT sample reweighting show that BROS reduces peak memory by up to 44.9% while closely matching full-space baseline performance.

2605.10235 2026-05-13 cs.CL

Route Before Retrieve: Activating Latent Routing Abilities of LLMs for RAG vs. Long-Context Selection

Yiwen Chen, Kuan Li, Fuzhen Zhuang, Deqing Wang, Zhao Zhang, Liwen Zhang, Yong Jiang, Shuai Wang, Minhao Cheng

AI总结 本文研究了在大语言模型(LLM)中如何有效选择检索增强生成(RAG)与长上下文(LC)策略的问题,提出了一种名为Pre-Route的主动路由框架。该方法通过利用文档类型、长度等轻量级元数据进行结构化推理,在回答前完成任务分析、覆盖估计和信息需求预测,从而生成可解释且高效的成本决策。实验表明,Pre-Route在多个基准上优于现有方法,展现出更高的整体成本效益。

详情
英文摘要

Recent advances in large language models (LLMs) have expanded the context window to beyond 128K tokens, enabling long-document understanding and multi-source reasoning. A key challenge, however, lies in choosing between retrieval-augmented generation (RAG) and long-context (LC) strategies: RAG is efficient but constrained by retrieval quality, while LC supports global reasoning at higher cost and with position sensitivity. Existing methods such as Self-Route adopt failure-driven fallback from RAG to LC, but remain passive, inefficient, and hard to interpret. We propose Pre-Route, a proactive routing framework that performs structured reasoning before answering. Using lightweight metadata (e.g., document type, length, initial snippet), Pre-Route enables task analysis, coverage estimation, and information-need prediction, producing explainable and cost-efficient routing decisions. Our study shows three key findings: (i) LLMs possess latent routing ability that can be reliably elicited with guidelines, allowing single-sample performance to approach that of multi-sample (Best-of-N) results; (ii) linear probes reveal that structured prompts sharpen the separability of the "optimal routing dimension" in representation space; and (iii) distillation transfers this reasoning structure to smaller models for lightweight deployment. Experiments on LaRA (in-domain) and LongBench-v2 (OOD) confirm that Pre-Route outperforms Always-RAG, Always-LC, and Self-Route baselines, achieving superior overall cost-effectiveness.

2605.10094 2026-05-13 cs.RO cs.AI

Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs

Jianchao Zhao, Huoren Yang, Yusong Hu, Yuyang Gao, Qiguan Ou, Cong Wan, SongLin Dong, Zhiheng Ma, Yihong Gong

AI总结 本文研究了在持续部署环境下如何提升冻结的视觉-语言-动作(VLA)模型在测试时的可靠性问题。提出了一种基于在线成功记忆的测试时自适应框架,通过在部署过程中存储成功的观察-动作片段,并在推理时检索相关动作片段进行轨迹一致性过滤和聚合,生成高质量的动作先验。该方法引入了置信度自适应的先验引导机制,将先验信息注入动作生成流程,实现了无需参数更新的轻量级自适应,实验表明该方法在长时间和多阶段任务中显著提升了任务成功率和闭环稳定性。

详情
英文摘要

Vision-Language-Action (VLA) models show strong potential for general-purpose robotic manipulation, yet their closed-loop reliability often degrades under local deployment conditions. Existing evaluations typically treat test episodes as independent zero-shot trials. However, real robots often operate repeatedly in the same or slowly changing environments, where successful executions provide environment-verified evidence of reliable behavior patterns. We study this persistent-deployment setting, asking whether a partially competent frozen VLA can improve its reliability by reusing its successful test-time experience. We propose an online success-memory guided test-time adaptation framework for generative VLAs. During deployment, the robot stores progress-calibrated successful observation-action segments in a long-term memory. At inference, it retrieves state-relevant action chunks, filters inconsistent candidates via trajectory-level consistency, and aggregates them into an elite action prior. To incorporate this prior into action generation, we introduce confidence-adaptive prior guidance, which injects the elite prior into an intermediate state of the flow-matching action sampler and adjusts the guidance strength based on retrieval confidence. This design allows the frozen VLA to exploit environment-specific successful experience while preserving observation-conditioned generative refinement. This retrieve-then-steer mechanism enables lightweight, non-parametric test-time adaptation without requiring parameter updates. Simulation and real-world experiments show improved task success and closed-loop stability, especially in long-horizon and multi-stage tasks.

2605.09965 2026-05-13 cs.CV

Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

Kuan Zhang, Dongchen Liu, Qiyue Zhao, Tianyu Xin, Yue Su, Haisheng Wang, Han Yin, Hongbo Ma, Peize Li, Tianjun Gu, Xiangnan Wu, Xinran Zhang, Yongxuan Li, Zirong Chen, Yiming Li

AI总结 该研究探讨了如何通过基础模型实现通用游戏玩家,旨在使人工智能具备在由不同规则、目标和物理特性构成的“游戏多元宇宙”中灵活适应和表现的能力。研究从数据集、模型、应用框架和评估基准四个相互关联的支柱出发,分析了通用游戏玩家的完整生命周期,并指出了当前系统面临的五大根本性权衡。通过这一整体视角,论文提出了一个五阶段的发展路线图,从单一游戏精通逐步迈向能够同时创造和演化于理论游戏多元宇宙的终极创造者阶段,为实现通用人工智能(AGI)提供了系统性指导。

Comments 51 pages, 7 figures, github: https://github.com/THUSI-Lab/Awesome-LFMs-Play-Games

详情
英文摘要

The real world unfolds along a single set of physics laws, yet human intelligence demonstrates a remarkable capacity to generalize experiences from this singular physical existence into a multiverse of games, each governed by entirely different rules, aesthetics, physics, and objectives. This omni-reality adaptability is a hallmark of general intelligence. As Artificial Intelligence progresses towards Artificial General Intelligence, the multiverse of games has evolved from mere entertainment into the ultimate ground for training and evaluating AGI. The pursuit of this generality has unfolded across four eras: from environment-specific symbolic and reinforcement learning agents, to current large foundation models acting as generalist players, and toward a future creator stage where agent both creates new game worlds and continually evolves within them. We trace the full lifecycle of a generalist game player along four interdependent pillars: Dataset, Model, Harness, and Benchmark. Every advance across these pillars can be read as an attempt to break one of five fundamental trade-offs that currently bound the whole system. Building on this end-to-end view, we chart a five-level roadmap, progressing from single-game mastery to the ultimate creator stage in which the agent simultaneously creates and evolves within theoretical game multiverse. Taken together, our work offers a unified lens onto a rapidly shifting field,and a principled path toward the omnipotent generalist agent capable of seamlessly mastering any challenge within the multiverse of games, thereby paving the way for AGI.

2605.09780 2026-05-13 cs.AI

Attribution-based Explanations for Markov Decision Processes

Paul Kobialka, Andrea Pferscher, Francesco Leofante, Erika Ábrahám, Silvia Lizeth Tapia Tarifa, Einar Broch Johnsen

AI总结 本文研究如何为马尔可夫决策过程(MDP)生成基于归因的解释,以阐明智能体在序列决策中的行为逻辑。作者提出了一种形式化框架,用于在MDP中分配状态和执行路径的重要性分数,并利用策略合成技术高效计算这些分数,克服了MDP中非确定性的挑战。通过五个案例研究验证了方法的有效性,展示了其在提供可解释决策洞察方面的应用价值。

详情
英文摘要

Attribution techniques explain the outcome of an AI model by assigning a numerical score to its inputs. So far, these techniques have mainly focused on attributing importance to static input features at a single point in time, and thus fail to generalize to sequential decision-making settings. This paper fills this gap by introducing techniques to generate attribution-based explanations for Markov Decision Processes (MDPs). We give a formal characterization of what attributions should represent in MDPs, focusing on explanations that assign importance scores to both individual states and execution paths. We show how importance scores can be computed by leveraging techniques for strategy synthesis, enabling the efficient computation of these scores despite the non-determinism inherent in an MDP. We evaluate our approach on five case-studies, demonstrating its utility in providing interpretable insights into the logic of sequential decision-making agents.

2605.09769 2026-05-13 cs.AI

UTS at PsyDefDetect: Multi-Agent Councils and Absence-Based Reasoning for Defense Mechanism Classification

Dima Galat, Marian-Andrei Rizoiu

AI总结 本文介绍了一种用于情感支持对话中心理防御机制分类的系统,基于防御机制评分量表(DMRS),在64支队伍中排名第二(F1值为0.406)。研究核心在于将防御机制定义为缺失的方面(如情感缺失、认知阻滞、现实否认),并通过情感-认知整合光谱在提示级别的临床规则中进行编码,显著提升了分类性能。系统采用多阶段的Gemini 2.5代理委员会架构,通过类特定倡导者评估证据强度而非简单投票,无需微调即取得良好效果,最终结合三个微调Qwen3.5模型的定向覆盖策略进一步提升了性能。

详情
英文摘要

This paper describes our system for classifying psychological defense mechanisms in emotional support dialogues using the Defense Mechanism Rating Scales (DMRS), placing second (F1 0.406) among 64 teams. A central insight is that defense mechanisms are defined by what is absent: missing affect, blocked cognition, denied reality. We encode this as an affect-cognition integration spectrum in prompt-level clinical rules, which account for the largest single gain (+11.4pp F1). Our architecture is a multi-phase deliberative council of Gemini 2.5 agents where class-specific advocates rate evidence strength rather than voting, achieving F1 0.382 with no fine-tuning - a top-5 result on its own. We find, however, that the council is confidently wrong about minority classes: 59-80% of stable minority predictions are incorrect, driven by a systematic "L7 attractor" in which emotional content defaults to the majority class. A targeted override ensemble from three fine-tuned Qwen3.5 models applies 16 overrides (+2.4pp), selected by a structured multi-agent system (builder, critic, regression guard) that produced a larger F1 gain in one iteration than 8 prior attempts combined.

2605.09271 2026-05-13 cs.AI

Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding

Zhiqin Yang, Yuhan Liu, Jingwen Fu, Pei Fu, Bo Han, Masashi Sugiyama, Nanning Zheng

AI总结 尽管自然语言是大语言模型(LLM)的默认输入媒介,但其表达能力的局限性在复杂问题求解中形成了瓶颈。本文提出,通过先进的语言表征来构建知识框架(schema)是拓展LLM智能的下一步关键方向,并论证了语言表征的结构和符号复杂性对模型知识激活与组织方式的重要影响。研究通过理论阐述与实验验证,展示了精心设计的语言表征能够在不改变模型参数或规模的前提下显著提升模型性能,为未来研究提供了新的思路和方向。

Comments 41 pages, 30 figures

详情
英文摘要

Although natural language is the default medium for Large Language Models (LLMs), its limited expressive capacity creates a profound bottleneck for complex problem-solving. While recent advancements in AI have relied heavily on scaling, merely internalizing knowledge does not guarantee its effective application. Defining language representation as the linguistic and symbolic constructs used to map and model the real world, this paper argues that shaping schemas through advanced language representation is the next frontier for expanding LLM intelligence. We posit that an LLM's knowledge activation and organization -- its schema -- depends heavily on the structural and symbolic sophistication of the language used to represent a given task. This paper contributes both a formalization of this claim and the empirical evidence to support it. With a new formalization, we present multiple lines of evidence to support our position: Firstly, we review recent empirical practices and emerging methodologies that demonstrate the substantial performance gains achievable through deliberate language representation design, even without modifying model parameters or scale. Secondly, we conduct controlled experiments showing that LLM performance and its internal feature activations vary under different language representations of the same underlying task. Together, these findings highlight language representation design as a promising direction for future research.

2605.09266 2026-05-13 cs.AI

SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

Kun Xiang, Terry Jingchen Zhang, Zirong Liu, Bokai Zhou, Yueling Tang, Junjie Yu, Jiacong Lu, Shangrui Huang, Heng Li, Likui Zhang, Kunkun Liu, Changzheng Zhang, Yangle Fang, Boqiang Guo, Hui-Ling Zhen, Dandan Tu, Yinya Huang, Xiaodan Liang

AI总结 本文提出 SeePhys Pro,一个用于研究多模态模型在文本向图像逐步转移信息时是否保持相同推理能力的细粒度基准。该基准包含每个问题的四个语义对齐的变体,视觉元素逐步增加,实验表明当前前沿模型在从语言到图表的信息转移过程中性能下降,视觉变量的 grounding 是关键瓶颈。研究进一步通过盲训练等方法分析模型改进的来源,发现部分提升可能源于文本残留线索而非真实视觉证据,强调多模态推理评估应关注模态迁移下的鲁棒性及对关键视觉证据的依赖性。

详情
英文摘要

We introduce SeePhys Pro, a fine-grained modality transfer benchmark that studies whether models preserve the same reasoning capability when critical information is progressively transferred from text to image. Unlike standard vision-essential benchmarks that evaluate a single input form, SeePhys Pro features four semantically aligned variants for each problem with progressively increasing visual elements. Our evaluation shows that current frontier models are far from representation-invariant reasoners: performance degrades on average as information moves from language to diagrams, with visual variable grounding as the most critical bottleneck. Motivated by this inference-time fragility, we further develop large training corpora for multimodal RLVR and use blind training as a diagnostic control, finding that RL with all training images masked can still improve performance on unmasked validation sets. To analyze this effect, text-deletion, image-mask-rate, and format-saturation controls suggest that such gains can arise from residual textual and distributional cues rather than valid visual evidence. Our results highlight the need to evaluate multimodal reasoning not only by final-answer accuracy, but also by robustness under modality transfer and by diagnostics that test whether improvements rely on task-critical visual evidence.

2605.09236 2026-05-13 cs.CL cs.AI cs.CY cs.DL cs.IR

Matching Meaning at Scale: Evaluating Semantic Search for 18th-Century Intellectual History through the Case of Locke

Yu Wu, Ananth Mahadevan, Filip Ginter, Michael Mathioudakis, Mikko Tolonen

AI总结 本文通过研究约翰·洛克思想在18世纪的传播,评估了语义搜索在分析历史语料中思想传播的有效性。研究采用基于语义分类的专家标注,检验现成语义搜索方法能否发现传统基于词汇重用方法所忽略的隐含引用。结果表明,语义搜索能检索到更多隐性思想影响,但也揭示了表面词汇重叠对检索结果的限制,突显了语义检索在历史语料分析中的潜力与局限。

Comments Accepted by NLP4DH 2026

详情
英文摘要

While digitized corpora have transformed the study of intellectual transmission, current methods rely heavily on lexical text reuse detection, capturing verbatim quotations but fundamentally missing paraphrases and complex implicit engagement. This paper evaluates semantic search in 18th-century intellectual history through the reception of John Locke's foundational work. Using expert annotation grounded in a semantic taxonomy, we examine whether an off-the-shelf semantic search pipeline can surface meaning-level correspondences overlooked by lexical methods. Our results demonstrate that semantic search retrieves substantially more implicit receptions than lexical baselines. However, linguistic diagnostics also reveal a "lexical gatekeeping" effect, where retrieval remains partially constrained by surface vocabulary overlap. These findings highlight both the potential and the limitations of semantic retrieval for analyzing the circulation of ideas in large historical corpora. The data is available at https://github.com/COMHIS/locke-sim-data.

2605.09127 2026-05-13 cs.RO

IMPACT: An Implicit Active-Set Augmented Lagrangian for Fast Contact-Implicit Trajectory Optimization

Jiayun Li, Dejian Gong, Georgia Chalvatzaki

AI总结 IMPACT 是一种用于接触隐式轨迹优化(CITO)的隐式增广拉格朗日方法,旨在高效求解包含互补约束的数学规划问题。该方法能够在轨迹优化过程中动态识别接触模式分支,从而提升求解效率与稳定性。实验表明,IMPACT 在多个基准测试中显著优于现有方法,并在实际机器人系统上实现了高质量的接触丰富任务控制。

Comments Accepted to Robotics: Science and Systems (RSS), 2026

详情
英文摘要

Contact-implicit trajectory optimization (CITO) has attracted growing attention as a unified framework for planning and control in contact-rich robotic tasks. Recent approaches have demonstrated promising results in manipulation and locomotion without requiring a prescribed contact-mode schedule. It is well known that the underlying mathematical programs with complementarity constraints (MPCCs) remain numerically ill-conditioned, and systematic, scalable solution strategies for CITO remain an active area of research. More efficient and principled solvers that can handle contact constraints are therefore essential to broaden the applicability of CITO. In this work, we develop an augmented-Lagrangian approach to CITO for solving MPCC-based CITO with stationarity guarantees. The method can be interpreted as identifying the implicit contact-mode branches on the fly during the trajectory optimization (TO) iterations; we call this approach IMPACT (IMPlicit contact ACtive-set Trajectory optimization). We provide an efficient C++ implementation tailored to trajectory-optimization workloads and evaluate it on the open-source CITO and contact-implicit model predictive control (CI-MPC) benchmarks. On CITO, IMPACT achieves 2.9x-70x speedups over strong baselines (geometric mean 13.8x). On CI-MPC, we show improved control quality for contact-rich trajectories on dexterous manipulation tasks in simulation. Finally, we demonstrate the proposed method on real robotic hardware on a T-shaped object pushing task.