arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4065
2605.10166 2026-05-12 cs.RO

Data-Asymmetric Latent Imagination and Reranking for 3D Robotic Imitation Learning

Lianghao Luo, Xizhou Bu, Ruyan Liu, Qingqiu Huang, Chufeng Tang, Xiaoshuai Hao, Hongbo Wang, Wei Li

AI总结 本文研究了如何从质量参差不齐的轨迹中进行三维机器人模仿学习,提出了一种名为DALI-R的数据非对称潜在想象与重排序框架。该方法通过学习3D点云的潜在世界模型进行想象 rollout,并结合任务完成评分器对候选动作片段进行重排序,从而在无需额外高质量演示的情况下提升决策性能。实验表明,DALI-R在多个基准测试中有效提高了任务成功率,同时保持了较低的推理开销。

详情
英文摘要

Robotic imitation learning typically assumes access to optimal demonstrations, yet real-world data collection often yields suboptimal, exploratory, or even failed trajectories. Discarding such data wastes valuable information about environment dynamics and failure modes, which can instead be leveraged to improve decision-making. While 3D policies reduce reliance on high-quality demonstrations through strong spatial generalization, they still require large-scale data to achieve high task success. To address this, we propose DALI-R, a Data-Asymmetric Latent Imagination and Reranking framework for 3D robotic imitation learning from mixed-quality trajectories. It learns a Latent World Model over 3D point clouds for imagined rollouts and a Task Completion Scorer that reranks candidate action chunks, improving decision-making without additional high-quality demonstrations. We instantiate DALI-R with both diffusion and efficient flow-matching policies and evaluate it on Adroit and MetaWorld benchmarks. Across the two evaluated 3D base policies, DALI-R achieves an average $6.8$\% improvement in success rate while incurring less than $0.7\times$ additional inference overhead.

2605.10164 2026-05-12 cs.LG stat.ML

Hyperparameter Transfer for Dense Associative Memories

Roi Holtzman, Dmitry Krotov, Boris Hanin

AI总结 该论文研究了如何将超参数迁移方法应用于密集联想记忆(DenseAM)模型,这类模型通过神经网络在能量景观上进行时间动态操作,具有层内和层间权重共享的结构特点。由于DenseAM使用了在传统前馈网络中较少见的快速峰值激活函数,使得现有超参数迁移方法难以直接应用。本文提出了针对DenseAM的超参数迁移方法,推导了从小规模模型迁移至大规模模型的明确超参数设置规则,并通过实验验证了理论分析与实际结果的一致性。

详情
英文摘要

Dense Associative Memory (DenseAM) is a promising family of AI architectures that is represented by a neural network performing temporal dynamics on an energy landscape. While hyperparameter transfer methods are well-studied for feed-forward networks, these methods have not been developed for settings in which weights are shared across layers and within the layer, which is common in DenseAMs. Additionally, DenseAMs utilize rapidly peaking activation functions that are rarely used in feed-forward architectures. The confluence of these aspects makes DenseAM a challenging framework for using existing methods for hyperparameter transfer. Our work initiates the development of hyperparameter transfer methods for this class of models. We derive explicit prescriptions for how the hyperparameters tuned on small models can be transferred to models trained at scale. We demonstrate excellent agreement between these theoretical findings and empirical results.

2605.10162 2026-05-12 cs.CV

Active-SAOOD: Active Sparsely Annotated Oriented Object Detection in Remote Sensing Images

Yu Lin, Jianghang Lin, Kai Ye, Shengchuan Zhang, Liujuan Cao

AI总结 本文提出了一种基于主动学习的稀疏标注遥感图像定向目标检测方法Active-SAOOD,旨在降低遥感图像中定向目标检测的标注成本。该方法通过模型状态观测模块,在实例层面综合考虑方向、分类与定位的不确定性以及类间和类内多样性,主动选择对当前模型最有价值的稀疏样本,从而在完全随机初始化的稀疏标注下实现稳定检测。实验表明,Active-SAOOD在多种数据集上显著提升了现有稀疏标注方法的性能与稳定性,尤其在仅1%标注比例下性能提升达9%,进一步增强了其在遥感领域的实用价值。

详情
英文摘要

Reducing the annotation cost of oriented object detection in remote sensing remains a major challenge. Recently, sparse annotation has gained attention for effectively reducing annotation redundancy in densely remote sensing scenes. However, (1) the sparse data reliance on class-dependent sampling, and (2) the lack of in-depth investigation into the characteristics of sparse samples hinders its further development. This paper proposes an active learning-based sparsely annotated oriented object detection (SAOOD) method, termed Active-SAOOD. Based on a model state observation module, Active-SAOOD actively selects the most valuable sparse samples at the instance level that are best suited to the current model state, by jointly considering orientation, classification, and localization uncertainty, as well as inter- and intra-class diversity. This design enables SAOOD to operate stably under completely randomly initialized sparse annotations and extends its applicability to broader real-world. Experiments on multiple datasets demonstrate that Active-SAOOD significantly improves both performance and stability of existing SAOOD methods under various random sparse annotation. In particular, with only 1\% annotated ratios, it achieves a 9\% performance gain over the baseline, further enhancing the practical value of SAOOD in remote sensing. The code will be public.

2605.10161 2026-05-12 cs.LG

OUIDecay: Adaptive Layer-wise Weight Decay for CNNs Using Online Activation Patterns

Alberto Fernández-Hernández, Jose I. Mestre, Cristian Pérez-Corral, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ortí

AI总结 本文提出了一种名为OUIDecay的自适应层间权重衰减方法,用于卷积神经网络的训练。该方法基于激活模式计算的过拟合-欠拟合指示器(OUI),动态调整各层的权重衰减系数,无需依赖验证数据,且计算轻量,适合在线使用。实验表明,OUIDecay在多个数据集和网络结构上优于固定衰减和基于梯度的自适应方法,有效提升了模型的泛化性能。

详情
英文摘要

Weight decay remains one of the most widely used regularization mechanisms for training convolutional neural networks, yet it is still commonly applied as a fixed coefficient shared by all layers throughout training. This uniform treatment ignores that different layers may follow different structural dynamics and therefore may require different regularization strengths. In this work, we propose OUIDecay, an adaptive layer-wise and time-dependent weight decay scheduler for CNNs driven by the Overfitting-Underfitting Indicator (OUI), an activation-based metric previously shown to provide early information about regularization quality. OUIDecay uses a lightweight batch-based formulation of OUI to monitor the structural behavior of each layer online and periodically rescales its weight decay relative to the other layers in the network. Unlike gradient-based adaptive decay methods, our approach relies on functional information extracted from activation patterns and does not require validation data. Experiments on EfficientNet-B0 with Stanford Cars, ResNet50 with Food101, DenseNet121 with CIFAR100, and MobileNetV2 with CIFAR10 show that OUIDecay achieves the best mean best-validation-loss in 7 out of 8 evaluated settings. These results indicate that activation-driven weight decay adaptation is a practical and effective alternative to fixed decay and gradient-based adaptive decay, while keeping the method lightweight and suitable for online use.

2605.10159 2026-05-12 cs.LG cs.NA math.NA physics.comp-ph

jNO: A JAX Library for Neural Operator and Foundation Model Training

Leon Armbruster, Rathan Ramesh, Georg Kruse, Christopher Straub

AI总结 jNO 是一个基于 JAX 的库,旨在支持神经算子和基础模型的训练,统一支持数据驱动和物理感知两种训练方式。其核心设计采用了一种追踪系统,允许用户用统一的符号语言编写领域、模型调用、残差、监督损失和诊断信息,并将其编译为一个优化流程,从而在不同任务间灵活切换而无需重构代码。jNO 还支持多模型组合、参数级别的精细控制、超参数调优以及适用于偏微分方程基础模型家族的原生 JAX 工作流。

详情
英文摘要

jNO (jax Neural Operators) is a JAX-native library for neural operators and foundation models with unified support for both data-driven and physics-informed training. Its core design is a tracing system in which domains, model calls, residuals, supervised losses, and diagnostics are written in one symbolic language and compiled into one optimization pipeline. This allows users to move between operator regression, mesh-aware residual evaluation, and PDE-constrained training without restructuring the surrounding code. jNO also supports multi-model compositions, fine-grained control at parameter level (model, optimizer, and learning rate), hyperparameter tuning, and JAX-native workflows for translated PDE foundation-model families. The source repository is available at https://github.com/FhG-IISB/jNO.

2605.10158 2026-05-12 cs.LG

Unsupervised Process Reward Models

Artyom Gadetsky, Maxim Kodryan, Siba Smarak Panigrahi, Hang Guo, Maria Brbic

AI总结 本文提出了一种无需人工监督的无监督过程奖励模型(uPRM),用于指导大语言模型的推理过程。该方法通过利用大语言模型的下一个词概率定义评分函数,联合评估多个推理轨迹中首个错误步骤的位置,从而实现对推理过程的评估与引导。实验表明,uPRM在错误步骤识别、测试时扩展验证以及强化学习奖励信号应用中均表现出色,为复杂推理任务的可扩展奖励建模提供了新途径。

Comments preprint

详情
英文摘要

Process Reward Models (PRMs) are a powerful mechanism for steering large language model reasoning by providing fine-grained, step-level supervision. However, this effectiveness comes at a significant cost: PRMs require expert annotations for every reasoning step, making them costly and difficult to scale. Here, we propose a method for training unsupervised PRMs (uPRM) that requires no human supervision, neither at the level of step-by-step annotations nor through ground-truth verification of final answers. The key idea behind our approach is to define a scoring function, derived from LLM next-token probabilities, that jointly assesses candidate positions of first erroneous steps across a batch of reasoning trajectories. We demonstrate the effectiveness of uPRM across diverse scenarios: (i) uPRM achieves up to 15% absolute accuracy improvements over the LLM-as-a-Judge in identifying first erroneous steps on the ProcessBench dataset; (ii) as a verifier for test-time scaling, uPRM performs comparably to supervised PRMs and outperforms the majority voting baseline by up to 6.9%, and (iii) when used as a reward signal in reinforcement learning, uPRM enables more robust policy optimization throughout training compared to a supervised PRM trained using ground-truth labels. Overall, our results open a path toward scalable reward modeling for complex reasoning tasks.

2605.10155 2026-05-12 cs.CL

NyayaAI: An AI-Powered Legal Assistant Using Multi-Agent Architecture and Retrieval-Augmented Generation

Deepanshu, Divi Saxena, Deepali Rana, Ayesha Varshney, Sahinur Rahman Laskar

AI总结 本文介绍了NyayaAI,一个基于多智能体架构和检索增强生成技术的AI法律助手,旨在解决印度法律信息因语言复杂和文档量大而难以获取的问题。该系统结合大型语言模型与构建在印度法律知识库上的检索增强生成流程,通过多智能体协调处理法律研究、文档摘要、案例检索和文书起草等任务,并设有合规模块确保输出准确性。实验表明,该系统在领域分类、检索和响应准确率方面均达到较高水平,展示了结构化多智能体LLM系统在提升法律可及性和工作效率方面的潜力。

Comments 3 pages, 1 figure

详情
英文摘要

Legal information in India remains largely inaccessible due to the complexity of legal language and the sheer volume of legal documentation involved in research and case analysis. This paper presents NyayaAI, an AI-powered legal assistant that automates and simplifies legal workflows for lawyers, law students, and general users. The system combines Large Language Models with a Retrieval-Augmented Generation pipeline grounded in a curated Indian legal knowledge base comprising constitutional provisions, statutes, case laws, and judicial precedents. A multi-agent architecture orchestrated through the Mastra TypeScript framework coordinates a main agent with specialized sub-agents handling legal research, document summarization, case law retrieval, and drafting assistance. A compliance module validates all responses before delivery. Domain classification achieved 70\% precision across test samples, with RAG retrieval precision at 74\% and overall response accuracy at 72\%, demonstrating that structured multi-agent LLM systems can meaningfully improve legal accessibility and workflow efficiency. The code\footnote{https://github.com/B97784/NyayaAI} is made publicly available for the benefit of the research community.

2605.10154 2026-05-12 cs.LG

Stable Long-Horizon PDE Forecasting via Latent Structured Spectral Propagators

Xiaoxiao Lu, Ye Yuan, Jiahao Shi

AI总结 本文研究了长时间尺度偏微分方程(PDE)的稳定预测问题,提出了一种基于隐结构谱传播器(SSP)的神经预测框架。该方法通过将PDE演化重构为传播导向的潜在空间中的结构化谱传播过程,有效分离了动态演化与空间细节,提升了预测的稳定性与准确性。实验表明,SSP在长期预测任务中显著优于现有方法,大幅降低了预测误差并增强了时间外推的稳定性。

详情
英文摘要

Long-horizon forecasting of time-dependent partial differential equations (PDEs) is critical for characterizing the sustained evolution of physical systems. While neural operators have emerged as efficient surrogates, they typically learn implicit finite-time transitions from discrete observations. When deployed autoregressively, such propagators often suffer from rapid error accumulation and dynamic drift. To address this, we propose a neural forecasting framework that reformulates PDE rollout as learning a Structured Spectral Propagator (SSP) in a propagation-oriented latent space. Following an analysis-propagation-synthesis design, our framework: (i) maps physical states into a shared, time-consistent spatial representation; (ii) projects this space into a compact propagation state to isolate recurrent dynamics from fine-grained spatial details, thereby decoupling reconstruction fidelity from rollout regularity; and (iii) evolves retained spectral modes using a frequency-conditioned linear backbone complemented by a nonlinear spectral closure to account for truncated interactions. This explicit structuring endows the propagator with a strong inductive bias for coherent modal evolution. Extensive experiments demonstrate that SSP significantly outperforms state-of-the-art baselines, reducing relative $L_2$ errors by up to 48.9% and exhibiting improved stability in temporal extrapolation beyond the supervised horizon.

2605.10153 2026-05-12 cs.SD cs.LG

APEX: Audio Prototype EXplanations for Classification Tasks

Piotr Kawa, Kornel Howil, Piotr Borycki, Miłosz Adamczyk, Przemysław Spurek, Piotr Syga

AI总结 本文提出了一种名为APEX的音频分类解释框架,旨在解决当前音频领域可解释AI方法不足的问题。该方法基于预训练音频分类器,无需微调即可生成与原模型输出一致的解释结果。APEX通过将解释分解为时域、频域及时频联合四个视角,提供了更符合音频特性的直观解释,提升了分类结果的语义可理解性。

详情
英文摘要

Explainable AI (XAI) has achieved remarkable success in image classification, yet the audio domain lacks equally mature solutions. Current methods apply vision-based attribution techniques to spectrograms, overlooking fundamental differences between visual and acoustic signals. While prototype reasoning is promising, acoustic similarity remains multidimensional. We introduce APEX (Audio Prototype EXplanations), a post-hoc framework for interpreting pre-trained audio classifiers. Crucially, APEX requires no fine-tuning of the original backbone and strictly preserves output invariance. APEX disentangles explanations into four perspectives: Square-based prototypes to localize transient events, Time-based for temporal patterns, Frequency-based highlighting spectral bands, and Time-Frequency-based integrating both. This yields intuitive, example-based explanations that respect acoustic properties, providing greater semantic clarity than standard gradient-based methods.

2605.10151 2026-05-12 cs.LG cs.SY eess.SY math.OC

Learning to Sparsify Stochastic Linear Bandits

Zhengmiao Wang, Ming Chi, Zhi-Wei Liu, Lintao Ye, Carla Fabiana Chiasserini

AI总结 本文研究了在高维空间中带有稀疏性约束的随机线性博弈问题,旨在在最小化累积遗憾的同时选择稀疏动作。作者提出了一种自适应分阶段的探索与利用算法框架,结合普通最小二乘法进行参数学习,并采用专门的子程序进行稀疏动作选择。对于欧几里得球形动作集,算法可高效计算最优稀疏动作并获得 $\tilde{\mathcal{O}}(d\sqrt{T})$ 的遗憾界;对于一般凸紧动作集,采用贪心子程序并分别给出了不同情况下的遗憾上界。实验验证了算法在推荐系统等实际场景中的有效性。

Comments Include all the omitted details and proofs from the conference paper accepted to IJCAI 2026

详情
英文摘要

This paper addresses the problem of learning to sparsify stochastic linear bandits, where a decision-maker sequentially selects actions from a high-dimensional space subject to a sparsity constraint on the number of nonzero elements in the action vector. The key challenge lies in minimizing cumulative regret while tackling the potential NP-hardness of finding optimal sparse actions due to the inherent combinatorial structure of the problem. We propose an adaptively phased exploration and exploitation algorithmic framework, utilizing ordinary least squares for parameter learning and specialized subroutines for sparse action selection. When the action set is a Euclidean ball, optimal sparse actions can be efficiently computed, enabling us to establish a $\tilde{\mathcal{O}}(d\sqrt{T})$ regret, where $d$ is the dimension of the action vector and $T$ is the time horizon length. For general convex and compact action sets where finding optimal sparse actions is intractable, we employ a greedy subroutine. For general strongly convex action sets, we derive a $\tilde{\mathcal{O}}(d \sqrt{T})$ $α$-regret; for general compact sets lacking strong convexity, we establish a $\tilde{\mathcal{O}}(d T^{2/3})$ $α$-regret, where $α$ pertains to the approximation ratio of the greedy algorithm. Finally, we validate the performance of our algorithms using extensive experiments including an application to recommendation system.

2605.10149 2026-05-12 cs.CV

Improving Temporal Action Segmentation via Constraint-Aware Decoding

Yeo Keat Ee, Debaditya Roy, Chen Li, Hao Zhang, Basura Fernando

AI总结 本文研究如何通过引入结构先验约束来提升时序动作分割的性能。作者提出了一种轻量级的约束感知解码框架,通过整合动作转移置信度、动作边界集和类别持续时间等统计结构先验,在不增加模型复杂度的情况下实现推理阶段的预测优化。该方法有效提升了全监督和半监督动作分割模型的性能,尤其在标注数据有限或新领域场景中表现突出。

Comments accepted to ICPR 2026

详情
英文摘要

Temporal action segmentation (TAS) divides untrimmed videos into labeled action segments. While fully supervised methods have advanced the field, challenges such as action variability, ambiguous boundaries, and high annotation costs remain, especially in new or low-resource domains. Grammar-based approaches improve segmentation with structural priors but rely on complex parsing limiting scalability. In this work, we propose a lightweight, constraint-based refinement framework that enhances TAS predictions by integrating statistical structural priors such as transition confidence, action boundary sets, and per-class duration, that can be directly extracted from annotated data. These constraints are integrated into a modified Viterbi decoding algorithm, allowing inference-time refinement without retraining or added model complexity. Our approach improves both fully and semi-supervised TAS models by correcting structural prediction errors while maintaining high efficiency. Code is available at https://github.com/LUNAProject22/CAD

2605.10148 2026-05-12 cs.CV

MicroViTv2: Beyond the FLOPS for Edge Energy-Friendly Vision Transformers

Novendra Setyawan, Chi-Chia Sun, Mao-Hsiu Hsu, Wen-Kai Kuo, Jun-Wei Hsieh

AI总结 本文提出了一种轻量级的视觉Transformer模型MicroViTv2,旨在提升边缘设备上的能效表现。通过引入重参数化设计,包括重参数化块嵌入(RepEmbed)和重参数化深度可分离卷积混合器(RepDW),并结合单深度可分离转置注意力(SDTA)模块,模型在保持快速推理速度的同时,实现了更高的准确率。实验表明,MicroViTv2在Jetson AGX Orin等硬件平台上展现出优越的能效比,验证了超越FLOPs指标进行效率评估的重要性。

详情
英文摘要

The Vision Transformer (ViT) achieves remarkable accuracy across visual tasks but remains computationally expensive for edge deployment. This paper presents MicroViTv2, a lightweight Vision Transformer optimized for real-device efficiency. Built upon the original MicroViT, the proposed model is designed based on reparameterized design, specifically Reparameterized Patch Embedding (RepEmbed) and Reparameterized Depth-Wise convolution mixer (RepDW) for faster inference, and introduces the Single Depth-Wise Transposed Attention (SDTA) to capture long-range dependencies with minimal redundancy. Despite slightly higher FLOPs, MicroViTv2 improves accuracy up to 0.5% compared to its predecessor and surpassing MobileViTv2, EdgeNeXt, and EfficientViT while maintaining fast inference and high energy efficiency on Jetson AGX Orin. Experiments on ImageNet-1K and COCO demonstrate that hardware-aware design and structural re-parameterization are key to achieving high accuracy and low energy consumption, validating the need to evaluate efficiency beyond FLOPs. Code is available at https://github.com/novendrastywn/MicroViT.

2605.10146 2026-05-12 cs.AI cs.CR

Benchmarking Safety Risks of Knowledge-Intensive Reasoning under Malicious Knowledge Editing

Qinghua Mao, Xi Lin, Jinze Gu, Jun Wu, Siyuan Li, Yuliang Chen

AI总结 本文研究了在恶意知识编辑背景下,知识密集型推理中的安全风险问题。为填补现有基准在安全评估方面的不足,作者提出了EditRisk-Bench,该基准通过集成多种恶意场景和复杂的推理任务,系统评估恶意知识对推理行为和可靠性的影响。实验表明,恶意知识编辑能够在不显著影响模型整体能力的前提下,诱导错误或危险的推理,揭示了知识编辑安全风险的隐蔽性和复杂性。

详情
英文摘要

Large language models (LLMs) increasingly rely on knowledge editing to support knowledge-intensive reasoning, but this flexibility also introduces critical safety risks: adversaries can inject malicious or misleading knowledge that corrupts downstream reasoning and leads to harmful outcomes. Existing knowledge editing benchmarks primarily focus on editing efficacy and lack a unified framework for systematically evaluating the safety implications of edited knowledge on reasoning behavior. To address this gap, we present EditRisk-Bench, a benchmark for systematically evaluating safety risks of knowledge-intensive reasoning under malicious knowledge editing. Unlike prior benchmarks that mainly emphasize edit success, generalization, and locality, EditRisk-Bench focuses on how injected knowledge affects downstream reasoning behavior and reliability. It integrates diverse malicious scenarios, including misinformation, bias, and safety violations, together with multi-level knowledge-intensive reasoning tasks and representative editing strategies within a unified evaluation framework measuring attack effectiveness, reasoning correctness, and side effects. Extensive experiments on both open-source and closed-source LLMs show that malicious knowledge editing can reliably induce incorrect or unsafe reasoning while largely preserving general capabilities, making such risks difficult to detect. We further identify several key factors influencing these risks, including edit scale, knowledge characteristics, and reasoning complexity. EditRisk-Bench provides an extensible testbed for understanding and mitigating safety risks in knowledge editing for LLMs.

2605.10142 2026-05-12 cs.CV cs.AI

Scaling Vision Models Does Not Consistently Improve Localisation-Based Explanation Quality

Mateusz Cedro, Marcin Chlebus

AI总结 本文研究了视觉模型的规模扩大是否能提升基于定位的解释质量。通过在多个图像数据集上评估不同深度和复杂度的ResNet、DenseNet和Vision Transformer模型,结合五种事后解释方法,发现模型规模的增加并未在大多数情况下提升解释质量,较小的模型往往表现相当甚至更优。研究还指出,预训练虽能提升预测性能,但对定位精度的提升并不一致,表明在模型选择中应明确评估解释性以确保安全应用。

Comments 28 pages, 8 figures, 8 tables

详情
英文摘要

Artificial intelligence models are increasingly scaled to improve predictive accuracy, yet it remains unclear whether scale improves the quality of post-hoc explanations. We investigate this relationship by evaluating 11 computer vision models representing increasing levels of depth and complexity within the ResNet, DenseNet, and Vision Transformer families, trained from scratch or pretrained, across three image datasets with ground-truth segmentation masks. For each model, we generate explanations using five post-hoc explainable AI methods and quantify mask alignment using two localisation metrics: Relevance Rank Accuracy (Arras et al., 2022) and the proposed Dual-Polarity Precision, which measures positive attributions inside the class mask and negative attributions outside it. Across datasets and methods, increasing architectural depth and parameter count does not improve explanation quality in most statistical comparisons, and smaller models often match or exceed deeper variants. While pretraining typically improves predictive performance and increases the dependence of explanations on learned weights, it does not consistently increase localisation scores. We also observe scenarios in which models achieve strong predictive performance while localisation precision is near zero, suggesting that performance metrics alone may not indicate whether predictions are based on the annotated regions. These results indicate that larger models do not reliably provide higher-quality explanations, and that explainability should therefore be assessed explicitly during model selection for safety-sensitive deployments.

2605.10141 2026-05-12 cs.AI

FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

Zeynel A. Uluşan, Burak S. Akbudak, Can S. Erer, Gözde Gül Şahin

AI总结 该论文提出了一种名为 FormalRewardBench 的基准,用于评估形式化定理证明中奖励模型的表现。研究针对当前基于可验证奖励的神经定理证明器在稀疏奖励分配上的不足,引入了五种专家设计的错误注入策略,构建了包含250对证明对比的基准数据集。实验表明,前沿大语言模型在证明质量评估上表现最佳,而专门的定理证明模型表现较差,揭示了定理证明能力与证明评估能力之间的差异。

详情
英文摘要

Recent neural theorem provers use reinforcement learning with verifiable rewards (RLVR), where proof assistants provide binary correctness signals. While verifiable rewards are cheap and scalable without reward hacking issues, they suffer from sparse credit assignment: models receive no learning signal from difficult problems where partial progress goes unrewarded. This motivates learned reward models that can evaluate proof quality beyond binary verification. However, comparing reward models is challenging since it typically requires expensive RL training ablations. To address this, we introduce \textbf{FormalRewardBench}, the first benchmark for evaluating reward models in formal theorem proving with Lean 4. Our benchmark consists of 250 preference pairs where correct proofs are paired with incorrect variants generated through five expert curated error injection strategies: forced mistakes, minimal single-point variations, verbose incorrect proofs, natural language justification, and Python code injection. We evaluate frontier LLMs (e.g., Claude Opus 4.5), judge LLMs (e.g., CompassJudger-1-14B), general-purpose LLMs (e.g., Qwen2.5-72B-Instruct), and specialized theorem proving models (e.g., DeepSeek-Prover-V2-7B). Our results reveal that frontier LLMs achieve the highest performance (59.8\%) while specialized theorem provers perform the worst (24.4\%), suggesting that theorem proving ability does not transfer to proof evaluation. We provide further insights on various error injection mechanisms, highlighting the challenging nature of most injection mechanisms. We release \textbf{FormalRewardBench} publicly to encourage more research on developing reward models in formal mathematics.

2605.10136 2026-05-12 cs.LG

Per-Loss Adapters for Gradient Conflict in Physics-Informed Neural Networks

Bum Jun Kim, Gnankan Landry Regis N'guessan

AI总结 物理信息神经网络(PINNs)通过最小化多个物理和数据驱动的损失函数来训练单一神经网络近似模型,但这些损失的梯度常发生冲突,导致优化停滞。本文指出,这种梯度冲突并非单一失效模式,而是存在不同类型的冲突场景,需采用不同的干预策略。为此,作者提出了一种基于诊断的框架,通过低秩适配器为每个损失创建独立的参数子空间,从而在保持共享主干网络的前提下,为每个损失提供直接的梯度路径,实验表明该方法在多种偏微分方程问题中显著提升了性能。

Comments 49 pages, 10 figures

详情
英文摘要

Physics-informed neural networks (PINNs) train a single neural approximation by minimizing multiple physics- and data-derived losses, but the gradients of these losses often interfere and can stall optimization. Existing remedies typically treat this pathology either through scalar loss balancing or full-parameter-space gradient surgery, leaving it unclear which intervention is most appropriate. We show that PINN gradient conflict is not a uniform failure mode with one universal remedy. Instead, we identify distinct PINN gradient-conflict regimes, each associated with a different intervention class. Persistent directional conflict may require separate loss-indexed parameter subspaces, magnitude imbalance often favors scalar reweighting, and low or transient conflict may require no extra mitigation. To select between scalar reweighting and a lightweight architectural intervention, we propose a diagnostic-first framework. It profiles a 1000-step unmodified PINN run and, when intervention is warranted, uses one low-rank adapter per loss to create explicit loss-indexed parameter subspaces attached to a shared PINN trunk, providing each loss with a direct gradient pathway. Across more than 60 PDE configurations, including forward, inverse, multi-physics, parameter-varying, and high-dimensional problems up to 50D, persistent directional conflict dominates standard forward $K=3$ benchmarks and a natural $K=4$ thermoelastic system, where adapters combined with reweighting yield significant improvements. In contrast, $K=3$ inverse problems and natural $K=5$ and $K=6$ multi-physics systems are largely magnitude-dominated and often favor reweighting alone, while full-parameter-space gradient surgery can fail on heterogeneous parameter spaces.

2605.10130 2026-05-12 cs.CV

Thermal-Det: Language-Guided Cross-Modal Distillation for Open-Vocabulary Thermal Object Detection

Yasiru Ranasinghe, Elim Schenck, Florence Yellin, Shuowen Hu, Christopher Funk, Vishal M. Patel

AI总结 现有开放词汇检测方法主要针对RGB图像,难以推广到热成像领域,因热图像纹理低、发射率变化大,给基于RGB的语义理解带来挑战。本文提出Thermal-Det,首个由大语言模型(LLM)监督的开放词汇热成像目标检测方法,通过构建包含百万级热成像对齐样本的合成数据集,并结合跨模态蒸馏与文本校准模块,实现了无需人工标注的热成像检测知识迁移。实验表明,该方法在公开数据集上相比现有开放词汇检测器平均精度提升2-4%,为语言驱动的热感知系统奠定了基础。

Comments Accepted at CVPR 26

详情
英文摘要

Existing open-vocabulary detectors focus on RGB images and fail to generalize to thermal imagery, where low texture and emissivity variations challenge RGB-based semantics. We present Thermal-Det, the first large language model (LLM) supervised open-vocabulary detector tailored for thermal images. To enable large-scale training, we develop a synthetic dataset by converting GroundingCap-1M into the thermal domain and filtering captions to remove RGB-specific terms, yielding over one million thermally aligned samples with bounding boxes, grounding texts, and detailed captions. Thermal-Det jointly optimizes detection, captioning, and cross-modal distillation objectives. A frozen RGB teacher provides geometric and semantic pseudo-supervision for paired but unlabeled RGB-thermal data, transferring open-vocabulary knowledge without manual annotation. The model further employs a Thermal-Text Alignment Head for text calibration and a Modality-Fused Cross-Attention module for dual-modality reasoning. Unlike prior domain-adaptation methods, the detector is fully fine-tuned to internalize thermal contrast patterns while preserving language alignment. Experiments on public benchmarks show consistent 2-4% AP gains over existing open-vocabulary detectors, establishing a strong foundation for scalable, language-driven thermal perception.

2605.10129 2026-05-12 cs.CL

Synthetic Pre-Pre-Training Improves Language Model Robustness to Noisy Pre-Training Data

Xu Guo, Runyu Peng, Jian Tong, Yunhua Zhou, Haijun Lv, Zhihui Lu, Qipeng Guo

AI总结 本文研究了如何通过引入一种轻量级的预预训练(PPT)阶段来提升大型语言模型在噪声预训练数据下的鲁棒性。作者提出使用具有可学习时间结构的合成数据进行PPT,从而在正式预训练阶段增强模型对噪声的抵抗能力。实验表明,这种方法在不同噪声水平下均能有效提升模型性能,并减少了对自然文本预训练数据的依赖。

详情
英文摘要

Large language models (LLMs) rely on web-scale corpora for pre-training. The noise inherent in these datasets tends to obscure meaningful patterns and ultimately degrade model performance. Data curation mitigates but cannot eliminate such noise, so pre-training corpora remain noisy in practice. We therefore study whether a lightweight pre-pre-training (PPT) stage based on synthetic data with learnable temporal structure helps resist noisy data during the pre-training (PT) stage. Across various corruption settings, our method consistently improves robustness to noise during PT, with larger relative gains at higher noise levels. For a 1B-parameter model, a synthetic PPT stage with only 65M tokens achieves the same final loss as the baseline while using up to 49\% fewer natural-text PT tokens across different noise levels. Mechanistic analyses suggest PPT does not immediately suppress attention to noisy tokens. Rather, PPT-initialized models gradually downweight attention between corrupted tokens during noisy PT. This indicates that synthetic PPT inhibits noise self-modeling and shapes the subsequent optimization trajectory. Code is available at https://github.com/guox18/formal-language-prepretraining.

2605.10123 2026-05-12 cs.LG

Complex-Valued Phase-Coherent Transformer

Leona Hioki

AI总结 本文提出了一种名为Phase-Coherent Transformer(PCT)的复值变换器模型,旨在解决复值神经网络中相位信息保留的问题。与传统基于softmax注意力的复值变换器不同,PCT采用一种实值、元素无关且平滑的门控机制,对L2归一化的复值查询-键相似度进行处理,从而避免令牌竞争并保持跨层的相位一致性。实验表明,PCT在多个中等规模基准任务中表现出色,优于标准softmax变换器及其复值变体,并在一些传统上对复值网络具有挑战性的任务中仍保持竞争力。

Comments 26 pages, 17 tables (no figures). Companion Lean 4 formalization of Theorems 1 and 2 at https://github.com/leohio/phase-coherent-transformer-r-d

详情
英文摘要

Complex-valued Transformers have largely inherited softmax attention from real-valued architectures. However, row-normalised token competition is not necessarily aligned with phase-preserving computation. In this paper, we introduce the Phase-Coherent Transformer (PCT), which applies a real-valued, element-independent, smooth gate to L2-normalised complex query-key similarities. PCT replaces token competition with token-non-competing attention and is designed to preserve phase information across layers. Across mid-scale benchmarks spanning long-range memory, hierarchical long-range reasoning, positional retrieval, phase-based memory and superposition, and image classification, PCT shows strong generalisation across task categories. Under parameter-fair comparison, PCT consistently outperforms both the standard softmax Transformer and its direct complex-valued counterpart. Moreover, even on tasks traditionally considered difficult for complex-valued neural networks, such as NIAH and LRA-Text, PCT remains competitive with Multiscreen, the strongest real-valued NN baseline in our comparison. Experiments introducing gates that deliberately violate the PCT conditions show that the design is not incidental: smooth gates that preserve negatively aligned phase components remain strong, whereas gates that delete such components collapse on long-range retrieval, and gates whose outputs become excessively large suffer clear performance degradation. PCT also shows no depth-related accuracy collapse across the tested depth range. These results support introducing multi-layer phase-coherent structure into attention as a promising design principle for achieving generalisation in complex-valued Transformers.

2605.10122 2026-05-12 cs.AI cs.LG

Rethinking Constraint Awareness for Efficient State Embedding of Neural Routing Solver

Canhong Yu, Changliang Zhou, Rongsheng Chen, Zhenkun Wang, Yu Zhou

AI总结 本文针对神经路由求解器在处理具有复杂约束的车辆路径问题(VRP)时的不足,重新审视了状态嵌入的生成机制,指出当前方法在解码过程中限制了观察空间,成为性能瓶颈。为此,作者提出了一种名为CARM的约束感知残差调制模块,通过自适应地利用约束相关变量对上下文嵌入进行调制,有效增强了模型对约束的感知能力。实验表明,CARM模块在多个单任务和多任务路由求解器中均显著提升了性能,尤其在处理大规模实例和泛化到新VRP变体时表现突出。

详情
英文摘要

Heavy-Encoder-Light-Decoder (HELD) neural routing solvers have emerged as a promising paradigm due to their broad applicability across multiple vehicle routing problems (VRPs). However, they typically struggle with VRP variants with complex constraints. To address this limitation, this paper systematically revisits existing neural solvers from the perspective of the generation mechanism for state embeddings (i.e., query vector prior to compatibility calculation) during decoding. We identify that current mechanisms restrict the observation space during attention computation, introducing a key bottleneck to achieving high-quality solutions. Through detailed empirical analysis, we demonstrate the necessity of preserving a global observation space. To overcome the constraint-agnostic drawback inherent to global observation spaces, we propose a simple yet powerful Constraint-Aware Residual Modulation (CARM) module. By adaptively modulating the context embedding with constraint-relevant variables, CARM effectively enhances constraint awareness, enabling the neural solver to fully leverage the global observation space and generate an efficient state embedding. Extensive experimental results across two single-task and five multi-task neural routing solvers confirm that the CARM module consistently boosts baseline performance. Notably, solvers equipped with our CARM achieve substantial improvements in scaling to large-scale instances and in generalizing to unseen VRP variants. These findings provide valuable insights for the architectural design of neural routing solvers.

2605.10121 2026-05-12 cs.LG cs.AI cs.HC

Explainability of Recurrent Neural Networks for Enhancing P300-based Brain-Computer Interfaces

Christian Oliva, Vinicio Changoluisa, Francisco B Rodríguez, Luis F Lago-Fernández

AI总结 本文研究了如何提高基于P300事件相关电位的脑机接口中循环神经网络的可解释性。作者提出了一种称为后循环模块(PRM)的附加层,将其集成到RNN架构中,以提升模型性能和透明度。该方法通过全局和局部解释技术,实现了对时空信号的双重分析,能够识别分类过程中涉及的关键脑区和时间区间,并与已有的神经生理学描述保持一致。实验表明,该方法在性能上比现有方法提升了9%,并揭示了个体间和个体内部变异的重要性,为构建可解释的脑电模型提供了有效框架。

详情
英文摘要

Brain-Computer Interfaces (BCIs) based on P300 event-related potentials offer promising applications in health, education, and assistive technologies. However, challenges related to inter- and intra-subject variability and the explainability of Deep Learning (DL) models limit their practical deployment. In this work, we present the Post-Recurrent Module (PRM), an additional layer designed to improve both performance and transparency, incorporated into a Recurrent Neural Network (RNN) architecture for classifying P300 signals from EEG data. Our approach enables a dual analysis of spatio-temporal signals through both global and local explainability techniques, allowing us not only to identify the most relevant brain regions and critical time intervals involved in classification, but also to interpret model decisions in terms of spatio-temporal EEG patterns consistent with well-stablished neurophysiological descriptions of the P300. Experimental results show a 9\% improvement in performance over state of the art, while also revealing the importance of inter- and intra-subject variability, in alignment with established neuroscience literature. By making model decisions transparent and efficient, we present a framework for explainable EEG-based models. This framework is not limited to more efficient P300 detection, but can be generalized to a wide range of EEG-based tasks. Its ability to identify key spatial and temporal features makes it suitable for applications such as motor imagery, steady-state visual evoked potentials, and even cognitive workload assessment.

2605.10120 2026-05-12 cs.CV cs.AI

MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph

Manyu Li, Ruian He, Chenxi Ma, Weimin Tan, Bo Yan

AI总结 本文提出了一种名为MicroWorld的框架,旨在解决多模态大语言模型在显微镜等专业微观领域表现不足的问题。该方法通过构建多模态属性图(MAPG)来增强模型的推理能力,无需特定领域的微调即可在推理阶段提升模型表现。实验表明,MicroWorld显著提升了Qwen3-VL-8B-Instruct在MicroVQA等基准上的性能,取得了当前最优结果,并展示了其在跨领域泛化能力上的优势。

Comments 29 pages, 14 figures

详情
英文摘要

Multimodal large language models (MLLMs) show remarkable potential for scientific reasoning, yet their performance in specialized domains such as microscopy remains limited by the scarcity of domain-specific training data and the difficulty of encoding fine-grained expert knowledge into model parameters. To bridge the gap, we introduce MicroWorld, a framework that constructs a multimodal attributed property graph (MAPG) from large-scale scientific image--caption corpora and leverages it to augment MLLM reasoning at inference time without any domain-specific fine-tuning. MicroWorld extracts biomedical entities and relations via scispaCy or LLM-based triplet mining, aligns images and entities in a shared embedding space using Qwen3-VL-Embedding, and assembles a knowledge graph comprising approximately 111K nodes and 346K typed edges spanning eight relation categories. At inference time, a graph-augmented retrieval pipeline matches query entities to the MAPG and injects structured knowledge context into the MLLM prompt. On the MicroVQA benchmark, MicroWorld improves the reasoning performance of Qwen3-VL-8B-Instruct by 37.5%, outperforming GPT-5 by 13.0% to achieve a new state-of-the-art. Furthermore, it yields a 6.0% performance gain on the MicroBench benchmark. Extensive experiments demonstrate the enhanced generalization capability introduced by MicroWorld. A qualitative case study further reveals both the mechanisms through which structured knowledge improves reasoning and the failure modes that point to promising future directions. Code and data are available at https://github.com/ieellee/MicroWorld.

2605.10118 2026-05-12 cs.RO

Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation

Zhixuan Shen, Jiawei Du, Ziyu Guo, Han Luo, Lilan Peng, Joey Tianyi Zhou, Haonan Luo, Tianrui Li

AI总结 该研究旨在解决具身导航中视觉语言模型因缺乏真实世界数据而表现受限的问题,提出了一种基于物理约束语义抽象的框架SAGE。通过构建语义环境、强化学习训练及抽象策略到现实控制的迁移,SAGE实现了在简化物理抽象中学习并规划,最终在A-EQA数据集上取得了显著提升的导航成功率,并展示了良好的现实机器人部署迁移能力。

Comments 28 pages, 15 figures, Extended Version of accepted ICML 2026 Paper

详情
英文摘要

Vision-Language Models (VLMs) have demonstrated exceptional general reasoning capabilities. However, their performance in embodied navigation remains hindered by a scarcity of aligned open-world vision and robot control data. Despite simulators providing a cost-effective alternative for data collection, the inherent reliance on photorealistic simulations often limits the transferability of learned policies. To this end, we propose \textit{\textbf{S}andbox-\textbf{A}bstracted \textbf{G}rounded \textbf{E}xperience} (\textbf{\textit{SAGE}}), a framework that enables agents to learn within a physics-grounded semantic abstraction rather than a photorealistic simulation, mimicking the human capacity for mental simulation where plans are rehearsed in simplified physics abstractions before execution. \textit{SAGE} system operates via three synergistic phases: (1) \textit{Genesis}: constructing diverse, physics-constrained semantic environments to bootstrap experience; (2) \textit{Evolution}: distilling experiences through Reinforcement Learning (RL), utilizing a novel asymmetric adaptive clipping mechanism to stabilize updates; (3) \textit{Navigation}: bridging the abstract policy to open-world control. We demonstrate that \textit{SAGE} significantly improves planner-assisted embodied navigation, achieving a 53.21\% LLM-Match Success Rate on A-EQA (+9.7\% over baseline), while showing encouraging transfer to physical indoor robot deployment.

2605.10117 2026-05-12 cs.CV cs.AI

Think as Needed: Geometry-Driven Adaptive Perception for Autonomous Driving

Donghyun Kim, Jaehyoung Park

AI总结 本文研究了自动驾驶场景中如何根据环境复杂度动态调整感知计算资源的问题。提出了一种名为Enhanced HOPE的自适应感知架构,通过无监督方法估计LiDAR帧的几何复杂度,并据此选择浅层或深层处理路径,从而在保证精度的同时提升计算效率。该方法还引入了线性时间的子空间注意力网络和持续的时序记忆模块,有效提升了对遮挡目标的跟踪能力,并在多个基准测试中表现出优越的性能。

详情
英文摘要

Autonomous driving scenes range from empty highways to dense intersections with dozens of interacting road users, yet current 3D detection models apply a fixed computation budget to every frame, wasting resources on simple scenes while lacking capacity for complex ones. Existing approaches compound this problem: Transformer-based interaction models scale quadratically with the number of detected objects, and frame-by-frame processing causes the system to immediately forget objects the moment they become occluded. We propose Enhanced HOPE, an adaptive perception architecture that measures the geometric complexity of each incoming LiDAR frame using an unsupervised statistical estimator and routes it through a shallow or deep processing path accordingly, requiring no manual scene labels. To keep interaction modeling efficient, we replace quadratic pairwise attention with a linear-time subspace-based network that groups nearby objects into clusters and processes them jointly. The computational savings from these two mechanisms free up resources for a persistent temporal memory module that retains previously detected objects and traffic rules across frames, enabling the system to recall occluded objects seconds after they disappear from view. On the nuScenes and CARLA benchmarks, Enhanced HOPE reduces latency by 38% on simple scenes with no accuracy loss, improves mean Average Precision by 2.7 points on rare long-tail scenarios, and tracks objects through occlusions lasting over 5 seconds, where all tested baselines fail.

2605.10115 2026-05-12 cs.LG cond-mat.mtrl-sci

Generating Symmetric Materials using Latent Flow Matching

Anmar Karmush, Cedric Mathieu Brandenburg, Soheil Ershadrad, Johanna Rosén, Michael Felsberg, Filip Ekström Kelvinius

AI总结 本文提出了一种名为SymADiT的对称感知材料生成模型,旨在改进现有的全原子扩散变换器(ADiT)。该方法基于Wyckoff位置对材料进行表征,并在潜在空间中进行生成建模,通过强制生成结果满足晶体空间群和原子Wyckoff位置的对称性约束,从而生成具有更真实对称特性的材料。实验表明,SymADiT在生成稳定且对称的材料方面表现出与现有模型相当甚至更优的性能。

Comments Preprint

详情
英文摘要

Tackling the task of materials generation, we aim to enhance the previously proposed All-atom Diffusion Transformer (ADiT) by introducing SymADiT, a symmetry-aware variant. To do so, we use a representation of materials based on Wyckoff positions. We follow ADiT and perform generative modelling in latent space, adapted to our symmetry-aware representation. By forcing the output of the generative model to adhere to the symmetry restrictions imposed by the generated crystal's space group and each atom's Wyckoff-position, the generated materials exhibit more realistic symmetry properties. We benchmark our method against both symmetry-aware and symmetry-agnostic models for materials generation and show competitive performance, generating stable, symmetric materials with a simple Transformer architecture.

2605.10114 2026-05-12 cs.CL

SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution

Xiangcheng Meng, Shu Wang, Yixiang Fang

AI总结 SkillRAE 是一种基于技能的上下文编译方法,旨在提升检索增强执行(RAE)在复杂任务中的表现。该方法分为离线和在线两个阶段,离线阶段构建多级技能图谱以捕捉技能之间的关系,在线阶段通过技能排序检索和关键证据编译生成紧凑、可靠且易于使用的任务上下文。实验表明,SkillRAE 在多个基准测试中显著优于现有方法,展示了其在上下文编译方面的有效性与重要性。

详情
英文摘要

Large Language Model (LLM)-based agents (e.g., OpenClaw) increasingly rely on reusable skill libraries to solve artifact-rich tasks such as document-centric workflows and data-intensive analysis. As these libraries grow, a few works have attempted to study the Retrieval-Augmented Execution (RAE), which often first retrieves some external skills and other knowledge, then compiles the context using retrieved skills, and finally executes the task. Existing works mainly focus on optimizing skill retrieval and task execution, and they pay little attention to how to effectively organize the selected skill evidence in a form that is compact, grounded, and immediately usable for the downstream executors to complete tasks. To fill this gap, we propose SkillRAE, a two-stage RAE approach focusing on skill-based context compilation, which consists of the offline and online stages. Specifically, in the offline indexing stage, it builds a multi-level skill graph over skill communities, skills, and reusable subunits, for capturing their relationships. In the online retrieval stage, it first performs skill-ranked retrieval with selected-subunit evidence export in the graph, and then applies rescue-aware compact compilation to recover the key evidence. Together, these components compile a coarse-ranked skill set into a task-specific context that is compact, grounded, and immediately usable. Experiments on two public benchmarks show that SkillRAE achieves a significant improvement over baselines for RAE. For example, on SkillsBench, it achieves an improvement of 11.7% over the SOTA method. Ablation studies further show that our context compilation is crucial, instead of a mere prompt addition.

2605.07846 2026-05-12 cs.CV

BRIDGE: Background Routing and Isolated Discrete Gating for Coarse-Mask Local Editing

Peilin Xiong, Honghui Yuan, Junwen Chen, Keiji Yanai

AI总结 本文研究了粗粒度掩码局部图像编辑中因掩码形状偏差导致的编辑区域边界失真问题,提出了一种名为BRIDGE的方法。该方法通过将掩码分离于DiT主干网络之外,并引入可学习的离散几何门控机制,实现背景稳定与编辑区域灵活生成的双重约束。实验表明,BRIDGE在多个基准测试中显著提升了编辑质量,同时保持了模型的轻量化特性。

Comments 11 pages, 6 figures

详情
英文摘要

Coarse-mask local image editing asks a model to modify a user-indicated region while preserving the surrounding scene. In practice, however, rough masks often become unintended shape priors: instead of serving as flexible edit support, the mask can pull the generated object toward its accidental boundary. We study this failure as mask-shape bias and frame the task through a Two-Zone Constraint, where the background should remain stable while the editable region should follow the instruction without being forced to inherit the mask contour. BRIDGE addresses this setting by keeping masks outside the DiT backbone for support construction and blending, avoiding DiT-internal mask injection and copied control branches. It uses BridgePath generation, where a Main Path preserves background context and a Subject Path generates editable content from independent noise. Motivated by a diagnostic Qwen-Image experiment showing that positional embeddings and attention connectivity regulate which image context visual tokens reuse, BRIDGE introduces a learnable Discrete Geometric Gate for token-level positional-embedding routing. This gate lets subject tokens borrow background-anchored coordinates near fusion regions or keep subject-centric coordinates for geometric freedom. We evaluate BRIDGE on BRIDGE-Bench, MagicBrush, and ICE-Bench. On BRIDGE-Bench, BRIDGE improves Local SigLIP2-T from 0.262 with FLUX.1-Fill and 0.390 with ACE++ to 0.503, with parallel gains in local DINO and DreamSim. Zero-shot results on MagicBrush and ICE-Bench further indicate competitive alignment and source preservation beyond the curated benchmark, while the added routing module remains compact at 13.31M parameters compared with ControlNet-style copied branches.

2605.07820 2026-05-12 cs.LG

Scaling Categorical Flow Maps

Oscar Davis, Anastasiia Filippova, Pierre Ablin, Victor Turrisi, Amitis Shidani, Marco Cuturi, Louis Béthune

AI总结 本文研究了如何扩展分类流图(CFMs)在大规模语言建模中的应用,提出了一种基于1.7B参数的流模型,并通过自蒸馏方法将其转化为能够在4步内生成高质量文本的CFM。该方法在保持接近数据级词元熵的同时,实现了与离散扩散模型相当的性能。此外,作者还引入了半离散设置下的似然界,并探讨了大规模训练中出现的挑战及损失权重和时间调度的优化策略。

Comments Minor style changes

详情
英文摘要

Continuous diffusion and flow matching models could represent a powerful alternative to autoregressive approaches for language modelling (LM), as they unlock a host of advantages currently reserved for continuous modalities, including accelerated sampling and tilting. Recently, several works have demonstrated the possibility of generating discrete data continuously by a simple flow matching process between a Gaussian and the one-hot encoded data distribution. They have further shown the feasibility of accelerated sampling via Categorical Flow Maps (CFMs), resulting in competitive sample quality in the few-step regime. However, this method had only been evaluated at relatively modest scales ($<1$B), leaving the question of its scalability completely open. In this article, we train a $1.7$B-parameter base flow model on $2.1$T tokens and self-distill it into a CFM that generates diverse, high-quality text in as few as $4$ inference steps while maintaining near-data-level token entropy. Furthermore, we introduce a likelihood bound for CFMs in the semi-discrete setting, and show that they can be used to score the model on standard LM benchmarks, achieving results in the same range as discrete diffusion methods. Finally, we uncover some of the challenges that arise from training these models at scale, and we provide prescriptive insights on loss weighting and time scheduling.

2605.07786 2026-05-12 cs.CV cs.AI

APEX: Assumption-free Projection-based Embedding eXamination Metric for Image Quality Assessment

Caterina Gallegati, Monica Bianchini, Franco Scarselli, Vittorio Murino, Barbara Toniella Corradini

AI总结 随着生成模型在视觉质量上取得突破,传统的基于特征分布的图像评估指标(如FID)仍被视为黄金标准,但其受到过时特征和参数化假设的限制。为解决这些问题,本文提出APEX,一种基于切片沃谢尔距离的无假设嵌入评估框架,无需依赖特定参数形式,且能兼容多种嵌入模型,如CLIP和DINOv2。实验表明,APEX在高维空间中具有良好可扩展性,对视觉退化具有更强鲁棒性,并在跨数据集评估中表现出高度稳定性。

详情
英文摘要

As generative models achieve unprecedented visual quality, the gold standard for image evaluation remains traditional feature-distribution metrics (e.g., FID). However, these metrics are provably hindered by the closed-vocabulary bottleneck of outdated features and the assumptive bias of rigid parametric formulations. Recent alternatives exploit modern backbones to solve the feature bottleneck, yet continue to suffer from parametric limitations. To close this gap, we introduce APEX (Assumption-free Projection-based Embedding eXamination), a novel evaluation framework leveraging the Sliced Wasserstein Distance as a mathematically grounded, assumption-free similarity measure. APEX inherits effective scalability to high-dimensional spaces, as we prove with theoretical and empirical evidences. Moreover, APEX is embedding-agnostic and uses two open-vocabulary foundation models, CLIP and DINOv2, as feature extractors. Benchmarking APEX against established baselines reveals superior robustness to visual degradations. Additionally, we show that APEX metrics exhibit intra- and cross-dataset stability, ensuring highly stable evaluations on out-of-domain datasets.

2605.07575 2026-05-12 cs.CV cs.AI

Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding

Ke Ma, Jiaqi Tang, Bin Guo, Xueting Han, Ruonan Xu, Qingfeng He, Ziheng Wang, Xu Wang, Qifeng Chen, Zhiwen Yu, Yunhao Liu

AI总结 本文提出了一种名为Response-G1的新型框架,旨在解决流媒体视频理解中主动响应时机判断的问题。该方法通过显式的场景图建模,将视频内容与查询响应条件进行结构化对齐,从而提升响应决策的准确性和可解释性。框架包含三个无需微调的阶段,包括在线生成场景图、基于记忆的语义检索以及增强触发提示,实验表明其在主动和被动任务中均优于现有方法。

Comments Accepted to ACL 2026

详情
英文摘要

Proactive streaming video understanding requires Video-LLMs to decide when to respond as a video unfolds, a task where existing methods often fall short due to their implicit, query-agnostic modeling of visual evidence. We introduce Response-G1, a novel framework that establishes explicit, structured alignment between the accumulated video evidence and the query's expected response conditions via scene graphs. The framework operates in three fine-tuning-free stages: (1) online query-guided scene graph generation from streaming clips; (2) memory-based retrieval of the most semantically relevant historical scene graphs; and (3) retrieval-augmented trigger prompting for per-frame "silence/response" decisions. By grounding both evidence and conditions in a shared graph representation, Response-G1 achieves more interpretable and accurate response timing decisions. Experimental results on established benchmarks demonstrate the superiority of our method in both proactive and reactive tasks, validating the advantage of explicit scene graph modeling and retrieval in streaming video understanding.