arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3410
2605.18932 2026-05-26 cs.LG cs.AI

HypergraphFormer: Learning Hypergraphs from LLMs for Editable Floor Plan Generation

HypergraphFormer: 从大语言模型中学习超图以实现可编辑的楼层平面图生成

Nikita Klimenko, Hesam Salehipour, Parham Eftekhar, Amir Khasahmadi, Ramon Elias Weber

AI总结 提出HypergraphFormer,利用大语言模型学习超图表示来生成楼层平面图,在RPLAN数据集上超越现有方法,并支持任意边界和高度可编辑性。

详情
AI中文摘要

在这项工作中,我们提出了HypergraphFormer,一种基于大语言模型学习超图表示的新型高效楼层平面图生成方法。该模型通过监督微调训练,生成基于超图的文本表示,编码楼层平面图中的空间关系和连通性信息。我们在RPLAN数据集上训练和评估我们的方法,并进一步在本文发布的一个独立的分布外数据集上展示其泛化能力。我们的方法在多种指标上优于基于栅格化或向量化表示的最先进技术。我们还展示了改进的数据效率,特别是在分布偏移下。超图公式通过将公寓足迹与其功能和几何细分解耦,使得能够为任意、不规则、用户指定的边界生成楼层平面图。此外,我们展示了所提出的方法具有高度的可编辑性,使其特别适合由大语言模型支持的设计导向工作流程。

英文摘要

In this work, we propose HypergraphFormer, a novel and efficient approach to floor plan generation based on learning hypergraph representations with a large language model (LLM). The model is trained via supervised fine-tuning to generate a hypergraph-based textual representation that encodes spatial relationships and connectivity information within floor plans. We train and evaluate our approach on the RPLAN dataset, and further demonstrate its generalizability on a separate out-of-distribution dataset, which we release in this paper. Our method outperforms state-of-the-art techniques based on rasterized or vectorized representations across a diverse set of metrics. We also show improved data efficiency, particularly under distribution shift. The hypergraph formulation enables the generation of floor plans for arbitrary, irregular, user-specified boundaries by decoupling apartment footprints from their functional and geometric subdivisions. Furthermore, we show that the proposed methodology offers a high degree of editability, making it particularly well suited to design-oriented workflows supported by LLMs.

2605.18267 2026-05-26 cs.CV

SRC-Flow: Compact Semantic Representations Enable Normalizing Flows for Image Generation

SRC-Flow:紧凑语义表示实现归一化流用于图像生成

Longtao Jiang, Jianmin Bao, Zhendong Wang, Xin Tao, Pengfei Wan, Zhihui Li, Xiaojun Chang

AI总结 提出SRC-Flow,通过语义表示压缩器将高维RAE特征压缩到低维语义空间,降低归一化流建模负担,在ImageNet上实现最优生成质量,同时保持精确似然计算和确定性可逆采样。

详情
AI中文摘要

归一化流(NFs)提供精确似然和确定性可逆采样,但在大规模图像生成方面历史上落后于扩散模型。我们识别出一个关键障碍:NFs需要学习全环境空间上的单个可逆传输,使其对高维表示高度敏感。这导致现代视觉表示空间中的语义-容量不匹配,其中语义信息紧凑但编码在过完备特征中。我们提出SRC-Flow,引入语义表示压缩器(SRC),在流建模之前将高维RAE特征压缩到低维语义空间,并通过冻结的RAE解码器保持重建。这个紧凑空间减少了NFs的建模负担,并在语义表示空间中实现了有效的基于似然的生成。我们进一步采用针对流学习的固定无条件双射的常数噪声正则化。在ImageNet $256 \times 256$和$512 \times 512$上,SRC-Flow在归一化流方法中实现了最先进的生成质量,在无分类器引导下gFID分数分别为1.65和2.07,同时在紧凑语义表示空间中保留精确似然计算,并在流级别实现确定性可逆采样。代码和模型将在https://github.com/longtaojiang/SRC-Flow提供。

英文摘要

Normalizing flows (NFs) provide exact likelihoods and deterministic invertible sampling, but have historically lagged behind diffusion models for large-scale image generation. We identify a key obstacle: NFs are required to learn a single invertible transport over the full ambient space, making them highly sensitive to high-dimensional representations. This leads to a semantic-capacity mismatch in modern visual representation spaces, where semantic information is compact but encoded in overcomplete features. We propose SRC-Flow, which introduces a Semantic Representation Compressor (SRC) to compact high-dimensional RAE features into a low-dimensional semantic space before flow modeling and preserve reconstruction through the frozen RAE decoder. This compact space reduces the modeling burden of NFs and enables effective likelihood-based generation in semantic representation space. We further adopt constant noise regularization tailored to the fixed unconditional bijection learned by flows. On ImageNet $256 \times 256$ and $512 \times 512$, SRC-Flow achieves state-of-the-art generation quality among normalizing flow methods, with gFID scores of 1.65 and 2.07 under classifier-free guidance, while retaining exact likelihood computation in the compact semantic representation space and deterministic invertible sampling at the flow level. Codes and models will be available at https://github.com/longtaojiang/SRC-Flow.

2605.18224 2026-05-26 cs.LG cs.AI

A Simplex Witness Certificate for Constant Collapse in Variational Autoencoders

变分自编码器中恒定坍缩的单纯形见证证书

Zegu Zhang, Jianhua Peng, Jian Zhang

AI总结 提出一种基于GMM教师后验和单纯形见证的证书,用于检测和量化VAE编码器均值是否发生输入无关的恒定坍缩,并在MNIST、CIFAR-10和CIFAR-100上验证了方法有效性。

详情
AI中文摘要

我们研究变分自编码器中的精确恒定坍缩:确定性编码器均值变得与输入无关。先验保持为标准高斯分布。在VAE训练之前,我们从基于GMM的数据视角选择一个固定的教师后验,并将一个固定的仅潜在空间单纯形见证附加到编码器均值上。这种构造产生两个关联对象。第一个是证书:如果见证预测优于教师的最佳恒定预测器,则编码器均值不能是输入无关的常数。第二个是局部逃逸方向:在坍缩流形上,教师残差为对齐损失提供样本相关的下降方向。对于任何全支撑的教师后验,相同的几何结构也给出一个具有零教师-见证对齐误差的闭式潜在码。其缩放版本追踪一条从恒定预测器到精确教师码的边际能量路径,该路径量化了受保护见证子空间内的非坍缩。我们在MNIST、CIFAR-10和CIFAR-100上实例化了该方法。使用搜索的无监督PCA-GMM教师,在CIFAR-10和CIFAR-100上,所有五个种子的普通VAE均未通过教师-见证证书,而RST变体在所有五个种子中均通过。在坍缩压力设置下(β_KL ∈ {2,4,8}),普通VAE再次在所有种子中失败,而RST-alpha-prefit保持证书阳性。在两个自然图像数据集上的逃逸轨迹从低边际初始化开始增加见证边际,并表现出非零的教师诱导梯度范数。该分析仅限于编码器均值的精确恒定坍缩;生成质量、解码器使用和其他坍缩模式仍是独立的问题。

英文摘要

We study exact constant collapse in variational autoencoders: the deterministic encoder mean becomes independent of the input. The prior remains the standard Gaussian. Before VAE training, we select a fixed teacher posterior from a GMM-based view of the data and attach a fixed latent-only simplex witness to the encoder mean. This construction yields two linked objects. The first is a certificate: if the witness prediction improves on the best constant predictor of the teacher, the encoder mean cannot be input-independent constant. The second is a local escape direction: on the collapsed manifold, the teacher residual gives a sample-dependent descent direction for the alignment loss. For any full-support teacher posterior, the same geometry also gives a closed-form latent code with zero teacher-witness alignment error. Its scaled versions trace a margin-energy path from the constant predictor to the exact teacher code, which quantifies non-collapse inside the protected witness subspace. We instantiate the method on MNIST, CIFAR-10, and CIFAR-100. With searched unsupervised PCA-GMM teachers, vanilla VAEs fail the teacher-witness certificate in all five seeds on CIFAR-10 and CIFAR-100, while RST variants pass in all five seeds. Under collapse-stress settings with \(β_{\mathrm{KL}}\in\{2,4,8\}\), vanilla VAE again fails in all seeds, whereas RST-alpha-prefit remains certificate-positive. Escape trajectories on both natural-image datasets increase the witness margin from a low-margin initialization and exhibit nonzero teacher-induced gradient norms. The analysis is confined to exact constant collapse of the encoder mean; generation quality, decoder use, and other collapse modes remain separate questions.

2605.17606 2026-05-26 cs.LG

The Neural Tangent Kernel for Classification

分类问题的神经正切核

Jonathan Plenk, Sergio Calvo-Ordonez, Alvaro Cartea, Yarin Gal, Mark van der Wilk, Kamil Ciosek

AI总结 本文通过识别宽神经网络在分类损失下保持懒惰训练的条件,将神经正切核理论扩展到分类问题,并分析了参数正则化对核常数性的影响以及预测器分布与贝叶斯方法的关系。

Comments Preprint

详情
AI中文摘要

在宽神经网络中,神经正切核(NTK)在训练过程中近似保持常数,为研究训练动态、泛化以及核方法的联系提供了强大的理论工具。然而,该理论主要局限于回归损失。先前认为,在分类损失或更一般涉及非线性输出变换的损失上训练会破坏这一性质,导致logits发散和线性化失效。本文通过识别宽神经网络保持懒惰训练机制的条件,将NTK理论扩展到分类问题。我们表明,参数空间正则化确保了交叉熵损失下训练过程中NTK的常数性,而在无正则化的情况下,当目标非退化(即所有类别具有严格正概率)时,该机制得以恢复。在这些条件下,训练可由线性化模型很好地近似,从而基于NTK得到解的显式刻画。我们进一步分析了随机初始化引起的训练预测器分布,并将这种模型不确定性的概念与贝叶斯方法联系起来。

英文摘要

In wide neural networks, the Neural Tangent Kernel (NTK) remains approximately constant during training, providing a powerful theoretical tool for studying training dynamics, generalization, and connections to kernel methods. However, this theory is largely restricted to regression losses. It was previously thought that training on a classification loss, or more generally losses involving nonlinear output transformations, breaks this property, leading to divergent logits and a breakdown of the linearization. In this paper, we extend NTK theory to classification by identifying conditions under which wide neural networks remain in the lazy training regime. We show that parameter-space regularization ensures a constant NTK during training for cross-entropy loss, while in the absence of regularization the regime is recovered when targets are non-degenerate, i.e. when all classes have strictly positive probability. Under these conditions, training is well-approximated by the linearized model, yielding an explicit characterization of the solution in terms of the NTK. We further analyze the distribution of trained predictors induced by random initialization and relate this notion of model uncertainty to Bayesian methods.

2605.17543 2026-05-26 cs.CV cs.GR

HL-OutPaint: Coarse-to-Fine Video Outpainting for High-Resolution Long-Range Videos

HL-OutPaint:面向高分辨率长范围视频的粗到细视频外绘

Jeongeun Park, Janghyeok Han, Geonung Kim, Hyun-Seung Lee, Kyuha Choi, Youngseok Han, Sunghyun Cho

AI总结 提出HL-OutPaint框架,采用粗到细两阶段流程,通过全局-局部帧交换机制构建全局粗引导,实现高分辨率长视频的大空间外推和时空一致生成。

Comments Supplementary material and video included. Project page: https://koyy001.github.io/Publications/hl-outpaint

详情
AI中文摘要

视频外绘生成超出视频原始空间范围的合理视觉内容,在使视频适应不同显示格式方面发挥关键作用。为支持此类应用,它必须能够对长序列进行大空间外推。然而,现有大多数方法仅解决其中一个挑战,或缺乏确保全局时空一致性的明确机制,导致明显局限性。本文提出HL-OutPaint,一种用于长序列的高分辨率视频外绘框架。我们的方法遵循粗到细策略,采用两阶段流水线。首先构建全局粗引导(GCG),这是一种低分辨率表示,捕捉视频的全局结构和主导运动。与简单下采样不同,GCG通过一种新颖的全局-局部帧交换机制构建,该机制将稀疏全局关键帧与局部时间窗口耦合,并在采样过程中交换信息。这使得GCG能够在统一表示中编码长期结构一致性和短期时间动态。在此表示引导下,HL-OutPaint随后执行高分辨率外绘,生成空间细节丰富且时间一致的内容。通过将全局结构建模与细粒度合成分离,我们的框架实现了对大空间扩展和长视频序列的稳定、连贯生成。大量实验表明,HL-OutPaint在涉及宽空间外推和长视频序列的挑战性场景中优于现有方法。

英文摘要

Video outpainting generates plausible visual content beyond the original spatial extent of a video, playing a key role in adapting videos to diverse display formats. To support such use cases, it must enable large spatial extrapolation over long sequences. However, most existing methods address only one of these challenges or lack explicit mechanisms for ensuring global spatio-temporal consistency, leading to notable limitations. In this paper, we propose HL-OutPaint, a high-resolution video outpainting framework for long sequences. Our approach follows a coarse-to-fine strategy with a two-stage pipeline. We first construct Global Coarse Guidance (GCG), a low-resolution representation that captures global structure and dominant motion across the video. Unlike naive downsampling, GCG is built via a novel global-local frame swapping mechanism that couples sparse global keyframes with local temporal windows and exchanges information during sampling. This enables GCG to encode both long-term structural consistency and short-term temporal dynamics in a unified representation. Guided by this representation, HL-OutPaint then performs high-resolution outpainting to generate spatially detailed and temporally consistent content. By separating global structure modeling from fine-grained synthesis, our framework achieves stable, coherent generation for large spatial expansion and long video sequences. Extensive experiments show that HL-OutPaint outperforms existing methods in challenging scenarios involving wide spatial extrapolation and long video sequences.

2605.17260 2026-05-26 cs.CV

LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

LiteFrame: 高效视觉编码器解锁视频大语言模型中的帧缩放

Jihwan Kim, Nikhil Parthasarathy, Danfeng Qin, Junhwa Hur, Deqing Sun, Bohyung Han, Ming-Hsuan Yang, Boqing Gong

AI总结 针对视频大语言模型处理长视频时视觉令牌上下文长度爆炸的问题,提出LiteFrame高效视频编码器,通过压缩令牌蒸馏(CTD)训练框架,使紧凑的学生模型直接预测教师模型的信息密集时空压缩表示,从而在降低35%端到端延迟的同时处理8倍帧数并提升视频理解精度。

Comments Project Page: https://jjihwan.github.io/projects/LiteFrame

详情
AI中文摘要

将视频大语言模型扩展到长视频的基本挑战在于管理视觉令牌上下文长度的爆炸。现有策略主要关注“事后”令牌缩减——在特征提取后减少视觉令牌以减轻LLM的计算开销。虽然这些方法有效减少了视觉令牌数量,但我们观察到主要延迟瓶颈随后从LLM转移到视觉编码器昂贵的逐帧处理。为了解决这个问题,我们引入了LiteFrame,一个强大且高效的视频编码器骨干网络,用于视频大语言模型。为了训练LiteFrame,我们提出了压缩令牌蒸馏(CTD),一种新颖的训练框架,教导紧凑的学生视觉编码器直接预测大型教师视觉模型产生的信息密集、时空压缩的表示,从而有效绕过冗余计算。当与进一步的语言模型适配(LMA)结合时,这种方法产生了一个新的延迟-精度帕累托前沿——与InternVL3-8B相比,LiteFrame在端到端延迟降低35%的同时处理8倍帧数,并在多个基准测试中提高了平均视频理解精度。我们的结果展示了在固定计算预算下解锁更长视频理解的新潜在路径。

英文摘要

The fundamental challenge in scaling Video Large Language Models (Video LLMs) to long-form video lies in managing the explosion of visual-token context length. Existing strategies predominantly focus on "post-hoc" token reduction -- reducing visual tokens after feature extraction to alleviate the LLM's computational overhead. While these methods effectively reduce the number of visual tokens, we observe that the primary latency bottleneck then shifts from the LLM to the expensive per-frame processing of the vision encoder. To address this, we introduce LiteFrame, a strong, yet highly efficient video encoder backbone for Video LLMs. To train LiteFrame, we propose Compressed Token Distillation (CTD), a novel training framework that teaches a compact student vision encoder to directly predict information-dense, spatio-temporally compressed representations produced by a large teacher vision model, effectively bypassing redundant computation. When coupled with further Language Model Adaptation (LMA), this approach results in a new latency-accuracy Pareto frontier -- compared with InternVL3-8B, LiteFrame provides a 35% reduction in end-to-end latency while processing 8$\times$ more frames and improves average video understanding accuracy across multiple benchmarks. Our results demonstrate a new potential path to unlocking longer-form video understanding under fixed compute budgets.

2605.16302 2026-05-26 cs.LG cs.AI cs.CL

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

通过反事实推理路径减少信用分配方差

Fei Ding, Yongkang Zhang, Youwei Wang, Zijian Zeng

AI总结 提出反事实比较框架,通过采样多条推理轨迹并利用差异隐式估计过程级优势,将稀疏终端奖励转化为步骤敏感信号,从而改进大语言模型多步推理的信用分配,并引入隐式行为策略优化(IBPO)提升训练稳定性和性能上限。

详情
AI中文摘要

使用大语言模型进行多步推理的强化学习通常依赖于稀疏的终端奖励,这会导致一个条件较差的信用分配问题:最终反馈均匀地传播到所有中间决策。这导致高梯度方差、不稳定的训练和许多无效更新,最终限制了模型的持续改进。我们提出了一种用于信用分配的反事实比较框架。对于每个输入,该框架采样多个推理轨迹,并将它们的差异视为对替代决策的隐式近似。这产生了一个隐式过程级优势估计器,将稀疏终端奖励转化为步骤敏感的学习信号。基于此框架,我们引入了隐式行为策略优化(IBPO),该方法在数学和代码推理基准上显著提高了训练稳定性和性能上限。我们的结果为释放大语言模型的推理潜力指明了一个有前景的方向。

英文摘要

Reinforcement learning for multi-step reasoning with large language models (LLMs) typically relies on sparse terminal rewards, which creates a poorly conditioned credit-assignment problem: the final feedback is propagated uniformly across all intermediate decisions. This leads to high gradient variance, unstable training, and many ineffective updates, ultimately limiting sustained model improvement. We propose a counterfactual-comparison framework for credit assignment. For each input, the framework samples multiple reasoning trajectories and treats their differences as implicit approximations to alternative decisions. This yields an implicit process-level advantage estimator that converts sparse terminal rewards into step-sensitive learning signals. Building on this framework, we introduce Implicit Behavior Policy Optimization (IBPO), which substantially improves training stability and the performance ceiling on mathematical and code-reasoning benchmarks. Our results point to a promising direction for unlocking the reasoning potential of LLMs.

2605.15433 2026-05-26 cs.LG

Spectral Priors vs. Attention: Investigating the Utility of Attention Mechanisms in EEG-Based Diagnosis

光谱先验 vs. 注意力:探究注意力机制在基于脑电图的诊断中的效用

Tawsik Jawad, Gowtham Atluri, Vikram Ravindra

AI总结 本文提出一种基于频带选择的光谱特征构建方法,证明在小型EEG数据集中,传统机器学习模型性能可匹敌或超越SOTA深度学习模型,而注意力机制无法提取稳定的光谱特征。

详情
AI中文摘要

脑电图(EEG)时间序列信号具有显著噪声和粗糙的空间分辨率,这使得神经退行性疾病的分类变得复杂。即使是最先进的深度学习架构,由于组间高度相似性,也难以区分健康对照和患病受试者,或不同疾病类型。在本文中,我们展示了一种光谱选择性特征构建方法能够增强类别可分性。通过隔离主要脑波频带内的信号强度,我们将高维原始数据转化为高价值的光谱特征。我们的结果表明,在小型数据集中:a) 从频域和时频域导出的特征使传统机器学习模型能够匹配或超越最先进深度学习模型的性能;b) 注意力机制无法提取表征健康神经活动的稳定特征签名,无论是在静息态还是任务态EEG中;c) 基于注意力的模型在寻找相关光谱特征方面的局限性似乎是稳健的,因为提供频率选择性时域输入并未显著改善其性能。我们在三个开源静息态EEG数据集和一个任务态EEG数据集上验证了我们的方法,为我们的主张提供了强有力的经验证据。

英文摘要

Electroencephalograph (EEG) timeseries signals are characterized by significant noise and coarse spatial resolution, which complicates the classification of neurodegenerative diseases. Even SOTA deep learning architectures struggle to distinguish between healthy controls and diseased subjects, or between different disease types, due to high intergroup similarity. In this paper, we show that a spectrally selective approach to feature construction enhances class separability. By isolating signal strengths within the primary brainwave bands, we transform high dimensional raw data into high value spectral features. Our results demonstrate that in small datasets a) features derived from frequency and time frequency domain allow traditional machine learning models to match or exceed the performance of SOTA deep learning models, b) Attention mechanism is unable to distill the stable feature signatures that characterize healthy neural activity in both resting and task EEGs, and c) the limitations of attention based models in finding relevant spectral features appear to be robust in that providing frequency selective time domain input do not appreciably improve their performance. We validate our methodology across three open source resting EEG datasets and one task EEG dataset, providing robust empirical evidence for our claims.

2605.14889 2026-05-26 cs.CV cs.AI

SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition

SurgicalMamba: 具有状态重编程的双路径SSD用于在线手术阶段识别

Sukju Oh, Sukkyu Sun

AI总结 提出SurgicalMamba模型,基于Mamba2的结构化状态空间对偶性(SSD),通过双路径SSD块、强度调制步进和状态重编程三个组件,实现在线手术阶段识别,在多个基准上达到最先进性能。

Comments 28 pages, 7 figures, 10 tables; Code available at https://github.com/sukjuoh/Surgical-Mamba

详情
AI中文摘要

在线手术阶段识别(SPR)是上下文感知手术室系统的基础,要求仅根据过去上下文对每一帧做出预测。手术视频提出了自然视频识别器无法共同解决的三个需求:手术过程跨越数万帧,时间流动不均匀(长时间常规片段被短暂的阶段定义转换打断),视觉领域狭窄,因此骨干特征在通道间高度相关。现有识别器要么让每帧成本随已处理长度增长,要么保持成本有界但以均匀速率和通道独立动态推进状态,无法解决后两个需求。我们提出SurgicalMamba,一种基于Mamba2的结构化状态空间对偶性(SSD)的因果SPR模型,将每帧成本保持在O(d)。它引入了三个与SSD兼容的组件,共同解决这些需求:双路径SSD块,在循环状态级别分离长期和短期模式;强度调制步进,一种连续时间时间扭曲,使慢路径的有效速率适应阶段相关信息;以及状态重编程,一种每块的Cayley旋转,在原本轴对齐的SSM循环中打开跨通道混合。学习到的旋转平面继承了阶段对齐的结构,无需任何直接监督,提供了手术工作流的可解释内部特征。在七个公开SPR基准上,SurgicalMamba在严格在线评估下达到了最先进的准确率和阶段级Jaccard指数:在Cholec80上为94.6%/82.7%(比最强先前方法高0.7 pp/2.2 pp),在AutoLaparo上为89.5%/68.9%(高1.7 pp/2.0 pp),在单个GPU上达到238.74 fps。消融实验分离了每个组件的贡献。代码公开于https://github.com/sukjuoh/Surgical-Mamba。

英文摘要

Online surgical phase recognition (SPR) underpins context-aware operating-room systems and requires committing to a prediction at every frame from past context alone. Surgical video poses three demands that natural-video recognizers do not jointly address: procedures span tens of thousands of frames, time flows non-uniformly as long routine stretches are punctuated by brief phase-defining transitions, and the visual domain is narrow so backbone features are strongly correlated across channels. Existing recognizers either let per-frame cost grow with elapsed length, or hold cost bounded but advance state at a uniform rate with channel-independent dynamics, leaving the latter two demands unaddressed. We present SurgicalMamba, a causal SPR model built on Mamba2's structured state-space duality (SSD) that holds per-frame cost at O(d). It introduces three SSD-compatible components that jointly address these demands: a dual-path SSD block that separates long- and short-term regimes at the level of recurrent state; intensity-modulated stepping, a continuous-time time-warp that adapts the slow path's effective rate to phase-relevant information; and state regramming, a per-chunk Cayley rotation that opens cross-channel mixing in the otherwise axis-aligned SSM recurrence. The learned rotation planes inherit a phase-aligned structure without any direct supervision, offering an interpretable internal signature of surgical workflow. Across seven public SPR benchmarks, SurgicalMamba reaches state-of-the-art accuracy and phase-level Jaccard under strict online evaluation: 94.6%/82.7% on Cholec80 (+0.7 pp/+2.2 pp over the strongest prior) and 89.5%/68.9% on AutoLaparo (+1.7 pp/+2.0 pp), at 238.74 fps on a single GPU. Ablations isolate the contribution of each component. The code is publicly available at https://github.com/sukjuoh/Surgical-Mamba.

2605.14255 2026-05-26 cs.LG cs.CV

Architecture-Aware Explanation Auditing for Industrial Visual Inspection

面向工业视觉检测的架构感知解释审计

Sibo Jia, Zihang Zhao, Kunrong Li

AI总结 本文提出一种基于原生读出假设的架构感知解释审计协议,通过扰动实验证明解释方法的忠实度受其与模型原生决策机制的结构距离约束,并揭示忠实度排名是(模型、解释器、扰动算子)三元组的联合属性。

Comments Format update

详情
AI中文摘要

工业视觉检测系统日益依赖深度分类器,其热力图解释可能看似合理,但未能识别真正驱动模型决策的图像区域。本文基于原生读出假设,实现了一种架构感知的解释审计协议:解释方法的基于扰动的忠实度受其与模型原生决策机制的结构距离约束。在WM-811K晶圆图(9类,172k图像)上,采用三种子零填充扰动协议,ViT-Tiny + Attention Rollout的Deletion AUC为0.211,而Swin-Tiny / ResNet18+CBAM / DenseNet121 + Grad-CAM的Deletion AUC为0.432-0.525(|Cohen's d| > 1.1),尽管其分类准确率较低。Swin-Tiny将架构家族与读出结构分离:尽管是Transformer,其空间特征图层次使其与Grad-CAM兼容,表明操作因素是读出结构而非架构家族。一个模型无关的控制方法(RISE)将所有家族的Deletion AUC压缩至约0.1,表明差距源于解释器路径;值得注意的是,RISE优于所有原生方法,因此原生读出是兼容性原则而非最优性保证。模糊填充敏感性分析表明,在不同扰动基线下的家族排序反转,强化了忠实度排名是(模型、解释器、扰动算子)三元组的联合属性。在MVTec AD(预训练模型)上的探索性边界条件研究表明,审计结果依赖于数据集/任务,并识别了需要限定的条件。该协议提供了可操作的指导:解释路径应基于读出结构与模型架构协同设计,部署的热力图应附带定量忠实度指标。

英文摘要

Industrial visual inspection systems increasingly rely on deep classifiers whose heatmap explanations may appear visually plausible while failing to identify the image regions that actually drive model decisions. This paper operationalizes an architecture-aware explanation audit protocol grounded in the native-readout hypothesis: the perturbation-based faithfulness of an explanation method is bounded by its structural distance from the model's native decision mechanism. On WM-811K wafer maps (9 classes, 172k images) under a three-seed zero-fill perturbation protocol, ViT-Tiny + Attention Rollout attains Deletion AUC 0.211 against 0.432-0.525 for Swin-Tiny / ResNet18+CBAM / DenseNet121 + Grad-CAM (abs(Cohen's d) > 1.1), despite lower classification accuracy. Swin-Tiny disentangles architecture family from readout structure: despite being a Transformer, its spatial feature-map hierarchy makes it Grad-CAM compatible, showing that the operative factor is readout structure rather than architecture family. A model-agnostic control (RISE) compresses all families to Deletion AUC about 0.1, indicating the gap arises from the explainer pathway; notably, RISE outperforms all native methods, so native readout is a compatibility principle rather than an optimality guarantee. A blur-fill sensitivity analysis shows that the family ordering reverses under a different perturbation baseline, reinforcing that faithfulness rankings are joint properties of (model, explainer, perturbation operator) triples. An exploratory boundary-condition study on MVTec AD (pretrained models) indicates that audit results are dataset/task dependent and identifies conditions requiring qualification. The protocol yields actionable guidance: explanation pathways should be co-designed with model architectures based on readout structure, and deployed heatmaps should be accompanied by quantitative faithfulness metrics.

2605.12961 2026-05-26 cs.CV cs.LG

Reducing Bias and Variance: Generative Semantic Guidance and Bi-Layer Ensemble for Image Clustering

减少偏差与方差:用于图像聚类的生成语义引导与双层集成

Feijiang Li, Zhenxiong Li, Jieting Wang, Zizheng Jiu, Saixiong Liu, Liang Du

AI总结 提出GSEC框架,通过生成语义引导减少偏差、双层集成学习降低方差,在六个基准数据集上超越18种最新方法。

详情
AI中文摘要

图像聚类旨在将未标记的图像数据集划分为不同的组。该任务的一个核心方面是构建并利用先验知识来指导聚类过程。最近的方法引入语义描述作为先验信息,其中大多数通常依赖于基于匹配的技术和预定义词汇表。然而,有限的匹配空间限制了它们对下游聚类任务的适应性。此外,这些方法主要关注减少偏差以提高性能,经常忽视方差降低的重要性。为了解决这些局限性,我们提出了GSEC(基于生成语义引导和双层集成的图像聚类),这是一个旨在通过生成语义引导减少偏差并通过集成学习缓解方差的框架。我们的方法利用多模态大语言模型生成语义描述,并通过加权平均推导图像嵌入。此外,双层集成策略通过内层的BatchEnsemble整合跨模态信息,并通过外层的对齐机制对齐输出。对比实验表明,GSEC在六个基准数据集上优于18种最新方法,进一步分析证实了其在同时减少偏差和方差方面的有效性。代码可在https://github.com/2017LI/GSEC.git获取。

英文摘要

Image clustering aims to partition unlabeled image datasets into distinct groups. A core aspect of this task is constructing and leveraging prior knowledge to guide the clustering process. Recent approaches introduce semantic descriptions as prior information, most of which typically relying on matching-based techniques with predefined vocabularies. However, the limited matching space restricts their adaptability to downstream clustering tasks. Moreover, these methods primarily focus on reducing bias to improve performance, frequently overlooking the importance of variance reduction. To address these limitations, we propose GSEC (Image Clustering based on Generative Semantic Guidance and Bi-Layer Ensemble), a framework designed to reduce bias through generative semantic guidance and mitigate variance via ensemble learning. Our method employs Multimodal Large Language Models to generate semantic descriptions and derive image embeddings via weighted averaging. Additionally, a bi-layer ensemble strategy integrates cross-modal information through BatchEnsemble in the inner layer and aligns outputs via an alignment mechanism in the outer layer. Comparative experiments demonstrate that GSEC outperforms 18 state-of-the-art methods across six benchmark datasets, while further analysis confirms its effectiveness in simultaneously reducing both bias and variance. The code is available at https://github.com/2017LI/GSEC.git.

2605.10764 2026-05-26 cs.CV cs.AI

Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

打破刹车,而非车轮:通过熵最大化实现无目标越狱

Mengqi He, Xinyu Tian, Xin Shen, Shu Zou, Jinhong Ni, Zhaoyuan Yang, Weikang Li, Xuesong Li, Jing Zhang

AI总结 提出UJEM-KL攻击方法,通过最大化决策令牌的熵来翻转视觉-语言模型的拒绝输出,实现高迁移性的无目标越狱。

Comments Preprint. 17 pages, 8 figures, 6 tables

详情
AI中文摘要

近期研究表明,基于梯度的通用图像越狱攻击在视觉-语言模型(VLM)上几乎没有或完全没有跨模型迁移性,这使人们对可迁移多模态越狱的可行性产生了怀疑。我们在严格的无目标威胁模型下重新审视这一结论,不强制固定前缀或响应模式。初步实验发现,在自回归解码过程中,拒绝行为集中在高熵令牌上,而攻击前非拒绝令牌在前排候选者中已占据相当大的概率质量。受此启发,我们提出通过熵最大化的无目标越狱(UJEM)-KL,这是一种轻量级攻击,通过最大化这些决策令牌的熵来翻转拒绝结果,同时稳定剩余的低熵位置以保持输出质量。在三个VLM和两个安全基准测试中,UJEM-KL实现了具有竞争力的白盒攻击成功率,并持续提高了迁移性,同时在代表性防御下仍然有效。我们的实验结果表明,有限的迁移性主要源于过度受限的优化目标。

英文摘要

Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.

2605.10430 2026-05-26 cs.LG cs.AI stat.ML

Real vs. Semi-Simulated: Rethinking Evaluation for Treatment Effect Estimation

真实 vs. 半模拟:重新思考治疗效果估计的评估

George Panagopoulos

AI总结 通过大规模实证研究,比较了半模拟基准和真实数据集上使用反事实指标与可观测指标评估治疗效果估计模型的效果,揭示了两种评估体系之间的差距,并发现简单元学习器与强基础模型结合具有竞争力。

详情
AI中文摘要

利用机器学习估计异质性治疗效果在学术研究和工业实践中都引起了广泛关注。然而,这两个领域通常在不同条件下评估模型。方法论工作通常依赖于半模拟基准和需要反事实结果的指标,而实际应用则依赖于基于排名或测试结果的可观测指标。尽管方法论进展与实际部署之间存在众所周知的差距,但这些评估体系之间的关系尚未得到系统研究。我们对标准半模拟基准系列和真实数据集上的治疗效果评估进行了大规模实证研究。我们的基准涵盖了与多个基础学习器配对的元学习器,以及专门的因果机器学习模型。我们使用应用导向文献中常见的可观测指标以及方法论文中常用的反事实指标来评估这些方法。我们的结果揭示了两个互补的差距。首先,即使在相同的半模拟基准上,反事实指标也不能可靠地恢复可观测指标偏好的估计器。其次,在半模拟基准上获得的排名不能迁移到真实数据集。我们还发现,具有强大基础模型的简单元学习器始终具有竞争力,这与专门的因果模型形成对比。总体而言,我们的发现表明,治疗效果估计研究的进展不应仅通过反事实指标和半模拟基准来评估,而应结合可观测指标和真实数据验证。

英文摘要

Estimating heterogeneous treatment effects with machine learning has attracted substantial attention in both academic research and industrial practice. However, the two communities often evaluate models under markedly different conditions. Methodological work typically relies on semi-simulated benchmarks and metrics that require counterfactual outcomes, whereas real-world applications rely on observable metrics based on ranking or test outcomes. Despite the well-known gap between methodological progress and practical deployment, the relationship between these evaluation regimes has not been examined systematically. We conduct a large-scale empirical study of treatment effect evaluation across standard semi-simulated benchmark families and real-world datasets. Our benchmark covers meta-learners paired with multiple base learners, as well as specialized causal machine learning models. We evaluate these methods using observable metrics common in application-oriented literature, alongside counterfactual metrics commonly used in methods papers. Our results reveal two complementary gaps. First, counterfactual metrics do not reliably recover the estimators preferred by observable metrics, even on the same semi-simulated benchmarks. Second, rankings obtained on semi-simulated benchmarks do not transfer to real datasets. We further find that simple meta-learners with strong base models are consistently competitive, in contrast to specialized causal models. Overall, our findings suggest that progress in treatment effect estimation research should not be assessed solely through counterfactual metrics and semi-simulated benchmarks, but it would benefit from incorporating observable metrics and real-data validation.

2605.07733 2026-05-26 cs.LG cs.AI

Intelligent Truck Matching in Full Truckload Shipments using Ping2Hex approach

使用Ping2Hex方法的整车运输智能卡车匹配

Srinivas Kumar Ramdas, Jose Mathew, Ankit Singh Chauhan, Dinesh Rajkumar, Aravind Manoj, Mohit Goel

AI总结 提出基于Ping2Hex的智能卡车匹配系统ITM 2.0,通过概率排序和LightGBM模型解决GPS数据中车辆标识缺失导致的匹配问题,显著提升精度和覆盖率。

Comments 12 pages, 10 figures, 8 tables. Accepted at iSCSi 2026 (International Conference on Industry Sciences and Computer Sciences Innovation). To appear in Procedia Computer Science (Elsevier)

详情
Journal ref
ISCSI(2026)
AI中文摘要

利用GPS数据进行准确的卡车与货物匹配是整车供应链可视性的基础,能够实现实时跟踪和准确的预计到达时间(ETA)预测。然而,缺失或损坏的车辆标识符使得传统匹配方法无法使用,导致货物失去可视性。本文提出了智能卡车匹配(ITM)2.0,一个机器学习系统,通过将匹配问题表述为概率排序来解决这一关键缺口。我们的方法利用Uber H3六边形空间索引将GPS ping离散化为路线相似性特征,结合时间信息,然后应用带有阈值后处理的LightGBM梯度提升。通过严格的评估,包括离线模型选择(SVM、XGBoost、LightGBM)、全面的消融研究和生产影子测试,我们展示了相对于基于规则的基线的显著提升。ITM 2.0在北美实现了26个百分点的精度提升,在欧洲实现了14个百分点的提升,同时覆盖率翻倍。该系统已在Project44部署用于处理整车运输,展示了对于高达1公里的地理编码误差、多个候选卡车和稀疏ping的鲁棒性。

英文摘要

Accurate truck-to-shipment matching using GPS data is foundational for full truckload supply chain visibility, enabling real-time tracking and accurate estimated time of arrival (ETA) predictions. However, missing or corrupted vehicle identifiers prevent traditional matching approaches, leaving shipments without visibility. This paper presents Intelligent Truck Matching (ITM) 2.0, a machine learning system that addresses this critical gap by formulating matching as a probabilistic ranking problem. Our approach leverages Uber H3 hexagonal spatial indexing to discretize GPS pings into route similarity features, combined with temporal information, then applies LightGBM gradient boosting with threshold-based post-processing. Through rigorous evaluation including offline model selection (SVM, XGBoost, LightGBM), comprehensive ablation studies, and production shadow testing, we demonstrate substantial gains over rule-based baselines. ITM 2.0 achieves 26 percentage point precision improvement in North America and 14 points in Europe, while doubling coverage. Deployed in production at Project44 handling full truckload shipments, the system demonstrates robustness to geocoding errors up to 1 km, multiple candidate trucks, and sparse pings.

2605.06415 2026-05-26 cs.LG cs.AI cs.CL cs.CV

E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology

E = T*H/(O+B):混合专家生态的无量纲控制参数

Qingjun Zhang

AI总结 提出无量纲控制参数E = T*H/(O+B),通过12个控制实验证明E≥0.5可保证混合专家模型无死亡专家,并发现专家复活、正交毒性依赖数据集等六项额外结果。

Comments 12 experiments, 11,000+ training epochs, cross-modal validation (vision + language). Extended version of the Claude-in-the-Loop ecology framework

详情
AI中文摘要

我们引入E = T*H/(O+B),这是一个无量纲控制参数,用于预测混合专家(MoE)模型是否会发展出健康的专家生态还是陷入死亡专家。E将四个超参数——路由温度T、路由熵权重H、先知权重O和平衡权重B——组合成一个单一量。通过12个控制实验(8个视觉,4个语言),总计超过11,000个训练周期,我们确定仅E ≥ 0.5就足以保证零死亡专家,消除了手工设计负载平衡辅助损失的必要性。我们在CIFAR-10、CIFAR-100、TinyImageNet-200、WikiText-2和WikiText-103上跨模态验证了这一点。另外还发现了六项结果:(1)死亡专家可以复活——由平衡损失驱动路由器重新探索触发;(2)正交毒性依赖于数据集,并非普遍存在;(3)任务复杂性改变了临界E阈值;(4)模型过拟合与专家生态健康解耦;(5)三层MoE自发崩溃为两层功能结构;(6)生态结构在50倍温度范围内保持不变。我们提出E作为MoE训练的统一诊断指标,类似于流体力学中的雷诺数。

英文摘要

We introduce E = T*H/(O+B), a dimensionless control parameter that predicts whether Mixture-of-Experts (MoE) models will develop a healthy expert ecology or collapse into dead experts. E combines four hyperparameters -- routing temperature T, routing entropy weight H, oracle weight O, and balance weight B -- into a single quantity. Through 12 controlled experiments (8 vision, 4 language) totaling over 11,000 training epochs, we establish that E >= 0.5 alone is sufficient to guarantee zero dead experts, removing the necessity for handcrafted load-balancing auxiliary losses. We validate this cross-modally on CIFAR-10, CIFAR-100, TinyImageNet-200, WikiText-2, and WikiText-103. Six additional findings emerge: (1) dead experts can resuscitate -- triggered by balance loss driving router re-exploration; (2) ortho toxicity is dataset-dependent, not universal; (3) task complexity shifts the critical E threshold; (4) model overfitting is decoupled from expert ecological health; (5) three-tier MoE spontaneously collapses into a two-tier functional structure; (6) ecological structure is temperature-invariant across a 50x range. We propose that E serves as a unified diagnostic for MoE training, analogous to the Reynolds number in fluid dynamics.

2605.04295 2026-05-26 cs.LG cs.AI

LLMs Uncertainty Quantification via Adaptive Conformal Semantic Entropy

通过自适应共形语义熵进行LLM不确定性量化

Hamed Karimi, Vaishali Meyappan, Reza Samavi

AI总结 提出自适应共形语义熵(ACSE)方法,通过聚类语义熵并自适应调整不确定性分数,结合共形校准实现统计可靠的接受/弃权决策,在多个数据集上优于现有基线。

Comments Accepted for publication in the Proceedings of the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026); 14 Pages

详情
AI中文摘要

LLMs的过度自信,特别是在产生幻觉时,对在安全关键环境中部署模型构成了重大挑战,并使得对不确定性进行可靠估计成为必要。现有的不确定性量化方法通常优先考虑词汇或概率度量;然而,这些技术往往忽略了具有相似含义的不同响应的语义差异。在本文中,我们提出了自适应共形语义熵(ACSE),一种通过自适应测量LLMs输出中的语义分散性来估计提示级不确定性的方法。我们的不确定性评分函数基于对同一提示的多个不同响应的语义熵进行聚类。该函数根据每个聚类的语义特征自适应调整不确定性分数。为了确保我们分数的统计可靠性,我们使用共形校准应用决策规则来接受/弃权提示,提供了有限样本、无分布的保证,使得接受响应中的错误率保持在用户指定的容差范围内。我们使用不同LLMs和数据集进行的广泛实验评估表明,我们的方法在判别性能、共形保证和概率校准指标方面始终优于最先进的不确定性量化基线。作为一个亮点,对于TriviaQA数据集,我们方法的AUROC为0.88,而令牌熵方法为0.65。

英文摘要

LLMs' overconfidence, particularly when hallucinating, poses a significant challenge for the deployment of the models in safety-critical settings and makes a reliable estimation of uncertainty necessary. Existing approaches for uncertainty quantification typically prioritize lexical or probabilistic measures; however, these techniques often ignore the semantic variance of different responses with similar meaning. In this paper, we propose Adaptive Conformal Semantic Entropy (ACSE), a method for estimating prompt-level uncertainty by adaptively measuring semantic dispersion in LLMs outputs. Our uncertainty scoring function is based on clustering semantic entropy of multiple diverse responses to the same prompt. The function adaptively adjusts the uncertainty score based on semantic features of each cluster. To ensure statistical reliability of our score, we use conformal calibration to apply a decision rule to accept/abstain the prompts, providing a finite-sample, distribution-free guarantee such that the error rate among the accepted responses remains bounded by a user-specified tolerance. Our extensive experimental evaluations using different LLMs and datasets, demonstrate that our approach consistently outperforms state-of-the-art uncertainty quantification baselines using discriminative performance, conformal guarantees, and probabilistic calibration indicators. As a highlight, for TriviaQA dataset, AUROC of our approach is 0.88 compared to 0.65 produced by the token entropy approach.

2605.03509 2026-05-26 cs.CV cs.AI

BFORE: Butterfly-Firefly Optimized Retinex Enhancement for Low-Light Image Quality Improvement

BFORE: 蝴蝶-萤火虫优化的Retinex增强用于低光图像质量提升

Ahmed Cherif

AI总结 提出BFORE框架,结合蝴蝶优化算法和萤火虫算法自动搜索最佳Retinex增强参数,最大化高斯自然度评分,显著提升低光图像质量。

详情
AI中文摘要

低光图像存在可见度差、噪声和颜色失真问题。现有的基于Retinex的增强方法依赖手动调整参数,无法泛化到不同光照条件。本文提出BFORE(蝴蝶-萤火虫优化的Retinex增强),一个自动为每张图像寻找最佳增强参数的框架。BFORE分两阶段工作:(1)蝴蝶优化算法(BOA)搜索最优的多尺度Retinex带颜色恢复(MSRCR)参数,然后(2)萤火虫算法(FA)微调伽马校正、去噪和颜色参数。两个阶段都最大化高斯自然度评分(GNS),一种衡量增强图像自然度的无参考指标。标准质量指标(PSNR、SSIM、NIQE)仅在优化后计算,确保零数据泄露。在30对合成图像上,BFORE达到GNS=0.971,优于次优方法MSRCR(0.894)8.6%。在来自LOL数据集的115张真实图像上,BFORE达到GNS=0.887,优于MSRCR(0.808)9.8%。与三个在相同条件下训练的深度学习基线(Zero-DCE、SCI、IAT)进行受控比较,BFORE在GNS上超过最佳深度学习方法14.7%。消融研究证实,混合BOA+FA策略显著优于单独使用每种优化器,而在三个评估预算下的可扩展性分析表明,一旦计算资源可用,结构化优化器显著优于均匀随机采样(128次评估时p=0.009,300次评估时p=0.021)。所有改进均具有统计显著性(Wilcoxon符号秩检验p<0.0001)。每张图像在CPU上的处理时间为3-6分钟,适用于离线应用。

英文摘要

Low-light images suffer from poor visibility, noise, and color distortion. Existing Retinex-based enhancement methods rely on manually tuned parameters that do not generalize across different lighting conditions. This paper proposes BFORE (Butterfly-Firefly Optimized Retinex Enhancement), a framework that automatically finds the best enhancement parameters for each image. BFORE works in two phases: (1) a Butterfly Optimization Algorithm (BOA) searches for optimal Multi-Scale Retinex with Color Restoration (MSRCR) parameters, then (2) a Firefly Algorithm (FA) fine-tunes gamma correction, denoising, and color parameters. Both phases maximize a Gaussian Naturalness Score (GNS), a no-reference metric that measures how natural the enhanced image looks. Standard quality metrics (PSNR, SSIM, NIQE) are computed only after optimization, ensuring zero data leakage. On 30 synthetic image pairs, BFORE achieves GNS = 0.971, outperforming the next-best method MSRCR (0.894) by 8.6%. On 115 real images from the LOL dataset, BFORE achieves GNS = 0.887, outperforming MSRCR (0.808) by 9.8%. A controlled comparison with three deep learning baselines (Zero-DCE, SCI, IAT) trained under identical conditions shows BFORE surpasses the best DL method by 14.7% in GNS. An ablation study confirms that the hybrid BOA+FA strategy significantly outperforms each optimizer in isolation, and a scalability analysis at three evaluation budgets shows that the structured optimizer significantly outperforms uniform random sampling once compute is available (p = 0.009 at 128 evaluations, p = 0.021 at 300 evaluations). All improvements are statistically significant (p < 0.0001, Wilcoxon signed-rank test). Processing time is 3-6 minutes per image on CPU, suitable for offline applications.

2605.02044 2026-05-26 cs.LG

NeuroViz: Real-time Interactive Visualization of Forward and Backward Passes in Neural Network Training

NeuroViz:神经网络训练中前向和后向传播的实时交互式可视化

Tanvi Sharma, Reza Rawassizadeh

AI总结 提出NeuroViz交互式可视化工具,通过实时展示全连接神经网络训练中的激活值、权重更新和损失变化,以及逐神经元方程,显著提升训练透明度和可解释性。

Comments 9 pages, 4 figures, 6 tables

详情
AI中文摘要

训练神经网络难以解释,尤其对于新手。我们介绍了NeuroViz,一个交互式可视化工具,支持全连接神经网络训练的实时探索。用户可以配置网络架构、激活函数、学习率和数据集,然后观察激活值、权重更新和损失进展。NeuroViz将权重变化与前后向传播中的激活信号直接对应可视化,使用户能够区分单个epoch内的更新前后状态,并查看动态更新的逐神经元方程。我们与31名参与者进行了对比用户研究,与六个已有的可视化工具相比,NeuroViz获得了最高的可用性评分(SUS 80.97,属于“优秀”范围),清晰度平均排名2.47,有用性平均排名2.23(越低越好)。超过70%的参与者报告说,可视化显著提高了他们对神经网络训练透明度的感知。实现实例可在https://neuroviz.org访问。

英文摘要

Training neural networks is difficult to interpret, particularly for newcomers. We introduce NeuroViz, an interactive visualization tool that supports real-time exploration of fully connected neural network training. Users can configure network architecture, activation functions, learning rates, and datasets, then observe activations, weight updates, and loss progression. NeuroViz visualizes weight changes in direct correspondence with activation signals in both forward and backward passes, enabling users to distinguish pre- and post-update states within individual epochs and view dynamically updating per-neuron equations. We conduct a comparative user study with 31 participants against six established visualization tools and we achieved the highest usability score (SUS 80.97, in the 'excellent' range), with mean rankings of 2.47 for clarity and 2.23 for usefulness (lower is better). Over 70% of participants reported that the visualizations substantially increased their perception of neural network training transparency. The implemented instance is accessible at https://neuroviz.org.

2605.02037 2026-05-26 cs.RO cs.AI

VILAS: A VLA-Integrated Low-cost Architecture with Soft Grasping for Robotic Manipulation

VILAS:一种集成软抓取的VLA低成本机器人操作架构

Zijian An, Hadi Khezam, Bill Cai, Ran Yang, Shijie Geng, Yiming Feng, Yue Zheng, Lifeng Zhou

AI总结 提出VILAS低成本模块化机器人操作平台,集成软抓取机构,支持端到端VLA策略学习与部署,并在葡萄抓取任务中验证有效性。

详情
AI中文摘要

我们提出了VILAS,一个完全低成本、模块化的机器人操作平台,旨在支持端到端视觉-语言-动作(VLA)策略学习并在可访问硬件上部署。该系统集成了法如FR5协作臂、Jodell RG52-50电动夹爪和双摄像头感知模块,通过基于ZMQ的通信架构统一协调遥操作、数据收集和策略部署于单一框架内。为了在不依赖显式力传感的情况下安全操作易碎物体,我们设计了一种基于kirigami的软柔性夹爪扩展件,在压缩载荷下产生可预测变形,提供对脆弱目标的温和且可重复接触。我们在VILAS平台上部署并评估了三种最先进的VLA模型:pi_0、pi_0.5和GR00T N1.6。所有模型均使用通过我们的遥操作流水线收集的相同演示数据集,从公开发布的预训练检查点进行微调。在葡萄抓取任务上的实验验证了所提系统的有效性,证实了有能力的操作策略可以在低成本模块化硬件上成功训练和部署。我们的结果进一步为当前VLA模型在真实环境中的部署特性提供了实践见解。

英文摘要

We present VILAS, a fully low-cost, modular robotic manipulation platform designed to support end-to-end vision-language-action (VLA) policy learning and deployment on accessible hardware. The system integrates a Fairino FR5 collaborative arm, a Jodell RG52-50 electric gripper, and a dual-camera perception module, unified through a ZMQ-based communication architecture that seamlessly coordinates teleoperation, data collection, and policy deployment within a single framework. To enable safe manipulation of fragile objects without relying on explicit force sensing, we design a kirigami-based soft compliant gripper extension that induces predictable deformation under compressive loading, providing gentle and repeatable contact with delicate targets. We deploy and evaluate three state-of-the-art VLA models on the VILAS platform: pi_0, pi_0.5, and GR00T N1.6. All models are fine-tuned from publicly released pretrained checkpoints using an identical demonstration dataset collected via our teleoperation pipeline. Experiments on a grape grasping task validate the effectiveness of the proposed system, confirming that capable manipulation policies can be successfully trained and deployed on low-cost modular hardware. Our results further provide practical insights into the deployment characteristics of current VLA models in real-world settings.

2605.00908 2026-05-26 cs.CV

Evaluation of Convolutional and Transformer-Based Detectors for Weed Detection in Tomato Plantations

卷积与基于Transformer的检测器在番茄种植园杂草检测中的评估

Alcides Toledo Espinosa, Gerardo Antonio Álvarez Hernández, Ángel Eduardo Zamora-Suárez, Miguel Bolaños, Juan Irving Vásquez

AI总结 本文比较了基于CNN和Transformer的目标检测架构在番茄种植园早期杂草检测中的性能,揭示了效率与上下文建模之间的权衡。

Comments 7 pages, 3 figures, and 1 table

详情
AI中文摘要

本文对卷积和基于Transformer的目标检测架构在番茄种植园早期杂草检测中进行了比较评估。考虑了每种范式的代表性模型,包括YOLOv6-nano(YOLO系列的最新变体)以及作为基于Transformer架构的RT-DETR Large和RF-DETR Medium。评估在GROUNDBASED_WEED数据集上进行,考虑了六个杂草类别和一个对应于未识别植物的额外类别,从而能够使用精度、召回率、平均精度和推理速度等指标以及非参数统计检验来评估检测准确性和计算效率方面的性能。结果突出了效率与上下文建模之间的明显权衡:基于CNN的检测器以较低的计算成本实现了高性能,而基于Transformer的方法以更高的资源需求为代价提供了更好的全局上下文捕获。这些结果为精准农业应用中的模型选择提供了实用标准。

英文摘要

This paper presents a comparative evaluation of convolutional and transformer-based object detection architectures for early weed detection in tomato plantations. Representative models from each paradigm are considered, including YOLOv26-nano, a recent variant of the YOLO family, and RT-DETR Large and RF-DETR Medium as transformer-based architectures. The evaluation was conducted on the GROUNDBASED_WEED dataset, considering six weed classes and an additional category corresponding to unidentified plants, which allowed for the assessment of performance in terms of detection accuracy and computational efficiency using metrics such as precision, recall, average precision, and inference speed, as well as non-parametric statistical tests. The results highlight a clear trade-off between efficiency and contextual modeling: CNN-based detectors achieve high performance at a lower computational cost, while transformer-based approaches offer better global context capture at the expense of higher resource demands. These results provide practical criteria for model selection in precision agriculture applications.

2604.24517 2026-05-26 cs.LG cs.GT

Prior-Agnostic Robust Forecast Aggregation

先验无关的鲁棒预测聚合

Zhi Chen, Cheng Peng, Wei Tang

AI总结 针对未知状态空间和先验的鲁棒预测聚合问题,提出一种显式闭式对数几率聚合器,在线性对数几率空间线性池化预测,并在三种知识体制下给出接近极小极大遗憾的界。

详情
AI中文摘要

鲁棒预测聚合结合多个信息源的预测,以在所有可能信息结构的最坏情况下表现良好。以往工作主要关注已知二元状态空间(状态为0或1)的设置。我们研究先验无关的鲁棒预测聚合,其中聚合器仅观察专家的报告,但对底层联合信息结构和完整先验(包括底层状态空间)一无所知。与固定二元状态空间{0,1}的标准模型不同,我们允许(二元)未知状态值为[0,1]中的任意数,因此相同的报告概率可能对应不同环境中截然不同的实现结果频率。 我们的主要贡献是一个简单、显式、闭式的对数几率聚合器,它在对数几率空间线性池化预测,并在三种知识体制下给出(近乎)紧的极小极大遗憾界。我们首先证明,在条件独立(CI)信号下,通过建立更大的下界,未知状态空间的鲁棒聚合比已知状态设置严格更难,并且我们的聚合规则可以实现0.0255的最坏情况遗憾。在此过程中,我们还刻画了Blackwell有序结构和一般信息结构的紧遗憾界。在经典设置(已知状态空间{0,1})中,我们的聚合器在CI结构下实现严格低于0.0226的遗憾。据我们所知,这是第一个实现严格低于0.0226遗憾上界的显式闭式聚合器。最后,我们扩展模型,使聚合器额外知道每个专家的边际预测分布;在此设置下,对于CI结构,我们证明广义对数几率规则实现0.0228的遗憾,并补充了0.0225的下界。

英文摘要

Robust forecast aggregation combines the predictions of multiple information sources to perform well in the worst case across all possible information structures. Previous work largely focuses on settings with a known binary state space, where the state is either 0 or 1. We study prior-agnostic robust forecast aggregation in which the aggregator observes only experts' reports, yet is ignorant of both the underlying joint information structure and the full prior, including the underlying state space. Unlike the standard model that fixes the binary state space {0, 1}, we allow the (binary) unknown state values to be arbitrary numbers in [0, 1], so the same reported probability may correspond to very different realized outcome frequencies across environments. Our main contribution is a simple, explicit, closed-form log-odds aggregator that linearly pools forecasts in logit space, together with (nearly-)tight minimax-regret guarantees across three knowledge regimes. We first show that under conditionally independent (CI) signals, robust aggregation with an unknown state space is strictly harder than in the known-state setting by establishing a larger lower bound, and our aggregation rule can achieve a worst-case regret of 0.0255. Along the way, we also characterize tight regret bounds for Blackwell-ordered structures and for general information structures. In the classical setting with known state space {0,1}, our aggregator achieves regret strictly below 0.0226 for CI structures. To the best of our knowledge, this is the first explicit closed-form aggregator that achieves a regret upper bound strictly less than 0.0226. Finally, we extend the model where the aggregator additionally knows each expert's marginal forecast distribution; in this setting, with the CI structures, we show that a generalized log-odds rule achieves regret of 0.0228, complementing with a lower bound of 0.0225.

2604.22948 2026-05-26 cs.LG stat.CO stat.ML

Score-Repellent Monte Carlo: Toward Efficient Non-Markovian Sampler with Constant Memory in General State Spaces

分数排斥蒙特卡洛:面向一般状态空间中具有恒定内存的高效非马尔可夫采样器

Jie Hu, Lingyun Chen, Geeho Kim, Jinyoung Choi, Bohyung Han, Do Young Eun

AI总结 提出分数排斥蒙特卡洛(SRMC)框架,通过分数评估的运行平均值总结轨迹历史,利用指数分数倾斜构建替代目标,实现恒定内存下的非马尔可夫采样,降低渐近方差并改善模式覆盖。

Comments Accepted at ICML 2026 (Spotlight); GitHub Repo: https://github.com/srmc-project/Score-Repellent-Monte-Carlo

详情
AI中文摘要

历史依赖采样可以通过阻止冗余重访来降低长期蒙特卡洛方差,但现有方案通常通过有限状态空间上的经验度量编码历史,这在高维离散配置空间中不可行或在连续域中不适定。我们提出分数排斥蒙特卡洛(SRMC)框架,该框架通过 $\mathbb{R}^d$ 中分数评估的运行平均值总结轨迹历史,其中 $d$ 是分数和状态表示的维度。该历史通过指数分数倾斜转换为替代目标,以 $α$ 为索引,表示排斥强度,控制基于历史的排斥幅度。替代族在标准MCMC意义上是无需归一化的,从而产生一个通用包装器:在每次迭代中,任何针对 $π$ 的基础核都可以在当前替代 $π_{θ_n}$ 上运行,同时在线更新历史。我们使用带有受控马尔可夫噪声的随机逼近分析历史递归和蒙特卡洛估计器的耦合演化,建立了几乎必然收敛和联合中心极限定理。我们进一步确定了渐近协方差随 $α$ 增加而减小的区域,缩放比例为 $O(1/α)$,将有限状态历史依赖采样器的近零方差效应扩展到具有恒定内存的一般状态空间。在连续目标和离散能量基模型上的实验表明,估计器方差和模式覆盖得到改善,同时保持 $O(d)$ 内存使用和适度的每次迭代开销。

英文摘要

History-dependent sampling can reduce long-run Monte Carlo variance by discouraging redundant revisits, but existing schemes typically encode history through empirical measure on finite state spaces, which is infeasible in high-dimensional discrete configuration spaces or ill-posed in continuous domains. We propose Score-Repellent Monte Carlo (SRMC) framework that summarizes trajectory history by a running average of score evaluations in $\mathbb{R}^d$, where $d$ is the dimension of the score and state representation. This history is converted into a surrogate target through an exponential score tilt, indexed with $α$ that represents the strength of repellence in controlling the magnitude of the history-based repulsion. The surrogate family is normalization-free in the standard MCMC sense, yielding a generic wrapper: at each iteration, any base kernel targeting $π$ can instead be run on the current surrogate $π_{θ_n}$ while the history is updated online. We analyze the coupled evolution of the history recursion and Monte Carlo estimators using stochastic approximation with controlled Markovian noise, establishing almost sure convergence and a joint central limit theorem. We further identify regimes in which the asymptotic covariance decreases as $α$ increases, with scaling $O(1/α)$, extending the near-zero-variance effect of finite-state history-dependent samplers to general state spaces with constant memory. Experiments on continuous targets and discrete energy-based models demonstrate improved estimator variance and mode coverage, while retaining $O(d)$ memory usage and modest per-iteration overhead.

2604.13088 2026-05-26 cs.LG cs.AI

Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation

序列级奖励的组内学习设计条件:令牌梯度消除

Fei Ding, Yongkang Zhang, youwei wang, Zijian Zeng

AI总结 针对大语言模型多步推理中稀疏终端奖励导致的信用分配问题,提出反事实比较框架和隐式行为策略优化(IBPO),通过轨迹差异近似替代决策,将稀疏奖励转化为步骤敏感信号,提升训练稳定性和推理性能。

详情
AI中文摘要

基于大语言模型的多步推理强化学习通常依赖于稀疏的终端奖励,这导致了不良条件的信用分配问题:最终反馈均匀地传播到所有中间决策。这导致高梯度方差、不稳定的训练和许多无效更新,最终限制了模型的持续改进。我们提出了一种用于信用分配的反事实比较框架。对于每个输入,该框架采样多个推理轨迹,并将它们的差异视为替代决策的隐式近似。这产生了一个隐式过程级优势估计器,将稀疏的终端奖励转化为步骤敏感的学习信号。基于此框架,我们引入了隐式行为策略优化(IBPO),显著提高了数学和代码推理基准上的训练稳定性和性能上限。我们的结果指向了一个有希望的方向,以解锁大语言模型的推理潜力。

英文摘要

Reinforcement learning for multi-step reasoning with large language models (LLMs) typically relies on sparse terminal rewards, which creates a poorly conditioned credit-assignment problem: the final feedback is propagated uniformly across all intermediate decisions. This leads to high gradient variance, unstable training, and many ineffective updates, ultimately limiting sustained model improvement. We propose a counterfactual-comparison framework for credit assignment. For each input, the framework samples multiple reasoning trajectories and treats their differences as implicit approximations to alternative decisions. This yields an implicit process-level advantage estimator that converts sparse terminal rewards into step-sensitive learning signals. Building on this framework, we introduce Implicit Behavior Policy Optimization (IBPO), which substantially improves training stability and the performance ceiling on mathematical and code-reasoning benchmarks. Our results point to a promising direction for unlocking the reasoning potential of LLMs.

2604.03675 2026-05-26 cs.AI cs.CL cs.IR

OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search

OASES:面向智能搜索的结果对齐搜索-评估协同训练

Erhan Zhang, Yiqun Chen, Zechun Niu, Wei Yang, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

AI总结 提出OASES框架,通过结果对齐的过程奖励和搜索-评估协同训练,解决智能搜索中奖励稀疏和过程监督不可靠的问题,在多跳问答基准上优于强强化学习基线。

详情
AI中文摘要

智能搜索使语言模型能够通过自适应地多步获取外部证据来解决知识密集型任务。具有可验证奖励的强化学习已成为搜索智能体广泛采用的训练范式,但仅结果奖励是稀疏的,并且对中间搜索动作的信用分配有限。因此,现有的过程奖励方法试图通过代理信号、外部评估器或基于似然的信息增益来密集化监督。然而,代理奖励可能偏离最终结果目标,而固定评估器随着搜索策略的演化可能变得过时,导致不可靠的过程监督。为应对这些挑战,我们提出OASES,一种用于智能搜索的结果对齐搜索-评估监督框架。OASES通过评估每个中间搜索状态对回答原始问题的支持程度,推导出结果对齐的过程奖励。它进一步在策略上协同训练搜索策略和状态评估器,使评估器能够适应演化的搜索行为并提供更可靠的过程奖励。在五个多跳问答基准上的实验表明,OASES始终优于强强化学习基线,进一步分析证实了结果对齐过程奖励和搜索-评估协同训练的优势。

英文摘要

Agentic search enables language models to solve knowledge-intensive tasks by adaptively acquiring external evidence over multiple steps. Reinforcement learning with verifiable rewards (RLVR) has emerged as a widely adopted training paradigm for search agents, yet outcome-only rewards are sparse and provide limited credit assignment for intermediate search actions. Existing process-reward methods therefore seek to densify supervision through proxy signals, external evaluators, or likelihood-based information gain. However, proxy rewards can deviate from the final outcome objective, while fixed evaluators can become stale as the search policy evolves, leading to unreliable process supervision. To address these challenges, we propose OASES, an Outcome-Aligned Search-Evaluation Supervision framework for agentic search. OASES derives outcome-aligned process rewards by evaluating how well each intermediate search state supports answering the original question. It further co-trains the search policy and the state evaluator on policy, allowing the evaluator to adapt to evolving search behavior and provide more reliable process rewards. Experiments on five multi-hop QA benchmarks show that OASES consistently outperforms strong RL baselines, with further analyses confirming the benefits of outcome-aligned process rewards and search-evaluation co-training.

2603.29236 2026-05-26 cs.CV

M2H-MX: Multi-Task Semantic and Geometric Perception for Real-Time Monocular 3D Scene Graph Construction

M2H-MX:用于实时单目3D场景图构建的多任务语义与几何感知

U. V. B. L. Udugama, George Vosselman, Francesco Nex

AI总结 提出M2H-MX多任务感知模型,通过注册门控全局上下文和受控跨任务交互的轻量解码器,在严格延迟约束下实现深度与语义预测相互增强,并集成到单目SLAM中,显著提升轨迹精度和地图质量。

Comments 6 pages, 5 figures, 5 tables. Preprint under review

详情
AI中文摘要

单目相机因其低成本且易于部署而对机器人感知具有吸引力,但从单一图像流实现可靠的实时空间理解仍然具有挑战性。虽然最近的多任务密集预测模型改进了逐像素深度和语义估计,但将这些进展转化为稳定的单目建图系统仍然不简单。本文提出了M2H-MX,一种用于单目空间理解的实时多任务感知模型。该模型保留多尺度特征表示,同时在轻量解码器中引入注册门控全局上下文和受控跨任务交互,使深度和语义预测在严格的延迟约束下相互增强。其输出通过紧凑的感知到建图接口直接集成到未修改的单目SLAM流水线中。我们评估了密集预测精度和系统内性能。在NYUDv2上,M2H-MX-L取得了最先进的结果,与代表性多任务基线相比,语义mIoU提高了6.6%,深度RMSE降低了9.4%。当在ScanNet上的实时单目建图系统中部署时,与强单目SLAM基线相比,M2H-MX将平均轨迹误差降低了60.7%,同时生成更清晰的度量-语义地图。这些结果表明,现代多任务密集预测可以可靠地部署于机器人系统中的实时单目空间感知。

英文摘要

Monocular cameras are attractive for robotic perception due to their low cost and ease of deployment, yet achieving reliable real-time spatial understanding from a single image stream remains challenging. While recent multi-task dense prediction models have improved per-pixel depth and semantic estimation, translating these advances into stable monocular mapping systems is still non-trivial. This paper presents M2H-MX, a real-time multi-task perception model for monocular spatial understanding. The model preserves multi-scale feature representations while introducing register-gated global context and controlled cross-task interaction in a lightweight decoder, enabling depth and semantic predictions to reinforce each other under strict latency constraints. Its outputs integrate directly into an unmodified monocular SLAM pipeline through a compact perception-to-mapping interface. We evaluate both dense prediction accuracy and in-the-loop system performance. On NYUDv2, M2H-MX-L achieves state-of-the-art results, improving semantic mIoU by 6.6% and reducing depth RMSE by 9.4% over representative multi-task baselines. When deployed in a real-time monocular mapping system on ScanNet, M2H-MX reduces average trajectory error by 60.7% compared to a strong monocular SLAM baseline while producing cleaner metric-semantic maps. These results demonstrate that modern multi-task dense prediction can be reliably deployed for real-time monocular spatial perception in robotic systems.

2603.18444 2026-05-26 cs.LG cs.AI

Discounted Beta-Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

折扣Beta-Bernoulli奖励估计用于基于可验证奖励的样本高效强化学习

Haechan Kim, Soohyun Ryu, Gyouk Chu, Doohyuk Jang, Eunho Yang

AI总结 针对基于可验证奖励的强化学习样本效率低的问题,提出折扣Beta-Bernoulli奖励估计方法,利用历史奖励统计量降低估计方差并避免方差崩溃,在多个推理基准上显著提升性能。

Comments 14 pages, 3 figures

详情
AI中文摘要

基于可验证奖励的强化学习已成为提升大语言模型推理能力的有效后训练范式。然而,现有的基于组的RLVR方法常遭受严重的样本低效问题。这种低效源于对少量rollout的奖励进行点估计,导致高估计方差、方差崩溃以及生成响应的无效利用。在本工作中,我们从统计估计角度重新审视RLVR,将奖励建模为从策略诱导分布中抽取的样本,并将优势计算视为从有限数据中估计奖励分布的问题。基于此观点,我们提出折扣Beta-Bernoulli奖励估计,该方法利用历史奖励统计量处理非平稳分布。尽管有偏,所得估计量展现出降低且稳定的方差,理论上避免了估计方差崩溃,并在均方误差上优于标准点估计。在六个分布内和三个分布外推理基准上的大量实验表明,使用DBB的GRPO一致优于朴素GRPO,在1.7B和8B模型上分别实现了分布内平均Acc@8提升3.22/2.42点,分布外提升12.49/6.92点,且无需额外计算成本或内存开销。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has emerged as an effective post-training paradigm for improving the reasoning capabilities of large language models. However, existing group-based RLVR methods often suffer from severe sample inefficiency. This inefficiency stems from reliance on point estimation of rewards from a small number of rollouts, leading to high estimation variance, variance collapse, and ineffective utilization of generated responses. In this work, we reformulate RLVR from a statistical estimation perspective by modeling rewards as samples drawn from a policy-induced distribution and casting advantage computation as the problem of estimating the reward distribution from finite data. Building on this view, we propose Discounted Beta-Bernoulli (DBB) reward estimation, which leverages historical reward statistics for the non-stationary distribution. Although biased, the resulting estimator exhibits reduced and stable variance, theoretically avoids estimated variance collapse, and achieves lower mean squared error than standard point estimation. Extensive experiments across six in-distribution and three out-of-distribution reasoning benchmarks demonstrate that GRPO with DBB consistently outperforms naive GRPO, achieving average Acc@8 improvements of 3.22/2.42 points in-distribution and 12.49/6.92 points out-of-distribution on the 1.7B and 8B models, respectively, without additional computational cost or memory usage.

2603.17198 2026-05-26 cs.LG cs.CL

Structural Abstraction as an Inductive Bias for Non-Stationary Language Model Training

结构抽象作为非平稳语言模型训练的归纳偏置

Elnaz Rahmati, Nona Ghazizadeh, Zhivar Sourati, Nina Rouhani, Morteza Dehghani

AI总结 提出抽象增强训练(AAT)方法,通过联合优化具体实例及其结构抽象,减少灾难性干扰并提升关系泛化能力,在非平稳语言模型训练中验证了结构抽象作为稳定学习信号的有效性。

详情
AI中文摘要

认知科学的一个基本原则认为,智能体不是通过将经验存储为孤立实例来学习,而是通过形成捕捉跨情境共享关系结构的抽象图式来学习。尽管这一主张得到了行为和神经影像研究的充分支持,但其作为语言模型计算训练信号的作用仍未得到充分探索。我们针对非平稳语言模型训练中的这一空白,提出疑问:将学习偏向结构抽象是否能如人类结果所预测的那样减少灾难性干扰并提升关系泛化?为研究这一问题,我们引入了抽象增强训练(AAT),这是一种轻量级的损失级修改,联合优化具体实例及其结构抽象,以及两个基准:关系循环基准(RCB)和叙事抽象基准(NAB)。这些资源将核心认知构造操作化:实体掩码作为关系对齐的计算模拟,谚语作为必须跨表面不同情境推断的隐式抽象意义的载体。我们的实证结果表明,AAT持续减少遗忘并提升泛化,其模式与基于图式学习的认知预测一致。除了对持续学习的实际意义外,这些结果提供了初步的计算证据,表明结构抽象是非平稳环境中稳定学习的信号。

英文摘要

A foundational principle in cognitive science holds that intelligent agents do not learn by storing experiences as isolated instances, but by forming abstract schemas that capture relational structure shared across situations. Even though this claim is well supported by behavioral and neuroimaging studies, its role as a computational training signal in language models remains underexplored. We target this gap in the setting of non-stationary language model training, asking does biasing learning toward structural abstraction reduce catastrophic interference and improve relational generalization as predicted by human results? To study this question, we introduce Abstraction-Augmented Training (AAT), a lightweight loss-level modification that jointly optimizes over concrete instances and their structural abstractions, and two benchmarks, the Relational Cycle Benchmark (RCB) and the Narrative Abstraction Benchmark (NAB). These resources operationalize core cognitive constructs: entity masking as a computational analog of relational alignment, and proverbs as vehicles for implicit abstract meaning that must be inferred across surface-dissimilar situations. Our empirical results demonstrate that AAT consistently reduces forgetting and improves generalization in a pattern that aligns with cognitive predictions for schema-based learning. Beyond the practical implications for continual learning, these results offer preliminary computational evidence that structural abstraction is a signal for stable learning in non-stationary environments.

2603.17044 2026-05-26 cs.LG cs.AI cs.CV

Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models

理解与生成相冲突吗?统一多模态模型DPO的诊断研究

Abinav Rao, Sujan Rachuri

AI总结 通过系统实验发现,在统一多模态模型上应用DPO时,生成质量难以对齐,主要原因是理解和生成梯度近乎正交且存在11-14倍的幅度不平衡,源于VQ token数量不对称。

Comments Experiments are inconclusive: The claim that architectures such as Chameleon or Emu would exhibit stronger gradient conflict is not supported by experiments or analysis, and all experiments are conducted on Janus-Pro without evaluation on other unified multimodal architectures

详情
AI中文摘要

统一多模态模型共享一个语言模型骨干来同时进行理解和生成图像。DPO能否同时对齐这两种能力?我们首次系统研究了这一问题,在Janus-Pro的1B和7B参数上应用DPO,采用七种训练策略和两种事后方法。核心发现是负面的:在该架构下,所有测试条件下生成质量都抵制DPO对齐。在7B规模下,没有任何方法能改善生成CLIPScore(|Δ| < 0.2,每个种子n=200,3个种子,p > 0.5);在1B规模下,所有方法都降低了生成质量,并且该结果在偏好数据类型(真实vs生成和模型vs模型)以及测试的数据量(150-288对)上均成立。梯度分析揭示了原因:理解和生成梯度近乎正交(cos ~ 0),且由于VQ token数量不对称(576个生成token vs. ~30-100个文本token),幅度不平衡达到约11-14倍。这种不平衡是多任务DPO中的主要干扰机制;幅度平衡产生了方向正确的理解增量(VQA +0.01-0.04,虽然单独不显著),但生成差距仍然存在。我们识别出离散VQ tokenization是一个可能的结构瓶颈——生成DPO损失收敛到ln(2)支持了这一点——并为使用基于VQ的统一模型的从业者提供了实用指导。

英文摘要

Unified multimodal models share a language model backbone for both understanding and generating images. Can DPO align both capabilities simultaneously? We present the first systematic study of this question, applying DPO to Janus-Pro at 1B and 7B parameters under seven training strategies and two post-hoc methods. The central finding is negative: generation quality resists DPO alignment across all tested conditions on this architecture. No method improves generation CLIPScore at 7B (|Delta| < 0.2, p > 0.5 at n=200 per seed, 3 seeds); at 1B, all methods degrade generation, and the result holds across preference data types (real-vs-generated and model-vs-model) and the data volumes tested (150-288 pairs). Gradient analysis reveals why: understanding and generation gradients are near-orthogonal (cos ~ 0) with ~11-14x magnitude imbalance driven by VQ token count asymmetry (576 generation tokens vs. ~30-100 text tokens). This imbalance is the dominant interference mechanism in multi-task DPO; magnitude-balancing yields directionally positive understanding deltas (+0.01-0.04 VQA, though individually not significant), but the generation gap persists regardless. We identify discrete VQ tokenization as a likely structural bottleneck -- supported by the generation DPO loss converging to ln(2) -- and provide practical guidance for practitioners working with VQ-based unified models.

2603.10267 2026-05-26 cs.CV

A Robust Deep Learning Framework for Bangla License Plate Recognition Using YOLO and Vision-Language OCR

基于YOLO和视觉语言OCR的孟加拉车牌识别鲁棒深度学习框架

Nayeb Hasin, Md. Arafath Rahman Nishat, Mainul Islam, Khandakar Shakib Al Hasan, Asif Newaz

AI总结 提出一种结合YOLOv8两阶段自适应训练和ViT+BanglaBERT视觉语言OCR的鲁棒孟加拉车牌识别系统,在车牌定位和字符识别上分别达到97.83%准确率和0.1323字符错误率。

Comments Accepted at the 2026 IEEE International Conference on AI and Data Analytics (ICAD 2026). Final version will appear in IEEE Xplore

详情
AI中文摘要

自动车牌识别(ALPR)系统是智能交通管理系统的关键组成部分。然而,由于复杂的字符方案和不均匀的布局,孟加拉车牌检测仍然具有挑战性。本文提出了一种鲁棒的孟加拉车牌识别系统,该系统将基于深度学习的车牌定位目标检测模型与用于文本提取的光学字符识别相结合。比较了多种目标检测架构,包括U-Net和几种YOLO(You Only Look Once)变体,用于车牌定位。本研究提出了一种基于YOLOv8架构的新型两阶段自适应训练策略,以提高定位性能。所提出的方法优于现有模型,达到了97.83%的准确率和91.3%的交并比(IoU)。文本识别问题被表述为基于视觉编码器-解码器架构的序列生成问题,并评估了编码器-解码器的组合。结果表明,ViT + BanglaBERT模型在字符级别上取得了更好的结果,字符错误率为0.1323,词错误率为0.1068。所提出的系统在为此研究目的整理的外部数据集上进行测试时也表现出一致的性能。与训练样本相比,该数据集提供了完全不同的环境和光照条件,表明了所提出框架的鲁棒性。总体而言,我们提出的系统为孟加拉车牌识别提供了鲁棒且可靠的解决方案,并在各种真实场景中有效运行,包括光照、噪声和车牌样式的变化。这些优势使其非常适合部署在智能交通应用中,如自动执法和访问控制。

英文摘要

An Automatic License Plate Recognition (ALPR) system constitutes a crucial element in an intelligent traffic management system. However, the detection of Bangla license plates remains challenging because of the complicated character scheme and uneven layouts. This paper presents a robust Bangla License Plate Recognition system that integrates a deep learning-based object detection model for license plate localization with Optical Character Recognition for text extraction. Multiple object detection architectures, including U-Net and several YOLO (You Only Look Once) variants, are compared for license plate localization. This study proposes a novel two-stage adaptive training strategy built upon the YOLOv8 architecture to improve localization performance. The proposed approach outperforms the established models, achieving an accuracy of 97.83% and an Intersection over Union (IoU) of 91.3%. The text recognition problem is phrased as a sequence generation problem with a VisionEncoderDecoder architecture, with a combination of encoder-decoders evaluated. It was demonstrated that the ViT + BanglaBERT model gives better results at the character level, with a Character Error Rate of 0.1323 and Word Error Rate of 0.1068. The proposed system also shows a consistent performance when tested on an external dataset that has been curated for this study purpose. The dataset offers completely different environment and lighting conditions compared to the training sample, indicating the robustness of the proposed framework. Overall, our proposed system provides a robust and reliable solution for Bangla license plate recognition and performs effectively across diverse real-world scenarios, including variations in lighting, noise, and plate styles. These strengths make it well suited for deployment in intelligent transportation applications such as automated law enforcement and access control.

2603.09458 2026-05-26 cs.RO

Stein Variational Ergodic Surface Coverage with SE(3) Constraints

Stein变分遍历曲面覆盖与SE(3)约束

Jiayun Li, Yufeng Jin, Sangli Teng, Dejian Gong, Georgia Chalvatzaki

AI总结 提出一种基于预条件SE(3) Stein变分梯度下降的采样即优化方法,用于生成满足SE(3)约束的遍历轨迹,实现复杂3D点云曲面的高质量覆盖。

详情
AI中文摘要

曲面操作任务要求机器人生成能够全面覆盖复杂3D曲面同时保持精确末端执行器姿态的轨迹。现有的遍历轨迹优化(TO)方法在覆盖任务中表现出色,但由于非凸优化景观以及采样即优化(SAO)技术中对SE(3)约束处理不足,在处理点云目标时存在困难。在这项工作中,我们引入了一种预条件SE(3) Stein变分梯度下降(SVGD)方法用于SAO遍历轨迹生成。我们提出的方法包含多项创新。首先,我们将点云遍历覆盖重新表述为流形感知的采样问题。其次,我们推导了SE(3)特定的SVGD粒子更新,第三,我们开发了一个预条件子以加速TO收敛。与基于优化的强基线和SAO基线相比,我们的基于采样的框架在保持SE(3)几何结构的同时,一致地识别出更优的局部最优解。在3D点云曲面覆盖基准测试和机器人曲面绘制任务上的实验表明,相对于现有的TO和SAO方法,我们的方法在我们的设置中以可计算的计算量实现了更优的覆盖质量,并在真实机器人实验中得到了验证。

英文摘要

Surface manipulation tasks require robots to generate trajectories that comprehensively cover complex 3D surfaces while maintaining precise end-effector poses. Existing ergodic trajectory optimization (TO) methods demonstrate success in coverage tasks, while struggling with point-cloud targets due to the nonconvex optimization landscapes and the inadequate handling of SE(3) constraints in sampling-as-optimization (SAO) techniques. In this work, we introduce a preconditioned SE(3) Stein Variational Gradient Descent (SVGD) approach for SAO ergodic trajectory generation. Our proposed approach comprises multiple innovations. First, we reformulate point-cloud ergodic coverage as a manifold-aware sampling problem. Second, we derive SE(3)-specific SVGD particle updates, and, third, we develop a preconditioner to accelerate TO convergence. Our sampling-based framework consistently identifies superior local optima compared to strong optimization-based and SAO baselines while preserving the SE(3) geometric structure. Experiments on a 3D point-cloud surface coverage benchmark and robotic surface drawing tasks demonstrate that our method achieves superior coverage quality with tractable computation in our setting relative to existing TO and SAO approaches, and is validated in real-world robot experiments.