arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4046
2605.08185 2026-05-12 cs.RO cs.AI

From Ontology Conformance to Admissible Reconfiguration: A RoSO/SMGI Adequacy Argument for Robotic Service Governance

Aomar Osmani

AI总结 本文探讨了在服务机器人系统中,当服务被重新绑定、重组、修复或重新部署后,如何确保其配置仍符合原始服务规范的问题。研究提出将机器人服务本体(RoSO)嵌入结构化通用智能模型(SMGI),通过引入结构接口和行为语义,实现对服务描述的动态治理。该方法不仅提供了RoSO到SMGI的充分性定理,还给出了保持身份不变的重构条件,为服务语义在修订过程中保持一致提供了形式化保障。

Comments 26 pages

详情
英文摘要

The Robotic Service Ontology (RoSO) gives service robotics a typed semantic vocabulary for services, functions, interactions, and deployment-sensitive constraints. Its public revision trail makes visible a harder question than ontology conformance alone can settle: once a service is rebound, recomposed, repaired, or redeployed, under what conditions does the resulting configuration remain an admissible realization of the same protected service? This article argues that the Structural Model of General Intelligence (SMGI) is relevant exactly at that level \citep{osmani2026smgi}. SMGI adds not only a structural interface $θ$, but an induced behavioral semantics $T_θ$ and a governance discipline for norm-respecting change. We show that RoSO can be embedded into SMGI as a typed semantic layer, so that service descriptions become dynamically governable rather than merely well formed. This yields a RoSO-to-SMGI adequacy theorem, identity-preserving reconfiguration criteria, and compositional conditions under which locally acceptable updates remain globally admissible. The resulting claim is not that SMGI replaces RoSO, but that it provides a formal account of what admissible runtime change requires once service semantics must survive revision.

2605.08183 2026-05-12 cs.CV cs.LG

Sparsity Hurts: Simple Linear Adapter Can Boost Generalized Category Discovery

Bo Ye, Kai Gan, Tong Wei, Min-Ling Zhang

AI总结 本文研究了广义类别发现(GCD)问题,旨在从无标签数据中发现新类别,同时保持对已知类别的分类能力。为了解决现有方法在模型适应性和过拟合方面的不足,作者提出了一种简单有效的GCD方法LAGCD,通过在每个ViT块中嵌入残差线性适配器,提升了模型的灵活性和性能。实验表明,LAGCD在多个通用和细粒度数据集上均优于许多复杂基线方法。

Comments Submitted to IEEE TPAMI

详情
英文摘要

Generalized Category Discovery (GCD) seeks to identify novel categories from unlabeled data while retaining the classification ability of seen categories. Prior GCD methods commonly leverage transferable representations from pre-trained models, adapting to downstream datasets via partial fine-tuning (updating only the final ViT block) and visual prompt tuning (appending learnable vectors to inputs). However, conventional partial fine-tuning offers limited flexibility, as it fails to adapt the entire model; meanwhile, visual prompt tuning is prone to overfitting, due to its sensitivity to initialization and inherently constrained capacity. To address these limitations, we propose LAGCD, a simple yet effective GCD approach that embeds a residual linear adapter into each ViT block. From the perspective of feature sparsity, we systematically show that non-linearity in conventional adapters impairs performance, whereas our linear adapter enhances it by enabling more flexible model capacity. We further introduce an auxiliary distribution alignment loss to mitigate the negative impact of biased predictions between seen and novel categories. Extensive experiments on both generic and fine-grained datasets confirm that LAGCD consistently improves performance over many sophisticated baselines. The source code is available at https://github.com/yebo0216best/LAGCD

2605.08182 2026-05-12 cs.LG cs.AI

Quantile Geometry Regularization for Distributional Reinforcement Learning

Zhaofan Zhang, Minghao Yang, Rufeng Chen, Sihong Xie, Hui Xiong

AI总结 该论文提出了一种名为RQIQN的分布强化学习方法,旨在解决基于分位数的分布强化学习中因引导目标分位数导致的分布估计失真问题。通过引入Wasserstein分布鲁棒性增强,该方法在分位数估计的基础上对贝尔曼目标进行修正,从而在不改变价值目标的前提下,有效改善分布退化现象。实验表明,RQIQN在风险敏感导航和Atari游戏中优于现有的分位数分布强化学习算法。

详情
英文摘要

Quantile-based distributional reinforcement learning methods learn return distributions through sampled quantile regression, but their bootstrapped target quantiles may induce distorted or degenerate distribution estimates. We propose Robust Quantile-based Implicit Quantile Networks (RQIQN), a lightweight Wasserstein distributionally robust enhancement boosted from a quantile estimation perspective. We first reinterpret a snapshot of IQN loss as a collection of local empirical quantile estimation problems over sampled current fractions. We then robustify each local slot with a Wasserstein distributionally robust quantile estimation formulation, yielding a closed-form, fraction-dependent correction to the Bellman target. This correction directly addresses distributional degeneration: its median antisymmetry preserves the risk-neutral quantile average, while its monotonicity enlarges upper-lower quantile gaps and counteracts collapsed distributional spread. RQIQN thus regularizes quantile geometry without changing the underlying value objective or requiring additional sample set reconstruction. Finally, we empirically show that the proposed RQIQN outperforms other existing quantile-based distributional reinforcement learning algorithms in risk-sensitive navigation and Atari games.

2605.08181 2026-05-12 cs.CV cs.AI cs.LG

Text-Guided Multi-Scale Frequency Representation Adaptation

Weicai Yan, Xinhua Ma, Wang Lin, Tao Jin

AI总结 本文提出了一种名为FreqAdapter的参数高效的微调方法,通过在频率域中引入文本引导的多尺度频率表示适配,解决现有方法在信号空间域中信息冗余以及无法充分捕捉信号多尺度特征的问题。该方法采用多尺度适配策略,优化不同频率范围的感受野,显著提升了模型的表征能力与效率。实验表明,FreqAdapter在CLIP和LLaVA等多模态模型上取得了性能与效率的双重提升。

Comments ACL 2026 Main

详情
英文摘要

Parameter-efficient fine-tuning methods introduce a small number of training parameters, enabling pre-trained models to adapt rapidly to new data distributions. While these methods have shown promising results, they exhibit notable limitations. First, most existing methods operate in the signal space domain, which results in substantial information redundancy. Second, most existing methods utilize fixed prompts or adaptation layers, failing to fully account for the multi-scale characteristics of signals. To address these challenges, we propose the Multi-Scale Frequency Adapter (FreqAdapter), which integrates textual information and performs multi-scale fine-tuning of signals in the frequency domain. Additionally, we introduce a multi-scale adaptation strategy to optimize receptive fields across different frequency ranges, further enhancing the model's representational capacity. Extensive experiments on multimodal models, including CLIP and LLaVA, demonstrate that FreqAdapter significantly improves both performance and efficiency. FreqAdapter improves performance with minimal cost and fast convergence within one epoch. Code is available at https://github.com/Kelvin-ywc/FreqAdapter.

2605.08178 2026-05-12 cs.LG cs.AI

Generalized Category Discovery in Federated Graph Learning

Zhongzheng Yuan, Lianshuai Guo, Xunkai Li, Wenyu Wang, Meixia Qu

AI总结 本文研究了联邦图学习中的广义类别发现(FGGCD)问题,旨在在分布式图数据环境中协作发现新类别并保留已知类别知识。针对结构碎片化导致的邻域吸收效应和全局语义不一致等核心挑战,提出了一种名为GCD-FGL的框架,通过客户端的拓扑可靠语义对齐与发现机制和服务器端的层次化原型对齐策略,有效缓解了这些问题。实验表明,该方法在多个真实图数据集上显著优于现有方法。

详情
英文摘要

Federated Graph Learning (FGL) enables collaborative learning over distributed graph data, yet existing approaches largely rely on a closed-world assumption, limiting their applicability in dynamic environments where novel categories continuously emerge. To bridge this gap, we target the practical scenario of Federated Graph Generalized Category Discovery (FGGCD), aiming to collaboratively discover novel categories across decentralized graph clients while retaining knowledge of known categories. We observe that FGGCD introduces two fundamental challenges: (1) the Neighborhood Absorption Effect, where structural fragmentation leads to biased neighborhood aggregation, causing novel nodes to be misclassified as known categories; and (2) Global Semantic Inconsistency, where the aforementioned local biases propagate to the server and are amplified by heterogeneous subgraph distributions, hindering cross-client knowledge integration. To address these issues, we propose GCD-FGL, an FGL framework for GCD that integrates a client-side Topology-Reliable Semantic Alignment and Discovery process to mitigate the neighborhood absorption effect, and a server-side Hierarchical Prototype Alignment strategy to resolve global semantic inconsistency. Extensive experiments on five real-world graph datasets demonstrate that GCD-FGL consistently outperforms state-of-the-art baselines, achieving an average absolute gain of +4.86 in HRScore.

2605.08177 2026-05-12 cs.LG cs.AI

Echo-LoRA: Parameter-Efficient Fine-Tuning via Cross-Layer Representation Injection

Yihang Peng, Peng Jin, Jie Gong, Xingyuan Chen, Lingjiao Xu, Ning Su, Yan Ran

AI总结 本文提出了一种名为Echo-LoRA的参数高效微调方法,通过跨层表示注入提升大语言模型在下游任务中的性能。该方法在训练过程中从深层网络层提取隐藏状态,生成样本级回声表示,并通过轻量投影和门控网络将其注入到浅层LoRA或DoRA模块中,从而更有效地利用中间表示。实验表明,Echo-LoRA在多个常识推理基准上显著优于现有LoRA基线方法,且部署时无需额外参数或计算开销。

详情
英文摘要

Parameter-efficient fine-tuning (PEFT) has become a practical route for adapting large language models to downstream tasks, with LoRA-style methods being particularly attractive because they are inexpensive to train and easy to deploy. Most LoRA variants, however, revise the update rule within the weight space of each layer and leave the intermediate representations formed by deeper layers largely unused. We propose Echo-LoRA, a cross-layer representation injection method for parameter-efficient fine-tuning. During training, Echo-LoRA collects boundary hidden states from deeper source layers, aggregates them into a sample-level echo representation, and uses lightweight projection and gating networks to inject the resulting signal into shallow LoRA or DoRA modules. Answer-only masking, masked distillation, and stochastic routing are used to keep this auxiliary path stable and to reduce the gap between training and inference. On eight commonsense reasoning benchmarks, Echo-LoRA exceeds the reported LoRA baselines by 5.7 percentage points on average across LLaMA-7B, LLaMA2-7B, and LLaMA3-8B. Under reproduced LoRA baselines in our unified implementation, the average gain is 3.0 points; when combined with DoRA, the gain is 2.7 points. The Echo path is discarded after training, so the deployed model keeps the original low-rank LoRA/DoRA form and adds neither inference-time parameters nor inference computation.

2605.08176 2026-05-12 cs.LG cs.NE

Physics-Modeled Neural Networks

Raul Felipe-Sosa, Angel Martin del Rey, Maria Flores Ceballos

AI总结 本文提出了一种名为“动态物理建模神经网络”(DynPMNNs)的连续时间深度学习架构,其隐藏层通过常微分方程的解来定义,替代了传统神经网络中的静态激活函数,从而赋予隐藏层行为以动态系统视角,并能融合物理意义的模型。该框架基于再生核巴拿赫空间理论,揭示了其与标准神经网络的结构联系。实验表明,DynPMNNs在参数更少的情况下仍能取得与神经ODE和闭式连续时间网络相当的性能,展示了其在深度学习与动力系统之间建立理论桥梁的潜力。

详情
英文摘要

We introduce \emph{Dynamical Physics-Modeled Neural Networks} (DynPMNNs), a continuous-time deep learning architecture in which each hidden layer is defined as the solution of an ordinary differential equation. Unlike classical feed-forward networks, this approach replaces static activation functions with time-evolving dynamical systems, providing a biologically inspired interpretation of hidden-layer behavior and enabling the integration of physically meaningful models. The framework is rigorously grounded in Reproducing Kernel Banach Spaces (RKBSs), allowing DynPMNNs to be characterized as finite-dimensional solutions of an abstract training problem and revealing structural connections with standard neural networks. We present a concrete implementation based on the FitzHugh--Nagumo model for neuronal activation, where numerical ODE solvers are embedded into the computational graph via Euler-type schemes. Both network weights and dynamical parameters are trained jointly. Through experiments on the California Housing dataset, we compare DynPMNNs with Neural ODEs (NODEs) and Closed-form Continuous-Time Networks (CfCs). Despite using fewer trainable parameters, DynPMNNs achieve competitive performance. These results position DynPMNNs as a principled bridge between dynamical systems and deep learning, with promising directions for further research in expressivity, stability, and physics-based modeling.

2605.08175 2026-05-12 cs.CV cs.AI

KARMA-MV: A Benchmark for Causal Question Answering on Music Videos

Archishman Ghosh, Abhinaba Roy, Dorien Herremans

AI总结 尽管视频问答和跨模态理解已取得显著进展,但对音乐视频中视觉动态如何驱动音乐结构的因果推理仍研究不足。本文提出 KARMA-MV,一个基于2,682个YouTube音乐视频构建的多选题问答数据集,旨在评估模型整合时序视听线索并进行视觉到音乐影响推理的能力。该数据集通过大语言模型实现可扩展的生成与验证,包含37,737道题目,并引入因果知识图谱方法增强视觉语言模型的跨模态依赖结构化检索能力,实验表明该方法尤其对小型模型有显著提升,为音乐视频因果理解提供了新的基准。

详情
英文摘要

While significant progress has been made in Video Question Answering and cross-modal understanding, causal reasoning about how visual dynamics drive musical structure in music videos remains under-explored. We introduce KARMA-MV, a large-scale multiple-choice QA dataset derived from 2,682 YouTube music videos, designed to test models' ability to integrate temporal audio-visual cues and reason about visual-to-musical influence across reasoning, prediction, and counterfactual questions. Unlike traditional datasets requiring manual annotation, KARMA-MV leverages LLM reasoning for scalable generation and validation, yielding 37,737 MCQs. We propose a causal knowledge graph (CKG) approach that augments vision-language models (VLMs) with structured retrieval of cross-modal dependencies. Experiments on state-of-the-art VLMs and LLMs show consistent gains from CKG grounding -- especially for smaller models -- establishing the value of explicit causal structure for music-video reasoning. KARMA-MV provides a new benchmark for advancing causal audio-visual understanding beyond correlation.

2605.08174 2026-05-12 cs.LG cs.AI cs.CV

CERSA: Cumulative Energy-Retaining Subspace Adaptation for Memory-Efficient Fine-Tuning

Jingze Ge, Xue Geng, Yun Liu, Wanqi Dong, Wang Zhe Mark, Min Wu, Ngai-Man Cheung, Bharadwaj Veeravalli, Xulei Yang

AI总结 为了解决大模型微调过程中的内存限制问题,本文提出了一种新的高效微调方法CERSA,该方法通过奇异值分解(SVD)保留权重变化中90%至95%的谱能量,仅对低秩表示进行微调,从而大幅降低内存消耗。与现有方法如LoRA相比,CERSA在保持高性能的同时显著提升了内存效率,并在图像识别、文本生成和自然语言理解等多个任务和不同规模模型上均表现出优越性。

Comments 10 pages, 7 figures, supplementary material included

详情
英文摘要

To mitigate the memory constraints associated with fine-tuning large pre-trained models, existing parameter-efficient fine-tuning (PEFT) methods, such as LoRA, rely on low-rank updates. However, such updates fail to fully capture the rank characteristics of the weight modifications observed in full-parameter fine-tuning, resulting in a performance gap. Furthermore, LoRA and other existing PEFT methods still require substantial memory to store the full set of frozen weights, limiting their efficiency in resource-constrained settings. To addres these limitations, we introduce Cumulative Energy-Retaining Subspace Adaptation (CERSA), a novel fine-tuning paradigm that leverages singular value decomposition (SVD) to retain only the principal components responsible for 90% to 95% of the spectral energy. By fine-tuning low-rank representations derived from this principal subspace, CERSA significantly reduces memory consumption. We conduct extensive evaluations of CERSA across models of varying scales and domains, including image recognition, text-to-image generation, and natural language understanding. Empirical results demonstrate that CERSA consistently outperforms state-of-the-art PEFT methods while achieving substantially lower memory requirements. The code will be publicly released.

2605.08173 2026-05-12 cs.CV cs.LG

CASISR: Circular Arbitrary-Scale Image Super-Resolution

Honggui Li, Zhengyang Zhang, Dingtai Li, Sinan Chen, Nahid Md Lokman Hossain, Xinfeng Xu, Yinlu Qin, Ruobing Wang, Hantao Lu, Yuting Feng, Maria Trocan, Dimitri Galayko, Amara Amara, Mohamad Sawan

AI总结 本文提出了一种基于闭环架构的任意尺度图像超分辨率方法CASISR,旨在提升预训练模型在测试数据上的泛化性能。通过结合超分辨率与退化模型,CASISR利用自动控制理论构建了一个数学非线性闭环方程,并通过条件概率理论和泰勒展开证明了其合理性和稳定性。实验表明,CASISR在图像重建质量上优于八个现有方法,尤其在处理分数倍放大因子以及边缘变化剧烈的文本和条纹图像时表现出色。

详情
英文摘要

The generalization performance (GP) of deep learning-based arbitrary-scale image super-resolution (ASISR) methods is subject to limited training datasets and unlimited testing datasets. It is vitally significant to enhance the GP of the pretrained ASISR models by making full use of the testing samples. The ASISR models usually employ an open-loop architecture from low-resolution (LR) images to super-resolution (SR) images. The degradation model from SR samples to LR samples is known bicubic down-sampling for the classical ASISR, is supposed down-sampling with additive random noise for the blind ASISR, and is learnable for the real-world ASISR. Combining the ASISR and degradation models, it is potentially possible to adopt a closed-loop architecture based on the automatic control theory for strengthening the GP of the ASISR methods. Therefore, this paper proposes a closed-loop architecture, circular ASISR (CASISR), to lift the capability of image reconstruction. A mathematical nonlinear loop equation is established to describe the CASISR, the reasonability of the CASISR is proven by conditional probability theory, and the stability of the CASISR is proven by Taylor series approximation. The first-order and second-order absolute difference images are defined to compare the image reconstruction performance of the ASISR and the CASISR methods. Comprehensive simulation experiments show that the proposed CASISR approach outperforms the eight state-of-the-art ASISR approaches in the quality of image reconstruction. Especially, the proposed CASISR is extraordinarily suitable for fractional SR scale factors and is extremely effective for text and stripe images with drastically changed edges.

2605.08172 2026-05-12 cs.CV cs.LG

Augmented Equivariant Mesh Networks for Anatomical Segmentation

Daniel Saragih

AI总结 本文提出了一种名为EAMS的等变解剖网格分割模型,用于处理医学图像中的不规则表面几何结构,并在不同患者姿态和网格分辨率变化下保持鲁棒性。该方法基于等变网格神经网络(EMNN),结合内在网格描述符与解剖学先验知识,如牙齿弓和肝脏表面的PCA帧,并增强消息传递机制以提供全局上下文信息。实验表明,EAMS在多种临床任务中表现出色,尤其在未受扰动和几何扰动情况下均具有较高的分割精度和稳定性,且模型参数量少于200万,展示了其高效与通用性。

Comments 21 pages, 7 figures, 14 tables

详情
英文摘要

Anatomical mesh segmentation requires models that operate directly on irregular surface geometry while remaining robust to arbitrary patient pose and mesh resolution variation. Existing task-specific mesh and point-cloud methods are not equivariant, and can degrade sharply under test-time perturbation, for example dropping by 25-26 IoU points on intraoral scan segmentation at $40^\circ$ tilt. We present EAMS, an Equivariant Anatomical Mesh Segmentor built on Equivariant Mesh Neural Networks (EMNN), and evaluate it across four clinically distinct tasks spanning edge-, vertex-, and face-level supervision. We combine intrinsic mesh descriptors with anatomy-aware priors, including PCA-derived frames for dental arches and liver surfaces, and augment message passing to provide lightweight global context. Across intracranial aneurysm and intraoral segmentation, EAMS variants are competitive with specialized baselines on unperturbed inputs while remaining stable under geometric perturbations, and on liver surfaces they expose a favorable trade-off between canonical-pose accuracy and rotation robustness. These results show that a lightweight ($<2$M parameters) equivariant framework can deliver robust anatomical mesh segmentation across diverse supervision types without task-specific architectures.

2605.08170 2026-05-12 cs.LG math.FA

Quantitative Sobolev Approximation Bounds for Neural Operators with Empirical Validation on Burgers Equation

Nicole Hao

AI总结 本文研究了神经算子在Sobolev范数下的逼近能力,并以Burgers方程为例进行了实证分析。作者建立了一个函数分析框架,证明了在特定条件下,非线性算子可以用具有特定参数数量的神经算子以指定误差进行逼近,并给出了误差与参数规模之间的显式关系。实验表明,Fourier神经算子在Sobolev空间中具有良好的逼近性能,且其误差随参数数量呈现幂律关系,验证了理论分析的有效性。

详情
英文摘要

Neural operators have emerged as a powerful tool for learning mappings between infinite-dimensional function spaces. However, their approximation properties in Sobolev norms remain poorly quantified, even though these norms control both function values and derivatives and are the natural metrics for PDE well-posedness, stability, and generalization. We develop a functional-analytic framework for operator learning in Sobolev spaces and connect it to the numerical behavior of Fourier Neural Operators (FNOs) on a prototypical PDE. First, for a continuous nonlinear operator $\mathcal{G}: H^{s}(D)\to H^{t}(D')$ with $s > d/2$ and inputs restricted to a compact subset of $H^{s}(D)$, we prove that $\mathcal{G}$ can be uniformly approximated in $H^{t}$-norm by a neural operator with $\mathcal{O}(\varepsilon^{-d/s})$ trainable parameters. This yields an explicit complexity--error relation of the form $\|\mathcal{G}-\mathcal{G}_θ\|_{H^{t}} \lesssim C N^{-s/d}$. We then study the one-dimensional viscous Burgers solution operator $\mathcal{G}: u_{0}\mapsto u(\cdot,1)$ on a bounded $H^{1}$-ball and train FNOs with an $H^{1}$-loss. Across a sweep of model sizes, we obtain test $H^{1}$-errors down to $\mathcal{O}(10^{-7})$ and relative errors of order $10^{-3}$, with predictions accurately matching both solutions and spatial derivatives on held-out data. A log-log plot of Sobolev error versus parameter count exhibits an approximate power law $\|\mathcal{G}-\mathcal{G}_θ\|_{H^{1}} \approx C N^{-α}$ with empirical exponent $α\approx 1.4$, and long-horizon training reveals optimization instabilities in large FNOs, providing quantitative evidence that Sobolev-space approximation theory meaningfully predicts neural-operator scaling behavior.

2605.08169 2026-05-12 cs.CV cs.AI

Optimized Culprit Identification Using Mobilenet and Attention Mechanisms

Savitha N J, Lata B T

AI总结 本文提出了一种结合轻量级MobileNet架构与通道和空间注意力机制的优化深度学习框架,用于提升监控系统中可疑目标识别的准确率与计算效率。该方法通过注意力机制强化关键特征区域,抑制背景干扰,从而提高识别性能。实验表明,该模型在多个基准人脸数据集上取得了97.8%的高分类准确率,优于传统模型,并且具有较低的计算复杂度和推理时间,适用于实时监控和边缘计算场景。

详情
Journal ref
ISSN No: 2096-3246, Link: https://advancedengineeringscience.com/article/2026/2216.html
英文摘要

Automated culprit identification in surveillance systems is a critical task that requires high accuracy along with computational efficiency for real-time deployment. In this paper, an optimized deep learning framework is proposed using a lightweight MobileNet architecture integrated with channel and spatial attention mechanisms. The proposed model enhances feature representation by selectively focusing on the most discriminative regions while suppressing irrelevant background information, thereby improving identification performance. The framework incorporates efficient preprocessing, attention based feature refinement, and a robust classification strategy optimized using the Adam Optimizer. Experiments were conducted on benchmark face recognition datasets, including Labelled Faces in the Wild (LFW), CASIA-WebFace, and a subset of VGGFace2, under realistic conditions with variations in illumination, pose, and occlusion. The results demonstrate that the proposed model achieves a high classification accuracy of 97.8%, outperforming conventional models such as baseline CNN, ResNet, and standard MobileNet. The confusion matrix analysis indicates strong class-wise discrimination with minimal misclassification, while ROC-AUC evaluation confirms robust performance across all classes. Additionally, the proposed approach maintains low computational complexity and reduced inference time, making it suitable for real-time surveillance and edge-based applications.

2605.08168 2026-05-12 cs.RO cs.AI cs.LG

Understanding Asynchronous Inference Methods for Vision-Language-Action Models

Ayoub Agouzoul

AI总结 视觉-语言-动作(VLA)模型为通用机器人控制提供了前景,但其推理延迟会导致异步执行时的观测滞后问题。本文系统比较了四种缓解该问题的方法,包括推理时修复(IT-RTC)、训练时延迟模拟(TT-RTC)、未来状态感知条件(VLASH)和轻量残差校正(A2C2),通过统一的代码库和实验设置在多个基准上进行评估。结果表明,A2C2在多数场景下表现最优,而TT-RTC则在训练稳定性方面具有优势。

详情
英文摘要

Vision-Language-Action (VLA) models offer a promising path to generalist robot control, but their inference latency causes observation staleness when generated actions are executed asynchronously. Several methods have been proposed concurrently to mitigate this problem: inference-time inpainting (IT-RTC), training-time delay simulation (TT-RTC), future-state-aware conditioning (VLASH), and lightweight residual correction (A2C2). Each takes a fundamentally different approach, but they have so far been evaluated independently with different codebases, base policies, and protocols. We present a systematic comparison of these four methods under controlled conditions. We develop two unified codebases that integrate all methods with harmonized library and dataset versions, and we benchmark them on the Kinetix suite with MLPMixer policies and on the LIBERO manipulation benchmark with SmolVLA, sweeping inference delays up to $d=20$ control steps. A2C2's per-step residual correction is the most effective method on Kinetix, holding above 90% solve rate up to $d=8$, and also leads on LIBERO from $d=4$ onwards. IT-RTC is competitive at low delays but degrades sharply under long chunks ($H=30$) and high delays. TT-RTC is the most robust training-based method: stable across $d_\max$ choices, generalizes beyond its training delay distribution, and adds zero inference overhead. VLASH exhibits a clear low-delay vs. high-delay trade-off governed by the fine-tuning delay range $[0,d_\max]$. Code is available at https://github.com/TheAyos/async-vla-inference

2605.08167 2026-05-12 cs.CV cs.AI

Digital Image Forgery Detection Using Transfer Learning

Fatma Betul Buyuk, Gozde Karatas Baydogmus, Ali Buldu, Ayaulym Tulendiyeva, Zhuldyz Baizhumanova

AI总结 随着高级图像编辑工具的普及,数字图像伪造内容日益增多,给数字取证和信息安全带来严峻挑战。本文提出了一种基于迁移学习的图像伪造检测框架,结合压缩感知特征增强与深度卷积神经网络,通过融合RGB图像与基于压缩差异的特征(FDIFF),有效提升了对细微伪造痕迹的检测能力。实验表明,该方法在多个预训练网络模型上均表现出优越的性能,尤其在降低误报率和提升分类可靠性方面具有显著优势,适用于实际场景中的图像伪造检测。

详情
英文摘要

The increasing availability of advanced image editing tools has led to a significant rise in manipulated digital content, posing serious challenges for digital forensics and information security. This study presents a transfer learning-based framework for digital image forgery detection that integrates compression-aware feature enhancement with deep convolutional neural network (CNN) architectures. The proposed approach introduces a hybrid input representation that combines RGB images with compression difference-based features (FDIFF), explicitly highlighting subtle manipulation artifacts that are often difficult to detect. In addition, a model-specific adaptive threshold optimization strategy based on the Youden Index is employed to improve classification reliability by achieving a better balance between true positive and false positive rates. Experiments conducted on the CASIA v2.0 dataset using multiple pretrained CNN architectures, including DenseNet121, VGG16, ResNet50, EfficientNetB0, MobileNet, and InceptionV3, demonstrate the effectiveness and robustness of the proposed framework. The models are evaluated using comprehensive performance metrics such as accuracy, precision, recall, F1-score, Matthews correlation coefficient (MCC), and area under the ROC curve (AUC). The results show that DenseNet121 achieves the highest accuracy and AUC, while ResNet50 provides the most balanced and reliable predictions with the highest MCC. The findings emphasize that relying solely on accuracy is insufficient for forensic applications, where minimizing false negatives is critical. Overall, the proposed framework improves the visibility of manipulation artifacts and enhances classification robustness, making it suitable for real-world digital image forgery detection scenarios.

2605.08161 2026-05-12 cs.CV

Advanced Tumor Segmentation in PET/CT Imaging: A Training Strategy Study with nnU-Net for AutoPET III

Hussain Alasmawi

AI总结 本文研究了全身PET/CT影像中肿瘤分割的挑战性问题,旨在开发一种能够跨示踪剂和多中心数据泛化的分割方法。作者基于nnU-Net框架,采用ResNet作为编码器,并系统探索了强度归一化、批量Dice优化和CraveMix数据增强等训练策略对模型性能的影响。实验表明,这些策略显著提升了模型在减少假阳性及应对病灶变化方面的鲁棒性,最佳配置在初步测试中达到0.80的Dice分数,并在AutoPET III挑战赛中排名第三。

详情
英文摘要

Tumor segmentation in whole-body PET/CT imaging is crucial for precise disease evaluation and treatment planning. However, it remains challenging due to variability in lesion size, contrast, and anatomical distribution. Relying on manual segmentation makes the process time-consuming and prone to intra- and inter-observer variability. This work presents a whole-body tumor segmentation method developed for the AutoPET III challenge, where the goal is to build models that generalize across tracers and multi-center data. We employ the nnU-Net framework with a ResNet-based encoder as our baseline and systematically investigate the impact of training strategies, including intensity normalization, batch dice optimization, and data augmentation using CraveMix. Our experiments show that these strategies significantly influence model performance, particularly in reducing false positives and improving robustness to lesion variability. The best-performing configuration achieves a Dice score of up to 0.80 on the preliminary test phase, and our method ranked third in the AutoPET III challenge. The code is publicly available here.

2605.08160 2026-05-12 cs.CV cs.AI

WATCH: Wide-Area Archaeological Site Tracking for Change Detection

Girmaw Abebe Tadesse, Titien Bartette, Andrew Hassanali, Allen Kim, Jonathan Chemla, Andrew Zolli, Yves Ubelmann, Caleb Robinson, Inbal Becker-Reshef, Juan Lavista Ferres

AI总结 本文提出 WATCH 框架,用于大规模考古遗址的月级变化检测,旨在解决因视觉线索细微和真实标注数据稀缺而导致的扰动识别难题。该方法结合三种互补的评分策略,包括无需训练的时间嵌入距离(TED)、自监督变化检测(SSCD)以及弱监督时间定位模型,并在阿富汗等多国遗址上进行验证。实验表明,基于卫星影像与基础模型嵌入的无监督方法在变化检测中表现优异,尤其在早期预警和精确时间定位方面具有显著优势。

详情
英文摘要

Monitoring archaeological sites at scale is vital for protecting cultural heritage, yet pinpointing when disturbances occur remains difficult because visual cues are subtle and ground-truth data are sparse. We introduce WATCH, a framework for month-level change-event localization over PlanetScope satellite mosaics (2017-2024, 4.7 m/px) that supports three complementary scoring approaches: (i) Temporal Embedding Distance (TED), a training-free method that scores month-to-month deviations from a local temporal reference; (ii) Self-Supervised Change Detection (SSCD), an ensemble of reconstruction, forecasting, and latent-novelty signals; and (iii) a Weakly Supervised (WS) temporal localization model trained with sparse event-month labels. We benchmark WATCH on 1,943 archaeological sites in Afghanistan using embeddings from six foundation models (CLIP, GeoRSCLIP, SatMAE, Prithvi-EO-2.0, DINOv3, and Satlas-Pretrain) alongside a handcrafted spectral and texture baseline, and assess cross-regional generalization on sites in Syria, Turkey, Pakistan, and Egypt. The unsupervised approaches (TED, SSCD) consistently outperform the weakly supervised alternative. TED with SatMAE achieves the highest exact-month recall (55% at m=0), while TED with GeoRSCLIP, CLIP, or Satlas-Pretrain reaches 92.5% within a three-month tolerance (m=3). Handcrafted features remain competitive for exact-month detection under weak supervision. Our directional margin analysis reveals systematic temporal biases: SSCD paired with GeoRSCLIP or Prithvi-EO-2.0 exhibits the strongest early-warning profile, detecting anomalies before the recorded event, while TED favors confirmation-oriented detection after a change has materialized. These results show that satellite imagery combined with foundation-model embeddings enables scalable, decision-relevant heritage monitoring. Code: https://github.com/microsoft/WATCH

2605.08158 2026-05-12 cs.CV cs.AI

HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding

Haopeng Jin, Hongzhu Yi, Wenlong Zhao, Jinwen Luo, Shani Ye, Zhenyu Guan, Shiquan Dong, Tiankun Yang, Tao Yu

AI总结 本文提出了一种名为HY-Himmel的层次化视频-语言框架,旨在解决多模态语言模型在长视频理解中面临的关键问题,包括高解码成本、token数量二次增长以及稀疏采样下的运动感知不足。该方法通过将语义和运动编码分离,利用少量稀疏的I帧进行对象和场景识别,同时使用轻量级的三流适配器对密集的帧间信息进行运动特征提取,并通过可微分的占位符机制将运动特征注入语言模型。实验表明,HY-Himmel在Video-MME数据集上相比32帧的密集基线模型,在保持更少token数量的前提下实现了显著的性能提升。

Comments 59 pages, 42 figures. Technical report

详情
英文摘要

Long-video understanding with multimodal language models suffers from three compounding bottlenecks: heavy decode cost to obtain dense RGB frames, quadratic token growth with frame count, and weak motion perception under sparse keyframe sampling. We present HY-Himmel, a hierarchical video-language framework that allocates semantic and motion capacity separately. A small set of sparse anchor I-frames is routed to the expensive host ViT to ground object identity and scene layout, while the far denser inter-frame intervals are encoded by a lightweight compressed-domain tri-stream adapter that distils motion evidence from motion-vector maps, residual maps, and I-frame context into aligned motion tokens. These tokens are injected into the LLM via a differentiable placeholder mechanism after a dedicated Stage-1 contrastive alignment that places the motion representation in a geometry compatible with the frozen visual backbone. On Video-MME, HY-Himmel surpasses the dense 32-frame baseline by +2.3 pp (61.2 to 63.5%) while using 3.6x fewer context tokens. Extensive ablations over stream composition, motion encoder family, fusion mode, alignment objective, anchor count, LoRA rank, and video duration confirm that the full tri-stream is necessary and sufficient for the observed gains.

2605.08156 2026-05-12 cs.CV cs.AI

LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment

Junyi Hu, Qiji Zhou, Lei Zhang, Yue Zhang

AI总结 该研究提出了一种名为LAGO的框架,用于解决零样本视觉-文本对齐中的细粒度识别问题。针对现有方法依赖大量冗余图像区域导致推理成本高、语义引导过早引入易产生错误反馈的问题,LAGO通过类无关的对象中心候选发现和自适应语言引导的精炼策略,实现了更高效且鲁棒的对齐。实验表明,LAGO在多个标准零样本基准和分布偏移场景中均取得领先性能,同时大幅减少了推理时所需的候选区域数量。

Comments 37 pages, 26 figures, including appendix. Preprint

详情
英文摘要

Zero-shot recognition aims to classify an image by selecting the most compatible label description from a set of candidate classes without any task-specific supervision. In fine-grained settings, however, the relevant evidence often lies in localized parts, attributes, or textures rather than in the full image, making whole-image alignment suboptimal. Recent localized visual-text alignment methods address this by comparing class descriptions with multiple image regions, but they typically rely on large sets of random or redundant crops, increasing inference cost and introducing many highly redundant or weakly relevant candidates. Moreover, introducing semantic guidance too early can create an error-amplifying feedback process in which inaccurate intermediate predictions bias later localization and reinforce subsequent mistakes; we refer to this failure mode as the prediction loop. We propose LAGO (LAnguage-Guided adaptive Object-region focus), a framework for efficient and robust zero-shot localized visual-text alignment. LAGO first performs class-agnostic object-centric candidate discovery to obtain a stable visual initialization, and then applies adaptive language-guided refinement with the strength of semantic guidance controlled by intermediate confidence. It further combines object-level, contextual, and full-image evidence through an effective object-context dual-channel aggregation strategy. Extensive experiments show that LAGO consistently achieves state-of-the-art performance on standard zero-shot benchmarks and challenging distribution-shift settings, while requiring substantially fewer candidate regions at inference time.

2605.08153 2026-05-12 cs.LG cs.GT

Temporal-Decay Shapley: A Time-Aware Data Valuation Framework for Time-Series Data

Chuwen Pang, Bing Mi, Kongyang Chen

AI总结 随着机器学习在时间序列数据中的广泛应用,准确评估训练样本的价值对于数据选择、噪声检测和模型优化至关重要。然而,传统数据评估方法通常假设样本独立同分布,忽略了时间序列数据中样本价值随时间变化的特性。本文提出了一种基于时间衰减机制和多尺度融合策略的改进时间序列Shapley数据评估方法,通过三种逐步增强的时间Shapley方法,有效提升了时间序列数据中样本价值评估的准确性,实验表明该方法在噪声检测和高价值数据识别任务中优于传统方法,尤其在强时间依赖场景下表现更为突出。

详情
英文摘要

With the rapid development of machine learning applications on time-series data, accurately assessing the value of training samples has become essential for data selection, noise detection, and model optimization. However, traditional data valuation methods usually assume that samples are independent and identically distributed, and thus ignore the time-varying nature of sample value in time-series data. This paper proposes an improved temporal Shapley data valuation method that enables accurate sample valuation for time-series data through a temporal decay mechanism and a multi-scale fusion strategy. Specifically, we propose three progressively enhanced temporal Shapley methods. Temporal-Decay Shapley (TDS) incorporates temporal information into Shapley value computation through exponential decay weights; the improved TDS adopts power exponential decay to better adapt to nonlinear temporal drift; and Multi-Scale Temporal-Decay Shapley (MS-TDS) constructs a multi-scale fusion mechanism that balances the value of short-term hotspot samples and long-term foundational samples through parallel multi-scale valuation and sample-level adaptive fusion. Experimental results show that the proposed methods generally outperform traditional methods in noise detection and high-value data identification tasks, with more evident advantages under most strongly temporal settings, thereby effectively improving the accuracy and robustness of data valuation.

2605.08150 2026-05-12 cs.LG

A PyTorch Library of Turing-Complete Neural Networks

Jonathan Bates

AI总结 本文介绍了一个基于 PyTorch 的库,能够从图灵机的描述直接编译出神经网络模型,无需训练即可精确模拟指定的图灵机行为。该库实现了两种不同的网络架构,分别对应两种理论结果,展示了如何通过 ReLU 网络实现布尔电路,以及如何利用硬注意力机制实现图灵机磁带的位置查找。该工具为符号与神经网络之间的桥梁提供了具体的实现参考,也为未来研究构造解在梯度优化下的稳定性奠定了基础。

详情
英文摘要

We present a PyTorch package that compiles neural networks and their weights from Turing machine descriptions, producing models that exactly simulate the specified machine without any training. Given a transition function and a set of terminal states, the package constructs a model whose forward pass corresponds to one step of the Turing machine. Two architectures are implemented, each realizing a different theoretical result: (1) a transformer with self-attention, cross-attention, and feedforward layers based on Wei, Chen, and Ma (2021), and (2) a recurrent network based on Siegelmann and Sontag (1995) that encodes the stack in a Cantor set. We develop the constructions from first principles, showing how ReLU networks implement Boolean circuits (AND, OR, NOT, XOR gates and their composition into DNF formulas and binary adders) and how hard attention implements positional lookup on the tape. The package serves as a concrete, runnable reference for the symbolic-neural bridge, and as a foundation for future work on the stability of constructed solutions under gradient-based optimization. Code is available at https://github.com/jonrbates/turing.

2605.08149 2026-05-12 cs.LG cs.CL

Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLMs

Harshavardhan

AI总结 该研究探讨了稀疏自编码器(SAEs)在大语言模型中的特征竞争现象,即“特征对抗”,并分析其与模型不确定性之间的关系。通过在Gemma-2-2B模型上进行受控实验,研究发现高熵问题会显著增强特定网络层中的特征对抗现象,并且这种对抗在一定程度上可以预测模型输出的准确性。研究还表明,沿特征对抗方向进行激活引导能够更有效地改变模型输出,揭示了特征对抗在模型处理过程中的因果作用。

Comments 10 pages, 6 figures

详情
英文摘要

Sparse Autoencoders (SAEs) decompose large language model representations into interpretable features, but how these features interact under uncertainty remains poorly understood. We introduce Feature Rivalry -- negatively correlated SAE feature pairs -- and study whether rivalry serves as a mechanistic signature of model uncertainty in Gemma-2-2B using Gemma Scope SAEs. Through a controlled within-domain experiment on PopQA split by response entropy, we find that high-entropy questions produce significantly stronger feature rivalry at layers 0 and 12 relative to low-entropy questions (p=5.3x10^-26 and p=5.8x10^-5 respectively), localizing uncertainty to specific processing stages in the residual stream. We then test whether rivalry is causally upstream of model outputs via activation steering along rivalry axes -- finding that steering along the rivalry direction (vec_A - vec_B) causes more output changes than random directions at low steering multipliers across 15 of 20 rival feature pairs. Finally, a per-prompt rivalry score derived from pairwise cosine similarities of active SAE feature decoder vectors predicts answer correctness (AUROC=0.689), approaching but not matching softmax confidence (AUROC=0.808).

2605.08144 2026-05-12 cs.LG cs.AI cs.CV

NoiseRater: Meta-Learned Noise Valuation for Diffusion Model Training

Fang Wu, Haokai Zhao, Da Xing, Hanqun Cao, Tinson Xu, Yanchao Li, Xiangru Tang, Zehong Wang, Aaron Tu, Kuan Pang, Hanchen Wang, Hongbin Lin, Zeqi Zhou, Yinxi Li, Peng Xia, Li Erran Li, Molei Tao, Jure Leskovec, Aditya Joshi, Yejin Choi

AI总结 扩散模型在生成任务中取得了显著成功,但其训练过程中通常将注入的噪声视为具有相同信息量。本文提出NoiseRater,一种基于元学习的噪声评估框架,用于在扩散模型训练中对每个噪声样本进行实例级重要性评分,从而实现训练目标的自适应重加权。通过双层优化训练评估器,并设计两阶段训练流程,实验表明关注信息量大的噪声能有效提升训练效率和生成质量,为扩散模型训练提供了新的优化方向。

详情
英文摘要

Diffusion models have achieved remarkable success across a wide range of generative tasks, yet their training paradigm largely treats injected noise as uniformly informative. In this work, we challenge this assumption and introduce NoiseRater, a meta-learning framework for instance-level noise valuation in diffusion model training. We propose a parametric noise rater that assigns importance scores to individual noise realizations conditioned on data and timestep, enabling adaptive reweighting of the training objective. The rater is trained via bilevel optimization to improve downstream validation performance after inner-loop diffusion updates. To enable efficient deployment, we further design a decoupled two-stage pipeline that transitions from soft weighting during meta-training to hard noise selection during standard training. Extensive experiments on FFHQ and ImageNet demonstrate that not all noise samples contribute equally, and that prioritizing informative noise improves both training efficiency and generation quality. Our results establish noise valuation as a complementary and previously underexplored axis for improving diffusion model training. Our code is available at: https://anonymous.4open.science/r/NoiseRater-DEB116.

2605.08142 2026-05-12 cs.LG cs.CL cs.CV

Reasoning emerges from constrained inference manifolds in large language models

Yanbiao Ma, Fei Luo, Linfeng Zhang, Chuangxin Zhao, Mingxuan Wang, Yinan Wu, Zhe Qian, Yang Lu, Long Chen, Zhao Cao, Xiaoshuai Hao, Ji-Rong Wen, Jungong Han

AI总结 该研究探讨了大语言模型中推理能力的内在动态过程,发现推理时的表示演化会自我组织成高维空间中的低维流形。研究指出,仅靠几何压缩不足以实现稳定可靠的推理,有效的推理动态需要满足三个条件:足够的表达能力、自发的流形压缩以及压缩子空间中非退化信息体积的保持。基于这些发现,作者提出了一种无需标签的诊断方法,揭示了大语言模型的推理本质是由几何与信息约束共同决定的。

详情
英文摘要

Reasoning in large language models is predominantly evaluated through labeled benchmarks, conflating task performance with the quality of internal inference. Here we study reasoning as an intrinsic dynamical process by examining the evolution of internal representations during inference. We find that inference-time dynamics consistently self-organize into low-dimensional manifolds embedded within high-dimensional representation spaces. we find that such geometric compression, although pervasive, is not sufficient for stable or reliable reasoning. Instead, effective reasoning dynamics emerge within a constrained structural regime characterized by three conditions: adequate representational expressivity, spontaneous manifold compression, and preservation of non-degenerate information volume within the compressed subspace. Models outside this regime exhibit characteristic pathological inference dynamics. Based on these insights, we introduce a unified, label-free diagnostic computed solely from internal dynamics. These findings suggest that reasoning in LLMs is fundamentally governed by geometric and informational constraints, offering a complementary framework to benchmark-centric assessment.

2605.08138 2026-05-12 cs.LG

DataArc-SynData-Toolkit: A Unified Closed-Loop Framework for Multi-Path, Multimodal, and Multilingual Data Synthesis

Zhichao Shi, Cehao Yang, Hao Zhou, Xiaojun Wu, Huajie Li, Xuhui Jiang, Chengjin Xu, Yuanzhuo Wang, Jian Guo

AI总结 为了解决大语言模型在特定领域和低资源语言中面临的数据稀缺问题,本文提出了一种名为 DataArc-SynData-Toolkit 的开源工具包,它提供了一个统一的闭环框架,支持多路径、多模态和多语言数据的合成。该工具包通过配置驱动的端到端流程、标准化的高质量生成范式以及高度模块化的架构,显著提升了数据合成的易用性、可扩展性和跨模态适应能力。实验表明,该工具在生成效率与数据质量之间达到了良好平衡,有助于降低合成数据生成及模型训练的技术门槛,加速其在实际应用中的部署。

Comments 6 pages

详情
英文摘要

Synthetic data has emerged as a crucial solution to the data scarcity bottleneck in large language models (LLMs), particularly for specialized domains and low-resource languages. However, the broader adoption of existing synthetic data tools is severely hindered by convoluted workflows, fragmented data standards, and limited scalability across modalities. To address these limitations, we develop DataArc-SynData-Toolkit, an open-source framework featuring: (1) a configuration-driven, end-to-end pipeline equipped with an intuitive visual interface and simplified CLI for exceptional usability; (2) a unified, quality-controllable synthesis paradigm that standardizes multi-source data generation to ensure high reusability; and (3) a highly modular architecture designed for seamless multimodal, multilingual, and multi-task adaptation. We apply the toolkit in multiple application scenarios. Experimental results demonstrate that our toolkit achieves an optimal balance between generation efficiency and data quality. By offering an end-to-end and visually interactive pipeline, DataArc-SynData-Toolkit significantly lowers the technical barrier to synthetic data generation and subsequent model training, accelerating its practical deployment in real-world applications.

2605.08137 2026-05-12 cs.LG cs.AI cs.CY

Weight Pruning Amplifies Bias: A Multi-Method Study of Compressed LLMs for Edge AI

Plawan Kumar Rath, Rahul Maliakkal

AI总结 该研究探讨了权重剪枝对大型语言模型公平性的影响,发现激活感知剪枝方法(如Wanda)在保持模型语言能力的同时显著放大了模型的偏见。研究对比了三种剪枝方法在不同稀疏度下的表现,揭示了“智能剪枝悖论”:剪枝虽能提升模型压缩效率,却可能加剧模型的刻板印象行为。研究还指出,剪枝在边缘设备上的实际部署效果有限,且对模型对齐构成比量化更大的风险,强调了在边缘AI部署前进行偏见验证的重要性。

Comments 8 pages, 7 figures, 8 tables. Accepted at the 7th Annual World AIIoT Congress (AIIoT 2026). This is the author's accepted version; the version of record will appear in IEEE Xplore

详情
英文摘要

Weight pruning is widely advocated for deploying Large Language Models on resource-constrained IoT and edge devices, yet its impact on model fairness remains poorly understood. We conduct a controlled empirical study of three instruction-tuned models (Gemma-2-9b-it, Mistral-7B-Instruct-v0.3, Phi-3.5-mini-instruct) across three pruning methods (Random, Magnitude, Wanda) at four sparsity levels (10-70%) on 12,148 BBQ bias benchmark items with 5 random seeds, totaling 2,368,860 inference records. Our results reveal a Smart Pruning Paradox: activation-aware pruning (Wanda) preserves perplexity nearly perfectly (just 3.5% increase at 50% sparsity for Mistral-7B), yet produces the highest bias amplification, with Stereotype Reliance Score increasing 83.7% and 47-59% of previously unbiased items developing new stereotypical behaviors at 70% sparsity. Random pruning destroys language capability entirely (perplexity exceeding $10^4$ and reaching $10^8$) but produces only random-chance bias. We further show that unstructured pruning provides zero storage savings and zero inference latency reduction on real edge hardware, undermining the primary motivation for its use in IoT deployment. Of 180 dense-vs-pruned comparisons, 141 (78.3%) are significant ($p < 0.05$) with mean $|h| = 0.305$. Published quantization studies report up to 21% of responses flipping between biased and unbiased states; our pruning results show transition rates nearly three times higher (47-59%), suggesting pruning poses a categorically greater risk to alignment than quantization. These findings demonstrate that perplexity-based evaluation provides false assurance of behavioral equivalence, and that IoT deployment pipelines require bias-aware validation before deploying pruned models at the edge.

2605.08136 2026-05-12 cs.CV cs.AI cs.RO

Benchmarking ResNet Backbones in RT-DETR: Impact of Depth and Regularization under environmental conditions

Pamela Barboza, Víctor Castelli, Belén Pereira, Ricardo Grando, Bruna de Vargas, Augusto Calfani

AI总结 本文研究了在竞技机器人环境下,不同深度的ResNet主干网络对RT-DETR目标检测性能的影响,重点分析了光照和背景变化对模型置信度、准确率及推理延迟的影响。通过对比ResNet18、ResNet34、ResNet50和ResNet101四种模型,在相同配置下进行训练与评估,发现环境条件主要影响预测置信度,而推理延迟基本不受影响,分类准确率普遍较高。实验表明,ResNet50在光照变化下表现最佳,ResNet34在背景变化下具有更均衡的性能,说明最优网络结构取决于具体的环境变化类型。

Comments Accepted at the International Conference on Data Science, Technology and Applications (DATA) 2026

详情
英文摘要

Visual perception plays a central role in competitive robotics, where environmental variations can directly affect real-time detection performance. The related literature on transformer-based detectors lack information regarding the impact of backbone scale and environmental settings on model performance. This work presents a comparative evaluation of RT-DETR for detecting round objects under environmental and hyperparameter variations relevant to competitive robotics. Four ResNet backbones (ResNet18, ResNet34, ResNet50, and ResNet101) were compared using dropout rates, analyzing their effect on confidence and accuracy. All models were trained under the same configuration and evaluated under changes in lighting and background contrast. Environmental conditions primarily impact prediction confidence, while inference latency remains largely unaffected and classification accuracy stays consistently high, approaching or above 1.00 in most cases. Two distinct behaviors were observed. Under illumination variation, ResNet50 achieves the best trade-off, combining near-perfect accuracy, confidence values up to approximately 0.869 and latency around 0.058-0.059 ms. Under background variation, ResNet34 provides the most balanced performance, reaching near-perfect accuracy and higher confidence values up to approximately 0.887. These results indicate that the optimal architecture depends on the type of environmental variation, with intermediate-depth models offering the best balance between performance and efficiency.

2605.08135 2026-05-12 cs.LG

Dendritic Neural Networks with Equilibrium Propagation

Yoshimasa Kubo

AI总结 本文研究了将树突神经网络与平衡传播(EP)结合的可行性,提出了一种基于先进EP框架的树突EP模型。实验表明,该模型在简单任务上表现与标准EP相当,在更具挑战性的数据集和深层网络中则显著优于标准EP,接近使用时间反向传播训练的树突网络性能。分析发现,树突结构改变了网络内部动态,提升了隐藏状态的激活幅度和分布性,表明引入树突结构有助于增强生物合理性学习算法的效果,尤其在标准EP表现不佳的场景中。

Comments 8 pages

详情
英文摘要

Equilibrium propagation (EP) is a biologically plausible alternative to backpropagation (BP), but its effectiveness can degrade in deeper and more challenging learning settings. In parallel, dendritic neural networks have demonstrated improved performance and generalization when trained with BP, suggesting that structured, biologically inspired architectures may enhance learning. In this work, we investigate the integration of dendritic neural networks with equilibrium propagation using an advanced EP framework. We evaluate the proposed dendritic EP model on MNIST, Kuzushiji-MNIST (KMNIST), and Fashion-MNIST (FMNIST), considering both shallow and deeper architectures. Our results show that dendritic EP achieves performance comparable to standard EP on simple tasks, while providing consistent improvements on more challenging datasets and deeper models. In particular, dendritic EP significantly outperforms standard EP on KMNIST and FMNIST, and approaches the performance of dendritic networks trained with backpropagation through time.To further understand these improvements, we analyze the evolution of hidden states during the free phase. We observe that dendritic EP exhibits higher activation magnitudes and more distributed hidden-state activity compared to standard EP, indicating that dendritic structure alters the internal network dynamics. These findings suggest that incorporating dendritic structure can enhance the effectiveness of biologically plausible learning algorithms, especially in regimes where standard EP struggles. Our work highlights the importance of architectural design for improving biologically inspired training methods.

2605.08134 2026-05-12 cs.LG cs.AI

DARE: Diffusion Language Model Activation Reuse for Efficient Inference

Natalia Frumkin, Bokun Wang, Hung-Yueh Chiang, Chi-Chih Chang, Mohamed S. Abdelfattah, Diana Marculescu

AI总结 本文提出了一种名为DARE的方法,旨在提升扩散语言模型(dLLM)的推理效率。研究发现,dLLM在双向自注意力机制中存在“词粒度冗余”特性,即不同词之间的注意力激活高度相关,可据此复用部分计算结果。DARE通过两个互补机制——DARE-KV复用键值激活,DARE-O复用输出激活,显著减少了冗余计算,在保持生成质量的同时实现了每层延迟最高1.20倍的提升,并可复用高达87%的注意力激活。

详情
英文摘要

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to auto-regressive (AR) models, offering greater expressive capacity and potential for parallel generation and faster inference. However, open-source dLLMs remain immature, lagging behind AR models in both efficiency and quality. We identify an underexplored property of dLLMs: *token-wise redundancy* in bi-directional self-attention. Self-attention activations are highly correlated across tokens, and temporal changes in query representations can predict redundancy in corresponding key, value, and output activations. We introduce DARE, with two complementary mechanisms: DARE-KV, which reuses cached key-value (KV) activations, and DARE-O, which reuses output activations to reduce redundant computation while preserving quality. DARE achieves up to 1.20x per-layer latency reduction and reuses up to 87% of attention activations, with negligible degradation on reasoning and code-generation benchmarks. DARE-KV and DARE-O incur average performance drops of only 2.0% and 1.2%, respectively. Combined with techniques such as prefix caching and Fast-dLLM, DARE provides additive gains without retraining. These results establish token-wise reuse as an effective strategy for improving the efficiency of diffusion-based LLMs while preserving generation fidelity. Code: https://github.com/enyac-group/DARE

2605.08131 2026-05-12 cs.LG

Interactive Inverse Reinforcement Learning of Interaction Scenarios via Bi-level Optimization

Yue Mao, Shicheng Liu, Siyuan Xu, Minghui Zhu

AI总结 本文研究了交互式逆强化学习(IIRL)问题,旨在通过与专家的互动学习其奖励函数并制定相应的交互策略。为此,作者将IIRL建模为一个随机双层优化问题,其中底层学习解释专家行为的奖励函数,上层学习与专家交互的策略。提出了一种双循环算法BISIRL,能够在内层求解奖励函数,外层优化交互策略,并在理论上保证算法收敛,实验验证了其有效性。

详情
英文摘要

Inverse reinforcement learning (IRL) learns a reward function and a corresponding policy that best fit the demonstration data of an expert. However, in the current IRL setting, the learner is isolated from the expert and can only passively observe the expert demonstrations. This limits the applicability of IRL to interactive settings, where the learner actively interacts with the expert and needs to infer the expert's reward function from the interactions. To bridge the gap, this paper studies interactive IRL (IIRL) where a learner aims to learn the reward function of an expert and a policy to interact with the expert during its interactions with the expert. We formulate IIRL as a stochastic bi-level optimization problem where the lower level learns a reward function to explain the behaviors of the expert, and the upper level learns a policy to interact with the expert. We develop a double-loop algorithm, Bi-level Interactive Scenarios Inverse Reinforcement Learning (BISIRL), which solves the lower-level problem in the inner loop and the upper-level problem in the outer loop. We formally guarantee that BISIRL converges and validate our algorithm through extensive experiments.