arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3004
2604.22951 2026-04-28 cs.AI cs.CL cs.LG

The Power of Power Law: Asymmetry Enables Compositional Reasoning

Zixuan Wang, Xingyu Dang, Jason D. Lee, Kaifeng Lyu

详情
英文摘要

Natural language data follows a power-law distribution, with most knowledge and skills appearing at very low frequency. While a common intuition suggests that reweighting or curating data towards a uniform distribution may help models better learn these long-tail skills, we find a counterintuitive result: across a wide range of compositional reasoning tasks, such as state tracking and multi-step arithmetic, training under power-law distributions consistently outperforms training under uniform distributions. To understand this advantage, we introduce a minimalist skill-composition task and show that learning under a power-law distribution provably requires significantly less training data. Our theoretical analysis reveals that power law sampling induces a beneficial asymmetry that improves the pathological loss landscape, which enables models to first acquire high-frequency skill compositions with low data complexity, which in turn serves as a stepping stone to efficiently learn rare long-tailed skills. Our results offer an alternative perspective on what constitutes an effective data distribution for training models.

2604.22939 2026-04-28 cs.CL cs.AI cs.CV cs.IR

Self Knowledge Re-expression: A Fully Local Method for Adapting LLMs to Tasks Using Intrinsic Knowledge

Mengyu Wang, Xiaoying Zhi, Zhiyi Li, Robin Schmucker, Shay B. Cohen, Tiejun Ma, Fran Silavong

详情
英文摘要

While the next-token prediction (NTP) paradigm enables large language models (LLMs) to express their intrinsic knowledge, its sequential nature constrains performance on specialized, non-generative tasks. We attribute this performance bottleneck to the LLMs' knowledge expression mechanism, rather than to deficiencies in knowledge acquisition. To address this, we propose Self-Knowledge Re-expression (SKR), a novel, task-agnostic adaptation method. SKR transforms the LLM's output from generic token generation to highly efficient, task-specific expression. SKR is a fully local method that uses only unannotated data, requiring neither human supervision nor model distillation. Experiments on a large financial document dataset demonstrate substantial improvements: over 40% in Recall@1 for information retrieval tasks, over 76% reduction in object detection latency, and over 33% increase in anomaly detection AUPRC. Our results on the MMDocRAG dataset surpass those of leading retrieval models by at least 12.6%.

2604.22937 2026-04-28 cs.CL cs.LG cs.PL

AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs

Pouya Pezeshkpour, Estevam Hruschka

详情
英文摘要

Verification is becoming central to both reinforcement-learning-based training and inference-time control of large language models (LLMs). Yet current verifiers face a fundamental trade-off: LLM-based verifiers are expressive but hard to control and prone to error, while deterministic executable verifiers are reliable and interpretable but often limited in capability. We study the following question: given a development set of LLM outputs and labels for a target objective, such as correctness, can we automatically induce a minimal set of Python verifiers whose joint satisfaction closely matches that objective? We propose AutoPyVerifier, a framework that uses an LLM to synthesize candidate verifier functions and then refines them through search over a directed acyclic graph (DAG). By navigating the DAG, AutoPyVerifier systematically explores the space of deterministic executable verifiers and selects a compact verifier set whose joint satisfaction best approximates the target objective. Across mathematical reasoning, coding, function calling, and instruction-following benchmarks for several state-of-the-art LLMs, AutoPyVerifier improves target-objective prediction by up to 55.0 F1 points over the initial LLM-generated verifier sets. Additional analyses show that the most useful verification targets vary by benchmark and model, and that the DAG-based search shifts the learned verifier sets toward more structural and semantically grounded checks. We further show that exposing the discovered verifier set to an LLM as an external tool improves downstream accuracy by up to 17.0 points. We release our code

2604.22934 2026-04-28 cs.AI cs.CL

PExA: Parallel Exploration Agent for Complex Text-to-SQL

Tanmay Parekh, Ella Hofmann-Coyle, Shuyi Wang, Sachith Sri Ram Kothur, Srivas Prasad, Yunmo Chen

Comments Accepted at ACL 2026

详情
英文摘要

LLM-based agents for text-to-SQL often struggle with latency-performance trade-off, where performance improvements come at the cost of latency or vice versa. We reformulate text-to-SQL generation within the lens of software test coverage where the original query is prepared with a suite of test cases with simpler, atomic SQLs that are executed in parallel and together ensure semantic coverage of the original query. After iterating on test case coverage, the final SQL is generated only when enough information is gathered, leveraging the explored test case SQLs to ground the final generation. We validated our framework on a state-of-the-art benchmark for text-to-SQL, Spider 2.0, achieving a new state-of-the-art with 70.2% execution accuracy.

2604.22911 2026-04-28 cs.RO

RecoverFormer: End-to-End Contact-Aware Recovery for Humanoid Robots

Zihui Liu

详情
英文摘要

Humanoid robots operating in unstructured environments must recover from unexpected disturbances-a capability that remains challenging for end-to-end control policies. We present RECOVERFORMER, a fully end-to-end humanoid recovery policy that learns when and how to switch among recovery behaviors-including compensatory stepping, hand-environment contact, and center-of-mass reshaping-while maintaining robust performance under model mismatch. The architecture combines a causal transformer over a 50-step observation history with two novel heads: a latent recovery mode that enables smooth transitions among distinct recovery strategies, and a contact affordance head that predicts which environmental surfaces (walls, railings, table edges) are beneficial for stabilization. We evaluate RECOVERFORMER on the Unitree G1 humanoid in MuJoCo. Trained only on open floor, RECOVERFORMER transfers zero shot to walled environments, achieving 100% recovery success across 100-300 N pushes and across wall distances from 0.25-1.4m. Under zero-shot dynamics mismatch, RECOVERFORMER reaches 75.5% at plus +25% mass, 89% under 30 ms latency, 91.5% at low friction, and 99% under compound friction, latency and mass perturbation. The learned latent modes specialize across force regimes without mode-level supervision, validated by t-SNE analysis of 300 episodes. Taken together, these results show that a single end-to-end policy can deliver multi-modal, contact aware humanoid recovery that generalizes across perturbation magnitude, contact geometry, and dynamics shift.

2604.22909 2026-04-28 cs.LG

Deep Clustering for Climate: Analyzing Teleconnections through Learned Categorical States

Lívia Meinhardt, Dário Oliveira

详情
英文摘要

Understanding and representing complex climate variability is essential for both scientific analysis and predictive modeling. However, identifying meaningful climate regimes from raw variables is challenging, as they exhibit high noise and nonlinear dependencies. In this work, we explore the use of Masked Siamese Networks to discretize climate time series into semantically rich clusters. Focusing on daily minimum and maximum temperature, we show that the resulting representations: (i) yield clusters that reflect meaningful climate states under our modeling assumptions, offering a simplified representation for downstream use; (ii) enable sampling and analysis of specific climate scenarios; and (iii) exhibit statistical associations with El Niño events, underscoring their scientific relevance. Our findings highlight the potential of self-supervised discretization as a tool for climate data analysis and open avenues for incorporating richer climate indicators in future work.

2604.22903 2026-04-28 cs.CV cs.AI

On the Complementarity of Quantum and Classical Features: Adaptive Hybrid Quantum-Classical Feature Fusion for Breast Cancer Classification

Yasmin Rodrigues Sobrinho, João Renato Ribeiro Manesco, João Paulo Papa

Comments 41 pages, 16 figures. This manuscript is a preprint under review at Artificial Intelligence in Medicine

详情
英文摘要

The integration of quantum machine learning with classical deep learning offers promising avenues for medical image analysis by mapping data into high-dimensional Hilbert spaces. However, effectively unifying these distinct paradigms remains challenging due to common optimization asymmetries. In this paper, a novel hybrid quantum-classical architecture for breast cancer diagnosis based on a dual-branch feature-extraction pipeline is proposed. Our framework extracts and unifies complementary representations from classical models and quantum circuits, exploring both trainable and deterministic (non-trainable) quantum paradigms. To integrate these embeddings, three progressive feature fusion strategies are introduced: Static Hybrid Fusion (SHF) for offline extraction, Dynamic Hybrid Fusion (DHF) for end-to-end co-adaptation, and a novel Temperature-Scaled Hybrid Fusion (TSHF). The TSHF strategy incorporates a learnable scalar, inspired by multimodal learning, that dynamically balances hybrid gradient dynamics and resolves optimization bottlenecks. Empirical validation on the BreastMNIST dataset confirms our hypothesis that unifying diverse feature representations creates a richer data context. The TSHF strategy, specifically when pairing a ResNet backbone with a trainable quantum circuit, achieved a peak accuracy of 87.82%, F1-score of 91.77%, and an AUC-ROC of 89.08%, outperforming purely classical baselines. These results demonstrate that the proposed hybrid framework improves classification accuracy and threshold reliability, providing a stable, high-performance architecture for the clinical deployment of quantum-enhanced diagnostic tools.

2604.22901 2026-04-28 cs.LG

Accelerating Frequency Domain Diffusion Models with Error-Feedback Event-Driven Caching

Dong Liu, Haisheng Wang, Yanxuan Yu

详情
英文摘要

Diffusion models achieve remarkable success in time series generation. However, slow inference limits their practical deployment. We propose E$^2$-CRF (Error-Feedback Event-Driven Cumulative Residual Feature caching) to accelerate frequency domain diffusion models. Our method exploits two structural properties: (1) spectral localization, where signal energy concentrates in low frequencies, and (2) mirror symmetry, which halves the effective frequency dimension. E$^2$-CRF uses a closed-loop error-feedback system that adaptively caches transformer KV features across diffusion steps. We trigger recomputation using event-driven residual dynamics instead of fixed schedules. Our method selectively recomputes high-energy or rapidly-changing tokens while reusing cached features for stable high-frequency components. E$^2$-CRF achieves ~2.2 speedup while maintaining sample quality. We demonstrate effectiveness on 5 datasets. Our caching strategy naturally aligns with the diffusion process's structure-to-detail progression. We include sufficient-condition error and complexity bounds under standard regularity assumptions (Appendix), alongside empirical validation. Our code is available at https://github.com/NoakLiu/FastFourierDiffusion and is also integrated in https://github.com/NoakLiu/FastCache-xDiT.

2604.22899 2026-04-28 cs.CV

Text-Guided Multimodal Unified Industrial Anomaly Detection

Zewen Li, Shuo Ye, Zitong Yu, Weicheng Xie, Linlin Shen

Comments 12 pages

详情
英文摘要

Industrial anomaly detection based on RGB-3D multimodal data has emerged as a mainstream paradigm for intelligent quality inspection. However, existing unsupervised methods suffer from two critical limitations: ambiguous cross-modal alignment caused by the lack of high-level semantic guidance and insufficient geometric modeling for RGB-to-3D feature mapping. To address these issues, we propose a unified multimodal industrial anomaly detection framework guided by text semantics. The framework consists of two core modules: a Geometry-Aware Cross-Modal Mapper to preserve geometric structure during modality conversion, and an Object-Conditioned Textual Feature Adaptor to align multimodal features with semantic priors. Furthermore, we establish a unified learning paradigm for multimodal industrial anomaly detection, which breaks the one-model-one-class constraint and enables accurate anomaly detection across diverse classes using a single model. Extensive experiments on the MVTec 3D-AD and Eyecandies datasets demonstrate that our method achieves state-of-the-art performance in classification and localization under unsupervised settings.

2604.22893 2026-04-28 cs.LG cs.AI

Utility-Aware Data Pricing: Token-Level Quality and Empirical Training Gain for LLMs

Minghui Xu, Qi Luo, Kun Li

Comments 23 pages, 1 figure, 6 tables

详情
英文摘要

Traditional data valuation methods based on ``row-count $\times$ quality coefficient'' paradigms fail to capture the nuanced, nonlinear contributions that data makes to Large Language Model (LLM) capabilities. This paper presents a dynamic data valuation framework that transitions from static accounting to utility-based pricing. Our approach operates on three layers: (1) token-level information density metrics using Shannon entropy and Data Quality Scores; (2) empirical training gain measurement through influence functions, proxy model strategies, and Data Shapley values; and (3) cryptographic verifiability through hash-based commitments, Merkle trees, and a tamper-evident training ledger. We provide comprehensive experimental validation on three real domains (instruction following, mathematical reasoning, and code summarization), demonstrating that proxy-based empirical gain achieves near-perfect ranking alignment with realized utility, substantially outperforming row-count and token-count baselines. This framework enables a fair Data-as-a-Service economy where high-reasoning data is priced according to its actual contribution to model intelligence, while providing the transparency and auditability necessary for trustworthy data markets.

2604.22892 2026-04-28 cs.LG

StackFeat RL: Reinforcement Learning over Iterative Dual Criterion Feature Selection for Stable Biomarker Discovery

A. Yermekov, D. A. Herrera-Martí

Comments 7 pages. Submitted to eccb2026

详情
英文摘要

Feature selection in high-dimensional genomic data ($d \gg n$) demands methods that are simultaneously accurate, sparse, and stable. Existing approaches either require manual threshold specification (mRMR, stability selection), produce unstable selections under data perturbation (Lasso, Boruta), or ignore biological structure entirely. We introduce StackFeat-RL, a meta-learning framework that optimises the hyperparameters of an iterative dual-criterion feature selection algorithm via REINFORCE policy gradients. The dual criterion, requiring both coefficient consistency and selection frequency, guards against two failure modes missed by single-criterion methods, while iterative accumulation provides convergence guarantees via the law of large numbers. On COVID-19 miRNA data (GSE240888, 332 features) and three Alzheimer's disease classification tasks (GSE84422, 13237 genes; Normal vs.\ Possible, Probable, and Definite AD), StackFeat-RL achieves the highest predictive accuracy among all evaluated methods, including ElasticNet, Boruta, mRMR, and stability selection, while requiring 3--4$\times$ fewer features. Keywords: feature selection, reinforcement learning, REINFORCE, elastic net, biomarker discovery, Alzheimer's disease, dual-criterion selection, protein interaction networks

2604.22886 2026-04-28 cs.CV

Breaking Degradation Coupling: A Structural Entropy Guided Decoupled Framework and Benchmark for Infrared Enhancement

Pu Li, Huafeng Li, Yafei Zhang, Yu Liu, Wen Wang

Comments Accepted by CVPR2026

详情
英文摘要

Thermal infrared image enhancement aims to restore high-quality images from complex compound degradations. Existing all-in-one approaches typically employ a single shared backbone to handle diverse degradations, which causes gradient interference and parameter competition. To address this, we propose a Structural Entropy-Guided Decoupled (SEGD) Framework. Unlike unified modeling paradigms, SEGD decomposes compound degradations into independent sub-processes and models them in a divide-and-conquer manner through Degradation-Specific Residual Modules (DRMs). Each DRM focuses on residual estimation for a specific degradation, enabling task decoupling while remaining jointly trainable, which mitigates parameter contention. A Degradation-Aware Evidential Network further estimates degradation type and intensity, providing priors that adaptively regulate DRM restoration strength. To handle compound cases, DRMs are composed in varying orders to form multiple restoration paths, from which the most informative features are aggregated under a structural-entropy criterion, yielding decoder-ready representations with structural fidelity and degradation awareness. Integrating divide-and-conquer restoration, evidential perception, and entropy-guided adaptation, SEGD achieves fine-grained and interpretable enhancement. We also construct a nighttime TIR benchmark for evaluation under real low-light conditions. Experimental results demonstrate that SEGD surpasses state-of-the-art methods while achieving higher efficiency with fewer parameters.

2604.22885 2026-04-28 cs.CV cs.AI

Federated Cross-Modal Retrieval with Missing Modalities via Semantic Routing and Adapter Personalization

Hefeng Zhou, Xuan Liu, Sicheng Chen, Wutong Zhang, Wu Yan, Jiong Lou, Chentao Wu, Guangtao Xue, Wei Zhao, Jie Li

详情
英文摘要

Federated cross-modal retrieval faces severe challenges from heterogeneous client data, particularly non-IID semantic distributions and missing modalities. Under such heterogeneity, a single global model is often insufficient to capture both shared cross-modal knowledge and client-specific characteristics. We propose RCSR, a personalization-friendly federated framework that integrates prototype anchoring, retrieval-centric semantic routing, and optional client-specific adapters. Built on a frozen CLIP backbone, RCSR leverages lightweight shared adapters for global knowledge transfer while supporting efficient local personalization. Prototype anchoring helps unimodal clients align with global cross-modal semantics, and a server-side semantic router adaptively assigns aggregation weights based on retrieval consistency to mitigate alignment drift during heterogeneous updates. Extensive experiments on MS-COCO, Flickr30K, and other benchmarks show that RCSR consistently improves global retrieval accuracy and training stability, while further enhancing client-level retrieval performance, especially for clients with incomplete modalities. Code is available at https://github.com/RezinChow/RCSR-Retrieval-Centric-Semantic-Routing.

2604.22884 2026-04-28 cs.CV cs.AI

Can Multimodal Large Language Models Truly Understand Small Objects?

Fujun Han, Junan Chen, Xintong Zhu, Jingqi Ye, Xuanjie Mao, Tao Chen, Peng Ye

Comments Under Peer Review (26 pages, 9 figures, 6 tables)

详情
英文摘要

Multimodal Large Language Models (MLLMs) have shown promising potential in diverse understanding tasks, e.g., image and video analysis, math and physics olympiads. However, they remain blank and unexplored for Small Object Understanding (SOU) tasks. To fill this gap, we introduce SOUBench, the first and comprehensive benchmark for exploring the small objects understanding capability of existing MLLMs. Specifically, we first design an effective and automatic visual question-answer generation strategy, constructing a new SOU-VQA evaluation dataset, with 18,204 VQA pairs, six relevant sub-tasks, and three dominant scenarios (i.e., Driving, Aerial, and Underwater). Then, we conduct a comprehensive evaluation on 15 state-of-the-art MLLMs and reveal their weak capabilities in small object understanding. Furthermore, we develop SOU-Train, a multimodal training dataset with 11,226 VQA pairs, to improve the SOU capabilities of MLLMs. Through supervising fine-tuning of the latest MLLM, we demonstrate that SOU-Train can effectively enhance the latest MLLM's ability to understand small objects. Comprehensive experimental results demonstrate that, the proposed SOUBench, along with the SOU-VQA and SOU-Train datasets, provides a crucial empirical foundation to the community for further developing models with enhanced small object understanding capabilities. Datasets and Code: https://github.com/Hanfj-X/SOU.

2604.22883 2026-04-28 cs.CV cs.AI

NeuroAPS-Net: Neuro-Anatomically Aware Point Cloud Representation for Efficient Alzheimer's Disease Classification

Towhidul Islam, Mufti Mahmud

Comments 6 pages, 3 figures, Accepted under IJCNN 2026

详情
英文摘要

Alzheimer's disease (AD) is a progressive neurodegenerative disorder and a major cause of dementia. Structural MRI is widely used to analyze AD-related brain atrophy; however, most deep learning methods rely on computationally expensive 3D convolutional neural networks (CNNs), limiting deployment in resource-constrained settings. This work introduces two main contributions. First, we propose a pipeline that converts T1-weighted MRI into anatomically informed 2D point clouds using Anatomical Priority Sampling (APS), producing ADNI-2DPC, the first neuroanatomically labeled MRI-derived point cloud dataset. Second, we present NeuroAPS-Net, a lightweight geometric deep learning model that incorporates anatomical priors via region-aware feature encoding and ROI token aggregation. Experiments on ADNI-2DPC demonstrate that NeuroAPS-Net achieves competitive classification accuracy while significantly reducing inference latency and GPU memory compared to state-of-the-art point cloud methods. These results highlight the potential of anatomically guided point cloud learning as an efficient and interpretable alternative to voxel-based CNNs for AD classification.

2604.22882 2026-04-28 cs.LG physics.comp-ph physics.data-an

Predicting Wind Loads on Container Ships in Harbor Environments through Multi-Fidelity Modeling

Matilde Fiore, Andrea Bresciani, Miguel Alfonso Mendez, Jeroen van Beeck

详情
英文摘要

Modern container ships face higher wind loads due to increased windage areas, making accurate predictions of wind loads essential for mooring design. Existing empirical models, largely developed for container ships with smaller windage areas and simpler geometrical configurations than those of modern large-scale vessels, often lack accuracy and do not account for the influence of nearby structures. This study proposes a multi-fidelity surrogate modelling framework for the prediction of wind-load coefficients, combining empirical correlations with simplified and detailed CFD models for ships in open-sea and harbor environments. The approach relies on recursive co-kriging to consistently fuse information across fidelity levels, enabling accurate predictions at a reduced computational cost. A sensitivity analysis is used to identify the most influential geometric parameters, and the resulting reduced parameter space is explored through sequential sampling to efficiently construct the training database. The surrogate models are validated over a wide range of loading configurations and for two distinct harbor environments. The results demonstrate that the multi-fidelity approach significantly improves prediction accuracy compared to single-fidelity models, while substantially reducing the reliance on high-fidelity simulations. In particular, the proposed framework captures the dependence of wind loads on key geometric parameters and consistently outperforms traditional empirical correlations, providing a robust and efficient tool for engineering applications.

2604.22881 2026-04-28 cs.LG cs.AI

MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches

Xin Wang, Chi Ma, Shaobin Chen, Pu Wang, Menglei Zhou, Junyi Qiu, Qiaorui Chen, Jiayu Sun, Shijie Liu, Zehuan Wang, Lei Yu, Chuan Liu, Fei Jiang, Wei Lin, Hao Wang, Jiawei Jiang, Xiao Yan

详情
英文摘要

Generative recommendation (GR) offers superior modeling capabilities but suffers from prohibitive inference costs due to the repeated encoding of long user histories. While cross-request Key-Value (KV) cache reuse presents a significant optimization opportunity, the massive scale of individual user states creates a storage explosion that far exceeds physical GPU limits. We propose MTServe, a hierarchical cache management system that virtualizes GPU memory by leveraging host RAM as a scalable backup store. To bridge the I/O gap between tiers, MTServe introduces a suite of system-level optimizations, including a hybrid storage layout, an asynchronous data transfer pipeline, and a locality-driven replacement policy. On both public and production datasets, MTServe delivers up to 3.1* speedup while maintaining near-perfect hit ratios (>98.5%).

2604.22880 2026-04-28 cs.CL

TexOCR: Advancing Document OCR Models for Compilable Page-to-LaTeX Reconstruction

Chengye Wang, Lin Fu, Zexi Kuang, Yilun Zhao

Comments Accepted by ACL 2026 Main

详情
英文摘要

Existing document OCR largely targets plain text or Markdown, discarding the structural and executable properties that make LaTeX essential for scientific publishing. We study page-level reconstruction of scientific PDFs into compilable LaTeX and introduce TexOCR-Bench, a benchmark, and TexOCR-Train, a large-scale training corpus, for this task. TexOCR-Bench features a multi-dimensional evaluation suite that jointly assesses transcription fidelity, structural faithfulness, and end-to-end compilability. Leveraging TexOCR-Train, we train a 2B-parameter model, TexOCR, using supervised fine-tuning (SFT) and reinforcement learning (RL) with verifiable rewards derived from LaTeX unit tests that directly enforce compilability and referential integrity. Experiments across 21 frontier models on TexOCR-Bench show that existing systems frequently violate key document invariants, including consistent section structure, correct float placement, and valid label-reference links, which undermines compilation reliability and downstream usability. Our analysis further reveals that RL with verifiable rewards yields consistent improvements over SFT alone, particularly on structural and compilation metrics.

2604.22873 2026-04-28 cs.LG cs.AI

When Policies Cannot Be Retrained: A Unified Closed-Form View of Post-Training Steering in Offline Reinforcement Learning

Elias Hossain, Mohammad Jahid Ibna Basher, Ivan Garibay, Ozlem Garibay, Niloofar Yousefi

详情
英文摘要

Offline reinforcement learning (RL) can learn effective policies from fixed datasets, but deployment objectives may change after training, and in many applications the trained actor cannot be retrained because of data, cost, or governance constraints. We study deployment-time adaptation for frozen offline actors using Product-of-Experts (PoE) composition with a goal-conditioned prior. Our main practical finding is graceful degradation rather than universal performance gain: under degraded or random priors, precision-weighted composition remains anchored to the frozen actor, while additive and prior-only adaptation collapse, and a KL-budget selector often recovers a near-oracle operating point. We also make explicit a closed-form identity in the frozen-actor setting: for diagonal-Gaussian actors and priors, PoE with coefficient alpha yields the same deterministic policy as KL-regularized adaptation with beta = alpha / (1 - alpha), with posterior covariances differing only by a global scalar factor. Empirically, across four D4RL environments (3,900 MuJoCo episodes), we observe a 4/5/3 HELP/FROZEN/HURT split. Extending the analysis to six harder cells and two AntMaze diagnostics reveals an actor-competence ceiling: medium-expert remains HURT in all 9 cells at every tested alpha, while AntMaze with a behavior-cloned frozen actor yields zero success for all composition rules. Overall, PoE and KL-regularized adaptation are best viewed as a single actor-anchored safety mechanism for deployment-time steering.

2604.22872 2026-04-28 cs.CV cs.SY eess.SY

Vision-Based Lane Following and Traffic Sign Recognition for Resource-Constrained Autonomous Vehicles

Md Tanjemul Islam, Md Rafiul Kabir

Comments 2026 International Conference on Intelligent Systems, Blockchain, and Communication Technologies

详情
英文摘要

Autonomous vehicles (AVs) rely on real-time perception systems to understand road environments and ensure safe navigation. However, implementing reliable perception algorithms on resource-constrained embedded platforms remains challenging due to limited computational resources. This paper presents a lightweight vision-based framework that integrates lane detection, lane tracking, and traffic sign recognition for embedded autonomous vehicles. A computationally efficient threshold-based lane segmentation method combined with perspective transformation and histogram-based curvature estimation is used for robust lane tracking under varying illumination conditions. A rule-based steering controller generates steering commands to maintain stable vehicle navigation. For traffic sign recognition, two lightweight convolutional neural networks (CNNs), EfficientNet-B0 and MobileNetV2, are evaluated using a custom dataset captured from the vehicle's onboard camera. Experimental results show that the system achieves real-time performance while maintaining accurate lane tracking with only 3.16% maximum offset RMSE. EfficientNet-B0 achieves a high offline classification accuracy of 98.77% on the test dataset, while achieving 90% accuracy during real-time on-device deployment, outperforming MobileNetV2 in both settings. MobileNetV2, however, offers slightly faster inference and lower computational cost. These results highlight the effectiveness of lightweight vision-based perception pipelines for resource-constrained autonomous driving applications.

2604.22870 2026-04-28 cs.LG cs.AI cs.LO

Towards Understanding the Expressive Power of GNNs with Global Readout

Maurice Funk, Daumantas Kojelis

Comments 17 pages

详情
英文摘要

We study the expressive power of message-passing aggregate-combine-readout graph neural networks (ACR-GNNs). Particularly, we focus on the first-order (FO) properties expressible by this formalism. While a tight logical characterisation remains a difficult open question, we make two contributions towards answering it. First, we show that sum aggregation and readout suffice for GNNs to capture FO properties that cannot be expressed in the logic C2 on both directed and undirected graphs. This strengthens known results by Hauke and Wał{\k e}ga (2026) where aggregation and readout functions are specially crafted for the task. Second, we identify two natural ways of restoring characterisability (with regard to C2) for ACR-GNNs. One option is to limit local aggregation (without imposing restrictions on global readout), whilst the second is to run ACR-GNNs over graphs of bounded degree (but unbounded size). In both cases, the FO properties captured by GNNs are exactly those definable by a formula in graded modal logic with global counting modalities. Our results thus establish an innate lower- and upper-bound in terms of how far (fragments of) C2 can be taken to characterise GNNs, and imply that is indeed the unbounded interaction of aggregation and readout that pushes the logical expressive power of GNNs above C2.

2604.22869 2026-04-28 cs.LG cs.AI cs.SY eess.SY

Avionic Main Fuel Pump Simulation and Fault-Diagnosis Benchmark

Felix Leonhard Janzen, Lukas Moddemann, Alexander Diedrich, Oliver Niggemann

详情
英文摘要

In many cyber-physical systems, especially in critical applications such as aeroplanes, data to train anomaly detection and diagnosis algorithms is lacking due to data protection issues and partial observability. To combat this inherent lack of data, we introduce a high-fidelity, physics-informed co-simulation of a common aircraft main-fuel-pump system modelled in \textsc{MATLAB/Simulink Simscape Fluids}. We also describe its generated time-series data with health and fault mode annotations. To show feasibility of our benchmark, we apply an unsupervised Recurrent Variational Autoencoder (RNN-VAE) for anomaly detection and a SOM-VAE for operating mode discretization, trained to separate healthy and faulty conditions.

2604.22868 2026-04-28 cs.CV cs.AI

Probing Visual Planning in Image Editing Models

Zhimu Zhou, Yanpeng Zhao, Qiuyu Liao, Bo Zhao, Xiaojian Ma

Comments Accepted to ES-Reasoning Workshop @ ICLR 2026. Our code is available at https://github.com/spatigen/amaze

详情
英文摘要

Visual planning represents a crucial facet of human intelligence, especially in tasks that require complex spatial reasoning and navigation. Yet, in machine learning, this inherently visual problem is often tackled through a verbal-centric lens. While recent research demonstrates the promise of fully visual approaches, they suffer from significant computational inefficiency due to the step-by-step planning-by-generation paradigm. In this work, we present EAR, an editing-as-reasoning paradigm that reformulates visual planning as a single-step image transformation. To isolate intrinsic reasoning from visual recognition, we employ abstract puzzles as probing tasks and introduce AMAZE, a procedurally generated dataset that features the classical Maze and Queen problems, covering distinct, complementary forms of visual planning. The abstract nature of AMAZE also facilitates automatic evaluation of autoregressive and diffusion-based models in terms of both pixel-wise fidelity and logical validity. We assess leading proprietary and open-source editing models. The results show that they all struggle in the zero-shot setting, finetuning on basic scales enables remarkable generalization to larger in-domain scales and out-of-domain scales and geometries. However, our best model that runs on high-end hardware fails to match the zero-shot efficiency of human solvers, highlighting a persistent gap in neural visual reasoning.

2604.22865 2026-04-28 cs.CV

MeshLAM: Feed-Forward One-Shot Animatable Textured Mesh Avatar Reconstruction

Yisheng He, Steven Hoi

Comments Accepted to CVPR 2026

详情
英文摘要

We introduce MeshLAM, a feed-forward framework for one-shot animatable mesh head reconstruction that generates high-fidelity, animatable 3D head avatars from a single image. Unlike previous work that relies on time-consuming test-time optimization or extensive multi-view data, our method produces complete mesh representations with inherent animatability from a single image in a single forward pass. Our approach employs a dual shape and texture map architecture that simultaneously processes mesh vertices and texture map with extracted image features from a shared transformer backbone, allowing for coherent shape carving and appearance modeling. To prevent mesh collapse and ensure topological integrity during feed-forward deformation, we propose an iterative GRU-based decoding mechanism with progressive geometry deformation and texture refinement, coupled with a novel reprojection-based texture guidance mechanism that anchors appearance learning to the input image. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in reconstruction quality, animation capability, and computational efficiency. Project page at https://meshlam.github.io.

2604.22860 2026-04-28 cs.RO cs.SY eess.SY

Airspeed Forward-Invariance for Unpowered Fixed-Wing Aircraft

Huseyin Emre Tekaslan, Ella M. Atkins

详情
英文摘要

Autonomous fixed-wing flight is becoming a key capability in aerial robotics, enabling sensing, mobility, and contingency operations across both small-scale Uncrewed Aircraft Systems and large-scale Advanced Air Mobility. During unpowered operation in fixed-wing platforms, airspeed is regulated solely through potential-kinetic energy exchange, making airspeed dynamics highly sensitive to guidance commands, particularly under wind. This paper presents a viability-based airspeed protection for ground-referenced guidance in steady wind, where airspeed evolution depends explicitly on the commanded flight path angle. Leveraging Nagumo's tangency condition, we derive a closed-form, wind-dependent characterization of admissible guidance commands that guarantees forward invariance of a safe airspeed envelope. These conditions are embedded within an offline quadratic programming framework to certify airspeed-safe maneuver primitives for non-ascending flight at the guidance level. The approach is validated using a high-fidelity unpowered fixed-wing aircraft model on gliding trajectories formed by concatenating certified maneuver primitives, demonstrating strict airspeed boundedness. Future work will address unsteady wind fields and flight experiments.

2604.22858 2026-04-28 cs.CV

A Digital Pathology Resource for Liver Cancer Quantification with Datasets, Benchmarks, and Tools

Ying Xiao, Shimiao Tang, Xitong Ling, Weiming Chen, Jun Wang, Jiawen Li, Huaitian Yuan, Jianghui Yang, Bowen Li, Huan Li, Yiting Meng, Tian Guan, Yonghong He, Hongfang Yin

详情
英文摘要

Liver cancer, especially hepatocellular carcinoma (HCC), imposes a substantial global disease burden. Accurate diagnosis and prognostic assessment directly influence treatment selection and patient survival, and pathological examination remains the gold standard for liver cancer diagnosis. Identifying diverse tissue components and pathological subtypes on histopathology slides is crucial for estimating postoperative recurrence risk and overall prognosis. However, most publicly available resources are still provided at the whole-slide image (WSI) level, and well-annotated datasets for fine-grained tissue component identification in liver cancer are scarce, which hinders reproducible model development and the deployment of quantitative analysis tools. To address this gap, we release HepatoBench, a patch-level image database for liver cancer with annotations for seven key tissue categories. Based on HepatoBench, we train and open-source a deep learning classification model as a tissue recognition tool. Furthermore, we train a WSI-level tumor/non-tumor segmentation model to automatically localize lesion regions across entire slides. By integrating the patch-level tissue classifier with the WSI-level segmentation model, we build HepatoQuant, an end-to-end, disease-specific regional quantification tool for liver cancer, enabling a unified workflow from WSIs to tissue composition parsing and quantitative statistics. We also open-source HepatoBench, the benchmarking protocol, and supporting tools, providing a solid foundation for automated regional quantification and fair method comparison in liver cancer pathology.

2604.22857 2026-04-28 cs.CV

IoT-Enhanced CNN-Based Labelled Crack Detection for Additive Manufacturing Image Annotation in Industry 4.0

Mohsen Asghari Ilani, Yaser Mike Banad

Comments 6 Figures, 23 Pages

详情
英文摘要

This paper presents an IoT-enhanced deep learning framework for automated crack detection in Additive Manufacturing (AM) surfaces using convolutional neural networks (CNNs). By integrating IoT-enabled real-time monitoring, high-resolution imaging, and edge computing, the system enables continuous in-situ defect detection and classification. Real-time data acquisition supports immediate CNN-based analysis, improving both accuracy and efficiency in AM quality control. The framework supports supervised and semi-supervised learning, enabling robust performance on large, sparsely annotated datasets. Using LabelImg for annotation and OpenCV for preprocessing, the system achieves 99.54% accuracy on 14,982 images, with 96% precision, 98% recall, and a 97% F1-score. Dataset balancing and augmentation significantly improve generalization, increasing accuracy from 32% to 99%. Beyond detection, the framework establishes a linkage between AM process parameters, defect formation, and surface topology, supporting predictive analytics and defect mitigation. Aligned with Industry 4.0, it incorporates Digital Twin (DT) technology for real-time process simulation, predictive maintenance, and adaptive control. Key contributions include an IoT-based monitoring system using edge devices (Raspberry Pi 4B), an optimized CNN with model quantization and batch processing reducing inference latency by 47%, and an MQTT-based low-latency data streaming system with 5G connectivity, lowering transmission overhead by 35%. DT integration further enables predictive defect analysis and dynamic adjustment of AM parameters. This work advances intelligent AM quality control by providing a scalable, high-accuracy, and low-latency framework. Future directions include multimodal data fusion, hybrid architectures, and enhanced Digital Twin simulations for AI-driven defect prevention.

2604.22856 2026-04-28 cs.CV

Attention-Augmented YOLOv8 with Ghost Convolution for Real-Time Vehicle Detection in Intelligent Transportation Systems

Syed Sajid Ullah, Muhammad Zunair Zamir, Ahsan Ishfaq, Salman Khan

Journal ref 10.32604/jai.2025.069008

详情
英文摘要

Accurate vehicle detection is a critical component of autonomous driving, traffic surveillance, and intelligent transportation systems. This paper presents an enhanced YOLOv8n-based model that integrates the Ghost Module, Convolutional Block Attention Module (CBAM), and Deformable Convolutional Networks v2 (DCNv2) to improve detection performance. The Ghost Module reduces feature redundancy through efficient feature generation, CBAM refines feature representation via channel and spatial attention, and DCNv2 enhances adaptability to geometric variations in vehicle structures. Evaluated on the KITTI dataset, the proposed model achieves 95.4% mAP@0.5, representing an 8.97% improvement over the baseline YOLOv8n, along with 96.2% precision, 93.7% recall, and a 94.93% F1-score. Comparative analysis against seven state-of-the-art detectors demonstrates consistent superiority across key performance metrics, while ablation studies validate the individual and combined contributions of the integrated modules. By addressing feature redundancy, attention refinement, and spatial adaptability, the proposed approach offers a robust and computationally efficient solution for vehicle detection in diverse and complex traffic environments.

2604.22855 2026-04-28 cs.CV

Evaluating Remote Sensing Image Captions Beyond Metric Biases

Ziyun Chen, Fan Liu, Liang Yao, Chuanyi Zhang, Yuye Ma, Wei Zhou

详情
英文摘要

The core objective of image captioning is to achieve lossless semantic compression from visual signals into textual modalities. However, the reliance on manually curated reference texts for evaluation essentially forces models to mimic specific human annotation styles, thereby masking the true descriptive capabilities of advanced foundation models. This systemic misalignment prompts a critical question: Is task-specific fine-tuning truly necessary for Remote Sensing Image Captioning, or is the perceived performance gap merely an artifact of flawed evaluation criteria? To investigate this discrepancy, we propose ReconScore, a novel reference-free evaluation metric. Rather than computing textual similarities, we assess caption quality by its capability to reconstruct the original visual elements solely from the generated text, effectively neutralizing human annotation biases. Applying this metric, we uncover a profound, counterintuitive truth: inherently powerful, unfine-tuned MLLMs surpass their fine-tuned counterparts in authentic zero-shot RSIC tasks. Driven by this structural discovery, we introduce RemoteDescriber, a completely training-free generation methodology. By employing ReconScore as a self-correction mechanism, we iteratively refine the semantic precision of MLLM outputs without any computational fine-tuning overhead. Comprehensive experiments demonstrate that RemoteDescriber achieves state-of-the-art performance on three datasets. Furthermore, we validate ReconScore's reliability and analyze the flaws of traditional metrics. Our code is available at https://github.com/hhu-czy/RemoteDescriber.

2604.22854 2026-04-28 cs.CV cs.AI

MAE-Based Self-Supervised Pretraining for Data-Efficient Medical Image Segmentation Using nnFormer

R. M. Krishna Sureddi, T. Satyanarayana Murthy, Nomula Varsha Reddy, Adi Kanishka, Nalla Manvika Reddy

Comments 4 pages, 2 figures, 2 tables

详情
英文摘要

Transformer architectures, including nnFormer,have demonstrated promising results in volumetric medical image segmentation by being able to capture long-range spatial interactions. Although they have high performance, these models need large quantities of labeled training data and are also likely to overfit and become training unstable. This is a serious practical problem because it is not only time-consuming but also expensive to obtain medical images that are annotated by experts. Moreover, fully supervised traditional training pipelines do not take advantage of the available large amounts of unlabeled medical imaging data that can be easily obtained in the clinics. We have solved these drawbacks by advancing the efficiency of the nnFormer with a self-supervised pretraining framework, which is based on the Masked Autoencoders (MAE). In this method, the model is pretrained on unlabeled volumetric medical images to reconstruct randomly masked parts of the input. This allows the encoder to learn meaningful anatomical and structural representations . The encoder is then further fine-tuned on a labeled dataset on the downstream segmentation task. Conducted Experiment shows that the offered method leads to a higher segmentation performance on the count of Dice score, a quicker convergence rate on the course of the fine-tuning procedure, and a superior generalization on the basis of limited labeled data . These findings validate that self-supervised learning combined with transformer-based segmentation models is an appropriate approach to the problem of data shortage in medical image analysis.