arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2602.17929 2026-03-12 cs.CV cs.LG eess.IV

ZACH-ViT: Regime-Dependent Inductive Bias in Compact Vision Transformers for Medical Imaging

Athanasios Angelakis

Comments 24 pages, 15 figures, 5 tables. Code and models available at https://github.com/Bluesman79/ZACH-ViT

详情

英文摘要

Vision Transformers rely on positional embeddings and class tokens encoding fixed spatial priors. While effective for natural images, these priors may be suboptimal when spatial layout is weakly informative, a frequent condition in medical imaging. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a compact Vision Transformer that removes positional embeddings and the [CLS] token, achieving permutation-invariant patch processing via global average pooling. Zero-token denotes removal of the dedicated aggregation token and positional encodings. Patch tokens remain unchanged. Adaptive residual projections preserve training stability under strict parameter constraints. We evaluate ZACH-ViT across seven MedMNIST datasets under a strict few-shot protocol (50 samples/class, fixed hyperparameters, five seeds). Results reveal regime-dependent behavior: ZACH-ViT (0.25M parameters, trained from scratch) achieves strongest advantage on BloodMNIST and remains competitive on PathMNIST, while relative advantage decreases on datasets with stronger anatomical priors (OCTMNIST, OrganAMNIST), consistent with our hypothesis. Component and pooling ablations show positional support becomes mildly beneficial as spatial structure increases, whereas reintroducing a [CLS] token is consistently unfavorable. These findings support that architectural alignment with data structure can outweigh universal benchmark dominance. Despite minimal size and no pretraining, ZACH-ViT achieves competitive performance under data-scarce conditions, relevant for compact medical imaging and low-resource settings. Code: https://github.com/Bluesman79/ZACH-ViT

URL PDF HTML ☆

赞 0 踩 0

2602.02895 2026-03-12 cs.RO cs.AI

Moving On, Even When You're Broken: Fail-Active Trajectory Generation via Diffusion Policies Conditioned on Embodiment and Task

Gilberto G. Briscoe-Martinez, Yaashia Gautam, Rahul Shetty, Anuj Pasricha, Marco M. Nicotra, Alessandro Roncone

Comments To be published in the 2026 IEEE International Conference on Robotics & Automation

2601.03410 2026-03-12 cs.LG cs.CV eess.IV

Inferring Clinically Relevant Molecular Subtypes of Pancreatic Cancer from Routine Histopathology Using Deep Learning

Abdul Rehman Akbar, Alejandro Levya, Ashwini Esnakula, Elshad Hasanov, Anne Noonan, Lingbin Meng, Susan Tsai, Vaibhav Sahai, Midhun Malla, Sarbajit Mukherjee, Upender Manne, Anil Parwani, Wei Chen, Ashish Manne, Muhammad Khalid Khan Niazi

详情

英文摘要

Molecular subtyping of PDAC into basal-like and classical has established prognostic and predictive value. However, its use in clinical practice is limited by cost, turnaround time, and tissue requirements, thereby restricting its application in the management of PDAC. We introduce PanSubNet, an interpretable deep learning framework that predicts therapy-relevant molecular subtypes directly from standard H&E-stained WSIs. PanSubNet was developed using data from 1,055 patients across two multi-institutional cohorts (PANCAN, n=846; TCGA, n=209) with paired histology and RNA-seq data. Ground-truth labels were derived using the validated Moffitt 50-gene signature refined by GATA6 expression. The model employs dual-scale architecture that fuses cellular-level morphology with tissue-level architecture, leveraging attention mechanisms for multi-scale representation learning and transparent feature attribution. On internal validation within PANCAN using five-fold cross-validation, PanSubNet achieved mean AUC of 88.5% with balanced sensitivity and specificity. External validation on the independent TCGA cohort without fine-tuning demonstrated robust generalizability (AUC 84.0%). PanSubNet preserved and, in metastatic disease, strengthened prognostic stratification compared to RNA-seq based labels. Prediction uncertainty linked to intermediate transcriptional states, not classification noise. Model predictions are aligned with established transcriptomic programs, differentiation markers, and DNA damage repair signatures. By enabling rapid, cost-effective molecular stratification from routine H&E-stained slides, PanSubNet offers a clinically deployable and interpretable tool for genetic subtyping. We are gathering data from two institutions to validate and assess real-world performance, supporting integration into digital pathology workflows and advancing precision oncology for PDAC.

URL PDF HTML ☆

赞 0 踩 0

2512.16950 2026-03-12 cs.CV cs.AI

Enhancing Tree Species Classification: Insights from YOLOv8 and Explainable AI Applied to TLS Point Cloud Projections

Adrian Straker, Paul Magdon, Marco Zullich, Maximilian Freudenberg, Christoph Kleinn, Johannes Breidenbach, Stefano Puliti, Nils Noelke

Comments 34 pages, 17 figures, submitted to Forestry: An International Journal of Forest Research

详情

英文摘要

Aiming to advance research in the field of interpretability of deep learning models for tree species classification using TLS 3D point clouds we present insights in the classification abilities of YOLOv8 through a new framework which enables systematic analysis of saliency maps derived from CAM (Class Activation Mapping). To investigate the contribution of structural tree features to the classification decisions of the models, we link regions with high saliency derived from the application of Finer-CAM to segments of 2D side-view images that correspond to structural tree features. Using TLS 3D point clouds from 2445 trees across seven European tree species, we trained five YOLOv8 models with cross-validation, reaching a mean accuracy of 96% (SD = 0.24%) when applied to the test data. Our results demonstrate that Finer-CAM can be considered faithful in identifying discriminative regions that discriminate target tree species. This renders Finer-CAM suitable for enhancing the interpretability of the tree species classification models. Analysis of 630 saliency maps indicate that the models primarily rely on image regions associated with tree crowns for species classification. While this result is pronounced in Silver Birch, European Beech, English oak, and Norway Spruce, image regions associated with stems contribute more frequently to the differentiation of European ash, Scots pine, and Douglas-fir. We demonstrate that the visibility of detailed structural tree features in the 2D side-view images enhances the discriminative performances of the models, indicating YOLOv8`s abilities to leverage detailed point cloud representations. Our results represent a first step toward enhancing the understanding of the classification decision processes of tree species classification models, aiding in the identification of data set and model limitations, and building confidence in model predictions.

URL PDF HTML ☆

赞 0 踩 0

2510.13310 2026-03-12 cs.CV

InstantSfM: Towards GPU-Native SfM for the Deep Learning Era

Jiankun Zhong, Zitong Zhan, Quankai Gao, Ziyu Chen, Haozhe Lou, Jiageng Mao, Ulrich Neumann, Chen Wang, Yue Wang

2510.11997 2026-03-12 cs.CL

SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation

Ryan Shea, Yunan Lu, Liang Qiu, Zhou Yu

2506.12878 2026-03-12 cs.LG

Silhouette-Driven Instance-Weighted $k$-means

Aggelos Semoglou, Aristidis Likas, John Pavlopoulos

Comments 36 pages including appendix

2506.10918 2026-03-12 cs.LG

Sequential-Parallel Duality in Prefix Scannable Models

Morris Yau, Sharut Gupta, Valerie Engelmayer, Kazuki Irie, Stefanie Jegelka, Jacob Andreas

2506.07527 2026-03-12 cs.AI cs.LG

Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions

Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Yanhao Li, Bin Cui, Wentao Zhang

Comments 12 pages, 5 figures

详情

Journal ref: ICLR 2026

英文摘要

Recent advances in large language model (LLM) reasoning have shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model rather than facilitating the acquisition of new information. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at maintaining and improving performance on questions within the model's original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce a novel training approach, \textbf{ReLIFT} (\textbf{Re}inforcement \textbf{L}earning \textbf{I}nterleaved with Online \textbf{F}ine-\textbf{T}uning). In ReLIFT, the model is primarily trained using RL, but when it encounters challenging questions, high-quality solutions are collected for fine-tuning, and the training process alternates between RL and fine-tuning to enhance the model's reasoning abilities. ReLIFT achieves an average improvement of over +5.2 points across five competition-level benchmarks and one out-of-distribution benchmark compared to other zero-RL models. Furthermore, we demonstrate that ReLIFT outperforms both RL and SFT while using only 13\% of the detailed demonstration data, highlighting its scalability. These results provide compelling evidence that ReLIFT overcomes the fundamental limitations of RL and underscores the significant potential.

URL PDF HTML ☆

赞 0 踩 0

2505.02787 2026-03-12 cs.CV

Unsupervised training of keypoint-agnostic descriptors for flexible retinal image registration

David Rivas-Villar, Álvaro S. Hervella, José Rouco, Jorge Novo

2503.11627 2026-03-12 cs.SD cs.LG eess.AS

Are Deep Speech Denoising Models Robust to Adversarial Noise?

Will Schwarzer, Neel Chaudhari, Philip S. Thomas, Andrea Fanelli, Xiaoyu Liu

Comments 22 pages, 14 figures. Related conference version accepted to ICLR 2026: see https://openreview.net/forum?id=WtH2JxKJKf

2410.02113 2026-03-12 cs.LG cs.NA math.NA

Mamba Neural Operator: Who Wins? Transformers vs. State-Space Models for PDEs

Chun-Wun Cheng, Jiahao Huang, Yi Zhang, Guang Yang, Carola-Bibiane Schönlieb, Angelica I. Aviles-Rivero

Comments Accepted in Journal of Computational Physics 2025

2603.10904 2026-03-12 cs.SD cs.AI cs.ET

When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

Anupam Purwar, Aditya Choudhary

Comments We finetune the Qwen 0.5B backbone in an LLM TTS with LoRA to raise MOS speaker similarity and SNR. It works best with diverse training audio with uniform data it can amplify noise so tune decoding and use GGUF quantization for low latency stable quality

2603.10899 2026-03-12 cs.LG cs.AI

LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

Jinwoo Ahn, Ingyu Seong, Akhil Kedia, Junhan Kim, Hyemi Jang, Kangwook Lee, Yongkweon Jeon

Comments ICLR 2026

2603.10895 2026-03-12 cs.LG

Ergodicity in reinforcement learning

Dominik Baumann, Erfaun Noorani, Arsenii Mustafin, Xinyi Sheng, Bert Verbruggen, Arne Vanhoyweghen, Vincent Ginis, Thomas B. Schön

Comments Accepted article to appear in Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences

2603.10893 2026-03-12 cs.CV

S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs

Yuzhou Ji, Qijian Tian, He Zhu, Xiaoqi Jiang, Guangzhi Cao, Lizhuang Ma, Yuan Xie, Xin Tan

2603.10891 2026-03-12 cs.AI cs.IR

A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification

Yichi Zhu, Kan Ling, Xu Liu, Hengrun Zhang, Huiqun Yu, Guisheng Fan

Comments 11 pages, 7 figures.Framework for safe prescription auditing and hybrid knowledge-grounded reasoning

2603.10890 2026-03-12 cs.RO cs.SY eess.SY

A gripper for flap separation and opening of sealed bags

Sergi Foix, Jaume Oriol, Carme Torras, Júlia Borràs

Comments 8 pages, Accepted at the 2026 IEEE International Conference on Robotics & Automation (ICRA2026)

2603.10888 2026-03-12 cs.SD

VoxCare: Studying Natural Communication Behaviors of Hospital Caregivers through Wearable Sensing of Egocentric Audio

Tiantian Feng, Kleanthis Avramidis, Anfeng Xu, Deqi Wang, Brandon M Booth, Shrikanth Narayanan

2603.10887 2026-03-12 cs.LG cs.AI

Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models

Yixiu Mao, Yun Qu, Qi Wang, Heming Zou, Xiangyang Ji

Comments Accepted to ICLR 2026

2603.10885 2026-03-12 cs.LG cs.AI q-bio.GN

Continuous Diffusion Transformers for Designing Synthetic Regulatory Elements

Jonathan Liu, Kia Ghods

2603.10877 2026-03-12 cs.CL

From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers

Ayan Sengupta, Shantanu Dixit, Md Shad Akhtar, Tanmoy Chakraborty

2603.10876 2026-03-12 cs.CL cs.AI cs.DL cs.IR

An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took "Use of Practical AI in Digital Libraries" seriously?

Jennifer D'Souza, Sameer Sadruddin, Maximilian Kähler, Andrea Salfinger, Luca Zaccagna, Francesca Incitti, Lauro Snidaro, Osma Suominen

Comments 9 pages, 5 figures. Accepted to appear in the Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

2603.10873 2026-03-12 cs.LG q-bio.GN

SNPgen: Phenotype-Supervised Genotype Representation and Synthetic Data Generation via Latent Diffusion

Andrea Lampis, Michela Carlotta Massi, Nicola Pirastu, Francesca Ieva, Matteo Matteucci, Emanuele Di Angelantonio

2603.10872 2026-03-12 cs.CV

Bilevel Layer-Positioning LoRA for Real Image Dehazing

Yan Zhang, Long Ma, Yuxin Feng, Zhe Huang, Fan Zhou, Zhuo Su

Comments Accepted by CVPR 2026

2603.10871 2026-03-12 cs.RO

FG-CLTP: Fine-Grained Contrastive Language Tactile Pretraining for Robotic Manipulation

Wenxuan Ma, Chaofan Zhang, Yinghao Cai, Guocai Yao, Shaowei Cui, Shuo Wang

Comments 9 pages, 6 figures

2603.10863 2026-03-12 cs.CV

Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding

Lin Chen, Bolin Ni, Qi Yang, Zili Wang, Kun Ding, Ying Wang, Houwen Peng, Shiming Xiang

2603.10862 2026-03-12 cs.SD

OSUM-Pangu: An Open-Source Multidimension Speech Understanding Foundation Model Built upon OpenPangu on Ascend NPUs

Yujie Liao, Xuelong Geng, Hongfei Xue, Shuiyuan Wang, Lei Xie

Comments 5 pages, 2 figures

2603.10861 2026-03-12 cs.CL

SiDiaC-v.2.0: Sinhala Diachronic Corpus Version 2.0

Nevidu Jayatilleke, Nisansa de Silva, Uthpala Nimanthi, Gagani Kulathilaka, Azra Safrullah, Johan Sofalas

Comments 23 pages, 13 figures, 10 tables, Accepted paper at the 15th Language Resources and Evaluation Conference (LREC 2026)

2603.10858 2026-03-12 cs.RO cs.AI cs.MA

GRACE: A Unified 2D Multi-Robot Path Planning Simulator & Benchmark for Grid, Roadmap, And Continuous Environments

Chuanlong Zang, Anna Mannucci, Isabelle Barz, Philipp Schillinger, Florian Lier, Wolfgang Hönig

Comments ICRA 2026, code will be released soon