arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.05738 2026-04-08 cs.CL

MedLayBench-V: A Large-Scale Benchmark for Expert-Lay Semantic Alignment in Medical Vision Language Models

Han Jang, Junhyeok Lee, Heeseong Eum, Kyu Sung Choi

Comments Accepted at ACL 2026 Findings (Oral). 9 pages, 5 figures, 11 tables, plus appendix

详情

英文摘要

Medical Vision-Language Models (Med-VLMs) have achieved expert-level proficiency in interpreting diagnostic imaging. However, current models are predominantly trained on professional literature, limiting their ability to communicate findings in the lay register required for patient-centered care. While text-centric research has actively developed resources for simplifying medical jargon, there is a critical absence of large-scale multimodal benchmarks designed to facilitate lay-accessible medical image understanding. To bridge this resource gap, we introduce MedLayBench-V, the first large-scale multimodal benchmark dedicated to expert-lay semantic alignment. Unlike naive simplification approaches that risk hallucination, our dataset is constructed via a Structured Concept-Grounded Refinement (SCGR) pipeline. This method enforces strict semantic equivalence by integrating Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs) with micro-level entity constraints. MedLayBench-V provides a verified foundation for training and evaluating next-generation Med-VLMs capable of bridging the communication divide between clinical experts and patients.

URL PDF HTML ☆

赞 0 踩 0

2604.05732 2026-04-08 cs.LG cs.IR

Graph Topology Information Enhanced Heterogeneous Graph Representation Learning

He Zhao, Zhiwei Zeng, Yongwei Wang, Chunyan Miao

详情

英文摘要

Real-world heterogeneous graphs are inherently noisy and usually not in the optimal graph structures for downstream tasks, which often adversely affects the performance of GRL models in downstream tasks. Although Graph Structure Learning (GSL) methods have been proposed to learn graph structures and downstream tasks simultaneously, existing methods are predominantly designed for homogeneous graphs, while GSL for heterogeneous graphs remains largely unexplored. Two challenges arise in this context. Firstly, the quality of the input graph structure has a more profound impact on GNN-based heterogeneous GRL models compared to their homogeneous counterparts. Secondly, most existing homogenous GRL models encounter memory consumption issues when applied directly to heterogeneous graphs. In this paper, we propose a novel Graph Topology learning Enhanced Heterogeneous Graph Representation Learning framework (ToGRL).ToGRL learns high-quality graph structures and representations for downstream tasks by incorporating task-relevant latent topology information. Specifically, a novel GSL module is first proposed to extract downstream task-related topology information from a raw graph structure and project it into topology embeddings. These embeddings are utilized to construct a new graph with smooth graph signals. This two-stage approach to GSL separates the optimization of the adjacency matrix from node representation learning to reduce memory consumption. Following this, a representation learning module takes the new graph as input to learn embeddings for downstream tasks. ToGRL also leverages prompt tuning to better utilize the knowledge embedded in learned representations, thus enhancing adaptability to downstream tasks. Extensive experiments on five real-world datasets show that our ToGRL outperforms state-of-the-art methods by a large margin.

URL PDF HTML ☆

赞 0 踩 0

2604.05731 2026-04-08 cs.CV

FoleyDesigner: Immersive Stereo Foley Generation with Precise Spatio-Temporal Alignment for Film Clips

Mengtian Li, Kunyan Dai, Yi Ding, Ruobing Ni, Ying Zhang, Wenwu Wang, Zhifeng Xie

2604.05730 2026-04-08 cs.LG

Controllable Image Generation with Composed Parallel Token Prediction

Jamie Stirling, Noura Al-Moubayed, Chris G. Willcocks, Hubert P. H. Shum

Comments 8 pages + references, 7 figures, accepted to CVPR Workshops 2026 (LoViF). arXiv admin note: substantial text overlap with arXiv:2405.06535

2604.05727 2026-04-08 cs.CV

Single-Stage Signal Attenuation Diffusion Model for Low-Light Image Enhancement and Denoising

Ying Liu, Junchao Zhang, Caiyun Wu

2604.05724 2026-04-08 cs.CV

Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP

Yusung Ro, Jaehyun Choi, Junmo Kim

Comments CVPR 2026 Findings

2604.05721 2026-04-08 cs.CV

GaussianGrow: Geometry-aware Gaussian Growing from 3D Point Clouds with Text Guidance

Weiqi Zhang, Junsheng Zhou, Haotian Geng, Kanle Shi, Shenkun Xu, Yi Fang, Yu-Shen Liu

Comments Accepted by CVPR 2026. Project page: https://weiqi-zhang.github.io/GaussianGrow

2604.05716 2026-04-08 cs.AI

Can Large Language Models Reinvent Foundational Algorithms?

Jian Zhao, Haoren Luo, Yu Wang, Yuhan Cao, Pingyue Sheng, Tianxing He

2604.05715 2026-04-08 cs.CV

In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting

Wenhui Xiao, Ethan Goan, Rodrigo Santa Cruz, David Ahmedt-Aristizabal, Olivier Salvado, Clinton Fookes, Leo Lebrat

Comments accepted to CVPR 3DMV Workshop

2604.05700 2026-04-08 cs.LG

Optimal-Transport-Guided Functional Flow Matching for Turbulent Field Generation in Hilbert Space

Li Kunpeng, Wan Chenguang, Qu Zhisong, Lim Kyungtak, Virginie Grandgirard, Xavier Garbet, Yu Hua, Ong Yew Soon

Comments 41 pages, 5 figures, journal paper

2604.05695 2026-04-08 cs.CV

Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs

Chongyu Wang, Ting Huang, Chunyu Sun, Xinyu Ning, Di Wang, Hao Tang

2604.05689 2026-04-08 cs.CV cs.AI

CRFT: Consistent-Recurrent Feature Flow Transformer for Cross-Modal Image Registration

Xuecong Liu, Mengzhu Ding, Zixuan Sun, Zhang Li, Xichao Teng

Comments Accepted to CVPR 2026

2604.05688 2026-04-08 cs.CL cs.AI

Attention Editing: A Versatile Framework for Cross-Architecture Attention Conversion

Zhen Cheng, Hao-Bo Yang, Wan-Yi Huang, Jin-Long Li

2604.05683 2026-04-08 cs.SD

Time-Domain Voice Identity Morphing (TD-VIM): A Signal-Level Approach to Morphing Attacks on Speaker Verification Systems

Aravinda Reddy PN, Raghavendra Ramachandra, K. Sreenivasa Rao, Pabitra Mitra, Kunal Singh

2604.05681 2026-04-08 cs.AI cs.CL cs.GT cs.LG cs.MA

LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo

Ojas Jain, Dhruv Kumar

Comments Under Review

2604.05677 2026-04-08 cs.RO

Dynamic Control Allocation for Dual-Tilt UAV Platforms

Marcello Sorge, Federico Ciresola, Giulia Michieletto, Angelo Cenedese

2604.05656 2026-04-08 cs.CV cs.AI

SnapFlow: One-Step Action Generation for Flow-Matching VLAs via Progressive Self-Distillation

Wuyang Luan, Junhui Li, Weiguang Zhao, Wenjian Zhang, Tieru Wu, Rui Ma

Comments 10 pages, 6 figures, 9 tables

详情

英文摘要

Vision-Language-Action (VLA) models based on flow matching -- such as pi0, pi0.5, and SmolVLA -- achieve state-of-the-art generalist robotic manipulation, yet their iterative denoising, typically 10 ODE steps, introduces substantial latency: on a modern GPU, denoising alone accounts for 80% of end-to-end inference time. Naively reducing the step count is unreliable, degrading success on most tasks due to the velocity field being uncalibrated for single-step jumps. We present SnapFlow, a plug-and-play self-distillation method that compresses multi-step denoising into a single forward pass (1-NFE) for flow-matching VLAs. SnapFlow mixes standard flow-matching samples with consistency samples whose targets are two-step Euler shortcut velocities computed from the model's own marginal velocity predictions, avoiding the trajectory drift caused by conditional velocities, as we analyze theoretically. A zero-initialized target-time embedding lets the network switch between local velocity estimation and global one-step generation within a single architecture. SnapFlow requires no external teacher, no architecture changes, and trains in ~12h on a single GPU. We validate on two VLA architectures spanning a 6x parameter range, with identical hyperparameters: on pi0.5 (3B) across four LIBERO suites (40 tasks, 400 episodes), SnapFlow achieves 98.75% average success -- matching the 10-step teacher at 97.75% and slightly exceeding it -- with 9.6x denoising speedup and end-to-end latency reduced from 274ms to 83ms; on SmolVLA (500M), it reduces MSE by 8.3% with 3.56x end-to-end acceleration. An action-step sweep on long-horizon tasks reveals that SnapFlow maintains its advantage across execution horizons, achieving 93% at n_act=5 where the baseline reaches only 90%. SnapFlow is orthogonal to layer-distillation and token-pruning approaches, enabling compositional speedups.

URL PDF HTML ☆

赞 0 踩 0

2604.05655 2026-04-08 cs.CL cs.AI cs.LG

LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals

Lihao Sun, Hang Dong, Bo Qiao, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan

Comments ACL 2026 (Main)

2604.05651 2026-04-08 cs.CV

Probing Intrinsic Medical Task Relationships: A Contrastive Learning Perspective

Jonas Muth, Zdravko Marinov, Simon Reiß

2604.05649 2026-04-08 cs.CV cs.AI

Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis

Peixi Peng, Housheng Xie, Yanling Wei, Guangcong Ruan, Xiaoyang Zou, Qian Cao, Yongjian Nian, Guoyan Zheng

详情

英文摘要

Gastrointestinal diseases impose a growing global health burden, and endoscopy is a primary tool for early diagnosis. However, routine endoscopic image interpretation still suffers from missed lesions and limited efficiency. Although AI-assisted diagnosis has shown promise, existing models often lack generalizability, adaptability, robustness, and scalability because of limited medical data, domain shift, and heterogeneous annotations. To address these challenges, we develop RATNet, a foundation model for gastrointestinal endoscopy imaging based on analogical reasoning. RATNet acquires and transfers knowledge from heterogeneous expert annotations across five gastrointestinal endoscopy datasets through a cyclic pre-training strategy. Its architecture consists of an encoder, a relevance-knowledge acquisition and transfer (RAT) module, a projector, and a multi-task head, and supports fine-tuning, linear probing, and zero-shot transfer. Evaluations show that RATNet outperforms existing foundation models, including GastroNet and GastroVision, across six scenarios: diagnosis of common gastrointestinal diseases, few-shot learning for rare diseases, zero-shot transfer to new medical sites, robustness under long-tailed disease distributions, adaptation to novel diseases, and privacy-preserving deployment via federated learning. Its advantage comes from an analogical reasoning mechanism that matches image-derived posterior knowledge to a learned prior knowledge base and transfers relative knowledge to guide diagnosis, improving generalization and resistance to bias. RATNet is open and cost-effective, supports automatic integration of heterogeneous annotations without manual label unification, and reduces data acquisition costs, making it a practical foundation for intelligent gastrointestinal diagnosis, especially in resource-limited settings.

URL PDF HTML ☆

赞 0 踩 0

2604.05648 2026-04-08 cs.RO cs.SY eess.SY

Leaderless Collective Motion in Affine Formation Control over the Complex Plane

Jesus Bautista, Enric Morella, Lili Wang, Hector Garcia de Marina

Comments 16 pages, submitted version to TCNS

2604.05638 2026-04-08 cs.CV

PanopticQuery: Unified Query-Time Reasoning for 4D Scenes

Ruilin Tang, Yang Zhou, Zhong Ye, Wenxi Liu, Yan Huang, Shengfeng He

2604.05636 2026-04-08 cs.CV

Towards Athlete Fatigue Assessment from Association Football Videos

Xavier Bou, Nathan Correger, Alexandre Cloots, Cédric Gavage, Silvio Giancola, Cédric Schwartz, François Delvaux, Rudi Cloots, Marc Van Droogenbroeck, Anthony Cioppa

2604.05635 2026-04-08 cs.LG

From Uniform to Learned Knots: A Study of Spline-Based Numerical Encodings for Tabular Deep Learning

Manish Kumar, Anton Frederik Thielmann, Christoph Weisser, Benjamin Säfken

Comments 20, 9 figures

详情

英文摘要

Numerical preprocessing remains an important component of tabular deep learning, where the representation of continuous features can strongly affect downstream performance. Although its importance is well established for classical statistical and machine learning models, the role of explicit numerical preprocessing in tabular deep learning remains less well understood. In this work, we study this question with a focus on spline-based numerical encodings. We investigate three spline families for encoding numerical features, namely B-splines, M-splines, and integrated splines (I-splines), under uniform, quantile-based, target-aware, and learnable-knot placement. For the learnable-knot variants, we use a differentiable knot parameterization that enables stable end-to-end optimization of knot locations jointly with the backbone. We evaluate these encodings on a diverse collection of public regression and classification datasets using MLP, ResNet, and FT-Transformer backbones, and compare them against common numerical preprocessing baselines. Our results show that the effect of numerical encodings depends strongly on the task, output size, and backbone. For classification, piecewise-linear encoding (PLE) is the most robust choice overall, while spline-based encodings remain competitive. For regression, no single encoding dominates uniformly. Instead, performance depends on the spline family, knot-placement strategy, and output size, with larger gains typically observed for MLP and ResNet than for FT-Transformer. We further find that learnable-knot variants can be optimized stably under the proposed parameterization, but may substantially increase training cost, especially for M-spline and I-spline expansions. Overall, the results show that numerical encodings should be assessed not only in terms of predictive performance, but also in terms of computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2604.05632 2026-04-08 cs.CV

SGANet: Semantic and Geometric Alignment for Multimodal Multi-view Anomaly Detection

Letian Bai, Chengyu Tao, Juan Du

2604.05629 2026-04-08 cs.CV

A Unified Foundation Model for All-in-One Multi-Modal Remote Sensing Image Restoration and Fusion with Language Prompting

Yongchuan Cui, Peng Liu

2604.05624 2026-04-08 cs.CL

YoNER: A New Yorùbá Multi-domain Named Entity Recognition Dataset

Peace Busola Falola, Jesujoba O. Alabi, Solomon O. Akinola, Folashade T. Ogunajo, Emmanuel Oluwadunsin Alabi, David Ifeoluwa Adelani

Comments LREC 2026

2604.05623 2026-04-08 cs.CV cs.CL cs.MM

DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions

Xinran Wang, Yuxuan Zhang, Xiao Zhang, Haolong Yan, Muxi Diao, Songyu Xu, Zhonghao Yan, Hongbing Li, Kongming Liang, Zhanyu Ma

Comments 8 pages, 5 figures. The dataset and code are available at https://zyx-hhnkh.github.io/DetailVerifyBench/

2604.05620 2026-04-08 cs.CV cs.AI

Semantic-Topological Graph Reasoning for Language-Guided Pulmonary Screening

Chenyu Xue, Yiran Liu, Mian Zhou, Jionglong Su, Zhixiang Lu

2604.05616 2026-04-08 cs.CV cs.AI cs.LG

Evaluation of Randomization through Style Transfer for Enhanced Domain Generalization

Dustin Eisenhardt, Timothy Schaumlöffel, Alperen Kantarci, Gemma Roig