arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.03231 2026-04-06 cs.CV

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

Ankan Deria, Komal Kumar, Xilin He, Imran Razzak, Hisham Cholakkal, Fahad Shahbaz Khan, Salman Khan

Comments 16 pages, 10 figures, 5 tables

详情

英文摘要

Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. The fused tokens can be injected into a decoder-only LLM with minimal changes to standard VLM pipelines. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin. Finally, we conduct ablation studies on layer merging, non-redundant feature mixing, and fusion capacity to evaluate how complementary contrastive and self-supervised signals affect VLM performance.

URL PDF HTML ☆

赞 0 踩 0

2604.03226 2026-04-06 cs.LG cs.AI

Enhancing Robustness of Federated Learning via Server Learning

Van Sy Mai, Kushal Chakrabarti, Richard J. La, Dipankar Maity

2604.03225 2026-04-06 cs.CV

VOSR: A Vision-Only Generative Model for Image Super-Resolution

Rongyuan Wu, Lingchen Sun, Zhengqiang Zhang, Xiangtao Kong, Jixin Zhao, Shihao Wang, Lei Zhang

Comments Accepted by CVPR2026

2604.03216 2026-04-06 cs.CL

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

Sean Wu, Fredrik K. Gustafsson, Edward Phillips, Boyan Gao, Anshul Thakur, David A. Clifton

Comments 24 pages, 7 figures, 6 tables

详情

英文摘要

Large language models (LLMs) often produce confident but incorrect answers in settings where abstention would be safer. Standard evaluation protocols, however, require a response and do not account for how confidence should guide decisions under different risk preferences. To address this gap, we introduce the Behavioral Alignment Score (BAS), a decision-theoretic metric for evaluating how well LLM confidence supports abstention-aware decision making. BAS is derived from an explicit answer-or-abstain utility model and aggregates realized utility across a continuum of risk thresholds, yielding a measure of decision-level reliability that depends on both the magnitude and ordering of confidence. We show theoretically that truthful confidence estimates uniquely maximize expected BAS utility, linking calibration to decision-optimal behavior. BAS is related to proper scoring rules such as log loss, but differs structurally: log loss penalizes underconfidence and overconfidence symmetrically, whereas BAS imposes an asymmetric penalty that strongly prioritizes avoiding overconfident errors. Using BAS alongside widely used metrics such as ECE and AURC, we then construct a benchmark of self-reported confidence reliability across multiple LLMs and tasks. Our results reveal substantial variation in decision-useful confidence, and while larger and more accurate models tend to achieve higher BAS, even frontier models remain prone to severe overconfidence. Importantly, models with similar ECE or AURC can exhibit very different BAS due to highly overconfident errors, highlighting limitations of standard metrics. We further show that simple interventions, such as top-$k$ confidence elicitation and post-hoc calibration, can meaningfully improve confidence reliability. Overall, our work provides both a principled metric and a comprehensive benchmark for evaluating LLM confidence reliability.

URL PDF HTML ☆

赞 0 踩 0

2604.03208 2026-04-06 cs.LG

Hierarchical Planning with Latent World Models

Wancong Zhang, Basile Terver, Artem Zholus, Soham Chitnis, Harsh Sutaria, Mido Assran, Randall Balestriero, Amir Bar, Adrien Bardes, Yann LeCun, Nicolas Ballas

2604.03203 2026-04-06 cs.CV cs.AI cs.LG

PR3DICTR: A modular AI framework for medical 3D image-based detection and outcome prediction

Daniel C. MacRae, Luuk van der Hoek, Robert van der Wal, Suzanne P. M. de Vette, Hendrike Neh, Baoqiang Ma, Peter M. A. van Ooijen, Lisanne V. van Dijk

Comments 16 pages, 6 figures and 1 table

2604.03201 2026-04-06 cs.AI

Coupled Control, Structured Memory, and Verifiable Action in Agentic AI (SCRAT -- Stochastic Control with Retrieval and Auditable Trajectories): A Comparative Perspective from Squirrel Locomotion and Scatter-Hoarding

Maximiliano Armesto, Christophe Kolb

Comments 15 pages, 4 figures, 3 tables

2604.03200 2026-04-06 cs.RO math.OC

Safety-Critical Centralized Nonlinear MPC for Cooperative Payload Transportation by Two Quadrupedal Robots

Ruturaj S. Sambhus, Yicheng Zeng, Kapi Ketan Mehta, Jeeseop Kim, Kaveh Akbari Hamed

2604.03199 2026-04-06 cs.CL cs.CR cs.LG

Learning the Signature of Memorization in Autoregressive Language Models

David Ilić, Kostadin Cvejoski, David Stanojević, Evgeny Grigorenko

Comments Preprint. 10 pages, 4 figures, 12 tables

详情

英文摘要

All prior membership inference attacks for fine-tuned language models use hand-crafted heuristics (e.g., loss thresholding, Min-K\%, reference calibration), each bounded by the designer's intuition. We introduce the first transferable learned attack, enabled by the observation that fine-tuning any model on any corpus yields unlimited labeled data, since membership is known by construction. This removes the shadow model bottleneck and brings membership inference into the deep learning era: learning what matters rather than designing it, with generalization through training diversity and scale. We discover that fine-tuning language models produces an invariant signature of memorization detectable across architectural families and data domains. We train a membership inference classifier exclusively on transformer-based models. It transfers zero-shot to Mamba (state-space), RWKV-4 (linear attention), and RecurrentGemma (gated recurrence), achieving 0.963, 0.972, and 0.936 AUC respectively. Each evaluation combines an architecture and dataset never seen during training, yet all three exceed performance on held-out transformers (0.908 AUC). These four families share no computational mechanisms, their only commonality is gradient descent on cross-entropy loss. Even simple likelihood-based methods exhibit strong transfer, confirming the signature exists independently of the detection method. Our method, Learned Transfer MIA (LT-MIA), captures this signal most effectively by reframing membership inference as sequence classification over per-token distributional statistics. On transformers, LT-MIA achieves 2.8$\times$ higher TPR at 0.1\% FPR than the strongest baseline. The method also transfers to code (0.865 AUC) despite training only on natural language texts. Code and trained classifier available at https://github.com/JetBrains-Research/learned-mia.

URL PDF HTML ☆

赞 0 踩 0

2604.03198 2026-04-06 cs.CV

The Eleventh NTIRE 2026 Efficient Super-Resolution Challenge Report

Bin Ren, Hang Guo, Yan Shu, Jiaqi Ma, Ziteng Cui, Shuhong Liu, Guofeng Mei, Lei Sun, Zongwei Wu, Fahad Shahbaz Khan, Salman Khan, Radu Timofte, Yawei Li, Hongyuan Yu, Pufan Xu, Chen Wu, Long Peng, Jiaojiao Yi, Siyang Yi, Yuning Cui, Jingyuan Xia, Xing Mou, Keji He, Jinlin Wu, Zongang Gao, Sen Yang, Rui Zheng, Fengguo Li, Yecheng Lei, Wenkai Min, Jie Liu, Keye Cao, Shubham Sharma, Manish Prasad, Haobo Li, Matin Fazel, Abdelhak Bentaleb, Rui Chen, Shurui Shi, Zitao Dai, Qingliang Liu, Yang Cheng, Jing Hu, Xuan Zhang, Rui Ding, Tingyi Zhang, Hui Deng, Mengyang Wang, Fulin Liu, Jing Wei, Qian Wang, Hongying Liu, Mingyang Li, Guanglu Dong, Zheng Yang, Chao Ren, Hongbo Fang, Lingxuan Li, Lin Si, Pan Gao, Moncef Gabbouj, Watchara Ruangsang, Supavadee Aramvith

Comments CVPR 2026 NTIRE Workshop Paper, Efficient Super Resolution Technical Report

2604.03197 2026-04-06 cs.LG physics.comp-ph

Real-Time Surrogate Modeling for Personalized Blood Flow Prediction and Hemodynamic Analysis

Sokratis J. Anagnostopoulos, George Rovas, Vasiliki Bikia, Theodore G. Papaioannou, Athanase D. Protogerou, Nikolaos Stergiopulos

2604.03192 2026-04-06 cs.CL cs.AI

Reliability Gated Multi-Teacher Distillation for Low Resource Abstractive Summarization

Dipto Sumit, Ankan Kumar Roy, Sadia Khair Rodela, Atia Haque Asha, Mourchona Afrin, Niloy Farhan, Farig Yousuf Sadeque

2604.03191 2026-04-06 cs.RO cs.CV cs.LG

The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling

Takuya Shiba

Comments 11 pages, 1 figure

2604.03190 2026-04-06 cs.LG cs.AI

Gradient Boosting within a Single Attention Layer

Saleh Sargolzaei

2604.03189 2026-04-06 cs.LG cs.AI

Reflective Context Learning: Studying the Optimization Primitives of Context Space

Nikita Vassilyev, William Berrios, Ruowang Zhang, Bo Han, Douwe Kiela, Shikib Mehri

Comments Under review at COLM. Github: https://github.com/nvassilyev/RCL

2604.03181 2026-04-06 cs.RO cs.CV

Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

Peiyan Li, Yixiang Chen, Yuan Xu, Jiabing Yang, Xiangnan Wu, Jun Guo, Nan Sun, Long Qian, Xinghang Li, Xin Xiao, Jing Liu, Nianfeng Liu, Tao Kong, Yan Huang, Liang Wang, Tieniu Tan

Comments Project Website: https://lpy1219.github.io/MV-VDP-Web/

2604.03180 2026-04-06 cs.LG cs.CL cs.IR cs.SI

PRISM: LLM-Guided Semantic Clustering for High-Precision Topics

Connor Douglas, Utkucan Balci, Joseph Aylett-Bullock

Comments To appear in Proceedings of the ACM Web Conference 2026 (WWW 26)

2604.03179 2026-04-06 cs.LG cs.AI cs.CV

Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

Gengwei Zhang, Jie Peng, Zhen Tan, Mufan Qiu, Hossein Nourkhiz Mahjoub, Vaishnav Tadiparthi, Kwonjoon Lee, Yanyong Zhang, Tianlong Chen

Comments CVPR 2026

2604.03176 2026-04-06 cs.CV cs.MM

SFFNet: Synergistic Feature Fusion Network With Dual-Domain Edge Enhancement for UAV Image Object Detection

Wenfeng Zhang, Jun Ni, Yue Meng, Xiaodong Pei, Wei Hu, Qibing Qin, Lei Huang

Comments Accepted for publication in IEEE Transactions on Multimedia

2604.03174 2026-04-06 cs.CL cs.AI

Beyond the Parameters: A Technical Survey of Contextual Enrichment in Large Language Models: From In-Context Prompting to Causal Retrieval-Augmented Generation

Prakhar Bansal, Shivangi Agarwal

Comments 7 pages, 4 tables

2604.03173 2026-04-06 cs.CL

Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents

Delip Rao, Eric Wong, Chris Callison-Burch

Comments 25 pages

2604.03172 2026-04-06 cs.CV

EffiMiniVLM: A Compact Dual-Encoder Regression Framework

Yin-Loon Khor, Yi-Jie Wong, Yan Chai Hum

2604.03157 2026-04-06 cs.AI

Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models

Yunfei Bai, Amit Dhanda, Shekhar Jain

Comments In Proceedings of the 32nd ACM-SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

2604.03156 2026-04-06 cs.CV

CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator

Yuhan Pu, Hao Zheng, Ziqian Mo, Hill Zhang, Tianyi Fan, Shuhong Wu, Jiaheng Wei

2604.03154 2026-04-06 cs.LG

DSBD: Dual-Aligned Structural Basis Distillation for Graph Domain Adaptation

Yingxu Wang, Kunyu Zhang, Jiaxin Huang, Mengzhu Wang, Mingyan Xiao, Siyang Gao, Nan Yin

2604.03150 2026-04-06 cs.LG

HyperFitS -- Hypernetwork Fitting Spectra for metabolic quantification of ${}^1$H MR spectroscopic imaging

Paul J. Weiser, Gulnur Ungan, Amirmohammad Shamaei, Georg Langs, Wolfgang Bogner, Malte Hoffmann, Antoine Klauser, Ovidiu C. Andronesi

2604.03141 2026-04-06 cs.CL

Beyond Precision: Importance-Aware Recall for Factuality Evaluation in Long-Form LLM Generation

Nazanin Jafari, James Allan, Mohit Iyyer

2604.03139 2026-04-06 cs.RO

FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation

Mingao Tan, Yiyang Li, Shanze Wang, Xinming Zhang, Wei Zhang

2604.03127 2026-04-06 cs.CL cs.AI

Domain-Adapted Retrieval for In-Context Annotation of Pedagogical Dialogue Acts

Jinsook Lee, Kirk Vanacore, Zhuqian Zhou, Bakhtawar Ahtisham, Rene F. Kizilcec

Comments 20 pages, 20 tables, 4 figures

2604.03118 2026-04-06 cs.CV eess.IV

Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

Xingtong Ge, Yi Zhang, Yushi Huang, Dailan He, Xiahong Wang, Bingqi Ma, Guanglu Song, Yu Liu, Jun Zhang

Comments under review