arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.06229 2026-05-08 cs.CV

Look Beyond Saliency: Low-Attention Guided Dual Encoding for Video Semantic Search

Faisal Aljehrai, Mohammed A. Alkhrashi, Alreem Almuhrij, Sarah Abuhimed, Noorh Aldossary, Abdullah Aldwyish, Raied Aljadaany, Huda Alamri, Muhammad Kamran J Khan

2605.06228 2026-05-08 cs.LG cs.AI

Soft Deterministic Policy Gradient with Gaussian Smoothing

Hyunjun Na, Donghwan Lee

Comments 25 pages, 4 figures

2605.06227 2026-05-08 cs.AI

Price of Fairness in Short-Term and Long-Term Algorithmic Selections

Shahin Jabbari, Chen Wang

Comments The short version of this paper appears in the proceedings of IJCAI-26

2605.06221 2026-05-08 cs.CL

UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification

Qihang Fan, Huaibo Huang, Zhiying Wu, Bingning Wang, Ran He

Comments code: https://github.com/qhfan/UniPrefill.git

2605.06219 2026-05-08 cs.AI

Joint Consistency: A Unified Test-Time Aggregation Framework via Energy Minimization

Yunzhen Yao, Hongye Wang, Yahong Wang, Michael C. Gastpar, Bo Jiang, Lie He

2605.06216 2026-05-08 cs.CL cs.AI cs.LG

TIDE: Every Layer Knows the Token Beneath the Context

Ajay Jaiswal, Lauren Hannah, Han-Byul Kim, Duc Hoang, Mehrdad Farajtabar, Minsik Cho

2605.06214 2026-05-08 cs.CV

Differentiable Adaptive 4D Structured Illumination for Joint Capture of Shape and Reflectance

Huakeng Ding, Yaowen Chen, Kun Zhou, Hongzhi Wu

Comments Accepted to CVPR 2026. 10 pages, 13 figures

2605.06212 2026-05-08 cs.LG cs.CV

Playing the network backward: A Game Theoretic Attribution Framework

Jakob Paul Zimmermann, Jim Berend, Georg Loho, Sebastian Lapuschkin, Wojciech Samek

2605.06211 2026-05-08 cs.LG cs.AI cs.CL cs.DS

Contrastive Identification and Generation in the Limit

Xiaoyu Li, Andi Han, Jiaojiao Jiang, Junbin Gao

详情

英文摘要

In the classical identification in the limit model of Gold [1967], a stream of positive examples is presented round by round, and the learner must eventually recover the target hypothesis. Recently, Kleinberg and Mullainathan [2024] introduced generation in the limit, where the learner instead must eventually output novel elements of the target's support. Both lines of work focus on positive-only or fully labeled data. Yet many natural supervision signals are inherently relational rather than singleton, which encode relationships between examples rather than labels of individual ones. We initiate the study of contrastive identification and generation in the limit, where the learner observes a contrastive presentation of data: a stream of unordered pairs $\{x,y\}$ satisfying $h(x)\ne h(y)$ for an unknown target binary hypothesis $h$, but which element is positive is hidden from the learner. We first present three results in the noiseless setting: an exact characterization of contrastive identifiable classes (a one-line geometric refinement of Angluin [1980]'s tell-tale condition), a combinatorial dimension called contrastive closure dimension (a contrasitive analogue of the closure dimension in Raman et al. [2025]) and exactly characterizing uniform contrastive generation with tight sample complexity, and a strict hierarchy in which contrastive generation and text identification are mutually incomparable. We then prove a sharp reversal under finite adversarial corruption: there exist classes identifiable from contrastive pairs under any finite corruption budget by a single budget-independent algorithm, yet not identifiable from positive examples under even one corrupted observation. The unifying technical object is the common crossing graph, which encodes pairwise ambiguity, family-level generation obstructions, and corruption defects in a single coverage-and-incidence language.

URL PDF HTML ☆

赞 0 踩 0

2605.06207 2026-05-08 cs.CV cs.AI cs.LG

Taming the Entropy Cliff: Variable Codebook Size Quantization for Autoregressive Visual Generation

Bowen Zheng, Weijian Luo, Guang Yang, Colin Zhang, Tianyang Hu

详情

英文摘要

Most discrete visual tokenizers rely on a default design: every position in the sequence shares the same codebook. Researchers try to scale the codebook size $K$ to get better reconstruction performance. Such a constant-codebook design hits a fundamental information-theoretic limit. We observe that the per-position conditional entropy of the training set decays so quickly along the sequence that, after a few positions, the conditional distribution becomes essentially deterministic. On ImageNet with $K=16384$, this happens within only 2 out of 256 positions, turning the remaining 254 into a memorization problem. We call this phenomenon the Entropy Cliff and formalize it with a simple expression: $t^{*} = \lceil \log_2 N / \log_2 K \rceil$. Interestingly, this phenomenon is not observed in language, as its natural structure keeps the effective entropy per position well below the codebook capacity. To address this, we propose Variable Codebook Size Quantization (VCQ), where the codebook size $K_t$ grows monotonically along the sequence from $K_{\min}=2$ to $K_{\max}$, leaving the loss function, parameter count, and AR training procedure unchanged. With a vanilla autoregressive Transformer and standard next-token prediction, a base version of VCQ reduces gFID w/o CFG from 27.98 to 14.80 on ImageNet $256\times256$ over the baseline. Scaled up, it reaches gFID 1.71 with 684M autoregressive parameters, without any extra training techniques such as semantic regularization or causal alignment. The extreme information bottleneck at $K_{\min}=2$ naturally induces a coarse-to-fine semantic hierarchy: a linear probe on only the first 10 tokens reaches 43.8% top-1 accuracy on ImageNet, compared to 27.1% for uniform codebooks. Ultimately, these results show that what matters is not only the total capacity of the codebook, but also how that capacity is distributed and organized.

URL PDF HTML ☆

赞 0 踩 0

2605.06206 2026-05-08 cs.LG

Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

Muhammad Shahir Abdurrahman, Chun Deng, Azalia Mirhoseini, Philip Levis

2605.06202 2026-05-08 cs.LG stat.ML

Bandit Learning in General Open Multi-agent Systems

Mengfan Xu

2605.06201 2026-05-08 cs.AI

Towards Annotation-Free Validation of MLLMs: A Vision-Language Logical Consistency Metric

Ying Gu, Mei Chee Leong, Hui Li Tan, Shangbo Mao, Liyuan Li, Nancy Chen

2605.06200 2026-05-08 cs.CL

A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

Dingwei Chen, Zefang Zong, Zhipeng Ma, Leo Luo, Yang Li, Chengming Li, Peng Chen, Jie Jiang

详情

英文摘要

Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assignment either depend on separate external process reward models that introduce additional consumption, or tree-based structural rollout that merely redistributes the outcome signal while constraining trajectory diversity. A promising alternative leverages the per-turn change in the policy's predicted probability of the ground-truth, termed Information Gain (IG), as an intrinsic process signal without an external evaluator. However, prior work on leveraging IG signals within the RL training loop faces three systematic challenges: normalizing across turns that face heterogeneous positional contexts can distort the relative standing of individual turns, accumulating a variable number of terms causes advantage magnitudes to drift with trajectory depth, and a fixed clipping range governs policy updates identically for turns with vastly different IG signals. In this paper, we propose A$^2$TGPO (Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping), which retains IG as the intrinsic signal but re-designs how it is normalized, accumulated, and consumed: (i) turn-group normalization: normalizes IG within each (prompt, turn-index) group so that each turn is compared only against peers at the same interaction depth; (ii) variance-rescaled discounted accumulation: divides cumulative normalized IG by square root of accumulated terms to keep advantage magnitudes comparable across turn positions; and (iii) adaptive turn-level clipping: modulates each turn's clipping range based on its normalized IG, widening the update region for informative turns and narrowing it for uninformative ones.

URL PDF HTML ☆

赞 0 踩 0

2605.06197 2026-05-08 cs.CV cs.LG

Bridging visual saliency and large language models for explainable deep learning in medical imaging

Paul Valery Nguezet, Elie Tagne Fute, Yusuf Brima, Benoit Martin Azanguezet, Marcellin Atemkeng

详情

英文摘要

The opaque nature of deep learning models remains a significant barrier to their clinical adoption in medical imaging. This paper presents a multimodal explainability framework that bridges the gap between convolutional neural network (CNN) predictions and clinically actionable insights for brain tumor classification, leveraging large language models (LLMs) to deliver human-interpretable diagnostic narratives. The proposed framework operates through three coupled stages. First, nine CNN architectures are extended with a dual-output hybrid formulation that simultaneously optimises a classification head and a segmentation head, enabling spatially richer feature learning. Second, visual saliency attribution methods, namely Grad-CAM, Grad-CAM++, and ScoreCAM, are applied to generate class-discriminative heatmaps, which are subsequently refined into binary tumor masks via an adaptive percentile thresholding pipeline. Third, the resulting masks are mapped onto the Harvard-Oxford cortical atlas to translate pixel-level evidence into named neuroanatomical structures, and the extracted findings are encoded into a structured JSON file that conditions three LLMs (Grok3, Mistral, and LLaMA) to generate coherent, radiological-style diagnostic reports. Evaluated on a dataset of 4,834 contrast-enhanced T1-weighted brain MRI images spanning three tumor classes, InceptionResNetV2 achieved the highest classification performance and Grad-CAM++ yielded the best segmentation overlap. Among the language models, Grok3 led in lexical diversity and coherence, while LLaMA achieved the highest readability score. By integrating visual, anatomical, and linguistic modalities into a unified pipeline, the framework produces explanations that are technically grounded and meaningfully interpretable, advancing the transparency and clinical accountability of artificial intelligence assisted brain tumor diagnosis.

URL PDF HTML ☆

赞 0 踩 0

2605.06196 2026-05-08 cs.AI cs.CL

The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models

Chonghan Qin, Xiachong Feng, Ziyun Song, Xiaocheng Feng, Jing Xiong, Lingpeng Kong

Comments 28 pages, including appendices

2605.06192 2026-05-08 cs.CV cs.AI cs.RO

EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

Zhaoyang Yang, Yurun Jin, Lizhe Qi, Cong Huang, Kai Chen

Comments Preprint. 22 pages, 10 figures

2605.06191 2026-05-08 cs.AI

Systematic Evaluation of Large Language Models for Post-Discharge Clinical Action Extraction

Shivali Dalmia, Ananya Mantravadi, Prasanna Desikan

2605.06190 2026-05-08 cs.LG

Constrained Contextual Bandits with Adversarial Contexts

Dhruv Sarkar, Abhishek Sinha

2605.06188 2026-05-08 cs.AI cs.CL

OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models

Jaehoon Kim, Dongha Lee

2605.06187 2026-05-08 cs.LG cs.AI

In-Context Black-Box Optimization with Unreliable Feedback

Nicolas Samuel Blumer, Julien Martinelli, Samuel Kaski

2605.06185 2026-05-08 cs.AI cs.CV

Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios

Peizheng Yan, Yu Zhao, Liang Xie, Juntong Qi, Mingming Wang, Erwei Yin

2605.06183 2026-05-08 cs.AI cs.CL cs.LG

Rethinking Adapter Placement: A Dominant Adaptation Module Perspective

Suoxin Zhang, Run He, Di Fang, Xiang Tan, Kaixuan Chen, Huiping Zhuang

2605.06179 2026-05-08 cs.CV

SuperFace: Preference-Aligned Facial Expression Estimation Beyond Pseudo Supervision

Zejian Kang, Xuanyang Xu, Wentao Yang, Kai Zheng, Yuanchen Fei, Hongyuan Zou, Hui Shan, Shuo Yang, Xiangru Huang

2605.06177 2026-05-08 cs.AI

BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

Jinge Wu, Hongjian Zhou, Mingde Zeng, Jiayuan Zhu, Junde Wu, Jiazhen Pan, Sean Wu, Honghan Wu, Fenglin Liu, David A. Clifton

2605.06170 2026-05-08 cs.CV

DynT2I-Eval: A Dynamic Evaluation Framework for Text-to-Image Models

Juntong Wang, Jiarui Wang, Huiyu Duan, Lewei Li, Guangtao Zhai, Xiongkuo Min

2605.06166 2026-05-08 cs.LG

One Algorithm, Two Goals: Dual Scoring for Parameter and Data Selection in LLM Fine-Tuning

Xinrui Chen, Liu Yang, Ou Wu

2605.06165 2026-05-08 cs.AI

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

Richmond Sin Jing Xuan, Rishabh Bhardwaj, Soujanya Poria

2605.06161 2026-05-08 cs.AI cs.SE

Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

Shihao Weng, Yang Feng, Xiaofei Xie

Comments 9 pages

2605.06160 2026-05-08 cs.CV

Beyond Forgetting in Continual Medical Image Segmentation: A Comprehensive Benchmark Study

Bomin Wang, Hangqi Zhou, Yibo Gao, Xiahai Zhuang

Comments Submitted to a journal