arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2511.11910 2026-02-26 cs.CV

Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models

Siyou Li, Huanan Wu, Juexi Shao, Yinghao Ma, Yujian Gan, Yihao Luo, Yuwei Wang, Dong Nie, Lu Wang, Wenqing Wu, Le Zhang, Massimo Poesio, Juntao Yu

详情

英文摘要

Despite the recent advances in the video understanding ability of multimodal large language models (MLLMs), long video understanding remains a challenge. One of the main issues is that the number of vision tokens grows linearly with video length, which causes an explosion in attention cost, memory, and latency. To solve this challenge, we present Query-aware Token Selector (\textbf{QTSplus}), a lightweight yet powerful visual token selection module that serves as an information gate between the vision encoder and LLMs. Given a text query and video tokens, QTSplus dynamically selects the most important visual evidence for the input text query by (i) scoring visual tokens via cross-attention, (ii) \emph{predicting} an instance-specific retention budget based on the complexity of the query, and (iii) \emph{selecting} Top-$n$ tokens with a differentiable straight-through estimator during training and a hard gate at inference. Furthermore, a small re-encoder preserves temporal order using absolute time information, enabling second-level localization while maintaining global coverage. Integrated into Qwen2.5-VL, QTSplus compresses the vision stream by up to \textbf{89\%} and reduces end-to-end latency by \textbf{28\%} on long videos. The evaluation on eight long video understanding benchmarks shows near-parity accuracy overall when compared with the original Qwen models and outperforms the original model by \textbf{+20.5} and \textbf{+5.6} points respectively on TempCompass direction and order accuracies. These results show that QTSplus is an effective, general mechanism for scaling MLLMs to real-world long-video scenarios while preserving task-relevant evidence.

URL PDF HTML ☆

赞 0 踩 0

2511.11456 2026-02-26 cs.RO

SimTac: A Physics-Based Simulator for Vision-Based Tactile Sensing with Biomorphic Structures

Xuyang Zhang, Jiaqi Jiang, Zhuo Chen, Yongqiang Zhao, Tianqi Yang, Daniel Fernandes Gomes, Jianan Wang, Shan Luo

2510.26656 2026-02-26 cs.RO cs.LG

Heuristic Adaptation of Potentially Misspecified Domain Support for Likelihood-Free Inference in Stochastic Dynamical Systems

Georgios Kamaras, Craig Innes, Subramanian Ramamoorthy

Comments 20 pages, 18 figures

2510.20498 2026-02-26 cs.CL

Robust Preference Alignment via Directional Neighborhood Consensus

Ruochen Mao, Yuling Shi, Xiaodong Gu, Jiaheng Wei

Comments Accepted to ICLR 2026

详情

英文摘要

Aligning large language models with human preferences is critical for creating reliable and controllable AI systems. A human preference can be visualized as a high-dimensional vector where different directions represent trade-offs between desired attributes (e.g., helpfulness vs. verbosity). Yet, because the training data often reflects dominant, average preferences, LLMs tend to perform well on common requests but fall short in specific, individual needs. This mismatch creates a preference coverage gap. Existing methods often address this through costly retraining, which may not be generalized to the full spectrum of diverse preferences. This brittleness means that when a user's request reflects a nuanced preference deviating from the training data's central tendency, model performance can degrade unpredictably. To address this challenge, we introduce Robust Preference Selection (RPS), a post-hoc, training-free method by leveraging directional neighborhood consensus. Instead of forcing a model to generate a response from a single, highly specific preference, RPS samples multiple responses from a local neighborhood of related preferences to create a superior candidate pool. It then selects the response that best aligns with the user's original intent. We provide a theoretical framework showing our neighborhood generation strategy is provably superior to a strong baseline that also samples multiple candidates. Comprehensive experiments across three distinct alignment paradigms (DPA, DPO, and SFT) demonstrate that RPS consistently improves robustness against this baseline, achieving win rates of up to 69% on challenging preferences from under-represented regions of the space without any model retraining. Our work presents a practical, theoretically-grounded solution for enhancing the reliability of preference-aligned models.

URL PDF HTML ☆

赞 0 踩 0

2510.13654 2026-02-26 cs.LG cs.AI

Rethinking Evaluation in the Era of Time Series Foundation Models: (Un)known Information Leakage Challenges

Marcel Meyer, Sascha Kaltenpoth, Kevin Zalipski, Oliver Müller

2510.10625 2026-02-26 cs.LG cs.CR cs.CV

ImpMIA: Leveraging Implicit Bias for Membership Inference Attack

Yuval Golbari, Navve Wasserman, Gal Vardi, Michal Irani

2510.09256 2026-02-26 cs.CV

Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy

Patrick Wienholt, Sophie Caselitz, Robert Siepmann, Philipp Bruners, Keno Bressem, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn

Comments Code is available: https://github.com/TruhnLab/VisionSemanticEntropy

详情

DOI: 10.1007/s00330-026-12384-z
Journal ref: Eur Radiol (2026)

英文摘要

To determine whether using discrete semantic entropy (DSE) to reject questions likely to generate hallucinations can improve the accuracy of black-box vision-language models (VLMs) in radiologic image based visual question answering (VQA). This retrospective study evaluated DSE using two publicly available, de-identified datasets: the VQA-Med 2019 benchmark (500 images with clinical questions and short-text answers) and a diagnostic radiology dataset (206 cases: 60 computed tomography scans, 60 magnetic resonance images, 60 radiographs, 26 angiograms) with corresponding ground-truth diagnoses. GPT-4o and GPT-4.1 (Generative Pretrained Transformer; OpenAI) answered each question 15 times using a temperature of 1.0. Baseline accuracy was determined using low-temperature answers (temperature 0.1). Meaning-equivalent responses were grouped using bidirectional entailment checks, and DSE was computed from the relative frequencies of the resulting semantic clusters. Accuracy was recalculated after excluding questions with DSE > 0.6 or > 0.3. p-values and 95% confidence intervals were obtained using bootstrap resampling and a Bonferroni-corrected threshold of p < .004 for statistical significance. Across 706 image-question pairs, baseline accuracy was 51.7% for GPT-4o and 54.8% for GPT-4.1. After filtering out high-entropy questions (DSE > 0.3), accuracy on the remaining questions was 76.3% (retained questions: 334/706) for GPT-4o and 63.8% (retained questions: 499/706) for GPT-4.1 (both p < .001). Accuracy gains were observed across both datasets and largely remained statistically significant after Bonferroni correction. DSE enables reliable hallucination detection in black-box VLMs by quantifying semantic inconsistency. This method significantly improves diagnostic answer accuracy and offers a filtering strategy for clinical VLM applications.

URL PDF HTML ☆

赞 0 踩 0

2510.04091 2026-02-26 cs.LG

Rethinking Consistent Multi-Label Classification Under Inexact Supervision

Wei Wang, Tianhao Ma, Ming-Kun Xie, Gang Niu, Masashi Sugiyama

Comments ICLR 2026

2510.01988 2026-02-26 cs.LG

PepCompass: Navigating peptide embedding spaces using Riemannian Geometry

Marcin Możejko, Adam Bielecki, Jurand Prądzyński, Marcin Traskowski, Antoni Janowski, Hyun-Su Lee, Marcelo Der Torossian Torres, Michał Kmicikiewicz, Paulina Szymczak, Karol Jurasz, Michał Kucharczyk, Cesar de la Fuente-Nunez, Ewa Szczurek

2509.25800 2026-02-26 cs.LG stat.ME

Characterization and Learning of Causal Graphs with Latent Confounders and Post-treatment Selection from Interventional Data

Gongxu Luo, Loka Li, Guangyi Chen, Haoyue Dai, Kun Zhang

2509.21865 2026-02-26 cs.LG

Beyond RAG vs. Long-Context: Learning Distraction-Aware Retrieval for Efficient Knowledge Grounding

Seongwoong Shim, Myunsoo Kim, Jae Hyeon Cho, Byung-Jun Lee

Comments Accepted at ICLR 2026

2509.18880 2026-02-26 cs.CL cs.AI cs.LG

Diversity Boosts AI-Generated Text Detection

Advik Raj Basani, Pin-Yu Chen

Comments Accepted to Transactions on Machine Learning Research (TMLR '26). Project page and demos: https://diveye.vercel.app/

2509.07477 2026-02-26 cs.CV cs.LG

MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification

Patrick Wienholt, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn

Comments 28 pages, 12 figures

2509.01552 2026-02-26 cs.CV

Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, Honggang Chen

Comments Accepted by CVPR 2026. Code is available at \url{https://github.com/xuyang-liu16/V2Drop}

2508.21438 2026-02-26 cs.LG q-bio.OT quant-ph

Quantum enhanced ensemble GANs for anomaly detection in continuous biomanufacturing

Rajiv Kailasanathan, William R. Clements, Mohammad Reza Boskabadi, Shawn M. Gibford, Emmanouil Papadakis, Christopher J. Savoie, Seyed Soheil Mansouri

Comments Accepted in the Journal of Industrial & Engineering Chemistry Research

2508.21421 2026-02-26 cs.LG

Rethinking Layer-wise Model Merging through Chain of Merges

Pietro Buzzega, Riccardo Salami, Angelo Porrello, Simone Calderara

2508.15427 2026-02-26 cs.RO cs.CV

Lang2Lift: A Language-Guided Autonomous Forklift System for Outdoor Industrial Pallet Handling

Huy Hoang Nguyen, Johannes Huemer, Markus Murschitz, Tobias Glueck, Minh Nhat Vu, Andreas Kugi

Comments 8 pages, 7 figures

2508.04605 2026-02-26 cs.LG math.DS

Multitask Learning with Stochastic Interpolants

Hugo Negrel, Florentin Coeurdoux, Michael S. Albergo, Eric Vanden-Eijnden

2508.01617 2026-02-26 cs.CV

LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding

Xuanzhao Dong, Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Peijie Qiu, Shao Tang, Xin Li, Yalin Wang

2507.08422 2026-02-26 cs.CV eess.IV

Training-free Mixed-Resolution Latent Upsampling for Spatially Accelerated Diffusion Transformers

Wongi Jeong, Kyungryeol Lee, Hoigi Seo, Se Young Chun

2507.08017 2026-02-26 cs.CL cs.AI

Mechanistic Indicators of Understanding in Large Language Models

Pierre Beckmann, Matthieu Queloz

Comments 38 pages

2507.06593 2026-02-26 cs.CV eess.IV

Capturing Stable HDR Videos Using a Dual-Camera System

Qianyu Zhang, Bolun Zheng, Lingyu Zhu, Hangjia Pan, Zunjie Zhu, Zongpeng Li, Shiqi Wang

2507.00031 2026-02-26 cs.LG

Enhancing Spatio-Temporal Forecasting with Spatial Neighbourhood Fusion:A Case Study on COVID-19 Mobility in Peru

Chuan Li, Jiang You, Hassine Moungla, Vincent Gauthier, Miguel Nunez-del-Prado, Hugo Alatrista-Salas

2506.22685 2026-02-26 cs.LG cs.GR

Mitigating Semantic Collapse in Generative Personalization with Test-Time Embedding Adjustment

Anh Bui, Trang Vu, Trung Le, Junae Kim, Tamas Abraham, Rollin Omari, Amar Kaur, Dinh Phung

2506.13793 2026-02-26 cs.AI

Med-REFL: Medical Reasoning Enhancement via Self-Corrected Fine-grained Reflection

Zongxian Yang, Jiayu Qian, Zegao Peng, Haoyu Zhang, Yu-An Huang, KC Tan, Zhi-An Huang

2506.10082 2026-02-26 cs.CV

LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning

Chenjian Gao, Lihe Ding, Xin Cai, Zhanpeng Huang, Zibin Wang, Tianfan Xue

Comments ICLR 2026

2506.07477 2026-02-26 cs.LG cs.AI cs.LO

Premise Selection for a Lean Hammer

Thomas Zhu, Joshua Clune, Jeremy Avigad, Albert Qiaochu Jiang, Sean Welleck

Comments LeanPremise is available at https://github.com/hanwenzhu/premise-selection and LeanHammer is available at https://github.com/JOSHCLUNE/LeanHammer

2506.05154 2026-02-26 cs.CL cs.AI cs.IR

Resisting Contextual Interference in RAG via Parametric-Knowledge Reinforcement

Chenyu Lin, Yilin Wen, Du Su, Hexiang Tan, Fei Sun, Muhan Chen, Chenfu Bao, Zhonghou Lyu

Comments Accepted to ICLR 2026

2506.04941 2026-02-26 cs.RO

ArtVIP: Articulated Digital Assets of Visual Realism, Modular Interaction, and Physical Fidelity for Robot Learning

Zhao Jin, Zhengping Che, Tao Li, Zhen Zhao, Kun Wu, Yuheng Zhang, Yinuo Zhao, Zehui Liu, Qiang Zhang, Xiaozhu Ju, Jing Tian, Yousong Xue, Jian Tang

2506.01085 2026-02-26 cs.CV cs.AI

Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection

Shivam Chandhok, Qian Yang, Oscar Manas, Kanishk Jain, Leonid Sigal, Aishwarya Agrawal

Comments CVPR 2026