arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2511.09396 2026-03-05 cs.CL cs.AI

Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque

Lukas Arana, Julen Etxaniz, Ander Salaberria, Gorka Azkune

详情

英文摘要

Current Multimodal Large Language Models exhibit very strong performance for several demanding tasks. While commercial MLLMs deliver acceptable performance in low-resource languages, comparable results remain unattained within the open science community. In this paper, we aim to develop a strong MLLM for a low-resource language, namely Basque. For that purpose, we develop our own training and evaluation image-text datasets. Using two different Large Language Models as backbones, the Llama-3.1-Instruct model and a Basque-adapted variant called Latxa, we explore several data mixtures for training. We show that: i) low ratios of Basque multimodal data (around 20%) are already enough to obtain solid results on Basque benchmarks, and ii) contrary to expected, a Basque instructed backbone LLM is not required to obtain a strong MLLM in Basque. Our results pave the way to develop MLLMs for other low-resource languages by openly releasing our resources.

URL PDF HTML ☆

赞 0 踩 0

2511.08417 2026-03-05 cs.LG cs.CV

NeuCLIP: Efficient Large-Scale CLIP Training with Neural Normalizer Optimization

Xiyuan Wei, Chih-Jen Lin, Tianbao Yang

Comments Accepted to 40th International Conference on Learning Representations. 32 pages, 5 figures

2511.08269 2026-03-05 cs.CV

Re-coding for Uncertainties: Edge-awareness Semantic Concordance for Resilient Event-RGB Segmentation

Nan Bao, Yifan Zhao, Lin Zhu, Jia Li

Comments Accepted to NeurIPS 2025; code and datasets available at https://github.com/iCVTEAM/ESC

2511.07162 2026-03-05 cs.CL

Categorical Emotions or Appraisals - Which Emotion Model Explains Argument Convincingness Better?

Lynn Greschner, Meike Bauer, Sabine Weber, Roman Klinger

Comments Accepted at LREC 2026

2511.06427 2026-03-05 cs.CL cs.CY

Dutch Metaphor Extraction from Cancer Patients' Interviews and Forum Data using LLMs and Human in the Loop

Lifeng Han, David Lindevelt, Sander Puts, Erik van Mulligen, Suzan Verberne

Comments Ongoing project report, on behalf of 4D PICTURE https://4dpicture.eu/

2511.05854 2026-03-05 cs.AI

Can a Small Model Learn to Look Before It Leaps? Dynamic Learning and Proactive Correction for Hallucination Detection

Zepeng Bao, Shen Zhou, Qiankun Pi, Jianhao Chen, Mayi Xu, Ming Zhong, Yuanyuan Zhu, Tieyun Qian

2511.03950 2026-03-05 cs.CV cs.AI

Improving Multi-View Reconstruction via Texture-Guided Gaussian-Mesh Joint Optimization

Zhejia Cai, Puhua Jiang, Shiwei Mao, Hongkun Cao, Ruqi Huang

Comments 10 pages, correct errors, clarify details, accepted to 3DV 2026

2511.03441 2026-03-05 cs.CL cs.AI

CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field

Doria Bonzi, Alexandre Guiggi, Frédéric Béchet, Carlos Ramisch, Benoit Favre

Comments Accepted at LREC 2026. To access the dataset, see https://github.com/bonzid/CareMedEval

2511.01131 2026-03-05 cs.CV

Weakly Supervised Concept Learning with Class-Level Priors for Interpretable Medical Diagnosis

Md Nahiduzzaman, Steven Korevaar, Alireza Bab-Hadiashar, Ruwan Tennakoon

Comments Accepted to IEEE International Symposium on Biomedical Imaging (ISBI) 2026

2510.26905 2026-03-05 cs.AI

Cognition Envelopes for Bounded Decision Making in Autonomous UAS Operations

Pedro Antonio Alarcon Granadeno, Arturo Miguel Bernal Russell, Sofia Nelson, Demetrius Hernandez, Maureen Petterson, Michael Murphy, Walter J. Scheirer, Jane Cleland-Huang

Comments 12 pages, 9 figures

2510.26303 2026-03-05 cs.LG cs.AI math.OC stat.ML

Implicit Bias of Per-sample Adam on Separable Data: Departure from the Full-batch Regime

Beomhan Baek, Minhak Song, Chulhee Yun

Comments Published at ICLR 2026

2510.25191 2026-03-05 cs.RO

SoraNav: Adaptive UAV Task-Centric Navigation via Zeroshot VLM Reasoning

Hongyu Song, Rishabh Dev Yadav, Cheng Guo, Wei Pan

2510.24702 2026-03-05 cs.CL cs.AI

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, Joseph Liu, Tianyue Ou, Zhihao Yuan, Frank Xu, Shuyan Zhou, Xingyao Wang, Xiang Yue, Tao Yu, Huan Sun, Yu Su, Graham Neubig

2510.24178 2026-03-05 cs.CL cs.AI

MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations

Aaron Scott, Maike Züfle, Jan Niehues

2510.19655 2026-03-05 cs.RO

LaViRA: Language-Vision-Robot Actions Translation for Zero-Shot Vision Language Navigation in Continuous Environments

Hongyu Ding, Ziming Xu, Yudong Fang, You Wu, Zixuan Chen, Jieqi Shi, Jing Huo, Yifan Zhang, Yang Gao

Comments ICRA 2026

2510.18573 2026-03-05 cs.CV cs.AI

Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model

Zhenxing Zhang, Jiayan Teng, Zhuoyi Yang, Tiankun Cao, Cheng Wang, Xiaotao Gu, Jie Tang, Dan Guo, Meng Wang

Comments 21 pages, 7 figures

2510.17509 2026-03-05 cs.CL

Annotation-Efficient Universal Honesty Alignment

Shiyu Ni, Keping Bi, Jiafeng Guo, Minghao Tang, Jingtong Wu, Zengxin Han, Xueqi Cheng

Comments ICLR 2026

2510.15040 2026-03-05 cs.CV cs.CL cs.LG

Composition-Grounded Data Synthesis for Visual Reasoning

Xinyi Gu, Jiayuan Mao, Zhang-Wei Hong, Zhuoran Yu, Pengyuan Li, Dhiraj Joshi, Rogerio Feris, Zexue He

Comments ICLR2026 camera-ready version. Project page: https://cogsynthesis.github.io

2510.14936 2026-03-05 cs.LG cs.AI cs.CL

Circuit Insights: Towards Interpretability Beyond Activations

Elena Golimblevskaia, Aakriti Jain, Bruno Puri, Ammar Ibrahim, Wojciech Samek, Sebastian Lapuschkin

2510.10889 2026-03-05 cs.CV cs.AI cs.LG

Topological Alignment of Shared Vision-Language Embedding Space

Junwon You, Dasol Kang, Jae-Hun Jung

Comments 27 pages, 5 figures, 24 tables

2510.08580 2026-03-05 cs.SD cs.AI eess.AS

LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection

Benjamin Shiue-Hal Chou, Purvish Jajal, Nick John Eliopoulos, James C. Davis, George K. Thiruvathukal, Kristen Yeon-Ji Yun, Yung-Hsiang Lu

Comments Accepted to ICLR 2026

2510.07181 2026-03-05 cs.RO cs.AI cs.CV

TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics

Yi Han, Enshen Zhou, Shanyu Rong, Jingkun An, Pengwei Wang, Zhongyuan Wang, Cheng Chi, Lu Sheng, Shanghang Zhang

Comments 8 pages, 6 figures

2510.07151 2026-03-05 cs.LG cs.AI cs.RO

ELMUR: External Layer Memory with Update/Rewrite for Long-Horizon RL Problems

Egor Cherepanov, Alexey K. Kovalev, Aleksandr I. Panov

Comments 31 pages, 15 figures, 8 tables

2510.05091 2026-03-05 cs.CV

Factuality Matters: When Image Generation and Editing Meet Structured Visuals

Le Zhuo, Songhao Han, Yuandong Pu, Boxiang Qiu, Sayak Paul, Yue Liao, Yihao Liu, Jie Shao, Xi Chen, Si Liu, Hongsheng Li

Comments Accepted by ICLR 2026, Project page: https://structvisuals.github.io

2510.02903 2026-03-05 cs.LG q-bio.CB

Learning Explicit Single-Cell Dynamics Using ODE Representations

Jan-Philipp von Bassewitz, Adeel Pervez, Marco Fumero, Matthew Robinson, Theofanis Karaletsos, Francesco Locatello

Comments 27 pages, 11 figures

2509.25541 2026-03-05 cs.CV cs.AI

Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, Wentian Zhao

Comments ICLR 2026

详情

英文摘要

Although reinforcement learning (RL) has emerged as a promising approach for improving vision-language models (VLMs) and multimodal large language models (MLLMs), current methods rely heavily on manually curated datasets and costly human verification, which limits scalable self-improvement in multimodal systems. To address this challenge, we propose Vision-Zero, a label-free, domain-agnostic multi-agent self-play framework for self-evolving VLMs through competitive visual games generated from arbitrary image inputs. Specifically, Vision-Zero encompasses three main attributes: (1) Strategic Self-Play Framework: Vision-Zero trains VLMs in "Who Is the Spy"-style games, where the models engage in strategic reasoning and actions across multiple roles. Through interactive gameplay, models autonomously generate their training data without human annotation. (2) Gameplay from Arbitrary Images: Unlike existing gamified frameworks, Vision-Zero can generate games from arbitrary images, thereby enhancing the model's reasoning ability across diverse domains and showing strong generalization to different tasks. We demonstrate this versatility using three distinct types of image datasets: CLEVR-based synthetic scenes, charts, and real-world images. (3) Sustainable Performance Gain: We introduce Iterative Self-Play Policy Optimization (Iterative-SPO), a novel training algorithm that alternates between Self-Play and reinforcement learning with verifiable rewards (RLVR), mitigating the performance plateau often seen in self-play-only training and achieving sustained long-term improvements. Despite using label-free data, Vision-Zero achieves state-of-the-art performance on reasoning, chart question answering, and vision-centric understanding tasks, surpassing other annotation-based methods. Models and code have been released at https://github.com/wangqinsi1/Vision-Zero.

URL PDF HTML ☆

赞 0 踩 0

2509.25135 2026-03-05 cs.LG stat.ML

Learning in an Echo Chamber: Online Learning with Replay Adversary

Daniil Dmitriev, Harald Eskelund Franck, Carolin Heinzler, Amartya Sanyal

详情

DOI: 10.1137/1.9781611978971.239
Journal ref: Proceedings of the 2026 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)

英文摘要

As machine learning systems increasingly train on self-annotated data, they risk reinforcing errors and becoming echo chambers of their own beliefs. We model this phenomenon by introducing a learning-theoretic framework: Online Learning in the Replay Setting. In round $t$, the learner outputs a hypothesis $\hat{h}_t$; the adversary then reveals either the true label $f^\ast(x_t)$ or a replayed label $\hat{h}_i(x_t)$ from an earlier round $i < t$. A mistake is counted only when the true label is shown, yet classical algorithms such as the SOA or the halving algorithm are easily misled by the replayed errors. We introduce the Extended Threshold dimension, $\mathrm{ExThD}(\mathcal{H})$, and prove matching upper and lower bounds that make $\mathrm{ExThD}(\mathcal{H})$ the exact measure of learnability in this model. A closure-based learner makes at most $\mathrm{ExThD}(\mathcal{H})$ mistakes against any adaptive adversary, and no algorithm can perform better. For stochastic adversaries, we prove a similar bound for every intersection-closed class. The replay setting is provably harder than the classical mistake bound setting: some classes have constant Littlestone dimension but arbitrarily large $\mathrm{ExThD}(\mathcal{H})$. Proper learning exhibits an even sharper separation: a class is properly learnable under replay if and only if it is (almost) intersection-closed. Otherwise, every proper learner suffers $Ω(T)$ errors, whereas our improper algorithm still achieves the $\mathrm{ExThD}(\mathcal{H})$ bound. These results give the first tight analysis of learning against replay adversaries, based on new results for closure-type algorithms.

URL PDF HTML ☆

赞 0 踩 0

2509.25106 2026-03-05 cs.CL cs.AI cs.IR

Towards Personalized Deep Research: Benchmarks and Evaluations

Yuan Liang, Jiaxian Li, Yuqing Wang, Piaohong Wang, Motong Tian, Pai Liu, Shuofei Qiao, Runnan Fang, He Zhu, Ge Zhang, Minghao Liu, Yuchen Eleanor Jiang, Ningyu Zhang, Wangchunshu Zhou

2509.23124 2026-03-05 cs.CL

Non-Collaborative User Simulators for Tool Agents

Jeonghoon Shim, Woojung Song, Cheyon Jin, Seungwon KooK, Yohan Jo

Comments Accepted to ICLR 2026

2509.22580 2026-03-05 cs.LG

The Lie of the Average: How Class Incremental Learning Evaluation Deceives You?

Guannan Lai, Da-Wei Zhou, Xin Yang, Han-Jia Ye