arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.18250 2026-04-21 cs.CV

Medical Image Understanding Improves Survival Prediction via Visual Instruction Tuning

Xixi Liu, Jorge Lazo, Andreas Hallqvist, Mikael Johansson, Åse Johnsson, Jonas S Andersson, Ella Äng Eklund, Patrik Sund, Nasser Hosseini, Jennifer Alvén, Ida Häggström

Comments Submitted to MICCAI 2026

详情

英文摘要

Accurate prognostication and risk estimation are essential for guiding clinical decision-making and optimizing patient management. While radiologist-assessed features from CT scans provide valuable indicators of disease severity and outcomes, interpreting such images requires expert knowledge, and translating rich visual information into textual summaries inevitably leads to information loss. In this work, we propose a vision-language framework for 3D CT image understanding that leverages large-scale open-sourced CT images paired with radiology reports through visual instruction tuning. This pre-training enables the model to learn clinically meaningful visual-textual representations, which can then be adapted to downstream survival prediction tasks. By incorporating a survival prediction head on top of the pre-trained model, our approach improves survival prediction from CT images and clinical data while generating clinically meaningful language responses to predefined questions. Experimental results demonstrate that our method outperforms baseline methods in survival prediction, particularly, when clinical data alone is less predictive. The code will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2604.18249 2026-04-21 cs.CL

Where Do Self-Supervised Speech Models Become Unfair?

Felix Herron, Maja Hjuler, Solange Rossato, Alexandre Allauzen, François Portet

2604.18240 2026-04-21 cs.AI

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

Wentao Shi, Yu Wang, Yuyang Zhao, Yuxin Chen, Fuli Feng, Xueyuan Hao, Xi Su, Qi Gu, Hui Su, Xunliang Cai, Xiangnan He

Comments Accepted to ACL 2026 Findings. 43 pages total, 5 figures

2604.18237 2026-04-21 cs.LG cs.AI

Semantic-based Distributed Learning for Diverse and Discriminative Representations

Zhuojun Tian, Chaouki Ben Issaid, Mehdi Bennis

2604.18236 2026-04-21 cs.RO

COFFAIL: A Dataset of Successful and Anomalous Robot Skill Executions in the Context of Coffee Preparation

Alex Mitrevski, Ayush Salunke

Comments Presented as an extended abstract at the 2nd German Robotics Conference (GRC)

2604.18226 2026-04-21 cs.CL

Model in Distress: Sentiment Analysis on French Synthetic Social Media

Pierre-Carl Langlais, Pavel Chizhov, Yannick Detrois, Carlos Rosas Hinostroza, Ivan P. Yamshchikov, Bastien Perroy

2604.18223 2026-04-21 cs.CV

Instruction-as-State: Environment-Guided and State-Conditioned Semantic Understanding for Embodied Navigation

Zhen Liu, Yuhan Liu, Jinjun Wang, Jianyi Liu, Wei Song, Jingwen Fu

2604.18210 2026-04-21 cs.AI cs.LG cs.MA

TacticGen: Grounding Adaptable and Scalable Generation of Football Tactics

Sheng Xu, Guiliang Liu, Tarak Kharrat, Yudong Luo, Mohamed Aloulou, Javier López Peña, Konstantin Sofeikov, Adam Reid, Paul Roberts, Steven Spencer, Joe Carnall, Ian McHale, Oliver Schulte, Hongyuan Zha, Wei-Shi Zheng

Comments 23 pages

2604.18208 2026-04-21 cs.CV math.GT

Towards Symmetry-sensitive Pose Estimation: A Rotation Representation for Symmetric Object Classes

Andreas Kriegler, Csaba Beleznai, Margrit Gelautz

Comments Published Open-Access in IJCV, see https://link.springer.com/article/10.1007/s11263-026-02770-x . 28 pages, 6 figures, 9 tables, 1 algorithm

详情

DOI: 10.1007/s11263-026-02770-x
Journal ref: Int J Comput Vis 134, 212 (2026)

英文摘要

Symmetric objects are common in daily life and industry, yet their inherent orientation ambiguities that impede the training of deep learning networks for pose estimation are rarely discussed in the literature. To cope with these ambiguities, existing solutions typically require the design of specific loss functions and network architectures or resort to symmetry-invariant evaluation metrics. In contrast, we focus on the numeric representation of the rotation itself, modifying trigonometric identities with the degrees of symmetry derived from the objects' shapes. We use our representation, SARR, to obtain canonic (symmetry-resolved) poses for the symmetric objects in two popular 6D pose estimation datasets, T-LESS and ITODD, where SARR is unique and continuous w.r.t. the visual appearance. This allows us to use a standard CNN for 3D orientation estimation whose performance is evaluated with the symmetry-sensitive cosine distance $\text{AR}_{\text{C}}$. Our networks outperform the state of the art using $\text{AR}_{\text{C}}$ and achieve satisfactory performance when using conventional symmetry-invariant measures. Our method does not require any 3D models but only depth, or, as part of an additional experiment, texture-less RGB/grayscale images as input. We also show that networks trained on SARR outperform the same networks trained on rotation matrices, Euler angles, quaternions, standard trigonometrics or the recently popular 6d representation -- even in inference scenarios where no prior knowledge of the objects' symmetry properties is available. Code and a visualization toolkit are available at https://github.com/akriegler/SARR .

URL PDF HTML ☆

赞 0 踩 0

2604.18206 2026-04-21 cs.AI

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang, Zhicheng Qian, Xingyu Zhou

2604.18205 2026-04-21 cs.CV cs.RO

A Comparative Evaluation of Geometric Accuracy in NeRF and Gaussian Splatting

Mikolaj Zielinski, Eryk Vykysaly, Bartlomiej Biesiada, Jan Baturo, Mateusz Capala, Dominik Belter

2604.18204 2026-04-21 cs.CL

Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages

V. S. D. S. Mahesh Akavarapu, Michael Daniel, Gerhard Jäger

Comments Accepted to ACL 2026 (Findings)

2604.18203 2026-04-21 cs.CL

Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

Samuel G. Balter, Ethan Jerzak, Connor T. Jerzak

Comments To appear in ACL Findings (2026)

详情

英文摘要

Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often lack systematically paired instances across modalities, it remains difficult to compare genuine arithmetic limits within and across model families. We therefore introduce a controlled multimodal multiplication benchmark that factorially varies digit length, digit sparsity, representation (e.g., numerals vs. number words), and modality (text, rendered images, audio), with paired instances from a reproducible generator. We also define arithmetic load, C, as the product of the total and non-zero digit count as a compact, mechanistically motivated proxy for operation count. Across evaluations, accuracy falls sharply as C grows, often nearing zero by C > 100. Indeed, C remains predictive of performance across modalities and models, with R-squared often > 0.5, nearing the value from more complex measures of arithmetic load that count the number of intermediate arithmetic steps. A separate perception-versus-computation decomposition shows that multimodal degradation is primarily computational rather than perceptual: on matched-perception checks, models are near-perfect (> 99%) across modalities, even when multiplication accuracy drops. Beyond measuring when models fail, we ask which procedures they are predisposed to follow. We introduce a forced-completion loss probe that scores heuristic-specific reasoning prefixes--including columnar multiplication, distributive decomposition, and rounding/compensation. Here, decomposition is favored in both text and vision modalities; heuristic-specific LoRA adapters produce near-orthogonal updates yet degrade accuracy, indicating the base model maintains a well-tuned internal router.

URL PDF HTML ☆

赞 0 踩 0

2604.18201 2026-04-21 cs.CV cs.LG

DiffuSAM: Diffusion Guided Zero-Shot Object Grounding for Remote Sensing Imagery

Geet Sethi, Panav Shah, Ashutosh Gandhe, Soumitra Darshan Nayak

Comments Accepted at ICLR 2026 ML4RS Workshop

2604.18199 2026-04-21 cs.CL

Linear-Time and Constant-Memory Text Embeddings Based on Recurrent Language Models

Tobias Grantner, Emanuel Sallinger, Martin Flechl

2604.18194 2026-04-21 cs.LG cs.CV

Attraction, Repulsion, and Friction: Introducing DMF, a Friction-Augmented Drifting Model

Arkadii Kazanskii, Tatiana Petrova, Konstantin Bagrianskii, Aleksandr Puzikov, Radu State

Comments 15 pages, 2 figures, 2 tables

2604.18190 2026-04-21 cs.LG cs.AI

Scalable Neighborhood-Based Multi-Agent Actor-Critic

Tim Goppelsroeder, Rasmus Jensen

2604.18187 2026-04-21 cs.SD cs.CL

Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

Xiang He, Chenxing Li, Jinting Wang, Yan Rong, Tianxin Xie, Wenfu Wang, Li Liu, Dong Yu

详情

英文摘要

Large Audio-Language Models (LALMs) have made significant progress in audio understanding, yet they primarily operate as perception-and-answer systems without explicit reasoning processes. Existing methods for enhancing audio reasoning rely either on supervised chain-of-thought (CoT) fine-tuning, which is limited by training data quality, or on reinforcement learning (RL) with coarse rewards that do not directly evaluate reasoning quality. As a result, the generated reasoning chains often appear well-structured yet lack specific acoustic grounding. We propose Audio-DeepThinker, a framework built on two core ideas. First, we introduce a hybrid reasoning similarity reward that directly supervises the quality of generated reasoning chains by combining an LLM evaluator assessing logical path alignment, key step coverage, and analytical depth with an embedding similarity component enforcing semantic alignment with reference reasoning chains. Second, we propose a progressive two-stage curriculum that enables high-quality CoT reasoning to emerge through pure RL exploration, without any supervised reasoning fine-tuning, from an instruction-tuned model that possesses no prior chain-of-thought capability. Stage 1 trains on foundational audio QA with the hybrid reward to foster basic reasoning patterns, while Stage 2 shifts to acoustically challenging boundary cases with an LLM-only reward for greater reasoning diversity. Audio-DeepThinker achieves state-of-the-art results on MMAR (74.0%), MMAU-test-mini (78.5%), and MMSU (77.26%), winning 1st Place in the Interspeech 2026 Audio Reasoning Challenge (Single Model Track). Interpretability analyses further reveal that RL training primarily reshapes upper-layer MoE gating mechanisms and that reasoning tokens crystallize progressively in the upper transformer layers, offering mechanistic insights into how audio reasoning emerges through exploration.

URL PDF HTML ☆

赞 0 踩 0

2604.18184 2026-04-21 cs.CV

CanonSLR: Canonical-View Guided Multi-View Continuous Sign Language Recognition

Xu Wang, Shengeng Tang, Wan Jiang, Yaxiong Wang, Lechao Cheng, Richang Hong

2604.18176 2026-04-21 cs.AI quant-ph

QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning

Songxin Qu, Tai-Ping Sun, Yun-Jie Wang, Huan-Yu Liu, Cheng Xue, Xiao-Fan Xu, Han Fang, Yang Yang, Yu-Chun Wu, Guo-Ping Guo, Zhao-Yun Chen

Comments 25 pages

2604.18169 2026-04-21 cs.CL cs.AI

Beyond Reproduction: A Paired-Task Framework for Assessing LLM Comprehension and Creativity in Literary Translation

Ran Zhang, Steffen Eger, Arda Tezcan, Wei Zhao, Simone Paolo Ponzetto, Lieve Macken

Comments Accepted to ACL 2026 Findings

2604.18168 2026-04-21 cs.CV

Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

Chenxi Zhao, Chen Zhu, Xiaokun Feng, Aiming Hao, Jiashu Zhu, Jiachen Lei, Jiahong Wu, Xiangxiang Chu, Jufeng Yang

Comments CVPR2026

2604.18167 2026-04-21 cs.CV

Embedding Arithmetic: A Lightweight, Tuning-Free Framework for Post-hoc Bias Mitigation in Text-to-Image Models

Venkatesh Thirugnana Sambandham, Torsten Schön

Comments A demo notebook with basic implementations can be found at \url{https://github.com/cvims/EMBEDDING-ARITHMETIC}

2604.18161 2026-04-21 cs.LG cs.AI cs.RO

Does "Do Differentiable Simulators Give Better Policy Gradients?'' Give Better Policy Gradients?

Ku Onoda, Paavo Parmas, Manato Yaguchi, Yutaka Matsuo

Comments ICLR2026

2604.18159 2026-04-21 cs.CL

FreezeEmpath: Efficient Training for Empathetic Spoken Chatbots with Frozen LLMs

Yun Hong, Yan Zhou, Yang Feng

2604.18158 2026-04-21 cs.AI

State Transfer Reveals Reuse in Controlled Routing

Yanzhen Lu, Zhicheng Qian, Muchen Jiang, Xingyu Zhou

2604.18151 2026-04-21 cs.CV cs.CY

AI-based Waste Mapping for Addressing Climate-Exacerbated Flood Risk

Steffen Knoblauch, Levi Szamek, Iddy Chazua, Benedcto Adamu, Innocent Maholi, Alexander Zipf

2604.18148 2026-04-21 cs.CV cs.LG

Attention-ResUNet for Automated Fetal Head Segmentation

Ammar Bhilwarawala, Mainak Bandyopadhyay

Comments Accepted and Presented at ANTIC 2025, IIITM Gwalior (5th International Conference on Advanced Network Technologies and Intelligent Computing) on 23rd December 2025. Presented with the best paper award in Image Processing Track

2604.18135 2026-04-21 cs.CV cs.AI cs.LG

Soft Label Pruning and Quantization for Large-Scale Dataset Distillation

Xiao Lingao, Yang He

2604.18134 2026-04-21 cs.CV

Can LLM-Generated Text Empower Surgical Vision-Language Pre-training?

Chengan Che, Chao Wang, Jiayuan Huang, Xinyue Chen, Luis C. Garcia-Peraza-Herrera

Comments Accepted at CVPRW 2026 (AI4RWC Oral presentationn)