arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2507.05201 2026-04-08 cs.AI cs.CL cs.CV

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Brush, Kenneth Philbrick, Mercy Asiedu, Ines Mezerreg, Howard Hu, Howard Yang, Richa Tiwari, Sunny Jansen, Preeti Singh, Yun Liu, Shekoofeh Azizi, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Riviere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Elena Buchatskaya, Jean-Baptiste Alayrac, Dmitry Lepikhin, Vlad Feinberg, Sebastian Borgeaud, Alek Andreev, Cassidy Hardin, Robert Dadashi, Léonard Hussenot, Armand Joulin, Olivier Bachem, Yossi Matias, Katherine Chou, Avinatan Hassidim, Kavi Goel, Clement Farabet, Joelle Barral, Tris Warkentin, Jonathon Shlens, David Fleet, Victor Cotruta, Omar Sanseviero, Gus Martins, Phoebe Kirk, Anand Rao, Shravya Shetty, David F. Steiner, Can Kirmizibayrak, Rory Pilgrim, Daniel Golden, Lin Yang

Comments Fix references

详情

英文摘要

Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare's diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B. MedGemma demonstrates advanced medical understanding and reasoning on images and text, significantly exceeding the performance of similar-sized generative models and approaching the performance of task-specific models, while maintaining the general capabilities of the Gemma 3 base models. For out-of-distribution tasks, MedGemma achieves 2.6-10% improvement on medical multimodal question answering, 15.5-18.1% improvement on chest X-ray finding classification, and 10.8% improvement on agentic evaluations compared to the base models. Fine-tuning MedGemma further improves performance in subdomains, reducing errors in electronic health record information retrieval by 50% and reaching comparable performance to existing specialized state-of-the-art methods for pneumothorax classification and histopathology patch classification. We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP. MedSigLIP powers the visual understanding capabilities of MedGemma and as an encoder achieves comparable or better performance than specialized medical image encoders. Taken together, the MedGemma collection provides a strong foundation of medical image and text capabilities, with potential to significantly accelerate medical research and development of downstream applications. The MedGemma collection, including tutorials and model weights, can be found at https://goo.gle/medgemma.

URL PDF HTML ☆

赞 0 踩 0

2506.22653 2026-04-08 cs.AI

URSA: The Universal Research and Scientific Agent

Michael Grosskopf, Nathan Debardeleben, Russell Bent, Rahul Somasundaram, Isaac Michaud, Arthur Lui, Alexius Wadell, Warren D. Graham, Golo A Wimmer, Sachin Shivakumar, Joan Vendrell Gallart, Harsha Nagarajan, Earl Lawrence

Comments 24 pages, 10 figures

2506.21872 2026-04-08 cs.LG cs.AI

A Survey of Continual Reinforcement Learning

Chaofan Pan, Xin Yang, Yanhua Li, Wei Wei, Tianrui Li, Bo An, Jiye Liang

2506.18027 2026-04-08 cs.CL

PDF Retrieval Augmented Question Answering

Thi Thu Uyen Hoang, Meenakshi Rajendran, Kun Zhang, Yuhan Wu, Viet Anh Nguyen

2506.17697 2026-04-08 cs.AI

Beyond Syntax: Action Semantics Learning for App Agents

Bohan Tang, Dezhao Luo, Jianheng Liu, Jingxuan Chen, Shaogang Gong, Jianye Hao, Jun Wang, Kun Shao

2506.15115 2026-04-08 cs.LG

Towards Reliable Forgetting: A Survey on Machine Unlearning Verification

Lulu Xue, Shengshan Hu, Wei Lu, Yan Shen, Dongxu Li, Peijin Guo, Ziqi Zhou, Minghui Li, Yanjun Zhang, Leo Yu Zhang

Comments Accepted by ACM Computing Surveys 2026

2506.05831 2026-04-08 cs.LG cs.AI

HeartcareGPT: A Unified Multimodal ECG Suite for Dual Signal-Image Modeling and Understanding

Yihan Xie, Sijing Li, Tianwei Lin, Zhuonan Wang, Chenglin Yang, Yu Zhong, Wenjie Yan, Wenqiao Zhang, Xiaogang Guo, Jun Xiao, Yueting Zhuang, Beng Chin Ooi

2506.03863 2026-04-08 cs.RO cs.LG

STAR: Learning Diverse Robot Skill Abstractions through Rotation-Augmented Vector Quantization

Hao Li, Qi Lv, Rui Shao, Xiang Deng, Yinchuan Li, Jianye Hao, Liqiang Nie

Comments Accepted by ICML 2025 Spotlight

2505.20858 2026-04-08 cs.CV

ProBA: Probabilistic Bundle Adjustment with the Bhattacharyya Coefficient

Jason Chui, Hector Andrade-Loarca, Daniel Cremers

Comments 14 pages, 5 figures, 3 tables

2505.16932 2026-04-08 cs.LG cs.AI cs.CL cs.NA math.NA math.OC

The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

Noah Amsel, David Persson, Christopher Musco, Robert M. Gower

Comments 34 pages, 8 figures, 4 algorithms

2505.14226 2026-04-08 cs.CL cs.AI

Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs

Darpan Aswal, Siddharth D Jaiswal

2505.12863 2026-04-08 cs.SD cs.AI cs.CV eess.AS

Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio

Jongmin Jung, Dongmin Kim, Sihun Lee, Seola Cho, Hyungjoon Soh, Irmak Bukey, Chris Donahue, Dasaem Jeong

Comments Submitted to IEEE Transactions on Audio, Speech and Language Processing (TASLPRO)

详情

DOI: 10.1109/TASLPRO.2025.3648794
Journal ref: IEEE Transactions on Audio, Speech and Language Processing, vol. 34, pp. 1876-1891, 2026

英文摘要

Music exists in various modalities, such as score images, symbolic scores, MIDI, and audio. Translations between each modality are established as core tasks of music information retrieval, such as automatic music transcription (audio-to-MIDI) and optical music recognition (score image to symbolic score). However, most past work on multimodal translation trains specialized models on individual translation tasks. In this paper, we propose a unified approach, where we train a general-purpose model on many translation tasks simultaneously. Two key factors make this unified approach viable: a new large-scale dataset and the tokenization of each modality. Firstly, we propose a new dataset that consists of more than 1,300 hours of paired audio-score image data collected from YouTube videos, which is an order of magnitude larger than any existing music modal translation datasets. Secondly, our unified tokenization framework discretizes score images, audio, MIDI, and MusicXML into a sequence of tokens, enabling a single encoder-decoder Transformer to tackle multiple cross-modal translation as one coherent sequence-to-sequence task. Experimental results confirm that our unified multitask model improves upon single-task baselines in several key areas, notably reducing the symbol error rate for optical music recognition from 24.58% to a state-of-the-art 13.67%, while similarly substantial improvements are observed across the other translation tasks. Notably, our approach achieves the first successful score-image-conditioned audio generation, marking a significant breakthrough in cross-modal music generation.

URL PDF HTML ☆

赞 0 踩 0

2505.04638 2026-04-08 cs.AI cs.CL cs.IR

Advancing AI Research Assistants with Expert-Involved Learning

Tianyu Liu, Simeng Han, Hanchen Wang, Xiao Luo, Pan Lu, Biqing Zhu, Yuge Wang, Keyi Li, Jiapeng Chen, Rihao Qu, Yufeng Liu, Xinyue Cui, Aviv Yaish, Yuhang Chen, Minsheng Hao, Chuhan Li, Kexing Li, Yinsheng Lu, Xinyu Wei, Qinzhe Xing, Antonia Panescu, Mengbo Wang, Vibha Annaswamy, Alicia Sanchez, Jack Cloherty, Arman Cohan, Hua Xu, Mark Gerstein, James Zou, Hongyu Zhao

Comments 43 pages, 7 figures

2505.00472 2026-04-08 cs.AI cs.DC cs.MA cs.NI

UserCentrix: An Agentic Memory-augmented AI Framework for Smart Spaces

Alaa Saleh, Sasu Tarkoma, Praveen Kumar Donta, Anders Lindgren, Naser Hossein Motlagh, Schahram Dustdar, Susanna Pirttikangas, Lauri Lovén

2504.14135 2026-04-08 cs.RO cs.CV cs.GR cs.LG

Unreal Robotics Lab: A High-Fidelity Robotics Simulator with Advanced Physics and Rendering

Jonathan Embley-Riches, Jianwei Liu, Simon Julier, Dimitrios Kanoulas

2504.10163 2026-04-08 cs.RO

Shoulder Range of Motion Rehabilitation Robot Incorporating Scapulohumeral Rhythm for Frozen Shoulder

Hyunbum Cho, Sungmoon Hur, Joowan Kim, Keewon Kim, Jaeheung Park

Comments Published in Journal of Bionic Engineering

详情

DOI: 10.1007/s42235-025-00768-8
Journal ref: J. Bionic Eng. 22, 2456-2473 (2025)

英文摘要

This paper presents a novel rehabilitation robot designed to address the challenges of Passive Range of Motion (PROM) exercises for frozen shoulder patients by integrating advanced scapulohumeral rhythm stabilization. Frozen shoulder is characterized by limited glenohumeral motion and disrupted scapulohumeral rhythm, with therapist-assisted interventions being highly effective for restoring normal shoulder function. While existing robotic solutions replicate natural shoulder biomechanics, they lack the ability to stabilize compensatory movements, such as shoulder shrugging, which are critical for effective rehabilitation. Our proposed device features a 6 Degrees of Freedom (DoF) mechanism, including 5 DoF for shoulder motion and an innovative 1 DoF Joint press for scapular stabilization. The robot employs a personalized two-phase operation: recording normal shoulder movement patterns from the unaffected side and applying them to guide the affected side. Experimental results demonstrated the robot's ability to replicate recorded motion patterns with high precision, with Root Mean Square Error (RMSE) values consistently below 1 degree. In simulated frozen shoulder conditions, the robot effectively suppressed scapular elevation, delaying the onset of compensatory movements and guiding the affected shoulder to move more closely in alignment with normal shoulder motion, particularly during arm elevation movements such as abduction and flexion. These findings confirm the robot's potential as a rehabilitation tool capable of automating PROM exercises while correcting compensatory movements. The system provides a foundation for advanced, personalized rehabilitation for patients with frozen shoulders.

URL PDF HTML ☆

赞 0 踩 0

2504.08528 2026-04-08 cs.CL cs.SD eess.AS

On The Landscape of Spoken Language Models: A Comprehensive Survey

Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Emmanuel Dupoux, Hung-Yi Lee, Karen Livescu, Shinji Watanabe

Comments Published in Transactions on Machine Learning Research

2503.21210 2026-04-08 cs.CV

Toward Generalizable Forgery Detection and Reasoning

Yueying Gao, Dongliang Chang, Bingyao Yu, Haotian Qin, Muxi Diao, Lei Chen, Kongming Liang, Zhanyu Ma

Comments Accepted to IEEE TIP

详情

DOI: 10.1109/TIP.2026.3673967

英文摘要

Accurate and interpretable detection of AI-generated images is essential for mitigating risks associated with AI misuse. However, the substantial domain gap among generative models makes it challenging to develop a generalizable forgery detection model. Moreover, since every pixel in an AI-generated image is synthesized, traditional saliency-based forgery explanation methods are not well suited for this task. To address these challenges, we formulate detection and explanation as a unified Forgery Detection and Reasoning task (FDR-Task), leveraging Multi-Modal Large Language Models (MLLMs) to provide accurate detection through reliable reasoning over forgery attributes. To facilitate this task, we introduce the Multi-Modal Forgery Reasoning dataset (MMFR-Dataset), a large-scale dataset containing 120K images across 10 generative models, with 378K reasoning annotations on forgery attributes, enabling comprehensive evaluation of the FDR-Task. Furthermore, we propose FakeReasoning, a forgery detection and reasoning framework with three key components: 1) a dual-branch visual encoder that integrates CLIP and DINO to capture both high-level semantics and low-level artifacts; 2) a Forgery-Aware Feature Fusion Module that leverages DINO's attention maps and cross-attention mechanisms to guide MLLMs toward forgery-related clues; 3) a Classification Probability Mapper that couples language modeling and forgery detection, enhancing overall performance. Experiments across multiple generative models demonstrate that FakeReasoning not only achieves robust generalization but also outperforms state-of-the-art methods on both detection and reasoning tasks. The code is available at: https://github.com/PRIS-CV/FakeReasoning.

URL PDF HTML ☆

赞 0 踩 0

2503.03262 2026-04-08 cs.RO cs.AI cs.CV cs.LG

Trajectory Prediction for Autonomous Driving: Progress, Limitations, and Future Directions

Nadya Abdel Madjid, Abdulrahman Ahmad, Murad Mebrahtu, Yousef Babaa, Abdelmoamen Nasser, Sumbal Malik, Bilal Hassan, Naoufel Werghi, Jorge Dias, Majid Khonji

2502.17873 2026-04-08 cs.LG eess.SP

An Efficient Self-Supervised Framework for Long-Sequence EEG Modeling

Jiazhen Hong, Geoffrey Mackellar, Soheila Ghane

Comments 10 pages

详情

DOI: 10.1109/ICDMW69685.2025.00039
Journal ref: Proceedings of the ICDM 2025 Workshop on AI for Time Series

英文摘要

Electroencephalogram (EEG) signals generally exhibit low signal-to-noise ratio (SNR) and high inter-subject variability, making generalization across subjects and domains challenging. Recent advances in deep learning, particularly self-supervised learning with Transformer-based architectures, have shown promise in EEG representation learning. However, their quadratic computational complexity increases memory usage and slows inference, making them inefficient for modeling long-range dependencies. Moreover, most existing approaches emphasize either explicit window segmentation of the temporal signal or spectral-only input embedding while neglecting raw temporal dynamics. In this paper, we propose EEGM2, a self-supervised framework that overcomes these limitations. EEGM2 adopts a U-shaped encoder-decoder architecture integrated with Mamba-2 to achieve linear computational complexity, thereby reducing memory usage and improving inference speed. Meanwhile, the selective information propagation mechanism of Mamba-2 enables the model to effectively capture and preserve long-range dependencies in raw EEG signals, where traditional RNN or CNN architectures often struggle. Moreover, EEGM2 employs a self-supervised pre-training objective that reconstructs raw EEG using a combined L1 and spectral (Fourier-based) loss, enhancing generalization by jointly preserving temporal dynamics and spectral characteristics. Experimental results demonstrate that EEGM2 achieves state-of-the-art performance in both short- and long-sequence modeling and classification. Further evaluations show that EEGM2 consistently outperforms existing models, demonstrating strong generalization across subjects and tasks, as well as transferability across domains. Overall, EEGM2 offers an efficient and scalable solution suitable for deployment on resource-constrained brain-computer interface (BCI) devices.

URL PDF HTML ☆

赞 0 踩 0

2502.10573 2026-04-08 cs.LG cs.AI

An Innovative Next Activity Prediction Using Process Entropy and Dynamic Attribute-Wise-Transformer in Predictive Business Process Monitoring

Hadi Zare, Mostafa Abbasi, Maryam Ahang, Homayoun Najjaran

2502.08660 2026-04-08 cs.CL

A Systematic Survey of Semantic Role Labeling in the Era of Pretrained Language Models

Huiyao Chen, Meishan Zhang, Jing Li, Lilja Øvrelid, Jan Hajič, Hao Fei, Min Zhang

Comments 54 pages, 9 figures, 9 tables

2502.06387 2026-04-08 cs.LG cs.GT econ.TH

How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators

Shang Liu, Hanzhao Wang, Zhongyao Ma, Xiaocheng Li

2501.14194 2026-04-08 cs.CV cs.AI

ENTER: Event Based Interpretable Reasoning for VideoQA

Hammad Ayyubi, Junzhang Liu, Ali Asgarov, Zaber Ibn Abdul Hakim, Najibul Haque Sarker, Zhecan Wang, Chia-Wei Tang, Hani Alomari, Md. Atabuzzaman, Xudong Lin, Naveen Reddy Dyava, Shih-Fu Chang, Chris Thomas

2501.14183 2026-04-08 cs.LG cs.AI

VarDrop: Enhancing Training Efficiency by Reducing Variate Redundancy in Periodic Time Series Forecasting

Junhyeok Kang, Yooju Shin, Jae-Gil Lee

Comments Published in AAAI 2025

2501.09411 2026-04-08 cs.CV

Towards Robust and Realistic Human Pose Estimation via WiFi Signals

Yang Chen, Jingcai Guo

Comments 12 pages, 9 figures

2412.08079 2026-04-08 cs.LG cs.NA math.NA physics.ao-ph

Regional climate risk assessment from climate models using probabilistic machine learning

Zhong Yi Wan, Ignacio Lopez-Gomez, Robert Carver, Tapio Schneider, John Anderson, Fei Sha, Leonardo Zepeda-Núñez

Comments 125 pages

2412.00727 2026-04-08 cs.LG cs.CR cs.CV

Perturb and Recover: Fine-tuning for Effective Backdoor Removal from CLIP

Naman Deep Singh, Francesco Croce, Matthias Hein

Comments CVPR 2026 Findings

2411.08937 2026-04-08 cs.CV cs.LG

Dual-Head Knowledge Distillation: Enhancing Logits Utilization with an Auxiliary Head

Penghui Yang, Chen-Chen Zong, Sheng-Jun Huang, Lei Feng, Bo An

Comments Accepted by KDD 2025

2411.08249 2026-04-08 cs.LG cs.AI

Retrieval Augmented Time Series Forecasting

Kutay Tire, Ege Onur Taga, Muhammed Emrullah Ildiz, Samet Oymak