arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1289
2508.21418 2026-02-16 cs.CV cs.LG

From slides to AI-ready maps: Standardized multi-layer tissue maps as metadata for artificial intelligence in digital pathology

Gernot Fiala, Markus Plass, Robert Harb, Peter Regitnig, Kristijan Skok, Wael Al Zoughbi, Carmen Zerner, Paul Torke, Michaela Kargl, Heimo Müller, Tomas Brazdil, Matej Gallo, Jaroslav Kubín, Roman Stoklasa, Rudolf Nenutil, Norman Zerbe, Andreas Holzinger, Petr Holub

Journal ref Artificial Intelligence in Medicine, Volume 174 (2026), 103368

详情
英文摘要

A Whole Slide Image (WSI) is a high-resolution digital image created by scanning an entire glass slide containing a biological specimen, such as tissue sections or cell samples, at multiple magnifications. These images are digitally viewable, analyzable, and shareable, and are widely used for Artificial Intelligence (AI) algorithm development. WSIs play an important role in pathology for disease diagnosis and oncology for cancer research, but are also applied in neurology, veterinary medicine, hematology, microbiology, dermatology, pharmacology, toxicology, immunology, and forensic science. When assembling cohorts for AI training or validation, it is essential to know the content of a WSI. However, no standard currently exists for this metadata, and such a selection has largely relied on manual inspection, which is not suitable for large collections with millions of objects. We propose a general framework to generate 2D index maps (tissue maps) that describe the morphological content of WSIs using common syntax and semantics to achieve interoperability between catalogs. The tissue maps are structured in three layers: source, tissue type, and pathological alterations. Each layer assigns WSI segments to specific classes, providing AI-ready metadata. We demonstrate the advantages of this standard by applying AI-based metadata extraction from WSIs to generate tissue maps and integrating them into a WSI archive. This integration enhances search capabilities within WSI archives, thereby facilitating the accelerated assembly of high-quality, balanced, and more targeted datasets for AI training, validation, and cancer research.

2508.14746 2026-02-16 cs.LG

MissionHD: Hyperdimensional Refinement of Distribution-Deficient Reasoning Graphs for Video Anomaly Detection

Sanggeon Yun, Raheeb Hassan, Ryozo Masukawa, Nathaniel D. Bastian, Mohsen Imani

详情
英文摘要

LLM-generated reasoning graphs, referred to as mission-specific graphs (MSGs), are increasingly used for video anomaly detection (VAD) and recognition (VAR). However, they are typically treated as fixed despite being generic and distribution-deficient. Conventional graph structure refinement (GSR) methods are ill-suited to this setting, as they rely on learning structural distributions that are absent in LLM-generated graphs. We propose HDC-constrained Graph Structure Refinement (HDC-GSR), a new paradigm that directly optimizes a decodable, task-aligned graph representation in a single hyperdimensional space without distribution modeling. Leveraging Hyperdimensional Computing (HDC), our framework encodes graphs via binding and bundling operations, aligns the resulting graph code with downstream loss, and decodes edge contributions to refine the structure. We instantiate this approach as MissionHD for weakly supervised VAD/VAR and demonstrate consistent performance gains on benchmark datasets.

2508.08275 2026-02-16 cs.CL cs.AI

MLLM-CTBench: A Benchmark for Continual Instruction Tuning with Reasoning Process Diagnosis

Haiyun Guo, Zhiyan Hou, Yandu Sun, Jinghan He, Yu Chen, Yuzhe Zhou, Yuheng Jia, Jinqiao Wang, Tat-Seng Chua

Comments under review

详情
英文摘要

Continual instruction tuning(CIT) during the post-training phase is crucial for adapting multimodal large language models (MLLMs) to evolving real-world demands. However, the progress is hampered by the lack of benchmarks with rigorous, protocol-consistent evaluation. To bridge this gap, we introduce MLLM-CTBench, a comprehensive benchmark for CIT of MLLMs, covering seven challenging tasks across six diverse domains. MLLM-CTBench makes three key contributions. First, we establish a multidimensional evaluation framework that jointly assesses final-answer accuracy and process-level reasoning quality, where Chain-of-Thought (CoT) traces serve as an observable signal to diagnose catastrophic forgetting beyond answer-only evaluation. Second, we conduct a large-scale evaluation of continual learning methods by systematically assessing eight representative algorithms from four major families under a unified protocol across task orders, providing actionable insights for algorithm design. Third, we expand the scope from Supervised Fine-Tuning (SFT) to Reinforcement Fine-Tuning (RFT) in CIT. By investigating GRPO, an on-policy RL algorithm that stabilizes updates through explicit KL-divergence control to a prior policy, we aim to analyze how this mechanism affects cross-task knowledge retention. Our experiments yield several findings:(1) Process-level reasoning quality is often more resilient to catastrophic forgetting than final-answer accuracy, and forgetting is primarily driven by degradation in domain knowledge. (2) Model capability is critical factor influencing continual learning outcomes, with stronger baseline models exhibiting greater resistance to catastrophic forgetting. (3) On-policy RFT (GRPO), with its inherent KL control, achieves more stable cross-task retention than SFT. While removing KL control can amplify forgetting despite potential gains on new ones.

2508.07675 2026-02-16 cs.LG

Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation

Xutong Liu, Baran Atalar, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, John C. S. Lui, Wei Chen, Carlee Joe-Wong

Comments Accepted to INFOCOM 2026

详情
英文摘要

Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved without another forward pass through the LLM, has emerged as one possible solution. Traditional exact-match caching, however, overlooks the semantic similarity between queries, leading to unnecessary recomputation. Semantic caching addresses this by retrieving responses based on semantic similarity, but introduces a fundamentally different cache eviction problem: one must account for mismatch costs between incoming queries and cached responses. Moreover, key system parameters, such as query arrival probabilities and serving costs, are often unknown and must be learned over time. Existing semantic caching methods are largely ad-hoc, lacking theoretical foundations and unable to adapt to real-world uncertainty. In this paper, we present a principled, learning-based framework for semantic cache eviction under unknown query and cost distributions. We formulate both offline optimization and online learning variants of the problem, and develop provably efficient algorithms with state-of-the-art guarantees. We also evaluate our framework on a synthetic dataset, showing that our proposed algorithms perform matching or superior performance compared with baselines.

2508.06095 2026-02-16 cs.RO

Incremental Language Understanding for Online Motion Planning of Robot Manipulators

Mitchell Abrams, Thies Oelerich, Christian Hartl-Nesic, Andreas Kugi, Matthias Scheutz

Comments 8 pages, 9 figures, accepted at IROS 2025

详情
英文摘要

Human-robot interaction requires robots to process language incrementally, adapting their actions in real-time based on evolving speech input. Existing approaches to language-guided robot motion planning typically assume fully specified instructions, resulting in inefficient stop-and-replan behavior when corrections or clarifications occur. In this paper, we introduce a novel reasoning-based incremental parser which integrates an online motion planning algorithm within the cognitive architecture. Our approach enables continuous adaptation to dynamic linguistic input, allowing robots to update motion plans without restarting execution. The incremental parser maintains multiple candidate parses, leveraging reasoning mechanisms to resolve ambiguities and revise interpretations when needed. By combining symbolic reasoning with online motion planning, our system achieves greater flexibility in handling speech corrections and dynamically changing constraints. We evaluate our framework in real-world human-robot interaction scenarios, demonstrating online adaptions of goal poses, constraints, or task objectives. Our results highlight the advantages of integrating incremental language understanding with real-time motion planning for natural and fluid human-robot collaboration. The experiments are demonstrated in the accompanying video at www.acin.tuwien.ac.at/42d5.

2508.05004 2026-02-16 cs.LG cs.AI cs.CL

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, Dong Yu

详情
英文摘要

Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels. Empirically, R-Zero substantially improves reasoning capability across different backbone LLMs, e.g., boosting the Qwen3-4B-Base by +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks.

2508.02872 2026-02-16 cs.CL cs.LG

Highlight & Summarize: RAG without the jailbreaks

Giovanni Cherubin, Andrew Paverd

详情
英文摘要

Preventing jailbreaking and model hijacking of Large Language Models (LLMs) is an important yet challenging task. When interacting with a chatbot, malicious users can input specially crafted prompts that cause the LLM to generate undesirable content or perform a different task from its intended purpose. Existing systems attempt to mitigate this by hardening the LLM's system prompt or using additional classifiers to detect undesirable content or off-topic conversations. However, these probabilistic approaches are relatively easy to bypass due to the very large space of possible inputs and undesirable outputs. We present and evaluate Highlight & Summarize (H&S), a new design pattern for retrieval-augmented generation (RAG) systems that prevents these attacks by design. The core idea is to perform the same task as a standard RAG pipeline (i.e., to provide natural language answers to questions, based on relevant sources) without ever revealing the user's question to the generative LLM. This is achieved by splitting the pipeline into two components: a highlighter, which takes the user's question and extracts ("highlights") relevant passages from the retrieved documents, and a summarizer, which takes the highlighted passages and summarizes them into a cohesive answer. We describe and implement several possible instantiations of H&S and evaluate their responses in terms of correctness, relevance, and quality. For certain question-answering (QA) tasks, the responses produced by H&S are judged to be as good, if not better, than those of a standard RAG pipeline.

2508.01669 2026-02-16 cs.LG cs.DC

Bridging Generalization Gap of Heterogeneous Federated Clients Using Generative Models

Ziru Niu, Hai Dong, A. K. Qin

Comments Accepted by ICLR 2026 (poster)

Journal ref ICLR 2026

详情
英文摘要

Federated Learning (FL) is a privacy-preserving machine learning framework facilitating collaborative training across distributed clients. However, its performance is often compromised by data heterogeneity among participants, which can result in local models with limited generalization capability. Traditional model-homogeneous approaches address this issue primarily by regularizing local training procedures or dynamically adjusting client weights during aggregation. Nevertheless, these methods become unsuitable in scenarios involving clients with heterogeneous model architectures. In this paper, we propose a model-heterogeneous FL framework that enhances clients' generalization performance on unseen data without relying on parameter aggregation. Instead of model parameters, clients share feature distribution statistics (mean and covariance) with the server. Then each client trains a variational transposed convolutional neural network using Gaussian latent variables sampled from these distributions, and use it to generate synthetic data. By fine-tuning local models with the synthetic data, clients achieve significant improvement of generalization ability. Experimental results demonstrate that our approach not only attains higher generalization accuracy compared to existing model-heterogeneous FL frameworks, but also reduces communication costs and memory consumption.

2508.01504 2026-02-16 cs.LG

Instruction-based Time Series Editing

Jiaxing Qiu, Dongliang Guo, Brynne Sullivan, Teague R. Henry, Thomas Hartvigsen

Comments (KDD 26) Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1

详情
英文摘要

In time series editing, we aim to modify some properties of a given time series without altering others. For example, when analyzing a hospital patient's blood pressure, we may add a sudden early drop and observe how it impacts their future while preserving other conditions. Existing diffusion-based editors rely on rigid, predefined attribute vectors as conditions and produce all-or-nothing edits through sampling. This attribute- and sampling-based approach limits flexibility in condition format and lacks customizable control over editing strength. To overcome these limitations, we introduce Instruction-based Time Series Editing, where users specify intended edits using natural language. This allows users to express a wider range of edits in a more accessible format. We then introduce InstructTime, the first instruction-based time series editor. InstructTime takes in time series and instructions, embeds them into a shared multi-modal representation space, then decodes their embeddings to generate edited time series. By learning a structured multi-modal representation space, we can easily interpolate between embeddings to achieve varying degrees of edit. To handle local and global edits together, we propose multi-resolution encoders. In our experiments, we use synthetic and real datasets and find that InstructTime is a state-of-the-art time series editor: InstructTime achieves high-quality edits with controllable strength, can generalize to unseen instructions, and can be easily adapted to unseen conditions through few-shot learning.

2507.19561 2026-02-16 cs.LG

Harnessing intuitive local evolution rules for physical learning

Roie Ezraty, Menachem Stern, Shmuel M. Rubinstein

Comments 26 pages, 6 figures (with appendices). Submitted to Physical Review E

详情
英文摘要

Machine Learning, however popular and accessible, is computationally intensive and highly power-consuming, prompting interest in alternative physical implementations of learning tasks. We introduce a training scheme for physical systems that minimize power dissipation in which only boundary parameters (i.e. inputs and outputs) are externally controlled. Using this scheme, these Boundary-Enabled Adaptive State Tuning Systems (BEASTS) learn by exploiting local physical rules. Our scheme, BEASTAL (BEAST-Adaline), is the closest analog of the Adaline algorithm for such systems. We demonstrate this autonomous learning in silico for regression and classification tasks. Our approach advances previous physical learning schemes by using intuitive, local evolution rules without requiring large-scale memory or complex internal architectures. BEASTAL can perform any linear task, achieving best performance when the local evolution rule is non-linear.

2507.06971 2026-02-16 cs.CV cs.RO eess.IV

Hallucinating 360°: Panoramic Street-View Generation via Local Scenes Diffusion and Probabilistic Prompting

Fei Teng, Kai Luo, Sheng Wu, Siyu Li, Pujun Guo, Jiale Wei, Jiaming Zhang, Kunyu Peng, Kailun Yang

Comments Accepted to ICRA 2026. The source code will be publicly available at https://github.com/FeiT-FeiTeng/Percep360

详情
英文摘要

Panoramic perception holds significant potential for autonomous driving, enabling vehicles to acquire a comprehensive 360° surround view in a single shot. However, autonomous driving is a data-driven task. Complete panoramic data acquisition requires complex sampling systems and annotation pipelines, which are time-consuming and labor-intensive. Although existing street view generation models have demonstrated strong data regeneration capabilities, they can only learn from the fixed data distribution of existing datasets and cannot leverage stitched pinhole images as a supervisory signal. In this paper, we propose the first panoramic generation method Percep360 for autonomous driving. Percep360 enables coherent generation of panoramic data with control signals based on the stitched panoramic data. Percep360 focuses on two key aspects: coherence and controllability. Specifically, to overcome the inherent information loss caused by the pinhole sampling process, we propose the Local Scenes Diffusion Method (LSDM). LSDM reformulates the panorama generation as a spatially continuous diffusion process, bridging the gaps between different data distributions. Additionally, to achieve the controllable generation of panoramic images, we propose a Probabilistic Prompting Method (PPM). PPM dynamically selects the most relevant control cues, enabling controllable panoramic image generation. We evaluate the effectiveness of the generated images from three perspectives: image quality assessment (i.e., no-reference and with reference), controllability, and their utility in real-world Bird's Eye View (BEV) segmentation. Notably, the generated data consistently outperforms the original stitched images in no-reference quality metrics and enhances downstream perception models. The source code will be publicly available at https://github.com/FeiT-FeiTeng/Percep360.

2507.04103 2026-02-16 cs.AI cs.LG stat.ML

How to Train Your LLM Web Agent: A Statistical Diagnosis

Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre Piché, Alexandre Lacoste, Massimo Caccia

详情
英文摘要

LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.

2507.03262 2026-02-16 cs.CV cs.AI

Investigating Redundancy in Multimodal Large Language Models with Multiple Vision Encoders

Yizhou Wang, Song Mao, Yang Chen, Yufan Shen, Yinqiao Yan, Pinlong Cai, Ding Wang, Guohang Yan, Zhi Yu, Xuming Hu, Botian Shi

Comments accepted by ICLR2026, project website: https://github.com/MaoSong2022/Encoder-Redundancy

详情
英文摘要

Recent multimodal large language models (MLLMs) increasingly integrate multiple vision encoders to improve performance on various benchmarks, assuming that diverse pretraining objectives yield complementary visual signals. However, we show this assumption often fails in practice. Through systematic encoder masking across representative multi encoder MLLMs, we find that performance typically degrades gracefully, and sometimes even improves, when selected encoders are masked, revealing pervasive encoder redundancy. To quantify this effect, we introduce two principled metrics: the Conditional Utilization Rate (CUR), which measures an encoder s marginal contribution in the presence of others, and the Information Gap (IG), which captures heterogeneity in encoder utility within a model. Using these tools, we observe: (i) strong specialization on tasks like OCR and Chart, where a single encoder can dominate with a CUR greater than 90 percent, (ii) high redundancy on general VQA and knowledge based tasks, where encoders are largely interchangeable, (iii) instances of detrimental encoders with negative CUR. Notably, masking specific encoders can yield up to 16 percent higher accuracy on a specific task category and 3.6 percent overall performance boost compared to the full model.Furthermore, single and dual encoder variants recover over 90 percent of baseline on most non OCR tasks with substantially lower training resources and inference latency. Our analysis challenges the more encoders are better heuristic in MLLMs and provides actionable diagnostics for developing more efficient and effective multimodal architectures.

2506.11906 2026-02-16 cs.RO

Palpation Alters Auditory Pain Expressions with Gender-Specific Variations in Robopatients

Chapa Sirithunge, Yue Xie, Saitarun Nadipineni, Fumiya Iida, Thilina Dulantha Lalitharatne

Comments 12 pages, 9 figures, journal

详情
英文摘要

Diagnostic errors remain a major cause of preventable mortality, particularly in resource limited settings. Medical training simulators, including robopatients, help reduce such errors by replicating patient responses during procedures such as abdominal palpation. However, generating realistic multimodal feedback especially auditory pain expressions remains challenging due to the complex, nonlinear relationship between applied palpation forces and perceived pain sounds. The high dimensionality and perceptual variability of pain vocalizations further limit conventional modeling approaches. We propose a novel experimental paradigm for adaptive pain expressivity in robopatients that dynamically generates auditory pain responses to palpation forces using human in the loop machine learning. Specifically, we employ Proximal Policy Optimization (PPO), a reinforcement learning algorithm suited for continuous control, to iteratively refine pain sound generation based on real time human evaluative feedback. The system initializes randomized mappings between force inputs and sound outputs, and the learning agent progressively adjusts them to align with human perceptual preferences. Results show that the framework adapts to individual palpation behaviors and subjective sound preferences while capturing a broad range of perceived pain intensities, from mild discomfort to acute distress. We also observe perceptual saturation at lower force ranges, with gender specific thresholds in pain sound perception. This work demonstrates the feasibility of human in the loop reinforcement learning for co-optimizing haptic input and auditory pain expression in medical simulators, highlighting the potential of adaptive and immersive platforms to enhance palpation training and reduce diagnostic errors.

2506.11827 2026-02-16 cs.RO cs.HC

Auditory-Tactile Congruence for Synthesis of Adaptive Pain Expressions in RoboPatients

Saitarun Nadipineni, Chapa Sirithunge, Yue Xie, Fumiya Iida, Thilina Dulantha Lalitharatne

Comments 20 pages, 8 figures, journal

详情
英文摘要

Misdiagnosis can result in delayed treatment and patient harm. Robotic patient simulators (robopatients) provide a controlled framework for training and evaluating clinicians in rare and complex cases. We investigate auditory tactile congruence in the synthesis of adaptive vocal pain expressions for robopatients. The system generates pain vocalizations in response to tactile stimuli applied during abdominal palpation. Haptic input is captured through an abdominal phantom and processed using an internal palpation-to-pain mapping model that drives acoustic output. To evaluate perceptual congruence between palpation force and synthesized pain expressions, we conducted a study comprising 7,680 trials with 20 participants. Participants rated perceived pain intensity based solely on auditory feedback. We analyzed the influence of acoustic parameters on agreement between applied force and perceived pain. Results indicate that amplitude and pitch significantly affect perceptual agreement, independent of pain sound category. Increased palpation force was associated with higher agreement ratings, consistent with psychophysical scaling effects. Among the tested acoustic features, pitch exerted a stronger influence than amplitude on perceived congruence. These findings demonstrate that modulation of pitch and amplitude is critical for achieving perceptually coherent auditory tactile mappings in robotic pain simulation. The proposed framework supports the development of high fidelity robotic patient simulators and provides a platform for studying multidimensional representations of pain in embodied robotic systems.

2506.04166 2026-02-16 cs.LG stat.CO stat.ML

N$^2$: A Unified Python Package and Test Bench for Nearest Neighbor-Based Matrix Completion

Caleb Chin, Aashish Khubchandani, Harshvardhan Maskara, Kyuseong Choi, Jacob Feitelberg, Albert Gong, Manit Paul, Tathagata Sadhukhan, Anish Agarwal, Raaz Dwivedi

Comments 21 pages, 6 figures

详情
英文摘要

Nearest neighbor (NN) methods have re-emerged as competitive tools for matrix completion, offering strong empirical performance and recent theoretical guarantees, including entry-wise error bounds, confidence intervals, and minimax optimality. Despite their simplicity, recent work has shown that NN approaches are robust to a range of missingness patterns and effective across diverse applications. This paper introduces N$^2$, a unified Python package and testbed that consolidates a broad class of NN-based methods through a modular, extensible interface. Built for both researchers and practitioners, N$^2$ supports rapid experimentation and benchmarking. Using this framework, we introduce a new NN variant that achieves state-of-the-art results in several settings. We also release a benchmark suite of real-world datasets, from healthcare and recommender systems to causal inference and LLM evaluation, designed to stress-test matrix completion methods beyond synthetic scenarios. Our experiments demonstrate that while classical methods excel on idealized data, NN-based techniques consistently outperform them in real-world settings.

2505.16814 2026-02-16 cs.CL

Does Synthetic Data Help Named Entity Recognition for Low-Resource Languages?

Gaurav Kamath, Sowmya Vajjala

Comments Accepted at AACL 2025. Camera-ready version

Journal ref https://aclanthology.org/2025.ijcnlp-short.15/

详情
英文摘要

Named Entity Recognition(NER) for low-resource languages aims to produce robust systems for languages where there is limited labeled training data available, and has been an area of increasing interest within NLP. Data augmentation for increasing the amount of low-resource labeled data is a common practice. In this paper, we explore the role of synthetic data in the context of multilingual, low-resource NER, considering 11 languages from diverse language families. Our results suggest that synthetic data does in fact hold promise for low-resource language NER, though we see significant variation between languages.

2505.16308 2026-02-16 cs.LG

Beyond All-to-All: Causal-Aligned Transformer with Dynamic Structure Learning for Multivariate Time Series Forecasting

Xingyu Zhang, Hanyun Du, Zeen Song, Siyu Zhao, Changwen Zheng, Wenwen Qiang

详情
英文摘要

Most existing multivariate time series forecasting methods adopt an all-to-all paradigm that feeds all variable histories into a unified model to predict their future values without distinguishing their individual roles. However, this undifferentiated paradigm makes it difficult to identify variable-specific causal influences and often entangles causally relevant information with spurious correlations. To address this limitation, we propose an all-to-one forecasting paradigm that predicts each target variable separately. Specifically, we first construct a Structural Causal Model from observational data and then, for each target variable, we partition the historical sequence into four subsegments according to the inferred causal structure: endogenous, direct causal, collider causal, and spurious correlation. Furthermore, we propose the Causal Decomposition Transformer (CDT), which integrates a dynamic causal adapter to learn causal structures initialized by the inferred graph, enabling correction of imperfect causal discovery during training. Furthermore, motivated by causal theory, we apply a projection-based output constraint to mitigate collider induced bias and improve robustness. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of the CDT.

2505.12988 2026-02-16 cs.LG

Optimal Formats for Weight Quantisation

Douglas Orr, Luka Ribar, Carlo Luschi

Comments 36 pages, 35 figures

详情
英文摘要

Weight quantisation is an essential technique for enabling efficient training and deployment of modern deep learning models. However, the recipe book of quantisation formats is large and formats are often chosen empirically. In this paper, we propose a framework for systematic design and analysis of quantisation formats. By connecting the question of format design with the classical quantisation theory, we show that the strong practical performance of popular formats comes from their ability to represent values using variable-length codes. We frame the problem as minimising the KL divergence between original and quantised model outputs under a model size constraint, which can be approximated by minimising the squared quantisation error, a well-studied problem where entropy-constrained quantisers with variable-length codes are optimal. We develop non-linear quantisation curves for block-scaled data across multiple distribution families and observe that these formats, along with sparse outlier formats, consistently outperform fixed-length formats, indicating that they also exploit variable-length encoding. Finally, by using the relationship between the Fisher information and KL divergence, we derive the optimal allocation of bit-widths to individual parameter tensors across the model's layers, saving up to 0.25 bits per parameter when applied to large language models.

2505.02467 2026-02-16 cs.CV cs.AI

Timing Is Everything: Finding the Optimal Fusion Points in Multimodal Medical Imaging

Valerio Guarrasi, Klara Mogensen, Sara Tassinari, Sara Qvarlander, Paolo Soda

详情
英文摘要

Multimodal deep learning harnesses diverse imaging modalities, such as MRI sequences, to enhance diagnostic accuracy in medical imaging. A key challenge is determining the optimal timing for integrating these modalities-specifically, identifying the network layers where fusion modules should be inserted. Current approaches often rely on manual tuning or exhaustive search, which are computationally expensive without any guarantee of converging to optimal results. We propose a sequential forward search algorithm that incrementally activates and evaluates candidate fusion modules at different layers of a multimodal network. At each step, the algorithm retrains from previously learned weights and compares validation loss to identify the best-performing configuration. This process systematically reduces the search space, enabling efficient identification of the optimal fusion timing without exhaustively testing all possible module placements. The approach is validated on two multimodal MRI datasets, each addressing different classification tasks. Our algorithm consistently identified configurations that outperformed unimodal baselines, late fusion, and a brute-force ensemble of all potential fusion placements. These architectures demonstrated superior accuracy, F-score, and specificity while maintaining competitive or improved AUC values. Furthermore, the sequential nature of the search significantly reduced computational overhead, making the optimization process more practical. By systematically determining the optimal timing to fuse imaging modalities, our method advances multimodal deep learning for medical imaging. It provides an efficient and robust framework for fusion optimization, paving the way for improved clinical decision-making and more adaptable, scalable architectures in medical AI applications.

2505.01390 2026-02-16 cs.CV cs.AI cs.LG

Multimodal Doctor-in-the-Loop: A Clinically-Guided Explainable Framework for Predicting Pathological Response in Non-Small Cell Lung Cancer

Alice Natalina Caragliano, Claudia Tacconi, Carlo Greco, Lorenzo Nibid, Edy Ippolito, Michele Fiore, Giuseppe Perrone, Sara Ramella, Paolo Soda, Valerio Guarrasi

Comments arXiv admin note: substantial text overlap with arXiv:2502.17503

详情
英文摘要

This study proposes a novel approach combining Multimodal Deep Learning with intrinsic eXplainable Artificial Intelligence techniques to predict pathological response in non-small cell lung cancer patients undergoing neoadjuvant therapy. Due to the limitations of existing radiomics and unimodal deep learning approaches, we introduce an intermediate fusion strategy that integrates imaging and clinical data, enabling efficient interaction between data modalities. The proposed Multimodal Doctor-in-the-Loop method further enhances clinical relevance by embedding clinicians' domain knowledge directly into the training process, guiding the model's focus gradually from broader lung regions to specific lesions. Results demonstrate improved predictive accuracy and explainability, providing insights into optimal data integration strategies for clinical applications.

2505.01096 2026-02-16 cs.CV cs.CL

Evaluating Vision Language Model Adaptations for Radiology Report Generation in Low-Resource Languages

Marco Salmè, Rosa Sicilia, Paolo Soda, Valerio Guarrasi

详情
英文摘要

The integration of artificial intelligence in healthcare has opened new horizons for improving medical diagnostics and patient care. However, challenges persist in developing systems capable of generating accurate and contextually relevant radiology reports, particularly in low-resource languages. In this study, we present a comprehensive benchmark to evaluate the performance of instruction-tuned Vision-Language Models (VLMs) in the specialized task of radiology report generation across three low-resource languages: Italian, German, and Spanish. Employing the LLaVA architectural framework, we conducted a systematic evaluation of pre-trained models utilizing general datasets, domain-specific datasets, and low-resource language-specific datasets. In light of the unavailability of models that possess prior knowledge of both the medical domain and low-resource languages, we analyzed various adaptations to determine the most effective approach for these contexts. The results revealed that language-specific models substantially outperformed both general and domain-specific models in generating radiology reports, emphasizing the critical role of linguistic adaptation. Additionally, models fine-tuned with medical terminology exhibited enhanced performance across all languages compared to models with generic knowledge, highlighting the importance of domain-specific training. We also explored the influence of the temperature parameter on the coherence of report generation, providing insights for optimal model settings. Our findings highlight the importance of tailored language and domain-specific training for improving the quality and accuracy of radiological reports in multilingual settings. This research not only advances our understanding of VLMs adaptability in healthcare but also points to significant avenues for future investigations into model tuning and language-specific adaptations.

2505.01091 2026-02-16 cs.CV cs.AI

Any-to-Any Vision-Language Model for Multimodal X-ray Imaging and Radiological Report Generation

Daniele Molino, Francesco di Feola, Linlin Shen, Paolo Soda, Valerio Guarrasi

Comments arXiv admin note: substantial text overlap with arXiv:2501.04614

详情
英文摘要

Generative models have revolutionized Artificial Intelligence (AI), particularly in multimodal applications. However, adapting these models to the medical domain poses unique challenges due to the complexity of medical data and the stringent need for clinical accuracy. In this work, we introduce a framework specifically designed for multimodal medical data generation. By enabling the generation of multi-view chest X-rays and their associated clinical report, it bridges the gap between general-purpose vision-language models and the specialized requirements of healthcare. Leveraging the MIMIC-CXR dataset, the proposed framework shows superior performance in generating high-fidelity images and semantically coherent reports. Our quantitative evaluation reveals significant results in terms of FID and BLEU scores, showcasing the quality of the generated data. Notably, our framework achieves comparable or even superior performance compared to real data on downstream disease classification tasks, underlining its potential as a tool for medical research and diagnostics. This study highlights the importance of domain-specific adaptations in enhancing the relevance and utility of generative models for clinical applications, paving the way for future advancements in synthetic multimodal medical data generation.

2504.07282 2026-02-16 cs.CL

RAISE: Reinforced Adaptive Instruction Selection For Large Language Models

Qingsong Lv, Yangning Li, Zihua Lan, Zishan Xu, Jiwei Tang, Tingwei Lu, Yinghui Li, Wenhao Jiang, Hong-Gee Kim, Hai-Tao Zheng, Philip S. Yu

Comments Accepted by EMNLP 2025 findings

详情
英文摘要

In the instruction fine-tuning of large language models (LLMs), it is widely recognized that a few high-quality instructions are superior to a large number of low-quality instructions. At present, many instruction selection methods have been proposed, but most of these methods select instruction based on heuristic quality metrics, and only consider data selection before training. These designs lead to insufficient optimization of instruction fine-tuning, and fixed heuristic indicators are often difficult to optimize for specific tasks. Therefore, we design a dynamic, task-objective-driven instruction selection framework RAISE(Reinforced Adaptive Instruction SElection), which incorporates the entire instruction fine-tuning process into optimization, selecting instructions at each step based on the expected impact of each instruction on model performance improvement. Our approach is well interpretable and has strong task-specific optimization capabilities. By modeling dynamic instruction selection as a sequential decision-making process, we use RL to train our selection strategy. Extensive experiments and result analysis prove the superiority of our method compared with other instruction selection methods. Notably, RAISE achieves superior performance by updating only 1% of the training steps compared to full-data training, demonstrating its efficiency and effectiveness.

2504.01342 2026-02-16 cs.CL

Foundations and Evaluations in NLP

Jungyeul Park

Comments Mémoire d'habilitation à diriger des recherches, 2025-2026

详情
英文摘要

This memoir explores two fundamental aspects of Natural Language Processing (NLP): the creation of linguistic resources and the evaluation of NLP system performance. Over the past decade, my work has focused on developing a morpheme-based annotation scheme for the Korean language that captures linguistic properties from morphology to semantics. This approach has achieved state-of-the-art results in various NLP tasks, including part-of-speech tagging, dependency parsing, and named entity recognition. Additionally, this work provides a comprehensive analysis of segmentation granularity and its critical impact on NLP system performance. In parallel with linguistic resource development, I have proposed a novel evaluation framework, the jp-algorithm, which introduces an alignment-based method to address challenges in preprocessing tasks like tokenization and sentence boundary detection (SBD). Traditional evaluation methods assume identical tokenization and sentence lengths between gold standards and system outputs, limiting their applicability to real-world data. The jp-algorithm overcomes these limitations, enabling robust end-to-end evaluations across a variety of NLP tasks. It enhances accuracy and flexibility by incorporating linear-time alignment while preserving the complexity of traditional evaluation metrics. This memoir provides key insights into the processing of morphologically rich languages, such as Korean, while offering a generalizable framework for evaluating diverse end-to-end NLP systems. My contributions lay the foundation for future developments, with broader implications for multilingual resource development and system evaluation.

2503.24258 2026-02-16 cs.CV cs.AI

Beyond a Single Mode: GAN Ensembles for Diverse Medical Data Generation

Lorenzo Tronchin, Tommy Löfstedt, Paolo Soda, Valerio Guarrasi

详情
英文摘要

The advancement of generative AI, particularly in medical imaging, confronts the trilemma of ensuring high fidelity, diversity, and efficiency in synthetic data generation. While Generative Adversarial Networks (GANs) have shown promise across various applications, they still face challenges like mode collapse and insufficient coverage of real data distributions. This work explores the use of GAN ensembles to overcome these limitations, specifically in the context of medical imaging. By solving a multi-objective optimisation problem that balances fidelity and diversity, we propose a method for selecting an optimal ensemble of GANs tailored for medical data. The selected ensemble is capable of generating diverse synthetic medical images that are representative of true data distributions and computationally efficient. Each model in the ensemble brings a unique contribution, ensuring minimal redundancy. We conducted a comprehensive evaluation using three distinct medical datasets, testing 22 different GAN architectures with various loss functions and regularisation techniques. By sampling models at different training epochs, we crafted 110 unique configurations. The results highlight the capability of GAN ensembles to enhance the quality and utility of synthetic medical images, thereby improving the efficacy of downstream tasks such as diagnostic modelling.

2503.22809 2026-02-16 cs.LG cs.AI

Data-Driven Worker Activity Recognition and Efficiency Estimation in Manual Fruit Harvesting

Uddhav Bhattarai, Rajkishan Arikapudi, Steven A. Fennimore, Frank N Martin, Stavros G. Vougioukas

Comments Published in Elsevier Biosystems Engineering

Journal ref Biosystems Engineering, Vol. 261, 104326 (2026)

详情
英文摘要

Manual fruit harvesting is common in agriculture, but the amount of time pickers spend on non-productive activities can make it very inefficient. Accurately identifying picking vs. non-picking activity is crucial for estimating picker efficiency and optimising labour management and harvest processes. In this study, a practical system was developed to calculate the efficiency of pickers in commercial strawberry harvesting. Instrumented picking carts (iCarritos) were developed to record the harvested fruit weight, geolocation, and iCarrito movement in real time. The iCarritos were deployed during the commercial strawberry harvest season in Santa Maria, CA. The collected data was then used to train a CNN-LSTM-based deep neural network to classify a picker's activity into "Pick" and "NoPick" classes. Experimental evaluations showed that the CNN-LSTM model showed promising activity recognition performance with an F1 score of 0.97. The recognition results were then used to compute picker efficiency and the time required to fill a tray. Analysis of the season-long harvest data showed that the average picker efficiency was 75.07% with an estimation accuracy of 97.23%. Furthermore, the average tray fill time was 6.85 minutes with an estimation accuracy of 96.78%. When integrated into commercial harvesting, the proposed technology can aid growers in monitoring automated worker activity and optimising harvests to reduce non-productive time and enhance overall harvest efficiency.

2502.17503 2026-02-16 cs.LG cs.AI cs.CV eess.IV

Doctor-in-the-Loop: An Explainable, Multi-View Deep Learning Framework for Predicting Pathological Response in Non-Small Cell Lung Cancer

Alice Natalina Caragliano, Filippo Ruffini, Carlo Greco, Edy Ippolito, Michele Fiore, Claudia Tacconi, Lorenzo Nibid, Giuseppe Perrone, Sara Ramella, Paolo Soda, Valerio Guarrasi

详情
英文摘要

Non-small cell lung cancer (NSCLC) remains a major global health challenge, with high post-surgical recurrence rates underscoring the need for accurate pathological response predictions to guide personalized treatments. Although artificial intelligence models show promise in this domain, their clinical adoption is limited by the lack of medically grounded guidance during training, often resulting in non-explainable intrinsic predictions. To address this, we propose Doctor-in-the-Loop, a novel framework that integrates expert-driven domain knowledge with explainable artificial intelligence techniques, directing the model toward clinically relevant anatomical regions and improving both interpretability and trustworthiness. Our approach employs a gradual multi-view strategy, progressively refining the model's focus from broad contextual features to finer, lesion-specific details. By incorporating domain insights at every stage, we enhance predictive accuracy while ensuring that the model's decision-making process aligns more closely with clinical reasoning. Evaluated on a dataset of NSCLC patients, Doctor-in-the-Loop delivers promising predictive performance and provides transparent, justifiable outputs, representing a significant step toward clinically explainable artificial intelligence in oncology.

2502.09567 2026-02-16 cs.CL cs.AI

MorphNLI: A Stepwise Approach to Natural Language Inference Using Text Morphing

Vlad Andrei Negru, Robert Vacareanu, Camelia Lemnaru, Mihai Surdeanu, Rodica Potolea

Comments 16 pages, 11 figures, 8 tables. Accepted for NAACL 2025 Findings

详情
英文摘要

We introduce MorphNLI, a modular step-by-step approach to natural language inference (NLI). When classifying the premise-hypothesis pairs into {entailment, contradiction, neutral}, we use a language model to generate the necessary edits to incrementally transform (i.e., morph) the premise into the hypothesis. Then, using an off-the-shelf NLI model we track how the entailment progresses with these atomic changes, aggregating these intermediate labels into a final output. We demonstrate the advantages of our proposed method particularly in realistic cross-domain settings, where our method always outperforms strong baselines with improvements up to 12.6% (relative). Further, our proposed approach is explainable as the atomic edits can be used to understand the overall NLI label.

2501.04614 2026-02-16 cs.AI cs.LG

XGeM: A Multi-Prompt Foundation Model for Multimodal Medical Data Generation

Daniele Molino, Francesco Di Feola, Eliodoro Faiella, Deborah Fazzini, Domiziana Santucci, Linlin Shen, Valerio Guarrasi, Paolo Soda

详情
英文摘要

The adoption of Artificial Intelligence in medical imaging holds great promise, yet it remains hindered by challenges such as data scarcity, privacy concerns, and the need for robust multimodal integration. While recent advances in generative modeling have enabled high-quality synthetic data generation, existing approaches are often limited to unimodal, unidirectional synthesis and therefore lack the ability to jointly synthesize multiple modalities while preserving clinical consistency. To address this challenge, we introduce XGeM, a 6.77-billion-parameter multimodal generative model designed to support flexible, any-to-any synthesis between medical data modalities. XGeM constructs a shared latent space via contrastive learning and introduces a novel Multi-Prompt Training strategy, enabling conditioning on arbitrary subsets of input modalities. This design allows the model to adapt to heterogeneous clinical inputs and generate multiple outputs jointly, preserving both semantic and structural coherence. We extensively validate XGeM: first we benchmark it against five competitors on the MIMIC-CXR dataset, a state-of-the-art dataset for multi-view Chest X-ray and radiological report generation. Secondly, we perform a Visual Turing Test with expert radiologists to assess the realism and clinical relevance of the generated data, ensuring alignment with real-world scenarios. Finally, we show how XGeM can support key medical data challenges such as anonymization, class imbalance, and data scarcity, underscoring its utility as a foundation model for medical data synthesis. Project page is at https://cosbidev.github.io/XGeM/.