arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1527
2509.25035 2026-03-13 cs.CL cs.AI cs.LG

Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct

Haoyang Zheng, Xinyang Liu, Cindy Xiangrui Kong, Nan Jiang, Zheyuan Hu, Weijian Luo, Wei Deng, Guang Lin

Comments [ICLR 2026] 38 pages, 7 figures, 13 tables

详情
英文摘要

Fast and high-quality language generation is the holy grail that people pursue in the age of AI. In this work, we introduce Discrete Diffusion Divergence Instruct (DiDi-Instruct), a training-based method that initializes from a pre-trained diffusion large language model (dLLM) and distills a few-step student for fast generation. The model distilled with DiDi-Instruct matches or surpasses its dLLM teacher and the GPT-2 baseline while providing up to 64$\times$ acceleration. The theoretical foundation of DiDi-Instruct is a novel framework based on integral KL-divergence minimization, which leads to a practical training algorithm. We further introduce grouped reward normalization, intermediate-state matching, and the reward-guided ancestral sampler to improve training stability, model coverage, and inference quality. On the OpenWebText benchmark, DiDi-Instruct achieves perplexity ranging from 62.2 (8 NFEs) to 18.4 (128 NFEs), outperforming prior accelerated dLLMs and the GPT-2 baseline. These gains incur a negligible entropy loss (around $1$%) and reduce additional training wall-clock time by more than $20\times$ compared to competing dLLM distillation methods. We further validate the robustness and effectiveness of DiDi-Instruct through extensive ablation studies, model scaling, downstream task evaluations, and unconditional protein sequence generation. In conclusion, DiDi-Instruct enables efficient and effective distillation for language generation in the blink of an eye.

2509.23097 2026-03-13 cs.CV

Streamline pathology foundation model by cross-magnification distillation

Ziyu Su, Abdul Rehman Akbar, Usama Sajjad, Anil V. Parwani, Muhammad Khalid Khan Niazi

详情
英文摘要

Foundation models (FM) have transformed computational pathology but remain computationally prohibitive for clinical deployment due to their massive parameter counts and high-magnification processing requirements. Here, we introduce XMAG, a lightweight FM developed through corss-magnification distillation that transfers knowledge from state-of-the-art 20x magnification teacher to an efficient 5x magnification student architecture. XMAG employs a compact backbone and operates entirely at 5x, requiring 11.3 times fewer patches per whole slide image (WSI) compared to existing approaches. Our Novel distillation framework incorporates dual-level knowledge transfer, aligning both global image representations and local spatial token mapping. We trained XMAG on 3.49 million images curated from publicly available datasets and evaluated performance across six clinically relevant histopathology analysis tasks spanning multiple cancer types. XMAG achieved diagnostic accuracy within 1% of substantially larger foundation models while delivering 30-fold processing acceleration, reaching 8.8 WSIs per minute processing speed. Our cross-institutional validation confirmed robust generalization. Further, we developed an end-to-end training strategy to further boost our model's performance to approach the larger FMs' performance. These results establish cross-magnification distillation as a viable approach for deploying FM capabilities in resource-constrained clinical environments, potentially enabling real-time pathology AI integration.

2509.22824 2026-03-13 cs.CL

Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning

Chi Ruan, Dongfu Jiang, Yubo Wang, Wenhu Chen

详情
英文摘要

Reinforcement Learning (RL) has emerged as a popular training paradigm, particularly when paired with reasoning models. While effective, it primarily focuses on generating responses and lacks mechanisms to explicitly foster critique or reflection. Several recent studies, like Critique-Fine-Tuning (CFT) and Critique-Guided-Distillation (CGD) have shown the benefits of explicitly teaching LLMs how to critique. Motivated by them, we propose Critique Reinforcement Learning (CRL), where the model is tasked with generating a critique for a given (question, solution) pair. The reward is determined solely by whether the final judgment label $c \in \{\texttt{True}, \texttt{False}\}$ of the generated critique aligns with the ground-truth judgment $c^*$. Building on this point, we introduce Critique-Coder, which is trained on a hybrid of RL and CRL by substituting 20% of the standard RL data with CRL data. We fine-tune multiple models (Critique-Coder) and evaluate them on different benchmarks to show their advantages over RL-only models. We show that Critique-Coder consistently outperforms RL-only baselines on all the evaluated benchmarks. Notably, our Critique-Coder-8B can reach over 60% on LiveCodeBench (v5), outperforming other reasoning models like DeepCoder-14B and GPT-o1. Beyond code generation, Critique-Coder also demonstrates enhanced general reasoning abilities, as evidenced by its better performance on logic reasoning tasks from the BBEH dataset. This indicates that the application of CRL on coding datasets enhances general reasoning and critique abilities, which are transferable across a broad range of tasks. Hence, we believe that CRL works as a great complement to standard RL for LLM reasoning.

2509.20681 2026-03-13 cs.RO cs.AI cs.CV

Efficient Construction of Implicit Surface Models From a Single Image for Motion Generation

Wei-Teng Chu, Tianyi Zhang, Matthew Johnson-Roberson, Weiming Zhi

Comments 9 pages, 6 figures, 2026 IEEE International Conference on Robotics and Automation (ICRA)

详情
英文摘要

Implicit representations have been widely applied in robotics for obstacle avoidance and path planning. In this paper, we explore the problem of constructing an implicit distance representation from a single image. Past methods for implicit surface reconstruction, such as NeuS and its variants generally require a large set of multi-view images as input, and require long training times. In this work, we propose Fast Image-to-Neural Surface (FINS), a lightweight framework that can reconstruct high-fidelity surfaces and SDF fields based on a single or a small set of images. FINS integrates a multi-resolution hash grid encoder with lightweight geometry and color heads, making the training via an approximate second-order optimizer highly efficient and capable of converging within a few seconds. Additionally, we achieve the construction of a neural surface requiring only a single RGB image, by leveraging pre-trained foundation models to estimate the geometry inherent in the image. Our experiments demonstrate that under the same conditions, our method outperforms state-of-the-art baselines in both convergence speed and accuracy on surface reconstruction and SDF field estimation. Moreover, we demonstrate the applicability of FINS for robot surface following tasks and show its scalability to a variety of benchmark datasets. Code is publicly available at https://github.com/waynechu1109/FINS.

2509.18395 2026-03-13 cs.CL

NormGenesis: Multicultural Dialogue Generation via Exemplar-Guided Social Norm Modeling and Violation Recovery

Minki Hong, Jangho Choi, Jihie Kim

Comments 39 pages, 17 figures, EMNLP 2025 Main Conference, Senior Area Chair (SAC) Highlights Award

详情
英文摘要

Social norms govern culturally appropriate behavior in communication, enabling dialogue systems to produce responses that are not only coherent but also socially acceptable. We present NormGenesis, a multicultural framework for generating and annotating socially grounded dialogues across English, Chinese, and Korean. To model the dynamics of social interaction beyond static norm classification, we propose a novel dialogue type, Violation-to-Resolution (V2R), which models the progression of conversations following norm violations through recognition and socially appropriate repair. To improve pragmatic consistency in underrepresented languages, we implement an exemplar-based iterative refinement early in the dialogue synthesis process. This design introduces alignment with linguistic, emotional, and sociocultural expectations before full dialogue generation begins. Using this framework, we construct a dataset of 10,800 multi-turn dialogues annotated at the turn level for norm adherence, speaker intent, and emotional response. Human and LLM-based evaluations demonstrate that NormGenesis significantly outperforms existing datasets in refinement quality, dialogue naturalness, and generalization performance. We show that models trained on our V2R-augmented data exhibit improved pragmatic competence in ethically sensitive contexts. Our work establishes a new benchmark for culturally adaptive dialogue modeling and provides a scalable methodology for norm-aware generation across linguistically and culturally diverse languages.

2509.11125 2026-03-13 cs.RO cs.CV

ManiVID-3D: Generalizable View-Invariant Reinforcement Learning for Robotic Manipulation via Disentangled 3D Representations

Zheng Li, Pei Qu, Yufei Jia, Shihui Zhou, Haizhou Ge, Jiahang Cao, Jinni Zhou, Guyue Zhou, Jun Ma

Comments Accepted to RA-L. Project website: https://zheng-joe-lee.github.io/manivid3d/

详情
英文摘要

Deploying visual reinforcement learning (RL) policies in real-world manipulation is often hindered by camera viewpoint changes. A policy trained from a fixed front-facing camera may fail when the camera is shifted -- an unavoidable situation in real-world settings where sensor placement is hard to manage appropriately. Existing methods often rely on precise camera calibration or struggle with large perspective changes. To address these limitations, we propose ManiVID-3D, a novel 3D RL architecture designed for robotic manipulation, which learns view-invariant representations through self-supervised disentangled feature learning. The framework incorporates ViewNet, a lightweight yet effective module that automatically aligns point cloud observations from arbitrary viewpoints into a unified spatial coordinate system without the need for extrinsic calibration. Additionally, we develop an efficient GPU-accelerated batch rendering module capable of processing over 5000 frames per second, enabling large-scale training for 3D visual RL at unprecedented speeds. Extensive evaluation across 10 simulated and 5 real-world tasks demonstrates that our approach achieves a 40.6% higher success rate than state-of-the-art methods under viewpoint variations while using 80% fewer parameters. The system's robustness to severe perspective changes and strong sim-to-real performance highlight the effectiveness of learning geometrically consistent representations for scalable robotic manipulation in unstructured environments.

2509.06322 2026-03-13 cs.LG

Text-Trained LLMs Can Zero-Shot Extrapolate PDE Dynamics, Revealing a Three-Stage In-Context Learning Mechanism

Jiajun Bao, Nicolas Boullé, Toni J. B. Liu, Raphaël Sarfati, Christopher J. Earls

详情
英文摘要

Large language models (LLMs) have demonstrated emergent in-context learning (ICL) capabilities across a range of tasks, including zero-shot time-series forecasting. We show that text-trained foundation models can accurately extrapolate spatiotemporal dynamics from discretized partial differential equation (PDE) solutions without fine-tuning or natural language prompting. Predictive accuracy improves with longer temporal contexts but degrades at finer spatial discretizations. In multi-step rollouts, where the model recursively predicts future spatial states over multiple time steps, errors grow algebraically with the time horizon, reminiscent of global error accumulation in classical finite-difference solvers. We interpret these trends as in-context neural scaling laws, where prediction quality varies predictably with both context length and output length. To better understand how LLMs are able to internally process PDE solutions so as to accurately roll them out, we analyze token-level output distributions and uncover a consistent three-stage ICL progression: beginning with syntactic pattern imitation, transitioning through an exploratory high-entropy phase, and culminating in confident, numerically grounded predictions.

2509.00639 2026-03-13 cs.LG

Disentangling Slow and Fast Temporal Dynamics in Degradation Inference with Hierarchical Differential Models

Mengjie Zhao, Olga Fink

详情
英文摘要

Reliable inference of system degradation from sensor data is fundamental to condition monitoring and prognostics in mechanical and infrastructural systems. Since degradation is rarely directly observable and measurable, it must be inferred to enable accurate health assessment and decision-making. This is particularly challenging because operational and environmental variations dominate system behavior, while degradation introduces only subtle, long-term changes. Consequently, sensor data primarily reflect short-term operational variability, making it difficult to disentangle the underlying degradation process. Most unsupervised degradation inference methods learn nominal system behavior and use residuals as degradation proxies. However, residuals remain strongly entangled with operational history, yielding noisy and unreliable degradation estimates, particularly in infrastructural systems with dominant transient dynamics. Neural Ordinary Differential Equations (NODEs) offer a flexible framework for modeling latent dynamics, but in degraded systems, they suffer from numerical stiffness and degradation disentanglement remains difficult. To address these challenges, we propose a Hierarchical Controlled Differential Equation (H-CDE) framework that jointly models slow degradation dynamics and fast operational dynamics. H-CDE improves numerical efficiency through separate time integration of slow and fast components. Through a learnable path transformation mapping raw inputs to a latent degradation-relevant control path and a monotonicity-enforcing activation function that regularizes the inferred degradation dynamics, H-CDE enables effective disentangled degradation inference. Evaluations on mechanical and infrastructural systems demonstrate that H-CDE outperforms residual-based baselines, yielding more accurate, robust, and interpretable degradation inference in an unsupervised setting.

2508.16777 2026-03-13 cs.AI

Evaluation and LLM-Guided Learning of ICD Coding Rationales

Mingyang Li, Viktor Schlegel, Tingting Mu, Wuraola Oyewusi, Kai Kang, Goran Nenadic

详情
英文摘要

ICD coding is the process of mapping unstructured text from Electronic Health Records (EHRs) to standardised codes defined by the International Classification of Diseases (ICD) system. In order to promote trust and transparency, existing explorations on the explainability of ICD coding models primarily rely on attention-based rationales and qualitative assessments conducted by physicians, yet lack a systematic evaluation across diverse types of rationales using consistent criteria and high-quality rationale-annotated datasets specifically designed for the ICD coding task. Moreover, dedicated methods explicitly trained to generate plausible rationales remain scarce. In this work, we present evaluations of the explainability of rationales in ICD coding, focusing on two fundamental dimensions: faithfulness and plausibility -- in short how rationales influence model decisions and how convincing humans find them. For plausibility, we construct a novel, multi-granular rationale-annotated ICD coding dataset, based on the MIMIC-IV database and the updated ICD-10 coding system. We conduct a comprehensive evaluation across three types of ICD coding rationales: entity-level mentions automatically constructed via entity linking, LLM-generated rationales, and rationales based on attention scores of ICD coding models. Building upon the strong plausibility exhibited by LLM-generated rationales, we further leverage them as distant supervision signals to develop rationale learning methods. Additionally, by prompting the LLM with few-shot human-annotated examples from our dataset, we achieve notable improvements in the plausibility of rationale generation in both the teacher LLM and the student rationale learning models.

2508.10745 2026-03-13 cs.AI cs.CV cs.LG cs.MA cs.MM

Agentic Design Review System

Sayan Nag, K J Joseph, Koustava Goswami, Vlad I Morariu, Balaji Vasan Srinivasan

Comments Project Page: https://sayannag.github.io/AgenticDRS

详情
英文摘要

Evaluating graphic designs involves assessing it from multiple facets like alignment, composition, aesthetics and color choices. Evaluating designs in a holistic way involves aggregating feedback from individual expert reviewers. Towards this, we propose an Agentic Design Review System (AgenticDRS), where multiple agents collaboratively analyze a design, orchestrated by a meta-agent. A novel in-context exemplar selection approach based on graph matching and a unique prompt expansion method plays central role towards making each agent design aware. Towards evaluating this framework, we propose DRS-BENCH benchmark. Thorough experimental evaluation against state-of-the-art baselines adapted to the problem setup, backed-up with critical ablation experiments brings out the efficacy of Agentic-DRS in evaluating graphic designs and generating actionable feedback. We hope that this work will attract attention to this pragmatic, yet under-explored research direction.

2508.09487 2026-03-13 cs.CV

Semantic-Aware Reconstruction Error for Detecting AI-Generated Images

Ju Yeon Kang, Jaehong Park, Semin Kim, Ji Won Yoon, Nam Soo Kim

详情
英文摘要

Recently, AI-generated image detection has gained increasing attention, as the rapid advancement of image generation technologies has raised serious concerns about their potential misuse. While existing detection methods have achieved promising results, their performance often degrades significantly when facing fake images from unseen, out-of-distribution (OOD) generative models, since they primarily rely on model-specific artifacts and thus overfit to the models used for training. To address this limitation, we propose a novel representation, namely Semantic-Aware Reconstruction Error (SARE), that measures the semantic difference between an image and its caption-guided reconstruction. The key hypothesis behind SARE is that real images, whose captions often fail to fully capture their complex visual content, may undergo noticeable semantic shifts during the caption-guided reconstruction process. In contrast, fake images, which closely align with their captions, show minimal semantic changes. By quantifying these semantic shifts, SARE provides a robust and discriminative feature for detecting fake images across diverse generative models. Additionally, we introduce a fusion module that integrates SARE into the backbone detector via a cross-attention mechanism. Image features attend to semantic representations extracted from SARE, enabling the model to adaptively leverage semantic information. Experimental results demonstrate that the proposed method achieves strong generalization, outperforming existing baselines on benchmarks including GenImage and ForenSynths. We further validate the effectiveness of caption guidance through a detailed analysis of semantic shifts, confirming its ability to enhance detection robustness.

2508.04604 2026-03-13 cs.CL cs.AI cs.IR

TURA: Tool-Augmented Unified Retrieval Agent for AI Search

Zhejun Zhao, Yuchen Li, Alley Liu, Yuehu Dong, Xiaolong Wei, Lixue Zheng, Pingsheng Liu, Dongdong Shen, Long Xia, Jiashu Zhao, Dawei Yin

详情
英文摘要

The advent of Large Language Models (LLMs) is transforming search engines into conversational AI search products, primarily using Retrieval-Augmented Generation (RAG) on web corpora. However, this paradigm has significant industrial limitations. Traditional RAG approaches struggle with real-time needs and structured queries that require accessing dynamically generated content like ticket availability or inventory. Limited to indexing static pages, search engines cannot perform the interactive queries needed for such time-sensitive data. Academic research has focused on optimizing RAG for static content, overlooking complex intents and the need for dynamic sources like databases and real-time APIs. To bridge this gap, we introduce TURA (Tool-Augmented Unified Retrieval Agent for AI Search), a novel three-stage framework that combines RAG with agentic tool-use to access both static content and dynamic, real-time information. TURA has three key components: an Intent-Aware Retrieval module to decompose queries and retrieve information sources encapsulated as Model Context Protocol (MCP) Servers, a DAG-based Task Planner that models task dependencies as a Directed Acyclic Graph (DAG) for optimal parallel execution, and a lightweight Distilled Agent Executor for efficient tool calling. TURA is the first architecture to systematically bridge the gap between static RAG and dynamic information sources for a world-class AI search product. Serving tens of millions of users, it leverages an agentic framework to deliver robust, real-time answers while meeting the low-latency demands of a large-scale industrial system.

2507.16083 2026-03-13 cs.CL cs.AI cs.LG

Efficient Compositional Multi-tasking for On-device Large Language Models

Ondrej Bohdal, Mete Ozay, Jijoong Moon, Kyeng-Hun Lee, Hyeonmok Ko, Umberto Michieli

Comments Accepted at EMNLP 2025 (main track, long paper)

详情
英文摘要

Adapter parameters provide a mechanism to modify the behavior of machine learning models and have gained significant popularity in the context of large language models (LLMs) and generative AI. These parameters can be merged to support multiple tasks via a process known as task merging. However, prior work on merging in LLMs, particularly in natural language processing, has been limited to scenarios where each test example addresses only a single task. In this paper, we focus on on-device settings and study the problem of text-based compositional multi-tasking, where each test example involves the simultaneous execution of multiple tasks. For instance, generating a translated summary of a long text requires solving both translation and summarization tasks concurrently. To facilitate research in this setting, we propose a benchmark comprising four practically relevant compositional tasks. We also present an efficient method (Learnable Calibration) tailored for on-device applications, where computational resources are limited, emphasizing the need for solutions that are both resource-efficient and high-performing. Our contributions lay the groundwork for advancing the capabilities of LLMs in real-world multi-tasking scenarios, expanding their applicability to complex, resource-constrained use cases.

2507.14552 2026-03-13 cs.AI

Large Language Models Assisting Ontology Evaluation

Anna Sofia Lippolis, Mohammad Javad Saeedizade, Robin Keskisärkkä, Aldo Gangemi, Eva Blomqvist, Andrea Giovanni Nuzzolese

详情
英文摘要

Ontology evaluation through functional requirements, such as testing via competency question (CQ) verification, is a well-established yet costly, labour-intensive, and error-prone endeavour, even for ontology engineering experts. In this work, we introduce OE-Assist, a novel framework designed to assist ontology evaluation through automated and semi-automated CQ verification. By presenting and leveraging a dataset of 1,393 CQs paired with corresponding ontologies and ontology stories, our contributions present, to our knowledge, the first systematic investigation into large language model (LLM)-assisted ontology evaluation, and include: (i) evaluating the effectiveness of a LLM-based approach for automatically performing CQ verification against a manually created gold standard, and (ii) developing and assessing an LLM-powered framework to assist CQ verification with Protégé, by providing suggestions. We found that automated LLM-based evaluation with o1-preview and o3-mini perform at a similar level to the average user's performance.

2506.23543 2026-03-13 cs.CV

Pyramidal Patchification Flow for Visual Generation

Hui Li, Baoyou Chen, Liwei Zhang, Jiaye Li, Jingdong Wang, Siyu Zhu

Comments ICLR 2026

详情
英文摘要

Diffusion transformers (DiTs) adopt Patchify, mapping patch representations to token representations through linear projections, to adjust the number of tokens input to DiT blocks and thus the computation cost. Instead of a single patch size for all the timesteps, we introduce a Pyramidal Patchification Flow (PPFlow) approach: Large patch sizes are used for high noise timesteps and small patch sizes for low noise timesteps; Linear projections are learned for each patch size; and Unpatchify is accordingly modified. Unlike Pyramidal Flow, our approach operates over full latent representations other than pyramid representations, and adopts the normal denoising process without requiring the renoising trick. We demonstrate the effectiveness of our approach through two training manners. Training from scratch achieves a $1.6\times$ ($2.0\times$) inference speed over SiT-B/2 for 2-level (3-level) pyramid patchification with slightly lower training FLOPs and similar image generation performance. Training from pretrained normal DiTs achieves even better performance with small training time. The code and checkpoint are at https://github.com/fudan-generative-vision/PPFlow.

2506.21583 2026-03-13 cs.CL cs.AI

Hope Speech Detection in code-mixed Roman Urdu tweets: A Positive Turn in Natural Language Processing

Muhammad Ahmad, Muhammad Waqas, Ameer Hamza, Ildar Batyrshin, Grigori Sidorov

Comments We are withdrawing this preprint because it contains initial experimental results and an early version of the manuscript. We are currently improving the methodology, conducting additional experiments, and refining the analysis. A substantially revised version will be submitted in the future

详情
英文摘要

Hope is a positive emotional state involving the expectation of favorable future outcomes, while hope speech refers to communication that promotes optimism, resilience, and support, particularly in adverse contexts. Although hope speech detection has gained attention in Natural Language Processing (NLP), existing research mainly focuses on high-resource languages and standardized scripts, often overlooking informal and underrepresented forms such as Roman Urdu. To the best of our knowledge, this is the first study to address hope speech detection in code-mixed Roman Urdu by introducing a carefully annotated dataset, thereby filling a critical gap in inclusive NLP research for low-resource, informal language varieties. This study makes four key contributions: (1) it introduces the first multi-class annotated dataset for Roman Urdu hope speech, comprising Generalized Hope, Realistic Hope, Unrealistic Hope, and Not Hope categories; (2) it explores the psychological foundations of hope and analyzes its linguistic patterns in code-mixed Roman Urdu to inform dataset development; (3) it proposes a custom attention-based transformer model optimized for the syntactic and semantic variability of Roman Urdu, evaluated using 5-fold cross-validation; and (4) it verifies the statistical significance of performance gains using a t-test. The proposed model, XLM-R, achieves the best performance with a cross-validation score of 0.78, outperforming the baseline SVM (0.75) and BiLSTM (0.76), with gains of 4% and 2.63% respectively.

2506.16584 2026-03-13 cs.CL cs.AI cs.LG

Measuring Intent Comprehension in LLMs

Nadav Kunievsky, James A. Evans

详情
英文摘要

People judge interactions with large language models (LLMs) as successful when outputs match what they want, not what they type. Yet LLMs are trained to predict the next token solely from text input, not underlying intent. Because written language is an imperfect proxy for intent, and correlations between phrasing and desired outcomes can break down in training data, models that rely too heavily on surface cues may respond inconsistently to semantically equivalent prompts. This makes it essential to evaluate whether LLMs can reliably infer user intent-especially in high-stakes settings where robustness and generalization are critical. We introduce a formal framework for assessing intent comprehension in LLMs: whether a model demonstrates robust understanding of user intent by producing consistent outputs across semantically equivalent prompts while differentiating between prompts with distinct intents. Our evaluation approach is based on a variance decomposition of model responses into three components: variability due to user intent, user articulation, and model uncertainty. Models that understand what users want, and are not overly sensitive to textual cues, should attribute most output variance to intent differences, rather than articulation style. Applying this framework across diverse domains, we find that, within the five LLaMA and Gemma models we evaluate, larger models typically assign a greater share of variance to intent, indicating stronger comprehension of intent, although gains are uneven and often modest with increasing model size. These results motivate moving beyond accuracy-only benchmarks toward semantic diagnostics that directly assess whether models understand what users intend.

2506.13723 2026-03-13 cs.CV

SOTA: Self-adaptive Optimal Transport for Zero-Shot Classification with Multiple Foundation Models

Zhanxuan Hu, Qiyu Xu, Yu Duan, Yonghang Tai, Huafeng Li

详情
英文摘要

Foundation models have attracted widespread attention across domains due to their powerful zero-shot classification capabilities. This work is motivated by two key observations: (1) \textit{Vision-Language Models} (VLMs), such as CLIP, often over-rely on class-level textual priors and struggle to capture fine-grained visual cues, whereas \textit{Vision-only Foundation Models} (VFMs), such as DINO, provide rich and discriminative visual features but lack semantic alignment; (2) the performance of different VLMs varies considerably across datasets owing to differences in pre-training. To address these challenges, we propose \textbf{SOTA} (\textit{Self-adaptive Optimal TrAnsport}), a \textit{training-free} ensemble framework that integrates the outputs of multiple foundation models~(VFMs or VLMs) by learning a self-adaptive transport plan. Notably, \textbf{SOTA} is prior-free and automatically balances model contributions. Extensive experiments across diverse domains, including natural images, medical pathology, and remote sensing, validate the generalizability of \textbf{SOTA}. The results consistently show that it effectively leverages the complementary strengths of different foundation models and achieves substantial improvements over individual models. The implementation code is available at: https://github.com/Afleve/self-adaptive-Optimal-Transport.

2505.20967 2026-03-13 cs.CV

RF4D:Neural Radar Fields for Novel View Synthesis in Outdoor Dynamic Scenes

Jiarui Zhang, Zhihao Li, Chong Wang, Bihan Wen

详情
英文摘要

Neural fields (NFs) have achieved remarkable success in scene reconstruction and novel view synthesis. However, existing NF approaches that rely on RGB or LiDAR inputs often struggle under adverse weather conditions, limiting their robustness in real-world outdoor environments such as autonomous driving. In contrast, millimeter-wave radar is inherently resilient to environmental variations, yet its integration with NFs remains largely underexplored. Moreover, outdoor driving scenes frequently involve dynamic objects, making spatiotemporal modeling crucial for temporally consistent novel view synthesis. To address these challenges, we present RF4D, a radar-based neural field framework tailored for novel view synthesis in outdoor dynamic scenes. RF4D explicitly incorporates temporal information into its representation, enabling more accurate modeling of object motion. A dedicated scene flow module further predicts temporal offsets between adjacent frames, enforcing temporal occupancy coherence during dynamic scene reconstruction. Moreover, we propose a radar-specific power rendering formulation grounded in radar sensing physics, improving both synthesis accuracy and interpretability. Extensive experiments on public radar datasets demonstrate that RF4D substantially outperforms existing methods in radar measurement synthesis and occupancy estimation accuracy, with particularly strong gains in dynamic outdoor environments.

2505.19459 2026-03-13 cs.LG cs.AI

Your Classifier Can Do More: Towards Balancing the Gaps in Classification, Robustness, and Generation

Kaichao Jiang, He Wang, Xiaoshuai Hao, Xiulong Yang, Ajian Liu, Qi Chu, Yunfeng Diao, Richang Hong

Comments accepted by CVPR2026

详情
英文摘要

Joint Energy-based Models (JEMs) are well known for their ability to unify classification and generation within a single framework. Despite their promising generative and discriminative performance, their robustness remains far inferior to adversarial training (AT), which, conversely, achieves strong robustness but sacrifices clean accuracy and lacks generative ability. This inherent trilemma-balancing classification accuracy, robustness, and generative capability-raises a fundamental question: Can a single model achieve all three simultaneously? To answer this, we conduct a systematic energy landscape analysis of clean, adversarial, and generated samples across various JEM and AT variants. We observe that AT reduces the energy gap between clean and adversarial samples, while JEMs narrow the gap between clean and synthetic ones. This observation suggests a key insight: if the energy distributions of all three data types can be aligned, we might bridge their performance disparities. Building on this idea, we propose Energy-based Joint Distribution Adversarial Training (EB-JDAT), a unified generative-discriminative-robust framework that maximizes the joint probability of clean and adversarial distribution. EB-JDAT introduces a novel min-max energy optimization to explicitly aligning energies across clean, adversarial, and generated samples. Extensive experiments on CIFAR-10, CIFAR-100, and ImageNet subsets demonstrate that EB-JDAT achieves state-of-the-art robustness while maintaining near-original accuracy and competitive generation quality of JEMs, effectively achieving a new trade-off frontier between accuracy, robustness, and generation. The code is released at https://github.com/yujkc/EB-JDAT.

2505.19240 2026-03-13 cs.CL cs.AI cs.LG

LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models

Aida Kostikova, Zhipin Wang, Deidamea Bajri, Ole Pütz, Benjamin Paaßen, Steffen Eger

Comments ACM Computing Surveys (CSUR); 56 pages

详情
英文摘要

Large language model (LLM) research has grown rapidly, along with increasing concern about their limitations. In this survey, we conduct a data-driven, semi-automated review of research on limitations of LLMs (LLLMs) from 2022 to early 2025 using a bottom-up approach. From a corpus of 250,000 ACL and arXiv papers, we identify 14,648 relevant papers using keyword filtering, LLM-based classification, validated against expert labels, and topic clustering (via two approaches, HDBSCAN+BERTopic and LlooM). We find that the share of LLM-related papers increases over fivefold in ACL and nearly eightfold in arXiv between 2022 and 2025. Since 2022, LLLMs research grows even faster, reaching over 30% of LLM papers by 2025. Reasoning remains the most studied limitation, followed by generalization, hallucination, bias, and security. The distribution of topics in the ACL dataset stays relatively stable over time, while arXiv shifts toward security risks, alignment, hallucinations, knowledge editing, and multimodality. We offer a quantitative view of trends in LLLMs research and release a dataset of annotated abstracts and a validated methodology, available at: https://github.com/a-kostikova/LLLMs-Survey.

2505.18607 2026-03-13 cs.AI

From Entity-Centric to Goal-Oriented Graphs: Enhancing LLM Knowledge Retrieval in Minecraft

Jonathan Leung, Yongjie Wang, Zhiqi Shen

Comments Accepted at Knowledge-Based Systems

详情
英文摘要

Large Language Models (LLMs) demonstrate impressive general capabilities but often struggle with step-by-step procedural reasoning, a critical challenge in complex interactive environments. While retrieval-augmented methods like GraphRAG attempt to bridge this gap, their fragmented entity-relation graphs hinder the construction of coherent, multi-step plans. In this paper, we propose a novel framework based on Goal-Oriented Graphs (GoGs), where each node represents a goal and edges encode logical dependencies between them. This structure enables the explicit retrieval of causal reasoning paths by identifying a high-level goal and recursively retrieving its prerequisites, forming a coherent chain to guide the LLM. Through extensive experiments on the Minecraft testbed, a domain that demands robust multi-step planning and provides rich procedural knowledge, we demonstrate that GoG substantially improves procedural reasoning and significantly outperforms GraphRAG and other state-of-the-art baselines.

2505.18547 2026-03-13 cs.AI cs.CV

Diffusion Blend: Inference-Time Multi-Preference Alignment for Diffusion Models

Min Cheng, Fatemeh Doudi, Dileep Kalathil, Mohammad Ghavamzadeh, Panganamala R. Kumar

Comments Accepted at ICLR 2026

详情
英文摘要

Reinforcement learning (RL) algorithms have been used recently to align diffusion models with downstream objectives such as aesthetic quality and text-image consistency by fine-tuning them to maximize a single reward function under a fixed KL regularization. However, this approach is inherently restrictive in practice, where alignment must balance multiple, often conflicting objectives. Moreover, user preferences vary across prompts, individuals, and deployment contexts, with varying tolerances for deviation from a pre-trained base model. We address the problem of inference-time multi-preference alignment: given a set of basis reward functions and a reference KL regularization strength, can we design a fine-tuning procedure so that, at inference time, it can generate images aligned with any user-specified linear combination of rewards and regularization, without requiring additional fine-tuning? We propose Diffusion Blend, a novel approach to solve inference-time multi-preference alignment by blending backward diffusion processes associated with fine-tuned models, and we instantiate this approach with two algorithms: DB-MPA for multi-reward alignment and DB-KLA for KL regularization control. Extensive experiments show that Diffusion Blend algorithms consistently outperform relevant baselines and closely match or exceed the performance of individually fine-tuned models, enabling efficient, user-driven alignment at inference-time. The code is available at https://github.com/bluewoods127/DB-2025.

2505.17778 2026-03-13 cs.CV

TextFlux: An OCR-Free DiT Model for High-Fidelity Multilingual Scene Text Synthesis

Yu Xie, Jielei Zhang, Pengyu Chen, Weihang Wang, Longwen Gao, Peiyi Li, Qian Qiao, Zhouhui Lian

Comments Accepted to Eurographics 2026 (Computer Graphics Forum)

详情
英文摘要

Diffusion-based scene text synthesis has progressed rapidly, yet existing methods commonly rely on additional visual conditioning modules and require large-scale annotated data to support multilingual generation. In this work, we revisit the necessity of complex auxiliary modules and further explore an approach that simultaneously ensures glyph accuracy and achieves high-fidelity scene integration, by leveraging diffusion models' inherent capabilities for contextual reasoning. To this end, we introduce TextFlux, a DiT-based framework that enables multilingual scene text synthesis. The advantages of TextFlux can be summarized as follows: (1) OCR-free model architecture. TextFlux eliminates the need for OCR encoders (additional visual conditioning modules) that are specifically used to extract visual text-related features. (2) Strong multilingual scalability. TextFlux is effective in low-resource multilingual settings, and achieves strong performance in newly added languages with fewer than 1,000 samples. (3) Streamlined training setup. TextFlux is trained with only 1% of the training data required by competing methods. (4) Controllable multi-line text generation. TextFlux offers flexible multi-line synthesis with precise line-level control, outperforming methods restricted to single-line or rigid layouts. Extensive experiments and visualizations demonstrate that TextFlux outperforms previous methods in both qualitative and quantitative evaluations.

2505.16211 2026-03-13 cs.SD cs.AI cs.CL eess.AS

AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng, Xuechao Zou, Lionel Z. Wang, Shun Zhang, Xingjian Du, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, Yifan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Gelei Deng, Haoyang Li, Yiming Li, Xiaobin Zhuang, Tianlong Chen, Qingsong Wen, Tianwei Zhang, Yang Liu, Haibo Hu, Zhizheng Wu, Xiaolin Hu, Eng-Siong Chng, Wenyuan Xu, XiaoFeng Wang, Wei Dong, Xinfeng Li

Comments Accepted to ICLR 2026

详情
英文摘要

The rapid development and widespread adoption of Audio Large Language Models (ALLMs) demand rigorous evaluation of their trustworthiness. However, existing evaluation frameworks are primarily designed for text and fail to capture vulnerabilities introduced by the acoustic properties of audio. We find that significant trustworthiness risks in ALLMs arise from non-semantic acoustic cues, such as timbre, accent, and background noise, which can be exploited to manipulate model behavior. To address this gap, we propose AudioTrust, the first large-scale and systematic framework for evaluating ALLM trustworthiness under audio-specific risks. AudioTrust covers six key dimensions: fairness, hallucination, safety, privacy, robustness, and authenticition. It includes 26 sub-tasks and a curated dataset of more than 4,420 audio samples collected from real-world scenarios, including daily conversations, emergency calls, and voice assistant interactions, and is specifically designed to probe trustworthiness across multiple dimensions. Our comprehensive evaluation spans 18 experimental settings and uses human-validated automated pipelines to enable objective and scalable assessment of model outputs. Experimental results on 14 state-of-the-art open-source and closed-source ALLMs reveal important limitations and failure boundaries under diverse high-risk audio scenarios, providing critical insights for the secure and trustworthy deployment of future audio models. Our platform and benchmark are publicly available at https://github.com/JusperLee/AudioTrust.

2505.13903 2026-03-13 cs.CL cs.AI

Let's Verify Math Questions Step by Step

Chengyu Shen, Zhen Hao Wong, Runming He, Hao Liang, Meiyi Qiang, Zimo Meng, Zhengyang Zhao, Bohan Zeng, Zhengzhou Zhu, Bin Cui, Wentao Zhang

详情
英文摘要

Large Language Models (LLMs) have recently achieved remarkable progress in mathematical reasoning. To enable such capabilities, many existing works distill strong reasoning models into long chains of thought or design algorithms to construct high-quality math question-answer (QA) data for training. However, these efforts primarily focus on generating correct reasoning paths and answers, while largely overlooking the correctness of the questions themselves. In this work, we present ValiMath, a benchmark consisting of 2147 human-verified mathematical questions covering a wide range of domains such as arithmetic, algebra, and geometry, which are synthesized and curated from the NuminaMath dataset. Each question is annotated with its logical structure, domain coverage, and question correctness, enabling fine-grained evaluation of question quality. ValiMath serves as a high-quality gold-standard test set for validating mathematical questions in LLM training corpora. Building upon this benchmark, we further propose MathQ-Verify, a pipeline that performs fine-grained parsing of mathematical questions into atomic assumptions and conclusions, and evaluates their semantic soundness through consistency checks. This pipeline achieves high precision in detecting flawed questions and provides a reliable foundation for cleaning noisy mathematical datasets. Experiments show that MathQ-Verify achieves state-of-the-art performance across multiple benchmarks, improving the F1 score by up to 25 percentage points over the direct verification baseline. MathQ-Verify offers a scalable and accurate solution for curating reliable mathematical datasets, reducing label noise and avoiding unnecessary computation on invalid questions. Our code and data are available at the repository https://github.com/OpenDCAI/MathQ-Verify.

2505.05806 2026-03-13 cs.CV

Image Segmentation via Variational Model Based Tailored UNet: A Deep Variational Framework

Kaili Qi, Wenli Yang, Ye Li, Zhongyi Huang

详情
英文摘要

Traditional image segmentation methods, such as variational models based on partial differential equations (PDEs), offer strong mathematical interpretability and precise boundary modeling, but often suffer from sensitivity to parameter settings and high computational costs. In contrast, deep learning models such as UNet, which are relatively lightweight in parameters, excel in automatic feature extraction but lack theoretical interpretability and require extensive labeled data. To harness the complementary strengths of both paradigms, we propose Variational Model Based Tailored UNet (VM_TUNet), a novel hybrid framework that integrates the fourth-order modified Cahn-Hilliard equation with the deep learning backbone of UNet, which combines the interpretability and edge-preserving properties of variational methods with the adaptive feature learning of neural networks. Specifically, a data-driven operator is introduced to replace manual parameter tuning, and we incorporate the tailored finite point method (TFPM) to enforce high-precision boundary preservation. Experimental results on benchmark datasets demonstrate that VM_TUNet achieves superior segmentation performance compared to existing approaches, especially for fine boundary delineation.

2504.16916 2026-03-13 cs.RO cs.SY eess.SY

Zero-shot Sim-to-Real Transfer for Reinforcement Learning-based Visual Servoing of Soft Continuum Arms

Hsin-Jung Yang, Mahsa Khosravi, Benjamin Walt, Girish Krishnan, Soumik Sarkar

Comments The 7th Annual Learning for Dynamics & Control Conference (L4DC) 2025

详情
Journal ref
PMLR 283:1300-1312, 2025
英文摘要

Soft continuum arms (SCAs) soft and deformable nature presents challenges in modeling and control due to their infinite degrees of freedom and non-linear behavior. This work introduces a reinforcement learning (RL)-based framework for visual servoing tasks on SCAs with zero-shot sim-to-real transfer capabilities, demonstrated on a single section pneumatic manipulator capable of bending and twisting. The framework decouples kinematics from mechanical properties using an RL kinematic controller for motion planning and a local controller for actuation refinement, leveraging minimal sensing with visual feedback. Trained entirely in simulation, the RL controller achieved a 99.8% success rate. When deployed on hardware, it achieved a 67% success rate in zero-shot sim-to-real transfer, demonstrating robustness and adaptability. This approach offers a scalable solution for SCAs in 3D visual servoing, with potential for further refinement and expanded applications.

2504.09940 2026-03-13 cs.LG

TianQuan-S2S: A Subseasonal-to-Seasonal Global Weather Model via Incorporate Climatology State

Guowen Li, Xintong Liu, Yang Liu, Mengxuan Chen, Shilei Cao, Xuehe Wang, Juepeng Zheng, Jinxiao Zhang, Haoyuan Liang, Lixian Zhang, Jiuke Wang, Meng Jin, Hong Cheng, Haohuan Fu

详情
英文摘要

Accurate Subseasonal-to-Seasonal (S2S) forecasting is vital for decision-making in agriculture, energy production, and emergency management. However, it remains a challenging and underexplored problem due to the chaotic nature of the weather system. Recent data-driven studies have shown promising results, but their performance is limited by the inadequate incorporation of climate states and a model tendency to degrade, progressively losing fine-scale details and yielding over-smoothed forecasts. To overcome these limitations, we propose TianQuan-S2S, a global S2S forecasting model that integrates initial weather states with climatological means via incorporating climatology into patch embedding and enhancing variability capture through an uncertainty-augmented Transformer. Extensive experiments on the Earth Reanalysis 5 (ERA5) reanalysis dataset demonstrate that our model yields a significant improvement in both deterministic and ensemble forecasting over the climatology mean, traditional numerical methods, and data-driven models. Ablation studies empirically show the effectiveness of our model designs. Remarkably, our model outperforms skillful numerical ECMWF-S2S and advanced data-driven Fuxi-S2S in key meteorological variables. The code implementation can be found in https://github.com/zhangminglang42/TianQuan.

2504.05662 2026-03-13 cs.CV

InvAD: Inversion-based Reconstruction-Free Anomaly Detection with Diffusion Models

Shunsuke Sakai, Xiangteng He, Chunzhi Gu, Leonid Sigal, Tatsuhito Hasegawa

Comments Accepted to CVPR2026. Project page: https://invad-project.com

详情
英文摘要

Despite the remarkable success, recent reconstruction-based anomaly detection (AD) methods via diffusion modeling still involve fine-grained noise-strength tuning and computationally expensive multi-step denoising, leading to a fundamental tension between fidelity and efficiency. In this paper, we propose InvAD, a novel inversion-based anomaly detection approach ("detection via noising in latent space") that circumvents explicit reconstruction. Importantly, we contend that the limitations in prior reconstruction-based methods originate from the prevailing "detection via denoising in RGB space" paradigm. To address this, we model AD under a reconstruction-free formulation, which directly infers the final latent variable corresponding to the input image via DDIM inversion, and then measures the deviation based on the known prior distribution for anomaly scoring. Specifically, in approximating the original probability flow ODE using the Euler method, we enforce only a few inversion steps to noise the clean image to pursue inference efficiency. As the added noise is adaptively derived with the learned diffusion model, the original features for the clean testing image can still be leveraged to yield high detection accuracy. We perform extensive experiments and detailed analyses across four widely used industrial and medical AD benchmarks under the unsupervised unified setting to demonstrate the effectiveness of our model, achieving state-of-the-art AD performance and approximately 2x inference-time speedup without diffusion distillation.