arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1386
专题追踪
2505.23599 2026-02-12 cs.LG math.RT math.ST stat.ML stat.TH

On Transferring Transferability: Towards a Theory for Size Generalization

Eitan Levin, Yuxin Ma, Mateo Díaz, Soledad Villar

Comments 75 pages, 10 figures, closest to version to be published in NeurIPS

详情
英文摘要

Many modern learning tasks require models that can take inputs of varying sizes. Consequently, dimension-independent architectures have been proposed for domains where the inputs are graphs, sets, and point clouds. Recent work on graph neural networks has explored whether a model trained on low-dimensional data can transfer its performance to higher-dimensional inputs. We extend this body of work by introducing a general framework for transferability across dimensions. We show that transferability corresponds precisely to continuity in a limit space formed by identifying small problem instances with equivalent large ones. This identification is driven by the data and the learning task. We instantiate our framework on existing architectures, and implement the necessary changes to ensure their transferability. Finally, we provide design principles for designing new transferable models. Numerical experiments support our findings.

2505.17512 2026-02-12 cs.AI cs.CL

Is Your LLM Really Mastering the Concept? A Multi-Agent Benchmark

Shuhang Xu, Weijian Deng, Yixuan Zhou, Fangwei Zhong

Comments 8 pages

详情
英文摘要

Concepts serve as fundamental abstractions that support human reasoning and categorization. However, it remains unclear whether large language models truly capture such conceptual structures or primarily rely on surface-level pattern memorization. Existing benchmarks are largely static and fact oriented, which limits their ability to probe fine-grained semantic understanding and makes them vulnerable to data leakage and overfitting. To address this limitation, we introduce CK-Arena, a dynamic benchmark for conceptual knowledge evaluation based on a multi agent social deduction game, namely the Undercover game. In this setting, LLM based agents are assigned subtly different concept words and must describe, distinguish, and infer conceptual properties from others' statements. Model performance is evaluated through both game level outcomes and the semantic quality of generated descriptions. Furthermore, CK-Arena leverages the interaction process to automatically construct high quality question answering data for fine grained diagnostic analysis. Experimental results show that conceptual understanding varies substantially across models and categories, and is not strictly aligned with overall model capability. The data and code are available at the project homepage: https://ck-arena.site.

2505.16204 2026-02-12 cs.LG math.ST stat.ML stat.TH

Directional Convergence, Benign Overfitting of Gradient Descent in leaky ReLU two-layer Neural Networks

Ichiro Hashimoto

Comments 41 pages, Accepted at International Conference on Learning Representations 2026 (ICLR 2026)

详情
英文摘要

In this paper, we provide sufficient conditions of benign overfitting of fixed width leaky ReLU two-layer neural network classifiers trained on mixture data via gradient descent. Our results are derived by establishing directional convergence of the network parameters and classification error bound of the convergent direction. Our classification error bound also lead to the discovery of a newly identified phase transition. Previously, directional convergence in (leaky) ReLU neural networks was established only for gradient flow. Due to the lack of directional convergence, previous results on benign overfitting were limited to those trained on nearly orthogonal data. All of our results hold on mixture data, which is a broader data setting than the nearly orthogonal data setting in prior work. We demonstrate our findings by showing that benign overfitting occurs with high probability in a much wider range of scenarios than previously known. Our results also allow us to characterize cases when benign overfitting provably fails even if directional convergence occurs. Our work thus provides a more complete picture of benign overfitting in leaky ReLU two-layer neural networks.

2505.15147 2026-02-12 cs.CV

From Pixels to Images: A Structural Survey of Deep Learning Paradigms in Remote Sensing Image Semantic Segmentation

Quanwei Liu, Tao Huang, Jiaqi Yang, Wei Xiang

Comments 34 pages, 9 figures, 5 tables

详情
英文摘要

Semantic segmentation (SS) of RSIs enables the fine-grained interpretation of surface features, making it a critical task in RS analysis. With the increasing diversity and volume of RSIs collected by sensors on various platforms, traditional processing methods struggle to maintain efficiency and accuracy. In response, deep learning (DL) has emerged as a transformative approach, enabling substantial advances in remote sensing image semantic segmentation (RSISS) by automating hierarchical feature extraction and improving segmentation performance across diverse modalities. As data scale and model capacity have increased, DL-based RSISS has undergone a structural evolution from pixel-level and patch-based classification to tile-level, end-to-end segmentation, and, more recently, to image-level modelling with vision foundation models. However, existing reviews often focus on individual components, such as supervision strategies or fusion stages, and lack a unified operational perspective aligned with segmentation granularity and the training/inference pipeline. This paper provides a comprehensive review by organizing DL-based RSISS into a pixel-patch-tile-image hierarchy, covering early pixel-based methods, prevailing patch-based and tile-based techniques, and emerging image-based approaches. This review offers a holistic and structured understanding of DL-based RSISS, highlighting representative datasets, comparative insights, and open challenges related to data scale, model efficiency, domain robustness, and multimodal integration. Furthermore, to facilitate reproducible research, curated code collections are provided at: https://github.com/quanweiliu/PatchwiseClsFra and https://github.com/quanweiliu/TilewiseSegFra.

2505.13944 2026-02-12 cs.CL

WAVE++: Capturing Within-Task Variance for Continual Relation Extraction with Adaptive Prompting

Bao-Ngoc Dao, Minh Le, Quang Nguyen, Luyen Ngo Dinh, Nam Le, Linh Ngo Van

Comments Accepted in Neurocomputing, Elsevier

详情
英文摘要

Memory-based approaches have shown strong performance in Continual Relation Extraction (CRE). However, storing examples from previous tasks increases memory usage and raises privacy concerns. Recently, prompt-based methods have emerged as a promising alternative, as they do not rely on storing past samples. Despite this progress, current prompt-based techniques face several core challenges in CRE, particularly in accurately identifying task identities and mitigating catastrophic forgetting. Existing prompt selection strategies often suffer from inaccuracies, lack robust mechanisms to prevent forgetting in shared parameters, and struggle to handle both cross-task and within-task variations. In this paper, we propose WAVE++, a novel approach inspired by the connection between prefix-tuning and mixture of experts. Specifically, we introduce task-specific prompt pools that enhance flexibility and adaptability across diverse tasks while avoiding boundary-spanning risks; this design more effectively captures both within-task and cross-task variations. To further refine relation classification, we incorporate label descriptions that provide richer, more global context, enabling the model to better distinguish among different relations. We also propose a training-free mechanism to improve task prediction during inference. Moreover, we integrate a generative model to consolidate prior knowledge within the shared parameters, thereby removing the need for explicit data storage. Extensive experiments demonstrate that WAVE++ outperforms state-of-the-art prompt-based and rehearsal-based methods, offering a more robust solution for continual relation extraction. Our code is publicly available at https://github.com/PiDinosauR2804/WAVE-CRE-PLUS-PLUS.

2505.13444 2026-02-12 cs.CL cs.CV

ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

Liyan Tang, Grace Kim, Xinyu Zhao, Thom Lake, Wenxuan Ding, Fangcong Yin, Prasann Singhal, Manya Wadhwa, Zeyu Leo Liu, Zayne Sprague, Ramya Namuduri, Bodun Hu, Juan Diego Rodriguez, Puyuan Peng, Greg Durrett

Comments NeurIPS 2025 Datasets & Benchmarks

详情
英文摘要

Chart understanding presents a unique challenge for large vision-language models (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks -- where frontier models perform similarly and near saturation -- our benchmark exposes a substantial gap between model and human performance, while effectively differentiating model capabilities: although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Moreover, on questions requiring primarily visual reasoning, all models experience a 35%-55% performance drop from text-reasoning-heavy question performance. Lastly, our qualitative error analysis reveals specific categories of visual reasoning that are challenging for current LVLMs.

2505.11739 2026-02-12 cs.CL cs.AI

ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training

Feijiang Han, Xiaodong Yu, Jianheng Tang, Delip Rao, Weihua Du, Lyle Ungar

Comments ICLR 2026 Accepted Version: proofread, introduction rewritten, additional experiments and appendix material added

详情
英文摘要

Token-level attention tuning, a class of training-free methods including Post-hoc Attention Steering (PASTA) and Attention Calibration (ACT), has emerged as a promising approach for improving frozen LLMs via interpretable interventions. However, these methods rely on auxiliary heuristics to identify important task-specific tokens, which can introduce bias and limit applicability when token importance is ambiguous or when optimized kernels make attention maps inaccessible. We propose a simpler alternative: intervening only on the initial token (e.g., BOS in LLaMA). We theoretically show that adding lightweight biases to this token's attention logits systematically shifts and reshapes downstream attention patterns - an effect amplified by its natural role as an attention sink. Empirically, we find that this tuning can improve LLM performance and better elicit pretrained knowledge, with stronger effects in early layers and distinct scaling preferences across attention heads. Building on these findings, we introduce ZeroTuning, a training-free method that improves LLM performance by applying head-specific attention adjustments to the initial token, requiring no parameter updates. We present two variants: a supervised mode that calibrates on validation examples, and an unsupervised mode that directly minimizes output entropy. ZeroTuning requires no KV-cache or decoding changes and is kernel-agnostic (works with SDPA and FlashAttention). It requires only four lines of modification to the standard LlamaAttention code, achieves gains across 15 datasets, and outperforms prior, more complex methods. For example, on Llama-3.1-8B, it yields relative improvements of 19.9% on classification, 4.5% on question answering, and 2.1% on dialogue. ZeroTuning also works out of the box with quantized inference and maintains its improvements as context length increases.

2505.11578 2026-02-12 cs.LG cs.AI physics.comp-ph

Spatiotemporal Field Generation Based on Hybrid Mamba-Transformer with Physics-informed Fine-tuning

Peimian Du, Jiabin Liu, Xiaowei Jin, Wangmeng Zuo, Hui Li

详情
英文摘要

This research confronts the challenge of substantial physical equation discrepancies encountered in the generation of spatiotemporal physical fields through data-driven trained models. A spatiotemporal physical field generation model, named HMT-PF, is developed based on the hybrid Mamba-Transformer architecture, incorporating unstructured grid information as input. A fine-tuning block, enhanced with physical information, is introduced to effectively reduce the physical equation discrepancies. The physical equation residuals are computed through a point query mechanism for efficient gradient evaluation, then encoded into latent space for refinement. The fine-tuning process employs a self-supervised learning approach to achieve physical consistency while maintaining essential field characteristics. Results show that the hybrid Mamba-Transformer model achieves good performance in generating spatiotemporal fields, while the physics-informed fine-tuning mechanism further reduces significant physical errors effectively. A MSE-R evaluation method is developed to assess the accuracy and realism of physical field generation.

2505.09651 2026-02-12 cs.CV cs.AI cs.LG

Geospatial Representation Learning: A Survey from Deep Learning to The LLM Era

Xixuan Hao, Yutian Jiang, Xingchen Zou, Jiabo Liu, Yifang Yin, Song Gao, Flora Salim, Tianrui Li, Yuxuan Liang

详情
英文摘要

The ability to transform location-centric geospatial data into meaningful computational representations has become fundamental to modern spatial analysis and decision-making. Geospatial Representation Learning (GRL), the process of automatically extracting latent structures and semantic patterns from geographic data, is undergoing a profound transformation through two successive technological revolutions: the deep learning breakthrough and the emerging large language model (LLM) paradigm. While deep neural networks (DNNs) have demonstrated remarkable success in automated feature extraction from structured and semi-structured geospatial data (e.g., satellite imagery, GPS trajectories), the recent integration of LLMs introduces transformative capabilities for cross-modal geospatial reasoning and unstructured geo-textual data processing. This survey presents a comprehensive review of geospatial representation learning across both technological eras, organizing them into a structured taxonomy based on the complete pipeline comprising: (1) data perspective, (2) methodological perspective, and (3) application perspective. We also highlight current advancements, discuss existing limitations, and propose potential future research directions in the LLM and foundation model era. This work offers a thorough exploration of the field and provides a roadmap for further innovation in GRL. The summary of the up-to-date paper list can be found in https://github.com/CityMind-Lab/Awesome-Geospatial-Representation-Learning and will undergo continuous updates.

2505.05082 2026-02-12 cs.LG cs.IT math.IT math.PR

ItDPDM: Information-Theoretic Discrete Poisson Diffusion Model

Sagnik Bhattacharya, Abhiram Gorle, Ahsan Bilal, Connor Ding, Amit Kumar Singh Yadav, Tsachy Weissman

Comments Published in NeurIPS 2025

详情
英文摘要

Generative modeling of non-negative, discrete data, such as symbolic music, remains challenging due to two persistent limitations in existing methods. Firstly, many approaches rely on modeling continuous embeddings, which is suboptimal for inherently discrete data distributions. Secondly, most models optimize variational bounds rather than exact data likelihood, resulting in inaccurate likelihood estimates and degraded sampling quality. While recent diffusion-based models have addressed these issues separately, we tackle them jointly. In this work, we introduce the Information-Theoretic Discrete Poisson Diffusion Model (ItDPDM), inspired by photon arrival process, which combines exact likelihood estimation with fully discrete-state modeling. Central to our approach is an information-theoretic Poisson Reconstruction Loss (PRL) that has a provable exact relationship with the true data likelihood. ItDPDM achieves improved likelihood and sampling performance over prior discrete and continuous diffusion models on a variety of synthetic discrete datasets. Furthermore, on real-world datasets such as symbolic music and images, ItDPDM attains superior likelihood estimates and competitive generation quality-demonstrating a proof of concept for distribution-robust discrete generative modeling.

2504.11524 2026-02-12 cs.AI cs.CL cs.CY cs.LG

HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation

Haokun Liu, Sicong Huang, Jingyu Hu, Yangqiaoyu Zhou, Chenhao Tan

Comments 32 pages, 6 figures, website link: https://chicagohai.github.io/HypoBench/

详情
英文摘要

There is growing interest in hypothesis generation with large language models (LLMs). However, fundamental questions remain: what makes a good hypothesis, and how can we systematically evaluate methods for hypothesis generation? To address this, we introduce HypoBench, a novel benchmark designed to evaluate LLMs and hypothesis generation methods across multiple aspects, including practical utility, generalizability, and hypothesis discovery rate. HypoBench includes 7 real-world tasks and 5 synthetic tasks with 194 distinct datasets. We evaluate four state-of-the-art LLMs combined with six existing hypothesis-generation methods. Overall, our results suggest that existing methods are capable of discovering valid and novel patterns in the data. However, the results from synthetic datasets indicate that there is still significant room for improvement, as current hypothesis generation methods do not fully uncover all relevant or meaningful patterns. Specifically, in synthetic settings, as task difficulty increases, performance significantly drops, with best models and methods only recovering 38.8% of the ground-truth hypotheses. These findings highlight challenges in hypothesis generation and demonstrate that HypoBench serves as a valuable resource for improving AI systems designed to assist scientific discovery.

2504.00437 2026-02-12 cs.CV

ADGaussian: Generalizable Gaussian Splatting for Autonomous Driving via Multi-modal Joint Learning

Qi Song, Chenghong Li, Haotong Lin, Sida Peng, Rui Huang

Comments The paper is accepted by ICRA 2026 and the project page can be found at https://maggiesong7.github.io/research/ADGaussian/

详情
英文摘要

We present a novel approach, termed ADGaussian, for generalizable street scene reconstruction. The proposed method enables high-quality rendering from merely single-view input. Unlike prior Gaussian Splatting methods that primarily focus on geometry refinement, we emphasize the importance of joint optimization of image and depth features for accurate Gaussian prediction. To this end, we first incorporate sparse LiDAR depth as an additional input modality, formulating the Gaussian prediction process as a joint learning framework of visual information and geometric clue. Furthermore, we propose a Multi-modal Feature Matching strategy coupled with a Multi-scale Gaussian Decoding model to enhance the joint refinement of multi-modal features, thereby enabling efficient multi-modal Gaussian learning. Extensive experiments on Waymo and KITTI demonstrate that our ADGaussian achieves state-of-the-art performance and exhibits superior zero-shot generalization capabilities in novel-view shifting.

2503.23270 2026-02-12 cs.RO cs.AI cs.LG

Localized Graph-Based Neural Dynamics Models for Terrain Manipulation

Chaoqi Liu, Yunzhu Li, Kris Hauser

详情
英文摘要

Predictive models can be particularly helpful for robots to effectively manipulate terrains in construction sites and extraterrestrial surfaces. However, terrain state representations become extremely high-dimensional especially to capture fine-resolution details and when depth is unknown or unbounded. This paper introduces a learning-based approach for terrain dynamics modeling and manipulation, leveraging the Graph-based Neural Dynamics (GBND) framework to represent terrain deformation as motion of a graph of particles. Based on the principle that the moving portion of a terrain is usually localized, our approach builds a large terrain graph (potentially millions of particles) but only identifies a very small active subgraph (hundreds of particles) for predicting the outcomes of robot-terrain interaction. To minimize the size of the active subgraph we introduce a learning-based approach that identifies a small region of interest (RoI) based on the robot's control inputs and the current scene. We also introduce a novel domain boundary feature encoding that allows GBNDs to perform accurate dynamics prediction in the RoI interior while avoiding particle penetration through RoI boundaries. Our proposed method is both orders of magnitude faster than naive GBND and it achieves better overall prediction accuracy. We further evaluated our framework on excavation and shaping tasks on terrain with different granularity.

2503.16858 2026-02-12 cs.CL cs.AI

MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering

Jialin Chen, Aosong Feng, Ziyu Zhao, Juan Garza, Gaukhar Nurbek, Cheng Qin, Ali Maatouk, Leandros Tassiulas, Yifeng Gao, Rex Ying

Comments 18 pages

详情
英文摘要

Understanding the relationship between textual news and time-series evolution is a critical yet under-explored challenge in applied data science. While multimodal learning has gained traction, existing multimodal time-series datasets fall short in evaluating cross-modal reasoning and complex question answering, which are essential for capturing complex interactions between narrative information and temporal patterns. To bridge this gap, we introduce Multimodal Time Series Benchmark (MTBench), a large-scale benchmark designed to evaluate large language models (LLMs) on time series and text understanding across financial and weather domains. MTbench comprises paired time series and textual data, including financial news with corresponding stock price movements and weather reports aligned with historical temperature records. Unlike existing benchmarks that focus on isolated modalities, MTbench provides a comprehensive testbed for models to jointly reason over structured numerical trends and unstructured textual narratives. The richness of MTbench enables formulation of diverse tasks that require a deep understanding of both text and time-series data, including time-series forecasting, semantic and technical trend analysis, and news-driven question answering (QA). These tasks target the model's ability to capture temporal dependencies, extract key insights from textual context, and integrate cross-modal information. We evaluate state-of-the-art LLMs on MTbench, analyzing their effectiveness in modeling the complex relationships between news narratives and temporal patterns. Our findings reveal significant challenges in current models, including difficulties in capturing long-term dependencies, interpreting causality in financial and weather trends, and effectively fusing multimodal information.

2503.16814 2026-02-12 cs.LG cs.CL

From Belief Entrenchment to Robust Reasoning in LLM Agents

Jihwan Oh, Minchan Jeong, Jongwoo Ko, Se-Young Yun

Comments Accepted to TACL

详情
英文摘要

Multi-Agent Debate (MAD) has emerged as a promising inference scaling method for Large Language Model (LLM) reasoning. However, it frequently suffers from belief entrenchment, where agents reinforce shared errors rather than correcting them. Going beyond merely identifying this failure, we decompose it into two distinct root causes: (1) the model's biased $\textit{static initial belief}$ and (2) $\textit{homogenized debate dynamics}$ that amplify the majority view regardless of correctness. To address these sequentially, we propose $\textbf{DReaMAD}$ $($$\textbf{D}$iverse $\textbf{Rea}$soning via $\textbf{M}$ulti-$\textbf{A}$gent $\textbf{D}$ebate with Refined Prompt$)$. Our framework first rectifies the static belief via strategic prior knowledge elicitation, then reshapes the debate dynamics by enforcing perspective diversity. Validated on our new $\textit{MetaNIM Arena}$ benchmark, $\textbf{DReaMAD}$ significantly mitigates entrenchment, achieving a +9.5\% accuracy gain over ReAct prompting and a +19.0\% higher win rate than standard MAD.

2503.13180 2026-02-12 cs.LG cs.AI cs.DC

GC-Fed: Gradient Centralized Federated Learning with Partial Client Participation

Jungwon Seo, Ferhat Ozgur Catak, Chunming Rong, Kibeom Hong, Minhoe Kim

详情
英文摘要

Federated Learning (FL) enables privacy-preserving multi-source information fusion (MSIF) but is challenged by client drift in highly heterogeneous data settings. Many existing drift-mitigation strategies rely on reference-based techniques--such as gradient adjustments or proximal loss--that use historical snapshots (e.g., past gradients or previous global models) as reference points. When only a subset of clients participates in each training round, these historical references may not accurately capture the overall data distribution, leading to unstable training. In contrast, our proposed Gradient Centralized Federated Learning (GC-Fed) employs a hyperplane as a historically independent reference point to guide local training and enhance inter-client alignment. GC-Fed comprises two complementary components: Local GC, which centralizes gradients during local training, and Global GC, which centralizes updates during server aggregation. In our hybrid design, Local GC is applied to feature-extraction layers to harmonize client contributions, while Global GC refines classifier layers to stabilize round-wise performance. Theoretical analysis and extensive experiments on benchmark FL tasks demonstrate that GC-Fed effectively mitigates client drift and achieves up to a 20% improvement in accuracy under heterogeneous and partial participation conditions.

2503.11297 2026-02-12 cs.CV

GMG: A Video Prediction Method Based on Global Focus and Motion Guided

Yuhao Du, Hui Liu, Haoxiang Peng, Xinyuan Cheng, Chengrong Wu, Jiankai Zhang

详情
英文摘要

Recent years, weather forecasting has gained significant attention. However, accurately predicting weather remains a challenge due to the rapid variability of meteorological data and potential teleconnections. Current spatiotemporal forecasting models primarily rely on convolution operations or sliding windows for feature extraction. These methods are limited by the size of the convolutional kernel or sliding window, making it difficult to capture and identify potential teleconnection features in meteorological data. Additionally, weather data often involve non-rigid bodies, whose motion processes are accompanied by unpredictable deformations, further complicating the forecasting task. In this paper, we propose the GMG model to address these two core challenges. The Global Focus Module, a key component of our model, enhances the global receptive field, while the Motion Guided Module adapts to the growth or dissipation processes of non-rigid bodies. Through extensive evaluations, our method demonstrates competitive performance across various complex tasks, providing a novel approach to improving the predictive accuracy of complex spatiotemporal data.

2503.11093 2026-02-12 cs.CV

OmniDiff: A Comprehensive Benchmark for Fine-grained Image Difference Captioning

Yuan Liu, Saihui Hou, Saijie Hou, Jiabao Du, Shibei Meng, Yongzhen Huang

详情
英文摘要

Image Difference Captioning (IDC) aims to generate natural language descriptions of subtle differences between image pairs, requiring both precise visual change localization and coherent semantic expression. Despite recent advancements, existing datasets often lack breadth and depth, limiting their applicability in complex and dynamic environments: (1) from a breadth perspective, current datasets are constrained to limited variations of objects in specific scenes, and (2) from a depth perspective, prior benchmarks often provide overly simplistic descriptions. To address these challenges, we introduce OmniDiff, a comprehensive dataset comprising 324 diverse scenarios-spanning real-world complex environments and 3D synthetic settings-with fine-grained human annotations averaging 60 words in length and covering 12 distinct change types. Building on this foundation, we propose M$^3$Diff, a MultiModal large language model enhanced by a plug-and-play Multi-scale Differential Perception (MDP) module. This module improves the model's ability to accurately identify and describe inter-image differences while maintaining the foundational model's generalization capabilities. With the addition of the OmniDiff dataset, M$^3$Diff achieves state-of-the-art performance across multiple benchmarks, including Spot-the-Diff, IEdit, CLEVR-Change, CLEVR-DC, and OmniDiff, demonstrating significant improvements in cross-scenario difference recognition accuracy compared to existing methods. The dataset, code, and models will be made publicly available to support further research.

2503.06790 2026-02-12 cs.CV cs.AI eess.IV

GenDR: Lighten Generative Detail Restoration

Yan Wang, Shijie Zhao, Kexin Zhang, Junlin Li, Li Zhang

Comments Accepted by ICLR 2026

详情
英文摘要

Although recent research applying text-to-image (T2I) diffusion models to real-world super-resolution (SR) has achieved remarkable progress, the misalignment of their targets leads to a suboptimal trade-off between inference speed and detail fidelity. Specifically, the T2I task requires multiple inference steps to synthesize images matching to prompts and reduces the latent dimension to lower generating difficulty. Contrariwise, SR can restore high-frequency details in fewer inference steps, but it necessitates a more reliable variational auto-encoder (VAE) to preserve input information. However, most diffusion-based SRs are multistep and use 4-channel VAEs, while existing models with 16-channel VAEs are overqualified diffusion transformers, e.g., FLUX (12B). To align the target, we present a one-step diffusion model for generative detail restoration, GenDR, distilled from a tailored diffusion model with a larger latent space. In detail, we train a new SD2.1-VAE16 (0.9B) via representation alignment to expand the latent space without increasing the model size. Regarding step distillation, we propose consistent score identity distillation (CiD) that incorporates SR task-specific loss into score distillation to leverage more SR priors and align the training target. Furthermore, we extend CiD with adversarial learning and representation alignment (CiDA) to enhance perceptual quality and accelerate training. We also polish the pipeline to achieve a more efficient inference. Experimental results demonstrate that GenDR achieves state-of-the-art performance in both quantitative metrics and visual fidelity.

2503.01882 2026-02-12 cs.LG physics.geo-ph stat.AP stat.ML

Constructing balanced datasets for predicting failure modes in structural systems under seismic hazards

Jungho Kim, Taeyong Kim

Journal ref Engineering Structures, Vol(346), 121637, 2026

详情
英文摘要

Accurate prediction of structural failure modes under seismic excitations is essential for seismic risk and resilience assessment. Traditional simulation-based approaches often result in imbalanced datasets dominated by non-failure or frequently observed failure scenarios, limiting the effectiveness in machine learning-based prediction. To address this challenge, this study proposes a framework for constructing balanced datasets that include distinct failure modes. The framework consists of three key steps. First, critical ground motion features (GMFs) are identified to effectively represent ground motion time histories. Second, an adaptive algorithm is employed to estimate the probability densities of various failure domains in the space of critical GMFs and structural parameters. Third, samples generated from these probability densities are transformed into ground motion time histories by using a scaling factor optimization process. A balanced dataset is constructed by performing nonlinear response history analyses on structural systems with parameters matching the generated samples, subjected to corresponding transformed ground motion time histories. Deep neural network models are trained on balanced and imbalanced datasets to highlight the importance of dataset balancing. To further evaluate the framework's applicability, numerical investigations are conducted using two different structural models subjected to recorded and synthetic ground motions. The results demonstrate the framework's robustness and effectiveness in addressing dataset imbalance and improving machine learning performance in seismic failure mode prediction.

2503.00038 2026-02-12 cs.CL cs.AI cs.CR

from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors

Yu Yan, Sheng Sun, Zenghao Duan, Teli Liu, Min Liu, Zhiyi Yin, Jingyu Lei, Qi Li

Comments arXiv admin note: substantial text overlap with arXiv:2412.12145

详情
英文摘要

Current studies have exposed the risk of Large Language Models (LLMs) generating harmful content by jailbreak attacks. However, they overlook that the direct generation of harmful content from scratch is more difficult than inducing LLM to calibrate benign content into harmful forms. In our study, we introduce a novel attack framework that exploits AdVersArial meTAphoR (AVATAR) to induce the LLM to calibrate malicious metaphors for jailbreaking. Specifically, to answer harmful queries, AVATAR adaptively identifies a set of benign but logically related metaphors as the initial seed. Then, driven by these metaphors, the target LLM is induced to reason and calibrate about the metaphorical content, thus jailbroken by either directly outputting harmful responses or calibrating residuals between metaphorical and professional harmful content. Experimental results demonstrate that AVATAR can effectively and transferable jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs.

2502.19844 2026-02-12 cs.CV

ProAPO: Progressively Automatic Prompt Optimization for Visual Classification

Xiangyan Qu, Gaopeng Gou, Jiamin Zhuang, Jing Yu, Kun Song, Qihao Wang, Yili Li, Gang Xiong

Comments Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

详情
英文摘要

Vision-language models (VLMs) have made significant progress in image classification by training with large-scale paired image-text data. Their performances largely depend on the prompt quality. While recent methods show that visual descriptions generated by large language models (LLMs) enhance the generalization of VLMs, class-specific prompts may be inaccurate or lack discrimination due to the hallucination in LLMs. In this paper, we aim to find visually discriminative prompts for fine-grained categories with minimal supervision and no human-in-the-loop. An evolution-based algorithm is proposed to progressively optimize language prompts from task-specific templates to class-specific descriptions. Unlike optimizing templates, the search space shows an explosion in class-specific candidate prompts. This increases prompt generation costs, iterative times, and the overfitting problem. To this end, we first introduce several simple yet effective edit-based and evolution-based operations to generate diverse candidate prompts by one-time query of LLMs. Then, two sampling strategies are proposed to find a better initial search point and reduce traversed categories, saving iteration costs. Moreover, we apply a novel fitness score with entropy constraints to mitigate overfitting. In a challenging one-shot image classification setting, our method outperforms existing textual prompt-based methods and improves LLM-generated description methods across 13 datasets. Meanwhile, we demonstrate that our optimal prompts improve adapter-based methods and transfer effectively across different backbones.

2502.10303 2026-02-12 cs.AI cs.GT

Reinforcement Learning in Strategy-Based and Atari Games: A Review of Google DeepMinds Innovations

Abdelrhman Shaheen, Anas Badr, Ali Abohendy, Hatem Alsaadawy, Nadine Alsayad, Ehab H. El-Shazly

详情
英文摘要

Reinforcement Learning (RL) has been widely used in many applications, particularly in gaming, which serves as an excellent training ground for AI models. Google DeepMind has pioneered innovations in this field, employing reinforcement learning algorithms, including model-based, model-free, and deep Q-network approaches, to create advanced AI models such as AlphaGo, AlphaGo Zero, and MuZero. AlphaGo, the initial model, integrates supervised learning and reinforcement learning to master the game of Go, surpassing professional human players. AlphaGo Zero refines this approach by eliminating reliance on human gameplay data, instead utilizing self-play for enhanced learning efficiency. MuZero further extends these advancements by learning the underlying dynamics of game environments without explicit knowledge of the rules, achieving adaptability across various games, including complex Atari games. This paper reviews the significance of reinforcement learning applications in Atari and strategy-based games, analyzing these three models, their key innovations, training processes, challenges encountered, and improvements made. Additionally, we discuss advancements in the field of gaming, including MiniZero and multi-agent models, highlighting future directions and emerging AI models from Google DeepMind.

2502.04468 2026-02-12 cs.LG eess.IV math.PR

Iterative Importance Fine-tuning of Diffusion Models

Alexander Denker, Shreyas Padhy, Francisco Vargas, Johannes Hertrich

详情
英文摘要

Diffusion models are an important tool for generative modelling, serving as effective priors in applications such as imaging and protein design. A key challenge in applying diffusion models for downstream tasks is efficiently sampling from resulting posterior distributions, which can be addressed using Doob's $h$-transform. This work introduces a self-supervised algorithm for fine-tuning diffusion models by learning the optimal control, enabling amortised conditional sampling. Our method iteratively refines the control using a synthetic dataset resampled with path-based importance weights. We demonstrate the effectiveness of this framework on class-conditional sampling, inverse problems and reward fine-tuning for text-to-image diffusion models.

2502.03366 2026-02-12 cs.LG stat.ML

Rethinking Approximate Gaussian Inference in Classification

Bálint Mucsányi, Nathaël Da Costa, Philipp Hennig

Comments 46 pages

详情
英文摘要

In classification tasks, softmax functions are ubiquitously used as output activations to produce predictive probabilities. Such outputs only capture aleatoric uncertainty. To capture epistemic uncertainty, approximate Gaussian inference methods have been proposed. We develop a common formalism to describe such methods, which we view as outputting Gaussian distributions over the logit space. Predictives are then obtained as the expectations of the Gaussian distributions pushed forward through the softmax. However, such softmax Gaussian integrals cannot be solved analytically, and Monte Carlo (MC) approximations can be costly and noisy. We propose to replace the softmax activation by element-wise normCDF or sigmoid, which allows for the accurate sampling-free approximation of predictives. This also enables the approximation of the Gaussian pushforwards by Dirichlet distributions with moment matching. This approach entirely eliminates the runtime and memory overhead associated with MC sampling. We evaluate it combined with several approximate Gaussian inference methods (Laplace, HET, SNGP) on large- and small-scale datasets (ImageNet, CIFAR-100, CIFAR-10), demonstrating improved uncertainty quantification capabilities compared to softmax MC sampling. Our code is available at https://github.com/bmucsanyi/probit.

2501.18936 2026-02-12 cs.LG cs.CV

Revisit Visual Prompt Tuning: The Expressiveness of Prompt Experts

Minh Le, Anh Nguyen, Huy Nguyen, Chau Nguyen, Anh Tran, Nhat Ho

Comments Accepted to ICLR 2026

详情
英文摘要

Visual Prompt Tuning (VPT) has proven effective for parameter-efficient adaptation of pre-trained vision models to downstream tasks by inserting task-specific learnable prompt tokens. Despite its empirical success, a comprehensive theoretical understanding of VPT remains an active area of research. Building on the recently established connection between Mixture of Experts (MoE) and prompt-based methods, wherein each attention head can be conceptualized as a composition of multiple MoE models, we reinterpret VPT as the introduction of new prompt experts into these MoE structures. We identify a key limitation in existing VPT frameworks: the restricted functional expressiveness of prompt experts, which remain static and thus limited in their adaptability. To address this, we propose Visual Adaptive Prompt Tuning (VAPT), a novel method that endows prompt experts with enhanced expressiveness while preserving parameter efficiency. Empirical evaluations on VTAB-1K and FGVC demonstrate that VAPT achieves substantial performance improvements, surpassing fully fine-tuned baselines by 7.34% and 1.04%, respectively. Moreover, VAPT consistently outperforms VPT while requiring fewer additional parameters. Furthermore, our theoretical analysis indicates that VAPT achieves optimal sample efficiency. Collectively, these results underscore the theoretical grounding and empirical advantages of our approach.

2501.04538 2026-02-12 cs.LG

HypeRL: Hypernetwork-Based Reinforcement Learning for Control of Parametrized Dynamical Systems

Nicolò Botteghi, Stefania Fresca, Mengwu Guo, Andrea Manzoni

详情
英文摘要

In this work, we devise a new, general-purpose reinforcement learning strategy for the optimal control of parametric dynamical systems. Such problems frequently arise in applied sciences and engineering and entail a significant complexity when control and/or state variables are distributed in high-dimensional space or depend on varying parameters. Traditional numerical methods, relying on either iterative minimization algorithms -- exploiting, e.g., the solution of the adjoint problem -- or dynamic programming -- also involving the solution of the Hamilton-Jacobi-Bellman (HJB) equation -- while reliable, often become computationally infeasible. In this paper, we propose HypeRL a deep reinforcement learning (DRL) framework to overcome the limitations shown by traditional methods. HypeRL aims at approximating the optimal control policy directly. Specifically, we employ an actor-critic DRL approach to learn an optimal feedback control strategy that can generalize across the range of variation of the parameters. To effectively learn such optimal control laws for different instances of the parameters, encoding the parameter information into the DRL policy and value function neural networks (NNs) is essential. HypeRL uses two additional NNs, called hypernetworks, to learn the weights and biases of the value function and the policy NNs. In this way, HypeRL effectively embeds the parametric information into the value function and policy. We validate the proposed approach on two parametric control problems, namely (I) a 1D parametric Kuramoto-Sivashinsky equation with in-domain control, and (ii) a navigation problem of particle dynamics in a parametric 2D gyre flow. We show that the knowledge of physical and task-dependent information and the encoding of this information via a hypernetwork, are essential ingredients for learning parameter-dependent control policies.

2412.03462 2026-02-12 cs.RO cs.SY eess.SY

Multi-Momentum Observer Contact Estimation for Bipedal Robots

J. Joe Payne, Daniel A. Hagen, Denis Garagić, Aaron M. Johnson

详情
英文摘要

As bipedal robots become more and more popular in commercial and industrial settings, the ability to control them with a high degree of reliability is critical. To that end, this paper considers how to accurately estimate which feet are currently in contact with the ground so as to avoid improper control actions that could jeopardize the stability of the robot. Additionally, modern algorithms for estimating the position and orientation of a robot's base frame rely heavily on such contact mode estimates. Dedicated contact sensors on the feet can be used to estimate this contact mode, but these sensors are prone to noise, time delays, damage/yielding from repeated impacts with the ground, and are not available on every robot. To overcome these limitations, we propose a momentum observer based method for contact mode estimation that does not rely on such contact sensors. Often, momentum observers assume that the robot's base frame can be treated as an inertial frame. However, since many humanoids' legs represent a significant portion of the overall mass, the proposed method instead utilizes multiple simultaneous dynamic models. Each of these models assumes a different contact condition. A given contact assumption is then used to constrain the full dynamics in order to avoid assuming that either the body is an inertial frame or that a fully accurate estimate of body velocity is known. The (dis)agreement between each model's estimates and measurements is used to determine which contact mode is most likely using a Markov-style fusion method. The proposed method produces contact detection accuracy of up to 98.44% with a low noise simulation and 77.12% when utilizing data collect on the Sarcos Guardian XO robot (a hybrid humanoid/exoskeleton).

2411.15927 2026-02-12 cs.CL cs.AI

Generative Prompt Internalization

Haebin Shin, Lei Ji, Yeyun Gong, Sungdong Kim, Eunbi Choi, Minjoon Seo

Comments NAACL 2025 (Main Conference)

Journal ref NAACL 2025

详情
英文摘要

Prompts used in recent large language model based applications are often fixed and lengthy, leading to significant computational overhead. To address this challenge, we propose Generative Prompt Internalization (GenPI), a lightweight method that employs a joint training approach. GenPI not only replicates the behavior of models with prompt inputs but also generates the content of the prompt along with reasons for why the model's behavior should change accordingly. We demonstrate that our approach effectively internalizes complex prompts across various agent-based application scenarios. For effective training without interactions with the dedicated environments, we introduce a data synthesis technique that autonomously collects conversational datasets by swapping the roles of the agent and environment. This method is especially useful in scenarios where only a predefined prompt is available without a corresponding training dataset. By internalizing complex prompts, Generative Prompt Internalization enables high performance and efficient inference without the need for explicit prompts.

2411.04534 2026-02-12 cs.LG

Hypercube Policy Regularization Framework for Offline Reinforcement Learning

Yi Shen, Hanyan Huang

Comments Revised version accepted at MLMI 2025

详情
英文摘要

Offline reinforcement learning has received extensive attention from scholars because it avoids the interaction between the agent and the environment by learning a policy through a static dataset. However, general reinforcement learning methods cannot get satisfactory results in offline reinforcement learning due to the out-of-distribution state actions that the dataset cannot cover during training. To solve this problem, the policy regularization method that tries to directly clone policies used in static datasets has received numerous studies due to its simplicity and effectiveness. However, policy constraint methods make the agent choose the corresponding actions in the static dataset. This type of constraint is usually over-conservative, which results in suboptimal policies, especially in low-quality static datasets. In this paper, a hypercube policy regularization framework is proposed, this method alleviates the constraints of policy constraint methods by allowing the agent to explore the actions corresponding to similar states in the static dataset, which increases the effectiveness of algorithms in low-quality datasets. It was also theoretically demonstrated that the hypercube policy regularization framework can effectively improve the performance of original algorithms. In addition, the hypercube policy regularization framework is combined with TD3-BC and Diffusion-QL for experiments on D4RL datasets which are called TD3-BC-C and Diffusion-QL-C. The experimental results of the score demonstrate that TD3-BC-C and Diffusion-QL-C perform better than state-of-the-art algorithms like IQL, CQL, TD3-BC and Diffusion-QL in most D4RL environments in approximate time.