arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2365
专题追踪
2507.01835 2026-02-24 cs.CV

Modulate and Reconstruct: Learning Hyperspectral Imaging from Misaligned Smartphone Views

Daniil Reutsky, Daniil Vladimirov, Yasin Mamedov, Georgy Perevozchikov, Nancy Mehta, Egor Ershov, Radu Timofte

详情
英文摘要

Hyperspectral reconstruction (HSR) from RGB images is a highly promising direction for accurate color reproduction and material color measurement. While most existing approaches rely on a single RGB image - thereby limiting reconstruction accuracy - the majority of modern smartphones are equipped with two or more cameras. In this work, we propose a novel multi-image-to-hyperspectral reconstruction (MI-HSR) framework that leverages a triple-camera smartphone system, where two lenses are equipped with carefully selected spectral filters. Our easy-to-implement configuration, based on theoretical and empirical analysis, allows to obtain more complete and diverse spectral data than traditional single-chamber setups. To support this new paradigm, we introduce Doomer, the first dataset for MI-HSR, comprising aligned images from three smartphone cameras and a hyperspectral reference camera across diverse scenes. We further introduce a lightweight alignment module for MI-HSR that effectively fuses multi-view inputs while mitigating parallax- and occlusion-induced artifacts. Proposed module demonstrate consistent quality improvements for modern HSR methods. In a nutshell, our setup allows 30% more accurate estimations of spectra compared to an ordinary RGB camera, while the proposed alignment module boosts the reconstruction quality of SotA methods by an additional 5%. Our findings suggest that spectral filtering of multiple views with commodity hardware unlocks more accurate and practical hyperspectral imaging.

2507.01781 2026-02-24 cs.LG cs.AI

Symbolic Branch Networks: Tree-Inherited Neural Models for Interpretable Multiclass Classification

Dalia Rodríguez-Salas

Comments Substantially revised and extended version (previously titled "BranchNet"), introducing the Symbolic Branch Network (SBN) framework with updated architectural formulation, training methodology, and expanded experimental evaluation. Submitted to Neurocomputing

详情
英文摘要

Symbolic Branch Networks (SBNs) are neural models whose architecture is inherited directly from an ensemble of decision trees. Each root-to-parent-of-leaf decision path is mapped to a hidden neuron, and the matrices $W_{1}$ (feature-to-branch) and $W_{2}$ (branch-to-class) encode the symbolic structure of the ensemble. Because these matrices originate from the trees, SBNs preserve transparent feature relevance and branch-level semantics while enabling gradient-based learning. The primary contribution of this work is SBN, a semi-symbolic variant that preserves branch semantics by keeping $W_{2}$ fixed, while allowing $W_{1}$ to be refined through learning. This controlled relaxation improves predictive accuracy without altering the underlying symbolic structure. Across 28 multiclass tabular datasets from the OpenML CC-18 benchmark, SBN consistently matches or surpasses XGBoost while retaining human-interpretable branch attributions. We also analyze SBN*, a fully symbolic variant in which both $W_{1}$ and $W_{2}$ are frozen and only calibration layers are trained. Despite having no trainable symbolic parameters, SBN* achieves competitive performance on many benchmarks, highlighting the strength of tree-derived symbolic routing as an inductive bias. Overall, these results show that symbolic structure and neural optimization can be combined to achieve strong performance while maintaining stable and interpretable internal representations.

2506.22740 2026-02-24 cs.AI stat.ML

Explanations are a Means to an End: Decision Theoretic Explanation Evaluation

Ziyang Guo, Berk Ustun, Jessica Hullman

详情
英文摘要

Explanations of model behavior are commonly evaluated via proxy properties weakly tied to the purposes explanations serve in practice. We contribute a decision theoretic framework that treats explanations as information signals valued by the expected improvement they enable on a specified decision task. This approach yields three distinct estimands: 1) a theoretical benchmark that upperbounds achievable performance by any agent with the explanation, 2) a human-complementary value that quantifies the theoretically attainable value that is not already captured by a baseline human decision policy, and 3) a behavioral value representing the causal effect of providing the explanation to human decision-makers. We instantiate these definitions in a practical validation workflow, and apply them to assess explanation potential and interpret behavioral effects in human-AI decision support and mechanistic interpretability.

2506.08604 2026-02-24 cs.LG cs.AI cs.CE cs.NA math.NA

Physics vs Distributions: Pareto Optimal Flow Matching with Physics Constraints

Giacomo Baldan, Qiang Liu, Alberto Guardone, Nils Thuerey

详情
英文摘要

Physics-constrained generative modeling aims to produce high-dimensional samples that are both physically consistent and distributionally accurate, a task that remains challenging due to often conflicting optimization objectives. Recent advances in flow matching and diffusion models have enabled efficient generative modeling, but integrating physical constraints often degrades generative fidelity or requires costly inference-time corrections. Our work is the first to recognize the trade-off between distributional and physical accuracy. Based on the insight of inherently conflicting objectives, we introduce Physics-Based Flow Matching (PBFM) a method that enforces physical constraints at training time using conflict-free gradient updates and unrolling to mitigate Jensen's gap. Our approach avoids manual loss balancing and enables simultaneous optimization of generative and physical objectives. As a consequence, physics constraints do not impede inference performance. We benchmark our method across three representative PDE benchmarks. PBFM achieves a Pareto-optimal trade-off, competitive inference speed, and generalizes to a wide range of physics-constrained generative tasks, providing a practical tool for scientific machine learning. Code and datasets available at https://github.com/tum-pbs/PBFM.

2506.07078 2026-02-24 cs.LG cs.SD eess.AS

E-BATS: Efficient Backpropagation-Free Test-Time Adaptation for Speech Foundation Models

Jiaheng Dong, Hong Jia, Soumyajit Chatterjee, Abhirup Ghosh, James Bailey, Ting Dang

Comments Accepted by NeurIPS 2025

详情
英文摘要

Speech Foundation Models encounter significant performance degradation when deployed in real-world scenarios involving acoustic domain shifts, such as background noise and speaker accents. Test-time adaptation (TTA) has recently emerged as a viable strategy to address such domain shifts at inference time without requiring access to source data or labels. However, existing TTA approaches, particularly those relying on backpropagation, are memory-intensive, limiting their applicability in speech tasks and resource-constrained settings. Although backpropagation-free methods offer improved efficiency, existing ones exhibit poor accuracy. This is because they are predominantly developed for vision tasks, which fundamentally differ from speech task formulations, noise characteristics, and model architecture, posing unique transferability challenges. In this paper, we introduce E-BATS, the first Efficient BAckpropagation-free TTA framework designed explicitly for speech foundation models. E-BATS achieves a balance between adaptation effectiveness and memory efficiency through three key components: (i) lightweight prompt adaptation for a forward-pass-based feature alignment, (ii) a multi-scale loss to capture both global (utterance-level) and local distribution shifts (token-level) and (iii) a test-time exponential moving average mechanism for stable adaptation across utterances. Experiments conducted on four noisy speech datasets spanning sixteen acoustic conditions demonstrate consistent improvements, with 4.1%-13.5% accuracy gains over backpropagation-free baselines and 2.0-6.4 times GPU memory savings compared to backpropagation-based methods. By enabling scalable and robust adaptation under acoustic variability, this work paves the way for developing more efficient adaptation approaches for practical speech processing systems in real-world environments.

2506.05850 2026-02-24 cs.CL cs.AI

Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models

Cheonbok Park, Jeonghoon Kim, Joosung Lee, Sanghwan Bae, Jaegul Choo, Kang Min Yoo

Comments Preprint

详情
英文摘要

Reinforcement learning with verifiable reward (RLVR) has been instrumental in eliciting strong reasoning capabilities from large language models (LLMs) via long chains of thought (CoT). During RLVR training, we formalize and systemically study an empirical phenomenon whereby a multilingual model's CoT reverts to its dominant pre-training language (e.g., English) even when prompted in another language, which we term Cross-lingual Collapse. Because the long-CoT regime magnifies exposure to linguistic priors, the underlying trade-off between maximizing reasoning depth and preserving target-language fidelity has remained under-characterized. To examine this trade-off, we train LLMs with Group-Relative Policy Optimization (GRPO) on translated versions of math datasets widely used to elicit long-CoT reasoning. Throughout training, we track both task accuracy and the language consistency of reasoning chains. Our experiments yield three findings: (i) under RLVR, CoT in LLMs systematically drifts toward the pre-training dominant language as reasoning performance rises; (ii) English-centric priors, long-CoT GRPO optimization, task difficulty, and high-entropy decoding jointly amplify this drift, and the pattern persists beyond mathematics; and (iii) interventions that favor target-language traces--via a language-consistency reward, decoding-time controls, or more balanced backbones--mitigate collapse but reveal a persistent performance-fidelity trade-off.

2506.03867 2026-02-24 cs.CL

EuroGEST: Investigating gender stereotypes in multilingual language models

Jacqueline Rowe, Mateusz Klimaszewski, Liane Guillou, Shannon Vallor, Alexandra Birch

Comments In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 32074-32096, Suzhou, China. Association for Computational Linguistics. 9 pages, 5 figures, 1 table

详情
英文摘要

Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric. We introduce EuroGEST, a dataset designed to measure gender-stereotypical reasoning in LLMs across English and 29 European languages. EuroGEST builds on an existing expert-informed benchmark covering 16 gender stereotypes, expanded in this work using translation tools, quality estimation metrics, and morphological heuristics. Human evaluations confirm that our data generation method results in high accuracy of both translations and gender labels across languages. We use EuroGEST to evaluate 24 multilingual language models from six model families, demonstrating that the strongest stereotypes in all models across all languages are that women are 'beautiful', 'empathetic' and 'neat' and men are 'leaders', 'strong, tough' and 'professional'. We also show that larger models encode gendered stereotypes more strongly and that instruction finetuning does not consistently reduce gendered stereotypes. Our work highlights the need for more multilingual studies of fairness in LLMs and offers scalable methods and resources to audit gender bias across languages.

2506.00486 2026-02-24 cs.LG cs.AI stat.ML

It Takes a Good Model to Train a Good Model: Generalized Gaussian Priors for Optimized LLMs

Jun Wu, Patrick Huang, Jiangtao Wen, Yuxing Han

详情
英文摘要

Despite rapid progress in large language models (LLMs), the statistical structure of their weights, activations, and gradients-and its implications for initialization, training dynamics, and efficiency-remains largely unexplored. We empirically show that these quantities in LLMs are well modeled by generalized Gaussian (GG) distributions, and introduce a unified, end-to-end optimization framework grounded in this observation. Our contributions are threefold: (1) a GG-based initialization that aligns with trained model statistics, accelerating convergence and improving accuracy; (2) ACT, a progressive activation-constrained training method that reduces redundancy and propagation overhead; and (3) GCT, a gradient-constrained training algorithm that substantially lowers communication cost in distributed training. Experiments across diverse architectures demonstrate consistently smaller, faster models with minimal communication overhead that match or surpass standard baselines. By anchoring LLM optimization in principled statistical modeling, this work advances efficient, scalable, and hardware-aware AI systems.

2505.24183 2026-02-24 cs.LG cs.AR cs.PL

QiMeng-CodeV-R1: Reasoning-Enhanced Verilog Generation

Yaoyu Zhu, Di Huang, Hanqi Lyu, Xiaoyun Zhang, Chongxiao Li, Wenxuan Shi, Yutong Wu, Jianan Mu, Jinghua Wang, Yang Zhao, Pengwei Jin, Shuyao Cheng, Shengwen Liang, Xishan Zhang, Rui Zhang, Zidong Du, Qi Guo, Xing Hu, Yunji Chen

详情
英文摘要

Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification, such as software programming and mathematical problems. Extending RLVR to electronic design automation (EDA), especially automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges: the lack of automated and accurate verification environments, the scarcity of high-quality NL-code pairs, and the prohibitive computation cost of RLVR. To this end, we introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs. First, we develop a rule-based testbench generator that performs robust equivalence checking against golden references. Second, we propose a round-trip data synthesis method that pairs open-source Verilog snippets with LLM-generated NL descriptions, verifies code-NL-code consistency via the generated testbench, and filters out inequivalent examples to yield a high-quality dataset. Third, we employ a two-stage "distill-then-RL" training pipeline: distillation for the cold start of reasoning abilities, followed by adaptive DAPO, our novel RLVR algorithm that can reduce training cost by adaptively adjusting sampling rate. The resulting model, CodeV-R1-7B, achieves 68.6% and 72.9% pass@1 on VerilogEval v2 and RTLLM v1.1, respectively, surpassing prior state-of-the-art by 12~20%, while even exceeding the performance of 671B DeepSeek-R1 on RTLLM. We have released our model, training code, and dataset to facilitate research in EDA and LLM communities.

2505.22774 2026-02-24 cs.CL

Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages

Kaja Dobrovoljc

Comments Accepted manuscript. Published in Corpus Linguistics and Linguistic Theory (2026)

Journal ref Corpus Linguistics and Linguistic Theory, 2026. Advance online publication

详情
英文摘要

This paper presents a novel treebank-driven approach to comparing syntactic structures in speech and writing using dependency-parsed corpora. Adopting a fully inductive, bottom-up method, we define syntactic structures as delexicalized dependency (sub)trees and extract them from spoken and written Universal Dependencies (UD) treebanks in two syntactically distinct languages, English and Slovenian. For each corpus, we analyze the size, diversity, and distribution of syntactic inventories, their overlap across modalities, and the structures most characteristic of speech. Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities. Strikingly, the overlap between spoken and written syntactic inventories is very limited: most structures attested in speech do not occur in writing, pointing to modality-specific preferences in syntactic organization that reflect the distinct demands of real-time interaction and elaborated writing. This contrast is further supported by a keyness analysis of the most frequent speech-specific structures, which highlights patterns associated with interactivity, context-grounding, and economy of expression. We argue that this scalable, language-independent framework offers a useful general method for systematically studying syntactic variation across corpora, laying the groundwork for more comprehensive data-driven theories of grammar in use.

2505.22147 2026-02-24 cs.AI

Lifted Forward Planning in Relational Factored Markov Decision Processes with Concurrent Actions

Florian Andreas Marwitz, Tanya Braun, Ralf Möller, Marcel Gehrke

Comments Accepted at AAMAS 2026

详情
英文摘要

When allowing concurrent actions in Markov Decision Processes, whose state and action spaces grow exponentially in the number of objects, computing a policy becomes highly inefficient, as it requires enumerating the joint of the two spaces. For the case of indistinguishable objects, we present a first-order representation to tackle the exponential blow-up in the action and state spaces. We propose Foreplan, an efficient relational forward planner, which uses the first-order representation allowing to compute policies in space and time polynomially in the number of objects. Thus, Foreplan significantly increases the number of planning problems solvable in an exact manner in reasonable time, which we underscore with a theoretical analysis. To speed up computations even further, we also introduce an approximate version of Foreplan, including guarantees on the error. Further, we provide an empirical evaluation of both Foreplan versions, demonstrating a speedup of several orders of magnitude. For the approximate version of Foreplan, we also empirically show that the induced error is often negligible.

2505.20815 2026-02-24 cs.LG

Interpretable Credit Default Prediction with Ensemble Learning and SHAP

Shiqi Yang, Ziyi Huang, Wengran Xiao, Xinyu Shen

详情
英文摘要

This study focuses on the problem of credit default prediction, builds a modeling framework based on machine learning, and conducts comparative experiments on a variety of mainstream classification algorithms. Through preprocessing, feature engineering, and model training of the Home Credit dataset, the performance of multiple models including logistic regression, random forest, XGBoost, LightGBM, etc. in terms of accuracy, precision, and recall is evaluated. The results show that the ensemble learning method has obvious advantages in predictive performance, especially in dealing with complex nonlinear relationships between features and data imbalance problems. It shows strong robustness. At the same time, the SHAP method is used to analyze the importance and dependency of features, and it is found that the external credit score variable plays a dominant role in model decision making, which helps to improve the model's interpretability and practical application value. The research results provide effective reference and technical support for the intelligent development of credit risk control systems.

2505.17779 2026-02-24 cs.CV cs.LG

U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding

Anjie Le, Henan Liu, Yue Wang, Zhenyu Liu, Rongkun Zhu, Taohan Weng, Jinze Yu, Boyang Wang, Yalun Wu, Kaiwen Yan, Quanlin Sun, Meirui Jiang, Jialun Pei, Siya Liu, Haoyun Zheng, Zhoujun Li, Alison Noble, Jacques Souquet, Xiaoqing Guo, Manxi Lin, Hongcheng Guo

Journal ref ICLR 2026

详情
英文摘要

Ultrasound is a widely-used imaging modality critical to global healthcare, yet its interpretation remains challenging due to its varying image quality on operators, noises, and anatomical structures. Although large vision-language models (LVLMs) have demonstrated impressive multimodal capabilities across natural and medical domains, their performance on ultrasound remains largely unexplored. We introduce U2-BENCH, the first comprehensive benchmark to evaluate LVLMs on ultrasound understanding across classification, detection, regression, and text generation tasks. U2-BENCH aggregates 7,241 cases spanning 15 anatomical regions and defines 8 clinically inspired tasks, such as diagnosis, view recognition, lesion localization, clinical value estimation, and report generation, across 50 ultrasound application scenarios. We evaluate 23 state-of-the-art LVLMs, both open- and closed-source, general-purpose and medical-specific. Our results reveal strong performance on image-level classification, but persistent challenges in spatial reasoning and clinical language generation. U2-BENCH establishes a rigorous and unified testbed to assess and accelerate LVLM research in the uniquely multimodal domain of medical ultrasound imaging.

2505.17543 2026-02-24 cs.SD cs.MM eess.AS

MEGADance: Mixture-of-Experts Architecture for Genre-Aware 3D Dance Generation

Kaixing Yang, Xulong Tang, Ziqiao Peng, Yuxuan Hu, Jun He, Hongyan Liu

Comments NeurIPS 2025

详情
英文摘要

Music-driven 3D dance generation has attracted increasing attention in recent years, with promising applications in choreography, virtual reality, and creative content creation. Previous research has generated promising realistic dance movement from audio signals. However, traditional methods underutilize genre conditioning, often treating it as auxiliary modifiers rather than core semantic drivers. This oversight compromises music-motion synchronization and disrupts dance genre continuity, particularly during complex rhythmic transitions, thereby leading to visually unsatisfactory effects. To address the challenge, we propose MEGADance, a novel architecture for music-driven 3D dance generation. By decoupling choreographic consistency into dance generality and genre specificity, MEGADance demonstrates significant dance quality and strong genre controllability. It consists of two stages: (1) High-Fidelity Dance Quantization Stage (HFDQ), which encodes dance motions into a latent representation by Finite Scalar Quantization (FSQ) and reconstructs them with kinematic-dynamic constraints, and (2) Genre-Aware Dance Generation Stage (GADG), which maps music into the latent representation by synergistic utilization of Mixture-of-Experts (MoE) mechanism with Mamba-Transformer hybrid backbone. Extensive experiments on the FineDance and AIST++ dataset demonstrate the state-of-the-art performance of MEGADance both qualitatively and quantitatively. Code is available at https://github.com/XulongT/MEGADance.

2505.16789 2026-02-24 cs.CL cs.AI cs.LG

Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards

Punya Syon Pandey, Samuel Simko, Kellin Pelrine, Zhijing Jin

Comments Second Conference of the International Association for Safe and Ethical Artificial Intelligence (IASEAI 2026)

详情
英文摘要

As large language models (LLMs) gain popularity, their vulnerability to adversarial attacks emerges as a primary concern. While fine-tuning models on domain-specific datasets is often employed to improve model performance, it can inadvertently introduce vulnerabilities within the underlying model. In this work, we investigate Accidental Vulnerability, unexpected vulnerabilities arising from characteristics of fine-tuning data. We begin by identifying potential correlation factors such as linguistic features, semantic similarity, and toxicity across multiple experimental datasets. We then evaluate the adversarial robustness of these fine-tuned models, analyzing persona shifts and interpretability traits to understand how dataset factors contribute to attack success rates. Lastly, we explore causal relationships that offer new insights into adversarial defense strategies, highlighting the crucial role of dataset design in preserving model alignment. Our code is available at https://github.com/psyonp/accidental_vulnerability.

2505.11304 2026-02-24 cs.LG cs.AI

Heterogeneity-Aware Client Sampling for Optimal and Efficient Federated Learning

Shudi Weng, Chao Ren, Ming Xiao, Mikael Skoglund

详情
英文摘要

Federated learning (FL) commonly involves clients with diverse communication and computational capabilities. Such heterogeneity can significantly distort the optimization dynamics and lead to objective inconsistency, where the global model converges to an incorrect stationary point potentially far from the pursued optimum. Despite its critical impact, the joint effect of communication and computation heterogeneity has remained largely unexplored, due to the intrinsic complexity of their interaction. In this paper, we reveal the fundamentally distinct mechanisms through which heterogeneous communication and computation drive inconsistency in FL. To the best of our knowledge, this is the first unified theoretical analysis of general heterogeneous FL, offering a principled understanding of how these two forms of heterogeneity jointly distort the optimization trajectory under arbitrary choices of local solvers. Motivated by these insights, we propose Federated Heterogeneity-Aware Client Sampling, FedACS, a universal method to eliminate all types of objective inconsistency. We theoretically prove that FedACS converges to the correct optimum at a rate of $O(1/\sqrt{R})$, even in dynamic heterogeneous environments. Extensive experiments across multiple datasets show that FedACS outperforms state-of-the-art and category-specific baselines by 4.3%-36%, while reducing communication costs by 22%-89% and computation loads by 14%-105%, respectively.

2505.08783 2026-02-24 cs.LG cs.AI cs.CL cs.NA math.NA

CodePDE: An Inference Framework for LLM-driven PDE Solver Generation

Shanda Li, Tanya Marwah, Junhong Shen, Weiwei Sun, Andrej Risteski, Yiming Yang, Ameet Talwalkar

Comments TMLR. Code available at https://github.com/LithiumDA/CodePDE

详情
英文摘要

Partial differential equations (PDEs) are fundamental to modeling physical systems, yet solving them remains a complex challenge. Traditional numerical solvers rely on expert knowledge to implement and are computationally expensive, while neural-network-based solvers require large training datasets and often lack interpretability. In this work, we frame PDE solving as a code generation task and introduce CodePDE, the first inference framework for generating PDE solvers using large language models (LLMs). With CodePDE, we present a thorough evaluation on critical capacities of LLM for PDE solving: reasoning, debugging, self-refinement, and test-time scaling. CodePDE shows that, with advanced inference-time algorithms and scaling strategies, LLMs can achieve strong performance across a range of representative PDE problems. We also identify novel insights into LLM-driven solver generation, such as trade-offs between solver reliability and sophistication, design principles for LLM-powered PDE solving agents, and failure modes for LLM on hard tasks. These insights offer guidance for building more capable and reliable LLM-based scientific engines.

2505.02515 2026-02-24 cs.LG

FedSDAF: Leveraging Source Domain Awareness for Enhanced Federated Domain Generalization

Hongze Li, Zesheng Zhou, Zhenbiao Cao, Xinhui Li, Wei Chen, Xiaojin Zhang

详情
英文摘要

Traditional Federated Domain Generalization (FedDG) methods focus on learning domain-invariant features or adapting to unseen target domains, often overlooking the unique knowledge embedded within the source domain, especially in strictly isolated federated learning environments. Through experimentation, we discovered a counterintuitive phenomenon: features learned from a complete source domain have superior generalization capabilities compared to those learned directly from the target domain. This insight leads us to propose the Federated Source Domain Awareness Framework (FedSDAF), the first systematic approach to enhance FedDG by leveraging source domain-aware features. FedSDAF employs a dual-adapter architecture that decouples "local expertise" from "global generalization consensus." A Domain-Aware Adapter, retained locally, extracts and protects the unique discriminative knowledge of each source domain, while a Domain-Invariant Adapter, shared across clients, builds a robust global consensus. To enable knowledge exchange, we introduce a Bidirectional Knowledge Distillation mechanism that facilitates efficient dialogue between the adapters. Extensive experiments on four benchmark datasets (OfficeHome, PACS, VLCS, and DomainNet) show that FedSDAF significantly outperforms existing FedDG methods. The source code is available at https://github.com/pizzareapers/FedSDAF.

2505.02161 2026-02-24 cs.CV

Not All Pixels Are Equal: Confidence-Guided Attention for Feature Matching

Dongyue Li

详情
英文摘要

Semi-dense feature matching methods have been significantly advanced by leveraging attention mechanisms to extract discriminative descriptors. However, most existing approaches treat all pixels equally during attention computations, which can potentially introduce noise and redundancy from irrelevant regions. To address this issue, we propose a confidence-guided attention that adaptively prunes attention weights for each pixel based on precomputed matching confidence maps. These maps are generated by evaluating the mutual similarity between feature pairs extracted from the backbone, where high confidence indicates a high potential for matching. Then the attention is refined through two steps: (1) a confidence-guided bias is introduced to adaptively adjust the attention distributions for each query pixel, avoiding irrelevant interactions between non-overlap pixels; (2) the corresponding confidence map is additionally employed to rescale value features during feature aggregation, attenuating the influence of uncertain regions. Moreover, a classification loss is introduced to encourage the backbone's features to discriminate between matchable and non-matchable regions. Extensive experiments on three benchmarks demonstrate that the proposal outperforms existing state-of-the-art methods.

2504.19375 2026-02-24 cs.LG cs.SY eess.SY math.OC stat.ML

$O(1/k)$ Finite-Time Bound for Non-Linear Two-Time-Scale Stochastic Approximation

Siddharth Chandak

Comments Submitted to IEEE Transactions on Automatic Control

详情
英文摘要

Two-time-scale stochastic approximation (SA) is an algorithm with coupled iterations which has found broad applications in reinforcement learning, optimization and game control. In this work, we derive mean squared error bounds for non-linear two-time-scale iterations with contractive mappings. In the setting where both stepsizes are order $Θ(1/k)$, commonly referred to as single time-scale SA with multiple coupled sequences, we obtain the first $O(1/k)$ rate without imposing additional smoothness assumptions. In the setting with true time-scale separation, the previous best bound was $O(1/k^{2/3})$. We improve this to $O(1/k^a)$ for any $a<1$ approaching the optimal $O(1/k)$ rate. The key step in our analysis involves rewriting the original iteration in terms of an averaged noise sequence whose variance decays sufficiently fast. Additionally, we use an induction-based approach to show that the iterates are bounded in expectation. Our results apply to Polyak averaging, as well as to algorithms from reinforcement learning, and optimization, including gradient descent-ascent and two-time-scale Lagrangian optimization.

2504.10917 2026-02-24 cs.LG cs.AI

Towards A Universal Graph Structural Encoder

Jialin Chen, Haolan Zuo, Haoyu Peter Wang, Siqi Miao, Pan Li, Rex Ying

详情
英文摘要

Recent advancements in large-scale pre-training have shown the potential to learn generalizable representations for downstream tasks. In the graph domain, however, capturing and transferring structural information across different graph domains remains challenging, primarily due to the inherent differences in graph topological patterns across various contexts. For example, a social network's structure is fundamentally different from that of a product co-purchase graph. Additionally, most existing models struggle to capture the rich topological complexity of graph structures, leading to inadequate exploration of the graph embedding space. To address these challenges, we propose GFSE, a universal pre-trained graph encoder designed to capture transferable structural patterns across diverse domains such as the web graph, social networks, and citation networks. GFSE is the first cross-domain graph structural encoder pre-trained with multiple self-supervised learning objectives. Built on a Graph Transformer, GFSE incorporates attention mechanisms informed by graph structural information, enabling it to encode intricate multi-level and fine-grained topological features within complex graph structures. The pre-trained GFSE produces generic and theoretically expressive positional and structural encoding for graphs, which can be seamlessly integrated with various downstream graph feature encoders, including graph neural networks for vectorized features and Large Language Models (LLMs) for text-attributed graphs. Comprehensive experiments on synthetic and real-world datasets demonstrate GFSE's capability to significantly enhance the model's performance while requiring substantially less task-specific fine-tuning.

2504.06742 2026-02-24 cs.CV

nnLandmark: A Self-Configuring Method for 3D Medical Landmark Detection

Alexandra Ertl, Stefan Denner, Robin Peretzke, Shuhan Xiao, David Zimmerer, Maximilian Fischer, Markus Bujotzek, Xin Yang, Peter Neher, Fabian Isensee, Klaus H. Maier-Hein

详情
英文摘要

Landmark detection is central to many medical applications, such as identifying critical structures for treatment planning or defining control points for biometric measurements. However, manual annotation is labor-intensive and requires expert anatomical knowledge. While deep learning shows promise in automating this task, fair evaluation and interpretation of methods in a broader context are hindered by limited public benchmarking, inconsistent baseline implementations, and non-standardized experimentation. To overcome these pitfalls, we present nnLandmark, a self-configuring framework for 3D landmark detection that combines tailored heatmap generation, loss design, inference logic, and a robust set of hyperparameters for heatmap regression, while reusing components from nnU-Net's underlying self-configuration and training engine. nnLandmark achieves state-of-the-art performance across five public and one private dataset, benchmarked against three recently published methods. Its out-of-the-box usability enables training strong landmark detection models on new datasets without expert knowledge or dataset-specific hyperparameter tuning. Beyond accuracy, nnLandmark provides both a strong, common baseline and a flexible, standardized environment for developing and evaluating new methodological contributions. It further streamlines evaluation across multiple datasets by offering data conversion utilities for current public benchmarks. Together, these properties position nnLandmark as a central tool for advancing 3D medical landmark detection through systematic, transparent benchmarking, enabling to genuinely measure methodological progress. The code is available on GitHub: https://github.com/MIC-DKFZ/nnLandmark

2504.05806 2026-02-24 cs.AI

Meta-Continual Learning of Neural Fields

Seungyoon Woo, Junhyeog Yun, Gunhee Kim

Comments Accepted at ICLR 2025

详情
英文摘要

Neural Fields (NF) have gained prominence as a versatile framework for complex data representation. This work unveils a new problem setting termed \emph{Meta-Continual Learning of Neural Fields} (MCL-NF) and introduces a novel strategy that employs a modular architecture combined with optimization-based meta-learning. Focused on overcoming the limitations of existing methods for continual learning of neural fields, such as catastrophic forgetting and slow convergence, our strategy achieves high-quality reconstruction with significantly improved learning speed. We further introduce Fisher Information Maximization loss for neural radiance fields (FIM-NeRF), which maximizes information gains at the sample level to enhance learning generalization, with proved convergence guarantee and generalization bound. We perform extensive evaluations across image, audio, video reconstruction, and view synthesis tasks on six diverse datasets, demonstrating our method's superiority in reconstruction quality and speed over existing MCL and CL-NF approaches. Notably, our approach attains rapid adaptation of neural fields for city-scale NeRF rendering with reduced parameter requirement. Code is available at https://github.com/seungyoon-woo/mcl-nf.

2504.02996 2026-02-24 cs.LG cs.CV

Noise-Aware Generalization: Robustness to In-Domain Noise and Out-of-Domain Generalization

Siqi Wang, Aoming Liu, Bryan A. Plummer

Comments Accepted at ICLR 2026

详情
英文摘要

Methods addressing Learning with Noisy Labels (LNL) and multi-source Domain Generalization (DG) use training techniques to improve downstream task performance in the presence of label noise or domain shifts, respectively. Prior work often explores these tasks in isolation, and the limited work that does investigate their intersection, which we refer to as Noise-Aware Generalization (NAG), only benchmarks existing methods without also proposing an approach to reduce its effect. We find that this is likely due, in part, to the new challenges that arise when exploring NAG, which does not appear in LNL or DG alone. For example, we show that the effectiveness of DG methods is compromised in the presence of label noise, making them largely ineffective. Similarly, LNL methods often overfit to easy-to-learn domains as they confuse domain shifts for label noise. Instead, we propose Domain Labels for Noise Detection (DL4ND), the first direct method developed for NAG which uses our observation that noisy samples that may appear indistinguishable within a single domain often show greater variation when compared across domains. We find DL4ND outperforms DG and LNL methods, including their combinations, even when simplifying the NAG challenge by using domain labels to isolate domain shifts from noise. Performance gains up to 12.5% over seven diverse datasets with three noise types demonstrates DL4ND's ability to generalize to a wide variety of settings.

2503.24298 2026-02-24 cs.CV

Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions

Thinesh Thiyakesan Ponbagavathi, Alina Roitberg

Comments Accepted to ICRA 2026

详情
英文摘要

Fine-grained understanding of human actions is essential for safe and intuitive human--robot interaction. We study the challenge of recognizing nearly symmetric actions, such as picking up vs. placing down a tool or opening vs. closing a drawer. These actions are common in close human-robot collaboration, yet they are rare and largely overlooked in mainstream vision frameworks. Pretrained vision foundation models (VFMs) are often adapted using probing, valued in robotics for its efficiency and low data needs, or parameter-efficient fine-tuning (PEFT), which adds temporal modeling through adapters or prompts. However, our analysis shows that probing is permutation-invariant and blind to frame order, while PEFT is prone to overfitting on smaller HRI datasets, and less practical in real-world robotics due to compute constraints. To address this, we introduce STEP (Self-attentive Temporal Embedding Probing), a lightweight extension to probing that models temporal order via frame-wise positional encodings, a global CLS token, and a simplified attention block. Compared to conventional probing, STEP improves accuracy by 4--10% on nearly symmetric actions and 6--15% overall across action recognition benchmarks in human-robot-interaction, industrial assembly, and driver assistance. Beyond probing, STEP surpasses heavier PEFT methods and even outperforms fully fine-tuned models on all three benchmarks, establishing a new state-of-the-art. Code and models will be made publicly available: https://github.com/th-nesh/STEP.

2503.23377 2026-02-24 cs.CV cs.AI cs.SD eess.AS

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Jiebo Luo, Ziwei Liu, Hao Fei, Tat-Seng Chua

Comments Accepted by ICLR 2026. Homepage: https://javisverse.github.io/JavisDiT-page/

详情
英文摘要

This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Based on the powerful Diffusion Transformer (DiT) architecture, JavisDiT simultaneously generates high-quality audio and video content from open-ended user prompts in a unified framework. To ensure audio-video synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, which consists of 10,140 high-quality text-captioned sounding videos and focuses on synchronization evaluation in diverse and complex real-world scenarios. Further, we specifically devise a robust metric for measuring the synchrony between generated audio-video pairs in real-world content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks. Our code, model, and data are available at https://javisverse.github.io/JavisDiT-page/.

2503.21258 2026-02-24 cs.CV cs.AI

Learn by Reasoning: Analogical Weight Generation for Few-Shot Class-Incremental Learning

Jizhou Han, Chenhao Ding, Yuhang He, Songlin Dong, Qiang Wang, Xinyuan Gao, Yihong Gong

Comments Accepted by IEEE TCSVT. This is the author's version which has not been fully edited and content may change prior to final publication

详情
英文摘要

Few-shot class-incremental Learning (FSCIL) enables models to learn new classes from limited data while retaining performance on previously learned classes. Traditional FSCIL methods often require fine-tuning parameters with limited new class data and suffer from a separation between learning new classes and utilizing old knowledge. Inspired by the analogical learning mechanisms of the human brain, we propose a novel analogical generative method. Our approach includes the Brain-Inspired Analogical Generator (BiAG), which derives new class weights from existing classes without parameter fine-tuning during incremental stages. BiAG consists of three components: Weight Self-Attention Module (WSA), Weight & Prototype Analogical Attention Module (WPAA), and Semantic Conversion Module (SCM). SCM uses Neural Collapse theory for semantic conversion, WSA supplements new class weights, and WPAA computes analogies to generate new class weights. Experiments on miniImageNet, CUB-200, and CIFAR-100 datasets demonstrate that our method achieves higher final and average accuracy compared to SOTA methods.

2503.14720 2026-02-24 cs.CV

ShapeShift: Text-to-Mosaic Synthesis via Semantic Phase-Field Guidance

Vihaan Misra, Peter Schaldenbrand, Jean Oh

详情
英文摘要

We present ShapeShift, a method for arranging rigid objects into configurations that visually convey semantic concepts specified by natural language. While pretrained diffusion models provide powerful semantic guidance, such as Score Distillation Sampling, enforcing physical validity poses a fundamental challenge. Naive overlap resolution disrupts semantic structure -- separating overlapping shapes along geometrically optimal directions (minimum translation vectors) often destroys the very arrangements that make concepts recognizable. Our intuition is that diffusion model features encode not just what a concept looks like, but its geometric, directional structure -- how it is oriented and shaped -- which we leverage to make overlap resolution semantically aware. We introduce a deformable boundary represented as a phase field that expands anisotropically, guided by intermediate features from the diffusion model, creating space along semantically coherent directions. Experiments demonstrate that ShapeShift, by coupling semantic guidance and feasibility constraint resolution, produces arrangements achieving both semantic clarity and overlap-free validity, significantly outperforming baselines that treat these objectives independently.

2503.12047 2026-02-24 cs.CV

PSGait: Gait Recognition using Parsing Skeleton

Hangrui Xu, Zhengxian Wu, Chuanrui Zhang, Zhuohong Chen, Zhifang Liu, Peng Jiao, Haoqian Wang

Comments Accepted by ICASSP 2026

详情
英文摘要

Gait recognition has emerged as a robust biometric modality due to its non-intrusive nature. Conventional gait recognition methods mainly rely on silhouettes or skeletons. While effective in controlled laboratory settings, their limited information entropy restricts generalization to real-world scenarios. To overcome this, we propose a novel representation called \textbf{Parsing Skeleton}, which uses a skeleton-guided human parsing method to capture fine-grained body dynamics with much higher information entropy. To effectively explore the capability of the Parsing Skeleton, we also introduce \textbf{PSGait}, a framework that fuses Parsing Skeleton with silhouettes to enhance individual differentiation. Comprehensive benchmarks demonstrate that PSGait outperforms state-of-the-art multimodal methods while significantly reducing computational resources. As a plug-and-play method, it achieves an improvement of up to 15.7\% in the accuracy of Rank-1 in various models. These results validate the Parsing Skeleton as a \textbf{lightweight}, \textbf{effective}, and highly \textbf{generalizable} representation for gait recognition in the wild. Code is available at https://github.com/realHarryX/PSGait.

2503.04940 2026-02-24 cs.CL cs.AI

VQEL: Enabling Self-Play in Emergent Language Games via Agent-Internal Vector Quantization

Mohammad Mahdi Samiei Paqaleh, Mehdi Jamalkhah, Mahdieh Soleymani Baghshah

详情
英文摘要

Emergent Language (EL) focuses on the emergence of communication among artificial agents. Although symbolic communication channels more closely mirror the discrete nature of human language, learning such protocols remains fundamentally difficult due to the non-differentiability of symbol sampling. Existing approaches typically rely on high-variance gradient estimators such as REINFORCE or on continuous relaxations such as Gumbel-Softmax, both of which suffer from limitations in training stability and scalability. Motivated by cognitive theories that emphasize intrapersonal processes preceding communication, we explore self-play as a substrate for language emergence prior to mutual interaction. We introduce Vector Quantized Emergent Language (VQEL), a novel architecture that incorporates vector quantization into the message generation process. VQEL enables agents to perform self-play using discrete internal representations derived from a learned codebook while preserving end-to-end differentiability. Moreover, the resulting vector-quantized codebook naturally induces a symbolic vocabulary that can be directly transferred and aligned during subsequent mutual play with other agents. Empirical results show that agents pretrained via VQEL self-play achieve more consistent symbol alignment and higher task success when later engaged in mutual interaction. These findings position self-play as a principled and effective mechanism for learning discrete communication protocols, addressing key optimization and representational challenges in emergent language systems.