arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2365
专题追踪
2602.18825 2026-02-24 cs.LG cs.CV

Bayesian Lottery Ticket Hypothesis

Nicholas Kuhn, Arvid Weyrauch, Lars Heyen, Achim Streit, Markus Götz, Charlotte Debus

详情
英文摘要

Bayesian neural networks (BNNs) are a useful tool for uncertainty quantification, but require substantially more computational resources than conventional neural networks. For non-Bayesian networks, the Lottery Ticket Hypothesis (LTH) posits the existence of sparse subnetworks that can train to the same or even surpassing accuracy as the original dense network. Such sparse networks can lower the demand for computational resources at inference, and during training. The existence of the LTH and corresponding sparse subnetworks in BNNs could motivate the development of sparse training algorithms and provide valuable insights into the underlying training process. Towards this end, we translate the LTH experiments to a Bayesian setting using common computer vision models. We investigate the defining characteristics of Bayesian lottery tickets, and extend our study towards a transplantation method connecting BNNs with deterministic Lottery Tickets. We generally find that the LTH holds in BNNs, and winning tickets of matching and surpassing accuracy are present independent of model size, with degradation at very high sparsities. However, the pruning strategy should rely primarily on magnitude, secondly on standard deviation. Furthermore, our results demonstrate that models rely on mask structure and weight initialization to varying degrees.

2602.18817 2026-02-24 cs.CV

HeRO: Hierarchical 3D Semantic Representation for Pose-aware Object Manipulation

Chongyang Xu, Shen Cheng, Haipeng Li, Haoqiang Fan, Ziliang Feng, Shuaicheng Liu

Comments Accepted by ICRA 2026

详情
英文摘要

Imitation learning for robotic manipulation has progressed from 2D image policies to 3D representations that explicitly encode geometry. Yet purely geometric policies often lack explicit part-level semantics, which are critical for pose-aware manipulation (e.g., distinguishing a shoe's toe from heel). In this paper, we present HeRO, a diffusion-based policy that couples geometry and semantics via hierarchical semantic fields. HeRO employs dense semantics lifting to fuse discriminative, geometry-sensitive features from DINOv2 with the smooth, globally coherent correspondences from Stable Diffusion, yielding dense features that are both fine-grained and spatially consistent. These features are processed and partitioned to construct a global field and a set of local fields. A hierarchical conditioning module conditions the generative denoiser on global and local fields using permutation-invariant network architecture, thereby avoiding order-sensitive bias and producing a coherent control policy for pose-aware manipulation. In various tests, HeRO establishes a new state-of-the-art, improving success on Place Dual Shoes by 12.3% and averaging 6.5% gains across six challenging pose-aware tasks. Code is available at https://github.com/Chongyang-99/HeRO.

2602.18814 2026-02-24 cs.RO cs.SY eess.SY

RotorSuite: A MATLAB/Simulink Toolbox for Tilt Multi-Rotor UAV Modeling

Nicola Cigarini, Giulia Michieletto, Angelo Cenedese

详情
英文摘要

In recent years, aerial platforms have evolved from passive flying sensors into versatile, contact-aware robotic systems, leading to rapid advances in platform design. Standard coplanar and collinear quadrotors have been complemented by modern tilted and tilting multi-rotor platforms with enhanced maneuverability. To properly analyze, control, and validate the performance of these emerging platforms, an accurate modeling step is required; however, this can be time-consuming, user-dependent and error-prone. To address this issue, we propose a MATLAB/Simulink toolbox for modeling and simulating the dynamics of a broad class of multi-rotor platforms through both an analytical and physics-based approaches. The toolbox, named RotorSuite, is provided with comprehensive documentation and example use cases, representing a valuable tool for didactic, research, and industrial development purposes.

2602.18813 2026-02-24 cs.RO cs.LG

Habilis-$β$: A Fast-Motion and Long-Lasting On-Device Vision-Language-Action Model

Tommoro Robotics, :, Jesoon Kang, Taegeon Park, Jisu An, Soo Min Kimm, Jaejoon Kim, Jinu Pahk, Byungju Kim, Junseok Lee, Namheon Baek, Sungwan Ha, Hojun Baek, Eduardo Ayerve Cruz, Wontae Kim, Junghyeon Choi, Yousuk Lee, Joonmo Han, Sunghyun Cho, Sunghyun Kwon, Soyoung Lee, Jun Ki Lee, Seung-Joon Yi, Byoung-Tak Zhang, Theo Taeyeong Kim

详情
英文摘要

We introduce Habilis-$β$, a fast-motion and long-lasting on-device vision-language-action (VLA) model designed for real-world deployment. Current VLA evaluation remains largely confined to single-trial success rates under curated resets, which fails to capture the fast-motion and long-lasting capabilities essential for practical operation. To address this, we introduce the Productivity-Reliability Plane (PRP), which evaluates performance through Tasks per Hour (TPH) and Mean Time Between Intervention (MTBI) under a continuous-run protocol that demands both high-speed execution and sustained robustness. Habilis-$β$ achieves high performance by integrating language-free pre-training on large-scale play data for robust interaction priors with post-training on cyclic task demonstrations that capture state drift across consecutive task iterations. The system further employs ESPADA for phase-adaptive motion shaping to accelerate free-space transit, utilizes rectified-flow distillation to enable high-frequency control on edge devices, and incorporates classifier-free guidance (CFG) as a deployment-time knob to dynamically balance instruction adherence and learned interaction priors. In 1-hour continuous-run evaluations, Habilis-$β$ achieves strong performance under the PRP metrics, compared to $π_{0.5}$ in both simulation and real-world environments. In simulation, Habilis-$β$ achieves 572.6 TPH and 39.2 s MTBI (vs. 120.5 TPH and 30.5 s for $π_{0.5}$), while in a real-world humanoid logistics workflow it achieves 124 TPH and 137.4 s MTBI (vs. 19 TPH and 46.1 s for $π_{0.5}$). Finally, Habilis-$β$ achieves the highest reported performance on the standard RoboTwin 2.0 leaderboard across representative tasks, validating its effectiveness in complex manipulation scenarios.

2602.18812 2026-02-24 cs.AI

GenPlanner: From Noise to Plans -- Emergent Reasoning in Flow Matching and Diffusion Models

Agnieszka Polowczyk, Alicja Polowczyk, Michał Wieczorek

详情
英文摘要

Path planning in complex environments is one of the key problems of artificial intelligence because it requires simultaneous understanding of the geometry of space and the global structure of the problem. In this paper, we explore the potential of using generative models as planning and reasoning mechanisms. We propose GenPlanner, an approach based on diffusion models and flow matching, along with two variants: DiffPlanner and FlowPlanner. We demonstrate the application of generative models to find and generate correct paths in mazes. A multi-channel condition describing the structure of the environment, including an obstacle map and information about the starting and destination points, is used to condition trajectory generation. Unlike standard methods, our models generate trajectories iteratively, starting with random noise and gradually transforming it into a correct solution. Experiments conducted show that the proposed approach significantly outperforms the baseline CNN model. In particular, FlowPlanner demonstrates high performance even with a limited number of generation steps.

2602.18811 2026-02-24 cs.CV

Learning Multi-Modal Prototypes for Cross-Domain Few-Shot Object Detection

Wanqi Wang, Jingcai Guo, Yuxiang Cai, Zhi Chen

Comments Accepted to CVPR 2026 Findings

详情
英文摘要

Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel classes in unseen target domains given only a few labeled examples. While open-vocabulary detectors built on vision-language models (VLMs) transfer well, they depend almost entirely on text prompts, which encode domain-invariant semantics but miss domain-specific visual information needed for precise localization under few-shot supervision. We propose a dual-branch detector that Learns Multi-modal Prototypes, dubbed LMP, by coupling textual guidance with visual exemplars drawn from the target domain. A Visual Prototype Construction module aggregates class-level prototypes from support RoIs and dynamically generates hard-negative prototypes in query images via jittered boxes, capturing distractors and visually similar backgrounds. In the visual-guided branch, we inject these prototypes into the detection pipeline with components mirrored from the text branch as the starting point for training, while a parallel text-guided branch preserves open-vocabulary semantics. The branches are trained jointly and ensembled at inference by combining semantic abstraction with domain-adaptive details. On six cross-domain benchmark datasets and standard 1/5/10-shot settings, our method achieves state-of-the-art or highly competitive mAP.

2602.18806 2026-02-24 cs.CL cs.AI

Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models

Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma

详情
英文摘要

Large Language Models (LLMs) demonstrate strong reasoning performance, yet their ability to reliably monitor, diagnose, and correct their own errors remains limited. We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight dual-process MetaController for adaptive effort allocation. Across diverse reasoning and diagnostic benchmarks (GSM8K, CRUXEval, MBPP, AIME, CorrectBench, and TruthfulQA) using Llama-3 and Qwen-3 (8B), explicit regulatory structuring substantially improves error diagnosis and yields a threefold increase in successful self-correction. Blinded human evaluations over 580 query pairs show an 84% aggregate preference for trustworthiness and metacognitive self-awareness over standard and Chain-of-Thought baselines. Grounding LLM reasoning in established cognitive theory offers a principled path toward more transparent and diagnostically robust AI systems.

2602.18803 2026-02-24 cs.RO

Learning to Localize Reference Trajectories in Image-Space for Visual Navigation

Finn Lukas Busch, Matti Vahs, Quantao Yang, Jesús Gerardo Ortega Peimbert, Yixi Cai, Jana Tumova, Olov Andersson

详情
英文摘要

We present LoTIS, a model for visual navigation that provides robot-agnostic image-space guidance by localizing a reference RGB trajectory in the robot's current view, without requiring camera calibration, poses, or robot-specific training. Instead of predicting actions tied to specific robots, we predict the image-space coordinates of the reference trajectory as they would appear in the robot's current view. This creates robot-agnostic visual guidance that easily integrates with local planning. Consequently, our model's predictions provide guidance zero-shot across diverse embodiments. By decoupling perception from action and learning to localize trajectory points rather than imitate behavioral priors, we enable a cross-trajectory training strategy for robustness to viewpoint and camera changes. We outperform state-of-the-art methods by 20-50 percentage points in success rate on conventional forward navigation, achieving 94-98% success rate across diverse sim and real environments. Furthermore, we achieve over 5x improvements on challenging tasks where baselines fail, such as backward traversal. The system is straightforward to use: we show how even a video from a phone camera directly enables different robots to navigate to any point on the trajectory. Videos, demo, and code are available at https://finnbusch.com/lotis.

2602.18802 2026-02-24 cs.SD

Multi-Channel Speech Enhancement for Cocktail Party Speech Emotion Recognition

Youjun Chen, Guinan Li, Mengzhe Geng, Xurong Xie, Shujie Hu, Huimeng Wang, Haoning Xu, Chengxi Deng, Jiajun Deng, Zhaoqing Li, Mingyu Cui, Xunying Liu

Comments Accepted by ICASSP2026

详情
英文摘要

This paper highlights the critical importance of multi-channel speech enhancement (MCSE) for speech emotion recognition (ER) in cocktail party scenarios. A multi-channel speech dereverberation and separation front-end integrating DNN-WPE and mask-based MVDR is used to extract the target speaker's speech from the mixture speech, before being fed into the downstream ER back-end using HuBERT- and ViT-based speech and visual features. Experiments on mixture speech constructed using the IEMOCAP and MSP-FACE datasets suggest the MCSE output consistently outperforms domain fine-tuned single-channel speech representations produced by: a) Conformer-based metric GANs; and b) WavLM SSL features with optional SE-ER dual task fine-tuning. Statistically significant increases in weighted, unweighted accuracy and F1 measures by up to 9.5%, 8.5% and 9.1% absolute (17.1%, 14.7% and 16.0% relative) are obtained over the above single-channel baselines. The generalization of IEMOCAP trained MCSE front-ends are also shown when being zero-shot applied to out-of-domain MSP-FACE data.

2602.18799 2026-02-24 cs.CV

Rethinking Preference Alignment for Diffusion Models with Classifier-Free Guidance

Zhou Jiang, Yandong Wen, Zhen Liu

详情
英文摘要

Aligning large-scale text-to-image diffusion models with nuanced human preferences remains challenging. While direct preference optimization (DPO) is simple and effective, large-scale finetuning often shows a generalization gap. We take inspiration from test-time guidance and cast preference alignment as classifier-free guidance (CFG): a finetuned preference model acts as an external control signal during sampling. Building on this view, we propose a simple method that improves alignment without retraining the base model. To further enhance generalization, we decouple preference learning into two modules trained on positive and negative data, respectively, and form a \emph{contrastive guidance} vector at inference by subtracting their predictions (positive minus negative), scaled by a user-chosen strength and added to the base prediction at each step. This yields a sharper and controllable alignment signal. We evaluate on Stable Diffusion 1.5 and Stable Diffusion XL with Pick-a-Pic v2 and HPDv3, showing consistent quantitative and qualitative gains.

2602.18795 2026-02-24 cs.LG stat.ML

Vectorized Bayesian Inference for Latent Dirichlet-Tree Allocation

Zheng Wang, Nizar Bouguila

Comments Submitted to JMLR, under review

详情
英文摘要

Latent Dirichlet Allocation (LDA) is a foundational model for discovering latent thematic structure in discrete data, but its Dirichlet prior cannot represent the rich correlations and hierarchical relationships often present among topics. We introduce the framework of Latent Dirichlet-Tree Allocation (LDTA), a generalization of LDA that replaces the Dirichlet prior with an arbitrary Dirichlet-Tree (DT) distribution. LDTA preserves LDA's generative structure but enables expressive, tree-structured priors over topic proportions. To perform inference, we develop universal mean-field variational inference and Expectation Propagation, providing tractable updates for all DT. We reveal the vectorized nature of the two inference methods through theoretical development, and perform fully vectorized, GPU-accelerated implementations. The resulting framework substantially expands the modeling capacity of LDA while maintaining scalability and computational efficiency.

2602.18793 2026-02-24 cs.LG

From Few-Shot to Zero-Shot: Towards Generalist Graph Anomaly Detection

Yixin Liu, Shiyuan Li, Yu Zheng, Qingfeng Chen, Chengqi Zhang, Philip S. Yu, Shirui Pan

Comments 19 pages, 12 figures, 5 tables

详情
英文摘要

Graph anomaly detection (GAD) is critical for identifying abnormal nodes in graph-structured data from diverse domains, including cybersecurity and social networks. The existing GAD methods often focus on the learning paradigms of "one-model-for-one-dataset", requiring dataset-specific training for each dataset to achieve optimal performance. However, this paradigm suffers from significant limitations, such as high computational and data costs, limited generalization and transferability to new datasets, and challenges in privacy-sensitive scenarios where access to full datasets or sufficient labels is restricted. To address these limitations, we propose a novel generalist GAD paradigm that aims to develop a unified model capable of detecting anomalies on multiple unseen datasets without extensive retraining/fine-tuning or dataset-specific customization. To this end, we propose ARC, a few-shot generalist GAD method that leverages in-context learning and requires only a few labeled normal samples at inference time. Specifically, ARC consists of three core modules: a feature Alignment module to unify and align features across datasets, a Residual GNN encoder to capture dataset-agnostic anomaly representations, and a cross-attentive in-Context learning module to score anomalies using few-shot normal context. Building on ARC, we further introduce ARC_zero for the zero-shot generalist GAD setting, which selects representative pseudo-normal nodes via a pseudo-context mechanism and thus enables fully label-free inference on unseen datasets. Extensive experiments on 17 real-world graph datasets demonstrate that both ARC and ARC_zero effectively detect anomalies, exhibit strong generalization ability, and perform efficiently under few-shot and zero-shot settings.

2602.18786 2026-02-24 cs.LG cs.IR

CaliCausalRank: Calibrated Multi-Objective Ad Ranking with Robust Counterfactual Utility Optimization

Xikai Yang, Sebastian Sun, Yilin Li, Yue Xing, Ming Wang, Yang Wang

详情
英文摘要

Ad ranking systems must simultaneously optimize multiple objectives including click-through rate (CTR), conversion rate (CVR), revenue, and user experience metrics. However, production systems face critical challenges: score scale inconsistency across traffic segments undermines threshold transferability, and position bias in click logs causes offline-online metric discrepancies. We propose CaliCausalRank, a unified framework that integrates training-time scale calibration, constraint-based multi-objective optimization, and robust counterfactual utility estimation. Our approach treats score calibration as a first-class training objective rather than post-hoc processing, employs Lagrangian relaxation for constraint satisfaction, and utilizes variance-reduced counterfactual estimators for reliable offline evaluation. Experiments on the Criteo and Avazu datasets demonstrate that CaliCausalRank achieves 1.1% relative AUC improvement, 31.6% calibration error reduction, and 3.2% utility gain compared to the best baseline (PairRank) while maintaining consistent performance across different traffic segments.

2602.18776 2026-02-24 cs.CL cs.AI

ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models

Anas Alhumud, Abdulaziz Alhammadi, Muhammad Badruddin Khan

详情
英文摘要

We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9). We evaluate 71 models from 10 providers using four prompting strategies (zero-shot, zero-shot CoT, few-shot, few-shot CoT) on 210 number reading tasks spanning six contextual categories: pure numerals, addresses, dates, quantities, and prices. Our evaluation comprises 59,010 individual test cases and tracks extraction methods to measure structured output generation. Evaluation reveals substantial performance variation, with accuracy ranging from 14.29\% to 99.05\% across models and strategies. Few-shot Chain-of-Thought prompting achieves 2.8x higher accuracy than zero-shot approaches (80.06\% vs 28.76\%). A striking finding emerges: models achieving elite accuracy (98-99\%) often produce predominantly unstructured output, with most responses lacking Arabic CoT markers. Only 6 models consistently generate structured output across all test cases, while the majority require fallback extraction methods despite high numerical accuracy. Comprehensive evaluation of 281 model-strategy combinations demonstrates that numerical accuracy and instruction-following represent distinct capabilities, establishing baselines for Arabic number comprehension and providing actionable guidance for model selection in production Arabic NLP systems.

2602.18773 2026-02-24 cs.AI

LAMMI-Pathology: A Tool-Centric Bottom-Up LVLM-Agent Framework for Molecularly Informed Medical Intelligence in Pathology

Haoyang Su, Shaoting Zhang, Xiaosong Wang

详情
英文摘要

The emergence of tool-calling-based agent systems introduces a more evidence-driven paradigm for pathology image analysis in contrast to the coarse-grained text-image diagnostic approaches. With the recent large-scale experimental adoption of spatial transcriptomics technologies, molecularly validated pathological diagnosis is becoming increasingly open and accessible. In this work, we propose LAMMI-Pathology (LVLM-Agent System for Molecularly Informed Medical Intelligence in Pathology), a scalable agent framework for domain-specific agent tool-calling. LAMMI-Pathology adopts a tool-centric, bottom-up architecture in which customized domain-adaptive tools serve as the foundation. These tools are clustered by domain style to form component agents, which are then coordinated through a top-level planner hierarchically, avoiding excessively long context lengths that could induce task drift. Based on that, we introduce a novel trajectory construction mechanism based on Atomic Execution Nodes (AENs), which serve as reliable and composable units for building semi-simulated reasoning trajectories that capture credible agent-tool interactions. Building on this foundation, we develop a trajectory-aware fine-tuning strategy that aligns the planner's decision-making process with these multi-step reasoning trajectories, thereby enhancing inference robustness in pathology understanding and its adaptive use of the customized toolset.

2602.18769 2026-02-24 cs.LG cs.AI

GLaDiGAtor: Language-Model-Augmented Multi-Relation Graph Learning for Predicting Disease-Gene Associations

Osman Onur Kuzucu, Tunca Doğan

详情
英文摘要

Understanding disease-gene associations is essential for unravelling disease mechanisms and advancing diagnostics and therapeutics. Traditional approaches based on manual curation and literature review are labour-intensive and not scalable, prompting the use of machine learning on large biomedical data. In particular, graph neural networks (GNNs) have shown promise for modelling complex biological relationships. To address limitations in existing models, we propose GLaDiGAtor (Graph Learning-bAsed DIsease-Gene AssociaTiOn pRediction), a novel GNN framework with an encoder-decoder architecture for disease-gene association prediction. GLaDiGAtor constructs a heterogeneous biological graph integrating gene-gene, disease-disease, and gene-disease interactions from curated databases, and enriches each node with contextual features from well-known language models (ProtT5 for protein sequences and BioBERT for disease text). In evaluations, our model achieves superior predictive accuracy and generalisation, outperforming 14 existing methods. Literature-supported case studies confirm the biological relevance of high-confidence novel predictions, highlighting GLaDiGAtor's potential to discover candidate disease genes. These results underscore the power of graph convolutional networks in biomedical informatics and may ultimately facilitate drug discovery by revealing new gene-disease links. The source code and processed datasets are publicly available at https://github.com/HUBioDataLab/GLaDiGAtor.

2602.18766 2026-02-24 cs.CV

Initialization matters in few-shot adaptation of vision-language models for histopathological image classification

Pablo Meseguer, Rocío del Amor, Valery Naranjo

Comments Accepted as oral presentation at CASEIB 2024 held in Sevilla, Spain

详情
英文摘要

Vision language models (VLM) pre-trained on datasets of histopathological image-caption pairs enabled zero-shot slide-level classification. The ability of VLM image encoders to extract discriminative features also opens the door for supervised fine-tuning for whole-slide image (WSI) classification, ideally using few labeled samples. Slide-level prediction frameworks require the incorporation of multiple instance learning (MIL) due to the gigapixel size of the WSI. Following patch-level feature extraction and aggregation, MIL frameworks rely on linear classifiers trained on top of the slide-level aggregated features. Classifier weight initialization has a large influence on Linear Probing performance in efficient transfer learning (ETL) approaches based on few-shot learning. In this work, we propose Zero-Shot Multiple-Instance Learning (ZS-MIL) to address the limitations of random classifier initialization that underperform zero-shot prediction in MIL problems. ZS-MIL uses the class-level embeddings of the VLM text encoder as the classification layer's starting point to compute each sample's bag-level probabilities. Through multiple experiments, we demonstrate the robustness of ZS-MIL compared to well-known weight initialization techniques both in terms of performance and variability in an ETL few-shot scenario for subtyping prediction.

2602.18765 2026-02-24 cs.CV

A high-resolution nationwide urban village mapping product for 342 Chinese cities based on foundation models

Lubin Bai, Sheng Xiao, Ziyu Yin, Haoyu Wang, Siyang Wu, Xiuyuan Zhang, Shihong Du

Comments Submitted to Earth System Science Data

详情
英文摘要

Urban Villages (UVs) represent a distinctive form of high-density informal settlement embedded within China's rapidly urbanizing cities. Accurate identification of UVs is critical for urban governance, renewal, and sustainable development. But due to the pronounced heterogeneity and diversity of UVs across China's vast territory, a consistent and reliable nationwide dataset has been lacking. In this work, we present GeoLink-UV, a high-resolution nationwide UV mapping product that clearly delineates the locations and boundaries of UVs in 342 Chinese cities. The dataset is derived from multisource geospatial data, including optical remote sensing images and geo-vector data, and is generated through a foundation model-driven mapping framework designed to address the generalization issues and improve the product quality. A geographically stratified accuracy assessment based on independent samples from 28 cities confirms the reliability and scientific credibility of the nationwide dataset across heterogeneous urban contexts. Based on this nationwide product, we reveal substantial interregional disparities in UV prevalence and spatial configuration. On average, UV areas account for 8 % of built-up land, with marked clustering in central and south China. Building-level analysis further confirms a consistent low-rise, high-density development pattern of UVs nationwide, while highlighting regionally differentiated morphological characteristics. The GeoLink-UV dataset provides an open and systematically validated geospatial foundation for urban studies, informal settlement monitoring, and evidence-based urban renewal planning, and contributes directly to large-scale assessments aligned with Sustainable Development Goal 11. The GeoLink-UV dataset introduced in this article is freely available at https://doi.org/10.5281/zenodo.18688062.

2602.18763 2026-02-24 cs.CV cs.AI

TAG: Thinking with Action Unit Grounding for Facial Expression Recognition

Haobo Lin, Tianyi Bai, Jiajun Zhang, Xuanhao Chang, Sheng Lu, Fangming Gu, Zengjie Hu, Wentao Zhang

Comments 33 pages, 8 figures

详情
英文摘要

Facial Expression Recognition (FER) is a fine-grained visual understanding task where reliable predictions require reasoning over localized and meaningful facial cues. Recent vision--language models (VLMs) enable natural language explanations for FER, but their reasoning is often ungrounded, producing fluent yet unverifiable rationales that are weakly tied to visual evidence and prone to hallucination, leading to poor robustness across different datasets. We propose TAG (Thinking with Action Unit Grounding), a vision--language framework that explicitly constrains multimodal reasoning to be supported by facial Action Units (AUs). TAG requires intermediate reasoning steps to be grounded in AU-related facial regions, yielding predictions accompanied by verifiable visual evidence. The model is trained via supervised fine-tuning on AU-grounded reasoning traces followed by reinforcement learning with an AU-aware reward that aligns predicted regions with external AU detectors. Evaluated on RAF-DB, FERPlus, and AffectNet, TAG consistently outperforms strong open-source and closed-source VLM baselines while simultaneously improving visual faithfulness. Ablation and preference studies further show that AU-grounded rewards stabilize reasoning and mitigate hallucination, demonstrating the importance of structured grounded intermediate representations for trustworthy multimodal reasoning in FER. The code will be available at https://github.com/would1920/FER_TAG .

2602.18752 2026-02-24 cs.CV cs.GR

Optimizing ID Consistency in Multimodal Large Models: Facial Restoration via Alignment, Entanglement, and Disentanglement

Yuran Dong, Hang Dai, Mang Ye

Comments ICLR 26

详情
英文摘要

Multimodal editing large models have demonstrated powerful editing capabilities across diverse tasks. However, a persistent and long-standing limitation is the decline in facial identity (ID) consistency during realistic portrait editing. Due to the human eye's high sensitivity to facial features, such inconsistency significantly hinders the practical deployment of these models. Current facial ID preservation methods struggle to achieve consistent restoration of both facial identity and edited element IP due to Cross-source Distribution Bias and Cross-source Feature Contamination. To address these issues, we propose EditedID, an Alignment-Disentanglement-Entanglement framework for robust identity-specific facial restoration. By systematically analyzing diffusion trajectories, sampler behaviors, and attention properties, we introduce three key components: 1) Adaptive mixing strategy that aligns cross-source latent representations throughout the diffusion process. 2) Hybrid solver that disentangles source-specific identity attributes and details. 3) Attentional gating mechanism that selectively entangles visual elements. Extensive experiments show that EditedID achieves state-of-the-art performance in preserving original facial ID and edited element IP consistency. As a training-free and plug-and-play solution, it establishes a new benchmark for practical and reliable single/multi-person facial identity restoration in open-world settings, paving the way for the deployment of multimodal editing large models in real-person editing scenarios. The code is available at https://github.com/NDYBSNDY/EditedID.

2602.18749 2026-02-24 cs.AI

Federated Reasoning Distillation Framework with Model Learnability-Aware Data Allocation

Wei Guo, Siyuan Lu, Xiangdong Ran, Yiqi Tong, Yikun Ban, Zelong Xu, Jing Fan, Zixuan Huang, Xiao Zhang, Zhaojun Hu, Fuzhen Zhuang

详情
英文摘要

Data allocation plays a critical role in federated large language model (LLM) and small language models (SLMs) reasoning collaboration. Nevertheless, existing data allocation methods fail to address an under-explored challenge in collaboration: bidirectional model learnability gap, where client-side SLMs cannot identify high-reward samples matching their learnability constraints for effective knowledge transfer from LLMs, while LLMs struggle to select samples contributing novel knowledge beyond their existing data. Furthermore, these collaboration frameworks face another key challenge: domain-agnostic reasoning transfer, where existing reasoning transfer methods fail to flexibly adapt to the local domain data, preventing SLMs from effectively acquiring step-by-step reasoning abilities within from general LLM. To address these challenges, we propose LaDa, a federated reasoning distillation framework with model learnability-aware data allocation. It introduces a model learnability-aware data filter that adaptively allocates high-reward samples based on the learnability gap between each SLM and LLM pair, effectively facilitating bidirectional knowledge transfer. We further design a domain adaptive reasoning distillation method that aligns joint probabilities of reasoning paths on filtered high-reward samples through contrastive distillation learning between SLM and LLM, enabling SLM to capture underlying reasoning patterns under local data distribution. LaDa operates as a plug-in module for existing collaboration frameworks, adapting knowledge transfer based on model learnability gaps.

2602.18747 2026-02-24 cs.CV

Benchmarking Computational Pathology Foundation Models For Semantic Segmentation

Lavish Ramchandani, Aashay Tinaikar, Dev Kumar Das, Rohit Garg, Tijo Thomas

Comments 5 pages, submitted to IEEE ISBI 2026

详情
英文摘要

In recent years, foundation models such as CLIP, DINO,and CONCH have demonstrated remarkable domain generalization and unsupervised feature extraction capabilities across diverse imaging tasks. However, systematic and independent evaluations of these models for pixel-level semantic segmentation in histopathology remain scarce. In this study, we propose a robust benchmarking approach to asses 10 foundational models on four histopathological datasets covering both morphological tissue-region and cellular/nuclear segmentation tasks. Our method leverages attention maps of foundation models as pixel-wise features, which are then classified using a machine learning algorithm, XGBoost, enabling fast, interpretable, and model-agnostic evaluation without finetuning. We show that the vision language foundation model, CONCH performed the best across datasets when compared to vision-only foundation models, with PathDino as close second. Further analysis shows that models trained on distinct histopathology cohorts capture complementary morphological representations, and concatenating their features yields superior segmentation performance. Concatenating features from CONCH, PathDino and CellViT outperformed individual models across all the datasets by 7.95% (averaged across the datasets), suggesting that ensembles of foundation models can better generalize to diverse histopathological segmentation tasks.

2602.18745 2026-02-24 cs.CV cs.AI

Synthesizing Multimodal Geometry Datasets from Scratch and Enabling Visual Alignment via Plotting Code

Haobo Lin, Tianyi Bai, Chen Chen, Jiajun Zhang, Bohan Zeng, Wentao Zhang, Binhang Yuan

Comments 58 pages, 10 figures

详情
英文摘要

Multimodal geometry reasoning requires models to jointly understand visual diagrams and perform structured symbolic inference, yet current vision--language models struggle with complex geometric constructions due to limited training data and weak visual--symbolic alignment. We propose a pipeline for synthesizing complex multimodal geometry problems from scratch and construct a dataset named \textbf{GeoCode}, which decouples problem generation into symbolic seed construction, grounded instantiation with verification, and code-based diagram rendering, ensuring consistency across structure, text, reasoning, and images. Leveraging the plotting code provided in GeoCode, we further introduce code prediction as an explicit alignment objective, transforming visual understanding into a supervised structured prediction task. GeoCode exhibits substantially higher structural complexity and reasoning difficulty than existing benchmarks, while maintaining mathematical correctness through multi-stage validation. Extensive experiments show that models trained on GeoCode achieve consistent improvements on multiple geometry benchmarks, demonstrating both the effectiveness of the dataset and the proposed alignment strategy. The code will be available at https://github.com/would1920/GeoCode.

2602.18744 2026-02-24 cs.LG

RadioGen3D: 3D Radio Map Generation via Adversarial Learning on Large-Scale Synthetic Data

Junshen Chen, Angzi Xu, Zezhong Zhang, Shiyao Zhang, Junting Chen, Shuguang Cui

详情
英文摘要

Radio maps are essential for efficient radio resource management in future 6G and low-altitude networks. While deep learning (DL) techniques have emerged as an efficient alternative to conventional ray-tracing for radio map estimation (RME), most existing DL approaches are confined to 2D near-ground scenarios. They often fail to capture essential 3D signal propagation characteristics and antenna polarization effects, primarily due to the scarcity of 3D data and training challenges. To address these limitations, we present the RadioGen3D framework. First, we propose an efficient data synthesis method to generate high-quality 3D radio map data. By establishing a parametric target model that captures 2D ray-tracing and 3D channel fading characteristics, we derive realistic coefficient combinations from minimal real measurements, enabling the construction of a large-scale synthetic dataset, Radio3DMix. Utilizing this dataset, we propose a 3D model training scheme based on a conditional generative adversarial network (cGAN), yielding a 3D U-Net capable of accurate RME under diverse input feature combinations. Experimental results demonstrate that RadioGen3D surpasses all baselines in both estimation accuracy and speed. Furthermore, fine-tuning experiments verify its strong generalization capability via successful knowledge transfer.

2602.18742 2026-02-24 cs.RO cs.AI cs.CV

RoboCurate: Harnessing Diversity with Action-Verified Neural Trajectory for Robot Learning

Seungku Kim, Suhyeok Jang, Byungjun Yoon, Dongyoung Kim, John Won, Jinwoo Shin

Comments 20 pages; 6 figures; Project page is available at https://seungkukim.github.io/robocurate/

详情
英文摘要

Synthetic data generated by video generative models has shown promise for robot learning as a scalable pipeline, but it often suffers from inconsistent action quality due to imperfectly generated videos. Recently, vision-language models (VLMs) have been leveraged to validate video quality, but they have limitations in distinguishing physically accurate videos and, even then, cannot directly evaluate the generated actions themselves. To tackle this issue, we introduce RoboCurate, a novel synthetic robot data generation framework that evaluates and filters the quality of annotated actions by comparing them with simulation replay. Specifically, RoboCurate replays the predicted actions in a simulator and assesses action quality by measuring the consistency of motion between the simulator rollout and the generated video. In addition, we unlock observation diversity beyond the available dataset via image-to-image editing and apply action-preserving video-to-video transfer to further augment appearance. We observe RoboCurate's generated data yield substantial relative improvements in success rates compared to using real data only, achieving +70.1% on GR-1 Tabletop (300 demos), +16.1% on DexMimicGen in the pre-training setup, and +179.9% in the challenging real-world ALLEX humanoid dexterous manipulation setting.

2602.18740 2026-02-24 cs.LG cs.AI cs.SY eess.SY

HONEST-CAV: Hierarchical Optimization of Network Signals and Trajectories for Connected and Automated Vehicles with Multi-Agent Reinforcement Learning

Ziyan Zhang, Changxin Wan, Peng Hao, Kanok Boriboonsomsin, Matthew J. Barth, Yongkang Liu, Seyhan Ucar, Guoyuan Wu

Comments 7 pages, 6 figures. Accepted at the 2026 IEEE Intelligent Vehicles Symposium. Final version to appear at IEEE Xplore

详情
英文摘要

This study presents a hierarchical, network-level traffic flow control framework for mixed traffic consisting of Human-driven Vehicles (HVs), Connected and Automated Vehicles (CAVs). The framework jointly optimizes vehicle-level eco-driving behaviors and intersection-level traffic signal control to enhance overall network efficiency and decrease energy consumption. A decentralized Multi-Agent Reinforcement Learning (MARL) approach by Value Decomposition Network (VDN) manages cycle-based traffic signal control (TSC) at intersections, while an innovative Signal Phase and Timing (SPaT) prediction method integrates a Machine Learning-based Trajectory Planning Algorithm (MLTPA) to guide CAVs in executing Eco-Approach and Departure (EAD) maneuvers. The framework is evaluated across varying CAV proportions and powertrain types to assess its effects on mobility and energy performance. Experimental results conducted in a 4*4 real-world network demonstrate that the MARL-based TSC method outperforms the baseline model (i.e., Webster method) in speed, fuel consumption, and idling time. In addition, with MLTPA, HONEST-CAV benefits the traffic system further in energy consumption and idling time. With a 60% CAV proportion, vehicle average speed, fuel consumption, and idling time can be improved/saved by 7.67%, 10.23%, and 45.83% compared with the baseline. Furthermore, discussions on CAV proportions and powertrain types are conducted to quantify the performance of the proposed method with the impact of automation and electrification.

2602.18739 2026-02-24 cs.LG

When World Models Dream Wrong: Physical-Conditioned Adversarial Attacks against World Models

Zhixiang Guo, Siyuan Liang, Andras Balogh, Noah Lunberry, Rong-Cheng Tu, Mark Jelasity, Dacheng Tao

详情
英文摘要

Generative world models (WMs) are increasingly used to synthesize controllable, sensor-conditioned driving videos, yet their reliance on physical priors exposes novel attack surfaces. In this paper, we present Physical-Conditioned World Model Attack (PhysCond-WMA), the first white-box world model attack that perturbs physical-condition channels, such as HDMap embeddings and 3D-box features, to induce semantic, logic, or decision-level distortion while preserving perceptual fidelity. PhysCond-WMA is optimized in two stages: (1) a quality-preserving guidance stage that constrains reverse-diffusion loss below a calibrated threshold, and (2) a momentum-guided denoising stage that accumulates target-aligned gradients along the denoising trajectory for stable, temporally coherent semantic shifts. Extensive experimental results demonstrate that our approach remains effective while increasing FID by about 9% on average and FVD by about 3.9% on average. Under the targeted attack setting, the attack success rate (ASR) reaches 0.55. Downstream studies further show tangible risk, which using attacked videos for training decreases 3D detection performance by about 4%, and worsens open-loop planning performance by about 20%. These findings has for the first time revealed and quantified security vulnerabilities in generative world models, driving more comprehensive security checkers.

2602.18733 2026-02-24 cs.LG

Prior Aware Memorization: An Efficient Metric for Distinguishing Memorization from Generalization in Large Language Models

Trishita Tiwari, Ari Trachtenberg, G. Edward Suh

详情
英文摘要

Training data leakage from Large Language Models (LLMs) raises serious concerns related to privacy, security, and copyright compliance. A central challenge in assessing this risk is distinguishing genuine memorization of training data from the generation of statistically common sequences. Existing approaches to measuring memorization often conflate these phenomena, labeling outputs as memorized even when they arise from generalization over common patterns. Counterfactual Memorization provides a principled solution by comparing models trained with and without a target sequence, but its reliance on retraining multiple baseline models makes it computationally expensive and impractical at scale. This work introduces Prior-Aware Memorization, a theoretically grounded, lightweight and training-free criterion for identifying genuine memorization in LLMs. The key idea is to evaluate whether a candidate suffix is strongly associated with its specific training prefix or whether it appears with high probability across many unrelated prompts due to statistical commonality. We evaluate this metric on text from the training corpora of two pre-trained models, LLaMA and OPT, using both long sequences (to simulate copyright risks) and named entities (to simulate PII leakage). Our results show that between 55% and 90% of sequences previously labeled as memorized are in fact statistically common. Similar findings hold for the SATML training data extraction challenge dataset, where roughly 40% of sequences exhibit common-pattern behavior despite appearing only once in the training data. These results demonstrate that low frequency alone is insufficient evidence of memorization and highlight the importance of accounting for model priors when assessing leakage.

2602.18731 2026-02-24 cs.AI

Beyond Description: A Multimodal Agent Framework for Insightful Chart Summarization

Yuhang Bai, Yujuan Ding, Shanru Lin, Wenqi Fan

Comments 5 pages, 5 figures

详情
英文摘要

Chart summarization is crucial for enhancing data accessibility and the efficient consumption of information. However, existing methods, including those with Multimodal Large Language Models (MLLMs), primarily focus on low-level data descriptions and often fail to capture the deeper insights which are the fundamental purpose of data visualization. To address this challenge, we propose Chart Insight Agent Flow, a plan-and-execute multi-agent framework effectively leveraging the perceptual and reasoning capabilities of MLLMs to uncover profound insights directly from chart images. Furthermore, to overcome the lack of suitable benchmarks, we introduce ChartSummInsights, a new dataset featuring a diverse collection of real-world charts paired with high-quality, insightful summaries authored by human data analysis experts. Experimental results demonstrate that our method significantly improves the performance of MLLMs on the chart summarization task, producing summaries with deep and diverse insights.

2602.18729 2026-02-24 cs.CV cs.AI

MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment

Sagarika Banerjee, Tangatar Madi, Advait Swaminathan, Nguyen Dao Minh Anh, Shivank Garg, Kevin Zhu, Vasu Sharma

Comments EACL 2026, Main, Short Paper

详情
英文摘要

Fine-grained image-caption alignment is crucial for vision-language models (VLMs), especially in socially critical contexts such as identifying real-world risk scenarios or distinguishing cultural proxies, where correct interpretation hinges on subtle visual or linguistic clues and where minor misinterpretations can lead to significant real-world consequences. We present MiSCHiEF, a set of two benchmarking datasets based on a contrastive pair design in the domains of safety (MiS) and culture (MiC), and evaluate four VLMs on tasks requiring fine-grained differentiation of paired images and captions. In both datasets, each sample contains two minimally differing captions and corresponding minimally differing images. In MiS, the image-caption pairs depict a safe and an unsafe scenario, while in MiC, they depict cultural proxies in two distinct cultural contexts. We find that models generally perform better at confirming the correct image-caption pair than rejecting incorrect ones. Additionally, models achieve higher accuracy when selecting the correct caption from two highly similar captions for a given image, compared to the converse task. The results, overall, highlight persistent modality misalignment challenges in current VLMs, underscoring the difficulty of precise cross-modal grounding required for applications with subtle semantic and visual distinctions.