arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1337
2506.00098 2026-02-26 cs.RO cs.LG

Interactive Imitation Learning for Dexterous Robotic Manipulation: Challenges and Perspectives -- A Survey

Edgar Welte, Rania Rayyes

Comments 27 pages, 4 figures, 3 tables

详情
英文摘要

Dexterous manipulation is a crucial yet highly complex challenge in humanoid robotics, demanding precise, adaptable, and sample-efficient learning methods. As humanoid robots are usually designed to operate in human-centric environments and interact with everyday objects, mastering dexterous manipulation is critical for real-world deployment. Traditional approaches, such as reinforcement learning and imitation learning, have made significant strides, but they often struggle due to the unique challenges of real-world dexterous manipulation, including high-dimensional control, limited training data, and covariate shift. This survey provides a comprehensive overview of these challenges and reviews existing learning-based methods for real-world dexterous manipulation, spanning imitation learning, reinforcement learning, and hybrid approaches. A promising yet underexplored direction is interactive imitation learning, where human feedback actively refines a robots behavior during training. While interactive imitation learning has shown success in various robotic tasks, its application to dexterous manipulation remains limited. To address this gap, we examine current interactive imitation learning techniques applied to other robotic tasks and discuss how these methods can be adapted to enhance dexterous manipulation. By synthesizing state-of-the-art research, this paper highlights key challenges, identifies gaps in current methodologies, and outlines potential directions for leveraging interactive imitation learning to improve dexterous robotic skills.

2505.19610 2026-02-26 cs.CV

JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models

Jiaxin Song, Yixu Wang, Jie Li, Rui Yu, Yan Teng, Xingjun Ma, Yingchun Wang

Comments The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)

详情
英文摘要

Vision-Language Models (VLMs) exhibit impressive performance, yet the integration of powerful vision encoders has significantly broadened their attack surface, rendering them increasingly susceptible to jailbreak attacks. However, lacking well-defined attack objectives, existing jailbreak methods often struggle with gradient-based strategies prone to local optima and lacking precise directional guidance, and typically decouple visual and textual modalities, thereby limiting their effectiveness by neglecting crucial cross-modal interactions. Inspired by the Eliciting Latent Knowledge (ELK) framework, we posit that VLMs encode safety-relevant information within their internal fusion-layer representations, revealing an implicit safety decision boundary in the latent space. This motivates exploiting boundary to steer model behavior. Accordingly, we propose JailBound, a novel latent space jailbreak framework comprising two stages: (1) Safety Boundary Probing, which addresses the guidance issue by approximating decision boundary within fusion layer's latent space, thereby identifying optimal perturbation directions towards the target region; and (2) Safety Boundary Crossing, which overcomes the limitations of decoupled approaches by jointly optimizing adversarial perturbations across both image and text inputs. This latter stage employs an innovative mechanism to steer the model's internal state towards policy-violating outputs while maintaining cross-modal semantic consistency. Extensive experiments on six diverse VLMs demonstrate JailBound's efficacy, achieves 94.32% white-box and 67.28% black-box attack success averagely, which are 6.17% and 21.13% higher than SOTA methods, respectively. Our findings expose a overlooked safety risk in VLMs and highlight the urgent need for more robust defenses. Warning: This paper contains potentially sensitive, harmful and offensive content.

2505.17306 2026-02-26 cs.CL

Refusal Direction is Universal Across Safety-Aligned Languages

Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Schütze, Barbara Plank

详情
英文摘要

Refusal mechanisms in large language models (LLMs) are essential for ensuring safety. Recent research has revealed that refusal behavior can be mediated by a single direction in activation space, enabling targeted interventions to bypass refusals. While this is primarily demonstrated in an English-centric context, appropriate refusal behavior is important for any language, but poorly understood. In this paper, we investigate the refusal behavior in LLMs across 14 languages using PolyRefuse, a multilingual safety dataset created by translating malicious and benign English prompts into these languages. We uncover the surprising cross-lingual universality of the refusal direction: a vector extracted from English can bypass refusals in other languages with near-perfect effectiveness, without any additional fine-tuning. Even more remarkably, refusal directions derived from any safety-aligned language transfer seamlessly to others. We attribute this transferability to the parallelism of refusal vectors across languages in the embedding space and identify the underlying mechanism behind cross-lingual jailbreaks. These findings provide actionable insights for building more robust multilingual safety defenses and pave the way for a deeper mechanistic understanding of cross-lingual vulnerabilities in LLMs.

2505.13667 2026-02-26 cs.RO

Adaptive Diffusion Constrained Sampling for Bimanual Robot Manipulation

Haolei Tong, Yuezhe Zhang, Sophie Lueth, Georgia Chalvatzaki

Comments Accepted by IEEE International Conference on Robotics and Automation 2026(ICRA 2026)

详情
英文摘要

Coordinated multi-arm manipulation requires satisfying multiple simultaneous geometric constraints across high-dimensional configuration spaces, which poses a significant challenge for traditional planning and control methods. In this work, we propose Adaptive Diffusion Constrained Sampling (ADCS), a generative framework that flexibly integrates both equality (e.g., relative and absolute pose constraints) and structured inequality constraints (e.g., proximity to object surfaces) into an energy-based diffusion model. Equality constraints are modeled using dedicated energy networks trained on pose differences in Lie algebra space, while inequality constraints are represented via Signed Distance Functions (SDFs) and encoded into learned constraint embeddings, allowing the model to reason about complex spatial regions. A key innovation of our method is a Transformer-based architecture that learns to weight constraint-specific energy functions at inference time, enabling flexible and context-aware constraint integration. Moreover, we adopt a two-phase sampling strategy that improves precision and sample diversity by combining Langevin dynamics with resampling and density-aware re-weighting. Experimental results on dual-arm manipulation tasks show that ADCS significantly improves sample diversity and generalization across settings demanding precise coordination and adaptive constraint handling.

2505.08246 2026-02-26 cs.CV cs.NA math.NA

Identifying Memorization of Diffusion Models through $p$-Laplace Analysis: Estimators, Bounds and Applications

Jonathan Brokman, Itay Gershon, Amit Giloni, Omer Hofman, Roman Vainshtein, Hisashi Kojima, Guy Gilboa

Comments This manuscript is a substantially extended version of our SSVM 2025 paper, including significant new theoretical results and additional experiments. It is currently under review as a journal submission

详情
英文摘要

Diffusion models, today's leading image generative models, estimate the score function, i.e. the gradient of the log probability of (perturbed) data samples, without direct access to the underlying probability distribution. This work investigates whether the estimated score function can be leveraged to compute higher-order differentials, namely the p-Laplace operators. We show that these operators can be employed to identify memorized training data. We propose a numerical p-Laplace approximation based on the learned score functions, showing its effectiveness in identifying key features of the probability landscape. Furthermore, theoretical error-bounds to these estimators are proven and demonstrated numerically. We analyze the structured case of Gaussian mixture models, and demonstrate that the results carry-over to text-conditioned image generative models (text-to-image), where memorization identification based on the p-Laplace operator is performed for the first time, showing its advantage on 500 memorized prompts ($\sim$3000 generated images) in a post-generation regime, especially when the conditioning text is unavailable.

2504.06533 2026-02-26 cs.LG cs.AI cs.DS

Rethinking Flexible Graph Similarity Computation: One-step Alignment with Global Guidance

Zhouyang Liu, Ning Liu, Yixin Chen, Jiezhong He, Shuai Ma, Dongsheng Li

Comments Accepted by ICDE 2026

详情
英文摘要

Graph Edit Distance (GED) is a widely used measure of graph similarity, valued for its flexibility in encoding domain knowledge through operation costs. However, existing learning-based approximation methods follow a modeling paradigm that decouples local candidate match selection from both operation costs and global dependencies between matches. This decoupling undermines their ability to capture the intrinsic flexibility of GED and often forces them to rely on costly iterative refinement to obtain accurate alignments. In this work, we revisit the formulation of GED and revise the prevailing paradigm, and propose Graph Edit Network (GEN), an implementation of the revised formulation that tightly integrates cost-aware expense estimation with globally guided one-step alignment. Specifically, GEN incorporates operation costs into node matching expenses estimation, ensuring match decisions respect the specified cost setting. Furthermore, GEN models match dependencies within and across graphs, capturing each match's impact on the overall alignment. These designs enable accurate GED approximation without iterative refinement. Extensive experiments on real-world and synthetic benchmarks demonstrate that GEN achieves up to a 37.8% reduction in GED predictive errors, while increasing inference throughput by up to 414x. These results highlight GEN's practical efficiency and the effectiveness of the revision. Beyond this implementation, our revision provides a principled framework for advancing learning-based GED approximation.

2503.15133 2026-02-26 cs.CL

EmoGRACE: Aspect-based emotion analysis for social media data

Christina Zorenböhmer, Sebastian Schmidt, Bernd Resch

详情
英文摘要

While sentiment analysis has advanced from sentence to aspect-level, i.e., the identification of concrete terms related to a sentiment, the equivalent field of Aspect-based Emotion Analysis (ABEA) is faced with dataset bottlenecks and the increased complexity of emotion classes in contrast to binary sentiments. This paper addresses these gaps, by generating a first ABEA training dataset, consisting of 2,621 English Tweets, and fine-tuning a BERT-based model for the ABEA sub-tasks of Aspect Term Extraction (ATE) and Aspect Emotion Classification (AEC). The dataset annotation process was based on the hierarchical emotion theory by Shaver et al. [1] and made use of group annotation and majority voting strategies to facilitate label consistency. The resulting dataset contained aspect-level emotion labels for Anger, Sadness, Happiness, Fear, and a None class. Using the new ABEA training dataset, the state-of-the-art ABSA model GRACE by Luo et al. [2] was fine-tuned for ABEA. The results reflected a performance plateau at an F1-score of 70.1% for ATE and 46.9% for joint ATE and AEC extraction. The limiting factors for model performance were broadly identified as the small training dataset size coupled with the increased task complexity, causing model overfitting and limited abilities to generalize well on new data.

2503.07982 2026-02-26 cs.CV

TRACE: Your Diffusion Model is Secretly an Instance Edge Detector

Sanghyun Jo, Ziseok Lee, Wooyeol Lee, Jonghyun Choi, Jaesik Park, Kyungsu Kim

Comments Accepted to ICLR 2026 (Oral)

详情
英文摘要

High-quality instance and panoptic segmentation has traditionally relied on dense instance-level annotations such as masks, boxes, or points, which are costly, inconsistent, and difficult to scale. Unsupervised and weakly-supervised approaches reduce this burden but remain constrained by semantic backbone constraints and human bias, often producing merged or fragmented outputs. We present TRACE (TRAnsforming diffusion Cues to instance Edges), showing that text-to-image diffusion models secretly function as instance edge annotators. TRACE identifies the Instance Emergence Point (IEP) where object boundaries first appear in self-attention maps, extracts boundaries through Attention Boundary Divergence (ABDiv), and distills them into a lightweight one-step edge decoder. This design removes the need for per-image diffusion inversion, achieving 81x faster inference while producing sharper and more connected boundaries. On the COCO benchmark, TRACE improves unsupervised instance segmentation by +5.1 AP, and in tag-supervised panoptic segmentation it outperforms point-supervised baselines by +1.7 PQ without using any instance-level labels. These results reveal that diffusion models encode hidden instance boundary priors, and that decoding these signals offers a practical and scalable alternative to costly manual annotation. Project Page: https://shjo-april.github.io/TRACE/

2503.05236 2026-02-26 cs.CV

Unified Reward Model for Multimodal Understanding and Generation

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, Jiaqi Wang

Comments project page: https://codegoat24.github.io/UnifiedReward/

详情
英文摘要

Recent advances in human preference alignment have significantly improved multimodal generation and understanding. A key approach is to train reward models that provide supervision signals for preference optimization. However, existing reward models are often task-specific, limiting their adaptability across diverse visual applications. We also argue that a reward model that jointly learning to assess multiple vision tasks may foster a synergistic effect, where improved image understanding enhances image generation assessment, and refined image evaluation benefits video assessment through better frame analysis. To this end, this paper proposes UnifiedReward, the first unified reward model for multimodal understanding and generation assessment. It supports both pairwise ranking and pointwise scoring, providing effective reward signals for vision model preference alignment. Specifically, (1) we first train UnifiedReward on our constructed large-scale human preference dataset, which covers both image and video generation/understanding tasks. (2) Then, we leverage it to automatically construct high-quality pairwise preference data from vision models by progressively filtering their outputs through our two-stage strategy, i.e., pair ranking and point sifting. (3) Finally, we use these data to align vision models with human preferences via Direct Preference Optimization (DPO). Experimental results show that jointly learning to assess diverse visual tasks yields substantial mutual benefits. We further apply our pipeline to both vision understanding and generation, achieving consistent improvements across each domain.

2503.03178 2026-02-26 cs.LG math.PR

Active operator learning with predictive uncertainty quantification for partial differential equations

Nick Winovich, Mitchell Daneker, Lu Lu, Guang Lin

Comments Submitted to the Journal of Computational Physics

详情
英文摘要

With the increased prevalence of neural operators being used to provide rapid solutions to partial differential equations (PDEs), understanding the accuracy of model predictions and the associated error levels is necessary for deploying reliable surrogate models in scientific applications. Existing uncertainty quantification (UQ) frameworks employ ensembles or Bayesian methods, which can incur substantial computational costs during both training and inference. We propose a lightweight predictive UQ method tailored for Deep operator networks (DeepONets) that also generalizes to other operator networks. Numerical experiments on linear and nonlinear PDEs demonstrate that the framework's uncertainty estimates are unbiased and provide accurate out-of-distribution uncertainty predictions with a sufficiently large training dataset. Our framework provides fast inference and uncertainty estimates that can efficiently drive outer-loop analyses that would be prohibitively expensive with conventional solvers. We demonstrate how predictive uncertainties can be used in the context of Bayesian optimization and active learning problems to yield improvements in accuracy and data-efficiency for outer-loop optimization procedures. In the active learning setup, we extend the framework to Fourier Neural Operators (FNO) and describe a generalized method for other operator networks. To enable real-time deployment, we introduce an inference strategy based on precomputed trunk outputs and a sparse placement matrix, reducing evaluation time by more than a factor of five. Our method provides a practical route to uncertainty-aware operator learning in time-sensitive settings.

2502.18424 2026-02-26 cs.CL

Compressing Language Models for Specialized Domains

Miles Williams, George Chrysostomou, Vitor Jeronymo, Nikolaos Aletras

Comments EACL 2026

详情
英文摘要

Language models (LMs) excel at tasks across diverse domains, yet require substantial computational resources during inference. Compression techniques such as pruning and quantization offer a practical path towards efficient LM deployment, exemplified by their ability to preserve performance on general-purpose benchmarks. However, general-purpose LM compression methods can negatively affect performance in specialized domains (e.g. biomedical or legal). Recent work has sought to address this issue, but requires a computationally expensive full-parameter fine-tuning pipeline. To this end, we propose MixCal, a novel calibration method designed to improve the in-domain performance of compressed LMs in a post-training setting. Through extensive experimentation, we demonstrate that MixCal substantially outperforms existing approaches on domain-specific tasks and preserves general performance. Notably, these performance gains are achieved while also reducing the computational cost of LM compression.

2501.16443 2026-02-26 cs.LG cs.CV

Object-Centric World Models from Few-Shot Annotations for Sample-Efficient Reinforcement Learning

Weipu Zhang, Adam Jelley, Trevor McInroe, Amos Storkey, Gang Wang

详情
英文摘要

While deep reinforcement learning (RL) from pixels has achieved remarkable success, its sample inefficiency remains a critical limitation for real-world applications. Model-based RL (MBRL) addresses this by learning a world model to generate simulated experience, but standard approaches that rely on pixel-level reconstruction losses often fail to capture small, task-critical objects in complex, dynamic scenes. We posit that an object-centric (OC) representation can direct model capacity toward semantically meaningful entities, improving dynamics prediction and sample efficiency. In this work, we introduce OC-STORM, an object-centric MBRL framework that enhances a learned world model with object representations extracted by a pretrained segmentation network. By conditioning on a minimal number of annotated frames, OC-STORM learns to track decision-relevant object dynamics and inter-object interactions without extensive labeling or access to privileged information. Empirical results demonstrate that OC-STORM significantly outperforms the STORM baseline on the Atari 100k benchmark and achieves state-of-the-art sample efficiency on challenging boss fights in the visually complex game Hollow Knight. Our findings underscore the potential of integrating OC priors into MBRL for complex visual domains. Project page: https://oc-storm.weipuzhang.com

2412.10895 2026-02-26 cs.LG stat.ML

Multi-Class and Multi-Task Strategies for Neural Directed Link Prediction

Claudio Moroni, Claudio Borile, Carolina Mattsson, Michele Starnini, André Panisson

Comments 15 pages, 2 figures

详情
Journal ref
ECML PKDD 2025
英文摘要

Link Prediction is a foundational task in Graph Representation Learning, supporting applications like link recommendation, knowledge graph completion and graph generation. Graph Neural Networks have shown the most promising results in this domain and are currently the de facto standard approach to learning from graph data. However, a key distinction exists between Undirected and Directed Link Prediction: the former just predicts the existence of an edge, while the latter must also account for edge directionality and bidirectionality. This translates to Directed Link Prediction (DLP) having three sub-tasks, each defined by how training, validation and test sets are structured. Most research on DLP overlooks this trichotomy, focusing solely on the "existence" sub-task, where training and test sets are random, uncorrelated samples of positive and negative directed edges. Even in the works that recognize the aforementioned trichotomy, models fail to perform well across all three sub-tasks. In this study, we experimentally demonstrate that training Neural DLP (NDLP) models only on the existence sub-task, using methods adapted from Neural Undirected Link Prediction, results in parameter configurations that fail to capture directionality and bidirectionality, even after rebalancing edge classes. To address this, we propose three strategies that handle the three tasks simultaneously. Our first strategy, the Multi-Class Framework for Neural Directed Link Prediction (MC-NDLP) maps NDLP to a Multi-Class training objective. The second and third approaches adopt a Multi-Task perspective, either with a Multi-Objective (MO-DLP) or a Scalarized (S-DLP) strategy. Our results show that these methods outperform traditional approaches across multiple datasets and models, achieving equivalent or superior performance in addressing the three DLP sub-tasks.

2412.06966 2026-02-26 cs.LG cs.AI cs.CY

Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy and Research

A. Feder Cooper, Christopher A. Choquette-Choo, Miranda Bogen, Kevin Klyman, Matthew Jagielski, Katja Filippova, Ken Liu, Alexandra Chouldechova, Jamie Hayes, Yangsibo Huang, Eleni Triantafillou, Peter Kairouz, Nicole Elyse Mitchell, Niloofar Mireshghallah, Abigail Z. Jacobs, James Grimmelmann, Vitaly Shmatikov, Christopher De Sa, Ilia Shumailov, Andreas Terzis, Solon Barocas, Jennifer Wortman Vaughan, danah boyd, Yejin Choi, Sanmi Koyejo, Fernando Delgado, Percy Liang, Daniel E. Ho, Pamela Samuelson, Miles Brundage, David Bau, Seth Neel, Hanna Wallach, Amy B. Cyphert, Mark A. Lemley, Nicolas Papernot, Katherine Lee

Comments NeurIPS 2025 (Oral)

详情
英文摘要

"Machine unlearning" is a popular proposed solution for mitigating the existence of content in an AI model that is problematic for legal or moral reasons, including privacy, copyright, safety, and more. For example, unlearning is often invoked as a solution for removing the effects of specific information from a generative-AI model's parameters, e.g., a particular individual's personal data or the inclusion of copyrighted content in the model's training data. Unlearning is also proposed as a way to prevent a model from generating targeted types of information in its outputs, e.g., generations that closely resemble a particular individual's data or reflect the concept of "Spiderman." Both of these goals--the targeted removal of information from a model and the targeted suppression of information from a model's outputs--present various technical and substantive challenges. We provide a framework for ML researchers and policymakers to think rigorously about these challenges, identifying several mismatches between the goals of unlearning and feasible implementations. These mismatches explain why unlearning is not a general-purpose solution for circumscribing generative-AI model behavior in service of broader positive impact.

2411.06657 2026-02-26 cs.CV cs.AI cs.CL cs.LG

Renaissance: Investigating the Pretraining of Vision-Language Encoders

Clayton Fields, Casey Kennington

Comments 9 pages

详情
英文摘要

In the past several years there has been an explosion of available models for vision-language (VL) tasks. Unfortunately, the literature still leaves open a number of questions related to best practices in designing and training such models. Additionally, the limited programming tools available for modeling make conducting VL research more difficult than necessary. In this paper, we seek to answer several questions related to the pretraining of VL encoders through meta-analysis. To conduct these experiments, we introduce a VL evaluation framework called Renaissance. In our first set of experiments, we show that we can save significant compute at little to no cost to downstream performance, by freezing large parts of VL models during pretraining. In our second set of experiments, we examine the effect of basing a VL transformer on a vision model versus a text model. Renaissance offers a great deal of flexibility in creating, training and evaluating transformer encoders for VL modeling. Its source code will be made publicly available upon publication. The source code for Renaissance can be found at https://github.com/bsu-slim/renaissance.

2411.04997 2026-02-26 cs.CV cs.CL

LLM2CLIP: Powerful Language Model Unlocks Richer Cross-Modality Representation

Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Usman Naseem, Chunyu Wang, Chunyu Wang, Qi Dai, Xiyang Dai, Dongdong Chen, Chong Luo, Lili Qiu, Liang Hu

详情
Journal ref
AAAI 2026
英文摘要

CLIP is a seminal multimodal model that maps images and text into a shared representation space through contrastive learning on billions of image-caption pairs. Inspired by the rapid progress of large language models (LLMs), we investigate how the superior linguistic understanding and broad world knowledge of LLMs can further strengthen CLIP, particularly in handling long and complex captions. We introduce an efficient fine-tuning framework that embeds an LLM into a pretrained CLIP while incurring nearly the same training cost as standard CLIP fine-tuning. Our method first converts the LLM into an embedding-compatible form for the CLIP setting, and then couples it with the pretrained CLIP vision encoder through a lightweight adaptor trained on only a few million image-caption pairs. With this strategy, we achieve large performance gains without large-scale retraining, outperforming state-of-the-art CLIP variants such as EVA02 and SigLIP-2. The LLM-enhanced CLIP delivers consistent improvements across a wide range of downstream tasks, including linear-probe classification, zero-shot image-text retrieval with both short and long captions (in English and other languages), zero-shot and supervised image segmentation, object detection, and serving as a tokenizer backbone for multimodal large-model benchmarks. Code and models are available at: https://aka.ms/llm2clip

2411.03941 2026-02-26 cs.LG cs.AI

Modular Deep Learning for Multivariate Time-Series: Decoupling Imputation and Downstream Tasks

Joseph Arul Raj, Linglong Qian, Zina Ibrahim

详情
英文摘要

Missing values are pervasive in large-scale time-series data, posing challenges for reliable analysis and decision-making. Many neural architectures have been designed to model and impute the complex and heterogeneous missingness patterns of such data. Most existing methods are end-to-end, rendering imputation tightly coupled with downstream predictive tasks and leading to limited reusability of the trained model, reduced interpretability, and challenges in assessing model quality. In this paper, we call for a modular approach that decouples imputation and downstream tasks, enabling independent optimisation and greater adaptability. Using the largest open-source Python library for deep learning-based time-series analysis, PyPOTS, we evaluate a modular pipeline across six state-of-the-art models that perform imputation and prediction on seven datasets spanning multiple domains. Our results show that a modular approach maintains high performance while prioritising flexibility and reusability - qualities that are crucial for real-world applications. Through this work, we aim to demonstrate how modularity can benefit multivariate time-series analysis, achieving a balance between performance and adaptability.

2410.16718 2026-02-26 cs.LG

Learning Partial Graph Matching via Optimal Partial Transport

Gathika Ratnayaka, James Nichols, Qing Wang

详情
英文摘要

Partial graph matching extends traditional graph matching by allowing some nodes to remain unmatched, enabling applications in more complex scenarios. However, this flexibility introduces additional complexity, as both the subset of nodes to match and the optimal mapping must be determined. While recent studies have explored deep learning techniques for partial graph matching, a significant limitation remains: the absence of an optimization objective that fully captures the problem's intrinsic nature while enabling efficient solutions. In this paper, we propose a novel optimization framework for partial graph matching, inspired by optimal partial transport. Our approach formulates an objective that enables partial assignments while incorporating matching biases, using weighted total variation as the divergence function to guarantee optimal partial assignments. Our method can achieve efficient, exact solutions within cubic worst case time complexity. Our contributions are threefold: (i) we introduce a novel optimization objective that balances matched and unmatched nodes; (ii) we establish a connection between partial graph matching and linear sum assignment problem, enabling efficient solutions; (iii) we propose a deep graph matching architecture with a novel partial matching loss, providing an end-to-end solution. The empirical evaluations on standard graph matching benchmarks demonstrate the efficacy of the proposed approach.

2409.20469 2026-02-26 cs.CV

PoseAdapt: Sustainable Human Pose Estimation via Continual Learning Benchmarks and Toolkit

Muhammad Saif Ullah Khan, Didier Stricker

Comments Accepted in WACV 2026 Applications Track

详情
英文摘要

Human pose estimators are typically retrained from scratch or naively fine-tuned whenever keypoint sets, sensing modalities, or deployment domains change--an inefficient, compute-intensive practice that rarely matches field constraints. We present PoseAdapt, an open-source framework and benchmark suite for continual pose model adaptation. PoseAdapt defines domain-incremental and class-incremental tracks that simulate realistic changes in density, lighting, and sensing modality, as well as skeleton growth. The toolkit supports two workflows: (i) Strategy Benchmarking, which lets researchers implement continual learning (CL) methods as plugins and evaluate them under standardized protocols; and (ii) Model Adaptation, which allows practitioners to adapt strong pretrained models to new tasks with minimal supervision. We evaluate representative regularization-based methods in single-step and sequential settings. Benchmarks enforce a fixed lightweight backbone, no access to past data, and tight per-step budgets. This isolates adaptation strategy effects, highlighting the difficulty of maintaining accuracy under strict resource limits. PoseAdapt connects modern CL techniques with practical pose estimation needs, enabling adaptable models that improve over time without repeated full retraining.

2409.18745 2026-02-26 cs.RO

A study on the effects of mixed explicit and implicit communications in human-artificial-agent interactions

Ana Christina Almada Campos, Bruno Vilhena Adorno

Comments Main paper with 28 pages, 14 figures, 4 tables. Supplementary material with 39 pages, 44 figures, 2 tables. Submitted to Intelligent Service Robotics

详情
英文摘要

Communication between humans and artificial agents is essential for their interaction. This is often inspired by human communication, which uses gestures, facial expressions, gaze direction, and other explicit and implicit means. This work presents interaction experiments where humans and artificial agents interact through explicit and implicit communication to evaluate the effect of mixed explicit-implicit communication against purely explicit communication and the impact of the task difficulty in this evaluation. Results obtained using Bayesian parameter estimation show that the task execution time did not significantly change when mixed explicit and implicit communications were used in neither of our experiments, which varied in the type of artificial agent (virtual agent and humanoid robot) used and task difficulty. The number of errors was affected by the communication only when the human was executing a more difficult task, and an impact on the perceived efficiency of the interaction was only observed in the interaction with the robot, for both easy and difficult tasks. In contrast, acceptance, sociability, and transparency of the artificial agent increased when using mixed communication modalities in both our experiments and task difficulty levels. This suggests that task-related measures, such as time, number of errors, and perceived efficiency of the interaction, as well as the impact of the communication on them, are more sensitive to the type of task and the difficulty level, whereas the combination of explicit and implicit communications more consistently improves human perceptions about artificial agents.

2408.05861 2026-02-26 cs.AI cs.LG

Temporal Knowledge-Graph Memory in a Partially Observable Environment

Taewoon Kim, Vincent François-Lavet, Michael Cochez

详情
英文摘要

Agents in partially observable environments require persistent memory to integrate observations over time. While KGs (knowledge graphs) provide a natural representation for such evolving state, existing benchmarks rarely expose agents to environments where both the world dynamics and the agent's memory are explicitly graph-shaped. We introduce the Room Environment v3, a configurable environment whose hidden state is an RDF KG and whose observations are RDF triples. The agent may extend these observations into a temporal KG when storing them in long-term memory. The environment is easily adjustable in terms of grid size, number of rooms, inner walls, and moving objects. We define a lightweight temporal KG memory for agents, based on RDF-star-style qualifiers (time_added, last_accessed, num_recalled), and evaluate several symbolic baselines that maintain and query this memory under different capacity constraints. Two neural sequence models (LSTM and Transformer) serve as contrasting baselines without explicit KG structure. Agents train on one layout and are evaluated on a held-out layout with the same dynamics but a different query order, exposing train-test generalization gaps. In this setting, temporal qualifiers lead to more stable performance, and the symbolic TKG (temporal knowledge graph) agent achieves roughly fourfold higher test QA (question-answer) accuracy than the neural baselines under the same environment and query conditions. The environment, agent implementations, and experimental scripts are released for reproducible research at https://github.com/humemai/agent-room-env-v3 and https://github.com/humemai/room-env.

2406.17115 2026-02-26 cs.CV cs.AI

Measuring the Measurers: Quality Evaluation of Hallucination Benchmarks for Large Vision-Language Models

Bei Yan, Jie Zhang, Zheng Yuan, Shiguang Shan, Xilin Chen

详情
英文摘要

Despite the outstanding performance in multimodal tasks, Large Vision-Language Models (LVLMs) have been plagued by the issue of hallucination, i.e., generating content that is inconsistent with the corresponding visual inputs. While previous works have proposed various benchmarks to evaluate this issue, the quality of these evaluations remains unverified. We observe that some of these benchmarks may produce inconsistent evaluation results across repeated tests or fail to align with human evaluation. To address this, we propose a Hallucination benchmark Quality Measurement framework (HQM), which leverages specific indicators to assess both reliability and validity. Our empirical analysis using HQM reveals and pinpoints potential evaluation issues in existing benchmarks, exposing a critical gap in current hallucination evaluation. To bridge this gap, we propose HQH, a High-Quality Hallucination benchmark, which demonstrates superior reliability and validity under HQM, serving as a credible evaluation tool. Our large-scale evaluation of popular LVLMs on HQH reveals severe hallucination problems, which occur not only in the models' main answer to a question but also in additional analysis. This highlights the necessity for future model improvements to effectively mitigate hallucinations and reduce the associated security risks in real-world applications. Our benchmark is publicly available at https://github.com/HQHBench/HQHBench.

2406.05085 2026-02-26 cs.CL cs.AI cs.IR

Multi-Head RAG: Solving Multi-Aspect Problems with LLMs

Maciej Besta, Ales Kubicek, Robert Gerstenberger, Marcin Chrapek, Roman Niggli, Patrik Okanovic, Yi Zhu, Patrick Iff, Michal Podstawski, Lucas Weitzendorf, Mingyuan Chi, Joanna Gajda, Piotr Nyczyk, Jürgen Müller, Hubert Niewiadomski, Torsten Hoefler

详情
英文摘要

Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by retrieving supporting documents into the prompt, but existing methods do not explicitly target queries that require fetching multiple documents with substantially different content. Such multi-aspect queries are challenging because relevant documents can be far apart in embedding space, making joint retrieval difficult. We introduce Multi-Head RAG (MRAG), which addresses this gap with a simple yet powerful idea: using Transformer multi-head attention activations rather than the standard decoder-layer embedding, as retrieval keys. It leverages the observation that different heads capture different semantic aspects. This yields multi-aspect embeddings for both documents and queries, improving retrieval accuracy on complex queries. We show MRAG's design advantages over 18 RAG baselines, up to 20% higher retrieval success ratios for real-world use cases, and improved downstream LLM generation. MRAG integrates seamlessly with existing RAG frameworks and benchmarks.

2402.13604 2026-02-26 cs.CL econ.EM

Breaking the HISCO Barrier: Automatic Occupational Standardization with OccCANINE

Christian Møller Dahl, Torben Johansen, Christian Vedel

Comments All code and guides on how to use OccCANINE is available on GitHub https://github.com/christianvedels/OccCANINE

详情
英文摘要

This paper introduces OccCANINE, an open-source tool that maps occupational descriptions to HISCO codes. Manual coding is slow and error-prone; OccCANINE replaces weeks of work with results in minutes. We fine-tune CANINE on 15.8 million description-code pairs from 29 sources in 13 languages. The model achieves 96 percent accuracy, precision, and recall. We also show that the approach generalizes to three systems - OCC1950, OCCICEM, and ISCO-68 - and release them open source. By breaking the "HISCO barrier," OccCANINE democratizes access to high-quality occupational coding, enabling broader research in economics, economic history, and related disciplines.

2305.15929 2026-02-26 cs.CL

Emergence of a phonological bias in ChatGPT

Juan Manuel Toro

Comments 15 pages, 1 figure, corrected typo

详情
英文摘要

Current large language models, such as OpenAI's ChatGPT, have captured the public's attention because how remarkable they are in the use of language. Here, I demonstrate that ChatGPT displays phonological biases that are a hallmark of human language processing. More concretely, just like humans, ChatGPT has a consonant bias. That is, the chatbot has a tendency to use consonants over vowels to identify words. This is observed across languages that differ in their relative distribution of consonants and vowels such as English and Spanish. Despite the differences in how current artificial intelligence language models are trained to process linguistic stimuli and how human infants acquire language, such training seems to be enough for the emergence of a phonological bias in ChatGPT

2303.00799 2026-02-26 cs.AI cs.LG cs.MA

Fairness for Workers Who Pull the Arms: An Index Based Policy for Allocation of Restless Bandit Tasks

Arpita Biswas, Jackson A. Killian, Paula Rodriguez Diaz, Susobhan Ghosh, Milind Tambe

Comments 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2023), 10 pages

详情
英文摘要

Motivated by applications such as machine repair, project monitoring, and anti-poaching patrol scheduling, we study intervention planning of stochastic processes under resource constraints. This planning problem has previously been modeled as restless multi-armed bandits (RMAB), where each arm is an intervention-dependent Markov Decision Process. However, the existing literature assumes all intervention resources belong to a single uniform pool, limiting their applicability to real-world settings where interventions are carried out by a set of workers, each with their own costs, budgets, and intervention effects. In this work, we consider a novel RMAB setting, called multi-worker restless bandits (MWRMAB) with heterogeneous workers. The goal is to plan an intervention schedule that maximizes the expected reward while satisfying budget constraints on each worker as well as fairness in terms of the load assigned to each worker. Our contributions are two-fold: (1) we provide a multi-worker extension of the Whittle index to tackle heterogeneous costs and per-worker budget and (2) we develop an index-based scheduling policy to achieve fairness. Further, we evaluate our method on various cost structures and show that our method significantly outperforms other baselines in terms of fairness without sacrificing much in reward accumulated.

2002.10764 2026-02-26 cs.AI cs.GT

FairRec: Two-Sided Fairness for Personalized Recommendations in Two-Sided Platforms

Gourab K Patro, Arpita Biswas, Niloy Ganguly, Krishna P. Gummadi, Abhijnan Chakraborty

Comments In Proceedings of The Web Conference (WWW) 2020

详情
英文摘要

We investigate the problem of fair recommendation in the context of two-sided online platforms, comprising customers on one side and producers on the other. Traditionally, recommendation services in these platforms have focused on maximizing customer satisfaction by tailoring the results according to the personalized preferences of individual customers. However, our investigation reveals that such customer-centric design may lead to unfair distribution of exposure among the producers, which may adversely impact their well-being. On the other hand, a producer-centric design might become unfair to the customers. Thus, we consider fairness issues that span both customers and producers. Our approach involves a novel mapping of the fair recommendation problem to a constrained version of the problem of fairly allocating indivisible goods. Our proposed FairRec algorithm guarantees at least Maximin Share (MMS) of exposure for most of the producers and Envy-Free up to One item (EF1) fairness for every customer. Extensive evaluations over multiple real-world datasets show the effectiveness of FairRec in ensuring two-sided fairness while incurring a marginal loss in the overall recommendation quality.

2602.21717 2026-02-26 cs.LG cs.DB

C$^{2}$TC: A Training-Free Framework for Efficient Tabular Data Condensation

Sijia Xu, Fan Li, Xiaoyang Wang, Zhengyi Yang, Xuemin Lin

详情
英文摘要

Tabular data is the primary data format in industrial relational databases, underpinning modern data analytics and decision-making. However, the increasing scale of tabular data poses significant computational and storage challenges to learning-based analytical systems. This highlights the need for data-efficient learning, which enables effective model training and generalization using substantially fewer samples. Dataset condensation (DC) has emerged as a promising data-centric paradigm that synthesizes small yet informative datasets to preserve data utility while reducing storage and training costs. However, existing DC methods are computationally intensive due to reliance on complex gradient-based optimization. Moreover, they often overlook key characteristics of tabular data, such as heterogeneous features and class imbalance. To address these limitations, we introduce C$^{2}$TC (Class-Adaptive Clustering for Tabular Condensation), the first training-free tabular dataset condensation framework that jointly optimizes class allocation and feature representation, enabling efficient and scalable condensation. Specifically, we reformulate the dataset condensation objective into a novel class-adaptive cluster allocation problem (CCAP), which eliminates costly training and integrates adaptive label allocation to handle class imbalance. To solve the NP-hard CCAP, we develop HFILS, a heuristic local search that alternates between soft allocation and class-wise clustering to efficiently obtain high-quality solutions. Moreover, a hybrid categorical feature encoding (HCFE) is proposed for semantics-preserving clustering of heterogeneous discrete attributes. Extensive experiments on 10 real-world datasets demonstrate that C$^{2}$TC improves efficiency by at least 2 orders of magnitude over state-of-the-art baselines, while achieving superior downstream performance.

2602.21716 2026-02-26 cs.CV

TranX-Adapter: Bridging Artifacts and Semantics within MLLMs for Robust AI-generated Image Detection

Wenbin Wang, Yuge Huang, Jianqing Xu, Yue Yu, Jiangtao Yan, Shouhong Ding, Pan Zhou, Yong Luo

详情
英文摘要

Rapid advances in AI-generated image (AIGI) technology enable highly realistic synthesis, threatening public information integrity and security. Recent studies have demonstrated that incorporating texture-level artifact features alongside semantic features into multimodal large language models (MLLMs) can enhance their AIGI detection capability. However, our preliminary analyses reveal that artifact features exhibit high intra-feature similarity, leading to an almost uniform attention map after the softmax operation. This phenomenon causes attention dilution, thereby hindering effective fusion between semantic and artifact features. To overcome this limitation, we propose a lightweight fusion adapter, TranX-Adapter, which integrates a Task-aware Optimal-Transport Fusion that leverages the Jensen-Shannon divergence between artifact and semantic prediction probabilities as a cost matrix to transfer artifact information into semantic features, and an X-Fusion that employs cross-attention to transfer semantic information into artifact features. Experiments on standard AIGI detection benchmarks upon several advanced MLLMs, show that our TranX-Adapter brings consistent and significant improvements (up to +6% accuracy).

2602.21712 2026-02-26 cs.CV

Innovative Tooth Segmentation Using Hierarchical Features and Bidirectional Sequence Modeling

Xinxin Zhao, Jian Jiang, Yan Tian, Liqin Wu, Zhaocheng Xu, Teddy Yang, Yunuo Zou, Xun Wang

Comments Accepted by Pattern Recognition

详情
Journal ref
Xinxin Zhao, Jian Jiang, Yan Tian, Liqin Wu, Zhaocheng Xu, Wei-fa Yang, Yunuo Zou, Xun Wang. Innovative tooth segmentation using hierarchical features and bidirectional sequence modeling[J]. Pattern Recognition, 2026, 175:113045
英文摘要

Tooth image segmentation is a cornerstone of dental digitization. However, traditional image encoders relying on fixed-resolution feature maps often lead to discontinuous segmentation and poor discrimination between target regions and background, due to insufficient modeling of environmental and global context. Moreover, transformer-based self-attention introduces substantial computational overhead because of its quadratic complexity (O(n^2)), making it inefficient for high-resolution dental images. To address these challenges, we introduce a three-stage encoder with hierarchical feature representation to capture scale-adaptive information in dental images. By jointly leveraging low-level details and high-level semantics through cross-scale feature fusion, the model effectively preserves fine structural information while maintaining strong contextual awareness. Furthermore, a bidirectional sequence modeling strategy is incorporated to enhance global spatial context understanding without incurring high computational cost. We validate our method on two dental datasets, with experimental results demonstrating its superiority over existing approaches. On the OralVision dataset, our model achieves a 1.1% improvement in mean intersection over union (mIoU).