arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1920
2508.04559 2026-04-15 cs.CV

One Model for All: Unified Try-On and Try-Off in Any Pose via LLM-Inspired Bidirectional Tweedie Diffusion

Jinxi Liu, Zijian He, Guangrun Wang, Guanbin Li, Liang Lin

详情
英文摘要

Recent diffusion-based approaches have made significant advances in image-based virtual try-on, enabling more realistic and end-to-end garment synthesis. However, most existing methods remain constrained by their reliance on exhibition garments and segmentation masks, as well as their limited ability to handle flexible pose variations. These limitations reduce their practicality in real-world scenarios; for instance, users cannot easily transfer garments worn by one person onto another, and the generated try-on results are typically restricted to the same pose as the reference image. In this paper, we introduce OMFA (One Model For All), a unified diffusion framework for both virtual try-on and try-off that operates without the need for exhibition garments and supports arbitrary poses. OMFA is inspired by the mask-based paradigm of discrete diffusion language models and unifies try-on and try-off within a bidirectional framework. It is built upon a Bidirectional Tweedie Diffusion process for target-selective denoising in latent space. Instead of imposing lower body constraints, OMFA is an entirely mask-free framework that requires only a single portrait and a target garment as inputs, and is designed to support flexible outfit combinations and cross-person garment transfer, making it better aligned with practical usage scenarios. Additionally, by leveraging SMPL-X-based pose conditioning, OMFA supports multi-view and arbitrary-pose try-on from just one image. Extensive experiments demonstrate that OMFA achieves state-of-the-art results on both try-on and try-off tasks, providing a practical and generalizable solution for virtual garment synthesis. Project page: https://onemodelforall.github.io

2508.04282 2026-04-15 cs.AI

Synthetic POMDPs to Challenge Memory-Augmented RL: Memory Demand Structure Modeling

Yongyi Wang, Lingfeng Li, Bozhou Chen, Ang Li, Hanyu Liu, Qirui Zheng, Xionghui Yang, Wenxin Li

Comments The article has been accepted by Frontiers of Computer Science (FCS), with the DOI: {10.1007/s11704-026-52148-y}

详情
英文摘要

Recent benchmarks for memory-augmented reinforcement learning (RL) have introduced partially observable Markov decision process (POMDP) environments in which agents must use historical observations to make decisions. However, these benchmarks often lack fine-grained control over the challenges posed to memory models. Synthetic environments offer a solution, enabling precise manipulation of environment dynamics for rigorous and interpretable evaluation of memory-augmented RL. This paper advances the design of such customizable POMDPs with three key contributions: (1) a theoretical framework for analyzing POMDPs based on Memory Demand Structure (MDS) and related concepts; (2) a methodology using linear dynamics, state aggregation, and reward redistribution to construct POMDPs with predefined MDS; and (3) a suite of lightweight, scalable POMDP environments with tunable difficulty, grounded in our theoretical insights. Overall, our work clarifies core challenges in partially observable RL, offers principled guidelines for POMDP design, and aids in selecting and developing suitable memory architectures for RL tasks.

2507.22359 2026-04-15 cs.AI cs.CL

League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

Qianhong Guo, Wei Xie, Xiaofang Cai, Enze Wang, Shuoyoucheng Ma, Xiaobing Sun, Tian Xia, Kai Chen, Xiaofeng Wang, Baosheng Wang

详情
英文摘要

Although large language models (LLMs) have shown exceptional capabilities across a wide range of tasks, reliable evaluation remains a critical challenge due to data contamination, opaque operation, and subjective preferences. To address these issues, we propose League of LLMs (LOL), a novel benchmark-free evaluation paradigm that organizes multiple LLMs into a self-governed league for multi-round mutual evaluation. LOL integrates four core criteria (dynamic, transparent, objective, and professional) to mitigate key limitations of existing paradigms. Experiments on eight mainstream LLMs in mathematics and programming demonstrate that LOL can effectively distinguish LLM capabilities while maintaining high internal ranking stability (Top-$k$ consistency $= 70.7\%$). Beyond ranking, LOL reveals empirical findings that are difficult for traditional paradigms to capture. For instance, ``memorization-based answering'' behaviors are observed in some models, and higher in-family scores are found in the OpenAI model family ($Δ= 9$, $p < 0.05$). Finally, we make our framework and code publicly available as a valuable complement to the current LLM evaluation ecosystem.

2507.08977 2026-04-15 cs.LG cs.AI stat.ML

Simulation as Supervision: Mechanistic Pretraining for Scientific Discovery

Carson Dudley, Reiden Magdaleno, Christopher Harding, Marisa Eisenberg

详情
英文摘要

Scientific modeling faces a tradeoff between the interpretability of mechanistic theory and the predictive power of machine learning. While existing hybrid approaches have made progress by incorporating domain knowledge into machine learning methods as functional constraints, they can be limited by a reliance on precise mathematical specifications. When the underlying equations are partially unknown or misspecified, enforcing rigid constraints can introduce bias and hinder a model's ability to learn from data. We introduce Simulation-Grounded Neural Networks (SGNNs), a framework that incorporates scientific theory by using mechanistic simulations as training data for neural networks. By pretraining on diverse synthetic corpora that span multiple model structures and realistic observational noise, SGNNs internalize the underlying dynamics of a system as a structural prior. We evaluated SGNNs across multiple disciplines, including epidemiology, ecology, social science, and chemistry. In forecasting tasks, SGNNs outperformed both standard data-driven baselines and physics-constrained hybrid models. They nearly tripled the forecasting skill of the average CDC models in COVID-19 mortality forecasts and accurately forecasted high-dimensional ecological systems. SGNNs demonstrated robustness to model misspecification, performing well even when trained on data with incorrect assumptions. Our framework also introduces back-to-simulation attribution, a method for mechanistic interpretability that explains real-world dynamics by identifying their most similar counterparts within the simulated corpus. By unifying these techniques into a single framework, we demonstrate that diverse mechanistic simulations can serve as effective training data for robust scientific inference.

2507.08458 2026-04-15 cs.CV cs.AI

A document is worth a structured record: Principled inductive bias design for document recognition

Benjamin Meyer, Lukas Tuggener, Sascha Hänzi, Daniel Schmid, Erdal Ayfer, Benjamin F. Grewe, Ahmed Abdulkadir, Thilo Stadelmann

详情
英文摘要

Many document types use intrinsic, convention-driven structures that serve to encode precise and structured information, such as the conventions governing engineering drawings. However, many state-of-the-art approaches treat document recognition as a mere computer vision problem, neglecting these underlying document-type-specific structural properties, making them dependent on sub-optimal heuristic post-processing and rendering many less frequent or more complicated document types inaccessible to modern document recognition. We suggest a novel perspective that frames document recognition as a transcription task from a document to a record. This implies a natural grouping of documents based on the intrinsic structure inherent in their transcription, where related document types can be treated (and learned) similarly. We propose a method to design structure-specific relational inductive biases for the underlying machine-learned end-to-end document recognition systems, and a respective base transformer architecture that we successfully adapt to different structures. We demonstrate the effectiveness of the so-found inductive biases in extensive experiments with progressively complex record structures from monophonic sheet music, shape drawings, and simplified engineering drawings. By integrating an inductive bias for unrestricted graph structures, we train the first-ever successful end-to-end model to transcribe mechanical engineering drawings to their inherently interlinked information. Our approach is relevant to inform the design of document recognition systems for document types that are less well understood than standard OCR, OMR, etc., and serves as a guide to unify the design of future document foundation models.

2507.06448 2026-04-15 cs.CL

Perception-Aware Policy Optimization for Multimodal Reasoning

Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji

详情
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term, which can be seamlessly plugged into mainstream RLVR algorithms such as GRPO and DAPO. Notably, PAPO does not rely on additional data curation, reward models, or stronger teacher models. To further enhance the training stability of PAPO, we introduce the Double Entropy Loss, which effectively regularizes the new KL objective without compromising performance. Despite its simplicity, PAPO yields significant overall improvements of 4.4%-17.5% on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%-19.1%, on tasks with high vision dependency. We also observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO. Overall, our work introduces a deeper integration of perception-aware supervision into core learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Code and data will be made publicly available for research purposes. Project page: https://mikewangwzhl.github.io/PAPO.

2507.04017 2026-04-15 cs.CV

Habitat Classification from Ground-Level Imagery Using Deep Neural Networks

Hongrui Shi, Lisa Norton, Lucy Ridding, Simon Rolph, Tom August, Claire M Wood, Lan Qie, Petra Bosilj, James M Brown

Comments Accepted to Ecological Informatics. Main paper has 18 pages, 7 figures, 4 tables. Appendix has 10 pages, 8 figures, 2 tables

详情
英文摘要

Habitat assessment at local scales -- critical for enhancing biodiversity and guiding conservation priorities -- often relies on expert field surveys that can be costly, motivating the exploration of AI-driven tools to automate and refine this process. While most AI-driven habitat mapping depends on remote sensing, it is often constrained by sensor availability, weather, and coarse resolution. In contrast, ground-level imagery captures essential structural and compositional cues invisible from above and remains underexplored for robust, fine-grained habitat classification. This study addresses this gap by applying state-of-the-art deep neural network architectures to ground-level habitat imagery. Leveraging data from the UK Countryside Survey covering 18 broad habitat types, we evaluate two families of models - convolutional neural networks (CNNs) and vision transformers (ViTs) - under both supervised and supervised contrastive learning paradigms. Our results demonstrate that ViTs consistently outperform state-of-the-art CNN baselines on key classification metrics (Top-3 accuracy = 91%, MCC = 0.66) and offer more interpretable scene understanding tailored to ground-level images. Moreover, supervised contrastive learning significantly reduces misclassification rates among visually similar habitats (e.g., Improved vs. Neutral Grassland), driven by a more discriminative embedding space. Finally, our best model performs on par with experienced ecological experts in habitat classification from images, underscoring the promise of expert-level automated assessment. By integrating advanced AI with ecological expertise, this research establishes a scalable, cost-effective framework for ground-level habitat monitoring to accelerate biodiversity conservation and inform land-use decisions at a national scale.

2505.23209 2026-04-15 cs.CV

Navigating the Accuracy-Size Trade-Off with Flexible Model Merging

Akash Dhasade, Divyansh Jhunjhunwala, Milos Vujasinovic, Gauri Joshi, Anne-Marie Kermarrec

Comments Accepted at ICLR 2026

详情
英文摘要

Model merging has emerged as an efficient method to combine multiple single-task fine-tuned models. The merged model can enjoy multi-task capabilities without expensive training. While promising, merging into a single model often suffers from an accuracy gap with respect to the fine-tuned models. On the other hand, deploying all individual fine-tuned models incurs high storage costs. We propose FlexMerge, a novel data-free model merging framework that: (a) flexibly generates merged models of varying sizes, spanning the full spectrum from a single merged model to retaining all fine-tuned models; and (b) supports multiple merging algorithms in a unified framework. Using FlexMerge, we systematically characterize the accuracy-size trade-off of different algorithms. Our study reveals two key findings: first, even modestly larger merged models can yield steep accuracy gains (up to 13.5% when just doubling the size); second, algorithm rankings are not consistent as size increases, with some methods overtaking others beyond the one-model regime. These results uncover a new design dimension for model merging: developing and comparing algorithms across the full spectrum of sizes rather than only at the single-model limit. Extensive experiments on vision and NLP benchmarks, with up to 30 tasks, confirm the generality and practicality of FlexMerge.

2505.19328 2026-04-15 cs.CV cs.LG

BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Digital Behavioural Change

Manuela González-González, Soufiane Belharbi, Muhammad Osama Zeeshan, Masoumeh Sharafi, Muhammad Haseeb Aslam, Marco Pedersoli, Alessandro Lameiras Koerich, Simon L Bacon, Eric Granger

Comments 46 pages, 21 figures, ICLR 2026

详情
英文摘要

Ambivalence and hesitancy (A/H), closely related constructs, are the primary reasons why individuals delay, avoid, or abandon health behaviour changes. They are subtle and conflicting emotions that sets a person in a state between positive and negative orientations, or between acceptance and refusal to do something. They manifest as a discord in affect between multiple modalities or within a modality, such as facial and vocal expressions, and body language. Although experts can be trained to recognize A/H as done for in-person interactions, integrating them into digital health interventions is costly and less effective. Automatic A/H recognition is therefore critical for the personalization and cost-effectiveness of digital behaviour change interventions. However, no datasets currently exist for the design of machine learning models to recognize A/H. This paper introduces the Behavioural Ambivalence/Hesitancy (BAH) dataset collected for multimodal recognition of A/H in videos. It contains 1,427 videos with a total duration of 10.60 hours, captured from 300 participants across Canada, answering predefined questions to elicit A/H. It is intended to mirror real-world digital behaviour change interventions delivered online. BAH is annotated by three experts to provide timestamps that indicate where A/H occurs, and frame- and video-level annotations with A/H cues. Video transcripts, cropped and aligned faces, and participant metadata are also provided. Since A and H manifest similarly in practice, we provide a binary annotation indicating the presence or absence of A/H. Additionally, this paper includes benchmarking results using baseline models on BAH for frame- and video-level recognition, and different learning setups. The limited performance highlights the need for adapted multimodal and spatio-temporal models for A/H recognition. The data and code are publicly available.

2505.17384 2026-04-15 cs.LG cs.CV stat.ML

Variational Autoencoding Discrete Diffusion with Enhanced Dimensional Correlations Modeling

Tianyu Xie, Shuchen Xue, Zijin Feng, Tianyang Hu, Jiacheng Sun, Zhenguo Li, Cheng Zhang

Comments ICLR 2026 Poster; 24 pages, 13 figures

详情
英文摘要

Discrete diffusion models have recently shown great promise for modeling complex discrete data, with masked diffusion models (MDMs) offering a compelling trade-off between quality and generation speed. MDMs denoise by progressively unmasking multiple dimensions from an all-masked input, but their performance can degrade when using few denoising steps due to limited modeling of inter-dimensional dependencies. In this paper, we propose Variational Autoencoding Discrete Diffusion (VADD), a novel framework that enhances discrete diffusion with latent variable modeling to implicitly capture correlations among dimensions. By introducing an auxiliary recognition model, VADD enables stable training via variational lower bounds maximization and amortized inference over the training set. Our approach retains the efficiency of traditional MDMs while significantly improving sample quality, especially when the number of denoising steps is small. Empirical results on 2D toy data, pixel-level image generation, and text generation demonstrate that VADD consistently outperforms MDM baselines in sample quality with few denoising steps.

2504.06983 2026-04-15 cs.LG math.PR stat.ML

Free Random Projection for In-Context Reinforcement Learning

Tomohiro Hayase, Benoît Collins, Nakamasa Inoue

Comments Accepted to AISTATS2026. Code available at https://github.com/ThayaFluss/frp_rl

详情
英文摘要

Hierarchical inductive biases are hypothesized to promote generalizable policies in reinforcement learning, as demonstrated by explicit hyperbolic latent representations and architectures. Therefore, a more flexible approach is to have these biases emerge naturally from the algorithm. We introduce Free Random Projection, an input mapping grounded in free probability theory that constructs random orthogonal matrices where hierarchical structure arises inherently. The free random projection integrates seamlessly into existing in-context reinforcement learning frameworks by encoding hierarchical organization within the input space without requiring explicit architectural modifications. Empirical results on multi-environment benchmarks show that free random projection consistently outperforms the standard random projection, leading to improvements in generalization. Furthermore, analyses within linearly solvable Markov decision processes and investigations of the spectrum of kernel random matrices reveal the theoretical underpinnings of free random projection's enhanced performance, highlighting its capacity for effective adaptation in hierarchically structured state spaces.

2503.23178 2026-04-15 cs.CV

Intelligent bear deterrence system based on computer vision: Reducing human-bear conflicts in remote areas

Pengyu Chen, Teng Fei, John A. Kupfer, Yunyan Du, Jiawei Yi, Yi Li

详情
Journal ref
Ursus 2026(37e6), 1-11 (2026)
英文摘要

Human-bear conflicts on the Tibetan Plateau threaten both local livelihoods and the conservation of Tibetan brown bears (Ursus arctos pruinosus). To address this challenge, we developed a low-power, network-independent deterrence system that combines computer vision with Internet of Things (IoT) hardware. The system integrates a YOLOv5-MobileNet detection model deployed on a low-power edge artificial intelligence (AI) board with a solar-powered bear spray device. We compiled a data set of 1,243 wildlife images (including 795 bears with 100 infrared captures for nighttime detection, plus other common objects and animals such as mastiffs, yaks, humans, and vehicles), from which 80% were used for training and 20% for validation. Validation showed robust performance (mean average precision = 91.4%, recall = 93.6%). In 100 controlled activation tests involving simulated approaches by bears, humans, and other animals, the spray deployed within 0.2 seconds of detection with 97.2% accuracy, confirming timely and reliable responses. A 30-day field trial in Zadoi County, Qinghai Province, China, recorded 3 successful deterrence events without false activations. By using energy-efficient components and ensuring continuous and stable system operation, this solution provides a practical, sustainable, and scalable approach to mitigating human-bear conflicts, effectively enhancing human safety and bear conservation in remote areas without network or grid coverage.

2503.21708 2026-04-15 cs.LG cs.AI cs.CL

On the Mathematical Relationship Between Layer Normalization and Dynamic Activation Functions

Felix Stollenwerk

Comments EACL 2026 (Main), see https://aclanthology.org/2026.eacl-short.48/

详情
英文摘要

Layer normalization (LN) is an essential component of modern neural networks. While many alternative techniques have been proposed, none of them have succeeded in replacing LN so far. The latest suggestion in this line of research is a dynamic activation function called Dynamic Tanh (DyT). Although it is empirically well-motivated and appealing from a practical point of view, it lacks a theoretical foundation. In this work, we shed light on the mathematical relationship between LN and dynamic activation functions. In particular, we derive DyT from the LN variant RMSNorm, and show that a well-defined decoupling in derivative space as well as an approximation are needed to do so. By applying the same decoupling procedure directly in function space, we are able to omit the approximation and obtain the exact element-wise counterpart of RMSNorm, which we call Dynamic Inverse Square Root Unit (DyISRU). We demonstrate numerically that DyISRU reproduces the normalization effect on outliers more accurately than DyT does.

2503.09441 2026-04-15 cs.RO cs.SY eess.SY

Learned Incremental Nonlinear Dynamic Inversion for Quadrotors with and without Slung Payloads

Eckart Cobo-Briesewitz, Khaled Wahba, Wolfgang Hönig

Comments Accepted to L4DC 2026

详情
英文摘要

The increasing complexity of multirotor applications demands flight controllers that can accurately account for all forces acting on the vehicle. Conventional controllers model most aerodynamic and dynamic effects but often neglect higher-order forces, as their accurate estimation is computationally expensive. Incremental Nonlinear Dynamic Inversion (INDI) offers an alternative by estimating residual forces from differences in sensor measurements; however, its reliance on specialized and often noisy sensors limits its applicability. Recent work has demonstrated that residual forces can be predicted using learning-based methods. In this paper, we show that a neural network can generate smooth approximations of INDI outputs without requiring specialized rotor RPM sensor inputs. We further propose a hybrid approach that integrates learning-based predictions with INDI and demonstrate both methods for multirotors and multirotors carrying slung payloads. Experimental results on trajectory tracking errors demonstrate that the specialized sensor measurements required by INDI can be eliminated by replacing the residual computation with a neural network.

2503.05167 2026-04-15 cs.LG

FMASH: Advancing Traditional Chinese Medicine Formula Recommendation with Efficient Fusion of Multiscale Associations of Symptoms and Herbs

Xinhan Zheng, Xueting Wang, Ruotai Li, Huyu Wu, Haopeng Jin, Yehan Yang, Guodong Shan

详情
英文摘要

Traditional Chinese medicine (TCM) exhibits remarkable therapeutic efficacy in healthcare through patient-specific formulas. However, current AI-based TCM formula recommendation models and methods mainly focus on data-based textual associations between symptoms and herbs, and have not fully utilized their features and relations at different scales, especially at the molecular scale. To address these limitations, we propose the Fusion of Multiscale Associations of Symptoms and Herbs (FMASH), a novel framework that effectively incorporates the properties of herbs on different scales with clinical symptoms and provides refined embeddings of their multiscale associations. The framework integrates molecular-scale features and macroscopic properties of herbs and combines complex local and global relations in the heterogeneous graph of symptoms and herbs. Moreover, it provides effective representation embeddings of the multiscale features and associations of symptoms and herbs in a unified semantic space. Comprehensive experiments have been conducted on FMASH, and the results demonstrate that our FMASH-based model outperforms the state-of-the-art (SOTA) model on both datasets, confirming the effectiveness of FMASH in building the TCM formula recommendation model. In Dataset1, our model has achieved a significant improvement compared to the SOTA model, with increases of 3.38% in Precision@5, 3.89% in Recall@5, and 3.69% in F1-score@5. In Dataset2, Precision@5, Recall@5, and F1-score@5 increase by 2.64%, 1.92%, and 2.23%, respectively. This work facilitates the application of the AI-based TCM formula recommendation and promotes the innovative development of TCM diagnosis and treatment.

2501.13340 2026-04-15 cs.CV

Retrievals Can Be Detrimental: Unveiling the Backdoor Vulnerability of Retrieval-Augmented Diffusion Models

Hao Fang, Xiaohang Sui, Hongyao Yu, Kuofeng Gao, Jiawei Kong, Sijin Yu, Bin Chen, Shu-Tao Xia

Comments Accepted by ACL-2026

详情
英文摘要

Diffusion models (DMs) have recently demonstrated remarkable generation capability. However, their training generally requires huge computational resources and large-scale datasets. To solve these, recent studies empower DMs with the advanced Retrieval-Augmented Generation (RAG) technique and propose retrieval-augmented diffusion models (RDMs). By incorporating rich knowledge from an auxiliary database, RAG enhances diffusion models' generation and generalization ability while significantly reducing model parameters. Despite the great success, RAG may introduce novel security issues that warrant further investigation. In this paper, we reveal that the RDM is susceptible to backdoor attacks by proposing a multimodal contrastive attack approach named BadRDM. Our framework fully considers RAG's characteristics and is devised to manipulate the retrieved items for given text triggers, thereby further controlling the generated contents. Specifically, we first insert a tiny portion of images into the retrieval database as target toxicity surrogates. Subsequently, a malicious variant of contrastive learning is adopted to inject backdoors into the retriever, which builds shortcuts from triggers to the toxicity surrogates. Furthermore, we enhance the attacks through novel entropy-based selection and generative augmentation strategies that can derive better toxicity surrogates. Extensive experiments on two mainstream tasks demonstrate the proposed BadRDM achieves outstanding attack effects while preserving the model's benign utility.

2412.07238 2026-04-15 cs.CL q-bio.NC

Speaker effects in language comprehension: An integrative model of language and speaker processing

Hanlin Wu, Zhenguang G. Cai

详情
Journal ref
Psychon Bull Rev 33, 138 (2026)
英文摘要

The identity of a speaker influences language comprehension through modulating perception and expectation. This review explores speaker effects and proposes an integrative model of language and speaker processing that integrates distinct mechanistic perspectives. We argue that speaker effects arise from the interplay between bottom-up perception-based processes, driven by acoustic-episodic memory, and top-down expectation-based processes, driven by a speaker model. We show that language and speaker processing are functionally integrated through multi-level probabilistic processing: prior beliefs about a speaker modulate language processing at the phonetic, lexical, and semantic levels, while the unfolding speech and message continuously update the speaker model, refining broad demographic priors into precise individualized representations. Within this framework, we distinguish between speaker-idiosyncrasy effects arising from familiarity with an individual and speaker-demographics effects arising from social group expectations. We discuss how speaker effects serve as indices for assessing language development and social cognition, and we encourage future research to extend these findings to the emerging domain of artificial intelligence (AI) speakers, as AI agents represent a new class of social interlocutors that are transforming the way we engage in communication.

2410.23728 2026-04-15 cs.CL

GigaCheck: Detecting LLM-generated Content via Object-Centric Span Localization

Irina Tolstykh, Aleksandra Tsybina, Sergey Yakubson, Aleksandr Gordeev, Vladimir Dokholyan, Maksim Kuprashevich

Comments Accepted to Findings of the Association for Computational Linguistics: ACL 2026

详情
英文摘要

With the increasing quality and spread of LLM assistants, the amount of generated content is growing rapidly. In many cases and tasks, such texts are already indistinguishable from those written by humans, and the quality of generation continues to increase. At the same time, detection methods are advancing more slowly than generation models, making it challenging to prevent misuse of generative AI technologies. We propose GigaCheck, a dual-strategy framework for AI-generated text detection. At the document level, we leverage the representation learning of fine-tuned LLMs to discern authorship with high data efficiency. At the span level, we introduce a novel structural adaptation that treats generated text segments as "objects." By integrating a DETR-like vision model with linguistic encoders, we achieve precise localization of AI intervals, effectively transferring the robustness of visual object detection to the textual domain. Experimental results across three classification and three localization benchmarks confirm the robustness of our approach. The shared fine-tuned backbone delivers strong accuracy in both scenarios, highlighting the generalization power of the learned embeddings. Moreover, we successfully demonstrate that visual detection architectures like DETR are not limited to pixel space, effectively generalizing to the localization of generated text spans. To ensure reproducibility and foster further research, we publicly release our source code.

2410.03000 2026-04-15 cs.LG cs.CR

Towards Generalized Certified Robustness with Multi-Norm Training

Enyi Jiang, David S. Cheung, Gagandeep Singh

Comments Accepted by TMLR 2026

详情
英文摘要

Existing certified training methods can only train models to be robust against a certain perturbation type (e.g. $l_\infty$ or $l_2$). However, an $l_\infty$ certifiably robust model may not be certifiably robust against $l_2$ perturbation (and vice versa) and also has low robustness against other perturbations (e.g. geometric and patch transformation). By constructing a theoretical framework to analyze and mitigate the tradeoff, we propose the first multi-norm certified training framework \textbf{CURE}, consisting of several multi-norm certified training methods, to attain better \emph{union robustness} when training from scratch or fine-tuning a pre-trained certified model. Inspired by our theoretical findings, we devise bound alignment and connect natural training with certified training for better union robustness. Compared with SOTA-certified training, \textbf{CURE} improves union robustness to $32.0\%$ on MNIST, $25.8\%$ on CIFAR-10, and $10.6\%$ on TinyImagenet across different epsilon values. It leads to better generalization on a diverse set of challenging unseen geometric and patch perturbations to $6.8\%$ and $16.0\%$ on CIFAR-10. Overall, our contributions pave a path towards \textit{generalized certified robustness}.

2407.17182 2026-04-15 cs.LG

A DeepONet for inverting the Neumann-to-Dirichlet Operator in Electrical Impedance Tomography: An approximation theoretic perspective and numerical results

Anuj Abhishek, Thilo Strauss

详情
英文摘要

In this work, we consider the non-invasive medical imaging modality of Electrical Impedance Tomography (EIT), where the goal is to recover the conductivity in a medium from boundary current-to-voltage measurements, i.e., the Neumann-to-Dirichlet (N--t--D) operator. We formulate this inverse problem as an operator-learning task, where the aim is to approximate the implicitly defined map from N--t--D operators to admissible conductivities. To this end, we employ a Deep Operator Network (DeepONet) architecture, thereby extending operator learning beyond the classical function-to-function setting to the more challenging operator-to-function regime. We establish a universal approximation theorem that guarantees that such operator-to-function maps can be approximated arbitrarily well by DeepONets. Furthermore, we provide a computational implementation of our approach and compare it against the iteratively regularized Gauss--Newton (IRGN) method. Our results show that the proposed framework yields accurate and robust reconstructions, outperforms the baseline, and demonstrates strong generalization. To our knowledge, this is the first work that combines rigorous approximation-theoretic guarantees with DeepONet-based inversion for EIT, thereby opening a principled and interpretable pathway for use of DeepONets in such inverse problems.

2405.18921 2026-04-15 cs.LG

GLANCE: Global Actions in a Nutshell for Counterfactual Explainability

Loukas Kavouras, Eleni Psaroudaki, Konstantinos Tsopelas, Dimitrios Rontogiannis, Nikolaos Theologitis, Dimitris Sacharidis, Giorgos Giannopoulos, Dimitrios Tomaras, Kleopatra Markou, Dimitrios Gunopulos, Dimitris Fotakis, Ioannis Emiris

详情
Journal ref
2026 Proceedings of the 40th Annual AAAI Conference on Artificial Intelligence
英文摘要

The widespread deployment of machine learning systems in critical real-world decision-making applications has highlighted the urgent need for counterfactual explainability methods that operate effectively. Global counterfactual explanations, expressed as actions to offer recourse, aim to provide succinct explanations and insights applicable to large population subgroups. High effectiveness, measured by the fraction of the population that is provided recourse, ensures that the actions benefit as many individuals as possible. Keeping the cost of actions low ensures the proposed recourse actions remain practical and actionable. Limiting the number of actions that provide global counterfactuals is essential to maximizing interpretability. The primary challenge, therefore, is to balance these trade-offs--maximizing effectiveness, minimizing cost, while maintaining a small number of actions. We introduce $\texttt{GLANCE}$, a versatile and adaptive algorithm that employs a novel agglomerative approach, jointly considering both the feature space and the space of counterfactual actions, thereby accounting for the distribution of points in a way that aligns with the model's structure. This design enables the careful balancing of the trade-offs among the three key objectives, with the size objective functioning as a tunable parameter to keep the actions few and easy to interpret. Our extensive experimental evaluation demonstrates that $\texttt{GLANCE}$ consistently shows greater robustness and performance compared to existing methods across various datasets and models.

2403.07818 2026-04-15 cs.CV cs.AI cs.LG

Label Dropout: Improved Deep Learning Echocardiography Segmentation Using Multiple Datasets With Domain Shift and Partial Labelling

Iman Islam, Esther Puyol-Antón, Bram Ruijsink, Andrew J. Reader, Andrew P. King

Comments 10 pages, 5 figures, ASMUS 2024, Held in Conjunction with MICCAI 2024

详情
英文摘要

Echocardiography (echo) is the first imaging modality used when assessing cardiac function. The measurement of functional biomarkers from echo relies upon the segmentation of cardiac structures and deep learning models have been proposed to automate the segmentation process. However, in order to translate these tools to widespread clinical use it is important that the segmentation models are robust to a wide variety of images (e.g. acquired from different scanners, by operators with different levels of expertise etc.). To achieve this level of robustness it is necessary that the models are trained with multiple diverse datasets. A significant challenge faced when training with multiple diverse datasets is the variation in label presence, i.e. the combined data are often partially-labelled. Adaptations of the cross entropy loss function have been proposed to deal with partially labelled data. In this paper we show that training naively with such a loss function and multiple diverse datasets can lead to a form of shortcut learning, where the model associates label presence with domain characteristics, leading to a drop in performance. To address this problem, we propose a novel label dropout scheme to break the link between domain characteristics and the presence or absence of labels. We demonstrate that label dropout improves echo segmentation Dice score by 62% and 25% on two cardiac structures when training using multiple diverse partially labelled datasets.

2309.13904 2026-04-15 cs.CV

Subspace-Guided Feature Reconstruction for Unsupervised Anomaly Localization

Katsuya Hotta, Chao Zhang, Yoshihiro Hagihara, Takuya Akashi

详情
英文摘要

Unsupervised anomaly localization aims to identify anomalous regions that deviate from normal sample patterns. Most recent methods perform feature matching or reconstruction for the target sample with pre-trained deep neural networks. However, they still struggle to address challenging anomalies because the deep embeddings stored in the memory bank can be less powerful and informative. Specifically, prior methods often overly rely on the finite resources stored in the memory bank, which leads to low robustness to unseen targets. In this paper, we propose a novel subspace-guided feature reconstruction framework to pursue adaptive feature approximation for anomaly localization. It first learns to construct low-dimensional subspaces from the given nominal samples, and then learns to reconstruct the given deep target embedding by linearly combining the subspace basis vectors using the self-expressive model. Our core is that, despite the limited resources in the memory bank, the out-of-bank features can be alternatively ``mimicked'' to adaptively model the target. Moreover, we propose a sampling method that leverages the sparsity of subspaces and allows the feature reconstruction to depend only on a small resource subset, contributing to less memory overhead. Extensive experiments on three benchmark datasets demonstrate that our approach generally achieves state-of-the-art anomaly localization performance.

2604.12551 2026-04-15 cs.CV

Cross-Attentive Multiview Fusion of Vision-Language Embeddings

Tomas Berriel Martins, Martin R. Oswald, Javier Civera

详情
英文摘要

Vision-language models have been key to the development of open-vocabulary 2D semantic segmentation. Lifting these models from 2D images to 3D scenes, however, remains a challenging problem. Existing approaches typically back-project and average 2D descriptors across views, or heuristically select a single representative one, often resulting in suboptimal 3D representations. In this work, we introduce a novel multiview transformer architecture that cross-attends across vision-language descriptors from multiple viewpoints and fuses them into a unified per-3D-instance embedding. As a second contribution, we leverage multiview consistency as a self-supervision signal for this fusion, which significantly improves performance when added to a standard supervised target-class loss. Our Cross-Attentive Multiview Fusion, which we denote with its acronym CAMFusion, not only consistently outperforms naive averaging or single-view descriptor selection, but also achieves state-of-the-art results on 3D semantic and instance classification benchmarks, including zero-shot evaluations on out-of-domain datasets.

2604.12545 2026-04-15 cs.AI cs.CY

Cross-Cultural Simulation of Citizen Emotional Responses to Bureaucratic Red Tape Using LLM Agents

Wanchun Ni, Jiugeng Sun, Yixian Liu, Mennatallah El-Assady

Comments To appear in the CHI 2026 Workshop on PoliSim

详情
英文摘要

Improving policymaking is a central concern in public administration. Prior human subject studies reveal substantial cross-cultural differences in citizens' emotional responses to red tape during policy implementation. While LLM agents offer opportunities to simulate human-like responses and reduce experimental costs, their ability to generate culturally appropriate emotional responses to red tape remains unverified. To address this gap, we propose an evaluation framework for assessing LLMs' emotional responses to red tape across diverse cultural contexts. As a pilot study, we apply this framework to a single red-tape scenario. Our results show that all models exhibit limited alignment with human emotional responses, with notably weaker performance in Eastern cultures. Cultural prompting strategies prove largely ineffective in improving alignment. We further introduce \textbf{RAMO}, an interactive interface for simulating citizens' emotional responses to red tape and for collecting human data to improve models. The interface is publicly available at https://ramo-chi.ivia.ch.

2604.12543 2026-04-15 cs.AI

A Two-Stage LLM Framework for Accessible and Verified XAI Explanations

Georgios Mermigkis, Dimitris Metaxakis, Marios Tyrovolas, Argiris Sofotasios, Nikolaos Avgeris, Panagiotis Hadjidoukas, Chrysostomos Stylios

Comments 8 pages, 8 figures, Accepted for publication at the 2026 IEEE World Congress on Computational Intelligence (WCCI 2026)

详情
英文摘要

Large Language Models (LLMs) are increasingly used to translate the technical outputs of eXplainable Artificial Intelligence (XAI) methods into accessible natural-language explanations. However, existing approaches often lack guarantees of accuracy, faithfulness, and completeness. At the same time, current efforts to evaluate such narratives remain largely subjective or confined to post-hoc scoring, offering no safeguards to prevent flawed explanations from reaching end-users. To address these limitations, this paper proposes a Two-Stage LLM Meta-Verification Framework that consists of (i) an Explainer LLM that converts raw XAI outputs into natural-language narratives, (ii) a Verifier LLM that assesses them in terms of faithfulness, coherence, completeness, and hallucination risk, and (iii) an iterative refeed mechanism that uses the Verifier's feedback to refine and improve them. Experiments across five XAI techniques and datasets, using three families of open-weight LLMs, show that verification is crucial for filtering unreliable explanations while improving linguistic accessibility compared with raw XAI outputs. In addition, the analysis of the Entropy Production Rate (EPR) during the refinement process indicates that the Verifier's feedback progressively guides the Explainer toward more stable and coherent reasoning. Overall, the proposed framework provides an efficient pathway toward more trustworthy and democratized XAI systems.

2604.12540 2026-04-15 cs.CL cs.AI

When Does Data Augmentation Help? Evaluating LLM and Back-Translation Methods for Hausa and Fongbe NLP

Mahounan Pericles Adjovi, Roald Eiselen, Prasenjit Mitra

Comments 13 pages, 6 tables; previously submitted to KDD 2026

详情
英文摘要

Data scarcity limits NLP development for low-resource African languages. We evaluate two data augmentation methods -- LLM-based generation (Gemini 2.5 Flash) and back-translation (NLLB-200) -- for Hausa and Fongbe, two West African languages that differ substantially in LLM generation quality. We assess augmentation on named entity recognition (NER) and part-of-speech (POS) tagging using MasakhaNER 2.0 and MasakhaPOS benchmarks. Our results reveal that augmentation effectiveness depends on task type rather than language or LLM quality alone. For NER, neither method improves over baseline for either language; LLM augmentation reduces Hausa NER by 0.24% F1 and Fongbe NER by 1.81% F1. For POS tagging, LLM augmentation improves Fongbe by 0.33% accuracy, while back-translation improves Hausa by 0.17%; back-translation reduces Fongbe POS by 0.35% and has negligible effect on Hausa POS. The same LLM-generated synthetic data produces opposite effects across tasks for Fongbe -- hurting NER while helping POS -- suggesting task structure governs augmentation outcomes more than synthetic data quality. These findings challenge the assumption that LLM generation quality predicts augmentation success, and provide actionable guidance: data augmentation should be treated as a task-specific intervention rather than a universally beneficial preprocessing step.

2604.12537 2026-04-15 cs.CV cs.AI

MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models

Ruoxiang Huang, Zhen Yuan

Comments Accepted by CVPR 2026 (Highlight). 10 pages, 2 figures, 5 tables

详情
英文摘要

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their positional encoding mechanisms remain suboptimal. Existing approaches uniformly assign positional indices to all tokens, overlooking variations in information density within and across modalities, which leads to inefficient attention allocation where redundant visual regions dominate while informative content is underrepresented. We identify positional granularity as an implicit resource and propose MODIX (Multimodal Information-Driven Positional IndeX Scaling), a training-free framework that dynamically adapts positional strides based on modality-specific contributions. MODIX jointly models intra-modal density via covariance-based entropy and inter-modal interaction via cross-modal alignment to derive unified scores, which rescale positional indices to allocate finer granularity to informative modalities while compressing redundant ones, without requiring any modification to model parameters or architecture. Experiments across diverse architectures and benchmarks demonstrate that MODIX consistently improves multimodal reasoning and adaptively reallocates attention according to task-dependent information distributions, suggesting that positional encoding should be treated as an adaptive resource in Transformers for multimodal sequence modeling.

2604.12534 2026-04-15 cs.AI cs.LO

Technical Report -- A Context-Sensitive Multi-Level Similarity Framework for First-Order Logic Arguments: An Axiomatic Study

Victor David, Jérôme Delobelle, Jean-Guy Mailly

Comments 19 pages, 6 figures

详情
英文摘要

Similarity in formal argumentation has recently gained attention due to its significance in problems such as argument aggregation in semantics and enthymeme decoding. While existing approaches focus on propositional logic, we address the richer setting of First-Order Logic (FOL), where similarity must account for structured content. We introduce a comprehensive framework for FOL argument similarity, built upon: (1) an extended axiomatic foundation; (2) a four-level parametric model covering predicates, literals, clauses, and formulae similarity; (3) two model families, one syntax-sensitive via language models, both integrating contextual weights for nuanced and explainable similarity; and (4) formal constraints enforcing desirable properties.

2604.12526 2026-04-15 cs.LG cs.AI

Orthogonal Subspace Projection for Continual Machine Unlearning via SVD-Based LoRA

Yogachandran Rahulamathavan, Nasir Iqbal, Juncheng Hu, Sangarapillai Lambotharan

详情
英文摘要

Continual machine unlearning aims to remove the influence of data that should no longer be retained, while preserving the usefulness of the model on everything else. This setting becomes especially difficult when deletion requests arrive sequentially, because the model must repeatedly adapt without erasing previously retained knowledge. Low-Rank Adaptation (LoRA) offers an efficient way to implement such updates, but naively combining many sequential LoRA modules leads to parameter collision, causing \textit{strong interference} between tasks. We propose a static alternative based on Singular Value Decomposition (SVD)-guided orthogonal subspace projection. Our method constrains each new LoRA update during training so that it lies in the orthogonal complement of the subspaces used by earlier unlearning tasks. This preserves task isolation without requiring dynamic routing at deployment. Experiments on CIFAR-100 with ResNet-20 and on MNIST show stable behavior across long sequences of unlearning tasks. After thirty sequential unlearning tasks, state-of-the-art static fusion reduces retained accuracy from 60.39\% to 12.70\%, whereas the proposed in-training constrained optimization maintains baseline performance ($\sim$58.1\%) while preserving strong unlearning efficacy.