arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2088
2510.05652 2026-05-08 cs.CV

SD-MVSum: Script-Driven Multimodal Video Summarization Method and Datasets

Manolis Mylonas, Charalampia Zerva, Evlampios Apostolidis, Vasileios Mezaris

Comments Under review

详情
英文摘要

In this work, we present a method and two large-scale datasets for Script-Driven Multimodal Video Summarization. The proposed method, SD-MVSum, builds on our earlier SD-VSum method for script-driven video summarization, which considered just the visual content of the video. SD-MVSum takes into account, in addition to the visual modality, the relevance of the user-provided script with the spoken content (i.e., audio transcript) of the video. The dependence between each considered pair of data modalities, i.e., script-video and script-transcript, is modeled using a new weighted cross-modal attention mechanism. This mechanism explicitly exploits the semantic similarity between the paired modalities in order to promote the parts of the full-length video with the highest relevance to the user-provided script. Furthermore, we extend two large-scale datasets for script-driven (S-VideoXum) and generic (MrHiSum) video summarization, to make them suitable for training and evaluation of script-driven multimodal video summarization methods. Experimental comparisons document the competitiveness of the proposed SD-MVSum method against other SotA approaches for script-driven and generic video summarization. Our new method and extended datasets are available at: https://github.com/IDT-ITI/SD-MVSum.

2510.02339 2026-05-08 cs.CL cs.AI

Evaluating Uncertainty Quantification Methods in Argumentative Large Language Models

Kevin Zhou, Adam Dejl, Gabriel Freedman, Lihu Chen, Antonio Rago, Francesca Toni

Comments Accepted at EMNLP Findings 2025

详情
Journal ref
Findings of the Association for Computational Linguistics: EMNLP 2025, 21700-21711. 2025
英文摘要

Research in uncertainty quantification (UQ) for large language models (LLMs) is increasingly important towards guaranteeing the reliability of this groundbreaking technology. We explore the integration of LLM UQ methods in argumentative LLMs (ArgLLMs), an explainable LLM framework for decision-making based on computational argumentation in which UQ plays a critical role. We conduct experiments to evaluate ArgLLMs' performance on claim verification tasks when using different LLM UQ methods, inherently performing an assessment of the UQ methods' effectiveness. Moreover, the experimental procedure itself is a novel way of evaluating the effectiveness of UQ methods, especially when intricate and potentially contentious statements are present. Our results demonstrate that, despite its simplicity, direct prompting is an effective UQ strategy in ArgLLMs, outperforming considerably more complex approaches.

2510.02312 2026-05-08 cs.LG

KaVa: Latent Reasoning via Compressed KV-Cache Distillation

Anna Kuzina, Maciej Pioro, Paul N. Whatmough, Babak Ehteshami Bejnordi

Comments ICLR 2026

详情
英文摘要

Large Language Models (LLMs) excel at multi-step reasoning problems with explicit chain-of-thought (CoT), but verbose traces incur significant computational costs and memory overhead, and often carry redundant, stylistic artifacts. Latent reasoning has emerged as an efficient alternative that internalizes the thought process, but it suffers from a critical lack of supervision, limiting its effectiveness on complex, natural-language reasoning traces. In this work we propose KaVa, the first framework that bridges this gap by distilling knowledge directly from a compressed KV-cache of the teacher into a latent-reasoning student via self-distillation, leveraging the representational flexibility of continuous latent tokens to align stepwise KV trajectories. We show that the abstract, unstructured knowledge within compressed KV-cache, which lacks direct token correspondence, can serve as a rich supervisory signal for a latent reasoning student. Empirically, the approach consistently outperforms strong latent baselines, exhibits markedly smaller degradation from equation-only to natural-language traces, and scales to larger backbones while preserving efficiency. These results establish compressed KV-cache distillation as a scalable supervision signal for latent reasoning, combining the accuracy of CoT-trained teachers with the efficiency and deployability of latent inference.

2509.24382 2026-05-08 cs.CV cs.AI

REMAP: Regularized Matching and Partial Alignment of Video Embeddings

Soumyadeep Chandra, Kaushik Roy

Comments 9 pages, 4 figures, 6 tables

详情
英文摘要

Real-world instructional videos are long, noisy, and often contain extended background segments, repeated actions, and execution variability that do not correspond to meaningful procedural steps. We propose **REMAP**, an unsupervised framework for procedure learning based on *Regularized Fused Partial Gromov-Wasserstein Optimal Transport*. REMAP relaxes balanced transport constraints, allowing non-informative or redundant frames to remain unmatched through partial transport. The formulation jointly models semantic similarity and temporal structure, while incorporating Laplacian-based smoothness and structural regularization to prevent degenerate alignments and reduce background interference. We evaluate REMAP on large-scale egocentric and third-person benchmarks. The method consistently outperforms state-of-the-art approaches, achieving up to **11.6\% (+4.45pp)** F1 and **19.6\% (+4.73pp)** IoU improvements on EgoProceL, and an average **41\% (+17.15pp)** F1 gain on ProceL and CrossTask. These results highlight the importance of partial alignment in handling real-world procedural variability and demonstrate that REMAP provides a robust and scalable approach for instructional video understanding.

2509.23765 2026-05-08 cs.CL cs.AI cs.LG

Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality

Junliang Li, Yucheng Wang, Yan Chen, Yu Ran, Ruiqing Zhang, Jing Liu, Hua Wu, Haifeng Wang

Comments 32 pages

详情
英文摘要

Hallucination in large language models (LLMs) during long-form generation remains difficult to address under existing reinforcement learning from human feedback (RLHF) frameworks, as their preference rewards often overlook the model's own knowledge boundaries. In this paper, we propose the $\textbf{K}$nowledge-$\textbf{L}$evel $\textbf{C}$onsistency Reinforcement Learning $\textbf{F}$ramework ($\textbf{KLCF}$), which re-examines this problem from a distribution alignment perspective. KLCF formalizes long-form factuality as a bidirectional distribution matching objective between the policy model's expressed knowledge distribution and the base model's parametric knowledge distribution: under the constraint that generation must not exceed the support set of the base knowledge, the objective maximizes coverage of high-probability facts, thereby jointly optimizing precision and recall. To achieve this, we design a Dual-Fact Alignment mechanism that approximates the recall term using a factual checklist constructed by sampling from the base model, and constrains hallucinations with a lightweight truthfulness reward model. Both components are jointly optimized and require no external retrieval throughout training. Experimental results demonstrate that KLCF consistently improves factuality metrics across multiple long-form benchmarks and model scales, effectively alleviating hallucination and over-conservatism while maintaining efficiency and scalability.

2509.04112 2026-05-08 cs.LG cs.IT math.IT

Synthetic Counterfactual Labels for Efficient Conformal Counterfactual Inference

Amirmohammad Farzaneh, Matteo Zecchin, Osvaldo Simeone

详情
英文摘要

This work addresses the problem of constructing reliable prediction intervals for individual counterfactual outcomes. Existing conformal counterfactual inference (CCI) methods provide marginal coverage guarantees but often produce overly conservative intervals, particularly under treatment imbalance when counterfactual samples are scarce. We introduce synthetic data-powered CCI (SP-CCI), a new framework that augments the calibration set with synthetic counterfactual labels generated by a pre-trained counterfactual model. To ensure validity, SP-CCI incorporates synthetic samples into a conformal calibration procedure based on risk-controlling prediction sets (RCPS) with a debiasing step informed by prediction-powered inference (PPI). We prove that SP-CCI achieves tighter prediction intervals while preserving marginal coverage, with theoretical guarantees under both exact and approximate importance weighting. Empirical results on different datasets confirm that SP-CCI consistently reduces interval width compared to standard CCI across all settings.

2509.03238 2026-05-08 cs.RO cs.SY eess.SY

Vibration Damping in Underactuated Cable-suspended Artwork -- Flying Belt Motion Control

Martin Goubej, Lauria Clarke, Martin Hrabačka, David Tolar

Comments 10 pages, 10 figures

详情
英文摘要

This paper presents a comprehensive refurbishment of the interactive robotic art installation Standards and Double Standards by Rafael Lozano-Hemmer. The installation features an array of belts suspended from the ceiling, each actuated by stepper motors and dynamically oriented by a vision-based tracking system that follows the movements of exhibition visitors. The original system was limited by oscillatory dynamics, resulting in torsional and pendulum-like vibrations that constrained rotational speed and reduced interactive responsiveness. To address these challenges, the refurbishment involved significant upgrades to both hardware and motion control algorithms. A detailed mathematical model of the flying belt system was developed to accurately capture its dynamic behavior, providing a foundation for advanced control design. An input shaping method, formulated as a convex optimization problem, was implemented to effectively suppress vibrations, enabling smoother and faster belt movements. Experimental results demonstrate substantial improvements in system performance and audience interaction. This work exemplifies the integration of robotics, control engineering, and interactive art, offering new solutions to technical challenges in real-time motion control and vibration damping for large-scale kinetic installations.

2508.16745 2026-05-08 cs.LG cs.AI

Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling

Ivan Rodkin, Daniil Orel, Konstantin Smirnov, Arman Bolatov, Bilal Elbouardi, Besher Hassan, Yuri Kuratov, Aydar Bulatov, Preslav Nakov, Timothy Baldwin, Artem Shelmanov, Mikhail Burtsev

详情
英文摘要

Reasoning is a core capability of large language models, yet how multi-step reasoning is learned and executed remains unclear. We study this question in a controlled cellular-automata (1dCA) framework that excludes memorisation by using disjoint training and test rules. Given a short state sequence, the model is required to infer the hidden local rule and then chain it to predict multiple future steps. Our evaluation shows that LLMs largely fail to reliably solve a natural-language proxy of the proposed task. We find that most neural architectures trained from scratch can learn rule inference and achieve high next-step accuracy, but performance drops sharply as the required number of intermediate reasoning steps increases. Experiments show that increasing model depth is crucial, and extending effective depth via recurrence, memory, or test-time compute improves results but remains bounded. The code is available on github: https://github.com/RodkinIvan/associative-recurrent-memory-transformer/tree/ACT

2508.15119 2026-05-08 cs.AI cs.CL cs.LG cs.RO

Flexible Agent Alignment with Goal Inference from Open-Ended Dialog

Rachel Ma, Jingyi Qu, Andreea Bobu, Dylan Hadfield-Menell

Comments Previous version of the paper was titled: Open-Universe Assistance Games

详情
英文摘要

We introduce Open-Universe Assistance Games (OU-AGs), a formal framework extending assistance games to LLM-based agents. Effective assistance requires reasoning over human preferences that are unbounded, underspecified, and evolving. Current LLM agents struggle in multi-turn interactions and with maintaining accurate models of user intent in collaborative settings. Existing assistance game formulations assume fixed, predefined preferences, an assumption that breaks down in open-ended dialogue where goals are revised incrementally and expressed in natural language. Grounded in cognitive science accounts of preference construction, we represent human preferences as a dynamically updated distribution over discrete natural-language goals. To operationalize OU-AGs, we introduce GOOD (GOals from Open-ended Dialogue), a data-efficient online method that extracts and ranks candidate goals during interaction, using LLM-simulated users to perform probabilistic inference over goal hypotheses. This allows for interpretable, uncertainty-aware preference representations without large offline datasets. We evaluate GOOD across three text-based domains: grocery shopping, household robotics (AI2-THOR), and coding. Compared to baselines without explicit goal tracking, GOOD produces semantically coherent goal representations and improves alignment with user intent across domains.

2508.14482 2026-05-08 cs.LG

On the notion of missingness for path attribution explainability methods in medical settings: Guiding the selection of medically meaningful baselines

Alexander Geiger, Lars Wagner, Daniel Rueckert, Dirk Wilhelm, Alissa Jell

详情
英文摘要

The explainability of deep learning models remains a significant challenge, particularly in the medical domain where interpretable outputs are essential for clinical trust and transparency. Path attribution methods such as Integrated Gradients rely on a baseline that represents the absence of informative features, a notion commonly referred to as missingness. Standard baselines, such as all-zero inputs, are often semantically meaningless in medical contexts, where intensity values carry clinical significance. In this work, we revisit the notion of missingness for medical imaging, expose the limitations of standard baselines in this setting, and formalize a stricter missingness we term semantic missingness: a baseline must not merely lack signal, but must represent a clinically plausible state in which the disease-related features are absent. This formulation motivates a counterfactual-guided approach to baseline selection, in which a synthetically generated counterfactual (i.e. a clinically normal variant of the pathological input) serves as a principled and semantically meaningful reference. We derive theoretical guarantees showing that counterfactual baselines yield more faithful attributions than standard alternatives, and empirically validate this with two complementary counterfactual generative models, a VAE and a diffusion model, though the concept is model-agnostic and compatible with any suitable counterfactual method. Across three diverse medical datasets, counterfactual baselines produce more faithful and medically relevant attributions, outperforming standard baseline choices as well as related methods. Notably, we also compare against using the counterfactual directly as an explanation (an established paradigm in its own) and show that employing it as a baseline for Integrated Gradients yields superior results, thereby bridging two complementary explainability paradigms.

2507.22832 2026-05-08 cs.LG cs.CV cs.NE

Pulling Back the Curtain on Deep Networks

Maciej Satkiewicz, Roberto Corizzo, Marcin Pietroń

Comments Preprint; 9 pages, 23-page appendix, 12 figures, 6 Tables; v6 changes: slight reframing of the presentation

详情
英文摘要

In linear models, visualizing a weight vector naturally reveals the model's preferred input direction, but extending this intuition to deep networks via gradients or gradient ascent often yields brittle or adversarial-looking features. We argue that deep networks are better understood as input-conditioned affine operators, whose natural adjoint action pulls a neuron's preferred direction back to input space. We further refine this representation by backward-only softening and iterative enhancement to reconstruct coherent local structures encoded by the target neuron. This provides a unifying perspective on previously disparate ideas such as SmoothGrad, B-cos-style alignment, and Feature Accentuation. The resulting Semantic Pullbacks (SP) generate perceptually aligned, class-conditional post-hoc explanations that emphasize semantically meaningful features, facilitate coherent counterfactual perturbations, and remain theoretically grounded. Across convolutional architectures (ResNet50, VGG) and transformer-based models (PVT), Semantic Pullbacks achieve the best overall trade-off across faithfulness, stability, and target-sensitivity benchmarks, while remaining general, computationally efficient, and readily integrable into existing deep learning pipelines.

2507.02466 2026-05-08 cs.LG

Variational Kolmogorov-Arnold Network

Francesco Alesiani, Henrik Christiansen, Federico Errica

Comments Preprint

详情
英文摘要

Kolmogorov-Arnold Networks (KANs) offer a theoretically grounded alternative to multi-layer perceptrons by representing multivariate functions as compositions of univariate basis functions. However, a critical limitation of KANs is the need to manually specify the number of basis functions per layer -- a hyperparameter that directly controls model capacity and substantially impacts performance, yet whose optimal value varies unpredictably across tasks. We present InfinityKAN, a variational inference framework that eliminates this design choice by learning the number of basis functions during training. Our approach models the basis count as a latent variable with a truncated exponential prior, introducing a differentiable weighting function that enables gradient-based optimization. We establish the Lipschitz continuity of the variational objective, ensuring stable training dynamics. Experiments across 18 datasets spanning synthetic, image, tabular, and graph domains demonstrate that InfinityKAN matches or exceeds the performance of KANs while requiring no manual selection of the number of bases for each layer.

2507.01833 2026-05-08 cs.AI

Refining Gelfond Rationality Principle: Towards More Comprehensive Foundational Principles for Answer Set Semantics

Yi-Dong Shen, Thomas Eiter

Comments 76 pages. This article is a significantly extended version of a paper presented by the authors at IJCAI-2022

详情
英文摘要

Non-monotonic logic programming is the basis for a declarative problem solving paradigm known as answer set programming (ASP). Departing from the seminal definition by Gelfond and Lifschitz in 1988 for simple normal logic programs, various answer set semantics have been proposed for extensions. We consider two important questions: (1) Should the minimal model property, constraint monotonicity and foundedness as defined in the literature be mandatory conditions for an answer set semantics in general? (2) If not, what other properties could be considered as general principles for answer set semantics? We address the two questions. First, it seems that the three aforementioned conditions may sometimes be too strong, and we illustrate with examples that enforcing them may exclude expected answer sets. Second, we evolve the Gelfond answer set (GAS) principles for answer set construction by refining the Gelfond's rationality principle to well-supportedness, minimality w.r.t. negation by default and minimality w.r.t. epistemic negation. The principle of well-supportedness guarantees that every answer set is constructible from if-then rules obeying a level mapping and is thus free of circular justification, while the two minimality principles ensure that the formalism minimizes knowledge both at the level of answer sets and of world views. Third, to embody the refined GAS principles, we extend the notion of well-supportedness substantially to answer sets and world views, respectively. Fourth, we define new answer set semantics in terms of the refined GAS principles. Fifth, we use the refined GAS principles as an alternative baseline to intuitively assess the existing answer set semantics. Finally, we analyze the computational complexity.

2506.18682 2026-05-08 cs.CV cs.AI

Multi-Scale Spectral Attention Module-based Hyperspectral Segmentation in Autonomous Driving Scenarios

Imad Ali Shah, Jiarong Li, Tim Brophy, Martin Glavin, Edward Jones, Enda Ward, Brian Deegan

详情
英文摘要

Recent advances in autonomous driving (AD) have highlighted the potential of hyperspectral imaging (HSI) for enhanced environmental perception, particularly in challenging weather and lighting conditions. However, efficiently processing high-dimensional spectral data remains a significant challenge. This paper presents an empirical investigation of a Multi-Scale Attention Mechanism (MSAM) for enhanced spectral feature extraction through three parallel 1D convolutions with varying kernel sizes (1-11) and adaptive feature aggregation. By integrating MSAM into UNet's skip connections, we evaluate performance improvements in semantic segmentation across multiple HSI datasets for urban driving scenarios. Comprehensive ablation studies demonstrate that MSAM consistently outperforms baseline UNet-SC, achieving average improvements of 2.32% in mIoU and 2.88% in mF1, while maintaining competitive GPU performance against established attention mechanisms. Our findings reveal that optimal kernel combinations are dataset-specific, with configurations such as (1;5;11) and (3;7;11) demonstrating particularly strong performance. This empirical investigation advances understanding of HSI processing capabilities for AD applications and establishes a foundation for adaptive multi-scale spectral feature extraction in automotive deployment.

2506.13727 2026-05-08 cs.LG cs.AI cs.CL

Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs

Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Reduan Achtibat, Patrick Kahardipraja, Thomas Wiegand, Wojciech Samek, Alexander Binder, Sebastian Lapuschkin

Comments Work in progress (9 pages manuscript, 3 pages references, 16 pages appendix)

详情
英文摘要

Large Language Models (LLMs) are widely deployed in real-world applications, yet their internal mechanisms remain difficult to interpret and control, limiting our ability to diagnose and correct undesirable behaviors. Mechanistic interpretability addresses this challenge by identifying circuits -- subsets of model components responsible for specific behaviors. However, discovering such circuits in LLMs remains difficult due to their scale and complexity. We frame circuit discovery as identifying parameters that contribute most to model outputs on task-specific inputs, and use Layer-wise Relevance Propagation (LRP) with reference samples to attribute and extract these components via pruning. Building on this, we introduce contrastive relevance to isolate circuits associated with undesired behaviors while preserving general capabilities, enabling targeted model correction. On OPT-125M, we show that pruning as little as ~0.3% of neurons substantially reduces toxic outputs, while pruning approximately 0.03% of weight elements mitigates repetitive text generation without degrading general performance. These results establish attribution-guided pruning as an effective mechanism for identifying and intervening on behavior-specific circuits in LLMs. We further validate our findings on additional small-scale language models, demonstrating that the proposed approach transfers across architectures. Our code is publicly available at https://github.com/erfanhatefi/SparC3.

2505.24437 2026-05-08 cs.SD eess.AS

SwitchCodec: A High-Fidelity Nerual Audio Codec With Sparse Quantization

Jin Wang, Wenbin Jiang, Xiangbo Wang, Yubo You, Sheng Fang

Comments There is some technical error in this paper's method

详情
英文摘要

Neural audio compression has emerged as a promising technology for efficiently representing speech, music, and general audio. However, existing methods suffer from significant performance degradation at limited bitrates, where the available embedding space is sharply constrained. To address this, we propose a universal high-fidelity neural audio compression algorithm featuring Residual Experts Vector Quantization (REVQ), which substantially expands the embedding space with minimal impact on bandwidth. A gentle load-balancing strategy is introduced to ensure the full utilization of this expanded space. Furthermore, we develop a novel multi-tiered discriminator that periodically stratifies STFT spectra, guiding the generator to focus on critical spectral regions. To support multiple bitrates without quality loss at the lower end, we adopt an efficient post-training strategy. Our proposed model achieves impressive performance, with PESQ and ViSQOL scores of 2.87 and 4.27, respectively, at 2.67 kbps bandwidth. The approach effectively reduces spectral blur, decreasing the distance to the original mel-spectrogram by 13%. Notably, our post-training strategy achieves performance comparable to dedicated fixed-bitrate models while reducing the required training time by half. Extensive ablation studies confirm the superiority of our method over baselines.

2505.15064 2026-05-08 cs.LG math.DS stat.ML

Why and When Deep is Better than Shallow: Implementation-Agnostic State-Transition Model of Deep Learning

Sho Sonoda, Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda

详情
英文摘要

Why and when does depth improve generalization? We study this question in an implementation-agnostic state-transition model, where a depth-$k$ predictor is a readout class $H$ composed with the word ball $B(k,F)$ generated by hidden state transitions. Generalization bounds separate implementation error, approximation error, and statistical complexity, and upper bound the depth-dependent variance term by a Dudley entropy integral over $B(k,F)$, with a conditional lower-bound diagnostic under readout separation. We identify geometric and semigroup mechanisms that keep this entropy contribution saturated or polynomial, and contrast them with separation mechanisms that recover the classical exponential-growth obstruction. Coupling these variance upper bounds with approximation rates gives typical depth trade-off patterns, clarifying that depth is statistically favorable when approximation improves rapidly while the transition semigroup remains geometrically tame.

2505.13100 2026-05-08 cs.LG

Time series saliency maps: explaining models across multiple domains

Christodoulos Kechris, Jonathan Dan, David Atienza

详情
英文摘要

Traditional saliency map methods, popularized in computer vision, highlight individual points (pixels) of the input that contribute the most to the model's output. However, in time series, they offer limited insights, as semantically meaningful features are often found in other domains. We introduce Cross-domain Integrated Gradients, a generalization of Integrated Gradients. Our method enables feature attributions in any domain that can be formulated as an invertible, differentiable transformation of the time domain. Crucially, our derivation extends the original Integrated Gradients into the complex domain, enabling frequency-based attributions. We provide the necessary theoretical guarantees, namely, path independence and completeness. We validate our method via controlled experiments with mechanistic analysis, quantitative faithfulness tests, and real-world case studies. Our approach reveals interpretable, problem-specific attributions that time-domain methods cannot capture in three real-world tasks across a variety of model architectures, machine-learning tasks, and cross-domain transforms: frequency-based attribution for a regression task in wearable heart rate extraction, independent component analysis in a classification task for electroencephalography-based seizure detection, and seasonal-trend decomposition for a forecasting problem with a zero-shot time-series foundation model. We release an open-source TensorFlow/PyTorch library to enable plug-and-play cross-domain explainability for time-series models. These results demonstrate the ability of Cross-Domain Integrated Gradients to provide semantically meaningful insights into time-series models that are impossible to achieve with traditional saliency in the time domain.

2504.04202 2026-05-08 cs.LG

Local-Order Auxiliary Losses Can Improve Autoencoder Reconstruction

Harvey Dam, Martin Burtscher, Tripti Agarwal, Ganesh Gopalakrishnan

详情
英文摘要

Mean-squared error is the default objective for training autoencoders, yet compressed reconstructions often depend not only on pointwise accuracy but also on preserving local spatial order. We study whether structural auxiliary losses can improve, rather than trade off against, MSE in finite-capacity autoencoders. We introduce finite-difference sign error (FDSE), a local-order auxiliary objective that penalizes disagreements between the signs of neighboring finite differences in the target and reconstruction. FDSE is simple, architecture-agnostic, and differentiable through smooth sign surrogates. Across four tensor reconstruction tasks, we find that moderate mixtures of MSE and FDSE can substantially reduce validation MSE relative to pure MSE training. In coefficient sweeps, FDSE mixtures reduce validation MSE by 2.3$\times$--7.0$\times$ over pure MSE on these tasks, while comparisons with other auxiliary objectives show FDSE to be among the strongest structural objectives tested. The effect is not universal: pure FDSE performs poorly, and gains are largest for coherent spatial fields where local order carries information about the underlying signal. These results suggest that, in compressed-latent reconstruction, appropriately weighted local-structure supervision can guide optimization toward solutions with better pointwise accuracy, rather than merely improving perceptual or structural metrics at MSE's expense.

2503.06624 2026-05-08 cs.CV

Chameleon: Benchmarking Detection and Backtracking on Commercial-Grade AI-Generated Videos

Xingming Liao, Meiyu Zeng, Canyu Chen, Nankai Lin, Zhuowei Wang, Aimin Yang

Comments Accepted by ICMR 2026

详情
英文摘要

The proliferation of AI-Generated Content (AIGC), especially deepfake videos, poses a severe threat to social trust by enabling fraud, privacy violations and disinformation. Existing AI-generated video detection (AGVD) benchmarks focus on open-source model generated videos, yet commercial closed-source models produce more realistic, temporally coherent videos that are underexplored in detection research. To fill this gap, we present Chameleon, a commercial-grade dataset with 1,700 AI-generated videos from 600 real-world sources across three key domains (News, Speech, Recommendation), featuring high resolution, rich annotations and 3D consistency metrics for dynamic scene spatial coherence, shifting detection from face-centric forgery to holistic scene forensics. This benchmark assesses models on two core tasks: accurate AI video detection in real-world conditions and forensic backtracking of original sources. Experimental results reveal critical limitations of existing methods in detecting and backtracking high-fidelity, spatiotemporally consistent videos from commercial closed-source models, highlighting current methods' flawed forensic reasoning and establishing Chameleon as a vital challenge for AIGC security research. The code and data are available at https://github.com/lxixim/Chameleon.

2503.02379 2026-05-08 cs.LG cs.CV

Teaching Metric Distance to Discrete Autoregressive Language Models

Jiwan Chung, Saejin Kim, Yongrae Jo, Jaewoo Park, Dongjun Min, Youngjae Yu

详情
英文摘要

Large language models (LLMs) operate as autoregressive predictors over discrete token vocabularies, a formulation that has enabled their adaptation far beyond natural language to vision, robotics, and multimodal reasoning. However, training against one-hot targets disregards metric relationships between tokens and limits effectiveness on tasks where distance is meaningful, such as numerical values, spatial coordinates, or quantized embeddings. We introduce DIST2Loss, a distance-aware objective for discrete autoregressive models that replaces one-hot targets with reward-weighted distributions derived from predefined token distances. DIST2Loss can be interpreted as the closed-form solution to entropy-regularized policy optimization with known per-token rewards, retaining the core mechanism of reinforcement learning while avoiding sampling, rollouts, and instability. Our experiments show that DIST2Loss improves data efficiency and downstream performance across diverse domains. It yields tighter bounding boxes in visual grounding, accelerates robotic manipulation by improving action learning, enhances reward modeling for LLM alignment, and strengthens vector-quantized image generation. These results demonstrate that distance-aware supervision offers a simple and general alternative to one-hot supervision for discrete autoregressive models.

2502.19918 2026-05-08 cs.AI cs.LG

Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models

Yuan Sui, Yufei He, Tri Cao, Simeng Han, Yulin Chen, Bryan Hooi

Comments Accepted by ACL'2026

详情
英文摘要

Large Language Models (LLMs) often struggle with computational efficiency and error propagation in multi-step reasoning tasks. While recent advancements on prompting and post-training have enabled LLMs to perform step-wise reasoning, they still tend to explore unproductive solution paths without effective backtracking or strategy adjustment. In this paper, we propose Meta-Reasoner, a new framework that empowers LLMs to "think about how to think". It optimizes the inference process by dynamically adapting reasoning strategies in real-time. Our approach employs contextual multi-armed bandits (CMABs) to learn an adaptive policy. It learns to evaluate the current state of LLM's reasoning and determine optimal strategy that is most likely to lead to a successful outcome during inference, like whether to backtrack, switch to a new approach, or restart the problem-solving process. This meta-guidance helps avoid unproductive paths exploration during inference and hence improves computational efficiency. We evaluate Meta-Reasoner on math problems (e.g., Game-of-24, TheoremQA) and scientific tasks (e.g., SciBench). Results show that our method outperform previous SOTA methods by 9-12% in accuracy, while reducing inference time by 28-35% under the same compute budget. Additional experiments on creative writing demonstrate the generalizability of our approach to diverse reasoning-intensive tasks.

2502.03725 2026-05-08 cs.LG

Optimal Control of Fluid Restless Multi-armed Bandits: A Machine Learning Approach

Dimitris Bertsimas, Cheol Woo Kim, José Niño-Mora

详情
英文摘要

We present a novel machine learning framework for the optimal control of fluid restless multi-armed bandit problems (FRMABPs) with state equations that are either affine or quadratic in the state variables. By establishing fundamental properties of FRMABPs, we develop an efficient numerical algorithm that generates a comprehensive training set by solving multiple instances with diverse initial states. We further enhance this training set by applying a nonlinear transformation to the feature vectors, leveraging structural properties of FRMABPs. A time-dependent state feedback policy is then learned using Optimal Classification Trees with hyperplane splits (OCT-H). We test our approach on machine maintenance, epidemic control, and fisheries control problems, demonstrating that our method yields high-quality state feedback policies. Furthermore, once a policy is learned, it achieves a speed-up of up to 26 million times compared to the direct numerical algorithm.

2501.09238 2026-05-08 cs.LG

Mono-Forward: Revisiting Forward-Forward through Objective-Locality Decomposition

James Gong, Bruce Li, Waleed Abdulla

Comments 26 pages

详情
英文摘要

Backpropagation remains the dominant algorithm for training deep neural networks, but it incurs substantial memory overhead and relies on global error propagation, which is often regarded as biologically implausible. The Forward-Forward (FF) algorithm is an appealing local-learning alternative to backpropagation, yet it still lags behind backpropagation in accuracy. A central unresolved question is whether this gap arises from FF's locality or from the positive-negative double-pass goodness objective used to train each layer. In this work, we revisit FF under the supervised setting through a decomposition that separates these two design choices. Our analysis suggests that FF's performance limitations are not explained by locality alone, but are also likely influenced by its goodness objective. Motivated by this view, we introduce Mono-Forward (MF), a simplification of FF that preserves its locality while replacing the contrastive goodness objective with a standard multi-class cross-entropy objective applied locally at each layer, serving as a controlled baseline for evaluating local learning under a standard classification objective. Across MLPs and convolutional networks, MF outperforms vanilla FF and remains competitive in multiple FF variants. On MLP-Mixers, MF achieves stronger results on PathMNIST than backpropagation while requiring only 31% of backpropagation's memory.

2412.09125 2026-05-08 cs.AI cs.DB cs.LO

Goal-Driven Query Answering over First- and Second-Order Dependencies with Equality

Efthymia Tsamoura, Boris Motik

Comments 47 pages

详情
英文摘要

In this paper we present the first goal-driven query answering technique for first- and second-order dependencies with equality. Our technique transforms the input dependencies so that applying the chase to the output avoids many inferences that are irrelevant to the query. The transformation proceeds in several steps, which comprise the following three novel techniques. First, we present a variant of the singularisation technique by Marnette [59] that can handle function variables and that corrects an incompleteness of a related formulation by ten Cate et al. [73]. Second, we present a relevance analysis technique that can eliminate dependencies that provably do not contribute to query answers. Third, we present a variant of the magic sets algorithm [19] that can handle second-order dependencies with equality. We also present the results of an extensive empirical evaluation, which show that goal-driven query answering can be orders of magnitude faster than computing the full universal model.

2412.08110 2026-05-08 cs.CV cs.CL cs.LG

The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding

Jiayun Luo, Mir Rayat Imtiaz Hossain, Pritam Sarkar, Boyang Li, Leonid Sigal

详情
英文摘要

Vision-Language Models (VLMs) have achieved strong performance on implicit and explicit visual grounding and related tasks. However, such abilities are generally tested on simple, single-object phrases. We find that grounding performance degrades for complex, multi-object references. These limitations largely arise from training objectives that leverage image-caption alignment, where direct multi-object references are rare, the number of possible such references is theoretically large (exponential in the number of objects), and attribution is difficult. To address this, without requiring any additional annotations, we propose Compositional Attention-Regularized Training (CompART), which decomposes captions into object-centric phrases and constructs composite phrases by pairing them with conjunctions. We then introduce a composition loss that encourages the attention induced by a composite phrase to equal the sum of the attentions of its constituent phrases, promoting balanced multi-object localization. We evaluate CompART across four VLM architectures, spanning both contrastive-based and generative-based models, on four benchmarks for multi-object grounding and two VQA benchmarks for general visual understanding. CompART consistently improves grounding for both single- and multi-object references across diverse VLM architectures and datasets, and further demonstrates enhanced visual understanding, as evidenced by gains on VQA, despite not being explicitly trained for this task.

2411.18954 2026-05-08 cs.LG cs.AI

ReMAP: Neural Reparameterization for Scalable MAP Inference in Arbitrary-Order Markov Random Fields

Yaomin Wang, Chaolong Ying, Xiaodong Luo, Tianshu Yu

详情
英文摘要

Scalable high-quality MAP inference in arbitrary-order Markov Random Fields (MRFs) remains challenging. Approximate message-passing methods are often efficient but can degrade on dense or high-order instances, while exact solvers such as Toulbar2 become increasingly expensive at scale. We present ReMAP, an instance-wise neural reparameterization framework that directly optimizes a differentiable relaxation of the original MRF energy. Instead of relying on supervised labels or amortized training, ReMAP treats each MRF as an independent optimization problem: a Graph Neural Network produces node-wise label distributions, and gradient-based optimization searches for a low-energy discrete solution in an over-parameterized continuous space. The method supports pairwise and arbitrary-order factors, heterogeneous label cardinalities, and efficient GPU execution, without requiring labeled solutions. We show that the relaxed objective is consistent with the discrete MAP problem and analyze how neural over-parameterization can expose low-energy optimization paths unavailable in the original discrete space. Empirically, on synthetic pairwise and high-order MRFs, UAI 2022 inference benchmarks, and real-world Physical Cell Identity (PCI) problems, ReMAP consistently outperforms approximate baselines and often finds lower-energy solutions than Toulbar2 on hard large-scale instances within practical time budgets.

2411.12220 2026-05-08 cs.LG cs.AI cs.CR

DeTrigger: A Gradient-Centric Approach to Backdoor Attack Mitigation in Federated Learning

Kichang Lee, Yujin Shin, Jonghyuk Yun, Songkuk Kim, Jun Han, JeongGil Ko

Comments 21 pages

详情
英文摘要

Federated Learning (FL) enables collaborative model training across distributed devices while preserving local data privacy, making it ideal for mobile and embedded systems. However, the decentralized nature of FL also opens vulnerabilities to model poisoning attacks, particularly backdoor attacks, where adversaries implant trigger patterns to manipulate model predictions. In this paper, we propose DeTrigger, a scalable and efficient backdoor-robust federated learning framework that leverages insights from adversarial attack methodologies. By employing gradient analysis with temperature scaling, DeTrigger detects and isolates backdoor triggers, allowing for precise model weight pruning of backdoor activations without sacrificing benign model knowledge. Extensive evaluations across four widely used datasets demonstrate that DeTrigger achieves up to 251x faster detection than traditional methods and mitigates backdoor attacks by up to 98.9%, with minimal impact on global model accuracy. Our findings establish DeTrigger as a robust and scalable solution to protect federated learning environments against sophisticated backdoor threats.

2411.03962 2026-05-08 cs.CL cs.IR

How Does A Text Preprocessing Pipeline Affect Ontology Matching?

Zhangcheng Qiang, Kerry Taylor, Weiqing Wang

Comments 14 pages, 16 figures, 3 tables

详情
英文摘要

The classical text preprocessing pipeline, comprising Tokenisation, Normalisation, Stop Words Removal, and Stemming/Lemmatisation, has been implemented in many systems for ontology matching (OM). However, the lack of standardisation in text preprocessing creates diversity in the mapping results. In this paper, we investigate the effect of the text preprocessing pipeline on 8 Ontology Alignment Evaluation Initiative (OAEI) tracks with 49 distinct alignments. We find that Tokenisation and Normalisation (categorised as Phase 1 text preprocessing) are more effective than Stop Words Removal and Stemming/Lemmatisation (categorised as Phase 2 text preprocessing). We propose two novel approaches to repair unwanted false mappings that occur in Phase 2 text preprocessing. One is a pre hoc logic-based repair approach used before text preprocessing, employing an ontology-specific check to find common words that cause false mappings. The other repair approach is the post hoc large language model (LLM)-based approach, used after text preprocessing, which utilises the strong background knowledge provided by LLMs to repair non-existent and counter-intuitive false mappings. The experimental results indicate that these two approaches can significantly improve the matching correctness and the overall matching performance.

2411.02740 2026-05-08 cs.LG cond-mat.mtrl-sci physics.app-ph physics.comp-ph physics.data-an

An information-matching approach to optimal experimental design and active learning

Yonatan Kurniawan, Tracianne B. Neilsen, Benjamin L. Francis, Alex M. Stankovic, Mingjian Wen, Ilia Nikiforov, Ellad B. Tadmor, Vasily V. Bulatov, Vincenzo Lordi, Mark K. Transtrum

详情
Journal ref
Appl. Phys. Lett. 128, 064104 (2026)
英文摘要

The efficacy of mathematical models heavily depends on the quality of the training data, yet collecting sufficient data is often expensive and challenging. Many modeling applications require inferring parameters only as a means to predict other quantities of interest (QoI). Because models often contain many unidentifiable (sloppy) parameters, QoIs often depend on a relatively small number of parameter combinations. Therefore, we introduce an information-matching criterion based on the Fisher Information Matrix to select the most informative training data from a candidate pool. This method ensures that the selected data contain sufficient information to learn only those parameters that are needed to constrain downstream QoIs. It is formulated as a convex optimization problem, making it scalable to large models and datasets. We demonstrate the effectiveness of this approach across various modeling problems in diverse scientific fields, including power systems and underwater acoustics. Finally, we use information-matching as a query function within an Active Learning loop for material science applications. In all these applications, we find that a relatively small set of optimal training data can provide the necessary information for achieving precise predictions. These results are encouraging for diverse future applications, particularly active learning in large machine learning models.