arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1597
2510.06002 2026-04-30 cs.AI cs.CL cs.IR

Deterministic Legal Agents: A Canonical Primitive API for Auditable Reasoning over Temporal Knowledge Graphs

Hudson de Martim

Comments Substantially revised version consolidating the paper as a formal SAT-Graph API specification: clarifies Probability Isolation and post-anchoring determinism, broadens semantic anchoring to open and thematic legal queries, refines the data models and temporal primitives, and strengthens the use cases, limitations, and bibliography

详情
英文摘要

In high-stakes legal domains, retrieval must preserve not only semantic relevance, but also the hierarchy, temporality, and causal provenance of legal norms. Standard Retrieval-Augmented Generation (RAG), based mainly on semantic similarity over text fragments, cannot reliably provide this level of control. Prior work on SAT-Graph RAG addressed the representation problem by modeling legal materials as structure-aware temporal knowledge graphs. This paper addresses the next problem: how an LLM-based reasoning agent can interact with such a graph without reintroducing unreliable retrieval behavior. We specify the SAT-Graph API, a canonical primitive interface for auditable reasoning over temporal knowledge graphs, developed and illustrated in the legal domain. The API exposes typed, atomic, and composable primitives that mediate between a probabilistic language model and a deterministic symbolic substrate. Its design follows Probability Isolation: uncertainty is confined to intent translation, semantic anchoring, and final narrative synthesis, while structural, temporal, and causal graph traversals are executed through deterministic operations over canonical anchors. The interface shifts legal RAG from single-shot Retrieve-then-Generate to active Reason-Act-Observe. An agent decomposes a legal question into an explicit execution plan, invokes primitives for point-in-time retrieval, context reconstruction, provenance tracing, and impact analysis, and produces an answer grounded in an auditable log of graph operations. The result is a formal architectural specification, not an empirical benchmark: a secure interaction protocol that decouples legal knowledge representation from agentic reasoning. Although illustrated in law, the primitive model is domain-portable to other temporally versioned, provenance-sensitive, and authority-governed knowledge bases.

2510.03093 2026-04-30 cs.CL cs.SD

Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?

Oriol Pareras, Gerard I. Gállego, Federico Costa, Cristina España-Bonet, Javier Hernando

Comments To appear in Proc. ICASSP 2026, May 04-08, 2026, Barcelona, Spain

详情
英文摘要

Recent work on Speech-to-Text Translation (S2TT) has focused on LLM-based models, introducing the increasingly adopted Chain-of-Thought (CoT) prompting, where the model is guided to first transcribe the speech and then translate it. CoT typically outperforms direct prompting primarily because it can exploit abundant Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) datasets to explicitly model its steps. In this paper, we systematically compare CoT and Direct prompting under increasing amounts of S2TT data. To this end, we pseudo-label an ASR corpus by translating its transcriptions into six European languages, and train LLM-based S2TT systems with both prompting strategies at different data scales. Our results show that Direct improves more consistently as the amount of data increases, suggesting that it may become a more effective approach as larger S2TT resources are created.

2509.25236 2026-04-30 cs.AI cs.LG eess.SP

Networks of Causal Abstractions: A Sheaf-theoretic Framework

Gabriele D'Acunto, Paolo Di Lorenzo, Sergio Barbarossa

Comments Major changes: added convergence proof for CK diffusion dynamics; clarified relation with our prior work arXiv:2602.02623, removing substantial overlap; shortened background

详情
英文摘要

A core challenge in causal artificial intelligence is the principled coordination of multiple, imperfect, and subjective causal perspectives arising from distributed agents with limited and heterogeneous access to the environment. This problem has received little formal treatment, as the existing framework assumes a single shared global causal model. This work introduces the causal abstraction network (CAN), a general sheaf-theoretic framework for representing, learning, and reasoning across collections of mixture of causal models (MCMs) - a class that unifies several existing models of context-dependent causal mechanisms. Sheaf theory provides a natural foundation for this task, offering a rigorous framework to coherently align distributed causal knowledge without requiring explicit causal graphs, functional mechanisms, interventional data, or jointly sampled observations. At the theoretical level, we provide a categorical formulation of MCMs and characterize key properties of CANs, including consistency and smoothness. Under consistency, we establish necessary and sufficient conditions: (i) for the existence of global sections, linked to spectral properties of an associated connection Laplacian; and (ii) for the convergence of causal knowledge diffusion over the CAN to the space of global sections. At the methodological level, we exploit the compositionality of causal abstractions to decompose the learning of consistent CANs into local problems on network edges, extending our prior work on Gaussian variables to Gaussian mixtures via the proposed MIXTURE-CALSEP algorithm. We validate the framework on synthetic data and through a financial application involving a multi-agent trading system, demonstrating CAN recovery, CAN-based portfolio optimization, and counterfactual reasoning.

2509.23980 2026-04-30 cs.CV

Towards Redundancy Reduction in Diffusion Models for Efficient Video Super-Resolution

Jinpei Guo, Yifei Ji, Shengwei Wang, Zheng Chen, Yufei Wang, Sizhuo Ma, Yong Guo, Baiang Li, Jusheng Zhang, Yulun Zhang, Jian Wang

详情
英文摘要

Diffusion models have recently shown promising results for video super-resolution (VSR). However, directly adapting generative diffusion models to VSR can result in redundancy, since low-quality videos already preserve substantial content information. Such redundancy leads to increased computational overhead and learning burden, as the model performs superfluous operations and must learn to filter out irrelevant information. To address this problem, we propose OASIS, an efficient $\textbf{o}$ne-step diffusion model with $\textbf{a}$ttention $\textbf{s}$pecialization for real-world v$\textbf{i}$deo $\textbf{s}$uper-resolution. OASIS incorporates an attention specialization routing that assigns attention heads to different patterns according to their intrinsic behaviors. This routing mitigates redundancy while effectively preserving pretrained knowledge, allowing diffusion models to better adapt to VSR and achieve stronger performance. Moreover, we propose a simple yet effective progressive training strategy, which starts with temporally consistent degradations and then shifts to inconsistent settings. This strategy facilitates learning under complex degradations. Extensive experiments demonstrate that OASIS achieves state-of-the-art performance on both synthetic and real-world datasets. OASIS also provides superior inference speed, offering a $\textbf{6.2$\times$}$ speedup over one-step diffusion baselines such as SeedVR2. The code will be available at \href{https://github.com/jp-guo/OASIS}{https://github.com/jp-guo/OASIS}.

2509.23410 2026-04-30 cs.LG cs.AI cs.PF

PATCH: Learnable Tile-level Hybrid Sparsity for LLMs

Younes Hourri, Mohammad Mozaffari, Maryam Mehri Dehnavi

详情
英文摘要

Large language models (LLMs) deliver impressive performance but incur prohibitive memory and compute costs at deployment. Model pruning is an effective way to reduce these overheads, yet existing approaches face challenges: unstructured sparsity, where nonzeros can appear anywhere, preserves accuracy but yields irregular access patterns that prevent GPU acceleration, while semi-structured 2:4 sparsity is hardware-friendly but enforces a rigid 50% pattern that degrades model quality. To bridge this gap, we introduce PATCH, a hybrid sparsity framework that enables a continuous sparsity ratio between 0% and 50%. PATCH partitions weight matrices into tiles, assigning each tile to be either dense or 2:4 sparse via a learnable mask selection mechanism. This design provides fine-grained control over accuracy-acceleration tradeoffs and supports non-uniform sparsity across layers, leading to superior overall quality. Across models from 0.5B to 13B parameters, PATCH consistently narrows the gap to dense accuracy while delivering practical speedups. For instance, on LLaMA-2 7B with an A6000 GPU, PATCH achieves 1.18x-1.38x end-to-end speedup over dense baselines while improving accuracy by 0.37%-2.96% compared to the state-of-the-art 2:4 pruning method, MaskLLM.

2509.21983 2026-04-30 cs.RO cs.AI

Hybrid Diffusion for Simultaneous Symbolic and Continuous Planning

Sigmund Hennum Høeg, Aksel Vaaler, Chaoqi Liu, Olav Egeland, Yilun Du

Comments 10 pages, 11 figures. This work has been submitted to the IEEE for possible publication. See https://sigmundhh.com/hybrid_diffusion/ for the project website

详情
Journal ref
IEEE Robotics and Automation Letters, vol. 11, no. 4, pp. 4489-4496, April 2026
英文摘要

Constructing robots to accomplish long-horizon tasks is a long-standing challenge within artificial intelligence. Approaches using generative methods, particularly Diffusion Models, have gained attention due to their ability to model continuous robotic trajectories for planning and control. However, we show that these models struggle with long-horizon tasks that involve complex decision-making and, in general, are prone to confusing different modes of behavior, leading to failure. To remedy this, we propose to augment continuous trajectory generation by simultaneously generating a high-level symbolic plan. We show that this requires a novel mix of discrete variable diffusion and continuous diffusion, which dramatically outperforms the baselines. In addition, we illustrate how this hybrid diffusion process enables flexible trajectory synthesis, allowing us to condition synthesized actions on partial and complete symbolic conditions.

2509.20265 2026-04-30 cs.LG cs.CL

Failure Modes of Maximum Entropy RLHF

Ömer Veysel Çağatan, Barış Akgün

Comments 23 pages, 12 figures

详情
英文摘要

In this paper, we show that Simple Preference Optimization (SimPO) can be derived as Maximum Entropy Reinforcement Learning, providing a theoretical foundation for this reference-free method. Motivated by SimPO's strong performance in offline preference optimization, we investigate whether Maximum Entropy RL can achieve similar results in online RLHF settings. Our experiments find that Maximum Entropy RL frequently exhibits overoptimization and unstable KL dynamics across model scales, with overoptimization persisting even at conservative learning rates for some configurations. Unlike KL-constrained methods that maintain stable training, entropy regularization fails to reliably prevent reward hacking and, in our experiments, correlates with the onset of overoptimization rather than guarding against it. Even in configurations where training remains stable, entropy regularization is not the stabilizing factor. Lastly, we discuss possible explanations for why SimPO succeeds in offline settings while Maximum Entropy RL struggles in online scenarios. Our findings suggest that reference-free approaches may face distinct challenges when applied to online versus offline preference learning.

2509.17625 2026-04-30 cs.LG cs.CY physics.soc-ph stat.ME

Comparing Data Assimilation and Likelihood-Based Inference on Latent State Estimation in Agent-Based Models

Blas Kolic, Corrado Monti, Gianmarco De Francisci Morales, Marco Pangallo

详情
英文摘要

In this paper, we present the first systematic comparison of Data Assimilation (DA) and Likelihood-Based Inference (LBI) in the context of an Agent-Based Model (ABM). These models generate observable time series driven by evolving, partially-latent microstates. Latent states must be estimated to align simulations with real-world data, a task traditionally addressed by DA, particularly in continuous and equation-based models used in weather forecasting. However, the nature of ABMs poses challenges for standard DA methods. Solving such issues requires adapting previous DA techniques or using ad hoc alternatives such as LBI. DA approximates the likelihood in a model-agnostic way, making it broadly applicable but potentially less precise. In contrast, LBI provides more accurate state estimation by directly leveraging the model's likelihood, but at the cost of requiring a hand-crafted, model-specific likelihood function, which may be complex or infeasible to derive. We compare the two methods on the Bounded-Confidence Model, a well-known opinion dynamics ABM, where agents are affected only by others holding sufficiently similar opinions. We find that LBI better recovers latent agent-level opinions, even under model mis-specification, leading to improved individual-level forecasts. At the aggregate level, however, both methods perform comparably, and DA remains competitive across levels of aggregation under certain parameter settings. Our findings suggest that DA is well-suited for aggregate predictions, while LBI is preferable for agent-level inference.

2509.16591 2026-04-30 cs.CL

Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token's Nature

Zheng Liu, Mengjie Liu, Siwei Wen, Mengzhang Cai, Bin Cui, Conghui He, Wentao Zhang

详情
英文摘要

Using entropy as a measure of heterogeneity to guide optimization has emerged as a crucial research direction in Reinforcement Learning for LLMs. However, existing methods typically treat it as a discrete filter or post-hoc regulator rather than a core optimization driver. To fully leverage the potential of entropy and achieve fine-grained regulation, we introduce Heterogeneous Adaptive Policy Optimization (HAPO), a token-aware algorithm that continuously adapts optimization dynamics based on token-level entropy throughout the entire training process. Our algorithm includes four key components: (1) Adaptive Temperature Sampling that adjusts sampling temperature in real time, promoting exploration at high-entropy tokens. (2) Token-Level Group Average Advantage Estimation that estimates advantages at token level, accounting for sequence-length effects while preserving non-biased treatment.(3) Differential Advantage Redistribution that leverages entropy and importance ratios to adjust advantages for tokens with clear signals. (4) Asymmetric Adaptive Clipping that adynamically adjusts clipping boundaries based on token-level entropy. Through systematic investigation of entropy, we embed token-level treatment into every stage. Extensive experiments on mathematical reasoning, code, and logic tasks across multiple models demonstrate HAPO's consistent superiority over DAPO. Our code can be found in https://github.com/starriver030515/HAPO.

2509.07523 2026-04-30 cs.LG

RoseCDL: Robust and Scalable Convolutional Dictionary Learning for Rare event and Anomaly Detection

Jad Yehya, Mansour Benbakoura, Cédric Allain, Benoît Malezieux, Matthieu Kowalski, Thomas Moreau

Comments Accepted to the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026

详情
英文摘要

Detecting rare events and anomalies in large-scale signals is essential in fields such as astronomy, physical simulations, and biomedical science. In many cases, this problem naturally decomposes into identifying common local patterns and detecting deviations that correspond to anomalies. Convolutional Dictionary Learning (CDL) is a powerful tool for modeling local structures, but its adoption for this task has been limited by computational demands and sensitivity to outliers. We introduce RoseCDL, a novel CDL algorithm designed for robust and scalable modeling of signal pattern distribution. RoseCDL leverages stochastic windowing for efficient training and incorporates inline outlier detection to enhance robustness. This enables unsupervised identification of anomalous and rare patterns in long signals based on the local reconstruction loss. Experiments on real-world datasets show that RoseCDL delivers improved detection accuracy and computational efficiency, making CDL practical for challenging detection tasks in large-scale signal analysis.

2508.19900 2026-04-30 cs.LG

Adaptive Scaling of Policy Constraints for Offline Reinforcement Learning

Tan Jing, Xiaorui Li, Chao Yao, Xiaojuan Ban, Yuetong Fang, Renjing Xu, Zhaolin Yuan

详情
英文摘要

Offline reinforcement learning (RL) enables learning effective policies from fixed datasets without any environment interaction. Existing methods typically employ policy constraints to mitigate the distribution shift encountered during offline RL training. However, because the scale of the constraints varies across tasks and datasets of differing quality, existing methods must meticulously tune hyperparameters to match each dataset, which is time-consuming and often impractical. We propose Adaptive Scaling of Policy Constraints (ASPC), a second-order differentiable framework that dynamically balances RL and behavior cloning (BC) during training. We theoretically analyze its performance improvement guarantee. In experiments on 39 datasets across four D4RL domains, ASPC using a single hyperparameter configuration outperforms other adaptive constraint methods and state-of-the-art offline RL algorithms that require per-dataset tuning while incurring only minimal computational overhead. The code will be released at https://github.com/Colin-Jing/ASPC.

2508.12672 2026-04-30 cs.LG cs.AI

Robust Federated Learning under Adversarial Attacks via Loss-Based Client Clustering

Emmanouil Kritharakis, Dusan Jakovetic, Antonios Makris, Konstantinos Tserpes

Comments Accepted at the 3rd Workshop on Advancements in Federated Learning (WAFL@ECML-PKDD 2025)

详情
英文摘要

Federated Learning (FL) enables collaborative model training across multiple clients without sharing private data. We consider FL scenarios wherein FL clients are subject to adversarial (Byzantine) attacks, while the FL server is trusted (honest) and has a trustworthy side dataset. This may correspond to, e.g., cases where the server possesses trusted data prior to federation, or to the presence of a trusted client that temporarily assumes the server role. Our approach requires only two honest participants, i.e., the server and one client, to function effectively, without prior knowledge of the number of malicious clients. Theoretical analysis demonstrates bounded optimality gaps even under strong Byzantine attacks. Experimental results show that our algorithm significantly outperforms standard and robust FL baselines such as Mean, Trimmed Mean, Median, Krum, and Multi-Krum under various attack strategies including label flipping, sign flipping, and Gaussian noise addition across MNIST, FMNIST, and CIFAR-10 benchmarks using the Flower framework.

2508.09547 2026-04-30 cs.CV cs.AI

GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning

Fengyi Wu, Yifei Dong, Yilong Dai, Guangyu Chen, Qifeng Wu, Huiting Huang, Hang Wang, Qi Dai, Alexander G. Hauptmann, Zhi-Qi Cheng

Comments Accepted to ACL 2026 Findings. 22 pages, 12 figures, Code: https://github.com/F1y1113/GoViG

详情
英文摘要

We introduce Goal-Conditioned Visual Navigation Instruction Generation (GoViG), a new task that aims to generate contextually coherent navigation instructions solely from egocentric visual observations of initial and goal states. Unlike prior work relying on structured inputs, such as semantic annotations or environmental maps, GoViG exclusively leverages raw egocentric visual data, improving adaptability to unseen and unstructured environments. Our method addresses this task by decomposing it into two interconnected subtasks: (1) navigation visualization, predicting intermediate visual states bridging the initial and goal views; and (2) instruction generation, synthesizing coherent instructions grounded in observed and anticipated visuals. Both subtasks are integrated within an autoregressive multimodal LLM trained with tailored objectives to ensure spatial accuracy and linguistic clarity. Furthermore, we introduce two multimodal reasoning strategies, one-pass and interleaved reasoning, to mimic incremental human navigation cognition. To comprehensively evaluate our method, we propose the R2R-Goal dataset, combining diverse synthetic and real-world trajectories. Empirical results demonstrate significant performance improvements over state-of-the-art methods in BLEU-4 and CIDEr scores along with robust cross-domain generalization.

2508.07220 2026-04-30 cs.LG cs.AI

Neural Bridge Processes

Jian Xu, Yican Liu, Delu Zeng, John Paisley, Qibin Zhao

详情
英文摘要

Learning stochastic functions from partially observed context-target pairs requires models that are expressive, uncertainty-aware, and strongly conditioned on inputs. Neural Diffusion Processes (NDPs) improve expressivity with denoising diffusion, but their forward process is input-independent; inputs only enter the reverse denoiser, so the noisy training states themselves do not encode the conditioning inputs. We propose Neural Bridge Processes (NBPs), which replace the unconditional forward kernel with an input-anchored bridge trajectory. When input and output dimensions differ, NBP learns an output-space anchor $a_ψ(x)=P_ψ(x)$, allowing coordinates or other inputs to guide the generative path without changing the denoising backbone. We show theoretically that process-level anchoring induces pathwise input distinguishability, injects information about x into noisy states, and creates a direct gradient pathway unavailable to NDPs. Experiments on synthetic regression, EEG, CylinderFlow, and image regression show consistent improvements. Additional ablations show that the gains come from the full bridge construction with learned alignment, and that the same input-anchored path principle transfers to Flow Matching Neural Processes. These results suggest that bridge-anchored generative paths provide a general mechanism for strengthening conditional stochastic function modeling.

2508.04325 2026-04-30 cs.CL cs.AI cs.CV cs.LG cs.MM

Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

Wenting Chen, Guo Yu, Yiu-Fai Cheung, Meidan Ding, Jie Liu, Zizhan Ma, Wenxuan Wang, Linlin Shen

Comments Accepted by ACL 2026

详情
英文摘要

Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clinical fidelity, robust data management, and safety-oriented evaluation metrics. To address these shortcomings, we introduce MedCheck, the first lifecycle-oriented assessment framework specifically designed for medical benchmarks. Our framework deconstructs a benchmark's development into five continuous stages, from design to governance, and provides a comprehensive checklist of 46 medically-tailored criteria. Using MedCheck, we conducted an in-depth empirical evaluation of 53 medical LLM benchmarks. Our analysis uncovers widespread, systemic issues, including a profound disconnect from clinical practice, a crisis of data integrity due to unmitigated contamination risks, and a systematic neglect of safety-critical evaluation dimensions like model robustness and uncertainty awareness. Based on these findings, MedCheck serves as both a diagnostic tool for existing benchmarks and an actionable guideline to foster a more standardized, reliable, and transparent approach to evaluating AI in healthcare.

2507.21420 2026-04-30 cs.CV cs.CL

ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs

Chaoyu Li, Yogesh Kulkarni, Pooyan Fazli

Comments ACL 2026. Project page: https://people-robots.github.io/regate

详情
英文摘要

The computational cost of training multimodal large language models (MLLMs) grows rapidly with the number of processed tokens. Existing efficiency methods mainly target inference via token reduction or merging, offering limited benefits during training. We introduce ReGATE (Reference-Guided Adaptive Token Elision), an adaptive token pruning method for accelerating MLLM training. ReGATE adopts a teacher-student framework, in which a frozen teacher LLM provides per-token guidance losses that are fused with an exponential moving average of the student's difficulty estimates. This adaptive scoring mechanism dynamically selects informative tokens while skipping redundant ones in the forward pass, substantially reducing computation without altering the model architecture. Across three representative MLLMs, ReGATE matches the peak accuracy of standard training on MVBench up to 2$\times$ faster, using only 38% of the tokens. With extended training, it even surpasses the baseline across multiple multimodal benchmarks, cutting total token usage by over 41%.

2507.12549 2026-04-30 cs.LG cs.CC stat.ML

The Serial Scaling Hypothesis

Yuxi Liu, Konpat Preechakul, Kananart Kuwaranancharoen, Yutong Bai

Comments ICLR 2026. Equal contribution by the first two authors. Project page: https://serial-scaling-hypothesis.github.io

详情
英文摘要

While machine learning has advanced through massive parallelization, we identify a critical blind spot: some problems are fundamentally sequential. These "inherently serial" problems-from mathematical reasoning to physical simulations to sequential decision-making-require sequentially dependent computational steps that cannot be efficiently parallelized. We formalize this distinction in complexity theory, and demonstrate that current parallel-centric architectures face fundamental limitations on such tasks. Then, we show for first time that diffusion models despite their sequential nature are incapable of solving inherently serial problems. We argue that recognizing the serial nature of computation holds profound implications on machine learning, model design, and hardware development.

2507.01544 2026-04-30 cs.LG

MARVIS: Modality Adaptive Reasoning over VISualizations

Benjamin Feuer, Lennart Purucker, Oussama Elachqar, Chinmay Hegde

详情
英文摘要

Predictive applications of machine learning often rely on small (sub 1 Bn parameter) specialized models tuned to particular domains or modalities. Such models often achieve excellent performance, but lack flexibility. LLMs and VLMs offer versatility, but typically underperform specialized predictors, especially on non-traditional modalities and long-tail domains. We propose MARVIS (Modality Adaptive Reasoning over VISualizations), a system that transforms latent embedding spaces into visual representations and then leverages the spatial and fine-grained reasoning skills of VLMs to interpret the visualizations and utilize them for predictions successfully. MARVIS achieves competitive performance across vision, audio, biological, and tabular domains using a single 3B parameter model, yielding results that beat Gemini 2.0 by 16% on average. MARVIS drastically reduces the gap between LLM/VLMs approaches and specialized domain-specific methods, without requiring any domain-specific training. Code and datasets are available at https://github.com/penfever/marvis.

2507.01449 2026-04-30 cs.CL

LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation

Tianyu Liu, Qitan Lv, Hao Li, Xing Gao, Xiao Sun, Xiaoyan Sun

详情
英文摘要

Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhead and significantly reduce the difficulty in deployment and applications. However, retrieval-based SD relies on a matching paradigm to retrieval the most relevant reference as the draft tokens, where these methods often fail to find matched and accurate draft tokens. To address this challenge, we propose LogitSpec to effectively expand the retrieval range and find the most relevant reference as drafts. Our LogitSpec is motivated by the observation that the logit of the last token can not only predict the next token, but also speculate the next next token. Specifically, LogitSpec generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. LogitSpec is training-free and plug-and-play, which can be easily integrated into existing LLM inference frameworks. Extensive experiments on a wide range of text generation benchmarks demonstrate that LogitSpec can achieve up to 2.61 $\times$ speedup and 3.28 mean accepted tokens per decoding step. Our code is available at https://github.com/smart-lty/LogitSpec.

2506.23323 2026-04-30 cs.CV

FA-Seg: A Fast and Accurate Diffusion-Based Method for Open-Vocabulary Segmentation

Huy Che, Vinh-Tiep Nguyen

详情
Journal ref
Neurocomputing 660 (2026) 131844
英文摘要

Open-vocabulary semantic segmentation (OVSS) aims to segment objects from arbitrary text categories without requiring densely annotated datasets. Although contrastive learning based models enable zero-shot segmentation, they often lose fine spatial precision at pixel level, due to global representation bias. In contrast, diffusion-based models naturally encode fine-grained spatial features via attention mechanisms that capture both global context and local details. However, they often face challenges in balancing the computation costs and the quality of the segmentation mask. In this work, we present FA-Seg, a Fast and Accurate training-free framework for open-vocabulary segmentation based on diffusion models. FA-Seg performs segmentation using only a (1+1)-step from a pretrained diffusion model. Moreover, instead of running multiple times for different classes, FA-Seg performs segmentation for all classes at once. To further enhance the segmentation quality, FA-Seg introduces three key components: (i) a dual-prompt mechanism for discriminative, class-aware attention extraction, (ii) a Hierarchical Attention Refinement Method (HARD) that enhances semantic precision via multi-resolution attention fusion, and (iii) a Test-Time Flipping (TTF) scheme designed to improve spatial consistency. Extensive experiments show that FA-Seg achieves state-of-the-art training-free performance, obtaining 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object benchmarks while maintaining superior inference efficiency. Our results demonstrate that FA-Seg provides a strong foundation for extendability, bridging the gap between segmentation quality and inference efficiency. The source code is available at https://github.com/chequanghuy/FA-Seg.

2506.20876 2026-04-30 cs.CL

Decide less, communicate more: On the construct validity of end-to-end fact-checking in medicine

Sebastian Joseph, Lily Chen, Barry Wei, Michael Mackert, Iain J. Marshall, Paul Pu Liang, Ramez Kouzy, Byron C. Wallace, Junyi Jessy Li

Comments ACL 2026 Findings camera-ready

详情
英文摘要

Technological progress has led to concrete advancements in tasks that were regarded as challenging, such as automatic fact-checking. Interest in adopting these systems for public health and medicine has grown due to the high-stakes nature of medical decisions and challenges in critically appraising a vast and diverse medical literature. Evidence-based medicine connects to every individual, and yet the nature of it is highly technical, rendering the medical literacy of majority users inadequate to sufficiently navigate the domain. Such problems with medical communication ripen the ground for end-to-end fact-checking agents: check a claim against current medical literature and return with an evidence-backed verdict. And yet, such systems remain largely unused. In this position paper, developed with expert input, we present the first study examining how clinical experts verify real claims from social media by synthesizing medical evidence. In searching for this upper-bound, we reveal fundamental challenges in end-to-end fact-checking when applied to medicine: Difficulties connecting claims in the wild to scientific evidence in the form of clinical trials; ambiguities in underspecified claims mixed with mismatched intentions; and inherently subjective veracity labels. We argue that fact-checking should be approached as an interactive communication problem, rather than an end-to-end process.

2506.18499 2026-04-30 cs.LG cs.AI cs.DB

PuckTrick: A Library for Making Synthetic Data More Realistic

Alessandra Agostini, Andrea Maurino, Blerina Spahiu

Comments 17 pages, 3 figures

详情
英文摘要

The increasing reliance on machine learning (ML) models for decision-making requires high-quality training data. However, access to real-world datasets is often restricted due to privacy concerns, proprietary restrictions, and incomplete data availability. As a result, synthetic data generation (SDG) has emerged as a viable alternative, enabling the creation of artificial datasets that preserve the statistical properties of real data while ensuring privacy compliance. Despite its advantages, synthetic data is often overly clean and lacks real-world imperfections, such as missing values, noise, outliers, and misclassified labels, which can significantly impact model generalization and robustness. To address this limitation, we introduce Pucktrick, a Python library designed to systematically contaminate synthetic datasets by introducing controlled errors. The library supports multiple error types, including missing data, noisy values, outliers, label misclassification, duplication, and class imbalance, offering a structured approach to evaluating ML model resilience under real-world data imperfections. Pucktrick provides two contamination modes: one for injecting errors into clean datasets and another for further corrupting already contaminated datasets. Through extensive experiments on real-world financial datasets, we evaluate the impact of systematic data contamination on model performance. Our findings demonstrate that ML models trained on contaminated synthetic data outperform those trained on purely synthetic, error-free data, particularly for tree-based and linear models such as SVMs and Extra Trees.

2506.16742 2026-04-30 cs.CV

Uncertainty-Aware Information Pursuit for Interpretable and Reliable Medical Image Analysis

Md Nahiduzzaman, Steven Korevaar, Zongyuan Ge, Feng Xia, Alireza Bab-Hadiashar, Ruwan Tennakoon

Comments Accepted to IEEE Transactions on Medical Imaging (IEEE TMI 2025)

详情
英文摘要

To be adopted in safety-critical domains like medical image analysis, AI systems must provide human-interpretable decisions. Variational Information Pursuit (V-IP) offers an interpretable-by-design framework by sequentially querying input images for human-understandable concepts, using their presence or absence to make predictions. However, existing V-IP methods overlook sample-specific uncertainty in concept predictions, which can arise from ambiguous features or model limitations, leading to suboptimal query selection and reduced robustness. In this paper, we propose an interpretable and uncertainty-aware framework for medical imaging that addresses these limitations by accounting for upstream uncertainties in concept-based, interpretable-by-design models. Specifically, we introduce two uncertainty-aware models, EUAV-IP and IUAV-IP, that integrate uncertainty estimates into the V-IP querying process to prioritize more reliable concepts per sample. EUAV-IP skips uncertain concepts via masking, while IUAV-IP incorporates uncertainty into query selection implicitly for more informed and clinically aligned decisions. Our approach allows models to make reliable decisions based on a subset of concepts tailored to each individual sample, without human intervention, while maintaining overall interpretability. We evaluate our methods on five medical imaging datasets across four modalities: dermoscopy, X-ray, ultrasound, and blood cell imaging. The proposed IUAV-IP model achieves state-of-the-art accuracy among interpretable-by-design approaches on four of the five datasets, and generates more concise explanations by selecting fewer yet more informative concepts. These advances enable more reliable and clinically meaningful outcomes, enhancing model trustworthiness and supporting safer AI deployment in healthcare.

2506.13116 2026-04-30 cs.LG cs.CL

Crime Hotspot Prediction Using Deep Graph Convolutional Networks

Tehreem Zubair, Syeda Kisaa Fatima, Noman Ahmed, Asifullah Khan

详情
英文摘要

Crime hotspot prediction is critical for ensuring urban safety and effective law enforcement, it remains challenging due to complex spatial dependencies that are inherent in criminal activities. The traditional approaches use classical algorithms such as the KDE and SVM to model data distributions and decision boundaries. The methods often fail to capture these spatial relationships, treating crime events as independent and ignoring geographical interactions. To address this, we propose a novel framework based on Graph Convolutional Networks (GCNs), which explicitly model all of spatial dependencies by representing crime data as a graph. In this graph, nodes represent discrete geographic grid cells and edges capture proximity relationships. The spatial features from Chicago Crime Dataset are used in this system, a multi-layer GCN model is trained to classify crime types and predict high-risk zones. Our approach significantly outperforms traditional approaches, achieving 78% classification accuracy. Moreover, the model generates interpretable heat maps of crime hotspots, demonstrating the usefulness of graph-based learning for predictive policing and spatial criminology.

2506.02494 2026-04-30 cs.CL cs.AI cs.CV

MINOS: A Multimodal Evaluation Model for Bidirectional Generation Between Image and Text

Junzhe Zhang, Huixuan Zhang, Xinyu Hu, Li Lin, Mingqi Gao, Shi Qiu, Xiaojun Wan

Comments Accepted to the Findings of ACL 2026

详情
英文摘要

Evaluation is important for multimodal generation tasks, while traditional multimodal evaluation metrics suffer from several limitations. With the rapid progress of MLLMs, there is growing interest in applying MLLMs to build general evaluation systems. However, existing researches often simply collect large-scale evaluation data for training, while overlooking the quality of evaluation data. What's more, current proposed evaluation models often struggle to achieve consistently strong performance across both image-to-text (I2T) and text-to-image (T2I) tasks. In this paper, through rigorous quality control strategies, we construct a comprehensive multimodal evaluation dataset, Minos-57K, with evaluation samples across 15 datasets, for developing the multimodal evaluation model Minos with SFT and preference alignment training strategies. Notably, despite using less than half the scale of the training data of prior work, our model achieves state-of-the-art evaluation performance across 16 out-of-domain datasets covering both I2T and T2I tasks among all open-source multimodal evaluation models and remain competitive with closed-source models. Extensive experiments demonstrate the importance of leveraging quality control process, jointly training on evaluation data from both I2T and T2I generation tasks and further preference alignment.

2505.24867 2026-04-30 cs.CV cs.AI

Time Blindness: Why Video-Language Models Can't See What Humans Can?

Ujjwal Upadhyay, Mukul Ranjan, Zhiqiang Shen, Mohamed Elhoseiny

Comments Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 Project page at https://timeblindness.github.io

详情
英文摘要

Recent advances in vision-language models (VLMs) have made impressive strides in understanding spatio-temporal relationships in videos. However, when spatial information is obscured, these models struggle to capture purely temporal patterns. We introduce $\textbf{SpookyBench}$, a benchmark where information is encoded solely in temporal sequences of noise-like frames, mirroring natural phenomena from biological signaling to covert communication. Interestingly, while humans can recognize shapes, text, and patterns in these sequences with over 98% accuracy, state-of-the-art VLMs achieve 0% accuracy. This performance gap highlights a critical limitation: an over-reliance on frame-level spatial features and an inability to extract meaning from temporal cues. Furthermore, when trained in data sets with low spatial signal-to-noise ratios (SNR), temporal understanding of models degrades more rapidly than human perception, especially in tasks requiring fine-grained temporal reasoning. Overcoming this limitation will require novel architectures or training paradigms that decouple spatial dependencies from temporal processing. Our systematic analysis shows that this issue persists across model scales and architectures. We release SpookyBench to catalyze research in temporal pattern recognition and bridge the gap between human and machine video understanding. Dataset and code has been made available on our project website: https://timeblindness.github.io/.

2505.22910 2026-04-30 cs.CL

Talent or Luck? Evaluating Attribution Bias in Large Language Models

Chahat Raj, Mahika Banerjee, Jinhao Pan, Aylin Caliskan, Antonios Anastasopoulos, Ziwei Zhu

Comments Accepted to ACL Findings 2026

详情
英文摘要

When a student fails an exam, do we tend to blame their effort or the test's difficulty? Attribution, defined as how reasons are assigned to event outcomes, shapes perceptions, reinforces stereotypes, and influences decisions. Attribution Theory in social psychology explains how humans assign responsibility for events using implicit cognition, attributing causes to internal (e.g., effort, ability) or external (e.g., task difficulty, luck) factors. LLMs' attribution of event outcomes based on demographics carries important fairness implications. Most works exploring social biases in LLMs focus on surface-level associations or isolated stereotypes. This work proposes a cognitively grounded bias evaluation framework to identify how models' reasoning disparities channelize biases toward demographic groups.

2505.22897 2026-04-30 cs.CL

VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models

Chahat Raj, Bowen Wei, Aylin Caliskan, Antonios Anastasopoulos, Ziwei Zhu

Comments Accepted to ACL 2026

详情
英文摘要

While bias in large language models (LLMs) is well-studied, similar concerns in vision-language models (VLMs) have received comparatively less attention. Existing VLM bias studies often focus on portrait-style images and gender-occupation associations, overlooking broader and more complex social stereotypes and their implied harm. This work introduces VIGNETTE, a large-scale VQA benchmark with 30M+ images for evaluating bias in VLMs through a question-answering framework spanning four directions: factuality, perception, stereotyping, and decision making. Beyond narrowly-centered studies, we assess how VLMs interpret identities in contextualized settings, revealing how models make trait and capability assumptions and exhibit patterns of discrimination. Drawing from social psychology, we examine how VLMs connect visual identity cues to trait and role-based inferences, encoding social hierarchies, through biased selections. Our findings uncover subtle, multifaceted, and surprising stereotypical patterns, offering insights into how VLMs construct social meaning from inputs.

2505.21190 2026-04-30 cs.CL cs.AI

Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation

Jong Hak Moon, Geon Choi, Paloma Rabaey, Min Gwan Kim, Jung-Oh Lee, Hyuk Gi Hong, Eun Woo Doe, Hangyul Yoon, Jiyoun Kim, Harshita Sharma, Daniel C. Castro, Javier Alvarez-Valle, Edward Choi

Comments CHIL (Conference on Health, Inference, and Learning) 2026

详情
英文摘要

Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-grained clinical semantics and temporal dependencies. We introduce LUNGUAGE, a benchmark dataset for structured radiology report generation that supports both single-report evaluation and longitudinal patient-level assessment across multiple studies. It contains 1,473 annotated chest X-ray reports, each reviewed by experts, and 186 of them contain longitudinal annotations to capture disease progression and inter-study intervals, also reviewed by experts. Using this benchmark, we develop a two-stage structuring framework that transforms generated reports into fine-grained, schema-aligned structured reports, enabling longitudinal interpretation. We also propose LUNGUAGESCORE, an interpretable metric that compares structured outputs at the entity, relation, and attribute level while modeling temporal consistency across patient timelines. These contributions establish the first benchmark dataset, structuring framework, and evaluation metric for sequential radiology reporting, with empirical results demonstrating that LUNGUAGESCORE effectively supports structured report evaluation. The code is available at: https://github.com/SuperSupermoon/Lunguage

2505.20414 2026-04-30 cs.CV cs.AI cs.RO

RetroMotion: Retrocausal Motion Forecasting Models are Instructable

Royden Wagner, Omer Sahin Tas, Felix Hauser, Marlon Steiner, Dominik Strutz, Abhishek Vivekanandan, Jaime Villa, Yinzhe Shen, Carlos Fernandez, Christoph Stiller

Comments CVPRW26

详情
英文摘要

Motion forecasts of road users (i.e., agents) vary in complexity depending on the number of agents, scene constraints, and interactions. In particular, the output space of joint trajectory distributions grows exponentially with the number of agents. Therefore, we decompose multi-agent motion forecasts into (1) marginal distributions for all modeled agents and (2) joint distributions for interacting agents. Using a transformer model, we generate joint distributions by re-encoding marginal distributions followed by pairwise modeling. This incorporates a retrocausal flow of information from later points in marginal trajectories to earlier points in joint trajectories. For each time step, we model the positional uncertainty using compressed exponential power distributions. Notably, our method achieves strong results in the Waymo Interaction Prediction Challenge and generalizes well to the Argoverse 2 and V2X-Seq datasets. Additionally, our method provides an interface for issuing instructions. We show that standard motion forecasting training implicitly enables the model to follow instructions and adapt them to the scene context. GitHub repository: https://github.com/kit-mrt/future-motion