arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1864
2602.18887 2026-04-01 cs.CV

SafeDrive: Fine-Grained Safety Reasoning for End-to-End Driving in a Sparse World

Jungho Kim, Jiyong Oh, Seunghoon Yu, Hongjae Shin, Donghyuk Kwak, Jun Won Choi

Comments Accepted to CVPR 2026, 19 pages, 9 figures

详情
英文摘要

The end-to-end (E2E) paradigm, which maps sensor inputs directly to driving decisions, has recently attracted significant attention due to its unified modeling capability and scalability. However, ensuring safety in this unified framework remains one of the most critical challenges. In this work, we propose SafeDrive, an E2E planning framework designed to perform explicit and interpretable safety reasoning through a trajectory-conditioned Sparse World Model. SafeDrive comprises two complementary networks: the Sparse World Network (SWNet) and the Fine-grained Reasoning Network (FRNet). SWNet constructs trajectory-conditioned sparse worlds that simulate the future behaviors of critical dynamic agents and road entities, providing interaction-centric representations for downstream reasoning. FRNet then evaluates agent-specific collision risks and temporal adherence to drivable regions, enabling precise identification of safety-critical events across future timesteps. SafeDrive achieves state-of-the-art performance on both open-loop and closed-loop benchmarks. On NAVSIM, it records a PDMS of 91.6 and an EPDMS of 87.5, with only 61 collisions out of 12,146 scenarios (0.5%). On Bench2Drive, SafeDrive attains a 66.8% driving score.

2602.17525 2026-04-01 cs.LG math.ST stat.ML stat.TH

Variational inference via radial transport

Luca Ghafourpour, Sinho Chewi, Alessio Figalli, Aram-Alexandre Pooladian

详情
英文摘要

In variational inference (VI), the practitioner approximates a high-dimensional distribution $π$ with a simple surrogate one, often a (product) Gaussian distribution. However, in many cases of practical interest, Gaussian distributions might not capture the correct radial profile of $π$, resulting in poor coverage. In this work, we approach the VI problem from the perspective of optimizing over these radial profiles. Our algorithm radVI is a cheap, effective add-on to many existing VI schemes, such as Gaussian (mean-field) VI and Laplace approximation. We provide theoretical convergence guarantees for our algorithm, owing to recent developments in optimization over the Wasserstein space--the space of probability distributions endowed with the Wasserstein distance--and new regularity properties of radial transport maps in the style of Caffarelli (2000).

2602.15772 2026-04-01 cs.CV cs.AI

Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models

Sen Ye, Mengde Xu, Shuyang Gu, Di He, Liwei Wang, Han Hu

Comments Accepted to ICLR2026

详情
英文摘要

Current research in multimodal models faces a key challenge where enhancing generative capabilities often comes at the expense of understanding, and vice versa. We analyzed this trade-off and identify the primary cause might be the potential conflict between generation and understanding, which creates a competitive dynamic within the model. To address this, we propose the Reason-Reflect-Refine (R3) framework. This innovative algorithm re-frames the single-step generation task into a multi-step process of "generate-understand-regenerate". By explicitly leveraging the model's understanding capability during generation, we successfully mitigate the optimization dilemma, achieved stronger generation results and improved understanding ability which are related to the generation process. This offers valuable insights for designing next-generation unified multimodal models. Code is available at https://github.com/sen-ye/R3.

2602.06241 2026-04-01 cs.LG cs.CE

A Fast and Generalizable Fourier Neural Operator-Based Surrogate for Melt-Pool Prediction in Laser Processing

Alix Benoit, Toni Ivas, Mateusz Papierz, Asel Sagingalieva, Alexey Melnikov, Elia Iseli

Comments 29 pages, 12 figures, 6 tables

详情
英文摘要

High-fidelity simulations of laser welding capture complex thermo-fluid phenomena, including phase change, free-surface deformation, and keyhole dynamics, however their computational cost limits large-scale process exploration and real-time use. In this work we present the Laser Processing Fourier Neural Operator (LP-FNO), a Fourier Neural Operator (FNO) based surrogate model that learns the parametric solution operator of various laser processes from multiphysics simulations generated with FLOW-3D WELD (registered trademark). Through a novel approach of reformulating the transient problem in the moving laser frame and applying temporal averaging, the system results in a quasi-steady state setting suitable for operator learning, even in the keyhole welding regime. The proposed LP-FNO maps process parameters to three-dimensional temperature fields and melt-pool boundaries across a broad process window spanning conduction and keyhole regimes using the non-dimensional normalized enthalpy formulation. The model achieves temperature prediction errors on the order of 1% and intersection-over-union scores for melt-pool segmentation over 0.9. We demonstrate that a LP-FNO model trained on coarse-resolution data can be evaluated on finer grids, yielding accurate super-resolved predictions in mesh-converged conduction regimes, whereas discrepancies in keyhole regimes reflect unresolved dynamics in the coarse-mesh training data. These results indicate that the LP-FNO provides an efficient surrogate modeling framework for laser welding, enabling prediction of full three-dimensional fields and phase interfaces over wide parameter ranges in just tens of milliseconds, up to a hundred thousand times faster than traditional Finite Volume multi-physics software.

2602.03584 2026-04-01 cs.CL cs.AI cs.LG

$V_0$: A Generalist Value Model for Any Policy at State Zero

Yi-Kai Zhang, Zhiyuan Yao, Hongyan Hao, Yueqing Sun, Qi Gu, Hui Su, Xunliang Cai, De-Chuan Zhan, Han-Jia Ye

详情
英文摘要

Policy gradient methods rely on a baseline to measure the relative advantage of an action, ensuring the model reinforces behaviors that outperform its current average capability. In the training of Large Language Models (LLMs) using Actor-Critic methods (e.g., PPO), this baseline is typically estimated by a Value Model (Critic) often as large as the policy model itself. However, as the policy continuously evolves, the value model requires expensive, synchronous incremental training to accurately track the shifting capabilities of the policy. To avoid this overhead, Group Relative Policy Optimization (GRPO) eliminates the coupled value model by using the average reward of a group of rollouts as the baseline; yet, this approach necessitates extensive sampling to maintain estimation stability. In this paper, we propose $V_0$, a Generalist Value Model capable of estimating the expected performance of any model on unseen prompts without requiring parameter updates. We reframe value estimation by treating the policy's dynamic capability as an explicit context input; specifically, we leverage a history of instruction-performance pairs to dynamically profile the model, departing from the traditional paradigm that relies on parameter fitting to perceive capability shifts. Focusing on value estimation at State Zero (i.e., the initial prompt, hence $V_0$), our model serves as a critical resource scheduler. During GRPO training, $V_0$ predicts success rates prior to rollout, allowing for efficient sampling budget allocation; during deployment, it functions as a router, dispatching instructions to the most cost-effective and suitable model. Empirical results demonstrate that $V_0$ significantly outperforms heuristic budget allocation and achieves a Pareto-optimal trade-off between performance and cost in LLM routing tasks.

2602.01639 2026-04-01 cs.CV

ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval

Tianyu Yang, Chenwei He, Xiangzhao Hao, Tianyue Wang, Jiarui Guo, Haiyun Guo, Leigang Qu, Jinqiao Wang, Tat-Seng Chua

Comments Accepted to CVPR 2026

详情
英文摘要

Composed Image Retrieval (CIR) aims to retrieve target images based on a hybrid query comprising a reference image and a modification text. Early dual-tower Vision-Language Models (VLMs) struggle with cross-modality compositional reasoning required for this task. While adapting generative Multimodal Large Language Models (MLLMs) for retrieval offers a promising direction, we identify that this strategy overlooks a fundamental issue: compressing a generative MLLM into a single-embedding discriminative retriever triggers a paradigm conflict, which leads to Capability Degradation - the deterioration of native fine-grained reasoning after retrieval adaptation. To address this challenge, we propose ReCALL, a model-agnostic framework that follows a diagnose-generate-refine pipeline: First, we diagnose cognitive blind spots of the retriever via self-guided informative instance mining. Next, we generate corrective instructions and triplets by prompting the foundation MLLM and conduct quality control with VQA-based consistency filtering. Finally, we refine the retriever through continual training on these triplets with a grouped contrastive scheme, thereby internalizing fine-grained visual-semantic distinctions and realigning the discriminative embedding space of retriever with intrinsic compositional reasoning within the MLLM. Extensive experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance. Code is available at https://github.com/RemRico/Recall.

2601.21747 2026-04-01 cs.LG cs.AI

Temporal Sepsis Modeling: a Relational and Explainable-by-Design Framework

Vincent Lemaire, Nédra Meloulli, Pierre Jaquet

详情
英文摘要

Sepsis remains one of the most complex and heterogeneous syndromes in intensive care, characterized by diverse physiological trajectories and variable responses to treatment. While deep learning models perform well in the early prediction of sepsis, they often lack interpretability and ignore latent patient sub-phenotypes. In this work, we propose a machine learning framework by opening up a new avenue for addressing this issue: a relational approach. Temporal data from electronic medical records (EMRs) are viewed as multivariate patient logs and represented in a relational data schema. Then, a propositionalisation technique (based on classic aggregation/selection functions from the field of relational data) is applied to construct interpretable features to "flatten" the data. Finally, the flattened data is classified using a selective naive Bayesian classifier. Experimental validation demonstrates the relevance of the suggested approach as well as its extreme interpretability. The interpretation is fourfold: univariate, global, local, and counterfactual.

2601.13802 2026-04-01 cs.CL cs.SD eess.AS

Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis

Yushen Chen, Junzhe Liu, Yujie Tu, Zhikang Niu, Yuzhe Liang, Chunyu Qiang, Chen Zhang, Kai Yu, Xie Chen

详情
英文摘要

Arabic spans over 30 spoken varieties, yet no open-source text-to-speech system unifies them. Key barriers include substantial cross-dialect lexical and phonological divergence, scarce synthesis-grade data, and the absence of a standardized multi-dialect evaluation benchmark. We present Habibi, a unified-dialectal Arabic TTS framework that addresses all three. Through a multi-step curation pipeline, we repurpose open-source ASR corpora into TTS training data covering 12+ regional dialects. A linguistically-informed curriculum learning strategy - progressing from Modern Standard Arabic to dialectal data - enables robust zero-shot synthesis without text diacritization. We further release the first standardized multi-dialect Arabic TTS benchmark, comprising over 11,000 utterances across 7 dialect subsets with manually verified transcripts. On this benchmark, our unified model matches or surpasses per-dialect specialized models. Both automatic metrics and human evaluations confirm that Habibi is highly competitive with ElevenLabs' Eleven v3 (alpha) in intelligibility, speaker similarity, and naturalness. Extensive ablations (~8,000 H100 GPU hours, 30+ configurations) validate each design choice. We open-source all checkpoints, training and inference code, and benchmark data - the first such release for multi-dialect Arabic TTS - at https://SWivid.github.io/Habibi/ .

2601.04065 2026-04-01 cs.CV cs.LG

Unsupervised Modular Adaptive Region Growing and RegionMix Classification for Wind Turbine Segmentation

Raül Pérez-Gonzalo, Riccardo Magro, Andreas Espersen, Antonio Agudo

Comments Accepted to WACV 2026

详情
英文摘要

Reliable operation of wind turbines requires frequent inspections, as even minor surface damages can degrade aerodynamic performance, reduce energy output, and accelerate blade wear. Central to automating these inspections is the accurate segmentation of turbine blades from visual data. This task is traditionally addressed through dense, pixel-wise deep learning models. However, such methods demand extensive annotated datasets, posing scalability challenges. In this work, we introduce an annotation-efficient segmentation approach that reframes the pixel-level task into a binary region classification problem. Image regions are generated using a fully unsupervised, interpretable Modular Adaptive Region Growing technique, guided by image-specific Adaptive Thresholding and enhanced by a Region Merging process that consolidates fragmented areas into coherent segments. To improve generalization and classification robustness, we introduce RegionMix, an augmentation strategy that synthesizes new training samples by combining distinct regions. Our framework demonstrates state-of-the-art segmentation accuracy and strong cross-site generalization by consistently segmenting turbine blades across distinct windfarms.

2601.00728 2026-04-01 cs.LG cs.NA math.NA

Precision autotuning for linear solvers via contextual bandit-based RL

Erin Carson, Xinye Chen

详情
英文摘要

We propose a reinforcement learning (RL) framework for adaptive precision tuning for linear solvers, which can be extended to general algorithms. The framework is formulated as a contextual bandit problem and solved using incremental action-value estimation with a discretized state space to select optimal precision configurations for computational steps, balancing precision and computational efficiency. To verify its effectiveness, we apply the framework to iterative refinement for solving linear systems $Ax = b$. In this application, our approach dynamically chooses precisions based on calculated features from the system while maintaining acceptable accuracy and convergence. In detail, an action-value estimator takes discretized features (e.g., approximate condition number and matrix norm) as input and outputs estimated action values, from which a policy selects the actions (chosen precision configurations for specific steps), optimized via an $ε$-greedy strategy to maximize a multi-objective reward to balance accuracy and computational cost. Empirical results demonstrate effective precision selection, reducing computational cost while maintaining accuracy comparable to double-precision baselines. The framework generalizes to diverse out-of-sample data and provides insights into applying RL precision selection to other numerical algorithms, advancing mixed-precision numerical methods in scientific computing. To the best of our knowledge, this is the first work on precision autotuning with RL with verification on unseen datasets.

2512.21507 2026-04-01 cs.CV

SVBench: Evaluation of Video Generation Models on Social Reasoning

Wenshuo Peng, Gongxuan Wang, Tianmeng Yang, Chuanhao Li, Xiaojie Xu, Hui He, Kaipeng Zhang

Comments 10pages

详情
英文摘要

Recent text-to-video generation models have made remarkable progress in visual realism, motion fidelity, and text-video alignment, yet they still struggle to produce socially coherent behavior. Unlike humans, who readily infer intentions, beliefs, emotions, and social norms from brief visual cues, current models often generate literal scenes without capturing the underlying causal and psychological dynamics. To systematically assess this limitation, we introduce the first benchmark for social reasoning in video generation. Grounded in developmental and social psychology, the benchmark covers thirty classic social cognition paradigms spanning seven core dimensions: mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we build a fully training-free agent-based pipeline that distills the reasoning structure of each paradigm, synthesizes diverse video-ready scenarios, enforces conceptual neutrality and difficulty control through cue-based critique, and evaluates generated videos with a high-capacity VLM judge along five interpretable dimensions of social reasoning. Using this framework, we conduct the first large-scale evaluation of seven state-of-the-art video generation systems. Results show a clear gap between surface-level plausibility and deeper social reasoning, suggesting that current models remain limited in their ability to generate socially grounded behavior. https://github.com/Gloria2tt/SVBench-Evaluation

2512.21226 2026-04-01 cs.RO cs.SY eess.SY

Relative Localization System Design for SnailBot: A Modular Self-reconfigurable Robot

Shuhan Zhang

Comments The paper contains factual error and logic flaws, which needs to be repaired before submitting

详情
英文摘要

This paper presents the design and implementation of a relative localization system for SnailBot, a modular self reconfigurable robot. The system integrates ArUco marker recognition, optical flow analysis, and IMU data processing into a unified fusion framework, enabling robust and accurate relative positioning for collaborative robotic tasks. Experimental validation demonstrates the effectiveness of the system in realtime operation, with a rule based fusion strategy ensuring reliability across dynamic scenarios. The results highlight the potential for scalable deployment in modular robotic systems.

2512.16908 2026-04-01 cs.CV

SceneDiff: A Benchmark and Method for Multiview Object Change Detection

Yuqun Wu, Chih-hao Lin, Henry Che, Aditi Tiwari, Chuhang Zou, Shenlong Wang, Derek Hoiem

详情
英文摘要

We investigate the problem of identifying objects that have been added, removed, or moved between a pair of captures (images or videos) of the same scene at different times. Accurately identifying verifiable changes is extremely challenging -- some objects may appear to be missing because they are occluded or out of frame, while others may appear different due to large viewpoint changes. To study this problem, we introduce the SceneDiff Benchmark, the first multiview change detection dataset for scenes captured along different camera trajectories, comprising 350 diverse video pairs with dense object instance-level annotations. We also introduce the SceneDiff algorithm, a training-free approach that solves for image poses, segments images into objects, and compares them using semantic and geometric features. By building on pretrained models, SceneDiff generalizes across domains without retraining and naturally improves as the underlying models advance. Experiments on multiview and two-view benchmarks demonstrate that our method outperforms existing approaches by large margins (53.0\% and 30.6\% relative AP improvements). Project page: https://yuqunw.github.io/SceneDiff

2512.15713 2026-04-01 cs.CV

DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Lunbin Zeng, Jingfeng Yao, Bencheng Liao, Hongyuan Tao, Wenyu Liu, Xinggang Wang

Comments 12 pages, 4 figures, conference or other essential info

详情
英文摘要

Diffusion-based decoding has recently emerged as an appealing alternative to autoregressive (AR) generation, offering the potential to update multiple tokens in parallel and reduce latency. However, diffusion vision language models (dVLMs) still lag significantly behind mainstream autoregressive vision language models. This is due to the scarcity and weaker performance of base diffusion language models (dLLMs) compared with their autoregressive counterparts. This raises a natural question: Can we build high-performing dVLMs directly from existing powerful AR models, without relying on dLLMs? We propose DiffusionVL, a family of dVLMs obtained by translating pretrained AR models into the diffusion paradigm via an efficient diffusion finetuning procedure that changes the training objective and decoding process while keeping the backbone architecture intact. Through an efficient diffusion finetuning strategy, we successfully adapt AR pretrained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance comparable to that of the same AR model finetuned with standard autoregressive visual instruction tuning. To enable practical open-ended generation, we further integrate block decoding, which supports arbitrary-length outputs and KV-cache reuse for faster inference. Our experiments demonstrate that despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement, with a 34.4% gain on the MMMU-Pro (vision) benchmark and 37.5% gain on the MME (Cog.) benchmark, alongside a 2x inference speedup. The model and code are released at https://github.com/hustvl/DiffusionVL.

2512.09833 2026-04-01 cs.RO

Bridging the Basilisk Astrodynamics Framework with ROS 2 for Modular Spacecraft Simulation and Hardware Integration

Elias Krantz, Ngai Nam Chan, Gunnar Tibert, Huina Mao, Christer Fuglesang

Comments Presented at the International Conference on Space Robotics (iSpaRo) 2025

Journal ref 2025 International Conference on Space Robotics (iSpaRo), Sendai, Japan, 2025

详情
英文摘要

Integrating high-fidelity spacecraft simulators with modular robotics frameworks remains a challenge for autonomy development. This paper presents a lightweight, open-source communication bridge between the Basilisk astrodynamics simulator and the Robot Operating System 2 (ROS 2), enabling real-time, bidirectional data exchange for spacecraft control. The bridge requires no changes to Basilisk's core and integrates seamlessly with ROS 2 nodes. We demonstrate its use in a leader-follower formation flying scenario using nonlinear model predictive control, deployed identically in both simulation and on the ATMOS planar microgravity testbed. This setup supports rapid development, hardware-in-the-loop testing, and seamless transition from simulation to hardware. The bridge offers a flexible and scalable platform for modular spacecraft autonomy and reproducible research workflows.

2512.08829 2026-04-01 cs.CV cs.AI

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, Xinggang Wang

Comments 20 pages, 8 figures, conference or other essential info

详情
英文摘要

Vision-Language Models (VLMs) are increasingly tasked with ultra-long multimodal understanding. While linear architectures offer constant computation and memory footprints, they often struggle with high-frequency visual perception compared to standard Transformers. To bridge this gap, we introduce \textbf{InfiniteVL}. We first develop a hybrid base model called \textbf{InfiniteVL-Base} that interleaves a small fraction of Full Attention layers with Gated DeltaNet. Empowered by a tailored distillation and fine-tuning strategy, InfiniteVL-Base matches the fundamental multimodal performance of equivalent Transformers while achieving a \textbf{1.7$\times$} decoding speedup. However, the quadratic complexity of the retained Full Attention inevitably becomes an efficiency bottleneck when scaling to ultra long context. To break this barrier, we propose a novel Long-Sequence Architectural Fine-Tuning strategy that seamlessly transforms the dense attention into vision-specific sparse mechanisms. This yields two specialized variants: \textbf{InfiniteVL-Offline} for offline retrieval and \textbf{InfiniteVL-Online} for online streaming. By eliminating the computation explosion of global attention without sacrificing high-frequency visual recall, InfiniteVL-Offline achieves Transformer-level length generalization with a \textbf{5x} prefill acceleration at 256K context. Concurrently, InfiniteVL-Online delivers robust streaming perception with a constant memory footprint and a real-time throughput of \textbf{25} FPS. Code and models are available at https://github.com/hustvl/InfiniteVL.

2512.05955 2026-04-01 cs.RO cs.CV

SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models

Haowen Liu, Shaoxiong Yao, Haonan Chen, Jiawei Gao, Jiayuan Mao, Jia-Bin Huang, Yilun Du

Comments Accepted to CVPR 2026; camera-ready version

详情
英文摘要

Vision-Language Models (VLMs) exhibit remarkable common-sense and semantic reasoning capabilities. However, they lack a grounded understanding of physical dynamics. This limitation arises from training VLMs on static internet-scale visual-language data that contain no causal interactions or action-conditioned changes. Consequently, it remains challenging to leverage VLMs for fine-grained robotic manipulation tasks that require physical understanding, reasoning, and corresponding action planning. To overcome this, we present SIMPACT, a test-time, SIMulation-enabled ACTion Planning framework that equips VLMs with physical reasoning through simulation-in-the-loop world modeling, without requiring any additional training. From a single RGB-D observation, SIMPACT efficiently constructs physics simulations, enabling the VLM to propose informed actions, observe simulated rollouts, and iteratively refine its reasoning. By integrating language reasoning with physics prediction, our simulation-enabled VLM can understand contact dynamics and action outcomes in a physically grounded way. Our method demonstrates state-of-the-art performance on five challenging, real-world rigid-body and deformable manipulation tasks that require fine-grained physical reasoning, outperforming existing general-purpose robotic manipulation models. Our results demonstrate that embedding physics understanding via efficient simulation into VLM reasoning at test time offers a promising path towards generalizable embodied intelligence. Project webpage can be found at https://simpact-bot.github.io

2512.04821 2026-04-01 cs.CV

LatentFM: A Latent Flow Matching Approach for Generative Medical Image Segmentation

Huynh Trinh Ngoc, Hoang Anh Nguyen Kim, Toan Nguyen Hai, Long Tran Quoc

详情
英文摘要

Generative models have achieved remarkable progress with the emergence of flow matching (FM). It has demonstrated strong generative capabilities and attracted significant attention as a simulation-free flow-based framework capable of learning exact data densities. Motivated by these advances, we propose LatentFM, a flow-based model operating in the latent space for medical image segmentation. To model the data distribution, we first design two variational autoencoders (VAEs) to encode both medical images and their corresponding masks into a lower-dimensional latent space. We then estimate a conditional velocity field that guides the flow based on the input image. By sampling multiple latent representations, our method synthesizes diverse segmentation outputs whose pixel-wise variance reliably captures the underlying data distribution, enabling both highly accurate and uncertainty-aware predictions. Furthermore, we generate confidence maps that quantify the model certainty, providing clinicians with richer information for deeper analysis. We conduct experiments on two datasets, ISIC-2018 and CVC-Clinic, and compare our method with several prior baselines, including both deterministic and generative approach models. Through comprehensive evaluations, both qualitative and quantitative results show that our approach achieves superior segmentation accuracy while remaining highly efficient in the latent space.

2512.03768 2026-04-01 cs.LG eess.SP

Deep Unfolding: Recent Developments, Theory, and Design Guidelines

Nir Shlezinger, Santiago Segarra, Yi Zhang, Dvir Avrahami, Zohar Davidov, Tirza Routtenberg, Yonina C. Eldar

Comments under review for publication in the IEEE

详情
英文摘要

Optimization methods play a central role in signal processing, serving as the mathematical foundation for inference, estimation, and control. While classical iterative optimization algorithms provide interpretability and theoretical guarantees, they often rely on surrogate objectives, require careful hyperparameter tuning, and exhibit substantial computational latency. Conversely, machine learning (ML ) offers powerful data-driven modeling capabilities but lacks the structure, transparency, and efficiency needed for optimization-driven inference. Deep unfolding has recently emerged as a compelling framework that bridges these two paradigms by systematically transforming iterative optimization algorithms into structured, trainable ML architectures. This article provides a tutorial-style overview of deep unfolding, presenting a unified perspective of methodologies for converting optimization solvers into ML models and highlighting their conceptual, theoretical, and practical implications. We review the foundations of optimization for inference and for learning, introduce four representative design paradigms for deep unfolding, and discuss the distinctive training schemes that arise from their iterative nature. Furthermore, we survey recent theoretical advances that establish convergence and generalization guarantees for unfolded optimizers, and provide comparative qualitative and empirical studies illustrating their relative trade-offs in complexity, interpretability, and robustness.

2512.03639 2026-04-01 cs.RO

Context-Triggered Contingency Games for Strategic Multi-Agent Interaction

Kilian Schweppe, Anne-Kathrin Schmuck

详情
英文摘要

We address the challenge of reliable and efficient interaction in autonomous multi-agent systems, where agents must balance long-term strategic objectives with short-term dynamic adaptation. We propose context-triggered contingency games, a novel integration of strategic games derived from temporal logic specifications with dynamic contingency games solved in real time. Our two-layered architecture leverages strategy templates to guarantee satisfaction of high-level objectives, while a new factor-graph-based solver enables scalable, real-time model predictive control of dynamic interactions. The resulting framework ensures both safety and progress in uncertain, interactive environments. We validate our approach through simulations and hardware experiments in autonomous driving and robotic navigation, demonstrating efficient, reliable, and adaptive multi-agent interaction.

2512.02172 2026-04-01 cs.CV cs.GR cs.LG

SplatSuRe: Selective Super-Resolution for Multi-view Consistent 3D Gaussian Splatting

Pranav Asthana, Alex Hanson, Allen Tu, Tom Goldstein, Matthias Zwicker, Amitabh Varshney

Comments Project Page: https://splatsure.github.io/

Journal ref Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

详情
英文摘要

3D Gaussian Splatting (3DGS) enables high-quality novel view synthesis, motivating interest in generating higher-resolution renders than those available during training. A natural strategy is to apply super-resolution (SR) to low-resolution (LR) input views, but independently enhancing each image introduces multi-view inconsistencies, leading to blurry renders. Prior methods attempt to mitigate these inconsistencies through learned neural components, temporally consistent video priors, or joint optimization on LR and SR views, but all uniformly apply SR across every image. In contrast, our key insight is that close-up LR views may contain high-frequency information for regions also captured in more distant views and that we can use the camera pose relative to scene geometry to inform where to add SR content. Building on this insight, we propose SplatSuRe, a method that selectively applies SR content only in undersampled regions lacking high-frequency supervision, yielding sharper and more consistent results. Across Tanks & Temples, Deep Blending, and Mip-NeRF 360, our approach surpasses baselines in both fidelity and perceptual quality. Notably, our gains are most significant in localized foreground regions where higher detail is desired.

2511.17693 2026-04-01 cs.LG cs.CL cs.CV

DeepCoT: Deep Continual Transformers for Real-Time Inference on Data Streams

Ginés Carreto Picón, Peng Yuan Zhou, Qi Zhang, Alexandros Iosifidis

Comments 15 pages, 5 figures

详情
英文摘要

Transformer-based models have dramatically increased their size and parameter count to tackle increasingly complex tasks. At the same time, there is a growing demand for high performance, low-latency inference on devices with limited resources. In particular, stream data inference is typically performed over a sliding temporal window, leading to highly redundant computations. While the recent Continual Transformers started addressing this issue, they can be effectively used only in shallow models, which limits their scope and generalization power. In this paper, we propose the Deep Continual Transformer (DeepCoT), a redundancy-free encoder attention mechanism that can be applied over existing deep encoder architectures with minimal changes. In our experiments over audio, video, and text streams, we show that DeepCoTs retain comparative performance to their non-continual baselines while offering a linear computational cost for all Transformer layers, which reduces up to two orders of magnitude in the running time compared to previous efficient models.

2510.17341 2026-04-01 cs.RO

Interactive Force-Impedance Control

Fan Shao, Satoshi Endo, Sandra Hirche, Fanny Ficuciello

详情
英文摘要

Human collaboration with robots requires flexible role adaptation, enabling the robot to switch between an active leader and a passive follower. Effective role switching depends on accurately estimating human intentions, which is typically achieved through external force analysis, nominal robot dynamics, or data-driven approaches. However, these methods are primarily effective in contact-sparse environments. When robots under hybrid or unified force-impedance control physically interact with active humans or non-passive environments, the robotic system may lose passivity and thus compromise safety. To address this challenge, this paper proposes a unified Interactive Force-Impedance Control (IFIC) framework that adapts to interaction power flow, ensuring safe and effortless interaction in contact-rich environments. The proposed control architecture is formulated within a port-Hamiltonian framework, incorporating both interaction and task control ports, thereby guaranteeing autonomous system passivity. Experiments in both rigid and soft contact scenarios demonstrate that IFIC ensures stable collaboration under active human interaction, reduces contact impact forces and interaction force oscillations.

2510.14538 2026-04-01 cs.AI cs.LG

Symbol Grounding in Neuro-Symbolic AI: A Gentle Introduction to Reasoning Shortcuts

Emanuele Marconato, Samuele Bortolotti, Emile van Krieken, Paolo Morettin, Elena Umili, Antonio Vergari, Efthymia Tsamoura, Andrea Passerini, Stefano Teso

详情
英文摘要

Neuro-symbolic (NeSy) AI aims to develop deep neural networks whose predictions comply with prior knowledge encoding, e.g. safety or structural constraints. As such, it represents one of the most promising avenues for reliable and trustworthy AI. The core idea behind NeSy AI is to combine neural and symbolic steps: neural networks are typically responsible for mapping low-level inputs into high-level symbolic concepts, while symbolic reasoning infers predictions compatible with the extracted concepts and the prior knowledge. Despite their promise, it was recently shown that - whenever the concepts are not supervised directly - NeSy models can be affected by Reasoning Shortcuts (RSs). That is, they can achieve high label accuracy by grounding the concepts incorrectly. RSs can compromise the interpretability of the model's explanations, performance in out-of-distribution scenarios, and therefore reliability. At the same time, RSs are difficult to detect and prevent unless concept supervision is available, which is typically not the case. However, the literature on RSs is scattered, making it difficult for researchers and practitioners to understand and tackle this challenging problem. This overview addresses this issue by providing a gentle introduction to RSs, discussing their causes and consequences in intuitive terms. It also reviews and elucidates existing theoretical characterizations of this phenomenon. Finally, it details methods for dealing with RSs, including mitigation and awareness strategies, and maps their benefits and limitations. By reformulating advanced material in a digestible form, this overview aims to provide a unifying perspective on RSs to lower the bar to entry for tackling them. Ultimately, we hope this overview contributes to the development of reliable NeSy and trustworthy AI models.

2510.07178 2026-04-01 cs.CL

Biasless Language Models Learn Unnaturally: How LLMs Fail to Distinguish the Possible from the Impossible

Imry Ziv, Nur Lan, Emmanuel Chemla

Comments 15 pages, 4 figures

详情
英文摘要

Are large language models (LLMs) sensitive to the distinction between humanly possible and impossible languages? This question was recently used in a broader debate on whether LLMs and humans share the same innate learning biases. Previous work has answered it in the positive by comparing LLM learning curves on existing language datasets and on "impossible" datasets derived from them via various perturbation functions. Using the same methodology, we examine this claim on a wider set of languages and impossible perturbations. We find that in most cases, GPT-2 learns each language and its impossible counterpart equally easily, in contrast to previous findings. We also apply a more lenient condition by testing whether GPT-2 provides any kind of separation between the whole sets of natural vs. impossible languages, based on cross-linguistic variance in metrics derived from the learning curves. Taken together, these perspectives show that GPT-2 provides no systematic separation between the possible and the impossible.

2510.06353 2026-04-01 cs.CV cs.AI cs.LG

TransFIRA: Transfer Learning for Face Image Recognizability Assessment

Allen Tu, Kartik Narayan, Joshua Gleason, Jennifer Xu, Matthew Meyn, Tom Goldstein, Vishal M. Patel

Comments Project Page: https://transfira.github.io/

Journal ref Proceedings of the IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2026

详情
英文摘要

Face recognition in unconstrained environments such as surveillance, video, and web imagery must contend with extreme variation in pose, blur, illumination, and occlusion, where conventional visual quality metrics fail to predict whether inputs are truly recognizable to the deployed encoder. Existing FIQA methods typically rely on visual heuristics, curated annotations, or computationally intensive generative pipelines, leaving their predictions detached from the encoder's decision geometry. We introduce TransFIRA (Transfer Learning for Face Image Recognizability Assessment), a lightweight and annotation-free framework that grounds recognizability directly in embedding space. TransFIRA delivers three advances: (i) a definition of recognizability via class-center similarity (CCS) and class-center angular separation (CCAS), yielding the first natural, decision-boundary-aligned criterion for filtering and weighting; (ii) a recognizability-informed aggregation strategy that achieves state-of-the-art verification accuracy on BRIAR and IJB-C while nearly doubling correlation with true recognizability, all without external labels, heuristics, or backbone-specific training; and (iii) new extensions beyond faces, including encoder-grounded explainability that reveals how degradations and subject-specific factors affect recognizability, and the first method for body recognizability assessment. Experiments confirm state-of-the-art results on faces, strong performance on body recognition, and robustness under cross-dataset shifts and out-of-distribution evaluation. Together, these contributions establish TransFIRA as a unified, geometry-driven framework for recognizability assessment that is encoder-specific, accurate, interpretable, and extensible across modalities, significantly advancing FIQA in accuracy, explainability, and scope.

2510.00027 2026-04-01 cs.LG cs.AI q-bio.BM q-bio.QM

Learning Inter-Atomic Potentials without Explicit Equivariance

Ahmed A. Elhag, Arun Raja, Alex Morehead, Samuel M. Blau, Hongtao Zhao, Christian Tyrchan, Eva Nittinger, Garrett M. Morris, Michael M. Bronstein

Comments 22 pages, 7 tables, 11 figures. Under review. Changes from v2 to v3: Added results for new experiments, training models for 80 epochs on OMol25

详情
英文摘要

Accurate and scalable machine-learned inter-atomic potentials (MLIPs) are essential for molecular simulations ranging from drug discovery to new material design. Current state-of-the-art models enforce roto-translational symmetries through equivariant neural network architectures, a hard-wired inductive bias that can often lead to reduced flexibility, computational efficiency, and scalability. In this work, we introduce TransIP: Transformer-based Inter-Atomic Potentials, a novel training paradigm for interatomic potentials achieving symmetry compliance without explicit architectural constraints. Our approach guides a generic non-equivariant Transformer-based model to learn SO(3)-equivariance by optimizing its representations in the embedding space. Trained on the recent Open Molecules (OMol25) collection, a large and diverse molecular dataset built specifically for MLIPs and covering different types of molecules (including small organics, biomolecular fragments, and electrolyte-like species), TransIP attains comparable performance in machine-learning force fields versus state-of-the-art equivariant baselines. Further, compared to a data augmentation baseline, TransIP achieves 40% to 60% improvement in performance across varying OMol25 dataset sizes. More broadly, our work shows that learned equivariance can be a powerful and efficient alternative to equivariant or augmentation-based MLIP models. Our code is available at: https://github.com/Ahmed-A-A-Elhag/TransIP.

2509.24956 2026-04-01 cs.RO cs.AI cs.LG

MSG: Multi-Stream Generative Policies for Sample-Efficient Robotic Manipulation

Jan Ole von Hartz, Lukas Schweizer, Joschka Boedecker, Abhinav Valada

详情
英文摘要

Generative robot policies such as Flow Matching offer flexible, multi-modal policy learning but are sample-inefficient. Although object-centric policies improve sample efficiency, it does not resolve this limitation. In this work, we propose Multi-Stream Generative Policy (MSG), an inference-time composition framework that trains multiple object-centric policies and combines them at inference to improve generalization and sample efficiency. MSG is model-agnostic and inference-only, hence widely applicable to various generative policies and training paradigms. We perform extensive experiments both in simulation and on a real robot, demonstrating that our approach learns high-quality generative policies from as few as five demonstrations, resulting in a 95% reduction in demonstrations, and improves policy performance by 89 percent compared to single-stream approaches. Furthermore, we present comprehensive ablation studies on various composition strategies and provide practical recommendations for deployment. Finally, MSG enables zero-shot object instance transfer. We make our code publicly available at https://msg.cs.uni-freiburg.de.

2509.24745 2026-04-01 cs.CL cs.LG

ProxyAttn: Guided Sparse Attention via Representative Heads

Yixuan Wang, Huang He, Siqi Bao, Hua Wu, Haifeng Wang, Qingfu Zhu, Wanxiang Che

Comments ICLR 2026 camera ready

详情
英文摘要

The quadratic complexity of attention mechanisms limits the efficiency of Large Language Models (LLMs) on long-text tasks. Recently, methods that dynamically estimate block importance have enabled efficient block sparse attention, leading to significant acceleration in long-text pre-filling of LLMs. However, their coarse-grained estimation inevitably leads to performance degradation at high sparsity rates. In this work, we propose ProxyAttn, a training-free sparse attention algorithm that achieves more precise block estimation by compressing the dimension of attention heads. Based on our observation of the similarity among multiple attention heads, we use the scores of pooled representative heads to approximate the scores for all heads. To account for the varying sparsity among heads, we also propose a block-aware dynamic budget estimation method. By combining the scores from representative proxy heads with multi-head dynamic budgets, we achieve a more fine-grained block importance evaluation at low computational cost. Experiments on a variety of mainstream models and extensive benchmarks confirm the underlying similarity among attention heads. Leveraging a fine-grained estimation, the proposed method achieves substantial gains in performance and efficiency compared to existing methods. More precisely, ProxyAttn can achieve up to 10.3x attention acceleration and 2.4x prefilling acceleration without significant performance loss. Our code is available at https://github.com/wyxstriker/ProxyAttn.

2509.10969 2026-04-01 cs.CV

Gaze Authentication: Factors Influencing Authentication Performance

Dillon Lohr, Michael J Proulx, Mehedi Hasan Raju, Oleg V Komogortsev

Comments 21 pages, 6 figures, 10 tables

详情
英文摘要

This paper examines the key factors that influence the performance of state-of-the-art gaze-based authentication. Experiments were conducted on a large-scale, in-house dataset comprising 8,849 subjects collected with Meta Quest Pro equivalent hardware running a video oculography-driven gaze estimation pipeline at 72~Hz. State of the neural network architecture was employed to study the influence of the following factors on authentication performance: eye tracking signal quality, various aspects of eye tracking calibration, and simple filtering on estimated raw gaze. This report provides performance results and their analysis.