arXivDaily arXiv每日学术速递 周一至周五更新
重置
2603.23501 2026-03-25 cs.CV cs.AI cs.CL

MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage

Ufaq Khan, Umair Nawaz, L D M S S Teja, Numaan Saeed, Muhammad Bilal, Yutong Xie, Mohammad Yaqub, Muhammad Haris Khan

Comments 11 Pages

详情
英文摘要

Vision Language Models (VLMs) are increasingly used for tasks like medical report generation and visual question answering. However, fluent diagnostic text does not guarantee safe visual understanding. In clinical practice, interpretation begins with pre-diagnostic sanity checks: verifying that the input is valid to read (correct modality and anatomy, plausible viewpoint and orientation, and no obvious integrity violations). Existing benchmarks largely assume this step is solved, and therefore miss a critical failure mode: a model can produce plausible narratives even when the input is inconsistent or invalid. We introduce MedObvious, a 1,880-task benchmark that isolates input validation as a set-level consistency capability over small multi-panel image sets: the model must identify whether any panel violates expected coherence. MedObvious spans five progressive tiers, from basic orientation/modality mismatches to clinically motivated anatomy/viewpoint verification and triage-style cues, and includes five evaluation formats to test robustness across interfaces. Evaluating 17 different VLMs, we find that sanity checking remains unreliable: several models hallucinate anomalies on normal (negative-control) inputs, performance degrades when scaling to larger image sets, and measured accuracy varies substantially between multiple-choice and open-ended settings. These results show that pre-diagnostic verification remains unsolved for medical VLMs and should be treated as a distinct, safety-critical capability before deployment.

2603.23500 2026-03-25 cs.CV

UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation

Jie Liu, Zilyu Ye, Linxiao Yuan, Shenhan Zhu, Yu Gao, Jie Wu, Kunchang Li, Xionghui Wang, Xiaonan Nie, Weilin Huang, Wanli Ouyang

详情
英文摘要

Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed by image synthesis. Formulating this multimodal generation process as a Markov Decision Process with sparse terminal rewards, we introduce UniGRPO to jointly optimize text and image generation policies using GRPO. Adopting a minimalist methodology to avoid over-design, we leverage established training recipes for both modalities by seamlessly integrating standard GRPO for reasoning and FlowGRPO for visual synthesis. To ensure scalability to multi-round interleaved generation, we introduce two critical modifications to the original FlowGRPO: (1) eliminating classifier-free guidance to maintain linear, unbranched rollouts, which is essential for scaling to complex scenarios involving multi-turn interactions and multi-condition generation (e.g., editing); and (2) replacing the standard latent KL penalty with an MSE penalty directly on the velocity fields, providing a more robust and direct regularization signal to mitigate reward hacking effectively. Our experiments demonstrate that this unified training recipe significantly enhances image generation quality through reasoning, providing a robust and scalable baseline for the future post-training of fully interleaved models.

2603.23499 2026-03-25 cs.CV

DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

Jaewon Min, Jaeeun Lee, Yeji Choi, Paul Hyunbin Cho, Jin Hyeon Kim, Tae-Young Lee, Jongsik Ahn, Hwayeong Lee, Seonghyun Park, Seungryong Kim

Comments Project page: https://cvlab-kaist.github.io/DA-Flow

详情
英文摘要

Optical flow models trained on high-quality data often degrade severely when confronted with real-world corruptions such as blur, noise, and compression artifacts. To overcome this limitation, we formulate Degradation-Aware Optical Flow, a new task targeting accurate dense correspondence estimation from real-world corrupted videos. Our key insight is that the intermediate representations of image restoration diffusion models are inherently corruption-aware but lack temporal awareness. To address this limitation, we lift the model to attend across adjacent frames via full spatio-temporal attention, and empirically demonstrate that the resulting features exhibit zero-shot correspondence capabilities. Based on this finding, we present DA-Flow, a hybrid architecture that fuses these diffusion features with convolutional features within an iterative refinement framework. DA-Flow substantially outperforms existing optical flow methods under severe degradation across multiple benchmarks.

2603.23497 2026-03-25 cs.CV

WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG

Zhen Li, Zian Meng, Shuwei Shi, Wenshuo Peng, Yuwei Wu, Bo Zheng, Chuanhao Li, Kaipeng Zhang

详情
英文摘要

Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel-level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large-scale action-conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role-playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The project page is https://shandaai.github.io/wildworld-project/.

2603.23496 2026-03-25 cs.LG

Estimating Flow Velocity and Vehicle Angle-of-Attack from Non-invasive Piezoelectric Structural Measurements Using Deep Learning

Chandler B. Smith, S. Hales Swift, Andrew Steyer, Ihab El-Kady

详情
英文摘要

Accurate estimation of aerodynamic state variables such as freestream velocity and angle of attack (AoA) is important for aerodynamic load prediction, flight control, and model validation. This work presents a non-intrusive method for estimating vehicle velocity and AoA from structural vibration measurements rather than direct flow instrumentation such as pitot tubes. A dense array of piezoelectric sensors mounted on the interior skin of an aeroshell capture vibrations induced by turbulent boundary layer pressure fluctuations, and a convolutional neural network (CNN) is trained to invert these structural responses to recover velocity and AoA. Proof-of-concept is demonstrated through controlled experiments in Sandia's hypersonic wind tunnel spanning zero and nonzero AoA configurations, Mach~5 and Mach~8 conditions, and both constant and continuously varying tunnel operations. The CNN is trained and evaluated using data from 16 wind tunnel runs, with a temporally centered held-out interval within each run used to form training, validation, and test datasets and assess intra-run temporal generalization. Raw CNN predictions exhibit increased variance during continuously varying conditions; a short-window moving-median post-processing step suppresses this variance and improves robustness. After post-processing, the method achieves a mean velocity error relative to the low-pass filtered reference velocity below 2.27~m/s (0.21\%) and a mean AoA error of $0.44^{\circ} (8.25\%)$ on held-out test data from the same experimental campaign, demonstrating feasibility of vibration-based velocity and AoA estimation in a controlled laboratory environment.

2603.23495 2026-03-25 cs.CV cs.AI cs.LG

VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions

Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Yassine Ouali, Georgios Tzimiropoulos

Comments Accepted at CVPR 2026

详情
英文摘要

Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressing the image, VISOR improves efficiency by sparsifying the interaction between image and text tokens. Specifically, the language model attends to the full set of high-resolution visual tokens through a small, strategically placed set of attention layers: general visual context is provided by efficient cross-attention between text-image, while a few well-placed and dynamically selected self-attention layers refine the visual representations themselves, enabling complex, high-resolution reasoning when needed. Based on this principle, we first train a single universal network on a range of computational budgets by varying the number of self-attention layers, and then introduce a lightweight policy mechanism that dynamically allocates visual computation based on per-sample complexity. Extensive experiments show that VISOR drastically reduces computational cost while matching or exceeding state-of-the-art results across a diverse suite of benchmarks, and excels in challenging tasks that require detailed visual understanding.

2603.23491 2026-03-25 cs.CV

Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation

Brian Chao, Lior Yariv, Howard Xiao, Gordon Wetzstein

Comments Project website at https://bchao1.github.io/foveated-diffusion

详情
英文摘要

Diffusion and flow matching models have unlocked unprecedented capabilities for creative content creation, such as interactive image and streaming video generation. The growing demand for higher resolutions, frame rates, and context lengths, however, makes efficient generation increasingly challenging, as computational complexity grows quadratically with the number of generated tokens. Our work seeks to optimize the efficiency of the generation process in settings where the user's gaze location is known or can be estimated, for example, by using eye tracking. In these settings, we leverage the eccentricity-dependent acuity of human vision: while a user perceives very high-resolution visual information in a small region around their gaze location (the foveal region), the ability to resolve detail quickly degrades in the periphery of the visual field. Our approach starts with a mask modeling the foveated resolution to allocate tokens non-uniformly, assigning higher token density to foveal regions and lower density to peripheral regions. An image or video is generated in a mixed-resolution token setting, yielding results perceptually indistinguishable from full-resolution generation, while drastically reducing the token count and generation time. To this end, we develop a principled mechanism for constructing mixed-resolution tokens directly from high-resolution data, allowing a foveated diffusion model to be post-trained from an existing base model while maintaining content consistency across resolutions. We validate our approach through extensive analysis and a carefully designed user study, demonstrating the efficacy of foveation as a practical and scalable axis for efficient generation.

2603.23490 2026-03-25 cs.CG cs.DS

Dynamic Light Spanners in Doubling Metrics

Sujoy Bhore, Jonathan Conroy, Arnold Filtser

详情
英文摘要

A $t$-spanner of a point set $X$ in a metric space $(\mathcal{X}, δ)$ is a graph $G$ with vertex set $P$ such that, for any pair of points $u,v \in X$, the distance between $u$ and $v$ in $G$ is at most $t$ times $δ(u,v)$. We study the problem of maintaining a spanner for a dynamic point set $X$ -- that is, when $X$ undergoes a sequence of insertions and deletions -- in a metric space of constant doubling dimension. For any constant $\varepsilon>0$, we maintain a $(1+\varepsilon)$-spanner of $P$ whose total weight remains within a constant factor of the weight of the minimum spanning tree of $X$. Each update (insertion or deletion) can be performed in $\operatorname{poly}(\log Φ)$ time, where $Φ$ denotes the aspect ratio of $X$. Prior to our work, no efficient dynamic algorithm for maintaining a light-weight spanner was known even for point sets in low-dimensional Euclidean space.

2603.23489 2026-03-25 cs.CV

AgentRVOS: Reasoning over Object Tracks for Zero-Shot Referring Video Object Segmentation

Woojeong Jin, Jaeho Lee, Heeseong Shin, Seungho Jang, Junhwan Heo, Seungryong Kim

详情
英文摘要

Referring Video Object Segmentation (RVOS) aims to segment a target object throughout a video given a natural language query. Training-free methods for this task follow a common pipeline: a MLLM selects keyframes, grounds the referred object within those frames, and a video segmentation model propagates the results. While intuitive, this design asks the MLLM to make temporal decisions before any object-level evidence is available, limiting both reasoning quality and spatio-temporal coverage. To overcome this, we propose AgentRVOS, a training-free agentic pipeline built on the complementary strengths of SAM3 and a MLLM. Given a concept derived from the query, SAM3 provides reliable perception over the full spatio-temporal extent through generated mask tracks. The MLLM then identifies the target through query-grounded reasoning over this object-level evidence, iteratively pruning guided by SAM3's temporal existence information. Extensive experiments show that AgentRVOS achieves state-of-the-art performance among training-free methods across multiple benchmarks, with consistent results across diverse MLLM backbones. Our project page is available at: https://cvlab-kaist.github.io/AgentRVOS/.

2603.23487 2026-03-25 cs.CV

TETO: Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation

Jini Yang, Eunbeen Hong, Soowon Son, Hyunkoo Lee, Sunghwan Hong, Sunok Kim, Seungryong Kim

详情
英文摘要

Event cameras capture per-pixel brightness changes with microsecond resolution, offering continuous motion information lost between RGB frames. However, existing event-based motion estimators depend on large-scale synthetic data that often suffers from a significant sim-to-real gap. We propose TETO (Tracking Events with Teacher Observation), a teacher-student framework that learns event motion estimation from only $\sim$25 minutes of unannotated real-world recordings through knowledge distillation from a pretrained RGB tracker. Our motion-aware data curation and query sampling strategy maximizes learning from limited data by disentangling object motion from dominant ego-motion. The resulting estimator jointly predicts point trajectories and dense optical flow, which we leverage as explicit motion priors to condition a pretrained video diffusion transformer for frame interpolation. We achieve state-of-the-art point tracking on EVIMO2 and optical flow on DSEC using orders of magnitude less training data, and demonstrate that accurate motion estimation translates directly to superior frame interpolation quality on BS-ERGB and HQ-EVFI.

2603.23483 2026-03-25 cs.CV cs.CL

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Haoyu Huang, Jinfa Huang, Zhongwei Wan, Xiawu Zheng, Rongrong Ji, Jiebo Luo

Comments Code: https://github.com/MAC-AutoML/SpecEyes

详情
英文摘要

Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.

2603.23482 2026-03-25 cs.SE cs.AI

ReqFusion: A Multi-Provider Framework for Automated PEGS Analysis Across Software Domains

Muhammad Khalid, Manuel Oriol, Yilmaz Uygun

Comments 17 pages, 6 figures, 7 tables. Accepted at VerifAI-2026 Workshop, co-located with ETAPS 2026

详情
英文摘要

Requirements engineering is a vital, yet labor-intensive, stage in the software development process. This article introduces ReqFusion: an AI-enhanced system that automates the extraction, classification, and analysis of software requirements utilizing multiple Large Language Model (LLM) providers. The architecture of ReqFusion integrates OpenAI GPT, Anthropic Claude, and Groq models to extract functional and non-functional requirements from various documentation formats (PDF, DOCX, and PPTX) in academic, industrial, and tender proposal contexts. The system uses a domain-independent extraction method and generates requirements following the Project, Environment, Goal, and System (PEGS) approach introduced by Bertrand Meyer. The main idea is that, because the PEGS format is detailed, LLMs have more information and cues about the requirements, producing better results than a simple generic request. An ablation study confirms this hypothesis: PEGS-guided prompting achieves an F1 score of 0.88, compared to 0.71 for generic prompting under the same multi-provider configuration. The evaluation used 18 real-world documents to generate 226 requirements through automated classification, with 54.9% functional and 45.1% nonfunctional across academic, business, and technical domains. An extended evaluation on five projects with 1,050 requirements demonstrated significant improvements in extraction accuracy and a 78% reduction in analysis time compared to manual methods. The multi-provider architecture enhances reliability through model consensus and fallback mechanisms, while the PEGS-based approach ensures comprehensive coverage of all requirement categories.

2603.23481 2026-03-25 cs.RO cs.AI cs.CV cs.LG

VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs

Haoran Yuan, Weigang Yi, Zhenyu Zhang, Wendi Chen, Yuchen Mo, Jiashi Yin, Xinzhuo Li, Xiangyu Zeng, Chuan Wen, Cewu Lu, Katherine Driggs-Campbell, Ismini Lourentzou

Comments https://plan-lab.github.io/projects/vtam/

详情
英文摘要

Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions are not reliably encoded in visual tokens, leading to unstable or imprecise behaviors. To bridge this gap, we introduce the Video-Tactile Action Model (VTAM), a multimodal world modeling framework that incorporates tactile perception as a complementary grounding signal. VTAM augments a pretrained video transformer with tactile streams via a lightweight modality transfer finetuning, enabling efficient cross-modal representation learning without tactile-language paired data or independent tactile pretraining. To stabilize multimodal fusion, we introduce a tactile regularization loss that enforces balanced cross-modal attention, preventing visual latent dominance in the action model. VTAM demonstrates superior performance in contact-rich manipulation, maintaining a robust success rate of 90 percent on average. In challenging scenarios such as potato chip pick-and-place requiring high-fidelity force awareness, VTAM outperforms the pi 0.5 baseline by 80 percent. Our findings demonstrate that integrating tactile feedback is essential for correcting visual estimation errors in world action models, providing a scalable approach to physically grounded embodied foundation models.

2603.23480 2026-03-25 cs.CE

Stablecoins as Dry Powder: A Copula-Based Risk Analysis of Cryptocurrency Markets

Elliot Jones, Toshiko Matsui, William Knottenbelt

详情
英文摘要

Stablecoins serve as the fundamental infrastructure for Decentralised Finance (DeFi), acting as the primary bridge between fiat currencies and the digital asset ecosystem. While peg stability is well-documented, the structural role stablecoins play in transmitting systemic risk to the broader market remains under-explored. This study uses copula-based approaches to quantify the transmission of volatility and activity from stablecoin to cryptocurrency markets. We demonstrate in-sample causality across daily, weekly, and monthly horizons. Furthermore, we show that incorporating stablecoin factors significantly reduces Mean Squared Error in cryptocurrency forecasting. Specifically, we link stablecoin volume and upside volatility to broader market volatility, indicating its role as dry powder. Finally, we establish economic value by demonstrating reduced risk in a cryptocurrency volatility targeting model when stablecoin factors are employed.

2603.23478 2026-03-25 cs.CV

UniFunc3D: Unified Active Spatial-Temporal Grounding for 3D Functionality Segmentation

Jiaying Lin, Dan Xu

详情
英文摘要

Functionality segmentation in 3D scenes requires an agent to ground implicit natural-language instructions into precise masks of fine-grained interactive elements. Existing methods rely on fragmented pipelines that suffer from visual blindness during initial task parsing. We observe that these methods are limited by single-scale, passive and heuristic frame selection. We present UniFunc3D, a unified and training-free framework that treats the multimodal large language model as an active observer. By consolidating semantic, temporal, and spatial reasoning into a single forward pass, UniFunc3D performs joint reasoning to ground task decomposition in direct visual evidence. Our approach introduces active spatial-temporal grounding with a coarse-to-fine strategy. This allows the model to select correct video frames adaptively and focus on high-detail interactive parts while preserving the global context necessary for disambiguation. On SceneFun3D, UniFunc3D achieves state-of-the-art performance, surpassing both training-free and training-based methods by a large margin with a relative 59.9\% mIoU improvement, without any task-specific training. Code will be released on our project page: https://jiaying.link/unifunc3d.

2603.23476 2026-03-25 cs.IT cs.NI cs.SY eess.SY math.IT

Index-Based Scheduling for a Resource-Constrained Quantum Switch

Subhankar Banerjee, Stavros Mitrolaris, Sennur Ulukus

详情
英文摘要

We consider a quantum switch with a finite number of quantum memory registers that aims to serve multipartite entanglement requests among $N$ users. We propose scheduling policies that aim to optimize the average number of requests served per unit time by efficiently utilizing the switch's available memory. To measure the performance of the scheduling policies, we employ the newly introduced metric of age of entanglement establishment (AoEE). We formulate the scheduling problem in a restless multi-armed bandit (RMAB) framework. We show that the scheduling of entanglement requests is indexable. Subsequently, we find a closed-form expression of the Whittle index for all possible request-age pairs. By modeling the Whittle index of each request as its reward and its cardinality as its cost, we formulate the memory-constrained scheduling problem as a $0$-$1$ knapsack problem and solve it via dynamic programming. Furthermore, we consider two low-complexity sequential greedy policies that leverage two different modified Whittle indices.

2603.23475 2026-03-25 eess.SY cs.SY physics.app-ph

Bridging the numerical-physical gap in acoustic holography via end-to-end differentiable structural optimization

Moon Hwan Lee, Mohd. Afzal Khan, Akm Ashiquzzaman, Eunbin Lee, Jonghun Lee, Euiheon Chung, Hyuk-Sang Kwon, Jae Youn Hwang

详情
英文摘要

Acoustic holography provides a practical means of flexibly controlling acoustic wavefronts. However, high-fidelity shaping of acoustic fields remains constrained by the numerical-physical gap inherent in conventional phase-only designs. These approaches realize a two-dimensional phase-delay profile as a three-dimensional thickness-varying lens, while neglecting wave-matter interactions arising from the lens structure. Here, we introduce an end-to-end, physics-aware differentiable structural optimization framework that directly incorporates three-dimensional lens geometries into the acoustic simulation and optimization loop. Using a novel differentiable relaxation, termed Differentiable Hologram Lens Approximation (DHLA), the lens geometry is treated as a differentiable design variable, ensuring intrinsic consistency between numerical design and physical realization. The resulting Thickness-Only Acoustic Holograms (TOAHs) significantly outperform state-of-the-art phase-only acoustic holograms (POAHs) in field reconstruction fidelity and precision under complex conditions. We further demonstrate the application of the framework to spatially selective neuromodulation in a neuropathic pain mouse model, highlighting its potential for non-invasive transcranial neuromodulation. In summary, by reconciling numerical design with physical realization, this work establishes a robust strategy for high-fidelity acoustic wavefront shaping in complex environments.

2603.23474 2026-03-25 cs.CY

Evidence of political bias in search engines and language models before major elections

Íris Damião, Paulo Almeida, João Franco, Nuno Santos, Pedro C. Magalhães, Joana Gonçalves-Sá

Comments 20 pages, 4 figures; Supplementary Information : Page 22 - 74

详情
英文摘要

Search engines (SEs) and large language models (LLMs) are central to political information access, yet their algorithmic decisions and potential underlying biases remain underexplored. We developed a standardized, privacy-preserving, bot-and-proxy methodology to audit four SEs and two LLMs before the 2024 European Parliament and US presidential elections. We collected answers to approximately 4,360 queries related to elections in five EU countries and 15 US counties, identified political entities and topics in those answers, and mapped them to ideological positions (EU) or issue associations (US). In Europe, SE results disproportionately mentioned far-right entities beyond levels expected from polls, past elections, or media salience. In the US, Google strongly favored topics more important to Republican voters, while other search engines favored issues more relevant to Democrats. LLMs responses were more balanced, although there is evidence of overrepresentation of far-right (and Green) entities. These results show evidence of bias and open important discussions on how even small skews in widely used platforms may influence democratic processes, calling for systematic audits of their outputs.

2603.23471 2026-03-25 cs.CY

Regulating AI Agents

Kathrin Gardhouse, Amin Oueslati, Noam Kolt

详情
英文摘要

AI agents -- systems that can independently take actions to pursue complex goals with only limited human oversight -- have entered the mainstream. These systems are now being widely used to produce software, conduct business activities, and automate everyday personal tasks. While AI agents implicate many areas of law, ranging from agency law and contracts to tort liability and labor law, they present particularly pressing questions for the most globally consequential AI regulation: the European Union's AI Act. Promulgated prior to the development and widespread use of AI agents, the EU AI Act faces significant obstacles in confronting the governance challenges arising from this transformative technology, such as performance failures in autonomous task execution, the risk of misuse of agents by malicious actors, and unequal access to the economic opportunities afforded by AI agents. We systematically analyze the EU AI Act's response to these challenges, focusing on both the substantive provisions of the regulation and, crucially, the institutional frameworks that aim to support its implementation. Our analysis of the Act's allocation of monitoring and enforcement responsibilities, reliance on industry self-regulation, and level of government resourcing illustrates how a regulatory framework designed for conventional AI systems can be ill-suited to AI agents. Taken together, our findings suggest that policymakers in the EU and beyond will need to change course, and soon, if they are to effectively govern the next generation of AI technology.

2603.23470 2026-03-25 cs.SE

ConceptCoder: Improve Code Reasoning via Concept Learning

Md Mahbubur Rahman, Hengbo Tong, Wei Le

详情
英文摘要

Large language models (LLMs) have shown promising results for software engineering applications, but still struggle with code reasoning tasks such as vulnerability detection (VD). We introduce ConceptCoder, a fine-tuning method that simulates human code inspection: models are trained to first recognize code concepts and then perform reasoning on top of these concepts. In prior work, concepts are extracted by multimodal models or LLMs to explain vision and natural language models. Our work is the first to formulate concepts for code. We define code concepts as human-understandable semantic properties of code and train models to learn such concepts. Our evaluation shows that this approach significantly improves VD accuracy, from 66.32 to 72.15 F1 on average over 9 open-source LLMs. ConceptCoder achieves the best VD performance compared to state-of-the-art (SOTA) baselines, including fine-tuned SOTA open-source LLMs and prompted proprietary models such as GPT-5.2 and Claude-Opus-4.5. Our approach also scales: concepts defined from four types of vulnerabilities benefit general vulnerability datasets with 134 CWEs. We further demonstrate that concept-based fine-tuning generalizes beyond VD and improves branch prediction. We release our code and datasets at https://figshare.com/s/1decab8232c653b44f71.

2603.23465 2026-03-25 eess.SY cs.SY

Statistical Efficiency of Single- and Multi-step Models for Forecasting and Control

Anne Somalwar, Bruce D. Lee, George J. Pappas, Nikolai Matni

Comments arXiv admin note: substantial text overlap with arXiv:2504.01766

详情
英文摘要

Compounding error, where small prediction mistakes accumulate over time, presents a major challenge in learning-based control. A common remedy is to train multi-step predictors directly instead of rolling out single-step models. However, it is unclear when the benefits of multi-step predictors outweigh the difficulty of learning a more complex model. We provide the first quantitative analysis of this trade-off for linear dynamical systems. We study three predictor classes: (i) single step models, (ii) multi-step models, and (iii) single step models trained with multi-step losses. We show that when the model class is well-specified and accurately captures the system dynamics, single-step models achieve the lowest asymptotic prediction error. On the other hand, when the model class is misspecified due to partial observability, direct multi-step predictors can significantly reduce bias and improve accuracy. We provide theoretical and empirical evidence that these trade-offs persist when predictors are used in closed-loop control.

2603.23463 2026-03-25 cs.CV cs.AI

InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting

Duc Vu, Kien Nguyen, Trong-Tung Nguyen, Ngan Nguyen, Phong Nguyen, Khoi Nguyen, Cuong Pham, Anh Tran

Comments Accepted to CVPR'26 (Main Conference)

详情
英文摘要

Recent diffusion-based models achieve photorealism in image inpainting but require many sampling steps, limiting practical use. Few-step text-to-image models offer faster generation, but naively applying them to inpainting yields poor harmonization and artifacts between the background and inpainted region. We trace this cause to random Gaussian noise initialization, which under low function evaluations causes semantic misalignment and reduced fidelity. To overcome this, we propose InverFill, a one-step inversion method tailored for inpainting that injects semantic information from the input masked image into the initial noise, enabling high-fidelity few-step inpainting. Instead of training inpainting models, InverFill leverages few-step text-to-image models in a blended sampling pipeline with semantically aligned noise as input, significantly improving vanilla blended sampling and even matching specialized inpainting models at low NFEs. Moreover, InverFill does not require real-image supervision and only adds minimal inference overhead. Extensive experiments show that InverFill consistently boosts baseline few-step models, improving image quality and text coherence without costly retraining or heavy iterative optimization.

2603.23462 2026-03-25 cs.CV

RealMaster: Lifting Rendered Scenes into Photorealistic Video

Dana Cohen-Bar, Ido Sobol, Raphael Bensadoun, Shelly Sheynin, Oran Gafni, Or Patashnik, Daniel Cohen-Or, Amit Zohar

Comments Project page: https://danacohen95.github.io/RealMaster/

详情
英文摘要

State-of-the-art video generation models produce remarkable photorealism, but they lack the precise control required to align generated content with specific scene requirements. Furthermore, without an underlying explicit geometry, these models cannot guarantee 3D consistency. Conversely, 3D engines offer granular control over every scene element and provide native 3D consistency by design, yet their output often remains trapped in the "uncanny valley". Bridging this sim-to-real gap requires both structural precision, where the output must exactly preserve the geometry and dynamics of the input, and global semantic transformation, where materials, lighting, and textures must be holistically transformed to achieve photorealism. We present RealMaster, a method that leverages video diffusion models to lift rendered video into photorealistic video while maintaining full alignment with the output of the 3D engine. To train this model, we generate a paired dataset via an anchor-based propagation strategy, where the first and last frames are enhanced for realism and propagated across the intermediate frames using geometric conditioning cues. We then train an IC-LoRA on these paired videos to distill the high-quality outputs of the pipeline into a model that generalizes beyond the pipeline's constraints, handling objects and characters that appear mid-sequence and enabling inference without requiring anchor frames. Evaluated on complex GTA-V sequences, RealMaster significantly outperforms existing video editing baselines, improving photorealism while preserving the geometry, dynamics, and identity specified by the original 3D control.

2603.23461 2026-03-25 cs.LG

End-to-End Efficient RL for Linear Bellman Complete MDPs with Deterministic Transitions

Zakaria Mhammedi, Alexander Rakhlin, Nneka Okolo

详情
英文摘要

We study reinforcement learning (RL) with linear function approximation in Markov Decision Processes (MDPs) satisfying \emph{linear Bellman completeness} -- a fundamental setting where the Bellman backup of any linear value function remains linear. While statistically tractable, prior computationally efficient algorithms are either limited to small action spaces or require strong oracle assumptions over the feature space. We provide a computationally efficient algorithm for linear Bellman complete MDPs with \emph{deterministic transitions}, stochastic initial states, and stochastic rewards. For finite action spaces, our algorithm is end-to-end efficient; for large or infinite action spaces, we require only a standard argmax oracle over actions. Our algorithm learns an $\varepsilon$-optimal policy with sample and computational complexity polynomial in the horizon, feature dimension, and $1/\varepsilon$.

2603.23458 2026-03-25 cs.GT cs.DC

SNARE: A TRAP for Rational Players to Solve Byzantine Consensus in the 5f+1 Model

Alejandro Ranchal-Pedrosa, Benjamin Marsh

Comments WIP

详情
英文摘要

The TRAP protocol solves rational agreement by combining accountable consensus with a one-shot BFTCR finalization phase. We present SNARE (Scalable Nash Agreement via Reward and Exclusion), the adaptation of TRAP to $n=5f{+}1$, and prove $ε$-$(k,t)$-robustness for rational agreement tolerating coalitions up to ${\approx}73\%$ with deposits under $0.5\%$ of the gain. A central finding is that appending a single all-to-all broadcast round with the $4f{+}1$ threshold after predecisions yields $ε$-$(k,t)$-robustness for coalitions up to $3f$ (${\approx}60\%$) without any deposit: we need not model or know the utility function of deviating players, only that they participate in the protocol. These players can be \emph{deceitful} (arbitrary unknown utility), not just rational, and the finalization structure prevents disagreement regardless of their motivation. This observation is protocol-agnostic, applies to any $5f{+}1$ protocol at the cost of one message delay that runs concurrently with the next view, and does not require commit-reveal mechanisms. Above $60\%$, the full baiting mechanism with deposits under $0.5\%$ extends tolerance to ${\approx}73\%$. A second finding is that valid-candidacy, the property preventing reward front-running, holds unconditionally regardless of the quorum threshold, removing both the $n>2(k{+}t)$ and $n>\frac{3}{2}k{+}3t$ constraints from the original TRAP. This retroactively extends the $3f{+}1$ bound from $C<n/2$ to $C<5n/9$. The binding constraint in both models is the winner consensus operating on $2f$ residual players after excluding $3f{+}1$ detected equivocators. We explore avenues for relaxing this limit.

2603.23455 2026-03-25 cs.CV

DetPO: In-Context Learning with Multi-Modal LLMs for Few-Shot Object Detection

Gautam Rajendrakumar Gare, Neehar Peri, Matvei Popov, Shruti Jain, John Galeotti, Deva Ramanan

Comments Project Page: https://ggare-cmu.github.io/DetPO/

详情
英文摘要

Multi-Modal LLMs (MLLMs) demonstrate strong visual grounding capabilities on popular object detection benchmarks like OdinW-13 and RefCOCO. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. While in-context prompting is a common strategy to improve performance across diverse tasks, we find that it often yields lower detection accuracy than prompting with class names alone. This suggests that current MLLMs cannot yet effectively leverage few-shot visual examples and rich textual descriptions for object detection. Since frontier MLLMs are typically only accessible via APIs, and state-of-the-art open-weights models are prohibitively expensive to fine-tune on consumer-grade hardware, we instead explore black-box prompt optimization for few-shot object detection. To this end, we propose Detection Prompt Optimization (DetPO), a gradient-free test-time optimization approach that refines text-only prompts by maximizing detection accuracy on few-shot visual training examples while calibrating prediction confidence. Our proposed approach yields consistent improvements across generalist MLLMs on Roboflow20-VL and LVIS, outperforming prior black-box approaches by up to 9.7%. Our code is available at https://github.com/ggare-cmu/DetPO

2603.23450 2026-03-25 eess.SY cs.SY

Information-Driven Active Perception for k-step Predictive Safety Monitoring

Sumukha Udupa, Jie Fu

Comments 6 pages, 6 figures, 1 table, submitted to IEEE L-CSS

详情
英文摘要

This work studies the synthesis of active perception policies for predictive safety monitoring in partially observable stochastic systems. Operating under strict sensing and communication budgets, the proposed monitor dynamically schedules sensor queries to maximize information gain about the safety of future states. The underlying stochastic dynamics are captured by a labeled hidden Markov model (HMM), with safety requirements defined by a deterministic finite automaton (DFA). To enable active information acquisition, we introduce minimizing k-step Shannon conditional entropy of the safety of future states as a planning objective, under the constraint of a limited sensor query budget. Using observable operators, we derive an efficient algorithm to compute the k-step conditional entropy and analyze key properties of the conditional entropy gradient with respect to policy parameters. We validate the effectiveness of the method for predictive safety monitoring through a dynamic congestion game example.

2603.23447 2026-03-25 cs.CV cs.AI

3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding

Yiping Chen, Jinpeng Li, Wenyu Ke, Yang Luo, Jie Ouyang, Zhongjie He, Li Liu, Hongchao Fan, Hao Wu

Comments 24 pages, 11 figures, 12 tables

详情
英文摘要

While multi-modality large language models excel in object-centric or indoor scenarios, scaling them to 3D city-scale environments remains a formidable challenge. To bridge this gap, we propose 3DCity-LLM, a unified framework designed for 3D city-scale vision-language perception and understanding. 3DCity-LLM employs a coarse-to-fine feature encoding strategy comprising three parallel branches for target object, inter-object relationship, and global scene. To facilitate large-scale training, we introduce 3DCity-LLM-1.2M dataset that comprises approximately 1.2 million high-quality samples across seven representative task categories, ranging from fine-grained object analysis to multi-faceted scene planning. This strictly quality-controlled dataset integrates explicit 3D numerical information and diverse user-oriented simulations, enriching the question-answering diversity and realism of urban scenarios. Furthermore, we apply a multi-dimensional protocol based on text-similarity metrics and LLM-based semantic assessment to ensure faithful and comprehensive evaluations for all methods. Extensive experiments on two benchmarks demonstrate that 3DCity-LLM significantly outperforms existing state-of-the-art methods, offering a promising and meaningful direction for advancing spatial reasoning and urban intelligence. The source code and dataset are available at https://github.com/SYSU-3DSTAILab/3D-City-LLM.

2603.23445 2026-03-25 cs.HC cs.MM

MRATTS: An MR-Based Acupoint Therapy Training System with Real-Time Acupoint Detection and Evaluation Standards

Jiacheng Liu, Bohan Chen, Qian Wang, Weichao Song, Fangfei Ye, Liang Zhou, Haibin Ling, Bingyao Huang

详情
英文摘要

Acupoint therapy is a core therapeutic method of Traditional Chinese Medicine (TCM), and it requires a high level of expertise and skills to detect acupoints and perform acupuncture and moxibustion. Existing mixed reality (MR)-based training methods often fall short in accurate real-time detection and visualization of acupoints on the hand, limb, or torso of a real person and do not support various techniques of acupuncture and moxibustion. Moreover, evaluation standards and visual guidance with fine details for each step during MR-based training are typically missing. To this end, we propose the MR-based TCM Acupoint Therapy Teaching System (MRATTS)--an MR-based acupoint therapy teaching and training framework. MRATTS is based on a real-time hand, limb, and torso acupoint detection method to accurately track and visualize acupoints on real patients through MR. On top of that, in collaboration with an experienced acupoint therapist, we design a practice method with interactive visual guidance for various acupoint therapy techniques that simulate acupressure, acupuncture (insertion, lifting-thrusting, and twisting), and moxibustion (mild, sparrow-pecking, and whirling). A set of TCM theory-based evaluation standards is formulated within MRATTS to enable the scoring and visualization of the accuracy and proficiency of acupoint therapy. The effectiveness and usefulness of MRATTS are evaluated through a controlled user study and expert feedback. Results of the study indicate that the MRATTS group shows clear improvements in understanding 3D locations of acupoints and proficiency in acupoint therapy compared to control groups.

2603.23443 2026-03-25 cs.SE cs.AI

Evaluating LLM-Based Test Generation Under Software Evolution

Sabaat Haroon, Mohammad Taha Khan, Muhammad Ali Gulzar

Comments 10 pages, 9 figures, 2 tables

详情
英文摘要

Large Language Models (LLMs) are increasingly used for automated unit test generation. However, it remains unclear whether these tests reflect genuine reasoning about program behavior or simply reproduce superficial patterns learned during training. If the latter dominates, LLM-generated tests may exhibit weaknesses such as reduced coverage, missed regressions, and undetected faults. Understanding how LLMs generate tests and how those tests respond to code evolution is therefore essential. We present a large-scale empirical study of LLM-based test generation under program changes. Using an automated mutation-driven framework, we analyze how generated tests react to semantic-altering changes (SAC) and semantic-preserving changes (SPC) across eight LLMs and 22,374 program variants. LLMs achieve strong baseline results, reaching 79% line coverage and 76% branch coverage with fully passing test suites on the original programs. However, performance degrades as programs evolve. Under SACs, the pass rate of newly generated tests drops to 66%, and branch coverage declines to 60%. More than 99% of failing SAC tests pass on the original program while executing the modified region, indicating residual alignment with the original behavior rather than adaptation to updated semantics. Performance also declines under SPCs despite unchanged functionality: pass rates fall to 79% and branch coverage to 69%. Although SPC edits preserve semantics, they often introduce larger syntactic changes, leading to instability in generated test suites. Models generate more new tests while discarding many baseline tests, suggesting sensitivity to lexical changes rather than true semantic impact. Overall, our results indicate that current LLM-based test generation relies heavily on surface-level cues and struggles to maintain regression awareness as programs evolve.