arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1864
专题追踪
2507.11241 2026-04-01 cs.RO

Comparison of Localization Algorithms between Reduced-Scale and Real-Sized Vehicles Using Visual and Inertial Sensors

Tobias Kern, Leon Tolksdorf, Christian Birkner

详情
英文摘要

Physically reduced-scale vehicles are emerging to accelerate the development of advanced automated driving functions. In this paper, we investigate the effects of scaling on self-localization accuracy with visual and visual-inertial algorithms using cameras and an inertial measurement unit (IMU). For this purpose, ROS2-compatible visual and visual-inertial algorithms are selected, and datasets are chosen as a baseline for real-sized vehicles. A test drive is conducted to record data of reduced-scale vehicles. We compare the selected localization algorithms, OpenVINS, VINS-Fusion, and RTAB-Map, in terms of their pose accuracy against the ground-truth and against data from real-sized vehicles. When comparing the implementation of the selected localization algorithms to real-sized vehicles, OpenVINS has the lowest average localization error. Although all selected localization algorithms have overlapping error ranges, OpenVINS also performs best when applied to a reduced-scale vehicle. When reduced-scale vehicles were compared to real-sized vehicles, minor differences were found in translational vehicle motion estimation accuracy. However, no significant differences were found when comparing the estimation accuracy of rotational vehicle motion, allowing RSVRs to be used as testing platforms for self-localization algorithms.

2507.00552 2026-04-01 cs.RO

Generation of Indoor Open Street Maps for Robot Navigation from CAD Files

Jiajie Zhang, Shenrui Wu, Xu Ma, Sören Schwertfeger

Comments 8 pages, 8 figures

详情
英文摘要

The deployment of autonomous mobile robots is predicated on the availability of environmental maps, yet conventional generation via SLAM (Simultaneous Localization and Mapping) suffers from significant limitations in time, labor, and robustness, particularly in dynamic, large-scale indoor environments where map obsolescence can lead to critical localization failures. To address these challenges, this paper presents a complete and automated system for converting architectural Computer-Aided Design (CAD) files into a hierarchical topometric OpenStreetMap (OSM) representation, tailored for robust life-long robot navigation. Our core methodology involves a multi-stage pipeline that first isolates key structural layers from the raw CAD data and then employs an AreaGraph-based topological segmentation to partition the building layout into a hierarchical graph of navigable spaces. This process yields a comprehensive and semantically rich map, further enhanced by automatically associating textual labels from the CAD source and cohesively merging multiple building floors into a unified, topologically-correct model. By leveraging the permanent structural information inherent in CAD files, our system circumvents the inefficiencies and fragility of SLAM, offering a practical and scalable solution for deploying robots in complex indoor spaces. The software is encapsulated within an intuitive Graphical User Interface (GUI) to facilitate practical use. The code and dataset are available at https://github.com/jiajiezhang7/osmAG-from-cad.

2506.21458 2026-04-01 cs.AI cs.CL cs.CV

MindCube: Spatial Mental Modeling from Limited Views

Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Jiajun Wu, Li Fei-Fei, Manling Li

Comments The latest version includes an expanded discussion of scaffolding, along with updated data statistics and experimental results

详情
英文摘要

Can Vision-Language Models (VLMs) imagine the full scene from just a few views, like humans do? Humans form spatial mental models naturally, internal representations of unseen space, to reason about layout, perspective, and motion. Our MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap, where existing VLMs exhibit near-random performance. Using MindCube, we systematically evaluate how well VLMs build robust spatial mental models through representing positions (cognitive mapping), orientations (perspective-taking), and dynamics (mental simulation for "what-if" movements). We then explore three approaches to help approximate spatial mental models in VLMs, focusing on incorporating unseen intermediate views, natural language reasoning chains, and cognitive maps. The significant improvement comes from a synergistic approach, "map-then-reason", that jointly trains the model to first generate a cognitive map and then reason upon it. By training models to reason over these internal maps, we boosted accuracy from 37.8% to 57.8% (+20.0%). Adding reinforcement learning pushed performance even further to 61.3% (+23.5%). Our key insight is that such scaffolding of spatial mental models, actively constructing and utilizing internal structured spatial representations with flexible reasoning processes, significantly improves understanding of unobservable space.

2506.10848 2026-04-01 cs.CL cs.AI cs.LG

Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles

Qingyan Wei, Yaojie Zhang, Zhiyuan Liu, Puyu Zeng, Yuxuan Wang, Biqing Qi, Dongrui Liu, Linfeng Zhang

Comments 11 pages; 5 figures;

详情
英文摘要

Diffusion-based language models (dLLMs) have emerged as a promising alternative to traditional autoregressive LLMs by enabling parallel token generation and significantly reducing inference latency. However, existing sampling strategies for dLLMs, such as confidence-based or semi-autoregressive decoding, often suffer from static behavior, leading to suboptimal efficiency and limited flexibility. In this paper, we propose SlowFast Sampling, a novel dynamic sampling strategy that adaptively alternates between exploratory and accelerated decoding stages. Our method is guided by three golden principles: certainty principle, convergence principle, and positional principle, which govern when and where tokens can be confidently and efficiently decoded. We further integrate our strategy with dLLM-Cache to reduce redundant computation. Extensive experiments across benchmarks and models show that SlowFast Sampling achieves up to 15.63$\times$ speedup on LLaDA with minimal accuracy drop, and up to 34.22$\times$ when combined with caching. Notably, our approach outperforms strong autoregressive baselines like LLaMA3 8B in throughput, demonstrating that well-designed sampling can unlock the full potential of dLLMs for fast and high-quality generation.

2506.09919 2026-04-01 cs.CV

MetricHMSR:Metric Human Mesh and Scene Recovery from Monocular Images

Chentao Song, He Zhang, Haolei Yuan, Haozhe Lin, Jianhua Tao, Hongwen Zhang, Tao Yu

详情
英文摘要

We introduce MetricHMSR, a novel framework for recovering metric human meshes and 3D scenes from a single monocular image. Existing methods struggle to recover metric scale due to monocular scale ambiguity and weak-perspective camera assumptions. Moreover, their fully coupled feature representations make it difficult to disentangle local pose from global translation, often requiring multi-stage pipelines that introduce accumulated errors. To address these challenges, we propose MetricHMR (Metric Human Mesh Recovery), which incorporates a bounding camera ray map representation to provide explicit metric cues for human reconstruction,together with a Human Mixture-of-Experts (HumanMoE) that dynamically routes image features to specialized experts, enabling the disentangled perception of local human pose and global metric position. Leveraging the recovered metric human as a geometric anchor, we further refine monocular metric depth estimation to achieve more accurate 3D alignment between humans and scenes.Comprehensive experiments demonstrate that our method achieves state-of-the-art performance on both human mesh recovery and metric human-scene reconstruction. Project Page: https://Metaverse-AI-Lab-THU.github.io/MetricHMSR.

2506.09082 2026-04-01 cs.CV cs.AI cs.LG

AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models

Zheda Mai, Arpita Chowdhury, Zihe Wang, Sooyoung Jeon, Lemeng Wang, Jiacheng Hou, Wei-Lun Chao

Comments Accepted by CVPR 2026. The first two authors contribute equally

详情
英文摘要

The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general-purpose heads, followed by evaluation on broad Visual Question Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i) the instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than a VFM' visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities, making it hard to tell whether errors stem from lacking all required abilities or just a single critical one. To address these gaps, we introduce AVA-Bench, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs) -- foundational skills like localization, depth estimation, and spatial understanding that collectively support complex visual reasoning tasks. By decoupling AVAs and matching training and test distributions within each, AVA-Bench pinpoints exactly where a VFM excels or falters. Applying AVA-Bench to leading VFMs thus reveals distinctive "ability fingerprints," turning VFM selection from educated guesswork into principled engineering. Notably, we find that a 0.5B LLM yields similar VFM rankings as a 7B LLM while cutting GPU hours by 8x, enabling more efficient evaluation. By offering a comprehensive and transparent benchmark, we hope AVA-Bench lays the foundation for the next generation of VFMs.

2506.07578 2026-04-01 cs.LG cs.AI

Denoising the Future: Top-p Distributions for Moving Through Time

Florian Andreas Marwitz, Ralf Möller, Magnus Bender, Marcel Gehrke

Comments Extended version of paper accepted at ECSQARU 2025, extended version submitted to International Journal of Approximate Reasoning

详情
英文摘要

Inference in dynamic probabilistic models is a complex task involving expensive operations. In particular, for Hidden Markov Models, the whole state space has to be enumerated for advancing in time. Even states with negligible probabilities are considered, resulting in computational inefficiency and possibly increased noise due to the propagation of unlikely probability mass. We propose to denoise the future and speed up inference by using only the top-p transitions, i.e., the most probable transitions with accumulated probability p. We show that the error introduced by using only the top-p transitions is bound by $p$ and the so-called minimal mixing rate of the underlying model. We also show the same bound when using only the top-p states, which is the same, just for the states. Moreover, in our empirical evaluation, we show that we can, when using top-p transitions, expect speedups of at least an order of magnitude, while the error in terms of total variation distance is below 0.09. Using the top-p states is slower than top-p transitions since we iterate over all states in each time step and sometimes lead empirically to a higher error. With a more sophisticated implementation, the speed-up, if any, would be really small. While top-p transitions look really promising, we cannot recommend top-p states and discuss why it is of the slower, while the error does not necessarily decrease.

2506.06858 2026-04-01 cs.LG cs.AI

FA-INR: Adaptive Implicit Neural Representations for Interpretable Exploration of Simulation Ensembles

Ziwei Li, Yuhan Duan, Tianyu Xiong, Yi-Tang Chen, Wei-Lun Chao, Han-Wei Shen

详情
英文摘要

Surrogate models are essential for efficient exploration of large-scale ensemble simulations. Implicit neural representations (INRs) provide a compact and continuous framework for modeling spatially structured data, but they often struggle with learning complex localized structures within the scientific fields. Recent INR-based surrogates address this by augmenting INRs with explicit feature structures, but at the cost of flexibility and substantial memory overhead. In this paper, we present Feature-Adaptive INR (FA-INR), an adaptive INR-based surrogate model for high-fidelity and interpretable exploration of ensemble simulations. Instead of relying on structured feature representations, FA-INR leverages cross-attention over a learnable key-value memory bank to allocate model capacity adaptively based on the data characteristics. To further improve scalability, we introduce a coordinate-guided mixture of experts (MoE) framework that enhances both efficiency and specialization of feature representations. More importantly, the learned experts produce an interpretable partition over the simulation domain, enabling scientists to identify complex structures and perform localized parameter-space exploration. Beyond quantitative and qualitative evaluations, we also demonstrate that our learned expert specialization can reveal meaningful scientific insights and support localized sensitivity analysis.

2505.11872 2026-04-01 cs.CV

PRS-Med: Position Reasoning Segmentation in Medical Imaging

Quoc-Huy Trinh, Minh-Van Nguyen, Jun Zeng, Debesh Jha, Ulas Bagci

详情
英文摘要

Prompt-based medical image segmentation has rapidly emerged, yet existing methods rely on explicit prompts like bounding boxes and struggle to reason about the spatial relationships essential for clinical diagnosis. While general-domain models attempt complex coordinate regression, these approaches often lack the structured reliability required for medical applications. In this work, we introduce PRS-Med, a unified framework that adopts an elegant, clinical-first approach to position reasoning segmentation. By utilizing a medical vision-language model integrated with a segmentation decoder, PRS-Med mimics the structured "search patterns" used by radiologists to identify pathologies within specific anatomical zones. To support this robust reasoning, we present the Medical Position Reasoning Segmentation (PosMed) dataset, comprising 116,000 expert-validated, spatially grounded question-answer pairs across six imaging modalities. Unlike previous brittle attempts at spatial reasoning, PosMed leverages a scalable, deterministic pipeline validated by board-certified radiologists to ensure clinical accuracy. Extensive experiments demonstrate that our zone-based reasoning not only improves segmentation accuracy (mean Dice improvements up to +31.2\%) but also provides a high-confidence interpretability layer that outperforms state-of-the-art complex reasoning models. By prioritizing functional reliability over unnecessary technical complexity, PRS-Med offers a practical and scalable baseline for the next generation of intelligent medical assistants.

2505.07899 2026-04-01 cs.CL cs.AI

On the Superimposed Noise Accumulation Problem in Sequential Knowledge Editing of Large Language Models

Ding Cao, Yuchen Cai, Yuqing Huang, Xuesong He, Rongxi Guo, Guiquan Liu, Guangzhong Sun

详情
英文摘要

Sequential knowledge editing techniques aim to continuously update knowledge in large language models at low cost, preventing models from generating outdated or incorrect information. However, existing sequential editing methods suffer from a significant decline in editing success rates after long-term editing. Through theoretical analysis and experiments, our findings reveal that as the number of edits increases, the model's output increasingly deviates from the desired target, leading to a drop in editing success rates. We refer to this issue as the superimposed noise accumulation problem. Our further analysis demonstrates that the problem is related to the erroneous activation of irrelevant knowledge and conflicts between activated knowledge. Based on this analysis, a method named DeltaEdit is proposed that reduces conflicts between knowledge through dynamic orthogonal constraint strategies. Experiments show that DeltaEdit significantly reduces superimposed noise, achieving a 16.8% improvement in editing performance over the strongest baseline.

2505.06537 2026-04-01 cs.CV cs.AI

ProFashion: Prototype-guided Fashion Video Generation with Multiple Reference Images

Xianghao Kong, Qiaosong Qi, Yuanbin Wang, Biaolong Chen, Aixi Zhang, Anyi Rao

Comments CVPRW 2026

详情
英文摘要

Fashion video generation aims to synthesize temporally consistent videos from reference images of a designated character. Despite significant progress, existing diffusion-based methods only support a single reference image as input, severely limiting their capability to generate view-consistent fashion videos, especially when there are different patterns on the clothes from different perspectives. Moreover, the widely adopted motion module does not sufficiently model human body movement, leading to sub-optimal spatiotemporal consistency. To address these issues, we propose ProFashion, a fashion video generation framework leveraging multiple reference images to achieve improved view consistency and temporal coherency. To effectively leverage features from multiple reference images while maintaining a reasonable computational cost, we devise a Pose-aware Prototype Aggregator, which selects and aggregates global and fine-grained reference features according to pose information to form frame-wise prototypes, which serve as guidance in the denoising process. To further enhance motion consistency, we introduce a Flow-enhanced Prototype Instantiator, which exploits the human keypoint motion flow to guide an extra spatiotemporal attention process in the denoiser. To demonstrate the effectiveness of ProFashion, we extensively evaluate our method on the MRFashion-7K dataset we collected from the Internet. ProFashion also outperforms previous methods on the UBC Fashion dataset.

2503.15149 2026-04-01 cs.LG physics.comp-ph

Machine learning surrogate models of many-body dispersion interactions in polymer melts

Zhaoxiang Shen, Raúl I. Sosa, Jakub Lengiewicz, Alexandre Tkatchenko, Stéphane P. A. Bordas

详情
英文摘要

Accurate prediction of many-body dispersion (MBD) interactions is essential for understanding the van der Waals forces that govern the behavior of many complex molecular systems. However, the high computational cost of MBD calculations limits their direct application in large-scale simulations. In this work, we introduce a machine learning surrogate model specifically designed to predict MBD forces in polymer melts, a system that demands accurate MBD description and offers structural advantages for machine learning approaches. Our model is based on a trimmed SchNet architecture that selectively retains the most relevant atomic connections and incorporates trainable radial basis functions for geometric encoding. We validate our surrogate model on datasets from polyethylene, polypropylene, and polyvinyl chloride melts, demonstrating high predictive accuracy and robust generalization across diverse polymer systems. In addition, the model captures key physical features, such as the characteristic decay behavior of MBD interactions, providing valuable insights for optimizing cutoff strategies. Characterized by high computational efficiency, our surrogate model enables practical incorporation of MBD effects into large-scale molecular simulations.

2502.21085 2026-04-01 cs.CV

BST: Badminton Stroke-type Transformer for Skeleton-based Action Recognition in Racket Sports

Jing-Yuan Chang

Comments Accepted by CVPRW 2026 - 12th CVsports

详情
英文摘要

Badminton, known for having the fastest ball speeds among all sports, presents significant challenges to the field of computer vision, including player identification, court line detection, shuttlecock trajectory tracking, and player stroke-type classification. In this paper, we introduce a novel video clipping strategy to extract frames of each player's racket swing in a badminton broadcast match. These clipped frames are then processed by three existing models: one for Human Pose Estimation to obtain human skeletal joints, another for shuttlecock trajectory tracking, and the other for court line detection to determine player positions on the court. Leveraging these data as inputs, we propose Badminton Stroke-type Transformer (BST) to classify player stroke-types in singles. To the best of our knowledge, experimental results demonstrate that our method outperforms the previous state-of-the-art on the largest publicly available badminton video dataset (ShuttleSet), another badminton dataset (BadmintonDB), and a tennis dataset (TenniSet). These results suggest that effectively leveraging ball trajectory is a promising direction for action recognition in racket sports.

2502.20663 2026-04-01 cs.CL

Prediction of Item Difficulty for Reading Comprehension Items by Creation of Annotated Item Repository

Radhika Kapoor, Sang T. Truong, Nick Haber, Maria Araceli Ruiz-Primo, Benjamin W. Domingue

详情
英文摘要

Prediction of item difficulty based on its text content is of substantial interest. In this paper, we focus on the related problem of recovering IRT-based difficulty when the data originally reported item p-value (percent correct responses). We model this item difficulty using a repository of reading passages and student data from US standardized tests from New York and Texas for grades 3-8 spanning the years 2018-23. This repository is annotated with meta-data on (1) linguistic features of the reading items, (2) test features of the passage, and (3) context features. A penalized regression prediction model with all these features can predict item difficulty with RMSE 0.59 compared to baseline RMSE of 0.92, and with a correlation of 0.77 between true and predicted difficulty. We supplement these features with embeddings from LLMs (ModernBERT, BERT, and LlAMA), which marginally improve item difficulty prediction. When models use only item linguistic features or LLM embeddings, prediction performance is similar, which suggests that only one of these feature categories may be required. This item difficulty prediction model can be used to filter and categorize reading items and will be made publicly available for use by other stakeholders.

2502.13280 2026-04-01 cs.LG

Value Gradient Sampler: Learning Invariant Value Functions for Equivariant Diffusion Sampling

Himchan Hwang, Hyeokju Jeong, Dong Kyu Shin, Che-Sang Park, Sehee Kweon, Sangwoong Yoon, Frank Chongwoo Park

Comments AISTATS 2026. Code: https://github.com/swyoon/value-gradient-sampler/

详情
英文摘要

We propose the Value Gradient Sampler (VGS), a diffusion sampler parameterized by value functions. VGS generates samples from an unnormalized target density (i.e., energy) by evolving randomly initialized particles along the gradient of the value function. In many sampling problems where the target density exhibits invariant symmetries, value functions provide a novel approach to leveraging invariant networks for sampling by inducing an equivariant gradient flow, without requiring more complex equivariant networks. The value networks are trained via temporal difference learning, which supports off-policy training and other established reinforcement learning (RL) techniques. By combining advanced RL methods with efficient invariant networks, VGS achieves both the highest sample quality and the fastest sampling speed among our baselines on the 55-particle Lennard-Jones system.

2410.08875 2026-04-01 cs.AI cs.SI physics.soc-ph

Online design of dynamic networks

Duo Wang, Andrea Araldo, Mounim El Yacoubi

Comments 14 pages

详情
英文摘要

Designing a network (e.g., a telecommunication or transport network) is mainly done offline, in a planning phase, prior to the operation of the network. On the other hand, a massive effort has been devoted to characterizing dynamic networks, i.e., those that evolve over time. The novelty of this paper is that we introduce a method for the online design of dynamic networks. The need to do so emerges when a network needs to operate in a dynamic and stochastic environment. In this case, one may wish to build a network over time, on the fly, in order to react to the changes of the environment and to keep certain performance targets. We tackle this online design problem with a rolling horizon optimization based on Monte Carlo Tree Search. The potential of online network design is showcased for the design of a futuristic dynamic public transport network, where bus lines are constructed on the fly to better adapt to a stochastic user demand. In such a scenario, we compare our results with state-of-the-art dynamic vehicle routing problem (VRP) resolution methods, simulating requests from a New York City taxi dataset. Differently from classic VRP methods, that extend vehicle trajectories in isolation, our method enables us to build a structured network of line buses, where complex user journeys are possible, thus increasing system performance.

2410.05493 2026-04-01 cs.LG cs.IT math.IT

An Information-Theoretic Approach to Understanding Transformers' In-Context Learning of Variable-Order Markov Chains

Ruida Zhou, Chao Tian, Suhas Diggavi

Comments AISTATS 2026

详情
英文摘要

We study transformers' in-context learning of variable-length Markov chains (VOMCs), focusing on the finite-sample accuracy as the number of in-context examples increases. Compared to fixed-order Markov chains (FOMCs), learning VOMCs is substantially more challenging due to the additional structural learning component. The problem is naturally suited to a Bayesian formulation, where the context-tree weighting (CTW) algorithm, originally developed in the information theory community for universal data compression, provides an optimal solution. Empirically, we find that single-layer transformers fail to learn VOMCs in context, whereas transformers with two or more layers can succeed, with additional layers yielding modest but noticeable improvements. In contrast to prior results on FOMCs, attention-only networks appear insufficient for VOMCs. To explain these findings, we provide explicit transformer constructions: one with $D+2$ layers that can exactly implement CTW for VOMCs of maximum order $D$, and a simplified two-layer construction that uses partial information for approximate blending, shedding light on why two-layer transformers can perform well.

2407.17829 2026-04-01 cs.CV cs.LG

Image Segmentation via Divisive Normalization: dealing with environmental diversity

Pablo Hernández-Cámara, Jorge Vila-Tomás, Paula Dauden-Oliver, Nuria Alabau-Bosque, Valero Laparra, Jesús Malo

详情
英文摘要

Autonomous driving is a challenging scenario for image segmentation due to the presence of uncontrolled environmental conditions and the eventually catastrophic consequences of failures. Previous work suggested that a biologically motivated computation, the so-called Divisive Normalization, could be useful to deal with image variability, but its effects have not been systematically studied over different data sources and environmental factors. Here we put segmentation U-nets augmented with Divisive Normalization to work far from training conditions to find where this adaptation is more critical. We categorize the scenes according to their radiance level and dynamic range (day/night), and according to their achromatic/chromatic contrasts. We also consider video game (synthetic) images to broaden the range of environments. We check the performance in the extreme percentiles of such categorization. Then, we push the limits further by artificially modifying the images in perceptually/environmentally relevant dimensions: luminance, contrasts and spectral radiance. Results show that neural networks with Divisive Normalization get better results in all the scenarios and their performance remains more stable with regard to the considered environmental factors and nature of the source. Finally, we explain the improvements in segmentation performance in two ways: (1) by quantifying the invariance of the responses that incorporate Divisive Normalization, and (2) by illustrating the adaptive nonlinearity of the different layers that depends on the local activity.

2407.03004 2026-04-01 cs.CL cs.AI

SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy

Meghal Dani, Muthu Jeyanthi Prakash, Filip Rosa, Zeynep Akata, Stefanie Liebe

详情
英文摘要

Large Language Models (LLMs) have been shown to encode clinical knowledge. Many evaluations, however, rely on structured question-answer benchmarks, overlooking critical challenges of interpreting and reasoning about unstructured clinical narratives in real-world settings. In this study we task eight Large Language models including two medical models (GPT-3.5, GPT-4, Mixtral-8x7B, Qwen-72B, LlaMa2, LlaMa3, OpenBioLLM, Med42) with a core diagnostic task in epilepsy: mapping seizure description phrases, after targeted filtering and standardization, to one of seven possible seizure onset zones using likelihood estimates. Most models yield results that often match the ground truth and even approach clinician-level performance after prompt engineering. Specifically, clinician-guided chain-of-thought reasoning leading to the most consistent improvements. Performance was further strongly modulated by clinical in-context impersonation, narrative length and language context (13.7%, 32.7% and 14.2% performance variation, respectively). However, expert analysis of reasoning outputs revealed that correct prediction can be based on hallucinated knowledge and inaccurate source citation, underscoring the need to improve interpretability of LLMs in clinical use. Overall, SemioLLM provides a scalable, domain-adaptable framework for evaluating LLMs in clinical disciplines where unstructured verbal descriptions encode diagnostic information. By identifying both the strengths and limitations of LLMs, our work contributes to testing the applicability of foundational AI systems for healthcare.

2406.18615 2026-04-01 cs.AI

Improving Execution Concurrency in Partial-Order Plans via Block-Substitution

Sabah Binte Noor, Fazlul Hasan Siddiqui

Comments arXiv admin note: text overlap with arXiv:2406.03091

Journal ref Autonomous Agents and Multi-Agent Systems, vol. 39, April 2025

详情
英文摘要

Partial-order plans in AI planning facilitate execution flexibility and several other tasks, such as plan reuse, modification, and decomposition, due to their less constrained nature. A \acrfull*{pop} specifies partial-order over actions, providing the flexibility of executing unordered actions in different sequences. This flexibility can be further extended by enabling parallel execution of actions in the POP to reduce its overall execution time. While extensive studies exist on improving the flexibility of a POP by optimizing its action orderings through plan deordering and reordering, there has been limited focus on the flexibility of executing actions concurrently in a plan. Flexibility of executing actions concurrently, referred to as concurrency, in a POP can be achieved by incorporating action non-concurrency constraints, specifying which actions can not be executed in parallel. This work establishes the necessary and sufficient conditions for non-concurrency constraints between two actions or two subplans with respect to a planning task. We also introduce an algorithm to improve a plan's concurrency by optimizing resource utilization through substitutions of the plan's subplans with respect to the corresponding planning task. Our algorithm employs block deordering that eliminates orderings in a POP by encapsulating coherent actions in blocks, and then exploits blocks as candidate subplans for substitutions. Experiments over the benchmark problems from International Planning Competitions (IPC) exhibit considerable improvement in plan concurrency.

2406.14753 2026-04-01 cs.LG stat.ME

A General Control-Theoretic Approach for Reinforcement Learning: Theory and Algorithms

Weiqin Chen, Mark S. Squillante, Chai Wah Wu, Santiago Paternain

详情
英文摘要

We devise a control-theoretic reinforcement learning approach to support direct learning of the optimal policy. We establish various theoretical properties of our approach, such as convergence and optimality of our analog of the Bellman operator and Q-learning, a new control-policy-variable gradient theorem, and a specific gradient ascent algorithm based on this theorem within the context of a specific control-theoretic framework. We empirically evaluate the performance of our control theoretic approach on several classical reinforcement learning tasks, demonstrating significant improvements in solution quality, sample complexity, and running time of our approach over state-of-the-art methods.

2406.03091 2026-04-01 cs.AI

Improving Plan Execution Flexibility using Block-Substitution

Sabah Binte Noor, Fazlul Hasan Siddiqui

Journal ref . Improving Plan Execution Flexibility using Block-Substitution. Journal of Artificial Intelligence Research 85, Article 35 (March 2026)

详情
英文摘要

Partial-order plans in AI planning facilitate execution flexibility due to their less-constrained nature. Maximizing plan flexibility has been studied through the notions of plan deordering, and plan reordering. Plan deordering removes unnecessary action orderings within a plan, while plan reordering modifies them arbitrarily to minimize action orderings. This study, in contrast with traditional plan deordering and reordering strategies, improves a plan's flexibility by substituting its subplans with actions outside the plan for a planning problem. Our methodology builds on block deordering, which eliminates orderings in a POP by encapsulating coherent actions in blocks, yielding a hierarchically structured plan termed a Block Decomposed Partial-Order (BDPO) plan. We consider the action blocks in a BDPO plan as candidate subplans for substitutions, and ensure that each successful substitution produces a plan with strictly greater flexibility. In addition, this paper employs plan reduction strategies to eliminate redundant actions within a BDPO plan. We also evaluate our approach when combined with MaxSAT-based reorderings. Our experimental result demonstrates a significant improvement in plan execution flexibility on the benchmark problems from International Planning Competitions (IPC), maintaining good coverage and execution time.

2404.15244 2026-04-01 cs.CV cs.LG

Image-Specific Adaptation of Transformer Encoders for Compute-Efficient Segmentation

Manyi Yao, Abhishek Aich, Yumin Suh, Amit Roy-Chowdhury, Christian Shelton, Manmohan Chandraker

Comments Accepted at WACV 2026 WVAQ

详情
英文摘要

Vision transformer based models bring significant improvements for image segmentation tasks. Although these architectures offer powerful capabilities irrespective of specific segmentation tasks, their use of computational resources can be taxing on deployed devices. One way to overcome this challenge is by adapting the computation level to the specific needs of the input image rather than the current one-size-fits-all approach. To this end, we introduce ECO-M2F or EffiCient TransfOrmer Encoders for Mask2Former-style models. Noting that the encoder module of M2F-style models incur high resource-intensive computations, ECO-M2F provides a strategy to self-select the number of hidden layers in the encoder, conditioned on the input image. To enable this self-selection ability for providing a balance between performance and computational efficiency, we present a three step recipe. The first step is to train the parent architecture to enable early exiting from the encoder. The second step is to create an derived dataset of the ideal number of encoder layers required for each training example. The third step is to use the aforementioned derived dataset to train a gating network that predicts the number of encoder layers to be used, conditioned on the input image. Additionally, to change the computational-accuracy tradeoff, only steps two and three need to be repeated which significantly reduces retraining time. Experiments on the public datasets show that the proposed approach reduces expected encoder computational cost while maintaining performance, adapts to various user compute resources, is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.

2312.04466 2026-04-01 cs.CV

Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion

Kiran Chhatre, Radek Daněček, Nikos Athanasiou, Giorgio Becherini, Christopher Peters, Michael J. Black, Timo Bolkart

Comments Conference on Computer Vision and Pattern Recognition (CVPR) 2024. Webpage: https://amuse.is.tue.mpg.de/

详情
英文摘要

Existing methods for synthesizing 3D human gestures from speech have shown promising results, but they do not explicitly model the impact of emotions on the generated gestures. Instead, these methods directly output animations from speech without control over the expressed emotion. To address this limitation, we present AMUSE, an emotional speech-driven body animation model based on latent diffusion. Our observation is that content (i.e., gestures related to speech rhythm and word utterances), emotion, and personal style are separable. To account for this, AMUSE maps the driving audio to three disentangled latent vectors: one for content, one for emotion, and one for personal style. A latent diffusion model, trained to generate gesture motion sequences, is then conditioned on these latent vectors. Once trained, AMUSE synthesizes 3D human gestures directly from speech with control over the expressed emotions and style by combining the content from the driving speech with the emotion and style of another speech sequence. Randomly sampling the noise of the diffusion model further generates variations of the gesture with the same emotional expressivity. Qualitative, quantitative, and perceptual evaluations demonstrate that AMUSE outputs realistic gesture sequences. Compared to the state of the art, the generated gestures are better synchronized with the speech content, and better represent the emotion expressed by the input speech. Our code is available at amuse.is.tue.mpg.de.

2206.02454 2026-04-01 cs.CV

What do CNNs Learn in the First Layer and Why? A Linear Systems Perspective

Rhea Chowers, Yair Weiss

Journal ref Proceedings of the 40th International Conference on Machine Learning (2023), PMLR 202

详情
英文摘要

It has previously been reported that the representation that is learned in the first layer of deep Convolutional Neural Networks (CNNs) is highly consistent across initializations and architectures. In this work, we quantify this consistency by considering the first layer as a filter bank and measuring its energy distribution. We find that the energy distribution is very different from that of the initial weights and is remarkably consistent across random initializations, datasets, architectures and even when the CNNs are trained with random labels. In order to explain this consistency, we derive an analytical formula for the energy profile of linear CNNs and show that this profile is mostly dictated by the second order statistics of image patches in the training set and it will approach a whitening transformation when the number of iterations goes to infinity. Finally, we show that this formula for linear CNNs also gives an excellent fit for the energy profiles learned by commonly used nonlinear CNNs such as ResNet and VGG, and that the first layer of these CNNs indeed perform approximate whitening of their inputs.

2203.02381 2026-04-01 cs.RO

Where to Look Next: Learning Viewpoint Recommendations for Informative Trajectory Planning

Max Lodel, Bruno Brito, Álvaro Serra-Gómez, Laura Ferranti, Robert Babuška, Javier Alonso-Mora

Comments accepted to ICRA2022

Journal ref 2022 International Conference on Robotics and Automation (ICRA)

详情
英文摘要

Search missions require motion planning and navigation methods for information gathering that continuously replan based on new observations of the robot's surroundings. Current methods for information gathering, such as Monte Carlo Tree Search, are capable of reasoning over long horizons, but they are computationally expensive. An alternative for fast online execution is to train, offline, an information gathering policy, which indirectly reasons about the information value of new observations. However, these policies lack safety guarantees and do not account for the robot dynamics. To overcome these limitations we train an information-aware policy via deep reinforcement learning, that guides a receding-horizon trajectory optimization planner. In particular, the policy continuously recommends a reference viewpoint to the local planner, such that the resulting dynamically feasible and collision-free trajectories lead to observations that maximize the information gain and reduce the uncertainty about the environment. In simulation tests in previously unseen environments, our method consistently outperforms greedy next-best-view policies and achieves competitive performance compared to Monte Carlo Tree Search, in terms of information gains and coverage time, with a reduction in execution time by three orders of magnitude.

2112.02057 2026-04-01 cs.RO

Snake Robot Gait Decomposition and Gait Parameter Optimization

Bongsub Song, Insung Ju, Dongwon Yun

Comments Temporarily withdrawing the paper to replenish the evidence base

详情
英文摘要

This paper proposes Gait Decomposition (G.D), a method of mathematically decomposing snake movements, and Gait Parameter Gradient (GPG), a method of optimizing decomposed gait parameters. G.D is a method that can express the snake gait mathematically and concisely from generating movement using the curve function to the motor control order when generating movement of snake robot. Through this method, the gait of the snake robot can be intuitively classified into a matrix, as well as flexibly adjusting the parameters of the curve function required for gait generation. This can solve the problem that parameter tuning, which is the reason why it is difficult for a snake robot to practical use, is difficult. Therefore, if this G.D is applied to snake robots, various gaits can be generated with a few of parameters, so snake robots can be used in many fields. We also implemented the GPG algorithm to optimize the gait curve function as well as define the gait of the snake robot through G.D.

2603.30040 2026-04-01 cs.SE cs.AI

Automatic Identification of Parallelizable Loops Using Transformer-Based Source Code Representations

Izavan dos S. Correia, Henrique C. T. Santos, Tiago A. E. Ferreira

Comments 28 pages, 12 figures

详情
英文摘要

Automatic parallelization remains a challenging problem in software engineering, particularly in identifying code regions where loops can be safely executed in parallel on modern multi-core architectures. Traditional static analysis techniques, such as dependence analysis and polyhedral models, often struggle with irregular or dynamically structured code. In this work, we propose a Transformer-based approach to classify the parallelization potential of source code, focusing on distinguishing independent (parallelizable) loops from undefined ones. We adopt DistilBERT to process source code sequences using subword tokenization, enabling the model to capture contextual syntactic and semantic patterns without handcrafted features. The approach is evaluated on a balanced dataset combining synthetically generated loops and manually annotated real-world code, using 10-fold cross-validation and multiple performance metrics. Results show consistently high performance, with mean accuracy above 99\% and low false positive rates, demonstrating robustness and reliability. Compared to prior token-based methods, the proposed approach simplifies preprocessing while improving generalization and maintaining computational efficiency. These findings highlight the potential of lightweight Transformer models for practical identification of parallelization opportunities at the loop level.

2603.30016 2026-04-01 cs.CR cs.AI

Architecting Secure AI Agents: Perspectives on System-Level Defenses Against Indirect Prompt Injection Attacks

Chong Xiang, Drew Zagieboylo, Shaona Ghosh, Sanjay Kariyappa, Kai Greshake, Hanshen Xiao, Chaowei Xiao, G. Edward Suh

详情
英文摘要

AI agents, predominantly powered by large language models (LLMs), are vulnerable to indirect prompt injection, in which malicious instructions embedded in untrusted data can trigger dangerous agent actions. This position paper discusses our vision for system-level defenses against indirect prompt injection attacks. We articulate three positions: (1) dynamic replanning and security policy updates are often necessary for dynamic tasks and realistic environments; (2) certain context-dependent security decisions would still require LLMs (or other learned models), but should only be made within system designs that strictly constrain what the model can observe and decide; (3) in inherently ambiguous cases, personalization and human interaction should be treated as core design considerations. In addition to our main positions, we discuss limitations of existing benchmarks that can create a false sense of utility and security. We also highlight the value of system-level defenses, which serve as the skeleton of agentic systems by structuring and controlling agent behaviors, integrating rule-based and model-based security checks, and enabling more targeted research on model robustness and human interaction.

2603.30014 2026-04-01 cs.DC cs.AI

Scalable AI-assisted Workflow Management for Detector Design Optimization Using Distributed Computing

Derek Anderson, Amit Bashyal, Markus Diefenthaler, Cristiano Fanelli, Wen Guan, Tanja Horn, Alex Jentsch Meifeng Lin, Tadashi Maeno, Kei Nagai, Hemalata Nayak, Connor Pecar, Karthik Suresh, Fang-Ying Tsai, Anselm Vossen, Tianle Wang, Torre Wenaus

详情
英文摘要

The Production and Distributed Analysis (PanDA) system, originally developed for the ATLAS experiment at the CERN Large Hadron Collider (LHC), has evolved into a robust platform for orchestrating large-scale workflows across distributed computing resources. Coupled with its intelligent Distributed Dispatch and Scheduling (iDDS) component, PanDA supports AI/ML-driven workflows through a scalable and flexible workflow engine. We present an AI-assisted framework for detector design optimization that integrates multi-objective Bayesian optimization with the PanDA--iDDS workflow engine to coordinate iterative simulations across heterogeneous resources. The framework addresses the challenge of exploring high-dimensional parameter spaces inherent in modern detector design. We demonstrate the framework using benchmark problems and realistic studies of the ePIC and dRICH detectors for the Electron-Ion Collider (EIC). Results show improved automation, scalability, and efficiency in multi-objective optimization. This work establishes a flexible and extensible paradigm for AI-driven detector design and other computationally intensive scientific applications.