arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1800
2603.08906 2026-03-11 cs.CV physics.med-ph

Multi-Kernel Gated Decoder Adapters for Robust Multi-Task Thyroid Ultrasound under Cross-Center Shift

Maziar Sabouri, Nourhan Bayasi, Arman Rahmim

详情
英文摘要

Thyroid ultrasound (US) automation couples two competing requirements: global, geometry-driven reasoning for nodule delineation and local, texture-driven reasoning for malignancy risk assessment. Under cross-center domain shift, these cues degrade asymmetrically, yet most multi-task pipelines rely on a single shared backbone, often inducing negative transfer. In this paper, we characterize this interference across CNN (ResNet34) and medical ViT (MedSAM) backbones, and observe a consistent trend: ViTs transfer geometric priors that benefit segmentation, whereas CNNs more reliably preserve texture cues for malignancy discrimination under strong shift and artifacts. Motivated by this failure mode, we propose a lightweight family of decoder-side adapters, the Multi-Kernel Gated Adapter (MKGA) and a residual variant (ResMKGA), which refine multi-scale skip features using complementary receptive fields and apply semantic, context-conditioned gating to suppress artifact-prone content before fusion. Across two US benchmarks, the proposed adapters improve cross-center robustness: they strengthen out-of-domain segmentation and, in the CNN setting, yield clear gains in clinical TI-RADS diagnostic accuracy compared to standard multi-task baselines. Code and models will be released.

2603.08905 2026-03-11 cs.RO

Proprioceptive Safe Active Navigation and Exploration for Planetary Environments

Matthew Y. Jiang, Feifei Qian, Shipeng Liu

Comments 9 pages, 7 figures

详情
英文摘要

Deformable granular terrains introduce significant locomotion and immobilization risks in planetary exploration and are difficult to detect via remote sensing (e.g., vision). Legged robots can sense terrain properties through leg-terrain interactions during locomotion, offering a direct means to assess traversability in deformable environments. How to systematically exploit this interaction-derived information for navigation planning, however, remains underexplored. We address this gap by presenting PSANE, a Proprioceptive Safe Active Navigation and Exploration framework that leverages leg-terrain interaction measurements for safe navigation and exploration in unknown deformable environments. PSANE learns a traversability model via Gaussian Process regression to estimate and certify safe regions and identify exploration frontiers online, and integrates these estimates with a reactive controller for real-time navigation. Frontier selection is formulated as a multi-objective optimization that balances safe-set expansion probability and goal-directed cost, with subgoals selected via scalarization over the Pareto-optimal frontier set. PSANE safely explores unknown granular terrain and reaches specified goals using only proprioceptively estimated traversability, while achieving performance improvements over baseline methods.

2603.08898 2026-03-11 cs.CV

Towards Visual Query Segmentation in the Wild

Bing Fan, Minghao Li, Hanzhi Zhang, Shaohua Dong, Naga Prudhvi Mareedu, Weishi Shi, Yunhe Feng, Yan Huang, Heng Fan

详情
英文摘要

In this paper, we introduce visual query segmentation (VQS), a new paradigm of visual query localization (VQL) that aims to segment all pixel-level occurrences of an object of interest in an untrimmed video, given an external visual query. Compared to existing VQL locating only the last appearance of a target using bounding boxes, VQS enables more comprehensive (i.e., all object occurrences) and precise (i.e., pixel-level masks) localization, making it more practical for real-world scenarios. To foster research on this task, we present VQS-4K, a large-scale benchmark dedicated to VQS. Specifically, VQS-4K contains 4,111 videos with more than 1.3 million frames and covers a diverse set of 222 object categories. Each video is paired with a visual query defined by a frame outside the search video and its target mask, and annotated with spatial-temporal masklets corresponding to the queried target. To ensure high quality, all videos in VQS-4K are manually labeled with meticulous inspection and iterative refinement. To the best of our knowledge, VQS-4K is the first benchmark specifically designed for VQS. Furthermore, to stimulate future research, we present a simple yet effective method, named VQ-SAM, which extends SAM 2 by leveraging target-specific and background distractor cues from the video to progressively evolve the memory through a novel multi-stage framework with an adaptive memory generation (AMG) module for VQS, significantly improving the performance. In our extensive experiments on VQS-4K, VQ-SAM achieves promising results and surpasses all existing approaches, demonstrating its effectiveness. With the proposed VQS-4K and VQ-SAM, we expect to go beyond the current VQL paradigm and inspire more future research and practical applications on VQS. Our benchmark, code, and results will be made publicly available.

2603.08897 2026-03-11 cs.CV

Comparative Analysis of Patch Attack on VLM-Based Autonomous Driving Architectures

David Fernandez, Pedram MohajerAnsari, Amir Salarpour, Long Cheng, Abolfazl Razi, Mert D. Pesé

Comments Accepted at the 2025 IEEE Intelligent Vehicles Symposium (IV 2025)

详情
英文摘要

Vision-language models are emerging for autonomous driving, yet their robustness to physical adversarial attacks remains unexplored. This paper presents a systematic framework for comparative adversarial evaluation across three VLM architectures: Dolphins, OmniDrive (Omni-L), and LeapVAD. Using black-box optimization with semantic homogenization for fair comparison, we evaluate physically realizable patch attacks in CARLA simulation. Results reveal severe vulnerabilities across all architectures, sustained multi-frame failures, and critical object detection degradation. Our analysis exposes distinct architectural vulnerability patterns, demonstrating that current VLM designs inadequately address adversarial threats in safety-critical autonomous driving applications.

2603.08877 2026-03-11 cs.AI

Quantifying the Accuracy and Cost Impact of Design Decisions in Budget-Constrained Agentic LLM Search

Kyle McCleary, James Ghawaly

Comments Accepted in 2026 Language Resources and Evaluation Conference (LREC)

详情
英文摘要

Agentic Retrieval-Augmented Generation (RAG) systems combine iterative search, planning prompts, and retrieval backends, but deployed settings impose explicit budgets on tool calls and completion tokens. We present a controlled measurement study of how search depth, retrieval strategy, and completion budget affect accuracy and cost under fixed constraints. Using Budget-Constrained Agentic Search (BCAS), a model-agnostic evaluation harness that surfaces remaining budget and gates tool use, we run comparisons across six LLMs and three question-answering benchmarks. Across models and datasets, accuracy improves with additional searches up to a small cap, hybrid lexical and dense retrieval with lightweight re-ranking produces the largest average gains in our ablation grid, and larger completion budgets are most helpful on HotpotQA-style synthesis. These results provide practical guidance for configuring budgeted agentic retrieval pipelines and are accompanied by reproducible prompts and evaluation settings.

2603.08869 2026-03-11 cs.CL

One Language, Two Scripts: Probing Script-Invariance in LLM Concept Representations

Sripad Karne

Comments Accepted at the UCRL Workshop at ICLR 2026

详情
英文摘要

Do the features learned by Sparse Autoencoders (SAEs) represent abstract meaning, or are they tied to how text is written? We investigate this question using Serbian digraphia as a controlled testbed: Serbian is written interchangeably in Latin and Cyrillic scripts with a near-perfect character mapping between them, enabling us to vary orthography while holding meaning exactly constant. Crucially, these scripts are tokenized completely differently, sharing no tokens whatsoever. Analyzing SAE feature activations across the Gemma model family (270M-27B parameters), we find that identical sentences in different Serbian scripts activate highly overlapping features, far exceeding random baselines. Strikingly, changing script causes less representational divergence than paraphrasing within the same script, suggesting SAE features prioritize meaning over orthographic form. Cross-script cross-paraphrase comparisons provide evidence against memorization, as these combinations rarely co-occur in training data yet still exhibit substantial feature overlap. This script invariance strengthens with model scale. Taken together, our findings suggest that SAE features can capture semantics at a level of abstraction above surface tokenization, and we propose Serbian digraphia as a general evaluation paradigm for probing the abstractness of learned representations.

2603.08863 2026-03-11 cs.RO

Adaptive SINDy: Residual Force System Identification Based UAV Disturbance Rejection

Fawad Mehboob, Amir Atef Habel, Roohan Ahmed Khan, Mikhail Derevianchenko, Clement Fortin, Dzmitry Tsetserukou

详情
英文摘要

The stability and control of Unmanned Aerial Vehicles (UAVs) in a turbulent environment is a matter of great concern. Devising a robust control algorithm to reject disturbances is challenging due to the highly nonlinear nature of wind dynamics, and modeling the dynamics using analytical techniques is not straightforward. While traditional techniques using disturbance observers and classical adaptive control have shown some progress, they are mostly limited to relatively non-complex environments. On the other hand, learning based approaches are increasingly being used for modeling of residual forces and disturbance rejection; however, their generalization and interpretability is a factor of concern. To this end, we propose a novel integration of data-driven system identification using Sparse Identification of Non-Linear Dynamics (SINDy) with a Recursive Least Square (RLS) adaptive control to adapt and reject wind disturbances in a turbulent environment. We tested and validated our approach on Gazebo harmonic environment and on real flights with wind speeds of up to 2 m/s from four directions, creating a highly dynamic and turbulent environment. Adaptive SINDy outperformed the baseline PID and INDI controllers on several trajectory tracking error metrics without crashing. A root mean square error (RMSE) of up to 12.2 cm and 17.6 cm, and a mean absolute error (MAE) of 13.7 cm and 10.5 cm were achieved on circular and lemniscate trajectories, respectively. The validation was performed on a very lightweight Crazyflie drone under a highly dynamic environment for complex trajectory tracking.

2603.08862 2026-03-11 cs.RO cs.LG

APPLV: Adaptive Planner Parameter Learning from Vision-Language-Action Model

Yuanjie Lu, Beichen Wang, Zhengqi Wu, Yang Li, Xiaomin Lin, Chengzhi Mao, Xuesu Xiao

详情
英文摘要

Autonomous navigation in highly constrained environments remains challenging for mobile robots. Classical navigation approaches offer safety assurances but require environment-specific parameter tuning; end-to-end learning bypasses parameter tuning but struggles with precise control in constrained spaces. To this end, recent robot learning approaches automate parameter tuning while retaining classical systems' safety, yet still face challenges in generalizing to unseen environments. Recently, Vision-Language-Action (VLA) models have shown promise by leveraging foundation models' scene understanding capabilities, but still struggle with precise control and inference latency in navigation tasks. In this paper, we propose Adaptive Planner Parameter Learning from Vision-Language-Action Model (\textsc{applv}). Unlike traditional VLA models that directly output actions, \textsc{applv} leverages pre-trained vision-language models with a regression head to predict planner parameters that configure classical planners. We develop two training strategies: supervised learning fine-tuning from collected navigation trajectories and reinforcement learning fine-tuning to further optimize navigation performance. We evaluate \textsc{applv} across multiple motion planners on the simulated Benchmark Autonomous Robot Navigation (BARN) dataset and in physical robot experiments. Results demonstrate that \textsc{applv} outperforms existing methods in both navigation performance and generalization to unseen environments.

2603.08860 2026-03-11 cs.RO cs.SY eess.SY

SEP-NMPC: Safety Enhanced Passivity-Based Nonlinear Model Predictive Control for a UAV Slung Payload System

Seyedreza Rezaei, Junjie Kang, Amaldev Haridevan, Jinjun Shan

Comments Accepted at ICRA 2026

详情
英文摘要

Model Predictive Control (MPC) is widely adopted for agile multirotor vehicles, yet achieving both stability and obstacle-free flight is particularly challenging when a payload is suspended beneath the airframe. This paper introduces a Safety Enhanced Passivity-Based Nonlinear MPC (SEP-NMPC) that provides formal guarantees of stability and safety for a quadrotor transporting a slung payload through cluttered environments. Stability is enforced by embedding a strict passivity inequality, which is derived from a shaped energy storage function with adaptive damping, directly into the NMPC. This formulation dissipates excess energy and ensures asymptotic convergence despite payload swings. Safety is guaranteed through high-order control barrier functions (HOCBFs) that render user-defined clearance sets forward-invariant, obliging both the quadrotor and the swinging payload to maintain separation while interacting with static and dynamic obstacles. The optimization remains quadratic-program compatible and is solved online at each sampling time without gain scheduling or heuristic switching. Extensive simulations and real-world experiments confirm stable payload transport, collision-free trajectories, and real-time feasibility across all tested scenarios. The SEP-NMPC framework therefore unifies passivity-based closed-loop stability with HOCBF-based safety guarantees for UAV slung-payload transportation.

2603.08859 2026-03-11 cs.LG

Expressivity-Efficiency Tradeoffs for Hybrid Sequence Models

John Cooper, Ilias Diakonikolas, Mingchen Ma, Frederic Sala

详情
英文摘要

Hybrid sequence models--combining Transformer and state-space model layers--seek to gain the expressive versatility of attention as well as the computational efficiency of state-space model layers. Despite burgeoning interest in hybrid models, we lack a basic understanding of the settings where--and underlying mechanisms through which--they offer benefits over their constituent models. In this paper, we study this question, focusing on a broad family of core synthetic tasks. For this family of tasks, we prove the existence of fundamental limitations for non-hybrid models. Specifically, any Transformer or state-space model that solves the underlying task requires either a large number of parameters or a large working memory. On the other hand, for two prototypical tasks within this family--namely selective copying and associative recall--we construct hybrid models of small size and working memory that provably solve these tasks, thus achieving the best of both worlds. Our experimental evaluation empirically validates our theoretical findings. Importantly, going beyond the settings in our theoretical analysis, we empirically show that learned--rather than constructed--hybrids outperform non-hybrid models with up to 6x as many parameters. We additionally demonstrate that hybrid models exhibit stronger length generalization and out-of-distribution robustness than non-hybrids.

2603.08852 2026-03-11 cs.AI cs.MA cs.SE

LDP: An Identity-Aware Protocol for Multi-Agent LLM Systems

Sunil Prakash

Comments 16 pages, 9 figures, 8 tables, 4 appendices

详情
英文摘要

As multi-agent AI systems grow in complexity, the protocols connecting them constrain their capabilities. Current protocols such as A2A and MCP do not expose model-level properties as first-class primitives, ignoring properties fundamental to effective delegation: model identity, reasoning profile, quality calibration, and cost characteristics. We present the LLM Delegate Protocol (LDP), an AI-native communication protocol introducing five mechanisms: (1) rich delegate identity cards with quality hints and reasoning profiles; (2) progressive payload modes with negotiation and fallback; (3) governed sessions with persistent context; (4) structured provenance tracking confidence and verification status; (5) trust domains enforcing security boundaries at the protocol level. We implement LDP as a plugin for the JamJet agent runtime and evaluate against A2A and random baselines using local Ollama models and LLM-as-judge evaluation. Identity-aware routing achieves ~12x lower latency on easy tasks through delegate specialization, though it does not improve aggregate quality in our small delegate pool; semantic frame payloads reduce token count by 37% (p=0.031) with no observed quality loss; governed sessions eliminate 39% token overhead at 10 rounds; and noisy provenance degrades synthesis quality below the no-provenance baseline, arguing that confidence metadata is harmful without verification. Simulated analyses show architectural advantages in attack detection (96% vs. 6%) and failure recovery (100% vs. 35% completion). This paper contributes a protocol design, reference implementation, and initial evidence that AI-native protocol primitives enable more efficient and governable delegation.

2603.08850 2026-03-11 cs.CV

HECTOR: Hybrid Editable Compositional Object References for Video Generation

Guofeng Zhang, Angtian Wang, Jacob Zhiyuan Fang, Liming Jiang, Haotian Yang, Alan Yuille, Chongyang Ma

详情
英文摘要

Real-world videos naturally portray complex interactions among distinct physical objects, effectively forming dynamic compositions of visual elements. However, most current video generation models synthesize scenes holistically and therefore lack mechanisms for explicit compositional manipulation. To address this limitation, we propose HECTOR, a generative pipeline that enables fine-grained compositional control. In contrast to prior methods,HECTOR supports hybrid reference conditioning, allowing generation to be simultaneously guided by static images and/or dynamic videos. Moreover, users can explicitly specify the trajectory of each referenced element, precisely controlling its location, scale, and speed (see Figure1). This design allows the model to synthesize coherent videos that satisfy complex spatiotemporal constraints while preserving high-fidelity adherence to references. Extensive experiments demonstrate that HECTOR achieves superior visual quality, stronger reference preservation, and improved motion controllability compared with existing approaches.

2603.08844 2026-03-11 cs.CV cs.AI

A Lightweight Multi-Cancer Tumor Localization Framework for Deployable Digital Pathology

Brian Isett, Rebekah Dadey, Aofei Li, Ryan C. Augustin, Kate Smith, Aatur D. Singhi, Qiangqiang Gu, Riyue Bao

Comments 9 pages, 2 figures

详情
英文摘要

Accurate localization of tumor regions from hematoxylin and eosin-stained whole-slide images is fundamental for translational research including spatial analysis, molecular profiling, and tissue architecture investigation. However, deep learning-based tumor detection trained within specific cancers may exhibit reduced robustness when applied across different tumor types. We investigated whether balanced training across cancers at modest scale can achieve high performance and generalize to unseen tumor types. A multi-cancer tumor localization model (MuCTaL) was trained on 79,984 non-overlapping tiles from four cancers (melanoma, hepatocellular carcinoma, colorectal cancer, and non-small cell lung cancer) using transfer learning with DenseNet169. The model achieved a tile-level ROC-AUC of 0.97 in validation data from the four training cancers, and 0.71 on an independent pancreatic ductal adenocarcinoma cohort. A scalable inference workflow was built to generate spatial tumor probability heatmaps compatible with existing digital pathology tools. Code and models are publicly available at https://github.com/AivaraX-AI/MuCTaL.

2603.08835 2026-03-11 cs.AI cs.CL cs.LG

MASEval: Extending Multi-Agent Evaluation from Models to Systems

Cornelius Emde, Alexander Rubinstein, Anmol Goel, Ahmed Heakl, Sangdoo Yun, Seong Joon Oh, Martin Gubri

详情
英文摘要

The rapid adoption of LLM-based agentic systems has produced a rich ecosystem of frameworks (smolagents, LangGraph, AutoGen, CAMEL, LlamaIndex, i.a.). Yet existing benchmarks are model-centric: they fix the agentic setup and do not compare other system components. We argue that implementation decisions substantially impact performance, including choices such as topology, orchestration logic, and error handling. MASEval addresses this evaluation gap with a framework-agnostic library that treats the entire system as the unit of analysis. Through a systematic system-level comparison across 3 benchmarks, 3 models, and 3 frameworks, we find that framework choice matters as much as model choice. MASEval allows researchers to explore all components of agentic systems, opening new avenues for principled system design, and practitioners to identify the best implementation for their use case. MASEval is available under the MIT licence https://github.com/parameterlab/MASEval.

2603.08831 2026-03-11 cs.RO cs.SY eess.SY

Predictive Control with Indirect Adaptive Laws for Payload Transportation by Quadrupedal Robots

Leila Amanzadeh, Taizoon Chunawala, Randall T. Fawcett, Alexander Leonessa, Kaveh Akbari Hamed

Comments 8 pages, 6 figures. Published in IEEE Robotics and Automation Letters

详情
Journal ref
IEEE Robotics and Automation Letters, vol. 9, no. 11, pp. 10359-10366, Nov. 2024
英文摘要

This paper formally develops a novel hierarchical planning and control framework for robust payload transportation by quadrupedal robots, integrating a model predictive control (MPC) algorithm with a gradient-descent-based adaptive updating law. At the framework's high level, an indirect adaptive law estimates the unknown parameters of the reduced-order (template) locomotion model under varying payloads. These estimated parameters feed into an MPC algorithm for real-time trajectory planning, incorporating a convex stability criterion within the MPC constraints to ensure the stability of the template model's estimation error. The optimal reduced-order trajectories generated by the high-level adaptive MPC (AMPC) are then passed to a low-level nonlinear whole-body controller (WBC) for tracking. Extensive numerical investigations validate the framework's capabilities, showcasing the robot's proficiency in transporting unmodeled, unknown static payloads up to 109% in experiments on flat terrains and 91% on rough experimental terrains. The robot also successfully manages dynamic payloads with 73% of its mass on rough terrains. Performance comparisons with a normal MPC and an L1 MPC indicate a significant improvement. Furthermore, comprehensive hardware experiments conducted in indoor and outdoor environments confirm the method's efficacy on rough terrains despite uncertainties such as payload variations, push disturbances, and obstacles.

2603.08827 2026-03-11 cs.CV

Computer Vision-Based Vehicle Allotment System using Perspective Mapping

Prachi Nandi, Sonakshi Satapathy, Suchismita Chinara

详情
英文摘要

Smart city research envisions a future in which data-driven solutions and sustainable infrastructure work together to define urban living at the crossroads of urbanization and technology. Within this framework, smart parking systems play an important role in reducing urban congestion and supporting sustainable transportation. Automating parking solutions have considerable benefits, such as increased efficiency and less reliance on human involvement, but obstacles such as sensor limitations and integration complications remain. To overcome them, a more sophisticated car allotment system is required, particularly in heavily populated urban areas. Computer vision, with its higher accuracy and adaptability, outperforms traditional sensor-based systems for recognizing vehicles and vacant parking spaces. Unlike fixed sensor technologies, computer vision can dynamically assess a wide range of visual inputs while adjusting to changing parking layouts. This research presents a cost-effective, easy-to-implement smart parking system utilizing computer vision and object detection models like YOLOv8. Using inverse perspective mapping (IPM) to merge images from four camera views, we extract data on vacant spaces. The system simulates a 3D parking environment, representing available spots with a 3D Cartesian plot to guide users.

2603.08825 2026-03-11 cs.LG cs.AI

Are Expressive Encoders Necessary for Discrete Graph Generation?

Jay Revolinsky, Harry Shomer, Jiliang Tang

Comments 25 pages, 15 figures, 10 tables

详情
英文摘要

Discrete graph generation has emerged as a powerful paradigm for modeling graph data, often relying on highly expressive neural backbones such as transformers or higher-order architectures. We revisit this design choice by introducing GenGNN, a modular message-passing framework for graph generation. Diffusion models with GenGNN achieve more than 90% validity on Tree and Planar datasets, within margins of graph transformers, at 2-5x faster inference speed. For molecule generation, DiGress with a GenGNN backbone achieves 99.49% Validity. A systematic ablation study shows the benefit provided by each GenGNN component, indicating the need for residual connections to mitigate oversmoothing on complicated graph-structure. Through scaling analyses, we apply a principled metric-space view to investigate learned diffusion representations and uncover whether GNNs can be expressive neural backbones for discrete diffusion.

2603.08824 2026-03-11 cs.LG

SoftJAX & SoftTorch: Empowering Automatic Differentiation Libraries with Informative Gradients

Anselm Paulus, A. René Geist, Vít Musil, Sebastian Hoffmann, Onur Beker, Georg Martius

详情
英文摘要

Automatic differentiation (AD) frameworks such as JAX and PyTorch have enabled gradient-based optimization for a wide range of scientific fields. Yet, many "hard" primitives in these libraries such as thresholding, Boolean logic, discrete indexing, and sorting operations yield zero or undefined gradients that are not useful for optimization. While numerous "soft" relaxations have been proposed that provide informative gradients, the respective implementations are fragmented across projects, making them difficult to combine and compare. This work introduces SoftJAX and SoftTorch, open-source, feature-complete libraries for soft differentiable programming. These libraries provide a variety of soft functions as drop-in replacements for their hard JAX and PyTorch counterparts. This includes (i) elementwise operators such as clip or abs, (ii) utility methods for manipulating Booleans and indices via fuzzy logic, (iii) axiswise operators such as sort or rank -- based on optimal transport or permutahedron projections, and (iv) offer full support for straight-through gradient estimation. Overall, SoftJAX and SoftTorch make the toolbox of soft relaxations easily accessible to differentiable programming, as demonstrated through benchmarking and a practical case study. Code is available at github.com/a-paulus/softjax and github.com/a-paulus/softtorch.

2603.08821 2026-03-11 cs.RO

Impact of Different Failures on a Robot's Perceived Reliability

Andrew Violette, Zhanxin Wu, Haruki Nishimura, Masha Itkina, Leticia Priebe Rocha, Mark Zolotas, Guy Hoffman, Hadas Kress-Gazit

Comments Accepted to ICRA 2026. 8 pages, 6 figures

详情
英文摘要

Robots fail, potentially leading to a loss in the robot's perceived reliability (PR), a measure correlated with trustworthiness. In this study we examine how various kinds of failures affect the PR of the robot differently, and how this measure recovers without explicit social repair actions by the robot. In a preregistered and controlled online video study, participants were asked to predict a robot's success in a pick-and-place task. We examined manipulation failures (slips), freezing (lapses), and three types of incorrect picked objects or place goals (mistakes). Participants were shown one of 11 videos -- one of five types of failure, one of five types of failure followed by a successful execution in the same video, or a successful execution video. This was followed by two additional successful execution videos. Participants bet money either on the robot or on a coin toss after each video. People's betting patterns along with a qualitative analysis of their survey responses highlight that mistakes are less damaging to PR than slips or lapses, and some mistakes are even perceived as successes. We also see that successes immediately following a failure have the same effect on PR as successes without a preceding failure. Finally, we show that successful executions recover PR after a failure. Our findings highlight which robot failures are in higher need of repair in a human-robot interaction, and how trust could be recovered by robot successes.

2603.08817 2026-03-11 cs.RO

HMR-1: Hierarchical Massage Robot with Vision-Language-Model for Embodied Healthcare

Rongtao Xu, Mingming Yu, Xiaofeng Han, Yu Zhang, Kaiyi Hu, Zhe Feng, Zenghuang Fu, Changwei Wang, Weiliang Meng, Xiaopeng Zhang

详情
英文摘要

The rapid advancement of Embodied Intelligence has opened transformative opportunities in healthcare, particularly in physical therapy and rehabilitation. However, critical challenges remain in developing robust embodied healthcare solutions, such as the lack of standardized evaluation benchmarks and the scarcity of open-source multimodal acupoint massage datasets. To address these gaps, we construct MedMassage-12K - a multimodal dataset containing 12,190 images with 174,177 QA pairs, covering diverse lighting conditions and backgrounds. Furthermore, we propose a hierarchical embodied massage framework, which includes a high-level acupoint grounding module and a low-level control module. The high-level acupoint grounding module uses multimodal large language models to understand human language and identify acupoint locations, while the low-level control module provides the planned trajectory. Based on this, we evaluate existing MLLMs and establish a benchmark for embodied massage tasks. Additionally, we fine-tune the Qwen-VL model, demonstrating the framework's effectiveness. Physical experiments further confirm the practical applicability of the framework.Our dataset and code are publicly available at https://github.com/Xiaofeng-Han-Res/HMR-1.

2603.08814 2026-03-11 cs.RO cs.AI cs.ET cs.MA

Scale-Plan: Scalable Language-Enabled Task Planning for Heterogeneous Multi-Robot Teams

Piyush Gupta, Sangjae Bae, Jiachen Li, David Isele

详情
英文摘要

Long-horizon task planning for heterogeneous multi-robot systems is essential for deploying collaborative teams in real-world environments; yet, it remains challenging due to the large volume of perceptual information, much of which is irrelevant to task objectives and burdens planning. Traditional symbolic planners rely on manually constructed problem specifications, limiting scalability and adaptability, while recent large language model (LLM)-based approaches often suffer from hallucinations and weak grounding-i.e., poor alignment between generated plans and actual environmental objects and constraints-in object-rich settings. We present Scale-Plan, a scalable LLM-assisted framework that generates compact, task-relevant problem representations from natural language instructions. Given a PDDL domain specification, Scale-Plan constructs an action graph capturing domain structure and uses shallow LLM reasoning to guide a structured graph search that identifies a minimal subset of relevant actions and objects. By filtering irrelevant information prior to planning, Scale-Plan enables efficient decomposition, allocation, and long-horizon plan generation. We evaluate our approach on complex multi-agent tasks and introduce MAT2-THOR, a cleaned benchmark built on AI2-THOR for reliable evaluation of multi-robot planning systems. Scale-Plan outperforms pure LLM and hybrid LLM-PDDL baselines across all metrics, improving scalability and reliability.

2603.08812 2026-03-11 cs.CV

VisionCreator-R1: A Reflection-Enhanced Native Visual-Generation Agentic Model

Jinxiang Lai, Wenzhe Zhao, Zexin Lu, Hualei Zhang, Qinyu Yang, Rongwei Quan, Zhimin Li, Shuai Shao, Song Guo, Qinglin Lu

详情
英文摘要

Visual content generation has advanced from single-image to multi-image workflows, yet existing agents remain largely plan-driven and lack systematic reflection mechanisms to correct mid-trajectory visual errors. To address this limitation, we propose VisionCreator-R1, a native visual generation agent with explicit reflection, together with a Reflection-Plan Co-Optimization (RPCO) training methodology. Through extensive experiments and trajectory-level analysis, we uncover reflection-plan optimization asymmetry in reinforcement learning (RL): planning can be reliably optimized via plan rewards, while reflection learning is hindered by noisy credit assignment. Guided by this insight, our RPCO first trains on the self-constructed VCR-SFT dataset with reflection-strong single-image trajectories and planning-strong multi-image trajectories, then co-optimization on VCR-RL dataset via RL. This yields our unified VisionCreator-R1 agent, which consistently outperforms Gemini2.5Pro on existing benchmarks and our VCR-bench covering single-image and multi-image tasks.

2603.08810 2026-03-11 cs.RO

Age-Related Differences in the Perception of Eye-Gaze from a Social Robot

Lucas Morillo-Mendez, Martien G. S. Schrooten, Oscar Martinez Mozos

Comments This is the pre-print version. Final publication available at https://doi.org/10.1007/978-3-030-90525-5_30

详情
Journal ref
International Conference on Social Robotics (ICSR), 2021. Lecture Notes in Computer Science, vol 13086
英文摘要

There is an increasing interest in social robots assisting older adults during daily life tasks. In this context, non-verbal cues such as deictic gaze are important in natural communication in human-robot interaction. However, the sensibility to deictic-gaze declines naturally with age and results in a reduction in social perception. Therefore, this work explores the benefits of deictic gaze from social robots assisting older adults during daily life tasks, and how age-related differences may influence their social perception in contrast to younger populations. This may help on the design of adaptive age-related non-verbal cues in the Human-Robot Interaction context.

2603.08803 2026-03-11 cs.LG stat.ML

The Temporal Markov Transition Field

Michael Leznik

Comments 13 pages, 2 figures

详情
英文摘要

The Markov Transition Field (MTF), introduced by Wang and Oates (2015), encodes a time series as a two-dimensional image by mapping each pair of time steps to the transition probability between their quantile states, estimated from a single global transition matrix. This construction is efficient when the transition dynamics are stationary, but produces a misleading representation when the process changes regime over time: the global matrix averages across regimes and the resulting image loses all information about \emph{when} each dynamical regime was active. In this paper we introduce the \emph{Temporal Markov Transition Field} (TMTF), an extension that partitions the series into $K$ contiguous temporal chunks, estimates a separate local transition matrix for each chunk, and assembles the image so that each row reflects the dynamics local to its chunk rather than the global average. The resulting $T \times T$ image has $K$ horizontal bands of distinct texture, each encoding the transition dynamics of one temporal segment. We develop the formal definition, establish the key structural properties of the representation, work through a complete numerical example that makes the distinction from the global MTF concrete, analyse the bias--variance trade-off introduced by temporal chunking, and discuss the geometric interpretation of the local transition matrices in terms of process properties such as persistence, mean reversion, and trending behaviour. The TMTF is amplitude-agnostic and order-preserving, making it suitable as an input channel for convolutional neural networks applied to time series characterisation tasks.

2603.08800 2026-03-11 cs.CV

Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM

Junyuan Mao, Qiankun Li, Linghao Meng, Zhicheng He, Xinliang Zhou, Kun Wang, Yang Liu, Yueming Jin

详情
英文摘要

Recent advances in multimodal large language models largely rely on CLIP-based visual encoders, which emphasize global semantic alignment but struggle with fine-grained visual understanding. In contrast, DINOv3 provides strong pixel-level perception yet lacks coarse-grained semantic abstraction, leading to limited multi-granularity reasoning. To address this gap, we propose Granulon, a novel DINOv3-based MLLM with adaptive granularity augmentation. Granulon introduces a text-conditioned granularity Controller that dynamically adjusts the visual abstraction level according to the semantic scope of the textual input, and an Adaptive Token Aggregation module that performs granularity-guided pooling and relation-aware clustering to produce compact, semantically rich visual tokens. This design enables unified "pixel-to-fine-to-coarse" reasoning within a single forward pass. Extensive and interpretable experiments demonstrate that Granulon improves accuracy by ~30% and reduces hallucination by ~20%, outperforming all visual encoders under identical settings.

2603.08773 2026-03-11 cs.LG cs.AI stat.ML

Multi-level meta-reinforcement learning with skill-based curriculum

Sichen Yang, Mauro Maggioni

Comments 78 pages, 12 figures

详情
英文摘要

We consider problems in sequential decision making with natural multi-level structure, where sub-tasks are assembled together to accomplish complex goals. Systematically inferring and leveraging hierarchical structure has remained a longstanding challenge; we describe an efficient multi-level procedure for repeatedly compressing Markov decision processes (MDPs), wherein a parametric family of policies at one level is treated as single actions in the compressed MDPs at higher levels, while preserving the semantic meanings and structure of the original MDP, and mimicking the natural logic to address a complex MDP. Higher-level MDPs are themselves independent MDPs with less stochasticity, and may be solved using existing algorithms. As a byproduct, spatial or temporal scales may be coarsened at higher levels, making it more efficient to find long-term optimal policies. The multi-level representation delivered by this procedure decouples sub-tasks from each other and usually greatly reduces unnecessary stochasticity and the policy search space, leading to fewer iterations and computations when solving the MDPs. A second fundamental aspect of this work is that these multi-level decompositions plus the factorization of policies into embeddings (problem-specific) and skills (including higher-order functions) yield new transfer opportunities of skills across different problems and different levels. This whole process is framed within curriculum learning, wherein a teacher organizes the student agent's learning process in a way that gradually increases the difficulty of tasks and and promotes transfer across MDPs and levels within and across curricula. The consistency of this framework and its benefits can be guaranteed under mild assumptions. We demonstrate abstraction, transferability, and curriculum learning in examples, including MazeBase+, a more complex variant of the MazeBase example.

2603.08763 2026-03-11 cs.LG cs.RO

SPREAD: Subspace Representation Distillation for Lifelong Imitation Learning

Kaushik Roy, Giovanni D'urso, Nicholas Lawrance, Brendan Tidd, Peyman Moghadam

Comments IEEE International Conference on Robotics & Automation (ICRA) 2026

详情
英文摘要

A key challenge in lifelong imitation learning (LIL) is enabling agents to acquire new skills from expert demonstrations while retaining prior knowledge. This requires preserving the low-dimensional manifolds and geometric structures that underlie task representations across sequential learning. Existing distillation methods, which rely on L2-norm feature matching in raw feature space, are sensitive to noise and high-dimensional variability, often failing to preserve intrinsic task manifolds. To address this, we introduce SPREAD, a geometry-preserving framework that employs singular value decomposition (SVD) to align policy representations across tasks within low-rank subspaces. This alignment maintains the underlying geometry of multimodal features, facilitating stable transfer, robustness, and generalization. Additionally, we propose a confidence-guided distillation strategy that applies a Kullback-Leibler divergence loss restricted to the top-M most confident action samples, emphasizing reliable modes and improving optimization stability. Experiments on the LIBERO, lifelong imitation learning benchmark, show that SPREAD substantially improves knowledge transfer, mitigates catastrophic forgetting, and achieves state-of-the-art performance.

2603.08759 2026-03-11 cs.SD cs.AI

EDMFormer: Genre-Specific Self-Supervised Learning for Music Structure Segmentation

Sahal Sajeer, Krish Patel, Oscar Chung, Joel Song Bae

Comments Published in CUCAI 2026 conference proceedings

详情
英文摘要

Music structure segmentation is a key task in audio analysis, but existing models perform poorly on Electronic Dance Music (EDM). This problem exists because most approaches rely on lyrical or harmonic similarity, which works well for pop music but not for EDM. EDM structure is instead defined by changes in energy, rhythm, and timbre, with different sections such as buildup, drop, and breakdown. We introduce EDMFormer, a transformer model that combines self-supervised audio embeddings using an EDM-specific dataset and taxonomy. We release this dataset as EDM-98: a group of 98 professionally annotated EDM tracks. EDMFormer improves boundary detection and section labelling compared to existing models, particularly for drops and buildups. The results suggest that combining learned representations with genre-specific data and structural priors is effective for EDM and could be applied to other specialized music genres or broader audio domains.

2603.08758 2026-03-11 cs.LG cs.AI

Generalized Reduction to the Isotropy for Flexible Equivariant Neural Fields

Alejandro García-Castellanos, Gijs Bellaard, Remco Duits, Daniel Pelt, Erik J Bekkers

详情
英文摘要

Many geometric learning problems require invariants on heterogeneous product spaces, i.e., products of distinct spaces carrying different group actions, where standard techniques do not directly apply. We show that, when a group $G$ acts transitively on a space $M$, any $G$-invariant function on a product space $X \times M$ can be reduced to an invariant of the isotropy subgroup $H$ of $M$ acting on $X$ alone. Our approach establishes an explicit orbit equivalence $(X \times M)/G \cong X/H$, yielding a principled reduction that preserves expressivity. We apply this characterization to Equivariant Neural Fields, extending them to arbitrary group actions and homogeneous conditioning spaces, and thereby removing the major structural constraints imposed by existing methods.

2603.08754 2026-03-11 cs.LG cs.AI

Hindsight Credit Assignment for Long-Horizon LLM Agents

Hui-Ze Tan, Xiao-Wen Yang, Hao Chen, Jie-Jing Shao, Yi Wen, Yuteng Shen, Weihong Luo, Xiku Du, Lan-Zhe Guo, Yu-Feng Li

详情
英文摘要

Large Language Model (LLM) agents often face significant credit assignment challenges in long-horizon, multi-step tasks due to sparse rewards. Existing value-free methods, such as Group Relative Policy Optimization (GRPO), encounter two fundamental bottlenecks: inaccurate step-level Q-value estimation and misaligned value baselines for intermediate states. To address these limitations, we introduce HCAPO, the first framework to integrate hindsight credit assignment into LLM agents. HCAPO leverages the LLM itself as a post-hoc critic to refine step-level Q-values through hindsight reasoning. Furthermore, HCAPO's multi-scale advantage mechanism effectively supplements the inaccurate value baselines at critical decision states. Evaluations across three challenging benchmarks, including WebShop and ALFWorld, demonstrate that HCAPO consistently outperforms state-of-the-art RL methods. Notably, HCAPO achieves a 7.7% improvement in success rate on WebShop and a 13.8% on ALFWorld over GRPO using the Qwen2.5-7B-Instruct model. These results indicate that HCAPO significantly enhances exploration efficiency, promotes concise decision-making, and ensures scalability in complex, long-horizon tasks.