arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2862
2510.17088 2026-03-10 cs.LG cs.AI cs.CE

Explainable Heterogeneous Anomaly Detection in Financial Networks via Adaptive Expert Routing

Zan Li, Rui Fan

详情
Journal ref
XAI-FIN: International Joint Workshop on Explainable AI in Finance, ACM ICAIF 2025
英文摘要

Financial anomalies arise from heterogeneous mechanisms -- price shocks, liquidity freezes, contagion cascades, and momentum reversals -- yet existing detectors produce uniform scores without revealing which mechanism is failing. This hinders targeted responses: liquidity freezes call for market-making support, whereas price shocks call for circuit breakers. Three key challenges remain: (1) static graphs cannot adapt when correlations shift across regimes; (2) uniform detectors overlook heterogeneous anomaly signatures; and (3) black-box scores provide no actionable guidance on driving mechanisms. We address these challenges with an adaptive graph learning framework that embeds interpretability architecturally rather than post hoc. The framework constructs stress-modulated graphs that adaptively interpolate between known sector and geographic relationships and data-driven correlations as market conditions evolve. Anomalies are decomposed via four mechanism-specific experts -- Price-Shock, Liquidity, Systemic-Contagion, and Momentum-Reversal -- each capturing a distinct anomaly channel documented in the financial economics literature. The resulting routing weights serve as interpretable proxies for mechanism attribution, with their relative values indicating each anomaly's primary driving mechanism. A hierarchical Market Pressure Index aggregates entity-level anomaly scores into graduated market-wide alerts. On 100 U.S. equities (2017-2024), the framework detects all six major stress events with a 3.7-day mean lead time, outperforming baselines by +33 percentage points, with AUC 0.888 and AP 0.626. Case studies on SVB (March 2023) and Japan carry-trade unwind (August 2024) demonstrate that routing weights automatically distinguish localized from systemic crises without labeled supervision.

2510.17057 2026-03-10 cs.LG cs.AI

The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLM CoTs

Nikolaus Howe, Micah Carroll

Comments 28 pages

详情
英文摘要

Chain-of-Thought (CoT) monitoring has emerged as a compelling method for detecting harmful behaviors such as reward hacking for reasoning models, under the assumption that models' reasoning processes are informative of such behaviors. In practice, LLM training often produces unintended behaviors due to imperfect reward signals, leading models to develop misaligned tendencies. A common corrective approach is to apply post-hoc instructions to avoid problematic behaviors, but what happens to the model's reasoning process when these instructions conflict with learned behaviors? We investigate this question in simple settings and find that models engage in systematic motivated reasoning -- generating plausible-sounding justifications for violating their instructions while downplaying potential harms or contradictions. Concerningly, we find that as motivated reasoning becomes more prevalent over the course of training, an 8B-parameter CoT monitor is increasingly fooled by the motivated reasoning, being persuaded to judge the answer as following the constitution, despite correctly identifying the answer as contradicting the constitution when not provided with the model's reasoning trace. While we find that large frontier reasoning models closely track human ability in detecting motivated reasoning, this should not give us too much solace, as frontier model developers rely on smaller models for monitoring due to their low latency and deployment costs. Our results underscore the necessity for further research into the emergence and detection of motivated reasoning in model evaluation and oversight. Code for this paper is available at https://github.com/nikihowe/motivated-reasoning. WARNING: some examples in this paper may be upsetting.

2510.14584 2026-03-10 cs.RO

A Robust Placeability Metric for Model-Free Unified Pick-and-Place Reasoning

Benno Wingender, Nils Dengler, Rohit Menon, Sicong Pan, Maren Bennewitz

详情
英文摘要

Reliable manipulation of previously unseen objects remains a fundamental challenge for autonomous robotic systems operating in unstructured environments. In particular, robust pick-and-place planning directly from noisy and only partial real-world observations, where object surfaces are inherently incomplete due to occlusions (e.g., bottom faces on a tabletop), is difficult. As a result, many existing methods rely on strong object priors (e.g., CAD models) or to assume placement on continuous, flat support surfaces such as planar tabletops, without explicitly accounting for edge proximity or inclined supports. In this work, we introduce a robust probabilistic placeability metric that evaluates 6D object placement poses from partial observations by jointly scoring object stability, graspability, and clearance from raw point cloud geometry. Using this metric, we generate diverse multi-orientation placement candidates and condition grasp scoring on these placements, enabling model-free unified pick-and-place reasoning. Simulation and real-robot experiments on unseen objects and challenging support geometries confirm that our metric yields accurate stability predictions and consistently improves end-to-end pick-and-place success by producing stable, collision-free grasp-place pairs directly from partial point clouds.

2510.14462 2026-03-10 cs.CV

Unsupervised Deep Generative Models for Anomaly Detection in Neuroimaging: A Systematic Scoping Review

Youwan Mahé, Elise Bannier, Stéphanie Leplaideur, Elisa Fromont, Francesca Galassi

详情
英文摘要

Unsupervised anomaly detection (UAD) based on deep generative modelling has been increasingly explored for identifying pathological brain abnormalities without requiring voxel-level annotations. By learning the distribution of healthy anatomy and generating pseudo-healthy reconstructions, these methods aim to localise deviations in a pathology-agnostic manner. Despite rapid methodological development - from autoencoders and variational autoencoders to generative adversarial networks and diffusion-based models - a structured synthesis of their application in structural neuroimaging is lacking. We conducted a PRISMA-ScR-guided scoping review of studies published between January 2018-December 2025 that applied unsupervised deep generative models to anomaly detection in brain MRI (and, less frequently, CT). Thirty-three studies met inclusion criteria. Methods were categorised by architectural family, and reported performance was synthesised across major pathology groups, with segmentation (Dice) and detection metrics (AUROC, AUPRC) disaggregated by evaluation level (voxel, slice, subject). For transparency, we also summarised dataset characteristics, dimensionality (2D vs. 3D), and thresholding strategies. Overall, unsupervised generative approaches demonstrate potential for pathology-agnostic anomaly localisation, particularly in settings where annotated data are scarce. However, methodological heterogeneity, limited external validation, and sensitivity to dataset characteristics remain important challenges. Emerging paradigms - including anatomy-aware modelling, diffusion-based frameworks, and alternative normative evaluation metrics - seek to address these limitations and improve robustness and clinical relevance.

2510.14176 2026-03-10 cs.AI cs.LG

ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning

Roger Creus Castanyer, Faisal Mohamed, Pablo Samuel Castro, Cyrus Neary, Glen Berseth

Comments Published at ICLR 2026

详情
英文摘要

Reinforcement learning (RL) algorithms are highly sensitive to reward function specification, which remains a central challenge limiting their broad applicability. We present ARM-FM: Automated Reward Machines via Foundation Models, a framework for automated, compositional reward design in RL that leverages the high-level reasoning capabilities of foundation models (FMs). Reward machines (RMs) -- an automata-based formalism for reward specification -- are used as the mechanism for RL objective specification, and are automatically constructed via the use of FMs. The structured formalism of RMs yields effective task decompositions, while the use of FMs enables objective specifications in natural language. Concretely, we (i) use FMs to automatically generate RMs from natural language specifications; (ii) associate language embeddings with each RM automata-state to enable generalization across tasks; and (iii) provide empirical evidence of ARM-FM's effectiveness in a diverse suite of challenging environments, including evidence of zero-shot generalization.

2510.13795 2026-03-10 cs.CV cs.AI

Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs

Yi Zhang, Bolin Ni, Xin-Sheng Chen, Heng-Rui Zhang, Yongming Rao, Houwen Peng, Qinglin Lu, Han Hu, Meng-Hao Guo, Shi-Min Hu

Comments homepage: https://open-bee.github.io/

详情
英文摘要

Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.

2510.11682 2026-03-10 cs.RO cs.AI cs.SY eess.SY

Ego-Vision World Model for Humanoid Contact Planning

Hang Liu, Yuman Gao, Sangli Teng, Yufeng Chi, Yakun Sophia Shao, Zhongyu Li, Maani Ghaffari, Koushil Sreenath

详情
英文摘要

Enabling humanoid robots to exploit physical contact, rather than simply avoid collisions, is crucial for autonomy in unstructured environments. Traditional optimization-based planners struggle with contact complexity, while on-policy reinforcement learning (RL) is sample-inefficient and has limited multi-task ability. We propose a framework combining a learned world model with sampling-based Model Predictive Control (MPC), trained on a demonstration-free offline dataset to predict future outcomes in a compressed latent space. To address sparse contact rewards and sensor noise, the MPC uses a learned surrogate value function for dense, robust planning. Our single, scalable model supports contact-aware tasks, including wall support after perturbation, blocking incoming objects, and traversing height-limited arches, with improved sample efficiency and multi-task capability over on-policy RL. Deployed on a physical humanoid, our system achieves robust, real-time contact planning from proprioception and ego-centric depth images. Code and dataset are available at our website: https://ego-vcp.github.io/

2510.10002 2026-03-10 cs.AI

Deliberative Dynamics and Value Alignment in LLM Debates

Pratik S. Sachdeva, Tom van Nuenen

详情
英文摘要

As large language models (LLMs) are increasingly deployed in sensitive everyday contexts -- offering personal advice, mental health support, and moral guidance -- understanding their behavior in navigating complex moral reasoning is essential. Most evaluations study this sociotechnical alignment through single-turn prompts, but it is unclear if these findings extend to multi-turn settings, and even less clear how they depend on the interaction protocols used to coordinate agentic systems. We address this gap using LLM debate to examine deliberative dynamics and value alignment in multi-turn settings by prompting subsets of three models (GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.0 Flash) to collectively assign blame in 1,000 everyday dilemmas from Reddit's ``Am I the Asshole'' community. To test order effects and assess verdict revision, we use both synchronous (parallel responses) and round-robin (sequential responses) deliberation structures, mirroring how multi-agent systems are increasingly orchestrated in practice. Our findings show striking behavioral differences. In the synchronous setting, GPT-4.1 showed strong inertia (0.6-3.1\% revision rates) while Claude 3.7 Sonnet and Gemini 2.0 Flash were far more flexible (28-41\% revision rates). Value patterns also diverged: GPT-4.1 emphasized personal autonomy and direct communication (relative to its deliberation partners), while Claude 3.7 Sonnet and Gemini 2.0 Flash prioritized empathetic dialogue. We further find that deliberation format had a strong impact on model behavior: GPT-4.1 and Gemini 2.0 Flash stood out as highly conforming relative to Claude 3.7 Sonnet, with their verdict behavior strongly shaped by order effects. We provide additional results on open-source models (DeepSeek-V3.2 and Llama 3.1).

2510.08131 2026-03-10 cs.CV

Real-Time Motion-Controllable Autoregressive Video Diffusion

Kesen Zhao, Jiaxin Shi, Beier Zhu, Junbao Zhou, Xiaolong Shen, Yuan Zhou, Qianru Sun, Hanwang Zhang

详情
英文摘要

Real-time motion-controllable video generation remains challenging due to the inherent latency of bidirectional diffusion models and the lack of effective autoregressive (AR) approaches. Existing AR video diffusion models are limited to simple control signals or text-to-video generation, and often suffer from quality degradation and motion artifacts in few-step generation. To address these challenges, we propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control. We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement learning with a trajectory-based reward model. Our design preserves the Markov property through a Self-Rollout mechanism and accelerates training by selectively introducing stochasticity in denoising steps. Extensive experiments demonstrate that AR-Drag achieves high visual fidelity and precise motion alignment, significantly reducing latency compared with state-of-the-art motion-controllable VDMs, while using only 1.3B parameters. Additional visualizations can be found on our project page: https://kesenzhao.github.io/AR-Drag.github.io/.

2510.06146 2026-03-10 cs.RO

Vision-Guided Targeted Grasping and Vibration for Robotic Pollination in Controlled Environments

Jaehwan Jeong, Tuan-Anh Vu, Radha Lahoti, Jiawen Wang, Vivek Alumootil, Sangpil Kim, M. Khalid Jawed

Comments YouTube: https://youtu.be/XHLA7pEXhZU; GitHub: https://github.com/StructuresComp/robotic-pollination

详情
英文摘要

Robotic pollination offers a promising alternative to manual labor and bumblebee-assisted methods in controlled agriculture, where wind-driven pollination is absent and regulatory restrictions limit the use of commercial pollinators. In this work, we present and validate a vision-guided robotic framework that uses data from an end-effector mounted RGB-D sensor and combines 3D plant reconstruction, targeted grasp planning, and physics-based vibration modeling to enable precise pollination. First, the plant is reconstructed in 3D and registered to the robot coordinate frame to identify obstacle-free grasp poses along the main stem. Second, a discrete elastic rod model predicts the relationship between actuation parameters and flower dynamics, guiding the selection of optimal pollination strategies. Finally, a manipulator with soft grippers grasps the stem and applies controlled vibrations to induce pollen release. End-to-end experiments demonstrate a 92.5\% main-stem grasping success rate, and simulation-guided optimization of vibration parameters further validates the feasibility of our approach, ensuring that the robot can safely and effectively perform pollination without damaging the flower. To our knowledge, this is the first robotic system to jointly integrate vision-based grasping and vibration modeling for automated precision pollination.

2510.01402 2026-03-10 cs.RO cs.SY eess.SY

Beyond Collision Cones: Dynamic Obstacle Avoidance for Nonholonomic Robots via Dynamic Parabolic Control Barrier Functions

Hun Kuk Park, Taekyung Kim, Dimitra Panagou

Comments The first two authors contributed equally to this work. 2026 IEEE International Conference on Robotics and Automation (ICRA). Project page: https://www.taekyung.me/dpcbf

详情
英文摘要

Control Barrier Functions (CBFs) are a powerful tool for ensuring the safety of autonomous systems, yet applying them to nonholonomic robots in cluttered, dynamic environments remains an open challenge. State-of-the-art methods often rely on collision-cone or velocity-obstacle constraints which, by only considering the angle of the relative velocity, are inherently conservative and can render the CBF-based quadratic program infeasible, particularly in dense scenarios. To address this issue, we propose a Dynamic Parabolic Control Barrier Function (DPCBF) that defines the safe set using a parabolic boundary. The parabola's vertex and curvature dynamically adapt based on both the distance to an obstacle and the magnitude of the relative velocity, creating a less restrictive safety constraint. We prove that the proposed DPCBF is valid for a kinematic bicycle model subject to input constraints. Extensive comparative simulations demonstrate that our DPCBF-based controller significantly enhances navigation success rates and QP feasibility compared to baseline methods. Our approach successfully navigates through dense environments with up to 100 dynamic obstacles, scenarios where collision cone-based methods fail due to infeasibility.

2510.01089 2026-03-10 cs.LG q-bio.QM

Double projection for reconstructing dynamical systems: between stochastic and deterministic regimes

Viktor Sip, Martin Breyton, Spase Petkoski, Viktor Jirsa

详情
英文摘要

Learning stochastic models of dynamical systems from observed data is of interest in many scientific fields. Here, we propose a new method for this task within the family of dynamical variational autoencoders. The proposed double projection method estimates both the system state trajectories and the noise time series from data. This approach naturally allows us to perform multi-step system evolution and to learn models with a comparatively low-dimensional state space. We evaluate the performance of the method on six benchmark problems, including both simulated and experimental data. We further illustrate the effects of the teacher forcing interval of the multi-step scheme on the nature of the internal dynamics and compare the resulting behavior to that of deterministic models of equivalent architecture.

2510.00657 2026-03-10 cs.SD

XPPG-PCA: Reference-free automatic speech severity evaluation with principal components

Bence Mark Halpern, Thomas B. Tienkamp, Teja Rebernik, Rob J. J. H. van Son, Sebastiaan A. H. J. de Visscher, Max J. H. Witjes, Defne Abur, Tomoki Toda

Comments 14 pages, 4 figures. Author Accepted Manuscript version of the IEEE Selected Topics in Signal Processing with the same title

详情
Journal ref
IEEE Journal of Selected Topics in Signal Processing 2026
英文摘要

Reliably evaluating the severity of a speech pathology is crucial in healthcare. However, the current reliance on expert evaluations by speech-language pathologists presents several challenges: while their assessments are highly skilled, they are also subjective, time-consuming, and costly, which can limit the reproducibility of clinical studies and place a strain on healthcare resources. While automated methods exist, they have significant drawbacks. Reference-based approaches require transcriptions or healthy speech samples, restricting them to read speech and limiting their applicability. Existing reference-free methods are also flawed; supervised models often learn spurious shortcuts from data, while handcrafted features are often unreliable and restricted to specific speech tasks. This paper introduces XPPG-PCA (x-vector phonetic posteriorgram principal component analysis), a novel, unsupervised, reference-free method for speech severity evaluation. Using three Dutch oral cancer datasets, we demonstrate that XPPG-PCA performs comparably to, or exceeds established reference-based methods. Our experiments confirm its robustness against data shortcuts and noise, showing its potential for real-world clinical use. Taken together, our results show that XPPG-PCA provides a robust, generalizable solution for the objective assessment of speech pathology, with the potential to significantly improve the efficiency and reliability of clinical evaluations across a range of disorders. An open-source implementation is available.

2510.00444 2026-03-10 cs.CL

TokMem: One-Token Procedural Memory for Large Language Models

Zijun Wu, Yongchang Hao, Lili Mou

Comments Accepted by ICLR 2026

详情
英文摘要

Large language models are typically controlled via prompts, which must be repeatedly re-processed for every new query and are difficult to reuse modularly. We introduce TokMem, a procedural memory framework that compiles each reusable task procedure into a single trainable memory token. Each token serves as both a procedure index and a generation control signal that steers generation, enabling targeted behaviors with constant-size overhead. TokMem keeps the backbone LLM frozen and stores procedural knowledge entirely in these dedicated units, so new procedures can be added continually without interfering with existing ones. We evaluate TokMem on two settings: atomic recall over 1,000 Super-Natural Instructions tasks and compositional recall on multi-step function-calling. Our results show that TokMem consistently outperforms retrieval-augmented prompting while avoiding repeated context overhead. Moreover, it matches or exceeds parameter-efficient fine-tuning with substantially fewer trainable parameters.

2509.26354 2026-03-10 cs.AI cs.CL cs.LG

Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents

Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, Jing Shao

Comments Published in ICLR 2026

详情
英文摘要

Advances in Large Language Models (LLMs) have enabled a new class of self-evolving agents that autonomously improve through interaction with the environment, demonstrating strong capabilities. However, self-evolution also introduces novel risks overlooked by current safety research. In this work, we study the case where an agent's self-evolution deviates in unintended ways, leading to undesirable or even harmful outcomes. We refer to this as Misevolution. To provide a systematic investigation, we evaluate misevolution along four key evolutionary pathways: model, memory, tool, and workflow. Our empirical findings reveal that misevolution is a widespread risk, affecting agents built even on top-tier LLMs (e.g., Gemini-2.5-Pro). Different emergent risks are observed in the self-evolutionary process, such as the degradation of safety alignment after memory accumulation, or the unintended introduction of vulnerabilities in tool creation and reuse. To our knowledge, this is the first study to systematically conceptualize misevolution and provide empirical evidence of its occurrence, highlighting an urgent need for new safety paradigms for self-evolving agents. Finally, we discuss potential mitigation strategies to inspire further research on building safer and more trustworthy self-evolving agents. Our code and data are available at https://github.com/ShaoShuai0605/Misevolution . Warning: this paper includes examples that may be offensive or harmful in nature.

2509.25429 2026-03-10 cs.LG cs.GT

Feedback Control for Small Budget Pacing

Sreeja Apparaju, Yichuan Niu, Xixi Qi

详情
英文摘要

Budget pacing is critical in online advertising to align spend with campaign goals under dynamic auctions. Existing pacing methods often rely on ad-hoc parameter tuning, which can be unstable and inefficient. We propose a principled controller that combines bucketized hysteresis with proportional feedback to provide stable and adaptive spend control. Our method provides a framework and analysis for parameter selection that enables accurate tracking of desired spend rates across campaigns. Experiments in real-world auctions demonstrate significant improvements in pacing accuracy and delivery consistency, reducing pacing error by 13% and $λ$-volatility by 54% compared to baseline method. By bridging control theory with advertising systems, our approach offers a scalable and reliable solution for budget pacing, with particular benefits for small-budget campaigns.

2509.24472 2026-03-10 cs.LG

FS-KAN: Permutation Equivariant Kolmogorov-Arnold Networks via Function Sharing

Ran Elbaz, Guy Bar-Shalom, Yam Eitan, Fabrizio Frasca, Haggai Maron

详情
英文摘要

Permutation equivariant neural networks employing parameter-sharing schemes have emerged as powerful models for leveraging a wide range of data symmetries, significantly enhancing the generalization and computational efficiency of the resulting models. Recently, Kolmogorov-Arnold Networks (KANs) have demonstrated promise through their improved interpretability and expressivity compared to traditional architectures based on MLPs. While equivariant KANs have been explored in recent literature for a few specific data types, a principled framework for applying them to data with permutation symmetries in a general context remains absent. This paper introduces Function Sharing KAN (FS-KAN), a principled approach to constructing equivariant and invariant KA layers for arbitrary permutation symmetry groups, unifying and significantly extending previous work in this domain. We derive the basic construction of these FS-KAN layers by generalizing parameter-sharing schemes to the Kolmogorov-Arnold setup and provide a theoretical analysis demonstrating that FS-KANs have the same expressive power as networks that use standard parameter-sharing layers, allowing us to transfer well-known and important expressivity results from parameter-sharing networks to FS-KANs. Empirical evaluations on multiple data types and symmetry groups show that FS-KANs exhibit superior data efficiency compared to standard parameter-sharing layers, by a wide margin in certain cases, while preserving the interpretability and adaptability of KANs, making them an excellent architecture choice in low-data regimes.

2509.23626 2026-03-10 cs.CV

Efficient Domain-Adaptive Multi-Task Dense Prediction with Vision Foundation Models

Beomseok Kang, Niluthpol Chowdhury Mithun, Mikhail Sizintsev, Han-Pang Chiu, Supun Samarasekera

详情
英文摘要

Multi-task dense prediction, which aims to jointly solve tasks like semantic segmentation and depth estimation, is crucial for robotics applications but suffers from domain shift when deploying models in new environments. While unsupervised domain adaptation (UDA) addresses this challenge for single tasks, existing multi-task UDA methods primarily rely on adversarial learning approaches that are less effective than recent self-training techniques. In this paper, we introduce FAMDA, a simple yet effective UDA framework that addresses this limitation by leveraging Vision Foundation Models (VFMs) as powerful teachers within a self-training paradigm. Our approach integrates Segmentation and Depth foundation models into a self-training paradigm to generate high-quality pseudo-labels for the target domain, effectively distilling their robust generalization capabilities into a single, efficient student network. Extensive experiments show that FAMDA achieves state-of-the-art (SOTA) performance on standard synthetic-to-real UDA multi-task learning (MTL) benchmarks and a challenging new day-to-night adaptation task. Our framework enables the training of highly efficient models; a lightweight variant achieves SOTA accuracy while being more than 10X smaller than foundation models, highlighting FAMDA's suitability for creating domain-adaptive and efficient models for resource-constrained robotics applications.

2509.23488 2026-03-10 cs.AI cs.CL

Mapping Overlaps in Benchmarks through Perplexity in the Wild

Siyang Wu, Honglin Bao, Sida Li, Ari Holtzman, James A. Evans

详情
英文摘要

We introduce benchmark signatures to characterize the capacity demands of LLM benchmarks and their overlaps. Signatures are sets of salient tokens from in-the-wild corpora whose model token perplexity, reflecting training exposure, predicts benchmark performance. We extract them via stepwise forward selection with linear regression in a meta-evaluation spanning 32 LLMs and 89 benchmarks across diverse domains. We then analyze how these signatures relate to both the semantic similarity of benchmark questions and the correlation structure of model performance. While performance correlations are uniformly high and semantic overlaps stay in a narrow mid-range, benchmark signatures reveal more nuanced structure. For instance, they uncover substantial overlap between benchmarks in knowledge and reasoning tasks, whereas benchmarks in culture- and humanity-oriented domains show low similarity with each other. Unlike raw performance correlations, which are influenced by benchmark-orthogonal factors such as question formats, signatures are robust to such confounds. We further identify cross-functional overlaps between logic, math, language, instruction following, and cultural/world modeling, with coding emerging as the most isolated function, interacting only moderately with the ability of detecting missing information. Qualitative analysis shows that only the knowledge signature aligns with actual knowledge, suggesting that LLM semantic organization may differ from human conceptual structure. Together, these findings offer insights into benchmark validity, LLM sensitivities, and the landscape of interconnected LLM capacities. We have open-sourced the code and data in this https://github.com/siyangwu1/Benchmark-Signature-Repository.

2509.23223 2026-03-10 cs.RO cs.SY eess.SY

SAC-Loco: Safe and Adjustable Compliant Quadrupedal Locomotion

Aoqian Zhang, Zixuan Zhuang, Chunzheng Wang, Shuzhi Sam Ge, Fan Shi, Cheng Xiang

详情
英文摘要

Quadruped robots are designed to achieve agile and robust locomotion by drawing inspiration from legged animals. However, most existing control methods for quadruped robots lack a key capacity observed in animals: the ability to exhibit diverse compliance behaviors while ensuring stability when experiencing external forces. In particular, achieving adjustable compliance while maintaining robust safety under force disturbances remains a significant challenge. In this work, we propose a safety aware compliant locomotion framework that integrates adjustable disturbance compliance with robust failure prevention. We first train a force compliant policy with adjustable compliance levels using a teacher student reinforcement learning framework, allowing deployment without explicit force sensing. To handle disturbances beyond the limits of compliant control, we develop a safety oriented policy for rapid recovery and stabilization. Finally, we introduce a learned safety critic that monitors the robot's safety in real time and coordinates between compliant locomotion and recovery behaviors. Together, this framework enables quadruped robots to achieve smooth force compliance and robust safety under a wide range of external force disturbances.

2509.23184 2026-03-10 cs.CL

PonderLM-2: Pretraining LLM with Latent Thoughts in Continuous Space

Boyi Zeng, He Li, Shixiang Song, Yixuan Wang, Zitong Wang, Ziwei He, Xinbing Wang, Zhouhan Lin

详情
英文摘要

The remarkable success of Chain-of-Thought (CoT), which enhances performance by scaling generation steps at test-time, inspires us to ask: can we leverage a similar scaling of computational steps during pretraining to improve the generation of each individual token? To address this, we propose a novel pre-training methodology: Pretraining Language Models with Latent Thoughts (PonderLM-2). Our approach pretrains a language model (LM) to first generate an intermediate latent thought-the last hidden state of the current position-which is then used as input to predict the actual subsequent token. This additional computational step enables the LM to refine its prediction within unconstrained continuous space. Our experiments demonstrate that, at an identical inference cost, a LM that generates one additional latent thought per token outperforms a standard model with double the parameters. For instance, our PonderLM-2-Pythia-1.4B, pretrained on 300B tokens from the Pile, significantly surpasses the vanilla Pythia-2.8B trained on the same data on both language modeling and a range of general downstream tasks. Furthermore, increasing the number of latent thoughts generated before each actual token-forming a chain analogous to CoT-consistently improves the model's performance. The code is available at https://github.com/LUMIA-Group/PonderLM-2.

2509.23077 2026-03-10 cs.LG

CLAD-Net: Continual Activity Recognition in Multi-Sensor Wearable Systems

Reza Rahimi Azghan, Gautham Krishna Gudur, Mohit Malu, Edison Thomaz, Giulia Pedrielli, Pavan Turaga, Hassan Ghasemzadeh

详情
英文摘要

The rise of deep learning has greatly advanced human behavior monitoring using wearable sensors, particularly human activity recognition (HAR). While deep models have been widely studied, most assume stationary data distributions - an assumption often violated in real-world scenarios. For example, sensor data from one subject may differ significantly from another, leading to distribution shifts. In continual learning, this shift is framed as a sequence of tasks, each corresponding to a new subject. Such settings suffer from catastrophic forgetting, where prior knowledge deteriorates as new tasks are learned. This challenge is compounded by the scarcity and inconsistency of labeled data in human studies. To address these issues, we propose CLAD-Net (Continual Learning with Attention and Distillation), a framework enabling wearable-sensor models to be updated continuously without sacrificing performance on past tasks. CLAD-Net integrates a self-supervised transformer, acting as long-term memory, with a supervised Convolutional Neural Network (CNN) trained via knowledge distillation for activity classification. The transformer captures global activity patterns through cross-attention across body-mounted sensors, learning generalizable representations without labels. Meanwhile, the CNN leverages knowledge distillation to retain prior knowledge during subject-wise fine-tuning. On PAMAP2, CLAD-Net achieves 91.36 percent final accuracy with only 8.78 percent forgetting, surpassing memory-based and regularization-based baselines such as Experience Replay and Elastic Weight Consolidation. In semi-supervised settings with only 10-20 percent labeled data, CLAD-Net still delivers strong performance, demonstrating robustness to label scarcity. Ablation studies further validate each module's contribution.

2509.22295 2026-03-10 cs.LG

Aurora: Towards Universal Generative Multimodal Time Series Forecasting

Xingjian Wu, Jianxin Jin, Wanghui Qiu, Peng Chen, Yang Shu, Bin Yang, Chenjuan Guo

详情
英文摘要

Cross-domain generalization is very important in Time Series Forecasting because similar historical information may lead to distinct future trends due to the domain-specific characteristics. Recent works focus on building unimodal time series foundation models and end-to-end multimodal supervised models. Since domain-specific knowledge is often contained in modalities like texts, the former lacks the explicit utilization of them, thus hindering the performance. The latter is tailored for end-to-end scenarios and does not support zero-shot inference for cross-domain scenarios. In this work, we introduce Aurora, a Multimodal Time Series Foundation Model, which supports multimodal inputs and zero-shot inference. Pretrained on Cross-domain Multimodal Time Series Corpus, Aurora can adaptively extract and focus on key domain knowledge contained in corresponding text or image modalities, thus possessing strong cross-domain generalization capability. Through tokenization, encoding, and distillation, Aurora can extract multimodal domain knowledge as guidance and then utilizes a Modality-Guided Multi-head Self-Attention to inject them into the modeling of temporal representations. In the decoding phase, the multimodal representations are used to generate the conditions and prototypes of future tokens, contributing to a novel Prototype-Guided Flow Matching for generative probabilistic forecasting. Comprehensive experiments on 5 well-recognized benchmarks, including TimeMMD, TSFM-Bench, ProbTS, TFB, and EPF, demonstrate the consistent state-of-the-art performance of Aurora on both unimodal and multimodal scenarios.

2509.21715 2026-03-10 cs.CV

Motion-Aware Transformer for Multi-Object Tracking

Xu Yang, Gady Agam

详情
英文摘要

Multi-object tracking (MOT) in videos remains challenging due to complex object motions and crowded scenes. Recent DETR-based frameworks offer end-to-end solutions but typically process detection and tracking queries jointly within a single Transformer Decoder layer, leading to conflicts and degraded association accuracy. We introduce the Motion-Aware Transformer (MATR), which explicitly predicts object movements across frames to update track queries in advance. By reducing query collisions, MATR enables more consistent training and improves both detection and association. Extensive experiments on DanceTrack, SportsMOT, and BDD100k show that MATR delivers significant gains across standard metrics. On DanceTrack, MATR improves HOTA by more than 9 points over MOTR without additional data and reaches a new state-of-the-art score of 71.3 with supplementary data. MATR also achieves state-of-the-art results on SportsMOT (72.2 HOTA) and BDD100k (54.7 mTETA, 41.6 mHOTA) without relying on external datasets. These results demonstrate that explicitly modeling motion within end-to-end Transformers offers a simple yet highly effective approach to advancing multi-object tracking.

2509.21571 2026-03-10 cs.RO

Autonomous UAV-Quadruped Docking in Complex Terrains via Active Posture Alignment and Constraint-Aware Control

Haozhe Xu, Cheng Cheng, Hongrui Sang, Zhipeng Wang, Qiyong He, Xiuxian Li, Bin He

详情
英文摘要

Autonomous docking between Unmanned Aerial Vehicles (UAVs) and ground robots is essential for heterogeneous systems, yet most existing approaches target wheeled platforms whose limited mobility constrains exploration in complex terrains. Quadruped robots offer superior adaptability but undergo frequent posture variations, making it difficult to provide a stable landing surface for UAVs. To address these challenges, we propose an autonomous UAV-quadruped docking framework for GPS-denied environments. On the quadruped side, a Hybrid Internal Model with Horizontal Alignment (HIM-HA), learned via deep reinforcement learning, actively stabilizes the torso to provide a level platform. On the UAV side, a three-phase strategy is adopted, consisting of long-range acquisition with a median-filtered YOLOv8 detector, close-range tracking with a constraint-aware controller that integrates a Nonsingular Fast Terminal Sliding Mode Controller (NFTSMC) and a logarithmic Barrier Function (BF) to guarantee finite-time error convergence under field-of-view (FOV) constraints, and terminal descent guided by a Safety Period (SP) mechanism that jointly verifies tracking accuracy and platform stability. The proposed framework is validated in both simulation and real-world scenarios, successfully achieving docking on outdoor staircases higher than 17 cm and rough slopes steeper than 30 degrees. Supplementary materials and videos are available at: https://uav-quadruped-docking.github.io.

2509.17287 2026-03-10 cs.RO cs.CV

Event-Based Visual Teach-and-Repeat via Fast Fourier-Domain Cross-Correlation

Gokul B. Nair, Alejandro Fontan, Michael Milford, Tobias Fischer

Comments 8 Pages, 5 Figures, Under Review

详情
英文摘要

Visual teach-and-repeat (VT&R) navigation enables robots to autonomously traverse previously demonstrated paths using visual feedback. We present a novel event-camera-based VT\&R system. Our system formulates event-stream matching as frequency-domain cross-correlation, transforming spatial convolutions into efficient Fourier-space multiplications. By exploiting the binary structure of event frames and applying image compression techniques, we achieve a processing latency of just 2.88 ms, about 3.5 times faster than conventional camera-based baselines that are optimised for runtime efficiency. Experiments using a Prophesee EVK4 HD event camera mounted on an AgileX Scout Mini robot demonstrate successful autonomous navigation across 3000+ meters of indoor and outdoor trajectories in daytime and nighttime conditions. Our system maintains Cross-Track Errors (XTE) below 15 cm, demonstrating the practical viability of event-based perception for real-time VT\&R navigation.

2509.16614 2026-03-10 cs.RO cs.LG cs.SY eess.SY

ORN-CBF: Learning Observation-conditioned Residual Neural Control Barrier Functions via Hypernetworks

Bojan Derajić, Sebastian Bernhard, Wolfgang Hönig

详情
英文摘要

Control barrier functions (CBFs) have been demonstrated as an effective method for safety-critical control of autonomous systems. Although CBFs are simple to deploy, their design remains challenging, motivating the development of learning-based approaches. Yet, issues such as suboptimal safe sets, applicability in partially observable environments, and lack of rigorous safety guarantees persist. In this work, we propose observation-conditioned neural CBFs based on Hamilton-Jacobi (HJ) reachability analysis, which approximately recover the maximal safe sets. We exploit certain mathematical properties of the HJ value function, ensuring that the predicted safe set never intersects with the observed failure set. Moreover, we leverage a hypernetwork-based architecture that is particularly suitable for the design of observation-conditioned safety filters. The proposed method is examined both in simulation and hardware experiments for a ground robot and a quadcopter. The results show improved success rates and generalization to out-of-domain environments compared to the baselines.

2509.16053 2026-03-10 cs.RO cs.AI

Compose by Focus: Scene Graph-based Atomic Skills

Han Qi, Changhe Chen, Heng Yang

Comments Acceptance to ICRA 2026. Website: https://computationalrobotics.seas.harvard.edu/SkillComposition/

详情
英文摘要

A key requirement for generalist robots is compositional generalization - the ability to combine atomic skills to solve complex, long-horizon tasks. While prior work has primarily focused on synthesizing a planner that sequences pre-learned skills, robust execution of the individual skills themselves remains challenging, as visuomotor policies often fail under distribution shifts induced by scene composition. To address this, we introduce a scene graph-based representation that focuses on task-relevant objects and relations, thereby mitigating sensitivity to irrelevant variation. Building on this idea, we develop a scene-graph skill learning framework that integrates graph neural networks with diffusion-based imitation learning, and further combine "focused" scene-graph skills with a vision-language model (VLM) based task planner. Experiments in both simulation and real-world manipulation tasks demonstrate substantially higher success rates than state-of-the-art baselines, highlighting improved robustness and compositional generalization in long-horizon tasks.

2509.15254 2026-03-10 cs.RO

OIPP: Object-Adaptive Impact Point Predictor for Catching Diverse In-Flight Objects

Ngoc Huy Nguyen, Kazuki Shibata, Takamitsu Matsubara

Comments 9 pages, 9 figures

详情
英文摘要

In this study, we address the problem of in-flight object catching using a quadruped robot with a basket. Our objective is to accurately predict the impact point, defined as the object's landing position. This task poses two key challenges: the absence of public datasets capturing diverse objects under unsteady aerodynamics, which are essential for training reliable predictors; and the difficulty of accurate early-stage impact point prediction when trajectories appear similar across objects. To overcome these issues, we construct a real-world dataset of 8,000 trajectories from 20 objects, providing a foundation for advancing in-flight object catching under complex aerodynamics. We then propose the Object-Adaptive Impact Point Predictor (OIPP), consisting of two modules: (i) an Object-Adaptive Encoder (OAE) that extracts object-dependent representations from motion histories, and (ii) an Impact Point Predictor (IPP) that estimates the impact point from these representations. Two IPP variants are implemented: a Neural Acceleration Estimator (NAE)-based method that predicts trajectories and derives the impact point, and a Direct Point Estimator (DPE)-based method that directly outputs it. Experimental results show that our dataset is more diverse and complex than existing datasets, and that our method outperforms baselines on both 15 seen and 5 unseen objects. Furthermore, we show that improved early-stage prediction enhances catching success in simulation and demonstrate the effectiveness of our approach through real-robot experiments. The demonstration is available at https://sites.google.com/view/robot-catching-2025.

2509.15237 2026-03-10 cs.AI cs.CV cs.LG

MICA: Multi-Agent Industrial Coordination Assistant

Di Wen, Kunyu Peng, Junwei Zheng, Yufan Chen, Yitian Shi, Jiale Wei, Ruiping Liu, Kailun Yang, Rainer Stiefelhagen

Comments Accepted to ICRA 2026. The source code will be made publicly available at https://github.com/Kratos-Wen/MICA

详情
英文摘要

Industrial workflows demand adaptive and trustworthy assistance that can operate under limited computing, connectivity, and strict privacy constraints. In this work, we present MICA (Multi-Agent Industrial Coordination Assistant), a perception-grounded and speech-interactive system that delivers real-time guidance for assembly, troubleshooting, part queries, and maintenance. MICA coordinates five role-specialized language agents, audited by a safety checker, to ensure accurate and compliant support. To achieve robust step understanding, we introduce Adaptive Step Fusion (ASF), which dynamically blends expert reasoning with online adaptation from natural speech feedback. Furthermore, we establish a new multi-agent coordination benchmark across representative task categories and propose evaluation metrics tailored to industrial assistance, enabling systematic comparison of different coordination topologies. Our experiments demonstrate that MICA consistently improves task success, reliability, and responsiveness over baseline structures, while remaining deployable on practical offline hardware. Together, these contributions highlight MICA as a step toward deployable, privacy-preserving multi-agent assistants for dynamic factory environments. The source code will be made publicly available at https://github.com/Kratos-Wen/MICA.