ExoPredicator: Learning Abstract Models of Dynamic Worlds for Robot Planning
Comments ICLR 2026. The last two authors contributed equally in co-advising
Yichao Liang, Dat Nguyen, Cambridge Yang, Tianyang Li, Joshua B. Tenenbaum, Carl Edward Rasmussen, Adrian Weller, Zenna Tavares, Tom Silver, Kevin Ellis
Comments ICLR 2026. The last two authors contributed equally in co-advising
Long-horizon embodied planning is challenging because the world does not only change through an agent's actions: exogenous processes (e.g., water heating, dominoes cascading) unfold concurrently with the agent's actions. We propose a framework for abstract world models that jointly learns (i) symbolic state representations and (ii) causal processes for both endogenous actions and exogenous mechanisms. Each causal process models the time course of a stochastic cause-effect relation. We learn these world models from limited data via variational Bayesian inference combined with LLM proposals. Across five simulated tabletop robotics environments, the learned models enable fast planning that generalizes to held-out tasks with more objects and more complex goals, outperforming a range of baselines.
Yuan Zhao, Youwei Pang, Lihe Zhang, Hanqi Liu, Jiaming Zuo, Huchuan Lu, Xiaoqi Zhao
Comments Accepted by CVPR 2026
Existing anomaly detection (AD) methods often treat the modality and class as independent factors. Although this paradigm has enriched the development of AD research branches and produced many specialized models, it has also led to fragmented solutions and excessive memory overhead. Moreover, reconstruction-based multi-class approaches typically rely on shared decoding paths, which struggle to handle large variations across domains, resulting in distorted normality boundaries, domain interference, and high false alarm rates. To address these limitations, we propose UniMMAD, a unified framework for multi-modal and multi-class anomaly detection. At the core of UniMMAD is a Mixture-of-Experts (MoE)-driven feature decompression mechanism, which enables adaptive and disentangled reconstruction tailored to specific domains. This process is guided by a ``general to specific'' paradigm. In the encoding stage, multi-modal inputs of varying combinations are compressed into compact, general-purpose features. The encoder incorporates a feature compression module to suppress latent anomalies, encourage cross-modal interaction, and avoid shortcut learning. In the decoding stage, the general features are decompressed into modality-specific and class-specific forms via a sparsely-gated cross MoE, which dynamically selects expert pathways based on input modality and class. To further improve efficiency, we design a grouped dynamic filtering mechanism and a MoE-in-MoE structure, reducing parameter usage by 75\% while maintaining sparse activation and fast inference. UniMMAD achieves state-of-the-art performance on 9 anomaly detection datasets, spanning 3 fields, 12 modalities, and 66 classes. The source code will be available at https://github.com/yuanzhao-CVLAB/UniMMAD.
Qingjie Zhang, Yujia Fu, Yang Wang, Liu Yan, Tao Wei, Ke Xu, Minlie Huang, Han Qiu
Current answering paradigms for Large Reasoning Models (LRMs) often fail to account for the fact that some questions may lie beyond the model's operational capability boundary, leading to long but unproductive reasoning. In this paper, we study whether LRMs expose early signals predictive of such cases, and whether these signals can be used to mitigate unproductive reasoning. In black-box settings, we find that reasoning expressions contain failure-predictive signals. In white-box settings, we show that the hidden states of the last input token contain information that is predictive of whether a question will not be solved correctly under our evaluation setup. Building on these observations, we propose two test-time monitoring strategies: reasoning expression monitoring and hidden states monitoring, that reduce token usage by 62.7-93.6%, substantially improving efficiency and reliability while largely preserving accuracy.
Ziheng Cheng, Xin Guo, Yufei Zhang
The theory of continuous-time reinforcement learning (RL) has progressed rapidly in recent years. While the ultimate objective of RL is typically to learn deterministic control policies, most existing continuous-time RL methods rely on stochastic policies. Such approaches often require sampling actions at very high frequencies, and involve computationally expensive expectations over continuous action spaces, resulting in high-variance gradient estimates and slow convergence. In this paper, we introduce and develop deterministic policy gradient (DPG) methods for continuous-time RL. We derive a continuous-time policy gradient formula expressed as the expected gradient of an advantage rate function and establish a martingale characterization for both the value function and the advantage rate. These theoretical results provide tractable estimators for deterministic policy gradients in continuous-time RL. Building on this foundation, we propose a model-free continuous-time Deep Deterministic Policy Gradient (CT-DDPG) algorithm that enables stable learning for general reinforcement learning problems with continuous time-and-state. Numerical experiments show that CT-DDPG achieves superior stability and faster convergence compared to existing stochastic-policy methods, across a wide range of learning tasks with varying time discretizations and noise levels.
Yifan Zhang, Wei Zhang, Chuangxin He, Zhonghua Miao, Junhui Hou
Comments Accepted to TMM 2026
Unsupervised online 3D instance segmentation is a fundamental yet challenging task, as it requires maintaining consistent object identities across LiDAR scans without relying on annotated training data. Existing methods, such as UNIT, have made progress in this direction but remain constrained by limited training diversity, rigid temporal sampling, and heavy dependence on noisy pseudo-labels. We propose a new framework that enriches the training distribution through synthetic point cloud sequence generation, enabling greater diversity without relying on manual labels or simulation engines. To better capture temporal dynamics, our method incorporates a flexible sampling strategy that leverages both adjacent and non-adjacent frames, allowing the model to learn from long-range dependencies as well as short-term variations. In addition, a dynamic-weighting loss emphasizes confident and informative samples, guiding the network toward more robust representations. Through extensive experiments on SemanticKITTI, nuScenes, and PandaSet, our method consistently outperforms UNIT and other unsupervised baselines, achieving higher segmentation accuracy and more robust temporal associations. The code will be publicly available at github.com/Eaphan/SFT3D.
Zhenxing Xu, Yizhe Zhang, Weidong Bao, Hao Wang, Ming Chen, Haoran Ye, Wenzheng Jiang, Hui Yan, Ji Wang
Dynamically configuring algorithm hyperparameters is a fundamental challenge in computational intelligence. While learning-based methods offer automation, they suffer from prohibitive sample complexity and poor generalization. We introduce AutoEP, a novel framework that bypasses training entirely by leveraging Large Language Models (LLMs) as zero-shot reasoning engines for algorithm control. AutoEP's core innovation lies in a tight synergy between two components: (1) an online Exploratory Landscape Analysis (ELA) module that provides real-time, quantitative feedback on the search dynamics, and (2) a multi-LLM reasoning chain that interprets this feedback to generate adaptive hyperparameter strategies. This approach grounds high-level reasoning in empirical data, mitigating hallucination. Evaluated on three distinct metaheuristics across diverse combinatorial optimization benchmarks, AutoEP consistently outperforms state-of-the-art tuners, including neural evolution and other LLM-based methods. Notably, our framework enables open-source models like Qwen3-30B to match the performance of GPT-4, demonstrating a powerful and accessible new paradigm for automated hyperparameter design. Our code is available at https://github.com/YiZheZhang12/AutoEP.
Xiao Hu, Qi Yin, Yangming Shi, Yang Ye
Comments This submission was uploaded in error and has been withdrawn. A substantial revision will need to be completed
Data scarcity remains a fundamental challenge in robot learning. While human demonstrations benefit from abundant motion capture data and vast internet resources, robotic manipulation suffers from limited training examples. To bridge this gap between human and robot manipulation capabilities, we propose UniPrototype, a novel framework that enables effective knowledge transfer from human to robot domains via shared motion primitives. ur approach makes three key contributions: (1) We introduce a compositional prototype discovery mechanism with soft assignments, enabling multiple primitives to co-activate and thus capture blended and hierarchical skills; (2) We propose an adaptive prototype selection strategy that automatically adjusts the number of prototypes to match task complexity, ensuring scalable and efficient representation; (3) We demonstrate the effectiveness of our method through extensive experiments in both simulation environments and real-world robotic systems. Our results show that UniPrototype successfully transfers human manipulation knowledge to robots, significantly improving learning efficiency and task performance compared to existing approaches.The code and dataset will be released upon acceptance at an anonymous repository.
Mohammed Sabry, Anya Belz
Comments Published as a paper at 3rd DATA-FM workshop @ ICLR 2026, Brazil
Mechanism-targeted synthetic data is increasingly proposed as a way to steer pretraining toward desirable capabilities, but it remains unclear how such interventions should be evaluated. We study this question for in-context learning (ICL) under matched compute (iso-FLOPs) using Bi-Induct, a lightweight data rewrite that interleaves short directional copy snippets into a natural pretraining stream: forward-copy (induction), backward-copy (anti-induction, as a directional control), or a balanced mix. Across 0.13B-1B decoder-only models, we evaluate (i) few-shot performance on standard LM benchmarks and function-style ICL probes, (ii) head-level copy telemetry, and (iii) held-out perplexity as a guardrail. Bi-Induct reliably increases induction-head activity, but this does not translate into consistent improvements in few-shot generalization: on standard LM benchmarks, Bi-Induct is largely performance-neutral relative to natural-only training, while on function-style probes the 1B natural-only model performs best. Despite explicit backward-copy cues, anti-induction scores remain near zero across scales, revealing a strong forward/backward asymmetry. Targeted ablations show a sharper distinction: removing the top 2% induction heads per layer harms ICL more than matched random ablations, with the largest relative drop occurring in the natural-only models. This indicates that natural-only training produces more centralized, load-bearing induction circuitry, whereas Bi-Induct tends to create more distributed and redundant induction activity. Our main conclusion is that eliciting a mechanism is not the same as making it load-bearing. For data-centric foundation model design, this suggests that synthetic data interventions should be evaluated not only by signature amplification, but by whether they create causally necessary computation while preserving natural-data modeling quality.
Zihuan Qiu, Lei Wang, Yang Cao, Runtong Zhang, Bing Su, Yi Xu, Fanman Meng, Linfeng Xu, Qingbo Wu, Hongliang Li
Comments Accepted by ICLR 2026
Data-free continual model merging (DFCMM) aims to fuse independently fine-tuned models into a single backbone that evolves with incoming tasks without accessing task data. This paper revisits two fundamental desiderata for DFCMM: stability, avoiding interference with earlier tasks, and plasticity, adapting faithfully to each new task. This poses a challenge that existing approaches fail to address: how to bridge data-level desiderata with parameter-space optimization to ensure stability and plasticity in the absence of task data. To this end, we propose NUFILT (NUll-space FILTering), a data-free framework that directly links these desiderata into parameter-space optimization. Our key observation is that task vectors approximately align with representation subspaces, providing structural surrogates for enforcing stability and plasticity. Accordingly, we design a null-space projector that preserves prior responses by filtering overlapping components of new task vectors, ensuring stability. We further introduce a lightweight LoRA adapter that injects complementary task-specific signals to enable plasticity. The adapter is trained with a projection-based surrogate loss that preserves consistency with prior knowledge while introducing novel directions. This joint filtering-adaptation process enables the backbone to absorb new knowledge while retaining existing behaviors, with updates fused back in a layer-wise linear fashion without extra parameters or inference cost. Theoretically, we establish approximate subspace alignment guarantees that justify null-space filtering. Empirically, NUFILT achieves state-of-the-art performance with minimal forgetting on both vision and NLP benchmarks, improving average accuracy by 4-7% over OPCM and WUDI-Merging, while narrowing the gap to fine-tuning and reducing computation overhead. The code is available at: https://github.com/zihuanqiu/NUFILT
Sowoo Lee, Dongyun Kang, Jaehyun Park, Hae-Won Park
Comments 8 pages
This paper introduces DynaFlow, a novel framework that embeds a differentiable simulator directly into a flow matching model. By generating trajectories in the action space and mapping them to dynamically feasible state trajectories via the simulator, DynaFlow ensures all outputs are physically consistent by construction. This end-to-end differentiable architecture enables training on state-only demonstrations, allowing the model to simultaneously generate physically consistent state trajectories while inferring the underlying action sequences required to produce them. We demonstrate the effectiveness of our approach through quantitative evaluations and showcase its real-world applicability by deploying the generated actions onto a physical Go1 quadruped robot. The robot successfully reproduces diverse gait present in the dataset, executes long-horizon motions in open-loop control and translates infeasible kinematic demonstrations into dynamically executable, stylistic behaviors. These hardware experiments validate that DynaFlow produces deployable, highly effective motions on real-world hardware from state-only demonstrations, effectively bridging the gap between kinematic data and real-world execution.
Hanqing Liu, Shouwei Ruan, Jiahuan Long, Junqi Wu, Jiacheng Hou, Huili Tang, Tingsong Jiang, Weien Zhou, Wen Yao
Vision-Language-Action (VLA) models have emerged as promising solutions for robotic manipulation, yet their robustness to real-world physical variations remains critically underexplored. To bridge this gap, we propose Eva-VLA, the first unified framework to systematically evaluate the robustness of VLA models by formulating uncontrollable physical variations as continuous optimization problems. Specifically, our framework addresses two fundamental challenges in VLA models' physical robustness evaluation: 1) how to systematically characterize diverse physical perturbations encountered in real-world deployment while maintaining reproducibility, and 2) how to efficiently discover worst-case scenarios without incurring prohibitive real-world data collection costs. To tackle the first challenge, we decouple real-world variations into three key dimensions: 3D object transformations that affect spatial reasoning, illumination changes that challenge visual perception, and adversarial regions that disrupt scene understanding. For the second challenge, we introduce a continuous black-box optimization mechanism that maps these perturbations into a continuous parameter space, enabling the systematic exploration of worst-case scenarios. Extensive experiments validate the effectiveness of our approach. Notably, OpenVLA exhibits an average failure rate of over 90% across three physical variations on the LIBERO-Long task, exposing critical systemic fragilities. Furthermore, applying the generated worst-case scenarios during adversarial training quantifiably increases model robustness, validating the effectiveness of this approach. Our evaluation exposes the gap between laboratory and real-world conditions, while the Eva-VLA framework can serve as an effective data augmentation method to enhance the resilience of robotic manipulation systems.
Weiyi Xiong, Bing Zhu, Zewei Zheng
Comments Accepted by IEEE Robotics and Automation Letters
4D automotive radars have gained increasing attention for autonomous driving due to their low cost, robustness, and inherent velocity measurement capability. However, existing 4D radar-based 3D detectors rely heavily on pillar encoders for BEV feature extraction, where each point contributes to only a single BEV grid, resulting in sparse feature maps and degraded representation quality. In addition, they also optimize bounding box attributes independently, leading to sub-optimal detection accuracy. Moreover, their inference speed, while sufficient for high-end GPUs, may fail to meet the real-time requirement on vehicle-mounted embedded devices. To overcome these limitations, an efficient and effective Gaussian-based 3D detector, namely RadarGaussianDet3D is introduced, leveraging Gaussian primitives and distributions as intermediate representations for radar points and bounding boxes. In RadarGaussianDet3D, a novel Point Gaussian Encoder (PGE) is designed to transform each point into a Gaussian primitive after feature aggregation and employs the 3D Gaussian Splatting (3DGS) technique for BEV rasterization, yielding denser feature maps. PGE exhibits exceptionally low latency, owing to the optimized algorithm for point feature aggregation and fast rendering of 3DGS. In addition, a new Box Gaussian Loss (BGL) is proposed, which converts bounding boxes into 3D Gaussian distributions and measures their distance to enable more comprehensive and consistent optimization. Extensive experiments on TJ4DRadSet and View-of-Delft demonstrate that RadarGaussianDet3D achieves high detection accuracy while delivering substantially faster inference, highlighting its potential for real-time deployment in autonomous driving.
Rongkun Zhu, Kangning Cui, Wei Tang, Rui-Feng Wang, Sarra Alqahtani, David Lutz, Fan Yang, Paul Fine, Jordan Karubian, Robert Plemmons, Jean-Michel Morel, Victor Pauca, Miles Silman
Comments 8 pages, 2 figures, 2 tables
Accurate mapping of individual trees is essential for ecological monitoring and forest management. Orthomosaic imagery from unmanned aerial vehicles (UAVs) is widely used, but stitching artifacts and heavy preprocessing limit its suitability for field deployment. This study explores the use of raw UAV imagery for palm detection and crown-center localization in tropical forests. Two research questions are addressed: (1) how detection performance varies across orthomosaic and raw imagery, including within-domain and cross-domain transfer, and (2) to what extent crown-center annotations improve localization accuracy beyond bounding-box centroids. Using state-of-the-art detectors and keypoint models, we show that raw imagery yields superior performance in deployment-relevant scenarios, while orthomosaics retain value for robust cross-domain generalization. Incorporating crown-center annotations in training further improves localization and provides precise tree positions for downstream ecological analyses. These findings offer practical guidance for UAV-based biodiversity and conservation monitoring.
BaiChen Fan, Yuanxi Cui, Jian Li, Qin Wang, Shibo Zhao, Muqing Cao, Sifan Zhou
Comments Acceptted in ICRA 2026
LiDAR-based 3D single object tracking (3D SOT) is a critical task in robotics and autonomous systems. Existing methods typically follow frame-wise motion estimation or a sequence-based paradigm. However, the two-frame methods are efficient but lack long-term temporal context, making them vulnerable in sparse or occluded scenes, while sequence-based methods that process multiple point clouds gain robustness at a significant computational cost. To resolve this dilemma, we propose a novel trajectory-based paradigm and its instantiation, TrajTrack. TrajTrack is a lightweight framework that enhances a base two-frame tracker by implicitly learning motion continuity from historical bounding box trajectories alone-without requiring additional, costly point cloud inputs. It first generates a fast, explicit motion proposal and then uses an implicit motion modeling module to predict the future trajectory, which in turn refines and corrects the initial proposal. Extensive experiments on the large-scale NuScenes benchmark show that TrajTrack achieves new state-of-the-art performance, dramatically improving tracking precision by 3.02% over a strong baseline while running at 55 FPS. Besides, we also demonstrate the strong generalizability of TrajTrack across different base trackers. Code is available at https://github.com/FiBonaCci225/TrajTrack.
Matthias Wüest, Francis Engelmann, Ondrej Miksik, Marc Pollefeys, Daniel Barath
We propose UnLoc, an efficient data-driven solution for sequential camera localization within floorplans. Floorplan data is readily available, long-term persistent, and robust to changes in visual appearance. We address key limitations of recent methods, such as the lack of uncertainty modeling in depth predictions and the necessity for custom depth networks trained for each environment. We introduce a novel probabilistic model that incorporates uncertainty estimation, modeling depth predictions as explicit probability distributions. By leveraging off-the-shelf pre-trained monocular depth models, we eliminate the need to rely on per-environment-trained depth networks, enhancing generalization to unseen spaces. We evaluate UnLoc on large-scale synthetic and real-world datasets, demonstrating significant improvements over existing methods in terms of accuracy and robustness. Notably, we achieve $2.7$ times higher localization recall on long sequences (100 frames) and $42.2$ times higher on short ones (15 frames) than the state of the art on the challenging LaMAR HGE dataset.
Maria Mannone, Mahathi Anand, Peppino Fazio, Abdalla Swikir
In a robotic swarm, parameters such as position and proximity to the target can be described in terms of probability amplitudes. This idea led to recent studies on a quantum approach to the definition of the swarm, including a block-matrix representation. However, the size of such matrix-based representation increases drastically with the swarm size, making them impractical for large swarms. Hence, in this work, we propose a new approach for modeling robotic swarms and robotic networks by considering them as mixed quantum states that can be represented mathematically via density matrices. The size of such an approach only depends on the available degrees of freedom of the robot, and not its swarm size and thus scales well to large swarms. Moreover, it also enables the extraction of local information of the robots from the global swarm information contained in the density matrices, facilitating decentralized behavior that aligns with the collective swarm behavior. Our approach is validated on several simulations including large-scale swarms of up to 1000 robots. Finally, we provide some directions for future research that could potentially widen the impact of our approach.
Andrew Ferguson, Marisa LaFleur, Lars Ruthotto, Jesse Thaler, Yuan-Sen Ting, Pratyush Tiwary, Soledad Villar, E. Paulo Alves, Jeremy Avigad, Simon Billinge, Camille Bilodeau, Keith Brown, Emmanuel Candes, Arghya Chattopadhyay, Bingqing Cheng, Jonathan Clausen, Connor Coley, Andrew Connolly, Fred Daum, Sijia Dong, Chrisy Xiyu Du, Cora Dvorkin, Cristiano Fanelli, Eric B. Ford, Luis Manuel Frutos, Nicolás García Trillos, Cecilia Garraffo, Robert Ghrist, Rafael Gomez-Bombarelli, Gianluca Guadagni, Sreelekha Guggilam, Sergei Gukov, Juan B. Gutiérrez, Salman Habib, Johannes Hachmann, Boris Hanin, Philip Harris, Murray Holland, Elizabeth Holm, Hsin-Yuan Huang, Shih-Chieh Hsu, Nick Jackson, Olexandr Isayev, Heng Ji, Aggelos Katsaggelos, Jeremy Kepner, Yannis Kevrekidis, Michelle Kuchera, J. Nathan Kutz, Branislava Lalic, Ann Lee, Matt LeBlanc, Josiah Lim, Rebecca Lindsey, Yongmin Liu, Peter Y. Lu, Sudhir Malik, Vuk Mandic, Vidya Manian, Emeka P. Mazi, Pankaj Mehta, Peter Melchior, Brice Ménard, Jennifer Ngadiuba, Stella Offner, Elsa Olivetti, Shyue Ping Ong, Christopher Rackauckas, Philippe Rigollet, Chad Risko, Philip Romero, Grant Rotskoff, Brett Savoie, Uros Seljak, David Shih, Gary Shiu, Dima Shlyakhtenko, Eva Silverstein, Taylor Sparks, Thomas Strohmer, Christopher Stubbs, Stephen Thomas, Suriyanarayanan Vaikuntanathan, Rene Vidal, Francisco Villaescusa-Navarro, Gregory Voth, Benjamin Wandelt, Rachel Ward, Melanie Weber, Risa Wechsler, Stephen Whitelam, Olaf Wiest, Mike Williams, Zhuoran Yang, Yaroslava G. Yingling, Bin Yu, Shuwen Yue, Ann Zabludoff, Huimin Zhao, Tong Zhang
Comments Community Paper from the NSF Future of AI+MPS Workshop, Cambridge, Massachusetts, March 24-26, 2025, supported by NSF Award Number 2512945; v2: minor clarifications; v3: approximate version to appear in MLST
This community paper developed out of the NSF Workshop on the Future of Artificial Intelligence (AI) and the Mathematical and Physics Sciences (MPS), which was held in March 2025 with the goal of understanding how the MPS domains (Astronomy, Chemistry, Materials Research, Mathematical Sciences, and Physics) can best capitalize on, and contribute to, the future of AI. We present here a summary and snapshot of the MPS community's perspective, as of Spring/Summer 2025, in a rapidly developing field. The link between AI and MPS is becoming increasingly inextricable; now is a crucial moment to strengthen the link between AI and Science by pursuing a strategy that proactively and thoughtfully leverages the potential of AI for scientific discovery and optimizes opportunities to impact the development of AI by applying concepts from fundamental science. To achieve this, we propose activities and strategic priorities that: (1) enable AI+MPS research in both directions; (2) build up an interdisciplinary community of AI+MPS researchers; and (3) foster education and workforce development in AI for MPS researchers and students. We conclude with a summary of suggested priorities for funding agencies, educational institutions, and individual researchers to help position the MPS community to be a leader in, and take full advantage of, the transformative potential of AI+MPS.
Zhijing Wu, Longguang Wang
Reconstructing dynamic 3D scenes from monocular video has broad applications in AR/VR, robotics, and autonomous navigation, but often fails due to severe motion blur caused by camera and object motion. Existing methods commonly follow a two-step pipeline, where camera poses are first estimated and then 3D Gaussians are optimized. Since blurring artifacts usually undermine pose estimation, pose errors could be accumulated to produce inferior reconstruction results. To address this issue, we introduce a unified optimization framework by incorporating camera poses as learnable parameters complementary to 3DGS attributes for end-to-end optimization. Specifically, we recast camera and object motion as per-primitive SE(3) affine transformations on 3D Gaussians and formulate a unified optimization objective. For stable optimization, we introduce a three-stage training schedule that optimizes camera poses and Gaussians alternatively. Particularly, 3D Gaussians are first trained with poses being fixed, and then poses are optimized with 3D Gaussians being untouched. Finally, all learnable parameters are optimized together. Extensive experiments on the Stereo Blur dataset and challenging real-world sequences demonstrate that our method achieves significant gains in reconstruction quality and pose estimation accuracy over prior dynamic deblurring methods.
Rohan Chandra, Shubham Singh, Wenhao Luo, Katia Sycara
Comments Accepted for publication in Autonomous Robots 2026
The "Last Mile Challenge" has long been considered an important, yet unsolved, challenge for autonomous vehicles, public service robots, and delivery robots. A central issue in this challenge is the ability of robots to navigate constrained and cluttered environments that have high agency (e.g., doorways, hallways, corridor intersections), often while competing for space with other robots and humans. We refer to these environments as "Social Mini-Games" (SMGs). Traditional navigation approaches designed for MRN do not perform well in SMGs, which has led to focused research on dedicated SMG solvers. However, publications on SMG navigation research make different assumptions, and have different objective functions (safety versus liveness). These assumptions and objectives are sometimes implicitly assumed or described informally. This makes it difficult to establish appropriate baselines for comparison in research papers, as well as making it difficult for practitioners to find the papers relevant to their concrete application. Such ad-hoc representation of the field also presents a barrier to new researchers wanting to start research in this area. SMG navigation research requires its own taxonomy, definitions, and evaluation protocols to guide effective research moving forward. This survey is the first to catalog SMG solvers using a well-defined and unified taxonomy and to classify existing methods accordingly. It also discusses the essential properties of SMG solvers, defines what SMGs are and how they appear in practice, outlines how to evaluate SMG solvers, and highlights the differences between SMG solvers and general navigation systems. The survey concludes with an overview of future directions and open challenges in the field. Our project is open-sourced at https://socialminigames.github.io/{https://socialminigames.github.io/.
Zhuohang Jiang, Pangjing Wu, Xu Yuan, Wenqi Fan, Qing Li
Comments The source code for our system is released in https://github.com/jzzzzh/QA-Dragon
Retrieval-Augmented Generation (RAG) has been introduced to mitigate hallucinations in Multimodal Large Language Models (MLLMs) by incorporating external knowledge into the generation process, and it has become a widely adopted approach for knowledge-intensive Visual Question Answering (VQA). However, existing RAG methods typically retrieve from either text or images in isolation, limiting their ability to address complex queries that require multi-hop reasoning or up-to-date factual knowledge. To address this limitation, we propose QA-Dragon, a Query-Aware Dynamic RAG System for Knowledge-Intensive VQA. Specifically, QA-Dragon introduces a domain router to identify the query's subject domain for domain-specific reasoning, along with a search router that dynamically selects optimal retrieval strategies. By orchestrating both text and image search agents in a hybrid setup, our system supports multimodal, multi-turn, and multi-hop reasoning, enabling it to tackle complex VQA tasks effectively. We evaluate our QA-Dragon on the Meta CRAG-MM Challenge at KDD Cup 2025, where it significantly enhances the reasoning performance of base models under challenging scenarios. Our framework achieves substantial improvements in both answer accuracy and knowledge overlap scores, outperforming baselines by 5.06% on the single-source task, 6.35% on the multi-source task, and 5.03% on the multi-turn task.
Suchisrit Gangopadhyay, Jung-Hee Kim, Xien Chen, Patrick Rim, Hyoungseob Park, Alex Wong
We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: https://github.com/JungHeeKim29/calibration-token.
Shengchao Chen, Guodong Long, Michael Blumenstein, Jing Jiang
Comments Accepted by ICLR 2026
Dataset-level heterogeneity introduces significant domain biases that fundamentally degrade generalization on general Time Series Foundation Models (TSFMs), yet this challenge remains underexplored. This paper rethinks the from-scratch training of TSFMs using the paradigm of federated learning. We propose a novel Federated Dataset Learning (FeDaL) approach to tackle heterogeneous time series by learning dataset-agnostic temporal representations. Specifically, the distributed architecture of federated learning is a nature solution to decompose heterogeneous TS datasets into shared generalized knowledge and preserved personalized knowledge. Moreover, based on the TSFM architecture, FeDaL explicitly mitigates both local and global biases by adding two complementary mechanisms: Domain Bias Elimination (DBE) and Global Bias Elimination (GBE). FeDaL`s cross-dataset generalization has been extensively evaluated in real-world datasets spanning eight tasks (including various regression and classification), against 54 baselines. We further analyze federated scaling behavior, showing how data volume, client count, and join rate affect model performance under decentralization. Our code is publicly available at https://github.com/shengchaochen82/FeDaL
Ivana Collado-Gonzalez, John McConnell, Paul Szenher, Brendan Englot
Comments To appear at IROS 2025 in Hangzhou, China
Scene reconstruction is an essential capability for underwater robots navigating in close proximity to structures. Monocular vision-based reconstruction methods are unreliable in turbid waters and lack depth scale information. Sonars are robust to turbid water and non-uniform lighting conditions, however, they have low resolution and elevation ambiguity. This work proposes a real-time opti-acoustic scene reconstruction method that is specially optimized to work in turbid water. Our strategy avoids having to identify point features in visual data and instead identifies regions of interest in the data. We then match relevant regions in the image to corresponding sonar data. A reconstruction is obtained by leveraging range data from the sonar and elevation data from the camera image. Experimental comparisons against other vision-based and sonar-based approaches at varying turbidity levels, and field tests conducted in marina environments, validate the effectiveness of the proposed approach. We have made our code open-source to facilitate reproducibility and encourage community engagement.
Manh Nguyen, Sunil Gupta, Hung Le
Comments updated experiments and presentation
Ensuring truthfulness in large language models (LLMs) remains a critical challenge for reliable text generation. While supervised fine-tuning and reinforcement learning with human feedback have shown promise, they require a substantial amount of annotated data and computational resources, limiting scalability. In contrast, decoding-time interventions offer lightweight alternatives without model retraining. However, existing decoding strategies often face issues like prompt sensitivity, limited generalization, or dependence on internal model states. We propose Retrieval-Augmented Decoding (RAD), a context-aware adaptive decoding method that leverages a compact reference grounding space built from as few as 10 annotated examples and comprising pairs of context embeddings and next-token logits from truthful responses, to enable retrieval-based logit shaping during inference. At each decoding step, RAD retrieves high-quality semantically similar contexts from this grounding space and aggregates their associated next token logits to modify the model's current logits. Across four open-ended generation benchmarks and four LLMs, our method consistently outperforms strong baselines and shows robust cross-task generalization, underscoring the promise of context-aware decoding for enhancing factual reliability.
Xinlin Zhuang, Feilong Tang, Haolin Yang, Xiwei Liu, Ming Hu, Huifa Li, Haochen Xue, Junjun He, Zongyuan Ge, Yichen Li, Ying Qian, Imran Razzak
Comments Accepted by CVPR 2026
Supervised Fine-Tuning (SFT) of the language backbone plays a pivotal role in adapting Vision-Language Models (VLMs) to specialized domains such as medical reasoning. However, existing SFT practices often rely on unfiltered textual datasets that contain redundant and low-quality samples, leading to substantial computational costs and suboptimal performance in complex clinical scenarios. Although existing methods attempt to alleviate this problem by selecting data based on sample difficulty, defined by knowledge and reasoning complexity, they overlook each sample's optimization utility reflected in its gradient. Interestingly, we find that gradient-based influence alone favors easy-to-optimize samples that cause large parameter shifts but lack deep reasoning chains, while difficulty alone selects noisy or overly complex textual cases that fail to guide stable optimization. Based on this observation, we propose a data selection strategy, Difficulty-Influence Quadrant (DIQ), which prioritizes samples in the "high-difficulty-high-influence" quadrant to balance complex clinical reasoning with substantial gradient influence. This enables efficient medical reasoning for VLMs with minimal fine-tuning data. Furthermore, Human and LLM-as-a-judge evaluations show that DIQ-selected subsets demonstrate higher data quality and generate clinical reasoning that is more aligned with expert practices in differential diagnosis, safety check, and evidence citation, as DIQ emphasizes samples that foster expert-like reasoning patterns. Extensive experiments on medical reasoning benchmarks demonstrate that DIQ enables VLM backbones fine-tuned on only 1% of selected data to match full-dataset performance, while using 10% consistently outperforms baseline methods, highlighting the superiority of principled data selection over brute-force scaling. The code is available at https://github.com/mihara-bot/DIQ.
Zizhuo Zhang, Jianing Zhu, Xinmu Ge, Zihua Zhao, Zhanke Zhou, Xuan Li, Xiao Feng, Jiangchao Yao, Bo Han
Comments Accepted by ICLR 2026
While reinforcement learning with verifiable rewards (RLVR) is effective to improve the reasoning ability of large language models (LLMs), its reliance on human-annotated labels leads to the scaling up dilemma, especially for complex tasks. Recent self-rewarding methods investigate a label-free alternative to unlock the reasoning capabilities of LLMs, yet they frequently encounter the non-negligible training collapse issue, as the single-view supervision signal easily forms the self-consistent illusion, yielding the reward hacking. Inspired by the success of self-supervised learning, we propose \textit{Co-rewarding}, a novel self-supervised RL framework that improves training stability by seeking complementary supervision from another views. Specifically, we instantiate Co-rewarding in two ways: (1) \textit{Co-rewarding-I} is a data-side instantiation that derives reward signals from contrastive agreement across semantically analogous questions; and (2) \textit{Co-rewarding-II} is a model-side instantiation that maintains a slowly-updated reference teacher with pseudo labels to realize self-distillation. Intuitively, such instantiations introduce different levels of discrepancy to increase the difficulty of training collapse on trivial reasoning solutions. We also explore their orthogonally combined version to further boost the performance. Empirically, Co-rewarding exhibits stable training across various setups, and outperforms other self-rewarding baselines by $+3.31\%$ improvements on average on multiple mathematical reasoning benchmarks, especially by $+7.49\%$ on Llama-3.2-3B-Instruct. Notably, Co-rewarding reaches or even surpasses RLVR with ground-truth (GT) label in several cases, such as a Pass@$1$ of $94.01\%$ on GSM8K with Qwen3-8B-Base remarkably higher than GT. Our code is released at https://github.com/tmlr-group/Co-rewarding.
Xuan Wu, Di Wang, Chunguo Wu, Kaifang Qi, Chunyan Miao, Yubin Xiao, Jian Zhang, You Zhou
Numerous Neural Combinatorial Optimization (NCO) solvers have been proposed to address Vehicle Routing Problems (VRPs). However, most of these solvers focus exclusively on single-vehicle VRP variants, overlooking the more realistic min-max Heterogeneous Capacitated Vehicle Routing Problem (MMHCVRP), which involves multiple vehicles. Existing MMHCVRP solvers typically select a vehicle and its next node to visit at each decoding step, but often make myopic decoding decisions and overlook key properties of MMHCVRP, including local topological relationships, vehicle permutation invariance, and node symmetry, resulting in suboptimal performance. To better address these limitations, we propose ECHO, an efficient NCO solver. First, ECHO exploits the proposed dual-modality node encoder to capture local topological relationships among nodes. Subsequently, to mitigate myopic decisions, ECHO employs the proposed Parameter-Free Cross-Attention mechanism to prioritize the vehicle selected in the preceding decoding step. Finally, leveraging vehicle permutation invariance and node symmetry, we introduce a tailored data augment strategy for MMHCVRP to stabilize the Reinforcement Learning training process. To assess the performance of ECHO, we conduct extensive experiments. The experimental results demonstrate that ECHO outperforms state-of-the-art NCO solvers across varying numbers of vehicles and nodes, and exhibits well-performing generalization across both scales and distribution patterns. Finally, ablation studies validate the effectiveness of all proposed methods.
Yiming Zhang, Zhuokai Zhao, Chengzhang Yu, Kun Wang, Zhendong Chu, Qiankun Li, Zihan Chen, Yang Liu, Zenghui Ding, Yining Sun, Qingsong Wen
Autoregressive large vision--language models (LVLMs) interface video and language by projecting video features into the LLM's embedding space as continuous visual token embeddings. However, it remains unclear where temporal evidence is represented and how it causally influences decoding. To address this gap, we present CircuitProbe, a circuit-level analysis framework that dissects the end-to-end video-language pathway through two stages: (i) Visual Auditing, which localizes object semantics within the projected video-token sequence and reveals their causal necessity via targeted ablations and controlled substitutions; and (ii) Semantic Tracing, which uses logit-lens probing to track the layer-wise emergence of object and temporal concepts, augmented with temporal frame interventions to assess sensitivity to temporal structure. Based on the resulting analysis, we design a targeted surgical intervention that strictly follows our observations: identifying temporally specialized attention heads and selectively amplifying them within the critical layer interval revealed by Semantic Tracing. This analysis-driven intervention yields consistent improvements (up to 2.4% absolute) on the temporal-heavy TempCompass benchmark, validating the correctness, effectiveness, and practical value of the proposed circuit-level analysis for temporal understanding in LVLMs.
Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Yufeng Zhong, Lin Ma
Comments technical report
Chart reasoning presents unique challenges due to its inherent complexity -- requiring precise numerical comprehension, multi-level visual understanding, and logical inference across interconnected data elements. Existing vision-language models often struggle with such reasoning tasks, particularly when handling multi-subchart scenarios and numerical sensitivity. To address these challenges, we introduce Chart-R1, a chart-domain vision-language model that leverages reinforcement fine-tuning for advanced chart reasoning. We first propose a programmatic data synthesis approach to generate high-quality step-by-step reasoning data with verifiable answer formats, covering diverse chart types and complexity levels. Our two-stage training strategy includes: (1) Chart-COT, which decomposes complex reasoning into interpretable subtasks through chain-of-thought supervision, and (2) Chart-RFT, which employs group relative policy optimization with numerically sensitive rewards tailored for chart-specific reasoning. Experiments on open-source benchmarks and our proposed ChartRQA dataset demonstrate that Chart-R1 significantly outperforms existing chart-domain methods and rivals large-scale open/closed-source models.
Shanshan Zhang, Siyue Wang, Mengzi Chen, Mengzhe Wang, Liqin Wu, Qi Zhang, Lingxiang Zheng
Inertial odometry (IO) is a widely used approach for localization on mobile devices; however, obtaining a lightweight IO model that also achieves high accuracy remains challenging. To address this issue, we propose TinyIO, a lightweight IO method. During training, we adopt a multi-branch architecture to extract diverse motion features more effectively. At inference time, the trained multi-branch model is converted into an equivalent single-path architecture to reduce computational complexity. We further propose a Dual-Path Adaptive Attention mechanism (DPAA), which enhances TinyIO's perception of contextual motion along both channel and temporal dimensions with negligible additional parameters. Extensive experiments on public datasets demonstrate that our method attains a favorable trade-off between accuracy and model size. On the RoNIN dataset, TinyIO reduces the ATE by 23.53% compared with R-ResNet and decreases the parameter count by 3.68%.
扫码添加微信好友,提出您的宝贵建议 👇
💡 备注请填写:网站反馈