arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1527
2512.17053 2026-03-13 cs.CL cs.AI cs.DB

Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL

Khushboo Thaker, Yony Bresler

Comments Accepted at the 39th Canadian Conference on Artificial Intelligence (Canadian AI 2026). This is the extended version containing additional details and appendices omitted from the camera-ready proceedings due to space constraints

详情
英文摘要

Deploying accurate Text-to-SQL systems at the enterprise level faces a difficult trilemma involving cost, security and performance. Current solutions force enterprises to choose between expensive, proprietary Large Language Models (LLMs) and low-performing Small Language Models (SLMs). Efforts to improve SLMs often rely on distilling reasoning from large LLMs using unstructured Chain-of-Thought (CoT) traces, a process that remains inherently ambiguous. Instead, we hypothesize that a formal, structured reasoning representation provides a clearer, more reliable teaching signal, as the Text-to-SQL task requires explicit and precise logical steps. To evaluate this hypothesis, we propose Struct-SQL, a novel Knowledge Distillation (KD) framework that trains an SLM to emulate a powerful large LLM. Consequently, we adopt a query execution plan as a formal blueprint to derive this structured reasoning. Our SLM, distilled with structured CoT, achieves an absolute improvement of 8.1% over an unstructured CoT distillation baseline. A detailed error analysis reveals that a key factor in this gain is a marked reduction in syntactic errors. This demonstrates that teaching a model to reason using a structured logical blueprint is beneficial for reliable SQL generation in SLMs.

2512.06002 2026-03-13 cs.RO cs.AI

POrTAL: Plan-Orchestrated Tree Assembly for Lookahead

Evan Conway, David Porfirio, David Chan, Mark Roberts, Laura M. Hiatt

Comments Submitted to IROS 26

详情
英文摘要

When tasking robots in partially observable environments, these robots must efficiently and robustly plan to achieve task goals under uncertainty. Although many probabilistic planning algorithms exist for this purpose, these algorithms can be inefficient if executed with the robot's limited computational resources, or may produce policies that take more steps than expected to achieve the goal. We therefore created a new, lightweight, probabilistic planning algorithm, Plan-Orchestrated Tree Assembly for Lookahead (POrTAL), that combines the strengths of two baseline planning algorithms, FF-Replan and POMCP. We demonstrate that POrTAL is an anytime algorithm that generally outperforms these baselines in terms of the final executed plan length given bounded computation time, especially for problems with only moderate levels of uncertainty.

2512.04034 2026-03-13 cs.LG

Domain Feature Collapse: Implications for Out-of-Distribution Detection and Solutions

Hong Yang, Devroop Kar, Qi Yu, Alex Ororbia, Travis Desell

Comments Error in theoretical assumptions

详情
英文摘要

Why do state-of-the-art OOD detection methods exhibit catastrophic failure when models are trained on single-domain datasets? We provide the first theoretical explanation for this phenomenon through the lens of information theory. We prove that supervised learning on single-domain data inevitably produces domain feature collapse -- representations where I(x_d; z) = 0, meaning domain-specific information is completely discarded. This is a fundamental consequence of information bottleneck optimization: models trained on single domains (e.g., medical images) learn to rely solely on class-specific features while discarding domain features, leading to catastrophic failure when detecting out-of-domain samples (e.g., achieving only 53% FPR@95 on MNIST). We extend our analysis using Fano's inequality to quantify partial collapse in practical scenarios. To validate our theory, we introduce Domain Bench, a benchmark of single-domain datasets, and demonstrate that preserving I(x_d; z) > 0 through domain filtering (using pretrained representations) resolves the failure mode. While domain filtering itself is conceptually straightforward, its effectiveness provides strong empirical evidence for our information-theoretic framework. Our work explains a puzzling empirical phenomenon, reveals fundamental limitations of supervised learning in narrow domains, and has broader implications for transfer learning and when to fine-tune versus freeze pretrained models.

2512.02421 2026-03-13 cs.CV

Generalizing Vision-Language Models with Dedicated Prompt Guidance

Xinyao Li, Yinjie Min, Hongbo Chen, Zhekai Du, Fengling Li, Jingjing Li

Comments Accepted to AAAI26

详情
英文摘要

Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we provide a theoretical understanding of the generalization ability for VLM fine-tuning, which reveals that training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. Inspired by this finding, we propose a two-step domain-expert-Guided DG (GuiDG) framework. GuiDG first employs prompt tuning to obtain source domain experts, then introduces a Cross-Modal Attention module to guide the fine-tuning of the vision encoder via adaptive expert integration. To better evaluate few-shot DG, we construct ImageNet-DG from ImageNet and its variants. Extensive experiments on standard DG benchmarks and ImageNet-DG demonstrate that GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.

2511.22433 2026-03-13 cs.CV

SkeletonAgent: An Agentic Interaction Framework for Skeleton-based Action Recognition

Hongda Liu, Yunfan Liu, Changlu Wang, Yunlong Wang, Zhenan Sun

详情
英文摘要

Recent advances in skeleton-based action recognition increasingly leverage semantic priors from Large Language Models (LLMs) to enrich skeletal representations. However, the LLM is typically queried in isolation from the recognition model and receives no performance feedback. As a result, it often fails to deliver the targeted discriminative cues critical to distinguish similar actions. To overcome these limitations, we propose SkeletonAgent, a novel framework that bridges the recognition model and the LLM through two cooperative agents, i.e., Questioner and Selector. Specifically, the Questioner identifies the most frequently confused classes and supplies them to the LLM as context for more targeted guidance. Conversely, the Selector parses the LLM's response to extract precise joint-level constraints and feeds them back to the recognizer, enabling finer-grained cross-modal alignment. Comprehensive evaluations on five benchmarks, including NTU RGB+D, NTU RGB+D 120, Kinetics-Skeleton, FineGYM, and UAV-Human, demonstrate that SkeletonAgent consistently outperforms state-of-the-art benchmark methods. The code is available at https://github.com/firework8/SkeletonAgent.

2511.22018 2026-03-13 cs.CV cs.AI

MedEyes: Learning Dynamic Visual Focus for Medical Progressive Diagnosis

Chunzheng Zhu, Yangfang Lin, Shen Chen, Yijun Wang, Jianxin Lin

Comments AAAI 2026, Medical Chain-of-Thought (CoT), Reinforcement Learning with Verifiable Rewards (RLVR), Multimodal Grounded Reasoning

详情
英文摘要

Accurate medical diagnosis often involves progressive visual focusing and iterative reasoning, characteristics commonly observed in clinical workflows. While recent vision-language models demonstrate promising chain-of-thought (CoT) reasoning capabilities via reinforcement learning with verifiable rewards (RLVR), their purely on-policy learning paradigm tends to reinforce superficially coherent but clinically inaccurate reasoning paths. We propose MedEyes, a novel reinforcement learning framework that dynamically models clinician-style diagnostic reasoning by progressively attending to and interpreting relevant medical image regions. By incorporating off-policy expert guidance, MedEyes converts expert visual search trajectories into structured external behavioral signals, guiding the model toward clinically aligned visual reasoning. We design the Gaze-guided Reasoning Navigator (GRN) to emulate the diagnostic process through a dual-mode exploration strategy, scanning for systematic abnormality localization and drilling for detailed regional analysis. To balance expert imitation and autonomous discovery, we introduce the Confidence Value Sampler (CVS), which employs nucleus sampling and adaptive termination to create diverse yet credible exploration paths. Finally, the dual-stream GRPO optimization framework decouples on-policy and off-policy learning signals, mitigating reward assimilation and entropy collapse. Experiments demonstrate that MedEyes achieves an average performance improvement of +8.5pp across multiple medical VQA benchmarks, validating MedEyes's potential in building trustworthy medical AI systems. Code is available at https://github.com/zhcz328/MedEyes.

2511.18685 2026-03-13 cs.CV cs.RO

Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents

Dayong Liu, Chao Xu, Weihong Chen, Suyu Zhang, Juncheng Wang, Jiankang Deng, Baigui Sun, Yang Liu

详情
英文摘要

Multimodal Large Language Models (MLLMs) show promising results as decision-making engines for embodied agents operating in complex, physical environments. However, existing benchmarks often prioritize high-level planning or spatial reasoning, leaving the fine-grained action intelligence required for embodied physical interaction underexplored. To address this gap, we introduce CFG-Bench, a new benchmark designed to systematically evaluate this crucial capability. CFG-Bench consists of 1,368 curated videos paired with 19,562 question-answer pairs spanning three evaluation paradigms targeting four cognitive abilities: 1) Physical Interaction, 2) Temporal-Causal Relation, 3) Intentional Understanding, and 4) Evaluative Judgment. Together, these dimensions provide a systematic framework for assessing a model's ability to translate visual observations into actionable knowledge, moving beyond mere surface-level recognition. Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions and exhibit profound limitations in the higher-order reasoning of intention and evaluation. Moreover, supervised fine-tuning (SFT) on our data demonstrates that teaching an MLLMs to articulate fine-grained actions directly translates to significant performance gains on established embodied benchmarks. Our analysis highlights these limitations and offers insights for developing more capable and grounded embodied agents. Project page: https://cfg-bench.github.io/

2511.18463 2026-03-13 cs.CV

Decoupling Perception from Reasoning for Hallucination-Resistant Video Understanding

Bowei Pu, Chuanbin Liu, Yifan Ge, Peicheng Zhou, Yiwei Sun, Zhiying Lu, Zhangchi Hu, Hongtao Xie

Comments 17 pages, 8 figures

详情
英文摘要

Video Large Language Models improve reasoning over complex videos by generating intermediate reasoning text. However, reliable reasoning depends on accurate video perception. In existing approaches, perception evidence is intertwined with reasoning text, making it difficult to directly supervise the perception process. We argue that reliable supervision requires explicitly separating perception evidence from reasoning so that perception can be verified independently. To supervise perception directly, we propose Decoupled Perception and Logic (DPL), which represents perception as fixed-format evidence units containing timestamps and visual descriptions. This structured representation enables direct extraction of perception content and simplifies alignment between video segments and reward evaluation. Building on DPL, we introduce a perception reward that encourages both hallucination resistance and perception-based reasoning. An Factual-Aware Evaluator (FAE) provides anti-hallucination scores and achieves hallucination evaluation performance comparable to GPT-4o. In addition, we validate reasoning consistency by feeding perception results and questions into a reference model. Experiments show that, by providing reliable process rewards, Video-DPL consistently improves post-training performance at both 3B and 7B scales, while delivering higher data efficiency.

2511.16846 2026-03-13 cs.CL cs.AI

ConCISE: A Reference-Free Conciseness Evaluation Metric for LLM-Generated Answers

Seyed Mohssen Ghafari, Ronny Kol, Juan C. Quiroz, Nella Luan, Monika Patial, Chanaka Rupasinghe, Herman Wandabwa, Luiz Pizzato

详情
英文摘要

Large language models (LLMs) frequently generate responses that are lengthy and verbose, filled with redundant or unnecessary details. This diminishes clarity and user satisfaction, and it increases costs for model developers, especially with well-known proprietary models that charge based on the number of output tokens. In this paper, we introduce a novel reference-free metric for evaluating the conciseness of responses generated by LLMs. Our method quantifies non-essential content without relying on gold standard references and calculates the average of three calculations: i) a compression ratio between the original response and an LLM abstractive summary; ii) a compression ratio between the original response and an LLM extractive summary; and iii) wordremoval compression, where an LLM removes as many non-essential words as possible from the response while preserving its meaning, with the number of tokens removed indicating the conciseness score. Experimental results demonstrate that our proposed metric identifies redundancy in LLM outputs, offering a practical tool for automated evaluation of response brevity in conversational AI systems without the need for ground truth human annotations.

2511.12908 2026-03-13 cs.CV cs.AI

DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning

Junbo Zou, Haotian Xia, Zhen Ye, Shengjie Zhang, Christopher Lai, Vicente Ordonez, Weining Shen, Hanjie Chen

详情
英文摘要

Sports video understanding requires perceiving high-speed dynamics, complex rules, and long temporal contexts. Yet, current Multimodal Large Language Models (MLLMs) remain narrowly focused on single sports, specific tasks, or training-free paradigms. We introduce DeepSport, the first end-to-end trained MLLM for multi-task, multi-sport video understanding. DeepSport shifts from passive frame processing to active, iterative reasoning, dynamically extracting frames to "think with videos." To train our model, we curate a unified 78k-sample dataset via a rigorous three-step text-and-vision distillation pipeline. We then employ a progressive two-stage training strategy: a Sports Curriculum Supervised Fine-Tuning phase to build foundational perception, followed by Agentic Reinforcement Learning with a novel tool-use reward. Extensive experiments on a comprehensive 6.7k benchmark demonstrate that DeepSport achieves state-of-the-art performance, outperforming powerful proprietary and open-source models, while utilizing significantly fewer frames. Furthermore, it exhibits strong zero-shot transferability to unseen sports and broad motion recognition tasks, establishing a highly efficient and generalized foundation for complex video reasoning.

2511.12254 2026-03-13 cs.AI cs.IR

Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation

Yuxiang Zhou, Jichang Li, Yanhao Zhang, Haonan Lu, Guanbin Li

详情
英文摘要

Mobile agents show immense potential, yet current state-of-the-art (SoTA) agents exhibit inadequate success rates on real-world, long-horizon, cross-application tasks. We attribute this bottleneck to the agents' excessive reliance on static, internal knowledge within MLLMs, which leads to two critical failure points: 1) strategic hallucinations in high-level planning and 2) operational errors during low-level execution on user interfaces (UI). The core insight of this paper is that high-level planning and low-level UI operations require fundamentally distinct types of knowledge. Planning demands high-level, strategy-oriented experiences, whereas operations necessitate low-level, precise instructions closely tied to specific app UIs. Motivated by these insights, we propose Mobile-Agent-RAG, a novel hierarchical multi-agent framework that innovatively integrates dual-level retrieval augmentation. At the planning stage, we introduce Manager-RAG to reduce strategic hallucinations by retrieving human-validated comprehensive task plans that provide high-level guidance. At the execution stage, we develop Operator-RAG to improve execution accuracy by retrieving the most precise low-level guidance for accurate atomic actions, aligned with the current app and subtask. To accurately deliver these knowledge types, we construct two specialized retrieval-oriented knowledge bases. Furthermore, we introduce Mobile-Eval-RAG, a challenging benchmark for evaluating such agents on realistic multi-app, long-horizon tasks. Extensive experiments demonstrate that Mobile-Agent-RAG significantly outperforms SoTA baselines, improving task completion rate by 11.0% and step efficiency by 10.2%, establishing a robust paradigm for context-aware, reliable multi-agent mobile automation.

2511.11851 2026-03-13 cs.CV cs.CR

Defending Unauthorized Model Merging via Dual-Stage Weight Protection

Wei-Jia Chen, Min-Yen Tsai, Cheng-Yi Lee, Chia-Mu Yu

Comments Accepted at CVPR 2026, updated

详情
英文摘要

The rapid proliferation of pretrained models and open repositories has made model merging a convenient yet risky practice, allowing free-riders to combine fine-tuned models into a new multi-capability model without authorization. Such unauthorized model merging not only violates intellectual property rights but also undermines model ownership and accountability. To address this issue, we present MergeGuard, a proactive dual-stage weight protection framework that disrupts merging compatibility while maintaining task fidelity. In the first stage, we redistribute task-relevant information across layers via L2-regularized optimization, ensuring that important gradients are evenly dispersed. In the second stage, we inject structured perturbations to misalign task subspaces, breaking curvature compatibility in the loss landscape. Together, these stages reshape the model's parameter geometry such that merged models collapse into destructive interference while the protected model remains fully functional. Extensive experiments on both vision (ViT-L-14) and language (Llama2, Gemma2, Mistral) models demonstrate that MergeGuard reduces merged model accuracy by up to 90% with less than 1.5% performance loss on the protected model.

2511.09921 2026-03-13 cs.AI

Adaptive Hyperbolic Kernels: Modulated Embedding in de Branges-Rovnyak Spaces

Leping Si, Meimei Yang, Hui Xue, Shipeng Zhu, Pengfei Fang

Comments 13 pages, 3 figures, AAAI26 conference Camera-Ready

详情
英文摘要

Hierarchical data pervades diverse machine learning applications, including natural language processing, computer vision, and social network analysis. Hyperbolic space, characterized by its negative curvature, has demonstrated strong potential in such tasks due to its capacity to embed hierarchical structures with minimal distortion. Previous evidence indicates that the hyperbolic representation capacity can be further enhanced through kernel methods. However, existing hyperbolic kernels still suffer from mild geometric distortion or lack adaptability. This paper addresses these issues by introducing a curvature-aware de Branges-Rovnyak space, a reproducing kernel Hilbert space (RKHS) that is isometric to a Poincare ball. We design an adjustable multiplier to select the appropriate RKHS corresponding to the hyperbolic space with any curvature adaptively. Building on this foundation, we further construct a family of adaptive hyperbolic kernels, including the novel adaptive hyperbolic radial kernel, whose learnable parameters modulate hyperbolic features in a task-aware manner. Extensive experiments on visual and language benchmarks demonstrate that our proposed kernels outperform existing hyperbolic kernels in modeling hierarchical dependencies.

2511.07654 2026-03-13 cs.RO

Time as a Control Dimension in Robot Learning

Yinsen Jia, Boyuan Chen

详情
英文摘要

Temporal awareness plays a central role in intelligent behavior by shaping how actions are paced, coordinated, and adapted to changing goals and environments. In contrast, most robot learning algorithms treat time only as a fixed episode horizon or scheduling constraint. Here we introduce time-aware policy learning, a reinforcement learning framework that treats time as a control dimension of robot behavior. The approach augments policies with two temporal signals, the remaining time and a time ratio that modulates the policy's internal progression of time, allowing a single policy to regulate its execution strategy across temporal regimes. Across diverse manipulation tasks including long-horizon manipulation, granular-media pouring, articulated-object interaction, and multi-agent coordination, the resulting policies adapt their behavior continuously from dynamic execution under tight schedules to stable and deliberate interaction when more time is available. This temporal awareness improves efficiency, robustness under sim-to-real mismatch and disturbances, and controllability through human input without retraining. Treating time as a controllable variable provides a new framework for adaptive and human-aligned robot autonomy.

2511.06315 2026-03-13 cs.CV

PuzLM: Solving Jigsaw Puzzles with Sequence-to-Sequence Language Models

Gur Elkin, Ofir Itzhak Shahar, Ohad Ben-Shahar

详情
英文摘要

Square jigsaw puzzles are typically solved by visually matching piece images to recover the original layout. This work introduces PuzLM, an alternative perspective that recasts jigsaw reassembly as a discrete sequence-to-sequence (Seq2Seq) problem, inspired by natural language representations. We design an efficient puzzle quantization procedure that transforms each piece into a short sequence of discrete tokens, enabling the direct application of standard Seq2Seq language models as powerful jigsaw solvers. Our approach demonstrates that accurate puzzle reconstruction can be achieved through purely symbolic reasoning over discrete representations, improving state-of-the-art performance even on puzzles with eroded boundaries or missing pieces.

2511.04583 2026-03-13 cs.AI cs.CL cs.CV cs.LG

Jr. AI Scientist and Its Risk Report: Autonomous Scientific Exploration from a Baseline Paper

Atsuyuki Miyai, Mashiro Toyooka, Takashi Otonari, Zaiying Zhao, Kiyoharu Aizawa

Comments TMLR2026. Issues, comments, and questions are all welcome in https://github.com/Agent4Science-UTokyo/Jr.AI-Scientist

详情
英文摘要

Understanding the current capabilities and risks of AI Scientist systems (autoresearch) is essential for ensuring trustworthy and sustainable AI-driven scientific progress while preserving the integrity of the academic ecosystem. To this end, we develop Jr. AI Scientist, a state-of-the-art autonomous AI scientist system that mimics the core research workflow of a novice student researcher: Given the baseline paper from the human mentor, it analyzes its limitations, formulates novel hypotheses for improvement, iteratively experiments until improvements are achieved, and writes a paper with the results. Unlike previous approaches that assume full automation or operate on small-scale code, Jr. AI Scientist follows a well-defined research workflow and leverages modern coding agents to handle complex, multi-file implementations, leading to scientifically valuable contributions. Through our experiments, the Jr. AI Scientist successfully generated new research papers that build upon real NeurIPS, IJCV, and ICLR works by proposing and implementing novel methods. For evaluation, we conducted automated assessments using AI Reviewers, author-led evaluations, and submissions to Agents4Science, a venue dedicated to AI-driven contributions. The findings demonstrate that Jr. AI Scientist generates papers receiving higher review scores by DeepReviewer than existing fully automated systems. Nevertheless, we identify important limitations from the author evaluation and the Agents4Science reviews, indicating the potential risks of directly applying current AI Scientist systems and key challenges for future research. Finally, we comprehensively report various risks identified during development. We believe this study clarifies the current role and limitations of AI Scientist systems, offering insights into the areas that still require human expertise and the risks that may emerge as these systems evolve.

2511.03400 2026-03-13 cs.RO

GUIDES: Guidance Using Instructor-Distilled Embeddings for Pre-trained Robot Policy Enhancement

Minquan Gao, Xinyi Li, Qing Yan, Xiaojian Sun, Xiaopan Zhang, Chien-Ming Huang, Jiachen Li

Comments IEEE International Conference on Robotics and Automation (ICRA 2026)

详情
英文摘要

Pre-trained robot policies serve as the foundation of many validated robotic systems, which encapsulate extensive embodied knowledge. However, they often lack the semantic awareness characteristic of foundation models, and replacing them entirely is impractical in many situations due to high costs and the loss of accumulated knowledge. To address this gap, we introduce GUIDES, a lightweight framework that augments pre-trained policies with semantic guidance from foundation models without requiring architectural redesign. GUIDES employs a fine-tuned vision-language model (Instructor) to generate contextual instructions, which are encoded by an auxiliary module into guidance embeddings. These embeddings are injected into the policy's latent space, allowing the legacy model to adapt to this new semantic input through brief, targeted fine-tuning. For inference-time robustness, a large language model-based Reflector monitors the Instructor's confidence and, when confidence is low, initiates a reasoning loop that analyzes execution history, retrieves relevant examples, and augments the VLM's context to refine subsequent actions. Extensive validation in the RoboCasa simulation environment across diverse policy architectures shows consistent and substantial improvements in task success rates. Real-world deployment on a UR5 robot further demonstrates that GUIDES enhances motion precision for critical sub-tasks such as grasping. Overall, GUIDES offers a practical and resource-efficient pathway to upgrade, rather than replace, validated robot policies.

2511.00783 2026-03-13 cs.RO cs.SY eess.SY

When Semantics Connect the Swarm: LLM-Driven Fuzzy Control for Cooperative Multi-Robot Underwater Coverage

Jingzehua Xu, Weihang Zhang, Yangyang Li, Hongmiaoyi Zhang, Guanwen Xie, Jiwei Tang, Shuai Zhang, Yi Li

Comments Withdrawal for further improvement. The final version will be released in a few months

详情
英文摘要

Underwater multi-robot cooperative coverage remains challenging due to partial observability, limited communication, environmental uncertainty, and the lack of access to global localization. To address these issues, this paper presents a semantics-guided fuzzy control framework that couples Large Language Models (LLMs) with interpretable control and lightweight coordination. Raw multimodal observations are compressed by the LLM into compact, human-interpretable semantic tokens that summarize obstacles, unexplored regions, and Objects Of Interest (OOIs) under uncertain perception. A fuzzy inference system with pre-defined membership functions then maps these tokens into smooth and stable steering and gait commands, enabling reliable navigation without relying on global positioning. Then, we further coordinate multiple robots by introducing semantic communication that shares intent and local context in linguistic form, enabling agreement on who explores where while avoiding redundant revisits. Extensive simulations in unknown reef-like environments show that, under limited sensing and communication, the proposed framework achieves robust OOI-oriented navigation and cooperative coverage with improved efficiency and adaptability, narrowing the gap between semantic cognition and distributed underwater control in GPS-denied, map-free conditions.

2511.00617 2026-03-13 cs.LG cs.AI cs.CL stat.ML

Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering

Eric Bigelow, Daniel Wurgaft, YingQiao Wang, Noah Goodman, Tomer Ullman, Hidenori Tanaka, Ekdeep Singh Lubana

详情
英文摘要

Large language models (LLMs) can be controlled at inference time through prompts (in-context learning) and internal activations (activation steering). Different accounts have been proposed to explain these methods, yet their common goal of controlling model behavior raises the question of whether these seemingly disparate methodologies can be seen as specific instances of a broader framework. Motivated by this, we develop a unifying, predictive account of LLM control from a Bayesian perspective. Specifically, we posit that both context- and activation-based interventions impact model behavior by altering its belief in latent concepts: steering operates by changing concept priors, while in-context learning leads to an accumulation of evidence. This results in a closed-form Bayesian model that is highly predictive of LLM behavior across context- and activation-based interventions in a set of domains inspired by prior work on many-shot in-context learning. This model helps us explain prior empirical phenomena - e.g., sigmoidal learning curves as in-context evidence accumulates - while predicting novel ones - e.g., additivity of both interventions in log-belief space, which results in distinct phases such that sudden and dramatic behavioral shifts can be induced by slightly changing intervention controls. Taken together, this work offers a unified account of prompt-based and activation-based control of LLM behavior, and a methodology for empirically predicting the effects of these interventions.

2510.26796 2026-03-13 cs.CV cs.GR

See4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting

Dongyue Lu, Ao Liang, Tianxin Huang, Xiao Fu, Yuyang Zhao, Baorui Ma, Liang Pan, Wei Yin, Lingdong Kong, Wei Tsang Ooi, Ziwei Liu

Comments Eurographics2026; 26 pages; 21 figures; 3 tables; project page: https://see-4d.github.io/

详情
英文摘要

Immersive applications call for synthesizing spatiotemporal 4D content from casual videos without costly 3D supervision. Existing video-to-4D methods typically rely on manually annotated camera poses, which are labor-intensive and brittle for in-the-wild footage. Recent warp-then-inpaint approaches mitigate the need for pose labels by warping input frames along a novel camera trajectory and using an inpainting model to fill missing regions, thereby depicting the 4D scene from diverse viewpoints. However, this trajectory-to-trajectory formulation often entangles camera motion with scene dynamics and complicates both modeling and inference. We introduce See4D, a pose-free, trajectory-to-camera framework that replaces explicit trajectory prediction with rendering to a bank of fixed virtual cameras, thereby separating camera control from scene modeling. A view-conditional video inpainting model is trained to learn a robust geometry prior by denoising realistically synthesized warped images and to inpaint occluded or missing regions across virtual viewpoints, eliminating the need for explicit 3D annotations. Building on this inpainting core, we design a spatiotemporal autoregressive inference pipeline that traverses virtual-camera splines and extends videos with overlapping windows, enabling coherent generation at bounded per-step complexity. We validate See4D on cross-view video generation and sparse reconstruction benchmarks. Across quantitative metrics and qualitative assessments, our method achieves superior generalization and improved performance relative to pose- or trajectory-conditioned baselines, advancing practical 4D world modeling from casual videos.

2510.21019 2026-03-13 cs.LG cs.CV

More Than Memory Savings: Zeroth-Order Optimization Mitigates Forgetting in Continual Learning

Wanhao Yu, Zheng Wang, Shuteng Niu, Sen Lin, Li Yang

详情
英文摘要

Zeroth-order (ZO) optimization has gained attention as a memory-efficient alternative to first-order (FO) methods, particularly in settings where gradient computation is expensive or even impractical. Beyond its memory efficiency, in this work, we investigate ZO optimization for continual learning (CL) as a novel approach to address the plasticity-stability-efficiency trilemma. Through theoretical analysis and empirical evidence, we show that ZO optimization naturally leads to flatter loss landscapes, which in turn reduce forgetting in CL. However, this stability comes at a cost of plasticity: due to its imprecise gradient estimates and slower convergence, ZO optimization tends to be less effective than FO in acquiring new task-specific knowledge, particularly under constrained training budgets. To better understand this trade-off, we conduct a holistic evaluation of ZO optimization applied to various existing CL methods. Our findings reveal that ZO optimization enhances stability but often undermines plasticity, particularly when used with learnable classifiers. Motivated by this insight, we propose ZO-FC, a simple but effective approach that applies ZO optimization to a single adapter-based PEFT module with FO optimized classifier. This design leverages the stability benefits of ZO while preserving the adaptability of FO updates with negligible memory overhead. Experiments demonstrate that ZO-FC achieves an effective balance between stability and plasticity, offering a practical and memory-efficient solution for on-device CL.

2510.16439 2026-03-13 cs.CL

FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution

Syed Rifat Raiyan, Md Farhan Ishmam, Abdullah Al Imran, Mohammad Ali Moni

详情
英文摘要

Human communication heavily relies on laconism and inferential pragmatics, allowing listeners to successfully reconstruct rich meaning from sparse, telegraphic speech. In contrast, large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency. This overhead manifests from the redundant low-utility tokens present in typical prompts, as only a fraction of tokens typically carries the majority of the semantic weight. Inspired by the aforementioned cognitive psycholinguistic processes, we address this inefficiency by introducing FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens. Leveraging two state-of-the-art token attribution methods, GlobEnc and DecompX, we assign salience scores to every token in an input sequence, rank them to retain the top-k% tokens, and obtain a sparse frugalized prompt. We establish the theoretical stability of our approach and provide strong empirical results across a suite of four NLP tasks to study the trade-off between the portion of retained tokens and performance. Experimental findings across retention settings reveal asymmetric performance patterns that suggest potential task contamination effects. We posit that our work contributes to a more nuanced understanding of LLM behavior in performance-efficiency trade-offs and delineates the boundary between tasks tolerant of contextual sparsity and those requiring exhaustive context.

2510.13108 2026-03-13 cs.CV cs.AI cs.RO

DriveCritic: Towards Context-Aware, Human-Aligned Evaluation for Autonomous Driving with Vision-Language Models

Jingyu Song, Zhenxin Li, Shiyi Lan, Xinglong Sun, Nadine Chang, Maying Shen, Joshua Chen, Katherine A. Skinner, Jose M. Alvarez

Comments Accepted at ICRA 2026; 8 pages, 3 figures

详情
英文摘要

Benchmarking autonomous driving planners to align with human judgment remains a critical challenge, as state-of-the-art metrics like the Extended Predictive Driver Model Score (EPDMS) lack context awareness in nuanced scenarios. To address this, we introduce DriveCritic, a novel framework featuring two key contributions: the DriveCritic dataset, a curated collection of challenging scenarios where context is critical for correct judgment and annotated with pairwise human preferences, and the DriveCritic model, a Vision-Language Model (VLM) based evaluator. Fine-tuned using a two-stage supervised and reinforcement learning pipeline, the DriveCritic model learns to adjudicate between trajectory pairs by integrating visual and symbolic context. Experiments show DriveCritic significantly outperforms existing metrics and baselines in matching human preferences and demonstrates strong context awareness. Overall, our work provides a more reliable, human-aligned foundation to evaluating autonomous driving systems. The project page for DriveCritic is https://song-jingyu.github.io/DriveCritic

2510.11036 2026-03-13 cs.RO cs.AI

XGrasp: Gripper-Aware Grasp Detection with Multi-Gripper Data Generation

Yeonseo Lee, Jungwook Mun, Hyosup Shin, Guebin Hwang, Junhee Nam, Taeyeop Lee, Sungho Jo

Comments 9 pages, 10 figures

详情
英文摘要

Real-world robotic systems frequently require diverse end-effectors for different tasks, however most existing grasp detection methods are optimized for a single gripper type, demanding retraining or optimization for each novel gripper configuration. This gripper-specific retraining paradigm is neither scalable nor practical. We propose XGrasp, a real-time gripper-aware grasp detection framework that generalizes to novel gripper configurations without additional training or optimization. To resolve data scarcity, we augment existing single-gripper datasets with multi-gripper annotations by incorporating the physical characteristics and closing trajectories of diverse grippers. Each gripper is represented as a two-channel 2D image encoding its static shape (Gripper Mask) and dynamic closing trajectory (Gripper Path). XGrasp employs a hierarchical two-stage architecture consisting of a Grasp Point Predictor (GPP) and an Angle-Width Predictor (AWP). In the AWP, contrastive learning with a quality-aware anchor builds a gripper-agnostic embedding space, enabling generalization to novel grippers without additional training. Experimental results demonstrate that XGrasp outperforms existing gripper-aware methods in both grasp success rate and inference speed across diverse gripper types. Project page: https://sites.google.com/view/xgrasp

2510.10489 2026-03-13 cs.CV

Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation

Jiaye Li, Baoyou Chen, Hui Li, Zilong Dong, Jingdong Wang, Siyu Zhu

详情
英文摘要

Transformers rely on explicit positional encoding to model structure in data. While Rotary Position Embedding (RoPE) excels in 1D domains, its application to image generation reveals significant limitations such as fine-grained spatial relation modeling, color cues, and object counting. This paper identifies key limitations of standard multi-dimensional RoPE-rigid frequency allocation, axis-wise independence, and uniform head treatment-in capturing the complex structural biases required for fine-grained image generation. We propose HARoPE, a head-wise adaptive extension that inserts a learnable linear transformation parameterized via singular value decomposition (SVD) before the rotary mapping. This lightweight modification enables dynamic frequency reallocation, semantic alignment of rotary planes, and head-specific positional receptive fields while rigorously preserving RoPE's relative-position property. Extensive experiments on class-conditional ImageNet and text-to-image generation (Flux and MMDiT) demonstrate that HARoPE consistently improves performance over strong RoPE baselines and other extensions. The method serves as an effective drop-in replacement, offering a principled and adaptable solution for enhancing positional awareness in transformer-based image generative models.

2510.08724 2026-03-13 cs.LG

Counterfactually Fair Conformal Prediction

Ozgur Guldogan, Neeraj Sarna, Yuanyuan Li, Michael Berger

Comments Accepted at AISTATS 2026

详情
英文摘要

While counterfactual fairness of point predictors is well studied, its extension to prediction sets--central to fair decision-making under uncertainty--remains underexplored. On the other hand, conformal prediction (CP) provides efficient, distribution-free, finite-sample valid prediction sets, yet does not ensure counterfactual fairness. We close this gap by developing Counterfactually Fair Conformal Prediction (CF-CP) that produces counterfactually fair prediction sets. Through symmetrization of conformity scores across protected-attribute interventions, we prove that CF-CP results in counterfactually fair prediction sets while maintaining the marginal coverage property. Furthermore, we empirically demonstrate that on both synthetic and real datasets, across regression and classification tasks, CF-CP achieves the desired counterfactual fairness and meets the target coverage rate with minimal increase in prediction set size. CF-CP offers a simple, training-free route to counterfactually fair uncertainty quantification.

2510.07791 2026-03-13 cs.CV

GTR-Bench: Evaluating Geo-Temporal Reasoning in Vision-Language Models

Qinghongbing Xie, Zhaoyuan Xia, Feng Zhu, Lijun Gong, Ziyue Li, Rui Zhao, Long Zeng

Comments ICLR 2026, 31 pages, 20 figures

详情
英文摘要

Recently spatial-temporal intelligence of Visual-Language Models (VLMs) has attracted much attention due to its importance for autonomous driving, embodied AI and general AI. Existing spatial-temporal benchmarks mainly focus on egocentric (first-person) perspective reasoning using images/video contexts, or geographic reasoning with graphical context (e.g., maps), thus fail to assess VLMs' geographic spatial-temporal intelligence that requires integrating both images/video and graphical context, which is crucial for real-world scenarios such as traffic management and emergency response. To address the gaps, we introduce Geo-Temporal Reasoning benchmark (GTR-Bench), a novel challenge for geographic temporal reasoning of moving targets in a large-scale camera network. GTR-Bench is more challenging as it requires multiple perspective switches between maps and videos, joint reasoning across multiple videos with non-overlapping fields of view, and inference over spatial-temporal regions that are unobserved by any video context. Evaluations of more than 10 popular VLMs on GTR-Bench show that even the best proprietary model, Gemini-2.5-Pro (34.9\%), significantly lags behind human performance (78.61\%) on geo-temporal reasoning. Moreover, our comprehensive analysis on GTR-Bench reveals three major deficiencies of current models for geo-temporal reasoning. (1) VLMs exhibit imbalanced utilization of spatial and temporal context during reasoning. (2) they show weak temporal forecasting ability, leading to poorer performance on temporally focused tasks. (3) they lack the capability to effectively align and integrate map data with multi-view video inputs. We believe GTR-Bench offers valuable insights and opens up new opportunities for research and applications in spatial-temporal intelligence. Benchmark and code will be released at https://github.com/X-Luffy/GTR-Bench.

2510.06754 2026-03-13 cs.RO cs.CV cs.LG

UniFField: A Generalizable Unified Neural Feature Field for Visual, Semantic, and Spatial Uncertainties in Any Scene

Christian Maurer, Snehal Jauhri, Sophie Lueth, Georgia Chalvatzaki

Comments ICRA 2026 Project website: https://sites.google.com/view/uniffield

详情
英文摘要

Comprehensive visual, geometric, and semantic understanding of a 3D scene is crucial for successful execution of robotic tasks, especially in unstructured and complex environments. Additionally, to make robust decisions, it is necessary for the robot to evaluate the reliability of perceived information. While recent advances in 3D neural feature fields have enabled robots to leverage features from pretrained foundation models for tasks such as language-guided manipulation and navigation, existing methods suffer from two critical limitations: (i) they are typically scene-specific, and (ii) they lack the ability to model uncertainty in their predictions. We present UniFField, a unified uncertainty-aware neural feature field that combines visual, semantic, and geometric features in a single generalizable representation while also predicting uncertainty in each modality. Our approach, which can be applied zero shot to any new environment, incrementally integrates RGB-D images into our voxel-based feature representation as the robot explores the scene, simultaneously updating uncertainty estimation. We evaluate our uncertainty estimations to accurately describe the model prediction errors in scene reconstruction and semantic feature prediction. Furthermore, we successfully leverage our feature predictions and their respective uncertainty for an active object search task using a mobile manipulator robot, demonstrating the capability for robust decision-making.

2510.04579 2026-03-13 cs.LG math.MG stat.ML

Busemann Functions in the Wasserstein Space: Existence, Closed-Forms, and Applications to Slicing

Clément Bonet, Elsa Cazelles, Lucas Drumetz, Nicolas Courty

Comments Published as a conference paper at AISTATS 2026

详情
英文摘要

The Busemann function has recently found much interest in a variety of geometric machine learning problems, as it naturally defines projections onto geodesic rays of Riemannian manifolds and generalizes the notion of hyperplanes. As several sources of data can be conveniently modeled as probability distributions, it is natural to study this function in the Wasserstein space, which carries a rich formal Riemannian structure induced by Optimal Transport metrics. In this work, we investigate the existence and computation of Busemann functions in Wasserstein space, which admits geodesic rays. We establish closed-form expressions in two important cases: one-dimensional distributions and Gaussian measures. These results enable explicit projection schemes for probability distributions on $\mathbb{R}$, which in turn allow us to define novel Sliced-Wasserstein distances over Gaussian mixtures and labeled datasets. We demonstrate the efficiency of those original schemes on synthetic datasets as well as transfer learning problems.

2509.26489 2026-03-13 cs.CV cs.LG eess.SP

Contrastive Diffusion Guidance for Spatial Inverse Problems

Sattwik Basu, Chaitanya Amballa, Zhongweiyang Xu, Jorge Vančo Sampedro, Srihari Nelakuditi, Romit Roy Choudhury

详情
英文摘要

We consider a class of inverse problems characterized by forward operators that are partially specified, non-smooth, and non-differentiable. Although generative inverse solvers have made significant progress, we find that these forward operators introduce a distinct set of challenges. As a concrete instance, we consider the problem of reconstructing spatial layouts, such as floorplans, from human movement trajectories, where the underlying path-generation process is inherently non-differentiable and only partially known. In such problems, direct likelihood-based guidance becomes unstable, since the underlying path-planning process does not provide reliable gradients. We break-away from existing diffusion-based posterior samplers and reformulate likelihood-based guidance in a smoother embedding space. This embedding space is learned using a contrastive objective to bring compatible trajectory-floorplan pairs close together while pushing mismatched pairs apart. We show that this surrogate likelihood score in the embedding space provides a valid approximation to the true likelihood score, making it possible to steer the denoising process towards the posterior. Across extensive experiments, our model CoGuide produces more consistent reconstructions and is more robust than existing inverse-solvers and guided diffusion. Beyond spatial mapping, we show that our method can be applied more broadly, suggesting a route toward solving generalized blind inverse problems using diffusion models.