arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1115
2605.01427 2026-05-05 cs.RO

SixthSense: Task-Agnostic Proprioception-Only Whole-Body Wrench Estimation for Humanoids

Xingzhou Chen, Xiayan Xu, Yan Ning, Jiyu Yu, Yizheng Zhang, Siyi Qian, Lingzhu Xiang, Jiahao Chen, Yuquan Wang, Haodong Zhang, Ling Shi

详情
英文摘要

Humanoid robots are entering our physical world at scale, yet as oversized toys--good at singing and dancing, but short on force-interaction capabilities for practical tasks. Bridging this gap necessitates prioritizing reliable contact perception as a fundamental requirement. Estimating external wrenches in humanoids is complicated by floating-base dynamics and indeterminate contact locations. Existing analytical frameworks require idealistic assumptions and hard-to-obtain measurements, which are often unavailable in practice. To bridge this gap, we propose SixthSense, a task-agnostic approach that infers whole-body contact timing, location, and wrenches from proprioception and IMU data alone. To capture the multi-modal dynamics between unstructured contact inputs and the uncertain motion outputs, we employ conditional flow matching to tokenize proprioceptive histories and estimate a spatiotemporally sparse contact-event flow. SixthSense serves as a plug-and-play perception module for applications including collision detection, physical human-robot interaction, and force-feedback teleoperation. Experiments across standing, walking, and whole-body motion-tracking policies showcased unprecedented performance in diverse behaviors.

2605.01425 2026-05-05 cs.LG

Barriers to Counterfactual Credit Attribution for Autoregressive Models

Aloni Cohen, Chenhao Zhang

Comments ICML 2026

详情
英文摘要

Generative AI disrupts the practice of giving credit to work that came before. Ideally, a generative model would give credit to any work on which its output depends in a significant way. \emph{Counterfactual credit attribution} (CCA) is a technical condition formalizing this goal--a relaxation of differential privacy--recently introduced by Livni, Moran, Nissim, and Pabbaraju [2024] who studied it in the PAC learning setting. We initiate the study of CCA generative models. Specifically, we consider autoregressive models giving credit to a deployment-time dataset (e.g., a RAG database). We uncover barriers to two natural approaches to CCA autoregressive models. First, we show that imposing CCA on the underlying next-token predictor does not guarantee that the model is CCA: CCA does not compose autoregressively (unlike DP). Second, we consider a different approach to building CCA models which we call \emph{retrofitting}. Retrofitting takes a model that does not attribute credit, and adds credit onto it. We prove a lower bound for CCA retrofitting under a weak optimality requirement. Given black-box access to the starting model, retrofitting requires query complexity exponential in the length of the model's outputs.

2605.01424 2026-05-05 cs.LG cs.AI

Quantifying Multimodal Capabilities: Formal Generalization Guarantees in Pairwise Metric Learning

Richeng Zhou, Xuelin Zhang, Liyuan Liu

详情
英文摘要

Multimodal learning leverages the integration of diverse data modalities to enhance performance in complex tasks. Yet, it frequently encounters incomplete or redundant modality data in real-world scenarios. This paper presents a fine-grained theoretical analysis of the generalization properties of multimodal metric learning models, addressing critical gaps in understanding the relationship between modality selection and algorithmic performance. We establish hierarchical relationships between function classes corresponding to different modality subsets and quantify the discrepancy between learned mappings and ground truth. Through rigorous analysis of pairwise complexity within the multimodal learning framework, we derive novel generalization error bounds that reveal the joint impact of modality quantity and granularity on model performance. Our theoretical findings on both upper and lower bounds demonstrate that incorporating fine-grained modality features reduces the complexity of the hypothesis space by enhancing modality complementarity. This work offers both theoretical foundations and practical implications for improving convergence rates and accuracy in multimodal learning systems.

2605.01420 2026-05-05 cs.AI

Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance

Wesley Shu, Peng Wei

详情
英文摘要

Artificial Jagged Intelligence (AJI) denotes a recurring pattern in which large learning systems exhibit strong local capabilities while remaining weak or brittle in other domains. This paper develops a formal theory of AJI as uneven allocation of optimization pressure. We model training as a finite-budget process that distributes gradient-driven update energy across capability-relevant directions in parameter space. In this model, jagged capability profiles arise from anisotropic objective structure, data geometry, and representational coupling rather than from a single scalar quantity called intelligence. The paper defines capability gain, optimization energy share, and jaggedness, then proves that persistent concentration of cumulative update energy yields lower bounds on dispersion in capability gains. A finite-budget tradeoff theorem shows why prioritizing one capability can impose opportunity costs on others unless positive coupling or shared structure offsets the cost. The analysis also studies redistribution mechanisms, including energy-variance regularization and auxiliary structural objectives, as interventions that reshape the optimization field. The resulting framework links uneven emergence, training architecture, and optimization governance. It predicts that early concentration of update energy should forecast later capability jaggedness; that scaling under a narrow objective need not eliminate anisotropy; and that explicitly funded auxiliary objectives can revive neglected capabilities. AJI is therefore not merely a descriptive label for uneven model behavior, but a testable theory of how finite optimization resources produce concentrated, delayed, and structurally uneven capability formation.

2605.01418 2026-05-05 cs.AI

TimeTok: Granularity-Controllable Time-Series Generation via Hierarchical Tokenization

Seokhyun Lee, Jaeho Kim, Changjun Oh, Mihaela van der Schaar, Changhee Lee

详情
英文摘要

Time-series generative models often lack control over temporal granularity, forcing users to accept whatever granularity the model produces. To enable truly user-driven generation, we introduce TimeTok, a unified framework for Granularity-Controllable Time-Series Generation (GC-TSG), which generates time series at any target granularity from any coarser input (e.g., rough sketches) or from scratch. At the core of TimeTok is a hierarchical tokenization strategy that maps time series into an ordered sequence of tokens, from coarse to fine temporal granularity. Our autoregressive generation process operates across these granularity levels, producing token blocks that are decoded back into continuous time series. This design naturally enables GC-TSG - including standard generation - within a single framework, where controlling the number of token blocks provides explicit control over output detail. Experiments show that TimeTok excels at GC-TSG tasks while achieving state-of-the-art performance in standard generation. Furthermore, we showcase TimeTok's potential as a foundational tokenizer by training on multiple datasets with heterogeneous temporal granularities, verifying strong transferability that consistently outperforms models trained on individual datasets. To our knowledge, this is the first unified framework that covers the full generative spectrum for time series, offering a valuable foundation for models that benefit from diverse temporal granularities.

2605.01417 2026-05-05 cs.CL cs.AI

Medmarks: A Comprehensive Open-Source LLM Benchmark Suite for Medical Tasks

Benjamin Warner, Ratna Sagari Grandhi, Max Kieffer, Aymane Ouraq, Saurav Panigrahi, Geetu Ambwani, Kunal Bagga, Nikhil Khandekar, Arya Hariharan, Nishant Mishra, Manish Ram, Shamus Sim Zi Yang, Ahmed Essouaied, Adepoju Jeremiah Moyondafoluwa, Robert Scholz, Bofeng Huang, Molly Beavers, Srishti Gureja, Anish Mahishi, Sameed Khan, Maxime Griot, Hunar Batra, Jean-Benoit Delbrouck, Siddhant Bharadwaj, Ronald Clark, Ashish Vashist, Anas Zafar, Leema Krishna Murali, Harsh Deshpande, Ameen Patel, William Brown, Johannes Hagemann, Connor Lane, Paul Steven Scotti, Tanishq Mathew Abraham

Comments website: https://medmarks.ai

详情
英文摘要

Evaluating large language models (LLMs) for medical applications remains challenging due to benchmark saturation, limited data accessibility, and insufficient coverage of relevant tasks. Existing suites have either saturated, heavily depend on restricted datasets, or lack comprehensive model coverage. We introduce Medmarks, a fully open-source evaluation suite with 30 benchmarks spanning question answering, information extraction, medical calculations, and open-ended clinical reasoning. We perform a systematic evaluation of 61 models across 71 configurations using verifiable metrics and LLM-as-a-Judge. Our results show that frontier reasoning models (Gemini 3 Pro Preview, GPT-5.1, & GPT-5.2) achieve the highest performance across both benchmarks, most frontier proprietary models are significantly more token efficient than open-weight alternatives, medically fine-tuned models outperform their generalist counterparts, and that models are susceptible to answer-order bias (particularly smaller models and Grok 4). A subset of our evals (Medmarks-T) can be directly used as reinforcement learning environments to post-train LLMs for medical reasoning. Code is available at https://github.com/MedARC-AI/Medmarks

2605.01415 2026-05-05 cs.AI cs.CY

AI Safety as Control of Irreversibility: A Systems Framework for Decision-Energy and Sovereignty Boundaries

Wesley Shu, Peng Wei

详情
英文摘要

Recent AI systems compress the distance between capability growth and capability deployment. Earlier high-risk technologies were slowed by capital intensity, physical bottlenecks, organizational inertia, and specialized supply chains. By contrast, AI capabilities can be copied, invoked, embedded in workflows, and scaled across institutions at low marginal cost. This paper argues that declining deployment friction changes the safety problem at its root. Safety is not only local output correctness or preference alignment, but the control of irreversibility under rising decision density. The paper formalizes this claim through decision-energy density: the rate-weighted capacity of a node to generate, evaluate, select, and execute consequential decisions. It then identifies three sovereignty boundaries that determine whether AI remains an amplifier within a human-governed system or becomes a de facto control center: irreversible decision authority, physical resource mobilization authority, and self-expansion authority. The model shows how efficiency pressure, path dependence, scale feedback, and weak boundary constraints concentrate decision-energy in the most efficient node. This concentration can diffuse responsibility and raise the probability of irreversible system-level loss even when local per-action error rates remain low. The main result is a boundary stabilization theorem. It shows that safety need not require proving that advanced systems are always correct. Instead, it requires institutional and technical designs that prevent irreversible power from being released by a single high-efficiency node. The paper reframes AI safety as layered control, authorization, and externally reviewable limits, linking alignment, security engineering, organizational economics, and institutional design.

2605.01403 2026-05-05 cs.LG

Rethinking Multi-Label Node Classification: Do Tuned Classic GNNs Suffice?

Yuxuan Xiao, Shengzhong Zhang

详情
英文摘要

Multi-label node classification (MLNC) has recently been addressed by increasingly complex label-aware designs that explicitly model node-label interactions and inter-label dependencies.However, it remains unclear whether the advantages of these methods truly stem from their specialized designs, or simply from insufficiently optimized baselines. In this paper, we revisit MLNC from a strong-baseline perspective and investigate whether carefully tuned classic full-graph GNNs can already serve as strong solutions to this task. We systematically study several representative backbones, including GCN, SSGConv, and GCNII, and optimize them using standard yet effective techniques such as normalization, dropout, and residual connections. Experiments on five representative benchmark datasets show that our tuned baselines outperform representative specialized methods on four datasets and achieve state-of-the-art performance in multiple settings. These results indicate that careful tuning of classic backbones is a highly influential but often overlooked factor in MLNC, and highlight the need for more rigorous strong-baseline evaluation in future research on multi-label graph learning.

2605.01399 2026-05-05 cs.CL cs.AI cs.IR

Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning

Sangkwon Park, Donghun Kang, Jisoo Mok, Sungroh Yoon

Comments ACL 2026 Main Conference

详情
英文摘要

The conventional Retrieval-Augmented Generation (RAG) paradigm of injecting raw retrieved texts into the Large Language Model (LLM)'s context often results in suboptimal integration of retrieved information. This paper proposes to bridge retrieval results and the LLM's reasoning ability through Verbal Annotations, analytic narratives that explicitly articulate the logical connection between a search query and retrieved contexts. Our empirical investigation reveals the potential of Verbal Annotations to substantially enhance the LLM's ability to generate accurate, contextually-grounded responses. Motivated by this finding, we introduce Verbal-R3, a novel agentic RAG framework that consists of a Generator and a Verbal Reranker. The Generator performs iterative retrieval and reasoning, while the Verbal Reranker returns relevance scores and Verbal Annotations to guide the reasoning and answering process of the Generator. The inference process of Verbal-R3 is further refined through relevance-guided test-time scaling, which efficiently allocates test-time compute for effective trajectory expansion. Verbal-R3 achieves state-of-the-art performance on complex Question Answering benchmarks, validating the effectiveness of the proposed framework.

2605.01393 2026-05-05 cs.CV

Recall to Predict: Grounding Motion Forecasting in Interpretable Motion Bank

Abhishek Vivekanandan, Ahmed Abouelazm, J. Marius Zöllner

Comments Sumitted for PeerReview

详情
英文摘要

Motion forecasting often requires trading interpretability for predictive accuracy. Standard anchor-based architectures rely on opaque latent queries that are highly prone to latent collapse, or naive trajectory sampling that limits multi-modal diversity. We propose an end-to-end differentiable framework that grounds predictions in a comprehensive "motion bank", a structured embedding space of physically realizable trajectories constructed via contrastive learning. Rather than regressing paths from a blank slate, our architecture dynamically retrieves explicit motion priors using a novel Anchor Retrieval Layer. This module adapts orthogonally initialized queries via a Dual-Level Gated Cross-Attention mechanism and executes discrete trajectory selection using a Straight-Through Gumbel-Softmax estimator to preserve continuous gradient flow. The retrieved semantically grounded anchors are then geometrically refined by a DETR-style decoder, optimized jointly with a Winner-Takes-All (WTA) kinematic Gaussian Mixture Model (GMM), a latent diversity penalty, and a soft-min weighted endpoint loss. By strictly conditioning the decoding phase on diverse, interpretable motion primitives, our approach eliminates the "black box" of standard latent queries while achieving competitive multi-modal accuracy on the Argoverse 2 and Waymo Open Motion datasets. Code is available at: https://github.com/abviv/recall2predict

2605.01383 2026-05-05 cs.LG cond-mat.dis-nn physics.comp-ph

Sequential Learning and Catastrophic Forgetting in Differentiable Resistor Networks

Maniru Ibrahim

详情
英文摘要

Differentiable physical networks provide a simple setting in which learning can be studied through the interaction between trainable parameters and physical equilibrium constraints. We investigate sequential learning in differentiable resistor networks governed by Kirchhoff's laws. Although individual input--output mappings can be learned by gradient-based adjustment of edge conductances, sequential training on conflicting tasks produces catastrophic forgetting. We show that forgetting is controlled by task conflict and by the degree of adaptation to the new task. Uniform anchoring and normalised gradient-weighted anchoring reduce forgetting only by increasing the final loss on the new task, giving a clear forgetting--adaptation trade-off. We also show that forgetting is associated with localised conductance changes on high-current edges, giving a physical interpretation as reconfiguration of dominant transport pathways. Broader random-task ensembles show that the strongest forgetting occurs when the second task reverses the output ordering imposed by the first task. Finally, comparisons across Erdős--Rényi, small-world, scale-free, and random-geometric graph ensembles show that topology changes the forgetting--adaptation balance. These results position differentiable resistor networks as compact, physically interpretable testbeds for studying continual learning in tunable matter.

2605.01382 2026-05-05 cs.CV cs.AI

Sparse Representation Learning for Vessels

Chinmay Prabhakar, Bastian Wittmann, Paul Büschl, Hongwei Bran Li, Bjoern Menze, Suprosanna Shit

详情
英文摘要

Analyzing human vasculature and vessel-like, tubular structures, such as airways, is crucial for disease diagnosis and treatment. Current methods often rely on small sub-regions or simplified tree-like structures, rendering analysis of entire organ-level networks at clinical resolution computationally challenging. To this end, we propose VAEsselSparse, an efficient encoder-decoder model to obtain a meaningful yet compact representation of the entire organ-level vascular network at sub-millimeter resolution. VAEsselSparse leverages the inherent sparsity of 3D vascular structures via sparse convolutions and attention mechanisms, achieving substantial spatial compression rates of 8 x 8 x 8. We demonstrate superior reconstruction performance compared to dense counterparts and previous methods. Importantly, the resulting latent space retains clinically relevant discriminative features readily usable for classification tasks, such as aneurysm/stenosis or subvariants of the circle of Willis. Moreover, the compact latent space of VAEsselSparse serves as an effective representation for learning vessel-specific priors through generative models, enabling the synthesis of realistic vasculature.

2605.01381 2026-05-05 cs.CL cs.LG

A framework for analyzing concept representations in neural models

Burin Naowarat, Hao Tang, Sharon Goldwater

Comments CoNLL 2026

详情
英文摘要

Understanding how neural models represent human-interpretable concepts is challenging. Prior work has explored linear concept subspaces from diverse perspectives, such as probing and concept erasure. We introduce a unified framework to study these subspaces along two axes: \textit{containment}, which tests if a concept is fully represented in a subspace but not outside it, and \textit{disentanglement}, which tests for isolation from other concepts. In experiments on both text and speech models, we first highlight that concept subspaces may not be uniquely determined, and discuss the implications for concept subspace analysis. Then, we compare properties of concept subspaces estimated using five estimators, proposed in different communities. We find that (1) the choice of estimator impacts the containment and disentanglement properties; (2) the state-of-the-art concept erasure method, LEACE, performs well on both testing axes, but still struggles to generalize to unseen data; and (3) in HuBERT speech representations, phone information is both contained and disentangled from speaker information, while speaker information is hard to contain in a compact subspace, despite being disentangled from phones.

2605.01376 2026-05-05 cs.AI

A Cellular Doctrine of Morality: Intrinsic Active Precision and the Mind-Reality Overload Dilemma

Ahsan Adeel

详情
英文摘要

Current AI systems, grounded in oversimplified neuroscience, risk eroding the distinction between truth and falsehood. They maximize reward by amplifying attention to information without intrinsic precision mechanisms to assess whether it is valid or worth attending to. This increases both the volume of information and the inherent biases in what the system attends to, whether true, false, or irrelevant. If not corrected, this trend will accelerate, threatening to overload systems and individuals with biased and dubious information and increasing the risk of confusion, poor judgment, and irrational or harmful decisions and behaviour, a condition I term the mind-reality overload dilemma. I argue that this threat may be mitigated by providing the public with access to more advanced AI tools built on the biophysical dynamics of pyramidal neurons underlying awake thought and higher-order cognition. These neurons support an intrinsic active precision mechanism that, rather than merely maximizing reward, uses locally and globally coherent predictions to evaluate the validity and contextual adequacy of evidence before it is attended to or propagated through hierarchies, prioritizing coherence and adequacy before attention.~While this approach does not derive or prescribe moral rules from biology, it may give rise to AI with more "real understanding", helping restore epistemic conditions by reducing information overload and amplifying reliable information, thereby supporting the formation of better-informed beliefs and more coherent judgments that benefit society at large-though no guarantees exist.

2605.01373 2026-05-05 cs.CL cs.AI

Focus on the Core: Empowering Diffusion Large Language Models by Self-Contrast

Jinyuan Feng, Xin Yu, Yiqun Chen, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Zhiqiang Pu

详情
英文摘要

The iterative denoising paradigm of Diffusion Large Language Models (DLMs) endows them with a distinct advantage in global context modeling. However, current decoding strategies fail to leverage this capability, typically exhibiting a local preference that overlooks the heterogeneous information density within the context, ultimately degrading generation quality. To address this limitation, we systematically investigate high-information-density (HD) tokens and present two key findings: (1) explicitly conditioning on HD tokens substantially improves output quality; and (2) HD tokens exhibit an early-decoding tendency, converging earlier than surrounding tokens. Motivated by these findings, we propose Focus on the Core \textbf{(FoCore)}, a training-free decoding strategy that utilizes HD tokens in a self-contrast manner, wherein HD tokens are temporarily remasked as negative samples, to guide generation. We further introduce FoCore\_Accelerate \textbf{(FoCore\_A)}, an efficient variant that, upon detecting HD token convergence, performs parallel decoding over stable candidates within a local context window, substantially accelerating generation. Extensive experiments on math, code and logical reasoning benchmarks demonstrate that FoCore consistently improves generation quality and efficiency across both LLaDA and Dream backbones. For instance, on HumanEval, FoCore improves pass@1 from 39.02 to 42.68 over standard Classifier-Free Guidance, while FoCore-A reduces the number of decoding steps by 2.07x and per-sample latency from 20.76s to 8.64s (-58.4\%).

2605.01372 2026-05-05 cs.CL

Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders

Ailiang Lin, Zhuoyun Li, Keyu Mao, Kotaro Funakoshi, Manabu Okumura

详情
英文摘要

Large language models (LLMs) have been widely explored for embedding generation. While recent studies show that in-context learning (ICL) effectively enhances the representational capability of LLMs by prepending a few task-related demonstrations, it causes substantial token overhead due to the increased sequence length. In this work, we propose EPIC, a novel embedding-based in-context prompt training strategy that leverages ICL to generate high-quality embeddings while reducing computational burden during both training and inference. This approach replaces discrete text demonstrations with their corresponding continuous embeddings, which not only encourages the LLM to align semantically-related text pairs during contrastive learning, but also requires the model to interpret demonstration embeddings as part of the in-context prompt. Consequently, EPIC-trained models achieve excellent embedding performance both with or without in-context prompts at inference time. Comprehensive experiments demonstrate that our method establishes new state-of-the-art results on the MTEB benchmark, surpassing frontier models trained solely on publicly available retrieval data. Extensive ablation studies further validate the effectiveness and necessity of our mechanism.

2605.01371 2026-05-05 cs.RO

ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue

Daoxuan Zhang, Ping Chen, Jianyi Zhou, Shuo Yang

Comments 20 pages, 7 figures

详情
英文摘要

The rapid advancement of Multimodal Large Language Models (MLLMs) has empowered Unmanned Aerial Vehicle (UAV) with exceptional capabilities in spatial reasoning, semantic understanding, and complex decision-making, making them inherently suited for UAV Search and Rescue (SAR). However, existing UAV SAR research is dominated by traditional vision and path-planning methods and lacks a comprehensive and unified benchmark for embodied agents. To bridge this gap, we first propose the novel task of \textbf{Embodied Search and Rescue (ESAR)}, which requires aerial agents to autonomously explore complex environments, identify rescue clues, and reason about victim locations to execute informed decision-making. Additionally, we present \textbf{ESARBench}, the first comprehensive benchmark designed to evaluate MLLM-driven UAV agents in highly realistic SAR scenarios. Leveraging Unreal Engine 5 and AirSim, we construct four high-fidelity, large-scale open environments mapped directly from real-world Geographic Information System (GIS) data to ensure photorealistic landscapes. To rigorously simulate actual rescue operations, our benchmark incorporates dynamic variables including weather conditions, time of day, and stochastic clue placement. Furthermore, we create a dataset of 600 tasks modeled after real-world rescue cases and propose a robust set of evaluation metrics. We evaluate diverse baselines, ranging from traditional heuristics to advanced ground and aerial MLLM-based ObjectNav agents. Experimental results highlight the challenges in ESAR, revealing critical bottlenecks in spatial memory, aerial adaptation, and the trade-off between search efficiency and flight safety. We hope ESARBench serves as a valuable resource to advance research on Embodied Search and Rescue domain. Source code and project page: https://4amgodvzx.github.io/ESAR.github.io.

2605.01368 2026-05-05 cs.RO

Assistance Without Interruption: A Benchmark and LLM-based Framework for Non-Intrusive Human-Robot Assistance

Yuedi Zhang, Shuanghao Bai, Wanqi Zhou, Haoran Zhang, Qi Zhang, Zhirong Luan, Badong Chen

详情
英文摘要

Human-robot interaction (HRI) has long studied how agents and people coordinate to achieve shared goals. In this work, we formalize and benchmark the non-intrusive assistance as an independent paradigm of HRI, where a robot proactively supports a human's ongoing multi-step activities while strictly avoiding interruptions. Unlike conventional HRI tasks that rely on direct commands, explicit negotiation, or proactive interventions based on user habits and history, our task treats the human's plan as the primary process and formulates assistance as a joint decision over when to act and what to do. To systematically evaluate this problem, we establish a simulation benchmark, NIABench, along with new metrics tailored to the non-intrusive assistance task. We further propose a hybrid architecture that integrates an LLM with a scoring model. The scoring model first applies semantic retrieval to prune large candidate action sets, and then a ranker evaluates human-step and robot-action pairs, enabling reasoning over timing and cross-step dependencies. Comprehensive experiments on both NIABench and real-world scenarios demonstrate that our method achieves proactive, non-intrusive assistance that reduces human effort while preserving task effectiveness.

2605.01365 2026-05-05 cs.CV cs.RO

VoxAfford: Multi-Scale Voxel-Token Fusion for Open-Vocabulary 3D Affordance Detection

Haowen Sun, Shaolong Zhang, Mingyang Li, Chengzhong Ma, Xinzhe Chen, Qiongjie Cui, Xingyu Chen, Zeyang Liu, Xuguang Lan

详情
英文摘要

Open-vocabulary 3D affordance detection requires localizing interaction regions on point clouds given novel affordance descriptions. Recent methods extend multimodal large language models (MLLMs) with special output tokens that are decoded into segmentation masks. However, these tokens are produced through autoregressive generation, which models sequential dependencies rather than spatial neighborhood relations, leaving them semantically rich but spatially impoverished for 3D localization. We propose Voxel-enhanced Affordance detection (VoxAfford), which bypasses this bottleneck by injecting multi-scale geometric features from a frozen pre-trained 3D VQVAE encoder into the output tokens after generation. Each output token uses its affordance semantics as a query to retrieve relevant geometric patterns from its paired voxel scale via cross-attention, with a learned compatibility gate controlling the injection strength. The enhanced tokens are then aggregated into a spatially-aware affordance prompt through semantic-conditioned attention and propagated alongside per-point features to generate the final mask. Experiments on open-vocabulary affordance detection tasks show that VoxAfford achieves state-of-the-art performance with approximately an 8% improvement in mIoU, and real robot experiments confirm zero-shot transfer to novel objects.

2605.01364 2026-05-05 cs.LG cs.SY eess.SY

Toward a foundational thermal model for residential buildings

Ting-Yu Dai, Kingsley Nweye, Dev Niyogi, Zoltan Nagy

详情
英文摘要

The building energy community lacks a foundational thermal model, i.e., a single pretrained model capable of generalizing across diverse buildings, climates, and control strategies without building-specific calibration. Achieving this vision requires architectural principles that capture universal thermal dynamics rather than memorizing building-specific patterns. We take a step toward this goal by presenting a physics-informed transformer architecture that embeds domain knowledge, e.g., derivative enrichment and Euler-based numerical integration, into a decoder-only framework. We incorporate static building features extracted from simulation models and employ Rotary Position Embedding attention to capture temporal dependencies. Evaluated on the CityLearn dataset spanning 247 residential buildings across three climate zones, our model achieves one-step prediction accuracy (RMSE of 0.30°C in Texas, 0.29°C in Vermont) while outperforming both traditional baselines and fine-tuned Time-Series Foundation Models. We also demonstrate zero-shot transferability: models trained on as few as two buildings generalize to unseen buildings and climate zones without fine-tuning. Despite the limitation of simulated residential buildings, our results establish physics-informed architectural principles as a promising foundation for universal building thermal models.

2605.01359 2026-05-05 cs.AI

Structural Ranking of the Cognitive Plausibility of Computational Models of Analogy and Metaphors with the Minimal Cognitive Grid

Alessio Donvito, Antonio Lieto

Comments 35 pages

详情
英文摘要

In this paper, we employ the Minimal Cognitive Grid (MCG), a framework created to evaluate the cognitive plausibility of artificial systems, to offer a systematic assessment of leading computational models of analogy and metaphor, including the Structure-Mapping Engine (SME), CogSketch, METCL, and Large Language Models (LLMs). We present a formal and quantitative operationalization of the MCG framework and, through the analysis of its three main dimensions (Functional/Structural Ratio, Generality, and Performance Match), examine how well each system aligns with standard cognitive theories of the modeled phenomena, thus allowing for comparison of the models with respect to their cognitive plausibility, according to consistent and generalizable mathematical criteria.

2605.01358 2026-05-05 cs.LG

PACE: Parameter Change for Unsupervised Environment Design

Fang Yuan, Quanjun Yin, Siqi Shen, Yuxiang Xie, Junqiang Yang, Long Qin, Junjie Zeng, Qinglun Li

详情
英文摘要

Unsupervised Environment Design (UED) offers a promising paradigm for improving reinforcement learning generalization by adaptively shaping training environments, but it requires reliable environment evaluation to remain effective. However, existing UED methods evaluate environments using indirect proxy signals such as regret, value-based errors, or Monte Carlo, which suffer from bias, high variance, or substantial computational overhead and fail to reflect agent realized learning progress. To address these limitations, we propose Parameter Change Environment Design (PACE), which evaluates an environment through the policy parameter change induced by training on that environment, directly grounding environment selection in realized learning progress. Specifically, PACE assigns environment value using a first-order approximation of the policy optimization objective, where the improvement induced by an environment is proportional to the squared L2 norm of the corresponding parameter update, enabling low-variance and computation-efficient evaluation without additional rollouts. Experiments on MiniGrid and Craftax show that PACE consistently outperforms established UED baselines, achieving higher IQM and smaller Optimality Gap on OOD evaluations, including an IQM of 96.4% and an Optimality Gap of 17.2% on MiniGrid.

2605.01357 2026-05-05 cs.CL

On Stable Long-Form Generation: Benchmarking and Mitigating Length Volatility

Zhitao He, Haolin Yang, Rui Min, Zeyu Qin, Yi R. Fung

详情
英文摘要

Large Language Models (LLMs) excel at long-context understanding but exhibit significant limitations in long-form generation. Existing studies primarily focus on single-generation quality, generally overlooking the volatility of the output. This volatility not only leads to significant computational costs but also severely impacts the models' reliable application. To address this gap, our work unfolds in three stages: benchmarking, probing, and mitigation. We first propose the VOlatility in Long-form Text Benchmark (VOLTBench), a novel heterogeneous-task benchmark designed to systematically quantify the length volatility of long-form generation. Subsequently, by analyzing attention traces, we conduct an in-depth probe to identify several common internal patterns that cause this volatility. Finally, to mitigate long-form output volatility, we propose Stable Generation via Logits Boosting (GLoBo), a lightweight decoding-stage optimization strategy, designed to significantly enhance both the length accuracy and stability of long-form generation without additional training. Extensive experiments on VOLTBench provide the first systematic confirmation of severe long-form output instability in mainstream models and validate that our proposed method successfully improves the mean output length of the base model by 148% and reduces the length volatility by 69%, while maintaining high generation quality.

2605.01356 2026-05-05 cs.LG cs.AI

Model-Based Proactive Cost Generation for Learning Safe Policies Offline with Limited Violation Data

Ruiqi Xue, Lei Yuan, Kainuo Cheng, Jing-Wen Yang, Yang Yu

详情
英文摘要

Learning constraint-satisfying policies from offline data without risky online interaction is crucial for safety-critical decision making. Conventional methods typically learn cost value functions from abundant unsafe samples to define safety boundaries and penalize violations. However, in high-stakes scenarios, risky trial-and-error is infeasible, yielding datasets with few or no unsafe samples. Under this limitation, existing approaches often treat all data as uniformly safe, overlooking safe-but-infeasible states - states that currently satisfy constraints but inevitably violate them within a few steps - leading to deployment failures. Drawing inspiration from the concept of knowledge-data integration, we leverage large language models (LLMs) to incorporate natural language knowledge into the policy to address this challenge. Specifically, we propose PROCO, a model-based offline safe reinforcement learning (RL) framework tailored to datasets largely free of violations. PROCO first learns a dynamics model from offline data and constructs a conservative cost function by grounding natural-language knowledge of unsafe states in LLMs, enabling risk estimation even without observed violations. Using the cost function and learned model, PROCO performs model-based rollouts to synthesize diverse counterfactual unsafe samples, supporting reliable feasibility identification and feasibility-guided policy learning. Across a range of Safety-Gymnasium tasks with exclusively safe or minimally risky training data, PROCO integrates seamlessly with a variety of offline safe RL algorithms and consistently demonstrates reduced constraint violations and improved safety performance compared to both the original methods and other behavior cloning baselines.

2605.01350 2026-05-05 cs.CL

LLM Output Detectability and Task Performance Can be Jointly Optimized

Koshiro Saito, Ryuto Koike, Masahiro Kaneko, Naoaki Okazaki

Comments Preprint. Under review

详情
英文摘要

Detecting machine-generated text is essential for transparency and accountability when deploying large language models (LLMs). Among detection approaches, watermarking is a statistically reliable method by design -- it embeds detectable signals into LLM outputs by biasing their token distributions. However, it has been reported that watermarked LLMs often perform worse on downstream tasks. We propose PUPPET, a framework that fine-tunes an LLM via reinforcement learning to generate text that is both more detectable and better performing on downstream tasks. We use two reward functions: a detector that outputs a machine-class likelihood and an evaluator that measures a task-specific metric. Experiments on long-form QA, summarization, and essay writing show that LLMs trained with PUPPET achieve high detectability competitive with watermarking methods while outperforming them on downstream tasks. The analysis shows that this optimization can be performed efficiently with only a few thousand samples in 1--2 GPU hours. Moreover, these gains are consistent across out-of-domain tasks, different LLM families, and model sizes, and are even robust to paraphrasing attacks.

2605.01347 2026-05-05 cs.CL cs.AI cs.LG

MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate

Jianze Wang, Ying Liu, Jinlong Chen, Xuchun Hu, Qilong Zhang, Yu Cao, Jun Wang, Hua Yang, Yong Xie, Qianglong Chen

Comments Preprint. 9-page main paper + appendix. 8 figures, 7 tables. Code: https://github.com/chiefovoavicii/MAD-OPD

详情
英文摘要

On-policy distillation (OPD) trains a student on its own trajectories under token-level teacher supervision, but existing methods are capped by a single-teacher capability ceiling: when the teacher errs, the student inherits the error. OPD also remains largely unexplored in agentic tasks, where per-step errors compound across long trajectories and destabilize training. We propose MAD-OPD (Multi-Agent Debate-driven On-Policy Distillation), which breaks this ceiling by recasting the distillation teacher as a deliberative collective of teachers that debate over the student's on-policy state; the debate produces an emergent collective intelligence that supplies token-level supervision, with each teacher's contribution weighted by its post-debate confidence. To extend OPD to agentic tasks, we also introduce On-Policy Agentic Distillation (OPAD), which adds step-level sampling to stabilize training under multi-step error compounding. We additionally derive a task-adaptive divergence principle, selecting JSD (Jensen-Shannon divergence) for agentic stability and reverse KL (Kullback-Leibler) divergence for code generation, and verify it both theoretically and empirically. Across six teacher-student configurations (Qwen3 and Qwen3.5; 1.7B-14B students, 8B-32B teachers) and five agentic and code benchmarks, MAD-OPD ranks first across all six configurations; on the 14B+8B$\to$4B setting it lifts the agentic average by $+2.4\%$ and the code average by $+3.7\%$ over the stronger single-teacher OPD.

2605.01346 2026-05-05 cs.CV

CHASE: Competing Hypotheses for Ambiguity-Aware Selective Prediction

Kartik Jhawar, Yuhao Geng, Atul N. Parikh, Lipo Wang

详情
英文摘要

Standard selective prediction methods typically estimate uncertainty from the output of a single predictive branch. While effective for general uncertainty estimation, these approaches often struggle under partial observability, where local temporal evidence can be contradictory and standard confidence scores become misleading. We introduce CHASE (Competing Hypotheses for Ambiguity-Aware Selective Prediction), a selective prediction framework that explicitly compares structured temporal explanations to determine whether to commit to a decision or abstain. Because genuine ambiguity causes the score gap between competing hypotheses to collapse, CHASE optimizes a ranking-aware selector over these hypothesis margins to globally separate safe commitments from fundamentally uncertain ones. We evaluate this framework on the problem of hidden connectivity inference, utilizing a controlled, physically grounded simulator inspired by the dynamics of giant unilamellar vesicles (GUVs), alongside zero-shot qualitative transfer (without retraining or fine tuning) to representative real GUV videos. Our experiments demonstrate that explicitly reasoning over competing hypotheses provides a superior balance of metrics. Compared to canonical uncertainty baselines, CHASE achieves statistically significant gains in overall no-abstain accuracy, three-way accuracy, and overall ambiguity-aligned abstention (at 80% coverage). Specifically, it yields up to an 11.0% relative mean improvement in overall alignment, alongside up to an 8.8% relative boost in three-way accuracy in the very-high ambiguity regime. By maintaining a selective risk boundary strictly at par with the best baselines at 80% coverage, and reducing overall risk by 9.9% at 90% coverage, this framework offers a more reliable approach to decision-making under structured ambiguity.

2605.01340 2026-05-05 cs.RO eess.SP

Terrain Perception for Agricultural UAVs in Complex Farmland via Rotating mmWave Radar

Zhihao Zhan, Le Tao, Shaobin Li, Chenxin Fang, Xingrui Yang, Liang Li, Rui Fan, Yuhang Ming

详情
英文摘要

Accurate terrain perception is essential for terrain-following flight of agricultural unmanned aerial vehicles (UAVs), yet remains challenging in real-world farmland due to occlusions, complex terrain geometry, and environmental disturbances. Millimeter-wave (mmWave) radar is a promising sensing modality for this task due to its robustness to adverse conditions; however, existing UAV-mounted radar systems rely on fixed field of view (FoV) and terrain extraction methods designed for dense LiDAR data, leading to incomplete and unreliable terrain estimation. To address these limitations, we present a low-cost rotating mmWave radar-enabled terrain perception framework for agricultural UAVs operating in complex farmland environments. Specifically, a mechanically rotating sensing design is introduced to enlarge spatial coverage and improve terrain observability beyond the limitations of fixed-view radar under dynamic low-altitude flight. Building upon this sensing capability, we further design a pose-consistent terrain reconstruction pipeline tailored for sparse, noisy, and partially observable radar data, enabling reliable ground extraction and continuous terrain surface estimation in challenging agricultural scenarios. The complete system is deployed on a real agricultural UAV platform and comprehensively evaluated through extensive field experiments. Experimental results demonstrate improved terrain coverage and estimation accuracy, achieving an F1 score of 94.42 for ground segmentation, while the closest rival only achieves 90.48. Thus, leading to more robust terrain following flight.

2605.01339 2026-05-05 cs.LG

Robust Parameter Learning for Uncertain MDPs

Yannik Schnitzer, Alessandro Abate, David Parker

详情
英文摘要

Learning-based approaches to verifying unknown Markov decision processes (MDPs) often employ uncertain MDPs. These models use, for example, confidence intervals to capture transition uncertainty and allow synthesis of policies that are robust to this uncertainty. However, this approach typically quantifies uncertainty independently for individual transition probabilities, ignoring dependencies due to shared latent quantities. We propose to learn such models using parametric MDPs (pMDPs), where transition probabilities are expressions over a set of parameters. We project statistical uncertainty from empirical transition frequencies onto the pMDP's parameter space, yielding a probably approximately correct (PAC) uncertainty model for the underlying MDP that respects the algebraic dependencies between transitions. The resulting models are algorithmically challenging to solve, so we propose a hierarchy of sound polytopic outer approximations of the induced confidence set. We implement and evaluate our approach, demonstrating substantially tighter uncertainty estimates than classical interval-based uncertain MDP learning techniques.

2605.01338 2026-05-05 cs.AI

DiagramNet: An End-to-End Recognition Framework and Dataset for Non-Standard System-Level Diagrams

Jincheng Lou, Ruohan Xu, Jiapeng Li, Junyin Pi, Runzhe Tao, Weijian Fan, Xiao Tan, Guojie Luo, Yibo Lin

Comments 13 pages, 7 figures. Preprint

详情
英文摘要

System-level diagrams encode the architectural blueprint of chip design, specifying module functions, dataflows, and interface protocols. However, non-standardized symbols and the scarcity of structured training data hinder existing multimodal large language models (MLLMs) from recognizing these diagrams. To address this gap, we introduce DiagramNet, the first multimodal dataset for system-level diagrams, comprising 10,977 connection annotations and 15,515 chain-of-thought QA pairs across four tasks: Listing, Localization, Connection, and Circuit QA. Building on this dataset, we propose a progressive training pipeline together with a decoupled multi-agent workflow that decomposes complex visual reasoning into Perception, Reasoning, and Knowledge stages. On the DiagramNet benchmark, integrating our 3B-parameter model with the proposed workflow surpasses the 2025 EDA Elite Challenge winner and outperforms GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro by over 2x in end-to-end evaluation. Notably, the workflow generalizes beyond our model, boosting Task 1 performance by 128.7x for Gemini-2.5-Pro and 12.4x for GPT-5. Furthermore, with only 60 images for detector adaptation, the method transfers effectively to AMSBench, achieving zero-shot connectivity reasoning on par with GPT-5 and Claude-Sonnet-4 while surpassing the AMS state-of-the-art method Netlistify.